I'm trying to implement the Rank-Order Clustering here is a link to the paper (which is a kind of agglomerative clustering) algorithm from scratch. I have read through the paper (many times) and I have an implementation that is working although it is a lot slower than I expect.
Here is a link to my Github which has instructions to download and run the Jupyter Notebook.
The algorithm:
Algorithm 1 Rank-Order distance based clustering
Input:
N faces, Rank-Order distance threshold t .
Output:
A cluster set C and an “un-grouped” cluster Cun.
1: Initialize clusters C = { C1, C2, … CN }
by letting each face be a single-element cluster.
2: repeat
3: for all pair Cj and Ci in C do
4: Compute distances DR( Ci, Cj) by (4) and DN(Ci, Cj) by (5).
5: if DR(Ci, Cj) < t and DN(Ci, Cj) < 1 then
6: Denote ⟨Ci, Cj⟩ as a candidate merging pair.
7: end if
8: end for
9: Do “transitive” merge on all candidate merging pairs.
(For example, Ci, Cj, Ck are merged
if ⟨Ci, Cj⟩ and ⟨Cj, Ck⟩ are candidate merging pairs.)
10: Update C and absolute distances between clusters by (3).
11: until No merge happens
12: Move all single-element clusters in C into an “un-grouped” face cluster Cun.
13: return C and Cun.
My implementation:
I have defined a Cluster
class like so:
class Cluster: def __init__(self): self.faces = list() self.absolute_distance_neighbours = None
A Face
class like so:
class Face: def __init__(self, embedding): self.embedding = embedding # a point in 128 dimensional space self.absolute_distance_neighbours = None
I have also implemented finding the rank-order distance (D^R(C_i, C_j))
and the normalized distance (D^N(C_i, C_j))
used in step 4
so these can be taken for granted.
Here is my implementation for finding the closest absolute distance between two clusters:
def find_nearest_distance_between_clusters(cluster1, cluster2): nearest_distance = sys.float_info.max for face1 in cluster1.faces: for face2 in cluster2.faces: distance = np.linalg.norm(face1.embedding - face2.embedding, ord = 1) if distance < nearest_distance: nearest_distance = distance # If there is a distance of 0 then there is no need to continue if distance == 0: return(0) return(nearest_distance) def assign_absolute_distance_neighbours_for_clusters(clusters, N = 20): for i, cluster1 in enumerate(clusters): nearest_neighbours = [] for j, cluster2 in enumerate(clusters): distance = find_nearest_distance_between_clusters(cluster1, cluster2) neighbour = Neighbour(cluster2, distance) nearest_neighbours.append(neighbour) nearest_neighbours.sort(key = lambda x: x.distance) # take only the top N neighbours cluster1.absolute_distance_neighbours = nearest_neighbours[0:N]
Here is my implementation of the rank-order clustering algorithm (assume that the implementation of find_normalized_distance_between_clusters
and find_rank_order_distance_between_clusters
is correct):
import networkx as nx def find_clusters(faces): clusters = initial_cluster_creation(faces) # makes each face a cluster assign_absolute_distance_neighbours_for_clusters(clusters) t = 14 # threshold number for rank-order clustering prev_cluster_number = len(clusters) num_created_clusters = prev_cluster_number is_initialized = False while (not is_initialized) or (num_created_clusters): print("Number of clusters in this iteration: {}".format(len(clusters))) G = nx.Graph() for cluster in clusters: G.add_node(cluster) processed_pairs = 0 # Find the candidate merging pairs for i, cluster1 in enumerate(clusters): # Only get the top 20 nearest neighbours for each cluster for j, cluster2 in enumerate([neighbour.entity for neighbour in \ cluster1.absolute_distance_neighbours]): processed_pairs += 1 print("Processed {}/{} pairs".format(processed_pairs, len(clusters) * 20), end="\r") # No need to merge with yourself if cluster1 is cluster2: continue else: normalized_distance = find_normalized_distance_between_clusters(cluster1, cluster2) if (normalized_distance >= 1): continue rank_order_distance = find_rank_order_distance_between_clusters(cluster1, cluster2) if (rank_order_distance >= t): continue G.add_edge(cluster1, cluster2) # add an edge to denote that these two clusters are to be merged # Create the new clusters clusters = [] # Note here that nx.connected_components(G) are # the clusters that are connected for _clusters in nx.connected_components(G): new_cluster = Cluster() for cluster in _clusters: for face in cluster.faces: new_cluster.faces.append(face) clusters.append(new_cluster) current_cluster_number = len(clusters) num_created_clusters = prev_cluster_number - current_cluster_number prev_cluster_number = current_cluster_number # Recalculate the distance between clusters (this is what is taking a long time) assign_absolute_distance_neighbours_for_clusters(clusters) is_initialized = True # Now that the clusters have been created, separate them into clusters that have one face # and clusters that have more than one face unmatched_clusters = [] matched_clusters = [] for cluster in clusters: if len(cluster.faces) == 1: unmatched_clusters.append(cluster) else: matched_clusters.append(cluster) matched_clusters.sort(key = lambda x: len(x.faces), reverse = True) return(matched_clusters, unmatched_clusters)
The problem:
The reason for the slow performance is due to step 10: Update C and absolute distance between clusters by (3)
where (3)
is:
This is the smallest L1-norm distance between all the faces in C_i (cluster i)
and C_j (cluster j)
After merging the clusters
Since I have to recalculate the absolute distances between the newly created clusters every time I finish finding the candidate merging pairs in steps 3 - 8
. I'm basically having to do a nested for loop for all the created cluster and then having ANOTHER nested for loop to find the two faces that have the closest distance. Afterwards, I still have to sort the neighbours by nearest distance!
I believe that this is the wrong approach as I have looked at OpenBR which has also implemented the same Rank-Order Clustering algorithm that I want it is under the method name:
Clusters br::ClusterGraph(Neighborhood neighborhood, float aggressiveness, const QString &csv)
Although I'm not that familiar with C++ I'm pretty sure that they are not recalculating the absolute distances between the clusters which leads me to believe that this is the part of the algorithm that I am implementing wrongly.
Moreover, at the top of their method declaration the comments say that they have pre-computed a kNN graph which makes sense as when I recalculate the absolute distances between clusters I am doing a lot of computation that I have previously done. I believe that the key is to precompute a kNN graph for the clusters although this is the part that I'm stuck at. I can't think of how to implement the data structure so that the absolute distances of the clusters would not have to be recalculated every time they are merged.
2 Answers
Answers 1
At a high level, and this is what OpenBR seems to do as well, what is needed is a lookup table for cluster ID -> cluster object from which a new cluster list is generated from without re-calculation.
Can see where the new cluster is generated from an ID lookup table at this section on OpenBR.
For your code, will need to add an ID to each Cluster
object, integers will probably be best for memory usage. Then update the merge code to create a list of to-be-merged indices at findClusters
and create a new cluster list from the existing cluster indices. Then merge and update neighbours from their indices.
The last step, neighbour indice merging can be seen here on OpenBR.
The key part is that no new clusters are created on merge and distance for them is not re-calculated. Only indices are updated from existing cluster objects and their neighbouring distances merged.
Answers 2
You could try to store distance values between faces in dictionary ex.
class Face: def __init__(self, embedding, id): self.embedding = embedding # a point in 128 dimensional space self.absolute_distance_neighbours = None self.id = id #Add face unique id distances = {} def find_nearest_distance_between_clusters(cluster1, cluster2): nearest_distance = sys.float_info.max for face1 in cluster1.faces: for face2 in cluster2.faces: if not distances.has_key( (face1.id, face2.id) ): distances[(face1.id, face2.id)] = np.linalg.norm(face1.embedding - face2.embedding, ord = 1) #calc distance only once distance = distances[(face1.id, face2.id)] #use precalc distances if distance < nearest_distance: nearest_distance = distance # If there is a distance of 0 then there is no need to continue if distance == 0: return(0) return(nearest_distance)
0 comments:
Post a Comment