28 Machine Learning Unsupervised Hierarchical Clustering

1. Machine Learning for Data Mining Hierarchical Clustering Andres Mendez-Vazquez July 27, 2015 1 / 46

2. Images/cinvestav- Outline 1 Hierarchical Clustering Deﬁnition Basic Ideas 2 Agglomerative Algorithms Introduction Problems with Agglomerative Algorithms Two Categories of Agglomerative Algorithms Matrix Based Algorithms Graph Based Algorithms 3 Divisive Algorithms Introduction 4 Algorithms for Large Data Sets Introduction Clustering Using REpresentatives (CURE) 2 / 46

4. Images/cinvestav- Concepts Hierarchical Clustering Algorithms They are quite diﬀerent from the previous clustering algorithms. Actually They produce a hierarchy of clusterings. 4 / 46

5. Images/cinvestav- Concepts Hierarchical Clustering Algorithms They are quite diﬀerent from the previous clustering algorithms. Actually They produce a hierarchy of clusterings. 4 / 46

6. Images/cinvestav- Dendrogram: Hierarchical Clustering Hierarchical Clustering The clustering is obtained by cutting the dendrogram at a desired level: Each connected component forms a cluster. 5 / 46

7. Images/cinvestav- Example Dendrogram 6 / 46

9. Images/cinvestav- Basic Ideas At each step t A new clustering is obtained based on the clustering produced at the previous step t − 1 Two Main Types 1 Agglomerative Algorithms. 1 Start with each item being a single cluster. 2 Eventually all items belong to the same cluster. 2 Divisive Algorithms 1 Start with all items belong to the same cluster. 2 Eventually each item forms a cluster on its own. 8 / 46

16. Images/cinvestav- Therefore Given the previous ideas It is necessary to deﬁne the concept of nesting!!! After all given a divisive and agglomerative procedure 9 / 46

17. Images/cinvestav- Therefore Given the previous ideas It is necessary to deﬁne the concept of nesting!!! After all given a divisive and agglomerative procedure 9 / 46

18. Images/cinvestav- Nested Clustering Deﬁnition 1 A clustering i containing k clusters is said to be nested in the clustering i+1, which contains r < k clusters, if each cluster in i, it is a subset of a set in i+1. 2 At least one cluster at i is a proper subset of a set in i+1. This is written as i i+1 (1) 10 / 46

21. Images/cinvestav- Example We have The following set{x1, x2, x3, x4, x5}. With the following structures 1 = {{x1, x3} , {x4} , {x2, x5}} 2 = {{x1, x3, x4} , {x2, x5}} Again Hierarchical Clustering produces a hierarchy of clusterings!!! 11 / 46

26. Images/cinvestav- Agglomerative Algorithms. Initial State You have N clusters each containing an element of the data X. At each step i, you have an i structure with N − i. Then, a new clustering structure i+1 is generated. Thus 13 / 46

30. Images/cinvestav- In that way... We have At each step, we have that each cluster i is a proper subset of a cluste in i or i i+1 (2) 14 / 46

31. Images/cinvestav- The Basic Algorithm for Agglomerative For this We have a function g (Ci, Cj) deﬁned in all pair of cluster to measure similarity or dissimilarity. t denotes the current level of the hierarchy. Algorithm I n i t i a l i z a t i o n Choose 0 = {Ci = {xi } , i = 1, ..., N} t = 0 Repeat t = t + 1 Find one p a i r of c l u s t e r s (Cr , Cs) i n t−1 such that g(Ci , Cj ) = max, min of a s i m i l l a r i t y or d i s s i m i l a r i t y f u n c t i o n over a l l p a i r s Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r 15 / 46

34. Images/cinvestav- Enforcing Nesting Note the following “We can say that if two vectors come together into a single cluster at level t of the hierarchy, they will remain in the same cluster for all subsequent clusterings.” Thus 0 1 2 ... N−1 N (3) Hurra!!! Enforcing the nesting property!!! 16 / 46

38. Images/cinvestav- Problems with Agglomerative Algorithms First - Related to Nesting Property No way to recover from a “poor” clustering that may have occurred in an earlier level of the hierarchy. Second At each level t, there are N − t clusters. Thus at level t+1 the total number of pairs compared. N − t 2 = (N − t) (N − t − 1) 2 (4) Total Number of pairs compared are N−1 t=0 N − t 2 (5) 18 / 46

43. Images/cinvestav- Thus We have that N−1 t=0 N − t 2 = N k=1 k 2 = (N − 1) N (N + 1) 6 (6) Thus The complexity of this schema is O N3 However You still depend on the nature of g. 19 / 46

47. Images/cinvestav- Two Categories of Agglomerative Algorithms There are two 1 Matrix Theory Based. 2 Graph Theory Concepts. Matrix Theory Based As the name says, they are based in dissimilarity matrix P0 = P (X) of N × N. At each merging the matrix is reduced by one level ⇒ Pt becomes a N − t × N − t matrix. 21 / 46

51. Images/cinvestav- Matrix Based Algorithm Matrix Updating Algorithmic Scheme (MUAS) I n i t i a l i z a t i o n Choose 0 = {Ci = {xi} , i = 1, ..., N} P0 = P(X) t = 0 Repeat t = t + 1 Find one p a i r of c l u s t e r s (Cr , Cs) i n t−1 such that d(Ci, Cj) = minr,s=1,..,N,r=s d(Cr , Cs) Define Cq = Ci ∪ Cj , t = t−1 − {Ci, Cj} ∪ Cq Create Pt by s t r a t e g y U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r 22 / 46

52. Images/cinvestav- Matrix Based Algorithm Strategy 1 Delete the two rows and columns that correspond to the merged clusters. 2 Add new row and a new column that contain the distances between the newly formed cluster and the old (unaﬀected at this level) clusters. 23 / 46

53. Images/cinvestav- Matrix Based Algorithm Strategy 1 Delete the two rows and columns that correspond to the merged clusters. 2 Add new row and a new column that contain the distances between the newly formed cluster and the old (unaﬀected at this level) clusters. 23 / 46

54. Images/cinvestav- Distance Used in These Schemes It has been pointed out that there is only one general distance for these algorithms d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ... bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)| Where diﬀerent values of ai, aj, b and c correspond to diﬀerent choices of the dissimilarity measures. Using this distance is possible to generate several algorithms 1 The single link algorithm. 2 The complete link algorithm. 3 The weighted pair group method average. 4 The unweighted pair group method centroid. 5 Etc... 24 / 46

60. Images/cinvestav- For example The single link algorithm This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2 Thus, we have d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7) Please look at the example in the Dropbox It is an interesting example. 25 / 46

63. Images/cinvestav- Agglomerative Algorithms Based on Graph Theory Consider the following 1 Each node in the graph G correspond to a vector. 2 Cluster are formed by connecting nodes. 3 Certain property, h (k), needs to be respected. Common Properties: Node Connectivity The node connectivity of a connected subgraph is the largest integer k such that all pairs of nodes are joined by at least k paths having no nodes in common. 26 / 46

67. Images/cinvestav- Agglomerative Algorithms Based on Graph Theory Common Properties: Edge Connectivity The edge connectivity of a connected subgraph is the largest integer k such that all pairs of nodes are joined by at least k paths having no edges in common. Common Properties: Node Degree The degree of a connected subgraph is the largest integer k such that each node has at least k incident edges. 27 / 46

68. Images/cinvestav- Agglomerative Algorithms Based on Graph Theory Common Properties: Edge Connectivity The edge connectivity of a connected subgraph is the largest integer k such that all pairs of nodes are joined by at least k paths having no edges in common. Common Properties: Node Degree The degree of a connected subgraph is the largest integer k such that each node has at least k incident edges. 27 / 46

69. Images/cinvestav- Basically, We use the Same Scheme, But... The function gh(k) (Cr , Cs) = min x∈Cr ,y∈Cs {d (x, y) |Property} (8) Property The G subgraph deﬁned by Cr ∪ Cs is 1 It is connected and either 1 It has the property h(k) or 2 It is complete 28 / 46

74. Images/cinvestav- Examples Examples 1 Single Link Algorithm 2 Complete Link Algorithm There is other style of clustering Clustering Algorithms Based on the Minimum Spanning Tree 29 / 46

78. Images/cinvestav- Divisive Algorithms Reverse Strategy Start with a single cluster split it iteratively. 31 / 46

79. Images/cinvestav- Generalized Divisive Scheme Algorithm PROBLEM what is wrong!!! I n i t i a l i z a t i o n Choose 0 = {X} P0 = P(X) t = 0 Repeat t = t + 1 For i = 1 to t Given a p a r t i t i o n Ct−1, i Generate a l l p o s s i b l e c l u s t e r s next i Find the p a i r C1 t−1,j, C2 t−1,j that maximize g Create t = t−1 − {Ct−1,j} ∪ C1 t−1,j, C2 t−1,j U n t i l a l l v e c t o r s l i e i n a s i n g l e c l u s t e r 32 / 46

81. Images/cinvestav- Algorithms for Large Data Sets There are several 1 The CURE Algorithm 2 The ROCK Algorithm 3 The Chameleon Algorithm 4 The BIRCH Algorithm 34 / 46

85. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Idea Each cluster Ci has a set of representatives RCi = x (i) 1 , x (i) 2 , ..., x (i) K with K > 1. What is happening By using multiple representatives for each cluster, the CURE algorithm tries to “capture” the shape of each one. However In order to avoid taking into account irregularities (For example, outliers) in the border of the cluster. The initially chosen representatives are “pushed” toward the mean of the cluster. 35 / 46

89. Images/cinvestav- Therfore This action is known As “Shrinking” in the sense that the volume of space “deﬁned” by the representatives is shrunk toward the mean of the cluster. 36 / 46

90. Images/cinvestav- Shrinking Process Given a cluster C Select the point x ∈ C with the maximum distance from the mean of C and set RC = {x} (the set of representatives). Then 1 For i = 2 to min {K, nC } 2 Determine y ∈ C − RC that lies farthest from the points in RC 3 RC = RC ∪ {y} 37 / 46

95. Images/cinvestav- Shrinking Process Do the Shrinking Shrink the points x ∈ RC toward the mean mC in C by a factor α. Actually x = (1 − α) x + αmC ∀x ∈ RC (9) 38 / 46

96. Images/cinvestav- Shrinking Process Do the Shrinking Shrink the points x ∈ RC toward the mean mC in C by a factor α. Actually x = (1 − α) x + αmC ∀x ∈ RC (9) 38 / 46

97. Images/cinvestav- Resulting set RC Thus The resulting set RC is the set of representatives of C. Thus the distance between two cluster is deﬁned as d (Ci, Cj) = min x∈RCi ,y∈RCj d (x, y) (10) 39 / 46

98. Images/cinvestav- Resulting set RC Thus The resulting set RC is the set of representatives of C. Thus the distance between two cluster is deﬁned as d (Ci, Cj) = min x∈RCi ,y∈RCj d (x, y) (10) 39 / 46

99. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46

112. Images/cinvestav- Complexity of Cure Too Prohibitive O N2 log2 N (11) 41 / 46

113. Images/cinvestav- Possible Solution CURE does the following The technique adopted by the CURE algorithm, in order to reduce the computational complexity, is that of random sampling. Actually That is, a sample set X is created from X, by choosing randomly N out of the N points of X. However, one has to ensure that the probability of missing a cluster of X, due to this sampling This can be guaranteed if the number of points N is suﬃciently large. 42 / 46

116. Images/cinvestav- Then Having estimated N CURE forms a number of p = N N sample data sets by successive random samples. In other words In other words, X is partitioned randomly in p subsets. For this a parameter q is selected Then, the points in each partition are clustered until N q clusters are formed or the distance between the closest pair of clusters to be merged in the next iteration step exceeds a user-deﬁned threshold. 43 / 46

119. Images/cinvestav- Once this has been ﬁnished A second clustering pass is done One the at most pN q = N q clusters from all the subsets. The Goal To apply the merging procedure described previously to all (at most) N q so that we end up with the required ﬁnal number, m, of clusters. Finally Each point x in the data set, X, that is not used as a representative in any one of the m clusters is assigned to one of them according to the following strategy. 44 / 46

122. Images/cinvestav- Finally First A random sample of representative points from each of the m clusters is chosen. Then Then, based on the previous representatives the point x is assigned to the cluster that contains the representative closest to it. Experiments reported by Guha et al. show that CURE It is sensitive to parameter selection. Speciﬁcally K must be large enough to capture the geometry of each cluster. In addition, N must be higher than a certain percentage ≈ 2.5% of N. 45 / 46

127. Images/cinvestav- Not only that The value of a aﬀects also CURE Small values, CURE looks similar than a MST clustering. Large values, CURE resembles an algorithm with a single representative. Worst Case Complexity O N 2 log2 N (12) 46 / 46

128. Images/cinvestav- Not only that The value of a aﬀects also CURE Small values, CURE looks similar than a MST clustering. Large values, CURE resembles an algorithm with a single representative. Worst Case Complexity O N 2 log2 N (12) 46 / 46

28 Machine Learning Unsupervised Hierarchical Clustering

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to 28 Machine Learning Unsupervised Hierarchical Clustering (20)

More from Andres Mendez-Vazquez (20)

Recently uploaded (20)

28 Machine Learning Unsupervised Hierarchical Clustering