Gravitational Based Hierarchical Clustering Algorithm

ACEEE Int. J. on Communication, Vol. 01, No. 03, Dec 2010

Gravitational Based Hierarchical
Clustering Algorithm
P S Bishnu1 and V Bhattacherjee2
Department of Computer Science and Engineering
Birla Institute of Technology, Ranchi, India
Email: 1psbishnu@gmail.com, 2vbhattacharya@bitmesra.ac.in

Abstract—We propose a new gravitational based hierarchical
clustering algorithm using kd- tree. kd- tree generates densely III. OVERVIEW OF AGGLOMERATIVE CLUSTERING
populated packets and finds the clusters using gravitational ALGORITHM AND KD-TREE
force between the packets. Gravitational based hierarchical
clustering results are of high quality and robustness. Our
method is effective as well as robust. Our proposed algorithm A. Agglomerative hierarchical clustering algorithm
is tested on synthetic dataset and results are presented. Agglomerative hierarchical algorithms begin with all the
data objects as individual cluster. At each step two most
Index Terms —Data Mining, hierarchical clustering, kd-tree alike clusters are merged. After each merge, the total
number of clusters decreases by one. These steps can be
1. INTRODUCTION repeated until the desired number of clusters is obtained or
the distance between two closest clusters is above a certain
Hierarchical clustering generates a hierarchical series of threshold distance [5].
nested clusters which can be graphically represented by a
tree called “Dendrogram”. By cutting the dendrogram at B. kd Tree
some level, we can obtain a specified number of clusters kd- tree is a geometrical, top-down hierarchical tree data
[3]. Due to its nested structure it is effective and gives structure. At the root whole data space is divided with a
better structural information. This paper presents a new vertical line into two subsets of roughly equal size. The
gravity based hierarchical technique using kd- Tree. We splitting line is stored at the root, the left data points are
call our new algorithm GLHL (which is anagram of the assigned as the content of the left subtree and the right data
bold letters in GravitationaL Based Hierarchical points are assigned as the content of the right subtree. At
ALgorithm) The orientation of the paper is as follows: the next level, each node is again partitioned along the
Section 2 presents survey of the related work. Section 3 alternate line (e.g. if the previous level was partitioned by
gives an overview of the hierarchical clustering algorithm vertical line, this level would be partitioned along
and kd-trees. Section 4 presents our proposed algorithm. horizontal line). At each partitioning, data points to the left
Next we discuss results in Section 5 and conclusions in or on a vertical line are assigned to the left subtree and the
Section 6. rest to the right subtree. The process continues till the
criteria function has converged. [8].
2. RELATED WORK
Here we review some of the pioneering methods in IV. THE PROPOSED GRAVITATIONAL BASED HIERARCHICAL
ALGORITHM
hierarchical clustering. Zhang et al. proposed BIRCH
technique [6]. It overcomes the two limitations like The proposed algorithm uses kd- tree to divide the data
scalability and the inability to undo what was done in the space into regions called leaf buckets and calculate the
previous step, of agglomerative clustering [1] [6]. It is density of each leaf buckets. Further, mean of each leaf
designed for clustering large amount of dataset and it is buckets is calculated and treated as center of gravity of
incremental and hierarchical and can handle outliers. It each leaf bucket. Next, an object (leaf bucket) with high
introduces a concept like clustering feature (CF) which density attracts some other objects (leaf bucket) with lower
contains information regarding a cluster. CF tree is created density [7]. Using gravity function we can calculate the
using CF and it contains CF information about its attraction force between the two objects (leaf buckets) and
subclusters [2] [6]. BIRCH applies only to numeric data [2] the degree of gravity is in direct ratio to the product of two
and if the clusters are not spherical in nature BIRCH does objects (leaf buckets) density and in inverse ratio to the
not perform well [1]. Jiang et al. proposed DHC (density square of their distance [9].
based hierarchical clustering) [7]. It uses two types of data
structure called density tree which is used to uncover the A. Calculation of density of a leaf bucket
embedded cluster and attraction tree which is used to Consider a set of n data points,
explore the inner structure of clusters, the boundary of the (o1 , o 2 ,..., o n ) occupying a t dimensional space. Each
cluster and the outliers.
5
© 2010 ACEEE
DOI: 01.IJCOM.01.03.49


oi has K 2
∑ (s
point associated with it to
computed from Δ = − 1) o0 − o0 j )
(
j , where s j is
coordinate (oi1 , oi 2 ,..., oit ) . There exists a bounding box j =1 2
(buckets) which contains all data points and whose extrema the number of data points in cluster C j , K denotes number
are defined by the maximum and minimum coordinate
values of the data points in each dimension figure [1a]. The of clusters, o0 is the global centroid defined
data is then divided into two sub buckets by splitting the 1 n
data along the median value of the coordinates of the parent
bucket so that the number of points in each bucket remains
as o0 = ∑ oi where n is total number of data points
n i =1
roughly the same. The division process recursively ( j)
and oi is the data points, o0 denotes the centroid of the
repeated on each sub bucket until a leaf bucket is created
where each leaf bucket contains single data point [Figure cluster j, which is defined
sj
1b-1c]. The basic concept behind the calculation of density 1
of each leaf bucket is with the help of bucket size and the as o0
( j)
=
sj
∑o
i =1
( j)
i where oi
( j)
denotes data points
number of data points present in that bucket (here each leaf
bucket contain single data points but the bucket size th
belongs to j cluster.
differs). If the numbers of data points present are more then
the density is likely to be more [11]. D. The GLHL Algorithm
This technique is based upon picking up the highest
density leaf buckets and calculating the gravitational
attraction force with the next lower density leaf buckets.
The buckets with the maximum force of attraction are
merged during each iteration. The iterations continue until
a single bucket remains in the bucket set B. The algorithm
uses the following data structures:
Figure 1a Figure 1b Figure 1c
Figure1: Implementation of kd-Tree B : Set of buckets
D [1..n] :density vector where n is the number of data
B. Calculation of force of attraction points
g [0..n] : gain array
mm dis [1..n][1..n] : distance array
The force of gravity is given by F = G 1 2 2 . Where, F1 [1..n][1..n] : force array
r c [1..n] : number of clusters
m1, m2 are two objects’ weight, r is the distance between
them and G is the universal gravitational constant. The Table 5.1(Dataset )
formula used in this paper has been adapted from [9] and
the density value calculated in section 4.1 has been used.
The force of attraction between two buckets bi and b j is
di d j
f ij = 2
where d i and d j are density of two buckets
dis ij
bi and b j , dis ij is distance between two buckets bi
and b j . All these data structures (B, D, g, dis and F1) need to be
recalculated at each iteration because of merger of buckets.
C. Calculation of gain The pseudocode for the GLHL algorithm is given in
Jung et al proposed clustering gain as a measure for Algorithm 1.
clustering optimality, which is based on the squared error
sum as a clustering algorithm proceeds and the measure
can be applicable for both hierarchical and partitional
clustering method to estimate desired number of clusters.
Authors showed the clustering gain to have a maximum
value at the optimal number of clusters. In addition,
clustering gain is cheap to compute. Therefore, it can be
computed in each step of clustering process to determine
the optimal number of clusters without increasing the
computational complexity [10]. The clustering gain can be

6
© 2010 ACEEE
DOI: 01.IJCOM.01.03.49


Algorithm1: Gravitational based hierarchical clustering bold font give the optimal number of clusters in each case
algorithm (GLHL Algorithm) indicated by maximum gain value.

B = {bi }: bi = kd _ tree _ partitioni ng (dataset ); VI. CONCLUSION
D = [d i ]: d i = calculate _ density (bi ); In this paper a new gravitational based hierarchical
clustering algorithm using kd- tree has been proposed. The
dis = [ dis ij ]: dis ij = calculate _ dis tan ce(bi , b j ) kd- tree structure is used to generate densely populated
∀buckets ; packets. The clusters are formed by calculating the
gravitational force between the packets. For validation of
F1 = [ f ij ]: f ij = calculate _ force (bi , b j ) the performance of our algorithm we have used the concept
∀buckets ; of gain calculated as in section 4.3. The validation results
have been presented in section 5.
g[0] = calculate _ gain (); //each data point as cluster and g vector
stores value of gain
REFERENCES
q = 1; // counter for iterations
Repeat [1] Data mining concepts and techniques,, By Han and
N = B ; // number of buckets Kamber, Morgan Kaufmann Publishers.
[2] Data Mining- Introductory and Advanced Topics, BY
N = 1 then break; // only one bucket remaining
If M H Dunhum, Pearson Education.
merge (bi , b j ) : f ij = find _ max( F1 ); [3] Y. Liu, Z. Liu, An Improved Hierarchical K-Means
Algorithm for Web Document Clustering, Int. Con. on
Update B, D, dis , F1 ; Computer Science and Info. Tech., IEEE Conf. 2008.
g[ q ] = calculate _ gain (); [4] S. Lee, W. Lee, S. Chung, D., “An, Ingeun Bok,
Hongjin Ryu, “Selection of Cluster Hierarchy Depth in
c[ q ] = N ; // c vectors stores number of clusters (bucket) Hierarchical Clustering using K-Means Algorithm,”
c[ q ] = N ; Int. Sym. on Info. Tech. Conv., IEEE 2007.
Until true; [5] George K, Eui-Hong H, Vipin K, “CHAMELEON: A
Return c[k ] as optimal number of clusters Hierarchical Clustering Algorithm Using Dynamic
: g[ k ] = find _ max( g ); Modeling”, COMPUTER, 32: 68-75, IEEE, 1999.
[6] T Zhang, R Ramakrishnan, M Linvy, “BIRCH: an
efficient data clustering method for very large
The GLHL algorithm functions in four phases. databases. In Proc ACM-SIGMOD Int. Conf.
Phase1 constitutes the initialization of buckets by kd-tree Management of data, pp. 103-114, Canada, 1996.
partitions and data structures density D, distance dis, [7] D. Jiang J. Pei, A. Zhang, “DHC: A Density-based
gravitational force F1 and the initial gain g [0] for all n Hierarchical Clustering Method for Time Series Gene
buckets where n is total data points. Phase2 checks the exit Expression Data”, IEEE Transcaction.
criteria that there is one bucket remaining. Phase3 finds the [8] Berg et al. “Computational Geometry Algorithms and
maximum entry f ij in the F1 matrices and merges buckets Application” 2nd Edition Springer publication.
[9] M. Jian, L. Cheng, W. Xiang, “Research of Gravity-
bi , b j corresponding to the (ij) pair. Phase4 updates the B, based Outliers Detection”, Int. Conf. on Intelligent
D, dis, F1, g and the number of clusters is stored in c and Information Hiding and Multimedia Signal Processing.
returns the number of clusters k corresponding to maximum IEEE, 2008.
gain. [10] Y. Jung, H. Park, D. Z. Du, B. L Drake, “ A decision
criteria for the optimal number of clusters in
V. RESULTS Hierarchical clustering”, J. of global optimization, 25,
91-111, 2003.
The GLHL algorithm was implemented in C-language [11] Stephen J. Redmond, Conor Heneghan, A method
and validations were performed. The data sets used is for initializing the K-Means clustering algorithm using
synthetic two dimensional data points generated by mouse kd-trees, Pattern Recognition Letters 28(2007) 965-
clicking. Result is presented in Table 5.1. The rows in the 973.

7
© 2010 ACEEE
DOI: 01.IJCOM.01.03.49

Gravitational Based Hierarchical Clustering Algorithm

More Related Content

Viewers also liked (9)

Similar to Gravitational Based Hierarchical Clustering Algorithm (20)

More from IDES Editor (20)

Recently uploaded (20)

Gravitational Based Hierarchical Clustering Algorithm