SlideShare a Scribd company logo
ACEEE Int. J. on Communication, Vol. 01, No. 03, Dec 2010




                       Gravitational Based Hierarchical
                            Clustering Algorithm
                                             P S Bishnu1 and V Bhattacherjee2
                                     Department of Computer Science and Engineering
                                         Birla Institute of Technology, Ranchi, India
                                Email: 1psbishnu@gmail.com, 2vbhattacharya@bitmesra.ac.in


Abstract—We propose a new gravitational based hierarchical
clustering algorithm using kd- tree. kd- tree generates densely             III. OVERVIEW OF AGGLOMERATIVE CLUSTERING
populated packets and finds the clusters using gravitational                          ALGORITHM AND KD-TREE
force between the packets. Gravitational based hierarchical
clustering results are of high quality and robustness. Our
method is effective as well as robust. Our proposed algorithm         A. Agglomerative hierarchical clustering algorithm
is tested on synthetic dataset and results are presented.                Agglomerative hierarchical algorithms begin with all the
                                                                      data objects as individual cluster. At each step two most
Index Terms —Data Mining, hierarchical clustering, kd-tree            alike clusters are merged. After each merge, the total
                                                                      number of clusters decreases by one. These steps can be
                      1. INTRODUCTION                                 repeated until the desired number of clusters is obtained or
                                                                      the distance between two closest clusters is above a certain
   Hierarchical clustering generates a hierarchical series of         threshold distance [5].
nested clusters which can be graphically represented by a
tree called “Dendrogram”. By cutting the dendrogram at                B. kd Tree
some level, we can obtain a specified number of clusters                 kd- tree is a geometrical, top-down hierarchical tree data
[3]. Due to its nested structure it is effective and gives            structure. At the root whole data space is divided with a
better structural information. This paper presents a new              vertical line into two subsets of roughly equal size. The
gravity based hierarchical technique using kd- Tree. We               splitting line is stored at the root, the left data points are
call our new algorithm GLHL (which is anagram of the                  assigned as the content of the left subtree and the right data
bold letters in GravitationaL Based Hierarchical                      points are assigned as the content of the right subtree. At
ALgorithm) The orientation of the paper is as follows:                the next level, each node is again partitioned along the
Section 2 presents survey of the related work. Section 3              alternate line (e.g. if the previous level was partitioned by
gives an overview of the hierarchical clustering algorithm            vertical line, this level would be partitioned along
and kd-trees. Section 4 presents our proposed algorithm.              horizontal line). At each partitioning, data points to the left
Next we discuss results in Section 5 and conclusions in               or on a vertical line are assigned to the left subtree and the
Section 6.                                                            rest to the right subtree. The process continues till the
                                                                      criteria function has converged. [8].
                     2. RELATED WORK
   Here we review some of the pioneering methods in                   IV. THE PROPOSED GRAVITATIONAL BASED HIERARCHICAL
                                                                                               ALGORITHM
hierarchical clustering. Zhang et al. proposed BIRCH
technique [6]. It overcomes the two limitations like                     The proposed algorithm uses kd- tree to divide the data
scalability and the inability to undo what was done in the            space into regions called leaf buckets and calculate the
previous step, of agglomerative clustering [1] [6]. It is             density of each leaf buckets. Further, mean of each leaf
designed for clustering large amount of dataset and it is             buckets is calculated and treated as center of gravity of
incremental and hierarchical and can handle outliers. It              each leaf bucket. Next, an object (leaf bucket) with high
introduces a concept like clustering feature (CF) which               density attracts some other objects (leaf bucket) with lower
contains information regarding a cluster. CF tree is created          density [7]. Using gravity function we can calculate the
using CF and it contains CF information about its                     attraction force between the two objects (leaf buckets) and
subclusters [2] [6]. BIRCH applies only to numeric data [2]           the degree of gravity is in direct ratio to the product of two
and if the clusters are not spherical in nature BIRCH does            objects (leaf buckets) density and in inverse ratio to the
not perform well [1]. Jiang et al. proposed DHC (density              square of their distance [9].
based hierarchical clustering) [7]. It uses two types of data
structure called density tree which is used to uncover the            A. Calculation of density of a leaf bucket
embedded cluster and attraction tree which is used to                    Consider           a    set    of   n     data     points,
explore the inner structure of clusters, the boundary of the          (o1 , o 2 ,..., o n ) occupying a t dimensional space. Each
cluster and the outliers.
                                                                  5
© 2010 ACEEE
DOI: 01.IJCOM.01.03.49
ACEEE Int. J. on Communication, Vol. 01, No. 03, Dec 2010



                 oi has                                                                                                   K                                  2
                                                                                                                     ∑ (s
point                           associated         with          it   to
                                                                               computed from Δ =                                     − 1) o0 − o0 j )
                                                                                                                                                (
                                                                                                                                 j                               , where s j is
coordinate (oi1 , oi 2 ,..., oit ) . There exists a bounding box                                                          j =1                               2
(buckets) which contains all data points and whose extrema                     the number of data points in cluster C j , K denotes number
are defined by the maximum and minimum coordinate
values of the data points in each dimension figure [1a]. The                   of    clusters,                o0           is    the      global        centroid       defined
data is then divided into two sub buckets by splitting the                              1       n
data along the median value of the coordinates of the parent
bucket so that the number of points in each bucket remains
                                                                               as o0 =     ∑ oi where n is total number of data points
                                                                                        n i =1
roughly the same. The division process recursively                                                         ( j)
                                                                               and oi is the data points, o0 denotes the centroid of the
repeated on each sub bucket until a leaf bucket is created
where each leaf bucket contains single data point [Figure                      cluster                        j,                 which                  is             defined
                                                                                                         sj
1b-1c]. The basic concept behind the calculation of density                                    1
of each leaf bucket is with the help of bucket size and the                    as o0
                                                                                    ( j)
                                                                                           =
                                                                                               sj
                                                                                                    ∑o
                                                                                                     i =1
                                                                                                                   ( j)
                                                                                                                   i       where oi
                                                                                                                                       ( j)
                                                                                                                                              denotes        data       points
number of data points present in that bucket (here each leaf
bucket contain single data points but the bucket size                                               th
                                                                               belongs to j cluster.
differs). If the numbers of data points present are more then
the density is likely to be more [11].                                         D. The GLHL Algorithm
                                                                                  This technique is based upon picking up the highest
                                                                               density leaf buckets and calculating the gravitational
                                                                               attraction force with the next lower density leaf buckets.
                                                                               The buckets with the maximum force of attraction are
                                                                               merged during each iteration. The iterations continue until
                                                                               a single bucket remains in the bucket set B. The algorithm
                                                                               uses the following data structures:
     Figure 1a                   Figure 1b                Figure 1c
                        Figure1: Implementation of kd-Tree                     B                                   : Set of buckets
                                                                               D [1..n]                            :density vector where n is the number of data
B. Calculation of force of attraction                                                                              points
                                                                               g [0..n]                            : gain array
                                         mm                                    dis [1..n][1..n]                    : distance array
   The force of gravity is given by F = G 1 2 2 . Where,                       F1 [1..n][1..n]                     : force array
                                          r                                    c [1..n]                            : number of clusters
m1, m2 are two objects’ weight, r is the distance between
them and G is the universal gravitational constant. The                                                                    Table 5.1(Dataset )
formula used in this paper has been adapted from [9] and
the density value calculated in section 4.1 has been used.
The force of attraction between two buckets bi and b j is
          di d j
 f ij =         2
                     where d i and d j are density of two buckets
            dis ij
bi and b j , dis ij is distance between two buckets bi
and b j .                                                                         All these data structures (B, D, g, dis and F1) need to be
                                                                               recalculated at each iteration because of merger of buckets.
C. Calculation of gain                                                         The pseudocode for the GLHL algorithm is given in
   Jung et al proposed clustering gain as a measure for                        Algorithm 1.
clustering optimality, which is based on the squared error
sum as a clustering algorithm proceeds and the measure
can be applicable for both hierarchical and partitional
clustering method to estimate desired number of clusters.
Authors showed the clustering gain to have a maximum
value at the optimal number of clusters. In addition,
clustering gain is cheap to compute. Therefore, it can be
computed in each step of clustering process to determine
the optimal number of clusters without increasing the
computational complexity [10]. The clustering gain can be

                                                                           6
© 2010 ACEEE
DOI: 01.IJCOM.01.03.49
ACEEE Int. J. on Communication, Vol. 01, No. 03, Dec 2010



   Algorithm1: Gravitational based hierarchical clustering                    bold font give the optimal number of clusters in each case
algorithm (GLHL Algorithm)                                                    indicated by maximum gain value.

B = {bi }: bi = kd _ tree _ partitioni ng (dataset );                                              VI. CONCLUSION
D = [d i ]: d i = calculate _ density (bi );                                  In this paper a new gravitational based hierarchical
                                                                              clustering algorithm using kd- tree has been proposed. The
dis = [ dis ij ]: dis ij = calculate _ dis tan ce(bi , b j )                  kd- tree structure is used to generate densely populated
                                      ∀buckets ;                              packets. The clusters are formed by calculating the
                                                                              gravitational force between the packets. For validation of
F1 = [ f ij ]: f ij = calculate _ force (bi , b j )                           the performance of our algorithm we have used the concept
                                      ∀buckets ;                              of gain calculated as in section 4.3. The validation results
                                                                              have been presented in section 5.
g[0] = calculate _ gain (); //each data point as cluster and g vector
              stores value of gain
                                                                                                     REFERENCES
q = 1; // counter for iterations
Repeat                                                                        [1] Data mining concepts and techniques,, By Han and
   N = B ; // number of buckets                                                    Kamber, Morgan Kaufmann Publishers.
                                                                              [2] Data Mining- Introductory and Advanced Topics, BY
     N = 1 then break; // only one bucket remaining
    If                                                                             M H Dunhum, Pearson Education.
    merge (bi , b j ) : f ij = find _ max( F1 );                              [3] Y. Liu, Z. Liu, An Improved Hierarchical K-Means
                                                                                   Algorithm for Web Document Clustering, Int. Con. on
    Update     B, D, dis , F1 ;                                                    Computer Science and Info. Tech., IEEE Conf. 2008.
         g[ q ] = calculate _ gain ();                                        [4] S. Lee, W. Lee, S. Chung, D., “An, Ingeun Bok,
                                                                                   Hongjin Ryu, “Selection of Cluster Hierarchy Depth in
           c[ q ] = N ; // c vectors stores number of clusters (bucket)            Hierarchical Clustering using K-Means Algorithm,”
           c[ q ] = N ;                                                            Int. Sym. on Info. Tech. Conv., IEEE 2007.
           Until true;                                                        [5] George K, Eui-Hong H, Vipin K, “CHAMELEON: A
           Return   c[k ] as optimal number of clusters                            Hierarchical Clustering Algorithm Using Dynamic
                           : g[ k ] = find _ max( g );                             Modeling”, COMPUTER, 32: 68-75, IEEE, 1999.
                                                                              [6] T Zhang, R Ramakrishnan, M Linvy, “BIRCH: an
                                                                                   efficient data clustering method for very large
        The GLHL algorithm functions in four phases.                               databases. In Proc ACM-SIGMOD Int. Conf.
Phase1 constitutes the initialization of buckets by kd-tree                        Management of data, pp. 103-114, Canada, 1996.
partitions and data structures density D, distance dis,                       [7] D. Jiang J. Pei, A. Zhang, “DHC: A Density-based
gravitational force F1 and the initial gain g [0] for all n                        Hierarchical Clustering Method for Time Series Gene
buckets where n is total data points. Phase2 checks the exit                       Expression Data”, IEEE Transcaction.
criteria that there is one bucket remaining. Phase3 finds the                 [8] Berg et al. “Computational Geometry Algorithms and
maximum entry f ij in the F1 matrices and merges buckets                           Application” 2nd Edition Springer publication.
                                                                              [9] M. Jian, L. Cheng, W. Xiang, “Research of Gravity-
bi , b j corresponding to the (ij) pair. Phase4 updates the B,                     based Outliers Detection”, Int. Conf. on Intelligent
D, dis, F1, g and the number of clusters is stored in c and                        Information Hiding and Multimedia Signal Processing.
returns the number of clusters k corresponding to maximum                          IEEE, 2008.
gain.                                                                         [10] Y. Jung, H. Park, D. Z. Du, B. L Drake, “ A decision
                                                                                   criteria for the optimal number of clusters in
                                     V. RESULTS                                    Hierarchical clustering”, J. of global optimization, 25,
                                                                                   91-111, 2003.
 The GLHL algorithm was implemented in C-language                                  [11] Stephen J. Redmond, Conor Heneghan, A method
and validations were performed. The data sets used is                              for initializing the K-Means clustering algorithm using
synthetic two dimensional data points generated by mouse                           kd-trees, Pattern Recognition Letters 28(2007) 965-
clicking. Result is presented in Table 5.1. The rows in the                        973.




                                                                          7
© 2010 ACEEE
DOI: 01.IJCOM.01.03.49

More Related Content

PDF
I1803026164
PDF
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
PDF
Independent Component Analysis
PPTX
Linked CP Tensor Decomposition (presented by ICONIP2012)
PDF
A comprehensive survey of contemporary
PDF
Reduct generation for the incremental data using rough set theory
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
Nonnegative Matrix Factorization
I1803026164
Introduction to Common Spatial Pattern Filters for EEG Motor Imagery Classifi...
Independent Component Analysis
Linked CP Tensor Decomposition (presented by ICONIP2012)
A comprehensive survey of contemporary
Reduct generation for the incremental data using rough set theory
International Journal of Computational Engineering Research(IJCER)
Nonnegative Matrix Factorization

Viewers also liked (9)

PDF
Data Management In Cellular Networks Using Activity Mining
PDF
Preventing Autonomous System against IP Source Address Spoofing: (PASIPS) A N...
PDF
A Novel Optimum Technique for JPEG 2000 Post Compression Rate Distortion Algo...
PDF
Development of Robust Adaptive Inverse models using Bacterial Foraging Optimi...
PDF
Visual Programming and Program Visualization – Towards an Ideal Visual Softwa...
PDF
Image Fusion of Video Images and Geo-localization for UAV Applications
PDF
Physical Layer Technologies And Challenges In Mobile Satellite Communications
PDF
Performance Comparison of Rerouting Schemes of Multi Protocol Label Switching...
PDF
IEEE-Communications-Mag-CSDCYorkU
Data Management In Cellular Networks Using Activity Mining
Preventing Autonomous System against IP Source Address Spoofing: (PASIPS) A N...
A Novel Optimum Technique for JPEG 2000 Post Compression Rate Distortion Algo...
Development of Robust Adaptive Inverse models using Bacterial Foraging Optimi...
Visual Programming and Program Visualization – Towards an Ideal Visual Softwa...
Image Fusion of Video Images and Geo-localization for UAV Applications
Physical Layer Technologies And Challenges In Mobile Satellite Communications
Performance Comparison of Rerouting Schemes of Multi Protocol Label Switching...
IEEE-Communications-Mag-CSDCYorkU
Ad

Similar to Gravitational Based Hierarchical Clustering Algorithm (20)

PDF
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
PDF
4 image segmentation through clustering
PDF
4 image segmentation through clustering
PPT
[PPT]
PDF
Clustering: A Survey
PDF
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
PDF
11 clusadvanced
PDF
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
PDF
Bj24390398
PPT
11ClusAdvanced.ppt
DOCX
Clustering techniques final
PDF
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
PPTX
Presentation on unsupervised learning
PDF
STATE SPACE GENERATION FRAMEWORK BASED ON BINARY DECISION DIAGRAM FOR DISTRIB...
PDF
STATE SPACE GENERATION FRAMEWORK BASED ON BINARY DECISION DIAGRAM FOR DISTRIB...
PDF
Fuzzy c means_realestate_application
PDF
The International Journal of Engineering and Science (The IJES)
PPT
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
PPT
PAM.ppt
PDF
9517cnc01
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
4 image segmentation through clustering
4 image segmentation through clustering
[PPT]
Clustering: A Survey
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
11 clusadvanced
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
Bj24390398
11ClusAdvanced.ppt
Clustering techniques final
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
Presentation on unsupervised learning
STATE SPACE GENERATION FRAMEWORK BASED ON BINARY DECISION DIAGRAM FOR DISTRIB...
STATE SPACE GENERATION FRAMEWORK BASED ON BINARY DECISION DIAGRAM FOR DISTRIB...
Fuzzy c means_realestate_application
The International Journal of Engineering and Science (The IJES)
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
PAM.ppt
9517cnc01
Ad

More from IDES Editor (20)

PDF
Power System State Estimation - A Review
PDF
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
PDF
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
PDF
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
PDF
Line Losses in the 14-Bus Power System Network using UPFC
PDF
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
PDF
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
PDF
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
PDF
Selfish Node Isolation & Incentivation using Progressive Thresholds
PDF
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
PDF
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
PDF
Cloud Security and Data Integrity with Client Accountability Framework
PDF
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
PDF
Enhancing Data Storage Security in Cloud Computing Through Steganography
PDF
Low Energy Routing for WSN’s
PDF
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
PDF
Rotman Lens Performance Analysis
PDF
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
PDF
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
PDF
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Power System State Estimation - A Review
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Line Losses in the 14-Bus Power System Network using UPFC
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Selfish Node Isolation & Incentivation using Progressive Thresholds
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Cloud Security and Data Integrity with Client Accountability Framework
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
Enhancing Data Storage Security in Cloud Computing Through Steganography
Low Energy Routing for WSN’s
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Rotman Lens Performance Analysis
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Spectroscopy.pptx food analysis technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
Teaching material agriculture food technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
DOCX
The AUB Centre for AI in Media Proposal.docx
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
Spectroscopy.pptx food analysis technology
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
Chapter 3 Spatial Domain Image Processing.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx

Gravitational Based Hierarchical Clustering Algorithm

  • 1. ACEEE Int. J. on Communication, Vol. 01, No. 03, Dec 2010 Gravitational Based Hierarchical Clustering Algorithm P S Bishnu1 and V Bhattacherjee2 Department of Computer Science and Engineering Birla Institute of Technology, Ranchi, India Email: 1psbishnu@gmail.com, 2vbhattacharya@bitmesra.ac.in Abstract—We propose a new gravitational based hierarchical clustering algorithm using kd- tree. kd- tree generates densely III. OVERVIEW OF AGGLOMERATIVE CLUSTERING populated packets and finds the clusters using gravitational ALGORITHM AND KD-TREE force between the packets. Gravitational based hierarchical clustering results are of high quality and robustness. Our method is effective as well as robust. Our proposed algorithm A. Agglomerative hierarchical clustering algorithm is tested on synthetic dataset and results are presented. Agglomerative hierarchical algorithms begin with all the data objects as individual cluster. At each step two most Index Terms —Data Mining, hierarchical clustering, kd-tree alike clusters are merged. After each merge, the total number of clusters decreases by one. These steps can be 1. INTRODUCTION repeated until the desired number of clusters is obtained or the distance between two closest clusters is above a certain Hierarchical clustering generates a hierarchical series of threshold distance [5]. nested clusters which can be graphically represented by a tree called “Dendrogram”. By cutting the dendrogram at B. kd Tree some level, we can obtain a specified number of clusters kd- tree is a geometrical, top-down hierarchical tree data [3]. Due to its nested structure it is effective and gives structure. At the root whole data space is divided with a better structural information. This paper presents a new vertical line into two subsets of roughly equal size. The gravity based hierarchical technique using kd- Tree. We splitting line is stored at the root, the left data points are call our new algorithm GLHL (which is anagram of the assigned as the content of the left subtree and the right data bold letters in GravitationaL Based Hierarchical points are assigned as the content of the right subtree. At ALgorithm) The orientation of the paper is as follows: the next level, each node is again partitioned along the Section 2 presents survey of the related work. Section 3 alternate line (e.g. if the previous level was partitioned by gives an overview of the hierarchical clustering algorithm vertical line, this level would be partitioned along and kd-trees. Section 4 presents our proposed algorithm. horizontal line). At each partitioning, data points to the left Next we discuss results in Section 5 and conclusions in or on a vertical line are assigned to the left subtree and the Section 6. rest to the right subtree. The process continues till the criteria function has converged. [8]. 2. RELATED WORK Here we review some of the pioneering methods in IV. THE PROPOSED GRAVITATIONAL BASED HIERARCHICAL ALGORITHM hierarchical clustering. Zhang et al. proposed BIRCH technique [6]. It overcomes the two limitations like The proposed algorithm uses kd- tree to divide the data scalability and the inability to undo what was done in the space into regions called leaf buckets and calculate the previous step, of agglomerative clustering [1] [6]. It is density of each leaf buckets. Further, mean of each leaf designed for clustering large amount of dataset and it is buckets is calculated and treated as center of gravity of incremental and hierarchical and can handle outliers. It each leaf bucket. Next, an object (leaf bucket) with high introduces a concept like clustering feature (CF) which density attracts some other objects (leaf bucket) with lower contains information regarding a cluster. CF tree is created density [7]. Using gravity function we can calculate the using CF and it contains CF information about its attraction force between the two objects (leaf buckets) and subclusters [2] [6]. BIRCH applies only to numeric data [2] the degree of gravity is in direct ratio to the product of two and if the clusters are not spherical in nature BIRCH does objects (leaf buckets) density and in inverse ratio to the not perform well [1]. Jiang et al. proposed DHC (density square of their distance [9]. based hierarchical clustering) [7]. It uses two types of data structure called density tree which is used to uncover the A. Calculation of density of a leaf bucket embedded cluster and attraction tree which is used to Consider a set of n data points, explore the inner structure of clusters, the boundary of the (o1 , o 2 ,..., o n ) occupying a t dimensional space. Each cluster and the outliers. 5 © 2010 ACEEE DOI: 01.IJCOM.01.03.49
  • 2. ACEEE Int. J. on Communication, Vol. 01, No. 03, Dec 2010 oi has K 2 ∑ (s point associated with it to computed from Δ = − 1) o0 − o0 j ) ( j , where s j is coordinate (oi1 , oi 2 ,..., oit ) . There exists a bounding box j =1 2 (buckets) which contains all data points and whose extrema the number of data points in cluster C j , K denotes number are defined by the maximum and minimum coordinate values of the data points in each dimension figure [1a]. The of clusters, o0 is the global centroid defined data is then divided into two sub buckets by splitting the 1 n data along the median value of the coordinates of the parent bucket so that the number of points in each bucket remains as o0 = ∑ oi where n is total number of data points n i =1 roughly the same. The division process recursively ( j) and oi is the data points, o0 denotes the centroid of the repeated on each sub bucket until a leaf bucket is created where each leaf bucket contains single data point [Figure cluster j, which is defined sj 1b-1c]. The basic concept behind the calculation of density 1 of each leaf bucket is with the help of bucket size and the as o0 ( j) = sj ∑o i =1 ( j) i where oi ( j) denotes data points number of data points present in that bucket (here each leaf bucket contain single data points but the bucket size th belongs to j cluster. differs). If the numbers of data points present are more then the density is likely to be more [11]. D. The GLHL Algorithm This technique is based upon picking up the highest density leaf buckets and calculating the gravitational attraction force with the next lower density leaf buckets. The buckets with the maximum force of attraction are merged during each iteration. The iterations continue until a single bucket remains in the bucket set B. The algorithm uses the following data structures: Figure 1a Figure 1b Figure 1c Figure1: Implementation of kd-Tree B : Set of buckets D [1..n] :density vector where n is the number of data B. Calculation of force of attraction points g [0..n] : gain array mm dis [1..n][1..n] : distance array The force of gravity is given by F = G 1 2 2 . Where, F1 [1..n][1..n] : force array r c [1..n] : number of clusters m1, m2 are two objects’ weight, r is the distance between them and G is the universal gravitational constant. The Table 5.1(Dataset ) formula used in this paper has been adapted from [9] and the density value calculated in section 4.1 has been used. The force of attraction between two buckets bi and b j is di d j f ij = 2 where d i and d j are density of two buckets dis ij bi and b j , dis ij is distance between two buckets bi and b j . All these data structures (B, D, g, dis and F1) need to be recalculated at each iteration because of merger of buckets. C. Calculation of gain The pseudocode for the GLHL algorithm is given in Jung et al proposed clustering gain as a measure for Algorithm 1. clustering optimality, which is based on the squared error sum as a clustering algorithm proceeds and the measure can be applicable for both hierarchical and partitional clustering method to estimate desired number of clusters. Authors showed the clustering gain to have a maximum value at the optimal number of clusters. In addition, clustering gain is cheap to compute. Therefore, it can be computed in each step of clustering process to determine the optimal number of clusters without increasing the computational complexity [10]. The clustering gain can be 6 © 2010 ACEEE DOI: 01.IJCOM.01.03.49
  • 3. ACEEE Int. J. on Communication, Vol. 01, No. 03, Dec 2010 Algorithm1: Gravitational based hierarchical clustering bold font give the optimal number of clusters in each case algorithm (GLHL Algorithm) indicated by maximum gain value. B = {bi }: bi = kd _ tree _ partitioni ng (dataset ); VI. CONCLUSION D = [d i ]: d i = calculate _ density (bi ); In this paper a new gravitational based hierarchical clustering algorithm using kd- tree has been proposed. The dis = [ dis ij ]: dis ij = calculate _ dis tan ce(bi , b j ) kd- tree structure is used to generate densely populated ∀buckets ; packets. The clusters are formed by calculating the gravitational force between the packets. For validation of F1 = [ f ij ]: f ij = calculate _ force (bi , b j ) the performance of our algorithm we have used the concept ∀buckets ; of gain calculated as in section 4.3. The validation results have been presented in section 5. g[0] = calculate _ gain (); //each data point as cluster and g vector stores value of gain REFERENCES q = 1; // counter for iterations Repeat [1] Data mining concepts and techniques,, By Han and N = B ; // number of buckets Kamber, Morgan Kaufmann Publishers. [2] Data Mining- Introductory and Advanced Topics, BY N = 1 then break; // only one bucket remaining If M H Dunhum, Pearson Education. merge (bi , b j ) : f ij = find _ max( F1 ); [3] Y. Liu, Z. Liu, An Improved Hierarchical K-Means Algorithm for Web Document Clustering, Int. Con. on Update B, D, dis , F1 ; Computer Science and Info. Tech., IEEE Conf. 2008. g[ q ] = calculate _ gain (); [4] S. Lee, W. Lee, S. Chung, D., “An, Ingeun Bok, Hongjin Ryu, “Selection of Cluster Hierarchy Depth in c[ q ] = N ; // c vectors stores number of clusters (bucket) Hierarchical Clustering using K-Means Algorithm,” c[ q ] = N ; Int. Sym. on Info. Tech. Conv., IEEE 2007. Until true; [5] George K, Eui-Hong H, Vipin K, “CHAMELEON: A Return c[k ] as optimal number of clusters Hierarchical Clustering Algorithm Using Dynamic : g[ k ] = find _ max( g ); Modeling”, COMPUTER, 32: 68-75, IEEE, 1999. [6] T Zhang, R Ramakrishnan, M Linvy, “BIRCH: an efficient data clustering method for very large The GLHL algorithm functions in four phases. databases. In Proc ACM-SIGMOD Int. Conf. Phase1 constitutes the initialization of buckets by kd-tree Management of data, pp. 103-114, Canada, 1996. partitions and data structures density D, distance dis, [7] D. Jiang J. Pei, A. Zhang, “DHC: A Density-based gravitational force F1 and the initial gain g [0] for all n Hierarchical Clustering Method for Time Series Gene buckets where n is total data points. Phase2 checks the exit Expression Data”, IEEE Transcaction. criteria that there is one bucket remaining. Phase3 finds the [8] Berg et al. “Computational Geometry Algorithms and maximum entry f ij in the F1 matrices and merges buckets Application” 2nd Edition Springer publication. [9] M. Jian, L. Cheng, W. Xiang, “Research of Gravity- bi , b j corresponding to the (ij) pair. Phase4 updates the B, based Outliers Detection”, Int. Conf. on Intelligent D, dis, F1, g and the number of clusters is stored in c and Information Hiding and Multimedia Signal Processing. returns the number of clusters k corresponding to maximum IEEE, 2008. gain. [10] Y. Jung, H. Park, D. Z. Du, B. L Drake, “ A decision criteria for the optimal number of clusters in V. RESULTS Hierarchical clustering”, J. of global optimization, 25, 91-111, 2003. The GLHL algorithm was implemented in C-language [11] Stephen J. Redmond, Conor Heneghan, A method and validations were performed. The data sets used is for initializing the K-Means clustering algorithm using synthetic two dimensional data points generated by mouse kd-trees, Pattern Recognition Letters 28(2007) 965- clicking. Result is presented in Table 5.1. The rows in the 973. 7 © 2010 ACEEE DOI: 01.IJCOM.01.03.49