SlideShare a Scribd company logo
Parallel K-means Clustering in
Erlang
Guide: Dr. Govindarajulu R.
Chinmay Patel - 201405627
Dharak Kharod - 201405583
Pawan Kumar - 201405637
The Clustering Problem
➢ Cluster analysis or clustering is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar (in
some sense or another) to each other than to those in other groups (clusters).
➢ Unsupervised learning
➢ NP hard problem
➢ Organizing data into clusters such that there is:
○ High intra-cluster similarity
○ Low inter-cluster similarity
○ Informally, finding natural grouping among objects
K-means Clustering Algorithm
➢ In k-means clustering, a set of n data points in d-dimensional space Rd
and an
integer k are given, and the problem is to determine a set of k points in Rd
,
called centers (means), so as to minimize the mean squared distance from
each data point to its nearest center.
➢ Fast, robust, easier to understand, partitional and non-hierarchical clustering
method.
➢ The k-means algorithm does not necessarily find the most optimal
configuration, corresponding to the global objective function minimum. The
algorithm is also significantly sensitive to the initial randomly selected cluster
centres.
Erlang Language
➢ Lightweight Concurrency
➢ Built in fault tolerance and asynchronous message passing
➢ No shared state
➢ Pattern matching
➢ Used at: Facebook chat service backend, Amazon SimpleDB
➢ Our experiments: Leader Election Algorithm, Chat server
Standard Algorithm
➢ Given an initial set of k means m1
(1)
,…,mk
(1)
, the algorithm proceeds by
alternating between two steps:
➢ Assignment step​: Assign each observation to the cluster whose mean yields
the least within cluster sum of squares (WCSS). Since the sum of squares is
the squared Euclidean distance, this is intuitively the "nearest" mean.
where each xp
is assigned to exactly one S(t)
.
Standard Algorithm
➢ Update step​: Calculate the new means to be the centroids of the observations
in the new clusters.
➢ Since the arithmetic mean is a least­ squares estimator, this also minimizes
the within­ cluster sum of squares (WCSS) objective.
➢ The algorithm has converged when the assignments no longer change. Since
both steps optimize the WCSS objective, and there only exists a finite number
of such partitionings, the algorithm must converge to a (local) optimum.
Naive Parallel K-means Clustering Algorithm
➢ In parallel version of K-means algorithm, the work of calculating means and
grouping data into clusters was divided among several nodes.
➢ Suppose, there are N worker nodes, then we divide our data-set into N
approximately equal subparts, and each subpart is sent to one worker node.
The server node sends initial set of K means to each worker node.
➢ Each worker node divides its own sublist into K clusters, depending upon K
means sent from the server node.
Naive Parallel K-means Clustering Algorithm
➢ After calculation of K sub-clusters, each worker node instead of sending
whole sub-cluster, sends sum of points of each K sub-clusters and count of
total points of each K sub-clusters. Then the server node calculates actual
mean of all clusters combined.
➢ Thus, this gives a set of new means which is again sent to each worker node,
and the process repeats till there is no change in means.
Improvised Parallel K-means Clustering
Algorithm
➢ The algorithm has to calculate the distance from each data object to every cluster
mean in each iteration. However, it is not necessary to calculate that distance
each time.
➢ The main idea of algorithm is to set two data structures to retain the labels of
cluster and the distance of all the data objects to the nearest cluster during the
each iteration.
➢ That can be used in next iteration, we calculate the distance between the current
data object and the new cluster mean, if the computed distance is smaller than or
equal to the distance to the old mean, the data object stays in its cluster that was
assigned to in previous iteration.
Introduction to KD Tree
➢ In each iteration algorithm boils down to calculating the nearest centroid for
every data point.
➢ If we can take the geometric arrangement of data in consideration we can
reduce the number of comparisons. This can be done using KD tree.
➢ KD tree is used to store spatial data and for nearest neighbour queries.
➢ A KD-Tree is a binary tree, where each node is associated with a disjoint
subset of the data
➢ KD trees are guaranteed log2
n depth where n is the number of points in the
set.
KD Tree Construction
➢ If there is just one point, form a leaf with that point.
➢ Otherwise, cycle through data dimension to select splitting plane.
➢ Split at median (to have a balanced tree)
➢ Continue recursively until both sides of the splitting plane are empty
➢ Each non-leaf represents a hierarchical subdivision of the data into two
hyperspaces using a hyperplane (splitting plane). The hyperplane is orthogonal
to each dimensional axis.
➢ Constructing the k-d tree can be done in O(dn log n) and O(n) storage. (d is the
dimension of data).
Example for 2D
Nearest Neighbour in KD Tree
➢ Query: given a kd-tree and a point in space , which point in the kd-tree is
closest to the test point?
➢ Given our guess of what the nearest neighbor is, we can make an observation
that if there is a point in this data set that is closer to the test point that our
current guess, it must lie in the circle centered at the test point that passes
through the current guess. (figure next slide).
➢ This lets us prune which parts of the tree might hold the true nearest neighbor.
➢ Runtime depends on data distribution but it has been shown to run in O(log n)
average time per search in a reasonable model. (Assuming d constant)
Example for 2D
Nearest Neighbour in KD Tree
NNS(q: point, n: node, p: point, w: distance){ // initial call NNS(q,root,p,infinity);
if n.left = null then {leaf case}
if distance(q,n.point) < w then return n.point else return p;
else
if w = infinity then
if q(n.axis) < n.value then
p := NNS(q,n.left,p,w);
w := distance(p,q);
if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w);
else
p := NNS(q,n.right,p,w);
w := distance(p,q);
if q(n.axis) - w < n.value then p := NNS(q, n.left, p, w);
else //w is finite//
if q(n.axis) - w < n.value then
p := NNS(q, n.left, p, w);
w := distance(p,q);
if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w);
return p
}
Parallel NN based K-means Clustering
Algorithm
➢ Master divides data points randomly and sends to worker nodes along with
initial centroids.
➢ At worker node:
○ Create KD tree out of the received centroids.
○ For each data point find the nearest centroid using the NNS algorithm. Assign it to
corresponding cluster
○ Sends results to master for each cluster.
➢ Master receives results for each cluster and calculates new means.
➢ Repeat until convergence.
Drawbacks of NNS based Algorithm
➢ As we make the nearest neighbour search complexity log(k) from k, if the k
value is small then there is not much difference. And usually that is the case
so speed up that we get is low.
➢ We are making KD tree of centroids and then we perform nearest neighbour
so we need to build the KD tree for every iteration.
➢ If some method is based on making KD tree out of data points then we can
save a lot of time. This is the basis of the filtering algorithm.
The Filtering Algorithm
➢ In kd tree a bounding box B refers to the smallest box (an axis-aligned
hyper-rectangle) that contains all of the respective points.
➢ Each node in the kd-tree has an associated set of potential candidate centers.
All k centers are candidate centers for the root.
➢ For each node, the candidate center z0
that is closest to the midpoint of B is
first computed and added to the set Z then we filter each impossible
candidate center z.
➢ That is if no part of B is closer to each new candidate center than z0
, they are
not added to that node’s candidate center set Z
The Filtering Algorithm
➢ Comparison of distances between any part of B
and z we compare distances from the vertices of B
which basically maximizes the distances for all
points in B from the given z.
➢ If there are more than one candidate center for a
leaf node then again usual distance computation
has to be done.
➢ Pre-processed data is stored in kd tree nodes.
➢ Data sensitive algorithm.
The filtering algorithm
Filter(kdNode u, CandidateSet Z) {
C <- u.cell;
If ( u is a leaf ) {
z* <- the closest point in Z to u.point;
z*.wgtCent <- z*.wgtCent + u.point;
z*.count <- z*.count +1;
}
Else {
z* <- the closest point in Z to C’s midpoint;
For each ( z belongs to Z  {z*} )
If ( z.isFarther(z*,C) ) Z <- Z  {z} ;
If ( |Z| =1 ) {
z*.wgtCent <- z*.wgtCent + u.wgtCent;
z*.count <- z*.count + u.count;
}
Else {
Filter (u.left,Z);
Filter ( u.right,Z);
}
}
}
Parallel filtering based K-means algorithm
➢ Master makes kd tree of logn height where n is the number of worker node
and sends the nodes at leaves to worker nodes along with initial centroids.
➢ At worker node:
○ Create KD tree out of the received data points. (Done only once)
○ Run filtering algorithm.
○ Sends results to master for each cluster.
➢ Master receives results for each cluster and calculates new means.
➢ Repeat until convergence.
Results and Conclusion
➢ Clustering problem NP hard. Approximations such as Lloyd’s algorithm are
much more computationally attractive approximations that converge to a
local optimum.
➢ In comparison of sequential, parallel algorithm improves performance by far.
➢ In the dataset with relatively less data points but higher number of clusters,
nearest neighbour algorithm performs way better.
➢ If data is well-separated then filtering algorithm gives overall best
performance. Otherwise the performance is as good as normal parallel
k-means.
Thank You.

More Related Content

PPT
Clustering: Large Databases in data mining
PPTX
Clustering on database systems rkm
PDF
Birch
PDF
Density Based Clustering
PDF
Optics ordering points to identify the clustering structure
PPTX
Kernel density estimation (kde)
PPT
3.4 density and grid methods
PPT
Data miningpresentation
Clustering: Large Databases in data mining
Clustering on database systems rkm
Birch
Density Based Clustering
Optics ordering points to identify the clustering structure
Kernel density estimation (kde)
3.4 density and grid methods
Data miningpresentation

What's hot (20)

PDF
Clustering: A Survey
PPT
Lect4
PPTX
Large Scale Data Clustering: an overview
PDF
PDF
New Approach for K-mean and K-medoids Algorithm
PPTX
Clique and sting
PDF
Pattern Mining in large time series databases
PPT
PPTX
CLUSTER ANALYSIS ALGORITHMS.pptx
PPTX
DBSCAN (2014_11_25 06_21_12 UTC)
PPTX
machine learning - Clustering in R
PDF
K means and dbscan
PPT
K mean-clustering algorithm
PPT
Dataa miining
PPT
Chapter 11 cluster advanced : web and text mining
DOCX
Neural nw k means
PPTX
Birch Algorithm With Solved Example
PPT
K mean-clustering
PDF
K means
PDF
Premeditated Initial Points for K-Means Clustering
Clustering: A Survey
Lect4
Large Scale Data Clustering: an overview
New Approach for K-mean and K-medoids Algorithm
Clique and sting
Pattern Mining in large time series databases
CLUSTER ANALYSIS ALGORITHMS.pptx
DBSCAN (2014_11_25 06_21_12 UTC)
machine learning - Clustering in R
K means and dbscan
K mean-clustering algorithm
Dataa miining
Chapter 11 cluster advanced : web and text mining
Neural nw k means
Birch Algorithm With Solved Example
K mean-clustering
K means
Premeditated Initial Points for K-Means Clustering
Ad

Similar to Parallel kmeans clustering in Erlang (20)

PDF
Data analysis of weather forecasting
PDF
DBSCAN
PPTX
Knn 160904075605-converted
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
PPT
Enhance The K Means Algorithm On Spatial Dataset
PPTX
Unsupervised learning clustering
PDF
Unsupervised Learning in Machine Learning
PDF
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
PPTX
UNIT_V_Cluster Analysis.pptx
PPT
Clustering_Unsupervised learning Unsupervised learning.ppt
PPTX
Project PPT
PDF
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
PDF
k-mean-clustering.pdf
PPT
3.2 partitioning methods
PDF
CSA 3702 machine learning module 3
PDF
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
PPTX
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
PPT
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
Data analysis of weather forecasting
DBSCAN
Knn 160904075605-converted
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
Enhance The K Means Algorithm On Spatial Dataset
Unsupervised learning clustering
Unsupervised Learning in Machine Learning
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
UNIT_V_Cluster Analysis.pptx
Clustering_Unsupervised learning Unsupervised learning.ppt
Project PPT
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
k-mean-clustering.pdf
3.2 partitioning methods
CSA 3702 machine learning module 3
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Reach Out and Touch Someone: Haptics and Empathic Computing
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx

Parallel kmeans clustering in Erlang

  • 1. Parallel K-means Clustering in Erlang Guide: Dr. Govindarajulu R. Chinmay Patel - 201405627 Dharak Kharod - 201405583 Pawan Kumar - 201405637
  • 2. The Clustering Problem ➢ Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). ➢ Unsupervised learning ➢ NP hard problem ➢ Organizing data into clusters such that there is: ○ High intra-cluster similarity ○ Low inter-cluster similarity ○ Informally, finding natural grouping among objects
  • 3. K-means Clustering Algorithm ➢ In k-means clustering, a set of n data points in d-dimensional space Rd and an integer k are given, and the problem is to determine a set of k points in Rd , called centers (means), so as to minimize the mean squared distance from each data point to its nearest center. ➢ Fast, robust, easier to understand, partitional and non-hierarchical clustering method. ➢ The k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centres.
  • 4. Erlang Language ➢ Lightweight Concurrency ➢ Built in fault tolerance and asynchronous message passing ➢ No shared state ➢ Pattern matching ➢ Used at: Facebook chat service backend, Amazon SimpleDB ➢ Our experiments: Leader Election Algorithm, Chat server
  • 5. Standard Algorithm ➢ Given an initial set of k means m1 (1) ,…,mk (1) , the algorithm proceeds by alternating between two steps: ➢ Assignment step​: Assign each observation to the cluster whose mean yields the least within cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean. where each xp is assigned to exactly one S(t) .
  • 6. Standard Algorithm ➢ Update step​: Calculate the new means to be the centroids of the observations in the new clusters. ➢ Since the arithmetic mean is a least­ squares estimator, this also minimizes the within­ cluster sum of squares (WCSS) objective. ➢ The algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there only exists a finite number of such partitionings, the algorithm must converge to a (local) optimum.
  • 7. Naive Parallel K-means Clustering Algorithm ➢ In parallel version of K-means algorithm, the work of calculating means and grouping data into clusters was divided among several nodes. ➢ Suppose, there are N worker nodes, then we divide our data-set into N approximately equal subparts, and each subpart is sent to one worker node. The server node sends initial set of K means to each worker node. ➢ Each worker node divides its own sublist into K clusters, depending upon K means sent from the server node.
  • 8. Naive Parallel K-means Clustering Algorithm ➢ After calculation of K sub-clusters, each worker node instead of sending whole sub-cluster, sends sum of points of each K sub-clusters and count of total points of each K sub-clusters. Then the server node calculates actual mean of all clusters combined. ➢ Thus, this gives a set of new means which is again sent to each worker node, and the process repeats till there is no change in means.
  • 9. Improvised Parallel K-means Clustering Algorithm ➢ The algorithm has to calculate the distance from each data object to every cluster mean in each iteration. However, it is not necessary to calculate that distance each time. ➢ The main idea of algorithm is to set two data structures to retain the labels of cluster and the distance of all the data objects to the nearest cluster during the each iteration. ➢ That can be used in next iteration, we calculate the distance between the current data object and the new cluster mean, if the computed distance is smaller than or equal to the distance to the old mean, the data object stays in its cluster that was assigned to in previous iteration.
  • 10. Introduction to KD Tree ➢ In each iteration algorithm boils down to calculating the nearest centroid for every data point. ➢ If we can take the geometric arrangement of data in consideration we can reduce the number of comparisons. This can be done using KD tree. ➢ KD tree is used to store spatial data and for nearest neighbour queries. ➢ A KD-Tree is a binary tree, where each node is associated with a disjoint subset of the data ➢ KD trees are guaranteed log2 n depth where n is the number of points in the set.
  • 11. KD Tree Construction ➢ If there is just one point, form a leaf with that point. ➢ Otherwise, cycle through data dimension to select splitting plane. ➢ Split at median (to have a balanced tree) ➢ Continue recursively until both sides of the splitting plane are empty ➢ Each non-leaf represents a hierarchical subdivision of the data into two hyperspaces using a hyperplane (splitting plane). The hyperplane is orthogonal to each dimensional axis. ➢ Constructing the k-d tree can be done in O(dn log n) and O(n) storage. (d is the dimension of data).
  • 13. Nearest Neighbour in KD Tree ➢ Query: given a kd-tree and a point in space , which point in the kd-tree is closest to the test point? ➢ Given our guess of what the nearest neighbor is, we can make an observation that if there is a point in this data set that is closer to the test point that our current guess, it must lie in the circle centered at the test point that passes through the current guess. (figure next slide). ➢ This lets us prune which parts of the tree might hold the true nearest neighbor. ➢ Runtime depends on data distribution but it has been shown to run in O(log n) average time per search in a reasonable model. (Assuming d constant)
  • 15. Nearest Neighbour in KD Tree NNS(q: point, n: node, p: point, w: distance){ // initial call NNS(q,root,p,infinity); if n.left = null then {leaf case} if distance(q,n.point) < w then return n.point else return p; else if w = infinity then if q(n.axis) < n.value then p := NNS(q,n.left,p,w); w := distance(p,q); if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w); else p := NNS(q,n.right,p,w); w := distance(p,q); if q(n.axis) - w < n.value then p := NNS(q, n.left, p, w); else //w is finite// if q(n.axis) - w < n.value then p := NNS(q, n.left, p, w); w := distance(p,q); if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w); return p }
  • 16. Parallel NN based K-means Clustering Algorithm ➢ Master divides data points randomly and sends to worker nodes along with initial centroids. ➢ At worker node: ○ Create KD tree out of the received centroids. ○ For each data point find the nearest centroid using the NNS algorithm. Assign it to corresponding cluster ○ Sends results to master for each cluster. ➢ Master receives results for each cluster and calculates new means. ➢ Repeat until convergence.
  • 17. Drawbacks of NNS based Algorithm ➢ As we make the nearest neighbour search complexity log(k) from k, if the k value is small then there is not much difference. And usually that is the case so speed up that we get is low. ➢ We are making KD tree of centroids and then we perform nearest neighbour so we need to build the KD tree for every iteration. ➢ If some method is based on making KD tree out of data points then we can save a lot of time. This is the basis of the filtering algorithm.
  • 18. The Filtering Algorithm ➢ In kd tree a bounding box B refers to the smallest box (an axis-aligned hyper-rectangle) that contains all of the respective points. ➢ Each node in the kd-tree has an associated set of potential candidate centers. All k centers are candidate centers for the root. ➢ For each node, the candidate center z0 that is closest to the midpoint of B is first computed and added to the set Z then we filter each impossible candidate center z. ➢ That is if no part of B is closer to each new candidate center than z0 , they are not added to that node’s candidate center set Z
  • 19. The Filtering Algorithm ➢ Comparison of distances between any part of B and z we compare distances from the vertices of B which basically maximizes the distances for all points in B from the given z. ➢ If there are more than one candidate center for a leaf node then again usual distance computation has to be done. ➢ Pre-processed data is stored in kd tree nodes. ➢ Data sensitive algorithm.
  • 20. The filtering algorithm Filter(kdNode u, CandidateSet Z) { C <- u.cell; If ( u is a leaf ) { z* <- the closest point in Z to u.point; z*.wgtCent <- z*.wgtCent + u.point; z*.count <- z*.count +1; } Else { z* <- the closest point in Z to C’s midpoint; For each ( z belongs to Z {z*} ) If ( z.isFarther(z*,C) ) Z <- Z {z} ; If ( |Z| =1 ) { z*.wgtCent <- z*.wgtCent + u.wgtCent; z*.count <- z*.count + u.count; } Else { Filter (u.left,Z); Filter ( u.right,Z); } } }
  • 21. Parallel filtering based K-means algorithm ➢ Master makes kd tree of logn height where n is the number of worker node and sends the nodes at leaves to worker nodes along with initial centroids. ➢ At worker node: ○ Create KD tree out of the received data points. (Done only once) ○ Run filtering algorithm. ○ Sends results to master for each cluster. ➢ Master receives results for each cluster and calculates new means. ➢ Repeat until convergence.
  • 22. Results and Conclusion ➢ Clustering problem NP hard. Approximations such as Lloyd’s algorithm are much more computationally attractive approximations that converge to a local optimum. ➢ In comparison of sequential, parallel algorithm improves performance by far. ➢ In the dataset with relatively less data points but higher number of clusters, nearest neighbour algorithm performs way better. ➢ If data is well-separated then filtering algorithm gives overall best performance. Otherwise the performance is as good as normal parallel k-means.