Parallel kmeans clustering in Erlang

Parallel K-means Clustering in
Erlang
Guide: Dr. Govindarajulu R.
Chinmay Patel - 201405627
Dharak Kharod - 201405583
Pawan Kumar - 201405637

The Clustering Problem
➢ Cluster analysis or clustering is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar (in
some sense or another) to each other than to those in other groups (clusters).
➢ Unsupervised learning
➢ NP hard problem
➢ Organizing data into clusters such that there is:
○ High intra-cluster similarity
○ Low inter-cluster similarity
○ Informally, finding natural grouping among objects

K-means Clustering Algorithm
➢ In k-means clustering, a set of n data points in d-dimensional space Rd
and an
integer k are given, and the problem is to determine a set of k points in Rd
,
called centers (means), so as to minimize the mean squared distance from
each data point to its nearest center.
➢ Fast, robust, easier to understand, partitional and non-hierarchical clustering
method.
➢ The k-means algorithm does not necessarily find the most optimal
configuration, corresponding to the global objective function minimum. The
algorithm is also significantly sensitive to the initial randomly selected cluster
centres.

Erlang Language
➢ Lightweight Concurrency
➢ Built in fault tolerance and asynchronous message passing
➢ No shared state
➢ Pattern matching
➢ Used at: Facebook chat service backend, Amazon SimpleDB
➢ Our experiments: Leader Election Algorithm, Chat server

Standard Algorithm
➢ Given an initial set of k means m1
(1)
,…,mk
(1)
, the algorithm proceeds by
alternating between two steps:
➢ Assignment step: Assign each observation to the cluster whose mean yields
the least within cluster sum of squares (WCSS). Since the sum of squares is
the squared Euclidean distance, this is intuitively the "nearest" mean.
where each xp
is assigned to exactly one S(t)
.

Standard Algorithm
➢ Update step: Calculate the new means to be the centroids of the observations
in the new clusters.
➢ Since the arithmetic mean is a least squares estimator, this also minimizes
the within cluster sum of squares (WCSS) objective.
➢ The algorithm has converged when the assignments no longer change. Since
both steps optimize the WCSS objective, and there only exists a finite number
of such partitionings, the algorithm must converge to a (local) optimum.

Naive Parallel K-means Clustering Algorithm
➢ In parallel version of K-means algorithm, the work of calculating means and
grouping data into clusters was divided among several nodes.
➢ Suppose, there are N worker nodes, then we divide our data-set into N
approximately equal subparts, and each subpart is sent to one worker node.
The server node sends initial set of K means to each worker node.
➢ Each worker node divides its own sublist into K clusters, depending upon K
means sent from the server node.

Naive Parallel K-means Clustering Algorithm
➢ After calculation of K sub-clusters, each worker node instead of sending
whole sub-cluster, sends sum of points of each K sub-clusters and count of
total points of each K sub-clusters. Then the server node calculates actual
mean of all clusters combined.
➢ Thus, this gives a set of new means which is again sent to each worker node,
and the process repeats till there is no change in means.

Improvised Parallel K-means Clustering
Algorithm
➢ The algorithm has to calculate the distance from each data object to every cluster
mean in each iteration. However, it is not necessary to calculate that distance
each time.
➢ The main idea of algorithm is to set two data structures to retain the labels of
cluster and the distance of all the data objects to the nearest cluster during the
each iteration.
➢ That can be used in next iteration, we calculate the distance between the current
data object and the new cluster mean, if the computed distance is smaller than or
equal to the distance to the old mean, the data object stays in its cluster that was
assigned to in previous iteration.

Introduction to KD Tree
➢ In each iteration algorithm boils down to calculating the nearest centroid for
every data point.
➢ If we can take the geometric arrangement of data in consideration we can
reduce the number of comparisons. This can be done using KD tree.
➢ KD tree is used to store spatial data and for nearest neighbour queries.
➢ A KD-Tree is a binary tree, where each node is associated with a disjoint
subset of the data
➢ KD trees are guaranteed log2
n depth where n is the number of points in the
set.

KD Tree Construction
➢ If there is just one point, form a leaf with that point.
➢ Otherwise, cycle through data dimension to select splitting plane.
➢ Split at median (to have a balanced tree)
➢ Continue recursively until both sides of the splitting plane are empty
➢ Each non-leaf represents a hierarchical subdivision of the data into two
hyperspaces using a hyperplane (splitting plane). The hyperplane is orthogonal
to each dimensional axis.
➢ Constructing the k-d tree can be done in O(dn log n) and O(n) storage. (d is the
dimension of data).

Nearest Neighbour in KD Tree
➢ Query: given a kd-tree and a point in space , which point in the kd-tree is
closest to the test point?
➢ Given our guess of what the nearest neighbor is, we can make an observation
that if there is a point in this data set that is closer to the test point that our
current guess, it must lie in the circle centered at the test point that passes
through the current guess. (figure next slide).
➢ This lets us prune which parts of the tree might hold the true nearest neighbor.
➢ Runtime depends on data distribution but it has been shown to run in O(log n)
average time per search in a reasonable model. (Assuming d constant)

Nearest Neighbour in KD Tree
NNS(q: point, n: node, p: point, w: distance){ // initial call NNS(q,root,p,infinity);
if n.left = null then {leaf case}
if distance(q,n.point) < w then return n.point else return p;
else
if w = infinity then
if q(n.axis) < n.value then
p := NNS(q,n.left,p,w);
w := distance(p,q);
if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w);
else
p := NNS(q,n.right,p,w);
w := distance(p,q);
if q(n.axis) - w < n.value then p := NNS(q, n.left, p, w);
else //w is finite//
if q(n.axis) - w < n.value then
p := NNS(q, n.left, p, w);
w := distance(p,q);
if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w);
return p
}

Parallel NN based K-means Clustering
Algorithm
➢ Master divides data points randomly and sends to worker nodes along with
initial centroids.
➢ At worker node:
○ Create KD tree out of the received centroids.
○ For each data point find the nearest centroid using the NNS algorithm. Assign it to
corresponding cluster
○ Sends results to master for each cluster.
➢ Master receives results for each cluster and calculates new means.
➢ Repeat until convergence.

Drawbacks of NNS based Algorithm
➢ As we make the nearest neighbour search complexity log(k) from k, if the k
value is small then there is not much difference. And usually that is the case
so speed up that we get is low.
➢ We are making KD tree of centroids and then we perform nearest neighbour
so we need to build the KD tree for every iteration.
➢ If some method is based on making KD tree out of data points then we can
save a lot of time. This is the basis of the filtering algorithm.

The Filtering Algorithm
➢ In kd tree a bounding box B refers to the smallest box (an axis-aligned
hyper-rectangle) that contains all of the respective points.
➢ Each node in the kd-tree has an associated set of potential candidate centers.
All k centers are candidate centers for the root.
➢ For each node, the candidate center z0
that is closest to the midpoint of B is
first computed and added to the set Z then we filter each impossible
candidate center z.
➢ That is if no part of B is closer to each new candidate center than z0
, they are
not added to that node’s candidate center set Z

The Filtering Algorithm
➢ Comparison of distances between any part of B
and z we compare distances from the vertices of B
which basically maximizes the distances for all
points in B from the given z.
➢ If there are more than one candidate center for a
leaf node then again usual distance computation
has to be done.
➢ Pre-processed data is stored in kd tree nodes.
➢ Data sensitive algorithm.

The filtering algorithm
Filter(kdNode u, CandidateSet Z) {
C <- u.cell;
If ( u is a leaf ) {
z* <- the closest point in Z to u.point;
z*.wgtCent <- z*.wgtCent + u.point;
z*.count <- z*.count +1;
}
Else {
z* <- the closest point in Z to C’s midpoint;
For each ( z belongs to Z {z*} )
If ( z.isFarther(z*,C) ) Z <- Z {z} ;
If ( |Z| =1 ) {
z*.wgtCent <- z*.wgtCent + u.wgtCent;
z*.count <- z*.count + u.count;
}
Else {
Filter (u.left,Z);
Filter ( u.right,Z);
}
}
}

Parallel filtering based K-means algorithm
➢ Master makes kd tree of logn height where n is the number of worker node
and sends the nodes at leaves to worker nodes along with initial centroids.
➢ At worker node:
○ Create KD tree out of the received data points. (Done only once)
○ Run filtering algorithm.
○ Sends results to master for each cluster.
➢ Master receives results for each cluster and calculates new means.
➢ Repeat until convergence.

Results and Conclusion
➢ Clustering problem NP hard. Approximations such as Lloyd’s algorithm are
much more computationally attractive approximations that converge to a
local optimum.
➢ In comparison of sequential, parallel algorithm improves performance by far.
➢ In the dataset with relatively less data points but higher number of clusters,
nearest neighbour algorithm performs way better.
➢ If data is well-separated then filtering algorithm gives overall best
performance. Otherwise the performance is as good as normal parallel
k-means.

Parallel kmeans clustering in Erlang

More Related Content

What's hot (20)

Similar to Parallel kmeans clustering in Erlang (20)

Recently uploaded (20)

Parallel kmeans clustering in Erlang