Ijartes v1-i2-006

International Journal of Advanced Research in Technology, Engineering and Science (A Bimonthly Open Access Online
Journal) Volume1, Issue2, Sept-Oct, 2014.ISSN:2349-7173(Online)
Comparison of Different Clustering Algorithms
using WEKA Tool
Priya Kakkar1, Anshu Parashar2
______________________________________________
Abstract:
Data Mining is a process of extracting useful information
from a large dataset and Clustering is one of important
technique in data mining process, whose main purpose is to
group data of similar types into clusters and finding a
structure among unlabelled data. In this paper we have
taken four different clustering algorithm i.e. K-Mean
algorithm, Hierarchical algorithm, Density based algorithm,
EM algorithm. All these algorithms are applied on data of
egit software repositories and depends or dependent classes.
In this paper to compare and analyze these four algorithms
with respect to time to build a model, cluster instances,
squared errors and log likelihood by using Weka tool.
_________________________________________________
Keywords: Data Mining, Clustering, K-mean, Weka tool,
DBSCAN
__________________________________________________
I.INTRODUCTION
Data mining is a field used to find out the data hidden in your
clusters of data or massive set of data. Data mining is an
important tool to convert the data into information. It is used
in a different field of practices, such as marketing, fraud
detection and scientific discovery. Data mining is the also
used for extracting patterns from data. It can be used to
uncover patterns in data but is often carried out only on
sample of data. The mining process will be ineffective if the
samples are not good representation of the larger body of the
data. The discovery of a particular pattern in a particular set of
data does not necessarily mean that pattern is found elsewhere
in the larger data from which that sample was drawn. An
important part of the method is the verification and validation
of patterns on other samples of data. A primary reason for
using data mining is to assist in the analysis of collection of
observations of behavior. Data mining is the analysis step of
the "Knowledge Discovery in Databases" process and is the
process that attempts to discover patterns from large data sets.
The main aim of the data mining process is to extract
information from a data set and transform it into an
understandable format for further use.
________________________________________________
First Author’s Name: Priya Kakkar, Department of Computer Science &
Engineering, HCTM Technical Campus, Kaithal, India.
Second Author’s Name: Anshu Parashar, Department of Computer Science
& Engineering, HCTM Technical Campus, Kaithal, India.
__________________________________________________________
Clustering is the task of assigning a set of objects into groups
(called clusters) so that the objects in the same cluster are
more similar to each other than to those in other clusters.
A cluster is therefore a collection of objects which are
“similar” between them and are “dissimilar” to the objects
belonging to other clusters. Clustering is a common technique
used for statistical data analysis in many fields like machine
learning, pattern recognition, image analysis, information
retrieval, and bioinformatics.
II.CLUSTERING METHODS
The goal of clustering is to organize objects which are related
to each other or have similar characteristics. Clustering groups
similar objects (item) into same group. We use different
methods for clustering.
· Partitioning clustering
The partitioning method uses a set of M clusters and each
object belongs to one cluster. Each cluster can be represented
by a centroid or a cluster representative; that is a description
of all the objects contained in a cluster. This description will
depend on the type of the object which is clustered. In real-valued
data the arithmetic mean of the attribute vectors for all
objects within a cluster provides an appropriate representative
while alternative types of centroid may be required in other
cases. If the number of the clusters is large then centroid can
be further clustered which produces hierarchy within a
dataset.
· Hierarchical clustering
Flat clustering is efficient and conceptually simple but it has a
number of drawbacks. The algorithms require a pre-specified
number of clusters as input and are nondeterministic.
Hierarchical clustering outputs a hierarchical structure that is
more informative than the unstructured set of clusters formed
by flat clustering. Hierarchical clustering also does not need to
specify the number of clusters in advance. In hierarchical
clustering clusters are created either by top-down or bottom-up
fashion by recursive partitioning. Hierarchical clustering
are of two types: - Hierarchical Agglomerative methods,
Hierarchical Divisive clustering.
· Density based clustering
Density-based clustering algorithms try to find clusters based
on density of data points in a region. The key idea behind
density-based clustering is that for each instance of a cluster
the neighborhood of a given radius (Eps) has to contain at
least a minimum number of instances (MinPts). Density based
clustering is based on probability distribution and points from
All Rights Reserved © 2014 IJARTES Visit: www.ijartes.org Page 20

one distribution are assumed to be part of one cluster. This
method identifies the clusters and their parameters.
IV.VARIOUS CLUSTARING ALGORITHMS
· k-mean clustering
K-means is a widely used partition based clustering method
because it can be easily implemented and most efficient one in
terms of the execution time. k-mean clustering group items
into k groups. This grouping is done on the basis of
minimizing the sum of squared distances between items and
the corresponding centroid. A centroid is "center of mass of a
geometric object of uniform density".
K-Means Algorithm: In k-mean algorithm each cluster’s
center is represented by mean value of objects in the cluster.
Input: k: the number of clusters.
D: data set containing n objects.
Output: A set of k clusters.
Method:
1. Arbitrarily choose k objects from D as the initial cluster
centers.
2. Repeat.
3. Reassign each object to the cluster to which the object is
most similar based on the mean value of the objects in the
cluster.
4. Update the cluster means.
5. until no change.
· EM algorithm
In cases where the equations cannot be solved directly we use
a special algorithm known as The EM algorithm. EM stands
for Expectation and Maximization which is part of data
mining tools.The EM algorithm is used to find most likelihood
parameters in a model. These models Contains latent variable
and use likelihood functions in addition to unknown
parameters and known data observations. It contains either
missing value among the data, or the model can be simplified
by assuming the existence of additional unobserved data
points. To find out solutions it requires taking derivatives of
likelihood functions with respect to all unknown values. The
result is typically a set of interlocking equations in which the
solution to the parameters requires the values of the latent
variables and vice-versa, but substituting one set of equations
into the other produces an unsolvable equation. EM algorithm
pick arbitrarily values for one of sets and use these values to
estimate the second set then use these values to estimate first
set and this will continue until the resulting values converge to
fixed points.
· Density-based spatial clustering of applications with
noise (DBSCAN) Algorithm
Density based spatial clustering of application with noise is
one of Density based algorithm. It separates data points into
three parts: Core points (points that are at the interior of a
cluster), Border points (points which fall within neighborhood
of core point) and Noise points (point that is not a core point
or a border point).DBSCAN starts with an arbitrary instance
(p) in data set (D) and finds all values of D within Eps and
MinPts. The algorithm uses a spatial data structure to place
points within Eps from the core points of the clusters. It starts
with an arbitrary starting point that has not been visited and
point’s Eps-neighborhood is found out and if it contains
sufficiently many points, a cluster is started. Otherwise, point
is recognized as noise.
This point might later be found within Eps-environment of a
different point and hence it’s to made part of a cluster. If a
point is found a dense part of a cluster then its Eps-neighborhood
is also part of that cluster. Hence, all points
which are found within the Eps-neighborhood are also added
like their own Eps-neighborhood when they are dense. This
process continues until the density-connected cluster is
completely found. Then, a new unvisited point found out and
processed which leads to the discovery of a further cluster or
noise.
V.EXPERIMENTAL SETUP
In our work for the comparison of various clustering
algorithms we used Weka tool. Weka is one of data-mining
tool which contains a collection of machine learning
algorithms. Weka contains tools for pre-processing,
classification, regression, clustering, association rules, and
visualization of data. In our work we made a dataset of egit
software form the pfCDA software and svnsearch.org site.
Dataset consists of three attributes class, depends and change.
Classes with similar characteristics are grouped. We created
database using Excel work-sheet in a .CSV file format. For
our work we made an .arff file format from the .CSV file
format. In our work we compared four clustering algorithms
(K-mean, Hierarchal, EM, Density based) on the basis of
Number of cluster, Cluster instances, Square error, Time taken
to build model and Log likelihood. We showed training set,
classes to cluster evaluation and visualization of cluster in our
work. We used these algorithms one by one in weka tool and
found their results and made a comparison table.
V1.RESULTS ANALYSIS
From Weka tool we found results using all algorithms that are
shown in table4.1. This comparison table shows that for
similar clustered data these algorithms give different results.
Form this comparison table we find that k-mean algorithm
provides better results than hierarchical and EM algorithm. It
has better time for building a model than hierarchical and EM
but it takes more time than Density based algorithms. We also
find that log likelihood value of density based algorithm is
higher. Form result we find that k-mean is a faster and safer
algorithm than other algorithms we used.

Name of
algorithm
Numbe
r of
cluster
Cluster
instance
s
Squar
e
error
Time
taken
to
build
model
Log
likelihoo
d
k-mean 4 30% 602 0.03
secon
d
28%
22%
20%
Hierarch
al
4 52% 0.19
secon
d
1%
27%
20%
EM 4 30% 2.68
secon
d
-11.9224
20%
22%
28%
Density
based
4 30% 0.02
secon
d
-11.8997
28%
22%
20%
Table 4.1: Result of comparison of four clustering algorithms
VII.CONCLUSION
k-mean, EM, density based clustering algorithm have same
clustered instances, but EM algorithm take more time to build
cluster that’s why k-mean and density based algorithm are
better than EM algorithm. Density based algorithm take less
time to build a cluster but it does not better than the k-mean
algorithm because density based algorithm has high log
likelihood value, if the value of log likelihood is high than it
doesn’t make good cluster. Hence k-mean is best algorithm
because it takes very less time to build a model. Hierarchal
algorithm take more time than k-mean algorithm and cluster
instances are also not good in hierarchal algorithm.
REFERENCES
[1] A Hinneburg and D. Keim, "An Efficient Approach to Clustering in
Large Multimedia Databases with Noise”, Proceedings of KDD-98
(1998).
[2] Aastha Joshi and Rajneet Kaur “Comparative Study of Various
Clustering Techniques in Data Mining” (2013).
[3] Bharat Chaudhari, Manan Parikh “A Comparative Study of clustering
algorithms Using weka tools” (2012)
[4] Bhoj Raj Sharmaa and Aman Paula “Clustering Algorithms: Study
and Performance Evaluation Using Weka Tool” (2013).
[5] Charalampos Mavroforaki “Data mining with WEKA”.
[6] Clifton and Christopher, “Encyclopaedia Britannica: Definition of
data mining”, Retrieved 2010-12-09, 2010.
[7] David Scuse and Peter Reutemann”WEKA Experimenter Tutorial for
Version 3-5-5”
[8] Daljit Kaur, Kiran Jyoti “Enhancement in the Performance of K-means
Algorithm” (2013)
[9] Ester M., Kriegel HP., Sander J and Xu X,“A density-based
algorithm for discovering clusters in largespatial databases with
noise”,Second International Conference on Knowledge Discovery
and Data Mining, 1996.
[10] Fayyad, Usama, Gregory Piatetsky, Shapiro and Padhraic Smyth
"From Data Mining to Knowledge Discovery in Databases",
Retrieved 2008-12-17, 1996.
[11] Gengxin Chen, Saied A. Jaradat, Nila Banerjee “EVALUATION
AND COMPARISON OF CLUSTERING ALGORITHMS IN
ANGLYZING ES CELL GENE EXPRESSION DATA” (2002)
[12] M. Ankerst, M. Breunig, H.P. Kriegel and J. Sander, “OPTICS:
Ordering Points To Identify the Clustering Structure”, Proceedings of
ACM SIGMOD ‘99, International Conference on Management of
Data, Philadelphia, pp. 49-60, 1999.
[13] Michael Steinbach George Karypis Vipin Kumar “A Comparison of
Document Clustering Techniques”
[14] Narendra Sharma, Aman Bajpai, Mr. Ratnesh Litoriya “Comparison
the various clustering algorithms of weka tools” (2012).
[15] Pallavi, Sunila Godara “A Comparative Performance Analysis of
Clustering Algorithms”.
[16] Prajwala T R1, Sangeeta V I “Comparative Analysis of EM
Clustering Algorithm and Density Based Clustering Algorithm Using
WEKA tool.” (2014).
[17] Sonam Narwal and Mr. Kamaldeep Mintwal “Comparison the
Various Clustering and Classification Algorithms of WEKA Tools”
(2013)
[18] T.Balasubramanian, R.Umarani “Clustering as a Data Mining
Technique in Health Hazards of High levels of Fluoride in Potable
Water” (2012).
[19] Vishal Shrivastava, Prem narayan Arya “A Study of Various
Clustering Algorithms on Retail Sales Data” (2012)

Ijartes v1-i2-006

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to Ijartes v1-i2-006 (20)

More from IJARTES (15)

Recently uploaded (20)

Ijartes v1-i2-006