SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3637
Optimal Number of Cluster Identification using Robust K-means for
Sequences in Categorical Sequences
S.U. Patil1, U.A. Nuli2
1,2Computer Science and Engineering department, M. Tech, Textile and Engineering Institute, Ichalkaranji,
Maharashtra, India
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - This paper presents a modified k-means
algorithm for clustering. In traditional method of
clustering number of clusterstobeformedwillbegivenat
the start of the algorithm, which affects performance and
efficiency of the algorithm. In Robust K-means for
sequences optimal number of cluster will be predicted by
removing noise cluster. Cluster validation, which is the
process of evaluating the quality of clustering results,
plays an important role for practical machine learning
systems. Categorical sequences, such as biological
sequences in computational biology, have become
common in real-world applications. Different from
previous studies, whichmainly focusedon attribute-value
data, in this paper, we work on the cluster validation
problem for categorical sequences. Clustering is defined
as an unsupervised learning where the objects are
grouped on the basis of some similarity inherent among
them. The intension of this paper is to describe the
clustering method which will give the optimal number of
clusters in categorical sequences.
Key Words: Clusterign ,K-means,clustervalidationindex,
categorical sequences, centroid.
1. INTRODUCTION
Data mining is a process of deriving required data from a
collection of large dataset and making analysis on collected
data. The data mining technique is to mine informationfrom
a bulky data set and make over it into a reasonable form for
supplementary purpose. In generic term this is known as
classification. In classification when the classes of an object
is given in advance is termed as supervised classification
where as the other case when the class label is not tagged to
an object in advance is termed as unsupervised
classification. The unsupervised classification is commonly
known as clustering. Clustering is important analysis
techniques that is employed to large datasets and finds its
application in the fields like search engines,
recommendation systems, data mining, knowledge
discovery, bioinformaticsanddocumentation.Nowadays, the
data being generated is not only huge in volume, but is also
stored across various machines all around the world. The
main purpose behind the study of classificationistodevelop
a tool or an algorithm, which can be used to predict the class
of an unknown object, which is not labeled.
Clustering problem cannot be solvedbyonespecific
algorithm but it requires various algorithms that differ
significantly in their notion of what makes a clusterandhow
to efficiently find them. Generally clusters include groups
with small distances among the cluster members, dense
areas of the data space, intervals or particular statistical
distributions. Clustering can therefore be formulated as a
multi-objective optimization problem. The appropriate
clustering algorithm and parameter settings depend on the
individual data set and intended useoftheresults.Clustering
is considered to be more difficult than supervised
classification as there is no label attached to the patterns in
clustering. The given label in the case of supervised
classification becomes a clue to grouping data objects as a
whole. Whereas in the case of clustering, it becomes difficult
to decide, to which group a pattern will belong to, in the
absence of a label.
The problem of clustering becomes more
challenging when the data is categorical, that is, when there
is no inherent distance measure between data values.Thisis
often the case in many domains, wheredata isdescribedbya
set of descriptive attributes, many of which are neither
numerical nor inherently ordered in any way. Moreover
clustering categorical sequences is a challenging problem
due to one more reason the difficulties in defining an
inherently meaningful measure of similarity between
sequences. Cluster validation, which is the process of
evaluating the quality of clustering results, plays an
important role for practical machine learning systems.
Categorical sequences, such as biological sequences in
computational biology, have become common in real-world
applications. Without a measure of distance between data
values, it is unclear how to define a quality measure for
categorical clustering. To do this, we employ mutual
information, a measure from information theory. A good
clustering is one where the clusters are informative about
the data objects they contain. Since data objects are
expressed in terms of attribute values, we require that the
clusters convey information about the attributevaluesof the
objects in the cluster. The evaluation of sequences
clustering is currently difficult due to the lack of an internal
validation criterion defined with regard to the structural
features hidden in sequences. To solve this problem, a novel
cluster validity index (CVI) is proposed as a function of
clustering, with the intra-clusterstructural compactness and
inter-cluster structural separation linearly combined to
measure the quality of sequence clusters. Cluster validation,
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3638
which is the process of evaluating the quality of clustering
results, plays an important role in many issues of cluster
analysis A partition-based algorithm for robust clustering of
categorical sequences andtheCVIarethenassembled within
the common model selection procedure to determine the
number of clusters in categorical sequence sets. Currently,
cluster validation remains an open problem due to the
unsupervised nature of clustering tasks, where no external
validation criterion is available to evaluate the result.
Despite different aspects of cluster validation, we are
interested in the problem of determining the optimal
number of clusters in a data set, as most basic clustering
algorithms assume that the number of clusters is a user-
defined parameter, which, however, is difficult to set in
practice.
2. RELATED WORK
K. A. Abdul Nazeer et al. [12] discuss in this paper
about the one major drawback of the k-means algorithm. K-
means algorithm, for different sets of values of initial
centroids, produces different clusters.Final clusterqualityin
algorithm depends on the selection of initial centroids. Two
phases includes in original k means algorithm: first for
determining initial centroids and second for assigning data
points to the nearest clusters and then recalculating the
clustering mean. An enhanced clustering method is
discussed in this paper, in which both the phases of the
original k-means algorithm are modified to improve the
accuracy and efficiency. This algorithm combines a
systematic method for finding initial centroids and an
efficient way for assigning data points to clusters. But still
there is a limitation in this enhanced algorithm that is the
value of k, the number of desired clusters, is still required to
be given as an input, regardless of the distribution of the
data points.
G. Dong and J. Pei, [10] says algorithm is expected to
produce high-quality results associated with the underlying
cluster structures in the data set given K. This problem
becomes difficult for categorical sequences,because,usually,
they are infected with significant quantities of noise, which
confuses the identification of cluster structures.Theclusters
containing less numberofobjects afterapplyinga algorithms
are considered as K-dependentnoises.Suchcluster-number-
dependent noises would mislead the validation criterion to
evaluate the clustering quality. The analysisofshortcomings
of the standard k-means algorithm. As k-means algorithm
has to calculate the distance betweeneachdata objectandall
cluster centers in each iteration. This repetitive process
affects the efficiency of clustering algorithm.
R. Xu and D. Wunsch [11], in the existing partition-
based methods, the performance of k-means algorithm
which is evaluated with various databasessuchasIris,Wine,
Vowel, Ionosphere and Crude oil data Set and various
distance metrics. proposed that due to the increment in the
amount of data across the world, analysis of the data turns
out to be very difficult task. To understand and learn the
data, classify those data into remarkable collection.So,there
is a need of data mining techniques. It is concluded that
performance of k-means clustering is depend on the data
base used as well as distance metrics. the comparative
analysis of one partition clustering algorithm (k means) and
one hierarchical clusteringalgorithm(agglomerative).On the
basis of accuracy and running time the performance of k-
means and hierarchical clustering algorithm is calculated
using WEKA tools. This work results that accuracy of k-
means is higher than the hierarchical clustering for iris
dataset which have real attributesandaccuracyofhierarchal
clustering is higher than the k-means for diabetes dataset
which have integer, real attribute. So for large datasets k
means algorithm is good.
R. Amutha et al. [13] proposed that when two or more
algorithms of same category of clustering technique is used
then best results will be acquired. Two k-means algorithms:
Parallel k/h-Means Clustering for Large Data Sets and A
Novel K-Means Based Clustering Algorithm for High
Dimensional Data Sets. Parallel k/h-Means algorithm is
designed to deal with very large data sets. Novel K-Means
Based Clustering provides the advantages of using both HC
and K-Means. Using these two algorithms, space and
similarity between the data sets present each nodes is
extended, however, there is a general lack of adaptive
mechanisms to deduce the reasonable number of clusters
with the possible K-dependent noises identified, given that
the user defined K likely deviates from the true number.
Solution to this problem a novel cluster validation method
which will provide robust clustering is used.
Navjot Kaur, Navneet Kaur[4]enhanced the
traditional k-means by introducing Ranking method.Author
introduces Ranking Method to overcome the deficiency of
more execution time taken by traditional k-means. The
Ranking Method is a way to find the occurrence of similar
data and to improve search effectiveness. The tool used to
implemented the improved algorithm is Visual Studio 2008
using C#. The advantages of k-means are also analyzed in
this paper. The author finds k-means as fast, robustandeasy
understandable algorithm. He also discuss that the clusters
are non-hierarchical in nature and are not overlapping in
nature. The process used in the algorithm takes student
marks as data set and then initial centroid is selected.
Euclidean distance is then calculated from centroid for each
data object. Then the threshold value is set for each data set.
Ranking Method is applied next and finally the clusters are
created based on minimum distance between the data point
and the centroid. The future scope of this paper is use of
Query Redirection can be used to cluster huge amount of
data from various databases.
3. METHODOLOGY
In this partition-based robust K-means for sequences
(RKMS) for robust clustering of categorical sequences, the
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3639
data which is given as input is in character form. There are
difficulties in defining an inherently meaningful measure of
similarity between sequences. To apply the clustering
algorithms and existing indices to sequences, one has to
resort to the vectorizationmethod. Toconvertthesequences
in vector form it is encoded into Latin-1, also called ISO-
8859-1, is an 8-bit character set endorsed by the
International Organization for Standardization (ISO) and
represents the alphabets of Western European languages.
Once the character file is converted to binary format we are
going to use Principal Component Analysis (PCA) algorithm.
The main idea of principal component analysis (PCA) is to
reduce the dimensionality of a data set consisting of many
variables correlated witheachother, eitherheavilyorlightly,
while retaining the variation present in thedataset,uptothe
maximum extent. Using PCA is very important because the
dataset which we have it is in very diverse format.Analyzing
this type of data becomes very difficult without using PCA.
After all this preprocessing we are ready to apply RKMS
algorithm on the data.
The number of clusters, K, produced by RKMS is not
necessarily equal to the given number ˆK (we will call ˆK the
expected number of clusters).The number of clusters is
adaptively determined by the elimination of K-dependent
noise during the RKMS clustering. Two key issues for a
robust K-means type algorithm are: the selection of the
initial cluster centroids and the identification of noise
clusters. In general, those groups that contain a small
number of objects can be viewed as noise. To identify the
noise cluster on threshold will be given and iftheobjectsina
cluster are less than thresholdthenthatclusterisconsidered
as noise.
After creating a cluster quality of the cluster is
measure by using Cluster Validity Index. For that structural
dissimilarity between the cluster object is considered. This
new index is linear combination of the within-cluster how
objects are scattered and between- cluster how objects are
separated and these both measures are computed on the
structural dissimilarity of the sequences which is
probabilistic approach.
Fig. 1. System Architecture of RKMS method
If we are sign high dimension data then robustness
of K-means algorithm may became a problem. There will be
two main issues we might face in this algorithm one is
selecting the initial centroied value. And another issue will
be finding and removing the noise cluster. The first issue
raises, because it is known that the performance of the K-
means algorithm, as an instance of the EM method [13], is
sensitive to the initial conditions (here, the initial cluster
centroids). Since the number of clusters is fixed at ˆK given
by the users, the traditional algorithm likely produces
clusters consisting of ˆK -dependent noise, especially when
the expected number deviates largely from thetruenumber.
RKMS aims at building a robust condition for the
coming iteration, by choosing a set of well-scattered objects
(sequences) as the initial cluster centroids. The selection
method is based on the greedy approach.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3640
The (k + 1)th initial centroid based on the maximum–
minimum principle, given by
In the traditional greedy approach, however, the first
centroid is chosen randomly or according to the length of
object vector. Since each sequence has beenrepresentedina
unit vector as Definition 1 shows,
we choose the first two centroids I1 and I2 by
That is, two sequences havingthemaximal pairwisedistance
are chosen as the initial centroids for C1 and C2; the
remaining Kˆ − 2 centroids are then determined by
repeatedly using (2) with k = 2,..., Kˆ − 1.
RKMS is to delete the possible K-dependent noise in each
iteration, to further enhance therobustnessofthealgorithm.
The details will be given in Section III-C. Note that the
number of resulting clustersproducedby RKMScouldbeless
than the given ˆK, due to the inclusion of this step in the
algorithm. This feature is somewhat similar to that of [10]
and [12], where the number canbereducedbyfadingout the
redundant clusters;however,RKMStargetstheidentification
of K-dependent noise.
Given a clustering of the sequence set, C, we now aim at
establishing a judgement criterion to examine whethereach
group Ck ∈ C is a noise cluster or not. In general, those
groups that contain a small number of objects can beviewed
as noise . According to this view, one can derive a necessary
condition for the judgement, i.e., nk = |Ck| ≤τ,wherenkisthe
number of objects in Ck and τ is a threshold defining the
minimal number in a normal cluster. If the condition holds
with τ = 1, then the cluster is undoubtedly a noise as it
consists of one single object. However,itisnotthecasewhen
1 < nk ≤ τ , because the assertion would be concerned with
the distribution of the objects in the cluster. To address the
problem, we propose a model-selection based method, in
view of the observation that if each object in a cluster is
indeed K-dependent noise, then the preferred clustering
model should be the one with that cluster removed. With
regard to Ck ∈ C, we denote the new clustering after the
removal of Ck by C_, created by reassigning each sequence
originally belonging to Ck to its closest cluster except Ck.
Thus, the number of clusters satisfies |C_| = |C| − 1 and,
consequently, the conclusion that Ck is a noise clustercanbe
drawn if the model of C_ is preferred to that of C.
Structure-based cluster validity index have two main
components first one is structural dissimilarity (SD) of
sequences and new cluster validity index(CVI). To measure
SD for sequences by Markovian modeling, which is based on
the hypothesis that occurrence of each symbol xl in the
sequence s = x1 . . . xl . . . xL is closely related to its preceding
subsequence x0x1 . . . xl−1. For a Markov model of order2 ꝭ ,
the preceding subsequence of xl is truncated to δl = xl−ꝭ . . .
xl−1, excluding xl− ꝭ for l < ꝭ. By such modeling, the
sequential dependence of each symbol xl on its preceding
subsequence δl can be characterized using the conditional
probability ps(xl |δl ) with regard to the sequence s.
where ps(δ) is the probability of δ with regard to s.
The SD between two sequences s and s_ can then be
numerically computed, by measuring the dissimilarity of
their probability distributions ps(X) and ps'(X).
Then next part is CVI VSD where it will validate the
quality of a series of clustering results,eachgenerated bythe
clustering algorithm on thesamesequenceset Swithvarious
numbers of sequence clusters. To distinguish the results
with different cluster numbers, we now use CK to denote C
consisting of K sequence clusters, i.e., CK = {C1, . . . ,Ck, . . . ,CK
}. Denote the minimal and the maximal numberofclustersin
the series by Kmin and Kmax, respectively. The new index is
a linear combination of the within-cluster scatter and
between-cluster separation, where both measures are
computed on the structural dissimilarity of sequences using
SD(·, ·) of (5). We define the total within-cluster scatter and
between-cluster separation of CK, denoted by Scat(CK ) and
Sep(CK ), respectively, as
Since, in practice, the number of clusters is ranged in [Kmin,
Kmax], we define the CVI as a linear combination of the
normalized measures, that is
where maxScat = maxK_∈[Kmin,Kmax] Scat(CK_ ) and
maxSep = maxK_∈[Kmin,Kmax] Sep(CK_ ). The index offersa
tradeoff between thewithin-clusterscatterandthebetween-
cluster separation. The minimal VSD is considered to be
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3641
associated with the clustering results produced by the
algorithm using the optimal number of clusters.
4. RESULT AND DISSCUSSION
The main purpose is detecting optimal cluster
numbers on "USDA Plants Dataset" with using K-means
clustering algorithm. Database contains all plants (species
and genera) in the database and the states of USA and
Canada where they occur.
The dataset used for this research contains 34781 records
with 70 features
DataSet Characteristics: Multivariate
Attribute Characteristics: Categorical
Associated Tasks: Clustering
Number of Instances: 22632
Number of Attributes: 70
Missing Values? Yes
Table -1: Dataset features
To implement this wehave Silhouetteanalysis.Itcanbeused
to study the separation distance between the resulting
clusters. The silhouette plot displays a measure of how close
each point in one cluster is to points in the neighboring
clusters and thus provides a way to assess parameters like
number of clusters visually. This measure has a range of [-1,
1]. Silhouette coefficients (as these values are referred to as)
near +1 indicate that the sample is far away from the
neighboring clusters. A value of 0 indicates that the sampleis
on or very close to the decision boundary between two
neighboring clusters and negative values indicate that those
samples might have been assigned to the wrong cluster.
BelowwehaveSilhouettecoefficientsfortheclusters
K- meansalgorithm on 1000attribute data for the respective
values of K.
K- Value Passed Silhouette coefficients
10 0.161727
20 0.214786
30 0.218644
40 0.255462
Table-2: K values with respective silhouette score
The performance of RKMS is evaluated in terms of clustering
accuracy (CA), which is computed as
where K is the true number and ak is the number of
sequences in the majority group corresponding to Ck. To
examine the strength of the robust method used to choose
the initial cluster centroids in RKMS, we also performed
clustering using K-means with random initialization [1], [4].
The data set was clustered by K-meansfor20 executionsand
both the average and best accuracy are reported After
applying RKMS algorithm on the same data set the optimal
number of cluster we got is as below.
Estimated number of cluster Silhouette coefficients
23 0.361
Fig-2. Cluster image with 500 data element
3. CONCLUSION
A robust algorithm named RKMS for partitionbased
sequence clustering. In the RKMS algorithm, unlike
traditional partitioning methods, the number of clusters (K)
can be automatically deducedbasedonthe expectednumber
( ˆK given by the users), by the eliminationofcluster-number
dependent noise (called K-dependent noise).RKMSandVSD
to the common model selection procedure, to estimate the
optimal number of clusters in a sequence set. A cluster
validation method for categorical sequence clustering is
proposed, which is different from the existing methods
mainly designed for numeric data. A probabilistic approach
to measure the structural dissimilarity of categorical
sequences, based on which the within-cluster compactness
and between-cluster separation are formulated, in order to
evaluate the structural similarity of sequences in each
cluster and the structural dissimilarity between sequence
clusters. To evaluate the quality of sequence clusters, a new
CVI named VSD by linearly combining the within-cluster
structural compactness and the between-cluster structural
separation.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3642
REFERENCES
1] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Clustering
validity checking methods: Part II,” ACM SIGMOD Rec. Arch.,
vol. 31, no. 3, pp. 19–27, 2002.
2] C. C. Aggarwal, Data Mining: The textbook. New York, NY,
USA: Springer, 2015.
3] J. C. Dunn, “A fuzzy relative of the ISODATAprocessand its
use in detecting compact well-separated clusters,” J. Cybern.,
vol. 3, no. 3, pp. 32–57, 1973.
4] T. Cali´nski and J. Harabasz,“Adendritemethodforcluster
analysis,” Commun. Stat., vol. 3, no. 1, pp. 1–27, 1974.
5] H. Sun, S. Wang, and Q. Jiang, “FCM-based model selection
algorithms for determining the number of clusters,” Pattern
Recognit., vol. 37, no. 10, pp. 2027–2037, Oct. 2004.
6] L. Xu, T. W. S. Chow, and E. W. M. Ma, “Topology-based
clustering using polar self-organizing map,” IEEE Trans.
Neural Netw. Learn. Syst., vol. 24, no. 4, pp. 798–808,
7] D. Pelleg and A. W. Moore, “X-means: Extending k-means
with efficient estimation of the number of clusters,” in Proc.
17th ICML, 2000, pp. 727–734.
8] J. Yang and W. Wang, “CLUSEQ: Efficient and effective
sequence clustering,” in Proc. IEEE ICDE,Mar.2003,pp.101–
112.
9] M. M.-T. Chiang and B. Mirkin, “Intelligent choice of the
number of clusters in k-means clustering: An experimental
study with different cluster spreads,” J. Classif., vol. 27, no. 1,
pp. 3–40, 2010.
10] G. Dong and J. Pei, “Classification,clustering, features and
distances of sequence data,” Seq. Data Min., vol. 33, pp. 47–
65, 2007.
[11] R. Xu and D. Wunsch, “Survey of clustering algorithms,”
IEEE Trans Neural Netw., vol. 16, no. 3, pp. 645–678, May
2005.
12] K. A. Abdul Nazeer, M. P. Sebastian,.Improving the
Accuracy and Efficiency of thek-means Clustering
Algorithm.,Proceedings of the World Congress on
Engineering 2009 Vol I WCE 2009, July 1 - 3, 2009, London,
U.K.

More Related Content

PDF
Ensemble based Distributed K-Modes Clustering
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
PDF
Cancer data partitioning with data structure and difficulty independent clust...
PDF
Du35687693
PDF
Clustering Approach Recommendation System using Agglomerative Algorithm
PDF
Certain Investigation on Dynamic Clustering in Dynamic Datamining
PDF
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
PDF
11.software modules clustering an effective approach for reusability
Ensemble based Distributed K-Modes Clustering
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
Cancer data partitioning with data structure and difficulty independent clust...
Du35687693
Clustering Approach Recommendation System using Agglomerative Algorithm
Certain Investigation on Dynamic Clustering in Dynamic Datamining
INTEGRATED ASSOCIATIVE CLASSIFICATION AND NEURAL NETWORK MODEL ENHANCED BY US...
11.software modules clustering an effective approach for reusability

What's hot (20)

PDF
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
PDF
IRJET - Survey on Clustering based Categorical Data Protection
PDF
A statistical data fusion technique in virtual data integration environment
PDF
Review of Existing Methods in K-means Clustering Algorithm
PDF
G0354451
PDF
Effective data mining for proper
PDF
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
PDF
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...
PDF
Recommendation system using bloom filter in mapreduce
PDF
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
PDF
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
PDF
Assessment of Cluster Tree Analysis based on Data Linkages
PDF
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
PDF
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
PDF
Ijmet 10 01_141
PDF
Similarity distance measures
PPTX
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
PDF
Classification By Clustering Based On Adjusted Cluster
PDF
Study of the Class and Structural Changes Caused By Incorporating the Target ...
PDF
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
IRJET - Survey on Clustering based Categorical Data Protection
A statistical data fusion technique in virtual data integration environment
Review of Existing Methods in K-means Clustering Algorithm
G0354451
Effective data mining for proper
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...
Recommendation system using bloom filter in mapreduce
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Assessment of Cluster Tree Analysis based on Data Linkages
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
Ijmet 10 01_141
Similarity distance measures
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
Classification By Clustering Based On Adjusted Cluster
Study of the Class and Structural Changes Caused By Incorporating the Target ...
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Ad

Similar to IRJET- Optimal Number of Cluster Identification using Robust K-Means for Sequences in Categorical Sequences (20)

PDF
A survey on Efficient Enhanced K-Means Clustering Algorithm
PDF
84cc04ff77007e457df6aa2b814d2346bf1b
PDF
Chapter#04[Part#01]K-Means Clusterig.pdf
PDF
Extended pso algorithm for improvement problems k means clustering algorithm
PDF
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
PDF
Extended pso algorithm for improvement problems k means clustering algorithm
PDF
Experimental study of Data clustering using k- Means and modified algorithms
PDF
A Study of Efficiency Improvements Technique for K-Means Algorithm
PPTX
Unsupervised%20Learninffffg (2).pptx. application
PDF
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
PDF
47 292-298
DOCX
8.clustering algorithm.k means.em algorithm
PPTX
Large Scale Data Clustering: an overview
PDF
Bb25322324
PDF
An Iterative Improved k-means Clustering
PDF
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MINING
PDF
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MINING
PDF
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MINING
PPT
26-Clustering MTech-2017.ppt
A survey on Efficient Enhanced K-Means Clustering Algorithm
84cc04ff77007e457df6aa2b814d2346bf1b
Chapter#04[Part#01]K-Means Clusterig.pdf
Extended pso algorithm for improvement problems k means clustering algorithm
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
Extended pso algorithm for improvement problems k means clustering algorithm
Experimental study of Data clustering using k- Means and modified algorithms
A Study of Efficiency Improvements Technique for K-Means Algorithm
Unsupervised%20Learninffffg (2).pptx. application
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
47 292-298
8.clustering algorithm.k means.em algorithm
Large Scale Data Clustering: an overview
Bb25322324
An Iterative Improved k-means Clustering
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MINING
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MINING
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MINING
26-Clustering MTech-2017.ppt
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PPTX
additive manufacturing of ss316l using mig welding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
web development for engineering and engineering
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPT
Project quality management in manufacturing
PPTX
Geodesy 1.pptx...............................................
PDF
PPT on Performance Review to get promotions
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Welding lecture in detail for understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Sustainable Sites - Green Building Construction
Lecture Notes Electrical Wiring System Components
additive manufacturing of ss316l using mig welding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
OOP with Java - Java Introduction (Basics)
bas. eng. economics group 4 presentation 1.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Arduino robotics embedded978-1-4302-3184-4.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
web development for engineering and engineering
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Project quality management in manufacturing
Geodesy 1.pptx...............................................
PPT on Performance Review to get promotions
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Foundation to blockchain - A guide to Blockchain Tech
Welding lecture in detail for understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Sustainable Sites - Green Building Construction

IRJET- Optimal Number of Cluster Identification using Robust K-Means for Sequences in Categorical Sequences

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3637 Optimal Number of Cluster Identification using Robust K-means for Sequences in Categorical Sequences S.U. Patil1, U.A. Nuli2 1,2Computer Science and Engineering department, M. Tech, Textile and Engineering Institute, Ichalkaranji, Maharashtra, India ---------------------------------------------------------------------***---------------------------------------------------------------------- Abstract - This paper presents a modified k-means algorithm for clustering. In traditional method of clustering number of clusterstobeformedwillbegivenat the start of the algorithm, which affects performance and efficiency of the algorithm. In Robust K-means for sequences optimal number of cluster will be predicted by removing noise cluster. Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, whichmainly focusedon attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. The intension of this paper is to describe the clustering method which will give the optimal number of clusters in categorical sequences. Key Words: Clusterign ,K-means,clustervalidationindex, categorical sequences, centroid. 1. INTRODUCTION Data mining is a process of deriving required data from a collection of large dataset and making analysis on collected data. The data mining technique is to mine informationfrom a bulky data set and make over it into a reasonable form for supplementary purpose. In generic term this is known as classification. In classification when the classes of an object is given in advance is termed as supervised classification where as the other case when the class label is not tagged to an object in advance is termed as unsupervised classification. The unsupervised classification is commonly known as clustering. Clustering is important analysis techniques that is employed to large datasets and finds its application in the fields like search engines, recommendation systems, data mining, knowledge discovery, bioinformaticsanddocumentation.Nowadays, the data being generated is not only huge in volume, but is also stored across various machines all around the world. The main purpose behind the study of classificationistodevelop a tool or an algorithm, which can be used to predict the class of an unknown object, which is not labeled. Clustering problem cannot be solvedbyonespecific algorithm but it requires various algorithms that differ significantly in their notion of what makes a clusterandhow to efficiently find them. Generally clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings depend on the individual data set and intended useoftheresults.Clustering is considered to be more difficult than supervised classification as there is no label attached to the patterns in clustering. The given label in the case of supervised classification becomes a clue to grouping data objects as a whole. Whereas in the case of clustering, it becomes difficult to decide, to which group a pattern will belong to, in the absence of a label. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values.Thisis often the case in many domains, wheredata isdescribedbya set of descriptive attributes, many of which are neither numerical nor inherently ordered in any way. Moreover clustering categorical sequences is a challenging problem due to one more reason the difficulties in defining an inherently meaningful measure of similarity between sequences. Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Without a measure of distance between data values, it is unclear how to define a quality measure for categorical clustering. To do this, we employ mutual information, a measure from information theory. A good clustering is one where the clusters are informative about the data objects they contain. Since data objects are expressed in terms of attribute values, we require that the clusters convey information about the attributevaluesof the objects in the cluster. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intra-clusterstructural compactness and inter-cluster structural separation linearly combined to measure the quality of sequence clusters. Cluster validation,
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3638 which is the process of evaluating the quality of clustering results, plays an important role in many issues of cluster analysis A partition-based algorithm for robust clustering of categorical sequences andtheCVIarethenassembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. Currently, cluster validation remains an open problem due to the unsupervised nature of clustering tasks, where no external validation criterion is available to evaluate the result. Despite different aspects of cluster validation, we are interested in the problem of determining the optimal number of clusters in a data set, as most basic clustering algorithms assume that the number of clusters is a user- defined parameter, which, however, is difficult to set in practice. 2. RELATED WORK K. A. Abdul Nazeer et al. [12] discuss in this paper about the one major drawback of the k-means algorithm. K- means algorithm, for different sets of values of initial centroids, produces different clusters.Final clusterqualityin algorithm depends on the selection of initial centroids. Two phases includes in original k means algorithm: first for determining initial centroids and second for assigning data points to the nearest clusters and then recalculating the clustering mean. An enhanced clustering method is discussed in this paper, in which both the phases of the original k-means algorithm are modified to improve the accuracy and efficiency. This algorithm combines a systematic method for finding initial centroids and an efficient way for assigning data points to clusters. But still there is a limitation in this enhanced algorithm that is the value of k, the number of desired clusters, is still required to be given as an input, regardless of the distribution of the data points. G. Dong and J. Pei, [10] says algorithm is expected to produce high-quality results associated with the underlying cluster structures in the data set given K. This problem becomes difficult for categorical sequences,because,usually, they are infected with significant quantities of noise, which confuses the identification of cluster structures.Theclusters containing less numberofobjects afterapplyinga algorithms are considered as K-dependentnoises.Suchcluster-number- dependent noises would mislead the validation criterion to evaluate the clustering quality. The analysisofshortcomings of the standard k-means algorithm. As k-means algorithm has to calculate the distance betweeneachdata objectandall cluster centers in each iteration. This repetitive process affects the efficiency of clustering algorithm. R. Xu and D. Wunsch [11], in the existing partition- based methods, the performance of k-means algorithm which is evaluated with various databasessuchasIris,Wine, Vowel, Ionosphere and Crude oil data Set and various distance metrics. proposed that due to the increment in the amount of data across the world, analysis of the data turns out to be very difficult task. To understand and learn the data, classify those data into remarkable collection.So,there is a need of data mining techniques. It is concluded that performance of k-means clustering is depend on the data base used as well as distance metrics. the comparative analysis of one partition clustering algorithm (k means) and one hierarchical clusteringalgorithm(agglomerative).On the basis of accuracy and running time the performance of k- means and hierarchical clustering algorithm is calculated using WEKA tools. This work results that accuracy of k- means is higher than the hierarchical clustering for iris dataset which have real attributesandaccuracyofhierarchal clustering is higher than the k-means for diabetes dataset which have integer, real attribute. So for large datasets k means algorithm is good. R. Amutha et al. [13] proposed that when two or more algorithms of same category of clustering technique is used then best results will be acquired. Two k-means algorithms: Parallel k/h-Means Clustering for Large Data Sets and A Novel K-Means Based Clustering Algorithm for High Dimensional Data Sets. Parallel k/h-Means algorithm is designed to deal with very large data sets. Novel K-Means Based Clustering provides the advantages of using both HC and K-Means. Using these two algorithms, space and similarity between the data sets present each nodes is extended, however, there is a general lack of adaptive mechanisms to deduce the reasonable number of clusters with the possible K-dependent noises identified, given that the user defined K likely deviates from the true number. Solution to this problem a novel cluster validation method which will provide robust clustering is used. Navjot Kaur, Navneet Kaur[4]enhanced the traditional k-means by introducing Ranking method.Author introduces Ranking Method to overcome the deficiency of more execution time taken by traditional k-means. The Ranking Method is a way to find the occurrence of similar data and to improve search effectiveness. The tool used to implemented the improved algorithm is Visual Studio 2008 using C#. The advantages of k-means are also analyzed in this paper. The author finds k-means as fast, robustandeasy understandable algorithm. He also discuss that the clusters are non-hierarchical in nature and are not overlapping in nature. The process used in the algorithm takes student marks as data set and then initial centroid is selected. Euclidean distance is then calculated from centroid for each data object. Then the threshold value is set for each data set. Ranking Method is applied next and finally the clusters are created based on minimum distance between the data point and the centroid. The future scope of this paper is use of Query Redirection can be used to cluster huge amount of data from various databases. 3. METHODOLOGY In this partition-based robust K-means for sequences (RKMS) for robust clustering of categorical sequences, the
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3639 data which is given as input is in character form. There are difficulties in defining an inherently meaningful measure of similarity between sequences. To apply the clustering algorithms and existing indices to sequences, one has to resort to the vectorizationmethod. Toconvertthesequences in vector form it is encoded into Latin-1, also called ISO- 8859-1, is an 8-bit character set endorsed by the International Organization for Standardization (ISO) and represents the alphabets of Western European languages. Once the character file is converted to binary format we are going to use Principal Component Analysis (PCA) algorithm. The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated witheachother, eitherheavilyorlightly, while retaining the variation present in thedataset,uptothe maximum extent. Using PCA is very important because the dataset which we have it is in very diverse format.Analyzing this type of data becomes very difficult without using PCA. After all this preprocessing we are ready to apply RKMS algorithm on the data. The number of clusters, K, produced by RKMS is not necessarily equal to the given number ˆK (we will call ˆK the expected number of clusters).The number of clusters is adaptively determined by the elimination of K-dependent noise during the RKMS clustering. Two key issues for a robust K-means type algorithm are: the selection of the initial cluster centroids and the identification of noise clusters. In general, those groups that contain a small number of objects can be viewed as noise. To identify the noise cluster on threshold will be given and iftheobjectsina cluster are less than thresholdthenthatclusterisconsidered as noise. After creating a cluster quality of the cluster is measure by using Cluster Validity Index. For that structural dissimilarity between the cluster object is considered. This new index is linear combination of the within-cluster how objects are scattered and between- cluster how objects are separated and these both measures are computed on the structural dissimilarity of the sequences which is probabilistic approach. Fig. 1. System Architecture of RKMS method If we are sign high dimension data then robustness of K-means algorithm may became a problem. There will be two main issues we might face in this algorithm one is selecting the initial centroied value. And another issue will be finding and removing the noise cluster. The first issue raises, because it is known that the performance of the K- means algorithm, as an instance of the EM method [13], is sensitive to the initial conditions (here, the initial cluster centroids). Since the number of clusters is fixed at ˆK given by the users, the traditional algorithm likely produces clusters consisting of ˆK -dependent noise, especially when the expected number deviates largely from thetruenumber. RKMS aims at building a robust condition for the coming iteration, by choosing a set of well-scattered objects (sequences) as the initial cluster centroids. The selection method is based on the greedy approach.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3640 The (k + 1)th initial centroid based on the maximum– minimum principle, given by In the traditional greedy approach, however, the first centroid is chosen randomly or according to the length of object vector. Since each sequence has beenrepresentedina unit vector as Definition 1 shows, we choose the first two centroids I1 and I2 by That is, two sequences havingthemaximal pairwisedistance are chosen as the initial centroids for C1 and C2; the remaining Kˆ − 2 centroids are then determined by repeatedly using (2) with k = 2,..., Kˆ − 1. RKMS is to delete the possible K-dependent noise in each iteration, to further enhance therobustnessofthealgorithm. The details will be given in Section III-C. Note that the number of resulting clustersproducedby RKMScouldbeless than the given ˆK, due to the inclusion of this step in the algorithm. This feature is somewhat similar to that of [10] and [12], where the number canbereducedbyfadingout the redundant clusters;however,RKMStargetstheidentification of K-dependent noise. Given a clustering of the sequence set, C, we now aim at establishing a judgement criterion to examine whethereach group Ck ∈ C is a noise cluster or not. In general, those groups that contain a small number of objects can beviewed as noise . According to this view, one can derive a necessary condition for the judgement, i.e., nk = |Ck| ≤τ,wherenkisthe number of objects in Ck and τ is a threshold defining the minimal number in a normal cluster. If the condition holds with τ = 1, then the cluster is undoubtedly a noise as it consists of one single object. However,itisnotthecasewhen 1 < nk ≤ τ , because the assertion would be concerned with the distribution of the objects in the cluster. To address the problem, we propose a model-selection based method, in view of the observation that if each object in a cluster is indeed K-dependent noise, then the preferred clustering model should be the one with that cluster removed. With regard to Ck ∈ C, we denote the new clustering after the removal of Ck by C_, created by reassigning each sequence originally belonging to Ck to its closest cluster except Ck. Thus, the number of clusters satisfies |C_| = |C| − 1 and, consequently, the conclusion that Ck is a noise clustercanbe drawn if the model of C_ is preferred to that of C. Structure-based cluster validity index have two main components first one is structural dissimilarity (SD) of sequences and new cluster validity index(CVI). To measure SD for sequences by Markovian modeling, which is based on the hypothesis that occurrence of each symbol xl in the sequence s = x1 . . . xl . . . xL is closely related to its preceding subsequence x0x1 . . . xl−1. For a Markov model of order2 ꝭ , the preceding subsequence of xl is truncated to δl = xl−ꝭ . . . xl−1, excluding xl− ꝭ for l < ꝭ. By such modeling, the sequential dependence of each symbol xl on its preceding subsequence δl can be characterized using the conditional probability ps(xl |δl ) with regard to the sequence s. where ps(δ) is the probability of δ with regard to s. The SD between two sequences s and s_ can then be numerically computed, by measuring the dissimilarity of their probability distributions ps(X) and ps'(X). Then next part is CVI VSD where it will validate the quality of a series of clustering results,eachgenerated bythe clustering algorithm on thesamesequenceset Swithvarious numbers of sequence clusters. To distinguish the results with different cluster numbers, we now use CK to denote C consisting of K sequence clusters, i.e., CK = {C1, . . . ,Ck, . . . ,CK }. Denote the minimal and the maximal numberofclustersin the series by Kmin and Kmax, respectively. The new index is a linear combination of the within-cluster scatter and between-cluster separation, where both measures are computed on the structural dissimilarity of sequences using SD(·, ·) of (5). We define the total within-cluster scatter and between-cluster separation of CK, denoted by Scat(CK ) and Sep(CK ), respectively, as Since, in practice, the number of clusters is ranged in [Kmin, Kmax], we define the CVI as a linear combination of the normalized measures, that is where maxScat = maxK_∈[Kmin,Kmax] Scat(CK_ ) and maxSep = maxK_∈[Kmin,Kmax] Sep(CK_ ). The index offersa tradeoff between thewithin-clusterscatterandthebetween- cluster separation. The minimal VSD is considered to be
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3641 associated with the clustering results produced by the algorithm using the optimal number of clusters. 4. RESULT AND DISSCUSSION The main purpose is detecting optimal cluster numbers on "USDA Plants Dataset" with using K-means clustering algorithm. Database contains all plants (species and genera) in the database and the states of USA and Canada where they occur. The dataset used for this research contains 34781 records with 70 features DataSet Characteristics: Multivariate Attribute Characteristics: Categorical Associated Tasks: Clustering Number of Instances: 22632 Number of Attributes: 70 Missing Values? Yes Table -1: Dataset features To implement this wehave Silhouetteanalysis.Itcanbeused to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1]. Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sampleis on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. BelowwehaveSilhouettecoefficientsfortheclusters K- meansalgorithm on 1000attribute data for the respective values of K. K- Value Passed Silhouette coefficients 10 0.161727 20 0.214786 30 0.218644 40 0.255462 Table-2: K values with respective silhouette score The performance of RKMS is evaluated in terms of clustering accuracy (CA), which is computed as where K is the true number and ak is the number of sequences in the majority group corresponding to Ck. To examine the strength of the robust method used to choose the initial cluster centroids in RKMS, we also performed clustering using K-means with random initialization [1], [4]. The data set was clustered by K-meansfor20 executionsand both the average and best accuracy are reported After applying RKMS algorithm on the same data set the optimal number of cluster we got is as below. Estimated number of cluster Silhouette coefficients 23 0.361 Fig-2. Cluster image with 500 data element 3. CONCLUSION A robust algorithm named RKMS for partitionbased sequence clustering. In the RKMS algorithm, unlike traditional partitioning methods, the number of clusters (K) can be automatically deducedbasedonthe expectednumber ( ˆK given by the users), by the eliminationofcluster-number dependent noise (called K-dependent noise).RKMSandVSD to the common model selection procedure, to estimate the optimal number of clusters in a sequence set. A cluster validation method for categorical sequence clustering is proposed, which is different from the existing methods mainly designed for numeric data. A probabilistic approach to measure the structural dissimilarity of categorical sequences, based on which the within-cluster compactness and between-cluster separation are formulated, in order to evaluate the structural similarity of sequences in each cluster and the structural dissimilarity between sequence clusters. To evaluate the quality of sequence clusters, a new CVI named VSD by linearly combining the within-cluster structural compactness and the between-cluster structural separation.
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3642 REFERENCES 1] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Clustering validity checking methods: Part II,” ACM SIGMOD Rec. Arch., vol. 31, no. 3, pp. 19–27, 2002. 2] C. C. Aggarwal, Data Mining: The textbook. New York, NY, USA: Springer, 2015. 3] J. C. Dunn, “A fuzzy relative of the ISODATAprocessand its use in detecting compact well-separated clusters,” J. Cybern., vol. 3, no. 3, pp. 32–57, 1973. 4] T. Cali´nski and J. Harabasz,“Adendritemethodforcluster analysis,” Commun. Stat., vol. 3, no. 1, pp. 1–27, 1974. 5] H. Sun, S. Wang, and Q. Jiang, “FCM-based model selection algorithms for determining the number of clusters,” Pattern Recognit., vol. 37, no. 10, pp. 2027–2037, Oct. 2004. 6] L. Xu, T. W. S. Chow, and E. W. M. Ma, “Topology-based clustering using polar self-organizing map,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 4, pp. 798–808, 7] D. Pelleg and A. W. Moore, “X-means: Extending k-means with efficient estimation of the number of clusters,” in Proc. 17th ICML, 2000, pp. 727–734. 8] J. Yang and W. Wang, “CLUSEQ: Efficient and effective sequence clustering,” in Proc. IEEE ICDE,Mar.2003,pp.101– 112. 9] M. M.-T. Chiang and B. Mirkin, “Intelligent choice of the number of clusters in k-means clustering: An experimental study with different cluster spreads,” J. Classif., vol. 27, no. 1, pp. 3–40, 2010. 10] G. Dong and J. Pei, “Classification,clustering, features and distances of sequence data,” Seq. Data Min., vol. 33, pp. 47– 65, 2007. [11] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Trans Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005. 12] K. A. Abdul Nazeer, M. P. Sebastian,.Improving the Accuracy and Efficiency of thek-means Clustering Algorithm.,Proceedings of the World Congress on Engineering 2009 Vol I WCE 2009, July 1 - 3, 2009, London, U.K.