SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 440
Survey Paper on Clustering Data Streams Based on Shared Density
between Micro-Clusters
Dure Supriya Suresh ; Prof. Wadne Vinod
ME(Student), ICOER ,Wagholi , Pune ,Maharastra,India
Assit.Professor, ICOER ,Wagholi , Pune ,Maharastra,India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract—As more and more applications
produce streaming data, clustering data streams
has become an important technique for data and
knowledge engineering. A typical approach is to
summarize the data stream in real-time with an
online process into a large number of so called
micro-clusters. Micro-clusters represent local
density estimates by aggregating the information
of many data points in a defined area. On demand,
a (modified) conventional clustering algorithm is
used in a second offline step to recluster the
micro-clusters into larger final clusters. For
reclustering, the centers of the micro-clusters are
used as pseudo points with the density estimates
used as their weights. However, information about
density in the area between micro-clusters is not
preserved in the online processandreclusteringis
based on possibly inaccurate assumptions about
the distribution ofdata within andbetween micro-
clusters (e.g., uniform or Gaussian). This paper
describes DBSTREAM, the first micro-cluster-
based online clustering component that explicitly
captures the density between micro-clusters via a
shared density graph. The density information in
this graph is then exploited for reclustering based
on actualdensitybetween adjacent micro-clusters.
Index Terms—Data mining, data stream
clustering, density-based clustering
1] INTRODUCTION
CLUSTERING data streams has become n
important technique for data and knowledge
engineering. A data stream is an ordered and
potentially unbounded sequence of data points.
Such streams of constantly arriving data are
generated for many types of applications and
include GPS data from smart phones, web click-
stream data, computer network monitoring data,
tele-communication connection data, readings
from sensor nets Stock quotes, etc.
Data streamclusteringistypicallydone asa
two-stage process with an online part which
summarizes the data into many micro-clusters or
grid cells and then, in an offline pro-cess, these
micro-clusters (cells)are reclustered/mergedinto
a smaller number of final clusters. Since the
reclustering is an offline processandthusnot time
critical, it is typically not discussed in detail in
papers about new data stream clustering
algorithms. Most papers suggest to use an where
the micro-clusters are used as pseudo points.
Another approach used in DenStream is to use
reachability where all micro-clusters which are
less than a given distance from each other are
linked together to form clusters. Grid-based
algorithms typically merge adjacent dense grid
cells to form larger clusters.Current reclustering
approaches completely ignore the data density in
the area between the micro-clusters (grid cells)
and thus might join micro-clusters (cells) which
are close together but at the same time separated
by a small area of low density. To address this
problem, Tu and Chen introduced an extension to
the grid-based D-Stream algo-rithm based on the
concept of attraction between adjacent grids cells
and showed its effectiveness.
In this paper, we develop and evaluate a
new method to address this problem for micro-
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 441
cluster-based algorithms. We introduce the
concept of a shared density graph which explicitly
captures the density of the original data between
micro-clusters during clustering and then show
how the graph can be used for reclustering micro-
clusters. This is a novel approach since instead on
relying on assumptions about the distribution of
data pointsassignedtoa micro-cluster(MC)(often
a Gaussian distribution around a cen-ter), it
estimates the density in the shared region
between micro-clusters directly from the data. To
the best of our knowledge, this paper is the first to
propose and investigate using a shared-density-
based reclustering approach for data stream
clustering
2] RELATED WORK
Density-based clustering isa well-researchedarea
and we can only give a very brief overview here.
DBSCAN [10] and several of its improvements can
be seen as the prototypical density-based
clustering approach. DBSCAN estimates the
density around each data point by counting the
number of points in a user-specified eps-
neighborhood and applies user-specified
thresholds to identify core, border and noise
points. In a second step, core pointsare joinedinto
a cluster if they are density-reachable (i.e., there is
a chain of core points where one falls inside the
eps-neighborhood of the next). Finally, border
points are assigned to clusters. Other approaches
are based on kernel density estimation (e.g.,
DENCLUE [11]) or use shared nearest neighbors
However, these algorithms were not developed
with data streams in mind. A data stream is an
ordered and potentially unbounded sequence of
data points X ¼ hx1; x2; x3; . . .i. It is not possible
to permanently store all the data in the stream
which implies that repeated random access to the
data is infeasible. Also, data streams exhibit
concept drift over time where the position and/or
shape of clusters changes, and new clusters may
appear or existing clusters disappear. This makes
the application of existing clustering algorithms
diffi-cult. Data stream clustering algorithms limit
data access to a single pass over the data and
adapt to concept drift. Over the last 10 yearsmany
algorithms for clustering data streams have been
proposed. Most data stream clustering algorithms
use a two-stage online/offline approach.
Fig. 1. Problem with reclustering when dense
areas are separated by small areas of low density
with (a) micro clusters and (b) grid cells.
Reclustering methods based solely on micro-
clusters only take closeness of the micro-clusters
into account. This makes it likely that two micro-
clusters which are close to each other, but
separated by an area of low density still will be
merged into a cluster. Information about the
density between micro-clusters is not available
since the information does not get recorded in the
online step and the original data points are no
longer available. Fig. 1a illustrates the problem
where the micro-clusters MC1 and MC2 will be
merged as long as their distance d is low. This is
even true when density-basedclus-teringmethods
(e.g., DBSCAN) are used in the offline reclus-tering
step, since the reclustering is still exclusively
based on the micro-cluster centers and weights.
Fig. 2. MC1 is a single MC. MC2 and MC3 are close to each other but
the density between them is low relative to the two MCs densities
while MC3 and MC4 are connected by a high density area.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 442
3] PROPOSED SYSTEM
In this paper, we develop and evaluate a
new method to address this problem for micro-
cluster-based algorithms. We introduce the
concept of a shared density graph which explicitly
captures the density of the original data between
micro-clusters during clustering and then show
how the graph can be used for reclustering micro-
clusters. This is a novel approach since instead on
relying on assumptions about the distribution of
data points assigned to a microcluster (often a
Gaussian distribution around a center), it
estimates the density in the shared region
between microclusters directly from the data.
4] THE DBSTREAM ONLINE
COMPONENT
Typical micro-cluster-based data stream
clustering algo-rithms retain the density within
each micro-cluster as some form of weight (e.g.,
the number of points assigned to the MC). Some
algorithms also capture the dispersion of the
points by recording variance. For reclustering,
however, only the distances between the MCs and
their weights are used. In this setting, MCs which
are closer to each other are more likely to end up
in the same cluster. This is even true if a density-
based algorithm like DBSCAN [10] is used for
reclustering since here only the position ofthe MC
centers and their weights are used. The density in
the area between MCs is not available since it is
not retained during the online stage.
The basic idea of this work is that if we can
capture not only the distance between two
adjacent MCs but also the connectivity using the
density of the original data in the area between the
MCs, then the reclustering results may be
improved. In the followingwe develop DBSTREAM
which stands for density-based stream clustering.
4.1Leader-Based Clustering
Leader-based clustering was introduced by
Hartigan [21] as a conventional clustering
algorithm. It is straight-forward to apply the idea
to data streams (see, e.g., [20]).
DBSTREAM represents each MC by a leader (a
data point defining the MC’s center) and the
density in an area of a user-specified radius r
(threshold) around the center. This is similar to
DBSCAN’s concept ofcountingthe pointsisan eps-
neighborhood, however, here the density is not
estimated for each point, but only for each MC
which can easily be achieved for streaming data. A
new data point is assigned to an existing MC
(leader) if it is within a fixed radius of its center.
The assigned point increases the density estimate
of the chosen cluster and the MC’s center is
updated to move towards the new data point. If
the data point falls in the assignment area of
several MCs then all of them are updated. If a data
point cannot be assigned to anyexistingMC, a new
MC (leader) is created for the point. Finding the
potential clusters for a new data point is a fixed-
radius nearest-neighbor problem [22] which can
be efficiently dealt with for data of moderate
dimensionality using spatial indexing data
structures like a k-d tree [23]. Variations of this
simple algorithm were suggested in [24] for
outlier detection and in [25] for sequence
modeling.
4.2 Competitive Learning
New leaders are chosen as points which cannot be
assigned to an existing MC. The positions of these
newly formed MCs are most likely not idealforthe
clustering. To remedy this problem, we use a
competitive learning strategy intro-duced in [26]
to move the MC centers towards each newly
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 443
assigned point. To control the magnitude of the
movement, we use a neighborhood function hðÞ
similar to self-organiz-ing maps.
4.3 Capturing Shared Density
Capturing shared density directly in the online
component is a new concept introduced in this
paper. The fact, that in dense areas MCs will have
an overlapping assignment area, can be used to
measure density between MCs by counting the
points which are assigned to two or more MCs.
The idea is that high density in the intersection
area relative to the rest of the MCs’ area means
that the two MCs share an area of high densityand
should be part of the same macro-clus-ter. In the
example in Fig. 2 we see that MC2 and MC3 are
close to each other and overlap. However, the
shared weight s2;3 is small compared to the
weight of each of the two involved MCs indicating
that the two MCs do not form a single area of high
density. On the other hand, MC3 and MC4 are
more distant, but their shared weight s3;4 is large
indicating that both MCs form an area of high
density and thus should form a single macro-
cluster.
this approach to work we have to keep a time-
stamp with the time when fading was applied last
for each value that is subject to fading.
4.5 The Complete Online Algorithm
Algorithm 1 shows our approach and the used
clustering data structures and user-specified
parameters in detail. Micro-clustersare storedasa
set MC. Each micro-cluster is repre-sented by the
tuple ðc; w; tÞ representing the cluster center, the
cluster weight and the last time it was updated,
respectively. The weighted adjacency list S
represents the sparse shared density graph which
captures the weight of the data points shared by
MCs. Since shared density estimates are also
subject to fading, we also store a timestamp with
each entry. Fading alsoshareddensityestimatesis
important since MCs are allowed to move which
over time would lead to estimates of intersection
areas the MC is not covering anymore.
5] COMPUTATIONAL COMPLEXITY
Space complexity of the clustering depends on the
number of MCs that need to be stored in MC. In the
worse case, the maximum number of strong MCs
at any time is tgap MCs and is reached when every
MC receives exactly a weight of one during each
interval of tgap time steps. Given the cleanup
strategy in Algorithm 2, where we remove weak
MCs every tgap time steps, the algorithm never
stores more than k0 ¼ 2tgap MCs.
The space complexity of MC is linear in the
maximal number of MCs k0. The worst case size of
the adjacency list of the shared density graph S
depends on k0 and the dimensionality of the data.
In the 2D case each MC can have a maximumofjN j
¼ 6 neighbors (at optimal packing). Therefore,
each of the k0 MCs has in the adjacency list S at
most six entries resulting in a space complexity of
4.4 Fading and Forgetting Data
To adapt to evolving data streams we use the
exponential fading strategy introduced in
DenStream [6] and usedin manyotheralgorithms.
Cluster weights are faded in every time step by a
factor of 2__, where _ > 0 is a user-specified fading
factor. We implement fading in a similar way as in
D-Stream [9], where fading is only applied when a
value changes (e.g., the weight of a MCis updated).
For example, if the current time-step is t ¼ 10 and
the weight w was last updated at tw ¼ 5 then we
apply for fading the factor 2__ðt_twÞ resulting in
the correct fading for five time steps. In order for
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 444
storing MC and S of OðtgapÞ. For higher-
dimensionaldata streams, the maximalnum-berof
possible adjacent hyper spheres is given by
Newton’s number also referred to as kissing
number [29]. Newton’s number defines the
maximal number of hyper spheres which can
touch a hyper sphere of the same size without
intersecting any other hyper sphere. If we double
the radius of all hyper spheres in this
configuration then we get our scenario with
sphere centers touching the surface of the center
sphere. We use Kd do denote Newton’s number in
d dimensions. Newton’s exact number is known
only for some small dimensionality values d, and
for many other dimensions only lower and upper
bounds are known Note, that Newton’s number
grows fast, reaches 196,560 for d ¼ 24 and is
unknown for most larger d. This growth would
make storing the shared weights for high-
dimensional data in a densely packed area very
expensive. However, we also know that the
maximal neighborhood size jN maxj _ minðk0 _ 1;
KdÞ, since we cannot have more neigh-bors than
we have MCs. Therefore, the space complexity of
maintaining S is bounded by Oðk0jN maxjÞ.To
analyze the algorithm’s time complexity, we need
to consider all parts of the clustering function. The
fixed-radius nearest neighbor search can be done
using linear search in Oðdnk0 Þ, where d is the
data dimensionality, n isthe numberofdata points
clustered and k0 is the number of MCs. The time
complexity can be improved to Oðd n logðk0ÞÞ
using a special indexing data structure like a k-d
tree [23]. Adding or updating a single MC is done
in time linear in n.
6] EXPERIMENTS
To perform our experiments and make them
reproducible, we have implemented/interfacedall
algorithms in a pub-licly available R-extension
called stream [30]. Stream pro-vides an intuitive
interface for experimentingwithdata streamsand
data stream algorithms. It includes generators for
all the synthetic data used in thispaperaswellasa
growing numberofdata streamminingalgorithms
including clustering algorithms available in the
MOA (Massive Online Analysis) framework [31]
and the algorithm discussed in this paper. In this
paper we use four synthetic data streams called
Cassini, Noisy Mixture of Gaussians, and DS3 and
DS41 used to evaluate CHAMELEON [13]. These
data sets do not exhibit concept drift. For data
with concept drift we use MOA’s Random RBF
Generator with Events. In addition we use several
real data sets called Sensor,2 Forest Cover Type3
and the KDD CUP’99 data4 which are often used
for com-paring data stream clustering algorithms.
Kremer et al. [32] discuss internal and external
evaluation measures for the qualityofdata stream
clustering. We conductedexperimentswitha large
set of evaluation measures (purity, precision,
recall, F-measure, sum of squared distances,
silhouette coefficient, mutual information,
adjusted Rand index). In this study we mainly
report the adjusted Rand index to evaluate the
average agreement of the known cluster structure
(ground truth) of the data stream with the found
structure. The adjusted Rand index (adjusted for
expected random agreements) is widely accepted
as the appropriate measure tocompare the quality
of different partitions given the ground truth [33].
Zero indicates that the found agreements can be
entirely explained by chance and the closer the
index is to one, the better the agreement. For
clustering with concept drift, we also report
average purity and average within cluster sum of
squares (WSS). However, like most other
measures, these make comparison difficult. For
example, average purity (equivalent to precision
and part of the F-measure) depends on the
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 445
number of clusters and thus makes comparison of
clustering’s with a different number of clusters
invalid. The within cluster sum of squares favors
algorithms which produce spherical clusters (e.g.,
k-means-type algorithms). A smaller WSS
represent tighter clusters and thus a better
clustering. However, WSS always will get smaller
with an increasing number of clusters. We report
these measureshere forcomparison since theyare
used in many data stream clustering papers.
7] CONCLUSION
In this paper, we have developed the first data
stream clustering algorithm which explicitly
records the density in the area shared by micro-
clusters and uses this information forreclustering.
We have introduced the shared density graph
together with the algorithms needed to maintain
the graph in the online component of a data
stream mining algorithm. Although, we showed
that the worst-case memory requirements of the
shared density graph grow extremely fast with
data dimensionality, complexity analysis and
experiments reveal that the procedure can be
effectively applied to data sets of moderate
dimensionality.
Experimentsalsoshowthatshared-density
reclustering already performs extremely well
when the online data streamclusteringcomponent
is set to produce a small number of large MCs.
Other popular reclustering strategies can only
slightly improve over the results of shareddensity
reclustering and need significantly more MCs to
achieve comparable results. This is an important
advantage since it implies that we can tune the
online component to produce less micro-clusters
for shared-density reclustering. This improves
performance and, in many cases, the saved
memory more than offset the memory
requirement for the shared density graph.
9] REFERENCES
[1] S. Guha, N. Mishra, R. Motwani, and L.
O’Callaghan, “Clustering data streams,” in Proc.
ACM Symp. Found. Comput. Sci., 12–14 Nov. 2000,
pp. 359–366.
[2] C. Aggarwal, Data Streams: Models and
Algorithms, (series Advances in Database
Systems). New York, NY, USA: Springer-Verlag,
2007.
[3] J. Gama, Knowledge Discovery from Data
Streams, 1st ed. London, U.K.: Chapman & Hall,
2010.
[4] J. A. Silva, E. R. Faria, R. C. Barros, E. R.
Hruschka, A. C. P. L. F. d. Carvalho, and J. A. Gama,
“Data stream clustering: A survey,” ACM Comput.
Surveys, vol. 46, no. 1, pp. 13:1–13:31, Jul. 2013.
[5] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A
framework for clustering evolving data streams,”
in Proc. Int. Conf. Very Large Data Bases, 2003, pp.
81–92.
[6] F. Cao, M. Ester, W. Qian, andA. Zhou, “Density-
based clustering over an evolving data stream
with noise,” in Proc. SIAM Int. Conf. Data Mining,
2006, pp. 328–339.
[7] Y. Chen and L. Tu, “Density-based clustering
for real-time stream data,” in Proc. 13th ACM
SIGKDD Int. Conf. Knowl. Discovery Data Mining,
2007, pp. 133–142.
[8] L. Wan, W. K. Ng, X. H. Dang, P. S. Yu, and K.
Zhang, “Density-based clustering of data streams
at multiple reso-lutions,” ACM Trans. Knowl.
Discovery from Data, vol. 3, no. 3,
pp. 1–28, 2009.
Dure Supriya Suresh
ME(Student), ICOER ,Wagholi ,
Pune ,Maharastra,India
1’st
Author
Photo

More Related Content

PDF
A0360109
PDF
B0330811
PDF
C0312023
PDF
NODE FAILURE TIME AND COVERAGE LOSS TIME ANALYSIS FOR MAXIMUM STABILITY VS MI...
PDF
7. 10083 12464-1-pb
DOC
Benefit based data caching in ad hoc networks (synopsis)
PDF
CLUSTERING DATA STREAMS BASED ON SHARED DENSITY BETWEEN MICRO-CLUSTERS
PDF
Clustering data streams based on shared density between micro clusters
A0360109
B0330811
C0312023
NODE FAILURE TIME AND COVERAGE LOSS TIME ANALYSIS FOR MAXIMUM STABILITY VS MI...
7. 10083 12464-1-pb
Benefit based data caching in ad hoc networks (synopsis)
CLUSTERING DATA STREAMS BASED ON SHARED DENSITY BETWEEN MICRO-CLUSTERS
Clustering data streams based on shared density between micro clusters

What's hot (20)

PDF
50120130406035
PDF
An Overview of Information Extraction from Mobile Wireless Sensor Networks
 
PDF
A fuzzy clustering algorithm for high dimensional streaming data
PDF
Algorithmic Construction of Optimal and Load Balanced Clusters in Wireless Se...
 
PDF
Az36311316
PDF
Clustering and data aggregation scheme in underwater wireless acoustic sensor...
PDF
Interpolation Techniques for Building a Continuous Map from Discrete Wireless...
 
PDF
Vol 8 No 1 - December 2013
PDF
Information extraction from sensor networks using the Watershed transform alg...
 
PDF
Mobile ad hoc networks and its clustering scheme
PDF
Adaptive Routing in Wireless Sensor Networks: QoS Optimisation for Enhanced A...
 
PDF
Clustering Algorithms for Data Stream
PDF
Paper id 21201414
PDF
ENERGY-EFFICIENT MULTI-HOP ROUTING WITH UNEQUAL CLUSTERING APPROACH FOR WIREL...
PDF
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
PDF
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
DOCX
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
PDF
A survey on weighted clustering techniques in manets
PDF
Highly Scalable Energy Efficient Distributed Clustering Mechanism in Wireless...
PDF
Data Dissemination in Wireless Sensor Networks: A State-of-the Art Survey
50120130406035
An Overview of Information Extraction from Mobile Wireless Sensor Networks
 
A fuzzy clustering algorithm for high dimensional streaming data
Algorithmic Construction of Optimal and Load Balanced Clusters in Wireless Se...
 
Az36311316
Clustering and data aggregation scheme in underwater wireless acoustic sensor...
Interpolation Techniques for Building a Continuous Map from Discrete Wireless...
 
Vol 8 No 1 - December 2013
Information extraction from sensor networks using the Watershed transform alg...
 
Mobile ad hoc networks and its clustering scheme
Adaptive Routing in Wireless Sensor Networks: QoS Optimisation for Enhanced A...
 
Clustering Algorithms for Data Stream
Paper id 21201414
ENERGY-EFFICIENT MULTI-HOP ROUTING WITH UNEQUAL CLUSTERING APPROACH FOR WIREL...
O N T HE D ISTRIBUTION OF T HE M AXIMAL C LIQUE S IZE F OR T HE V ERTICES IN ...
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
ON DISTRIBUTED FUZZY DECISION TREES FOR BIG DATA
A survey on weighted clustering techniques in manets
Highly Scalable Energy Efficient Distributed Clustering Mechanism in Wireless...
Data Dissemination in Wireless Sensor Networks: A State-of-the Art Survey
Ad

Similar to Survey Paper on Clustering Data Streams Based on Shared Density between Micro-Clusters (20)

PPTX
Graph and Density Based Clustering
PDF
Study of Density Based Clustering Techniques on Data Streams
PDF
E502024047
PDF
E502024047
PDF
IRJET- Enhanced Density Based Method for Clustering Data Stream
PDF
A Density Based Clustering Technique For Large Spatial Data Using Polygon App...
PDF
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
PDF
Paper id 26201478
PPTX
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
PDF
Target Response Electrical usage Profile Clustering using Big Data
PDF
Ir3116271633
PDF
T24144148
PDF
Clustering Algorithm by Vishal.pdf
PDF
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
PDF
clustering density technidques in machine learning
PDF
50120140501016
PPT
dm_clustering2.ppt
PDF
DBSCAN
PDF
Analysis of mass based and density based clustering techniques on numerical d...
Graph and Density Based Clustering
Study of Density Based Clustering Techniques on Data Streams
E502024047
E502024047
IRJET- Enhanced Density Based Method for Clustering Data Stream
A Density Based Clustering Technique For Large Spatial Data Using Polygon App...
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Paper id 26201478
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
Target Response Electrical usage Profile Clustering using Big Data
Ir3116271633
T24144148
Clustering Algorithm by Vishal.pdf
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
clustering density technidques in machine learning
50120140501016
dm_clustering2.ppt
DBSCAN
Analysis of mass based and density based clustering techniques on numerical d...
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Welding lecture in detail for understanding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Geodesy 1.pptx...............................................
PDF
Digital Logic Computer Design lecture notes
PPTX
Construction Project Organization Group 2.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Welding lecture in detail for understanding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
bas. eng. economics group 4 presentation 1.pptx
Foundation to blockchain - A guide to Blockchain Tech
OOP with Java - Java Introduction (Basics)
Internet of Things (IOT) - A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Geodesy 1.pptx...............................................
Digital Logic Computer Design lecture notes
Construction Project Organization Group 2.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

Survey Paper on Clustering Data Streams Based on Shared Density between Micro-Clusters

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 440 Survey Paper on Clustering Data Streams Based on Shared Density between Micro-Clusters Dure Supriya Suresh ; Prof. Wadne Vinod ME(Student), ICOER ,Wagholi , Pune ,Maharastra,India Assit.Professor, ICOER ,Wagholi , Pune ,Maharastra,India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract—As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Micro-clusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-clusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online processandreclusteringis based on possibly inaccurate assumptions about the distribution ofdata within andbetween micro- clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-cluster- based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for reclustering based on actualdensitybetween adjacent micro-clusters. Index Terms—Data mining, data stream clustering, density-based clustering 1] INTRODUCTION CLUSTERING data streams has become n important technique for data and knowledge engineering. A data stream is an ordered and potentially unbounded sequence of data points. Such streams of constantly arriving data are generated for many types of applications and include GPS data from smart phones, web click- stream data, computer network monitoring data, tele-communication connection data, readings from sensor nets Stock quotes, etc. Data streamclusteringistypicallydone asa two-stage process with an online part which summarizes the data into many micro-clusters or grid cells and then, in an offline pro-cess, these micro-clusters (cells)are reclustered/mergedinto a smaller number of final clusters. Since the reclustering is an offline processandthusnot time critical, it is typically not discussed in detail in papers about new data stream clustering algorithms. Most papers suggest to use an where the micro-clusters are used as pseudo points. Another approach used in DenStream is to use reachability where all micro-clusters which are less than a given distance from each other are linked together to form clusters. Grid-based algorithms typically merge adjacent dense grid cells to form larger clusters.Current reclustering approaches completely ignore the data density in the area between the micro-clusters (grid cells) and thus might join micro-clusters (cells) which are close together but at the same time separated by a small area of low density. To address this problem, Tu and Chen introduced an extension to the grid-based D-Stream algo-rithm based on the concept of attraction between adjacent grids cells and showed its effectiveness. In this paper, we develop and evaluate a new method to address this problem for micro-
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 441 cluster-based algorithms. We introduce the concept of a shared density graph which explicitly captures the density of the original data between micro-clusters during clustering and then show how the graph can be used for reclustering micro- clusters. This is a novel approach since instead on relying on assumptions about the distribution of data pointsassignedtoa micro-cluster(MC)(often a Gaussian distribution around a cen-ter), it estimates the density in the shared region between micro-clusters directly from the data. To the best of our knowledge, this paper is the first to propose and investigate using a shared-density- based reclustering approach for data stream clustering 2] RELATED WORK Density-based clustering isa well-researchedarea and we can only give a very brief overview here. DBSCAN [10] and several of its improvements can be seen as the prototypical density-based clustering approach. DBSCAN estimates the density around each data point by counting the number of points in a user-specified eps- neighborhood and applies user-specified thresholds to identify core, border and noise points. In a second step, core pointsare joinedinto a cluster if they are density-reachable (i.e., there is a chain of core points where one falls inside the eps-neighborhood of the next). Finally, border points are assigned to clusters. Other approaches are based on kernel density estimation (e.g., DENCLUE [11]) or use shared nearest neighbors However, these algorithms were not developed with data streams in mind. A data stream is an ordered and potentially unbounded sequence of data points X ¼ hx1; x2; x3; . . .i. It is not possible to permanently store all the data in the stream which implies that repeated random access to the data is infeasible. Also, data streams exhibit concept drift over time where the position and/or shape of clusters changes, and new clusters may appear or existing clusters disappear. This makes the application of existing clustering algorithms diffi-cult. Data stream clustering algorithms limit data access to a single pass over the data and adapt to concept drift. Over the last 10 yearsmany algorithms for clustering data streams have been proposed. Most data stream clustering algorithms use a two-stage online/offline approach. Fig. 1. Problem with reclustering when dense areas are separated by small areas of low density with (a) micro clusters and (b) grid cells. Reclustering methods based solely on micro- clusters only take closeness of the micro-clusters into account. This makes it likely that two micro- clusters which are close to each other, but separated by an area of low density still will be merged into a cluster. Information about the density between micro-clusters is not available since the information does not get recorded in the online step and the original data points are no longer available. Fig. 1a illustrates the problem where the micro-clusters MC1 and MC2 will be merged as long as their distance d is low. This is even true when density-basedclus-teringmethods (e.g., DBSCAN) are used in the offline reclus-tering step, since the reclustering is still exclusively based on the micro-cluster centers and weights. Fig. 2. MC1 is a single MC. MC2 and MC3 are close to each other but the density between them is low relative to the two MCs densities while MC3 and MC4 are connected by a high density area.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 442 3] PROPOSED SYSTEM In this paper, we develop and evaluate a new method to address this problem for micro- cluster-based algorithms. We introduce the concept of a shared density graph which explicitly captures the density of the original data between micro-clusters during clustering and then show how the graph can be used for reclustering micro- clusters. This is a novel approach since instead on relying on assumptions about the distribution of data points assigned to a microcluster (often a Gaussian distribution around a center), it estimates the density in the shared region between microclusters directly from the data. 4] THE DBSTREAM ONLINE COMPONENT Typical micro-cluster-based data stream clustering algo-rithms retain the density within each micro-cluster as some form of weight (e.g., the number of points assigned to the MC). Some algorithms also capture the dispersion of the points by recording variance. For reclustering, however, only the distances between the MCs and their weights are used. In this setting, MCs which are closer to each other are more likely to end up in the same cluster. This is even true if a density- based algorithm like DBSCAN [10] is used for reclustering since here only the position ofthe MC centers and their weights are used. The density in the area between MCs is not available since it is not retained during the online stage. The basic idea of this work is that if we can capture not only the distance between two adjacent MCs but also the connectivity using the density of the original data in the area between the MCs, then the reclustering results may be improved. In the followingwe develop DBSTREAM which stands for density-based stream clustering. 4.1Leader-Based Clustering Leader-based clustering was introduced by Hartigan [21] as a conventional clustering algorithm. It is straight-forward to apply the idea to data streams (see, e.g., [20]). DBSTREAM represents each MC by a leader (a data point defining the MC’s center) and the density in an area of a user-specified radius r (threshold) around the center. This is similar to DBSCAN’s concept ofcountingthe pointsisan eps- neighborhood, however, here the density is not estimated for each point, but only for each MC which can easily be achieved for streaming data. A new data point is assigned to an existing MC (leader) if it is within a fixed radius of its center. The assigned point increases the density estimate of the chosen cluster and the MC’s center is updated to move towards the new data point. If the data point falls in the assignment area of several MCs then all of them are updated. If a data point cannot be assigned to anyexistingMC, a new MC (leader) is created for the point. Finding the potential clusters for a new data point is a fixed- radius nearest-neighbor problem [22] which can be efficiently dealt with for data of moderate dimensionality using spatial indexing data structures like a k-d tree [23]. Variations of this simple algorithm were suggested in [24] for outlier detection and in [25] for sequence modeling. 4.2 Competitive Learning New leaders are chosen as points which cannot be assigned to an existing MC. The positions of these newly formed MCs are most likely not idealforthe clustering. To remedy this problem, we use a competitive learning strategy intro-duced in [26] to move the MC centers towards each newly
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 443 assigned point. To control the magnitude of the movement, we use a neighborhood function hðÞ similar to self-organiz-ing maps. 4.3 Capturing Shared Density Capturing shared density directly in the online component is a new concept introduced in this paper. The fact, that in dense areas MCs will have an overlapping assignment area, can be used to measure density between MCs by counting the points which are assigned to two or more MCs. The idea is that high density in the intersection area relative to the rest of the MCs’ area means that the two MCs share an area of high densityand should be part of the same macro-clus-ter. In the example in Fig. 2 we see that MC2 and MC3 are close to each other and overlap. However, the shared weight s2;3 is small compared to the weight of each of the two involved MCs indicating that the two MCs do not form a single area of high density. On the other hand, MC3 and MC4 are more distant, but their shared weight s3;4 is large indicating that both MCs form an area of high density and thus should form a single macro- cluster. this approach to work we have to keep a time- stamp with the time when fading was applied last for each value that is subject to fading. 4.5 The Complete Online Algorithm Algorithm 1 shows our approach and the used clustering data structures and user-specified parameters in detail. Micro-clustersare storedasa set MC. Each micro-cluster is repre-sented by the tuple ðc; w; tÞ representing the cluster center, the cluster weight and the last time it was updated, respectively. The weighted adjacency list S represents the sparse shared density graph which captures the weight of the data points shared by MCs. Since shared density estimates are also subject to fading, we also store a timestamp with each entry. Fading alsoshareddensityestimatesis important since MCs are allowed to move which over time would lead to estimates of intersection areas the MC is not covering anymore. 5] COMPUTATIONAL COMPLEXITY Space complexity of the clustering depends on the number of MCs that need to be stored in MC. In the worse case, the maximum number of strong MCs at any time is tgap MCs and is reached when every MC receives exactly a weight of one during each interval of tgap time steps. Given the cleanup strategy in Algorithm 2, where we remove weak MCs every tgap time steps, the algorithm never stores more than k0 ¼ 2tgap MCs. The space complexity of MC is linear in the maximal number of MCs k0. The worst case size of the adjacency list of the shared density graph S depends on k0 and the dimensionality of the data. In the 2D case each MC can have a maximumofjN j ¼ 6 neighbors (at optimal packing). Therefore, each of the k0 MCs has in the adjacency list S at most six entries resulting in a space complexity of 4.4 Fading and Forgetting Data To adapt to evolving data streams we use the exponential fading strategy introduced in DenStream [6] and usedin manyotheralgorithms. Cluster weights are faded in every time step by a factor of 2__, where _ > 0 is a user-specified fading factor. We implement fading in a similar way as in D-Stream [9], where fading is only applied when a value changes (e.g., the weight of a MCis updated). For example, if the current time-step is t ¼ 10 and the weight w was last updated at tw ¼ 5 then we apply for fading the factor 2__ðt_twÞ resulting in the correct fading for five time steps. In order for
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 444 storing MC and S of OðtgapÞ. For higher- dimensionaldata streams, the maximalnum-berof possible adjacent hyper spheres is given by Newton’s number also referred to as kissing number [29]. Newton’s number defines the maximal number of hyper spheres which can touch a hyper sphere of the same size without intersecting any other hyper sphere. If we double the radius of all hyper spheres in this configuration then we get our scenario with sphere centers touching the surface of the center sphere. We use Kd do denote Newton’s number in d dimensions. Newton’s exact number is known only for some small dimensionality values d, and for many other dimensions only lower and upper bounds are known Note, that Newton’s number grows fast, reaches 196,560 for d ¼ 24 and is unknown for most larger d. This growth would make storing the shared weights for high- dimensional data in a densely packed area very expensive. However, we also know that the maximal neighborhood size jN maxj _ minðk0 _ 1; KdÞ, since we cannot have more neigh-bors than we have MCs. Therefore, the space complexity of maintaining S is bounded by Oðk0jN maxjÞ.To analyze the algorithm’s time complexity, we need to consider all parts of the clustering function. The fixed-radius nearest neighbor search can be done using linear search in Oðdnk0 Þ, where d is the data dimensionality, n isthe numberofdata points clustered and k0 is the number of MCs. The time complexity can be improved to Oðd n logðk0ÞÞ using a special indexing data structure like a k-d tree [23]. Adding or updating a single MC is done in time linear in n. 6] EXPERIMENTS To perform our experiments and make them reproducible, we have implemented/interfacedall algorithms in a pub-licly available R-extension called stream [30]. Stream pro-vides an intuitive interface for experimentingwithdata streamsand data stream algorithms. It includes generators for all the synthetic data used in thispaperaswellasa growing numberofdata streamminingalgorithms including clustering algorithms available in the MOA (Massive Online Analysis) framework [31] and the algorithm discussed in this paper. In this paper we use four synthetic data streams called Cassini, Noisy Mixture of Gaussians, and DS3 and DS41 used to evaluate CHAMELEON [13]. These data sets do not exhibit concept drift. For data with concept drift we use MOA’s Random RBF Generator with Events. In addition we use several real data sets called Sensor,2 Forest Cover Type3 and the KDD CUP’99 data4 which are often used for com-paring data stream clustering algorithms. Kremer et al. [32] discuss internal and external evaluation measures for the qualityofdata stream clustering. We conductedexperimentswitha large set of evaluation measures (purity, precision, recall, F-measure, sum of squared distances, silhouette coefficient, mutual information, adjusted Rand index). In this study we mainly report the adjusted Rand index to evaluate the average agreement of the known cluster structure (ground truth) of the data stream with the found structure. The adjusted Rand index (adjusted for expected random agreements) is widely accepted as the appropriate measure tocompare the quality of different partitions given the ground truth [33]. Zero indicates that the found agreements can be entirely explained by chance and the closer the index is to one, the better the agreement. For clustering with concept drift, we also report average purity and average within cluster sum of squares (WSS). However, like most other measures, these make comparison difficult. For example, average purity (equivalent to precision and part of the F-measure) depends on the
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 01 | Jan -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 445 number of clusters and thus makes comparison of clustering’s with a different number of clusters invalid. The within cluster sum of squares favors algorithms which produce spherical clusters (e.g., k-means-type algorithms). A smaller WSS represent tighter clusters and thus a better clustering. However, WSS always will get smaller with an increasing number of clusters. We report these measureshere forcomparison since theyare used in many data stream clustering papers. 7] CONCLUSION In this paper, we have developed the first data stream clustering algorithm which explicitly records the density in the area shared by micro- clusters and uses this information forreclustering. We have introduced the shared density graph together with the algorithms needed to maintain the graph in the online component of a data stream mining algorithm. Although, we showed that the worst-case memory requirements of the shared density graph grow extremely fast with data dimensionality, complexity analysis and experiments reveal that the procedure can be effectively applied to data sets of moderate dimensionality. Experimentsalsoshowthatshared-density reclustering already performs extremely well when the online data streamclusteringcomponent is set to produce a small number of large MCs. Other popular reclustering strategies can only slightly improve over the results of shareddensity reclustering and need significantly more MCs to achieve comparable results. This is an important advantage since it implies that we can tune the online component to produce less micro-clusters for shared-density reclustering. This improves performance and, in many cases, the saved memory more than offset the memory requirement for the shared density graph. 9] REFERENCES [1] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan, “Clustering data streams,” in Proc. ACM Symp. Found. Comput. Sci., 12–14 Nov. 2000, pp. 359–366. [2] C. Aggarwal, Data Streams: Models and Algorithms, (series Advances in Database Systems). New York, NY, USA: Springer-Verlag, 2007. [3] J. Gama, Knowledge Discovery from Data Streams, 1st ed. London, U.K.: Chapman & Hall, 2010. [4] J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. d. Carvalho, and J. A. Gama, “Data stream clustering: A survey,” ACM Comput. Surveys, vol. 46, no. 1, pp. 13:1–13:31, Jul. 2013. [5] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streams,” in Proc. Int. Conf. Very Large Data Bases, 2003, pp. 81–92. [6] F. Cao, M. Ester, W. Qian, andA. Zhou, “Density- based clustering over an evolving data stream with noise,” in Proc. SIAM Int. Conf. Data Mining, 2006, pp. 328–339. [7] Y. Chen and L. Tu, “Density-based clustering for real-time stream data,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2007, pp. 133–142. [8] L. Wan, W. K. Ng, X. H. Dang, P. S. Yu, and K. Zhang, “Density-based clustering of data streams at multiple reso-lutions,” ACM Trans. Knowl. Discovery from Data, vol. 3, no. 3, pp. 1–28, 2009. Dure Supriya Suresh ME(Student), ICOER ,Wagholi , Pune ,Maharastra,India 1’st Author Photo