Clustering: Grouping all Data for Insights

CLUSTERING
TECHNIQUES
OVERVIEW AND
APPLICATIONS

INDEX
Introduction
Clustering
Clustering Techniques
Pros and Cons of Clustering Techniques
Applications of Clustering Techniques
Conclusion
Future Work

INTRODUCTION
• Clustering was first employed in biology back in the 1960s to classify species.
• In this data-driven era, effective data organization and analysis methods play a
major role in gaining insights from data.
• From marketing to social network analysis, clustering has been evolving and now
is an essential sorting and categorizing data tool for pattern detection, data
analysis, and interpretation
• Clustering is an unsupervised data analysis technique that groups a set of objects
such that objects in the same group (cluster) are more similar to each other than to
those in other groups.
.

Example: Clustering Grocery Items
eggs bananas
milk bread

TECHNIQUES
Partitional Clustering - K Means
Hierarchical Based Clustering – BIRCH
Density Based Clustering – DBSCAN
Grid Based Clustering – STING
Model Based Clustering – Gaussian Mixture Model

Partitional Clustering - K Means
• Partitional clustering divides a dataset into non-overlapping partitions or clusters, where
each data point belongs to exactly one cluster.
• K-means clustering groups the unlabelled dataset into a defined number of clusters where
similar data points are grouped together to discover underlying patterns.
Phases:
• Initialization
• Categorize and Update centroids
• Repeat

Hierarchical Based Clustering –
BIRCH(Unsupervised)
Hierarchical Clustering organizes elements in a hierarchical or tree like structure.
Balanced Iterative Reducing and Clustering
BIRCH clusters large data set with a single scan and improves the quality of data
with a few additional scans.
BIRCH consists of two stages,
• Building the CF(Clustering Feature) tree
• Global Clustering.
Cluster refinement for accuracy.

Density Based Clustering – DBSCAN
• Density-based clustering methods create clusters based on the density of data or
information that are to be clustered in the feature space.
• Density Based Spatial Clustering of Applications with Noise defines clusters by
identifying the data which has a minimum number of data points within a specific
radius.
• Steps in the DBSCAN algorithm
• Classify the points and discard noise.
• Assign cluster to a core point.
• Color all the density connected points and boundary points according to the nearest core point.

Grid-Based
Clustering –
STING
• Grid-based clustering partitions the dataset into a grid structure,
organizing data points into cells for efficient clustering based on spatial
proximity.
• STING(STATISTICAL INFORMATION GRID) approach which
partitions the data into a hierarchical grid, Investigates the clusters at
different levels of their detail
• Phases of sting are
• Grid Construction & Cell Assignment
• Density Calculation & Cluster Identification
• Border Point Assignment & Noise Identification

Model-Based Clustering
– Gaussian Mixture
• Model-based clustering assigns data points to clusters based on
probabilistic models representing the data distribution.
• "Gaussian Mixture is a statistical model that identifies subgroups within
a population using a combination of Gaussian distributions."
• It repeatedly optimizes parameters using an expectation-maximization
algorithm which estimates cluster means, covariances, and mixture
covariances
• Steps of gaussian mixture are
• Initialization
• Expectation Step(E-Step) & Maximum Step(M-Step)
• Convergence Check
• Iteration

PROS AND CONS OF CLUSTERING
TECHNIQUES
Cons:
• Parameter subjectivity
• High dimensions challenge
• Evaluation difficulty
• Shape Assumptions
• Noise handling
Pros:
• Pattern finding
• Exploration
• Feature Discovery
• Data compression
• Scalability

Applications of Clustering Techniques
• Customer Segmentation: Grouping customers into distinct segments based on attitudes
and behavior for targeted marketing strategies.
• Anomaly Detection: Identifying unusual patterns or outliers in datasets that deviate
significantly from normal behavior.
• Image Segmentation: Partitioning an image into regions with similar attributes, for object
recognition and image analysis tasks.
• Recommendation Systems: Grouping users or items into clusters based on preferences
or similarities to provide personalized recommendations in e-commerce or content
platforms.
• Document clustering enables automatic grouping of similar documents for efficient
information retrieval, text summarization, and content-based recommendation systems.

CONCLUSION
Clustering techniques offer a flexible approach to unsupervised learning, applicable
across diverse datasets and domains. By grouping similar data points, clustering
facilitates exploration and recognition of underlying patterns, leading to valuable
insights.
Clustering algorithms automate data grouping tasks, saving time and enabling
efficient analysis of large datasets. Clustering finds use in marketing, healthcare,
finance, and more, for tasks like customer segmentation and anomaly detection.

FUTURE
RESEARCH Adaptability to diverse data types,
including text, image, and graph data.
Improving visualization of clustering
results.
Integration with machine learning for
predictive modeling.
Addressing privacy concerns with
privacy-preserving techniques.
Tailoring clustering methods for
domain-specific applications.

References
• T. Zhang, R. Ramakrishnan and M. Livny, “BIRCH: an efficient data clustering method
for very large databases” in ACM Sigmod Record, ACM, vol. 25, pp. 103–114.
• M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering
techniques" in Proceedings of the KDD Workshop on Text Mining, ACM, 2000.
• M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering
clusters in large spatial databases with noise" in Proceedings of the 2nd International
Conference on Knowledge Discovery and Data Mining (KDD-96), AAAI Press, 1996.

Clustering: Grouping all Data for Insights

More Related Content

Similar to Clustering: Grouping all Data for Insights (20)

Recently uploaded (20)

Clustering: Grouping all Data for Insights