SlideShare a Scribd company logo
Clustering Algorithms: An Introduction
Classification Method of Supervised learning Learns a method for predicting the instance class from pre-labeled (classified)  instances
Clustering Method of unsupervised   learning Finds “natural” grouping of instances given un-labeled data
Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
Clusters:  exclusive vs. overlapping a k j i h g f e d c b
Example of Outlier x  x x  x  x  x x  x x  x  x  x  x x  x x xx  x x  x  x  x  x  x x x  x x x  x x  x  x  x x  x  x x  x x Outlier
Methods of Clustering Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.
Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters  Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram
Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start:  tree consists of empty root node Then:  add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on  category utility
And in the Non-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1:  clustroid   = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.
“ Closest” Point? Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.
k  – Means Algorithm(s) Assumes Euclidean space. Start by picking  k , the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then  k  -1 other points, each as far away as possible from the previous points.
Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the  k   clusters. Optional : reassign all points to their closest centroid. Sometimes moves points between clusters.
Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
K-means variations K-medoids  – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is  Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is  Median advantage: not affected by extreme values For large databases, use sampling 5 205 5
Examples of Clustering Applications Marketing:  discover customer groups and use them for targeted marketing and re-organization Astronomy:  find groups of similar stars and galaxies Earth-quake studies:  Observed earth quake epicenters should be clustered along continent faults Genomics:  finding groups of gene with similar expression And many more.
Clustering Summary unsupervised many approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes
References This PPT is complied from: Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

PPTX
Unsupervised Learning
PPTX
12 types of clustering
PPTX
Unsupervised Learning
PPT
Cure, Clustering Algorithm
PPT
3.3 hierarchical methods
PPT
Textmining Retrieval And Clustering
PPT
Clustering
PPTX
Algorithm explanations
Unsupervised Learning
12 types of clustering
Unsupervised Learning
Cure, Clustering Algorithm
3.3 hierarchical methods
Textmining Retrieval And Clustering
Clustering
Algorithm explanations

What's hot (19)

PPTX
Density based Clustering Algorithms(DB SCAN, Mean shift )
PDF
Data Science - Part VII - Cluster Analysis
PPTX
Data Compression in Data mining and Business Intelligencs
PPT
Textmining Predictive Models
PPTX
ML basic & clustering
PDF
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
PPTX
Unsupervised learning (clustering)
PPTX
Clustering in Data Mining
PPTX
An algorithm for building
PPTX
Hierarchical clustering
PPTX
Cluster Analysis
PPT
PPTX
05 Clustering in Data Mining
PPTX
Lecture 11
PPT
5.4 mining sequence patterns in biological data
PPT
Concurrent Replication of Parallel and Distributed Simulations
PPTX
K means clustring @jax
PPTX
Machine learning clustering
PPT
The science behind predictive analytics a text mining perspective
Density based Clustering Algorithms(DB SCAN, Mean shift )
Data Science - Part VII - Cluster Analysis
Data Compression in Data mining and Business Intelligencs
Textmining Predictive Models
ML basic & clustering
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
Unsupervised learning (clustering)
Clustering in Data Mining
An algorithm for building
Hierarchical clustering
Cluster Analysis
05 Clustering in Data Mining
Lecture 11
5.4 mining sequence patterns in biological data
Concurrent Replication of Parallel and Distributed Simulations
K means clustring @jax
Machine learning clustering
The science behind predictive analytics a text mining perspective
Ad

Viewers also liked (18)

PPTX
Cluster Analysis
PPT
Association Rules
PPTX
PPTX
Exploring Data
PPTX
Classification Continued
PPT
Textmining Retrieval And Clustering
PPTX
Association Analysis
PPT
Textmining Introduction
PPTX
Quick Look At Clustering
PPTX
Knowledge Discovery
PPTX
Data For Datamining
PPTX
AI: Learning in AI
PPTX
Data Mining: Application and trends in data mining
PPTX
AI: Logic in AI
PPTX
AI: AI & problem solving
PPTX
Data Mining: Data cube computation and data generalization
PPTX
Data Mining: clustering and analysis
PPTX
Data Mining: Graph mining and social network analysis
Cluster Analysis
Association Rules
Exploring Data
Classification Continued
Textmining Retrieval And Clustering
Association Analysis
Textmining Introduction
Quick Look At Clustering
Knowledge Discovery
Data For Datamining
AI: Learning in AI
Data Mining: Application and trends in data mining
AI: Logic in AI
AI: AI & problem solving
Data Mining: Data cube computation and data generalization
Data Mining: clustering and analysis
Data Mining: Graph mining and social network analysis
Ad

Similar to Clustering (20)

PPT
26-Clustering MTech-2017.ppt
PDF
Clustering.pdf
PDF
Chapter#04[Part#01]K-Means Clusterig.pdf
PPTX
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PPTX
machine learning - Clustering in R
PDF
Clustering.pdf
PPTX
Unsupervised learning Algorithms and Assumptions
PDF
clustering-151017180103-lva1-app6892 (1).pdf
PPT
Clustering
PPTX
Unsupervised%20Learninffffg (2).pptx. application
PDF
ch_5_dm clustering in data mining.......
PPTX
Unsupervised Learning.pptx
PPTX
Cluster Analysis.pptx
PDF
Clustering[306] [Read-Only].pdf
PPT
Chap8 basic cluster_analysis
PDF
Unsupervised learning and clustering.pdf
PPTX
Unsupervised learning Modi.pptx
PDF
[ML]-Unsupervised-learning_Unit2.ppt.pdf
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
26-Clustering MTech-2017.ppt
Clustering.pdf
Chapter#04[Part#01]K-Means Clusterig.pdf
K MEANS CLUSTERING - UNSUPERVISED LEARNING
machine learning - Clustering in R
Clustering.pdf
Unsupervised learning Algorithms and Assumptions
clustering-151017180103-lva1-app6892 (1).pdf
Clustering
Unsupervised%20Learninffffg (2).pptx. application
ch_5_dm clustering in data mining.......
Unsupervised Learning.pptx
Cluster Analysis.pptx
Clustering[306] [Read-Only].pdf
Chap8 basic cluster_analysis
Unsupervised learning and clustering.pdf
Unsupervised learning Modi.pptx
[ML]-Unsupervised-learning_Unit2.ppt.pdf
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

More from Datamining Tools (20)

PPTX
Data Mining: Text and web mining
PPTX
Data Mining: Outlier analysis
PPTX
Data Mining: Mining stream time series and sequence data
PPTX
Data Mining: Mining ,associations, and correlations
PPTX
Data Mining: Data warehouse and olap technology
PPTX
Data MIning: Data processing
PPTX
Data mining: Classification and Prediction
PPTX
Data Mining: Data mining classification and analysis
PPTX
Data Mining: Data mining and key definitions
PPTX
Data Mining: Applying data mining
PPTX
AI: Planning and AI
PPTX
AI: Logic in AI 2
PPTX
AI: Learning in AI 2
PPTX
AI: Introduction to artificial intelligence
PPTX
AI: Belief Networks
PPTX
Quick Look At Classification
PPTX
Data Mining The Sky
PPTX
Data Mining Techniques In Computer Aided Cancer Diagnosis
PPTX
Anomaly Detection
Data Mining: Text and web mining
Data Mining: Outlier analysis
Data Mining: Mining stream time series and sequence data
Data Mining: Mining ,associations, and correlations
Data Mining: Data warehouse and olap technology
Data MIning: Data processing
Data mining: Classification and Prediction
Data Mining: Data mining classification and analysis
Data Mining: Data mining and key definitions
Data Mining: Applying data mining
AI: Planning and AI
AI: Logic in AI 2
AI: Learning in AI 2
AI: Introduction to artificial intelligence
AI: Belief Networks
Quick Look At Classification
Data Mining The Sky
Data Mining Techniques In Computer Aided Cancer Diagnosis
Anomaly Detection

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
CIFDAQ's Market Insight: SEC Turns Pro Crypto
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Clustering

  • 2. Classification Method of Supervised learning Learns a method for predicting the instance class from pre-labeled (classified) instances
  • 3. Clustering Method of unsupervised learning Finds “natural” grouping of instances given un-labeled data
  • 4. Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
  • 5. Clusters: exclusive vs. overlapping a k j i h g f e d c b
  • 6. Example of Outlier x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Outlier
  • 7. Methods of Clustering Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.
  • 8. Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram
  • 9. Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start: tree consists of empty root node Then: add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on category utility
  • 10. And in the Non-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1: clustroid = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.
  • 11. “ Closest” Point? Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.
  • 12. k – Means Algorithm(s) Assumes Euclidean space. Start by picking k , the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then k -1 other points, each as far away as possible from the previous points.
  • 13. Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the k clusters. Optional : reassign all points to their closest centroid. Sometimes moves points between clusters.
  • 14. Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
  • 15. K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
  • 16. K-means variations K-medoids – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is Median advantage: not affected by extreme values For large databases, use sampling 5 205 5
  • 17. Examples of Clustering Applications Marketing: discover customer groups and use them for targeted marketing and re-organization Astronomy: find groups of similar stars and galaxies Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Genomics: finding groups of gene with similar expression And many more.
  • 18. Clustering Summary unsupervised many approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes
  • 19. References This PPT is complied from: Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
  • 20. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net