SlideShare a Scribd company logo
Machine Learning with
Python
Clustering
Clustering is an unsupervised method. It is used to segment the data into similar groups instead
of prediction.
It does not predict anything but can be used to improve the accuracy of predictive methods.
Distances for Clustering
Euclidean Distance(As-the-crow-flies Distance): Its the straight line between 2 points
Distances between Clusters:
Minimum Distance: Distances between the points which are closest
Maximum Distance: Distances between the points that are farthest
Centroid Distance: Distance between centroid of clusters
Minimum Maximum Centroid
K Means Clustering
 It aims at partitioning the data into k clusters in a way that each data point belongs to the
cluster whose mean is nearest to it.
Lets say you have some data points and you want to group them into 3 clusters. So k=3.
1. You will start with randomly assigning the data points into 3 clusters.
2. Then, you will calculate the centroid or mean of each cluster
3. Now the data points will be reassigned to the closest cluster mean.
4. Recompute cluster centroids
5. Repeat steps 2 and 4 until no improvement is made
K Means Clustering
K Means Clustering
Take a group of 12 football players who have each scored a certain number of goals this season
(say in the range 3–30). Let’s divide them into separate clusters—say three.
Step 1 requires us to randomly split the players into three groups and calculate the means of
each.
K Means Clustering
Step 2: For each player, reassign them to the group with the closest mean. E.g., Player A (5 goals)
is assigned to Group 2 (mean = 9). Then recalculate the group means.
K Means Clustering
Repeat Step 2 over and over until the group means no longer change.
With this example, the clusters could correspond to the players’ positions on the
field — such as defenders, midfielders and attackers.
How to find k
In K-Means Clustering, value of K has to be specified beforehand. It can be determined using
below method:
Elbow Method: Clustering is done on a dataset for varying values and SSE (Sum of squared
errors) is calculated for each value of K. Then, a graph between K and SSE is plotted. There is a
point on the graph where SSE does not decreases significantly with increasing K. This is
represented by elbow of the arm and is chosen as the value of K.
The SSE is defined as the sum of the squared distance between each member of the cluster and
its centroid.
Hierarchal Clustering(Agglomerative or Single Linkage)
 Start with each data point in its own cluster
 Combine two nearest clusters(Euclidean or centroid)
 Repeat the process till all data points belong to same cluster.
Hierarchal Clustering
How many clusters to form?
 Visualising dendogram: Best choice of no. of clusters is no. of vertical lines that can be cut
by a horizontal line, that can transverse maximum distance vertically without intersecting
other cluster.
For eg., in the below case, best choice for no. of clusters will be 4.
 Intuition and prior knowledge of the data set.
K-Means vs Hieraichal
 For big data, K-Means is better!
Time complexity of K-Means is linear, while that of hierarchical clustering is quadratic.
 Results are reproducible in Hierarchical, and not in K-Means, as they depend on intialization
of centroids.
 K-Means requires prior and proper knowledge about the data set for specifying K.
In Hierarchical, we can choose no. of clusters by interpreting dendogram.
Applications
Clustering algorithms can be applied in many fields, for instance:
 Marketing: finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records;
 Biology: classification of plants and animals given their features
 Insurance: identifying groups of motor insurance policy holders with a high average claim
cost; identifying frauds
 Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones
 World Wide Web: document classification; clustering weblog data to discover groups of
similar access patterns.

More Related Content

PPTX
Machine learning session8(svm nlp)
PDF
Machine Learning Algorithm - Decision Trees
PDF
Machine Learning Clustering
PDF
Machine Learning Decision Tree Algorithms
PDF
Summary statistics
PPTX
Dimension Reduction: What? Why? and How?
PDF
Random Forest / Bootstrap Aggregation
PPTX
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine learning session8(svm nlp)
Machine Learning Algorithm - Decision Trees
Machine Learning Clustering
Machine Learning Decision Tree Algorithms
Summary statistics
Dimension Reduction: What? Why? and How?
Random Forest / Bootstrap Aggregation
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...

What's hot (17)

PPTX
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
PPTX
Machine learning session1
PDF
Data Science - Part VII - Cluster Analysis
PDF
Clustering - Machine Learning Techniques
PDF
Understanding the Machine Learning Algorithms
PPTX
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
PPTX
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
PPTX
Random forest
PPTX
WEKA: Practical Machine Learning Tools And Techniques
PPTX
Random forest
PPTX
Lect 3 background mathematics
PPT
Textmining Retrieval And Clustering
PPT
Decision tree and random forest
PPTX
Cluster Analysis
PPTX
WEKA: Data Mining Input Concepts Instances And Attributes
PDF
Data Preparation with the help of Analytics Methodology
PPTX
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine learning session1
Data Science - Part VII - Cluster Analysis
Clustering - Machine Learning Techniques
Understanding the Machine Learning Algorithms
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
Random forest
WEKA: Practical Machine Learning Tools And Techniques
Random forest
Lect 3 background mathematics
Textmining Retrieval And Clustering
Decision tree and random forest
Cluster Analysis
WEKA: Data Mining Input Concepts Instances And Attributes
Data Preparation with the help of Analytics Methodology
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Ad

Similar to Machine learning session9(clustering) (20)

PDF
Unsupervised Learning in Machine Learning
PPTX
Cluster Analysis
PPTX
Cluster Analysis
PPT
26-Clustering MTech-2017.ppt
PPT
Clustering in Machine Learning: A Brief Overview.ppt
PDF
ch_5_dm clustering in data mining.......
PPTX
machine learning - Clustering in R
PPTX
Clustering
DOCX
8.clustering algorithm.k means.em algorithm
PDF
Unsupervised learning clustering
PPTX
Unsupervised learning Algorithms and Assumptions
DOCX
Neural nw k means
PDF
Mat189: Cluster Analysis with NBA Sports Data
PPTX
Cluster Analysis
PPT
PPT
Clustering
PPTX
Lec13 Clustering.pptx
PPTX
K means clustering
PPT
Chap8 basic cluster_analysis
Unsupervised Learning in Machine Learning
Cluster Analysis
Cluster Analysis
26-Clustering MTech-2017.ppt
Clustering in Machine Learning: A Brief Overview.ppt
ch_5_dm clustering in data mining.......
machine learning - Clustering in R
Clustering
8.clustering algorithm.k means.em algorithm
Unsupervised learning clustering
Unsupervised learning Algorithms and Assumptions
Neural nw k means
Mat189: Cluster Analysis with NBA Sports Data
Cluster Analysis
Clustering
Lec13 Clustering.pptx
K means clustering
Chap8 basic cluster_analysis
Ad

More from Abhimanyu Dwivedi (10)

PPTX
Deepfakes videos
DOCX
John mc carthy contribution to AI
PPTX
Machine learning session7(nb classifier k-nn)
PPTX
Machine learning session6(decision trees random forrest)
PPTX
Machine learning session5(logistic regression)
PPTX
Machine learning session4(linear regression)
PPTX
Machine learning session3(intro to python)
PPTX
Machine learning session2
PPTX
Data analytics with python introductory
PPTX
Housing price prediction
Deepfakes videos
John mc carthy contribution to AI
Machine learning session7(nb classifier k-nn)
Machine learning session6(decision trees random forrest)
Machine learning session5(logistic regression)
Machine learning session4(linear regression)
Machine learning session3(intro to python)
Machine learning session2
Data analytics with python introductory
Housing price prediction

Recently uploaded (20)

PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Computing-Curriculum for Schools in Ghana
PDF
RMMM.pdf make it easy to upload and study
PPTX
Lesson notes of climatology university.
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
01-Introduction-to-Information-Management.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
GDM (1) (1).pptx small presentation for students
Pharma ospi slides which help in ospi learning
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Institutional Correction lecture only . . .
Renaissance Architecture: A Journey from Faith to Humanism
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Microbial disease of the cardiovascular and lymphatic systems
Computing-Curriculum for Schools in Ghana
RMMM.pdf make it easy to upload and study
Lesson notes of climatology university.
O5-L3 Freight Transport Ops (International) V1.pdf
O7-L3 Supply Chain Operations - ICLT Program
Final Presentation General Medicine 03-08-2024.pptx
01-Introduction-to-Information-Management.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Anesthesia in Laparoscopic Surgery in India
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
GDM (1) (1).pptx small presentation for students

Machine learning session9(clustering)

  • 2. Clustering Clustering is an unsupervised method. It is used to segment the data into similar groups instead of prediction. It does not predict anything but can be used to improve the accuracy of predictive methods.
  • 3. Distances for Clustering Euclidean Distance(As-the-crow-flies Distance): Its the straight line between 2 points Distances between Clusters: Minimum Distance: Distances between the points which are closest Maximum Distance: Distances between the points that are farthest Centroid Distance: Distance between centroid of clusters Minimum Maximum Centroid
  • 4. K Means Clustering  It aims at partitioning the data into k clusters in a way that each data point belongs to the cluster whose mean is nearest to it. Lets say you have some data points and you want to group them into 3 clusters. So k=3. 1. You will start with randomly assigning the data points into 3 clusters. 2. Then, you will calculate the centroid or mean of each cluster 3. Now the data points will be reassigned to the closest cluster mean. 4. Recompute cluster centroids 5. Repeat steps 2 and 4 until no improvement is made
  • 6. K Means Clustering Take a group of 12 football players who have each scored a certain number of goals this season (say in the range 3–30). Let’s divide them into separate clusters—say three. Step 1 requires us to randomly split the players into three groups and calculate the means of each.
  • 7. K Means Clustering Step 2: For each player, reassign them to the group with the closest mean. E.g., Player A (5 goals) is assigned to Group 2 (mean = 9). Then recalculate the group means.
  • 8. K Means Clustering Repeat Step 2 over and over until the group means no longer change. With this example, the clusters could correspond to the players’ positions on the field — such as defenders, midfielders and attackers.
  • 9. How to find k In K-Means Clustering, value of K has to be specified beforehand. It can be determined using below method: Elbow Method: Clustering is done on a dataset for varying values and SSE (Sum of squared errors) is calculated for each value of K. Then, a graph between K and SSE is plotted. There is a point on the graph where SSE does not decreases significantly with increasing K. This is represented by elbow of the arm and is chosen as the value of K. The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid.
  • 10. Hierarchal Clustering(Agglomerative or Single Linkage)  Start with each data point in its own cluster  Combine two nearest clusters(Euclidean or centroid)  Repeat the process till all data points belong to same cluster.
  • 12. How many clusters to form?  Visualising dendogram: Best choice of no. of clusters is no. of vertical lines that can be cut by a horizontal line, that can transverse maximum distance vertically without intersecting other cluster. For eg., in the below case, best choice for no. of clusters will be 4.  Intuition and prior knowledge of the data set.
  • 13. K-Means vs Hieraichal  For big data, K-Means is better! Time complexity of K-Means is linear, while that of hierarchical clustering is quadratic.  Results are reproducible in Hierarchical, and not in K-Means, as they depend on intialization of centroids.  K-Means requires prior and proper knowledge about the data set for specifying K. In Hierarchical, we can choose no. of clusters by interpreting dendogram.
  • 14. Applications Clustering algorithms can be applied in many fields, for instance:  Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;  Biology: classification of plants and animals given their features  Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds  Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones  World Wide Web: document classification; clustering weblog data to discover groups of similar access patterns.