SlideShare a Scribd company logo
2
Most read
11
Most read
15
Most read
Mauritius JEDI
Machine Learning
&
Big Data
Clustering Algorithms
Nadeem Oozeer
Machine learning:
• Supervised vs Unsupervised.
– Supervised learning - the presence of the
outcome variable is available to guide the learning
process.
• there must be a training data set in which the solution
is already known.
– Unsupervised learning - the outcomes are
unknown.
• cluster the data to reveal meaningful partitions and
hierarchies
Clustering:
• Clustering is the task of gathering samples into groups of similar samples
according to some predefined similarity or dissimilarity measure
sample Cluster/group
• In this case clustering is carried out using the Euclidean distance as a
measure.
Clustering:
• What is clustering good for
– Market segmentation - group customers into
different market segments
– Social network analysis - Facebook "smartlists"
– Organizing computer clusters and data centers for
network layout and location
– Astronomical data analysis - Understanding
galaxy formation
Galaxy Clustering:
• Multi-wavelength data obtained for galaxy clusters
– Aim: determine robust criteria for the inclusion of a galaxy into
a cluster galaxy
– Note: physical parameters of the galaxy cluster can be heavily
influenced by wrong candidate
Credit:
HST
Clustering Algorithms :
• Hierarchy methods
– statistical method used to build a cluster by
arranging elements at various levels
Dendogram:
• Each level will then represent a possible
cluster.
• The height of the dendrogram shows the level
of similarity that any two clusters are joined
• The closer to the bottom they are the more
similar the clusters are
• Finding of groups from a dendrogram is not
simple and is very often subjective
• Partitioning methods
– make an initial division of the database and then use an
iterative strategy to further divide it into sections
– here each object belongs to exactly one cluster
Credit:
Legodi,
2014
K-means:
K-means algorithm:
1. Given n objects, initialize k cluster centers
2. Assign each object to its closest cluster centre
3. Update the center for each cluster
4. Repeat 2 and 3 until no change in each cluster center
• Experiment: Pack of cards, dominoes
• Apply the K-means algorithm to the Shapley data
– Change the number of potential cluster and find how the
clustering differ
K Nearest Neighbors (k-NN):
• One of the simplest of all machine learning
classifiers
• Differs from other machine learning techniques,
in that it doesn't produce a model.
• It does however require a distance measure and
the selection of K.
• First the K nearest training data points to the new
observation are investigated.
• These K points determine the class of the new
observation.
1-NN
• Simple idea: label a new point the same as
the closest known point
Label it red.
1-NN Aspects of an
Instance-Based Learner
1. A distance metric
– Euclidian
2. How many nearby neighbors to look at?
– One
3. A weighting function (optional)
– Unused
4. How to fit with the local points?
– Just predict the same output as the nearest
neighbor.
k-NN
• Generalizes 1-NN to smooth away noise in the labels
• A new point is now assigned the most frequent label of its k
nearest neighbors
Label it red, when k = 3
Label it blue, when k = 7

More Related Content

PPTX
Kmeans
PPT
K mean-clustering
PPTX
K-means Clustering
PDF
Unsupervised Learning in Machine Learning
PPT
K means Clustering Algorithm
PPTX
Introduction to Clustering algorithm
PDF
Clustering
PPTX
K-Means Clustering Algorithm.pptx
Kmeans
K mean-clustering
K-means Clustering
Unsupervised Learning in Machine Learning
K means Clustering Algorithm
Introduction to Clustering algorithm
Clustering
K-Means Clustering Algorithm.pptx

What's hot (20)

PPTX
Presentation on K-Means Clustering
PPTX
05 Clustering in Data Mining
PPT
3. mining frequent patterns
PPT
K mean-clustering algorithm
PPT
Clustering
PPT
3.2 partitioning methods
PDF
Hierarchical Clustering
ODP
Machine Learning with Decision trees
PPTX
Decision Trees
PDF
Dempster Shafer Theory AI CSE 8th Sem
PPTX
Inductive bias
PPTX
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
PPTX
Data Reduction
PPTX
Data mining primitives
PPTX
Ensemble learning
PPT
Artificial Neural Networks - ANN
PDF
Bayesian learning
PDF
Classification Based Machine Learning Algorithms
PPTX
Unsupervised learning
Presentation on K-Means Clustering
05 Clustering in Data Mining
3. mining frequent patterns
K mean-clustering algorithm
Clustering
3.2 partitioning methods
Hierarchical Clustering
Machine Learning with Decision trees
Decision Trees
Dempster Shafer Theory AI CSE 8th Sem
Inductive bias
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Data Reduction
Data mining primitives
Ensemble learning
Artificial Neural Networks - ANN
Bayesian learning
Classification Based Machine Learning Algorithms
Unsupervised learning
Ad

Similar to Machine learning clustering (20)

PPTX
Machine learning clustering
PDF
clustering-151017180103-lva1-app6892 (1).pdf
PPTX
For iiii year students of cse ML-UNIT-V.pptx
PDF
A Comprehensive Overview of Clustering Algorithms in Pattern Recognition
PDF
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
PPT
Clustering
PPTX
Unsupervised learning Algorithms and Assumptions
PDF
Clustering.pdf
PPTX
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PDF
Unsupervised Learning in Machine Learning
PDF
Unit 5-1.pdf
PDF
Introduction to data mining and machine learning
PDF
Clustering in Machine Learning.pdf
PDF
Machine Learning, Statistics And Data Mining
PDF
Chapter#04[Part#01]K-Means Clusterig.pdf
PDF
Clustering.pdf
PPTX
AI-Lec20 Clustering I - Kmean.pptx
PPTX
Unsupervised%20Learninffffg (2).pptx. application
PPT
Clustering and Pattern Recognition Unit 5 (1).ppt
PPTX
Unit 2 unsupervised learning.pptx
Machine learning clustering
clustering-151017180103-lva1-app6892 (1).pdf
For iiii year students of cse ML-UNIT-V.pptx
A Comprehensive Overview of Clustering Algorithms in Pattern Recognition
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Clustering
Unsupervised learning Algorithms and Assumptions
Clustering.pdf
K MEANS CLUSTERING - UNSUPERVISED LEARNING
Unsupervised Learning in Machine Learning
Unit 5-1.pdf
Introduction to data mining and machine learning
Clustering in Machine Learning.pdf
Machine Learning, Statistics And Data Mining
Chapter#04[Part#01]K-Means Clusterig.pdf
Clustering.pdf
AI-Lec20 Clustering I - Kmean.pptx
Unsupervised%20Learninffffg (2).pptx. application
Clustering and Pattern Recognition Unit 5 (1).ppt
Unit 2 unsupervised learning.pptx
Ad

More from CosmoAIMS Bassett (20)

PPTX
Mauritius Big Data and Machine Learning JEDI workshop
PDF
Testing dark energy as a function of scale
PPTX
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
PPTX
Cosmology with the 21cm line
PDF
Tuning your radio to the cosmic dawn
PDF
A short introduction to massive gravity... or ... Can one give a mass to the ...
PPTX
Decomposing Profiles of SDSS Galaxies
PDF
Cluster abundances and clustering Can theory step up to precision cosmology?
PPTX
An Overview of Gravitational Lensing
PDF
Testing cosmology with galaxy clusters, the CMB and galaxy clustering
PPT
Galaxy Formation: An Overview
PDF
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio Data
PPT
MeerKAT: an overview
PDF
Casa cookbook for KAT 7
PPT
From Darkness, Light: Computing Cosmological Reionization
PDF
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?
PDF
Binary pulsars as tools to study gravity
PDF
Cross Matching EUCLID and SKA using the Likelihood Ratio
PDF
Machine Learning Challenges in Astronomy
PDF
Cosmological Results from Planck
Mauritius Big Data and Machine Learning JEDI workshop
Testing dark energy as a function of scale
Seminar by Prof Bruce Bassett at IAP, Paris, October 2013
Cosmology with the 21cm line
Tuning your radio to the cosmic dawn
A short introduction to massive gravity... or ... Can one give a mass to the ...
Decomposing Profiles of SDSS Galaxies
Cluster abundances and clustering Can theory step up to precision cosmology?
An Overview of Gravitational Lensing
Testing cosmology with galaxy clusters, the CMB and galaxy clustering
Galaxy Formation: An Overview
Spit, Duct Tape, Baling Wire & Oral Tradition: Dealing With Radio Data
MeerKAT: an overview
Casa cookbook for KAT 7
From Darkness, Light: Computing Cosmological Reionization
WHAT CAN WE DEDUCE FROM STUDIES OF NEARBY GALAXY POPULATIONS?
Binary pulsars as tools to study gravity
Cross Matching EUCLID and SKA using the Likelihood Ratio
Machine Learning Challenges in Astronomy
Cosmological Results from Planck

Machine learning clustering

  • 1. Mauritius JEDI Machine Learning & Big Data Clustering Algorithms Nadeem Oozeer
  • 2. Machine learning: • Supervised vs Unsupervised. – Supervised learning - the presence of the outcome variable is available to guide the learning process. • there must be a training data set in which the solution is already known. – Unsupervised learning - the outcomes are unknown. • cluster the data to reveal meaningful partitions and hierarchies
  • 3. Clustering: • Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure sample Cluster/group
  • 4. • In this case clustering is carried out using the Euclidean distance as a measure.
  • 5. Clustering: • What is clustering good for – Market segmentation - group customers into different market segments – Social network analysis - Facebook "smartlists" – Organizing computer clusters and data centers for network layout and location – Astronomical data analysis - Understanding galaxy formation
  • 6. Galaxy Clustering: • Multi-wavelength data obtained for galaxy clusters – Aim: determine robust criteria for the inclusion of a galaxy into a cluster galaxy – Note: physical parameters of the galaxy cluster can be heavily influenced by wrong candidate Credit: HST
  • 7. Clustering Algorithms : • Hierarchy methods – statistical method used to build a cluster by arranging elements at various levels
  • 8. Dendogram: • Each level will then represent a possible cluster. • The height of the dendrogram shows the level of similarity that any two clusters are joined • The closer to the bottom they are the more similar the clusters are • Finding of groups from a dendrogram is not simple and is very often subjective
  • 9. • Partitioning methods – make an initial division of the database and then use an iterative strategy to further divide it into sections – here each object belongs to exactly one cluster Credit: Legodi, 2014
  • 11. K-means algorithm: 1. Given n objects, initialize k cluster centers 2. Assign each object to its closest cluster centre 3. Update the center for each cluster 4. Repeat 2 and 3 until no change in each cluster center • Experiment: Pack of cards, dominoes • Apply the K-means algorithm to the Shapley data – Change the number of potential cluster and find how the clustering differ
  • 12. K Nearest Neighbors (k-NN): • One of the simplest of all machine learning classifiers • Differs from other machine learning techniques, in that it doesn't produce a model. • It does however require a distance measure and the selection of K. • First the K nearest training data points to the new observation are investigated. • These K points determine the class of the new observation.
  • 13. 1-NN • Simple idea: label a new point the same as the closest known point Label it red.
  • 14. 1-NN Aspects of an Instance-Based Learner 1. A distance metric – Euclidian 2. How many nearby neighbors to look at? – One 3. A weighting function (optional) – Unused 4. How to fit with the local points? – Just predict the same output as the nearest neighbor.
  • 15. k-NN • Generalizes 1-NN to smooth away noise in the labels • A new point is now assigned the most frequent label of its k nearest neighbors Label it red, when k = 3 Label it blue, when k = 7

Editor's Notes

  • #7: In order to make use of all the multi-wavelength data obtained for galaxy clusters we need to determine robust criteria for the inclusion of a galaxy into a galaxy cluster. The physical parameters can be heavily influenced by the inclusion of galaxies which do not belong and this may lead to false conclusions. Clustering algorithms can be divided into two main groups – hierarchy methods and partitioning methods.
  • #9: Dendogram: .... We choose a set level of similarity of about 50% of the height and then all lines which cross this level indicate a cluster. This method is combined into the partitioning methods to get starting points for the mixture modeling algorithms.
  • #11: It is a clustering algorithm that tries to partition a set of points into K sets (clusters) such that the points in each cluster tend to near each other. It is unsupervised because the points have no external classification.