SlideShare a Scribd company logo
Buenos Aires, junio de 2016
Eduardo Poggi
Clustering
 Supervised vs. Unsupervised Learning
 Clustering Concepts
 Non-Hierarchical Clustering
 K-means
 EM-Algorithm
 Hierarchical Clustering
 Hierarchical Agglomerative Clustering (HAC)
Supervised vs. UnSupervised Learning
 Supervised Learning
 Classification: partition examples into groups according to pre-defined
categories
 Regression: assign value to feature vectors
 Requires labeled data for training
 Unsupervised Learning
 Clustering: partition examples into groups when no pre-defined
categories/classes are available
 Novelty detection: find changes in data
 Outlier detection: find unusual events (e.g. hackers)
 Only instances required, but no labels
Clustering Concepts
 El objetivo básico del análisis de clusters es descubrir grupos en los
datos, de modo tal que los objetos del mismo grupo sean similares,
mientras que los objetos de diferentes grupos sean tan disímiles
como sea posible.
 Partition unlabeled examples into disjoint subsets of clusters, such
that:
 Examples within a cluster are similar
 Examples in different clusters are different
 Discover new categories in an unsupervised manner (no sample
category labels provided).
Clustering Concepts (2)
 Las aplicaciones son muy numerosas, por ejemplo la clasificación de plantas y
animales, en ciencias sociales la clasificación de personas considerando sus
costumbres y preferencias, en marketing la identificación de grupos de consumidores
con necesidades parecidas, etc.
 Cluster retrieved documents (e.g. Teoma)
 to present more organized and understandable results to user
 Detecting near duplicates
 Entity resolution
 E.g. “Thorsten Joachims” == “Thorsten B Joachims”
 Cheating detection
 Exploratory data analysis
 Automated (or semi-automated) creation of taxonomies
 e.g. Yahoo-style
Clustering Concepts (3)
 Consideraremos dos tipos de algoritmos de clustering:
 Métodos de partición: clasifican los datos en k grupos que deben cumplir los
requerimientos de una partición
 Cada grupo debe contener al menos un objeto
 Cada objeto debe pertenecer exactamente a un grupo.
 Métodos jerárquicos:
 Aglomerativos: empiezan con n clusters de una observación cada uno, en cada paso
se combinan dos grupos hasta terminar en un sólo cluster con n observaciones.
 Divisorios: comienzan con un sólo cluster de n observaciones y en cada paso se divide
un grupo en dos hasta tener n clusters con una observación cada uno.
K-Means Clustering Method
1. Ask user how many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. For each datapoint find out which Center it’s closest to. (Thus each
Center “owns” a set of datapoints)
4. For each Center find the centroid of the points it owns
5. …and jumps there
6. …Repeat until terminated!
(Are we sure it will terminate?)
K-Means Step by step (1 & 2)
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
K-Means Step by step (3)
1. Ask…
2. Randomly guess k
cluster Center
locations
3. For each datapoint
find out which Center
it’s closest to. (Thus
each Center “owns” a
set of datapoints)
K-Means Step by step (4)
1. Ask…
2. Randomly guess…
3. For each datapoint
find out which Center
it’s closest to. (Thus
each Center “owns” a
set of datapoints)
4. For each Center find
the centroid of the
points it owns
K-Means Step by step (5 & 6)
1. Ask…
2. Randomly guess…
3. For each datapoint …
4. For each Center find
the centroid of the
points it owns
5. …and jumps there
6. …Repeat until
terminated!
K-Means Q&A
 What is it trying to optimize?
 Are we sure it will terminate?
 Are we sure it will find an optimal clustering?
 How should we start it?
 How could we automatically choose the number
of centers?
K-Means Q&A (2)
 This clustering method is simple and reasonably effective.
 The final cluster centers do not represent a global
minimum but only a local one.
 Completely different final clusters can arise from
differerences in the initial randomly chosen cluster
centers.
K-Means Q&A (3)
Are we sure it will terminate?
 There are only a finite number of ways of partitioning R records into k
groups.
 So there are only a finite number of possible configurations in which all
Centers are the centroids of the points they own.
 If the configuration changes on an iteration, it must have improved the
distortion.
 So each time the configuration changes it must go to a configuration it’s
never been to before.
 So if it tried to go on forever, it would eventually run out of configurations.
K-Means Q&A (4)
 Will we find the optimal configuration?
 Can you invent a configuration that has converged, but does not
have the minimum distortion?
K-Means Q&A (5)
 Will we find the optimal configuration?
 Can you invent a configuration that has converged, but does not
have the minimum distortion?
K-Means Q&A (6)
Trying to find good optima
 Idea 1: Be careful about where you start
 Neat trick:
 Place first center on top of randomly chosen datapoint.
 Place second center on datapoint that’s as far away as possible from first center:
 Place j’th center on datapoint that’s as far away as possible from the closest of
Centers 1 through j-1
 Idea 2: Do many runs of k-means, each from a different random start
configuration
 Many other ideas floating around.
K-Means Q&A (7)
Choosing the number of Centers
 A difficult problem
 Most common approach is to try to find the solution that minimizes
the Schwarz Criterion
 Trying k from 2 to n !!
 Incrementally (k=2, then do 2-Means for each cluster, and so on…)
Common uses of K-means
 Often used as an exploratory data analysis tool
 In one-dimension, a good way to quantize realvalued variables into k
non-uniform buckets
 Used on acoustic data in speech understanding to convert waveforms
into one of k categories (known as Vector Quantization)
 Also used for choosing color palettes on old fashioned graphical
display devices!
Single Linkage Hierarchical Clustering
1. Say “Every point is its
own cluster”
Single Linkage Hierarchical Clustering (2)
1. Say “Every point is its
own cluster”
2. Find “Most similar” pair of
clusters
Single Linkage Hierarchical Clustering (3)
1. Say “Every point is its
own cluster”
2. Find “Most similar” pair of
clusters
3. Merge it into a parent
cluster
Single Linkage Hierarchical Clustering (4)
1. Say “Every point is its
own cluster”
2. Find “Most similar” pair of
clusters
3. Merge it into a parent
cluster
4. Repeat... until you’ve
merged the whole dataset
into one cluster
Single Linkage Hierarchical Clustering (5)
1. Say “Every point is its
own cluster”
2. Find “Most similar” pair of
clusters
3. Merge it into a parent
cluster
4. Repeat... until you’ve
merged the whole dataset
into one cluster
Hierarchical Clustering Q&A
 How do we define similarity between clusters?
 Minimum distance between points in clusters (in which case we’re
simply doing Euclidian Minimum Spanning Trees)
 Maximum distance between points in clusters
 Average distance between points in clusters
 And more…
Hierarchical Clustering Q&A (bis)
Hierarchical Clustering Q&A (2)
 Single Linkage Comments
 Also known in the trade as Hierarchical Agglomerative Clustering (note
the acronym)
 It’s nice that you get a hierarchy instead of an amorphous collection of
groups
 If you want k groups, just cut the (k-1) longest links
 There’s no real statistical or information-theoretic foundation to this.
Makes your lecturer feel a bit queasy.
Cluster Silhouettes
 Para cada ejemplo i definimos a(i), con A el cluster asignado a i
 Luego calculamos d(i, C) para los clusters distintos a A
 Nos quedamos con b(i) como la menor distancia un cluster. El cluster B para el cual este mínimo
se cumple, es decir d(i,B) = b(i) se llama el vecino del objeto i. (La segunda opción de
pertenencia)
Cluster Silhouettes (2)
 Ahora definimos s(i) como:
 Para entender el significado de s(i) veamos que sucede en las situaciones extremas:
 Cuando s(i) es cercano a 1, a(i) es decir, el promedio de las disimilaridades entre i y los objetos de su cluster
son mucho más pequeñas que b(i) la disimilaridad entre i y el cluster vecino. Por lo tanto podemos decir que
i está bien clasificado.
 Cuando s(i) es cercano a 0, b(i) y a(i) son aproximadamente iguales no es claro si i debe ser asignado a A ó al
cluster vecino. El objeto i está tan lejos de uno como de otro.
 La peor situación se da cuando s(i) es cercano a –1, a(i) es mucho más grande que b(i), entonces i en
promedio está más cerca del cluster vecino que de A.
Cluster Silhouettes (3)
0.0 0.2 0.4 0.6 0.8 1.0
Li
J
Le
P
Ti
I
K
Ta
Silhouettewidth
Averagesilhouettewidth:0.8
C1
C2
SC Interpretación
0.71-1 Fuerte estructura
0.51-0.7 Razonable estructura
0.26-0.5 La estructura es débil y podría ser artificial
< 0.25 No se ha hallado estructura
eduardopoggi@yahoo.com.ar
eduardo-poggi
http://ar.linkedin.com/in/eduardoapoggi
https://www.facebook.com/eduardo.poggi
@eduardoapoggi
Bibliografía

More Related Content

PDF
Cluster analysis
PPTX
Mathematics online: some common algorithms
PDF
Cluster Analysis
PDF
Dh31504508
PPTX
Deep Learning for Search
PPT
Clustering in artificial intelligence
PPTX
Deep Learning for Search
PPTX
Catching co occurrence information using word2vec-inspired matrix factorization
Cluster analysis
Mathematics online: some common algorithms
Cluster Analysis
Dh31504508
Deep Learning for Search
Clustering in artificial intelligence
Deep Learning for Search
Catching co occurrence information using word2vec-inspired matrix factorization

What's hot (20)

PPTX
TypeScript and Deep Learning
PPTX
Clustering on database systems rkm
PPTX
Neural collaborative filtering-발표
PPT
3.1 clustering
PDF
Intro to threp
PPT
Cluster analysis
PDF
Sparse autoencoder
PDF
PDF
presentation
PDF
waseda-presentation-svm-comparisons
PDF
Algoritma fuzzy c means fcm java c++ contoh program
PDF
[DL輪読会]Generative Models of Visually Grounded Imagination
PDF
AI Lesson 04
PDF
Mini-batch Variational Inference for Time-Aware Topic Modeling
PPTX
Finding bursty topics from microblogs
PPTX
Dbscan algorithom
PDF
Unsupervised learning clustering
DOCX
8.clustering algorithm.k means.em algorithm
PDF
Proactive Secret Sharing using a Trivariate Polynomial
PPTX
# Neural network toolbox
TypeScript and Deep Learning
Clustering on database systems rkm
Neural collaborative filtering-발표
3.1 clustering
Intro to threp
Cluster analysis
Sparse autoencoder
presentation
waseda-presentation-svm-comparisons
Algoritma fuzzy c means fcm java c++ contoh program
[DL輪読会]Generative Models of Visually Grounded Imagination
AI Lesson 04
Mini-batch Variational Inference for Time-Aware Topic Modeling
Finding bursty topics from microblogs
Dbscan algorithom
Unsupervised learning clustering
8.clustering algorithm.k means.em algorithm
Proactive Secret Sharing using a Trivariate Polynomial
# Neural network toolbox
Ad

Viewers also liked (17)

PPT
Let It Go - Our Group's Own Version (lyrics changed)
DOCX
DESVENTAS Y VENTAJAS
PDF
Asia Counsel Insights 1 August 2016
PDF
Untitled Presentation
PDF
Asia Counsel Insights 5 September 2016
DOCX
Qué significa ser uniatlantico
DOCX
講義報告
DOC
CARTA PASTORAL DE NUESTRO OBISPO DON AMADEO: JORNADA DIOCESANA DE NUEVOS TEMP...
DOC
Control de aforo - Comunycarse
PPS
PRP, OPRP, HACCP-Plan Awareness (Urdu)
PPTX
OpenStack Dragonflow shenzhen and Hangzhou meetups
PPTX
Genome mapping
PPTX
Hierarchical clustering in Python and beyond
PPTX
Multivariate data analysis
PPTX
Cluster analysis
PPTX
Priya
PPTX
Cluster analysis
Let It Go - Our Group's Own Version (lyrics changed)
DESVENTAS Y VENTAJAS
Asia Counsel Insights 1 August 2016
Untitled Presentation
Asia Counsel Insights 5 September 2016
Qué significa ser uniatlantico
講義報告
CARTA PASTORAL DE NUESTRO OBISPO DON AMADEO: JORNADA DIOCESANA DE NUEVOS TEMP...
Control de aforo - Comunycarse
PRP, OPRP, HACCP-Plan Awareness (Urdu)
OpenStack Dragonflow shenzhen and Hangzhou meetups
Genome mapping
Hierarchical clustering in Python and beyond
Multivariate data analysis
Cluster analysis
Priya
Cluster analysis
Ad

Similar to Poggi analytics - clustering - 1 (20)

PDF
Clustering.pdf
PPTX
Lecture 9 -Clustering(ML algorithms: Clustering, KNN, DBScan).pptx
PPT
Lecture8 clustering
PPT
PPT
Clustering
PDF
iiit delhi unsupervised pdf.pdf
DOCX
Neural nw k means
PPTX
Cluster Analysis
PPTX
Cluster Analysis
PPTX
Cluster Analysis
PDF
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
PDF
Data Science - Part VII - Cluster Analysis
PPT
Clustering in Machine Learning Topic7a.ppt
PPTX
Clustering_Overview.pptx
PDF
Machine Learning - Clustering
PPT
My8clst
PPT
upd Unit-v -Cluster Analysis (1) (1).ppt
PPT
26-Clustering MTech-2017.ppt
Clustering.pdf
Lecture 9 -Clustering(ML algorithms: Clustering, KNN, DBScan).pptx
Lecture8 clustering
Clustering
iiit delhi unsupervised pdf.pdf
Neural nw k means
Cluster Analysis
Cluster Analysis
Cluster Analysis
2_9_asset-v1-ColumbiaX+CSMM.101x+2T2017+type@asset+block@AI_edx_ml_unsupervis...
Data Science - Part VII - Cluster Analysis
Clustering in Machine Learning Topic7a.ppt
Clustering_Overview.pptx
Machine Learning - Clustering
My8clst
upd Unit-v -Cluster Analysis (1) (1).ppt
26-Clustering MTech-2017.ppt

More from Gaston Liberman (15)

PPT
Taller bd8
PPTX
Poggi analytics - tm - 1b
PPT
Poggi analytics - distance - 1a
PPTX
Poggi analytics - sentiment - 1
PPTX
Poggi analytics - geneticos - 1
PPTX
Poggi analytics - ebl - 1
PPT
Poggi analytics - star - 1a
PPTX
Poggi analytics - inference - 1a
PPTX
Poggi analytics - ensamble - 1b
PPTX
Poggi analytics - trees - 1e
PPTX
Poggi analytics - concepts - 1a
PPTX
Poggi analytics - ml - 1d
PPTX
Poggi analytics - intro - 1c
PPTX
Poggi analytics - bayes - 1a
PPT
Henrion poggi analytics - ann - 1
Taller bd8
Poggi analytics - tm - 1b
Poggi analytics - distance - 1a
Poggi analytics - sentiment - 1
Poggi analytics - geneticos - 1
Poggi analytics - ebl - 1
Poggi analytics - star - 1a
Poggi analytics - inference - 1a
Poggi analytics - ensamble - 1b
Poggi analytics - trees - 1e
Poggi analytics - concepts - 1a
Poggi analytics - ml - 1d
Poggi analytics - intro - 1c
Poggi analytics - bayes - 1a
Henrion poggi analytics - ann - 1

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Introduction to the R Programming Language
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to machine learning and Linear Models
PPTX
Computer network topology notes for revision
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Knowledge Engineering Part 1
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
IB Computer Science - Internal Assessment.pptx
Introduction to the R Programming Language
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Supervised vs unsupervised machine learning algorithms
Introduction to machine learning and Linear Models
Computer network topology notes for revision
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
STERILIZATION AND DISINFECTION-1.ppthhhbx
Lecture1 pattern recognition............
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Poggi analytics - clustering - 1

  • 1. Buenos Aires, junio de 2016 Eduardo Poggi
  • 2. Clustering  Supervised vs. Unsupervised Learning  Clustering Concepts  Non-Hierarchical Clustering  K-means  EM-Algorithm  Hierarchical Clustering  Hierarchical Agglomerative Clustering (HAC)
  • 3. Supervised vs. UnSupervised Learning  Supervised Learning  Classification: partition examples into groups according to pre-defined categories  Regression: assign value to feature vectors  Requires labeled data for training  Unsupervised Learning  Clustering: partition examples into groups when no pre-defined categories/classes are available  Novelty detection: find changes in data  Outlier detection: find unusual events (e.g. hackers)  Only instances required, but no labels
  • 4. Clustering Concepts  El objetivo básico del análisis de clusters es descubrir grupos en los datos, de modo tal que los objetos del mismo grupo sean similares, mientras que los objetos de diferentes grupos sean tan disímiles como sea posible.  Partition unlabeled examples into disjoint subsets of clusters, such that:  Examples within a cluster are similar  Examples in different clusters are different  Discover new categories in an unsupervised manner (no sample category labels provided).
  • 5. Clustering Concepts (2)  Las aplicaciones son muy numerosas, por ejemplo la clasificación de plantas y animales, en ciencias sociales la clasificación de personas considerando sus costumbres y preferencias, en marketing la identificación de grupos de consumidores con necesidades parecidas, etc.  Cluster retrieved documents (e.g. Teoma)  to present more organized and understandable results to user  Detecting near duplicates  Entity resolution  E.g. “Thorsten Joachims” == “Thorsten B Joachims”  Cheating detection  Exploratory data analysis  Automated (or semi-automated) creation of taxonomies  e.g. Yahoo-style
  • 6. Clustering Concepts (3)  Consideraremos dos tipos de algoritmos de clustering:  Métodos de partición: clasifican los datos en k grupos que deben cumplir los requerimientos de una partición  Cada grupo debe contener al menos un objeto  Cada objeto debe pertenecer exactamente a un grupo.  Métodos jerárquicos:  Aglomerativos: empiezan con n clusters de una observación cada uno, en cada paso se combinan dos grupos hasta terminar en un sólo cluster con n observaciones.  Divisorios: comienzan con un sólo cluster de n observaciones y en cada paso se divide un grupo en dos hasta tener n clusters con una observación cada uno.
  • 7. K-Means Clustering Method 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. For each datapoint find out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) 4. For each Center find the centroid of the points it owns 5. …and jumps there 6. …Repeat until terminated! (Are we sure it will terminate?)
  • 8. K-Means Step by step (1 & 2) 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations
  • 9. K-Means Step by step (3) 1. Ask… 2. Randomly guess k cluster Center locations 3. For each datapoint find out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
  • 10. K-Means Step by step (4) 1. Ask… 2. Randomly guess… 3. For each datapoint find out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) 4. For each Center find the centroid of the points it owns
  • 11. K-Means Step by step (5 & 6) 1. Ask… 2. Randomly guess… 3. For each datapoint … 4. For each Center find the centroid of the points it owns 5. …and jumps there 6. …Repeat until terminated!
  • 12. K-Means Q&A  What is it trying to optimize?  Are we sure it will terminate?  Are we sure it will find an optimal clustering?  How should we start it?  How could we automatically choose the number of centers?
  • 13. K-Means Q&A (2)  This clustering method is simple and reasonably effective.  The final cluster centers do not represent a global minimum but only a local one.  Completely different final clusters can arise from differerences in the initial randomly chosen cluster centers.
  • 14. K-Means Q&A (3) Are we sure it will terminate?  There are only a finite number of ways of partitioning R records into k groups.  So there are only a finite number of possible configurations in which all Centers are the centroids of the points they own.  If the configuration changes on an iteration, it must have improved the distortion.  So each time the configuration changes it must go to a configuration it’s never been to before.  So if it tried to go on forever, it would eventually run out of configurations.
  • 15. K-Means Q&A (4)  Will we find the optimal configuration?  Can you invent a configuration that has converged, but does not have the minimum distortion?
  • 16. K-Means Q&A (5)  Will we find the optimal configuration?  Can you invent a configuration that has converged, but does not have the minimum distortion?
  • 17. K-Means Q&A (6) Trying to find good optima  Idea 1: Be careful about where you start  Neat trick:  Place first center on top of randomly chosen datapoint.  Place second center on datapoint that’s as far away as possible from first center:  Place j’th center on datapoint that’s as far away as possible from the closest of Centers 1 through j-1  Idea 2: Do many runs of k-means, each from a different random start configuration  Many other ideas floating around.
  • 18. K-Means Q&A (7) Choosing the number of Centers  A difficult problem  Most common approach is to try to find the solution that minimizes the Schwarz Criterion  Trying k from 2 to n !!  Incrementally (k=2, then do 2-Means for each cluster, and so on…)
  • 19. Common uses of K-means  Often used as an exploratory data analysis tool  In one-dimension, a good way to quantize realvalued variables into k non-uniform buckets  Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization)  Also used for choosing color palettes on old fashioned graphical display devices!
  • 20. Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster”
  • 21. Single Linkage Hierarchical Clustering (2) 1. Say “Every point is its own cluster” 2. Find “Most similar” pair of clusters
  • 22. Single Linkage Hierarchical Clustering (3) 1. Say “Every point is its own cluster” 2. Find “Most similar” pair of clusters 3. Merge it into a parent cluster
  • 23. Single Linkage Hierarchical Clustering (4) 1. Say “Every point is its own cluster” 2. Find “Most similar” pair of clusters 3. Merge it into a parent cluster 4. Repeat... until you’ve merged the whole dataset into one cluster
  • 24. Single Linkage Hierarchical Clustering (5) 1. Say “Every point is its own cluster” 2. Find “Most similar” pair of clusters 3. Merge it into a parent cluster 4. Repeat... until you’ve merged the whole dataset into one cluster
  • 25. Hierarchical Clustering Q&A  How do we define similarity between clusters?  Minimum distance between points in clusters (in which case we’re simply doing Euclidian Minimum Spanning Trees)  Maximum distance between points in clusters  Average distance between points in clusters  And more…
  • 27. Hierarchical Clustering Q&A (2)  Single Linkage Comments  Also known in the trade as Hierarchical Agglomerative Clustering (note the acronym)  It’s nice that you get a hierarchy instead of an amorphous collection of groups  If you want k groups, just cut the (k-1) longest links  There’s no real statistical or information-theoretic foundation to this. Makes your lecturer feel a bit queasy.
  • 28. Cluster Silhouettes  Para cada ejemplo i definimos a(i), con A el cluster asignado a i  Luego calculamos d(i, C) para los clusters distintos a A  Nos quedamos con b(i) como la menor distancia un cluster. El cluster B para el cual este mínimo se cumple, es decir d(i,B) = b(i) se llama el vecino del objeto i. (La segunda opción de pertenencia)
  • 29. Cluster Silhouettes (2)  Ahora definimos s(i) como:  Para entender el significado de s(i) veamos que sucede en las situaciones extremas:  Cuando s(i) es cercano a 1, a(i) es decir, el promedio de las disimilaridades entre i y los objetos de su cluster son mucho más pequeñas que b(i) la disimilaridad entre i y el cluster vecino. Por lo tanto podemos decir que i está bien clasificado.  Cuando s(i) es cercano a 0, b(i) y a(i) son aproximadamente iguales no es claro si i debe ser asignado a A ó al cluster vecino. El objeto i está tan lejos de uno como de otro.  La peor situación se da cuando s(i) es cercano a –1, a(i) es mucho más grande que b(i), entonces i en promedio está más cerca del cluster vecino que de A.
  • 30. Cluster Silhouettes (3) 0.0 0.2 0.4 0.6 0.8 1.0 Li J Le P Ti I K Ta Silhouettewidth Averagesilhouettewidth:0.8 C1 C2 SC Interpretación 0.71-1 Fuerte estructura 0.51-0.7 Razonable estructura 0.26-0.5 La estructura es débil y podría ser artificial < 0.25 No se ha hallado estructura

Editor's Notes

  • #3: What is machine learning?
  • #4: What is machine learning?
  • #5: What is machine learning?
  • #6: What is machine learning?
  • #7: What is machine learning?
  • #8: What is machine learning?
  • #9: What is machine learning?
  • #10: What is machine learning?
  • #11: What is machine learning?
  • #12: What is machine learning?
  • #13: What is machine learning?
  • #14: What is machine learning?
  • #15: What is machine learning?
  • #16: What is machine learning?
  • #17: What is machine learning?
  • #18: What is machine learning?
  • #19: What is machine learning?
  • #20: What is machine learning?
  • #21: What is machine learning?
  • #22: What is machine learning?
  • #23: What is machine learning?
  • #24: What is machine learning?
  • #25: What is machine learning?
  • #26: What is machine learning?
  • #27: What is machine learning?
  • #28: What is machine learning?
  • #29: What is machine learning?
  • #30: What is machine learning?
  • #31: What is machine learning?
  • #32: 31