SlideShare a Scribd company logo
CPSC 340:
Machine Learning and Data Mining
K-Means Clustering
Andreas Lehrmann and Mark Schmidt
University of British Columbia, Fall 2022
https://www.students.cs.ubc.ca/~cs-340
Last Time: Ensemble Methods
• Ensemble methods are models that use other models as input.
– The ensemble can often achieve higher accuracy than individual models.
• One of the simplest ensemble methods is voting:
– Take the mode of the predictions across the classifiers.
– Higher accuracy than individual classifiers if error are independent.
• Random forests:
– Ensemble method based on deep decesision trees, incorporating two forms of randomness.
– Each tree is trained on a boostrap sample of the data (‘n’ examples sampled with replacement).
– We use random trees (covered today) to further encourage errors to be independent.
Assignment 2:
• [Code Update]
example_Kmeans.jl in
a2.zip now loads
clusterData.jld instead
of clusterData2.jld (l.2)
• [Due Date] Monday,
October 3rd instead of
Friday, September 30th
Random Forest Ingredient 1: Bootstrap/Bagging
• Bootstrap sample of a list of ‘n’ training examples:
– A new set of size ‘n’ chosen independently with replacement.
– Gives new dataset of ‘n’ examples, with some duplicated and some missing.
• For large ‘n’, approximately 63% of original examples are included at least once in bootstrap.
• Bagging: ensemble where you apply same classifier to different bootstraps.
1. Generate several bootstrap samples of the dataset.
2. Fit the classifier to each bootstrap sample.
– To make predictions, take vote based on the predictions.
• Random forests are a special case of bagging, using random trees as the classifier.
Random Forest Ingredient 2: Random Trees
• For each split in a random tree model:
– Randomly sample a small number of possible features (typically √𝑑).
– Only consider these random features when searching for the optimal rule.
• So splits will tend to use different features in different trees.
Random Forest Ingredient 2: Random Trees
• For each split in a random tree model:
– Randomly sample a small number of possible features (typically √𝑑).
– Only consider these random features when searching for the optimal rule.
• So splits will tend to use different features in different trees.
Random Forests: Putting it all Together
• Training:
Random Forests: Putting it all Together
• Prediction:
Random Forests: Discussion
• Random forest implementations use deep random trees.
– Often splitting until all leafs have only one label.
• So the individual trees tend to overfit.
– But bootstrapping and random trees makes errors more independent.
• So the vote tends to have a much lower test error than individual trees.
• Empirically, random forests are often one of the “best” classifiers.
– Fernandez-Delgado et al. [2014]:
• Compared 179 classifiers on 121 datasets.
• Random forests were most likely to be the best classifier.
– Grinsztajn et al. [2022]:
• “Why do tree-based models still outperform deep learning on tabular data?”
Beyond Voting: Model Averaging
• Voting is a special case of “averaging” ensemble methods.
– Where we somehow “average” the predictions of different models.
• Other averaging methods:
– For “regression” (where yi is continuous), take average yi predictions:
– With probabilistic classifiers, take the average probabilities:
– And there are variations where some classifiers get more weight (see bonus):
Types and Goals of Ensemble Methods
• Remember the fundamental trade-off:
1. Etrain: How small you can make the training error.
vs.
2. Eapprox: How well training error approximates the test error.
• Goal of ensemble methods is that meta-classifier:
– Does much better on one of these than individual classifiers.
– Does not do too much worse on the other.
• This suggests two types of ensemble methods:
1. Averaging: improves approximation error of classifiers with high Eapprox.
• This is the point of “voting”.
2. Boosting: improves training error of classifiers with high Etrain.
• Covered later in course.
End of Part 1: Key Concepts
• Fundamental ideas:
– Training vs. test error (memorization vs. learning).
– IID assumption (examples come independently from same distribution).
– Golden rule of ML (test set should not influence training).
– Fundamental trade-off (between training error vs. approximation error).
– Validation sets and cross-validation (can approximate test error)
– Optimization bias (we can overfit the training set and the validation set).
– Decision theory (we should consider costs of predictions).
– Parametric vs. non-parametric (whether model size depends on ‘n’).
– No free lunch theorem (there is no universally “best” model).
End of Part 1: Key Concepts
• We saw 3 ways of “learning”:
– Searching for rules.
• Decision trees (greedy recursive splitting using decision stumps).
– Counting frequencies.
• Naïve Bayes (probabilistic classifier based on conditional independence).
– Measuring distances.
• K-nearest neigbbours (non-parametric classifier with universal consistency).
• We saw 2 generic ways of improving performance:
– Encouraging invariances with data augmentation.
– Ensemble methods (combine predictions of several models).
• Random forests (averaging plus randomization to reduce overfitting).
Next Topic: Unsupervised Learning (Part 2)
Application: Classifying Cancer Types
• “I collected gene expression data for 1000 different types of cancer
cells, can you tell me the different classes of cancer?”
• We are not given the class labels y, but want meaningful labels.
• An example of unsupervised learning.
X =
https://corelifesciences.com/human-long-non-coding-rna-expression-microarray-service.html
Unsupervised Learning
• Supervised learning:
– We have features xi and class labels yi.
– Write a program that produces yi from xi.
• Unsupervised learning:
– We only have xi values, but no explicit target labels.
– You want to do “something” with them.
• Some unsupervised learning tasks:
– Outlier detection: Is this a ‘normal’ xi?
– Similarity search: Which examples look like this xi?
– Association rules: Which xj occur together?
– Latent-factors: What ‘parts’ are the xi made from?
– Data visualization: What does the high-dimensional X look like?
– Ranking: Which are the most important xi?
– Clustering: What types of xi are there?
Clustering Example
Input: data matrix ‘X’.
• In clustering we want to assign examples to “groups”:
Clustering Example
Input: data matrix ‘X’.
Output: clusters !
𝑦.
• In clustering we want to assign examples to “groups”:
Clustering
• Clustering:
– Input: set of examples described by features xi.
– Output: an assignment of examples to ‘groups’.
• Unlike classification, we are not given the ‘groups’.
– Algorithm must discover groups.
• Example of groups we might discover in e-mail spam:
– ‘Lucky winner’ group.
– ‘Weight loss’ group.
– ‘I need your help’ group.
– ‘Mail-order bride’ group.
Data Clustering
• General goal of clustering algorithms:
– Examples in the same group should be ‘similar’.
– Examples in different groups should be ‘different’.
• But the ‘best’ clustering is hard to define:
– We don’t have a test error.
– Generally, there is no ‘best’ method in unsupervised learning.
• So there are lots of methods: we’ll focus on important/representative ones.
• Why cluster?
– You could want to know what the groups are.
– You could want to find the group for a new example xi.
– You could want to find examples related to a new example xi.
– You could want a ‘prototype’ example for each group.
• For example, what does a typical breakfast look like?
Clustering of Epstein-Barr Virus
http://jvi.asm.org/content/86/20/11096.abstract
Other Clustering Applications
• NASA: what types of stars are there?
• Biology: are there sub-species?
• Documents: what kinds of documents are on my HD?
• Commercial: what kinds of customers do I have?
http://www.eecs.wsu.edu/~cook/dm/lectures/l9/index.html
http://www.biology-online.org/articles/canine_genomics_genetics_running/figures.html
K-Means
• Most popular clustering method is k-means.
• Input:
– The number of clusters ‘k’ (hyper-parameter).
– Initial guess of the center (the “mean”) of each cluster.
• K-Means Algorithm for Finding Means:
– Assign each xi to its closest mean.
– Update the means based on the assignment.
– Repeat until convergence.
K-Means Example
Start with ‘k’ initial ‘means’
(usually, random data points)
Input: data matrix ‘X’.
K-Means Example
Assign each example to
the closest mean.
Input: data matrix ‘X’.
K-Means Example
Update the mean
of each group.
Input: data matrix ‘X’.
K-Means Example
Assign each example to
the closest mean.
Input: data matrix ‘X’.
K-Means Example
Update the mean
of each group.
Input: data matrix ‘X’.
K-Means Example
Assign each example to
the closest mean.
Input: data matrix ‘X’.
K-Means Example
Update the mean
of each group.
Input: data matrix ‘X’.
K-Means Example
Assign each example to
the closest mean.
Input: data matrix ‘X’.
K-Means Example
Stop if no examples
change groups.
Input: data matrix ‘X’.
K-Means Example
Interactive demo:
https://www.naftaliharris.com/blog/visualizing-k-means-clustering
Input: data matrix ‘X’.
Output:
- Clusters ‘!
𝑦’.
- Means ‘W’.
K-Means Issues
• Guaranteed to converge when using Euclidean distance.
• Given a new test example:
– Assign it to the nearest mean to cluster it.
• Assumes you know number of clusters ‘k’.
– Lots of heuristics to pick ‘k’, none satisfying:
• https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
• Each example is assigned to one (and only one) cluster:
– No possibility for overlapping clusters or leaving examples unassigned.
• It may converge to sub-optimal solution…
K-Means Clustering with Different Initialization
• Classic approach to dealing with sensitivity to initialization: random restarts.
– Try several different random starting points, choose the “best”.
• See bonus slides for a more clever approach called k-means++.
KNN vs. K-Means
• Don’t confuse KNN classification and k-means clustering:
Property KNN Classification K-Means Clustering
Task Supervised learning (given yi) Unsupervised learning (no given yi).
Meaning of ‘k’ Number of neighbours to consider
(not number of classes).
Number of clusters (always consider single
nearest mean).
Initialization No training phase. Training that is sensitive to initialization.
Model complexity Model is complicated for small ‘k’,
simple for large ‘k’.
Model is simple for small ‘k’, complicated for
large ‘k’.
Parametric? Non-parametric:
- Stores data ‘X’
Parametric (for ‘k’ not depending on ‘n’)
- Stores means ‘W’
What is K-Means Doing?
• We can interpret K-means steps as minimizing an objective:
– Total sum of squared distances from each example xi to its center 𝑤!
"!
:
• The k-means steps:
– Minimize ‘f’ in terms of the $
𝑦i (update cluster assignments).
– Minimize ‘f’ in terms of the wc (update means).
• Termination of the algorithm follows because:
– Each step does not increase the objective.
– There are a finite number of assignments to k clusters.
What is K-Means Doing?
• We can interpret K-means steps as minimizing an objective:
– Total sum of squared distances from each example xi to its center 𝑤!
"!
:
• The k-means steps:
– Minimize ‘f’ in terms of the $
𝑦i (update cluster assignments).
– Minimize ‘f’ in terms of the wc (update means).
• Use ‘f’ to choose between initializations (fixed ‘k’).
• Need to change wc update under other distances:
– L1-norm: set wc to median (“k-medians”, see bonus).
Cost of K-means
• Bottleneck is calculating distance from each xi to each mean wc:
Cost of K-means
• Bottleneck is calculating distance from each xi to each mean wc:
– Each time we do this costs O(d).
• We need to compute distance from ‘n’ examples to ‘k’ clusters.
• Total cost of assigning examples to clusters is O(ndk).
– Fast if k is not too large.
• Updating means is cheaper: O(nd).
Vector Quantization
• K-means originally comes from signal processing.
• Designed for vector quantization:
– Replace examples with the mean of their cluster (“prototype”).
• Example:
– Facebook places: 1 location summarizes many.
– What sizes of clothing should I make?
http://wannabite.com/wp-content/uploads/2014/10/ragu-pasta-sauce-printable-coupon.jpg
Vector Quantization for Basketball Players
• Clustering NBA basketball players based on shot type/percentage:
• The “prototypes” (means) give offensive styles (like “catch and shoot”).
https://fansided.com/2018/08/23/nylon-calculus-shooting-volume-versus-efficiency/
Vector Quantization Example
(Bad) Vector Quantization in Practice
• Political parties can be thought as a form of vector quantization:
– Hope is that parties represent what a cluster of voters want.
• With larger ‘k’ more voters have a party that closely reflects them.
• With smaller ‘k’, parties are less accurate reflections of people’s views.
https://globalnews.ca/news/5191123/federal-election-seat-projection-trudeau-liberals-minority/
Summary
• Bagging:
• Ensemble method where we apply same classifier to bootstrap samples.
• Random forests: bagging of deep randomized decision trees.
• One of the best “out of the box” classifiers.
• Type of ensemble methods:
– “Boosting” reduces Etrain and “averaging” reduces Eapprox.
• Unsupervised learning: fitting data without explicit labels.
• Clustering: finding ‘groups’ of related examples.
• K-means: simple iterative clustering strategy.
– Fast but sensitive to initialization.
• Vector quantization:
– Compressing examples by replacing them with the mean of their cluster.
• Next time:
– John Snow and non-parametric clustering.
Extremely-Randomized Trees
• Extremely-randomized trees add an extra level of randomization:
1. Each tree is fit to a bootstrap sample.
2. Each split only considers a random subset of the features.
3. Each split only considers a random subset of the possible thresholds.
• So instead of considering up to ‘n’ thresholds,
only consider 10 or something small.
– Leads to different partitions so potentially more independence.
Bayesian Model Averaging
• Recall the key observation regarding ensemble methods:
– If models overfit in “different” ways, averaging gives better performance.
• But should all models get equal weight?
– E.g., decision trees of different depths, when lower depths have low
training error.
– E.g., a random forest where one tree does very well (on validation error)
and others do horribly.
– In science, research may be fraudulent or not based on evidence.
• In these cases, naïve averaging may do worse.
Bayesian Model Averaging
• Suppose we have a set of ‘m’ probabilistic binary classifiers wj.
• If each one gets equal weight, then we predict using:
• Bayesian model averaging treats model ‘wj’ as a random variable:
• So we should weight by probability that wj is the correct model:
– Equal weights assume all models are equally probable.
Bayesian Model Averaging
• Can get better weights by conditioning on training set:
• The ‘likelihood’ p(y | wj, X) makes sense:
– We should give more weight to models that predict ‘y’ well.
– Note that hidden denominator penalizes complex models.
• The ‘prior’ p(wj) is our ‘belief’ that wj is the correct model.
• This is how rules of probability say we should weigh models.
– The ‘correct’ way to predict given what we know.
– But it makes some people unhappy because it is subjective.
What is K-Means Doing?
• How are k-means step decreasing this objective?
• If we just write as function of a particular !
𝑦i, we get:
– The “constant” includes all other terms, and doesn’t affect location of min.
– We can minimize in terms of $
𝑦i by setting it to the ‘c’ with wc closest to xi.
What is K-Means Doing?
• How are k-means step decreasing this objective?
• If we just write as function of a particular wcj we get:
• Derivative is given by:
• Setting equal to 0 and solving for wcj gives:
K-Medians Clustering
• With other distances k-means may not converge.
– But we can make it converge by changing the updates so that they are
minimizing an objective function.
• E.g., we can use the L1-norm objective:
• Minimizing the L1-norm objective gives the ‘k-medians’ algorithm:
– Assign points to clusters by finding “mean” with smallest L1-norm
distance.
– Update ‘means’ as median value (dimension-wise) of each cluster.
• This minimizes the L1-norm distance to all the points in the cluster.
• This approach is more robust to outliers.
What is the “L1-norm and median” connection?
• Point that minimizes the sum of squared L2-norms to all points:
– Is given by the mean (just take derivative and set to 0):
• Point that minimizes the sum of L1-norms to all all points:
– Is given by the median (derivative of absolute value is +1 if positive and -1 if
negative, so any point with half of points larger and half of points smaller is a
solution).
K-Medoids Clustering
• A disadvantage of k-means in some applications:
– The means might not be valid data points.
– May be important for vector quantiziation.
• E.g., consider bag of words features like [0,0,1,1,0].
– We have words 3 and 4 in the document.
• A mean from k-means might look like [0.1 0.3 0.8 0.2 0.3].
– What does it mean to have 0.3 of word 2 in a document?
• Alternative to k-means is k-medoids:
– Same algorithm as k-means, except the means must be data points.
– Update the means by finding example in cluster minimizing squared L2-
norm distance to all points in the cluster.
K-Means Initialization
• K-means is fast but sensitive to initialization.
• Classic approach to initialization: random restarts.
– Run to convergence using different random initializations.
– Choose the one that minimizes average squared distance of data to means.
• Newer approach: k-means++
– Random initialization that prefers means that are far apart.
– Yields provable bounds on expected approximation ratio.
K-Means++
• Steps of k-means++:
1. Select initial mean w1 as a random xi.
2. Compute distance dic of each example xi to each mean wc.
3. For each example ‘i’ set di to the distance to the closest mean.
4. Choose next mean by sampling an example ‘i’ proportional to (di)2.
5. Keep returning to step 2 until we have k-means.
• Expected approximation ratio is O(log(k)).
K-Means++
K-Means++
First mean is a
random example.
K-Means++
Weight examples by
distance to mean squared.
K-Means++
Sample mean proportional
to distances squared.
K-Means++
Weight examples by squared
distance to nearest mean.
K-Means++
Sample mean proportional
to minimum distances squared.
K-Means++
Weight examples by squared
distance to mean.
K-Means++
Sample mean proportional
to distances squared.
(Now hit chosen target k=4.)
K-Means++
Start k-means: assign
examples to the closest mean.
K-Means++
Update the mean
of each cluster.
K-Means++
In this case: just 2 iterations!
Update the mean
of each cluster.
Discussion of K-Means++
• Recall the objective function k-means tries to minimize:
• The initialization of ‘W’ and ‘c’ given by k-means++ satisfies:
• Get good clustering with high probability by re-running.
• However, there is no guarantee that c* is a good clustering.
Uniform Sampling
• Standard approach to generating a random number from {1,2,…,n}:
1. Generate a uniform random number ‘u’ in the interval [0,1].
2. Return the largest index ‘i’ such that u ≤ i/n.
• Conceptually, this divides interval [0,1] into ‘n’ equal-size pieces:
• This assumes pi = 1/n for all ‘i’.
Non-Uniform Sampling
• Standard approach to generating a random number for general pi.
1. Generate a uniform random number ‘u’ in the interval [0,1].
2. Return the largest index ‘i’ such that u ≤
• Conceptually, this divides interval [0,1] into non-equal-size pieces:
• Can sample from a generic discrete probability distribution in O(n).
• If you need to generate ‘m’ samples:
– Cost is O(n + m log (n)) with binary search and storing cumulative sums.
How many iterations does k-means take?
• Each update of the ‘!
𝑦i’ or ‘wc’ does not increase the objective ‘f’.
• And there are kn possible assignments of the !
𝑦i to ‘k’ clusters.
• So within kn iterations you cannot improve the objective by
changing !
𝑦i, and the algorithm stops.
• Tighter-but-more-complicated “smoothed” analysis:
– https://arxiv.org/pdf/0904.1113.pdf
Vector Quantization: Image Colors
• Usual RGB representation of a pixel’s color: three 8-bit numbers.
– For example, [241 13 50] = .
– Can apply k-means to find set of prototype colours.
Original:
(24-bits/pixel)
K-means predictions:
(6-bits/pixel)
Run k-means with
26 clusters:
Vector Quantization: Image Colors
• Usual RGB representation of a pixel’s color: three 8-bit numbers.
– For example, [241 13 50] = .
– Can apply k-means to find set of prototype colours.
Original:
(24-bits/pixel)
K-means predictions:
(6-bits/pixel)
Run k-means with
26 clusters:
Vector Quantization: Image Colors
• Usual RGB representation of a pixel’s color: three 8-bit numbers.
– For example, [241 13 50] = .
– Can apply k-means to find set of prototype colours.
Original:
(24-bits/pixel)
K-means predictions:
(3-bits/pixel)
Run k-means with
26 clusters:
Vector Quantization: Image Colors
• Usual RGB representation of a pixel’s color: three 8-bit numbers.
– For example, [241 13 50] = .
– Can apply k-means to find set of prototype colours.
Original:
(24-bits/pixel)
K-means predictions:
(2-bits/pixel)
Run k-means with
26 clusters:
Vector Quantization: Image Colors
• Usual RGB representation of a pixel’s color: three 8-bit numbers.
– For example, [241 13 50] = .
– Can apply k-means to find set of prototype colours.
Original:
(24-bits/pixel)
K-means predictions:
(1-bit/pixel)
Run k-means with
26 clusters:

More Related Content

PPT
PPT-3.ppt
PPT
MAchine learning
PPT
Machine Learning Machine Learnin Machine Learningg
PPT
i i believe is is enviromntbelieve is is enviromnt7.ppt
PPT
Introduction to Machine Learning Aristotelis Tsirigos
PPT
ensemble learning
PPTX
How Machine Learning Helps Organizations to Work More Efficiently?
PPT-3.ppt
MAchine learning
Machine Learning Machine Learnin Machine Learningg
i i believe is is enviromntbelieve is is enviromnt7.ppt
Introduction to Machine Learning Aristotelis Tsirigos
ensemble learning
How Machine Learning Helps Organizations to Work More Efficiently?

Similar to CPSC 340: Machine Learning and Data Mining K-Means Clustering Andreas Lehrmann and Mark Schmidt University of British Columbia, Fall 2022 (20)

PDF
Introduction to Machine Learning Lecture
PPT
[ppt]
PPT
[ppt]
PPTX
DECISION TREE AND PROBABILISTIC MODELS.pptx
PDF
machine_learning.pptx
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
PPT
Supervised and unsupervised learning
PPTX
Machine learning and decision trees
PPT
Machine Learning Deep Learning Machine learning
PPTX
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
PDF
Complete picture of Ensemble-Learning, boosting, bagging
PPT
Computational Biology, Part 4 Protein Coding Regions
PDF
L4. Ensembles of Decision Trees
PDF
Machine Learning Algorithms Introduction.pdf
PPTX
Mis End Term Exam Theory Concepts
PPTX
DataAnalysis in machine learning using different techniques
PDF
Data Science Cheatsheet.pdf
PPTX
Deep learning from mashine learning AI..
PDF
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
PPTX
Lecture 09(introduction to machine learning)
Introduction to Machine Learning Lecture
[ppt]
[ppt]
DECISION TREE AND PROBABILISTIC MODELS.pptx
machine_learning.pptx
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
Supervised and unsupervised learning
Machine learning and decision trees
Machine Learning Deep Learning Machine learning
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
Complete picture of Ensemble-Learning, boosting, bagging
Computational Biology, Part 4 Protein Coding Regions
L4. Ensembles of Decision Trees
Machine Learning Algorithms Introduction.pdf
Mis End Term Exam Theory Concepts
DataAnalysis in machine learning using different techniques
Data Science Cheatsheet.pdf
Deep learning from mashine learning AI..
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Lecture 09(introduction to machine learning)
Ad

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Business Analytics and business intelligence.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Lecture1 pattern recognition............
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
annual-report-2024-2025 original latest.
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
Qualitative Qantitative and Mixed Methods.pptx
1_Introduction to advance data techniques.pptx
Introduction-to-Cloud-ComputingFinal.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
Business Analytics and business intelligence.pdf
Fluorescence-microscope_Botany_detailed content
STUDY DESIGN details- Lt Col Maksud (21).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Data_Analytics_and_PowerBI_Presentation.pptx
Lecture1 pattern recognition............
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
annual-report-2024-2025 original latest.
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Acumen Training GuidePresentation.pptx
IB Computer Science - Internal Assessment.pptx
Ad

CPSC 340: Machine Learning and Data Mining K-Means Clustering Andreas Lehrmann and Mark Schmidt University of British Columbia, Fall 2022

  • 1. CPSC 340: Machine Learning and Data Mining K-Means Clustering Andreas Lehrmann and Mark Schmidt University of British Columbia, Fall 2022 https://www.students.cs.ubc.ca/~cs-340
  • 2. Last Time: Ensemble Methods • Ensemble methods are models that use other models as input. – The ensemble can often achieve higher accuracy than individual models. • One of the simplest ensemble methods is voting: – Take the mode of the predictions across the classifiers. – Higher accuracy than individual classifiers if error are independent. • Random forests: – Ensemble method based on deep decesision trees, incorporating two forms of randomness. – Each tree is trained on a boostrap sample of the data (‘n’ examples sampled with replacement). – We use random trees (covered today) to further encourage errors to be independent. Assignment 2: • [Code Update] example_Kmeans.jl in a2.zip now loads clusterData.jld instead of clusterData2.jld (l.2) • [Due Date] Monday, October 3rd instead of Friday, September 30th
  • 3. Random Forest Ingredient 1: Bootstrap/Bagging • Bootstrap sample of a list of ‘n’ training examples: – A new set of size ‘n’ chosen independently with replacement. – Gives new dataset of ‘n’ examples, with some duplicated and some missing. • For large ‘n’, approximately 63% of original examples are included at least once in bootstrap. • Bagging: ensemble where you apply same classifier to different bootstraps. 1. Generate several bootstrap samples of the dataset. 2. Fit the classifier to each bootstrap sample. – To make predictions, take vote based on the predictions. • Random forests are a special case of bagging, using random trees as the classifier.
  • 4. Random Forest Ingredient 2: Random Trees • For each split in a random tree model: – Randomly sample a small number of possible features (typically √𝑑). – Only consider these random features when searching for the optimal rule. • So splits will tend to use different features in different trees.
  • 5. Random Forest Ingredient 2: Random Trees • For each split in a random tree model: – Randomly sample a small number of possible features (typically √𝑑). – Only consider these random features when searching for the optimal rule. • So splits will tend to use different features in different trees.
  • 6. Random Forests: Putting it all Together • Training:
  • 7. Random Forests: Putting it all Together • Prediction:
  • 8. Random Forests: Discussion • Random forest implementations use deep random trees. – Often splitting until all leafs have only one label. • So the individual trees tend to overfit. – But bootstrapping and random trees makes errors more independent. • So the vote tends to have a much lower test error than individual trees. • Empirically, random forests are often one of the “best” classifiers. – Fernandez-Delgado et al. [2014]: • Compared 179 classifiers on 121 datasets. • Random forests were most likely to be the best classifier. – Grinsztajn et al. [2022]: • “Why do tree-based models still outperform deep learning on tabular data?”
  • 9. Beyond Voting: Model Averaging • Voting is a special case of “averaging” ensemble methods. – Where we somehow “average” the predictions of different models. • Other averaging methods: – For “regression” (where yi is continuous), take average yi predictions: – With probabilistic classifiers, take the average probabilities: – And there are variations where some classifiers get more weight (see bonus):
  • 10. Types and Goals of Ensemble Methods • Remember the fundamental trade-off: 1. Etrain: How small you can make the training error. vs. 2. Eapprox: How well training error approximates the test error. • Goal of ensemble methods is that meta-classifier: – Does much better on one of these than individual classifiers. – Does not do too much worse on the other. • This suggests two types of ensemble methods: 1. Averaging: improves approximation error of classifiers with high Eapprox. • This is the point of “voting”. 2. Boosting: improves training error of classifiers with high Etrain. • Covered later in course.
  • 11. End of Part 1: Key Concepts • Fundamental ideas: – Training vs. test error (memorization vs. learning). – IID assumption (examples come independently from same distribution). – Golden rule of ML (test set should not influence training). – Fundamental trade-off (between training error vs. approximation error). – Validation sets and cross-validation (can approximate test error) – Optimization bias (we can overfit the training set and the validation set). – Decision theory (we should consider costs of predictions). – Parametric vs. non-parametric (whether model size depends on ‘n’). – No free lunch theorem (there is no universally “best” model).
  • 12. End of Part 1: Key Concepts • We saw 3 ways of “learning”: – Searching for rules. • Decision trees (greedy recursive splitting using decision stumps). – Counting frequencies. • Naïve Bayes (probabilistic classifier based on conditional independence). – Measuring distances. • K-nearest neigbbours (non-parametric classifier with universal consistency). • We saw 2 generic ways of improving performance: – Encouraging invariances with data augmentation. – Ensemble methods (combine predictions of several models). • Random forests (averaging plus randomization to reduce overfitting).
  • 13. Next Topic: Unsupervised Learning (Part 2)
  • 14. Application: Classifying Cancer Types • “I collected gene expression data for 1000 different types of cancer cells, can you tell me the different classes of cancer?” • We are not given the class labels y, but want meaningful labels. • An example of unsupervised learning. X = https://corelifesciences.com/human-long-non-coding-rna-expression-microarray-service.html
  • 15. Unsupervised Learning • Supervised learning: – We have features xi and class labels yi. – Write a program that produces yi from xi. • Unsupervised learning: – We only have xi values, but no explicit target labels. – You want to do “something” with them. • Some unsupervised learning tasks: – Outlier detection: Is this a ‘normal’ xi? – Similarity search: Which examples look like this xi? – Association rules: Which xj occur together? – Latent-factors: What ‘parts’ are the xi made from? – Data visualization: What does the high-dimensional X look like? – Ranking: Which are the most important xi? – Clustering: What types of xi are there?
  • 16. Clustering Example Input: data matrix ‘X’. • In clustering we want to assign examples to “groups”:
  • 17. Clustering Example Input: data matrix ‘X’. Output: clusters ! 𝑦. • In clustering we want to assign examples to “groups”:
  • 18. Clustering • Clustering: – Input: set of examples described by features xi. – Output: an assignment of examples to ‘groups’. • Unlike classification, we are not given the ‘groups’. – Algorithm must discover groups. • Example of groups we might discover in e-mail spam: – ‘Lucky winner’ group. – ‘Weight loss’ group. – ‘I need your help’ group. – ‘Mail-order bride’ group.
  • 19. Data Clustering • General goal of clustering algorithms: – Examples in the same group should be ‘similar’. – Examples in different groups should be ‘different’. • But the ‘best’ clustering is hard to define: – We don’t have a test error. – Generally, there is no ‘best’ method in unsupervised learning. • So there are lots of methods: we’ll focus on important/representative ones. • Why cluster? – You could want to know what the groups are. – You could want to find the group for a new example xi. – You could want to find examples related to a new example xi. – You could want a ‘prototype’ example for each group. • For example, what does a typical breakfast look like?
  • 20. Clustering of Epstein-Barr Virus http://jvi.asm.org/content/86/20/11096.abstract
  • 21. Other Clustering Applications • NASA: what types of stars are there? • Biology: are there sub-species? • Documents: what kinds of documents are on my HD? • Commercial: what kinds of customers do I have? http://www.eecs.wsu.edu/~cook/dm/lectures/l9/index.html http://www.biology-online.org/articles/canine_genomics_genetics_running/figures.html
  • 22. K-Means • Most popular clustering method is k-means. • Input: – The number of clusters ‘k’ (hyper-parameter). – Initial guess of the center (the “mean”) of each cluster. • K-Means Algorithm for Finding Means: – Assign each xi to its closest mean. – Update the means based on the assignment. – Repeat until convergence.
  • 23. K-Means Example Start with ‘k’ initial ‘means’ (usually, random data points) Input: data matrix ‘X’.
  • 24. K-Means Example Assign each example to the closest mean. Input: data matrix ‘X’.
  • 25. K-Means Example Update the mean of each group. Input: data matrix ‘X’.
  • 26. K-Means Example Assign each example to the closest mean. Input: data matrix ‘X’.
  • 27. K-Means Example Update the mean of each group. Input: data matrix ‘X’.
  • 28. K-Means Example Assign each example to the closest mean. Input: data matrix ‘X’.
  • 29. K-Means Example Update the mean of each group. Input: data matrix ‘X’.
  • 30. K-Means Example Assign each example to the closest mean. Input: data matrix ‘X’.
  • 31. K-Means Example Stop if no examples change groups. Input: data matrix ‘X’.
  • 33. K-Means Issues • Guaranteed to converge when using Euclidean distance. • Given a new test example: – Assign it to the nearest mean to cluster it. • Assumes you know number of clusters ‘k’. – Lots of heuristics to pick ‘k’, none satisfying: • https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set • Each example is assigned to one (and only one) cluster: – No possibility for overlapping clusters or leaving examples unassigned. • It may converge to sub-optimal solution…
  • 34. K-Means Clustering with Different Initialization • Classic approach to dealing with sensitivity to initialization: random restarts. – Try several different random starting points, choose the “best”. • See bonus slides for a more clever approach called k-means++.
  • 35. KNN vs. K-Means • Don’t confuse KNN classification and k-means clustering: Property KNN Classification K-Means Clustering Task Supervised learning (given yi) Unsupervised learning (no given yi). Meaning of ‘k’ Number of neighbours to consider (not number of classes). Number of clusters (always consider single nearest mean). Initialization No training phase. Training that is sensitive to initialization. Model complexity Model is complicated for small ‘k’, simple for large ‘k’. Model is simple for small ‘k’, complicated for large ‘k’. Parametric? Non-parametric: - Stores data ‘X’ Parametric (for ‘k’ not depending on ‘n’) - Stores means ‘W’
  • 36. What is K-Means Doing? • We can interpret K-means steps as minimizing an objective: – Total sum of squared distances from each example xi to its center 𝑤! "! : • The k-means steps: – Minimize ‘f’ in terms of the $ 𝑦i (update cluster assignments). – Minimize ‘f’ in terms of the wc (update means). • Termination of the algorithm follows because: – Each step does not increase the objective. – There are a finite number of assignments to k clusters.
  • 37. What is K-Means Doing? • We can interpret K-means steps as minimizing an objective: – Total sum of squared distances from each example xi to its center 𝑤! "! : • The k-means steps: – Minimize ‘f’ in terms of the $ 𝑦i (update cluster assignments). – Minimize ‘f’ in terms of the wc (update means). • Use ‘f’ to choose between initializations (fixed ‘k’). • Need to change wc update under other distances: – L1-norm: set wc to median (“k-medians”, see bonus).
  • 38. Cost of K-means • Bottleneck is calculating distance from each xi to each mean wc:
  • 39. Cost of K-means • Bottleneck is calculating distance from each xi to each mean wc: – Each time we do this costs O(d). • We need to compute distance from ‘n’ examples to ‘k’ clusters. • Total cost of assigning examples to clusters is O(ndk). – Fast if k is not too large. • Updating means is cheaper: O(nd).
  • 40. Vector Quantization • K-means originally comes from signal processing. • Designed for vector quantization: – Replace examples with the mean of their cluster (“prototype”). • Example: – Facebook places: 1 location summarizes many. – What sizes of clothing should I make? http://wannabite.com/wp-content/uploads/2014/10/ragu-pasta-sauce-printable-coupon.jpg
  • 41. Vector Quantization for Basketball Players • Clustering NBA basketball players based on shot type/percentage: • The “prototypes” (means) give offensive styles (like “catch and shoot”). https://fansided.com/2018/08/23/nylon-calculus-shooting-volume-versus-efficiency/
  • 43. (Bad) Vector Quantization in Practice • Political parties can be thought as a form of vector quantization: – Hope is that parties represent what a cluster of voters want. • With larger ‘k’ more voters have a party that closely reflects them. • With smaller ‘k’, parties are less accurate reflections of people’s views. https://globalnews.ca/news/5191123/federal-election-seat-projection-trudeau-liberals-minority/
  • 44. Summary • Bagging: • Ensemble method where we apply same classifier to bootstrap samples. • Random forests: bagging of deep randomized decision trees. • One of the best “out of the box” classifiers. • Type of ensemble methods: – “Boosting” reduces Etrain and “averaging” reduces Eapprox. • Unsupervised learning: fitting data without explicit labels. • Clustering: finding ‘groups’ of related examples. • K-means: simple iterative clustering strategy. – Fast but sensitive to initialization. • Vector quantization: – Compressing examples by replacing them with the mean of their cluster. • Next time: – John Snow and non-parametric clustering.
  • 45. Extremely-Randomized Trees • Extremely-randomized trees add an extra level of randomization: 1. Each tree is fit to a bootstrap sample. 2. Each split only considers a random subset of the features. 3. Each split only considers a random subset of the possible thresholds. • So instead of considering up to ‘n’ thresholds, only consider 10 or something small. – Leads to different partitions so potentially more independence.
  • 46. Bayesian Model Averaging • Recall the key observation regarding ensemble methods: – If models overfit in “different” ways, averaging gives better performance. • But should all models get equal weight? – E.g., decision trees of different depths, when lower depths have low training error. – E.g., a random forest where one tree does very well (on validation error) and others do horribly. – In science, research may be fraudulent or not based on evidence. • In these cases, naïve averaging may do worse.
  • 47. Bayesian Model Averaging • Suppose we have a set of ‘m’ probabilistic binary classifiers wj. • If each one gets equal weight, then we predict using: • Bayesian model averaging treats model ‘wj’ as a random variable: • So we should weight by probability that wj is the correct model: – Equal weights assume all models are equally probable.
  • 48. Bayesian Model Averaging • Can get better weights by conditioning on training set: • The ‘likelihood’ p(y | wj, X) makes sense: – We should give more weight to models that predict ‘y’ well. – Note that hidden denominator penalizes complex models. • The ‘prior’ p(wj) is our ‘belief’ that wj is the correct model. • This is how rules of probability say we should weigh models. – The ‘correct’ way to predict given what we know. – But it makes some people unhappy because it is subjective.
  • 49. What is K-Means Doing? • How are k-means step decreasing this objective? • If we just write as function of a particular ! 𝑦i, we get: – The “constant” includes all other terms, and doesn’t affect location of min. – We can minimize in terms of $ 𝑦i by setting it to the ‘c’ with wc closest to xi.
  • 50. What is K-Means Doing? • How are k-means step decreasing this objective? • If we just write as function of a particular wcj we get: • Derivative is given by: • Setting equal to 0 and solving for wcj gives:
  • 51. K-Medians Clustering • With other distances k-means may not converge. – But we can make it converge by changing the updates so that they are minimizing an objective function. • E.g., we can use the L1-norm objective: • Minimizing the L1-norm objective gives the ‘k-medians’ algorithm: – Assign points to clusters by finding “mean” with smallest L1-norm distance. – Update ‘means’ as median value (dimension-wise) of each cluster. • This minimizes the L1-norm distance to all the points in the cluster. • This approach is more robust to outliers.
  • 52. What is the “L1-norm and median” connection? • Point that minimizes the sum of squared L2-norms to all points: – Is given by the mean (just take derivative and set to 0): • Point that minimizes the sum of L1-norms to all all points: – Is given by the median (derivative of absolute value is +1 if positive and -1 if negative, so any point with half of points larger and half of points smaller is a solution).
  • 53. K-Medoids Clustering • A disadvantage of k-means in some applications: – The means might not be valid data points. – May be important for vector quantiziation. • E.g., consider bag of words features like [0,0,1,1,0]. – We have words 3 and 4 in the document. • A mean from k-means might look like [0.1 0.3 0.8 0.2 0.3]. – What does it mean to have 0.3 of word 2 in a document? • Alternative to k-means is k-medoids: – Same algorithm as k-means, except the means must be data points. – Update the means by finding example in cluster minimizing squared L2- norm distance to all points in the cluster.
  • 54. K-Means Initialization • K-means is fast but sensitive to initialization. • Classic approach to initialization: random restarts. – Run to convergence using different random initializations. – Choose the one that minimizes average squared distance of data to means. • Newer approach: k-means++ – Random initialization that prefers means that are far apart. – Yields provable bounds on expected approximation ratio.
  • 55. K-Means++ • Steps of k-means++: 1. Select initial mean w1 as a random xi. 2. Compute distance dic of each example xi to each mean wc. 3. For each example ‘i’ set di to the distance to the closest mean. 4. Choose next mean by sampling an example ‘i’ proportional to (di)2. 5. Keep returning to step 2 until we have k-means. • Expected approximation ratio is O(log(k)).
  • 57. K-Means++ First mean is a random example.
  • 60. K-Means++ Weight examples by squared distance to nearest mean.
  • 61. K-Means++ Sample mean proportional to minimum distances squared.
  • 62. K-Means++ Weight examples by squared distance to mean.
  • 63. K-Means++ Sample mean proportional to distances squared. (Now hit chosen target k=4.)
  • 66. K-Means++ In this case: just 2 iterations! Update the mean of each cluster.
  • 67. Discussion of K-Means++ • Recall the objective function k-means tries to minimize: • The initialization of ‘W’ and ‘c’ given by k-means++ satisfies: • Get good clustering with high probability by re-running. • However, there is no guarantee that c* is a good clustering.
  • 68. Uniform Sampling • Standard approach to generating a random number from {1,2,…,n}: 1. Generate a uniform random number ‘u’ in the interval [0,1]. 2. Return the largest index ‘i’ such that u ≤ i/n. • Conceptually, this divides interval [0,1] into ‘n’ equal-size pieces: • This assumes pi = 1/n for all ‘i’.
  • 69. Non-Uniform Sampling • Standard approach to generating a random number for general pi. 1. Generate a uniform random number ‘u’ in the interval [0,1]. 2. Return the largest index ‘i’ such that u ≤ • Conceptually, this divides interval [0,1] into non-equal-size pieces: • Can sample from a generic discrete probability distribution in O(n). • If you need to generate ‘m’ samples: – Cost is O(n + m log (n)) with binary search and storing cumulative sums.
  • 70. How many iterations does k-means take? • Each update of the ‘! 𝑦i’ or ‘wc’ does not increase the objective ‘f’. • And there are kn possible assignments of the ! 𝑦i to ‘k’ clusters. • So within kn iterations you cannot improve the objective by changing ! 𝑦i, and the algorithm stops. • Tighter-but-more-complicated “smoothed” analysis: – https://arxiv.org/pdf/0904.1113.pdf
  • 71. Vector Quantization: Image Colors • Usual RGB representation of a pixel’s color: three 8-bit numbers. – For example, [241 13 50] = . – Can apply k-means to find set of prototype colours. Original: (24-bits/pixel) K-means predictions: (6-bits/pixel) Run k-means with 26 clusters:
  • 72. Vector Quantization: Image Colors • Usual RGB representation of a pixel’s color: three 8-bit numbers. – For example, [241 13 50] = . – Can apply k-means to find set of prototype colours. Original: (24-bits/pixel) K-means predictions: (6-bits/pixel) Run k-means with 26 clusters:
  • 73. Vector Quantization: Image Colors • Usual RGB representation of a pixel’s color: three 8-bit numbers. – For example, [241 13 50] = . – Can apply k-means to find set of prototype colours. Original: (24-bits/pixel) K-means predictions: (3-bits/pixel) Run k-means with 26 clusters:
  • 74. Vector Quantization: Image Colors • Usual RGB representation of a pixel’s color: three 8-bit numbers. – For example, [241 13 50] = . – Can apply k-means to find set of prototype colours. Original: (24-bits/pixel) K-means predictions: (2-bits/pixel) Run k-means with 26 clusters:
  • 75. Vector Quantization: Image Colors • Usual RGB representation of a pixel’s color: three 8-bit numbers. – For example, [241 13 50] = . – Can apply k-means to find set of prototype colours. Original: (24-bits/pixel) K-means predictions: (1-bit/pixel) Run k-means with 26 clusters: