SlideShare a Scribd company logo
CSCI 567: Machine Learning
Vatsal Sharan
Fall 2022
Lecture 9, Nov 3
Administrivia
• Project details are out
• Make groups (of 4) by Nov 11, minimum team size is 3.
• Q5 on HW4 will help you get started on it.
• We’ll give an overview of the project and general tips in today’s discussion.
• HW4 is due in about two weeks (Nov 16 at 2pm).
• We’ll release another question on PCA tomorrow.
• Today’s plan:
• Finish ensemble methods
• Unsupervised learning:
• PCA
• Clustering
Ensemble methods:
Recap
Ensemble methods
• Bagging
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Bagging
Collect T subsets each of some fixed size (say m) by sampling with replacement from
training data.
Let ft(x) be the classifier (such as a decision tree) obtained by training on the subset
t ∈ {1, . . . , T}. Then the aggregrated classifier fagg(x) is given by:
fagg(x) =
! 1
T
"T
t=1 ft(x) for regression,
sign
#
1
T
"T
t=1 ft(x)
$
= Majority Vote{ft(x)}T
t=1 for classification.
• Reduces overfitting (i.e., variance)
• Can work with any type of classifier (here focus on trees)
• Easy to parallelize (can train multiple trees in parallel)
• But loses on interpretability to single decision tree (true for all ensemble methods..)
Ensemble methods
• Bagging
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Random forests: When growing a tree on a bootstrapped (i.e. subsampled) dataset,
before each split select ! ≤ # of the # input variables at random as candidates for
splitting.
When ! = # → same as Bagging
When ! < # → Random forests
! is a hyperparameter, tuned via cross-validation
Random forests
Ensemble methods
• Bagging
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Boosting: Idea
The boosted predictor is of the form fboost(x) = sign(h(x)), where,
h(x) =
T
!
t=1
βtft(x) for βt ≥ 0 and ft ∈ F.
The goal is to minimize "(h(x), y) for some loss function ".
Q: We know how to find the best predictor in F on some data, but how do we find the best weighted
combination h(x)?
A: Minimize the loss by a greedy approach, i.e. find βt, ft(x) one by one for t = 1, . . . , T.
Specifically, let ht(x) =
"t
τ=1 βτ fτ (x). Suppose we have found ht−1(x), how do we find βt, ft(x)?
Find the βt, ft(x) which minimizes the loss "(ht(x), y).
Different loss function " give different boosting algorithms.
"(h(x), y) =
#
(h(x) − y)2
→ Least squares boosting,
exp(−h(x)y) → AdaBoost.
Ensemble methods
• Bagging
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Given a training set S and a base algorithm A, initialize D1 to be uniform
For t = 1, . . . , T
• obtain a weak classifier ft(x) ← A(S, Dt)
• calculate the weight βt of ft(x) as
βt =
1
2
ln
!
1 − "t
"t
"
(βt > 0 ⇔ "t < 0.5)
where "t =
#
i:ft(xi)!=yi
Dt(i) is the weighted error of ft(x).
• update distributions
Dt+1(i) ∝ Dt(i)e−βtyift(xi)
=
$
Dt(i)e−βt
if ft(xi) = yi
Dt(i)eβt
else
Output the final classifier fboost = sgn
%#T
t=1 βtft(x)
&
AdaBoost: Full algorithm
Put more weight on difficult to classify instances and less on those already handled well
New weak learners are added sequentially that focus their training on the more difficult patterns
Adaboost: Example
10 data points in R2
The size of + or - indicates the weight, which starts from uniform D1
Toy Example
Toy Example
Toy Example
Toy Example
Toy Example
D1
weak classifiers = vertical or h
Base algorithm is decision stump:
Go through the calculations in the example to make sure you understand the algorithm
Ensemble methods:
Gradient boosting
Ensemble methods
• Bagging
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting
Gradient Boosting
Recall ht(x) =
!t
τ=1 βτ fτ (x). For Adaboost (exponential loss), given ht−1(x), we
found what ft(x) should be.
Gradient boosting provides an iterative approach for general (any) loss function "(h(x), y):
• For all training datapoints (xi, yi) find the gradient
ri =
"
δ"(h(xi), yi)
δh(xi)
#
h(xi)=ht−1(xi)
• Use the weak learner to find ft which fits (xi, ri) as well as possible:
ft = argmin
f∈F
n
$
i=1
(ri − f(xi))2
.
• Update ht(x) = ht−1(x) + ηft(x), for some step size η.
Gradient Boosting
Usually we add some regularization term to prevent overfitting (penalize the size of the tree etc.)
Gradient boosting is extremely successful!!
A variant XGBoost is one of the most popular algorithms for structured data (tables etc. with
numbers and categories where each feature typically has some meaning, unlike images or text).
(for e.g. during Kaggle competitions back in 2015, 17 out of 29 winning solutions used XGBoost)
Unsupervised
learning: PCA
A simplistic taxonomy of ML
Supervised learning:
Aim to predict
outputs of future
datapoints
Unsupervised
learning:
Aim to discover
hidden patterns and
explore data
Reinforcement
learning:
Aim to make
sequential decisions
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
Acknowledgement & further reading
Our presentation is closely based on Gregory Valiant’s
notes for CS168 at Stanford.
https://web.stanford.edu/class/cs168/l/l7.pdf
https://web.stanford.edu/class/cs168/l/l8.pdf
You can refer to these notes for further reading.
Also review our Linear algebra Colab notebooks:
Part 1
Part 2
We’ll start with a simple and fundamental unsupervised learning problem: dimensionality reduction.
Goal: reduce the dimensionality of a dataset so that
• it is easier to visualize and discover patterns
• it takes less time and space to process for any downstream application (classification, regression, etc)
• noise is reduced
• · · ·
There are many approaches, we focus on a linear method: Principal Component Analysis (PCA).
Dimensionality reduction & PCA
Consider the following dataset:
• 17 features, each represents the average consump-
tion of some food
• 4 data points, each represents some country.
What can you tell?
Hard to say anything looking at all these 17 features.
Picture from here
See this for more details
PCA: Motivation
PCA: Motivation
PCA can help us! The projection of the data onto its first principal component:
i.e. we reduce the dimensionality from 17 to just 1.
Now one data point is clearly different from the rest!
PCA: Motivation
i.e. we reduce the dimensionality from 17 to just 1.
Now one data point is clearly different from the rest!
That turns out to be data from Northern Ireland,
the only country not on the island of Great Britain out of the 4 samples.
Can also interpret components: PC1 tells us that the Northern Irish eat more grams of fresh
potatoes and fewer of fresh fruits and alcoholic drinks.
PCA can help us! The projection of the data onto its first principal component (PC1):
PCA: Motivation
We can find the second (and more) principal components of the data too:
PCA: Motivation
And the components themselves are interpretable too:
See this for more details
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
Suppose we have a dataset of n datapoints x1, x2, . . . , xn ∈ Rd
.
The high level goal of PCA is to find a set of k principal components (PCs) /principal vectors v1, . . . , vk ∈
Rd
such that for each xi,
xi ≈
k
!
j=1
αijvj
for some coefficients αij ∈ R.
High-level goal
Preprocessing the data
• Before we apply PCA, we usually preprocess the data to center it
• In many applications, it is also important to scale each coordinate properly. This is especially true if
the coordinates are in different units or scales.
Figure from CS168 notes
Objective function for PCA
Key difference from supervised learning problems:
No labels given, which means no ground-truth to measure the quality of the answer!
However, we can still write an optimization problem based on our high-level goal.
For clarity, we first discuss the special case of ! = 1.
Optimization problem for finding the 1st principal component %!:
Figure from CS168 notes
An example:
Objective function for larger values of !
The generalization of the original formulation for general k is to find a k-dimensional subspace S
such that the points are as close to it as possible:
S = argmin
k-dim subspaces S
n
!
i=1
(distance between xi and subspace S)2
By the same reasoning as for k = 1, this is equivalent to,
S = argmax
k-dim subspaces S
n
!
i=1
(length of xi’s projection on S)2
It is useful to think of the subspace S as the span of k orthonormal vectors v1, . . . , vk ∈ Rd
.
Example,
• k = 1, span is line through the origin.
• k = 2, if v1, v2 are linearly independent, the span is a plane through the origin, and so on.
Formal problem solved by PCA:
Given x1, . . . , xn ∈ Rd
and a parameter k ≥ 1, compute orthonormal vectors v1, . . . , vk ∈ Rd
to maximize,
n
!
i=1
k
!
j=1
#xi, vj$2
.
Equivalent view:
• Pick v1 to be the variance maximizing direction.
• Pick v2 to be the variance maximizing direction, orthogonal to v1.
• Pick v3 to be the variance maximizing direction, orthogonal to v1 and v2, and so on.
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
Input: n datapoints x1, x2, . . . , xn ∈ Rd
, #components k we want
Step 1 Perform PCA to get top k principal components v1, . . . , vk ∈ Rd
.
Step 2 For each datapoint xi, define its “v1-coordinate” as "xi, v1#, its “v2-coordinate” as "xi, v2#.
Therefore we associate k coordinates to each datapoint xi, where the j-th coordinate denotes the
extent to which xi points in the direction of vj.
Step 3 We now have a new “compressed” dataset where each datapoint is k-dimensional. For visu-
alization, we can plot the point xi in Rk
as the point ("xi, v1#, "xi, v2#, . . . , "xi, vk#).
Using PCA for data compression and visualization
Visualization example: Do Genomes Encode Geography?
Dataset: genomes of 1,387 Europeans (each individual’s genotype at 200,000 locations in the genome)
& = 1387, + ≈ 200,000
Project the datapoints onto top 2 PCs
Plot shown below; looks remarkably like the map of Europe!
From paper: ``Genes mirror geography within Europe” Novembre et al., Nature’08
Compression example: Eigenfaces
Dataset: 256x256 (≈65K pixels) dimensional images
of about 2500 faces, all framed similarly
& = 2500, + ≈ 65,000
We can represent each image with high accuracy
using only 100-150 principal components!
The principal components (called eigenfaces here)
are themselves interpretable too!
From paper: ``Eigenfaces for recognition” Turk & Pentland, Journal of Cognitive Neuroscience’91
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
How to solve the PCA optimization problem?
lec9_annotated.pdf ml csci 567 vatsal sharan
The diagonal case
Let’s solve argmaxv:!v!2=1 vT
Av for the special case where A is a diagonal matrix.
Any d × d matrix A can be thought of as a function that maps points in Rd
back to points in Rd
:
v "→ Av.
The matrix
!
2 0
0 1
"
maps (x, y) to (2x, y):
Figure from CS168 notes
Points on circle {(x, y) : x2
+ y2
= 1} are mapped to the ellipse {(x, y) :
!x
2
"2
+ y2
= 1}.
So what direction v should maximize vT
Av for diagonal A?
It should be the direction of maximum stretch:
Diagonals in disguise
The previous figure, rotated 45 degrees.
Consider
A =
!3
2
1
2
1
2
3
2
"
=
#
1
√
2
− 1
√
2
1
√
2
1
√
2
$
·
!
2 0
0 1
"
·
#
1
√
2
1
√
2
− 1
√
2
1
√
2
$
·
1 still does nothing other than stretch
out different orthogonal axes, possibly
with these axes being a “rotated
version” of the original ones.
How do we formalize the concept of a rotation in high dimensions as a matrix operation?
Answer: Orthogonal matrix (also called orthonormal matrix).
Recall that we want to find v1 = argmaxv:!v!2=1 vT
Av.
Now consider A that can be written as A = QDQT
for an orthogonal matrix Q and diagonal
matrix D with diagonal entries λ1 ≥ λ2 ≥ λ3 ≥ . . . λd ≥ 0.
What is the direction which gets stretched the maximum?
(Informal answer) The maximum possible stretch by D is λ1. The direction of maximum stretch
under D is e1. Therefore, direction of maximum stretch under DQT
is v s.t. QT
v = e1 =⇒ v =
(QT
)−1
e1 = Qe1.
lec9_annotated.pdf ml csci 567 vatsal sharan
General covariance matrices
When k = 1, the solution to argmaxv:!v!2=1 vT
Av is the first column of Q, where A = XT
X =
QDQT
with Q orthogonal and D diagonal with sorted entries.
General values of !
What is the solution to the PCA objective for general values of k?
n
!
i=1
k
!
j=1
!xi, vj"2
Solution: Pick the first k columns of Q, where the covariance XT
X = QDQT
with Q orthogonal
and D diagonal with sorted entries.
Since Q is orthogonal, the first k columns of Q are orthogonal vectors. These are called the top k
principal components (PCs).
Eigenvalues & eigenvectors
How to compute the top k columns of Q in the decomposition XT
X = QDQT
?
Solution: Eigenvalue decomposition!
Eigenvectors: axes of stretch in geometric intuition
Eigenvalues: stretch factors
PCA boils down to computing the ! eigenvectors of the covariance matrix 2"
2
that have the largest eigenvalues.
Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion
For visualization, we usually choose k to be small and just pick the first few principal components.
In other applications such as compression, it is a good idea to plot the eigenvalues and see. A lot of
data is close to being low rank, so the eigenvalues may decay and become small.
We can also choose the threshold based on how much variance we want to capture. Suppose we
want to capture 90% of the variance in the data. Then we can pick k such that i.e.
!k
j=1 λj
!d
j=1 λj
≥ 90%
where λ1 ≥ · · · ≥ λd are sorted eigenvalues.
Note:
!d
j=1 λj = trace(XT
X), so no need to actually find all eigenvalues.
How many PCs to use?
When and why does PCA fail?
1. Data is not properly
scaled/normalized.
2. Non-orthogonal structure in data: PCs
are forced to be orthogonal, and
there may not be too many
orthogonal components in the data
which are all interpretable.
3. Non-linear structure in data.

More Related Content

PDF
Introduction to Big Data Science
PPTX
Lecture 8 about data mining and how to use it.pptx
PDF
1 chapter1 introduction
PDF
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
DOC
Unit 2 in daa
DOC
algorithm Unit 2
PPTX
Support Vector Machines Simply
PPTX
DimensionalityReduction.pptx
Introduction to Big Data Science
Lecture 8 about data mining and how to use it.pptx
1 chapter1 introduction
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Unit 2 in daa
algorithm Unit 2
Support Vector Machines Simply
DimensionalityReduction.pptx

Similar to lec9_annotated.pdf ml csci 567 vatsal sharan (20)

PPT
lecture 11
PDF
5 DimensionalityReduction.pdf
PPT
Recursion
PDF
lec6_annotated.pdf ml csci 567 vatsal sharan
PPTX
super vector machines algorithms using deep
PPT
AI_Lecture_34.ppt
PPTX
Design and Analysis of Algorithms Lecture Notes
PDF
A Short Course in Data Stream Mining
PDF
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
PDF
Internet of Things Data Science
PDF
L1 intro2 supervised_learning
PDF
MLE.pdf
PDF
Supervised Prediction of Graph Summaries
PPTX
Deep learning Unit1 BasicsAllllllll.pptx
PPTX
DeepLearningLecture.pptx
PDF
Regression_1.pdf
PDF
3.1 Functions
PPTX
Dynamic Programming.pptx
PDF
机器学习Adaboost
PPT
Parallel Computing 2007: Bring your own parallel application
lecture 11
5 DimensionalityReduction.pdf
Recursion
lec6_annotated.pdf ml csci 567 vatsal sharan
super vector machines algorithms using deep
AI_Lecture_34.ppt
Design and Analysis of Algorithms Lecture Notes
A Short Course in Data Stream Mining
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
Internet of Things Data Science
L1 intro2 supervised_learning
MLE.pdf
Supervised Prediction of Graph Summaries
Deep learning Unit1 BasicsAllllllll.pptx
DeepLearningLecture.pptx
Regression_1.pdf
3.1 Functions
Dynamic Programming.pptx
机器学习Adaboost
Parallel Computing 2007: Bring your own parallel application
Ad

More from agrimsachdeva2 (6)

PDF
lec3_annotated.pdf ml csci 567 vatsal sharan
PDF
lec8_annotated.pdf ml csci 567 vatsal sharan
PDF
lec2_annotated.pdf ml csci 567 vatsal sharan
PDF
lec5_annotated.pdf ml csci 567 vatsal sharan
PDF
lec2_unannotated.pdf ml csci 567 vatsal sharan
PDF
lec4_annotated.pdf ml csci 567 vatsal sharan
lec3_annotated.pdf ml csci 567 vatsal sharan
lec8_annotated.pdf ml csci 567 vatsal sharan
lec2_annotated.pdf ml csci 567 vatsal sharan
lec5_annotated.pdf ml csci 567 vatsal sharan
lec2_unannotated.pdf ml csci 567 vatsal sharan
lec4_annotated.pdf ml csci 567 vatsal sharan
Ad

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Computer network topology notes for revision
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Database Infoormation System (DBIS).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Reliability_Chapter_ presentation 1221.5784
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Moving the Public Sector (Government) to a Digital Adoption
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Computer network topology notes for revision
Launch Your Data Science Career in Kochi – 2025
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Database Infoormation System (DBIS).pptx

lec9_annotated.pdf ml csci 567 vatsal sharan

  • 1. CSCI 567: Machine Learning Vatsal Sharan Fall 2022 Lecture 9, Nov 3
  • 2. Administrivia • Project details are out • Make groups (of 4) by Nov 11, minimum team size is 3. • Q5 on HW4 will help you get started on it. • We’ll give an overview of the project and general tips in today’s discussion. • HW4 is due in about two weeks (Nov 16 at 2pm). • We’ll release another question on PCA tomorrow. • Today’s plan: • Finish ensemble methods • Unsupervised learning: • PCA • Clustering
  • 4. Ensemble methods • Bagging • Random forests • Boosting: Basics • Adaboost • Gradient boosting
  • 5. Bagging Collect T subsets each of some fixed size (say m) by sampling with replacement from training data. Let ft(x) be the classifier (such as a decision tree) obtained by training on the subset t ∈ {1, . . . , T}. Then the aggregrated classifier fagg(x) is given by: fagg(x) = ! 1 T "T t=1 ft(x) for regression, sign # 1 T "T t=1 ft(x) $ = Majority Vote{ft(x)}T t=1 for classification. • Reduces overfitting (i.e., variance) • Can work with any type of classifier (here focus on trees) • Easy to parallelize (can train multiple trees in parallel) • But loses on interpretability to single decision tree (true for all ensemble methods..)
  • 6. Ensemble methods • Bagging • Random forests • Boosting: Basics • Adaboost • Gradient boosting
  • 7. Random forests: When growing a tree on a bootstrapped (i.e. subsampled) dataset, before each split select ! ≤ # of the # input variables at random as candidates for splitting. When ! = # → same as Bagging When ! < # → Random forests ! is a hyperparameter, tuned via cross-validation Random forests
  • 8. Ensemble methods • Bagging • Random forests • Boosting: Basics • Adaboost • Gradient boosting
  • 9. Boosting: Idea The boosted predictor is of the form fboost(x) = sign(h(x)), where, h(x) = T ! t=1 βtft(x) for βt ≥ 0 and ft ∈ F. The goal is to minimize "(h(x), y) for some loss function ". Q: We know how to find the best predictor in F on some data, but how do we find the best weighted combination h(x)? A: Minimize the loss by a greedy approach, i.e. find βt, ft(x) one by one for t = 1, . . . , T. Specifically, let ht(x) = "t τ=1 βτ fτ (x). Suppose we have found ht−1(x), how do we find βt, ft(x)? Find the βt, ft(x) which minimizes the loss "(ht(x), y). Different loss function " give different boosting algorithms. "(h(x), y) = # (h(x) − y)2 → Least squares boosting, exp(−h(x)y) → AdaBoost.
  • 10. Ensemble methods • Bagging • Random forests • Boosting: Basics • Adaboost • Gradient boosting
  • 11. Given a training set S and a base algorithm A, initialize D1 to be uniform For t = 1, . . . , T • obtain a weak classifier ft(x) ← A(S, Dt) • calculate the weight βt of ft(x) as βt = 1 2 ln ! 1 − "t "t " (βt > 0 ⇔ "t < 0.5) where "t = # i:ft(xi)!=yi Dt(i) is the weighted error of ft(x). • update distributions Dt+1(i) ∝ Dt(i)e−βtyift(xi) = $ Dt(i)e−βt if ft(xi) = yi Dt(i)eβt else Output the final classifier fboost = sgn %#T t=1 βtft(x) & AdaBoost: Full algorithm
  • 12. Put more weight on difficult to classify instances and less on those already handled well New weak learners are added sequentially that focus their training on the more difficult patterns Adaboost: Example 10 data points in R2 The size of + or - indicates the weight, which starts from uniform D1 Toy Example Toy Example Toy Example Toy Example Toy Example D1 weak classifiers = vertical or h Base algorithm is decision stump: Go through the calculations in the example to make sure you understand the algorithm
  • 14. Ensemble methods • Bagging • Random forests • Boosting: Basics • Adaboost • Gradient boosting
  • 15. Gradient Boosting Recall ht(x) = !t τ=1 βτ fτ (x). For Adaboost (exponential loss), given ht−1(x), we found what ft(x) should be. Gradient boosting provides an iterative approach for general (any) loss function "(h(x), y): • For all training datapoints (xi, yi) find the gradient ri = " δ"(h(xi), yi) δh(xi) # h(xi)=ht−1(xi) • Use the weak learner to find ft which fits (xi, ri) as well as possible: ft = argmin f∈F n $ i=1 (ri − f(xi))2 . • Update ht(x) = ht−1(x) + ηft(x), for some step size η.
  • 16. Gradient Boosting Usually we add some regularization term to prevent overfitting (penalize the size of the tree etc.) Gradient boosting is extremely successful!! A variant XGBoost is one of the most popular algorithms for structured data (tables etc. with numbers and categories where each feature typically has some meaning, unlike images or text). (for e.g. during Kaggle competitions back in 2015, 17 out of 29 winning solutions used XGBoost)
  • 18. A simplistic taxonomy of ML Supervised learning: Aim to predict outputs of future datapoints Unsupervised learning: Aim to discover hidden patterns and explore data Reinforcement learning: Aim to make sequential decisions
  • 19. Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion
  • 20. Acknowledgement & further reading Our presentation is closely based on Gregory Valiant’s notes for CS168 at Stanford. https://web.stanford.edu/class/cs168/l/l7.pdf https://web.stanford.edu/class/cs168/l/l8.pdf You can refer to these notes for further reading. Also review our Linear algebra Colab notebooks: Part 1 Part 2
  • 21. We’ll start with a simple and fundamental unsupervised learning problem: dimensionality reduction. Goal: reduce the dimensionality of a dataset so that • it is easier to visualize and discover patterns • it takes less time and space to process for any downstream application (classification, regression, etc) • noise is reduced • · · · There are many approaches, we focus on a linear method: Principal Component Analysis (PCA). Dimensionality reduction & PCA
  • 22. Consider the following dataset: • 17 features, each represents the average consump- tion of some food • 4 data points, each represents some country. What can you tell? Hard to say anything looking at all these 17 features. Picture from here See this for more details PCA: Motivation
  • 23. PCA: Motivation PCA can help us! The projection of the data onto its first principal component: i.e. we reduce the dimensionality from 17 to just 1. Now one data point is clearly different from the rest!
  • 24. PCA: Motivation i.e. we reduce the dimensionality from 17 to just 1. Now one data point is clearly different from the rest! That turns out to be data from Northern Ireland, the only country not on the island of Great Britain out of the 4 samples. Can also interpret components: PC1 tells us that the Northern Irish eat more grams of fresh potatoes and fewer of fresh fruits and alcoholic drinks. PCA can help us! The projection of the data onto its first principal component (PC1):
  • 25. PCA: Motivation We can find the second (and more) principal components of the data too:
  • 26. PCA: Motivation And the components themselves are interpretable too: See this for more details
  • 27. Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion
  • 28. Suppose we have a dataset of n datapoints x1, x2, . . . , xn ∈ Rd . The high level goal of PCA is to find a set of k principal components (PCs) /principal vectors v1, . . . , vk ∈ Rd such that for each xi, xi ≈ k ! j=1 αijvj for some coefficients αij ∈ R. High-level goal
  • 29. Preprocessing the data • Before we apply PCA, we usually preprocess the data to center it • In many applications, it is also important to scale each coordinate properly. This is especially true if the coordinates are in different units or scales.
  • 31. Objective function for PCA Key difference from supervised learning problems: No labels given, which means no ground-truth to measure the quality of the answer! However, we can still write an optimization problem based on our high-level goal. For clarity, we first discuss the special case of ! = 1. Optimization problem for finding the 1st principal component %!:
  • 34. Objective function for larger values of ! The generalization of the original formulation for general k is to find a k-dimensional subspace S such that the points are as close to it as possible: S = argmin k-dim subspaces S n ! i=1 (distance between xi and subspace S)2 By the same reasoning as for k = 1, this is equivalent to, S = argmax k-dim subspaces S n ! i=1 (length of xi’s projection on S)2 It is useful to think of the subspace S as the span of k orthonormal vectors v1, . . . , vk ∈ Rd .
  • 35. Example, • k = 1, span is line through the origin. • k = 2, if v1, v2 are linearly independent, the span is a plane through the origin, and so on.
  • 36. Formal problem solved by PCA: Given x1, . . . , xn ∈ Rd and a parameter k ≥ 1, compute orthonormal vectors v1, . . . , vk ∈ Rd to maximize, n ! i=1 k ! j=1 #xi, vj$2 . Equivalent view: • Pick v1 to be the variance maximizing direction. • Pick v2 to be the variance maximizing direction, orthogonal to v1. • Pick v3 to be the variance maximizing direction, orthogonal to v1 and v2, and so on.
  • 37. Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion
  • 38. Input: n datapoints x1, x2, . . . , xn ∈ Rd , #components k we want Step 1 Perform PCA to get top k principal components v1, . . . , vk ∈ Rd . Step 2 For each datapoint xi, define its “v1-coordinate” as "xi, v1#, its “v2-coordinate” as "xi, v2#. Therefore we associate k coordinates to each datapoint xi, where the j-th coordinate denotes the extent to which xi points in the direction of vj. Step 3 We now have a new “compressed” dataset where each datapoint is k-dimensional. For visu- alization, we can plot the point xi in Rk as the point ("xi, v1#, "xi, v2#, . . . , "xi, vk#). Using PCA for data compression and visualization
  • 39. Visualization example: Do Genomes Encode Geography? Dataset: genomes of 1,387 Europeans (each individual’s genotype at 200,000 locations in the genome) & = 1387, + ≈ 200,000 Project the datapoints onto top 2 PCs Plot shown below; looks remarkably like the map of Europe! From paper: ``Genes mirror geography within Europe” Novembre et al., Nature’08
  • 40. Compression example: Eigenfaces Dataset: 256x256 (≈65K pixels) dimensional images of about 2500 faces, all framed similarly & = 2500, + ≈ 65,000 We can represent each image with high accuracy using only 100-150 principal components! The principal components (called eigenfaces here) are themselves interpretable too! From paper: ``Eigenfaces for recognition” Turk & Pentland, Journal of Cognitive Neuroscience’91
  • 41. Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion
  • 42. How to solve the PCA optimization problem?
  • 44. The diagonal case Let’s solve argmaxv:!v!2=1 vT Av for the special case where A is a diagonal matrix. Any d × d matrix A can be thought of as a function that maps points in Rd back to points in Rd : v "→ Av.
  • 45. The matrix ! 2 0 0 1 " maps (x, y) to (2x, y): Figure from CS168 notes Points on circle {(x, y) : x2 + y2 = 1} are mapped to the ellipse {(x, y) : !x 2 "2 + y2 = 1}.
  • 46. So what direction v should maximize vT Av for diagonal A? It should be the direction of maximum stretch:
  • 47. Diagonals in disguise The previous figure, rotated 45 degrees. Consider A = !3 2 1 2 1 2 3 2 " = # 1 √ 2 − 1 √ 2 1 √ 2 1 √ 2 $ · ! 2 0 0 1 " · # 1 √ 2 1 √ 2 − 1 √ 2 1 √ 2 $ · 1 still does nothing other than stretch out different orthogonal axes, possibly with these axes being a “rotated version” of the original ones.
  • 48. How do we formalize the concept of a rotation in high dimensions as a matrix operation? Answer: Orthogonal matrix (also called orthonormal matrix).
  • 49. Recall that we want to find v1 = argmaxv:!v!2=1 vT Av. Now consider A that can be written as A = QDQT for an orthogonal matrix Q and diagonal matrix D with diagonal entries λ1 ≥ λ2 ≥ λ3 ≥ . . . λd ≥ 0. What is the direction which gets stretched the maximum? (Informal answer) The maximum possible stretch by D is λ1. The direction of maximum stretch under D is e1. Therefore, direction of maximum stretch under DQT is v s.t. QT v = e1 =⇒ v = (QT )−1 e1 = Qe1.
  • 51. General covariance matrices When k = 1, the solution to argmaxv:!v!2=1 vT Av is the first column of Q, where A = XT X = QDQT with Q orthogonal and D diagonal with sorted entries.
  • 52. General values of ! What is the solution to the PCA objective for general values of k? n ! i=1 k ! j=1 !xi, vj"2 Solution: Pick the first k columns of Q, where the covariance XT X = QDQT with Q orthogonal and D diagonal with sorted entries. Since Q is orthogonal, the first k columns of Q are orthogonal vectors. These are called the top k principal components (PCs).
  • 53. Eigenvalues & eigenvectors How to compute the top k columns of Q in the decomposition XT X = QDQT ? Solution: Eigenvalue decomposition! Eigenvectors: axes of stretch in geometric intuition Eigenvalues: stretch factors
  • 54. PCA boils down to computing the ! eigenvectors of the covariance matrix 2" 2 that have the largest eigenvalues.
  • 55. Principal Component Analysis (PCA) • Introduction • Formalizing the problem • How to use PCA, and examples • Solving the PCA optimization problem • Conclusion
  • 56. For visualization, we usually choose k to be small and just pick the first few principal components. In other applications such as compression, it is a good idea to plot the eigenvalues and see. A lot of data is close to being low rank, so the eigenvalues may decay and become small. We can also choose the threshold based on how much variance we want to capture. Suppose we want to capture 90% of the variance in the data. Then we can pick k such that i.e. !k j=1 λj !d j=1 λj ≥ 90% where λ1 ≥ · · · ≥ λd are sorted eigenvalues. Note: !d j=1 λj = trace(XT X), so no need to actually find all eigenvalues. How many PCs to use?
  • 57. When and why does PCA fail? 1. Data is not properly scaled/normalized. 2. Non-orthogonal structure in data: PCs are forced to be orthogonal, and there may not be too many orthogonal components in the data which are all interpretable. 3. Non-linear structure in data.