lec9_annotated.pdf ml csci 567 vatsal sharan

CSCI 567: Machine Learning
Vatsal Sharan
Fall 2022
Lecture 9, Nov 3

Administrivia
• Project details are out
• Make groups (of 4) by Nov 11, minimum team size is 3.
• Q5 on HW4 will help you get started on it.
• We’ll give an overview of the project and general tips in today’s discussion.
• HW4 is due in about two weeks (Nov 16 at 2pm).
• We’ll release another question on PCA tomorrow.
• Today’s plan:
• Finish ensemble methods
• Unsupervised learning:
• PCA
• Clustering

Ensemble methods
• Bagging
• Random forests
• Boosting: Basics
• Adaboost
• Gradient boosting

Bagging
Collect T subsets each of some fixed size (say m) by sampling with replacement from
training data.
Let ft(x) be the classifier (such as a decision tree) obtained by training on the subset
t ∈ {1, . . . , T}. Then the aggregrated classifier fagg(x) is given by:
fagg(x) =
! 1
T
"T
t=1 ft(x) for regression,
sign
#
1
T
"T
t=1 ft(x)
$
= Majority Vote{ft(x)}T
t=1 for classification.
• Reduces overfitting (i.e., variance)
• Can work with any type of classifier (here focus on trees)
• Easy to parallelize (can train multiple trees in parallel)
• But loses on interpretability to single decision tree (true for all ensemble methods..)

Random forests: When growing a tree on a bootstrapped (i.e. subsampled) dataset,
before each split select ! ≤ # of the # input variables at random as candidates for
splitting.
When ! = # → same as Bagging
When ! < # → Random forests
! is a hyperparameter, tuned via cross-validation
Random forests

Boosting: Idea
The boosted predictor is of the form fboost(x) = sign(h(x)), where,
h(x) =
T
!
t=1
βtft(x) for βt ≥ 0 and ft ∈ F.
The goal is to minimize "(h(x), y) for some loss function ".
Q: We know how to find the best predictor in F on some data, but how do we find the best weighted
combination h(x)?
A: Minimize the loss by a greedy approach, i.e. find βt, ft(x) one by one for t = 1, . . . , T.
Specifically, let ht(x) =
"t
τ=1 βτ fτ (x). Suppose we have found ht−1(x), how do we find βt, ft(x)?
Find the βt, ft(x) which minimizes the loss "(ht(x), y).
Different loss function " give different boosting algorithms.
"(h(x), y) =
#
(h(x) − y)2
→ Least squares boosting,
exp(−h(x)y) → AdaBoost.

Given a training set S and a base algorithm A, initialize D1 to be uniform
For t = 1, . . . , T
• obtain a weak classifier ft(x) ← A(S, Dt)
• calculate the weight βt of ft(x) as
βt =
1
2
ln
!
1 − "t
"t
"
(βt > 0 ⇔ "t < 0.5)
where "t =
#
i:ft(xi)!=yi
Dt(i) is the weighted error of ft(x).
• update distributions
Dt+1(i) ∝ Dt(i)e−βtyift(xi)
=
$
Dt(i)e−βt
if ft(xi) = yi
Dt(i)eβt
else
Output the final classifier fboost = sgn
%#T
t=1 βtft(x)
&
AdaBoost: Full algorithm

Put more weight on difficult to classify instances and less on those already handled well
New weak learners are added sequentially that focus their training on the more difficult patterns
Adaboost: Example
10 data points in R2
The size of + or - indicates the weight, which starts from uniform D1
Toy Example
Toy Example
Toy Example
Toy Example
Toy Example
D1
weak classifiers = vertical or h
Base algorithm is decision stump:
Go through the calculations in the example to make sure you understand the algorithm

Ensemble methods:
Gradient boosting

Gradient Boosting
Recall ht(x) =
!t
τ=1 βτ fτ (x). For Adaboost (exponential loss), given ht−1(x), we
found what ft(x) should be.
Gradient boosting provides an iterative approach for general (any) loss function "(h(x), y):
• For all training datapoints (xi, yi) find the gradient
ri =
"
δ"(h(xi), yi)
δh(xi)
#
h(xi)=ht−1(xi)
• Use the weak learner to find ft which fits (xi, ri) as well as possible:
ft = argmin
f∈F
n
$
i=1
(ri − f(xi))2
.
• Update ht(x) = ht−1(x) + ηft(x), for some step size η.

Gradient Boosting
Usually we add some regularization term to prevent overfitting (penalize the size of the tree etc.)
Gradient boosting is extremely successful!!
A variant XGBoost is one of the most popular algorithms for structured data (tables etc. with
numbers and categories where each feature typically has some meaning, unlike images or text).
(for e.g. during Kaggle competitions back in 2015, 17 out of 29 winning solutions used XGBoost)

A simplistic taxonomy of ML
Supervised learning:
Aim to predict
outputs of future
datapoints
Unsupervised
learning:
Aim to discover
hidden patterns and
explore data
Reinforcement
learning:
Aim to make
sequential decisions

Principal Component Analysis (PCA)
• Introduction
• Formalizing the problem
• How to use PCA, and examples
• Solving the PCA optimization problem
• Conclusion

Acknowledgement & further reading
Our presentation is closely based on Gregory Valiant’s
notes for CS168 at Stanford.
https://web.stanford.edu/class/cs168/l/l7.pdf
https://web.stanford.edu/class/cs168/l/l8.pdf
You can refer to these notes for further reading.
Also review our Linear algebra Colab notebooks:
Part 1
Part 2

We’ll start with a simple and fundamental unsupervised learning problem: dimensionality reduction.
Goal: reduce the dimensionality of a dataset so that
• it is easier to visualize and discover patterns
• it takes less time and space to process for any downstream application (classification, regression, etc)
• noise is reduced
• · · ·
There are many approaches, we focus on a linear method: Principal Component Analysis (PCA).
Dimensionality reduction & PCA

Consider the following dataset:
• 17 features, each represents the average consump-
tion of some food
• 4 data points, each represents some country.
What can you tell?
Hard to say anything looking at all these 17 features.
Picture from here
See this for more details
PCA: Motivation

PCA: Motivation
PCA can help us! The projection of the data onto its first principal component:
i.e. we reduce the dimensionality from 17 to just 1.
Now one data point is clearly different from the rest!

PCA: Motivation
i.e. we reduce the dimensionality from 17 to just 1.
Now one data point is clearly different from the rest!
That turns out to be data from Northern Ireland,
the only country not on the island of Great Britain out of the 4 samples.
Can also interpret components: PC1 tells us that the Northern Irish eat more grams of fresh
potatoes and fewer of fresh fruits and alcoholic drinks.
PCA can help us! The projection of the data onto its first principal component (PC1):

PCA: Motivation
We can find the second (and more) principal components of the data too:

PCA: Motivation
And the components themselves are interpretable too:
See this for more details

Suppose we have a dataset of n datapoints x1, x2, . . . , xn ∈ Rd
.
The high level goal of PCA is to find a set of k principal components (PCs) /principal vectors v1, . . . , vk ∈
Rd
such that for each xi,
xi ≈
k
!
j=1
αijvj
for some coefficients αij ∈ R.
High-level goal

Preprocessing the data
• Before we apply PCA, we usually preprocess the data to center it
• In many applications, it is also important to scale each coordinate properly. This is especially true if
the coordinates are in different units or scales.

Objective function for PCA
Key difference from supervised learning problems:
No labels given, which means no ground-truth to measure the quality of the answer!
However, we can still write an optimization problem based on our high-level goal.
For clarity, we first discuss the special case of ! = 1.
Optimization problem for finding the 1st principal component %!:

Objective function for larger values of !
The generalization of the original formulation for general k is to find a k-dimensional subspace S
such that the points are as close to it as possible:
S = argmin
k-dim subspaces S
n
!
i=1
(distance between xi and subspace S)2
By the same reasoning as for k = 1, this is equivalent to,
S = argmax
k-dim subspaces S
n
!
i=1
(length of xi’s projection on S)2
It is useful to think of the subspace S as the span of k orthonormal vectors v1, . . . , vk ∈ Rd
.

Example,
• k = 1, span is line through the origin.
• k = 2, if v1, v2 are linearly independent, the span is a plane through the origin, and so on.

Formal problem solved by PCA:
Given x1, . . . , xn ∈ Rd
and a parameter k ≥ 1, compute orthonormal vectors v1, . . . , vk ∈ Rd
to maximize,
n
!
i=1
k
!
j=1
#xi, vj$2
.
Equivalent view:
• Pick v1 to be the variance maximizing direction.
• Pick v2 to be the variance maximizing direction, orthogonal to v1.
• Pick v3 to be the variance maximizing direction, orthogonal to v1 and v2, and so on.

Input: n datapoints x1, x2, . . . , xn ∈ Rd
, #components k we want
Step 1 Perform PCA to get top k principal components v1, . . . , vk ∈ Rd
.
Step 2 For each datapoint xi, define its “v1-coordinate” as "xi, v1#, its “v2-coordinate” as "xi, v2#.
Therefore we associate k coordinates to each datapoint xi, where the j-th coordinate denotes the
extent to which xi points in the direction of vj.
Step 3 We now have a new “compressed” dataset where each datapoint is k-dimensional. For visu-
alization, we can plot the point xi in Rk
as the point ("xi, v1#, "xi, v2#, . . . , "xi, vk#).
Using PCA for data compression and visualization

Visualization example: Do Genomes Encode Geography?
Dataset: genomes of 1,387 Europeans (each individual’s genotype at 200,000 locations in the genome)
& = 1387, + ≈ 200,000
Project the datapoints onto top 2 PCs
Plot shown below; looks remarkably like the map of Europe!
From paper: ``Genes mirror geography within Europe” Novembre et al., Nature’08

Compression example: Eigenfaces
Dataset: 256x256 (≈65K pixels) dimensional images
of about 2500 faces, all framed similarly
& = 2500, + ≈ 65,000
We can represent each image with high accuracy
using only 100-150 principal components!
The principal components (called eigenfaces here)
are themselves interpretable too!
From paper: ``Eigenfaces for recognition” Turk & Pentland, Journal of Cognitive Neuroscience’91

How to solve the PCA optimization problem?

lec9_annotated.pdf ml csci 567 vatsal sharan

The diagonal case
Let’s solve argmaxv:!v!2=1 vT
Av for the special case where A is a diagonal matrix.
Any d × d matrix A can be thought of as a function that maps points in Rd
back to points in Rd
:
v "→ Av.

The matrix
!
2 0
0 1
"
maps (x, y) to (2x, y):
Figure from CS168 notes
Points on circle {(x, y) : x2
+ y2
= 1} are mapped to the ellipse {(x, y) :
!x
2
"2
+ y2
= 1}.

So what direction v should maximize vT
Av for diagonal A?
It should be the direction of maximum stretch:

Diagonals in disguise
The previous figure, rotated 45 degrees.
Consider
A =
!3
2
1
2
1
2
3
2
"
=
#
1
√
2
− 1
√
2
1
√
2
1
√
2
$
·
!
2 0
0 1
"
·
#
1
√
2
1
√
2
− 1
√
2
1
√
2
$
·
1 still does nothing other than stretch
out different orthogonal axes, possibly
with these axes being a “rotated
version” of the original ones.

How do we formalize the concept of a rotation in high dimensions as a matrix operation?
Answer: Orthogonal matrix (also called orthonormal matrix).

Recall that we want to find v1 = argmaxv:!v!2=1 vT
Av.
Now consider A that can be written as A = QDQT
for an orthogonal matrix Q and diagonal
matrix D with diagonal entries λ1 ≥ λ2 ≥ λ3 ≥ . . . λd ≥ 0.
What is the direction which gets stretched the maximum?
(Informal answer) The maximum possible stretch by D is λ1. The direction of maximum stretch
under D is e1. Therefore, direction of maximum stretch under DQT
is v s.t. QT
v = e1 =⇒ v =
(QT
)−1
e1 = Qe1.

General covariance matrices
When k = 1, the solution to argmaxv:!v!2=1 vT
Av is the first column of Q, where A = XT
X =
QDQT
with Q orthogonal and D diagonal with sorted entries.

General values of !
What is the solution to the PCA objective for general values of k?
n
!
i=1
k
!
j=1
!xi, vj"2
Solution: Pick the first k columns of Q, where the covariance XT
X = QDQT
with Q orthogonal
and D diagonal with sorted entries.
Since Q is orthogonal, the first k columns of Q are orthogonal vectors. These are called the top k
principal components (PCs).

Eigenvalues & eigenvectors
How to compute the top k columns of Q in the decomposition XT
X = QDQT
?
Solution: Eigenvalue decomposition!
Eigenvectors: axes of stretch in geometric intuition
Eigenvalues: stretch factors

PCA boils down to computing the ! eigenvectors of the covariance matrix 2"
2
that have the largest eigenvalues.

For visualization, we usually choose k to be small and just pick the first few principal components.
In other applications such as compression, it is a good idea to plot the eigenvalues and see. A lot of
data is close to being low rank, so the eigenvalues may decay and become small.
We can also choose the threshold based on how much variance we want to capture. Suppose we
want to capture 90% of the variance in the data. Then we can pick k such that i.e.
!k
j=1 λj
!d
j=1 λj
≥ 90%
where λ1 ≥ · · · ≥ λd are sorted eigenvalues.
Note:
!d
j=1 λj = trace(XT
X), so no need to actually find all eigenvalues.
How many PCs to use?

When and why does PCA fail?
1. Data is not properly
scaled/normalized.
2. Non-orthogonal structure in data: PCs
are forced to be orthogonal, and
there may not be too many
orthogonal components in the data
which are all interpretable.
3. Non-linear structure in data.

lec9_annotated.pdf ml csci 567 vatsal sharan

More Related Content

Similar to lec9_annotated.pdf ml csci 567 vatsal sharan (20)

More from agrimsachdeva2 (6)

Recently uploaded (20)

lec9_annotated.pdf ml csci 567 vatsal sharan