Machine Learning course Lecture number 2 - Supervised machine learning, part 1.pptx

Supervised machine
learning
Part 1: k-NN and Naive Bayes

All classifiers use training data to carve up the space and group similar examples
together. They just differ in the geometry they use to do it.
Classifiers just differ by how they carve the space
2
Classifier Borders
defined by
k Nearest Neighbors distances to nearest
neighbor(s)
Decision Trees Vertical or Horizontal
boundaries only
Perceptron or Logistic
regression
lines/planes/hyperplanes (can
be diagonal)
Naive Bayes Conic sections (various
curves) since groups are
captured by gaussian ellipses
kNN
Decision
Trees
Naive Bayes
Perceptron

A review of k-nearest neighbors for context
Review of kNN

The hat made these decisions on the current class of students.
Each dot represents a student in each of the 3 houses.
kNN with Harry Potter - the training data
4
X = Trustworthiness
Y
=
Courage
Gryffindor
Hufflepuff
Slytherin

If I give you a new person (with an X and Y coordinate), what would be the likely
house? - Simple, pick the closest!
A new person? Pick the nearest (kNN with k=1)
5
X = 0.5
Y = 0.35
Hufflepuff!
The classes of
new samples
using this strategy
(kNN with k = 1)
?
?
For every possibility...

If I just gave you an X and Y coordinate near these circled points, what should you
pick as the right class?
Graphically, avoiding outliers (k=1 vs k=5 for kNN)
6
vs
These points are likely mistakes or
unexplained outliers to disregard
Polling your
nearest neighbor
(k=1) will lead to
errors in these
places
But polling your
5 nearest
neighbors (k=5)
avoids this
problem

k too low:
You may not want every single example to impact your
decision making
● When k = 1 in kNN, one mistake affects anything similar
in the future
k too high:
But you don’t want to “average out” too much
● High k values may wipe out meaningful variations/
“peninsulas and islands” in our space of classification
● What would happen if a group only had 5 members with
k=20? It would never be chosen!
Finding the right k for kNN
7
low k
high k

The “k” of k Nearest Neighbors is a “hyperparameter”.
Almost all machine learning algorithms have
hyperparameters.
Many have a similar “goldilocks” nature to them:
● k too low → “Overfitting” →
○ Decisions based on noise
● k too high → “Underfitting” →
○ Decisions not sensitive to important subgroups in
the data set
Finding the right k for kNN
8
k too low
“Overfit”
k too high
“Overgeneralize”

They balance between making the model fit too much to outliers/exceptions or making the
model too insensitive to important subgroups in the data
Protip: ML hyperparameters trade off in complexity
9
Learner / Model Hyperparameters Underfitting
“too simple”
Overfitting
“too complex”
kNN (k nearest neighbors) k value - number of neighbors to
poll
k too high k too low (e.g. k=1)
polynomial regression
e.g. linear, quadratic, cubic...
n - degree of the polynomial:
y = anxn
+ … a1x + a0
degree too low
e.g. a line when a curve is
better.
degree too high. e.g. a 20
degree polynomial to fit 20
points.
regularized linear (and
logistic) regression
e.g. LASSO, ridge, elastic net...
λ - regularization strength
(varies # of features to include)
λ too high
might only pick one or two
features, rest are 0’s
λ too low (e.g. λ=0)
just plain linear regression, use
all features
SVM (support vector machines) C - slack variable penalty
(higher allows more “mistakes”)
C too high
Disregard too many points as
“outliers”
C too low
Awkwardly adjusting the border
to correctly classify every single
example

kNN Challenges - efficient lookup
A naive implementation takes O(k ntraining) time to
classify one data point!
This is unacceptable as the number of training
points may be in the millions.
Solution: use a data structure that partitions the
space of points so you only have to calculate
distances for likely nearby data points.
k-D trees allow for average case O(log n) lookup
time rather than O(n). So, we trade O(n) time for
O(n) space - a fair trade.

kNN Challenges - normalization
Without normalization, features with larger variances have excessive influence
Vs.

Normalization approaches
Similar problem in other classifiers. Because of this, many methods may normalize
automatically and fix the normalization parameters from the training set
Examples of typical normalizations:
● For transformed features with mean 0 and variance 1 (for the training set)
xnormalized = (x - μtraining set) / σtraining set
● For transformed features between 0 and 1 (based on the training set)
xnormalized = (x - mintraining set) / (maxtraining set - mintraining set)

kNN Challenges - features treated equally
The largest drawback for kNN is a lack on ability to adjust the
contribution for different features.
Being able to value some features more, and disregard other
features is a key ability for almost all machine learning
models.
Now we will talk about a technique which can use the quality
of features for classification to improve performance.

Naive Bayes
Visually, Naive Bayes fits multidimensional gaussians to clouds of points to define
a class.

Conditional probability intuition
● P(A|B) intuition - vs. P(B|A) and P(A and B)
○ P(it’s raining) = 0.15
○ P(I’m carrying an umbrella) = 0.12
○ P(it’s raining AND I’m carrying an umbrella) = 0.10
○ P(I’m carrying an umbrella | it’s raining) = 0.666...
○ P(it’s raining | I’m carrying an umbrella) = 0.833...
● Last two are from P(A|B) = P(A and B) / P(B)
16

Conditional probability rules
● Use a venn diagram if necessary
● P(A and A’) = 0 also written as P(A,A’)
● P(A or B) = P(A) + P(B) - P(A and B)
● P(A or A’) = P(A) + P(A’) = 1
● P(A) = P(A,B) + P(A,B’)
● P(A,B) = P(A|B) P(B) = also P(B|A) P(A)
○ Which leads to...
17

Bayes cancer test example
There is a test for cancer with a 1% error rate (both false positives and false
negatives). 1 in 10,000 people have this form of cancer.
Given a positive test result, what is the probability the person developed the
disease?
19

Another example, trick coins
Assuming you see trick coins at a rate of 0.1% in the world when
someone is flipping for you, is it more or less likely you see a trick (vs fair)
coin after 10 consecutive coin flips?
21

Trick coin example worked out
P(fair | 10 heads) = ?
= P(10 heads | fair) P(fair) / P(10 heads)
P(fair) = 0.999 : 1 - P(trick)
P(10 heads | fair) = (½)10
P(10 heads) = P(10H|fair) P(fair) + P(10H|trick) P(trick)
= (½)10
*0.999 + 1 * 0.001
~ 0.002
P(fair | 10 heads) = (½)10
* 0.999 / 0.002 = 48% chance!
Try with a different level of trust in the world (change P(trick)) or different
number of heads in a row.
22

Naive Bayes
● for classification: p(class | data) p(data | class) * p(class)
∝
○ pick the most probable class
● gaussian naive bayes - use gaussian distributions for probabilities
○ create gaussian distributions using available training data
○ gaussian probability mean and variance is mean and variance of the training data
● Multiple features? just treat their contributions to the overall probability independently
● Advantages of GNB
○ exceptionally fast training
○ relative to other approaches, it works well when there is little data
● Disadvantages of GNB
○ independence assumption
○ gaussians may not be appropriate to model the data distribution

Digit recognition example discussion
x represent the image vector (x1, x2, x3, … x64)
ck represents class k = that is, one of the 10 digits for recognition
Recall, we’re looking for the highest p(ck|x) by using this fact:
p(ck|x) = p(x|ck) p(ck) / p(x)
Let’s step through parts of this equation one-by-one

● The main assumption of naive Bayes is that the features should be
treated independently (which is why it’s “naive”). This means
○ p(x|ck) = p(x1|ck) * p(x2|ck) * … * p(x64|ck)
● For each class, k, in the training data:
○ Calculate the mean and variance of each feature for that class
■ Variances of 0 are often too extreme, often softened with
something slightly greater than 0 in practice
○ Use that and the formula for a gaussian probability to calculate
p(xi|ck)

p(ck) is simply the proportion of that class in the training data.
e.g. if there are 20 fives out of 200 digits in the training
sample p(five) = 20/200 = 0.1

● p(x) is the normalization term. You don’t necessarily need to
calculate this, since you just want to pick the largest p(ck|x),
and p(x) is the same denominator in calculating p(ck|x) for
every class.
● However, if you want p(ck|x) to provide a true estimate of the
probability, you can use the following formula to calculate p(x):
○ p(x) = Σk p(x,ck) = Σk p(x|ck) p(ck)

The predicted class is the largest p(ck|x) for each image.
For the assignment:
1. Report the overall accuracy of your prediction.
2. Show the classification matrix.
3. Note which errors are more common. In what way does that match your
intuitions?

Let’s apply this by performing a classification in class!
Numeric:
How many different sports do you watch regularly over a year?
How many hours a week do you want TV/Movies/Netflix?
How many hours per week do you read a book or magazine offline?
How long is your commute in minutes?
How many years of postgraduate education do you plan on completing?
How many hours per week do you exercise?
On a scale from 1 to 10, how much you like to go hiking?
On a scale from extrovert (0) to introvert (5) where are you?
How many countries have you travelled to?
What is the percent chance you are home on a friday night?
How many times a week do you drink coffee?
How much sleep do you get per night in hours?
How many minutes do you spend in the morning getting ready?
How many people do you text in a day?
Categorical:
Cat person or dog person?
Group projects or individual work preferred?
...

Machine Learning course Lecture number 2 - Supervised machine learning, part 1.pptx

Additional material with time permitting

Easy to accumulate information (an optional aside)
● Let’s assume you just figured out P(belief | data1) ∝
P(data1 | belief) P(belief)
● How do you integrate new data2?
○ Recalculate with all the data together?
■ NO! (you can, but it’s such a waste)
○ Just use the old posterior as the new prior
■ which is really where priors come from anyway -
previous data or experience
32

Machine Learning course Lecture number 2 - Supervised machine learning, part 1.pptx

More Related Content

Similar to Machine Learning course Lecture number 2 - Supervised machine learning, part 1.pptx (20)

Recently uploaded (20)

Machine Learning course Lecture number 2 - Supervised machine learning, part 1.pptx