SlideShare a Scribd company logo
Supervised machine
learning
Part 1: k-NN and Naive Bayes
All classifiers use training data to carve up the space and group similar examples
together. They just differ in the geometry they use to do it.
Classifiers just differ by how they carve the space
2
Classifier Borders
defined by
k Nearest Neighbors distances to nearest
neighbor(s)
Decision Trees Vertical or Horizontal
boundaries only
Perceptron or Logistic
regression
lines/planes/hyperplanes (can
be diagonal)
Naive Bayes Conic sections (various
curves) since groups are
captured by gaussian ellipses
kNN
Decision
Trees
Naive Bayes
Perceptron
A review of k-nearest neighbors for context
Review of kNN
The hat made these decisions on the current class of students.
Each dot represents a student in each of the 3 houses.
kNN with Harry Potter - the training data
4
X = Trustworthiness
Y
=
Courage
Gryffindor
Hufflepuff
Slytherin
If I give you a new person (with an X and Y coordinate), what would be the likely
house? - Simple, pick the closest!
A new person? Pick the nearest (kNN with k=1)
5
X = 0.5
Y = 0.35
Hufflepuff!
The classes of
new samples
using this strategy
(kNN with k = 1)
?
?
For every possibility...
If I just gave you an X and Y coordinate near these circled points, what should you
pick as the right class?
Graphically, avoiding outliers (k=1 vs k=5 for kNN)
6
vs
These points are likely mistakes or
unexplained outliers to disregard
Polling your
nearest neighbor
(k=1) will lead to
errors in these
places
But polling your
5 nearest
neighbors (k=5)
avoids this
problem
k too low:
You may not want every single example to impact your
decision making
● When k = 1 in kNN, one mistake affects anything similar
in the future
k too high:
But you don’t want to “average out” too much
● High k values may wipe out meaningful variations/
“peninsulas and islands” in our space of classification
● What would happen if a group only had 5 members with
k=20? It would never be chosen!
Finding the right k for kNN
7
low k
high k
The “k” of k Nearest Neighbors is a “hyperparameter”.
Almost all machine learning algorithms have
hyperparameters.
Many have a similar “goldilocks” nature to them:
● k too low → “Overfitting” →
○ Decisions based on noise
● k too high → “Underfitting” →
○ Decisions not sensitive to important subgroups in
the data set
Finding the right k for kNN
8
k too low
“Overfit”
k too high
“Overgeneralize”
They balance between making the model fit too much to outliers/exceptions or making the
model too insensitive to important subgroups in the data
Protip: ML hyperparameters trade off in complexity
9
Learner / Model Hyperparameters Underfitting
“too simple”
Overfitting
“too complex”
kNN (k nearest neighbors) k value - number of neighbors to
poll
k too high k too low (e.g. k=1)
polynomial regression
e.g. linear, quadratic, cubic...
n - degree of the polynomial:
y = anxn
+ … a1x + a0
degree too low
e.g. a line when a curve is
better.
degree too high. e.g. a 20
degree polynomial to fit 20
points.
regularized linear (and
logistic) regression
e.g. LASSO, ridge, elastic net...
λ - regularization strength
(varies # of features to include)
λ too high
might only pick one or two
features, rest are 0’s
λ too low (e.g. λ=0)
just plain linear regression, use
all features
SVM (support vector machines) C - slack variable penalty
(higher allows more “mistakes”)
C too high
Disregard too many points as
“outliers”
C too low
Awkwardly adjusting the border
to correctly classify every single
example
kNN Challenges - efficient lookup
A naive implementation takes O(k ntraining) time to
classify one data point!
This is unacceptable as the number of training
points may be in the millions.
Solution: use a data structure that partitions the
space of points so you only have to calculate
distances for likely nearby data points.
k-D trees allow for average case O(log n) lookup
time rather than O(n). So, we trade O(n) time for
O(n) space - a fair trade.
kNN Challenges - normalization
Without normalization, features with larger variances have excessive influence
Vs.
Normalization approaches
Similar problem in other classifiers. Because of this, many methods may normalize
automatically and fix the normalization parameters from the training set
Examples of typical normalizations:
● For transformed features with mean 0 and variance 1 (for the training set)
xnormalized = (x - μtraining set) / σtraining set
● For transformed features between 0 and 1 (based on the training set)
xnormalized = (x - mintraining set) / (maxtraining set - mintraining set)
kNN Challenges - features treated equally
The largest drawback for kNN is a lack on ability to adjust the
contribution for different features.
Being able to value some features more, and disregard other
features is a key ability for almost all machine learning
models.
Now we will talk about a technique which can use the quality
of features for classification to improve performance.
Naive Bayes
Visually, Naive Bayes fits multidimensional gaussians to clouds of points to define
a class.
Conditional probability and Bayes’ Rule
Recall, p(A|B) = p(A, B) / p(B)
Bayes’ rule:
p(A|B) = p(B|A) p(A) / p(B) … but more meaningfully:
p(hypothesis|data) ~ p(data|hypothesis) * p(hypothesis)
posterior ~ likelihood * prior … more on this later
15
Conditional probability intuition
● P(A|B) intuition - vs. P(B|A) and P(A and B)
○ P(it’s raining) = 0.15
○ P(I’m carrying an umbrella) = 0.12
○ P(it’s raining AND I’m carrying an umbrella) = 0.10
○ P(I’m carrying an umbrella | it’s raining) = 0.666...
○ P(it’s raining | I’m carrying an umbrella) = 0.833...
● Last two are from P(A|B) = P(A and B) / P(B)
16
Conditional probability rules
● Use a venn diagram if necessary
● P(A and A’) = 0 also written as P(A,A’)
● P(A or B) = P(A) + P(B) - P(A and B)
● P(A or A’) = P(A) + P(A’) = 1
● P(A) = P(A,B) + P(A,B’)
● P(A,B) = P(A|B) P(B) = also P(B|A) P(A)
○ Which leads to...
17
Bayes rule
P(A | B) = P(B | A) * P(A) / P(B)
And another formulation we will find useful...
P(A | B) = P(B | A) * P(A) / [ P(B|A) P(A) + P(B|A’) P(A’) ]
(because P(B) = P(B,A) + P(B,A’)
= P(B|A) P(A) + P(B|A’)
P(A’) ) 18
Bayes cancer test example
There is a test for cancer with a 1% error rate (both false positives and false
negatives). 1 in 10,000 people have this form of cancer.
Given a positive test result, what is the probability the person developed the
disease?
19
Cancer test example worked
P(D+|T+) = ?
= P(T+|D+) P(D+) / P(T+)
P(T+|D+) = 0.99 : 1 - false negative rate
P(D+) = 0.0001
P(T+) = P(T+,D+) + P(T+,D-)
= P(T+|D+) P(D+) + P(T+|D-) P(D-)
= 0.99 * 0.0001 + 0.01 * 0.9999
~ 0.01
P(D+|T+) = 0.0099 → < 1% chance!
20
Another example, trick coins
Assuming you see trick coins at a rate of 0.1% in the world when
someone is flipping for you, is it more or less likely you see a trick (vs fair)
coin after 10 consecutive coin flips?
21
Trick coin example worked out
P(fair | 10 heads) = ?
= P(10 heads | fair) P(fair) / P(10 heads)
P(fair) = 0.999 : 1 - P(trick)
P(10 heads | fair) = (½)10
P(10 heads) = P(10H|fair) P(fair) + P(10H|trick) P(trick)
= (½)10
*0.999 + 1 * 0.001
~ 0.002
P(fair | 10 heads) = (½)10
* 0.999 / 0.002 = 48% chance!
Try with a different level of trust in the world (change P(trick)) or different
number of heads in a row.
22
Naive Bayes
● for classification: p(class | data) p(data | class) * p(class)
∝
○ pick the most probable class
● gaussian naive bayes - use gaussian distributions for probabilities
○ create gaussian distributions using available training data
○ gaussian probability mean and variance is mean and variance of the training data
● Multiple features? just treat their contributions to the overall probability independently
● Advantages of GNB
○ exceptionally fast training
○ relative to other approaches, it works well when there is little data
● Disadvantages of GNB
○ independence assumption
○ gaussians may not be appropriate to model the data distribution
Digit recognition example discussion
x represent the image vector (x1, x2, x3, … x64)
ck represents class k = that is, one of the 10 digits for recognition
Recall, we’re looking for the highest p(ck|x) by using this fact:
p(ck|x) = p(x|ck) p(ck) / p(x)
Let’s step through parts of this equation one-by-one
p(ck|x) = p(x|ck) p(ck) / p(x)
● The main assumption of naive Bayes is that the features should be
treated independently (which is why it’s “naive”). This means
○ p(x|ck) = p(x1|ck) * p(x2|ck) * … * p(x64|ck)
● For each class, k, in the training data:
○ Calculate the mean and variance of each feature for that class
■ Variances of 0 are often too extreme, often softened with
something slightly greater than 0 in practice
○ Use that and the formula for a gaussian probability to calculate
p(xi|ck)
p(ck|x) = p(x|ck) p(ck) / p(x)
p(ck) is simply the proportion of that class in the training data.
e.g. if there are 20 fives out of 200 digits in the training
sample p(five) = 20/200 = 0.1
p(ck|x) = p(x|ck) p(ck) / p(x)
● p(x) is the normalization term. You don’t necessarily need to
calculate this, since you just want to pick the largest p(ck|x),
and p(x) is the same denominator in calculating p(ck|x) for
every class.
● However, if you want p(ck|x) to provide a true estimate of the
probability, you can use the following formula to calculate p(x):
○ p(x) = Σk p(x,ck) = Σk p(x|ck) p(ck)
p(ck|x) = p(x|ck) p(ck) / p(x)
The predicted class is the largest p(ck|x) for each image.
For the assignment:
1. Report the overall accuracy of your prediction.
2. Show the classification matrix.
3. Note which errors are more common. In what way does that match your
intuitions?
Let’s apply this by performing a classification in class!
Numeric:
How many different sports do you watch regularly over a year?
How many hours a week do you want TV/Movies/Netflix?
How many hours per week do you read a book or magazine offline?
How long is your commute in minutes?
How many years of postgraduate education do you plan on completing?
How many hours per week do you exercise?
On a scale from 1 to 10, how much you like to go hiking?
On a scale from extrovert (0) to introvert (5) where are you?
How many countries have you travelled to?
What is the percent chance you are home on a friday night?
How many times a week do you drink coffee?
How much sleep do you get per night in hours?
How many minutes do you spend in the morning getting ready?
How many people do you text in a day?
Categorical:
Cat person or dog person?
Group projects or individual work preferred?
...
Machine Learning course Lecture number 2 - Supervised machine learning, part 1.pptx
Additional material with time permitting
Easy to accumulate information (an optional aside)
● Let’s assume you just figured out P(belief | data1) ∝
P(data1 | belief) P(belief)
● How do you integrate new data2?
○ Recalculate with all the data together?
■ NO! (you can, but it’s such a waste)
○ Just use the old posterior as the new prior
■ which is really where priors come from anyway -
previous data or experience
32

More Related Content

PPTX
Naive Bayes Presentation
PDF
AppsDiff3c.pdf
PDF
Expectation propagation
PPT
Lecture07_ Naive Bayes Classifier Machine Learning
PPTX
Stats chapter 8
PPTX
Stats chapter 8
PDF
Clustering:k-means, expect-maximization and gaussian mixture model
PDF
Appendex b
Naive Bayes Presentation
AppsDiff3c.pdf
Expectation propagation
Lecture07_ Naive Bayes Classifier Machine Learning
Stats chapter 8
Stats chapter 8
Clustering:k-means, expect-maximization and gaussian mixture model
Appendex b

Similar to Machine Learning course Lecture number 2 - Supervised machine learning, part 1.pptx (20)

PDF
sample space formation.pdf
PDF
Stratified sampling and resampling for approximate Bayesian computation
PDF
Machine learning mathematicals.pdf
PDF
17_monte_carlo.pdf
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
Regression_1.pdf
PDF
Parallel Bayesian Optimization
PDF
Probabilistic AI Lecture 1: Introduction to variational inference and the ELBO
PDF
MLHEP 2015: Introductory Lecture #4
PDF
Probability Cheatsheet.pdf
PPTX
Computer Science Exam Help
PDF
MLHEP 2015: Introductory Lecture #1
PPTX
Deep learning from mashine learning AI..
PDF
ENBIS 2018 presentation on Deep k-Means
PDF
2 random variables notes 2p3
PDF
Machine Learning Algorithms Introduction.pdf
PPTX
Lecture5.pptx
PPT
Naive-Bayewwewewewewewewewewewewewewew.ppt
PDF
Stratified Monte Carlo and bootstrapping for approximate Bayesian computation
PDF
dma_ppt.pdf
sample space formation.pdf
Stratified sampling and resampling for approximate Bayesian computation
Machine learning mathematicals.pdf
17_monte_carlo.pdf
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Regression_1.pdf
Parallel Bayesian Optimization
Probabilistic AI Lecture 1: Introduction to variational inference and the ELBO
MLHEP 2015: Introductory Lecture #4
Probability Cheatsheet.pdf
Computer Science Exam Help
MLHEP 2015: Introductory Lecture #1
Deep learning from mashine learning AI..
ENBIS 2018 presentation on Deep k-Means
2 random variables notes 2p3
Machine Learning Algorithms Introduction.pdf
Lecture5.pptx
Naive-Bayewwewewewewewewewewewewewewew.ppt
Stratified Monte Carlo and bootstrapping for approximate Bayesian computation
dma_ppt.pdf
Ad

Recently uploaded (20)

PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
PPT on Performance Review to get promotions
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
OOP with Java - Java Introduction (Basics)
PPT
Project quality management in manufacturing
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
composite construction of structures.pdf
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Digital Logic Computer Design lecture notes
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
DOCX
573137875-Attendance-Management-System-original
bas. eng. economics group 4 presentation 1.pptx
R24 SURVEYING LAB MANUAL for civil enggi
PPT on Performance Review to get promotions
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
OOP with Java - Java Introduction (Basics)
Project quality management in manufacturing
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Geodesy 1.pptx...............................................
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
composite construction of structures.pdf
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Embodied AI: Ushering in the Next Era of Intelligent Systems
Digital Logic Computer Design lecture notes
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
573137875-Attendance-Management-System-original
Ad

Machine Learning course Lecture number 2 - Supervised machine learning, part 1.pptx

  • 1. Supervised machine learning Part 1: k-NN and Naive Bayes
  • 2. All classifiers use training data to carve up the space and group similar examples together. They just differ in the geometry they use to do it. Classifiers just differ by how they carve the space 2 Classifier Borders defined by k Nearest Neighbors distances to nearest neighbor(s) Decision Trees Vertical or Horizontal boundaries only Perceptron or Logistic regression lines/planes/hyperplanes (can be diagonal) Naive Bayes Conic sections (various curves) since groups are captured by gaussian ellipses kNN Decision Trees Naive Bayes Perceptron
  • 3. A review of k-nearest neighbors for context Review of kNN
  • 4. The hat made these decisions on the current class of students. Each dot represents a student in each of the 3 houses. kNN with Harry Potter - the training data 4 X = Trustworthiness Y = Courage Gryffindor Hufflepuff Slytherin
  • 5. If I give you a new person (with an X and Y coordinate), what would be the likely house? - Simple, pick the closest! A new person? Pick the nearest (kNN with k=1) 5 X = 0.5 Y = 0.35 Hufflepuff! The classes of new samples using this strategy (kNN with k = 1) ? ? For every possibility...
  • 6. If I just gave you an X and Y coordinate near these circled points, what should you pick as the right class? Graphically, avoiding outliers (k=1 vs k=5 for kNN) 6 vs These points are likely mistakes or unexplained outliers to disregard Polling your nearest neighbor (k=1) will lead to errors in these places But polling your 5 nearest neighbors (k=5) avoids this problem
  • 7. k too low: You may not want every single example to impact your decision making ● When k = 1 in kNN, one mistake affects anything similar in the future k too high: But you don’t want to “average out” too much ● High k values may wipe out meaningful variations/ “peninsulas and islands” in our space of classification ● What would happen if a group only had 5 members with k=20? It would never be chosen! Finding the right k for kNN 7 low k high k
  • 8. The “k” of k Nearest Neighbors is a “hyperparameter”. Almost all machine learning algorithms have hyperparameters. Many have a similar “goldilocks” nature to them: ● k too low → “Overfitting” → ○ Decisions based on noise ● k too high → “Underfitting” → ○ Decisions not sensitive to important subgroups in the data set Finding the right k for kNN 8 k too low “Overfit” k too high “Overgeneralize”
  • 9. They balance between making the model fit too much to outliers/exceptions or making the model too insensitive to important subgroups in the data Protip: ML hyperparameters trade off in complexity 9 Learner / Model Hyperparameters Underfitting “too simple” Overfitting “too complex” kNN (k nearest neighbors) k value - number of neighbors to poll k too high k too low (e.g. k=1) polynomial regression e.g. linear, quadratic, cubic... n - degree of the polynomial: y = anxn + … a1x + a0 degree too low e.g. a line when a curve is better. degree too high. e.g. a 20 degree polynomial to fit 20 points. regularized linear (and logistic) regression e.g. LASSO, ridge, elastic net... λ - regularization strength (varies # of features to include) λ too high might only pick one or two features, rest are 0’s λ too low (e.g. λ=0) just plain linear regression, use all features SVM (support vector machines) C - slack variable penalty (higher allows more “mistakes”) C too high Disregard too many points as “outliers” C too low Awkwardly adjusting the border to correctly classify every single example
  • 10. kNN Challenges - efficient lookup A naive implementation takes O(k ntraining) time to classify one data point! This is unacceptable as the number of training points may be in the millions. Solution: use a data structure that partitions the space of points so you only have to calculate distances for likely nearby data points. k-D trees allow for average case O(log n) lookup time rather than O(n). So, we trade O(n) time for O(n) space - a fair trade.
  • 11. kNN Challenges - normalization Without normalization, features with larger variances have excessive influence Vs.
  • 12. Normalization approaches Similar problem in other classifiers. Because of this, many methods may normalize automatically and fix the normalization parameters from the training set Examples of typical normalizations: ● For transformed features with mean 0 and variance 1 (for the training set) xnormalized = (x - μtraining set) / σtraining set ● For transformed features between 0 and 1 (based on the training set) xnormalized = (x - mintraining set) / (maxtraining set - mintraining set)
  • 13. kNN Challenges - features treated equally The largest drawback for kNN is a lack on ability to adjust the contribution for different features. Being able to value some features more, and disregard other features is a key ability for almost all machine learning models. Now we will talk about a technique which can use the quality of features for classification to improve performance.
  • 14. Naive Bayes Visually, Naive Bayes fits multidimensional gaussians to clouds of points to define a class.
  • 15. Conditional probability and Bayes’ Rule Recall, p(A|B) = p(A, B) / p(B) Bayes’ rule: p(A|B) = p(B|A) p(A) / p(B) … but more meaningfully: p(hypothesis|data) ~ p(data|hypothesis) * p(hypothesis) posterior ~ likelihood * prior … more on this later 15
  • 16. Conditional probability intuition ● P(A|B) intuition - vs. P(B|A) and P(A and B) ○ P(it’s raining) = 0.15 ○ P(I’m carrying an umbrella) = 0.12 ○ P(it’s raining AND I’m carrying an umbrella) = 0.10 ○ P(I’m carrying an umbrella | it’s raining) = 0.666... ○ P(it’s raining | I’m carrying an umbrella) = 0.833... ● Last two are from P(A|B) = P(A and B) / P(B) 16
  • 17. Conditional probability rules ● Use a venn diagram if necessary ● P(A and A’) = 0 also written as P(A,A’) ● P(A or B) = P(A) + P(B) - P(A and B) ● P(A or A’) = P(A) + P(A’) = 1 ● P(A) = P(A,B) + P(A,B’) ● P(A,B) = P(A|B) P(B) = also P(B|A) P(A) ○ Which leads to... 17
  • 18. Bayes rule P(A | B) = P(B | A) * P(A) / P(B) And another formulation we will find useful... P(A | B) = P(B | A) * P(A) / [ P(B|A) P(A) + P(B|A’) P(A’) ] (because P(B) = P(B,A) + P(B,A’) = P(B|A) P(A) + P(B|A’) P(A’) ) 18
  • 19. Bayes cancer test example There is a test for cancer with a 1% error rate (both false positives and false negatives). 1 in 10,000 people have this form of cancer. Given a positive test result, what is the probability the person developed the disease? 19
  • 20. Cancer test example worked P(D+|T+) = ? = P(T+|D+) P(D+) / P(T+) P(T+|D+) = 0.99 : 1 - false negative rate P(D+) = 0.0001 P(T+) = P(T+,D+) + P(T+,D-) = P(T+|D+) P(D+) + P(T+|D-) P(D-) = 0.99 * 0.0001 + 0.01 * 0.9999 ~ 0.01 P(D+|T+) = 0.0099 → < 1% chance! 20
  • 21. Another example, trick coins Assuming you see trick coins at a rate of 0.1% in the world when someone is flipping for you, is it more or less likely you see a trick (vs fair) coin after 10 consecutive coin flips? 21
  • 22. Trick coin example worked out P(fair | 10 heads) = ? = P(10 heads | fair) P(fair) / P(10 heads) P(fair) = 0.999 : 1 - P(trick) P(10 heads | fair) = (½)10 P(10 heads) = P(10H|fair) P(fair) + P(10H|trick) P(trick) = (½)10 *0.999 + 1 * 0.001 ~ 0.002 P(fair | 10 heads) = (½)10 * 0.999 / 0.002 = 48% chance! Try with a different level of trust in the world (change P(trick)) or different number of heads in a row. 22
  • 23. Naive Bayes ● for classification: p(class | data) p(data | class) * p(class) ∝ ○ pick the most probable class ● gaussian naive bayes - use gaussian distributions for probabilities ○ create gaussian distributions using available training data ○ gaussian probability mean and variance is mean and variance of the training data ● Multiple features? just treat their contributions to the overall probability independently ● Advantages of GNB ○ exceptionally fast training ○ relative to other approaches, it works well when there is little data ● Disadvantages of GNB ○ independence assumption ○ gaussians may not be appropriate to model the data distribution
  • 24. Digit recognition example discussion x represent the image vector (x1, x2, x3, … x64) ck represents class k = that is, one of the 10 digits for recognition Recall, we’re looking for the highest p(ck|x) by using this fact: p(ck|x) = p(x|ck) p(ck) / p(x) Let’s step through parts of this equation one-by-one
  • 25. p(ck|x) = p(x|ck) p(ck) / p(x) ● The main assumption of naive Bayes is that the features should be treated independently (which is why it’s “naive”). This means ○ p(x|ck) = p(x1|ck) * p(x2|ck) * … * p(x64|ck) ● For each class, k, in the training data: ○ Calculate the mean and variance of each feature for that class ■ Variances of 0 are often too extreme, often softened with something slightly greater than 0 in practice ○ Use that and the formula for a gaussian probability to calculate p(xi|ck)
  • 26. p(ck|x) = p(x|ck) p(ck) / p(x) p(ck) is simply the proportion of that class in the training data. e.g. if there are 20 fives out of 200 digits in the training sample p(five) = 20/200 = 0.1
  • 27. p(ck|x) = p(x|ck) p(ck) / p(x) ● p(x) is the normalization term. You don’t necessarily need to calculate this, since you just want to pick the largest p(ck|x), and p(x) is the same denominator in calculating p(ck|x) for every class. ● However, if you want p(ck|x) to provide a true estimate of the probability, you can use the following formula to calculate p(x): ○ p(x) = Σk p(x,ck) = Σk p(x|ck) p(ck)
  • 28. p(ck|x) = p(x|ck) p(ck) / p(x) The predicted class is the largest p(ck|x) for each image. For the assignment: 1. Report the overall accuracy of your prediction. 2. Show the classification matrix. 3. Note which errors are more common. In what way does that match your intuitions?
  • 29. Let’s apply this by performing a classification in class! Numeric: How many different sports do you watch regularly over a year? How many hours a week do you want TV/Movies/Netflix? How many hours per week do you read a book or magazine offline? How long is your commute in minutes? How many years of postgraduate education do you plan on completing? How many hours per week do you exercise? On a scale from 1 to 10, how much you like to go hiking? On a scale from extrovert (0) to introvert (5) where are you? How many countries have you travelled to? What is the percent chance you are home on a friday night? How many times a week do you drink coffee? How much sleep do you get per night in hours? How many minutes do you spend in the morning getting ready? How many people do you text in a day? Categorical: Cat person or dog person? Group projects or individual work preferred? ...
  • 31. Additional material with time permitting
  • 32. Easy to accumulate information (an optional aside) ● Let’s assume you just figured out P(belief | data1) ∝ P(data1 | belief) P(belief) ● How do you integrate new data2? ○ Recalculate with all the data together? ■ NO! (you can, but it’s such a waste) ○ Just use the old posterior as the new prior ■ which is really where priors come from anyway - previous data or experience 32