SlideShare a Scribd company logo
Machine Learning Primer
a.k.a.
Learning from Data
Medical screening and diagnosis:
Increasing cases of machines
providing better predictions than
individual doctors
(implications for increased
accountability and financial
justification for a wide variety of
medical interventions)
Why?... Machine Learning is really useful
2
Object identification in autonomous driving
cars
(implications for automating the use of
pervasive video feeds)
Speech recognition: Siri, Google home,
Alexa, Cortana…
(implications for using audio and voice for
more natural application interfaces)
The overarching goal for today is to
provide enough scaffolding in sophisticated
data-driven approaches to:
● Separate the hype vs. well-earned buzz on
new technologies
● Recognize opportunities where machine can
learn from data and perform well
● Better understand the limitations of
predictive models and how to overcome
them
● Be able to follow available tutorial code at
the end that implements these techniques
Today’s overall goal: Scaffolding
3
● Understand concepts in human learning and
how they impact machine learning
○ How systems make predictions
● Visualize how learning algorithms work on
simple examples
○ k-nearest neighbors and others
● See the general structure for standard
approaches to machine learning
○ samples x features matrix, and targets
● Evaluating and then picking the right model
○ The importance of separating training data
from test data
Objectives: concepts in machine learning
4
● Understand concepts in human learning and
how they impact machine learning
○ How systems make predictions
● Visualize how learning algorithms work on
simple examples
○ k-nearest neighbors and others
● See the general structure for standard
approaches to machine learning
○ samples x features matrix, and targets
● Evaluating and then picking the right model
○ The importance of separating training data
from test data
Objectives: concepts in machine learning
5
Explicit learning
“Conscious learning” - memorizing and applying rules and
facts
Easier to program: old-school “expert systems”
Implicit learning what we focus on today
←
“Subconscious learning” - inferring patterns about the
world
Knowledge and decisions are probabilistic based on data
Explicit vs Implicit learning
6
Implicit learning is done in a data-driven way:
● Learning how to walk as a baby (try, try, try…)
● Learning a friend is trustworthy based on
experiences
● Your food, music, movie, book, etc. tastes
This usually means there are no hard and fast rules
like in explicit learning.
Conclusions are more or less certain depending on the
amount or quality of the data, and your ability to
make inferences on those data.
Implicit learning is data-driven
7
Machine learning is not complicated: almost all
concepts in machine learning can be mapped onto
issues in human learning (but with a need for some
programming)
Implicit human vs machine learning similarities
8
Human learning Machine learning
uncertainty probability
exceptions to the rule noise, outliers
naive underfitting (model too simple)
superstitious overfitting to outliers (model too complex)
TMI (too much information) TMF (too many features)
1. On Monday, you put your foot in
lake Michigan ---> Wow, it’s cold!
2. On Tuesday, you put your foot in
lake Michigan ---> Wow, it’s still
cold!
…
n. So on Friday, how will it feel if
you put your foot in lake Michigan?
Clearly, you learn that similar
situations produce similar
outcomes.
How do you learn from data?
9
● This is already the basis of one machine learning
algorithm called k-nearest neighbors (when k = 1).
● You just look for the example in your past (a.k.a.
your “training data”) that best matches the example
you have in front of you, and do the same.
Guess by most similar situation (kNN with k=1)
10
Question In your head Response
Would you like to
eat this cookie?
Hmm, looks like a
chocolate chip cookie I ate
last week and liked...
Oh, Yeah!
Hey, want to see
Spiderman 2?
Hmm, I didn’t like
Spiderman 1.
Nah, I’m not into it.
● Understand concepts in human learning and
how they impact machine learning
○ How systems make predictions
● Visualize how learning algorithms work on
simple examples
○ k-nearest neighbors and others
● See the general structure for standard
approaches to machine learning
○ samples x features matrix, and targets
● Evaluating and then picking the right model
○ The importance of separating training data
from test data
Objectives: concepts in machine learning
11
For this example, we will build a classifier to determine which
house a new student in Hogwarts should belong to, like the hat in
Harry Potter (youtube link)
The hat scene from Harry Potter
12
Gryffindor, Slytherin, Hufflepuff...
Say the hat made decisions based on two characteristics -
trustworthiness (X) and courage (Y)
Given its past decisions, let’s build our own hogwarts house classifier.
kNN with k=1, Harry Potter hat example
13
Gryffindor
Hufflepuff
Slytherin
X = 0.9
Y = 0.7
X = 0.9
Y = 0.8
X = 0.3
Y = 0.7
Harry
Hermione
Draco
Characteristics
(“features”)
Person
(“sample”)
House
(“class” or “target”)
The hat made these decisions on the current class of students.
Each dot represents a student in each of the 3 houses.
kNN with Harry Potter - the training data
14
X = Trustworthiness
Y
=
Courage
Gryffindor
Hufflepuff
Slytherin
If I give you a new person (with an X and Y coordinate),
what would be the likely house? - Simple, pick the
closest!
A new person? Pick the nearest (kNN with k=1)
15
X = 0.5
Y = 0.35
Hufflepuff!
The classes of
new samples
using this strategy
(kNN with k = 1)
?
? For every possibility...
But you don’t always want to
rely on single examples
● It may have been a mistake,
or a random occurrence
● It may have been correct,
but for unknown reasons,
and you don’t want to use it
to make a judgement
But the “nearest” example is not always best
16
Some poor reasoning, like superstitions and really bad
generalizations come from not looking at more, similar
past examples:
● A black cat crossed my path and I broke my toe that
day! Black cats are bad luck!
○ Counter: How many times did you see a black cat
and something bad didn’t happen?
● Vaccines cause autism because my kid was diagnosed
with autism after a shot!
○ Counter: Take all the kids at that age? Is it
discovered more in kids who got the shot than
didn’t? (answer: no)
It’s important to consider many similar
examples to avoid fitting your belief to a
single outlier or exception
But the “nearest” example is not always best
17
If I just gave you an X and Y coordinate near these
circled points, what should you pick as the right class?
Graphically, avoiding outliers (k=1 vs k=5 for kNN)
18
vs
These points are likely mistakes or
unexplained outliers to disregard
Polling your
nearest
neighbor
(k=1) will
lead to errors
in these
places
But polling
your 5
nearest
neighbors
(k=5)
avoids this
problem
k too low:
You may not want every single example to impact
your decision making
● When k = 1 in kNN, one mistake affects
anything similar in the future
k too high:
But you don’t want to “average out” too much
● High k values may wipe out meaningful
variations/ “peninsulas and islands” in our
space of classification
● What would happen if a group only had 5
members with k=20? It would never be chosen!
Finding the right k for kNN
19
low k
high k
The “k” of k Nearest Neighbors is a “hyperparameter”.
Almost all machine learning algorithms have
hyperparameters.
Many have a similar “goldilocks” nature to them:
● k too low “Overfitting”
→ →
○ Decisions based on noise
● k too high “Underfitting”
→ →
○ Decisions not sensitive to important subgroups in
the data set
We’ll cover exactly how to pick the right k at the end, but
let’s understand how this concept applies across many
machine learning algorithms.
Finding the right k for kNN
20
k too low
“Overfit”
k too high
“Overgeneralize”
Which border below makes more intuitive sense?
More importantly, your intuition matches what will
perform better on new data (as opposed to the data you
see here).
A visual example of the hyperparameter issue
21
Underfitting
(“too simple”)
Overfitting
(“too complex”)
They balance between making the model
fit too much to outliers/exceptions or
making the model too insensitive to
important subgroups in the data
Protip: ML hyperparameters trade off in
complexity
22
Learner / Model Hyperparameters Underfitting
“too simple”
Overfitting
“too complex”
kNN (k nearest
neighbors)
k value - number of
neighbors to poll
k too high k too low (e.g. k=1)
polynomial regression
e.g. linear, quadratic,
cubic...
n - degree of the
polynomial:
y = anxn
+ … a1x + a0
degree too low
e.g. a line when a curve
is better.
degree too high. e.g. a
20 degree polynomial to
fit 20 points.
regularized linear (and
logistic) regression
e.g. LASSO, ridge,
elastic net...
λ - regularization
strength (varies # of
features to include)
λ too high
might only pick one or
two features, rest are 0’s
λ too low (e.g. λ=0)
just plain linear
regression, use all
features
SVM (support vector
machines)
C - slack variable
penalty (higher allows
more “mistakes”)
C too high
Disregard too many
points as “outliers”
C too low
Awkwardly adjusting the
border to correctly
classify every single
example
How they are (almost) all the same
Almost all machine learning models have these
hyperparameters trading off model complexity
How they are different
Now, let’s step through the different types of models
Machine Learning Models
23
What types of models are there to learn with?
24
Predicting
a label Predicting
a number
Unsupervised Learning
Philosophical note: all learning problems fit into these general classes
Supervised Learning
● As you’ve seen with kNN, classification is just making a
system that can predict which group a new example belongs
to
● Most interesting problems rely on more than two numbers
(“features”) to make a decision, so we can’t simply draw them
like we did before, but we use the same idea
Classification - just defining borders
25
Problem Input Example
features
# of features Output
Our harry potter hat
classifier for
students
Two personality
characteristics
Trustworthiness &
courage
2 (drawable) Harry potter house
A spice girl internet
survey
Eight random
questions
Are you a dog or cat
person?...
8 (too complicated to
draw the borders)
Which spice girl you
are
Speech recognition Spoken words pitch, volume, fourier
components...
hundreds or
thousands
The written word
(e.g., “bat” vs “cat”)
Face recognition Pictures relative size of nose,
eye, eye spacing...
hundreds or
thousands
Who the person is
(e.g. “Sally”)
All classifiers use training data to carve up the space
and group similar examples together.
They just differ in the geometry they use to do it.
Classifiers just differ by how they carve the
space
26
Classifier Borders
defined by
k Nearest Neighbors distances to nearest
neighbor(s)
Decision Trees Vertical or Horizontal
boundaries only
Perceptron or
Logistic regression
lines/planes/
hyperplanes (can be
diagonal)
Naive Bayes Conic sections
(various curves) since
groups are captured
by gaussian ellipses
kNN
Decision
Trees
Naive Bayes
Perceptron
● Most learning problems are about eventually predicting a class
(classification) or a number (regression).
○ Will it rain tomorrow? (classification) vs
○ How much will it rain? (regression)
● There are very similar techniques between classification and
regression
○ Often using a variation of the same name
Regression - predicting numbers
27
K Nearest Neighbors Regression
predicting y from a given x using neighbors
Medical screening and diagnosis:
Increasing cases of machines
providing better predictions than
individual doctors
(implications for increased
accountability and financial
justification for a wide variety of
medical interventions)
Step back a moment… Recall what it’s all about!
28
Object identification in autonomous driving
cars
(implications for automating the use of
pervasive video feeds)
Speech recognition: Siri, Google home,
Alexa, Cortana…
(implications for using audio to ease of use
and process voice for applications)
● Understand concepts in human learning and
how they impact machine learning
○ How systems make predictions
● Visualize how learning algorithms work on
simple examples
○ k-nearest neighbors and others
● See the general structure for standard
approaches to machine learning
○ samples x features matrix, and targets
● Evaluating and then picking the right model
○ The importance of separating training data
from test data
Objectives: concepts in machine learning
29
● Finding “similar” previous cases is too vague for tough
problems
● How do I know what examples from the past are similar
enough (and in what way?) to help me predict answer?
● In other words, what features should I pay attention to?
○ And how much?
First, how do we know what information to
include?
30
Let’s consider an example:
Do I want to watch this?
What feature is important to me for finding and
comparing to past experiences properly?
1. I don’t like Thor in the previous movies
2. But I heard good things about this director’s
movies
3. There’s a cameo by Matt Damon!
4. I tend to like movies rated high on Rotten
Tomatoes
So which features should I include?
31
You have intuitions for which are important here, but computers don’t.
Dilbert’s boss actually demonstrates an important concept here:
● Intuitions are not explicit knowledge (“feels right” above), but come
from probabilistic inferences from data over a lifetime of experiences.
○ So really, when you hand-pick the right features, it is data-driven
but in a much larger sense.
● But computers don’t have a lifetime of experiences, so they (and
Dilbert here) need data to decide which features matter.
○ And even more data to figure out exactly how to weigh them
correctly!
○ With enough quality data, the conclusions may be more reliable
than your intuitions.
You have intuitions, but computers need data
32
● Let’s go over an example where we are trying to
predict things like basketball ability, physics ability,
and running ability.
● If I know these 5 features for a large group of people:
○ GPA
○ IQ
○ weight
○ height
○ sex
● We can build a model if we know the value of this
“target” information for each person:
○ basketball skill rating
○ physics test score
○ running pace
Let’s do an example prediction on people
33
There’s a standard form for learning to predict from data.
● Rows - samples of your data (e.g. different people)
● Columns - features of your data (IQ, GPA, height…)
● Target - what you are trying to get your system to predict (physics
score)
The setup: Machine Learning standard form
34
Tips:
● Express all information as numbers or consistent labels
● Ideally, features should have a reasonable chance of being useful
○ otherwise they are just distracting, and decrease performance
● A very rough rule of thumb on number of features and samples
○ number of features < sqrt(number of samples)
○ (Why? Many models implicitly use linear combinations of the cross
products of features, which are as numerous as the number of
features squared - and this shouldn’t be more than the number of
samples)
So, more data means it’s okay to use more features!
Some tips for putting problems into standard
form
35
Predicting basketball ability
36
Name GPA IQ
Weight
(lbs)
Height
(ft’in”) Sex
B-ball
rating
Physics
test
score
Running
pace
(min/mile)
James 2.7 90 220 6’5” M 4 55% 8
Mary 3.4 120 160 5’10” F 4 95% 7
John 3.6 115 240 6’10” M 5 80% 10
Patricia 3.8 135 120 5’1” F 3 100% 9
Robert 4.0 130 130 5’3” M 3 95% 9:30
Linda 2.1 95 160 5’3” F 3 70% 10
Michael 3.8 120 160 6’6” M 4 85% 6
Barbara 3.2 100 180 6’2” F 3 80% 9
William 3.2 110 210 5’1” M 1 90% 12
Elizabet
h
3.9 130 150 5’6” F 3 95% 10
David 2.5 110 220 5’8” M 2 70% 11
Target, and data matrix if you have only a few samples or with lots of samples.
Predicting physics ability
37
Name GPA IQ
Weight
(lbs)
Height
(ft’in”) Sex
B-ball
rating
Physics
test
score
Running
pace
(min/mile)
James 2.7 90 220 6’5” M 4 55% 8
Mary 3.4 120 160 5’10” F 4 95% 7
John 3.6 115 240 6’10” M 5 80% 10
Patricia 3.8 135 120 5’1” F 3 100% 9
Robert 4.0 130 130 5’3” M 3 95% 9:30
Linda 2.1 95 160 5’3” F 3 70% 10
Michael 3.8 120 160 6’6” M 4 85% 6
Barbara 3.2 100 180 6’2” F 3 80% 9
William 3.2 110 210 5’1” M 1 90% 12
Elizabet
h
3.9 130 150 5’6” F 3 95% 10
David 2.5 110 220 5’8” M 2 70% 11
Target, and data matrix if you have only a few samples or with lots of samples.
Predicting running pace
38
Name GPA IQ
Weight
(lbs)
Height
(ft’in”) Sex
B-ball
rating
Physics
test
score
Running
pace
(min/mile)
James 2.7 90 220 6’5” M 4 55% 8
Mary 3.4 120 160 5’10” F 4 95% 7
John 3.6 115 240 6’10” M 5 80% 10
Patricia 3.8 135 120 5’1” F 3 100% 9
Robert 4.0 130 130 5’3” M 3 95% 9:30
Linda 2.1 95 160 5’3” F 3 70% 10
Michael 3.8 120 160 6’6” M 4 85% 6
Barbara 3.2 100 180 6’2” F 3 80% 9
William 3.2 110 210 5’1” M 1 90% 12
Elizabet
h
3.9 130 150 5’6” F 3 95% 10
David 2.5 110 220 5’8” M 2 70% 11
Target, and data matrix if you have only a few samples or with lots of samples.
Notice how the “lots of samples” situation
allows us to just include all the data? Nice.
With more samples, you don’t have to be as
picky about which features to use in your
learning.
This allows the system (rather than you) to
determine which features are useless, which
are good, and have the system try to
combine the good ones in useful ways.
More samples more robust, automated
→
learning
39
Why “Big Data” is such a hot topic now?
Because more samples means more
robust, automated learning
This is why massive data sets are helping in
complex problems
(since people can’t possibly hand-pick the
right features in all these problems)
● Autonomous cars
● Speech recognition
● Face recognition
More samples more robust, automated
→
learning
40
Aside from just
1. picking good features by hand and
2. finding more data
How else can we help a system learn from data?
Helping the learner
41
You can help learning/decision making by extracting
meaningful relationships in your data, prior to
applying your machine learning model
● If you try to keep similar examples near each
other but using fewer features (e.g. nD to 2D for
visualization), it’s dimensionality reduction
● If you try to group similar examples together, it’s
called clustering
○ Naive clustering: based on one feature, e.g.
■ marketing to men vs women
■ grouping people by tall vs short
○ Better clustering: conjunctions of many
features that group similar examples, e.g.
■ low GPA and high IQ: (“the lazy geniuses”)
■ tall and moderate weight: (“likely good in
basketball”)
42
Helping the learner: unsupervised learning
Often you can create features from your data
to improve learning.
For a runner, BMI makes more sense than
height or weight alone
BMI = weight in kg / (height in cm)2
This is called feature engineering, and there
are many, many ways to do this. Sometimes
the features can be very complex ones.
Helping the learner: feature engineering
43
Weight
(lbs)
Height
(ft’in”) BMI
Runnin
g pace
(min/mile)
220 6’5” 26.1 8
160 5’10” 22.3 7
240 6’10” 25.1 10
120 5’1” 22.7 9
130 5’3” 23 9
160 5’3” 28.3 10
160 6’6” 18.5 6
180 6’2” 23.1 9
210 5’1” 39.7 12
150 5’6” 24.2 10
220 5’8” 33.5 11
You can use smaller, simpler classifiers
to help the system make a more
complicated decision
Want to build a face detector?
Make an eye detector, a nose
detector, and mouth detector... and
use their outputs as inputs to your
system.
Want to predict shopping habits?
Use an earlier built classifier that
tells you demographic information
or socioeconomic status to help
predict shopping behavior
Help the learner: low-level classifiers as new
features
44
Feature extraction is difficult and expensive in terms of time and
expertise.
In some situations It’s almost impossible for a person to extract all
the useful low-level features (e.g. face recognition, speech
recognition).
Good systems for this don’t use features programmed by a person,
they automatically extract useful features directly from the data.
Traditional Machine Learning
45
This is done in deep learning (“deep” = multilevel neural network)
Note: this requires MASSIVE AMOUNTS OF DATA
Deep Learning (automatic features!)
46
Machine Learning vs Deep Learning
47
● Understand concepts in human learning and
how they impact machine learning
○ How systems make predictions
● Visualize how learning algorithms work on
simple examples
○ k-nearest neighbors and others
● See the general structure for standard
approaches to machine learning
○ samples x features matrix, and targets
● Evaluating and then picking the right model
○ The importance of separating training data
from test data
Objectives: concepts in machine learning
48
So many options to try in machine learning!
1. Pick the right features
2. Create new features (feature engineering)
3. Simplify your feature set (clustering, dimensionality reduction)
4. Try different learning models (kNN, SVM, …)
5. Change hyperparameters of those models (like k in kNN, or
degree in polynomial regression)
6. Combine the output of different models using ensemble
methods
7. ...
A crazy number of options for learning models
49
● How do we pick the best options for our
learner?
● Simple, you test them out on more data
● The best combination of options after
testing is what you use
● Note, a lot of good machine learning people
don’t know the details of how their models
work, they just try a number of options, but
leverage the data they have well.
● The more data you have, the more options
you can try without running into problems
● Again, Big Data to the rescue!
So many modeling options...
50
Key: separate your test data from training data
51
1. Get the data in standard form (with all the features for every sample, and
known targets)
2. Separate it into training sets and test sets
○ (see “cross-validation” for a more standard way to do this)
3. Build each different model from the same training set
○ Pick different modeling options for each one
4. Apply the built models to predict the targets of the test set
5. Then see how well the prediction matches the right answers on the test set.
● If a model is too simple,
it will clearly not
perform well, even on
training data
● Overly “complicated”
models will fit your
training data (too) well
○ But they do worse
on new data!
● You’ll often see that the
accuracy in the test set
peaks (a.k.a. errors are
minimal) at a certain
middle point
Why have a separate test set?
52
Underfitting
(“too simple”)
Overfitting
(“too complex”)
● Pick the simplest model that still explains what you have seen
○ a.k.a. your training data!
● Why? That’s likely to be a better model than an overly
complicated one. It’s more likely to be right in newer situations.
○ a.k.a. your future test data!
But nowadays we don’t have to rely on philosophy or feelings or
arguments or… to pick our best model! We just need more data...
You’ve actually seen this before: Occam’s razor
53
● Understand concepts in human learning and
how they impact machine learning
○ How systems make predictions
● Visualize how learning algorithms work on
simple examples
○ k-nearest neighbors and others
● See the general structure for standard
approaches to machine learning
○ samples x features matrix, and targets
● Evaluating and then picking the right model
○ The importance of separating training data
from test data
Summary (1 of 2): concepts in machine learning
54
But step away from the details for a
moment...
Predicting numbers or labels from sets
of data sounds technical and limited
But the applications are powerful
and limitless…
● Speech recognition: Siri, Google
home, Alexa, Cortana…
● Natural language processing
● Medical diagnosis
● Face and expression recognition
● Autonomous driving cars
● ...
Summary (2 of 2): It’s much more than “just
data”!
55
All the following examples are in Python and using the
sklearn machine learning library
(I recommend downloading the
Anaconda Python distribution and running code in a Jupyter
notebook)
● Digit recognition classifier
○ uses SVMs, but you can try whatever you want by
changing one line in it with any of these classifiers
● Underfitting vs Overfitting regression example with
polynomial degree
● How different classifiers carve decision boundaries
● k-nearest neighbors classifier using the iris data set
○ a classic data set to test classifiers (with features like
petal length) to determine the type of iris flower
based on the features
For more information: try some tutorials
56
Supplementary slides
Metrics, normalization, and cross-validation
57
Numeric error measures (much more than just MSE)
Mean Square Error
● Larger errors are penalized more
Mean Absolute Error
● All errors are penalized equally
RMSLE: Root Mean Square Log Error
● Similar percent errors are penalized
equally
Classification errors
Digit recognition example:
kNN Challenges - normalization
Without normalization, features with larger variances have excessive influence
Vs.
Normalization
Remember from kNN? Without normalization, some features would dominate the
distance metric
Similar problem in other classifiers. Because of this, many methods may normalize
automatically and fix the normalization parameters from the training set
Examples of typical normalizations:
● standardized = (raw - μtraining set) / σtraining set
● standardized = (raw - mintraining set) / (maxtraining set - mintraining set)
Cross-validation
Hold-out test sets are a challenge
● Too small, and the test is unstable
● Too large, and you lose data for
training
A more systematic way is to use cross-
validation to rotate which section is used
for testing
Cross-validation
● K-fold
● Stratified K-fold
● Leave One Out: (k-fold where k=n)
● Leave One Label Out
○ e.g. Subject-wise cross-validation
● Questions
○ Shuffle or not?
○ Put all groups together, or use separate models for each
group?
Hyperparameter selection
Use cross-validation on one set of data to determine the hyperparameter
To report the accuracy of the model correctly, you need to use a separate test set
from that used in fitting the hyperparameter
This is a common error in hyperparameter selection and accuracy reporting, but
after this class you should be aware!
Properly performing cross-validation
Nested cross-validation - properly without a holdout test set

More Related Content

PPT
Machine Learning ICS 273A
PPTX
Intro to machine learning
DOC
learningIntro.doc
DOC
learningIntro.doc
PPT
Machine learning introduction to unit 1.ppt
PPT
Lecture: introduction to Machine Learning.ppt
PDF
Introduction to conventional machine learning techniques
PDF
Ensemble Methods and Recommender Systems
Machine Learning ICS 273A
Intro to machine learning
learningIntro.doc
learningIntro.doc
Machine learning introduction to unit 1.ppt
Lecture: introduction to Machine Learning.ppt
Introduction to conventional machine learning techniques
Ensemble Methods and Recommender Systems

Similar to Machine Learning course Lecture number 1.pptx (20)

PPT
Unit-V Machine Learning.ppt
PPTX
IntroML_4_Classification
PPTX
IntroML_4_Classification_2
PDF
Introduction to machine learning-2023-IT-AI and DS.pdf
PPTX
Supervised learning: Types of Machine Learning
PPTX
Generalization abstraction
PPT
Machine learning and deep learning algorithms
PDF
Deep Learning Class #0 - You Can Do It
PDF
DL Classe 0 - You can do it
PDF
Lessons learned from building practical deep learning systems
PDF
[Slides] Crowdsourcing Pareto-Optimal Object Finding By Pairwise Comparisons
PDF
Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons
PDF
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
PDF
Exploration Strategies in Reinforcement Learning
PDF
Chapter 05 k nn
PPTX
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
PPTX
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
PPTX
Machine Learning basics
PDF
Introduction to data mining and machine learning
PPTX
Machine_Learning.pptx
Unit-V Machine Learning.ppt
IntroML_4_Classification
IntroML_4_Classification_2
Introduction to machine learning-2023-IT-AI and DS.pdf
Supervised learning: Types of Machine Learning
Generalization abstraction
Machine learning and deep learning algorithms
Deep Learning Class #0 - You Can Do It
DL Classe 0 - You can do it
Lessons learned from building practical deep learning systems
[Slides] Crowdsourcing Pareto-Optimal Object Finding By Pairwise Comparisons
Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Exploration Strategies in Reinforcement Learning
Chapter 05 k nn
CST413 KTU S7 CSE Machine Learning Introduction Parameter Estimation MLE MAP ...
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Machine Learning basics
Introduction to data mining and machine learning
Machine_Learning.pptx
Ad

Recently uploaded (20)

PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
UNIT - 3 Total quality Management .pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPT
Total quality management ppt for engineering students
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
communication and presentation skills 01
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPT
introduction to datamining and warehousing
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
86236642-Electric-Loco-Shed.pdf jfkduklg
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
UNIT - 3 Total quality Management .pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Total quality management ppt for engineering students
Fundamentals of safety and accident prevention -final (1).pptx
Safety Seminar civil to be ensured for safe working.
Soil Improvement Techniques Note - Rabbi
communication and presentation skills 01
Information Storage and Retrieval Techniques Unit III
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
introduction to datamining and warehousing
Nature of X-rays, X- Ray Equipment, Fluoroscopy
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Fundamentals of Mechanical Engineering.pptx
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
86236642-Electric-Loco-Shed.pdf jfkduklg
Ad

Machine Learning course Lecture number 1.pptx

  • 2. Medical screening and diagnosis: Increasing cases of machines providing better predictions than individual doctors (implications for increased accountability and financial justification for a wide variety of medical interventions) Why?... Machine Learning is really useful 2 Object identification in autonomous driving cars (implications for automating the use of pervasive video feeds) Speech recognition: Siri, Google home, Alexa, Cortana… (implications for using audio and voice for more natural application interfaces)
  • 3. The overarching goal for today is to provide enough scaffolding in sophisticated data-driven approaches to: ● Separate the hype vs. well-earned buzz on new technologies ● Recognize opportunities where machine can learn from data and perform well ● Better understand the limitations of predictive models and how to overcome them ● Be able to follow available tutorial code at the end that implements these techniques Today’s overall goal: Scaffolding 3
  • 4. ● Understand concepts in human learning and how they impact machine learning ○ How systems make predictions ● Visualize how learning algorithms work on simple examples ○ k-nearest neighbors and others ● See the general structure for standard approaches to machine learning ○ samples x features matrix, and targets ● Evaluating and then picking the right model ○ The importance of separating training data from test data Objectives: concepts in machine learning 4
  • 5. ● Understand concepts in human learning and how they impact machine learning ○ How systems make predictions ● Visualize how learning algorithms work on simple examples ○ k-nearest neighbors and others ● See the general structure for standard approaches to machine learning ○ samples x features matrix, and targets ● Evaluating and then picking the right model ○ The importance of separating training data from test data Objectives: concepts in machine learning 5
  • 6. Explicit learning “Conscious learning” - memorizing and applying rules and facts Easier to program: old-school “expert systems” Implicit learning what we focus on today ← “Subconscious learning” - inferring patterns about the world Knowledge and decisions are probabilistic based on data Explicit vs Implicit learning 6
  • 7. Implicit learning is done in a data-driven way: ● Learning how to walk as a baby (try, try, try…) ● Learning a friend is trustworthy based on experiences ● Your food, music, movie, book, etc. tastes This usually means there are no hard and fast rules like in explicit learning. Conclusions are more or less certain depending on the amount or quality of the data, and your ability to make inferences on those data. Implicit learning is data-driven 7
  • 8. Machine learning is not complicated: almost all concepts in machine learning can be mapped onto issues in human learning (but with a need for some programming) Implicit human vs machine learning similarities 8 Human learning Machine learning uncertainty probability exceptions to the rule noise, outliers naive underfitting (model too simple) superstitious overfitting to outliers (model too complex) TMI (too much information) TMF (too many features)
  • 9. 1. On Monday, you put your foot in lake Michigan ---> Wow, it’s cold! 2. On Tuesday, you put your foot in lake Michigan ---> Wow, it’s still cold! … n. So on Friday, how will it feel if you put your foot in lake Michigan? Clearly, you learn that similar situations produce similar outcomes. How do you learn from data? 9
  • 10. ● This is already the basis of one machine learning algorithm called k-nearest neighbors (when k = 1). ● You just look for the example in your past (a.k.a. your “training data”) that best matches the example you have in front of you, and do the same. Guess by most similar situation (kNN with k=1) 10 Question In your head Response Would you like to eat this cookie? Hmm, looks like a chocolate chip cookie I ate last week and liked... Oh, Yeah! Hey, want to see Spiderman 2? Hmm, I didn’t like Spiderman 1. Nah, I’m not into it.
  • 11. ● Understand concepts in human learning and how they impact machine learning ○ How systems make predictions ● Visualize how learning algorithms work on simple examples ○ k-nearest neighbors and others ● See the general structure for standard approaches to machine learning ○ samples x features matrix, and targets ● Evaluating and then picking the right model ○ The importance of separating training data from test data Objectives: concepts in machine learning 11
  • 12. For this example, we will build a classifier to determine which house a new student in Hogwarts should belong to, like the hat in Harry Potter (youtube link) The hat scene from Harry Potter 12 Gryffindor, Slytherin, Hufflepuff...
  • 13. Say the hat made decisions based on two characteristics - trustworthiness (X) and courage (Y) Given its past decisions, let’s build our own hogwarts house classifier. kNN with k=1, Harry Potter hat example 13 Gryffindor Hufflepuff Slytherin X = 0.9 Y = 0.7 X = 0.9 Y = 0.8 X = 0.3 Y = 0.7 Harry Hermione Draco Characteristics (“features”) Person (“sample”) House (“class” or “target”)
  • 14. The hat made these decisions on the current class of students. Each dot represents a student in each of the 3 houses. kNN with Harry Potter - the training data 14 X = Trustworthiness Y = Courage Gryffindor Hufflepuff Slytherin
  • 15. If I give you a new person (with an X and Y coordinate), what would be the likely house? - Simple, pick the closest! A new person? Pick the nearest (kNN with k=1) 15 X = 0.5 Y = 0.35 Hufflepuff! The classes of new samples using this strategy (kNN with k = 1) ? ? For every possibility...
  • 16. But you don’t always want to rely on single examples ● It may have been a mistake, or a random occurrence ● It may have been correct, but for unknown reasons, and you don’t want to use it to make a judgement But the “nearest” example is not always best 16
  • 17. Some poor reasoning, like superstitions and really bad generalizations come from not looking at more, similar past examples: ● A black cat crossed my path and I broke my toe that day! Black cats are bad luck! ○ Counter: How many times did you see a black cat and something bad didn’t happen? ● Vaccines cause autism because my kid was diagnosed with autism after a shot! ○ Counter: Take all the kids at that age? Is it discovered more in kids who got the shot than didn’t? (answer: no) It’s important to consider many similar examples to avoid fitting your belief to a single outlier or exception But the “nearest” example is not always best 17
  • 18. If I just gave you an X and Y coordinate near these circled points, what should you pick as the right class? Graphically, avoiding outliers (k=1 vs k=5 for kNN) 18 vs These points are likely mistakes or unexplained outliers to disregard Polling your nearest neighbor (k=1) will lead to errors in these places But polling your 5 nearest neighbors (k=5) avoids this problem
  • 19. k too low: You may not want every single example to impact your decision making ● When k = 1 in kNN, one mistake affects anything similar in the future k too high: But you don’t want to “average out” too much ● High k values may wipe out meaningful variations/ “peninsulas and islands” in our space of classification ● What would happen if a group only had 5 members with k=20? It would never be chosen! Finding the right k for kNN 19 low k high k
  • 20. The “k” of k Nearest Neighbors is a “hyperparameter”. Almost all machine learning algorithms have hyperparameters. Many have a similar “goldilocks” nature to them: ● k too low “Overfitting” → → ○ Decisions based on noise ● k too high “Underfitting” → → ○ Decisions not sensitive to important subgroups in the data set We’ll cover exactly how to pick the right k at the end, but let’s understand how this concept applies across many machine learning algorithms. Finding the right k for kNN 20 k too low “Overfit” k too high “Overgeneralize”
  • 21. Which border below makes more intuitive sense? More importantly, your intuition matches what will perform better on new data (as opposed to the data you see here). A visual example of the hyperparameter issue 21 Underfitting (“too simple”) Overfitting (“too complex”)
  • 22. They balance between making the model fit too much to outliers/exceptions or making the model too insensitive to important subgroups in the data Protip: ML hyperparameters trade off in complexity 22 Learner / Model Hyperparameters Underfitting “too simple” Overfitting “too complex” kNN (k nearest neighbors) k value - number of neighbors to poll k too high k too low (e.g. k=1) polynomial regression e.g. linear, quadratic, cubic... n - degree of the polynomial: y = anxn + … a1x + a0 degree too low e.g. a line when a curve is better. degree too high. e.g. a 20 degree polynomial to fit 20 points. regularized linear (and logistic) regression e.g. LASSO, ridge, elastic net... λ - regularization strength (varies # of features to include) λ too high might only pick one or two features, rest are 0’s λ too low (e.g. λ=0) just plain linear regression, use all features SVM (support vector machines) C - slack variable penalty (higher allows more “mistakes”) C too high Disregard too many points as “outliers” C too low Awkwardly adjusting the border to correctly classify every single example
  • 23. How they are (almost) all the same Almost all machine learning models have these hyperparameters trading off model complexity How they are different Now, let’s step through the different types of models Machine Learning Models 23
  • 24. What types of models are there to learn with? 24 Predicting a label Predicting a number Unsupervised Learning Philosophical note: all learning problems fit into these general classes Supervised Learning
  • 25. ● As you’ve seen with kNN, classification is just making a system that can predict which group a new example belongs to ● Most interesting problems rely on more than two numbers (“features”) to make a decision, so we can’t simply draw them like we did before, but we use the same idea Classification - just defining borders 25 Problem Input Example features # of features Output Our harry potter hat classifier for students Two personality characteristics Trustworthiness & courage 2 (drawable) Harry potter house A spice girl internet survey Eight random questions Are you a dog or cat person?... 8 (too complicated to draw the borders) Which spice girl you are Speech recognition Spoken words pitch, volume, fourier components... hundreds or thousands The written word (e.g., “bat” vs “cat”) Face recognition Pictures relative size of nose, eye, eye spacing... hundreds or thousands Who the person is (e.g. “Sally”)
  • 26. All classifiers use training data to carve up the space and group similar examples together. They just differ in the geometry they use to do it. Classifiers just differ by how they carve the space 26 Classifier Borders defined by k Nearest Neighbors distances to nearest neighbor(s) Decision Trees Vertical or Horizontal boundaries only Perceptron or Logistic regression lines/planes/ hyperplanes (can be diagonal) Naive Bayes Conic sections (various curves) since groups are captured by gaussian ellipses kNN Decision Trees Naive Bayes Perceptron
  • 27. ● Most learning problems are about eventually predicting a class (classification) or a number (regression). ○ Will it rain tomorrow? (classification) vs ○ How much will it rain? (regression) ● There are very similar techniques between classification and regression ○ Often using a variation of the same name Regression - predicting numbers 27 K Nearest Neighbors Regression predicting y from a given x using neighbors
  • 28. Medical screening and diagnosis: Increasing cases of machines providing better predictions than individual doctors (implications for increased accountability and financial justification for a wide variety of medical interventions) Step back a moment… Recall what it’s all about! 28 Object identification in autonomous driving cars (implications for automating the use of pervasive video feeds) Speech recognition: Siri, Google home, Alexa, Cortana… (implications for using audio to ease of use and process voice for applications)
  • 29. ● Understand concepts in human learning and how they impact machine learning ○ How systems make predictions ● Visualize how learning algorithms work on simple examples ○ k-nearest neighbors and others ● See the general structure for standard approaches to machine learning ○ samples x features matrix, and targets ● Evaluating and then picking the right model ○ The importance of separating training data from test data Objectives: concepts in machine learning 29
  • 30. ● Finding “similar” previous cases is too vague for tough problems ● How do I know what examples from the past are similar enough (and in what way?) to help me predict answer? ● In other words, what features should I pay attention to? ○ And how much? First, how do we know what information to include? 30 Let’s consider an example: Do I want to watch this?
  • 31. What feature is important to me for finding and comparing to past experiences properly? 1. I don’t like Thor in the previous movies 2. But I heard good things about this director’s movies 3. There’s a cameo by Matt Damon! 4. I tend to like movies rated high on Rotten Tomatoes So which features should I include? 31 You have intuitions for which are important here, but computers don’t.
  • 32. Dilbert’s boss actually demonstrates an important concept here: ● Intuitions are not explicit knowledge (“feels right” above), but come from probabilistic inferences from data over a lifetime of experiences. ○ So really, when you hand-pick the right features, it is data-driven but in a much larger sense. ● But computers don’t have a lifetime of experiences, so they (and Dilbert here) need data to decide which features matter. ○ And even more data to figure out exactly how to weigh them correctly! ○ With enough quality data, the conclusions may be more reliable than your intuitions. You have intuitions, but computers need data 32
  • 33. ● Let’s go over an example where we are trying to predict things like basketball ability, physics ability, and running ability. ● If I know these 5 features for a large group of people: ○ GPA ○ IQ ○ weight ○ height ○ sex ● We can build a model if we know the value of this “target” information for each person: ○ basketball skill rating ○ physics test score ○ running pace Let’s do an example prediction on people 33
  • 34. There’s a standard form for learning to predict from data. ● Rows - samples of your data (e.g. different people) ● Columns - features of your data (IQ, GPA, height…) ● Target - what you are trying to get your system to predict (physics score) The setup: Machine Learning standard form 34
  • 35. Tips: ● Express all information as numbers or consistent labels ● Ideally, features should have a reasonable chance of being useful ○ otherwise they are just distracting, and decrease performance ● A very rough rule of thumb on number of features and samples ○ number of features < sqrt(number of samples) ○ (Why? Many models implicitly use linear combinations of the cross products of features, which are as numerous as the number of features squared - and this shouldn’t be more than the number of samples) So, more data means it’s okay to use more features! Some tips for putting problems into standard form 35
  • 36. Predicting basketball ability 36 Name GPA IQ Weight (lbs) Height (ft’in”) Sex B-ball rating Physics test score Running pace (min/mile) James 2.7 90 220 6’5” M 4 55% 8 Mary 3.4 120 160 5’10” F 4 95% 7 John 3.6 115 240 6’10” M 5 80% 10 Patricia 3.8 135 120 5’1” F 3 100% 9 Robert 4.0 130 130 5’3” M 3 95% 9:30 Linda 2.1 95 160 5’3” F 3 70% 10 Michael 3.8 120 160 6’6” M 4 85% 6 Barbara 3.2 100 180 6’2” F 3 80% 9 William 3.2 110 210 5’1” M 1 90% 12 Elizabet h 3.9 130 150 5’6” F 3 95% 10 David 2.5 110 220 5’8” M 2 70% 11 Target, and data matrix if you have only a few samples or with lots of samples.
  • 37. Predicting physics ability 37 Name GPA IQ Weight (lbs) Height (ft’in”) Sex B-ball rating Physics test score Running pace (min/mile) James 2.7 90 220 6’5” M 4 55% 8 Mary 3.4 120 160 5’10” F 4 95% 7 John 3.6 115 240 6’10” M 5 80% 10 Patricia 3.8 135 120 5’1” F 3 100% 9 Robert 4.0 130 130 5’3” M 3 95% 9:30 Linda 2.1 95 160 5’3” F 3 70% 10 Michael 3.8 120 160 6’6” M 4 85% 6 Barbara 3.2 100 180 6’2” F 3 80% 9 William 3.2 110 210 5’1” M 1 90% 12 Elizabet h 3.9 130 150 5’6” F 3 95% 10 David 2.5 110 220 5’8” M 2 70% 11 Target, and data matrix if you have only a few samples or with lots of samples.
  • 38. Predicting running pace 38 Name GPA IQ Weight (lbs) Height (ft’in”) Sex B-ball rating Physics test score Running pace (min/mile) James 2.7 90 220 6’5” M 4 55% 8 Mary 3.4 120 160 5’10” F 4 95% 7 John 3.6 115 240 6’10” M 5 80% 10 Patricia 3.8 135 120 5’1” F 3 100% 9 Robert 4.0 130 130 5’3” M 3 95% 9:30 Linda 2.1 95 160 5’3” F 3 70% 10 Michael 3.8 120 160 6’6” M 4 85% 6 Barbara 3.2 100 180 6’2” F 3 80% 9 William 3.2 110 210 5’1” M 1 90% 12 Elizabet h 3.9 130 150 5’6” F 3 95% 10 David 2.5 110 220 5’8” M 2 70% 11 Target, and data matrix if you have only a few samples or with lots of samples.
  • 39. Notice how the “lots of samples” situation allows us to just include all the data? Nice. With more samples, you don’t have to be as picky about which features to use in your learning. This allows the system (rather than you) to determine which features are useless, which are good, and have the system try to combine the good ones in useful ways. More samples more robust, automated → learning 39
  • 40. Why “Big Data” is such a hot topic now? Because more samples means more robust, automated learning This is why massive data sets are helping in complex problems (since people can’t possibly hand-pick the right features in all these problems) ● Autonomous cars ● Speech recognition ● Face recognition More samples more robust, automated → learning 40
  • 41. Aside from just 1. picking good features by hand and 2. finding more data How else can we help a system learn from data? Helping the learner 41
  • 42. You can help learning/decision making by extracting meaningful relationships in your data, prior to applying your machine learning model ● If you try to keep similar examples near each other but using fewer features (e.g. nD to 2D for visualization), it’s dimensionality reduction ● If you try to group similar examples together, it’s called clustering ○ Naive clustering: based on one feature, e.g. ■ marketing to men vs women ■ grouping people by tall vs short ○ Better clustering: conjunctions of many features that group similar examples, e.g. ■ low GPA and high IQ: (“the lazy geniuses”) ■ tall and moderate weight: (“likely good in basketball”) 42 Helping the learner: unsupervised learning
  • 43. Often you can create features from your data to improve learning. For a runner, BMI makes more sense than height or weight alone BMI = weight in kg / (height in cm)2 This is called feature engineering, and there are many, many ways to do this. Sometimes the features can be very complex ones. Helping the learner: feature engineering 43 Weight (lbs) Height (ft’in”) BMI Runnin g pace (min/mile) 220 6’5” 26.1 8 160 5’10” 22.3 7 240 6’10” 25.1 10 120 5’1” 22.7 9 130 5’3” 23 9 160 5’3” 28.3 10 160 6’6” 18.5 6 180 6’2” 23.1 9 210 5’1” 39.7 12 150 5’6” 24.2 10 220 5’8” 33.5 11
  • 44. You can use smaller, simpler classifiers to help the system make a more complicated decision Want to build a face detector? Make an eye detector, a nose detector, and mouth detector... and use their outputs as inputs to your system. Want to predict shopping habits? Use an earlier built classifier that tells you demographic information or socioeconomic status to help predict shopping behavior Help the learner: low-level classifiers as new features 44
  • 45. Feature extraction is difficult and expensive in terms of time and expertise. In some situations It’s almost impossible for a person to extract all the useful low-level features (e.g. face recognition, speech recognition). Good systems for this don’t use features programmed by a person, they automatically extract useful features directly from the data. Traditional Machine Learning 45
  • 46. This is done in deep learning (“deep” = multilevel neural network) Note: this requires MASSIVE AMOUNTS OF DATA Deep Learning (automatic features!) 46
  • 47. Machine Learning vs Deep Learning 47
  • 48. ● Understand concepts in human learning and how they impact machine learning ○ How systems make predictions ● Visualize how learning algorithms work on simple examples ○ k-nearest neighbors and others ● See the general structure for standard approaches to machine learning ○ samples x features matrix, and targets ● Evaluating and then picking the right model ○ The importance of separating training data from test data Objectives: concepts in machine learning 48
  • 49. So many options to try in machine learning! 1. Pick the right features 2. Create new features (feature engineering) 3. Simplify your feature set (clustering, dimensionality reduction) 4. Try different learning models (kNN, SVM, …) 5. Change hyperparameters of those models (like k in kNN, or degree in polynomial regression) 6. Combine the output of different models using ensemble methods 7. ... A crazy number of options for learning models 49
  • 50. ● How do we pick the best options for our learner? ● Simple, you test them out on more data ● The best combination of options after testing is what you use ● Note, a lot of good machine learning people don’t know the details of how their models work, they just try a number of options, but leverage the data they have well. ● The more data you have, the more options you can try without running into problems ● Again, Big Data to the rescue! So many modeling options... 50
  • 51. Key: separate your test data from training data 51 1. Get the data in standard form (with all the features for every sample, and known targets) 2. Separate it into training sets and test sets ○ (see “cross-validation” for a more standard way to do this) 3. Build each different model from the same training set ○ Pick different modeling options for each one 4. Apply the built models to predict the targets of the test set 5. Then see how well the prediction matches the right answers on the test set.
  • 52. ● If a model is too simple, it will clearly not perform well, even on training data ● Overly “complicated” models will fit your training data (too) well ○ But they do worse on new data! ● You’ll often see that the accuracy in the test set peaks (a.k.a. errors are minimal) at a certain middle point Why have a separate test set? 52 Underfitting (“too simple”) Overfitting (“too complex”)
  • 53. ● Pick the simplest model that still explains what you have seen ○ a.k.a. your training data! ● Why? That’s likely to be a better model than an overly complicated one. It’s more likely to be right in newer situations. ○ a.k.a. your future test data! But nowadays we don’t have to rely on philosophy or feelings or arguments or… to pick our best model! We just need more data... You’ve actually seen this before: Occam’s razor 53
  • 54. ● Understand concepts in human learning and how they impact machine learning ○ How systems make predictions ● Visualize how learning algorithms work on simple examples ○ k-nearest neighbors and others ● See the general structure for standard approaches to machine learning ○ samples x features matrix, and targets ● Evaluating and then picking the right model ○ The importance of separating training data from test data Summary (1 of 2): concepts in machine learning 54
  • 55. But step away from the details for a moment... Predicting numbers or labels from sets of data sounds technical and limited But the applications are powerful and limitless… ● Speech recognition: Siri, Google home, Alexa, Cortana… ● Natural language processing ● Medical diagnosis ● Face and expression recognition ● Autonomous driving cars ● ... Summary (2 of 2): It’s much more than “just data”! 55
  • 56. All the following examples are in Python and using the sklearn machine learning library (I recommend downloading the Anaconda Python distribution and running code in a Jupyter notebook) ● Digit recognition classifier ○ uses SVMs, but you can try whatever you want by changing one line in it with any of these classifiers ● Underfitting vs Overfitting regression example with polynomial degree ● How different classifiers carve decision boundaries ● k-nearest neighbors classifier using the iris data set ○ a classic data set to test classifiers (with features like petal length) to determine the type of iris flower based on the features For more information: try some tutorials 56
  • 58. Numeric error measures (much more than just MSE) Mean Square Error ● Larger errors are penalized more Mean Absolute Error ● All errors are penalized equally RMSLE: Root Mean Square Log Error ● Similar percent errors are penalized equally
  • 61. kNN Challenges - normalization Without normalization, features with larger variances have excessive influence Vs.
  • 62. Normalization Remember from kNN? Without normalization, some features would dominate the distance metric Similar problem in other classifiers. Because of this, many methods may normalize automatically and fix the normalization parameters from the training set Examples of typical normalizations: ● standardized = (raw - μtraining set) / σtraining set ● standardized = (raw - mintraining set) / (maxtraining set - mintraining set)
  • 63. Cross-validation Hold-out test sets are a challenge ● Too small, and the test is unstable ● Too large, and you lose data for training A more systematic way is to use cross- validation to rotate which section is used for testing
  • 64. Cross-validation ● K-fold ● Stratified K-fold ● Leave One Out: (k-fold where k=n) ● Leave One Label Out ○ e.g. Subject-wise cross-validation ● Questions ○ Shuffle or not? ○ Put all groups together, or use separate models for each group?
  • 65. Hyperparameter selection Use cross-validation on one set of data to determine the hyperparameter To report the accuracy of the model correctly, you need to use a separate test set from that used in fitting the hyperparameter This is a common error in hyperparameter selection and accuracy reporting, but after this class you should be aware!
  • 67. Nested cross-validation - properly without a holdout test set