Machine Learning course Lecture number 1.pptx

Machine Learning Primer
a.k.a.
Learning from Data

Medical screening and diagnosis:
Increasing cases of machines
providing better predictions than
individual doctors
(implications for increased
accountability and financial
justification for a wide variety of
medical interventions)
Why?... Machine Learning is really useful
2
Object identification in autonomous driving
cars
(implications for automating the use of
pervasive video feeds)
Speech recognition: Siri, Google home,
Alexa, Cortana…
(implications for using audio and voice for
more natural application interfaces)

The overarching goal for today is to
provide enough scaffolding in sophisticated
data-driven approaches to:
● Separate the hype vs. well-earned buzz on
new technologies
● Recognize opportunities where machine can
learn from data and perform well
● Better understand the limitations of
predictive models and how to overcome
them
● Be able to follow available tutorial code at
the end that implements these techniques
Today’s overall goal: Scaffolding
3

● Understand concepts in human learning and
how they impact machine learning
○ How systems make predictions
● Visualize how learning algorithms work on
simple examples
○ k-nearest neighbors and others
● See the general structure for standard
approaches to machine learning
○ samples x features matrix, and targets
● Evaluating and then picking the right model
○ The importance of separating training data
from test data
Objectives: concepts in machine learning
4

simple examples
from test data
5

Explicit learning
“Conscious learning” - memorizing and applying rules and
facts
Easier to program: old-school “expert systems”
Implicit learning what we focus on today
←
“Subconscious learning” - inferring patterns about the
world
Knowledge and decisions are probabilistic based on data
Explicit vs Implicit learning
6

Implicit learning is done in a data-driven way:
● Learning how to walk as a baby (try, try, try…)
● Learning a friend is trustworthy based on
experiences
● Your food, music, movie, book, etc. tastes
This usually means there are no hard and fast rules
like in explicit learning.
Conclusions are more or less certain depending on the
amount or quality of the data, and your ability to
make inferences on those data.
Implicit learning is data-driven
7

Machine learning is not complicated: almost all
concepts in machine learning can be mapped onto
issues in human learning (but with a need for some
programming)
Implicit human vs machine learning similarities
8
Human learning Machine learning
uncertainty probability
exceptions to the rule noise, outliers
naive underfitting (model too simple)
superstitious overfitting to outliers (model too complex)
TMI (too much information) TMF (too many features)

1. On Monday, you put your foot in
lake Michigan ---> Wow, it’s cold!
2. On Tuesday, you put your foot in
lake Michigan ---> Wow, it’s still
cold!
…
n. So on Friday, how will it feel if
you put your foot in lake Michigan?
Clearly, you learn that similar
situations produce similar
outcomes.
How do you learn from data?
9

● This is already the basis of one machine learning
algorithm called k-nearest neighbors (when k = 1).
● You just look for the example in your past (a.k.a.
your “training data”) that best matches the example
you have in front of you, and do the same.
Guess by most similar situation (kNN with k=1)
10
Question In your head Response
Would you like to
eat this cookie?
Hmm, looks like a
chocolate chip cookie I ate
last week and liked...
Oh, Yeah!
Hey, want to see
Spiderman 2?
Hmm, I didn’t like
Spiderman 1.
Nah, I’m not into it.

simple examples
from test data
11

For this example, we will build a classifier to determine which
house a new student in Hogwarts should belong to, like the hat in
Harry Potter (youtube link)
The hat scene from Harry Potter
12
Gryffindor, Slytherin, Hufflepuff...

Say the hat made decisions based on two characteristics -
trustworthiness (X) and courage (Y)
Given its past decisions, let’s build our own hogwarts house classifier.
kNN with k=1, Harry Potter hat example
13
Gryffindor
Hufflepuff
Slytherin
X = 0.9
Y = 0.7
X = 0.9
Y = 0.8
X = 0.3
Y = 0.7
Harry
Hermione
Draco
Characteristics
(“features”)
Person
(“sample”)
House
(“class” or “target”)

The hat made these decisions on the current class of students.
Each dot represents a student in each of the 3 houses.
kNN with Harry Potter - the training data
14
X = Trustworthiness
Y
=
Courage
Gryffindor
Hufflepuff
Slytherin

If I give you a new person (with an X and Y coordinate),
what would be the likely house? - Simple, pick the
closest!
A new person? Pick the nearest (kNN with k=1)
15
X = 0.5
Y = 0.35
Hufflepuff!
The classes of
new samples
using this strategy
(kNN with k = 1)
?
? For every possibility...

But you don’t always want to
rely on single examples
● It may have been a mistake,
or a random occurrence
● It may have been correct,
but for unknown reasons,
and you don’t want to use it
to make a judgement
But the “nearest” example is not always best
16

Some poor reasoning, like superstitions and really bad
generalizations come from not looking at more, similar
past examples:
● A black cat crossed my path and I broke my toe that
day! Black cats are bad luck!
○ Counter: How many times did you see a black cat
and something bad didn’t happen?
● Vaccines cause autism because my kid was diagnosed
with autism after a shot!
○ Counter: Take all the kids at that age? Is it
discovered more in kids who got the shot than
didn’t? (answer: no)
It’s important to consider many similar
examples to avoid fitting your belief to a
single outlier or exception
But the “nearest” example is not always best
17

If I just gave you an X and Y coordinate near these
circled points, what should you pick as the right class?
Graphically, avoiding outliers (k=1 vs k=5 for kNN)
18
vs
These points are likely mistakes or
unexplained outliers to disregard
Polling your
nearest
neighbor
(k=1) will
lead to errors
in these
places
But polling
your 5
nearest
neighbors
(k=5)
avoids this
problem

k too low:
You may not want every single example to impact
your decision making
● When k = 1 in kNN, one mistake affects
anything similar in the future
k too high:
But you don’t want to “average out” too much
● High k values may wipe out meaningful
variations/ “peninsulas and islands” in our
space of classification
● What would happen if a group only had 5
members with k=20? It would never be chosen!
Finding the right k for kNN
19
low k
high k

The “k” of k Nearest Neighbors is a “hyperparameter”.
Almost all machine learning algorithms have
hyperparameters.
Many have a similar “goldilocks” nature to them:
● k too low “Overfitting”
→ →
○ Decisions based on noise
● k too high “Underfitting”
→ →
○ Decisions not sensitive to important subgroups in
the data set
We’ll cover exactly how to pick the right k at the end, but
let’s understand how this concept applies across many
machine learning algorithms.
Finding the right k for kNN
20
k too low
“Overfit”
k too high
“Overgeneralize”

Which border below makes more intuitive sense?
More importantly, your intuition matches what will
perform better on new data (as opposed to the data you
see here).
A visual example of the hyperparameter issue
21
Underfitting
(“too simple”)
Overfitting
(“too complex”)

They balance between making the model
fit too much to outliers/exceptions or
making the model too insensitive to
important subgroups in the data
Protip: ML hyperparameters trade off in
complexity
22
Learner / Model Hyperparameters Underfitting
“too simple”
Overfitting
“too complex”
kNN (k nearest
neighbors)
k value - number of
neighbors to poll
k too high k too low (e.g. k=1)
polynomial regression
e.g. linear, quadratic,
cubic...
n - degree of the
polynomial:
y = anxn
+ … a1x + a0
degree too low
e.g. a line when a curve
is better.
degree too high. e.g. a
20 degree polynomial to
fit 20 points.
regularized linear (and
logistic) regression
e.g. LASSO, ridge,
elastic net...
λ - regularization
strength (varies # of
features to include)
λ too high
might only pick one or
two features, rest are 0’s
λ too low (e.g. λ=0)
just plain linear
regression, use all
features
SVM (support vector
machines)
C - slack variable
penalty (higher allows
more “mistakes”)
C too high
Disregard too many
points as “outliers”
C too low
Awkwardly adjusting the
border to correctly
classify every single
example

How they are (almost) all the same
Almost all machine learning models have these
hyperparameters trading off model complexity
How they are different
Now, let’s step through the different types of models
Machine Learning Models
23

What types of models are there to learn with?
24
Predicting
a label Predicting
a number
Unsupervised Learning
Philosophical note: all learning problems fit into these general classes
Supervised Learning

● As you’ve seen with kNN, classification is just making a
system that can predict which group a new example belongs
to
● Most interesting problems rely on more than two numbers
(“features”) to make a decision, so we can’t simply draw them
like we did before, but we use the same idea
Classification - just defining borders
25
Problem Input Example
features
# of features Output
Our harry potter hat
classifier for
students
Two personality
characteristics
Trustworthiness &
courage
2 (drawable) Harry potter house
A spice girl internet
survey
Eight random
questions
Are you a dog or cat
person?...
8 (too complicated to
draw the borders)
Which spice girl you
are
Speech recognition Spoken words pitch, volume, fourier
components...
hundreds or
thousands
The written word
(e.g., “bat” vs “cat”)
Face recognition Pictures relative size of nose,
eye, eye spacing...
hundreds or
thousands
Who the person is
(e.g. “Sally”)

All classifiers use training data to carve up the space
and group similar examples together.
They just differ in the geometry they use to do it.
Classifiers just differ by how they carve the
space
26
Classifier Borders
defined by
k Nearest Neighbors distances to nearest
neighbor(s)
Decision Trees Vertical or Horizontal
boundaries only
Perceptron or
Logistic regression
lines/planes/
hyperplanes (can be
diagonal)
Naive Bayes Conic sections
(various curves) since
groups are captured
by gaussian ellipses
kNN
Decision
Trees
Naive Bayes
Perceptron

● Most learning problems are about eventually predicting a class
(classification) or a number (regression).
○ Will it rain tomorrow? (classification) vs
○ How much will it rain? (regression)
● There are very similar techniques between classification and
regression
○ Often using a variation of the same name
Regression - predicting numbers
27
K Nearest Neighbors Regression
predicting y from a given x using neighbors

Medical screening and diagnosis:
Increasing cases of machines
providing better predictions than
individual doctors
(implications for increased
accountability and financial
justification for a wide variety of
medical interventions)
Step back a moment… Recall what it’s all about!
28
Object identification in autonomous driving
cars
(implications for automating the use of
pervasive video feeds)
Speech recognition: Siri, Google home,
Alexa, Cortana…
(implications for using audio to ease of use
and process voice for applications)

simple examples
from test data
29

● Finding “similar” previous cases is too vague for tough
problems
● How do I know what examples from the past are similar
enough (and in what way?) to help me predict answer?
● In other words, what features should I pay attention to?
○ And how much?
First, how do we know what information to
include?
30
Let’s consider an example:
Do I want to watch this?

What feature is important to me for finding and
comparing to past experiences properly?
1. I don’t like Thor in the previous movies
2. But I heard good things about this director’s
movies
3. There’s a cameo by Matt Damon!
4. I tend to like movies rated high on Rotten
Tomatoes
So which features should I include?
31
You have intuitions for which are important here, but computers don’t.

Dilbert’s boss actually demonstrates an important concept here:
● Intuitions are not explicit knowledge (“feels right” above), but come
from probabilistic inferences from data over a lifetime of experiences.
○ So really, when you hand-pick the right features, it is data-driven
but in a much larger sense.
● But computers don’t have a lifetime of experiences, so they (and
Dilbert here) need data to decide which features matter.
○ And even more data to figure out exactly how to weigh them
correctly!
○ With enough quality data, the conclusions may be more reliable
than your intuitions.
You have intuitions, but computers need data
32

● Let’s go over an example where we are trying to
predict things like basketball ability, physics ability,
and running ability.
● If I know these 5 features for a large group of people:
○ GPA
○ IQ
○ weight
○ height
○ sex
● We can build a model if we know the value of this
“target” information for each person:
○ basketball skill rating
○ physics test score
○ running pace
Let’s do an example prediction on people
33

There’s a standard form for learning to predict from data.
● Rows - samples of your data (e.g. different people)
● Columns - features of your data (IQ, GPA, height…)
● Target - what you are trying to get your system to predict (physics
score)
The setup: Machine Learning standard form
34

Tips:
● Express all information as numbers or consistent labels
● Ideally, features should have a reasonable chance of being useful
○ otherwise they are just distracting, and decrease performance
● A very rough rule of thumb on number of features and samples
○ number of features < sqrt(number of samples)
○ (Why? Many models implicitly use linear combinations of the cross
products of features, which are as numerous as the number of
features squared - and this shouldn’t be more than the number of
samples)
So, more data means it’s okay to use more features!
Some tips for putting problems into standard
form
35

Predicting basketball ability
36
Name GPA IQ
Weight
(lbs)
Height
(ft’in”) Sex
B-ball
rating
Physics
test
score
Running
pace
(min/mile)
James 2.7 90 220 6’5” M 4 55% 8
Mary 3.4 120 160 5’10” F 4 95% 7
John 3.6 115 240 6’10” M 5 80% 10
Patricia 3.8 135 120 5’1” F 3 100% 9
Robert 4.0 130 130 5’3” M 3 95% 9:30
Linda 2.1 95 160 5’3” F 3 70% 10
Michael 3.8 120 160 6’6” M 4 85% 6
Barbara 3.2 100 180 6’2” F 3 80% 9
William 3.2 110 210 5’1” M 1 90% 12
Elizabet
h
3.9 130 150 5’6” F 3 95% 10
David 2.5 110 220 5’8” M 2 70% 11
Target, and data matrix if you have only a few samples or with lots of samples.

Predicting physics ability
37
Name GPA IQ
Weight
(lbs)
Height
(ft’in”) Sex
B-ball
rating
Physics
test
score
Running
pace
(min/mile)
James 2.7 90 220 6’5” M 4 55% 8
Mary 3.4 120 160 5’10” F 4 95% 7
John 3.6 115 240 6’10” M 5 80% 10
Patricia 3.8 135 120 5’1” F 3 100% 9
Robert 4.0 130 130 5’3” M 3 95% 9:30
Linda 2.1 95 160 5’3” F 3 70% 10
Michael 3.8 120 160 6’6” M 4 85% 6
Barbara 3.2 100 180 6’2” F 3 80% 9
William 3.2 110 210 5’1” M 1 90% 12
Elizabet
h
3.9 130 150 5’6” F 3 95% 10
David 2.5 110 220 5’8” M 2 70% 11

Predicting running pace
38
Name GPA IQ
Weight
(lbs)
Height
(ft’in”) Sex
B-ball
rating
Physics
test
score
Running
pace
(min/mile)
James 2.7 90 220 6’5” M 4 55% 8
Mary 3.4 120 160 5’10” F 4 95% 7
John 3.6 115 240 6’10” M 5 80% 10
Patricia 3.8 135 120 5’1” F 3 100% 9
Robert 4.0 130 130 5’3” M 3 95% 9:30
Linda 2.1 95 160 5’3” F 3 70% 10
Michael 3.8 120 160 6’6” M 4 85% 6
Barbara 3.2 100 180 6’2” F 3 80% 9
William 3.2 110 210 5’1” M 1 90% 12
Elizabet
h
3.9 130 150 5’6” F 3 95% 10
David 2.5 110 220 5’8” M 2 70% 11

Notice how the “lots of samples” situation
allows us to just include all the data? Nice.
With more samples, you don’t have to be as
picky about which features to use in your
learning.
This allows the system (rather than you) to
determine which features are useless, which
are good, and have the system try to
combine the good ones in useful ways.
More samples more robust, automated
→
learning
39

Why “Big Data” is such a hot topic now?
Because more samples means more
robust, automated learning
This is why massive data sets are helping in
complex problems
(since people can’t possibly hand-pick the
right features in all these problems)
● Autonomous cars
● Speech recognition
● Face recognition
More samples more robust, automated
→
learning
40

Aside from just
1. picking good features by hand and
2. finding more data
How else can we help a system learn from data?
Helping the learner
41

You can help learning/decision making by extracting
meaningful relationships in your data, prior to
applying your machine learning model
● If you try to keep similar examples near each
other but using fewer features (e.g. nD to 2D for
visualization), it’s dimensionality reduction
● If you try to group similar examples together, it’s
called clustering
○ Naive clustering: based on one feature, e.g.
■ marketing to men vs women
■ grouping people by tall vs short
○ Better clustering: conjunctions of many
features that group similar examples, e.g.
■ low GPA and high IQ: (“the lazy geniuses”)
■ tall and moderate weight: (“likely good in
basketball”)
42
Helping the learner: unsupervised learning

Often you can create features from your data
to improve learning.
For a runner, BMI makes more sense than
height or weight alone
BMI = weight in kg / (height in cm)2
This is called feature engineering, and there
are many, many ways to do this. Sometimes
the features can be very complex ones.
Helping the learner: feature engineering
43
Weight
(lbs)
Height
(ft’in”) BMI
Runnin
g pace
(min/mile)
220 6’5” 26.1 8
160 5’10” 22.3 7
240 6’10” 25.1 10
120 5’1” 22.7 9
130 5’3” 23 9
160 5’3” 28.3 10
160 6’6” 18.5 6
180 6’2” 23.1 9
210 5’1” 39.7 12
150 5’6” 24.2 10
220 5’8” 33.5 11

You can use smaller, simpler classifiers
to help the system make a more
complicated decision
Want to build a face detector?
Make an eye detector, a nose
detector, and mouth detector... and
use their outputs as inputs to your
system.
Want to predict shopping habits?
Use an earlier built classifier that
tells you demographic information
or socioeconomic status to help
predict shopping behavior
Help the learner: low-level classifiers as new
features
44

Feature extraction is difficult and expensive in terms of time and
expertise.
In some situations It’s almost impossible for a person to extract all
the useful low-level features (e.g. face recognition, speech
recognition).
Good systems for this don’t use features programmed by a person,
they automatically extract useful features directly from the data.
Traditional Machine Learning
45

This is done in deep learning (“deep” = multilevel neural network)
Note: this requires MASSIVE AMOUNTS OF DATA
Deep Learning (automatic features!)
46

Machine Learning vs Deep Learning
47

simple examples
from test data
48

So many options to try in machine learning!
1. Pick the right features
2. Create new features (feature engineering)
3. Simplify your feature set (clustering, dimensionality reduction)
4. Try different learning models (kNN, SVM, …)
5. Change hyperparameters of those models (like k in kNN, or
degree in polynomial regression)
6. Combine the output of different models using ensemble
methods
7. ...
A crazy number of options for learning models
49

● How do we pick the best options for our
learner?
● Simple, you test them out on more data
● The best combination of options after
testing is what you use
● Note, a lot of good machine learning people
don’t know the details of how their models
work, they just try a number of options, but
leverage the data they have well.
● The more data you have, the more options
you can try without running into problems
● Again, Big Data to the rescue!
So many modeling options...
50

Key: separate your test data from training data
51
1. Get the data in standard form (with all the features for every sample, and
known targets)
2. Separate it into training sets and test sets
○ (see “cross-validation” for a more standard way to do this)
3. Build each different model from the same training set
○ Pick different modeling options for each one
4. Apply the built models to predict the targets of the test set
5. Then see how well the prediction matches the right answers on the test set.

● If a model is too simple,
it will clearly not
perform well, even on
training data
● Overly “complicated”
models will fit your
training data (too) well
○ But they do worse
on new data!
● You’ll often see that the
accuracy in the test set
peaks (a.k.a. errors are
minimal) at a certain
middle point
Why have a separate test set?
52
Underfitting
(“too simple”)
Overfitting
(“too complex”)

● Pick the simplest model that still explains what you have seen
○ a.k.a. your training data!
● Why? That’s likely to be a better model than an overly
complicated one. It’s more likely to be right in newer situations.
○ a.k.a. your future test data!
But nowadays we don’t have to rely on philosophy or feelings or
arguments or… to pick our best model! We just need more data...
You’ve actually seen this before: Occam’s razor
53

simple examples
from test data
Summary (1 of 2): concepts in machine learning
54

But step away from the details for a
moment...
Predicting numbers or labels from sets
of data sounds technical and limited
But the applications are powerful
and limitless…
● Speech recognition: Siri, Google
home, Alexa, Cortana…
● Natural language processing
● Medical diagnosis
● Face and expression recognition
● Autonomous driving cars
● ...
Summary (2 of 2): It’s much more than “just
data”!
55

All the following examples are in Python and using the
sklearn machine learning library
(I recommend downloading the
Anaconda Python distribution and running code in a Jupyter
notebook)
● Digit recognition classifier
○ uses SVMs, but you can try whatever you want by
changing one line in it with any of these classifiers
● Underfitting vs Overfitting regression example with
polynomial degree
● How different classifiers carve decision boundaries
● k-nearest neighbors classifier using the iris data set
○ a classic data set to test classifiers (with features like
petal length) to determine the type of iris flower
based on the features
For more information: try some tutorials
56

Supplementary slides
Metrics, normalization, and cross-validation
57

Numeric error measures (much more than just MSE)
Mean Square Error
● Larger errors are penalized more
Mean Absolute Error
● All errors are penalized equally
RMSLE: Root Mean Square Log Error
● Similar percent errors are penalized
equally

kNN Challenges - normalization
Without normalization, features with larger variances have excessive influence
Vs.

Normalization
Remember from kNN? Without normalization, some features would dominate the
distance metric
Similar problem in other classifiers. Because of this, many methods may normalize
automatically and fix the normalization parameters from the training set
Examples of typical normalizations:
● standardized = (raw - μtraining set) / σtraining set
● standardized = (raw - mintraining set) / (maxtraining set - mintraining set)

Cross-validation
Hold-out test sets are a challenge
● Too small, and the test is unstable
● Too large, and you lose data for
training
A more systematic way is to use cross-
validation to rotate which section is used
for testing

Cross-validation
● K-fold
● Stratified K-fold
● Leave One Out: (k-fold where k=n)
● Leave One Label Out
○ e.g. Subject-wise cross-validation
● Questions
○ Shuffle or not?
○ Put all groups together, or use separate models for each
group?

Hyperparameter selection
Use cross-validation on one set of data to determine the hyperparameter
To report the accuracy of the model correctly, you need to use a separate test set
from that used in fitting the hyperparameter
This is a common error in hyperparameter selection and accuracy reporting, but
after this class you should be aware!

Properly performing cross-validation

Nested cross-validation - properly without a holdout test set

Machine Learning course Lecture number 1.pptx

More Related Content

Similar to Machine Learning course Lecture number 1.pptx (20)

Recently uploaded (20)

Machine Learning course Lecture number 1.pptx