MLEARN 210 B Autumn 2018: Lecture 1

Introduction to
Machine Learning
Hany SalahEldeen

Introductions
> Email: hanys@uw.edu
> Response time: within 24hrs
> Office Hours: Upon Request
> Email: jav7@uw.edu

Administrivia
– Calculus
– Linear Algebra
– Programming (preferably python)
To successfully complete this course, you must:
– Answer all quiz questions
– Submit all lab assignments
– Obtain average score of 80% or more
– Attend at least 80% of the lectures
– Participate in posts/board discussions

Administrivia
– All assignments have been posted and their due dates.
– If submitted late you earn only half the point.
– After one week from due date no submissions are accepted
– Assignments, Quizzes, discussion, and slides:
> https://canvas.uw.edu/courses/1243196
– Syllabus:
> https://canvas.uw.edu/courses/1243196/files/50920192/download
?wrap=1

Administrivia
– James, Witten, Hastie, and Tibshirani. An
Introduction to Statistical Learning (w/Applications
in R) http://www.statlearning.com [Required]
– Hands-On Machine Learning with Scikit-Learn and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems
by Aurélien Géron [Recommended but not necessary]
– Vocareum:
> https://canvas.uw.edu/courses/1243196/fil
es/50824726/download?wrap=1

Our Contract
– Teach you how to analyze data, preprocess, look for patterns
– Understand core machine learning algorithms, concepts and
ideas
– Understand regression and classification classes of problems
– Train and tune models to make predictions from data
– Attend the classes.
– Listen and focus.
– Do the reading before class
– Do the work.
– Ask questions and engage in discussions

Course Outline
1. Introduction to Statistical Learning
2. Linear Regression
3. Classification
4. Model Selection, Part 1
6. Resampling Methods
7. Linear Model Selection and
Regularization
8. Moving Beyond Linearity
9. Bayesian Analysis
10. Dimensionality Reduction

Definitions
“Learning is any process by which a system improves
performance from experience” ~
- general purpose
- fully automatic
- “off-the-shelf”
– But however, in practice, incorporation of prior human knowledge is
crucial

Machine Learning Models
Improving how to perform a task with experience.

Simple example: classification
We have a dataset of profile pictures.
As humans how can we classify?

Feature Extraction
Strong Features
Male Female
Beard lipstick
Moderate Features
Male Female
Jaw line Eye lashes
Weak Features
Male Female
Skin color Smiling

Strong Features
Male Female
Beard lipstick
Moderate Features
Male Female
Jaw line Eye lashes
Weak Features
Male Female
Skin color Smiling
Feature Extraction

Classification
Feature value
Beard 10 pt.
Lipstick 0 pt.
Long hair 0 pt.
Short hair 2 pt.
Breast 0 pt.
Jawline 3 pt.
Male Female
Probability 93% 7%
Male Female
Probability 11% 89%
Feature value
Beard -2 pt.
Lipstick 8 pt.
Long hair 7 pt.
Short hair 1 pt.
Breast 9 pt.
Jawline -2 pt.

Example: Email Classification
Users receive spam emails in their inbox, we need
to reduce that

- Classify emails, detect spam, less important emails
- Reduce % of spam emails
- Reduce % of delete without open emails
- A dataset of emails labelled by users
Example: Email Classification

Learning, in machine learning
Real world
User
Interactions
Telemetry,
logs, and
usage
Preprocessing
Feature
extraction
Model
learning
Testing Encapsulation Analysis
It’s a multi stage process

Machine learning in a real-life product
Real world
User
Interactions
Telemetry,
logs, and
usage
Preprocessing
Feature
extraction
Model
learning
Testing Encapsulation Analysis
It’s a multi stage process

So how to evaluate a machine
learning model?

Precision, Recall and Confusion Matrix

Precision, Recall and F-measure
Example: In document retrieval:
• Precision:
– how many of the returned documents are correct
• Recall:
– how many of the positives does the model return
• F-measure:
– the harmonic average of precision and recall

So now we know how to
evaluate a machine learning
model, …let’s train one

Cont. Example: Email Classification
- Classify emails, detect spam, less important emails
- Reduce % of spam emails
- Reduce % of delete without open emails

Feature Extraction

Classification Visualization
Red is spam
Blue is good mail

Classification
• Extract the
features
• Place it where it
belongs feature-
wise
• Predict the class

Now we have a basic overview
and understanding of Machine
Learning, let’s start from the
beginning, and dig deep

Datasets
> Wage Data
> Standard and Poor Index
> Gene Expression data
> …etc

Output Vector
An output vector is used for
supervised learning
• Numeric output values for
regression
• Nominal (categorical) output
values for classification
• Rank for ranking problems

Counts
‘n’ is the number of observations in a data set
(rows of the matrix)
‘p’ is the number of predictors in a data set
(columns of the matrix)

Matrix transposition
Just swap the row and column indices:

Alternative Matrix Notation
Matrix expressed as a set of
column vectors, where each
column is a variable
Matrix expressed as a set of row vectors,
where each row is an observation
[the authors are treating an observation
Vector as a column vector]

Vector Multiplication
[sometimes called adot product]

Terminology
• Scalar: a single numeric value
• Vector: a 1-dimensional array of values
• Matrix: a 2-dimensional array of values
• Tensor: an array of values with 3 or more dimensions
[e.g. an array of images]

Organization of The Book
3. Classification
Regularization
9. Bayesian Analysis

Organization of The Book
• Statistical Learning: Terminology and Concepts, plus ‘k’ nearest
neighbor
• Regression, Part 1: Linear Regression
• Classification: Logistic Regression and Linear Discriminant Analysis
• Resampling: Cross Validation and the Bootstrap
• Regression, Part 2: Stepwise Selection, Ridge Regression, Principal
Components Regression, Partial Least Squares, and the LASSO
• Non-Linear Regression: Polynomial Regression, Splines, General
Additive Models
• Tree-Based Classification: Bagging, Boosting, and Random Forests
• Support Vector Machines
• Unsupervised Learning: Principal Component Analysis, k-Means
Clustering, and Hierarchical Clustering

Datasets referenced in the Textbook

Advertising Data: Closer look
TV Advertising Budget (thousands of $)
SalesoftheProduct(thousandsofunits)

Advertising Data: Descriptive
Statistics
• Variable type
• (binary, categorical, integer, real)
• Distribution of variables:
• Graphing the data (histograms, density plot)
• Distribution shape
(normal, log-normal, binomial, etc.)
• Central tendency measures (mean, median, mode)
• Outlier measures (percentiles, min, max)
• Associations between variables:
• Pearson’s correlation coefficient
• Spearman’s correlation coefficient
• Mutual information
• Maximal information coefficient

Advertising Data: First Model!
• 𝑌 = 𝑓(𝑋) + 𝜖
• 𝑌 is an output Sales value
• 𝑓(𝑋) is a function of TV Ad Budget

Advertising Data: First Model!
• 𝑌 = 𝑓(𝑋) + 𝜖
• 𝑌 is an output Sales value
• 𝑓(𝑋) is a function of TV Ad Budget
➢ 𝑓(𝑋) = 0.05 * X + 7
➢ Slope: (22 – 7) / (300 – 0) = 0.05
➢ Intercept: 22 - 0.05 * 300 = 7
➢ f( 0) = 0.05 * 0 + 7 = 7
➢ f(100) = 0.05 * 100 + 7 = 12
➢ f(200) = 0.05 * 200 + 7 = 17
➢ f(300) = 0.05 * 300 + 7 = 22
• 𝜖 is a residual “error” term
(Greek letter “epsilon”)

Income as a Function of Education

Income as a function of Education and
Seniority

Why Estimate f(x)?
• The hats (circumflex characters: ‘^’)indicate we’re talking about
estimates rather than some notion of absolute truth
• is the function we learned from data: our function is a model
that maps an input to an output
• is our prediction
• Reasons:
• To predict an outcome
• To understand the influence of the predictors on theoutcome
(NOTE: inferential rather than causal influence)

Prediction
• A loss function measures how well a model is able to map inputs to
outputs
•
• is referred to as reducible error: we could reduce the error if we had
better features
• 𝑉𝑎𝑟( 𝜖)is referred to as irreducible error, because we believe the
process is stochastic rather than deterministic
• 𝐸 indicates we’re talking about an expected value (average value)
• 𝑉𝑎𝑟 indicates we’re talking about variance, the expected squared
deviation from the mean
• Since we believe our residual error has a mean of zero E(𝜖2) = 𝑉𝑎𝑟 (𝜖)

Inference [Understanding]
• Which predictors are associated with the response?
• What is the relationship between the response and each predictor?
• Can the relationship between the inputs and outputs be summarized
adequately using a linear model, or is the relationship more complex?
• Examples:
• Which media contribute to sales?
• Which media generate the biggest boost in sales?
• How much increase in sales is associated with a given increase inTV
advertising?

How do we estimate f?
Parametric methods (makes strong assumptions about the data and fixes the
number of parameters):
• linear regression
• polynomial regression
• logistic regression
• neural network
• support vector machines (linear)
Non-Parametric methods:
• nearest neighbor
• random forests
• gradient boosting
• support vector machines (RBF)

Parametric Linear Model for income

Non-Parametric Linear Model for
Income

Trade off Between Prediction Accuracy
and Model Interpretability

Supervised Vs. Unsupervised Learning
Supervised Learning
• The learning algorithm is given
a target output variable
• Classification: the output
variable is nominal (categorical,
qualitative)
• Regression: the output variable
is numeric (quantitative)
Class 3
Class 2
Class 1
Unsupervised Learning
• The learning algorithm is
*not* given a targetoutput
variable
• Clustering
• Principal Component
Analysis

Unsupervised Learning and Class
Overlap

Measuring the Quality of the Model
Common Loss functions
• Regression
• Gaussian loss (mean squared error)
• Laplacian loss (mean absolute error)
• Classification
• Logloss
• Hinge loss

Example: High Bias (Underfitting) Vs.
High Variance (Overfitting)

Bias Variance Decomposition
We’re using
We’readding and subtracting the same value (zero)
We’regrouping pairs of terms and multiplying

Optimal Flexibility Varies by Problem

Classification Error
Let’s assume that one is trying to estimate f based on training
observations:
{(x1, y1), (x2, y2), (xn, yn)}
The most common approach is to estimate the ‘error rate’:
Where: is an indicator variable:
• 1 if prediction is right
• 0 if the prediction is wrong
Accuracy = 1 – error rate
The error can be estimated during training (training error) or
testing (test error)

Bayes Classifier
• The Bayes classifier picks the class ‘j’that maximizes the
conditional probability:
• You can interpret it as:
• ‘probability that Y is equal to j given that X is equal to x0’
• The Bayes error rate is:

Bayes Classifier for Simulated Problem

K Nearest Neighbors
𝑤ℎ𝑒𝑟𝑒 𝒩0 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑒𝑡 𝑜𝑓 𝑖𝑛𝑑𝑖𝑐𝑒𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 ′𝐾′ 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑜𝑓
For classification using K nearest neighbors, we’re estimating the proportion of
nearest neighbors that belong to class ‘j’

K-Nearest Neighbor Classifier
Example (k=3)

KNN with K=10 vs. Bayes Decision
Boundary

Course Outline
3. Classification
Regularization
9. Unsupervised Learning

References
– http://www.slideshare.net/liorrokach/introduction-to-machine-learning-
13809045
– http://dimacs.rutgers.edu/Workshops/MachineLearning/slides/schapire.pdf
– http://www.cs.odu.edu/~hany/teaching/cs495-
f12/lectures/lecture_8/lecture_8.pdf
– http://alex.smola.org/teaching/cmu2013-10-701/slides/1_Intro.pdf
– http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/materials/Scha
moni_boosteddecisiontrees.pdf
– https://www.cs.utexas.edu/~mooney/cs343/slide-handouts/learning.pdf
– http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf

MLEARN 210 B Autumn 2018: Lecture 1

More Related Content

What's hot (20)

Similar to MLEARN 210 B Autumn 2018: Lecture 1 (20)

More from heinestien (8)

Recently uploaded (20)

MLEARN 210 B Autumn 2018: Lecture 1