SlideShare a Scribd company logo
Introduction to
Machine Learning
Hany SalahEldeen
Introductions
> Email: hanys@uw.edu
> Response time: within 24hrs
> Office Hours: Upon Request
> Email: jav7@uw.edu
Administrivia
– Calculus
– Linear Algebra
– Programming (preferably python)
To successfully complete this course, you must:
– Answer all quiz questions
– Submit all lab assignments
– Obtain average score of 80% or more
– Attend at least 80% of the lectures
– Participate in posts/board discussions
Administrivia
– All assignments have been posted and their due dates.
– If submitted late you earn only half the point.
– After one week from due date no submissions are accepted
– Assignments, Quizzes, discussion, and slides:
> https://canvas.uw.edu/courses/1243196
– Syllabus:
> https://canvas.uw.edu/courses/1243196/files/50920192/download
?wrap=1
Administrivia
– James, Witten, Hastie, and Tibshirani. An
Introduction to Statistical Learning (w/Applications
in R) http://www.statlearning.com [Required]
– Hands-On Machine Learning with Scikit-Learn and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems
by Aurélien Géron [Recommended but not necessary]
– Vocareum:
> https://canvas.uw.edu/courses/1243196/fil
es/50824726/download?wrap=1
Our Contract
– Teach you how to analyze data, preprocess, look for patterns
– Understand core machine learning algorithms, concepts and
ideas
– Understand regression and classification classes of problems
– Train and tune models to make predictions from data
– Attend the classes.
– Listen and focus.
– Do the reading before class
– Do the work.
– Ask questions and engage in discussions
Course Outline
1. Introduction to Statistical Learning
2. Linear Regression
3. Classification
4. Model Selection, Part 1
5. Model Selection, Part 2
6. Resampling Methods
7. Linear Model Selection and
Regularization
8. Moving Beyond Linearity
9. Bayesian Analysis
10. Dimensionality Reduction
What is
Machine Learning?
Definitions
“Learning is any process by which a system improves
performance from experience” ~
- general purpose
- fully automatic
- “off-the-shelf”
– But however, in practice, incorporation of prior human knowledge is
crucial
Machine Learning Models
Improving how to perform a task with experience.
Simple example: classification
We have a dataset of profile pictures.
As humans how can we classify?
Feature Extraction
Strong Features
Male Female
Beard lipstick
Moderate Features
Male Female
Jaw line Eye lashes
Weak Features
Male Female
Skin color Smiling
Strong Features
Male Female
Beard lipstick
Moderate Features
Male Female
Jaw line Eye lashes
Weak Features
Male Female
Skin color Smiling
Feature Extraction
Classification
Feature value
Beard 10 pt.
Lipstick 0 pt.
Long hair 0 pt.
Short hair 2 pt.
Breast 0 pt.
Jawline 3 pt.
Male Female
Probability 93% 7%
Male Female
Probability 11% 89%
Feature value
Beard -2 pt.
Lipstick 8 pt.
Long hair 7 pt.
Short hair 1 pt.
Breast 9 pt.
Jawline -2 pt.
Example: Email Classification
Users receive spam emails in their inbox, we need
to reduce that
- Classify emails, detect spam, less important emails
- Reduce % of spam emails
- Reduce % of delete without open emails
- A dataset of emails labelled by users
Example: Email Classification
Learning, in machine learning
Real world
User
Interactions
Telemetry,
logs, and
usage
Preprocessing
Feature
extraction
Model
learning
Testing Encapsulation Analysis
It’s a multi stage process
Machine learning in a real-life product
Real world
User
Interactions
Telemetry,
logs, and
usage
Preprocessing
Feature
extraction
Model
learning
Testing Encapsulation Analysis
It’s a multi stage process
So how to evaluate a machine
learning model?
Testing/Evaluation
Validation and Tuning
Cross validation
Cross validation
Confusion Matrix
Precision, Recall and Confusion Matrix
Precision, Recall and F-measure
Example: In document retrieval:
• Precision:
– how many of the returned documents are correct
• Recall:
– how many of the positives does the model return
• F-measure:
– the harmonic average of precision and recall
So now we know how to
evaluate a machine learning
model, …let’s train one
Cont. Example: Email Classification
- Classify emails, detect spam, less important emails
- Reduce % of spam emails
- Reduce % of delete without open emails
- A dataset of emails labelled by users
Feature Extraction
- A dataset of emails labelled by users
Classification Visualization
Red is spam
Blue is good mail
Classification
Classification
• Extract the
features
• Place it where it
belongs feature-
wise
• Predict the class
Non-linearly Seperable
Decision tree
Email Example, Decision tree
Email Example, Decision tree
Minimize overall error
Now we have a basic overview
and understanding of Machine
Learning, let’s start from the
beginning, and dig deep
Datasets
> Wage Data
> Standard and Poor Index
> Gene Expression data
> …etc
Wage Data
Standard & Poor Index
Gene Expression Data
Gene Expression Data
Mathematics in ML
Matrix Notation
Output Vector
An output vector is used for
supervised learning
• Numeric output values for
regression
• Nominal (categorical) output
values for classification
• Rank for ranking problems
Alternative Names
Counts
‘n’ is the number of observations in a data set
(rows of the matrix)
‘p’ is the number of predictors in a data set
(columns of the matrix)
Matrix transposition
Just swap the row and column indices:
Alternative Matrix Notation
Matrix expressed as a set of
column vectors, where each
column is a variable
Matrix expressed as a set of row vectors,
where each row is an observation
[the authors are treating an observation
Vector as a column vector]
Vector Multiplication
[sometimes called adot product]
Matrix Multiplication
Terminology
• Scalar: a single numeric value
• Vector: a 1-dimensional array of values
• Matrix: a 2-dimensional array of values
• Tensor: an array of values with 3 or more dimensions
[e.g. an array of images]
Organization of The Book
1. Introduction to Statistical Learning
2. Linear Regression
3. Classification
4. Model Selection, Part 1
5. Model Selection, Part 2
6. Resampling Methods
7. Linear Model Selection and
Regularization
8. Moving Beyond Linearity
9. Bayesian Analysis
10. Dimensionality Reduction
Organization of The Book
• Statistical Learning: Terminology and Concepts, plus ‘k’ nearest
neighbor
• Regression, Part 1: Linear Regression
• Classification: Logistic Regression and Linear Discriminant Analysis
• Resampling: Cross Validation and the Bootstrap
• Regression, Part 2: Stepwise Selection, Ridge Regression, Principal
Components Regression, Partial Least Squares, and the LASSO
• Non-Linear Regression: Polynomial Regression, Splines, General
Additive Models
• Tree-Based Classification: Bagging, Boosting, and Random Forests
• Support Vector Machines
• Unsupervised Learning: Principal Component Analysis, k-Means
Clustering, and Hierarchical Clustering
Organization of The Book
• Statistical Learning: Terminology and Concepts, plus ‘k’ nearest
neighbor
• Regression, Part 1: Linear Regression
• Classification: Logistic Regression and Linear Discriminant Analysis
• Resampling: Cross Validation and the Bootstrap
• Regression, Part 2: Stepwise Selection, Ridge Regression, Principal
Components Regression, Partial Least Squares, and the LASSO
• Non-Linear Regression: Polynomial Regression, Splines, General
Additive Models
• Tree-Based Classification: Bagging, Boosting, and Random Forests
• Support Vector Machines
• Unsupervised Learning: Principal Component Analysis, k-Means
Clustering, and Hierarchical Clustering
Datasets referenced in the Textbook
Advertising Data: Closer look
TV Advertising Budget (thousands of $)
SalesoftheProduct(thousandsofunits)
Advertising Data: Descriptive
Statistics
TV Advertising Budget (thousands of $)
SalesoftheProduct(thousandsofunits)
• Variable type
• (binary, categorical, integer, real)
• Distribution of variables:
• Graphing the data (histograms, density plot)
• Distribution shape
(normal, log-normal, binomial, etc.)
• Central tendency measures (mean, median, mode)
• Outlier measures (percentiles, min, max)
• Associations between variables:
• Pearson’s correlation coefficient
• Spearman’s correlation coefficient
• Mutual information
• Maximal information coefficient
Advertising Data: First Model!
TV Advertising Budget (thousands of $)
SalesoftheProduct(thousandsofunits)
• 𝑌 = 𝑓(𝑋) + 𝜖
• 𝑌 is an output Sales value
• 𝑓(𝑋) is a function of TV Ad Budget
Advertising Data: First Model!
TV Advertising Budget (thousands of $)
SalesoftheProduct(thousandsofunits)
• 𝑌 = 𝑓(𝑋) + 𝜖
• 𝑌 is an output Sales value
• 𝑓(𝑋) is a function of TV Ad Budget
➢ 𝑓(𝑋) = 0.05 * X + 7
➢ Slope: (22 – 7) / (300 – 0) = 0.05
➢ Intercept: 22 - 0.05 * 300 = 7
➢ f( 0) = 0.05 * 0 + 7 = 7
➢ f(100) = 0.05 * 100 + 7 = 12
➢ f(200) = 0.05 * 200 + 7 = 17
➢ f(300) = 0.05 * 300 + 7 = 22
• 𝜖 is a residual “error” term
(Greek letter “epsilon”)
Income as a Function of Education
Income as a function of Education and
Seniority
Why Estimate f(x)?
• The hats (circumflex characters: ‘^’)indicate we’re talking about
estimates rather than some notion of absolute truth
• is the function we learned from data: our function is a model
that maps an input to an output
• is our prediction
• Reasons:
• To predict an outcome
• To understand the influence of the predictors on theoutcome
(NOTE: inferential rather than causal influence)
Prediction
• A loss function measures how well a model is able to map inputs to
outputs
•
• is referred to as reducible error: we could reduce the error if we had
better features
• 𝑉𝑎𝑟( 𝜖)is referred to as irreducible error, because we believe the
process is stochastic rather than deterministic
• 𝐸 indicates we’re talking about an expected value (average value)
• 𝑉𝑎𝑟 indicates we’re talking about variance, the expected squared
deviation from the mean
• Since we believe our residual error has a mean of zero E(𝜖2) = 𝑉𝑎𝑟 (𝜖)
Inference [Understanding]
• Which predictors are associated with the response?
• What is the relationship between the response and each predictor?
• Can the relationship between the inputs and outputs be summarized
adequately using a linear model, or is the relationship more complex?
• Examples:
• Which media contribute to sales?
• Which media generate the biggest boost in sales?
• How much increase in sales is associated with a given increase inTV
advertising?
How do we estimate f?
Parametric methods (makes strong assumptions about the data and fixes the
number of parameters):
• linear regression
• polynomial regression
• logistic regression
• neural network
• support vector machines (linear)
Non-Parametric methods:
• nearest neighbor
• random forests
• gradient boosting
• support vector machines (RBF)
Parametric Linear Model for income
Non-Parametric Linear Model for
Income
Trade off Between Prediction Accuracy
and Model Interpretability
Supervised Vs. Unsupervised Learning
Supervised Learning
• The learning algorithm is given
a target output variable
• Classification: the output
variable is nominal (categorical,
qualitative)
• Regression: the output variable
is numeric (quantitative)
Class 3
Class 2
Class 1
Unsupervised Learning
• The learning algorithm is
*not* given a targetoutput
variable
• Clustering
• Principal Component
Analysis
Unsupervised Learning and Class
Overlap
Measuring the Quality of the Model
Common Loss functions
• Regression
• Gaussian loss (mean squared error)
• Laplacian loss (mean absolute error)
• Classification
• Logloss
• Hinge loss
Example: High Bias (Underfitting) Vs.
High Variance (Overfitting)
Bias Vs. Variance Trade-off
Bias Vs. Variance Trade-off
Bias Vs. Variance Trade-off
Bias Variance Decomposition
We’re using
We’readding and subtracting the same value (zero)
We’regrouping pairs of terms and multiplying
Optimal Flexibility Varies by Problem
Classification Error
Let’s assume that one is trying to estimate f based on training
observations:
{(x1, y1), (x2, y2), (xn, yn)}
The most common approach is to estimate the ‘error rate’:
Where: is an indicator variable:
• 1 if prediction is right
• 0 if the prediction is wrong
Accuracy = 1 – error rate
The error can be estimated during training (training error) or
testing (test error)
Bayes Classifier
• The Bayes classifier picks the class ‘j’that maximizes the
conditional probability:
• You can interpret it as:
• ‘probability that Y is equal to j given that X is equal to x0’
• The Bayes error rate is:
Bayes Classifier for Simulated Problem
K Nearest Neighbors
𝑤ℎ𝑒𝑟𝑒 𝒩0 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑒𝑡 𝑜𝑓 𝑖𝑛𝑑𝑖𝑐𝑒𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 ′𝐾′ 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑜𝑓
For classification using K nearest neighbors, we’re estimating the proportion of
nearest neighbors that belong to class ‘j’
K-Nearest Neighbor Classifier
Example (k=3)
KNN with K=10 vs. Bayes Decision
Boundary
KNN with K=1 vs. K=10
Error vs. Complexity for KNN
Course Outline
1. Introduction to Statistical Learning
2. Linear Regression
3. Classification
4. Model Selection, Part 1
5. Model Selection, Part 2
6. Resampling Methods
7. Linear Model Selection and
Regularization
8. Moving Beyond Linearity
9. Unsupervised Learning
10. Dimensionality Reduction
References
– http://www.slideshare.net/liorrokach/introduction-to-machine-learning-
13809045
– http://dimacs.rutgers.edu/Workshops/MachineLearning/slides/schapire.pdf
– http://www.cs.odu.edu/~hany/teaching/cs495-
f12/lectures/lecture_8/lecture_8.pdf
– http://alex.smola.org/teaching/cmu2013-10-701/slides/1_Intro.pdf
– http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/materials/Scha
moni_boosteddecisiontrees.pdf
– https://www.cs.utexas.edu/~mooney/cs343/slide-handouts/learning.pdf
– http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf

More Related Content

PDF
Lecture 2: Preliminaries (Understanding and Preprocessing data)
PDF
Lecture 8: Machine Learning in Practice (1)
PDF
An Overview of Naïve Bayes Classifier
PPTX
Application of Machine Learning in Agriculture
PPTX
Linear Regression, Machine learning term
PDF
Machine Learning Lecture 3 Decision Trees
PPTX
Classification in data mining
PDF
Mini datathon
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 8: Machine Learning in Practice (1)
An Overview of Naïve Bayes Classifier
Application of Machine Learning in Agriculture
Linear Regression, Machine learning term
Machine Learning Lecture 3 Decision Trees
Classification in data mining
Mini datathon

What's hot (20)

PPTX
Ensemble Learning and Random Forests
PDF
Data mining
PPTX
Naive Bayes | Statistics
PPTX
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
PPTX
Introduction to Machine Learning
PDF
Machine learning Lecture 1
PDF
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
PDF
L2. Evaluating Machine Learning Algorithms I
PPTX
Predictive analytics
PDF
On Semi-Supervised Learning and Beyond
PPTX
Intro to modelling-supervised learning
PPT
MachineLearning.ppt
PPTX
Presentation on supervised learning
PDF
Understanding computer vision with Deep Learning
PPT
T7 data analysis
PPTX
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
PPT
My7class
PPTX
Basic Statistics for Class 11, B.COm, BSW, B.A, BBA, MBA
PPTX
Machine learning
PPTX
Decision trees
Ensemble Learning and Random Forests
Data mining
Naive Bayes | Statistics
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Introduction to Machine Learning
Machine learning Lecture 1
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
L2. Evaluating Machine Learning Algorithms I
Predictive analytics
On Semi-Supervised Learning and Beyond
Intro to modelling-supervised learning
MachineLearning.ppt
Presentation on supervised learning
Understanding computer vision with Deep Learning
T7 data analysis
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
My7class
Basic Statistics for Class 11, B.COm, BSW, B.A, BBA, MBA
Machine learning
Decision trees
Ad

Similar to MLEARN 210 B Autumn 2018: Lecture 1 (20)

PPTX
Analytics Boot Camp - Slides
PDF
Week 1.pdf
PDF
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
PPTX
Machine learning and_nlp
PDF
Lecture 5 - Linear Regression Linear Regression
PPTX
Week_1 Machine Learning introduction.pptx
PPTX
Recommender Systems from A to Z – Model Training
PPTX
Fundamentals of Data Science Modeling Lec
PPTX
introduction to Statistical Theory.pptx
PPT
3 DM Classification HFCS kilometres .ppt
PDF
An introduction to machine learning and statistics
PPTX
CSL0777-L07.pptx
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
PPTX
Kaggle Gold Medal Case Study
PDF
3ml.pdf
PPTX
Essential of ML 2nd Lecture IIT Kharagpur
PDF
Introduction to machine learning
PPTX
Deep Dive to Learning to Rank for Graph Search.pptx
PDF
HRUG - Linear regression with R
PPTX
Unit 3 – AIML.pptx
Analytics Boot Camp - Slides
Week 1.pdf
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Machine learning and_nlp
Lecture 5 - Linear Regression Linear Regression
Week_1 Machine Learning introduction.pptx
Recommender Systems from A to Z – Model Training
Fundamentals of Data Science Modeling Lec
introduction to Statistical Theory.pptx
3 DM Classification HFCS kilometres .ppt
An introduction to machine learning and statistics
CSL0777-L07.pptx
When Models Meet Data: From ancient science to todays Artificial Intelligence...
Kaggle Gold Medal Case Study
3ml.pdf
Essential of ML 2nd Lecture IIT Kharagpur
Introduction to machine learning
Deep Dive to Learning to Rank for Graph Search.pptx
HRUG - Linear regression with R
Unit 3 – AIML.pptx
Ad

More from heinestien (8)

PDF
Doctoral Defense: Hany SalahEldeen
PPTX
Zen & the art of data mining
PDF
Reading the Correct History? Modeling Temporal Intention in Resource Sharing
PDF
Carbon Dating The Web: Estimating the Age of Web Resources
PPTX
Tpdl Doctoral consortium 2012
PPTX
Losing My Revolution Long Paper TPDL2012
PDF
Hany's JCDL Doctoral Consortium
PPTX
Hany's Doctoral Consortium
Doctoral Defense: Hany SalahEldeen
Zen & the art of data mining
Reading the Correct History? Modeling Temporal Intention in Resource Sharing
Carbon Dating The Web: Estimating the Age of Web Resources
Tpdl Doctoral consortium 2012
Losing My Revolution Long Paper TPDL2012
Hany's JCDL Doctoral Consortium
Hany's Doctoral Consortium

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Global journeys: estimating international migration
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Quality review (1)_presentation of this 21
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Foundation of Data Science unit number two notes
PPTX
1_Introduction to advance data techniques.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Major-Components-ofNKJNNKNKNKNKronment.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Global journeys: estimating international migration
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Quality review (1)_presentation of this 21
.pdf is not working space design for the following data for the following dat...
Supervised vs unsupervised machine learning algorithms
Miokarditis (Inflamasi pada Otot Jantung)
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Reliability_Chapter_ presentation 1221.5784
Introduction to Knowledge Engineering Part 1
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Data_Analytics_and_PowerBI_Presentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Foundation of Data Science unit number two notes
1_Introduction to advance data techniques.pptx

MLEARN 210 B Autumn 2018: Lecture 1

  • 2. Introductions > Email: hanys@uw.edu > Response time: within 24hrs > Office Hours: Upon Request > Email: jav7@uw.edu
  • 3. Administrivia – Calculus – Linear Algebra – Programming (preferably python) To successfully complete this course, you must: – Answer all quiz questions – Submit all lab assignments – Obtain average score of 80% or more – Attend at least 80% of the lectures – Participate in posts/board discussions
  • 4. Administrivia – All assignments have been posted and their due dates. – If submitted late you earn only half the point. – After one week from due date no submissions are accepted – Assignments, Quizzes, discussion, and slides: > https://canvas.uw.edu/courses/1243196 – Syllabus: > https://canvas.uw.edu/courses/1243196/files/50920192/download ?wrap=1
  • 5. Administrivia – James, Witten, Hastie, and Tibshirani. An Introduction to Statistical Learning (w/Applications in R) http://www.statlearning.com [Required] – Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron [Recommended but not necessary] – Vocareum: > https://canvas.uw.edu/courses/1243196/fil es/50824726/download?wrap=1
  • 6. Our Contract – Teach you how to analyze data, preprocess, look for patterns – Understand core machine learning algorithms, concepts and ideas – Understand regression and classification classes of problems – Train and tune models to make predictions from data – Attend the classes. – Listen and focus. – Do the reading before class – Do the work. – Ask questions and engage in discussions
  • 7. Course Outline 1. Introduction to Statistical Learning 2. Linear Regression 3. Classification 4. Model Selection, Part 1 5. Model Selection, Part 2 6. Resampling Methods 7. Linear Model Selection and Regularization 8. Moving Beyond Linearity 9. Bayesian Analysis 10. Dimensionality Reduction
  • 9. Definitions “Learning is any process by which a system improves performance from experience” ~ - general purpose - fully automatic - “off-the-shelf” – But however, in practice, incorporation of prior human knowledge is crucial
  • 10. Machine Learning Models Improving how to perform a task with experience.
  • 11. Simple example: classification We have a dataset of profile pictures. As humans how can we classify?
  • 12. Feature Extraction Strong Features Male Female Beard lipstick Moderate Features Male Female Jaw line Eye lashes Weak Features Male Female Skin color Smiling
  • 13. Strong Features Male Female Beard lipstick Moderate Features Male Female Jaw line Eye lashes Weak Features Male Female Skin color Smiling Feature Extraction
  • 14. Classification Feature value Beard 10 pt. Lipstick 0 pt. Long hair 0 pt. Short hair 2 pt. Breast 0 pt. Jawline 3 pt. Male Female Probability 93% 7% Male Female Probability 11% 89% Feature value Beard -2 pt. Lipstick 8 pt. Long hair 7 pt. Short hair 1 pt. Breast 9 pt. Jawline -2 pt.
  • 15. Example: Email Classification Users receive spam emails in their inbox, we need to reduce that
  • 16. - Classify emails, detect spam, less important emails - Reduce % of spam emails - Reduce % of delete without open emails - A dataset of emails labelled by users Example: Email Classification
  • 17. Learning, in machine learning Real world User Interactions Telemetry, logs, and usage Preprocessing Feature extraction Model learning Testing Encapsulation Analysis It’s a multi stage process
  • 18. Machine learning in a real-life product Real world User Interactions Telemetry, logs, and usage Preprocessing Feature extraction Model learning Testing Encapsulation Analysis It’s a multi stage process
  • 19. So how to evaluate a machine learning model?
  • 25. Precision, Recall and Confusion Matrix
  • 26. Precision, Recall and F-measure Example: In document retrieval: • Precision: – how many of the returned documents are correct • Recall: – how many of the positives does the model return • F-measure: – the harmonic average of precision and recall
  • 27. So now we know how to evaluate a machine learning model, …let’s train one
  • 28. Cont. Example: Email Classification - Classify emails, detect spam, less important emails - Reduce % of spam emails - Reduce % of delete without open emails - A dataset of emails labelled by users
  • 29. Feature Extraction - A dataset of emails labelled by users
  • 30. Classification Visualization Red is spam Blue is good mail
  • 32. Classification • Extract the features • Place it where it belongs feature- wise • Predict the class
  • 38. Now we have a basic overview and understanding of Machine Learning, let’s start from the beginning, and dig deep
  • 39. Datasets > Wage Data > Standard and Poor Index > Gene Expression data > …etc
  • 46. Output Vector An output vector is used for supervised learning • Numeric output values for regression • Nominal (categorical) output values for classification • Rank for ranking problems
  • 48. Counts ‘n’ is the number of observations in a data set (rows of the matrix) ‘p’ is the number of predictors in a data set (columns of the matrix)
  • 49. Matrix transposition Just swap the row and column indices:
  • 50. Alternative Matrix Notation Matrix expressed as a set of column vectors, where each column is a variable Matrix expressed as a set of row vectors, where each row is an observation [the authors are treating an observation Vector as a column vector]
  • 53. Terminology • Scalar: a single numeric value • Vector: a 1-dimensional array of values • Matrix: a 2-dimensional array of values • Tensor: an array of values with 3 or more dimensions [e.g. an array of images]
  • 54. Organization of The Book 1. Introduction to Statistical Learning 2. Linear Regression 3. Classification 4. Model Selection, Part 1 5. Model Selection, Part 2 6. Resampling Methods 7. Linear Model Selection and Regularization 8. Moving Beyond Linearity 9. Bayesian Analysis 10. Dimensionality Reduction
  • 55. Organization of The Book • Statistical Learning: Terminology and Concepts, plus ‘k’ nearest neighbor • Regression, Part 1: Linear Regression • Classification: Logistic Regression and Linear Discriminant Analysis • Resampling: Cross Validation and the Bootstrap • Regression, Part 2: Stepwise Selection, Ridge Regression, Principal Components Regression, Partial Least Squares, and the LASSO • Non-Linear Regression: Polynomial Regression, Splines, General Additive Models • Tree-Based Classification: Bagging, Boosting, and Random Forests • Support Vector Machines • Unsupervised Learning: Principal Component Analysis, k-Means Clustering, and Hierarchical Clustering
  • 56. Organization of The Book • Statistical Learning: Terminology and Concepts, plus ‘k’ nearest neighbor • Regression, Part 1: Linear Regression • Classification: Logistic Regression and Linear Discriminant Analysis • Resampling: Cross Validation and the Bootstrap • Regression, Part 2: Stepwise Selection, Ridge Regression, Principal Components Regression, Partial Least Squares, and the LASSO • Non-Linear Regression: Polynomial Regression, Splines, General Additive Models • Tree-Based Classification: Bagging, Boosting, and Random Forests • Support Vector Machines • Unsupervised Learning: Principal Component Analysis, k-Means Clustering, and Hierarchical Clustering
  • 57. Datasets referenced in the Textbook
  • 58. Advertising Data: Closer look TV Advertising Budget (thousands of $) SalesoftheProduct(thousandsofunits)
  • 59. Advertising Data: Descriptive Statistics TV Advertising Budget (thousands of $) SalesoftheProduct(thousandsofunits) • Variable type • (binary, categorical, integer, real) • Distribution of variables: • Graphing the data (histograms, density plot) • Distribution shape (normal, log-normal, binomial, etc.) • Central tendency measures (mean, median, mode) • Outlier measures (percentiles, min, max) • Associations between variables: • Pearson’s correlation coefficient • Spearman’s correlation coefficient • Mutual information • Maximal information coefficient
  • 60. Advertising Data: First Model! TV Advertising Budget (thousands of $) SalesoftheProduct(thousandsofunits) • 𝑌 = 𝑓(𝑋) + 𝜖 • 𝑌 is an output Sales value • 𝑓(𝑋) is a function of TV Ad Budget
  • 61. Advertising Data: First Model! TV Advertising Budget (thousands of $) SalesoftheProduct(thousandsofunits) • 𝑌 = 𝑓(𝑋) + 𝜖 • 𝑌 is an output Sales value • 𝑓(𝑋) is a function of TV Ad Budget ➢ 𝑓(𝑋) = 0.05 * X + 7 ➢ Slope: (22 – 7) / (300 – 0) = 0.05 ➢ Intercept: 22 - 0.05 * 300 = 7 ➢ f( 0) = 0.05 * 0 + 7 = 7 ➢ f(100) = 0.05 * 100 + 7 = 12 ➢ f(200) = 0.05 * 200 + 7 = 17 ➢ f(300) = 0.05 * 300 + 7 = 22 • 𝜖 is a residual “error” term (Greek letter “epsilon”)
  • 62. Income as a Function of Education
  • 63. Income as a function of Education and Seniority
  • 64. Why Estimate f(x)? • The hats (circumflex characters: ‘^’)indicate we’re talking about estimates rather than some notion of absolute truth • is the function we learned from data: our function is a model that maps an input to an output • is our prediction • Reasons: • To predict an outcome • To understand the influence of the predictors on theoutcome (NOTE: inferential rather than causal influence)
  • 65. Prediction • A loss function measures how well a model is able to map inputs to outputs • • is referred to as reducible error: we could reduce the error if we had better features • 𝑉𝑎𝑟( 𝜖)is referred to as irreducible error, because we believe the process is stochastic rather than deterministic • 𝐸 indicates we’re talking about an expected value (average value) • 𝑉𝑎𝑟 indicates we’re talking about variance, the expected squared deviation from the mean • Since we believe our residual error has a mean of zero E(𝜖2) = 𝑉𝑎𝑟 (𝜖)
  • 66. Inference [Understanding] • Which predictors are associated with the response? • What is the relationship between the response and each predictor? • Can the relationship between the inputs and outputs be summarized adequately using a linear model, or is the relationship more complex? • Examples: • Which media contribute to sales? • Which media generate the biggest boost in sales? • How much increase in sales is associated with a given increase inTV advertising?
  • 67. How do we estimate f? Parametric methods (makes strong assumptions about the data and fixes the number of parameters): • linear regression • polynomial regression • logistic regression • neural network • support vector machines (linear) Non-Parametric methods: • nearest neighbor • random forests • gradient boosting • support vector machines (RBF)
  • 70. Trade off Between Prediction Accuracy and Model Interpretability
  • 71. Supervised Vs. Unsupervised Learning Supervised Learning • The learning algorithm is given a target output variable • Classification: the output variable is nominal (categorical, qualitative) • Regression: the output variable is numeric (quantitative) Class 3 Class 2 Class 1 Unsupervised Learning • The learning algorithm is *not* given a targetoutput variable • Clustering • Principal Component Analysis
  • 72. Unsupervised Learning and Class Overlap
  • 73. Measuring the Quality of the Model Common Loss functions • Regression • Gaussian loss (mean squared error) • Laplacian loss (mean absolute error) • Classification • Logloss • Hinge loss
  • 74. Example: High Bias (Underfitting) Vs. High Variance (Overfitting)
  • 75. Bias Vs. Variance Trade-off
  • 76. Bias Vs. Variance Trade-off
  • 77. Bias Vs. Variance Trade-off
  • 78. Bias Variance Decomposition We’re using We’readding and subtracting the same value (zero) We’regrouping pairs of terms and multiplying
  • 80. Classification Error Let’s assume that one is trying to estimate f based on training observations: {(x1, y1), (x2, y2), (xn, yn)} The most common approach is to estimate the ‘error rate’: Where: is an indicator variable: • 1 if prediction is right • 0 if the prediction is wrong Accuracy = 1 – error rate The error can be estimated during training (training error) or testing (test error)
  • 81. Bayes Classifier • The Bayes classifier picks the class ‘j’that maximizes the conditional probability: • You can interpret it as: • ‘probability that Y is equal to j given that X is equal to x0’ • The Bayes error rate is:
  • 82. Bayes Classifier for Simulated Problem
  • 83. K Nearest Neighbors 𝑤ℎ𝑒𝑟𝑒 𝒩0 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑒𝑡 𝑜𝑓 𝑖𝑛𝑑𝑖𝑐𝑒𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 ′𝐾′ 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 𝑜𝑓 For classification using K nearest neighbors, we’re estimating the proportion of nearest neighbors that belong to class ‘j’
  • 85. KNN with K=10 vs. Bayes Decision Boundary
  • 86. KNN with K=1 vs. K=10
  • 88. Course Outline 1. Introduction to Statistical Learning 2. Linear Regression 3. Classification 4. Model Selection, Part 1 5. Model Selection, Part 2 6. Resampling Methods 7. Linear Model Selection and Regularization 8. Moving Beyond Linearity 9. Unsupervised Learning 10. Dimensionality Reduction
  • 89. References – http://www.slideshare.net/liorrokach/introduction-to-machine-learning- 13809045 – http://dimacs.rutgers.edu/Workshops/MachineLearning/slides/schapire.pdf – http://www.cs.odu.edu/~hany/teaching/cs495- f12/lectures/lecture_8/lecture_8.pdf – http://alex.smola.org/teaching/cmu2013-10-701/slides/1_Intro.pdf – http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/materials/Scha moni_boosteddecisiontrees.pdf – https://www.cs.utexas.edu/~mooney/cs343/slide-handouts/learning.pdf – http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf