Prof. Pier Luca Lanzi
Regression
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Simple Linear Regression
2
Prof. Pier Luca Lanzi
It all starts from some data …
Prof. Pier Luca Lanzi
The training data points
Prof. Pier Luca Lanzi
Can we predict the value of y from x?
Prof. Pier Luca Lanzi
Training data points with a simple linear regression model
Prof. Pier Luca Lanzi
But the model we built will not be
used not on the same data …
Prof. Pier Luca Lanzi
The previous model applied to the model learned from the training data
Prof. Pier Luca Lanzi
regression = model building + model usage
Prof. Pier Luca Lanzi
How Do We Evaluate
a Regression Model?
• Given N examples, pairs xi yi, linear regression computes a model
• So that for each point,
• We evaluate the model by computing the Residual Sum of
Squares (RSS) computed as,
10
Prof. Pier Luca Lanzi
The goal of linear regression is thus to
find the weights that minimize RSS
Prof. Pier Luca Lanzi
How Do We Compute
the Best Weights?
• Approach 1
§Set the gradient of RSS(w0,w1) to zero; but the approach is
infeasible in practice
• Approach 2
§Apply gradient descent
§If η is large, we are making large steps but might not converge,
if η is small, we might be very slow. Typically, η adapts over
time, e.g., η(t) = α/t or α/sqrt(t)
12
Prof. Pier Luca Lanzi
The RSS Gradient
• For simple linear regression the gradient of RSS has only two
components one for w0 and one for w1
13
Prof. Pier Luca Lanzi
Multiple Linear Regression
14
Prof. Pier Luca Lanzi
Input variable: LSTAT - % lower status of the population
Output variable: MEDV - Median value of owner-occupied homes in $1000's
Prof. Pier Luca Lanzi
Can we predict the property value
using other variables?
Prof. Pier Luca Lanzi
Multiple Linear Regression
• Given a set of examples associating LSTATi values to MEDVi
values, simple linear regression find a function f(.) such that
• Where 𝛆i is the error to be minimized
• Typically, we assume a model and fit the model into the data
• With linear model we assume that f(.) is computed as
• A polynomial model would fit the data points with a function,
17
Prof. Pier Luca Lanzi
Multiple Linear Regression
• Given, D input variables, assumes that the output y can be
computed as,
• The model cost is computed using the residual sum of square,
RSS(w) as,
18
Prof. Pier Luca Lanzi
Coefficient of Determination R2
• Total sum of squares
• Coefficient of determination
• R2 measures of how well the regression line approximates the
real data points. When R2 is 1, the regression line perfectly fits
the data.
19
Prof. Pier Luca Lanzi
Multiple Linear Regression:
General Formulation
• In general, given a set of input variables x, a set of N examples xi,
yi and a set of D features hj computed from the input variables xi,
multiple linear regression assumes a model,
• hj(.) identify variables derived from the original inputs
• hj(.) could be the squared value of an existing variable, a
trigonometric function, the age given the date of birth, etc.
20
Prof. Pier Luca Lanzi
Multiple Linear Regression:
General Formulation
• Multiple linear regression aims at minimizing,
• For this purpose, it can apply gradient descent to update the
weights as,
21
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Multiple Linear Regression with
Gradient Descent
24
Prof. Pier Luca Lanzi
Model Evaluation
25
Prof. Pier Luca Lanzi
Model Evaluation
• Models should be evaluated using data that have not been used
to build the model itself
• For example, would be feasible to evaluate students using exactly
the same problems solved in class?
• The available data must be split between training and test
• Training data will be used to build the model
• Test data will be used to evaluate the model performance
26
Prof. Pier Luca Lanzi
Holdout Evaluation
• Reserves a certain amount for testing and uses the remainder for
training
§Too small training sets might result in poor weight estimation
§Too small test sets might result in a poor estimation of future
performance
• Typically,
§Reserve ½ for training and ½ for testing
§Reserve 2/3 for training and 1/3 for testing
• For small or “unbalanced” datasets, samples might not be
representative
27
Prof. Pier Luca Lanzi
Given the original dataset, we split the data into 2/3 train and 1/3 test and then apply multiple linear
regression using polynomials of increasing degree. The plot show how RSS and R2 vary.
Prof. Pier Luca Lanzi
Holdout Evaluation using
the Housing Data
• Given the original dataset, we split the data into 2/3 train and 1/3
test and then apply multiple linear regression using polynomials of
increasing degree.
• RSS initially decreases as polynomials better approximate the data
but then higher degree polynomials overfit
• The same is shown by the R2 statistics
29
Prof. Pier Luca Lanzi
Cross-Validation
• First step
§Data is split into k subsets of equal size
• Second step
§Each subset in turn is used for testing and the remainder for training
• This is called k-fold cross-validation and avoids overlapping test sets
• Often the subsets are stratified before cross-validation is performed
• The error estimates are averaged to yield an overall error estimate
30
Prof. Pier Luca Lanzi
Ten-fold Crossvalidation 31
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
test train
1 2 3 4 5 6 7 8 9 10
testtrain
2 3 4 5 6 7 8 9 10
test train
1
train
… … …
p1
p2
p10
The final performance is computed as the average pi
Prof. Pier Luca Lanzi
Cross-Validation
• Standard method for evaluation stratified ten-fold cross-validation
• Why ten? Extensive experiments have shown that this is the best
choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• E.g. ten-fold cross-validation is repeated ten times and results are
averaged (reduces the variance)
• Other approaches appear to be robust, e.g., 5x2 crossvalidation
32
Prof. Pier Luca Lanzi
As we increase the degree of the fitting polynomial, the crossvalidation error starts to increase,
because the model the model starts to overfit! Best performance for the 5th degree polynomial.
Prof. Pier Luca Lanzi
Fitting using the 5th degree polynomial
Prof. Pier Luca Lanzi
Overfitting
35
Prof. Pier Luca Lanzi
What is Overfitting?
Very good performance
on the training set
Terrible performance
on the test set
Prof. Pier Luca Lanzi
In regression, overfitting is often associated
to large weights estimates
Add to the usual cost (RSS) a term
to penalize large weights to avoid overfitting
Total cost = Measure of Fit + Magnitude of Coefficients
Prof. Pier Luca Lanzi
Ridge Regression (L2 Regularization)
• Minimizes the cost function,
• If α is zero, the cost is exactly the same as before; if α is infinite,
then the only solution corresponds to having all the weights to 0
• In the gradient descent algorithm the update for weight j
becomes,
38
Prof. Pier Luca Lanzi
Lasso Regression (L1 Regularization)
• Minimizes the cost function,
• If α is zero, the cost is exactly the same as before; if α is infinite,
then the only solution corresponds to having all the weights to 0
• In gradient descent, weight j is modified as,
39
Prof. Pier Luca Lanzi
Example
40
Prof. Pier Luca Lanzi
Simple example data generated using a trigonometric function.
Prof. Pier Luca Lanzi
Applying multiple linear regression
Prof. Pier Luca Lanzi
Absolute value of the largest weight computed using simple linear
regression with polynomials of increasing degree
Prof. Pier Luca Lanzi
Ridge Regression
Prof. Pier Luca Lanzi
Absolute value of the largest weight computed using
ridge regression with polynomials of increasing degree
Prof. Pier Luca Lanzi
Applying ridge regression with polynomials of increasing degrees
Prof. Pier Luca Lanzi
Computed weight values when applying ridge regression with
a polynomial of degree 10 and different values of lambda
Prof. Pier Luca Lanzi
Lasso
Prof. Pier Luca Lanzi
Absolute value of the largest weight computed using
Lasso with polynomials of increasing degree
Prof. Pier Luca Lanzi
Computed weight values when applying Lasso with
a polynomial of degree 10 and different values of lambda
Prof. Pier Luca Lanzi
Lasso tends to zero out less important
features and produces sparser solutions
Basically, by penalizing large weights it
also performs feature selection
Prof. Pier Luca Lanzi
Choosing α
52
Prof. Pier Luca Lanzi
Available Data
Training
model building
Testing
model evaluation
Training
model building
Validation
select α
Testing
model evaluation
Training & α Selection
select the λ with the smallest crossvalidation error then train
Testing
model evaluation
Prof. Pier Luca Lanzi
Selecting the Best α
• To select the best value of α we cannot use the test set since it is
going to be used for evaluating the final model (which uses α)
• We need to reserve part of the training data to evaluate possible
candidate values of α and to select the best one
• If we have enough data, we can extract a validation set from the
training data which will be used to select α
• If we don’t have enough data, we should select α by applying k-
fold crossvalidation over the training data choosing the α
corresponding to the lowest average cost over the k folds
54
Prof. Pier Luca Lanzi
Applying Lasso with a α of 0.01 with different polynomials
55
Prof. Pier Luca Lanzi
Applying Lasso with differen values of α – Best α is 0.01
Prof. Pier Luca Lanzi
Summary
57
Prof. Pier Luca Lanzi
Linear Regression Algorithms
• The goal is to minimize the residual sum of squares (RSS)
• Exact methods
§Compute the set of weights that minimizes RSS
• Gradient descent (batch and stochastic)
§Start with a random set of weights and update them based on
the direction that minimizes RSS
• Ridge regression/Lasso
§Compute the cost also using the magnitude of the coefficients
§The larger the coefficients the more likely we are overfitting
58
Prof. Pier Luca Lanzi
Assignments
• Check the Python notebooks discussing simple and multiple linear
regression, Lasso and Ridge Regression
59

More Related Content

PDF
DMTM Lecture 04 Classification
PDF
DMTM Lecture 10 Classification ensembles
PDF
DMTM Lecture 20 Data preparation
PDF
DMTM Lecture 15 Clustering evaluation
PDF
DMTM Lecture 06 Classification evaluation
PDF
DMTM Lecture 13 Representative based clustering
PDF
DMTM 2015 - 06 Introduction to Clustering
PDF
DMTM Lecture 11 Clustering
DMTM Lecture 04 Classification
DMTM Lecture 10 Classification ensembles
DMTM Lecture 20 Data preparation
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 06 Classification evaluation
DMTM Lecture 13 Representative based clustering
DMTM 2015 - 06 Introduction to Clustering
DMTM Lecture 11 Clustering

What's hot (20)

PDF
DMTM 2015 - 15 Classification Ensembles
PDF
DMTM Lecture 09 Other classificationmethods
PDF
DMTM Lecture 12 Hierarchical clustering
PDF
DMTM Lecture 18 Graph mining
PDF
DMTM 2015 - 04 Data Exploration
PDF
DMTM Lecture 05 Data representation
PDF
DMTM Lecture 08 Classification rules
PDF
DMTM 2015 - 10 Introduction to Classification
PDF
DMTM 2015 - 14 Evaluation of Classification Models
PDF
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
PDF
DMTM 2015 - 08 Representative-Based Clustering
PDF
DMTM Lecture 19 Data exploration
PDF
DMTM Lecture 07 Decision trees
PDF
DMTM Lecture 02 Data mining
PDF
DMTM 2015 - 12 Classification Rules
PDF
DMTM 2015 - 03 Data Representation
PDF
DMTM 2015 - 16 Data Preparation
PDF
Mathematical Background for Artificial Intelligence
PDF
DMTM 2015 - 18 Text Mining Part 2
PDF
An overview of Hidden Markov Models (HMM)
DMTM 2015 - 15 Classification Ensembles
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 18 Graph mining
DMTM 2015 - 04 Data Exploration
DMTM Lecture 05 Data representation
DMTM Lecture 08 Classification rules
DMTM 2015 - 10 Introduction to Classification
DMTM 2015 - 14 Evaluation of Classification Models
DMTM 2015 - 13 Naive bayes, Nearest Neighbours and Other Methods
DMTM 2015 - 08 Representative-Based Clustering
DMTM Lecture 19 Data exploration
DMTM Lecture 07 Decision trees
DMTM Lecture 02 Data mining
DMTM 2015 - 12 Classification Rules
DMTM 2015 - 03 Data Representation
DMTM 2015 - 16 Data Preparation
Mathematical Background for Artificial Intelligence
DMTM 2015 - 18 Text Mining Part 2
An overview of Hidden Markov Models (HMM)
Ad

Similar to DMTM Lecture 03 Regression (20)

PPTX
ngboost.pptx
PDF
Linear regression
PPTX
Supervised learning for IOT IN Vellore Institute of Technology
PDF
Machine Learning and Data Mining: 16 Classifiers Ensembles
PPT
15303589.ppt
PPTX
credibility : evaluating what's been learned from data science
PDF
Model Selection and Validation
PDF
Artificial Intelligence Course: Linear models
PDF
ISSTA'16 Summer School: Intro to Statistics
PPTX
ngboost.pptx
PDF
2015 EDM Leopard for Adaptive Tutoring Evaluation
PPTX
k-nearest neighbour Machine Learning.pptx
PPTX
Joint contrastive learning with infinite possibilities
PDF
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
PPTX
Week 11 Model Evalaution Model Evaluation
PPTX
Linear regression in machine learning
PDF
Model selection
PDF
k-nearest neighbour Machine Learning.pdf
POT
Proving Lower Bounds to Answer the P versus NP question
PPTX
lecture 9 pdddddddddddddddddssdsdnn.pptx
ngboost.pptx
Linear regression
Supervised learning for IOT IN Vellore Institute of Technology
Machine Learning and Data Mining: 16 Classifiers Ensembles
15303589.ppt
credibility : evaluating what's been learned from data science
Model Selection and Validation
Artificial Intelligence Course: Linear models
ISSTA'16 Summer School: Intro to Statistics
ngboost.pptx
2015 EDM Leopard for Adaptive Tutoring Evaluation
k-nearest neighbour Machine Learning.pptx
Joint contrastive learning with infinite possibilities
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Week 11 Model Evalaution Model Evaluation
Linear regression in machine learning
Model selection
k-nearest neighbour Machine Learning.pdf
Proving Lower Bounds to Answer the P versus NP question
lecture 9 pdddddddddddddddddssdsdnn.pptx
Ad

More from Pier Luca Lanzi (15)

PDF
11 Settembre 2021 - Giocare con i Videogiochi
PDF
Breve Viaggio al Centro dei Videogiochi
PDF
Global Game Jam 19 @ POLIMI - Morning Welcome
PPTX
Data Driven Game Design @ Campus Party 2018
PDF
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
PDF
GGJ18 al Politecnico di Milano - Presentazione di apertura
PDF
Presentation for UNITECH event - January 8, 2018
PDF
DMTM Lecture 17 Text mining
PDF
DMTM Lecture 16 Association rules
PDF
DMTM Lecture 14 Density based clustering
PDF
DMTM Lecture 01 Introduction
PDF
VDP2016 - Lecture 16 Rendering pipeline
PDF
VDP2016 - Lecture 15 PCG with Unity
PDF
VDP2016 - Lecture 14 Procedural content generation
PDF
VDP2016 - Lecture 13 Data driven game design
11 Settembre 2021 - Giocare con i Videogiochi
Breve Viaggio al Centro dei Videogiochi
Global Game Jam 19 @ POLIMI - Morning Welcome
Data Driven Game Design @ Campus Party 2018
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione di apertura
Presentation for UNITECH event - January 8, 2018
DMTM Lecture 17 Text mining
DMTM Lecture 16 Association rules
DMTM Lecture 14 Density based clustering
DMTM Lecture 01 Introduction
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 15 PCG with Unity
VDP2016 - Lecture 14 Procedural content generation
VDP2016 - Lecture 13 Data driven game design

Recently uploaded (20)

PDF
Empowerment Technology for Senior High School Guide
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Complications of Minimal Access-Surgery.pdf
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
What if we spent less time fighting change, and more time building what’s rig...
PPTX
Introduction to pro and eukaryotes and differences.pptx
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
International_Financial_Reporting_Standa.pdf
PDF
IGGE1 Understanding the Self1234567891011
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PPTX
20th Century Theater, Methods, History.pptx
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Empowerment Technology for Senior High School Guide
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Complications of Minimal Access-Surgery.pdf
Uderstanding digital marketing and marketing stratergie for engaging the digi...
Weekly quiz Compilation Jan -July 25.pdf
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
AI-driven educational solutions for real-life interventions in the Philippine...
What if we spent less time fighting change, and more time building what’s rig...
Introduction to pro and eukaryotes and differences.pptx
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
LDMMIA Reiki Yoga Finals Review Spring Summer
International_Financial_Reporting_Standa.pdf
IGGE1 Understanding the Self1234567891011
History, Philosophy and sociology of education (1).pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
20th Century Theater, Methods, History.pptx
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)

DMTM Lecture 03 Regression

  • 1. Prof. Pier Luca Lanzi Regression Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi Simple Linear Regression 2
  • 3. Prof. Pier Luca Lanzi It all starts from some data …
  • 4. Prof. Pier Luca Lanzi The training data points
  • 5. Prof. Pier Luca Lanzi Can we predict the value of y from x?
  • 6. Prof. Pier Luca Lanzi Training data points with a simple linear regression model
  • 7. Prof. Pier Luca Lanzi But the model we built will not be used not on the same data …
  • 8. Prof. Pier Luca Lanzi The previous model applied to the model learned from the training data
  • 9. Prof. Pier Luca Lanzi regression = model building + model usage
  • 10. Prof. Pier Luca Lanzi How Do We Evaluate a Regression Model? • Given N examples, pairs xi yi, linear regression computes a model • So that for each point, • We evaluate the model by computing the Residual Sum of Squares (RSS) computed as, 10
  • 11. Prof. Pier Luca Lanzi The goal of linear regression is thus to find the weights that minimize RSS
  • 12. Prof. Pier Luca Lanzi How Do We Compute the Best Weights? • Approach 1 §Set the gradient of RSS(w0,w1) to zero; but the approach is infeasible in practice • Approach 2 §Apply gradient descent §If η is large, we are making large steps but might not converge, if η is small, we might be very slow. Typically, η adapts over time, e.g., η(t) = α/t or α/sqrt(t) 12
  • 13. Prof. Pier Luca Lanzi The RSS Gradient • For simple linear regression the gradient of RSS has only two components one for w0 and one for w1 13
  • 14. Prof. Pier Luca Lanzi Multiple Linear Regression 14
  • 15. Prof. Pier Luca Lanzi Input variable: LSTAT - % lower status of the population Output variable: MEDV - Median value of owner-occupied homes in $1000's
  • 16. Prof. Pier Luca Lanzi Can we predict the property value using other variables?
  • 17. Prof. Pier Luca Lanzi Multiple Linear Regression • Given a set of examples associating LSTATi values to MEDVi values, simple linear regression find a function f(.) such that • Where 𝛆i is the error to be minimized • Typically, we assume a model and fit the model into the data • With linear model we assume that f(.) is computed as • A polynomial model would fit the data points with a function, 17
  • 18. Prof. Pier Luca Lanzi Multiple Linear Regression • Given, D input variables, assumes that the output y can be computed as, • The model cost is computed using the residual sum of square, RSS(w) as, 18
  • 19. Prof. Pier Luca Lanzi Coefficient of Determination R2 • Total sum of squares • Coefficient of determination • R2 measures of how well the regression line approximates the real data points. When R2 is 1, the regression line perfectly fits the data. 19
  • 20. Prof. Pier Luca Lanzi Multiple Linear Regression: General Formulation • In general, given a set of input variables x, a set of N examples xi, yi and a set of D features hj computed from the input variables xi, multiple linear regression assumes a model, • hj(.) identify variables derived from the original inputs • hj(.) could be the squared value of an existing variable, a trigonometric function, the age given the date of birth, etc. 20
  • 21. Prof. Pier Luca Lanzi Multiple Linear Regression: General Formulation • Multiple linear regression aims at minimizing, • For this purpose, it can apply gradient descent to update the weights as, 21
  • 24. Prof. Pier Luca Lanzi Multiple Linear Regression with Gradient Descent 24
  • 25. Prof. Pier Luca Lanzi Model Evaluation 25
  • 26. Prof. Pier Luca Lanzi Model Evaluation • Models should be evaluated using data that have not been used to build the model itself • For example, would be feasible to evaluate students using exactly the same problems solved in class? • The available data must be split between training and test • Training data will be used to build the model • Test data will be used to evaluate the model performance 26
  • 27. Prof. Pier Luca Lanzi Holdout Evaluation • Reserves a certain amount for testing and uses the remainder for training §Too small training sets might result in poor weight estimation §Too small test sets might result in a poor estimation of future performance • Typically, §Reserve ½ for training and ½ for testing §Reserve 2/3 for training and 1/3 for testing • For small or “unbalanced” datasets, samples might not be representative 27
  • 28. Prof. Pier Luca Lanzi Given the original dataset, we split the data into 2/3 train and 1/3 test and then apply multiple linear regression using polynomials of increasing degree. The plot show how RSS and R2 vary.
  • 29. Prof. Pier Luca Lanzi Holdout Evaluation using the Housing Data • Given the original dataset, we split the data into 2/3 train and 1/3 test and then apply multiple linear regression using polynomials of increasing degree. • RSS initially decreases as polynomials better approximate the data but then higher degree polynomials overfit • The same is shown by the R2 statistics 29
  • 30. Prof. Pier Luca Lanzi Cross-Validation • First step §Data is split into k subsets of equal size • Second step §Each subset in turn is used for testing and the remainder for training • This is called k-fold cross-validation and avoids overlapping test sets • Often the subsets are stratified before cross-validation is performed • The error estimates are averaged to yield an overall error estimate 30
  • 31. Prof. Pier Luca Lanzi Ten-fold Crossvalidation 31 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 test train 1 2 3 4 5 6 7 8 9 10 testtrain 2 3 4 5 6 7 8 9 10 test train 1 train … … … p1 p2 p10 The final performance is computed as the average pi
  • 32. Prof. Pier Luca Lanzi Cross-Validation • Standard method for evaluation stratified ten-fold cross-validation • Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate • Stratification reduces the estimate’s variance • Even better: repeated stratified cross-validation • E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance) • Other approaches appear to be robust, e.g., 5x2 crossvalidation 32
  • 33. Prof. Pier Luca Lanzi As we increase the degree of the fitting polynomial, the crossvalidation error starts to increase, because the model the model starts to overfit! Best performance for the 5th degree polynomial.
  • 34. Prof. Pier Luca Lanzi Fitting using the 5th degree polynomial
  • 35. Prof. Pier Luca Lanzi Overfitting 35
  • 36. Prof. Pier Luca Lanzi What is Overfitting? Very good performance on the training set Terrible performance on the test set
  • 37. Prof. Pier Luca Lanzi In regression, overfitting is often associated to large weights estimates Add to the usual cost (RSS) a term to penalize large weights to avoid overfitting Total cost = Measure of Fit + Magnitude of Coefficients
  • 38. Prof. Pier Luca Lanzi Ridge Regression (L2 Regularization) • Minimizes the cost function, • If α is zero, the cost is exactly the same as before; if α is infinite, then the only solution corresponds to having all the weights to 0 • In the gradient descent algorithm the update for weight j becomes, 38
  • 39. Prof. Pier Luca Lanzi Lasso Regression (L1 Regularization) • Minimizes the cost function, • If α is zero, the cost is exactly the same as before; if α is infinite, then the only solution corresponds to having all the weights to 0 • In gradient descent, weight j is modified as, 39
  • 40. Prof. Pier Luca Lanzi Example 40
  • 41. Prof. Pier Luca Lanzi Simple example data generated using a trigonometric function.
  • 42. Prof. Pier Luca Lanzi Applying multiple linear regression
  • 43. Prof. Pier Luca Lanzi Absolute value of the largest weight computed using simple linear regression with polynomials of increasing degree
  • 44. Prof. Pier Luca Lanzi Ridge Regression
  • 45. Prof. Pier Luca Lanzi Absolute value of the largest weight computed using ridge regression with polynomials of increasing degree
  • 46. Prof. Pier Luca Lanzi Applying ridge regression with polynomials of increasing degrees
  • 47. Prof. Pier Luca Lanzi Computed weight values when applying ridge regression with a polynomial of degree 10 and different values of lambda
  • 48. Prof. Pier Luca Lanzi Lasso
  • 49. Prof. Pier Luca Lanzi Absolute value of the largest weight computed using Lasso with polynomials of increasing degree
  • 50. Prof. Pier Luca Lanzi Computed weight values when applying Lasso with a polynomial of degree 10 and different values of lambda
  • 51. Prof. Pier Luca Lanzi Lasso tends to zero out less important features and produces sparser solutions Basically, by penalizing large weights it also performs feature selection
  • 52. Prof. Pier Luca Lanzi Choosing α 52
  • 53. Prof. Pier Luca Lanzi Available Data Training model building Testing model evaluation Training model building Validation select α Testing model evaluation Training & α Selection select the λ with the smallest crossvalidation error then train Testing model evaluation
  • 54. Prof. Pier Luca Lanzi Selecting the Best α • To select the best value of α we cannot use the test set since it is going to be used for evaluating the final model (which uses α) • We need to reserve part of the training data to evaluate possible candidate values of α and to select the best one • If we have enough data, we can extract a validation set from the training data which will be used to select α • If we don’t have enough data, we should select α by applying k- fold crossvalidation over the training data choosing the α corresponding to the lowest average cost over the k folds 54
  • 55. Prof. Pier Luca Lanzi Applying Lasso with a α of 0.01 with different polynomials 55
  • 56. Prof. Pier Luca Lanzi Applying Lasso with differen values of α – Best α is 0.01
  • 57. Prof. Pier Luca Lanzi Summary 57
  • 58. Prof. Pier Luca Lanzi Linear Regression Algorithms • The goal is to minimize the residual sum of squares (RSS) • Exact methods §Compute the set of weights that minimizes RSS • Gradient descent (batch and stochastic) §Start with a random set of weights and update them based on the direction that minimizes RSS • Ridge regression/Lasso §Compute the cost also using the magnitude of the coefficients §The larger the coefficients the more likely we are overfitting 58
  • 59. Prof. Pier Luca Lanzi Assignments • Check the Python notebooks discussing simple and multiple linear regression, Lasso and Ridge Regression 59