DMTM Lecture 03 Regression

Prof. Pier Luca Lanzi
Regression
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)

Simple Linear Regression
2

It all starts from some data …

The training data points

Can we predict the value of y from x?

Training data points with a simple linear regression model

But the model we built will not be
used not on the same data …

The previous model applied to the model learned from the training data

regression = model building + model usage

How Do We Evaluate
a Regression Model?
• Given N examples, pairs xi yi, linear regression computes a model
• So that for each point,
• We evaluate the model by computing the Residual Sum of
Squares (RSS) computed as,
10

The goal of linear regression is thus to
find the weights that minimize RSS

How Do We Compute
the Best Weights?
• Approach 1
§Set the gradient of RSS(w0,w1) to zero; but the approach is
infeasible in practice
• Approach 2
§Apply gradient descent
§If η is large, we are making large steps but might not converge,
if η is small, we might be very slow. Typically, η adapts over
time, e.g., η(t) = α/t or α/sqrt(t)
12

The RSS Gradient
• For simple linear regression the gradient of RSS has only two
components one for w0 and one for w1
13

Multiple Linear Regression
14

Input variable: LSTAT - % lower status of the population
Output variable: MEDV - Median value of owner-occupied homes in $1000's

Can we predict the property value
using other variables?

• Given a set of examples associating LSTATi values to MEDVi
values, simple linear regression find a function f(.) such that
• Where 𝛆i is the error to be minimized
• Typically, we assume a model and fit the model into the data
• With linear model we assume that f(.) is computed as
• A polynomial model would fit the data points with a function,
17

• Given, D input variables, assumes that the output y can be
computed as,
• The model cost is computed using the residual sum of square,
RSS(w) as,
18

Coefficient of Determination R2
• Total sum of squares
• Coefficient of determination
• R2 measures of how well the regression line approximates the
real data points. When R2 is 1, the regression line perfectly fits
the data.
19

Multiple Linear Regression:
General Formulation
• In general, given a set of input variables x, a set of N examples xi,
yi and a set of D features hj computed from the input variables xi,
multiple linear regression assumes a model,
• hj(.) identify variables derived from the original inputs
• hj(.) could be the squared value of an existing variable, a
trigonometric function, the age given the date of birth, etc.
20

Multiple Linear Regression:
General Formulation
• Multiple linear regression aims at minimizing,
• For this purpose, it can apply gradient descent to update the
weights as,
21

Multiple Linear Regression with
Gradient Descent
24

Model Evaluation
25

Model Evaluation
• Models should be evaluated using data that have not been used
to build the model itself
• For example, would be feasible to evaluate students using exactly
the same problems solved in class?
• The available data must be split between training and test
• Training data will be used to build the model
• Test data will be used to evaluate the model performance
26

Holdout Evaluation
• Reserves a certain amount for testing and uses the remainder for
training
§Too small training sets might result in poor weight estimation
§Too small test sets might result in a poor estimation of future
performance
• Typically,
§Reserve ½ for training and ½ for testing
§Reserve 2/3 for training and 1/3 for testing
• For small or “unbalanced” datasets, samples might not be
representative
27

Given the original dataset, we split the data into 2/3 train and 1/3 test and then apply multiple linear
regression using polynomials of increasing degree. The plot show how RSS and R2 vary.

Holdout Evaluation using
the Housing Data
• Given the original dataset, we split the data into 2/3 train and 1/3
test and then apply multiple linear regression using polynomials of
increasing degree.
• RSS initially decreases as polynomials better approximate the data
but then higher degree polynomials overfit
• The same is shown by the R2 statistics
29

Cross-Validation
• First step
§Data is split into k subsets of equal size
• Second step
§Each subset in turn is used for testing and the remainder for training
• This is called k-fold cross-validation and avoids overlapping test sets
• Often the subsets are stratified before cross-validation is performed
• The error estimates are averaged to yield an overall error estimate
30

Ten-fold Crossvalidation 31
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
test train
1 2 3 4 5 6 7 8 9 10
testtrain
2 3 4 5 6 7 8 9 10
test train
1
train
… … …
p1
p2
p10
The final performance is computed as the average pi

Cross-Validation
• Standard method for evaluation stratified ten-fold cross-validation
• Why ten? Extensive experiments have shown that this is the best
choice to get an accurate estimate
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• E.g. ten-fold cross-validation is repeated ten times and results are
averaged (reduces the variance)
• Other approaches appear to be robust, e.g., 5x2 crossvalidation
32

As we increase the degree of the fitting polynomial, the crossvalidation error starts to increase,
because the model the model starts to overfit! Best performance for the 5th degree polynomial.

Fitting using the 5th degree polynomial

Overfitting
35

What is Overfitting?
Very good performance
on the training set
Terrible performance
on the test set

In regression, overfitting is often associated
to large weights estimates
Add to the usual cost (RSS) a term
to penalize large weights to avoid overfitting
Total cost = Measure of Fit + Magnitude of Coefficients

Ridge Regression (L2 Regularization)
• Minimizes the cost function,
• If α is zero, the cost is exactly the same as before; if α is infinite,
then the only solution corresponds to having all the weights to 0
• In the gradient descent algorithm the update for weight j
becomes,
38

Lasso Regression (L1 Regularization)
• Minimizes the cost function,
• If α is zero, the cost is exactly the same as before; if α is infinite,
then the only solution corresponds to having all the weights to 0
• In gradient descent, weight j is modified as,
39

Example
40

Simple example data generated using a trigonometric function.

Applying multiple linear regression

Absolute value of the largest weight computed using simple linear
regression with polynomials of increasing degree

Ridge Regression

Absolute value of the largest weight computed using
ridge regression with polynomials of increasing degree

Applying ridge regression with polynomials of increasing degrees

Computed weight values when applying ridge regression with
a polynomial of degree 10 and different values of lambda

Absolute value of the largest weight computed using
Lasso with polynomials of increasing degree

Computed weight values when applying Lasso with
a polynomial of degree 10 and different values of lambda

Lasso tends to zero out less important
features and produces sparser solutions
Basically, by penalizing large weights it
also performs feature selection

Choosing α
52

Available Data
Training
model building
Testing
model evaluation
Training
model building
Validation
select α
Testing
model evaluation
Training & α Selection
select the λ with the smallest crossvalidation error then train
Testing
model evaluation

Selecting the Best α
• To select the best value of α we cannot use the test set since it is
going to be used for evaluating the final model (which uses α)
• We need to reserve part of the training data to evaluate possible
candidate values of α and to select the best one
• If we have enough data, we can extract a validation set from the
training data which will be used to select α
• If we don’t have enough data, we should select α by applying k-
fold crossvalidation over the training data choosing the α
corresponding to the lowest average cost over the k folds
54

Applying Lasso with a α of 0.01 with different polynomials
55

Applying Lasso with differen values of α – Best α is 0.01

Summary
57

Linear Regression Algorithms
• The goal is to minimize the residual sum of squares (RSS)
• Exact methods
§Compute the set of weights that minimizes RSS
• Gradient descent (batch and stochastic)
§Start with a random set of weights and update them based on
the direction that minimizes RSS
• Ridge regression/Lasso
§Compute the cost also using the magnitude of the coefficients
§The larger the coefficients the more likely we are overfitting
58

Assignments
• Check the Python notebooks discussing simple and multiple linear
regression, Lasso and Ridge Regression
59

DMTM Lecture 03 Regression

More Related Content

What's hot (20)

Similar to DMTM Lecture 03 Regression (20)

More from Pier Luca Lanzi (15)

Recently uploaded (20)

DMTM Lecture 03 Regression