The fundamentals of regression

The fundamentals of
regression
@theStephLocke

Steph Locke
• CEO @ Nightingale HQ
• Data Scientist
• Author
• Microsoft Data Platform
& Artificial Intelligence
MVP
• T: @theStephLocke
• Li: /stephanielocke

Machine Learning
algorithms
Fitting
Supervised
Loss function
Error

Fitting a model
• The process of iteratively applying an algorithm to generate a
model
• Optimises model based on the loss function
• Can rely on hyperparameters to control how algorithm proceeds

Supervised fitting
• Produce a model based on a label
• Expresses some combination of features
• Loss function typically minimises error
Aka outcome,
dependant variable
Aka fields,
independent
variables, columns
The difference
between the
predicted and actual
value of a label

Example
X Y Y=1+2X Y=2+1X y=2.5+1X
1 3 3 3 3.5
1 4 3 3 3.5
2 4 5 4 4.5
2 5 5 4 4.5
3 5 7 5 5.5
3 6 7 5 5.5
4 6 9 6 6.5
4 7 9 6 6.5 0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
1.00 2.00 3.00 4.00
Y=1+2x Y=2+x y=2.5+x Y

Loss function
• A loss function is a calculation used to determine the difference
between what did happen and what the algorithm has produced
• Algorithms typically minimise the output of this function
• Selecting the right loss function is important to fitting models
appropriately

Example
X Y Y=1+2x Y=2+1x y=2.5+1x
1 3 3 3 3.5
1 4 3 3 3.5
2 4 5 4 4.5
2 5 5 4 4.5
3 5 7 5 5.5
3 6 7 5 5.5
4 6 9 6 6.5
4 7 9 6 6.5
Error (P-
A)
8 -4 0
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
1.00 2.00 3.00 4.00
Y=1+2x Y=2+x y=2.5+x Y

Example
X Y y=2.5+1x Y=0+2x
1 3 3.5 2
1 4 3.5 2
2 4 4.5 4
2 5 4.5 4
3 5 5.5 6
3 6 5.5 6
4 6 6.5 8
4 7 6.5 8
Error (P-
A)
0 0
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
1.00 2.00 3.00 4.00
y=2.5+x y=2x Y

Error
• Error due to simplification is called bias
• Error due to complexity is called variance
• Error can come from how data is collected / measured
• There is always some irreducible error

Regression algorithms
Features
Assumptions
Link functions
Loss functions

Regression
A numeric combination of variables used to predict another variable

Regression
• At it’s simplest: y = mx + c
• y is a combination of:
• m units of x
• c (represents a bunch of other
stuff that can’t be explained)
y = x + 2.5
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50

Features
• All variables must be represented numerically
• Categorical and text values have a variety of ways they can be
represented
• The handling of missing values impacts the final model
• Variables can be processed to meet assumptions

Assumptions
❑The sample represents the population
❑All features are represented numerically
❑Features are independent and uncorrelated
❑The outcome is dependent on a combination of the features
❑The relationship is consistent across observations
❑The linear combination of variables should be normally
distributed

Multivariate
normal
The combination of variables is
normally distributed
By Bscan - Own work, CC0,
https://commons.wikimedia.org/w/index.php?curid=25235145

Link function
A function to express a relationship between the linear combination
of features and the outcome

Loss function
A function to express the performance of a model on the training
data

Ordinary
Least Squares
y = 2x + 1
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
Square the residuals

Ordinary
Least Squares
Sum the squares
y = 2x + 1
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50

Ordinary
Least Squares
Divide by number of observations
y = 2x + 1
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50

Ordinary Least
Squares
Repeat with new line
y = x + 2.5
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50

Ordinary Least
Squares
y = x + 2.5
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
Calculate

Ordinary Least
Squares
Compare
y = x + 2.5
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50
<
Y=x+2.5 is
better
than
Y=2x+1

Linear regression
Features
Interpretation
Evaluation

Linear regression
• y = mx + c
• y is a numeric variable
• Y is a linear combination of:
• m units of x
• c (represents a bunch of other
stuff that can’t be explained)
y = x + 2.5
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50

Features
• OLS based models are sensitive to outliers
• Commonly see categorical variables are included via one-hot
encoding
• If two variables impact each other, you can include their
interaction as a feature
• Features on disparate numeric scales can be normalised to reduce
potential distortion

Interpretation
• Categoricals encoded via one-hot add to the intercept
• The coefficient for a feature represents its contribution to y for
one unit of change in its value (rescaling features needs
translating)
• Sign indicates correlation between variable and outcome
• P-value asterisks indicate confidence that there is a correlation
between the feature and the outcome
• Standard error indicates how precise the coefficient estimate is

An example
## Call:
## lm(formula = dist ~ speed.c, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.9800 2.1750 19.761 < 2e-16 ***
## speed.c 3.9324 0.4155 9.464 1.49e-12 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ‘ 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

Evaluation
• R2 describes performance over just guessing the average
• <0 Worse!
• 0 Same
• >0 Better
• 1 No error in model (you’ve probably done something wrong!)
• Various measures take the square or the absolute of errors
• Relative to dataset
• Smaller is usually better
• Options include:
• Root Mean Squared Error
• Mean Squared Error
• Mean Absolute Error

Evaluation
• The distribution of residuals (errors) helps indicate problems
• They should be normally distributed
• They should be distributed across the range of fitted values with a
similar range
• Some observations can have high influence on the model (usually
outliers)
• Compare model against other versions (fewer features, more etc)
• Beware the curse of dimensionality

Residuals vs
actuals
-3.00
-2.00
-1.00
0.00
1.00
2.00
3.00
4.00
0.00 2.00 4.00 6.00 8.00
R: Y=1+2x R: Y=2+x R: y=2.5+x R: y=2x

Next steps
Continue attending
Learn R or Python
Check out the resources

Resources • Making Friends with Machine Learning –
YouTube
• Machine Learning Flashcards
(chrisalbon.com)
• Setosa data visualization and visual
explanations
• Feature Engineering and Selection: A
Practical Approach for Predictive Models
• RPubs - Residual Analysis in Linear
Regression
• Regression Models for Data Science in R

The fundamentals of regression

More Related Content

What's hot (20)

Similar to The fundamentals of regression (20)

More from Stephanie Locke (19)

Recently uploaded (20)

The fundamentals of regression