GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear Regression”

2
2
Advanced statistical methods for linear
regression

3
Agenda
1. Reminder
2. Batch learning
3. Fit
4. Residuals
5. Estimator properties
6. Visualizations
7. Fisher tests
8. Regularization
9. Multiple output

4
4
Linear regression | Deﬁnition

5
Data model
Definition
b b b
X - features (regressors) | Matrix (d, n)
b - unknown parameters | Vector (d, 1)
Y - target (answer) | Vector (1, n)
epsilon - error | Vector (1, n)

6
Definition
Normal equation for linear regression (MSE case)
sklearn.linear_model.LinearRegression

7
7
Linear Regression | Batch learning

8
GD
Batches
Possible upgrades:
- use subsamples
- use 2-nd order optimization procedures
Criticism:
- Normal equation!!!
- Useless and naive
- Good idea for non-linear regression

9
GD (d ~ 10 000)
Batches
What if:
dim(x) = (n, d),
n = 100 000,
d = 10 000

10
Big scale (d ~ 10 000, n ~ 100 000, X and y are rvs)
Batches
Fit time: 140s - 170s
Loss: 0.074
Fit time: 49s - 60s
Loss: 0.085
Faster, but less accurate on
train data
Classic Iterative
Slow, but train quality is the best

11
Iterative fit with keras
Batches
classic way
iterative way

12
12
Linear regression | Cascade of models

13
Cascade
Cars’ rotation angles (POC)
Problem:
- small client’s dataset (~1200 train, 400 test)
- one car per image
Model pipeline:
- Encoder
- SVD
- Ridge
Requirements:
- Near real time
- Portable model to C++

14
Cascade
Cars’ rotation angles (encoder)
Variants:
- Encoders from Tensorflow hub
- Custom autoencoder
- Variational autoencoder
Custom autoencoder:
- 600k parameters
- 20 min fit on CPU

15
Cascade
Cars’ rotation angles (SVD + Ridge)
SVD:
- Decrease dimension
Ridge:
- Regularize
- Output is an angle
autoencoder
(_, 64*3)
SVD
(_, 64)
Ridge
(_, 1)
tensorflow sklearn
image (64, 64, 3)
Characteristics
- quickly to train and hyper optimize
- Cross validation is quick
- Unstable without SVD
- TF - sklearn bottleneck.
- Autoencoder uses GPU,
- SVD and Ridge uses CPU)

16
Cascade
Inference optimization
SVD and Ridge are linear models:
- SVD is defined by Projection matrix and
dimension reduction is a matrix operation.
- Ridge model is described by a matrix
multiplication
We can add them as layers into Encoder

18
Cascade
Bigger encoder
Characteristics
- quickly to train and hyper optimize
- Cross validation is quick
- Unstable without SVD
- TF - sklearn bottleneck.
- Autoencoder uses GPU
- SVD and Ridge uses CPU
1. Now the cascade is run in GPU
2. We do not require to create additional cpp code for SVD and
Ridge.
3. Future work will be only with TF and c++ :)

19
19
Linear regression | Bias-Variance decomposition

20
Bias Variance tradeoff
Bias-Variance decomposition
link

21
Bias Variance tradeoff
BV for linear function
link
Bias^2 Variance
Irreducible
error
Irreducible
error
Bias^2 Variance
f is linear

22
22
Linear regression | Mixture model
(Mixture of experts, MoE, Bias minimization)

23
Mixture model
Mixture or various concentrations
xi are observed, k are hidden

24
Mixture model
Gaussian mixture model
EM algorithm
sklearn.mixture.GaussianMixture
Bilmes
x2
x1

25
Mixture model
What if we have next data?
X
Y
X1
X2
y

26
Mixture model
How fit this. (GMM + linreg)
1. Fit GMM on X 2. Fit regression
1. Use GMM to receive clusters for X
2. Fit stratified linear regression model using
clusters and X
In terms of MLE (maximum likelihood estimator)
(cross_entropy) such approach is not the best
one. Consistency properties are unknown

27
Mixture model
Regression mixture
X1
X2
y
All these parameters are unknown :)
sklearn doesn’t have this model

28
Mixture model
Regression mixture (solution)
X1
X2
y
Faria (EM), link (SDG, NR),
Implement yourself
- use EM algorithm
- or optimize likelihood with tf

29
Mixture model
Regression mixture (solution)
X
Y

30
30
Linear regression | Ensemble (variance minimization)

31
Ensemble
Linear model ensemble (bagging)

32
Ensemble
Ensemble for linear model is linear model too
optimization!
If model’s outputs are independent :)

33
Ensemble
If model’s outputs are dependent :
- Models in the ensemble should be from different paradigms
(parametric/nonparametric)
- Features should be completely different for different models
for one model
The best output:

34
Ensemble
Linear model ensemble (stacking)
Leave-one-out
w minimizes error term variance (on train), but

35
Ensemble
Linear model ensemble (stacking)
w minimizes error term variance (on train), but
This mean that for linear model, stack is well validated, but not the best in train sense
(it might prevent overfit for linear model…)
Leave-one-out
test_loss estimator train_loss

36
36
Linear regression | Jackknife estimators
(why stacking works)

37
Jackknife
Jackknife
Then
link

38
Jackknife
Jackknife application
V_n is a consistent estimator for asymptotic
covariance matrix of MSE-estimator
J. Shao, Mathematical Statistics, ch 5.

39
Jackknife
Jackknife
With the theorem
stacking loss is a consistent test-loss estimator.
i-th error term estimated by model
on a sample without i-th object
well, here should be a 10-page article with theorems …, link

GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear Regression”

More Related Content

Similar to GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear Regression” (20)

More from GlobalLogic Ukraine (20)

Recently uploaded (20)

GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear Regression”