ML Module 3.pdf

MODULE 3
Supervised ML – Regression
Shiwani Gupta
Use case
Simple Linear
Gradient Descent
Evaluation Metric
Multi Linear, Polynomial
Regularization

USE CASES
 A hospital may be interested in finding how total cost of a patient varies with severity of disease.
 Insurance companies would like to understand association between healthcare cost and ageing.
 An organization may be interested in finding relationship between revenue generated from a product and
features such as price, promotional amount spent, competitor’s price of similar product, etc.
 Restaurants would like to know relationship between customer waiting time after placing the order and the
revenue generated.
 E-commerce companies like: Amazon, BigBasket, Flipkart , etc. would like to understand the relationship
between revenue generated and features like no. of customer visit to portal, no. of clicks on products, no.
of items on sale, av. discount percentage etc.
 Bank and other financial institutions would like to understand the impact of variables such as
unemployment rate, marital status, bank balance, etc. on percentage of Non Performing Assets, etc.
2

ISSUES
Outlier
Multicollinearity
Underfitting, Overfitting
3

LINEAR REGRESSION
 Linear Regression is a Supervised Machine Learning algorithm for
predictive modelling.
 It tries to find out the best linear relationship that describes the data you have
(Scatter Plot).
 It assumes that there exists a linear relationship between a dependent
variable (usually called y) and independent variable(s) (usually called X).
 The value of the dependent / response / outcome variable of a Linear
Regression model is a continuous value / quantitative in nature i.e. real
numbers.
 Linear Regression model represents linear relationship between a dependent
variable and independent / predictor / explanatory variable(s) via a sloped
straight line.
 The sloped straight line representing the linear relationship that fits the given
data best is called a Regression Line / Best Fit Line.
 Based on the number of independent variables, there are two types of Linear
Regression
4

INFERENCE ABOUT THE REGRESSION MODEL
 When a scatter plot shows a linear relationship between a quantitative explanatory variable x and a
quantitative response variable y, we can use the least square line fitted to the data to predict y for a given
value of x.
 We think of the least square line we calculated from the sample as an estimate of a regression line for the
population.
 Just as the sample mean is an estimate of the population mean µ.
 We will write the population regression line as
 The numbers and are parameters that describe the population.
 We will write the least-squares line fitted to sample data as
 This notation reminds us that the intercept b0 of the fitted line estimates the intercept 0 of the
population line, and the slope b1 estimates the slope 1 respectively.
x
x
1
0 
 
0
 1

x
b
b 1
0 
5

SIMPLE LINEAR REGRESSION
 A statistical method to summarize and study the functional relationship b/w 2 cont. variables.
 May be linear or nonlinear (eg. Population growth over time)
 The dependent variable depends only on a single independent variable.
 The form of the model is: y = β0 + β1X eg. V=I*R, Circumference = 2*pi*r, C=(F-32)*5/9, etc.
 y is a dependent variable.
 X is an independent variable.
 β0 and β1 are the regression coefficients.
 β0 is the intercept or the bias that fixes the offset to a line. It is the av. y value for Xmean = 0
 β1 is the slope or weight that specifies the factor by which X has an impact on y.
The values of the regression parameters 0, and 1 are not known.
We estimate them from data.
6

Deterministic Relationship
Stochastic Relationship
7

REGRESSION LINE
 We will write an estimated regression line based on sample data as
 Least Squares Method give us the “best” estimated line for our set of sample data.
 The method of least squares chooses the values for b0 and b1 to minimize the Sum of Squared Errors
 Using calculus, we obtain the estimating formulas:
x
b
b
y 1
0
ˆ 

 
2
1
1
0
1
2
)
ˆ
( 
 






n
i
n
i
i
i x
b
b
y
y
y
SSE
 
  


 
  








 n
i
n
i
i
i
n
i
n
i
n
i
i
i
i
i
n
i
i
n
i
i
i
x
x
n
y
x
y
x
n
x
x
y
y
x
x
b
1 1
2
2
1 1 1
1
2
1
1
)
(
)
(
)
)(
(
x
b
y
b 1
0 

Fitted regression line can be used to estimate y for a given value of x.
𝑀𝑆𝐸 = (1/𝑛)
𝑖=1
𝑛
(𝑦𝑖 − 𝑦𝑖)2 𝑀𝐴𝐸 = (1/𝑛)
𝑖=1
𝑛
|𝑦𝑖 − 𝑦 |
8

STEPS TO ESTABLISH A LINEAR RELATION
 Gather sample of observed height and corresponding weight.
 Create relationship model.
 Find coefficients from model and establish mathematical equation.
 Get summary of model to compute av. prediction error … Residual.
 Predict weight
height weight
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48
 
2
1
1
0





n
i
x
b
b
y
Q
We want to penalize the points which are
farther from the regression line much more
than the points which lie close to the line.
9

STRENGTH OF LINEAR ASSOCIATION: PEARSON COEFFICIENT
height
(x)
weight
(y)
x-xmean y-ymean (x-xmean)*(y-ymean) (x-xmean)*(x-xmean) xy x2
y2
151 63 -2.8 -2.3 6.44 7.84 9513 22801 3969
174 81 20.2 15.7 317.14 408.04 14094 30276 6561
138 56 -15.8 -9.3 146.94 249.64 7728 19044 3136
186 91 32.2 25.7 827.54 1036.84 16926 34596 8281
128 47 -25.8 -18.3 472.14 665.64 6016 16384 2209
136 57 -17.8 -8.3 147.74 316.84 7752 18496 3249
179 76 25.2 10.7 269.64 635.04 13604 32041 5776
163 72 9.2 6.7 61.64 84.64 11736 26569 5184
152 62 -1.8 -3.3 5.94 3.24 9424 23104 3844
131 48 -22.8 -17.3 394.44 519.84 6288 17161 2304
xmean ymean sum(xy) sum(x2
) sum(y2
)
153.8 65.3 264.96 392.76 103081 240472 44513
sum(x) sum(y) (sum(y))*(sum(y)) n*sum(xy) n*sum(x2
) n*sum(y2
)
1538 653 1004314 2365444 426409 1030810 2404720 445130
b1 = 0.67461 b0 = -38.455 y = b1x+b0 = 8.767644363 r = 0.97713
r = [-1,1]
Magnitude
as well as
direction
x=70
10

GRADIENT DESCENT
 used to minimize the cost function Q
 STEPS:
1. Random initialization for θ1 and θ0.
2. Measure how the cost function changes with change in it’s parameters by computing the partial derivatives of
cost function w.r.t to the parameters θ₀, θ₁, … , θₙ.
3. After computing derivative, update parameters θj: = θj−α ∂/∂θj Q(θ0,θ1) for j=0,1 where α, learning rate, a
positive no. and a step to update parameters.
4. Repeat process of Simultaneous update of θ1 and θ0, until convergence.
 α too small, too much time, α too large, failure to converge.
11

GRADIENT DESCENT FOR UNIVARIATE LINEAR REGRESSION
 Hypothesis hθ(x)=θ0+θ1x
 Cost Function J(θ0,θ1)=(1/2*m) * ∑i=1
m(hθ(x(i))−y(i))2
 Gradient Descent to minimize cost function for Linear Regression model
 Compute derivative for j=0, j=1
12

Dispersion of observed variable around mean How well our line fits data
total variability of the data is equal to the variability
explained by the regression line plus the unexplained
variability, known as error.
MODEL
EVALUATION
13

COEFFICIENT OF DETERMINATION
 Recall that SST measures the total variations in yi when no account of the independent
variable x is taken.
 SSE measures the variation in the yi when a regression model with the independent variable x
is used.
 A natural measure of the effect of x in reducing the variation in y can be defined as:
R2 is called the coefficient of determination / goodness of fit.
 0  SSE  SST, it follows that:
 We may interpret R2 as the proportionate reduction of total variability in y associated with the
use of the independent variable x.
 The larger is R2, the more is the total variation of y reduced by including the variable x in the
model.
SST
SSE
SST
SSR
SST
SSE
SST
R 



 1
2
𝟎 ≤ 𝑅2
≤ 𝟏
14

COEFFICIENT OF DETERMINATION
If all the observations fall on the fitted regression line, SSE = 0 and R2 = 1.
If the slope of the fitted regression line
b1 = 0 so that , SSE=SST and R2 = 0.
The closer R2 is to 1, the greater is said to be the degree of linear association
between x and y.
The square root of R2 is called the coefficient of correlation (r).
y
yi 
ˆ
2
R
r 

 
 
  










2
2
2
2
2
2
)
(
)
(
)
(
)
(
)
)(
(
y
y
n
x
x
n
y
x
xy
n
r
y
y
x
x
y
y
x
x
r
15

MODEL EVALUATION : R-SQUARED
height (cm) weight (kg) ypredicted SSE = (y-ypred)2 SST = (y-ymean)2 SSR = (ypred-ymean)2
151 63 63.4111 0.16901143 5.29 3.56790543
174 81 78.9271 4.29674858 246.49 185.698945
138 56 54.6412 1.84639179 86.49 113.610444
186 91 87.0225 15.8208245 660.49 471.865268
128 47 47.8951 0.80116821 334.89 302.93124
136 57 53.292 13.7495606 68.89 144.193025
179 76 82.3002 39.692394 114.49 289.00646
163 72 71.5064 0.24361134 44.89 38.5197733
152 62 64.0857 4.35022792 10.89 1.47447592
131 48 49.9189 3.68221559 299.29 236.57793
ymean sum sum sum
65.3 82.652154 1872.1 1787.44547
R2 = SSR/SST = 0.95478
measures the
proportion of the
variation in your
dependent
variable
explained by all
your independent
variables in the
model
R2 = [0,1]
16

ESTIMATION OF MEAN RESPONSE
 The weekly advertising expenditure (x) and weekly sales (y) are presented in the following table:
 From the table, the least squares estimates of the regression coefficients are:
y x
1250 41
1380 54
1425 63
1425 54
1450 48
1300 46
1400 62
1510 61
1575 64
1650 71
 
 





818755
14365
32604
564
10 2
xy
y
x
x
n
8
.
10
)
564
(
)
32604
(
10
)
14365
)(
564
(
)
818755
(
10
)
( 2
2
2
1 






 
  
x
x
n
y
x
xy
n
b 828
)
4
.
56
(
8
.
10
5
.
1436
0 


b
The estimated regression function is:
This means that if weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8.
Fitted values for the sample data are obtained by substituting the x value into the estimated regression function.
For example, if the advertising expenditure is $50, then the estimated Sales is:
This is called the point estimate (forecast) of the mean response (sales).
e
Expenditur
8
.
10
828
Sales
10.8x
828
ŷ




1368
)
50
(
8
.
10
828 


Sales
x
b
y
b 1
0 

17

EXAMPLE: SOLVE
• The primary goal of Quantitative Analysis is to use current information about a
phenomenon to predict its future behavior.
• Current information is usually in the form of data.
• In a simple case, when the data forms a set of pairs of numbers, we may interpret
them as representing the observed values of an independent (or predictor) variable
X and a dependent (or response) variable y.
• The goal of the analyst who studies the data is to find a functional relation
between the response variable y and the predictor variable x.
lot size Man-hours
30 73
20 50
60 128
80 170
40 87
50 108
60 135
30 69
70 148
60 132
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70 80 90
Man-Hour
Lot size
Statistical relation between Lot size and Man-Hour
)
(x
f
y 
18

EXAMPLE: RETAIL SALES AND FLOOR SPACE
 It is customary in retail operations to assess the performance of stores partly in terms of their annual
sales relative to their floor area (square feet).
 We might expect sales to increase linearly as stores get larger, with of course individual variation among
the stores of same size.
 The regression model for a population of stores says that SALES = 0 + 1 AREA + 
 The slope 1 is rate of change: it is the expected increase in annual sales associated with each additional
square foot of floor space.
 The intercept 0 is needed to describe the line but has no statistical importance because no stores have
area close to zero.
 Floor space does not completely determine sales. The term  in the model accounts for difference among
individual stores with the same floor space. A store’s location, for example, is important.
 Residual: The difference between the observed value yi and the corresponding fitted value
 Residuals are highly useful for studying whether a given regression model is appropriate for the
data at hand. i
i
i y
y
e ˆ


i
ŷ
19

ANALYSIS OF RESIDUAL
 To examine whether the regression model is appropriate for the data being analyzed, we can check residual plots.
 Residual plots are:
 A scatterplot of the residuals
 Plot residuals against the fitted values.
 Plot residuals against the independent variable.
 Plot residuals over time if the data are chronological.
 The residuals should have no systematic pattern. Eg. The residual plot below shows a scatter of the points with no
individual observations or systematic change as x increases.
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
0 20 40 60
Residuals
Degree Days
Degree Days Residual Plot
20

RESIDUAL PLOTS
The points in this residual plot have a curve pattern, so a straight line fits
poorly
The points in this plot show more spread for larger values of the
explanatory variable x, so prediction will be less accurate when x is
large.
21

EXAMPLE: DO WAGES RISE WITH EXPERIENCE?
 Many factors affect the wages of
workers: the industry they work in,
their type of job, their education,
their experience, and changes in
general levels of wages. We will
look at a sample of 59 married
women who hold customer service
jobs in Indiana banks. The table
gives their weekly wages at a
specific point in time also their
Length Of Service with their
employer, in months. The size of
the place of work is recorded
simply as “large” (100 or more
workers) or “small.” Because
industry, job type, and the time of
measurement are the same for all 59
subjects, we expect to see a clear
relationship between wages and
length of service.
22

EXAMPLE: DO WAGES RISE WITH EXPERIENCE?
From previous table we have:
The least squares estimates of the regression coefficients are:

 
 






1719376
9460467
23069
451031
4159
59
2
2
xy
y
y
x
x
n


 x
b
y
b0
 
  


 2
2
1
)
( x
x
n
y
x
xy
n
b
 
 2
)
ˆ
( i
i y
y
SSE
 
 2
)
( y
y
SST i
 
 2
)
ˆ
( y
y
SSR i
23

USING THE REGRESSION LINE
 One of the most common reasons to fit a line to data is to predict the response to a particular value of
the explanatory variable.
 In our example, the least square line for predicting the weekly earnings for female bank customer
service workers from their length of service is
 For a length of service of 125 months, our least-squares regression equation gives
x
y 5905
.
0
4
.
349
ˆ 

per week
423
$
)
125
)(
5905
(.
4
.
349
ˆ 


y
The measure of variation in the data around the fitted regression line
SSE = 36124.76 SST = 128552.5
If SST = 0, all observations are the same (No variability).
The greater is SST, the greater is the variation among the y values.
74
.
92427
76
.
36124
5
.
128552 



 SSE
SST
SSR
SSR is the variation among predicted responses. The predicted responses lie on the least-square
line. They show how y moves in response to x.
The larger is SSR relative to SST, the greater is the role of regression line in explaining the total
variability in y observations.
This indicates that most of variability in weekly sales can be explained by the relation between
the weekly advertising expenditure and the weekly sales.
R2 = SSR/SST = .719
24

MULTIPLE LINEAR REGRESSION
 The dependent variable depends on more than one independent variables.
 The form of the model is: y = b0 + b1x1 + b2x2 + b3x3 + …… + bnxn
 May be Linear or nonlinear.
 Here,
 y is a dependent variable.
 x1, x2, …., xn are independent variables.
 b0, b1,…, bn are the regression coefficients.
 bj (1<=j<=n) is the slope or weight that specifies the factor by which Xj has an impact on Y.
25

POLYNOMIAL REGRESSION
 y= b0+b1x1+ b2x1
2+ b2x1
3+...... bnx1
n
 Special case of Multiple Linear Regression.
 We add some polynomial terms to the Multiple Linear regression equation to convert it into Polynomial Regression.
 Linear model with some modification in order to increase the accuracy.
 Training data is of non-linear nature.
 In Polynomial regression, the original features are converted into Polynomial features of required degree (2,3,..,n) and
then modeled using a Linear model.
 If we apply a linear model on a linear dataset, then it provides us good result, but if we apply the same model without
any modification on a non-linear dataset, then it will produce a drastic output. Due to which loss function will increase,
the error rate will be high, and the accuracy will decrease.
 A Polynomial Regression algorithm is also called Polynomial Linear Regression because it does not depend on the
variables, instead, it depends on the coefficients, which are arranged in a linear fashion.
26

REGULARIZATION
 To avoid overfitting of training data and hence enhance generalization performance.
 Since model tries to capture noise, that doesn’t represent true properties of data.
 Regularization is a form of regression that constraints/ regularizes/ shrinks coefficient estimates towards zero.
 Y represents the learned relation, β represents the coefficient estimates for
different variables or predictors (X).
 Coefficients are chosen so as to minimize loss function
27

• Eg. a person’s height and weight, age and sales price of a car, or years of education
and annual income
• Doesn’t affect DT
• kNN affected
• Cause
• Insufficient data
• Dummy variables
• Including a variable in the regression that is actually a combination of two
other variables.
• Identify (corr > 0.4, Variance Inflation Factor score > 5 high correlation )
• Sol
• Feature selection
• PCA
• More data
• Ridge regression reduces magnitude of model coefficients 28

RIDGE REGULARIZATION (L2 NORM)
 Used when data suffers from multicollinearity
 RSS is modified by adding shrinkage quantity, λ (tuning parameter) that decides how much we want to penalize the
flexibility of our model, intercept β0, is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0.
 If we want to minimize the above function, these coefficients need to be small.
 When λ = 0, the penalty term has no eﬀect, and the estimates produced by ridge regression will be equal to least squares.
 However, as λ→∞, the impact of the shrinkage penalty grows, and ridge regression coeﬃcient estimates will approach
zero.
 Note: we need to standardize the predictors or bring the predictors to the same scale before performing ridge regression.
 Disadvantage: model interpretability
29

LASSO REGULARIZATION (L1 NORM)
 Least Absolute Shrinkage and Selection Operator
 This variation differs from ridge regression only in penalizing the high coefficients.
 It uses |βj| (modulus) instead of squares of β, as its penalty.
 Lasso method also performs variable selection and is said to yield sparse models.
30

RIDGE LASSO COMPARISON
 Ridge Regression can be thought of as solving an equation, where summation of squares of coefficients is less than or equal to
s. And Lasso can be thought of as an equation where summation of modulus of coefficients is less than or equal to s. Here, s is a
constant that exists for each value of shrinkage factor λ. These equations are also referred to as constraint functions.
 Consider there are 2 parameters in a given problem. Then according to above formulation:
 Ridge regression is expressed by β1² + β2² ≤ s. This implies that ridge regression coefficients have the smallest RSS (loss
function) for all points that lie within the circle given by β1² + β2² ≤ s.
 for Lasso, the equation becomes, |β1|+|β2| ≤ s. This implies that lasso coefficients have the smallest RSS (loss function) for all
points that lie within the diamond given by |β1|+|β2|≤ s.
Image shows the constraint functions (green areas), for Lasso (left) and Ridge regression (right), along with contours for RSS
(red ellipse). The black point denotes that the least square error is minimized at that point and as we can see that it increases
quadratically as we move away from it and the regularization term is minimized at the origin where all the parameters are
zero
Since Ridge Regression has a circular constraint with no sharp points,
this intersection will not generally occur on an axis, and so ridge
regression coeﬃcient estimates will be exclusively non-zero.
However, Lasso constraint has corners at each of the axes, and so the
ellipse will often intersect the constraint region at an axis. When this
occurs, one of the coeﬃcients will equal zero.
31

BENEFIT
 Regularization significantly reduces the variance of model, without substantial increase in its bias.
 The tuning parameter λ, controls the impact on bias and variance.
 As the value of λ rises, it reduces the value of coefficients and thus reducing the variance.
 Till a point, this increase in λ is beneficial as it is only reducing the variance (hence avoiding
overfitting), without loosing any important properties in the data.
 But after certain value, the model starts loosing important properties, giving rise to bias in the model and
thus underfitting. Therefore, the value of λ should be carefully selected.
32

SUMMATIVE ASSESSMENT
3 Consider the following dataset showing relationship
between food intake (lb) of cows and milk yield (lb).
Estimate the parameters for the linear regression model
for the dataset:
Food (lb) Milk Yield (lb)
4 3.0
6 5.5
10 6.5
12 9.0
4 Fit a Linear Regression model for following relation
between mother’s Estirol level and birth weight of child
for following data:
Estirol (mg/24 hr) Birth weight (g/100)
1 1
2 1
3 2
4 2
5 4
5 Create a relationship model for given data to find
relationship b/w height and weight of students. Compute
Karl Pearson coefficient and Coefficient of determination.
REFER SLIDE 9
6 State benefits of regularization for avoiding overfitting
in Linear Regression. State mathematical formulation of
Regularization.
7 Explain steps of Gradient Descent Algorithm.
33
1.The rent of a property is related to its area. Given the area in square feet and rent in dollars, find the
relationship between area and rent using the concept of linear regression. Also predict the rent for a
property of 790 ft square.
2. The marks obtained by a student is dependent on his/her study time. Given the study time in minutes
and marks out of 2000. Find the relationship between study time and marks using the concept of Linear
Regression. Also predict marks for a student if he/she studied for 790 minutes.
Area (ft2) Rent (inr)
360 520
1070 1600
630 1000
890 850
940 1350
500 490
Study Time (min.) Marks obtained
350 520
1070 1600
630 1000
890 850
940 1350
500 490

SUMMATIVE ASSESSMENT
34
8. Use the method of Least Square using Regression to predict the final exam grade of a
student who received 86 on mid term exam.
x (midterm) y (final exam)
65 175
67 133
71 185
71 163
66 126
75 198
67 153
70 163
71 159
69 151
9. Create a relationship model for given data to find
relationship between height and weight of students. .
Height (inches) Weight (pounds)
72 200
68 165
69 160
71 163
66 126

RESOURCES
 https://www.youtube.com/watch?v=Rb8MnMEJTI4&list=PLIeGtxpvyG-KE0M1r5cjbC_7Q_dVlKVq4&index=1
 https://www.youtube.com/watch?v=ls3XKoGntXg&list=PLIeGtxpvyG-KE0M1r5cjbC_7Q_dVlKVq4&index=3
 https://www.youtube.com/watch?v=E5RjzSK0fvY
 https://www.youtube.com/watch?v=NF5_btOaCig&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU&index=5
 https://www.youtube.com/watch?v=5Z9OIYA8He8&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU&index=9
 https://www.youtube.com/watch?v=Xm2C_gTAl8c
 https://www.geeksforgeeks.org/mathematical-explanation-for-linear-regression-working/
 https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/
 https://www.youtube.com/playlist?list=PLIeGtxpvyG-IqjoU8IiF0Yu1WtxNq_4z-
 https://365datascience.com/r-squared/
35

ML Module 3.pdf

More Related Content

What's hot (20)

Similar to ML Module 3.pdf (20)

More from Shiwani Gupta (20)

Recently uploaded (20)

ML Module 3.pdf