SlideShare a Scribd company logo
MODULE 3
Supervised ML – Regression
Shiwani Gupta
Use case
Simple Linear
Gradient Descent
Evaluation Metric
Multi Linear, Polynomial
Regularization
USE CASES
 A hospital may be interested in finding how total cost of a patient varies with severity of disease.
 Insurance companies would like to understand association between healthcare cost and ageing.
 An organization may be interested in finding relationship between revenue generated from a product and
features such as price, promotional amount spent, competitor’s price of similar product, etc.
 Restaurants would like to know relationship between customer waiting time after placing the order and the
revenue generated.
 E-commerce companies like: Amazon, BigBasket, Flipkart , etc. would like to understand the relationship
between revenue generated and features like no. of customer visit to portal, no. of clicks on products, no.
of items on sale, av. discount percentage etc.
 Bank and other financial institutions would like to understand the impact of variables such as
unemployment rate, marital status, bank balance, etc. on percentage of Non Performing Assets, etc.
2
ISSUES
Outlier
Multicollinearity
Underfitting, Overfitting
3
LINEAR REGRESSION
 Linear Regression is a Supervised Machine Learning algorithm for
predictive modelling.
 It tries to find out the best linear relationship that describes the data you have
(Scatter Plot).
 It assumes that there exists a linear relationship between a dependent
variable (usually called y) and independent variable(s) (usually called X).
 The value of the dependent / response / outcome variable of a Linear
Regression model is a continuous value / quantitative in nature i.e. real
numbers.
 Linear Regression model represents linear relationship between a dependent
variable and independent / predictor / explanatory variable(s) via a sloped
straight line.
 The sloped straight line representing the linear relationship that fits the given
data best is called a Regression Line / Best Fit Line.
 Based on the number of independent variables, there are two types of Linear
Regression
4
INFERENCE ABOUT THE REGRESSION MODEL
 When a scatter plot shows a linear relationship between a quantitative explanatory variable x and a
quantitative response variable y, we can use the least square line fitted to the data to predict y for a given
value of x.
 We think of the least square line we calculated from the sample as an estimate of a regression line for the
population.
 Just as the sample mean is an estimate of the population mean µ.
 We will write the population regression line as
 The numbers and are parameters that describe the population.
 We will write the least-squares line fitted to sample data as
 This notation reminds us that the intercept b0 of the fitted line estimates the intercept 0 of the
population line, and the slope b1 estimates the slope 1 respectively.
x
x
1
0 
 
0
 1

x
b
b 1
0 
5
SIMPLE LINEAR REGRESSION
 A statistical method to summarize and study the functional relationship b/w 2 cont. variables.
 May be linear or nonlinear (eg. Population growth over time)
 The dependent variable depends only on a single independent variable.
 The form of the model is: y = β0 + β1X eg. V=I*R, Circumference = 2*pi*r, C=(F-32)*5/9, etc.
 y is a dependent variable.
 X is an independent variable.
 β0 and β1 are the regression coefficients.
 β0 is the intercept or the bias that fixes the offset to a line. It is the av. y value for Xmean = 0
 β1 is the slope or weight that specifies the factor by which X has an impact on y.
The values of the regression parameters 0, and 1 are not known.
We estimate them from data.
6
Deterministic Relationship
Stochastic Relationship
7
REGRESSION LINE
 We will write an estimated regression line based on sample data as
 Least Squares Method give us the “best” estimated line for our set of sample data.
 The method of least squares chooses the values for b0 and b1 to minimize the Sum of Squared Errors
 Using calculus, we obtain the estimating formulas:
x
b
b
y 1
0
ˆ 

 
2
1
1
0
1
2
)
ˆ
( 
 






n
i
n
i
i
i x
b
b
y
y
y
SSE
 
  


 
  








 n
i
n
i
i
i
n
i
n
i
n
i
i
i
i
i
n
i
i
n
i
i
i
x
x
n
y
x
y
x
n
x
x
y
y
x
x
b
1 1
2
2
1 1 1
1
2
1
1
)
(
)
(
)
)(
(
x
b
y
b 1
0 

Fitted regression line can be used to estimate y for a given value of x.
𝑀𝑆𝐸 = (1/𝑛)
𝑖=1
𝑛
(𝑦𝑖 − 𝑦𝑖)2 𝑀𝐴𝐸 = (1/𝑛)
𝑖=1
𝑛
|𝑦𝑖 − 𝑦 |
8
STEPS TO ESTABLISH A LINEAR RELATION
 Gather sample of observed height and corresponding weight.
 Create relationship model.
 Find coefficients from model and establish mathematical equation.
 Get summary of model to compute av. prediction error … Residual.
 Predict weight
height weight
151 63
174 81
138 56
186 91
128 47
136 57
179 76
163 72
152 62
131 48
 
2
1
1
0





n
i
x
b
b
y
Q
We want to penalize the points which are
farther from the regression line much more
than the points which lie close to the line.
9
STRENGTH OF LINEAR ASSOCIATION: PEARSON COEFFICIENT
height
(x)
weight
(y)
x-xmean y-ymean (x-xmean)*(y-ymean) (x-xmean)*(x-xmean) xy x2
y2
151 63 -2.8 -2.3 6.44 7.84 9513 22801 3969
174 81 20.2 15.7 317.14 408.04 14094 30276 6561
138 56 -15.8 -9.3 146.94 249.64 7728 19044 3136
186 91 32.2 25.7 827.54 1036.84 16926 34596 8281
128 47 -25.8 -18.3 472.14 665.64 6016 16384 2209
136 57 -17.8 -8.3 147.74 316.84 7752 18496 3249
179 76 25.2 10.7 269.64 635.04 13604 32041 5776
163 72 9.2 6.7 61.64 84.64 11736 26569 5184
152 62 -1.8 -3.3 5.94 3.24 9424 23104 3844
131 48 -22.8 -17.3 394.44 519.84 6288 17161 2304
xmean ymean sum(xy) sum(x2
) sum(y2
)
153.8 65.3 264.96 392.76 103081 240472 44513
sum(x) sum(y) (sum(y))*(sum(y)) n*sum(xy) n*sum(x2
) n*sum(y2
)
1538 653 1004314 2365444 426409 1030810 2404720 445130
b1 = 0.67461 b0 = -38.455 y = b1x+b0 = 8.767644363 r = 0.97713
r = [-1,1]
Magnitude
as well as
direction
x=70
10
GRADIENT DESCENT
 used to minimize the cost function Q
 STEPS:
1. Random initialization for θ1 and θ0.
2. Measure how the cost function changes with change in it’s parameters by computing the partial derivatives of
cost function w.r.t to the parameters θ₀, θ₁, … , θₙ.
3. After computing derivative, update parameters θj: = θj−α ∂/∂θj Q(θ0,θ1) for j=0,1 where α, learning rate, a
positive no. and a step to update parameters.
4. Repeat process of Simultaneous update of θ1 and θ0, until convergence.
 α too small, too much time, α too large, failure to converge.
11
GRADIENT DESCENT FOR UNIVARIATE LINEAR REGRESSION
 Hypothesis hθ(x)=θ0+θ1x
 Cost Function J(θ0,θ1)=(1/2*m) * ∑i=1
m(hθ(x(i))−y(i))2
 Gradient Descent to minimize cost function for Linear Regression model
 Compute derivative for j=0, j=1
12
Dispersion of observed variable around mean How well our line fits data
total variability of the data is equal to the variability
explained by the regression line plus the unexplained
variability, known as error.
MODEL
EVALUATION
13
COEFFICIENT OF DETERMINATION
 Recall that SST measures the total variations in yi when no account of the independent
variable x is taken.
 SSE measures the variation in the yi when a regression model with the independent variable x
is used.
 A natural measure of the effect of x in reducing the variation in y can be defined as:
R2 is called the coefficient of determination / goodness of fit.
 0  SSE  SST, it follows that:
 We may interpret R2 as the proportionate reduction of total variability in y associated with the
use of the independent variable x.
 The larger is R2, the more is the total variation of y reduced by including the variable x in the
model.
SST
SSE
SST
SSR
SST
SSE
SST
R 



 1
2
𝟎 ≤ 𝑅2
≤ 𝟏
14
COEFFICIENT OF DETERMINATION
If all the observations fall on the fitted regression line, SSE = 0 and R2 = 1.
If the slope of the fitted regression line
b1 = 0 so that , SSE=SST and R2 = 0.
The closer R2 is to 1, the greater is said to be the degree of linear association
between x and y.
The square root of R2 is called the coefficient of correlation (r).
y
yi 
ˆ
2
R
r 

 
 
  










2
2
2
2
2
2
)
(
)
(
)
(
)
(
)
)(
(
y
y
n
x
x
n
y
x
xy
n
r
y
y
x
x
y
y
x
x
r
15
MODEL EVALUATION : R-SQUARED
height (cm) weight (kg) ypredicted SSE = (y-ypred)2 SST = (y-ymean)2 SSR = (ypred-ymean)2
151 63 63.4111 0.16901143 5.29 3.56790543
174 81 78.9271 4.29674858 246.49 185.698945
138 56 54.6412 1.84639179 86.49 113.610444
186 91 87.0225 15.8208245 660.49 471.865268
128 47 47.8951 0.80116821 334.89 302.93124
136 57 53.292 13.7495606 68.89 144.193025
179 76 82.3002 39.692394 114.49 289.00646
163 72 71.5064 0.24361134 44.89 38.5197733
152 62 64.0857 4.35022792 10.89 1.47447592
131 48 49.9189 3.68221559 299.29 236.57793
ymean sum sum sum
65.3 82.652154 1872.1 1787.44547
R2 = SSR/SST = 0.95478
measures the
proportion of the
variation in your
dependent
variable
explained by all
your independent
variables in the
model
R2 = [0,1]
16
ESTIMATION OF MEAN RESPONSE
 The weekly advertising expenditure (x) and weekly sales (y) are presented in the following table:
 From the table, the least squares estimates of the regression coefficients are:
y x
1250 41
1380 54
1425 63
1425 54
1450 48
1300 46
1400 62
1510 61
1575 64
1650 71
 
 





818755
14365
32604
564
10 2
xy
y
x
x
n
8
.
10
)
564
(
)
32604
(
10
)
14365
)(
564
(
)
818755
(
10
)
( 2
2
2
1 






 
  
x
x
n
y
x
xy
n
b 828
)
4
.
56
(
8
.
10
5
.
1436
0 


b
The estimated regression function is:
This means that if weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8.
Fitted values for the sample data are obtained by substituting the x value into the estimated regression function.
For example, if the advertising expenditure is $50, then the estimated Sales is:
This is called the point estimate (forecast) of the mean response (sales).
e
Expenditur
8
.
10
828
Sales
10.8x
828
ŷ




1368
)
50
(
8
.
10
828 


Sales
x
b
y
b 1
0 

17
EXAMPLE: SOLVE
• The primary goal of Quantitative Analysis is to use current information about a
phenomenon to predict its future behavior.
• Current information is usually in the form of data.
• In a simple case, when the data forms a set of pairs of numbers, we may interpret
them as representing the observed values of an independent (or predictor) variable
X and a dependent (or response) variable y.
• The goal of the analyst who studies the data is to find a functional relation
between the response variable y and the predictor variable x.
lot size Man-hours
30 73
20 50
60 128
80 170
40 87
50 108
60 135
30 69
70 148
60 132
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70 80 90
Man-Hour
Lot size
Statistical relation between Lot size and Man-Hour
)
(x
f
y 
18
EXAMPLE: RETAIL SALES AND FLOOR SPACE
 It is customary in retail operations to assess the performance of stores partly in terms of their annual
sales relative to their floor area (square feet).
 We might expect sales to increase linearly as stores get larger, with of course individual variation among
the stores of same size.
 The regression model for a population of stores says that SALES = 0 + 1 AREA + 
 The slope 1 is rate of change: it is the expected increase in annual sales associated with each additional
square foot of floor space.
 The intercept 0 is needed to describe the line but has no statistical importance because no stores have
area close to zero.
 Floor space does not completely determine sales. The term  in the model accounts for difference among
individual stores with the same floor space. A store’s location, for example, is important.
 Residual: The difference between the observed value yi and the corresponding fitted value
 Residuals are highly useful for studying whether a given regression model is appropriate for the
data at hand. i
i
i y
y
e ˆ


i
ŷ
19
ANALYSIS OF RESIDUAL
 To examine whether the regression model is appropriate for the data being analyzed, we can check residual plots.
 Residual plots are:
 A scatterplot of the residuals
 Plot residuals against the fitted values.
 Plot residuals against the independent variable.
 Plot residuals over time if the data are chronological.
 The residuals should have no systematic pattern. Eg. The residual plot below shows a scatter of the points with no
individual observations or systematic change as x increases.
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
0 20 40 60
Residuals
Degree Days
Degree Days Residual Plot
20
RESIDUAL PLOTS
The points in this residual plot have a curve pattern, so a straight line fits
poorly
The points in this plot show more spread for larger values of the
explanatory variable x, so prediction will be less accurate when x is
large.
21
EXAMPLE: DO WAGES RISE WITH EXPERIENCE?
 Many factors affect the wages of
workers: the industry they work in,
their type of job, their education,
their experience, and changes in
general levels of wages. We will
look at a sample of 59 married
women who hold customer service
jobs in Indiana banks. The table
gives their weekly wages at a
specific point in time also their
Length Of Service with their
employer, in months. The size of
the place of work is recorded
simply as “large” (100 or more
workers) or “small.” Because
industry, job type, and the time of
measurement are the same for all 59
subjects, we expect to see a clear
relationship between wages and
length of service.
22
EXAMPLE: DO WAGES RISE WITH EXPERIENCE?
From previous table we have:
The least squares estimates of the regression coefficients are:

 
 






1719376
9460467
23069
451031
4159
59
2
2
xy
y
y
x
x
n


 x
b
y
b0
 
  


 2
2
1
)
( x
x
n
y
x
xy
n
b
 
 2
)
ˆ
( i
i y
y
SSE
 
 2
)
( y
y
SST i
 
 2
)
ˆ
( y
y
SSR i
23
USING THE REGRESSION LINE
 One of the most common reasons to fit a line to data is to predict the response to a particular value of
the explanatory variable.
 In our example, the least square line for predicting the weekly earnings for female bank customer
service workers from their length of service is
 For a length of service of 125 months, our least-squares regression equation gives
x
y 5905
.
0
4
.
349
ˆ 

per week
423
$
)
125
)(
5905
(.
4
.
349
ˆ 


y
The measure of variation in the data around the fitted regression line
SSE = 36124.76 SST = 128552.5
If SST = 0, all observations are the same (No variability).
The greater is SST, the greater is the variation among the y values.
74
.
92427
76
.
36124
5
.
128552 



 SSE
SST
SSR
SSR is the variation among predicted responses. The predicted responses lie on the least-square
line. They show how y moves in response to x.
The larger is SSR relative to SST, the greater is the role of regression line in explaining the total
variability in y observations.
This indicates that most of variability in weekly sales can be explained by the relation between
the weekly advertising expenditure and the weekly sales.
R2 = SSR/SST = .719
24
MULTIPLE LINEAR REGRESSION
 The dependent variable depends on more than one independent variables.
 The form of the model is: y = b0 + b1x1 + b2x2 + b3x3 + …… + bnxn
 May be Linear or nonlinear.
 Here,
 y is a dependent variable.
 x1, x2, …., xn are independent variables.
 b0, b1,…, bn are the regression coefficients.
 bj (1<=j<=n) is the slope or weight that specifies the factor by which Xj has an impact on Y.
25
POLYNOMIAL REGRESSION
 y= b0+b1x1+ b2x1
2+ b2x1
3+...... bnx1
n
 Special case of Multiple Linear Regression.
 We add some polynomial terms to the Multiple Linear regression equation to convert it into Polynomial Regression.
 Linear model with some modification in order to increase the accuracy.
 Training data is of non-linear nature.
 In Polynomial regression, the original features are converted into Polynomial features of required degree (2,3,..,n) and
then modeled using a Linear model.
 If we apply a linear model on a linear dataset, then it provides us good result, but if we apply the same model without
any modification on a non-linear dataset, then it will produce a drastic output. Due to which loss function will increase,
the error rate will be high, and the accuracy will decrease.
 A Polynomial Regression algorithm is also called Polynomial Linear Regression because it does not depend on the
variables, instead, it depends on the coefficients, which are arranged in a linear fashion.
26
REGULARIZATION
 To avoid overfitting of training data and hence enhance generalization performance.
 Since model tries to capture noise, that doesn’t represent true properties of data.
 Regularization is a form of regression that constraints/ regularizes/ shrinks coefficient estimates towards zero.
 Y represents the learned relation, β represents the coefficient estimates for
different variables or predictors (X).
 Coefficients are chosen so as to minimize loss function
27
• Eg. a person’s height and weight, age and sales price of a car, or years of education
and annual income
• Doesn’t affect DT
• kNN affected
• Cause
• Insufficient data
• Dummy variables
• Including a variable in the regression that is actually a combination of two
other variables.
• Identify (corr > 0.4, Variance Inflation Factor score > 5 high correlation )
• Sol
• Feature selection
• PCA
• More data
• Ridge regression reduces magnitude of model coefficients 28
RIDGE REGULARIZATION (L2 NORM)
 Used when data suffers from multicollinearity
 RSS is modified by adding shrinkage quantity, λ (tuning parameter) that decides how much we want to penalize the
flexibility of our model, intercept β0, is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0.
 If we want to minimize the above function, these coefficients need to be small.
 When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will be equal to least squares.
 However, as λ→∞, the impact of the shrinkage penalty grows, and ridge regression coefficient estimates will approach
zero.
 Note: we need to standardize the predictors or bring the predictors to the same scale before performing ridge regression.
 Disadvantage: model interpretability
29
LASSO REGULARIZATION (L1 NORM)
 Least Absolute Shrinkage and Selection Operator
 This variation differs from ridge regression only in penalizing the high coefficients.
 It uses |βj| (modulus) instead of squares of β, as its penalty.
 Lasso method also performs variable selection and is said to yield sparse models.
30
RIDGE LASSO COMPARISON
 Ridge Regression can be thought of as solving an equation, where summation of squares of coefficients is less than or equal to
s. And Lasso can be thought of as an equation where summation of modulus of coefficients is less than or equal to s. Here, s is a
constant that exists for each value of shrinkage factor λ. These equations are also referred to as constraint functions.
 Consider there are 2 parameters in a given problem. Then according to above formulation:
 Ridge regression is expressed by β1² + β2² ≤ s. This implies that ridge regression coefficients have the smallest RSS (loss
function) for all points that lie within the circle given by β1² + β2² ≤ s.
 for Lasso, the equation becomes, |β1|+|β2| ≤ s. This implies that lasso coefficients have the smallest RSS (loss function) for all
points that lie within the diamond given by |β1|+|β2|≤ s.
Image shows the constraint functions (green areas), for Lasso (left) and Ridge regression (right), along with contours for RSS
(red ellipse). The black point denotes that the least square error is minimized at that point and as we can see that it increases
quadratically as we move away from it and the regularization term is minimized at the origin where all the parameters are
zero
Since Ridge Regression has a circular constraint with no sharp points,
this intersection will not generally occur on an axis, and so ridge
regression coefficient estimates will be exclusively non-zero.
However, Lasso constraint has corners at each of the axes, and so the
ellipse will often intersect the constraint region at an axis. When this
occurs, one of the coefficients will equal zero.
31
BENEFIT
 Regularization significantly reduces the variance of model, without substantial increase in its bias.
 The tuning parameter λ, controls the impact on bias and variance.
 As the value of λ rises, it reduces the value of coefficients and thus reducing the variance.
 Till a point, this increase in λ is beneficial as it is only reducing the variance (hence avoiding
overfitting), without loosing any important properties in the data.
 But after certain value, the model starts loosing important properties, giving rise to bias in the model and
thus underfitting. Therefore, the value of λ should be carefully selected.
32
SUMMATIVE ASSESSMENT
3 Consider the following dataset showing relationship
between food intake (lb) of cows and milk yield (lb).
Estimate the parameters for the linear regression model
for the dataset:
Food (lb) Milk Yield (lb)
4 3.0
6 5.5
10 6.5
12 9.0
4 Fit a Linear Regression model for following relation
between mother’s Estirol level and birth weight of child
for following data:
Estirol (mg/24 hr) Birth weight (g/100)
1 1
2 1
3 2
4 2
5 4
5 Create a relationship model for given data to find
relationship b/w height and weight of students. Compute
Karl Pearson coefficient and Coefficient of determination.
REFER SLIDE 9
6 State benefits of regularization for avoiding overfitting
in Linear Regression. State mathematical formulation of
Regularization.
7 Explain steps of Gradient Descent Algorithm.
33
1.The rent of a property is related to its area. Given the area in square feet and rent in dollars, find the
relationship between area and rent using the concept of linear regression. Also predict the rent for a
property of 790 ft square.
2. The marks obtained by a student is dependent on his/her study time. Given the study time in minutes
and marks out of 2000. Find the relationship between study time and marks using the concept of Linear
Regression. Also predict marks for a student if he/she studied for 790 minutes.
Area (ft2) Rent (inr)
360 520
1070 1600
630 1000
890 850
940 1350
500 490
Study Time (min.) Marks obtained
350 520
1070 1600
630 1000
890 850
940 1350
500 490
SUMMATIVE ASSESSMENT
34
8. Use the method of Least Square using Regression to predict the final exam grade of a
student who received 86 on mid term exam.
x (midterm) y (final exam)
65 175
67 133
71 185
71 163
66 126
75 198
67 153
70 163
71 159
69 151
9. Create a relationship model for given data to find
relationship between height and weight of students. .
Height (inches) Weight (pounds)
72 200
68 165
69 160
71 163
66 126
RESOURCES
 https://www.youtube.com/watch?v=Rb8MnMEJTI4&list=PLIeGtxpvyG-KE0M1r5cjbC_7Q_dVlKVq4&index=1
 https://www.youtube.com/watch?v=ls3XKoGntXg&list=PLIeGtxpvyG-KE0M1r5cjbC_7Q_dVlKVq4&index=3
 https://www.youtube.com/watch?v=E5RjzSK0fvY
 https://www.youtube.com/watch?v=NF5_btOaCig&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU&index=5
 https://www.youtube.com/watch?v=5Z9OIYA8He8&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU&index=9
 https://www.youtube.com/watch?v=Xm2C_gTAl8c
 https://www.geeksforgeeks.org/mathematical-explanation-for-linear-regression-working/
 https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/
 https://www.youtube.com/playlist?list=PLIeGtxpvyG-IqjoU8IiF0Yu1WtxNq_4z-
 https://365datascience.com/r-squared/
35

More Related Content

PPTX
Object Oriented Programming
PPTX
C++ compilation process
PPT
Null space, Rank and nullity theorem
PPTX
halstead software science measures
PPTX
Regula falsi method
PPTX
Heaviside's function
PDF
8.-DAA-LECTURE-8-RECURRENCES-AND-ITERATION-METHOD.pdf
PPTX
Matlab Introduction
Object Oriented Programming
C++ compilation process
Null space, Rank and nullity theorem
halstead software science measures
Regula falsi method
Heaviside's function
8.-DAA-LECTURE-8-RECURRENCES-AND-ITERATION-METHOD.pdf
Matlab Introduction

What's hot (20)

PPTX
Secant method
PPT
programming with python ppt
PPTX
MATLAB - Arrays and Matrices
PPTX
calculate audio file size
PPTX
Java string handling
PDF
Python Manuel-R2021.pdf
PPTX
Python libraries for data science
PDF
Multiple Choice Questions - Numerical Methods
PDF
C++ book
PPSX
Type conversion
PPTX
COMPILER DESIGN OPTIONS
PPT
Challenge Response Authentication
PPT
RECURSION IN C
PPTX
Circle & curve clipping algorithm
PPTX
Introduction to Python Basics Programming
PPTX
TOC - MODIFICATIONS OF TURING MACHINES
PPTX
Solving recurrences
PPT
Unit-2 PPT.ppt
PDF
Difference between c, c++ and java
PDF
Introduction to Input/Output Functions in C
Secant method
programming with python ppt
MATLAB - Arrays and Matrices
calculate audio file size
Java string handling
Python Manuel-R2021.pdf
Python libraries for data science
Multiple Choice Questions - Numerical Methods
C++ book
Type conversion
COMPILER DESIGN OPTIONS
Challenge Response Authentication
RECURSION IN C
Circle & curve clipping algorithm
Introduction to Python Basics Programming
TOC - MODIFICATIONS OF TURING MACHINES
Solving recurrences
Unit-2 PPT.ppt
Difference between c, c++ and java
Introduction to Input/Output Functions in C
Ad

Similar to ML Module 3.pdf (20)

PPTX
Regression analysis in R
PDF
simple linear regression - brief introduction
PPTX
Regression Analysis.pptx
PDF
Lecture 1.pdf
PPT
15.Simple Linear Regression of case study-530 (2).ppt
PPTX
regression analysis presentation slides.
PPTX
Regression Analysis.pptx
PPTX
Regression Analysis Techniques.pptx
PPTX
Linear Regression final-1.pptx thbejnnej
PPT
Statistics08_Cut_Regression.jdnkdjvbjddj
PDF
Simple regression model
PPTX
Basics of Regression analysis
PDF
Regression analysis
PPTX
Artifical Intelligence And Machine Learning Algorithum.pptx
PPTX
Regression refers to the statistical technique of modeling
PPTX
Regression
PDF
Data Approximation in Mathematical Modelling Regression Analysis and curve fi...
PDF
Data Approximation in Mathematical Modelling Regression Analysis and Curve Fi...
PPTX
Lecture 8 Linear and Multiple Regression (1).pptx
Regression analysis in R
simple linear regression - brief introduction
Regression Analysis.pptx
Lecture 1.pdf
15.Simple Linear Regression of case study-530 (2).ppt
regression analysis presentation slides.
Regression Analysis.pptx
Regression Analysis Techniques.pptx
Linear Regression final-1.pptx thbejnnej
Statistics08_Cut_Regression.jdnkdjvbjddj
Simple regression model
Basics of Regression analysis
Regression analysis
Artifical Intelligence And Machine Learning Algorithum.pptx
Regression refers to the statistical technique of modeling
Regression
Data Approximation in Mathematical Modelling Regression Analysis and curve fi...
Data Approximation in Mathematical Modelling Regression Analysis and Curve Fi...
Lecture 8 Linear and Multiple Regression (1).pptx
Ad

More from Shiwani Gupta (20)

PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
PDF
Generative Artificial Intelligence and Large Language Model
PDF
ML MODULE 6.pdf
PDF
ML MODULE 5.pdf
PDF
ML MODULE 4.pdf
PDF
module6_stringmatchingalgorithm_2022.pdf
PDF
module5_backtrackingnbranchnbound_2022.pdf
PDF
module4_dynamic programming_2022.pdf
PDF
module3_Greedymethod_2022.pdf
PDF
module2_dIVIDEncONQUER_2022.pdf
PDF
module1_Introductiontoalgorithms_2022.pdf
PDF
ML MODULE 1_slideshare.pdf
PDF
ML MODULE 2.pdf
PDF
Problem formulation
PDF
Simplex method
PDF
Functionsandpigeonholeprinciple
PDF
Relations
PDF
PDF
Set theory
PDF
Uncertain knowledge and reasoning
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Generative Artificial Intelligence and Large Language Model
ML MODULE 6.pdf
ML MODULE 5.pdf
ML MODULE 4.pdf
module6_stringmatchingalgorithm_2022.pdf
module5_backtrackingnbranchnbound_2022.pdf
module4_dynamic programming_2022.pdf
module3_Greedymethod_2022.pdf
module2_dIVIDEncONQUER_2022.pdf
module1_Introductiontoalgorithms_2022.pdf
ML MODULE 1_slideshare.pdf
ML MODULE 2.pdf
Problem formulation
Simplex method
Functionsandpigeonholeprinciple
Relations
Set theory
Uncertain knowledge and reasoning

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Mega Projects Data Mega Projects Data
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
annual-report-2024-2025 original latest.
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Quality review (1)_presentation of this 21
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Clinical guidelines as a resource for EBP(1).pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Acumen Training GuidePresentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
climate analysis of Dhaka ,Banglades.pptx
Reliability_Chapter_ presentation 1221.5784
Mega Projects Data Mega Projects Data
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction-to-Cloud-ComputingFinal.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
annual-report-2024-2025 original latest.
Galatica Smart Energy Infrastructure Startup Pitch Deck
Quality review (1)_presentation of this 21
Supervised vs unsupervised machine learning algorithms
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IB Computer Science - Internal Assessment.pptx
ISS -ESG Data flows What is ESG and HowHow
Clinical guidelines as a resource for EBP(1).pdf

ML Module 3.pdf

  • 1. MODULE 3 Supervised ML – Regression Shiwani Gupta Use case Simple Linear Gradient Descent Evaluation Metric Multi Linear, Polynomial Regularization
  • 2. USE CASES  A hospital may be interested in finding how total cost of a patient varies with severity of disease.  Insurance companies would like to understand association between healthcare cost and ageing.  An organization may be interested in finding relationship between revenue generated from a product and features such as price, promotional amount spent, competitor’s price of similar product, etc.  Restaurants would like to know relationship between customer waiting time after placing the order and the revenue generated.  E-commerce companies like: Amazon, BigBasket, Flipkart , etc. would like to understand the relationship between revenue generated and features like no. of customer visit to portal, no. of clicks on products, no. of items on sale, av. discount percentage etc.  Bank and other financial institutions would like to understand the impact of variables such as unemployment rate, marital status, bank balance, etc. on percentage of Non Performing Assets, etc. 2
  • 4. LINEAR REGRESSION  Linear Regression is a Supervised Machine Learning algorithm for predictive modelling.  It tries to find out the best linear relationship that describes the data you have (Scatter Plot).  It assumes that there exists a linear relationship between a dependent variable (usually called y) and independent variable(s) (usually called X).  The value of the dependent / response / outcome variable of a Linear Regression model is a continuous value / quantitative in nature i.e. real numbers.  Linear Regression model represents linear relationship between a dependent variable and independent / predictor / explanatory variable(s) via a sloped straight line.  The sloped straight line representing the linear relationship that fits the given data best is called a Regression Line / Best Fit Line.  Based on the number of independent variables, there are two types of Linear Regression 4
  • 5. INFERENCE ABOUT THE REGRESSION MODEL  When a scatter plot shows a linear relationship between a quantitative explanatory variable x and a quantitative response variable y, we can use the least square line fitted to the data to predict y for a given value of x.  We think of the least square line we calculated from the sample as an estimate of a regression line for the population.  Just as the sample mean is an estimate of the population mean µ.  We will write the population regression line as  The numbers and are parameters that describe the population.  We will write the least-squares line fitted to sample data as  This notation reminds us that the intercept b0 of the fitted line estimates the intercept 0 of the population line, and the slope b1 estimates the slope 1 respectively. x x 1 0    0  1  x b b 1 0  5
  • 6. SIMPLE LINEAR REGRESSION  A statistical method to summarize and study the functional relationship b/w 2 cont. variables.  May be linear or nonlinear (eg. Population growth over time)  The dependent variable depends only on a single independent variable.  The form of the model is: y = β0 + β1X eg. V=I*R, Circumference = 2*pi*r, C=(F-32)*5/9, etc.  y is a dependent variable.  X is an independent variable.  β0 and β1 are the regression coefficients.  β0 is the intercept or the bias that fixes the offset to a line. It is the av. y value for Xmean = 0  β1 is the slope or weight that specifies the factor by which X has an impact on y. The values of the regression parameters 0, and 1 are not known. We estimate them from data. 6
  • 8. REGRESSION LINE  We will write an estimated regression line based on sample data as  Least Squares Method give us the “best” estimated line for our set of sample data.  The method of least squares chooses the values for b0 and b1 to minimize the Sum of Squared Errors  Using calculus, we obtain the estimating formulas: x b b y 1 0 ˆ     2 1 1 0 1 2 ) ˆ (          n i n i i i x b b y y y SSE                      n i n i i i n i n i n i i i i i n i i n i i i x x n y x y x n x x y y x x b 1 1 2 2 1 1 1 1 2 1 1 ) ( ) ( ) )( ( x b y b 1 0   Fitted regression line can be used to estimate y for a given value of x. 𝑀𝑆𝐸 = (1/𝑛) 𝑖=1 𝑛 (𝑦𝑖 − 𝑦𝑖)2 𝑀𝐴𝐸 = (1/𝑛) 𝑖=1 𝑛 |𝑦𝑖 − 𝑦 | 8
  • 9. STEPS TO ESTABLISH A LINEAR RELATION  Gather sample of observed height and corresponding weight.  Create relationship model.  Find coefficients from model and establish mathematical equation.  Get summary of model to compute av. prediction error … Residual.  Predict weight height weight 151 63 174 81 138 56 186 91 128 47 136 57 179 76 163 72 152 62 131 48   2 1 1 0      n i x b b y Q We want to penalize the points which are farther from the regression line much more than the points which lie close to the line. 9
  • 10. STRENGTH OF LINEAR ASSOCIATION: PEARSON COEFFICIENT height (x) weight (y) x-xmean y-ymean (x-xmean)*(y-ymean) (x-xmean)*(x-xmean) xy x2 y2 151 63 -2.8 -2.3 6.44 7.84 9513 22801 3969 174 81 20.2 15.7 317.14 408.04 14094 30276 6561 138 56 -15.8 -9.3 146.94 249.64 7728 19044 3136 186 91 32.2 25.7 827.54 1036.84 16926 34596 8281 128 47 -25.8 -18.3 472.14 665.64 6016 16384 2209 136 57 -17.8 -8.3 147.74 316.84 7752 18496 3249 179 76 25.2 10.7 269.64 635.04 13604 32041 5776 163 72 9.2 6.7 61.64 84.64 11736 26569 5184 152 62 -1.8 -3.3 5.94 3.24 9424 23104 3844 131 48 -22.8 -17.3 394.44 519.84 6288 17161 2304 xmean ymean sum(xy) sum(x2 ) sum(y2 ) 153.8 65.3 264.96 392.76 103081 240472 44513 sum(x) sum(y) (sum(y))*(sum(y)) n*sum(xy) n*sum(x2 ) n*sum(y2 ) 1538 653 1004314 2365444 426409 1030810 2404720 445130 b1 = 0.67461 b0 = -38.455 y = b1x+b0 = 8.767644363 r = 0.97713 r = [-1,1] Magnitude as well as direction x=70 10
  • 11. GRADIENT DESCENT  used to minimize the cost function Q  STEPS: 1. Random initialization for θ1 and θ0. 2. Measure how the cost function changes with change in it’s parameters by computing the partial derivatives of cost function w.r.t to the parameters θ₀, θ₁, … , θₙ. 3. After computing derivative, update parameters θj: = θj−α ∂/∂θj Q(θ0,θ1) for j=0,1 where α, learning rate, a positive no. and a step to update parameters. 4. Repeat process of Simultaneous update of θ1 and θ0, until convergence.  α too small, too much time, α too large, failure to converge. 11
  • 12. GRADIENT DESCENT FOR UNIVARIATE LINEAR REGRESSION  Hypothesis hθ(x)=θ0+θ1x  Cost Function J(θ0,θ1)=(1/2*m) * ∑i=1 m(hθ(x(i))−y(i))2  Gradient Descent to minimize cost function for Linear Regression model  Compute derivative for j=0, j=1 12
  • 13. Dispersion of observed variable around mean How well our line fits data total variability of the data is equal to the variability explained by the regression line plus the unexplained variability, known as error. MODEL EVALUATION 13
  • 14. COEFFICIENT OF DETERMINATION  Recall that SST measures the total variations in yi when no account of the independent variable x is taken.  SSE measures the variation in the yi when a regression model with the independent variable x is used.  A natural measure of the effect of x in reducing the variation in y can be defined as: R2 is called the coefficient of determination / goodness of fit.  0  SSE  SST, it follows that:  We may interpret R2 as the proportionate reduction of total variability in y associated with the use of the independent variable x.  The larger is R2, the more is the total variation of y reduced by including the variable x in the model. SST SSE SST SSR SST SSE SST R      1 2 𝟎 ≤ 𝑅2 ≤ 𝟏 14
  • 15. COEFFICIENT OF DETERMINATION If all the observations fall on the fitted regression line, SSE = 0 and R2 = 1. If the slope of the fitted regression line b1 = 0 so that , SSE=SST and R2 = 0. The closer R2 is to 1, the greater is said to be the degree of linear association between x and y. The square root of R2 is called the coefficient of correlation (r). y yi  ˆ 2 R r                    2 2 2 2 2 2 ) ( ) ( ) ( ) ( ) )( ( y y n x x n y x xy n r y y x x y y x x r 15
  • 16. MODEL EVALUATION : R-SQUARED height (cm) weight (kg) ypredicted SSE = (y-ypred)2 SST = (y-ymean)2 SSR = (ypred-ymean)2 151 63 63.4111 0.16901143 5.29 3.56790543 174 81 78.9271 4.29674858 246.49 185.698945 138 56 54.6412 1.84639179 86.49 113.610444 186 91 87.0225 15.8208245 660.49 471.865268 128 47 47.8951 0.80116821 334.89 302.93124 136 57 53.292 13.7495606 68.89 144.193025 179 76 82.3002 39.692394 114.49 289.00646 163 72 71.5064 0.24361134 44.89 38.5197733 152 62 64.0857 4.35022792 10.89 1.47447592 131 48 49.9189 3.68221559 299.29 236.57793 ymean sum sum sum 65.3 82.652154 1872.1 1787.44547 R2 = SSR/SST = 0.95478 measures the proportion of the variation in your dependent variable explained by all your independent variables in the model R2 = [0,1] 16
  • 17. ESTIMATION OF MEAN RESPONSE  The weekly advertising expenditure (x) and weekly sales (y) are presented in the following table:  From the table, the least squares estimates of the regression coefficients are: y x 1250 41 1380 54 1425 63 1425 54 1450 48 1300 46 1400 62 1510 61 1575 64 1650 71          818755 14365 32604 564 10 2 xy y x x n 8 . 10 ) 564 ( ) 32604 ( 10 ) 14365 )( 564 ( ) 818755 ( 10 ) ( 2 2 2 1             x x n y x xy n b 828 ) 4 . 56 ( 8 . 10 5 . 1436 0    b The estimated regression function is: This means that if weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8. Fitted values for the sample data are obtained by substituting the x value into the estimated regression function. For example, if the advertising expenditure is $50, then the estimated Sales is: This is called the point estimate (forecast) of the mean response (sales). e Expenditur 8 . 10 828 Sales 10.8x 828 ŷ     1368 ) 50 ( 8 . 10 828    Sales x b y b 1 0   17
  • 18. EXAMPLE: SOLVE • The primary goal of Quantitative Analysis is to use current information about a phenomenon to predict its future behavior. • Current information is usually in the form of data. • In a simple case, when the data forms a set of pairs of numbers, we may interpret them as representing the observed values of an independent (or predictor) variable X and a dependent (or response) variable y. • The goal of the analyst who studies the data is to find a functional relation between the response variable y and the predictor variable x. lot size Man-hours 30 73 20 50 60 128 80 170 40 87 50 108 60 135 30 69 70 148 60 132 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 80 90 Man-Hour Lot size Statistical relation between Lot size and Man-Hour ) (x f y  18
  • 19. EXAMPLE: RETAIL SALES AND FLOOR SPACE  It is customary in retail operations to assess the performance of stores partly in terms of their annual sales relative to their floor area (square feet).  We might expect sales to increase linearly as stores get larger, with of course individual variation among the stores of same size.  The regression model for a population of stores says that SALES = 0 + 1 AREA +   The slope 1 is rate of change: it is the expected increase in annual sales associated with each additional square foot of floor space.  The intercept 0 is needed to describe the line but has no statistical importance because no stores have area close to zero.  Floor space does not completely determine sales. The term  in the model accounts for difference among individual stores with the same floor space. A store’s location, for example, is important.  Residual: The difference between the observed value yi and the corresponding fitted value  Residuals are highly useful for studying whether a given regression model is appropriate for the data at hand. i i i y y e ˆ   i ŷ 19
  • 20. ANALYSIS OF RESIDUAL  To examine whether the regression model is appropriate for the data being analyzed, we can check residual plots.  Residual plots are:  A scatterplot of the residuals  Plot residuals against the fitted values.  Plot residuals against the independent variable.  Plot residuals over time if the data are chronological.  The residuals should have no systematic pattern. Eg. The residual plot below shows a scatter of the points with no individual observations or systematic change as x increases. -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 0 20 40 60 Residuals Degree Days Degree Days Residual Plot 20
  • 21. RESIDUAL PLOTS The points in this residual plot have a curve pattern, so a straight line fits poorly The points in this plot show more spread for larger values of the explanatory variable x, so prediction will be less accurate when x is large. 21
  • 22. EXAMPLE: DO WAGES RISE WITH EXPERIENCE?  Many factors affect the wages of workers: the industry they work in, their type of job, their education, their experience, and changes in general levels of wages. We will look at a sample of 59 married women who hold customer service jobs in Indiana banks. The table gives their weekly wages at a specific point in time also their Length Of Service with their employer, in months. The size of the place of work is recorded simply as “large” (100 or more workers) or “small.” Because industry, job type, and the time of measurement are the same for all 59 subjects, we expect to see a clear relationship between wages and length of service. 22
  • 23. EXAMPLE: DO WAGES RISE WITH EXPERIENCE? From previous table we have: The least squares estimates of the regression coefficients are:            1719376 9460467 23069 451031 4159 59 2 2 xy y y x x n    x b y b0         2 2 1 ) ( x x n y x xy n b    2 ) ˆ ( i i y y SSE    2 ) ( y y SST i    2 ) ˆ ( y y SSR i 23
  • 24. USING THE REGRESSION LINE  One of the most common reasons to fit a line to data is to predict the response to a particular value of the explanatory variable.  In our example, the least square line for predicting the weekly earnings for female bank customer service workers from their length of service is  For a length of service of 125 months, our least-squares regression equation gives x y 5905 . 0 4 . 349 ˆ   per week 423 $ ) 125 )( 5905 (. 4 . 349 ˆ    y The measure of variation in the data around the fitted regression line SSE = 36124.76 SST = 128552.5 If SST = 0, all observations are the same (No variability). The greater is SST, the greater is the variation among the y values. 74 . 92427 76 . 36124 5 . 128552      SSE SST SSR SSR is the variation among predicted responses. The predicted responses lie on the least-square line. They show how y moves in response to x. The larger is SSR relative to SST, the greater is the role of regression line in explaining the total variability in y observations. This indicates that most of variability in weekly sales can be explained by the relation between the weekly advertising expenditure and the weekly sales. R2 = SSR/SST = .719 24
  • 25. MULTIPLE LINEAR REGRESSION  The dependent variable depends on more than one independent variables.  The form of the model is: y = b0 + b1x1 + b2x2 + b3x3 + …… + bnxn  May be Linear or nonlinear.  Here,  y is a dependent variable.  x1, x2, …., xn are independent variables.  b0, b1,…, bn are the regression coefficients.  bj (1<=j<=n) is the slope or weight that specifies the factor by which Xj has an impact on Y. 25
  • 26. POLYNOMIAL REGRESSION  y= b0+b1x1+ b2x1 2+ b2x1 3+...... bnx1 n  Special case of Multiple Linear Regression.  We add some polynomial terms to the Multiple Linear regression equation to convert it into Polynomial Regression.  Linear model with some modification in order to increase the accuracy.  Training data is of non-linear nature.  In Polynomial regression, the original features are converted into Polynomial features of required degree (2,3,..,n) and then modeled using a Linear model.  If we apply a linear model on a linear dataset, then it provides us good result, but if we apply the same model without any modification on a non-linear dataset, then it will produce a drastic output. Due to which loss function will increase, the error rate will be high, and the accuracy will decrease.  A Polynomial Regression algorithm is also called Polynomial Linear Regression because it does not depend on the variables, instead, it depends on the coefficients, which are arranged in a linear fashion. 26
  • 27. REGULARIZATION  To avoid overfitting of training data and hence enhance generalization performance.  Since model tries to capture noise, that doesn’t represent true properties of data.  Regularization is a form of regression that constraints/ regularizes/ shrinks coefficient estimates towards zero.  Y represents the learned relation, β represents the coefficient estimates for different variables or predictors (X).  Coefficients are chosen so as to minimize loss function 27
  • 28. • Eg. a person’s height and weight, age and sales price of a car, or years of education and annual income • Doesn’t affect DT • kNN affected • Cause • Insufficient data • Dummy variables • Including a variable in the regression that is actually a combination of two other variables. • Identify (corr > 0.4, Variance Inflation Factor score > 5 high correlation ) • Sol • Feature selection • PCA • More data • Ridge regression reduces magnitude of model coefficients 28
  • 29. RIDGE REGULARIZATION (L2 NORM)  Used when data suffers from multicollinearity  RSS is modified by adding shrinkage quantity, λ (tuning parameter) that decides how much we want to penalize the flexibility of our model, intercept β0, is a measure of the mean value of the response when xi1 = xi2 = …= xip = 0.  If we want to minimize the above function, these coefficients need to be small.  When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will be equal to least squares.  However, as λ→∞, the impact of the shrinkage penalty grows, and ridge regression coefficient estimates will approach zero.  Note: we need to standardize the predictors or bring the predictors to the same scale before performing ridge regression.  Disadvantage: model interpretability 29
  • 30. LASSO REGULARIZATION (L1 NORM)  Least Absolute Shrinkage and Selection Operator  This variation differs from ridge regression only in penalizing the high coefficients.  It uses |βj| (modulus) instead of squares of β, as its penalty.  Lasso method also performs variable selection and is said to yield sparse models. 30
  • 31. RIDGE LASSO COMPARISON  Ridge Regression can be thought of as solving an equation, where summation of squares of coefficients is less than or equal to s. And Lasso can be thought of as an equation where summation of modulus of coefficients is less than or equal to s. Here, s is a constant that exists for each value of shrinkage factor λ. These equations are also referred to as constraint functions.  Consider there are 2 parameters in a given problem. Then according to above formulation:  Ridge regression is expressed by β1² + β2² ≤ s. This implies that ridge regression coefficients have the smallest RSS (loss function) for all points that lie within the circle given by β1² + β2² ≤ s.  for Lasso, the equation becomes, |β1|+|β2| ≤ s. This implies that lasso coefficients have the smallest RSS (loss function) for all points that lie within the diamond given by |β1|+|β2|≤ s. Image shows the constraint functions (green areas), for Lasso (left) and Ridge regression (right), along with contours for RSS (red ellipse). The black point denotes that the least square error is minimized at that point and as we can see that it increases quadratically as we move away from it and the regularization term is minimized at the origin where all the parameters are zero Since Ridge Regression has a circular constraint with no sharp points, this intersection will not generally occur on an axis, and so ridge regression coefficient estimates will be exclusively non-zero. However, Lasso constraint has corners at each of the axes, and so the ellipse will often intersect the constraint region at an axis. When this occurs, one of the coefficients will equal zero. 31
  • 32. BENEFIT  Regularization significantly reduces the variance of model, without substantial increase in its bias.  The tuning parameter λ, controls the impact on bias and variance.  As the value of λ rises, it reduces the value of coefficients and thus reducing the variance.  Till a point, this increase in λ is beneficial as it is only reducing the variance (hence avoiding overfitting), without loosing any important properties in the data.  But after certain value, the model starts loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the value of λ should be carefully selected. 32
  • 33. SUMMATIVE ASSESSMENT 3 Consider the following dataset showing relationship between food intake (lb) of cows and milk yield (lb). Estimate the parameters for the linear regression model for the dataset: Food (lb) Milk Yield (lb) 4 3.0 6 5.5 10 6.5 12 9.0 4 Fit a Linear Regression model for following relation between mother’s Estirol level and birth weight of child for following data: Estirol (mg/24 hr) Birth weight (g/100) 1 1 2 1 3 2 4 2 5 4 5 Create a relationship model for given data to find relationship b/w height and weight of students. Compute Karl Pearson coefficient and Coefficient of determination. REFER SLIDE 9 6 State benefits of regularization for avoiding overfitting in Linear Regression. State mathematical formulation of Regularization. 7 Explain steps of Gradient Descent Algorithm. 33 1.The rent of a property is related to its area. Given the area in square feet and rent in dollars, find the relationship between area and rent using the concept of linear regression. Also predict the rent for a property of 790 ft square. 2. The marks obtained by a student is dependent on his/her study time. Given the study time in minutes and marks out of 2000. Find the relationship between study time and marks using the concept of Linear Regression. Also predict marks for a student if he/she studied for 790 minutes. Area (ft2) Rent (inr) 360 520 1070 1600 630 1000 890 850 940 1350 500 490 Study Time (min.) Marks obtained 350 520 1070 1600 630 1000 890 850 940 1350 500 490
  • 34. SUMMATIVE ASSESSMENT 34 8. Use the method of Least Square using Regression to predict the final exam grade of a student who received 86 on mid term exam. x (midterm) y (final exam) 65 175 67 133 71 185 71 163 66 126 75 198 67 153 70 163 71 159 69 151 9. Create a relationship model for given data to find relationship between height and weight of students. . Height (inches) Weight (pounds) 72 200 68 165 69 160 71 163 66 126
  • 35. RESOURCES  https://www.youtube.com/watch?v=Rb8MnMEJTI4&list=PLIeGtxpvyG-KE0M1r5cjbC_7Q_dVlKVq4&index=1  https://www.youtube.com/watch?v=ls3XKoGntXg&list=PLIeGtxpvyG-KE0M1r5cjbC_7Q_dVlKVq4&index=3  https://www.youtube.com/watch?v=E5RjzSK0fvY  https://www.youtube.com/watch?v=NF5_btOaCig&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU&index=5  https://www.youtube.com/watch?v=5Z9OIYA8He8&list=PLblh5JKOoLUIzaEkCLIUxQFjPIlapw8nU&index=9  https://www.youtube.com/watch?v=Xm2C_gTAl8c  https://www.geeksforgeeks.org/mathematical-explanation-for-linear-regression-working/  https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/  https://www.youtube.com/playlist?list=PLIeGtxpvyG-IqjoU8IiF0Yu1WtxNq_4z-  https://365datascience.com/r-squared/ 35