SlideShare a Scribd company logo
UNIT – IV
Supervised Learning
1
 Introduction
 Example of Regression
 Common Regression Algorithms-Simple linear regression,
 Multiple linear regression,
 Assumptions in Regression Analysis,
 Main Problems in Regression Analysis,
 Improving Accuracy of the Linear Regression Model,
 Polynomial Regression Model,
 Logistic Regression,
 Maximum Likelihood Estimation
SYLLABUS
2
Regression
3
 Prediction of numerical value can be solved using the regression model.
 In the context of regression, dependent variable (Y) is the one whose value is
to be predicted,
 EX: Predicting the price of the land.
 This variable is presumed to be functionally related to one (say, X) or more
independent variables called predictors.
 In other words, the dependent variable depends on independent variable(s) or
predictor(s).
 Regression is essentially finding a relationship (or) association between the
dependent variable (Y) and the independent variable(s) (X), i.e. to find the
function ‘f ’ for the association Y = f (X)
4 COMMON REGRESSION ALGORITHMS
The most common regression algorithms are
 Simple linear regression
 Multiple linear regression
 Polynomial regression
 Multivariate adaptive regression splines
 Logistic regression Maximum likelihood estimation (least squares)
Simple Linear Regression
 It is the simplest regression model which involves only one predictor.
 This model assumes a linear relationship between the dependent variable and
the predictor variable as shown in Figure.
5
Real-life example of simple linear
regression
Suppose we have a dataset in which we have one
independent feature like work experience(years)
and salary as a dependent feature which means
salary prediction is dependent on working
experience. So we can use linear regression here.
6
FIG. Simple linear regression
7
 We know that, straight lines can be defined in a slope –
intercept form
 Y = (a + bX),
where a = intercept and
b = slope of the straight line.
 The value of intercept indicates the value of Y when X = 0.
 It is known as ‘the intercept or Y intercept’ because it
specifies where the straight line crosses the vertical or Y-
axis refer to Fig.
8
Slope of the simple linear regression model
Slope of a straight line represents how much the line in a
graph changes in the vertical direction (Y-axis) over a
change in the horizontal direction (X-axis) as shown in
Figure.
Slope = Change in Y/Change in X
Rise is the change in Y-axis (Y2 − Y1) and Run is the
change in X-axis (X2 − X1). So, slope is represented as
given below:
FIG. Rise and run representation
9
Example of slope
Let us find the slope of the graph where
 the lower point on the line is represented as (−3, −2)
and
 the higher point on the line is represented as (2, 2).
(X1, Y1) = (−3, −2) and
(X2, Y2) = (2, 2)
Rise=(Y2 −Y1)=(2−(−2))=2+2=4
Run=(X2 −X1)=(2−(−3))=2+3=5
Slope = Rise/Run = 4/5 = 0.8
10
 There can be two types of slopes in a linear regression
model:
positive slope and negative slope.
 Different types of regression lines based on the type of
slope include :
 Linear positive slope
 Linear positive slope Curve
 Linear negative slope Curve
 Linear negative slope
11
Linear positive slope
FIG. Linear positive slope
Slope = Rise/Run = (Y2 − Y1) / (X2 − X1) = Delta (Y) / Delta(X)
Scenario 1 for positive slope:
Delta (Y) is positive and Delta (X) is positive
Scenario 2 for positive slope:
Delta (Y) is negative and Delta (X) is negative
A positive slope always moves upward on a graph from left to right refer to Fig.
12
linear positive slope Curve
FIG. Curve linear positive slope
Curves in these graphs (refer to Fig. ) slope upward from left to right.
Slope = (Y2 − Y1) / (X2 − X1) = Delta (Y) / Delta(X)
Slope for a variable (X) may vary between two graphs, but it will always be
positive; hence, the above graphs are called as graphs with curve linear
positive slope.
13
Linear negative slope
A negative slope always moves downward on
a graph from left to right. As X value (on X-axis)
increases, Y value decreases.
Slope = Rise/Run = (Y2 − Y1) / (X2 − X1) = Delta (Y) /
Delta(X)
Scenario 1 for negative slope: Delta (Y) is positive and
Delta (X) is negative
Scenario 2 for negative slope: Delta (Y) is negative and
Delta (X) is positive
Fig. Linear negative slope
14
Curve linear negative slop
Curves in these graphs (refer to Fig) slope downward from left to right.
Slope for a variable (X) may vary between two graphs, but it will always be negative;
hence, the above graphs are called as graphs with curve linear negative slope.
15 No relationship graph
Scatter graph shown in Figure indicates ‘no relationship’ curve
as it is very difficult to conclude whether the relationship between
X and Y is positive or negative
FIG. No relationship graph
16 Error in simple regression
 The regression equation model in machine learning uses the slope–
intercept format in algorithms.
 X and Y values are provided to the machine.
 The machine identifies the values of a (intercept) and b (slope)
by relating the values of X and Y.
 However, identifying the exact match of values for a and b is not
always possible.
 There will be some error value (ɛ) associated with it.
 This error is called marginal or residual error.
Y = (a + bX) + ε
17
Example of simple regression
A college professor believes that if the grade for internal examination is
high in a class, the grade for external examination will also be high. A
random sample of 15 students in that class was selected, and the data is
given below:
18
A scatter plot was drawn to explore the relationship
between the independent variable (internal marks)
mapped to X-axis and dependent variable (external
marks) mapped to Y-axis as depicted in Figure
FIG. Scatter plot and regression line
 If we observe from the graph, the line (i.e. the
regression line) does not predict the data exactly
(refer to Fig. ).
 Instead, it just cuts through the data.
 Some predictions are lower than expected,
 while some others are higher than expected.
19
FIG. Residual error
20
 Ordinary Least Squares (OLS) is the technique used to
estimate a line that will minimize the error (ε),
 This means summing the errors of each prediction or, more
appropriately,
The Sum of the Squares of the Errors is given by:
21
Multiple Linear Regression
 In a multiple regression model, two or more independent variables, i.e.
predictors are involved in the model.
 It is an extension of linear regression
we have more than one independent feature in our
dataset. For, eg.
Question : can we use a linear
regression model for this dataset?
Answer : NO
22 The equation to implement multiple linear regression is as follows :
y = A+B1x1+B2x2+B3x3+B4x4+……….BnXn
 y = the predicted value of the dependent variable
 A = an intercept
 B1x1 = B1 the regression coefficient of the first independent
variable (X1)
and so on……..
23
Assumptions in Regression Analysis
1. The dependent variable (Y) can be calculated / predicated based on
independent variables (X’s) plus an error term (ε).
2. The number of observations (n) is greater than the number of parameters (k) to
be estimated, i.e. n > k.
3. Relationships determined by regression are only relationships of association
based on the data set.
4. Regression line can be valid only over a limited range of data. If the line is
extended (outside the range of extrapolation), it may only lead to wrong
predictions
5. The error term (ε) is normally distributed. This also means that the mean of
the error (ε) has an expected value of 0.
Given the above assumptions, the OLS estimator which is used
to minimize the error is the Best Linear Unbiased Estimator
(BLUE), and this is called as Gauss-Markov Theorem.
24
Main Problems in Regression Analysis
In multiple regressions, there are two primary problems:
1. Multicollinearity : It is a situation in which the degree of correlation is not
only between the dependent variable and the independent variable, but there is
also a strong correlation within (among) the independent variables
themselves.
Advantage : A multiple regression equation can make good predictions
when there is multicollinearity,
Disadvantage : but it is difficult to determine how the dependent
variable will change if each independent variable is changed one at a
time. When multicollinearity is present, it increases the standard errors
of the coefficients.
25
2. Heteroskedasticity : It refers to the changing variance of the error term.
 If the variance of the error term is not constant across data sets, there
will be erroneous predictions.
 In general, for a regression equation to make accurate predictions, the
independent,
identically distributed.
Mathematically, this assumption is written as
where ‘var’ represents the variance,
‘cov’ represents the covariance,
‘u’ represents the error terms, and ‘X’ represents the independent variables. This
assumption is more commonly written as
Main Problems in Regression Analysis
Error Term should be
26
Accuracy refers to
how close the estimation is
near the actual value
Prediction refers to
continuous estimation of the
value.
High bias = Low accuracy (not
close to real value)
High variance = Low Prediction
(Values are scattered)
Improving Accuracy of the Linear Regression Model
 To improve accuracy of the model, we need to understand the
concept of bias and variance is similar to accuracy and
prediction.
Low bias = High accuracy
( Close to real value)
Low variance = High Prediction
(Values are scattered)
27
Accuracy of linear regression can be improved using the following three methods:
1.Shrinkage Approach 2.Subset Selection 3. Dimensionality (Variable) Reduction
Shrinkage (Regularization) approach
Regularization
When it comes to training models, there are two major problems one can
encounter: overfitting and underfitting.
Overfitting happens when the model performs well on the training set but not so
well on unseen (test) data.
Underfitting happens when it neither performs well on the train set nor on the test
set.
Particularly, regularization is implemented to avoid overfitting of the data,
especially when there is a large variance between train and test set performances.
28
There are different ways of reducing model complexity and preventing overfitting in
linear models. This includes ridge and lasso regression models. The key difference is in
how they assign penalties to the coefficients:
Ridge Regression:
Performs L2 regularization, i.e., adds penalty equivalent to the square of the
magnitude of coefficients
Minimization objective = LS Obj + α * (sum of square of coefficients)
Lasso Regression:
Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of
the magnitude of coefficients
Minimization objective = LS Obj + α * (sum of the absolute value of coefficients)
29
Subset selection
Identify a subset of the predictors that is assumed to be related to the response
and then fit a model using OLS on the selected reduced subset of variables.
There are two methods in which subset of the regression can be selected:
1. Best subset selection (considers all the possible (2 ))
2.Stepwise subset selection
1. Forward stepwise selection (0 to k)
2. Backward stepwise selection (k to 0)
30
In best subset selection, we fit a separate least squares regression for each
possible subset of the k predictors.
For computational reasons, best subset selection cannot be applied with very large
value of predictors (k).
The best subset selection procedure considers all the possible (2 ) models containing
subsets of the p predictors.
The stepwise subset selection method can be applied to choose the best subset.
There are two stepwise subset selection:
1. Forward stepwise selection (0 to k)
2. Backward stepwise selection (k to 0)
31
Forward stepwise selection
It is a computationally efficient alternative to best subset selection.
It considers a much smaller set of models, that too step by step, compared to
best set selection.
It begins with a model containing no predictors, and then, predictors are added
one by one to the model, until all the k predictors are included in the model.
In particular, at each step, the variable (X) that gives the highest additional
improvement to the fit is added.
Backward stepwise selection
It begins with the least squares model which contains all k predictors and then
iteratively removes the least useful predictor one by one.
32 Dimensionality Reduction (Variable reduction)
In dimensionality reduction, predictors (X) are transformed,
and
the model is set up using the transformed variables after
dimensionality reduction.
The number of variables is reduced using the dimensionality
reduction method.
Principal component analysis is one of the most important
dimensionality (variable) reduction techniques.
33
Polynomial Regression Mode
 It is the extension of the simple linear model.
It is generated by adding extra predictors obtained by raising (squaring) each of the
original predictors to a power.
For example, if there are three variables, X1, X 2 , and X3 are used as predictors. This
approach provides a simple way to yield a non-linear fit to data.
34 If we observe, the regression line in figure is slightly curved for polynomial
degree 3 with the above 15 data points. The regression line will curve further if
we increase the polynomial degree (refer to Fig.). At the extreme value as shown
below, the regression line will be overfitting into all the original values of X.
FIG. Polynomial regression degree 3
Why Not Linear Regression?
Suppose that we are trying to predict the medical condition of a patient in the emergency room
on the basis of her symptoms. In this simplified example, there are three possible diagnoses:
stroke, drug overdose, and epileptic seizure (abnormal brain activity). We could consider
encoding these values as a quantitative response variable, Y , as follows:
35
which would imply a totally different relationship among the three conditions. Each
of these codings would produce fundamentally different linear models that would
ultimately lead to different sets of predictions on test observations.
If the response variable’s values did take on a natural ordering, such as mild,
moderate, and severe, and we felt the gap between mild and moderate was similar
to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable.
Unfortunately, in general there is no natural way to convert a qualitative response
variable with more than two levels into a quantitative response that is ready for
linear regression.
For a binary (two level) qualitative response, the situation is better. For instance,
perhaps there are only two possibilities for the patient’s medical condition: stroke
and drug overdose. We could then potentially use the dummy variable approach to
code the response as follows:
36
Y = 0 if stroke;
1 if drug overdose.
We could then fit a linear regression to this binary response, and predict drug
overdose if ˆY > 0.5 and stroke otherwise. In the binary case it is not hard to
show that even if we flip the above coding, linear regression will produce the
same final predictions.
For a binary response with a 0/1 coding as above, regression by least squares
does make sense;
However, the dummy variable approach cannot be easily extended to accommodate
qualitative responses with more than two levels. For these reasons, it is
preferable to use a classification method that is truly suited for qualitative
response values, such as LOGISTIC REGRESSION
37
Logistic Regression
38
Consider again the Default data set, where the response default falls into one of two
categories, Yes or No. Rather than modeling this response Y directly, logistic
regression models the probability that Y belongs to a particular category.
For the Default data, logistic regression models the probability of default.
For example, the probability of default given balance can be written as
Pr(default = Yes | balance). The values of Pr(default = Yes | balance), which we
abbreviate p(balance), will range between 0 and 1. Then for any given value of
balance, a prediction can be made for default.
For example, one might predict default = Yes for any individual for whom p(balance)
> 0.5. Alternatively, if a company wishes to be conservative in predicting individuals
who are at risk for default, then they may choose to use a lower threshold, such as
p(balance) > 0.1.
39
If we are modeling people’s gender as male or female from their height,
then the first class could be male and the logistic regression model could
be written as the probability of male given a person’s height, or more
formally:
P(gender = male | height)
Written another way, we are modeling the probability that an input (X)
belongs to the default class (Y=1), we can write this formally as:
P(X) = P ( Y = 1 | X )
40 Assumptions in logistic regression
The following assumptions must hold when building a logistic regression model:
1. There exists a linear relationship between logit function and independent
variables.
2. The dependent variable Y must be categorical (1/0) and take binary value,
e.g. if pass then Y = 1; else Y = 0.
3. The data meets the ‘iid’ criterion, i.e. the error terms, ε, are independent from
one another and identically distributed
4. The error term follows a binomial distribution [n, p]
 n = # of records in the data
 p = probability of success (pass, responder)
41
Maximum Likelihood Estimation
 The coefficients in a logistic regression are estimated using a process called
Maximum Likelihood Estimation (MLE).
 A fair coin outcome flips equally heads and tails of the same number of times. If we
toss the coin 10 times, it is expected that we get five times Head and five times Tail.
 Let us now discuss about the probability of getting only Head as an outcome; it is
5/10 = 0.5 in the above case.
 Whenever this number (P) is greater than 0.5, it is said to be in favor of Head.
 Whenever P is lesser than 0.5, it is said to be against the outcome of getting Head.
 Let us represent ‘n’ flips of coin as X1 , X2 , X3 ,…, Xn .
 Now X can take the value of 1 or 0.
 X I = 1 if Head is the outcome
 X I = 0 if Tail is the outcome
42
Thank You……

More Related Content

PPT
Regression
PPT
Regression
PDF
Linear regreesion ppt
PPTX
REGRESSION ANALYSIS THEORY EXPLAINED HERE
PPT
Regression.ppt basic introduction of regression with example
PPT
Corr-and-Regress (1).ppt
PPT
Corr-and-Regress.ppt
PPT
Cr-and-Regress.ppt
Regression
Regression
Linear regreesion ppt
REGRESSION ANALYSIS THEORY EXPLAINED HERE
Regression.ppt basic introduction of regression with example
Corr-and-Regress (1).ppt
Corr-and-Regress.ppt
Cr-and-Regress.ppt

Similar to ML-UNIT-IV complete notes download here (20)

PPT
Correlation & Regression for Statistics Social Science
PPT
Corr-and-Regress.ppt
PPT
Corr-and-Regress.ppt
PPT
Corr-and-Regress.ppt
PPT
Corr And Regress
PPTX
Simple linear regression
PPTX
Regression
PDF
Simple Linear Regression detail explanation.pdf
PPTX
Unit 7b Regression Analyss.pptxbhjjjjjjk
PPTX
Regression
PPTX
RAD 451 regression statistics and others
PPTX
Linear regression
PPTX
Simple Linear Regression explanation.pptx
PDF
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
PPTX
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
PPTX
simple and multiple linear Regression. (1).pptx
PPT
Chapter05
PPT
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
PDF
Chapter 2 part3-Least-Squares Regression
Correlation & Regression for Statistics Social Science
Corr-and-Regress.ppt
Corr-and-Regress.ppt
Corr-and-Regress.ppt
Corr And Regress
Simple linear regression
Regression
Simple Linear Regression detail explanation.pdf
Unit 7b Regression Analyss.pptxbhjjjjjjk
Regression
RAD 451 regression statistics and others
Linear regression
Simple Linear Regression explanation.pptx
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
SM_d89ccf05-7de1-4a30-a134-3143e9b3bf3f_38.pptx
simple and multiple linear Regression. (1).pptx
Chapter05
Correlation by Neeraj Bhandari ( Surkhet.Nepal )
Chapter 2 part3-Least-Squares Regression
Ad

Recently uploaded (20)

DOCX
573137875-Attendance-Management-System-original
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Digital Logic Computer Design lecture notes
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Geodesy 1.pptx...............................................
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
composite construction of structures.pdf
PPTX
Sustainable Sites - Green Building Construction
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
PPT on Performance Review to get promotions
PPTX
Construction Project Organization Group 2.pptx
573137875-Attendance-Management-System-original
CH1 Production IntroductoryConcepts.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Digital Logic Computer Design lecture notes
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Lecture Notes Electrical Wiring System Components
Geodesy 1.pptx...............................................
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Structs to JSON How Go Powers REST APIs.pdf
composite construction of structures.pdf
Sustainable Sites - Green Building Construction
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
additive manufacturing of ss316l using mig welding
Arduino robotics embedded978-1-4302-3184-4.pdf
PPT on Performance Review to get promotions
Construction Project Organization Group 2.pptx
Ad

ML-UNIT-IV complete notes download here

  • 2.  Introduction  Example of Regression  Common Regression Algorithms-Simple linear regression,  Multiple linear regression,  Assumptions in Regression Analysis,  Main Problems in Regression Analysis,  Improving Accuracy of the Linear Regression Model,  Polynomial Regression Model,  Logistic Regression,  Maximum Likelihood Estimation SYLLABUS 2
  • 3. Regression 3  Prediction of numerical value can be solved using the regression model.  In the context of regression, dependent variable (Y) is the one whose value is to be predicted,  EX: Predicting the price of the land.  This variable is presumed to be functionally related to one (say, X) or more independent variables called predictors.  In other words, the dependent variable depends on independent variable(s) or predictor(s).  Regression is essentially finding a relationship (or) association between the dependent variable (Y) and the independent variable(s) (X), i.e. to find the function ‘f ’ for the association Y = f (X)
  • 4. 4 COMMON REGRESSION ALGORITHMS The most common regression algorithms are  Simple linear regression  Multiple linear regression  Polynomial regression  Multivariate adaptive regression splines  Logistic regression Maximum likelihood estimation (least squares)
  • 5. Simple Linear Regression  It is the simplest regression model which involves only one predictor.  This model assumes a linear relationship between the dependent variable and the predictor variable as shown in Figure. 5 Real-life example of simple linear regression Suppose we have a dataset in which we have one independent feature like work experience(years) and salary as a dependent feature which means salary prediction is dependent on working experience. So we can use linear regression here.
  • 6. 6 FIG. Simple linear regression
  • 7. 7  We know that, straight lines can be defined in a slope – intercept form  Y = (a + bX), where a = intercept and b = slope of the straight line.  The value of intercept indicates the value of Y when X = 0.  It is known as ‘the intercept or Y intercept’ because it specifies where the straight line crosses the vertical or Y- axis refer to Fig.
  • 8. 8 Slope of the simple linear regression model Slope of a straight line represents how much the line in a graph changes in the vertical direction (Y-axis) over a change in the horizontal direction (X-axis) as shown in Figure. Slope = Change in Y/Change in X Rise is the change in Y-axis (Y2 − Y1) and Run is the change in X-axis (X2 − X1). So, slope is represented as given below: FIG. Rise and run representation
  • 9. 9 Example of slope Let us find the slope of the graph where  the lower point on the line is represented as (−3, −2) and  the higher point on the line is represented as (2, 2). (X1, Y1) = (−3, −2) and (X2, Y2) = (2, 2) Rise=(Y2 −Y1)=(2−(−2))=2+2=4 Run=(X2 −X1)=(2−(−3))=2+3=5 Slope = Rise/Run = 4/5 = 0.8
  • 10. 10  There can be two types of slopes in a linear regression model: positive slope and negative slope.  Different types of regression lines based on the type of slope include :  Linear positive slope  Linear positive slope Curve  Linear negative slope Curve  Linear negative slope
  • 11. 11 Linear positive slope FIG. Linear positive slope Slope = Rise/Run = (Y2 − Y1) / (X2 − X1) = Delta (Y) / Delta(X) Scenario 1 for positive slope: Delta (Y) is positive and Delta (X) is positive Scenario 2 for positive slope: Delta (Y) is negative and Delta (X) is negative A positive slope always moves upward on a graph from left to right refer to Fig.
  • 12. 12 linear positive slope Curve FIG. Curve linear positive slope Curves in these graphs (refer to Fig. ) slope upward from left to right. Slope = (Y2 − Y1) / (X2 − X1) = Delta (Y) / Delta(X) Slope for a variable (X) may vary between two graphs, but it will always be positive; hence, the above graphs are called as graphs with curve linear positive slope.
  • 13. 13 Linear negative slope A negative slope always moves downward on a graph from left to right. As X value (on X-axis) increases, Y value decreases. Slope = Rise/Run = (Y2 − Y1) / (X2 − X1) = Delta (Y) / Delta(X) Scenario 1 for negative slope: Delta (Y) is positive and Delta (X) is negative Scenario 2 for negative slope: Delta (Y) is negative and Delta (X) is positive Fig. Linear negative slope
  • 14. 14 Curve linear negative slop Curves in these graphs (refer to Fig) slope downward from left to right. Slope for a variable (X) may vary between two graphs, but it will always be negative; hence, the above graphs are called as graphs with curve linear negative slope.
  • 15. 15 No relationship graph Scatter graph shown in Figure indicates ‘no relationship’ curve as it is very difficult to conclude whether the relationship between X and Y is positive or negative FIG. No relationship graph
  • 16. 16 Error in simple regression  The regression equation model in machine learning uses the slope– intercept format in algorithms.  X and Y values are provided to the machine.  The machine identifies the values of a (intercept) and b (slope) by relating the values of X and Y.  However, identifying the exact match of values for a and b is not always possible.  There will be some error value (ɛ) associated with it.  This error is called marginal or residual error. Y = (a + bX) + ε
  • 17. 17 Example of simple regression A college professor believes that if the grade for internal examination is high in a class, the grade for external examination will also be high. A random sample of 15 students in that class was selected, and the data is given below:
  • 18. 18 A scatter plot was drawn to explore the relationship between the independent variable (internal marks) mapped to X-axis and dependent variable (external marks) mapped to Y-axis as depicted in Figure FIG. Scatter plot and regression line  If we observe from the graph, the line (i.e. the regression line) does not predict the data exactly (refer to Fig. ).  Instead, it just cuts through the data.  Some predictions are lower than expected,  while some others are higher than expected.
  • 20. 20  Ordinary Least Squares (OLS) is the technique used to estimate a line that will minimize the error (ε),  This means summing the errors of each prediction or, more appropriately, The Sum of the Squares of the Errors is given by:
  • 21. 21 Multiple Linear Regression  In a multiple regression model, two or more independent variables, i.e. predictors are involved in the model.  It is an extension of linear regression we have more than one independent feature in our dataset. For, eg. Question : can we use a linear regression model for this dataset? Answer : NO
  • 22. 22 The equation to implement multiple linear regression is as follows : y = A+B1x1+B2x2+B3x3+B4x4+……….BnXn  y = the predicted value of the dependent variable  A = an intercept  B1x1 = B1 the regression coefficient of the first independent variable (X1) and so on……..
  • 23. 23 Assumptions in Regression Analysis 1. The dependent variable (Y) can be calculated / predicated based on independent variables (X’s) plus an error term (ε). 2. The number of observations (n) is greater than the number of parameters (k) to be estimated, i.e. n > k. 3. Relationships determined by regression are only relationships of association based on the data set. 4. Regression line can be valid only over a limited range of data. If the line is extended (outside the range of extrapolation), it may only lead to wrong predictions 5. The error term (ε) is normally distributed. This also means that the mean of the error (ε) has an expected value of 0. Given the above assumptions, the OLS estimator which is used to minimize the error is the Best Linear Unbiased Estimator (BLUE), and this is called as Gauss-Markov Theorem.
  • 24. 24 Main Problems in Regression Analysis In multiple regressions, there are two primary problems: 1. Multicollinearity : It is a situation in which the degree of correlation is not only between the dependent variable and the independent variable, but there is also a strong correlation within (among) the independent variables themselves. Advantage : A multiple regression equation can make good predictions when there is multicollinearity, Disadvantage : but it is difficult to determine how the dependent variable will change if each independent variable is changed one at a time. When multicollinearity is present, it increases the standard errors of the coefficients.
  • 25. 25 2. Heteroskedasticity : It refers to the changing variance of the error term.  If the variance of the error term is not constant across data sets, there will be erroneous predictions.  In general, for a regression equation to make accurate predictions, the independent, identically distributed. Mathematically, this assumption is written as where ‘var’ represents the variance, ‘cov’ represents the covariance, ‘u’ represents the error terms, and ‘X’ represents the independent variables. This assumption is more commonly written as Main Problems in Regression Analysis Error Term should be
  • 26. 26 Accuracy refers to how close the estimation is near the actual value Prediction refers to continuous estimation of the value. High bias = Low accuracy (not close to real value) High variance = Low Prediction (Values are scattered) Improving Accuracy of the Linear Regression Model  To improve accuracy of the model, we need to understand the concept of bias and variance is similar to accuracy and prediction. Low bias = High accuracy ( Close to real value) Low variance = High Prediction (Values are scattered)
  • 27. 27 Accuracy of linear regression can be improved using the following three methods: 1.Shrinkage Approach 2.Subset Selection 3. Dimensionality (Variable) Reduction Shrinkage (Regularization) approach Regularization When it comes to training models, there are two major problems one can encounter: overfitting and underfitting. Overfitting happens when the model performs well on the training set but not so well on unseen (test) data. Underfitting happens when it neither performs well on the train set nor on the test set. Particularly, regularization is implemented to avoid overfitting of the data, especially when there is a large variance between train and test set performances.
  • 28. 28 There are different ways of reducing model complexity and preventing overfitting in linear models. This includes ridge and lasso regression models. The key difference is in how they assign penalties to the coefficients: Ridge Regression: Performs L2 regularization, i.e., adds penalty equivalent to the square of the magnitude of coefficients Minimization objective = LS Obj + α * (sum of square of coefficients) Lasso Regression: Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude of coefficients Minimization objective = LS Obj + α * (sum of the absolute value of coefficients)
  • 29. 29 Subset selection Identify a subset of the predictors that is assumed to be related to the response and then fit a model using OLS on the selected reduced subset of variables. There are two methods in which subset of the regression can be selected: 1. Best subset selection (considers all the possible (2 )) 2.Stepwise subset selection 1. Forward stepwise selection (0 to k) 2. Backward stepwise selection (k to 0)
  • 30. 30 In best subset selection, we fit a separate least squares regression for each possible subset of the k predictors. For computational reasons, best subset selection cannot be applied with very large value of predictors (k). The best subset selection procedure considers all the possible (2 ) models containing subsets of the p predictors. The stepwise subset selection method can be applied to choose the best subset. There are two stepwise subset selection: 1. Forward stepwise selection (0 to k) 2. Backward stepwise selection (k to 0)
  • 31. 31 Forward stepwise selection It is a computationally efficient alternative to best subset selection. It considers a much smaller set of models, that too step by step, compared to best set selection. It begins with a model containing no predictors, and then, predictors are added one by one to the model, until all the k predictors are included in the model. In particular, at each step, the variable (X) that gives the highest additional improvement to the fit is added. Backward stepwise selection It begins with the least squares model which contains all k predictors and then iteratively removes the least useful predictor one by one.
  • 32. 32 Dimensionality Reduction (Variable reduction) In dimensionality reduction, predictors (X) are transformed, and the model is set up using the transformed variables after dimensionality reduction. The number of variables is reduced using the dimensionality reduction method. Principal component analysis is one of the most important dimensionality (variable) reduction techniques.
  • 33. 33 Polynomial Regression Mode  It is the extension of the simple linear model. It is generated by adding extra predictors obtained by raising (squaring) each of the original predictors to a power. For example, if there are three variables, X1, X 2 , and X3 are used as predictors. This approach provides a simple way to yield a non-linear fit to data.
  • 34. 34 If we observe, the regression line in figure is slightly curved for polynomial degree 3 with the above 15 data points. The regression line will curve further if we increase the polynomial degree (refer to Fig.). At the extreme value as shown below, the regression line will be overfitting into all the original values of X. FIG. Polynomial regression degree 3
  • 35. Why Not Linear Regression? Suppose that we are trying to predict the medical condition of a patient in the emergency room on the basis of her symptoms. In this simplified example, there are three possible diagnoses: stroke, drug overdose, and epileptic seizure (abnormal brain activity). We could consider encoding these values as a quantitative response variable, Y , as follows: 35
  • 36. which would imply a totally different relationship among the three conditions. Each of these codings would produce fundamentally different linear models that would ultimately lead to different sets of predictions on test observations. If the response variable’s values did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression. For a binary (two level) qualitative response, the situation is better. For instance, perhaps there are only two possibilities for the patient’s medical condition: stroke and drug overdose. We could then potentially use the dummy variable approach to code the response as follows: 36
  • 37. Y = 0 if stroke; 1 if drug overdose. We could then fit a linear regression to this binary response, and predict drug overdose if ˆY > 0.5 and stroke otherwise. In the binary case it is not hard to show that even if we flip the above coding, linear regression will produce the same final predictions. For a binary response with a 0/1 coding as above, regression by least squares does make sense; However, the dummy variable approach cannot be easily extended to accommodate qualitative responses with more than two levels. For these reasons, it is preferable to use a classification method that is truly suited for qualitative response values, such as LOGISTIC REGRESSION 37
  • 38. Logistic Regression 38 Consider again the Default data set, where the response default falls into one of two categories, Yes or No. Rather than modeling this response Y directly, logistic regression models the probability that Y belongs to a particular category. For the Default data, logistic regression models the probability of default. For example, the probability of default given balance can be written as Pr(default = Yes | balance). The values of Pr(default = Yes | balance), which we abbreviate p(balance), will range between 0 and 1. Then for any given value of balance, a prediction can be made for default. For example, one might predict default = Yes for any individual for whom p(balance) > 0.5. Alternatively, if a company wishes to be conservative in predicting individuals who are at risk for default, then they may choose to use a lower threshold, such as p(balance) > 0.1.
  • 39. 39 If we are modeling people’s gender as male or female from their height, then the first class could be male and the logistic regression model could be written as the probability of male given a person’s height, or more formally: P(gender = male | height) Written another way, we are modeling the probability that an input (X) belongs to the default class (Y=1), we can write this formally as: P(X) = P ( Y = 1 | X )
  • 40. 40 Assumptions in logistic regression The following assumptions must hold when building a logistic regression model: 1. There exists a linear relationship between logit function and independent variables. 2. The dependent variable Y must be categorical (1/0) and take binary value, e.g. if pass then Y = 1; else Y = 0. 3. The data meets the ‘iid’ criterion, i.e. the error terms, ε, are independent from one another and identically distributed 4. The error term follows a binomial distribution [n, p]  n = # of records in the data  p = probability of success (pass, responder)
  • 41. 41 Maximum Likelihood Estimation  The coefficients in a logistic regression are estimated using a process called Maximum Likelihood Estimation (MLE).  A fair coin outcome flips equally heads and tails of the same number of times. If we toss the coin 10 times, it is expected that we get five times Head and five times Tail.  Let us now discuss about the probability of getting only Head as an outcome; it is 5/10 = 0.5 in the above case.  Whenever this number (P) is greater than 0.5, it is said to be in favor of Head.  Whenever P is lesser than 0.5, it is said to be against the outcome of getting Head.  Let us represent ‘n’ flips of coin as X1 , X2 , X3 ,…, Xn .  Now X can take the value of 1 or 0.  X I = 1 if Head is the outcome  X I = 0 if Tail is the outcome