ML-UNIT-IV complete notes download here

UNIT – IV
Supervised Learning
1

 Introduction
 Example of Regression
 Common Regression Algorithms-Simple linear regression,
 Multiple linear regression,
 Assumptions in Regression Analysis,
 Main Problems in Regression Analysis,
 Improving Accuracy of the Linear Regression Model,
 Polynomial Regression Model,
 Logistic Regression,
 Maximum Likelihood Estimation
SYLLABUS
2

Regression
3
 Prediction of numerical value can be solved using the regression model.
 In the context of regression, dependent variable (Y) is the one whose value is
to be predicted,
 EX: Predicting the price of the land.
 This variable is presumed to be functionally related to one (say, X) or more
independent variables called predictors.
 In other words, the dependent variable depends on independent variable(s) or
predictor(s).
 Regression is essentially finding a relationship (or) association between the
dependent variable (Y) and the independent variable(s) (X), i.e. to find the
function ‘f ’ for the association Y = f (X)

4 COMMON REGRESSION ALGORITHMS
The most common regression algorithms are
 Simple linear regression
 Multiple linear regression
 Polynomial regression
 Multivariate adaptive regression splines
 Logistic regression Maximum likelihood estimation (least squares)

Simple Linear Regression
 It is the simplest regression model which involves only one predictor.
 This model assumes a linear relationship between the dependent variable and
the predictor variable as shown in Figure.
5
Real-life example of simple linear
regression
Suppose we have a dataset in which we have one
independent feature like work experience(years)
and salary as a dependent feature which means
salary prediction is dependent on working
experience. So we can use linear regression here.

6
FIG. Simple linear regression

7
 We know that, straight lines can be defined in a slope –
intercept form
 Y = (a + bX),
where a = intercept and
b = slope of the straight line.
 The value of intercept indicates the value of Y when X = 0.
 It is known as ‘the intercept or Y intercept’ because it
specifies where the straight line crosses the vertical or Y-
axis refer to Fig.

8
Slope of the simple linear regression model
Slope of a straight line represents how much the line in a
graph changes in the vertical direction (Y-axis) over a
change in the horizontal direction (X-axis) as shown in
Figure.
Slope = Change in Y/Change in X
Rise is the change in Y-axis (Y2 − Y1) and Run is the
change in X-axis (X2 − X1). So, slope is represented as
given below:
FIG. Rise and run representation

9
Example of slope
Let us find the slope of the graph where
 the lower point on the line is represented as (−3, −2)
and
 the higher point on the line is represented as (2, 2).
(X1, Y1) = (−3, −2) and
(X2, Y2) = (2, 2)
Rise=(Y2 −Y1)=(2−(−2))=2+2=4
Run=(X2 −X1)=(2−(−3))=2+3=5
Slope = Rise/Run = 4/5 = 0.8

10
 There can be two types of slopes in a linear regression
model:
positive slope and negative slope.
 Different types of regression lines based on the type of
slope include :
 Linear positive slope
 Linear positive slope Curve
 Linear negative slope Curve
 Linear negative slope

11
Linear positive slope
FIG. Linear positive slope
Slope = Rise/Run = (Y2 − Y1) / (X2 − X1) = Delta (Y) / Delta(X)
Scenario 1 for positive slope:
Delta (Y) is positive and Delta (X) is positive
Scenario 2 for positive slope:
Delta (Y) is negative and Delta (X) is negative
A positive slope always moves upward on a graph from left to right refer to Fig.

12
linear positive slope Curve
FIG. Curve linear positive slope
Curves in these graphs (refer to Fig. ) slope upward from left to right.
Slope = (Y2 − Y1) / (X2 − X1) = Delta (Y) / Delta(X)
Slope for a variable (X) may vary between two graphs, but it will always be
positive; hence, the above graphs are called as graphs with curve linear
positive slope.

13
Linear negative slope
A negative slope always moves downward on
a graph from left to right. As X value (on X-axis)
increases, Y value decreases.
Slope = Rise/Run = (Y2 − Y1) / (X2 − X1) = Delta (Y) /
Delta(X)
Scenario 1 for negative slope: Delta (Y) is positive and
Delta (X) is negative
Scenario 2 for negative slope: Delta (Y) is negative and
Delta (X) is positive
Fig. Linear negative slope

14
Curve linear negative slop
Curves in these graphs (refer to Fig) slope downward from left to right.
Slope for a variable (X) may vary between two graphs, but it will always be negative;
hence, the above graphs are called as graphs with curve linear negative slope.

15 No relationship graph
Scatter graph shown in Figure indicates ‘no relationship’ curve
as it is very difficult to conclude whether the relationship between
X and Y is positive or negative
FIG. No relationship graph

16 Error in simple regression
 The regression equation model in machine learning uses the slope–
intercept format in algorithms.
 X and Y values are provided to the machine.
 The machine identifies the values of a (intercept) and b (slope)
by relating the values of X and Y.
 However, identifying the exact match of values for a and b is not
always possible.
 There will be some error value (ɛ) associated with it.
 This error is called marginal or residual error.
Y = (a + bX) + ε

17
Example of simple regression
A college professor believes that if the grade for internal examination is
high in a class, the grade for external examination will also be high. A
random sample of 15 students in that class was selected, and the data is
given below:

18
A scatter plot was drawn to explore the relationship
between the independent variable (internal marks)
mapped to X-axis and dependent variable (external
marks) mapped to Y-axis as depicted in Figure
FIG. Scatter plot and regression line
 If we observe from the graph, the line (i.e. the
regression line) does not predict the data exactly
(refer to Fig. ).
 Instead, it just cuts through the data.
 Some predictions are lower than expected,
 while some others are higher than expected.

20
 Ordinary Least Squares (OLS) is the technique used to
estimate a line that will minimize the error (ε),
 This means summing the errors of each prediction or, more
appropriately,
The Sum of the Squares of the Errors is given by:

21
Multiple Linear Regression
 In a multiple regression model, two or more independent variables, i.e.
predictors are involved in the model.
 It is an extension of linear regression
we have more than one independent feature in our
dataset. For, eg.
Question : can we use a linear
regression model for this dataset?
Answer : NO

22 The equation to implement multiple linear regression is as follows :
y = A+B1x1+B2x2+B3x3+B4x4+……….BnXn
 y = the predicted value of the dependent variable
 A = an intercept
 B1x1 = B1 the regression coefficient of the first independent
variable (X1)
and so on……..

23
Assumptions in Regression Analysis
1. The dependent variable (Y) can be calculated / predicated based on
independent variables (X’s) plus an error term (ε).
2. The number of observations (n) is greater than the number of parameters (k) to
be estimated, i.e. n > k.
3. Relationships determined by regression are only relationships of association
based on the data set.
4. Regression line can be valid only over a limited range of data. If the line is
extended (outside the range of extrapolation), it may only lead to wrong
predictions
5. The error term (ε) is normally distributed. This also means that the mean of
the error (ε) has an expected value of 0.
Given the above assumptions, the OLS estimator which is used
to minimize the error is the Best Linear Unbiased Estimator
(BLUE), and this is called as Gauss-Markov Theorem.

24
Main Problems in Regression Analysis
In multiple regressions, there are two primary problems:
1. Multicollinearity : It is a situation in which the degree of correlation is not
only between the dependent variable and the independent variable, but there is
also a strong correlation within (among) the independent variables
themselves.
Advantage : A multiple regression equation can make good predictions
when there is multicollinearity,
Disadvantage : but it is difficult to determine how the dependent
variable will change if each independent variable is changed one at a
time. When multicollinearity is present, it increases the standard errors
of the coefficients.

25
2. Heteroskedasticity : It refers to the changing variance of the error term.
 If the variance of the error term is not constant across data sets, there
will be erroneous predictions.
 In general, for a regression equation to make accurate predictions, the
independent,
identically distributed.
Mathematically, this assumption is written as
where ‘var’ represents the variance,
‘cov’ represents the covariance,
‘u’ represents the error terms, and ‘X’ represents the independent variables. This
assumption is more commonly written as
Main Problems in Regression Analysis
Error Term should be

26
Accuracy refers to
how close the estimation is
near the actual value
Prediction refers to
continuous estimation of the
value.
High bias = Low accuracy (not
close to real value)
High variance = Low Prediction
(Values are scattered)
Improving Accuracy of the Linear Regression Model
 To improve accuracy of the model, we need to understand the
concept of bias and variance is similar to accuracy and
prediction.
Low bias = High accuracy
( Close to real value)
Low variance = High Prediction
(Values are scattered)

27
Accuracy of linear regression can be improved using the following three methods:
1.Shrinkage Approach 2.Subset Selection 3. Dimensionality (Variable) Reduction
Shrinkage (Regularization) approach
Regularization
When it comes to training models, there are two major problems one can
encounter: overfitting and underfitting.
Overfitting happens when the model performs well on the training set but not so
well on unseen (test) data.
Underfitting happens when it neither performs well on the train set nor on the test
set.
Particularly, regularization is implemented to avoid overfitting of the data,
especially when there is a large variance between train and test set performances.

28
There are different ways of reducing model complexity and preventing overfitting in
linear models. This includes ridge and lasso regression models. The key difference is in
how they assign penalties to the coefficients:
Ridge Regression:
Performs L2 regularization, i.e., adds penalty equivalent to the square of the
magnitude of coefficients
Minimization objective = LS Obj + α * (sum of square of coefficients)
Lasso Regression:
Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of
the magnitude of coefficients
Minimization objective = LS Obj + α * (sum of the absolute value of coefficients)

29
Subset selection
Identify a subset of the predictors that is assumed to be related to the response
and then fit a model using OLS on the selected reduced subset of variables.
There are two methods in which subset of the regression can be selected:
1. Best subset selection (considers all the possible (2 ))
2.Stepwise subset selection
1. Forward stepwise selection (0 to k)
2. Backward stepwise selection (k to 0)

30
In best subset selection, we fit a separate least squares regression for each
possible subset of the k predictors.
For computational reasons, best subset selection cannot be applied with very large
value of predictors (k).
The best subset selection procedure considers all the possible (2 ) models containing
subsets of the p predictors.
The stepwise subset selection method can be applied to choose the best subset.
There are two stepwise subset selection:
1. Forward stepwise selection (0 to k)
2. Backward stepwise selection (k to 0)

31
Forward stepwise selection
It is a computationally efficient alternative to best subset selection.
It considers a much smaller set of models, that too step by step, compared to
best set selection.
It begins with a model containing no predictors, and then, predictors are added
one by one to the model, until all the k predictors are included in the model.
In particular, at each step, the variable (X) that gives the highest additional
improvement to the fit is added.
Backward stepwise selection
It begins with the least squares model which contains all k predictors and then
iteratively removes the least useful predictor one by one.

32 Dimensionality Reduction (Variable reduction)
In dimensionality reduction, predictors (X) are transformed,
and
the model is set up using the transformed variables after
dimensionality reduction.
The number of variables is reduced using the dimensionality
reduction method.
Principal component analysis is one of the most important
dimensionality (variable) reduction techniques.

33
Polynomial Regression Mode
 It is the extension of the simple linear model.
It is generated by adding extra predictors obtained by raising (squaring) each of the
original predictors to a power.
For example, if there are three variables, X1, X 2 , and X3 are used as predictors. This
approach provides a simple way to yield a non-linear fit to data.

34 If we observe, the regression line in figure is slightly curved for polynomial
degree 3 with the above 15 data points. The regression line will curve further if
we increase the polynomial degree (refer to Fig.). At the extreme value as shown
below, the regression line will be overfitting into all the original values of X.
FIG. Polynomial regression degree 3

Why Not Linear Regression?
Suppose that we are trying to predict the medical condition of a patient in the emergency room
on the basis of her symptoms. In this simplified example, there are three possible diagnoses:
stroke, drug overdose, and epileptic seizure (abnormal brain activity). We could consider
encoding these values as a quantitative response variable, Y , as follows:
35

which would imply a totally different relationship among the three conditions. Each
of these codings would produce fundamentally different linear models that would
ultimately lead to different sets of predictions on test observations.
If the response variable’s values did take on a natural ordering, such as mild,
moderate, and severe, and we felt the gap between mild and moderate was similar
to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable.
Unfortunately, in general there is no natural way to convert a qualitative response
variable with more than two levels into a quantitative response that is ready for
linear regression.
For a binary (two level) qualitative response, the situation is better. For instance,
perhaps there are only two possibilities for the patient’s medical condition: stroke
and drug overdose. We could then potentially use the dummy variable approach to
code the response as follows:
36

Y = 0 if stroke;
1 if drug overdose.
We could then fit a linear regression to this binary response, and predict drug
overdose if ˆY > 0.5 and stroke otherwise. In the binary case it is not hard to
show that even if we flip the above coding, linear regression will produce the
same final predictions.
For a binary response with a 0/1 coding as above, regression by least squares
does make sense;
However, the dummy variable approach cannot be easily extended to accommodate
qualitative responses with more than two levels. For these reasons, it is
preferable to use a classification method that is truly suited for qualitative
response values, such as LOGISTIC REGRESSION
37

Logistic Regression
38
Consider again the Default data set, where the response default falls into one of two
categories, Yes or No. Rather than modeling this response Y directly, logistic
regression models the probability that Y belongs to a particular category.
For the Default data, logistic regression models the probability of default.
For example, the probability of default given balance can be written as
Pr(default = Yes | balance). The values of Pr(default = Yes | balance), which we
abbreviate p(balance), will range between 0 and 1. Then for any given value of
balance, a prediction can be made for default.
For example, one might predict default = Yes for any individual for whom p(balance)
> 0.5. Alternatively, if a company wishes to be conservative in predicting individuals
who are at risk for default, then they may choose to use a lower threshold, such as
p(balance) > 0.1.

39
If we are modeling people’s gender as male or female from their height,
then the first class could be male and the logistic regression model could
be written as the probability of male given a person’s height, or more
formally:
P(gender = male | height)
Written another way, we are modeling the probability that an input (X)
belongs to the default class (Y=1), we can write this formally as:
P(X) = P ( Y = 1 | X )

40 Assumptions in logistic regression
The following assumptions must hold when building a logistic regression model:
1. There exists a linear relationship between logit function and independent
variables.
2. The dependent variable Y must be categorical (1/0) and take binary value,
e.g. if pass then Y = 1; else Y = 0.
3. The data meets the ‘iid’ criterion, i.e. the error terms, ε, are independent from
one another and identically distributed
4. The error term follows a binomial distribution [n, p]
 n = # of records in the data
 p = probability of success (pass, responder)

41
Maximum Likelihood Estimation
 The coefficients in a logistic regression are estimated using a process called
Maximum Likelihood Estimation (MLE).
 A fair coin outcome flips equally heads and tails of the same number of times. If we
toss the coin 10 times, it is expected that we get five times Head and five times Tail.
 Let us now discuss about the probability of getting only Head as an outcome; it is
5/10 = 0.5 in the above case.
 Whenever this number (P) is greater than 0.5, it is said to be in favor of Head.
 Whenever P is lesser than 0.5, it is said to be against the outcome of getting Head.
 Let us represent ‘n’ flips of coin as X1 , X2 , X3 ,…, Xn .
 Now X can take the value of 1 or 0.
 X I = 1 if Head is the outcome
 X I = 0 if Tail is the outcome

ML-UNIT-IV complete notes download here

More Related Content

Similar to ML-UNIT-IV complete notes download here (20)

Recently uploaded (20)

ML-UNIT-IV complete notes download here