Unit2_Regression, ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON

Course – Big Data Analytics (Professional
Elective-II)
Course code-IT314B
Unit-II- ADVANCED ANALYTICAL THEORY AND
METHODS USING PYTHON
Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423603
(An Autonomous Institute Affiliated to Savitribai Phule Pune University, Pune)
NAAC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Information Technology
(NBA Accredited)
Mr. Rajendra N Kankrale
Asst. Prof.
1

BDA- Unit-II Regression Department of IT
Unit-II- ADVANCED ANALYTICAL THEORY AND METHODS USING
PYTHON
• Syllabus
2
ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON
Introduction to Scikit-learn,
Installations, Dataset, matplotlib, filling missing values,
Regression and Classification using Scikit-learn
Association Rules: FP growth,
Regression: Linear Regression, Logistic Regression,
Classification: Naïve Bayes classifier

Unit-II- Regression
3

Unit-II- Regression
• Motivation
• Regression estimates the relationship between the target and
the independent variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine the
most important factor, the least important factor, and how each
factor is affecting the other factors.
4

Linear Regression
• Linear Regression is a supervised machine learning algorithm.
• It tries to find out the best linear relationship that describes the data you have.
• It assumes that there exists a linear relationship between a dependent variable and
independent variable(s).
• The value of the dependent variable of a linear regression model is a continuous
value i.e. real numbers.
5

Representing Linear Regression Model
• Linear regression model represents the linear relationship between a dependent
variable and independent variable(s) via a sloped straight line
• The sloped straight line representing the linear relationship that fits the given
data best is called as a regression line.
• It is also called as best fit line.
6

Types of Linear Regression-
1. Simple Linear Regression
2. Multiple Linear Regression
7

Simple Linear Regression
For simple linear regression, the form of the model is-
Y = β0 + β1X
8
Here,
Y is a dependent variable.
X is an independent variable.
β0 and β1 are the regression coefficients.
β0 is the intercept or the bias that fixes the offset to a
line.
β1 is the slope or weight that specifies the factor by
which X has an impact on Y.

There are following 3 cases possible-
Case-01: β1 < 0
It indicates that variable X has negative impact on Y.
If X increases, Y will decrease and vice-versa.
9

Case-02: β1 = 0
• It indicates that variable X has no impact on Y.
• If X changes, there will be no change in Y.
10

11
Case-03: β1 > 0
It indicates that variable X has positive impact on Y.
If X increases, Y will increase and vice-versa.

Multiple Linear Regression-
12
In multiple linear regression, the dependent variable depends on more
than one independent variables.
For multiple linear regression, the form of the model is-
Y = β0 + β1X1 + β2X2 + β3X3 + …… + βnXn
Here,
Y is a dependent variable.
X1, X2, …., Xn are independent variables.
β0, β1,…, βn are the regression coefficients.
βj (1<=j<=n) is the slope or weight that specifies the factor
by which Xj has an impact on Y.

Evaluation metrics for a linear regression model
Mean Squared Error (MSE)
The most common metric for regression tasks is MSE. It has a convex shape. It
is the average of the squared difference between the predicted and actual value.
Since it is differentiable and has a convex shape, it is easier to optimize.
MSE penalizes large errors.
13

14

Mean Absolute Error (MAE)
This is simply the average of the absolute difference between the target value
and the value predicted by the model. Not preferred in cases where outliers are
prominent.
MAE does not penalize large errors.
15

16

Root Mean Squared Error(RMSE)
As RMSE is clear by the name itself, that it is a simple square root of mean squared
error.
17

18

R-squared explains to what extent the variance of one variable explains
the variance of the second variable. In other words, it measures the
proportion of variance of the dependent variable explained by the
independent variable.
R squared is a popular metric for identifying model accuracy. It tells how
close are the data points to the fitted line generated by a regression
algorithm. A larger R squared value indicates a better fit. This helps us to
find the relationship between the independent variable towards the
dependent variable.
19

• SSE is the sum of the square of the
difference between the actual value
and the predicted value
• SST is the total sum of the square of
the difference between the actual
value and the mean of the actual
value.
• yi is the observed target value, ŷi is
the predicted value, and y-bar is the
mean value, m represents the total
number of observations.
20

21

22
• R² score ranges from 0 to 1. The closest to 1 the R², the better the
regression model is. If R² is equal to 0, the model is not performing better
than a random model. If R² is negative, the regression model is erroneous.
• A small MAE suggests the model is great at prediction, while a large MAE
suggests that your model may have trouble in certain areas. MAE of 0
means that your model is a perfect predictor of the outputs.
• If you have outliers in the dataset then it penalizes the outliers most and
the calculated MSE is bigger. So, in short, It is not Robust to outliers which
were an advantage in MAE.

Simple Linear Regression With scikit-learn
23
• You’ll start with the simplest case, which is simple linear regression. There
are five basic steps when you’re implementing linear regression:
1. Import the packages and classes that you need.
2. Provide data to work with, and eventually do appropriate transformations.
3. Create a regression model and fit it with existing data.
4. Check the results of model fitting to know whether the model is
satisfactory.
5. Apply the model for predictions.

24
• Step 1: Import packages and classes
• The first step is to import the package numpy and the class
LinearRegression from sklearn.linear_model:
• >>> import numpy as np
• >>> from sklearn.linear_model import LinearRegression

25
• Step 2: Provide data
• The second step is defining data to work with. The inputs (regressors, 𝑥)
and output (response, 𝑦) should be arrays or similar objects. This is the
simplest way of providing data for regression:
• >>> x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
• >>> y = np.array([5, 20, 14, 32, 22, 38])
• Now, you have two arrays: the input, x, and the output, y. You should call
.reshape() on x because this array must be two-dimensional, or more
precisely, it must have one column and as many rows as necessary. That’s
exactly what the argument (-1, 1) of .reshape() specifies.

26
• This is how x and y look now:
• >>> x
• array([[ 5],
• [15],
• [25],
• [35],
• [45],
• [55]])
• >>> y
• array([ 5, 20, 14, 32, 22, 38])
• As you can see, x has two dimensions, and x.shape is (6, 1), while y has a
single dimension, and y.shape is (6,).

27
• Step 3: Create a model and fit it
• The next step is to create a linear regression model and fit it using the
existing data.
• Create an instance of the class LinearRegression, which will represent the
regression model:
• >>> model = LinearRegression()
• This statement creates the variable model as an instance of
LinearRegression. You can provide several optional parameters to
LinearRegression:

28
• This statement creates the variable model as an instance of
LinearRegression. You can provide several optional parameters to
LinearRegression:
• fit_intercept is a Boolean that, if True, decides to calculate the intercept 𝑏₀
or, if False, considers it equal to zero. It defaults to True.
• normalize is a Boolean that, if True, decides to normalize the input
variables. It defaults to False, in which case it doesn’t normalize the input
variables.
• copy_X is a Boolean that decides whether to copy (True) or overwrite the
input variables (False). It’s True by default.
• n_jobs is either an integer or None. It represents the number of jobs used
in parallel computation. It defaults to None, which usually means one job. -
1 means to use all available processors.
• Your model as defined above uses the default values of all parameters.

29
• It’s time to start using the model. First, you need to call .fit() on model:
• >>> model.fit(x, y)
• LinearRegression()
• With .fit(), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using
the existing input and output, x and y, as the arguments. In other words,
.fit() fits the model. It returns self, which is the variable model itself. That’s
why you can replace the last two statements with this one:
• >>> model = LinearRegression().fit(x, y)
• This statement does the same thing as the previous two. It’s just shorter.

30
• Step 4: Get results
• Once you have your model fitted, you can get the results to check whether
the model works satisfactorily and to interpret it.
• You can obtain the coefficient of determination, 𝑅², with .score() called on
model:
• >>> r_sq = model.score(x, y)
• >>> print(f"coefficient of determination: {r_sq}")
• coefficient of determination: 0.7158756137479542

31
• When you’re applying .score(), the arguments are also the predictor x and
response y, and the return value is 𝑅².
• The attributes of model are .intercept_, which represents the coefficient 𝑏₀,
and .coef_, which represents 𝑏₁:
• >>> print(f"intercept: {model.intercept_}")
• intercept: 5.633333333333329
• >>> print(f"slope: {model.coef_}")
• slope: [0.54]
• The code above illustrates how to get 𝑏₀ and 𝑏₁. You can notice that
.intercept_ is a scalar, while .coef_ is an array.

Unit2_Regression, ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON

More Related Content

What's hot (20)

Similar to Unit2_Regression, ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON (20)

More from RajendraKankrale1 (7)

Recently uploaded (20)

Unit2_Regression, ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON