SlideShare a Scribd company logo
Course – Big Data Analytics (Professional
Elective-II)
Course code-IT314B
Unit-II- ADVANCED ANALYTICAL THEORY AND
METHODS USING PYTHON
Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423603
(An Autonomous Institute Affiliated to Savitribai Phule Pune University, Pune)
NAAC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Information Technology
(NBA Accredited)
Mr. Rajendra N Kankrale
Asst. Prof.
1
BDA- Unit-II Regression Department of IT
Unit-II- ADVANCED ANALYTICAL THEORY AND METHODS USING
PYTHON
• Syllabus
2
ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON
Introduction to Scikit-learn,
Installations, Dataset, matplotlib, filling missing values,
Regression and Classification using Scikit-learn
Association Rules: FP growth,
Regression: Linear Regression, Logistic Regression,
Classification: Naïve Bayes classifier
BDA- Unit-II Regression Department of IT
Unit-II- Regression
3
BDA- Unit-II Regression Department of IT
Unit-II- Regression
• Motivation
• Regression estimates the relationship between the target and
the independent variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine the
most important factor, the least important factor, and how each
factor is affecting the other factors.
4
BDA- Unit-II Regression Department of IT
Linear Regression
• Linear Regression is a supervised machine learning algorithm.
• It tries to find out the best linear relationship that describes the data you have.
• It assumes that there exists a linear relationship between a dependent variable and
independent variable(s).
• The value of the dependent variable of a linear regression model is a continuous
value i.e. real numbers.
5
BDA- Unit-II Regression Department of IT
Representing Linear Regression Model
• Linear regression model represents the linear relationship between a dependent
variable and independent variable(s) via a sloped straight line
• The sloped straight line representing the linear relationship that fits the given
data best is called as a regression line.
• It is also called as best fit line.
6
BDA- Unit-II Regression Department of IT
Types of Linear Regression-
1. Simple Linear Regression
2. Multiple Linear Regression
7
BDA- Unit-II Regression Department of IT
Simple Linear Regression
For simple linear regression, the form of the model is-
Y = β0 + β1X
8
Here,
Y is a dependent variable.
X is an independent variable.
β0 and β1 are the regression coefficients.
β0 is the intercept or the bias that fixes the offset to a
line.
β1 is the slope or weight that specifies the factor by
which X has an impact on Y.
BDA- Unit-II Regression Department of IT
Simple Linear Regression
There are following 3 cases possible-
Case-01: β1 < 0
It indicates that variable X has negative impact on Y.
If X increases, Y will decrease and vice-versa.
9
BDA- Unit-II Regression Department of IT
Simple Linear Regression
Case-02: β1 = 0
• It indicates that variable X has no impact on Y.
• If X changes, there will be no change in Y.
10
BDA- Unit-II Regression Department of IT
Simple Linear Regression
11
Case-03: β1 > 0
It indicates that variable X has positive impact on Y.
If X increases, Y will increase and vice-versa.
BDA- Unit-II Regression Department of IT
Multiple Linear Regression-
12
In multiple linear regression, the dependent variable depends on more
than one independent variables.
For multiple linear regression, the form of the model is-
Y = β0 + β1X1 + β2X2 + β3X3 + …… + βnXn
Here,
Y is a dependent variable.
X1, X2, …., Xn are independent variables.
β0, β1,…, βn are the regression coefficients.
βj (1<=j<=n) is the slope or weight that specifies the factor
by which Xj has an impact on Y.
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
Mean Squared Error (MSE)
The most common metric for regression tasks is MSE. It has a convex shape. It
is the average of the squared difference between the predicted and actual value.
Since it is differentiable and has a convex shape, it is easier to optimize.
MSE penalizes large errors.
13
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
14
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
Mean Absolute Error (MAE)
This is simply the average of the absolute difference between the target value
and the value predicted by the model. Not preferred in cases where outliers are
prominent.
MAE does not penalize large errors.
15
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
16
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
Root Mean Squared Error(RMSE)
As RMSE is clear by the name itself, that it is a simple square root of mean squared
error.
17
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
18
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
R-squared explains to what extent the variance of one variable explains
the variance of the second variable. In other words, it measures the
proportion of variance of the dependent variable explained by the
independent variable.
R squared is a popular metric for identifying model accuracy. It tells how
close are the data points to the fitted line generated by a regression
algorithm. A larger R squared value indicates a better fit. This helps us to
find the relationship between the independent variable towards the
dependent variable.
19
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
• SSE is the sum of the square of the
difference between the actual value
and the predicted value
• SST is the total sum of the square of
the difference between the actual
value and the mean of the actual
value.
• yi is the observed target value, ŷi is
the predicted value, and y-bar is the
mean value, m represents the total
number of observations.
20
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
21
BDA- Unit-II Regression Department of IT
Evaluation metrics for a linear regression model
22
• R² score ranges from 0 to 1. The closest to 1 the R², the better the
regression model is. If R² is equal to 0, the model is not performing better
than a random model. If R² is negative, the regression model is erroneous.
• A small MAE suggests the model is great at prediction, while a large MAE
suggests that your model may have trouble in certain areas. MAE of 0
means that your model is a perfect predictor of the outputs.
• If you have outliers in the dataset then it penalizes the outliers most and
the calculated MSE is bigger. So, in short, It is not Robust to outliers which
were an advantage in MAE.
BDA- Unit-II Regression Department of IT
Simple Linear Regression With scikit-learn
23
• You’ll start with the simplest case, which is simple linear regression. There
are five basic steps when you’re implementing linear regression:
1. Import the packages and classes that you need.
2. Provide data to work with, and eventually do appropriate transformations.
3. Create a regression model and fit it with existing data.
4. Check the results of model fitting to know whether the model is
satisfactory.
5. Apply the model for predictions.
BDA- Unit-II Regression Department of IT
Simple Linear Regression With scikit-learn
24
• Step 1: Import packages and classes
• The first step is to import the package numpy and the class
LinearRegression from sklearn.linear_model:
• >>> import numpy as np
• >>> from sklearn.linear_model import LinearRegression
BDA- Unit-II Regression Department of IT
Simple Linear Regression With scikit-learn
25
• Step 2: Provide data
• The second step is defining data to work with. The inputs (regressors, 𝑥)
and output (response, 𝑦) should be arrays or similar objects. This is the
simplest way of providing data for regression:
• >>> x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
• >>> y = np.array([5, 20, 14, 32, 22, 38])
• Now, you have two arrays: the input, x, and the output, y. You should call
.reshape() on x because this array must be two-dimensional, or more
precisely, it must have one column and as many rows as necessary. That’s
exactly what the argument (-1, 1) of .reshape() specifies.
BDA- Unit-II Regression Department of IT
Simple Linear Regression With scikit-learn
26
• This is how x and y look now:
• >>> x
• array([[ 5],
• [15],
• [25],
• [35],
• [45],
• [55]])
• >>> y
• array([ 5, 20, 14, 32, 22, 38])
• As you can see, x has two dimensions, and x.shape is (6, 1), while y has a
single dimension, and y.shape is (6,).
BDA- Unit-II Regression Department of IT
Simple Linear Regression With scikit-learn
27
• Step 3: Create a model and fit it
• The next step is to create a linear regression model and fit it using the
existing data.
• Create an instance of the class LinearRegression, which will represent the
regression model:
• >>> model = LinearRegression()
• This statement creates the variable model as an instance of
LinearRegression. You can provide several optional parameters to
LinearRegression:
BDA- Unit-II Regression Department of IT
Simple Linear Regression With scikit-learn
28
• This statement creates the variable model as an instance of
LinearRegression. You can provide several optional parameters to
LinearRegression:
• fit_intercept is a Boolean that, if True, decides to calculate the intercept 𝑏₀
or, if False, considers it equal to zero. It defaults to True.
• normalize is a Boolean that, if True, decides to normalize the input
variables. It defaults to False, in which case it doesn’t normalize the input
variables.
• copy_X is a Boolean that decides whether to copy (True) or overwrite the
input variables (False). It’s True by default.
• n_jobs is either an integer or None. It represents the number of jobs used
in parallel computation. It defaults to None, which usually means one job. -
1 means to use all available processors.
• Your model as defined above uses the default values of all parameters.
BDA- Unit-II Regression Department of IT
Simple Linear Regression With scikit-learn
29
• It’s time to start using the model. First, you need to call .fit() on model:
• >>> model.fit(x, y)
• LinearRegression()
• With .fit(), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using
the existing input and output, x and y, as the arguments. In other words,
.fit() fits the model. It returns self, which is the variable model itself. That’s
why you can replace the last two statements with this one:
• >>> model = LinearRegression().fit(x, y)
• This statement does the same thing as the previous two. It’s just shorter.
BDA- Unit-II Regression Department of IT
Simple Linear Regression With scikit-learn
30
• Step 4: Get results
• Once you have your model fitted, you can get the results to check whether
the model works satisfactorily and to interpret it.
• You can obtain the coefficient of determination, 𝑅², with .score() called on
model:
• >>> r_sq = model.score(x, y)
• >>> print(f"coefficient of determination: {r_sq}")
• coefficient of determination: 0.7158756137479542
BDA- Unit-II Regression Department of IT
Simple Linear Regression With scikit-learn
31
• When you’re applying .score(), the arguments are also the predictor x and
response y, and the return value is 𝑅².
• The attributes of model are .intercept_, which represents the coefficient 𝑏₀,
and .coef_, which represents 𝑏₁:
• >>> print(f"intercept: {model.intercept_}")
• intercept: 5.633333333333329
• >>> print(f"slope: {model.coef_}")
• slope: [0.54]
• The code above illustrates how to get 𝑏₀ and 𝑏₁. You can notice that
.intercept_ is a scalar, while .coef_ is an array.

More Related Content

PPTX
Bin packing
PDF
K-means and GMM
PPTX
4-ML-UNIT-IV-Bayesian Learning.pptx
PPTX
Unit 2. Image Enhancement in Spatial Domain.pptx
PPTX
Association Rule Learning Part 1: Frequent Itemset Generation
PDF
Color models in Digitel image processing
PDF
SVM for Regression
Bin packing
K-means and GMM
4-ML-UNIT-IV-Bayesian Learning.pptx
Unit 2. Image Enhancement in Spatial Domain.pptx
Association Rule Learning Part 1: Frequent Itemset Generation
Color models in Digitel image processing
SVM for Regression

What's hot (20)

PPTX
Image Filtering in the Frequency Domain
PPTX
Fact less fact Tables & Aggregate Tables
PPT
rnn BASICS
PPTX
NP completeness
PPTX
Comparison of image fusion methods
PPT
Spatial filtering
PPTX
Representation image
PPTX
Support vector machines
PPTX
Apriori algorithm
PPTX
Association rule mining
PPT
GRPHICS08 - Raytracing and Radiosity
PPTX
Image denoising
PPT
Chapter 5
PPT
Spatial filtering using image processing
PPTX
CS 402 DATAMINING AND WAREHOUSING -MODULE 4
PPTX
Removal of Salt and Pepper Noise in images
PDF
Valmiki Ramayana Online Class - Yuddha Kanda, Session 24
PPT
Deep-Learning-2017-Lecture6RNN.ppt
PPTX
Image Filtering in the Frequency Domain
Fact less fact Tables & Aggregate Tables
rnn BASICS
NP completeness
Comparison of image fusion methods
Spatial filtering
Representation image
Support vector machines
Apriori algorithm
Association rule mining
GRPHICS08 - Raytracing and Radiosity
Image denoising
Chapter 5
Spatial filtering using image processing
CS 402 DATAMINING AND WAREHOUSING -MODULE 4
Removal of Salt and Pepper Noise in images
Valmiki Ramayana Online Class - Yuddha Kanda, Session 24
Deep-Learning-2017-Lecture6RNN.ppt
Ad

Similar to Unit2_Regression, ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON (20)

PPTX
Linear Regression final-1.pptx thbejnnej
PDF
Unit2_Linear Regression_Performance Metrics.pdf
PDF
Machine Learning_Unit_II_Regression_notes.pdf
PPTX
Regression Analysis.pptx
PPTX
Regression Analysis Techniques.pptx
PDF
Linear Regression
PPTX
Basics of Regression analysis
PDF
Machine learning Introduction
PDF
The normal presentation about linear regression in machine learning
PPTX
Linear regression.pptx
PDF
Module-2_ML.pdf
PDF
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
PPTX
Artifical Intelligence And Machine Learning Algorithum.pptx
PPTX
MachineLearning_Unit-II.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile M...
PDF
MachineLearning_Unit-II.FHDGFHJKpptx.pdf
PDF
Regression analysis
PPTX
Linear regression aims to find the "best-fit" linear line
PPTX
business Lesson-Linear-Regression-1.pptx
PPTX
Different Types of Machine Learning Algorithms
PPTX
Detail Study of the concept of Regression model.pptx
Linear Regression final-1.pptx thbejnnej
Unit2_Linear Regression_Performance Metrics.pdf
Machine Learning_Unit_II_Regression_notes.pdf
Regression Analysis.pptx
Regression Analysis Techniques.pptx
Linear Regression
Basics of Regression analysis
Machine learning Introduction
The normal presentation about linear regression in machine learning
Linear regression.pptx
Module-2_ML.pdf
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Artifical Intelligence And Machine Learning Algorithum.pptx
MachineLearning_Unit-II.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile M...
MachineLearning_Unit-II.FHDGFHJKpptx.pdf
Regression analysis
Linear regression aims to find the "best-fit" linear line
business Lesson-Linear-Regression-1.pptx
Different Types of Machine Learning Algorithms
Detail Study of the concept of Regression model.pptx
Ad

More from RajendraKankrale1 (7)

PPTX
5.Transaction Management and concurrency Control
PPTX
PL_SQL, Trigger, Cursor, Stored procedure ,function
PPT
UNIT 2 relational algebra and Structured Query Language
PPT
UNIT 1 ER Model 2025 -Entity Relationship (ER) Diagram
PPTX
HADOOP ECO SYSTEM Pig: Introduction to PIG, Execution Modes of Pig, Comp...
PPTX
INTRODUCTION TO APACHE HADOOP AND MAPREDUCE
PPTX
Unit-I_Big data life cycle.pptx, sources of Big Data
5.Transaction Management and concurrency Control
PL_SQL, Trigger, Cursor, Stored procedure ,function
UNIT 2 relational algebra and Structured Query Language
UNIT 1 ER Model 2025 -Entity Relationship (ER) Diagram
HADOOP ECO SYSTEM Pig: Introduction to PIG, Execution Modes of Pig, Comp...
INTRODUCTION TO APACHE HADOOP AND MAPREDUCE
Unit-I_Big data life cycle.pptx, sources of Big Data

Recently uploaded (20)

PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Artificial Intelligence
PDF
PPT on Performance Review to get promotions
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Lecture Notes Electrical Wiring System Components
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
introduction to datamining and warehousing
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Artificial Intelligence
PPT on Performance Review to get promotions
Internet of Things (IOT) - A guide to understanding
Lecture Notes Electrical Wiring System Components
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
introduction to datamining and warehousing
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
OOP with Java - Java Introduction (Basics)
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Foundation to blockchain - A guide to Blockchain Tech
Mechanical Engineering MATERIALS Selection
CH1 Production IntroductoryConcepts.pptx
R24 SURVEYING LAB MANUAL for civil enggi
Model Code of Practice - Construction Work - 21102022 .pdf

Unit2_Regression, ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON

  • 1. Course – Big Data Analytics (Professional Elective-II) Course code-IT314B Unit-II- ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON Sanjivani Rural Education Society’s Sanjivani College of Engineering, Kopargaon-423603 (An Autonomous Institute Affiliated to Savitribai Phule Pune University, Pune) NAAC ‘A’ Grade Accredited, ISO 9001:2015 Certified Department of Information Technology (NBA Accredited) Mr. Rajendra N Kankrale Asst. Prof. 1
  • 2. BDA- Unit-II Regression Department of IT Unit-II- ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON • Syllabus 2 ADVANCED ANALYTICAL THEORY AND METHODS USING PYTHON Introduction to Scikit-learn, Installations, Dataset, matplotlib, filling missing values, Regression and Classification using Scikit-learn Association Rules: FP growth, Regression: Linear Regression, Logistic Regression, Classification: Naïve Bayes classifier
  • 3. BDA- Unit-II Regression Department of IT Unit-II- Regression 3
  • 4. BDA- Unit-II Regression Department of IT Unit-II- Regression • Motivation • Regression estimates the relationship between the target and the independent variable. • It is used to find the trends in data. • It helps to predict real/continuous values. • By performing the regression, we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors. 4
  • 5. BDA- Unit-II Regression Department of IT Linear Regression • Linear Regression is a supervised machine learning algorithm. • It tries to find out the best linear relationship that describes the data you have. • It assumes that there exists a linear relationship between a dependent variable and independent variable(s). • The value of the dependent variable of a linear regression model is a continuous value i.e. real numbers. 5
  • 6. BDA- Unit-II Regression Department of IT Representing Linear Regression Model • Linear regression model represents the linear relationship between a dependent variable and independent variable(s) via a sloped straight line • The sloped straight line representing the linear relationship that fits the given data best is called as a regression line. • It is also called as best fit line. 6
  • 7. BDA- Unit-II Regression Department of IT Types of Linear Regression- 1. Simple Linear Regression 2. Multiple Linear Regression 7
  • 8. BDA- Unit-II Regression Department of IT Simple Linear Regression For simple linear regression, the form of the model is- Y = β0 + β1X 8 Here, Y is a dependent variable. X is an independent variable. β0 and β1 are the regression coefficients. β0 is the intercept or the bias that fixes the offset to a line. β1 is the slope or weight that specifies the factor by which X has an impact on Y.
  • 9. BDA- Unit-II Regression Department of IT Simple Linear Regression There are following 3 cases possible- Case-01: β1 < 0 It indicates that variable X has negative impact on Y. If X increases, Y will decrease and vice-versa. 9
  • 10. BDA- Unit-II Regression Department of IT Simple Linear Regression Case-02: β1 = 0 • It indicates that variable X has no impact on Y. • If X changes, there will be no change in Y. 10
  • 11. BDA- Unit-II Regression Department of IT Simple Linear Regression 11 Case-03: β1 > 0 It indicates that variable X has positive impact on Y. If X increases, Y will increase and vice-versa.
  • 12. BDA- Unit-II Regression Department of IT Multiple Linear Regression- 12 In multiple linear regression, the dependent variable depends on more than one independent variables. For multiple linear regression, the form of the model is- Y = β0 + β1X1 + β2X2 + β3X3 + …… + βnXn Here, Y is a dependent variable. X1, X2, …., Xn are independent variables. β0, β1,…, βn are the regression coefficients. βj (1<=j<=n) is the slope or weight that specifies the factor by which Xj has an impact on Y.
  • 13. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model Mean Squared Error (MSE) The most common metric for regression tasks is MSE. It has a convex shape. It is the average of the squared difference between the predicted and actual value. Since it is differentiable and has a convex shape, it is easier to optimize. MSE penalizes large errors. 13
  • 14. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model 14
  • 15. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model Mean Absolute Error (MAE) This is simply the average of the absolute difference between the target value and the value predicted by the model. Not preferred in cases where outliers are prominent. MAE does not penalize large errors. 15
  • 16. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model 16
  • 17. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model Root Mean Squared Error(RMSE) As RMSE is clear by the name itself, that it is a simple square root of mean squared error. 17
  • 18. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model 18
  • 19. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model R-squared explains to what extent the variance of one variable explains the variance of the second variable. In other words, it measures the proportion of variance of the dependent variable explained by the independent variable. R squared is a popular metric for identifying model accuracy. It tells how close are the data points to the fitted line generated by a regression algorithm. A larger R squared value indicates a better fit. This helps us to find the relationship between the independent variable towards the dependent variable. 19
  • 20. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model • SSE is the sum of the square of the difference between the actual value and the predicted value • SST is the total sum of the square of the difference between the actual value and the mean of the actual value. • yi is the observed target value, ŷi is the predicted value, and y-bar is the mean value, m represents the total number of observations. 20
  • 21. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model 21
  • 22. BDA- Unit-II Regression Department of IT Evaluation metrics for a linear regression model 22 • R² score ranges from 0 to 1. The closest to 1 the R², the better the regression model is. If R² is equal to 0, the model is not performing better than a random model. If R² is negative, the regression model is erroneous. • A small MAE suggests the model is great at prediction, while a large MAE suggests that your model may have trouble in certain areas. MAE of 0 means that your model is a perfect predictor of the outputs. • If you have outliers in the dataset then it penalizes the outliers most and the calculated MSE is bigger. So, in short, It is not Robust to outliers which were an advantage in MAE.
  • 23. BDA- Unit-II Regression Department of IT Simple Linear Regression With scikit-learn 23 • You’ll start with the simplest case, which is simple linear regression. There are five basic steps when you’re implementing linear regression: 1. Import the packages and classes that you need. 2. Provide data to work with, and eventually do appropriate transformations. 3. Create a regression model and fit it with existing data. 4. Check the results of model fitting to know whether the model is satisfactory. 5. Apply the model for predictions.
  • 24. BDA- Unit-II Regression Department of IT Simple Linear Regression With scikit-learn 24 • Step 1: Import packages and classes • The first step is to import the package numpy and the class LinearRegression from sklearn.linear_model: • >>> import numpy as np • >>> from sklearn.linear_model import LinearRegression
  • 25. BDA- Unit-II Regression Department of IT Simple Linear Regression With scikit-learn 25 • Step 2: Provide data • The second step is defining data to work with. The inputs (regressors, 𝑥) and output (response, 𝑦) should be arrays or similar objects. This is the simplest way of providing data for regression: • >>> x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1)) • >>> y = np.array([5, 20, 14, 32, 22, 38]) • Now, you have two arrays: the input, x, and the output, y. You should call .reshape() on x because this array must be two-dimensional, or more precisely, it must have one column and as many rows as necessary. That’s exactly what the argument (-1, 1) of .reshape() specifies.
  • 26. BDA- Unit-II Regression Department of IT Simple Linear Regression With scikit-learn 26 • This is how x and y look now: • >>> x • array([[ 5], • [15], • [25], • [35], • [45], • [55]]) • >>> y • array([ 5, 20, 14, 32, 22, 38]) • As you can see, x has two dimensions, and x.shape is (6, 1), while y has a single dimension, and y.shape is (6,).
  • 27. BDA- Unit-II Regression Department of IT Simple Linear Regression With scikit-learn 27 • Step 3: Create a model and fit it • The next step is to create a linear regression model and fit it using the existing data. • Create an instance of the class LinearRegression, which will represent the regression model: • >>> model = LinearRegression() • This statement creates the variable model as an instance of LinearRegression. You can provide several optional parameters to LinearRegression:
  • 28. BDA- Unit-II Regression Department of IT Simple Linear Regression With scikit-learn 28 • This statement creates the variable model as an instance of LinearRegression. You can provide several optional parameters to LinearRegression: • fit_intercept is a Boolean that, if True, decides to calculate the intercept 𝑏₀ or, if False, considers it equal to zero. It defaults to True. • normalize is a Boolean that, if True, decides to normalize the input variables. It defaults to False, in which case it doesn’t normalize the input variables. • copy_X is a Boolean that decides whether to copy (True) or overwrite the input variables (False). It’s True by default. • n_jobs is either an integer or None. It represents the number of jobs used in parallel computation. It defaults to None, which usually means one job. - 1 means to use all available processors. • Your model as defined above uses the default values of all parameters.
  • 29. BDA- Unit-II Regression Department of IT Simple Linear Regression With scikit-learn 29 • It’s time to start using the model. First, you need to call .fit() on model: • >>> model.fit(x, y) • LinearRegression() • With .fit(), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output, x and y, as the arguments. In other words, .fit() fits the model. It returns self, which is the variable model itself. That’s why you can replace the last two statements with this one: • >>> model = LinearRegression().fit(x, y) • This statement does the same thing as the previous two. It’s just shorter.
  • 30. BDA- Unit-II Regression Department of IT Simple Linear Regression With scikit-learn 30 • Step 4: Get results • Once you have your model fitted, you can get the results to check whether the model works satisfactorily and to interpret it. • You can obtain the coefficient of determination, 𝑅², with .score() called on model: • >>> r_sq = model.score(x, y) • >>> print(f"coefficient of determination: {r_sq}") • coefficient of determination: 0.7158756137479542
  • 31. BDA- Unit-II Regression Department of IT Simple Linear Regression With scikit-learn 31 • When you’re applying .score(), the arguments are also the predictor x and response y, and the return value is 𝑅². • The attributes of model are .intercept_, which represents the coefficient 𝑏₀, and .coef_, which represents 𝑏₁: • >>> print(f"intercept: {model.intercept_}") • intercept: 5.633333333333329 • >>> print(f"slope: {model.coef_}") • slope: [0.54] • The code above illustrates how to get 𝑏₀ and 𝑏₁. You can notice that .intercept_ is a scalar, while .coef_ is an array.