SlideShare a Scribd company logo
LINEAR
REGRESSION
STATISTICS
LINEAR REGRESSION
Linear regression attempts to model
the relationship between two variables
by fitting a linear equation to observed
data. One variable is considered to be
an explanatory variable, and the other
is considered to be a dependent
variable. For example, a modeler might
want to relate the weights of
individuals to their heights using a
linear regression model.
Before attempting to fit a linear model to
observed data, a modeler should first
determine whether or not there is a
relationship between the variables of
interest. This does not necessarily imply
that one variable causes the other (for
example, higher SAT scores do
not cause higher college grades), but that
there is some significant association
between the two variables.
A scatterplot can be a helpful tool in
determining the strength of the
relationship between two variables. If
there appears to be no association
between the proposed explanatory and
dependent variables (i.e., the scatterplot
does not indicate any increasing or
decreasing trends), then fitting a linear
regression model to the data probably
will not provide a useful model.
A valuable numerical measure of
association between two variables is the 
correlation coefficient, which is a value
between -1 and 1 indicating the strength
of the association of the observed data for
the two variables.
A linear regression line has an equation
of the form Y = a + bX, where X is the
explanatory variable and Y is the
dependent variable. The slope of the line
is b, and a is the intercept (the value
of y when x = 0).
LEAST-SQUARES REGRESSION
The most common method for fitting a
regression line is the method of least-
squares. This method calculates the
best-fitting line for the observed data
by minimizing the sum of the squares
of the vertical deviations from each
data point to the line (if a point lies on
the fitted line exactly, then its vertical
deviation is 0). Because the deviations
are first squared, then summed, there
are no cancellations between positive
and negative values.
Example
The dataset "Televisions, Physicians, and Life
Expectancy" contains, among other variables, the
number of people per television set and the
number of people per physician for 40 countries.
Since both variables probably reflect the level of
wealth in each country, it is reasonable to assume
that there is some positive association between
them. After removing 8 countries with missing
values from the dataset, the remaining 32
countries have a correlation coefficient of 0.852
for number of people per television set and
number of people per physician.
The r² value is 0.726 (the square of the
correlation coefficient), indicating that
72.6% of the variation in one variable may
be explained by the other. Suppose we
choose to consider number of people per
television set as the explanatory variable,
and number of people per physician as the
dependent variable. Using the MINITAB
"REGRESS" command gives the following
results:
THE REGRESSION EQUATION IS PEOPLE.PHYS. = 1019
+ 56.2 PEOPLE.TEL
 To view the fit of the
model to the observed
data, one may plot the
computed regression line
over the actual data
points to evaluate the
results. For this example,
the plot appears to the
right, with number of
individuals per television
set (the explanatory
variable) on the x-axis
and number of individuals
per physician (the
dependent variable) on
the y-axis.
While most of the data points are
clustered towards the lower left corner of
the plot (indicating relatively few
individuals per television set and per
physician), there are a few points which
lie far away from the main cluster of the
data. These points are known
as outliers, and depending on their
location may have a major impact on the
regression line. 
OUTLIERS AND INFLUENTIAL
OBSERVATIONS
 After a regression line has been computed for a group
of data, a point which lies far from the line (and thus
has a large residual value) is known as an outlier.
Such points may represent erroneous data, or may
indicate a poorly fitting regression line. If a point lies
far from the other data in the horizontal direction, it is
known as an influential observation. The reason for
this distinction is that these points have may have a
significant impact on the slope of the regression line.
Notice, in the above example, the effect of removing
the observation in the upper right corner of the plot:
WITH THIS INFLUENTIAL OBSERVATION REMOVED, THE
REGRESSION EQUATION IS NOW PEOPLE.PHYS = 1650 + 21.3
PEOPLE.TEL.
 The correlation between the
two variables has dropped to
0.427, which reduces
the r² value to 0.182. With
this influential observation
removed, less that 20% of
the variation in number of
people per physician may be
explained by the number of
people per television.
Influential observations are
also visible in the new
model, and their impact
should also be investigated.
RESIDUALS
 Once a regression model
has been fit to a group of
data, examination of the
residuals (the deviations
from the fitted line to the
observed values) allows
the modeler to
investigate the validity of
his or her assumption
that a linear relationship
exists.
Plotting the residuals on the y-axis
against the explanatory variable on the x-
axis reveals any possible non-linear
relationship among the variables, or
might alert the modeler to
investigate lurking variables.
In our example, the residual plot
amplifies the presence of outliers.  
LURKING VARIABLES
 If non-linear trends are visible in the relationship
between an explanatory and dependent variable,
there may be other influential variables to
consider. A lurking variable exists when the
relationship between two variables is
significantly affected by the presence of a third
variable which has not been included in the
modeling effort. Since such a variable might be a
factor of time (for example, the effect of political
or economic cycles), a time series plot of the
data is often a useful tool in identifying the
presence of lurking variables.
EXTRAPOLATION
 Whenever a linear regression model is fit to a group of
data, the range of the data should be carefully
observed. Attempting to use a regression equation to
predict values outside of this range is often
inappropriate, and may yield incredible answers. This
practice is known as extrapolation. Consider, for
example, a linear model which relates weight gain to
age for young children. Applying such a model to
adults, or even teenagers, would be absurd, since the
relationship between age and weight gain is not
consistent for all age groups.

More Related Content

PPTX
Stats 3000 Week 2 - Winter 2011
PPT
Exploring bivariate data
PPTX
Scattergrams
PPTX
Scatter plot- Complete
PPTX
Scatter plot diagram
PDF
Scatter diagram
DOCX
2.3 the simple regression model
PPT
Chapter 9 Regression
Stats 3000 Week 2 - Winter 2011
Exploring bivariate data
Scattergrams
Scatter plot- Complete
Scatter plot diagram
Scatter diagram
2.3 the simple regression model
Chapter 9 Regression

What's hot (20)

PPT
Sumit presentation
PDF
Reliability Plotting Explained
PPTX
Correlation
PPTX
Regression vs correlation and causation
PPTX
Linear regression AMOS (R-Square)
PPTX
Calculation of covariance and simple linear models
PDF
Kendall's ,partial correlation and scatter plot
PPT
Math n Statistic
PPTX
Quantitative data analysis
PPTX
Correlation analysis ppt
PDF
Scatter Diagrams
PPT
Ch8 Regression Revby Rao
PDF
Linear regression
DOCX
assignment 2
PPTX
Ols by hiron
DOCX
PPTX
Scatter Diagram
PPT
Statistics
PPT
Statistics
PDF
Data Science - Part X - Time Series Forecasting
Sumit presentation
Reliability Plotting Explained
Correlation
Regression vs correlation and causation
Linear regression AMOS (R-Square)
Calculation of covariance and simple linear models
Kendall's ,partial correlation and scatter plot
Math n Statistic
Quantitative data analysis
Correlation analysis ppt
Scatter Diagrams
Ch8 Regression Revby Rao
Linear regression
assignment 2
Ols by hiron
Scatter Diagram
Statistics
Statistics
Data Science - Part X - Time Series Forecasting
Ad

Viewers also liked (7)

PDF
¿Cómo de sexy puede hacer Backbone mi código?
PDF
آشنایی با دانش آزاد
PPTX
Tigers 120320021344-phpapp02
PPT
GDC branded interiors & furniture design
PDF
Untitled 4
PDF
6. javascript basic
PDF
10 multiplexers-de mux
¿Cómo de sexy puede hacer Backbone mi código?
آشنایی با دانش آزاد
Tigers 120320021344-phpapp02
GDC branded interiors & furniture design
Untitled 4
6. javascript basic
10 multiplexers-de mux
Ad

Similar to Linear regression (20)

PPTX
REGRESSION ANALYSIS THEORY EXPLAINED HERE
PPTX
Diagnostic methods for Building the regression model
PPTX
Simple linear regression
PPTX
Regression Analysis
PPTX
Linear Regression.pptx
PPTX
An Introduction to Regression Models: Linear and Logistic approaches
PPTX
STATISTICAL REGRESSION MODELS
PPTX
Lecture 8 Linear and Multiple Regression (1).pptx
PDF
Linear regression model in econometrics undergraduate
DOCX
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
PPTX
Linear regression by Kodebay
PDF
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
PPTX
Linear and Logistics Regression
PPTX
Unit-III Correlation and Regression.pptx
PPTX
Chapter 2 Simple Linear Regression Model.pptx
DOCX
Correlation and regression in r
PDF
Role of regression in statistics (2)
PPTX
2. diagnostics, collinearity, transformation, and missing data
PDF
Regression Analysis-Machine Learning -Different Types
PPTX
Artifical Intelligence And Machine Learning Algorithum.pptx
REGRESSION ANALYSIS THEORY EXPLAINED HERE
Diagnostic methods for Building the regression model
Simple linear regression
Regression Analysis
Linear Regression.pptx
An Introduction to Regression Models: Linear and Logistic approaches
STATISTICAL REGRESSION MODELS
Lecture 8 Linear and Multiple Regression (1).pptx
Linear regression model in econometrics undergraduate
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
Linear regression by Kodebay
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
Linear and Logistics Regression
Unit-III Correlation and Regression.pptx
Chapter 2 Simple Linear Regression Model.pptx
Correlation and regression in r
Role of regression in statistics (2)
2. diagnostics, collinearity, transformation, and missing data
Regression Analysis-Machine Learning -Different Types
Artifical Intelligence And Machine Learning Algorithum.pptx

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation theory and applications.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
A Presentation on Artificial Intelligence
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation theory and applications.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
A Presentation on Artificial Intelligence
The Rise and Fall of 3GPP – Time for a Sabbatical?

Linear regression

  • 2. LINEAR REGRESSION Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.
  • 3. Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for example, higher SAT scores do not cause higher college grades), but that there is some significant association between the two variables.
  • 4. A scatterplot can be a helpful tool in determining the strength of the relationship between two variables. If there appears to be no association between the proposed explanatory and dependent variables (i.e., the scatterplot does not indicate any increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide a useful model.
  • 5. A valuable numerical measure of association between two variables is the  correlation coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables.
  • 6. A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
  • 7. LEAST-SQUARES REGRESSION The most common method for fitting a regression line is the method of least- squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). Because the deviations are first squared, then summed, there are no cancellations between positive and negative values.
  • 8. Example The dataset "Televisions, Physicians, and Life Expectancy" contains, among other variables, the number of people per television set and the number of people per physician for 40 countries. Since both variables probably reflect the level of wealth in each country, it is reasonable to assume that there is some positive association between them. After removing 8 countries with missing values from the dataset, the remaining 32 countries have a correlation coefficient of 0.852 for number of people per television set and number of people per physician.
  • 9. The r² value is 0.726 (the square of the correlation coefficient), indicating that 72.6% of the variation in one variable may be explained by the other. Suppose we choose to consider number of people per television set as the explanatory variable, and number of people per physician as the dependent variable. Using the MINITAB "REGRESS" command gives the following results:
  • 10. THE REGRESSION EQUATION IS PEOPLE.PHYS. = 1019 + 56.2 PEOPLE.TEL  To view the fit of the model to the observed data, one may plot the computed regression line over the actual data points to evaluate the results. For this example, the plot appears to the right, with number of individuals per television set (the explanatory variable) on the x-axis and number of individuals per physician (the dependent variable) on the y-axis.
  • 11. While most of the data points are clustered towards the lower left corner of the plot (indicating relatively few individuals per television set and per physician), there are a few points which lie far away from the main cluster of the data. These points are known as outliers, and depending on their location may have a major impact on the regression line. 
  • 12. OUTLIERS AND INFLUENTIAL OBSERVATIONS  After a regression line has been computed for a group of data, a point which lies far from the line (and thus has a large residual value) is known as an outlier. Such points may represent erroneous data, or may indicate a poorly fitting regression line. If a point lies far from the other data in the horizontal direction, it is known as an influential observation. The reason for this distinction is that these points have may have a significant impact on the slope of the regression line. Notice, in the above example, the effect of removing the observation in the upper right corner of the plot:
  • 13. WITH THIS INFLUENTIAL OBSERVATION REMOVED, THE REGRESSION EQUATION IS NOW PEOPLE.PHYS = 1650 + 21.3 PEOPLE.TEL.  The correlation between the two variables has dropped to 0.427, which reduces the r² value to 0.182. With this influential observation removed, less that 20% of the variation in number of people per physician may be explained by the number of people per television. Influential observations are also visible in the new model, and their impact should also be investigated.
  • 14. RESIDUALS  Once a regression model has been fit to a group of data, examination of the residuals (the deviations from the fitted line to the observed values) allows the modeler to investigate the validity of his or her assumption that a linear relationship exists.
  • 15. Plotting the residuals on the y-axis against the explanatory variable on the x- axis reveals any possible non-linear relationship among the variables, or might alert the modeler to investigate lurking variables. In our example, the residual plot amplifies the presence of outliers.  
  • 16. LURKING VARIABLES  If non-linear trends are visible in the relationship between an explanatory and dependent variable, there may be other influential variables to consider. A lurking variable exists when the relationship between two variables is significantly affected by the presence of a third variable which has not been included in the modeling effort. Since such a variable might be a factor of time (for example, the effect of political or economic cycles), a time series plot of the data is often a useful tool in identifying the presence of lurking variables.
  • 17. EXTRAPOLATION  Whenever a linear regression model is fit to a group of data, the range of the data should be carefully observed. Attempting to use a regression equation to predict values outside of this range is often inappropriate, and may yield incredible answers. This practice is known as extrapolation. Consider, for example, a linear model which relates weight gain to age for young children. Applying such a model to adults, or even teenagers, would be absurd, since the relationship between age and weight gain is not consistent for all age groups.