FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx

FSE 200
Adkins Page 1 of 10
Simple Linear Regression
Correlation only measures the strength and direction of the
linear relationship between two quantitative variables. If the
relationship is linear, then we would like to try to model that
relationship with the equation of a line. We will use a
regression line to describe the relationship between an
explanatory variable and a response variable.
A regression line is a straight line that describes how a response
variable y changes as an explanatory variable x changes. We
often use a regression line to predict the value of y for a given
value of x.
Ex. It has been suggested that there is a relationship between
sleep deprivation of employees and the ability to complete
simple tasks. To evaluate this hypothesis, 12 people were asked
to solve simple tasks after having been without sleep for 15, 18,
21, and 24 hours. The sample data are shown below.
Subject
Hours without sleep, x
Tasks completed, y
1
15
13
2
15
9
3
15
15
4
18
8

5
18
12
6
18
10
7
21
5
8
21
8
9
21
7
10
24
3
11
24
5
12
24
4
Draw a scatterplot and describe the relationship. Lay a straight-
edge on top of the plot and move it around until you find what
you think might be a “line of best fit.” Then try to predict the
number of tasks completed for someone having been without
sleep 16 hours.

Was your line the same as that of the classmate sitting next to
you? Probably not. We need a method that we can use to find
the “best” regression line to use for prediction. The method we
will use is called least-squares. No line will pass exactly
through all the points in the scatterplot. When we use the line to
predict a y for a given x value, if there is a data point with that
same x value, we can compute the error (residual):
Our goal is going to be to make the vertical distances from the
line as small as possible. The most commonly used method for
doing this is the least-squares method.
The least-squares regression line of y on x is the line that makes
the sum of the squares of the vertical distances of the data
points from the line as small as possible.
Equation of the Least-Squares Regression Line
· Least-Squares Regression Line:
· Slope of the Regression Line:
· Intercept of the Regression Line:
Generally, regression is performed using statistical software.
Clearly, given the appropriate information, the above formulas
are simple to use.
Once we have the regression line, how do we interpret it, and
what can we do with it?
The slope of a regression line is the rate of change, that amount
of change in when x increases by 1.
The intercept of the regression line is the value of when x = 0.
It is statistically meaningful only when x can take on values
that are close to zero.
To make a prediction, just substitute an x-value into the
equation and find .
To plot the line on a scatterplot, just find a couple of points on
the regression line, one near each end of the range of x in the
data. Plot the points and connect them with a line. Again, this is
something that can be done using statistical software.

Ex. Use Excel to find the equation of the least-squares
regression line for the sleep deprivation data in the previous
example.
· Click Data -> Data Analysis -> Regression -> OK
· Input the cells of the response variable y in the Input Y Range
box.
· Input the cells of the explanatory variable x in the Input X
Range box.
· If you included variable names in the Input X and Y Range
boxes, check the Labels box.
· Input the cells you would like to display the output in the
Output Range box.
· Click OK.
a. State the equation of the least-squares regression line.
b. Identify and interpret the slope.
c. Identify and interpret the intercept.
d. Use the least-squares regression equation to predict the
number of tasks completed for an employee that has been
without sleep for 16 hours.
Facts About Least-Squares Regression
· In regression, the distinction between explanatory and

response variables is very important. When we computed the
correlation coefficient r, it did not matter which variable was x
and which was y, r would be the same. However, if you perform
a regression analysis on a data set and then swap x and y and
perform another regression analysis, the results will not be the
same.
· There is a connection between the correlation coefficient r and
the slope of the least-squares line:
A change of one standard deviation in x corresponds to a change
of r standard deviations in y.
· The least squares regression line always passes through the
point () on the graph of y versus x.
· The square of the correlation, , known as the coefficient of
determination, is the proportion of variation in y that can be
explained by the least-squares regression of y on x.
Note that 0 < r2 < 1. The closer r2 is to 1, the better your
regression line is at modeling the relationship between x and y.
We usually state r2 as a percentage.
Note:
· A statistical analysis package can find r2 for you; if only r is
given on the output, square it.
· If r2 is given on the output and you want to find r, take the
square root of r2 and look at the slope of the regression line to
determine the sign; r and b will have the same sign.
Ex. Refer to the Excel output from the sleep deprivation data.
Find r2 and interpret it. Then find r.
We know that in practical applications, we are not going to be
so lucky as to have all of our data points falling exactly on a
line.
A residualis the difference between an observed value of the
response variable and the value predicted by the regression line.

That is,
Ex. Find the residual for Subject 1 in our sleep deprivation data.
We could compute a residual for each observation in the data.
Note that the mean of the least-squares residuals is always zero.
It is a good idea to examine the residuals because they can tell
us something about how appropriate our linear model is.
A residual plot is a scatterplot of the regression residuals versus
the explanatory variable, x. Residual plots help us to assess the
fit of a regression line.
· If the regression line does a good job describing the overall
relationship between x and y, the residuals should have no
systematic pattern.
When you examine a residual plot, here are some things you
should consider:
· Generally, a horizontal line is drawn at zero.
· A curved pattern tells you that the relationship is not linear;
therefore, linear regression is not an appropriate method of
analysis.
· Increasing or decreasing spread about the line (at zero) as x
increases may indicate that the prediction of y for certain values
of x will be less accurate.
· Individual points with large residuals or outliers in the y
direction can greatly affect your analysis. (Check data entry,
etc.)
· Individual points that are extreme in the x direction may not
have large residuals, but they may still have quite an impact on
the analysis.
The last two points above lead us to a discussion of outliers and
influential points.
An outlier is an observation that lies outside the overall pattern

of the other observations.
An observation is influential for a statistical calculation if
removing it would significantly change the result of the
calculation. Points that are outliers in the x direction are often
influential for the least-squares regression line.
Ex. Use Excel to obtain the residual plot for the sleep
deprivation data. Analyze the output.
· In the Regression dialog box, check the Residual Plots box.
Cautions about Correlation and Regression
Some things to remember:
· Always plot your data first!
· The correlation and regression we have been studying should
be used only to describe linear relationships.
· The correlation coefficient r and least-squares regression are
not resistant; just one influential observation can greatly affect
your analysis.
Let’s discuss a few more things of which you should be aware
with regard to correlation and regression.
Extrapolation is the use of a regression line for prediction far
outside the range of values of the explanatory variable x that
you used to obtain the line. Such predictions may not be
accurate!
Extrapolation can be dangerous!

Ex. Refer to the sleep-deprivation example. Do you think it
would be appropriate to use our least-squares regression
equation to make predictions for a person who has gone without
sleep for 40 hours?
Generally two variables won’t exist by themselves in a vacuum,
so to speak. Often we may be interested in more than two
variables. Sometimes there are variables floating around in the
background that are influencing the variables of interest, but we
may not even have considered these background variables.
A lurking variable (or extraneous variable) is a variable that has
an important effect on the relationship among the variables in a
study but is not included among the variables studied.
A lurking variable could make it falsely appear that two other
variables have a strong relationship. A lurking variable could
also mask or hide a relationship that is really there.
Ex. Suppose that someone notices that as the number of
churches in town increases, the liquor sales also go up. Is there
a lurking variable that might explain this relationship?
ASSOCIATION DOES NOT IMPLY CAUSATION!
An association between an explanatory variable x and a
response variable y, even if it is very strong, is not by itself
good evidence that changes in x actually cause changes in y.
While our goal may often be to show that changes in the
explanatory variable cause changes in the response variable,
sometimes an observed association really is due to cause and
effect, but many times it is not. There may be a lurking variable
that is causing a common response in x and y, or maybe both the
lurking variable and x are causing changes in y so that their

effects are confounded.
Ex. Does having more cars make you live longer? A serious
study once found that people with two cars live longer than
people who own only one car. Owning three cars is even better,
and so on. There is a substantial positive correlation between
number of cars x and length of life y.
A basic meaning of causation is that by changing x we can bring
about a change in y. Could we lengthen our lives by buying
more cars?
How can we tell, then, if we have a cause-and-effect
relationship? A well-designed experiment is the best way to
determine causation; we will discuss experiments later. Many
times it is not possible to do an experiment. In the absence of an
experiment, what should we examine to determine causation?
· The association is strong.
· The association is consistent.
· Higher doses are associated with stronger responses.
· The alleged cause precedes the effect in time.
· The alleged cause is plausible.
Ex. Does smoking cause lung cancer? Despite the difficulties, it
is sometimes possible to build a strong case for causation in the
absence of experiments. The evidence that smoking causes lung
cancer is about as strong as nonexperimental evidence can be.
Doctors have long observed that most lung cancer patients were
smokers. Comparison of smokers and “similar” nonsmokers
showed a very strong association between smoking and death
from lung cancer. Could the association be explained by lurking
variables? Might there be, for example, a genetic factor that
predisposes people to both nicotine addiction and to lung
cancer? Smoking and lung cancer would then be positively
associated even if smoking had no direct effect on the lungs.
How do we overcome these objections?

· The association is strong. The association between smoking
and lung cancer is very strong.
· The association is consistent. Many studies of different kinds
of people in many countries link smoking to lung cancer. That
reduces the chances that a lurking variable specific to one group
or one study explains the association.
· Higher doses are associated with stronger responses. People
who smoke more cigarettes per day or who smoke over a longer
period of time get lung cancer more often. People who stop
smoking reduce their risk.
· The alleged cause precedes the effect in time. Lung cancer
develops after years of smoking. The number of men dying of
lung cancer rose as smoking became more common, with a lag
of about 30 years. Lung cancer kills more men than any other
form of cancer. Lung cancer was rare among women until
women began to smoke. Lung cancer in women rose along with
smoking again with a lag of about 30 years.
· The alleged cause is plausible. Experiments with animals show
that tars from cigarette smoke do cause cancer.
The evidence for causation is overwhelming-but it is still not as
strong as the evidence provided by well-designed experiments.
Hours without sleep, x Residual Plot
15.0 15.0 15.0 18.0 18.0 18.0 21.0 21.0 21.0 24.0 24.0 24.0
0.499999999999998 -3.500000000000002
2.499999999999998 -1.666666666666668
2.333333333333332 0.333333333333332 -1.833
333333333336 1.166666666666664 0.166666666666664 -
1.000000000000004 0.999999999999996 -3.5527136788005E-15
Hours without sleep, x
Residuals
CJ102: Criminology
Unit 8 Worksheet

Student Name:
_____________________________________________________
__
After completing the readings, answer the following questions:
PART I
1. What is a turning point?
2. What are the characteristics of low self-control or
impulsivity?
3. Define and differentiate adolescent limited and life course
persistent criminals.
Part II: Sex Crimes
1. What are the 7 goals of a primary interview with the rape
victim?
2. What method does the FBI use to determine the profile of the
offender in a sex crime?
3. What is the importance of the profile in helping solve the
crime?
Part III: Burglary
1. What are the common methods in which burglars gain entry
into a residence or building?
2. Describe the primary characteristics of suspect(s) in burglary
cases.
3. How are burglaries and sex crimes related?
© Kaplan University
FSE 200
Adkins Page 1 of 8
Scatterplots and Correlation
So far we have been examining one variable at a time. In
practice, we often want to look at several variables at once. In
this chapter, we will specifically consider how to analyze two

quantitative variables.
A response variable measures an outcome of a study.
An explanatory variable may explain or influence changes in a
response variable.
Ex. Suppose that individuals are given different amounts of
alcohol, and then reaction times for a particular activity are
measured.
Often explanatory variables are called independent variables,
and response variables are called dependent variables.
Note that a cause-and-effect relationship may or may not exist,
but we cannot determine causality.
Two variables measured on the same individual are associated if
some values of one variable tend to occur with some values of
the second variable more than with other values of that variable.
Ex.
Displaying Relationships: Scatterplots
A scatterplot shows the relationship between two quantitative
variables measured on the same individuals. The values of one
variable appear on the horizontal or x axis, and the values of the
other variable appear on the vertical or y axis. Each individual
in the data appears as a point in the plot.
If there is an explanatory variable and a response variable, the
explanatory variable goes on the horizontal axis and the
response variable on the vertical axis. If such a distinction
cannot be made, ten either variable can go on either axis.
To interpret a scatterplot, look for the overall pattern and for
striking deviations from that pattern. To describe the overall
pattern, look at the (1) form, (2) direction, and (3) strength of
the relationship. Also look for any outliers.
Two variables are positively associated when above-average
values of one tend to accompany above-average values of the
other, and below-average values also tend to occur together.

Ex. In a large group of people, there will be a positive
association between height and weight.
Two variables are negatively associated when above-average
values of one tend to accompany below-average values of the
other, and vice versa.
Ex. In a large group of people, there will be a negative
association between packs of cigarettes smoked and length of
life.
Ex. Create a scatterplot to show the relationship between yearly
average temperature and number of fires and yearly average
temperature and area burned.
year
average temperature (oF)
number of fires
acres burned (in millions)
2000
54.52
92250
7.39
2001
52.19
84079
3.57
2002
53.74
73457
7.18

2003
53.1
63629
3.96
2004
53.6
65461
8.1
2005
53.08
66753
8.69
2006
54.38
96385
9.87
2007
53.43
85705
9.33
2008
53.04
78979
5.29
2009
52.83
78792
5.92
2010
52.06
71971
3.42
2011
52.82
74126
8.71

Source: fire data from http://wildland-fires.sciencedaily.com/#
temperature data from
http://www.ncdc.noaa.gov/temp-and-precip/time-
series/index.php?parameter=tmp&month=5&year=2000&filter=
12&state=110&div=0
In Excel, highlight the two variables of interest. Click Insert ->
Scatter and select the appropriate chart type.
Ex. Fuel used vs. Speed
How does the fuel consumption of a car change as its speed
increases?
Speed vs. fuel consumption per 100 km travelled for British
Ford Escort
Describe the form of the relationship. Explain why the form
makes sense.
Does it make sense to describe the variables as either positively
or negatively associated? Why?
Measuring Linear Association: Correlation
We will look at one numerical measure of association, the
correlation coefficient. Technically, correlation only makes
sense when both variables are quantitative.
The correlation describes the direction and strength of a linear
relationship between two quantitative variables. The correlation
coefficient is usually written as r, the Pearson product-moment
correlation coefficient.
Now let’s learn how to calculate r. We will compute r based
upon n observations on variables x and y: and . We denote
this rXY, the correlation between X and Y.
Each observation is an ordered pair (). For example, and might

be my age and my number of college hours earned.
Calculating the correlation coefficient
1. List the two values for each individual.
2. Compute the sum of X values, and compute the sum of Y
values.
3. Square the X values.
4. Square the Y values.
5. Find the sum of the XY products.
6. Plug these values into the formula.
Ex. Calculate rXY by hand.
X
Y
X2
Y2
XY
4
6
7
4
10
2

Ex. Using Excel, find r for yearly average temperature and
number of fires and yearly average temperature and area burned.
Note: The columns for the variables of interest must be next to
each other.
· Using the CORREL function: In the cell you want to display
the correlation coefficient, type = CORREL(array1, array2).
· array1 contains data for the X variable
· array2 contains data for the Y variable
· Using the Analysis ToolPak:
· click Data -> Data Analysis -> Correlation
· In the Input Range box, input the cells that contain data for
both variables.
· Make sure the Grouped By: Columns option is selected if your
data are grouped in columns.
· If you include the variable name in the first column, check the
box next to Labels in first row.
· In the Output Range box, input the cell you wish to display the
output.
· Click OK.
Facts about r:

1. Positive r indicates positive correlation between the
variables, and negative r indicates negative correlation.
2. The correlation coefficient r always falls between -1 and 1,
that is, .
3. The extreme values r = -1 and r = 1 indicate perfect straight-
line (linear) association.
4. The correlation between x and y does not change when we
change the units of measurement of x, y, or both; r has no units.
5. Correlation ignores the distinction between explanatory and
response variables.
6. Correlation measures the strength of only linear association
between two variables.
7. Like the mean and standard deviation, r is strongly affected
by a few outliers; in other words, r is not resistant.
8. Correlation only makes sense for quantitative variables. We
can talk about the relationship or association between gender of
voters and political party, but not of the correlation between
these variables.
9. Note that correlation is not a complete description of
bivariate (two-variable) data. State the means and standard
deviations of both x and y along with the correlation.
Interpreting Correlation Coefficients
Size of the Correlation
Coefficient Interpretation
.8 to 1.0
Very strong relationship
.6 to .8
Strong relationship
.4 to .6
Moderate relationship
.2 to .4
Weak relationship
.0 to .2
Weak or no relationship

Types of Correlation and Relationships
What Happens to Variable X
What Happens to Variable Y
Type of Correlation
Value
Example
X increases in value
Y increases in value
Direct or positive
Positive, ranging from 0 to +1
The more time you spend studying, the higher your test score
will be
X decreases in value
Y decreases in value
Direct or positive
Positive, ranging from 0 to +1
The less money you put in the bank, the less interest you will
earn.
X increases in value
Y decreases in value
Indirect or negative
Negative, ranging from -1 to 0
The more you exercise, the less you will weigh.
X decreases in value
Y increases in value
Indirect or negative
Negative, ranging from -1 to 0
The less time you take to complete a test, the more you’ll get
wrong.
Types of Measurement of Correlation
Variable X
Variable Y
Type of Correlation coefficient
Correlation being computed
Nominal (voting preference, such as Democrat or Republican)

Nominal (sex, such as male or female)
Phi coefficient
The correlation between voting preference and sex.
Nominal (social class, such as high, medium, or low)
Ordinal (rank in high school graduating class)
Rank biserial coefficient
The correlation between social class and rank in high school.
Nominal (family configuration, such as intact or single parent)
Interval (grade point average)
Point biserial
The correlation between family configuration and grade point
average.
Ordinal (height converted to rank)
Ordinal (weight converted to rank)
Spearman rank coefficient
The correlation between height and weight.
Interval (number of problems solved)
Interval (age in years)
Pearson product-moment correlation coefficient
The correlation between a number of problems solved and age
in years.
Remember:
· Plot your data first.
· Look at each variable separately first, then study relationships
between variables.
Sheet1FSE 200Homework Assignment 425
PointsDirections:Complete all questions below. Print out and
submit this assignment in HARD COPY on the due date listed in
the syllabus.Using the following data, determine whether or not
the square footage of a particular fire station has an effect on

the turnout time for the firefighters.(This data is recreated from
an EFO paper by Michael E. Dell'Orfano.)Completion of Chart
(4 points)StationSquare FootageTurnout Time
(in minutes)Area SquaredTime SquaredArea *
Time3138421.8232115721.863368022.1334185002.3735192321.
923645002.3337670023830932.023992292.224060942.4241151
302.444275232.2243150003.184490002.184596472.3546151302
.5SumMean (1 point)Correlation Coefficient
(1 point)Coefficient of Determination
(1 point)SD (1 point)What is the Dependent Variable (1
point)?What is the Independent Variable (1 point)?Place a
scatterplot of the data below (2 points). Show the trendline as
well. Remember to include titles and labels.What does this data
tell you about the relationship (2 points)?State the equation of
the least-squares regression line (1 point).Identify and interpret
the slope (2 points).Identify and interpret the intercept (2
points).If appropriate, use the least-squares regression equation
to predict the turnout time for a 12,000 square foot fire station
(2 points).If appropriate, use the least-squares regression
equation to predict the turnout time for a 25,000 square foot fire
station (2 points).Obtain the residual plot and analyze the
output (2 points).
Sheet1FSE 200Homework Assignment 425
PointsDirections:Complete all questions below. Save the file
containing your solutions andsubmit electronically under
Assignments -> Assignment 4 in Blackboard.Using the
following data, determine whether or not the square footage of a
particular fire station has an effect on the turnout time for the
firefighters.(This data is recreated from an EFO paper by
Michael E. Dell'Orfano.)Completion of Chart (4
points)StationSquare FootageTurnout Time
(in minutes)Area SquaredTime SquaredArea *
Time3138421.82147609643.31246992.4432115721.8613391118
43.459621523.923368022.13462672044.536914488.2634185002
.373422500005.61694384535192321.923698698243.686436925.

443645002.33202500005.4289104853767002448900004134003
830932.0295666494.08046247.863992292.22851744414.928420
488.384060942.42371368365.856414747.4841151302.44228916
9005.953636917.24275232.22565955294.928416701.064315000
3.1822500000010.1124477004490002.18810000004.752419620
4596472.35930646095.522522670.4546151302.52289169006.25
37825Sum16099435.96201757104082.4256370577.49Mean (1
point)10062.132.25Correlation Coefficient
(1 point)Coefficient of Determination
(1 point)SD (1 point)5148.650.330.3460.120What is the
Dependent Variable (1 point)?What is the Independent Variable
(1 point)?Place a scatterplot of the relationship of turnout time
and square footage below (2 points). Show the trendline as
well. Remember to include titles and labels.What does this data
tell you about the relationship (2 points)?State the equation of
the least-squares regression line (1 point).Identify and interpret
the slope (2 points).Identify and interpret the intercept (2
points).If appropriate, use the least-squares regression equation
to predict the turnout time for a 12,000 square foot fire station
(2 points).If appropriate, use the least-squares regression
equation to predict the turnout time for a 25,000 square foot fire
station (2 points).Obtain the residual plot and analyze the
output (2 points).RESIDUAL OUTPUTObservationPredicted
Turnout Time
(in minutes)Residuals12.1107255995-
0.290725599522.2807006588-0.420700658832.1758130737-
0.045813073742.4330405308-0.063040530852.4491364873-
0.529136487362.12519436910.204805630972.1735701946-
0.173570194682.0942558299-0.074255829992.2291804048-
0.0091804048102.16024485360.2597551464112.35893756190.0
810624381122.19166715110.0283328489132.35607899040.823
9210096142.2241449211-
0.0441449211152.23837181160.1116281884162.35893756190.1
410624381
Turnout time (in minutes) for the firefighters is dependent
variable.

Scatter Plot
3842.0 11572.0 6802.0 18500.0 19232.0 4500.0
6700.0 3093.0 9229.0 6094.0 15130.0 7523.0
15000.0 9000.0 9647.0 15130.0 1.82 1.86 2.13
2.37 1.92 2.33 2.0 2.02 2.22 2.42 2.44 2.22 3.18 2.18
2.35 2.5
Area (Square footage)
Turnout time
Square Footage Residual Plot
3842.0 11572.0 6802.0 18500.0 19232.0 4500.0
6700.0 3093.0 9229.0 6094.0 15130.0 7523.0
15000.0 9000.0 9647.0 15130.0 -
0.290725599547464 -0.420700658810437 -
0.0458130737283695 -0.0630405308122319 -
0.529136487265078 0.204805630854213 -0.173570194550514
-0.0742558298983091 -0.00918040475440218
0.259755146447334 0.0810624381031908
0.028332848945809 0.823921009604379 -
0.0441449211100013 0.111628188418699
0.141062438103191
Square Footage
Residuals
Area (Square footage) of a fire station is independent variable.
This data tells us that there is weak positive relationship
between square footage area and turnout time for firefighters.
Hence as square footage area increases, the turnout time for
firefighters also increases to smaller extent.
The equation of the least-squares regression line is given below.
Turnout time = 2.026 + 0.000021*Square footage
The slope of regression line is 0.000021.
The slope indicates that when there is one square foot increase
in area of a fire station, predicted turnout time for firefighters
increases by 0.000021 minutes.
The intercept of regression line is 2.026.

The intercept indicates that when area of fire station is 0 square
foot, predicted turnout time for firefighters is approximately
2.026 minutes.
The least-squares regression equationis given below.
Turnout time = 2.026 + 0.000021*Square footage
Turnout time = 2.026 + 0.000021*12,000 = 2.29 minutes
Hence predicted turnout time for a 12,000 square foot fire
station is 2.29 minutes.
Least-squares regression equation is not useful to predict the
turnout time for a 25,000 square foot fire station because value
of x (25,000) is outlier for given range of area.
Residual plot indicates that all the residuals are randomly
distributed and there is no any pattern observed.
Hence normality assumption is satisfied.

FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx

More Related Content

Similar to FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx (20)

More from budbarber38650 (20)

Recently uploaded (20)

FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx