Exploring bivariate data

Bivariate Data
 Analyzing patterns in scatterplots
 Correlation and linearity
 Least-squares regression line
 Residual plots, outliers, and influential points
 Transformations to achieve linearity:
logarithmic and power transformations

Scatterplots
 The most effective way to display the
relationship between two quantitative
variables.
 The values of one variable appear on the
horizontal axis, and the values of the
other variable appear on the vertical axis.
Each individual in the data appears as
the point in the plot fixed by the values of
both variables for that individual.

Scatterplot Variables
 Response Variable - measures the outcome
of a study. (The dependent variable, plotted
on the y-axis).
 Explanatory or Predictor Variable – helps
explain or predict changes in a response
variable. (The independent variable, plotted
on the x-axis.)

Example:
 If you think that alcohol causes body
temperature to increase, you might do a study
giving certain amounts of alcohol to mice, and
measuring the temperature drops.
 In this case the explanatory variable is the
amount of alcohol and the response variable
is the measured temperature drop.

There are two ways of determining
whether two variables are related:
1) By looking at a scatter plot (graphical
approach)
2) By calculating a “correlation coefficient”
(mathematical approach)

How to Make a Scatterplot
1. Decide which
variable should
go on each axis.
2. Label and scale
your axes.
3. Plot individual
data values.

Interpreting a Scatterplot
 In any graph of data, look for the overall
pattern and for striking deviations from that
pattern.
 You can describe the overall pattern of a
scatterplot by the form, direction and strength
of the relationship.
 An important kind of deviation is an outlier, an
individual value that falls outside the overall
pattern of the relationship.

Clusters
Clusters of points
within the plot
can indicate the
presence of
another variable.
The scatterplot
on the right
shows two clear
clusters—one
near 2 minutes;
the other
between 4 – 5
minutes.

Gaps
Gaps are regions
(values) of the
explanatory variable
that have no
associated response
measurements.
The scatterplot on the
right shows a gap
between 600,00 and
80,000 white blood
cells (and probably
another between
80,000 and 100,000).

Correlation Coefficient (r)
 The correlation coefficient (r ) measures the
strength of the linear relationship between two
quantitative variables.
 Gives a numerical description of the strength
and direction of the linear association between
two variables.
r =
1
n −1
xi − x
sx





∑
yi − y
sy







Properties of r
 r is always a number
between -1 and 1
 r > 0 indicates a positive
association.
 r < 0 indicates a negative
association.
 Values of r near 0 indicate
a very weak linear
relationship.
 The strength of the linear
relationship increases as r
moves away from 0
towards -1 or 1.
 The extreme values r = -1
and r = 1 occur only in the
case of a perfect linear
relationship.

Correlation ≠ Causation
 Whenever we have a strong correlation, it is tempting to
explain it by imagining that the expanatory variable has
caused the response to help.
 A variable that is not explicitly part of a study but affects
the way the variables in the study appear to be related is
called a lurking variable.
 Because we can never be certain that observational data
are not hiding a lurking variable, it is never safe to
conclude that a scatterplot demonstrates a cause-and-
effect relationship, no matter how strong the correlation.
 Scatterplots and correlation coefficients never prove
causation.

Least-Squares Regression
(LSRL)
Least Squares Regression (linear regression) allows
you to fit a line to a scatter diagram in order to be
able to predict what the value of one variable will be
based on the value of another variable.
a: y intercept
b: slope of the linebxay +=ˆ

Regression Line
• A regression line is a
straight line that
describes how a
response variable y
changes as an
explanatory variable
x changes.
• We often use the
regression line for
predicting the value
of y for a given value
of x.

Interpreting a Regression Line
 The way the line is fitted to the data is through a
process called the method of least squares. The
main idea behind this method is that the square of
the vertical distance between each data point and
the line is minimized.
 The least squares regression line is a mathematical
model for the data that helps us predict values of
the response (dependant) variable from the
explanatory (independent) variable. Therefore,
with regression, unlike with correlation, we must
specify which is the response and which is the
explanatory variable.

Formulas for finding the slope and
y-intercept in a linear regression line:
slope y-intercept
a = y - bx
b1 = r
sy
sx

When will we ever need this?
 We use regression lines to make predictions.
 Interpolation – making predictions within
known data values.
 Extrapolation – making predictions beyond
known data values.

How good is our prediction?
The strength of a prediction which uses the LSRL
depends on how close the data points are to the
regression line. The mathematical approach to
describing this strength is via the coefficient of
determination. The coefficient of determination
gives us the proportion of variation in the values
of y that is explained by least-squares regression
of y on x. The coefficient of determination turns
out to be the correlation coefficient squared (r²).

Residuals
 Since the LSRL minimized the vertical distance between
the data values and a trend line we have a special name
for these vertical distances. They are called residuals.
 A residual is simply the difference between the
observed y and the predicted y.

Residual Plots
 Residuals help us
determine how well
our data can be
modeled by a straight
line, by enabling us to
construct a residual
plot.
 A residual plot is a
scatter diagram that
plots the residuals on
the y-axis and their
corresponding x
values on the x-axis.

INTERPRETING RESIDUAL PLOTS:
The following residual plot is in a curved
pattern and shows that the relationship is not
linear. A straight line is not a good summary
for such data.

Increasing or decreasing spread about the line as
x increases indicates that prediction of y will be
less accurate for larger x as shown in this residual
plot.

The following shows a residual plot that has a
uniform scatter of points about the fitted line
with no unusual observations. This tells us that
our linear model (regression line) will give us a
good prediction of the data.

Unusual and Influential Data
Outliers
Outlier: A value in a set of data that does not fit with the rest of
the data
Leverage
- An observation with an extreme value on a predictor variable.
• Leverage is a measure of how far an independent variable
deviates from its mean.
• These leverage points can have an effect on the estimate of
regression coefficients.
Influence
- Influence can be thought of as the product of leverage and
outlierness.
• Removing the observation substantially changes the
estimate of coefficients.

Outliers
 Data points more than 2
standard deviations away
from the mean of the data
set
 Data points that do not fit
the pattern governed by
the rest of the data
 In regression, any data
point that has an unusually
large residual
How can I tell if a point
in my data set is an
outlier?
• Take the IQR (interquartile
range) of your data set and
multiply it by 1.5. Subtract
that number from Quartile
1 and then from Quartile 3.
Any number lying outside
these points can be
considered an outlier.

Influential Points
 Influential points are normally outliers in the X
direction, but are not always outliers in terms of
regression
 A point is said to influence the data if it is
responsible for changes to the LSR line.
 Any point that has leverage on a set of data is
an influential point

Exploring bivariate data

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Exploring bivariate data (20)

More from Ulster BOCES (20)

Recently uploaded (20)

Exploring bivariate data