SlideShare a Scribd company logo
Exploring Bivariate Data
Bivariate Data
 Analyzing patterns in scatterplots
 Correlation and linearity
 Least-squares regression line
 Residual plots, outliers, and influential points
 Transformations to achieve linearity:
logarithmic and power transformations
Scatterplots
 The most effective way to display the
relationship between two quantitative
variables.
 The values of one variable appear on the
horizontal axis, and the values of the
other variable appear on the vertical axis.
Each individual in the data appears as
the point in the plot fixed by the values of
both variables for that individual.
Scatterplot Variables
 Response Variable - measures the outcome
of a study. (The dependent variable, plotted
on the y-axis).
 Explanatory or Predictor Variable – helps
explain or predict changes in a response
variable. (The independent variable, plotted
on the x-axis.)
Example:
 If you think that alcohol causes body
temperature to increase, you might do a study
giving certain amounts of alcohol to mice, and
measuring the temperature drops.
 In this case the explanatory variable is the
amount of alcohol and the response variable
is the measured temperature drop.
There are two ways of determining
whether two variables are related:
1) By looking at a scatter plot (graphical
approach)
2) By calculating a “correlation coefficient”
(mathematical approach)
How to Make a Scatterplot
1. Decide which
variable should
go on each axis.
2. Label and scale
your axes.
3. Plot individual
data values.
Interpreting a Scatterplot
 In any graph of data, look for the overall
pattern and for striking deviations from that
pattern.
 You can describe the overall pattern of a
scatterplot by the form, direction and strength
of the relationship.
 An important kind of deviation is an outlier, an
individual value that falls outside the overall
pattern of the relationship.
Positive Linear Association
No Association
Clusters
Clusters of points
within the plot
can indicate the
presence of
another variable.
The scatterplot
on the right
shows two clear
clusters—one
near 2 minutes;
the other
between 4 – 5
minutes.
Gaps
Gaps are regions
(values) of the
explanatory variable
that have no
associated response
measurements.
The scatterplot on the
right shows a gap
between 600,00 and
80,000 white blood
cells (and probably
another between
80,000 and 100,000).
Correlation Coefficient (r)
 The correlation coefficient (r ) measures the
strength of the linear relationship between two
quantitative variables.
 Gives a numerical description of the strength
and direction of the linear association between
two variables.
r =
1
n −1
xi − x
sx





∑
yi − y
sy






Properties of r
 r is always a number
between -1 and 1
 r > 0 indicates a positive
association.
 r < 0 indicates a negative
association.
 Values of r near 0 indicate
a very weak linear
relationship.
 The strength of the linear
relationship increases as r
moves away from 0
towards -1 or 1.
 The extreme values r = -1
and r = 1 occur only in the
case of a perfect linear
relationship.
Correlation ≠ Causation
 Whenever we have a strong correlation, it is tempting to
explain it by imagining that the expanatory variable has
caused the response to help.
 A variable that is not explicitly part of a study but affects
the way the variables in the study appear to be related is
called a lurking variable.
 Because we can never be certain that observational data
are not hiding a lurking variable, it is never safe to
conclude that a scatterplot demonstrates a cause-and-
effect relationship, no matter how strong the correlation.
 Scatterplots and correlation coefficients never prove
causation.
Least-Squares Regression
(LSRL)
Least Squares Regression (linear regression) allows
you to fit a line to a scatter diagram in order to be
able to predict what the value of one variable will be
based on the value of another variable.
a: y intercept
b: slope of the linebxay +=ˆ
Regression Line
• A regression line is a
straight line that
describes how a
response variable y
changes as an
explanatory variable
x changes.
• We often use the
regression line for
predicting the value
of y for a given value
of x.
Interpreting a Regression Line
 The way the line is fitted to the data is through a
process called the method of least squares. The
main idea behind this method is that the square of
the vertical distance between each data point and
the line is minimized.
 The least squares regression line is a mathematical
model for the data that helps us predict values of
the response (dependant) variable from the
explanatory (independent) variable. Therefore,
with regression, unlike with correlation, we must
specify which is the response and which is the
explanatory variable.
Formulas for finding the slope and
y-intercept in a linear regression line:
slope y-intercept
a = y - bx
b1 = r
sy
sx
When will we ever need this?
 We use regression lines to make predictions.
 Interpolation – making predictions within
known data values.
 Extrapolation – making predictions beyond
known data values.
How good is our prediction?
The strength of a prediction which uses the LSRL
depends on how close the data points are to the
regression line. The mathematical approach to
describing this strength is via the coefficient of
determination. The coefficient of determination
gives us the proportion of variation in the values
of y that is explained by least-squares regression
of y on x. The coefficient of determination turns
out to be the correlation coefficient squared (r²).
Residuals
 Since the LSRL minimized the vertical distance between
the data values and a trend line we have a special name
for these vertical distances. They are called residuals.
 A residual is simply the difference between the
observed y and the predicted y.
Residual Plots
 Residuals help us
determine how well
our data can be
modeled by a straight
line, by enabling us to
construct a residual
plot.
 A residual plot is a
scatter diagram that
plots the residuals on
the y-axis and their
corresponding x
values on the x-axis.
INTERPRETING RESIDUAL PLOTS:
The following residual plot is in a curved
pattern and shows that the relationship is not
linear. A straight line is not a good summary
for such data.
INTERPRETING RESIDUAL PLOTS:
Increasing or decreasing spread about the line as
x increases indicates that prediction of y will be
less accurate for larger x as shown in this residual
plot.
INTERPRETING RESIDUAL PLOTS:
The following shows a residual plot that has a
uniform scatter of points about the fitted line
with no unusual observations. This tells us that
our linear model (regression line) will give us a
good prediction of the data.
Unusual and Influential Data
Outliers
Outlier: A value in a set of data that does not fit with the rest of
the data
Leverage
- An observation with an extreme value on a predictor variable.
• Leverage is a measure of how far an independent variable
deviates from its mean.
• These leverage points can have an effect on the estimate of
regression coefficients.
Influence
- Influence can be thought of as the product of leverage and
outlierness.
• Removing the observation substantially changes the
estimate of coefficients.
Outliers
 Data points more than 2
standard deviations away
from the mean of the data
set
 Data points that do not fit
the pattern governed by
the rest of the data
 In regression, any data
point that has an unusually
large residual
How can I tell if a point
in my data set is an
outlier?
• Take the IQR (interquartile
range) of your data set and
multiply it by 1.5. Subtract
that number from Quartile
1 and then from Quartile 3.
Any number lying outside
these points can be
considered an outlier.
Influential Points
 Influential points are normally outliers in the X
direction, but are not always outliers in terms of
regression
 A point is said to influence the data if it is
responsible for changes to the LSR line.
 Any point that has leverage on a set of data is
an influential point

More Related Content

PPTX
Bivariate data
PPTX
Spearman rank correlation coefficient
PPT
Normal Probability Distribution
PPT
Regression
PPTX
Scatterplots, Correlation, and Regression
PPTX
Inter quartile range
PPTX
Quartile
PPT
Chapter 10
Bivariate data
Spearman rank correlation coefficient
Normal Probability Distribution
Regression
Scatterplots, Correlation, and Regression
Inter quartile range
Quartile
Chapter 10

What's hot (20)

PPTX
Hypothesis testing part iii for difference of means
PPTX
1.2 types of data
PPTX
Basics of Educational Statistics (Inferential statistics)
PPTX
Deciles & Quartiles - Point Measures
PDF
8. Correlation
PDF
Frequency Distribution.pdf
PPT
Cochran's q test report
PPT
Measures of Variation
PPTX
Percentile
PPTX
Normal Curve
PPTX
Chapter 2 understanding the normal curve distribution
PPTX
Regression Analysis
PPT
Basic Descriptive Statistics
PPT
Correlation and regression
PPTX
Confidence interval & probability statements
PPTX
Statistics "Descriptive & Inferential"
PPTX
Sampling Distributions
PPTX
Logistical Regression.pptx
PPTX
Box and whisker plots
PDF
Normal and standard normal distribution
Hypothesis testing part iii for difference of means
1.2 types of data
Basics of Educational Statistics (Inferential statistics)
Deciles & Quartiles - Point Measures
8. Correlation
Frequency Distribution.pdf
Cochran's q test report
Measures of Variation
Percentile
Normal Curve
Chapter 2 understanding the normal curve distribution
Regression Analysis
Basic Descriptive Statistics
Correlation and regression
Confidence interval & probability statements
Statistics "Descriptive & Inferential"
Sampling Distributions
Logistical Regression.pptx
Box and whisker plots
Normal and standard normal distribution
Ad

Viewers also liked (20)

PPTX
Transversals
PPTX
Standard deviation and variation
PDF
Intro probability 3
PDF
Intro probability 1
PDF
Lecture slides stats1.13.l07.air
PPT
Probability And Random Variable Lecture(Lec8)
PPTX
Attractive ppt on Hypothesis by ammara aftab
PDF
Intro probability 4
PDF
Intro probability 2
PPTX
Standard deviation
PPT
Statistics Vocabulary Chapter 1
DOC
Histogram
PPTX
Sampling and Sampling Distributions
PPTX
Sampling distribution concepts
PPT
Discrete Probability Distributions
PPT
Sampling distribution
PPT
scatter diagram
PPTX
Attribution theory
ODP
ANOVA II
PPTX
Attribution Theory ppt
Transversals
Standard deviation and variation
Intro probability 3
Intro probability 1
Lecture slides stats1.13.l07.air
Probability And Random Variable Lecture(Lec8)
Attractive ppt on Hypothesis by ammara aftab
Intro probability 4
Intro probability 2
Standard deviation
Statistics Vocabulary Chapter 1
Histogram
Sampling and Sampling Distributions
Sampling distribution concepts
Discrete Probability Distributions
Sampling distribution
scatter diagram
Attribution theory
ANOVA II
Attribution Theory ppt
Ad

Similar to Exploring bivariate data (20)

PPT
Linear regression
PPTX
Stats chapter 3
DOCX
Requirements.docxRequirementsFont Times New RomanI NEED .docx
DOCX
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
PDF
Chapter 2 part3-Least-Squares Regression
PPTX
regression.pptx
PPTX
An Introduction to Regression Models: Linear and Logistic approaches
PPTX
Stats 3000 Week 2 - Winter 2011
PPT
Regression and Co-Relation
PDF
ast5e_ppt_ch12 ast5e_ppt_ch09 slide of stastics (chap 12)
PDF
linear_regression_notes.pdf
PDF
Simple linear regression
PPTX
Regression analysis in R
PDF
Linear regression model in econometrics undergraduate
PPT
2-20-04.ppthjjbnjjjhhhhhhhhhhhhhhhhhhhhhhhh
PDF
Regression analysis
PPTX
Corrleation and regression
PPTX
Unit-III Correlation and Regression.pptx
PPTX
Correlation and Regression ppt
PPTX
STATISTICAL REGRESSION MODELS
Linear regression
Stats chapter 3
Requirements.docxRequirementsFont Times New RomanI NEED .docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
Chapter 2 part3-Least-Squares Regression
regression.pptx
An Introduction to Regression Models: Linear and Logistic approaches
Stats 3000 Week 2 - Winter 2011
Regression and Co-Relation
ast5e_ppt_ch12 ast5e_ppt_ch09 slide of stastics (chap 12)
linear_regression_notes.pdf
Simple linear regression
Regression analysis in R
Linear regression model in econometrics undergraduate
2-20-04.ppthjjbnjjjhhhhhhhhhhhhhhhhhhhhhhhh
Regression analysis
Corrleation and regression
Unit-III Correlation and Regression.pptx
Correlation and Regression ppt
STATISTICAL REGRESSION MODELS

More from Ulster BOCES (20)

PDF
Sampling means
PDF
Sampling distributions
PDF
Geometric distributions
PDF
Binomial distributions
PDF
Means and variances of random variables
PDF
Simulation
PDF
General probability rules
PDF
Planning and conducting surveys
PDF
Overview of data collection methods
PDF
Normal probability plot
PDF
Exploring data stemplot
PDF
Exploring data other plots
PDF
Exploring data histograms
PDF
Calculating percentages from z scores
PDF
Density curve
PDF
Standardizing scores
PDF
Intro to statistics
PPT
Describing quantitative data with numbers
PPT
Displaying quantitative data
PPTX
A.2 se and sd
Sampling means
Sampling distributions
Geometric distributions
Binomial distributions
Means and variances of random variables
Simulation
General probability rules
Planning and conducting surveys
Overview of data collection methods
Normal probability plot
Exploring data stemplot
Exploring data other plots
Exploring data histograms
Calculating percentages from z scores
Density curve
Standardizing scores
Intro to statistics
Describing quantitative data with numbers
Displaying quantitative data
A.2 se and sd

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Spectroscopy.pptx food analysis technology
PPT
Teaching material agriculture food technology
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Programs and apps: productivity, graphics, security and other tools
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
sap open course for s4hana steps from ECC to s4
Spectroscopy.pptx food analysis technology
Teaching material agriculture food technology
Big Data Technologies - Introduction.pptx
Understanding_Digital_Forensics_Presentation.pptx
Electronic commerce courselecture one. Pdf
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Exploring bivariate data

  • 2. Bivariate Data  Analyzing patterns in scatterplots  Correlation and linearity  Least-squares regression line  Residual plots, outliers, and influential points  Transformations to achieve linearity: logarithmic and power transformations
  • 3. Scatterplots  The most effective way to display the relationship between two quantitative variables.  The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as the point in the plot fixed by the values of both variables for that individual.
  • 4. Scatterplot Variables  Response Variable - measures the outcome of a study. (The dependent variable, plotted on the y-axis).  Explanatory or Predictor Variable – helps explain or predict changes in a response variable. (The independent variable, plotted on the x-axis.)
  • 5. Example:  If you think that alcohol causes body temperature to increase, you might do a study giving certain amounts of alcohol to mice, and measuring the temperature drops.  In this case the explanatory variable is the amount of alcohol and the response variable is the measured temperature drop.
  • 6. There are two ways of determining whether two variables are related: 1) By looking at a scatter plot (graphical approach) 2) By calculating a “correlation coefficient” (mathematical approach)
  • 7. How to Make a Scatterplot 1. Decide which variable should go on each axis. 2. Label and scale your axes. 3. Plot individual data values.
  • 8. Interpreting a Scatterplot  In any graph of data, look for the overall pattern and for striking deviations from that pattern.  You can describe the overall pattern of a scatterplot by the form, direction and strength of the relationship.  An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the relationship.
  • 11. Clusters Clusters of points within the plot can indicate the presence of another variable. The scatterplot on the right shows two clear clusters—one near 2 minutes; the other between 4 – 5 minutes.
  • 12. Gaps Gaps are regions (values) of the explanatory variable that have no associated response measurements. The scatterplot on the right shows a gap between 600,00 and 80,000 white blood cells (and probably another between 80,000 and 100,000).
  • 13. Correlation Coefficient (r)  The correlation coefficient (r ) measures the strength of the linear relationship between two quantitative variables.  Gives a numerical description of the strength and direction of the linear association between two variables. r = 1 n −1 xi − x sx      ∑ yi − y sy      
  • 14. Properties of r  r is always a number between -1 and 1  r > 0 indicates a positive association.  r < 0 indicates a negative association.  Values of r near 0 indicate a very weak linear relationship.  The strength of the linear relationship increases as r moves away from 0 towards -1 or 1.  The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship.
  • 15. Correlation ≠ Causation  Whenever we have a strong correlation, it is tempting to explain it by imagining that the expanatory variable has caused the response to help.  A variable that is not explicitly part of a study but affects the way the variables in the study appear to be related is called a lurking variable.  Because we can never be certain that observational data are not hiding a lurking variable, it is never safe to conclude that a scatterplot demonstrates a cause-and- effect relationship, no matter how strong the correlation.  Scatterplots and correlation coefficients never prove causation.
  • 16. Least-Squares Regression (LSRL) Least Squares Regression (linear regression) allows you to fit a line to a scatter diagram in order to be able to predict what the value of one variable will be based on the value of another variable. a: y intercept b: slope of the linebxay +=ˆ
  • 17. Regression Line • A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. • We often use the regression line for predicting the value of y for a given value of x.
  • 18. Interpreting a Regression Line  The way the line is fitted to the data is through a process called the method of least squares. The main idea behind this method is that the square of the vertical distance between each data point and the line is minimized.  The least squares regression line is a mathematical model for the data that helps us predict values of the response (dependant) variable from the explanatory (independent) variable. Therefore, with regression, unlike with correlation, we must specify which is the response and which is the explanatory variable.
  • 19. Formulas for finding the slope and y-intercept in a linear regression line: slope y-intercept a = y - bx b1 = r sy sx
  • 20. When will we ever need this?  We use regression lines to make predictions.  Interpolation – making predictions within known data values.  Extrapolation – making predictions beyond known data values.
  • 21. How good is our prediction? The strength of a prediction which uses the LSRL depends on how close the data points are to the regression line. The mathematical approach to describing this strength is via the coefficient of determination. The coefficient of determination gives us the proportion of variation in the values of y that is explained by least-squares regression of y on x. The coefficient of determination turns out to be the correlation coefficient squared (r²).
  • 22. Residuals  Since the LSRL minimized the vertical distance between the data values and a trend line we have a special name for these vertical distances. They are called residuals.  A residual is simply the difference between the observed y and the predicted y.
  • 23. Residual Plots  Residuals help us determine how well our data can be modeled by a straight line, by enabling us to construct a residual plot.  A residual plot is a scatter diagram that plots the residuals on the y-axis and their corresponding x values on the x-axis.
  • 24. INTERPRETING RESIDUAL PLOTS: The following residual plot is in a curved pattern and shows that the relationship is not linear. A straight line is not a good summary for such data.
  • 25. INTERPRETING RESIDUAL PLOTS: Increasing or decreasing spread about the line as x increases indicates that prediction of y will be less accurate for larger x as shown in this residual plot.
  • 26. INTERPRETING RESIDUAL PLOTS: The following shows a residual plot that has a uniform scatter of points about the fitted line with no unusual observations. This tells us that our linear model (regression line) will give us a good prediction of the data.
  • 27. Unusual and Influential Data Outliers Outlier: A value in a set of data that does not fit with the rest of the data Leverage - An observation with an extreme value on a predictor variable. • Leverage is a measure of how far an independent variable deviates from its mean. • These leverage points can have an effect on the estimate of regression coefficients. Influence - Influence can be thought of as the product of leverage and outlierness. • Removing the observation substantially changes the estimate of coefficients.
  • 28. Outliers  Data points more than 2 standard deviations away from the mean of the data set  Data points that do not fit the pattern governed by the rest of the data  In regression, any data point that has an unusually large residual How can I tell if a point in my data set is an outlier? • Take the IQR (interquartile range) of your data set and multiply it by 1.5. Subtract that number from Quartile 1 and then from Quartile 3. Any number lying outside these points can be considered an outlier.
  • 29. Influential Points  Influential points are normally outliers in the X direction, but are not always outliers in terms of regression  A point is said to influence the data if it is responsible for changes to the LSR line.  Any point that has leverage on a set of data is an influential point