Chapter-1-section 2.1 Exploring data-Edition-5.pptx

Chapter 1 & Section 2.1
Exploring Data

Introduction
2
 Statistics:
 the science of data. We begin our study of statistics by
mastering the art of examining data. Any set of data contains
information about some group of individuals. The
information is organized in variables.
 Individuals:
 The objects described by a set of data. Individuals may be
people, but they may also be other things.
 Variable:
 Any characteristic of an individual.
 Can take different values for different individuals.

VariableTypes
3
 Categorical variable:
 places an individual into one of several groups of categories.
 Quantitative variable:
 takes numerical values for which arithmetic operations such as
adding and averaging make sense.
 Distribution:
 pattern of variation of a variable
 tells what values the variable takes and how often it takes these
values.

5
 A. The individuals are the BMW 318I, the Buick
Century, and the Chevrolet Blazer.
 B. The variables given are
 Vehicle type (categorical)
 Transmission type (categorical)
 Number of cylinders (quantitative)
 City MPG (quantitative)
 Highway MPG (quantitative)

1.1 & 1.2: Displaying Distributions with graphs
6
• Graphs used to display data:
• bar graphs, pie charts, dot plots, stem plots, histograms, and
time plots
• Purpose of a graph:
• Helps to understand the data.
• Allows overall patterns and striking deviations from that pattern
to be seen.
• Describing the overall pattern:
• Three biggest descriptors:
• shape, center and spread.
• Next look for outliers and clusters.

Shape
7
 Concentrate on main features.
 Major peaks, outliers (not just the smallest and largest
observations), rough symmetry or clear skewness.
 Types of Shapes:
Symmetric Skewed right
Skewed left

1.5 How to make a bar graph.
9
Percent of females among people
earning doctorates in 1994.
Percent
Computer
science
Education
Engineering
Life
sciences
Physical
sciences
Psychology
10
20
30
40
50
60
70
15.4%
60.8%
11.1%
40.7%
21.7%
62.2%

10
No, a pie chart is used to display one variable
with all of its categories totaling 100%

How to make a dotplot
11
Highway mpg for some 2000
midsize cars
Frequency
or
Count
MPG
32
21 22 24 25 26 27 28 29 30 31
23
2
4
6
8
10

How to make and read a stemplot
12
 A stemplot is similar to a dotplot but there are some format
differences. Instead of dots actual numbers are used.
Instead of a horizontal axis, a vertical one is used.
Stems Leaves
Leaves are
single digits only
52 3 6
This arrangement
would be read as the
numbers 523 and
526.

13
 With the following data, make a stemplot.
1
2
3
4
5
Stems Leaves
4 9
2 5 6 6
3 3 4 5 5 5 5 9
0 2 7 7 8
2

14
 Lets use the same stemplot but now split the stems
1
2
3
4
5
Stems Leaves
4 9
2 5 6 6
3 3 4 5 5 5 5 9
0 2 7 7 8
2
4
9
2
5 6 6
3 3 4
5 5 5 5 9
0 2
7 7 8
2
1
1
2
2
3
3
4
4
5
Split
stems
Leaves, first stem uses
number 0-4, second
uses numbers 5-9

How to construct a histogram
15
 The most common graph of the distribution of one
quantitative variable is a histogram.
 To make a histogram:
1. Divide the range into equal widths.Then count the number
of observations that fall in each group.
2. Label and scale your axes and title your graph.
3. Draw bars that represent each count, no space between bars.

Chapter-1-section 2.1 Exploring data-Edition-5.pptx

Divide range into equal widths and count
17
0 < CEO Salary < 100
Scale
1
3
11
10
1
1
2
1
1
Counts

Draw and label axis, then make bars
18
CEO Salary in thousands of dollars
100 200 300 400 500 600 700 800 900
Thousand dollars
Count
1
2
3
4
5
6
7
8
9
10
11 Shape – the graph is skewed right
Center – the median is the first value in the
$300,000 to $400,000 range
Spread – the range of salaries is from
$21,000 to $862,000.
Outliers – there does not look like there are
any outliers, I would have to calculate to
make sure.

New terms used when graphing data.
19
 Relative frequency:
 Category count divided by the total count
 Gives a percentage
 Cumulative frequency:
 Sum of category counts up to an including the current category
 Ogives (pronounced O-Jive)
 Cumulative frequencies divided by the total count
 Relative cumulative frequency graph
 Percentile:
 The pth
percentile of a distribution is the value such that p
percent of the observations fall at or below it.

Lets look at a table to see what an ogive
would refer to.
20

The graph of an ogive for this data would
look like this.
21

22
Find the age of the
10th
percentile, the
median, and the
85th
percentile?
10th
percentile
Median
85th
percentile
47 55.5 62.5

Last graph of this section
23
 Time plots :
 Graph of each observation against the time at which it was
measured.
 Time is always on the x-axis.
 Use time plots to analyze what is occurring over time.

24
Deaths from cancer per 100,000
Deaths
Year
45 50 55 60 65 70 75 80 85 90 95
134
144
154
164
174
184
194
204

Section 1.1 & 1.2
25
 Homework: #’s Section 1-1: 2, 4, 6, 9, 38a, 48a&b,
Section 1-2: 52, 56 (use scale starting at 7 with width
of .5, make an ogive and use it to estimate the value of
the center and also the 90th
percentile) 58, Section 2-1:
9, R1.1 & 6, R2.2
 Complete additional notes packet pg. 1-4.

Section 1.3: Describing Quantitative
Data with Numbers.
 Center:
 Mean
 Median
 Mode – (only a measure of center for categorical data)
 Spread:
 Range
 Interquartile Range (IQR)
 Variance
 Standard Deviation
26

Measuring center:
27
 Mean:
 Most common measure of center.
 Is the arithmetic average.
 Formula:
 or
 Not resistant to the influence of extreme observations.
1 2 ... n
x x x
x
n
  

1
i
x x
n
 

Measuring center:
28
 Median
 The midpoint of a distribution
 The number such that half the observations are smaller and
the other half are larger.
 If the number of observations n is odd, the median is the
center of the ordered list.
 If the number of observations n is even, the median M is
the mean of the two center observations in the ordered
list.
 Is resistant to the influence of extreme observations.

Quick summary of measures of center.
Measure Definition Example using 1,2,3,3,4,5,5,9
sum of the data values
number of data values
The most frequently
occurring value (Categorical
data only)
Mean
Median
Mode
Middle value for an odd
# of data values
Mean of the 2 middle values
for an even # of data values
1 2 3 3 4 5 5 9
4
8
      

For 1,2,3,3,4,5,5,9, the
middle values are 3 and
4. The median is: 3
3
4
5
2
.


Two modes: 3 and 5
Set is bimodal.

Comparing the Mean and Median.
30
 The location of the mean and median for a distribution are
effected by the distribution’s shape.
Median and Mean
Symmetric
Median and Mean
Skewed right
Mean and Median
Skewed left

31
1 2 ... n
x x x
x
n
  

86 84 ... 93
14
x
  

1190
14
x 
85
x 

34
79.3
new
x  85
old
x 
Since zero is an outlier it effects the mean, since the mean is not a
resistant measurement of the center of data.

35
1
i
x x
n
 
1
$1,200,000
25
SUM

$1,200,000 25 SUM
 
$30million SUM


Measuring spread or variability:
36
 Range
 Difference between largest and smallest points.
 Not resistant to the influence of extreme observations.
 Interquartile Range (IQR)
 Measures the spread of the middle half of the data.
 Is resistant to the influence of extreme observations.
 Quartile 3 minus Quartile 1.

To calculate quartiles:
37
1. Arrange the observations in increasing order and locate
the median M.
2. The first quartile Q1 is the median of the observations
whose position in the ordered list is to the left of the
overall median.
3. The third quartile Q3 is the median of the observations
whose position in the ordered list is to the right of the
overall median.

The five number summary and box plots.
 The five number summary
 Consists of the
 min, Q1, median, Q3, max
 Offers a reasonably complete description of center and spread.
 Used to create a boxplot.
 Boxplot
 Shows less detail than histograms or stemplots.
 Best used for side-by-side comparison of more than one
distribution.
 Gives a good indication of symmetry or skewness of a
distribution.
 Regular boxplots conceal outliers.
 Modified boxplots put outliers as isolated points.
38

39
• Start by finding the 5 number summary for each of the groups.
• Use your calculator and put the two lists into their own column,
then use the 1-var Stats function.
Min Q1 M Q3 Max
Women: 101 126 138.5 154 200
Men: 70 98 114.5 143 187

How to construct a side-by-side boxplot
40
SSHA Scores for first year
college students
Women
Men
Scores
70 80 90 100 110 120 130 140 150 160 170 180 190 200

Calculating outliers
 Outlier
 An observation that falls outside the overall pattern of the data.
 Calculated by using the IQR
 Anything smaller than or larger than
is an outlier
41
1 1.5
Q IQR
  3 1.5
Q IQR
 
Min Q1 Median Q3 Max
1 1.5
Q IQR
  3 1.5
Q IQR
 

Constructing a modified boxplot
42
Min Q1 M Q3 Max
Women: 101 126 138.5 154 200
28
IQR 
1 1.5 126 1.5 28 84
Q IQR
     
3 1.5 154 1.5 28 196
Q IQR
     

Constructing a modified boxplot
43
84
Lower bound for outlier 
SSHA Scores for first year
college students
Women
Scores
70 80 90 100 110 120 130 140 150 160 170 180 190 200
3 1.5
Q IQR
 
1 1.5
Q IQR
 
196
Upper bound for outlier 
Min Q1 M Q3 Max
Women: 101 126 138.5 154 200

Section 1.3 Day 1
44
 Homework: #’s 84, 86, 88, 91, 92

Measuring Spread:
 Variance (s2
)
 The average of the squares of the deviations of the observations
from their mean.
 In symbols, the variance of n observations x1, x2, …, xn is
 Standard deviation (s)
 The square root of variance.
45
     
2 2 2
1 2
2
...
1
n
x x x x x x
s
n
     


 
2
2 1
1
i
s x x
n
 


or
 
2
1
1
i
s x x
n
 



How to find the mean and standard
deviation from their definitions.
46
 With the list of numbers below, calculate the standard
deviation.
o 5, 6, 7, 8, 10, 12
 
2
1
1
i
s x x
n
 


           
2 2 2 2 2 2
5 8 6 8 7 8 8 8 10 8 12 8
6 1
s
          


5 6 7 8 10 12
6
x
    

8
x 

47
           
2 2 2 2 2 2
3 2 1 0 2 4
5
s
       

9 4 1 0 4 16
5
s
    

34
5
s 
6.8
s 
2.61
s 
           
2 2 2 2 2 2
5 8 6 8 7 8 8 8 10 8 12 8
6 1
s
          



Properties ofVariance:
 Uses squared deviations from the mean because the sum
of all the deviations not squared is always zero.
 Has square units.
 Found by taking an average but dividing by n-1.
 The sum of the deviations is always zero, so the last
deviation can be found once the other n-1 deviations are
known.
 Means only n-1 of the squared deviations can vary freely, so
the average is found by dividing by n-1.
 n-1 is called the degrees of freedom.
48

Properties of Standard Deviation
 Measures the spread about the mean and should be used
only when the mean is chosen as the measure of center.
 Equals zero when there is no spread, happens when all
observations are the same value. Otherwise it is always
positive.
 Not resistant to the influence of extreme observations
or strong skewness.
49

Mean & Standard Deviation
Vs.
Median & the 5-Number Summary
50
 Mean & Standard Deviation
 Most common numerical description of a distribution.
 Used for reasonably symmetric distributions that are free from
outliers.
 Five-Number Summary
 Offer a reasonably complete description of center and spread.
 Used for describing skewed distributions or a distribution with
strong outliers.

Always plot your data.
 Graphs
 Give the best overall picture of a distribution.
 Numerical measures of center and spread
 Only give specific facts about a distribution.
 Do not describe its entire shape.
 Can give a misleading picture of a distribution or the
comparison of two or more distributions.
51

Changing the unit of measurement.
52
 LinearTransformations
 Changes the original variable x into the new variable xnew.
 xnew = a + bx
 Do not change the shape of a distribution.
 Can change one or both the center and spread.
 The effects of the changes follow a simple pattern.
 Adding the constant (a) shifts all values of x upward or downward by
the same amount.
 Adds (a) to the measures of center and to the quartiles but does not change
measures of spread.
 Multiplying by the positive constant (b) changes the size of the unit of
measurement.
 Multiplies both the measures of center (mean and median) and the measures of
spread (standard deviation and IQR) by (b).

The table shows an original data set and two different
linear transformations for that set.
Original (x) x + 12 3(x) - 7
5 17 8
6 18 11
7 19 14
8 20 17
10 22 23
12 24 29
What are the original and transformed mean, median,
range, quartiles, IQR, variance and standard deviation?
53

 Original Data
 Mean:
 Median:
 Q1:
 Q3:
 IQR:
 Range:
 Variance:
 St Dev:
8
X 
7
4
7 5
.
6
10
6 8
.
2 61
.
 x + 12
 Mean:
 Median:
 Q1:
 Q3:
 IQR:
 Range:
 Variance:
 St Dev:
 3(x) – 7
 Mean:
 Median:
 Q1:
 Q3:
 IQR:
 Range:
 Variance:
 St Dev:
54
20
X 
7
4
19 5
.
18
22
6 8
.
2 61
.
17
X 
21
12
15 5
.
11
23
61 2
.
7 82
.

Section 1.3 & 2.1 Day 2
55
 Homework: #’s Section 1-3: 97, 98, 103; Section 2-1:
19, 20, 22, R.2.3

Chapter-1-section 2.1 Exploring data-Edition-5.pptx

More Related Content

Similar to Chapter-1-section 2.1 Exploring data-Edition-5.pptx (20)

Recently uploaded (20)

Chapter-1-section 2.1 Exploring data-Edition-5.pptx