2. Introduction
2
Statistics:
the science of data. We begin our study of statistics by
mastering the art of examining data. Any set of data contains
information about some group of individuals. The
information is organized in variables.
Individuals:
The objects described by a set of data. Individuals may be
people, but they may also be other things.
Variable:
Any characteristic of an individual.
Can take different values for different individuals.
3. VariableTypes
3
Categorical variable:
places an individual into one of several groups of categories.
Quantitative variable:
takes numerical values for which arithmetic operations such as
adding and averaging make sense.
Distribution:
pattern of variation of a variable
tells what values the variable takes and how often it takes these
values.
5. 5
A. The individuals are the BMW 318I, the Buick
Century, and the Chevrolet Blazer.
B. The variables given are
Vehicle type (categorical)
Transmission type (categorical)
Number of cylinders (quantitative)
City MPG (quantitative)
Highway MPG (quantitative)
6. 1.1 & 1.2: Displaying Distributions with graphs
6
• Graphs used to display data:
• bar graphs, pie charts, dot plots, stem plots, histograms, and
time plots
• Purpose of a graph:
• Helps to understand the data.
• Allows overall patterns and striking deviations from that pattern
to be seen.
• Describing the overall pattern:
• Three biggest descriptors:
• shape, center and spread.
• Next look for outliers and clusters.
7. Shape
7
Concentrate on main features.
Major peaks, outliers (not just the smallest and largest
observations), rough symmetry or clear skewness.
Types of Shapes:
Symmetric Skewed right
Skewed left
9. 1.5 How to make a bar graph.
9
Percent of females among people
earning doctorates in 1994.
Percent
Computer
science
Education
Engineering
Life
sciences
Physical
sciences
Psychology
10
20
30
40
50
60
70
15.4%
60.8%
11.1%
40.7%
21.7%
62.2%
10. 10
No, a pie chart is used to display one variable
with all of its categories totaling 100%
11. How to make a dotplot
11
Highway mpg for some 2000
midsize cars
Frequency
or
Count
MPG
32
21 22 24 25 26 27 28 29 30 31
23
2
4
6
8
10
12. How to make and read a stemplot
12
A stemplot is similar to a dotplot but there are some format
differences. Instead of dots actual numbers are used.
Instead of a horizontal axis, a vertical one is used.
Stems Leaves
Leaves are
single digits only
52 3 6
This arrangement
would be read as the
numbers 523 and
526.
13. How to make and read a stemplot
13
With the following data, make a stemplot.
1
2
3
4
5
Stems Leaves
4 9
2 5 6 6
3 3 4 5 5 5 5 9
0 2 7 7 8
2
14. How to make and read a stemplot
14
Lets use the same stemplot but now split the stems
1
2
3
4
5
Stems Leaves
4 9
2 5 6 6
3 3 4 5 5 5 5 9
0 2 7 7 8
2
4
9
2
5 6 6
3 3 4
5 5 5 5 9
0 2
7 7 8
2
1
1
2
2
3
3
4
4
5
Split
stems
Leaves, first stem uses
number 0-4, second
uses numbers 5-9
15. How to construct a histogram
15
The most common graph of the distribution of one
quantitative variable is a histogram.
To make a histogram:
1. Divide the range into equal widths.Then count the number
of observations that fall in each group.
2. Label and scale your axes and title your graph.
3. Draw bars that represent each count, no space between bars.
17. Divide range into equal widths and count
17
0 < CEO Salary < 100
100 < CEO Salary < 200
200 < CEO Salary < 300
300 < CEO Salary < 400
400 < CEO Salary < 500
500 < CEO Salary < 600
600 < CEO Salary < 700
700 < CEO Salary < 800
800 < CEO Salary < 900
Scale
1
3
11
10
1
1
2
1
1
Counts
18. Draw and label axis, then make bars
18
CEO Salary in thousands of dollars
100 200 300 400 500 600 700 800 900
Thousand dollars
Count
1
2
3
4
5
6
7
8
9
10
11 Shape – the graph is skewed right
Center – the median is the first value in the
$300,000 to $400,000 range
Spread – the range of salaries is from
$21,000 to $862,000.
Outliers – there does not look like there are
any outliers, I would have to calculate to
make sure.
19. New terms used when graphing data.
19
Relative frequency:
Category count divided by the total count
Gives a percentage
Cumulative frequency:
Sum of category counts up to an including the current category
Ogives (pronounced O-Jive)
Cumulative frequencies divided by the total count
Relative cumulative frequency graph
Percentile:
The pth
percentile of a distribution is the value such that p
percent of the observations fall at or below it.
20. Lets look at a table to see what an ogive
would refer to.
20
21. The graph of an ogive for this data would
look like this.
21
22. 22
Find the age of the
10th
percentile, the
median, and the
85th
percentile?
10th
percentile
Median
85th
percentile
47 55.5 62.5
23. Last graph of this section
23
Time plots :
Graph of each observation against the time at which it was
measured.
Time is always on the x-axis.
Use time plots to analyze what is occurring over time.
24. 24
Deaths from cancer per 100,000
Deaths
Year
45 50 55 60 65 70 75 80 85 90 95
134
144
154
164
174
184
194
204
25. Section 1.1 & 1.2
25
Homework: #’s Section 1-1: 2, 4, 6, 9, 38a, 48a&b,
Section 1-2: 52, 56 (use scale starting at 7 with width
of .5, make an ogive and use it to estimate the value of
the center and also the 90th
percentile) 58, Section 2-1:
9, R1.1 & 6, R2.2
Complete additional notes packet pg. 1-4.
26. Section 1.3: Describing Quantitative
Data with Numbers.
Center:
Mean
Median
Mode – (only a measure of center for categorical data)
Spread:
Range
Interquartile Range (IQR)
Variance
Standard Deviation
26
27. Measuring center:
27
Mean:
Most common measure of center.
Is the arithmetic average.
Formula:
or
Not resistant to the influence of extreme observations.
1 2 ... n
x x x
x
n
1
i
x x
n
28. Measuring center:
28
Median
The midpoint of a distribution
The number such that half the observations are smaller and
the other half are larger.
If the number of observations n is odd, the median is the
center of the ordered list.
If the number of observations n is even, the median M is
the mean of the two center observations in the ordered
list.
Is resistant to the influence of extreme observations.
29. Quick summary of measures of center.
Measure Definition Example using 1,2,3,3,4,5,5,9
sum of the data values
number of data values
The most frequently
occurring value (Categorical
data only)
Mean
Median
Mode
Middle value for an odd
# of data values
Mean of the 2 middle values
for an even # of data values
1 2 3 3 4 5 5 9
4
8
For 1,2,3,3,4,5,5,9, the
middle values are 3 and
4. The median is: 3
3
4
5
2
.
Two modes: 3 and 5
Set is bimodal.
30. Comparing the Mean and Median.
30
The location of the mean and median for a distribution are
effected by the distribution’s shape.
Median and Mean
Symmetric
Median and Mean
Skewed right
Mean and Median
Skewed left
31. 31
1 2 ... n
x x x
x
n
86 84 ... 93
14
x
1190
14
x
85
x
36. Measuring spread or variability:
36
Range
Difference between largest and smallest points.
Not resistant to the influence of extreme observations.
Interquartile Range (IQR)
Measures the spread of the middle half of the data.
Is resistant to the influence of extreme observations.
Quartile 3 minus Quartile 1.
37. To calculate quartiles:
37
1. Arrange the observations in increasing order and locate
the median M.
2. The first quartile Q1 is the median of the observations
whose position in the ordered list is to the left of the
overall median.
3. The third quartile Q3 is the median of the observations
whose position in the ordered list is to the right of the
overall median.
38. The five number summary and box plots.
The five number summary
Consists of the
min, Q1, median, Q3, max
Offers a reasonably complete description of center and spread.
Used to create a boxplot.
Boxplot
Shows less detail than histograms or stemplots.
Best used for side-by-side comparison of more than one
distribution.
Gives a good indication of symmetry or skewness of a
distribution.
Regular boxplots conceal outliers.
Modified boxplots put outliers as isolated points.
38
39. 39
• Start by finding the 5 number summary for each of the groups.
• Use your calculator and put the two lists into their own column,
then use the 1-var Stats function.
Min Q1 M Q3 Max
Women: 101 126 138.5 154 200
Men: 70 98 114.5 143 187
40. How to construct a side-by-side boxplot
40
SSHA Scores for first year
college students
Women
Men
Scores
70 80 90 100 110 120 130 140 150 160 170 180 190 200
41. Calculating outliers
Outlier
An observation that falls outside the overall pattern of the data.
Calculated by using the IQR
Anything smaller than or larger than
is an outlier
41
1 1.5
Q IQR
3 1.5
Q IQR
Min Q1 Median Q3 Max
1 1.5
Q IQR
3 1.5
Q IQR
45. Measuring Spread:
Variance (s2
)
The average of the squares of the deviations of the observations
from their mean.
In symbols, the variance of n observations x1, x2, …, xn is
Standard deviation (s)
The square root of variance.
45
2 2 2
1 2
2
...
1
n
x x x x x x
s
n
2
2 1
1
i
s x x
n
or
2
1
1
i
s x x
n
46. How to find the mean and standard
deviation from their definitions.
46
With the list of numbers below, calculate the standard
deviation.
o 5, 6, 7, 8, 10, 12
2
1
1
i
s x x
n
2 2 2 2 2 2
5 8 6 8 7 8 8 8 10 8 12 8
6 1
s
5 6 7 8 10 12
6
x
8
x
48. Properties ofVariance:
Uses squared deviations from the mean because the sum
of all the deviations not squared is always zero.
Has square units.
Found by taking an average but dividing by n-1.
The sum of the deviations is always zero, so the last
deviation can be found once the other n-1 deviations are
known.
Means only n-1 of the squared deviations can vary freely, so
the average is found by dividing by n-1.
n-1 is called the degrees of freedom.
48
49. Properties of Standard Deviation
Measures the spread about the mean and should be used
only when the mean is chosen as the measure of center.
Equals zero when there is no spread, happens when all
observations are the same value. Otherwise it is always
positive.
Not resistant to the influence of extreme observations
or strong skewness.
49
50. Mean & Standard Deviation
Vs.
Median & the 5-Number Summary
50
Mean & Standard Deviation
Most common numerical description of a distribution.
Used for reasonably symmetric distributions that are free from
outliers.
Five-Number Summary
Offer a reasonably complete description of center and spread.
Used for describing skewed distributions or a distribution with
strong outliers.
51. Always plot your data.
Graphs
Give the best overall picture of a distribution.
Numerical measures of center and spread
Only give specific facts about a distribution.
Do not describe its entire shape.
Can give a misleading picture of a distribution or the
comparison of two or more distributions.
51
52. Changing the unit of measurement.
52
LinearTransformations
Changes the original variable x into the new variable xnew.
xnew = a + bx
Do not change the shape of a distribution.
Can change one or both the center and spread.
The effects of the changes follow a simple pattern.
Adding the constant (a) shifts all values of x upward or downward by
the same amount.
Adds (a) to the measures of center and to the quartiles but does not change
measures of spread.
Multiplying by the positive constant (b) changes the size of the unit of
measurement.
Multiplies both the measures of center (mean and median) and the measures of
spread (standard deviation and IQR) by (b).
53. The table shows an original data set and two different
linear transformations for that set.
Original (x) x + 12 3(x) - 7
5 17 8
6 18 11
7 19 14
8 20 17
10 22 23
12 24 29
What are the original and transformed mean, median,
range, quartiles, IQR, variance and standard deviation?
53