LEARNING OBJECTIVES
When you have completed this chapter, you will be able to:
LO4-1 Construct and interpret a dot plot.
LO4-2 Construct and describe a stem-and-leaf display.
LO4-3 Identify and compute measures of position.
LO4-4 Construct and analyze a box plot.
LO4-5 Compute and interpret the coefficient of skewness.
LO4-6 Create and interpret a scatter diagram.
LO4-7 Develop and explain a contingency table.
MCGIVERN JEWELERS recently posted an advertisement on a
social media site reporting
the shape, size, price, and cut grade for 33 of its diamonds in
stock. Develop a box plot of
the variable price and comment on the result. (See Exercise 37
and LO4-4.)
Describing Data:
DISPLAYING AND EXPLORING DATA4
© Denis Vrublevski/Shutterstock.com
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
95
INTRODUCTION
Chapter 2 began our study of descriptive statistics. In order to
transform raw or un-
grouped data into a meaningful form, we organize the data into
a frequency distribution.
We present the frequency distribution in graphic form as a
histogram or a frequency
polygon. This allows us to visualize where the data tend to
cluster, the largest and the
smallest values, and the general shape of the data.
In Chapter 3, we first computed several measures of location,
such as the mean,
median, and mode. These measures of location allow us to
report a typical value in the
set of observations. We also computed several measures of
dispersion, such as the
range, variance, and standard deviation. These measures of
dispersion allow us to de-
scribe the variation or the spread in a set of observations.
We continue our study of descriptive statistics in this chapter.
We study (1) dot plots,
(2) stem-and-leaf displays, (3) percentiles, and (4) box plots.
These charts and statistics
give us additional insight into where the values are concentrated
as well as the general
shape of the data. Then we consider bivariate data. In bivariate
data, we observe two
variables for each individual or observation. Examples include
the number of hours a
student studied and the points earned on an examination; if a
sampled product meets
quality specifications and the shift on which it is manufactured;
or the amount of electric-
ity used in a month by a homeowner and the mean daily high
temperature in the region
for the month. These charts and graphs provide useful insights
as we use business
analytics to enhance our understanding of data.
DOT PLOTS
Recall for the Applewood Auto Group data, we summarized the
profit earned on the
180 vehicles sold with a frequency distribution using eight
classes. When we orga-
nized the data into the eight classes, we lost the exact value of
the observations. A
dot plot, on the other hand, groups the data as little as possible,
and we do not lose
the identity of an individual observation. To develop a dot plot,
we display a dot for
each observation along a horizontal number line indicating the
possible values of the
data. If there are identical observations or the observations are
too close to be shown
individually, the dots are “piled” on top of each other. This
allows us to see the shape
of the distribution, the value about which the data tend to
cluster, and the largest and
smallest observations. Dot plots are most useful for smaller data
sets, whereas histo-
grams tend to be most useful for large data sets. An example
will show how to con-
struct and interpret dot plots.
LO4-1
Construct and interpret a
dot plot.
E X A M P L E
The service departments at Tionesta Ford Lincoln and Sheffield
Motors Inc., two
of the four Applewood Auto Group dealerships, were both open
24 days last
month. Listed below is the number of vehicles serviced last
month at the two
dealerships. Construct dot plots and report summary statistics to
compare the
two dealerships.
Tionesta Ford Lincoln
Monday Tuesday Wednesday Thursday Friday Saturday
23 33 27 28 39 26
30 32 28 33 35 32
29 25 36 31 32 27
35 32 35 37 36 30
96 CHAPTER 4
Sheffield Motors Inc.
Monday Tuesday Wednesday Thursday Friday Saturday
31 35 44 36 34 37
30 37 43 31 40 31
32 44 36 34 43 36
26 38 37 30 42 33
S O L U T I O N
The Minitab system provides a dot plot and outputs the mean,
median, maximum,
and minimum values, and the standard deviation for the number
of cars serviced
at each dealership over the last 24 working days.
The dot plots, shown in the center of the output, graphically
illustrate the distribu-
tions for each dealership. The plots show the difference in the
location and dis-
persion of the observations. By looking at the dot plots, we can
see that the
number of vehicles serviced at the Sheffield dealership is more
widely dispersed
and has a larger mean than at the Tionesta dealership. Several
other features of
the number of vehicles serviced are:
• Tionesta serviced the fewest cars in any day, 23.
• Sheffield serviced 26 cars during their slowest day, which is
4 cars less than
the next lowest day.
• Tionesta serviced exactly 32 cars on four different days.
• The numbers of cars serviced cluster around 36 for Sheffield
and 32 for Tionesta.
From the descriptive statistics, we see Sheffield serviced a
mean of 35.83 vehicles
per day. Tionesta serviced a mean of 31.292 vehicles per day
during the same
period. So Sheffield typically services 4.54 more vehicles per
day. There is also
more dispersion, or variation, in the daily number of vehicles
serviced at Sheffield
than at Tionesta. How do we know this? The standard deviation
is larger at Shef-
field (4.96 vehicles per day) than at Tionesta (4.112 cars per
day).
STEM-AND-LEAF DISPLAYS
In Chapter 2, we showed how to organize data into a frequency
distribution so we could
summarize the raw data into a meaningful form. The major
advantage to organizing the
data into a frequency distribution is we get a quick visual
picture of the shape of the
LO4-2
Construct and describe a
stem-and-leaf display.
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
97
distribution without doing any further calculation. To put it
another way, we can see
where the data are concentrated and also determine whether
there are any extremely
large or small values. There are two disadvantages, however, to
organizing the data into
a frequency distribution: (1) we lose the exact identity of each
value and (2) we are not
sure how the values within each class are distributed. To
explain, the Theater of the
Republic in Erie, Pennsylvania, books live theater and musical
performances. The the-
ater’s capacity is 160 seats. Last year, among the forty-five
performances, there were
eight different plays and twelve different bands. The following
frequency distribution
shows that between eighty up to ninety people attended two of
the forty-five perfor-
mances; there were seven performances where ninety up to one
hundred people at-
tended. However, is the attendance within this class clustered
about 90, spread evenly
throughout the class, or clustered near 99? We cannot tell.
Attendance Frequency
80 up to 90 2
90 up to 100 7
100 up to 110 6
110 up to 120 9
120 up to 130 8
130 up to 140 7
140 up to 150 3
150 up to 160 3
Total 45
One technique used to display quantitative information in a
condensed form
and provide more information than the frequency distribution is
the stem-and-leaf
display. An advantage of the stem-and-leaf display over a
frequency distribution is
we do not lose the identity of each observation. In the above
example, we would not
know the identity of the values in the 90 up to 100 class. To
illustrate the construc-
tion of a stem-and-leaf display using the number people
attending each perfor-
mance, suppose the seven observations in the 90 up to 100 class
are 96, 94, 93,
94, 95, 96, and 97. The stem value is the leading digit or digits,
in this case 9. The
leaves are the trailing digits. The stem is placed to the left of a
vertical line and the
leaf values to the right.
The values in the 90 up to 100 class would appear as follows:
9 ∣ 6 4 3 4 5 6 7
It is also customary to sort the values within each stem from
smallest to largest. Thus,
the second row of the stem-and-leaf display would appear as
follows:
9 ∣ 3 4 4 5 6 6 7
With the stem-and-leaf display, we can quickly observe that 94
people attended two
performances and the number attending ranged from 93 to 97. A
stem-and-leaf display
is similar to a frequency distribution with more information,
that is, the identity of the
observations is preserved.
STEM-AND-LEAF DISPLAY A statistical technique to present
a set of data. Each
numerical value is divided into two parts. The leading digit(s)
becomes the stem
and the trailing digit the leaf. The stems are located along the
vertical axis, and the
leaf values are stacked against each other along the horizontal
axis.
98 CHAPTER 4
The following example explains the details of developing a
stem-and-leaf display.
E X A M P L E
Listed in Table 4–1 is the number of people attending each of
the 45 performances
at the Theater of the Republic last year. Organize the data into a
stem-and-leaf
display. Around what values does attendance tend to cluster?
What is the smallest
attendance? The largest attendance?
S O L U T I O N
From the data in Table 4–1, we note that the smallest attendance
is 88. So we will
make the first stem value 8. The largest attendance is 156, so
we will have the
stem values begin at 8 and continue to 15. The first number in
Table 4–1 is 96,
which has a stem value of 9 and a leaf value of 6. Moving
across the top row, the
second value is 93 and the third is 88. After the first 3 data
values are considered,
the chart is as follows.
Stem Leaf
8 8
9 6 3
10
11
12
13
14
15
Organizing all the data, the stem-and-leaf chart looks as
follows.
Stem Leaf
8 8 9
9 6 3 5 6 4 4 7
10 8 7 3 4 6 3
11 7 3 2 7 2 1 9 8 3
12 7 5 7 0 5 5 0 4
13 9 5 2 9 4 6 8
14 8 2 3
15 6 5 5
The usual procedure is to sort the leaf values from the smallest
to largest. The last
line, the row referring to the values in the 150s, would appear
as:
15 ∣ 5 5 6
TABLE 4–1 Number of People Attending Each of the 45
Performances at the Theater of
the Republic
96 93 88 117 127 95 113 96 108 94 148 156
139 142 94 107 125 155 155 103 112 127 117 120
112 135 132 111 125 104 106 139 134 119 97 89
118 136 125 143 120 103 113 124 138
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
99
The final table would appear as follows, where we have sorted
all of the leaf values.
Stem Leaf
8 8 9
9 3 4 4 5 6 6 7
10 3 3 4 6 7 8
11 1 2 2 3 3 7 7 8 9
12 0 0 4 5 5 5 7 7
13 2 4 5 6 8 9 9
14 2 3 8
15 5 5 6
You can draw several conclusions from the stem-and-leaf
display. First, the mini-
mum number of people attending is 88 and the maximum is 156.
There were two per-
formances with less than 90 people attending, and three
performances with 150 or
more. You can observe, for example, that for the three
performances with more than
150 people attending, the actual attendances were 155, 155, and
156. The concentra-
tion of attendance is between 110 and 130. There were fifteen
performances with at-
tendance between 110 and 119 and eight performances between
120 and 129. We
can also tell that within the 120 to 129 group the actual
attendances were spread
evenly throughout the class. That is, 120 people attended two
performances, 124 peo-
ple attended one performance, 125 people attended three
performances, and 127 peo-
ple attended two performances.
We also can generate this information on the Minitab software
system. We have
named the variable Attendance. The Minitab output is below.
You can find the Minitab
commands that will produce this output in Appendix C.
The Minitab solution provides some additional information
regarding cumulative totals.
In the column to the left of the stem values are numbers such as
2, 9, 15, and so on. The
number 9 indicates there are 9 observations that have occurred
before the value of 100.
The number 15 indicates that 15 observations have occurred
prior to 110. About halfway
down the column the number 9 appears in parentheses. The
parentheses indicate that the
middle value or median appears in that row and there are nine
values in this group. In this
case, we describe the middle value as the value below which
half of the observations oc-
cur. There are a total of 45 observations, so the middle value, if
the data were arranged
from smallest to largest, would be the 23rd observation; its
value is 118. After the median,
the values begin to decline. These values represent the “more
than” cumulative totals.
There are 21 observations of 120 or more, 13 of 130 or more,
and so on.
100 CHAPTER 4
Which is the better choice, a dot plot or a stem-and-leaf chart?
This is really a matter
of personal choice and convenience. For presenting data,
especially with a large num-
ber of observations, you will find dot plots are more frequently
used. You will see dot
plots in analytical literature, marketing reports, and
occasionally in annual reports. If you
are doing a quick analysis for yourself, stem-and-leaf tallies are
handy and easy, partic-
ularly on a smaller set of data.
© Somos/Veer/Getty Images RF
1. The number of employees at each of the 142 Home Depot
stores in the Southeast
region is shown in the following dot plot.
100 10484 88 92
Number of employees
9680
(a) What are the maximum and minimum numbers of employees
per store?
(b) How many stores employ 91 people?
(c) Around what values does the number of employees per store
tend to cluster?
2. The rate of return for 21 stocks is:
8.3 9.6 9.5 9.1 8.8 11.2 7.7 10.1 9.9 10.8
10.2 8.0 8.4 8.1 11.6 9.6 8.8 8.0 10.4 9.8 9.2
S E L F - R E V I E W 4–1
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
101
Organize this information into a stem-and-leaf display.
(a) How many rates are less than 9.0?
(b) List the rates in the 10.0 up to 11.0 category.
(c) What is the median?
(d) What are the maximum and the minimum rates of return?
1. Describe the differences between a histogram and a dot plot.
When might a dot
plot be better than a histogram?
2. Describe the differences between a histogram and a stem-
and-leaf display.
3. Consider the following chart.
6 72 3 4 51
a. What is this chart called?
b. How many observations are in the study?
c. What are the maximum and the minimum values?
d. Around what values do the observations tend to cluster?
4. The following chart reports the number of cell phones sold at
a big-box retail store
for the last 26 days.
199 144
a. What are the maximum and the minimum numbers of cell
phones sold in a day?
b. What is a typical number of cell phones sold?
5. The first row of a stem-and-leaf chart appears as follows: 62
| 1 3 3 7 9. Assume
whole number values.
a. What is the “possible range” of the values in this row?
b. How many data values are in this row?
c. List the actual values in this row of data.
6. The third row of a stem-and-leaf chart appears as follows: 21
| 0 1 3 5 7 9. Assume
whole number values.
a. What is the “possible range” of the values in this row?
b. How many data values are in this row?
c. List the actual values in this row of data.
7. The following stem-and-leaf chart shows the number of units
produced per day in a
factory.
Stem Leaf
3 8
4
5 6
6 0133559
7 0236778
8 59
9 00156
10 36
a. How many days were studied?
b. How many observations are in the first class?
E X E R C I S E S
102 CHAPTER 4
c. What are the minimum value and the maximum value?
d. List the actual values in the fourth row.
e. List the actual values in the second row.
f. How many values are less than 70?
g. How many values are 80 or more?
h. What is the median?
i. How many values are between 60 and 89, inclusive?
8. The following stem-and-leaf chart reports the number of
prescriptions filled per day
at the pharmacy on the corner of Fourth and Main Streets.
Stem Leaf
12 689
13 123
14 6889
15 589
16 35
17 24568
18 268
19 13456
20 034679
21 2239
22 789
23 00179
24 8
25 13
26
27 0
a. How many days were studied?
b. How many observations are in the last class?
c. What are the maximum and the minimum values in the entire
set of data?
d. List the actual values in the fourth row.
e. List the actual values in the next to the last row.
f. On how many days were less than 160 prescriptions filled?
g. On how many days were 220 or more prescriptions filled?
h. What is the middle value?
i. How many days did the number of filled prescriptions range
between 170 and 210?
9. A survey of the number of phone calls made by a sample of
16 Verizon sub-
scribers last week revealed the following information. Develop
a stem-and-leaf
chart. How many calls did a typical subscriber make? What
were the maximum and
the minimum number of calls made?
52 43 30 38 30 42 12 46 39
37 34 46 32 18 41 5
10. Aloha Banking Co. is studying ATM use in suburban
Honolulu. Yesterday, for a
sample of 30 ATM's, the bank counted the number of times each
machine was
used. The data is presented in the table. Develop a stem-and-
leaf chart to summa-
rize the data. What were the typical, minimum, and maximum
number of times each
ATM was used?
83 64 84 76 84 54 75 59 70 61
63 80 84 73 68 52 65 90 52 77
95 36 78 61 59 84 95 47 87 60
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
103
MEASURES OF POSITION
The standard deviation is the most widely used measure of
dispersion. However, there
are other ways of describing the variation or spread in a set of
data. One method is to
determine the location of values that divide a set of
observations into equal parts. These
measures include quartiles, deciles, and percentiles.
Quartiles divide a set of observations into four equal parts. To
explain further, think of
any set of values arranged from the minimum to the maximum.
In Chapter 3, we called the
middle value of a set of data arranged from the minimum to the
maximum the median.
That is, 50% of the observations are larger than the median and
50% are smaller. The
median is a measure of location because it pinpoints the center
of the data. In a similar
fashion, quartiles divide a set of observations into four equal
parts. The first quartile, usu-
ally labeled Q1, is the value below which 25% of the
observations occur, and the third
quartile, usually labeled Q3, is the value below which 75% of
the observations occur.
Similarly, deciles divide a set of observations into 10 equal
parts and percentiles
into 100 equal parts. So if you found that your GPA was in the
8th decile at your univer-
sity, you could conclude that 80% of the students had a GPA
lower than yours and 20%
had a higher GPA. If your GPA was in the 92nd percentile, then
92% of students had a
GPA less than your GPA and only 8% of students had a GPA
greater than your GPA. Per-
centile scores are frequently used to report results on such
national standardized tests
as the SAT, ACT, GMAT (used to judge entry into many master
of business administration
programs), and LSAT (used to judge entry into law school).
Quartiles, Deciles, and Percentiles
To formalize the computational procedure, let Lp refer to the
location of a desired percen-
tile. So if we want to find the 92nd percentile we would use
L92, and if we wanted the
median, the 50th percentile, then L50. For a number of
observations, n, the location of
the Pth percentile, can be found using the formula:
LO4-3
Identify and compute
measures of position.
LOCATION OF A PERCENTILE Lp = (n + 1)
P
100
[4–1]
An example will help to explain further.
E X A M P L E
Morgan Stanley is an investment company with offices located
throughout the
United States. Listed below are the commissions earned last
month by a sample of
15 brokers at the Morgan Stanley office in Oakland, California.
$2,038 $1,758 $1,721 $1,637 $2,097 $2,047 $2,205 $1,787
$2,287
1,940 2,311 2,054 2,406 1,471 1,460
Locate the median, the first quartile, and the third quartile for
the commissions
earned.
S O L U T I O N
The first step is to sort the data from the smallest commission to
the largest.
$1,460 $1,471 $1,637 $1,721 $1,758 $1,787 $1,940 $2,038
2,047 2,054 2,097 2,205 2,287 2,311 2,406
104 CHAPTER 4
In the above example, the location formula yielded a whole
number. That is, we
wanted to find the first quartile and there were 15 observations,
so the location formula
indicated we should find the fourth ordered value. What if there
were 20 observations
in the sample, that is n = 20, and we wanted to locate the first
quartile? From the loca-
tion formula (4–1):
L25 = (n + 1)
P
100
= (20 + 1)
25
100
= 5.25
We would locate the fifth value in the ordered array and then
move .25 of the distance
between the fifth and sixth values and report that as the first
quartile. Like the median,
the quartile does not need to be one of the actual values in the
data set.
To explain further, suppose a data set contained the six values
91, 75, 61, 101, 43,
and 104. We want to locate the first quartile. We order the
values from the minimum to
the maximum: 43, 61, 75, 91, 101, and 104. The first quartile is
located at
L25 = (n + 1)
P
100
= (6 + 1)
25
100
= 1.75
The position formula tells us that the first quartile is located
between the first and the
second values and it is .75 of the distance between the first and
the second values. The
first value is 43 and the second is 61. So the distance between
these two values is 18.
To locate the first quartile, we need to move .75 of the distance
between the first and
second values, so .75(18) = 13.5. To complete the procedure, we
add 13.5 to the first
value, 43, and report that the first quartile is 56.5.
We can extend the idea to include both deciles and percentiles.
To locate the 23rd
percentile in a sample of 80 observations, we would look for the
18.63 position.
L23 = (n + 1)
P
100
= (80 + 1)
23
100
= 18.63
The median value is the observation in the
center and is the same as the 50th percen-
tile, so P equals 50. So the median or L50 is
located at (n + 1)(50/100), where n is the
number of observations. In this case, that is
position number 8, found by (15 + 1)
(50/100). The eighth-largest commission is
$2,038. So we conclude this is the median
and that half the brokers earned com-
missions more than $2,038 and half
earned less than $2,038. The result using
formula (4–1) to find the median is the same as the method
presented in
Chapter 3.
Recall the definition of a quartile. Quartiles divide a set of
observations into
four equal parts. Hence 25% of the observations will be less
than the first quartile.
Seventy-five percent of the observations will be less than the
third quartile. To
locate the first quartile, we use formula (4–1), where n = 15 and
P = 25:
L25 = (n + 1)
P
100
= (15 + 1)
25
100
= 4
and to locate the third quartile, n = 15 and P = 75:
L75 = (n + 1)
P
100
= (15 + 1)
75
100
= 12
Therefore, the first and third quartile values are located at
positions 4 and 12,
respectively. The fourth value in the ordered array is $1,721 and
the twelfth is
$2,205. These are the first and third quartiles.
© Ramin Talaie/Getty Images
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
105
To find the value corresponding to the 23rd percentile, we
would locate the 18th value
and the 19th value and determine the distance between the two
values. Next, we would
multiply this difference by 0.63 and add the result to the smaller
value. The result would
be the 23rd percentile.
Statistical software is very helpful when describing and
summarizing data. Excel,
Minitab, and MegaStat, a statistical analysis Excel add-in, all
provide summary statistics
that include quartiles. For example, the Minitab summary of the
Morgan Stanley com-
mission data, shown below, includes the first and third
quartiles, and other statistics.
Based on the reported quartiles, 25% of the commissions earned
were less than
$1,721 and 75% were less than $2,205. These are the same
values we calculated
using formula (4–1).
There are ways other than formula (4–1) to lo-
cate quartile values. For example, another method
uses 0.25n + 0.75 to locate the position of the first
quartile and 0.75n + 0.25 to locate the position of
the third quartile. We will call this the Excel Method.
In the Morgan Stanley data, this method would
place the first quartile at position 4.5 (.25 × 15 +
.75) and the third quartile at position 11.5 (.75 ×
15 + .25). The first quartile would be interpolated
as 0.5, or one-half the difference between the
fourth- and the fifth-ranked values. Based on this
method, the first quartile is $1739.5, found by
($1,721 + 0.5[$1,758 − $1,721]). The third quar-
tile, at position 11.5, would be $2,151, or one-half
the distance between the eleventh- and the
twelfth-ranked values, found by ($2,097 + 0.5[$2,205 −
$2,097]). Excel, as shown in
the Morgan Stanley and Applewood examples, can compute
quartiles using either of
the two methods. Please note the text uses formula (4–1) to
calculate quartiles.
Is the difference between the two methods important? No.
Usually it is just a nui-
sance. In general, both methods calculate values that will
support the statement that ap-
proximately 25% of the values are less than the value of the
first quartile, and approximately
75% of the data values are less than the value of the third
quartile. When the sample is
large, the difference in the results
from the two methods is small. For
example, in the Applewood Auto
Group data there are 180 vehicles.
The quartiles computed using both
methods are shown to the left. Based
on the variable profit, 45 of the
180 values (25%) are less than both
values of the first quartile, and 135 of
the 180 values (75%) are less than
both values of the third quartile.
When using Excel, be careful to
understand the method used to
STATISTICS IN ACTION
John W. Tukey (1915–2000)
received a PhD in mathe-
matics from Princeton in
1939. However, when he
joined the Fire Control Re-
search Office during World
War II, his interest in ab-
stract mathematics shifted
to applied statistics. He de-
veloped effective numerical
and graphical methods for
studying patterns in data.
Among the graphics he
developed are the stem-
and-leaf diagram and the
box-and-whisker plot or box
plot. From 1960 to 1980,
Tukey headed the statistical
division of NBC’s election
night vote projection team.
He became renowned in
1960 for preventing an
early call of victory for
Richard Nixon in the presi-
dential election won by
John F. Kennedy.
Morgan Stanley
Commissions
1460 Equation 4-1
2047
1471
Quartile 1
Quartile 3
1721
2205
Alternate Method
Quartile 1
Quartile 3
1739.5
2151
2054
1637
2097
1721
2205
1758
2287
1787
2311
1940
2406
2038
Pro�tAge
Applewood
Equation 4-1
Quartile 1
Quartile 3
1415.5
2275.5
Alternate Method
Quartile 1
Quartile 3
1422.5
2268.5
$1,387
$1,754
$1,817
$1,040
$1,273
$1,529
$3,082
$1,951
$2,692
$1,342
21
23
24
25
26
27
27
28
28
29
106 CHAPTER 4
calculate quartiles. Excel 2013 and Excel 2016 offer both
methods. The Excel function,
Quartile.exc, will result in the same answer as Equation 4–1.
The Excel function, Quar-
tile.inc, will result in the Excel Method answers.
The Quality Control department of Plainsville Peanut Company
is responsible for checking
the weight of the 8-ounce jar of peanut butter. The weights of a
sample of nine jars pro-
duced last hour are:
7.69 7.72 7.8 7.86 7.90 7.94 7.97 8.06 8.09
(a) What is the median weight?
(b) Determine the weights corresponding to the first and third
quartiles.
S E L F - R E V I E W 4–2
11. Determine the median and the first and third quartiles in
the following data.
46 47 49 49 51 53 54 54 55 55 59
12. Determine the median and the first and third quartiles in
the following data.
5.24 6.02 6.67 7.30 7.59 7.99 8.03 8.35 8.81 9.45
9.61 10.37 10.39 11.86 12.22 12.71 13.07 13.59 13.89 15.42
13. The Thomas Supply Company Inc. is a distributor of gas-
powered generators.
As with any business, the length of time customers take to pay
their invoices is im-
portant. Listed below, arranged from smallest to largest, is the
time, in days, for a
sample of The Thomas Supply Company Inc. invoices.
13 13 13 20 26 27 31 34 34 34 35 35 36 37 38
41 41 41 45 47 47 47 50 51 53 54 56 62 67 82
a. Determine the first and third quartiles.
b. Determine the second decile and the eighth decile.
c. Determine the 67th percentile.
14. Kevin Horn is the national sales manager for National
Textbooks Inc. He
has a sales staff of 40 who visit college professors all over the
United States.
Each Saturday morning he requires his sales staff to send him a
report. This re-
port includes, among other things, the number of professors
visited during the
previous week. Listed below, ordered from smallest to largest,
are the number
of visits last week.
38 40 41 45 48 48 50 50 51 51 52 52 53 54 55 55 55
56 56 57
59 59 59 62 62 62 63 64 65 66 66 67 67 69 69 71 77
78 79 79
a. Determine the median number of calls.
b. Determine the first and third quartiles.
c. Determine the first decile and the ninth decile.
d. Determine the 33rd percentile.
E X E R C I S E S
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
107
BOX PLOTS
A box plot is a graphical display, based on quartiles, that helps
us picture a set of data.
To construct a box plot, we need only five statistics: the
minimum value, Q1 (the first
quartile), the median, Q3 (the third quartile), and the maximum
value. An example will
help to explain.
LO4-4
Construct and analyze a
box plot.
E X A M P L E
Alexander’s Pizza offers free delivery of its pizza within 15
miles. Alex, the owner,
wants some information on the time it takes for delivery. How
long does a typical
delivery take? Within what range of times will most deliveries
be completed? For a
sample of 20 deliveries, he determined the following
information:
Minimum value = 13 minutes
Q1 = 15 minutes
Median = 18 minutes
Q3 = 22 minutes
Maximum value = 30 minutes
Develop a box plot for the delivery times. What conclusions can
you make about
the delivery times?
S O L U T I O N
The first step in drawing a box plot is to create an appropriate
scale along the
horizontal axis. Next, we draw a box that starts at Q1 (15
minutes) and ends at Q3
(22 minutes). Inside the box we place a vertical line to represent
the median (18
minutes). Finally, we extend horizontal lines from the box out
to the minimum
value (13 minutes) and the maximum value (30 minutes). These
horizontal lines
outside of the box are sometimes called “whiskers” because
they look a bit like a
cat’s whiskers.
12 14 16 18 20 22 24 26 28 30 32
Q1
Median
Q3
Minimum
value
Maximum
value
Minutes
The box plot also shows the interquartile range of delivery
times between
Q1 and Q3. The interquartile range is 7 minutes and indicates
that 50% of the
deliveries are between 15 and 22 minutes.
The box plot also reveals that the distribution of delivery times
is positively skewed.
In Chapter 3, we defined skewness as the lack of symmetry in a
set of data. How do we
know this distribution is positively skewed? In this case, there
are actually two pieces
of information that suggest this. First, the dashed line to the
right of the box from 22
minutes (Q3) to the maximum time of 30 minutes is longer than
the dashed line from
the left of 15 minutes (Q1) to the minimum value of 13 minutes.
To put it another way,
108 CHAPTER 4
the 25% of the data larger than the third quartile is more spread
out than the 25% less
than the first quartile. A second indication of positive skewness
is that the median is
not in the center of the box. The distance from the first quartile
to the median is smaller
than the distance from the median to the third quartile. We
know that the number of
delivery times between 15 minutes and 18 minutes is the same
as the number of de-
livery times between 18 minutes and 22 minutes.
E X A M P L E
Refer to the Applewood Auto Group data. Develop a box plot
for the variable age of
the buyer. What can we conclude about the distribution of the
age of the buyer?
S O L U T I O N
Minitab was used to develop the following chart and summary
statistics.
The median age of the purchaser is 46 years, 25% of the
purchasers are less than
40 years of age, and 25% are more than 52.75 years of age.
Based on the sum-
mary information and the box plot, we conclude:
• Fifty percent of the purchasers are between the ages of 40 and
52.75 years.
• The distribution of ages is fairly symmetric. There are two
reasons for this con-
clusion. The length of the whisker above 52.75 years (Q3) is
about the same
length as the whisker below 40 years (Q1). Also, the area in the
box between
40 years and the median of 46 years is about the same as the
area between
the median and 52.75.
There are three asterisks (*) above 70 years. What do they
indicate? In a box
plot, an asterisk identifies an outlier. An outlier is a value that
is inconsistent with
the rest of the data. It is defined as a value that is more than 1.5
times the inter-
quartile range smaller than Q1 or larger than Q3. In this
example, an outlier would
be a value larger than 71.875 years, found by:
Outlier > Q3 + 1.5(Q3 − Q1) = 52.75 + 1.5(52.75 − 40) =
71.875
An outlier would also be a value less than 20.875 years.
Outlier < Q1 − 1.5(Q3 − Q1) = 40 − 1.5(52.75 − 40) = 20.875
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
109
The following box plot shows the assets in millions of dollars
for credit unions in Seattle,
Washington.
0 10 20 30 40 50 60 70 80 90 100
What are the smallest and largest values, the first and third
quartiles, and the median?
Would you agree that the distribution is symmetrical? Are there
any outliers?
S E L F - R E V I E W 4–3
From the box plot, we conclude there are three purchasers 72
years of age or
older and none less than 21 years of age. Technical note: In
some cases, a single
asterisk may represent more than one observation because of the
limitations of the
software and space available. It is a good idea to check the
actual data. In this in-
stance, there are three purchasers 72 years old or older; two are
72 and one is 73.
15. The box plot below shows the amount spent for books and
supplies per year by
students at four-year public colleges.
0 350 700 1,050 1,400 $1,750
a. Estimate the median amount spent.
b. Estimate the first and third quartiles for the amount spent.
c. Estimate the interquartile range for the amount spent.
d. Beyond what point is a value considered an outlier?
e. Identify any outliers and estimate their value.
f. Is the distribution symmetrical or positively or negatively
skewed?
16. The box plot shows the undergraduate in-state tuition per
credit hour at four-year
public colleges.
*
0 300 600 900 1,200 $1,500
a. Estimate the median.
b. Estimate the first and third quartiles.
c. Determine the interquartile range.
d. Beyond what point is a value considered an outlier?
e. Identify any outliers and estimate their value.
f. Is the distribution symmetrical or positively or negatively
skewed?
17. In a study of the gasoline mileage of model year 2016
automobiles, the mean miles
per gallon was 27.5 and the median was 26.8. The smallest
value in the study was
12.70 miles per gallon, and the largest was 50.20. The first and
third quartiles were
17.95 and 35.45 miles per gallon, respectively. Develop a box
plot and comment
on the distribution. Is it a symmetric distribution?
E X E R C I S E S
110 CHAPTER 4
SKEWNESS
In Chapter 3, we described measures of central location for a
distribution of data by re-
porting the mean, median, and mode. We also described
measures that show the amount
of spread or variation in a distribution, such as the range and
the standard deviation.
Another characteristic of a distribution is the shape. There are
four shapes com-
monly observed: symmetric, positively skewed, negatively
skewed, and bimodal. In a
symmetric distribution the mean and median are equal and the
data values are evenly
spread around these values. The shape of the distribution below
the mean and median
is a mirror image of distribution above the mean and median. A
distribution of values is
skewed to the right or positively skewed if there is a single
peak, but the values extend
much farther to the right of the peak than to the left of the peak.
In this case, the mean
is larger than the median. In a negatively skewed distribution
there is a single peak, but
the observations extend farther to the left, in the negative
direction, than to the right. In
a negatively skewed distribution, the mean is smaller than the
median. Positively
skewed distributions are more common. Salaries often follow
this pattern. Think of the
salaries of those employed in a small company of about 100
people. The president and
a few top executives would have very large salaries relative to
the other workers and
hence the distribution of salaries would exhibit positive
skewness. A bimodal distribu-
tion will have two or more peaks. This is often the case when
the values are from two or
more populations. This information is summarized in Chart 4–1.
LO4-5
Compute and interpret
the coefficient of
skewness.
M
ed
ia
n
M
ea
n
45
Fr
eq
ue
nc
y
Fr
eq
ue
nc
y
Fr
eq
ue
nc
y
Fr
eq
ue
nc
y
Years
Ages
Symmetric
Monthly Salaries
Positively Skewed
$3,000 $4,000
M
ed
ia
n
M
ea
n
Median
Mean
Test Scores
Negatively Skewed
75 80 Score
Mean
Outside Diameter
Bimodal
.98 1.04 Inches$
CHART 4–1 Shapes of Frequency Polygons
There are several formulas in the statistical literature used to
calculate skewness.
The simplest, developed by Professor Karl Pearson (1857–
1936), is based on the differ-
ence between the mean and the median.
18. A sample of 28 time shares in the Orlando, Florida, area
revealed the follow-
ing daily charges for a one-bedroom suite. For convenience, the
data are ordered
from smallest to largest. Construct a box plot to represent the
data. Comment on
the distribution. Be sure to identify the first and third quartiles
and the median.
$116 $121 $157 $192 $207 $209 $209
229 232 236 236 239 243 246
260 264 276 281 283 289 296
307 309 312 317 324 341 353
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
111
Using this relationship, the coefficient of skewness can range
from −3 up to 3. A value
near −3, such as −2.57, indicates considerable negative
skewness. A value such as
1.63 indicates moderate positive skewness. A value of 0, which
will occur when the
mean and median are equal, indicates the distribution is
symmetrical and there is no
skewness present.
In this text, we present output from Minitab and Excel. Both of
these software pack-
ages compute a value for the coefficient of skewness based on
the cubed deviations
from the mean. The formula is:
SOFTWARE COEFFICIENT OF SKEWNESS
sk =
n
(n − 1) (n − 2)[
∑(
x − x
s )
3
] [4–3]
Formula (4–3) offers an insight into skewness. The right-hand
side of the formula is
the difference between each value and the mean, divided by the
standard deviation.
That is the portion (x − x )/s of the formula. This idea is called
standardizing. We will
discuss the idea of standardizing a value in more detail in
Chapter 7 when we describe
the normal probability distribution. At this point, observe that
the result is to report the
difference between each value and the mean in units of the
standard deviation. If this
difference is positive, the particular value is larger than the
mean; if the value is nega-
tive, the standardized quantity is smaller than the mean. When
we cube these values,
we retain the information on the direction of the difference.
Recall that in the formula for
the standard deviation [see formula (3–10)] we squared the
difference between each
value and the mean, so that the result was all nonnegative
values.
If the set of data values under consideration is symmetric, when
we cube the stan-
dardized values and sum over all the values, the result would be
near zero. If there are
several large values, clearly separate from the others, the sum
of the cubed differences
would be a large positive value. If there are several small values
clearly separate from
the others, the sum of the cubed differences will be negative.
An example will illustrate the idea of skewness.
PEARSON’S COEFFICIENT OF SKEWNESS sk =
3(x − Median)
s
[4–2]
STATISTICS IN ACTION
The late Stephen Jay Gould
(1941–2002) was a profes-
sor of zoology and professor
of geology at Harvard
University. In 1982, he was
diagnosed with cancer and
had an expected survival
time of 8 months. However,
never to be discouraged,
his research showed that
the distribution of survival
time is dramatically skewed
to the right and showed that
not only do 50% of similar
cancer patients survive
more than 8 months, but
that the survival time could
be years rather than months!
In fact, Dr. Gould lived an-
other 20 years. Based on
his experience, he wrote a
widely published essay
titled “The Median Is Not
the Message.”
E X A M P L E
Following are the earnings per share for a sample of 15 software
companies for the
year 2016. The earnings per share are arranged from smallest to
largest.
Compute the mean, median, and standard deviation. Find the
coefficient of
skewness using Pearson’s estimate and the software methods.
What is your
conclusion regarding the shape of the distribution?
S O L U T I O N
These are sample data, so we use formula (3–2) to determine the
mean
x =
Σx
n
=
$74.26
15
= $4.95
$0.09 $0.13 $0.41 $0.51 $ 1.12 $ 1.20 $ 1.49 $3.18
3.50 6.36 7.83 8.92 10.13 12.99 16.40
112 CHAPTER 4
The median is the middle value in a set of data, arranged from
smallest to largest.
In this case, there is an odd-number of observations, so the
middle value is the
median. It is $3.18.
We use formula (3–10) on page 78 to determine the sample
standard deviation.
s = √
Σ(x − x )2
n − 1
= √
($0.09 − $4.95)2 + … + ($16.40 − $4.95)2
15 − 1
= $5.22
Pearson’s coefficient of skewness is 1.017, found by
sk =
3(x − Median)
s
=
3($4.95 − $3.18)
$5.22
= 1.017
This indicates there is moderate positive skewness in the
earnings per share data.
We obtain a similar, but not exactly the same, value from the
software method.
The details of the calculations are shown in Table 4–2. To
begin, we find the differ-
ence between each earnings per share value and the mean and
divide this result
by the standard deviation. We have referred to this as
standardizing. Next, we cube,
that is, raise to the third power, the result of the first step.
Finally, we sum the cubed
values. The details for the first company, that is, the company
with an earnings per
share of $0.09, are:
(
x − x
s )
3
= (
0.09 − 4.95
5.22 )
3
= (−0.9310)3 = −0.8070
When we sum the 15 cubed values, the result is 11.8274. That
is, the term
Σ[(x − x )/s]3 = 11.8274. To find the coefficient of skewness,
we use formula (4–3),
with n = 15.
sk =
n
(n − 1) (n − 2)
∑(
x − x
s )
3
=
15
(15 − 1) (15 − 2)
(11.8274) = 0.975
We conclude that the earnings per share values are somewhat
positively
skewed. The following Minitab summary reports the descriptive
measures, such as
TABLE 4–2 Calculation of the Coefficient of Skewness
Earnings per Share
(x − x )
s
(
x − x
s )
3
0.09 −0.9310 −0.8070
0.13 −0.9234 −0.7873
0.41 −0.8697 −0.6579
0.51 −0.8506 −0.6154
1.12 −0.7337 −0.3950
1.20 −0.7184 −0.3708
1.49 −0.6628 −0.2912
3.18 −0.3391 −0.0390
3.50 −0.2778 −0.0214
6.36 0.2701 0.0197
7.83 0.5517 0.1679
8.92 0.7605 0.4399
10.13 0.9923 0.9772
12.99 1.5402 3.6539
16.40 2.1935 10.5537
11.8274
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
113
A sample of five data entry clerks employed in the Horry
County Tax Office revised the fol-
lowing number of tax records last hour: 73, 98, 60, 92, and 84.
(a) Find the mean, median, and the standard deviation.
(b) Compute the coefficient of skewness using Pearson’s
method.
(c) Calculate the coefficient of skewness using the software
method.
(d) What is your conclusion regarding the skewness of the
data?
S E L F - R E V I E W 4–4
For Exercises 19–22:
a. Determine the mean, median, and the standard deviation.
b. Determine the coefficient of skewness using Pearson’s
method.
c. Determine the coefficient of skewness using the software
method.
19. The following values are the starting salaries, in $000, for a
sample of five
accounting graduates who accepted positions in public
accounting last year.
36.0 26.0 33.0 28.0 31.0
20. Listed below are the salaries, in $000, for a sample of 15
chief financial offi-
cers in the electronics industry.
$516.0 $548.0 $566.0 $534.0 $586.0 $529.0
546.0 523.0 538.0 523.0 551.0 552.0
486.0 558.0 574.0
E X E R C I S E S
the mean, median, and standard deviation of the earnings per
share data. Also in-
cluded are the coefficient of skewness and a histogram with a
bell-shaped curve
superimposed.
114 CHAPTER 4
DESCRIBING THE RELATIONSHIP BETWEEN
TWO VARIABLES
In Chapter 2 and the first section of this chapter, we presented
graphical techniques
to summarize the distribution of a single variable. We used a
histogram in Chapter 2
to summarize the profit on vehicles sold by the Applewood Auto
Group. Earlier in
this chapter, we used dot plots and stem-and-leaf
displays to visually summarize a set of data. Because
we are studying a single variable, we refer to this as
univariate data.
There are situations where we wish to study and
visually portray the relationship between two vari-
ables. When we study the relationship between two
variables, we refer to the data as bivariate. Data ana-
lysts frequently wish to understand the relationship
between two variables. Here are some examples:
• Tybo and Associates is a law firm that advertises ex-
tensively on local TV. The partners are considering
increasing their advertising budget. Before doing
so, they would like to know the relationship be-
tween the amount spent per month on advertising
and the total amount of billings for that month. To
put it another way, will increasing the amount spent
on advertising result in an increase in billings?
LO4-6
Create and interpret a
scatter diagram.
© Steve Mason/Getty Images RF
21. Listed below are the commissions earned ($000) last year
by the 15 sales
representatives at Furniture Patch Inc.
$ 3.9 $ 5.7 $ 7.3 $10.6 $13.0 $13.6 $15.1 $15.8 $17.1
17.4 17.6 22.3 38.6 43.2 87.7
22. Listed below are the salaries for the 2016 New York
Yankees Major League
Baseball team.
Player Salary Player Salary
CC Sabathia $25,000,000 Dustin Ackley $3,200,000
Mark Teixeira 23,125,000 Martin Prado 3,000,000
Masahiro Tanaka 22,000,000 Didi Gregorius 2,425,000
Jacoby Ellsbury 21,142,857 Aaron Hicks 574,000
Alex Rodriguez 21,000,000 Austin Romine 556,000
Brian McCann 17,000,000 Chasen Shreve 533,400
Carlos Beltran 15,000,000 Greg Bird 525,300
Brett Gardner 13,500,000 Luis Severino 521,300
Chase Headley 13,000,000 Bryan Mitchell 516,650
Aroldis Chapman 11,325,000 Kirby Yates 511,900
Andrew Miller 9,000,000 Mason Williams 509,700
Starlin Castro 7,857,143 Ronald Torreyes 508,600
Nathan Eovaldi 5,600,000 John Barbato 507,500
Michael Pineda 4,300,000 Dellin Betances 507,500
Ivan Nova 4,100,000 Luis Cessa 507,500
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
115
• Coastal Realty is studying the selling prices of homes. What
variables seem to be
related to the selling price of homes? For example, do larger
homes sell for more
than smaller ones? Probably. So Coastal might study the
relationship between the
area in square feet and the selling price.
• Dr. Stephen Givens is an expert in human development. He is
studying the relation-
ship between the height of fathers and the height of their sons.
That is, do tall fathers
tend to have tall children? Would you expect LeBron James, the
6′8″, 250 pound
professional basketball player, to have relatively tall sons?
One graphical technique we use to show the relationship
between variables is called a
scatter diagram.
To draw a scatter diagram, we need two variables. We scale one
variable
along the horizontal axis (X-axis) of a graph and the other
variable along the vertical
axis (Y-axis). Usually one variable depends to some degree on
the other. In the
third example above, the height of the son depends on the
height of the father. So
we scale the height of the father on the horizontal axis and that
of the son on the
vertical axis.
We can use statistical software, such as Excel, to perform the
plotting function for
us. Caution: You should always be careful of the scale. By
changing the scale of either
the vertical or the horizontal axis, you can affect the apparent
visual strength of the
relationship.
Following are three scatter diagrams (Chart 4–2). The one on
the left shows a
rather strong positive relationship between the age in years and
the maintenance
cost last year for a sample of 10 buses owned by the city of
Cleveland, Ohio. Note
that as the age of the bus increases, the yearly maintenance cost
also increases. The
example in the center, for a sample of 20 vehicles, shows a
rather strong indirect rela-
tionship between the odometer reading and the auction price.
That is, as the number
of miles driven increases, the auction price decreases. The
example on the right de-
picts the relationship between the height and yearly salary for a
sample of 15 shift
supervisors. This graph indicates there is little relationship
between their height and
yearly salary.
$24,000
21,000
18,000
15,000
12,000A
uc
tio
n
pr
ic
e
10,000 30,000 50,000
Odometer
Auction Price versus Odometer
$10,000
8,000
6,000
4,000
2,000
0
Co
st
(a
nn
ua
l)
0 1 2 3 4 5 6
Age (years)
Age of Buses and
Maintenance Cost Height versus Salary
125
120
115
110
105
100
95
90S
al
ar
y
($
00
0)
54 55 56 57 58 59 60 61 62 63
Height (inches)
CHART 4–2 Three Examples of Scatter Diagrams.
E X A M P L E
In the introduction to Chapter 2, we presented data from the
Applewood Auto
Group. We gathered information concerning several variables,
including the profit
earned from the sale of 180 vehicles sold last month. In addition
to the amount of
profit on each sale, one of the other variables is the age of the
purchaser. Is there a
relationship between the profit earned on a vehicle sale and the
age of the pur-
chaser? Would it be reasonable to conclude that more profit is
made on vehicles
purchased by older buyers?
116 CHAPTER 4
In the preceding example, there is a weak positive, or direct,
relationship between the
variables. There are, however, many instances where there is a
relationship between
the variables, but that relationship is inverse or negative. For
example:
• The value of a vehicle and the number of miles driven. As the
number of miles in-
creases, the value of the vehicle decreases.
• The premium for auto insurance and the age of the driver.
Auto rates tend to be the
highest for younger drivers and less for older drivers.
• For many law enforcement personnel, as the number of years
on the job increases,
the number of traffic citations decreases. This may be because
personnel become
more liberal in their interpretations or they may be in supervisor
positions and not
in a position to issue as many citations. But in any event, as age
increases, the num-
ber of citations decreases.
CONTINGENCY TABLES
A scatter diagram requires that both of the variables be at least
interval scale. In the
Applewood Auto Group example, both age and vehicle profit
are ratio scale variables.
Height is also ratio scale as used in the discussion of the
relationship between the
height of fathers and the height of their sons. What if we wish
to study the relationship
between two variables when one or both are nominal or ordinal
scale? In this case, we
tally the results in a contingency table.
LO4-7
Develop and explain a
contingency table.
S O L U T I O N
We can investigate the relationship between vehicle profit and
the age of the buyer
with a scatter diagram. We scale age on the horizontal, or X-
axis, and the profit on
the vertical, or Y-axis. We assume profit depends on the age of
the purchaser. As
people age, they earn more income and purchase more
expensive cars which, in
turn, produce higher profits. We use Excel to develop the
scatter diagram. The
Excel commands are in Appendix C.
The scatter diagram shows a rather weak positive relationship
between the two
variables. It does not appear there is much relationship between
the vehicle profit
and the age of the buyer. In Chapter 13, we will study the
relationship between
variables more extensively, even calculating several numerical
measures to ex-
press the relationship between variables.
0 10 20 30 40
Age (Years)
Profit and Age of Buyer at Applewood Auto Group
Pr
ofi
t p
er
V
eh
ic
le
($
)
50 60 70 80
$0
$500
$1,000
$1,500
$2,000
$2,500
$3,000
$3,500
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
117
A contingency table is a cross-tabulation that simultaneously
summarizes two variables
of interest. For example:
• Students at a university are classified by gender and class
(freshman, sophomore,
junior, or senior).
• A product is classified as acceptable or unacceptable and by
the shift (day, after-
noon, or night) on which it is manufactured.
• A voter in a school bond referendum is classified as to party
affiliation (Democrat,
Republican, other) and the number of children that voter has
attending school in the
district (0, 1, 2, etc.).
CONTINGENCY TABLE A table used to classify observations
according to two
identifiable characteristics.
E X A M P L E
There are four dealerships in the Applewood Auto Group.
Suppose we want to com-
pare the profit earned on each vehicle sold by the particular
dealership. To put it
another way, is there a relationship between the amount of
profit earned and the
dealership?
S O L U T I O N
In a contingency table, both variables only need to be nominal
or ordinal. In this
example, the variable dealership is a nominal variable and the
variable profit is a
ratio variable. To convert profit to an ordinal variable, we
classify the variable profit
into two categories, those cases where the profit earned is more
than the median
and those cases where it is less. On page 64, we calculated the
median profit for all
sales last month at Applewood Auto Group to be $1,882.50.
Contingency Table Showing the Relationship between Profit
and Dealership
Above/Below
Median Profit Kane Olean Sheffield Tionesta Total
Above 25 20 19 26 90
Below 27 20 26 17 90
Total 52 40 45 43 180
By organizing the information into a contingency table, we can
compare the profit
at the four dealerships. We observe the following:
• From the Total column on the right, 90 of the 180 cars sold
had a profit above
the median and half below. From the definition of the median,
this is
expected.
• For the Kane dealership, 25 out of the 52, or 48%, of the cars
sold were sold
for a profit more than the median.
• The percentage of profits above the median for the other
dealerships are 50%
for Olean, 42% for Sheffield, and 60% for Tionesta.
We will return to the study of contingency tables in Chapter 5
during the study of
probability and in Chapter 15 during the study of nonparametric
methods of analysis.
118 CHAPTER 4
The rock group Blue String Beans is touring the United States.
The following chart shows
the relationship between concert seating capacity and revenue in
$000 for a sample of
concerts.
5800 6300 6800
Seating Capacity
8
7
6
5
4
3
2
Am
ou
nt
($
00
0)
7300
(a) What is the diagram called?
(b) How many concerts were studied?
(c) Estimate the revenue for the concert with the largest seating
capacity.
(d) How would you characterize the relationship between
revenue and seating capacity?
Is it strong or weak, direct or inverse?
S E L F - R E V I E W 4–5
23. Develop a scatter diagram for the following sample data.
How would you
describe the relationship between the values?
x-Value y-Value x-Value y-Value
10 6 11 6
8 2 10 5
9 6 7 2
11 5 7 3
13 7 11 7
24. Silver Springs Moving and Storage Inc. is studying the
relationship between the
number of rooms in a move and the number of labor hours
required for the move.
As part of the analysis, the CFO of Silver Springs developed the
following scatter
diagram.
1 2 3
Rooms
40
30
20
10
0
Ho
ur
s
54
E X E R C I S E S
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
119
a. How many moves are in the sample?
b. Does it appear that more labor hours are required as the
number of rooms
increases, or do labor hours decrease as the number of rooms
increases?
25. The Director of Planning for Devine Dining Inc. wishes to
study the relationship be-
tween the gender of a guest and whether the guest orders
dessert. To investigate the
relationship, the manager collected the following information
on 200 recent customers.
Gender
Dessert Ordered Male Female Total
Yes 32 15 47
No 68 85 153
Total 100 100 200
a. What is the level of measurement of the two variables?
b. What is the above table called?
c. Does the evidence in the table suggest men are more likely
to order dessert
than women? Explain why.
26. Ski Resorts of Vermont Inc. is considering a merger with
Gulf Shores Beach Resorts
Inc. of Alabama. The board of directors surveyed 50
stockholders concerning their
position on the merger. The results are reported below.
Opinion
Number of Shares Held Favor Oppose Undecided Total
Under 200 8 6 2 16
200 up to 1,000 6 8 1 15
Over 1,000 6 12 1 19
Total 20 26 4 50
a. What level of measurement is used in this table?
b. What is this table called?
c. What group seems most strongly opposed to the merger?
C H A P T E R S U M M A R Y
I. A dot plot shows the range of values on the horizontal axis
and the number of observa-
tions for each value on the vertical axis.
A. Dot plots report the details of each observation.
B. They are useful for comparing two or more data sets.
II. A stem-and-leaf display is an alternative to a histogram.
A. The leading digit is the stem and the trailing digit the leaf.
B. The advantages of a stem-and-leaf display over a histogram
include:
1. The identity of each observation is not lost.
2. The digits themselves give a picture of the distribution.
3. The cumulative frequencies are also shown.
III. Measures of location also describe the shape of a set of
observations.
A. Quartiles divide a set of observations into four equal parts.
1. Twenty-five percent of the observations are less than the first
quartile, 50% are
less than the second quartile, and 75% are less than the third
quartile.
2. The interquartile range is the difference between the third
quartile and the first
quartile.
B. Deciles divide a set of observations into 10 equal parts and
percentiles into 100
equal parts.
120 CHAPTER 4
IV. A box plot is a graphic display of a set of data.
A. A box is drawn enclosing the regions between the first
quartile and the third quartile.
1. A line is drawn inside the box at the median value.
2. Dotted line segments are drawn from the third quartile to the
largest value to
show the highest 25% of the values and from the first quartile to
the smallest
value to show the lowest 25% of the values.
B. A box plot is based on five statistics: the maximum and
minimum values, the first and
third quartiles, and the median.
V. The coefficient of skewness is a measure of the symmetry of
a distribution.
A. There are two formulas for the coefficient of skewness.
1. The formula developed by Pearson is:
sk =
3(x − Median)
s
[4–2]
2. The coefficient of skewness computed by statistical software
is:
sk =
n
(n − 1) (n − 2)[
∑(
x − x
s )
3
] [4–3]
VI. A scatter diagram is a graphic tool to portray the
relationship between two variables.
A. Both variables are measured with interval or ratio scales.
B. If the scatter of points moves from the lower left to the upper
right, the variables un-
der consideration are directly or positively related.
C. If the scatter of points moves from the upper left to the lower
right, the variables are
inversely or negatively related.
VII. A contingency table is used to classify nominal-scale
observations according to two
characteristics.
P R O N U N C I A T I O N K E Y
SYMBOL MEANING PRONUNCIATION
Lp Location of percentile L sub p
Q1 First quartile Q sub 1
Q3 Third quartile Q sub 3
C H A P T E R E X E R C I S E S
27. A sample of students attending Southeast Florida
University is asked the number of so-
cial activities in which they participated last week. The chart
below was prepared from
the sample data.
41 2
Activities
30
a. What is the name given to this chart?
b. How many students were in the study?
c. How many students reported attending no social activities?
28. Doctor’s Care is a walk-in clinic, with locations in
Georgetown, Moncks Corner, and
Aynor, at which patients may receive treatment for minor
injuries, colds, and flu, as well
as physical examinations. The following charts report the
number of patients treated in
each of the three locations last month.
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
121
5020 30
Patients
4010
Location
Georgetown
Moncks Corner
Aynor
Describe the number of patients served at the three locations
each day. What are the
maximum and minimum numbers of patients served at each of
the locations?
29. Below is the number of customers who visited Smith’s
True-Value hardware store
in Bellville, Ohio, over the last twenty-three days. Make a stem-
and-leaf display of this
variable.
46 52 46 40 42 46 40 37 46 40 52 32 37 32 52
40 32 52 40 52 46 46 52
30. The top 25 companies (by market capitalization) operating
in the Washington, DC,
area along with the year they were founded and the number of
employees are given
below. Make a stem-and-leaf display of each of these variables
and write a short de-
scription of your findings.
Company Name Year Founded Employees
AES Corp. 1981 30,000
American Capital Ltd. 1986 484
AvalonBay Communities Inc. 1978 1,767
Capital One Financial Corp. 1995 31,800
Constellation Energy Group Inc. 1816 9,736
Coventry Health Care Inc. 1986 10,250
Danaher Corp. 1984 45,000
Dominion Resources Inc. 1909 17,500
Fannie Mae 1938 6,450
Freddie Mac 1970 5,533
Gannett Co. 1906 49,675
General Dynamics Corp. 1952 81,000
Genworth Financial Inc. 2004 7,200
Harman International Industries Inc. 1980 11,246
Host Hotels & Resorts Inc. 1927 229
Legg Mason 1899 3,800
Lockheed Martin Corp. 1995 140,000
Marriott International Inc. 1927 151,000
MedImmune LLC 1988 2,516
NII Holdings Inc. 1996 7,748
Norfolk Southern Corp. 1982 30,594
Pepco Holdings Inc. 1896 5,057
Sallie Mae 1972 11,456
T. Rowe Price Group Inc. 1937 4,605
The Washington Post Co. 1877 17,100
31. In recent years, due to low interest rates, many
homeowners refinanced their
home mortgages. Linda Lahey is a mortgage officer at Down
River Federal Savings
122 CHAPTER 4
and Loan. Below is the amount refinanced for 20 loans she
processed last week.
The data are reported in thousands of dollars and arranged from
smallest to
largest.
59.2 59.5 61.6 65.5 66.6 72.9 74.8 77.3 79.2
83.7 85.6 85.8 86.6 87.0 87.1 90.2 93.3 98.6
100.2 100.7
a. Find the median, first quartile, and third quartile.
b. Find the 26th and 83rd percentiles.
c. Draw a box plot of the data.
32. A study is made by the recording industry in the United
States of the number
of music CDs owned by 25 senior citizens and 30 young adults.
The information is
reported below.
Seniors
28 35 41 48 52 81 97 98 98 99
118 132 133 140 145 147 153 158 162 174
177 180 180 187 188
Young Adults
81 107 113 147 147 175 183 192 202 209
233 251 254 266 283 284 284 316 372 401
417 423 490 500 507 518 550 557 590 594
a. Find the median and the first and third quartiles for the
number of CDs owned by
senior citizens. Develop a box plot for the information.
b. Find the median and the first and third quartiles for the
number of CDs owned by
young adults. Develop a box plot for the information.
c. Compare the number of CDs owned by the two groups.
33. The corporate headquarters of Bank.com, an on-line
banking company, is located
in downtown Philadelphia. The director of human resources is
making a study of the
time it takes employees to get to work. The city is planning to
offer incentives to each
downtown employer if they will encourage their employees to
use public transportation.
Below is a listing of the time to get to work this morning
according to whether the em-
ployee used public transportation or drove a car.
Public Transportation
23 25 25 30 31 31 32 33 35 36
37 42
Private
32 32 33 34 37 37 38 38 38 39
40 44
a. Find the median and the first and third quartiles for the time
it took employees using
public transportation. Develop a box plot for the information.
b. Find the median and the first and third quartiles for the time
it took employees who
drove their own vehicle. Develop a box plot for the
information.
c. Compare the times of the two groups.
34. The following box plot shows the number of daily
newspapers published in each
state and the District of Columbia. Write a brief report
summarizing the number pub-
lished. Be sure to include information on the values of the first
and third quartiles,
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
123
the median, and whether there is any skewness. If there are any
outliers, estimate
their value.
Number of Newspapers
****
0 20 40 60 80 100
35. Walter Gogel Company is an industrial supplier of
fasteners, tools, and springs. The
amounts of its invoices vary widely, from less than $20.00 to
more than $400.00. During
the month of January the company sent out 80 invoices. Here is
a box plot of these in-
voices. Write a brief report summarizing the invoice amounts.
Be sure to include infor-
mation on the values of the first and third quartiles, the median,
and whether there is any
skewness. If there are any outliers, approximate the value of
these invoices.
Invoice Amount
*
0 50 100 150 200 250
36. The American Society of PeriAnesthesia Nurses (ASPAN;
www.aspan.org) is a
national organization serving nurses practicing in ambulatory
surgery, preanesthesia, and
postanesthesia care. The organization consists of the 40
components listed below.
State/Region Membership
Alabama 95
Arizona 399
Maryland, Delaware, DC 531
Connecticut 239
Florida 631
Georgia 384
Hawaii 73
Illinois 562
Indiana 270
Iowa 117
Kentucky 197
Louisiana 258
Michigan 411
Massachusetts 480
Maine 97
Minnesota, Dakotas 289
Missouri, Kansas 282
Mississippi 90
Nebraska 115
North Carolina 542
Nevada 106
State/Region Membership
New Jersey, Bermuda 517
Alaska, Idaho, Montana,
Oregon, Washington 708
New York 891
Ohio 708
Oklahoma 171
Arkansas 68
California 1,165
New Mexico 79
Pennsylvania 575
Rhode Island 53
Colorado 409
South Carolina 237
Texas 1,026
Tennessee 167
Utah 67
Virginia 414
Vermont,
New Hampshire 144
Wisconsin 311
West Virginia 62
Use statistical software to answer the following questions.
a. Find the mean, median, and standard deviation of the number
of members per
component.
124 CHAPTER 4
b. Find the coefficient of skewness, using the software. What do
you conclude about
the shape of the distribution of component size?
c. Compute the first and third quartiles using formula (4–1).
d. Develop a box plot. Are there any outliers? Which
components are outliers? What are
the limits for outliers?
37. McGivern Jewelers is located in the Levis Square Mall just
south of Toledo, Ohio.
Recently it posted an advertisement on a social media site
reporting the shape, size,
price, and cut grade for 33 of its diamonds currently in stock.
The information is re-
ported below.
Shape Size (carats) Price Cut Grade Shape Size (carats) Price
Cut Grade
Princess 5.03 $44,312 Ideal cut Round 0.77 $2,828 Ultra ideal
cut
Round 2.35 20,413 Premium cut Oval 0.76 3,808 Premium cut
Round 2.03 13,080 Ideal cut Princess 0.71 2,327 Premium cut
Round 1.56 13,925 Ideal cut Marquise 0.71 2,732 Good cut
Round 1.21 7,382 Ultra ideal cut Round 0.70 1,915 Premium cut
Round 1.21 5,154 Average cut Round 0.66 1,885 Premium cut
Round 1.19 5,339 Premium cut Round 0.62 1,397 Good cut
Emerald 1.16 5,161 Ideal cut Round 0.52 2,555 Premium cut
Round 1.08 8,775 Ultra ideal cut Princess 0.51 1,337 Ideal cut
Round 1.02 4,282 Premium cut Round 0.51 1,558 Premium cut
Round 1.02 6,943 Ideal cut Round 0.45 1,191 Premium cut
Marquise 1.01 7,038 Good cut Princess 0.44 1,319 Average cut
Princess 1.00 4,868 Premium cut Marquise 0.44 1,319 Premium
cut
Round 0.91 5,106 Premium cut Round 0.40 1,133 Premium cut
Round 0.90 3,921 Good cut Round 0.35 1,354 Good cut
Round 0.90 3,733 Premium cut Round 0.32 896 Premium cut
Round 0.84 2,621 Premium cut
a. Develop a box plot of the variable price and comment on the
result. Are there any
outliers? What is the median price? What are the values of the
first and the third
quartiles?
b. Develop a box plot of the variable size and comment on the
result. Are there any
outliers? What is the median price? What are the values of the
first and the third
quartiles?
c. Develop a scatter diagram between the variables price and
size. Be sure to put price
on the vertical axis and size on the horizontal axis. Does there
seem to be an associ-
ation between the two variables? Is the association direct or
indirect? Does any point
seem to be different from the others?
d. Develop a contingency table for the variables shape and cut
grade. What is the most
common cut grade? What is the most common shape? What is
the most common
combination of cut grade and shape?
38. Listed below is the amount of commissions earned last
month for the eight mem-
bers of the sales staff at Best Electronics. Calculate the
coefficient of skewness using
both methods. Hint: Use of a spreadsheet will expedite the
calculations.
980.9 1,036.5 1,099.5 1,153.9 1,409.0 1,456.4 1,718.4 1,721.2
39. Listed below is the number of car thefts in a large city
over the last week. Calculate
the coefficient of skewness using both methods. Hint: Use of a
spreadsheet will expe-
dite the calculations.
3 12 13 7 8 3 8
DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
125
40. The manager of Information Services at Wilkin
Investigations, a private investigation firm,
is studying the relationship between the age (in months) of a
combination printer, copier,
and fax machine and its monthly maintenance cost. For a sample
of 15 machines, the
manager developed the following chart. What can the manager
conclude about the re-
lationship between the variables?
34 39 44
Months
$130
120
110
100
90
80
M
on
th
ly
M
ai
nt
en
an
ce
C
os
t
49
41. An auto insurance company reported the following
information regarding the age
of a driver and the number of accidents reported last year.
Develop a scatter diagram for
the data and write a brief summary.
Age Accidents Age Accidents
16 4 23 0
24 2 27 1
18 5 32 1
17 4 22 3
42. Wendy’s offers eight different condiments (mustard,
catsup, onion, mayonnaise, pickle,
lettuce, tomato, and relish) on hamburgers. A store manager
collected the following in-
formation on the number of condiments ordered and the age
group of the customer.
What can you conclude regarding the information? Who tends to
order the most or least
number of condiments?
Age
Number of Condiments Under 18 18 up to 40 40 up to 60 60 or
older
0 12 18 24 52
1 21 76 50 30
2 39 52 40 12
3 or more 71 87 47 28
43. Here is a table showing the number of employed and
unemployed workers 20 years or
older by gender in the United States.
Number of Workers (000)
Gender Employed Unemployed
Men 70,415 4,209
Women 61,402 3,314
a. How many workers were studied?
b. What percent of the workers were unemployed?
c. Compare the percent unemployed for the men and the
women.
126 A REVIEW OF CHAPTERS 1–4
D A T A A N A L Y T I C S
44. Refer to the North Valley real estate data recorded on
homes sold during the last
year. Prepare a report on the selling prices of the homes based
on the answers to the
following questions.
a. Compute the minimum, maximum, median, and the first and
the third quartiles of
price. Create a box plot. Comment on the distribution of home
prices.
b. Develop a scatter diagram with price on the vertical axis and
the size of the home on
the horizontal. Is there a relationship between these variables?
Is the relationship
direct or indirect?
c. For homes without a pool, develop a scatter diagram with
price on the vertical axis
and the size of the home on the horizontal. Do the same for
homes with a pool. How
do the relationships between price and size for homes without a
pool and homes
with a pool compare?
45. Refer to the Baseball 2016 data that report information on
the 30 Major League
Baseball teams for the 2016 season.
a. In the data set, the year opened, is the first year of operation
for that stadium. For
each team, use this variable to create a new variable, stadium
age, by subtracting
the value of the variable, year opened, from the current year.
Develop a box plot
with the new variable, age. Are there any outliers? If so, which
of the stadiums are
outliers?
b. Using the variable, salary, create a box plot. Are there any
outliers? Compute the
quartiles using formula (4–1). Write a brief summary of your
analysis.
c. Draw a scatter diagram with the variable, wins, on the
vertical axis and salary on the
horizontal axis. What are your conclusions?
d. Using the variable, wins, draw a dot plot. What can you
conclude from this plot?
46. Refer to the Lincolnville School District bus data.
a. Referring to the maintenance cost variable, develop a box
plot. What are the mini-
mum, first quartile, median, third quartile, and maximum
values? Are there any
outliers?
b. Using the median maintenance cost, develop a contingency
table with bus manufac-
turer as one variable and whether the maintenance cost was
above or below the
median as the other variable. What are your conclusions?
A REVIEW OF CHAPTERS 1–4
This section is a review of the major concepts and terms
introduced in Chapters 1–4. Chapter 1 began by describing the
meaning and purpose of statistics. Next we described the
different types of variables and the four levels of measurement.
Chapter 2 was concerned with describing a set of observations
by organizing it into a frequency distribution and then
portraying the frequency distribution as a histogram or a
frequency polygon. Chapter 3 began by describing measures of
location, such as the mean, weighted mean, median, geometric
mean, and mode. This chapter also included measures of
dispersion, or spread. Discussed in this section were the range,
variance, and standard deviation. Chapter 4 included
several graphing techniques such as dot plots, box plots, and
scatter diagrams. We also discussed the coefficient of skew-
ness, which reports the lack of symmetry in a set of data.
Throughout this section we stressed the importance of statistical
software, such as Excel and Minitab. Many computer
outputs in these chapters demonstrated how quickly and
effectively a large data set can be organized into a frequency
distribution, several of the measures of location or measures of
variation calculated, and the information presented in
graphical form.
A REVIEW OF CHAPTERS 1–4 127
124 14 150 289 52 156 203 82 27 248
39 52 103 58 136 249 110 298 251 157
186 107 142 185 75 202 119 219 156 78
116 152 206 117 52 299 58 153 219 148
145 187 165 147 158 146 185 186 149 140
Use a statistical software package such as Excel or Minitab to
help answer the following
questions.
a. Determine the mean, median, and standard deviation.
b. Determine the first and third quartiles.
c. Develop a box plot. Are there any outliers? Do the amounts
follow a symmetric distri-
bution or are they skewed? Justify your answer.
d. Organize the distribution of funds into a frequency
distribution.
e. Write a brief summary of the results in parts a to d.
2. Listed below are the 45 U.S. presidents and their age as
they began their terms in
office.
Number Name Age
1 Washington 57
2 J. Adams 61
3 Jefferson 57
4 Madison 57
5 Monroe 58
6 J. Q. Adams 57
7 Jackson 61
8 Van Buren 54
9 W. H. Harrison 68
10 Tyler 51
11 Polk 49
12 Taylor 64
13 Fillmore 50
14 Pierce 48
15 Buchanan 65
16 Lincoln 52
17 A. Johnson 56
18 Grant 46
19 Hayes 54
20 Garfield 49
21 Arthur 50
22 Cleveland 47
23 B. Harrison 55
Number Name Age
24 Cleveland 55
25 McKinley 54
26 T. Roosevelt 42
27 Taft 51
28 Wilson 56
29 Harding 55
30 Coolidge 51
31 Hoover 54
32 F. D. Roosevelt 51
33 Truman 60
34 Eisenhower 62
35 Kennedy 43
36 L. B. Johnson 55
37 Nixon 56
38 Ford 61
39 Carter 52
40 Reagan 69
41 G. H. W. Bush 64
42 Clinton 46
43 G. W. Bush 54
44 Obama 47
45 Trump 70
Use a statistical software package such as Excel or Minitab to
help answer the following
questions.
a. Determine the mean, median, and standard deviation.
b. Determine the first and third quartiles.
c. Develop a box plot. Are there any outliers? Do the amounts
follow a symmetric distri-
bution or are they skewed? Justify your answer.
d. Organize the distribution of ages into a frequency
distribution.
e. Write a brief summary of the results in parts a to d.
P R O B L E M S
1. The duration in minutes of a sample of 50 power
outages last year in the state of
South Carolina is listed below.
128 A REVIEW OF CHAPTERS 1–4
3. Listed below is the 2014 median household income for the
50 states and the
District of Columbia.
https://www.census.gov/hhes/www/income/data/historical/
household/
State Amount
Alabama 42,278
Alaska 67,629
Arizona 49,254
Arkansas 44,922
California 60,487
Colorado 60,940
Connecticut 70,161
Delaware 57,522
D.C. 68,277
Florida 46,140
Georgia 49,555
Hawaii 71,223
Idaho 53,438
Illinois 54,916
Indiana 48,060
Iowa 57,810
Kansas 53,444
Kentucky 42,786
Louisiana 42,406
Maine 51,710
Maryland 76,165
Massachusetts 63,151
Michigan 52,005
Minnesota 67,244
Mississippi 35,521
Missouri 56,630
State Amount
Montana 51,102
Nebraska 56,870
Nevada 49,875
New Hampshire 73,397
New Jersey 65,243
New Mexico 46,686
New York 54,310
North Carolina 46,784
North Dakota 60,730
Ohio 49,644
Oklahoma 47,199
Oregon 58,875
Pennsylvania 55,173
Rhode Island 58,633
South Carolina 44,929
South Dakota 53,053
Tennessee 43,716
Texas 53,875
Utah 63,383
Vermont 60,708
Virginia 66,155
Washington 59,068
West Virginia 39,552
Wisconsin 58,080
Wyoming 55,690
Use a statistical software package such as Excel or Minitab to
help answer the following
questions.
a. Determine the mean, median, and standard deviation.
b. Determine the first and third quartiles.
c. Develop a box plot. Are there any outliers? Do the amounts
follow a symmetric distri-
bution or are they skewed? Justify your answer.
d. Organize the distribution of funds into a frequency
distribution.
e. Write a brief summary of the results in parts a to d.
4. A sample of 12 homes sold last week in St. Paul, Minnesota,
revealed the following
information. Draw a scatter diagram. Can we conclude that, as
the size of the home
(reported below in thousands of square feet) increases, the
selling price (reported in
$ thousands) also increases?
Home Size Home Size
(thousands of Selling Price (thousands of Selling Price
square feet) ($ thousands) square feet) ($ thousands)
1.4 100 1.3 110
1.3 110 0.8 85
1.2 105 1.2 105
1.1 120 0.9 75
1.4 80 1.1 70
1.0 105 1.1 95
5. Refer to the following diagram.
0 40 80 120 160 200
* *
a. What is the graph called?
b. What are the median, and first and third quartile values?
c. Is the distribution positively skewed? Tell how you know.
d. Are there any outliers? If yes, estimate these values.
e. Can you determine the number of observations in the study?
A REVIEW OF CHAPTERS 1–4 129
C A S E S
A. Century National Bank
The following case will appear in subsequent review sec-
tions. Assume that you work in the Planning Department of
the Century National Bank and report to Ms. Lamberg. You
will need to do some data analysis and prepare a short writ-
ten report. Remember, Mr. Selig is the president of the bank,
so you will want to ensure that your report is complete and
accurate. A copy of the data appears in Appendix A.6.
Century National Bank has offices in several cities in
the Midwest and the southeastern part of the United
States. Mr. Dan Selig, president and CEO, would like to
know the characteristics of his checking account custom-
ers. What is the balance of a typical customer?
How many other bank services do the checking ac-
count customers use? Do the customers use the ATM ser-
vice and, if so, how often? What about debit cards? Who
uses them, and how often are they used?
To better understand the customers, Mr. Selig asked
Ms. Wendy Lamberg, director of planning, to select a sam-
ple of customers and prepare a report. To begin, she has
appointed a team from her staff. You are the head of the
team and responsible for preparing the report. You select a
random sample of 60 customers. In addition to the balance
in each account at the end of last month, you determine
(1) the number of ATM (automatic teller machine) transac-
tions in the last month; (2) the number of other bank ser-
vices (a savings account, a certificate of deposit, etc.) the
customer uses; (3) whether the customer has a debit card
(this is a bank service in which charges are made directly to
the customer’s account); and (4) whether or not interest is
paid on the checking account. The sample includes cus-
tomers from the branches in Cincinnati, Ohio; Atlanta,
Georgia; Louisville, Kentucky; and Erie, Pennsylvania.
1. Develop a graph or table that portrays the checking
balances. What is the balance of a typical customer?
Do many customers have more than $2,000 in their
accounts? Does it appear that there is a difference in
the distribution of the accounts among the four
branches? Around what value do the account bal-
ances tend to cluster?
2. Determine the mean and median of the checking ac-
count balances. Compare the mean and the median
balances for the four branches. Is there a difference
among the branches? Be sure to explain the difference
between the mean and the median in your report.
3. Determine the range and the standard deviation of
the checking account balances. What do the first and
third quartiles show? Determine the coefficient of
skewness and indicate what it shows. Because
Mr. Selig does not deal with statistics daily, include a
brief description and interpretation of the standard
deviation and other measures.
B. Wildcat Plumbing Supply Inc.:
Do We Have Gender Differences?
Wildcat Plumbing Supply has served the plumbing
needs of Southwest Arizona for more than 40 years.
The company was founded by Mr. Terrence St. Julian
and is run today by his son Cory. The company has
grown from a handful of employees to more than 500
today. Cory is concerned about several positions within
the company where he has men and women doing es-
sentially the same job but at different pay. To investi-
gate, he collected the information below. Suppose you
are a student intern in the Accounting Department and
have been given the task to write a report summarizing
the situation.
Yearly Salary ($000) Women Men
Less than 30 2 0
30 up to 40 3 1
40 up to 50 17 4
50 up to 60 17 24
60 up to 70 8 21
70 up to 80 3 7
80 or more 0 3
To kick off the project, Mr. Cory St. Julian held a meeting
with his staff and you were invited. At this meeting, it was
suggested that you calculate several measures of
130 A REVIEW OF CHAPTERS 1–4
location, create charts or draw graphs such as a cumula-
tive frequency distribution, and determine the quartiles
for both men and women. Develop the charts and write
the report summarizing the yearly salaries of employees
at Wildcat Plumbing Supply. Does it appear that there are
pay differences based on gender?
C. Kimble Products: Is There a Difference
In the Commissions?
At the January national sales meeting, the CEO of Kimble
Products was questioned extensively regarding the com-
pany policy for paying commissions to its sales represen-
tatives. The company sells sporting goods to two major
markets. There are 40 sales representatives who call di-
rectly on large-volume customers, such as the athletic de-
partments at major colleges and universities and
professional sports franchises. There are 30 sales repre-
sentatives who represent the company to retail stores lo-
cated in shopping malls and large discounters such as
Kmart and Target.
Upon his return to corporate headquarters, the CEO
asked the sales manager for a report comparing the com-
missions earned last year by the two parts of the sales
team. The information is reported below. Write a brief re-
port. Would you conclude that there is a difference? Be
sure to include information in the report on both the cen-
tral tendency and dispersion of the two groups.
Commissions Earned by Sales Representatives
Calling on Large Retailers ($)
1,116 681 1,294 12 754 1,206 1,448 870 944 1,255
1,213 1,291 719 934 1,313 1,083 899 850 886 1,556
886 1,315 1,858 1,262 1,338 1,066 807 1,244 758 918
Commissions Earned by Sales Representatives
Calling on Athletic Departments ($)
354 87 1,676 1,187 69 3,202 680 39 1,683 1,106
883 3,140 299 2,197 175 159 1,105 434 615 149
1,168 278 579 7 357 252 1,602 2,321 4 392
416 427 1,738 526 13 1,604 249 557 635 527
P R A C T I C E T E S T
There is a practice test at the end of each review section. The
tests are in two parts. The first part contains several objec-
tive questions, usually in a fill-in-the-blank format. The second
part is problems. In most cases, it should take 30 to 45
minutes to complete the test. The problems require a calculator.
Check the answers in the Answer Section in the back of
the book.
Part 1—Objective
1. The science of collecting, organizing, presenting, analyzing,
and interpreting data to assist in
making effective decisions is called . 1.
2. Methods of organizing, summarizing, and presenting data in
an informative way are
called . 2.
3. The entire set of individuals or objects of interest or the
measurements obtained from all
individuals or objects of interest are called the . 3.
4. List the two types of variables. 4.
5. The number of bedrooms in a house is an example of a .
(discrete variable,
continuous variable, qualitative variable—pick one) 5.
6. The jersey numbers of Major League Baseball players are an
example of what level of
measurement? 6.
7. The classification of students by eye color is an example of
what level of measurement? 7.
8. The sum of the differences between each value and the mean
is always equal to what value? 8.
9. A set of data contained 70 observations. How many classes
would the 2k method suggest to
construct a frequency distribution? 9.
10. What percent of the values in a data set are always larger
than the median? 10.
11. The square of the standard deviation is the . 11.
12. The standard deviation assumes a negative value when . (all
the values are negative,
at least half the values are negative, or never—pick one.) 12.
13. Which of the following is least affected by an outlier?
(mean, median, or range—pick one) 13.
Part 2—Problems
1. The Russell 2000 index of stock prices increased by the
following amounts over the last 3 years.
18% 4% 2%
What is the geometric mean increase for the 3 years?
2. The information below refers to the selling prices ($000) of
homes sold in Warren, Pennsylvania, during 2016.
Selling Price ($000) Frequency
120.0 up to 150.0 4
150.0 up to 180.0 18
180.0 up to 210.0 30
210.0 up to 240.0 20
240.0 up to 270.0 17
270.0 up to 300.0 10
300.0 up to 330.0 6
a. What is the class interval?
b. How many homes were sold in 2016?
c. How many homes sold for less than $210,000?
d. What is the relative frequency of the 210 up to 240 class?
e. What is the midpoint of the 150 up to 180 class?
f. The selling prices range between what two amounts?
3. A sample of eight college students revealed they owned the
following number of CDs.
52 76 64 79 80 74 66 69
a. What is the mean number of CDs owned?
b. What is the median number of CDs owned?
c. What is the 40th percentile?
d. What is the range of the number of CDs owned?
e. What is the standard deviation of the number of CDs owned?
4. An investor purchased 200 shares of the Blair Company for
$36 each in July of 2013, 300 shares
at $40 each in September 2015, and 500 shares at $50 each in
January 2016. What is the
investor’s weighted mean price per share?
5. During the 50th Super Bowl, 30 million pounds of snack
food were eaten. The chart below depicts
this information.
Snack Nuts
8%
Potato Chips
37%
Tortilla Chips
28%
Pretzels
14%
Popcorn
13%
a. What is the name given to this graph?
b. Estimate, in millions of pounds, the amount of potato chips
eaten during the game.
c. Estimate the relationship of potato chips to popcorn. (twice
as much, half as much, three
times, none of these—pick one)
d. What percent of the total do potato chips and tortilla chips
comprise?
A REVIEW OF CHAPTERS 1–4 131
LEARNING OBJECTIVES
When you have completed this chapter, you will be able to:
LO2-1 Summarize qualitative variables with frequency and
relative frequency tables.
LO2-2 Display a frequency table using a bar or pie chart.
LO2-3 Summarize quantitative variables with frequency and
relative frequency distributions.
LO2-4 Display a frequency distribution using a histogram or
frequency polygon.
MERRILL LYNCH recently completed a study of online
investment portfolios for a sample
of clients. For the 70 participants in the study, organize these
data into a frequency
distribution. (See Exercise 43 and LO2-3.)
Describing Data:
FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS,
AND GRAPHIC PRESENTATION2
© rido/123RF
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 19
INTRODUCTION
The United States automobile retailing industry is highly
competitive. It is dominated by
megadealerships that own and operate 50 or more franchises,
employ over 10,000
people, and generate several billion dollars in annual sales.
Many of the top dealerships
are publicly owned with shares traded on the New York Stock
Exchange
or NASDAQ. In 2014, the largest megadealership was
AutoNation (ticker
symbol AN), followed by Penske Auto Group (PAG), Group 1
Automotive,
Inc. (ticker symbol GPI), and the privately owned Van Tuyl
Group.
These large corporations use statistics and analytics to
summarize
and analyze data and information to support their decisions. As
an ex-
ample, we will look at the Applewood Auto group. It owns four
dealer-
ships and sells a wide range of vehicles. These include the
popular
Korean brands Kia and Hyundai, BMW and Volvo sedans and
luxury
SUVs, and a full line of Ford and Chevrolet cars and trucks.
Ms. Kathryn Ball is a member of the senior management team at
Applewood Auto Group, which has its corporate offices
adjacent to Kane
Motors. She is responsible for tracking and analyzing vehicle
sales and
the profitability of those vehicles. Kathryn would like to
summarize the profit earned on
the vehicles sold with tables, charts, and graphs that she would
review monthly. She
wants to know the profit per vehicle sold, as well as the lowest
and highest amount of
profit. She is also interested in describing the demographics of
the buyers. What are
their ages? How many vehicles have they previously purchased
from one of the Apple-
wood dealerships? What type of vehicle did they purchase?
The Applewood Auto Group operates four dealerships:
• Tionesta Ford Lincoln sells Ford and Lincoln cars and trucks.
• Olean Automotive Inc. has the Nissan franchise as well as the
General Motors
brands of Chevrolet, Cadillac, and GMC Trucks.
• Sheffield Motors Inc. sells Buick, GMC trucks, Hyundai, and
Kia.
• Kane Motors offers the Chrysler, Dodge, and Jeep line as well
as BMW and Volvo.
Every month, Ms. Ball collects data from each of the four
dealerships
and enters them into an Excel spreadsheet. Last month the
Applewood
Auto Group sold 180 vehicles at the four dealerships. A copy of
the first
few observations appears to the left. The variables collected
include:
• Age—the age of the buyer at the time of the purchase.
• Profit—the amount earned by the dealership on the sale of
each
vehicle.
• Location—the dealership where the vehicle was purchased.
• Vehicle type—SUV, sedan, compact, hybrid, or truck.
• Previous—the number of vehicles previously purchased at any
of the
four Applewood dealerships by the consumer.
The entire data set is available at the McGraw-Hill website
(www.mhhe
.com/lind17e) and in Appendix A.4 at the end of the text.
© Justin Sullivan/Getty Images
CONSTRUCTING FREQUENCY TABLES
Recall from Chapter 1 that techniques used to describe a set of
data are called descrip-
tive statistics. Descriptive statistics organize data to show the
general pattern of the
data, to identify where values tend to concentrate, and to expose
extreme or unusual
data values. The first technique we discuss is a frequency table.
LO2-1
Summarize qualitative
variables with frequency
and relative frequency
tables.
FREQUENCY TABLE A grouping of qualitative data into
mutually exclusive and
collectively exhaustive classes showing the number of
observations in each class.
20 CHAPTER 2
In Chapter 1, we distinguished between qualitative and
quantitative variables. To
review, a qualitative variable is nonnumeric, that is, it can only
be classified into distinct
categories. Examples of qualitative data include political
affiliation (Republican, Demo-
crat, Independent, or other), state of birth (Alabama, . . . ,
Wyoming), and method of
payment for a purchase at Barnes & Noble (cash, digital wallet,
debit, or credit). On the
other hand, quantitative variables are numerical in nature.
Examples of quantitative data
relating to college students include the price of their textbooks,
their age, and the num-
ber of credit hours they are registered for this semester.
In the Applewood Auto Group data set, there are five variables
for each vehicle
sale: age of the buyer, amount of profit, dealer that made the
sale, type of vehicle sold,
and number of previous purchases by the buyer. The dealer and
the type of vehicle are
qualitative variables. The amount of profit, the age of the buyer,
and the number of pre-
vious purchases are quantitative variables.
Suppose Ms. Ball wants to summarize last month’s sales by
location. The
first step is to sort the vehicles sold last month according to
their location and
then tally, or count, the number sold at each location of the four
locations:
Tionesta, Olean, Sheffield, or Kane. The four locations are
used to develop a
frequency table with four mutually exclusive (distinctive)
classes. Mutually exclu-
sive classes means that a particular vehicle can be assigned to
only one class. In
addition, the frequency table must be collectively exhaustive.
That is every vehi-
cle sold last month is accounted for in the table. If every
vehicle is included in the
frequency table, the table will be collectively exhaustive and
the total number of
vehicles will be 180. How do we obtain these counts? Excel
provides a tool
called a Pivot Table that will quickly and accurately establish
the four classes and
do the counting. The Excel results follow in Table 2–1. The
table shows a total of
180 vehicles and, of the 180 vehicles, 52 were sold at Kane
Motors. © Steve Cole/Getty Images RF
TABLE 2–1 Frequency Table for Vehicles Sold Last Month at
Applewood Auto Group by Location
Location Number of Cars
Kane 52
Olean 40
Sheffield 45
Tionesta 43
Total 180
Relative Class Frequencies
You can convert class frequencies to relative class frequencies
to show the fraction of the
total number of observations in each class. A relative frequency
captures the relationship
between a class frequency and the total number of observations.
In the vehicle sales ex-
ample, we may want to know the percentage of total cars sold at
each of the four locations.
To convert a frequency table to a relative frequency table, each
of the class frequencies is
divided by the total number of observations. Again, this is
easily accomplished using Excel.
The fraction of vehicles sold last month at the Kane location is
0.289, found by 52 divided
by 180. The relative frequency for each location is shown in
Table 2–2.
TABLE 2–2 Relative Frequency Table of Vehicles Sold by
Location Last Month at Applewood Auto Group
Location Number of Cars Relative Frequency Found by
Kane 52 .289 52/180
Olean 40 .222 40/180
Sheffield 45 .250 45/180
Tionesta 43 .239 43/180
Total 180 1.000
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 21
GRAPHIC PRESENTATION
OF QUALITATIVE DATA
The most common graphic form to present a qualitative variable
is a bar chart. In most
cases, the horizontal axis shows the variable of interest. The
vertical axis shows the
frequency or fraction of each of the possible outcomes. A
distinguishing feature of a bar
chart is there is distance or a gap between the bars. That is,
because the variable of in-
terest is qualitative, the bars are not adjacent to each other.
Thus, a bar chart graphically
describes a frequency table using a series of uniformly wide
rectangles, where the
height of each rectangle is the class frequency.
LO2-2
Display a frequency table
using a bar or pie chart.
BAR CHART A graph that shows qualitative classes on the
horizontal axis and the
class frequencies on the vertical axis. The class frequencies are
proportional to the
heights of the bars.
PIE CHART A chart that shows the proportion or percentage
that each class
represents of the total number of frequencies.
We use the Applewood Auto Group data as an example (Chart
2–1). The variables
of interest are the location where the vehicle was sold and the
number of vehicles sold
at each location. We label the horizontal axis with the four
locations and scale the verti-
cal axis with the number sold. The variable location is of
nominal scale, so the order of
the locations on the horizontal axis does not matter. In Chart 2–
1, the locations are
listed alphabetically. The locations could also be in order of
decreasing or increasing
frequencies.
The height of the bars, or rectangles, corresponds to the number
of vehicles at
each location. There were 52 vehicles sold last month at the
Kane location, so the
height of the Kane bar is 52; the height of the bar for the Olean
location is 40.
Nu
m
be
r o
f V
eh
ic
le
s
So
ld
50
40
30
20
10
0
Kane Olean
Location
Shef�eld Tionesta
CHART 2–1 Number of Vehicles Sold by Location
Another useful type of chart for depicting qualitative
information is a pie chart.
We explain the details of constructing a pie chart using the
information in Table 2–3,
which shows the frequency and percent of cars sold by the
Applewood Auto Group for
each vehicle type.
22 CHAPTER 2
The first step to develop a pie chart is to mark the percentages
0, 5, 10, 15, and so
on evenly around the circumference of a circle (see Chart 2–2).
To plot the 40% of total
sales represented by sedans, draw a line from the center of the
circle to 0 and another
line from the center of the circle to 40%. The area in this
“slice” represents the number
of sedans sold as a percentage of the total sales. Next, add the
SUV’s percentage of
total sales, 30%, to the sedan’s percentage of total sales, 40%.
The result is 70%. Draw
a line from the center of the circle to 70%, so the area between
40 and 70 shows the
sales of SUVs as a percentage of total sales. Continuing, add the
15% of total sales for
compact vehicles, which gives us a total of 85%. Draw a line
from the center of the circle
to 85, so the “slice” between 70% and 85% represents the
number of compact vehicles
sold as a percentage of the total sales. The remaining 10% for
truck sales and 5% for
hybrid sales are added to the chart using the same method.
Vehicle Type Number Sold Percent Sold
Sedan 72 40
SUV 54 30
Compact 27 15
Truck 18 10
Hybrid 9 5
Total 180 100
TABLE 2–3 Vehicle Sales by Type at Applewood Auto Group
25%
50%
70%
85%
95% 0%
40%
75%
Hybrid
Truck
Sedan
SUV
Compact
CHART 2–2 Pie Chart of Vehicles by Type
Because each slice of the pie represents the relative frequency
of each vehicle
type as a percentage of the total sales, we can easily compare
them:
• The largest percentage of sales is for sedans.
• Sedans and SUVs together account for 70% of vehicle sales.
• Hybrids account for 5% of vehicle sales, in spite of being on
the market for only a
few years.
We can use Excel software to quickly count the number of cars
for each vehicle
type and create the frequency table, bar chart, and pie chart
shown in the following
summary. The Excel tool is called a Pivot Table. The
instructions to produce these de-
scriptive statistics and charts are given in Appendix C.
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 23
Pie and bar charts both serve to illustrate frequency and relative
frequency ta-
bles. When is a pie chart preferred to a bar chart? In most cases,
pie charts are used
to show and compare the relative differences in the percentage
of observations for
each value or class of a qualitative variable. Bar charts are
preferred when the goal is
to compare the number or frequency of observations for each
value or class of a
qualitative variable. The following Example/
Solution
shows another application of bar
and pie charts.
E X A M P L E
SkiLodges.com is test marketing its new website and is
interested in how easy its
website design is to navigate. It randomly selected 200 regular
Internet users and
asked them to perform a search task on the website. Each person
was asked to
rate the relative ease of navigation as poor, good, excellent, or
awesome. The re-
sults are shown in the following table:
Awesome 102
Excellent 58
Good 30
Poor 10
1. What type of measurement scale is used for ease of
navigation?
2. Draw a bar chart for the survey results.
3. Draw a pie chart for the survey results.
S O L U T I O N
The data are measured on an ordinal scale. That is, the scale is
ranked in relative
ease of navigation when moving from “awesome” to “poor.”
The interval between
each rating is unknown so it is impossible, for example, to
conclude that a rating of
good is twice the value of a poor rating.
We can use a bar chart to graph the data. The vertical scale
shows the
relative frequency and the horizontal scale shows the values of
the ease-of-
navigation variable.
24 CHAPTER 2
A pie chart can also be used to graph these data. The pie chart
emphasizes that more
than half of the respondents rate the relative ease of using the
website awesome.
Re
la
tiv
e
Fr
eq
ue
nc
y
%
60
50
40
30
20
10
0
PoorGoodExcellentAwesome
Ease of Navigation of SkiLodges.com website
Ease of Navigation
Beverage Number
Cola-Plus 40
Coca-Cola 25
Pepsi 20
Lemon-Lime 15
Total 100
The answers are in Appendix E.
DeCenzo Specialty Food and Beverage Company has been
serving a cola drink with
an additional flavoring, Cola-Plus, that is very popular among
its customers. The company
is interested in customer preferences for Cola-Plus versus Coca-
Cola, Pepsi, and a lemon-lime
beverage. They ask 100 randomly sampled customers to take a
taste test and select the
beverage they prefer most. The results are shown in the
following table:
S E L F - R E V I E W 2–1
Poor
5%
Ease of Navigation of SkiLodges.com website
Good
15%
Awesome
51% Excellent
29%
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 25
(a) Is the data qualitative or quantitative? Why?
(b) What is the table called? What does it show?
(c) Develop a bar chart to depict the information.
(d) Develop a pie chart using the relative frequencies.
The answers to the odd-numbered exercises are at the end of the
book in Appendix D.
1. A pie chart shows the relative market share of cola products.
The “slice” for Pepsi-
Cola has a central angle of 90 degrees. What is its market
share?
2. In a marketing study, 100 consumers were asked to select the
best digital music
player from the iPod, the iRiver, and the Magic Star MP3. To
summarize the con-
sumer responses with a frequency table, how many classes
would the frequency
table have?
3. A total of 1,000 residents in Minnesota were asked which
season they preferred.
One hundred liked winter best, 300 liked spring, 400 liked
summer, and 200 liked
fall. Develop a frequency table and a relative frequency table to
summarize this
information.
4. Two thousand frequent business travelers are asked which
midwestern city they
prefer: Indianapolis, Saint Louis, Chicago, or Milwaukee. One
hundred liked India-
napolis best, 450 liked Saint Louis, 1,300 liked Chicago, and
the remainder pre-
ferred Milwaukee. Develop a frequency table and a relative
frequency table to
summarize this information.
5. Wellstone Inc. produces and markets replacement covers for
cell phones in five
different colors: bright white, metallic black, magnetic lime,
tangerine orange, and
fusion red. To estimate the demand for each color, the company
set up a kiosk in
the Mall of America for several hours and asked randomly
selected people which
cover color was their favorite. The results follow:
E X E R C I S E S
Bright white 130
Metallic black 104
Magnetic lime 325
Tangerine orange 455
Fusion red 286
a. What is the table called?
b. Draw a bar chart for the table.
c. Draw a pie chart.
d. If Wellstone Inc. plans to produce 1 million cell phone
covers, how many of
each color should it produce?
6. A small business consultant is investigating the performance
of several companies.
The fourth-quarter sales for last year (in thousands of dollars)
for the selected com-
panies were:
Fourth-Quarter Sales
Company ($ thousands)
Hoden Building Products $ 1,645.2
J & R Printing Inc. 4,757.0
Long Bay Concrete Construction 8,913.0
Mancell Electric and Plumbing 627.1
Maxwell Heating and Air Conditioning 24,612.0
Mizelle Roofing & Sheet Metals 191.9
The consultant wants to include a chart in his report comparing
the sales of the six
companies. Use a bar chart to compare the fourth-quarter sales
of these corpora-
tions and write a brief report summarizing the bar chart.
26 CHAPTER 2
CONSTRUCTING FREQUENCY DISTRIBUTIONS
In Chapter 1 and earlier in this chapter, we distinguished
between qualitative and quantitative
data. In the previous section, using the Applewood Automotive
Group data, we summarized
two qualitative variables: the location of the sale and the type
of vehicle sold. We created
frequency and relative frequency tables and depicted the results
in bar and pie charts.
The Applewood Auto Group data also includes several
quantitative variables: the
age of the buyer, the profit earned on the sale of the vehicle,
and the number of previ-
ous purchases. Suppose Ms. Ball wants to summarize last
month’s sales by profit earned
for each vehicle. We can describe profit using a frequency
distribution.
LO2-3
Summarize quantitative
variables with frequency
and relative frequency
distributions.
FREQUENCY DISTRIBUTION A grouping of quantitative data
into mutually exclusive
and collectively exhaustive classes showing the number of
observations in each class.
How do we develop a frequency distribution? The following
example shows the steps to
construct a frequency distribution. Remember, our goal is to
construct tables, charts,
and graphs that will quickly summarize the data by showing the
location, extreme
values, and shape of the data’s distribution.
TABLE 2–4 Profit on Vehicles Sold Last Month by the
Applewood Auto Group Maximum
Minimum
$1,387 $2,148 $2,201 $ 963 $ 820 $2,230 $3,043 $2,584 $2,370
1,754 2,207 996 1,298 1,266 2,341 1,059 2,666 2,637
1,817 2,252 2,813 1,410 1,741 3,292 1,674 2,991 1,426
1,040 1,428 323 1,553 1,772 1,108 1,807 934 2,944
1,273 1,889 352 1,648 1,932 1,295 2,056 2,063 2,147
1,529 1,166 482 2,071 2,350 1,344 2,236 2,083 1,973
3,082 1,320 1,144 2,116 2,422 1,906 2,928 2,856 2,502
1,951 2,265 1,485 1,500 2,446 1,952 1,269 2,989 783
2,692 1,323 1,509 1,549 369 2,070 1,717 910 1,538
1,206 1,760 1,638 2,348 978 2,454 1,797 1,536 2,339
1,342 1,919 1,961 2,498 1,238 1,606 1,955 1,957 2,700
443 2,357 2,127 294 1,818 1,680 2,199 2,240 2,222
754 2,866 2,430 1,115 1,824 1,827 2,482 2,695 2,597
1,621 732 1,704 1,124 1,907 1,915 2,701 1,325 2,742
870 1,464 1,876 1,532 1,938 2,084 3,210 2,250 1,837
1,174 1,626 2,010 1,688 1,940 2,639 377 2,279 2,842
1,412 1,762 2,165 1,822 2,197 842 1,220 2,626 2,434
1,809 1,915 2,231 1,897 2,646 1,963 1,401 1,501 1,640
2,415 2,119 2,389 2,445 1,461 2,059 2,175 1,752 1,821
1,546 1,766 335 2,886 1,731 2,338 1,118 2,058 2,487
S O L U T I O N
To begin, we need the profits for each of the 180 vehicle sales
listed in Table 2–4.
This information is called raw or ungrouped data because it is
simply a listing
E X A M P L E
Ms. Kathryn Ball of the Applewood Auto Group wants to
summarize the quantitative
variable profit with a frequency distribution and display the
distribution with charts
and graphs. With this information, Ms. Ball can easily answer
the following ques-
tions: What is the typical profit on each sale? What is the
largest or maximum profit
on any sale? What is the smallest or minimum profit on any
sale? Around what value
do the profits tend to cluster?
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 27
of the individual, observed profits. It is possible to search the
list and find the
smallest or minimum profit ($294) and the largest or maximum
profit ($3,292), but
that is about all. It is difficult to determine a typical profit or to
visualize where the
profits tend to cluster. The raw data are more easily interpreted
if we summarize
the data with a frequency distribution. The steps to create this
frequency distribu-
tion follow.
Step 1: Decide on the number of classes. A useful recipe to
determine the
number of classes (k) is the “2 to the k rule.” This guide
suggests you
select the smallest number (k) for the number of classes such
that 2k
(in words, 2 raised to the power of k) is greater than the number
of
observations (n). In the Applewood Auto Group example, there
were
180 vehicles sold. So n = 180. If we try k = 7, which means we
would
use 7 classes, 27 = 128, which is less than 180. Hence, 7 is too
few
classes. If we let k = 8, then 28 = 256, which is greater than
180. So the
recommended number of classes is 8.
Step 2: Determine the class interval. Generally, the class
interval is the
same for all classes. The classes all taken together must cover at
least the distance from the minimum value in the data up to the
max-
imum value. Expressing these words in a formula:
i ≥
Maximum Value − Minimum Value
k
where i is the class interval, and k is the number of classes.
For the Applewood Auto Group, the minimum value is $294 and
the maximum value is $3,292. If we need 8 classes, the interval
should be:
i ≥
Maximum Value − Minimum Value
k
=
$3,292 − $294
8
= $374.75
In practice, this interval size is usually rounded up to some
conve-
nient number, such as a multiple of 10 or 100. The value of
$400 is a
reasonable choice.
Step 3: Set the individual class limits. State clear class limits so
you can
put each observation into only one category. This means you
must
avoid overlapping or unclear class limits. For example, classes
such
as “$1,300–$1,400” and “$1,400–$1,500” should not be used
because it is not clear whether the value $1,400 is in the first
or
second class. In this text, we will generally use the format
$1,300
up to $1,400 and $1,400 up to $1,500 and so on. With this
format,
it is clear that $1,399 goes into the first class and $1,400 in the
second.
Because we always round the class interval up to get a conve-
nient class size, we cover a larger than necessary range. For ex-
ample, using 8 classes with an interval of $400 in the
Applewood
Auto Group example results in a range of 8($400) = $3,200. The
actual range is $2,998, found by ($3,292 − $294). Comparing
that
value to $3,200, we have an excess of $202. Because we need to
cover only the range (Maximum − Minimum), it is natural to put
ap-
proximately equal amounts of the excess in each of the two
tails.
Of course, we also should select convenient class limits. A
guide-
line is to make the lower limit of the first class a multiple of the
class interval. Sometimes this is not possible, but the lower
limit
should at least be rounded. So here are the classes we could use
for these data.
28 CHAPTER 2
Classes
$ 200 up to $ 600
600 up to 1,000
1,000 up to 1,400
1,400 up to 1,800
1,800 up to 2,200
2,200 up to 2,600
2,600 up to 3,000
3,000 up to 3,400
Profit Frequency
$ 200 up to $ 600 |||| |||
600 up to 1,000 |||| |||| |
1,000 up to 1,400 |||| |||| |||| |||| |||
1,400 up to 1,800 |||| |||| |||| |||| |||| |||| |||| |||
1,800 up to 2,200 |||| |||| |||| |||| |||| |||| |||| |||| ||||
2,200 up to 2,600 |||| |||| |||| |||| |||| ||
2,600 up to 3,000 |||| |||| |||| ||||
3,000 up to 3,400 ||||
Step 4: Tally the vehicle profit into the classes and determine
the number of
observations in each class. To begin, the profit from the sale of
the first
vehicle in Table 2–4 is $1,387. It is tallied in the $1,000 up to
$1,400
class. The second profit in the first row of Table 2–4 is $2,148.
It is tallied
in the $1,800 up to $2,200 class. The other profits are tallied in
a similar
manner. When all the profits are tallied, the table would appear
as:
The number of observations in each class is called the class
frequency. In the $200 up to $600 class there are 8
observations,
and in the $600 up to $1,000 class there are 11 observations.
There-
fore, the class frequency in the first class is 8 and the class
frequency
in the second class is 11. There are a total of 180 observations
in the
entire set of data. So the sum of all the frequencies should be
equal
to 180. The results of the frequency distribution are in Table 2–
5.
Now that we have organized the data into a frequency
distribution (see Table 2–5),
we can summarize the profits of the vehicles for the Applewood
Auto Group.
Observe the following:
1. The profits from vehicle sales range between $200 and
$3,400.
2. The vehicle profits are classified using a class interval of
$400. The class inter-
val is determined by subtracting consecutive lower or upper
class limits. For
TABLE 2–5 Frequency Distribution of Profit for Vehicles Sold
Last Month at Applewood Auto Group
Profit Frequency
$ 200 up to $ 600 8
600 up to 1,000 11
1,000 up to 1,400 23
1,400 up to 1,800 38
1,800 up to 2,200 45
2,200 up to 2,600 32
2,600 up to 3,000 19
3,000 up to 3,400 4
Total 180
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 29
example, the lower limit of the first class is $200, and the lower
limit of the
second class is $600. The difference is the class interval of
$400.
3. The profits are concentrated between $1,000 and $3,000. The
profit on 157
vehicles, or 87%, was within this range.
4. For each class, we can determine the typical profit or class
midpoint. It is half-
way between the lower or upper limits of two consecutive
classes. It is com-
puted by adding the lower or upper limits of consecutive classes
and dividing
by 2. Referring to Table 2–5, the lower class limit of the first
class is $200, and
the next class limit is $600. The class midpoint is $400, found
by ($600 +
$200)/2. The midpoint best represents, or is typical of, the
profits of the vehi-
cles in that class. Applewood sold 8 vehicles with a typical
profit of $400.
5. The largest concentration, or highest frequency, of vehicles
sold is in the $1,800 up
to $2,200 class. There are 45 vehicles in this class. The class
midpoint is $2,000.
So we say that the typical profit in the class with the highest
frequency is $2,000.
By presenting this information to Ms. Ball, we give her a clear
picture of the distribu-
tion of the vehicle profits for last month.
We admit that arranging the information on profits into a
frequency distribution
does result in the loss of some detailed information. That is, by
organizing the data
into a frequency distribution, we cannot pinpoint the exact
profit on any vehicle,
such as $1,387, $2,148, or $2,201. Further, we cannot tell that
the actual minimum
profit for any vehicle sold is $294 or that the maximum profit
was $3,292. However,
the lower limit of the first class and the upper limit of the last
class convey essen-
tially the same meaning. Likely, Ms. Ball will make the same
judgment if she knows
the smallest profit is about $200 that she will if she knows the
exact profit is $292.
The advantages of summarizing the 180 profits into a more
understandable and
organized form more than offset this disadvantage.
Number of Returns
Adjusted Gross Income (in thousands)
No adjusted gross income 178.2
$ 1 up to 5,000 1,204.6
5,000 up to 10,000 2,595.5
10,000 up to 15,000 3,142.0
15,000 up to 20,000 3,191.7
20,000 up to 25,000 2,501.4
25,000 up to 30,000 1,901.6
30,000 up to 40,000 2,502.3
40,000 up to 50,000 1,426.8
50,000 up to 75,000 1,476.3
75,000 up to 100,000 338.8
100,000 up to 200,000 223.3
200,000 up to 500,000 55.2
500,000 up to 1,000,000 12.0
1,000,000 up to 2,000,000 5.1
2,000,000 up to 10,000,000 3.4
10,000,000 or more 0.6
TABLE 2–6 Adjusted Gross Income for Individuals Filing
Income Tax Returns
When we summarize raw data with frequency distributions,
equal class intervals are pre-
ferred. However, in certain situations unequal class intervals
may be necessary to avoid a
large number of classes with very small frequencies. Such is the
case in Table 2–6. The
U.S. Internal Revenue Service uses unequal-sized class intervals
for adjusted gross
income on individual tax returns to summarize the number of
individual tax returns. If
we use our method to find equal class intervals, the 2k rule
results in 25 classes, and
STATISTICS IN ACTION
In 1788, James Madison,
John Jay, and Alexander
Hamilton anonymously
published a series of essays
entitled The Federalist.
These Federalist papers
were an attempt to convince
the people of New York
that they should ratify the
Constitution. In the course
of history, the authorship
of most of these papers
became known, but 12 re-
mained contested. Through
the use of statistical analysis,
and particularly studying
the frequency distributions
of various words, we can
now conclude that James
Madison is the likely author
of the 12 papers. In fact,
the statistical evidence that
Madison is the author is
overwhelming.
30 CHAPTER 2
a class interval of $400,000, assuming $0 and $10,000,000 as
the minimum and maximum
values for adjusted gross income. Using equal class intervals,
the first 13 classes in Table 2–6
would be combined into one class of about 99.9% of all tax
returns and 24 classes for the
0.1% of the returns with an adjusted gross income above
$400,000. Using equal class inter-
vals does not provide a good understanding of the raw data. In
this case, good judgment in
the use of unequal class intervals, as demonstrated in Table 2–6,
is required to show the
distribution of the number of tax returns filed, especially for
incomes under $500,000.
In the first quarter of last year, the 11 members of the sales
staff at Master Chemical Company
earned the following commissions:
$1,650 $1,475 $1,510 $1,670 $1,595 $1,760 $1,540 $1,495
$1,590 $1,625 $1,510
(a) What are the values such as $1,650 and $1,475 called?
(b) Using $1,400 up to $1,500 as the first class, $1,500 up to
$1,600 as the second class,
and so forth, organize the quarterly commissions into a
frequency distribution.
(c) What are the numbers in the right column of your frequency
distribution called?
(d) Describe the distribution of quarterly commissions, based on
the frequency distribu-
tion. What is the largest concentration of commissions earned?
What is the smallest,
and the largest? What is the typical amount earned?
Relative Frequency Distribution
It may be desirable, as we did earlier with qualitative data, to
convert class frequencies
to relative class frequencies to show the proportion of the total
number of observations
in each class. In our vehicle profits, we may want to know what
percentage of the vehi-
cle profits are in the $1,000 up to $1,400 class. To convert a
frequency distribution to a
relative frequency distribution, each of the class frequencies is
divided by the total num-
ber of observations. From the distribution of vehicle profits,
Table 2–5, the relative fre-
quency for the $1,000 up to $1,400 class is 0.128, found by
dividing 23 by 180. That
is, profit on 12.8% of the vehicles sold is between $1,000 and
$1,400. The relative fre-
quencies for the remaining classes are shown in Table 2–7.
S E L F - R E V I E W 2–2
TABLE 2–7 Relative Frequency Distribution of Profit for
Vehicles Sold Last Month at Applewood Auto Group
Profit Frequency Relative Frequency Found by
$ 200 up to $ 600 8 .044 8/180
600 up to 1,000 11 .061 11/180
1,000 up to 1,400 23 .128 23/180
1,400 up to 1,800 38 .211 38/180
1,800 up to 2,200 45 .250 45/180
2,200 up to 2,600 32 .178 32/180
2,600 up to 3,000 19 .106 19/180
3,000 up to 3,400 4 .022 4/180
Total 180 1.000
There are many software packages that perform
statistical calculations. Throughout this text, we will
show the output from Microsoft Excel, MegaStat (a
Microsoft Excel add-in), and Minitab (a statistical
software package). Because Excel is most readily
available, it is used most frequently.
Within the earlier Graphic Presentation of
Qualitative Data section, we used the Pivot Table
tool in Excel to create a frequency table. To create
the table to the left, we use the same Excel tool to
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 31
compute frequency and relative frequency distributions for the
profit variable in the
Applewood Auto Group data. The necessary steps are given in
the Software Commands
section in Appendix C.
Barry Bonds of the San Francisco Giants established a new
single-season Major League
Baseball home run record by hitting 73 home runs during the
2001 season. Listed below is
the sorted distance of each of the 73 home runs.
S E L F - R E V I E W 2–3
(a) For this data, show that seven classes would be used to
create a frequency distribution
using the 2k rule.
(b) Show that a class interval of 30 would summarize the data in
seven classes.
(c) Construct frequency and relative frequency distributions for
the data with
seven classes and a class interval of 30. Start the first class with
a lower limit
of 300.
(d) How many home runs traveled a distance of 360 up to 390
feet?
(e) What percentage of the home runs traveled a distance of 360
up to 390 feet?
(f) What percentage of the home runs traveled a distance of 390
feet or more?
7. A set of data consists of 38 observations. How many classes
would you recom-
mend for the frequency distribution?
8. A set of data consists of 45 observations between $0 and
$29. What size would
you recommend for the class interval?
9. A set of data consists of 230 observations between $235 and
$567. What class
interval would you recommend?
10. A set of data contains 53 observations. The minimum value
is 42 and the maximum
value is 129. The data are to be organized into a frequency
distribution.
a. How many classes would you suggest?
b. What would you suggest as the lower limit of the first class?
11. Wachesaw Manufacturing Inc. produced the following
number of units in the
last 16 days.
The information is to be organized into a frequency distribution.
a. How many classes would you recommend?
b. What class interval would you suggest?
c. What lower limit would you recommend for the first class?
d. Organize the information into a frequency distribution and
determine the relative
frequency distribution.
e. Comment on the shape of the distribution.
E X E R C I S E S
This icon indicates that
the data are available at the text
website: www.mhhe.com/
Lind17e. You will be able to
download the data directly into
Excel or Minitab from this site.
27 27 27 28 27 25 25 28
26 28 26 28 31 30 26 26
320 320 347 350 360 360 360 361 365 370
370 375 375 375 375 380 380 380 380 380
380 390 390 391 394 396 400 400 400 400
405 410 410 410 410 410 410 410 410 410
410 410 411 415 415 416 417 417 420 420
420 420 420 420 420 420 429 430 430 430
430 430 435 435 436 440 440 440 440 440
450 480 488
32 CHAPTER 2
The data are to be organized into a frequency distribution.
a. How many classes would you recommend?
b. What class interval would you suggest?
c. What lower limit would you recommend for the first class?
d. Organize the number of oil changes into a frequency
distribution.
e. Comment on the shape of the frequency distribution. Also
determine the relative
frequency distribution.
13. The manager of the BiLo Supermarket in Mt. Pleasant,
Rhode Island, gathered
the following information on the number of times a customer
visits the store during
a month. The responses of 51 customers were:
65 98 55 62 79 59 51 90 72 56
70 62 66 80 94 79 63 73 71 85
12. The Quick Change Oil Company has a number of outlets in
the metropolitan Seat-
tle area. The daily number of oil changes at the Oak Street
outlet in the past 20 days are:
5 3 3 1 4 4 5 6 4 2 6 6 6 7 1
1 14 1 2 4 4 4 5 6 3 5 3 4 5 6
8 4 7 6 5 9 11 3 12 4 7 6 5 15 1
1 10 8 9 2 12
a. Starting with 0 as the lower limit of the first class and using
a class interval of 3,
organize the data into a frequency distribution.
b. Describe the distribution. Where do the data tend to cluster?
c. Convert the distribution to a relative frequency distribution.
14. The food services division of Cedar River Amusement Park
Inc. is studying the
amount of money spent per day on food and drink by families
who visit the amuse-
ment park. A sample of 40 families who visited the park
yesterday revealed they
spent the following amounts:
$77 $18 $63 $84 $38 $54 $50 $59 $54 $56 $36 $26 $50 $34 $44
41 58 58 53 51 62 43 52 53 63 62 62 65 61 52
60 60 45 66 83 71 63 58 61 71
a. Organize the data into a frequency distribution, using seven
classes and 15 as
the lower limit of the first class. What class interval did you
select?
b. Where do the data tend to cluster?
c. Describe the distribution.
d. Determine the relative frequency distribution.
GRAPHIC PRESENTATION OF A DISTRIBUTION
Sales managers, stock analysts, hospital administrators, and
other busy executives of-
ten need a quick picture of the distributions of sales, stock
prices, or hospital costs.
These distributions can often be depicted by the use of charts
and graphs. Three charts
that will help portray a frequency distribution graphically are
the histogram, the fre-
quency polygon, and the cumulative frequency polygon.
Histogram
A histogram for a frequency distribution based on quantitative
data is similar to the
bar chart showing the distribution of qualitative data. The
classes are marked on the
LO2-4
Display a distribution
using a histogram or
frequency polygon.
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 33
horizontal axis and the class frequencies on the vertical axis.
The class frequencies
are represented by the heights of the bars. However, there is one
important differ-
ence based on the nature of the data. Quantitative data are
usually measured using
scales that are continuous, not discrete. Therefore, the
horizontal axis represents all
possible values, and the bars are drawn adjacent to each other to
show the continu-
ous nature of the data.
HISTOGRAM A graph in which the classes are marked on the
horizontal axis and
the class frequencies on the vertical axis. The class frequencies
are represented by
the heights of the bars, and the bars are drawn adjacent to each
other.
E X A M P L E
Below is the frequency distribution of the profits on vehicle
sales last month at the
Applewood Auto Group.
Construct a histogram. What observations can you reach based
on the information
presented in the histogram?
S O L U T I O N
The class frequencies are scaled along the vertical axis (Y-axis)
and either the class
limits or the class midpoints along the horizontal axis. To
illustrate the construction
of the histogram, the first three classes are shown in Chart 2–3.
Profit Frequency
$ 200 up to $ 600 8
600 up to 1,000 11
1,000 up to 1,400 23
1,400 up to 1,800 38
1,800 up to 2,200 45
2,200 up to 2,600 32
2,600 up to 3,000 19
3,000 up to 3,400 4
Total 180
200 600 1,000 1,400
32
24
16
8
8
11
23
Nu
m
be
r o
f V
eh
ic
le
s
(c
la
ss
fr
eq
ue
nc
y)
Pro�t $
CHART 2–3 Construction of a Histogram
34 CHAPTER 2
From Chart 2–3 we note the profit on eight vehicles was $200
up to $600. There-
fore, the height of the column for that class is 8. There are 11
vehicle sales where
the profit was $600 up to $1,000. So, logically, the height of
that column is 11. The
height of the bar represents the number of observations in the
class.
This procedure is continued for all classes. The complete
histogram is shown in
Chart 2–4. Note that there is no space between the bars. This is
a feature of the
histogram. Why is this so? Because the variable profit, plotted
on the horizontal
axis, is a continuous variable. In a bar chart, the scale of
measurement is usually
nominal and the vertical bars are separated. This is an important
distinction be-
tween the histogram and the bar chart.
We can make the following statements using Chart 2–4. They
are the same as the
observations based on Table 2–5.
1. The profits from vehicle sales range between $200 and
$3,400.
2. The vehicle profits are classified using a class interval of
$400. The class inter-
val is determined by subtracting consecutive lower or upper
class limits. For
example, the lower limit of the first class is $200, and the lower
limit of the
second class is $600. The difference is the class interval or
$400.
3. The profits are concentrated between $1,000 and $3,000. The
profit on 157
vehicles, or 87%, was within this range.
4. For each class, we can determine the typical profit or class
midpoint. It is halfway
between the lower or upper limits of two consecutive classes. It
is computed by
adding the lower or upper limits of consecutive classes and
dividing by 2. Refer-
ring to Chart 2–4, the lower class limit of the first class is $200,
and the next class
limit is $600. The class midpoint is $400, found by ($600 +
$200)/2. The mid-
point best represents, or is typical of, the profits of the vehicles
in that class.
Applewood sold 8 vehicles with a typical profit of $400.
5. The largest concentration, or highest frequency of vehicles
sold, is in the $1,800 up
to $2,200 class. There are 45 vehicles in this class. The class
midpoint is $2,000.
So we say that the typical profit in the class with the highest
frequency is $2,000.
Thus, the histogram provides an easily interpreted visual
representation of a
frequency distribution. We should also point out that we would
have made the
same observations and the shape of the histogram would have
been the same had
we used a relative frequency distribution instead of the actual
frequencies. That is,
if we use the relative frequencies of Table 2–7, the result is a
histogram of the same
shape as Chart 2–4. The only difference is that the vertical axis
would have been
reported in percentage of vehicles instead of the number of
vehicles. The Excel
commands to create Chart 2–4 are given in Appendix C.
20
0–
60
0
60
0–
1,0
00
1,0
00
–1
,40
0
1,4
00
–1
,80
0
1,8
00
–2
,20
0
2,2
00
–2
,60
0
2,6
00
–3
,00
0
3,0
00
–3
,40
0
10
0
30
20
Pro�t
11
23
38
45
32
19
4
8
40
v
Fr
eq
ue
nc
y
CHART 2–4 Histogram of the Profit on 180 Vehicles Sold at the
Applewood Auto Group
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 35
Frequency Polygon
A frequency polygon also shows the shape of a distribution and
is similar to a histo-
gram. It consists of line segments connecting the points formed
by the intersections of
the class midpoints and the class frequencies. The construction
of a frequency polygon
is illustrated in Chart 2–5. We use the profits from the cars sold
last month at the Apple-
wood Auto Group. The midpoint of each class is scaled on the
X-axis and the class
frequencies on the Y-axis. Recall that the class midpoint is the
value at the center of a
class and represents the typical values in that class. The class
frequency is the number
of observations in a particular class. The profit earned on the
vehicles sold last month
by the Applewood Auto Group is repeated below.
STATISTICS IN ACTION
Florence Nightingale is
known as the founder of
the nursing profession.
However, she also saved
many lives by using statisti-
cal analysis. When she
encountered an unsanitary
condition or an undersup-
plied hospital, she improved
the conditions and then
used statistical data to
document the improve-
ment. Thus, she was able
to convince others of the
need for medical reform,
particularly in the area of
sanitation. She developed
original graphs to demon-
strate that, during the
Crimean War, more soldiers
died from unsanitary condi-
tions than were killed in
combat.
Fr
eq
ue
nc
y
8
24
40
48
16
4000
Pro�t $
32
800 1,200 1,600 2,000 2,400 2,800 3,200 3,600
CHART 2–5 Frequency Polygon of Profit on 180 Vehicles Sold
at Applewood Auto Group
As noted previously, the $200 up to $600 class is represented by
the midpoint
$400. To construct a frequency polygon, move horizontally on
the graph to the mid-
point, $400, and then vertically to 8, the class frequency, and
place a dot. The x and
the y values of this point are called the coordinates. The
coordinates of the next point
are x = 800 and y = 11. The process is continued for all classes.
Then the points are
connected in order. That is, the point representing the lowest
class is joined to the
one representing the second class and so on. Note in Chart 2–5
that, to complete
the frequency polygon, midpoints of $0 and $3,600 are added
to the X-axis to “anchor”
the polygon at zero frequencies. These two values, $0 and
$3,600, were derived by
subtracting the class interval of $400 from the lowest midpoint
($400) and by adding
$400 to the highest midpoint ($3,200) in the frequency
distribution.
Both the histogram and the frequency polygon allow us to get a
quick picture of
the main characteristics of the data (highs, lows, points of
concentration, etc.). Although
the two representations are similar in purpose, the histogram
has the advantage of
depicting each class as a rectangle, with the height of the
rectangular bar representing
Profit Midpoint Frequency
$ 200 up to $ 600 $ 400 8
600 up to 1,000 800 11
1,000 up to 1,400 1,200 23
1,400 up to 1,800 1,600 38
1,800 up to 2,200 2,000 45
2,200 up to 2,600 2,400 32
2,600 up to 3,000 2,800 19
3,000 up to 3,400 3,200 4
Total 180
36 CHAPTER 2
8
24
40
48
56
16
4000
Pro�t $
32
Fr
eq
ue
nc
y
800 1,200 1,600 2,000 2,400 2,800 3,200 3,600
Fowler Motors
Applewood
CHART 2–6 Distribution of Profit at Applewood Auto Group
and Fowler Motors
the number in each class. The frequency polygon, in turn, has
an advantage over the
histogram. It allows us to compare directly two or more
frequency distributions. Sup-
pose Ms. Ball wants to compare the profit per vehicle sold at
Applewood Auto Group
with a similar auto group, Fowler Auto in Grayling, Michigan.
To do this, two frequency
polygons are constructed, one on top of the other, as in Chart 2–
6. Two things are clear
from the chart:
• The typical vehicle profit is larger at Fowler Motors—about
$2,000 for Applewood
and about $2,400 for Fowler.
• There is less variation or dispersion in the profits at Fowler
Motors than at Apple-
wood. The lower limit of the first class for Applewood is $0 and
the upper limit is
$3,600. For Fowler Motors, the lower limit is $800 and the
upper limit is the
same: $3,600.
The total number of cars sold at the two dealerships is about the
same, so a direct
comparison is possible. If the difference in the total number of
cars sold is large, then
converting the frequencies to relative frequencies and then
plotting the two distribu-
tions would allow a clearer comparison.
The annual imports of a selected group of electronic suppliers
are shown in the following
frequency distribution.
S E L F - R E V I E W 2–4
Imports ($ millions) Number of Suppliers
2 up to 5 6
5 up to 8 13
8 up to 11 20
11 up to 14 10
14 up to 17 1
(a) Portray the imports as a histogram.
(b) Portray the imports as a relative frequency polygon.
(c) Summarize the important facets of the distribution (such as
classes with the highest
and lowest frequencies).
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 37
15. Molly’s Candle Shop has several retail stores in the coastal
areas of North and
South Carolina. Many of Molly’s customers ask her to ship their
purchases. The fol-
lowing chart shows the number of packages shipped per day for
the last 100 days.
For example, the first class shows that there were 5 days when
the number of pack-
ages shipped was 0 up to 5.
Fr
eq
ue
nc
y
Number of Packages
10
0
5 10 15 20 25 30 35
20
30
13
28
23
18
10
35
a. What is this chart called?
b. What is the total number of packages shipped?
c. What is the class interval?
d. What is the number of packages shipped in the 10 up to 15
class?
e. What is the relative frequency of packages shipped in the 10
up to 15 class?
f. What is the midpoint of the 10 up to 15 class?
g. On how many days were there 25 or more packages shipped?
16. The following chart shows the number of patients admitted
daily to Memorial Hospital
through the emergency room.
0
10
20
30
2 4 6 8 10 12
Fr
eq
ue
nc
y
Number of Patients
a. What is the midpoint of the 2 up to 4 class?
b. How many days were 2 up to 4 patients admitted?
c. What is the class interval?
d. What is this chart called?
17. The following frequency distribution reports the number of
frequent flier miles,
reported in thousands, for employees of Brumley Statistical
Consulting Inc. during
the most recent quarter.
E X E R C I S E S
Frequent Flier Miles Number of
(000) Employees
0 up to 3 5
3 up to 6 12
6 up to 9 23
9 up to 12 8
12 up to 15 2
Total 50
38 CHAPTER 2
Cumulative Distributions
Consider once again the distribution of the profits on vehicles
sold by the Applewood
Auto Group. Suppose we were interested in the number of
vehicles that sold for a profit of
less than $1,400. These values can be approximated by
developing a cumulative
frequency distribution and portraying it graphically in a
cumulative frequency polygon.
Or, suppose we were interested in the profit earned on the
lowest-selling 40% of the ve-
hicles. These values can be approximated by developing a
cumulative relative frequency
distribution and portraying it graphically in a cumulative
relative frequency polygon.
a. How many employees were studied?
b. What is the midpoint of the first class?
c. Construct a histogram.
d. A frequency polygon is to be drawn. What are the
coordinates of the plot for the
first class?
e. Construct a frequency polygon.
f. Interpret the frequent flier miles accumulated using the two
charts.
18. A large Internet retailer is studying the lead time (elapsed
time between when an
order is placed and when it is filled) for a sample of recent
orders. The lead times
are reported in days.
a. How many orders were studied?
b. What is the midpoint of the first class?
c. What are the coordinates of the first class for a frequency
polygon?
d. Draw a histogram.
e. Draw a frequency polygon.
f. Interpret the lead times using the two charts.
Lead Time (days) Frequency
0 up to 5 6
5 up to 10 7
10 up to 15 12
15 up to 20 8
20 up to 25 7
Total 40
E X A M P L E
The frequency distribution of the profits earned at Applewood
Auto Group is
repeated from Table 2–5.
Profit Frequency
$ 200 up to $ 600 8
600 up to 1,000 11
1,000 up to 1,400 23
1,400 up to 1,800 38
1,800 up to 2,200 45
2,200 up to 2,600 32
2,600 up to 3,000 19
3,000 up to 3,400 4
Total 180
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 39
Construct a cumulative frequency polygon to answer the
following question: sixty
of the vehicles earned a profit of less than what amount?
Construct a cumulative
relative frequency polygon to answer this question: seventy-five
percent of the
vehicles sold earned a profit of less than what amount?
S O L U T I O N
As the names imply, a cumulative frequency distribution and a
cumulative fre-
quency polygon require cumulative frequencies. To construct a
cumulative fre-
quency distribution, refer to the preceding table and note that
there were eight
vehicles in which the profit earned was less than $600. Those 8
vehicles, plus
the 11 in the next higher class, for a total of 19, earned a profit
of less than $1,000.
The cumulative frequency for the next higher class is 42, found
by 8 + 11 + 23.
This process is continued for all the classes. All the vehicles
earned a profit of less
than $3,400. (See Table 2–8.)
TABLE 2–8 Cumulative Frequency Distribution for Profit on
Vehicles Sold Last Month at Applewood
Auto Group
Profit Cumulative Frequency Found by
Less than $ 600 8 8
Less than 1,000 19 8 + 11
Less than 1,400 42 8 + 11 + 23
Less than 1,800 80 8 + 11 + 23 + 38
Less than 2,200 125 8 + 11 + 23 + 38 + 45
Less than 2,600 157 8 + 11 + 23 + 38 + 45 + 32
Less than 3,000 176 8 + 11 + 23 + 38 + 45 + 32 + 19
Less than 3,400 180 8 + 11 + 23 + 38 + 45 + 32 + 19 + 4
TABLE 2–9 Cumulative Relative Frequency Distribution for
Profit on Vehicles Sold Last Month at
Applewood Auto Group
Profit Cumulative Frequency Cumulative Relative Frequency
Less than $ 600 8 8/180 = 0.044 = 4.4%
Less than $ 1,000 19 19/180 = 0.106 = 10.6%
Less than $ 1,400 42 42/180 = 0.233 = 23.3%
Less than $ 1,800 80 80/180 = 0.444 = 44.4%
Less than $2,200 125 125/180 = 0.694 = 69.4%
Less than $2,600 157 157/180 = 0.872 = 87.2%
Less than $3,000 176 176/180 = 0.978 = 97.8%
Less than $3,400 180 180/180 = 1.000 = 100%
To construct a cumulative relative frequency distribution, we
divide the cumulative
frequencies by the total number of observations, 180. As shown
in Table 2-9, the
cumulative relative frequency of the fourth class is 80/180 =
44%. This means that
44% of the vehicles sold for less than $1,800.
To plot a cumulative frequency distribution, scale the upper
limit of each
class along the X-axis and the corresponding cumulative
frequencies along the
Y-axis. To provide additional information, you can label the
vertical axis on the
right in terms of cumulative relative frequencies. In the
Applewood Auto Group,
40 CHAPTER 2
the vertical axis on the left is labeled from 0 to 180 and on the
right from 0 to
100%. Note, as an example, that 50% on the right axis should be
opposite 90
vehicles on the left axis.
To begin, the first plot is at x = 200 and y = 0. None of the
vehicles sold for a
profit of less than $200. The profit on 8 vehicles was less than
$600, so the next
plot is at x = 600 and y = 8. Continuing, the next plot is x =
1,000 and y = 19. There
were 19 vehicles that sold for a profit of less than $1,000. The
rest of the points are
plotted and then the dots connected to form Chart 2–7.
We should point out that the shape of the distribution is the
same if we use
cumulative relative frequencies instead of the cumulative
frequencies. The only
difference is that the vertical axis is scaled in percentages. In
the following charts,
a percentage scale is added to the right side of the graphs to
help answer ques-
tions about cumulative relative frequencies.
200 600 1,000 1,400 1,800 2,200 2,600 3,000 3,400
Nu
m
be
r o
f V
eh
ic
le
s
So
ld
Pe
rc
en
t o
f V
eh
ic
le
s
So
ld
Pro�t $
100
75
50
25
0
20
40
60
80
100
120
140
160
180
CHART 2–7 Cumulative Frequency Polygon for Profit on
Vehicles Sold Last
Month at Applewood Auto Group
Using Chart 2–7 to find the amount of profit on 75% of the cars
sold, draw a hori-
zontal line from the 75% mark on the right-hand vertical axis
over to the polygon,
then drop down to the X-axis and read the amount of profit. The
value on the X-axis
is about $2,300, so we estimate that 75% of the vehicles sold
earned a profit of
$2,300 or less for the Applewood group.
To find the highest profit earned on 60 of the 180 vehicles, we
use Chart 2–7
to locate the value of 60 on the left-hand vertical axis. Next, we
draw a horizontal
line from the value of 60 to the polygon and then drop down to
the X-axis and read
the profit. It is about $1,600, so we estimate that 60 of the
vehicles sold for a profit
of less than $1,600. We can also make estimates of the
percentage of vehicles that
sold for less than a particular amount. To explain, suppose we
want to estimate the
percentage of vehicles that sold for a profit of less than $2,000.
We begin by locat-
ing the value of $2,000 on the X-axis, move vertically to the
polygon, and then
horizontally to the vertical axis on the right. The value is about
56%, so we conclude
56% of the vehicles sold for a profit of less than $2,000.
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 41
A sample of the hourly wages of 15 employees at Home Depot
in Brunswick, Georgia, was
organized into the following table.
Hourly Wages Number of Employees
$ 8 up to $10 3
10 up to 12 7
12 up to 14 4
14 up to 16 1
(a) What is the table called?
(b) Develop a cumulative frequency distribution and portray the
distribution in a cumula-
tive frequency polygon.
(c) On the basis of the cumulative frequency polygon, how
many employees earn less
than $11 per hour?
S E L F - R E V I E W 2–5
19. The following cumulative frequency and the cumulative
relative frequency polygon
for the distribution of hourly wages of a sample of certified
welders in the Atlanta,
Georgia, area is shown in the graph.
Fr
eq
ue
nc
y
Hourly Wage
Pe
rc
en
t
0 5 10 15 20 25 30
100
75
50
25
40
30
20
10
a. How many welders were studied?
b. What is the class interval?
c. About how many welders earn less than $10.00 per hour?
d. About 75% of the welders make less than what amount?
e. Ten of the welders studied made less than what amount?
f. What percent of the welders make less than $20.00 per hour?
20. The cumulative frequency and the cumulative relative
frequency polygon for a dis-
tribution of selling prices ($000) of houses sold in the Billings,
Montana, area is
shown in the graph.
Fr
eq
ue
nc
y
Pe
rc
en
t
200
150
100
50
100
75
50
25
Selling Price ($000)
500 100 150 200 250 350300
E X E R C I S E S
42 CHAPTER 2
a. How many homes were studied?
b. What is the class interval?
c. One hundred homes sold for less than what amount?
d. About 75% of the homes sold for less than what amount?
e. Estimate the number of homes in the $150,000 up to
$200,000 class.
f. About how many homes sold for less than $225,000?
21. The frequency distribution representing the number of
frequent flier miles accumulated
by employees at Brumley Statistical Consulting Inc. is repeated
from Exercise 17.
Frequent Flier Miles
(000) Frequency
0 up to 3 5
3 up to 6 12
6 up to 9 23
9 up to 12 8
12 up to 15 2
Total 50
a. How many employees accumulated less than 3,000 miles?
b. Convert the frequency distribution to a cumulative frequency
distribution.
c. Portray the cumulative distribution in the form of a
cumulative frequency polygon.
d. Based on the cumulative relative frequencies, about 75% of
the employees
accumulated how many miles or less?
22. The frequency distribution of order lead time of the retailer
from Exercise 18 is
repeated below.
Lead Time (days) Frequency
0 up to 5 6
5 up to 10 7
10 up to 15 12
15 up to 20 8
20 up to 25 7
Total 40
a. How many orders were filled in less than 10 days? In less
than 15 days?
b. Convert the frequency distribution to cumulative frequency
and cumulative rela-
tive frequency distributions.
c. Develop a cumulative frequency polygon.
d. About 60% of the orders were filled in less than how many
days?
C H A P T E R S U M M A R Y
I. A frequency table is a grouping of qualitative data into
mutually exclusive and collectively
exhaustive classes showing the number of observations in each
class.
II. A relative frequency table shows the fraction of the number
of frequencies in each class.
III. A bar chart is a graphic representation of a frequency table.
IV. A pie chart shows the proportion each distinct class
represents of the total number of
observations.
V. A frequency distribution is a grouping of data into mutually
exclusive and collectively ex-
haustive classes showing the number of observations in each
class.
A. The steps in constructing a frequency distribution are
1. Decide on the number of classes.
2. Determine the class interval.
3. Set the individual class limits.
4. Tally the raw data into classes and determine the frequency
in each class.
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 43
B. The class frequency is the number of observations in each
class.
C. The class interval is the difference between the limits of two
consecutive classes.
D. The class midpoint is halfway between the limits of
consecutive classes.
VI. A relative frequency distribution shows the percent of
observations in each class.
VII. There are several methods for graphically portraying a
frequency distribution.
A. A histogram portrays the frequencies in the form of a
rectangle or bar for each class.
The height of the rectangles is proportional to the class
frequencies.
B. A frequency polygon consists of line segments connecting
the points formed by the
intersection of the class midpoint and the class frequency.
C. A graph of a cumulative frequency distribution shows the
number of observations less
than a given value.
D. A graph of a cumulative relative frequency distribution
shows the percent of observa-
tions less than a given value.
C H A P T E R E X E R C I S E S
23. Describe the similarities and differences of qualitative and
quantitative variables. Be
sure to include the following:
a. What level of measurement is required for each variable
type?
b. Can both types be used to describe both samples and
populations?
24. Describe the similarities and differences between a
frequency table and a frequency
distribution. Be sure to include which requires qualitative data
and which requires quan-
titative data.
25. Alexandra Damonte will be building a new resort in Myrtle
Beach, South Carolina. She
must decide how to design the resort based on the type of
activities that the resort will
offer to its customers. A recent poll of 300 potential customers
showed the following
results about customers’ preferences for planned resort
activities:
Like planned activities 63
Do not like planned activities 135
Not sure 78
No answer 24
a. What is the table called?
b. Draw a bar chart to portray the survey results.
c. Draw a pie chart for the survey results.
d. If you are preparing to present the results to Ms. Damonte as
part of a report, which
graph would you prefer to show? Why?
26. Speedy Swift is a package delivery service that serves the
greater Atlanta, Georgia,
metropolitan area. To maintain customer loyalty, one of Speedy
Swift’s performance
objectives is on-time delivery. To monitor its performance,
each delivery is measured on
the following scale: early (package delivered before the
promised time), on-time (pack-
age delivered within 5 minutes of the promised time), late
(package delivered more than
5 minutes past the promised time), or lost (package never
delivered). Speedy Swift’s
objective is to deliver 99% of all packages either early or on-
time. Speedy collected the
following data for last month’s performance:
On-time On-time Early Late On-time On-time On-time On-time
Late On-time
Early On-time On-time Early On-time On-time On-time On-time
On-time On-time
Early On-time Early On-time On-time On-time Early On-time
On-time On-time
Early On-time On-time Late Early Early On-time On-time On-
time Early
On-time Late Late On-time On-time On-time On-time On-time
On-time On-time
On-time Late Early On-time Early On-time Lost On-time On-
time On-time
Early Early On-time On-time Late Early Lost On-time On-time
On-time
On-time On-time Early On-time Early On-time Early On-time
Late On-time
On-time Early On-time On-time On-time Late On-time Early
On-time On-time
On-time On-time On-time On-time On-time Early Early On-time
On-time On-time
44 CHAPTER 2
a. What kind of variable is delivery performance? What scale is
used to measure delivery
performance?
b. Construct a frequency table for delivery performance for last
month.
c. Construct a relative frequency table for delivery performance
last month.
d. Construct a bar chart of the frequency table for delivery
performance for last month.
e. Construct a pie chart of on-time delivery performance for
last month.
f. Write a memo reporting the results of the analyses. Include
your tables and graphs with
written descriptions of what they show. Conclude with a general
statement of last
month’s delivery performance as it relates to Speedy Swift’s
performance objectives.
27. A data set consists of 83 observations. How many classes
would you recommend for a
frequency distribution?
28. A data set consists of 145 observations that range from 56
to 490. What size class inter-
val would you recommend?
29. The following is the number of minutes to commute from
home to work for a group
of 25 automobile executives.
28 25 48 37 41 19 32 26 16 23 23 29 36
31 26 21 32 25 31 43 35 42 38 33 28
a. How many classes would you recommend?
b. What class interval would you suggest?
c. What would you recommend as the lower limit of the first
class?
d. Organize the data into a frequency distribution.
e. Comment on the shape of the frequency distribution.
30. The following data give the weekly amounts spent on
groceries for a sample of 45
households.
$271 $363 $159 $ 76 $227 $337 $295 $319 $250
279 205 279 266 199 177 162 232 303
192 181 321 309 246 278 50 41 335
116 100 151 240 474 297 170 188 320
429 294 570 342 279 235 434 123 325
a. How many classes would you recommend?
b. What class interval would you suggest?
c. What would you recommend as the lower limit of the first
class?
d. Organize the data into a frequency distribution.
31. A social scientist is studying the use of iPods by college
students. A sample of 45
students revealed they played the following number of songs
yesterday.
4 6 8 7 9 6 3 7 7 6 7 1 4 7 7
4 6 4 10 2 4 6 3 4 6 8 4 3 3 6
8 8 4 6 4 6 5 5 9 6 8 8 6 5 10
Organize the information into a frequency distribution.
a. How many classes would you suggest?
b. What is the most suitable class interval?
c. What is the lower limit of the initial class?
d. Create the frequency distribution.
e. Describe the shape of the distribution.
32. David Wise handles his own investment portfolio, and has
done so for many years.
Listed below is the holding time (recorded to the nearest whole
year) between purchase
and sale for his collection of 36 stocks.
8 8 6 11 11 9 8 5 11 4 8 5 14 7 12 8 6 11 9 7
9 15 8 8 12 5 9 8 5 9 10 11 3 9 8 6
a. How many classes would you propose?
b. What class interval would you suggest?
c. What quantity would you use for the lower limit of the initial
class?
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 45
d. Using your responses to parts (a), (b), and (c), create a
frequency distribution.
e. Describe the shape of the frequency distribution.
33. You are exploring the music in your iTunes library. The
total play counts over the past
year for the 27 songs on your “smart playlist” are shown below.
Make a frequency distribu-
tion of the counts and describe its shape. It is often claimed that
a small fraction of a person’s
songs will account for most of their total plays. Does this seem
to be the case here?
128 56 54 91 190 23 160 298 445 50
578 494 37 677 18 74 70 868 108 71
466 23 84 38 26 814 17
34. The monthly issues of the Journal of Finance are available
on the Internet. The
table below shows the number of times an issue was
downloaded over the last
33 months. Suppose that you wish to summarize the number of
downloads with a
frequency distribution.
312 2,753 2,595 6,057 7,624 6,624 6,362 6,575 7,760 7,085
7,272
5,967 5,256 6,160 6,238 6,709 7,193 5,631 6,490 6,682 7,829
7,091
6,871 6,230 7,253 5,507 5,676 6,974 6,915 4,999 5,689 6,143
7,086
a. How many classes would you propose?
b. What class interval would you suggest?
c. What quantity would you use for the lower limit of the initial
class?
d. Using your responses to parts (a), (b), and (c), create a
frequency distribution.
e. Describe the shape of the frequency distribution.
35. The following histogram shows the scores on the first exam
for a statistics class.
50 60 70 80 90 100
25
20
15
10
5
0
Score
Fr
eq
ue
nc
y
3
14
21
12
6
a. How many students took the exam?
b. What is the class interval?
c. What is the class midpoint for the first class?
d. How many students earned a score of less than 70?
36. The following chart summarizes the selling price of homes
sold last month in the
Sarasota, Florida, area.
100
75
50
25
250
200
150
100
50
0 50 100 150
Selling Price ($000)
200 250 300 350
Fr
eq
ue
nc
y
Pe
rc
en
t
a. What is the chart called?
b. How many homes were sold during the last month?
c. What is the class interval?
d. About 75% of the houses sold for less than what amount?
e. One hundred seventy-five of the homes sold for less than
what amount?
46 CHAPTER 2
37. A chain of sport shops catering to beginning skiers,
headquartered in Aspen,
Colorado, plans to conduct a study of how much a beginning
skier spends on his or her
initial purchase of equipment and supplies. Based on these
figures, it wants to explore
the possibility of offering combinations, such as a pair of boots
and a pair of skis, to
induce customers to buy more. A sample of 44 cash register
receipts revealed these
initial purchases:
$140 $ 82 $265 $168 $ 90 $114 $172 $230 $142
86 125 235 212 171 149 156 162 118
139 149 132 105 162 126 216 195 127
161 135 172 220 229 129 87 128 126
175 127 149 126 121 118 172 126
a. Arrive at a suggested class interval.
b. Organize the data into a frequency distribution using a lower
limit of $70.
c. Interpret your findings.
38. The numbers of outstanding shares for 24 publicly traded
companies are listed in
the following table.
Number of
Outstanding
Shares
Company (millions)
Southwest Airlines 738
FirstEnergy 418
Harley Davidson 226
Entergy 178
Chevron 1,957
Pacific Gas and Electric 430
DuPont 932
Westinghouse 22
Eversource 314
Facebook 1,067
Google, Inc. 64
Apple 941
Number of
Outstanding
Shares
Company (millions)
Costco 436
Home Depot 1,495
DTE Energy 172
Dow Chemical 1,199
Eastman Kodak 272
American Electric Power 485
ITT Corporation 93
Ameren 243
Virginia Electric and Power 575
Public Service Electric & Gas 506
Consumers Energy 265
Starbucks 744
a. Using the number of outstanding shares, summarize the
companies with a frequency
distribution.
b. Display the frequency distribution with a frequency polygon.
c. Create a cumulative frequency distribution of the
outstanding shares.
d. Display the cumulative frequency distribution with a
cumulative frequency polygon.
e. Based on the cumulative relative frequency distribution, 75%
of the companies have
less than “what number” of outstanding shares?
f. Write a brief analysis of this group of companies based on
your statistical summaries
of “number of outstanding shares.”
39. A recent survey showed that the typical American car
owner spends $2,950 per year on
operating expenses. Below is a breakdown of the various
expenditure items. Draw an
appropriate chart to portray the data and summarize your
findings in a brief report.
Expenditure Item Amount
Fuel $ 603
Interest on car loan 279
Repairs 930
Insurance and license 646
Depreciation 492
Total $2,950
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 47
40. Midland National Bank selected a sample of 40 student
checking accounts. Below
are their end-of-the-month balances.
$404 $ 74 $234 $149 $279 $215 $123 $ 55 $ 43 $321
87 234 68 489 57 185 141 758 72 863
703 125 350 440 37 252 27 521 302 127
968 712 503 489 327 608 358 425 303 203
a. Tally the data into a frequency distribution using $100 as a
class interval and $0 as
the starting point.
b. Draw a cumulative frequency polygon.
c. The bank considers any student with an ending balance of
$400 or more a “pre-
ferred customer.” Estimate the percentage of preferred
customers.
d. The bank is also considering a service charge to the lowest
10% of the ending bal-
ances. What would you recommend as the cutoff point between
those who have to
pay a service charge and those who do not?
41. Residents of the state of South Carolina earned a total of
$69.5 billion in adjusted gross
income. Seventy-three percent of the total was in wages and
salaries; 11% in dividends,
interest, and capital gains; 8% in IRAs and taxable pensions;
3% in business income
pensions; 2% in Social Security; and the remaining 3% from
other sources. Develop a
pie chart depicting the breakdown of adjusted gross income.
Write a paragraph summa-
rizing the information.
42. A recent study of home technologies reported the number
of hours of personal
computer usage per week for a sample of 60 persons. Excluded
from the study were
people who worked out of their home and used the computer as
a part of their work.
9.3 5.3 6.3 8.8 6.5 0.6 5.2 6.6 9.3 4.3
6.3 2.1 2.7 0.4 3.7 3.3 1.1 2.7 6.7 6.5
4.3 9.7 7.7 5.2 1.7 8.5 4.2 5.5 5.1 5.6
5.4 4.8 2.1 10.1 1.3 5.6 2.4 2.4 4.7 1.7
2.0 6.7 1.1 6.7 2.2 2.6 9.8 6.4 4.9 5.2
4.5 9.3 7.9 4.6 4.3 4.5 9.2 8.5 6.0 8.1
a. Organize the data into a frequency distribution. How many
classes would you sug-
gest? What value would you suggest for a class interval?
b. Draw a histogram. Describe your result.
43. Merrill Lynch recently completed a study regarding the
size of online investment
portfolios (stocks, bonds, mutual funds, and certificates of
deposit) for a sample of cli-
ents in the 40 up to 50 years old age group. Listed following is
the value of all the in-
vestments in thousands of dollars for the 70 participants in the
study.
$669.9 $ 7.5 $ 77.2 $ 7.5 $125.7 $516.9 $ 219.9 $645.2
301.9 235.4 716.4 145.3 26.6 187.2 315.5 89.2
136.4 616.9 440.6 408.2 34.4 296.1 185.4 526.3
380.7 3.3 363.2 51.9 52.2 107.5 82.9 63.0
228.6 308.7 126.7 430.3 82.0 227.0 321.1 403.4
39.5 124.3 118.1 23.9 352.8 156.7 276.3 23.5
31.3 301.2 35.7 154.9 174.3 100.6 236.7 171.9
221.1 43.4 212.3 243.3 315.4 5.9 1,002.2 171.7
295.7 437.0 87.8 302.1 268.1 899.5
a. Organize the data into a frequency distribution. How many
classes would you sug-
gest? What value would you suggest for a class interval?
b. Draw a histogram. Financial experts suggest that this age
group of people have at
least five times their salary saved. As a benchmark, assume an
investment portfolio
of $500,000 would support retirement in 10–15 years. In
writing, summarize your
results.
48 CHAPTER 2
44. A total of 5.9% of the prime-time viewing audience
watched shows on ABC, 7.6%
watched shows on CBS, 5.5% on Fox, 6.0% on NBC, 2.0% on
Warner Brothers, and
2.2% on UPN. A total of 70.8% of the audience watched shows
on other cable net-
works, such as CNN and ESPN. You can find the latest
information on TV viewing from
the following website:
http://www.nielsen.com/us/en/top10s.html/. Develop a pie
chart or a bar chart to depict this information. Write a paragraph
summarizing your
findings.
45. Refer to the following chart:
Contact for Job Placement at Wake Forest University
Networking
and
Connections
70%
On-Campus
Recruiting
10%
Job Posting
Websites
20%
a. What is the name given to this type of chart?
b. Suppose that 1,000 graduates will start a new job shortly
after graduation. Estimate
the number of graduates whose first contact for employment
occurred through net-
working and other connections.
c. Would it be reasonable to conclude that about 90% of job
placements were made
through networking, connections, and job posting websites?
Cite evidence.
46. The following chart depicts the annual revenues, by type of
tax, for the state of Georgia.
Sales
44.54%Income
43.34%
Other
0.9%
License
2.9%
Corporate
8.31%
Annual Revenue State of Georgia
a. What percentage of the state revenue is accounted for by
sales tax and individual
income tax?
b. Which category will generate more revenue: corporate taxes
or license fees?
c. The total annual revenue for the state of Georgia is $6.3
billion. Estimate the amount
of revenue in billions of dollars for sales taxes and for
individual taxes.
DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
DISTRIBUTIONS, AND GRAPHIC PRESENTATION 49
47. In 2014, the United States exported a total of $376 billion
worth of products to Canada.
The five largest categories were:
Product Amount
Vehicles $63.3
Machinery 59.7
Electrical machinery 36.6
Mineral fuel and oil 24.8
Plastic 17.0
a. Use a software package to develop a bar chart.
b. What percentage of the United States’ total exports to
Canada is represented by the
two categories “Machinery” and “Electrical Machinery”?
c. What percentage of the top five exported products do
“Machinery” and “Electrical
Machinery” represent?
48. In the United States, the industrial revolution of the early
20th century changed
farming by making it more efficient. For example, in 1910 U.S.
farms used 24.2 million
horses and mules and only about 1,000 tractors. By 1960, 4.6
million tractors were
used and only 3.2 million horses and mules. An outcome of
making farming more
efficient is the reduction of the number of farms from over 6
million in 1920 to about
2.2 million farms today. Listed below is the number of farms, in
thousands, for each of
the 50 states. Summarize the data and write a paragraph that
describes your findings.
50 12 5 28 59 19 35 22 80 5
8 48 3 75 25 77 46 68 10 69
77 25 13 20 35 6 52 61 36 38
88 1 75 246 59 50 44 98 74 2
32 42 7 31 28 9 8 44 25 37
49. One of the most popular candies in the United States is
M&M’s produced by the Mars
Company. In the beginning M&M’s were all brown. Now they
are produced in red, green,
blue, orange, brown, and yellow. Recently, the purchase of a
14-ounce bag of M&M’s
Plain had 444 candies with the following breakdown by color:
130 brown, 98 yellow,
96 red, 35 orange, 52 blue, and 33 green. Develop a chart
depicting this information
and write a paragraph summarizing the results.
50. The number of families who used the Minneapolis YWCA
day care service was
recorded during a 30-day period. The results are as follows:
31 49 19 62 24 45 23 51 55 60
40 35 54 26 57 37 43 65 18 41
50 56 4 54 39 52 35 51 63 42
a. Construct a cumulative frequency distribution.
b. Sketch a graph of the cumulative frequency polygon.
c. How many days saw fewer than 30 families utilize the day
care center?
d. Based on cumulative relative frequencies, how busy were the
highest 80% of the days?
D A T A A N A L Y T I C S
51. Refer to the North Valley Real Estate data that reports
information on homes sold
during the last year. For the variable price, select an appropriate
class interval and orga-
nize the selling prices into a frequency distribution. Write a
brief report summarizing
your findings. Be sure to answer the following questions in your
report.
a. Around what values of price do the data tend to cluster?
b. Based on the frequency distribution, what is the typical
selling price in the first class?
What is the typical selling price in the last class?
50 CHAPTER 2
c. Draw a cumulative relative frequency distribution. Using this
distribution, fifty
percent of the homes sold for what price or less? Estimate the
lower price of the
top ten percent of homes sold. About what percent of the homes
sold for less than
$300,000?
d. Refer to the variable bedrooms. Draw a bar chart showing
the number of homes sold
with 2, 3, 4 or more bedrooms. Write a description of the
distribution.
52. Refer to the Baseball 2016 data that report information on
the 30 Major League
Baseball teams for the 2016 season. Create a frequency
distribution for the Team Salary
variable and answer the following questions.
a. What is the typical salary for a team? What is the range of
the salaries?
b. Comment on the shape of the distribution. Does it appear
that any of the teams have
a salary that is out of line with the others?
c. Draw a cumulative relative frequency distribution of team
salary. Using this distribu-
tion, forty percent of the teams have a salary of less than what
amount? About how
many teams have a total salary of more than $220 million?
53. Refer to the Lincolnville School District bus data. Select
the variable referring to
the number of miles traveled since the last maintenance, and
then organize these data
into a frequency distribution.
a. What is a typical amount of miles traveled? What is the
range?
b. Comment on the shape of the distribution. Are there any
outliers in terms of miles
driven?
c. Draw a cumulative relative frequency distribution. Forty
percent of the buses
were driven fewer than how many miles? How many buses were
driven less than
10,500 miles?
d. Refer to the variables regarding the bus manufacturer and
the bus capacity. Draw a
pie chart of each variable and write a description of your
results.
Week 2 Lecture
Last week we looked at describing data sets. We looked at
summary statistics for location, variation/consistency, position,
and likelihood. While discussing consistency and variability
within the data, the need often arises to examine distribution
patterns. Distributions are a critical element of statistical
analysis. As we will see starting next week, a lot of our ability
to make inferences about populations based on sample results
depends on assumptions about data distribution patterns.
We start our discussions about data patterns and distributions by
examining some graphical analysis techniques; describing and
organizing the data visually to see what insights might be
gained. Tables and graphs are some of the best techniques to
display the characteristics of the data – clustering, dispersion,
center, outliers, even shape are all important elements in
understanding what the data is telling us.
Visual conclusions fall into the realm of qualitative findings.
While many feel comfortable making claims based on these
observations, others feel that as useful as these initial
observations may be, claims must be tested and verified with
quantitative approaches such as experimentation, additional
sampling, and inferential statistical tests.
The ultimate goal of graphical displays is to illuminate
relationships in the data; make things clearer.
Graphs and Tables
Tables
Tables or frequency tables show numerical counts and
percentages. Single variable tables generally show frequencies
and relative frequencies Multi-variables, also known as
crosstabulation tables, show counts between and among the
variables. The Excel tool Pivot Table will create these kinds of
tables.
Graphs
It has often been said that “a picture is worth a thousand words”
for their ability to display relationships and detail that are often
missed or hard to describe otherwise. This is the strength – and
weakness – of graphs. Done well, they illuminate patterns and
relationships; done poor – either intentionally or thru design
errors – they can distort and hide key data issues.
Types of graphs
While there are literally dozens of graph types, we will look at
only a few of the most commonly used. These include bar
graphs, column and histogram graphs, line graphs, scatter
diagrams, and pie charts. The general purpose of each of these
is to provide a visual representation of the variation within data
sets.
Bar and Column Graphs. These graphs are very similar as both
display frequency counts for unique groups or attributes. Bars
are shown horizontally, while columns are vertical.
Dot Plots. These graphs use dots to represent data points along
a single numerical axis. Multiple values result in vertical
columns of dots. The data points may be individual values or
ranges grouped into “bins.”
Histograms. These graphs have some characteristics similar to
both the dot plots and column graphs. They are columns that
touch each other and show counts for how many values of a
continuous measurement are within each bin or range.
Generally, they have between 5 to 7 bins depending upon the
number of data points.
Line Graphs. These graphs show trends over time or groups.
They are used in quality control as statistical process control
charts.
Scatter Diagrams. These graphs use dots to show the
relationship between pairs of measurements. Often, a
regression line will be added to show the linear relationship.
Pie Charts. These circular charts show the percent or
proportion each group is of the whole.
Excel Tool. The Insert tab on Excel’s main ribbon allows for
the creation of tables, charts, and graphs.
Interpretation Issues – What to Look For
We examine graphs and tables both for what they show and for
what they don’t. Obviously, look for what the graphs show:
· Trends
· Changes in trends, means, variation/spread
· Patterns and cycles
· Data clustering
· Outliers
· Data gaps or missing data
· Relationships and changes in relationships
· Randomness or non-randomness
In one of the Sherlock Holmes stories, he remarks about the
strange case of the barking dog. Watson says there was no
barking dog, and Holmes replies, “exactly.” (At least,
according to the author’s memory.) The point was, the dog
should have barked if an evil-doer stranger was present, but it
did not. That missing data point suggested something. The
same is true with graphs, in addition to looking for what is
there, look for what isn’t:
· Missing data, particularly with sharp drops at one end or the
other indicating missing or not reported results
· Randomness - data that is “too” neat or perfect might have
been manipulated
· Identical base comparison years or units, for example one
measure based on hundreds and another on thousands will
distort the relationship between them
How to Lie with Graphs
Graphs are wonderful at displaying information. However, as
much of their impact is visual, they can easily be distorted.
Here are a couple of tricks to watch out for.
One simple trick is to not start the y-axis with the value of 0.
This has the effect of stretching out vertical differences – a line
that might look fairly flat if graphed with values starting at 0,
could show a sharp increase with a restricted range in the y-
axis.
Another common distortion occurs with Column graphs. Even
though, the difference in bars should be judged solely on height,
making one base much narrower and another much wider
distorts the volume in the bars; and people form judgements
more on volume comparisons than on strictly height – so the
“fatter” bar will seem more significant.
Probability distributions
Statistical inference – making judgements about a population
based on the results of samples – relies on two critical
elements. The first is having a random sample, one that as
fairly represents the population as possible. The other is an
analysis based on the proper probability distribution. Statistical
inference is based on probability – the likelihood of getting the
results we did given the population we assume we are dealing
with. For example, when we toss a pair of fair dice, we expect
that that long-term average sum of the showing faces will be 7.
If we toss a pair of dice 100 times and get an average of 3, we
rightly assume something is wrong as the probability of getting
3 with a pair of fair dice is quite low for even a single value
much less for the average.
Reading and Interpreting Distributions
Let’s use the example of tossing a pair of dice to build and
interpret a probability distribution. When we toss a pair of
dice, we have 36 possible outcomes resulting in values from 2
to 12 showing on the top faces.
In theory, we have a 1/36 probability of getting a 2, (1 showing
on each face), a 2/36 probability of getting a 3(1,2 or 2,1), a
3/36 probability of getting a 4(1,3,2,2,3,1)) etc. The complete
theoretical probability distribution for these values is shown in
the histogram below. The graph shows the value of the sum of
the top faces of the two dice on the x (horizontal) axis and the
number of ways that the value can be formed on the y (vertical)
axis. As noted, we can form the value 3 only 2 ways, and this
gives us a probability of 2/36 = 0.056 or 5.6% chance of getting
a 3 when we toss a pair of dice.
Let’s use this histogram to learn about probability
distributions. Some basics:
· The area under the entire curve (or sum of the bar areas in this
case) equals 1.00; meaning that one of the outcomes must occur.
· The probability of a single outcome (example rolling a 9)
equals the area for the outcome listed on the x-axis.
· The probability for multiple outcomes, such as getting an 11
or 12, is the sum of the probabilities for each value (since each
outcome is mutually exclusive and independent of each other).
· We define the term “p-value” as the probability of getting a
value equal to or more extreme than any specific value. For
example, the p-value for an outcome of 10, would be the
probability of getting a 10, 11, or 12; just as the p-value for a
value of 5 would be the probability of getting a 2,3 ,4 ,or 5.
So, with these “ground rules,” let’s explore how to use this
probability distribution to understand the outcomes.
Example 1. What is the probability of getting a 7 on any given
toss?
Since we can get a 7 in any of 6 ways, the probability is 6/36 or
.17
Example 2. What is the probability of getting a 6, 7, or 8 on
any given toss?
We can get a 6 in 5 different ways, a 7 in 6 ways, and an 8 in 5
ways, so in total we have a probability of (5 + 6 + 5)/36 which
equals 16/36 = .44. We simply add up the probabilities of each
separate outcome, which is the same as adding the area for these
outcomes.
Example 3. What is the probability of getting any value larger
than 4?
This asks about getting the values of 5, 6, 7, 8, 9, 10, 11, or 12.
We could, of course, simply add the areas for each to get the
answer; but a simpler way exists. We know that the probability
of getting 2 – 4 [P(2, 3, or 4)] plus the probability of getting 5 –
12 [P(5 thru 12)] must equal 1, as these two probabilities
encompass the entire range of possible outcomes. So, if P(2, 3,
or 4) + P(5 thru 12) = 1; then it makes sense to say that P(5 thru
12) = 1 - P(2, 3, or 4). This is called the compliment rule. It is
often easier to find the probability of the opposite of an event
and use the complement rule to find the desired probability. In
this case, P(2, 3, or 4) = (1 + 2 + 3)/36 = 6/36. So P(5 thru 12)
= 1 - P(2, 3, or 4) = 1- 6/36 = 30/36 or .83.
Example 4. What is the p-value of getting a 4 or less? 10 or
more?
Recall that a p-value is the probability of getting a specific
result or a more extreme result. When looking from the center
of the distribution, the more extreme results than 4 would
include getting a 3 or a 2. So, the P-value would be the
probability of getting P(2, 3, or 4), which we calculated above
as 6/36 or .17
The same thinking applies to getting a 10 or more, the related
more extreme outcomes would be 11 or 12. Since we have a
symmetrical distribution, the probability 10, 11 or 12 is the
same as that of 2, 3, 4 or .17.
Example 5. What is the probability of getting between 5 and 9
on a single toss?
This would equal the P(5 thru 12) minus P(10, 11 or 12). Since
we know both of these values from examples 3 and 4, we get
30/36 – 6/36 = 24/36 = .67.
These 5 examples cover the most common situations
encountered with a probability distribution. In the case of
discrete outcomes, we could do something line the odds of
getting an even or odd outcome; this would simply equal adding
the column areas for each of the appropriate values.
Normal Curve. One of the most commonly used probability
distributions in statistics is the normal curve, AKA bell shaped
curve. The normal curve looks much like the histogram we used
above with the bars shrunk down to almost no width. The
normal curve values run from minus infinity to plus infinity, but
the practical range is much smaller. The mean = median =
mode for the curve, and each side is symmetrical. As with our
histogram above, the area under the normal curve equals 1.0.
A specialized case of the normal curve, called the standard
normal curve, has a mean of 0 and a standard deviation of 1.0.
We get the standard normal curve from any normal curve by
subtracting the mean from each value, and then dividing the
result by the original standard deviation. This allows us to
determine probabilities of any outcome using one curve rather
than needed to calculate values from different curves all the
time. And, as we might hope, Excel will do all the math
involved for us. (Actually, Excel will do the math for any
normal curve as well.)
Some key functions, found in the Fx and Formulas lists, include
the following. Note that formulas having “.S.” in the middle
are for the standard normal curve; without the s are for any
normal curve distribution.
· DIST(VALUE, MEAN, STANDARD DEVIATION,
CUMMULATIVE), gives the total area/probability to the left of
the stated value for a normal curve with a specified mean and
standard deviation and cumulative = True or 1. (Note if
cumulative is false or 0, we get the height of the curve for
graphing purposes.) Example: =NORM.DIST(10, 8, 2, true) =
0.8413 (rounded).
· INV(PROBABILITY, MEAN, STANDARD DEVIATION)
returns the numerical value for the given probability. Example
=NORM.INV(0.8413,8,2) = 10 (rounded).
· S.DIST(value, cumulative) gives the area/probability of the
given z-score value or less with cumulative set to true or 1.
Example: =NORM.S.DIST(1.96, TRUE) = 0.975.
· S.INV(Probability) returns the Z-score associated with the
given probability. Example: =NORM.S.INV(0.975) = 1.96.
With these functions, we can do the same kinds of probability
calculations we did with our dice and the histogram. Some
examples follow.
Example 1. What is the probability of getting a result exactly
in the middle of the distribution, a z-score of 0.00. Note, since
the normal curve extends so far, the probability of each specific
value is technically 0 (any value divided by 2*infinity).
However, since specific events and values do occur, we create a
range by making an adjustment. Since z-scores are typically
reported to two decimal places, we add +/-0.005 to the score for
our range. So, the area or probability for a z-score of 0 would
be the area under the range between -0.005 to +0.005. We then
find the larger area (the largest value) and subtract the smaller
area from it. This, for our example equals a probability
=norm.s.dist(0.005,1) – norm.s.dist(-0.005,1) = 0.003989 or
0.004 (rounded).
Example 2. What is the p-value of exceeding a z-score of
1.96? Excel does not directly calculate probabilities of
exceeding a value, so we need to use the compliment rule
whenever we are asked for a probability exceeding a value.
Since we are again working with a z-score, we use the standard
normal curve functions: =1-NORM.S.DIST(1.96,1) = 1-0.975 =
0.025.
Example 3. What is the p-value of getting less than a z-score of
-1.96? The probability of getting a score up to any value is
directly found from Norm.s.dist, so this question is answered by
=NORM.S.DIST(-1.96,1) = 0.025 (rounded).
With these three approaches, you can find any normal curve
probability based on a z-score. If you have means and standard
deviations, the same logic applies but you would use the normal
curve functions that do not contain the “.S.” term.
T Curve. A special family/set of normal curves are used when
we estimate the mean and standard deviation from sample
values. These curves are somewhat flatter and more elongated
than the standard normal curve. Additionally, a separate curve
exists for each sample size that we might use. The good news is
that Excel does all the work for these curves as well. And, as
we will see, these curves are used more often than the standard
normal curve in statistical analysis.
The key difference with the T curves is the idea of degrees of
freedom (df). For the t distribution, df = the sample size -1 (n -
1); and this value is used in the Excel functions. The t-related
Excel functions, also found in Fx and Formulas, are:
· DIST(t-value, df, cumulative) – the p-value (probability) for
this value or less, for example, =T.DIST(2.228,10,1) = 0.975
(rounded). 1-T.DIST(t-value, df, 1) would be the p-value for a
positive t-value. T.DIST(-2.228,10,1) = 0.025; this would be the
p-value for a negative t-value. As with NORM.DIST, using
false or 0 for the cumulative value gives us the value to graph
the t-distribution.
· DIST.2T(t-value, df) – probability of getting a value between
(minus t-value) and (plus t-value), for example,
=T.DIST.2T(2.228,10) = 0.05 (rounded)
· DIST.RT(t-value, df) – p-value (probability) of getting this
value or more, for example, =T.DIST.RT(2.228,10) = 0.025
(rounded), the p-value for a positive t-value.
· INV(probability, df) – the t-value that has probability of being
this large or smaller, for example =T.INV(.95,10) = 1.812.
· INV.2T(probability, df) - - the t-value that cuts of
probability/2 in each side/tail; for example T.INV.2T(0.05, 10)
= 2.228. The probability of equaling or exceeding +2.228 is
0.025, while the probability of equaling or being less than -
2.228 is 0.025.
Using these functions to find values and probabilities for ranges
is done in a similar fashion as with the normal curve examples
shown above.
Week 2 Guidance
This week, we look at graphical analysis. We learn how to
select a graph to best display a certain type of data including
two-dimensional scatter plots for paired or bivariate data. The
shape of a scatter plot tells us if the data are correlated with one
another. If data are highly correlated, then the value of one
variable may be used to make a prediction about the value of
the other. This prediction process involves regression analysis
and the construction of a regression equation.
As in week one, we will employ the eight elements of thought to
critically think about these topics. As you think this week, try
to discern the purpose for correlation and regression (Paul and
Elder 2006). What questions might we be able to answer? What
assumptions must we make? What data do we need? How does
our point of view impact our ability to predict? What are the
critical ideas or concepts? What conclusions can we draw and
what are the consequences or implications?
Bivariate Data in Context
Bivariate data are paired data. The pairing of data does not
combine them, but rather associates them according to
collection. For example, suppose you collect the height and
weight of a high school basketball team. Each player has two
unique measurements that describe different traits. Suppose, for
example, that there are only five players
Height (in inches)
Weight (in pounds)
67
155
72
220
77
240
74
195
69
175
If we look at just height or just weight, we might display the
data as a bar graph or (for more players) a histogram. If we
sorted one column and didn’t sort the other we would unpair the
data – the 67 inch tall person would be adjacent to the score for
the 240 pound person, for example, even though they represent
different people. Bivariate data are coupled. In fact, we could
also represent the data as a single list of ordered pairs: (67,
155), (72, 220), (77, 240), (74, 195), and (69,175). The first
number in each ordered pair represents height and the second
number represents weight.
Bivariate data allow us to look at trends in one variable and
determine if there is any relationship with trends in the other
variable. Do you think that taller people in general will weigh
more? If so, then you are suggesting that there is a positive
correlation between height and weight. A small business owner
might collect bivariate data for the price of a certain product
and the number of units sold on a monthly basis. If price
increases, we might expect sales to decrease. When one variable
increasing is associated with another paired variable decreasing,
we refer to the relationship as a negative correlation.
Scatter Diagrams and the Correlation Coefficient
Six Sigma is a set of tools designed to improve business
processes by minimizing defects, errors, and variability through
the use of statistical tools. On its website, Six Sigma. defines
scatter plots as follows:
Scatter plots are used with variable data to study possible
relationships between two different variables. Even though a
scatter plot depicts a relationship between variables, it does not
indicate a cause and effect relationship. Use Scatter plots to
determine what happens to one variable when another variable
changes value. It is a tool used to visually determine whether a
potential relationship exists between an input and an outcome.
So a scatter plot or scatter diagram is just a two-dimensional
plot, as you may have done in middle school, where we use one
variable as the horizontal axis (x-coordinate), and one variable
as the vertical axis (y-coordinate). Our Basketball data above
would be plotted as
The correlation coefficient, or Pearson’s r-value is a measure of
how closely the scatter plot diagram is modeled by a straight
line. The correlation coefficient for any bivariate data will be a
number between -1 and +1. Data with an r near -1 are highly
correlated in the negative direction, which means there is the
inverse relationship discussed in the price and sales example.
These data will display as a negatively sloped line in the scatter
diagram with a pattern that descends from left to right. Data
with a correlation coefficient near +1 are highly correlated in
the positive direction and resemble a positively sloped line in
the scatter plot. Data with a correlation value near 0 (on either
side) are not correlated. No line fits better than any other line
and there is practically no association between the values. Non-
correlated bivariate data appear like a round cloud of dots with
no discernible direction or pattern.
Predictions with Linear Regression
If data are highly correlated, in either the positive or negative
direction, then we are able to use information about one value
to make predictions about the potential value of the correlated
variable. Since we use a straight line approximation for the
data, we call this process linear regression. The better our data
fit to a straight line, the better our predictions using this
method. Another way of stating the same principle is that
correlations with a coefficient near +/- 1 carry the most
reliability as predictive linear models.
The general process for linear regression is as follows:
1. Check the strength of the correlation. Regression usually
requires an r-value above 0.4 or below -0.4
2. Use the least squares method to find the equation for the
line of best fit. Often this step is completed using a software
package such as Minitab, SPSS, a TI Calculator, or even Excel.
The resulting equation will have the form: . Where x is the
variable depicted on the horizontal axis (input) and is the
output or predicted value for the variable on the vertical axis.
3. Substitute hypothesized values in for x to predict values
for y.
Students should be able to:1. Examine the value of presenting
data graphically.
2. Describe guidelines for effectively using graphical tools to
present numerical information.
References:
Lind, D. A., Marchal, W. G., & Wathen, S. A. (2017).
Statistical techniques in business and economics. (17th ed.).
Paul, R. and Elder, L. (2006). The Miniature Guide to Critical
Thinking: Concepts and Tools., Berkeley, CA: The Foundation
for Critical Thinking
Passy. (2012, March 13). Misleading graphs. Retrieved from
http://passyworldofmathematics.com/misleading-graphs/
Pearson, Karl (1924). The Life, Letters, and Labours of Francis
Galton. London: Cambridge University Press
Week 2 Discussions and Required Resources
Part 1 and Part 2 must be at least 150 - 200 words unless
otherwise
Part 1: Graphical Analysis Techniques
There are strengths and weaknesses to graphical analysis
research techniques. For this discussion, begin by reviewing the
technique of graphical analysis in your textbook. Then, keeping
this technique in mind, read the following quotes:
· “Errors using inadequate data are much less than those using
no data at all.”—Charles Babbage
· “Statistics is the science of variation.”—Douglas M. Bates
(1985)
· “All models are wrong, but some models are useful.”—George
E. P. Box (1979)
· The greatest moments are those when you see the result pop up
in a graph or in your statistics analysis - that moment you
realize you know something no one else does and you get the
pleasure of thinking about how to tell them.—Emily Oster
https://www.goodreads.com/quotes/search?utf8=%E2%9C%93&
q=statistics&commit=Search
Also consider the following ways to make a graph misleading
from Misleading Graphs - (Passy, 2012):
· “Vertical scale is too big or too small.
· Vertical axis skips numbers, or does not start at zero.
· Graph is not labeled properly.
· Graph does not have a title to explain what it is about.
· Data is left out.
· Scale not starting at zero.
· Scale made in very small units to make graph look very big.
· Scale values or labels missing from the graph.
· Incorrect scale placed on the graph.
· Pieces of a pie chart are not the correct sizes.
· Oversized volumes of objects that are too big for the vertical
scale differences they represent.
· Size of images used in pictographs being different for the
different categories being graphed.
· Graph being a non-standard size or shape.”
Based on the above quotes, along with this week’s assigned
readings and Instructor Guidance, compare graphical analysis
with quantitative analysis (a technique you explored last week),
and discuss why graphical analysis is important in research.
Finally, describe guidelines for using graphical tools to present
information clearly and effectively.
Part 2: Examples of Graphical Analysis Techniques in Research
Locate an example of a research study that uses graphs and/or
tables in its analysis. Explain what this statistical technique
allows the researchers to accomplish and/or conclude in the
study. Note: Graphic presentations are most often found in the
Results section of a study.
Required Resource
Text
Lind, D. A., Marchal, W. G., & Wathen, S. A.
(2017). Statistical techniques in business and economics. (17th
ed.). Retrieved from http://connect.mheducation.com/class/
The textbook is attached.
· Chapter 2: Describing Data: Frequency Tables, Frequency
Distributions, and Graphic Presentation
· Chapter 4: Describing Data: Displaying and Exploring Data
Article
Passy. (2012, March 13). Misleading graphs . Retrieved from
http://passyworldofmathematics.com/misleading-graphs/
· This article provides information about graph techniques often
used by both advertisers and the media to mislead viewers. It
will assist you in your Graphical Analysis Techniques
discussion this week.

More Related Content

PPT
PPT
Chap004.ppt
PDF
Chapter iv
PPT
1 chapter 04
PPT
Chapter 04
PPT
Chap004.ppt
PDF
Data displaying and exploring data DOC-20250701-WA0000..pdf
PDF
Chap 04 - Describing Data_Displaying and Exploring Data.pdf
Chap004.ppt
Chapter iv
1 chapter 04
Chapter 04
Chap004.ppt
Data displaying and exploring data DOC-20250701-WA0000..pdf
Chap 04 - Describing Data_Displaying and Exploring Data.pdf

Similar to LEARNING OBJECTIVESWhen you have completed this chapter, you.docx (20)

PPT
PPT
PPT
Chap004.ppt
PPTX
Data and Information Visualization Part 1part 1.pptx
PPT
Chapter 04
PPT
Chap002.ppt
PPTX
Chap004
PPTX
Chap004
PPTX
Chapter-1-section 2.1 Exploring data-Edition-5.pptx
PDF
Basic Concepts of Statistics - Lecture Notes
PPTX
Basics of data_interpretation
PPTX
Basics of data_interpretation
PPT
Sta2023 ch02
DOCX
TSTD 6251  Fall 2014SPSS Exercise and Assignment 120 PointsI.docx
DOCX
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
DOCX
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
PPTX
Chapter 2 of the book Basic Statistics as described by teacher
ZIP
B409 W11 Sas Collaborative Stats Guide V4.2
PPT
Statistics
DOC
Sqqs1013 ch2-a122
Chap004.ppt
Data and Information Visualization Part 1part 1.pptx
Chapter 04
Chap002.ppt
Chap004
Chap004
Chapter-1-section 2.1 Exploring data-Edition-5.pptx
Basic Concepts of Statistics - Lecture Notes
Basics of data_interpretation
Basics of data_interpretation
Sta2023 ch02
TSTD 6251  Fall 2014SPSS Exercise and Assignment 120 PointsI.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
ByPREFERENCES FOR CAR CHOICE IN UNITED STATES.docx
Chapter 2 of the book Basic Statistics as described by teacher
B409 W11 Sas Collaborative Stats Guide V4.2
Statistics
Sqqs1013 ch2-a122
Ad

More from jesssueann (20)

DOCX
Major Benefits and Drivers of IoT.Background According to T.docx
DOCX
Major Assessment 2 The Educated Person” For educators to be ef.docx
DOCX
Major Assessment 4 Cultural Bias Investigation Most educators agree.docx
DOCX
Maintaining privacy and confidentiality always is also vital. Nurses.docx
DOCX
Main content15-2aHow Identity Theft OccursPerpetrators of iden.docx
DOCX
Macro Presentation – Australia Table of ContentOver.docx
DOCX
M.S Aviation Pty Ltd TA Australian School of Commerce RTO N.docx
DOCX
M4.3 Case StudyCase Study ExampleJennifer S. is an Army veter.docx
DOCX
make a histogram out of this information Earthquake Frequency .docx
DOCX
Love Language Project FINAL PAPERLove Language Project Part .docx
DOCX
Major Computer Science What are the core skills and knowledge y.docx
DOCX
Major Crime in Your CommunityUse the Internet to search for .docx
DOCX
Major Assignment - Learning NarrativeWrite a learning narr.docx
DOCX
Looking to have this work done AGAIN. It was submitted several times.docx
DOCX
Major Assessment 1 Develop a Platform of Beliefs The following .docx
DOCX
Macroeconomics PaperThere are currently three major political ap.docx
DOCX
M A T T D O N O V A NThings in the Form o f a Prayer in.docx
DOCX
M A R C H 2 0 1 5F O R W A R D ❚ E N G A G E D ❚ .docx
DOCX
Lymphedema following breast cancer The importance of surgic.docx
DOCX
Lukas Nelson and his wife Anne and their three daughters had been li.docx
Major Benefits and Drivers of IoT.Background According to T.docx
Major Assessment 2 The Educated Person” For educators to be ef.docx
Major Assessment 4 Cultural Bias Investigation Most educators agree.docx
Maintaining privacy and confidentiality always is also vital. Nurses.docx
Main content15-2aHow Identity Theft OccursPerpetrators of iden.docx
Macro Presentation – Australia Table of ContentOver.docx
M.S Aviation Pty Ltd TA Australian School of Commerce RTO N.docx
M4.3 Case StudyCase Study ExampleJennifer S. is an Army veter.docx
make a histogram out of this information Earthquake Frequency .docx
Love Language Project FINAL PAPERLove Language Project Part .docx
Major Computer Science What are the core skills and knowledge y.docx
Major Crime in Your CommunityUse the Internet to search for .docx
Major Assignment - Learning NarrativeWrite a learning narr.docx
Looking to have this work done AGAIN. It was submitted several times.docx
Major Assessment 1 Develop a Platform of Beliefs The following .docx
Macroeconomics PaperThere are currently three major political ap.docx
M A T T D O N O V A NThings in the Form o f a Prayer in.docx
M A R C H 2 0 1 5F O R W A R D ❚ E N G A G E D ❚ .docx
Lymphedema following breast cancer The importance of surgic.docx
Lukas Nelson and his wife Anne and their three daughters had been li.docx
Ad

Recently uploaded (20)

PDF
HVAC Specification 2024 according to central public works department
PDF
Hazard Identification & Risk Assessment .pdf
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
Trump Administration's workforce development strategy
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
advance database management system book.pdf
PDF
International_Financial_Reporting_Standa.pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
HVAC Specification 2024 according to central public works department
Hazard Identification & Risk Assessment .pdf
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Uderstanding digital marketing and marketing stratergie for engaging the digi...
Chinmaya Tiranga quiz Grand Finale.pdf
TNA_Presentation-1-Final(SAVE)) (1).pptx
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
Share_Module_2_Power_conflict_and_negotiation.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Environmental Education MCQ BD2EE - Share Source.pdf
Weekly quiz Compilation Jan -July 25.pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
Trump Administration's workforce development strategy
202450812 BayCHI UCSC-SV 20250812 v17.pptx
advance database management system book.pdf
International_Financial_Reporting_Standa.pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Cambridge-Practice-Tests-for-IELTS-12.docx

LEARNING OBJECTIVESWhen you have completed this chapter, you.docx

  • 1. LEARNING OBJECTIVES When you have completed this chapter, you will be able to: LO4-1 Construct and interpret a dot plot. LO4-2 Construct and describe a stem-and-leaf display. LO4-3 Identify and compute measures of position. LO4-4 Construct and analyze a box plot. LO4-5 Compute and interpret the coefficient of skewness. LO4-6 Create and interpret a scatter diagram. LO4-7 Develop and explain a contingency table. MCGIVERN JEWELERS recently posted an advertisement on a social media site reporting the shape, size, price, and cut grade for 33 of its diamonds in stock. Develop a box plot of the variable price and comment on the result. (See Exercise 37 and LO4-4.) Describing Data: DISPLAYING AND EXPLORING DATA4 © Denis Vrublevski/Shutterstock.com DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
  • 2. 95 INTRODUCTION Chapter 2 began our study of descriptive statistics. In order to transform raw or un- grouped data into a meaningful form, we organize the data into a frequency distribution. We present the frequency distribution in graphic form as a histogram or a frequency polygon. This allows us to visualize where the data tend to cluster, the largest and the smallest values, and the general shape of the data. In Chapter 3, we first computed several measures of location, such as the mean, median, and mode. These measures of location allow us to report a typical value in the set of observations. We also computed several measures of dispersion, such as the range, variance, and standard deviation. These measures of dispersion allow us to de- scribe the variation or the spread in a set of observations. We continue our study of descriptive statistics in this chapter. We study (1) dot plots, (2) stem-and-leaf displays, (3) percentiles, and (4) box plots. These charts and statistics give us additional insight into where the values are concentrated as well as the general shape of the data. Then we consider bivariate data. In bivariate data, we observe two variables for each individual or observation. Examples include the number of hours a student studied and the points earned on an examination; if a sampled product meets
  • 3. quality specifications and the shift on which it is manufactured; or the amount of electric- ity used in a month by a homeowner and the mean daily high temperature in the region for the month. These charts and graphs provide useful insights as we use business analytics to enhance our understanding of data. DOT PLOTS Recall for the Applewood Auto Group data, we summarized the profit earned on the 180 vehicles sold with a frequency distribution using eight classes. When we orga- nized the data into the eight classes, we lost the exact value of the observations. A dot plot, on the other hand, groups the data as little as possible, and we do not lose the identity of an individual observation. To develop a dot plot, we display a dot for each observation along a horizontal number line indicating the possible values of the data. If there are identical observations or the observations are too close to be shown individually, the dots are “piled” on top of each other. This allows us to see the shape of the distribution, the value about which the data tend to cluster, and the largest and smallest observations. Dot plots are most useful for smaller data sets, whereas histo- grams tend to be most useful for large data sets. An example will show how to con- struct and interpret dot plots. LO4-1 Construct and interpret a dot plot.
  • 4. E X A M P L E The service departments at Tionesta Ford Lincoln and Sheffield Motors Inc., two of the four Applewood Auto Group dealerships, were both open 24 days last month. Listed below is the number of vehicles serviced last month at the two dealerships. Construct dot plots and report summary statistics to compare the two dealerships. Tionesta Ford Lincoln Monday Tuesday Wednesday Thursday Friday Saturday 23 33 27 28 39 26 30 32 28 33 35 32 29 25 36 31 32 27 35 32 35 37 36 30 96 CHAPTER 4 Sheffield Motors Inc. Monday Tuesday Wednesday Thursday Friday Saturday 31 35 44 36 34 37 30 37 43 31 40 31 32 44 36 34 43 36 26 38 37 30 42 33
  • 5. S O L U T I O N The Minitab system provides a dot plot and outputs the mean, median, maximum, and minimum values, and the standard deviation for the number of cars serviced at each dealership over the last 24 working days. The dot plots, shown in the center of the output, graphically illustrate the distribu- tions for each dealership. The plots show the difference in the location and dis- persion of the observations. By looking at the dot plots, we can see that the number of vehicles serviced at the Sheffield dealership is more widely dispersed and has a larger mean than at the Tionesta dealership. Several other features of the number of vehicles serviced are: • Tionesta serviced the fewest cars in any day, 23. • Sheffield serviced 26 cars during their slowest day, which is 4 cars less than the next lowest day. • Tionesta serviced exactly 32 cars on four different days. • The numbers of cars serviced cluster around 36 for Sheffield and 32 for Tionesta. From the descriptive statistics, we see Sheffield serviced a mean of 35.83 vehicles per day. Tionesta serviced a mean of 31.292 vehicles per day during the same period. So Sheffield typically services 4.54 more vehicles per day. There is also more dispersion, or variation, in the daily number of vehicles
  • 6. serviced at Sheffield than at Tionesta. How do we know this? The standard deviation is larger at Shef- field (4.96 vehicles per day) than at Tionesta (4.112 cars per day). STEM-AND-LEAF DISPLAYS In Chapter 2, we showed how to organize data into a frequency distribution so we could summarize the raw data into a meaningful form. The major advantage to organizing the data into a frequency distribution is we get a quick visual picture of the shape of the LO4-2 Construct and describe a stem-and-leaf display. DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 97 distribution without doing any further calculation. To put it another way, we can see where the data are concentrated and also determine whether there are any extremely large or small values. There are two disadvantages, however, to organizing the data into a frequency distribution: (1) we lose the exact identity of each value and (2) we are not sure how the values within each class are distributed. To explain, the Theater of the Republic in Erie, Pennsylvania, books live theater and musical performances. The the-
  • 7. ater’s capacity is 160 seats. Last year, among the forty-five performances, there were eight different plays and twelve different bands. The following frequency distribution shows that between eighty up to ninety people attended two of the forty-five perfor- mances; there were seven performances where ninety up to one hundred people at- tended. However, is the attendance within this class clustered about 90, spread evenly throughout the class, or clustered near 99? We cannot tell. Attendance Frequency 80 up to 90 2 90 up to 100 7 100 up to 110 6 110 up to 120 9 120 up to 130 8 130 up to 140 7 140 up to 150 3 150 up to 160 3 Total 45 One technique used to display quantitative information in a condensed form and provide more information than the frequency distribution is the stem-and-leaf display. An advantage of the stem-and-leaf display over a frequency distribution is we do not lose the identity of each observation. In the above example, we would not know the identity of the values in the 90 up to 100 class. To illustrate the construc- tion of a stem-and-leaf display using the number people
  • 8. attending each perfor- mance, suppose the seven observations in the 90 up to 100 class are 96, 94, 93, 94, 95, 96, and 97. The stem value is the leading digit or digits, in this case 9. The leaves are the trailing digits. The stem is placed to the left of a vertical line and the leaf values to the right. The values in the 90 up to 100 class would appear as follows: 9 ∣ 6 4 3 4 5 6 7 It is also customary to sort the values within each stem from smallest to largest. Thus, the second row of the stem-and-leaf display would appear as follows: 9 ∣ 3 4 4 5 6 6 7 With the stem-and-leaf display, we can quickly observe that 94 people attended two performances and the number attending ranged from 93 to 97. A stem-and-leaf display is similar to a frequency distribution with more information, that is, the identity of the observations is preserved. STEM-AND-LEAF DISPLAY A statistical technique to present a set of data. Each numerical value is divided into two parts. The leading digit(s) becomes the stem and the trailing digit the leaf. The stems are located along the vertical axis, and the leaf values are stacked against each other along the horizontal axis.
  • 9. 98 CHAPTER 4 The following example explains the details of developing a stem-and-leaf display. E X A M P L E Listed in Table 4–1 is the number of people attending each of the 45 performances at the Theater of the Republic last year. Organize the data into a stem-and-leaf display. Around what values does attendance tend to cluster? What is the smallest attendance? The largest attendance? S O L U T I O N From the data in Table 4–1, we note that the smallest attendance is 88. So we will make the first stem value 8. The largest attendance is 156, so we will have the stem values begin at 8 and continue to 15. The first number in Table 4–1 is 96, which has a stem value of 9 and a leaf value of 6. Moving across the top row, the second value is 93 and the third is 88. After the first 3 data values are considered, the chart is as follows. Stem Leaf 8 8
  • 10. 9 6 3 10 11 12 13 14 15 Organizing all the data, the stem-and-leaf chart looks as follows. Stem Leaf 8 8 9 9 6 3 5 6 4 4 7 10 8 7 3 4 6 3 11 7 3 2 7 2 1 9 8 3 12 7 5 7 0 5 5 0 4 13 9 5 2 9 4 6 8 14 8 2 3 15 6 5 5 The usual procedure is to sort the leaf values from the smallest to largest. The last line, the row referring to the values in the 150s, would appear as: 15 ∣ 5 5 6 TABLE 4–1 Number of People Attending Each of the 45 Performances at the Theater of the Republic 96 93 88 117 127 95 113 96 108 94 148 156 139 142 94 107 125 155 155 103 112 127 117 120 112 135 132 111 125 104 106 139 134 119 97 89
  • 11. 118 136 125 143 120 103 113 124 138 DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 99 The final table would appear as follows, where we have sorted all of the leaf values. Stem Leaf 8 8 9 9 3 4 4 5 6 6 7 10 3 3 4 6 7 8 11 1 2 2 3 3 7 7 8 9 12 0 0 4 5 5 5 7 7 13 2 4 5 6 8 9 9 14 2 3 8 15 5 5 6 You can draw several conclusions from the stem-and-leaf display. First, the mini- mum number of people attending is 88 and the maximum is 156. There were two per- formances with less than 90 people attending, and three performances with 150 or more. You can observe, for example, that for the three performances with more than 150 people attending, the actual attendances were 155, 155, and 156. The concentra- tion of attendance is between 110 and 130. There were fifteen performances with at- tendance between 110 and 119 and eight performances between 120 and 129. We
  • 12. can also tell that within the 120 to 129 group the actual attendances were spread evenly throughout the class. That is, 120 people attended two performances, 124 peo- ple attended one performance, 125 people attended three performances, and 127 peo- ple attended two performances. We also can generate this information on the Minitab software system. We have named the variable Attendance. The Minitab output is below. You can find the Minitab commands that will produce this output in Appendix C. The Minitab solution provides some additional information regarding cumulative totals. In the column to the left of the stem values are numbers such as 2, 9, 15, and so on. The number 9 indicates there are 9 observations that have occurred before the value of 100. The number 15 indicates that 15 observations have occurred prior to 110. About halfway down the column the number 9 appears in parentheses. The parentheses indicate that the middle value or median appears in that row and there are nine values in this group. In this case, we describe the middle value as the value below which half of the observations oc- cur. There are a total of 45 observations, so the middle value, if the data were arranged from smallest to largest, would be the 23rd observation; its value is 118. After the median, the values begin to decline. These values represent the “more than” cumulative totals. There are 21 observations of 120 or more, 13 of 130 or more, and so on.
  • 13. 100 CHAPTER 4 Which is the better choice, a dot plot or a stem-and-leaf chart? This is really a matter of personal choice and convenience. For presenting data, especially with a large num- ber of observations, you will find dot plots are more frequently used. You will see dot plots in analytical literature, marketing reports, and occasionally in annual reports. If you are doing a quick analysis for yourself, stem-and-leaf tallies are handy and easy, partic- ularly on a smaller set of data. © Somos/Veer/Getty Images RF 1. The number of employees at each of the 142 Home Depot stores in the Southeast region is shown in the following dot plot. 100 10484 88 92 Number of employees 9680 (a) What are the maximum and minimum numbers of employees per store? (b) How many stores employ 91 people? (c) Around what values does the number of employees per store tend to cluster? 2. The rate of return for 21 stocks is:
  • 14. 8.3 9.6 9.5 9.1 8.8 11.2 7.7 10.1 9.9 10.8 10.2 8.0 8.4 8.1 11.6 9.6 8.8 8.0 10.4 9.8 9.2 S E L F - R E V I E W 4–1 DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 101 Organize this information into a stem-and-leaf display. (a) How many rates are less than 9.0? (b) List the rates in the 10.0 up to 11.0 category. (c) What is the median? (d) What are the maximum and the minimum rates of return? 1. Describe the differences between a histogram and a dot plot. When might a dot plot be better than a histogram? 2. Describe the differences between a histogram and a stem- and-leaf display. 3. Consider the following chart. 6 72 3 4 51 a. What is this chart called? b. How many observations are in the study? c. What are the maximum and the minimum values? d. Around what values do the observations tend to cluster? 4. The following chart reports the number of cell phones sold at a big-box retail store for the last 26 days.
  • 15. 199 144 a. What are the maximum and the minimum numbers of cell phones sold in a day? b. What is a typical number of cell phones sold? 5. The first row of a stem-and-leaf chart appears as follows: 62 | 1 3 3 7 9. Assume whole number values. a. What is the “possible range” of the values in this row? b. How many data values are in this row? c. List the actual values in this row of data. 6. The third row of a stem-and-leaf chart appears as follows: 21 | 0 1 3 5 7 9. Assume whole number values. a. What is the “possible range” of the values in this row? b. How many data values are in this row? c. List the actual values in this row of data. 7. The following stem-and-leaf chart shows the number of units produced per day in a factory. Stem Leaf 3 8 4 5 6 6 0133559 7 0236778 8 59 9 00156 10 36
  • 16. a. How many days were studied? b. How many observations are in the first class? E X E R C I S E S 102 CHAPTER 4 c. What are the minimum value and the maximum value? d. List the actual values in the fourth row. e. List the actual values in the second row. f. How many values are less than 70? g. How many values are 80 or more? h. What is the median? i. How many values are between 60 and 89, inclusive? 8. The following stem-and-leaf chart reports the number of prescriptions filled per day at the pharmacy on the corner of Fourth and Main Streets. Stem Leaf 12 689 13 123 14 6889 15 589 16 35 17 24568 18 268 19 13456 20 034679 21 2239 22 789 23 00179 24 8
  • 17. 25 13 26 27 0 a. How many days were studied? b. How many observations are in the last class? c. What are the maximum and the minimum values in the entire set of data? d. List the actual values in the fourth row. e. List the actual values in the next to the last row. f. On how many days were less than 160 prescriptions filled? g. On how many days were 220 or more prescriptions filled? h. What is the middle value? i. How many days did the number of filled prescriptions range between 170 and 210? 9. A survey of the number of phone calls made by a sample of 16 Verizon sub- scribers last week revealed the following information. Develop a stem-and-leaf chart. How many calls did a typical subscriber make? What were the maximum and the minimum number of calls made? 52 43 30 38 30 42 12 46 39 37 34 46 32 18 41 5 10. Aloha Banking Co. is studying ATM use in suburban Honolulu. Yesterday, for a sample of 30 ATM's, the bank counted the number of times each machine was used. The data is presented in the table. Develop a stem-and- leaf chart to summa- rize the data. What were the typical, minimum, and maximum number of times each ATM was used?
  • 18. 83 64 84 76 84 54 75 59 70 61 63 80 84 73 68 52 65 90 52 77 95 36 78 61 59 84 95 47 87 60 DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 103 MEASURES OF POSITION The standard deviation is the most widely used measure of dispersion. However, there are other ways of describing the variation or spread in a set of data. One method is to determine the location of values that divide a set of observations into equal parts. These measures include quartiles, deciles, and percentiles. Quartiles divide a set of observations into four equal parts. To explain further, think of any set of values arranged from the minimum to the maximum. In Chapter 3, we called the middle value of a set of data arranged from the minimum to the maximum the median. That is, 50% of the observations are larger than the median and 50% are smaller. The median is a measure of location because it pinpoints the center of the data. In a similar fashion, quartiles divide a set of observations into four equal parts. The first quartile, usu- ally labeled Q1, is the value below which 25% of the observations occur, and the third quartile, usually labeled Q3, is the value below which 75% of the observations occur.
  • 19. Similarly, deciles divide a set of observations into 10 equal parts and percentiles into 100 equal parts. So if you found that your GPA was in the 8th decile at your univer- sity, you could conclude that 80% of the students had a GPA lower than yours and 20% had a higher GPA. If your GPA was in the 92nd percentile, then 92% of students had a GPA less than your GPA and only 8% of students had a GPA greater than your GPA. Per- centile scores are frequently used to report results on such national standardized tests as the SAT, ACT, GMAT (used to judge entry into many master of business administration programs), and LSAT (used to judge entry into law school). Quartiles, Deciles, and Percentiles To formalize the computational procedure, let Lp refer to the location of a desired percen- tile. So if we want to find the 92nd percentile we would use L92, and if we wanted the median, the 50th percentile, then L50. For a number of observations, n, the location of the Pth percentile, can be found using the formula: LO4-3 Identify and compute measures of position. LOCATION OF A PERCENTILE Lp = (n + 1) P 100 [4–1]
  • 20. An example will help to explain further. E X A M P L E Morgan Stanley is an investment company with offices located throughout the United States. Listed below are the commissions earned last month by a sample of 15 brokers at the Morgan Stanley office in Oakland, California. $2,038 $1,758 $1,721 $1,637 $2,097 $2,047 $2,205 $1,787 $2,287 1,940 2,311 2,054 2,406 1,471 1,460 Locate the median, the first quartile, and the third quartile for the commissions earned. S O L U T I O N The first step is to sort the data from the smallest commission to the largest. $1,460 $1,471 $1,637 $1,721 $1,758 $1,787 $1,940 $2,038 2,047 2,054 2,097 2,205 2,287 2,311 2,406 104 CHAPTER 4 In the above example, the location formula yielded a whole number. That is, we wanted to find the first quartile and there were 15 observations, so the location formula
  • 21. indicated we should find the fourth ordered value. What if there were 20 observations in the sample, that is n = 20, and we wanted to locate the first quartile? From the loca- tion formula (4–1): L25 = (n + 1) P 100 = (20 + 1) 25 100 = 5.25 We would locate the fifth value in the ordered array and then move .25 of the distance between the fifth and sixth values and report that as the first quartile. Like the median, the quartile does not need to be one of the actual values in the data set. To explain further, suppose a data set contained the six values 91, 75, 61, 101, 43, and 104. We want to locate the first quartile. We order the values from the minimum to the maximum: 43, 61, 75, 91, 101, and 104. The first quartile is located at L25 = (n + 1) P 100 = (6 + 1)
  • 22. 25 100 = 1.75 The position formula tells us that the first quartile is located between the first and the second values and it is .75 of the distance between the first and the second values. The first value is 43 and the second is 61. So the distance between these two values is 18. To locate the first quartile, we need to move .75 of the distance between the first and second values, so .75(18) = 13.5. To complete the procedure, we add 13.5 to the first value, 43, and report that the first quartile is 56.5. We can extend the idea to include both deciles and percentiles. To locate the 23rd percentile in a sample of 80 observations, we would look for the 18.63 position. L23 = (n + 1) P 100 = (80 + 1) 23 100 = 18.63 The median value is the observation in the center and is the same as the 50th percen-
  • 23. tile, so P equals 50. So the median or L50 is located at (n + 1)(50/100), where n is the number of observations. In this case, that is position number 8, found by (15 + 1) (50/100). The eighth-largest commission is $2,038. So we conclude this is the median and that half the brokers earned com- missions more than $2,038 and half earned less than $2,038. The result using formula (4–1) to find the median is the same as the method presented in Chapter 3. Recall the definition of a quartile. Quartiles divide a set of observations into four equal parts. Hence 25% of the observations will be less than the first quartile. Seventy-five percent of the observations will be less than the third quartile. To locate the first quartile, we use formula (4–1), where n = 15 and P = 25: L25 = (n + 1) P 100 = (15 + 1) 25 100 = 4 and to locate the third quartile, n = 15 and P = 75:
  • 24. L75 = (n + 1) P 100 = (15 + 1) 75 100 = 12 Therefore, the first and third quartile values are located at positions 4 and 12, respectively. The fourth value in the ordered array is $1,721 and the twelfth is $2,205. These are the first and third quartiles. © Ramin Talaie/Getty Images DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 105 To find the value corresponding to the 23rd percentile, we would locate the 18th value and the 19th value and determine the distance between the two values. Next, we would multiply this difference by 0.63 and add the result to the smaller value. The result would be the 23rd percentile. Statistical software is very helpful when describing and summarizing data. Excel, Minitab, and MegaStat, a statistical analysis Excel add-in, all
  • 25. provide summary statistics that include quartiles. For example, the Minitab summary of the Morgan Stanley com- mission data, shown below, includes the first and third quartiles, and other statistics. Based on the reported quartiles, 25% of the commissions earned were less than $1,721 and 75% were less than $2,205. These are the same values we calculated using formula (4–1). There are ways other than formula (4–1) to lo- cate quartile values. For example, another method uses 0.25n + 0.75 to locate the position of the first quartile and 0.75n + 0.25 to locate the position of the third quartile. We will call this the Excel Method. In the Morgan Stanley data, this method would place the first quartile at position 4.5 (.25 × 15 + .75) and the third quartile at position 11.5 (.75 × 15 + .25). The first quartile would be interpolated as 0.5, or one-half the difference between the fourth- and the fifth-ranked values. Based on this method, the first quartile is $1739.5, found by ($1,721 + 0.5[$1,758 − $1,721]). The third quar- tile, at position 11.5, would be $2,151, or one-half the distance between the eleventh- and the twelfth-ranked values, found by ($2,097 + 0.5[$2,205 − $2,097]). Excel, as shown in the Morgan Stanley and Applewood examples, can compute quartiles using either of the two methods. Please note the text uses formula (4–1) to calculate quartiles. Is the difference between the two methods important? No. Usually it is just a nui-
  • 26. sance. In general, both methods calculate values that will support the statement that ap- proximately 25% of the values are less than the value of the first quartile, and approximately 75% of the data values are less than the value of the third quartile. When the sample is large, the difference in the results from the two methods is small. For example, in the Applewood Auto Group data there are 180 vehicles. The quartiles computed using both methods are shown to the left. Based on the variable profit, 45 of the 180 values (25%) are less than both values of the first quartile, and 135 of the 180 values (75%) are less than both values of the third quartile. When using Excel, be careful to understand the method used to STATISTICS IN ACTION John W. Tukey (1915–2000) received a PhD in mathe- matics from Princeton in 1939. However, when he joined the Fire Control Re- search Office during World War II, his interest in ab- stract mathematics shifted to applied statistics. He de- veloped effective numerical and graphical methods for studying patterns in data.
  • 27. Among the graphics he developed are the stem- and-leaf diagram and the box-and-whisker plot or box plot. From 1960 to 1980, Tukey headed the statistical division of NBC’s election night vote projection team. He became renowned in 1960 for preventing an early call of victory for Richard Nixon in the presi- dential election won by John F. Kennedy. Morgan Stanley Commissions 1460 Equation 4-1 2047 1471 Quartile 1 Quartile 3 1721 2205 Alternate Method Quartile 1 Quartile 3 1739.5 2151 2054
  • 28. 1637 2097 1721 2205 1758 2287 1787 2311 1940 2406 2038 Pro�tAge Applewood Equation 4-1 Quartile 1 Quartile 3 1415.5 2275.5 Alternate Method Quartile 1 Quartile 3 1422.5 2268.5 $1,387 $1,754 $1,817 $1,040 $1,273 $1,529 $3,082
  • 29. $1,951 $2,692 $1,342 21 23 24 25 26 27 27 28 28 29 106 CHAPTER 4 calculate quartiles. Excel 2013 and Excel 2016 offer both methods. The Excel function, Quartile.exc, will result in the same answer as Equation 4–1. The Excel function, Quar- tile.inc, will result in the Excel Method answers. The Quality Control department of Plainsville Peanut Company is responsible for checking the weight of the 8-ounce jar of peanut butter. The weights of a sample of nine jars pro- duced last hour are: 7.69 7.72 7.8 7.86 7.90 7.94 7.97 8.06 8.09 (a) What is the median weight? (b) Determine the weights corresponding to the first and third
  • 30. quartiles. S E L F - R E V I E W 4–2 11. Determine the median and the first and third quartiles in the following data. 46 47 49 49 51 53 54 54 55 55 59 12. Determine the median and the first and third quartiles in the following data. 5.24 6.02 6.67 7.30 7.59 7.99 8.03 8.35 8.81 9.45 9.61 10.37 10.39 11.86 12.22 12.71 13.07 13.59 13.89 15.42 13. The Thomas Supply Company Inc. is a distributor of gas- powered generators. As with any business, the length of time customers take to pay their invoices is im- portant. Listed below, arranged from smallest to largest, is the time, in days, for a sample of The Thomas Supply Company Inc. invoices. 13 13 13 20 26 27 31 34 34 34 35 35 36 37 38 41 41 41 45 47 47 47 50 51 53 54 56 62 67 82 a. Determine the first and third quartiles. b. Determine the second decile and the eighth decile. c. Determine the 67th percentile. 14. Kevin Horn is the national sales manager for National Textbooks Inc. He has a sales staff of 40 who visit college professors all over the United States. Each Saturday morning he requires his sales staff to send him a report. This re-
  • 31. port includes, among other things, the number of professors visited during the previous week. Listed below, ordered from smallest to largest, are the number of visits last week. 38 40 41 45 48 48 50 50 51 51 52 52 53 54 55 55 55 56 56 57 59 59 59 62 62 62 63 64 65 66 66 67 67 69 69 71 77 78 79 79 a. Determine the median number of calls. b. Determine the first and third quartiles. c. Determine the first decile and the ninth decile. d. Determine the 33rd percentile. E X E R C I S E S DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 107 BOX PLOTS A box plot is a graphical display, based on quartiles, that helps us picture a set of data. To construct a box plot, we need only five statistics: the minimum value, Q1 (the first quartile), the median, Q3 (the third quartile), and the maximum value. An example will help to explain. LO4-4 Construct and analyze a box plot.
  • 32. E X A M P L E Alexander’s Pizza offers free delivery of its pizza within 15 miles. Alex, the owner, wants some information on the time it takes for delivery. How long does a typical delivery take? Within what range of times will most deliveries be completed? For a sample of 20 deliveries, he determined the following information: Minimum value = 13 minutes Q1 = 15 minutes Median = 18 minutes Q3 = 22 minutes Maximum value = 30 minutes Develop a box plot for the delivery times. What conclusions can you make about the delivery times? S O L U T I O N The first step in drawing a box plot is to create an appropriate scale along the horizontal axis. Next, we draw a box that starts at Q1 (15 minutes) and ends at Q3 (22 minutes). Inside the box we place a vertical line to represent the median (18 minutes). Finally, we extend horizontal lines from the box out to the minimum
  • 33. value (13 minutes) and the maximum value (30 minutes). These horizontal lines outside of the box are sometimes called “whiskers” because they look a bit like a cat’s whiskers. 12 14 16 18 20 22 24 26 28 30 32 Q1 Median Q3 Minimum value Maximum value Minutes The box plot also shows the interquartile range of delivery times between Q1 and Q3. The interquartile range is 7 minutes and indicates that 50% of the deliveries are between 15 and 22 minutes. The box plot also reveals that the distribution of delivery times is positively skewed. In Chapter 3, we defined skewness as the lack of symmetry in a set of data. How do we know this distribution is positively skewed? In this case, there are actually two pieces of information that suggest this. First, the dashed line to the right of the box from 22 minutes (Q3) to the maximum time of 30 minutes is longer than
  • 34. the dashed line from the left of 15 minutes (Q1) to the minimum value of 13 minutes. To put it another way, 108 CHAPTER 4 the 25% of the data larger than the third quartile is more spread out than the 25% less than the first quartile. A second indication of positive skewness is that the median is not in the center of the box. The distance from the first quartile to the median is smaller than the distance from the median to the third quartile. We know that the number of delivery times between 15 minutes and 18 minutes is the same as the number of de- livery times between 18 minutes and 22 minutes. E X A M P L E Refer to the Applewood Auto Group data. Develop a box plot for the variable age of the buyer. What can we conclude about the distribution of the age of the buyer? S O L U T I O N Minitab was used to develop the following chart and summary statistics. The median age of the purchaser is 46 years, 25% of the purchasers are less than 40 years of age, and 25% are more than 52.75 years of age.
  • 35. Based on the sum- mary information and the box plot, we conclude: • Fifty percent of the purchasers are between the ages of 40 and 52.75 years. • The distribution of ages is fairly symmetric. There are two reasons for this con- clusion. The length of the whisker above 52.75 years (Q3) is about the same length as the whisker below 40 years (Q1). Also, the area in the box between 40 years and the median of 46 years is about the same as the area between the median and 52.75. There are three asterisks (*) above 70 years. What do they indicate? In a box plot, an asterisk identifies an outlier. An outlier is a value that is inconsistent with the rest of the data. It is defined as a value that is more than 1.5 times the inter- quartile range smaller than Q1 or larger than Q3. In this example, an outlier would be a value larger than 71.875 years, found by: Outlier > Q3 + 1.5(Q3 − Q1) = 52.75 + 1.5(52.75 − 40) = 71.875 An outlier would also be a value less than 20.875 years. Outlier < Q1 − 1.5(Q3 − Q1) = 40 − 1.5(52.75 − 40) = 20.875 DESCRIBING DATA: DISPLAYING AND EXPLORING DATA
  • 36. 109 The following box plot shows the assets in millions of dollars for credit unions in Seattle, Washington. 0 10 20 30 40 50 60 70 80 90 100 What are the smallest and largest values, the first and third quartiles, and the median? Would you agree that the distribution is symmetrical? Are there any outliers? S E L F - R E V I E W 4–3 From the box plot, we conclude there are three purchasers 72 years of age or older and none less than 21 years of age. Technical note: In some cases, a single asterisk may represent more than one observation because of the limitations of the software and space available. It is a good idea to check the actual data. In this in- stance, there are three purchasers 72 years old or older; two are 72 and one is 73. 15. The box plot below shows the amount spent for books and supplies per year by students at four-year public colleges. 0 350 700 1,050 1,400 $1,750 a. Estimate the median amount spent. b. Estimate the first and third quartiles for the amount spent. c. Estimate the interquartile range for the amount spent.
  • 37. d. Beyond what point is a value considered an outlier? e. Identify any outliers and estimate their value. f. Is the distribution symmetrical or positively or negatively skewed? 16. The box plot shows the undergraduate in-state tuition per credit hour at four-year public colleges. * 0 300 600 900 1,200 $1,500 a. Estimate the median. b. Estimate the first and third quartiles. c. Determine the interquartile range. d. Beyond what point is a value considered an outlier? e. Identify any outliers and estimate their value. f. Is the distribution symmetrical or positively or negatively skewed? 17. In a study of the gasoline mileage of model year 2016 automobiles, the mean miles per gallon was 27.5 and the median was 26.8. The smallest value in the study was 12.70 miles per gallon, and the largest was 50.20. The first and third quartiles were 17.95 and 35.45 miles per gallon, respectively. Develop a box plot and comment on the distribution. Is it a symmetric distribution? E X E R C I S E S 110 CHAPTER 4
  • 38. SKEWNESS In Chapter 3, we described measures of central location for a distribution of data by re- porting the mean, median, and mode. We also described measures that show the amount of spread or variation in a distribution, such as the range and the standard deviation. Another characteristic of a distribution is the shape. There are four shapes com- monly observed: symmetric, positively skewed, negatively skewed, and bimodal. In a symmetric distribution the mean and median are equal and the data values are evenly spread around these values. The shape of the distribution below the mean and median is a mirror image of distribution above the mean and median. A distribution of values is skewed to the right or positively skewed if there is a single peak, but the values extend much farther to the right of the peak than to the left of the peak. In this case, the mean is larger than the median. In a negatively skewed distribution there is a single peak, but the observations extend farther to the left, in the negative direction, than to the right. In a negatively skewed distribution, the mean is smaller than the median. Positively skewed distributions are more common. Salaries often follow this pattern. Think of the salaries of those employed in a small company of about 100 people. The president and a few top executives would have very large salaries relative to the other workers and
  • 39. hence the distribution of salaries would exhibit positive skewness. A bimodal distribu- tion will have two or more peaks. This is often the case when the values are from two or more populations. This information is summarized in Chart 4–1. LO4-5 Compute and interpret the coefficient of skewness. M ed ia n M ea n 45 Fr eq ue nc y Fr eq ue
  • 41. n M ea n Median Mean Test Scores Negatively Skewed 75 80 Score Mean Outside Diameter Bimodal .98 1.04 Inches$ CHART 4–1 Shapes of Frequency Polygons There are several formulas in the statistical literature used to calculate skewness. The simplest, developed by Professor Karl Pearson (1857– 1936), is based on the differ- ence between the mean and the median. 18. A sample of 28 time shares in the Orlando, Florida, area revealed the follow- ing daily charges for a one-bedroom suite. For convenience, the data are ordered
  • 42. from smallest to largest. Construct a box plot to represent the data. Comment on the distribution. Be sure to identify the first and third quartiles and the median. $116 $121 $157 $192 $207 $209 $209 229 232 236 236 239 243 246 260 264 276 281 283 289 296 307 309 312 317 324 341 353 DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 111 Using this relationship, the coefficient of skewness can range from −3 up to 3. A value near −3, such as −2.57, indicates considerable negative skewness. A value such as 1.63 indicates moderate positive skewness. A value of 0, which will occur when the mean and median are equal, indicates the distribution is symmetrical and there is no skewness present. In this text, we present output from Minitab and Excel. Both of these software pack- ages compute a value for the coefficient of skewness based on the cubed deviations from the mean. The formula is: SOFTWARE COEFFICIENT OF SKEWNESS sk = n
  • 43. (n − 1) (n − 2)[ ∑( x − x s ) 3 ] [4–3] Formula (4–3) offers an insight into skewness. The right-hand side of the formula is the difference between each value and the mean, divided by the standard deviation. That is the portion (x − x )/s of the formula. This idea is called standardizing. We will discuss the idea of standardizing a value in more detail in Chapter 7 when we describe the normal probability distribution. At this point, observe that the result is to report the difference between each value and the mean in units of the standard deviation. If this difference is positive, the particular value is larger than the mean; if the value is nega- tive, the standardized quantity is smaller than the mean. When we cube these values, we retain the information on the direction of the difference. Recall that in the formula for the standard deviation [see formula (3–10)] we squared the difference between each value and the mean, so that the result was all nonnegative values. If the set of data values under consideration is symmetric, when we cube the stan-
  • 44. dardized values and sum over all the values, the result would be near zero. If there are several large values, clearly separate from the others, the sum of the cubed differences would be a large positive value. If there are several small values clearly separate from the others, the sum of the cubed differences will be negative. An example will illustrate the idea of skewness. PEARSON’S COEFFICIENT OF SKEWNESS sk = 3(x − Median) s [4–2] STATISTICS IN ACTION The late Stephen Jay Gould (1941–2002) was a profes- sor of zoology and professor of geology at Harvard University. In 1982, he was diagnosed with cancer and had an expected survival time of 8 months. However, never to be discouraged, his research showed that the distribution of survival time is dramatically skewed to the right and showed that not only do 50% of similar cancer patients survive more than 8 months, but that the survival time could be years rather than months!
  • 45. In fact, Dr. Gould lived an- other 20 years. Based on his experience, he wrote a widely published essay titled “The Median Is Not the Message.” E X A M P L E Following are the earnings per share for a sample of 15 software companies for the year 2016. The earnings per share are arranged from smallest to largest. Compute the mean, median, and standard deviation. Find the coefficient of skewness using Pearson’s estimate and the software methods. What is your conclusion regarding the shape of the distribution? S O L U T I O N These are sample data, so we use formula (3–2) to determine the mean x = Σx n = $74.26 15 = $4.95 $0.09 $0.13 $0.41 $0.51 $ 1.12 $ 1.20 $ 1.49 $3.18
  • 46. 3.50 6.36 7.83 8.92 10.13 12.99 16.40 112 CHAPTER 4 The median is the middle value in a set of data, arranged from smallest to largest. In this case, there is an odd-number of observations, so the middle value is the median. It is $3.18. We use formula (3–10) on page 78 to determine the sample standard deviation. s = √ Σ(x − x )2 n − 1 = √ ($0.09 − $4.95)2 + … + ($16.40 − $4.95)2 15 − 1 = $5.22 Pearson’s coefficient of skewness is 1.017, found by sk = 3(x − Median) s = 3($4.95 − $3.18)
  • 47. $5.22 = 1.017 This indicates there is moderate positive skewness in the earnings per share data. We obtain a similar, but not exactly the same, value from the software method. The details of the calculations are shown in Table 4–2. To begin, we find the differ- ence between each earnings per share value and the mean and divide this result by the standard deviation. We have referred to this as standardizing. Next, we cube, that is, raise to the third power, the result of the first step. Finally, we sum the cubed values. The details for the first company, that is, the company with an earnings per share of $0.09, are: ( x − x s ) 3 = ( 0.09 − 4.95 5.22 ) 3 = (−0.9310)3 = −0.8070 When we sum the 15 cubed values, the result is 11.8274. That
  • 48. is, the term Σ[(x − x )/s]3 = 11.8274. To find the coefficient of skewness, we use formula (4–3), with n = 15. sk = n (n − 1) (n − 2) ∑( x − x s ) 3 = 15 (15 − 1) (15 − 2) (11.8274) = 0.975 We conclude that the earnings per share values are somewhat positively skewed. The following Minitab summary reports the descriptive measures, such as TABLE 4–2 Calculation of the Coefficient of Skewness Earnings per Share (x − x ) s ( x − x
  • 49. s ) 3 0.09 −0.9310 −0.8070 0.13 −0.9234 −0.7873 0.41 −0.8697 −0.6579 0.51 −0.8506 −0.6154 1.12 −0.7337 −0.3950 1.20 −0.7184 −0.3708 1.49 −0.6628 −0.2912 3.18 −0.3391 −0.0390 3.50 −0.2778 −0.0214 6.36 0.2701 0.0197 7.83 0.5517 0.1679 8.92 0.7605 0.4399 10.13 0.9923 0.9772 12.99 1.5402 3.6539 16.40 2.1935 10.5537 11.8274 DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 113 A sample of five data entry clerks employed in the Horry County Tax Office revised the fol- lowing number of tax records last hour: 73, 98, 60, 92, and 84. (a) Find the mean, median, and the standard deviation. (b) Compute the coefficient of skewness using Pearson’s method. (c) Calculate the coefficient of skewness using the software
  • 50. method. (d) What is your conclusion regarding the skewness of the data? S E L F - R E V I E W 4–4 For Exercises 19–22: a. Determine the mean, median, and the standard deviation. b. Determine the coefficient of skewness using Pearson’s method. c. Determine the coefficient of skewness using the software method. 19. The following values are the starting salaries, in $000, for a sample of five accounting graduates who accepted positions in public accounting last year. 36.0 26.0 33.0 28.0 31.0 20. Listed below are the salaries, in $000, for a sample of 15 chief financial offi- cers in the electronics industry. $516.0 $548.0 $566.0 $534.0 $586.0 $529.0 546.0 523.0 538.0 523.0 551.0 552.0 486.0 558.0 574.0 E X E R C I S E S the mean, median, and standard deviation of the earnings per share data. Also in- cluded are the coefficient of skewness and a histogram with a bell-shaped curve superimposed.
  • 51. 114 CHAPTER 4 DESCRIBING THE RELATIONSHIP BETWEEN TWO VARIABLES In Chapter 2 and the first section of this chapter, we presented graphical techniques to summarize the distribution of a single variable. We used a histogram in Chapter 2 to summarize the profit on vehicles sold by the Applewood Auto Group. Earlier in this chapter, we used dot plots and stem-and-leaf displays to visually summarize a set of data. Because we are studying a single variable, we refer to this as univariate data. There are situations where we wish to study and visually portray the relationship between two vari- ables. When we study the relationship between two variables, we refer to the data as bivariate. Data ana- lysts frequently wish to understand the relationship between two variables. Here are some examples: • Tybo and Associates is a law firm that advertises ex- tensively on local TV. The partners are considering increasing their advertising budget. Before doing so, they would like to know the relationship be- tween the amount spent per month on advertising and the total amount of billings for that month. To put it another way, will increasing the amount spent on advertising result in an increase in billings?
  • 52. LO4-6 Create and interpret a scatter diagram. © Steve Mason/Getty Images RF 21. Listed below are the commissions earned ($000) last year by the 15 sales representatives at Furniture Patch Inc. $ 3.9 $ 5.7 $ 7.3 $10.6 $13.0 $13.6 $15.1 $15.8 $17.1 17.4 17.6 22.3 38.6 43.2 87.7 22. Listed below are the salaries for the 2016 New York Yankees Major League Baseball team. Player Salary Player Salary CC Sabathia $25,000,000 Dustin Ackley $3,200,000 Mark Teixeira 23,125,000 Martin Prado 3,000,000 Masahiro Tanaka 22,000,000 Didi Gregorius 2,425,000 Jacoby Ellsbury 21,142,857 Aaron Hicks 574,000 Alex Rodriguez 21,000,000 Austin Romine 556,000 Brian McCann 17,000,000 Chasen Shreve 533,400 Carlos Beltran 15,000,000 Greg Bird 525,300 Brett Gardner 13,500,000 Luis Severino 521,300 Chase Headley 13,000,000 Bryan Mitchell 516,650 Aroldis Chapman 11,325,000 Kirby Yates 511,900 Andrew Miller 9,000,000 Mason Williams 509,700 Starlin Castro 7,857,143 Ronald Torreyes 508,600 Nathan Eovaldi 5,600,000 John Barbato 507,500 Michael Pineda 4,300,000 Dellin Betances 507,500 Ivan Nova 4,100,000 Luis Cessa 507,500
  • 53. DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 115 • Coastal Realty is studying the selling prices of homes. What variables seem to be related to the selling price of homes? For example, do larger homes sell for more than smaller ones? Probably. So Coastal might study the relationship between the area in square feet and the selling price. • Dr. Stephen Givens is an expert in human development. He is studying the relation- ship between the height of fathers and the height of their sons. That is, do tall fathers tend to have tall children? Would you expect LeBron James, the 6′8″, 250 pound professional basketball player, to have relatively tall sons? One graphical technique we use to show the relationship between variables is called a scatter diagram. To draw a scatter diagram, we need two variables. We scale one variable along the horizontal axis (X-axis) of a graph and the other variable along the vertical axis (Y-axis). Usually one variable depends to some degree on the other. In the third example above, the height of the son depends on the height of the father. So we scale the height of the father on the horizontal axis and that of the son on the vertical axis.
  • 54. We can use statistical software, such as Excel, to perform the plotting function for us. Caution: You should always be careful of the scale. By changing the scale of either the vertical or the horizontal axis, you can affect the apparent visual strength of the relationship. Following are three scatter diagrams (Chart 4–2). The one on the left shows a rather strong positive relationship between the age in years and the maintenance cost last year for a sample of 10 buses owned by the city of Cleveland, Ohio. Note that as the age of the bus increases, the yearly maintenance cost also increases. The example in the center, for a sample of 20 vehicles, shows a rather strong indirect rela- tionship between the odometer reading and the auction price. That is, as the number of miles driven increases, the auction price decreases. The example on the right de- picts the relationship between the height and yearly salary for a sample of 15 shift supervisors. This graph indicates there is little relationship between their height and yearly salary. $24,000 21,000 18,000 15,000 12,000A uc
  • 55. tio n pr ic e 10,000 30,000 50,000 Odometer Auction Price versus Odometer $10,000 8,000 6,000 4,000 2,000 0 Co st (a nn ua l) 0 1 2 3 4 5 6 Age (years) Age of Buses and Maintenance Cost Height versus Salary
  • 56. 125 120 115 110 105 100 95 90S al ar y ($ 00 0) 54 55 56 57 58 59 60 61 62 63 Height (inches) CHART 4–2 Three Examples of Scatter Diagrams. E X A M P L E In the introduction to Chapter 2, we presented data from the Applewood Auto Group. We gathered information concerning several variables, including the profit earned from the sale of 180 vehicles sold last month. In addition to the amount of profit on each sale, one of the other variables is the age of the purchaser. Is there a relationship between the profit earned on a vehicle sale and the age of the pur-
  • 57. chaser? Would it be reasonable to conclude that more profit is made on vehicles purchased by older buyers? 116 CHAPTER 4 In the preceding example, there is a weak positive, or direct, relationship between the variables. There are, however, many instances where there is a relationship between the variables, but that relationship is inverse or negative. For example: • The value of a vehicle and the number of miles driven. As the number of miles in- creases, the value of the vehicle decreases. • The premium for auto insurance and the age of the driver. Auto rates tend to be the highest for younger drivers and less for older drivers. • For many law enforcement personnel, as the number of years on the job increases, the number of traffic citations decreases. This may be because personnel become more liberal in their interpretations or they may be in supervisor positions and not in a position to issue as many citations. But in any event, as age increases, the num- ber of citations decreases. CONTINGENCY TABLES A scatter diagram requires that both of the variables be at least
  • 58. interval scale. In the Applewood Auto Group example, both age and vehicle profit are ratio scale variables. Height is also ratio scale as used in the discussion of the relationship between the height of fathers and the height of their sons. What if we wish to study the relationship between two variables when one or both are nominal or ordinal scale? In this case, we tally the results in a contingency table. LO4-7 Develop and explain a contingency table. S O L U T I O N We can investigate the relationship between vehicle profit and the age of the buyer with a scatter diagram. We scale age on the horizontal, or X- axis, and the profit on the vertical, or Y-axis. We assume profit depends on the age of the purchaser. As people age, they earn more income and purchase more expensive cars which, in turn, produce higher profits. We use Excel to develop the scatter diagram. The Excel commands are in Appendix C. The scatter diagram shows a rather weak positive relationship between the two variables. It does not appear there is much relationship between the vehicle profit and the age of the buyer. In Chapter 13, we will study the relationship between variables more extensively, even calculating several numerical
  • 59. measures to ex- press the relationship between variables. 0 10 20 30 40 Age (Years) Profit and Age of Buyer at Applewood Auto Group Pr ofi t p er V eh ic le ($ ) 50 60 70 80 $0 $500 $1,000 $1,500 $2,000 $2,500
  • 60. $3,000 $3,500 DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 117 A contingency table is a cross-tabulation that simultaneously summarizes two variables of interest. For example: • Students at a university are classified by gender and class (freshman, sophomore, junior, or senior). • A product is classified as acceptable or unacceptable and by the shift (day, after- noon, or night) on which it is manufactured. • A voter in a school bond referendum is classified as to party affiliation (Democrat, Republican, other) and the number of children that voter has attending school in the district (0, 1, 2, etc.). CONTINGENCY TABLE A table used to classify observations according to two identifiable characteristics. E X A M P L E There are four dealerships in the Applewood Auto Group. Suppose we want to com-
  • 61. pare the profit earned on each vehicle sold by the particular dealership. To put it another way, is there a relationship between the amount of profit earned and the dealership? S O L U T I O N In a contingency table, both variables only need to be nominal or ordinal. In this example, the variable dealership is a nominal variable and the variable profit is a ratio variable. To convert profit to an ordinal variable, we classify the variable profit into two categories, those cases where the profit earned is more than the median and those cases where it is less. On page 64, we calculated the median profit for all sales last month at Applewood Auto Group to be $1,882.50. Contingency Table Showing the Relationship between Profit and Dealership Above/Below Median Profit Kane Olean Sheffield Tionesta Total Above 25 20 19 26 90 Below 27 20 26 17 90 Total 52 40 45 43 180 By organizing the information into a contingency table, we can compare the profit at the four dealerships. We observe the following: • From the Total column on the right, 90 of the 180 cars sold
  • 62. had a profit above the median and half below. From the definition of the median, this is expected. • For the Kane dealership, 25 out of the 52, or 48%, of the cars sold were sold for a profit more than the median. • The percentage of profits above the median for the other dealerships are 50% for Olean, 42% for Sheffield, and 60% for Tionesta. We will return to the study of contingency tables in Chapter 5 during the study of probability and in Chapter 15 during the study of nonparametric methods of analysis. 118 CHAPTER 4 The rock group Blue String Beans is touring the United States. The following chart shows the relationship between concert seating capacity and revenue in $000 for a sample of concerts. 5800 6300 6800 Seating Capacity 8 7
  • 63. 6 5 4 3 2 Am ou nt ($ 00 0) 7300 (a) What is the diagram called? (b) How many concerts were studied? (c) Estimate the revenue for the concert with the largest seating capacity. (d) How would you characterize the relationship between revenue and seating capacity? Is it strong or weak, direct or inverse? S E L F - R E V I E W 4–5 23. Develop a scatter diagram for the following sample data. How would you describe the relationship between the values?
  • 64. x-Value y-Value x-Value y-Value 10 6 11 6 8 2 10 5 9 6 7 2 11 5 7 3 13 7 11 7 24. Silver Springs Moving and Storage Inc. is studying the relationship between the number of rooms in a move and the number of labor hours required for the move. As part of the analysis, the CFO of Silver Springs developed the following scatter diagram. 1 2 3 Rooms 40 30 20 10 0 Ho ur s 54
  • 65. E X E R C I S E S DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 119 a. How many moves are in the sample? b. Does it appear that more labor hours are required as the number of rooms increases, or do labor hours decrease as the number of rooms increases? 25. The Director of Planning for Devine Dining Inc. wishes to study the relationship be- tween the gender of a guest and whether the guest orders dessert. To investigate the relationship, the manager collected the following information on 200 recent customers. Gender Dessert Ordered Male Female Total Yes 32 15 47 No 68 85 153 Total 100 100 200 a. What is the level of measurement of the two variables? b. What is the above table called? c. Does the evidence in the table suggest men are more likely to order dessert
  • 66. than women? Explain why. 26. Ski Resorts of Vermont Inc. is considering a merger with Gulf Shores Beach Resorts Inc. of Alabama. The board of directors surveyed 50 stockholders concerning their position on the merger. The results are reported below. Opinion Number of Shares Held Favor Oppose Undecided Total Under 200 8 6 2 16 200 up to 1,000 6 8 1 15 Over 1,000 6 12 1 19 Total 20 26 4 50 a. What level of measurement is used in this table? b. What is this table called? c. What group seems most strongly opposed to the merger? C H A P T E R S U M M A R Y I. A dot plot shows the range of values on the horizontal axis and the number of observa- tions for each value on the vertical axis. A. Dot plots report the details of each observation. B. They are useful for comparing two or more data sets. II. A stem-and-leaf display is an alternative to a histogram. A. The leading digit is the stem and the trailing digit the leaf. B. The advantages of a stem-and-leaf display over a histogram include:
  • 67. 1. The identity of each observation is not lost. 2. The digits themselves give a picture of the distribution. 3. The cumulative frequencies are also shown. III. Measures of location also describe the shape of a set of observations. A. Quartiles divide a set of observations into four equal parts. 1. Twenty-five percent of the observations are less than the first quartile, 50% are less than the second quartile, and 75% are less than the third quartile. 2. The interquartile range is the difference between the third quartile and the first quartile. B. Deciles divide a set of observations into 10 equal parts and percentiles into 100 equal parts. 120 CHAPTER 4 IV. A box plot is a graphic display of a set of data. A. A box is drawn enclosing the regions between the first quartile and the third quartile. 1. A line is drawn inside the box at the median value. 2. Dotted line segments are drawn from the third quartile to the largest value to show the highest 25% of the values and from the first quartile to the smallest
  • 68. value to show the lowest 25% of the values. B. A box plot is based on five statistics: the maximum and minimum values, the first and third quartiles, and the median. V. The coefficient of skewness is a measure of the symmetry of a distribution. A. There are two formulas for the coefficient of skewness. 1. The formula developed by Pearson is: sk = 3(x − Median) s [4–2] 2. The coefficient of skewness computed by statistical software is: sk = n (n − 1) (n − 2)[ ∑( x − x s ) 3 ] [4–3] VI. A scatter diagram is a graphic tool to portray the relationship between two variables.
  • 69. A. Both variables are measured with interval or ratio scales. B. If the scatter of points moves from the lower left to the upper right, the variables un- der consideration are directly or positively related. C. If the scatter of points moves from the upper left to the lower right, the variables are inversely or negatively related. VII. A contingency table is used to classify nominal-scale observations according to two characteristics. P R O N U N C I A T I O N K E Y SYMBOL MEANING PRONUNCIATION Lp Location of percentile L sub p Q1 First quartile Q sub 1 Q3 Third quartile Q sub 3 C H A P T E R E X E R C I S E S 27. A sample of students attending Southeast Florida University is asked the number of so- cial activities in which they participated last week. The chart below was prepared from the sample data. 41 2 Activities 30
  • 70. a. What is the name given to this chart? b. How many students were in the study? c. How many students reported attending no social activities? 28. Doctor’s Care is a walk-in clinic, with locations in Georgetown, Moncks Corner, and Aynor, at which patients may receive treatment for minor injuries, colds, and flu, as well as physical examinations. The following charts report the number of patients treated in each of the three locations last month. DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 121 5020 30 Patients 4010 Location Georgetown Moncks Corner Aynor Describe the number of patients served at the three locations each day. What are the maximum and minimum numbers of patients served at each of the locations?
  • 71. 29. Below is the number of customers who visited Smith’s True-Value hardware store in Bellville, Ohio, over the last twenty-three days. Make a stem- and-leaf display of this variable. 46 52 46 40 42 46 40 37 46 40 52 32 37 32 52 40 32 52 40 52 46 46 52 30. The top 25 companies (by market capitalization) operating in the Washington, DC, area along with the year they were founded and the number of employees are given below. Make a stem-and-leaf display of each of these variables and write a short de- scription of your findings. Company Name Year Founded Employees AES Corp. 1981 30,000 American Capital Ltd. 1986 484 AvalonBay Communities Inc. 1978 1,767 Capital One Financial Corp. 1995 31,800 Constellation Energy Group Inc. 1816 9,736 Coventry Health Care Inc. 1986 10,250 Danaher Corp. 1984 45,000 Dominion Resources Inc. 1909 17,500 Fannie Mae 1938 6,450 Freddie Mac 1970 5,533 Gannett Co. 1906 49,675 General Dynamics Corp. 1952 81,000 Genworth Financial Inc. 2004 7,200 Harman International Industries Inc. 1980 11,246 Host Hotels & Resorts Inc. 1927 229 Legg Mason 1899 3,800
  • 72. Lockheed Martin Corp. 1995 140,000 Marriott International Inc. 1927 151,000 MedImmune LLC 1988 2,516 NII Holdings Inc. 1996 7,748 Norfolk Southern Corp. 1982 30,594 Pepco Holdings Inc. 1896 5,057 Sallie Mae 1972 11,456 T. Rowe Price Group Inc. 1937 4,605 The Washington Post Co. 1877 17,100 31. In recent years, due to low interest rates, many homeowners refinanced their home mortgages. Linda Lahey is a mortgage officer at Down River Federal Savings 122 CHAPTER 4 and Loan. Below is the amount refinanced for 20 loans she processed last week. The data are reported in thousands of dollars and arranged from smallest to largest. 59.2 59.5 61.6 65.5 66.6 72.9 74.8 77.3 79.2 83.7 85.6 85.8 86.6 87.0 87.1 90.2 93.3 98.6 100.2 100.7 a. Find the median, first quartile, and third quartile. b. Find the 26th and 83rd percentiles. c. Draw a box plot of the data. 32. A study is made by the recording industry in the United States of the number
  • 73. of music CDs owned by 25 senior citizens and 30 young adults. The information is reported below. Seniors 28 35 41 48 52 81 97 98 98 99 118 132 133 140 145 147 153 158 162 174 177 180 180 187 188 Young Adults 81 107 113 147 147 175 183 192 202 209 233 251 254 266 283 284 284 316 372 401 417 423 490 500 507 518 550 557 590 594 a. Find the median and the first and third quartiles for the number of CDs owned by senior citizens. Develop a box plot for the information. b. Find the median and the first and third quartiles for the number of CDs owned by young adults. Develop a box plot for the information. c. Compare the number of CDs owned by the two groups. 33. The corporate headquarters of Bank.com, an on-line banking company, is located in downtown Philadelphia. The director of human resources is making a study of the time it takes employees to get to work. The city is planning to offer incentives to each downtown employer if they will encourage their employees to use public transportation. Below is a listing of the time to get to work this morning according to whether the em-
  • 74. ployee used public transportation or drove a car. Public Transportation 23 25 25 30 31 31 32 33 35 36 37 42 Private 32 32 33 34 37 37 38 38 38 39 40 44 a. Find the median and the first and third quartiles for the time it took employees using public transportation. Develop a box plot for the information. b. Find the median and the first and third quartiles for the time it took employees who drove their own vehicle. Develop a box plot for the information. c. Compare the times of the two groups. 34. The following box plot shows the number of daily newspapers published in each state and the District of Columbia. Write a brief report summarizing the number pub- lished. Be sure to include information on the values of the first and third quartiles, DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 123
  • 75. the median, and whether there is any skewness. If there are any outliers, estimate their value. Number of Newspapers **** 0 20 40 60 80 100 35. Walter Gogel Company is an industrial supplier of fasteners, tools, and springs. The amounts of its invoices vary widely, from less than $20.00 to more than $400.00. During the month of January the company sent out 80 invoices. Here is a box plot of these in- voices. Write a brief report summarizing the invoice amounts. Be sure to include infor- mation on the values of the first and third quartiles, the median, and whether there is any skewness. If there are any outliers, approximate the value of these invoices. Invoice Amount * 0 50 100 150 200 250 36. The American Society of PeriAnesthesia Nurses (ASPAN; www.aspan.org) is a national organization serving nurses practicing in ambulatory surgery, preanesthesia, and postanesthesia care. The organization consists of the 40 components listed below.
  • 76. State/Region Membership Alabama 95 Arizona 399 Maryland, Delaware, DC 531 Connecticut 239 Florida 631 Georgia 384 Hawaii 73 Illinois 562 Indiana 270 Iowa 117 Kentucky 197 Louisiana 258 Michigan 411 Massachusetts 480 Maine 97 Minnesota, Dakotas 289 Missouri, Kansas 282 Mississippi 90 Nebraska 115 North Carolina 542 Nevada 106 State/Region Membership New Jersey, Bermuda 517 Alaska, Idaho, Montana, Oregon, Washington 708 New York 891 Ohio 708 Oklahoma 171 Arkansas 68 California 1,165 New Mexico 79 Pennsylvania 575
  • 77. Rhode Island 53 Colorado 409 South Carolina 237 Texas 1,026 Tennessee 167 Utah 67 Virginia 414 Vermont, New Hampshire 144 Wisconsin 311 West Virginia 62 Use statistical software to answer the following questions. a. Find the mean, median, and standard deviation of the number of members per component. 124 CHAPTER 4 b. Find the coefficient of skewness, using the software. What do you conclude about the shape of the distribution of component size? c. Compute the first and third quartiles using formula (4–1). d. Develop a box plot. Are there any outliers? Which components are outliers? What are the limits for outliers? 37. McGivern Jewelers is located in the Levis Square Mall just south of Toledo, Ohio. Recently it posted an advertisement on a social media site
  • 78. reporting the shape, size, price, and cut grade for 33 of its diamonds currently in stock. The information is re- ported below. Shape Size (carats) Price Cut Grade Shape Size (carats) Price Cut Grade Princess 5.03 $44,312 Ideal cut Round 0.77 $2,828 Ultra ideal cut Round 2.35 20,413 Premium cut Oval 0.76 3,808 Premium cut Round 2.03 13,080 Ideal cut Princess 0.71 2,327 Premium cut Round 1.56 13,925 Ideal cut Marquise 0.71 2,732 Good cut Round 1.21 7,382 Ultra ideal cut Round 0.70 1,915 Premium cut Round 1.21 5,154 Average cut Round 0.66 1,885 Premium cut Round 1.19 5,339 Premium cut Round 0.62 1,397 Good cut Emerald 1.16 5,161 Ideal cut Round 0.52 2,555 Premium cut Round 1.08 8,775 Ultra ideal cut Princess 0.51 1,337 Ideal cut Round 1.02 4,282 Premium cut Round 0.51 1,558 Premium cut Round 1.02 6,943 Ideal cut Round 0.45 1,191 Premium cut Marquise 1.01 7,038 Good cut Princess 0.44 1,319 Average cut Princess 1.00 4,868 Premium cut Marquise 0.44 1,319 Premium cut Round 0.91 5,106 Premium cut Round 0.40 1,133 Premium cut Round 0.90 3,921 Good cut Round 0.35 1,354 Good cut Round 0.90 3,733 Premium cut Round 0.32 896 Premium cut Round 0.84 2,621 Premium cut a. Develop a box plot of the variable price and comment on the result. Are there any outliers? What is the median price? What are the values of the first and the third quartiles? b. Develop a box plot of the variable size and comment on the result. Are there any
  • 79. outliers? What is the median price? What are the values of the first and the third quartiles? c. Develop a scatter diagram between the variables price and size. Be sure to put price on the vertical axis and size on the horizontal axis. Does there seem to be an associ- ation between the two variables? Is the association direct or indirect? Does any point seem to be different from the others? d. Develop a contingency table for the variables shape and cut grade. What is the most common cut grade? What is the most common shape? What is the most common combination of cut grade and shape? 38. Listed below is the amount of commissions earned last month for the eight mem- bers of the sales staff at Best Electronics. Calculate the coefficient of skewness using both methods. Hint: Use of a spreadsheet will expedite the calculations. 980.9 1,036.5 1,099.5 1,153.9 1,409.0 1,456.4 1,718.4 1,721.2 39. Listed below is the number of car thefts in a large city over the last week. Calculate the coefficient of skewness using both methods. Hint: Use of a spreadsheet will expe- dite the calculations. 3 12 13 7 8 3 8
  • 80. DESCRIBING DATA: DISPLAYING AND EXPLORING DATA 125 40. The manager of Information Services at Wilkin Investigations, a private investigation firm, is studying the relationship between the age (in months) of a combination printer, copier, and fax machine and its monthly maintenance cost. For a sample of 15 machines, the manager developed the following chart. What can the manager conclude about the re- lationship between the variables? 34 39 44 Months $130 120 110 100 90 80 M on th ly
  • 81. M ai nt en an ce C os t 49 41. An auto insurance company reported the following information regarding the age of a driver and the number of accidents reported last year. Develop a scatter diagram for the data and write a brief summary. Age Accidents Age Accidents 16 4 23 0 24 2 27 1 18 5 32 1 17 4 22 3 42. Wendy’s offers eight different condiments (mustard, catsup, onion, mayonnaise, pickle, lettuce, tomato, and relish) on hamburgers. A store manager collected the following in- formation on the number of condiments ordered and the age group of the customer. What can you conclude regarding the information? Who tends to order the most or least
  • 82. number of condiments? Age Number of Condiments Under 18 18 up to 40 40 up to 60 60 or older 0 12 18 24 52 1 21 76 50 30 2 39 52 40 12 3 or more 71 87 47 28 43. Here is a table showing the number of employed and unemployed workers 20 years or older by gender in the United States. Number of Workers (000) Gender Employed Unemployed Men 70,415 4,209 Women 61,402 3,314 a. How many workers were studied? b. What percent of the workers were unemployed? c. Compare the percent unemployed for the men and the women. 126 A REVIEW OF CHAPTERS 1–4 D A T A A N A L Y T I C S 44. Refer to the North Valley real estate data recorded on
  • 83. homes sold during the last year. Prepare a report on the selling prices of the homes based on the answers to the following questions. a. Compute the minimum, maximum, median, and the first and the third quartiles of price. Create a box plot. Comment on the distribution of home prices. b. Develop a scatter diagram with price on the vertical axis and the size of the home on the horizontal. Is there a relationship between these variables? Is the relationship direct or indirect? c. For homes without a pool, develop a scatter diagram with price on the vertical axis and the size of the home on the horizontal. Do the same for homes with a pool. How do the relationships between price and size for homes without a pool and homes with a pool compare? 45. Refer to the Baseball 2016 data that report information on the 30 Major League Baseball teams for the 2016 season. a. In the data set, the year opened, is the first year of operation for that stadium. For each team, use this variable to create a new variable, stadium age, by subtracting the value of the variable, year opened, from the current year. Develop a box plot with the new variable, age. Are there any outliers? If so, which of the stadiums are
  • 84. outliers? b. Using the variable, salary, create a box plot. Are there any outliers? Compute the quartiles using formula (4–1). Write a brief summary of your analysis. c. Draw a scatter diagram with the variable, wins, on the vertical axis and salary on the horizontal axis. What are your conclusions? d. Using the variable, wins, draw a dot plot. What can you conclude from this plot? 46. Refer to the Lincolnville School District bus data. a. Referring to the maintenance cost variable, develop a box plot. What are the mini- mum, first quartile, median, third quartile, and maximum values? Are there any outliers? b. Using the median maintenance cost, develop a contingency table with bus manufac- turer as one variable and whether the maintenance cost was above or below the median as the other variable. What are your conclusions? A REVIEW OF CHAPTERS 1–4 This section is a review of the major concepts and terms introduced in Chapters 1–4. Chapter 1 began by describing the meaning and purpose of statistics. Next we described the different types of variables and the four levels of measurement. Chapter 2 was concerned with describing a set of observations by organizing it into a frequency distribution and then portraying the frequency distribution as a histogram or a frequency polygon. Chapter 3 began by describing measures of
  • 85. location, such as the mean, weighted mean, median, geometric mean, and mode. This chapter also included measures of dispersion, or spread. Discussed in this section were the range, variance, and standard deviation. Chapter 4 included several graphing techniques such as dot plots, box plots, and scatter diagrams. We also discussed the coefficient of skew- ness, which reports the lack of symmetry in a set of data. Throughout this section we stressed the importance of statistical software, such as Excel and Minitab. Many computer outputs in these chapters demonstrated how quickly and effectively a large data set can be organized into a frequency distribution, several of the measures of location or measures of variation calculated, and the information presented in graphical form. A REVIEW OF CHAPTERS 1–4 127 124 14 150 289 52 156 203 82 27 248 39 52 103 58 136 249 110 298 251 157 186 107 142 185 75 202 119 219 156 78 116 152 206 117 52 299 58 153 219 148 145 187 165 147 158 146 185 186 149 140 Use a statistical software package such as Excel or Minitab to help answer the following questions. a. Determine the mean, median, and standard deviation. b. Determine the first and third quartiles. c. Develop a box plot. Are there any outliers? Do the amounts follow a symmetric distri- bution or are they skewed? Justify your answer.
  • 86. d. Organize the distribution of funds into a frequency distribution. e. Write a brief summary of the results in parts a to d. 2. Listed below are the 45 U.S. presidents and their age as they began their terms in office. Number Name Age 1 Washington 57 2 J. Adams 61 3 Jefferson 57 4 Madison 57 5 Monroe 58 6 J. Q. Adams 57 7 Jackson 61 8 Van Buren 54 9 W. H. Harrison 68 10 Tyler 51 11 Polk 49 12 Taylor 64 13 Fillmore 50 14 Pierce 48 15 Buchanan 65 16 Lincoln 52 17 A. Johnson 56 18 Grant 46 19 Hayes 54 20 Garfield 49 21 Arthur 50 22 Cleveland 47 23 B. Harrison 55 Number Name Age
  • 87. 24 Cleveland 55 25 McKinley 54 26 T. Roosevelt 42 27 Taft 51 28 Wilson 56 29 Harding 55 30 Coolidge 51 31 Hoover 54 32 F. D. Roosevelt 51 33 Truman 60 34 Eisenhower 62 35 Kennedy 43 36 L. B. Johnson 55 37 Nixon 56 38 Ford 61 39 Carter 52 40 Reagan 69 41 G. H. W. Bush 64 42 Clinton 46 43 G. W. Bush 54 44 Obama 47 45 Trump 70 Use a statistical software package such as Excel or Minitab to help answer the following questions. a. Determine the mean, median, and standard deviation. b. Determine the first and third quartiles. c. Develop a box plot. Are there any outliers? Do the amounts follow a symmetric distri- bution or are they skewed? Justify your answer. d. Organize the distribution of ages into a frequency distribution. e. Write a brief summary of the results in parts a to d.
  • 88. P R O B L E M S 1. The duration in minutes of a sample of 50 power outages last year in the state of South Carolina is listed below. 128 A REVIEW OF CHAPTERS 1–4 3. Listed below is the 2014 median household income for the 50 states and the District of Columbia. https://www.census.gov/hhes/www/income/data/historical/ household/ State Amount Alabama 42,278 Alaska 67,629 Arizona 49,254 Arkansas 44,922 California 60,487 Colorado 60,940 Connecticut 70,161 Delaware 57,522 D.C. 68,277 Florida 46,140 Georgia 49,555 Hawaii 71,223 Idaho 53,438 Illinois 54,916 Indiana 48,060 Iowa 57,810 Kansas 53,444
  • 89. Kentucky 42,786 Louisiana 42,406 Maine 51,710 Maryland 76,165 Massachusetts 63,151 Michigan 52,005 Minnesota 67,244 Mississippi 35,521 Missouri 56,630 State Amount Montana 51,102 Nebraska 56,870 Nevada 49,875 New Hampshire 73,397 New Jersey 65,243 New Mexico 46,686 New York 54,310 North Carolina 46,784 North Dakota 60,730 Ohio 49,644 Oklahoma 47,199 Oregon 58,875 Pennsylvania 55,173 Rhode Island 58,633 South Carolina 44,929 South Dakota 53,053 Tennessee 43,716 Texas 53,875 Utah 63,383 Vermont 60,708 Virginia 66,155 Washington 59,068 West Virginia 39,552 Wisconsin 58,080
  • 90. Wyoming 55,690 Use a statistical software package such as Excel or Minitab to help answer the following questions. a. Determine the mean, median, and standard deviation. b. Determine the first and third quartiles. c. Develop a box plot. Are there any outliers? Do the amounts follow a symmetric distri- bution or are they skewed? Justify your answer. d. Organize the distribution of funds into a frequency distribution. e. Write a brief summary of the results in parts a to d. 4. A sample of 12 homes sold last week in St. Paul, Minnesota, revealed the following information. Draw a scatter diagram. Can we conclude that, as the size of the home (reported below in thousands of square feet) increases, the selling price (reported in $ thousands) also increases? Home Size Home Size (thousands of Selling Price (thousands of Selling Price square feet) ($ thousands) square feet) ($ thousands) 1.4 100 1.3 110 1.3 110 0.8 85 1.2 105 1.2 105 1.1 120 0.9 75 1.4 80 1.1 70 1.0 105 1.1 95
  • 91. 5. Refer to the following diagram. 0 40 80 120 160 200 * * a. What is the graph called? b. What are the median, and first and third quartile values? c. Is the distribution positively skewed? Tell how you know. d. Are there any outliers? If yes, estimate these values. e. Can you determine the number of observations in the study? A REVIEW OF CHAPTERS 1–4 129 C A S E S A. Century National Bank The following case will appear in subsequent review sec- tions. Assume that you work in the Planning Department of the Century National Bank and report to Ms. Lamberg. You will need to do some data analysis and prepare a short writ- ten report. Remember, Mr. Selig is the president of the bank, so you will want to ensure that your report is complete and accurate. A copy of the data appears in Appendix A.6. Century National Bank has offices in several cities in the Midwest and the southeastern part of the United States. Mr. Dan Selig, president and CEO, would like to know the characteristics of his checking account custom- ers. What is the balance of a typical customer? How many other bank services do the checking ac- count customers use? Do the customers use the ATM ser- vice and, if so, how often? What about debit cards? Who uses them, and how often are they used? To better understand the customers, Mr. Selig asked Ms. Wendy Lamberg, director of planning, to select a sam-
  • 92. ple of customers and prepare a report. To begin, she has appointed a team from her staff. You are the head of the team and responsible for preparing the report. You select a random sample of 60 customers. In addition to the balance in each account at the end of last month, you determine (1) the number of ATM (automatic teller machine) transac- tions in the last month; (2) the number of other bank ser- vices (a savings account, a certificate of deposit, etc.) the customer uses; (3) whether the customer has a debit card (this is a bank service in which charges are made directly to the customer’s account); and (4) whether or not interest is paid on the checking account. The sample includes cus- tomers from the branches in Cincinnati, Ohio; Atlanta, Georgia; Louisville, Kentucky; and Erie, Pennsylvania. 1. Develop a graph or table that portrays the checking balances. What is the balance of a typical customer? Do many customers have more than $2,000 in their accounts? Does it appear that there is a difference in the distribution of the accounts among the four branches? Around what value do the account bal- ances tend to cluster? 2. Determine the mean and median of the checking ac- count balances. Compare the mean and the median balances for the four branches. Is there a difference among the branches? Be sure to explain the difference between the mean and the median in your report. 3. Determine the range and the standard deviation of the checking account balances. What do the first and third quartiles show? Determine the coefficient of skewness and indicate what it shows. Because Mr. Selig does not deal with statistics daily, include a brief description and interpretation of the standard deviation and other measures.
  • 93. B. Wildcat Plumbing Supply Inc.: Do We Have Gender Differences? Wildcat Plumbing Supply has served the plumbing needs of Southwest Arizona for more than 40 years. The company was founded by Mr. Terrence St. Julian and is run today by his son Cory. The company has grown from a handful of employees to more than 500 today. Cory is concerned about several positions within the company where he has men and women doing es- sentially the same job but at different pay. To investi- gate, he collected the information below. Suppose you are a student intern in the Accounting Department and have been given the task to write a report summarizing the situation. Yearly Salary ($000) Women Men Less than 30 2 0 30 up to 40 3 1 40 up to 50 17 4 50 up to 60 17 24 60 up to 70 8 21 70 up to 80 3 7 80 or more 0 3 To kick off the project, Mr. Cory St. Julian held a meeting with his staff and you were invited. At this meeting, it was suggested that you calculate several measures of 130 A REVIEW OF CHAPTERS 1–4
  • 94. location, create charts or draw graphs such as a cumula- tive frequency distribution, and determine the quartiles for both men and women. Develop the charts and write the report summarizing the yearly salaries of employees at Wildcat Plumbing Supply. Does it appear that there are pay differences based on gender? C. Kimble Products: Is There a Difference In the Commissions? At the January national sales meeting, the CEO of Kimble Products was questioned extensively regarding the com- pany policy for paying commissions to its sales represen- tatives. The company sells sporting goods to two major markets. There are 40 sales representatives who call di- rectly on large-volume customers, such as the athletic de- partments at major colleges and universities and professional sports franchises. There are 30 sales repre- sentatives who represent the company to retail stores lo- cated in shopping malls and large discounters such as Kmart and Target. Upon his return to corporate headquarters, the CEO asked the sales manager for a report comparing the com- missions earned last year by the two parts of the sales team. The information is reported below. Write a brief re- port. Would you conclude that there is a difference? Be sure to include information in the report on both the cen- tral tendency and dispersion of the two groups. Commissions Earned by Sales Representatives Calling on Large Retailers ($) 1,116 681 1,294 12 754 1,206 1,448 870 944 1,255 1,213 1,291 719 934 1,313 1,083 899 850 886 1,556 886 1,315 1,858 1,262 1,338 1,066 807 1,244 758 918
  • 95. Commissions Earned by Sales Representatives Calling on Athletic Departments ($) 354 87 1,676 1,187 69 3,202 680 39 1,683 1,106 883 3,140 299 2,197 175 159 1,105 434 615 149 1,168 278 579 7 357 252 1,602 2,321 4 392 416 427 1,738 526 13 1,604 249 557 635 527 P R A C T I C E T E S T There is a practice test at the end of each review section. The tests are in two parts. The first part contains several objec- tive questions, usually in a fill-in-the-blank format. The second part is problems. In most cases, it should take 30 to 45 minutes to complete the test. The problems require a calculator. Check the answers in the Answer Section in the back of the book. Part 1—Objective 1. The science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions is called . 1. 2. Methods of organizing, summarizing, and presenting data in an informative way are called . 2. 3. The entire set of individuals or objects of interest or the measurements obtained from all individuals or objects of interest are called the . 3. 4. List the two types of variables. 4. 5. The number of bedrooms in a house is an example of a . (discrete variable,
  • 96. continuous variable, qualitative variable—pick one) 5. 6. The jersey numbers of Major League Baseball players are an example of what level of measurement? 6. 7. The classification of students by eye color is an example of what level of measurement? 7. 8. The sum of the differences between each value and the mean is always equal to what value? 8. 9. A set of data contained 70 observations. How many classes would the 2k method suggest to construct a frequency distribution? 9. 10. What percent of the values in a data set are always larger than the median? 10. 11. The square of the standard deviation is the . 11. 12. The standard deviation assumes a negative value when . (all the values are negative, at least half the values are negative, or never—pick one.) 12. 13. Which of the following is least affected by an outlier? (mean, median, or range—pick one) 13. Part 2—Problems 1. The Russell 2000 index of stock prices increased by the following amounts over the last 3 years. 18% 4% 2% What is the geometric mean increase for the 3 years? 2. The information below refers to the selling prices ($000) of homes sold in Warren, Pennsylvania, during 2016.
  • 97. Selling Price ($000) Frequency 120.0 up to 150.0 4 150.0 up to 180.0 18 180.0 up to 210.0 30 210.0 up to 240.0 20 240.0 up to 270.0 17 270.0 up to 300.0 10 300.0 up to 330.0 6 a. What is the class interval? b. How many homes were sold in 2016? c. How many homes sold for less than $210,000? d. What is the relative frequency of the 210 up to 240 class? e. What is the midpoint of the 150 up to 180 class? f. The selling prices range between what two amounts? 3. A sample of eight college students revealed they owned the following number of CDs. 52 76 64 79 80 74 66 69 a. What is the mean number of CDs owned? b. What is the median number of CDs owned? c. What is the 40th percentile? d. What is the range of the number of CDs owned? e. What is the standard deviation of the number of CDs owned? 4. An investor purchased 200 shares of the Blair Company for $36 each in July of 2013, 300 shares at $40 each in September 2015, and 500 shares at $50 each in January 2016. What is the investor’s weighted mean price per share? 5. During the 50th Super Bowl, 30 million pounds of snack
  • 98. food were eaten. The chart below depicts this information. Snack Nuts 8% Potato Chips 37% Tortilla Chips 28% Pretzels 14% Popcorn 13% a. What is the name given to this graph? b. Estimate, in millions of pounds, the amount of potato chips eaten during the game. c. Estimate the relationship of potato chips to popcorn. (twice as much, half as much, three times, none of these—pick one) d. What percent of the total do potato chips and tortilla chips comprise? A REVIEW OF CHAPTERS 1–4 131 LEARNING OBJECTIVES When you have completed this chapter, you will be able to:
  • 99. LO2-1 Summarize qualitative variables with frequency and relative frequency tables. LO2-2 Display a frequency table using a bar or pie chart. LO2-3 Summarize quantitative variables with frequency and relative frequency distributions. LO2-4 Display a frequency distribution using a histogram or frequency polygon. MERRILL LYNCH recently completed a study of online investment portfolios for a sample of clients. For the 70 participants in the study, organize these data into a frequency distribution. (See Exercise 43 and LO2-3.) Describing Data: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION2 © rido/123RF DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 19 INTRODUCTION The United States automobile retailing industry is highly competitive. It is dominated by megadealerships that own and operate 50 or more franchises, employ over 10,000 people, and generate several billion dollars in annual sales.
  • 100. Many of the top dealerships are publicly owned with shares traded on the New York Stock Exchange or NASDAQ. In 2014, the largest megadealership was AutoNation (ticker symbol AN), followed by Penske Auto Group (PAG), Group 1 Automotive, Inc. (ticker symbol GPI), and the privately owned Van Tuyl Group. These large corporations use statistics and analytics to summarize and analyze data and information to support their decisions. As an ex- ample, we will look at the Applewood Auto group. It owns four dealer- ships and sells a wide range of vehicles. These include the popular Korean brands Kia and Hyundai, BMW and Volvo sedans and luxury SUVs, and a full line of Ford and Chevrolet cars and trucks. Ms. Kathryn Ball is a member of the senior management team at Applewood Auto Group, which has its corporate offices adjacent to Kane Motors. She is responsible for tracking and analyzing vehicle sales and the profitability of those vehicles. Kathryn would like to summarize the profit earned on the vehicles sold with tables, charts, and graphs that she would review monthly. She wants to know the profit per vehicle sold, as well as the lowest and highest amount of profit. She is also interested in describing the demographics of
  • 101. the buyers. What are their ages? How many vehicles have they previously purchased from one of the Apple- wood dealerships? What type of vehicle did they purchase? The Applewood Auto Group operates four dealerships: • Tionesta Ford Lincoln sells Ford and Lincoln cars and trucks. • Olean Automotive Inc. has the Nissan franchise as well as the General Motors brands of Chevrolet, Cadillac, and GMC Trucks. • Sheffield Motors Inc. sells Buick, GMC trucks, Hyundai, and Kia. • Kane Motors offers the Chrysler, Dodge, and Jeep line as well as BMW and Volvo. Every month, Ms. Ball collects data from each of the four dealerships and enters them into an Excel spreadsheet. Last month the Applewood Auto Group sold 180 vehicles at the four dealerships. A copy of the first few observations appears to the left. The variables collected include: • Age—the age of the buyer at the time of the purchase. • Profit—the amount earned by the dealership on the sale of each vehicle. • Location—the dealership where the vehicle was purchased. • Vehicle type—SUV, sedan, compact, hybrid, or truck. • Previous—the number of vehicles previously purchased at any of the
  • 102. four Applewood dealerships by the consumer. The entire data set is available at the McGraw-Hill website (www.mhhe .com/lind17e) and in Appendix A.4 at the end of the text. © Justin Sullivan/Getty Images CONSTRUCTING FREQUENCY TABLES Recall from Chapter 1 that techniques used to describe a set of data are called descrip- tive statistics. Descriptive statistics organize data to show the general pattern of the data, to identify where values tend to concentrate, and to expose extreme or unusual data values. The first technique we discuss is a frequency table. LO2-1 Summarize qualitative variables with frequency and relative frequency tables. FREQUENCY TABLE A grouping of qualitative data into mutually exclusive and collectively exhaustive classes showing the number of observations in each class. 20 CHAPTER 2 In Chapter 1, we distinguished between qualitative and quantitative variables. To review, a qualitative variable is nonnumeric, that is, it can only
  • 103. be classified into distinct categories. Examples of qualitative data include political affiliation (Republican, Demo- crat, Independent, or other), state of birth (Alabama, . . . , Wyoming), and method of payment for a purchase at Barnes & Noble (cash, digital wallet, debit, or credit). On the other hand, quantitative variables are numerical in nature. Examples of quantitative data relating to college students include the price of their textbooks, their age, and the num- ber of credit hours they are registered for this semester. In the Applewood Auto Group data set, there are five variables for each vehicle sale: age of the buyer, amount of profit, dealer that made the sale, type of vehicle sold, and number of previous purchases by the buyer. The dealer and the type of vehicle are qualitative variables. The amount of profit, the age of the buyer, and the number of pre- vious purchases are quantitative variables. Suppose Ms. Ball wants to summarize last month’s sales by location. The first step is to sort the vehicles sold last month according to their location and then tally, or count, the number sold at each location of the four locations: Tionesta, Olean, Sheffield, or Kane. The four locations are used to develop a frequency table with four mutually exclusive (distinctive) classes. Mutually exclu- sive classes means that a particular vehicle can be assigned to only one class. In addition, the frequency table must be collectively exhaustive.
  • 104. That is every vehi- cle sold last month is accounted for in the table. If every vehicle is included in the frequency table, the table will be collectively exhaustive and the total number of vehicles will be 180. How do we obtain these counts? Excel provides a tool called a Pivot Table that will quickly and accurately establish the four classes and do the counting. The Excel results follow in Table 2–1. The table shows a total of 180 vehicles and, of the 180 vehicles, 52 were sold at Kane Motors. © Steve Cole/Getty Images RF TABLE 2–1 Frequency Table for Vehicles Sold Last Month at Applewood Auto Group by Location Location Number of Cars Kane 52 Olean 40 Sheffield 45 Tionesta 43 Total 180 Relative Class Frequencies You can convert class frequencies to relative class frequencies to show the fraction of the total number of observations in each class. A relative frequency captures the relationship between a class frequency and the total number of observations. In the vehicle sales ex- ample, we may want to know the percentage of total cars sold at each of the four locations. To convert a frequency table to a relative frequency table, each
  • 105. of the class frequencies is divided by the total number of observations. Again, this is easily accomplished using Excel. The fraction of vehicles sold last month at the Kane location is 0.289, found by 52 divided by 180. The relative frequency for each location is shown in Table 2–2. TABLE 2–2 Relative Frequency Table of Vehicles Sold by Location Last Month at Applewood Auto Group Location Number of Cars Relative Frequency Found by Kane 52 .289 52/180 Olean 40 .222 40/180 Sheffield 45 .250 45/180 Tionesta 43 .239 43/180 Total 180 1.000 DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 21 GRAPHIC PRESENTATION OF QUALITATIVE DATA The most common graphic form to present a qualitative variable is a bar chart. In most cases, the horizontal axis shows the variable of interest. The vertical axis shows the frequency or fraction of each of the possible outcomes. A distinguishing feature of a bar chart is there is distance or a gap between the bars. That is, because the variable of in-
  • 106. terest is qualitative, the bars are not adjacent to each other. Thus, a bar chart graphically describes a frequency table using a series of uniformly wide rectangles, where the height of each rectangle is the class frequency. LO2-2 Display a frequency table using a bar or pie chart. BAR CHART A graph that shows qualitative classes on the horizontal axis and the class frequencies on the vertical axis. The class frequencies are proportional to the heights of the bars. PIE CHART A chart that shows the proportion or percentage that each class represents of the total number of frequencies. We use the Applewood Auto Group data as an example (Chart 2–1). The variables of interest are the location where the vehicle was sold and the number of vehicles sold at each location. We label the horizontal axis with the four locations and scale the verti- cal axis with the number sold. The variable location is of nominal scale, so the order of the locations on the horizontal axis does not matter. In Chart 2– 1, the locations are listed alphabetically. The locations could also be in order of decreasing or increasing frequencies. The height of the bars, or rectangles, corresponds to the number of vehicles at
  • 107. each location. There were 52 vehicles sold last month at the Kane location, so the height of the Kane bar is 52; the height of the bar for the Olean location is 40. Nu m be r o f V eh ic le s So ld 50 40 30 20 10 0 Kane Olean Location
  • 108. Shef�eld Tionesta CHART 2–1 Number of Vehicles Sold by Location Another useful type of chart for depicting qualitative information is a pie chart. We explain the details of constructing a pie chart using the information in Table 2–3, which shows the frequency and percent of cars sold by the Applewood Auto Group for each vehicle type. 22 CHAPTER 2 The first step to develop a pie chart is to mark the percentages 0, 5, 10, 15, and so on evenly around the circumference of a circle (see Chart 2–2). To plot the 40% of total sales represented by sedans, draw a line from the center of the circle to 0 and another line from the center of the circle to 40%. The area in this “slice” represents the number of sedans sold as a percentage of the total sales. Next, add the SUV’s percentage of total sales, 30%, to the sedan’s percentage of total sales, 40%. The result is 70%. Draw a line from the center of the circle to 70%, so the area between 40 and 70 shows the sales of SUVs as a percentage of total sales. Continuing, add the 15% of total sales for compact vehicles, which gives us a total of 85%. Draw a line
  • 109. from the center of the circle to 85, so the “slice” between 70% and 85% represents the number of compact vehicles sold as a percentage of the total sales. The remaining 10% for truck sales and 5% for hybrid sales are added to the chart using the same method. Vehicle Type Number Sold Percent Sold Sedan 72 40 SUV 54 30 Compact 27 15 Truck 18 10 Hybrid 9 5 Total 180 100 TABLE 2–3 Vehicle Sales by Type at Applewood Auto Group 25% 50% 70% 85% 95% 0% 40% 75% Hybrid Truck
  • 110. Sedan SUV Compact CHART 2–2 Pie Chart of Vehicles by Type Because each slice of the pie represents the relative frequency of each vehicle type as a percentage of the total sales, we can easily compare them: • The largest percentage of sales is for sedans. • Sedans and SUVs together account for 70% of vehicle sales. • Hybrids account for 5% of vehicle sales, in spite of being on the market for only a few years. We can use Excel software to quickly count the number of cars for each vehicle type and create the frequency table, bar chart, and pie chart shown in the following summary. The Excel tool is called a Pivot Table. The instructions to produce these de- scriptive statistics and charts are given in Appendix C. DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 23 Pie and bar charts both serve to illustrate frequency and relative
  • 111. frequency ta- bles. When is a pie chart preferred to a bar chart? In most cases, pie charts are used to show and compare the relative differences in the percentage of observations for each value or class of a qualitative variable. Bar charts are preferred when the goal is to compare the number or frequency of observations for each value or class of a qualitative variable. The following Example/ Solution shows another application of bar and pie charts. E X A M P L E SkiLodges.com is test marketing its new website and is interested in how easy its website design is to navigate. It randomly selected 200 regular Internet users and asked them to perform a search task on the website. Each person was asked to rate the relative ease of navigation as poor, good, excellent, or awesome. The re- sults are shown in the following table:
  • 112. Awesome 102 Excellent 58 Good 30 Poor 10 1. What type of measurement scale is used for ease of navigation? 2. Draw a bar chart for the survey results. 3. Draw a pie chart for the survey results. S O L U T I O N The data are measured on an ordinal scale. That is, the scale is ranked in relative ease of navigation when moving from “awesome” to “poor.” The interval between each rating is unknown so it is impossible, for example, to conclude that a rating of good is twice the value of a poor rating. We can use a bar chart to graph the data. The vertical scale shows the relative frequency and the horizontal scale shows the values of the ease-of-
  • 113. navigation variable. 24 CHAPTER 2 A pie chart can also be used to graph these data. The pie chart emphasizes that more than half of the respondents rate the relative ease of using the website awesome. Re la tiv e Fr eq ue nc y
  • 114. % 60 50 40 30 20 10 0 PoorGoodExcellentAwesome Ease of Navigation of SkiLodges.com website Ease of Navigation Beverage Number Cola-Plus 40 Coca-Cola 25
  • 115. Pepsi 20 Lemon-Lime 15 Total 100 The answers are in Appendix E. DeCenzo Specialty Food and Beverage Company has been serving a cola drink with an additional flavoring, Cola-Plus, that is very popular among its customers. The company is interested in customer preferences for Cola-Plus versus Coca- Cola, Pepsi, and a lemon-lime beverage. They ask 100 randomly sampled customers to take a taste test and select the beverage they prefer most. The results are shown in the following table: S E L F - R E V I E W 2–1 Poor 5% Ease of Navigation of SkiLodges.com website
  • 116. Good 15% Awesome 51% Excellent 29% DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 25 (a) Is the data qualitative or quantitative? Why? (b) What is the table called? What does it show? (c) Develop a bar chart to depict the information. (d) Develop a pie chart using the relative frequencies. The answers to the odd-numbered exercises are at the end of the book in Appendix D. 1. A pie chart shows the relative market share of cola products. The “slice” for Pepsi- Cola has a central angle of 90 degrees. What is its market
  • 117. share? 2. In a marketing study, 100 consumers were asked to select the best digital music player from the iPod, the iRiver, and the Magic Star MP3. To summarize the con- sumer responses with a frequency table, how many classes would the frequency table have? 3. A total of 1,000 residents in Minnesota were asked which season they preferred. One hundred liked winter best, 300 liked spring, 400 liked summer, and 200 liked fall. Develop a frequency table and a relative frequency table to summarize this information. 4. Two thousand frequent business travelers are asked which midwestern city they prefer: Indianapolis, Saint Louis, Chicago, or Milwaukee. One hundred liked India- napolis best, 450 liked Saint Louis, 1,300 liked Chicago, and the remainder pre- ferred Milwaukee. Develop a frequency table and a relative
  • 118. frequency table to summarize this information. 5. Wellstone Inc. produces and markets replacement covers for cell phones in five different colors: bright white, metallic black, magnetic lime, tangerine orange, and fusion red. To estimate the demand for each color, the company set up a kiosk in the Mall of America for several hours and asked randomly selected people which cover color was their favorite. The results follow: E X E R C I S E S Bright white 130 Metallic black 104 Magnetic lime 325 Tangerine orange 455 Fusion red 286 a. What is the table called? b. Draw a bar chart for the table. c. Draw a pie chart. d. If Wellstone Inc. plans to produce 1 million cell phone
  • 119. covers, how many of each color should it produce? 6. A small business consultant is investigating the performance of several companies. The fourth-quarter sales for last year (in thousands of dollars) for the selected com- panies were: Fourth-Quarter Sales Company ($ thousands) Hoden Building Products $ 1,645.2 J & R Printing Inc. 4,757.0 Long Bay Concrete Construction 8,913.0 Mancell Electric and Plumbing 627.1 Maxwell Heating and Air Conditioning 24,612.0 Mizelle Roofing & Sheet Metals 191.9 The consultant wants to include a chart in his report comparing the sales of the six companies. Use a bar chart to compare the fourth-quarter sales of these corpora- tions and write a brief report summarizing the bar chart.
  • 120. 26 CHAPTER 2 CONSTRUCTING FREQUENCY DISTRIBUTIONS In Chapter 1 and earlier in this chapter, we distinguished between qualitative and quantitative data. In the previous section, using the Applewood Automotive Group data, we summarized two qualitative variables: the location of the sale and the type of vehicle sold. We created frequency and relative frequency tables and depicted the results in bar and pie charts. The Applewood Auto Group data also includes several quantitative variables: the age of the buyer, the profit earned on the sale of the vehicle, and the number of previ- ous purchases. Suppose Ms. Ball wants to summarize last month’s sales by profit earned for each vehicle. We can describe profit using a frequency distribution.
  • 121. LO2-3 Summarize quantitative variables with frequency and relative frequency distributions. FREQUENCY DISTRIBUTION A grouping of quantitative data into mutually exclusive and collectively exhaustive classes showing the number of observations in each class. How do we develop a frequency distribution? The following example shows the steps to construct a frequency distribution. Remember, our goal is to construct tables, charts, and graphs that will quickly summarize the data by showing the location, extreme values, and shape of the data’s distribution. TABLE 2–4 Profit on Vehicles Sold Last Month by the Applewood Auto Group Maximum Minimum $1,387 $2,148 $2,201 $ 963 $ 820 $2,230 $3,043 $2,584 $2,370
  • 122. 1,754 2,207 996 1,298 1,266 2,341 1,059 2,666 2,637 1,817 2,252 2,813 1,410 1,741 3,292 1,674 2,991 1,426 1,040 1,428 323 1,553 1,772 1,108 1,807 934 2,944 1,273 1,889 352 1,648 1,932 1,295 2,056 2,063 2,147 1,529 1,166 482 2,071 2,350 1,344 2,236 2,083 1,973 3,082 1,320 1,144 2,116 2,422 1,906 2,928 2,856 2,502 1,951 2,265 1,485 1,500 2,446 1,952 1,269 2,989 783 2,692 1,323 1,509 1,549 369 2,070 1,717 910 1,538 1,206 1,760 1,638 2,348 978 2,454 1,797 1,536 2,339 1,342 1,919 1,961 2,498 1,238 1,606 1,955 1,957 2,700 443 2,357 2,127 294 1,818 1,680 2,199 2,240 2,222 754 2,866 2,430 1,115 1,824 1,827 2,482 2,695 2,597 1,621 732 1,704 1,124 1,907 1,915 2,701 1,325 2,742 870 1,464 1,876 1,532 1,938 2,084 3,210 2,250 1,837 1,174 1,626 2,010 1,688 1,940 2,639 377 2,279 2,842 1,412 1,762 2,165 1,822 2,197 842 1,220 2,626 2,434 1,809 1,915 2,231 1,897 2,646 1,963 1,401 1,501 1,640 2,415 2,119 2,389 2,445 1,461 2,059 2,175 1,752 1,821 1,546 1,766 335 2,886 1,731 2,338 1,118 2,058 2,487 S O L U T I O N To begin, we need the profits for each of the 180 vehicle sales listed in Table 2–4. This information is called raw or ungrouped data because it is
  • 123. simply a listing E X A M P L E Ms. Kathryn Ball of the Applewood Auto Group wants to summarize the quantitative variable profit with a frequency distribution and display the distribution with charts and graphs. With this information, Ms. Ball can easily answer the following ques- tions: What is the typical profit on each sale? What is the largest or maximum profit on any sale? What is the smallest or minimum profit on any sale? Around what value do the profits tend to cluster? DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 27 of the individual, observed profits. It is possible to search the list and find the smallest or minimum profit ($294) and the largest or maximum
  • 124. profit ($3,292), but that is about all. It is difficult to determine a typical profit or to visualize where the profits tend to cluster. The raw data are more easily interpreted if we summarize the data with a frequency distribution. The steps to create this frequency distribu- tion follow. Step 1: Decide on the number of classes. A useful recipe to determine the number of classes (k) is the “2 to the k rule.” This guide suggests you select the smallest number (k) for the number of classes such that 2k (in words, 2 raised to the power of k) is greater than the number of observations (n). In the Applewood Auto Group example, there were 180 vehicles sold. So n = 180. If we try k = 7, which means we would use 7 classes, 27 = 128, which is less than 180. Hence, 7 is too few classes. If we let k = 8, then 28 = 256, which is greater than 180. So the
  • 125. recommended number of classes is 8. Step 2: Determine the class interval. Generally, the class interval is the same for all classes. The classes all taken together must cover at least the distance from the minimum value in the data up to the max- imum value. Expressing these words in a formula: i ≥ Maximum Value − Minimum Value k where i is the class interval, and k is the number of classes. For the Applewood Auto Group, the minimum value is $294 and the maximum value is $3,292. If we need 8 classes, the interval should be: i ≥ Maximum Value − Minimum Value k =
  • 126. $3,292 − $294 8 = $374.75 In practice, this interval size is usually rounded up to some conve- nient number, such as a multiple of 10 or 100. The value of $400 is a reasonable choice. Step 3: Set the individual class limits. State clear class limits so you can put each observation into only one category. This means you must avoid overlapping or unclear class limits. For example, classes such as “$1,300–$1,400” and “$1,400–$1,500” should not be used because it is not clear whether the value $1,400 is in the first or second class. In this text, we will generally use the format $1,300 up to $1,400 and $1,400 up to $1,500 and so on. With this format, it is clear that $1,399 goes into the first class and $1,400 in the
  • 127. second. Because we always round the class interval up to get a conve- nient class size, we cover a larger than necessary range. For ex- ample, using 8 classes with an interval of $400 in the Applewood Auto Group example results in a range of 8($400) = $3,200. The actual range is $2,998, found by ($3,292 − $294). Comparing that value to $3,200, we have an excess of $202. Because we need to cover only the range (Maximum − Minimum), it is natural to put ap- proximately equal amounts of the excess in each of the two tails. Of course, we also should select convenient class limits. A guide- line is to make the lower limit of the first class a multiple of the class interval. Sometimes this is not possible, but the lower limit should at least be rounded. So here are the classes we could use for these data. 28 CHAPTER 2
  • 128. Classes $ 200 up to $ 600 600 up to 1,000 1,000 up to 1,400 1,400 up to 1,800 1,800 up to 2,200 2,200 up to 2,600 2,600 up to 3,000 3,000 up to 3,400 Profit Frequency $ 200 up to $ 600 |||| ||| 600 up to 1,000 |||| |||| | 1,000 up to 1,400 |||| |||| |||| |||| ||| 1,400 up to 1,800 |||| |||| |||| |||| |||| |||| |||| ||| 1,800 up to 2,200 |||| |||| |||| |||| |||| |||| |||| |||| |||| 2,200 up to 2,600 |||| |||| |||| |||| |||| || 2,600 up to 3,000 |||| |||| |||| |||| 3,000 up to 3,400 |||| Step 4: Tally the vehicle profit into the classes and determine
  • 129. the number of observations in each class. To begin, the profit from the sale of the first vehicle in Table 2–4 is $1,387. It is tallied in the $1,000 up to $1,400 class. The second profit in the first row of Table 2–4 is $2,148. It is tallied in the $1,800 up to $2,200 class. The other profits are tallied in a similar manner. When all the profits are tallied, the table would appear as: The number of observations in each class is called the class frequency. In the $200 up to $600 class there are 8 observations, and in the $600 up to $1,000 class there are 11 observations. There- fore, the class frequency in the first class is 8 and the class frequency in the second class is 11. There are a total of 180 observations in the entire set of data. So the sum of all the frequencies should be equal to 180. The results of the frequency distribution are in Table 2– 5.
  • 130. Now that we have organized the data into a frequency distribution (see Table 2–5), we can summarize the profits of the vehicles for the Applewood Auto Group. Observe the following: 1. The profits from vehicle sales range between $200 and $3,400. 2. The vehicle profits are classified using a class interval of $400. The class inter- val is determined by subtracting consecutive lower or upper class limits. For TABLE 2–5 Frequency Distribution of Profit for Vehicles Sold Last Month at Applewood Auto Group Profit Frequency $ 200 up to $ 600 8 600 up to 1,000 11 1,000 up to 1,400 23 1,400 up to 1,800 38 1,800 up to 2,200 45
  • 131. 2,200 up to 2,600 32 2,600 up to 3,000 19 3,000 up to 3,400 4 Total 180 DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 29 example, the lower limit of the first class is $200, and the lower limit of the second class is $600. The difference is the class interval of $400. 3. The profits are concentrated between $1,000 and $3,000. The profit on 157 vehicles, or 87%, was within this range. 4. For each class, we can determine the typical profit or class midpoint. It is half- way between the lower or upper limits of two consecutive classes. It is com-
  • 132. puted by adding the lower or upper limits of consecutive classes and dividing by 2. Referring to Table 2–5, the lower class limit of the first class is $200, and the next class limit is $600. The class midpoint is $400, found by ($600 + $200)/2. The midpoint best represents, or is typical of, the profits of the vehi- cles in that class. Applewood sold 8 vehicles with a typical profit of $400. 5. The largest concentration, or highest frequency, of vehicles sold is in the $1,800 up to $2,200 class. There are 45 vehicles in this class. The class midpoint is $2,000. So we say that the typical profit in the class with the highest frequency is $2,000. By presenting this information to Ms. Ball, we give her a clear picture of the distribu- tion of the vehicle profits for last month. We admit that arranging the information on profits into a frequency distribution does result in the loss of some detailed information. That is, by
  • 133. organizing the data into a frequency distribution, we cannot pinpoint the exact profit on any vehicle, such as $1,387, $2,148, or $2,201. Further, we cannot tell that the actual minimum profit for any vehicle sold is $294 or that the maximum profit was $3,292. However, the lower limit of the first class and the upper limit of the last class convey essen- tially the same meaning. Likely, Ms. Ball will make the same judgment if she knows the smallest profit is about $200 that she will if she knows the exact profit is $292. The advantages of summarizing the 180 profits into a more understandable and organized form more than offset this disadvantage. Number of Returns Adjusted Gross Income (in thousands) No adjusted gross income 178.2 $ 1 up to 5,000 1,204.6 5,000 up to 10,000 2,595.5 10,000 up to 15,000 3,142.0 15,000 up to 20,000 3,191.7
  • 134. 20,000 up to 25,000 2,501.4 25,000 up to 30,000 1,901.6 30,000 up to 40,000 2,502.3 40,000 up to 50,000 1,426.8 50,000 up to 75,000 1,476.3 75,000 up to 100,000 338.8 100,000 up to 200,000 223.3 200,000 up to 500,000 55.2 500,000 up to 1,000,000 12.0 1,000,000 up to 2,000,000 5.1 2,000,000 up to 10,000,000 3.4 10,000,000 or more 0.6 TABLE 2–6 Adjusted Gross Income for Individuals Filing Income Tax Returns When we summarize raw data with frequency distributions, equal class intervals are pre- ferred. However, in certain situations unequal class intervals may be necessary to avoid a large number of classes with very small frequencies. Such is the case in Table 2–6. The U.S. Internal Revenue Service uses unequal-sized class intervals for adjusted gross income on individual tax returns to summarize the number of
  • 135. individual tax returns. If we use our method to find equal class intervals, the 2k rule results in 25 classes, and STATISTICS IN ACTION In 1788, James Madison, John Jay, and Alexander Hamilton anonymously published a series of essays entitled The Federalist. These Federalist papers were an attempt to convince the people of New York that they should ratify the Constitution. In the course of history, the authorship of most of these papers became known, but 12 re- mained contested. Through the use of statistical analysis, and particularly studying the frequency distributions of various words, we can now conclude that James
  • 136. Madison is the likely author of the 12 papers. In fact, the statistical evidence that Madison is the author is overwhelming. 30 CHAPTER 2 a class interval of $400,000, assuming $0 and $10,000,000 as the minimum and maximum values for adjusted gross income. Using equal class intervals, the first 13 classes in Table 2–6 would be combined into one class of about 99.9% of all tax returns and 24 classes for the 0.1% of the returns with an adjusted gross income above $400,000. Using equal class inter- vals does not provide a good understanding of the raw data. In this case, good judgment in the use of unequal class intervals, as demonstrated in Table 2–6, is required to show the distribution of the number of tax returns filed, especially for incomes under $500,000.
  • 137. In the first quarter of last year, the 11 members of the sales staff at Master Chemical Company earned the following commissions: $1,650 $1,475 $1,510 $1,670 $1,595 $1,760 $1,540 $1,495 $1,590 $1,625 $1,510 (a) What are the values such as $1,650 and $1,475 called? (b) Using $1,400 up to $1,500 as the first class, $1,500 up to $1,600 as the second class, and so forth, organize the quarterly commissions into a frequency distribution. (c) What are the numbers in the right column of your frequency distribution called? (d) Describe the distribution of quarterly commissions, based on the frequency distribu- tion. What is the largest concentration of commissions earned? What is the smallest, and the largest? What is the typical amount earned? Relative Frequency Distribution It may be desirable, as we did earlier with qualitative data, to
  • 138. convert class frequencies to relative class frequencies to show the proportion of the total number of observations in each class. In our vehicle profits, we may want to know what percentage of the vehi- cle profits are in the $1,000 up to $1,400 class. To convert a frequency distribution to a relative frequency distribution, each of the class frequencies is divided by the total num- ber of observations. From the distribution of vehicle profits, Table 2–5, the relative fre- quency for the $1,000 up to $1,400 class is 0.128, found by dividing 23 by 180. That is, profit on 12.8% of the vehicles sold is between $1,000 and $1,400. The relative fre- quencies for the remaining classes are shown in Table 2–7. S E L F - R E V I E W 2–2 TABLE 2–7 Relative Frequency Distribution of Profit for Vehicles Sold Last Month at Applewood Auto Group Profit Frequency Relative Frequency Found by $ 200 up to $ 600 8 .044 8/180
  • 139. 600 up to 1,000 11 .061 11/180 1,000 up to 1,400 23 .128 23/180 1,400 up to 1,800 38 .211 38/180 1,800 up to 2,200 45 .250 45/180 2,200 up to 2,600 32 .178 32/180 2,600 up to 3,000 19 .106 19/180 3,000 up to 3,400 4 .022 4/180 Total 180 1.000 There are many software packages that perform statistical calculations. Throughout this text, we will show the output from Microsoft Excel, MegaStat (a Microsoft Excel add-in), and Minitab (a statistical software package). Because Excel is most readily available, it is used most frequently. Within the earlier Graphic Presentation of Qualitative Data section, we used the Pivot Table tool in Excel to create a frequency table. To create the table to the left, we use the same Excel tool to DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY
  • 140. DISTRIBUTIONS, AND GRAPHIC PRESENTATION 31 compute frequency and relative frequency distributions for the profit variable in the Applewood Auto Group data. The necessary steps are given in the Software Commands section in Appendix C. Barry Bonds of the San Francisco Giants established a new single-season Major League Baseball home run record by hitting 73 home runs during the 2001 season. Listed below is the sorted distance of each of the 73 home runs. S E L F - R E V I E W 2–3 (a) For this data, show that seven classes would be used to create a frequency distribution using the 2k rule. (b) Show that a class interval of 30 would summarize the data in seven classes. (c) Construct frequency and relative frequency distributions for the data with
  • 141. seven classes and a class interval of 30. Start the first class with a lower limit of 300. (d) How many home runs traveled a distance of 360 up to 390 feet? (e) What percentage of the home runs traveled a distance of 360 up to 390 feet? (f) What percentage of the home runs traveled a distance of 390 feet or more? 7. A set of data consists of 38 observations. How many classes would you recom- mend for the frequency distribution? 8. A set of data consists of 45 observations between $0 and $29. What size would you recommend for the class interval? 9. A set of data consists of 230 observations between $235 and $567. What class interval would you recommend? 10. A set of data contains 53 observations. The minimum value
  • 142. is 42 and the maximum value is 129. The data are to be organized into a frequency distribution. a. How many classes would you suggest? b. What would you suggest as the lower limit of the first class? 11. Wachesaw Manufacturing Inc. produced the following number of units in the last 16 days. The information is to be organized into a frequency distribution. a. How many classes would you recommend? b. What class interval would you suggest? c. What lower limit would you recommend for the first class? d. Organize the information into a frequency distribution and determine the relative frequency distribution. e. Comment on the shape of the distribution. E X E R C I S E S This icon indicates that the data are available at the text website: www.mhhe.com/
  • 143. Lind17e. You will be able to download the data directly into Excel or Minitab from this site. 27 27 27 28 27 25 25 28 26 28 26 28 31 30 26 26 320 320 347 350 360 360 360 361 365 370 370 375 375 375 375 380 380 380 380 380 380 390 390 391 394 396 400 400 400 400 405 410 410 410 410 410 410 410 410 410 410 410 411 415 415 416 417 417 420 420 420 420 420 420 420 420 429 430 430 430 430 430 435 435 436 440 440 440 440 440 450 480 488 32 CHAPTER 2 The data are to be organized into a frequency distribution. a. How many classes would you recommend? b. What class interval would you suggest? c. What lower limit would you recommend for the first class?
  • 144. d. Organize the number of oil changes into a frequency distribution. e. Comment on the shape of the frequency distribution. Also determine the relative frequency distribution. 13. The manager of the BiLo Supermarket in Mt. Pleasant, Rhode Island, gathered the following information on the number of times a customer visits the store during a month. The responses of 51 customers were: 65 98 55 62 79 59 51 90 72 56 70 62 66 80 94 79 63 73 71 85 12. The Quick Change Oil Company has a number of outlets in the metropolitan Seat- tle area. The daily number of oil changes at the Oak Street outlet in the past 20 days are: 5 3 3 1 4 4 5 6 4 2 6 6 6 7 1 1 14 1 2 4 4 4 5 6 3 5 3 4 5 6 8 4 7 6 5 9 11 3 12 4 7 6 5 15 1 1 10 8 9 2 12
  • 145. a. Starting with 0 as the lower limit of the first class and using a class interval of 3, organize the data into a frequency distribution. b. Describe the distribution. Where do the data tend to cluster? c. Convert the distribution to a relative frequency distribution. 14. The food services division of Cedar River Amusement Park Inc. is studying the amount of money spent per day on food and drink by families who visit the amuse- ment park. A sample of 40 families who visited the park yesterday revealed they spent the following amounts: $77 $18 $63 $84 $38 $54 $50 $59 $54 $56 $36 $26 $50 $34 $44 41 58 58 53 51 62 43 52 53 63 62 62 65 61 52 60 60 45 66 83 71 63 58 61 71 a. Organize the data into a frequency distribution, using seven classes and 15 as the lower limit of the first class. What class interval did you select?
  • 146. b. Where do the data tend to cluster? c. Describe the distribution. d. Determine the relative frequency distribution. GRAPHIC PRESENTATION OF A DISTRIBUTION Sales managers, stock analysts, hospital administrators, and other busy executives of- ten need a quick picture of the distributions of sales, stock prices, or hospital costs. These distributions can often be depicted by the use of charts and graphs. Three charts that will help portray a frequency distribution graphically are the histogram, the fre- quency polygon, and the cumulative frequency polygon. Histogram A histogram for a frequency distribution based on quantitative data is similar to the bar chart showing the distribution of qualitative data. The classes are marked on the LO2-4 Display a distribution using a histogram or frequency polygon.
  • 147. DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 33 horizontal axis and the class frequencies on the vertical axis. The class frequencies are represented by the heights of the bars. However, there is one important differ- ence based on the nature of the data. Quantitative data are usually measured using scales that are continuous, not discrete. Therefore, the horizontal axis represents all possible values, and the bars are drawn adjacent to each other to show the continu- ous nature of the data. HISTOGRAM A graph in which the classes are marked on the horizontal axis and the class frequencies on the vertical axis. The class frequencies are represented by the heights of the bars, and the bars are drawn adjacent to each other.
  • 148. E X A M P L E Below is the frequency distribution of the profits on vehicle sales last month at the Applewood Auto Group. Construct a histogram. What observations can you reach based on the information presented in the histogram? S O L U T I O N The class frequencies are scaled along the vertical axis (Y-axis) and either the class limits or the class midpoints along the horizontal axis. To illustrate the construction of the histogram, the first three classes are shown in Chart 2–3. Profit Frequency $ 200 up to $ 600 8 600 up to 1,000 11 1,000 up to 1,400 23 1,400 up to 1,800 38
  • 149. 1,800 up to 2,200 45 2,200 up to 2,600 32 2,600 up to 3,000 19 3,000 up to 3,400 4 Total 180 200 600 1,000 1,400 32 24 16 8 8 11 23 Nu m
  • 151. CHART 2–3 Construction of a Histogram 34 CHAPTER 2 From Chart 2–3 we note the profit on eight vehicles was $200 up to $600. There- fore, the height of the column for that class is 8. There are 11 vehicle sales where the profit was $600 up to $1,000. So, logically, the height of that column is 11. The height of the bar represents the number of observations in the class. This procedure is continued for all classes. The complete histogram is shown in Chart 2–4. Note that there is no space between the bars. This is a feature of the histogram. Why is this so? Because the variable profit, plotted on the horizontal axis, is a continuous variable. In a bar chart, the scale of measurement is usually nominal and the vertical bars are separated. This is an important
  • 152. distinction be- tween the histogram and the bar chart. We can make the following statements using Chart 2–4. They are the same as the observations based on Table 2–5. 1. The profits from vehicle sales range between $200 and $3,400. 2. The vehicle profits are classified using a class interval of $400. The class inter- val is determined by subtracting consecutive lower or upper class limits. For example, the lower limit of the first class is $200, and the lower limit of the second class is $600. The difference is the class interval or $400. 3. The profits are concentrated between $1,000 and $3,000. The profit on 157 vehicles, or 87%, was within this range. 4. For each class, we can determine the typical profit or class midpoint. It is halfway
  • 153. between the lower or upper limits of two consecutive classes. It is computed by adding the lower or upper limits of consecutive classes and dividing by 2. Refer- ring to Chart 2–4, the lower class limit of the first class is $200, and the next class limit is $600. The class midpoint is $400, found by ($600 + $200)/2. The mid- point best represents, or is typical of, the profits of the vehicles in that class. Applewood sold 8 vehicles with a typical profit of $400. 5. The largest concentration, or highest frequency of vehicles sold, is in the $1,800 up to $2,200 class. There are 45 vehicles in this class. The class midpoint is $2,000. So we say that the typical profit in the class with the highest frequency is $2,000. Thus, the histogram provides an easily interpreted visual representation of a frequency distribution. We should also point out that we would have made the same observations and the shape of the histogram would have been the same had
  • 154. we used a relative frequency distribution instead of the actual frequencies. That is, if we use the relative frequencies of Table 2–7, the result is a histogram of the same shape as Chart 2–4. The only difference is that the vertical axis would have been reported in percentage of vehicles instead of the number of vehicles. The Excel commands to create Chart 2–4 are given in Appendix C. 20 0– 60 0 60 0– 1,0 00 1,0 00
  • 158. eq ue nc y CHART 2–4 Histogram of the Profit on 180 Vehicles Sold at the Applewood Auto Group DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 35 Frequency Polygon A frequency polygon also shows the shape of a distribution and is similar to a histo- gram. It consists of line segments connecting the points formed by the intersections of the class midpoints and the class frequencies. The construction of a frequency polygon is illustrated in Chart 2–5. We use the profits from the cars sold last month at the Apple-
  • 159. wood Auto Group. The midpoint of each class is scaled on the X-axis and the class frequencies on the Y-axis. Recall that the class midpoint is the value at the center of a class and represents the typical values in that class. The class frequency is the number of observations in a particular class. The profit earned on the vehicles sold last month by the Applewood Auto Group is repeated below. STATISTICS IN ACTION Florence Nightingale is known as the founder of the nursing profession. However, she also saved many lives by using statisti- cal analysis. When she encountered an unsanitary condition or an undersup- plied hospital, she improved the conditions and then used statistical data to document the improve- ment. Thus, she was able
  • 160. to convince others of the need for medical reform, particularly in the area of sanitation. She developed original graphs to demon- strate that, during the Crimean War, more soldiers died from unsanitary condi- tions than were killed in combat. Fr eq ue nc y 8 24 40
  • 161. 48 16 4000 Pro�t $ 32 800 1,200 1,600 2,000 2,400 2,800 3,200 3,600 CHART 2–5 Frequency Polygon of Profit on 180 Vehicles Sold at Applewood Auto Group As noted previously, the $200 up to $600 class is represented by the midpoint $400. To construct a frequency polygon, move horizontally on the graph to the mid- point, $400, and then vertically to 8, the class frequency, and place a dot. The x and the y values of this point are called the coordinates. The coordinates of the next point are x = 800 and y = 11. The process is continued for all classes. Then the points are
  • 162. connected in order. That is, the point representing the lowest class is joined to the one representing the second class and so on. Note in Chart 2–5 that, to complete the frequency polygon, midpoints of $0 and $3,600 are added to the X-axis to “anchor” the polygon at zero frequencies. These two values, $0 and $3,600, were derived by subtracting the class interval of $400 from the lowest midpoint ($400) and by adding $400 to the highest midpoint ($3,200) in the frequency distribution. Both the histogram and the frequency polygon allow us to get a quick picture of the main characteristics of the data (highs, lows, points of concentration, etc.). Although the two representations are similar in purpose, the histogram has the advantage of depicting each class as a rectangle, with the height of the rectangular bar representing Profit Midpoint Frequency $ 200 up to $ 600 $ 400 8
  • 163. 600 up to 1,000 800 11 1,000 up to 1,400 1,200 23 1,400 up to 1,800 1,600 38 1,800 up to 2,200 2,000 45 2,200 up to 2,600 2,400 32 2,600 up to 3,000 2,800 19 3,000 up to 3,400 3,200 4 Total 180 36 CHAPTER 2 8 24 40 48 56
  • 164. 16 4000 Pro�t $ 32 Fr eq ue nc y 800 1,200 1,600 2,000 2,400 2,800 3,200 3,600 Fowler Motors Applewood CHART 2–6 Distribution of Profit at Applewood Auto Group and Fowler Motors the number in each class. The frequency polygon, in turn, has
  • 165. an advantage over the histogram. It allows us to compare directly two or more frequency distributions. Sup- pose Ms. Ball wants to compare the profit per vehicle sold at Applewood Auto Group with a similar auto group, Fowler Auto in Grayling, Michigan. To do this, two frequency polygons are constructed, one on top of the other, as in Chart 2– 6. Two things are clear from the chart: • The typical vehicle profit is larger at Fowler Motors—about $2,000 for Applewood and about $2,400 for Fowler. • There is less variation or dispersion in the profits at Fowler Motors than at Apple- wood. The lower limit of the first class for Applewood is $0 and the upper limit is $3,600. For Fowler Motors, the lower limit is $800 and the upper limit is the same: $3,600. The total number of cars sold at the two dealerships is about the same, so a direct
  • 166. comparison is possible. If the difference in the total number of cars sold is large, then converting the frequencies to relative frequencies and then plotting the two distribu- tions would allow a clearer comparison. The annual imports of a selected group of electronic suppliers are shown in the following frequency distribution. S E L F - R E V I E W 2–4 Imports ($ millions) Number of Suppliers 2 up to 5 6 5 up to 8 13 8 up to 11 20 11 up to 14 10 14 up to 17 1 (a) Portray the imports as a histogram. (b) Portray the imports as a relative frequency polygon. (c) Summarize the important facets of the distribution (such as classes with the highest
  • 167. and lowest frequencies). DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 37 15. Molly’s Candle Shop has several retail stores in the coastal areas of North and South Carolina. Many of Molly’s customers ask her to ship their purchases. The fol- lowing chart shows the number of packages shipped per day for the last 100 days. For example, the first class shows that there were 5 days when the number of pack- ages shipped was 0 up to 5. Fr eq ue nc y
  • 168. Number of Packages 10 0 5 10 15 20 25 30 35 20 30 13 28 23 18 10 35 a. What is this chart called? b. What is the total number of packages shipped? c. What is the class interval? d. What is the number of packages shipped in the 10 up to 15
  • 169. class? e. What is the relative frequency of packages shipped in the 10 up to 15 class? f. What is the midpoint of the 10 up to 15 class? g. On how many days were there 25 or more packages shipped? 16. The following chart shows the number of patients admitted daily to Memorial Hospital through the emergency room. 0 10 20 30 2 4 6 8 10 12 Fr eq ue nc
  • 170. y Number of Patients a. What is the midpoint of the 2 up to 4 class? b. How many days were 2 up to 4 patients admitted? c. What is the class interval? d. What is this chart called? 17. The following frequency distribution reports the number of frequent flier miles, reported in thousands, for employees of Brumley Statistical Consulting Inc. during the most recent quarter. E X E R C I S E S Frequent Flier Miles Number of (000) Employees 0 up to 3 5 3 up to 6 12 6 up to 9 23 9 up to 12 8
  • 171. 12 up to 15 2 Total 50 38 CHAPTER 2 Cumulative Distributions Consider once again the distribution of the profits on vehicles sold by the Applewood Auto Group. Suppose we were interested in the number of vehicles that sold for a profit of less than $1,400. These values can be approximated by developing a cumulative frequency distribution and portraying it graphically in a cumulative frequency polygon. Or, suppose we were interested in the profit earned on the lowest-selling 40% of the ve- hicles. These values can be approximated by developing a cumulative relative frequency distribution and portraying it graphically in a cumulative relative frequency polygon. a. How many employees were studied?
  • 172. b. What is the midpoint of the first class? c. Construct a histogram. d. A frequency polygon is to be drawn. What are the coordinates of the plot for the first class? e. Construct a frequency polygon. f. Interpret the frequent flier miles accumulated using the two charts. 18. A large Internet retailer is studying the lead time (elapsed time between when an order is placed and when it is filled) for a sample of recent orders. The lead times are reported in days. a. How many orders were studied? b. What is the midpoint of the first class? c. What are the coordinates of the first class for a frequency polygon? d. Draw a histogram. e. Draw a frequency polygon. f. Interpret the lead times using the two charts. Lead Time (days) Frequency
  • 173. 0 up to 5 6 5 up to 10 7 10 up to 15 12 15 up to 20 8 20 up to 25 7 Total 40 E X A M P L E The frequency distribution of the profits earned at Applewood Auto Group is repeated from Table 2–5. Profit Frequency $ 200 up to $ 600 8 600 up to 1,000 11 1,000 up to 1,400 23 1,400 up to 1,800 38 1,800 up to 2,200 45 2,200 up to 2,600 32 2,600 up to 3,000 19 3,000 up to 3,400 4
  • 174. Total 180 DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 39 Construct a cumulative frequency polygon to answer the following question: sixty of the vehicles earned a profit of less than what amount? Construct a cumulative relative frequency polygon to answer this question: seventy-five percent of the vehicles sold earned a profit of less than what amount? S O L U T I O N As the names imply, a cumulative frequency distribution and a cumulative fre- quency polygon require cumulative frequencies. To construct a cumulative fre- quency distribution, refer to the preceding table and note that there were eight vehicles in which the profit earned was less than $600. Those 8
  • 175. vehicles, plus the 11 in the next higher class, for a total of 19, earned a profit of less than $1,000. The cumulative frequency for the next higher class is 42, found by 8 + 11 + 23. This process is continued for all the classes. All the vehicles earned a profit of less than $3,400. (See Table 2–8.) TABLE 2–8 Cumulative Frequency Distribution for Profit on Vehicles Sold Last Month at Applewood Auto Group Profit Cumulative Frequency Found by Less than $ 600 8 8 Less than 1,000 19 8 + 11 Less than 1,400 42 8 + 11 + 23 Less than 1,800 80 8 + 11 + 23 + 38 Less than 2,200 125 8 + 11 + 23 + 38 + 45 Less than 2,600 157 8 + 11 + 23 + 38 + 45 + 32 Less than 3,000 176 8 + 11 + 23 + 38 + 45 + 32 + 19 Less than 3,400 180 8 + 11 + 23 + 38 + 45 + 32 + 19 + 4 TABLE 2–9 Cumulative Relative Frequency Distribution for
  • 176. Profit on Vehicles Sold Last Month at Applewood Auto Group Profit Cumulative Frequency Cumulative Relative Frequency Less than $ 600 8 8/180 = 0.044 = 4.4% Less than $ 1,000 19 19/180 = 0.106 = 10.6% Less than $ 1,400 42 42/180 = 0.233 = 23.3% Less than $ 1,800 80 80/180 = 0.444 = 44.4% Less than $2,200 125 125/180 = 0.694 = 69.4% Less than $2,600 157 157/180 = 0.872 = 87.2% Less than $3,000 176 176/180 = 0.978 = 97.8% Less than $3,400 180 180/180 = 1.000 = 100% To construct a cumulative relative frequency distribution, we divide the cumulative frequencies by the total number of observations, 180. As shown in Table 2-9, the cumulative relative frequency of the fourth class is 80/180 = 44%. This means that 44% of the vehicles sold for less than $1,800. To plot a cumulative frequency distribution, scale the upper limit of each class along the X-axis and the corresponding cumulative
  • 177. frequencies along the Y-axis. To provide additional information, you can label the vertical axis on the right in terms of cumulative relative frequencies. In the Applewood Auto Group, 40 CHAPTER 2 the vertical axis on the left is labeled from 0 to 180 and on the right from 0 to 100%. Note, as an example, that 50% on the right axis should be opposite 90 vehicles on the left axis. To begin, the first plot is at x = 200 and y = 0. None of the vehicles sold for a profit of less than $200. The profit on 8 vehicles was less than $600, so the next plot is at x = 600 and y = 8. Continuing, the next plot is x = 1,000 and y = 19. There were 19 vehicles that sold for a profit of less than $1,000. The rest of the points are
  • 178. plotted and then the dots connected to form Chart 2–7. We should point out that the shape of the distribution is the same if we use cumulative relative frequencies instead of the cumulative frequencies. The only difference is that the vertical axis is scaled in percentages. In the following charts, a percentage scale is added to the right side of the graphs to help answer ques- tions about cumulative relative frequencies. 200 600 1,000 1,400 1,800 2,200 2,600 3,000 3,400 Nu m be r o f V eh ic le
  • 181. 180 CHART 2–7 Cumulative Frequency Polygon for Profit on Vehicles Sold Last Month at Applewood Auto Group Using Chart 2–7 to find the amount of profit on 75% of the cars sold, draw a hori- zontal line from the 75% mark on the right-hand vertical axis over to the polygon, then drop down to the X-axis and read the amount of profit. The value on the X-axis is about $2,300, so we estimate that 75% of the vehicles sold earned a profit of $2,300 or less for the Applewood group. To find the highest profit earned on 60 of the 180 vehicles, we use Chart 2–7 to locate the value of 60 on the left-hand vertical axis. Next, we draw a horizontal line from the value of 60 to the polygon and then drop down to the X-axis and read the profit. It is about $1,600, so we estimate that 60 of the vehicles sold for a profit of less than $1,600. We can also make estimates of the
  • 182. percentage of vehicles that sold for less than a particular amount. To explain, suppose we want to estimate the percentage of vehicles that sold for a profit of less than $2,000. We begin by locat- ing the value of $2,000 on the X-axis, move vertically to the polygon, and then horizontally to the vertical axis on the right. The value is about 56%, so we conclude 56% of the vehicles sold for a profit of less than $2,000. DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 41 A sample of the hourly wages of 15 employees at Home Depot in Brunswick, Georgia, was organized into the following table. Hourly Wages Number of Employees $ 8 up to $10 3 10 up to 12 7
  • 183. 12 up to 14 4 14 up to 16 1 (a) What is the table called? (b) Develop a cumulative frequency distribution and portray the distribution in a cumula- tive frequency polygon. (c) On the basis of the cumulative frequency polygon, how many employees earn less than $11 per hour? S E L F - R E V I E W 2–5 19. The following cumulative frequency and the cumulative relative frequency polygon for the distribution of hourly wages of a sample of certified welders in the Atlanta, Georgia, area is shown in the graph. Fr eq ue
  • 184. nc y Hourly Wage Pe rc en t 0 5 10 15 20 25 30 100 75 50 25 40 30
  • 185. 20 10 a. How many welders were studied? b. What is the class interval? c. About how many welders earn less than $10.00 per hour? d. About 75% of the welders make less than what amount? e. Ten of the welders studied made less than what amount? f. What percent of the welders make less than $20.00 per hour? 20. The cumulative frequency and the cumulative relative frequency polygon for a dis- tribution of selling prices ($000) of houses sold in the Billings, Montana, area is shown in the graph. Fr eq ue nc y
  • 187. 500 100 150 200 250 350300 E X E R C I S E S 42 CHAPTER 2 a. How many homes were studied? b. What is the class interval? c. One hundred homes sold for less than what amount? d. About 75% of the homes sold for less than what amount? e. Estimate the number of homes in the $150,000 up to $200,000 class. f. About how many homes sold for less than $225,000? 21. The frequency distribution representing the number of frequent flier miles accumulated by employees at Brumley Statistical Consulting Inc. is repeated from Exercise 17. Frequent Flier Miles (000) Frequency
  • 188. 0 up to 3 5 3 up to 6 12 6 up to 9 23 9 up to 12 8 12 up to 15 2 Total 50 a. How many employees accumulated less than 3,000 miles? b. Convert the frequency distribution to a cumulative frequency distribution. c. Portray the cumulative distribution in the form of a cumulative frequency polygon. d. Based on the cumulative relative frequencies, about 75% of the employees accumulated how many miles or less? 22. The frequency distribution of order lead time of the retailer from Exercise 18 is repeated below. Lead Time (days) Frequency 0 up to 5 6
  • 189. 5 up to 10 7 10 up to 15 12 15 up to 20 8 20 up to 25 7 Total 40 a. How many orders were filled in less than 10 days? In less than 15 days? b. Convert the frequency distribution to cumulative frequency and cumulative rela- tive frequency distributions. c. Develop a cumulative frequency polygon. d. About 60% of the orders were filled in less than how many days? C H A P T E R S U M M A R Y I. A frequency table is a grouping of qualitative data into mutually exclusive and collectively exhaustive classes showing the number of observations in each class. II. A relative frequency table shows the fraction of the number
  • 190. of frequencies in each class. III. A bar chart is a graphic representation of a frequency table. IV. A pie chart shows the proportion each distinct class represents of the total number of observations. V. A frequency distribution is a grouping of data into mutually exclusive and collectively ex- haustive classes showing the number of observations in each class. A. The steps in constructing a frequency distribution are 1. Decide on the number of classes. 2. Determine the class interval. 3. Set the individual class limits. 4. Tally the raw data into classes and determine the frequency in each class. DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 43
  • 191. B. The class frequency is the number of observations in each class. C. The class interval is the difference between the limits of two consecutive classes. D. The class midpoint is halfway between the limits of consecutive classes. VI. A relative frequency distribution shows the percent of observations in each class. VII. There are several methods for graphically portraying a frequency distribution. A. A histogram portrays the frequencies in the form of a rectangle or bar for each class. The height of the rectangles is proportional to the class frequencies. B. A frequency polygon consists of line segments connecting the points formed by the intersection of the class midpoint and the class frequency. C. A graph of a cumulative frequency distribution shows the number of observations less than a given value.
  • 192. D. A graph of a cumulative relative frequency distribution shows the percent of observa- tions less than a given value. C H A P T E R E X E R C I S E S 23. Describe the similarities and differences of qualitative and quantitative variables. Be sure to include the following: a. What level of measurement is required for each variable type? b. Can both types be used to describe both samples and populations? 24. Describe the similarities and differences between a frequency table and a frequency distribution. Be sure to include which requires qualitative data and which requires quan- titative data. 25. Alexandra Damonte will be building a new resort in Myrtle Beach, South Carolina. She must decide how to design the resort based on the type of activities that the resort will offer to its customers. A recent poll of 300 potential customers
  • 193. showed the following results about customers’ preferences for planned resort activities: Like planned activities 63 Do not like planned activities 135 Not sure 78 No answer 24 a. What is the table called? b. Draw a bar chart to portray the survey results. c. Draw a pie chart for the survey results. d. If you are preparing to present the results to Ms. Damonte as part of a report, which graph would you prefer to show? Why? 26. Speedy Swift is a package delivery service that serves the greater Atlanta, Georgia, metropolitan area. To maintain customer loyalty, one of Speedy Swift’s performance objectives is on-time delivery. To monitor its performance, each delivery is measured on the following scale: early (package delivered before the promised time), on-time (pack-
  • 194. age delivered within 5 minutes of the promised time), late (package delivered more than 5 minutes past the promised time), or lost (package never delivered). Speedy Swift’s objective is to deliver 99% of all packages either early or on- time. Speedy collected the following data for last month’s performance: On-time On-time Early Late On-time On-time On-time On-time Late On-time Early On-time On-time Early On-time On-time On-time On-time On-time On-time Early On-time Early On-time On-time On-time Early On-time On-time On-time Early On-time On-time Late Early Early On-time On-time On- time Early On-time Late Late On-time On-time On-time On-time On-time On-time On-time On-time Late Early On-time Early On-time Lost On-time On- time On-time Early Early On-time On-time Late Early Lost On-time On-time On-time On-time On-time Early On-time Early On-time Early On-time Late On-time On-time Early On-time On-time On-time Late On-time Early
  • 195. On-time On-time On-time On-time On-time On-time On-time Early Early On-time On-time On-time 44 CHAPTER 2 a. What kind of variable is delivery performance? What scale is used to measure delivery performance? b. Construct a frequency table for delivery performance for last month. c. Construct a relative frequency table for delivery performance last month. d. Construct a bar chart of the frequency table for delivery performance for last month. e. Construct a pie chart of on-time delivery performance for last month. f. Write a memo reporting the results of the analyses. Include your tables and graphs with written descriptions of what they show. Conclude with a general
  • 196. statement of last month’s delivery performance as it relates to Speedy Swift’s performance objectives. 27. A data set consists of 83 observations. How many classes would you recommend for a frequency distribution? 28. A data set consists of 145 observations that range from 56 to 490. What size class inter- val would you recommend? 29. The following is the number of minutes to commute from home to work for a group of 25 automobile executives. 28 25 48 37 41 19 32 26 16 23 23 29 36 31 26 21 32 25 31 43 35 42 38 33 28 a. How many classes would you recommend? b. What class interval would you suggest? c. What would you recommend as the lower limit of the first class? d. Organize the data into a frequency distribution. e. Comment on the shape of the frequency distribution.
  • 197. 30. The following data give the weekly amounts spent on groceries for a sample of 45 households. $271 $363 $159 $ 76 $227 $337 $295 $319 $250 279 205 279 266 199 177 162 232 303 192 181 321 309 246 278 50 41 335 116 100 151 240 474 297 170 188 320 429 294 570 342 279 235 434 123 325 a. How many classes would you recommend? b. What class interval would you suggest? c. What would you recommend as the lower limit of the first class? d. Organize the data into a frequency distribution. 31. A social scientist is studying the use of iPods by college students. A sample of 45 students revealed they played the following number of songs yesterday. 4 6 8 7 9 6 3 7 7 6 7 1 4 7 7 4 6 4 10 2 4 6 3 4 6 8 4 3 3 6 8 8 4 6 4 6 5 5 9 6 8 8 6 5 10
  • 198. Organize the information into a frequency distribution. a. How many classes would you suggest? b. What is the most suitable class interval? c. What is the lower limit of the initial class? d. Create the frequency distribution. e. Describe the shape of the distribution. 32. David Wise handles his own investment portfolio, and has done so for many years. Listed below is the holding time (recorded to the nearest whole year) between purchase and sale for his collection of 36 stocks. 8 8 6 11 11 9 8 5 11 4 8 5 14 7 12 8 6 11 9 7 9 15 8 8 12 5 9 8 5 9 10 11 3 9 8 6 a. How many classes would you propose? b. What class interval would you suggest? c. What quantity would you use for the lower limit of the initial class?
  • 199. DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 45 d. Using your responses to parts (a), (b), and (c), create a frequency distribution. e. Describe the shape of the frequency distribution. 33. You are exploring the music in your iTunes library. The total play counts over the past year for the 27 songs on your “smart playlist” are shown below. Make a frequency distribu- tion of the counts and describe its shape. It is often claimed that a small fraction of a person’s songs will account for most of their total plays. Does this seem to be the case here? 128 56 54 91 190 23 160 298 445 50 578 494 37 677 18 74 70 868 108 71 466 23 84 38 26 814 17 34. The monthly issues of the Journal of Finance are available on the Internet. The table below shows the number of times an issue was downloaded over the last 33 months. Suppose that you wish to summarize the number of
  • 200. downloads with a frequency distribution. 312 2,753 2,595 6,057 7,624 6,624 6,362 6,575 7,760 7,085 7,272 5,967 5,256 6,160 6,238 6,709 7,193 5,631 6,490 6,682 7,829 7,091 6,871 6,230 7,253 5,507 5,676 6,974 6,915 4,999 5,689 6,143 7,086 a. How many classes would you propose? b. What class interval would you suggest? c. What quantity would you use for the lower limit of the initial class? d. Using your responses to parts (a), (b), and (c), create a frequency distribution. e. Describe the shape of the frequency distribution. 35. The following histogram shows the scores on the first exam for a statistics class. 50 60 70 80 90 100 25 20
  • 202. a. How many students took the exam? b. What is the class interval? c. What is the class midpoint for the first class? d. How many students earned a score of less than 70? 36. The following chart summarizes the selling price of homes sold last month in the Sarasota, Florida, area. 100 75 50 25 250 200 150 100 50 0 50 100 150
  • 203. Selling Price ($000) 200 250 300 350 Fr eq ue nc y Pe rc en t a. What is the chart called? b. How many homes were sold during the last month? c. What is the class interval? d. About 75% of the houses sold for less than what amount? e. One hundred seventy-five of the homes sold for less than what amount?
  • 204. 46 CHAPTER 2 37. A chain of sport shops catering to beginning skiers, headquartered in Aspen, Colorado, plans to conduct a study of how much a beginning skier spends on his or her initial purchase of equipment and supplies. Based on these figures, it wants to explore the possibility of offering combinations, such as a pair of boots and a pair of skis, to induce customers to buy more. A sample of 44 cash register receipts revealed these initial purchases: $140 $ 82 $265 $168 $ 90 $114 $172 $230 $142 86 125 235 212 171 149 156 162 118 139 149 132 105 162 126 216 195 127 161 135 172 220 229 129 87 128 126 175 127 149 126 121 118 172 126 a. Arrive at a suggested class interval. b. Organize the data into a frequency distribution using a lower
  • 205. limit of $70. c. Interpret your findings. 38. The numbers of outstanding shares for 24 publicly traded companies are listed in the following table. Number of Outstanding Shares Company (millions) Southwest Airlines 738 FirstEnergy 418 Harley Davidson 226 Entergy 178 Chevron 1,957 Pacific Gas and Electric 430 DuPont 932 Westinghouse 22 Eversource 314 Facebook 1,067 Google, Inc. 64 Apple 941
  • 206. Number of Outstanding Shares Company (millions) Costco 436 Home Depot 1,495 DTE Energy 172 Dow Chemical 1,199 Eastman Kodak 272 American Electric Power 485 ITT Corporation 93 Ameren 243 Virginia Electric and Power 575 Public Service Electric & Gas 506 Consumers Energy 265 Starbucks 744 a. Using the number of outstanding shares, summarize the companies with a frequency distribution. b. Display the frequency distribution with a frequency polygon. c. Create a cumulative frequency distribution of the outstanding shares.
  • 207. d. Display the cumulative frequency distribution with a cumulative frequency polygon. e. Based on the cumulative relative frequency distribution, 75% of the companies have less than “what number” of outstanding shares? f. Write a brief analysis of this group of companies based on your statistical summaries of “number of outstanding shares.” 39. A recent survey showed that the typical American car owner spends $2,950 per year on operating expenses. Below is a breakdown of the various expenditure items. Draw an appropriate chart to portray the data and summarize your findings in a brief report. Expenditure Item Amount Fuel $ 603 Interest on car loan 279 Repairs 930 Insurance and license 646 Depreciation 492
  • 208. Total $2,950 DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 47 40. Midland National Bank selected a sample of 40 student checking accounts. Below are their end-of-the-month balances. $404 $ 74 $234 $149 $279 $215 $123 $ 55 $ 43 $321 87 234 68 489 57 185 141 758 72 863 703 125 350 440 37 252 27 521 302 127 968 712 503 489 327 608 358 425 303 203 a. Tally the data into a frequency distribution using $100 as a class interval and $0 as the starting point. b. Draw a cumulative frequency polygon. c. The bank considers any student with an ending balance of
  • 209. $400 or more a “pre- ferred customer.” Estimate the percentage of preferred customers. d. The bank is also considering a service charge to the lowest 10% of the ending bal- ances. What would you recommend as the cutoff point between those who have to pay a service charge and those who do not? 41. Residents of the state of South Carolina earned a total of $69.5 billion in adjusted gross income. Seventy-three percent of the total was in wages and salaries; 11% in dividends, interest, and capital gains; 8% in IRAs and taxable pensions; 3% in business income pensions; 2% in Social Security; and the remaining 3% from other sources. Develop a pie chart depicting the breakdown of adjusted gross income. Write a paragraph summa- rizing the information. 42. A recent study of home technologies reported the number of hours of personal
  • 210. computer usage per week for a sample of 60 persons. Excluded from the study were people who worked out of their home and used the computer as a part of their work. 9.3 5.3 6.3 8.8 6.5 0.6 5.2 6.6 9.3 4.3 6.3 2.1 2.7 0.4 3.7 3.3 1.1 2.7 6.7 6.5 4.3 9.7 7.7 5.2 1.7 8.5 4.2 5.5 5.1 5.6 5.4 4.8 2.1 10.1 1.3 5.6 2.4 2.4 4.7 1.7 2.0 6.7 1.1 6.7 2.2 2.6 9.8 6.4 4.9 5.2 4.5 9.3 7.9 4.6 4.3 4.5 9.2 8.5 6.0 8.1 a. Organize the data into a frequency distribution. How many classes would you sug- gest? What value would you suggest for a class interval? b. Draw a histogram. Describe your result. 43. Merrill Lynch recently completed a study regarding the size of online investment portfolios (stocks, bonds, mutual funds, and certificates of deposit) for a sample of cli- ents in the 40 up to 50 years old age group. Listed following is the value of all the in- vestments in thousands of dollars for the 70 participants in the
  • 211. study. $669.9 $ 7.5 $ 77.2 $ 7.5 $125.7 $516.9 $ 219.9 $645.2 301.9 235.4 716.4 145.3 26.6 187.2 315.5 89.2 136.4 616.9 440.6 408.2 34.4 296.1 185.4 526.3 380.7 3.3 363.2 51.9 52.2 107.5 82.9 63.0 228.6 308.7 126.7 430.3 82.0 227.0 321.1 403.4 39.5 124.3 118.1 23.9 352.8 156.7 276.3 23.5 31.3 301.2 35.7 154.9 174.3 100.6 236.7 171.9 221.1 43.4 212.3 243.3 315.4 5.9 1,002.2 171.7 295.7 437.0 87.8 302.1 268.1 899.5 a. Organize the data into a frequency distribution. How many classes would you sug- gest? What value would you suggest for a class interval? b. Draw a histogram. Financial experts suggest that this age group of people have at least five times their salary saved. As a benchmark, assume an investment portfolio of $500,000 would support retirement in 10–15 years. In writing, summarize your results.
  • 212. 48 CHAPTER 2 44. A total of 5.9% of the prime-time viewing audience watched shows on ABC, 7.6% watched shows on CBS, 5.5% on Fox, 6.0% on NBC, 2.0% on Warner Brothers, and 2.2% on UPN. A total of 70.8% of the audience watched shows on other cable net- works, such as CNN and ESPN. You can find the latest information on TV viewing from the following website: http://www.nielsen.com/us/en/top10s.html/. Develop a pie chart or a bar chart to depict this information. Write a paragraph summarizing your findings. 45. Refer to the following chart: Contact for Job Placement at Wake Forest University Networking and
  • 213. Connections 70% On-Campus Recruiting 10% Job Posting Websites 20% a. What is the name given to this type of chart? b. Suppose that 1,000 graduates will start a new job shortly after graduation. Estimate the number of graduates whose first contact for employment occurred through net- working and other connections. c. Would it be reasonable to conclude that about 90% of job placements were made through networking, connections, and job posting websites? Cite evidence.
  • 214. 46. The following chart depicts the annual revenues, by type of tax, for the state of Georgia. Sales 44.54%Income 43.34% Other 0.9% License 2.9% Corporate 8.31% Annual Revenue State of Georgia a. What percentage of the state revenue is accounted for by sales tax and individual income tax? b. Which category will generate more revenue: corporate taxes
  • 215. or license fees? c. The total annual revenue for the state of Georgia is $6.3 billion. Estimate the amount of revenue in billions of dollars for sales taxes and for individual taxes. DESCRIBING DATA: FREQUENCY TABLES, FREQUENCY DISTRIBUTIONS, AND GRAPHIC PRESENTATION 49 47. In 2014, the United States exported a total of $376 billion worth of products to Canada. The five largest categories were: Product Amount Vehicles $63.3 Machinery 59.7 Electrical machinery 36.6 Mineral fuel and oil 24.8 Plastic 17.0
  • 216. a. Use a software package to develop a bar chart. b. What percentage of the United States’ total exports to Canada is represented by the two categories “Machinery” and “Electrical Machinery”? c. What percentage of the top five exported products do “Machinery” and “Electrical Machinery” represent? 48. In the United States, the industrial revolution of the early 20th century changed farming by making it more efficient. For example, in 1910 U.S. farms used 24.2 million horses and mules and only about 1,000 tractors. By 1960, 4.6 million tractors were used and only 3.2 million horses and mules. An outcome of making farming more efficient is the reduction of the number of farms from over 6 million in 1920 to about 2.2 million farms today. Listed below is the number of farms, in thousands, for each of the 50 states. Summarize the data and write a paragraph that describes your findings.
  • 217. 50 12 5 28 59 19 35 22 80 5 8 48 3 75 25 77 46 68 10 69 77 25 13 20 35 6 52 61 36 38 88 1 75 246 59 50 44 98 74 2 32 42 7 31 28 9 8 44 25 37 49. One of the most popular candies in the United States is M&M’s produced by the Mars Company. In the beginning M&M’s were all brown. Now they are produced in red, green, blue, orange, brown, and yellow. Recently, the purchase of a 14-ounce bag of M&M’s Plain had 444 candies with the following breakdown by color: 130 brown, 98 yellow, 96 red, 35 orange, 52 blue, and 33 green. Develop a chart depicting this information and write a paragraph summarizing the results. 50. The number of families who used the Minneapolis YWCA day care service was recorded during a 30-day period. The results are as follows: 31 49 19 62 24 45 23 51 55 60 40 35 54 26 57 37 43 65 18 41 50 56 4 54 39 52 35 51 63 42
  • 218. a. Construct a cumulative frequency distribution. b. Sketch a graph of the cumulative frequency polygon. c. How many days saw fewer than 30 families utilize the day care center? d. Based on cumulative relative frequencies, how busy were the highest 80% of the days? D A T A A N A L Y T I C S 51. Refer to the North Valley Real Estate data that reports information on homes sold during the last year. For the variable price, select an appropriate class interval and orga- nize the selling prices into a frequency distribution. Write a brief report summarizing your findings. Be sure to answer the following questions in your report. a. Around what values of price do the data tend to cluster? b. Based on the frequency distribution, what is the typical selling price in the first class? What is the typical selling price in the last class?
  • 219. 50 CHAPTER 2 c. Draw a cumulative relative frequency distribution. Using this distribution, fifty percent of the homes sold for what price or less? Estimate the lower price of the top ten percent of homes sold. About what percent of the homes sold for less than $300,000? d. Refer to the variable bedrooms. Draw a bar chart showing the number of homes sold with 2, 3, 4 or more bedrooms. Write a description of the distribution. 52. Refer to the Baseball 2016 data that report information on the 30 Major League Baseball teams for the 2016 season. Create a frequency distribution for the Team Salary variable and answer the following questions. a. What is the typical salary for a team? What is the range of the salaries? b. Comment on the shape of the distribution. Does it appear
  • 220. that any of the teams have a salary that is out of line with the others? c. Draw a cumulative relative frequency distribution of team salary. Using this distribu- tion, forty percent of the teams have a salary of less than what amount? About how many teams have a total salary of more than $220 million? 53. Refer to the Lincolnville School District bus data. Select the variable referring to the number of miles traveled since the last maintenance, and then organize these data into a frequency distribution. a. What is a typical amount of miles traveled? What is the range? b. Comment on the shape of the distribution. Are there any outliers in terms of miles driven? c. Draw a cumulative relative frequency distribution. Forty percent of the buses were driven fewer than how many miles? How many buses were
  • 221. driven less than 10,500 miles? d. Refer to the variables regarding the bus manufacturer and the bus capacity. Draw a pie chart of each variable and write a description of your results. Week 2 Lecture Last week we looked at describing data sets. We looked at summary statistics for location, variation/consistency, position, and likelihood. While discussing consistency and variability within the data, the need often arises to examine distribution patterns. Distributions are a critical element of statistical analysis. As we will see starting next week, a lot of our ability to make inferences about populations based on sample results depends on assumptions about data distribution patterns. We start our discussions about data patterns and distributions by examining some graphical analysis techniques; describing and organizing the data visually to see what insights might be gained. Tables and graphs are some of the best techniques to display the characteristics of the data – clustering, dispersion, center, outliers, even shape are all important elements in understanding what the data is telling us.
  • 222. Visual conclusions fall into the realm of qualitative findings. While many feel comfortable making claims based on these observations, others feel that as useful as these initial observations may be, claims must be tested and verified with quantitative approaches such as experimentation, additional sampling, and inferential statistical tests. The ultimate goal of graphical displays is to illuminate relationships in the data; make things clearer. Graphs and Tables Tables Tables or frequency tables show numerical counts and percentages. Single variable tables generally show frequencies and relative frequencies Multi-variables, also known as crosstabulation tables, show counts between and among the variables. The Excel tool Pivot Table will create these kinds of tables. Graphs It has often been said that “a picture is worth a thousand words” for their ability to display relationships and detail that are often missed or hard to describe otherwise. This is the strength – and weakness – of graphs. Done well, they illuminate patterns and relationships; done poor – either intentionally or thru design errors – they can distort and hide key data issues. Types of graphs While there are literally dozens of graph types, we will look at
  • 223. only a few of the most commonly used. These include bar graphs, column and histogram graphs, line graphs, scatter diagrams, and pie charts. The general purpose of each of these is to provide a visual representation of the variation within data sets. Bar and Column Graphs. These graphs are very similar as both display frequency counts for unique groups or attributes. Bars are shown horizontally, while columns are vertical. Dot Plots. These graphs use dots to represent data points along a single numerical axis. Multiple values result in vertical columns of dots. The data points may be individual values or ranges grouped into “bins.” Histograms. These graphs have some characteristics similar to both the dot plots and column graphs. They are columns that touch each other and show counts for how many values of a continuous measurement are within each bin or range. Generally, they have between 5 to 7 bins depending upon the number of data points. Line Graphs. These graphs show trends over time or groups. They are used in quality control as statistical process control charts. Scatter Diagrams. These graphs use dots to show the relationship between pairs of measurements. Often, a regression line will be added to show the linear relationship. Pie Charts. These circular charts show the percent or
  • 224. proportion each group is of the whole. Excel Tool. The Insert tab on Excel’s main ribbon allows for the creation of tables, charts, and graphs. Interpretation Issues – What to Look For We examine graphs and tables both for what they show and for what they don’t. Obviously, look for what the graphs show: · Trends · Changes in trends, means, variation/spread · Patterns and cycles · Data clustering · Outliers · Data gaps or missing data · Relationships and changes in relationships · Randomness or non-randomness In one of the Sherlock Holmes stories, he remarks about the strange case of the barking dog. Watson says there was no barking dog, and Holmes replies, “exactly.” (At least, according to the author’s memory.) The point was, the dog should have barked if an evil-doer stranger was present, but it did not. That missing data point suggested something. The same is true with graphs, in addition to looking for what is there, look for what isn’t: · Missing data, particularly with sharp drops at one end or the other indicating missing or not reported results · Randomness - data that is “too” neat or perfect might have
  • 225. been manipulated · Identical base comparison years or units, for example one measure based on hundreds and another on thousands will distort the relationship between them How to Lie with Graphs Graphs are wonderful at displaying information. However, as much of their impact is visual, they can easily be distorted. Here are a couple of tricks to watch out for. One simple trick is to not start the y-axis with the value of 0. This has the effect of stretching out vertical differences – a line that might look fairly flat if graphed with values starting at 0, could show a sharp increase with a restricted range in the y- axis. Another common distortion occurs with Column graphs. Even though, the difference in bars should be judged solely on height, making one base much narrower and another much wider distorts the volume in the bars; and people form judgements more on volume comparisons than on strictly height – so the “fatter” bar will seem more significant. Probability distributions Statistical inference – making judgements about a population based on the results of samples – relies on two critical elements. The first is having a random sample, one that as fairly represents the population as possible. The other is an analysis based on the proper probability distribution. Statistical
  • 226. inference is based on probability – the likelihood of getting the results we did given the population we assume we are dealing with. For example, when we toss a pair of fair dice, we expect that that long-term average sum of the showing faces will be 7. If we toss a pair of dice 100 times and get an average of 3, we rightly assume something is wrong as the probability of getting 3 with a pair of fair dice is quite low for even a single value much less for the average. Reading and Interpreting Distributions Let’s use the example of tossing a pair of dice to build and interpret a probability distribution. When we toss a pair of dice, we have 36 possible outcomes resulting in values from 2 to 12 showing on the top faces. In theory, we have a 1/36 probability of getting a 2, (1 showing on each face), a 2/36 probability of getting a 3(1,2 or 2,1), a 3/36 probability of getting a 4(1,3,2,2,3,1)) etc. The complete theoretical probability distribution for these values is shown in the histogram below. The graph shows the value of the sum of the top faces of the two dice on the x (horizontal) axis and the number of ways that the value can be formed on the y (vertical) axis. As noted, we can form the value 3 only 2 ways, and this gives us a probability of 2/36 = 0.056 or 5.6% chance of getting a 3 when we toss a pair of dice. Let’s use this histogram to learn about probability
  • 227. distributions. Some basics: · The area under the entire curve (or sum of the bar areas in this case) equals 1.00; meaning that one of the outcomes must occur. · The probability of a single outcome (example rolling a 9) equals the area for the outcome listed on the x-axis. · The probability for multiple outcomes, such as getting an 11 or 12, is the sum of the probabilities for each value (since each outcome is mutually exclusive and independent of each other). · We define the term “p-value” as the probability of getting a value equal to or more extreme than any specific value. For example, the p-value for an outcome of 10, would be the probability of getting a 10, 11, or 12; just as the p-value for a value of 5 would be the probability of getting a 2,3 ,4 ,or 5. So, with these “ground rules,” let’s explore how to use this probability distribution to understand the outcomes. Example 1. What is the probability of getting a 7 on any given toss? Since we can get a 7 in any of 6 ways, the probability is 6/36 or .17 Example 2. What is the probability of getting a 6, 7, or 8 on any given toss? We can get a 6 in 5 different ways, a 7 in 6 ways, and an 8 in 5 ways, so in total we have a probability of (5 + 6 + 5)/36 which equals 16/36 = .44. We simply add up the probabilities of each separate outcome, which is the same as adding the area for these
  • 228. outcomes. Example 3. What is the probability of getting any value larger than 4? This asks about getting the values of 5, 6, 7, 8, 9, 10, 11, or 12. We could, of course, simply add the areas for each to get the answer; but a simpler way exists. We know that the probability of getting 2 – 4 [P(2, 3, or 4)] plus the probability of getting 5 – 12 [P(5 thru 12)] must equal 1, as these two probabilities encompass the entire range of possible outcomes. So, if P(2, 3, or 4) + P(5 thru 12) = 1; then it makes sense to say that P(5 thru 12) = 1 - P(2, 3, or 4). This is called the compliment rule. It is often easier to find the probability of the opposite of an event and use the complement rule to find the desired probability. In this case, P(2, 3, or 4) = (1 + 2 + 3)/36 = 6/36. So P(5 thru 12) = 1 - P(2, 3, or 4) = 1- 6/36 = 30/36 or .83. Example 4. What is the p-value of getting a 4 or less? 10 or more? Recall that a p-value is the probability of getting a specific result or a more extreme result. When looking from the center of the distribution, the more extreme results than 4 would include getting a 3 or a 2. So, the P-value would be the probability of getting P(2, 3, or 4), which we calculated above as 6/36 or .17 The same thinking applies to getting a 10 or more, the related more extreme outcomes would be 11 or 12. Since we have a
  • 229. symmetrical distribution, the probability 10, 11 or 12 is the same as that of 2, 3, 4 or .17. Example 5. What is the probability of getting between 5 and 9 on a single toss? This would equal the P(5 thru 12) minus P(10, 11 or 12). Since we know both of these values from examples 3 and 4, we get 30/36 – 6/36 = 24/36 = .67. These 5 examples cover the most common situations encountered with a probability distribution. In the case of discrete outcomes, we could do something line the odds of getting an even or odd outcome; this would simply equal adding the column areas for each of the appropriate values. Normal Curve. One of the most commonly used probability distributions in statistics is the normal curve, AKA bell shaped curve. The normal curve looks much like the histogram we used above with the bars shrunk down to almost no width. The normal curve values run from minus infinity to plus infinity, but the practical range is much smaller. The mean = median = mode for the curve, and each side is symmetrical. As with our histogram above, the area under the normal curve equals 1.0. A specialized case of the normal curve, called the standard normal curve, has a mean of 0 and a standard deviation of 1.0. We get the standard normal curve from any normal curve by subtracting the mean from each value, and then dividing the result by the original standard deviation. This allows us to
  • 230. determine probabilities of any outcome using one curve rather than needed to calculate values from different curves all the time. And, as we might hope, Excel will do all the math involved for us. (Actually, Excel will do the math for any normal curve as well.) Some key functions, found in the Fx and Formulas lists, include the following. Note that formulas having “.S.” in the middle are for the standard normal curve; without the s are for any normal curve distribution. · DIST(VALUE, MEAN, STANDARD DEVIATION, CUMMULATIVE), gives the total area/probability to the left of the stated value for a normal curve with a specified mean and standard deviation and cumulative = True or 1. (Note if cumulative is false or 0, we get the height of the curve for graphing purposes.) Example: =NORM.DIST(10, 8, 2, true) = 0.8413 (rounded). · INV(PROBABILITY, MEAN, STANDARD DEVIATION) returns the numerical value for the given probability. Example =NORM.INV(0.8413,8,2) = 10 (rounded). · S.DIST(value, cumulative) gives the area/probability of the given z-score value or less with cumulative set to true or 1. Example: =NORM.S.DIST(1.96, TRUE) = 0.975. · S.INV(Probability) returns the Z-score associated with the given probability. Example: =NORM.S.INV(0.975) = 1.96. With these functions, we can do the same kinds of probability
  • 231. calculations we did with our dice and the histogram. Some examples follow. Example 1. What is the probability of getting a result exactly in the middle of the distribution, a z-score of 0.00. Note, since the normal curve extends so far, the probability of each specific value is technically 0 (any value divided by 2*infinity). However, since specific events and values do occur, we create a range by making an adjustment. Since z-scores are typically reported to two decimal places, we add +/-0.005 to the score for our range. So, the area or probability for a z-score of 0 would be the area under the range between -0.005 to +0.005. We then find the larger area (the largest value) and subtract the smaller area from it. This, for our example equals a probability =norm.s.dist(0.005,1) – norm.s.dist(-0.005,1) = 0.003989 or 0.004 (rounded). Example 2. What is the p-value of exceeding a z-score of 1.96? Excel does not directly calculate probabilities of exceeding a value, so we need to use the compliment rule whenever we are asked for a probability exceeding a value. Since we are again working with a z-score, we use the standard normal curve functions: =1-NORM.S.DIST(1.96,1) = 1-0.975 = 0.025. Example 3. What is the p-value of getting less than a z-score of -1.96? The probability of getting a score up to any value is directly found from Norm.s.dist, so this question is answered by
  • 232. =NORM.S.DIST(-1.96,1) = 0.025 (rounded). With these three approaches, you can find any normal curve probability based on a z-score. If you have means and standard deviations, the same logic applies but you would use the normal curve functions that do not contain the “.S.” term. T Curve. A special family/set of normal curves are used when we estimate the mean and standard deviation from sample values. These curves are somewhat flatter and more elongated than the standard normal curve. Additionally, a separate curve exists for each sample size that we might use. The good news is that Excel does all the work for these curves as well. And, as we will see, these curves are used more often than the standard normal curve in statistical analysis. The key difference with the T curves is the idea of degrees of freedom (df). For the t distribution, df = the sample size -1 (n - 1); and this value is used in the Excel functions. The t-related Excel functions, also found in Fx and Formulas, are: · DIST(t-value, df, cumulative) – the p-value (probability) for this value or less, for example, =T.DIST(2.228,10,1) = 0.975 (rounded). 1-T.DIST(t-value, df, 1) would be the p-value for a positive t-value. T.DIST(-2.228,10,1) = 0.025; this would be the p-value for a negative t-value. As with NORM.DIST, using false or 0 for the cumulative value gives us the value to graph the t-distribution. · DIST.2T(t-value, df) – probability of getting a value between
  • 233. (minus t-value) and (plus t-value), for example, =T.DIST.2T(2.228,10) = 0.05 (rounded) · DIST.RT(t-value, df) – p-value (probability) of getting this value or more, for example, =T.DIST.RT(2.228,10) = 0.025 (rounded), the p-value for a positive t-value. · INV(probability, df) – the t-value that has probability of being this large or smaller, for example =T.INV(.95,10) = 1.812. · INV.2T(probability, df) - - the t-value that cuts of probability/2 in each side/tail; for example T.INV.2T(0.05, 10) = 2.228. The probability of equaling or exceeding +2.228 is 0.025, while the probability of equaling or being less than - 2.228 is 0.025. Using these functions to find values and probabilities for ranges is done in a similar fashion as with the normal curve examples shown above. Week 2 Guidance This week, we look at graphical analysis. We learn how to select a graph to best display a certain type of data including two-dimensional scatter plots for paired or bivariate data. The shape of a scatter plot tells us if the data are correlated with one another. If data are highly correlated, then the value of one variable may be used to make a prediction about the value of
  • 234. the other. This prediction process involves regression analysis and the construction of a regression equation. As in week one, we will employ the eight elements of thought to critically think about these topics. As you think this week, try to discern the purpose for correlation and regression (Paul and Elder 2006). What questions might we be able to answer? What assumptions must we make? What data do we need? How does our point of view impact our ability to predict? What are the critical ideas or concepts? What conclusions can we draw and what are the consequences or implications? Bivariate Data in Context Bivariate data are paired data. The pairing of data does not combine them, but rather associates them according to collection. For example, suppose you collect the height and weight of a high school basketball team. Each player has two unique measurements that describe different traits. Suppose, for example, that there are only five players Height (in inches) Weight (in pounds) 67 155 72 220 77 240
  • 235. 74 195 69 175 If we look at just height or just weight, we might display the data as a bar graph or (for more players) a histogram. If we sorted one column and didn’t sort the other we would unpair the data – the 67 inch tall person would be adjacent to the score for the 240 pound person, for example, even though they represent different people. Bivariate data are coupled. In fact, we could also represent the data as a single list of ordered pairs: (67, 155), (72, 220), (77, 240), (74, 195), and (69,175). The first number in each ordered pair represents height and the second number represents weight. Bivariate data allow us to look at trends in one variable and determine if there is any relationship with trends in the other variable. Do you think that taller people in general will weigh more? If so, then you are suggesting that there is a positive correlation between height and weight. A small business owner might collect bivariate data for the price of a certain product and the number of units sold on a monthly basis. If price increases, we might expect sales to decrease. When one variable increasing is associated with another paired variable decreasing, we refer to the relationship as a negative correlation. Scatter Diagrams and the Correlation Coefficient
  • 236. Six Sigma is a set of tools designed to improve business processes by minimizing defects, errors, and variability through the use of statistical tools. On its website, Six Sigma. defines scatter plots as follows: Scatter plots are used with variable data to study possible relationships between two different variables. Even though a scatter plot depicts a relationship between variables, it does not indicate a cause and effect relationship. Use Scatter plots to determine what happens to one variable when another variable changes value. It is a tool used to visually determine whether a potential relationship exists between an input and an outcome. So a scatter plot or scatter diagram is just a two-dimensional plot, as you may have done in middle school, where we use one variable as the horizontal axis (x-coordinate), and one variable as the vertical axis (y-coordinate). Our Basketball data above would be plotted as The correlation coefficient, or Pearson’s r-value is a measure of how closely the scatter plot diagram is modeled by a straight line. The correlation coefficient for any bivariate data will be a number between -1 and +1. Data with an r near -1 are highly correlated in the negative direction, which means there is the inverse relationship discussed in the price and sales example. These data will display as a negatively sloped line in the scatter diagram with a pattern that descends from left to right. Data
  • 237. with a correlation coefficient near +1 are highly correlated in the positive direction and resemble a positively sloped line in the scatter plot. Data with a correlation value near 0 (on either side) are not correlated. No line fits better than any other line and there is practically no association between the values. Non- correlated bivariate data appear like a round cloud of dots with no discernible direction or pattern. Predictions with Linear Regression If data are highly correlated, in either the positive or negative direction, then we are able to use information about one value to make predictions about the potential value of the correlated variable. Since we use a straight line approximation for the data, we call this process linear regression. The better our data fit to a straight line, the better our predictions using this method. Another way of stating the same principle is that correlations with a coefficient near +/- 1 carry the most reliability as predictive linear models. The general process for linear regression is as follows: 1. Check the strength of the correlation. Regression usually requires an r-value above 0.4 or below -0.4 2. Use the least squares method to find the equation for the line of best fit. Often this step is completed using a software package such as Minitab, SPSS, a TI Calculator, or even Excel.
  • 238. The resulting equation will have the form: . Where x is the variable depicted on the horizontal axis (input) and is the output or predicted value for the variable on the vertical axis. 3. Substitute hypothesized values in for x to predict values for y. Students should be able to:1. Examine the value of presenting data graphically. 2. Describe guidelines for effectively using graphical tools to present numerical information. References: Lind, D. A., Marchal, W. G., & Wathen, S. A. (2017). Statistical techniques in business and economics. (17th ed.). Paul, R. and Elder, L. (2006). The Miniature Guide to Critical Thinking: Concepts and Tools., Berkeley, CA: The Foundation for Critical Thinking Passy. (2012, March 13). Misleading graphs. Retrieved from http://passyworldofmathematics.com/misleading-graphs/ Pearson, Karl (1924). The Life, Letters, and Labours of Francis Galton. London: Cambridge University Press Week 2 Discussions and Required Resources Part 1 and Part 2 must be at least 150 - 200 words unless otherwise Part 1: Graphical Analysis Techniques
  • 239. There are strengths and weaknesses to graphical analysis research techniques. For this discussion, begin by reviewing the technique of graphical analysis in your textbook. Then, keeping this technique in mind, read the following quotes: · “Errors using inadequate data are much less than those using no data at all.”—Charles Babbage · “Statistics is the science of variation.”—Douglas M. Bates (1985) · “All models are wrong, but some models are useful.”—George E. P. Box (1979) · The greatest moments are those when you see the result pop up in a graph or in your statistics analysis - that moment you realize you know something no one else does and you get the pleasure of thinking about how to tell them.—Emily Oster https://www.goodreads.com/quotes/search?utf8=%E2%9C%93& q=statistics&commit=Search Also consider the following ways to make a graph misleading from Misleading Graphs - (Passy, 2012): · “Vertical scale is too big or too small. · Vertical axis skips numbers, or does not start at zero. · Graph is not labeled properly. · Graph does not have a title to explain what it is about. · Data is left out. · Scale not starting at zero. · Scale made in very small units to make graph look very big.
  • 240. · Scale values or labels missing from the graph. · Incorrect scale placed on the graph. · Pieces of a pie chart are not the correct sizes. · Oversized volumes of objects that are too big for the vertical scale differences they represent. · Size of images used in pictographs being different for the different categories being graphed. · Graph being a non-standard size or shape.” Based on the above quotes, along with this week’s assigned readings and Instructor Guidance, compare graphical analysis with quantitative analysis (a technique you explored last week), and discuss why graphical analysis is important in research. Finally, describe guidelines for using graphical tools to present information clearly and effectively. Part 2: Examples of Graphical Analysis Techniques in Research Locate an example of a research study that uses graphs and/or tables in its analysis. Explain what this statistical technique allows the researchers to accomplish and/or conclude in the study. Note: Graphic presentations are most often found in the Results section of a study. Required Resource
  • 241. Text Lind, D. A., Marchal, W. G., & Wathen, S. A. (2017). Statistical techniques in business and economics. (17th ed.). Retrieved from http://connect.mheducation.com/class/ The textbook is attached. · Chapter 2: Describing Data: Frequency Tables, Frequency Distributions, and Graphic Presentation · Chapter 4: Describing Data: Displaying and Exploring Data Article Passy. (2012, March 13). Misleading graphs . Retrieved from http://passyworldofmathematics.com/misleading-graphs/ · This article provides information about graph techniques often used by both advertisers and the media to mislead viewers. It will assist you in your Graphical Analysis Techniques discussion this week.