lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdf

Studocu is not sponsored or endorsed by any college or university
Lecture NOTE ON Basic Statistics Prem
Mann, Introductory Statistics
Introductory Statistics 7th ed (Jaamacada Hargeysa)
Studocu is not sponsored or endorsed by any college or university
Lecture NOTE ON Basic Statistics Prem
Mann, Introductory Statistics
Introductory Statistics 7th ed (Jaamacada Hargeysa)
Downloaded by Ahmed Elmi (ahmedatoshe@gmail.com)
lOMoARcPSD|7438752

1
CHAPTER ONE
INTRODUCTION TO STATISTICS
Objectives:
At the end of this chapter, student should be able to:
 Define statistics.
 Differentiate between the two branches of statistics.
 Classify types of data.
 Identify the measurement level for each variable.
 Explain the difference between an cross-section and time-series data
Introduction
The study of statistics has become more popular than ever over the past four decades or so. The
increasing availability of computers and statistical software packages has enlarged the role of statistics
as a tool for empirical research. As a result, statistics is used for research in almost all professions,
from medicine to sports. Today, college students in almost all disciplines are required to take at least
one statistics course. Almost all newspapers and magazines these days contain graphs and stories on
statistical studies. After you finish reading this book, it should be much easier to understand these
graphs and stories.
Every field of study has its own terminology. Statistics is no exception. This introductory chapter
explains the basic terms of statistics. These terms will bridge our understanding of the concepts and
techniques presented in subsequent chapters.
Definition
Statistics is a group of methods used to collect, analyze, present, and interpret data and to make
decisions.
Every day, we make decisions that may be personal, business related, or of some other kind.
Usually these decisions are made under conditions of uncertainty. Many times, the situations or
problems we face in the real world have no precise or definite solution. Statistical methods help us
make scientific and intelligent decisions in such situations. Decisions made by using statistical
methods are called educated guesses. Decisions made without using statistical (or scientific) methods
lOMoARcPSD|7438752

2
are pure guesses and, hence, may prove to be unreliable. For example, opening a large store in an area
with or without assessing the need for it may affect its success
TYPES OF STATISTICS:
Broadly speaking, applied statistics can be divided into two areas:
1) Descriptive statistics and
2) Inferential statistics.
Descriptive Statistics
Suppose we have information on the test scores of students enrolled in a statistics class. In statistical
terminology, the whole set of numbers that represents the scores of students is called a data set, the
name of each student is called an element, and the score of each student is called an observation.
A data set in its original form is usually very large. Consequently, such a data set is not very helpful in
drawing conclusions or making decisions. It is easier to draw conclusions from summary tables and
diagrams than from the original version of a data set. So, we reduce data to a manageable size by
constructing tables, drawing graphs, or calculating summary measures such as averages. The portion
of statistics that helps us do this type of statistical analysis is called descriptive statistics
Definition
Descriptive statistics consists of methods for organizing, displaying, and describing data by using
tables, graphs, and summary measures
Inferential Statistics
In statistics, the collection of all elements of interest is called a population. The selection of a few
elements from this population is called a sample.
A major portion of statistics deals with making decisions, inferences, predictions, and forecasts about
populations based on results obtained from samples. For example, we may want to find the starting
salary of a typical university graduate. To do so, we may select 2000 recent college graduates, find
their starting salaries, and make a decision based on this information. The area of statistics that deals
with such decision-making procedures is referred to as inferential statistics. This branch of statistics is
also called inductive reasoning or inductive statistics.
lOMoARcPSD|7438752

3
Definition
Inferential Statistics consists of methods that use sample results to help make decisions or
predictions about a population.
EXERCISES
1. Briefly describe the two meanings of the word statistics.
2. Briefly explain the types of statistics.
Population versus Sample
We will encounter the terms population and sample on almost every page of this text.2 Consequently,
understanding the meaning of each of these two terms and the difference between them is crucial.
Suppose a statistician is interested in knowing
1. The percentage of all voters in a city who will vote for a particular candidate in an election
2. The 2009 gross sales of all companies in New York City
3. The prices of all houses in California
In these examples, the statistician is interested in all voters, all companies, and all houses.
Each of these groups is called the population for the respective example. In statistics, a population
does not necessarily mean a collection of people. It can, in fact, be a collection of people or of any
kind of item such as houses, books, television sets, or cars. The population of interest is usually called
the target population
Definition
A population consists of all subjects (human or otherwise) that are being studied.
Most of the time, due to the expense, time, size of population, medical concerns, etc., it is not
possible to use the entire population for a statistical study; therefore, researchers use samples.
lOMoARcPSD|7438752

4
Definition
A sample is a group of subjects selected from a population.
If the subjects of a sample are properly selected, most of the time they should possess the same or
similar characteristics as the subjects in the population.
The collection of information from the elements of a population or a sample is called a survey. A
survey that includes every element of the target population is called a census. Often the target
population is very large. Hence, in practice, a census is rarely taken because it is expensive and time-
consuming.
Definition
Census and Sample: Survey A survey that includes every member of the population is called a
census. The technique of collecting information from a portion of the population is called a
sample survey.
The purpose of conducting a sample survey is to make decisions about the corresponding population.
It is important that the results obtained from a sample survey closely match the results that we would
obtain by conducting a census. Otherwise, any decision based on a sample survey will not apply to the
corresponding population.
Definition
Representative Sample :A sample that represents the characteristics of the population as closely
as possible is called a representative sample
lOMoARcPSD|7438752

5
Exercise
1. Briefly explain the terms population, sample, representative sample, random sample, sampling
with replacement, and sampling without replacement.
2. Give one example each of sampling with and sampling without replacement.
3. Briefly explain the difference between a census and a sample survey. Why is conducting a
sample survey preferable to conducting a census?
4. Explain whether each of the following constitutes a population or a sample.
a. Pounds of bass caught by all participants in a bass fishing derby
b. Credit card debts of 100 families selected from a city
c. Amount spent on prescription drugs by 200 senior citizens in a large city
d. Weekly salaries of all employees of a company
e. Number of computers sold during the past week at all computer stores in Hargeisa
Basic terms
It is very important to understand the meaning of some basic terms that will be used frequently in this
text.
Definition
Element or Member: An element or member of a sample or population is a specific subject or
object
(for example, a person, firm, item, state, or country) about which the information is collected
Example
The following table gives information on the 2007 charitable giving’s (in millions of U.S. dollars) by
six retail companies. We can call this group of companies a sample of six companies. Each company
listed in this table is called an element or a member of the sample. The table contains information on
six elements. Note that elements are also called observational units.
lOMoARcPSD|7438752

6
The 2007 charitable giving’s in our example is called a variable. The 2007 charitable giving’s is a
characteristic of companies that we are investigating or studying.
Definition
Variable: A variable is a characteristic under study that assumes different values for different
elements. In contrast to a variable, the value of a constant is fixed.
The value of a variable for an element is called an observation or measurement.
A data set is a collection of observations on one or more variables.
Types of Variables
Statisticians gain information about a particular situation by collecting data for random variables.
This section will explore in greater detail the nature of variables and types of data.
Examples of variables are:
a) The incomes of households,
b) The number of houses built in a city per month during the past year,
c) The gross profits of companies
d) Colors of cars
e) Marital status of people
Variables can be classified as:
A) Qualitative variables or
A variable is a characteristic or attribute that can assume different values.
lOMoARcPSD|7438752

7
B) Quantitative variables
Qualitative variables are variables that can be placed into distinct categories, according to some
characteristic or attribute. For example, if subjects are classified according to gender (male or
female), then the variable gender is qualitative. Other examples of qualitative variables are
religious preference and geographic locations.
Definition
A) Qualitative or Categorical Variable A variable that cannot assume a numerical value
but can be classified into two or more nonnumeric categories is called a qualitative or
categorical variable. The data collected on such a variable are called qualitative data
Examples:
1) The status of an undergraduate college student (freshman, sophomore, junior, or
senior.)
2) The gender of a person
3) The brand of a computer
4) The opinions of people.
Definition:
B) Quantitative variables: A variable that can be measured numerically is called a
quantitative variable. The data collected on a quantitative variable are called quantitative
data.
Incomes, heights, gross sales, prices of homes, number of cars owned, and number of accidents
are examples of quantitative variables because each of them can be expressed numerically.
For instance, the income of a family may be $81,520.75 per year, the gross sales for a company
may be $567 million for the past year, and so forth.
Quantitative variables can be further classified into two groups:
a) Discrete variables
lOMoARcPSD|7438752

8
b) Continuous variables
Discrete Variables
The values that a certain quantitative variable can assume may be countable or noncountable.
For example, we can count the number of cars owned by a family, but we cannot count the
height of a family member. A variable that assumes countable values is called a discrete
variable. Note that there are no possible intermediate values between consecutive values of a
discrete variable.
Definition:
Discrete Variable: A variable whose values are countable is called a discrete variable. In other
words, a discrete variable can assume only certain values with no intermediate values.
Examples:
1) The number of cars sold on any day at a car dealership
2) The number of people visiting a bank on any day
3) The number of cars in a parking lot
4) The number of students in a class.
Continuous Variables
Some variables cannot be counted, and they can assume any numerical value between two
numbers. Such variables are called continuous variables.
Definition:
Continuous Variable A variable that can assume any numerical value over a certain interval or
intervals is called a continuous variable.
Examples:
1) The time taken to complete an examination
lOMoARcPSD|7438752

9
2) The degree of a temperature
3) The height of a person
EXERCISE:
1) Briefly explain types of statistics
2) Differentiate population and sample
3) Briefly discuss the variables and their types
4) Explain whether each of the following constitutes a population or a sample.
a) Pounds of bass caught by all participants in a bass fishing derby
b) Credit card debts of 100 families selected from a city
c) Number of home runs hit by all Major League baseball players in the 2014season
d) Weekly salaries of all employees of a company
e) Number of computers sold during the past week at all computer stores in Los Angele
5) Indicate which of the following variables are quantitative and which are
qualitative.
a. Number of persons in a family
b. Colors of cars
c. Marital status of people
Cross-Section versus Time-Series Data
Based on the time over which they are collected, data can be classified as either cross-section or time-
series data.
Cross-Section Data
Cross-section data contain information on different elements of a population or sample for the same
period of time. The information on incomes of 100 families for 2009 is an example of cross-section
data. All examples of data already presented in this chapter have been cross section data.
Definition
Cross-Section Data: Data collected on different elements at the same point in time or for the
same period of time are called cross-section data
The table that shows the 2007 charitable giving’s of six retail companies is an example of cross-
section data.
Time-Series Data
lOMoARcPSD|7438752

10
Time-series data contain information on the same element for different periods of time. Information
on U.S. exports for the years 1983 to 2009 is an example of time-series data.
Definition
Time-Series Data: Data collected on the same element for the same variable at different points
in time or for different periods of time are called time-series data.
lOMoARcPSD|7438752

11
CHAPTER TWO
ORGANIZING AND GRAPHING DATA
Objectives:
 Organize data using a frequency distribution.
 Represent data in frequency distributions graphically using histograms, frequency
polygons, and ogives.
 Represent data using bar graphs, Pareto charts, and pie graphs.
 Draw and interpret a stem and leaf plot.
Introduction
When conducting a statistical study, the researcher must gather data for the particular variable
under study. For example, if a researcher wishes to study the number of people who were bitten
by poisonous snakes in a specific geographic area over the past several years, he or she has to
gather the data from various doctors, hospitals, or health departments.
To describe situations, draw conclusions, or make inferences about events, the researcher must
organize the data in some meaningful way. The most convenient method of organizing data is to
construct a frequency distribution.
After organizing the data, the researcher must present them so they can be understood by those
who will benefit from reading the study. The most useful method of presenting the data is by
constructing statistical charts and graphs. There are many different types of charts and graphs,
and each one has a specific purpose.
This chapter explains how to organize data by constructing frequency distributions and how to
present the data by constructing charts and graphs. The charts and graphs illustrated here are
histograms, frequency polygons, ogives, pie graphs, Pareto charts, and time series graphs. A
graph that combines the characteristics of a frequency distribution and a histogram, called a stem
and leaf plot, is also explained.
lOMoARcPSD|7438752

12
Organizing Data
Suppose a researcher wished to do a study on the ages of the top 50 wealthiest people in the
world. The researcher first would have to get the data on the ages of the people. In this case,
these ages are listed in Forbes Magazine. When the data are in original form, they are called raw
data and are listed next.
Definition
Raw Data: Data recorded in the sequence in which they are collected and before they are
processed or ranked are called raw data.
Since little information can be obtained from looking at raw data, the researcher organizes the
data into what is called a frequency distribution. A frequency distribution consists of classes and
their corresponding frequencies. Each raw data value is placed into a quantitative or qualitative
category called a class. The frequency of a class then is the number of a data values contained in
a specific class.
lOMoARcPSD|7438752

13
A frequency distribution is shown for the preceding data set.
Class limits Tally Frequency
35-41 3
42-48 3
49-55 4
56-62 10
63-69 10
70-76 5
77-83 10
84-90 5
Now some general observations can be made from looking at the frequency distribution.
For example, it can be stated that the majority of the wealthy people in the study are over 55
years old.
A frequency distribution: The organization of raw data in table form, using classes and
frequencies.
Two types of frequency distributions that are most often used are the categorical frequency
distribution and the grouped frequency distribution. The procedures for constructing these
distributions are shown now.
Categorical Frequency Distributions
The categorical frequency distribution is used for data that can be placed in specific categories.
lOMoARcPSD|7438752

14
For example, data such as political affiliation, religious affiliation, or major field of study would
use categorical frequency distributions.
Example
A sample of 30 employees from large companies was selected, and these employees were
asked how stressful their jobs were. The responses of these employees are recorded below,
where very represents very stressful, somewhat means somewhat stressful, and
none stands for not stressful at all.
Construct a frequency distribution table for these data.
Solution
Percentage =(Relative frequency) .100
n
Relative frequency of a category: rel. freq =
f
Definition
Frequency Distribution for Qualitative Data A frequency distribution for qualitative data lists
all categories and the number of elements that belong to each of the categories.
lOMoARcPSD|7438752

15
Graphical Presentation of Qualitative Data
The most common types of graphs that are commonly used to display qualitative data are:
1) Bar graph
2) Pareto graph
3) Pie graph
1) Bar graphs:
Definition
Bar Graph: A graph made of bars whose heights represent the frequencies of respective categories
is called a bar graph.
2) Pareto Charts
When the variable displayed on the horizontal axis is qualitative or categorical, a Pareto chart can
also be used to represent the data.
A Pareto chart is used to represent a frequency distribution for a categorical variable, and the
frequencies are displayed by the heights of vertical bars, which are arranged in order from
highest to lowest.
3) The Pie Graph
Pie graphs are used extensively in statistics. The purpose of the pie graph is to show the
relationship of the parts to the whole by visually comparing the sizes of the sections. Percentages
or proportions can be used. The variable is nominal or categorical.
A pie graph is a circle that is divided into sections or wedges according to the percentage of
frequencies in each category of the distribution.
lOMoARcPSD|7438752

16
Example
Draw a bar graph and pie chart for the above table?
Solution
Grouped Frequency Distributions
In the previous section we learned how to group and display qualitative data. This section
explains how to group and display quantitative data.
Definition
Frequency Distribution for Quantitative Data: A frequency distribution for quantitative data lists
all the classes and the number of values that belong to each class. Data presented in the form of a
frequency distribution are called grouped data.
Class Boundary: The class boundary is given by the midpoint of the upper limit of one class
and the lower limit of the next class.
lOMoARcPSD|7438752

17
The difference between the two boundaries of a class gives the class width. The class width is
also called the class size.
The class midpoint or mark is obtained by dividing the sum of the two limits (or the two
boundaries) of a class by 2.
lower limit + upper limit
classmark
2
=
Constructing Frequency Distribution Tables
When you are constructing a frequency distribution, the following guidelines should be followed.
1) Determine the classes.
✓ Find the highest and lowest values.
✓ Find the range.
✓ Select the number of classes desired.
✓ Find the width by dividing the range by the number of classes and
rounding up.
✓ Select a starting point (usually the lowest value or any convenient
number less than the lowest value); add the width to get the lower
limits.
✓ Find the upper class limits.
✓ Find the boundaries.
2) Tally the data.
3) Find the numerical frequencies from the tallies, and find the cumulative frequencies.
Example 1
The following data give the total number of iPods sold by a mail order company on each of
30 days.
a) Construct a frequency distribution table.
b) Find the class boundaries relative frequency and percentage
lOMoARcPSD|7438752

18
Solution
Graphing Grouped Data
Grouped (quantitative) data can be displayed in many forms, and the three most commonly used
graphs in research are:
1) The histogram 2. The polygon 3. The ogive 4. Stem and leaf plots
1) Histograms
A histogram is a graph in which classes are marked on the horizontal axis and the frequencies
are marked on the vertical axis. The frequencies are represented by the heights of the bars. In a
histogram, the bars are drawn adjacent to each other.
lOMoARcPSD|7438752

19
2) Polygon
A graph formed by joining the midpoints of the tops of successive bars in a histogram with straight
lines is called a polygon.
Example
a) Draw a histogram and polygon of data in example 1
Histogram polygon
Cumulative Frequency Distributions
Consider again the previous example about the total number of iPods sold by a company.
Suppose we want to know on how many days the company sold 19 or fewer iPods. Such a question
can be answered by using a cumulative frequency distribution. Each class in a cumulative frequency
distribution table gives the total number of values that fall below a certain
value. A cumulative frequency distribution is constructed for quantitative data only.
cummulative frequency
cummulative relative frequency=
total of observations
cummulative percentage cummulative relative frequency×100
=
Definition
A cumulative frequency distribution gives the total number of values that fall below the upper
boundary of each class. In a cumulative frequency distribution table, each class has the same lower
limit but a different upper limit.
3) Ogive
An ogive is a curve drawn for the cumulative frequency distribution by joining with straight lines the
dots marked above the upper boundaries of classes at heights equal to the cumulative frequencies of
lOMoARcPSD|7438752

20
respective classes
Example
Using the following frequency distribution
a) Prepare a cumulative frequency
b) Draw an Ogive
Solution
lOMoARcPSD|7438752

21
Ogive
4) Stem and Leaf Plots
The stem and leaf plot is a method of organizing data and is a combination of sorting and
graphing. It has the advantage over a grouped frequency distribution of retaining the actual data
while showing them in graphical form.
In a stem-and-leaf display of quantitative data, each value is divided into two portions—a stem
and a leaf. The leaves for each stem are shown separately in a display.
EXAMPLE:
The following are the scores of 30 college students on a statistics test.
75 52 80 96 65 79 71 87 93 95 69 72 81 61 76 86 79 68 50 92 83 84 77 64
71 87 72 92 57 98
Construct a stem-and-leaf display
Exercise:
lOMoARcPSD|7438752

22
1. The following data give the results of a sample survey. The letters A, B, and C represent the
three categories
a. P
r
e
pare a frequency distribution table.
b. Calculate the relative frequencies and percentages for all categories.
c. What percentage of the elements in this sample belong to category B?
d. What percentage of the elements in this sample belong to category A or C?
e. Draw a bar graph for the frequency distribution
2. The following data represent the status of 50 students
a.Prepare a frequency distribution table
b. Calculate the relative frequencies and percentages for all categories
c.What percentage of these students are juniors or seniors?
d. Draw a bar graph and a pie chart for the frequency distribution.
3. The following sample data set lists the prices (in dollars) of 30 portable global positioning
system (GPS) navigators.
lOMoARcPSD|7438752

23
a. Construct a frequency distribution that has seven classes.
b. Compute class midpoints, class boundaries, class width, relative frequency
c. Draw a histogram, pie chart, ogive
4. Construct a frequency distribution and a relative frequency histogram for the data set using
indicated classes. Which class has the greatest frequency and which has the least frequency?
a. Data set: Highway fuel consumptions (in miles per gallon) for a sample (using five classes)
b. Data set: A sample of ATM withdrawals (in dollars) (using five classes)
c. Data set: Retirement ages for a sample of doctors (using six classes)
d. Data set: Exam scores for all students in a statistics class (using five classes)
e. Data set: Number of children of the U.S. presidents (using six classes)
lOMoARcPSD|7438752

24
CHAPTER THREE
NUMERICAL DESCRIPTIVE TECHNIQUES
Objectives
➢ Summarize data, using measures of central tendency, such as the mean, median, mode,
and midrange.
➢ Describe data, using measures of variation, such as the range, variance, and standard
deviation.
➢ Identify the position of a data value in a data set, using various measures of position,
such as percentiles, deciles, and quartiles.
➢ Use the techniques of exploratory data analysis, including box plots and five-
number summaries, to discover various aspects of data.
Introduction
In Chapter 2 we discussed how to summarize data using different methods and to display data using
graphs. Graphs are one important component of statistics; however, it is also important to numerically
describe the main characteristics of a data set. The numerical summary measures, such as the ones that
identify the center and spread of a distribution, identify many important features of a distribution.
The measures that we discuss in this chapter include measures of (1) central tendency, (2) dispersion
(or spread), and (3) position.
Measures of Central Tendency for ungrouped data
We often represent a data set by numerical summary measures, usually called the typical values.
A measure of central tendency gives the center of a histogram or a frequency distribution curve. This
section discusses three different measures of central tendency: the mean, the median, and the mode;
however, a few other measures of central tendency, such as the trimmed mean, the weighted mean,
and the geometric mean, are explained in exercises following this section. We will learn how to
calculate each of these measures for ungrouped data. Recall from Chapter 2 that the data that give
lOMoARcPSD|7438752

25
information on each member of the population or sample individually are called ungrouped data,
whereas grouped data are presented in the form of a frequency distribution table.
Measures found by using all the data values in the population are called parameters.
Measures obtained by using the data values from samples are called statistics
The Mean
The mean, also called the arithmetic mean, is the most frequently used measure of central
tendency. This book will use the words mean and average synonymously. For ungrouped data,
the mean is obtained by dividing the sum of all values by the number of values in the data set:
sum of all values
mean=
number of values
✓ Population mean:
x
N
 =
 transportation cost
✓ Sample mean:
x
x
n
=

Example
The prices (in dollars) for a sample of different transportation cars from Hargeisa to Borama, listed.
What is the mean price of the cost transportation?
5, 6, 7, 5, 6, 8, 4, 5, 8.
Solution
5 8 .... 8
6
9
x
x
n
+ + +
= = =

Properties of the mean
1. The mean is found by using all the values of the data.
2. The mean is used in computing other statistics, such as the data values.
3. The mean is affected by extremely high or low values, called outliers, and may not be
the appropriate average to use in these situations
The median:
The median is the value of the middle term in a data set that has been ranked in increasing order.
As it is obvious from the definition of the median, it divides a ranked data set into two equal parts.
The calculation of the median consists of the following two steps:
lOMoARcPSD|7438752

26
1) Rank the data set in increasing order.
2) Find the middle term. The value of this term is the median.
Note that if the number of observations in a data set is odd, then the median is given by the value of
the middle term in the ranked data. However, if the number of observations is even, then the
median is given by the average of the values of the two middle terms.
Example
The following data give the prices (in thousands of dollars) of seven houses selected from all houses
sold last month in Hargeisa. 32, 25, 42, 80, 31, 60, 45.
Find the median?
Solution
25, 31, 32, 42, 45, 60, 80
Median =42
Properties of the median
1) The median is used to find the center or middle value of a dataset.
2) The median is used when it is necessary to find out whether the data values fall into the
upper half or lower half of the distribution
3) The median is affected less than the mean by extremely high or extremely low values
The mode:
The mode is the value that occurs with the highest frequency in a data set.
A data set that has only one value that occurs with the greatest frequency is said to be unimodal. If
a data set has two values that occur with the same greatest frequency, both values are considered to
be the mode and the data set is said to be bimodal. If a data set has more than two values that
occur with the same greatest frequency, each value is used as the mode, and the data set is said to
be multimodal. When no data value occurs more than once, the data set is said to have no mode. A
data set can have more than one mode or no mode at all
Example
The ages of 10 randomly selected students from a class are 21, 19, 27, 22, 29, 19, 25, 21, 22, and 30
years, respectively. Find the mode
Solution
This data set has three modes: 19, 21, and 22. Each of these three values occurs with a (highest)
frequency of 2.
lOMoARcPSD|7438752

27
Properties of the Mode
1. The mode is used when the most typical case is desired.
2. The mode is the easiest average to compute
3. The mode can be used when the data are qualitative, such as gender, or political
affiliation.
4. The mode is not always unique.
Although the mean, the median, and the mode each describe a typical entry of a data set, there are
advantages and disadvantages of using each. The mean is a reliable measure because it takes into
account every entry of a data set. However, the mean can be greatly affected when the data set
contains outliers.
Definition
An outlier is a data entry that is far removed from the other entries in the data set.
Example
Find the mean, the median, and the mode of the sample ages of students in a class. Which measure of
central tendency best describes a typical entry of this data set? Are there any outliers?
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 23, 23, 23, 24, 24, 65
Solution
Mean =23.8 years
Median = 21.5 years
Mode = 20 years
Exercise
1. The following data give the numbers of car thefts that occurred in a city during the past 12
days. 6 3 7 11 4 3 8 7 2 6 9 15. Find
a. The mean,
b. The median, and
c. The mode
2. The following data set belongs to a sample: 14 18 -1 08 8 -16. Calculate
a. The mean, median, and mode.
3. The cholesterol levels of a sample of 10 female employees: 154 240 171 188 235 203 184
173 181 275.
a) Calculate the mean, median, and mode.
lOMoARcPSD|7438752

28
4. The following data give the 2015 profits (in thousands of dollars) of the six companies for the
year 2015 (Horn Africa newspaper, May 5, 2015). The data represent the following companies,
respectively: Telesom, Ominco, Somtel, Somcable,SBI, and fly Dubai :270, 390, 150, 120,
210, 90.
a.Find the mean and median for these data.
5. The status of five students who are members of the student senate at a college are senior,
sophomore, senior, junior, and senior, respectively. Find the mode.
Weighted mean
Sometimes data sets contain entries that have a greater effect on the mean than do other entries. To
find the mean of such a data set, you must find the weighted mean.
Definition
A weighted mean is the mean of a data set whose entries have varying weights.
A weighted mean is given by
( )
weighted mean
x w
w
•
=


, where w is the weight of each entry
Example
You are taking a class in which your grade is determined from five sources: 50% from your test mean,
15% from your midterm, 20% from your final exam, 10% from your computer lab work, and 5% from
your homework. Your scores are 86 (test mean), 96 (midterm), 82 (final exam), 98 (computer lab),
and 100 (homework). What is the weighted mean of your scores? If the minimum average for an A is
90, did you get an A?
Solution
Begin by organizing the scores and the weights in a table.
lOMoARcPSD|7438752

29
( ) 88.6
weighted mean 88.6
1
x w
w
•
= = =


Exercise
The scores and their percents of the final grade for a statistics student are given. What is the student’s
weighted mean score?
Score Percentage
Homework
Quizzes
Project
Speech
Final exam
85
80
100
90
93
5%
35%
20%
15%
25%
Mean for Grouped Data
We learned that the mean is obtained by dividing the sum of all values by the number of values
in a data set. However, if the data are given in the form of a frequency table, we no longer know
the values of individual observations. Consequently, in such cases, we cannot obtain the sum of
individual values. We find an approximation for the sum of these values using the procedure
explained in the next paragraph and example. The formulas used to calculate the mean for
grouped data is:
lOMoARcPSD|7438752

30
mean
mf
f
=


Example
The following table gives the frequency distribution of the number of orders received each day during
the past 50 days at the office of a mail-order company
a. Calculate the mean
Solution
Because the data set includes only 50 days, it represents a sample. The value of mf
 is calculated
in this table
832
16.64
50
mf
x
f
= = =


orders
Exercise
1) For 108 randomly selected college students, this exam score frequency distributionwas
obtained.
lOMoARcPSD|7438752

31
Find the mean of the grouped data
Measures of Dispersion for Ungrouped Data
The measures of central tendency, such as the mean, median, and mode, do not reveal the whole
picture of the distribution of a data set. Two data sets with the same mean may have completely
different spreads. The variation among the values of observations for one data set may be much
larger or smaller than for the other data set. (Note that the words dispersion, spread, and variation
have the same meaning.) Consider the following two data sets on the ages (in years) of all workers
working for each of two small companies.
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
The mean age of workers in both these companies is the same, 40 years. If we do not know the
ages of individual workers at these two companies and are told only that the mean age of the
workers at both companies is the same, we may deduce that the workers at these two companies
have a similar age distribution. As we can observe, however, the variation in the
workers’ ages for each of these two companies is very different. As illustrated in the diagram, the ages
of the workers at the second company have a much larger variation than the ages of the workers at the
first company.
Class limits Frequency
90-98 6
99-107 22
108-116 43
117-125 28
126-134 9
lOMoARcPSD|7438752

32
Thus, the mean, median, or mode by itself is usually not a sufficient measure to reveal the shape of
the distribution of a data set. We also need a measure that can provide some information about the
variation among data values. The measures that help us learn about the spread of a data set are
called the measures of dispersion. The measures of central tendency and dispersion taken together
give a better picture of a data set than the measures of central tendency alone.
This section discusses three measures of dispersion:
1) Range
2) Variance and
3) Standard deviation.
Range:
The range is the simplest measure of dispersion to calculate. It is obtained by taking the
difference between the largest and the smallest values in a data set.
Variance and Standard Deviation
The standard deviation is the most-used measure of dispersion. The value of the standard
deviation tells how closely the values of a data set are clustered around the mean. In general, a
lower value of the standard deviation for a data set indicates that the values of that data set are
spread over a relatively smaller range around the mean. In contrast, a larger value of the standard
deviation for a data set indicates that the values of that data set are spread over a relatively
larger range around the mean.
The population and standard variance is given by:
Population variance:
( )
2
2 i
x
N


−
=

Sample variance:
( )
2
2
1
i
x x
s
n
−
=
−

The standard deviation is derived from the variance in the following way.
lOMoARcPSD|7438752

33
Population standard deviation: 2
 
=
Sample standard deviation: 2
s s
=
Example
One corporation hired 10 graduates. The starting salaries for each graduate are shown.
a. Find the range of the starting salaries for the Corporation
b. Compute the variance and standard deviation of the data
Solution
a. 47 37
Range = −
b.
( ) ( )
2 2 2
2
41 41.5 ...... (42 41.5)
9.8
1 10 1
i
x x
s
n
− − + + −
= = =
− −
 
c. 2
9.8 3.1
s s
= = =
Exercise
1. Najax Construction Company hired 10 new University graduates. The starting salaries for each
graduate are shown.
a.Find the range of the starting salaries for the Corporation
b. Compute the variance and standard deviation of the data
2. Find the range, mean, variance, and standard deviation of the population data set.
a. 9 5 9 10 11 12 7 7 8 12
b. 18 20 19 21 19 17 15 17 25 22 19 20 16 18
3. Find the range, mean, variance, and standard deviation of the sample data set.
a. 4 15 9 12 16 8 11 19 14
b. 28 25 21 15 7 14 9 27 21 24 14 17 16
4. The following data give the prices of seven textbooks randomly selected from a university
bookstore. $89 $170 $104 $113 $56 $161 $147
41 38 39 45 47
41 44 41 37 42
40 0 230 410 500 490
320 410 290 520 580
lOMoARcPSD|7438752

34
a. Find the mean for these data. Calculate the deviations of the data values from the mean.
Is the sum of these deviations zero?
b. Calculate the range, variance, and standard deviation.
5. The following data are the ages (in years) of six students. 19 19 19 19 19 19
a. Calculate the standard deviation. Is its value zero? If yes, why?
6. The hourly wages for a sample of part-time employees at U-fresh company are: $ 3, 5, 8, 6,
and 4.
a. What is the range, sample variance and standard deviation
7. The data represent the number of days off per year for a sample of individuals selected
from nine different countries.
a. Find the mean and median of these data. 20, 26, 40, 36, 23, 42, 35, 24,30
8. The following data give the numbers of car thefts that occurred in a city in the past 12
days. 6 3 7 1 14 3 8 7 2 6 9 15
a. Calculate the range, variance, and standard deviation
Variance and Standard Deviation for Grouped Data
Following are what we will call the basic formulas used to calculate the population and sample
variances for grouped data:
Population variance:
( )
2
2
2
mf
m f
N
N

−
=


Sample variance:
( )
2
2
2
1
mf
m f
n
s
n
−
=
−


Example
The following data, give the frequency distribution of the number of orders received each day during
the past 50 days at the office of a mail-order company
lOMoARcPSD|7438752

35
a. Calculate the sample variance and standard deviation of the data
Solution
( ) ( )
2 2
2
2
2
832
14216
50 7.5820
1 50 1
7.5820 2.75 orders
mf
m f
n
s
n
s s
− −
= = =
− −
= = =


Exercise
1) These data represent the net worth (in millions of dollars) of 45 national
corporations.
Class limits 10-20 21-31 32-42 43-53 54-64 65-75
lOMoARcPSD|7438752

36
Frequency 2 8 15 7 10 3
a. Find the mean and the model class of the data
b. Compute the sample variance and standard deviation of the data
2) For 108 randomly selected college students, this exam score frequency distributionwas
obtained.
a. Find the sample variance and standard deviation of the grouped data
Uses of Variance and Standard Deviations
1) As previously stated, variances and standard deviations can be used to determine the
spread of the data. If the variance or standard deviation is large, the data are more
dispersed. This information is useful in comparing two (or more) data sets to determine
which is more (most) variable.
2) The variance and standard deviation are used to determine the number of data values
that fall within a specified interval in a distribution. For example, Chebyshev’s theorem
(explained later) shows that, for any distribution, at least 75% of the data values will fall
within 2 standard deviations of the mean.
3) Finally, the variance and standard deviation are used quite often in inferential statistics.
Distribution Shapes
Frequency distributions can assume many shapes. The three most important shapes are positively
skewed, symmetric, and negatively skewed. Figure 3–1 shows histograms of each.
In a positively skewed or right-skewed distribution, the majority of the data values fall to the left
of the mean and cluster at the lower end of the distribution; the “tail” is to the right. Also, the mean
is to the right of the median, and the mode is to the left of the median. For example, if an instructor
Class limits Frequency
90-98 6
99-107 22
108-116 43
117-125 28
126-134 9
lOMoARcPSD|7438752

37
gave an examination and most of the students did poorly, their scores would tend to cluster on the
left side of the distribution. A few high scores would constitute the tail of the distribution, which
would be on the right side. Another example of a positively skewed distribution is the incomes of
the population of the United States. Most of the incomes cluster about the low end of the
distribution; those with high incomes are in the minority and are in the tail at the right of the
distribution.
In a symmetric distribution, the data values are evenly distributed on both sides of the mean. In
addition, when the distribution is unimodal, the mean, median, and mode are the same and are at
the center of the distribution. Examples of symmetric distributions are IQ scores and heights of
adult males. When the majority of the data values fall to the right of the mean and cluster at the
upper end of the distribution, with the tail to the left, the distribution is said to be negatively
skewed or left-skewed. Also, the mean is to the left of the median, and the mode is to the right
lOMoARcPSD|7438752

38
of the median. As an example, a negatively skewed distribution results if the majority of
students score very high on an instructor’s examination. These scores will tend to cluster to the
right of the distribution.
Uses of Standard Deviation
By using the mean and standard deviation, we can find the proportion or percentage of the total observations
that fall within a given interval about the mean. This section briefly discusses Chebyshev’s theorem and the
empirical rule, both of which demonstrate this use of the standarddeviation.
Chebyshev’s Theorem
As stated previously, the variance and standard deviation of a variable can be used to determine the
spread, or dispersion, of a variable. That is, the larger the variance or standard deviation, the more
the data values are dispersed. For example, if two variables measured in the same units have the
same mean, say, 70, and the first variable has a standard deviation of 1.5 while the second variable
has a standard deviation of 10, then the data for the second variable will be more spread out than
the data for the first variable.
Chebyshev’s theorem, developed by the Russian mathematician Chebyshev (1821–1894),
specifies the proportions of the spread in terms of the standard deviation.
Example
Chebyshev’s theorem The proportion of values from a data set that will fall within k standard
deviations of the mean will be at least , where k is a number greater than 1 (k is not necessarily
an integer).
lOMoARcPSD|7438752

39
1. The 2015 gross sales of all companies in a large city have a mean of $2.3 million and a
standard deviation of $.6 million. Using Chebyshev’s theorem, find at least what percentage
of companies in this city had 2009 gross sales of
a. $1.1 to $3.5 million b. $.8 to $3.8 million
Solution
2 2
2.3, σ 0.6
distance 3.5 2.3
2
standard deviation 0.6
1 1
a) percentage= 1 100% 1 100% 75%
2
x
k
k



= =
− −
= = = =
   
− = − =
   
   
2 2
2.3, σ 0.6
distance 3.8 2.3
2.5
standard deviation 0.6
1 1
b) percentage= 1 100% 1 100% 84%
2.5
x
k
k



= =
− −
= = = =
   
− = − =
   
   
Empirical Rule
Whereas Chebyshev’s theorem is applicable to any kind of distribution, the empirical rule
applies only to a specific type of distribution called a bell-shaped distribution,( or a normal
curve).
When a distribution is bell-shaped (or what is called normal), the following statements, which
make up the empirical rule, are true.
1) Approximately 68% of the data values will fall within 1 standard deviation of themean.
2) Approximately 95% of the data values will fall within 2 standard deviations of themean.
3) Approximately 99.7% of the data values will fall within 3 standard deviations of themean.
Example
The age distribution of a sample of 5000 persons is bell shaped with a mean of 40 years and a standard
deviation of 12 years. Determine the approximate percentage of people who are 16 to 64 years old.
Solution
68% of the ages of these persons fall between 16 to 64 years old
Exercise
1) The mean time taken by all participants to run a road race was found to be 220 minutes with a
lOMoARcPSD|7438752

40
standard deviation of 20 minutes. Using Chebyshev’s theorem, find the percentage of runners
who ran this road race in
b. 180 to 260 minutes b. 160 to 280 minutes c. 170 to 270minutes
2) The mean price of houses in a certain neighborhood is $50,000, and the standard
deviation is $10,000. Find the price range for which at least 75% of the houses will sell.
3) A survey of local companies found that the mean amount of travel allowance for
executives was $0.25 per mile. The standard deviation was $0.02. Using Chebyshev’s
theorem, find the minimum percentage of the data values that will fall between $0.20and
$0.30.
4) The mean life of a certain brand of auto batteries is 44 months with a standard
deviation of 3 months. Assume that the lives of all auto batteries of this brand have a bell-
shaped distribution. Using the empirical rule, find the percentage of auto batteries of this
brand that have a life of
i. 41 to 47 months b. 38 to 50 months c. 35 to 53months
5) The prices of all college textbooks follow a bell-shaped distribution with a mean of$105
and a standard deviation of$20.
a) Using the empirical rule, find the percentage of all college textbooks with their prices
between i. $85 and $125 ii. $65 and$145
b) Using the empirical rule, find the interval that contains the prices of 99.7% of college
textbooks.
6) The mean price of new homes from a sample of houses is $155,000 with a standard deviation
of $15,000. The data set has a bell-shaped distribution. Between what two prices do 95% of the
houses fall?
Measures of Position
A measure of position determines the position of a single value in relation to other values in a sample
or a population data set. There are many measures of position; however, only quartiles, percentiles,
and percentile rank are discussed in this section.
Quartiles and Interquartile Range
Quartiles are the summary measures that divide a ranked data set into four equal parts. Three measures
will divide any data set into four equal parts. These three measures are the first quartile (denoted by
lOMoARcPSD|7438752

41
Q1), the second quartile (denoted by Q2), and the third quartile (denoted by Q3). The data should be
ranked in increasing order before the quartiles are determined. The quartiles are defined as follows.
Definition
Quartiles are three summary measures that divide a ranked data set into four equal parts. The second
quartile is the same as the median of a data set. The first quartile is the value of the middle term
among the observations that are less than the median, and the third quartile is the value of the middle
term among the observations that are greater than the median.
Approximately 25% of the values in a ranked data set are less than Q1 and about 75% are greater than
Q1. The second quartile, Q2, divides a ranked data set into two equal parts; hence, the second quartile
and the median are the same. Approximately 75% of the data values are less than Q3 and about 25%
are greater than Q3.
The difference between the third quartile and the first quartile for a data set is called the
interquartile range (IQR).
Example
The following data gives the 2008 profits (rounded to billions of dollars) of 12 companies selected
from all over the world.
8 12 7 17 14 45 10 13 17 13 9 1
a. Find the values of the three quartiles?
b. Find the interquartile range.
Solution
7 8 9 10 11 12 13 13 14 1 7 17 45
1 2 3
3 1
9.5, 12.5, 15.5
15.5 9.5 6
Q Q Q
IQR Q Q
= = =
= − = − =
Exercise
1. The following are the ages (in years) of nine employees of an insurance company:
47 28 39 51 33 37 59 24 33
a. Find the values of the three quartiles. Where does the age of 28 years fall in relation to the
ages of these employees?
b. Find the interquartile range.
2. The numbers of nuclear power plants in the top 15 nuclear power-producing countries in the
world are listed.
lOMoARcPSD|7438752

42
a. Find the first, second, and third quartiles of the data set.
b. What can you conclude? (Source: International Atomic Energy Agency)
7 18 11 6 59 17 18 54 104 20 31 8 10 15 19
Percentiles
Percentiles are the summary measures that divide a ranked data set into 100 equal parts. Each (ranked)
data set has 99 percentiles that divide it into 100 equal parts. The data should be ranked in increasing
order to compute percentiles. The kth percentile is denoted by Pk, where k is an integer in the range 1
to 99. For instance, the 25th percentile is denoted by 25
p .
Calculating percentiles
The (approximate) value of the kth percentile, denoted by Pk, is
he value of the h term in ranked data set
100
k
kn
p t t
 
=  
 
where k denotes the number of the percentile and n represents the sample size.
Example
The following data gives the 2008 profits (rounded to billions of dollars) of 12 companies selected
from all over the world.
8 12 7 17 14 45 10 13 17 13 9 1
a. Find the value of the 42nd percentile.
b. Find the value of the 75nd percentile
Solution
7 8 9 10 11 12 13 13 14 1 7 17 45
( )( )
42 42
42 12
) 5.04th term , $11 billion
100
a p p
= = =
( )( )
75 75
75 12
b) 9th term, $14 billion
100
p p
= = =
Exercise
1. Consider a sample with data values of 27, 25, 20, 15, 30, 34, 28, and 25.
a. Compute the 20th, 25th, 65th, and 75th percentiles.
2. The following data give the speeds of 13 cars (in mph) measured by radar,
73 75 69 68 78 69 74 76 72 79 68 77 71
a. Find the values of the three quartiles and the interquartile range.
lOMoARcPSD|7438752

43
b. Calculate the (approximate) value of the 35th percentile.
c. Calculate the (approximate) value of the 65th percentile.
d. Calculate the (approximate) value of the 25th percentile.
Box-and-Whisker Plot
A box-and-whisker plot gives a graphic presentation of data using five measures: the median, the first
quartile, the third quartile, and the smallest and the largest values in the data.
Another important application of quartiles is to represent data sets using box-and-whisker plots. A
box-and-whisker plot (or boxplot) is an exploratory data analysis tool that highlights the important
features of a data set. To graph a box-and-whisker plot, you must know the following values
1. The minimum entry
2. The third quartile
3. The first quartile
4. The maximum entry
5. The median
These five numbers are called the five-number summary of the data set
Drawing a Box-and-Whisker Plot
1. Find the five-number summary of the data set.
2. Construct a horizontal scale that spans the range of the data.
3. Plot the five numbers above the horizontal scale.
4. Draw a box above the horizontal scale from 1
Q to 2
Q and draw a vertical line in the box
at 2
Q
5. Draw whiskers from the box to the minimum and maximum entries.
Example
For the following data 7 18 11 6 59 17 18 54 104 20 31 8 10 15 19
lOMoARcPSD|7438752

44
a. Find the five number summary
b. Draw a box-and-whisker plot that represents the data set
c. Determine whether the data has an outlier on not
Solution
To determine the outlier
• 1 3
10, 31, 21
Q Q IQR
= = =
• 1.5 31.5
IQR
 =
• 1 3
1.5 21.5, 1.5 62.5
104 is an outlier
Q IQR Q IQR
− = − + =
Exercise
1. Consider a sample with data values of 27, 25, 20, 15, 30, 34, 28, and 25.
a. Provide the five number summaries for the data.
b. Show the box plot for the data
c. Determine whether the data has an outlier or not
2. Show the five-number summary and the box plot for the following data: 5, 15, 18, 10, 8, 12,
16, 10, 6.
3. A data set has a first quartile of 42 and a third quartile of 50. Compute the lower and upper
limits for the corresponding box plot. Should a data value of 65 be considered an outlier
4. Find the five-number summary, and (b) draw a box-and whisker plot of the data.
a. 39 36 30 27 26 24 28 35 39 60 50 41 35 32 51
b. 171 176 182 150 178 180 173 170 174 178 181 180
lOMoARcPSD|7438752

45
CHAPTER FOUR
DATA COLLECTION AND SAMPLING
Objectives
 Identify different types of data collection
 Differentiate sampling and none-sampling techniques
 Identify the basic sampling techniques.
 Differentiate primary and secondary data
 Explain the difference between simple random sampling and systematic sampling
Introduction
In Chapter 1, we briefly introduced the concept of statistical inference the process of inferring
information about a population from a sample. Because information about populations can usually be
described by parameters, the statistical technique used generally deals with drawing inferences about
population parameters from sample statistics. (Recall that a parameter is a measurement about a
population, and a statistic is a measurement about a sample.)
Working within the covers of a statistics textbook, we can assume that population parameters are
known. In real life, however, calculating parameters is virtually impossible because populations tend
to be very large. As a result, most population parameters are not only unknown but also unknowable.
The problem that motivates the subject of statistical inference is that we often need information about
the value of parameters in order to make decisions. For example, to make decisions about whether to
expand a line of clothing, we may need to know the mean annual expenditure on clothing by North
American adults. Because the size of this population is approximately 200 million, determining the
mean is prohibitive. However, if we are willing to accept less than 100% accuracy, we can use
statistical inference to obtain an estimate. Rather than investigating the entire population, we select a
sample of people, determine the annual expenditures on clothing in this group, and calculate the
sample mean. Although the probability that the sample mean will equal the population mean is very
small, we would expect them to be close. For many decisions, we need to know how close. We will
discuss the basic concepts and techniques of sampling itself. But first we take a look at various sources
for collecting data.
Sources of data
lOMoARcPSD|7438752

46
The availability of accurate and appropriate data is essential for deriving reliable results. Data may be
obtained from internal or external sources (secondary data), or surveys and experiments (primary
data). Many times data come from internal sources, such as a company’s personnel files or accounting
records. For example, a company that wants to forecast the future sales of its product may use the data
of past periods from its records. For most studies, however, all the data that are needed are not usually
available from internal sources. In such cases, one may have to depend on outside sources to obtain
data. These sources are called external sources. For instance, the Statistical Abstract of the United
States (published annually), which contains various kinds of data on the United States, is an external
source of data.
A large number of government and private publications can be used as external sources of data. The
following is a list of some government publications.
1. Statistical Abstract of the United State
2. Employment and Earnings
3. Handbook of Labor Statistics
4. Source Book of Criminal Justice Statistics
5. Economic Report of the President
Secondary Data: Data collected by someone else for some other purpose (but being utilized by the
investigator for another purpose)
Advantages of Secondary Data
• Cost and time economies
• Help to better state the research problem.
• Suggest improved methods or data for better understanding with the problem.
• Provide comparative data by which primary data can be more insightfully interpreted
Disadvantages of Secondary Data
• Problems of fit: Since secondary data are collected for other purposes, it will be rare that
they fit the problem as defined perfectly
• Problems of Accuracy
Primary Data: Data collected by the investigator himself/ herself for a specific purpose.
Examples: Data collected by a student for his/her thesis or research project.
lOMoARcPSD|7438752

47
There are various types of primary data:
i. Demographic/socio-economic characteristics:
✓ Examples: age, education, occupation, marital status, gender, income.
ii. Psychological/life-style characteristics
✓Examples: personality traits, activities, interests, and values.
Differences Between Primary and Secondary Data
The fundamental differences between primary and secondary data are discussed in the following
points:
1. The term primary data refers to the data originated by the researcher for the first time.
Secondary data is the already existing data, collected by the investigator agencies and
organisations earlier.
2. Primary data is a real-time data whereas secondary data is one which relates to the past.
3. Primary data is collected for addressing the problem at hand while secondary data is collected
for purposes other than the problem at hand.
4. Primary data collection is a very involved process. On the other hand, secondary data
collection process is rapid and easy.
5. Primary data collection sources include surveys, observations, experiments, questionnaire,
personal interview, etc. On the contrary, secondary data collection sources are government
publications, websites, books, journal articles, internal records etc.
6. Primary data collection requires a large amount of resources like time, cost and manpower.
Conversely, secondary data is relatively inexpensive and quickly available.
7. Primary data is always specific to the researcher’s needs, and he controls the quality of
research. In contrast, secondary data is neither specific to the researcher’s need, nor he has
control over the data quality.
lOMoARcPSD|7438752

48
8. Primary data is available in the raw form whereas secondary data is the refined form of
primary data. It can also be said that secondary data is obtained when statistical methods are
applied to the primary data.
9. Data collected through primary sources are more reliable and accurate as compared to the
secondary sources.
Primary versus secondary data
Comparison Primary data Secondary data
Meaning Primary data refers to the
first hand data gathered by
the researcher himself
Secondary data means data
collected by someone else
earlier
Data Real time data Past data
Source Surveys, observations,
experiments, questionnaire,
personal interview, etc
Government publications,
websites, books, journal
articles, internal records etc.
Cost effectiveness Expensive Economical
Collection time Long Short
Specific Always specific to the
researcher’s needs
May or may not be specific
to the researcher’s needs
Available in Crude form Refined form
Accuracy More relatively less
METHODS OF COLLECTING DATA
Most of this book addresses the problem of converting data into information. The question arises,
where do data come from? The answer is that a large number of methods produce data. Before we
proceed however, we’ll remind you of the definition of data introduced in Section 2.1. Data are the
observed values of a variable; that is, we define a variable or variables that are of interest to us and
lOMoARcPSD|7438752

49
then proceed to collect observations of those variables.
There are many methods used to collect or obtain data for statistical analysis. Three of the most
popular methods are:
1. Direct Observation
2. Experiments, and
3. Surveys.
Direct Observation
The simplest method of obtaining data is by direct observation. When data are gathered in this way,
they are said to be observational. For example, suppose that a researcher for a pharmaceutical
company wants to determine whether aspirin actually reduces the incidence of heart attacks.
Observational data may be gathered by selecting a sample of men and women and asking each
whether he or she has taken aspirin regularly over the past 2 years. Each person would be asked
whether he or she had suffered a heart attack over the same period.
There are many drawbacks to this method. One of the most critical is that it is difficult to produce
useful information in this way. For example, if the statistics practitioner concludes that people who
take aspirin suffer fewer heart attacks, can we conclude that aspirin is effective? It may be that people
who take aspirin tend to be more health conscious, and health-conscious people tend to have fewer
heart attacks. The one advantage to direct observation is that it is relatively inexpensive.
Experiments
A more expensive but better way to produce data is through experiments. Data produced in this
manner are called experimental. In the aspirin illustration, a statistics practitioner can randomly select
men and women. The sample would be divided in to two groups. One group would take aspirin
regularly, and the other would not. After 2 years, the statistics practitioner would determine the
proportion of people in each group who had suffered heart attacks, and statistical methods again would
be used to determine whether aspirin works. If we find that the aspirin group suffered fewer heart
attacks, then we may more confidently conclude that taking aspirin regularly is a healthy decision.
Surveys
One of the most familiar methods of collecting data is the survey, which solicits information from
people concerning such things as their income, family size, and opinions on various issues. We’re all
familiar, for example, with opinion polls that accompany each political election. The Gallup Poll and
the Harris Survey are two well-known surveys of public opinion whose results are often reported by
lOMoARcPSD|7438752

50
the media. But the majority of surveys are conducted for private use. Private surveys are used
extensively by market researchers to determine the preferences and attitudes of consumers and voters.
The results can be used for a variety of purposes, from helping to determine the target market for an
advertising campaign to modifying a candidate’s platform in an election campaign. As an illustration,
consider a television network that has hired a market research firm to provide the network with a
profile of owners of luxury automobiles, including what they watch on television and at what times.
The network could then use this information to develop a package of recommended time slots for
Cadillac commercials, including costs, which it would present to General Motors. It is quite likely that
many students reading this book will one day be marketing executives who will “live and die” by such
market research data.
An important aspect of surveys is the response rate. The response rate is the proportion of all people
who were selected who complete the survey. As we discuss in the next section, a low response rate
can destroy the validity of any conclusion resulting from the statistical analysis. Statistics practitioners
need to ensure that data are reliable.
Surveys may be administered in a variety of ways, e.g.
✓ Personal Interview,
✓ Telephone Interview,
✓ Self Administered Questionnaire.
Personal Interview Many researchers feel that the best way to survey people is by means of a
personal interview, which involves an interviewer soliciting information from a respondent by asking
prepared questions. A personal interview has the advantage of having a higher expected response rate
than other methods of data collection. In addition, there will probably be fewer incorrect responses
resulting from respondents misunderstanding some questions because the interviewer can clarify
misunderstandings when asked to. But the interviewer must also be careful not to say too much for
fear of biasing the response. To avoid introducing such biases, as well as to reap the potential benefits
of a personal interview, the interviewer must be well trained in proper interviewing techniques and
well informed on the purpose of the study. The main disadvantage of personal interviews is that they
are expensive, especially when travel is involved.
Telephone Interview A telephone interview is usually less expensive, but it is also less personal and
lOMoARcPSD|7438752

51
has a lower expected response rate. Unless the issue is of interest, many people will refuse to respond
to telephone surveys. This problem is exacerbated by telemarketers trying to sell something.
Self-Administered Survey A third popular method of data collection is the self-administered
questionnaire, which is usually mailed to a sample of people. This is an inexpensive method of
conducting a survey and is therefore attractive when the number of people to be surveyed is large. But
self-administered questionnaires usually have a low response rate and may have a relatively high
number of incorrect responses due to respondents misunderstanding some questions.
Questionnaire Design Whether a questionnaire is self-administered or completed by an interviewer, it
must be well designed. Proper questionnaire design takes knowledge, experience, time, and money.
Some basic points to consider regarding questionnaire design follow.
1. First and foremost, the questionnaire should be kept as short as possible to encourage
respondents to complete it. Most people are unwilling to spend much time filling out a
questionnaire.
2. The questions themselves should also be short, as well as simply and clearly worded, to enable
respondents to answer quickly, correctly, and without ambiguity. Even familiar terms such as
“unemployed” and “family” must be defined carefully because several interpretations are
possible.
3. Questionnaires often begin with simple demographic questions to help respondents get started
and become comfortable quickly.
4. Dichotomous questions (questions with only two possible responses such as “yes” and “no” and
multiple-choice questions) are useful and popular because of their simplicity, but they also have
possible shortcomings. For example, a respondent’s choice of yes or no to a question may
depend on certain assumptions not stated in the question. In the case of a multiple-choice
question, a respondent may feel that none of the choices offered is suitable.
5. Open-ended questions provide an opportunity for respondents to express opinions more fully,
but they are time consuming and more difficult to tabulate and analyze.
6. Avoid using leading questions, such as “Wouldn’t you agree that the statistics exam was too
difficult?” These types of questions tend to lead the respondent to a particular answer.
lOMoARcPSD|7438752

52
7. Time permitting, it is useful to pretest a questionnaire on a small number of people in order to
uncover potential problems such as ambiguous wording.
8. Finally, when preparing the questions, think about how you intend to tabulate and analyze the
responses. First, determine whether you are soliciting values (i.e., responses) for an interval
variable or a nominal variable. Then consider which type of statistical techniques—descriptive
or inferential—you intend to apply to the data to be collected, and note the requirements of the
specific techniques to be used. Thinking about these questions will help ensure that the
questionnaire is designed to collect the data you need.
Whatever method is used to collect primary data, we need to know something about sampling, the
subject of the next section.
SAMPLING AND SAMPLING PLANS
The chief motive for examining a sample rather than a population is cost. Statistical inference permits
us to draw conclusions about a population parameter based on a sample that is quite small in
comparison to the size of the population. For example, television executives want to know the
proportion of television viewers who watch a network’s programs. Because 100 million people may
be watching television in the United States on a given evening, determining the actual proportion of
the population that is watching certain programs is impractical and prohibitively expensive. The
Nielsen ratings provide approximations of the desired information by observing what is watched by a
sample of 5,000 television viewers. The proportion of households watching a particular program can
be calculated for the households in the Nielsen sample. This sample proportion is then used as an
estimate of the proportion of all households (the population proportion) that watched the program.
Another illustration of sampling can be taken from the field of quality management. To ensure that a
production process is operating properly, the operations manager needs to know what proportion of
items being produced is defective. If the quality technician must destroy the item to determine whether
it is defective, then there is no alternative to sampling: A complete inspection of the product
population would destroy the entire output of the production process.
Why Sample?
As mentioned in the previous section, most of the time surveys are conducted by using samples and
not a census of the population. Three of the main reasons for conducting a sample survey instead of a
census are listed next.
lOMoARcPSD|7438752

53
Time
In most cases, the size of the population is quite large. Consequently, conducting a census takes a long
time, whereas a sample survey can be conducted very quickly. It is time-consuming to interview or
contact hundreds of thousands or even millions of members of a population. On the other hand, a
survey of a sample of a few hundred elements may be completed in little time. In fact, because of the
amount of time needed to conduct a census, by the time the census is completed, the results may be
obsolete.
Cost
The cost of collecting information from all members of a population may easily fall outside the limited
budget of most, if not all, surveys. Consequently, to stay within the available resources, conducting a
sample survey may be the best approach.
Impossibility of Conducting a Census
Sometimes it is impossible to conduct a census. First, it may not be possible to identify and access
each member of the population. For example, if a researcher wants to conduct a survey about
homeless people, it is not possible to locate each member of the population and include him or her in
the survey. Second, sometimes conducting a survey means destroying the items included in the
survey. For example, to estimate the mean life of light bulbs would necessitate burning out all the
bulbs included in the survey. The same is true about finding the average life of batteries. In such cases,
only a portion of the population can be selected for the survey.
Types of sampling
Generally there are two types of sampling techniques that are widely deployed. These techniques are:
1. Probability Sampling
This sampling technique includes sample selection which is based on random methods. The
techniques that are based in this category are simple random sampling, stratified sampling, systematic
sampling and cluster sampling.
2. Non probability Sampling
This sampling technique is not based on random selection. Some examples are quota sampling,
purposive sampling and convenience sampling.
lOMoARcPSD|7438752

54
Probability Sampling
The techniques in probability sampling are as follows:
1) Simple Random Sampling
A simple random sample is a sample selected in such a way that every possible sample with the same
number of observations is equally likely to be chosen. For example, if we need to select 5 students
from a class of 50, we write each of the 50 names on a separate piece of paper. Then, we place all 50
names in a hat and mix them thoroughly. Next, we draw 1 name randomly from the hat. We repeat
this experiment four more times. The 5 drawn names make up a simple random sample. The second
procedure to select a simple random sample is to use a table of random numbers, which has become
an outdated procedure.
2) Systematic Sampling
In systematic random sampling, we first randomly select one member from the first k units. Then
every kth member, starting with the first selected member, is included in the sample.
For example, suppose there were 2000 subjects in the population and a sample of 50 subjects were
needed. Since 2000/50=40, then k=40, and every 40th subject would be selected; however, the ﬁrst
subject (numbered between 1 and 40) would be selected at random. Suppose subject 12 were the ﬁrst
subject selected; then the sample would consist of the subjects whose numbers were 12, 52, 92, etc.,
until 50 subjects were obtained. When using systematic sampling, you must be careful about how the
subjects in the population are numbered.
3) Stratified Sampling
In a stratified random sample, we first divide the population into sub-populations, which are called
strata. Then, one sample is selected from each of these strata. The collection of all samples from all
strata gives the stratified random sample.
Examples of criteria for separating a population into strata (and of the strata themselves) follow.
lOMoARcPSD|7438752

55
SAMPLING AND NONSAMPLING ERRORS
Two major types of error can arise when a sample of observations is taken from a population:
sampling error and non-sampling error. Anyone reviewing the results of sample surveys and studies,
as well as statistics practitioners conducting surveys and applying statistical techniques, should
understand the sources of these errors.
Sampling Error
Sampling error refers to differences between the sample and the population that exists only because of
the observations that happened to be selected for the sample. Sampling error is an error that we expect
to occur when we make a statement about a population that is based only on the observations
contained in a sample taken from the population. To illustrate, suppose that we wish to determine the
mean annual income of North American blue-collar workers. To determine this parameter we would
have to ask each North American blue-collar worker what his or her income is and then calculate the
mean of all the responses. Because the size of this population is several million, the task is both
expensive and impractical. We can use statistical inference to estimate the mean income of the
population if we are willing to accept less than 100% accuracy. We record the incomes of a sample of
the workers and find the mean of this sample of incomes. This sample mean is an estimate, of the
desired, population mean. But the value of the sample mean will deviate from the population mean
simply by chance because the value of the sample mean depends on which incomes just happened to
lOMoARcPSD|7438752

56
be selected for the sample. The difference between the true (unknown) value of the population mean
and its estimate, the sample mean, is the sampling error. The size of this deviation may be large
simply because of bad luck—bad luck that a particularly unrepresentative sample happened to be
selected. The only way we can reduce the expected size of this error is to take a larger sample.
Sampling Error
It is important to remember that a sampling error occurs because of chance. The errors that occur for
other reasons, such as errors made during collection, recording, and tabulation of data, are called non-
sampling errors. These errors occur because of human mistakes, and not chance. Note that there is
only one kind of sampling error—the error that occurs due to chance. However, there is not just one
non-sampling error, but there are many non-sampling errors that may occur for different reasons.
Non-sampling Errors The errors that occur in the collection, recording, and tabulation of data are
called non-sampling errors.
lOMoARcPSD|7438752

57
CHAPTER FIVE
PROBABILITY
Objectives
➢ Define probability, sample space, event,
➢ classify three approaches of probability
➢ Apply addition and multiplication rules
➢ Determine sample spaces and find the probability of an event, using classical probability or
empirical probability.
➢ Calculate conditional probability
Introduction
We often make statements about probability. For example, a weather forecaster may predict that
there is an 80% chance of rain tomorrow. A health news reporter may state that a smoker has a
much greater chance of getting cancer than does a nonsmoker. A college student may ask an
instructor about the chances of passing a course or getting an A if he or she did not do well on the
midterm examination.
Probability, which measures the likelihood that an event will occur, is an important part of
statistics. It is the basis of inferential statistics, which will be introduced in later chapters. In
inferential statistics, we make decisions under conditions of uncertainty. Probability theory is used
to evaluate the uncertainty involved in those decisions. For example, estimating next year’s sales
for a company is based on many assumptions, some of which may happen to be true and others
may not. Probability theory will help us make decisions under such conditions of imperfect
information and uncertainty.
Combining probability and probability distributions descriptive statistics will help us make
decisions about populations based on information obtained from samples. This chapter presents
the basic concepts of probability and the rules for computing probability
Sample Spaces and Probability
The theory of probability grew out of the study of various games of chance using coins, dice, and
cards. Since these devices lend themselves well to the application of concepts of probability,they
lOMoARcPSD|7438752

58
will be used in this chapter as examples. This section begins by explaining some basic concepts of
probability. Then the types of probability and probability rules are discussed.
Experiment, Outcomes, and Sample Space
Quality control inspector Jack Cook of Tennis Products Company picks up a tennis ball from the
production line to check whether it is good or defective. Cook’s act of inspecting a tennis ball is an
example of a statistical experiment. The result of his inspection will be that the ball is either “good” or
“defective.” Each of these two observations is called an outcome (also called a basic or final outcome)
of the experiment, and these outcomes taken together constitute the sample space for this experiment.
Definition
A probability experiment: is a chance process that leads to well-defined results called outcomes. An
outcome is the result of a single trial of a probability experiment
A sample space is the set of all possible outcomes of a probability experiment.
Some sample spaces for various probability experiments are shown here.
Experiment Sample space
Toss one coin Head ,tail
Roll a die 1,2,3,4,5,6,
Answer a true/false question True ,false
Toss two coins HH,HT,TH,TT
Take a test Pass ,fail
Select a worker Female ,male
Simple and Compound Events
An event consists of one or more of the outcomes of an experiment.
Definition
Event An event is a collection of one or more of the outcomes of an experiment.
Simple Event An event that includes one and only one of the (final) outcomes for an experiment
is called a simple event and is usually denoted by Ei.
Compound Event A compound event is a collection of more than one outcome for an
experiment.
lOMoARcPSD|7438752

59
There are three basic interpretations of probability:
1) Classical probability
2) Empirical or relative frequency probability
3) Subjective probability
Classical Probability
Classical probability assumes that all outcomes in the sample space are equally likely to occur.
For example, when a single die is rolled, each outcome has the same probability of occurring.
Since there are six outcomes, each outcome has a probability of occurring.
This probability is called classical probability, and it uses the sample space S.
EXAMPLE:
1) Find the probability of obtaining a head and the probability of obtaining a tail for one
toss of a coin.
Solution
1
{ , }; ( )
2
S H T P T
= =
2) Find the probability of obtaining an even number in one roll of a die.
Solution
{1,2,3,4,5,6}
1
{2,4,6}, ( )
2
S
E p E
=
= =
THE PROBABILITY RULES:
1) The probability of an event always lies in the range 0 to 1.
Definition
Equally Likely Outcomes Two or more outcomes (or events) that have the same probability of
occurrence are said to be equally likely outcomes (or events).
lOMoARcPSD|7438752

60
2) The sum of the probabilities of all simple events (or final outcomes) for
an experiment, denoted by P(E) is always1.
Complementary Events
Another important concept in probability theory is that of complementary events. When a die is
rolled, for instance, the sample space consists of the outcomes 1, 2, 3, 4, 5, and 6. The event E of
getting odd numbers consists of the outcomes 1, 3, and 5. The event of not getting an odd
number is called the complement of event E, and it consists of the outcomes 2, 4, and 6.
The complement of an event E is the set of outcomes in the sample space that are not included
in the outcomes of event E. The complement of E is denoted by
Example
Finding Complements
1) Find the complement of each event.
a) Rolling a die and getting a4
b) Selecting a month and getting a month that begins with a J
Solution
a. 1 {1,2,3,5,6}
E = b. 1 {months do not begin with j}
E =
Empirical Probability
The difference between classical and empirical probability is that classical probability assumes
that certain outcomes are equally likely (such as the outcomes when a die is rolled), while
empirical probability relies on actual experience to determine the likelihood of outcomes. In
empirical probability, one might actually roll a given die 6000 times, observe the various
frequencies, and use these frequencies to determine the probability of an outcome. Suppose, for
example, that a researcher for the American Automobile Association (AAA) asked 50 people who
plan to travel over the Thanksgiving holiday how they will get to their destination.
The results can be categorized in a frequency distribution as shown.
lOMoARcPSD|7438752

61
Method Frequency
Drive 41
Fly 6
Train or bus 3
Now probabilities can be computed for various categories. For example, the probability of
selecting a person who is driving is 0.82, since 41 out of the 50 people said that they were
driving.
Formula for Empirical probability
A frequency distribution, the probability of an event being in a given class is:
( )
f
p E
n
=
This probability is called empirical probability and is based on observation.
Example
1) In the travel survey just described, find the probability that a person will travel by
airplane over the Thanks giving holiday.
Solution
6
( )
50
p fly =
Subjective Probability
The third type of probability is called subjective probability. Subjective probability uses a
probability value based on an educated guess or estimate, employing opinions and inexact
information. In subjective probability, a person or group makes an educated guess at the chance
that an event will occur. This guess is based on the person’s experience and evaluation of a
solution.
EXERCISE:
1) If a die is rolled one time, find these probabilities.
a) Of getting a4
b) Of getting an even number
c) Of getting a number greater than4
d) Of getting a number less than7
lOMoARcPSD|7438752

62
e) Of getting a number greater than0
f) Of getting a number greater than 3 or an odd number
g) Of getting a number greater than 3 and an odd
2) A hat contains 40 marbles. Of them, 18 are red and 22 are green. If one marble is randomly
selected out of this hat, what is the probability that this marble is a. red? b. green
3) A die is rolled once. What is the probability that
a) A number less than 5 is obtained?
b) A number 3 to 6 is obtained?
4) In a statistics class of 42 students, 28 have volunteered for community service in the past.
Find the probability that a randomly selected student from this class has volunteered for
community service in the past
5) There are 1265 eligible voters in a town, and 972 of them are registered to vote. If one
eligible voter is selected at random, what is the probability that this voter is
a. Registered
b. Not registered?
6) In a large city, 15,000 workers lost their jobs last year. Of them, 7400 lost their jobs
because their companies closed down or moved, 4600 lost their jobs due to insufficient
work, and the remainder lost their jobs because their positions were abolished. If one of
these 15,000 workers is selected at random, find the probability that this worker lost his or
her job
a) because the company closed down or moved
b) due to insufficient work
c) because the position was abolished
7) If the probability that a person lives in an industrialized country of the world is 0.20,
find the probability that a person does not live in an industrialized country.
lOMoARcPSD|7438752

63
The Addition Rules for Probability
Definition
Mutually Exclusive Events: Events that cannot occur together are said to be mutually exclusive
events.
Example
1) Determine which events are mutually exclusive and which are not, when a single dieis
rolled.
a) Getting an odd number and getting an even number
b) Getting a 3 and getting an odd number
c) Getting an odd number and getting a number less than4
Solution
a. Mutually exclusive events
b. Not mutually exclusive events
c. Not mutually exclusive events
➢ The probability of two or more events can be determined by the addition rules. The first
addition rule is used when the events are mutually exclusive.
Example
1) A box contains 3 glazed doughnuts, 4 jelly doughnuts, and 5 chocolate doughnuts. If
a person selects a doughnut at random,
a) Find the probability that it is either a glazed doughnut or a chocolate doughnut.
b) Find the probability that it is either a jelly doughnut or a chocolate doughnut.
Solution
3 5 8
a. p(glazed or chocolate)=p(Gl)+p(Ch)=
12 12 12
+ =
4 5 9
b. p(jelly or chocolate)=p(Gl)+p(Ch)=
12 12 12
+ =
Addition rule one:
When two events A and B are mutually exclusive, the probability that A or B will occur is
P(A or B) =P(A) +P(B)
lOMoARcPSD|7438752

64
➢ When two events are not mutually exclusive, we must subtract one of the two
probabilities of the outcomes that are common to both events, since they have been
counted twice.
Example
1) In a statistics class there are 18 juniors and 10 seniors; 6 of the seniors are females,
and 12 of the juniors are males. If a student is selected at random, find the probability
of selecting the following.
a) A junior or a female
b) A senior or a female
Solution
18 12 6 24
.a. p(junior or female)=p(jun)+p(fem)=
28 28 28 28
a + − =
10 12 6 16
b. p(senior or female)=p(sen)+p(fem)=
28 28 28 28
+ − =
Exercise
1) In a statistics class there are 28 juniors and 15 seniors; 10 of the seniors are males,
and 12 of the juniors are females. If a student is selected at random, find the
probability of selecting the following.
a) A junior or a female
b) A senior or a male
c) A male or a female
d) A male or a junior
2) A university president proposed that all students must take a course in ethics as a
requirement for graduation. Three hundred faculty members and students from this
university were asked about their opinions on this issue. Table 4.9 gives a two-way
classification of the responses of these faculty members’ and students.
Addition rule two
If A and B are not mutually exclusive, then P(A or B) =P(A) +P(B) -P(A andB)
lOMoARcPSD|7438752

65
Favor Oppose Neutral
Faculty 45 15 10
Student 90 110 30
Find the probability that one person selected at random from these 300 persons is:
a) A faculty member or is in favor of this proposal.
b) A student member or is in oppose of this proposal
c) A faculty member or is in neutral of this proposal
The Multiplication Rules and Conditional Probability
The Multiplication Rules
The multiplication rules can be used to find the probability of two or more events that occur in
sequence. For example, if you toss a coin and then roll a die, you can find the probability of
getting a head on the coin and a 4 on the die. These two events are said to be independent since
the outcome of the first event (tossing a coin) does not affect the probability outcome of the
second event (rolling a die).
Example
1) A coin is flipped and a die is rolled. Find the probability of getting a head on the coin and
a 4 on the die.
Solution
1 1 1
1. p(head or 4)=p(head)•p(4)=
2 6 12
• =
Two events A and B are independent events if the fact that A occurs does not affect the
probability of B occurring.
When two events are independent, the probability of both occurring is P(A and B) = P(A) . P(B)
When the outcome or occurrence of the first event affects the outcome or occurrence of the
second event in such a way that the probability is changed, the events are said to be dependent
events.
When two events are dependent, the probability of both occurring is P(A and B) =P(A) .P(B/A)
lOMoARcPSD|7438752

66
2) An urn contains 3 red balls, 2 blue balls, and 5 white balls. A ball is selected and its color
noted. Then it is replaced. A second ball is selected and its color noted. Find the probability
of each of these.
a) Selecting 2 blue balls(with replaced)
b). Selecting 1 blue ball and then 1 white ball (with replaced)
c) Selecting 1 red ball and then 1 blue ball(without replaced)
Solution
1 2 1 2
2 2 4
a.P(B and B )=p(B ).p(B )= • =
10 10 100
2 5 10
b. P(B and W)=p(B).p(W)= • =
10 10 100
3 2 6
c. P(R and B)=p(R).p(B/R)= • =
10 9 90
Exercise
1) A university president proposed that all students must take a course in ethics as a
classification of the responses of these faculty members and students.
Faculty 45 15 10
Student 90 110 30
Find the probability that one person selected at random from these 300 persons is :
a. A faculty member and is in favor of this proposal.
b. A student member and is in oppose of this proposal
c. faculty member and is in neutral of this proposal
Marginal and Conditional Probabilities
Definition
Marginal Probability Marginal probability is the probability of a single event without
consideration of any other event. Marginal probability is also called simple probability.
Conditional Probability Conditional probability is the probability that an event will occur given
lOMoARcPSD|7438752

67
that another event has already occurred. If A and B are two events, then the conditional
probability of A given B is written as P(A/B) and read as “the probability of A given that B has
already occurred.”
(A and B)
( / )
( )
P
p A B
P A
=
Exercise
1. A university president proposed that all students must take a course in ethics as a
classification of the responses of these faculty members and students.
Faculty 45 15 10
Student 90 110 30
Find the probability that one person selected at random from these 300 persons is :
a) A faculty member.
b) A student member
c) is in neutral of this proposal
d) is in a favor
e) is in oppose
f) p( fac /fav)
g) p(fac/opp)
h) p(st /fav)
i) p(st /neut)
2) A recent survey asked 100 people if they thought women in the armed forces should be
permitted to participate in combat. The results of the survey are shown.
Gender Yes No
male 32 18
female 8 42
lOMoARcPSD|7438752

68
Find these probabilities.
a) The respondent answered yes, given that the respondent was a female.
b) The respondent was a male, given that the respondent answered no.
3) Suppose that we have two events, A and B, with P(A) =0.50, P(B) =0.60, and P(A and
B) =0.40.
a. Find P (A/B).
b. Find P (B/ A).
lOMoARcPSD|7438752

69
CHAPTER SIX
DISCRETE PROBABILITY DISTRIBUTION
Objectives
 Construct a probability distribution for a random variable
 Find the mean, variance, standard deviation, and expected value for a discrete
random variable.
 Find the exact probability for X successes in n trials of a binomial experiment.
 Find the mean, variance, and standard deviation for the variable of a binomial
distribution.
Introduction
Many decisions in business, insurance, and other real-life situations are made by assigning
probabilities to all possible outcomes pertaining to the situation and then evaluating the results.
For example, a saleswoman can compute the probability that she will make 0, 1, 2, or 3 or more
sales in a single day. An insurance company might be able to assign probabilities to the number
of vehicles a family owns. A self-employed speaker might be able to compute the probabilities
for giving 0, 1, 2, 3, or 4 or more speeches each week. Once these probabilities are assigned,
statistics such as the mean, variance, and standard deviation can be computed for these events.
With these statistics, various decisions can be made. The saleswoman will be able to compute the
average number of sales she makes per week, and if she is working on commission, she will be
able to approximate her weekly income over a period of time, say, monthly. The public speaker
will be able to plan ahead and approximate his average income and expenses. The insurance
company can use its information to design special computer forms and programs to
accommodate its customers’ future needs.
This chapter explains the concepts and applications of what is called a probability
distribution. In addition, special probability distributions, such as the binomial, Poisson
distributions, are explained.
Probability Distributions
Before probability distribution is defined formally, the definition of a variable is reviewed.
lOMoARcPSD|7438752

70
Example:
The procedure shown here for constructing a probability distribution for a discrete random
variable uses the probability experiment of tossing three coins.
Number of heads x 0 1 2 3
Probability p(x) 1/8 3/8 3/8 1/8
Two requirements for probability distributions
1) The sum of the probabilities of all the events in the sample space must equal1.
2) The probability of each event in the sample space must be between or equal to 0
and1.
EXAMPLE:
1) Determine whether each distribution is a probability distribution.
X 2 3 7
P(x) 0.5 0.3 0.4
X 0 5 10 15
P(x) 0.2 0.2 0.2 0.2
X 0 2 4 6
P(x) -1.0 1.5 0.3 0.2
X 1 2 3 4
P(x) 0.25 0.125 0.0625 0.5625
Solution
a. Does not represent probability distribution
b. Does not represent probability distribution
A random variable is a variable whose values are determined by chance.
A discrete probability distribution consists of the values a random variable can assume and the
corresponding probabilities of the values. The probabilities are determined theoretically or by
observation.
lOMoARcPSD|7438752

71
c. Does not represent probability distribution
d. Yes it is a probability distribution
Mean, Variance, Standard Deviation, and Expectation
The mean, variance, and standard deviation for a probability distribution are computed
differently from the mean, variance, and standard deviation for samples. This section explains
how these measures—as well as a new measure called the expectation—are calculated for
probability distributions.
a) ( ) ( )
E x x p x
 = = •

b) 2 2
( )
x p x
 
= −

Example
The following table gives the number of televisions per household in a small town. Find the mean,
variance, and standard deviation of the probability distribution.
x 0 1 2 3
P(x) 0.01 0.17 0.28 0.54
Solution
x 0 1 2 3 Total
P(x) 0.01 0.17 0.28 0.54
xp(x) 0 0.17 0.56 1.62 2.35
2
( )
x p x 0 0.17 1.12 4.86 6.15
a) ( ) ( ) 2.35
E x x p x
 = = • =

b) 2 2 2 2
( ) 6.15 (2.35) 0.6275
x p x
 
= − = − =

c) 2 2
( ) 0.6275 0.792
x p x
 
= − = =

lOMoARcPSD|7438752

72
Exercise
1) Find the mean and the standard deviation of the number of spots that appear when a
dieis tossed.
Outcome X 1 2 3 4 5 6
P(x) 1/6 1/6 1/6 1/6 1/6 1/6
a) Use the frequency distribution to construct a probability distribution,
b) Find the mean, variance, and standard deviation of the probability distribution, and
c) Interpret the results in the context of the real-life situation.
2) The number of defects per batch of camping chairs inspected
Defects 0 1 2 3 4 5
Batches 95 113 87 64 13 8
3) The number of overtime hours worked in one week per employee
Overtime hours 0 1 2 3 4 5 6
employees 6 12 29 57 42 30 16
4) Abox contains 5 balls. Two are numbered 3, one is numbered 4, and two are
numbered 5. The balls are mixed and one is selected at random. After a ball is
selected, its number is recorded. Then it is replaced. If the experiment is repeated
many times, find the variance and standard deviation of the numbers on theballs.
Number on ball x 3 4 5
P(x) 2/5 1/5 2/5
5) Baier’s Electronics manufactures computer parts that are supplied to many computer
companies. Despite the fact that two quality control inspectors at Baier’s Electronics
check every part for defects before it is shipped to another company, a few defective
parts do pass through these inspections undetected. Let x denote the number of defective
computer parts in a shipment of 400. The following table gives the probability
distribution of x.
x 0 1 2 3 4 5
P(x) 0.02 0.2 0.3 0.3 0.10 0.08
Compute the mean and standard deviation of x
lOMoARcPSD|7438752

73
The Binomial Distribution
Many types of probability problems have only two outcomes or can be reduced to two outcomes.
For example, when a coin is tossed, it can land heads or tails. When a baby is born, it will be
either male or female. In a basketball game, a team either wins or loses. A true/false item can be
answered in only two ways, true or false. Situations like these are called binomial experiments.
A binomial experiment and its results give rise to a special probability distribution called the
binomial distribution.
Example
1. A coin is tossed 3 times. Find the probability of getting
a) Exactly two heads. Using binomial formula
b) Exactly no heads
Solution
a) ( ) ( )
2 1
2
( 2) 3 0.5 0.5 0.375
x n x
x
p x nC p q C
−
= = = =
b) ( ) ( )
0 3
0
( 0) 3 0.5 0.5 0.125
x n x
x
p x nC p q C
−
= = = =
A binomial experiment is a probability experiment that satisfies the following four
requirements:
1. There must be a fixed number of trials.
2. Each trial can have only two outcomes, these outcomes can be considered as either
success or failure.
3. The outcomes of each trial must be independent of one another.
4. The probability of a success must remain the same for each trial.
The outcomes of a binomial experiment and the corresponding probabilities of these outcomes
are called a binomial distribution.
Binomial formula:
In a binomial experiment, the probability of exactly X successes in n trials is
( ) x n x
x
p x nC p q −
=
lOMoARcPSD|7438752

74
Exercise
1) A survey found that one out of five Americans says he or she has visited a doctor in
any given month. If 10 people are selected at random, find the probability that
exactly 3 will have visited a doctor last month.
2) A survey from Teenage Research found that 20% of teenage consumers receive
their spending money from part-time jobs. If 5 teenagers are selected at random,
find the probability that at least 3 of them will have part-time jobs.
3) About fifty-five percent of all small businesses in the Somaliland have a website. If
you randomly select 10 small businesses, what is the probability that exactly four
of them have a website
4) A random variable follows a binomial distribution with a probability of success
equal to 0.45. For a sample size of n=11, find
a) The probability of exactly one success
b) The probability of exactly 7 successes
c) The probability of 3 or fewer successes
d) The probability of at least 10 successes
e) The expected value of the random variable
5) A random variable follows a binomial distribution with a probability of success
equal to 0.65. For a sample size of n=7, find
a. The probability of exactly 3 successes
b. The probability of exactly 7 successes
c. The probability of 4 or more successes
d. The probability of exactly 3 successes
e. The expected value of the random variable
6) Public Opinion reported that 4% of Somalis are afraid of being alone in a house at
night. If a random sample of 20 Somalis is selected, find these probabilities by
using the binomial formula
a) There are exactly 5 people in the sample who are afraid of being alone at night.
b) There are at most 2 people in the sample who are afraid of being alone at night.
c) There are at least 3 people in the sample who are afraid of being alone at night.
lOMoARcPSD|7438752

75
The Poisson distribution
A discrete probability distribution that is useful when n is large and p is small and when the
independent variables occur over a period of time is called the Poisson distribution. In addition to
being used for the stated conditions (i.e., n is large, p is small, and the variables occur over a period
of time), the Poisson distribution can be used when a density of items is distributed over a given
area or volume, such as the number of plants growing per acre or the number of defects in
a given length of videotape
Conditions to Apply the Poisson Probability Distribution
The following three conditions must be satisfied to apply the Poisson probability distribution.
I. X is a discrete random variable.
II. The occurrences are random.
III. The occurrences are independent.
Examples of Poisson distribution
The following examples also qualify for the application of the Poisson probability distribution.
1) The number of accidents that occur on a given highway during a 1-week period
2) The number of customers entering a grocery store during a 1-hour interval
3) The number of television sets sold at a department store during a given week
According to the Poisson probability distribution, the probability of x occurrences in an interval is
( )
!
x
e
p x
x

 −
= where  is the mean of occurrences in that interval and the value of e is
approximately 2.7828
Example:
A study conducted at Afyare store shows that the average number of arrivals to the checkout section of
the store per hour is 16. Further, the distribution for the number of arrivals is considered to be Poisson
distributed. Find the probability that
a. That 12 customers arrive in checkout section in one hour
b. That 10 customers arrive in checkout section in one hour
Solution
a.
12 16
16
( 12) 0.0661
12!
e
p x
−
= = = b.
10 16
16
( 10) 0.0341
10!
e
p x
−
= = =
lOMoARcPSD|7438752

76
Exercise
1) Arrivals to a Darussalam bank automated teller machine (ATM) are distributed
according to a Poisson distribution with a mean equal to three per 15 minutes.
a. Determine the probability that in a given 15 minute segment no customers will
arrive at the ATM?
b. What is the probability that fewer than four customers will arrive in a 30-
minute segment?
2) A sales firm receives, on average, 3 calls per hour on its toll-free number. For any given
hour, find the probability that it will receive the following.
a) At most 3calls
b) At least 3calls
c) 5 or more calls
3) Using the Poisson formula, find the following probabilities.
a) P (x2), =3
b) P (x=8), =5.5
lOMoARcPSD|7438752

77
REFERENCES
1. Prem S. Mann, Introductory Statistics,
2. Anderson, Sweeney, Williams, Statistics for Business and Economics
3. Lind,Marchel,,Masson, Statistical Techniques in Business and Economics
4. David C F. Groebner, Business Statistics, a decision making approach
5. Ron Larson, Elementary Statistics
6. Freud, J.E. Modern elementary statistics
7. Bown, statistics for business and economics
8. Freud,J.E. and R.E.Walpole,mathematical statistics
9. Ramanathan,statistical methods in econometrics
lOMoARcPSD|7438752

lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdf

More Related Content

Similar to lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdf (20)

More from Atoshe Elmi (20)

Recently uploaded (20)

lecture-note-on-basic-statistics-prem-mann-introductory-statistics.pdf