1. Descriptive statistics.pptx engineering

1
Statistics and Research Methods

2
Topics: Statistics
• Descriptive Statistics
• Probability Theory and Probability
Distributions
• Hypothesis Testing
• Confidence Interval
• Analysis Of Variance (ANOVA)
• Regression and Correlation
• Chi-Squared

3
Topics: Research Methods
• Research Design
• Literature Review
• Sampling
• Data Collection Methods
• Sampling
• Ethical Issues in Research Resource
• IT role in research & Formatting

4
Assessment
• One test – 15%
• One Individual Assignment – 10%
• One Group Assignment – 15%
• Final Exam – 60%
• Note: Pass Mark is B (50%)

5
References
References
• Probability and Statistics for Engineering and the Sciences, by
Jay L. Devore, Monterey, California.
• Basic Business Statistics Berenson M.L, Levine D.M, Krehbiel,
T.C
• Research Methodology, Methods and Techniques, by C.R.
Kothari
• Research Methods for Business students, by Mark Saunders,
Philip Lewis and Adrian Thornhill
• Plenty of Websites

7
Introduction and Descriptive Statistics

8
The Science of Statistics
• Statistics is the science of data. This involves
collecting, classifying, summarizing, organizing,
analyzing and interpreting numerical information.
Statistics
Descriptiv
e Statistics
Inferential
Statistics

9
Types of Statistical Applications
• Descriptive statistics utilizes numerical and
graphical methods to look for patterns in a data
set, to summarize the information revealed in a
data set and to present that information in a
convenient form.
• Inferential statistics utilizes sample data to make
estimates, decisions, predictions or other
generalizations about a larger set of data.

10
Descriptive Statistics
• Collect data
– e.g. Survey
• Present data
– e.g. Tables and graphs
• Characterize data
– e.g. Sample mean =
i
X
n

Descriptive statistics utilizes numerical and graphical
methods to look for patterns in a data set, to
summarize the information revealed in a data set and
to present that information in a convenient form.

11
Inferential Statistics
• Estimation
– e.g.: Estimate the population mean weight
using the sample mean weight
• Hypothesis testing
– e.g.: Test the claim that the population mean
weight is 120 pounds
Drawing conclusions and/or making decisions
concerning a population based on sample results.
Inferential statistics utilizes sample data to make estimates,
decisions, predictions or other generalizations about a larger
set of data.

12
Fundamental Elements of Statistics
• An experimental unit is an object about which
we collect data.
– Person
– Place
– Thing
– Event

13
• An population is a set of units in which we are
interested.
– Typically, there are too many experimental units in
a population to consider every one.
• If we can examine every single one, we conduct a
census.

14
• A sample is a subset of the population.
• A variable is a characteristic or property of an
individual unit.
– The values of these characteristics will, not
surprisingly, vary.
– A measure of reliability is a statement about the
degree of uncertainty associated with a statistical
inference. (Based on our analysis, we think 56%
of soda drinkers prefer Pepsi to Coke, ± 5%.)

Descriptive Statistics
• The population or sample of
interest
• One or more variables to be
investigated
• Tables, graphs or numerical
summary tools
• Identification of patterns in
the data
Inferential Statistics
• Population of interest
• One or more variables to be
investigated
• The sample of population
units
• The inference about the
population based on the
sample data
• A measure of reliability of
the inference
15

Types of Data
• Quantitative Data are measurements that are
recorded on a naturally occurring numerical
scale. e.g. Age, GPA, Salary, Cost of books this
semester
• Categorical (Qualitative) Data are measurements
that cannot be recorded on a natural numerical
scale, but are recorded in categories e.g. Live
on/off campus, Major, Gender
16

17
Methods for Describing Sets of Data

18
Data Presentation
Ordered
Array
Ogive
Polygon
Histo-
gram
Frequency
Distributions
Numerical
Data
Stem-&-Leaf
Display

19
Ordered Array
• 1. Organizes Data to Focus on Major
Features
• 2. Data Placed in Rank Order
– Smallest to Largest
• 3. Data in Raw Form (as Collected)
– 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
• 4. Data in Ordered Array
– 21, 24, 24, 26, 27, 27, 30, 32, 38, 41

20
Stem-and-Leaf Display
• A Stem-and-Leaf Display
shows the number of
observations that share a
common value (the stem)
and the precise value of
each observation (the leaf)
Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
26
2 144677
3 028
4 1

21
Frequency Distribution Table
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Class Frequency
15 but < 25 3
25 but < 35 5
35 but < 45 2

22
Frequency Distribution Table Steps
• 1. Determine Range
• 2. Select Number of Classes
– Usually Between 5 & 15 Inclusive
• 3. Compute Class Intervals (Width)
• 4. Determine Class Boundaries (Limits)
• 5. Compute Class Midpoints
• 6. Count Observations & Assign to Classes

23
Frequency Distribution Table Example
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Boundaries
(Upper + Lower Boundaries) / 2
Width
Class Midpoint Frequency
15 but < 25 20 3
25 but < 35 30 5
35 but < 45 40 2

24
Relative Frequency &
% Distribution Tables
Percentage
Distribution
Relative Frequency Distribution
Class Prop.
15 but < 25 .3
25 but < 35 .5
35 but < 45 .2
Class %
15 but < 25 30.0
25 but < 35 50.0
35 but < 45 20.0
class frequency
class relative frequency =
n
class percentage = (class relative frequency) 100


25
Cumulative Percentage Distribution
Table
Percentage Less than
Lower Class Boundary
Raw Data: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Lower Class
Boundary
30% + 50%
80% + 20%
Class Cumulative
Percentage
15 but < 25 0.0
25 but < 35 30.0
35 but < 45 80.0
45 but < 55 100.0

26
0
1
2
3
4
5
Histogram
Frequency
Relative
Frequency
Percent
0 15 25 35 45 55
Lower Boundary
Bars Touch
Class Freq.
15 but < 25 3
25 but < 35 5
35 but < 45 2
Count
• Histograms are graphs of the frequency or relative
frequency of a variable.
– Class intervals make up the horizontal axis
– The frequencies or relative frequencies are displayed on the
vertical axis.

27
0
1
2
3
4
5
Polygon
Midpoint
Fictitious
Class
0 10 20 30 40 50 60
Class Freq.
15 but < 25 3
25 but < 35 5
35 but < 45 2
Frequency
Relative
Frequency
Percent
Count

28
0%
25%
50%
75%
100%
Cumulative % Polygon (Ogive)
Lower Boundary
Fictitious
Class
0 15 25 35 45 55
Class Cum. %
15 but < 25 0%
25 but < 35 30%
35 but < 45 80%
45 but < 55 100%
Cumulative %

29
Categorical Data Presentation
Pareto
Diagram
Pie
Chart
Categorical
Data
Bar
Chart
Summary
Table

30
Summary Table
• 1. Lists Categories & No. Elements in Category
• 2. Obtained by Tallying Responses in Category
• 3. May Show Frequencies (Counts), % or Both
Row Is
Category
Tally:
|||| ||||
|||| ||||
Major Count
Accounting 130
Economics 20
Management 50
Total 200

31
0 50 100 150
Acct.
Econ.
Mgmt.
Bar Chart
Horizontal
Bars for
Categorical
Variables
Bar Length
Shows
Frequency
or %
1/2 to 1 Bar
Width
Equal Bar
Widths
Zero Point
Frequency
Major
Percent Used Also

32
Econ.
10%
Mgmt.
25%
Acct.
65%
Pie Chart
• 1. Shows Breakdown
of Total Quantity
into Categories
• 2. Useful for Showing
Relative Differences
• 3. Angle Size
– (360°)(Percent)
Majors
(360°) (10%) = 36°
36°

33
0%
33%
67%
100%
Acct. Mgmt. Econ.
Pareto Diagram
Percent
Major
Descending
Order
Cumulative
Polygon (Ogive)
Equal Bar
Widths
Vertical
Bar Chart
Bar Midpoint
Always %

34
Numerical Descriptive
Measures

35
Summary Measures
Central Tendency
Mean
Median
Mode
Quartile
Geometric Mean
Variation
Variance Standard
Deviation
Coefficient of Variation
Range

36
Measures of Central Tendency
• Various ways to describe the central,
most common or middle value in a
distribution or set of data
– The Mean (Arithmetic Mean)
– The Median
– The Mode
– The Geometric Mean

37
Numerical Measures of
Central Tendency
• Summarizing data sets numerically
– Are there certain values that seem more
typical for the data?
– How typical are they?

38
Central Tendency
• Central tendency is the value or values around
which the data tend to cluster
• Variability shows how strongly the data
cluster around that (those) value(s)

39
Central Tendency
• The mean of a set of quantitative data is the sum
of the observed values divided by the number of
values
1
n
i
i
x
x
n




40
Central Tendency
• The mean of a sample is typically denoted by x-bar,
but the population mean is denoted by the Greek
symbol μ.
N
x
n
i
i


 1

1
n
i
i
x
x
n




41
• If x1 = 1, x2 = 2, x3 = 3 and x4 = 4,
= (1 + 2 + 3 + 4)/4 = 10/4 = 2.5
1
n
i
i
x
x
n



Central Tendency

42
Central Tendency
• The median of a set of quantitative data is the
value which is located in the middle of the data,
arranged from lowest to highest values (or vice
versa), with 50% of the observations above and
50% below.

43
Central Tendency
50% 50%
Lowest Value Highest Value
Median

44
Central Tendency
• Finding the Median, M:
– Arrange the n measurements from smallest to
largest
• If n is odd, M is the middle number
• If n is even, M is the average of the middle two
numbers

45
Central Tendency
• The mode is the most frequently observed
value.
• The modal class is the midpoint of the class
with the highest relative frequency.

46
Geometric Mean
• Equals the nth root of the product of all
observations or values
• For a set of values: x1, x2, x3, x3, ........., xn
• Geometric mean =

47
Example
Problem
#
Time Spent
(Minutes)
1 12
2 4
3 3
4 8
5 7
6 5
7 4
8 9
9 11
Jim has 20 problems to do for
homework. Some are harder
than others and take more
time to solve. We take a
random sample of 9 problems.
Find the mean (arithmetic and
geometric), median and mode
for the number of minutes Jim
spends on his homework.

48
Solution: Mean
Problem
#
Time Spent
(Minutes)
1 12
2 4
3 3
4 8
5 7
6 5
7 4
8 9
9 11
Sample size (n) = 9
Problems 1 through 9 = x1, x2, x3 … x9,
respectively.
Σx= (12 + 4 + 3 + 8 + 7 + 5 + 4 + 9 + 11) =
63 minutes
Σx/n = 63/9 = 7 minutes

49
Solution: Geometric Mean
Problem
#
Time Spent
(Minutes)
1 12
2 4
3 3
4 8
5 7
6 5
7 4
8 9
9 11
Sample size (n) = 9
Problems 1 through 9 = x1, x2, x3 … x9,
respectively.
GM= 6.31

50
Solution: Median
3
4
4
5
7
8
9
11
12
Place the data in ascending order as at
right.
(n+1)/2 = (9+1)/2 = 5
The 5th
ordered observation is 7 and so is
the Median.

51
Solution: Mode
Since the data is already arranged in order
from smallest to largest we will keep it that
way.
Only the value 4 occurs >1 time.
The Mode is 4.
3
4
4
5
7
8
9
11
12

52
Approximating the Mean from a Frequency
Distribution
• Used when the only source of data is a
frequency distribution
1
sample size
number of classes in the frequency distribution
midpoint of the th class
frequencies of the th class
c
j j
j
j
j
m f
X
n
n
c
m j
f j








53
1
sample size
number of classes in the frequency distribution
midpoint of the th class
frequencies of the th class
c
j j
j
j
j
m f
X
n
n
c
m j
f j







Example
Class MP Freq.
10 but < 20 15 3
20 but < 30 25 6
30 but < 40 35 5
40 but < 50 45 4
50 but < 60 55 2
Total 20
X = ((15*3) + (25*6) + (35*5) + (45*4) + (55*2))/20
= (45 + 150 + 175 + 180 + 110)/20
= 660/20
= 33

54
Central Tendency
• Perfectly symmetric data set:
– Mean = Median = Mode
• Extremely high value in the data set:
– Mean > Median > Mode
(Rightward skewness)
• Extremely low value in the data set:
– Mean < Median < Mode
(Leftward skewness)

55
Central Tendency
• A data set is skewed if one tail of the
distribution has more extreme observations
that the other tail.

56
Variability
• The mean, median and mode give us an idea
of the central tendency, or where the
“middle” of the data is
• Variability gives us an idea of how spread out
the data are around that middle

57
Measures of Variation
Variation
Variance Standard Deviation Coefficient
of Variation
Population
Variance
Sample
Variance
Population
Standard
Deviation
Sample
Standard
Deviation
Range
Interquartile Range

58
Variability
• The range is equal to the largest measurement
minus the smallest measurement
– Easy to compute, but not very informative
– Considers only two observations (the smallest and
largest)

59
Quartiles
• Quartiles Split Ordered Data into 4 equal
portions
• Q1 and Q3 are Measures of Non-central Location
– Q2 = the Median
25% 25% 25% 25%
 
1
Q  
2
Q  
3
Q

60
Quartiles
• Each Quartile has position and value
– With the data in an ordered array, the position of Qi
is:
– The value of Qi is the value associated with that
position in the ordered array
• Example:
Data in Ordered Array: 11 12 13 16 16 17 18 21 22
   
1 1
1 9 1 12 13
Position of 2.5 12.5
4 2
Q Q
 
   
 
 
1
4
i
i n
Q



61
Quartiles Example
Find the 1st
and 3rd
Quartiles in the ordered
observations at right.
Position of Q1 = 1(9+1)/4 = 2.5
The 2.5th
observation = (4+4)/2 = 4
Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5
The 7.5th
3
4
4
5
7
8
9
11
12

62
Interquartile Range (IQR)
• The difference between Q1 and Q3
– The middle 50% of the values
– Also Known as Midspread:
– Resistant to extreme values
• Example:
– Q1 = 12.5,Q3 = 17.5
– 17.5 – 12.5 = 5
– IQR = 5
11 12 13 16 16 17 17 18 21
 
1
Q  
3
Q

63
Range and IQR Example
Find the Range and the Interquartile Range in this distribution.
Range = largest – smallest = 12 – 3 = 9.
Position of Q1 = 1(9+1)/4 = 2.5
The 2.5th
Position of Q3 = 3(9+1)/4 = 3(Q1) = 7.5
The 7.5th
IQR = 10 – 4 = 6
3
4
4
5
7
8
9
11
12

64
Variability
• The sample variance, s2, for a sample of n
measurements is equal to the sum of the
squared distances from the mean, divided
by (n – 1).
2
2 1
( )
1
i
n
i
x x
s
n






65
Variability
• The sample standard deviation, s, for a
sample of n measurements is equal to the
square root of the sample variance.
2
2 1
( )
1
i
n
i
x x
s s
n


 



66
Numerical Measures of Variability
• Say a small data set consists of the measurements 1, 2
and 3.
= 2
 
2
2 2 2 2
1
2 2 2 2
2
( )
(3 2) (2 2) (1 2) (3 1)
1
1 0 1 / 2 2 / 2 1
1 1
/
i
n
i
x x
s
n
s
s s


 
       
 

    
  

x

67
Numerical Measures of Variability
• As before, Greek letters are used for
populations and Roman letters for samples
s2
= sample variance
s = sample standard deviation
s2
= population variance
s = population standard deviation

68
Comparing Standard Deviations
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = .9258
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Data C
• Greater S (or σ) = more dispersion of data

69
Interpreting the Standard Deviation
• Chebyshev’s Rule
• The Empirical Rule
Both tell us something about where
the data will be relative to the mean.

70
• Chebyshev’s Rule
– Valid for any data set
– For any number k >1, at
least (1-1/k2
)% of the
observations will lie
within k standard
deviations of the mean
k k2
1/ k2
(1- 1/ k2)%
2 4 .25 75%
3 9 .11 89%
4 16 .0625 93.75%

71
The Bienayme-Chebyshev Rule
• At least (≥) 75% of the observations must be
contained within distances of 2 SD around the
mean
• At least (≥) 88.89% of the observations must be
mean
• At least (≥) 93.75% of the observations must be
mean

72
The Bienayme-Chebyshev Rule
- 4sd - 3sd - 2sd µ +2sd +3sd +4sd
≥ 88.89%
≥ 93.75%
≥ 75%

73
• The Empirical Rule
– Useful for mound-
shaped, symmetrical
distributions
– If not perfectly mounded
and symmetrical, the
values are
approximations
• For a perfectly
symmetrical and mound-
shaped distribution,
– ~68% will be within the
range
– ~95% will be within the
range
– ~99.7% will be within the
range
)
,
(
__
__
s
x
s
x 

)
2
,
2
(
__
__
s
x
s
x 

)
3
,
3
(
__
__
s
x
s
x 


74
• Hummingbirds beat their
wings in flight an average of 55
times per second.
• Assume the standard deviation
is 10, and that the distribution
is symmetrical and mounded.
– Approximately what percentage
of hummingbirds beat their
wings between 45 and 65 times
per second?
– Between 55 and 65?
– Less than 45?

75
Since 45 and 65 are exactly
one standard deviation below
and above the mean, the
empirical rule says that about
68% of the hummingbirds will
be in this range.
times per second.
– Approximately what
percentage of hummingbirds
beat their wings between 45
and 65 times per second?
– Less than 45?

76
This range of numbers is from
the mean to one standard
deviation above it, or one-half
of the range in the previous
question. So, about one-half
of 68%, or 34%, of the
hummingbirds will be in this
range.
times per second.
– Approximately what
percentage of hummingbirds
beat their wings between 45
and 65 times per second?
– Less than 45?

77
Half of the entire data set lies
above the mean, and ~34% lie
between 45 and 55 (between
one standard deviation below
the mean and the mean), so
~84% (~34% + 50%) are above
45, which means ~16% are
below 45.
• Hummingbirds beat their wings in
flight an average of 55 times per
second.
• Assume the standard deviation is
10, and that the distribution is
symmetrical and mounded.
– Approximately what percentage
of hummingbirds beat their
wings between 45 and 65 times
per second?
– Less than 45?

78
Exercise
A manufacturer of automobile batteries claims that the average length of
life of its grade A battery is 60 months. However, the guarantee on this
brand is for just 36 months. Suppose the standard deviation of the life
length is known to be 10 months and the frequency distribution of the life-
length data is known to be mound shaped.
• Approximately what percentage of the manufacturer’s grade A batteries
will last more than 50 months, assuming that the manufacturer’s claim is
true?
• Approximately what percentage of the manufacturer’s batteries will last
less than 40 months, assuming that the manufacturer’s claim is true?
• Suppose your battery last 37 months. What could you infer about the
manufacturer’s claim?

79
Coefficient of Variation
• Measure of Relative Variation
• Shows Variation Relative to the Mean
• Used to Compare Two or More Sets of Data Measured in
Different Units
S = Sample Standard Deviation
X = Sample Mean
100%
S
CV
X
 
 
 

80
Comparing Coefficient
of Variation
• Stock A:
– Average price last year = $50
– Standard deviation = $5
• Stock B:
– Average price last year = $100
– Standard deviation = $5
Both stocks have
the same
standard
deviation, but
stock B is less
variable relative
to its price
10%
100%
$50
$5
100%
X
S
CVA 












5%
100%
$100
$5
100%
X
S
CVB 













81
Numerical Measures of Relative Standing
• The z-score tells us
how many standard
deviations above or
below the mean a
particular
measurement is.
• Sample z-score
• Population z-score
x x
z
s






x
z

82
times per second.
An individual hummingbird is
measured with 75 beats per
second. What is this bird’s z-
score?
x x
z
s


0
.
2
10
55
75



z

83
Z Scores
Example:
• If the mean is 14.0 and the standard deviation is 3.0, what is
the Z score for the value 18.5?
• The value 18.5 is 1.5 standard deviations above the mean
• (A negative Z-score would mean that a value is less than the
mean)
1.5
3.0
14.0
18.5
S
X
X
Z 





84
• Since ~95% of all the
measurements will be within 2
standard deviations of the
mean, only ~5% will be more
than 2 standard deviations
from the mean.
• About half of this 5% will be
far below the mean, leaving
only about 2.5% of the
measurements at least 2
standard deviations above the
mean.

85
Numerical Measures of Relative Standing
• Z scores are related to the empirical rule:
For a perfectly symmetrical and mound-
shaped distribution,
– ~68 % will have z-scores between -1 and 1
– ~95 % will have z-scores between -2 and 2
– ~99.7% will have z-scores between -3 and 3

86
Methods for Determining Outliers
• An outlier is a measurement that is unusually
large or small relative to the other values.
• Three possible causes:
– Observation, recording or data entry error
– Item is from a different population
– A rare, chance event

87
The Box Plot (“Box-and-Whisker”)
• The box plot is a graph representing
information about certain percentiles for a
data set and can be used to identify outliers
• 5 number summary
– Median, Q1, Q3, Xsmallest, Xlargest
• Box Plot
– Graphical display of data using 5-number ummary
4 6 8 10 12 Xlargest
Xsmallest
1
Q 3
Q
2
Q
Median

88
Distribution Shape & Box Plot
Right-Skewed
Left-Skewed Symmetric
1
Q 1
Q 1
Q
2
Q 2
Q 2
Q
3
Q 3
Q
3
Q

89
• Outliers and z-scores
– The chance that a z-score is between -3 and +3 is
over 99%.
– Any measurement with |z| > 3 is considered an
outlier.
Methods for Determining Outliers

90
Correlation Coefficient
• Correlation Coefficient = r
– Unit Free
– Measures the strength of the linear relationship
between 2 quantitative variables
• Ranges between –1 and 1
– The Closer to –1, the stronger the negative linear
relationship becomes
– The Closer to 1, the stronger the positive linear
relationship becomes
– The Closer to 0, the weaker any linear relationship
becomes

91
Scatter Plots of Data with Various Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = .6 r = 1
• Scattergram (or scatterplot) shows the relationship between two
quantitative variables

92
Distorting the Truth with Deceptive Statistics
• Distortions
– Stretching the axis (and the truth)
– Is average relevant?
• Mean, median or mode?
– Is average relevant?
• What about the spread?

1. Descriptive statistics.pptx engineering

More Related Content

Similar to 1. Descriptive statistics.pptx engineering (20)

Recently uploaded (20)

1. Descriptive statistics.pptx engineering

Editor's Notes