Advanced Statistics And Probability (MSC 615

Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
Advanced Statistics and Probability
(MSC 615)
Yerkin G. Abdildin
Nazarbayev University

2
Descriptive Statistics
• Numerical Summaries of
Data
• Stem-and-Leaf Diagrams
• Frequency Distributions
and Histograms
• Box Plots
• Time Sequence Plots
• Probability Plots

Terminology
• Population: The set of all possible observations that
could be made from a random experiment
– May be very large (e.g., the set of all KZ residents)
– Hypothetical or conceptual populations do not physically exist
(e.g., set of all products that will be produced in a factory)
• Sample: An observed subset of data from a population
– Example: A set of 300 randomly-chosen KZ residents
– Example: A set of 14 randomly-chosen parts produced in a
factory during one day’s operation
3

Numerical Summaries of Data
• Data: the numeric observations of a phenomenon of
interest.
– The totality of all observations is a population.
– A portion used for analysis is a random sample.
• We gain an understanding of this collection by describing it
numerically and graphically, usually with the sample data.
• We describe the collection in terms of shape, outliers, center, and
spread (SOCS).
– The center can be measured by the mean.
– The spread can be measured by the variance.
4

What is the field of statistics?
We encounter it regularly:
– Opinion polls
– Sports performance
– Effectiveness of medical treatments
But what is it, really?
– Statistics is the “science of data”
– Qualitative and quantitative tools for analyzing data
Why do we need it?
– Many outcomes are produced by “random” processes
– Want to understand these underlying processes more clearly
5

Probability and Statistics
6
• Knowing the structure of the process, what can
we say about the outcomes?
Process,
experiment
Observed
outcomes
Probability

7
• Knowing the structure of the process, what can
we say about the outcomes?
• Having observed some outcomes, what can we
say about the process?
Process,
experiment
Observed
outcomes
Probability
Statistics

8
EXAMPLE: Suppose that a factory manager is very
dedicated to product quality, and is concerned that too
many products produced at the factory may be defective.
What would your recommended plan of action be?

Take the factory offline,
disassemble and thoroughly
inspect every machine involved in
the production process, and put
every employee in the factory
through a rigorous two-week
training program.
9
IDEA 1:

Take the factory offline,
disassemble and thoroughly
inspect every machine involved in
the production process, and put
every employee in the factory
through a rigorous two-week
training program.
10
IDEA 1:
CONCERN:
Very expensive!

Take an entire day’s worth of
products and inspect them all
thoroughly to estimate the rate at
which defective products are
being produced.
IDEA 2:
11

Take an entire day’s worth of
products and inspect them all
thoroughly to estimate the rate at
which defective products are
being produced.
IDEA 2:
12
CONCERN:
What if the factory produces
thousands of units per day?
And what if testing requires
destroying the products (e.g.,
yield strength of steel
beams)?

Probability and Statistical inference
13
• A statistic is any quantity whose value can be calculated
from sample data (e.g. mean, variance, proportion).
• A reasoning that comes from a sample to a population is
referred to as statistical inference, or inferential statistics.
Population Sample
Probability
Inferential statistics,
or Statistical inference

Sample Mean
14
1 2
1 2 1
1 2
For observations in a random sample, , ,..., ,
the sample mean is
+ ...
.
For observations in a population, , ,..., ,
the population mean is
n
n
i
n i
N
n x x x
x
x x x
x
n n
N x x x

 
 

1
1
( ) .
N
i
N
i
i i
i
x
x f x
N
 

 



Example 1: Sample Mean
Consider 8 observations (xi) of pull-off force from
engine connectors as shown in the table.
15
i x i
1 12.6
2 12.9
3 13.4
4 12.3
5 13.6
6 13.5
7 12.6
8 13.1
13.00
= AVERAGE($B2:$B9)
The sample mean is the balance point.
8
1 12.6 12.9 ... 13.1
average
8 8
104
13.00 pounds
8
i
i
x
x    
  
 


Variance Defined
16
1 2
2
2 1
1 2
For observations in a random sample, , ,..., ,
the sample variance is
( )
.
1
For observations in a population, , ,..., ,
the population variance i
n
n
i
i
N
n x x x
x x
s
n
N x x x





2
2 2 1
1
s
( )
( ) ( ) .
N
i
N
i
i i
i
x
x f x
N

  


  



Some other concepts
• The standard deviation is the square root of the
variance.
– s is the sample standard deviation symbol.
– σ is the population standard deviation symbol.
• Sample statistic could be a reasonable estimate of
population parameters, e.g.
17
2 2
is an estimate of ,
is an estimate of .
x
s



Example 2: Sample Variance
Table 1 displays the quantities needed to calculate the
sample variance, s2, and sample standard deviation, s.
18
i x i x i - xbar (x i - xbar)
2
1 12.6 -0.4 0.16
2 12.9 -0.1 0.01
3 13.4 0.4 0.16
4 12.3 -0.7 0.49
5 13.6 0.6 0.36
6 13.5 0.5 0.25
7 12.6 -0.4 0.16
8 13.1 0.1 0.01
sums = 104.00 0.0 1.60
divide by 8 divide by 7
xbar = 13.00 s.variance = 0.2286
0.48
s. standard deviation =
Dimension of:
xi is pounds
Mean is pounds.
Variance is pounds2.
Standard deviation is pounds.
Desired accuracy is generally
accepted to be one more place
than the data.
Table 1
2 2
13.00 pounds,
0.2286 pounds ,
0.48 pounds.
x
s
s




Computation of s2
We can derive the computational formula.
It involves just 2 sums.
19
2 2 2
2 1 1
2 2 2 2
1 1 1
2 2
1
2
2 1
1
( ) ( 2 )
1 1
2 2
1 1
1
1
n
i
n
i
i
i
n n
i i i
i i
n n n
i i i
i i i
n
i
i
x x x x x x
s
n n
x nx x x x nx x nx
n
x
x
n
x nx
n
n
n
 
  



  
 
 
 
 
 

    
 
 

 
 
 
 





Example 3: Variance by Shortcut
20
 
2
2
1 1
2
2
2
1
1,353.60 104.0 8
7
1.60
0.2286 pounds
7
0.2286 0.48 pounds
n n
i i
i i
x x n
s
n
s
 
 
 
 




 
 
  i x i x i
2
1 12.6 158.76
2 12.9 166.41
3 13.4 179.56
4 12.3 151.29
5 13.6 184.96
6 13.5 182.25
7 12.6 158.76
8 13.1 171.61
sums = 104.0 1 353.60

Why does sample variance divide by (n –1)?
The population variance (σ2) is calculated with N, the
population size. Why isn’t the sample variance (s2)
calculated with n, the sample size?
1) s2 measures square deviation from x
̅ , not μ
• The true variance (σ2) is based on data deviations from
the true mean, μ.
• The sample variance (s2) is based on the data deviations
from x
̅ (x-bar), not μ.
• Dividing by n – 1 rather than n adjusts for our
underestimation of a square deviation about the true
mean.
21

2) n–1 is the number of “degrees of freedom” we have
in our computation
• s2 is calculated with the quantity n – 1, which is called
the “degrees of freedom”.
• Origin of the term:
– There are n deviations from x
̅ in the sample, xi – x
̅
– The sum of the deviations is zero (see Table 1)
– If we know the values of n – 1 of these terms, the
value of the last is known
– Hence, only n – 1 of these terms can be chosen freely
(i.e., independently of the others)
22

3) Dividing by n – 1 makes s2 an unbiased estimator of
σ2
• x
̅ is an estimator of μ; close but not the same.
• Remember that s2 is a random variable; what is its
expectation?
• Being an “unbiased estimator” means that E(s2) = σ2
23

Sample Range
If the n observations in a sample are denoted
by x1, x2, …, xn, the sample range is:
r = max(xi) – min(xi)
It is the largest observation in the sample minus
the smallest observation.
From Example 3:
r = 13.6 – 12.3 = 1.30 pounds
Note: population range ≥ sample range
24

Stem-and-Leaf Diagrams
• Dot diagrams (dotplots) are useful for small
data sets. Stem & leaf diagrams are better
for large sets.
• Steps to construct a stem-and-leaf diagram:
1) Divide each number (xi) into two parts:
– stem, all but the last significant digit
– leaf, the last significant digit.
2) List (smallest to largest) the stem values in a
vertical column.
3) To the right of each stem, list its leaves
25

Example 4: Alloy Strength
26
105 221 183 186 121 181 180 143
97 154 153 174 120 168 167 141
245 228 174 199 181 158 176 110
163 131 154 115 160 208 158 133
207 180 190 193 194 133 156 123
134 178 76 167 184 135 229 146
218 157 101 171 165 172 158 169
199 151 142 163 145 171 148 158
160 175 149 87 160 237 150 135
196 201 200 176 150 170 118 149
Table 2 Compressive Strength (psi) of
Aluminum-Lithium Specimens (80 values)
Stem-and-leaf diagram for Table 2 data. Center is
about 155 and most data is between 110 and 200.
Leaves are unordered.
To illustrate the construction of
a stem-and-leaf diagram,
consider the alloy compressive
strength data in Table 2.
psi - pounds per square inch
1 psi ≈ 6.894 757 kPa

Quartiles
• The 3 quartiles partition the data into 4 equally sized counts or
segments.
– q1, 1st or lower quartile: 25% of the data is ≤ q1.
– q2, 2nd quartile (median): 50% of the data is ≤ q2.
– q3, 3rd or upper quartile : 75% of the data is ≤ q3.
• For the Table 2 data:
27
i th
(i +1)th
0.25 20.25 143 145 143.50
0.50 40.50 160 163 161.50
0.75 60.75 181 181 181.00
quartile
Value of indexed item
f Index
105 221 183 186 121 181 180 143
97 154 153 174 120 168 167 141
245 228 174 199 181 158 176 110
163 131 154 115 160 208 158 133
207 180 190 193 194 133 156 123
134 178 76 167 184 135 229 146
218 157 101 171 165 172 158 169
199 151 142 163 145 171 148 158
160 175 149 87 160 237 150 135
196 201 200 176 150 170 118 149
Table 2 Compressive Strength (psi) of
Aluminum-Lithium Specimens
• How do we find these quartiles?
• First we need to sort data

Quartile q2
• Second quartile (sample median), q2: The “middle value” of the
observations, such that 50% of the remaining observations are
below the median and 50% of the remaining observations are above
the median
28
 
   
1 /2
2
/2 /2 1
if is odd
if is even
2 2
n
n n
x n
q x x
n
 
  



 



   
40 41
2
160 163
If 80, 161.5
2 2 2 2
x x
n q
     

Quartile q1
• First quartile, q1: Value, s.t. approx. 25% of observations are ≤ q1
29
   
     
 
1
1
1
/4
if 1 / 4 is an integer
if 1 / 4 is not an integer
and be the fractional
, then
and
int
1
eger parts of ( 1) / 4
let
n
w w
x n
n
q
f x f x f w
n
 
 



 

 
 
  



 
     
1 20 21
If 80, 1 / 4 20.25, so 20 and 0.25,
1 0.25 0.25
0.75 143 0.25 145 143.5
n n w f
q x x
    
   
    

Quartile q3 and IQR
• Third quartile, q3: Value, such that approximately 75% of the
observations are ≤ q3
• Interquartile range, IQR: The difference between the third and first
quartiles (another measure of the spread of the data, less sensitive
to extreme outliers than sample variance)
30
   
     
 
3 1 /4
1
3
if 3 1 / 4 is an integer
if 3 1 / 4 is not an intege
and be the fractional
r, then
and
integer parts of 3( 1) /
1
4
let
n
w w
x n
n
q
f x f w
f
n
x
 
 



 

 
 
  



3 1
IQR q q
 

Percentiles
• Percentiles:
– The idea of quartiles can be extended to consider percentiles.
– Percentiles partition the data into 100 segments.
– The k-th percentile has approximately 100k% of observations,
which are ≤ it.
• A WARNING!
– If you use software, it may not use the same method as we use
here.
– For all coursework, you must use the method given in the lecture.
• From the Quartiles example:
IQR = q3 – q1 = 181.00 – 143.50 = 37.5
• Impact of outlier data:
– IQR is not affected
– Range is directly affected.
31

Frequency Distributions
• A frequency distribution is a compact summary of data,
expressed as a table, graph, or function.
• The data is gathered into bins or cells, defined by class
intervals.
• The number of classes, multiplied by the class interval,
should exceed the range of the data.
• The square root of the sample size is a guide.
• The boundaries of the class intervals should be
convenient values, as should the class width.
32

Frequency Distribution Table
33
Class Frequency
Relative
Frequency
Cumulative
Relative
Frequency
70 ≤ x < 90 2 0.0250 0.0250
90 ≤ x < 110 3 0.0375 0.0625
110 ≤ x < 130 6 0.0750 0.1375
130 ≤ x < 150 14 0.1750 0.3125
150 ≤ x < 170 22 0.2750 0.5875
170 ≤ x < 190 17 0.2125 0.8000
190 ≤ x < 210 10 0.1250 0.9250
210 ≤ x < 230 4 0.0500 0.9750
230 ≤ x < 250 2 0.0250 1.0000
80 1.0000
Table 4 Frequency Distribution of Table 2 Data
Considerations:
Range = 245 – 76 = 169
Number of Classes
= Sqrt(80) = 8.9
Trial class width =
169/8.9 = 18.99
Decisions:
Number of classes = 9
Class width = 20
Range of classes = 20 * 9 = 180
Starting point = 70
Frequency Distribution for
the data in Table 2
From given n = 80 observations of xi:
max(xi) = 245
min(xi) = 76

Histograms
• A histogram is a visual display of a frequency
distribution.
• Steps to construct a histogram with equal bin widths:
1) Label the bin boundaries on the horizontal scale.
2) Mark & label the vertical scale with the frequencies
or relative frequencies.
3) Above each bin, draw a rectangle whose height is
equal to the frequency corresponding to that bin.
34

Histogram of the Table 2 Data
35
Histogram of compressive strength of 80 aluminum-lithium alloy specimens. Note
these features – (1) horizontal scale bin boundaries & labels with units, (2) vertical
scale measurements and labels, (3) histogram title at top or in legend.

Poor Choices in Drawing Histograms
36
Histogram of compressive strength of 80 aluminum-lithium
alloy specimens. Errors: too many bins (17) create jagged
shape, horizontal scale not at class boundaries, horizontal axis
label does not include units.

Cumulative Frequency Plot
37
Cumulative histogram of compressive strength of 80 aluminum-lithium
alloy specimens. Comment: Easy to see cumulative probabilities,
hard to see distribution shape.

Shape of a Frequency Distribution
38
Histograms of symmetric and skewed distributions.
(b) Symmetric distribution has identical mean, median and mode measures.
(a & c) Skewed distributions are positive or negative, depending on the
direction of the long tail. Their measures occur in alphabetical order as the
distribution is approached from the long tail.
Mode – value of x which occurs most often, i.e. has the greatest probability
of occurring.
Median is that value x for which P(X<x) ≤ 0.5 and P(X>x) ≤ 0.5.
Median is the "middle" of a sorted list of numbers.

Histograms for Categorical Data
• Categorical data is of two types:
– Ordinal: categories have a natural order, e.g., year in
college, military rank.
– Nominal: Categories are simply different, e.g.,
gender, colors.
• Histogram bars are for each category, are of equal width,
and have a height equal to the category’s frequency or
relative frequency.
• A Pareto chart is a histogram in which the categories are
sequenced in decreasing order. This approach
emphasizes the most and least important categories.
39

Example 6: Categorical Data Histogram
40
Airplane production in 1985. (Source: Boeing Company)
Comment: Illustrates nominal data in spite of the numerical
names, categories are shown at the bin’s midpoint, a Pareto chart
since the categories are in decreasing order.

Box Plot or Box-and-Whisker Chart
• A box plot is a graphical display showing center,
spread, shape, and outliers (SOCS).
• It displays the 5-number summary: min, q1,
median, q3, and max.
41
Description of a box plot.

Visual Summary: Box Plot
Constructing a box plot (vertical axis measures data values):
1) DRAW THE BOX:
• Draw a box that extends vertically from q1 to q3
• Draw a horizontal line through the box at q2
2) DRAW THE WHISKERS:
• Lower whisker: Draw a line extending from the bottom box edge to
the smallest observation within 1.5 IQR of q1
• Upper whisker: Draw a line extending from the top box edge to the
largest observation within 1.5 IQR of q3
3) OUTLIERS (label each with an asterisk):
• Observations beyond the whiskers, but within 3 IQR of a box edge
4) EXTREME OUTLIERS (label each with an asterisk or other symbol):
• Observations more than 3 IQR away from a box edge
42

Box Plot of Table 2 Data
43
Box plot of compressive strength of 80 aluminum-lithium alloy
specimens. Comment: Box plot may be shown vertically or
horizontally, data reveals three outliers and no extreme outliers.
Lower outlier’s upper limit is: 143.5 – 1.5*(181.0-143.5) = 87.25.

Time Sequence Plots
• A time series plot shows the data value, or statistic, on
the vertical axis with time on the horizontal axis.
• A time series plot reveals trends, cycles or other time-
oriented behavior that could not be seen in the data.
44
Company sales by year (a). By quarter (b).

Digidot Plot
45
Combining a time series plot with some of the other graphical displays that we
have considered previously will be very helpful sometimes.
The stem-and-leaf plot combined with a time series plot forms a digidot plot.
A digidot plot of the compressive strength data in Table 2.

Multivariate Data
46
Multivariate data – each observation consists of measurements of
several variables.
Table 5 Quality Data for Young Red Wine.

Scatterplot (Scatter Diagram)
47
Scatterplot – graphically displays the potential relationship b/n two
variables.
Scatterplot of wine Quality and Color from data in Table 5.

Matrix of Scatter Diagrams
48
Matrix of scatter diagrams – graphically displays the pairwise
relationships b/n the variables in the sample.
Matrix of scatter diagrams for the data in Table 5.
Notice a strong
positive correlation b/n
Color Density and
Color of wine.

Pearson Correlation Coefficient
49
Sample (or Pearson) correlation coefficient is a quantitative measure
of the strength of the relationship b/n 2 r.v.s, dimensionless, gives a
value between +1 and −1 inclusive.
(after Karl Pearson (1857-1936), British statistician)
All correlations b/n 5 r.v.s in Table 5.
Correlations
above |0.8| are strong,
below |0.5| are weak,
0 is no correlation.
Quality pH Total SO2 C.Density Color
Quality 1
pH 0.349 1
Total SO2 -0.445 -0.679 1
C.Density 0.702 0.482 -0.492 1
Color 0.712 0.430 -0.480 0.996 1
1
2 2
1 1
( )
( ) ( )
n
i i
i
xy n n
i i
i i
y x x
r
y y x x

 


 

 

50
In Excel: Correlation b/n Quality and Color of wine in Table 5.
Pearson correlation coefficient = 0.712 43.103/SQRT(75.852*48.2995)
i Xi (Quality) Yi (Color) Xi-Xbar Yi-Ybar (Xi-Xbar)^2 (Yi-Ybar)^2 Yi*(Xi-Xbar)
1 19.2 5.65 3.98 1.195 15.8404 1.428025 22.487
2 18.3 6.95 3.08 2.495 9.4864 6.225025 21.406
3 17.1 5.75 1.88 1.295 3.5344 1.677025 10.81
4 15.2 4 -0.02 -0.455 0.0004 0.207025 -0.08
5 14 2.25 -1.22 -2.205 1.4884 4.862025 -2.745
6 13.8 3.2 -1.42 -1.255 2.0164 1.575025 -4.544
7 12.8 2.7 -2.42 -1.755 5.8564 3.080025 -6.534
8 17.3 6.1 2.08 1.645 4.3264 2.706025 12.688
9 16.3 5 1.08 0.545 1.1664 0.297025 5.4
10 16 6 0.78 1.545 0.6084 2.387025 4.68
11 15.7 5.5 0.48 1.045 0.2304 1.092025 2.64
12 15.3 3.35 0.08 -1.105 0.0064 1.221025 0.268
13 14.3 3.25 -0.92 -1.205 0.8464 1.452025 -2.99
14 14 5.1 -1.22 0.645 1.4884 0.416025 -6.222
15 13.8 4.4 -1.42 -0.055 2.0164 0.003025 -6.248
16 12.5 3.15 -2.72 -1.305 7.3984 1.703025 -8.568
17 11.5 3.9 -3.72 -0.555 13.8384 0.308025 -14.508
18 14.2 2.4 -1.02 -2.055 1.0404 4.223025 -2.448
19 17.3 7.7 2.08 3.245 4.3264 10.530025 16.016
20 15.8 2.75 0.58 -1.705 0.3364 2.907025 1.595
SUM 304.4 89.1 75.852 48.2995 43.103
Mean 15.22 4.455

51
Potential relationship b/n r.v.s.

Probability Plot
• Is a particular distribution a reasonable model
for data? We may want to verify assumptions.
• If time-to-failure data ̴ exponential distribution,
then the failure rate is constant w.r.t. time.
• A probability plot is a graphical method for
determining whether sample data conform to a
hypothesized distribution based on a subjective
visual examination of the data.
• Histograms require very large sample size.
52

Constructing a Probability Plot
• To construct a probability plot:
– Sort the data observations in ascending order: x(1),
x(2),…, x(n).
– The observed value x(j) is plotted against the
observed cumulative frequency (j – 0.5)/n.
– The paired numbers are plotted on the probability
paper of the proposed distribution.
• If the paired numbers form a straight line, then
the hypothesized distribution adequately
describes the data.
53

Example 7: Battery Life
54
j x (j ) (j-0.5)/10 100(j-0.5)/10
1 176 0.05 5
2 183 0.15 15
3 185 0.25 25
4 190 0.35 35
5 191 0.45 45
6 192 0.55 55
7 201 0.65 65
8 205 0.75 75
9 214 0.85 85
10 220 0.95 95
Table 6 Calculations for Constructing a
Normal Probability Plot
Normal probability plot for battery life.
The effective service life (Xj in minutes) of batteries used in a laptop are given in
the table. We hypothesize that battery life is adequately modeled by a normal
distribution. To this hypothesis, first arrange the observations in ascending order
and calculate their cumulative frequencies and plot them.

Probability Plot on Standardized Normal Scores
55
j x (j ) (j-0.5)/10 z j
1 176 0.05 -1.64
2 183 0.15 -1.04
3 185 0.25 -0.67
4 190 0.35 -0.39
5 191 0.45 -0.13
6 192 0.55 0.13
7 201 0.65 0.39
8 205 0.75 0.67
9 214 0.85 1.04
10 220 0.95 1.64
Table 6 Calculations for
Constructing a Normal Probability
Plot
Normal Probability plot obtained from
standardized normal scores. This is equivalent
to Figure in the previous slide.
A normal probability plot can be plotted on ordinary axes using z-values. The
normal probability scale is not used.
(j – 0.5)/n = P(Z ≤ zj) = Φ(zj)

Probability Plot Variations
56
Normal probability plots indicating a non-normal distribution.
(a) Light tailed distribution (e.g. uniform distribution)
(b) Heavy tailed distribution (has larger variance, has heavier tails than the
normal distribution)
(c) Right skewed distribution

Advanced Statistics And Probability (MSC 615

More Related Content

What's hot (20)

Similar to Advanced Statistics And Probability (MSC 615 (20)

More from Maria Perkins (20)

Recently uploaded (20)

Advanced Statistics And Probability (MSC 615