SlideShare a Scribd company logo
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
Advanced Statistics and Probability
(MSC 615)
Yerkin G. Abdildin
Nazarbayev University
2
Descriptive Statistics
• Numerical Summaries of
Data
• Stem-and-Leaf Diagrams
• Frequency Distributions
and Histograms
• Box Plots
• Time Sequence Plots
• Probability Plots
Terminology
• Population: The set of all possible observations that
could be made from a random experiment
– May be very large (e.g., the set of all KZ residents)
– Hypothetical or conceptual populations do not physically exist
(e.g., set of all products that will be produced in a factory)
• Sample: An observed subset of data from a population
– Example: A set of 300 randomly-chosen KZ residents
– Example: A set of 14 randomly-chosen parts produced in a
factory during one day’s operation
3
Numerical Summaries of Data
• Data: the numeric observations of a phenomenon of
interest.
– The totality of all observations is a population.
– A portion used for analysis is a random sample.
• We gain an understanding of this collection by describing it
numerically and graphically, usually with the sample data.
• We describe the collection in terms of shape, outliers, center, and
spread (SOCS).
– The center can be measured by the mean.
– The spread can be measured by the variance.
4
What is the field of statistics?
We encounter it regularly:
– Opinion polls
– Sports performance
– Effectiveness of medical treatments
But what is it, really?
– Statistics is the “science of data”
– Qualitative and quantitative tools for analyzing data
Why do we need it?
– Many outcomes are produced by “random” processes
– Want to understand these underlying processes more clearly
5
Probability and Statistics
6
• Knowing the structure of the process, what can
we say about the outcomes?
Process,
experiment
Observed
outcomes
Probability
Probability and Statistics
7
• Knowing the structure of the process, what can
we say about the outcomes?
• Having observed some outcomes, what can we
say about the process?
Process,
experiment
Observed
outcomes
Probability
Statistics
Probability and Statistics
8
EXAMPLE: Suppose that a factory manager is very
dedicated to product quality, and is concerned that too
many products produced at the factory may be defective.
What would your recommended plan of action be?
Take the factory offline,
disassemble and thoroughly
inspect every machine involved in
the production process, and put
every employee in the factory
through a rigorous two-week
training program.
Probability and Statistics
9
EXAMPLE: Suppose that a factory manager is very
dedicated to product quality, and is concerned that too
many products produced at the factory may be defective.
What would your recommended plan of action be?
IDEA 1:
Take the factory offline,
disassemble and thoroughly
inspect every machine involved in
the production process, and put
every employee in the factory
through a rigorous two-week
training program.
Probability and Statistics
10
EXAMPLE: Suppose that a factory manager is very
dedicated to product quality, and is concerned that too
many products produced at the factory may be defective.
What would your recommended plan of action be?
IDEA 1:
CONCERN:
Very expensive!
Take an entire day’s worth of
products and inspect them all
thoroughly to estimate the rate at
which defective products are
being produced.
EXAMPLE: Suppose that a factory manager is very
dedicated to product quality, and is concerned that too
many products produced at the factory may be defective.
What would your recommended plan of action be?
IDEA 2:
Probability and Statistics
11
Take an entire day’s worth of
products and inspect them all
thoroughly to estimate the rate at
which defective products are
being produced.
EXAMPLE: Suppose that a factory manager is very
dedicated to product quality, and is concerned that too
many products produced at the factory may be defective.
What would your recommended plan of action be?
IDEA 2:
Probability and Statistics
12
CONCERN:
What if the factory produces
thousands of units per day?
And what if testing requires
destroying the products (e.g.,
yield strength of steel
beams)?
Probability and Statistical inference
13
• A statistic is any quantity whose value can be calculated
from sample data (e.g. mean, variance, proportion).
• A reasoning that comes from a sample to a population is
referred to as statistical inference, or inferential statistics.
Population Sample
Probability
Inferential statistics,
or Statistical inference
Sample Mean
14
1 2
1 2 1
1 2
For observations in a random sample, , ,..., ,
the sample mean is
+ ...
.
For observations in a population, , ,..., ,
the population mean is
n
n
i
n i
N
n x x x
x
x x x
x
n n
N x x x

 
 

1
1
( ) .
N
i
N
i
i i
i
x
x f x
N
 

 


Example 1: Sample Mean
Consider 8 observations (xi) of pull-off force from
engine connectors as shown in the table.
15
i x i
1 12.6
2 12.9
3 13.4
4 12.3
5 13.6
6 13.5
7 12.6
8 13.1
13.00
= AVERAGE($B2:$B9)
The sample mean is the balance point.
8
1 12.6 12.9 ... 13.1
average
8 8
104
13.00 pounds
8
i
i
x
x    
  
 

Variance Defined
16
1 2
2
2 1
1 2
For observations in a random sample, , ,..., ,
the sample variance is
( )
.
1
For observations in a population, , ,..., ,
the population variance i
n
n
i
i
N
n x x x
x x
s
n
N x x x





2
2 2 1
1
s
( )
( ) ( ) .
N
i
N
i
i i
i
x
x f x
N

  


  


Some other concepts
• The standard deviation is the square root of the
variance.
– s is the sample standard deviation symbol.
– σ is the population standard deviation symbol.
• Sample statistic could be a reasonable estimate of
population parameters, e.g.
17
2 2
is an estimate of ,
is an estimate of .
x
s


Example 2: Sample Variance
Table 1 displays the quantities needed to calculate the
sample variance, s2, and sample standard deviation, s.
18
i x i x i - xbar (x i - xbar)
2
1 12.6 -0.4 0.16
2 12.9 -0.1 0.01
3 13.4 0.4 0.16
4 12.3 -0.7 0.49
5 13.6 0.6 0.36
6 13.5 0.5 0.25
7 12.6 -0.4 0.16
8 13.1 0.1 0.01
sums = 104.00 0.0 1.60
divide by 8 divide by 7
xbar = 13.00 s.variance = 0.2286
0.48
s. standard deviation =
Dimension of:
xi is pounds
Mean is pounds.
Variance is pounds2.
Standard deviation is pounds.
Desired accuracy is generally
accepted to be one more place
than the data.
Table 1
2 2
13.00 pounds,
0.2286 pounds ,
0.48 pounds.
x
s
s



Computation of s2
We can derive the computational formula.
It involves just 2 sums.
19
2 2 2
2 1 1
2 2 2 2
1 1 1
2 2
1
2
2 1
1
( ) ( 2 )
1 1
2 2
1 1
1
1
n
i
n
i
i
i
n n
i i i
i i
n n n
i i i
i i i
n
i
i
x x x x x x
s
n n
x nx x x x nx x nx
n
x
x
n
x nx
n
n
n
 
  



  
 
 
 
 
 

    
 
 

 
 
 
 




Example 3: Variance by Shortcut
20
 
2
2
1 1
2
2
2
1
1,353.60 104.0 8
7
1.60
0.2286 pounds
7
0.2286 0.48 pounds
n n
i i
i i
x x n
s
n
s
 
 
 
 




 
 
  i x i x i
2
1 12.6 158.76
2 12.9 166.41
3 13.4 179.56
4 12.3 151.29
5 13.6 184.96
6 13.5 182.25
7 12.6 158.76
8 13.1 171.61
sums = 104.0 1 353.60
Why does sample variance divide by (n –1)?
The population variance (σ2) is calculated with N, the
population size. Why isn’t the sample variance (s2)
calculated with n, the sample size?
1) s2 measures square deviation from x
̅ , not μ
• The true variance (σ2) is based on data deviations from
the true mean, μ.
• The sample variance (s2) is based on the data deviations
from x
̅ (x-bar), not μ.
• Dividing by n – 1 rather than n adjusts for our
underestimation of a square deviation about the true
mean.
21
Why does sample variance divide by (n –1)?
2) n–1 is the number of “degrees of freedom” we have
in our computation
• s2 is calculated with the quantity n – 1, which is called
the “degrees of freedom”.
• Origin of the term:
– There are n deviations from x
̅ in the sample, xi – x
̅
– The sum of the deviations is zero (see Table 1)
– If we know the values of n – 1 of these terms, the
value of the last is known
– Hence, only n – 1 of these terms can be chosen freely
(i.e., independently of the others)
22
Why does sample variance divide by (n –1)?
3) Dividing by n – 1 makes s2 an unbiased estimator of
σ2
• x
̅ is an estimator of μ; close but not the same.
• Remember that s2 is a random variable; what is its
expectation?
• Being an “unbiased estimator” means that E(s2) = σ2
23
Sample Range
If the n observations in a sample are denoted
by x1, x2, …, xn, the sample range is:
r = max(xi) – min(xi)
It is the largest observation in the sample minus
the smallest observation.
From Example 3:
r = 13.6 – 12.3 = 1.30 pounds
Note: population range ≥ sample range
24
Stem-and-Leaf Diagrams
• Dot diagrams (dotplots) are useful for small
data sets. Stem & leaf diagrams are better
for large sets.
• Steps to construct a stem-and-leaf diagram:
1) Divide each number (xi) into two parts:
– stem, all but the last significant digit
– leaf, the last significant digit.
2) List (smallest to largest) the stem values in a
vertical column.
3) To the right of each stem, list its leaves
25
Example 4: Alloy Strength
26
105 221 183 186 121 181 180 143
97 154 153 174 120 168 167 141
245 228 174 199 181 158 176 110
163 131 154 115 160 208 158 133
207 180 190 193 194 133 156 123
134 178 76 167 184 135 229 146
218 157 101 171 165 172 158 169
199 151 142 163 145 171 148 158
160 175 149 87 160 237 150 135
196 201 200 176 150 170 118 149
Table 2 Compressive Strength (psi) of
Aluminum-Lithium Specimens (80 values)
Stem-and-leaf diagram for Table 2 data. Center is
about 155 and most data is between 110 and 200.
Leaves are unordered.
To illustrate the construction of
a stem-and-leaf diagram,
consider the alloy compressive
strength data in Table 2.
psi - pounds per square inch
1 psi ≈ 6.894 757 kPa
Quartiles
• The 3 quartiles partition the data into 4 equally sized counts or
segments.
– q1, 1st or lower quartile: 25% of the data is ≤ q1.
– q2, 2nd quartile (median): 50% of the data is ≤ q2.
– q3, 3rd or upper quartile : 75% of the data is ≤ q3.
• For the Table 2 data:
27
i th
(i +1)th
0.25 20.25 143 145 143.50
0.50 40.50 160 163 161.50
0.75 60.75 181 181 181.00
quartile
Value of indexed item
f Index
105 221 183 186 121 181 180 143
97 154 153 174 120 168 167 141
245 228 174 199 181 158 176 110
163 131 154 115 160 208 158 133
207 180 190 193 194 133 156 123
134 178 76 167 184 135 229 146
218 157 101 171 165 172 158 169
199 151 142 163 145 171 148 158
160 175 149 87 160 237 150 135
196 201 200 176 150 170 118 149
Table 2 Compressive Strength (psi) of
Aluminum-Lithium Specimens
• How do we find these quartiles?
• First we need to sort data
Quartile q2
• Second quartile (sample median), q2: The “middle value” of the
observations, such that 50% of the remaining observations are
below the median and 50% of the remaining observations are above
the median
28
 
   
1 /2
2
/2 /2 1
if is odd
if is even
2 2
n
n n
x n
q x x
n
 
  



 



   
40 41
2
160 163
If 80, 161.5
2 2 2 2
x x
n q
     
Quartile q1
• First quartile, q1: Value, s.t. approx. 25% of observations are ≤ q1
29
   
     
 
1
1
1
/4
if 1 / 4 is an integer
if 1 / 4 is not an integer
and be the fractional
, then
and
int
1
eger parts of ( 1) / 4
let
n
w w
x n
n
q
f x f x f w
n
 
 



 

 
 
  



 
     
1 20 21
If 80, 1 / 4 20.25, so 20 and 0.25,
1 0.25 0.25
0.75 143 0.25 145 143.5
n n w f
q x x
    
   
    
Quartile q3 and IQR
• Third quartile, q3: Value, such that approximately 75% of the
observations are ≤ q3
• Interquartile range, IQR: The difference between the third and first
quartiles (another measure of the spread of the data, less sensitive
to extreme outliers than sample variance)
30
   
     
 
3 1 /4
1
3
if 3 1 / 4 is an integer
if 3 1 / 4 is not an intege
and be the fractional
r, then
and
integer parts of 3( 1) /
1
4
let
n
w w
x n
n
q
f x f w
f
n
x
 
 



 

 
 
  



3 1
IQR q q
 
Percentiles
• Percentiles:
– The idea of quartiles can be extended to consider percentiles.
– Percentiles partition the data into 100 segments.
– The k-th percentile has approximately 100k% of observations,
which are ≤ it.
• A WARNING!
– If you use software, it may not use the same method as we use
here.
– For all coursework, you must use the method given in the lecture.
• From the Quartiles example:
IQR = q3 – q1 = 181.00 – 143.50 = 37.5
• Impact of outlier data:
– IQR is not affected
– Range is directly affected.
31
Frequency Distributions
• A frequency distribution is a compact summary of data,
expressed as a table, graph, or function.
• The data is gathered into bins or cells, defined by class
intervals.
• The number of classes, multiplied by the class interval,
should exceed the range of the data.
• The square root of the sample size is a guide.
• The boundaries of the class intervals should be
convenient values, as should the class width.
32
Frequency Distribution Table
33
Class Frequency
Relative
Frequency
Cumulative
Relative
Frequency
70 ≤ x < 90 2 0.0250 0.0250
90 ≤ x < 110 3 0.0375 0.0625
110 ≤ x < 130 6 0.0750 0.1375
130 ≤ x < 150 14 0.1750 0.3125
150 ≤ x < 170 22 0.2750 0.5875
170 ≤ x < 190 17 0.2125 0.8000
190 ≤ x < 210 10 0.1250 0.9250
210 ≤ x < 230 4 0.0500 0.9750
230 ≤ x < 250 2 0.0250 1.0000
80 1.0000
Table 4 Frequency Distribution of Table 2 Data
Considerations:
Range = 245 – 76 = 169
Number of Classes
= Sqrt(80) = 8.9
Trial class width =
169/8.9 = 18.99
Decisions:
Number of classes = 9
Class width = 20
Range of classes = 20 * 9 = 180
Starting point = 70
Frequency Distribution for
the data in Table 2
From given n = 80 observations of xi:
max(xi) = 245
min(xi) = 76
Histograms
• A histogram is a visual display of a frequency
distribution.
• Steps to construct a histogram with equal bin widths:
1) Label the bin boundaries on the horizontal scale.
2) Mark & label the vertical scale with the frequencies
or relative frequencies.
3) Above each bin, draw a rectangle whose height is
equal to the frequency corresponding to that bin.
34
Histogram of the Table 2 Data
35
Histogram of compressive strength of 80 aluminum-lithium alloy specimens. Note
these features – (1) horizontal scale bin boundaries & labels with units, (2) vertical
scale measurements and labels, (3) histogram title at top or in legend.
Poor Choices in Drawing Histograms
36
Histogram of compressive strength of 80 aluminum-lithium
alloy specimens. Errors: too many bins (17) create jagged
shape, horizontal scale not at class boundaries, horizontal axis
label does not include units.
Cumulative Frequency Plot
37
Cumulative histogram of compressive strength of 80 aluminum-lithium
alloy specimens. Comment: Easy to see cumulative probabilities,
hard to see distribution shape.
Shape of a Frequency Distribution
38
Histograms of symmetric and skewed distributions.
(b) Symmetric distribution has identical mean, median and mode measures.
(a & c) Skewed distributions are positive or negative, depending on the
direction of the long tail. Their measures occur in alphabetical order as the
distribution is approached from the long tail.
Mode – value of x which occurs most often, i.e. has the greatest probability
of occurring.
Median is that value x for which P(X<x) ≤ 0.5 and P(X>x) ≤ 0.5.
Median is the "middle" of a sorted list of numbers.
Histograms for Categorical Data
• Categorical data is of two types:
– Ordinal: categories have a natural order, e.g., year in
college, military rank.
– Nominal: Categories are simply different, e.g.,
gender, colors.
• Histogram bars are for each category, are of equal width,
and have a height equal to the category’s frequency or
relative frequency.
• A Pareto chart is a histogram in which the categories are
sequenced in decreasing order. This approach
emphasizes the most and least important categories.
39
Example 6: Categorical Data Histogram
40
Airplane production in 1985. (Source: Boeing Company)
Comment: Illustrates nominal data in spite of the numerical
names, categories are shown at the bin’s midpoint, a Pareto chart
since the categories are in decreasing order.
Box Plot or Box-and-Whisker Chart
• A box plot is a graphical display showing center,
spread, shape, and outliers (SOCS).
• It displays the 5-number summary: min, q1,
median, q3, and max.
41
Description of a box plot.
Visual Summary: Box Plot
Constructing a box plot (vertical axis measures data values):
1) DRAW THE BOX:
• Draw a box that extends vertically from q1 to q3
• Draw a horizontal line through the box at q2
2) DRAW THE WHISKERS:
• Lower whisker: Draw a line extending from the bottom box edge to
the smallest observation within 1.5 IQR of q1
• Upper whisker: Draw a line extending from the top box edge to the
largest observation within 1.5 IQR of q3
3) OUTLIERS (label each with an asterisk):
• Observations beyond the whiskers, but within 3 IQR of a box edge
4) EXTREME OUTLIERS (label each with an asterisk or other symbol):
• Observations more than 3 IQR away from a box edge
42
Box Plot of Table 2 Data
43
Box plot of compressive strength of 80 aluminum-lithium alloy
specimens. Comment: Box plot may be shown vertically or
horizontally, data reveals three outliers and no extreme outliers.
Lower outlier’s upper limit is: 143.5 – 1.5*(181.0-143.5) = 87.25.
Time Sequence Plots
• A time series plot shows the data value, or statistic, on
the vertical axis with time on the horizontal axis.
• A time series plot reveals trends, cycles or other time-
oriented behavior that could not be seen in the data.
44
Company sales by year (a). By quarter (b).
Digidot Plot
45
Combining a time series plot with some of the other graphical displays that we
have considered previously will be very helpful sometimes.
The stem-and-leaf plot combined with a time series plot forms a digidot plot.
A digidot plot of the compressive strength data in Table 2.
Multivariate Data
46
Multivariate data – each observation consists of measurements of
several variables.
Table 5 Quality Data for Young Red Wine.
Scatterplot (Scatter Diagram)
47
Scatterplot – graphically displays the potential relationship b/n two
variables.
Scatterplot of wine Quality and Color from data in Table 5.
Matrix of Scatter Diagrams
48
Matrix of scatter diagrams – graphically displays the pairwise
relationships b/n the variables in the sample.
Matrix of scatter diagrams for the data in Table 5.
Notice a strong
positive correlation b/n
Color Density and
Color of wine.
Pearson Correlation Coefficient
49
Sample (or Pearson) correlation coefficient is a quantitative measure
of the strength of the relationship b/n 2 r.v.s, dimensionless, gives a
value between +1 and −1 inclusive.
(after Karl Pearson (1857-1936), British statistician)
All correlations b/n 5 r.v.s in Table 5.
Correlations
above |0.8| are strong,
below |0.5| are weak,
0 is no correlation.
Quality pH Total SO2 C.Density Color
Quality 1
pH 0.349 1
Total SO2 -0.445 -0.679 1
C.Density 0.702 0.482 -0.492 1
Color 0.712 0.430 -0.480 0.996 1
1
2 2
1 1
( )
( ) ( )
n
i i
i
xy n n
i i
i i
y x x
r
y y x x

 


 

 
Pearson Correlation Coefficient
50
In Excel: Correlation b/n Quality and Color of wine in Table 5.
Pearson correlation coefficient = 0.712 43.103/SQRT(75.852*48.2995)
i Xi (Quality) Yi (Color) Xi-Xbar Yi-Ybar (Xi-Xbar)^2 (Yi-Ybar)^2 Yi*(Xi-Xbar)
1 19.2 5.65 3.98 1.195 15.8404 1.428025 22.487
2 18.3 6.95 3.08 2.495 9.4864 6.225025 21.406
3 17.1 5.75 1.88 1.295 3.5344 1.677025 10.81
4 15.2 4 -0.02 -0.455 0.0004 0.207025 -0.08
5 14 2.25 -1.22 -2.205 1.4884 4.862025 -2.745
6 13.8 3.2 -1.42 -1.255 2.0164 1.575025 -4.544
7 12.8 2.7 -2.42 -1.755 5.8564 3.080025 -6.534
8 17.3 6.1 2.08 1.645 4.3264 2.706025 12.688
9 16.3 5 1.08 0.545 1.1664 0.297025 5.4
10 16 6 0.78 1.545 0.6084 2.387025 4.68
11 15.7 5.5 0.48 1.045 0.2304 1.092025 2.64
12 15.3 3.35 0.08 -1.105 0.0064 1.221025 0.268
13 14.3 3.25 -0.92 -1.205 0.8464 1.452025 -2.99
14 14 5.1 -1.22 0.645 1.4884 0.416025 -6.222
15 13.8 4.4 -1.42 -0.055 2.0164 0.003025 -6.248
16 12.5 3.15 -2.72 -1.305 7.3984 1.703025 -8.568
17 11.5 3.9 -3.72 -0.555 13.8384 0.308025 -14.508
18 14.2 2.4 -1.02 -2.055 1.0404 4.223025 -2.448
19 17.3 7.7 2.08 3.245 4.3264 10.530025 16.016
20 15.8 2.75 0.58 -1.705 0.3364 2.907025 1.595
SUM 304.4 89.1 75.852 48.2995 43.103
Mean 15.22 4.455
Pearson Correlation Coefficient
51
Potential relationship b/n r.v.s.
Probability Plot
• Is a particular distribution a reasonable model
for data? We may want to verify assumptions.
• If time-to-failure data ̴ exponential distribution,
then the failure rate is constant w.r.t. time.
• A probability plot is a graphical method for
determining whether sample data conform to a
hypothesized distribution based on a subjective
visual examination of the data.
• Histograms require very large sample size.
52
Constructing a Probability Plot
• To construct a probability plot:
– Sort the data observations in ascending order: x(1),
x(2),…, x(n).
– The observed value x(j) is plotted against the
observed cumulative frequency (j – 0.5)/n.
– The paired numbers are plotted on the probability
paper of the proposed distribution.
• If the paired numbers form a straight line, then
the hypothesized distribution adequately
describes the data.
53
Example 7: Battery Life
54
j x (j ) (j-0.5)/10 100(j-0.5)/10
1 176 0.05 5
2 183 0.15 15
3 185 0.25 25
4 190 0.35 35
5 191 0.45 45
6 192 0.55 55
7 201 0.65 65
8 205 0.75 75
9 214 0.85 85
10 220 0.95 95
Table 6 Calculations for Constructing a
Normal Probability Plot
Normal probability plot for battery life.
The effective service life (Xj in minutes) of batteries used in a laptop are given in
the table. We hypothesize that battery life is adequately modeled by a normal
distribution. To this hypothesis, first arrange the observations in ascending order
and calculate their cumulative frequencies and plot them.
Probability Plot on Standardized Normal Scores
55
j x (j ) (j-0.5)/10 z j
1 176 0.05 -1.64
2 183 0.15 -1.04
3 185 0.25 -0.67
4 190 0.35 -0.39
5 191 0.45 -0.13
6 192 0.55 0.13
7 201 0.65 0.39
8 205 0.75 0.67
9 214 0.85 1.04
10 220 0.95 1.64
Table 6 Calculations for
Constructing a Normal Probability
Plot
Normal Probability plot obtained from
standardized normal scores. This is equivalent
to Figure in the previous slide.
A normal probability plot can be plotted on ordinary axes using z-values. The
normal probability scale is not used.
(j – 0.5)/n = P(Z ≤ zj) = Φ(zj)
Probability Plot Variations
56
Normal probability plots indicating a non-normal distribution.
(a) Light tailed distribution (e.g. uniform distribution)
(b) Heavy tailed distribution (has larger variance, has heavier tails than the
normal distribution)
(c) Right skewed distribution

More Related Content

PPTX
Linear functions
PPTX
Ellipse
PPT
Hyperbolas
DOCX
Daily Lesson Plan-W1-STATistic and probability
PPTX
A course on integral calculus
PPTX
Binary Operations.pptx
PPTX
Presentation and-analysis-of-business-data
Linear functions
Ellipse
Hyperbolas
Daily Lesson Plan-W1-STATistic and probability
A course on integral calculus
Binary Operations.pptx
Presentation and-analysis-of-business-data

What's hot (20)

PPTX
Alternating Group presentation
PDF
Introduction to Groups and Permutation Groups
PDF
Grade 8-slope-of-a-line
PPTX
CENTRAL LIMIT THEOREM- STATISTICS AND PROBABILITY
PDF
Probability(mutually exclusive events)
PDF
Discrete probability distribution (complete)
PPTX
1 6 construction
PPTX
Rational Expressions
PPTX
Different Types of Variations in Mathematics
PPTX
Nature of the roots of a quadratic equation
PDF
Isomorphism
PPTX
Sequences and series
PPT
Inverse functions and relations
PPTX
Lesson no. 5 (Reference Angle)
PPTX
Graphing rational functions
PPT
Long and synthetic division
PDF
Probability Distribution (Discrete Random Variable)
PPTX
Practical research-1
PPTX
Sets
PPTX
Transforming Quadratic functions from General Form to Standard Form
Alternating Group presentation
Introduction to Groups and Permutation Groups
Grade 8-slope-of-a-line
CENTRAL LIMIT THEOREM- STATISTICS AND PROBABILITY
Probability(mutually exclusive events)
Discrete probability distribution (complete)
1 6 construction
Rational Expressions
Different Types of Variations in Mathematics
Nature of the roots of a quadratic equation
Isomorphism
Sequences and series
Inverse functions and relations
Lesson no. 5 (Reference Angle)
Graphing rational functions
Long and synthetic division
Probability Distribution (Discrete Random Variable)
Practical research-1
Sets
Transforming Quadratic functions from General Form to Standard Form
Ad

Similar to Advanced Statistics And Probability (MSC 615 (20)

PPT
Statistics by DURGESH JHARIYA OF jnv,bn,jbp
PPT
1608 probability and statistics in engineering
PPTX
Engineering Data Analysis-ProfCharlton
PPTX
The treatment of data in engineering investigation is the most important thing.
PPTX
Transportation and logistics modeling 2
PPTX
Basic Statistical Descriptions of Data.pptx
PPT
Lect 2 basic ppt
PPTX
Lesson3 lpart one - Measures mean [Autosaved].pptx
PDF
Lesson2 - lecture two Measures mean.pdf
PPTX
Basics of Stats (2).pptx
PDF
Excel Basic Statistics for beginners.pdf
DOCX
SAMPLING MEAN DEFINITION The term sampling mean is.docx
PDF
Engineering Statistics
DOCX
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
DOCX
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
PPTX
Lesson2 lecture two in Measures mean.pptx
PPT
Statistics and variabilities Lecture03.ppt
PDF
1.0 Descriptive statistics.pdf
PPTX
Statistical process control spc enginering
PDF
Statistical thinking
Statistics by DURGESH JHARIYA OF jnv,bn,jbp
1608 probability and statistics in engineering
Engineering Data Analysis-ProfCharlton
The treatment of data in engineering investigation is the most important thing.
Transportation and logistics modeling 2
Basic Statistical Descriptions of Data.pptx
Lect 2 basic ppt
Lesson3 lpart one - Measures mean [Autosaved].pptx
Lesson2 - lecture two Measures mean.pdf
Basics of Stats (2).pptx
Excel Basic Statistics for beginners.pdf
SAMPLING MEAN DEFINITION The term sampling mean is.docx
Engineering Statistics
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
SAMPLING MEANDEFINITIONThe term sampling mean is a stati.docx
Lesson2 lecture two in Measures mean.pptx
Statistics and variabilities Lecture03.ppt
1.0 Descriptive statistics.pdf
Statistical process control spc enginering
Statistical thinking
Ad

More from Maria Perkins (20)

PDF
Hypothesis In A Research Pape. Online assignment writing service.
PDF
Afresheet An Incredibly Easy Method That Works For All GeelongHEART
PDF
How To Write An Introductory Paragraph For A Synthesis Essay AP Lang
PDF
How The Grinch Stole Christmas Lesson Plans A
PDF
How To Review A Research Paper. Online assignment writing service.
PDF
Tok Essay Expert Knowledge. Online assignment writing service.
PDF
Pin On English Vocabulary. Online assignment writing service.
PDF
10 Foolproof Tips How To Structure A 500 Word Essa
PDF
Online Paper Writing Service Reviews By. Online assignment writing service.
PDF
Synthesis Essay Introduction Example. Synthesis Essay Introduction
PDF
Colored Milky Gel Pens That Write On Black Paper - Bu
PDF
Writing Paper Free Printable YouLl Have Everything Yo
PDF
Purchase Essays. Personal Essay Purchase
PDF
Finished Custom Writing Paper B. Online assignment writing service.
PDF
My Best Friend Essay Essay On My Best Friend
PDF
Introduction - How To Write An Essay - LibGuides At Univer
PDF
PrintableWritingPaperBy. Online assignment writing service.
PDF
Pin On Notebook Paper. Online assignment writing service.
PDF
Help Write A Research Paper - The Oscillation Band
PDF
015 Essay Example Sample1A 8Th Grade Tha
Hypothesis In A Research Pape. Online assignment writing service.
Afresheet An Incredibly Easy Method That Works For All GeelongHEART
How To Write An Introductory Paragraph For A Synthesis Essay AP Lang
How The Grinch Stole Christmas Lesson Plans A
How To Review A Research Paper. Online assignment writing service.
Tok Essay Expert Knowledge. Online assignment writing service.
Pin On English Vocabulary. Online assignment writing service.
10 Foolproof Tips How To Structure A 500 Word Essa
Online Paper Writing Service Reviews By. Online assignment writing service.
Synthesis Essay Introduction Example. Synthesis Essay Introduction
Colored Milky Gel Pens That Write On Black Paper - Bu
Writing Paper Free Printable YouLl Have Everything Yo
Purchase Essays. Personal Essay Purchase
Finished Custom Writing Paper B. Online assignment writing service.
My Best Friend Essay Essay On My Best Friend
Introduction - How To Write An Essay - LibGuides At Univer
PrintableWritingPaperBy. Online assignment writing service.
Pin On Notebook Paper. Online assignment writing service.
Help Write A Research Paper - The Oscillation Band
015 Essay Example Sample1A 8Th Grade Tha

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Lesson notes of climatology university.
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Pharma ospi slides which help in ospi learning
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Institutional Correction lecture only . . .
TR - Agricultural Crops Production NC III.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Sports Quiz easy sports quiz sports quiz
Lesson notes of climatology university.
Final Presentation General Medicine 03-08-2024.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
Insiders guide to clinical Medicine.pdf
O7-L3 Supply Chain Operations - ICLT Program
Microbial disease of the cardiovascular and lymphatic systems
VCE English Exam - Section C Student Revision Booklet
Basic Mud Logging Guide for educational purpose
Pharma ospi slides which help in ospi learning
GDM (1) (1).pptx small presentation for students
Supply Chain Operations Speaking Notes -ICLT Program
O5-L3 Freight Transport Ops (International) V1.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF

Advanced Statistics And Probability (MSC 615

  • 1. Copyright © 2014 John Wiley & Sons, Inc. All rights reserved. Advanced Statistics and Probability (MSC 615) Yerkin G. Abdildin Nazarbayev University
  • 2. 2 Descriptive Statistics • Numerical Summaries of Data • Stem-and-Leaf Diagrams • Frequency Distributions and Histograms • Box Plots • Time Sequence Plots • Probability Plots
  • 3. Terminology • Population: The set of all possible observations that could be made from a random experiment – May be very large (e.g., the set of all KZ residents) – Hypothetical or conceptual populations do not physically exist (e.g., set of all products that will be produced in a factory) • Sample: An observed subset of data from a population – Example: A set of 300 randomly-chosen KZ residents – Example: A set of 14 randomly-chosen parts produced in a factory during one day’s operation 3
  • 4. Numerical Summaries of Data • Data: the numeric observations of a phenomenon of interest. – The totality of all observations is a population. – A portion used for analysis is a random sample. • We gain an understanding of this collection by describing it numerically and graphically, usually with the sample data. • We describe the collection in terms of shape, outliers, center, and spread (SOCS). – The center can be measured by the mean. – The spread can be measured by the variance. 4
  • 5. What is the field of statistics? We encounter it regularly: – Opinion polls – Sports performance – Effectiveness of medical treatments But what is it, really? – Statistics is the “science of data” – Qualitative and quantitative tools for analyzing data Why do we need it? – Many outcomes are produced by “random” processes – Want to understand these underlying processes more clearly 5
  • 6. Probability and Statistics 6 • Knowing the structure of the process, what can we say about the outcomes? Process, experiment Observed outcomes Probability
  • 7. Probability and Statistics 7 • Knowing the structure of the process, what can we say about the outcomes? • Having observed some outcomes, what can we say about the process? Process, experiment Observed outcomes Probability Statistics
  • 8. Probability and Statistics 8 EXAMPLE: Suppose that a factory manager is very dedicated to product quality, and is concerned that too many products produced at the factory may be defective. What would your recommended plan of action be?
  • 9. Take the factory offline, disassemble and thoroughly inspect every machine involved in the production process, and put every employee in the factory through a rigorous two-week training program. Probability and Statistics 9 EXAMPLE: Suppose that a factory manager is very dedicated to product quality, and is concerned that too many products produced at the factory may be defective. What would your recommended plan of action be? IDEA 1:
  • 10. Take the factory offline, disassemble and thoroughly inspect every machine involved in the production process, and put every employee in the factory through a rigorous two-week training program. Probability and Statistics 10 EXAMPLE: Suppose that a factory manager is very dedicated to product quality, and is concerned that too many products produced at the factory may be defective. What would your recommended plan of action be? IDEA 1: CONCERN: Very expensive!
  • 11. Take an entire day’s worth of products and inspect them all thoroughly to estimate the rate at which defective products are being produced. EXAMPLE: Suppose that a factory manager is very dedicated to product quality, and is concerned that too many products produced at the factory may be defective. What would your recommended plan of action be? IDEA 2: Probability and Statistics 11
  • 12. Take an entire day’s worth of products and inspect them all thoroughly to estimate the rate at which defective products are being produced. EXAMPLE: Suppose that a factory manager is very dedicated to product quality, and is concerned that too many products produced at the factory may be defective. What would your recommended plan of action be? IDEA 2: Probability and Statistics 12 CONCERN: What if the factory produces thousands of units per day? And what if testing requires destroying the products (e.g., yield strength of steel beams)?
  • 13. Probability and Statistical inference 13 • A statistic is any quantity whose value can be calculated from sample data (e.g. mean, variance, proportion). • A reasoning that comes from a sample to a population is referred to as statistical inference, or inferential statistics. Population Sample Probability Inferential statistics, or Statistical inference
  • 14. Sample Mean 14 1 2 1 2 1 1 2 For observations in a random sample, , ,..., , the sample mean is + ... . For observations in a population, , ,..., , the population mean is n n i n i N n x x x x x x x x n n N x x x       1 1 ( ) . N i N i i i i x x f x N       
  • 15. Example 1: Sample Mean Consider 8 observations (xi) of pull-off force from engine connectors as shown in the table. 15 i x i 1 12.6 2 12.9 3 13.4 4 12.3 5 13.6 6 13.5 7 12.6 8 13.1 13.00 = AVERAGE($B2:$B9) The sample mean is the balance point. 8 1 12.6 12.9 ... 13.1 average 8 8 104 13.00 pounds 8 i i x x          
  • 16. Variance Defined 16 1 2 2 2 1 1 2 For observations in a random sample, , ,..., , the sample variance is ( ) . 1 For observations in a population, , ,..., , the population variance i n n i i N n x x x x x s n N x x x      2 2 2 1 1 s ( ) ( ) ( ) . N i N i i i i x x f x N           
  • 17. Some other concepts • The standard deviation is the square root of the variance. – s is the sample standard deviation symbol. – σ is the population standard deviation symbol. • Sample statistic could be a reasonable estimate of population parameters, e.g. 17 2 2 is an estimate of , is an estimate of . x s  
  • 18. Example 2: Sample Variance Table 1 displays the quantities needed to calculate the sample variance, s2, and sample standard deviation, s. 18 i x i x i - xbar (x i - xbar) 2 1 12.6 -0.4 0.16 2 12.9 -0.1 0.01 3 13.4 0.4 0.16 4 12.3 -0.7 0.49 5 13.6 0.6 0.36 6 13.5 0.5 0.25 7 12.6 -0.4 0.16 8 13.1 0.1 0.01 sums = 104.00 0.0 1.60 divide by 8 divide by 7 xbar = 13.00 s.variance = 0.2286 0.48 s. standard deviation = Dimension of: xi is pounds Mean is pounds. Variance is pounds2. Standard deviation is pounds. Desired accuracy is generally accepted to be one more place than the data. Table 1 2 2 13.00 pounds, 0.2286 pounds , 0.48 pounds. x s s   
  • 19. Computation of s2 We can derive the computational formula. It involves just 2 sums. 19 2 2 2 2 1 1 2 2 2 2 1 1 1 2 2 1 2 2 1 1 ( ) ( 2 ) 1 1 2 2 1 1 1 1 n i n i i i n n i i i i i n n n i i i i i i n i i x x x x x x s n n x nx x x x nx x nx n x x n x nx n n n                                            
  • 20. Example 3: Variance by Shortcut 20   2 2 1 1 2 2 2 1 1,353.60 104.0 8 7 1.60 0.2286 pounds 7 0.2286 0.48 pounds n n i i i i x x n s n s                   i x i x i 2 1 12.6 158.76 2 12.9 166.41 3 13.4 179.56 4 12.3 151.29 5 13.6 184.96 6 13.5 182.25 7 12.6 158.76 8 13.1 171.61 sums = 104.0 1 353.60
  • 21. Why does sample variance divide by (n –1)? The population variance (σ2) is calculated with N, the population size. Why isn’t the sample variance (s2) calculated with n, the sample size? 1) s2 measures square deviation from x ̅ , not μ • The true variance (σ2) is based on data deviations from the true mean, μ. • The sample variance (s2) is based on the data deviations from x ̅ (x-bar), not μ. • Dividing by n – 1 rather than n adjusts for our underestimation of a square deviation about the true mean. 21
  • 22. Why does sample variance divide by (n –1)? 2) n–1 is the number of “degrees of freedom” we have in our computation • s2 is calculated with the quantity n – 1, which is called the “degrees of freedom”. • Origin of the term: – There are n deviations from x ̅ in the sample, xi – x ̅ – The sum of the deviations is zero (see Table 1) – If we know the values of n – 1 of these terms, the value of the last is known – Hence, only n – 1 of these terms can be chosen freely (i.e., independently of the others) 22
  • 23. Why does sample variance divide by (n –1)? 3) Dividing by n – 1 makes s2 an unbiased estimator of σ2 • x ̅ is an estimator of μ; close but not the same. • Remember that s2 is a random variable; what is its expectation? • Being an “unbiased estimator” means that E(s2) = σ2 23
  • 24. Sample Range If the n observations in a sample are denoted by x1, x2, …, xn, the sample range is: r = max(xi) – min(xi) It is the largest observation in the sample minus the smallest observation. From Example 3: r = 13.6 – 12.3 = 1.30 pounds Note: population range ≥ sample range 24
  • 25. Stem-and-Leaf Diagrams • Dot diagrams (dotplots) are useful for small data sets. Stem & leaf diagrams are better for large sets. • Steps to construct a stem-and-leaf diagram: 1) Divide each number (xi) into two parts: – stem, all but the last significant digit – leaf, the last significant digit. 2) List (smallest to largest) the stem values in a vertical column. 3) To the right of each stem, list its leaves 25
  • 26. Example 4: Alloy Strength 26 105 221 183 186 121 181 180 143 97 154 153 174 120 168 167 141 245 228 174 199 181 158 176 110 163 131 154 115 160 208 158 133 207 180 190 193 194 133 156 123 134 178 76 167 184 135 229 146 218 157 101 171 165 172 158 169 199 151 142 163 145 171 148 158 160 175 149 87 160 237 150 135 196 201 200 176 150 170 118 149 Table 2 Compressive Strength (psi) of Aluminum-Lithium Specimens (80 values) Stem-and-leaf diagram for Table 2 data. Center is about 155 and most data is between 110 and 200. Leaves are unordered. To illustrate the construction of a stem-and-leaf diagram, consider the alloy compressive strength data in Table 2. psi - pounds per square inch 1 psi ≈ 6.894 757 kPa
  • 27. Quartiles • The 3 quartiles partition the data into 4 equally sized counts or segments. – q1, 1st or lower quartile: 25% of the data is ≤ q1. – q2, 2nd quartile (median): 50% of the data is ≤ q2. – q3, 3rd or upper quartile : 75% of the data is ≤ q3. • For the Table 2 data: 27 i th (i +1)th 0.25 20.25 143 145 143.50 0.50 40.50 160 163 161.50 0.75 60.75 181 181 181.00 quartile Value of indexed item f Index 105 221 183 186 121 181 180 143 97 154 153 174 120 168 167 141 245 228 174 199 181 158 176 110 163 131 154 115 160 208 158 133 207 180 190 193 194 133 156 123 134 178 76 167 184 135 229 146 218 157 101 171 165 172 158 169 199 151 142 163 145 171 148 158 160 175 149 87 160 237 150 135 196 201 200 176 150 170 118 149 Table 2 Compressive Strength (psi) of Aluminum-Lithium Specimens • How do we find these quartiles? • First we need to sort data
  • 28. Quartile q2 • Second quartile (sample median), q2: The “middle value” of the observations, such that 50% of the remaining observations are below the median and 50% of the remaining observations are above the median 28       1 /2 2 /2 /2 1 if is odd if is even 2 2 n n n x n q x x n                  40 41 2 160 163 If 80, 161.5 2 2 2 2 x x n q      
  • 29. Quartile q1 • First quartile, q1: Value, s.t. approx. 25% of observations are ≤ q1 29             1 1 1 /4 if 1 / 4 is an integer if 1 / 4 is not an integer and be the fractional , then and int 1 eger parts of ( 1) / 4 let n w w x n n q f x f x f w n                             1 20 21 If 80, 1 / 4 20.25, so 20 and 0.25, 1 0.25 0.25 0.75 143 0.25 145 143.5 n n w f q x x              
  • 30. Quartile q3 and IQR • Third quartile, q3: Value, such that approximately 75% of the observations are ≤ q3 • Interquartile range, IQR: The difference between the third and first quartiles (another measure of the spread of the data, less sensitive to extreme outliers than sample variance) 30             3 1 /4 1 3 if 3 1 / 4 is an integer if 3 1 / 4 is not an intege and be the fractional r, then and integer parts of 3( 1) / 1 4 let n w w x n n q f x f w f n x                     3 1 IQR q q  
  • 31. Percentiles • Percentiles: – The idea of quartiles can be extended to consider percentiles. – Percentiles partition the data into 100 segments. – The k-th percentile has approximately 100k% of observations, which are ≤ it. • A WARNING! – If you use software, it may not use the same method as we use here. – For all coursework, you must use the method given in the lecture. • From the Quartiles example: IQR = q3 – q1 = 181.00 – 143.50 = 37.5 • Impact of outlier data: – IQR is not affected – Range is directly affected. 31
  • 32. Frequency Distributions • A frequency distribution is a compact summary of data, expressed as a table, graph, or function. • The data is gathered into bins or cells, defined by class intervals. • The number of classes, multiplied by the class interval, should exceed the range of the data. • The square root of the sample size is a guide. • The boundaries of the class intervals should be convenient values, as should the class width. 32
  • 33. Frequency Distribution Table 33 Class Frequency Relative Frequency Cumulative Relative Frequency 70 ≤ x < 90 2 0.0250 0.0250 90 ≤ x < 110 3 0.0375 0.0625 110 ≤ x < 130 6 0.0750 0.1375 130 ≤ x < 150 14 0.1750 0.3125 150 ≤ x < 170 22 0.2750 0.5875 170 ≤ x < 190 17 0.2125 0.8000 190 ≤ x < 210 10 0.1250 0.9250 210 ≤ x < 230 4 0.0500 0.9750 230 ≤ x < 250 2 0.0250 1.0000 80 1.0000 Table 4 Frequency Distribution of Table 2 Data Considerations: Range = 245 – 76 = 169 Number of Classes = Sqrt(80) = 8.9 Trial class width = 169/8.9 = 18.99 Decisions: Number of classes = 9 Class width = 20 Range of classes = 20 * 9 = 180 Starting point = 70 Frequency Distribution for the data in Table 2 From given n = 80 observations of xi: max(xi) = 245 min(xi) = 76
  • 34. Histograms • A histogram is a visual display of a frequency distribution. • Steps to construct a histogram with equal bin widths: 1) Label the bin boundaries on the horizontal scale. 2) Mark & label the vertical scale with the frequencies or relative frequencies. 3) Above each bin, draw a rectangle whose height is equal to the frequency corresponding to that bin. 34
  • 35. Histogram of the Table 2 Data 35 Histogram of compressive strength of 80 aluminum-lithium alloy specimens. Note these features – (1) horizontal scale bin boundaries & labels with units, (2) vertical scale measurements and labels, (3) histogram title at top or in legend.
  • 36. Poor Choices in Drawing Histograms 36 Histogram of compressive strength of 80 aluminum-lithium alloy specimens. Errors: too many bins (17) create jagged shape, horizontal scale not at class boundaries, horizontal axis label does not include units.
  • 37. Cumulative Frequency Plot 37 Cumulative histogram of compressive strength of 80 aluminum-lithium alloy specimens. Comment: Easy to see cumulative probabilities, hard to see distribution shape.
  • 38. Shape of a Frequency Distribution 38 Histograms of symmetric and skewed distributions. (b) Symmetric distribution has identical mean, median and mode measures. (a & c) Skewed distributions are positive or negative, depending on the direction of the long tail. Their measures occur in alphabetical order as the distribution is approached from the long tail. Mode – value of x which occurs most often, i.e. has the greatest probability of occurring. Median is that value x for which P(X<x) ≤ 0.5 and P(X>x) ≤ 0.5. Median is the "middle" of a sorted list of numbers.
  • 39. Histograms for Categorical Data • Categorical data is of two types: – Ordinal: categories have a natural order, e.g., year in college, military rank. – Nominal: Categories are simply different, e.g., gender, colors. • Histogram bars are for each category, are of equal width, and have a height equal to the category’s frequency or relative frequency. • A Pareto chart is a histogram in which the categories are sequenced in decreasing order. This approach emphasizes the most and least important categories. 39
  • 40. Example 6: Categorical Data Histogram 40 Airplane production in 1985. (Source: Boeing Company) Comment: Illustrates nominal data in spite of the numerical names, categories are shown at the bin’s midpoint, a Pareto chart since the categories are in decreasing order.
  • 41. Box Plot or Box-and-Whisker Chart • A box plot is a graphical display showing center, spread, shape, and outliers (SOCS). • It displays the 5-number summary: min, q1, median, q3, and max. 41 Description of a box plot.
  • 42. Visual Summary: Box Plot Constructing a box plot (vertical axis measures data values): 1) DRAW THE BOX: • Draw a box that extends vertically from q1 to q3 • Draw a horizontal line through the box at q2 2) DRAW THE WHISKERS: • Lower whisker: Draw a line extending from the bottom box edge to the smallest observation within 1.5 IQR of q1 • Upper whisker: Draw a line extending from the top box edge to the largest observation within 1.5 IQR of q3 3) OUTLIERS (label each with an asterisk): • Observations beyond the whiskers, but within 3 IQR of a box edge 4) EXTREME OUTLIERS (label each with an asterisk or other symbol): • Observations more than 3 IQR away from a box edge 42
  • 43. Box Plot of Table 2 Data 43 Box plot of compressive strength of 80 aluminum-lithium alloy specimens. Comment: Box plot may be shown vertically or horizontally, data reveals three outliers and no extreme outliers. Lower outlier’s upper limit is: 143.5 – 1.5*(181.0-143.5) = 87.25.
  • 44. Time Sequence Plots • A time series plot shows the data value, or statistic, on the vertical axis with time on the horizontal axis. • A time series plot reveals trends, cycles or other time- oriented behavior that could not be seen in the data. 44 Company sales by year (a). By quarter (b).
  • 45. Digidot Plot 45 Combining a time series plot with some of the other graphical displays that we have considered previously will be very helpful sometimes. The stem-and-leaf plot combined with a time series plot forms a digidot plot. A digidot plot of the compressive strength data in Table 2.
  • 46. Multivariate Data 46 Multivariate data – each observation consists of measurements of several variables. Table 5 Quality Data for Young Red Wine.
  • 47. Scatterplot (Scatter Diagram) 47 Scatterplot – graphically displays the potential relationship b/n two variables. Scatterplot of wine Quality and Color from data in Table 5.
  • 48. Matrix of Scatter Diagrams 48 Matrix of scatter diagrams – graphically displays the pairwise relationships b/n the variables in the sample. Matrix of scatter diagrams for the data in Table 5. Notice a strong positive correlation b/n Color Density and Color of wine.
  • 49. Pearson Correlation Coefficient 49 Sample (or Pearson) correlation coefficient is a quantitative measure of the strength of the relationship b/n 2 r.v.s, dimensionless, gives a value between +1 and −1 inclusive. (after Karl Pearson (1857-1936), British statistician) All correlations b/n 5 r.v.s in Table 5. Correlations above |0.8| are strong, below |0.5| are weak, 0 is no correlation. Quality pH Total SO2 C.Density Color Quality 1 pH 0.349 1 Total SO2 -0.445 -0.679 1 C.Density 0.702 0.482 -0.492 1 Color 0.712 0.430 -0.480 0.996 1 1 2 2 1 1 ( ) ( ) ( ) n i i i xy n n i i i i y x x r y y x x          
  • 50. Pearson Correlation Coefficient 50 In Excel: Correlation b/n Quality and Color of wine in Table 5. Pearson correlation coefficient = 0.712 43.103/SQRT(75.852*48.2995) i Xi (Quality) Yi (Color) Xi-Xbar Yi-Ybar (Xi-Xbar)^2 (Yi-Ybar)^2 Yi*(Xi-Xbar) 1 19.2 5.65 3.98 1.195 15.8404 1.428025 22.487 2 18.3 6.95 3.08 2.495 9.4864 6.225025 21.406 3 17.1 5.75 1.88 1.295 3.5344 1.677025 10.81 4 15.2 4 -0.02 -0.455 0.0004 0.207025 -0.08 5 14 2.25 -1.22 -2.205 1.4884 4.862025 -2.745 6 13.8 3.2 -1.42 -1.255 2.0164 1.575025 -4.544 7 12.8 2.7 -2.42 -1.755 5.8564 3.080025 -6.534 8 17.3 6.1 2.08 1.645 4.3264 2.706025 12.688 9 16.3 5 1.08 0.545 1.1664 0.297025 5.4 10 16 6 0.78 1.545 0.6084 2.387025 4.68 11 15.7 5.5 0.48 1.045 0.2304 1.092025 2.64 12 15.3 3.35 0.08 -1.105 0.0064 1.221025 0.268 13 14.3 3.25 -0.92 -1.205 0.8464 1.452025 -2.99 14 14 5.1 -1.22 0.645 1.4884 0.416025 -6.222 15 13.8 4.4 -1.42 -0.055 2.0164 0.003025 -6.248 16 12.5 3.15 -2.72 -1.305 7.3984 1.703025 -8.568 17 11.5 3.9 -3.72 -0.555 13.8384 0.308025 -14.508 18 14.2 2.4 -1.02 -2.055 1.0404 4.223025 -2.448 19 17.3 7.7 2.08 3.245 4.3264 10.530025 16.016 20 15.8 2.75 0.58 -1.705 0.3364 2.907025 1.595 SUM 304.4 89.1 75.852 48.2995 43.103 Mean 15.22 4.455
  • 52. Probability Plot • Is a particular distribution a reasonable model for data? We may want to verify assumptions. • If time-to-failure data ̴ exponential distribution, then the failure rate is constant w.r.t. time. • A probability plot is a graphical method for determining whether sample data conform to a hypothesized distribution based on a subjective visual examination of the data. • Histograms require very large sample size. 52
  • 53. Constructing a Probability Plot • To construct a probability plot: – Sort the data observations in ascending order: x(1), x(2),…, x(n). – The observed value x(j) is plotted against the observed cumulative frequency (j – 0.5)/n. – The paired numbers are plotted on the probability paper of the proposed distribution. • If the paired numbers form a straight line, then the hypothesized distribution adequately describes the data. 53
  • 54. Example 7: Battery Life 54 j x (j ) (j-0.5)/10 100(j-0.5)/10 1 176 0.05 5 2 183 0.15 15 3 185 0.25 25 4 190 0.35 35 5 191 0.45 45 6 192 0.55 55 7 201 0.65 65 8 205 0.75 75 9 214 0.85 85 10 220 0.95 95 Table 6 Calculations for Constructing a Normal Probability Plot Normal probability plot for battery life. The effective service life (Xj in minutes) of batteries used in a laptop are given in the table. We hypothesize that battery life is adequately modeled by a normal distribution. To this hypothesis, first arrange the observations in ascending order and calculate their cumulative frequencies and plot them.
  • 55. Probability Plot on Standardized Normal Scores 55 j x (j ) (j-0.5)/10 z j 1 176 0.05 -1.64 2 183 0.15 -1.04 3 185 0.25 -0.67 4 190 0.35 -0.39 5 191 0.45 -0.13 6 192 0.55 0.13 7 201 0.65 0.39 8 205 0.75 0.67 9 214 0.85 1.04 10 220 0.95 1.64 Table 6 Calculations for Constructing a Normal Probability Plot Normal Probability plot obtained from standardized normal scores. This is equivalent to Figure in the previous slide. A normal probability plot can be plotted on ordinary axes using z-values. The normal probability scale is not used. (j – 0.5)/n = P(Z ≤ zj) = Φ(zj)
  • 56. Probability Plot Variations 56 Normal probability plots indicating a non-normal distribution. (a) Light tailed distribution (e.g. uniform distribution) (b) Heavy tailed distribution (has larger variance, has heavier tails than the normal distribution) (c) Right skewed distribution