GC-S005-DataAnalysis

Henry R. Kang (1/2010)
General Chemistry
Lecture 5
Statistical Data
Analysis

Outlines
• Fundamental Statistics
• Accuracy and Precision
• Data Rejection

Accuracy & Precision
• Accuracy
 Accuracy is a measure of the closeness of a
measured quantity to the true value.
• Precision
 How close two or more measurements of the
quantity agree with one another.
 Precision is a measure of the agreement of
replicate measurements.

Fundamental
Statistics

Errors
• All Measurements Contain Errors.
• Types of Errors
 Systematic errors
 One-sided errors (either positive or negative)
• Usually from a single source
• Resulting data are consistently high or low
 Results may be precise but inaccurate
• Examples: Balance is incorrectly zeroed. Use incorrect constant for
calculations.
 Random errors
 Randomly occurred
 Positive and negative deviations occur with equal frequency and size.
• A bell shape curve (Gaussian or normal distribution)
 The source of the error is usually not known

Gaussian Distribution
• Gaussian distribution gives the distribution of data points with respect to the
true value. It gives a bell-shaped curve as shown in the figure.
 The closer to the true value, the higher the probability.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-3 -2 -1 0 1 2 3
Standard Deviation
Probability

Measuring Precision
• Mean (or Average)
• Deviation and Absolute Deviation
• Absolute Average Deviation
• Relative Deviation
• Relative Average Deviation (RAD)
• Standard Deviation
• Relative Standard Deviation

Mean (Average)
• For multiple measurements of a given quantity,
we have numerical values x1, x2, x3, - - - -, xn, where
n is the number of measurements.
• Sum is defined as
Sum = x1 + x2 + x3 + - - - + xn = ∑ xi
• Mean xavg is defined as
∑ xiSum
n n=xavg =

Deviation & Absolute Deviations
• Deviation is the difference (or variation) of a single measurement,
xi, away from the mean value, xavg.
 d1 = x1 – xavg
 d2 = x2 – xavg
 d3 = x3 – xavg
 -- - -- -- - -- --
 -- - -- -- - -- --
 dn = xn – xavg
• Absolute deviation is always positive.
 d1 = | x1 – xavg|
 d2 = | x2 – xavg|
 d3 = |x3 – xavg|
 -- - -- -- - -- --
 -- - -- -- - -- --

Relative Deviation
• Relative deviation, Di, is the ratio of
individual absolute deviations, di, to the
mean value, xavg.
D1 = d1 / xavg = | x1 – xavg| / xavg
------------
Di = di / xavg = | xi – xavg| / xavg
------------

Relative Average Deviation
• Relative average deviation (RAD) is the
absolute average deviation relative to
the mean xavg
A precision of 3 ppt or less is considered
very good.
RAD (ppt) = × 1000
davg
xavg

Standard Deviation
• Standard deviation (σ) is useful in estimating data points
distribution in the form of the Gaussian distribution (a
bell-shaped curve).
 (xavg ± σ) incorporates 68.3% of the data points.
 (xavg ± 3σ) incorporates 99.7% of the data points.
 The smaller the σ, the less spread of data points.
 d1 = x1 – xavg
d2 = x2 – xavg
d3 = x3 – xavg
------------
dn = xn – xavg
∑ di
2
n – 1
=σ
√ =
√
d1
2
+ d2
2
+ d3
2
+ - - - - + dn
2
n – 1

Relative Standard Deviation
• Relative standard deviation (σr) is the standard
deviation relative to the mean value.
 d1 = x1 – xavg
d2 = x2 – xavg
d3 = x3 – xavg
--------- ---
dn = xn – xavg
where n is the number of measurements
∑ (di /xavg)2
n – 1
=σr
√ =
√ D1
2
+D2
2
+D3
2
+ - - - - +Dn
2
n – 1
or σr (ppt) = (σ / xavg ) × 1000

Gaussian Distribution
• Gaussian distribution gives the
distribution of data points with
respect to the true value. It gives a
bell-shaped curve as shown in the
figure.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-3 -2 -1 0 1 2 3
Standard Deviation
Probability
• The Gaussian equation is
P(x) = [(2π)1/2
σ]–1
exp[-(x – X)2
/(2σ2
)]
where σ is the standard deviation and X is the true value.
 The closer to the true value, the higher the probability.
 The area under the curve (or the integration of the Gaussian function)
 (xture ± σ) incorporates 68.3% of the data points.
 (xture ± 3σ) incorporates 99.7% of the data points.
 (xture ± 3.8901σ) incorporates 99.99% of the data points.
 (xture ± 4.4172σ) incorporates 99.999% of the data points.
 (xture ± 6σ) incorporates nearly 100% of the data points.

Standard Deviation & Data Distribution
• The smaller the σ, the less spread of data points.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-4 -3 -2 -1 0 1 2 3 4
Standard Deviation
Probability
σ = 0.5
σ = 1.0
σ = 2.0

Approximation of Standard Deviation
• The computational cost for standard deviation is pretty
high; therefore, there exists a good approximation to
compute standard deviation with much less
computational cost.
• š = Ř/√N
 Ř is the range of data points from the lowest value to the
highest value
Ř = xmax – xmin
 N is the number of data points.
• For a small number of measurements the approximation
is accurate enough to replace the formal standard
deviation.

Accuracy
and
Precision

Accuracy & Precision of Measurements
• Accuracy is a measure of the closeness of a measured quantity to
the true value.
• Precision is a measure of the agreement of replicate
measurements.
• Measurements can be precise but not accurate or accurate but not
precise or neither. The best result is, of course, accurate and
precise.
Accurate &
precise
Precise but
not accurate
not accurate
& not precise
accurate but
not precise

Example 1 of Accuracy and Precision
• Measured %S values in H2SO4 are 28.72%, 28.40%, and 28.57%,
where the true value is 32.69%. Determine the accuracy and
precision.
• Answer:
 Mean = (28.72% + 28.40% + 28.57%) / 3 = 28.60%
 Estimated precision by using the approximation: š = Ř / √N
 š = (28.72 – 28.40)% / 31/2
= 0.32% / 1.732 = 0.18 %
 Relative standard deviation: sr = š / xM
 sr = 0.18% / 28.60% = 0.0063
 Accuracy = |X − xM| = | 32.69% − 28.60% | = 4.09%
 Relative accuracy = Accuracy / True value
= 4.09% / 32.69% = 0.125
• These result indicate that the data are precise but inaccurate.

precision.
• Answer:
 Mean = (28.89% + 32.56% + 36.64%) / 3 = 32.70%
 š = (36.64 – 28.89)% / 31/2
= 7.75% / 1.732 = 4.47 %
 sr = 4.47% / 32.70% = 0.137
 Accuracy = |X − xM| = | 32.69% − 32.70% | = 0.01%
= 0.01% / 32.69% = 0.0003
• These result indicate that the data are imprecise but accurate.

precision.
• Answer:
 Mean = (25.62% + 33.56% + 27.93%) / 3 = 29.04%
 š = (33.56 – 25.62)% / 31/2
= 7.94% / 1.732 = 4.58 %
 sr = 4.58% / 29.04% = 0.158
 Accuracy = |X − xM| = | 32.69% − 29.04% | = 3.65%
= 3.65% / 32.69% = 0.112
• These result indicate that the data are imprecise and inaccurate.

Data Rejection

Data Rejection
• Replicate measurements of a given quantity are usually
scattered.
 Some values are closer than others.
• Which values to keep (or which values to discard)
 If a single result differs greatly from the others that is caused
by a particular error of the experimenter, then this result
should be discarded.
 If a result is significantly “off”, but there is no error in the
experiment, then the result, in general, should be kept.
• If in doubt, use the rejection coefficient Q test.
• Do not discard any result just to get “good precision”.

Q Test
• Q test is used to test the extreme values (the highest and lowest
values)
• Procedure
 Calculate the range
 Range = xmax – xmin
 Calculate the difference between the extreme value with its nearest
neighbor
 dhi = xmax – xnbor,hi; dlo = | xmin – xnbor,lo |
 Calculate the ratio (Q value) between the difference and the range
 Qhi = dhi / Range ;Qlo = dlo / Range
• Compare the resulting Q value with the rejection table at 90%
confidence level (or other selected confidence level)
 If the calculated Q value is greater than the Q value given in the table, then
reject the value.

Rejection Q Tables
Number
of Data
Q90 Q96 Q99
3 0.94 0.98 0.99
4 0.76 0.85 0.93
5 0.64 0.73 0.82
6 0.56 0.64 0.74
7 0.51 0.59 0.68
8 0.47 0.54 0.63
9 0.44 0.51 0.60
10 0.41 0.48 0.57

Q Test - Example
• Data: 35.00, 35.05, 35.10, 35.80
• Calculate the range
 Range = xmax – xmin= 35.80 – 35.00 = 0.80
• Calculate the difference between the extreme value with its
nearest neighbor.
 dhi = xmax – xnbor,hi = 35.80 – 35.10 = 0.70
 dlo = xmin – xnbor,lo = | 35.00 – 35.05 | = 0.05
• Calculate Q values between the difference and the range.
 Qhi = dhi / Range = 0.70 / 0.80 = 0.88
 Qlo = dlo / Range = 0.05 / 0.80 = 0.063
• Compare the resulting Q value with the rejection table at 90%
confidence level.
 For 4 samples, the Q value in the table is 0.76
 Qhi > 0.76; therefore, the highest value 35.80 can be dropped
 Once the value is dropped, it is no longer in the data set and should not
be used for the calculations of mean and various deviations.
#Data Q90
3 0.94
4 0.76
5 0.64
6 0.56
7 0.51
8 0.47
9 0.44
10 0.41

GC-S005-DataAnalysis

More Related Content

Viewers also liked (20)

Similar to GC-S005-DataAnalysis (20)

More from henry kang (11)

GC-S005-DataAnalysis