Normal curve in Biostatistics data inference and applications

NORMAL CURVE
By
Dr.M.S.Bala Vidyadhar

 Introduction
 Basis for statistical analysis
 Probability Distributions
 Normal Distribution/Curve
 History
 Description
 Standard normal Variate
 Variations

 Normal interpretation
 Comparisons
 Normality tests
 Conclusion
 Previous Year Questions
 References

 The word statistics comes from Italian word ‘statista’
meaning statesman or German word ‘statistik’ which
means political state.
 Science of statistics existed since the time of early
Egypt to the Roman empire to count the families.
 John Graunt (1620-1674) is known as The Father of
Health Statistics.

 Statistics is the science of compiling classifying and
tabulating numerical data and expressing the results in
a mathematical or graphical form.
 Biostatistics is that branch of statistics concerned with
mathematical facts and data related to Biological
events.

 Statistical analyses are based on three primary entities:
1. (U) Population of interest,
2. (V) set of characteristics or variables of the units of
this population,
3. (P) Probability Distribution of the variables in the
given population.

 It is the most crucial link between the population and
its variables, which allows us to draw inferences on the
population based on the sample observations.
 It is a way to enumerate the different values the
variable can have, and how frequently each value
appears in the population.

 The three probability distributions useful in
medicine/health care are:
1. Normal distribution
2. Binomial distribution
3. Poissons distribution.

 Binomial distribution:
Useful where an event or variables are only binary
outcomes.(eg: yes/no; positive/negative).
 Poissons distribution:
Useful where the outcome is the number of times an
event occurs in the population, hence very helpful in
determining the probability of rare events/diseases.
 Both these distributions are applicable in discrete data
only.

 When data is collected from a very large population
and a frequency distribution is made with narrow class
intervals, the resulting curve is smooth, symmetrical
and is called a normal distribution curve.
 Also called as Gaussian Distribution.
 The normal distribution is continuous, so it can take on
any value.

 Was first discovered by Abraham de Moivre and published in
1733.

Two mathematician astronomers , Pierre-Simon Laplace (France)
and Karl Friedrich Gauss (Germany) established the scientific
principles of the Normal Distribution.
But Gauss’ name was given to the distribution as he applied it to
the Theory of motions of heavenly bodies.

 The normal distribution curve is a smooth, bell shaped
curve and is symmetric about the mean of the
distribution, symbolized by the letter μ(mu).
 The standard deviation is denoted by the Greek letter
sigma(σ).
 Sigma is the horizontal distance between mean and the
point of inflection on the curve.

 The mean and the standard deviation are the two
parameters that completely determine the location on
the number line and the shape of the normal curve.
 Thus many normal curves are possible , one each for
every value of the mean and standard deviation , but all
curves under probability distributions have area under
the curve equal to 1.

 The curve will be normal for values like height, weight,
hemoglobin ,PCV, BP, etc.
 For all normal curves their mean ,median and mode are
equal and coincide on the graph.
 The horizontal distance between the central point and
1 SD to both left and right of mean is marked as one
confidence limit.

 In case of Normal Curve, distribution of individual
subjects for their characters are symmetrically
distributed in relation to mean and SD.
 The area between -1SD to +1SD, there are 68.27 % of
the observations.
 The area between -2SD to +2SD, there are 95.4% of
the observations.
 The area between -3SD to +3SD, there are 99.5% of
the observations of the population.
Confidence Limits

 These limits are called the confidence limits and range
between the two is called confidence interval.
 Observations lying within -2 SD to +2SD are known to lie
in the critical level of significance.
 The data lying outside this area is said to be significantly
different from the population mean value.
 Extreme values will occur only about 5 times in 100
observations.
 As the normal curve is symmetrical, coefficient of skewness
is equal to 0

 The central limit theorem states that under certain
(fairly common) conditions, the sum of many random
variables will have an approximately normal
distribution.
 More specifically, where X1, …, Xn are independent
and identically distributed random variables with the
same arbitrary distribution, zero mean, and variance
σ2; and Z is their mean scaled by

 A normal distribution with parameters μ and σ has the
following properties.
 The curve is Bell –shaped
a. It is symmetrical (Non-skew).
b. The mean, media and mode are equal.
 The curve is asymptotic to the X-axis. That is, the
curve touches the X-axis only at -∞ and+∞.
 The curve has points of inflexion at μ - σ and μ +σ.

 For the distribution ….
a. Standard deviation = σ
b. Quartile deviation = 2/3 σ (approximately)
c. Mean deviation = 4/5 σ (approximately)
 For the distribution ….
a. The odd order moments are equal to zero.
b. The even order moments are given by –
Thus, μ2 = σ2 and μ4 = 3σ4.
 The distribution is mesokurtic. That is,β2=3.

 Total area under the curve is unity.
 P[a < X ≤ b]=
a. P[ μ - σ < X ≤ μ + σ ] = 0.6826 = 68.26%
b. P[μ – 2σ < X ≤ μ + 2σ] = 0.9544 = 95.44%
c .P[μ – 3 σ < X ≤ μ + 3σ] = 0.9974 = 99.74%
Area bounded by the curve
and the ordinates at a and b

 Deviation of an individual observation from the mean
in a normal distribution or curve is called standard
normal variate and is given the symbol Z.
 It is measured in terms of standard deviations (SDs)
and indicates how such an observation is bigger or
smaller than mean in units of standard deviation.

 So Z will be a ratio, calculated as given.
 Where Z stands for individual observation whereas μ
and σ stand for mean and SD as usual.
 Z is also called Standard normal deviate or relative
normal deviate.
z 
X 


 The normal curve is completely determined by two
parameters mean(µ) and SD(σ).
 So, a different normal distribution is specified for each
different value of µ and σ.
 Variations of mean and SD values affect the normal
curve in different ways.

The effects of µ and SD
How does the standard deviation affect the shape of f(x)
= 2
 =3
 =4
 = 10  = 11  = 12
How does the mean value affect the location of f(x)

 Different values of μ shift the graph of the distribution
along the x-axis.

 Different values of σ (SD) determine the degree of
flatness or peakedness of the graph of the distribution.

 Details of area under cumulative normal distribution
can also be plotted.
 It shows the cumulative probability by levels of
mean ± Standard error.

 The two variations of the normal curve are due to the 2
variants of the curve.
1. Skewness
2. Kurtosis
 The normal curve is symmetric; Frequently, however,
our data distributions, especially with small sample
sizes, will show some degree of asymmetry, or
departure from symmetry.

 Skewness is a statistic to measure the degree of
asymmetry.
 If the distribution has a longer "tail" to the right of the
peak than to the left, the distribution is skewed to the
right or has positive skewness.
 If the reverse is true the distribution is said to be
skewed to the left or to have a negative skewness.

 The value of skewness can be computed by
Where X is each individual score.
 The value of skewness is zero when the distribution is a
completely symmetric bell shaped curve.
 A positive value indicates that the distribution is skewed to
the right (i.e.,positive skewness) and a negative value
indicates that the distribution is skewed to the left
(i.e.negative skewness).

 While skewness describes the degree of symmetry of a
distribution, kurtosis measures the height of a
distribution curve.
 To compute kurtosis, we use the formula

 A positive kurtosis indicates that the distribution has a
relatively high peak ; this is called leptokurtic.
 A negative kurtosis indicates that the distribution is
relatively flat topped this is called platykurtic.
 A normal distribution has a kurtosis of zero this is
called mesokurtic.

 Skewness and Kurtosis provide distributional
information about the data.
 In statistical tests that assume a normal distribution of a
data, skewness and kurtosis can be used to examine this
assumption called normality.

 With measurements whose distributions are not normal, a simple
transformation of the scale of the measurement may induce
approximate normality.
 The square root √x, and the logarithm, log x, are often used as
transformations in this way.
 Those transformations are found useful for flexible use of some
tests of significance like student's t test.
Non Normal Distributions

 Even if the distribution in the original population is far from
normal, the distribution of sample averages tends to become
normal, under a wide variety of conditions, as the size of the
sample increases.
 This is the single most important reason for the use of the
normal distribution.

 Also, many results that are useful in statistical work,
although strictly true only when the population is
normal, hold well enough for rough and ready use
when samples come from non-normal populations.
 When presenting such results, we can indicate how
well they stand up under non-normality.

...71828.2eand...14159.3where
xe
2
1
)x(f
2
x
)2/1(













A random variable X with mean µ and standard
deviation σ is normally distributed if its probability
density function is given by

Normal distributions are bell shaped, and
symmetrical around .

Why symmetrical? Let µ = 100. Suppose x = 110.
22
10
)2/1(
100110
)2/1(
e
2
1
e
2
1
)110(f




















Now suppose x = 90
22
10
)2/1(
10090
)2/1(
e
2
1
e
2
1
)90(f





















11090

 The expected value (also called the mean) E(X) (or
µ) can be any number
 The standard deviation  can be any nonnegative
number
 The total area under every normal curve is 1
 There are infinitely many normal distributions

The effects of μ and σ
How does the standard deviation affect the shape of f(x)?
= 2
 =3
 =4
 = 10  = 11  = 12
How does the expected value affect the location of f(x)?

X
83 6 9 120

A family of bell-shaped curves that differ
only in their means and standard
deviations.
µ = the mean of the distribution
σ = the standard deviation
µ = 3 and  =
1

X
3 6 9 120
X
3 6 9 120

µ = 3 and  =
1
µ = 6 and  =
1

X
83 6 9 120

X
83 6 9 120

µ = 6 and σ = 2
µ = 6 and σ = 1

X
Probability = area under the density curve
P(6 < X < 8) = area under the density curve
between 6 and 8.
3 6 9 12
P(6 < X < 8) µ = 6 and σ = 2
0
X

X
Probability = area under the density curve
P(6 < X < 8) = area under the density curve
between 6 and 8.
a b
83 6 9 12
P(6 < X < 8) µ = 6 and σ =2
0
6 8
X

a b
Probabilities:
area under
graph of f(x)
P(a < X < b) = area under the density curve
between a and b.
P(X=a) = 0
P(a < x < b) = P(a < x < b)
f(x) P(a < X < b)
X
P(a X b) = f(x)dx
a
b
  

 Suppose X~N(
 Form a new random variable by subtracting the mean μ
from X and dividing by the standard deviation :
(X
 This process is called standardizing the random
variable X.

 (X is also a normal random variable; we will
denote it by Z:
Z = (X-µ)/σ
 Z has mean 0 and standard deviation 1:
E(Z) ==0; SD(Z) =1.
1
 The probability distribution of Z is called the standard
normal distribution.

 If X has mean  and stand. dev. , standardizing a particular
value of x tells how many standard deviations x is above or
below the mean .
 Exam 1: =80, =10; exam 1 score: 92
Exam 2: =80, =8; exam 2 score: 90
Which score is better?
1examon92thanbetteris2examon90
1.25
8
10
8
8090
z
1.2
10
12
10
8092
z
2
1







X
83 6 9 120
µ = 6 and 
= 2
Z
0 1 2 3-1-2-3
.5.5
µ = 0 and 
= 1
(X-6)/2

 A normal random variable x has the following pdf:







 
zez
pdf
forandforsubstituteNZ
xexf
z
x
,
2
1
)(
becomesrvnormalstandardfor the
10)1,0(~
,)(
2
2
2)(
2
1
2
1
2
1







Z = standard normal random variable
 = 0 and  = 1
Z
0 1 2 3-1-2-3
.5.5 .5.5

 Table Z is the standard Normal table. We have to convert our data
to z-scores before using the table.
 The figure shows us how to find the area to the left when we
have a z-score of 1.80:

P(0 < Z < 1) = .8413 -
.5 = .3413
0 1
Z
.1587.3413
.50

Standard normal probabilities have been
calculated and are provided in table Z.
The tabulated probabilities correspond
to the area between Z= - and some z0
Z = z0
P(- <Z<z0)
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
…
…
…
…
1 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
…
…
…
…
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
…
…
…
…

 Example – continued X~N(60, 8)
In this example z0 = 1.25
0.8944
0.8944
0.89440.89440.89440.8944= 0.8944
60 70 60
( 70)
8 8
( 1.25)
X
P X P
P z
  
   
 
 
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
…
…
…
…
1 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
…
…
…
…
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
…
…
…
…
P(z < 1.25)

 P(0  z  1.27) =
1.270 z
Area=.3980
.8980-.5=.3980

P(Z  .55) = A1
= 1 - A2
= 1 - .7088
= .2912
0 .55
A2

 P(-2.24  z  0) =
Area=.4875
.5 - .0125 = .4875
z
-2.24 0
Area=.0125

 P(-1.18 z 2.73) = A - A1
= .9968 - .1190
= .8778
A1 A2
0
2.73
z
-1.18.1190
.9968
A1
A

vi) P(-1≤ Z ≤ 1)
0.8413
0.1587
0.6826

Look up .2514 in body of table; corresponding entry is -.67
6. P(z < k) = .2514
6. P(z < k) = .2514
.5 .5
Is k positive or negative?
-.67
Direction of inequality; magnitude of probability

250 275
( 250) ( )
43
25
( ) ( .58) 1 .2810 .7190
43
P X P Z
P Z P Z

  

      
.2810
.7190

 225 275 275 375 275
43 43 43
) (225 375)
( 1.16 2.33) .9901 .1230 .8671
x
ix P x
P
P z
  
 
  
      
.9901.1230
.8671

 
 
88.367275)43(16.2
table)normalstandardfrom(16.2
43
275
43
275
43
275
43
275)(9846.




k
k
kzP
kxPkxP

P( Z < 2.16) = .9846
0 2.16
Z
.1587.4846
Area=.5
.9846

 Regulate blue dye for mixing paint; machine can be set to
discharge an average of μ ml./can of paint.
 Amount discharged: N(µ, .4 ml). If more than 6 ml.
discharged into paint can, shade of blue is unacceptable.
 Determine the setting μ so that only 1% of the cans of
paint will be unacceptable

=amount of dye discharged into can
~N( , .4); determine so that
( 6) .01
X
X
P X
 
 

   6 6
.4 .4 .4
6
.4
=amount of dye discharged into can
~N( , .4); determine so that
( 6) .01
.01 ( 6)
2.33(from standard normal table)
= 6-2.33(.4) = 5.068
x
X
X
P X
P x P P z  

 

  

 
     
 


 In statistics, normality tests are used to determine if a data set
is well-modelled by a normal distribution and to compute how
likely it is for a random variable underlying the data is set to
be normally distributed.
 More precisely, the tests are a form of model selection, and
can be interpreted several ways, depending on
one's interpretations of probability.

Graphical methods
 An informal approach to testing normality is to compare a histogram
of the sample data to a normal probability curve.
 The empirical distribution of the data (the histogram) should be
bell-shaped and resemble the normal distribution. This might be
difficult to see if the sample is small.
 In this case one might proceed by regressing the data against the
quantiles of a normal distribution with the same mean and variance
as the sample.
 Lack of fit to the regression line suggests a departure from
normality.

 A graphical tool for assessing normality is the normal
probability plot, a quantile-quantile plot (QQ plot) of the
standardized data against the standard normal distribution.
 Here the correlation between the sample data and normal
quantiles (a measure of the goodness of fit) measures how well
the data is modeled by a normal distribution.
 For normal data the points plotted in the QQ plot should fall
approximately on a straight line, indicating high positive
correlation.
 These plots are easy to interpret and also have the benefit that
outliers are easily identified.

 Simple back-of-the-envelope test takes the sample maximum
and minimum and computes their z-score, or more properly t-
statistic (number of sample standard deviations that a sample
is above or below the sample mean), and compares it to the
68–95–99.7 rule.
 This test is useful in cases where one faces kurtosis risk
– where large deviations matter – and has the benefits that it is
very easy to compute and to communicate: non-statisticians
can easily grasp that “6σ events don’t happen in normal
distributions”.

Tests of univariate normality include
 D'Agostino’s Ksquared test
 Jarque–Bera test
 Anderson–Darling test
 Cramér–von Mises criterion
 Lilliefors test for normality (itself an adaptation of the
Kolmogorov– Smirnov test)
 Shapiro–Wilk test
 Pearson’s chisquared test
 Shapiro–Francia test.

 A 2011 paper from The Journal of Statistical Modeling and
Analytics concludes that Shapiro-Wilk has the best power for
a given significance, followed closely by Anderson- Darling
when comparing the Shapiro-Wilk, Kolmogorov- Smirnov,
Lilliefors, and Anderson-Darling tests.
 Ralph B. D'Agostino (1986). “Tests for the Normal
Distribution”. In D'Agostino, R.B. and Stephens, M.A.
Goodness-of-Fit Techniques. New York: Marcel Dekker. ISBN
0-8247-7487-6.

 More recent tests of normality include the energy test (Székely and
Rizzo) and the tests based on the empirical characteristic function
(ecf) (e.g. Epps and Pulley, Henze–Zirkler, BHEP (Baringhaus–
Henze–Epps–Pulley multivariate normality test)
 The energy and the ecf tests are powerful tests that apply for testing
univariate or multivariate normality and are statistically consistent
against general alternatives.

 Kullback–Leibler divergences between the whole posterior distributions
of the slope and variance do not indicate non-normality.
 However, the ratio of expectations of these posteriors and the
expectation of the ratios give similar results to the Shapiro–Wilk statistic
except for very small samples, when non-informative priors are used.
 Spiegelhalter suggests using a Bayes factor to compare normality with a
different class of distributional alternatives. This approach has been
extended by Farrell and Rogers-Stewart.

 One application of normality tests is to the residuals from a
linear regression model. If they are not normally
distributed, the residuals should not be used in Z tests or in
any other tests derived from the normal distribution, such as
t tests, F tests and chi-squared tests.
 If the residuals are not normally distributed, then the
dependent variable or at least one explanatory variable may
have the wrong functional form, or important variables may
be missing, etc.
 Correcting one or more of these systematic errors may
produce residuals that are normally distributed.
Results of Normality tests

 Most of the statistical analyses presented are based on
the bell-shaped or normal distribution.
 The major importance of the normal distribution is the
statistical inference of how often an observation can
occur normally in a population.
 The normal distribution is the most important and most
widely used distribution in statistics.
 The normal distribution is very useful in practice and
makes statistical analysis easy.

1. Essentials of community dentistry by Soben Peter.
2. Basic and Clinical Biostatistics by Dawson and Trapp.
3. Biostatistics by Dr.Vishweswara Rao.
4. Health Research Methodology by Okolo.
5. Biostatistics by Sarmakaddam
6. Biostatistics by Kim and Dialey
7. http://en.wikipedia.org/w/index.php?title=File:Normal_Dis
tribution_PDF
8. Introduction to Normal Distributions by David M. Lane
9. Introduction to Statistics Online Edition by David M.
Lane1 Other authors: David Scott1, Mikki Hebl, Rudy
Guerra , Dan Osherson, and Heidi Ziemer

 Spiegelhalter, D.J. (1980). An omnibus test for
normality for small samples. Biometrika, 67, 493–496.
doi:10.1093/biomet/67.2.493
 Farrell, P.J., Rogers-Stewart, K. (2006)
“Comprehensive study of tests for normality and
symmetry: extending the Spiegelhalter test”. Journal of
Statistical Computation and Simulation, 76(9), 803 –
816. doi:10.1080/10629360500109023

 RGUHS -April 2000, Sept.2007, normal curve(10 mks).
 Sumandeep university- April 2012, normal curve (10mks).
 Manipal university -April 2007, normal curve (10mks).
 Mangalore university-July 1993, December 1997(10mks).

Normal curve in Biostatistics data inference and applications

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Normal curve in Biostatistics data inference and applications (20)

More from Bala Vidyadhar (6)

Recently uploaded (20)

Normal curve in Biostatistics data inference and applications