SlideShare a Scribd company logo
Descriptive Statistics:
Numerical Summary Measures
– Single numbers which quantify the
characteristics of a distribution of values
 Measures of central tendency (location)
 Measures of dispersion
• A frequency distribution is a general
picture of the distribution of a variable
• But, can’t indicate the average value and
the spread of the values
Measures of Central Tendency (MCT)
• On the scale of values of a variable there is
a certain stage at which the largest number
of items tend to cluster.
• Since this stage is usually in the centre of
distribution, the tendency of the statistical
data to get concentrated at a certain value
is called “central tendency”
• The various methods of determining the
point about which the observations tend to
concentrate are called MCT.
• The objective of calculating MCT is to
determine a single figure which may be
used to represent the whole data set.
• In that sense it is an even more compact
description of the statistical data than the
frequency distribution.
• Since a MCT represents the entire data, it
facilitates comparison within one group or
between groups of data.
0
5
10
15
20
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
Position
Characteristics of a good MCT
A MCT is good or satisfactory if it possesses
the following characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should be as close to the maximum number of
values as possible
4. It should have a definite value
5. It should not be subjected to complicated and tedious
calculations
6. It should be capable of further algebraic treatment
7. It should be stable with regard to sampling
• The most common measures of central
tendency include:
– Arithmetic Mean
– Median
– Mode
– Others
1. Arithmetic Mean
A. Ungrouped Data
• The arithmetic mean is the "average" of the data
set and by far the most widely used measure of
central location
• Is the sum of all the observations divided by the
total number of observations.
The Summation Notation
3 Descriptive Numerical Summary Measures.ppt
The heart rates for n=10 patients were as follows (beats per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of these patients?
b
)
G
r
o
u
p
e
d
d
a
t
a
I
n
c
a
l
c
u
l
a
t
i
n
g
t
h
e
m
e
a
n
f
r
o
m
g
r
o
u
p
e
d
d
a
t
a
,w
e
a
s
s
u
m
e
t
h
a
ta
l
lv
a
l
u
e
s
f
a
l
l
i
n
g
i
n
t
o
a
p
a
r
t
i
c
u
l
a
r
c
l
a
s
s
i
n
t
e
r
v
a
la
r
e
l
o
c
a
t
e
d
a
tt
h
e
m
i
d
-
p
o
i
n
to
f
t
h
e
i
n
t
e
r
v
a
l
.I
ti
s
c
a
l
c
u
l
a
t
e
d
a
s
f
o
l
l
o
w
:
x
=
m
f
f
i i
i
=
1
k
i
i
=
1
k


w
h
e
r
e
,
k
=
t
h
e
n
u
m
b
e
r
o
f
c
l
a
s
s
i
n
t
e
r
v
a
l
s
m
i=
t
h
e
m
i
d
-
p
o
i
n
to
f
t
h
e
i
t
h
c
l
a
s
s
i
n
t
e
r
v
a
l
f
i=
t
h
e
f
r
e
q
u
e
n
c
y
o
f
t
h
e
i
t
h
c
l
a
s
s
i
n
t
e
r
v
a
l
Example. Compute the mean age of 169 subjects from the
grouped data.
Mean = 5810.5/169 = 34.48 years
Class interval Mid-point (mi) Frequency (fi) mifi
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
58.0
1617.0
1621.5
1602.0
654.0
258.0
Total __ 169 5810.5
The mean can be thought of as a “balancing
point”, “center of gravity”
When the data are skewed, the mean is
“dragged” in the direction of the skewness
• It is possible in extreme cases for all but one of the sample points
to be on one side of the arithmetic mean & in this case, the mean is
a poor measure of central location or does not reflect the center of
the sample.
Properties of the Arithmetic Mean.
• For a given set of data there is one and only
one arithmetic mean (uniqueness).
• Easy to calculate and understand (simple).
• Influenced by each and every value in a data
set
• Greatly affected by the extreme values.
• In case of grouped data if any class interval
is open, arithmetic mean can not be
calculated.
2. Median
a) Ungrouped data
• The median is the value which divides the data set
into two equal parts.
• If the number of values is odd, the median will be the
middle value when all values are arranged in order of
magnitude.
• When the number of observations is even, there is no
single middle value but two middle observations.
• In this case the median is the mean of these two
middle observations, when all observations have
been arranged in the order of their magnitude.
3 Descriptive Numerical Summary Measures.ppt
3 Descriptive Numerical Summary Measures.ppt
3 Descriptive Numerical Summary Measures.ppt
• The median is a better description (than the mean) of
the majority when the distribution is skewed
• Example
– Data: 14, 89, 93, 95, 96
– Skewness is reflected in the outlying low value of 14
– The sample mean is 77.4
– The median is 93
b) Grouped data
• In calculating the median from grouped data, we
assume that the values within a class-interval
are evenly distributed through the interval.
• The first step is to locate the class interval in
which the median is located, using the following
procedure.
• Find n/2 and see a class interval with a
minimum cumulative frequency which contains
n/2.
• Then, use the following formal.
~
x = L
n
2
F
f
W
m
c
m












where,
Lm
= lower true class boundary of the interval containing the median
Fc
= cumulative frequency of the interval just above the median
class
interval
fm
= frequency of the interval containing the median
W= class interval width
n = total number of observations
Example. Compute the median age of 169
subjects from the grouped data.
n/2 = 169/2 = 84.5
Class interval Mid-point (mi) Frequency (fi) Cum. freq
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
4
70
117
153
165
169
Total 169
• n/2 = 84.5 = in the 3rd
class interval
• Lower limit = 29.5, Upper limit = 39.5
• Frequency of the class = 47
• (n/2 – fc) = 84.5-70 = 14.5
• Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33
Properties of the median
• There is only one median for a given set of data
(uniqueness)
• The median is easy to calculate
• Median is a positional average and hence it is
insensitive to very large or very small values
• Median can be calculated even in the case of
open end intervals
• It is determined mainly by the middle points and
less sensitive to the remaining data points
(weakness).
Quartiles
• Just as the median is the value above and
below which lie half the set of data, one
can define measures (above or below)
which lie other fractional parts of the data.
• The median divides the data into two
equal parts
• If the data are divided into four equal
parts, we speak of quartiles.
a) The first quartile (Q1
): 25% of all the
ranked observations are less than Q1.
b) The second quartile (Q2
): 50% of all the
ranked observations are less than Q2
. The
second quartile is the median.
c) The third quartile (Q3
): 75% of all the
ranked observations are less than Q3.
Percentiles
• Simply divide the data into 100 pieces.
• Percentiles are less sensitive to outliers
and not greatly affected by the sample
size (n).
3. Mode
• The mode is the most frequently occurring
value among all the observations in a set
of data.
• It is not influenced by extreme values.
• It is possible to have more than one mode
or no mode.
• It is not a good summary of the majority of
the data.
Mode
0
2
4
6
8
10
12
14
16
18
20
N
Mode
T. Ancelle, D. Coulombie
Mode
• It is a value which occurs most
frequently in a set of values.
• If all the values are different there is no
mode, on the other hand, a set of
values may have more than one mode.
a) Ungrouped data
• Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
• Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
• Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
b) Grouped data
• To find the mode of grouped data, we
usually refer to the modal class, where
the modal class is the class interval with
the highest frequency.
• If a single value for the mode of
grouped data must be specified, it is
taken as the mid-point of the modal
class interval.
3 Descriptive Numerical Summary Measures.ppt
Properties of mode
 It is not affected by extreme values
 It can be calculated for distributions with
open end classes
 Often its value is not unique
 The main drawback of mode is that
often it does not exist
4. Geometric mean (GM)
• Mainly used in many types of laboratory data,
specifically data in the form of concentrations of
one substance in another
• Example: the minimum inhibitory concentration of
penicillin in urine for N. gonorrhoeae in 71 patients
(µg/ml) Frequency (µg/ml) Frequency
0.03125
0.0625
0.1250
21
6
8
0.250
0.50
1.0
19
17
3
If x x ..., x are n positive observed values, then
GM = x
1 2 n
i
i=1
n
n
, ,

and
logGM =
logx
n
i
i=1
n

.
The geometric mean is generally used with data measured on a logarithmic scale, such
as titers of anti-neutrophil immunoglobulin G.
Example:
logGM = [21log(0.03125) + 6log(0.0625) +
8log(0.125) + 19log(0.25) + 17log(0.5)
+ 3log(1.0)]/74 = -0.846
The GM = the antilogarithm of -0.846 = 0.143
5. Harmonic mean (HM)
• Just as the geometric mean is based on
an arithmetic mean of logarithms, so is
the harmonic mean based on arithmetic
mean of the reciprocals.
• Pertains to rates and time
• We define it as the reciprocal of the
arithmetic mean of the reciprocal of the
given numbers.
If the given numbers are x x ..., x , then
HM =
1
1
n
1
x
1 2 n
i
i=1
n
, ,

6. Weighted mean (WM)
• In a weighted mean, separate outcomes
have separate influences.
• The influence attached to an outcome is
the weight.
• Familiar is the calculation of a course
grade as a weighted average of scores on
separate outcomes.
Example:
Which measure of central tendency is best with a
given set of data?
• Two factors are important in making this
decisions:
– The scale of measurement (type of data)
– The shape of the distribution of the
observations
• The mean can be used for discrete and
continuous data
• The median is appropriate for discrete and
continuous data as well, but can also be
used for ordinal data
• The mode can be used for all types of
data, but may be especially useful for
nominal and ordinal measurements
• For discrete or continuous data, the
“modal class” can be used
• The geometric mean is used primarily for
observations measured on a logarithmic
scale.
• Harmonic mean is a suitable MCT when
the data pertains to rates and time.
• Weighted mean is commonly used in the
calculation of mean for different
outcomes.
(a) Symmetric and unimodal distribution —
Mean, median, and mode should all be
approximately the same
Mean, Median & Mode
(b) Bimodal — Mean and median should be
about the same, but may take a value that
is unlikely to occur; two modes might be
best
(c) Skewed to the right (positively skewed) —
Mean is sensitive to extreme values, so
median might be more appropriate
Mode
Median
Mean
(d) Skewed to the left (negatively skewed)
— Same as (c)
Mode
Median
Mean
Measures of Dispersion
Consider the following two sets of data:
A: 177 193 195 209 226 Mean =
200
B: 192 197 200 202 209 Mean =
200
Two or more sets may have the same mean and/or median but they
may be quite different.
These two distributions have the same mean,
median, and mode
• MCT are not enough to give a clear
understanding about the distribution of
the data.
• We need to know something about the
variability or spread of the values —
whether they tend to be clustered close
together, or spread out over a broad
range
Measures of Dispersion
• Measures that quantify the variation or
dispersion of a set of data from its central
location
• Dispersion refers to the variety exhibited by
the values of the data.
• The amount may be small when the values are
close together.
• If all the values are the same, no dispersion
Measures of Dispersion
Other synonymous term:
– “Measure of Variation”
– “Measure of Spread”
– “Measures of Scatter”
• Measures of dispersion include:
– Range
– Inter-quartile range
– Variance
– Standard deviation
– Coefficient of variation
– Standard error
– Others
1. Range (R)
• The difference between the largest and
smallest observations in a sample.
• Range = Maximum value – Minimum value
• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
• Data set with higher range exhibit more
variability
Properties of range
 It is the simplest crude measure and can be
easily understood
 It takes into account only two values which
causes it to be a poor measure of dispersion
 Very sensitive to extreme observations
 The larger the sample size, the larger the
range
2. Interquartile range (IQR)
• Indicates the spread of the middle 50% of
the observations, and used with median
IQR = Q3 - Q1
• Example: Suppose the first and third quartile for
weights of girls 12 months of age are 8.8 Kg and
10.2 Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg
i.e., 50% of the infant girls weigh between 8.8 and
10.2 Kg.
The two quartiles (Q3 &Q1) form the basis of the
Box-and-Whiskers Plots — Variables A, B, C
0
1
2
3
4
5
6
7
8
9
10
Variable A Variable B Variable C
Properties of IQR:
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on
two specific values
• It is important in selecting cut-off points in the
formulation of clinical standards
• Since it excludes the lowest and highest 25%
values, it is not affected by extreme values
• Less sensitive to the size of the sample
3. Quartile deviation (QD)
QD = Q Q
2
3 1

4. Coefficient of quartile deviation
(CQD)
• CQD =
• CQD is an absolute quantity (unitless)
and is useful to compare the variability
among the middle 50% observations.
Q Q
Q Q
3 1
3 1


5. Mean deviation (MD)
• Mean deviation is the average of the
absolute deviations taken from a central
value, generally the mean or median.
• Consider a set of n observations x1
,
x2
, ..., xn
. Then:
• ‘A’ is a central value (arithmetic mean or
median).
MD
1
n
x A
i
i 1
n
 


Properties of mean deviation:
 MD removes one main objection of the earlier
measures, that it involves each value
 It is not affected much by extreme values
 Its main drawback is that algebraic negative
signs of the deviations are ignored which is
mathematically unsound
6. Variance (2
, s2
)
• The main objection of mean deviation, that
the negative signs are ignored, is removed
by taking the square of the deviations from
the mean.
• The variance is the average of the squares
of the deviations taken from the mean.
• It is squared because the sum of the
deviations of the individual observations of
a sample about the sample mean is
always 0
0 = ( )
• The variance can be thought of as an
average of squared deviations
 - x
x i
• Variance is used to measure the
dispersion of values relative to the mean.
• When values are close to their mean
(narrow range) the dispersion is less than
when there is scattering over a wide
range.
– Population variance = σ2
– Sample variance = S2
a) Ungrouped data
 Let X1
, X2
, ..., XN
be the measurement on N
population units, then:



2
i
2
i 1
N
i
i=1
N
(X )
N
where
=
X
N
is the population mean.





A sample variance is calculated for a sample of
individual values (X1, X2, … Xn) and uses the sample
mean (e.g. ) rather than the population mean µ.
Degrees of freedom
• In computing the variance there are (n-1)
degrees of freedom because only (n-1) of the
deviations are independent from each other
• The last one can always be calculated from
the others automatically.
• This is because the sum of the deviations
from their mean (Xi-Mean) must add to zero.
b) Grouped data
where
mi
= the mid-point of the ith
class interval
fi
= the frequency of the ith
class interval
= the sample mean
k = the number of class intervals
S
(m x) f
f - 1
2
i
2
i
i=1
k
i
i=1
k




x
Properties of Variance:
 The main disadvantage of variance is
that its unit is the square of the unite of
the original measurement values
 The variance gives more weight to the
extreme values as compared to those
which are near to mean value, because
the difference is squared in variance.
• The drawbacks of variance are
overcome by the standard deviation.
7. Standard deviation (, s)
• It is the square root of the variance.
• This produces a measure having the
same scale as that of the individual
values.
 
 2
and S = S2
• Following are the survival times of n=11
patients after heart transplant surgery.
• The survival time for the “ith” patient is
represented as Xi for i= 1, …, 11.
• Calculate the sample variance and SD.
3 Descriptive Numerical Summary Measures.ppt
Example. Compute the variance and SD of the age of 169
subjects from the grouped data.
Mean = 5810.5/169 = 34.48 years
S2
= 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2
fi
10-19
20-29
30-39
40-49
50-59
60-69
14.5
24.5
34.5
44.5
54.5
64.5
4
66
47
36
12
4
-19.98
-9-98
0.02
10.02
20.02
30.02
399.20
99.60
0.0004
100.40
400.80
901.20
1596.80
6573.60
0.0188
3614.40
4809.60
3604.80
Total 169 1901.20 20199.22
Properties of SD
• The SD has the advantage of being expressed in
the same units of measurement as the mean
• SD is considered to be the best measure of
dispersion and is used widely because of the
properties of the theoretical normal curve.
• However, if the units of measurements of variables
of two data sets is not the same, then there
variability can’t be compared by comparing the
values of SD.
SD Vs Standard Error (SE)
• SD describes the variability among individual
values in a given data set
• SE is used to describe the variability among
separate sample means obtained from one
sample to another
• We interpret SE of the mean to mean that
another similarly conducted study may give a
mean that may lie between  SE.
Standard Error
• SD is about the variability of individuals
• SE is used to describe the variability in
the means of repeated samples taken
from the same population.
• For example, imagine 5,000 samples, each of the same size n=11.
This would produce 5,000 sample means. This new collection has
its own pattern of variability. We describe this new pattern of
variability using the SE, not the SD.
Example: The heart transplant surgery
n=11, SD=168.89, Mean=161 days
• What happens if we repeat the study? What will our next mean be? Will
it be close? How different will it be? Focus here is on the
generalizability of the study findings.
• The behavior of mean from one replication of the study to the next
replication is referred to as the sampling distribution of mean.
• We can also have sampling distribution of the median or the SD
• We interpret this to mean that a similarly conducted study might
produce an average survival time that is near 161 days, ±50.9 days.
8. Coefficient of variation (CV)
• When two data sets have different units
of measurements, or their means differ
sufficiently in size, the CV should be
used as a measure of dispersion.
• It is the best measure to compare the
variability of two series of sets of
observations.
• Data with less coefficient of variation is
considered more consistent.
CV
S
x
100
 
• “Cholesterol is more variable than systolic blood
pressure”
SD Mean CV (%)
SBP
Cholesterol
15mm
40mg/dl
130mm
200mg/dl
11.5
20.0
•CV is the ratio of the SD to the mean multiplied by 100.
NOTE:
• The range often appears with the median
as a numerical summary measure
• The IQR is used with the median as well
• The SD is used with the mean
• For nominal and ordinal data, a table or
graph is often more effective than any
numerical summary measure

More Related Content

PPTX
03. Summarizing data biostatic - Copy.pptx
PPTX
Measures of central tendancy
PPTX
measures of central tendency in statistics which is essential for business ma...
PPT
2. Descriptive Numerical Summary Measures-2023(2).ppt
PPT
Central Tendency for bio statistics and data analysis
PPTX
Introduction to Measurement CHAPTER 2 (2) (1).pptx
PPTX
Central tendency
PPTX
#3Measures of central tendency
03. Summarizing data biostatic - Copy.pptx
Measures of central tendancy
measures of central tendency in statistics which is essential for business ma...
2. Descriptive Numerical Summary Measures-2023(2).ppt
Central Tendency for bio statistics and data analysis
Introduction to Measurement CHAPTER 2 (2) (1).pptx
Central tendency
#3Measures of central tendency

Similar to 3 Descriptive Numerical Summary Measures.ppt (20)

PDF
Upload 140103034715-phpapp01 (1)
PPTX
Measures of Central tendency
PPTX
Biostatistics cource for clinical pharmacy
PDF
3. Descriptive statistics.pdf
PPTX
Unit 3_1.pptx
PPTX
Measure of Central Tendency (Mean, Median, Mode and Quantiles)
PPT
Dr digs central tendency
PPTX
Measure OF Central Tendency
PPTX
Stat Chapter 3.pptx, proved detail statistical issues
PPTX
Measure of central tendency grouped data.pptx
PPTX
Measures of central tendency
PPTX
ANA 809 - Measures of Central Tendency - Emmanuel Uchenna.pptx
PDF
CHAPTER 2.pdfProbability and Statistics for Engineers
PPT
Statistics.ppt
PPTX
Biostatistics Measures of central tendency
PPTX
computation of measures of central tendency
PPTX
Slideshare notes about measures of central tendancy(mean,median and mode)
PPT
Biostatistics chapter two measure of the central
PPTX
Measures of central tendency median mode
PPTX
Measures of Central Tendency.pptx for UG
Upload 140103034715-phpapp01 (1)
Measures of Central tendency
Biostatistics cource for clinical pharmacy
3. Descriptive statistics.pdf
Unit 3_1.pptx
Measure of Central Tendency (Mean, Median, Mode and Quantiles)
Dr digs central tendency
Measure OF Central Tendency
Stat Chapter 3.pptx, proved detail statistical issues
Measure of central tendency grouped data.pptx
Measures of central tendency
ANA 809 - Measures of Central Tendency - Emmanuel Uchenna.pptx
CHAPTER 2.pdfProbability and Statistics for Engineers
Statistics.ppt
Biostatistics Measures of central tendency
computation of measures of central tendency
Slideshare notes about measures of central tendancy(mean,median and mode)
Biostatistics chapter two measure of the central
Measures of central tendency median mode
Measures of Central Tendency.pptx for UG
Ad

More from MuazbashaAlii (20)

PPTX
Neonatal and Child Health 2.pptx Reproductive
PPTX
GBV and Clinical management of rape 4.pptx
PPTX
cardiacdiseaseduringpregnancyfin-230511221602-d720e371.pptx for Nursing Student
PPTX
cardiacdiseaseduringpregnancyfin.pptx for midwifery
PPT
Hema_I_Chapter_3_phlebotomy.ppt for Medical laboratory
PPT
Hema I Chapter 4_Anticoag.ppt for Medical laboratory
PPT
Hema I Chapter 2_composition, formation & function.ppt
PDF
Adolescent and youth reproductive health.pdf
PDF
Pharmacology_of_drugs_used_for_treatment_of_gout_and_hyperlipidemia.pdf
PPTX
pregnancy related terminology.pptx for midwifery
PPTX
MALARIA DURING PREGNANT .pptx for midwifery
PDF
Reproductive Health.pdf for midwifery students
PPTX
Antenatal care for second year Midwifery.pptx
PPT
4 Probability and probability distn.ppt biostatistics
PDF
Estimation and hypothesis testing (2).pdf
PDF
CVD for midwifery.pdf for the second year
PPT
6. Benign & malignant disorders of the ovary.ppt
PDF
macronutrient and micronutrient.pdf midwifery
PPTX
Nutrition for Midwifery 2024.pptxsecond year
PPT
URINARY TRACT INFECTION.ppt pathological
Neonatal and Child Health 2.pptx Reproductive
GBV and Clinical management of rape 4.pptx
cardiacdiseaseduringpregnancyfin-230511221602-d720e371.pptx for Nursing Student
cardiacdiseaseduringpregnancyfin.pptx for midwifery
Hema_I_Chapter_3_phlebotomy.ppt for Medical laboratory
Hema I Chapter 4_Anticoag.ppt for Medical laboratory
Hema I Chapter 2_composition, formation & function.ppt
Adolescent and youth reproductive health.pdf
Pharmacology_of_drugs_used_for_treatment_of_gout_and_hyperlipidemia.pdf
pregnancy related terminology.pptx for midwifery
MALARIA DURING PREGNANT .pptx for midwifery
Reproductive Health.pdf for midwifery students
Antenatal care for second year Midwifery.pptx
4 Probability and probability distn.ppt biostatistics
Estimation and hypothesis testing (2).pdf
CVD for midwifery.pdf for the second year
6. Benign & malignant disorders of the ovary.ppt
macronutrient and micronutrient.pdf midwifery
Nutrition for Midwifery 2024.pptxsecond year
URINARY TRACT INFECTION.ppt pathological
Ad

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Cell Types and Its function , kingdom of life
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Complications of Minimal Access Surgery at WLH
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Lesson notes of climatology university.
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Institutional Correction lecture only . . .
Computing-Curriculum for Schools in Ghana
Cell Types and Its function , kingdom of life
Microbial disease of the cardiovascular and lymphatic systems
Complications of Minimal Access Surgery at WLH
Abdominal Access Techniques with Prof. Dr. R K Mishra
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
RMMM.pdf make it easy to upload and study
Pharma ospi slides which help in ospi learning
Microbial diseases, their pathogenesis and prophylaxis
Lesson notes of climatology university.
Final Presentation General Medicine 03-08-2024.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
human mycosis Human fungal infections are called human mycosis..pptx
Chinmaya Tiranga quiz Grand Finale.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3

3 Descriptive Numerical Summary Measures.ppt

  • 1. Descriptive Statistics: Numerical Summary Measures – Single numbers which quantify the characteristics of a distribution of values  Measures of central tendency (location)  Measures of dispersion
  • 2. • A frequency distribution is a general picture of the distribution of a variable • But, can’t indicate the average value and the spread of the values
  • 3. Measures of Central Tendency (MCT) • On the scale of values of a variable there is a certain stage at which the largest number of items tend to cluster. • Since this stage is usually in the centre of distribution, the tendency of the statistical data to get concentrated at a certain value is called “central tendency” • The various methods of determining the point about which the observations tend to concentrate are called MCT.
  • 4. • The objective of calculating MCT is to determine a single figure which may be used to represent the whole data set. • In that sense it is an even more compact description of the statistical data than the frequency distribution. • Since a MCT represents the entire data, it facilitates comparison within one group or between groups of data.
  • 5. 0 5 10 15 20 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99 Position
  • 6. Characteristics of a good MCT A MCT is good or satisfactory if it possesses the following characteristics. 1. It should be based on all the observations 2. It should not be affected by the extreme values 3. It should be as close to the maximum number of values as possible 4. It should have a definite value 5. It should not be subjected to complicated and tedious calculations 6. It should be capable of further algebraic treatment 7. It should be stable with regard to sampling
  • 7. • The most common measures of central tendency include: – Arithmetic Mean – Median – Mode – Others
  • 8. 1. Arithmetic Mean A. Ungrouped Data • The arithmetic mean is the "average" of the data set and by far the most widely used measure of central location • Is the sum of all the observations divided by the total number of observations.
  • 11. The heart rates for n=10 patients were as follows (beats per minute): 167, 120, 150, 125, 150, 140, 40, 136, 120, 150 What is the arithmetic mean for the heart rate of these patients?
  • 13. Example. Compute the mean age of 169 subjects from the grouped data. Mean = 5810.5/169 = 34.48 years Class interval Mid-point (mi) Frequency (fi) mifi 10-19 20-29 30-39 40-49 50-59 60-69 14.5 24.5 34.5 44.5 54.5 64.5 4 66 47 36 12 4 58.0 1617.0 1621.5 1602.0 654.0 258.0 Total __ 169 5810.5
  • 14. The mean can be thought of as a “balancing point”, “center of gravity”
  • 15. When the data are skewed, the mean is “dragged” in the direction of the skewness • It is possible in extreme cases for all but one of the sample points to be on one side of the arithmetic mean & in this case, the mean is a poor measure of central location or does not reflect the center of the sample.
  • 16. Properties of the Arithmetic Mean. • For a given set of data there is one and only one arithmetic mean (uniqueness). • Easy to calculate and understand (simple). • Influenced by each and every value in a data set • Greatly affected by the extreme values. • In case of grouped data if any class interval is open, arithmetic mean can not be calculated.
  • 17. 2. Median a) Ungrouped data • The median is the value which divides the data set into two equal parts. • If the number of values is odd, the median will be the middle value when all values are arranged in order of magnitude. • When the number of observations is even, there is no single middle value but two middle observations. • In this case the median is the mean of these two middle observations, when all observations have been arranged in the order of their magnitude.
  • 21. • The median is a better description (than the mean) of the majority when the distribution is skewed • Example – Data: 14, 89, 93, 95, 96 – Skewness is reflected in the outlying low value of 14 – The sample mean is 77.4 – The median is 93
  • 22. b) Grouped data • In calculating the median from grouped data, we assume that the values within a class-interval are evenly distributed through the interval. • The first step is to locate the class interval in which the median is located, using the following procedure. • Find n/2 and see a class interval with a minimum cumulative frequency which contains n/2. • Then, use the following formal.
  • 23. ~ x = L n 2 F f W m c m             where, Lm = lower true class boundary of the interval containing the median Fc = cumulative frequency of the interval just above the median class interval fm = frequency of the interval containing the median W= class interval width n = total number of observations
  • 24. Example. Compute the median age of 169 subjects from the grouped data. n/2 = 169/2 = 84.5 Class interval Mid-point (mi) Frequency (fi) Cum. freq 10-19 20-29 30-39 40-49 50-59 60-69 14.5 24.5 34.5 44.5 54.5 64.5 4 66 47 36 12 4 4 70 117 153 165 169 Total 169
  • 25. • n/2 = 84.5 = in the 3rd class interval • Lower limit = 29.5, Upper limit = 39.5 • Frequency of the class = 47 • (n/2 – fc) = 84.5-70 = 14.5 • Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33
  • 26. Properties of the median • There is only one median for a given set of data (uniqueness) • The median is easy to calculate • Median is a positional average and hence it is insensitive to very large or very small values • Median can be calculated even in the case of open end intervals • It is determined mainly by the middle points and less sensitive to the remaining data points (weakness).
  • 27. Quartiles • Just as the median is the value above and below which lie half the set of data, one can define measures (above or below) which lie other fractional parts of the data. • The median divides the data into two equal parts • If the data are divided into four equal parts, we speak of quartiles.
  • 28. a) The first quartile (Q1 ): 25% of all the ranked observations are less than Q1. b) The second quartile (Q2 ): 50% of all the ranked observations are less than Q2 . The second quartile is the median. c) The third quartile (Q3 ): 75% of all the ranked observations are less than Q3.
  • 29. Percentiles • Simply divide the data into 100 pieces. • Percentiles are less sensitive to outliers and not greatly affected by the sample size (n).
  • 30. 3. Mode • The mode is the most frequently occurring value among all the observations in a set of data. • It is not influenced by extreme values. • It is possible to have more than one mode or no mode. • It is not a good summary of the majority of the data.
  • 32. • It is a value which occurs most frequently in a set of values. • If all the values are different there is no mode, on the other hand, a set of values may have more than one mode. a) Ungrouped data
  • 33. • Example • Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6 • Mode is 4 “Unimodal” • Example • Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8 • There are two modes – 2 & 5 • This distribution is said to be “bi-modal” • Example • Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12 • No mode, since all the values are different
  • 34. b) Grouped data • To find the mode of grouped data, we usually refer to the modal class, where the modal class is the class interval with the highest frequency. • If a single value for the mode of grouped data must be specified, it is taken as the mid-point of the modal class interval.
  • 36. Properties of mode  It is not affected by extreme values  It can be calculated for distributions with open end classes  Often its value is not unique  The main drawback of mode is that often it does not exist
  • 37. 4. Geometric mean (GM) • Mainly used in many types of laboratory data, specifically data in the form of concentrations of one substance in another • Example: the minimum inhibitory concentration of penicillin in urine for N. gonorrhoeae in 71 patients (µg/ml) Frequency (µg/ml) Frequency 0.03125 0.0625 0.1250 21 6 8 0.250 0.50 1.0 19 17 3
  • 38. If x x ..., x are n positive observed values, then GM = x 1 2 n i i=1 n n , ,  and logGM = logx n i i=1 n  . The geometric mean is generally used with data measured on a logarithmic scale, such as titers of anti-neutrophil immunoglobulin G.
  • 39. Example: logGM = [21log(0.03125) + 6log(0.0625) + 8log(0.125) + 19log(0.25) + 17log(0.5) + 3log(1.0)]/74 = -0.846 The GM = the antilogarithm of -0.846 = 0.143
  • 40. 5. Harmonic mean (HM) • Just as the geometric mean is based on an arithmetic mean of logarithms, so is the harmonic mean based on arithmetic mean of the reciprocals. • Pertains to rates and time • We define it as the reciprocal of the arithmetic mean of the reciprocal of the given numbers.
  • 41. If the given numbers are x x ..., x , then HM = 1 1 n 1 x 1 2 n i i=1 n , , 
  • 42. 6. Weighted mean (WM) • In a weighted mean, separate outcomes have separate influences. • The influence attached to an outcome is the weight. • Familiar is the calculation of a course grade as a weighted average of scores on separate outcomes.
  • 44. Which measure of central tendency is best with a given set of data? • Two factors are important in making this decisions: – The scale of measurement (type of data) – The shape of the distribution of the observations
  • 45. • The mean can be used for discrete and continuous data • The median is appropriate for discrete and continuous data as well, but can also be used for ordinal data • The mode can be used for all types of data, but may be especially useful for nominal and ordinal measurements • For discrete or continuous data, the “modal class” can be used
  • 46. • The geometric mean is used primarily for observations measured on a logarithmic scale. • Harmonic mean is a suitable MCT when the data pertains to rates and time. • Weighted mean is commonly used in the calculation of mean for different outcomes.
  • 47. (a) Symmetric and unimodal distribution — Mean, median, and mode should all be approximately the same Mean, Median & Mode
  • 48. (b) Bimodal — Mean and median should be about the same, but may take a value that is unlikely to occur; two modes might be best
  • 49. (c) Skewed to the right (positively skewed) — Mean is sensitive to extreme values, so median might be more appropriate Mode Median Mean
  • 50. (d) Skewed to the left (negatively skewed) — Same as (c) Mode Median Mean
  • 51. Measures of Dispersion Consider the following two sets of data: A: 177 193 195 209 226 Mean = 200 B: 192 197 200 202 209 Mean = 200 Two or more sets may have the same mean and/or median but they may be quite different.
  • 52. These two distributions have the same mean, median, and mode
  • 53. • MCT are not enough to give a clear understanding about the distribution of the data. • We need to know something about the variability or spread of the values — whether they tend to be clustered close together, or spread out over a broad range
  • 54. Measures of Dispersion • Measures that quantify the variation or dispersion of a set of data from its central location • Dispersion refers to the variety exhibited by the values of the data. • The amount may be small when the values are close together. • If all the values are the same, no dispersion
  • 55. Measures of Dispersion Other synonymous term: – “Measure of Variation” – “Measure of Spread” – “Measures of Scatter”
  • 56. • Measures of dispersion include: – Range – Inter-quartile range – Variance – Standard deviation – Coefficient of variation – Standard error – Others
  • 57. 1. Range (R) • The difference between the largest and smallest observations in a sample. • Range = Maximum value – Minimum value • Example – – Data values: 5, 9, 12, 16, 23, 34, 37, 42 – Range = 42-5 = 37 • Data set with higher range exhibit more variability
  • 58. Properties of range  It is the simplest crude measure and can be easily understood  It takes into account only two values which causes it to be a poor measure of dispersion  Very sensitive to extreme observations  The larger the sample size, the larger the range
  • 59. 2. Interquartile range (IQR) • Indicates the spread of the middle 50% of the observations, and used with median IQR = Q3 - Q1 • Example: Suppose the first and third quartile for weights of girls 12 months of age are 8.8 Kg and 10.2 Kg, respectively. IQR = 10.2 Kg – 8.8 Kg i.e., 50% of the infant girls weigh between 8.8 and 10.2 Kg.
  • 60. The two quartiles (Q3 &Q1) form the basis of the Box-and-Whiskers Plots — Variables A, B, C 0 1 2 3 4 5 6 7 8 9 10 Variable A Variable B Variable C
  • 61. Properties of IQR: • It is a simple and versatile measure • It encloses the central 50% of the observations • It is not based on all observations but only on two specific values • It is important in selecting cut-off points in the formulation of clinical standards • Since it excludes the lowest and highest 25% values, it is not affected by extreme values • Less sensitive to the size of the sample
  • 62. 3. Quartile deviation (QD) QD = Q Q 2 3 1 
  • 63. 4. Coefficient of quartile deviation (CQD) • CQD = • CQD is an absolute quantity (unitless) and is useful to compare the variability among the middle 50% observations. Q Q Q Q 3 1 3 1  
  • 64. 5. Mean deviation (MD) • Mean deviation is the average of the absolute deviations taken from a central value, generally the mean or median. • Consider a set of n observations x1 , x2 , ..., xn . Then: • ‘A’ is a central value (arithmetic mean or median). MD 1 n x A i i 1 n    
  • 65. Properties of mean deviation:  MD removes one main objection of the earlier measures, that it involves each value  It is not affected much by extreme values  Its main drawback is that algebraic negative signs of the deviations are ignored which is mathematically unsound
  • 66. 6. Variance (2 , s2 ) • The main objection of mean deviation, that the negative signs are ignored, is removed by taking the square of the deviations from the mean. • The variance is the average of the squares of the deviations taken from the mean.
  • 67. • It is squared because the sum of the deviations of the individual observations of a sample about the sample mean is always 0 0 = ( ) • The variance can be thought of as an average of squared deviations  - x x i
  • 68. • Variance is used to measure the dispersion of values relative to the mean. • When values are close to their mean (narrow range) the dispersion is less than when there is scattering over a wide range. – Population variance = σ2 – Sample variance = S2
  • 69. a) Ungrouped data  Let X1 , X2 , ..., XN be the measurement on N population units, then:    2 i 2 i 1 N i i=1 N (X ) N where = X N is the population mean.     
  • 70. A sample variance is calculated for a sample of individual values (X1, X2, … Xn) and uses the sample mean (e.g. ) rather than the population mean µ.
  • 71. Degrees of freedom • In computing the variance there are (n-1) degrees of freedom because only (n-1) of the deviations are independent from each other • The last one can always be calculated from the others automatically. • This is because the sum of the deviations from their mean (Xi-Mean) must add to zero.
  • 72. b) Grouped data where mi = the mid-point of the ith class interval fi = the frequency of the ith class interval = the sample mean k = the number of class intervals S (m x) f f - 1 2 i 2 i i=1 k i i=1 k     x
  • 73. Properties of Variance:  The main disadvantage of variance is that its unit is the square of the unite of the original measurement values  The variance gives more weight to the extreme values as compared to those which are near to mean value, because the difference is squared in variance. • The drawbacks of variance are overcome by the standard deviation.
  • 74. 7. Standard deviation (, s) • It is the square root of the variance. • This produces a measure having the same scale as that of the individual values.    2 and S = S2
  • 75. • Following are the survival times of n=11 patients after heart transplant surgery. • The survival time for the “ith” patient is represented as Xi for i= 1, …, 11. • Calculate the sample variance and SD.
  • 77. Example. Compute the variance and SD of the age of 169 subjects from the grouped data. Mean = 5810.5/169 = 34.48 years S2 = 20199.22/169-1 = 120.23 SD = √S2 = √120.23 = 10.96 Class interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi 10-19 20-29 30-39 40-49 50-59 60-69 14.5 24.5 34.5 44.5 54.5 64.5 4 66 47 36 12 4 -19.98 -9-98 0.02 10.02 20.02 30.02 399.20 99.60 0.0004 100.40 400.80 901.20 1596.80 6573.60 0.0188 3614.40 4809.60 3604.80 Total 169 1901.20 20199.22
  • 78. Properties of SD • The SD has the advantage of being expressed in the same units of measurement as the mean • SD is considered to be the best measure of dispersion and is used widely because of the properties of the theoretical normal curve. • However, if the units of measurements of variables of two data sets is not the same, then there variability can’t be compared by comparing the values of SD.
  • 79. SD Vs Standard Error (SE) • SD describes the variability among individual values in a given data set • SE is used to describe the variability among separate sample means obtained from one sample to another • We interpret SE of the mean to mean that another similarly conducted study may give a mean that may lie between  SE.
  • 80. Standard Error • SD is about the variability of individuals • SE is used to describe the variability in the means of repeated samples taken from the same population. • For example, imagine 5,000 samples, each of the same size n=11. This would produce 5,000 sample means. This new collection has its own pattern of variability. We describe this new pattern of variability using the SE, not the SD.
  • 81. Example: The heart transplant surgery n=11, SD=168.89, Mean=161 days • What happens if we repeat the study? What will our next mean be? Will it be close? How different will it be? Focus here is on the generalizability of the study findings. • The behavior of mean from one replication of the study to the next replication is referred to as the sampling distribution of mean. • We can also have sampling distribution of the median or the SD • We interpret this to mean that a similarly conducted study might produce an average survival time that is near 161 days, ±50.9 days.
  • 82. 8. Coefficient of variation (CV) • When two data sets have different units of measurements, or their means differ sufficiently in size, the CV should be used as a measure of dispersion. • It is the best measure to compare the variability of two series of sets of observations. • Data with less coefficient of variation is considered more consistent.
  • 83. CV S x 100   • “Cholesterol is more variable than systolic blood pressure” SD Mean CV (%) SBP Cholesterol 15mm 40mg/dl 130mm 200mg/dl 11.5 20.0 •CV is the ratio of the SD to the mean multiplied by 100.
  • 84. NOTE: • The range often appears with the median as a numerical summary measure • The IQR is used with the median as well • The SD is used with the mean • For nominal and ordinal data, a table or graph is often more effective than any numerical summary measure

Editor's Notes

  • #51: They appear to have about the same center. The difference lies in the greater variability or spread.
  • #56: To quantify the spread or variation in the data we use measures of spread. So measures of spread are measures that quantify the variation or dispersion of a set of data from its central location. They are also known as “measures of dispersion or “measures of variation”. Some common measures of spread are: Range Interquartile range Variance / standard deviation Standard error 95% confidence interval
  • #57: The range is the simplest measure of dispersion. It is, simply, the difference between the largest and smallest values.
  • #58: The larger n is, the larger the ranges tend to be. This complication makes it difficult to compare ranges from different-size data sets.
  • #60: [Consider skipping or deleting the Interquartile Range slides - rcd] Here are the box-and-whiskers diagrams for Variables A, B, and C. You can see that Variable B had the narrowest interquartile range, because the numbers tended to be bunch in the middle. Variables A and C were more spread out.
  • #64: The mean deviation is a reasonable measure of spread, but does not characterize the spread as well as the standard deviation if the underlying distribution is bell-shaped.
  • #67: The sum of the deviations from the mean is always zero, and this does not summarize the difference between the individuals sample points and the arithmetic mean.
  • #71: In computing the variance, we say that we have n-1 degrees of freedom. Because the sum of the deviations of the values from their mean is equal to zero. If, then, we know the values of n-1 of the deviations from the mean, we know the nth one, since it is automatically determined because of the necessity for all n values to add to zero. All n of them must add up to zero. Only (n-1) of the deviations from (Xi-Mean) are independent from each other.
  • #78: The mean and the SD are the most widely used measures of location and spread in the literature. The normal distribution is defined explicitly in terms of these two parameters.