Lecture-2 (discriptive statistics).ppt

NURSING Dream ● Discover ● Deliver
Lemma Derseh (BSc., MPH)
1
University of Gondar
College of medicine and health science
Department of Epidemiology and
Biostatistics
Descriptive statistics

Statistical Methods (branches of statistics)
collection
organizing
summarizing
presenting of data
Descriptive Statistics
making inferences
hypothesis testing
determining relationship
making the prediction
Inferential Statistics
Biostatistics
Lemma Derseh, Department of Epidemiology and Biostatistics, University of Gondar

Descriptive Statistics
1. Involves
– Collecting Data
– Presenting Data
– Characterizing
Data
2. Purpose
– Describe Data
x = 74.5, S2 = 213
0
50
100
1St 2nd 3rd 4th
Class
size
Batch (one department)

Descriptive statistics cont…
Types of descriptive statistics
 Tables/charts/graphs …………..
 Measures of central tendency
 Measures of variability
Numerical summary
measures
Pictorial measure

Tables/charts/graphs
 Tables are used in categorical variables or
categorized numerical data
 Tables:
 Frequency (for nominal and ordinal data)
 Relative frequency (for nominal and ordinal data)
 Cumulative frequencies (for ordinal data)
The methods of describing data differ depending on the
type of the data itself (i.e. Numerical or Categorical).

Describing categorical variables … cont
 Frequency is the number of observations in each category
 The relative frequency of a class is the portion or
percentage of the data that falls in that class
E.g. 1: The blood type of 30 patients were given as follows:
A AB B B A O O AB AB B O A A B B A AB A O AB
B AB AB O A AB AB O A O
Construct a table for it
6
Type Frequency Relative frequency
A 8 0.267
B 6 0.20
AB 9 0.30
O 7 0.233
Total 30 1.00

Distribution of birth weight of newborns between 1976-1996 at TAH.
BWT Freq. Rel.Freq(%) Cum. Freq Cum.rel.freq.(%)
Very low 43 0.4 43 0.4
Low 793 8.0 836 8.4
Normal 8870 88.9 9706 97.3
Big 268 2.7 9974 100
Total 9974 100
7
Cumulative relative frequency is relevant for ordinal data
Consider for example, the variable birth weight with levels
‘Very low ’, ‘Low’, ‘Normal’ and ‘Big’.
The cumulative frequency of a class is the sum of the
frequency for that class and all the previous classes.

Charts
 Charts are used only for categorical variables
 Bar charts
The successive bars are separated (not continuous)
 Pie charts
Each sector of a circle indicates a category of data

Charts cont…
Bar Chart
 Bar charts: display the frequency distribution for
nominal or ordinal data.
 The various categories into which the observation fall
are represented along horizontal axis and
9

Fig. 1 Bar chart for blood type of 30 patients

Pie cart
 Pie chart displays the frequency of nominal or ordinal
variables.
 The various categories of the variable will be represented
by the sector of the circle.
 The area of each sector is proportional to the frequency
of the corresponding category of the variable

Fig. 3. Pie chart showing the frequency distribution of the
variable blood group

Categorizing Numeric data
 In order to present and organize numeric type of data using tables or
graphs, we need to group the dataset as follows:
 Number of class: the number of categories the table will have
 Class limit: The range for each class
 Lower class limit
 Upper class limit
 Class boundary: Continuous range of the class limit and it is obtained by
subtracting and adding 0.5 from lower and upper class limit respectively (for
non-decimal data but for decimal 0.05)
 Lower class boundary
 Upper class boundary
 Class mark: The average of lower and upper class limit.
13

Struge’s rule
 Select a set of continuous, non-overlapping intervals such
that each value in the set of observations can be placed in
one, and only one, of the intervals.
– Where K = number of class intervals
– n = number of observations
– W = width of the class interval
– L = the largest value
– S = the smallest value
14
K 1 3.322(logn)
W
L S
K
 



Struge’s rule cont…
 For datasets with integral values subtracted or add 05.from
class limits to find class boundaries
 The answer obtained by applying Sturge’s rule should not be
regarded as final, but should be considered as a guide only.
 The number of class intervals specified by the rule should be
increased or decreased for convenience and clear presentation
15

Example 1
 The blood lead level measured in μg/dl for 88 sample
individuals living in a region are given as follows(numbers
with blue color are for females and the black for males)
20,21, 22,22,23,23,23,24,24,24,24,25,25,25,25,25,26,26,26,26,26,27,
27,27,27,27,27,28,28,28,28,28,28,28,28,29,29,29,29,29,30,30,30,30,
30,30,30,30,30,31,31,31,31,31,31,31,32,32,32,32,32,33,33,33,33,33,
33,33,34,34,34,34,35,35,35,35,36,36,36,36,36,37,37,37,37,38,38,39
 Construct frequency distribution for the data.
Solution:
16
7
.
2
7
19
7
20
39
K
S
L
W
46
.
7
88)
3.322(log(
1
)
3.322(logn
1
K











≈ 3

Solution
Blood lead level
Mi frequency RF CF RCF
Class
Limit
Class
Boundaries
20-22 19.5-22.5 21 4 4/88 4 4/88
23-25 22.5-25.5 24 12 12/88 16 16/88
26-28 25.5-28.5 27 19 19/88 35 35/88
29-31 28.5-31.5 30 21 21/88 56 56/88
32-34 31.5-34.5 33 16 16/88 72 72/88
35-37 34.5-37.5 36 13 13/88 85 85/88
38-40 37.5-40.5 39 3 3/88 88 88/88
17
Where:
 RF = relative frequency
 Mi = class mark
 CF = cumulative frequency
 RCF = relative cumulative frequency

Graphs
 Some examples are:
 Histogram,
 Frequency polygon,
 Cumulative Relative Frequency Curve etc
18

Histograms
 Histograms are frequency distributions with continuous class
interval that have been turned into graphs.
 The area of each column is proportional to the number of
observations in that interval
19

Example
The distribution of the blood lead level of 88 individuals
Blood LL No. of Individuals
19.5-22.5 4
22.5-25.5 12
25.5-28.5 19
28.5-31.5 21
31.5-34.5 16
34.5-37.5 13
37.5-40.5 3
20
19.5 22.5 25.5 28.5 31.5 34.5 37.5 40.5
Blood lead level

Frequency polygons
 Instead of drawing bars for each class interval, sometimes
a single point is drawn at the mid point of each class
interval and consecutive points joined by straight line.
 Graphs drawn in this way are called frequency polygons
(line graphs).
21

Frequency polygons cont…
Frequency polygon for the blood lead level of study
participants

Frequency polygon of blood lead level for
males and females
23
Frequency polygons are superior to histograms for
comparing two or more sets of data.

Cumulative frequency curve (ogive)
 The horizontal axis displays the different categories/intervals
 The vertical axis displays cumulative (relative) frequency.
 A point is placed at the true upper limit of each interval; the
height represents the cumulative relative frequency
associated with that interval. The points are then connected
by straight lines.
 Like frequency polygons, cumulative frequency curve may be
used to compare sets of data.
 Cumulative frequency curve can also be used to obtain
percentiles of a set of data.
24

Cumulative frequency curve cont…
 Cumulative relative frequency curve for the blood lead
level of study participants
Cumulative
frequency
(prportion
of
individuals
)
The graph ends
at the upper
boundary of the
last class.
The graph begins at the lower
boundary of the first class.

Box plots
 A visual picture called box (box-and-whisker )plot can be
used to convey a fair amount of information about the
distribution of a set of data.
 It is used as an exploratory data analysis tool
 The box shows the distance between the first and the
third quartiles,
 The median is marked as a line within the box and
 The end lines show the minimum and maximum values
respectively
26

Box plot is the five-number summary:
The minimum entry
Q1
Q2 (median)
Q3
The maximum entry
Box plots cont…
The quartiles are sets of values which divide the distribution
into four parts such that there are an equal number of
observations in each part.
Q1 = [(n+1)/4]th
Q2 = [2(n+1)/4]th
Q3 = [3(n+1)/4]th

Example: Use the following age data of 15 patients to draw
a box-and-whisker plot.
35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
Box plots cont…
Q3
Q2
Q1
Max
Min

Illustration of Box-plot using the age of 15 patients
29
Notice the
distribution of
data in each
quarter(distance
between
quartiles)

A box-plot indicating the distribution of blood
lead level of individuals by sex
30

Measures of central tendency
 It is often useful to summarize, in a single number or statistic,
the general location of the data or the point at which the data
tend to cluster.
 Such statistics are called measures of location or measures of
central tendency.
 We describe them mean, median and mode.
Arithmetic mean
 The arithmetic mean, usually abbreviated to ‘mean’ is the sum of
the observations divided by the number of observations.
31

Arithmetic Mean
32
.
n
x
=
x
then
,
sample
a
of
values
observed
n
are
x
...,
,
x
,
x
If
n
1
=
i
i
n
2
1

a) Ungrouped mean
Population mean: , if x’s are population observations
x
μ
N


92
.
29
88
9)
3
...
22
21
(20
n
x
=
x
88
1
=
i
n
1
=
i
i








Example: Blood lead level for 88 sample individuals

Arithmetic Mean cont…
 b) Grouped data
 In calculating the mean from grouped data, we assume that
all values falling into a particular class interval are located
at the mid-point of the interval. It is calculated as follow:
 where,
k = the number of class intervals
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
33


k
1
=
i
i
k
1
=
i
i
i
f
f
m
=
x

Arithmetic Mean cont…
Blood lead
level
( CB)
Class
mark
(Mi)
frequency
19.5-22.5 21 4
22.5-25.5 24 12
25.5-28.5 27 19
28.5-31.5 30 21
31.5-34.5 33 16
34.5-37.5 36 13
37.5-40.5 39 3
86
.
29
)
3
..
.
12
(4
x3)
39
...
24x12
(21x4
=
x 7
1
=
i
7
1
=
i









Example: Arithmetic mean for grouped data of blood
lead level

Properties of the arithmetic mean
 The mean can be used as a summary measure for both discrete
and continuous data, in general however, it is not appropriate
for either nominal or ordinal data.
 For a given set of data there is one and only one arithmetic
mean.
 Algebraic sum of the deviations of the given values from their
arithmetic mean is always zero.
 The arithmetic mean is greatly affected by the extreme values.
 In grouped data if any class interval is open, arithmetic mean
cannot be calculated.
35

Median
 With the observations arranged in an increasing or decreasing order,
the median is defined as the middle observation.
Ungrouped data
 If the number of observations is odd, the median is defined as the
[(n+1)/2]th observation.
 If the number of observations is even the median is the average of
the two middle (n/2)th and [(n/2)+1]th values i.e
 Example , where n is even: 19, 20, 20, 21, 22, 24, 27, 27, 27, 34
 Then, the median = (22 + 24)/2 = 23
 The ungrouped median for the blood lead level data is the average
of the 44th & 45th observation; which is (30+30)/2 =30
36

Median Cont…
Grouped data
 In calculating the median from grouped data, we assume that
the values within a class-interval are evenly distributed
through the interval.
– The first step is to locate the class interval in which it is
located.
– Find n/2 and see a class interval with a minimum
cumulative frequency which contains n/2.
(Note:- All class intervals with cumulative frequencies ≥ n/2
contain the median)
37

Median for Grouped data …cont
To find a unique median value, use the following interpolation formal.
 where,
 Lm = lower true class boundary of the interval containing the median
 Fc = cumulative frequency of the interval just bellow the median class
interval
 fm = frequency of the interval containing the median
 W= class interval width
 n = total number of observations
38
W
f
F
2
n
L
=
x
~
m
c
m















Median for grouped data cont…
Example
Using the data on the blood lead level of 88 individuals, the
grouped median is:
79
.
29
3
21
35
44
28.5
W
f
F
2
n
L
=
x
~
m
c
m 





 

















Properties of median
 The median can be used as a summary measure for
ordinal, discrete and continuous data, in general
however, it is not appropriate for nominal data.
 There is only one median for a given set of data
 Median is a positional average and hence it is not
drastically affected by extreme values (It is robust or
resistant to extreme values)
 Median can be calculated even in the case of open end
intervals
 It is not a good representative of data if the number of
items is small
40

Mode
 Any observation of a variable at which the distribution reaches a
peak is called a mode.
 Most distributions encountered in practice have one peak and
are described as uni-modal.
 E.g. Consider the example of ten numbers
19 21 20 20 34 22 24 27 27 27
In the above data set, the mode is 27
 The mode of grouped data, usually refers to the modal class,
(the class interval with the highest frequency)
 If a single value for the mode of grouped data must be
specified, it is taken as the mid point of the modal class interval
41

Properties of mode
 The mode can be used as a summary measure for
nominal, ordinal, discrete and continuous data, in general
however, it is more appropriate for nominal and ordinal
data.
 It is not affected by extreme values
 It can be calculated for distributions with open end classes
 Sometimes its value is not unique
 The main drawback of mode is that it may not exist
42

Measures of variability (Dispersion)
 In order to fully understand the nature of the distribution of data set,
both measures of location and dispersion are important
 Some measures of variability are: range, inter-quartile range,
variance, standard deviation and the coefficient of variation.
Range:
 The range is the difference between the largest and the smallest
observations in the data set.
 Being determined by only the two extreme observations, use of the
range is limited because it tells us nothing about how the data
between the extremes are spread.
Example1 : We use the data set of 10 numbers:
19 , 21,20, 20, 34, 22, 24, 27, 27, 27
The range = 34 – 19 = 15

43

Quartiles and Inter-quartile Range, Percentiles
• The inter-quartile range (IQR) is the difference between the
third and the first quartiles.
Q3 – Q1
• Example: Consider the age data of 15 patients to find IQR
• IQR = 48 – 37 = 11
44
35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
Q3
Q2
Q1

Quartiles and Inter-quartile Range, Percentiles
 Percentiles divide the data into 100 parts of observations in
each part.
 It follows that the 25th percentile is the first quartile, the 50th
percentile is the median and the 75th percentile is the third
quartile.
45

Variance
 A good measure of dispersion should make use of all the data.
 Intuitively, a good measure could be derived by combining, in
some way, the deviations of each observation from the mean.
 The variance achieves this by averaging the sum of the squares
of the deviations from the mean.
46

Variance cont…
 The population variance of a population data set of N entries is
2
2 ( )
.
x μ
N
  

 The sample variance of the set x1, x2, ..., xn of n
observations with mean x is
S
(x x)
n -1
2
i
2
i=1
n



 Note : The sum of the deviations from the mean is zero, thus it
is more useful to square the deviations, add them, find the
mean (to get the variance).

Standard Deviation
 Being the square of the deviations, the variance is limited as
a descriptive statistic because it is not in the same units as
in the observations.
 By taking the square root of the variance, we obtain a
measure of dispersion in the original units.
 It is usually denoted by s.d or simply s and the formula is
given by:
48
1
-
n
)
x
(x
S
n
1
=
i
2
i
 


Examples
Example 1: Let us use the age data of 15 individuals
Example 2: consider the example of the blood lead level of 88
individuals given before . Find its variance
Solution
49
86
.
29
88
9)
3
...
22
21
(20
n
x
=
x
88
1
=
i
n
1
=
i
i








46
.
20
1
-
88
)
x
(x
S
88
1
=
i
2
i
2




35 35 36 37 37 38 42 43 43 44 45 48 48 51 55
47
.
42
,
,
12
.
38
1
-
15
)
x
(x
S
15
1
=
i
2
i
2





X
Where

Coefficient of variation
 When we want to compare the variability in two sets of data, the
standard deviation which calculates the absolute variation may
mislead us especially if the two data sets are:
with different units of measurement ,or
have widely different means
 The coefficient of variation (CV) gives relative variation & is the
best measure used to compare the variability in two sets of data.
 CV is often presented as the given ratio multiplied by 100%.
50

Mean, standard deviation and the
normal distribution
 For unimodal, moderately symmetrical, sets of data
approximately:
 68% of observations lie within 1 standard deviation of
the mean.
 95% of observations lie within 2 standard deviations of
the mean.
i.e. Normally Distributed Data

x
The Empirical
Rule

x - s x x + s
68% within
1 standard deviation
34% 34%
The Empirical Rule

x - 2s x - s x x + 2s
x + s
68% within
34% 34%
95% within
2 standard deviations
The Empirical Rule
13.5% 13.5%

x - 3s x - 2s x - s x x + 2s x + 3s
x + s
68% within
34% 34%
95% within
2 standard deviations
99.7% of data are within 3 standard deviations of the mean
The Empirical Rule
0.1% 0.1%
2.4% 2.4%
13.5% 13.5%

Choosing Appropriate measures
 If data are symmetric, with no serious outliers, use mean
and standard deviation.
 If data are skewed, and/or have serious outliers, use IQR
and median.
 If comparing variation across two variables, use coefficient
of variation if the variables are in different units and/or
scales or the means are significantly different.
 If the scales/units and mean are roughly the same direct
comparison of the standard deviation is fine.

Median Mode Mean
Fig. 2(a). Symmetric Distribution
Mean = Median = Mode
Mode Median Mean
Fig. 2(b). Distribution skewed to the right
Mean > Median > Mode
Mean Median Mode
Fig. 2(c). Distribution skewed to the left
Mean < Median < Mode
57

Lecture-2 (discriptive statistics).ppt

More Related Content

Similar to Lecture-2 (discriptive statistics).ppt (20)

More from habtamu biazin (20)

Recently uploaded (20)

Lecture-2 (discriptive statistics).ppt

Editor's Notes