descriptive data analysis

DESCRIPTIVE
DATA
Presented by
Dr. P. Gnana Sarita Kumari
I MDS
Department of Public Health Dentistry
2

CONTENTS
: INTRODUCTION
 TYPES OF VARIABLES AND LEVELS OF
MEASUREMENT
 MEASURES OF CENTRAL TENDENCY
 MEASURES OF DISPERSION
 NORMAL DISTRIBUTION
 MEASURES OF ASYMMETRY
 MEASURES OF RELATIONSHIP
 CONCLUSION
 REFERENCES
3

DESCRIPTIVE ANALYSIS :
• The data describe one group and that group only.
• Descriptive data analysis limits generalization to the
particular group of individuals observed.
• No conclusions are extended beyond this group.
• It provides valuable information about the nature of a
particular group of individuals.
INTRODUCTION 4

CLASSIFICATION OF VARIABLES
QUALITATIVE QUANTITATIVE
NOMINAL ORDINAL DISCRETE CONTINUOUS
5

LEVELS OF MEASUREMENT
• Introduced by STEVENS
6

NOMINAL MEASUREMENT SCALE
Nomina
scale
Represents
Simplest
of data
Values in
unordered
categories
No
quantitative
relationship
Numbers are
used for the
sake of
convenience
7

ORDINAL MEASUREMENT SCALE
Ordina
scale
Can be
ordered or
ranked
Though
ordered is not
quantified
Number or label
assigned does
indicate
magnitude
Precise
measurement
of differences
does not exist
8

INTERVAL MEASUREMENT SCALE
Interva
scale
Observations
can be ordered
Precise differences
between units of
measure exist
No meaningful
absolute zero
9

RATIO MEASUREMENT SCALE
Possess same properties as that of
interval scale
• Highest level of measurement
In this a true zero exist
10

MEASURES OF CENTRAL TENDENCY
• Mean
• Median
• Mode
11

TYPES OF MEAN
• Sample mean
• Weighted mean
• Geometric mean
• Harmonic mean
• Mean of two or more means
12

SAMPLE MEAN
Mean = Total or Sum of observations
Number of observations
For ungrouped series it is Calculated by :
1. DIRECT METHOD
2. ASSUMED MEAN METHOD
Where,
13

WEIGHTED MEAN
• Grouped data with a range of values :
 Also called GRAND MEAN
Calculation :
𝑋 𝑤= 𝑤1 𝑋1 + 𝑤2 𝑋2 + …….. + 𝑤 𝑛 𝑋 𝑛 = 𝑖=1
𝑛
𝑤𝑖𝑋𝑖
o By middle point method
o By alternative method
Let 𝑋1, 𝑋2,….., 𝑋 𝑛 be n measurements, and their relative importance be
expressed by a corresponding set of numbers 𝑤1, 𝑤2,…..., 𝑤 𝑛
14

GEOMETRIC MEAN
• The sample geometric mean of n non-negative observations, 𝑋1, 𝑋2,…..,
𝑋 𝑛, in a sample is defined by 𝒏 𝒕𝒉
root of the product.
𝑋 𝐺 = 𝑛
𝑋1. 𝑋2….. 𝑋 𝑛 = [𝑋1, 𝑋2,….., 𝑋 𝑛]1/𝑛
• If there are any negative measurements in a data set, the geometric
mean cannot be used.
15

HARMONIC MEAN
• Harmonic mean is defined as the reciprocal of the
average of reciprocals of the values of items of a series.
• Harmonic mean
16

MEDIAN
• The median is the value that divides the distribution of data
points into two equal parts, that is, the value at which 50% of
the data points lie above it and 50% lie below it.
• The median is the middle of the quartiles (the values that
divide the series into quarters) and the middle of the
percentiles (the values that divide the series into defined
percentages).
18

Calculation :
Median for ungrouped series :
a) In a series with an odd number of untied values, the values in the series are
arranged from lowest to highest, and the value that divides the series in half is the
median.
b) In a series with even number of untied values, the two values that divide the
series in half are determined, and the arithmetic mean of these values is the median.
c) An alternative method for calculating the median is to determine the 50% value
on a cumulative frequency curve.
19

d] If the data include tied scores at the median point, interpolation
within the tied scores is necessary.
• Lets consider a series of 70, 73, 74, 75, 75, 75, 75, 80 in which the mid
point observations were tied.
20

MODE
• The mode of a data set is that value that occurs with the greatest frequency.
• Whenever there are two non-adjacent scores with the same frequency and
they are the highest in the distribution, each score may be referred to as the
‘mode’ and the distribution is ‘bimodal’.
• In truly bimodal distribution, the population contains two sub-groups, each
of which has a different distribution that peaks at a different point.
• Calculation :
Mode = Mean – 3 [ Mean – Median ] or
= 3 Median – 2 Mean
22

MEASURES OF DISPERSION
• Percentile
• Range
• Inter-quartile range
• Mean deviation
• Standard deviation
• Variance
• Coefficient of variation
• To understand the data more completely, it is necessary to know how the members
of the data set arrange themselves about the central or typical value.
• The following questions must be answered:
1. How spread out are the data points?
2. How stable are the values in the group?
Based on percentiles
Based on mean
23

RANGE
• The range is the difference between the highest and
lowest values in a series.
Range = Maximum – Minimum.
• For example in the following series :
8, 8,10,10,10,12,13,14,15,16,58
Range = 18-8 = 10 min
24

PERCENTILE
• These are the percentage of observations below the point
indicated when all of the observations are ranked in ascending
order.
• The median is the 50th percentile.
• The 75th percentile is the point below which 75% of the
observations lie, while the 25th percentile is the point below which
25% of the observations lie.
25

INTER-QUARTILE RANGE
• The range of a variable between first quartile and the third
quartile is called inter-quartile range.
• Interquartile range = Q3 – Q1
• Median is the second quartile.
• Half of the median is called semi – interquartile range or
sometimes quartile deviation which is a measure of
dispersion around the mean.
26

MEAN DEVIATION
• Because the mean has several advantages, it might seem logical to
measure dispersion by taking the “average deviation” from the mean.
That proves to be useless, because the sum of the deviations from the
mean is 0.
• However, this inconvenience can easily be solved by computing the
mean deviation, which is the average of the absolute value of the
deviations from the mean, as shown in the following formula:
Mean deviation = |𝑋 − 𝑋|
n
27

VARIANCE
• The variance is the sum of the squared deviations from the mean divided by the
number of values in the series minus 1.
• Variance is symbolized by 𝑆2 or V.
𝑆2
= Σ(X − X)2
/n where Σ(X − X)2
is called sum of squares.
• Dividing by N-1 (called degrees of freedom), instead of dividing by N, is necessary
for the sample variance to be an unbiased estimator of the population variance.
• The numerator of the variance (i.e., the sum of the squared deviations of the
observations from the mean) is an extremely important entity in statistics. It is usually
called either the sum of squares (abbreviated SS) or the total sum of squares.
28

STANDARD DEVIATION
• The standard deviation is a measure of the variability among the
individual values within a group.
• Loosely defined, it is a description of the average distance of
individual observations from the group mean.
• From one point of view, however, the s is similar to the mean; that is;
it represents the mean of the squared deviations.
29

• Taking the mean and the standard deviation together, a sample can be described
in terms of its average score and in terms of its average variation.
• If more samples were taken from the same population it would be possible to
predict with some accuracy the average score of these samples and also the
amount of variation.
• The mathematical derivation of the standard deviation is presented here in some
detail because the intermediate steps in its calculation.
• (1) create a theme (called “sum of squares”) that is repeated over and over in
statistical arithmetic and (2) create the quantity known as the sample variance.
30

• The standard deviation is reported along with the sample mean, usually
in the following format: mean ± SD.
• This format serves as a pertinent reminder that the SD measures the
variability of values surrounding the middle of the data set.
• It also leads us to the practical application of the concepts of mean and
standard deviation shown in the following rules of thumb:
X ± 1 SD encompasses approximately 68% of the values in a group.
31

• These rules of thumb are useful when deciding whether to report
the mean ± SD or the median and range as the appropriate
descriptive statistics for a group of data points.
• If roughly 95% of the values in a group are contained in the
interval ‘X’ ± 2SD, researchers tend to use mean ± SD. Otherwise
the median and the range are perhaps more appropriate.
32

Applications and characteristics
1. The standard deviation is extremely important in sampling theory, in co relational
analysis, in estimating reliability of measures, and in determining relative position of an
individual within a distribution of scores and between distributions of scores.
2. The standard deviation is the most widely used estimate of variation because of its
known algebraic properties and its amenability to use with other statistics.
3. It also provides a better estimate of variation in the population than the other indexes.
33

4. When the standard deviation of any sample is small, the sample mean is
close to any individual value.
5. When standard deviation of a random sample is small, the sample mean is
likely to be close to the mean of all the data in the population.
6. The standard deviation decreases when the sample size increases.
34

COEFFICIENT OF VARIATION
• The coefficient of variation is the ratio of the standard deviation of a series to
the arithmetic mean of the series.
• The coefficient of variation is unit less and is expressed as a percentage.
Application and characteristics
The co efficient of variation is used to compare the relative variation, or spread,
of the distributions of different series, samples, or populations or of the
distributions of different characteristics of a single series.
35

Calculation:
• The coefficient of variation (CV) is calculated as CV (%) = SD / X х100
• For example,
In a typical medical school, the mean weight of 100 fourth-year medical
students is 140 lb, with a standard deviation of 28 lb.
CV (%) = 28 / 140 х 100 = 20%
The coefficient of variation for weight is 28 lb divided by 140 lb, or 20%.
36

NORMAL DISTRIBUTION
• Normal distribution, also called Gaussian distribution, is a continuous,
symmetric, bell shaped distribution and can be defined by a number of
measures.
• The majority of measurements of continuous data in medicine and
biology tend to approximate the theoretical distribution that is known as
the normal distribution and is also called the Gaussian distribution
(named after Johann Karl Gauss, the person who best described it).
37

• The normal distribution is one of the most frequently used distributions in biomedical and dental
research.
• The normal distribution is a population frequency distribution.
• It is characterized by a bell-shaped curve that is unimodal and is symmetric around the mean of the
distribution.
• The normal curve depends on two parameters: the population mean and the population standard
deviation.
• In order to discuss the area under the normal curve in terms of easily seen percentages of the
population distribution, the normal distribution has been standardized to the normal distribution in
which the population mean is 0 and the population standard deviation is 1.
• The area under the normal curve can be segmented starting with the mean in the center (on the x
axis) and moving by increments of 1 SD above and below the mean.
38

Figure shows a standard normal distribution (mean = 0; SD= 1) and the
percentages of area under the curve at each increment of SD.
39

• The total area beneath the normal curve is 1, or 100% of the observations in the
population represented by the curve.
• As indicated in the figure, the portion of the area under the curve between the
mean and 1 SD is 34.13% of the total area.
• The same area is found between the mean and one unit below the mean.
• Moving 2 SD more above the mean cuts off an additional 13.59% of the area,
and moving a total of 3 SD above the mean cuts off another 2.27%.
40

• The theory of the standard normal distribution leads us, therefore, to the following
property of a normally distributed variable:
Exactly 68.26% of the observations lie within 1 SD of the mean.
• Virtually all of the observations are contained within 3 SD of the mean. This is the
justification used by those who label values outside of the interval `X ± 3 SD as
“outliers” or unlikely values.
• Incidentally, the number of standard deviations away from the mean is called Z
score.
41

MEASURES OF ASYMMETRY
• Skewness
• kurtosis
42

SKEWNESS
A horizontal stretching of a frequency distribution to one side or
the other, so that one tail of observations is longer and has more
observations than the other tail, is called skewness.
43

• If a distribution is skewed, the mean moves farther in the direction of the
long tail than does the median, because the mean is more heavily
influenced by extreme values.
44

KURTOSIS
• It is characterized by a vertical
stretching of the frequency distribution.
• It is the measure of the peakedness of
a probability distribution.
• As shown in the figure kurtotic
distribution could look more peaked or
could look more flattened than the bell
shaped normal distribution.
• A normal distribution has zero kurtosis.
45

46
• Any distribution with kurtosis =3 is called as Mesokurtic.
• In Leptokurtic, the central peak is higher & sharper , tails are longer & flatter.
• In platykurtic, the central peak is lower & broader, tails are short & thinner.

MEASURES OF RELATIONSHIP
Correlation :
• This is used to assess the relationship between two continuous
variables within a group of subjects.
• This is used for quantifying any association between two
continuous variables. But it does not prove that one particular
variable alone causes the change in the other.
47

Correlation coefficient :
• This a measure of degree of straight line association
between two continuous variables.
• It is denoted by ‘r’ which may vary from -1 or +1.
• This can be of 5 types:
r = +1 [ perfect positive correlation ]
r = -1 [ perfect negative correlation ]
r = 0 [ no correlation ]
0 < r < 1 [ partially positive correlation ]
0 > r > -1 [ partially negative correlation ]
48

CONCLUSION
• In conclusion we would like to know that the best research studies are
initiated with a statistical plan already created.
• This plan may or may not have been developed with the assistance of a
statistician.
• The first step of data analysis is usually to describe the sample and then
sub groups within the sample. Frequency distribution, mean, median,
mode, range and the standard deviation are the most commonly used
statistics for accomplishing this task.
• This information can also be used as a background for the discussion
regarding inferential statistics.
50

REFERENCES :
 SANJEEV. B SARMUKADDAM, FUNDAMENTALS OF BIOSTATISTICS, 1st EDITION,
NEW DELHI, JITENDRA.P, 2006
 JOHN W. BEST AND JAMES V. KAHN, RESEARCH IN EDUCATION, 9th EDITION,
NEW DELHI, ASOKE K. GHOSH, 2006
 JAY S. KIM AND RONALD J. DAILEY, BIOSTATISTICS FOR ORAL HEALTH CARE, 1st
EDITION, NEW DELHI, BLACKWELL, 2008
 C. R. KOTHARI, RESEARCH METHODOLOGY, 2nd EDITION, NEW DELHI, NEW AGE
INTERNATIONAL LIMITED, 2004
 RONALD N. FORTHOFER, INTRODUCTION TO BIOSTATISTICS, LONDON,
ACADEMIC PRESS, 1995
51

 BRATATI BANERJEE, MAHAJAN’S METHODS IN BIOSTATISTICS, 9th
EDITION, NEW DELHI, JAYPEE BROTHERS, 2018
 F GAO SMITH AND J E SMITH, CLINICAL RESEARCH, 2nd EDITION, UK, BIOS
SCIENTIFIC PUBLISHERS LIMITED, 2005
 JAMES. F JEKEL, EPIDEMIOLOGY, BIOSTATISTICS AND PREVENTIVE
MEDICINE, 3rd EDITION, SAUNDERS, ELSEVIER PUBLICATIONS, 2007
 CHERYL BAGLEY THOMPSON, ‘DESCRIPTIVE DATA ANALYSIS’, AIR
DENTAL JOURNAL, 2009, VOLUME 28 [ 2 ] : 56 - 59
52

descriptive data analysis

More Related Content

What's hot (20)

Similar to descriptive data analysis (20)

Recently uploaded (20)

descriptive data analysis