BIOSTATISTICS

Murtaza KamalMurtaza Kamal
Dec 22, 2018Dec 22, 2018
murtaza.vmmc@gmail.commurtaza.vmmc@gmail.com
1


 Definition of biostatistics
 Data and its types: Qualitative/ Quantitative
 Variable and its types
 Mean/ median/ mode
 Normal curve
 Sn/ Sp/ PV
 Sample and its types and calculation of sample size
Points to be covered…
2

Science dealing with methods of data
collection, compilation, tabulation and analysis
to provide meaningful and valid interpretation
3

Bio-statisticsBio-statistics
Application of statistical methods in field of
biology, public health and medicine
4

 Quality of clinical & health planning decision
depends on Quality of information on which
they are based
• Medicine: A science with chance playing very
significant role
• Statistics help to quantify contribution of chance and
helps individual clinician make valid diagnostic,
prognostic or therapeutic decisions
• Helps programme managers+ policy planners to plan,
monitor+ evaluate public health initiatives
Remember this…
5

Datum: Latin "fact"
Collection of processed information
Sources of Data:
Primary Data : Collected and recorded by investigator/s
themselves by observation, interviews or measuring
instruments usually systematically and for defined
purposes
Secondary Data : Collected by somebody else or for
other purposes e.g. information derived from hospital
Data
6

• Choice of statistical tests to be used depend on kind of variable
studied
• An attribute, quality, characteristic or property of persons or
things being studied that can be quantitatively measured or
enumerated
• Varies from person to person or from time to time in same
person
• Ex: Height, weight, age , gender, blood pressure, pulse rate,
smoking status
Variable
7

 A variable that is manipulated or applied by
investigator or explains outcome
 Eg: Maternal age, age at marriage, spacing between
successive pregnancies, pre-pregnancy weight,
weight gain during pregnancy
Independent (stimulus/ explanatory)
variable
8

• Resulting response or behaviour that is observed
when exposed to independent variable e. g.
Independent Dependent
Maternal age Birth weight
Birth spacing Birth weight
PIH Perinatal mortality
Dependent (Outcome/ response)
variable
9

Nominal: This variable has mutually exclusive
categories and unordered
E.g. Blood group: A, B, AB, O,
Marital status: Unmarried, Married, Divorced,
Widowed
Ordinal: This variable has mutually exclusive
categories and ordered
E.g. Disease severity: Mild, Moderate, Severe,
Qualitative data
11

Quantitative data
Discrete: Often represents counts
e.g. Number of children
Number of times admitted to hospital in the
last 5 years
Continuous: Can take any value within a range of
values
e.g. Height in cm
Weight in kg
Distance from home to work in km
12

Importance of data type
Type of data: Critically important in determining which
methods of analysis will be appropriate and valid
13

 Descriptive statistics
 Describes basic features of data in a study
 Provide summaries about sample
 Inferential statistics
 Investigate questions, models, and hypotheses
 Infer population characteristics based on
sample
 Make judgments about what we observe
Types of Statistics
14

Descriptive Statistics
Univariate analysis (one variable at a time)
•Qualitative Data:
• Proportions or percentages
•Quantitative Data:
• Central tendencies
(Mean, Median, Mode)
• Measures of dispersion
(Range, Standard deviation, Coefficient of
variation, Percentiles, interqurtile range) 15

Two approaches
1. Estimating Parameters: Process of using sample
information to draw conclusions about the value of a
population parameter
e.g., Proportion, mean, SD, Correlation
2. Testing Hypothesis
Inferential Statistics
16

1. Point Estimates : Proportion, mean, SD, correlation
2. Interval Estimates:
 Confidence Intervals: Define an upper limit & lower limit
with an associated probability
Ends of confidence intervals Confidence limits
 95% confidence interval – 95% probability of containing
the population mean
 99% confidence interval – 99% probability of containing
the population mean
Wider/ greater range of values must be included for greater
confidence
Estimating Parameters
17

 To permit generalizations from a sample to the population
from which it came
 Steps in Hypothesis Testing
1. State the research question in terms of Statistical hypothesis
2. Decide on appropriate test statistic
3. Select level of significance
4. Determine value the test statistic must attain to be declared
significant
5. Perform calculations
6. Draw and State conclusions
Hypothesis Testing
18

Null hypothesis, Ho  Statement that no difference
or relationship
If related then Ho is rejected
If unrelated then Ho is retained (not accepted!!)
Alternative hypothesis, Ha  Disagrees with Ho
Step 1: State the research question in terms of
Statistical hypothesis
19

 Statistics whose primary use is in testing
hypotheses are called test statistics
 Parametric & non-parametric tests
Step 2: Decide on appropriate Test Statistic
20

• Some assumptions are to be met before a particular
test of significance can be applied to a set of data
• Sample measurements drawn from normally
distributed population of measurements in a
random manner
• Parametric tests are Student 't' test (paired and
unpaired), F test for analysis of variance,
correlation and regression analyses
Parametric Statistics
21

 Many naturally occurring events follow
a pattern with:
 Many observations clustered around the
mean
 Few observations with values away from
the mean
 This bell-shaped curve was named
Normal distribution by a mathematician
Gauss
Normal distribution 22

 Normal distribution
 The symmetrical clustering of values
around a central location
 Normal curve
 The bell-shaped curve that results when a
normal distribution is graphed
Normal distribution 23

Normal Distribution
 Developed by Karl F. Gauss (1777-1855) ‘Gaussian
distribution’
 Called ‘normal’ because many continuous variables in biology
and other sciences follow this particular distribution
24

The Normal
Distribution
Properties
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
x
f(x)1- Symmetric about mean
25

The Normal
Distribution
Properties
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
x
f(x)2- Mean, median and mode coincides
Mean = Median = Mode
26

27
The Normal
Distribution
Properties
3- Area
property

Skewed (or Asymmetric) Data
When the left and right-side sides of a frequency distribution do not
approximate mirror images, the data are said to be skewed or asymmetrical
Curve A Curve B
negative skew
positive skew
Mean>Median>Mode Mean<Median<Mode
28

• Suitable alternative particularly when data is in
form of ranks or counts
• Chi- squared test: MC employed nonparametric test
• Wilcoxan Rank Sum, Mann Whitney U or median
test , Kruskal-Wallis 1-way and Friedman 2-way
analysis of variance
Non-Parametric Statistics or
Distribution Free Methods
29

 p value: Related to hypothesis test
 Probability that the observed result is due to chance alone
 Calculated after test has been performed
 Small p-value (typically ≤ 0.05) indicates strong evidence
against null hypothesis, so you reject null hypothesis
 P = 5% is not a rule written on stone
 More generous (p=0.1)
 More strict (p =0.01)
Step 3: Select the level of significance for the statistical
test
30

Example
Study of the effects of anticonvulsant therapy on serum calcium
concentration in the elderly.
• group of treated patients
• group of untreated patients
Outcome variable: serum calcium concentration
Independent variable: Anticonvulsant therapy
Null Hypothesis:
Both the groups (treated and untreated) have same mean serum calcium
conc.
Test of Significance :
. t test
31

Direction of inquiry
Onset of study Time
Exposed
Unexposed
Cases
Controls
Exposed
Unexposed
Case-control study (retrospective)
32

Cases
Controls
Total
Ate Raw
Yes
17 (a)
7 (c)
24
Hamburger
No
20 (b)
26 (d)
46
Total
37
33
70
Cross Product Ratio
2.3
)7)(20(
)26)(17(^
==
×
×
=
cb
da
OR
Controls
Cases
ddsO
ddsO
^
^
=
Case-control study: Outbreak of
Diarrheal Disease at a Resort Club
33


 Odds ratio is a ratio of two odds
 Relative risk is a ratio of two probabilities
ODDS RATIO & RR
34

THREE KEY MEASURES OF
VALIDITY
1. SENSITIVITY
2. SPECIFICITY
3. PREDICTIVE VALUE
36

True Disease Status
Screening/
Diagnostic
Test
Positive Negative Total
Positive True Positives
(TP)
False Positives
(FP)
TP+FP
Negative False Negatives
(FN)
True Negatives
(TN)
FN+TN
Total TP+FN FP+TN TP+FP+FN+TN
Outcomes of a Screening/ Diagnostic
Test
37

38
What is used as a “gold standard”
1. Most definitive diagnostic procedure
e.g. microscopic examination of a tissue
specimen
2. Best available laboratory test
e.g. polymerase chain reaction (PCR)
for HIV virus
3. Comprehensive clinical evaluation
e.g. clinical assessment of arthritis

39
True
positive
True
negative
False
positive
False
negative
Sensitivity =
True positives
All cases
a + c b + d
=
a
a + c
Specificity =
True negatives
All non-cases
=
d
b + d
a + b
c + d
True Disease Status
Cases Non-cases
Positive
Negative
Screening
Test
Results
a
d
b
c

40
True Disease Status
Cases Non-cases
Positive
Negative
Screening
Test
Results
a
d
1,000
b
c
60
Sensitivity =
True positives
All cases
200 20,000
=
140
200
Specificity = True negatives
All non-cases
=
19,000
20,000
1,14
0
19,060
140
19,000
=
= 70%
95%

41
Interpreting test results:
predictive value
Probability (proportion) of those tested who
are correctly classified
PPV = Cases identified
/all positive tests
NPV = Non-cases identified
/all negative tests


 Positive predictive value: Probability that subjects
with a positive screening test truly have disease
 Negative predictive value: Probability that subjects
with a negative screening test truly don't have
disease
42

43
True
positive
True
negative
False
positive
False
negative
PPV =
True positives
All positives
a + c b + d
=
a
a + b
NPV =
True negatives
All negatives
=
d
c + d
a + b
c + d
True Disease Status
Cases Non-cases
Positive
Negative
Screening
Test
Results
a
d
b
c

44
True Disease
StatusCases Non-cases
Positive
Negative
Screening
Test
Results
a
d
1,000
b
c
60
PPV =
True positives
All positives
200 20,000
=
140
1,140
NPV = True negatives
All negatives
=
19,000
19,060
1,14
0
19,060
140
19,000
=
= 12.3%
99.7%

45
Positive predictive value,
Sensitivity, specificity, and prevalence
Prevalence (%) PV+ (%) Se (%) Sp (%)
0.1 1.4 70 95
1.0 12.3 70 95
5.0 42.4 70 95
50.0 93.3 70 95

 Study population are large: Not possible to meet all population
members (Not feasible practically and costly)
 Ideally: Study everyone so that we can generalize a finding to
study population
 Cannot study everyone due to limited time and resources
 If we cannot study everyone then we sample the population,
but in manner so that we can generalize the findings
Why Sampling?
47

Sampling
Process
Defining the
population
Developing
a sampling
Frame
Determining
Sample
Size
Specifying
Sample
Method
SELECTING THE SAMPLE
48

Sample Size Determination
• To carry out any scientific study, MC asked
question: What should be minimum sample size?
• If sample is too small  Fails to detect true
difference
•An exceedingly large sample size
•Wastage of time and money
•Will report tiniest relation/difference as significant
• Sample size should be calculated at planning
stage
Neither too small nor too large
51

Need to know following:
 Estimated prevalence/ SD
 Confidence interval (95% CI)
 Power: Ability to find significance when two
groups are really different (80%)
 Allowable error or precision (5-10%)
Sample size calculation
52

Sample Size for Qualitative
outcome variable
 P = Estimated prevalence
(percentage)
 Q =1-P
 L = Allowable Error
2
4
L
PQ
n =
54

Definition
 P = Estimated prevalence (percentage)
 From pilot study, published papers,
experience
 Q =1-P
 L and Q and P are in same unit
55

L; Allowable Error
 Suppose, the survey wants to
estimate the true prevalence of a
disease in population
 The estimate we get from the
survey will be within +/- L% of
the true prevalence
- L +L
56

Example
 A survey is to estimate prevalence of
influenza virus infection in school kids
 Suppose the available evidence
suggests that approximately 20%
(P=20) of the children will have
antibodies to the virus
 Assume the investigator wants to
estimate the prevalence within 6% of
the true value (6% is called allowable
error; L) 57

Example
 The required sample size is
 n = (4 x 20 x 80) / (6 x 6) =
177.78
 Thus approximately 180 kids
would be needed for the survey
2
4
L
PQ
n =
Note: population size not involved in the formula
58

Sample Size for Estimation of the
Mean (Quantitative outcome
variable)
 S = Standard Deviation of the
parameter
 S and L are in the same unit
 The average we find in the survey
will be within +/- L of the true
2
24
L
S
n =
- L +L
59

Example
 Suppose an investigator has some
evidence suggesting that the
standard deviation of rat weight
is about 455 g
 He wishes to provide an estimate
within 80 g of the true average
(80 g is the allowable error; L)
60

Example
 The required sample size is
n = 4 x (455)2
/ (80)2
= 129.39
 Thus approximately 130 rats would be needed.
2
24
L
S
n =
61

BIOSTATISTICS

More Related Content

What's hot (20)

Similar to BIOSTATISTICS (20)

More from Dr. Murtaza Kamal MRCPCH,MD,DNB,DrNB Ped Cardiology (20)

Recently uploaded (20)

BIOSTATISTICS

Editor's Notes