Complete Biostatistics (Descriptive and Inferential analysis)

Abdiweli Mohamed Abdi
Biostatistics

Table of contents
Section I
Chapter 1: Introduction to Biostatistics
Chapter 2: Measures of location
Chapter 3; Measures of dispersion
Chapter 4; Collection and organization of data
Chapter 5; Visualization and presentation of data
Chapter 6; Probability and Normal distribution of data
Section II
Chapter 7; Hypothesis and significance testing
Chapter 8; Comparing the significance of two sample and three sample
means (z-test, t-test and ANOVA)
Chapter 9; Association, correlation and regression
Chapter 10: Estimation

Sunday, September 29, 2024 3
Statistics – is a branch of
mathematics used for
collection, analysis and
interpretation of data.
Biostatistics- is a branch of
statistics used for collection,
analysis and interpretation
of biological data.
Chapter One; Introduction to Biostatistics

Statistics
Use in Health Issues
Biostatistics
Use in Agricultural
Sector Agri-statistics
Use in Business Admin
Business Statistics
Use in Industrial
Sector Industrial
Statistics
Use in Insurance
Actuarial Statistics
Use in Economic Sector
Economic Statistics

Biostatistics Types
Inferential
 Measures of location
 measures of central tendency
 measures of other location
 Measures of Dispersion
 range
 variance
 standard deviation
 coefficient of variation
 Estimation
Point Estimation
Interval Estimation
 Hypothesis testing
 z-test  t test
 Anova   2
test
 Correlation  Regression
Descriptive

Describes
characteristics of
data from a sample.
Ex: Mean, standard
deviation, frequency,
and percentage.
Descriptive Statistics

Ex: prevalence of
malaria among a
sample of 150
pregnant women =
40%.
Can we estimate
Prevalence of
malaria among the
population?
Inferential Statistics

Why medicine and
health science
students need to
learn Biostatistics?
To be able to:
 Conduct research
 Identify health
problems
 Monitor and evaluate
health programs

Variables
• Variable is any characteristic that
differs between individuals, Time or
place.
• Example of Variables:
• (1) No. of patients (2) Height
• (3) Sex (4) Educational Level

Types of Statistical variables
1.Quantitative/Numerical Variable: is a
characteristic that can be measured in
numbers.
Examples:
(i) Family Size (ii) No. of patients
(iii) Weight (iv) height (v) Age

Types of Quantitative Variable
a) Discrete Variables: quantitative variables
with no decimals or have gabs b/w numbers
Examples: Family size, Number of patients, No. of
students, parity, gravidity
(b) Continuous Variables: Quantitative variables
with decimals or have no gabs b/w numbers
Examples: Height, weight, income, blood sugar
level, creatinine level.

2. Qualitative/Categorical Variable: is a
characteristic that its values can be divided
into categories. No numbers!
Example:-
Blood type, Nationality, Students Grades,
Educational level, e.t.c.

Discrete
(whole
number)
Qualitative Quantitative
Continuos
(Decimal)
Variable
Type:
Nature:

Scales of Measurements
• Nominal Scale: implies name only no order or rank is
involved. E.g. Sex, blood type, institutional departments,
nationality.
• Ordinal Scale: implies name and order or rank. E.g.
Educational Level, military rank, students’ grades.
• Interval Scale:0 does implies presence of the
characteristics. E.g. Temperature and pH
• Ratio: 0 imply absence of the characteristics
both Interval and Ratio between two numbers – are
meaningful. eg. Height, Weight, age, income

Qualitative Quantitative
Variable
Interval
Ratio
Type:
Scale: Ordinal
Nominal

Variables Types Scales of
measurement

Population
A population is the largest collection of
objects (elements or individuals) in which
we want to draw some conclusions.
Populations may be finite or infinite.

Example: If we are interested to study
the socio-demographic characteristics of
students in a class, then our population
consists of all those students in the
class.

• Population Size (N): The number of
elements in the population is called the
population size and is denoted by N.
• Ex: 100 students in a class.
• Sample: - A sample is a part of a population
from which we collect the data.
• Ex: 30 students out of 100 students in a
class.

Population Sample
Statistic
Parameter

Common statistical symbols
Title Symbol
Sample Mean x
Population mean 
Sample standard deviation s
Population standard deviation 
Sample variance s2
Population variance 2
Summation 
Correlation coefficient r
Coefficient of determination r2
Degree of freedom df

Title Symbol
Chi-square value 2
Sample proportion p
Population proportion ∏
Null hypothesis Ho
Alternative hypothesis H1 or HA
Sample Size n
Type I error  error
Type II error  error
Power of the test 1- 

Chapter Two: Measures of Central tendency and
Measures of Other Location
a single value around the
center of the data used to
represent entire data.
In a word, measures of central
tendency conveys a single
information regarding the
entire data set.

•Measures of central tendency are not
calculated from qualitative/categorical
data
• Measures of Central tendency include
I.Mean (average)
II.Median
III.Mode

Mode
Mean The average
Median The number or
average of the
numbers in the
middle
Mode The number that
occurs most

Mean
Mean is the average of the data set.
There are four types of mean
a.Arthematic mean
b.Harmonic mean
c.Geometric mean
d.Weighted mean

Arithmetic mean
Arithmetic mean is the most
familiar measure of central
tendency as it is termed as
average or mean.
Arithmetic mean uses the
symbol (readed as X-bar)

28
Arithmetic mean formula:
The sum of all observations divided by
the total number of observations.
=
=sum of all observations, n= total
number of observations

Example-1
Suppose the pulse rates for 10 individuals was
recorded as:-
69,70,71,71,72,72,72,75,76,74
Find mean?
solution
= = =
72.2
= 72.2bits/minute

Example-2
The age 12 selected school and university
students were
19,18,14,13,22,25,13,22,12,18,14,16
What is the mean age of the selected
students?
= = =15.58
= 15.58 years

Advantages of mean
a) Easy to compute
b) Takes all data values into account
c) Reliable
d) It can be calculated if any value is zero or
negative.
e) Arranging of data is not necessary.
Disadvantages of mean
a) Highly effected by the extreme value.
b) Can not be calculated for qualitative/categorical
data.

Median
In an ordered array, the median is the
“middle” number.
If n is odd, the median is the middle
number.
If n is even, the median is the average
of the 2 middle numbers
Not Affected by Extreme Values

Procedure to find Median for Raw data
i. Arrange in order
ii. Find middle value
 for odd number : (n+1)/2
 for even number :
1st
middle value= n/2
2nd
middle value = (n/2 +1)
Median = average of the 1st
and 2nd
middle
values

Example-:
Data: 4 3 7 4 6
1. Arranged in ascending order: 3 4 4
6 7
2. Since it is odd, The middle = (n+1/2=
5+1/2) = 3rd
item
The Value in the 3rd
item = 4
 Median = 4.

Example-:
x: 4 3 7 4 6 9
Arranged in ascending order:
x: 3 4 4 6 7 9
1st
middle item = 6/2 = 3rd item
2nd
middle item= 6/2= 3+1= 4th item
The value of 3rd and 4th items are: 4 & 6
Median = av. of 4 & 6 = (4+6)/2 = 5.
Median=5.

Advantages of median
oA) Easy to compute.
oB) Not influenced by extreme
values.
Disadvantages of median
oDifficult to rank large number
of data values.

Mode
• A Measure of Central Tendency
• Value that Occurs Most Often
• Not Affected by Extreme Values
• There May Not be a Mode
• There May be Several Modes

Mode is the Value that Occurs Most
Example-: calculate mode for this data
set
2,3,4,3,4,5,4
Solution
Mode is 4

Advantages
Advantages of Mode
A) Easy to locate and understand.
B) Not influenced by extreme values.
C) Is an actual value of the data.
Disadvantages of Mode
a) Can’t always locate just one mode.
b) It does not depend on all
observations of the data set.

Measures of Other Location

Percentiles
Percentiles are positional measures that are
used to indicate what percent of the data
set have a value less than a specified value
when the data is divided into hundred parts.
Percentiles are not same as percentages.
=r
r: represents given percentile and n for

Deciles
Deciles are an other positional measures that
are used to indicate how much of the data
set have a value less than a specified value
when the data is divided into ten parts.
=r
where r represents given Deciles and n for
sample size

Quartiles
Quartiles are an other positional measures
that are used to indicate how much of the
data set have a value less than a specified
value when the data is divided into four parts.
=r
• where r represents given quartile (r=1 for
Q1, r=2 for Q2 and r=3 for Q3) and n for
sample size

Example
Calculate the 70th
percentile, 6th
decile and Q3 of the
following age data 28, 17, 12, 25, 26,19,13,27,21, 16
Percentiles
n=10
r= 70th
percentile
1st
Order data into ascending
12,13,16,17,19,21,25,26,27,28
=r==7=7.7 digit

7.7 lies somewhere between 25 and 26
To find the exact position we use this
formula for fraction percentiles
P70= decimal*(upper digit value - selected
digit value) + selected digit value
= 0.7* (26-25=1)= 0.7+25= 25.7
P70 =25.7, this means that 70 percentile of
values lie below 25.7 and 30% of the data
lies above 25.7

Deciles
Data ordered: 12,13,16,17,19,21,25,26,27,28
Question: Find 6th
decile?
Given
n=10
r=6
Solution
=r = 6=6.6

So 6.6 decile lies between 21 and 25
formula for fraction deciles=
decimal*(upper digit value - selected digit
value) + selected digit value
= 0.6 * (25-21=4) +21=23.4
Thus the 6th
decile is 23.4
This means that 6 deciles of the data lie
below 23.4

Quartiles
Data ordered: 12,13,16,17,19,21,25,26,27,28
Question: Find 3rd
Quartile?
Given
n=10 formula =r
r=3
Solution
• =3=8.25 digit

So 8.25 decile lies between 26 and 27
formula for fraction quartiles
Q3=decimal*(upper digit value - selected digit
value) + selected digit value
= 0.25* (27-26) + 26 =26.25
Thus Q3=26.25
This means that 3 quartiles (75%) of the
data lies below 26.25

Chapter Three
Measures of dispersion
50

Measures of dispersion or measures of
variation measure variability a set of
observations exhibit.
They measure how values spread out from
each other.
The variation is small when the values are
close together.
There is no dispersion (variation) if the
values are the same
51

There are several measures of
dispersion, some of which are
1. Range
2.Variance
3.Standard deviation
4.Coefficient of variation

The range
Range is the difference between the largest
value (maximum) and smallest value
(minimum).
Rang (R)=Max-Min
Example
Find the range for the sample values:
26,25,35,27,29

Solution
Max=35
Min=25
Range=35-25=10
Notes:
I. The unit of the range is the same as the unit
of the data
II.The range is poor measure as it takes into
account only two values (Max and Min)
54

The Variance
• The variance is one of the most important
measures of dispersion.
• The variance is a measure that uses mean as
point of reference
• Sample Variance is taken as symbol (S2
)
S2
=

• The population Variance is taken as symbol
(σ2
)
σ2
=

Example
We want to compute a sample variance of the
following sampled health care workers’ income
values per week 10, 21, 33, 53, 54
Solution
n=5
= = 10+21+33+53+54/5 = 171/5=34.2
Thus = 34.2 USD/week

S2
= = = 376.7
)2
10 10-34.2 =-24.2 (-24.2)2
=585.64
21 21-34.2 = -13.2 (-13.2)2
=174.24
33 33-34.2 = -1.2 (-1.2)2
=1.44
53 53-34.2 =18.8 (18.8)2
=353.44
54 54-34.2 =19.8 (19.8)2
=392.04
=171 =0 )2
=1506.8

• The standard deviation is another measure of
deviation.
• It s square root of the variance.
• Population standard deviation (σ)= √σ2
• Sample standard deviation (S)= √S2
Standard Deviation

Example
We want to compute a sample variance of the
following sampled health care workers’ income
values per week 10, 21, 33, 53, 54
Solution
n=5
S2
=376.7
S=√S2
= √376.7= 19.41

Coefficient of variation
The variance and standard deviation are useful
as measure of variation of the values of a single
variable for a single population.
If we want to compare the variation of two
variables we cannot use the variance or the
standard deviation because:
I. The variables might have different units.
II.The variables might have different means.

• We need a measure of the relative variation
that will not depend on either the units or
on how large the values are.
• This measure is the coefficient of variation
(C.V.).
• C.V= x100

Example
Compare the variability of weights of two groups
C.V1= x100 = x100=6.8%
C.V2 = x100 = x100=12.5%
Since C.V2>C.V1, the relative variability of the 2nd
group
is larger than the relative variability of the 1st
group
Groups Mean SD C.V
1st
group 66 kg 4.5 kg 6.8 %
2nd
Group 36 g 4.5 kg 12.5 %
63
Sunday, September 29, 2024

Exercise 1
A student was asked to mention the results of
the 5 subjects he/she covered for the last
semester and the data was presented as the
following: 80, 71, 63, 53, 54
- Now calculate:
1] Range
2] variance
3] Standard deviation 64

Exercise 2
Let us compare the exam results of 2 groups
The 1st
group:
Mean exam result= 75
Standard deviation= 7.5
The 2nd
group:
Mean exam result= 80
Standard deviation= 9
Calculate the variability of results among the 2
groups?
65

Data: raw, unorganized
facts that need to be
processed.
When data is processed
to make it useful, it is
called information.
66
Chapter 4; Collection and Organization
of data

Primary Data:
• Definition: data
collected firsthand
by the researcher.
68

Primary data collection methods
 Interviews
 Observations,
 Focus group discussions
 Blood, body fluid, urine,
feces,
 Imaging (X-ray, US, CT, MRI) 69

Common primary data collection tools
1. Questionnaires 2. Google form 3. Kobo tool box
70

Secondary Data:
• Definition: data that
has been collected
by some one else or
institution.
71

Journals
Books
Magazines
Newspaper
Libraries
Websites
Medical records
SECONDARY DATA SOURCES
72

Organizing data in Array (Ordered
Array)
• A first step in organizing data is the
preparation of an ordered array.
• An ordered array is a listing of the
values of data in order of magnitude
from the smallest value to the largest
value
73

Ex: the following data related to the age
of 6 individuals is arranged in array
55 46 58 54 52 69
Ascending form: 46 52 54 55 58 69
Descending form: 69 58 55 54 52 46

Frequency Distribution
• The most convenient method of
organizing data is to construct a
frequency distribution.
• A frequency distribution is the
organization of raw data in a table
form, using classes and frequencies.
75

Grouped Frequency Distributions
When the range of the data is large, the data
must be grouped into classes.
Class Boundary
Definition: Class Boundary: A class boundaries
(lower limit on class –upper limit of the previous
class) / 2.
The difference between the two boundaries of a
class gives the class width. The class width is also
called the class size. 76

Finding Class Width
Class width = Upper boundary - Lower
boundary
Calculating Class Midpoint or Mark
Class midpoint or mark=

Example: In the following Table gives
the weekly earnings of 100 employees of
a large company.
The first column lists the classes, which
represent the (quantitative) variable:
weekly INCOME.
78

79
Weekly Income in USD Number of employee (Freq)
801-1000 9
1001-1200 22
1201-1400 39
1401-1600 15
1601-1800 9
1801-2000 6

Calculate Class Boundaries, Class Widths, and Class Midpoints for the above data
Solution:
A class boundary = (lower limit on class – upper
limit of the previous class) / 2 = 1001 – 1000 / 2
= 1 / 2 = 0.5
Lower limit ( 801 – 0.5 ) = 800.5
Upper limit ( 1000 + 0.5 ) = 1000.5
Width of the first class = 1000.5 - 800.5 = 200
Midpoint of the first class = = 900.5 80

Constructing Frequency Distribution
Tables
Important steps for a Constructing of a
frequency Distribution for continuous table.
1.The number of classes depends on the
range of the data.
Range = largest value – smallest value
82

2. Number of class:
Number of class should not be too large or
too small.
As a general rule, the number of classes
should be around where n is the number of
data values observed.
83

4. Number of columns: usually there will
be two columns in a frequency table: class
intervals and frequency.
84

Example: the following data represents
the number of patients admitted by a
hospital in 30 days.
Construct a frequency distribution table.
85

Solution:
In this data, the minimum value is 5, and
the maximum value is 29.
Number of class = = 5
Range = largest value – smallest value
= 4.8 5
87

Patients admitted Frequency
5-9 3
10-14 6
15-19 8
20-24 8
25-29 5
Total frequency: 30

Example: Calculate the class boundaries relative frequencies
and percentages for the table in the previous example
89

90
Patients
admitted
Frequency Relative
frequency
Percentage (%)
5-9 3 3/30= 0.1 0.1x100= 10
10-14 6 6/30= 0.2 0.2x100= 20
15-19 8 8/30= 0.267 0.267x100= 26.7
20-24 8 8/30= 0.267 0.267x100= 26.7
25-29 5 5/30= 0.167 0.167x100= 16.7
Total 30 1 100

Cumulative Frequency Distribution
A cumulative frequency distribution gives the
total number of values that fall below the upper
boundary of each class.
91

Example: Calculate cumulative frequency and
cumulative percentages for the table in the
previous example
92

Patients
admitted
Frequency Cumulative
relative
frequency
Percentage (%) Cumulative
Percentage
5-9 3 3/30=0.100 0.1x100= 10 10
10-14 6 9/30=0.300 0.2x100= 20 30
15-19 8 17/30=0.567 0.267x100= 26.7 56.7
20-24 8 25/30=0.833 0.267x100= 26.7 83.3
25-29 5 30/30=1 0.167x100= 16.7 100
Total 30 100

Ungrouped frequency distribution of
numerical data
Data that has not been organized into groups.
Also called raw data.
Ungrouped data can be either numerical or
categorical.
94

Creating a Numerical Ungrouped
Frequency Distribution table
Step 1- arrange the data in an ascending
array.
Step 2- count the frequency of each value.
Step 3- create a table
Step 4- insert the data values in the table
95

Example: Blood Pressure Readings of 8
individuals.
120, 130, 130, 125, 140, 140, 140, 122.
create a frequency distribution table for
this data.
96

Step 1- arrange the data in an ascending
array.
120, 122, 125, 130, 130, 140, 140, 140.
Step 2- count the frequency of each
value.
120 (1), 122 (1), 125 (1), 130 (2), 140 (3).
97

Step 3- create a table
Step 4- insert the data values in the table

Creating a Categorical Frequency
Distribution table
Step 1-count the frequency of each value.
Step 2-create a table
Step 3-insert the data values in the table
99

Example of ungrouped categorical data
related to the blood types of 20
individuals:
• Blood Types:
A, B, O, AB, O, A, B, A, O, B, AB, A, O,
B, B, A, O, AB, B, A
100

Step 1- count the frequency of each
category.
A= 6 individuals
B= 5 individuals
AB= 5 individuals
O= 4 individuals
101

Step 2-create a table
Step 3-insert the data in the table
Blood Type Frequency
A 6
B 5
O 5
AB 4
Total frequency 20 102
:

Relative Frequency and Percentage
Distributions
Shows what fractional part of the total
frequency belongs to the corresponding
category.
The relative frequency of a category is
obtained by dividing the frequency of that
category by the sum of all frequencies.
103

The percentage for a category is obtained by
multiplying the relative frequency of that
category by 100.
A percentage distribution lists the
percentages for all categories.
Calculating Percentage
• Percentage = (Relative frequency) 100
105

Example: Determine the relative frequency
and percentage distributions for this data.
106

Chapter five
Visualization and presentation of data
107

Techniques of Data presentation
Data can be presented in:-
 Tabular
 Graphical

Tabular data presentation
A table contains data in rows
and columns.
Types of Tables
1. Univariate table
2.Bivariate table
3.Multivariate table
109

Age Frequency Percentage
21-26 6 30
27-32 6 30
33-38 2 10
39-44 3 15
45-50 3 15
Total 20 100
Univarate Table-2: Age
110

Age Male Femal
e
Total
21-26 1 5 6
27-32 3 3 6
33-38 0 2 2
39-44 3 0 3
45-50 1 2 3
Bivariate Table-1: Sex and Age
111

Multivariate Table-3: Age, sex and residence
Gender__
Age
Male Female
Total
Urban Rural Urban Rural
21-26 1 2 5 1 9
27-32 3 2 3 2 10
33-38 0 1 2 1 4
39-44 3 2 0 2 7
45-50 1 3 2 1 7
Total 8 10 12 7 37

Graphical presentation of data
Tabulation is an important systemic
presentation of data but often data is
easily revealed by diagrams or graphs.

Types of graphical presentation
Data Type Type of Table
Qualitative Univariate
Simple Bar
Components Bar
Pie chart
 multiple pie chart
Quantitative Histogram
Line graph/chart 114

Simple bar
Simple bar chart is used for presenting
Univariate qualitative data.
• Bar charts have horizontal axis called X-
axis and Vertical axis called Y-axis
• Categories are putted on X-axis and
percentage or Frequency on Y-axis
115

Male Female
0
10
20
30
40
50
60
Male
Female

Component Bar
• To draw component bar, divide
100% into components equal to
the number of categories of
the variable you want to draw.

Sex
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Female
Male

Pie chart
A pie chart is circular statistical graph,
which divides the data into slices to
illustrate numerical proportion of each
category.

40%
60%
Sex
Male
Female

Multiple bar chart
• A multiple bar chart is a type of bar chart tat is
used for bivariate qualitative data.
• Using this data construct Multiple bar chart.?
Sex Diabetes No diabetes
Male 3 5
Female 8 4
Total 11 9
121

Diabetes No diabetes
0
10
20
30
40
50
60
70
80
27.3%
55.5%
72.7%
44.5%
Male
Female

Graph for Quantitative variables
Graphs used to present quantitative
univariate variables include:-
• Histogram,
• Line graph/Line chart
123

Histogram
• Histogram is the common graph for
quantitative variables.
• It is similar to bar chart except that
there is no gaps between its bars

50 100 150 200
0
1
2
3
4
5
6
7
8
9
Histogram

Chapter Six: Probability and Normal
distribution of data
Probability is the likelihood of occurrence of
an event and is measured by the proportion
of times an event occurs. An event is taken
by “E”; the number of times event occurs is
taken by “n” and all possible events
(outcomes) is taken by “N”
P(E) =
or P(E) = n/N
126

EXAMPLE: 1
A coin is tossed, what is the probability of
getting head?
Coin has two outcomes, head and tail, so total
outcomes (N) is 2
There is only one head, so event (head) =1
P(Head) = = P(Head) = = 0.5
The probability of getting head if coin is tossed
is 0.5 or 1/2

EXAMPLE: 2
OPD attendance of a hospital is shown in here
What is the probability a randomly selected individual has
diabetes?
What is the probability a randomly selected individual has
hypertension?
Diseases Frequency
Diabetes 80
Hypertension 40
Total 120
128

Solution
• P(Diabetes) =
= P(Diabetes) = = 0.67
• P(Hypertension) =
= P(Hypertension) = = 0.33
129

Characteristics of Events
Events possess certain characteristics,
which are:-
a. Mutually exclusive events
b.Mutually non-exclusive events
c. Independent events
d.Dependent events
130

Mutually exclusive Events
• Events of a trail are called mutually exclusive if an
only one event occurs in each single trail. This means
that events cannot occur simultaneously that if one
event the other can occur.
• Example: suppose if a coin is tossed, for any toss
(trail) there is only one event (either head or tall).
131

Mutually non-exclusive Events
events which can occur simultaneously, for
example an individual can have only diabetes
or only hypertension or both diabetes and
hypertension at same time, these events
which can occur simultaneously are called
mutually non-exclusive.

Example: Suppose in OPD attendance there
are two categories, people with Diabetes
and people with hypertension.
However there some people who have both
Diabetes and Hypertension.
Thus events like Diabetes and Hypertension
are considered as Mutually Non-exclusive
events
133

Independent Events
if A and B are two events of a particular trail and
the outcome of event A does not effect and is not
effected by the outcome of event B then A and B
are called Independent Events.
For example: if you toss two coins, the outcome
of one first toss (head or tail) is will not affect
and it is not affected by the outcome of the
second toss.
134

Dependent Events:
• If outcome of event A influences outcome of
event B or B affects A, event A and event B are
considered as dependent events.
Example:
• Having smoking and lung cancer
• Driving a car and getting in a traffic accident
• Robbing a bank and going to jail.
135

Properties of probability
Probability is expressed in proportion. So it
takes any value between 0 to 1. However you
can show it in percentage, that is it can take
0 to 100%.
Probability of 1 means that event is certain
to occur (E.g. probability of dying).
Probability of 0 means that event is certain
not to occur (E.g. probability not dying).
136

A probability of 0.5 means that events have
equal chance of occurrence.
The higher the probability value, the higher
the chance of occurrence and the smaller the
probability value, the lower the chance of
occurrence.
The sum of probability of all events must be
equal to 1 or 100%
137

Types of probability
According to the time of occurrence of
events probability is categorized as :-
Priori probability: is calculated before the
occurrence of event by logically examining
the existing knowledge.
It usually deals with the independent events.
For example probability of having head or
tail is 1/2 or 0.5 138

Posteriori probability: is calculated
after the occurrence of the event, that
is it is based on frequency of
occurrence.
For example: number of hypertensive in
a sample of 100 patients.
139

Rules of probability
There are two basic rules in probability
i. Addition Rule
ii. Multiplication Rule

Addition Rule
This rule applies to both mutually exclusive and
mutually non-exclusive events of a single random
variable. This rule is characteristics by the term “or”
(sometimes as means of union) in between the two
∪
events E.g. P(A or B) sometimes also shown as P(A ∪
B)
For mutually exclusive Events
P(A or B) = P(A) + P(B)
For mutually non-exclusive Events
P(A or B) = P(A) + P(B) - P(AB) 141

Example 1 (mutually exclusive Events)
A single 6-sided die is rolled. What is the
probability of rolling a 2 or a 5?
Solution
Since 2 and 5 are mutually exclusive , the P (2
and 5) =0
P(2) = 1/6 , P(5) = 1/6
P(2 or 5) = P(2) + P(5) =1/6+1/6 =2/6 =1/3=
0.333 142

Example 2 (mutually exclusive and
mutually non exclusive Events)
Suppose patients attending a hospital OPD are
categorized as in the following table.
Disease No. of patients
Eye disease 5
Respiratory disease 15
Only Diabetes 90
Only Heart disease 30
Both Diabetes and Heart disease 10
Total 150
143

If person is drawn at random
a. What is the probability that he/she
will have Eye disease or Respiratory
disease
b.What is the probability that he/she
will have Diabetes or heart disease

Solution
a.Eye disease or Respiratory
disease (mutually exclusive In
here)
• Patients with eye disease =5
• Patients with respiratory
disease=15
• Total patients =150
P(eye disease or respiratory
disease) = 5/150+15/150 = 0.13

b. Diabetes or Heart Disease (mutually Non-
exclusive In here)
• Patients with diabetes =90+10=100
• Patients with Heart disease=30+10=40
• Total patients =150
P(Diabetes or Heart disease) = P(Diabetes) +
P(Heart disease) - P(Diabetes and Heart Disease)
P(Diabetes or Heart disease) = 100/150 +
40/150 - 10/150 =0.87
146

Normal Distributions of data
In the normal distribution, observations are
more clustered around the mean.
Normally almost half of the observations lie
above the mean and half below the mean and
all observations are symmetrically
distributed on each side of the mean.
147

Characteristics of Normal Curve/Distribution
a) The normal curve is symmetrical and bell shaped
b) Maximum values at the centre and decrease to
zero systemically on each side
c) Mean, median and mode are all equal
• Mean ± 1SD limits includes 68.2% of all
observations
• Mean ± 2SD limits includes 95% of all observations
• Mean ± 3SD limits includes 99.7% of all
observations

Normal Curve

Skewed Distributions
Distributions that are not symmetric and
have long tail in one direction are called
Skewed Distributions.
In skewed distribution, most values are
closer to one end and relatively few values in
the other direction. 150

151
Positively Skewed Distributions
If the tail of the distribution extend to the
right (positive side), the distribution is
called Positively Skewed Distribution or
right skewed distribution.
In right skewed distributions, majority of
the values lie at the left part of the
distribution.

152
Negatively Skewed Distributions
If the tail of the distribution extend to the
left (negative side), the distribution is
called negatively Skewed Distributions or
left skewed distributions.
In left skewed distributions, majority of the
values lies at the right side of the
distribution

153
Left and Right Skewed Examples

Section II
Inferential Biostatistics
155

Inferential Biostatistics
Descriptive statistics remains local to the
sample, describing its central tendency and
variability while inferential statistics focuses
on making statements about the population.
156

Statistics Vs. Parameter
Statistics(Sample value)
• Mean ()
• Variance (2
)
• Standard deviation ()
• Proportion ()
Parameter (population value)
• Mean (μ)
• Variance (2
)
• Standard deviation ()
• Proportion (
, ,

Chapter Seven
Hypothesis and significance testing
158

Test of significance
is the determination of whether a
result is statistically significant or if
it could have occurred by chance.
159

Hypothesis
It is researchers assumed answer for
relationship between two variables or the
significance of a test result.
There are two statistical hypotheses:-
a.Null Hypothesis
b.Alternative hypothesis

Null Hypothesis
it states that there is no real difference
between statistic and parameter, say sample
mean = population mean.
Any observed difference is just by chance.
Null hypothesis is donated by the symbol of
H0.

Alternative hypothesis
Alternative hypothesis: it states that there
is real difference between statistic and
parameter, say sample mean ≠ population
mean. Alternative hypothesis is donated by
the symbol of H1 or Ha.
H0 = µ1=µ2 Ha.= µ1 ≠ µ2
• When Null hypothesis is rejected,
alternative hypothesis is accepted. 162

P-Value
P-value indicates the amount of support
possessed by the null hypothesis.
As the p-value which lies between 0%-100%)
approaches to 0, the support (for H0)
becomes weaker and weaker while as it
approaches to 100, the support is stronger
and stronger.

Level of significance
In order to decide whether the support is
strong or weak we need some cut-off value or
level.
This cut-off value or level is known as level of
significance denoted by α.

Internationally accepted levels of
Significance
•10% (or 0.1)
•5% (or 0.05)
•1% (or 0.01)
The most commonly used is 5% (or 0.05)
165

The zone of the null hypothesis acceptance
1] If the calculated value is less than the
tabulated value, the null hypothesis is
accepted and alternative hypothesis is
rejected. (Calculation based)
2] If the support of the null hypothesis (p-
value ≥0.05) the null hypothesis is accepted
and alternative hypothesis is rejected.
(Computer Based) 166

The zone of the null hypothesis rejection
1] If the calculated value is greater than the
tabulated value, the null hypothesis is rejected
and alternative hypothesis is accepted.
(Calculation based)
2] If the support of the null hypothesis (p-
value) is less than the most commonly used
significance level (p-value <0.05) the null
hypothesis is rejected and alternative
hypothesis is accepted (Computer Based) 167

One-Tailed and Two-Tailed Tests
One-Tailed Test
The null hypothesis can be tested using either
one-tailed or two tailed tests.
A test involving null hypothesis that favors only
one direction is called one tailed test.
Example: suppose a study compares two drugs,
drug A and Drug B.

So null hypothesis (H0) = Drug A is not more
effective than Drug B. and alternative
hypothesis (Ha) = Drug A is more effective
than Drug B.
H0 Drug A = Drug B
Ha. Drug A > Drug B

Two-tailed Test
In Two-tailed Test deviation of both directions are
considered when testing.
For example: in the previous example of comparing the
effectiveness of Drug A and Drug B. The two tailed null
hypothesis and alternative hypothesis will be as H0 =
Drug A and Drug B has same effect. Ha = Drug A and
Drug B has no same effect or in short way:
H0 Drug A=Drug B
Ha. Drug A ≠ Drug B

Steps for Hypothesis Testing
a) Describe the given data
b) State the assumptions (assumption is
unexamined belief)
c) State Null and Alternative Hypothesis
d) State Level of significance
e) Choose test statistic (z-test, t-test,
ANOVA, X2
)
f) Compute the test statistic

G) Look the tabulated test statistic responding
to significance level or degree of freedom or p-
value and compare the calculated test statistic.
Or p-value. If the calculated test statistic > the
tabulated test statistic Otherwise we will not
reject (accept) Null hypothesis.
H) Decision: Reject or accept the Null
hypothesis.
I) Conclusion: conclude in the language of the
accepted hypothesis.
173

Chapter Eight
Testing the significance difference
between two and three sample means

Testing the significance difference between two sample means
When we want to determine that the
difference between two group means are
significant (large enough) or insignificant
(only due to chance) we do Z-test or t-tests.
Here are the decision criteria for using Z-
test or t-tests
175

Z-test (normal test)
Z or z =
Tabulated z values
Significance level
(α)
Two-tailed
1-(alpha/2)
One-tailed,>
1-alpha
One-tailed, <
1-alpha
10% (or 0.1) 1.64 1.28 -1.28
5% (or 0.05) 1.96 1.64 -1.64 177

Example
The mean birth weight of babies born on large
community over several years was 2470 gram and
standard deviation of 230 gram.
Following implementation of ANC program, the
mean birth weight obtained from a sample of 40
babies was 2560 gram and standard deviation of
250 gram.
Does the ANC program has any impact on birth
weight of the new born babies? 178

Solution
Data: Given=2470gm, 2560 gm, σ = 230gm, s=250gm, n=40
Assumption: a)birth weight of the baby population is
normally distributed
b) Sample was selected at random
Hypothesis: H0: =2470gm (mean birth weight of the
populations will not change even after ANC). Ha: ≠2470gm
(mean birth weight of the populations will change after ANC).
Level of significance (α): 5% (0.05)
Choose Test statistic: since σ is known, we do Z-test
179

Compute the test statistic
Z = Z
Compare the calculated Z to the Tabulated z :
Tabulated z with 5% level of significance is 1.96
Decision: we reject Null hypothesis since the
calculated z (2.47)> the tabulated z(1.96)
Conclusion: the mean birth weight of baby born
has increased after ANC program implementation.
180

Example-2
The Hemoglobin level of children was measured in 143
girls and 127 boys with known population SD. Here are
the results.
Here girls have Hb level than boys on average, so the
question is whether the observed difference is
significant or not?
Girls Boys
Mean 11.2 11.0
SD 1.4 1.3
n 143 127

Solution
• Data: Given,, s1 = 1.4 s2=1.3, n1=143, n2=127
• Assumption: a)HB level of the population is normally
distributed
• b) Sample was selected at random
• Hypothesis: H0: (any observed difference is due to by
chance alone).
Ha: : (mean Hb Level of girls and boys
are significantly differ)
• Level of significance (α): 5% (0.05)
• Choose Test statistic: since n>30, we do Z-test
182

Compute the test statistic
z = = = 0.2/0.14119=1.413
Compare the calculated Z to the Tabulated z with 5%
level of significance : Tabulated z with 5% level of
significance is 1.96
Decision: we accepted Null hypothesis since the
calculated z (1.413) is <the tabulated z(1.96)
Conclusion:mean Hb Level of girls and boys are not
significantly different.
183

t Test is a test for comparing means of one
sample as well as means of two sample
situations.
Types of t test
a) One sample t test b) Independent sample
t test c) Paired sample t test
185

One sample t test
• One sample t test is used to test whether a
population mean is significantly different
from some hypothesized value.
• t =
• is sample mean, m is the hypothesized
value, s = is sample SD and n = is sample size
186

Example : A professor of Statistics wants to
know whether if his introductory statistics class
has a good grasp of basic math. Six students were
chosen at random from the class and given a math
proficiency test.
The professor wants the class to be able to score
above 70 on the test. The six students get scores
of 63, 93, 75, 68, 83, and 92. with SD of 13.17.
Can the professor have 95% Confidence that the
mean score for the class on the test would be
above 70?
187

Since the population standard deviation is not known, we use t
test.
Solution
H0=
== 63+93+75+68+83+92/6 = 79
M= above 70
t = t = =
s = = 13.17
188

Solution
t =
df = n-1 = 6-1=5
Note that we are testing only
whether the average mean of
score of students is greater than
70, so we are dealing with one
tailed t-test.
189

The tabulated t test with 5%
significance level and df of 5 is
2.015
Thus the calculated t-test (1.67) is
less than the tabulated t-test with
df=5 at 5% level of which is 2.015.
(Calculated t<tabulated t0.05,5)so the
null hypothesis is accepted

Independent sample t-test
Independent sample t-test is used to test the
means of two independent groups. Usually a
qualitative Dependent variable with two categories
and quantitative continues independent variable.
Such as the height of male and females, blood
pressure of two groups. Example to test whether
male income and female income are different or
not. t =

Ex: Here is the blood pressure of male and female
patients. The question is whether the blood pressure of
the patients differs?
Solution
H0=Ha=
t = t =
Male Female
n 25 25
155 160
S 10 8
192

Df = n1+n2-2 =25+25-2=48 at 5%
significance level, the tabulated t =2.021
Thus ignoring the sign t calculated < t tabulated, so
null hypothesis is accepted.
We can conclude that the two means (the
mean male blood pressure and the mean
female blood pressure) are not
significantly different.
193

Paired sample t test
Paired sample t test is used to test the mean
difference of two dependent observations, such as
blood pressure before exercise and blood after
exercise for a group of individuals. In independent t
test we were interesting between group differences
but in paired t test we are interesting within group
difference. , where is the mean difference the two
pairs (eg. before and after) =
194

Example
Here is the temperature of 8 individuals before and after
the treatment
Patient Before (X) After (Y)
1 25.8 24.7
2 26.7 25.8
3 27.3 26.3
4 26.1 25.2
5 26.4 25.5
6 27.4 26.6
7 27.1 26.0 195

Solution
Lets first calculate d and d2
Patient Before (X) After (Y) d=x-y d2
1 25.8 24.7 1.1 1.21
2 26.7 25.8 0.9 0.81
3 27.3 26.3 1.0 1.00
4 26.1 25.2 0.9 0.81
5 26.4 25.5 0.9 0.81
6 27.4 26.6 0.8 0.64
7 27.1 26.0 1.1 1.21 196

• = 7.9/8=0.98
• sd=
• (Variance of d)=
• sd= =0.1
•
197

The tabulated t value with df 8-1=7 at 5%
significance level is 2.365, so the calculated
t>tabulated t with 7df at 5% significance
level.
Decision: Null hypothesis is rejected and
alternative hypothesis is accepted. We
conclude that the temperature of the
individuals before and after treatment is
not the same.

Analysis of Variance (ANOVA or F test)
199

Analysis of Variance (ANOVA or F
test)
Analysis of variance is statistical methods of
analyzing data with objective of comparing three
or more group means.
It replaces t-test that comparing two group
means only.
Analysis of variance is sometimes called F test,
after the British R A Fisher (the British
Statistician who developed this test).
200

One way ANOVA: used when we have
One continues dependent variable and
one categorical independent variable with
more two categories, to compare the
means of these groups.
Example: If we want to know whether people
residing three different areas (Rural, Urban
and Semi-urban) earn different incomes
201

How to calculate One- Way ANOVA
1) F = MSSBG/ MSSWG
2) SST = or SSBG +SSWG
3) SSBG = =
4) SSWG= SST - SSBG

5) MSSBG =
6) MSSWG=
7) F test =

Example
Three different treatments are given to 3
groups of patients with anemia. Increase in
HB% level was noted after one month and is
given in Table 2.0. we are interested to find
whether the difference in improvement in3
groups is significant or not.
205

Three different treatments are given to 3
groups of patients with anemia. Increase in
HB% level was noted after one month and is
given in Table below. we are interested to
find whether the difference in improvement
in 3 groups is significant or not.

Group A Group B Group C
x1 x2 x3
3 3 3
1 2 4
2 2 5
0 3 4
1 1 2
2 3 2

Solution
Group A Group B Group C Group A Group B Group C
x1 x2 x3 x1
2
x2
2
x3
2
3 3 3 9 9 9
1 2 4 1 4 16
2 2 5 4 4 25
0 3 4 0 9 16
1 1 2 1 1 4
2 3 2 4 9 4
2 2 4 4 4 16
=11 =16 =24 2
=23 2
=40 2
=90
=23+40+90= 153
=11+16+24= 51

SST = = = =29.14
SSBG = = =
=12.28
4) SSWG= SST - SSBG =29.14-12.28=16.86
5) MSSBG = = = 6.14

Source
of
variation
Degree
of
freedom
SUM of
Squares
Mean of
Squares
F
Between
Groups
K-1 = 3-
1= 2df
12.28 6.14 6.53
With in n-K= 21-
3=18
16.86 0.94
6) MSSWG= = =0.94
7) F = =6.53

Interpretation
The tabulated F value at df 2,18 is 3.55 at
5% level of significance. Our calculated F
value is 6.53, that is our calculated F value
is greater than the tabulated F value (F
calculated > F tabulated= 6.53> 3.55).
Thus the null hypothesis is rejected. Hence
we conclude at least one of the groups has a
significant increase of HB%
211

Chapter Nine
Association, Corrélation and prédictions
212

Chi-square Test

A chi square (χ2) test is useful in making
statistical association about two
independent categorical variables in
which the categories are two and above
(but usually two).
214

df= (r-1) (c-1), r=number of rows, c=number
of columns
Example
Suppose a researcher wants to test if the
knowledge of people is associated with
service
utilization. He conducted a sample survey of
100 individuals of which 78 had High level of
knowledge.

Of these 78 who had god knowledge, 50
were service user. Whereas 22 who had
low knowledge level, 10 of them used
service.
Do these data provides evidence of
association between knowledge level and
service utilization? 217

2. Assumption: data follows a normal
distribution and the sample was drawn
randomly.
3. Hypothesis:
Ho. There is no association between
“knowledge level” and “service utilization”
Ha. There is association between
“knowledge level” and “service utilization”
4. Level of significance: α=5% (0.05)

7. Compute the degree of freedom (df)
df= (r-1) (c-1)= (2-1)(2-1) =1df
8. Tabulated Value of χ2: with df=1 and 5%
level of significance =3.84
9. Compare the computed value with tabulated
value: calculated χ2
(2.481)<Tabulated χ2 (3.84)
10. Decision: H0. Is accepted
11. Conclusion: the data does not provide
evidence of association between
knowledge level and service utilization

When one quantitative variable changes with
the change of other quantitative variable
they are said to be correlated.
The variable that changes the other variable
is called Independent variable (IV) and the
variable that is changed is called Dependent
(DV).
The DV is represented by Y and IV is
represented by X.
224

Example: Income and Age are both quantitative.
They are correlated because when age
changes the income changes as well.
Therefore Age is (X=IV) while income is
(Y=DV).
When the change occurs in fixed rate it is called
linear correlation.
The correlation between one DV and One IV is
called Simple correlation. E.g. correlation
between Income and Age

The correlation between one DV and more
IVs is called multiple correlation. E.g.
correlation between Income, Age and family
size.
Correlation Coefficient (r)
To calculate the correlation between
variables, we use a measure called
correlation coefficient (r)

Characteristics of relationship
The correlation coefficient (r) indicates both
the strength and direction of relationship.
Strength (Magnitude) of the relationship:
When correlation coefficient is zero it
indicates no correlation.
<=0.3= weak correlation.
0.4-0.6= Moderate correlation.
0.7-1= Strong correlation

When the correlation coefficient is one
(either + or -) it indicates a perfect
correlation. As r approaches to 1(either
+ or -), the strength of the relationship
increases.
229

Direction of relationship: the relationship can
be positive, negative or no correlation.
Positive correlation is when the two variables
move the same direction (increase or decrease
together). E.g. Gestational period and birth
weight. This is when r=+ve
Negative correlation: is when the two variables
move on different directions (when one
increases the other decreases) E.g. Age and Eye
sight. This is when r= -ve

No correlation: is when the change in one variable
does not influence the change in another variable.
E.g. Age and Sex.
This is when r=0
Example:
Suppose 4 person were selected as a sample to
determine the correlation between weight
and height

Weight in
Pound (Y)
Height in
inches (X)
Y2
X2
XY
240 73 57600 5329 17520
210 70 44100 4900 14700
180 69 32400 4761 12420
160 68 25600 4624 10880
∑y: 790 ∑X: 280 ∑Y2
:
159700
∑X2
:
19614
∑XY:
55520

Interpretation
There is a very strong positive
correlation between the weight and
height of the respondents.

Coefficient of Determination (r2)
The square value of r is called coefficient of
determination.
The coefficient of determination (r2)
measures the amount of variability in Y (DV)
is explained by X (IV).
Coefficient of Determination (r2) is shown as
percentage.

Example: for the above example correlation
coefficient (r) is 0.97, thus coefficient of
determination (r2) is 0.97x0.97=0.94x100 =
94%
Interpretation
94% of the variability in the weight (DV) is
explained by the height (IV).
This means the remaining 6% variability in
weight is responsible by other variables but
not by height.

Correlation Significant Test
To test the significance of the correlation
value we use the following formula to find
calculated T-value
t= 0.97*5.77= 5.6 (calculated t-value)

Then we go to dependent t-test assuming the
significance level of 0.05 we look for Degree
of freedom which is in here calculated as n-1
then we go to T-TABLE and look for the
junction between the significance level and
the degree of freedom and we find the
tabulated T-value.
The tabulated t-value with two tailed test
of 0.05 significance level and a degree of
freedom of 3 is: 3.182

Since the calculated t-value of 5.6 is > the
tabulated t-value of 3.182, the null
hypothesis is rejected.
Therefor we can conclude that there is a
significant, very strong positive correlation
between the height and weight of our
participants.
239

Regression Analysis:
A statistical procedure used to find
relationships among a set of variables
In regression analysis, there is a dependent
variable, which is the one you are trying to
explain, and one or more independent
variables that are related to it.

REGRESSION TYPES
1) Linear regression = quantitative DV
A) simple (1 dv and 1 IV)
B) multiple (Multiple IV and 1 DV)
2) Logistic regression= qualitative DV
A) Binary = DV with 2 categories
simple (1 dv and 1 IV) multiple (Multiple IV and 1
DV)
B) Multinomial = DV with > 2 categories
C) Ordinal = DV which is ordinal.

Linear Regression:
Linear regression is used when the dependent
variable is continuous and assumes a linear
relationship with the independent variables.
It aims to find the best-fitting line that
represents the relationship between the
dependent variable and one or more
independent variables.

For example, a study might use linear
regression to determine the relationship
between smoking behavior (independent
variable) and lung function (dependent
variable) among a sample of individuals.

Logistic Regression:
Logistic regression is used when the dependent
variable is categorical or binary. It models the
probability of an event occurring or the likelihood
of an outcome belonging to a particular category.
The dependent variable is usually binary (e.g.,
yes/no, success/failure), but it can also be
multinomial (more than two categories) or ordinal
(ordered categories).

Why is regression analysis superior
compared with chi-square and
correlation
1. Prediction capability:
Regression analysis allows for prediction
that can estimate the value of the
dependent variable based on the values of
the independent variables.
245

2. Handling both categorical and numerical
variables
3. Control of confounding variables:
Regression analysis enables researchers to
control for the effects of confounding
variables by including them as independent
variables in the model.

Confounding variables: are factors that are
associated with both the independent variable(s)
and the dependent variable in a study. Age is
frequently a confounding variable in health
studies.
Ex: if studying the association between a specific
medication and heart disease risk, age must be
considered as a confounding variable because
older individuals are more likely to have both
higher heart disease risk and higher medication
usage

Regression equation: Beta0 + Beta1*X
Y= Dependent variable X= Independent variable
Beta 0 (CONSTANT) = (the value of Y when X
is zero).
It shows how much DV is if IV is 0.
Beta 0 formula= Y-bar – beta 1 * X-bar

Beta 1 (Regression co-officient/INTERCEPT)
It measures the amount of change in DV (Y)
for any change in IV (X).
It represents the relationship between IV and
DV.
Beta1=
• ∑xy – (∑x * ∑y)
n
∑X2
- (∑X2
/n)
249

Example 1. The height and weight of 4
individuals were given as presented in
the following table.
Let us predict how much the weight (DV)
of an individual could be if his height
(IV) is 80 inches.

Weight in POUND (y) Height in inch
(x)
Y 2
X2
Xy
240 73 57600 5329 17520
210 70 44100 4900 14700
180 69 32400 4761 12420
160 68 25600 4624 10880
∑y= 790 ∑x= 280 ∑y2
= 159700 ∑x2
=19614 ∑xy= 55520
251

Beta 0 formula= Y-bar – beta 1 * X-bar=
197.5 -15.7 * 70 = -- - 901.5.
Interpretation of Beta 0: if height is 0
the weight will be = -901.5 (a value that
does not exist) = 0
252

Beta1=
• ∑xy – (∑x * ∑y) = 55520 – (280 *790)
n 4
= 220 = 15.7
. ∑X2
- (∑X2
/n) = 19614 – (2802
/4) = 14
Beta1= 15.7
Interpretation of Beta 1: for any unit (inch)
change in height there will be 15.7 unit (pounds)
change in weight.
253

Regression equation: Beta0 + Beta1*X
-901.5+15.7*80 = 354.5
Interpretation of regression result:
based on the distribution of this data If
height is 80 inches the weight will be
354 pounds.

Chapter TEN
Estimation
Estimation is a procedure to find values of a
parameter based on the value of statistic.
There are various techniques available for
different situations. We shall, however, limit our
discussions on two estimations.
There are two types of estimation:-
–Point Estimation
–Interval estimation 255

Point Estimation
Point Estimation occurs when we estimate that
the unknown parameter is equal to the calculated
statistic e.g. = μ or = or s=
Remember that statistic means sample based
summery measure (and parameter is population
based summery measure (e.g μ

Interval estimation
Interval estimation occurs when we estimate that
the parameter will be included in an interval.
This interval is called confidence interval.
The likelihood that the parameter will include in
the confidence interval is called confidence level.
For example 95% Confidence level means, there is
95% likelihood (chance) that the parameter will
include the specified interval.
257

Estimation of a single population mean (μ)
Example-1:The mean reading speed of a random
sample of 81 University students is 325 words per
minute.
Find the mean reading speed of all Modern
students (μ) if it is known that the standard
deviation for all Modern students is 45 words per
minute.
258

Solution
Point Estimation: = μ = as the mean reading speed of a sample
is 325 words per minute, then the mean reading speed of all
Modern University students is also 325 words per minute
Interval Estimation for μ
μ = ±Z*SE(), Z=1.96 SE()= σ/√n =SE()=45 /√81=5 so
1.96*5= 9.8
325 ± 9.8 = 315.2 to 334.8 words/minute
This means if 100 samples is selected in university
students, the result of 90 of them will include in this
range.
259

Estimation of population mean differences(μ1-μ2)
Example-2:If a random sample of 50 non-smokers have
a mean life of 76 years with a standard deviation of 8
years, and a random sample of 65 smokers has a mean
live of 68 years with a standard deviation of 9 years,
A) What is the point estimate for the difference of the
population means?
B) Find a 95% C.I. for the difference of mean lifetime
of non-smokers and smokers.

solution
Point Estimation of μ1-μ2
μ1-μ2= 1- = as the mean difference of life in the sample is 76-
68=8 years, then the mean difference of the population is also
8 years.
Interval Estimation of μ1-μ2
μ1-μ2 = 1- ±1.96*SE(1- ),
SE(1- )= + + = 1.57 = 1.96*1.57= 3
= 8±3 = 5 to 11 years
So the population mean life difference b/w the two groups will
lie in the range from 5 to 11 years. 261

Estimation single population proportion (
Example: An epidemiologist is worried about the ever
increasing trend of malaria in a certain locality and
wants to estimate the proportion of persons infected in
the peak malaria transmission period.
If he takes a random sample of 150 persons in that
locality during the peak transmission period and finds
that 60 of them are positive for malaria, find
a) Point estimation for ?
b) Find 95% CI?
262

Solution
Point Estimation of
p==40%. That the proportion of malaria positive people in the
population is 40%.
Interval Estimation of
= ±1.96SE(), SE()= =SE()= =0.04 = 1.96*0.04=
0.078*100 =7.8%
40%±7.8% =32.2% to 47.8%
So the proportion of malaria positive individuals in the
population will lie between 32.2% to 47.8%
263

Estimation population proportion differences (1-2)
Example: Two groups each consists of 100 patients who
have leukemia.
A new drug is given to the first group but not to the
second (the control group). It is found that in the first
group 75 people have remission for 2 years; but only 60
in the second group.
Find 95% confidence limits for the difference in the
proportion of all patients with leukemia who have
remission for 2 years. 264

Solution
Point Estimation of1-2
1-2 =1-2=75%-60%=15. That is the proportion difference for
the two groups is 15%
Interval Estimation of1-2
1-2=1-2±1.96*SE(p),
SE(p)= = =0.065*100 = 6.5% =1.96*6.5% = 12.7%
So 15% ± 12.7%= 2.3% to 27.7%
So the population proportion difference will lie somewhere
between 2.3% to 27.7%

Complete Biostatistics (Descriptive and Inferential analysis)

Complete Biostatistics (Descriptive and Inferential analysis)

More Related Content

What's hot (20)

Similar to Complete Biostatistics (Descriptive and Inferential analysis) (20)

More from DrAbdiwaliMohamedAbd (19)

Recently uploaded (20)

Complete Biostatistics (Descriptive and Inferential analysis)

Editor's Notes