Ct lecture 5. descriptive analysis of categorical variables

Workshop on Analysis of Clinical Studies – Can Tho University of Medicine and Pharmacy – April 2012
Descriptive analysis
of categorical variables
Tuan V. Nguyen
Professor and NHMRC Senior Research Fellow
Garvan Institute of Medical Research
University of New South Wales
Sydney, Australia

What we are going to learn
•  Categorical data
•  Probability
•  Statistical description of
–  Prevalence
–  Incidence
–  Rate

Measurement and comparison
To find out whether a community is healthy or
unhealthy:
•  first measure one or more indicators of health
(deaths, new cases of disease, etc)
•  compare the results with another community or
group.

Measures of Disease Occurrence
•  Incidence proportion (risk)
•  Incidence rate (density)
•  Prevalence
All three are loosely called “rates” (but only the
second is a true rate)

Types of populations
We measure disease occurrence in two types of
populations:
•  Closed populations ! “cohorts”
•  Open populations

6
Cohort word origin
(Latin cohors) basic
tactical unit of a
Roman legion
Epi cohort ≡ a
group of individuals
followed over time
Closed population = cohort

•  Inflow (immigration,
births)
•  Outflow (emigration,
death)
•  An open population in
“steady
state” (constant size)
is said to be
stationary
Open population

•  “Rates” are composed of
numerators and
denominators
•  Numerator ! case count
Incidence count ! onsets
Prevalence count ! old + new
cases
•  Denominators ! reflection
of population size
Numerators and denominators

Denominators
Denominators:
reflection of population
size

•  Synonyms: risk, cumulative incidence,
attack rate
•  Interpretation: average risk
study
of
beginning
at
risk
@
no.
over time
onsets
of
no.
IP =
Can be calculated only in cohorts
Incidence proportion

•  Objective: estimate risk of uterine cancer
•  Recruit cohort of 1000 women
•  100 had hysterectomies, leaving 900 at risk
•  Follow at risk individuals for 10 years
•  Observe 10 onsets of uterine cancer
women
900
women
10
risk
@
no.
onsets
of
no.
IP =
=
10-year average risk is .011 or 1.1%.
0111
.
0
=
Example of IP

•  Synonyms: incidence density, person-time rate
•  Interpretation A: “Speed” at which events occur
•  Interpretation B: When disease is rare:
rate per person-year ≈ one-year risk
•  Calculated differently in closed and open
populations
risk
@
time
-
person
of
Sum
onsets
no.
IR =
Incidence rate

•  Objective: estimate rate of uterine cancer
•  Recruit cohort of 1000 women
•  100 had hysterectomies, leaving 900 at risk
•  Follow at risk individuals for 10 years
•  Observe 10 onsets of uterine cancer
time
-
person
onsets
of
no.
IR =
Rate is .00111 per year or 11.1 per 10,000 years
years
9000
10
=
years
10
women
900
women
10
×
=
year
.00111
=
Example of IP

Individual follow-up over time
years
50
years
25
onsets
2
+
=
time
-
person
onsets
IR
∑
=
years
-
person
100
per
2.67
years
-
person
per
0267
.
0 =
=
years
75
onsets
2
=

Rate
Mortality
1
expectacy
Life =
In stationary populations, and in cohorts with
complete follow-up, the mortality rate is the
reciprocal of life expectancy (and vice versa).
Example: for a mortality rate of .0267 per year
years
5
.
37
year
.0267
1
expectacy
Life =
=
Mortality and life expectancy

years
37.5
2
years
50)
(25
expectancy
life
has
cohort
This =
+
year
0267
.
0
years
50)
(25
deaths
2
of
rate
mortality
a
has
cohort
This 1
−
=
+

years
-
person
100,000
per
877
=
n
observatio
of
duration
size
population
Avg
onsets
IR
×
=
-1
year
deaths
008770
.
0
=
Example: 2,391,630 deaths in 1999 (one year)
Population size = 272,705,815
year
1
persons
5
272,705,81
deaths
2,391,630
IR
×
=
Incidence rate in open population

•  Point prevalence ≡ prevalence at a particular point in
time
•  Period prevalence ≡ prevalence over a period of time
•  Interpretation A: proportion with condition
•  Interpretation B: probability a person selected at
random will have the condition
people
of
no.
cases
new
and
old
no.
Prevalence=
Prevalence

•  Recruit 1000 women
•  Ascertain: 100 with hysterectomies
people
of
no.
cases
no.
Prevalence=
Prevalence in sample is 10%
10
.
0
=
people
1000
people
100
=
Example of prevalence

Increase incidence ! increase
inflow
Increase average duration of
disease ! decreased outflow
Ways to increase prevalence
Dynamic prevalence

duration)
(average
rate)
(incidence
prevalence ×
≈
Example:
•  Incidence rate = 0.01 / year
•  Average duration of the illness = 2 years.
•  Prevalence ≈ 0.01 / year × 2 years = 0.02
When disease rare & population stationary
Prevalence and incidence

Estimation of 95% confidence interval

Proportions
•  Proportion of event in the sample, denoted “p hat”:
where x = no. of events and n = sample size
n
x
p =
ˆ

Proportion, cont
Two of 10 individuals in the sample have a risk factor
for disease X
The prevalence of this risk factor in the sample is:
(or10%)
1
.
0
10
2
ˆ =
=
=
n
x
p

Inference about a Proportion
How good is sample proportion at estimating
population proportion p?
Consider what would happen if we took repeated
samples, each of size n, from the population? How
would sample proportions be distributed?

p
q
n
pq
p
N
p
−
=
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
1
where
,
~
ˆ
Normal Approximation for Proportions

Normal approximation
H0: p = p0 vs. Ha: p ! p0 where p0 represents the
proportion specified by the null hypothesis
Test statistic
ˆ
0
0
0
stat
n
q
p
p
p
z
−
=

Example
n = 57 finds 17 smokers (p-hat = 17 / 57 = 0.2982).
The national average for smoking prevalence is 0.25.
Is the proportion in the sample significantly different
than the national average?
H0:p = 0.25 vs. Ha: p ≠ 0.25
The sample proportion is not significantly different
than the national average.
84
.
0
57
75
.
25
.
25
.
2982
.
ˆ
0
0
0
stat =
⋅
−
=
−
=
n
q
p
p
p
z

Confidence Interval for Proportion

p± z1−α
2
⋅

p
q

n
where

x = 
x + 2, 
n = n + 4, 
p =

x

n
, and 
q =1− 
p
This method is called the “plus four method”
because it adds four imaginary points during
calculations. It is much more accurate than the
traditional Normal method.
A 1−α(100%) confidence interval for p is:

Confidence Interval, example
)
4277
.
,
1953
(.
1162
.
3115
.
)
0593
)(.
96
.
1
(
3115
.
~
for
CI
95%
confidence
95%
for
96
.
1
0593
.
61
)
6885
)(.
3115
(.
~
~
~
6885
.
3115
.
1
~
;
3115
.
61
19
~
61
4
57
4
~
;
19
2
17
2
~
~
~
=
±
=
±
=
⋅
±
=
=
=
=
=
=
−
=
=
=
=
+
=
+
=
=
+
=
+
=
p
p
SE
z
p
p
z
n
q
p
SE
q
p
n
n
x
x
Based on n = 57 and x = 17, the 95% CI for the
prevalence of smoking in the population is:

Sample Size and Power
Three approaches:
•  n needed to estimate p with margin of error m (for
confidence interval)
•  n needed to test H0 at given α level and power
•  The power of testing H0 under stated conditions

n need to achieve margin of error m
•  where p* represent an educated guess for population
proportion p (when no educated guess for p* is
available, let p* = .5)
•  Round up to next integer to ensure stated precision
2
*
*
2
1 2
m
q
p
z
n
α
−
=

n need to achieve m, example
Suppose our educated guess for the proportion is
p* = 0.30
897
896.4
03
.
)
70
)(.
30
)(.
96
.
1
(
2
2
⇒
=
=
n
For margin of error of .03, use:
323
322.7
05
.
)
70
)(.
30
)(.
96
.
1
(
2
2
⇒
=
=
n
For margin of error of .05, use:

n to test H0: p = p0
where
•  α ≡ alpha level of the test (two-sided)
•  1 – β ≡ power of the test
•  p0 ≡ proportion under the null hypothesis
•  p1 ≡ proportion under the alternative hypothesis
2
0
1
1
1
1
0
0
1 2
⎟
⎟
⎟
⎠
⎞
⎜
⎜
⎜
⎝
⎛
−
+
=
−
−
p
p
q
p
z
q
p
z
n
β
α

n to test H0: p = p0, example
How large a sample is needed to test H0: p = 0.21 against
Ha: p = 0.31 at α = 0.05 (two-sided) with 90% power?
194
3
.
193
21
.
0
31
.
0
)
69
.
0
)(
31
.
0
(
28
.
1
)
79
.
0
)(
21
.
0
(
96
.
1
2
⇒
=
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
−
+
=
n
! means round up to ensure stated power

Conditions for Inference
•  Sampling independence
•  Valid information
•  The plus-four confidence
interval requires at least 10
observations
•  The z test of H0: p = p0 requires
np0q0 ! 5
I'd rather have a sound
judgment than a talent.
Mark Twain

Bayesian analysis of proportion

Review
•  When X ∼ Binomial(n, p) we know that
•  p = X/n is the MLE for p

•  Var(p) = p(1 − p)/n
•  Wald interval for p

p± Z1−α/2 p 1− p
( )

Problems of Wald CI
•  The Wald interval performs terribly
•  Coverage probability varies wildly, sometimes being
quite low for certain values of n even when p is not
near the boundaries
–  Example, when p = .5 and n = 40 the actual coverage of a
95% interval is only 92%
•  When p is small or large, coverage can be quite poor
even for extremely large values of n
–  Example, when p = .005 and n = 1, 876 the actual cov-
erage rate of a 95% interval is only 90%

Simple adjustment
•  A simple fix for the problem is to add 2 successes and
2 failures
•  That is let p = (X + 2) / (n + 4)
•  Lead to the Agresti-Coull interval
p± Z1−α/2 p 1− p
( )

Bayesian analysis
•  Bayesian statistics posits a prior on the parameter of
interest
•  All inferences are then performed on the distribution
of the parameter given the data, called the posterior
•  In general
Posterior ∝ Likelihood × Prior
•  The likelihood is the factor by which our prior beliefs
are updated to produce conclusions in the light of the
data

Beta priors
•  The beta distribution is the default prior for parame-
ters between 0 and 1
•  The beta density depends on two parameters α and
β
•  The mean of the beta density is α/(α + β)
•  The variance of the beta density is
•  The uniform density is the special case where α = β
= 1
between 0 and 1.
beta density depends on two parameters α a
Γ(α + β)
Γ(α)Γ(β)
pα−1(1 − p)β−1 for 0 ≤ p ≤ 1
mean of the beta density is α/(α + β)
variance of the beta density is
αβ
(α + β)2(α + β + 1)
uniform density is the special case where α =

Some beta distributions
0.0 0.4 0.8
2
6
10
p
density
alpha = 0.5 beta = 0.5
0.0 0.4 0.8
0
5
10
15
p
density
alpha = 0.5 beta = 1
0.0 0.4 0.8
0
10
20
p
density
alpha = 0.5 beta = 2
0.0 0.4 0.8
0
5
10
15
p
density
alpha = 1 beta = 0.5
0.0 0.4 0.8
0.6
1.0
1.4
p
density
alpha = 1 beta = 1
0.0 0.4 0.8
0.0
1.0
2.0
p
density
alpha = 1 beta = 2
0.0 0.4 0.8
0
10
20
p
density
alpha = 2 beta = 0.5
0.0 0.4 0.8
0.0
1.0
2.0
p
density
alpha = 2 beta = 1
0.0 0.4 0.8
0.0
1.0
p
density
alpha = 2 beta = 2

Posterior
•  Suppose that we chose values of α and β so that the
beta prior is indicative of our degree of belief regard-
ing p in the absence of data
•  Then using the rule that
and throwing out anything that doesn’t depend on p, we
have that
terior
uppose that we chose values of α and β so that th
eta prior is indicative of our degree of belief regar
g p in the absence of data
hen using the rule that
nd throwing out anything that doesn’t depend on
e have that
Posterior ∝ px(1 − p)n−x × pα−1(1 − p)β−1
= px+α−1(1 − p)n−x+β−1
his density is just another beta density with param

Posterior mean
•  This density is just another beta density with param-
eters α* =x+α and β =n−x+β
Posterior mean
• Posterior mean
E[p | X] =
α̃
α̃ + β̃
=
x + α
x + α + n − x + β
=
x + α
n + α + β
=
x
n
×
n
n + α + β
+
α
α + β
×
α + β
n + α + β
= MLE × π + Prior Mean × (1 − π)

Posterior variance
•  Posterior variance is
Posterior variance
• The posterior variance is
Var(p | x) =
α̃β̃
(α̃ + β̃)2(α̃ + β̃ + 1)
=
(x + α)(n − x + β)
(n + α + β)2(n + α + β + 1)
• Let p̃ = (x + α)/(n + α + β) and ñ = n + α + β then we have
Var(p | x) =
p̃(1 − p̃)
ñ + 1
•  Let p* = (x + α)/(n + α + β) and n* = n + α + β then
we have
Var(p | x) = p*(1 – p*) / (n* + 1)

Jeffreys prior
•  The “Jeffrey’s prior” has some theoretical benefits
puts α = β = 0.5
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p
prior,
likelihood,
posterior Prior
Likelihood
Posterior
alpha = 0.5 beta = 0.5

0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p
prior,
likelihood,
posterior
Prior
Likelihood
Posterior
alpha = 1 beta = 1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p
prior,
likelihood,
posterior
Prior
Likelihood
Posterior
alpha = 2 beta = 2

R code
•  Install the binom package, then the command
library(binom)
binom.bayes(13, 20, type = "highest")
gives the HPD interval. The default credible level is 95%
and the default prior is the Jeffrey’s prior.

Ct lecture 5. descriptive analysis of categorical variables

More Related Content

What's hot (20)

Similar to Ct lecture 5. descriptive analysis of categorical variables (20)

More from Hau Pham (12)

Recently uploaded (20)

Ct lecture 5. descriptive analysis of categorical variables