Probability.pdf.pdf and Statistics for R

PROBABILITY
GR Miyambu
School of Science & Technology
Department of Statistical Sciences

Types of Probability
Types of Probability
Frequency
What is the probability of a
randomly chosen person
dying in the next year?
Model-based
What is the probability of a
child being affected by
cystic fibrosis given one of
the parents is a carrier of
the disease?
Subjective
What is the probability that
a particular patient has
heart disease given they
have chest pain?

Properties of Probability
The three types of probability all have the following
properties.
1. All probabilities lie between 0 and 1.
2. When the outcome can never happen the probability is 0.
3. When the outcome will deﬁnitely happen the probability
is 1.

Diagnostic Test
• A diagnostic test is any approach used to gather clinical
information for the purpose of making a clinical decision
(i.e., diagnosis).
• Some examples of diagnostic tests include X-rays, biopsies,
pregnancy tests, medical histories, and results from
physical examinations.
• From a statistical point of view there are two points to keep
in mind:
1. the clinical decision-making process is based on
probability;
2. the goal of a diagnostic test is to move the estimated
probability of disease toward either end of the
probability scale (i.e., 0 rules out disease, 1 confirms the
disease).

Uses of Diagnostic Test
• In making a diagnosis, a clinician first establishes a possible set of
diagnostic alternatives and then attempts to reduce these by
progressively ruling out specific diseases or conditions.
• Alternatively, the clinician may have a strong hunch that the patient
has one particular disease and he then sets about confirming it.
• Given a particular diagnosis, a good diagnostic test should indicate
either that the disease is very unlikely or that it is very probable.
• In a practical sense it is important to realize that a diagnostic test is
useful only if the result influences patient management since, if the
management is the same for two different conditions, there is little
point in trying strenuously to distinguish between them.

Analysis of Diagnostic Test
Disease No Disease
Test Positive a (true positives) b (false positives)
Test Negative c (false negatives) d (true negatives)

Gold Standard
• The "Gold Standard" is the method used to obtain a definitive diagnosis for a
particular disease; it may be biopsy, surgery, autopsy or an acknowledged standard.
• Gold Standards are used to define true disease status against which the results of a
new diagnostic test are compared.
• Here are a number of definitive diagnostic tests that will confirm whether or not
you have the disease.
• Some of these are quite invasive and this is a major reason why new diagnostic
procedures are being developed.
Target Disorder Gold Standard
Breast cancer Excisional biopsy
Prostate cancer Transrectal biopsy
Coronar stenosis Coronary angiography
Myocardial infarction Catheterization
Strep throat Throat culture

Sensitivity and speciﬁcity
• Many diagnostic test results are given in the form of a
continuous variable (that is one that can take any value
within a given range), such as diastolic blood pressure or
haemoglobin level.
• However, for ease of discussion we will ﬁrst assume that
these have been divided into positive or negative results.
• For example, a positive diagnostic result of ‘hypertension’ is
a diastolic blood pressure greater than 90 mmHg;
• whereas for ‘anaemia’, a haemoglob
in level less than 10 g/ d
l is
required.
• For every diagnostic procedure (which may involve a
laboratory test of a sample taken) there is a set of
fundamental questions that should be asked.

Sensitivity and speciﬁcity
• First, if the d
isease is present, what is the prob
a b
ility that the test
result will be positive?
• This leads to the notion of the sensitivity of the test.
• Secon d
, if the d
isease is a b
sent, what is the pro b
a b
ility that the test
result will be negative?
• This question refers to the speciﬁcity of the test.
• These questions can b
e answere donly if it is known what the ‘true’
diagnosis is.
• In the case of organic d
isease this can b
e d
etermine d b
y b
iopsy or, for
example, an expensive an drisky proce d
ure such as angiography for
heart disease.
• In other situations it may b
e b
y ‘expert’ opinion. Such tests provi d
e
the so-called ‘gold standard’.

Example
Diagnosis of heart
disease
• Consider the results of an
assay of N-terminal pro-brain
natriuretic peptide (NT-
proBNP) for diagnosis of heart
failure in a general population
survey in those over 45 years
of age and in patients with
existing diagnosis of heart
failure obtained by Hobbs et
al (2002) and summarised in
the table.
• Heart failure was identiﬁed
when NT-proBNP >36 pmol/l.
NT-proBNP
(pmol/l)
Confirmed Diagnosis of
Heart Failure
Present Absent Total
(D+) (D-)
> 36
Positive
(T+) 35 (a) 7 (b) 42
Negative
(T-) 68 (c) 300 (d) 368
Total 103 307 410

Sensitivity and Specificity
• The prevalence of heart failure in these subjects is (a + c)/(a + b + c + d)
• P(D+)=??
• The sensitivity of a test is theproportion of those with thedisease who also have a
positive test result.
• The sensitivity is a/(a + c)=??
• N o
w sensitivity is the pr o
bability o
f a p o
s it iv
e test result (e v
ent T
+) g
iv
en that the d
isease is
present (event D+) and can be written as p(T+|D+)=??, where the ‘|’ is read as ‘given’.
• The specificity of the test is the proportion of those without disease who give a
negative test result.
• Thus the specificity is d/(b + d)=??
• N o
w specificity is the pr o
bability o
f a ne g
ativ
e test result (e v
ent T
-) g
iv
en that the d
isease is absent
(event D-) and can be written as p(T-|D-)=??
• Since sensitivity is conditional on the disease being present, and specificity on the
disease being absent, in theory, they are unaffected by disease prevalence.

Sensitivity and Specificity
• Sensitivity and specificity are useful statistics because they will
yield consistent results for the diagnostic test in a variety of
patient groups with different disease prevalences.
• This is an important point; sensitivity and specificity are
characteristics of the test, not the population to which the test
is applied.
• Although indeed they are independent of disease prevalence, in
practice if the disease is very rare, the accuracy with which one
can estimate the sensitivity will be limited.
• Two other terms in common use are: the false negative rate (or
probability of a false negative) which is given by c/(a + c) =
1 - Sensitivity, and the false positive rate (or probability of
a false positive) or b/(b + d) = 1 - Specificity.
• Since sensitivity = 1 - Probability(false negative) and specificity =
1 - Probability(false positive), a possibly useful mnemonic to
recall this is that ‘sensitivity’ and ‘negative’ have ‘n’s in them
and ‘specificity’ and ‘positive’ have ‘p’s in them.

Summary of Definitions of Sensitivity and Specificity
Test result
True diagnosis
Disease present Disease absent
Positive Sensitivity
Probability of a false
positive
Negative
Probability of a false
negative
Speciﬁcity

Rates Assuming a Predicted Condition
• The predictive value refers to the likelihood for determining an outbreak or
non-outbreak of an infectious disease based on early warning results.
• Predictive values can be classified into the positive predictive value (PPV) and
the negative predictive value (PNV).
• Positive predictive value is the proportion of individuals with positive test
results that are correctly diagnosed and actually have the disease.
𝑃𝑃𝑃𝑃𝑃𝑃 = 𝑝𝑝 𝐷𝐷 + 𝑇𝑇 + =
𝑎𝑎
𝑎𝑎 + 𝑏𝑏
• Negative predictive value is the proportion of individuals with negative test
results that are correctly diagnosed and do not have the disease.
𝑁𝑁𝑁𝑁𝑁𝑁 = 𝑝𝑝 𝐷𝐷 − 𝑇𝑇 − =
𝑑𝑑
𝑐𝑐 + 𝑑𝑑
• False Omission Rate is the proportion of the individuals with a negative test
result for which the true condition is positive.
𝐹𝐹𝐹𝐹𝐹𝐹 =
𝑐𝑐
𝑐𝑐 + 𝑑𝑑
• The false discovery rate is the proportion of the individuals with a positive
test result for which the true condition is negative.
𝐹𝐹𝐹𝐹𝐹𝐹 =
𝑏𝑏
𝑎𝑎 + 𝑏𝑏

Whole Table Rates
• P r
evalence is the proportion of a population who have a
specific characteristic in a given time period.
• The prevalence may be estimated from the table if all the
individuals are randomly sampled from the population.
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =
𝑎𝑎 + 𝑐𝑐
𝑎𝑎 + 𝑏𝑏 + 𝑐𝑐 + 𝑑𝑑
• The Accu racy o
r P ro
p orti o
n C o
rrectly Cla ss
ifie d reflects
the total proportion of individuals that are correctly
classified.
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑃𝑃𝑃𝑃𝑃𝑃 =
𝑎𝑎 + 𝑑𝑑
• The proportion incorrectly classified reflects the total
proportion of individuals that are incorrectly classified.
𝑃𝑃𝑃𝑃𝑃𝑃 =
𝑏𝑏 + 𝑐𝑐

Likelihood Ratio
• The clear simplicity of diagnostic test data, particularly when presented as a 2 x 2
table, is confounded by many ways of reporting the results.
• The likelihood ratio (LR) is a simple measure combining sensitivity and specificity
• We have positive likelihood ratio (LR+) defined as
𝐿𝐿𝐿𝐿 +=
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
1 − 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
=
𝑇𝑇𝑇𝑇𝑇𝑇
𝐹𝐹𝐹𝐹𝐹𝐹
• This gives a ratio of the test being positive for patients with disease compared with those without
disease. Aim to be much greater than 1 for a good test.
• And negative likelihood ratio (LR-) defined as
𝐿𝐿𝐿𝐿 −=
1 − 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
=
𝐹𝐹𝐹𝐹𝐹𝐹
𝑇𝑇𝑇𝑇𝑇𝑇
• This gives a ratio of the test being negative for patients with disease compared with those without
disease. Aim to be considerably less than 1 for a good test.
• The likelihood odds ratio is the ratio of the positive likelihood ratio to the negative
likelihood ratio.
• In some calculation methods, ½ is added to all counts before the calculation of LOR, to avoid
dividing by 0.
𝐿𝐿𝐿𝐿𝐿𝐿 =
𝐿𝐿𝐿𝐿 +
𝐿𝐿𝐿𝐿 −

Distributions
Types of Distributions
• Binomial
• Poisson
• Normal

Properties of Normal Distribution

How do we use the Normal distribution?
• The Normal probability distribution can be used to calculate
the probability of different values occurring.
• We could be interested in: what is the probability of being
within 1 standard deviation of the mean (or outside it)?
• We can use a Normal distribution table which tells us the
probability of being outside this value.
• The Normal distribution also has other uses in statistics and
is often used as an approximation to the Binomial and
Poisson distributions.

Populations and Samples
• In the statistical sense a population is a theoretical concept used
to describe an entire group of individuals in whom we are
interested.
• Examples are the population of all patients with diabetes
mellitus, or the population of all middle-aged men.
• Parameters are quantities used to describe characteristics of
such populations.
• Thus the proportion of diabetic patients with nephropathy, or
the mean blood pressure of middle-aged men, are characteristics
describing the two populations.
• Generally, it is costly and labour intensive to study the entire
population.
• Therefore we collect data on a sample of individuals from the
population who we believe are representative of that
population, that is, they have similar characteristics to the
individuals in the population.
• We then use them to draw conclusions, technically make
inferences, about the population as a whole.

Populations and Samples
• The process is represented schematically in the figure.
• So, samples are taken from populations to provide estimates
of population parameters
• It is important to note that although the study populations
are unique, samples are not as we could take more than one
sample from the target population if we wished.
• Thus for middle-aged men there is only one normal range
for blood pressure.
• However, one investigator taking a random sample from a
population of middle-aged men and measuring their blood
pressure may obtain a different normal range from another
investigator who takes a different random sample from the
same population of such men.
• By studying only some of the population we have
introduced a sampling error.

Sample
• In some circumstances the sample may consist of all the members of a speciﬁcally
deﬁned population.
• For practical reasons, this is only likely to be the case if the population of interest is
not too large.
• If all members of the population can be assessed, then the estimate of the
parameter concerned is derived from information obtained on all members and so
its value will be the population parameter itself.
• In this idealised situation we know all about the population as we have examined all
its members and the parameter is estimated with no bias.
• The dotted arrow in Figure 6.1 connecting the population ellipse to population
parameter box illustrates this.
• However, this situation will rarely be the case so, in practice, we take a sample which
is often much smaller in size than the population under study.

Sample
• Ideally we should aim for a random sample.
• A list of all individuals from the population is drawn up (the
sampling frame), and individuals are selected randomly
from this list, that is, every possible sample of a given size in
the population has an equal chance of being chosen.
• Sometimes, there may be difﬁculty in constructing this list
or we may have to ‘make-do’ with those subjects who
happen to be available or what is termed a convenience
sample.
• Essentially if we take a random sample then we obtain an
unbiased estimate of the corresponding population
parameter, whereas a convenience sample may provide a
biased estimate but by how much we will not know

Properties of the distribution of sample
means
• The mean of all the sample means will be the same as the
population mean.
• The standard deviation of all the sample means is known as
the standard error (SE) of the mean or SEM.
• Given a large enough sample size, the distribution of sample
means, will be roughly Normal regardless of the distribution
of the variable.

Properties of standard errors
• The standard error (SE) is a measure of the precision
of a sample estimate.
• It provides a measure of how far from the true value
in the population the sample estimate is likely to be.
• All standard errors have the following interpretation:
• A large standard error indicates that the estimate is
imprecise.
• A small standard error indicates that the estimate is
precise.
• The standard error is reduced, that is, we obtain a more
precise estimate, if the size of the sample is increased.

Worked example: Standard error of a mean
– birthweight of preterm infants
• Simpson (2004) reported the birthweights of 98 infants who
were born prematurely, for which n = 98, ̅
𝑥𝑥 = 1.31 kg,
s = 0.42 kg and 𝑆𝑆𝑆𝑆 ̅
𝑥𝑥 =? ?
• The standard error provides a measure of the precision of
our sample estimate of the population mean birthweight

Worked example: Standard error of a
proportion – acupuncture and headache
• Melchart et al (2005) give the proportion who responded to
acupuncture treatment in 124 patients with tension type
headache as p = 0.46.
• We assume the numbers who respond have a Binomial
distribution and from Table 6.3 we ﬁnd the standard error is
𝑆𝑆𝑆𝑆 𝑝𝑝 =? ?

Worked example: Standard error of a rate –
cadaveric heart donors
• The study of Wight et al (2004) gave the number of organ
donations calculated over a two-year period as r = 1.82 per
day.
• We assume the number of donations follows a Poisson
distribution and from Table 6.3 we ﬁnd the standard error is
𝑆𝑆𝑆𝑆 𝑟𝑟 =? ?

Standard Errors of Differences

Example: Difference in means – physiotherapy for patients with lung
disease
• Grifﬁths et al (2000) report the results of a randomised
controlled trial to compare a pulmonary rehabilitation
programme (Intervention) with standard medical management
(Control) for the treatment of chronic obstructive pulmonary
disease.
• One outcome measure was the walking capacity (distance
walked in metres from a standardised test) of the patient
assessed 6 weeks after randomisation.
• Further suppose such measurements can be assumed to follow a
Normal distribution.
• The results from the 184 patients are expressed using the group
means and standard deviations (SD) as follows:
nInt = 93, ̅
𝑥𝑥Int = 211, SD(xInt ) = sInt = 118
nCon = 91, ̅
𝑥𝑥Con = 123, SD(xCon ) = sCon = 99.
• From these data d = xInt - xCon = 211 - 123 = 88 m and the
corresponding standard error is 𝑺𝑺𝑺𝑺 �
𝒅𝒅 =

Worked example: Difference in proportions – post-natal
urinary incontinence
• The results of randomised controlled trial conducted by
Glazener et al (2001) to assess the effect of nurse
assessment with reinforcement of pelvic ﬂoor muscle
training exercises and bladder training (Intervention)
compared with standard management (Control) among
women with persistent incontinence three months
postnatally are summarised in Table 6.2.

Confidence intervals for an estimate
• A confidence interval defines a range of values within which
our population parameter is likely to lie
• Such an interval for the population mean 𝜇𝜇 is defined by
̅
𝑥𝑥 − 1,96 × 𝑆𝑆𝑆𝑆 ̅
𝑥𝑥 𝑡𝑡𝑡𝑡 ̅
𝑥𝑥 + 1,96 × 𝑆𝑆𝑆𝑆 ̅
𝑥𝑥
and, in this case, is termed a 95% confidence interval as it
includes the multiplier 1.96.

Conﬁdence intervals for an mean

Probability.pdf.pdf and Statistics for R

Confidence Intervals for Differences
• To calculate a conﬁdence interval for a difference in
means, for example d = 𝜇𝜇A - 𝜇𝜇B, the same structure for
the conﬁdence interval of a single mean is used but
with ̅
𝑥𝑥 replaced by ̅
𝑥𝑥1 - ̅
𝑥𝑥2 and SE( x) replaced by SE( x1 -
x2).
• Algebraic expressions for these standard errors are
given in Table 6.4 (Section 6.10).
• Thus the 95% CI is given by
̅
𝑥𝑥1 − ̅
𝑥𝑥2 − 1,96 × 𝑆𝑆𝑆𝑆 ̅
𝑥𝑥1 − ̅
𝑥𝑥2 𝑡𝑡𝑡𝑡 ̅
𝑥𝑥1 − ̅
𝑥𝑥2 + 1,96 × 𝑆𝑆𝑆𝑆 ̅
𝑥𝑥1 − ̅
𝑥𝑥2

More Accurate Confidence Intervals for a Proportion
• To use this method, we first need to calculate three
quantities: 𝐴𝐴 = 2𝑟𝑟 + 𝑧𝑧2; 𝐵𝐵 = 𝑧𝑧 𝑧𝑧24𝑟𝑟(1 − 𝑝𝑝); and, 𝐶𝐶 =
2(𝑛𝑛 + 𝑧𝑧2)
Where 𝑧𝑧 is from the standard normal table
• The recommended confidence interval is given by
(𝐴𝐴 − 𝐵𝐵)
𝐶𝐶
𝑡𝑡𝑡𝑡
(𝐴𝐴 + 𝐵𝐵)
𝐶𝐶
• When there are no observed events, r = 0 and hence 𝑝𝑝 =
0
𝑛𝑛
= 0, the recommended CI simplifies to
0 𝑡𝑡𝑡𝑡
𝑧𝑧2
(𝑛𝑛 + 𝑧𝑧2)
• While when r = n so that p = 1, the CI becomes
𝑛𝑛
(𝑛𝑛 + 𝑧𝑧2)
𝑡𝑡𝑡𝑡 1

Probability.pdf.pdf and Statistics for R

More Related Content

Similar to Probability.pdf.pdf and Statistics for R (20)

Recently uploaded (20)

Probability.pdf.pdf and Statistics for R