1. Concepts in Biostatistics
Anne Eaton
Department of Epidemiology and Biostatistics
Memorial Sloan-Kettering Cancer Center
February 14, 2012
2. Outline of Talk
– Basic statistics concepts
– Types of variables
– Descriptive statistics
• Measures of location and dispersion
– Two variables
• Correlation between two variables
• Bivariate analysis (two-sample vs. paired)
– Multivariate analysis (multivariate normal regression, logistic regression)
– Survival analysis
– Clinical trial design
– Sample size
– Intent-to-treat analysis
– Missing data
3. Referenced Datasets
• I will be anchoring most of the concepts to two datasets
throughout the lectures
Dataset 1: Multiple myeloma patients (Krall, J. M. et al,
Biometrics, 31, 49–57; 1975.)
65 patients treated with alkylating agents
variables: BUN, HGB, platelets, age, WBC, fractures,
plasma cells in bone marrow, proteinuria, serum calcium, death
status
Dataset 2: Metastatic renal cancer patients
789 first-line mRCC clinical trial patients at MSKCC
selected variables: treatment, corrected calcium, HGB,
year of trt, LDH, KPS, death status
5. What is Statistics?
Descriptive Statistics: summarizing and presenting data using
numerical or graphical methods.
- What are the clinical characteristics of the 65 multiple
myeloma patients?
Inferential Statistics: making estimates, predictions or other
generalizations about the population.
- What can we say about the clinical characteristics of the
general population of multiple myeloma patients treated with
alkylating agents?
6. Statistical Inference and
Hypothesis Testing
Population, N Sample, n
Multiple myeloma
patients trted with
alkylating agents
=avg. platelet count
=prop. died w/in 1 yr
x
y
We use the sample of n patients to
make inference about the
population by:
- estimating parameters
- testing hypotheses.
=avg. platelet count
=prop. died w/in 1 yr
µ
θ
65 patients
7. Variable Types (Distributions)
• Continuous (always numeric)
– Age, Tumor size
• Count
– # of lesions, # prior therapies, # of surgeries
• Categorical
– Nominal (special case is binary)
• responder vs. nonresponder, gender, treatment
• Special case: PFS, death
– Ordinal
• Age categories (20-30 yrs, 31-40 yrs, 41-50 yrs)
• Tumor size categories (small, medium, large)
• Comorbidity score (none, mild, moderate, severe)
Statistical method depends on distribution of the outcome as well as
the hypothesis of research interest and other methodological
issues.
8. Summarizing Data: Univariate Analysis
•Continuous variables
-Location parameters identify the location where most of
the datapoints lie
-Mean: average, affected by outliers
-Median: value at which 50% of data points are higher
and 50% are lower, not so affected by outliers
-Mode: value with the most datapoints
-Dispersion parameters measure the variability, spread,
dispersion, variation of the data.
-Variance: approximately, the squared average distance
from the mean of all the data points, measure of how close
the values cluster together. Standard deviation is the
squareroot of the variance.
- Range: distance from lower to highest value
9. Variable N Mean Median Mode Minimum Maximum Variance Std Dev
HGB
bun2
65
65
10.2015385
4.2432506
10.2000000
3.7516660
10.2000000
3.7516660
4.9000000
2.1775491
14.6000000
9.3511563
6.5410913
2.3514505
2.5575557
1.5334440
Example: Descriptive statistics for
HGB and BUN
10. Summarizing Data: Univariate Analysis
•Count variables
- Mean, Median, range
-E.g. Number of metastatic sites
1 site: 98 , 2 sites: 129, 3 sites: 104, >=4 sites: 442
Mean: 5.2 Median: 6 Range: 1 to 9
•Categorical
-Total count of patients by each category, Proportion
-E.g. Fractures at baseline (yes=1 or no=0)
Frac Frequency Percent
Cumulative
Frequency
Cumulative
Percent
0 16 24.62 16 24.62
1 49 75.38 65 100.00
12. Are two variables correlated/associated?
• Basic idea in many correlative studies is to examine whether
there is a relationship between two variables
-Are baseline values of HGB and LDH correlated in the mRCC
dataset?
-Is presence of a fracture associated with abnormal platelets at
diagnosis?
• Caution with terminology: predictor implies causation
while correlate implies association. Whether one can
determine causation depends on the experimental design.
• Identifying correlations may be of primary purpose, but also
will help in multivariate modeling---don’t want to have two
highly correlated variables in a multivariate analysis.
13. Experimental Design
• Randomized study: patients are randomly assigned to a
treatment arm. Can infer causation.
Non-randomized study: a patient’s treatment may be
influenced by any number of factors including their health status
at baseline, their personal preference, their doctor’s preference,
what treatment was available at the time and location they were
being treated. Cannot infer causation.
Randomization balances the groups, as long as the number of
patients is large enough.
• In a non-randomized study, we can use multivariate models to
adjust for outside factors and measure the effect of treatment
“corrected for” the other differences between the groups
• However, the groups may still be unbalanced in ways we
can’t measure
• Randomization is crucial for inference!
14. From “Randomized
Trial of Estrogen
Plus Progestin for
Secondary Prevention
of Coronary Heart
Disease in
Postmenopausal
Women”
JAMA 1998: 280(7)
15. HGB by LDH before transformation HGB by LDH after transformation
Higher LDH values associated with
lower HGB values.
Two Variables: Correlations
and Associations
Example 1: HGB by LDH
16. Table of Frac by Platelet
Frac Platelet
Total
Frequency
Percent
Row Pct
Col Pct 0 1
0 1
1.54
6.25
11.11
15
23.08
93.75
26.79
16
24.62
1 8
12.31
16.33
88.89
41
63.08
83.67
73.21
49
75.38
Total 9
13.85
56
86.15
65
100.00
Presence of fractures (1=yes, 0=no) by
Platelet count (1=normal, 0=abnormal)
Among those who had a fracture, 8/49
had abnormal platelets, while among
those who did not have a fracture, 1/16
had abnormal platelets. (row percents)
Two Variables: Correlations and
Associations
Example 2: Fracture by Platelets
N
AbN
None
Yes
17. Quantifying Association or Correlation
Bivariate Analysis
• Purpose is to examine the relationship between two variables,
(two covariates, an outcome with a covariate)
- Are two variables associated or independent?
• Concepts important in quantifying this relationship:
- Distributional assumptions
- Null hypothesis, Alternative hypothesis
- Test statistic
- P-value / confidence interval
- Type I error
- One sided vs two-sided tests
18. - Is age associated with whether a patient presented with fractures
at diagnosis in the myeloma dataset?
- A two-sided hypothesis test is given by:
H0: µf = µnf (null hypothesis)
H1: µf ≠ µnf (alternative hypothesis)
-Calculate the test statistic:
-Reject null hypothesis if t < –tα/2 or t > tα/2
- tα is the critical value and α (Type I error) is usually set at .05
-The p-value is p(T>=t)
T-test: Comparing Two Means
/
f nf
y y
t
s n
−
=
19. Type I error
• Alpha: detecting a difference when a difference does
not actually exist.
– Also called Type I error
– Usually set at 5 or 10%
– ‘Detecting a difference under the null hypothesis’
20. Statistics
Variable Frac N
Lower CL
Mean Mean
Upper CL
Mean
Lower CL
Std Dev Std Dev
Upper CL
Std Dev Std Err
Age 0 16 56.31 62.313 68.315 8.3214 11.265 17.434 2.8162
Age 1 49 56.567 59.449 62.331 8.3671 10.033 12.535 1.4333
Age Diff (1-2) -3.086 2.8635 8.8131 8.8075 10.34 12.523 2.9772
T-Tests
Variable Method Variances DF t Value Pr > |t|
Age Pooled Equal 63 0.96 0.3398
Age Satterthwaite Unequal 23.3 0.91 0.3741
Equality of Variances
Variable Method Num DF Den DF F Value Pr > F
Age Folded F 15 48 1.26 0.5270
Variances are
equal, use pooled
t-test
Conclusion: p-value = .34. No difference, cannot reject null
T-test: Output from SAS
21. Do not
reject
Do not
reject
T= .96, df=63
Critical values are at -2 and 2 approximately.
Since .96 is in the ‘do not reject region’, we cannot
conclude there is a difference in age by presence of
fractures at diagnosis.
22. Interpreting p-values
P-value: the probability that an observed result is due to chance alone
if the null hypothesis is true.
• If p-value is less than the α-level (typically 0.05) chosen prior to
the study, then the null hypothesis is rejected.
• Commonly misinterpreted as the probability that the null
hypothesis is true.
23. Table of Frac by Platelet
Frac Platelet
Total
Frequency
Percent
Row Pct
Col Pct 0 1
0 1
1.54
6.25
11.11
15
23.08
93.75
26.79
16
24.62
1 8
12.31
16.33
88.89
41
63.08
83.67
73.21
49
75.38
Total 9
13.85
56
86.15
65
100.00
N
AbN
None
Yes
Statistic DF Value Prob
Chi-Square 1 1.0266 0.3109
Likelihood Ratio Chi-Square 1 1.1852 0.2763
Continuity Adj. Chi-Square 1 0.3557 0.5509
Mantel-Haenszel Chi-Square 1 1.0109 0.3147
Phi Coefficient -0.1257
Contingency Coefficient 0.1247
Cramer's V -0.1257
WARNING: 25% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
Fisher's Exact Test
Cell (1,1) Frequency (F) 1
Left-sided Pr <= F 0.2900
Right-sided Pr >= F 0.9357
Table Probability (P) 0.2257
Two-sided Pr <= P 0.4326
P=.43, cannot reject the
null. Conclude there is no
difference in the presence
of fractions by platelet
count status.
Chi-Square test: Output from SAS
H0: Fractures and platelets are independent.
Ha: Fractures and platelets are associated.
24. Note on p-values: Multiple testing
There is often a search for a ‘significant finding’, a p-value less
than .05. This search comes at a cost.
Since each test you do has a 5% chance of a “significant”
(p<0.05) finding by chance alone, the more tests you do, the
more likely you are to find a spurious association.
So instead of comparing each p-value to 0.05, we use a more
strict cutoff. This ensures that the family-wise error rate (the
probability of any significant finding given there are no true
associations) is less than alpha=0.05.
The Bonferroni adjustment is the most common method. You
compare each p-value to (alpha)/K where K is the number of
tests you are doing.
25. Difference, Two
independent samples
(e.g. two arms of a trial)
Difference, Paired data
(e.g. before and after on
same patient)
Difference between three
or more independent
samples (e.g. three arm
trial)
Binary or nominal
variables
Pearson's Chi-Square,
Fisher's Exact test
McNemar's test Pearson's Chi-Square
Quantitative,
normality assumed
Two sample T-test Paired T-test ANOVA (Analysis of Variance)
Non-normal data, non-
parametric tests
Mann-Whitney Wilcoxon signed rank Kruskal-Wallis
Important Notes:
• This is not an exhaustive list, many variations and areas beyond scope of talk
- Depends on your research question and data
• There will be times where your research question will require analysis that is
not listed above (e.g. Survival analysis, repeated measures, longitudinal data,
cluster analysis, inter-rater agreement, factor analysis, ROCs)
Summary of Commonly Used Tests
27. Multivariate Analysis
• Interested in more than one covariate
– Simultaneous effect of 2 or 3 covariates on the
outcome
– Effect of one covariate, adjusted for others (e.g.
confounding variables)
– Want to include interaction
• Continuous outcome: multivariate normal
regression
• Binary outcome: logistic regression
28. Multivariate Normal Analysis
• Outcome is normal (continuous)
• Covariates can be normal or categorical
• Simple linear regression models a linear relationship
(association) between the outcome and a single covariate.
• Multivariate normal regression models the relationship
between the outcome and several covariates.
• A sample interpretation might be, after adjusting for saturated
fat in diet, a one-year increase in age was associated with a
0.1-mg/dL increase in cholesterol
29. Logistic Regression
• Outcome is binary (0 vs 1)
• Covariates can be normal or categorical
• Parameter coefficients have a useful interpretation:
log odds ratios
• A sample interpretation might be, after adjusting for
age, patients with a stage 2 tumor had twice the odds
of being treated with chemotherapy compared to
patients with a stage 1 tumor
30. Interpretation of a multivariate model
If a covariate is significant in a multivariate model we
can say, “After adjusting for X, Y and Z, A has a
significant effect on B” or, “A is independently
associated with B.”
The number of variables you can correct for is limited
by your sample size. For linear regression, you need
10-15 patients per variable.
32. Survival Analysis
- Survival analysis is a group of statistical methods
designed to analyze time to an event.
- Examples of events could be:
- Recurrence or progression
- Death or death due to disease
- Disease onset (AIDS in HIV patients)
33. Two Common Goals of Survival Analysis
1) Evaluate time to event (descriptive)
-What is the median survival time from diagnosis
among patients in the multiple myeloma dataset?
2) Examine effect of certain factors (e.g.
clinicopathologic variables, biomarkers) on the time
to event.
- What are important prognostic factors for
survival in the multiple myeloma patient dataset?
34. Why we need survival analysis methods
• Able to account for censoring
– Subject does not experience event of interest
– Incomplete follow-up
• Lost to followup
• Withdrawal
• Death
Example of Right Censoring
35. Data
• When the clock starts
– E.g. Diagnosis date, end of therapy
• Did the patient experience the event? (binary)
– E.g. Death, death due to disease, progression, infection
• Last date of follow-up, Date of event
• Covariates
– Assessed at or before the clock starts
– Assessed after the clock starts (adds complexity to
analysis)
36. Kaplan Meier Estimates with 95% CI
Overall Survival, Kidney Cancer example
• Number at risk decreases over
time
• Tick marks represent when a
patient was censored
• Drops in the curve represent
when a patient experience the
event.
• CI gets wider at the end of the
curve (number at risk is small)
37. Log-rank p-value=.07
Platelets: Red = Normal, Blue=Abnormal
Months
Months
Fractures: Red = yes, Blue= none
Log-rank p-value=.33
Log-rank test to compare survival curves for
2 or more groups
38. Clinical trial design
• As opposed to observational studies, clinical
trials involve an intervention that’s assigned
by the investigator
• Clinical trials are highly regulated to make
sure approved drugs are safe and effective
• Endpoint and alpha must be specified
beforehand
39. Phase I trial
• First-in-humans trial
• Goal is to determine the MTD
• You have to define beforehand what is
considered a dose-limiting toxicity
• Standard design is called 3+3; patients are
enrolled in cohorts of size 3
41. Phase I trial
True risk of
toxicity
.10 .20 .30 .40 .50
Probability of
escalation
.91 .71 .49 .31 .17
This design has the property that the more toxic a drug is,
the less likely the dose will be escalated for the next
cohort of patients.
42. Phase II trial
• This trial looks at a drug’s efficacy
• Endpoint is often response rate
• Other possible endpoints are survival or
progression-free survival
• Trial may be randomized if no good historical
data is available for comparison
• Simon’s two stage design is common
– Endpoint is response rate
– Allows for early stopping if drug isn’t promising
44. Sample size justification
The sample size needed for a T-test is:
σ
µ
µ
β
α
1
0
1
2
/
1
−
−
−
z
z is a function of α that gets bigger as α gets smaller
is a function of β that gets bigger as β gets smaller
is the difference between the group means
is the variance of the variable you are measuring (dispersion)
45. Power and Alpha
• Power: the ability to detect a difference, given
that the difference actually exists (80-90%).
– Type II error = 1-Power
• Alpha: detecting a difference when a
difference does not actually exist. (5-10%)
– Also called Type I error
46. Power, alpha and sample size are
all related
• There aren’t simple formulas for other tests but the
general patterns are the same
• Lower error rate -> larger sample size
• Smaller detectable difference -> larger sample size
• More variability -> larger sample size
• A significant finding may not be scientifically
significant or clinically significant. Large sample
sizes have power to detect even small differences,
differences that may not be useful clinically.
48. Power tables
For a proportion, the variability gets bigger as
the true proportion increases from 0.1 to 0.5.
From MSK protocol 10-115, Association of smoking, lung
inflammation and lung metastases from breast cancer
49. Phase III trial
• This is the definitive trial that shows a drug is
superior to an older drug or whatever’s the
standard of care
• Large sample size (thousands) and
randomized
• Often blinded
50. Intent-to-treat analysis
• Randomization only works if you analyze the data “as
randomized” (also known as intent-to-treat)
• If analysis is not done in this way p-value can’t be
trusted
• The patients who deviate from the protocol may be
different from those who remain on protocol
• It’s good to randomize as late as possible so you
minimize the number of patients who are randomized
but don’t complete therapy or assessments
51. Evaluable patients
• For non-randomized studies, the protocol
should specify at what point a patient will be
considered “evaluable”
• If we can’t ascertain the outcome on an evaluable patient, we
have to assume the worst in order to be conservative and
control type I error
52. Missing data
• Complete case analysis looks at just the
patients whose data is complete.
• Are the patients missing at random?
• The less the better, but if more than 10% of
data is missing for a certain covariate,
reviewers may be skeptical.