SlideShare a Scribd company logo
Two-sample tests
Binary or categorical
outcomes (proportions)
Outcome
Variable
Are the observations correlated? Alternative to the chi-
square test if sparse
cells:
independent correlated
Binary or
categorical
(e.g.
fracture,
yes/no)
Chi-square test:
compares proportions between
two or more groups
Relative risks: odds
ratios or risk ratios
Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
McNemar’s chi-square
test: compares binary outcome
between correlated groups (e.g.,
before and after)
Conditional logistic
regression: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)
GEE modeling: multivariate
regression technique for a binary
outcome when groups are
Fisher’s exact test:
compares proportions between
independent groups when there
are sparse data (some cells <5).
McNemar’s exact test:
compares proportions between
correlated groups when there are
sparse data (some cells <5).
Recall: The odds ratio (two
samples=cases and controls)
  Smoker (E) Non-smoker 
(~E)
 
Stroke (D) 15 35
No Stroke (~D) 8 42
 
50
50
25.2
8*35
42*15
===
bc
ad
OR
Interpretation: there is a 2.25-fold higher odds of stroke
in smokers vs. non-smokers.
Inferences about the odds
ratio…
 Does the sampling distribution follow a
normal distribution?
 What is the standard error?
Simulation…
 1. In SAS, assume infinite population of cases
and controls with equal proportion of smokers
(exposure), p=.23 (UNDER THE NULL!)
 2. Use the random binomial function to randomly
select n=50 cases and n=50 controls each with
p=.23 chance of being a smoker.
 3. Calculate the observed odds ratio for the
resulting 2x2 table.
 4. Repeat this 1000 times (or some large number
of times).
 5. Observe the distribution of odds ratios under
the null hypothesis.
Properties of the OR (simulation)
(50 cases/50 controls/23% exposed)
Under the null, this is the expected 
variability of the sample ORnote 
the right skew
Properties of the lnOR
Normal!
Properties of the lnOR
From the simulation,
can get the empirical
standard error (~0.5)
and p-value (~.10)
Properties of the lnOR
dcba
1111
+++
Or, in general, standard error
=
Inferences about the ln(OR)
  Smoker (E) Non-smoker 
(~E)
 
Stroke (D) 15 35
No Stroke (~D) 8 42
 
50
50
81.0)ln(
25.2
=
=
OR
OR
64.1
494.0
81.0
42
1
35
1
15
1
8
1
0)25.2ln(
==
+++
−
=Z p=.10
Confidence interval…
  Smoker (E) Non-smoker 
(~E)
 
Stroke (D) 15 35
No Stroke (~D) 8 42
 
50
50
92.5,85.0,CI%95
78.1,16.0494.0*96.181.0lnCI%95
78.116.
==
−=±=
−
eeOR
OR
Final answer: 2.25 (0.85,5.92)
Practice problem:
Suppose the following data were collected in a case-control study of brain tumor and
cell phone usage:
Brain tumor No brain
tumor
Own a cell
phone
20 60
Don’t own a
cell phone
10 40
Is there sufficient evidence for an association between cell phones and brain tumor?
Answer
1. What is your null hypothesis?
Null hypothesis: OR=1.0; lnOR = 0
Alternative hypothesis: OR≠ 1.0; lnOR>0
2. What is your null distribution?
lnOR~ N(0, ) ; =SD (lnOR) = .44
3. Empirical evidence: = 20*40/60*10 =800/600 = 1.33
∴ lnOR = .288
4. Z = (.288-0)/.44 = .65
p-value = P(Z>.65 or Z<-.65) = .26*2
5. Not enough evidence to reject the null hypothesis of no association
40
1
60
1
20
1
10
1
+++
40
1
60
1
20
1
10
1
+++
TWO-SIDED TEST
TWO-SIDED TEST: it
would be just as
extreme if the sample
lnOR were .65 standard
deviations or more
below the null mean
Key measures of relative risk:
95% CIs OR and RR:








++++








+++−
dcbadcba
1111
96.1
1111
96.1
exp*OR,exp*OR







 +−
+
+−
+







 +−
+
+−
−
c
dcc
a
baa
c
dcc
a
baa )/(1)/(1
96.1
)/(1)/(1
96.1
exp*RR,exp*RR
For an odds ratio, 95% confidence limits:
For a risk ratio, 95% confidence limits:
Continuous outcome (means)
Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
independent correlated
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
Paired ttest: compares
means between two related
groups (e.g., the same subjects
before and after)
Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two or
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank
correlation coefficient:
The two-sample t-test
The two-sample T-test
 Is the difference in means that we
observe between two groups more than
we’d expect to see based on chance
alone?
The standard error of the
difference of two means
 
 
**First add the variances and then take the square root
of the sum to get the standard error.
mn
yx
yx
22
σσ
σ +=−
Recall, Var (A-B) =
Var (A) + Var (B) if
A and B are
independent!
Shown by simulation:
91.
30
5
==SE
91.
30
5
==SE
91.
30
5
==SE
91.
30
5
==SE
29.1
30
25
30
25
)( =+=diffSE
One sample of
30 (with SD=5).
One sample of
30 (with SD=5).
Difference of the two samples.
Distribution of differences
),(~
22
mn
NYX
yx
yxmn
σσ
µµ +−−
If X and Y are the averages of n and m subjects, respectively:
But…
 As before, you usually have to use the
sample SD, since you won’t know the
true SD ahead of time…
 So, again becomes a T-distribution...
Estimated standard error of
the difference….
m
s
n
s yx
yx
22
+≈−σ
Just plug in the sample
standard deviations for each
group.
Case 1: un-pooled variance
Question: What are your degrees of freedom here?
Answer: Not obvious!
Case 1: ttest, unpooled
variances
It is complicated to figure out the degrees of freedom here! A good
approximation is given as df ≈ harmonic mean (or SAS will tell you!):
νt
m
s
n
s
YX
T
yx
mn
~
22
+
−
=
mn
11
2
+
Case 2: pooled variance
If you assume that the standard deviation of the
characteristic (e.g., IQ) is the same in both groups, you can
pool all the data to estimate a common standard deviation.
This maximizes your degrees of freedom (and thus your
power).
2
)()(
)()1(and
1
)(
)()1(and
1
)(
:variancespooling
1
2
1
2
2
1
221
2
2
1
221
2
2
−+
−+−
=∴
−=−
−
−
=
−=−
−
−
=
∑∑
∑
∑
∑
∑
==
=
=
=
=
mn
yyxx
s
yysm
m
yy
s
xxsn
n
xx
s
m
i
mi
n
i
ni
p
m
i
miy
m
i
mi
y
n
i
nix
n
i
ni
x
2
)1()1( 22
2
−+
−+−
=
mn
smsn
s
yx
p
Degrees of
Freedom!
Estimated standard error
(using pooled variance estimate)
m
s
n
s pp
yx
22
+≈−σ
2
)()(
:
1
2
1
2
2
−+
−+−
=∴
∑∑ ==
mn
yyxx
s
where
m
i
mi
n
i
ni
p
The degrees
of freedom
are n+m-2
Case 2: ttest, pooled
variances
2
22
~ −+
+
−
= mn
pp
mn
t
m
s
n
s
YX
T
2
)1()1( 22
2
−+
−+−
=
mn
smsn
s
yx
p
Alternate calculation formula:
ttest, pooled variance
2~ −+
+
−
= mn
p
mn
t
mn
nm
s
YX
T
)()()
11
( 22
22
mn
mn
s
mn
m
mn
n
s
nm
s
n
s
m
s
ppp
pp +
=+=+=+
Pooled vs. unpooled variance
Rule of Thumb: Use pooled unless you have a
reason not to.
Pooled gives you more degrees of freedom.
Pooled has extra assumption: variances are
equal between the two groups.
SAS automatically tests this assumption for you
(“Equality of Variances” test). If p<.05, this
suggests unequal variances, and better to
use unpooled ttest.
Example: two-sample t-test
 In 1980, some researchers reported that
“men have more mathematical ability than
women” as evidenced by the 1979 SAT’s,
where a sample of 30 random male
adolescents had a mean score ± 1 standard
deviation of 436±77 and 30 random female
adolescents scored lower: 416±81 (genders
were similar in educational backgrounds,
socio-economic status, and age). Do you
agree with the authors’ conclusions?
Data Summary
n Sampl
e Mean
Sample
Standard
Deviation
Group 1:
women
30 416 81
Group 2:
men
30 436 77
Two-sample t-test
1. Define your hypotheses (null,
alternative)
H0
: ♂-♀ math SAT = 0
Ha: ♂-♀ math SAT ≠ 0 [two-sided]
Two-sample t-test
2. Specify your null distribution:
F and M have similar standard
deviations/variances, so make a “pooled”
estimate of variance.
6245
58
81)29(77)29(
2
)1()1( 2222
2
=
+
=
−+
−+−
=
mn
smsn
s
fm
p
)
30
6245
30
6245
,0(~ 583030 +− TFM 4.20
30
6245
30
6245
=+
Two-sample t-test
3. Observed difference in our experiment = 20
points
Two-sample t-test
4. Calculate the p-value of what you observed
98.
4.20
020
58 =
−
=T
data _null_;
pval=(1-probt(.98, 58))*2;
Example 2: Difference in means
 Example: Rosental, R. and Jacobson,
L. (1966) Teachers’ expectancies:
Determinates of pupils’ I.Q. gains.
Psychological Reports, 19, 115-118.
The Experiment
(note: exact numbers have been altered)
 Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n=90).
 Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent; these students were
identified as “academic bloomers” (n=18).
 BUT: the children on the teachers lists had
actually been randomly assigned to the list.
 At the end of the year, the same I.Q. test was re-
administered.
Example 2
 Statistical question: Do students in the
treatment group have more improvement
in IQ than students in the control group?
What will we actually compare?
 One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the
control group.
“Academic
bloomers”
(n=18)
Controls
(n=72)
Change in IQ score: 12.2 (2.0) 8.2 (2.0)
Results:
12.2 points 8.2 points
Difference=4 points
The standard deviation
of change scores was
2.0 in both groups. This
affects statistical
significance…
What does a 4-point
difference mean?
 Before we perform any formal statistical
analysis on these data, we already
have a lot of information.
 Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.
Is the association statistically
significant?
 This 4-point difference could reflect a
true effect or it could be a fluke.
 The question: is a 4-point difference
bigger or smaller than the expected
sampling variability?
Hypothesis testing
Null hypothesis: There is no difference between
“academic bloomers” and normal students (=
the difference is 0%)
Step 1: Assume the null hypothesis.
Hypothesis Testing
 These predictions can be made by
mathematical theory or by computer
simulation.
Step 2: Predict the sampling variability assuming the null
hypothesis is true
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true—math theory:
0.42
=p
s
)52.0
72
4
18
4
,0(~ 88"" =+− Tcontrolgifted µµ
Hypothesis Testing
 In computer simulation, you simulate
taking repeated samples of the same
size from the same population and
observe the sampling variability.
 I used computer simulation to take 1000
samples of 18 treated and 72 controls
Step 2: Predict the sampling variability assuming the null
hypothesis is true—computer simulation:
Computer Simulation Results
Standard error is
about 0.52
3. Empirical data
Observed difference in our experiment =
12.2-8.2 = 4.0
4. P-value
t-curve with 88 df’s has slightly wider
cut-off’s for 95% area (t=1.99) than a
normal curve (Z=1.96)
p-value <.0001
8
52.
4
52.
2.82.12
88 ==
−
=t
If we ran this
study 1000 times
we wouldn’t
expect to get 1
result as big as a
difference of 4
(under the null
hypothesis).
Visually…
5. Reject null!
 Conclusion: I.Q. scores can bias
expectancies in the teachers’ minds
and cause them to unintentionally treat
“bright” students differently from those
seen as less bright.
Confidence interval (more
information!!)
95% CI for the difference: 4.0±1.99(.52) =
(3.0 – 5.0)
t-curve with 88 df’s
has slightly wider cut-
off’s for 95% area
(t=1.99) than a normal
curve (Z=1.96)
What if our standard deviation
had been higher?
 The standard deviation for change
scores in treatment and control were
each 2.0. What if change scores had
been much more variable—say a
standard deviation of 10.0 (for both)?
Standard error is
0.52 Std. dev in
change scores =
2.0
Std. dev in
change scores =
10.0
Standard error is 2.58
With a std. dev. of 10.0…
LESS STATISICAL POWER!
Standard
error is 2.58
If we ran this
study 1000 times,
we would expect to
get ≥+4.0 or ≤–4.0
12% of the time.
P-value=.12
Don’t forget: The paired T-test
 Did the control group in the previous
experiment improve
at all during the year?
 Do not apply a two-sample ttest to answer
this question!
 After-Before yields a single sample of
differences…
 “within-group” rather than “between-group”
comparison…
Continuous outcome (means);
Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
independent correlated
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
Paired ttest: compares
means between two related
groups (e.g., the same subjects
before and after)
Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two or
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank
correlation coefficient:
Data Summary
n Sampl
e Mean
Sample
Standard
Deviation
Group 1:
Change
72 +8.2 2.0
Did the control group in the
previous experiment improve
at all during the year?
28
29.
2.8
72
2
02.8
271 ==
−
=t
p-value <.0001
Normality assumption of ttest
 If the distribution of the trait is normal, fine to use
a t-test.
 But if the underlying distribution is not normal
and the sample size is small (rule of thumb: n>30
per group if not too skewed; n>100 if distribution
is really skewed), the Central Limit Theorem
takes some time to kick in. Cannot use ttest.
 Note: ttest is very robust against the normality
assumption!
Alternative tests when normality
is violated: Non-parametric tests
Continuous outcome (means);
Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
independent correlated
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
Paired ttest: compares
means between two related
groups (e.g., the same subjects
before and after)
Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two or
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank
correlation coefficient:
Non-parametric tests
 t-tests require your outcome variable
to be normally distributed (or close
enough), for small samples.
 Non-parametric tests are based on
RANKS instead of means and
standard deviations (=“population
parameters”).
Example: non-parametric tests
10 dieters following Atkin’s diet vs. 10 dieters following
Jenny Craig
Hypothetical RESULTS:
Atkin’s group loses an average of 34.5 lbs.
J. Craig group loses an average of 18.5 lbs.
Conclusion: Atkin’s is better?
Example: non-parametric tests
BUT, take a closer look at the individual data…
Atkin’s, change in weight (lbs):
+4, +3, 0, -3, -4, -5, -11, -14, -15, -300
J. Craig, change in weight (lbs)
-8, -10, -12, -16, -18, -20, -21, -24, -26, -30
Jenny Craig
-30 -25 -20 -15 -10 -5 0 5 10 15 20
0
5
10
15
20
25
30
P
e
r
c
e
n
t
Weight Change
Atkin’s
-300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20
0
5
10
15
20
25
30
P
e
r
c
e
n
t
Weight Change
t-test inappropriate…
 Comparing the mean weight loss of the
two groups is not appropriate here.
 The distributions do not appear to be
normally distributed.
 Moreover, there is an extreme outlier
(this outlier influences the mean a great
deal).
Wilcoxon rank-sum test
 RANK the values, 1 being the least weight
loss and 20 being the most weight loss.
 Atkin’s
 +4, +3, 0, -3, -4, -5, -11, -14, -15, -300
  1, 2, 3, 4, 5, 6, 9, 11, 12, 20
 J. Craig
 -8, -10, -12, -16, -18, -20, -21, -24, -26, -30
 7, 8, 10, 13, 14, 15, 16, 17, 18, 19
Wilcoxon rank-sum test
 Sum of Atkin’s ranks:
  1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 + 20=73
 Sum of Jenny Craig’s ranks:
7 + 8 +10+ 13+ 14+ 15+16+ 17+ 18+19=137
 Jenny Craig clearly ranked higher!
 P-value *(from computer) = .018
*For details of the statistical test, see appendix of these slides…
Binary or categorical
outcomes (proportions)
Outcome
Variable
Are the observations correlated? Alternative to the chi-
square test if sparse
cells:
independent correlated
Binary or
categorical
(e.g.
fracture,
yes/no)
Chi-square test:
compares proportions between
two or more groups
Relative risks: odds
ratios or risk ratios
Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
McNemar’s chi-square
test: compares binary outcome
between two correlated groups (e.g.,
before and after)
Conditional logistic
regression: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)
GEE modeling: multivariate
regression technique for a binary
outcome when groups are
Fisher’s exact test:
compares proportions between
independent groups when there
are sparse data (some cells <5).
McNemar’s exact test:
compares proportions between
correlated groups when there are
sparse data (some cells <5).
Difference in proportions (special
case of chi-square test)
Standard error of the difference of two proportions=
21
2211
212
22
1
11 )()(n
where,
)1()1(
or
)ˆ1(ˆ)ˆ1(ˆ
nn
pnp
p
n
pp
n
pp
n
pp
n
pp
+
+
=
−
+
−−
+
−
Standard error of a proportion=
n
pp )1( −
Null distribution of a difference
in proportions
Standard error can be estimated by=
(still normally distributed)
n
pp )ˆ1(ˆ −
Analagous to pooled variance
in the ttest
The variance of a difference is the
sum of variances (as with difference
in means).
Null distribution of a difference
in proportions
Difference of proportions )
)1()1(
,(~
21
21
n
pp
n
pp
ppN
−
+
−
−
Difference in proportions test
Null hypothesis: The difference in proportions is 0.
21
21
)1(*)1(*
n
pp
n
pp
pp
Z
−
+
−
−
=
2groupinnumber
1groupinnumber
2groupinproportion
1groupinproportion
)proportionaverage(just
2
1
2
1
21
2211
=
=
=
=
+
+
=
n
n
p
p
nn
pnpn
p
Recall, variance of a
proportion is p(1-p)/n
Use average (or
pooled) proportion in
standard error formula,
because under the null
hypothesis, groups
have equal proportions.
Follows a normal
because binomial can
be approximated with
normal
Recall case-control example:
Smoker (E) Non-smoker
(~E)
Stroke (D) 15 35
No Stroke (~D) 8 42 50
50
Absolute risk: Difference in
proportions exposed
%14%16%30
50/850/15)~/()/(
=−=
−=− DEPDEP
Smoker (E) Non-smoker
(~E)
Stroke (D) 15 35
No Stroke (~D) 8 42 50
50
Difference in proportions
exposed
67.1
084.
14.
50
77.*23.
50
77.*23.
%0%14
==
+
−
=Z
.31to03.0084.*96.114.0:CI%95 −=±
Example 2: Difference in
proportions
 Research Question: Are
antidepressants arisk factor for suicide
attempts in children and adolescents?
Example modified from: “Antidepressant Drug Therapy and Suicide in Severely
Depressed Children and Adults ”; Olfson et al. Arch Gen Psychiatry.2006;63:865-
872.
Example 2: Difference in
Proportions
 Design: Case-control study
 Methods: Researchers used Medicaid
records to compare prescription histories
between 263 children and teenagers (6-18
years) who had attempted suicide and 1241
controls who had never attempted suicide (all
subjects suffered from depression).
 Statistical question: Is a history of use of
antidepressants more common among cases
than controls?
Example 2
 Statistical question: Is a history of use of
antidepressants more common among
heart disease cases than controls?
What will we actually compare?
 Proportion of cases who used
antidepressants in the past vs. proportion of
controls who did
No (%) of
cases
(n=263)
No (%) of
controls
(n=1241)
Any antidepressant
drug ever 120 (46%) 448 (36%)
46% 36%
Difference=10%
Results
Is the association statistically
significant?
 This 10% difference could reflect a true
association or it could be a fluke in this
particular sample.
 The question: is 10% bigger or smaller
than the expected sampling variability?
Hypothesis testing
Null hypothesis: There is no association
between antidepressant use and suicide
attempts in the target population (= the
difference is 0%)
Step 1: Assume the null hypothesis.
Hypothesis Testing
Step 2: Predict the sampling variability assuming the null
hypothesis is true
)033.=
1241
)
1504
568
1(
1504
568
+
263
)
1504
568
1(
1504
568
=σ,0(N~pˆpˆ controlscases
Also: Computer Simulation Results
Standard error is
about 3.3%
Hypothesis Testing
Step 3: Do an experiment
We observed a difference of 10% between
cases and controls.
Hypothesis Testing
Step 4: Calculate a p-value
003.=p;0.3=
033.
10.
=Z
When we ran this
study 1000 times,
we got 1 result as
big or bigger than
10%.
P-value from our simulation…
We also got 3
results as small
or smaller than
–10%.
P-valueP-value
From our simulation, we
estimate the p-value to be:
4/1000 or .004
Here we reject the null.
Alternative hypothesis: There is an association
between antidepressant use and suicide in the
target population.
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.
What would a lack of
statistical significance mean?
 If this study had sampled only 50 cases
and 50 controls, the sampling variability
would have been much higher—as
shown in this computer simulation…
Standard error is
about 10%
50 cases and 50
controls.
Standard error is
about 3.3% 263 cases and
1241 controls.
With only 50 cases and 50 controls…
Standard
error is
about 10%
If we ran this
study 1000 times,
we would expect to
get values of 10%
or higher 170
times (or 17% of
the time).
Two-tailed p-value
Two-tailed
p-value =
17%x2=34%
Practice problem…
An August 2003 research article in
Developmental and Behavioral Pediatrics
reported the following about a sample of UK
kids: when given a choice of a non-branded
chocolate cereal vs. CoCo Pops, 97% (36) of
37 girls and 71% (27) of 38 boys preferred
the CoCo Pops. Is this evidence that girls are
more likely to choose brand-named products?
Answer
1. Hypotheses:
H0
: p♂
-p♀
= 0
Ha: p♂
-p♀
≠ 0 [two-sided]
2. Null distribution of difference of two proportions:
3. Observed difference in our experiment = .97-.71= .26
4. Calculate the p-value of what you observed:
085.
38
)16(.84.
37
)16(.84.
)
38
)
75
63
1(
75
63
37
)
75
63
1(
75
63
,0(~ˆˆ
=+
−
+
−
=− σNpp mf
data _null_;
pval=(1-probnorm(3.06))*2;
put pval;
Null says p’s are equal so
estimate standard error using
overall observed p
06.3
085.
026.
=
−
=Z
Key two-sample Hypothesis
Tests…
Test for Ho
: μx
- μy
= 0 (σ2
unknown, but roughly equal):
Test for Ho
: p1-
p2
= 0:
 
2
)1()1(
;
22
2
22
2
−
−+−
=
+
−
=−
n
snsn
s
n
s
n
s
yx
t
yyxx
p
y
p
x
p
n
21
2211
21
21
ˆˆ
;
)1)(()1)((
ˆˆ
nn
pnpn
p
n
pp
n
pp
pp
Z
+
+
=
−
+
−
−
=
Corresponding confidence
intervals…
For a difference in means, 2 independent
samples (σ2
’s unknown but roughly equal):
For a difference in proportions, 2 independent
samples:
y
p
x
p
n
n
s
n
s
tyx
22
2/,2)( +∗±− − α
21
2/21
)1)(()1)((
)ˆˆ(
n
pp
n
pp
Zpp
−
+
−
∗±− α
Appendix: details of rank-sum
test…
Wilcoxon Rank-sum test
),min(
12
)1(
2Z
2
)1(
U
,10,01for
2
)1(
U
)(npopulationlargerthefromrankstheofsumtheisT
)(npopulationsmallerfromrankstheofsumtheisT
n.to1fromorderinnsobservatiotheofallRank
210
2121
21
0
2
22
212
211
11
211
22
11
UUU
nnnn
nn
U
T
nn
nn
nnT
nn
nn
=
++
−
=−
+
+=
>>−
+
+=
Find P(U² U0) in Mann-Whitney U tables
With n2 = the bigger of the 2 populations
Example

For example, if team 1 and team 2 (two gymnastic
teams) are competing, and the judges rank all the
individuals in the competition, how can you tell if
team 1 has done significantly better than team 2 or
vice versa?
Answer

Intuition: under the null hypothesis of no difference between the
two groups…
 If n1=n2, the sums of T1 and T2 should be equal.
 But if n1≠n2, then T2 (n2=bigger group) should automatically be
bigger. But how much bigger under the null?

For example, if team 1 has 3 people and team 2 has 10, we could
rank all 13 participants from 1 to 13 on individual performance. If
team1 (X) and team2 don’t differ in talent, the ranks ought to be
spread evenly among the two groups, e.g.…

1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even distribution if team1
ranks 3rd
, 7th
, and 11th
)
(larger)2groupofranksofsum
(smaller)1groupofranksofsum
2
1
=
=
T
T
21
22112
2
221121
2
1
2121
1
21
2
)1(
2
)1(
2
)(
2
)1)((21
nn
nnnnnnnnnnnn
nnnn
iTT
nn
i
+
+
+
+
=
+++++
=
+++
==+ ∑
+
=
Remember
this?
sum of within-group ranks for smaller
group.
2
)1( 11
1
1
+
=∑=
nn
i
n
i
sum of within-group ranks for larger
group.
2
)1( 22
1
2
+
=∑=
nn
i
n
i
3065591
2
)14)(13(
:heree.g.,
13
1
21 ++====+ ∑=i
iTT
21
2211
21
2
)1(
2
)1(
nn
nnnn
TT +
+
+
+
=+
Take-home point:
49655
6
2
)4(3
55
2
)11(10
3
1
10
1
=−
=
==
∑
∑
=
=
i
i
i
T1 = 3 + 7 + 11 =21
T2 = 1 + 2 + 4 + 5 + 6 + 8 + 9 +10 + 12 +13 = 70
70-21 = 49 Magic!
The difference between the sum of the
ranks within each individual group is 49.
The difference between the sum of the
ranks of the two groups is also equal to 49
if ranks are evenly interspersed (null is
true).
It turns out that, if the null hypothesis is true, the difference
between the larger-group sum of ranks and the smaller-group sum
of ranks is exactly equal to the difference between T1 and T2
2
)1(
2
)1(
null,Under the
1122
12
+
−
+
=−
nnnn
TT
.equalshouldsumTheir
2
)1(
Udefine
2
)1(
Udefine
22
)1(
22
)1(
2
)1(
2
)1(
2
)1(
2
)1(
21
121
11
1
221
22
2
2111
1
2122
2
1122
12
21
2211
12
nn
Tnn
nn
Tnn
nn
nnnn
T
nnnn
T
nnnn
TT
nn
nnnn
TT
−+
+
=
−+
+
=
+
+
=
+
+
=
+
−
+
=−
+
+
+
+
=+ From slide 23
From slide 24
Define new
statistics
Here, under null:
U2=55+30-70
U1=6+30-21
U2+U1=30
 ∴ under null hypothesis, U1
should equal U2
:
0)]T()
2
)1(
2
)1(
[()U-E(U 12
1122
12 =−−
+
−
+
= T
nnnn
E
The U’s should be equal to each other and will equal n1
n2
/2:
U1
+ U2
= n1
n2
Under null hypothesis, U1
= U2
= U0
∴E(U1
+ U2
) = 2E(U0
) = n1
n2
E(U1
= U2
=U0
) = n1
n2
/2
So, the test statistic here is not quite the difference in the
sum-of-ranks of the 2 groups
It’s the smaller observed U value: U0
For small n’s, take U0, and get p-value directly from a U
table.
For large enough n’s (>10 per
group)…
)(
2
)(
)(
Z
0
21
0
0
00
UVar
nn
U
UVar
UEU
−
=
−
=
2
)( 21
0
nn
UE =
12
)1(
)( 2121
0
++
=
nnnn
UVar
Add observed data to the
example…
Example: If the girls on the two gymnastics teams were ranked as follows:
Team 1: 1, 5, 7 Observed T1
= 13
Team 2: 2,3,4,6,8,9,10,11,12,13 Observed T2
= 78
Are the teams significantly different?
Total sum of ranks = 13*14/2 = 91 n1
n2
=3*10 = 30
Under the null hypothesis: expect U1
- U2
= 0 and U1
+ U2
= 30 (each should equal about 15 under the
null) and U0
= 15
U1
=30 + 6 – 13 = 23
U2
= 30 + 55 – 78 = 7
∴U0
= 7
Not quite statistically significant in U table…p=.1084 (see attached) x2 for two-tailed test
Example problem 2
A study was done to compare the Atkins Diet (low-carb) vs. Jenny Craig
(low-cal, low-fat). The following weight changes were obtained; note
they are very skewed because someone lost 100 pounds; the mean loss
for Atkins is going to look higher because of the bozo, but does that
mean the diet is better overall? Conduct a Mann-Whitney U test to
compare ranks.
Atkins Jenny Craig
-100 -11
-8 -15
-4 -5
+5 +6
+8 -20
+2
Answer Atkins Jenny Craig
1 4
5 3
7 6
9 10
11 2
8
Sum of ranks for JC = 25 (n=5)
Sum of ranks for Atkins=41 (n=6)
n1
n2
=5*6 = 30
under the null hypothesis: expect U1
- U2
= 0 and
U1
+ U2
= 30 and U0
= 15
U1
=30 + 15 – 25 = 20
U2
= 30 + 21 – 41 = 10
U0
= 10; n1
=5, n2
=6
Go to Mann-Whitney chart….p=.2143x 2 = .42

More Related Content

PPTX
PPT
Two sample t-test
PPTX
Estimation and confidence interval
PDF
Posthoc
PPT
Estimation and hypothesis testing 1 (graduate statistics2)
PPT
Inferential Statistics
PPTX
Das20502 chapter 1 descriptive statistics
PPT
T test statistics
Two sample t-test
Estimation and confidence interval
Posthoc
Estimation and hypothesis testing 1 (graduate statistics2)
Inferential Statistics
Das20502 chapter 1 descriptive statistics
T test statistics

What's hot (20)

PPTX
Analysis of variance (ANOVA) everything you need to know
PPT
In Anova
PPTX
Lecture 6. univariate and bivariate analysis
PPTX
P-Value.pptx
PDF
Power Analysis and Sample Size Determination
PPTX
Analysis of variance (ANOVA)
PPT
Chi square mahmoud
PDF
Kruskal Wallis test, Friedman test, Spearman Correlation
PPTX
Sign Test
PPT
Anova single factor
PPT
Analysis of variance
PPTX
Sampling Distributions
PPTX
T distribution | Statistics
PPTX
Chi square test final
DOCX
Spss paired samples t test Reporting
PPTX
What is a Single Sample Z Test?
ODP
Correlation
PPTX
biostatistics basic
PPTX
How to determine sample size
Analysis of variance (ANOVA) everything you need to know
In Anova
Lecture 6. univariate and bivariate analysis
P-Value.pptx
Power Analysis and Sample Size Determination
Analysis of variance (ANOVA)
Chi square mahmoud
Kruskal Wallis test, Friedman test, Spearman Correlation
Sign Test
Anova single factor
Analysis of variance
Sampling Distributions
T distribution | Statistics
Chi square test final
Spss paired samples t test Reporting
What is a Single Sample Z Test?
Correlation
biostatistics basic
How to determine sample size
Ad

Viewers also liked (20)

PPT
Introduction to t-tests (statistics)
PPT
PPTX
What is a paired samples t test
PPTX
The t Test for Two Independent Samples
PPT
T Test For Two Independent Samples
PPT
香港六合彩
DOCX
T test for two independent samples
PPT
Unit 5 lesson 2
PPTX
Statistics
PPTX
Spss2 comparing means_two_groups
PPT
Aron chpt 9 ed t test independent samples
PDF
(마더세이프라운드)임상연구에 필요한 통계 분석
PPT
통계적방법론발표Ppt Kmlikejy
PDF
(마더세이프라운드) 임상연구에 필요한 기초 통계
PDF
12.세표본 이상의 평균비교
PDF
11.두표본의 평균비교
PDF
Stat 130 chi-square goodnes-of-fit test
PPT
Factorial design
PDF
R 기초 : R Basics
Introduction to t-tests (statistics)
What is a paired samples t test
The t Test for Two Independent Samples
T Test For Two Independent Samples
香港六合彩
T test for two independent samples
Unit 5 lesson 2
Statistics
Spss2 comparing means_two_groups
Aron chpt 9 ed t test independent samples
(마더세이프라운드)임상연구에 필요한 통계 분석
통계적방법론발표Ppt Kmlikejy
(마더세이프라운드) 임상연구에 필요한 기초 통계
12.세표본 이상의 평균비교
11.두표본의 평균비교
Stat 130 chi-square goodnes-of-fit test
Factorial design
R 기초 : R Basics
Ad

Similar to The two sample t-test (20)

PPTX
PPT
lecture12.ppt
PPT
lecture12.ppt
PPT
Test of hypothesis (t)
DOCX
Descriptive Statistics Formula Sheet Sample Populatio.docx
PPTX
Lecture 11 Paired t test.pptx
PPTX
Hypothesis Test _Two-sample t-test, Z-test, Proportion Z-test
PPTX
Two Means, Independent Samples
DOC
non para.doc
PPT
Chi-square, Yates, Fisher & McNemar
PPT
Lecture-6 (t-test and one way ANOVA.ppt
PPT
Factorial Experiments
PPTX
Marketing Research Hypothesis Testing.pptx
PPT
Ch01_03.ppt
PPTX
Lec. 10: Making Assumptions of Missing data
PPTX
Experimental design data analysis
PDF
Point Estimate, Confidence Interval, Hypotesis tests
PPT
Anova by Hazilah Mohd Amin
lecture12.ppt
lecture12.ppt
Test of hypothesis (t)
Descriptive Statistics Formula Sheet Sample Populatio.docx
Lecture 11 Paired t test.pptx
Hypothesis Test _Two-sample t-test, Z-test, Proportion Z-test
Two Means, Independent Samples
non para.doc
Chi-square, Yates, Fisher & McNemar
Lecture-6 (t-test and one way ANOVA.ppt
Factorial Experiments
Marketing Research Hypothesis Testing.pptx
Ch01_03.ppt
Lec. 10: Making Assumptions of Missing data
Experimental design data analysis
Point Estimate, Confidence Interval, Hypotesis tests
Anova by Hazilah Mohd Amin

More from Christina K J (14)

PPTX
Perceived Barriers of Patients with ESRD regarding Kidney Transplantation
PPT
Progressive patient care
PPTX
Vapocoolant spray vs. lidocaine prilocaine cream for reducing the pain of ven...
PPTX
Presentation on aortic aneurysm.
PPT
Sample size and power
PPTX
Disaster nursing
PPTX
Breast cancer
PPT
Three diamensional audiovisual aids
PPTX
Healthy diet
PPT
Fluid and electrolyte imbalnce
PPTX
Acute respiratory distress syndrome
PPT
Paired t Test
PPTX
Presentation on microorganisms
PPTX
Viruses
Perceived Barriers of Patients with ESRD regarding Kidney Transplantation
Progressive patient care
Vapocoolant spray vs. lidocaine prilocaine cream for reducing the pain of ven...
Presentation on aortic aneurysm.
Sample size and power
Disaster nursing
Breast cancer
Three diamensional audiovisual aids
Healthy diet
Fluid and electrolyte imbalnce
Acute respiratory distress syndrome
Paired t Test
Presentation on microorganisms
Viruses

Recently uploaded (20)

PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Institutional Correction lecture only . . .
PDF
Classroom Observation Tools for Teachers
PDF
Basic Mud Logging Guide for educational purpose
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
Insiders guide to clinical Medicine.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Cell Structure & Organelles in detailed.
Module 4: Burden of Disease Tutorial Slides S2 2025
Abdominal Access Techniques with Prof. Dr. R K Mishra
Pharma ospi slides which help in ospi learning
Institutional Correction lecture only . . .
Classroom Observation Tools for Teachers
Basic Mud Logging Guide for educational purpose
Complications of Minimal Access Surgery at WLH
Renaissance Architecture: A Journey from Faith to Humanism
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
2.FourierTransform-ShortQuestionswithAnswers.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
VCE English Exam - Section C Student Revision Booklet
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Insiders guide to clinical Medicine.pdf

The two sample t-test

  • 2. Binary or categorical outcomes (proportions) Outcome Variable Are the observations correlated? Alternative to the chi- square test if sparse cells: independent correlated Binary or categorical (e.g. fracture, yes/no) Chi-square test: compares proportions between two or more groups Relative risks: odds ratios or risk ratios Logistic regression: multivariate technique used when outcome is binary; gives multivariate-adjusted odds ratios McNemar’s chi-square test: compares binary outcome between correlated groups (e.g., before and after) Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling: multivariate regression technique for a binary outcome when groups are Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5). McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).
  • 3. Recall: The odds ratio (two samples=cases and controls)   Smoker (E) Non-smoker  (~E)   Stroke (D) 15 35 No Stroke (~D) 8 42   50 50 25.2 8*35 42*15 === bc ad OR Interpretation: there is a 2.25-fold higher odds of stroke in smokers vs. non-smokers.
  • 4. Inferences about the odds ratio…  Does the sampling distribution follow a normal distribution?  What is the standard error?
  • 5. Simulation…  1. In SAS, assume infinite population of cases and controls with equal proportion of smokers (exposure), p=.23 (UNDER THE NULL!)  2. Use the random binomial function to randomly select n=50 cases and n=50 controls each with p=.23 chance of being a smoker.  3. Calculate the observed odds ratio for the resulting 2x2 table.  4. Repeat this 1000 times (or some large number of times).  5. Observe the distribution of odds ratios under the null hypothesis.
  • 8. Properties of the lnOR From the simulation, can get the empirical standard error (~0.5) and p-value (~.10)
  • 10. Inferences about the ln(OR)   Smoker (E) Non-smoker  (~E)   Stroke (D) 15 35 No Stroke (~D) 8 42   50 50 81.0)ln( 25.2 = = OR OR 64.1 494.0 81.0 42 1 35 1 15 1 8 1 0)25.2ln( == +++ − =Z p=.10
  • 11. Confidence interval…   Smoker (E) Non-smoker  (~E)   Stroke (D) 15 35 No Stroke (~D) 8 42   50 50 92.5,85.0,CI%95 78.1,16.0494.0*96.181.0lnCI%95 78.116. == −=±= − eeOR OR Final answer: 2.25 (0.85,5.92)
  • 12. Practice problem: Suppose the following data were collected in a case-control study of brain tumor and cell phone usage: Brain tumor No brain tumor Own a cell phone 20 60 Don’t own a cell phone 10 40 Is there sufficient evidence for an association between cell phones and brain tumor?
  • 13. Answer 1. What is your null hypothesis? Null hypothesis: OR=1.0; lnOR = 0 Alternative hypothesis: OR≠ 1.0; lnOR>0 2. What is your null distribution? lnOR~ N(0, ) ; =SD (lnOR) = .44 3. Empirical evidence: = 20*40/60*10 =800/600 = 1.33 ∴ lnOR = .288 4. Z = (.288-0)/.44 = .65 p-value = P(Z>.65 or Z<-.65) = .26*2 5. Not enough evidence to reject the null hypothesis of no association 40 1 60 1 20 1 10 1 +++ 40 1 60 1 20 1 10 1 +++ TWO-SIDED TEST TWO-SIDED TEST: it would be just as extreme if the sample lnOR were .65 standard deviations or more below the null mean
  • 14. Key measures of relative risk: 95% CIs OR and RR:         ++++         +++− dcbadcba 1111 96.1 1111 96.1 exp*OR,exp*OR         +− + +− +         +− + +− − c dcc a baa c dcc a baa )/(1)/(1 96.1 )/(1)/(1 96.1 exp*RR,exp*RR For an odds ratio, 95% confidence limits: For a risk ratio, 95% confidence limits:
  • 15. Continuous outcome (means) Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or Non-parametric statistics Wilcoxon sign-rank test: non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non- parametric alternative to the ttest Kruskal-Wallis test: non- parametric alternative to ANOVA Spearman rank correlation coefficient:
  • 17. The two-sample T-test  Is the difference in means that we observe between two groups more than we’d expect to see based on chance alone?
  • 18. The standard error of the difference of two means     **First add the variances and then take the square root of the sum to get the standard error. mn yx yx 22 σσ σ +=− Recall, Var (A-B) = Var (A) + Var (B) if A and B are independent!
  • 19. Shown by simulation: 91. 30 5 ==SE 91. 30 5 ==SE 91. 30 5 ==SE 91. 30 5 ==SE 29.1 30 25 30 25 )( =+=diffSE One sample of 30 (with SD=5). One sample of 30 (with SD=5). Difference of the two samples.
  • 20. Distribution of differences ),(~ 22 mn NYX yx yxmn σσ µµ +−− If X and Y are the averages of n and m subjects, respectively:
  • 21. But…  As before, you usually have to use the sample SD, since you won’t know the true SD ahead of time…  So, again becomes a T-distribution...
  • 22. Estimated standard error of the difference…. m s n s yx yx 22 +≈−σ Just plug in the sample standard deviations for each group.
  • 23. Case 1: un-pooled variance Question: What are your degrees of freedom here? Answer: Not obvious!
  • 24. Case 1: ttest, unpooled variances It is complicated to figure out the degrees of freedom here! A good approximation is given as df ≈ harmonic mean (or SAS will tell you!): νt m s n s YX T yx mn ~ 22 + − = mn 11 2 +
  • 25. Case 2: pooled variance If you assume that the standard deviation of the characteristic (e.g., IQ) is the same in both groups, you can pool all the data to estimate a common standard deviation. This maximizes your degrees of freedom (and thus your power). 2 )()( )()1(and 1 )( )()1(and 1 )( :variancespooling 1 2 1 2 2 1 221 2 2 1 221 2 2 −+ −+− =∴ −=− − − = −=− − − = ∑∑ ∑ ∑ ∑ ∑ == = = = = mn yyxx s yysm m yy s xxsn n xx s m i mi n i ni p m i miy m i mi y n i nix n i ni x 2 )1()1( 22 2 −+ −+− = mn smsn s yx p Degrees of Freedom!
  • 26. Estimated standard error (using pooled variance estimate) m s n s pp yx 22 +≈−σ 2 )()( : 1 2 1 2 2 −+ −+− =∴ ∑∑ == mn yyxx s where m i mi n i ni p The degrees of freedom are n+m-2
  • 27. Case 2: ttest, pooled variances 2 22 ~ −+ + − = mn pp mn t m s n s YX T 2 )1()1( 22 2 −+ −+− = mn smsn s yx p
  • 28. Alternate calculation formula: ttest, pooled variance 2~ −+ + − = mn p mn t mn nm s YX T )()() 11 ( 22 22 mn mn s mn m mn n s nm s n s m s ppp pp + =+=+=+
  • 29. Pooled vs. unpooled variance Rule of Thumb: Use pooled unless you have a reason not to. Pooled gives you more degrees of freedom. Pooled has extra assumption: variances are equal between the two groups. SAS automatically tests this assumption for you (“Equality of Variances” test). If p<.05, this suggests unequal variances, and better to use unpooled ttest.
  • 30. Example: two-sample t-test  In 1980, some researchers reported that “men have more mathematical ability than women” as evidenced by the 1979 SAT’s, where a sample of 30 random male adolescents had a mean score ± 1 standard deviation of 436±77 and 30 random female adolescents scored lower: 416±81 (genders were similar in educational backgrounds, socio-economic status, and age). Do you agree with the authors’ conclusions?
  • 31. Data Summary n Sampl e Mean Sample Standard Deviation Group 1: women 30 416 81 Group 2: men 30 436 77
  • 32. Two-sample t-test 1. Define your hypotheses (null, alternative) H0 : ♂-♀ math SAT = 0 Ha: ♂-♀ math SAT ≠ 0 [two-sided]
  • 33. Two-sample t-test 2. Specify your null distribution: F and M have similar standard deviations/variances, so make a “pooled” estimate of variance. 6245 58 81)29(77)29( 2 )1()1( 2222 2 = + = −+ −+− = mn smsn s fm p ) 30 6245 30 6245 ,0(~ 583030 +− TFM 4.20 30 6245 30 6245 =+
  • 34. Two-sample t-test 3. Observed difference in our experiment = 20 points
  • 35. Two-sample t-test 4. Calculate the p-value of what you observed 98. 4.20 020 58 = − =T data _null_; pval=(1-probt(.98, 58))*2;
  • 36. Example 2: Difference in means  Example: Rosental, R. and Jacobson, L. (1966) Teachers’ expectancies: Determinates of pupils’ I.Q. gains. Psychological Reports, 19, 115-118.
  • 37. The Experiment (note: exact numbers have been altered)  Grade 3 at Oak School were given an IQ test at the beginning of the academic year (n=90).  Classroom teachers were given a list of names of students in their classes who had supposedly scored in the top 20 percent; these students were identified as “academic bloomers” (n=18).  BUT: the children on the teachers lists had actually been randomly assigned to the list.  At the end of the year, the same I.Q. test was re- administered.
  • 38. Example 2  Statistical question: Do students in the treatment group have more improvement in IQ than students in the control group? What will we actually compare?  One-year change in IQ score in the treatment group vs. one-year change in IQ score in the control group.
  • 39. “Academic bloomers” (n=18) Controls (n=72) Change in IQ score: 12.2 (2.0) 8.2 (2.0) Results: 12.2 points 8.2 points Difference=4 points The standard deviation of change scores was 2.0 in both groups. This affects statistical significance…
  • 40. What does a 4-point difference mean?  Before we perform any formal statistical analysis on these data, we already have a lot of information.  Look at the basic numbers first; THEN consider statistical significance as a secondary guide.
  • 41. Is the association statistically significant?  This 4-point difference could reflect a true effect or it could be a fluke.  The question: is a 4-point difference bigger or smaller than the expected sampling variability?
  • 42. Hypothesis testing Null hypothesis: There is no difference between “academic bloomers” and normal students (= the difference is 0%) Step 1: Assume the null hypothesis.
  • 43. Hypothesis Testing  These predictions can be made by mathematical theory or by computer simulation. Step 2: Predict the sampling variability assuming the null hypothesis is true
  • 44. Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true—math theory: 0.42 =p s )52.0 72 4 18 4 ,0(~ 88"" =+− Tcontrolgifted µµ
  • 45. Hypothesis Testing  In computer simulation, you simulate taking repeated samples of the same size from the same population and observe the sampling variability.  I used computer simulation to take 1000 samples of 18 treated and 72 controls Step 2: Predict the sampling variability assuming the null hypothesis is true—computer simulation:
  • 47. 3. Empirical data Observed difference in our experiment = 12.2-8.2 = 4.0
  • 48. 4. P-value t-curve with 88 df’s has slightly wider cut-off’s for 95% area (t=1.99) than a normal curve (Z=1.96) p-value <.0001 8 52. 4 52. 2.82.12 88 == − =t
  • 49. If we ran this study 1000 times we wouldn’t expect to get 1 result as big as a difference of 4 (under the null hypothesis). Visually…
  • 50. 5. Reject null!  Conclusion: I.Q. scores can bias expectancies in the teachers’ minds and cause them to unintentionally treat “bright” students differently from those seen as less bright.
  • 51. Confidence interval (more information!!) 95% CI for the difference: 4.0±1.99(.52) = (3.0 – 5.0) t-curve with 88 df’s has slightly wider cut- off’s for 95% area (t=1.99) than a normal curve (Z=1.96)
  • 52. What if our standard deviation had been higher?  The standard deviation for change scores in treatment and control were each 2.0. What if change scores had been much more variable—say a standard deviation of 10.0 (for both)?
  • 53. Standard error is 0.52 Std. dev in change scores = 2.0 Std. dev in change scores = 10.0 Standard error is 2.58
  • 54. With a std. dev. of 10.0… LESS STATISICAL POWER! Standard error is 2.58 If we ran this study 1000 times, we would expect to get ≥+4.0 or ≤–4.0 12% of the time. P-value=.12
  • 55. Don’t forget: The paired T-test  Did the control group in the previous experiment improve at all during the year?  Do not apply a two-sample ttest to answer this question!  After-Before yields a single sample of differences…  “within-group” rather than “between-group” comparison…
  • 56. Continuous outcome (means); Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or Non-parametric statistics Wilcoxon sign-rank test: non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non- parametric alternative to the ttest Kruskal-Wallis test: non- parametric alternative to ANOVA Spearman rank correlation coefficient:
  • 57. Data Summary n Sampl e Mean Sample Standard Deviation Group 1: Change 72 +8.2 2.0
  • 58. Did the control group in the previous experiment improve at all during the year? 28 29. 2.8 72 2 02.8 271 == − =t p-value <.0001
  • 59. Normality assumption of ttest  If the distribution of the trait is normal, fine to use a t-test.  But if the underlying distribution is not normal and the sample size is small (rule of thumb: n>30 per group if not too skewed; n>100 if distribution is really skewed), the Central Limit Theorem takes some time to kick in. Cannot use ttest.  Note: ttest is very robust against the normality assumption!
  • 60. Alternative tests when normality is violated: Non-parametric tests
  • 61. Continuous outcome (means); Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or Non-parametric statistics Wilcoxon sign-rank test: non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non- parametric alternative to the ttest Kruskal-Wallis test: non- parametric alternative to ANOVA Spearman rank correlation coefficient:
  • 62. Non-parametric tests  t-tests require your outcome variable to be normally distributed (or close enough), for small samples.  Non-parametric tests are based on RANKS instead of means and standard deviations (=“population parameters”).
  • 63. Example: non-parametric tests 10 dieters following Atkin’s diet vs. 10 dieters following Jenny Craig Hypothetical RESULTS: Atkin’s group loses an average of 34.5 lbs. J. Craig group loses an average of 18.5 lbs. Conclusion: Atkin’s is better?
  • 64. Example: non-parametric tests BUT, take a closer look at the individual data… Atkin’s, change in weight (lbs): +4, +3, 0, -3, -4, -5, -11, -14, -15, -300 J. Craig, change in weight (lbs) -8, -10, -12, -16, -18, -20, -21, -24, -26, -30
  • 65. Jenny Craig -30 -25 -20 -15 -10 -5 0 5 10 15 20 0 5 10 15 20 25 30 P e r c e n t Weight Change
  • 66. Atkin’s -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 0 5 10 15 20 25 30 P e r c e n t Weight Change
  • 67. t-test inappropriate…  Comparing the mean weight loss of the two groups is not appropriate here.  The distributions do not appear to be normally distributed.  Moreover, there is an extreme outlier (this outlier influences the mean a great deal).
  • 68. Wilcoxon rank-sum test  RANK the values, 1 being the least weight loss and 20 being the most weight loss.  Atkin’s  +4, +3, 0, -3, -4, -5, -11, -14, -15, -300   1, 2, 3, 4, 5, 6, 9, 11, 12, 20  J. Craig  -8, -10, -12, -16, -18, -20, -21, -24, -26, -30  7, 8, 10, 13, 14, 15, 16, 17, 18, 19
  • 69. Wilcoxon rank-sum test  Sum of Atkin’s ranks:   1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 + 20=73  Sum of Jenny Craig’s ranks: 7 + 8 +10+ 13+ 14+ 15+16+ 17+ 18+19=137  Jenny Craig clearly ranked higher!  P-value *(from computer) = .018 *For details of the statistical test, see appendix of these slides…
  • 70. Binary or categorical outcomes (proportions) Outcome Variable Are the observations correlated? Alternative to the chi- square test if sparse cells: independent correlated Binary or categorical (e.g. fracture, yes/no) Chi-square test: compares proportions between two or more groups Relative risks: odds ratios or risk ratios Logistic regression: multivariate technique used when outcome is binary; gives multivariate-adjusted odds ratios McNemar’s chi-square test: compares binary outcome between two correlated groups (e.g., before and after) Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling: multivariate regression technique for a binary outcome when groups are Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5). McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).
  • 71. Difference in proportions (special case of chi-square test)
  • 72. Standard error of the difference of two proportions= 21 2211 212 22 1 11 )()(n where, )1()1( or )ˆ1(ˆ)ˆ1(ˆ nn pnp p n pp n pp n pp n pp + + = − + −− + − Standard error of a proportion= n pp )1( − Null distribution of a difference in proportions Standard error can be estimated by= (still normally distributed) n pp )ˆ1(ˆ − Analagous to pooled variance in the ttest The variance of a difference is the sum of variances (as with difference in means).
  • 73. Null distribution of a difference in proportions Difference of proportions ) )1()1( ,(~ 21 21 n pp n pp ppN − + − −
  • 74. Difference in proportions test Null hypothesis: The difference in proportions is 0. 21 21 )1(*)1(* n pp n pp pp Z − + − − = 2groupinnumber 1groupinnumber 2groupinproportion 1groupinproportion )proportionaverage(just 2 1 2 1 21 2211 = = = = + + = n n p p nn pnpn p Recall, variance of a proportion is p(1-p)/n Use average (or pooled) proportion in standard error formula, because under the null hypothesis, groups have equal proportions. Follows a normal because binomial can be approximated with normal
  • 75. Recall case-control example: Smoker (E) Non-smoker (~E) Stroke (D) 15 35 No Stroke (~D) 8 42 50 50
  • 76. Absolute risk: Difference in proportions exposed %14%16%30 50/850/15)~/()/( =−= −=− DEPDEP Smoker (E) Non-smoker (~E) Stroke (D) 15 35 No Stroke (~D) 8 42 50 50
  • 78. Example 2: Difference in proportions  Research Question: Are antidepressants arisk factor for suicide attempts in children and adolescents? Example modified from: “Antidepressant Drug Therapy and Suicide in Severely Depressed Children and Adults ”; Olfson et al. Arch Gen Psychiatry.2006;63:865- 872.
  • 79. Example 2: Difference in Proportions  Design: Case-control study  Methods: Researchers used Medicaid records to compare prescription histories between 263 children and teenagers (6-18 years) who had attempted suicide and 1241 controls who had never attempted suicide (all subjects suffered from depression).  Statistical question: Is a history of use of antidepressants more common among cases than controls?
  • 80. Example 2  Statistical question: Is a history of use of antidepressants more common among heart disease cases than controls? What will we actually compare?  Proportion of cases who used antidepressants in the past vs. proportion of controls who did
  • 81. No (%) of cases (n=263) No (%) of controls (n=1241) Any antidepressant drug ever 120 (46%) 448 (36%) 46% 36% Difference=10% Results
  • 82. Is the association statistically significant?  This 10% difference could reflect a true association or it could be a fluke in this particular sample.  The question: is 10% bigger or smaller than the expected sampling variability?
  • 83. Hypothesis testing Null hypothesis: There is no association between antidepressant use and suicide attempts in the target population (= the difference is 0%) Step 1: Assume the null hypothesis.
  • 84. Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true )033.= 1241 ) 1504 568 1( 1504 568 + 263 ) 1504 568 1( 1504 568 =σ,0(N~pˆpˆ controlscases
  • 85. Also: Computer Simulation Results Standard error is about 3.3%
  • 86. Hypothesis Testing Step 3: Do an experiment We observed a difference of 10% between cases and controls.
  • 87. Hypothesis Testing Step 4: Calculate a p-value 003.=p;0.3= 033. 10. =Z
  • 88. When we ran this study 1000 times, we got 1 result as big or bigger than 10%. P-value from our simulation… We also got 3 results as small or smaller than –10%.
  • 89. P-valueP-value From our simulation, we estimate the p-value to be: 4/1000 or .004
  • 90. Here we reject the null. Alternative hypothesis: There is an association between antidepressant use and suicide in the target population. Hypothesis Testing Step 5: Reject or do not reject the null hypothesis.
  • 91. What would a lack of statistical significance mean?  If this study had sampled only 50 cases and 50 controls, the sampling variability would have been much higher—as shown in this computer simulation…
  • 92. Standard error is about 10% 50 cases and 50 controls. Standard error is about 3.3% 263 cases and 1241 controls.
  • 93. With only 50 cases and 50 controls… Standard error is about 10% If we ran this study 1000 times, we would expect to get values of 10% or higher 170 times (or 17% of the time).
  • 95. Practice problem… An August 2003 research article in Developmental and Behavioral Pediatrics reported the following about a sample of UK kids: when given a choice of a non-branded chocolate cereal vs. CoCo Pops, 97% (36) of 37 girls and 71% (27) of 38 boys preferred the CoCo Pops. Is this evidence that girls are more likely to choose brand-named products?
  • 96. Answer 1. Hypotheses: H0 : p♂ -p♀ = 0 Ha: p♂ -p♀ ≠ 0 [two-sided] 2. Null distribution of difference of two proportions: 3. Observed difference in our experiment = .97-.71= .26 4. Calculate the p-value of what you observed: 085. 38 )16(.84. 37 )16(.84. ) 38 ) 75 63 1( 75 63 37 ) 75 63 1( 75 63 ,0(~ˆˆ =+ − + − =− σNpp mf data _null_; pval=(1-probnorm(3.06))*2; put pval; Null says p’s are equal so estimate standard error using overall observed p 06.3 085. 026. = − =Z
  • 97. Key two-sample Hypothesis Tests… Test for Ho : μx - μy = 0 (σ2 unknown, but roughly equal): Test for Ho : p1- p2 = 0:   2 )1()1( ; 22 2 22 2 − −+− = + − =− n snsn s n s n s yx t yyxx p y p x p n 21 2211 21 21 ˆˆ ; )1)(()1)(( ˆˆ nn pnpn p n pp n pp pp Z + + = − + − − =
  • 98. Corresponding confidence intervals… For a difference in means, 2 independent samples (σ2 ’s unknown but roughly equal): For a difference in proportions, 2 independent samples: y p x p n n s n s tyx 22 2/,2)( +∗±− − α 21 2/21 )1)(()1)(( )ˆˆ( n pp n pp Zpp − + − ∗±− α
  • 99. Appendix: details of rank-sum test…
  • 101. Example  For example, if team 1 and team 2 (two gymnastic teams) are competing, and the judges rank all the individuals in the competition, how can you tell if team 1 has done significantly better than team 2 or vice versa?
  • 102. Answer  Intuition: under the null hypothesis of no difference between the two groups…  If n1=n2, the sums of T1 and T2 should be equal.  But if n1≠n2, then T2 (n2=bigger group) should automatically be bigger. But how much bigger under the null?  For example, if team 1 has 3 people and team 2 has 10, we could rank all 13 participants from 1 to 13 on individual performance. If team1 (X) and team2 don’t differ in talent, the ranks ought to be spread evenly among the two groups, e.g.…  1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even distribution if team1 ranks 3rd , 7th , and 11th ) (larger)2groupofranksofsum (smaller)1groupofranksofsum 2 1 = = T T
  • 103. 21 22112 2 221121 2 1 2121 1 21 2 )1( 2 )1( 2 )( 2 )1)((21 nn nnnnnnnnnnnn nnnn iTT nn i + + + + = +++++ = +++ ==+ ∑ + = Remember this? sum of within-group ranks for smaller group. 2 )1( 11 1 1 + =∑= nn i n i sum of within-group ranks for larger group. 2 )1( 22 1 2 + =∑= nn i n i 3065591 2 )14)(13( :heree.g., 13 1 21 ++====+ ∑=i iTT 21 2211 21 2 )1( 2 )1( nn nnnn TT + + + + =+ Take-home point:
  • 104. 49655 6 2 )4(3 55 2 )11(10 3 1 10 1 =− = == ∑ ∑ = = i i i T1 = 3 + 7 + 11 =21 T2 = 1 + 2 + 4 + 5 + 6 + 8 + 9 +10 + 12 +13 = 70 70-21 = 49 Magic! The difference between the sum of the ranks within each individual group is 49. The difference between the sum of the ranks of the two groups is also equal to 49 if ranks are evenly interspersed (null is true). It turns out that, if the null hypothesis is true, the difference between the larger-group sum of ranks and the smaller-group sum of ranks is exactly equal to the difference between T1 and T2 2 )1( 2 )1( null,Under the 1122 12 + − + =− nnnn TT
  • 106.  ∴ under null hypothesis, U1 should equal U2 : 0)]T() 2 )1( 2 )1( [()U-E(U 12 1122 12 =−− + − + = T nnnn E The U’s should be equal to each other and will equal n1 n2 /2: U1 + U2 = n1 n2 Under null hypothesis, U1 = U2 = U0 ∴E(U1 + U2 ) = 2E(U0 ) = n1 n2 E(U1 = U2 =U0 ) = n1 n2 /2 So, the test statistic here is not quite the difference in the sum-of-ranks of the 2 groups It’s the smaller observed U value: U0 For small n’s, take U0, and get p-value directly from a U table.
  • 107. For large enough n’s (>10 per group)… )( 2 )( )( Z 0 21 0 0 00 UVar nn U UVar UEU − = − = 2 )( 21 0 nn UE = 12 )1( )( 2121 0 ++ = nnnn UVar
  • 108. Add observed data to the example… Example: If the girls on the two gymnastics teams were ranked as follows: Team 1: 1, 5, 7 Observed T1 = 13 Team 2: 2,3,4,6,8,9,10,11,12,13 Observed T2 = 78 Are the teams significantly different? Total sum of ranks = 13*14/2 = 91 n1 n2 =3*10 = 30 Under the null hypothesis: expect U1 - U2 = 0 and U1 + U2 = 30 (each should equal about 15 under the null) and U0 = 15 U1 =30 + 6 – 13 = 23 U2 = 30 + 55 – 78 = 7 ∴U0 = 7 Not quite statistically significant in U table…p=.1084 (see attached) x2 for two-tailed test
  • 109. Example problem 2 A study was done to compare the Atkins Diet (low-carb) vs. Jenny Craig (low-cal, low-fat). The following weight changes were obtained; note they are very skewed because someone lost 100 pounds; the mean loss for Atkins is going to look higher because of the bozo, but does that mean the diet is better overall? Conduct a Mann-Whitney U test to compare ranks. Atkins Jenny Craig -100 -11 -8 -15 -4 -5 +5 +6 +8 -20 +2
  • 110. Answer Atkins Jenny Craig 1 4 5 3 7 6 9 10 11 2 8 Sum of ranks for JC = 25 (n=5) Sum of ranks for Atkins=41 (n=6) n1 n2 =5*6 = 30 under the null hypothesis: expect U1 - U2 = 0 and U1 + U2 = 30 and U0 = 15 U1 =30 + 15 – 25 = 20 U2 = 30 + 21 – 41 = 10 U0 = 10; n1 =5, n2 =6 Go to Mann-Whitney chart….p=.2143x 2 = .42