The two sample t-test

Binary or categorical
outcomes (proportions)
Outcome
Variable
Are the observations correlated? Alternative to the chi-
square test if sparse
cells:
independent correlated
Binary or
categorical
(e.g.
fracture,
yes/no)
Chi-square test:
compares proportions between
two or more groups
Relative risks: odds
ratios or risk ratios
Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
McNemar’s chi-square
test: compares binary outcome
between correlated groups (e.g.,
before and after)
Conditional logistic
regression: multivariate
regression technique for a binary
outcome when groups are
correlated (e.g., matched data)
GEE modeling: multivariate
Fisher’s exact test:
independent groups when there
are sparse data (some cells <5).
McNemar’s exact test:
correlated groups when there are
sparse data (some cells <5).

Recall: The odds ratio (two
samples=cases and controls)
Smoker (E) Non-smoker
(~E)

Stroke (D) 15 35
No Stroke (~D) 8 42

50
50
25.2
8*35
42*15
===
bc
ad
OR
Interpretation: there is a 2.25-fold higher odds of stroke
in smokers vs. non-smokers.

Inferences about the odds
ratio…
 Does the sampling distribution follow a
normal distribution?
 What is the standard error?

Simulation…
 1. In SAS, assume infinite population of cases
and controls with equal proportion of smokers
(exposure), p=.23 (UNDER THE NULL!)
 2. Use the random binomial function to randomly
select n=50 cases and n=50 controls each with
p=.23 chance of being a smoker.
 3. Calculate the observed odds ratio for the
resulting 2x2 table.
 4. Repeat this 1000 times (or some large number
of times).
 5. Observe the distribution of odds ratios under
the null hypothesis.

Properties of the OR (simulation)
(50 cases/50 controls/23% exposed)
Under the null, this is the expected
variability of the sample ORnote
the right skew

Properties of the lnOR
Normal!

From the simulation,
can get the empirical
standard error (~0.5)
and p-value (~.10)

dcba
1111
+++
Or, in general, standard error
=

Inferences about the ln(OR)
(~E)

Stroke (D) 15 35
No Stroke (~D) 8 42

50
50
81.0)ln(
25.2
=
=
OR
OR
64.1
494.0
81.0
42
1
35
1
15
1
8
1
0)25.2ln(
==
+++
−
=Z p=.10

Confidence interval…
(~E)

Stroke (D) 15 35
No Stroke (~D) 8 42

50
50
92.5,85.0,CI%95
78.1,16.0494.0*96.181.0lnCI%95
78.116.
==
−=±=
−
eeOR
OR
Final answer: 2.25 (0.85,5.92)

Practice problem:
Suppose the following data were collected in a case-control study of brain tumor and
cell phone usage:
Brain tumor No brain
tumor
Own a cell
phone
20 60
Don’t own a
cell phone
10 40
Is there sufficient evidence for an association between cell phones and brain tumor?

Answer
1. What is your null hypothesis?
Null hypothesis: OR=1.0; lnOR = 0
Alternative hypothesis: OR≠ 1.0; lnOR>0
2. What is your null distribution?
lnOR~ N(0, ) ; =SD (lnOR) = .44
3. Empirical evidence: = 20*40/60*10 =800/600 = 1.33
∴ lnOR = .288
4. Z = (.288-0)/.44 = .65
p-value = P(Z>.65 or Z<-.65) = .26*2
5. Not enough evidence to reject the null hypothesis of no association
40
1
60
1
20
1
10
1
+++
40
1
60
1
20
1
10
1
+++
TWO-SIDED TEST
TWO-SIDED TEST: it
would be just as
extreme if the sample
lnOR were .65 standard
deviations or more
below the null mean

Key measures of relative risk:
95% CIs OR and RR:








++++








+++−
dcbadcba
1111
96.1
1111
96.1
exp*OR,exp*OR







 +−
+
+−
+







 +−
+
+−
−
c
dcc
a
baa
c
dcc
a
baa )/(1)/(1
96.1
)/(1)/(1
96.1
exp*RR,exp*RR
For an odds ratio, 95% confidence limits:
For a risk ratio, 95% confidence limits:

Continuous outcome (means)
Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
Paired ttest: compares
means between two related
groups (e.g., the same subjects
before and after)
Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two or
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank
correlation coefficient:

The two-sample T-test
 Is the difference in means that we
observe between two groups more than
we’d expect to see based on chance
alone?

The standard error of the
difference of two means

**First add the variances and then take the square root
of the sum to get the standard error.
mn
yx
yx
22
σσ
σ +=−
Recall, Var (A-B) =
Var (A) + Var (B) if
A and B are
independent!

Shown by simulation:
91.
30
5
==SE
91.
30
5
==SE
91.
30
5
==SE
91.
30
5
==SE
29.1
30
25
30
25
)( =+=diffSE
One sample of
30 (with SD=5).
One sample of
30 (with SD=5).
Difference of the two samples.

Distribution of differences
),(~
22
mn
NYX
yx
yxmn
σσ
µµ +−−
If X and Y are the averages of n and m subjects, respectively:

But…
 As before, you usually have to use the
sample SD, since you won’t know the
true SD ahead of time…
 So, again becomes a T-distribution...

Estimated standard error of
the difference….
m
s
n
s yx
yx
22
+≈−σ
Just plug in the sample
standard deviations for each
group.

Case 1: un-pooled variance
Question: What are your degrees of freedom here?
Answer: Not obvious!

Case 1: ttest, unpooled
variances
It is complicated to figure out the degrees of freedom here! A good
approximation is given as df ≈ harmonic mean (or SAS will tell you!):
νt
m
s
n
s
YX
T
yx
mn
~
22
+
−
=
mn
11
2
+

Case 2: pooled variance
If you assume that the standard deviation of the
characteristic (e.g., IQ) is the same in both groups, you can
pool all the data to estimate a common standard deviation.
This maximizes your degrees of freedom (and thus your
power).
2
)()(
)()1(and
1
)(
)()1(and
1
)(
:variancespooling
1
2
1
2
2
1
221
2
2
1
221
2
2
−+
−+−
=∴
−=−
−
−
=
−=−
−
−
=
∑∑
∑
∑
∑
∑
==
=
=
=
=
mn
yyxx
s
yysm
m
yy
s
xxsn
n
xx
s
m
i
mi
n
i
ni
p
m
i
miy
m
i
mi
y
n
i
nix
n
i
ni
x
2
)1()1( 22
2
−+
−+−
=
mn
smsn
s
yx
p
Degrees of
Freedom!

Estimated standard error
(using pooled variance estimate)
m
s
n
s pp
yx
22
+≈−σ
2
)()(
:
1
2
1
2
2
−+
−+−
=∴
∑∑ ==
mn
yyxx
s
where
m
i
mi
n
i
ni
p
The degrees
of freedom
are n+m-2

Case 2: ttest, pooled
variances
2
22
~ −+
+
−
= mn
pp
mn
t
m
s
n
s
YX
T
2
)1()1( 22
2
−+
−+−
=
mn
smsn
s
yx
p

Alternate calculation formula:
ttest, pooled variance
2~ −+
+
−
= mn
p
mn
t
mn
nm
s
YX
T
)()()
11
( 22
22
mn
mn
s
mn
m
mn
n
s
nm
s
n
s
m
s
ppp
pp +
=+=+=+

Pooled vs. unpooled variance
Rule of Thumb: Use pooled unless you have a
reason not to.
Pooled gives you more degrees of freedom.
Pooled has extra assumption: variances are
equal between the two groups.
SAS automatically tests this assumption for you
(“Equality of Variances” test). If p<.05, this
suggests unequal variances, and better to
use unpooled ttest.

Example: two-sample t-test
 In 1980, some researchers reported that
“men have more mathematical ability than
women” as evidenced by the 1979 SAT’s,
where a sample of 30 random male
adolescents had a mean score ± 1 standard
deviation of 436±77 and 30 random female
adolescents scored lower: 416±81 (genders
were similar in educational backgrounds,
socio-economic status, and age). Do you
agree with the authors’ conclusions?

Data Summary
n Sampl
e Mean
Sample
Standard
Deviation
Group 1:
women
30 416 81
Group 2:
men
30 436 77

Two-sample t-test
1. Define your hypotheses (null,
alternative)
H0
: ♂-♀ math SAT = 0
Ha: ♂-♀ math SAT ≠ 0 [two-sided]

Two-sample t-test
2. Specify your null distribution:
F and M have similar standard
deviations/variances, so make a “pooled”
estimate of variance.
6245
58
81)29(77)29(
2
)1()1( 2222
2
=
+
=
−+
−+−
=
mn
smsn
s
fm
p
)
30
6245
30
6245
,0(~ 583030 +− TFM 4.20
30
6245
30
6245
=+

Two-sample t-test
3. Observed difference in our experiment = 20
points

Two-sample t-test
4. Calculate the p-value of what you observed
98.
4.20
020
58 =
−
=T
data _null_;
pval=(1-probt(.98, 58))*2;

Example 2: Difference in means
 Example: Rosental, R. and Jacobson,
L. (1966) Teachers’ expectancies:
Determinates of pupils’ I.Q. gains.
Psychological Reports, 19, 115-118.

The Experiment
(note: exact numbers have been altered)
 Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n=90).
 Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent; these students were
identified as “academic bloomers” (n=18).
 BUT: the children on the teachers lists had
actually been randomly assigned to the list.
 At the end of the year, the same I.Q. test was re-
administered.

Example 2
 Statistical question: Do students in the
treatment group have more improvement
in IQ than students in the control group?
What will we actually compare?
 One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the
control group.

“Academic
bloomers”
(n=18)
Controls
(n=72)
Change in IQ score: 12.2 (2.0) 8.2 (2.0)
Results:
12.2 points 8.2 points
Difference=4 points
The standard deviation
of change scores was
2.0 in both groups. This
affects statistical
significance…

What does a 4-point
difference mean?
 Before we perform any formal statistical
analysis on these data, we already
have a lot of information.
 Look at the basic numbers first; THEN
consider statistical significance as a
secondary guide.

Is the association statistically
significant?
 This 4-point difference could reflect a
true effect or it could be a fluke.
 The question: is a 4-point difference
bigger or smaller than the expected
sampling variability?

Hypothesis testing
Null hypothesis: There is no difference between
“academic bloomers” and normal students (=
the difference is 0%)
Step 1: Assume the null hypothesis.

Hypothesis Testing
 These predictions can be made by
mathematical theory or by computer
simulation.
Step 2: Predict the sampling variability assuming the null
hypothesis is true

Hypothesis Testing
hypothesis is true—math theory:
0.42
=p
s
)52.0
72
4
18
4
,0(~ 88"" =+− Tcontrolgifted µµ

Hypothesis Testing
 In computer simulation, you simulate
taking repeated samples of the same
size from the same population and
observe the sampling variability.
 I used computer simulation to take 1000
samples of 18 treated and 72 controls
hypothesis is true—computer simulation:

Computer Simulation Results
Standard error is
about 0.52

3. Empirical data
Observed difference in our experiment =
12.2-8.2 = 4.0

4. P-value
t-curve with 88 df’s has slightly wider
cut-off’s for 95% area (t=1.99) than a
normal curve (Z=1.96)
p-value <.0001
8
52.
4
52.
2.82.12
88 ==
−
=t

If we ran this
study 1000 times
we wouldn’t
expect to get 1
result as big as a
difference of 4
(under the null
hypothesis).
Visually…

5. Reject null!
 Conclusion: I.Q. scores can bias
expectancies in the teachers’ minds
and cause them to unintentionally treat
“bright” students differently from those
seen as less bright.

Confidence interval (more
information!!)
95% CI for the difference: 4.0±1.99(.52) =
(3.0 – 5.0)
t-curve with 88 df’s
has slightly wider cut-
off’s for 95% area
(t=1.99) than a normal
curve (Z=1.96)

What if our standard deviation
had been higher?
 The standard deviation for change
scores in treatment and control were
each 2.0. What if change scores had
been much more variable—say a
standard deviation of 10.0 (for both)?

Standard error is
0.52 Std. dev in
change scores =
2.0
Std. dev in
change scores =
10.0
Standard error is 2.58

With a std. dev. of 10.0…
LESS STATISICAL POWER!
Standard
error is 2.58
If we ran this
study 1000 times,
we would expect to
get ≥+4.0 or ≤–4.0
12% of the time.
P-value=.12

Don’t forget: The paired T-test
 Did the control group in the previous
experiment improve
at all during the year?
 Do not apply a two-sample ttest to answer
this question!
 After-Before yields a single sample of
differences…
 “within-group” rather than “between-group”
comparison…

Continuous outcome (means);
Outcome
Variable
Are the observations independent or correlated?
Alternatives if the normality
assumption is violated (and
small sample size):
Continuous
(e.g. pain
scale,
cognitive
function)
Ttest: compares means
between two independent groups
ANOVA: compares means
between more than two
independent groups
Pearson’s correlation
coefficient (linear
correlation): shows linear
correlation between two
continuous variables
Linear regression:
Paired ttest: compares
means between two related
groups (e.g., the same subjects
before and after)
Repeated-measures
ANOVA: compares changes
over time in the means of two or
more groups (repeated
measurements)
Mixed models/GEE
modeling: multivariate
regression techniques to compare
changes over time between two or
Non-parametric statistics
Wilcoxon sign-rank test:
non-parametric alternative to the
paired ttest
Wilcoxon sum-rank test
(=Mann-Whitney U test): non-
parametric alternative to the ttest
Kruskal-Wallis test: non-
parametric alternative to ANOVA
Spearman rank
correlation coefficient:

Data Summary
n Sampl
e Mean
Sample
Standard
Deviation
Group 1:
Change
72 +8.2 2.0

Did the control group in the
previous experiment improve
at all during the year?
28
29.
2.8
72
2
02.8
271 ==
−
=t
p-value <.0001

Normality assumption of ttest
 If the distribution of the trait is normal, fine to use
a t-test.
 But if the underlying distribution is not normal
and the sample size is small (rule of thumb: n>30
per group if not too skewed; n>100 if distribution
is really skewed), the Central Limit Theorem
takes some time to kick in. Cannot use ttest.
 Note: ttest is very robust against the normality
assumption!

Alternative tests when normality
is violated: Non-parametric tests

Non-parametric tests
 t-tests require your outcome variable
to be normally distributed (or close
enough), for small samples.
 Non-parametric tests are based on
RANKS instead of means and
standard deviations (=“population
parameters”).

Example: non-parametric tests
10 dieters following Atkin’s diet vs. 10 dieters following
Jenny Craig
Hypothetical RESULTS:
Atkin’s group loses an average of 34.5 lbs.
J. Craig group loses an average of 18.5 lbs.
Conclusion: Atkin’s is better?

Example: non-parametric tests
BUT, take a closer look at the individual data…
Atkin’s, change in weight (lbs):
+4, +3, 0, -3, -4, -5, -11, -14, -15, -300
J. Craig, change in weight (lbs)
-8, -10, -12, -16, -18, -20, -21, -24, -26, -30

Jenny Craig
-30 -25 -20 -15 -10 -5 0 5 10 15 20
0
5
10
15
20
25
30
P
e
r
c
e
n
t
Weight Change

Atkin’s
-300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20
0
5
10
15
20
25
30
P
e
r
c
e
n
t
Weight Change

t-test inappropriate…
 Comparing the mean weight loss of the
two groups is not appropriate here.
 The distributions do not appear to be
normally distributed.
 Moreover, there is an extreme outlier
(this outlier influences the mean a great
deal).

Wilcoxon rank-sum test
 RANK the values, 1 being the least weight
loss and 20 being the most weight loss.
 Atkin’s
 +4, +3, 0, -3, -4, -5, -11, -14, -15, -300
 1, 2, 3, 4, 5, 6, 9, 11, 12, 20
 J. Craig
 -8, -10, -12, -16, -18, -20, -21, -24, -26, -30
 7, 8, 10, 13, 14, 15, 16, 17, 18, 19

Wilcoxon rank-sum test
 Sum of Atkin’s ranks:
 1+ 2 + 3 + 4 + 5 + 6 + 9 + 11+ 12 + 20=73
 Sum of Jenny Craig’s ranks:
7 + 8 +10+ 13+ 14+ 15+16+ 17+ 18+19=137
 Jenny Craig clearly ranked higher!
 P-value *(from computer) = .018
*For details of the statistical test, see appendix of these slides…

Binary or categorical
outcomes (proportions)
Outcome
Variable
Are the observations correlated? Alternative to the chi-
square test if sparse
cells:
Binary or
categorical
(e.g.
fracture,
yes/no)
Chi-square test:
two or more groups
Relative risks: odds
ratios or risk ratios
Logistic regression:
multivariate technique used
when outcome is binary; gives
multivariate-adjusted odds
ratios
McNemar’s chi-square
test: compares binary outcome
between two correlated groups (e.g.,
before and after)
Conditional logistic
regression: multivariate
correlated (e.g., matched data)
GEE modeling: multivariate
Fisher’s exact test:
independent groups when there
are sparse data (some cells <5).
McNemar’s exact test:
correlated groups when there are
sparse data (some cells <5).

Difference in proportions (special
case of chi-square test)

Standard error of the difference of two proportions=
21
2211
212
22
1
11 )()(n
where,
)1()1(
or
)ˆ1(ˆ)ˆ1(ˆ
nn
pnp
p
n
pp
n
pp
n
pp
n
pp
+
+
=
−
+
−−
+
−
Standard error of a proportion=
n
pp )1( −
Null distribution of a difference
in proportions
Standard error can be estimated by=
(still normally distributed)
n
pp )ˆ1(ˆ −
Analagous to pooled variance
in the ttest
The variance of a difference is the
sum of variances (as with difference
in means).

Null distribution of a difference
in proportions
Difference of proportions )
)1()1(
,(~
21
21
n
pp
n
pp
ppN
−
+
−
−

Difference in proportions test
Null hypothesis: The difference in proportions is 0.
21
21
)1(*)1(*
n
pp
n
pp
pp
Z
−
+
−
−
=
2groupinnumber
1groupinnumber
2groupinproportion
1groupinproportion
)proportionaverage(just
2
1
2
1
21
2211
=
=
=
=
+
+
=
n
n
p
p
nn
pnpn
p
Recall, variance of a
proportion is p(1-p)/n
Use average (or
pooled) proportion in
standard error formula,
because under the null
hypothesis, groups
have equal proportions.
Follows a normal
because binomial can
be approximated with
normal

Recall case-control example:
(~E)
Stroke (D) 15 35
No Stroke (~D) 8 42 50
50

Absolute risk: Difference in
proportions exposed
%14%16%30
50/850/15)~/()/(
=−=
−=− DEPDEP
(~E)
Stroke (D) 15 35
No Stroke (~D) 8 42 50
50

Difference in proportions
exposed
67.1
084.
14.
50
77.*23.
50
77.*23.
%0%14
==
+
−
=Z
.31to03.0084.*96.114.0:CI%95 −=±

Example 2: Difference in
proportions
 Research Question: Are
antidepressants arisk factor for suicide
attempts in children and adolescents?
Example modified from: “Antidepressant Drug Therapy and Suicide in Severely
Depressed Children and Adults ”; Olfson et al. Arch Gen Psychiatry.2006;63:865-
872.

Example 2: Difference in
Proportions
 Design: Case-control study
 Methods: Researchers used Medicaid
records to compare prescription histories
between 263 children and teenagers (6-18
years) who had attempted suicide and 1241
controls who had never attempted suicide (all
subjects suffered from depression).
 Statistical question: Is a history of use of
antidepressants more common among cases
than controls?

Example 2
 Statistical question: Is a history of use of
antidepressants more common among
heart disease cases than controls?
What will we actually compare?
 Proportion of cases who used
antidepressants in the past vs. proportion of
controls who did

No (%) of
cases
(n=263)
No (%) of
controls
(n=1241)
Any antidepressant
drug ever 120 (46%) 448 (36%)
46% 36%
Difference=10%
Results

Is the association statistically
significant?
 This 10% difference could reflect a true
association or it could be a fluke in this
particular sample.
 The question: is 10% bigger or smaller
than the expected sampling variability?

Hypothesis testing
Null hypothesis: There is no association
between antidepressant use and suicide
attempts in the target population (= the
difference is 0%)
Step 1: Assume the null hypothesis.

Hypothesis Testing
hypothesis is true
)033.=
1241
)
1504
568
1(
1504
568
+
263
)
1504
568
1(
1504
568
=σ,0(N~pˆpˆ controlscases

Also: Computer Simulation Results
Standard error is
about 3.3%

Hypothesis Testing
Step 3: Do an experiment
We observed a difference of 10% between
cases and controls.

Hypothesis Testing
Step 4: Calculate a p-value
003.=p;0.3=
033.
10.
=Z

When we ran this
study 1000 times,
we got 1 result as
big or bigger than
10%.
P-value from our simulation…
We also got 3
results as small
or smaller than
–10%.

P-valueP-value
From our simulation, we
estimate the p-value to be:
4/1000 or .004

Here we reject the null.
Alternative hypothesis: There is an association
between antidepressant use and suicide in the
target population.
Hypothesis Testing
Step 5: Reject or do not reject the null hypothesis.

What would a lack of
statistical significance mean?
 If this study had sampled only 50 cases
and 50 controls, the sampling variability
would have been much higher—as
shown in this computer simulation…

Standard error is
about 10%
50 cases and 50
controls.
Standard error is
about 3.3% 263 cases and
1241 controls.

With only 50 cases and 50 controls…
Standard
error is
about 10%
If we ran this
study 1000 times,
we would expect to
get values of 10%
or higher 170
times (or 17% of
the time).

Two-tailed p-value
Two-tailed
p-value =
17%x2=34%

Practice problem…
An August 2003 research article in
Developmental and Behavioral Pediatrics
reported the following about a sample of UK
kids: when given a choice of a non-branded
chocolate cereal vs. CoCo Pops, 97% (36) of
37 girls and 71% (27) of 38 boys preferred
the CoCo Pops. Is this evidence that girls are
more likely to choose brand-named products?

Answer
1. Hypotheses:
H0
: p♂
-p♀
= 0
Ha: p♂
-p♀
≠ 0 [two-sided]
2. Null distribution of difference of two proportions:
3. Observed difference in our experiment = .97-.71= .26
4. Calculate the p-value of what you observed:
085.
38
)16(.84.
37
)16(.84.
)
38
)
75
63
1(
75
63
37
)
75
63
1(
75
63
,0(~ˆˆ
=+
−
+
−
=− σNpp mf
data _null_;
pval=(1-probnorm(3.06))*2;
put pval;
Null says p’s are equal so
estimate standard error using
overall observed p
06.3
085.
026.
=
−
=Z

Key two-sample Hypothesis
Tests…
Test for Ho
: μx
- μy
= 0 (σ2
unknown, but roughly equal):
Test for Ho
: p1-
p2
= 0:

2
)1()1(
;
22
2
22
2
−
−+−
=
+
−
=−
n
snsn
s
n
s
n
s
yx
t
yyxx
p
y
p
x
p
n
21
2211
21
21
ˆˆ
;
)1)(()1)((
ˆˆ
nn
pnpn
p
n
pp
n
pp
pp
Z
+
+
=
−
+
−
−
=

Corresponding confidence
intervals…
For a difference in means, 2 independent
samples (σ2
’s unknown but roughly equal):
For a difference in proportions, 2 independent
samples:
y
p
x
p
n
n
s
n
s
tyx
22
2/,2)( +∗±− − α
21
2/21
)1)(()1)((
)ˆˆ(
n
pp
n
pp
Zpp
−
+
−
∗±− α

Appendix: details of rank-sum
test…

Wilcoxon Rank-sum test
),min(
12
)1(
2Z
2
)1(
U
,10,01for
2
)1(
U
)(npopulationlargerthefromrankstheofsumtheisT
)(npopulationsmallerfromrankstheofsumtheisT
n.to1fromorderinnsobservatiotheofallRank
210
2121
21
0
2
22
212
211
11
211
22
11
UUU
nnnn
nn
U
T
nn
nn
nnT
nn
nn
=
++
−
=−
+
+=
>>−
+
+=
Find P(U² U0) in Mann-Whitney U tables
With n2 = the bigger of the 2 populations

Example

For example, if team 1 and team 2 (two gymnastic
teams) are competing, and the judges rank all the
individuals in the competition, how can you tell if
team 1 has done significantly better than team 2 or
vice versa?

Answer

Intuition: under the null hypothesis of no difference between the
two groups…
 If n1=n2, the sums of T1 and T2 should be equal.
 But if n1≠n2, then T2 (n2=bigger group) should automatically be
bigger. But how much bigger under the null?

For example, if team 1 has 3 people and team 2 has 10, we could
rank all 13 participants from 1 to 13 on individual performance. If
team1 (X) and team2 don’t differ in talent, the ranks ought to be
spread evenly among the two groups, e.g.…

1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even distribution if team1
ranks 3rd
, 7th
, and 11th
)
(larger)2groupofranksofsum
(smaller)1groupofranksofsum
2
1
=
=
T
T

21
22112
2
221121
2
1
2121
1
21
2
)1(
2
)1(
2
)(
2
)1)((21
nn
nnnnnnnnnnnn
nnnn
iTT
nn
i
+
+
+
+
=
+++++
=
+++
==+ ∑
+
=
Remember
this?
sum of within-group ranks for smaller
group.
2
)1( 11
1
1
+
=∑=
nn
i
n
i
sum of within-group ranks for larger
group.
2
)1( 22
1
2
+
=∑=
nn
i
n
i
3065591
2
)14)(13(
:heree.g.,
13
1
21 ++====+ ∑=i
iTT
21
2211
21
2
)1(
2
)1(
nn
nnnn
TT +
+
+
+
=+
Take-home point:

49655
6
2
)4(3
55
2
)11(10
3
1
10
1
=−
=
==
∑
∑
=
=
i
i
i
T1 = 3 + 7 + 11 =21
T2 = 1 + 2 + 4 + 5 + 6 + 8 + 9 +10 + 12 +13 = 70
70-21 = 49 Magic!
The difference between the sum of the
ranks within each individual group is 49.
The difference between the sum of the
ranks of the two groups is also equal to 49
if ranks are evenly interspersed (null is
true).
It turns out that, if the null hypothesis is true, the difference
between the larger-group sum of ranks and the smaller-group sum
of ranks is exactly equal to the difference between T1 and T2
2
)1(
2
)1(
null,Under the
1122
12
+
−
+
=−
nnnn
TT

.equalshouldsumTheir
2
)1(
Udefine
2
)1(
Udefine
22
)1(
22
)1(
2
)1(
2
)1(
2
)1(
2
)1(
21
121
11
1
221
22
2
2111
1
2122
2
1122
12
21
2211
12
nn
Tnn
nn
Tnn
nn
nnnn
T
nnnn
T
nnnn
TT
nn
nnnn
TT
−+
+
=
−+
+
=
+
+
=
+
+
=
+
−
+
=−
+
+
+
+
=+ From slide 23
From slide 24
Define new
statistics
Here, under null:
U2=55+30-70
U1=6+30-21
U2+U1=30

 ∴ under null hypothesis, U1
should equal U2
:
0)]T()
2
)1(
2
)1(
[()U-E(U 12
1122
12 =−−
+
−
+
= T
nnnn
E
The U’s should be equal to each other and will equal n1
n2
/2:
U1
+ U2
= n1
n2
Under null hypothesis, U1
= U2
= U0
∴E(U1
+ U2
) = 2E(U0
) = n1
n2
E(U1
= U2
=U0
) = n1
n2
/2
So, the test statistic here is not quite the difference in the
sum-of-ranks of the 2 groups
It’s the smaller observed U value: U0
For small n’s, take U0, and get p-value directly from a U
table.

For large enough n’s (>10 per
group)…
)(
2
)(
)(
Z
0
21
0
0
00
UVar
nn
U
UVar
UEU
−
=
−
=
2
)( 21
0
nn
UE =
12
)1(
)( 2121
0
++
=
nnnn
UVar

Add observed data to the
example…
Example: If the girls on the two gymnastics teams were ranked as follows:
Team 1: 1, 5, 7 Observed T1
= 13
Team 2: 2,3,4,6,8,9,10,11,12,13 Observed T2
= 78
Are the teams significantly different?
Total sum of ranks = 13*14/2 = 91 n1
n2
=3*10 = 30
Under the null hypothesis: expect U1
- U2
= 0 and U1
+ U2
= 30 (each should equal about 15 under the
null) and U0
= 15
U1
=30 + 6 – 13 = 23
U2
= 30 + 55 – 78 = 7
∴U0
= 7
Not quite statistically significant in U table…p=.1084 (see attached) x2 for two-tailed test

Example problem 2
A study was done to compare the Atkins Diet (low-carb) vs. Jenny Craig
(low-cal, low-fat). The following weight changes were obtained; note
they are very skewed because someone lost 100 pounds; the mean loss
for Atkins is going to look higher because of the bozo, but does that
mean the diet is better overall? Conduct a Mann-Whitney U test to
compare ranks.
Atkins Jenny Craig
-100 -11
-8 -15
-4 -5
+5 +6
+8 -20
+2

Answer Atkins Jenny Craig
1 4
5 3
7 6
9 10
11 2
8
Sum of ranks for JC = 25 (n=5)
Sum of ranks for Atkins=41 (n=6)
n1
n2
=5*6 = 30
under the null hypothesis: expect U1
- U2
= 0 and
U1
+ U2
= 30 and U0
= 15
U1
=30 + 15 – 25 = 20
U2
= 30 + 21 – 41 = 10
U0
= 10; n1
=5, n2
=6
Go to Mann-Whitney chart….p=.2143x 2 = .42

The two sample t-test

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to The two sample t-test (20)

More from Christina K J (14)

Recently uploaded (20)

The two sample t-test