SlideShare a Scribd company logo
Robert V. Hogg
Allen T. Craig
THE UNIVERSITY OF IOWA
Introduction to
Mathematical
Statistics
Fourth Edition
Macmillan Publishing Co., Inc.
NEW YORK
Collier Macmillan Publishers
LONDON
Preface
Copyright © 1978, Macmillan Publishing Co., Inc.
Printed in the United States of America
Earlier editions © 1958 and 1959 and copyright © 1965 and 1970 by
Macmillan Publishing Co., Inc.
Macmillan Publishing Co., Inc.
866 Third Avenue, New York, New York 10022
Collier Macmillan Canada, Ltd.
Library of Congress Cataloging in Publication Data
Hogg, Robert V
Introduction to mathematical statistics.
We are much indebted to our colleagues throughout the country who
have so generously provided us with suggestions on both the order of
presentation and the kind of material to be included in this edition of
Introduction to Mathematical Statistics. We believe that you will find
the book much more adaptable for classroom use than the previous
edition. Again, essentially all the distribution theory that is needed is
found in the first five chapters. Estimation and tests of statistical
hypotheses, including nonparameteric methods, follow in Chapters 6, 7,
8, and 9, respectively. However, sufficient statistics can be introduced
earlier by considering Chapter 10 immediately after Chapter 6 on
estimation. Many of the topics of Chapter 11 are such that they may
also be introduced sooner: the Rae-Cramer inequality (11.1) and
robust estimation (11.7) after measures of the quality of estimators
(6.2), sequential analysis (11.2) after best tests (7.2), multiple com-
parisons (11.3) after the analysis of variance (8.5), and classification
(11.4) after material on the sample correlation coefficient (8.7). With this
flexibility the first eight chapters can easily be covered in courses of
either six semester hours or eight quarter hours, supplementing with
the various topics from Chapters 9 through 11 as the teacher chooses
and as the time permits. In a longer course, we hope many teachers and
students will be interested in the topics of stochastic independence
(11.5), robustness (11.6 and 11.7), multivariate normal distributions
(12.1), and quadratic forms (12.2 and 12.3).
We are obligated to Catherine M. Thompson and Maxine Merrington
and to Professor E. S. Pearson for permission to include Tables II and
V, which are abridgments and adaptations of tables published in
Biometrika. We wish to thank Oliver & Boyd Ltd., Edinburgh, for
permission to include Table IV, which is an abridgment and adaptation
v
56789
131415 YEAR
PRINTING
Bibliography: p.
Includes index.
1. Mathematical statistics. I. Craig, Allen
Thornton, (date) joint author. II. Title.
QA276.H59 1978 519 77-2884
ISBN 0-02-355710-9 (Hardbound)
ISBN 0-02-978990-7 (International Edition)
All rights reserved. No part of this book may be reproduced or
transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and
retrieval system, without permission in writing from the Publisher.
T~RN
vi
Preface
of Table III from the book Statistical Tables for Biological, Agricultural,
and Medical Research by the late Professor Sir Ronald A. Fisher,
Cambridge, and Dr. Frank Yates, Rothamsted. Finally, we wish to
thank Mrs. Karen Horner for her first-class help in the preparation of
the manuscript.
R. V. H.
A. T. C.
Contents
Chapter 1
Distributions of Random Variables 1
1.1 Introduction 1
1.2 Algebra of Sets 4
1.3 Set Functions 8
1.4 The Probability Set Function 12
1.5 Random Variables 16
1.6 The Probability Density Function 23
1.7 The Distribution Function 31
1.8 Certain Probability Models 38
1.9 Mathematical Expectation 44
1.10 Some Special Mathematical Expectations 48
1.11 Chebyshev's Inequality 58
Chapter 2
Conditional Probability and Stochastic Independence 61
2.1 Conditional Probability 61
2.2 Marginal and Conditional Distributions 65
2.3 The Correlation Coefficient 73
2.4 Stochastic Independence 80
Chapter 3
Some Special Distributions
3.1 The Binomial, Trinomial, and Multinomial Distributions 90
3.2 The Poisson Distribution 99
3.3 The Gamma and Chi-Square Distributions 103
3.4 The Normal Distribution 109
3.5 The Bivariate Normal Distribution 117
vii
90
viii
Contents
Contents ix
Chapter 4
Distributions of Functions of Random 'Variables
4.1 Sampling Theory 122
4.2 Transformations of Variables of the Discrete Type 128
4.3 Transformations of Variables of the Continuous Type 132
4.4 The t and F Distributions 143
4.5 Extensions of the Change-of-Variable Technique 147
4.6 Distributions of Order Statistics 154
4.7 The Moment-Generating-Function Technique 164
4.8 The Distributions of X and nS2
ja2
172
4.9 Expectations of Functions of Random Variables 176
Chapter 5
Limiting Distributions
5.1 Limiting Distributions 181
5.2 Stochastic Convergence 186
5.3 Limiting Moment-Generating Functions 188
5.4 The Central Limit Theorem 192
5.5 Some Theorems on Limiting Distributions 196
122
181
Chapter 8
Other Statistical Tests
8.1 Chi-Square Tests 269
8.2 The Distributions of Certain Quadratic Forms 278
8.3 A Test of the Equality of Several Means 283
8.4 Noncentral X2
and Noncentral F 288
8.5 The Analysis of Variance 291
8.6 A Regression Problem 296
8.7 A Test of Stochastic Independence 300
Chapter 9
Nonparametric Methods
9.1 Confidence Intervals for Distribution Quantiles 304
9.2 Tolerance Limits for Distributions 307
9.3 The Sign Test 312
9.4 A Test of Wilcoxon 314
9.5 The Equality of Two Distributions 320
9.6 The Mann-Whitney-Wilcoxon Test 326
9.7 Distributions Under Alternative Hypotheses 331
9.8 Linear Rank Statistics 334
269
304
Chapter 6
Estimation
6.1 Point Estimation 200
6.2 Measures of Quality of Estimators 207
6.3 Confidence Intervals for Means 212
6.4 Confidence Intervals for Differences of Means 219
6.5 Confidence Intervals for Variances 222
6.6 Bayesian Estimates 227
200 Chapter 10
Sufficient Statistics
1v.1 A Sufficient Statistic for a Parameter 341
10.2 The Rao-Blackwell Theorem 349
10.3 Completeness and Uniqueness 353
10.4 The Exponential Class of Probability Density Functions 357
10.5 Functions of a Parameter 361
10.6 The Case of Several Parameters 364
341
Chapter 7
Statistical Hypotheses
7.1 Some Examples and Definitions 235
7.2 Certain Best Tests 242
7.3 Uniformly Most Powerful Tests 251
7.4 Likelihood Ratio Tests 257
235
Chapter 11
Further Topics in Statistical Inference
11.1 The Rae-Cramer Inequality 370
11.2 The Sequential Probability Ratio Test 374
11.3 Multiple Comparisons 380
11.4 Classification 385
370
x
Contents
11.5 Sufficiency, Completeness, and Stochastic Independence 389
11.6 Robust Nonparametric Methods 396
11.7 Robust Estimation 400
Chapter 12
Further Normal Distribution Theory
12.1 The Multivariate Normal Distribution 405
12.2 The Distributions of Certain Quadratic Forms
12.3 The Independence of Certain Quadratic Forms
Appendix A
References
Appendix B
Tables
Appendix C
Answers to Selected Exercises
Index
410
414
405
421
423
429
435
Chapter I
Distributions of
Random Variables
1.1 Introduction
Many kinds of investigations may be characterized in part by the
fact that repeated experimentation, under essentially the same con-
ditions, is more or less standard procedure. For instance, in medical
research, interest may center on the effect of a drug that is to be
administered; or an economist may be concerned with the prices of
three specified commodities at various time intervals; or the agronomist
may wish to study the effect that a chemical fertilizer has on the yield
of a cereal grain. The only way in which an investigator can elicit
information about any such phenomenon is to perform his experiment.
Each experiment terminates with an outcome. But it is characteristic of
these experiments that the outcome cannot be predicted with certainty
prior to the performance of the experiment.
Suppose that we have such an experiment, the outcome of which
cannot be predicted with certainty, but the experiment is of such a
nature that the collection of every possible outcome can be described
prior to its performance. If this kind of experiment can be repeated
under the same conditions, it is called a random experiment, and the
collection of every possible outcome is called the experimental space or
the sample space.
Example 1. In the toss of a coin, let the outcome tails be denoted by
T and let the outcome heads be denoted by H. If we assume that the coin
may be repeatedly tossed under the same conditions, then the toss of this
coin is an example of a random experiment in which the outcome is one of
1
2 Distributions oj Random Variables [eh.l Sec. 1.1] Introduction 3
the two symbols T and H; that is, the sample space is the collection of these
two symbols.
Example 2. In the cast of one red die and one white die, let the outcome
be the ordered pair (number of spots up on the red die, number of spots up
on the white die). If we assume that these two dice may be repeatedly cast
under the same conditions, then the cast of this pair of dice is a random
experiment and the sample space consists of the 36 order pairs (1, 1), .. "
(1, 6), (2, 1), ... , (2, 6), ... , (6,6).
Let ce denote a sample space, and let C represent a part of ce. If,
upon the performance of the experiment, the outcome is in C, we shall
say that the event C has occurred. Now conceive of our having made N
repeated performances of the random experiment. Then we can count
the numberf of times (the frequency) that the event C actually occurred
throughout the N performances. The ratio fiN is called the relative
frequency of the event C in these N experiments. A relative frequency is
usually quite erratic for small values of N, as you can discover by
tossing a coin. But as N increases, experience indicates that relative
frequencies tend to stabilize. This suggests that we associate with the
event C a number, say p, that is equal or approximately equal to that
number about which the relative frequency seems to stabilize. If we do
this, then the number p can be interpreted as that number which, in
future performances of the experiment, the relative frequency of the
event C will either equal or approximate. Thus, although we cannot
predict the outcome of a random experiment, we can, for a large value
of N, predict approximately the relative frequency with which the
outcome will be in C. The number p associated with the event C is given
various names. Sometimes it is called the probability that the outcome
of the random experiment is in C; sometimes it is called the probability
of the event C; and sometimes it is called the probability measure of C.
The context usually suggests an appropriate choice of terminology.
Example 3. Let ~ denote the sample space of Example 2 and let C be
the collection of every ordered pair of ~ for which the sum of the pair is equal
to seven. Thus C is the collection (1, 6), (2,5), (3,4), (4,3), (5,2), and (6,1).
Suppose that the dice are cast N = 400 times and let f, the frequency of a
sum of seven, bef = 60. Then the relative frequency with which the outcome
was in C is [lN = lrP-o = 0.15. Thus we might associate with C a number p
that is close to 0.15, and p would be called the probability of the event C.
Remark. The preceding interpretation of probability is sometimes re-
ferred to as the relative frequency approach, and it obviously depends upon
the fact that an experiment can be repeated under essentially identical con-
ditions. However, many persons extend probability to other situations by
treating it as rational measure of belief. For example, the statement p = i
would mean to them that their personal or subjective probability of the event
C is equal to i. Hence, if they are not opposed to gambling, this could be
interpreted as a willingness on their part to bet on the outcome of C so that
the two possible payoffs are in the ratio PI(1 - P) = Ht = t. Moreover, if
they truly believe that p = i is correct, they would be willing to accept either
side of the bet: (a) win 3 units if C occurs and lose 2 if it does not occur, or
(b) win 2 units if C does not occur and lose 3 if it does. However, since the
mathematical properties of probability given in Section 1.4 are consistent
with either of these interpretations, the subsequent mathematical develop-
ment does not depend upon which approach is used.
The primary purpose of having a mathematical theory of statistics
is to provide mathematical models for random experiments. Once a
model for such an experiment has been provided and the theory
worked out in detail, the statistician may, within this framework, make
inferences (that is, draw conclusions) about the random experiment.
The construction of such a model requires a theory of probability. One
of the more logically satisfying theories of probability is that based on
the concepts of sets and functions of sets. These concepts are introduced
in Sections 1.2 and 1.3.
EXERCISES
1.1. In each of the following random experiments, describe the sample
space s. Use any experience that you may have had (or use your intuition) to
assign a value to the probability p of the event C in each of the following
instances:
(a) The toss of an unbiased coin where the event C is tails.
(b) The cast of an honest die where the event C is a five or a six.
(c) The draw of a card from an ordinary deck of playing cards where the
event C occurs if the card is a spade.
(d) The choice of a number on the interval zero to 1 where the event C
occurs if the number is less than t.
(e) The choice of a point from the interior of a square with opposite
vertices (-1, -1) and (1, 1) where the event C occurs if the sum of the
coordinates of the point is less than 1-.
1.2. A point is to be chosen in a haphazard fashion from the interior of a
fixed circle. Assign a probability p that the point will be inside another circle,
which has a radius of one-half the first circle and which lies entirely within
the first circle.
1.3. An unbiased coin is to be tossed twice. Assign a probability h to
the event that the first toss will be a head and that the second toss will be a
4 Distributions of Random Variables [Ch.l Sec. 1.2] Algebra of Sets 5
tail. Assign a probability P2 to the event that there will be one head and one
tail in the two tosses.
1.2 Algebra of Sets
The concept of a set or a collection of objects is usually left undefined.
However, a particular set can be described so that there is no misunder-
standing as to what collection of objects is under consideration. For
example, the set of the first 10 positive integers is sufficiently well
described to make clear that the numbers -i and 14 are not in the set,
while the number 3 is in the set. If an object belongs to a set, it is said
to be an element of the set. For example, if A denotes the set of real
numbers x for which 0 ;::; x ;::; 1, then i is an element of the set A. The
fact that i is an element of the set A is indicated by writing i EA.
More generally, a E A means that a is an element of the set A.
The sets that concern us will frequently be sets ofnumbers. However,
the language of sets of points proves somewhat more convenient than
that of sets of numbers. Accordingly, we briefly indicate how we use
this terminology. In analytic geometry considerable emphasis is placed
on the fact that to each point on a line (on which an origin and a unit
point have been selected) there corresponds one and only one number,
say x; and that to each number x there corresponds one and only one
point on the line. This one-to-one correspondence between the numbers
and points on a line enables us to speak, without misunderstanding, of
the" point x" instead of the" number z." Furthermore, with a plane
rectangular coordinate system and with x and y numbers, to each
symbol (x, y) there corresponds one and only one point in the plane; and
to each point in the plane there corresponds but one such symbol. Here
again, we may speak of the" point (x, y)," meaning the" ordered number
pair x and y." This convenient language can be used when we have a
rectangular coordinate system in a space of three or more dimensions.
Thus the" point (Xl' x2 , •• " xn) " means the numbers Xl' X 2, ••• , X n in
the order stated. Accordingly, in describing our sets, we frequently
speak of a set of points (a set whose elements are points), being careful,
of course, to describe the set so as to avoid any ambiguity. The nota-
tion A = {x; 0 ;::; x ;::; I} is read "A is the one-dimensional set of
points x for which 0 ;::; x s 1." Similarly, A = {(x, y); 0 s x s 1,
o s y s I} can be read" A is the two-dimensional set of points (x, y)
that are interior to, or on the boundary of, a square with opposite
vertices at (0,0) and (1, 1)." We now give some definitions (together
with illustrative examples) that lead to an elementary algebra of sets
adequate for our purposes.
Definition 1. If each element of a set A I is also an element of set
A 2, the set A I is called a subset of the set A 2 . This is indicated by writing
Al c A 2 · If Al C A 2 and also A 2 c A v the two sets have the same
elements, and this is indicated by writing Al = A 2
•
Example 1. Let Al = {x; 0 ;::; x ;::; I} and A2 = {x; -1 ;::; x ;::; 2}. Here
the one-dimensional set Al is seen to be a subset of the one-dimensional set
A2 ; that is, Al C A 2 . Subsequently, when the dimensionality of the set is
clear, we shall not make specific reference to it.
Example 2. LetAI = {(x,y);O;::; x = y;::; Ij and zl, = {(x,y);O;::; x;::; 1,
o;::; y ;::; I}. Since the elements of Al are the points on one diagonal of the
square, then Al C A 2 •
Definition 2. If a set A has no elements, A is called the null set.
This is indicated by writing A = 0.
Definition 3. The set of all elements that belong to at least one
of the sets A I and A 2 is called the union of A I and A 2' The union of
Al and A 2 is indicated by writing Al U A 2 . The union of several sets
A v A 2 , As, ... is the set of all elements that belong to at least one of
the several sets. This union is denoted by A I U A 2 U As u . .. or by
Al U A 2 U ... U A k if a finite number k of sets is involved.
Example 3. Let Al = {x; X = 0, 1, ... , lO}and A 2 = {x, X = 8,9, lO, 11,
or 11 < x ;::; 12}. Then Al U A 2 = {x; X = 0, 1, ... , 8, 9, lO, 11, or 11 <
x;::; 12} = {x; x = 0, 1, ... ,8,9, lO, or 11 ;::; x ;::; 12}.
Example 4. Let Al and A 2 be defined as in Example 1. Then
Al U A 2 = A 2 •
Example 5. Let A 2 = 0. Then Al U A 2 = Al for every set AI'
Example 6. For every set A, A u A = A.
Example 7. Let A k = {x; 1/(k + 1) ;::; x ;::; I}, k = 1, 2, 3, . . .. Then
Al U A 2 U As u ... = {x, 0 < x ;::; I}. Note that the number zero is not in
this set, since it is not in one of the sets AI> A 2
, As,. '"
Definition 4. The set of all elements that belong to each of the sets
A I and A 2 is called the intersection of A I and A 2' The intersection of A I
and A 2 is indicated by writing Al n A 2 . The intersection of several sets
AI' A 2 , As, ... is the set of all elements that belong to each of the sets
A v A 2 , As, .. " This intersection is denoted by Al n A 2
n As n ...
or by Al n A 2 n ... n A k if a finite number k of sets is involved.
Example 8. Let Al = {(x, y); (x, y) = (0, 0), (0, 1), (1, I)} and A 2
=
{(x, y); (x, y) = (1,1), (1,2), (2, I)}. Then Al n A 2 = {(x, y); (x, y) = (1, I)}.
6 Distributions oj Random Variables [Clr. l Sec. 1.2] Algebra oj Sets 7
FIGURE U
Example 14. Let the number of heads, in tossing a coin four times, be
EXERCISES
denoted by x. Of necessity, the number of heads will be one of the numbers
0, 1, 2, 3, 4. Here, then, the space is the set d = {x; x = 0, 1,2, 3, 4}.
Example 15. Consider all nondegenerate rectangles of base x and height
y. To be meaningful, both x and y must be positive. Thus the space is the
set .xl = {(x, y); x > 0, Y > O}.
Definition 6. Let .91 denote a space and let A be a subset of the
set d. The set that consists of all elements of .91 that are not elements
of A is called the complement of A (actually, with respect to d). The
complement of A is denoted by A *. In particular, .91* = 0.
Example 16. Let d be defined as in Example 14, and let the set A =
{x; x = 0, I}.The complement of A (with respect to d) is A * = {x; x = 2,3, 4}.
Esample l'[, Given A c dThenA u A* = d,A nA* = 0 ,A ud = d,
And = A, and (A*)* = A.
1.4. Find the union A l U A 2 and the intersection A l n A 2 of the two
sets A l and A 2 , where:
(a) A l = {x; X = 0, 1, 2}, A 2 = {x; X = 2, 3, 4}.
(b) A l = {x; 0 < x < 2}, A 2 = {x; 1 :::::; x < 3}.
(c) A l = {(x,y);O < x < 2,0 < y < 2},A2 = {(x,y); 1 < x < 3,1 < y < 3}.
1.5. Find the complement A *of the set A with respect to the space d if:
(a) d = {x; 0 < x < I}, A = {x; i :::::; x < I}.
(b) d = {(x, y, z); x2 + y2 + Z2 :::::; I}, A = {(x, y, z); x2 + y2 + Z2 = I}.
(c) d = {(x, y); Ixl + Iyl :::::; 2}, A = {(x, y); x2 + y2 < 2}.
1.6. List all possible arrangements of the four letters m, a, r, and y. Let
A l be the collection of the arrangements in which y is in the last position.
Let A 2 be the collection of the arrangements in which m is in the first position.
Find the union and the intersection of A l and A 2 •
1.7. By use of Venn diagrams, in which the space d is the set of points
enclosed by a rectangle containing the circles, compare the following sets:
(a) A l n (A2 u As) and (Al n A 2 ) u (A l n As).
(b) A l u (A2 n As) and (A l U A 2 ) n (A l U As).
(c) (Al U A 2)* and AT n A~.
(d) (Al n A 2)* and AT U A~.
1.8. If a sequence of sets A v A 2 , As, ... is such that A k c A k +V
k = 1,2, 3, ... , the sequence is said to be a nondecreasing sequence. Give an
example of this kind of sequence of sets.
1.9. If a sequence of sets A l , A 2 , As, ... is such that A k ::::> A k +V
k = 1, 2, 3, ... , the sequence is said to be a nonincreasing sequence. Give an
example of this kind of sequence of sets.
FIGURE 1.1
A, uAz
Example9. Let zl, = {(x,y); 0:::::; x + y:::::; lj and zl, = {(x,y); 1 < x + y}.
Then A l and A 2 have no points in common and A l n A 2 = 0.
Example 10. For every set A, A n A = A and A n 0 = 0.
Example 11. Let A k = {x; 0 < x < 11k}, k = 1,2, 3, .... Then A l n
A2 n As ... is the null set, since there is no point that belongs to each of
the sets A v A 2 , As, ....
Example 12. Let A l and A 2 represent the sets of points enclosed,
respectively, by two intersecting circles. Then the sets A l U A 2 and A l n A 2
are represented, respectively, by the shaded regions in the Venn diagrams
in Figure 1.1.
Example 13. Let A v A 2 , and As represent the sets of points enclosed,
respectively, by three intersecting circles. Then the sets (A l U A 2 ) n As and
(Al n A 2 ) u As are depicted in Figure 1.2.
Definition 5. In certain discussions or considerations the totality
of all elements that pertain to the discussion can be described. This set
of all elements under consideration is given a special name. It is called
the space. We shall often denote spaces by capital script letters such as
d, flB, and C(j'.
8 Distributions oj Random Variables [Ch.l Sec. 1.3] Set Functions 9
1.10. If AI> A 2 , A 3 , ••• are sets such that Ale c AIe+I> k = 1,2,3, ... ,
lim Ale is defined as the union A l U A2 U A 3 U .. '. Find lim Ale if:
k-+co k-+oo
(a) Ale = {x; 11k s x s 3 - 11k}, k = 1, 2, 3, ... ;
(b) Ale = {(x, y); 11k S x2 + y2 S 4 - 11k}, k = 1,2,3, ....
1.11. If AI> A 2 , A 3 , ••• are sets such that Ale::> AIe+ l , k = 1,2,3, ... ,
lim Ale is defined as the intersection A l n A 2 n A3 n· . '. Find lim Ale if:
k-+oo k-+oo
(a) Ale = {x; 2 - 11k < x s 2}, k = 1, 2, 3, .
(b) Ale = {x; 2 < x s 2 + 11k}, k = 1, 2, 3, .
(c) Ale = {(x, y); 0 S x2 + y2 sIlk}, k = 1,2,3, ....
Example 2. Let A be a set in two-dimensional space and let Q(A) be the
area of A, if A has a finite area; otherwise, let Q(A) be undefined. Thus, if
A = {(x, y); x2 + y2 S I}, then Q(A) = 7T; if A = {(x, y); (x, y) = (0,0),
(1,1), (0, I)}, then Q(A) = 0; if A = {(x, y); 0 s x,O S y, x + Y s I}, then
Q(A) = t.
Example 3. Let A be a set in three-dimensional space and let Q(A) be
the volume of A, if A has a finite volume; otherwise, let Q(A) be undefined.
Thus, if A = {(x, y, z); 0 S x s 2,0 s y s 1,0 s z s 3}, then Q(A) = 6;
if A = {(x, y, z); x2 + y2 + Z2 ~ I}, then Q(A) is undefined.
At this point we introduce the following notations. The symbol
1.3 Set Functions
In the calculus, functions such as
f(x) = 2x, -00 < x < 00,
Lf(x) dx
will mean the ordinary (Riemann) integral of f(x) over a prescribed
one-dimensional set A; the symbol
= 0 elsewhere,
= 0 elsewhere,
or
or possibly
g(x, y) = e-X
- Y , o< x < 00, 0 < y < 00,
o S Xi S 1, i = 1,2, ... , n,
LIg(x, y) dx dy
will mean the Riemann integral of g(x, y) over a prescribed two-
dimensional set A; and so on. To be sure, unless these sets A and these
functions f(x) and g(x, y) are chosen with care, the integrals will
frequently fail to exist. Similarly, the symbol
2:f(x)
A
= 0 elsewhere.
will mean the sum extended over all x E A; the symbol
2: 2:g(x, y)
A
will mean the sum extended over all (x, y) E A; and so on.
Example 4. Let A be a set in one-dimensional space and let Q(A) =
"Lf(x), where
A
x = 0,1,
x = 1,2,3, ... ,
= 0 elsewhere.
f(x) = (1Y,
f(x) = px{l - P)l-X,
If A = {x; 0 S x s 3}, then
Q(A) = 1- + m
2
+ m
3
= l
Example 5. Let Q(A) = "Lf(x), where
A
were of common occurrence. The value of f(x) at the" point x = 1" is
f(l) = 2; the value of g(x, y) at the" point (-1, 3)" is g(- 1, 3) = 0;
the value of h(x1 , x2 , ••• , xn) at the" point (1, 1, ... , 1)" is 3. Functions
such as these are called functions of a point or, more simply, point
functions because they are evaluated (if they have a value) at a point
in a space of indicated dimension.
There is no reason why, if they prove useful, we should not have
functions that can be evaluated, not necessarily at a point, but for an
entire set of points. Such functions are naturally called functions of a
set or, more simply, set functions. We shall give some examples of set
functions and evaluate them for certain simple sets.
Example 1. Let A be a set in one-dimensional space and let Q(A) be
equal to the number of points in A which correspond to positive integers.
Then Q(A) is a function of the set A. Thus, if A = {x; 0 < x < 5}, then
Q(A) = 4; if A = {x; x = -2, -I}, thenQ(A) = O;ifA = {x; -00 < x < 6},
then Q(A) = 5.
= Q(AI ) + Q(A2);
if A = Al U A 2,where Al = {x; 0 :$ x :S 2} and A 2 = {x; 1 :S x :S 3},then
Q(A) = Q(AI U A 2) = f:e- Xdx
= f:e- Xdx + f:e- Xdx - f:e-
Xdx
= Q(AI ) + Q(A2) - Q(AI n A 2).
Example 7. Let A be a set in n-dimensional space and let
Q(A) = r~. fdXI dX2 dxn•
If A = {(Xl' x2,···, xn) ; 0 s Xl s X2 s :$ Xn :$ I}, then
Q(A) = f: f:n
. . . f:3
f:2
dXI dx2· .. dXn - 1 dx;
1
- -, where n! = n(n - 1) .. ·3' 2' 1.
- n!
X=o
Q(A) = L PX(1 - P)l-X = 1 - P;
x=o
if A = {x; 1 :$ x :$ 2}, then Q(A) = f(l) = p.
Example 6. Let A be a one-dimensional set and let
Q(A) = Le-X
dx.
Thus, if A = {x; 0 :$ x < co}, then
Q(A) = f: e- X
dx = 1;
if A = {x; 1 :$ x :$ 2}, then
Q(A) = 1:e- X
dx = e- l
- e-2
;
if Al = {x; 0 :$ x :$ I} and A2 = {x; 1 < x :$ 3}, then
Q(AI U A 2) = f:e- Xdx
= f~ e- Xdx + 1:e- Xdx
1.14. For everyone-dimensional set A, let Q(A) be equal to the number of
points in A that correspond to positive integers. If Al = {x; x a multiple of 3,
less than or equal to 50} and A 2 = {x; x a multiple of 7, less than or equal to
50}, findQ(A I ) , Q(A2), Q(AI U A 2), andQ(A I n A 2). ShowthatQ(AI U A 2) =
Q(AI ) + Q(A2) - Q(AI n A2).
11
Sec. 1.3] Set Functions
a + ar + .. , + arn- l = a(1 - rn)j(1 - r) and lim Sn = aj(1 - r) provided
n-e cc
that Irl < 1.
1.13. For everyone-dimensional set A for which the integral exists, let
Q(A) = fAf(x) da; where f(x) = 6x(1 - x), 0 < x < 1, zero elsewhere;
otherwise, let Q(A) be undefined. If Al = {x; i- < x < i}, A 2 = {x; X = !},
and Aa = {x; 0 < x < 10}, find Q(AI ) , Q(A2), and Q(Aa).
1.18. Let d be the set of points interior to or on the boundary of a cube
with edge 1. Moreover, say that the cube is in the first octant with one vertex
at the point (0, 0, 0) and an opposite vertex is at the point (1, 1, 1). Let
Q(A) = Iff dxdydz. (a) If A c dis the set{(x, y, z); 0 < x < y < z < I},
A
compute Q(A). (b) If A is the subset {(x, y, z); 0 < x = Y = z < I}, compute
Q(A).
1.15. For every two-dimensional set A, let Q(A) be equal to the number
of points (x, y) in A for which both x and yare positive integers. Find Q(AI )
andQ(A2),where Al = {(x,y);x2 + y2:$ 4}andA2 = {(x,y);x2 + y2:s
9}.
Note that Al C A 2 and that Q(AI ) :S Q(A2).
1.16. Let Q(A) = fA f (x2 + y2) dx dy for every two-dimensional set A
for which the integral exists; otherwise, let Q(A) be undefined. If Al =
{(x, y); -1 :S x :S 1, -1 :S Y :S I}, A 2 = {(x, y); -1 :S x = y:s I}, and
Aa = {(x, y); x2 + y2 :S I}, findQ(A I ) , Q(A2), andQ(Aa). Hint. In evaluating
Q(A2 ) , recall the definition of the double integral (or consider the volume
under the surface z = x2 + y2 above the line segment - 1 :$ x = Y :S 1 in
the xy-plane). Use polar coordinates in the calculation of Q(Aa).
1.17. Let d denote the set of points that are interior to or on the
boundary of a square with opposite vertices at the point (0,0) and at the
point (1, 1). LetQ(A) = Lfdydx. (a) If A c dis the set {(x, y); 0 < x < y < I},
compute Q(A). (b) If A c d is the set {(x, y); 0 < x = Y < I}, compute
Q(A). (c) If A c d is the set {(x, y); 0 < xj2 :S y s 3xj2 < I}, compute
Q(A).
Distributions of Random Variables [Ch.l
10
If A = {x; x = O}, then
EXERCISES
1.12. For everyone-dimensional set A, let Q(A) = Lf(x), wheref(x) =
A
(!)mX
, x = 0, 1, 2, .. " zero elsewhere. If Al = {x; X = 0, 1,2, 3} and
A 2 = {x; X = 0, 1,2, ...}, find Q(AI ) and Q(A2). Hint. Recall that Sn =
1.19. Let A denote the set {(x, y, z); x2 + y2 + Z2 :S I}. Evaluate Q(A) =
JJf v'x2 + y2 + Z2 dx dy dz. Hint. Change variables to spherical coordinates.
A
1.20. To join a certain club, a person must be either a statistician or a
12 Distributions of Random Variables [eh.l Sec. 1.4] The Probability Set Function 13
mathematician or both. Of the 2S members in this club, 19 are statisticians
and 16are mathematicians. How many persons in the club are both a statisti-
cian and a mathematician?
1.21. After a hard-fought football game, it was reported that, of the
11 starting players, 8 hurt a hip, 6 hurt an arm, S hurt a knee, 3 hurt both a
hip and an arm, 2 hurt both a hip and a knee, 1 hurt both an arm and a knee,
and no one hurt all three. Comment on the accuracy of the report.
1.4 The Probability Set Function
Let <fl denote the set of every possible outcome of a random experi-
ment; that is, <fl is the sample space. It is our purpose to define a set
function P(C) such that if C is a subset of <fl, then P(C) is the probability
that the outcome of the random experiment is an element of C. Hence-
forth it will be tacitly assumed that the structure of each set C is
sufficiently simple to allow the computation. We have already seen that
advantages accrue if we take P(C) to be that number about which the
relative frequencyfiN of the event C tends to stabilize after a long series
of experiments. This important fact suggests some of the properties that
we would surely want the set function P(C) to possess. For example, no
relative frequency is ever negative; accordingly, we would want P(C)
to be a nonnegative set function. Again, the relative frequency of the
whole sample space <fl is always 1. Thus we would want P(<fl) = 1.
Finally, if C1> C2 , Cs, ... are subsets of '?J' such that no two of these
subsets have a point in common, the relative frequency of the union of
these sets is the sum of the relative frequencies of the sets, and we would
want the set function P(C) to reflect this additive property. We now
formally define a probability set function.
Definition 7. If P(C) is defined for a type of subset of the space <fl,
and if
(a) P(C) ~ 0,
(b) P(C1 u C2 U Cs U ) = P(C1 ) + P(C2 ) + P(Cs) + .. " where
the sets Ci , i = 1, 2, 3, , are such that no two have a point in
common, (that is, where C, (' C, = 0, i # j),
(c) P(<fl) = 1,
then P(C) is called the probability set function of the outcome of the
random experiment. For each subset C of '?J', the number P(C) is called
the probability that the outcome of the random experiment is an
element of the set C, or the probability of the event C, or the probability
measure of the set C.
A probability set function tells us how the probability is distributed
over various subsets C of a sample space <fl. In this sense we speak of a
distribution of probability.
Remark. In the definition, the phrase" a type of subset of the space 'C"
would be explained more fully in a more advanced course. Nevertheless, a
few observations can be made about the collection of subsets that are of the
type. From condition (c) of the definition, we see that the space 'C must be
in the collection. Condition (b) implies that if the sets C1> C2 , C3 , •.• are in
the collection, their union is also one of that type. Finally, we observe from
the following theorems and their proofs that if the set C is in the collection,
its complement must be one of those subsets. In particular, the null set,
which is the complement of 'C, must be in the collection.
The following theorems give us some other properties of a probability
set function. In the statement of each of these theorems, P(C) is taken,
tacitly, to be a probability set function defined for a certain type of
subset of the sample space '?J'.
Theorem 1. For each C c '?J', P(C) = 1 - P(C*).
Proof. We have '?J' = C u C* and C (' C* = 0. Thus, from (c) and
(b) of Definition 7, it follows that
1 = P(C) + P(C*),
which is the desired result.
Theorem 2. The probability of the null setis zero;that is, P(0) = O.
Proof. In Theorem 1, take C = 0 so that C* = '?J'. Accordingly, we
have
P(0) = 1 - P('?J') = 1 - 1 = 0,
and the theorem is proved.
Theorem 3. If C1 and C2 are subsets of'?J' such that C1 c C2, then
P(C1 ) s P(C2) ·
Proof. NowC2 = C1 u (ct (' C2) and C, {' (ct (' C2) = 0. Hence,
from (b) of Definition 7,
P(C2 ) = P(C1 ) + P(ct (' C2 ) .
However, from (a) of Definition 7, P(Ct (' C2 ) ~ 0; accordingly,
P(C2 ) ~ P(C1 ) ·
Theorem 4. For each C c '?J', 0 s P(C) ~ 1.
Distributions of Random Variables (eh.l
Proof. Since 0 c C c C{?, we have by Theorem 3 that with (b) of Definition 7. Moreover, if C{? = c1 U c2 U c3 U .. " the
mutually exclusive events are further characterized as being exhaustive
and the probability of their union is obviously equal to 1.
14
P(0) s P(C) s P(6') or os P(C) ~ 1,
Sec. 1.4] The Probability Set Function 15
the desired result.
Theorem 5. IfC1 and C2 are subsets ofC{?, then
P(C 1 U C2 ) = P(C 1) + P(C 2) - P(C1 11 C2) .
Proof. Each of the sets C1 U C2 and C2 can be represented, respec-
tively, as a union of nonintersecting sets as follows:
C1 U C2 = Cl U (ct 11 C2) and
Thus, from (b) of Definition 7,
P(C 1 U C2) = P(C 1) + P(ct ( C2 )
and
P(C2 ) = P(C 1 11 C2 ) + P(ct 11 C2) .
If the second of these equations is solved for P(ct 11 C2) and this
result substituted in the first equation, we obtain
P(C 1 U C2) = P(C 1) + P(C 2) - P(C 1 ( C2) .
This completes the proof.
Example 1. Let ~ denote the sample space of Example 2 of Section 1.1.
Let the probability set function assign a probability of :1-6 to each of the 36
points in ~. If Cl = {c; C = (1, 1), (2, 1), (3, 1), (4, 1), (5, In and C2 =
{c; C = (1,2), (2, 2), (3, 2n, then P(CI ) = -l(i, P(C2 ) = -l6' P(CI U C2) = 3'
and P(CI n C2) = O.
Example 2. Two coins are to be tossed and the outcome is the ordered
pair (face on the first coin, face on the second coin). Thus the sample space
may be represented as ~ = {c; c = (H, H), (H, T), (T, H), (T, Tn. Let the
probability set function assign a probability of ! to each element of ~.
Let Cl = {c; C = (H, H), (H, Tn and C2 = {c; C = (H, H), (T, Hn. Then
P(Cl) = P(C2) = -t, P(CI n C2 ) = -.t, and, in accordance with Theorem 5,
P(CI U C2 ) = 1- + 1- - ! = i·
Let C{? denote a sample space and let Cl , C2 , C3 , ••• denote subsets
of C{? If these subsets are such that no two have an element in common,
they are called mutually disjoint sets and the corresponding events
C1> C2 , C3 , ••• are said to be mutually exclusive events. Then, for example,
P(C1 U C2 U C3 U, .. ) = P(C1) + P(C 2) + P(C3 ) + .. " in accordance
EXERCISES
1.22. A positive integer from one to six is to be chosen by casting a die.
Thus the elements c of the sample space ~ are 1, 2, 3, 4, 5, 6. Let Cl =
{c; C = 1, 2, 3, 4}, C2 = {c; C = 3, 4, 5, 6}. If the probability set function P
assigns a probability of i to each of the elements of~, compute P(Cl), P(C2 ) ,
P(CI n C2 ) , and P(CI U C2 ) .
1.23. A random experiment consists in drawing a card from an ordinary
deck of 52 playing cards. Let the probability set function P assign a prob-
ability of -h- to each of the 52 possible outcomes. Let Cl denote the collection
of the 13 hearts and let C2 denote the collection of the 4 kings. Compute
P(Cl), P(C2 ), P(CI n C2 ) , and P(CI U C2 ) .
1.24. A coin is to be tossed as many times as is necessary to turn up one
head. Thus the elements c of the sample space ~ are H, TH, TTH, TTTH,
and so forth. Let the probability set function P assign to these elements the
respective probabilities -t, -.t, t, -1-6, and so forth. Show that P(~) = 1. Let
Cl = {c; cis H, TH, TTH, TTTH, or TTTTH}. Compute P(CI)' Let C2 =
{c; cis TTTTH or TTTTTH}. Compute P(C2) , P(CI n C2) , and P(CI U C2) .
1.25. If the sample space is ~ = CI U C2 and if P(Cl) = 0.8 and P(C2) =
0.5, find P(CI n C2 ) .
1.26. Let the sample space be ~ = {c; 0 < c < co}, Let C c ~ be defined
by C = {c; 4 < c < oo} and take P(C) = Ie e- X
dx. Evaluate P(C), P(C*),
and P(C U C*).
1.27. If the sample space is ~ = {c; -00 < c < co}and if C c ~is a set
for which the integral Ie e- ix i dx exists, show that this set function is not a
probability set function. What constant could we multiply the integral by
to make it a probability set function?
1.28. If Cl and C2 are subsets of the sample space jf show that
P(CI n C2) ~ P(CI ) s P(CI U C2) s P(CI ) + P(C2) .
1.29. Let Cl , C2 , and Cs be three mutually disjoint subsets of the sample
space~. Find P[(CI U C2) n Cs] and P(ct U q).
1.30. If Cl , C2 , and Cs are subsets of ~, show that
P(CI U C2 U Cs) = P(CI ) + P(C2 ) + P(Cs) - P(CI n C2 )
- P(CI n Cs) - P(C2 n Cs) + P(CI n C2 n Cs).
16 Distributions of Random Variables [eh.l Sec. 1.5] Random Variables 17
What is the generalization of this result to four or more subsets of '?5?
Hint. Write P(CI U C2 U C3) = P[CI U (C2 U C3 )] and use Theorem 5.
1.5 Random Variables
The reader will perceive that a sample space '?5 may be tedious to
describe if the elements of '?5 are not numbers. We shall now discuss
how we may formulate a rule, or a set of rules, by which the elements C
of '?5 may be represented by numbers x or ordered pairs of numbers
(Xl' X 2) or, more generally, ordered n-tuplets of numbers (Xl' ... , xn) .
We begin the discussion with a very simple example. Let the random
experiment be the toss of a coin and let the sample space associated with
the experiment be '?5 = {c; where c is Tor c is H} and T and H repre-
sent, respectively, tails and heads. Let X be a function such that
X(c) = 0 if c is T and let X(c) = 1 if c is H. Thus X is a real-valued
, function defined on the sample space '?5 which takes us from the sample
space '?5 to a space of real numbers d = {x; X = 0, 1}. We call X a
random variable and, in this example, the space associated with X is
d = {x; X = 0, 1}. We now formulate the definition of a random
variable and its space.
Definition 8. Given a random experiment with a sample space '?5.
A function X, which assigns to each element c E'?5 one and only one
real number X(c) = x, is called a random variable. The space of X is the
set of real numbers d = {x; X = X(c), CE'?5}.
It may be that the set '?5 has elements which are themselves real
numbers. In such an instance we could write X(c) = c so that d = '?5.
Let X be a random variable that is defined on a sample space '?5,
and let d be the space of X. Further, let A be a subset of d. Just as we
used the terminology" the event G," with G c '?5, we shall now speak
of" the event A." The probability P(G) of the event G has been defined.
We wish now to define the probability of the event A. This probability
will be denoted by Pr (X E A), where Pr is an abbreviation of "the
probability that." With A a subset of d, let G be that subset of '?5 such
that G = {c; CE'?5 and X(c) E A}. Thus G has as its elements all out-
comes in '?5 for which the random variable X has a value that is in A.
This prompts us to define, as we now do, Pr (X E A) to be equal to
P(G), where G = {c; CE '?5 and X(c) E A}. Thus Pr (X E A) is an assign-
ment of probability to a set A, which is a subset of the space d associated
with the random variable X. This assignment is determined by the
probability set function P and the random variable X and is sometimes
denoted by Px(A). That is,
Pr (X E A) = Px(A) = P(G),
where G = {c; CE'?5 and X(c) E A}. Thus a random variable X is a
function that carries the probability from a sample space '?5 to a space
d of real numbers. In this sense, with A c d, the probability Px(A)
is often called an induced probability.
The function Px(A) satisfies the conditions (a), (b), and (c) of the
definition of a probability set function (Section 1.4). That is, Px(A) is
also a probability set function. Conditions (a) and (c) are easily verified
by observing, for an appropriate G, that
Px(A) = P(G) ~ 0,
and that '?5 = {c; CE'?5 and X(c) Ed} requires
Px(d) = P('?5) = 1.
In discussing condition (b), let us restrict our attention to two mutually
exclusive events Al and A 2 • Here PX(A I U A 2 ) = P(G), where G =
{c; CE'?5 and X(c) EAl U A 2} . However,
G = {c; CE'?5 and X(c) E AI} U {c; CE'?5 and X(c) E A 2} ,
or, for brevity, G = GI U G2 • But GI and G2 are disjoint sets. This must
be so, for if some c were common, say ct, then X(ct) E A I and X(ct) E A 2
•
That is, the same number X(ct) belongs to both Al and A 2 . This is a
contradiction because Al and A 2 are disjoint sets. Accordingly,
P(G) = P(GI ) + P(G2) .
However, by definition, P(GI ) is PX(A I ) and P(G2 ) is PX(A2 ) and thus
PX(A I U A 2 ) = PX(A I ) + PX(A2 ) .
This is condition (b) for two disjoint sets.
Thus each of Px(A) and P(G) is a probability set function. But the
reader should fully recognize that the probability set function P is
defined for subsets G of '?5, whereas P x is defined for subsets A of d,
and, in general, they are not the same set function. Nevertheless, they
are closely related and some authors even drop the index X and write
P(A) for Px(A). They think it is quite clear that P(A) means the
probability of A, a subset of d, and P(G) means the probability of G,
a subset of '?5. From this point on, we shall adopt this convention and
simply write P(A).
18 Distributions of Random Variables [eh.l Sec. 1.5] Random Variables 19
valued functions defined on the sample space f'(?, which takes us from
that sample space to the space of ordered number pairs
d = {(xv x2 ) ; (Xl, x2 ) = (0,0), (0, 1), (1, 1), (1,2), (2,2), (2, 3)}.
Thus Xl and X 2 are two random variables defined on the space f'(?,
and, in this example, the space of these random variables is the two-
dimensional set d given immediately above. We now formulate the
definition of the space of two random variables.
Definition 9. Given a random experiment with a sample space f'(?
Consider two random variables Xl and X 2, which assign to each element
c of f'(? one and only one ordered pair of numbers Xl (c) = Xv X 2(C) = x2.
The space of Xl and X 2 is the set of ordered pairs d = {(xv x2
) ;
Xl = X 1(c), X2 = X 2(c), c E f'(?}.
Let d be the space associated with the two random variables Xl
and X 2 and let A be a subset of d. As in the case of one random variable
we shall speak of the event A. We wish to define the probability of the
event A, which we denote by Pr [(Xl' X 2) E A]. Take C = {c; C E f'(? and
[X1 (c), X 2 (c)] E A}, where f'(? is the sample space. We then define
Pr[(X1 , X 2 ) EA] = P(C), where P is the probability set function
defined for subsets C of f'(? Here again we could denote Pr [(Xl, X 2
) E A]
by the probability set function P x t,x2(A);
but, with our previous
convention, we simply write
P(A) = Pr [(Xl' X 2) E A].
Again it is important to observe that this function is a probability set
function defined for subsets A of the space d.
Let us return to the example in our discussion of two random vari-
ables. Consider the subset A of d, where A = {(xv x2 ) ; (xv x2
) = (1, 1),
(1,2)}. To compute Pr [(Xv X 2 ) E A] = P(A), we must include as
elements of C all outcomes in f'(? for which the random variables Xl and
X 2 take values (xv x2 ) which are elements of A. Now X 1
(C
S
) = 1,
X 2 (CS ) = 1, X 1 (C4 ) = 1, and X 2 (C4) = 1. Also, X 1 (C5 ) = 1, X 2
(C
5
) = 2,
X 1(C6) = 1, and X 2 (C6 ) = 2. Thus P(A) = Pr [(Xv X 2 ) E A] = P(C),
where C = {c; c = Cs, c4 , c5 , or c6}. Suppose that our probability set
function P(C) assigns a probability of t to each of the eight elements of
f'(? Then P(A), which can be written as Pr (Xl = 1, X 2
= 1 or 2), is
equal to t = l It is left for the reader to show that we can tabulate
the probability, which is then assigned to each of the elements of d.
with the following result: '
(xv x2 ) (0,0) (0, 1) (1, 1) (1,2) (2, 2) (2,3)
Perhaps an additional example will be helpful. Let a coin be tossed
twice and let our interest be in the number of heads to be observed.
Thus the sample space is f'(? = {c; where c is TT or TH or HT or HH}.
Let X(c) = °if c is TT; let X(c) = 1 if c is either TH or HT; and let
X(c) = 2 if c is HH. Thus the space of the random variable X is
d = {x; x = 0, 1, 2}. Consider the subset A of the space d, where
A = {x; x = I}. How is the probability of the event A defined? We
take the subset C of f'(? to have as its elements all outcomes in f'(? for
which the random variable X has a value that is an element of A.
Because X(c) = 1 if c is either TH or HT, then C = {c; where c is TH
or HT}. Thus P(A) = Pr (X E A) = P(C). Since A = {x; x = I}, then
P(A) = Pr (X E A) can be written more simply as Pr (X = 1). Let
C1 = {c; c is TT}, C2 = {c; c is TH}, Cs = {c; c is HT}, and C4 =
{c; c is HH} denote subsets of f'(? Suppose that our probability set
function P(C) assigns a probability of t to each of the sets Ct , i =
1,2,3,4. Then P(C1 ) = t, P(C2 U Cs) = t + t = t, and P(C4 ) = l
Let us now point out how much simpler it is to couch these statements
in a language that involves the random variable X. Because X is the
number of heads to be observed in tossing a coin two times, we have
Pr (X = 0) = t, since P(C1 ) = t;
Pr (X = 1) = t, since P(C2 u Cs) = t;
and
Pr (X = 2) = t, since P(C4 ) = l
This may be further condensed in the following table:
x 012
Pr (X = x) t t t
This table depicts the distribution of probability over the elements of
d, the space of the random variable X.
We shall now discuss two random variables. Again, we start with
an example. A coin is to be tossed three times and our interest is in the
ordered number pair (number of H's on first two tosses, number H's on
all three tosses). Thus the sample space is f'(? = {c; c = ct , i = 1,2, ... , 8},
where C1 is TTT, C2 is TTH, Cs is THT, C4 is HTT, C5 is THH, C6 is
HTH, C7 is HHT, and Cs is HHH. Let Xl and X 2 be two functions such
that X 1 (C1 ) = X 1 (C2 ) = 0, X1 (CS) = X 1 (C4) = X 1 (C5 ) = X 1 (C6) = 1,
X 1 (C7 ) = X 1 (CS ) = 2; and X 2 (C1) = 0, X 2 (C2 ) = X 2 (CS ) = X 2 (C4 ) = 1,
X 2(C5) = X 2 (C6 ) = X 2 (C7 ) = 2, X 2 (CS ) = 3. Thus Xl and X 2 are real- 2
If
2
If
20 Distributions of Random Variables [eh.l Sec. 1.5] Random Variables 21
and
exists. Thus the probability set function of X is this integral.
P(A) = L tdx
J 1
11
2 3x2 1
P(A l ) = Pr (X E A l ) = f(x) dx = 8 dx = 64
A , 0
3x2
where f(x) = 8' XEd = {x; 0 < x < 2}.
P(A) = Lf(x) dx,
In statistics we are usually more interested in the probability set
function of the random variable X than we are in the sample space C(f
and the probability set function P(C). Therefore, in most instances, we
begin with an assumed distribution of probability for the random
variable X. Moreover, we do this same kind of thing with two or more
random variables. Two illustrative examples follow.
Example 2. Let the probability set function P(A) of a random variable
Xbe
P(A 2) = Pr (X E A 2) = J f(x) dx = r
2
3t dx = ~.
A2 Jl
To compute P(A l U A 2 ) , we note that A l n A 2 = 0; then we have
P(A l U A 2 ) = P(A l ) + P(A 2 ) = U.
Example 3. Let d = {(x, y); 0 < x < y < I} be the space of two
random variables X and Y. Let the probability set function be
Let A l = {x; 0 < x < ·n and A 2 = {x; 1 < x < 2} be two subsets of d.
Then
since A = {x; 2 < x < b}. This kind of argument holds for every set A c d
for which the integral
P(A) = Lf2dxdy.
If A is taken to be Al = {(x, y); t < x < y < I}, then
P(A 1 ) = Pr [(X, Y) E A 1J = II IY 2 dx dy = t.
1/2 1/2
If A is taken to be A 2 = {(x, y); x < y < 1,0 < x :0; t}, then A 2 = At, and
P(A 2 ) = Pr [(X, Y) E A 2J = P(A!) = 1 - P(A 1 ) = i-
EXERCISES
I
11
2
P(C) = dz = t.
114
«b-2l/3
Px(A) = P(A) = P(C) = Jo dz.
P(C) = fa dz.
For instance, if C = {c; t < c < !}, then
In the integral, make the change of variable x = 3z + 2 and obtain
Again we should make the comment that Pr [(Xl' ... , X n) E A]
could be denoted by the probability set function P X 1 •.••• xn(A). But, if
there is no chance of misunderstanding, it will be written simply as
P(A).
Up to this point, our illustrative examples have dealt with a sample
space C(f that contains a finite number of elements. We now give an
example of a sample space C(f that is an interval.
Example 1. Let the outcome of a random experiment be a point on the
interval (0, 1). Thus, ~ = {c; 0 < c < I}. Let the probability set function
be given by
This table depicts the distribution of probability over the elements of
d, the space of the random variables Xl and X 2 •
The preceding notions about one and two random variables can be
immediately extended to n random variables. We make the following
definition of the space of n random variables.
Definition 10. Given a random experiment with the sample space
C(f. Let the random variable X, assign to each element C E C(f one and
only one real number X,(c) = z , i = 1,2, ... , n. The space of these
random variables is the set of ordered n-tuplets .xl = {(Xl' X2 , ••• , Xn);
Xl = Xl(C), ... , Xn = Xn(C), CE C(f}. Further, let A be a subset of .xl.
Then Pr [(Xl>' .. , X n) E A] = P(C), where C = {c; CE C(f and [Xl(c),
X 2(c), .• " Xn(c)] E A}.
Define the random variable X to be X = X(c) = 3c + 2. Accordingly, the
space of X is d = {x; 2 < x < 5}. We wish to determine the probability
set function of X, namely P(A), A c d. At this time, let A be the set
{x; 2 < x < b}, where 2 < b < 5. Now X(c) is between 2 and b when and
only when C E C = {c; 0 < c < (b - 2)j3}. Hence
Px(A) = P(A) = f: tdx = Ltdx,
1.31. Let a card be selected from an ordinary deck of playing cards. The
outcome C is one of these 52 cards. Let X(c) = 4 if c is an ace, let X(c) = 3 if
22 Distributions oj Random Variables [Ch.l Sec. 1.6] The Probability Density Function 23
c is a king, let X(c) = 2 if c is a queen, let X(c) = 1 if c is a jack, and let
X(c) = 0 otherwise. Suppose that P(C) assigns a probability of -l-i to each
outcome c. Describe the induced probability Px(A) on the space d =
{x; x = 0, 1, 2, 3, 4} of the random variable X.
1.32. Let a point be selected from the sample space rc = {c; 0 < c < lO}.
Let Cere and let the probability set function be P(C) = fe /0 dz. Define
the random variable X to be X = X(c) = 2c - 10. Find the probability set
function of X. Hint. If -10 < a < b < 10, note that a < X(c) < b when
and only when (a + lO)/2 < c < (b + 10)/2.
1.33. Let the probability set function peA) of two random variables X
and Y be peA) = L L f(x, y), where f(x, y) = -h, (x, y) Ed = {(x, y);
A
(x, y) = (0, 1), (0, 2), ... , (0, 13), (1, 1), ... , (1, 13), ... , (3, 13)}. Compute
peA) = Pr [(X, Y) E A): (a) when A = {(x, y); (x, y) = (0,4), (1, 3), (2, 2)};
(b) when A = {(x, y); x + y = 4, (x, y) Ed}.
1.34. Let the probability set function peA) of the random variable X be
peA) = L f(x) dx, where f(x) = 2x/9, xEd = {x; 0 < x < 3}. Let Al =
{x; 0 < x < I}, A 2 = {x; 2 < x < 3}. Compute peAl) = Pr [X E AI), P(A 2 ) =
Pr (X E A 2 ) , and peAl U A 2 ) = Pr (X E Al U A 2 ) ·
1.35. Let the space of the random variable X be d = {x; 0 < x < I}.
HAl = {x; 0 < x < -H and A 2 = {x; 1: ~ x < I}, find P(A 2) if peAl) =-!:-
1.36. Let the space of the random variable X be d = {x; 0 < x < lO}
and let peAl) = -i, where Al = {x; 1 < x < 5}. Show that P(A 2) ~ i,
where A 2 = {x; 5 ~ x < lO}.
1.37. Let the subsets Al = {x; t < x < !} and A 2 = {x; 1- ~ x < I} of
the space d = {x; 0 < x < I} of the random variable X be such that
peAl) = t and P(A 2 ) = l Find peAl U A 2 ) , peA!), and peA! n A;).
1.38. Let Al = {(x, y); x ~ 2, y ~ 4}, A 2 = {(x, y); x ~ 2, y ~ I}, Aa =
{(x, y); x ~ 0, y ~ 4}, and A 4 = {(x, y); x ~ 0, y ~ I} be subsets of the
space d of two random variables X and Y, which is the entire two-dimen-
sional plane. If peAl) = i, P(A 2 ) = t P(Aa) = i, and P(A 4 ) = j, find
peAs), where As = {(x, y); 0 < x ~ 2, 1 < y ~ 4}.
1.39. Given fA [l/1T(1 + x2))dx, where A c d = {x; -00 < x < oo}.
Show that the integral could serve as a probability set function of a random
variable X whose space is d.
1.40. Let the probability set function of the random variable X be
Let Ale = {x; 2 - l/k < x ~ 3}, k = 1,2,3,.... Find lim Ale and
k -« co
P( lim A k). Find peAk) and lim peAk)' Note that lim peAk) = P( lim A k).
k-+OO k-+oo k-+oo k-+co
1.6 The Probability Density Function
Let X denote a random variable with space sf and let A be a subset
of d. If we know how to compute P(C), C c '6', then for each A under
consideration we can compute peA) = Pr (X EA); that is, we know
how the probability is distributed over the various subsets of sf. In
this sense, we speak of the distribution of the random variable X,
meaning, of course, the distribution of probability. Moreover, we can
use this convenient terminology when more than one random variable
is involved and, in the sequel, we shall do this.
In this section, we shall investigate some random variables whose
distributions can be described very simply by what will be called the
probability density function. The two types of distributions that we shall
consider are called, respectively, the discrete type and the continuous
type. For simplicity of presentation, we first consider a distribution of
one random variable.
(a) The discrete type of random variable. Let X denote a random
variable with one-dimensional space sf. Suppose that the space sf is a
set of points such that there is at most a finite number of points of sf in
every finite interval. Such a set sf will be called a set of discrete points.
Let a function f(x) be such that f(x) > 0, X E .91, and that
2:f(x) = 1.
@
Whenever a probability set function P(A), A c .91, can be expressed in
terms of such an f(x) by
peA) = Pr (X E A) = 2: f(x) ,
A
then X is called a random variable of the discrete type, and X is said to
have a distribution of the discrete type.
Example 1. Let X be a random variable of the discrete type with space
d = {x; x = 0, 1,2, 3, 4}. Let
peA) = L f(x),
A
where
peA) = Le- X
dx, where d = {x; 0 < x < co}. 4! (1)4
f(x) = xl (4 - x)! 2 ' XEd,
24 Distributions of Random Variables [eh.l Sec. 1.6] The Probability Density Function 25
and, as usual, Ot = 1. Then if A = {x; x = 0, I}, we have
41 (1)4 4! (1)4 5
Pr (X E A) = O! 4! 2 + I! 3! 2 = 16'
Example 2. Let X be a random variable of the discrete type with space
d = {x; x = 1, 2, 3, ... }, and let
f(x) = (t)X, xEd.
Then
Pr (X E A) = L f(x).
A
If A = {x; x = 1, 3, 5, 7, ...}, we have
Pr (X EA) = (1-) + (1-)3 + (!)5 + ... = t·
(b) The continuous type of random variable. Let the one-dimensional
set d be such that the Riemann integral
fd f(x) dx = 1,
where (1) f(x) > 0, XEd, and (2) f(x) has at most a finite num~er of
discontinuities in every finite interval that is a subset of d. If d IS the
space of the random variable X and if the probability set function
P(A), A c d, can be expressed in terms of such anf(x) by
P(A) = Pr (X E A) = t f(x) dx,
then X is said to be a random variable of the continuous type and to
have a distribution of that type.
Example 3. Let the space d = {x; 0 < x < co}, and let
f(x) = e-x , xEd.
If X is a random variable of the continuous type so that
Pr (X E A) = t e- X
dx,
we have, with A = {x; 0 < x < I},
Pr (X E A) = f:e- X
de = 1 - e- 1
•
Note that Pr (X E A) is the area under the graph of f(x) = e- x
, which lies
above the z-axis and between the vertical lines x = 0 and x = 1.
Example 4. Let X be a random variable of the continuous type with
space d = {x; 0 < x < I}. Let the probability set function be
P(A) = Lf(x) dx,
where
Since P(A) is a probability set function, P(d) = 1. Hence the constant c is
determined by
(l cx2 dx = 1
Jo '
or c = 3.
It is seen that whether the random variable X is of the discrete type
or of the continuous type, the probability Pr (X E A) is completely
determined by a functionf(x). In either casef(x) is called the probability
density function (hereafter abbreviated p.d.f.) of the random variable X.
If we restrict ourselves to random variables of either the discrete type
or the continuous type, we may work exclusively with the p.d.f. f(x).
This affords an enormous simplification; but it should be recognized
that this simplification is obtained at considerable cost from a mathe-
matical point of view. Not only shall we exclude from consideration
many random variables that do not have these types of distributions,
but we shall also exclude many interesting subsets of the space. In this
book, however, we shall in general restrict ourselves to these simple
types of random variables.
Remarks. Let X denote the number of spots that show when a die is cast.
We can assume that X is a random variable with d = {x; x = 1,2, .. " 6}
and with a p.d.f. f(x) = i, XEd. Other assumptions can be made to provide
different mathematical models for this experiment. Experimental evidence
can be used to help one decide which model is the more realistic. Next, let X
denote the point at which a balanced pointer comes to rest. If the circum-
ference is graduated 0 ~ x < 1, a reasonable mathematical model for this
experiment is tc take X to be a random variable with d = {x; 0 ~ x < I}
and with a p.d.f. f(x) = 1, XEd.
Both types of probability density functions can be used as distributional
models for many random variables found in real situations. For illustrations
consider the following. If X is the number of automobile accidents during
a given day, thenf(0),j(I),j(2), ... represent the probabilities of 0,1,2, ...
accidents. On the other hand, if X is length of life of a female born in a certain
community, the integral [area under the graph of f(x) that lies above the
z-axis and between the vertical lines x = 40 and x = 50J
I
50
f(x) dx
40
represents the probability that she dies between 40 and 50 (or the percentage
or as
or as
= 0 elsewhere,
P(A) = Pr [(X, Y) E A] = 'L 'L f(x, y),
A
27
f~oof(x) dx.
'Lf(x),
x
'L 'Lf(x, y),
y x
by
by
by
L.J(x) dx
Sec. 1.6] The Probability Density Function
and then refer to f(x) as the p.d.f. of X. We have
f~oof(x) dx = foo 0 dx + foOO e- X
dx = 1.
Thus we rr:ay treat the entire axis of reals as though it were the space of
X. Accordingly, we now replace
'L 'Lf(x, y)
oUl
'Lf(x)
oUl
and, for two random variables,
and so on.
In ~c~ordancewith this convention (of extending the definition of a
p.d.f.), It IS seen that a point function f, whether in one or more variables
esse~tially satis~es the conditions of being a p.d.f. if (a) f is defined
~nd IS not negatIve for all real values of its argument(s) and if (b) its
mtegral [~or the continuous type of random variable(s)], or its sum
[for the discrete type of random variable(s)] over all real values of its
argument(s) is 1.
.Hf(x) is the p.d.f. of a continuous type of random variable X and if
A IS the set {x; a < x < b}, then P(A) = Pr (X E A) can be written as
Pr (a < X < b) = f:f(x) dx.
Moreover, if A = {x; x = a}, then
Similarly, we may extend the definition of a p.d.f. f(x, y) over the entire
xy-plane, or a p.d.f. /(x, Y'.z) throughout three-dimensional space, and
so on. We shall do this consistently so that tedious, repetitious references
to the spaced can be avoided. Once this is done, we replace
foUl f f(x, y) dx dy by f~00 f~00 f(x, y) dx dy,
a?d so on. Similarly, after extending the definition of a p.d.f. of the
discrete type, we replace, for one random variable,
P(A) = Pr (X EA) = Pr (X = a) = f:f(x) dx = 0,
~n.ce the integral [: f(x) dx is defined in calculus to be zero. That is, if
IS a random vanable of the continuous type, the probability of every
o < x < co,
Distributions of Random Variables [Ch.l
f(x) = e- x ,
26
The notion of the p.d.f. of one random variable X can be extended
to the notion of the p.d.f. of two or more random variables. Under
certain restrictions on the space .91 and the function f > 0 on .91
(restrictions that will not be enumerated here), we say that the two
random variables X and Yare of the discrete type or of the continuous
type, and have a distribution of that type, according as the probability
set function P(A), A c .91, can be expressed as
of these females dying between 40 and 50). A particular f(x) willbe suggested
later for each of these situations, but again experimental evidence must be
used to decide whether we have realistic models.
P(A) = Pr [(X, Y) E A] = Lff(x, y) dx dy.
In either case f is called the p.d.f. of the two random variables X and Y.
Of necessity, P(d) = 1 in each case. More generally, we say that the n
random variables Xl' X 2 , ••• , Xn are of the discrete type or of the con-
tinuous type, and have a distribution of that type, according as the
probability set function P(A), A c d, can be expressed as
P(A) = Pr [(Xl> ... , X n) E A] = 'L" .'L f(xl> ... , xn),
A
P(A) = Pr [(Xl> ... , X n) E A] = rA-ff(Xl> ... , xn) dX1 ... dxn•
The idea to be emphasized is that a function f, whether in one or more
variables, essentially satisfies the conditions of being a p.d.f. if f > 0
on a spaced and if its integral [for the continuous type of random
variable(s)] or its sum [for the discrete type of random variable(s)] over
.91 is one.
Our notation can be considerably simplified when we restrict our-
selves to random variables of the continuous or discrete types. Suppose
that the space of a continuous type of random variable X is .91 =
{x; 0 < x < co} and that the p.d.f. of X is e- x , xEd. We shall in no
manner alter the distribution of X [that is, alter any P(A), A c .91] if
we extend the definition of the p.d.f. of X by writing
28 Distributions of Random Variables [Ch.l Sec. 1.6] The Probability Density Function 29
= 0 elsewhere,
set consisting of a single point is zero. This fact enables us to write, say,
Pr (a < X < b) = Pr (a s X s b).
More important, this fact allows us to change the value of the p.d.f.
of a continuous type of random variable X at a single point without
altering the distribution of X. For instance, the p.d.f.
f(x) = e-x , o< x < 00,
Find Pr (-!- < X < i) and Pr (--!- < X < -!-). First,
f
3/4 f3/4
Pr (-t < X < i) = I(x) dx = 'lx dx = l6'
1/2 1/2
Next,
f
l /2
Pr (--t < X < -!-) = f(x) dx
-1/2
fo il/2
= Odx + Zx d»
-1/2 0
= 0 + -! = -!.
= 0 elsewhere,
without changing any P(A). We observe that these two functions differ
only at x = 0 and Pr (X = 0) = O. More generally, if two probability
density functions of random variables of the continuous type differ only
on a set having probability zero, the two corresponding probability set
functions are exactly the same. Unlike the continuous type, the p.d.f.
of a discrete type of random variable may not be changed at any point,
since a change in such a p.d.f. alters the distribution of probability.
Finally, if a p.d.f. in one or more variables is explicitly defined, we
can see by inspection whether the random variables are of the con-
tinuous or discrete type. For example, it seems obvious that the p.d.f.
can be written as
f(x) = e-X
, o ~ x < 00,
Example 6. Let
f(x, y) = 6x2y, 0 < X < 1, 0 < y < 1,
= 0 elsewhere,
be the p.d.f. of two random variables X and Y. We have, for instance,
I
2 [3/4
Pr (0 < X < i,1- < Y < 2) = 113 Jo f(x, y) dx dy
II i3/4 f2 i3/4
= 6x2y
dx dy + 0 dx dy
1/3 0 1 0
= i + 0 = i·
Note that this probability is the volume under the surface f(x, y) = 6x2y
and above the rectangular set {(x, y); 0 < x < i, -1 < y < I} in the xy-plane.
EXERCISES
= 0 elsewhere.
Example 5. Let the random variable X have the p.d.I,
= 0 elsewhere,
is a p.d.f. of two discrete-type random variables X and Y, whereas the
p.d.f.
1.41. For each of the following, find the constant c so that f(x) satisfies
the conditions of being a p.d.f. of one random variable X.
(a) f(x) = cWx, x = 1, 2, 3,... , zero elsewhere.
(b) f(x) = cee:», 0 < x < (X), zero elsewhere.
1.42. Letf(x) = x/IS, x = 1,2,3,4,5, zero elsewhere, be the p.d.f. of X.
Find Pr (X = 1 or 2), Pr (-!- < X < f), and Pr (1 s X s 2).
1.43. For each of the following probability density functions of X,
compute Pr (IXI < 1) and Pr (X2 < 9).
(a) f(x) = x2
/18, -3 < x < 3, zero elsewhere.
(b) f(x) = (x + 2)/18, -2 < x < 4, zero elsewhere.
1.44. Let f(x) = l/x2
, 1 < x < (X) , zero elsewhere, be the p.d.f. of X. If
Al = {x; 1 < x < 2} and A 2 = {x; 4 < x < 5}, find peAl U A 2
) and
peAl (1 A 2) .
1.45. Let f(x!> x2) = 4Xl X2, 0 < Xl < 1, 0 < X2 < 1, zero elsewhere, be
the p.d.f. of Xl and X 2· Find Pr (0 < Xl < -!-, t < X 2 < 1), Pr (Xl = X 2),
0< x < 1,
o< x < 00, 0 < y < 00,
x = 1, 2, 3, ... , y = 1,2, 3, ... ,
I(x) = 'lx,
9
f(x, y) = 4x + y '
f(x, y) = 4xye-X 2
- y2
,
= 0 elsewhere,
is clearly a p.d.f. of two continuous-type random variables X and Y. In
such cases it seems unnecessary to specify which of the two simpler
types of random variables is under consideration.
Distributions of Random Variables [eh.l
and, for k ~ 1, that (by integrating by parts)
f: xe"" dx = f: e- X
dx = 1,
31
for the discrete type of random variable, and
F(x) = f:a) f(w) dw,
for the continuous type of random variable. We speak of a distribution
function F(x) 3.S being of the continuous or discrete type, depending on
whether the random variable is of the continuous or discrete type.
Remark. If X is a random variable of the continuous type, the p.d.f.
j(x) has at most a finite number of discontinuities in every finite interval.
This means (1) that the distribution function F(x) is everywhere continuous
and (2) that the derivative of F(x) with respect to x exists and is equal to
j(x) at each point of continuity ofj(x). That is, F'(x) = j(x) at each point of
continuity of j(x). If the random variable X is of the discrete type, most
surely the p.d.f. j(x) is not the derivative of F(x) with respect to x (that is,
with respect to Lebesgue measure); butj(x) is the (Radon-Nikodym) deriva-
tive of F(x) with respect to a counting measure. A derivative is often called
a density. Accordingly, we call these derivatives probability density junctions.
F(x) = L f(w),
W:S:x
fo'" g(x) dx = 1.
Show thatj(xl' x2) = [2g(v'x~ + X~)JI(7TVX~ + x~), 0 < Xl < 00,0 < X 2 < 00,
zero elsewhere, satisfies the conditions of being a p.d.f. of two continuous-
type random variables Xl and X 2 . Hint. Use polar coordinates.
(c) For what value of the constant c does the function j(x) = cxne- x ,
o< X < 00, zero elsewhere, satisfy the properties of a p.d.f.?
1.51. Given that the nonnegative function g(x) has the property that
1.7 The Distribution Function
Let the random variable X have the probability set function P(A),
where A is a one-dimensional set. Take x to be a real number and con-
sider the set A which is an unbounded set from -00 to x, including the
point x itself. For all such sets A we have P(A) = Pr (X E A) =
Pr (X ~ x). This probability depends on the point x; that is, this
probability is a function of the point x. This point function is denoted
by the symbol F(x) = Pr (X :::; x). The function F(x) is called the
distribution junction (sometimes, cumulative distribution junction) of
the random variable X. Since F(x) = Pr (X ~ x), then, with f(x)
the p.d.f., we have
Sec. 1.7] The Distribution Function
(a) j(x) = 4! (!)x (~) 4-X, X = 0, 1, 2, 3, 4, zero elsewhere.
x! (4 - x)! 4 4
(b) j(x) = 3x2, 0 < X < 1, zero elsewhere.
1
(c) j(x) = 7T(1 + x2) ' -00 < X < 00.
Hint. In parts (b) and (c), Pr (X < x) = Pr (X ~ x) and thus that
common value must equal 1- if x is to be the median of the distribution.
1.49. Let 0 < p < 1. A (100P)th percentile (quantile of order P) of the
distribution of a random variable X is a value ~p such that Pr (X < ~p) s p
and Pr (X ~ ~p) ~ p. Find the twentieth percentile of the distribution that
has p.d.f. j(x) = 4x3 , 0 < X < 1, zero elsewhere. Hint. With a continuous-
type random variable X, Pr (X < gp) = Pr (X ~ ~p) and hence that common
value must equal p.
1.50. Show that
(a) What is the value of f: xne- X
d», where n is a nonnegative integer?
(b) Formulate a reasonable definition of the now meaningless symbol 01.
30
Pr (Xl < X 2
) , and Pr (Xl s X 2) . Hint. Recall that Pr (Xl = x, would be
the volume under the surface j(Xl> x2) = 4X1X2 and above the line segment
o< Xl = X2 < 1 in the xlx2-plane.
1.46. Let j(Xl' X2' x3) = exp [- (Xl + X 2 + x3)] , 0 < Xl < 00, 0 <
X2 < 00,0 < X3 < 00, zero elsewhere, be the p.d.f. of Xl> X 2, X 3· Compute
Pr (Xl < X2 < X 3
) and Pr (Xl = X2 < X 3) . The symbol exp (w) means e
W
•
1.47. A mode of a distribution of one random variable X of the con-
tinuous or discrete type is a value of X that maximizes the p.d.f.j(x). If there
is only one such x, it is called the mode oj the distribution. Find the mode of
each of the following distributions:
(a) j(x) = (-!y, X = 1, 2, 3, ... , zero elsewhere.
(b) j(x) = 12x2(1 - x), 0 < X < 1, zero elsewhere.
(c) j(x) = (t)x2e- X
, 0 < X < 00, zero elsewhere.
1.48. A median of a distribution of one random variable X of the discrete
or continuous type is a value of X such that Pr (X < x) ~ t and ~r (~ ~ x)
~ !- If there is only one such x, it is called the median oj the distribution,
Find the median of each of the following distributions:
32 Distributions oj Random Variables [Ch.l Sec. 1.7] The Distribution Function 33
F(x)
F(x)
--------- - _.~;;;;.-
......
=--............................._~
2 3 x
x
FIGURE 1.3
FIGURE 1.4
Example 1. Let the random variable X of the discrete type have the
p.d.f.j(x) = x/6, x = 1,2,3, zero elsewhere. The distribution function of X is
Here, as depicted in Figure 1.3, F(x) is a step function that is constant in
every interval not containing 1, 2, or 3, but has steps of heights i, i, and i
at those respective points. It is also seen that F(x) is everywhere continuous
to the right.
Example 2. Let the random variable X of the continuous type have the
p.d.f. j(x) = 2/x3 , 1 < x < CX), zero elsewhere. The distribution function of
Xis
= fX ~ dw = 1 - .!-,
1 w3
x2
F(x) = I:00 °dw = 0,
and
F(x") - F(x') = Pr (x' < X ~ x") ?: O.
(c) F(oo) = 1 and F( -(0) = 0 because the set {x; x ~ co] is the
entire one-dimensional space and the set {x; x :$ -oo} is the null set.
From the proof of (b), it is observed that, if a < b, then
Pr (X s x") = Pr (X s x') + Pr (x' < X :$ x").
That is,
and lim F(x), respectively. In like manner, the symbols {x; x :$ co}
x-+ - 00
and {x; x :$ -oo} represent, respectively, the limits of the sets
{x; x ~ b} and {x; x ~ - b} as b -7 00.
(a) 0 -s F(x) s 1 because 0 ~ Pr (X s x) s 1.
(b) F(x) is a nondecreasing function of x. For, if x' < x", then
{x; x ~ x"} = {x; x ~ x'} U {x; x' < x ~ x"}
1 s x.
x < 1,
x < 1,
1 :$ x < 2,
2 s x < 3,
3 :$ x.
= 1,
-1-
- 6'
_.J.
- 6'
F(x) = 0,
The graph of this distribution function is depicted in Figure 1.4. Here F(x)
is a continuous function for all real numbers x; in particular, F(x) is every-
where continuous to the right. Moreover, the derivative of F(x) with respect
to x exists at all points except at x = 1. Thus the p.d.f. of X is defined by this
derivative except at x = 1. Since the set A = {x; x = I} is a set of probability
measure zero [that is, P(A) = OJ, we are free to define the p.d.f. at x = 1 in
any manner we please. One way to do this is to writej(x) = 2/x3
, 1 < x < 00,
zero elsewhere.
There are several properties of a distribution function F(x) that
can be listed as a consequence of the properties of the probability set
function. Some of these are the following. In listing these properties, we
shall not restrict X to be a random variable of the discrete or continuous
type. We shall use the symbols F(oo) and F( -(0) to mean lim F(x)
x'" 00
Pr (a < X s b) = F(b) - F(a).
Suppose that we want to use F(x) to compute the probability Pr (X = b).
To do this, consider, with h > 0,
lim Pr (b - h < X :$ b) = lim [F(b) - F(b - h)].
h ... O h"'O
Intuitively, it seems that lim Pr (b - h < X ~ b) should exist and be
h...O
equal to Pr (X = b) because, as h tends to zero, the limit of the set
{x; b - h < x ~ b}is the set that contains the single point x = b. The
fact that this limit is Pr (X = b) is a theorem that we accept without
proof. Accordingly, we have
Pr (X = b) = F(b) - F(b-),
34 Distributions of Random Variables [Ch.l Sec. 1.7] The Distribution Function 35
where F(b - ) is the left-hand limit of F(x) at x = b. That is, the proba-
bility that X = b is the height of the step that F(x) has at x = b.
Hence, if the distribution function F(x) is continuous at x = b, then
Pr (X = b) = O.
There is a fourth property of F(x) that is now listed.
(d) F(x) is continuous to the right at each point x.
To prove this property, consider, with h > 0,
F(x)
FIGURE 1.5
x
lim Pr (a < X ~ a + h) = lim [F(a + h) - F(a)].
h-->O h-->O
We accept without proof a theorem which states, with h > 0, that
lim Pr (a < X ~ a + h) = P(O) = O.
h-->O
Here also, the theorem is intuitively appealing because, as h tends to
zero, the limit of the set {x; a < x ~ a + h}is the null set. Accordingly,
we write
o= F(a+) - F(a),
where F(a +) is the right-hand limit of F(x) at x = a. Hence F(x) is
continuous to the right at every point x = a.
The preceding discussion may be summarized in the following
manner: A distribution function F(x) is a nondecreasing function of x,
which is everywhere continuous to the right and has F( -(0) = 0,
F(oo) = 1. The probability Pr (a < X ~ b) is equal to the difference
F(b) - F(a). If x is a discontinuity point of F(x), then the probability
Pr (X = x) is equal to the jump which the distribution function has at
the point x. If x is a continuity point of F(x), then Pr (X = x) = O.
Let X be a random variable of the continuous type that has p.d.f.
j(x) , and let A be a set of probability measure zero; that is, P(A) =
Pr (X E A) = O. It has been observed that we may change the definition
of j(x) at any point in A without in any way altering the distribution
of probability. The freedom to do this with the p.d.f. j(x), of a con-
tinuous type of random variable does not extend to the distribution
function F(x); for, if F(x) is changed at so much as one point x, the
probability Pr (X ~ x) = F(x) is changed, and we have a different
distribution of probability. That is, the distribution function F(x), not
the p.d.f. j(x), is really the fundamental concept.
Remark. The definition of the distribution function makes it clear that
the probability set function P determines the distribution function F. It is
true, although not so obvious, that a probability set function P can be found
from a distribution function F. That is, P and F give the same information
about the.distribution of probability, and which function is used is a matter
of convemence.
We now give an illustrative example.
Example 3. Let a distribution function be given by
F(x) = 0, x < 0,
x + 1
= -2-' 0 s x < 1,
= 1, 1 ~ x.
Then, for instance,
Pr (-3 < X ~ -!-) = Fm - F( -3) = ! - 0 = !
and
Pr (X = 0) = F(O) - F(O-) = -!- - 0 = l
~he graph o~ ~(x) is shown in Figure 1.5. We see that F(x) is not always
c?ntI~uou.s, nor IS It a step ~unction. Accordingly, the corresponding distribu-
tion ~s neither of the continuous type nor of the discrete type. It may be
descnbed as a mixture of those types.
We shall now point out an important fact about a function of a
rand~m variable. Let X denote a random variable with space d.
ConsIder the function Y = u(X) of the random variable X. Since X is
a fun:tion defined on a sample space ee, then Y = u(X) is a composite
fu~ctlOn defined on ee. That is, Y = u(X) is itself a random variable
which .~as its own space fJ1J = {y; y = u(x), xEd} and its own
probabl1Ity set function. If y E fJ1J, the event Y = u(X) s y occurs when,
and o~ly whe~, t~e event X E A c d occurs, where A = {x; u(x) s y}.
That IS, the distribution function of Y is
G(y) = Pr (Y ~ y) = Pr [u(X) ~ y] = P(A).
36 Distributions oj Random Variables [eh.l Sec. 1.7] The Distribution Function 37
= 0 elsewhere.
Accordingly, the distribution function of Y, G(y) = Pr (Y :$ y), is given by
Since Y is a random variable of the continuous type, the p.d.f. of Y is
g(y) = G'(y) at all points of continuity of g(y). Thus we may write
o:$ x, y, z < 00,
0
3
F(x, y, z) - f( )
ox oy oz - x, y, z .
= f:f:f: e- U
-
V
-
W
du dv dw
and is equal to zero elsewhere. Incidentally, except for a set of probability
measure zero, we have
F(x, y, z) = Pr (X :$ z, Y s y, Z s z)
Example 5. Let f(x, y, z) = e-(x+y+Z), 0 < z, y, z < 00, zero elsewhere,
be the p.d.f. of the random variables X, Y, and Z. Then the distribution
function of X, Y, and Z is given by
F(xv x2, ••• , xn) = Pr (Xl :$ Xv X 2 :$ x2, ••• , X; :$ xn
) .
An illustrative example follows.
The distribution function of the n random variables Xl' X 2
, ••• , X;
is the point function
o:$ Y < 1,
o< Y < 1,
1
g(y) = . r'
2vy
G(y) = 0, y < 0,
= f./Y_
1- dx = Vii,
- .;y
= 1, 1 :$ y.
The following example illustrates a method of finding the distribution
function and the p.d.f. of a function of a random variable.
Example 4. Let f(x) = 1-, -1 < x < 1, zero elsewhere, be the p.d.f. of
the random variable X. Define the random variable Y by Y = X2. We wish
to find the p.d.f. of Y. If y ~ 0, the probability Pr (Y :$ y) is equivalent to
Pr (X2 s y) = Pr (-Vii s X s Vii).
EXERCISES
1.55. Let F(x, y) be the distribution function of X and Y. Show that
x+2
= -4-' -1 :$ x < I,
= 1, 1 :$ x.
Sketch the graph of F(x) and then compute: (a) Pr (-t < X :$ t); (b)
Pr (X = 0); (c) Pr (X = 1); (d) Pr (2 < X :$ 3).
x < -1,
F(x) = 0,
1.52. Letf(x) be the p.d.f. of a random variable X. Find the distribution
function F(x) of X and sketch its graph if:
(a) f(x) = 1, x = 0, zero elsewhere.
(b) f(x) = t, x = -1, 0, 1, zero elsewhere.
(c) f(x) = x/IS, x = 1, 2, 3, 4, 5, zero elsewhere.
(d) f(x) = 3(1 - x)2, 0 < x < 1, zero elsewhere.
(e) f(x) = 1/x2
, 1 < x < 00, zero elsewhere.
(f) f(x) = t, 0 < x < 1 or 2 < x < 4, zero elsewhere.
1.53. Find the median of each of the distributions in Exercise 1.52.
1.54. Given the distribution function
F(x, y) = I:ro s:eo j(u, v) du dv.
Accordingly, at points of continuity ofj(x, y), we have
fj2 F(x, y) = j( )
ox oy x, y .
It is left as an exercise to show, in every case, that
Pr (a < X :$ b, c < Y s d) = F(b, d) - F(b, c) - F(a, d) + F(a, c),
for all real constants a < b, c < d.
Let the random variables X and Y have the probability set function
P(A), where A is a two-dimensional set. If A is the unbounded set
{(u, v); u :$ X, v :$ y}, where X and yare real numbers, we have
P(A) = Pr [(X, Y) E A] = Pr (X s x, Y :$ y).
This function of the point (x, y) is called the distribution function of X
and Y and is denoted by
F(x, y) = Pr (X s x, Y s y).
If X and Yare random variables of the continuous type that have
p.d.f. j(x, y), then
38 Distributions oj Random Variables [eb.l Sec. 1.8] Certain Probability Models 39
Pr (a < X ~ b, e < Y ~ d) = F(b, d) - F(b, e) - F(a, d) + F(a, e), for
all real constants a < b, e < d.
interval A is a subset of d, the probability of the event A is proportional
to the length of A. Hence, if A is the interval [a, x], x ~ b, then
= 0 elsewhere.
= 1, b < x.
= 0 elsewhere,
1 = Pr (a ~ X ~ b) = c(b - a),
a ~ x s b,
o < x < 1, 0 < Y < 1,
f(x)
f(x, y) = 1,
Accordingly, the p.d.f. of X, f(x) = F'(x), may be written
1
=--,
b-a
P(A) = Pr (X E A) = Pr (a ~ X ~ x) = c(x - a),
The derivative of F(x) does not exist at x = a nor at x = b; but the set
{x; x = a, b}is a set of probability measure zero, and we elect to define
f(x) to be equal to l/(b - a) at those two points, just as a matter of
convenience. We observe that this p.d.f. is a constant on d. If the
p.d.f. of one or more variables of the continuous type or of the discrete
type is a constant on the space d, we say that the probability is
distributed uniformly over d. Thus, in the example above, we say that
X has a uniform distribution over the interval [a, b].
Consider next an experiment in which one chooses at random a
point (X, Y) from the unit square ~ = d = {(x, y); 0 < x < 1,
o< y < I}. Suppose that our interest is not in X or in Y but in
Z = X + Y. Once a suitable probability model has been adopted, we
shall see how to find the p.d.f. of Z. To be specific, let the nature of
the random experiment be such that it is reasonable to assume that
the distribution of probability over the unit square is uniform. Then the
p.d.f. of X and Y may be written
where c is the constant of proportionality.
In the expression above, if we take x = b, we have
so c = l/(b - a). Thus we will have an appropriate probability model
if we take the distribution function of X, F(x) = Pr (X ~ x), to be
F(x) = 0, x < a,
x-a
- --, a ~ x ~ b,
-b-a
1.56. Let f(x) = 1, 0 < x < 1, zero elsewhere, be the p.d.f. of X. Find
the distribution function and the p.d.f. of Y = fl. Hint. Pr (Y ~ y) =
Pr (fl ~ y) = Pr (X s y2), 0 < y < 1.
1.57. Letf(x) = xj6, x = 1,2,3, zero elsewhere, be the p.d.f. of X. Find
the distribution function and the p.d.f. of Y = X2. Hint. Note that X is a
random variable of the discrete type.
1.58. Letf(x) = (4 - x)j16, -2 < x < 2, zero elsewhere, be the p.d.f. of
X.
(a) Sketch the distribution function and the p.d.f. of X on the same set of
axes.
(b) If Y = lXI, compute Pr (Y s 1).
(c) If Z = X2, compute Pr (Z ~ t).
1.59. Let f(x, y) = e- X
- Y , 0 < x < 00, 0 < Y < 00, zero elsewhere, be
the p.d.f. of X and Y. If Z = X + Y, compute Pr (Z s 0), Pr (Z s 6), and,
more generally, Pr (Z ~ z), for 0 < z < 00. What is the p.d.f. of Z?
1.60. Explain why, with h > 0, the two limits lim Pr (b - h < X ~ b)
h-+O
and lim F(b - h) exist. Hint. Note that Pr (b - h < X ~ b) is bounded
h-+O
below by zero and F(b - h) is bounded above by both F(b) and 1.
1.61. Showthat the function F(x, y) that is equal to 1,providedx + 2y ;::: 1,
and that is equal to zero provided x + 2y < 1, cannot be a distribution
function of two random variables. Hint. Find four numbers a < b, e < d, so
that F(b, d) - F(a, d) - F(b, e) + F(a, e) is less than zero.
1.62. Let F(x) be the distribution function of the random variable X. If
m is a number such that F(m) = ,!-, show that m is a median of the distri-
bution.
1.63. Let f(x) = j-, -1 < x < 2, zero elsewhere, be the p.d.f. of X.
Find the distribution function and the p.d.f. of Y = X2. Hint. Consider
Pr (X2 ~ y) for two cases: 0 s y < 1 and 1 ~ y < 4.
Consider an experiment in which one chooses at random a point
from the closed interval [a, b] that is on the real line. Thus the sample
space ~ is [a, b]. Let the random variable X be the identity function
defined on ~. Thus the space d of X is d = ~. Suppose that it is
reasonable to assume, from the nature of the experiment, that if an
1.8 Certain Probability Models
40 Distributions of Random Variables [Chv I Sec. 1.8] Certain Probability Models 41
and this describes the probability modeL Now let the distribution
function of Z be denoted by G(z) = Pr (X + Y s z). Then
= 1, 2 :s; z,
Since G'(z) exists for all values of z, the p.d.f. of Z may then be written
= 0 elsewhere.
It is clear that a different choice of the p.d.f. f(x, y) that describes the
probability model will, in general, lead to a different p.d.f. of Z.
We wish presently to extend and generalize some of the notions
expressed in the next three sentences. Let the discrete type of random
variable X have a uniform distribution of probability over the k points
of the space d = {x; x = 1, 2, ... , k}. The p.d.f. of X is then f(x) =
1jk, x E.9I, zero elsewhere. This type of p.d.f. is used to describe the
probability model when each of the k points has the same probability,
namely, 1jk.
The probability model described in the preceding paragraph will
now be adapted to a more general situation. Let a probability set
function P(C) be defined on a sample space C(? Here C(? may be a set in
one, or two, or more dimensions. Let rc be partitioned into k mutually
disjoint subsets Cl> C2 , ••• , C/c in such a way that the union of these k
mutually disjoint subsets is the sample space C(? Thus the events
Cl> C2 , ••• , C/c are mutually exclusive and exhaustive. Suppose that the
random experiment is of such a character that it may be assumed that
each of the mutually exclusive and exhaustive events Ci , i = 1,2, , k,
has the same probability. Necessarily then, P(Ci ) = 1jk, i = 1,2, , k.
Let the event E be the union of r of these mutually exclusive events,
say
r
P(E) = P(C1) + P(C2 ) + ... + P(Cr) = k'
(
52) 52!
k = 5 = 5! 47!·
and
(
13) 13!
r1 = 5 = 5! 8!
In general, if n is a positive integer and if x is a nonnegative integer with
x :s; n, then the binomial coefficient
(n
) nl
x = x! (n - x)!
is equal to the number of combinations of n things taken x at a time. Thus,
here,
en (13){12)(ll)(10)(9)
P(E1
} = C52) = (52)(51)(50)(49)(48) = 0.0005,
Frequently, the integer k is called the total number of ways (for this
particular partition of c(?) in which the random experiment can ter-
minate and the integer r is called the number of ways that are favorable
to the event E. So, in this terminology, P(E) is equal to the number of
ways favorable to the event E divided by the total number of ways in
which the experiment can terminate. It should be emphasized that in
order to assign, in this manner, the probability rjk to the event E, we
must assume that each of the mutually exclusive and exhaustive events
Cl> C2 , ••• , C/c has the same probability 1jk. This assumption then
becomes part of our probability modeL Obviously, if this assumption is
not realistic in an application, the probability of the event E cannot be
computed in this way.
We next present two examples that are illustrative of this modeL
Example 1. Let a card be drawn at random from an ordinary deck of
52 playing cards. The sample space Cfjis the union of k = 52 outcomes, and it
is reasonable to assume that each of these outcomes has the same probability
-l-i' Accordingly, if E1 is the set of outcomes that are spades, P(E1 } = g = t
because there are r1 = 13 spades in the deck; that is, t is the probability of
drawing a card that is a spade. If E 2 is the set of outcomes that are kings,
P(E2} = n = -ls because there are r2 = 4 kings in the deck; that is, -ls is
the probability of drawing a card that is a king. These computations are very
easy because there are no difficultiesin the determination of the appropriate
values of rand k. However, instead of drawing only one card, suppose that
fivecards are taken, at random and without replacement, from this deck. We
can think of each five-card hand as being an outcome in a sample space. It
is reasonable to assume that each of these outcomes has the same probability.
Now if E1 is the set of outcomes in which each card of the hand is a spade,
P(E1 } is equal to the number r1 of all spade hands divided by the total
number, say k, of five-card hands. It is shown in many books on algebra that
1 s z < 2,
r s k.
(2 - Z)2
2 '
1 :s; z < 2,
os z < 1,
o < z < 1,
= 2 - z,
g(z) = z,
z < 0,
E = C1 U C2 U ... U C"
rz r-x
Z2
=JoJo dydx=Z'
= 1 - i1
i1
dy dx = 1
z-1 z-x
G(z) = 0,
Then
42 Distributions of Random Variables [eb.l Sec. 1.8] Certain Probability Models 43
approximately. Next, let E2 be the set of outcomes in which at least one
card is a spade. Then E~ is the set of outcomes in which no card is a spade.
There are r~ = C:) such outcomes Hence
because the numerator of this fraction is the number of outcomes in E 4 •
Example 2. A lot, consisting of 100 fuses, is inspected by the following
procedure. Five of these fuses are chosen at random and tested; if all 5
"blow" at the correct amperage, the lot is accepted. If, in fact, there are
20 defective fuses in the lot, the probability of accepting the lot is, under
appropriate assumptions,
EXERCISES
1.69. Let X have the uniform distribution given by the p.d.f. f(x) = t,
x = -2, -1, 0, 1,2, zero elsewhere. Find the p.d.I, of Y = X2. Hint. Note
that Y has a distribution of the discrete type.
1.70. Let X and Y have the p.d.f. f(x, y) = 1, 0 < x < 1, 0 < y < 1,
zero elsewhere. Find the p.d.f. of the product Z = XY.
1.71. Let 13 cards be taken, at random and without replacement, from
an ordinary deck of playing cards. If X is the number of spades in these 13
cards, find the p.d.f. of X. If, in addition, Y is the number of hearts in these
13 cards, find the probability Pr (X = 2, Y = 5). What is the p.d.f. of X
and Y?
1.72. Four distinct integers are chosen at random and without replace-
ment from the first 10 positive integers. Let the random variable X be the
next to the smallest of these four numbers. Find the p.d.f. of X.
1.73. In a lot of 50 light bulbs, there are 2 bad bulbs. An inspector
examines 5 bulbs, which are selected at random and without replacement.
(a) Find the probability of at least 1 defective bulb among the 5.
(In order to solve some of these exercises, the reader must make certain
assumptions.)
1.64. A bowl contains 16 chips, of which 6 are red, 7 are white, and 3 are
blue. If 4 chips are taken at random and without replacement, find the
probability that: (a) each of the 4 chips is red; (b) none of the 4 chips is red;
(c) there is at least 1 chip of each color.
1.65. A person has purchased 10 of 1000 tickets sold in a certain raffle.
To determine the five prize winners,S tickets are to be drawn at random
and without replacement. Compute the probability that this person will win
at least one prize. Hint. First compute the probability that the person does
not win a prize.
1.66. Compute the probability of being dealt at random and without
replacement a 13-card bridge hand consisting of: (a) 6 spades, 4 hearts, 2
diamonds, and 1 club; (b) 13 cards of the same suit.
1.67. Three distinct integers are chosen at random from the first 20
positive integers. Compute the probability that: (a) their sum is even;
(b) their product is even.
1.68. There are five red chips and three blue chips in a bowl. The red
chips are numbered 1, 2, 3, 4, 5, respectively, and the blue chips are numbered
1, 2, 3, respectively. If two chips are to be drawn at random and without
replacement, find the probability that these chips have either the same
number or the same color.
x = 0, 1,2,3,4,5,
and
(~)(5~ x)
f(x) = Pr (X = x) = C~O) ,
= 0 elsewhere.
This is an example of a discrete type of distribution called a hypergeometric
distribution.
(~O)
C~O) = 0.32,
approximately. More generally, let the random variable X be the number of
defective fuses among the 5 that are inspected. The space of X is d =
{x; x = 0, 1, 2, 3, 4, 5} and the p.d.f. of X is given by
Now suppose that E 3 is the set of outcomes in which exactly three cards are
kings and exactly two cards are queens. We can select the three kings in any
one of G) ways and the two queens in anyone of G) ways By a well-known
counting principle, the number of outcomes in E3 is r3 = G)G)· Thus
P(E3 ) = G)G)/C~)· Finally, let E4 be the set of outcomes in which there
are exactly two kings, two queens, and one jack. Then
44 Distributions oj Random Variables [Ch.l Sec. 1.9] Mathematical Expectation 45
(b) How many bulbs should he examine so that the probability of finding
at least 1 bad bulb exceeds j ?
distribution of probability. Suppose the p.d.f. of Y is g(y). Then E(Y) is
given by
1.9 Mathematical Expectation J~<x>yg(y) dy or Lyg(y),
'II
One of the more useful concepts in problems involving distributions
of random variables is that of mathematical expectation. Let X be a
random variable having a p.d.f. j(x) , and let u(X) be a function of X
such that
J:00 u(x)j(x) dx
exists, if X is a continuous type of random variable, or such that
2: u(x)j(x)
x
according as Y is of the continuous type or of the discrete type. The question
is: Does this have the same value as E[u(X)], which was defined above?
The answer to this question is in the affirmative, as will be shown in Chapter
4.
More generally, let Xl' X 2 , ••• , X; be random variables having
p.d.f.j(xv x2,···, xn) and let u(XI, X 2, ... , X n) be a function of these
variables such that the n-fold integral
(1) I~00 ••• 1:00U(Xv x2, ... , xn)f(xl, x2, ... , xn) dXI dX2... dx;
exists, if the random variables are of the continuous type, or such that
the n-fold sum
exists if the random variables are of the discrete type. The n-fold
integral (or the n-fold sum, as the case may be) is called the mathe-
matical expectation, denoted by E[u(Xv X 2 , ••• , X n)] , of the function
u(Xv X 2, ... , X n).
Next, we shall point out some fairly obvious but useful facts about
mathematical expectations when they exist.
(a) If k is a constant, then E(k) = k. This follows from expression
(1) [or (2)] upon setting u = k and recalling that an integral (or sum) of
a constant times a function is the constant times the integral (or sum)
of the function. Of course, the integral (or sum) of the function j is 1.
(b) If k is a constant and v is a function, then E(kv) = kE(v). This
follows from expression (1) [or (2)] upon setting u = kv and rewriting
expression (1) [or (2)] as k times the integral (or sum) of the product vf.
(c) If kl and k2 are constants and VI and V2 are functions, then
E(klvl + k2v2) = kIE(vl) + k2E(V2)' This, too, follows from expression
(1) [or (2)] upon setting u = klvl + k2v2 because the integral (or sum)
of (klv l + k2v2)j is equal to the integral (or sum) of klVd plus the
integral (or sum) of k2v2 f. Repeated application of this property shows
that if kv k2, ... , km are constants and Vv v2, ... , Vm are functions, then
exists, if X is a discrete type of random variable. The integral, or the
sum, as the case may be, is called the mathematical expectation (or
expected value) of u(X) and is denoted by E[u(X)]. That is,
E[u(X)] = I~00 u(x)j(x) dx,
if X is a continuous type of random variable, or
E[u(X)] = 2: u(x)j(x),
x
if X is a discrete type of random variable.
Remarks. The usual definition of E[u(X)] requires that the integral (or
sum) converge absolutely. However, in this book, each u(x) is of such a
character that if the integral (or sum) exists, the convergence is absolute.
Accordingly, wehave not burdened the student with this additional provision.
The terminology "mathematical expectation" or "expected value" has
its origin in games of chance. This can be illustrated as follows: Three small
similar discs, numbered 1, 2, and 2, respectively, are placed in a bowl and
are mixed. A player is to be blindfolded and is to draw a disc from the bowl.
If he draws the disc numbered 1, he will receive $9; if he draws either disc
numbered 2, he will receive $3. It seems reasonable to assume that the
player has a "j- claim" on the $9 and a "t claim" on the $3. His" total
claim" is 9(j-) + 3(t), or $5. If we take X to be a random variable having
the p.d.f. f(x) = xJ3, x = 1, 2, zero elsewhere, and u(x) = 15 - 6x, then
2
E[u(X)] = L u(x)f(x) = L (15 - 6x)(xj3) = 5. That is, the mathematical
x x=l
expectation of u(X) is precisely the player's" claim" or expectation.
The student may observe that u(X) is a random variable Y with its own
(2)
46 Distributions of Random Variables [eh.l Sec. 1.9] Mathematical Expectation 47
and, of course,
E(6X + 3X2) = 6(t) + 3(i-) = t.
Example 2. Let X have the p.d.f.
This property of mathematical expectation leads us to characterize the
symbol E as a linear operator.
Example 1. Let X have the p.d.f.
E(X) = I~00 xJ(x) dx = I:(x)2(1 - x) d» = t,
E(X2) = I~00 x2J(x) dx = I:(x2)2(1 - x) dx = !,
The expected value of the length X is E(X) = t and the expected value of
the length 5 - X is E(5 - X) = t. But the expected value of the product of
the two lengths is equal to
E[X(5 - X)J = I:x(5 - x)(t) dx = 265 =f (t)2.
That is, in general, the expected value of a product is not equal to the
product of the expected values.
Example 5. A bowl contains five chips, which cannot be distinguished by
a sense of touch alone. Three of the chips are marked $1 each and the re-
maining two are marked $4 each. A player is blindfolded and draws, at
random and without replacement, two chips from the bowl. The player is
paid an amount equal to the sum of the values of the two chips that he
draws and the game is over. If it costs $4.75 cents to play this game, would
we care to participate for any protracted period of time? Because we are
unable to distinguish the chips by sense of touch, we assume that each of the
10 pairs that can be drawn has the same probability of being drawn. Let the
random variable X be the number of chips, of the two to be chosen, that are
marked $1. Then, under our assumption, X has the hypergeometric p.d.f.
o< x < 1,
x = 1,2,3,
x
J(x) = "6'
J(x) = 2(1 - x),
= 0 elsewhere.
Then
= 0 elsewhere.
Then
= ! + 1.
6
Q + 8i = 'l.l.
Example 3. Let X and Y have the p.d.f.
J(x, y) = x + y, 0 < x < 1, 0 < Y < 1,
= 0 elsewhere.
x = 0, 1,2,
= 0 elsewhere.
If X = x, the player receives u(x) = x + 4(2 - x) = 8 - 3x dollars. Hence
his mathematical expectation is equal to
2
E[8 - 3X] = L (8 - 3x)J(x) = t~,
x=o
Accordingly, or $4.40.
= 0 elsewhere.
= U·
Example 4. Let us divide, at random, a horizontal line segment of length
5 into two parts. If X is the length of the left-hand part, it is reasonable to
assume that X has the p.d.f.
E(XY2) = I~00 I~00 xy2J(X, y) dx dy
= I:f:xy2(X + y) d» dy
I
i
,II
EXERCISES
1.74. Let X have the p.d.f. J(x) = (x + 2)/18, -2 < x < 4, zero else-
where. Find E(X), E[(X + 2)3J, and E[6X - 2(X + 2)3).
1.75. Suppose thatJ(x) = t, x = 1,2,3,4, 5, zero elsewhere, is the p.d.f.
of the discrete type of random variable X. Compute E(X) and E(X2). Use
these two results to find E[(X + 2)2J by writing (X + 2)2 = X2 + 4X + 4.
1.76. If X and Y have the p.d.f.J(x, y) = t, (x, y) = (0,0), (0, 1), (1, 1),
zero elsewhere, find E[(X - t)(Y - t)].
1.77. Let the p.d.f. of X and Y beJ(x, y) = e-X
- Y , 0 < x < 00,0 < Y < 00,
0< x < 5,
J(x) = t,
Distributions of Random Variables [eh.l
and since E is a linear operator,
49
Sec. 1.10] Some Special Mathematical Expectations
if at> a2 , • • • are the discrete points of the space of positive probability
density. This sum of products may be interpreted as a "weighted
average" of the squares of the deviations of the numbers at> a2' ...
from the mean value fL of those numbers where the" weight" associated
with each (aj - fL)2 is f(a j ) . This mean value of the square of the
deviation of X from its mean value fL is called the variance of X (or the
variance of the distribution).
The variance of X will be denoted by a2 , and we define a2, if it
exists, by a2 = E[(X - fL)2], whether X is a discrete or a continuous
type of random variable.
It is worthwhile to observe that
a2 = E(X2) - 2fLE(X) + fL2
= E(X2) - 2fL2 + fL2
= E(X2) - fL2.
This frequency affords an easier way of computing the variance of X.
It is customary to call a (the positive square root of the variance) the
standard deviation of X (or the standard deviation of the distribution).
The number a is sometimes interpreted as a measure of the dispersion of
the points of the space relative to the mean value fL. We note that if the
space contains only one point x for which f(x) > 0, then a = O.
Remark. Let the random variable X of the continuous type have the
p.d.f. f(x) = 1/2a, -a < x < a, zero elsewhere, so that a = alV3 is the
This sum of products is seen to be a "weighted average" of the values
aI' a2 , a3 , ••• , the "weight" associated with each a, being f(aj ) . This
suggests that we call E(X) the arithmetic mean of the values of X, or,
more simply, the mean valueof X (or the mean value of the distribution).
The mean value fL of a random variable X is defined, when it exists,
to be fL = E(X), where X is a random variable of the discrete or of the
continuous type.
Another special mathematical expectation is obtained by taking
u(X) = (X - fL)2. If, initially, X is a random variable of the discrete
type having a p.d.f. f(x), then
E[(X - fL)2] = L (x - fL)2f(x)
x
If the discrete points of the space of positive probability density are
at> a2 , a3 , ••• , then
E(IX - bl) = E(IX - ml) + 2f:(b - x)f(x) dx,
provided that the expectations exist. For what value of b is E(IX - bl) a
minimum?
1.82. Let f(x) = 2x, 0 < x < 1, zero elsewhere, be the p.d.f. of X.
(a) Compute E(fl). (b) Find the distribution function and the p.d.f. of
Y = fl. (c) Compute E(Y) and compare this result with the answer
obtained in part (a).
1.83. Two distinct integers are chosen at random and without replace-
ment from the first six positive integers. Compute the expected value of the
absolute value of the difference of these two numbers.
1.10 Some Special Mathematical Expectations
Certain mathematical expectations, if they exist, have special
names and symbols to represent them. We shall mention now only
those associated with one random variable. First, let u(X) = X, where
X is a random variable of the discrete type having a p.d.f. f(x). Then
E(X) = L xf(x).
x
48
zero elsewhere. Let u(X, Y) = X, v(X, Y) = Y, and w(X, Y) = XV. Show
that E[u(X, Y)J . E[v(X, Y)J = E[w(X, Y)J.
1.78. Let the p.d.f. of X and Y be f(x, y) = 2, 0 < x < y, 0 < y < 1,
zero elsewhere. Let u(X, Y) = X, v(X, Y) = Y and w(X, Y) = XV. Show
that E[u(X, Y)J . E[v(X, Y)J =I E[w(X, V)].
1.79. Let X have a p.d.f.f(x) that is positive at x = -1,0,.1 and is zerlo
elsewhere. (a) If f(O) = -1, find E(X2
) . (b) If f(O) = -1 and If E(X) = 6'
determine f( -1) andf(l).
1.80. A bowl contains 10 chips, of which 8 are marked $2 each and 2 are
marked $5 each. Let a person choose, at random and without replaceme~t,
3 chips from this bowl. If the person is to receive the sum of the resultmg
amounts, find his expectation.
1.81. Let X be a random variable of the continuous type that has p.d.f.
f(x). If m is the unique median of the distribution ofX and bis a real constant,
show that
if X is a continuous type of random variable, or
We next define a third special mathematical expectation, called the
moment-generatingfunction of a random variable X. Suppose that there
is a positive number h such that for -h < t < h the mathematical
expectation E(etX) exists. Thus
standard deviation of the distribution of X. Next, let the random variable Y
of the continuous type have the p.d.f, g(y) = 1/4a, - 2a < y < 2a, zero
elsewhere, so that a = 2a/V?' is the standard deviation of the distribution of
Y. Here the standard deviation of Y is greater than that of X; this reflects
the fact that the probability for Y is more widely distributed (relative to the
mean zero) than is the probability for X.
x = 1,2,3,4,
x
f(x) = -,
10
M(t) = Letxf(x),
x
or
Sec. 1.10] Some Special Mathematical Expectations 51
variable X of the discrete type. If we let f(x) be the p.d.f. of X and let
a, b, c, d, ... be the discrete points in the space of X at whichf(x) > 0,
then
loet + /oe2t + 130e3t + 140e4t = f(a)e at + f(b)ebt + .. '.
Because this is an identity for all real values of t, it seems that the right-
hand member should consist of but four terms and that each of the four
should equal, respectively, one of those in the left-hand member; hence
we may take a = 1,j(a) = lo; b = 2,j(b) = /0; C = 3,j(c) = 1
30;
d = 4,
f(d) = 1
40'
Or, more simply, the p.d.f. of X is
Distributions of Random Variables [Ch.l
50
is the moment-generating function of X. That is, we are given
dM(t) Joo
~ = M'(t) = -00 xetxf(x) dx,
= 0 elsewhere.
On the other hand, let X be a random variable of the continuous
type and let it be given that
t < 1.
t < 1,
o < x < 00,
M(t) _ 1
- (1 - t)2'
f(x) = xe- x ,
(1 ~ t)2 = J:oo etxf(x) dx,
It is not at all obvious how f(x) is found. However, it is easy to see that
a distribution with p.d.f.
= 0 elsewhere
has the moment-generating function M(t) = (1 - t)-2, t < 1. Thus the
random variable X has a distribution with this p.d.f. in accordance with
the assertion of the uniqueness of the moment-generating function.
Since a distribution that has a moment-generating function M(t) is
com~letely determined by M(t), it would not be surprising if we could
obtam some properties of the distribution directly from M(t). For
example, the existence of M(t) for -h < t < h implies that derivatives
of all order exist at t = O. Thus
M(t) = E(etX).
It is evident that if we set t = 0, we have M(O) = 1. As will be seen by
example, not every distribution has a moment-generating function,
but it is difficult to overemphasize the importance of a moment-
generating function when it does exist. This importance stems from the
fact that the moment-generating function is unique and completely
determines the distribution of the random variable; thus, if two random
variables have the same moment-generating function, they have the
same distribution. This property of a moment-generating function will
be very useful in subsequent chapters. Proof of the uniqueness of the
moment-generating function is based on the theory of transforms in
analysis, and therefore we merely assert this uniqueness.
Although the fact that a moment-generating function (when it
exists) completely determines a distribution of one random variable will
not be proved, it does seem desirable to try to make the assertion
plausible. This can be done if the random variable is of the discrete
type. For example, let it be given that
M(t) = loet + /oe2t + 130e3t + 140e4t
is, for all real values of t, the moment-generating function of a random
E(etX) = Letxf(x),
x
if X is a discrete type of random variable. This expectation is called the
moment-generating function of X (or of the distribution) and is denoted
by M(t). That is,
52 Distributions of Random Variables [Ch.l Sec. 1.10] Some Special Mathematical Expectations 53
JOO Jl X+ 1 1
JL = xf(x) dx = x--dx = -
-00 -1 2 3
the moment-generating function. In fact, we shall sometimes call
E(xm) the mth moment of the distribution, or the mth moment of X.
Example 1. Let X have the p.d.f.
-1 < x < 1,
f(x) = !(x + 1),
= 0 elsewhere.
Then the mean value of X is
or
M'(O) = E(X) = 1-'-.
The second derivative of M(t) is
if X is of the continuous type, or
dM
d
(t) = M'(t) = 2. xetxf(x) ,
t x
if X is of the discrete type. Upon setting t = 0, we have in either case
while the variance of X is
fOO fl X+ 1 2
a2
= x2f(x)
dx - JL2 = x2
- - dx - (t)2 = -.
-00 -1 2 9
so that M"(O) = E(X2). Accordingly,
a2 = E(X2) - 1-'-2 = M"(O) - [M'(0)J2.
For example, if M(t) = (1 - t)-2, t < 1, as in the illustration above,
then
M'(t) = 2(1 - t)-3
and
Example 2. If X has the p.d.f.
1
f(x) = 2'
x
1 < x < 00,
M"(t) = 6(1 - t)-4.
Hence
= 0 elsewhere,
then the mean value of X does not exist, since
I-'- = M'(O) = 2
and
a2 = M"(O) - 1-'-2 = 6 - 4 = 2.
Of course we could have computed I-'- and a 2
from the p.d.f. by
I-'- = f~co xf(x) dx and
= lim (In b - In 1)
b--+00
does not exist.
Example 3. Given that the series
1 1 1
}2 + 22 + 32 + ...
respectively. Sometimes one way is easier than the other.
In general, if m is a positive integer and if M<m)(t) means the mth
derivative of M(t), we have, by repeated differentiation with respect
to t,
converges to 7T
2
/6.Then
x = 1,2,3, ... ,
Now
= 0 elsewhere,
is the p.d.f. of a discrete type of random variable X. The moment-generating
function of this distribution, if it exists, is given by
or M(t) = E(etX) = 2: etxf(x)
x
and integrals (or sums) of this sort are, in mechanics, called moments.
Since M(t) generates the values of E(xm), m = 1,2,3, ... , it is called
Distributions of Random Variables [Chv I
Accordingly, the integral for ep(t) exists for all real values of t. In the discrete
case, a summation would replace the integral.
Every distribution has a unique characteristic function; and to each
characteristic function there corresponds a unique distribution of probability.
If X has a distribution with characteristic function ep(t), then, for instance, if
E(X) and E(X2) exist, they are given, respectively, by iE(X) = ep'(O) and
i 2E(X2) = ep"(O). Readers who are familiar with complex-valued functions
may write ep(t) = M(it) and, throughout this book, may prove certain
theorems in complete generality.
Those who have studied Laplace and Fourier transforms will note a
similarity between these transforms and M (t) and ep(t); it is the uniqueness
of these transforms that allows us to assert the uniqueness of each of the
moment-generating and characteristic functions.
54
The ratio test may be used to show that this series div:rges if t > O. Thus
there does not exist a positive number h such that M(t) ex~sts for -h < t < h.
Accordingly, the distribution having the p.d.f. f(x) of this example does not
have a moment-generating function.
. f . M(t) t2
/2
Example 4. Let X have the moment-generatmg unct~on = e ,
-00 < t < 00. We can differentiate M(t) any number of times to find the
t f X However it is instructive to consider this alternative method.
momen s o . , . " .
The function M(t) is represented by the following MacLaunn s senes.
1 (t
2)
1 (t
2)
2 1 (t
2)
Ic
et2/2 = 1 + TI "2 + 2i 2" + ... + ki"2 + ...
1 (3)(1) 4 (2k - 1)· .. (3)(1) t21c + ...
- 1 + _t2 + --t + ... + --(2k)! .
- 2! 4! .
Sec. 1.10] Some Special Mathematical Expectations 55
In general, the MacLaurin's series for M(t) is
'(0) M"(O) M<m)(0)
M(t) = M(O) + MIl t + ---zt t2 + ... + ~ t
m
+ ...
_ 1 + E(X) t + E(X2) t2 + ... + E(xm) tm + .. '.
- I! 2! m!
Thus the coefficient of (tmjm!) in the MacLaurin's series representation of
M(t) is E(xm). So, for our particular M(t), we have
(2k)!
E(X21c) = (2k - 1)(2k - 3)· .. (3)(1) = 2lck!'
k = 1,2,3, ... , and E(X2
1c - l ) = 0, k = 1,2,3, ....
Remarks. In a more advanced course, we would not work with the
moment-generating function because so many distr.ibutions do .not .have
moment-generating functions. Instead, we would let t d~note ~he Imagm~ry
unit, t an arbitrary real, and we would define ep(t) = E(e' )- !hIS ex?ectahon
exists for every distribution and it is called the charaetenst.tcfunctwn .of the
distribution. To see why ep(t) exists for all real t, we note, m the continuous
case, that its absolute value
lep(t) I = Irooeitxf(x) - :::; f~oo Jeitxf(x)J dx.
However, If(x) I = f(x) since f(x) is nonnegative and
Jeitxl = [cos tx + i sin tx = vcos2
tx + sin
2
tx = 1.
EXERCISES
1.84. Find the mean and variance, if they exist, of each of the following
distributions.
3! (1)3
(a) f(x) = x! (3 _ x)! 2 ,x = 0, 1, 2, 3, zero elsewhere.
(b) f(x) = 6x(1 - x), 0 < x < 1, zero elsewhere.
(c) f(x) = 2jx3
, 1 < x < 00, zero elsewhere.
1.85. Letf(x) = (-!y, x = 1,2,3, ... , zero elsewhere, be the p.d.f. of the
random variable X. Find the moment-generating function, the mean, and
the variance of X.
1.86. For each of the following probability density functions, compute
Pr (iL - 2a < X < iL + 2a).
(a) f(x) = 6x(1 - x), 0 < x < 1, zero elsewhere.
(b) f(x) = (-!y, x = 1, 2, 3, ... , zero elsewhere.
1.87. If the variance of the random variable X exists, show that E(X2) ;:::
[E(X)]2.
1.88. Let a random variable X of the continuous type have a p.d.f. f(x)
whose graph is symmetric with respect to x = c. If the mean value of X
exists, show that E(X) = c. Hint. Show that E(X - c) equals zero by writing
E(X - c) as the sum of two integrals: one from -00 to c and the other from
c to 00. In the first, let y = c - x; and, in the second, Z = x-c. Finally,
use the symmetry condition f(c - y) = f(c + y) in the first.
1.89. Let the random variable X have mean u; standard deviation a, and
moment-generating function M(t), -h < t < h. Show that
Thus
lep(t)i :::; roof(x) dx = 1.
= 1, t = O.
57
Sec. 1.10] Some Special Mathematical Expectations
= 0 elsewhere,
where 0 < p < -to Find the measure of kurtosis as a function of p. Determine
its value when p = t, P = t, p = /0' and p = 1~o. Note that the kurtosis
increases as p decreases.
f(x) = p, x = -1, 1,
= 1 - 2p, x = 0,
1.98. Let X be a random variable with mean I-' and variance a2 such that
the fourth moment E[(X - 1-')4] about the vertical line through I-' exists.
The value of the ratio E[(X - 1-')4]/a4is often used as a measure of kurtosis.
Graph each of the following probability density functions and show that this
measure is smaller for the first distribution.
(a) f(x) = -t, -1 < x < 1, zero elsewhere.
(b) f(x) = 3(1 - x2)j4,
-1 < x < 1, zero elsewhere.
1.99. Let the random variable X have p.d.f.
1.100. Let o/(t) = In M(t), where M(t) is the moment-generating function
of a distribution. Prove that 0/'(0) = I-' and 0/"(0) = a2•
1.101. Find the mean and the variance of the distribution that has the
distribution function
-ha < t < ho:
t # 0,
Distributions of Random Variables [eh.l
1.90. Show that the moment-generating function of the random variable
X having the p.d.f. f(x) = t, -1 < x < 2, zero elsewhere, is
M(t) = e
2t
- e-
t
,
3t
and
1.91. Let X be a random variable such that E[(X - b)2] exists for all
real b. Show that E[(X - b)2] is a minimum when b = E(X).
1.92. Letf(xv x2) = 2XI, 0 < Xl < 1,0 < X 2 < 1, zero elsewhere, be the
p.d.f. of Xl and X 2· Compute E(XI + X 2) andE{[XI + X2 - E(XI + X 2)]2}.
1.93. Let X denote a random variable for which E[(X - a)2] exists. Give
an example of a distribution of a discrete type such that this expectation is
zero. Such a distribution is called a degenerate distribution.
1.94. Let X be a random variable such that K(t) = EW) exists for all
real values of t in a certain open interval that includes the point t = 1.
Show that K<m)(l) is equal to the mth factorial moment E[X(X - 1)· ..
(X - m + 1)].
56
1.95. Let X be a random variable. If m is a positive integer, the expecta-
tion E[(X - b)m], if it exists, is called the mth moment of the distribution
about the point b. Let the first, second, and third moments of the distribution
about the point 7 be 3, 11, and 15, respectively. Determine the mean I-' of X,
and then find the first, second, and third moments of the distribution about
the point 1-'.
1.96. Let X be a random variable such that R(t) = E(et<X-b») exists for
- h < t < h. If m is a positive integer, show that R<m)(o) is equal to the mth
moment of the distribution about the point b.
1.97. Let X be a random variable with mean I-' and variance a2
such that
the third moment E[(X - 1-')3] about the vertical line through I-' exists. The
value of the ratio E[(X - 1-')3]/a3 is often used as a measure of skewness.
Graph each of the following probability density functions and show that this
measure is negative, zero, and positive for these respective distributions
(said to be skewed to the left, not skewed, and skewed to the right, re-
spectively).
(a) f(x) = (x + 1)/2, -1 < x < 1, zero elsewhere.
(b) f(x) = -t, -1 < x < 1, zero elsewhere.
(c) f(x) = (1 - x)/2, -1 < x < 1, zero elsewhere.
F(x) = 0, x < 0,
x
= g' 0 s x < 2,
x2
= 16' 2 ~ x < 4,
= 1, 4 ~ x.
f 1.~02. Find the moments of the distribution that has moment-generating
unctIOn M(t) = (1 - t)-3, t < 1. Hint. Differentiate twice the series
(1 - t) -1 = 1 + t + t2 + t3 + ... , -1 < t < 1.
Whi~:~:' L~t.X be a ~andom variable of the continuous type with p.d.f.f(x),
Sh h
posItIve provided 0 < x < b < 00, and is equal to zero elsewhere
~t~ .
E(X) = f:[1 - F(x)] d»,
Where F(x) is the distribution function of X.
58 Distributions of Random Variables [Ch.l Sec. 1.11] Chebyshev's Inequality 59
1.11 Chebyshev's Inequality
In this section we shall prove a theorem that enables us to find
upper (or lower) bounds for certain probabilities. These bounds, how-
ever, are not necessarily close to the exact probabilities and, accordingly,
we ordinarily do not use the theorem to approximate a probability.
The principal uses of the theorem and a special case of it are in theoreti-
cal discussions.
Theorem 7. Chebyshev's Inequality. Let the random variable X
have a distribution ofprobability about which we assume only that there is
a finite variance a2
• This, of course, implies that there is a mean 11-. Then
for every k > 0,
1
Pr (IX - 11-1 ~ ka) s k2 '
or, equivalently,
Theorem 6. Let u(X) be a nonnegative function of the random
variable X. If E[u(X)] exists, then, for every positive constant c,
Pr [u(X) ~ c] s E[u(X)].
c
1
Pr (IX - ILl < ka) ~ 1 - k2
'
Proof. In Theorem 6 take u(X) = (X - 11-)2 and c = k2a2• Then
we have
1
Pr (IX - 11-1 ~ ka) ~ k 2 '
Since the numerator of the right-hand member of the preceding
inequality is u2
, the inequality may be written
which is the desired result. Naturally, we would take the positive
number k to be greater than 1 to have an inequality of interest.
-V3 < x < V3,
1
f(x) =-,
2V3
It is seen that the number 1/k2
is an upper bound for the probability
Pr (IX - 11-1 ~ ka). In the following example this upper bound and
the exact value of the probability are compared in special instances.
Example 1. Let X have the p.d.f.
= 0 elsewhere.
lIere IL = 0 and a2
= 1. If k = "i, we have the exact probability
Pr (IX - ILl ::.::: ka) = Pr (IXI ~ ~) = 1 _ f3/2 _1_ dx = 1 _ V3.
2 -3/22V3 2
By Chebyshev's inequality, the preceding probability has the upper bound
l/k2 - ± S' . j-
tho - 9' mce 1 - v 3/2 = 0.134, approximately, the exact probability in
IS C ' id
h ase IS consi erably less than the upper bound l If we take k = 2, we
ave the exact probability Pr (IX - ILl ~ 2a) = Pr (lXI ::.::: 2) = O. This
tf(x) dx = Pr (X E A) = Pr [u(X) ~ t],
E[u(X)] ~ cPr [u(X) ~ c],
E[u(X)] ~ c tf(x) dx.
Proof. The proof is given when the random variable X is of the
continuous type; but the proof can be adapted to the discrete case if we
replace integrals by sums. Let A = {x; u(x) ~ c} and let f(x) denote
the p.d.f. of X. Then
E[u(X)] = I~co u(x)f(x) dx = t u(x)f(x) dx + t*u(x)f(x) dx.
Since each of the integrals in the extreme right-hand member of the
preceding equation is nonnegative, the left-hand member is greater
than or equal to either of them. In particular,
E[u(X)] ~ t u(x)f(x) dx.
However, if x E A, then u(x) ~ c; accordingly, the right-hand member
of the preceding inequality is not increased if we replace u(x) by c.
Thus
Since
which is the desired result.
it follows that
The preceding theorem is a generalization of an inequality which
is often called Chebyshev's inequality. This inequality will now be
established.
58 Distributions of Random Variables [eb.l Sec. 1.11] Chebyshev's Inequality 59
1.11 Chebyshev's Inequality
In this section we shall prove a theorem that enables us to find
upper (or lower) bounds for certain probabilities. These bounds, how-
ever, are not necessarily close to the exact probabilities and, accordingly,
we ordinarily do not use the theorem to approximate a probability.
The principal uses of the theorem and a special case of it are in theoreti-
cal discussions.
Theorem 7. Chebyshev's Inequality. Let the random variable X
have a distribution of probability about which we assume only that there is
a finite variance a2
• This, of course, implies that there is a mean iL. Then
for every k > 0,
Pr (IX - iLl ~ ka) s 12'
or, equivalently,
Theorem 6. Let u(X) be a nonnegative function of the random
variable X. If E[u(X)] exists, then, for every positive constant c,
E[u(X)]
Pr [u(X) ~ c] ~ .
c
1
Pr (IX - iLl < ka) ~ 1 - k2 •
Proof. In Theorem 6 take u(X) = (X - iL)2 and c = k2(J2. Then
we have
1
Pr (IX - iLl ~ ka) ~ k2 '
which is the desired result. Naturally, we would take the positive
number k to be greater than 1 to have an inequality of interest.
Since the numerator of the right-hand member of the preceding
inequality is a2
, the inequality may be written
-V3 < x < V3,
1
f(x) =-,
2V3
It is seen that the number lIP is an upper bound for the probability
Pr (IX - iLl ~ ka). In the following example this upper bound and
the exact value of the probability are compared in special instances.
Example 1. Let X have the p.d.f.
Proof. The proof is given when the random variable X is of the
continuous type; but the proof can be adapted to the discrete case if we
replace integrals by sums. Let A = {x; u(x) ~ c} and let f(x) denote
the p.d.f. of X. Then
E[u(X)] = J~co u(x)f(x) dx = Lu(x)f(x) dx + L.u(x)f(x) dx.
Since each of the integrals in the extreme right-hand member of the
preceding equation is nonnegative, the left-hand member is greater
than or equal to either of them. In particular,
Since
E[u(X)] ~ Lu(x)f(x) dx.
However, if x E A, then u(x) ~ c; accordingly, the right-hand member
of the preceding inequality is not increased if we replace u(x) by c.
Thus
E[u(X)] ~ c Lf(x) dx.
Lf(x) dx = Pr (X E A) = Pr [u(X) ~ c],
it follows that
E[u(X)] ~ cPr [u(X) ~ c],
which is the desired result.
The preceding theorem is a generalization of an inequality which
is often called Chebyshev's inequality. This inequality will now be
established.
= 0 elsewhere.
Here iL = 0 and a2
= 1. If k = ·t we have the exact probability
( 3) f3/2 1 v'3
Pr (IX - iLl ~ ka) = Pr IXI ~ - = 1 - - dx = 1 - -.
2 - 3 /2 2V3 2
By Chebyshev's inequality, the preceding probability has the upper bound
Ijk2
= t. Since 1 - V3j2 = 0.134, approximately, the exact probability in
this case is considerably less than the upper bound l If we take k = 2, we
have the exact probability Pr (IX - iLl ~ 2a) = Pr (IXI ~ 2) = O. This
60
Distributions of Random Variables [eb.l
again is considerably less than the upper bound l/k
2
= t provided by
Chebyshev's inequality.
In each instance in the preceding example, the probability
Pr (IX - iLl :2: ka) and its upper bound Ijk2
differ considerably..This
suggests that this inequality might be made sharper. However, If we
want an inequality that holds for every k > 0 and holds for all random
variables having finite variance, such an improvement is impossible,
as is shown by the following example.
Example 2. Let the random variable X of the discrete type have
probabilities i, ·t i at the points x = -1, 0, 1, respectively. Here I-'- = 0 and
a2 = t. If k = 2, then l/k2 = t and Pr (IX - 1-'-1 :2: ka) = Pr (IXI :2: 1) = t·
That is, the probability Pr (IX - 1-'-1 :2: ka) here attains the upper bound
l/k2 = t. Hence the inequality cannot be improved without further assump-
tions about the distribution of X.
EXERCISES
1.104. Let X be a random variable with mean I-'- and let E[(X - 1-'-)2k]
exist. Show, with d > 0, that Pr (IX - 1-'-1 :2: d) ~ E[(X - 1-'-)2k]/d
2k•
1.105. Let X be a random variable such that Pr (X ~ 0) = 0 and let
I-'- = E(X) exist. Show that Pr (X :2: 21-'-) ~ 1'.
1.106. If X is a random variable such that E(X) = 3 and E(X2) = 13, use
Chebyshev's inequality to determine a lower bound for the probability
Pr(-2<X<8).
1.107. Let X be a random variable with moment-generating function
M(t), -h < t < h. Prove that
Pr (X :2: a) s e-atM(t), 0 < t < h,
and that
Pr (X s a) s e-atM(t), -h < t < O.
Hint. Let u(x) = etx and c = eta in Theorem 6. Note. These results imply
that Pr (X :2: a) and Pr (X ~ a) are less than the respective greatest lower
bounds of e-atM(t) when 0 < t < h and when -h < t < O.
1.108. The moment-generating function of X exists for all real values of
t and is given by
et _ e:'
M(t) = 2t ' t i= 0, M(O) = 1.
Use the results of the preceding exercise to show that Pr (X :2: 1) = 0 and
Pr (X ~ -1) = O. Note that here h is infinite.
Chapter 2
Conditional Probability
and Stochastic
Independence
2.1 Conditional Probability
In some random experiments, we are interested only in those out-
comes that are elements of a subset C1 of the sample space Cf/. This
means, for our purposes, that the sample space is effectively the subset
C1. We are now confronted with the problem of defining a probability
set function with C1 as the" new" sample space.
Let the probability set function P(C) be defined on the sample space
Cf/ and let CI be a subset of Cf/ such that P(CI ) > O. We agree to con-
sider only those outcomes of the random experiment that are elements
of C1;in essence, then, we take CI to be a sample space. Let C2be another
subset of Cf/. How, relative to the new sample space CI , do we want to
define the probability of the event C2? Once defined, this probability is
called the conditional probability of the event C2 , relative to the
hypothesis of the event CI ; or, more briefly, the conditional probability
of C2 , given CI . Such a conditional probability is denoted by the symbol
P(C2!CI ) . We now return to the question that was raised about the
definition of this symbol. Since CI is now the sample space, the only
elements of C2 that concern us are those, if any, that are also elements of
CI , that is, the elements of CI r. C2 • It seems desirable, then, to define
the symbol P(C2ICI) in such a way that
P(C1ICI) = 1 and P(C2!C I ) = P(CI n C2!CI).
Moreover, from a relative frequency point of view, it would seem logic-
ally inconsistent if we did not require that the ratio of the probabilities
of the events CI r, C2 and Cv relative to the space Cv be the same as the
61
63
x = 0,1,2,3,4,5,
and
Sec. 2.1] Conditional Probability
From the definition of the conditional probability set function, we
observe that
The desired probability P(C1 n C2)is then the product of these two numbers.
More generally, if X + 3 is the number of draws necessary to produce
= 0 elsewhere.
It is w~rth noting, if we let the random variable X equal the number of
spades III a 5-card hand, that a reasonable probability model for X is given
by the hypergeometric p.d.f.
Accordingly, we can write P(C2IC1 ) = Pr (X = 5)f[Pr (X = 4,5)] =
j(5)fU(4) + j(5)J.
P(C1 n C2) = P(C1)P(C2IC1)'
This relation is frequently called the multiplication rule for probabilities.
Som~times, after considering the nature of the random experiment, it is
possible to make reasonable assumptions so that both P(C1) and
P(C2IC1) can be assigned. Then P(C1 n C2) can be computed under
these assumptions. This will be illustrated in Examples 2 and 3.
Example 2. A bowl contains eight chips. Three of the chips are red and
the remaining five are blue. Two chips are to be drawn successively, at ran-
dom and without replacement. We want to compute the probability that the
first draw results in a red chip (C1) and that the second draw results in a
blue chip (C2 ) . It is reasonable to assign the following probabilities:
P(C1 ) = i and P(C2I C1 ) = t·
Thus, under these assignments we have P(C n C ) - (.3.)(~) - l~
, 1 2 - 8 7 - 56'
Example 3. From an ordinary deck of playing cards, cards are to be
drawn succ.essively, at random and without replacement. The probability
that the third spade appears on the sixth draw is computed as follows. Let
C1 be the event of two spades in the first five draws and let C2 be the event of
a spade on the sixth draw. Thus the probability that we wish to compute is
P(C1 n C2) . It is reasonable to take
Conditional Probability and Stochastic Independence [Ch. 2
P(C1 n C2C1) P(C1 n C2)
P(C1IC1) = P(C1 ) •
These three desirable conditions imply that the relation
P(C C ) _ P(C1 n C2)
2 1 - P(C1
)
is a suitable definition of the conditional probability of the event C2,
given the event C1
, provided P(C1 ) > O. Moreover, we have:
(a) P(C2IC1) ;::: O.
(b) P(C2 u C3
u· .. C1) = P(C2C1) + P(C3!C1) + .. " provided
C2
, C3
, .•. are mutually disjoint sets.
(c) P(C1C1) = 1.
Properties (a) and (c) are evident; proof of property (b) is left as an
exercise. But these are precisely the conditions that a probability set
function must satisfy. Accordingly, P(C2C1) is a probability set
function, defined for subsets of Ci- It may be called the conditional
probability set function, relative to the hypothesis C1; or the condi-
tional probability set function, given Ci- It should be noted that this
conditional probability set function, given C1> is defined at this time
only when P(C1) > O.
We have now defined the concept of conditional probability for
subsets C of a sample space s'. We wish to do the same kind of thing for
subsets A of .91, where .91 is the space of one or more random variables
defined on ~. Let P denote the probability set function of the induced
probability on .91. If A 1
and A 2 are subsets of .91, the conditional
probability of the event A 2 , given the event A 1 , is
P(A A ) = P(A 1 n A 2
)
2 1 P(A 1
)
provided P(A 1
) > O. This definition will apply to any space which has a
probability set function assigned to it.
Example 1. A hand of 5 cards is to be dealt at random and without
replacement from an ordinary deck of 52 playing cards. The conditional
probability of an all-spade hand (C2) , relative to the hypothesis that there
are at least 4 spades in the hand (C1) , is, since C1 n C2 = C2 ,
ratio of the probabilities of these events relative to the space ~; that is,
we should have
62
EXERCISES
exactly three spades, a reasonable probability model for the random variable
X is given by the p.d.f.
= 0 elsewhere.
Then the particular probability which we computed IS P(Cl () C2)
Pr (X = 3) = f(3).
65
Sec.2.2J Marginal and Conditional Distributions
x + 1 is the number of draws needed to produce the first blue chip,
determine the p.d.f. of X.
2.4. A ha.nd of 13 cards is to be dealt at random and without replacement
from an ordinary deck of pl~yin? cards. Find the conditional probability
that there are ~t least three kings III the hand relative to the hypothesis that
the hand contains at least two kings.
2.5. A drawer contains eight pairs of socks. If six socks are taken at
random and without replacement, compute the probability that there is at
least one m~tching pair among these six socks. Hint. Compute the probability
that there IS not a matching pair.
2.6. A bowl contains 10 chips. Four of the chips are red 5 are hit
d 1. . ' W 1 e,
an IS blue. If ~ :hIPS are taken at random and without replacement,
com~ute the conditional probability that there is 1 chip of each color
relative to the hypothesis that there is exactly 1 red chip among the 3.
2.7.. ~et each of the mutually disjoint sets Cl , ... , Cm have nonzero
probability. If the set C is a subset of the union of C1> ••• , Cm, show that
P(C) = P(C1)P(C/Cl) + ... + P(Cm)P(ClCm).
If P(C) > 0, prove Bayes' formula:
P(C1C) - P(Cj)P(ClCj)
, - P(Cl)P(ClCl) + ... + P(Cm)P(C/Cm) , 1 = 1, ... , m.
Hint. P(C)P(CdC) = P(Cj)P(ClCJ
2.8. Bowl I contains 3 red chips and 7 blue chips. Bowl II contains 6 red
chips a~d 4 blue chips. A bowl is selected at random and then 1 chip is drawn
from this bowl. (a) Compute the probability that this chip is red. (b) Relative
~o the hypothesis that the chip is red, find the conditional probability that it
IS drawn from bowl II.
2.9. Bowl I contains 6 red chips and 4 blue chips. Five of these 10 chips
are sel~c:ed at random and without replacement and put in bowl II, which
was originally empty. One chip is then drawn at random from bowl II.
Relative to the hypothesis that this chip is blue, find the conditional
probability that 2 red chips and 3 blue chips are transferred from bowl I to
bowl II.
2.2 Marginal and Conditional Distributions
Let f(x}> x2 ) be the p.d.f. of two random variables X and X .
F.ro~ thi.s point on, for emphasis and clarity, we shall call a Ip.d.f. or 2
a
distribution function a joint p.d.f. or a joint distribution function when
more than one random variable is involved. Thus f(x1 , x2 ) is the joint
x = 0, 1, 2, ... , 39,
Conditional Probability and Stochastic Independence [Ch. 2
(In order to solve certain of these exercises, the student is required to
make assumptions.)
2.1. If P(Cl
) > 0 and if C2
, Cs, C4 , •• , are mutually disjoint sets, show
that P(C2u Cs u·· ·IC1) = P(C2Cl) + P(CSIC1) + ....
2.2. Prove that
P(C
l
() C
2
() c, () C4
) = P(Cl)P(C2Cl)P(CS[Cl () C2)P(C4ICl () C2 () Cs)·
2.3. A bowl contains eight chips. Three of the chips are red and five are
blue. Four chips are to be drawn successively at random and without replace-
ment. (a) Compute the probability that the colors alternate. (b) Compute
the probability that the first blue chip appears on the third draw. (c) If
The multiplication rule can be extended to three or more events.
In the case of three events, we have, by using the multiplication rule
for two events,
P(C 1
n C2 n Cs) = P[(C1 n C2) n Cs]
= P(C1 n C2)P(C SC 1 n C2) ·
But P(C1
n C2) = P(C1)P(C2C 1) · Hence
P(C1
n C2 n Cs) = P(Cl)P(C2ICl)P(CSCl n C2)·
This procedure can be used to extend the multiplication rule to four
or more events. The general formula for k events can be proved by
mathematical induction.
Example 4. Four cards are to be dealt successively, at random and
without replacement, from an ordinary deck of playing cards. The probability
of receiving a spade, a heart, a diamond, and a club, in that order, is
(H)(H)(U)(H). This follows from the extension of the multiplication rule.
In this computation, the assumptions that are involved seem clear.
64
66 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.2] Marginal and Conditional Distributions 67
for the discrete case. Now each of
be computed as
X2 = 1,2,
Xl = 1,2,3,
On the other hand, the marginal p.d.£. of Xl is
1 (X ) = *Xl + X 2 _ 2xl + 3
IlL.. 21 - 21 '
X2=1
zero elsewhere, and the marginal p.d.f. of X 2 is
3
1: (X ) = '" Xl + X 2 = 6 + 3x2
2 2 L.. 21 21'
Xl= 1
zero elsewhere. Thus the preceding probabilities may
Pr (Xl = 3) = 11(3) = t and Pr (X2 = 2) = 12(2) = ~.
We shall now discuss the notion of a conditional p.d.f. Let Xl and
X 2 denote random variables of the discrete type which have the joint
p.d.f. f(xv x2) which is positive on .xl and is zero elsewhere. Let fl(X 1)
and f2(X2) denote, respectively, the marginal probability density func-
tions of Xl and X 2. Take Al to be the set A = {(x x)· x - x'
1 l' 2' 1 - 1,
-00 < X2 < oo], where x~ is such that P(A l) = Pr (Xl = X~) =
fl(X~) > 0, and take A 2 to be the set A 2 = {(xv x2); -00 < Xl < 00,
X 2 = x;}. Then, by definition, the conditional probability of the event
A 2 , given the event A v is
P(A IA ) = P(A I n A 2) _ Pr (Xl = X~, X 2 = x;) f(X~, X~)
2 1 P(A l
) - P (
r Xl = X~) fl(X~)
That is, if (Xl' x2) is any point at which 1l(Xl) > 0, the conditional
probability that X 2 = X2, given that Xl = Xv is f(xv x2)lfl(xl), With
Xl held fast, and with fl(X l) > 0, this function of X2 satisfies the
conditions of being a p.d.f. of a discrete type of random variable X 2
because f(xl, x2)Ifl(Xl) is not negative and
L fj(' ~2) = f-(1
) Lf(xl, x2) = fl(X l) = 1
X2 1 Xl 1 Xl X2 fl(X l ) '
We now define the symbolf(x2Ixl) by the relation
f(x Ix ) = f(xl, X2)
2 1 fl(Xl) ,
(discrete case),
(continuous case),
(continuous case),
(discrete case),
Xl = 1,2, 3, X 2 = 1,2,
and
f2(X2) = f~co f(xv x2) dXl
= 'L f(xv X 2)
Xl
Pr (a < Xl < b) = f:fl(X l) dX1
'L fl(X l)
a<xl<b
is called the marginal p.d.f. of X 2 •
Example 1. Let the joint p.d.f. of Xl and X 2 be
so that fl (Xl) is the p.d.f. of Xl alone. SinceI,(Xl) is found by summing
(or integrating) the joint p.d.f. f(xl, x2) over all X2for a fixed Xv we can
think of recording this sum in the" margin" of the x1x2-plane. Accord-
ingly, fl(X l) is called the marginal p.d.f. of Xl' In like manner
is a function of Xl alone, say fl(X l), Thus, for every a < b, we have
for the continuous case, and by
Pr (a < Xl < b, -00 < X 2 < 00) 'L 'Lf(xv x2)
a<xl <b X2
p.d.f. of the random variables Xl and X 2' Consider the event a < Xl <
b, a < b. This event can occur when and only when the event a < Xl <
b, -00 < X 2 < 00 occurs; that is, the two events are equivalent, so that
they have the same probability. But the probability of the latter event
has been defined and is given by
Pr (a < x, < b, -00 < X 2 < 00) = f: f~co f(xl, x2) dX2 dXl
= 0 elsewhere.
Then, for instance,
Pr (Xl = 3) = 1(3, 1) +1(3,2) = t
and we call f(x2Ixl) the conditional p.d.J. of the discrete type of
random variable X 2 , given that the discrete type of random variable
Xl ~ Xl' In a similar manner we define the symbol f(xllx2) by the
relation
and
Pr (X2 = 2) = 1(1,2) +1(2,2) + 1(3,2) = t.
69
0< X2 < 1,
Sec. 2.2] Marginal and Conditional Distributions
Pr (c < Xl < d IX2 = x2) = r!(x1Ix2) dx1.
If u(X2) is a function of X 2, the expectation
E[u(X2)lxd = J:oo u(X2)!(x2Ix1) dX2
is called the conditional expectation of u(X)' X
ti l ' f ' 2 ,gIven 1 = Xl' In par-
.ICU ar, 1 ~hey exist, E(X2Ix1) is the mean and E{[X2 - E(X
2
Ix1)J2Ix }
IS the vanance of the conditional distribution of X . X 1
It . . 2, given 1 = Xl
" IS c~~vell1ent .to refer to these as the" conditional mean" and th~
conditional vanance" of X given X - Of
2' 1 - Xl' course we have
E{[X2 - E(X2Ix1lJ2lx1} = E(X~lx1) - [E(X2Ix1)]2
from an. earlier result. In like manner, the conditional expectation of
u(X1) , grven X 2 = X 2, is given by
Pr (a < X 2 < blx1 ) . Similarly, the conditional
X probability that
c < 1 < d, given X 2 = X
2,
is
E[U(X1)IX2] = J:oo u(X1)!(x1Ix2) dx1·
W~t? .random va:i~bles of the discrete type, these conditional prob-
~blhtIesan~ condlt:onal expectations are computed by using summation
Instead of IntegratIOn. An illustrative example follows.
Example 2. Let Xl and X 2 have the joint p.d.f.
f(xl> x2) = 2, 0 < Xl < X2 < 1,
= 0 elsewhere.
Then the marginal probability density functions are, respectively,
fl(X I ) = 1:,2 dX2 = 2(1 - Xl)' 0 < Xl < 1,
= 0 elsewhere,
and
= 0 elsewhere.
The conditional p.d.f. of Xl> given X 2
= X2' is
In this relation, Xl is to be thought of as having a fixed (but any fixed)
value for which !1(X1) > 0. It is evident that !(x2Ix1) is nonnegative
and that
!( I ) - !(Xl> X2)
Xl X2 - !2(X2) ,
Since each of!(x2Ix1) and!(x1!x2) is a p.d.f. of one random variable
(whether of the discrete or the continuous type), each has all the
properties of such a p.d.f. Thus, we can compute probabilities and
mathematical expectations. If the random variables are of the con-
tinuous type, the probability
68 Conditional Probability and Stochastic Independence [Ch, 2
and we call !(X1!X2) the conditional p.d.f. of the discrete type of
random variable Xl> given that the discrete type of random variable
X 2 = X 2·
Now let Xl and X 2 denote random variables of the continuous type
that have the joint p.d.f. !(X1' x2) and the marginal probability density
functions j',(Xl) and !2(X2), respectively. We shall use the results of the
preceding paragraph to motivate a definition of a conditional p.d.f. of
a continuous type of random variable. When!1(x1) > 0, we define the
symbol!(x2Ix1) by the relation
That is,j(x2Ix1) has the properties of a p.d.f. of one continuous type of
random variable. It is called the conditional p.d.J. of the continuous
type of random variable X 2 , given that the continuous type of random
variable Xl has the value Xl' When !2(X2) > 0, the conditional p.d.f.
of the continuous type of random variable Xl> given that the con-
tinuous type of random variable X 2 has the value X2, is defined by
Pr (a < X 2 < bX1 = Xl) = f:!(x2Ix1) dX2
is called "the conditional probability that a < X 2 < b, given that
Xl = Xl." If there is no ambiguity, this may be written in the form
70 Conditional Probability and Stochastic Independence [Ch. 2 Sec. 2.2] Marginal and Conditional Distributions 71
but
and
Up to this point, each marginal p.d.f. has been a p.d.f. of one random
variable. It is convenient to extend this terminology to joint probability
density functions. We shall do this now. Let f(xv X2,... , xn) be the
joint p.d.f. of the n random variables X l' X 2,... , X n, just as before.
Now, however, let us take any group of k < n of these random variables
and let us find the joint p.d.f. of them. This joint p.d.f. is called the
marginal p.d.f. of this particular group of k variables. To fix the
ideas, take n = 6, k = 3, and let us select the group X 2 , X 4 , Xs' Then
the marginal p.d.f. of X 2 , X 4 , Xs is the joint p.d.f. of this particular
group of three variables, namely,
I~00 f~00 f~00 f(xl, X2, Xa, X4 , X s, x6 ) dXl dXa dx6 ,
if the random variables are of the continuous type.
We shall next extend the definition of a conditional p.d.f. If
fl(X l) > 0, the symbol f(x2, .. " xn!xl) is defined by the relation
f( I ) - f(xl, X2,... , xn)
X2,... , Xn Xl - fl(X
l)
,
and f(x2, ... , xn!xl) is called the joint conditional p.d.j. of X 2,... , X n,
given Xl = Xl' The joint conditional p.d.f. of any n - 1 random
variables, say XV"" X'-I' X,+ l , ... , X n, given X, = X" is defined
as the joint p.d.f. of Xl, X 2, ... , X n divided by marginal p.d.f. f,(x,) ,
provided f,(x,) > 0. More generally, the joint conditional p.d.f. of
n - k of the random variables, for given values of the remaining k
variables, is defined as the joint p.d.f. of the n variables divided by the
marginal p.d.f. of the particular group of k variables, provided the
latter p.d.f. is positive. We remark that there are many other con-
ditional probability density functions; for instance, see Exercise 2.17.
Because a conditional p.d.f. is a p.d.f. of a certain number of random
variables, the mathematical expectation of a function of these random
variables has been defined. To emphasize the fact that a conditional
p.d.f. is under consideration, such expectations are called conditional
expectations. For instance, the conditional expectation of u(X2, .. " X n)
given Xl = Xv is, for random variables of the continuous type, given by
E[u(X2,... , X n) Ixl ]
= f~oo .. -I~oo u(x2,···, xn)f(x2,···, xnlxl) dx2·· ·dxn,
provided fl(Xl) > °and the integral converges (absolutely). If the
random variables are of the discrete type, conditional mathematical
expectations are, of course, computed by using sums instead of integrals.
Pr (a < x, < b) = f:fl(Xl) dxl,
where fl(X l) is defined by the (n - l)-fold integral
fl(X l) = f~oo .. -I~oof(xv X2,···, xn) dX2" .dxn·
Accordingly, I, (Xl) is the p.d.f. of the one random variable Xl and
fl(X l) is called the marginal p.d.f. of Xl' The marginal probability
density functions f2(X2)'" .,fn(xn) of X 2, .. ·, X n, respectively, are
similar (n - l)-fold integrals.
We shall now discuss the notions of marginal and conditional
probability density functions from the point of view of n random
variables. All of the preceding definitions can be directly generalized
to the case of n variables in the following manner. Let the random
variables Xv X 2,... , X n have the joint p.d.f. f(xv X2, ... , xn)· If the
random variables are of the continuous type, then by an argument
similar to the two-variable case, we have for every a < b,
E(Xl IX2 ) = rooxI!(xllx2 ) dXl
= J:2xlCJ dXl
X~
= 12'
Finally, we shall compare the values of Pr (0 < Xl < -tIX2 = i) and
Pr (0 < Xl < -t). We have
[1/2 [1/2 4
Pr (0 < Xl < -tIX2 = -t) = Jo !(xlli) dXl = Jc (3) dXl = t.
Here the conditional mean and conditional variance of Xl. given X2 = X 2•
are, respectively,
72 Conditional Probability and Stochastic Independence [Ch. 2 Sec. 2.3] The Correlation Coefficient 73
EXERCISES
2.10. Let Xl and X2have the joint p.d.f.j(x1 , x2) = Xl + x2, 0 < Xl < 1,
o< X
2
< 1, zero elsewhere. Find the conditional mean and variance of X 2 ,
given Xl = Xl' 0 < Xl < 1.
2.11. Let j(x1Ix2) = C1X1/X~, 0 < Xl < X2, 0 < X2 < 1, zero elsewhere,
andj2(x2) = c2xt 0 < X2 < 1, zero elsewhere, denote, respectively, the con-
ditional p.d.f. of Xl' given X 2 = X2, and the marginal p.d.f. of X 2. Deter-
mine: (a) the constants C1 and c2 ; (b) the joint p.d.f. of Xl and X 2 ;
(c) Pr (! < Xl < 1-IX2 = i); and (d) Pr (t < Xl < 1-).
2.12. Let j(x1,x2) = 21xixg, 0 < Xl < X2 < 1, zero elsewhere, be the
joint p.d.f. of Xl and X 2 . Find the conditional mean and variance of Xl'
given X2 = X2' 0 < X2 < 1.
2.13. If X 1 and X 2are random variables of the discrete type having p.d.f.
j(X1' X2) = (Xl + 2x2)/18, (Xl> x2) = (1,1), (1,2), (2,1), (2, .2), zero else-
where, determine the conditional mean and variance of X 2 , grven Xl = Xl>
Xl = 1 or 2.
2.14. Five cards are drawn at random and without replacement from a
bridge deck. Let the random variables Xl' X 2, and X 3 denote, respectively,
the number of spades, the number of hearts, and the number of diamonds that
appear among the five cards. (a) Determine the joint p.d.f. of Xl> X 2 , and
X 3 • (b) Find the marginal probability density functions of Xl' X 2, and X 3 .
(c) What is the joint conditional p.d.f. of X2and X 3 , given that Xl = 3?
2.15. Let Xl and X2 have the joint p.d.f. j(Xl> x2) described as follows:
problem of time until death, given survival until time X
o' (a) Show that
f(xlX > xo) is a p.d.f. (b) Let j(x) = e- x , 0 < X < 00, zero elsewhere.
Compute Pr (X > 21 X > 1).
2.3 The Correlation Coefficient
Let X, Y, and Z denote random variables that have joint p.d.f.
f(x, y, z). If u(x, y, z) is a function of x, y, and z, then E[u(X, Y, Z)]
was defined, subject to its existence, on p. 45. The existence of all
mathematical expectations will be assumed in this discussion. The
means of X, Y, and Z, say fL1' fLz, and fLs, are obtained by taking
u(x, y, z) to be x, y, and z, respectively; and the variances of X, Y, and
Z, say a~, a~, and a~, are obtained by setting the function u(x, y, z)
equal to (x - fL1)Z, (y - fLz)Z, and (z - fLs)Z, respectively. Consider the
mathematical expectation
E[(X - fL1)(Y - fLz)] = E(XY - fLzX - fL1 Y + fL1fLz)
= E(XY) - fLzE(X) - fL1E(Y) + fLlfLz
= E(XY) - fL1fLz.
This number is called the covariance of X and Y. The covariance of X
and Z is given by E[(X - fL1)(Z - fLs)], and the covariance of Y and Z
is E[(Y - fLz)(Z - fLs)]. If each of a1 and az is positive, the number
E[(X - fL1)(Y - fLz)]
P1Z =
U1UZ
o < X < 1,0 < y < 1,
j(x, y) = X + y,
Example 1. Let the random variables X and Y have the joint p.d.f.
!Jo1 = E(X) = f:f:x(x + y) dx dy = -/2
= 0 elsewhere.
is called the correlation coefficient of X and Y. If the standard deviations
are positive, the correlation coefficient of any two random variables is
defined to be the covariance of the two random variables divided by the
product of the standard deviations of the two random variables. It
should be noted that the expected value of the product of two random
variables is equal to the product of their expectations plus their
covariance.
We shall compute the correlation coefficient of X and Y. When only two
variables are under consideration, we shall denote the correlation coefficient
by p. Now
(2,1)
_L
18
(2,0)
_Q.-
18
(1, 1)
3
Ts
(1,0)
_±-
18
(0, 1)
-~­
18
(X 1, X 2) 1-(~0:-'1O-'-)---'~-'---'---'::-'----'-'-::--"------'---'--;-'-----'-;---:'
j(Xl> X2) Ts
and j(Xl> x2) is equal to zero elsewhere. Find the two marginal probability
density functions and the two conditional means.
2.16. Let us choose at random a point from the interval (0, 1) and let the
random variable Xl be equal to the number which corresponds to that point.
Then choose a point at random from the interval (0, Xl), where Xl is the ex-
perimental value of Xl; and let the random variable X 2 be equal to the
number which corresponds to this point. (a) Make assumptions about the
marginal p.d.f. j1(X1) and the conditional p.d.f. j(X2!X1). (b) Compute
Pr (Xl + X 2 :2: 1). (c) Find the conditional mean E(X1Ix2)'
2.17. Let j(x) and F(x) denote, respectively, the p.d.f. and the distribu-
tion function of the random variable X. The conditional p.d.f. of X, given
X > XO, Xo a fixed number, is defined by j(xlX > xo) = j(x)/[1 - F(xo)],
Xo < x, zero elsewhere. This kind of conditional p.d.f. finds application in a
74 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.3] The Correlation Coefficient 75
The covariance of X and Y is
-Th 1
p = V(_l..L) (-.l.l_) = - 11'
144 144
E(XY) - P,1P,2 = J:J:xy(x + y) dx dy - (fi)2 =
Accordingly, the correlation coefficient of X and Y is
f~00 yf(x, y) dy
E(Ylx) = fl(X) = a + bx,
when dealing with random variables of the continuous type. This con-
ditional mean of Y, given X = x, is, of course, a function of x alone,
say <p(x). In like vein, the conditional mean of X, given Y = y, is a
function of y alone, say ¢;(y).
In case <p(x) is a linear function of x, say <p(x) = a + bx, we say the
conditional mean of Y is linear in x; or that Y has a linear conditional
mean. When <p(x) = a + bx, the constants a and b have simple values
which will now be determined.
It will be assumed that neither uf nor u~, the variances of X and Y,
is zero. From
and
and
(le 7 11
af = E(X2) - p,f = Jo Jo x2(x + y) dx dy - (n)2 = T44'
Similarly,
P,2 = E(Y) = -?i
or
or
J
~oo yf(x, y) dy = (a + bx)f1(x).
(1)
(3)
(2)
where iLl = E(X) and P,2 = E(Y). If both members of Equation (1) are
first multiplied by x and then integrated on x, we have
E(XY) = aE(X) + bE(X2),
If both members of Equation (1) are integrated on x, it is seen that
E(Y) = a + bE(X),
we have
where PU1U2 is the covariance of X and Y. The simultaneous solution of
Equations (2) and (3) yields
Remark. For certain kinds of distributions of two random variables,
say X and Y, the correlation coefficient p proves to be a very useful charac-
teristic of the distribution. Unfortunately, the formal definition of p does
not reveal this fact. At this time we make some observations about p, some
of which will be explored more fully at a later stage. It will soon be seen
that if a joint distribution of two variables has a correlation coefficient
(that is, if both of the variances are positive), then p satisfies -1 ~ p ~ 1.
If p = 1, there is a line with equation y = a + bx, b > 0, the graph of
which contains all of the probability for the distribution of X and Y. In
this extreme case, we have Pr (Y = a + bX) = 1. If p = -1, we have the
same state of affairs except that b < o. This suggests the following interest-
ing question: When p does not have one of its extreme values, is there a line
in the xy-plane such that the probability for X and Y tends to be con-
centrated in a band about this line? Under certain restrictive conditions this
is in fact the case, and under those conditions we can look upon p as a measure
of the intensity of the concentration of the probability for X and Y about
that line.
Next, let f(x, y) denote the joint p.d.f. of two random variables X
and Y and let fl(X) denote the marginal p.d.f. of X. The conditional
p.d.f. of Y, given X = x, is
and
That is,
f(ylx)
f(x, y)
fl(X)
at points where f1(X) > O. Then the conditional mean of Y, given
X = x, is given by
f
OO •
00 yf(x, y) dy
E(Ylx) = J-00 yf(ylx) dy = - 00 fl(X) ,
<p(x) = E(Ylx) = iL2 + P U2 (x - P,l)
Ul
is the conditional mean of Y, given X = x, when the conditional mean
of Y is linear in x. If the conditional mean of X, given Y = y, is linear
in y, then that conditional mean is given by
76 Conditional Probability and Stochastic Independence [eb.2 Sec. 2.3] The Correlation Coefficient 77
and
so that
Example 2. Let the random variables X and Y have the linear con-
ditional means E(Ylx) = 4x + 3 and E(XI y) = -l"6Y - 3. In accordance with
the general formulas for the linear conditional means, weseethat E(Ylx) = IL2
if x = ILl and E(XI y) = ILl if Y = IL2' Accordingly, in this special case,
we have IL2 = 4ILl + 3 and ILl = -l6-IL2 - 3 so that ILl = -~i and IL2 = -12.
The general formulas for the linear conditional means also show that the
product of the coefficients of x and y, respectively, is equal to p2 and that
the quotient of these coefficients is equal to u~!u~. Here p2 = 4(l-6) = t with
p = t (not -t), and u~!u~ = 64. Thus, from the two linear conditional
means, we are able to find the values of ILl' IL2' p, and U2!UV but not the
values of Ul and U2'
This section will conclude with a definition and an illustrative
example. Letf(x, y) denote the joint p.d.f. of the two random variables
X and Y. If E(et1X +t2Y) exists for -hI < t1 < hv -h2 < t2 < h2 ,
where hI and h2 are positive, it is denoted by M(t1 , t2) and is called the
moment-generating function of the joint distribution of X and Y. As in
the case of one random variable, the moment-generating function
M(tv t2 ) completely determines the joint distribution of X and Y, and
hence the marginal distributions of X and Y. In fact,
M(tv 0) = E(et1X ) = M(t1)
In addition, in the case of random variables of the continuous type,
(5)
For instance, in a simplified notation which appears to be clear,
We shall next investigate the variance of a conditional distribution
under the assumption that the conditional mean is linear. The con-
ditional variance of Y is given by
(4) E{[Y - E{Y/x)J2Ix} = J~oo [y - 1-'-2 - P:: (x - 1-'-1)rf{ y 1x) dy
J~00 [(y - 1-'-2) - P~ (x - 1-'-1)rf(X, y) dy
f1(X)
when the random variables are of the continuous type. This variance is
nonnegative and is at most a function of x alone. If then, it is multiplied
by f1(X) and integrated on x, the result obtained will be nonnegative.
This result is
That is, if the variance, Equation (4), is denoted by k(x), then E[k(X)J =
a~(l - p2) ~ O. Accordingly, p2 :::;; 1, or -1 :::;; p :::;; 1. It is left as an
exercise to prove that - 1 :::;; p :::;; 1 whether the conditional mean is or
is not linear.
Suppose that the variance, Equation (4), is positive but not a func-
tion of x; that is, the variance is a constant k > O. Now if k is multi-
plied byI,(x) and integrated on x, the result is k, so that k = al(l - p2).
Thus, in this case, the variance of each conditional distribution of Y,
given X = x, is a~(l - p2). If p = 0, the variance of each conditional
distribution of Y, given X = x, is a~, the variance of the marginal
distribution of Y. On the other hand, if p2 is near one, the variance of
each conditional distribution of Y, given X = x, is relatively small,
and there is a high concentration of the probability for this conditional
distribution near the mean E(Ylx) = 1-'-2 + p{a2/al)(x - 1-'-1)'
It should be pointed out that if the random variables X and Y in
the preceding discussion are taken to be of the discrete type, the results
just obtained are valid.
J~00 J~oo [(y - 1-'-2) - P:: (x - 1-'-1)rf(X, y) dy dx
= JOO JOO [(y - 1-'-2)2 - 2p U2 (y - 1-'-2)(X - 1-'-1) + p2 a~ (x - 1-'-1)2]
- 00 - 00 Ul Ul
X f(x, y) dy dx
a2 U~ •
= E[(Y - 1-'-2)2J - 2p - E[(X - 1-'-1)(Y - 1-'-2)J + p22. E[(X - 1-'-1)2J
a1 al
2 a2 2al 2
= a2 - 2p - pala2 + p 2. al
a1 a1
= a~ - 2p2a~ + p2al = a~(l - p2) ~ O.
78 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.3] The Correlation Coefficient 79
(2, 3)
-..1.-
15
i = 1,2,
(1, 2) (1, 3) (2, 1) (2,2)
_.1._ _L _L _L
15 15 15 15
00/(0, 0)
Otl
2.18. Let the random variables X and Y have the joint p.d.f.
(a) f(x, y) = t, (x, y) = (0,0), (1, 1), (2,2), zero elsewhere.
(b) f(x, y) = t, (x, y) = (0,2), (1,1), (2,0), zero elsewhere.
(c) f(x, y) = j-, (x, y) = (0,0), (1, 1), (2,0), zero elsewhere.
In each case compute the correlation coefficient of X and Y.
2.19. Let X and Y have the joint p.d.f. described as follows:
EXERCISES
2.25. Let o/(tI , t2) = In M(tI , t2),where M(tl> t2) is the moment-generating
function of X and Y. Show that
and
and f(x, y) is equal to zero elsewhere. 'Find the correlation coefficient p.
2.20. Letf(x, y) = 2,0 < x < y,O < y < 1, zero elsewhere, be the joint
p.d.f. of X and Y. Show that the conditional means are, respectively,
(1 + x)f2, 0 < x < 1, and yf2, 0 < y < 1. Show that the correlation
coefficient of X and Y is p = l
2.21. Show that the variance of the conditional distribution of Y, given
X = x, in Exercise 2.20, is (1 - x)2f12, 0 < x < 1, and that the variance of
the conditional distribution of X, given Y = y, is y2f12, 0 < y < 1.
2.22. Verify the results of Equations (6) of this section.
2.23. Let X and Y have the joint p.d.f. f(x, y) = 1, -x < Y < x,
o < x < 1, zero elsewhere. Show that, on the set of positive probability
density, the graph of E(Ylx) is a straight line, whereas that of E(XI y) is not
a straight line.
2.24. If the correlation coefficient p of X and Y exists, show that
-1 :::; p :::; 1. Hint. Consider the discriminant of the nonnegative quadratic
function h(v) = E{l(X - ItI) + v(Y - 1t2)J2}, where v is real and is not a
function of X nor of Y.
a~ = 2,
1t2 = 2,
o < x < y < 00,
ai = 1,
Itl = 1,
= 0 elsewhere.
M(tI , t2) = fooo
L
oo
exp (tlx + t2y - y) dy dx
1
1
M(t1 , 0) = -1--' tl < 1,
- t1
These moment-generating functions are, of course, respectively, those of
the marginal probability density functions,
Verification of results of Equations (6) is left as an exercise. If, momentarilr,
we accept these results, the correlation coefficient of X and Y is p = IfV2.
Furthermore, the moment-generating functions of the marginal distributions
of X and Yare, respectively,
(6)
provided t, + t2 < 1 and t2 < 1. For this distribution, Equations (5) become
Example 3. Let the continuous-type random variables X and Y have
the joint p.d.f.
The moment-generating function of this joint distribution is
It is fairly obvious that the results of Equations (5) hold if X and
Yare random variables of the discrete type. Thus the correlation
coefficients may be computed by using the moment-generating function
of the joint distribution if that function is readily available. An
illustrative .example follows.
zero elsewhere.
zero elsewhere, and
0< x < 00,
o< y < 00,
020/(0,0)
Otl ot2
yield the means, the variances, and the covariance of the two random
variables.
2.26. Let Xl> X 2 , and X 3 be three random variables with means,
variances, and correlation coefficients, denoted by Itl> 1t2' 1t3; ai, a~, a~; and
80 Conditional Probability and Stochastic Independence [Ch. 2 Sec. 2.4] Stochastic Independence 81
PIZ, P13' PZ3, respectively. If E(XI - fLllxz, x3) = bz(xz - fLz) + b3(X3 - fL3)'
where bz and b3 are constants, determine bz and b3 in terms of the variances
and the correlation coefficients.
2.4 Stochastic Independence
Let Xl and X 2 denote random variables of either the continuous
or the discrete type which have the joint p.d.f. f(xl> x2) and marginal
probability density functions I,(Xl) and f2(X2), respectively. In accord-
ance with the definition of the conditional p.d.f. f(x2lxIL we may write
the joint p.d.f. f(xl, x2) as
f(xl> X2) = f(x2I xI)fl(XI),
Suppose we have an instance where f(X 2XI) does not depend upon Xl'
Then the marginal p.d.f. of X 2 is, for random variables of the con-
tinuous type,
f2(X2) = J~00 f(X 2!xl)fl(Xl) dx;
= f(X 2!Xl) J~oo fl(XI) dXI
= f(x2IxI)'
second remark pertains to the identity. The identity in Definition 1 should
be interpreted as follows. There may be certain points (Xl' X z) E.5# at which
f(xl, xz) of fl(xl)fz(xz)· However, if A is the set of points (Xl' xz)atwhich
the equality does not hold, then PtA) = O. In the subsequent theorems and
the subsequent generalizations, a product of nonnegative functions and an
identity should be interpreted in an analogous manner.
Example 1. Let the joint p.d.f. of Xl and X2 be
f(xl, xz) = Xl + Xz, 0 < Xl < 1, 0 < Xz < 1,
= 0 elsewhere.
It will be shown that Xl and Xz are stochastically dependent. Here the
marginal probability density functions are
fl(Xl) = J'" f(xl, xz) dxz = [1 (Xl + xz) dxz = Xl + 1-,
- co Jo
= 0 elsewhere,
and
fz(xz) = Joo f(Xl> xz) dXl = fl (Xl + xz) dXl = t + X2'
- 00 Jo
= 0 elsewhere.
Accordingly,
f2(X2) = f(x2I xl) and
Since f(xl, xz) '" fl(xl)fz(xz)' the random variables Xl and X 2 are stochastic-
ally dependent.
when f(x2Ixl) does not depend upon Xl' That is, if the conditional
distribution of X 2 , given Xl = Xl' is independent of any assumption
about Xl' then f(xl> x2) = fl(xl)f2(X2), These considerations motivate
the following definition.
Definition 1. Let the random variables Xl and X 2 have the joint
p.d.f. f(xl, x2) and the marginal probability density functions fl(X l)
andf2(x2), respectively. The random variables Xl and X 2 are said to be
stochastically independent if, and only if, f(xl, x2) == fl(xl)f2(X2),
Random variables that are not stochastically independent are said to
be stochastically dependent.
Remarks. Two comments should be made about the preceding definition.
First the product of two nonnegative functions I,(xl)fz (xz) means a function
that is positive on a product space. That is, if fl(Xl) and fz(xz) are positive
on, and only on, the respective spaces Mi and ~, then the product of
fl(Xl) and fz(xz) is positive on, and only on, the product space .5# =
{(Xl> xz); Xl E .5#1' Xz E .5#z}. For instance, if .5#1 = {Xl; 0 < Xl < I} and
~ = {xz; 0 < Xz < 3}, then .5# = {(Xl> xz); 0 < Xl < 1,0 < Xz < 3}. The
The following theorem makes it possible to assert, without comput-
ing the marginal probability density functions, that the random
variables Xl and X 2 of Example 1 are stochastically dependent.
Theorem 1. Let the random variables Xl and X 2 have the joint p.d.f.
f(xl, x2)· Then Xl and X 2 are stochastically independent if and only if
f(xl , x2) can be written as a product of a nonnegative function of Xl alone
and a nonnegative function of X2 alone. That is,
where g(xl) > 0, Xl E .#1' zero elsewhere, and h(x2) > 0, X2 E .#2' zero
elsewhere.
Proof. If Xl and X 2 are stochastically independent, thenf(xl' x2) ==
I,(xl)f2(X2), where fl (Xl) and f2(X2) are the marginal probability density
functions of Xl and X 2, respectively. Thus, the condition f(xl, x2) _
g(xl)h(x2 ) is fulfilled.
82 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.4] Stochastic Independence 83
Conversely, if f(xl, x2) == g(xl)h(X2)' then, for random variables of
the continuous type, we have
and
where C
l and C2 are constants, not functions of Xl or X 2• Moreover,
Cl c2 = 1 because
1 = f~00 f~00 g(xl)h(x2) dXl dX2 = [f~00 g(xl) dxl] [f~00 h(x2) dx2] = C2Cl·
These results imply that
f(xv x2) == g(xl)h(x2) == Clg(Xl)c2h(x2) == fl(xl)f2(X2),
Accordingly, Xl and X2 are stochastically independent.
If we now refer to Example 1, we see that the joint p.d.f.
f(xv x2) = Xl + X2, 0 < Xl < 1, 0 < X2 < 1,
= 0 elsewhere,
cannot be written as the product of a nonnegative function of Xl alone
and a nonnegative function of X 2 alone. Accordingly, Xl and X 2 are
stochastically dependent.
Example 2. Let the p.d.f. of the random variables Xl and X2 be
f(xl , x2 ) = 8XIX2 , 0 < Xl < X2 < 1, zero elsewhere. The formula 8XlX2 might
suggest to some that Xl and X 2 are stochastically independent. However, if
we consider the space d = {(Xl. X 2); 0 < Xl < X 2 < 1}, we see that it is not
a product space. This should make it clear that, in general, Xl and X2 must
be stochastically dependent if the space of positive probability density of XI
and X2 is bounded by a curve that is neither a horizontal nor a vertical line.
We now give a theorem that frequently simplifies the calculations
of probabilities of events which involve stochastically independent
variables.
Thereom 2. If Xl and X 2 are stochastically independent random
variables with marginal probability density functions fl(X l) and f2(X2),
respectively, then
Pr (a < Xl < b, c < X2 < d) = Pr (a < Xl < b) Pr (c < X 2 < d)
for every a < band c < d, where a, b, c, and d are constants.
Proof. From the stochastic independence of Xl and X 2, the joint
p.d.f. of Xl and X 2 isfl(xl)f2(X2), Accordingly, in the continuous case,
Pr (a < x, < b, c < X 2 < d) = f:f fl(xl)f2(X2) dX2dXl
= [f:I, (Xl) dxl][ff2(X2) dx2]
= Pr(a < Xl < b)Pr(c < X 2 < d);
or, in the discrete case,
Pr (a < Xl < b, c < X 2 < d) L L fl(xl)f2(X2)
a<:xl <:b c<:x2<:d
= L<~<b fl(Xl)][c<~<d f2(X2)]
= Pr (a < Xl < b) Pr (c < X2 < d),
as was to be shown.
Example 3. In Example 1, Xl and X2 were found to be stochastically
dependent. There, in general,
Pr (a < X, < b, c < X 2 < d) #- Pr (a < Xl < b) Pr (c < X 2 < d).
For instance,
[112 [112
Pr (0 < Xl < t,o < X 2 < t) = Jo Jo (Xl + X2) dXI dX2 = t,
whereas
fl/2
Pr (0 < Xl < t) = Jo (Xl + t) dXI = t
and
[112
Pr (0 < X2 < t) = Jo (t + x2) dX2 = l
Not merely are calculations of some probabilities usually simpler
when we have stochastically independent random variables, but many
mathematical expectations, including certain moment-generating
functions, have comparably simpler computations. The following result
will prove so useful that we state it in form of a theorem.
Theorem 3. Let the stochastically independent random variables Xl
and X2 have the marginal probability density functions fl(X l) and f2(X2),
respectively. The expected value of the product of a function u(Xl) of Xl
alone and a function v(X2) of X2 alone is, subject to their existence, equal
to the product of 'the expected value of u(X1) and the expected value of
v(X2); that is,
84 Conditional Probability and Stochastic Independence [Ch. 2
Sec. 2.4] Stochastic Independence 85
Proof. The stochastic independence of Xl and X 2 implies that the
joint p.d.f. of Xl and X 2 is fl(X l)f2(X2), Thus, we have, by definition of
mathematical expectation, in the continuous case,
E[u(X1)V(X2)] = I~00 f~00 u(Xl)V(x2)fl(xl)f2(X2) dXl dX2
= [f~oo u(xl)fl(Xl) dXl][I~oo v(x2)f2(X2) dx2]
= E[u(Xl)]E[v(X2)];
or, in the discrete case,
E[U(Xl)V(X2)] = 2: 2: u(Xl)V(X2)fl(xl)f2(X2)
X2 Xl
= [tu(xl)fl(Xl)][~ v(x2)f2(X2)]
= E[u(Xl)]E[v(X2)],
as stated in the theorem.
Example 4. Let X and Y be two stochastically independent random
variables with means u, and!Jo2 and positive variances a~ and a~, respectively.
We shall show that the stochastic independence of X and Y implies that the
correlation coefficient of X and Y is zero. This is true because the covariance
of X and Y is equal to
We shall now prove a very useful theorem about stochastically
independent random variables. The proof of the theorem relies heavily
upon our assertion that a moment-generating function, when it exists,
is unique and that it uniquely determines the distribution of probability.
Thereom 4. Let Xl and X 2 denote random variables that have the
joint p.d.j. f(xl, x2) and the marginal probability density functions
fl (Xl) and f2(X2), respectively. Furthermore, let M (tv t2) denotethe moment-
generating function of the distribution. Then Xl and X 2 are stochastically
independent if and only if M(tv t2) = M(tv O)M(O, t2)·
Proof. If Xl and X 2 are stochastically independent, then
M(tv t2 ) = E(et,x, +t2X 2)
= E(e!lX,et2X2)
= E(e!lX,)E(e!2X2)
= M(tv O)M(O, t2).
Thus the stochastic independence of Xl and X 2 implies that the
moment-generating function of the joint distribution factors into the
product of the moment-generating functions of the two marginal
distributions.
Suppose next that the moment-generating function of the joint
distribution of Xl and X 2 is given by M(t l, t2) = M(tv O)M(O, t2). Now
Xl has the unique moment-generating function which, in the con-
tinuous case, is given by
M(tl,O) = I~00 e!lX'fl(Xl) dxl·
Similarly, the unique moment-generating function of X 2 , in the con-
tinuous case, is given by
Thus we have
M(tv O)M(O, t2) = [f~00 e!lX'fl(Xl) dXl][f~00 e!2X2f2(X
2) dx2]
= f~oo f~oo e!,x, +!2X2fl(x
l)f2(X2) dXl dX2'
We are given that M(tl, t2) = M(tv O)M(O, t2); so
M(tv t2) = I~oo I~00 e!,x, +!2X2fl(Xl)f2(X
2) dXl dx2·
But M(tv t2) is the moment-generating function of Xl and X 2. Thus
also
M(tv t2) = f~00 I~00 e!,x, +!2X2f(x
l, x2) dXl dx2·
The uniqueness of the moment-generating function implies that the
two distributions of probability that are described by I,(xl)f2(x2) and
f(xv x2) are the same. Thus
f(xv x2 ) =I,(xl)f2(X2),
That is, if M(t l, t2) = M(tv O)M(O, t2),then Xl and X 2are stochastically
independent. This completes the proof when the random variables are
of the continuous type. With random variables of the discrete type,
the proof is made by using summation instead of integration.
Let the random variables Xv X 2,... , X; have the joint p.d.f.
f(xl, X2,.. " xn) and the marginal probability density functions
fl(Xl),f2(X2)," .,fn(xn), respectively. The definition of the stochastic
independence of Xl and X 2 is generalized to the mutual stochastic
independence of Xl' X 2 , ••• , X; as follows: The random variables
86 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.4] Stochastic Independence
87
Xl> X 2 , ••• , X n are said to be mutually stochastically independent if
and only if f(xl, X2, ... , xn) == fl(xl)f2(X2)' . -fn(xn), It follows immedi-
ately from this definition of the mutual stochastic independence of
Xl' X 2 , ••• , X n that
Pr (al < Xl < bl> a2 < X 2 < b2,·· ., an < X n < bn)
= Pr (al < x, < bl) Pr (a2 < X 2 < b2)·· ·Pr (an < x; < bn)
n
= f1 Pr (a, < X, < b,),
1=1
n
where the symbol f1 ep(i) is defined to be
j=l
n
f1 ep(i) = ep(1)ep(2) . .. ep(n).
f ee L
The theorem that E[u(Xl)V(X2)J = E[u(Xl)JE[v(X2)J for stochastically
independent random variables Xl and X 2 becomes, for mutually
stochastically independent random variables Xl' X 2 , ••• , X n,
Remark. If Xl> X 2 , and X s are mutually stochastically independent,
they are pairwise stochastically independent (that is, X, and XJ' i -# j, where
i, J = 1, 2, 3 are stochastically independent). However, the following
example, due to S. Bernstein, shows that pairwise independence does not
necessarily imply mutual independence. Let Xl> X 2 , and X s have the joint
p.d.f.
f(xl> X2,xs) = t, (Xl> X2,xs) E {(I, 0, 0), (0, 1,0), (0,0, 1), (1, 1, I)},
= °elsewhere.
The joint p.d.f. of X, and XJ' i -# j, is
};J(xj , XJ) = t, (x" XJ) E {(O, 0), (1,0), (0, 1), (1, I)},
= °elsewhere,
whereas the marginal p.d.f. of X, is
Xl = 0, 1,
= °elsewhere.
Obviously, if i -# j, we have
Accordingly, the p.d.f. of Y is
g(y) = 6y5 , °< y < 1,
= °elsewhere.
Pr (Y ~ t) = Pr (Xl ~ t, X 2 ~ -t, Xs ~ -!)
fl/2 (1/2 (1/2
= Jo Jo Jo 8XlX2XS dXI dX2dxs
= (1-)6 = 6~'
In a similar manner, we find that the distribution function of Y is
f.j(Xj , x;) == };(xtlfj(xj),
and thus X, and X J are stochastically independent. However,
f(xl, X2,xs) ot. fl(xl)f2(x2)fs(xs).
Thus Xl' X 2 , and X s are not mutually stochastically independent.
. Example 5. Let Xl> X 2 , and X s be three mutually stochastically
mdependent random variables and let each have the p.d.f. f(x) = a,
°< x < 1, zero elsewhere. The joint p.d.f. of Xl' X 2, Xs isf(xl)f(x2)f(xs) =
8XIX2XS , °< z, < 1, i = 1, 2, 3, zero elsewhere. Let Y be the maximum of
Xl> X 2 , and X s. Then, for instance, we have
y < 0,
°s y < I,
1 s y.
= y6,
= 1,
G(y) = Pr (Y ~ y) = 0,
The moment-generating function of the joint distribution of n
random variables Xl> X 2, ••. , X n is defined as follows. Let
E[exp (tlXl + t2X2 + ... + tnXn)J
n
M(tl, t2, ... , tn) = f1 M(O, ... ,0, t., 0, ... ,0)
j=l
exist for -h, < i, < h" i = 1,2, ... , n, where each h, is positive. This
expectation is denoted by M(tl, t2, .. " tn) and it is called the moment-
generating function of the joint distribution of Xl' ... , X n (or simply
the moment-generating function of Xl, ... , X n). As in the cases of one
and two variables, this moment-generating function is unique and
uniquely determines the joint distribution of the n variables (and
hence all marginal distributions). For example, the moment-generating
function of the marginal distribution of XI is M(O, ... , 0, t., 0, ... , 0),
i = 1,2, ... , n; that of the marginal distribution of X, and X J
is
M (0, ... , 0, t., 0, ... , 0, tl
, 0, ... , 0); and so on. Theorem 4 of this
chapter can be generalized, and the factorization
is a necessary and sufficient condition for the mutual stochastic
independence of Xl> X 2 , ••• , X n .
or
88 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.4] Stochastic Independence 89
= 0 elsewhere.
In particular, Pr (Y = 3) = g(3) = -§-.
Pr (Xl = 0, X2 = 0, X3 = 1)
= Pr (Xl = 0) Pr (X2 = 0) Pr (X3 = 1) = (-!-)3 = t-
In general, if Y is the number of the trial on which the first head appears,
then the p.d f. of Y is
Example 6. Let a fair coin be tossed at random on successive independent
trials. Let the random variable X, = 1 or X, = 0 according to whether the
outcome on the ith toss is a head or a tail, i = 1,2,3, .... Let the p.d.f. of
each X, bef(x) = -!-' x = 0, 1,zero elsewhere. Since the trials are independent,
we say that the random variables Xl' X 2 , X 3 , ••• are mutually stochastically
independent. Then, for example, the probability that the first head appears
on the third trial is
elsewhere. If Y is the minimum of these four variables, find the distribution
function and the p.d.f. of Y.
2.34. A fair die is cast at random three independent times. Let the
~ando.m v.ariable X, be equal to the number of spots which appear on the
tth t~lal,.t =.1,2,3. Let the random variable Y be equal to max (X,), Find
~he distribution function and the p.d.f. of Y. Hint. Pr (Y s y) = Pr (X, s y,
t = 1,2, 3).
2.35. Suppose a man leaves for work between 8: 00 A.M. and 8: 30 A.M. and
takes between 40 and 50 minutes to get to the office. Let X denote the time
of departure and let Y denote the time of travel. If we assume that these
random variables are stochastically independent and uniformly distributed,
find the probability that he arrives at the office before 9: 00 A.M.
2.36. Let M(tl , t2 , t3 ) be the moment-generating function of the random
variables Xl' X 2 , and X3 of Bernstein's example, described in the final
remark of this section. Show that M(tl , t2 , 0) = M(t l , 0, O)M(O, t2 , 0),
M(tl , 0, t3 ) = M(t l , 0, O)M(O, 0, t3 ) , M(O, t2 , t3 ) = M(O, t2 , O)M(O, 0, t3 ) , but
M(tl , t2, t3) i= M(tl, 0, O)M(O, t2, O)M(O, 0, t3). Thus Xl' X 2, X3 are pairwise
stochastically independent but not mutually stochastically independent.
2.37'. Gene~alize Theorem 1 of this chapter to the case of n mutually
stochastically mdependent random variables.
2.38.. Gene~alize Theorem 4 of this chapter to the case of n mutually
stochastically mdependent random variables.
y = 1,2,3, ... ,
g(y) = my,
EXERCISES
2.27. Show that the random variables Xl and X2 with joint p.d.f.
f(XI' X2) = 12xlx2(1 - x2), 0 < Xl < 1, 0 < X2 < 1, zero elsewhere, are
stochastically independent.
2.28. If the random variables XI and X2 have the joint p.d.f. f(xl, x2) =
2e-xl-x2, 0 < Xl < X2, 0 < X2 < 00, zero elsewhere, show that Xl and X2
are stochastically dependent.
2.29. Let f(xl, X2) = -il.6"' Xl = 1,2,3,4, and X2 = 1,2,3,4, zero else-
where, be the joint p.d.f. of X I and X2' Show that X I and X2are stochastically
independent.
2.30. Find Pr (0 < Xl < 1,0 < X 2 < 1) if the random variables Xl and
X
2
have the joint p.d.f. f(xl, x2) = 4XI(1 - X2)' 0 < Xl < 1, 0 < X2 < 1,
zero elsewhere.
2.31. Find the probability of the union of the events a < Xl < b,
-00 < X
2
< 00 and -00 < Xl < 00, C < X 2 < d if Xl and X2 are two
stochastically independent variables with Pr (a < Xl < b) = t and
Pr (c < X 2 < d) = 1·
2.32. If f(xl
, x2) = e- x l- x 2, 0 < Xl < 00, 0 < X2 < 00, zero elsewhere,
is the joint p.d.f. of the random variables Xl and X 2 , show that Xl and X2
are stochastically independent and that
E(el(X1 +X2)) = (1 - t)-2, t < 1.
2.33. Let Xl' X 2
, X 3
, and Xi be four mutually stochastically independent
random variables, each with p.d.f. f(x) = 3(1 - X)2, 0 < X < 1, zero
Chapter 3
Some Special Distributions
3.1 The Binomial, Trinomial, and Multinomial Distributions
In Chapter 1 we introduced the uniform distribution and the hyper-
geometric distribution. In this chapter we shall discuss some other
important distributions of random variables frequently used III
statistics. We begin with the binomial distribution.
Recall, if n is a positive integer, that
Consider the function defined by
x = 0,1,2, ... , n,
= 0 elsewhere,
where n is a positive integer and 0 < p < 1. Under these conditions it
is clear that f(x) :2: 0 and that
~ f(x) = ~o (:)px(l - p)n-x
= [(1 _ P) + p]n = 1.
That is, f(x) satisfies the conditions of being a p.d.f. of a random
variable X of the discrete type. A random variable X that has a p.d.f. of
the form of f(x) is said to have a binomial distribution, ~nd any such
f(x) is called a binomial p.d.f. A binomial distribution will be denoted
90
Sec. 3.1] The Binomial, Trinomial, and Multinomial Distributions 91
by the symbol b(n, P). The constants nand p are called the parameters
of the binomial distribution. Thus, if we say that X is b(5, !-), we mean
that X has the binomial p.d.f.
x = 0,1, ... ,5,
= 0 elsewhere.
Remark. The binomial distribution serves as an excellent mathematical
model in a number of experimental situations. Consider a random experiment,
the outcome of which can be classified in but one of two mutually exclusive
and exhaustive ways, say, success or failure (for example, head or tail, life
or death, effective or noneffective, etc.). Let the random experiment be
repeated n independent times. Assume further that the probability of success,
say p,is the same on each repetition; thus the probability of failure on each
repetition is 1 - p. Define the random variable X" i = 1, 2, ... , n, to be
zero, if the outcome of the ith performance is a failure, and to be 1 if that
outcome is a success. We then have Pr (X, = 0) = 1 - P and Pr (Xl = 1)
= p, i = 1, 2, ... , n. Since it has been assumed that the experiment is to be
repeated n independent times, the random variables Xl> X 2 , •• " X n are
mutually stochastically independent. According to the definition of X" the
sum Y = Xl + X 2 + ... + X; is the number of successes throughout the
n repetitions of the random experiment. The following argument shows that
Y has a binomial distribution. Let y be an element of {y; y = 0, 1,2, ... , n}.
Then Y = y if and only if exactly y of the variables Xl' X 2, ••. , X n have the
value 1, and each of the remaining n - y variables is equal to zero. There
are (;) ways in which exactly y ones can be assigned to y of the variables
Xl> X 2 , •• " X n · Since Xl' X 2 , ••• , X; are mutually stochastically inde-
pendent, the probability of each ofthese ways ispY(1 - p)n-y. Now Pr (Y = y)
is the sum of the probabilities of these (;) mutually exclusive events; that is,
Pr (Y = y) = (;)PY(1 - p)n-y, y = 0, 1,2, .. " n,
zero elsewhere. This is the p.d.f. of a binomial distribution.
The moment-generating function of a binomial distribution is easily
found. It is
M(t) = ~ etxf(x) = x~o etx(:)px(l - p)n-x
x~o (:) (pet)x(l - p)n-x
= [(1 - P) + pet]n
Some Special Distributions [Ch. 3
92
for all real values of t. The mean fL and the variance a
2
of X may be
computed from M(t). Since
M'(t) = n[(1 - p) + petJn-l(pet
)
and
M"(t) = n[(1 - P) + petJn-l(pet) + n(n - 1)[(1 - P) + pe
tJn-2(pet)2,
it follows that
fL = M'(O) = np
Sec. 3.1] The Binomial, Trinomial, and Multinomial Distributions 93
Example 3. If Yis b(n, t), then Pr (Y ;::-: 1) = 1 - Pr (Y = 0) = 1 - (t)n.
Suppose we wish to find the smallest value of n that yields Pr (Y ;::-: 1) > 0.80.
We have 1 - (j-)n > 0.80 and 0.20 > (t)n. Either by inspection or by use of
logarithms, we see that n = 4 is the solution. That is, the probability of at
least one success throughout n = 4 independent repetitions of a random
experiment with probability of success p = t is greater than 0.80.
Example 4. Let the random variable Y be equal to the number of
successes throughout n independent repetitions of a random experiment with
probability p of success. That is, Y is b(n, P). The ratio Yin is called the
relative frequency of success. For every e > 0, we have
1 1 7 8
Pr (0 s X s 1) = L f(x) = 128 + 128 = 128
x=o
and
a2 = M"(O) - fL2 = np + n(n - 1)p2 - (np)2 = np(1 - P)·
Example 1. The binomial distribution with p.d.f.
= °elsewhere,
has the moment-generating function
M(t) = (! + !et
)7,
has mean I-' = np = t, and has variance a2 = np(l - P) = l Furthermore,
if X is the random variable with this distribution, we have
Pr ( ~ - pi ;::-: e) = Pr (IY - nPI ;::-: en)
= Pr (IY- 1-'1;::-: eJp(1 ~ P) a)'
Now, for every fixed e > 0, the right-hand member of the preceding inequality
is close to zero for sufficiently large n. That is,
( J
n ) P(1 - P)
Pr IY - 1-'1 ;::-: e P(1 _ P) a ~ ne2
where I-' = np and a2 = np(1 - P). In accordance with Chebyshev's in-
equality with k = eVnlP(1 - P), we have
and hence
x = 0, 1, 2, ... , 7,
(7
)(I)X( 1)7-X
f(x) = x 2 1 - 2 '
and
Pr (X = 5) = f(5)
7! (1)5 (1)2 21
= 51 21 2 2 = 128' and
Example 2. If the moment-generating function of a random variable X is
M(t) = (t + j-et)5,
then X has a binomial distribution with n = 5 and p = j-; that is, the p.d.f.
of X is
(5
)(1)X(2)5-X
f(x) = x "3 "3 '
= °elsewhere.
x=0,1,2, ...,5,
Since this is true for every fixed e > 0, we see, in a certain sense, that the
relative frequency of success is for large values of n, close to the probability
p of success. This result is one form of the law of large numbers. It was
alluded to in the initial discussion of probability in Chapter 1 and will be
considered again, along with related concepts, in Chapter 5.
Example 5. Let the mutually stochastically independent random vari-
ables Xl> X 2 , X 3 have the same distribution function F(x). Let Y be the
middle value of Xl' X 2 , X 3 • To determine the distribution function of Y, say
The binomial distribution can be generalized to the trinomial
= °elsewhere.
g(y) = G'(y) = 6[F(y)][1 - F(y)Jf(y)·
Sec. 3.1] The Binomial, Trinomial, and Multinomial Distributions 95
distribution. If n is a positive integer and aI' a2 , aa are fixed constants,
we have
n!
f(x, y) = xl y! (n _ x _ y)! PfP~P~-X-y,
where x and yare nonnegative integers with x + y < nand P P
- , 11 21
and Pa are positive proper fractions with PI + P2 + P» = 1; and let
f(x, y) = 0 elsewhere. Accordingly, f(x, y) satisfies the conditions of
being a joint p.d.f. of two random variables X and Y of the discrete'
ty~e; that is,j(x, y) is nonnegative and its sum over all points (x, y) at
which f(x, y) is positive is equal to (PI + P2 + Pa)n = 1. The random
variables X and Y which have a joint p.d.f. of the formf(x, y) are said
to have a trinomial distribution, and any such f(x, y) is called a tri-
nomial p.d.j. The moment-generating function of a trinomial distri-
bution, in accordance with Equation (1), is given by
n n-x n!
M(tl> t2) = L L (Pletl)X(het2)ypn-X-y
x=o y=o xl y! (n - x _ y)! a
= (Pletl + P2et2 + Pa)n
for all real values of tl and t2 • The moment-generating functions of the
marginal distributions of X and Yare, respectively,
M(tl> 0) = (Pletl + h + Pa)n = [(1 - PI) + Pletl]n
n n-x n'
(1) L L . aXaYan-x-y
zi e O y=o xl yl (n - x _ y)! 1 2 a
_ ~ n! af n-x (n - x)!
- L, L a~a~-X-Y
zree O x! (n - x)! y=o y! (n - x - y)!
n n!
x~o xl (n _ x)! af(a2 + aa)n-x
= (al + a2 + aa)n.
Let the function f(x, y) be given by
and
M(O, t2) = (PI + het2 + Pa)n = [(1 - P2) + P2et2]n.
We see immediately, from Theorem 4, Section 2.4, that X and Yare
stochas.tically dependent. In addition, X is b(n, PI) and Y is b(n, P2)'
Accordmgly, the means and the variances of X and Yare, respectively,
1-'-1 = npl' 1-'-2 = np2' ai = nPl(1 - PI), and a~ = np2(1 - P2)'
y = 0, 1,2, ... ,
y = 0,1,2, ... ,
Some Special Distributions [eh.3
g(y) = P(l - P)Y,
(
y + r - 1)pr-l(l _ P)Y
r - 1
zero elsewhere, and the moment-generating function M(t) = P[l - (1 - p)etJ-l.
In this special case, r = I, we say that Y has a geometric distribution.
A distribution with a p.d.f. of the form g(y) is called a negative binomial
distribution; and any such g(y) is called a negative binomial p.d.f. The
distribution derives its name from the fact that g(y) is a general term in the
expansion of pr[l - (1 - P)]-r. It is left as an exercise to show that the
moment-generating function of this distribution is M(t) = pr[l - (1 - p)etJ -r,
for t < -In (1 - P). If r = 1, then Y has the p.d.f.
G(y) = G)[F(y)J2[1 - F(y)J + [F(y)J3.
of obtaining exactly r - 1 successes in the first y + r - 1 trials and the
probability p of a success on the (y + r)th trial. Thus the p.d.f. g(y) of Y is
given by
Example 6. Consider a sequence of independent repetitions of a random
experiment with constant probability p of success. Let the random variable
Y denote the total number of failures in this sequence before the rth success;
that is, Y + r is equal to the number of trials necessary to produce exactly
r successes. Here r is a fixed positive integer. To determine the p.d.f. of Y,let
y be an element of {y; y = 0, 1, 2, ...}. Then, by the multiplication rule of
probabilities, Pr (Y = y) = g(y) is equal to the product of the probability
If F(x) is a continuous type of distribution function so that the p.d.f. of X
is F'(x) = f(x), then the p.d.f. of Y is
G(y) = Pr (Y ~ y), we note that Y ~ y if and only if at least two of the
random variables Xv X 2 , Xa are less than or equal to y. Let us say that
the ith "trial" is a success if X. ~ y, i = 1,2,3; here each "trial" has the
probability of success F(y). In this terminology, G(y) = Pr (Y ~ y) is then
the probability of at least two successes in three independent trials. Thus
94
96 Some Special Distributions [Ch, 3 Sec. 3.1] The Binomial, Trinomial, and Multinomial Distributions 97
y=O,I, ...,n-x,
Consider next the conditional p.d.f. of Y, given X = x. We have
f(yIX) = (n - x)! (~)Y(~)n-x-y
y! (n - x - y)! 1 - Pl 1 - Pl '
= 0 elsewhere.
Thus the conditional distribution of Y, given X = x, is ben - x,
h/(1 - Pl)]. Hence the conditional mean of Y, given X = x, is the
linear function
the multinomial p.d.f. of k - 1 random variables Xv X 2 , ••• , X"_l of
the discrete type. The moment-generating function of a multinomial
distribution is given by
M(tv ... , t"-l) = (Pletl + ... + p"_letk-l + p,,)n
for all real values of tv t2 , ••• , t"-l' Thus each one-variable marginal
p.d.f. is binomial, each two-variable marginal p.d.f, is trinomial, and
so on.
EXERCISES
Pr (tt - 2a < X < tt + 2a) = X~l (:)Gr(jr-x
•
3.3. If X is b(n,P), show that
n
P(1 - P)
and
3.1. If the moment-generating function of a random variable X is
(t + i et)5, find Pr (X = 2 or 3).
3.2. The moment-generating function of a random variable X is
(i + tet)9. Show that
Now recall (Example 2, Section 2.3) that the square of the correlation
coefficient, say p2, is equal to the product of -P2/(1 - Pl) and
-Pl/(l - P2), the coefficients of x and y in the respective conditional
means. Since both of these coefficients are negative (and thus p is
negative), we have
E(YIX) = (n - x) (~).
1 - P1
Likewise, we find that the conditional distribution of X, given Y = y,
is ben - y, Pl/(1 - P2)] and thus
E(Xly) = (n - y) (1 ~lpJ'
The trinomial distribution is generalized to the multinomial distri-
bution as follows. Let a random experiment be repeated n independent
times. On each repetition the experiment terminates in but one of k
mutually exclusive and exhaustive ways, say C1, C2, ... , Ck- Let PI be
the probability that the outcome is an element of CI and let PI remain
constant throughout the n independent repetitions, i = 1,2, ... , k.
Define the random variable XI to be equal to the number of outcomes
which are elements of Ct, i = 1, 2, ... , k - 1. Furthermore, let Xl'
X 2, •• • , X"-l be nonnegative integers so that Xl + X 2 + ... + X"_l ~ n.
Then the probability that exactly Xl terminations of the experiment
are in Cv , exactly X"-l terminations are in C"-I' and hence exactly
n - (Xl + + X"_l) terminations are in C" is
n!
- - - - - - PIXl ... P"X
..!ylp%k,
Xl! ... X"-l! X,,!
where x" is merely an abbreviation for n - (Xl + ... + X"-l)' This is
3.4. Let the mutually stochastically independent random variables
Xl' X 2 , X3 have the same p.d f. f(x) = 3x2 , 0 < X < 1, zero elsewhere.
Find the probability that exactly two of these three variables exceed -t.
3.5. Let Y be the number of successes in n independent repetitions of a
random experiment having the probability of success P = i. If n = 3,
compute Pr (2 s Y); if n = 5, compute Pr (3 ~ Y).
3.6. Let Y be the number of successes throughout n independent repe-
titions of a random experiment having probability of success P = t Deter-
mine the smallest value of n so that Pr (1 ~ Y) ~ 0.70.
3.7. Let the stochastically independent random variables Xl and X2 have
binomial distributions wrth parameters nl = 3, PI = i and n2= 4, P2 = -t,
respectively. Compute Pr (Xl = X 2) . Hint. List the four mutually exclusive
ways that Xl = X2 and compute the probability of each.
3.8. Let Xl' X 2 , ••• , X"_l have a multinomial distribution (a) Find the
moment-generating function of X 2 , X 3 , ••. , X"-l' (b) What is the p.d.f.
of X 2 , X 3 , ••• , X"_l? (c) Determine the conditional p.d f. of Xv given
that X2 = x2 , ••. , X"_l = X"-l' (d) What is the conditional expectation
E(XI Ix2 , · · · , X"_l)?
98 Some Special Distributions [Ch, 3 Sec. 3.2] The Poisson Distribution 99
where m > 0. Since m > 0, thenf(x) ~ °and
x = 0,1,2, ... ,
mse:»
=---,
xl
= °elsewhere,
f(x)
Recall that the series
that is,f(x) satisfies the conditions of being a p.d.f. of a discrete type of
random variable. A random variable that has a p.d.f. of the form f(x)
is said to have a Poisson distribution, and any such f(x) is called a
Poisson p.d.j.
Remarks. Experience indicates that the Poisson p.d.f, may be used in a
number of applications with quite satisfactory results. For example, let the
random variable X denote the number of alpha particles emitted by a
radioactive substance that enter a prescribed region during a prescribed
interval of time. With a suitable value of m, it is found that X may be
assumed to have a Poisson distribution. Again let the random variable X
denote the number of defects on a manufactured article, such as a refrigerator
door. Upon examining many of these doors, it is found, with an appropriate
value of m, that X may be said to have a Poisson distribution. The number
of automobile accidents in some unit of time (or the number of insurance
claims in some unit of time) is often assumed to be a random variable which
has a Poisson distribution. Each of these instances can be thought of as a
process that generates a number of changes (accidents, claims, etc.) in a fixed
interval (of time or space and so on). If a process leads to a Poisson distri-
bution, that process is called a Poisson process. Some assumptions that ensure
a Poisson process will now be enumerated.
Let g(x, w) denote the probability of x changes in each interval of length
w. Furthermore, let the symbol o(h) represent any function such that
lim [o(h)fh] = 0; for example, h2 = o(h) and o(h) + o(h) = o(h). The Poisson
h-+O
postulates are the following:
(a) g(l, h) = Ah + o(h), where Ais a positive constant and h > O.
converge, for all values of m, to em. Consider the function f(x) defined
by
3.2 The Poisson Distribution
x2 = 0, 1, ... , Xl>
Xl = 1,2, 3,4, 5,
zero elsewhere, be the joint p.d.f. of Xl and X 2 . Determine: (a) E(X2 ) ,
(b) u(xl ) = E(X2 !X I ) , and (c) E[u(XI ) ]. Compare the answers to parts (a)
5 Xl
and (c). Hint. Note that E(X2) = L L x2f(xl , x2) and use the fact that
xl=l X2=O
i y(n) (t)n = nf2. Why?
y=O Y
3.15. Let an unbiased die be cast at random seven independent times.
Compute the conditional probability that each side appears at least once
relative to the hypothesis that side 1 appears exactly twice.
3.16. Compute the measures of skewness and kurtosis of the binomial
distribution b(n, Pl.
3.17. Let
3.9. Let X be b(2,P) and let Y be b(4,Pl. If Pr (X ~ 1) = i, find
Pr (Y ~ 1).
3.10. If x = r is the unique mode of a distribution that is b(n, P), show
that
(n + I)P - 1 < r < (n + l)p.
Hint. Determine the values of x for which the ratio f(x + l)ff(x) > 1.
3.11. One of the numbers 1, 2, ... , 6 is to be chosen by casting an un-
biased die. Let this random experiment be repeated five independent times.
Let the random variable Xl be the number of terminations in the set
{x; x = 1,2, 3} and let the random variable X 2 be the number of termina-
tions in the set {x; x = 4, 5}. Compute Pr (Xl = 2, X 2 = 1).
3.12. Show that the moment-generating function of the negative binomial
distribution is M(t) = PT
[1 - (1 - p)et] -T. Find the mean and the variance
of this distribution. Hint. In the summation representing M(t), make use of
the MacLaurin's series for (1 - w) -T.
3.13. Let Xl and X 2 have a trinomial distribution. Differentiate the
moment-generating function to show that their covariance is - nPIP2'
3.14. If a fair coin is tossed at random five independent times, find the
conditional probability of five heads relative to the hypothesis that there
are at least four heads.
3.18. Three fair dice are cast. In 10 independent casts, let X be the
number of times all three faces are alike and let Y be the number of times
only two faces are alike. Find the joint p.d.I. of X and Y and compute
E(6XY).
100 Some Special Distributions [Ch. 3 Sec. 3.2] The Poisson Distribution 101
co
(b) L g(x, h) = o(h).
:1:=2
(c) The numbers of changes in nonoverlapping intervals are stochastically
independent.
Postulates (a) and (c) state, in effect, that the probability of one change in
a short interval h is independent of changes in other nonoverlapping intervals
and is approximately proportional to the length of the interval. The sub-
stance of (b) is that the probability of two or more changes in the same
short interval h is essentially equal to zero. If x = 0, we take g(O, 0) = 1. In
accordance with postulates (a) and (b), the probability of at least one change
in an interval of length h is >"h + o(h) + o(h) = M + o(h). Hence the
probability of zero changes in this interval of length h is 1 - >"h - o(h). Thus
the probability g(O, w + h) of zero changes in an interval of length w + his,
in accordance with postulate (c), equal to the product of the probability
g(O, w) of zero changes in an interval of length wand the probability
[1 - >"h - o(h)] of zero changes in a nonoverlapping interval of length h.
That is,
g(O, w + h) = g(O, w)[1 - M - o(h)].
Then
g(O, w + h) - g(O, w) = _  (0 ) _ o(h)g(O, w)
h IIg ,w s":
If we take the limit as h -+ 0, we have
Dw[g(O, w)] = - ,g(0, w).
The solution of this differential equation is
g(O, w) = ce-~w.
The condition g(O, 0) = 1 implies that c = 1; so
g(O, w) = e-~w.
for x = 1,2,3, .... It can be shown, by mathematical induction, that the
solutions to these differential equations, with boundary conditions g(x, 0) = °
for x = 1, 2, 3, ... , are, respectively,
x = 1,2,3, ....
Hence the number of changes X in an interval of length w has a Poisson
distribution with parameter m = >..w.
The moment-generating function of a Poisson distribution is given
by
for all real values of t. Since
and
then
fJ- = M'(O) = m
and
That is, a Poisson distribution has fJ- = (72 = m > O. On this account,
a Poisson p.d.f. is frequently written
If x is a positive integer, we take g(x, 0) = 0. The postulates imply that
g(x, w + h) = [g(x, w)][1 - >"h - o(h)] + [g(x - 1, w)][M + o(h)] + o(h).
Accordingly, we have J(x) x = 0,1,2, ... ,
g(x, w + h) - g(x, w)  ( )  ( 1) o(h)
h = - IIg z, W + IIg X - ,w + h
and
Dw[g(x, w)] = ->..g(x, w) + ,g(x - 1, w),
= 0 elsewhere.
Thus the parameter m in a Poisson p.d.f. is the mean fJ-. Table I in
Appendix B gives approximately the distribution function of the
Poisson distribution for various values of the parameter m = fJ-.
102 Some Special Distributions [eh.3 Sec. 3.3] The Gamma and Chi-Square Distributions 103
= 0 elsewhere.
Example 1. Suppose that X has a Poisson distribution with fJ- = 2. Then
the p.d.f. of X is
The variance of this distribution is a2
= fJ- = 2. If we wish to compute
Pr (1 :s: X), we have
Pr (1 s X) = 1 - Pr (X = 0)
= 1 - j(O) = 1 - e- 2
= 0.865,
approximately, by Table I of Appendix B.
Example 2. If the moment-generating function of a random variable X is
M(t) = e4(e
l
- 1),
then X has a Poisson distribution with fJ- = 4. Accordingly, by way of
example,
2xe- 2
j(x) =-,
xl
x = 0, 1,2, ... ,
3.20. The moment-generating function of a random variable X is
e4
(e' - 1). Show that Pr (fJ- - 2a < X < fJ- + 2a) = 0.931.
3.21. In a lengthy manuscript, it is discovered that only 13.5 per cent of
the pages contain no typing errors. If we assume that the number of errors
per page is a random variable with a Poisson distribution, find the percentage
of pages that have exactly one error.
3.22. Let the p.d.f. j(x) be positive on and only on the nonnegative
integers. Given that j(x) = (4Jx)j(x - 1), x = 1,2,3, .... Find j(x). Hint.
Note thatj(l) = 4j(0),f(2) = W/2!)j(0), and so on. That is, find eachj(x)
in terms of j(O) and then determine j(O) from 1 = j(O) + j(l) + j(2) + .. ",
3.23. Let X have a Poisson distribution with fJ- = 100. Use Chebyshev's
inequality to determine a lower bound for Pr (75 < X < 125).
3.24. Given that g(x, 0) = 0 and that
Dw[g(x, w)] = - Ag(X, w) + Ag(X - 1, w)
for x = 1,2, 3, .... If g(O, w) = e- AW
, show, by mathematical induction,
that
or, by Table I,
Pr (X = 3) = Pr (X s 3) - Pr (X :s: 2) = 0.433 - 0.238 = 0.195.
Example 3. Let the probability of exactly one blemish in 1 foot of
wire be about 10
100 and let the probability of two or more blemishes in that
length be, for all practical purposes, zero. Let the random variable X be the
number of blemishes in 3000 feet of wire. If we assume the stochastic
independence of the numbers of blemishes in nonoverlapping intervals, then
the postulates of the Poisson process are approximated, with A= 10
100 and
w = 3000. Thus X has an approximate Poisson distribution with mean
3000(10
100) = 3. For example, the probability that there are exactly five
blemishes in 3000 feet of wire is
and, by Table I,
Pr (X = 5) = Pr (X s 5) - Pr (X s 4) = 0.101,
approximately.
EXERCISES
3.19. If the random variable X has a Poisson distribution such that
Pr (X = 1) = Pr (X = 2), find Pr (X = 4).
(Aw)Xe- AW
g(x, w) = , ' x = 1, 2, 3, ....
x.
3.25. Let the number of chocolate drops in a certain type of cookie have
a Poisson distribution. We want the probability that a cookie of this type
contains at least two chocolate drops to be greater than 0.99. Find the
smallest value that the mean of the distribution can take.
3.26. Compute the measures of skewness and kurtosis of the Poisson
distribution with mean fJ-.
3.27. Let X and Y have the joint p.d.f. j(x, y) = e- 2/[x!
(y - x)!],
y = 0, 1, 2, ... ; x = 0, 1, .. " y, zero elsewhere.
(a) Find the moment-generating function M(t1 , t2 ) of this joint distribu-
tion.
(b) Compute the means, the variances, and the correlation coefficient of
X and Y.
(c) Determine the conditional mean E(Xly). Hint. Note that
y
L: [exp (t1x)]y!J[x! (y - x)!] = [1 + exp (t1 }JY·
x=o
Why?
3.3 The Gamma and Chi-Square Distributions
In this section we introduce the gamma and chi-square distributions.
It is proved in books on advanced calculus that the integral
fooo
ya-1e- Y dy
104 Some Special Distributions [Ch. 3 Sec. 3.3] The Gamma and Chi-Square Distributions
105
exists for a > 0 and that the value of the integral is a positive number.
The integral is called the gamma function of a, and we write
f(a) = fo'Xl y«-le-Y dy.
However, the event W > w, for w > 0, is equivalent to the event in which
there are le~s than k.changes in a time interval of length w. That is, if the
random vanable X IS the number of changes in an interval of le gth
then n w,
or, equivalently,
If a = 1, clearly
w > 0,
0< w < 00,
0< w < 00,
g(w) = Ae-'w,
It is left as an exercise to verify that
k-l k-l (A) A
Pr (W > w) = L Pr (X = x) = L w X
e- w.
x=o x=o xl
M(t)
= 0 elsewhere,
We now find the moment-generating function of a gamma distri-
bution. Since
1
00
1
= --x«-le-x(l-Ptl/P dx
o f(a)f3« ,
= 0 elsewhere.
Joo zk-le-2 _ k-l (AW)Xe-AW
Aw (k - 1)!dz - x~o xl .
If, momentarily, we accept this result, we have, for w > 0,
('Xl Zk-le-2 lAW Zk-le-2
G(w) = 1 - JAW r(k) dz = 0 r(k) dz,
~nd for w ~ 0, G(w) = O. If we change the variable of integration in the
integral that defines G(w) by writing z = AY, then
and G(w) = 0, w ~ O. Accordingly, the p.d.f. of W is
Th~~ is, ~ has a gamma distribution with a = k and f3 = 1/A. If W is the
waiting time until the first change, that is, if k = 1, the p.d.f. of W is
and W is said to have an exponential distribution.
o < x < 00,
f(x)
1 = foo _1_ x«-le- X / Pdx.
o f(a)f3«
Since a > 0, f3 > 0, and I'(«) > 0, we see that
1
= -_ x«-le- x/P
r(a)f3« '
f(a) = (a - 1)(a - 2) ... (3)(2)(1)r(1) = (a - 1)1.
Since I'[I) = 1, this suggests that we take O! = 1, as we have done.
In the integral that defines I'{«), let us introduce a new variable
x by writing y = x/f3, where f3 > O. Then
[00 (X)«-l (1)
f(a) = Jo ~ e-
x/P
~ dx,
= 0 elsewhere,
is a p.d.f. of a random variable of the continuous type. A random
variable X that has a p.d.f. of this form is said to have a gamma dis-
tribution with parameters a and f3; and any suchf(x) is called a gamma-
type p.d.j.
Remark. The gamma distribution is frequently the probability model for
waiting times; for instances, in life testing, the waiting time until" death" is
the random variable which frequently has a gamma distribution. To see this,
let us assume the postulates of a Poisson process and let the interval of length
w be a time interval. Specifically, let the random variable W be the time that
is needed to obtain exactly k changes (possibly deaths), where k is a fixed
positive integer. Then the distribution function of W is
G(w) = Pr (W ~ w) = 1 - Pr (W > w).
r(l) = fooo e:v dy = 1.
If a > 1, an integration by parts shows that
I'(«) = (a - 1) fo
oo
y«-2e- Y dy = (a - 1)r(a - 1).
Accordingly, if a is a positive integer greater than 1,
and
M"(t) = (-a)(-a - 1)(1 - f3t)-a-2(_f3)2.
Hence, for a gamma distribution, we have
JL = M'(O) = af3
107
t < t,
M(t) = (1 - 2t)-r/2,
Sec. 3.3] The Gamma and Chi-Square Distributions
Let us now consider the special case of the gamma distribution in
which a = rj2, where r is a positive integer, and f3 = 2. A random
variable X of the continuous type that has the p.d.f.
f( ) 1 r/2-l -x/2 0
x r(rj2)2r/2x e , < x < 00,
= 0 elsewhere,
and the moment-generating function
Pr (3.25 :::; X :::; 20.5) = Pr (X s 20.5) - Pr (X s 3.25)
= 0.975 - 0.025 = 0.95.
is said to have a chi-square distribution, and any f(x) of this form is
called a chi-square p.d.j. The mean and the variance of a chi-square
distribution are JL = af3 = (rj2)2 = rand a2 = af32 = (rj2)22 = 2r,
respectively. For no obvious reason, we call the parameter r the number
of degrees of freedom of the chi-square distribution (or of the chi-
square p.d.f.). Because the chi-square distribution has an important
role in statistics and occurs so frequently, we write, for brevity, that
X is X2
(r) to mean that the random variable X has a chi-square distri-
bution with r degrees of freedom.
Example 3. If X has the p.d.f.
f(x) = !xe- x /2
, 0 < x < 00,
= 0 elsewhere,
then X is X2 (4). Hence J10 = 4, a2
= 8. and M(t) = (1 - 2t) -2, t < l
Example 4. If Xhas the moment-generating function M(t) = (1- 2t)-8,
t < 1. then X is X2
(16).
If the random variable X is X2
(r), then, with Cl < C2' we have
Pr (cl s X :::; c2 ) = Pr (X s c2 ) - Pr (X s Cl),
since Pr (X = cl ) = O. To compute such a probability, we need the
value of an integral like
Pr (X < x) = IX 1 wr/2-le-W/2 dw.
- 0 r(rj2)2r/2
Tables of this integral for selected values of r and x have been prepared
and are partially reproduced in Table II in Appendix B.
Example 5. Let X be X2 (10). Then, by Table II of Appendix B, with
r = 10,
m=I,2,3.....
Some Special Distributions [Ch. 3
(
_ I
_ ) a roo _1 ya-le-Y dy
1 - f3t Jo r(a)
1 1
t < _.
(1 - f3t)«' f3
M(t)
M'(t) (-a)(1 - f3t)-a-l(_f3)
Then the moment-generating function of X is given by the series
4! 3 5! 32
2 6! 3
3
t3 •••
M(t) = 1 + 3! I! t + 3! 2! t + 3! 3! + .
This however. is the Maclaurin's series for (1 - 3t)-4, provided that
_ 1 ~ 3t < 1. Accordingly. X has a gamma distribution with a = 4 and
f3 = 3.
Remark. The gamma distribution is not only a good model for ~aiting
times, but one for many nonnegative random variables of the continuous
type. For illustrations, the distribution of certain incomes could be modeled
satisfactorily by the gamma distribution, since the two parameters a and f3
provide a great deal of flexibility.
Example 1. Let the waiting time W have a gamma p.d.f. with a =.k and
f3 = 1/t... Accordingly, E(W) = k/t... If k ~ 1, then E(W) = .1/>..; that IS, the
expected waiting time for k = 1 changes IS equal to the reciprocal of >...
Example 2. Let X be a random variable such that
and
Now
That is,
106
we may set y = x(1 - f3t)jf3, t < Ijf3, or x = f3yj(1 - f3t), to obtain
M(t) = roo f3j(1 - f3t) (2JL)a-le_y dy.
Jo r(a)f3a 1 - f3t
k = 1,2,3, ....
Some Special Distributions [Ch, 3
and
109
o< x < 00,
1
j(x) = fJ2 xe-x /P,
I~oo exp (-Iyl + 1) dy = 2e.
the distribution of Y = minimum (Xl> X 2,Xs). Hint. Pr (Y :::; y) = 1 -
Pr (Y > y) = 1 - Pr (Xl> y, i = 1,2, 3).
3.34. Let X have a gamma distribution with p.d.f.
Sec. 3.4] The Normal Distribution
zero elsewhere. If x = 2 is the unique mode of the distribution, find the
parameter fJ and Pr (X < 9.49).
3.35. Compute the measures of skewness and kurtosis of a gamma distri-
bution with parameters a and fJ.
3.36. Let X have a gamma distribution with parameters a and fJ. Show
that Pr (X ~ 2afJ) s (2/e)a. Hint. Use the result of Exercise 1.107.
3.37. Give a reasonable definition of a chi-square distribution with zero
degrees of freedom. Hint. Work with the moment-generating function of a
distribution that is x2
(r) and let r = O.
3.38. In the Poisson postulates on page 99, let , be a nonnegative
function of w, say '(w), such that Dw[g(O, w)J = - '(w)g(O, w). Suppose that
,(w) = krwT- l, r ~ 1. (a) Find g(O, w) noting that g(O, 0) = 1. (b) Let W
be the time that is needed to obtain exactly one change. Find the distribution
function of W, namely G(w) = Pr (W :::; w) = 1 - Pr (W > w) = 1 - g(O, w),
o :::; w, and then find the p.d.f. of W. This p.d.f. is that of the Weibull distri-
bution, which is used in the study of breaking strengths of materials.
3.39. Let X have a Poisson distribution with parameter m. If m is an
experimental value of a random variable having a gamma distribution with
a = 2 and fJ = 1, compute Pr (X = 0, 1, 2).
3.40. Let X have the uniform distribution with p.d.f.j(x) = 1,0 < x < 1,
zero elsewhere. Find the distribution function of Y = -21n X. What is the
p.d.f. of Y?
Consider the integral
I = I~oo exp (_y2/2) dy.
This integral exists because the integrand is a positive continuous
function which is bounded by an integrable function; that is,
o< exp (_y2/2) < exp (-Iyl + 1), -00 < y < 00,
3.4 The Normal Distribution
f
"" 1 k-l p.xe-/L
- - Zk-Ie-Z dz = L: --,-'
/L r(k) x=o x.
This demonstrates the relationship between the distribution functions of the
gamma and Poisson distributions. Hint. Either integrate by parts k - 1
times or simply note that the "antiderivative" of Zk-Ie-
Z
is _zk-Ie-~ -
(k _ l)zk-2e- z _ ... - (k - 1)1 e- Z by differentiating the latter expression.
3.33. Let x.. X 2
, and x, be mutually stochastically independent rand?m
variables, each with p.d.f. j(x) = e- x , 0 < x < 00, zero elsewhere. Fmd
Accordingly, the p.d.f. of Y is
G'( ) fJ/2 (fJ /2)T/2 -Ie- y/2
g(y) = Y = r(r/2)W/2 y
if y > O. That is, Y is i'{r).
EXERCISES
3.28. If (1 - 2t)-6, t < -!-' is the moment-generating function of the
random variable X, find Pr (X < 5.23).
3.29. If X is x2 (5), determine the constants cand d so that Pr (c < X < d)
= 0.95 and Pr (X < c) = 0.025.
3.30. If X has a gamma distribution with a = 3 and fJ = 4, find
Pr (3.28 < X < 25.2). Hint. Consider the probability of the equivalent
event 1.64 < Y < 12.6, where Y = 2X/4 = X/2.
3.31. Let X be a random variable such that E(xm) = (m + 1)1 2
m,
m = 1,2,3, .... Determine the distribution of X.
3.32. Show that
If y s 0, then G(y) = 0; but if y > 0, then
f
PY/2 1
G( )
- xT/2-le-X/P dx.
y - r(r/2)pI2
o
108
Again, by way of example, if Pr (~ < X) = 0.05, then Pr (X :::; a) = 0.95,
and thus a = 18.3 from Table II WIth r = 10.
Example 6. Let X have a gamma distribution wit~ a = r/2, where r is
a positive integer, and fJ > O. Define the rand~m vana?le Y = 2X/fJ· We
seek the p.d.f. of Y. Now the distribution function of Y IS
G(y) = Pr (Y :::; y) = Pr (X s fJi)'
110
Some Special Distributions [eh.3 Sec. 3.4] The Normal Distribution 111
To evaluate the integral I, we note that I > 0 and that J2 may be
written
fOO foo (y2 + Z2)
[2 = -00 -00 exp - 2 dy dz.
This iterated integral can be evaluated by changing to polar coordinates.
If we set y = r cos 0 and z = r sin 0, we have
J2 = f:3t
fooo
e-
r2/2r dr dO
= f:3t
dO = 27T.
Accordingly, I = yZ; and
we complete the square in the exponent. Thus M(t) becomes
M(t) = exp [ a
2
- (a + b
2t)2]
foo _1_ ex [_ (x - a - b
2t)2]
2b2 _00 bvz:;, P 2b2 dx
= exp (ai + b~2)
because the integrand of the last integral can be thought of as a normal
p.d.f. with a replaced by a + b2t, and hence it is equal to 1.
The mean I-t and variance 0-
2 of a normal distribution will be calcu-
lated from M(t). Now
M'(t) = M(t)(a + b2t)
and
f
OO 1
-= e-
y 2/2 dy = 1.
-00 V27T Thus
If we introduce a new variable of integration, say z, by writing I-t = M'(O) = a
and
a form that shows explicitly the values of I-t and 0-
2
. The moment-
generating function M(t) can be written
0-
2
= M"(O) - 1-t2 = b2 + a2 - a2 = b2.
This permits us to write a normal p.d.f. in the form of
-00 < x < 00,
f(x) = 1_ exp [
o-V27T
b > 0,
(x - a)2]
2b2 dx = 1.
x-a
y=--'
b
fOO 1 [
--exp
-00 byZ;
Since b > 0, this implies that
the preceding integral becomes
1 [(X - a)2]
f(x) = --exp - ,
byZ; 2b2 -00 < x < 00
-00 < x < 00.
satisfies the conditions of being a p.d.f. of a continuous type of random
variable. A random variable of the continuous type that has a p.d.f. of
the form of f(x) is said to have a normal distribution, and any f(x) of
this form is called a normal p.d.f.
We can find the moment-generating function of a normal distribu-
tion as follows. In
f
OO 1 [(X a)2]
M (t) = etx
• r;;- exp - 2b2 dx
- 00 bv 27T
fOO 1 (
= --exp
-00 bv27T
Example 1. If X has the moment-generating function
M(t) = e2t+32t2,
then X has a normal distribution with flo = 2, 0-
2
= 64.
The normal p.d.f. occurs so frequently in certain parts of statistics
that we denote it, for brevity, by n(l-t, 0-
2
) . Thus, if we say that the
random variable X is n(O, 1), we mean that X has a normal distribution
with mean I-t = 0 and variance 0-
2
= 1, so that the p.d.f. of X is
1
f(x) = - - e- X 2
/ 2
yZ; ,
112 Some Special Distributions [Ch.B Sec. 3.4] The Normal Distribution 113
If we say that X is n(5, 4), we mean that X has a normal distribution
with mean p, = 5 and variance a2
= 4, so that the p.d.f. of X is
1 l (x - 5)21
f(x) = 2V2?T exp - 2(4) ,
Moreover, if
-00 < x < 00.
If we change the variable of integration by writing y = (x - p,)/a, then
f 1
G(w) = -- e- y 2
/2 dy.
-00 y'2;
Accordingly, the p.d.f. g(w) = G'(w) of the continuous-type random
variable W is
then X is n(O, 1).
The graph of
g(w) -00 < w < 00.
is seen (1) to be symmetric about a vertical axis through x = p" (2) to
have its maximum of 1/avz;.at x = p" and (3) to have the x-axis as
a horizontal asymptote. It should be verified that (4) there are points
of inflection at x = p, ± a.
Remark. Each of the special distributions considered thus far has been
"justified" by some derivation that is based upon certain concepts found
in elementary probability theory. Such a motivation for the normal distribu-
tion is not given at this time; a motivation is presented in Chapter 5. How-
ever, the normal distribution is one of the more widely used distributions in
applications of statistical methods. Variables that are often assumed to be
random variables having normal distributions (with appropriate values of
flo and u) are the diameter of a hole made by a drill press, the score on a
test, the yield of a grain on a plot of ground, and the length of a newborn
child.
1 l (x - p,)21
f(x) = aVZ; exp - 2a2 ' -00 < x < 00,
Thus W is n(O, 1), which is the desired result.
This fact considerably simplifies calculations of probabilities con-
cerning normally distributed variables, as will be seen presently. Sup-
pose that X is n(p" a2
) . Then, with C1 < C2 we have, since Pr (X = c1
)
= 0,
= Pr (X ~ p, < C
2
~ p,) _Pr (X ~ p. < c1 ~ p,)
because W = (X - p,)/a is n(O, 1). That is, probabilities concerning X,
which is n(p" a2
) , can be expressed in terms of probabilities concerning
W, which is n(O, 1). However, an integral such as
We now prove a very useful theorem.
Theorem 1. If the random variable X is n(p" a2 ) , a2
> 0, then the
random variable W = (X - p,)/a is n(O, 1).
Proof. The distribution function G(w) of W is, since a > 0,
G(w) = Pr (X ~ P, s w) = Pr (X s tso + p,).
That is,
J
W C1 tIL 1 l (x - p,)2J
G(w) = . j - exp - 2 2 dx.
-00 ov 2?T a
cannot be evaluated by the fundamental theorem of calculus because
an "antiderivative" of e-W 2
/
2
is not expressible as an elementary
function. Instead, tables of the approximate value of this integral for
various values of k have been prepared and are partially reproduced in
Table III in Appendix B. We use the notation (for normal)
J
x 1
N(x) = -- e-W 2
/
2 dw;
-00 VZ;
114 Some Special Distributions [Ch. 3 Sec. 3.4J The Normal Distribution 115
thus, if X is n(I-'-' a2
) , then
Pr (c1 < X< c2) = Pr (X ~ I-'- < C
2
~ 1-'-) (
X - I-'- c1 - 1-'-)
- Pr --- < ---
a a
That is,
f
-v'V 1
G(v) = 2 - - e-w 2
/2 dw,
o VZ;
os v,
If we change the variable of integration by writing w = yY, then
Hence the p.d.f. g(v) = G'(v) of the continuous-type random variable
V is
f
v 1
G(v) = e- y
/ 2 dy,
oVZ;vy
o :s; v.
v < O.
G(v) = 0,
and
It is left as an exercise to show that N( -x) = 1 - N(x).
Example 2. Let X be n(2, 25). Then, by Table III,
(
10 - 2) (0 - 2)
Pr (0 < X < 10) = N -5- - N -5-
= N(1.6) - N( -0.4)
= 0.945 - (1 - 0.655) = 0.600
I
x 1
N(x) = -- e-w2
/2 dw,
-<Xl VZ;
EXERCISES
3.41. If
3.46. If X is n(p., a2
) , show that E([X - p.!) = av/2/TT.
3.47. Show that the graph of a p.d.I. n(p., a2 ) has points of inflection at
x = p. - a and x = p. + a.
Since g(v) is a p.d.f. and hence
fooo
g(v) dv = 1,
it must be that r(t) = V:;;: and thus V is X2 (1).
o < V < 00,
1
_=--= vl/2-1e-V/2
V7TV2 '
= 0 elsewhere.
g(v)
show that N( -x) = 1 - N(x).
3.42. If X is n(75, 100), find Pr (X < 60) and Pr (70 < X < 100).
3.43. If X is n(p., a2
) , find b so that Pr [-b < (X - p.)/a < b] = 0.90.
3.44. Let X be n(p., a2
) so that Pr (X < 89) = 0.90 and Pr (X < 94) =
0.95. Find p. and a2•
3.45. Show that the constant c can be selected so that f(x) = c2- x 2
,
-00 < X < 00, satisfies the conditions of a normal p.d.f. Hint. Write 2 = e1n 2.
These conditions require that p. = 73.1 and a = 10.2 approximately.
We close this section with an important theorem.
Theorem 2. If the random variable X is n(fL, a2), a2 > 0, then the
random variable V = (X - p.)2ja2 is X2(1).
Proof. Because V = W2, where W = (X - fL)!a is n(O, 1), the
distribution function G(v) of V is, for v 2: 0,
G(v) = Pr (W2 ::; v) = Pr (- vv s W s Vv).
Pr (- 8 < X < 1) = NC ~ 2) - N(-8
5-
2)
= N( -0.2) - N( -2)
= (1 - 0.579) - (1 - 0.977) = 0.398.
Example 3. Let X be n(p., a2
) . Then, by Table III,
~0-~<X<p.+~=Nt+~-1-Nt-~-1
= N(2) - N(-2)
= 0.977 - (1 - 0.977) = 0.954.
Example 4. Suppose that 10 per cent of the probability for a certain
distribution that is n(p., a2
) is below 60 and that 5 per cent is above 90.
What are the values of p. and a? We are given that the random variable X is
n(p., a2
) and that Pr (X ::; 60) = 0.10 and Pr (X ::; 90) = 0.95. Thus
N[(60 - p.)/a] = 0.10 and N[(90 - p.)/a] = 0.95. From Table III we have
60 - p. = -1.282, 90 - P. = 1.645.
a a
and
116 Some Special Distributions [eh. 3 Sec. 3.5] The Bivariate Normal Distribution 117
where, with Ul > 0, U2 > 0, and - 1 < p < 1,
-00 < x < 00, -00 < y < 00,
j(x, y)
At this point we do not know that the constants iLl' iL2' at a~, and p
represent parameters of a distribution. As a matter of fact, we do not
know thatj(x, y) has the properties of a joint p.d.f.Tt will now be shown
that:
(a) f(x, y) is a joint p.d.f.
(b) X is n(iLl> ar) and Y is n(iL2' a~).
(c) p is the correlation coefficient of X and Y.
A joint p.d.f. of this form is called a bivariate normal p.df-, and the
random variables X and Yare said to have a bivariate normal distribu-
tion.
That the nonnegative function f(x, y) is actually a joint p.d.f. can
be seen as follows. Define fl(X) by
mean of the truncated distribution that has p.d.f. g(y) = j(y)jF(b), -00 <
y < b, zero elsewhere, be equal to - j(b)jF(b) for all real b. Prove that j(x)
is n(O, 1).
3.61. Let X and Y be stochastically independent random variables, each
with a distribution that is n(O, 1). Let Z = X + Y. Find the integral that
represents the distribution function G(z) = Pr (X + Y ::;; z) of Z. Deter-
mine the p.d.f. of Z. Hint. We have that G(z) = J~00 H(x, z) dx, where
I
Z
-
X
1
H(x, z) = _00 27T exp [- (x2 + y2)/2] dy.
Find G'(z) by evaluating e00 [8H(x, z)/8z] dx.
3.5 The Bivariate Normal Distribution
Let us investigate the function
o< x < 00, zero elsewhere.
3.48. Determine the ninetieth percentile of the distribution, which is
n(65,25).
3.49. If e3t+8t2 is the moment-generating function of the random variable
X, find Pr (-1 < X < 9).
3.50. Let the random variable X have the p.d.f.
Find the mean and variance of X. Hint. Compute E(X) directly and E(X2)
by comparing that integral with the integral representing the variance of a
variable that is n(O, 1).
3.51. Let X be n(5, 10). Find Pr [0.04 < (X - 5)2 < 38.4].
3.52. If X is n(l, 4), compute the probability Pr (1 < X2 < 9).
3.53. If X is n(75, 25), find the conditional probability that X is greater
than 80 relative to the hypothesis that X is greater than 77. See Exercise
2.17.
3.54. Let X be a random variable such that E(X2m) = (2m)!j(2mm!),
m = 1,2,3, ... and E(X2m-l) = 0, m = 1,2,3, .... Find the moment-
generating function and the p.d.f. of X.
3.55. Let the mutually stochastically independent random variables
Xl' X 2 , and X a be n(O, 1), n(2, 4), and n( -1, 1), respectively. Compute the
probability that exactly two of these three variables are less than zero.
3.56. Compute the measures of skewness and kurtosis of a distribution
which is n(f.L, a2
) .
3.57. Let the random variable X have a distribution that is n(f.L, a2 ) .
(a) Does the random variable Y = X2 also have a normal distribution?
(b) Would the random variable Y = aX + b, a and b nonzero constants,
have a normal distribution? Hint. In each case, first determine Pr (Y ::;; y).
3.58. Let the random variable X be n(f.L, a2
) . What would this distribution
be if a2
= O? Hint. Look at the moment-generating function of X for a2 > 0
and investigate its limit as a2
--+ O.
3.59. Let n(x) and N(x) be the p.d.f. and distribution function of a
distribution that is n(O, 1). Let Y have a truncated distribution with p.d.f.
g(y) = n(y)j[N(b) - N(a)], a < y < b, zero elsewhere. Show that E(Y) is
equal to [n(a) - n(b)]/[N(b) - N(a)].
3.60. Let j(x) and F(x) be the p.d.f. and the distribution function of a
distribution of the continuous type such that f'(x) exists for all x. Let the
118
Some Special Distributions [Ch, 3 Sec. 3.5] The Bivariate Normal Distribution 119
Now
(1 _ p2)q = [(y ~2iL2) - p(X ~liLl)r+ (1 - p2)(X ~liLlr
= (y ~ b) + (1 _ p2)(X ~liLlr,
where b = iL2 + p(a2/al)(X - iLl)' Thus
_ exp [- (x - iLl)2/2atJ foo exp {- (y - b)2/[2a~(1 - p2)J} dy.
fl(x) - al-yl2; -00 a2V 1 - p2-y12;
For the purpose of integration, the integrand of the ~ntegral in this
expression for fl(x) may be considered a.normal p.d.f. with mean band
variance a~(l - p2). Thus this integral IS equal to 1 and
f ( ) 1 [(X - iLl)2J -00 < X < 00.
1 X = . rr> exp - 2a2 '
alV 27T 1
Since
f~00 f:oof(x, y) dy dx = f:oofl(x) dx = 1,
the nonnegative functionf(x, y) is a joint p.d.f. of two ~ontinuou:-type
random variables X and Y. Accordingly, the function fl(x) IS the
. 1 d f of X and X is seen to be n(iLl' at)· In like manner, we
margma p. . . ,
see that Y is n(iL2' a~).
Moreover, from the development above, we note that
(
1 [(y - b)2 J)
f(x, y) = fl(x) a2
V 1
_ p2-y12; exp - 2a~(l _ p2) ,
h b + ( / )(X - II.) Accordingly the second factor in the
were = iL2 P a2 al .1 . , ..
right-hand member of the equation above ~s. the conditional p.d:f of
Y, given that X = x. That is, the conditional p.d.f of Y, ~Iven
X = x is itself normal with mean iL2 + p(a2/al)(X - iLl) and v~r~ance
2(1 _' 2) Thus with a bivariate normal distribution, the conditional
a2 p. , . .
mean of Y, given that X = x, is linear in x and IS gIVen by
a2 )
E(Ylx) = iL2 + P - (x - iLl .
al
Since the coefficient of x in this linear conditional mean E(:VI~) is
/ d . and a represent the respective standard deVIatIOns,
pa2 aI, an since al 2 Thi
the number p is, in fact, the correlation coefficient of X and Y. IS
follows from the result, established in Section 2.3, that the coefficient
of x in a general linear conditional mean E(Ylx) is the product of the
correlation coefficient and the ratio a2/al'
Although the mean of the conditional distribution of Y, given
X = x, depends upon x (unless p = 0), the variance a§(1 - p2) is the
same for all real values of x. Thus, by way of example, given that
X = x, the conditional probability that Y is within (2.576)a2V 1 _ p2
units of the conditional mean is 0.99, whatever the value of x. In this
sense, most of the probability for the distribution of X and Y lies in
the band
about the graph of the linear conditional mean. For every fixed positive
a2' the width of this band depends upon p. Because the band is narrow
when p2 is nearly 1, we see that p does measure the intensity of the
concentration of the probability for X and Y about the linear con-
ditional mean. This is the fact to which we alluded in the remark of
Section 2.3.
In a similar manner we can show that the conditional distribution
of X, given Y = y, is the normal distribution
Example 1. Let us assume that in a certain population of married couples
the height Xl of the husband and the height X2 of the wife have a bivariate
normal distribution with parameters Il-l = 5.8 feet, 1l-2 = 5.3 feet, al = U2 =
0.2 foot, and p = 0.6. The conditional p.d.f. of X 2 , given Xl = 6.3, is normal
with mean5.3 +(0.6)(6.3 -5.8) =5.6 and standarddeviation (0.2)V(1-0.36) =
0.16. Accordingly, given that the height of the husband is 6.3 feet, the
probability that his wife has a height between 5.28 and 5.92 feet is
Pr (5.28 < X2 < 5.921xl = 6.3) = N(2) - N( -2) = 0.954.
The moment-generating function of a bivariate normal distribution
can be determined as follows. We have
M(tl, t2) = f~00 f~00 etlx+t2Yf(x, y) dx dy
= f:oo etlxfl(x)[f~oo et2Yf(Ylx) dy] dx
for all real values of tl and t2 . The integral within the brackets is the
3.63. If M(tv t2) is the moment-generating function of a bivariate normal
distribution, compute the covariance by using the formula
Now let if;(tv t2) = In M(t1, t2). Show that a2if;(0, 0)/8t1at2 gives this co-
variance directly.
3.67. Let X, Y, and Z have the joint p.d.f.
(1/217')3/2 exp [- (x2 + y2 + z2)/2J{1 + xyz exp [- (x2 + y2 + z2)/2J},
121
aM(O, 0) 8M(0, 0)
at1 at2
3.65. Let X and Y have a bivariate normal distribution with parameters
iLl = 20, iL2 = 40, ai = 9, a~ = 4, and p = 0.6. Find the shortest interval
for which 0.90 is the conditional probability that Y is in this interval, given
that X = 22.
3.66. Letf(x,y) = (1/217') exp[ -t(x2+ y2)J{1 + xyexp[ --t(x2 + y2 - 2)]},
where -00 < x < 00, -00 < y < 00. If f(x, y) is a joint p.d.f., it is not a
normal bivariate p.d.f. Show that f(x, y) actually is a joint p.d.f. and that
each marginal p.d.f. is normal. Thus the fact that each marginal p.d.f. is
normal does not imply that the joint p.d.f. is bivariate normal.
where -00 < x < 00, -00 < Y < 00, and -00 < z < 00. While X, Y, and
Z are obviously stochastically dependent, show that X, Y, and Z are pair-
wise stochastically independent and that each pair has a bivariate normal
distribution.
EXERCISES
3.68. Let X and Y have a bivariate normal distribution with parameters
iLl = iL2 = 0, ai = a~ = 1, and correlation coefficient p. Find the distri-
bution of the random variable Z = aX + bY in which a and b are nonzero
constants. Hint. Write G(z) = Pr (Z :::; z) as an iterated integral and com-
pute G'(z) = g(z) by differentiating under the first integral sign and then
evaluating the resulting integral by completing the square in the exponent.
Sec. 3.5] The Bivariate Normal Distribution
3.64. Let X and Y have a bivariate normal distribution with parameters
iLl = 5, iL2 = 10, ai = 1, a~ = 25, and p > 0. If Pr (4 < Y < 161x = 5) =
0.954, determine p.
3.62. Let X and Y have a bivariate normal distribution with parameters
1-'-1 = 3, iL2 = 1, ai = 16, ~ = 25, and p = -t. Determine the following
probabilities:
(a) Pr (3 < Y < 8).
(b) Pr (3 < Y < 81x = 7).
(c) Pr(-3<X<3).
(d) Pr (-3 < X < 31y = -4).
or equivalently, 22
, ( u~ti + 2pU1U2t1t2 + u2t2).
M(t v t2) = exp I-'-lt1 + 1-'-2t2 + 2
It is interesting to note that if, in this moment-generating function
. ffici t . t equal to zero then
M(tv
t2), the correlatIOn coe cien pIS se '
M(t 1,t2) = M(t 1,O)M(O, t2)·
Thus X and Yare stochastically independent when p = 0,. If, con~
O)M(O t) we have eP<Jl<J2tlt2 = 1. Since eac
versely, M(t:, t2) =:.M(tt
1
h
' ~ ; Accordingly, we have the following
of U1 and u2ISpositive, en P - .
theorem.
3 L t X and Y have a bivariate normal distribution with
Theorem . e lati jfi . nt
means 1-'-1 and 1-'-2' positive variances ut and u§,~nd corre at~~n c~ OCM p.
Then X and Yare stochastically independent if and only if p - .
A
matter of fact, jf any two random variables are stochastical~y
s a . . have noted III
independent and have positive standard deVIatIOns, we .
. 4 th t - 0 However p = 0 does not III
Example 4 of Section 2. a p - . ' . d t: this
I
· I that two variables are stochastically llldepen en ,
genera Imp y . f Th rem 3
. E . 2 18(c) and 2.23. The Importance 0 eo
can be seen III xercises . h two random
. . h f ct that we now know when and only w en
~:~~~I:Seth:t have a bivariate normal distribution are stochastically
independent.
Some Special Distributions [eh. 3
120
t ting function of the conditional p.d.f. f(yx). ~ince
momen -genera 1 ( I )( ) and vanance
f(Ylx) is a normal p.d.f. with mean 1-'-2 + P U2 U1 X - 1-'-1
u~(l - p
2),
then 1 t2u2(1 _ p2)}
f~00 et2
Yf(yx) dy = exp {t2[1-'-2 + P:: (x - 1-'-1) J+ 2 2 2 .
A
di I M(t t) can be written in the form
ccor mg y, l' 2
{
U2 t~u~(l - p2)}fOO exp [(t1 + t2p u
2)x)f
1(x) dx.
exp t21-'-2 - t2p u11-'-1 + 2 _00 u1
E(
tX) [ t + (u2t2)J2] for all real values of t. Accordingly,
But e = exp 1-'-1 1 . . b
if we set t = t1 +- t2P(U2JU1)' we see that M(t v t2) IS glVen y
{
u2 t~u~(l - p2)
exp t21-'-2 - t2pu11-'-1 + 2 ( U 2)2}
t1 + t2p -
+ 1-'-1(t1 + t2p::) + a~ 2 U1
Sec. 4.1] Sampling Theory 123
Chapter 4
Distributions of Functions
of Random Variables
4.1 Sampling Theory
Let Xl> X
2
, ••• , K; denote n random variables that have the)oint
p.d.f.j(xl> X2' ••• , xn
) . These variables mayor may not be ~tochas:lcal~y
independent. Problems such as the foll.owing ~re very mteres.tmg m
themselves; but more importantly, their solutions often pr~vIde the
basis for making statistical inferences. Let Y be a random vanable that
is defined by a function of Xl' X 2 , • • • , X n, say Y = u(Xl> X 2 , ••• , X n) ·
O the P d f f(x X x ) is given can we find the p.d.f. of Y?
nce . . . l> 2"'" n '
In some of the preceding chapters, we have solved a fe~ of t~ese pro~-
lems. Among them are the following two. If n = 1 and If ~IIS n(p., a ),
then Y = (Xl - p.)/a is n(O, 1). If n is a positive integer, 1£ the random
. bl X . - 1 2 n are mutually stochastically independent,
vana es i' ~ - , , ... , ,
and each Xi has the same p.d.f. f(x) = PX(1 - P)l-X, x = 0, 1, and
zero elsewhere, and if Y = i Xi' then Yis b(n, P)· It should be observed
I
that Y = u(XI
) = (Xl - p.)/a is a function of Xl that depends upon
the two parameters of the normal distribution; whereas Y = u(Xl> X 2'
... , X
n
) = i Xi does not depend upon p, the parameter of the common
d f f th IX . - 1 2 n The distinction that we make between
p. ., 0 e ,,~- , , ... , .
these functions is brought out in the following definition.
Definition 1. A function of one or more random variab~e~ that
does not depend upon any unknown parameter is called a statistic.
122
n
In accordance with this definition, the random variable Y = .L: X,
I
discussed above is a statistic. But the random variable Y = (Xl - J-t)/a
is not a statistic unless J-t and a are known numbers. It should be noted
that, although a statistic does not depend upon any unknown param-
eter, the distribution of that statistic may very well depend upon
unknown parameters.
Remark. We remark, for the benefit of the more advanced reader,
that a statistic is usually defined to be a measurable function of the random
variables. In this book, however, we wish to minimize the use of measure
theoretic terminology so we have suppressed the modifier "measurable."
It is quite clear that a statistic is a random variable. In fact, someprobabilists
avoid the use of the word"statistic" altogether, and they refer to a measure-
able function of random variables as a random variable. We decided to
use the word"statistic" because the reader will encounter it so frequently
in books and journals.
We can motivate the study of the distribution of a statistic in the
following way. Let a random variable X be defined on a sample space
<'t' and let the space of X be denoted by d In many situations con-
fronting us, the distribution of X is not completely known. For instance,
we may know the distribution except for the value of an unknown
parameter. To obtain more information about this distribution (or the
unknown parameter), we shall repeat under identical conditions the
random experiment n independent times. Let the random variable X,
be a function of the ith outcome, i = 1,2, ... , n. Then we call Xl>
X 2 , ••• , X n the items of a random sample from the distribution under
consideration. Suppose that we can define a statistic Y = u(Xl> X 2 ,
... , X n) whose p.d.f. is found to be g(y). Perhaps this p.d.f. shows that
there is a great probability that Y has a value close to the unknown
parameter. Once the experiment has been repeated in the manner
indicated and we have Xl = xl> ... , X; = Xn> then y = u(xI , X2' ••• , xn)
is a known number. It is to be hoped that this known number can in
some manner be used to elicit information about the unknown param-
eter. Thus a statistic may prove to be useful.
Remarks. Let the random variable X be defined as the diameter of a
hole to be drilled by a certain drill press and let it be assumed that X has a
normal distribution. Past experience with many drill presses makes this
assumption plausible; but the assumption does not specify the mean p. nor
the variance a2
of this normal distribution. The only way to obtain informa-
tion about p. and a2
is to have recourse to experimentation. Thus weshall drill
124 Distributions oj Functions oj Random Variables [eh.4
Sec. 4.1] Sampling Theory 125
a number, say n = 20, of these holes whose diameters will be Xv X 2 , ••• ,
X
2 0
• Then Xl' X 2
, ••• , X2 0
is a random sample from the normal distribution
under consideration. Once the holes have been drilled and the diameters
measured, the 20 numbers may be used, as will be seen later, to elicit
information about fL and a2
•
The term" random sample" is now defined in a more formal manner.
Definition 2. Let Xl> X 2 , ••• , X n denote n mutually stochastically
independent random variables, each of which has the same but possibly
unknown p.d.f. f(x); that is, the probability density functions of
Xl> X 2, ... , X; are, respectivelY,]I(xI) = f(X I),f2(X2) = f(X2), ... ,fn(xn)
= f(xn), so that the joint p.d.f. is f(xl)f(x2)· . ·f(xn)· The random
variables Xl> X 2
, ••• , X n are then said to constitute a random sample
from a distribution that has p.d.f. f(x).
Later we shall define what we mean by a random sample from a
distribution of more than one random variable.
Sometimes it is convenient to refer to a random sample of size n
from a given distribution and, as has been remarked, to refer to
Xl> X 2
, ••• , X n
as the items of the random sample. A reexamination
of Example 5 of Section 2.4 reveals that we found the p.d.f. of the
statistic, which is the maximum of the items of a random sample of
size n = 3, from a distribution with p.d.f. f(x) = 2x, °< x < 1, zero
elsewhere. In the first Remark of Section 3.1 (and referred to in this
section), we found the p.d.f. of the statistic, which is the sum of the
items of a random sample of size n from a distribution that has p.d.f.
f(x) = px(1 - n::: X = 0, 1, zero elsewhere.
In this book, most of the statistics that we shall encounter will be
functions of the items of a random sample from a given distribution.
Next, we define two important statistics of this type.
Definition 3. Let Xl> X 2' .•• , X n denote a random sample of size
n from a given distribution. The statistic
X = Xl + X 2 + ... + Xn = i Xl...
n 1=1 n
is called the mean of the random sample, and the statistic
is called the variance of the random sample.
Remark. Many writers do not define the variance of a random sample
as we have done but, instead, they take 52 = ~ (X, - X)2/(n - 1). There
I
~re .goodreasons for doing this. But a certain price has to be paid, as we shall
indicate, Let Xl' X 2, ••• , X n denote experimental values of the random variable
X that has the p.d.f. f(x) and the distribution function F(x). Thus we may
l~ok upon Xv X 2, ••• , Xn as the experimental values of a .random sample of
SIze n from the given distribution. The distribution of the sample is then
defined to be t.he distribution obtained by assigning a probability of lin to
each of the POl~ts x~, x:' .. :' X n• This is a distribution of the discrete type.
The corres.pondmg distribution function will be denoted by Fn(x) and it is a
step function. If we let fx denote the number of sample values that are less
than or equal to x, then Fn(x) = fxln, so that Fn(x) gives the relative fre-
quency of the event X ::; X in the set of n observations. The function Fn(x) is
often called the" empirical distribution function" and it has a number of
uses.
Because the distribution of the sample is a discrete distribution, the
mean and the variance have been defined and are, respectively, i:«[« = x
n I
and f (Xl - x)2ln = S2. Thus, if one finds the distribution of the sample and
the associated empirical distribution function to be useful concepts, it
:"ould seem logically inconsistent to define the variance of a random sample
m any way other than we have.
Random sampling distribution theory means the general problem
of finding distributions of functions of the items of a random sample.
Up to this point, the only method, other than direct probabilistic
arguments, of finding the distribution of a function of one or more
random variables is the distribution function technique. That is, if
Xl> X 2 , · . · , X n are random variables, the distribution of Y =
u(XI , X 2 , ••• , X n) is determined by computing the distribution function
of Y,
G(y) = Pr [u(Xl> X 2 , •• " X n) s y].
Even in what superficially appears to be a very simple problem, this
can be quite tedious. This fact is illustrated in the next paragraph.
Let Xl' X 2 , Xs denote a random sample of size 3 from a distribution
that is n(O, 1). Let Y denote the statistic that is the sum of the squares
of the sample items. The distribution function of Y is given by
G(y) = Pr (Xi + X~ + X~ ::; y).
If y < 0, then G(y) = 0. However, if y ~ 0, then
G(y) = fif(2:)S/2 exp [ -~ (xi + x~ + X~)] dXI dX2 dxs,
126 Distributions of Functions of Random Variables [eh.4 Sec. 4.1] Sampling Theory 127
where p 2: 0, 0 ~ B < 27T, 0 ~ rp ~ 7T. Then, for y 2: 0,
G(y) - S'/Y f2nfn _1_ e- p 2
/2p2 sin rp drp dB dp
- (27T)3/2
o • 0 0
where A is the set of points (Xl> x2, x3) interior to, or on the surface of,
a sphere with center at (0,0,0) and radius equal to yY. This is not a
simple integral. We might hope to make progress by changing to
spherical coordinates:
= J~ L./Y p2e- p 2/2
dp.
If we change the variable of integration by setting p = VW, we have
G( ) - )2 iY
vw -w/2 d
y _ - -e w,
7T 0 2
for y 2: O. Since Y is a random variable of the continuous type, the
p.d.f. of Y is g(y) = G'(y). Thus
O<y<l.
Pr (Y s y) = P[P-l(y)] = y,
Pr (X s x) = Pr [F(X) ~ F(x)] = Pr [Y ~ F(x)]
because Y = F(X). However, Pr (Y ~ y) = G(y), so we have "
Pr (X ~ x) = G[F(x)] = F(x) , 0 < F(x) < 1.
EXERCISES
This is the distribution function of a random variable that is distri-
buted uniformly on the interval (0, 1).
4.1. Show that
That is, the distribution function of X is F(x).
corresponds to F(x). If 0 < F(x) < 1, the inequalities X s x and F(X) s
F(x) are equivalent. Thus, with 0 < F(x) < 1, the distribution function of
Xis
This result permits us to simulate random variables of different
types. This is done by simply determining values of the uniform variable
Y, usually with a computer. Then, after determining the observed
value Y = y, solve the equation y = P(x), either explicitly or by
numerical methods. This yields the inverse function X = P-l(y). By
the preceding result, this number X will be an observed value of X that
has distribution function P(x).
It is also interesting to note that the converse of this result is true.
If X has distribution function P(x) of the continuous type, then Y =
P(X) is uniformly distributed over 0 < y < 1. The reason for this is,
for 0 < y < 1, that
Pr (Y s y) = Pr [P(X) ~ y] = Pr [X s P-I(y)].
However, it is given that Pr (X s x) = P(x), so
X 3 = P cos rp,
o < y < 00,
X2 = p sin Bsin rp,
Xl = Pcos Bsin rp,
1
g(y) = __ y3/2-le-Y/2,
VZ;
= 0 elsewhere.
Because r(!) = (t)r(!-) = mV;, and thus V27T = r(!)23
/
2
, we see
that Y is X2
(3).
The problem that we have just solved points up the desirability of
having, if possible, various methods of determining the distribution of
a function of random variables. We shall find that other techniques are
available and that often a particular technique is vastly superior to
the others in a given situation. These techniques will be discussed in
subsequent sections.
Example 1. Let the random variable Y be distributed uniformly over the
unit interval 0 < y < 1; that is, the distribution function of Y is
G(y) = 0, y s 0
= y, 0 < y < 1,
= 1, 1 s y.
Suppose that F(x) is a distribution function of the continuous type which is
strictly increasing when 0 < F(x) < 1. If we define the random variable X
by the relationship Y = F(X), we now show that X has a distribution which
n
where X = LXt!n.
I
4.2. Find the probability that exactly four items of a random sample of
size 5 from the distribution having p.d.f. f(x) = (x + 1)/2, -1 < x < 1,
zero elsewhere, exceed zero.
4.3. Let Xl> X 2 , X3 be a random sample of size 3 from a distribution that
is n(6, 4). Determine the probability that the largest sample item exceeds 8.
4.2 Transformations of Variables of the Discrete Type
4.11. Let Xl and X2denote a random sample of size 2 from a distribution
with p.d.f. f(x) = 1, 0 < x < 1, zero elsewhere. Find the distribution
function and the p.d.f. of Y = X l/X2 ·
4.8. Let Xl and X 2denote a random sample of size 2 from a distribution
that is n(O, 1). Find the p.d.f. of Y = Xi + X~. Hint. In the double integral
representing Pr (Y :-::; y), use polar coordinates.
129
y = 0,4,8, ... ,
x = 0,1,2, ... ,
f(x)
Sec. 4.2] Transformations of Variables of the Discrete Type
Let X have the Poisson p.d.f.
= °elsewhere.
As we have done before, let d denote the space d = {x; x = 0,1,2, ...},
so that d is the set where f(x) > 0. Define a new random variable
Y by Y = 4X. We wish to find the p.d.f. of Y by the change-of-variable
technique. Let y = 4x. We call y = 4x a transformation from x to y,
and we say that the transformation maps the space d onto the space
f!lj = {y; y = 0, 4, 8, 12, ... }. The space f!lj is obtained by transforming
each point in d in accordance with y = 4x. We note two things about
this transformation. It is such that to each point in d there corresponds
one, and only one, point in f!lj; and conversely, to each point in f!lj there
corresponds one, and only one, point in d. That is, the transformation
y = 4x sets up a one-to-one correspondence between the points of d and
those of f!lj. Any function y = u(x) (not merely y = 4x) that maps a
space d (not merely our d) onto a space f!lj (not merely our f!lj) such that
there is a one-to-one correspondence between the points of d and those
of f!lj is called a one-to-one transformation. It is important to note that a
one-to-one transformation, y = u(x), implies that y is a single-valued
function of x, and that x is a single-valued function of y. In our case this
is obviously true, since y = 4x and x = (±)y.
Our problem is that of finding the p.d.f. g(y) of the discrete type of
random variable Y = 4X. Now g(y) = Pr (Y = y). Because there is a
one-to-one correspondence between the points of d and those of f!lj, the
event Y = y or 4X = y can occur when, and only when, the event X
= (i)y occurs. That is, the two events are equivalent and have the
same probability. Hence
( Y) f-tY
/4e - J.l
g(y) = Pr (Y = y) = Pr X = 4 = (yJ4)!'
0= elsewhere.
The foregoing detailed discussion should make the subsequent text
easier to read. Let X be a random variable of the discrete type, having
p.d.f. f(x). Let d denote the set of discrete points, at each of which
f(x) > 0, and let y = u(x) define a one-to-one transformation that maps
d onto f!lj. If we solve y = u(x) for x in terms of y, say, x = w(y), then
for each y E f!lj, we have x = w(y) Ed. Consider the random variable
Y = u(X). If Y E f!lj, then x = w(y) Ed, and the events Y = y [or
Distributions of Functions of Random Variables [Ch.4
An alternative method of finding the distribution of a function of
one or more random variables is called the change of variable technique.
There are some delicate questions (with particular reference to random
variables of the continuous type) involved in this technique, and these
make it desirable for us first to consider special cases.
4.9. The four values Yl = 0.42, Y2 = 0.31, Y3 = 0.87, and Y4 = 0.65
represent the observed values of a random sample of size n = 4 from the
uniform distribution over 0 < Y < 1. Using these four values, find a corre-
sponding observed random sample from a distribution that has p.d.f. f(x) =
e- x , 0 < X < 00, zero elsewhere.
4.10. Let Xv X2
denote a random sample of size 2 from a distribution
with p.d.f. f(x) = 1-, 0 < x < 2, zero elsewhere. Find the joint p.d.f, of Xl
and X 2
. Let Y = Xl + X 2 . Find the distribution function and the p.d.f.
of Y.
4.12. Let Xl, X 2
, X 3 be a random sample of size 3 from a distribution
having p.d.f. f(x) = 5x4 , 0 < X < 1, zero elsewhere. Let Y be the largest
item in the sample. Find the distribution function and p.d.f. of Y.
4.13. Let Xl and X2 be items of a random sample from a distribution
with p.d.f. f(x) = 2x, 0 < x < 1, zero elsewhere. Evaluate the conditional
probability Pr (Xl < X21Xl < 2X2) ·
4.7. Let Yi = a + bxi, i = 1,2, ... , n, where a and b are constants. Find
fj = L ydn and s~ = L (Yi - fj)2/n in terms of a, b, x = L «[«, and s~ =
L (Xi - X)2/n.
128
4.4. Let Xv X2
be a random sample from' the distribution having p.d.f.
f(x) = 2x, 0 < x < 1, zero elsewhere. Find Pr (Xl/X2 :-::; 1-).
4.5. If the sample size IS n = 2, find the constant c so that S2 =
c(Xl - X 2)2.
4.6. If Xi = i, i = 1,2, ... , n, compute the values of x = L «[n. and
S2 = L (Xi - X)2/n.
130 Distributions of Functions of Random Variables [Ch.4 Sec. 4.2] Transformations of Variables of the Discrete Type 131
Y = 0,1,4,9,
P4 = {(Yl> Y2); Y2 = 0, 1, ... , Yl and Yl = 0,1,2, ...}.
Example 2. Let Xl and X2 be two stochastically independent random
variables that have Poisson distributions with means fL1 and fL2' respectively.
The joint p.d.f. of Xl and X 2 is
one-to-one transformation of d onto@.This would enable us to find the
joint p.d.f. of Yl' Y2, and Y3 from which we would get the marginal
p.d.f. of Y1 by summing on Y2 and Ys.
Xl = 0, 1,2, 3, .. " X2 = 0, 1, 2, 3, ... ,
fLf1fL~2e -U1-u2
Xl! x2! '
and is zero elsewhere. Thus the space d is the set of points (Xl> x2), where
each of Xl and X 2 is a nonnegative integer. We wish to find the p.d.f. of
Y 1 = Xl + X 2' If we use the change of variable technique, we need to
define a second random variable Y 2' Because Y 2 is of no interest to us, let
us choose it in such a way that we have a simple one-to-one transformation.
For example, take Y 2 = X 2. Then Yl = Xl + X2 and Y2 = X2 represent a
one-to-one transformation that maps d onto
We seek the p.d.f. g(y) of the random variable Y = X2. The transformation
Y = u(x) = x2 maps d = {x; x = 0, 1, 2, 3} onto P4 = {y; Y = 0, 1,4, 9}.
In general, Y = x2 does not define a one-to-one transformation; here, how-
ever, it does, for there are no negative values of x in d = {x; x = 0, 1, 2, 3}.
That is, we have the single-valued inverse function x = w(y) = vY (not
-vy), and so
3! (2)-.!Y(I)3--'!Y
g(y) = f(vY) = (vY)! (3 - Vy)!"3 "3 '
= °elsewhere.
u(X) = y] and X = w(y) are equivalent. Accordingly, the p.d.f. of Y is
g(y) = Pr (Y = y) = Pr [X = w(y)] = f[w(y)], Y E @,
= 0 elsewhere.
Example 1. Let X have the binomial p.d.f.
f(x) = x! (/~ x)! (~rG) 3-X, X = 0, 1,2,3,
= °elsewhere.
Yl = 0, 1,2, ... ,
Y1
gl(Yl) = L g(Yl> Y2)
Y2=O
e- U1 -U2 L
Y1
Yl!
- - - fLY 1 -Y2fLY2
Yl! Y2=O (Yl - Y2)! Y2! 1 2
(fLl + fL2)Y1e- U1-u2
Yl! '
Note that, if (Yl' Y2) E P4, then 0 ~ Y2 ~ Yl' The inverse functions are given
by Xl = Yl - Y2 and X2 = Y2' Thus the joint p.d.f. of Y l and Y 2 is
and is zero elsewhere. Consequently, the marginal p.d.f. of Y l is given by
EXERCISES
and is zero elsewhere. That is, Yl = Xl + X 2 has a Poisson distribution
with parameter fLl + fL2'
4.14. Let X have a p.d.f, f(x) = !, x = 1, 2, 3, zero elsewhere. Find the
p.d.f. of Y = 2X + 1.
g(yv Y2) = f[W1(Y1' Y2), W2(Y1' Y2)]'
= 0 elsewhere,
where Xl = w1(Yv Y2) and X2 = w2(Yv Y2) are the single-valued inverses
of Y1 = U 1(x~, x2) and Y2 = u2(XV x2)· From this joint p.d.f. g(Y1' Y2)
we may obtain the marginal p.d.f. of Y1 by summing on Y2 or the
marginal p.d.f. of Y2 by summing on Y1'
Perhaps it should be emphasized that the technique of change of
variables involves the introduction of as many "new" variables as
there were" old" variables. That is, suppose that f(x1 , X2 , xs) is the
joint p.d.f. of Xl' X 2, and X s, with d the set where f(xv x2, xs) > O.
Let us say we seek the p.d.f. of Y1 = u1 (X V X 2 , X s)· We would then
define (if possible) Y2 = u2(X V X 2, X s) and Ys = us(Xv X 2, X s), so
that Yl = u1(XV X2' xs), Y2 = U2(X1, X2' xs), Ys = us(xv X2, xs) define a
There are no essential difficulties involved in a problem like the
following. Let f(xv x2 ) be the joint p.d.f. of two discrete-type random
variables Xl and X 2
with d the (two-dimensional) set of points at
which f(x1, x2) > O. Let Y1 = U 1(X1, x2) and Y2 = U 2(X1, x2) define a
one-to-one transformation that maps d onto @. The joint p.d.f. of
the two new random variables Y1 = u1(XV X 2) and Y2 = u2(XV X 2)
is given by
132 Distributions of Functions of Random Variables [eb." Sec. 4.3] Transformations of Variables of the Continuous Type 133
= 0 elsewhere.
and, accordingly, we have
!{Ib occurs because there is a one-to-one correspondence between the points
of .JJ! and £fl. Thus
o< Y < 8,
1
g(y) = 6y 1/3 '
Let us rewrite this integral by changing the variable of integration from X to
y by writing y = 8x3
or X = !tIY. Now
dx 1
dy = 6y213 '
Pr (a < Y < b) = Pr (!iY'a < X < !{/b)
= f{'bJ2 2x d
-'/_ X.
va/2
_Jb (tIY)( 1 )
Pr (a < Y < b) - a 2 2 6
y213 dy
[
b 1
= 6 113 dy.
• a Y
Since this is true for every 0 < a < b < 8, the p.d.f. g(y) of Y is the inte-
grand; that is,
This can be proved by comparing the coefficients of xk
in each member of
the identity (1 + x)nl (1 + x)n2 == (1 + x)nl +n2 •
4.19. Let Xl and X2 be stochastically independent random variables of
the discrete type with joint p.d.f. f1(X1)f2(X2), (Xl> x2) E.JJ!. Let Y1 = Ul(X1)
and Y2 = U2(X2) denote a one-to-one transformation that maps d onto fJI.
Show that Y1 = Ul(Xl) and Y2 = U2(X2) are stochastically independent.
4.15. uti»; x2) = (i) Xl +X2(t)2-Xl-X2, (Xl> X2) = (0,0), (0, 1), (I, 0), (I, 1).
zero elsewhere, is the joint p.d.I. of Xl and X 2 , find the joint p.d.f. of Yl =
Xl - X2 and Y2 = Xl + X 2•
4.16. Let X have the p.d.f. f(x) = a)x, X = 1, 2, 3, .. " zero elsewhere.
Find the p.d.f. of Y = X3.
4.17. Let Xl and X2 have the joint p.dJ.f(xl, x 2) = Xlx2/36, Xl = 1,2,3
and X 2 = 1,2,3, zero elsewhere. Find first the joint p.d.f, of Yl = X1X2 and
Y2 = X 2' and then find the marginal p.d.f. of Yr-
4.18. Let the stochastically independent random variables Xl and X2
be b(nl>P) and b{n2 , P), respectively. Find the joint p.d.f. of Yl = Xl + X2
and Y2 = X 2 , and then find the marginal p.d.f, of Yl . Hint. 'Use the fact
that
= 0 elsewhere.
Here .JJ! is the space {x; 0 < x < I}, where f(x) > O. Define the random
variable Y by Y = 8X3 and consider the transformation y = 8x2
• Under the
transformation Y = 8x2
, the set.JJ!is mapped onto the set flJ ={y; 0 < Y < 8},
and, moreover, the transformation is one-to-one. For every 0 < a < b < 8,
.3;-
the event a < Y < b will occur when, and only when, the event lv a < X <
4.3 Transformations of Variables of the Continuous Type
In the preceding section we introduced the notion of a one-to-one
transformation and the mapping of a set d onto a set fJI under that
transformation. Those ideas were sufficient to enable us to find the
distribution of a function of several random variables of the discrete
type. In this section we shall examine the same problem when the
random variables are of the continuous type. It is again helpful to
begin with a special problem.
Example 1. Let X be a random variable of the continuous type, having
p.d.f.
o< y < 8,
f(x) = 2x, 0< x < I,
It is worth noting that we found the p.d.f. of the random variable
Y = 8X3 by using a theorem on the change of variable in a definite
integral. However, to obtain g(y) we actually need only two things:
(1) the set fJI of points y where g(y) > 0 and (2) the integrand of the
integral on y to which Pr (a < Y < b) is equal. These can be found by
two simple rules:
(a) Verify that the transformation y = 8x3 maps d = {x; 0 < x < 1}
onto fJI = {y; 0 < y < 8} and that the transformation is one-to-one.
(b) Determine g(y) on this set fJI by substituting t{ly for x in J(x)
and then multiplying this result by the derivative of t{lY. That is,
( ) = J({Iy) d[(t){Iyj = _1_
g y 2 dy 6yl /3 '
= 0 elsewhere.
We shall accept a theorem in analysis on the change of variable in a
definite integral to enable us to state a more general result. Let X be a
random variable of the continuous type having p.d.f.J(x). Let d be the
134 Distributions of Functions of Random Variables [Ch.4 Sec. 4.3] Transformations of Variables of the Continuous Type 135
one-dimensional space where j(x) > O. Consider the random variable
Y = u(X), where Y = u(x) defines a one-to-one transformation that
maps the set d onto the set !!lJ. Let the inverse of Y = u(x) be denoted
by x = w(y), and let the derivative dx/dy = w'(y) be continuous and
not vanish for all points y in !!lJ. Then the p.d.f. of the random variable
Y = u(X) is given by
g(y) = j[w(y)Jlw'(y) I, y E!!lJ,
= 0 elsewhere,
where Iw'(y) I represents the absolute value of w'(y). This is precisely
what we did in Example 1 of this section, except there we deliberately
chose y = 8x3 to be an increasing function so that
(0,0)
FIGURE 4.1
dx '( ) 1 0 8
dy = w Y = 6y2 /3 ' < Y < ,
is positive, and hence
16y~/31 = 6y~/3' 0 < Y < 8.
Henceforth we shall refer to dx/dy = w'(y) as the Jacobian (denoted by
]) of the transformation. In most mathematical areas, ] = w'(y) is
referred to as the Jacobian of the inverse transformation x = w(y), but
in this book it will be called the Jacobian of the transformation, simply
for convenience.
= 0 elsewhere.
We are to show that the random variable Y = -21n X has a chi-square
distribution with 2 degrees of freedom. Here the transformation is Y = u(x) =
-21n X, so that x = w(y) = e:v!", The space dis d = {x; 0 < X < I}, which
the one-to-one transformation Y = -21n X maps onto!!lJ = {y; 0 < Y < oo].
The Jacobian of the transformation is
J = dx = w'(y) = _!e-Y/2 •
dy 2
Accordingly, the p.d.f. g(y) of Y = - 2 In X is
g(y) = !(e-Y
/
2
)IJI = !e-Y
/
2
, 0 < Y < 00,
= 0 elsewhere,
a p.d.f. that is chi-square with 2 degrees of freedom. Note that this problem
was first proposed in Exercise 3.40.
Example 2. Let X have the p.d.f.
j(x) ~ 1, o< x < 1,
This method of finding the p.d.f. of a function of one random variable
of the continuous type will now be extended to functions of two random
variables of this type. Again, only functions that define a one-to-one
transformation will be considered at this time. Let YI = uI(XV x2 ) and
Y2 = u2(XV x2) define a one-to-one transformation that maps a (two-
dimensional) set d in the xlx2-plane onto a (two-dimensional) set !!lJ in
the YlY2-plane. If we express each of Xl and X2 in terms of YI and Y2' we
can write Xl = wI(Yv Y2), X2 = w2(Yv Y2)' The determinant of order 2,
oXI oX I
0YI 0Y2
oX 2 oX 2
0YI 0Y2
is called the]acobian of the transformation and will be denoted by the
symbol]. It will be assumed that these first-order partial derivatives are
continuous and that the Jacobian] is not identically equal to zero in!!lJ.
An illustrative example may be desirable before we proceed with the
extension of the change of variable technique to two random variables
of the continuous type.
Example 3. Let d be the set d = {(Xl' X 2); 0 < Xl < 1,0 < X2 < I},
depicted in Figure 4.1. We wish to determine the set!!lJ in the YIY2-plane that
is the mapping of d under the one-to-one transformation
YI = uI(XV X2) = Xl + X2,
Y2 = u2(XV X2) = Xl - X2,
and we wish to compute the Jacobian of the transformation. Now
Xl = wI(Yv Y2) = ·HYI + Y2),
X2 = w2(Yv Y2) = ·HYI - Y2)'
136 Distributions of Functions of Random Variables [Ch.4 Sec. 4.3] Transformations of Variables of the Continuous Type 137
Yz Yz
(0,0) (0,0)
FIGURE 4.3
Yl
FIGURE 4.2
We wish now to change variables of integration by writing Yl =
ul(Xl> x2), Y2 = u2(Xl> x2), or Xl = wl(Yl> Y2), X2 = w2(Yl> Y2). It has
been proved in analysis that this change of variables requires
into
o< Xl < 1, 0 < X2 < 1,
o< X < 1,
!(X) = 1,
cp(X1> x2) = !(XI)!(X2 ) = 1,
= 0 elsewhere.
= 0 elsewhere,
Thus for every set B in flJ,
g(Yl> Y2) = q:>[wl(Yl> Y2), w2(Yl> Y2)] Ill,
= 0 elsewhere.
Pr [(Yl> Y2) E B] = JB Jq:>[wl(Yl> Y2), W
2(Yl , Y2)]II! dYl dY2'
which implies that the joint p.d.f. g(Yl> Y2) of Yl and Y2 is
Accordingly, the marginal p.d.f. gl(Yl) of Y1 can be obtained from the
joint p.d.f. g(YI, Y2) in the usual manner by integrating on Y2' Five
examples of this result will be given.
Example 4. Let the random variable X have the p.d.f.
and let X1> X2 denote a random sample from this distribution. The joint
p.d.f. of Xl and X2is then
Consider the two random variables YI = XI + X2and Y2 = XI - X2' We
~ish to find the joint p.d.f of YI and Y 2 • Here the two-dimensional space d
III the xlx2-planeis that of Example 3 of this section. The one-to-one trans-
formation YI = Xl + X 2, Y2 = Xl - X 2 maps d onto the space fjJ of that
into
into
into
o= ·t(YI + Y2),
1 = ·t(YI + Y2),
o= ·t(YI - Y2),
1 = !(YI - Y2)'
.
Accordingly, fjJ is as shown in Figure 4.2. Finally,
Pr [(Yl, Y 2 ) E B] = Pr [(Xl> X 2 ) E A]
= LJq:>(Xl> x2) dXl dx2.
OXI oXI 1 1
0YI 0Y2 Z Z 1
J = oX2
oX2
1 1 = -Z·
0YI 0Y2 Z-Z
We now proceed with the problem of finding the joint p.d.f. of two
functions of two continuous-type random variables. Let Xl and X 2 be
random variables of the continuous type, having joint p.d.f. q:>(xv x2 ) .
Let d be the two-dimensional set in the x1x2-plane where q:>(xv x2) > O.
Let Yl = u1(X v X 2 ) be a random variable whose p.d.f. is to be found.
If YI = U l(xv x2) and Y2 = u2(Xl> x2) define a one-to-one transformation
of d onto a set flJ in the Y1Y2-plane (with nonidentically vanishing
Jacobian), we can find, by use of a theorem in analysis, the joint p.d.f.
of Yl = u1(X v X 2) and Y2 = u2(Xv X 2). Let A be a subset of d, and
let B denote the mapping of A under the one-to-one transformation
(see Figure 4.3). The events (Xv X 2 ) EA and (Yl, Y 2 ) E B are equivalent.
Hence
To determine the set fjJ in the YIY2-plane onto which d is mapped under the
transformation, note that the boundaries of d are transformed as follows
into the boundaries of fjJ;
Xl = 0
example. Moreover, the Jacobian of that transformation has been shown to
be I = -toThus
g(Yl> Y2) = <pH(Yl + Y2)' t(Yl - Y2)]III
= f[t(Yl + Y2)]f[t(Yl - Y2)] III = t, (Yl> Y2) E ffl,
= 0 elsewhere.
Because ffl is not a product space, the random variables Y1 and Y2 are
stochastically dependent. The marginal p.d.f. of Y1 is given by
gl(Yl) = f",g(Yl' Y2) dY2'
o< Yl < 00,0 < Y2 < 1,
In accordance with Theorem 1, Section 2.4, the random variables are sto-
chastically independent. The marginal p.d.f. of Y2 is
o < Yl < 00, 0 < Y2 < I} in the Y1Y2-plane. The joint p.d.f. of Yl and Y2
is then
139
1
g(Yl' Y2) = (Yl) r(0:)r(,8) (Y1Y2)a-l[Yl(1 - Y2)]B-le- Y1
_ y~-l(1 - Y2)B-l ya+B-le-Y
- r(0:)r(,8) 1 1,
= 0 elsewhere.
Sec. 4.3] Transformations oj Variables oj the Continuous Type
Distributions oj Functions oj Random Variables [Ch.4
138
o< u, ::; 1,
If we refer to Figure 4.2, it is seen that
J
Y l
gl(Yl) = 1- dY2 = Yl>
-Yl
f
2- Y1
= 1- dY2 = 2 - Yl'
YI- 2
= 0 elsewhere.
1 < s, < 2,
0< Y2 < 1,
= 0 elsewhere.
o < Y2 < 1,
-1 < Y2 s 0,
o< Yl < 00
In a similar manner, the marginal p.d.f. g2(Y2) is given by
J
Y2+2 1
g2(Y2) = 1- dYl = Y2 + ,
-Y2
f
2 - Y2
= 1- dYl = 1 - Y2'
Y2
= 0 elsewhere.
Example 5. Let Xl and X 2 be two stochastically independent random
variables that have gamma distributions and joint p.d.f.
This p.d.f. is that of the beta distribution with parameters 0: and ,8. Since
g(Yl> Y2) == gl(Yl)g2(Y2), it must be that the p.d.f. of Y l is
1
gl(Yl) = r(o: + ,8) y~+B-le-Yl,
= 0 elsewhere,
which is that of a gamma distribution with parameter values of 0: + ,8and 1.
It is an easy exercise to show that the mean and the variance of Y 2 ,
which has a beta distribution with parameters 0: and ,8, are, respectively,
I =I~ ~ 1= 2;
Example 6. Let Yl = t(Xl - X 2 ) , where Xl and X2 are stochastically
independent random variables, each being X2(2).
The joint p.d.£. of Xl and
X 2 is
2 0:,8
a = (0: + ,8 + 1)(0: + ,8)2
0:
JL =--,
0:+,8
f(X l)f(X2) = lexp ( - Xl ; X2),
= 0 elsewhere.
Let Y 2 = X 2 so that Yl = t(xl - x2), Y2 = X2, or Xl = 2Yl + Y2' X2 = Y2
define a one-to-one transformation from d = {(Xl> X2); 0 < Xl < 00,
o < X2 < oo} onto f11J = {(Yl> Y2); - 2Yl < Y2 and 0 < Y2' -00 < Yl < co}.
The Jacobian of the transformation is
)
1 a-l B-le- x 1 - x 2 0 < Xl < 00,0 < X 2 < 00,
f(Xl> X2 = r(o:)r(,8) Xl X2 '
zero elsewhere where 0: > 0, ,8 > O. Let Yl = Xl + X2 and Y2 =
X l/(X1 + X 2)' We shall show that Y1and Y2are stochas~ically independent.
The space d is, exclusive of the points on the coordinate axes, the first
quadrant of the xlx2-plane. Now
Yl = ul(Xl> X2) = Xl + X2,
Xl
Y2 = u2(Xl> X2) = + X
Xl 2
may be written Xl = Y1Y2, X2 = Yl(1 - Y2)' so
I =  Y2 YlI= - u, '1= O.
1 - Y2 -Yl
The transformation is one-to-one, and it maps d onto d1J = {(Yl'Y2);
140 Distributions of Functions of Random Variables [Ch. 4 Sec. 4.3] Transformations of Variables of the Continuous Type 141
hence the joint p.d.f. of Y1 and Y2 is
g(y Y) - 8 e-Y 1 -Y2
1>2-4 '
are necessary. For instance, consider the important normal case in which we
desire to determine X so that it is nCO, 1). Of course, once X is determined,
other normal variables can then be obtained through X by the transformation
Z = aX + fL.
= 0 elsewhere.
gl(YI) = te-1Y11, -00 < Yl < 00.
This p.d.f. is now frequently called the double exponential p.d.f.
Example 7. In this example a rather important result is established.
Let Xl and X 2
be stochastically independent random variables of the
continuous type with joint p.d.f. !1(Xl)!2(X2) that is positive on the two-
dimensional space d. Let Yl = Ul(XI), a function of Xl alone, and
Y
2
= U2(X2), a function of X 2 alone. We assume for the present that
Yl = UI(XI), Y2 = U2(X2) define a one-to-one transformation from d onto a
two-dimensional set f!B in the YIY2-plane. Solving for Xl and X 2 in terms of
Yl and Y2' we have Xl = Wl(Yl) and X2 = W2(Y2), so
Thus the p.d.f. of Y1 is given by
gl(Yl) = f'" te-Y1
-Y2 dY2 = teY1
,
-2Yl
= fa'" -!e-Y1 -Y2 dY2 = te-Y1
,
or
-00 < Yl < 0,
os YI < 00,
To simulate normal variables, Box and Muller suggested the follow-
ing scheme. Let Y1> YZ be a random sample from the uniform distri-
bution over 0 < Y < 1. Define X I and X Z by
Xl = (-2In YI)I/Z cos (27TYZ)'
X z = (-2In yl)l/Z sin (27TYZ)'
The corresponding transformation is one-to-one and maps {(Yl> yz);
o< YI < 1,0 < Yz < I} onto {(Xl> xz); -00 < Xl < 00, -00 < Xz < co]
except for sets involving Xl = 0 and X z = 0, which have probability
zero. The inverse transformation is given by
(
x~ + X~)
YI = exp - 2 '
1 Xz
Yz = - arctan -.
27T Xl
This has the Jacobian
( _ xz) exp ( _ x~ ~ X~)
l/xI
]=
(
xZ+XZ)
exp _ I Z
2
Since the joint p.d.f. of YI and Y z is 1 on 0 < YI < 1,0 < Yz < 1, and
zero elsewhere, the joint p.d.f. of Xl and X z is
That is, Xl and X z are stochastically independent random variables
each being nCO, 1). '
g(Y1> Y2) = !1[WI(YI)]!2[W2(Y2)]lw~(YI)W;(Y2)J,
= 0 elsewhere.
However, from the procedure for changing variables in the case of one
random variable, we see that the marginal probability density functions
of YI and Y2 are, respectively, gl(YI) = !1[WI(Yl)]lw~(YI)1 and g2(Y2)
!2[W2(Y2)] !W;(Y2)! for u, and Y2 in some appropriate sets. Consequently,
g(Y1> Y2) == gl(YI)g2(Y2)'
Thus, summarizing, we note that, if Xl and X 2 are stochastically independent
random variables, then YI = Ul(XI) and Y2 = U2(X2) are also stochastically
independent random variables. It has been seen that the result holds if Xl
and X 2
are of the discrete type; see Exercise 4.19.
Example 8. In the simulation of random variables using uniform random
variables, it is frequently difficult to solve Y = F(x) for x. Thus other methods
IW~(Yl) 0 I '()'( );p 0
J = 0 W;(Y2) = WI YI W2 Y2 .
Hence the joint p.d.f. of Yl and Y2 is
EXERCISES
4.20. Let X have the p.d.f. f(x) = x2/9,
0 < x < 3, zero elsewhere. Find
the p.d.f. of Y = X3.
4.21. If the p.d.f. of X is f(x) = 2xe- x2, 0 < x < 00, zero elsewhere,
determine the p.d.f. of Y = X2.
4.22. Let Xl' X 2 be a random sample from the normal distribution
nCO, 1). Show that the marginal p.d.f. of YI = XI/X2 is the Cauchy p.d.J.
Sec. 4.4] The t and F Distributions
142 Distributions oj Functions oj Random Variables [eh.4 143
I'(l - t)f(1 + t), -1 < t < 1. Hint. In the integral representing M(t), let
y = (1 + e-X)-l.
4.29. Let X have the uniform distribution over the interval (-7T/2, 7T/2).
Show that Y = tan X has a Cauchy distribution.
4.30. Let Xl and X 2 be two stochastically independent random variables
of the :ontinuous type with probability density functions f(xl
) and g(x2
) ,
respectively. Show that the p.d.f. hey) of Y = Xl + X 2
can be found by the
convolution formula,
-00 < YI < 00. hey) = I~cof(Y - w)g(w) dw.
-00 < W < 00, °< v < 00,
Hint. Let Y2 = X 2 and take the p.d.f. of X2 to be equal to zero at X 2 = O.
Then determine the joint p.d.f. of YI and Y2' Be sure to multiply by the
absolute value of the Jacobian.
4.23. Find the mean and variance of the beta distribution considered in
Example 5. Hint. From that example, we know that
io
l
ya-I(1 - y)Ii-1 dy = r(a)r(,8)
f(a + ,8)
for all a > 0, ,8 > O.
4.24. Determine the constant c in each of the following so that eachf(x)
is a beta p.d.f
(a) f(x) = cx(1 - X)3, 0 < X < 1, zero elsewhere.
(b) f(x) = cx4(1 - X)5, 0 < X < 1, zero elsewhere.
(c) f(x) = cx2(1
- X)8, 0 < X < 1, zero elsewhere.
4.25. Determine the constant c so that f(x) = cx(3 - X)4, 0 < X < 3,
zero elsewhere, is a p.d.f.
4.26. Show that the graph of the beta p.d.f. is symmetric about the
vertical line through x = t if a = ,8.
4.27. Show, for k = 1, 2, ... , n, that
f (k _ 1)~~n _ k)! zk-I(1 - z)n-k dz = :~: (:)PX(1 - p)n-x.
This demonstrates the relationship between the distribution functions of
the beta and binomial distributions.
4.28. Let X have the logisticP.d.j.f(x) = e- x/(1 + e- X
)2, -00 < x < 00.
(a) Show that the graph of f(x) is symmetric about the vertical axis
through x = o.
(b) Find the distribution function of X.
(c) Show that the moment-generating function M(t) of X is
4.31. Let Xl and X 2 be two stochastically independent normal random
variables, each with mean zero and variance one (possibly resulting from a
Box-Muller transformation). Show that
Zl = iLl + alXV
Z2 = iL2 + pa2X I + a2Vf"=P2X2,
where 0 < aI, 0 < a2, and 0 < P < 1, have a bivariate normal distribution
with respective parameters t-«. iL2' at a~, and p.
4.32. Let Xl and X 2 denote a random sample of size 2 from a distribution
that is nip: a
2
) . Let YI = Xl + X2 and Y2 = Xl - X 2
• Find the joint
p.d.f. of YI and Y2 and show that these random variables are stochastically
independent.
4.33. Let Xl and X2 denote a random sample of size 2 from a distribution
that is neiL, a
2
) . Let YI = Xl + X 2 and Y2 = Xl + 2X . Show that the
• • 2
joint p.d.f. of YI and Y2 is bivariate normal with correlation coefficient
3/VlO.
4.4 The t and F Distributions
It is the purpose of this section to define two additional distributions
quite useful in certain problems of statistical inference. These are called,
respectively, the (Student's) t distribution and the F distribution.
Let W denote a random variable that is nCO, 1); let V denote a
~andom variable that is x2
(r); and let Wand V be stochastically
independent, Then the joint p.d.f. of Wand V, say cp(w, v), is the
product of the p.d.f. of Wand that of V or
1 1
cp(w, v) = -- e- w2/2 vT/2-1e-v/2
VZ; f(r/2)2T
/ 2 '
= °elsewhere.
144 Distributions oj Functions oj Random Variables [eh.4 Sec. 4.4] The t and F Distributions 145
= 0 elsewhere.
Define a new random variable T by writing
The marginal p.d.f. of T is then
F = V/r1
vt-,
and we propose finding the p.d.f. gl(f) of F. The equations
f
-_ u/r1 ,
z = v,
v/r2
define a one-to-one transformation that maps the set d' = {(u, v);
o < u < 00,0 < v < oo] onto the set!!lJ = {(f, z); 0 < f < 00,0 < z
< co], Since u = (rIfr2)zf, v = z, the absolute value of the Jacobian of
the transformation is III = (r1/r2)z. The joint p.d.f. g(f, z) of the
random variables F and Z = V is then
1 (r1Zf)rl/2-1
g(f, z) = r(r1/2)r(r2/2)2(r1 +r2)/2 ~ zT2/
2
-
1
= 0 elsewhere.
We define the new random variable
[ z ("Ii )]"l
Z
x exp - - - + 1 -,
2 r2 "2
provided that (f, z) E!!lJ, and zero elsewhere. The marginal p.d.f. gl(f)
of F is then
has the immediately preceding p.d.f. gl(t). The distribution of the ran-
dom variable T is usually called a t distribution. It should be observed
that a t distribution is completely determined by the parameter r, the
number of degrees of freedom of the random variable that has the
chi-square distribution. Some approximate values of
Pr (T s t) = roo gl(W) dw
for selected values of rand t, can be found in Table IV in Appendix B.
Next consider two stochastically independent chi-square random
variables V and V having r1 and r2 degrees of freedom, respectively.
The joint p.d.f. ep(u, v) of V and V is then
1
ep(u v) = url/2-1vr2/2-1e-(U+v)/2
, r (r1/2)r(r2/2)2(r1 +r 2)/2 '
o < u < 00, 0 < v < 00,
-00 < t < 00.
u=v
and
w
t=--
VV[r
s.(t) = f~00 g(t, u) du
= foo 1 u(r+1)/2-1 exp [-~ (1 + ~)l duo
o V271"Yr(r/2)2r/2 2 "
In this integral let z = u[l + (t2
/r)J/2, and it is seen that
foo 1 (2Z )(r+1)/2-1 ( 2 )
gl(t) = Jo V2Trrr(r/2)2r/2 1 + t2/r e-
Z
1 + t2/r dz
q(r + 1)/2J 1
= VTrrr(r/2) (1 + t2/r)(r+1)/2'
Thus, if W is n(O, 1), if V is X2(r), and if Wand V are stochastically
independent, then
W
T=-=
VV/r
g(t, u) = <pe~, u) III
_ 1 Ur/2-1 exp [-~ (1 + ~)J vu,
- y"i;r(r/2)2r/2 2 r vr
-00 < t < 00, 0 < u < 00,
define a one-to-one transformation that maps d' = {(w, v); -00 < w
< 00,0 < v < co} onto!!lJ = {(t,u); -00 < t < 00,0 < u < co}, Since
w = tVu/vr, v = u, the absolute value of the Jacobian of the trans-
formation is III = VU/Vr. Accordingly, the joint p.d.f. of T and V =
V is given by
W
T=--·
VV/r
The change-of-variable technique will be used to obtain the p.d.f. gl(t)
of T. The equations
146 Distributions of Functions of Random Variables [eb.4 Sec. 4.5] Extensions of the Change-of-Variable Technique 147
If we change the variable of integration by writing
z(rd )
Y=- -+1,
2 r2
it can be seen that
_ (rJJ (r1/r2)Tl/
2(f)Tl/2- 1 ( 2y )<Tl+T2)/2-1 e:»
g1(f) - Jo r(r1/2)r(r2/2)2<T1 +T2)/2 r.tt:« + 1
4.38. Let T = WjvVj" where the stochastically independent variables
W and V are, respectively, normal with mean zero and variance 1 and
chi-square with, degrees of freedom. Show that P has an F distribution
with parameters '1 = 1 and '2 = r. Hint. What is the distribution of the
numerator of T2?
where F has an F distribution with parameters '1 and '2. has a beta distri-
bution.
4.39. Show that the t distribution with, = 1 degree of freedom and the
Cauchy distribution are the same.
4.40. Show that
o < f < 00,
( 2 )
x dy
rd/r2 + 1
r[(r1 + r2)/2](r1/r2)Tl/2 (f)T1/2-1
= r(r1/2)r(r2/2) (1 + rd/r2)<T1 +T
2)/2,
= 0 elsewhere.
Accordingly, if U and V are stochastically independent chi-square
variables with r 1 and r 2 degrees of freedom, respectively, then
4.41. Let Xl' X 2 be a random sample from a distribution having the p.d.f.
f(x) = e-x , 0 < x < 00, zero elsewhere. Show that Z = X 1/X2 has an F
distribution.
F = U/r1
V/r2
has the immediately preceding p.d.f. g1(f). The distribution of this ran-
dom variable is usually called an F distribution. It should be observed
that an F distribution is completely determined by the two parameters
r1 and r2' Table V in Appendix B gives some approximate values of
Pr (F 5, f) = f:g1(W) dw
4.5 Extensions of the Change-of-Variable Technique
In Section 4.3 it was seen that the determination of the joint p.d.f.
of two functions of two random variables of the continuous type was
essentially a corollary to a theorem in analysis having to do with the
change of variables in a twofold integral. This theorem has a natural
extension to n-fold integrals. This extension is as follows. Consider an
integral of the form
for selected values of r1> r2, and f.
EXERCISES
4.34. Let T have a t distribution with 10 degrees of freedom. Find
Pr (ITI > 2.228) from Table IV.
4.35. Let T have a t distribution with 14 degrees of freedom. Determine
b so that Pr (- b < T < b) = 0.90.
4.36. Let F have an F distribution with parameters zj and '2' Prove that
1jF has an F distribution with parameters z, and z..
4.37. If F has an F distribution with parameters r; = 5 and r, = 10, find
a and b so that Pr (F 5, a) = 0.05 and Pr (F 5, b) = 0.95, and, accordingly,
Pr (a < F < b) = 0.90. Hint. Write Pr (F 5, a) = Pr (ljF ~ 1ja)
1 - Pr (l/F 5, 1ja), and use the result of Exercise 4.36 and Table V.
taken over a subset A of an n-dimensional space .xl. Let
together with the inverse functions
define a one-to-one transformation that maps .xl onto f!J in the
Y1> Y2' ... , Ynspace (and hence maps the subset A of.xl onto a subset B
= 0 elsewhere.
Let
of ~). Let the first partial derivatives of the inverse functions be con-
tinuous and let the n by n determinant (called the Jacobian)
149
-00 < x < 00,
Yk+1 0 0 Y1
0 Yk+1 0 Y2
J= = ?fi+1'
0 0 Yk+1 Yk
-Yk+1 -Yk+1 -Yk+1 (1 - Y1 - ... - Yk)
Hence the joint p.d.f. of YI> ... , Yk, Yk+1 is given by
Sec. 4.5] Extensions oj the Change-of-Variable Technique
We now consider some other problems that are encountered when
transforming variables. Let X have the Cauchy p.d.f.
1
f(x) = 7T(1 + x2)'
Y~;'+1·"+ak+l-1Y~1-1 .. 'Y~k-1(1 - Y1 - ... - Yk)ak+l- l e- lIk+l
r(a1) ... r(ak)r(ak+1)
provided that (Y1' .. " Yk' Yk+ 1) Ef1# and is equal to zero elsewhere. The joint
p.d.f. of YI> .. " Y k is seen by inspection to be given by
( ) r(a1 + ... + ak+1) a -1 a -1(1 )a -1
gYI>···,Yk = Y11 ···Ykk - Y1 _ ... - Yk k+1 ,
r(a1) ... r(ak+ 1)
when 0 < Yl' i = 1, .. " k, Y1 + ... + Yk < 1, while the function g is equal
to zero elsewhere. Random variables Y I> .. " Yk that have a joint p.d.f. of
this form are said to have a Dirichlet distribution with parameters al> ... ,
ak' ak+1> and any such g(Y1"'" Yk) is called a Dirichlet p.d.f. It is seen, in
the special case of k = 1, that the Dirichlet p.d.f. becomes a beta p.d.f.
Moreover, it is also clear from the joint p.d.f. of YI> , Y k» Y k+1 that
Yk+1 has a gamma distribution with parameters a1 + + ak + ak+1 and
fJ = 1 and that Y k+1 is stochastically independent of YI> Y 2, 0 • • , Yk'
The associated transformation maps d = {(XI> ••• , Xk +1); 0 < Xl < 00,
i = 1, ... , k + I} onto the space
f1# = {(YI>"" Yk' Yk+1); 0 < Yl' i = 1, ... , k,
Y1 + ... + Yk < 1,0 < Yk+1 < co].
The single-valued inverse functions are Xl = Y1Yk+1, ...• Xk = YkYk+1,
X k+1 = Yk+1(1 - Y1 - ... - Yk), so that the Jacobian is
and let Y = X2. We seek the p.d.f. g(y) of Y. Consider the trans-
formation Y = x2
• This transformation maps the space of X, d =
{x; -00 < x < co], onto ~ = {y; 0 :s Y < co}, However, the transfor-
mationisnot one-to-one. To eachYE~, with the exceptionofy = 0, there
correspond two points xEd. For example, if Y = 4, we may have either
x = 2 or x = - 2. In such an instance, we represent d as the union of
two disjoint sets A 1 and A 2 such that Y = x2 defines a one-to-one trans-
0< Xl < 00,
i = 1,2, ... , k,
Distributions oj Functions oj Random Variables [eh.4
s,
Yl = X '
Xl + X2 + ... + k+1
and Y k
+1
= Xl + X 2 + ... + X k +1 denote k + 1 new random variables.
when (Y1' Y2' . 0 0 , Yn) E ~, and is zero elsewhere.
Example 1. Let XI> X 2 , ••• , X k + 1 be mutually stochastically indepen-
dent random variables, each having a gamma distribution with fJ = 1. The
joint p.d.f. of these variables may be written as
Whenever the conditions of this theorem are satisfied, we can determine
the joint p.d.f. of n functions of n random variables. Appropriate
changes of notation in Section 4.3 (to indicate n-space as opposed to
2-space) is all that is needed to show that the joint p.d.f. of the random
variables YI = UI(Xl>X2 " , .,Xn), Y2 = U 2(X I, X 2, · · .,Xn),···, Y, =
un(Xl> X 2, ... , Xn)-where the joint p.d.f. of Xl> X 2, .. 0' X; is
cp(Xl> . 0 • , xn)-is given by
f·A•fcp(Xl> X2' . 0 0 ' xn)dX1 dX20 • • dxn
= rB' fcp[w1(Yl> .. 0' Yn),W2(Yl> .... Yn)• . . ., Wn(Y1' 0 •• , Yn)]
x IJI dYI dY2' .. dYn'
oX1 OX1 oX1
°Y1 Oy2 oYn
oX2 OX2 OX2
J= °Y1 °Y2 oYn
oXn OX~, oXn
Oy1 °Y2 Oyn
not vanish identically in ~. Then
148
Distributions oj Functions oj Random Variables [Ch.4 151
t = 1,2, ... , k,
Xl = wI,(Yv Y2' , Yn),
X2 = w2,(Yv Y2, , Yn),
OWli OWli oWli
°YI °Y2 °Yn
OW2i OW21 OW2i
11= °YI °Y2 °Yn t = 1,2, ... , k,
OWni OWn' OWni
0YI 0Y2 0Yn
be not identically equal to zero in flj. From a consideration of the prob-
ability of the union of k mutually exclusive events and by applying the
change of variable technique to the probability of each of these events,
define a one-to-one transformation of each A, onto flj. Thus, to each
point in flj there will correspond exactly one point in each of AI, A 2 ,
... ,A". Let
denote the k groups of n inverse functions, one group for each of these k
transformations. Let the first partial derivatives be continuous and let
each
Sec. 4.5] Extensions oj the Change-of-Variable Technique
why we sought to partition d (or a modification of d) into two disjoint
subsets such that the transformation Y = x2
maps each onto the same
flj. Had there been three inverse functions, we would have sought to
partition d (or a modified form of d) into three disjoint subsets,
and so on. It is hoped that this detailed discussion will make the follow-
ing paragraph easier to read.
Let <p(xv x2 , ••• , xn) be the joint p.d.f. of X v X 2' ... , X n' which are
random variables of the continuous type. Let d be the n-dimensional
space where <P(XI' x2, ... , xn) > 0, and consider the transformation
YI = UI(XI,X2," .,Xn),Y2 = U2(XI,X2,·· .,xn),·· ·,Yn = Un(XI,X2,·· .,xn),
which maps d onto flj in the YI,Y2, ... , Yn space. To each point of d
there will correspond, of course, but one point in flj; but to a point in
flj there may correspond more than one point in d. That is, the trans-
formation may not be one-to-one. Suppose, however, that we can
represent d as the union of a finite number, say k, of mutually disjoint
sets A v A 2 , ••• , A" so that
Y E B.
o < Y < 00,
g(y) 7T(1 + y)yY'
g(y) = .1;- [f( - Vy) + f(yY)],
2vy
With f(x) the Cauchy p.d.f. we have
1
In the first of these integrals, let x = - vy. Thus the Jacobian, say1 v
is -lj2yY; moreover, the set As is mapped onto B. In the second
integral let x = yY. Thus the Jacobian, say 12' is Ij2yY; moreover,
the set A 4
is also mapped onto B. Finally,
Pr (Y E B) = tf(-VY) - 2~Ydy + L
f(yY) 2~dy
= f[f( - vy) + f(Vy)] .1;- dy.
B 2vy
Hence the p.d.f. of Y is given by
= 0 elsewhere.
In the preceding discussion of a random variable of the c~ntinuous
type, we had two inverse functions, x = - yY and x = Vy. That is
150
formation that maps each of Al and A 2 onto flj. If we take Al to be
{x; -00 < x < O} and A 2
to be {x; 0 ::; x < co], we see that Al is mapped
onto {y; 0 < Y < oo], whereas A 2 is mapped onto {y; 0 ::; Y < co], and
these sets are not the same. Our difficulty is caused by the fact that
x = 0 is an element of d. Why, then, do we not return to the Cauchy
p.d.f. and take f(O) = O? Then our new dis d = {-oo < x < 00 but
x -# O}. We then take zl, = {x; -00 < x < 0}~ndA2 = {x; 0 < x < co],
Thus Y = x2 , with the inverse x = - vY, maps Al onto flj =
{y; 0 < Y < oo] and the transformation is one-to-one. Moreover, the
transformation Y = x2 , with inverse x = yY, maps A 2 on~o flj =
{y; 0 < Y < co] and the transformation is one-to-one. Consider the
probability Pr (Y E B), where B c flj. Let As = {x; x = - yY, Y E B}
C Al and let A 4
= {x; X = yY, Y E B} C A 2 • Then Y E B when and
only when X E As or X E A 4 • Thus we have
Pr (Y E B) = Pr (X E As) + Pr (X E A 4)
= f f(x) dx + f f(x) dx.
Aa A4
-00 < Yl < 00, 0 < Y2 < 00.
EXERCISES
We can make three interesting observations. The mean Yl of our random
sample is n(O, -!-); Y 2 , which is twice the variance of our sample, is i"(I); and
the two are stochastically independent. Thus the mean and the variance of
our sample are stochastically independent.
153
Sec. 4.5] Extensions oj the Change-of-Variabte Technique
Xl =1= x2}. This space is the union of the two disjoint sets A l = {(Xl, X2);
X2 > Xl} and A2 = {(Xl' X2); X2 < Xl}' Moreover, our transformation now
defines a one-to-one transformation of each A" i = 1,2, onto the new fll =
{(Yv Y2); -00 < Yl < 00, 0 < Y2 < co}. We can now find the joint p.d.f., say
g(yv Y2), of the mean Y1 and twice the variance Y2 of our random sample. An
easy computation shows that 1111 = 1121 = I/V2Y2' Thus
152 Distributions oj Functions oj Random Variables [eb.4
it can be seen that the joint p.d.f. of Y1 = u1(X 1> X 2,···, X n), Y 2 =
U2(X
1
, X 2, ... , X n),... , Yn = un(X 1> X 2, ... , X n), is given by
k
g(Y1> Y2' ... , Yn) = .L J;<P[Wll(Y1>"" Yn), ... , Wni(Y1> ... , Yn)],
t=l
provided that (Y1> Y2' ... , Yn) E fll, and equals zero elsewhere. The p.d.f.
of any Yi' say Y1> is then
gl(Yl) = j:00 ••• j:00 g(Yl' Y2' .•. , Yn) dY2' .. dYn'
An illustrative example follows.
Example 2. To illustrate the result just obta~ne~, t~ke n = ~ and let
Xv X
2
denote a random sample of size 2 from a dIstrIbutIOn that ISn(O, 1).
The joint p.d.f. of x, and X 2 is
1 (Xi + x~)
f(xv x2) = 2n exp ---2-' -00 < Xl < 00, -00 < X2 < 00.
Let Y1 denote the mean and let Y2 denote twice the variance of the random
sample. The associated transformation is
Xl + X2
Yl =--z-'
(Xl - X2)2
Y2 = 2 .
This transformation maps d = {(xv X2); -00 < Xl < 00, -00 < X2 < co}
onto fll = {(Yv Y2); -00 < Yl < 00, 0 s Y2 < co}. But. the tran~formation
is not one-to-one because, to each point in fll, exclusive of points where
Y2 = 0, there correspond two points in d. In fact, the two groups of inverse
functions are
and
4.42. Let Xv X 2 , X s denote a random sample from a normal distribution
n(O, 1). Let the random variables Yv Y 2 , Ys be defined by
Xl = Yl cos Y2sin v; X 2= Yl sin Y2sin Ys, X s = Yl cos Ys,
where 0 s v, < 00,0 s Y2 < 2n, 0 s v, s tr. Show that Yv Y 2 , v, are
mutually stochastically independent.
4.43. Let Xv X 2 , X s denote a random sample from the distribution
having p.d.f. f(x) = e- x , 0 < X < 00, zero elsewhere. Show that
Moreover the set d cannot be represented as the union of two disjoint sets,
each of which under our transformation maps onto fll. Our difficulty is
caused by those points of d that lie on the line whose equation is X2 = Xl'
At each of these points we have Y2 = O. However, v:e c~n define f~Xl' X2)
to be zero at each point where Xl = X2' We can do this without .alten.ng the
distribution of probability, because the probability measure of this set ISzero.
Thus we have a new d = {(xv X2); -00 < Xl < 00, -00 < X2 < 00, but
are mutually stochastically independent.
4.44. Let Xv X 2 , ••• , X, be r mutually stochastically independent
gamma variables with parameters a = at and f3 = 1, i = 1, 2, .. " r,
r~spec~ively. Show that Yl = Xl + X 2 + ... + X, has a gamma distribu-
tion with parameters a = al + ... + a r and f3 = 1.Hint. Let Y2 = X 2 + ...
+ X" Ys = X s + ... + X Y = X
r, ... , T ,.
4.45. Let Y v ... , Yk have a Dirichlet distribution with parameters
av ••• , ak' ak+ r-
(a) Show that Yl has a beta distribution with parameters a = al and
f3 = a2 + ... + ak+l'
(b) Show that Yl + ... + Y" r :$ k, has a beta distribution with
parameters a = al + ... + aT and f3 = aT+ l + ... + ak+l'
(c) Show that Y l + Y2 , Ys + Y4 , Y5 , ••• , Y k , k :2: 5, have a Dirichlet
distribution with parameters al + a2' as + a4, a5, ... , ak' ak+ r- Hint. Recall
the definition of Y, in Example 1 and use the fact that the sum of several
stochastically independent gamma variables with f3 = 1 is a gamma
variable (Exercise 4.44).
4.46. Let Xl' X 2 , and Xs be three mutually stochastically independent
chi-square variables with rv r2 , and rs degrees of freedom, respectively.
(a) Show that Y l = X l/X2 and Y2 = Xl + X 2 are stochastically
independent and that Y2 is X2
(r l + r 2) .
(b) Deduce that
(1)
Sec. 4.6] Distributions of Order Statistics
154 Distributions of Functions of Random Variables [Ch.4
155
statistic of the random sample X X X It ill b h
. . V 2,··· , n' W e s own that
the joint p.d.f. of Y v Y2, .•• , Yn is given by
g(yv Y2' ... , Yn) = (nJ)!(Yl)!(Y2) ...!(Yn),
a < Yl < Y2 < ... < Yn < b,
= °elsewhere.
We sh~ll prove this only for the case n = 3, but the argument is seen to
be entirely general. With n = 3 the J' oint pdf f X X X I'S
. ' . . . 0 V 2, s
!(Xl)!x(X2)!(b
x)s)T
' hC.onslder a probability such as Pr (a < x, = X 2 < b,
a < a < . IS probability is given by
J:J:J::!(Xl)!(X2)!(xs) dXl dX2 dxs = 0,
and
are stochastically independent F variables.
4.47. Iff(x) = 1-, -1 < x < 1,zero elsewhere, is the p.d.f. of the random
variable X, find the p.d.f. of Y = X2.
4.48. If Xl' X2 is a random sample from a distribution that is n(O, 1),
find the joint p.d.f. of Yl = Xi + X~ and Y2 = X2 and the marginal p.d.f.
of Yl. Hint. Note thatthe space of Yl and Y2 is given by -vYl < Y2 < VYv
o< Yl < 00.
4.49. If X has the p.d.f. f(x) = t, -1 < x < 3, zero elsewhere, find the
p.d.f. of Y = X2. Hint. Here f!J = {y; 0 :$ Y < 9} and the event Y E B is
the union of two mutually exclusive events if B = {y; 0 < Y < 1}.
4.6 Distributions of Order Statistics
In this section the notion of an order statistic will be defined and
we shall investigate some of the simpler properties of such a statistic.
These statistics have in recent times come to play an important role in
statistical inference partly because some of their properties do not
depend upon the distribution from which the random sample is obtained.
Let Xv X 2 , •.• , X; denote a random sample from a distribution of
the continuous type having a p.d.f. !(x) that is positive, provided that
a < X < b. Let Y 1 be the smallest of these X" Y 2 the next X, in order
of magnitude, ... , and Yn the largest X,. That is, Y 1 < Y 2 < ... < Yn
represent Xl> X 2 , ••• , X n when the latter are arranged in ascending
order of magnitude. Then Y" i = 1,2, ... , n, is called the ith order
since
is defined in calculus to be zero. As has been pointed out we may
without altering the distribution of X X X define the J.' . t d f'
V 2, s, om p. . .
!(Xl)!(X2)!(xs) to be zero at all points (Xl' X2, xs) that have at least
two o.f their c~ordinates ~qual. Then the set .91, where !(Xl)!(X2)!(xs)
> 0, IS the umon of the SIX mutually disjoint sets:
Al = {(Xv X2, xs); a < Xl < X2 < Xs < b},
A 2 = {(Xv X2, xs); a < X2 < Xl < Xs < b},
As = {(Xl, X2, Xs); a < Xl < Xs < X2 < b},
A 4 = {(Xl, X2, Xs); a < X2 < Xs < Xl < b},
A 5 = {(Xv X2, Xs); a < Xs < Xl < X2 < b},
Ae = {(Xv X2, Xs); a < Xs < X2 < Xl < b}.
There are six of these sets because we can arrange X X • • I
I . . V 2, X s in precise y
3. = 6 ways. Consider the functions Yl = minimum of X X x· Y =
'ddl . . 1, 2, S' 2
rm e m magmtude of Xl, X2, Xs ; and Ys = maximum of Xl X X
Th f . , 2, a-
ese unctIons define one-to-one transformations that map each of
A v A 2 , ••• , ~e onto the same set f!4 = {(Yl' Y2' Ys); a < Yl < Y2 <
Ys < b}. The mverse functions are for points in A X = Y X _ Y
' . ' V I I , 2 - 2,
Xs = Ys; for pomts in A 2, they are Xl = Y2' X2 = Yv X
s = Ys; and so
on, for each of the remaining four sets. Then we have that
1 ° 0/
I, = ° 1 ° = 1
° 0 1
156
and
Distributions oj Functions oj Random Variables [Ch. 4 Sec. 4.6] Distributions oj Order Statistics
Accordingly,
157
010
12 = 1 0 0 =-1.
o 0 1
It is easily verified that the absolute value of each of the 3! = 6
Jacobians is +1. Thus the joint p.d.f. of the three order statistics
Yl = minimumofXl,X2,Xs; Y2 = middleinmagnitudeofX1>X2,XS;
Ys = maximum of Xl' X 2 , X s is
If x :::; a, F(x) = 0; and if b :::; z, F(x) = 1. Thus there is a unique median
m of the distribution with F(m) = t. Let Xl> X 2 , Xa denote a random
sample from this distribution and let Y1 < Y2 < Ya denote the order
statistics of the sample. We shall compute the probability that Y2 :::; m.
The joint p.d.f. of the three order statistics is
g(Yl' Y2, Ys) = Jll!(Yl)!(Y2)!(YS) + 1121!(Y2)!(Yl)!(YS) + ...
+ 1161!(Ys)!(Y2)!(Yl), a < Y1 < Y2 < Y3 < b,
= (31)!(Y1)!(Y2)!(Ys), a < Y1 < Y2 < Ys < b,
= 0 elsewhere.
a < Y1 < Y2 < ... < Yn < b,
Pr (Y2 s m) = 6 r{F(Y2)j(Y2) - [F(Y2)]2j(Y2}} dY2
= 6{[F(Y2)]2 _ [F(Y2)]a}m = ~.
2 3 a 2
1 - F(x) = F(b) - F(x)
= I:!(w) dw - I: !(w) dw
= I:f(w) dw.
F(x) = 0, x :::; a,
= I: f(w) dw, a < x < b,
= 1, b :::; x.
The procedure used in Example 1 can be used to obtain general
formulas for the marginal probability density functions of the order
statistics. We shall do this now. Let X denote a random variable of the
cont~nuous type having a p.d.f. j(x) that is positive and continuous,
provided that a < x < b, and is zero elsewhere. Then the distribution
function F(x) may be written
Accordingly, F'(x) = !(x), a < x < b. Moreover, if a < x < b,
Let X1> X 2 , ••• , X; denote a random sample of size n from this
distribution, and let Yl' Y2' ... , Yn denote the order statistics of this
random sample. Then the joint p.d.f. of Y1> Y2, .. " Yn is
g(Y1> Y2, .. " Yn) = nl!(Y1)!(Y2) .. -f(Yn),
= 0 elsewhere.
a < x < b.
F(x) = f:j(w) dw,
This is Equation (1) with n = 3.
In accordance with the natural extension of Theorem 1, Section 2.4,
to distributions of more than two random variables, it is seen that
the order statistics, unlike the items of the random sample, are sto-
chastically dependent.
Example 1. Let X denote a random variable of the continuous type with
a p.d.f. j(x) that is positive and continuous provided that a < x < b, and is
zero elsewhere. The distribution function F(x) of X may be written
h(Y2) = 6j(Y2) J:2 I:2j(Yl)j(Ya) dYl dYa,
= 6j(Y2)F(Y2)[1 - F(Y2)], a < Y2 < b,
= 0 elsewhere.
g(Y1' Y2' Ya) = 6j(Yl)j(Y2)j(Ya),
= 0 elsewhere.
The p.d.f. of Y2 is then
a < Yl < Y2 < Ya < b,
It will first be shown how the marginal p.d.f. of Yn may be expressed in
terms of the distribution function F(x) and the p.d.f. !(x) of the random
variable X. If a < Yn < b, the marginal p.d.f. of Yn is given by
158 Distributions of Functions of Random Variables [Ch.4 Sec. 4.6] Distributions of Order Statistics 159
Upon completing the integrations, it is found that
since F(x) = S:f(w) dw. Now
J
Y 3 [F(Y2)]2IY 3
a F(Y2)f(Y2) dY2 = 2 a
[F(YS)]2
=
2
But
so that
gl(Yl) = n[1 - F(Yl)]n-1f(Yl)'
= 0 elsewhere.
Once it is observed that
Ix [F(w)]a-lf(w) dw = [F(x)]a,
a a
a < Yl < b,
a> 0
so
and that
fb [1 _ F(W)]8-1f(w) dw = [1 - F(y)]8,
Y ~
~ > 0,
If the successive integrations on Y4' ... , Yn-l are carried out, it is seen
that
_ 1 [F(YnW- 1
gn(Yn) - n. (n _ 1)1 j(Yn)
= n[F(Yn)]n-lj(Yn)' a < Yn < b,
= 0 elsewhere.
It will next be shown how to express the marginal p.d.f. of Y 1 in
terms of F(x) andj(x). We have, for a < Yl < b,
gl(Yl) = 5:1' ..J:n-3 J:n-2 J:n-1 n! f(Yl)f(Y2)' . -J(Yn) dYn dYn-l' . ·dY2
=fb···fb fb nlj(Yl)f(Y2)'"
Yl Yn-3 Yn-2
j(Yn-l)[l - F(Yn-l)] dYn-l .. , dY2'
But
f
b [1 - F(Yn_l)]2 b
Yn-2 [1 - F(Yn-l)]f(Yn-l) dYn-l = 2 Yn-2
[1 - F(Yn_2)]2
2 '
it is easy to express the marginal p.d.f. of any order statistic, say Y k'
in terms of F(x) andj(x). This is done by evaluating the integral
The result is
a < Yk < b,
= 0 elsewhere.
Example 2. Let Y1 < Y2 < Ys < Y4 denote the order statistics of a
random sample of size 4 from a distribution having p.d.f.
j(x) = 2x, 0 < x < 1,
= 0 elsewhere.
We shall express the p.d.f. of Y3 in terms ofj(x) and F(x) and then compute
Pr (t < Y 3 ) . Here F(x) = x2
, provided that 0 < x < 1, so that
o< Y3 < 1,
= 0 elsewhere.
160 Distributions of Functions of Random Variables [Ch.4 Sec. 4.6] Distributions of Order Statistics 161
Thus
Pr (-!- < Y3) = fal g3(Y3) dY3
1/2
= fl 24(yg - y~) dY3 = ~t~·
1/2
Finally, the joint p.d.f. of any two order statistics, say Y, < Y;, is
as easily expressed in terms of F(x) andf(x). We have
g (y Y) = JYI. .. JY2fYi . . . SYi fb ... (b n!f(Yl)' ..
ij i'; a a Yl Yj-2 u, JYn-l
Certain functions of the order statistics Y l' Y 2, ••• , Y n are important
statistics themselves. A few of these are: (a) Yn - Yl' which is called
the range of the random sample; (b) (Y1 + Yn)/2, which is called the
midrange of the random sample; and (c) if n is odd, Y(n+l)/2' which is
called the median of the random sample.
Example 3. Let Yl> Y 2 , Y 3 be the order statistics of a random sample of
size 3 from a distribution having p.d.f.
f(x) = 1, 0 < X < 1,
= 0 elsewhere.
Since, for y > 0,
ry
[F(y) _ F(W)y-lf(w) dw = _[F(y) - F(w)]Y IY
Jx y x
[F(y) - F(x)]Y,
y
We seek the p.d.f. of the sample range ZI = Y3 - Y1. Since F(x) = X,
o< X < 1, the joint p.d.f. of Y1 and Y3 is
gI3(Yl> Y3) = 6(Y3 - Yl), 0 < Yl < Y3 < 1,
= 0 elsewhere.
In addition to ZI = Y3 - Y 1, let Z2 = Y 3. Consider the functions ZI =
Y3 - Yl> Z2 = Y3' and their inverses Yl = Z2 - Zl> Y3 = Z2' so that the
corresponding Jacobian of the one-to-one-transformation is
EXERCISES
= 0 elsewhere.
o< ZI < 1,
1
-111
o 1 =-1.
°Yl °Yl
]=
OZI OZ2
°Y3 °Y3
OZI OZ2
Accordingly, the p.d.f. of the range ZI = Y3 - Y1 of the random sample of
size 3 is
Thus the joint p.d.f. of ZI and Z2 is
h(zl> Z2) = 1-116z1 = 6z1 ,
= 0 elsewhere.
4.50. Let Y 1 < Y 2 < Y 3 < Y 4 be the order statistics of a random sample
of size 4 from the distribution having p.d.f. f(x) = e- x , 0 < x < 00, zero
elsewhere. Find Pr (3 s Y4)'
4.51. Let Xl> X 2 , X; be a random sample from a distribution of the
continuous type having p.d.f. f(x) = 2x,O < x < 1, zero elsewhere. Compute
n!
n! pi-lpj-i-lpn-jp P
(i _ 1)1 (j _ i-I)! (n _ j)! 11 I! 1 2 3 4 5'
which is g(Yj, YJ}L1jL1j •
for a < Yi < Y; < b, and zero elsewhere.
Remark. There is an easy method of remembering a p.d.f. like that given
in Formula (3). The probability Pr (Yi < Y, < Yi + L1f , Yj < Y, < Yj + L1j),
where L1t
and L1j
are small, can be approximated by the following multinomial
probability. In n independent trials, i-I outcomes must be less than Yt (an
event that has probability PI = F(Yi) on each trial); j - i-I outcomes
must be between Yi + L1i
and Yj [an event with approximate probability
P2 = F(YJ) - F(Yf) on each trial]; n - j outcomes must be greater than
y. + L1. (an event with approximate probability P3 = 1 - F(Yj) on each
J J •
trial); one outcome must be between Yi and Yf + L1i (an event with approxi-
mate probability P4 = f(Yi)L1i on each trial); and finally one outcome must be
between Yj and Yj + L1j
[an event with approximate probability P5 = f(Yj)L1j
on each trial]. This multinomial probability is
(3) gij(Yi' Y;) = (i _ I)! (j - i-I)! (n - j)!
x [F(Yi)]i-l[F(y;) - F(Yi)r- i- 1
[1 - F(y;)]n-1(Yt)f(y;)
it is found that
162 Distributions of Functions of Random Variables [eh.4 Sec. 4.6] Distributions of Order Statistics 163
the probability that the smallest of these X, exceeds the median of the
distribution.
4.52. Let f(x) = i, x = 1, 2, 3, 4, 5, 6, zero elsewhere, be the p.d.f. of a
distribution of the discrete type. Show that the p.d.f. of the smallest item of
a random sample of size 5 from this distribution is
zero elsewhere. Note that in this exercise the random sample is from a
distribution of the discrete type. All formulas in the text were derived under
the assumption that the random sample is from a distribution of the
continuous type and are not applicable. Why?
4.53. Let Y1 < Y2 < Ya < Y4 < Y5 denote the order statistics of a
random sample of size 5 from a distribution having p.d.f. f(x) = e- x
,
o < x < 00, zero elsewhere. Show that Zl = Y2 and Z2 = Y4 - Y2 are
stochastically independent. Hint. First find the joint p.d.f. of Y2 and Y 4 •
4.54. Let Y1 < Y2 < ... < Yn be the order statistics of a random
sample of size n from a distribution with p.d.f. f(x) = 1, 0 < x < 1, zero
elsewhere. Show that the kth order statistic Y" has a beta p.d.f. with param-
eters a = k and f3 = n - k + 1.
4.55. Let Y1 < Y2 < ... < Yn be the order statistics from a Weibull
distribution, Exercise 3.38, Section 3.3. Find the distribution function and
p.d.f. of Y 1 •
4.56. Find the probability that the range of a random sample of size 4
from the uniform distribution having the p.d.f. f(x) = 1, 0 < x < 1, zero
elsewhere, is less than -to
4.57. Let Y1 < Y2 < Ya be the order statistics of a random sample of
size 3 from a distribution having the p.d.f. f(x) = 2x, 0 < x < 1, zero
elsewhere. Show that Zl = Y 1/Y2, Z2 = Y2/YS, and Zs = Ys are mutually
stochastically independent.
4.58. If a random sample of size 2 is taken from a distribution having
p.d.f. f(x) = 2(1 - x), 0 < x < 1, zero elsewhere, compute the probability
that one sample item is at least twice as large as the other.
4.59. Let Y1 < Y2 < Ya denote the order statistics of a random sample
of size 3 from a distribution with p.d.f. f(x) = 1, 0 < x < 1, zero elsewhere.
Let Z = (Y1
+ Ys)/2 be the midrange of the sample. Find the p.d.f. of Z.
4.60. Let Y1 < Y2 denote the order statistics of a random sample of size 2
from n(O, a2 ) . Show that E(Y1 ) = -a/V;. Hint. Evaluate E(Y1 ) by using
the joint p.d.f. of Y1 and Y 2 , and first integrating on Yl'
4.61. Let Y1 < Y2 be the order statistics of a random sample of size 2
(7- Yl) 5 (6 - Yl) 5
gl(Yl) = -6- - -6- , Yl = 1, 2, ... , 6,
from a distribution of the continuous type which has p.d.f, f(x) such that
f(x) > 0, provided x ~ 0, and f(x) = 0 elsewhere. Show that the stochastic
independence of Zl = Y1 and Z2 = Y2 - Y1 characterizes the gamma p.d.f.
f(x), which has parameters a = 1 and f3 > O. Hint. Use the change-of-
variable technique to find the joint p.d.f. of Zl and Z2 from that of Y1 and
Y2 • Accept the fact that the functional equation h(O)h(x + y) == h(x)h(y)
has the solution h(x) = c1e
c2
x , where C1 and C2 are constants.
4.62. Let Y denote the median of a random sample of size n = 2k + 1,
k a positive integer, from a distribution which is n(p., a2
) . Prove that the
graph of the p.d.f. of Y is symmetric with respect to the vertical axis through
Y = p. and deduce that E(Y) = p..
4.63. Let X and Y denote stochastically independent random variables
with respective probability density functions f(x) = 2x, 0 < x < 1, zero
elsewhere, and g(y) = 3y2
, 0 < Y < 1, zero elsewhere. Let U = min (X, Y)
and V = max (X, Y). Find the joint p.d.f. of U and V. Hint. Here the two
inverse transformations are given by x = u, Y = v and x = v, Y = U.
4.64. Let the joint p.d.f. of X and Y bef(x, y) = l./'-x(x + y), 0 < x < 1,
o < Y < 1, zero elsewhere. Let U = min (X, Y) and V = max (X, Y).
Find the joint p.d.f. of U and V.
4.65. Let Xv X 2 , ••. , X; be a random sample from a distribution of
either type. A measure of spread is Gini's mean difference
10
(a) If n = 10, find av a2 , ••. , a10 so that G = 2: a,Yi , where Y v Y 2 , ••• ,
'=1
Y1 0 are the order statistics of the sample.
(b) Show that E(G) = 2a/V; if the sample arises from the normal
distribution n(p., a2
) .
4.66. Let Y1 < Y2 < ... < Yn be the order statistics of a random
sample of size n from the exponential distribution with p.d.f. f(x) = e- x
,
o< x < 00, zero elsewhere.
(a) Show that Zl = nYv Z2 = (n - 1)(Y2 - Y1 ), Zs = (n - 2)
(Ys - Y 2 ) , •.. , Z; = Yn - Yn - 1 are stochastically independent and that
each Z, has the exponential distribution.
(b) Demonstrate that all linear functions of Y v Y 2 , ••• , Y n , such as
n
2: a,Y" can be expressed as linear functions of stochastically independent
1
random variables.
4.67. In the Program Evaluation and Review Technique (PERT), we are
interested in the total time to complete a project that is comprised of a large
4.7 The Moment-Generating-Function Technique
M(t) = E(etYl) = J~oo etYlg(Y1) dY1
in the continuous case. It would seem that we need to know g(Y1)
before we can compute M(t). That this is not the case is a fundamental
fact. To see this consider
165
j = 1, 2, ... , n, ~ = 1, 2, .. " k,
k
L IIilep[W1j(Y1"'" Yn), ... , wni(Yv .. " Yn)]
i=l
Sec. 4.7] The Moment-Generating-Function Technique
In accordance with Section 4.5,
IIlep[WI(Yv Y2' ... , Yn), ... , Wn(Yv Y2' .. " Yn)]
is the joint p.d.f. of Y v Y2" •• , Yn' The marginal p.d.f. g(Y1) of Y1
is obtained by integrating this joint p.d.f. on Y2" .. , Yn' Since the factor
etYl does not involve the variables Y2"'" Yn' display (2) may be
written as
(3) J~<x> etYlg(Y1) dY1'
But this is by definition the moment-generating function M(t) of the
distribution of Yl' That is, we can compute E[exp (tu1(Xv . . ., X n)]
and have the value of E(etY1
) , where Y1 = u1(X V ••• , X n) . This fact
provides another technique to help us find the p.d.f. of a function
of several random variables. For if the moment-generating function of
Y1 is seen to be that of a certain kind of distribution, the uniqueness
property makes it certain that Y1 has that kind of distribution. When
the p.d.f. of Y1 is obtained in this manner, we say that we use the
moment-generating-function technique.
The reader will observe that we have assumed the transformation
to be one-to-one. We did this for simplicity of presentation. If the
transformation is not one-to-one, let
E[w(Y1)] = J~00 W(Y1)g(Y1) dY1
= J~00 ••• J:00 w[u1(xv . • ., xn)]ep(Xl' ... , xn) dXl' .. dxn·
denote the k groups of n inverse functions each. Let Ii' i = 1, 2, ... , k,
denote the k J acobians. Then
(4)
is the joint p.d.f. of Yv ... , Yn' Then display (1) becomes display
(2) with IJlep(w1,... , wn) replaced by display (4). Hence our result is
valid if the transformation is not one-to-one. It seems evident that
we can treat the discrete case in an analogous manner with the same
result.
It should be noted that the expectation, subject to its existence, of
any function of Y1 can be computed in like manner. That is, if W(Y1) is
a function of Yv then
Distributions oj Functions oj Random Variables [eb.4o
1M
The change-of-variable procedure has been seen, in certain cases, to
be an effective method of finding the distribution of a function of several
random variables. An alternative procedure, built around the concept
of the moment-generating function of a distribution, will be presented
in this section. This procedure is particularly effective in certain
instances. We should recall that a moment-generating function, when
it exists, is unique and that it uniquely determines the distribution of
a probability.
Let ep(xv x2,. . ., xn) denote the joint p.d.f. of the n random vari-
ables Xv X 2 , · •• , X n . These random variables mayor may not be the
items of a random sample from some distribution that has a given p.d.f.
f(x). Let Y1 = U1(X1, X 2,... , X n). We seek g(Y1)' the p.d.f. of the
random variable Y 1 . Consider the moment-generating function of Y 1 •
If it exists, it is given by
(1) J
~00 ••• J:00 exp [tU1(xv· .. , xn)]ep(xv ... , xn) dX1... dxn,
which we assume to exist for - h < t < h. We shall introduce n new
variables of integration. They are Y1 = u1(XV X2' .. " xn),... , Yn =
un(xv X2,. . ., xn). Momentarily, we assume that these functions define
a one-to-one transformation. Let Xi = Wt(Y1' Y2' .. " Yn), i = 1,2, ... , n,
denote the inverse functions and let I denote the Jacobian. Under this
transformation, display (1) becomes
number of subprojects. For illustration, let Xl> X 2 , X a be three stochastically
independent random times for three subprojects. If these subprojects are in
series (the first one must be completed before the second starts, etc.), then
we are interested in the sum Y = Xl + X 2 + X a. If these are in parallel
(can be worked on simultaneously), then we are interested in Z =
max (Xl> X 2 , X a). In the case each of these random variables has the uniform
distribution with p.d.f. f(x) = 1, 0 < x < 1, zero elsewhere, find (a) the
p.d.f. of Yand (b) the p.d.f. of Z.
166 Distributions of Functions of Random Variables [eh.4 Sec. 4.7] The Moment-Generating-Function Technique 167
= 0 elsewhere;
that is, the p.d.f. of Xl isf(xl) and that of X 2 isf(x2) ; and so the joint p.d.f.
of Xl and X 2 is
We shall now give some examples and prove some theorems where
we use the moment-generating-function technique. In the first example,
to emphasize the nature of the problem, we find the distribution of a
rather simple statistic both by a direct probabilistic argument and by
the moment-generating-function technique.
Example 1. Let the stochastically independent random variables Xl
and X 2 have the same p.d.f.
Thus
This form of M(t) tells us immediately that the p.d.f. g(y) of Y is zero except
at y = 2, 3, 4, 5, 6, and that g(y) assumes the values :1-6, 3
46'
H, ~~, l6'
respectively, at these points where g(y) > O. This is, of course, the same
result that was obtained in the first solution. There appears here to be little,
if any, preference for one solution over the other. But in more complicated
situations, and particularly with random variables of the continuous type,
the moment-generating-function technique can prove very powerful.
Example 2. Let Xl and X 2 be stochastically independent with normal
distributions n(flol> an and n(flo2' a~), respectively. Define the random variable
Y by Y = Xl - X 2 • The problem is to find g(y), the p.d.f. of Y. This will
be done by first finding the moment-generating function of Y. It is
M(t) = E(et(X1 -X2»
)
= E(etXle-tx2)
= E(etXl)E(e-tx2),
since Xl and X 2 are stochastically independent. It is known that
since Xl and X 2 are stochastically independent. In this example Xl and X 2
have the same distribution, so they have the same moment-generating
function; that is,
x = 1,2,3,
Xl = 1, 2, 3, X2 = 1,2,3,
x
f(x) = (;'
= 0 elsewhere.
A probability, such as Pr (Xl = 2, X 2 = 3), can be seen immediately to be
(2)(3)/36 = l However, consider a probability such as Pr (Xl + X 2 = 3).
The computation can be made by first observing that the event X I + X 2 = 3
is the union, exclusive of the events with probability zero, of the two
mutually exclusive events (Xl = 1, X 2 = 2) and (Xl = 2, X 2 = 1). Thus
Pr (Xl + X 2 = 3) = Pr (Xl = 1, X 2 = 2) + Pr (Xl = 2, X 2 = 1)
(1)(2) (2)(1) 4
=~+~=36'
More generally, let y represent any of the numbers 2,3,4,5,6. The probability
of each of the events Xl + X 2 = y, Y = 2,3,4,5,6, can be computed as in
the case y = 3. Let g(y) = Pr (Xl + X 2 = y). Then the table and that
g(y) -16 3
46
~~ ~~ l6
gives the values of g(y) for y = 2,3,4,5,6. For all other values of y, g(y) = O.
What we have actually done is to define a new random variable Y by Y =
Xl + X 2 , and we have found the p.d.f. g(y) of this random variable Y. We
shall now solve the same problem, and by the moment-generating-function
technique.
Now the moment-generating function of Y is
M(t) = E(et(x1 +X2»
)
= E(etXletX2)
= E(etxl)E(etX2),
Finally, then,
for all real t. Then E(e-tX2) can be obtained from E(etX2) by replacing t by
-ct. That is,
(
(a~ + a~jt2)
= exp (flol - flo2)t + 2 .
2 3 456
y
Now
169
t < t.
t < t, i = 1,2, ... , n,
Sec. 4.7] The Moment-Generating-Function Technique
has a chi-square distribution with n degrees of freedom.
Not always do we sample from a distribution of one random vari-
able. Let the random variables X and Y have the joint p.d.f. f(x, y)
and let the 2n random variables (Xl' Y 1) , (X 2 , Y 2), ••• , (Xn, Y n) have
the joint p.d.f.
But this is the moment-generating function of a distribution that is
x2(rl + r2 + ... + rn). Accordingly, Y has this chi-square distribution.
Next, let Xl> X 2 , • • • , X; be a random sample of size n from a
distribution that is n(fL, a2). In accordance with Theorem 2 of Section
3.4, each of the random variables (X, - fL)2/a2, i = 1, 2, ... , n, is X2(1).
Moreover, these n random variables are mutually stochastically inde-
pendent. Accordingly, by Theorem 2, the random variable Y =
n
L [(X, - fL)/aJ2 is X2(n). This proves the following theorem.
1
M(t) = E{exp [t(XI + X 2 + ... + X n)]}
= E(etXl)E(etx2) ... E(etxn)
because Xl> X 2, ••• , X n are mutually stochastically independent. Since
If, in Theorem 1, we set each kt = 1, we see that the sum of n
mutually stochastically independent normally distributed variables has
a normal distribution. The next theorem proves a similar result for
chi-square variables.
Theorem 2. Let Xl> X 2' .•• , X n bemutually stochasticallyindepen-
dent variables that have, respectively, the chi-square distributions X2(r1) ,
X2(r2), .. " and X2(rn)' Then the random variable Y = XI + X2 + ... +
X n has a chi-square distribution with r1 + ... + rn degrees of freedom;
that is, Y is X2(r1 + ... + rn).
Proof. The moment-generating function of Y is
we have
Theorem 3. Let Xl> X 2,... , X; denote a random sample of size n
from a distribution that is n(fL, a2). The random variable
Distributions oj Functions oj Random Variables [eb. 4
That is, the moment-generating function of Y is
n [ (k
2a2)t2]
M(t) = [lexp (k,fLI)t + 1
2
1
[
n (~k~a~)t2]
= exp (fk1fL1)t + 2 .
But this is the moment-generating function of a distribution that is
n(~ k1fLp ~ k~a~). This is the desired result.
The following theorem, which is a generalization of Example 2, is
very important in distribution theory.
Theorem 1. Let Xl> X 2,. . ., K; be mutually stochastically indepen-
dent random variables having, respectively, the normal distributions
nI H a2) n(ll. a2) and n(ll. a2). The random variable Y = klXl +
JA'1' l' r2' 2' ••• , rn' n .
k
2X2
+ ... + knXn, where kl
, k2,. . ., kn are real constants, ss normally
distributed with mean klfLl + ... + knfLn and variance krar + ... + k~a~.
That is, Y is n(~ ktfL" ~ k~a~).
Proof. Because x; X 2,. . ., x, an: mutual~y ~tochastically inde-
pendent, the moment-generating function of Y IS glVen by
M(t) = E{exp [t(k1X1 + k2X2 + ... + knXn)]}
= E(etklXl)E(etk2X2) ... E(etknXn).
E(etXt) = exp (fL,t + a~2),
for all real t, i = 1,2, ... , n. Hence we have
168
The distribution of Y is completely determined by its moment-gene:ati~g
function M(t), and it is seen that Y has the p.d.f. g(y), whl~h IS
(
2 + 2) That is the difference between two stochastIcally
n fLl - fL2' Ul U2 . , . ' .
independent, normally distributed, random vanables IS Itself. a random
variable which is normally distributed with mean equal to the difference of
the means (in the order indicated) and the variance equal to the sum of the
variances.
Distributions of Functions of Random Variables [eh.4
n
real constants, has the moment-generating function M(t) = n M,(k,t).
1
(b) If each k; = 1 and if Xi is Poisson with mean /Li' i = 1, 2, ... , n,
prove that Y is Poisson with mean /Ll + ... + /Ln'
4.71. Let the stochastically independent random variables Xl and X2
have binomial distributions with parameters nl , PI = t and n2' P2 = t,
respectively. Show that Y = Xl - X 2 + n2has a binomial distribution with
parameters n = nl + n2, P = 1--
171
EXERCISES
Sec. 4.7] The Moment-Generating-Function Technique
4.70. Let Xl and X 2 be stochastically independent random variables.
Let Xl and Y = Xl + X 2have chi-square distributions with rl and r degrees
of freedom, respectively. Here rl < r. Show that X2 has a chi-square distribu-
tion with r - rl degrees of freedom. Hint. Write M(t) = E(e!(X1 +X2»
) and
make use of the stochastic independence of Xl and X 2.
4.68. Let the stochastically independent random variables Xl and X 2
have the same p.d.f. f(x) = 1" x = 1, 2, 3, 4, 5, 6, zero elsewhere. Find the
p.d.I. of Y = Xl + X 2 • Note, under appropriate assumptions, that Y may
be interpreted as the sum of the spots that appear when two dice are cast.
4.69. Let Xl and X 2be stochastically independent with normal distribu-
tions n(6, 1) and n(7, 1), respectively. Find Pr (Xl> X 2). Hint. Write
Pr (Xl> X 2) = Pr (Xl - X 2 > 0) and determine the distribution of
Xl - X 2·
4.72. Let X be n(O, 1). Use the moment-generating-function technique to
show that Y = X2 is x2
(l ). Hint. Evaluate the integral that represents
E(e!X2) by writing w = xVI - 2t, t < t.
4.73. Let Xl> X 2 , ••. , X; denote n mutually stochastically independent:
random variables with the moment-generating functions MI(t), M 2(t), ... ,
M n (t), respectively.
(a) Show that Y = klXI + k2X2 + ... + knXn, where kl> k2 , ••• , kn are
(
tl i Xi t2 i Yi)
= f:oo" -f~oo exp -;-- + -;-- cp dx l· .. dYn
= 0If_0000 f_0000 exp r~i + t~i)f(Xi' Yi) dx, dYi]·
The justification of the form of the right-hand member of the second
equality is that each pair (Xi' Y i ) has the same p.d.f., and th~t these
n pairs are mutually stochastically independent. The two~old integral
in the brackets in the last equality is the moment-generatmg function
of Xi and Y, (see Section 3.5) with tl replaced by tl/n and t2 replaced by
t2 /n. Accordingly,
cp = f(x v Yl)f(x2, Y2) ... f(xn, Yn),
the moment-generating function of the two means X and Y is given by
170
The n random pairs (Xv Y l ) , (X2, Y 2), · · · , (Xn, Y n) are then mutually
stochastically independent and are said to constitute a random sample
of size n from the distribution of X and Y. In the next paragraph we
shall take f(x, y) to be the normal bivariate p.d.f., .and we sha~l solve
a problem in sampling theory when we are samplmg from this two-
variable distribution.
Let (Xl' Yl), (X 2 , Y 2) , •• ·, (Xn,. Yn! d~note .a random sample of
size n from a bivariate normal distribution with p.d.f. f(x, y) and
t rs I/. u2 u2 and p We wish to find the joint p.d.f. of the
parame e ttl' ,...2' l' 2' .
two statistics X = i Xt!n and Y = i Ydn . We call X the mean of
1 1
Xl' ... , x, an~ Y the mean of. Yv ... , Yn' Sin~e t.he jo:t p.d.f. of the
2n random vanables (Xi' YJ, ~ = 1,2, ... , n, IS glVen y
TI
n ltlttl tzttz
M(tl , t2) = i=l exp -;; + -;;
ui(tl/n)Z + 2PUlu2(tl/n){tz/n) + u~(tz/n)Z]
+ 2
l
(ui/n)ti + 2p(uluZ/n)tltz + (u~/n)t~].
= exp tlttl + tzttz + 2
But this is the moment-generating function of a bivariate normal
distribution with means ttl and tt2' variances ui/n and u~/n, and corre-
lation coefficient p; therefore, X and Y have this joint distribution.
4.74. If Xl' X 2 , ••• , X; is a random sample from a distribution with
moment-generating function M(t), show that the moment-generating func-
n n
tions of L: X, and L: X;/n are, respectively, [M(t)Jn and [M(tjnW.
I 1
4.75. In Exercise 4.67 concerning PERT, find: (a) the p.d.f. of Y;
(b) the p.d.f. of Z in case each of the three stochastically independent
variables has the p.d.f. f(x) = e- x , 0 < x < 00, zero elsewhere.
4.76. If X and Y have a bivariate normal distribution with parameters
t-«. /Lz, ar, a~, and p, show that Z aX + bY + c is n(a/LI + b/L2 + c,
a2ar + 2abpalaZ + b2a~), where a, b, and c are constants. Hint. Use the
172 Distributions of Functions of Random Variables [Ch.4 Sec. 4.8] The Distributions of X and nS2Ja2
173
can be written
has Jacobian n. Since
distributions of the mean and the variance of this random sample,
that is, the distributions of the two statistics X = ±X,jn and 52 =
n 1
L (X, - X)2jn.
1
The problem of the distribution of X, the mean of the sample, is
solved by the use of Theorem 1 of Section 4.7. We have here, in the
notation of the statement of that theorem, 1-'1 = 1-'2 = ... = I-'n = 1-',
a~ = ~~ = ... = a; = a2
, and k1 = k2 = ... = kn = ljn. Accordingly,
Y = X has a normal distribution with mean and variance given by
n
= L(XI - X)2 + n(x - 1-')2
1
Xl = nYl - Y2 - .•. - Yn
X 2 = Y2
( l)n [L (x, - X)2 n(x - 1-')2]
• /2rra exp - ,
·v 2a2
2a2
respectively. That is, X is n(I-" a2jn).
Example 1. Let X be the mean of a random sample o~ size 25 from a
distribution that is n(75, 100). Thus X IS n(75, 4). Then, for instance,
Pr (71 < X < 79) = NC9 ; 75) _ NC1 ; 75)
= N(2) - N(-2) = 0.954.
We now take up the problem of the distribution of 52, the variance
of a random sample X1> ... , Xn from a distribution that is n(l-', a2).
To do this, let us first consider the joint distribution of Y1 = X,
Y2 = X 2 , · · . , Y n = Xn- The corresponding transformation
n
because 2(x - II.) "'" (Xi - x) = 0, th " t d f f X X X
r: L.., e JOIn p... 0 1> 2,"" n
1
moment-generating function M(tl> t2) of X and Y to find the moment-
generating function of Z.
4.77. Let X and Y have a bivariate normal distribution with parameters
1-'1 = 25, 1-'2 = 35, a~ = 4, a~ = 16, and p = H. 1£ Z = 3X - 2Y, find
Pr (- 2 < Z < 19).
4.78. Let U and V be stochastically independent random variables, each
having a normal distribution with mean zero and variance 1. Show that
the moment-generating function E(et(uV») of the product UV is (1 - t2 )- l/2,
-1 < t < 1. Hint. Compare E(etuV
) with the integral of a bivariate normal
p.d.f. that has means equal to zero.
4.79. Let X and Y have a bivariate normal distribution with the param-
eters I-'l> 1-'2' at a~, and p. Show that W = X - 1-'1 and Z = (Y - 1-'2) -
p(a2/a1)(X - 1-'1) are stochastically independent normal variables.
4.80. Let Xl> X 2 , X3 be a random sample of size n = 3 from the normal
distribution n(O, 1).
(a) Show that Y1 = Xl + SX3 , Y2 = X2 + SX3 has a bivariate normal
distribution.
(b) Find the value of S so that the correlation coefficient p = 1-.
(c) What additional transformation involving Y1 and Y 2 would produce
a bivariate normal distribution with means 1-'1 and 1-'2' variances a~ and a~,
and the same correlation coefficient p?
4.81. Let Xl> X 2 , ••• , X n be a random sample of size n from the normal
n
distribution n(I-" a2
) . Find the joint distribution of Y = L ajXj and
n 1
Z = L bjXj, where the a, and b, are real constants. When, and only when,
1
are Y and Z stochastically independent? Hint. Note that the joint moment-
generating function E [exp (t1 ~ «x, + t2 ~ b,Xj ) ] is that of a bivariate
normal distribution.
4.82. Let Xl> X 2 be a random sample of size 2 from a distribution with
positive variance and moment-generating function M(t). 1£ Y = Xl + X2
and Z = Xl - X 2 are stochastically independent, prove that the distribution
from which the sample is taken is a normal distribution. Hint. Show that
m(t1, t2) = E{exp [t1(X 1 + X 2) + t2(X 1 - X 2)]} = M(t1 + t2)M(t1 - t2).
Express each member of m(tl>t2) = m(tl>O)m(O, t2 ) in terms of M; differentiate
twice with respect to t2 ; set t2 = 0; and solve the resulting differential
equation in M.
4.8 The Distributions of Xand nS2 jo2
Let X1> X 2 , • • • , X; denote a random sample of size n ~ 2 from a
distribution that is n(p., a2
) . In this section we shall investigate the
Distributions of Functions of Random Variables [eh.4
EXERCISES
4.83. Let X be the mean of a random sample of size 5 from a normal
distribution with fL = 0 and a 2
= 125. Determme c so that Pr (X < c) =
0.90.
175
t < t.
That is, the conditional distribution of n52ja2, given X = x, is X2(n - 1).
Moreover, since it is clear that this conditional distribution does not
depend upon x, X and n52ja2
must be stochastically independent or,
equivalently, X and 52 are stochastically independent.
To summarize, we have established, in this section, three important
properties of X and 52 when the sample arises from a distribution
which is n(fJ-, a2):
(a) X is n(fJ-, a2jn).
(b) n52ja2 is X2(n - 1).
(c) X and 52 are stochastically independent.
Determination of the p.d.f. of 52 is left as an exercise.
where °< 1 - 2t, or t < l However, this latter integral is exactly the
same as that of the conditional p.d.f. of Y2, Y3, ••. , Yn' given Y1 = Y1'
with a 2
replaced by a 2
j(1 - 2t) > 0, and thus must equal 1. Hence the
conditional moment-generating function of n52ja2
, given Y1 = Y1 or
equivalently X = x, is
Sec. 4.8] The Distributions of X and nS2ja2
(
l)n  (nYl - Y2 - ... - Yn - Y1)2
n --=- exp - 2a2
v27Ta
n )2
~ (y, - Y1 _ n(Yl - fJ-)2),
2a2 2a2
. 1,2, ... , n. The quotient of this joint p.d.f. and the
-00 < Yl < 00,1, =
p.d.f.
vn [n(Yl - fJ-)2}
-==- exp - 2 2
V27Ta a
of Y1 = X is the conditional p.d.f. of Y2' Y3' ... , Yn' given Y1 = Yl>
vn(V~7Tar-1 exp ( - 2~2)'
)
2 + ~ (y _ y)2 Since this is a
where q = (nYl - Y2 - ... - Yn - Yl '§" l '
joint conditional p.d.f., it must be, for all a > 0, that
foo .. .foo vn(~ r:exp ( - 2~2) dY2' .. dYn = 1.
_ 00 - 00 V27Ta
174
( + + x )jn and -00 < x, < 00, 1, =
where x represents Xl + X2 . . . n ..
1,2, ... , n. Accordingly, with Y1 = x, we find that the [oint p.d.f. of
Yl' Y2' ..• , Yn is
Now consider
n52 = ~ (X, - X)2
1
n )2 Q
=(nYI-Y2-"'-Yn-Y1)2+~(Y'-Yl =.
. f ti f 52j 2 - Qja2 given
The conditional moment-generatmg unc IOn 0 n a - ,
Y1=Yl>is
E(etQ /<r2 IY1) = Joo .. .foo vn(_1-r-1
exp [- (1 ~a;t)q) dY2" .dYn
_ 00 - 00 VZ:;a
4.84. If X is the mean of a random sample of size n from a normal
distribution with mean fL and variance 100, find n so that Pr (fL - 5 <
X < fL + 5) = 0.954.
4.85. Let Xl' X 2 , .•• , X 2 5 and Yl> Y 2 , ... , Y 2 5 be two random samples
from two independent normal distributions n(O, 16) and n(1, 9), respectively.
Let X and Y denote the corresponding sample means. Compute Pr (X > Y).
4.86. Find the mean and variance of 52 = i (Xl - X)2/n, where
1
Xl' X 2 , ••• , X n is a random sample from n(fL, a2
) . Hint. Find the mean and
variance of n5 2/a2
•
(
_ 1_ ) <
n- 1)/2f
OO
•• • foo vn[1
2:a;tX
n- 1)/2
1-2t - 0 0 - 0 0
[
(1 - 2t)q )
x exp - 2a2 dY2' .. dYn'
4.87. Let 52 be the variance of a random sample of size 6 from the normal
distribution n(fL, 12). Find Pr (2.30 < 52 < 22.2).
4.88. Find the p.d.f. of the sample variance 52, provided that the distri-
bution from which the sample arises is n(fL, a2
) .
4.89. Let X and S2 be the mean and the variance of a random sample
of size 25 from a distribution which is n(3, 100). Evaluate Pr (0 < X < 6,
55.2 < S2 < 145.6).
177
i < j.
Sec. 4.9] Expectations of Functions of Random Variables
The variance of Y is given by
a~ = E{[(k1X1 + ... + knXn) - (k11-'1 + ... + knl-'n)J2}
= E{[k1(X1 - 1-'1) + ... + kn(Xn - I-'n)J2}
= E{t1k~(Xi - l-'i)2 + 2 ..
t t kjkj(Xi - l-'i)(Xj - I-'j)}
n
= i~1 ktE[(Xi - l-'i)2J + 2 f<t kikjE[(Xi - l-'i)(Xj - I-'j)J.
~onsider E[(Xi - l-'i)(Xj - I-'j)J, i < j. Because Xi and X, are stochastically
independent, we have
E[(Xi - l-'i)(Xj - I-'j)J = E(Xi - l-'i)E(Xj - I-'j) = O.
Finally, then,
We can o~tain a more general result if, in Example 2, we remove
the hypothesIs. of mutual stochastic independence of X 1, X 2
, ••• , X
n
.
We shall do this and we shall let Pij denote the correlation coefficient
of Xi and Xj. Thus for easy reference to Example 2, we write
If we refer to Example 2, we see that again I-'y = i kifJ-i. But now
1
Thus we have the following theorem.
Theorem 4. Let Xl>.'" X; denote random variables that have
means fJ-1' ... , t-« and variances a~, ... , a~. Let Pij, i # j, denote the COr-
relation coefficient of Xi and X, and let kl> .. " kn denote real constants.
The mean and the variance of the linear function
rr[(r - 2)/2J r
2[(r - 2)/2Jr[(r - 2)/2J = r - 2'
Distributions of Functions of Random Variables [Ch.4
Example 1. Given that W is n(O, 1), that V is X2(r)
with r ;::,.: 2, and let
Wand V be stochastically independent. The mean of the random variable
T = Wvr/V exists and is zero because the graph of the p.d.f. of T (see
Section 4.4) is symmetric about the vertical axis through t = O. The variance
of T, when it exists, could be computed by integrating the product of t2
and
the p.d.f. of T. But it seems much simpler to compute
Now W2 is X2(1), so E(W2) = 1. Furthermore,
E(r)_fro r 1 r/2-1 -v/2 d
V - 0 v2r/2f(r/2) v e v
exists if r > 2 and is given by
rr[(r - 2)/2J
2f(r/2)
Thus a~ = r/(r - 2), r > 2.
4.9 Expectations of Functions of Random Variables
Let Xl> X 2 , ••• , Xn denote random variables that have the joint
p.d.f. f(xl> X 2, • • • , xn). Let the random variable Y be defined by
Y = u(X1, X 2 , ••• , X n).We found in Section 4.7 that we could compute
expectations of functions of Y without first finding the p.d.f. of Y.
Indeed, this fact was the basis of the moment-generating-function
procedure for finding the p.d.f. of Y. We can take advantage of this
fact in a number of other instances. Some illustrative examples will be
given.
176
Example 2. Let Xi denote a random variable with mean ILi and variance
at, i = 1, 2, ... , n. Let X1> X2, ••• , Xn be mutually stochastically inde-
pendent and let kv k2, ... , kn denote real constants. We shall compute the
mean and variance of the linear function Y = k1X1 + k2X2 + ... + knXn.
Because E is a linear operator, the mean of Y is given by
I-'y = E(k1X1 + k2X2 + ... + knXn)
= k1E(X1) + k2E(X2) + ... + knE(Xn)
n
= k1IL1 + k21-'2 + ... + knl-'n = 2: kil-'i·
1
are, respectively,
n
fJ-y = 2: kifJ-i
1
and
178 Distributions of Functions of Random Variables [Ch.4 Sec. 4.9] Expectations of Functions of Random Variables 179
the variance
(~k~)0'2.
The following corollary of this theorem is quite useful.
Corollary. Let Xl"'" X; denote the items of a random sample of
size n from a distribution that has mean fL and variance 0'2. The mean and
n . (n) d 2
of Y = f«x, are, respectwely, fLy = fk, flo an o' y =
Example 3. Let X = i X,ln denote the mean of a random sample of
1
size n from a distribution that has mean flo and variance aZ
• In accordance
with the Corollary, we have flox = flo ~ (lIn) = flo and a~ = aZ~ (lln)Z =
a2/n. We have seen, in Section 4.8, that if our sample is from a distribution
that is n(flo, 0'2), then X is n(flo, aZln). It is interesting that flox = flo and
a~ = aZIn whether the sample is or is not from a normal distribution.
EXERCISES
4 90 Let X X X X be four mutually stochastically independent
• • 1, 2' 3, 4
random variables having the same p.d.f. f(x) = 2x, 0 < x < 1, zero else-
where. Find the mean and variance of the sum Y of these four random
variables.
4.91. Let Xl and Xz be two stochastically independent random variables
so that the variances of Xl and X 2are ar = k and a~ = 2, respectively. Given
that the variance of Y = 3X2 - Xl is 25, find k.
4.92. If the stochastically independent variables Xl and Xz have means
flov floz and variances ar, a~, respectively, show that the mean and variance of
the product Y = XlXZ are flolflo2 and ara~ + flora~ + flo~ar, respectively.
4.93. Find the mean and variance of the sum Y of the items of a random
sample of size 5 from the distribution having p.d.f. f(x) = 6x(1 - x),
o< x < 1, zero elsewhere.
4.94. Determine the mean and variance of the mean X of a random
sample of size 9 from a distribution having p.d f. f(x) = 4xs, 0 < x < 1,
zero elsewhere.
4.95. Let X and Y be random variables with flol = 1, floz = 4, ar = 4,
a~ = 6, p = t. Find the mean and variance of Z = 3X - 2Y.
4.96. Let X and Y be stochastically independent random variables with
means flol' fJ-z and variances ar, a~. Determine the correlation coefficient of
X and Z = X - Y in terms of fJ-v floz, ar, a~.
4.97. Let flo and aZ denote the mean and variance of the random variable
X. Let Y = c + bX, where band c are real constants. Show that the mean
and the variance of Yare, respectively, C + bflo and b2a2
•
4.98. Let X and Y be random variables with means flov floz; variances
ar, a~; and correlation coefficient p. Show that the correlation coefficient of
W = aX + b, a > 0, and Z = cY + d, c > 0, is p.
4.99. A person rolls a die, tosses a coin, and draws a card from an
ordinary deck. He receives $3 for each point up on the die, $10 for a head,
$0 for a tail, and $1 for each spot on the card (jack = 11, queen = 12,
king = 13). If we assume that the three random variables involved are
mutually stochastically independent and uniformly distributed, compute
the mean and variance of the amount to be received.
4.100. Let U and V be two stochastically independent chi-square variables
with rl and rz degrees of freedom, respectively. Find the mean and variance
of F = (rzU)/(rlV). What restriction is needed on the parameters rl and rz
in order to ensure the existence of both the mean and the variance of F?
4.101. Let Xv X z, .. " X; be a random sample of size n from a distribu-
tion with mean flo and variance aZ• Show that £(5Z) = (n - l)aZ
ln, where 52
n
is the variance of the random sample. Hint. Write 5z = (lin) L: (X, - flo)Z -
1
(X - flo)z.
4.102. Let Xl and Xz be stochastically independent random variables
with nonzero variances. Find the correlation coefficient of Y = XlXZ and
Xl in terms of the means and variances of Xl and X z.
4.103. Let Xl and Xz have a joint distribution with parameters flol' floz,
at, a~, and p. Find the correlation coefficient of the linear functions Y =
alXl + azXz and Z = blXl + bzXz in terms of the real constants aI, az,
bv bz, and the parameters of the distribution.
4.104. Let Xv X 2 , •• " Xn be a random sample of size n from a distri-
bution which has mean flo and variance 0'2. Use Chebyshev's inequality to
show, for every E > 0, that lim Pr(iX - floI < E) = 1; this is another form
n-oo
of the law of large numbers.
4.105. Let Xl' X z, and Xs be random variables with equal variances
but with correlation coefficients P12 = 0.3, PIS = 0.5, and pzs = 0.2. Find
the correlation coefficient of the linear functions Y = Xl + Xz and Z =
X 2 + X s·
4.106. Find the variance of the sum of 10 random variables if each has
variance 5 and if each pair has correlation coefficient 0.5.
4.107. Let Xl"'" Xn be random variables that have means flov"" flon
and variances ar, , a~. Let P'J' ~ =1= j, denote the correlation coefficient of
X, and XJ' Let av , an and bI> .. " bn be real constants. Show that the
n n n n
covariance of Y = L: a,X, and Z = L: s.x, is L: L: a,bp,aJP'J' where
'=1 r e t 1=1 1=1
P« = 1, i = 1, 2, ... , n.
180 Distributions of Functions of Random Variables [eb.4
4.108. Let Xl and X 2 have a bivariate normal distribution with param-
eters jLl> jL2' ut u~, and p. Compute the means, the variances, and the cor-
relation coefficient of Yl = exp (Xl) and Y2 = exp (X2). Hint. Various
moments of Yl and Y2 can be found by assigning appropriate values to tl and
t2 in E[exp (tlXl + t2X 2)].
4.109. Let X be n(jL, u2) and consider the transformation X = In Y or,
equivalently, Y = eX.
(a) Find the mean and the variance of Y by first determining E(eX ) and
E[(eX)2].
(b) Find the p.d.f. of Y. This is called the lognormal distribution.
4.110. Let Xl and X 2 have a trinomial distribution with parameters n,
Pl> P2'
(a) What is the distribution of Y = Xl + X2?
(b) From the equality u~ = u~ + u~ + 2PUlU2' once again determine the
correlation coefficient p of Xl and X 2•
4.111. Let Yl = Xl + X 2 and Y2 = X 2 + X a, where Xl> X 2, and Xa
are three stochastically independent random variables. Find the joint
moment-generating function and the correlation coefficient of Y1 and Y2
provided that:
(a) Xi has a Poisson distribution with mean jLi' i = 1, 2, 3.
(b) Xi is n(jLi' u~), i = 1, 2, 3.
Chapter 5
Limiting Distributions
5.1 Limiting Distributions
In some of the preceding chapters it has been demonstrated by
example that the distribution of a random variable (perhaps a statistic)
often depends upon a positive integer n. For example, if the random
variable X is b(n, P), the distribution of X depends upon n. If X is the
mean of a random sample of size n from a distribution that is n(/L' a2
) ,
then X is itself n(/L, a2/n)
and the distribution of X depends upon n.
If S2is the variance of this random sample from the normal distribution
to which we have just referred, the random variable nS2/a2
is x2
(n - 1),
and so the distribution of this random variable depends upon n.
We know from experience that the determination of the p.d.f. of a
random variable can, upon occasion, present rather formidable com-
putational difficulties. For example, if X is the mean of a random
sample Xl> X 2 , ••• , X n from a distribution that has the p.d.f.
f(x) = 1, o < x < 1,
= 0 elsewhere,
then (Exercise 4.74) the moment-generating function of X is given by
[M(t/n)]n, where here
I
I et - 1
M(t) = etxdx = --,
o t
= 1, t = O.
181
t =F 0,
182 Limiting Distributions [eh.5 Sec. 5.1] Limiting Distributions 183
os Y < 0,
0< y < 0,
-00 < Y < 0,
8 s y < 00.
-00 < Y < 8,
e ::; y < 00,
= 1,
= 1,
F(y) = 0,
The p.d.f. of Y n is
= 0 elsewhere,
and the distribution function of Yn is
Fn(Y) = 0, Y < 0,
= JlI nzn-l = ('¥.)n
on
dz
8'
o
= 1, 8 ::; Y < 00.
Then
Now
t =I- 0,
Hence
(
et/n_ 1)n
tin '
1, t = O.
Since the moment-generating function of X depends upon n, the
distribution of X depends upon n. It is true that various mathematical
techniques can be used to determine the p.d.f. of X for a fixed, but
arbitrarily fixed, positive integer n. But the p.d.f. is so complicated that
few, if any, of us would be interested in using it to compute probabilities
about X. One of the purposes of this chapter is to provide ways of
approximating, for large values of n, some of these complicated
probability density functions.
Consider a distribution that depends upon the positive integer n.
Clearly, the distribution function F of that distribution will also
depend upon n. Throughout this chapter, we denote this fact by
writing the distribution function as Fn and the corresponding p.d.f.
as in. Moreover, to emphasize the fact that we are working with
sequences of distribution functions, we place a subscript n on the ran-
dom variables. For example, we shall write
x 1
Fn(x) = J e- nw2/2
dw
-00 vrrnVZ;
for the distribution function of the mean Xn of a random sample of size
n from a normal distribution with mean zero and variance 1.
We now define a limiting distribution of a random variable whose
distribution depends upon n.
Definition 1. Let the distribution function F n(Y) of the random
variable Yn depend upon n, a positive integer. If F(y) is a distribution
function and if lim Fn(Y) = F(y) for every point Y at which F(y) is
is a distribution function. Moreover, lim Fn(y) = F(y) at each point of
n-co
continuity of F(y). In accordance with the definition of a limiting distribu-
tion, the random variable Yn has a limiting distribution with distribution
function F(y). Recall that a distribution of the discrete type which has a
probability of 1 at a single point has been called a degenerate distribution.
Thus in this example the limiting distribution of Yn is degenerate. Some-
times this is the case, sometimes a limiting distribution is not degenerate, and
sometimes there is no limiting distribution at all.
Example 2. Let X n have the distribution function
n-> 00
= 0 elsewhere.
continuous, then the random variable Y n is said to have a limiting
distribution with distribution function F(y).
x = 0,
x> O.
x < 0,
= -t,
= 1,
lim F n(x) = 0,
n_co
It is clear that
If the change of variable v = vnw is made, we have
o < x < 0, 0 < 8 < 00,
1
f(x) = 0'
The following examples are illustrative of random variables that
have limiting distributions.
Example 1. Let Yn denote the nth order statistic of a random sample
Xl> X z, ... , X; from a distribution having p.d.f.
184
Now the function
F(x) = 0,
Limiting Distributions [eh. 5
x < 0,
Sec. 5.1] Limiting Distributions
and the distribution function of Zn is
z < 0,
185
Hence
0:::;; z < nO,
0< z < co.
0:::;; z,
z < 0,
G(z) = 0,
5
z
(0 - w/n)n-l (Z)n
= dw = 1 - 1 - -
o r nO '
= 1, nO:::;; z.
Now
1
x = 2 + -.
n
= 1, x;::: 0,
is a distribution function and lim Fn(x) = F(x) at every point of continuity
of F(x). To be sure, lim Fn(O)# F(O). but F(x) is not continuous at x = O.
Accordingly, the rand~;variable Xn has a limiti~g d~stribution with distribu-
tion function F(x). Again, this limiting distributlOn IS degenerate and has all
the probability at the one point x = O.
Example 3. The fact that limiting distributions, if ~hey exist: cannot in
general be determined by taking the limit of the p.d.f. WIll now be Illustrated.
Let X; have the p.d.f.
EXERCISES
5.1. Let Xn denote the mean of a random sample of size n from a distribu-
tion that is n(JL, 0'2). Find the limiting distribution of Xn.
5.2. Let YI denote the first order statistic of a random sample of size n
from a distribution that has the p.d.f. f(x) = e-(X-O), 0 < x < co, zero
elsewhere. Let Zn = n(YI - 0). Investigate the limiting distribution of Zn.
5.3. Let Yn denote the nth order statistic of a random sample from a
distribution of the continuous type that has distribution function F(x) and
p.d.f. f(x) = F'(x). Find the limiting distribution of Zn = n[1 - F(Yn)J.
5.4. Let Y2 denote the second order statistic of a random sample of size
n from a distribution of the continuous type that has distribution function
F(x) and p.d.f.f(x) = F'(x). Find the limiting distribution of W n = nF(Y2 ) .
5.5. Let the p.d.f. of Yn be fn(Y) = 1, Y = n, zero elsewhere. Show that
Yn does not have a limiting distribution. (In this case, the probability has
"escaped" to infinity.)
5.6. Let Xl, X 2 , •• • , X; be a random sample of size n from a distribution
n
which is n(JL, 0'2), where JL > O. Show that the sum Z; = L Xi does not have
I
a limiting distribution.
is a distribution function that is everywhere continuous and lim Gn(z) =
n ....00
G(z) at all points. Thus Zn has a limiting distribution with distribution
function G(z). This affords us an example of a limiting distribution that is
not degenerate.
0< z < nO,
x:::;; 2,
x> 2.
x < 2,
= 1,
= 1,
F(x) = 0,
n ....00
(0 - z/n)n-l
hn(z) = on '
= 0 elsewhere,
Since
and
= 0 elsewhere.
Clearly, lim fn(x) = 0 for all values of x. This may suggest that X n has no
limiting dis~ribution. However, the distribution function of X; is
1
F () o x < 2 + -,
n X = , n
1
x> 2 +-,
- n
= 1, x;::: 2,
is a distribution function, and since lim Fn(x) = F(x) at all points of con-
n ....00
tinuity of F(x), there is a limiting distribution of X n with distribution function
F(x).
Example 4. Let Yn denote the nth order statistic of a random sample
from the uniform distribution of Example 1. Let Zn = n(O - Y n)·The p.d.f.
of Zn is
186 Limiting Distributions [eh.5
Sec. 5.2] Stochastic Convergence
187
Because °s Fn(Y) s 1 for all values of Y and for every positive
integer n, it must be that
Since this is true for every E > 0, we have
as we were required to show.
To complete the proof of Theorem 1, we assume that
lim Fn[(c + E) - J
n-+ 00
1.
Y < C,
Y > C,
Y < C,
= 1,
lim Fn(Y) = 0,
n-+ 00
lim F n(Y) = 0,
n-+ 00
lim Fn(c - E) = 0,
n-+ 00
5.2 Stochastic Convergence
When the limiting distribution of a random variable is degenerate,
the random variable is said to converge stochastically to the constant
that has a probability of 1. Thus Examples 1 to 3 of Section 5.1
illustrate not only the notion of a limiting distribution but also the
concept of stochastic convergence. In Example 1, the nth order
statistic Yn converges stochastically to 8; in Example 2, the statistic
Xn converges stochastically to zero, the mean of the normal distribution
from which the sample was taken; and in Example 3, the random
variable X n converges stochastically to 2. We shall show that in some
instances the inequality of Chebyshev can be used to advantage in
proving stochastic convergence. But first we shall prove the following
theorem.
lim Fn(c - E) = 0,
n-+ 00
lim Fn[(c + E) - J = 1,
n-+ 00
Y > c.
= 1,
We are to prove that lim Pr (/Y n - c] < ~) _- 1 f
~ or every E > O.
n-+ 00
Because
Pr (IYn - c] < E) = Fn[(c + E)-J - Fn(c - E),
and because it is given that
Theorem 1. Let Fn(Y) denote the distribution function of a random
variable Yn whose distribution depends upon the positive integer n. Let c
denote a constant which does not depend upon n. The random variable Yn
converges stochastically to the constant c if and only if, for every E > 0, the
lim Pr(Yn - c] < E) = 1.
n-+ 00
Proof. First, assume that the lim Pr (I Yn - cl < E) = 1 for every
n-+ 00
E > 0. We are to prove that the random variable Yn converges stochasti-
cally to the constant c. This means we must prove that
n-+ 00
= 1, Y > c.
Note that we do not need to know anything about the lim Fn(c). For
n-+ 00
if the limit of Fn(Y) is as indicated, then Yn has a limiting distribution
with distribution function
F(y) = 0,
Y < C,
Y < C,
for every E > 0, we have the desired result. This completes the proof
of the theorem.
We should like to point out a simple but useful fact. Clearly,
Pr (iYn - c] < E) + Pr (I Yn - c] ~ E) = 1.
Thus the limit of Pr (I Yn - cl < E) is equal to 1 when and only when
lim Pr(Yn - cl ~ E) = 0.
n-+ 00
1 = lim Pr(IYn - cl < E) = lim Fn[(c + E)-J - lim Fn(c - E).
n-+oo n-+oo n-+oo
Pr (!Yn - cl < E) = Fn[(c + E)- J - Fn(c - E),
where Fn[(c + E) - Jis the left-hand limit of Fn(y) at Y = c + E. Thus
we have
Now
= 1, Y ~ c. That is, :his last limit is also a necessary and sufficient condition for the
stochastic convergence of the random variable Yn to the constant c.
.Ex~mp'le 1. Let X n denote the mean of a random sample of size n from
a d~stnbutlO~ that has mean fL and positive variance a2. Then the mean and
variance of X; are fL and a2jn.
Consider, for every fixed E > 0, the probability
Pr (/Xn - fLl ~ E) = Pr (/Xn - fL/ ~ ~~),
EXERCISES
2
lim Pr (Xn - fLl ~ €) ::; lim ~ = 0.
n-+oo n_ CO n€
where k = €vnla. In accordance with the inequality of Chebyshev, this
probability is less than or equal to 1/k2 = a2In€2. So, for every fixed € > 0,
we have
Hence x, converges stochastically to fL if a2
is finite. In a more advanced
course, the student will learn that fL finite is sufficient to ensure this stochastic
convergence.
Remark. The condition lim Pr (IYn - c] < €) = 1 is often used as
n--> 00
lim [1 + ~ + lj;(n)]cn,
n-+co n n
lim (1 t
2
t3 ) -n/2 . ( t
2
t3jvn)-n/2
--+372 =hm 1--+-- .
n-+ co n n n-+ co n n
For example,
where band c do not depend upon n and where lim lj;(n) = O. Then
n-+ co
Here b = _t2, C = -t, and lj;(n) = t3 jvn. Accordingly, for every fixed
value of t, the limit is et 2
/2 •
Example~. Let Y, have a distribution that is b(n,Pl. Suppose that the
mean fL = np IS the same for every n; that is, p = fLln, where fL is a constant.
. [ b lj;(n)]cn (b)cn
lim 1 + - + -- = lim 1 + - = ebc
•
n-+co n n n-+co n
Sec. 5.3] Limiting Moment-Generating Functions 189
the p~oblem we should like to avoid. If it exists, the moment-generating
func:lOn that corresponds to the distribution function Fn(Y) often
provides a convenient method of determining the limiting distribution
function. To emphasize that the distribution of a random variable Y
depends upon the positive integer n, in this chapter we shall write the
moment-generating function of Y, in the form M(t; n).
The following theorem, which is essentially Curtiss' modification of
a the~rem of Levy an~ Cramer, explains how the moment-generating
function may be used In problems of limiting distributions. A proof of
the t?eorem requires a knowledge of that same facet of analysis that
pe~mItte~ us to assert that a moment-generating function, when it
exists, umquely determines a distribution. Accordingly, no proof of the
theorem will be given.
Theorem 2. Let the random variable Ynhave the distribution function
F n(y) and the moment-generating function M (t; n) that exists for
-h < t <.h for all n. If there exists a distribution function F(y), with
correspond~.ng moment-generating function M(t), defined for ItI s hI < h,
such that lim M(t; n) = M(t), then Y n has a limiting distribution with
n-->00
distribution function F (y).
In this and the subsequent section are several illustration of the
use of Theorem 2. In some of these examples it is convenient to use a
certain limit that is established in some courses in advanced calculus.
We refer to a limit of the form
Limiting Distributions [Ch, 5
the definition of convergence in probability and one says that Y n converges
to c in probability. Thus stochastic convergence and convergence in prob-
ability are equivalent. A stronger type of convergence is given by
Pr (lim Yn = c) = 1; in this case we say that Yn converges to c with
n--> 00
probability 1. Although we do not consider this type of convergence, it is
known that the mean Xn
of a random sample converges with probability 1
to the mean fL of the distribution, provided that the latter exists. This is
one form of the strong law of large numbers.
5.9. Let Wn
denote a random variable with mean fL and variance bjn",
where p > 0, fL, and b are constants (not functions of n). Prove that Wn
converges stochastically to fL. Hint. Use Chebyshev's inequality.
5.10. Let Yn denote the nth order statistic of a random sample of size n
from a uniform distribution on the interval (0, 0), as in Example 1 of
Section 5.1. Prove that Zn = YY;; converges stochastically to YO.
5.3 Limiting Moment-Generating Functions
To find the limiting distribution function of a random variable Yn
by use of the definition of limiting distribution function obviously
requires that we know Fn(Y) for each positive integer n. But, as
indicated in the introductory remarks of Section 5.1, this is precisely
5.7. Let the random variable Y, have a distribution that is b(n, Pl·
(a) Prove that Ynln converges stochastically to p. This result is one form of
the weak law of large numbers. (b) Prove that 1 - Ynln converges stochastic-
ally to 1 - p.
5.8. Let S~ denote the variance of a random sample of size n from a
distribution that is n(fL, a2). Prove that nS~/(n - 1) converges stochastically
to a2
•
188
n-+ 00
M(t; n) = E{exp [t(Z~~nn)]}
191
Sec. 5.3] Limiting Moment-Generating Functions
Since g(n) -»- 0 as n -»- 00, then lim ~(n) = 0 for every fixed value of t. In
accordance with the limit proposition cited earlier in this section, we have
lim M(t; n) = et2/2
n-+ 00
for.al~ ~eal values of t. That is, the random variable v, = (Zn - n)/VZn has
a limiting normal distribution with mean zero and variance 1.
M(t; n) = (1 _~ + ~~)) -n/2,
where
~(n)
If this sum is substituted for et./2
/n in the last expression for M(t; n), it is
seen that
5.11.. Let X n hav~ a gamma distribution with parameter a = nand (3,
where filS not a function of n. Let Y n = Xnln. Find the limiting distribution
of v;
5.12. Let Z; be x2(n) and let Wn = Znln2. Find the limiting distribution
of Wn •
5.13. Let X be X2
(50). Approximate Pr (40 < X < 60).
. 5.14. Let p = 0.95 be the probability that a man, in a certain age group,
lives at least 5 years.
(a) If w.e. are to observe 60 such men and if we assume independence, find
the probability that at least 56 of them live 5 or more years.
(b) Find an approximation to the result of part (a) by using the Poisson
distribution. Hint. Redefine p to be 0.05 and 1 - P = 0.95.
5.15. Let the random variable Zn have a Poisson distribution with
parameter fL = n. Show that the limiting distribution of the random variable
Y n = (Zn - n)/V1i is normal with mean zero and variance 1.
5.16. Let S~ denote the variance of a random sample of size n from a
distribu~ion that is n(fL, a2). It has been proved that nS~/(n - 1) converges
stochastically to a2
. Prove that S~ converges stochastically to a2
•
5.17. Let X n and Y n have a bivariate normal distribution with parameters
fLv fL2' ar, a~ (free of n) but p = 1 - lin. Consider the conditional distribu-
tion of Yn, given X; = x. Investigate the limit of this conditional distribution
as n -»- 00. What is the limiting distribution if p = - 1 + lin? Reference to
these facts was made in the Remark, Section Z.3.
EXERCISES
t < VZn.
Z
-:
Limiting Distributions [eh.5
e- 2 + Ze-2 = 0.406.
This may be written in the form
(
_ !2 -) -n/2
M(t; n) = et./2/n - t,J 1i et./2/n ,
In accordance with Taylor's formula, there exists a number g(n), between 0
and tVZln, such that
(2 1 ( (2) 2 e~(n) ( (2)3
et./2/n = 1 + t,J n+:2 t,J n + 6 t,J 1i .
approximately. Since fL = np = Z, the Poisson approximation to this prob-
ability is
Example 2. Let z; be x2 (n). Then the moment-generating function of z;
is (1 _ Zt) -n/2, t < -t. The mean and the variance of Zn are, respectively, n
and Zn.The limiting distribution of the random variable Y, = (Zn - n)/VZn
will be investigated. Now the moment-generating function of Yn is
for all real values of t. Since there exists a distribution, namely the Poisson
distribution with mean fL, that has this moment-generating function e~(et-l),
then in accordance with the theorem and under the conditions stated, it is
seen that Yn has a limiting Poisson distribution with mean fL·
Whenever a random variable has a limiting distribution, we may, if we
wish, use the limiting distribution as an approximation to the exact distri-
bution function. The result of this example enables us to use the Poisson
distribution as an approximation to the binomial distribution when n is large
and p is small. This is clearly an advantage, for it is easy to provide tables for
the one-parameter Poisson distribution. On the other hand, the binomial
distribution has two parameters, and tables for this distribution are very
ungainly. To illustrate the use of the approximation, let Y have a binomial
distribution with n = 50 and p = 2' Then
Pr (Y ~ 1) = (H)50 + 50(h)(H)49 = 0.400,
We shall find the limiting distribution of the binomial distribution, when
p = fLln, by finding the limit of M(t; n). Now
M(t; n) = E(etYn) = [(1 - P) + pet]n = [1 + fL(e
t
n- l)r
for all real values of t. Hence we have
lim M(t; n) = e~(eLl)
190
5.4 The Central Limit Theorem
It was seen (Section 4.8) that, if Xl> X 2 , • • • , X n is a random sample
from a normal distribution with mean iL and variance a2
, the random
variable
5.18. Let s;denote the mean of a random sample of size n from a Poisson
distribution with parameter iL = l.
(a) Show that the moment-generating function of Yn = Vn(Xn - p.)/a =
Vn(Xn - 1) is given by exp [-tvn + n(et/·./n- 1)].
(b) Investigate the limiting distribution of Y n as n -+ 00. Hint. Replace,
by its Macl.aurin's series, the expression etl
";" , which is in the exponent of the
moment-generating function of Yn'
5.19. Let Xn denote the mean of a random sample of size n from a
distribution that has p.d.f. f(x) = e- x , 0 < x < 00, zero elsewhere.
(a) Show that the moment-generating function M(t; n) of Y, =
vn(Xn - 1) is equal to [et/";" - (t/vn)et/.J1iJ -n, t < vn.
(b) Find the limiting distribution of Yn as n -+ 00.
This exercise and the immediately preceding one are special instances of
an important theorem that will be proved in the next section.
t
-h < -- < h.
aYri
= E[exp (t X;;n
iL)
J...E[exp (t X;;n iL)J
= {E[exp (tSf)Jf
= [mC~dr,
Sec. 5.4] The Central Limit Theorem
193
Theorem 3. LetX X X d t th .
, ' . 1, 2,"" n eno e e items of a random sample
from a d~stnbutlOn that has mean iL and positive variance a2• Then the
random variable Yn = (iXi - niL)'/vna = vn(X - )/ h 1"
1 n iL a as a imit-
ing distribution that is normal with mean zero and variance 1.
Proof In the modification of the proof, we assume the existence of
t~e ~on:ent-generating function M(t) = E(etX ) , -h < t < h, of the
dIst~Ibuhon.H~wever, this proof is essentially the same one that would
be given for thIS. theorem in a more advanced course by replacing the
mo~ent-generatmg function by the characteristic function (t)
E~~. r
The function
m(t) = E[et(x-/l)J = e-/ltM(t)
also exists for - h < t < h Since m(t) is the t .
f . . momen -generatmg
~nctlOn fo~, X - iL, it must follow that m(O) = 1, m'(O) = E(X - iL)
- 0, and m (0) = E[(X - 11.)2J - 2 By T I 'f I h
r: - a . ay or s ormu a t ere exists
a number t between 0 and t such that
m(t) = m(O) + m'(O)t + m"(t)t
2
2
m"(t)t2
= 1 + .
2
If a
2
t2
/2is added and subtracted, then
m(t) = 1 + a
2
t
2
+ [m"(t) - a2Jt2
2 2
Next consider M(t; n), where
Limiting Distributions [eb.5
n
f Xi - niL vn(Xn - iL)
avn a
is, for every positive integer n, normally distributed with zero mean and
unit variance. In probability theory there is a very elegant theorem
called the central limit theorem. A special case of this theorem asserts
the remarkable and important fact that if Xl' X 2 , •• " X n denote the
items of a random sample of size n from any distribution having positive
variance a2
(and hence finite mean iL), then the random variable
vn(Xn - iLl/a has a limiting normal distribution with zero mean and
unit variance. If this fact can be established, it will imply, whenever
the conditions of the theorem are satisfied, that (for fixed n) the random
variable vn(X - iLl/a has an approximate normal distribution with
mean zero and variance 1. It will then be possible to use this approxi-
mate normal distribution to compute approximate probabilities con-
cering X.
The more general form of the theorem is stated, but it is proved
only in the modified case. However, this is exactly the proof of the
theorem that would be given if we could use the characteristic function
in place of the moment-generating function.
192
= 0 elsewhere.
lim [m"(g) - a2
] = O.
n--+ OO
Since m"(t) is continuous at t = 0 and since g-+ 0 as n -+ 00, we have
195
Sec. 5.4] The Central Limit Theorem
since M(t) exists for all real values of t. Moreover, JL = ! and a2
= -h, so
we have approximately
Pr (0.45 < X < 0.55) = Pr [Vn(0.45 - JL) < vn(X - JL) < vn(0.55 - JL)]
a a a
= Pr [- 1.5 < 30(X - 0.5) < 1.5J
= 0.866,
from Table III in Appendix B.
Example 2. Let Xl' X 2 , ••• , X; denote a random sample from a
distribution that is b(l, P). Here JL = p, a2
= P(l - P), and M(t) exists for
all real values of t. If Yn = Xl + ... + X n , it is known that Y, is b(n,p).
Calculation of probabilities concerning Yn, when we do not use the Poisson
approximation, can be greatly simplified by making use of the fact that
(Yn - np)/vnp(l - P) = vn(Xn - P)/vP(l - P) = vn(Xn - JL)/a has a
limiting distribution that is normal with mean zero and variance 1. Let
n = 100 and p = -t, and suppose that we wish to compute Pr (Y = 48,49,
50,51,52). Since Y is a random variable of the discrete type, the events Y =
48, 49, 50, 51, 52 and 47.5 < Y < 52.5 are equivalent. That is, Pr (Y = 48,
49,50,51,52) = Pr (47.5 < Y < 52.5). Since np = 50 and np(l - P) = 25,
the latter probability may be written
Pr (47.5 < Y < 52.5) = Pr (47.5 5- 50 < Y ~ 50 < 52.55- 50)
= Pr (-0.5 < Y ~ 50 < 0.5).
Since (Y - 50)/5 has an approximate normal distribution with mean zero
and variance 1, Table III shows this probability to be approximately 0.382.
The convention of selecting the event 47.5 < Y < 52.5, instead of, say,
47.8 < Y < 52.3, as the event equivalent to the event Y = 48,49,50,51, 52
seems to have originated in the following manner: The probability,
Pr (Y = 48,49,50,51,52), can be interpreted as the sum of five rectangular
areas where the rectangles have bases 1 but the heights are, respectively,
Pr (Y = 48), .. " Pr (Y = 52). If these rectangles are so located that the
midpoints of their bases are, respectively, at the points 48, 49, ... , 52 on a
horizontal axis, then in approximating the sum of these areas by an area
bounded by the horizontal axis, the graph of a normal p.d.f., and two
ordinates, it seems reasonable to take the two ordinates at the points 47.5
and 52.5.
Limiting Distributions [eh.5
o < x < 1,
f(x) = 1,
We interpret this theorem as saying, with n a fixed positive integer,
that the random variable vn(X - fL)ja has an approximate normal
distribution with mean zero and variance 1; and in applications we
use the approximate normal p.d.f. as though it were the exact p.d.f.
of v'n(X - JL)ja.
Some illustrative examples, here and later, will help show the
importance of this version of the central limit theorem.
Example 1. Let X denote the mean of a random sample of size 75 from
the distribution that has the p.d.f.
{
t2 [m"(g) - a2
]t
2
}n
.
M(t; n) = 1 + -2 + 2 2
n no
for all real values of t. This proves that the random variable Yn =
vn(X
n
- JL)ja has a limiting normal distribution with mean zero and
variance 1.
The limit proposition cited in Section 5.3 shows that
lim M(t; n) = et 2
/
2
n--+ 00
where now g is between 0 and tjavn with - havn < t < havn.
Accordingly,
mC~) =
194
In m(t), replace t by tjav'n to obtain
It was stated in Section 5.1 that the exact p.d.f. of X, say g(x), is rather
complicated. It can be shown that g(x) has a grap~ at points of p~sitive
probability density that is composed of arcs of 75 different polynomIals of
degree 74. The computation of such a probability as Pr (0.45 < X <.0.55)
would be extremely laborious. The conditions of the theorem are satisfied,
EXERCISES
5.20. Let X denote the mean of a random sample of size 100 from a
distribution that is X2(50).
Compute an approximate value of Pr (49 < X < 51).
5.23. Compute an approximate probability that the mean of a random
sample of size 15 from a distribution having p.d.f. J(x) = 3x2, 0 < X < 1,
zero elsewhere, is between t and t.
5.21. Let X denote the mean of a random sample of size 128 from a
gamma distribution with a = 2 and f3 = 4. Approximate Pr (7 < X < 9).
5.22. Let Y be b(72, j-). Approximate Pr (22 s Y :s; 28).
5.24. Let Y denote the sum of the items of a random sample of size 12
from a distribution having p.d.f.J(x) = i, x = 1,2,3,4,5,6, zero elsewhere.
Compute an approximate value of Pr (36 :s; Y :s; 48). Hint. Since the event
of interest is Y = 36, 37, ... , 48, rewrite the probability as Pr (35.5 < Y <
48.5).
5.25. Let Y be b(400, !). Compute an approximate value of
Pr (0.25 < Yin).
5.26. If Y is b(100, f), approximate the value of Pr (Y = 50).
5.27. Let Y be b(n, 0.55). Find the smallest value of n so that (approxi-
mately) Pr (Yin> f) ~ 0.95.
5.28. Let J(x) = l/x2, 1 < x < 00, zero elsewhere, be the p.d.f. of a
random variable X. Consider a random sample of size 72 from the distri-
bution having this p.d.f. Compute approximately the probability that more
than 50 of the items of the random sample are less than 3.
Pr (lUn - cl ~ E) = Pr [1(vUn - VC)(v'V: + VC)I ~ E]
= Pr (Is/U, - vcl ~ E )
VtJ;. + VC
~ Pr (IVUn - vcl ~ :e) ~ o.
If we let' Ivc d if
h
E = E c, an 1 we take the limit, as n becomes infinite we
ave '
Sec. 5.5] Some Theorems on Limiting Distributions
197
!heorem 5. Let Fn(u) denote the distribution function of a random
~arzable ti, whose distri~ution depends upon the positive integer n. Further
~ o, converge stochasltcally to the positive constant c and let Pr (Un < 0)
- 0for every n. The random variable VtJ;. convergesstochastically to VC.
Proof. We are given that the lim Pr (/U; - cl ~ E) = 0 for every
n-+<XJ
E > O. We are to prove that the lim Pr (/VtJ;. - vcl ~ E') = 0 for
I n-+ 00
every E > O. Now the probability
o= !~~ Pr (jUn - cl ~ E) ~ lim Pr (jVtJ;. - Vel ~ E') = 0
n-+<XJ
Limiting Distributions [Ch, 5
196
5.29. Forty-eight measurements are recorded to several decimal places.
Each of these 48 numbers is rounded off to the nearest integer. The sum of
the original 48 numbers is approximated by the sum of these integers. If we
assume that the errors made by rounding off are stochastically independent
and have uniform distributions over the interval (-f, f), compute approxi-
mately the probability that the sum of the integers is within 2 units of the
true sum.
5.5 Some Theorems on Limiting Distributions
In this section we shall present some theorems that can often be
used to simplify the study of certain limiting distributions.
Theorem 4. Let Fn(u) denote the distribution function of a random
variable U'; whose distribution depends upon the positive integer n. Let U';
converge stochastically to the constant c #- O. The random variable Un/c
converges stochastically to 1.
The proof of this theorem is very easy and is left as an exercise.
for every E' > O. This completes the proof.
T?e conclusions of Theorems 4 and 5 are very natural ones and they
certainly appeal to 0 . t iti Th
. . ur III U1 IOn. ere are many other theor f
~~lS flavor in pr.obability theory. As exercises, it is to be shown ~~:t~f
t
. e random vanables U'; and Vn converge stochastically to the respec-
rve constants c and d the U V
, n n n converges stochastically to the
constant cd and U IV c t h .
. ' n n onverges s oc ashcally to the constant cid
fPrlo
lvl~ed that d #- O. However, we shall accept, without proof th~
o owmg theorem. '
!heorem 6. Let Fn(u) denote the distribution function of a random
uariable "'!n.w.hose ~ist:ib~tion depends upon the positive integer n. Let U
have a hm~.ltn~ d~~trzbutwn with distribution function F(u). Let H (v)
z:the d~strzbutwnfunction ofa random variable V whose distribut~'on
epends upon th p 't . . n "
Th li " . e osi we ~nteger n. Let Vn converge stochastically to 1
e ~m~t2ng distribution. of the random variable W = U IV . th .
as that 01' U . th . . " n n n ~s e same
j
ti 'J
F
( n), at is, W n has a hmtt~ng distribution with distribution
unc ton. w.
198 Limiting Distributions [Ch.5 Sec. 5.5] Some Theorems on Limiting Distributions 199
Example 1. Let Yn denote a random variable that is b(n, P), °< P < 1.
We know that
u _ Yn - np
n - Vnp(l-p)
has a limiting distribution that is n(O, 1). Moreover, it has been proved that
Yn/n and 1 - Yn/n converge stochastically to p and 1 - p, respectively; thus
(Yn/n)(1 - Yn/n) converges stochastically to P(1 - Pl. Then, by Theorem 4,
(Yn/n)(1 - Y n/n)/[p(1 - P)] converges stochastically to 1, and Theorem 5
asserts that the following does also:
V = [(Yn/n)(1 - Yn/n)] 1/2.
n P(1 _ P)
Thus, in accordance with Theorem 6, the ratio Wn = Un/Vn, namely
Y, - np
Vn(Yn/n)(1 - Yn/n) ,
has a limiting distribution that is n(O, 1). This fact enables us to write (with
n a fixed positive integer)
[
Y - np ]
Pr -2 < < 2 = 0.954,
Vn(Y/n)(1 - Yin)
approximately.
Example 2. Let X n and S~ denote, respectively, the mean and the
variance of a random sample of size n from a distribution that is n(JL, u2),
u2
> 0. It has been proved that s; converges stochastically to JL and that S~
converges stochastically to u2
• Theorem 5 asserts that Sn converges stochastic-
ally to a and Theorem 4 tells us that Si]« converges stochastically to 1. In
accordance with Theorem 6, the random variable Wn = uXn/Snhas the same
limiting distribution as does x; That is, uXn/Snconverges stochastically to JL.
EXERCISES
5.30. Prove Theorem 4. Hint. Note that Pr (!Un/e - 11 < E) =
Pr (!Un - c] < Elel), for every E > 0. Then take E' = €Iel.
5.31. Let X; denote the mean of a random sample of size n from a gamma
distribution with parameters ex = JL > °and f:3 = 1. Show that the limiting
distribution of vn(Xn - JL)/VX n is n(O, 1).
5.32. Let T'; = (Xn - JL)/VS~/(n - 1), where X n and S~ represent,
respectively, the mean and the variance of a random sample of size n from a
distribution that is n(JL' u2). Prove that the limiting distribution of Tn is
n(O, 1).
5.33. Let Xl' ... ' X n and YI , ... , Yn be the items of two independent
random samples, each of size n, from the distributions that have the
respective means JLI and JL2 and the common variance u2. Find the limiting
distribution of
(Xn - Yn) - (JLI - JL2)
uV2/n '
where x, and Yn are the respective means of the samples. Hint. Let Zn =
n
L Zt/n, where Z, = X, - Y,.
I
5.34. Let U'; and Vn converge stochastically to e and d, respectively.
Prove the following.
(a) The sum U'; + Vnconverges stochastically to e + d. Hint. Show that
Pr (!Un + Vn - e - dl ~ E) :$ Pr (!Un - c] + IVn - dl ~ E) :$ Pr (IU'; - c]
~ E/2or IVn - dl ~ E/2) s Pr (IUn - cl ~ E/2) + Pr (IVn - dl ~ E/2).
(b) The product U';Vn converges stochastically to cd.
(c) If d i= 0, the ratio Un/Vn converges stochastically to cjd.
5.35. Let U'; converge stochastically to c. If h(u) is a continuous function
at u = c, prove that h(Un) converges stochastically to h(c). Hint. For each
E > 0, there exists a 0 > °such that Pr [h(Un) - h(c)l < E] ~ Pr [!Un - c]
< 0]. Why?
Example 1. Let Xl> X 2 , ••• , X; denote a random sample from the
distribution with p.d.f.
of times and it is found that Xl = xb X 2 = X2"'" X; = X n, we shall
refer to xb x2 , ••• , xn as the experimental values of X b X 2 , ••• , X n or
as the sample data.
We shall use the terminology of the two preceding paragraphs, and
in this section we shall give some examples of statistical inference.
These examples will be built around the notion of a point estimate of
an unknown parameter in a p.d.f.
Let a random variable X have a p.d.f. that is of known functional
form but in which the p.d.f. depends upon an unknown parameter ()
that may have any value in a set Q. This will be denoted by writing the
p.d.f. in the formj(x; ()), () E Q. The set Q will be called the parameter
space. Thus we are confronted, not with one distribution of prob-
ability, but with a family of distributions. To each value of (), () E .0,
there corresponds one member of the family. A family of probability
density functions will be denoted by the symbol {j(x; ()); () ED}. Any
member of this family of probability density functions will be denoted
by the symbol j(x; ()), () E D. We shall continue to use the special
symbols that have been adopted for the normal, the chi-square, and the
binomial distributions. We may, for instance, have the family
{n((), 1); () ED}, where Q is the set -00 < () < 00. One member of this
family of distributions is the distribution that is n(O, 1). Any arbitrary
member is n((), 1), -00 < () < 00.
Consider a family of probability density functions {j(x; ()); () ED}.
It may be that the experimenter needs to select precisely one member
of the family as being the p.d.f. of his random variable. That is, he
needs a point estimate of (). Let Xl> X 2 , ••• , X n denote a random
sample from a distribution that has a p.d.f. which is one member (but
which member we do not know) of the family {j(x; ()); () E Q} of prob-
ability density functions. That is, our sample arises from a distribution
that has the p.d.f. j(x; ()); () E Q. Our problem is that of defining a
statistic YI = ul (Xl> X 2' •. " X n), so that if Xl> x2 , ••• , xn are the
observed experimental values of Xl> X 2 , ••• , X n, then the number
YI = uI(Xl> x2 , ••• , xn) will be a good point estimate of ().
The following illustration should help motivate one principle that is
often used in finding point estimates.
Chapter 6
Estimation
6.1 Point Estimation
The first five chapters of this book deal with certain concepts and
problems of probability theory. Throughout we have carefully dis-
tinguished between a sample space '?l of outcomes and the space d
of one or more random variables defined on '?l. With this chapter we
begin a study of some problems in statistics and here we are more
interested in the number (or numbers) by which an outcome is repre-
sented than we are in the outcome itself. Accordingly, we shall adopt a
frequently used convention. We shall refer to a random variable X as
the outcome of a random experiment and we shall refer to the space of
X as the sample space. Were it not so awkward, we would call X the
numerical outcome. Once the experiment has been performed and it is
found that X = X, we shall call X the experimental value of X for that
performance of the experiment.
This convenient terminology can be used to advantage in more
general situations. To illustrate this, let a random experiment be
repeated n independent times and under identical conditions. Then
the random variables X l>X 2' ... , X n (each of which assigns a numerical
value to an outcome) constitute (Section 4.1) the items of a random
sample. If we are more concerned with the numerical representations of
the outcomes than with the outcomes themselves, it seems natural to
refer to Xl> X 2, .•. , X n as the outcomes. And what more appropriate
name can we give to the space of a random sample than the sample
space? Once the experiment has been performed the indicated number
200
Sec. 6.1] Point Estimation
f(x) = OX(1 - OF-X,
= 0 elsewhere,
x = 0, 1,
201
202 Estimation [eh. 6 Sec. 6.1] Point Estimation 203
where 0 ::; e::; 1. The probability that Xl = XV X2 = x2 , ••• , X; = xn is
the joint p.d.f.
eX1(1 _ e)l-X1eX2(1 _ e)l-x2... eXn(1 _ e)l-Xn = e2;X1(1 _ e)n-2;xl ,
where Xl equals zero or 1, i = 1,2, ... , n. This probability, which is the
joint p.d.f. of Xl' X 2 , ••• , X n, may be regarded as a function of 8 and, when
so regarded, is denoted by L(e) and called the likelihood function.That is,
L(e) = e2;x.(1 - e)n-2;x
1, 0 s e ::; 1.
We might ask what value of 8 would maximize the probability L(8) of
obtaining this particular observed sample Xv X 2, .. " X n . Certainly, this
maximizing value of 8 would seemingly be a good estimate of 8 because it
would provide the largest probability of this particular sample. However,
since the likelihood function L(e) and its logarithm, In L(e), are maximized
for the same value e, either L(e) or In L(e) can be used. Here
In L(8) = (~Xl) In e+ (n - ~Xl) In (1 - 8);
so we have
din L(8) = 2: Xl _ n - 2: Xl = 0
d8 8 1 - 8 '
provided that 8 is not equal to zero or 1. This is equivalent to the equation
(1 - e) ~ Xl = e(n - ~ X}
n n
whose solution for 8 is 2: xt/n. That 2: xt/n actually maximizes L(8) and In L( 8)
1 1
can be easily checked, even in the cases in which all of Xv X2' .•. , Xn equal
"
zero together or 1 together. That is, 2: x,ln is the value of ethat maximizes
1
L( e). The corresponding statistic,
h=~~x.
n L t,
1=1
is called the maximum likelihood estimator of e. The observed value of h,
"
namely 2: xJn, is called the maximum likelihood estimate of 8. For a simple
1
example, suppose that n = 3, and Xl = 1, X 2 = 0, X3 = 1, then L(e) =
e2
(1 - 8) and the observed h= t is the maximum likelihood estimate of e.
The principle of the method of maximum likelihood can now be
formulated easily. Consider a random sample Xl' X 2, •.• , X n from a
distribution having p.d.f. f(x; 8), 8 E Q. The joint p.d.f. of Xl' X 2 , •.• ,
X n iSf(x1 ; 8)f(x2 ; e) .. ·f(xn ; 8). This joint p.d.f. may be regarded as a
function of e. When so regarded, it is called the likelihood function L
of the random sample, and we write
eE .0.
Suppose that we can find a nontrivial function of Xl> x 2
, ••• , X", say
u(xl> x2 , ••• , x,,), such that, when eis replaced by u(x1
, x2
, ••• , X
n
), the
likelihood function L is a maximum. That is, L[u(xl> x2
, ••• , x,,);
Xl> x2 , · •• , x,,] is at least as great as L( 8; Xl' x2 , • • • , xn) for every 8 E .0.
Then the statistic u(X1, X 2 , ••• , X,,) will be called a maximum likeli-
hood estimator of eand will be denoted by the symbol B= u(Xl> X 2
,
.. " X,,). We remark that in many instances there will be a unique
maximum likelihood estimator Bof a parameter e, and often it may be
obtained by the process of differentiation.
Example 2. Let Xv X 2 , ••• , X; be a random sample from the normal
distribution n(8, 1), -OCJ < e< OCJ. Here
This function L can be maximized by setting the first derivative of L, with
respect to e, equal to zero and solving the resulting equation for 8. We note,
however, that each of the functions L and In L is a maximum for the same
value 8. So it may be easier to solve
dlnL(e;x1'X2" " , Xn) = °
de .
For this example,
dIn L(e; Xl, X2' ... , X,,) = ~ ( . _ 8)
de L... x, .
1
If this derivative is equated to zero, the solution for 8 is u(x1
, X2' ... , X
n
) =
n n
2: xt/n. That 2: xJn actually maximizes L is easily shown. Thus the statistic
1 1
is the unique maximum likelihood estimator of the mean e.
It is interesting to note that in both Examples 1 and 2, it is true that
E(IJ) = e. That is, in each of these cases, the expected value of the
estimator is equal to the corresponding parameter, which leads to the
following definition.
204 Estimation [Ch, 6 Sec. 6.1] Point Estimation 205
= 0 elsewhere,
Definition 1. Any statistic whose mathematical expectation is
equal to a parameter °is called an unbiased estimator of the parameter
O. Otherwise, the statistic is said to be biased.
Example 3. Let
1
f(x; 0) = (j' o < x s 8, 0 < 8 < 00,
Om) EO n, depend on m parameters. This joint p.d.f., when regarded as
a function of (01) 0z, ... , Om) EO n, is called the likelihood function of the
random variables. Those functions u1(x, y, ... , z), uz(x, y, ... , z), ... ,
um(x, y, . . . , z) that maximize this likelihood function with respect to
01> 0z,· .. , Om' respectively, define the maximum likelihood estimators
&1 = u1(X, Y, ... , Z), &z = uz(X, Y, ... , Z), ... ,
&m = um(X, Y, ... , Z)
and let Xl> X 2 , ••• , X; denote a random sample from this distribution.
Note that we have taken 0 < X ::; 8 instead of 0 < x < 8 so as to avoid a
discussion of supremum versus maximum. Here
0< X, s 8,
of the m parameters.
Example 4. Let Xl> X2, ..• , Xn denote a random sample from a distri-
bution that is n(81 , ( 2), -00 < 81 < 00, 0 < 82 < 00. We shall find ~l and
~2' the maximum likelihood estimators of 81 and 82 , The logarithm of the
likelihood function may be written in the form
We observe that we may maximize by differentiation. We have
(n - 1)82
n
(n - 1)a2
n
n
aIn L f (x, - (1)2 n
ae;: = 28~ - 282
'
aln L
ae;- =
Sometimes it is impossible to find maximum likelihood estimators
in a convenient closed form and numerical methods must be used to
maximize the likelihood function. For illustration, suppose that
X1> X 2 , ••• , X n is a random sample from a gamma distribution with
However, in Chapter 5 it has been shown that ~1 = X and ~2 = 52 converge
stochastically to 81 and 82, respectively, and thus they are consistent esti-
mators of 81 and 82 ,
If we equate these partial derivatives to zero and solve simultaneously the
two equations thus obtained, the solutions for 81 and 82 are found to be
n n
2: x,jn = x and 2: (x, - X)2/n = S2, respectively. It can be verified that these
1 1
solutions maximize L. Thus the maximum likelihood estimators of 81 = f1.
and 82 = a2
are, respectively, the mean and the variance of the sample,
namely ~1 = X and ~2 = 52. Whereas ~1 is an unbiased estimator of 81, the
estimator ~2 = 52 is biased because
Consistency is a desirable property of an estimator; and, in all cases
of practical interest, maximum likelihood estimators are consistent.
The preceding definitions and properties are easily generalized.
Let X, Y, . . . , Z denote random variables that mayor may not be
stochastically independent and that mayor may not be identically
distributed. Let the joint p.d.f. g(x, y, ... , z; 01> 02' ... , Om), (01) 02' ... ,
While the maximum likelihood estimator &of °in Example 3 is a
biased estimator, results in Chapter 5 show that the nth order statistic
&= max (X,) = Yn converges stochastically to 0. Thus, in accordance
with the following definition, we say that IJ = Yn is a consistent
estimator of 0.
Definition 2. Any statistic that converges stochastically to a
parameter °is called a consistent estimator of that parameter 0.
1
[max (XtW
and the unique maximum likelihood estimator ~ of 8 in this example is the
nth order statistic max (X,), It can be shown that E[max (Xt) ] = n8/(n + 1).
Thus, in this instance, the maximum likelihood estimator of the parameter 8
is biased. That is, the property of unbiasedness is not in general a property
of a maximum likelihood estimator.
which is an ever-decreasing function of 8. The maximum of such functions
cannot be found by differentiation but by selecting 8 as small as possible.
Now 8 ~ each Xt; in particular, then, 8 ~ max (x.). Thus L can be made no
larger than
206 Estimation [Ch.6 Sec. 6.2] Measures of Quality of Estimators 207
parameters a = ()1 and f3 = ()2' where ()1 > 0, ()2 > O. It is difficult to
maximize
We say that these latter two statistics, 81 and 82, are respective esti-
mators of (J1 and (J2 found by the method of moments.
To generalize the discussion of the preceding paragraph, let Xl' X 2 ,
. . ., X n be a random sample of size n from a distribution with p.d.f.
f(x; (J1' (J2' •.. , (Jr), ((J1" •• , (Jr) E n. The expectation E(Xk) is frequently
called the kth moment of the distribution, k = 1,2, 3, .. " The sum
L(8v (J2; Xv ••. , x n) = [r((J~)(Jglr(X1X2 ••. Xn)81 -1 exp ( - ~ xii(J2 )
with respect to (J1 and (J2' owing to the presence of the gamma function
r((J1)' However, to obtain easily point estimates of (J1 and (J2' let us
simply equate the first two moments of the distribution to the corre-
sponding moments of the sample. This seems like a reasonable way in
which to find estimators, since the empirical distribution Fn(x)converges
stochastically to F(x), and hence corresponding moments should be
about equal. Here in this illustration we have
(a) j(x; 0) = OXe- 9/x!, X = 0, 1,2, ... , 0 ~ 0 < 00, zero elsewhere, where
j(O; 0) = 1.
(b) j(x; 0) = Ox9
- 1, 0 < x < 1,0 < 6 < 00, zero elsewhere.
(c) j(x; 6) = (1/6)e- x
/
9
, 0 < x < 00, 0 < 6 < 00, zero elsewhere.
(d) j(x; 6) = !e- 1x - B1, -00 < x < 00, -00 < 6 < 00.
(e) j(x; 6) = e>:», 6 ~ x < 00, -00 < 6 < 00, zero elsewhere.
In each case find the maximum likelihood estimator ~ of 6.
6.2. Let Xl' X 2 , · · . , X; be a random sample from the distribution
having p.d.f. j(x; 61, ( 2) = (1/62) r (X - 91l/B2, 61 ~ x < 00, -00 < 61
< 00,
o < O
2 < 00, zero elsewhere. Find the maximum likelihood estimators of
61 and °2 ,
6.3. Let Y1 < Y2 < ... < Yn be the order statistics of a random sample
from a distribution with p.d.f.j(x; 6) = 1,6- ! ~ x ~ 6 + !, -00 < 6 < 00,
zero elsewhere. Show that every statistic u(Xl> X 2 , ••• , X n
) such that
Y n - 1- ~ u(Xl , X 2 , ••• , X n ) ~ Yl + 1-
is a maximum likelihood estimator of O. In particular, (4Yl
+ 2Yn
+ 1)/6,
(Y, + Y n)/2, and (2Yl + 4Yn - 1)/6 are three such statistics. Thus unique-
ness is not in general a property of a maximum likelihood estimator.
6.4. Let Xl, X 2 , and X a have the multinomial distribution in which
n = 25, k = 4, and the unknown probabilities are 61 , 62 , and 0a,respectively.
Here we can,for convenience, let X, = 25 - Xl - X2 - Xa and 64
= 1 - 61
-
62 - 6a· If the observed values of the random variables are Xl = 4, X
2
= 11,
and X a = 7, find the maximum likelihood estimates of °1, 62 , and ea'
6.5. The Pareto distribution is frequently used as a model in study of
incomes and has the distribution function
F(x; °1, ( 2) = 1 - (6dx)B2 , 01 ~ X, zero elsewhere, where 01 > 0 and 62 > O.
If Xl' X 2 , ••. , X; is a random sample from this distribution, find the maxi-
mum likelihood estimators of 61 and °2 ,
6.6. Let Y n be a statistic such that lim E(Yn) = 0 and lim af
n
= O.
n- 00 n-+ co
Prove that Y, is a consistent estimator of 0. Hint. Pr (IYn
- 6! ~ €) ~
E[(Yn - O)2J/€2 and E[(Yn - 8)2J = [E(Yn - OW + at. Why?
6.7. For each of the distributions in Exercise 6.1, find an estimator of 0
by the method of moments and show that it is consistent.
and
()182 = X,
the solutions of which are
_ X2
81 = S2
n
M k = 2: X~jn is the kth moment of the sample, k = 1,2,3, .. " The
1
method of moments can be described as follows. Equate E(Xk) to M k»
beginning with k = 1 and continuing until there are enough equations
to provide unique solutions for (J1' (J2' ••• , (Jr, say hi(Mv M 2, ... ),
i = 1,2, ... , r, respectively. It should be noted that this could be
done in an equivalent manner by equating p. = E(X) to X and
n
E[(X - p.)kJ to 2: (Xi - X)kjn, k = 2, 3, and so on until unique solu-
1
tions for (Jv (J2"'" (Jr are obtained. This alternative procedure was
used in the preceding illustration. In most practical cases, the esti-
mator 8i = hi(M1, M 2, •.. ) of (J;, found by the method of moments,
is a consistent estimator of (Ji' i = 1, 2, ... , r,
EXERCISES
6.1. Let Xl> X 2 , •• " X; represent a random sample from each of the
distributions having the following probability density functions:
6.2 Measures of Quality of Estimators
Now it would seem that if y = u(xv x2 , ••• , xn) is to qualify as a
good point estimate of (J, there should be a great probability that the
208 Estimation [eh.6 Sec. 6.2] Measures of Quality of Estimators 209
statistic Y = u(Xl> X 2 , • • • , X n) will be close to 8; that is, 8 should
be a sort of rallying point for the numbers y = U(Xl> X 2, ••• , xn) . This
can be achieved in one way by selecting Y = u(XI , X 2 , " ., X n) in
such a way that not only is Y an unbiased esimator of 8 but also the
variance of Y is as small as it can be made. We do this because the
variance of Y is a measure of the intensity of the concentration of the
probability for Y in the neighborhood of the point 8 = E(Y). Accord-
ingly, we define an unbiased minimum variance estimator of the param-
eter 8 in the following manner.
Definition 3. For a given positive integer n, Y = u(XI , X 2 , ••• , X n)
will be called an unbiased minimum variance estimator of the parameter
fJ if Y is unbiased, that is E( Y) = fJ, and if the variance of Y is less than
or equal to the variance of every other unbiased estimator of fJ.
For illustration, let Xl' X 2 , ••. , X g denote a random sample from
a distribution that is n(fJ, 1), -00 < 8 < 00. Since the statistic X =
(Xl + X 2 + ... + X g)/9 is n(fJ, t), X is an unbiased estimator of fJ.
The statistic Xl is n(fJ, 1), so Xl is also an unbiased estimator of fJ.
Although the variance t of X is less than the variance 1 of Xl, we
cannot say, with n = 9, that X is the unbiased minimum variance
estimator of fJ; that definition requires that the comparison be made
with every unbiased estimator of 8. To be sure, it is quite impossible to
tabulate all other unbiased estimators of this parameter fJ, so other
methods must be developed for making the comparisons of the variances.
A beginning on this problem will be made in Chapter 10.
Let us now discuss the problem of point estimation of a parameter
from a slightly different standpoint. Let Xl' X 2 , ••• , X n denote a
random sample of size n from a distribution that has the p.d.f. f(x; fJ),
fJ E O. The distribution may be either of the continuous or the discrete
type. Let Y = u(Xv X 2 , • • • , X n) be a statistic on which we wish to
base a point estimate of the parameter fJ. Let w(y) be that function of
the observed value of the statistic Y which is the point estimate of fJ.
Thus the function w decides the value of our point estimate of fJ and w
is called a decision function or a decision rule. One value of the decision
function, say w(y), is called a decision. Thus a numerically determined
point estimate of a parameter fJ is a decision. Now a decision may be
correct or it may be wrong. It would be useful to have a measure of the
seriousness of the difference, if any, between the true value of fJ and
the point estimate w(y). Accordingly, with each pair, [fJ, w(y)], fJ EO,
we associate a nonnegative number 2[8, w(y)] that reflects this
seriousness. We call the function 2 the loss junction. The expected
(mean). value of the loss function is called the risk junction. If g(y; 8),
8 EO, IS the p.d.f. of Y, the risk function R(8, w) is given by
R(8, w) = E{2[8, w(Y)]} = J~G02[8, w(y)]g(y; 8) dy
if Y is a random variable of the continuous type. It would be desirable
to select a decision function that minimizes the risk R(fJ, w) for all
values of 8, fJ E O. But this is usually impossible because the decision
function w that minimizes R(fJ, w) for one value of fJ may not minimize
R(fJ, w) for another value of fJ. Accordingly, we need either to restrict
our decision function to a certain class or to consider methods of order-
ing th~ risk func~ions. The following example, while very simple,
dramatizes these difficulties,
E~ample 1. Let Xl' X 2 , •• " X 25 be a random sample from a distribution
that IS n(e, 1), -00 < e < 00. Let Y = X, the mean of the random sample,
a~d let .p[e, w(y)] = [e - w(y)J2. We shall compare the two decision functions
g.lven by.wl(y) = Y and w2 (y) = 0 for -OCJ < Y < 00. The corresponding
nsk functions are
and
Obviously, if, in fact, e = 0, then w2 (y) = 0 is an excellent decision and we
have R(O, w2 ) = O. However, if e differs from zero by very much, it is
equally clear that W2(Y) = 0 is a poor decision. For example, if, in fact,
e = 2, R(2, w2) = 4 > R(2, WI) = .,}-s. In general, we see that R(e w ) <
R(e, ~l)' provided that -t < e < t and that otherwise R(e, w
2)
:::': R(e,2
w1).
That IS, one of these decision functions is better than the other for some
values of eand the other decision function is better for other values of e.
If, however, we had restricted our consideration to decision functions w
:uch that E[w(Y)] = e for all values of e, e E Q, then the decision w
2(y)
= 0
IS no~ all~wed. Und~r this restriction and with the given .p[e, w(y)], the risk
function 1: the vanance of the unbiased estimator w(Y), and we are con-
fronted with the problem of finding the unbiased minimum variance esti-
mator. In Chapter 10 we show that the solution is w(y) = Y = X.
S~ppose, however, that we do not want to restrict ourselves to decision
functions w such that E[w(Y)] = e for all values of e, e E Q. Instead, let us
say t?at. the decision function that minimizes the maximum of the risk
function IS the best decision function. Because, in this example, R(e, w
2
) = e2
210 Estimation [Ch, 6 Sec. 6.2] Measures of Quality of Estimators 211
is unbounded, w2 (y) = 0 is not, in accordance, with this criterion, a good
decision function. On the other hand, with -00 < 0 < 00, we have
max R(O, WI) = max Us) = -i-so
8 8
Accordingly, wI(y) = Y = x seems to be a very good decision in accordance
with this criterion because 2~- is small. As a matter of fact, it can be proved
that WI is the best decision function, as measured by this minimax criterion,
when the loss function is .P[O,w(y)J = [0 - w(y)]2.
In this example we illustrated the following:
(a) Without some restriction on the decision function, it is difficult
to find a decision function that has a risk function which is uniformly
less than the risk function of another decision function.
(b) A principle of selecting a best decision function, called the
minimax principle. This principle may be stated as follows: If the
decision function given by wo(y) is such that, for all 0 E Q,
max R[O, wo(y)J ::; max R[O, w(y)J
8 8
for every other decision function w(y), then wo(y) is called a minimax
decision function.
With the restriction E[w(Y)J = 0 and the loss function.P[B, w(y)J =
[B - w(y)P, the decision function that minimizes the risk function
"yields an unbiased estimator with minimum variance. If, however, the
restriction E[w(Y)J = B is replaced by some other condition, the
decision function w(Y), if it exists, which minimizes E{[B - W(Y)J2}
uniformly in B is sometimes called the minimum mean-square-error
estimator. Exercises 6.13, 6.14, and 6.15 provide examples of this type
of estimator.
Another principle for selecting the decision function, which may be
called a best decision function, will be stated in Section 6.6.
EXERCISES
6.8. Show that the mean X of a random sample of size n from a distri-
bution having p.d.f.j(x; 0) = (ljO)e-<x/8), 0 < x < 00,0 < 0 < 00, zero else-
where, is an unbiased estimator of 0 and has variance 02jn.
6.9. Let Xl> X 2 , ••• , X n denote a random sample from a normal distribu-
n
tion with mean zero and variance 0, 0 < 0 < 00. Show that L XNn is an
1
unbiased estimator of 0 and has variance 202jn.
6.10. Let YI < Y2 < Y3 be the order statistics of a random sample of
size 3 from the uniform distribution having p.d.f. j(x; 0) = 1jO, 0 < x < 0,
o < 0 < 00, zero elsewhere. Show that 4Yl> 2Y2, and 1Y3 are all unbiased
estimators of O. Find the variance of each of these unbiased estimators.
6.11. Let YI and Y2 be two stochastically independent unbiased esti-
mators of O. Say the variance of YI is twice the variance of Y 2 . Find the
constants kI and k2 so that kI Y1 + k2 Y2 is an unbiased estimator with
smallest possible variance for such a linear combination.
6.12. In Example 1 of this section, take .P[0, w(y)] = 10 - w(y)I. Show
that R(O, WI) = tV2j7T and R(O, w2) = IO!. Of these two decision functions
WI and W 2, which yields the smaller maximum risk?
6.13. Let Xl> X 2 , ••• , X; denote a random sample from a Poisson distri-
n
bution with parameter 0, 0 < 0 < 00. Let Y = L Xl and let .P[O,w(y)] =
1
[0 - w(y)]2. If we restrict our considerations to decision functions of the
form w(y) = b + yjn, where b does not depend upon y, show that R(O, w) =
b2 + Ojn. What decision function of this form yields a uniformly smaller risk
than every other decision function of this form? With this solution, say w
and 0 < 0 < 00, determine max R(O, w) if it exists.
8
6.14. Let Xl> X 2 , ••• , X n denote a random sample from a distribution
n
that is n(fL, 0), 0 < (} < 00, where fL is unknown. Let Y = L (Xl - X)2jn =
1
S2 and let .P[O,w(y)J = [0 - w(y)]2. If we consider decision functions of the
form w(y) = by, where b does not depend upon y, show that R(O, w) =
(02jn2)[(n2 - 1)b2 - 2n(n - l)b + n2]. Show that b = nj(n + 1) yields a
minimum risk for decision functions of this form. Note that nYj(n + 1) is
not an unbiased estimator of O. With w(y) = nyj(n + 1) and 0 < 0 < 00,
determine max R(O, w) if it exists.
8
6.15. Let Xl> X 2 , ••. , X; denote a random sample from a distribution
n
that is b(l, 0),0 ::; 0 ::; 1. Let Y = L X, and let .P[O, w(y)] = [0 - w(y)]2.
1
Consider decision functions of the form w(y) = by, where b does not depend
upon y. Prove that R(O, w) = b2nO(1 - 0) + (bn - 1)202. Show that
provided the value b is such that b2n ~ 2(bn - 1)2. Prove that b = 1jn does
not minimize max R(O, w).
8
212 Estimation [Ch.6 Sec. 6.3] Confidence Intervals for Means 213
6.3 Confidence Intervals for Means
Suppose we are willing to accept as a fact that the (numerical) out-
come X of a random experiment is a random variable that has a normal
distribution with known variance a2
but unknown mean p... That is, p.. is
some constant, but its value is unknown. To elicit some information
about p.., we decide to repeat the random experiment n independent
times, n being a fixed positive integer, and under identical conditions.
Let the random variables Xl> X 2 , • • • , X n denote, respectively, the
outcomes to be obtained on these n repetitions of the experiment. If our
assumptions are fulfilled, we then have under consideration a random
sample Xl> X 2 , ••• , X n from a distribution that is n(p.., a2
), a2
known.
Consider the maximum likelihood estimator of p.., namely p.. = X. Of
course, X is n(p.., a2/n)
and (X - p..)/(a/vn) is n(O, 1). Thus
Pr (-Z < ~/~: < z) = 0.954.
However, the events
X - p..
-Z < - - < Z
a/v;:" ,
-Za Za
--= < X - p.. < - ,
vn vn
and
Za Za
X--<p.<X+-
v'n vn
are equivalent. Thus these events have the same probability. That is,
(
Za Za)
Pr X - vn < p.. < X + vn = 0.954.
Since a is a known number, each of the random variables X - Za/vn
and X + Za/v'n is a statistic. The interval (X - Za/vn, X + Za/v'n)
is a random interval. In this case, both end points of the interval are
statistics. The immediately preceding probability statement can be
read. Prior to the repeated independent performances of the random
experiment, the probability is 0.954 that the random interval
(X - Za/,Vn, X + Za/v'n) includes the unknown fixed point
(parameter) p...
Up to this point, only probability has been involved; the determina-
tion of the p.d.f. of X and the determination of the random interval
were problems of probability. Now the problem becomes statistical.
Suppose the experiment yields Xl = Xl> X2 = X 2, ••• , X n = X n . Then
the sample value of X is x = (Xl + X 2 + ... + xn)/n, a known number.
Moreover, since a is known, the interval (x - Za/v'n, x + Za/v'n) has
known end points. Obviously, we cannot say that 0.954 is the prob-
ability that the particular interval (x - Za/v'n,x + Za/v'n) includes
the parameter p.., for p.., although unknown, is some constant, and this
particular interval either does or does not include p... However, the fact
that we had such a high degree of probability, prior to the performance
of the experiment, that the random interval (X - Za/vn, X + Za/v'n)
includes the fixed point (parameter) p..leads us to have some reliance on
the particular interval (x - Za/v'n,x + Za/vn). This reliance is re-
flected by calling the known interval (x - Za/vn, x + Za/v'n) a 95.4
per cent confidence interval for p... The number 0.954 is called the
confidence coefficient. The confidence coefficient is equal to the prob-
ability that the random interval includes the parameter. One may, of
course, obtain an 80 per cent, a 90 per cent, or a 99 per cent confidence
interval for p.. by using 1.Z8Z, 1.645, or Z.576, respectively, instead of
the constant Z.
A statistical inference of this sort is an example of interval estimation
of a parameter. Note that the interval estimate of p.. is found by taking
a good (here maximum likelihood) estimate x of p.. and adding and
subtracting twice the standard deviation of X, namely Za/v'n, which
is small if n is large. If a were not known, the end points of the random
interval would not be statistics. Although the probability statement
about the random interval remains valid, the sample data would not
yield an interval with known end points.
Example 1. If in the preceding discussionn = 40, a2
= 10, and x = 7.164,
then (7.164 - 1.282V!%, 7.164 + 1.282V!%), or (6.523, 7.805), is an 80 per
cent confidence interval for /lo. Thus we have an interval estimate of /lo.
In the next example we shall show how the central limit theorem
may be used to help us find an approximate confidence interval for p..
when our sample arises from a distribution that is not normal.
214
Estimation [Ch, 6 Sec. 6.3] Confidence Intervals for Means 215
Example 2. Let X denote the mean of a random sample of size 25 from
a distribution that has a moment-generating function, variance a
2
= 100,
and mean fL. Since a/Vn = 2, then approximately
Pr ( -1.96 < X ~ fL < 1.96) = 0.95,
or
Pr (X - 3.92 < fL < X + 3.92) = 0.95.
Let the observed mean of the sample be x = 67.53. Accordingly, the interval
from x - 3.92 = 63.61 to x + 3.92 = 71.45 is an approximate 95 per cent
confidence interval for the mean fL·
Let us now turn to the problem of finding a confidence interval for
the mean fL of a normal distribution when we are not so fortunate as
to know the variance a2. In Section 4.8 we found that n5
2ja2,
where
52 is the variance of a random sample of size n from a distribution
that is n(fL' a2), is X2(n - 1). Thus we have vn(X - fL)ja to be n(O, 1),
n52ja2 to be l(n - 1), and the two to be stochastically independent.
In Section 4.4 the random variable T was defined in terms of two such
random variables as these. In accordance with that section and the
foregoing results, we know that
T = [vn(X - fL)Jja
vn52j[a2(n - I)J
has a t distribution with n - 1 degrees of freedom, whatever the value
of a2 > O. For a given positive integer n and a probability of 0.95, say,
we can find numbers a < b from Table IV in Appendix B, such that
Pr (a < ~ < b) = 0.95.
S] n - 1
Since the graph of the p.d.f. of the random variable T is symmetric
about the vertical axis through the origin, we would doubtless take
a = - b,b > O. If the probability of this event is written (with a = - b)
in the form
(
b5 b5 )
Pr X - vn=l' < fL < X + V = 0.95,
n-l n-l
then the interval [X - (b5jvn=l') , X + (b5jvn=l')J is a random
interval having probability 0.95 of including the unknown fixed point
(parameter) fL. If the experimental values of Xl' X 2 , · · · , X; are Xl> X 2,
••• , X n with S2 = ~ (xj - x)2jn, where x = ~ xdn, then the interval
[x - (bsjVn - 1), x + (bsjvn=1)J is a 95 per cent confidence
interval for fL for every a2 > O. Again this interval estimate of fL is
fo~nd b~ adding and subtracting a quantity, here bsjvn - 1, to the
pomt estimate X.
Example 3. If in the preceding discussion n = 10, x = 3.22, and s = 1.17,
then the i~terval [3.22 - (2.262)(1.17)/V9, 3.22 + (2.262)(1.17)jV9]or
(2.34,4.10) IS a 95 per cent confidence interval for fL.
Remark. If one wishes to find a confidence interval for fL and if the
va.riance.a2
of the nonnormal distribution is unknown (unlike Example 2 of
this section), he may with large samples proceed as follows. If certain weak
conditions are satisfied, then 52, the variance of a random sample of size
n ;::: 2, converges stochastically to a2
• Then in
Vn(X - fL)/a _ vn=1(X - fL)
Vn52/(n - 1)a2 - 5
the numerator of the left-hand member has a limiting distribution that is
n(O, 1) and the denominator of that member converges stochastically to 1.
Thus Vn - l(X - fL)/5 has a limiting distribution that is n(O, 1). This fact
enables us to find approximate confidence intervals for fL when our con-
ditions are satisfied. A similar procedure can be followed in the next section
when seeking confidence intervals for the difference of the means of two
independent nonnormal distributions.
. We shall now consider the problem of determining a confidence
mterval for the unknown parameter p of a binomial distribution when
the parameter n is known. Let Y be b(n, p), where 0 < p < 1 and n is
kno,:n. Then p is the mean of Yjn. We shall use a result of Example 1,
Section 5.5, to find an approximate 95.4 per cent confidence interval
for the mean p. There we found that
p[ Y-np ]
r -2 < < 2 = 0.954,
vn(Yjn)(1 - Yjn)
approximately. Since
Y - np (Yjn) - p
Vn(Yjn)(l - Yjn) = V(Yjn)(1 - Yjn)jn'
the probability statement above can easily be written in the form
Pr [Y _ 2j(Yjn)(1 - Yjn) < p < Y + 2j(Yjn)(1 - Yjn)] =
n n n n 0.954,
216 Estimation [eb.6 Sec. 6.3] Confidence Intervals jor Means 217
approximately. Thus, for large n, if the experimental value of Y is y,
the interval
provides an approximate 95.4 per cent confidence interval for p.
A more complicated approximate 95.4 per cent confidence interval
can be obtained from the fact that Z = (Y - np)/v'np(1 - P) has a
limiting distribution that is n(O, 1), and the fact that the event -2 <
Z < 2 is equivalent to the event
The first of these facts was established in Example 2, Section 5.4; the
proof of inequalities (1) is left as an exercise. Thus an experimental
value Y of Y may be used in inequalities (1) to determine an approxi-
mate 95.4 per cent confidence interval for p.
If one wishes a 95 per cent confidence interval for p that does not
depend upon limiting distribution theory, he may use the following
approach. (This approach is quite general and can be used in other
instances.) Determine two increasingfunctions of p, say c1(P) and c2(P),
such that for each value of p we have, at least approximately,
But it is the latter that we want to be essentially free of p; thus we set it
equal to a constant, obtaining the differential equation
v(~) = u(P) + (~ - p)u'(P).
u'(P) = VP(I
C
_ P)
Of course, v(Y/n) is a linear function of Yin and thus also has an approximate
normal distribution; clearly, it has mean u(P) and variance
[U'(P)]2P(1 - P).
n
(0.2 - zv'(0.2)(0.8)/100,0.2 + 2v (0.2)(0.8)/100) or (0.12, 0.28). The approxi-
mate 95.4 per cent confidence interval provided by inequalities (1) is
(
22 - 2v (1600/100) + 1, 22 + 2v(1600/100) + 1)
104 104
or (0.13,0.29). By referring to the appropriate tables found elsewhere, we
find that an approximate 95 per cent confidence interval has the limits
d2(20) = 0.13 and d1(20) = 0.29. Thus in this example we see that all three
methods yield results that are in substantial agreement.
Remark. The fact that the variance of Yin is a function of p caused us
some difficulty in finding a confidence interval for p.Another way of handling
the problem is to try to find a function u(Y/n) of Yin, whose variance is
essentially free of p. Since Yin converges stochastically to p,we can approxi-
mate u(Y/n) by the first two terms of its Taylor's expansion about p, namely
by
A solution of this is
u(P) = (2c) arc sin Vp.
If we take c = t, we have, since u(Y/n) is approximately equal to v(Y/n),
that
u(~) = arc sin J~.
Thi~ has an approximate normal distribution with mean arc sin Vpand
variance 1/4n. Hence we could find an approximate 95.4 per cent confidence
interval by using
P ( 2 arc sin VY/n - arc sin Vp )
r - < < 2 = 0954
v'1/4n .
and solving the inequalities for p.
Y + 2 + 2v'[Y(n - Y)/n] + 1
< P < .
n+4
Y + 2 - 2v'[Y(n - Y)/n] + 1
n+4
The reason that this may be approximate is due to the fact that Y has
a distribution of the discrete type and thus it is, in general, impossible
to achieve the probability 0.95 exactly. With c1(P) and c2(P) increasing
functions, they have single-valued inverses, say d1 (y) and d2 (y),
respectively. Thus the events c1(P) < Y < c2(P) and d2(Y) < P < d1(Y)
are equivalent and we have, at least approximately,
Pr [d2(Y) < P < d1(Y)] = 0.95.
In the case of the binomial distribution, the functions c1(P), c2(P),
d2(y), and d1(y) cannot be found explicitly, but a number of books
provide tables of d2 (y) and d1 (y) for various values of n.
Example 4. If, in the preceding discussion, we take n = 100 and y = 20,
the first approximate 95.4 per cent confidence interval is given by
(1)
6.22. Let Y be b(300, Pl. If the observed value of Y is y = 75, find an
approximate 90 per cent confidence interval for p.
6.23. Let X be the mean of a random sample of size n from a distribution
that is n(fL, (7 2), where the positive variance 172 is known. Use the fact that
N(2) - N( - 2) = 0.954 to find, for each fL, Cl (fL) and C
2(fL) ~uch t~at
Pr [Cl(fL) < X < C
2(fL)] = 0.954. Note that Cl(fL) and C2(fL) are mcreasmg
functions of fL. Solve for the respective functions dl (x) and d2 (x); thus we also
have that Pr [d2(X)
< fL < dl(X)] = 0.954. Compare this with the answer
obtained previously in the text.
6.16. Let the observed value of the mean X of a random sample of size
20 from a distribution that is n(fL, 80) be 81.2. Find a 95 per cent confidence
interval for fL.
6.17. Let X be the mean of a random sample of size n from a distribution
that is n(fL, 9). Find n such that Pr (X - 1 < fL < X + 1) = 0.90, approxi-
mately.
6.18. Let a random sample of size 17 from the normal distribution
n(fL, (72) yield x = 4.7 and S2 = 5.76. Determine a 90 per cent confidence
interval for fL·
6.19. Let X denote the mean of a random sample of size n from a distri-
bution that has mean fL, variance 172 = 10, and a moment-generating
function. Find n so that the probability is approximately 0.954 that the
random interval (X - t, X + t) includes fL·
6.20. Let Xl> X 2
, ••• , X g be a random sample of size 9 from a distribu-
tion that is n(fL' (7
2).
(a) If a is known, find the length of a 95 per cent confidence interval for fL
if this interval is based on the random variable V9(X - fL)/a.
(b) If a is unknown, find the expected value of the length of a 95 per.cent
confidence interval for fL if this interval is based on the random vanable
V8(X - fL)/5.
(c) Compare these two answers. Hint. Write E(5) = (a/Vn)E[(n5
2/a2
)11
2J.
6.21. Let Xl> X 2
, ••• , X n
, X n +l be a random sample of size n + 1, n > 1,
n n _
from a distribution that is n(fL, (7
2). Let X = LXtin and 52 = L (X, - X)2/n.
1 1
Find the constant c so that the statistic c(X - Xn + l )/5 has a t distribution.
If n = 8, determine k such that Pr (X - k5 < Xg < X + k5) = 0.80. The
observed interval (x - ks, x + ks) is often called an 80 per cent prediction
interval for Xg •
(X - Y) - (fLl - f-t2)
Va2jn + a2jm
6.24. In the notation of the discussion of the confidence interval for p,
show that the event - 2 < Z < 2 is equivalent to inequalities (1). Hint.
First observe that - 2 < Z < 2 is equivalent to Z2 < 4, which can be
written as an inequality involving a quadratic expression in p.
6.25. Let X denote the mean of a random sample of size 25 from a
gamma-type distribution with a = 4 and f3 > O. Use the central limit
theorem to find an approximate 0.954 confidence interval for fL, the mean of
the gamma distribution. Hint. Base the confidence interval on the random
variable (X - 4(3)f(4f32f25)112 = 5Xf2f3 - 10.
219
A confidence interval for fLl - f-t2 may be obtained as follows: Let
Xl' X 2 , · · · , X n and Yl , Y 2 , ••• , Ym denote, respectively, independent
random samples from the two independent distributions having,
respectively, the probability density functions n(fLl' (7
2) and n(fL2' (7
2).
Denote the means of the samples by X and Y and the variances of the
samples by Si and S~, respectively. It should be noted that these four
statistics are mutually stochastically independent. The stochastic
independence of X and Si (and, inferentially that of Y and S~) was
established in Section 4.8; the assumption that X and Y have indepen-
dent distributions accounts for the stochastic independence of the
others. Thus X and Yare normally and stochastically independently
distributed with means f-tl and f-t2 and variances a2jn and a2jm, respec-
tively. In accordance with Section 4.7, their difference X - Y is norm-
ally distributed with mean fLl - f-t2 and variance a2jn + a2jm . Then the
random variable
6.4 Confidence Intervals for Differences of Means
The random variable T may also be used to obtain a confidence
interval for the difference f-tl - f-t2 between the means of two inde-
pendent normal distributions, say n(fLl' (7
2) and n(f-t2' (7
2), when the
distributions have the same, but unknown, variance 17
2•
Remark. Let X have a normal distribution with unknown parameters
fLl and 172
• A modification can be made in conducting the experiment so that
the variance of the distribution will remain the same but the mean of the
distribution will be changed; say, increased. After the modification has been
effected, let the random variable be denoted by Y, and let Y have a normal
distribution with unknown parameters fL2 and 172. Naturally, it is hoped that
fL2 is greater than fLl> that is, that fLl - fL2 < O. Accordingly, one seeks a
confidence interval for fLl - fL2 in order to make a statistical inference.
Sec. 6.4] Confidence Intervals for Differences of Means
Estimation [Oh.6
EXERCISES
218
220 Estimation [eh.6 Sec. 6.4] Confidence Intervals for Differences of Means 221
is normally distributed with zero mean and unit variance. This random
variable may serve as the numerator of a T random variable. Further,
nSVa2 and mS~/a2 have stochastically independent chi-sq~are distribu-
tions with n - 1 and m - 1 degrees of freedom, respectively, so that
their sum (nSi + mS~)/a2 has a chi-square distribution with n + m - 2
degrees of freedom, provided that m + n - 2 > O. Because of the
mutual stochastic independence of X, Y, Si, and S~, it is seen that
J
nSi + mS~
a2(n + m - 2)
may serve as the denominator of a T random variable. That is, the
random variable
T = (X - Y) - (1-'1 - 1-'2)
Jns i + mS~ (!.- + !)
n+m-2 n m
has a t distribution with n + m - 2 degrees of freedom. As in the
previous section, we can (once nand m are specified positive integers
with n + m - 2 > 0) find a positive number b from Table IV of
Appendix B such that
Pr(-b < T < b) = 0.95.
unknown variances of the two independent normal distributions are
not equal is assigned to one of the exercises.
Example 1. It may be verifiedthat if in the preceding discussion n = 10,
m = 7, x = 4.2, ji = 3.4, s~ = 49, s~ = 32, then the interval (-5.16, 6.76)
is a 90 per cent confidenceinterval for fLl - fL2'
Let Y1 and Y2 be two stochastically independent random variables
with binomial distributions b(nv PI) and b(n2,P2)' respectively. Let us
now turn to the problem of finding a confidence interval for the
difference PI - P2 of the means of Yl/nl and Y2/n2 when nl and n2
are known. Since the mean and the variance of Yd», - Y2/n2 are,
respectively, PI - hand Pl(l - Pl)/nl + P2(1 - h)/n2, then the
random variable given by the ratio
(Yl/nl - Y 2/n2) - (PI - P2)
VPl(l - Pl)/nl + P2(1 - P2)/n2
has mean zero and variance 1 for all positive integers nl and n2 •
Moreover, since both Y1 and Y2 have approximate normal distributions
for large nl and n2 , one suspects that the ratio has an approximate
normal distribution. This is actually the case, but it will not be proved
here. Moreover, if ndn2 = c, where C is a fixed positive constant, the
result of Exercise 6.31 shows that the random variable
If we set
R = Jnsi + mS~ (.!. + ].),
n+m-2 n m
(1)
(Ydnl)(1 - Yl/nl)/nl + (Y2/n2)(1 - Y 2/n2)/n2
Pl(1 - Pl)/nl + P2(1 - P2)/n2
this probability may be written in the form
Pr [(X - Y) - bR < 1-'1 - 1-'2 < (X - Y) + bR] = 0.95.
It follows that the random interval
_ bJnsi + mS~ (~ + 2-),
n+m-2 n m
(X _ Y) + bJnSi + mS~ (!. + 2-)]
n+m-2 n m
has probability 0.95 of including the unknown fixed point (1-'1 - 1-'2)'
As usual, the experimental values of X, Y, Si, and S~, namely X, fj, si,
and s~, will provide a 95 per cent confidence interval for 1-'1 - 1-'2 when
the variances of the two independent normal distributions are unknown
but equal. A consideration of the difficulty encountered when the
converges stochastically to 1 as n2 --+ 00 (and thus nl --+ 00, since
nl/n2 = c, C > 0). In accordance with Theorem 6, Section 5.5, the
random variable
w = (Ydnl - Y 2/n2) - (PI - P2),
u
where
has a limiting distribution that is n(O, 1). The event - 2 < W < 2,
the probability of which is approximately equal to 0.954, is equivalent
to the event
222 Estimation [Ch, 6 Sec. 6.5] Confidence Intervals for Variances 223
Accordingly, the experimental values Yl and Y2 of Y l and Y 2 , respec-
tively, will provide an approximate 95.4 per cent confidence interval for
Pl - P2'
Example 2. If, in the preceding discussion, we take n l = 100, n2 = 400,
Yl = 30, Y2 = 80, then the experimental values of Y l jn 1 - Y2jn2 and U
are 0.1 and V(0.3)(0.7)jlOO + (0.2)(0.8)/400 = 0.05, respectively. Thus the
interval (0, 0.2) is an approximate 95.4 per cent confidence interval for
Pi - P2'
distribution that is n(JL, 0'2), where JL is a known number. The maximum
n
likelihood estimator of 0'2 is L (Xi - JL)2/n, and the variable Y =
n 1
L(Xi - JL)2/O'2 is X2(n). Let us select a probability, say 0.95, and for
1
the fixed positive integer n determine values of a and b, a < b, from
Table II, such that
Pr (a < Y < b) = 0.95.
Thus
or l
1
n
Since JL, a, and b are known constants, each of L(Xi - JL)2/b and
n 1
L (Xi - JL)2/a is a statistic. Moreover, the interval
1
= 0.95,
is a random interval having probability of 0.95 that it includes the
unknown fixed point (parameter) 0'2. Once the random experiment has
been performed, and it is found that Xl = Xl> X 2 = X 2, ••• , X; = X n,
then the particular interval
EXERCISES
6.26. Let two independent random samples, each of size 10, from two
independent normal distributions n(ILv 0'2) and n(IL2,O'2) yield x = 4.8,
s~ = 8.64, fj = 5.6, s~ = 7.88. Find a 95 per cent confidence interval for
ILl - IL2'
6.27. Let two stochastically independent random variables Y1 and Y 2 ,
with binomial distributions that have parameters n1 = n2 = 100,Pv and P2'
respectively, be observed to be equal to Y1 = 50 and Y2 = 40. Determine an
approximate 90 per cent confidence interval for P1 - P2'
6.28. Discuss the problem of finding a confidence interval for the
difference ILl - IL2 between the two means of two independent normal
distributions if the variances O'~ and O'~ are known but not necessarily equal.
6.29. Discuss Exercise 6.28 when it is assumed that the variances are
unknown and unequal. This is a very difficult problem, and the discussion
should point out exactly where the difficulty lies. If, however, the variances
are unknown but their ratio O'Vo'~ is a known constant k, then a statistic that
is a T random variable can again be used. Why?
6.30. Let X and Y be the means of two independent random samples,
each of size n, from the respective distributions n(flo1' 0'2) and n(flo2,O'2),
where the common variance is known. Find n such that Pr (X - Y - 0'/5 <
flo1 - IL2 < X - Y + 0'/5) = 0.90.
6.31. Under the conditions given, show that the random variable defined
by ratio (1) of the text converges stochastically to 1.
6.5 Confidence Intervals for Variances
Let the random variable X be n(JL, 0'2). We shall discuss the problem
of finding a confidence interval for 0'2. Our discussion will consist of two
parts: first, when JL is a known number, and second, when JL is unknown.
Let Xl> X 2 , • • " X; denote a random sample of size n from a
is a 95 per cent confidence interval for 0'2.
The reader will immediately observe that there are no unique
numbers a < b such that Pr (a < Y < b) = 0.95. A common method
of procedure is to find a and b such that Pr (Y < a) = 0.025 and
Pr (b < Y) = 0.025. That procedure will be followed in this book.
10
Example 1. If in the preceding discussion flo = 0, n = 10, and L: xt =
1
106.6, then the interval (106.6/20.5, 106.6/3.25), or (5.2, 32.8), is a 95 per cent
confidence interval for the variance 0'2, since Pr (Y < 3.25) = 0.025 and
Estimation [Ch.6
that is, that a~/a~ < 1. In order to make a statistical inference, we find a
confidence interval for the ratio a~/a~.
224
Pr (20.5 < Y) = 0.025, provided that Y has a chi-square distribution with
10 degrees of freedom.
Sec. 6.5] Confidence Intervals for Variances 225
We now turn to the case in which fL is not known. This case can be
handled by making use of the facts that S2 is the maximu~.lik:lihood
estimator of a2 and nS2/a2 is X2(n - 1). For a fixed positive integer
n 2': 2, we can find, from Table II, values of a and b, a < b, such that
Pr (a < n~2 < b) = 0.95.
Here, of course, we would find a and b by using a chi-square distribution
with n - 1 degrees of freedom. In accordance with the convention
previously adopted, we would select a and b so that
Pr (n~2 < a) = 0.025 and Pr (n~2 > b) = 0.025.
We then have
(
nS2 nS2)
Pr b < a
2
< a = 0.95
so that (nS2/b,nS2/a) is a random interval having probability 0.95 of
including the fixed but unknown point (parameter) a2
. After the random
experiment has been performed and we find, say, Xl = X1> X 2 = X2,... ,
X n
= X
n
, with S2 = ~ (Xi - X)2/n, we have, as a 95 per cent confidence
1
interval for a2, the interval (ns2/b, ns2/a).
Example 2. If, in the preceding discussion, we have n = 9, S2 = 7.63,
then the interval [9(7.63)/17.5, 9(7.63)/2.18J or (3.92,31.50) is a 95 per cent
confidence interval for the variance a
2
•
Next let X and Y denote stochastically independent random
variables that are n(fL1' ar) and n(fL2' a~), respectively. We shall deter-
mine a confidence interval for the ratio a~/ar when fLl and fL2 are
unknown.
Remark. Consider a situation in which a random variable X has a
normal distribution with variance a~. Although at is not known, it is found
that the experimental values of X are quite widely dispersed, so that. at
must be fairly large. It is believed that a certain modifica:ion i~ conducting
the experiment may reduce the variance. After the modification has been
effected let the random variable be denoted by Y, and let Y have a normal
distribution with variance a~. Naturally, it is hoped that a~ is less than at
Consider a random sample Xl' X 2, ... , X; of size n 2': 2 from the
distribution of X and a random sample Y1> Y2, ••• , Ym of size m 2': 2
from the independent distribution of Y. Here nand m mayor may not
be equaL Let the means of the two samples be denoted by X and Y,
n
and the variances of the two samples by Sr = L (Xi - X)2/n and
m 1
S~ = L (Yi - y)2/m, respectively. The stochastically independent
1
random variables nSVar and mS~/a~ have chi-square distributions
with n - 1 and m - 1 degrees of freedom, respectively. In Section 4.4 a
random variable called F was defined, and through the change-of-
variable technique the p.d.f. of F was obtained. If nSVar is divided by
n - 1, the number of degrees of freedom, and if mS~/a~ is divided
by m - 1, then, by definition of an F random variable, we have that
F = nSr/[ar(n - 1)]
mS~/[a~(m - 1)]

has an F distribution with parameters n - 1 and m - 1. For numeri-
cally given values of nand m and with a preassigned probability, say
0.95, we can determine from Table V of Appendix B, in accordance
with our convention, numbers 0 < a < b such that
[
nSr/[ar(n - 1)] ]
Pr a < mS~/[a~(m _ 1)] < b = 0.95.
If the probability of this event is written in the form
P [
mS~/(m - 1) a~ mS~/(m - 1)]
r a nSV(n _ 1) < ar < b nSV(n _ 1) = 0.95,
it is seen that the interval
[
a mS~/(m - 1), b mS~/(m - 1)]
nSV(n - 1) nSV(n - 1)
is a random interval having probability 0.95 of including the fixed but
unknown point a~/ar. If the experimental values of Xv X 2, ... , X n and
of Y1, Y2, •.• , Ym are denoted by Xv X2,... , Xn and Y1' Y2' ... , Ym'
226
Estimation [Ch.6 Sec. 6.6] Bayesian Estimates 227
n m
. d df 2 " ( -)2 ms2 - "(y - y-)2 then the
respectively. an I nS1 = f x, - X, 2 - f' '
interval with known end points, namely
[
a ms~/(m - 1), b ms~/(m - 1)1'
nsV(n - 1) nsV(n - 1)
is a 95 per cent confidence interval for the ratio a~/a~ of the two un-
known variances.
Example 3. If in the preceding discussion n = 10, m = 5, sf = 20.0,
s~ = 35.6, then the interval
[(
1 ) 5(35.6)/4 5(35.6)/4]
4.72 10(20.0)/9' (8.90) 10(20.0)/9
or (0.4, 17.8) is a 95 per cent confidence interval for a~/af·
EXERCISES
6.32. If 8.6, 7.9, 8.3, 6.4, 8.4,9.8,7.2,7.8,7.5 are the observed values of a
random sample of size 9 from a distribution that is n(8, a2
) , construct a
90 per cent confidence interval for a
2
.
6.33. Let Xl' X 2
, ••• , X; be a random sample from the distribution
n(/L, a2). Let 0 < a < b. Show that the mathematical expectation of the
length of the random interval [~(X, - /L)2/b, ~ (Xl - /L)2/a] is (b - a) x
(na2/ab).
6.34. A random sample of size 15 from the normal distribution n(/L'a
2)
yields x = 3.2 and S2 = 4.24. Determine a 90 per cent confidence interval
for a2
•
6.35. Let 52 be the variance of a random sample of size n taken from a
distribution that is n(/L, a2) where /L and a2are unknown. Let g(z)be the p.d.f.
of Z = nS2/a2, which is X2(n - 1). Let a and b.be such that
2
the .observed
interval (ns2 /b,ns2/a) is a 95 percent confidence mterval for a . y Its length
ns2(b - a)/ab is to be a minimum, show that a a~d b. mu.st satIsfy. the con-
dition that a2g(a) = b2g(b). Hmi. If G(z) is the distribution function of Z,
then differentiate both G(b) - G(a) = 0.95 and (b - a)/ab with respect to
b, recalling that, from the first equation, a must be a function of b. Then
equate the latter derivative to zero.
6.36. Let two independent random samples of sizes n = 16 and m = 10,
taken from two independent normal distributions n(/Ll' ai) a.nd n(/L2' a~),
respectively, yield x = 3.6, sf = 4.14, fj = 13.6, s~ = 7.26. Fmd a 90 per
cent confidence interval for aVai when /Ll and /L2 are unknown.
6.37. Discuss the problem of finding a confidence interval for the ratio
a~/af of the two unknown variances of two independent normal distributions
if the means /Ll and /L2 are known.
6.38. Let Xl> X 2 , ••• , X 6 be a random sample of size 6 from a gamma
distribution with parameters a = 1 and unknown f3 > O. Discuss the
construction of a 98 per cent confidence interval for f3. Hint. What is the
6
distribution of 2 L Xtlf3?
1
6.39. Let Sfand S~ denote, respectively, the variances of random samples,
of sizes nand m, from two independent distributions that are n(/Ll> a2)
and n(/L2' a2). Use the fact that (nSf + mS~)/a2 is X2(n + m - 2) to find a
confidence interval for the common unknown variance a2
•
6.40. Let Y4 be the nth order statistic of a random sample, n = 4, from
a continuous-type uniform distribution on the interval (0, e). Let 0 < Cl <
C2 :s; 1 be selected so that Pr (cle < Y4 < c2e) = 0.95. Verify that Cl =
V"0.05 and C2 = 1 satisfy these conditions. What, then, is a 95 per cent
confidence interval for e?
6.6 Bayesian Estimates
In Sections 6.3, 6.4, and 6.5 we constructed two statistics, say U and
V,. U < V, such that we have a preassigned probability p that the
random interval (U, V) contains a fixed but unknown point (parameter).
We then adopted this principle: Use the experimental results to com-
pute the values of U and V, say u and v; then call the interval (u, v) a
lOOp per cent confidence interval for the parameter. Adoption of this
principle provided us with one method of interval estimation. This
method of interval estimation is widely used in statistical literature and
in the applications. But it is important for us to understand that other
principles can be adopted. The student should constantly keep in mind
that as long as he is working with probability, he is in the realm of
mathematics; but once he begins to make inferences or to draw con-
clusions about a random experiment, which inferences are based upon
experimental data, he is in the field of statistics.
We shall now describe another approach to the problem of interval
estimation. This approach takes into account any prior knowledge of
the experiment that the statistician has and it is one application of a
principle of statistical inference that may be called Bayesian statistics.
Consider a random variable X that has a distribution of probability
that depends upon the symbol B, where eis an element of a well-
defined set Q. For example, if the symbol eis the mean of a normal
distribution, Q may be the real line. We have previously looked upon
Bas being some constant, although an unknown constant. Let us now
introduce a random variable 0 that has a distribution of probability
over the set Q; and, just as we look upon x as a possible value of the
random variable X, we now look upon B as a possible value of the
random variable 0. Thus the distribution of X depends upon B, a
random determination of the random variable 0. We shall denote the
p.d.f. of 0 by h(B) and we take h(B) = 0 when Bis not an element of Q.
Let Xl> X 2 , ••• , Xn denote a random sample from this distribution of
X and let Y denote a statistic that is a function of Xl> X 2 , ••• , X n•
We can find the p.d.f. of Y for every given B; that is, we can find the
conditional p.d.f. of Y, given 0 = B, which we denote by g(yIB). Thus
the joint p.d.f. of Y and 0 is given by
k(y, 8) = h(8)g(ylB).
228 Estimation [Ch, 6 Sec. 6.6] Bayesian Estimates 229
p.d.f. k(Bly) are known. Now, in general, how would we predict an
experimental value of any random variable, say W, if we want our
prediction to be "reasonably close" to the value to be observed? Many
statisticians would predict the mean, E(W), of the distribution of W;
others would predict a median (perhaps unique) of the distribution of
W; some would predict a mode (perhaps unique) of the distribution of
W; and some would have other predictions. However, it seems desirable
that the choice of the decision function should depend upon the loss
function 2[B, w(y)]. One way in which this dependence upon the loss
function can be reflected is to select the decision function w in such a
way that the conditional expectation of the loss is a minimum. A
Bayes' solution is a decision function w that minimizes
E{2[0, w(y)]IY = y} = f~oo 2[B, w(y)]k(Bly) dB,
If 0 is a random variable of the continuous type, the marginal p.d.f. of
Y is given by
If 0 is a random variable of the discrete type, integration would be
replaced by summation. In either case the conditional p.d.f. of 0,
given Y = y, is
This relationship is one form of Bayes' formula (see Exercise 2.7,
Section 2.1).
In Bayesian statistics, the p.d.f. h(B) is called the prior P.d.f. of 0,
and the conditional p.d.f. k(8Iy) is called the posterior p.d.f. of 0. This
is because h(B) is the p.d.f. of 0 prior to the observation of Y, whereas
k(Bly) is the p.d.f. of 0 after the observation of Y has been made. In
many instances, h(B) is not known; yet the choice of h(B) affects the
p.d.f. k(8Iy). In these instances the statistician takes into account all
prior knowledge of the experiment and assigns the prior p.d.f. h(B).
This, of course, injects the problem of personal or subjective probability
(see the Remark, Section 1.1).
Suppose that we want a point estimate of B. From the Bayesian
viewpoint, this really amounts to selecting a decision function w so that
w(y) is a predicted value of B (an experimental value of the random
variable 0) when both the computed value y and the conditional
if 0 is a random variable of the continuous type. The usual modification
of the right-hand member of this equation is made for random variables
of the discrete type. If, for example, the loss function is given by
2[B, w(y)] = [B - W(y)]2, the Bayes' solution is given by w(y) =
E(0IY), the mean of the conditional distribution of 0, given Y = y.
This follows from the fact (Exercise 1.91) that E[(W - b)2], if it exists,
is .a minimum when b = E(W). If the loss function is given by
2[B, w(y)] = IB - w(y)J, then a median of the conditional distribution
of 0, given Y = y, is the Bayes' solution. This follows from the fact
(Exercise 1.81) that E(IW - bi), if it exists, is a minimum when b is
equal to any median of the distribution of W.
The conditional expectation of the loss, given Y = y, defines a
random variable that is a function of the statistic Y. The expected
value of that function of Y, in the notation of this section, is given by
f~oo {f~oo 2[B, w(y)]k(Bly) dB}k1(y) dy
= f~oo {f~oo 2[B, w(y)]g(yIB) dY}h(B) so,
in the continuous case. The integral within the braces in the latter
expression is, for every given BE Q, the risk function R(B, w); accord-
ingly, the latter expression is the mean value of the risk, or the expected
risk. Because a Bayes' solution minimizes
f~oo 2[B, w(y)]k(Bly) dB
for every y for which k1 (y) > 0, it is evident that a Bayes' solution
230 Estimation [eb.6 Sec. 6.6] Bayesian Estimates 231
°< 8 < 1,
y = 0,1, ... , n,
D
[8 - w(y)]2k(8Iy) d8
for y = 0, 1, ... , n and, accordingly, it minimizes the expected risk. It is
very instructive to note that this Bayes' solution can be written as
°< 8 < 1,
k(8Iy) = c(y)8y+ a- I(1 - 8)n-u+Ii-1,
That is,
w(y) = C
+ ; + n) ~ + (a:;~ n) a:f3
which is a weighted average of the maximum likelihood estimate yin of 8
and the mean al(a + f3) of the prior p.d.f. of the parameter. Moreover, the
respective weights are nj(a + f3 + n) and (a + f3)/(a + f3 + n). Thus we
see that a and f3 should be selected so that not only is al(a + f3) the desired
prior mean, but the sum a + f3 indicates the worth of the prior opinion,
relative to a sample of size n. That is, if we want our prior opinion to have as
much weight as a sample size of 20, we would take a + f3 = 20. So if our
prior mean is i; we have that a and f3 are selected so that a = 15 and f3 = 5.
In Example 1 it is extremely convenient to notice that it is not
really necessary to determine kl (y) to find k(8Iy). If we divide g(yl 8)h(8)
by kl (y) we must get the product of a factor, which depends upon y but
does not depend upon 8, say c(y), and
8u+a-I(1 _ 8)n-u+li- 1.
a+y
a+f3+n
This decision function w(y) minimizes
The Bayes' solution w(y) is the mean of the conditional distribution of e,
given Y = y. Thus
w(y) = J:8k(8jy) d8
= r(n + a + f3) (I 8a+Y(1 _ 8)Ii+n-Y-1 d8
r(a + y)r(n + f3 - y) Jo
and y = 0, 1, ... , n. However, c(y) must be that" constant" needed to
make k(8Iy) a p.d.f., namely
r(n + ex + (3)
c(y) = r(y + a)r(n - y + (3)
Accordingly, Bayesian statisticians frequently write that k(8ly) is pro-
portional to g(yI8)h(8); that is,
k(8Iy) o: g(yI8)h(8).
h(8) = r(a + f3) 8a-1(1 _ 8)1i-1
r(a)r(f3) ,
= °elsewhere.
where a and f3 are assigned positive constants. Thus the joint p.d.f. of Yand
e is given by g(yl8)h(8) and the marginal p.d.f. of Y is
I:{~o [8 - W(y)J2(;)8Y(1 - 8)n-Y}h(8) d8
Y~{f[8 - w(y)J2k(8Iy) d8}k1(y).
k(81 ) = g(yl8)h(8)
y k1(y)
_ r(n + a + f3) 8a+y-1(1 _ 8)Ii+n-Y-1 °< 8 < 1,
- r(a + y)r(n + f3 - y) ,
and y = 0, 1, ... , n. We take the loss function to be 2[8, w(y)J = [8 - W(y)J2.
Because Y is a random variable of the discrete type, whereas e is of the
continuous type, we have for the expected risk,
k1(y) = J:h(8)g(yj8) d8
= (n) r(a + f3) II8y+a-1(1 _ 8)n-Y+Ii-1 se
y r(a)r(f3) 0
_
cr(a + f3)r(a + y)r(n + f3 - y), y = 0,1,2, ... , n,
- y r(a)r(f3)r(n + a + f3)
= °elsewhere.
Finally, the conditional p.d.f. of e, given Y = y, is, at points of positive
probability density,
g(yj8) = (;)8Y(1 - W-Y,
= °elsewhere.
We take the prior p.d.f. of the random variable e to be
w(y) minimizes this mean value of the risk. We now give an illustrative
example.
Example 1. Let Xl' X 2 , ••• , X n denote a random sample from a
distribution that is b(l, 8), °< 8 < 1. We seek a decision function w that
is a Bayes' solution. If Y = i Xi, then Y is b(n, 0). That is, the conditional
1
p.d.f. of Y, given e = 8, is
232 Estimation [eh.6 Sec. 6.6] Bayesian Estimates 233
Then to actually form the p.d.f. k(0ly), they simply find a "constant,"
which is some function of y, so that the expression integrates to 1. This
is now illustrated.
Example 2. Suppose that Y = X is the mean of a random sample of
size n that arises from the normal distribution n(O, ( 2
) , where a2 is known.
Then g(yI8) is n(8, a2/n). Further suppose that we are able to assign prior
knowledge to 8 through a prior p.d.I. h(O) that is n(Oo, a~). Then we have that
1 1 [(y - 8)2 (0 - (0 )2
1
k(Oly) oc . / . / . /- exp - 2( 2/ ) - 2 2 .
V 27TU/V n v 27TUo U n Uo
If we eliminate all constant factors (including factors involving y only), we
have
k(OI ) [
_ (u~ + u2/n)02 - 2(ya~ + 00a2/n)0].
y oc exp 2(u2/n)u~
This can be simplified, by completing the square, to read (after eliminating
factors not involving 0)
[
(
0 _ ya~ + 00a
2/n)2l
a~ + a2/n
k(0ly) oc exp - 2(a2/n)a~ .
(a~ + a2/n)
That is, the posterior p.d.f. of the parameter is obviously normal with mean
and variance (a2/n)a~/(a~ + a2/n). If the square-error loss function is used,
this posterior mean is the Bayes' solution. Again, note that it is a weighted
average of the maximum likelihood estimate y = x and the prior mean 00 ,
Observe here and in Example 1 that the Bayes' solution gets closer to the
maximum likelihood estimate as n increases. Thus the Bayesian procedures
permit the decision maker to enter his or her prior opinions into the solution
in a very formal way such that the influences of those prior notions will be
less and less as n increases.
In Bayesian statistics all the information is contained in the
posterior p.d.f. k(Oly). In Examples 1 and 2 we found Bayesian point
estimates using the square-error loss function. It should be noted that
if 2[w(y), 0] = Iw(y) - 01. the absolute value of the error, then the
Bayes' solution would be the median of the posterior distribution of the
parameter, which is given by k(Oly). Hence the Bayes' solution changes,
as it should, with different loss functions.
If an interval estimate of °is desired, we can now find two functions
u(y) and v(y) so that the conditional probability
i
V(Y )
Pr[u(y) < 0 < v(y)IY = y] = k(Oly) dO,
u(y)
is large, say 0.95. The experimental values of Xv X 2 , ••• , X n, say
Xl' X2, ... , Xn, provide us with an experimental value of Y, say y.
Then the interval u(y) to v(y) is an interval estimate of °in the sense
that the conditional probability of 0 belonging to that interval is
equal to 0.95. For illustration, in Example 2 where the posterior p.d.f.
of the parameter was normal, the interval, whose end points are found
by taking the mean of that distribution and adding and subtracting
1.96 of its standard deviation,
yu~ + 0ou
2/n
+ 1.96 (u2/n)u~
u~ + u2/n - u~ + u2/n
serves as an interval estimate for °with posterior probability of 0.95.
Finally. it should be noted that in Bayesian statistics it is really
better to begin with the sample items Xv X 2, ••• , X n rather than some
statistic Y. We used the latter approach for convenience of notation.
If Xv X 2 , ••• , X n are used, then in our discussion, replace g(yIO) by
f(X11 0)f(x210) .. ·f(xnIO) and k(Oly) by k(Olxv X2, ... , xn). Thus we find
that
k(0IX1, x2, ... , xn) OC h(0)j(xII0)j(x210)·· .j(xnIO).
II the statistic Y is chosen correctly (namely, as a sufficient statistic,
as explained in Chapter 10), we find that
k(Oxv X 2, ••• , xn) = k(0ly).
This is illustrated by Exercise 6.44. Of course, these Bayesian pro-
cedures can easily be extended to the case of several parameters, as
demonstrated by Exercise 6.45.
EXERCISES
6.41. Let Xv X 2 , ••• , X; denote a random sample from a distribution
that is n(e, ( 2
) , -00 < 0 < 00, where a2
is a given positive number. Let
Y = X, the mean of the random sample. Take the loss function to be
2[0, w(y)J = 10 - w(y)j. If 0 is an observed value of the random variable 0
that is n(p" r 2
) , where r 2
> 0 and p, are known numbers, find the Bayes'
solution w(y) for a point estimate of e.
6.42. Let Xv X 2, •• . , X; denote a random sample from a Poisson
n
distribution with mean e, 0 < e < 00. Let Y = L X, and take the loss
1
234 Estimation [eh.6
function to 2[B, w(y)] = [B - w(y)]2. Let B be an observed value of the
random variable 0. If 0 has the p.d.f. h(B) = Ba-Ie-8IP/r(a)f3a, 0 < B < co,
zero elsewhere, where a > 0, 13 > 0 are known numbers, find the Bayes'
solution w(y) for a point estimate of B.
6.43. Let Yn be the nth order statistic of a random sample of size n from
a distribution with p.d.f. f(xIB) = l/B, 0 < x < B, zero elsewhere. Take the
loss function to be 2[B, w(Yn)] = [B - w(Yn)]2. Let Bbe an observed value of
the random variable 0, which has p.d.f. h(B) = f3aP/Bl3+l, a < B < CfJ, zero
elsewhere, with a > 0, 13 > O. Find the Bayes' solution w(Yn) for a point
estimate of B.
6.44. Let Xl, X 2 , ••• , X; be a random sample from a distribution that
is b(l, B). Let the prior p.d.f. of 0 be a beta one with parameters a and 13.
Show that the posterior p.d.f. k(BIXl> x2, ... , xn) is exactly the same as
k(B[y) given in Example 1. This demonstrates that we get exactly the same
result whether we begin with the statistic Y or with the sample items. Hint.
Note that k(Blxl , x2, ... , xn) is proportional to the product of the joint p.d.f.
of Xl> X 2 , ••• , X n and the prior p.d.f. of B.
6.45. Let YI and Y2 be statistics that have a trinomial distribution with
parameters n, BI , and B2 . Here BI and B2 are observed values of the random
variables 01 and O2 , which have a Dirichlet distribution with known param-
eters aI' (X2' and (X3 (see Example 1, Section 4.5). Show that the conditional
distribution of 01 and O2 is Dirichlet and determine the conditional means
E(01 IYI' Y2) and E(02jYI' Y2)·
6.46. Let X be n(O, l/B). Assume that the unknown B is a value of a
random variable 0 which has a gamma distribution with parameters
(X = r/2 and f3 = 2/r, where r is a positive integer. Show that X has a marginal
t distribution with r degrees of freedom. This procedure is called one of
compounding, and it may be used by a Bayesian statistician as a way of first
presenting the t distribution, as well as other distributions.
6.47. Let X have a Poisson distribution with parameter 8. Assume that
the unknown Bis a value of a random variable 0 that has a gamma distri-
bution with parameters (X = rand f3 = (1 - P)/P, where r is a positive
integer and 0 < p < 1. Show, by the procedure of compounding, that X has
a marginal distribution which is negative binomial, a distribution that was
introduced earlier (Section 3.1) under very different assumptions.
6.48. In Example llet n = 30, (X = 10, and f3 = 5so that w(y) = (10 + y)j45
is the Bayes' estimate of 8.
(a) If Y has the binomial distribution b(30, B), compute the risk
E{[B - W(Y)]2}.
(b) Determine those values of Bfor which the risk of part (a) is less than
B(l - B)/30, the risk associated with the maximum likelihood estimator
Yin of B.
Chapter 7
Statistical Hypotheses
7.1 Some Examples and Definitions
The two principal areas of statistical inference are the areas of
estimation of parameters and of tests of statistical hypotheses. The
problem of estimation of parameters, both point and interval estima-
tion, has been treated. In this chapter some aspects of statistical
hypotheses and tests of statistical hypotheses will be considered. The
subject will be introduced by way of example.
Example 1. Let it be known that the outcome X of a random experiment
is n(B, 100). For instance, X may denote a score on a test, which score we
assume to be normally distributed with mean Band variance 100. Let us say
that past experience with this random experiment indicates that B = 75.
Suppose, owing possibly to some research in the area pertaining to this
experiment, some changes are made in the method of performing this random
experiment. It is then suspected that no longer does B = 75 but that now
B > 75. There is as yet no formal experimental evidence that B > 75; hence
the statement B > 75 is a conjecture or a statistical hypothesis. In admitting
that the statistical hypothesis B > 75 may be false, we allow, in effect, the
possibility that B :s; 75. Thus there are actually two statistical hypotheses.
First, that the unknown parameter B :s; 75; that is, there has been no increase
in B. Second, that the unknown parameter () > 75. Accordingly, the param-
eter space is Q = {(); -co < () < co}. We denote the first of these hypotheses
by the symbols Ho: B :s; 75 and the second by the symbols HI: B > 75. Since
the values B > 75 are alternatives to those where () :s; 75, the hypothesis
HI: B > 75 is called the alternative hypothesis. Needless to say, Ho could be
called the alternative HI; however, the conjecture, here () > 75, that is
235
236 Statistical Hypotheses [Ch.7 Sec. 7.1] Some Examples and Definitions 237
function of Test 1, and the value of the power function at a parameter point
is called the power of Test 1 at that point. Because X is n(8, 4), we have
- (78 - 8)
K 2(8) = Pr (X > 78) = 1 - N -2- .
(
X - 8 75 - 8) (75 - 8)
K I(8) = Pr -2- > -2- = 1 - N -2- .
I
79
1 -l
FIGURE 7.1
o
~(B)
Some values of the power function of Test 2 are K 2(73) = 0.006, K 2(75) =
0.067, K 2(77) = 0.309, and K 2(79) = 0.691. That is, if 8 = 75, the proba-
bility of rejecting H o: 8 ~ 75 is 0.067; this is much more desirable than the
corresponding probability 1- that resulted from Test 1. However, if H0 is
false and, in fact, 8 = 77, the probability of rejecting H o: 8 ~ 75 (and hence
of accepting HI: 8 > 75) is only 0.309. In certain instances, this low prob-
ability 0.309 of a correct decision (the acceptance of HI when HI is true) is
objectionable. That is, Test 2 is not wholly satisfactory. Perhaps we can
overcome the undesirable features of Tests 1 and 2 if we proceed as in Test 3.
So, for illustration, we have, by Table III of Appendix B, the power at
8 = 75 to be K I(75) = 0.500. Other powers are K I(73) = 0.159, K I(77) =
0.841, and K I(79) = 0.977. The graph of K1(8) of Test 1 is depicted in Figure
7.1. Among other things, this means that, if 8 = 75, the probability of
rejecting the hypothesis H o: 8 ~ 75 is 1-- That is, if 8 = 75 so that H o is
true, the probability of rejecting this true hypothesis H0 is 1-- Many statisti-
cians and research workers find it very undesirable to have such a high
probability as -t assigned to this kind of mistake: namely the rejection of
H o when Ho is a true hypothesis. Thus Test 1 does not appear to be a very
satisfactory test. Let us try to devise another test that does not have
this objectionable feature. We shall do this by making it more difficult to
reject the hypothesis H0' with the hope that this will give a smaller probability
of rejecting H o when that hypothesis is true.
Test 2. Let n = 25. We shall reject the hypothesis Ho: 8 ~ 75 and
accept the hypothesis HI: 8 > 75 if and only if x > 78. Here the critical
region is e = {(Xl' ... , X 25); Xl + ... + X 25 > (25)(78)}. The power function
of Test 2 is, because X is n(8, 4),
Pr [(Xl' ... , X 25 ) E C] = Pr (X > 75).
Obviously, this probability is a function of the parameter 8 and we shall
denote it by K I(8).
The function K 1(8) = Pr (X > 75) is called the power
made by the research worker is usually taken to be the alternative hypothesis.
In any case the problem is to decide which of these hypotheses is to be
accepted. To reach a decision, the random experiment is to be repe~ted a
number of independent times, say n, and the results observed. That IS, we
consider a random sample Xl, X 2 , ••• , X; from a distribution that is
n(8, 100), and we devise a rule that will tell us what decision t? make once
the experimental values, say Xl' X 2, •• " X n, have been determmed. Such a
rule is called a test of the hypothesis Ho: 8 ~ 75 against the alternative
hypothesis HI: 8 > 75. There is no bound on the number of rules or tes~s
that can be constructed. We shall consider three such tests. Our tests will
be constructed around the following notion. We shall partition the sample
space d into a subset e and its complement e*. If the experimental values of
Xl' X 2, ... , X n, say Xv X2' ... , Xn, are such that the point (Xl' X2' •.. , X n) E C,
we shall reject the hypothesis H o (accept the hypothesis HI)' If we have
(Xl' X
2,
•.• , X
n) E e*, we shall accept the hypothesis H0 (reject the hypothesis
HI)'
Test 1. Let n = 25. The sample space d is the set
{(xv X 2, ... , X 25); -00 < Xi < 00, i = 1,2, ... , 25}.
Let the subset e of the sample space be
e = {(xv X2' ... , X 25); Xl + X2 + ... + X 25 > (25)(75)}.
We shall reject the hypothesis Ho if and only if our 25 experimental values
are such that (Xl, X 2, ... , X 25) E c. If (xv X 2, .. " X25) is not an element of C,
we shall accept the hypothesis H o. This subset e of the sample space that
leads to the rejection of the hypothesis H 0: 8 ~ 75 is called the critical region
25 25
of Test 1. Now LXi> (25)(75) if and only if x > 75, where x = L xi/25.
I I
Thus we can much more conveniently say that we shall reject the hypothesis
H o: 8 ~ 75 and accept the hypothesis Hs: 8 > 75 if and only if the experi-
mentally determined value of the sample mean x is greater than 75. If
x ~ 75, we accept the hypothesis H o: 8 ~ 75. Our test then amounts to this:
We shall reject the hypothesis H o: 8 ~ 75 if the mean of the sample exceeds
the maximum value of the mean of the distribution when the hypothesis
H o is true.
It would help us to evaluate a test of a statistical hypothesis if we knew
the probability of rejecting that hypothesis (and hence of accepting the
alternative hypothesis). In our Test 1, this means that we want to compute
the probability
238
Statistical Hypotheses [Ch, 7 Sec. 7.1] Some Examples and Definitions 239
Equivalently, from Table III of Appendix B, we have
K s(8) = Pr (X > c) = 1 - N(c ~ /~).
10/v n
The conditions K s(75) = 0.159 and K s(77) = 0.841 require that
Test 3. Let us first select a power function KsUJ) that has the features
of a small value at 8 = 75 and a large value at 8 = 77. For instance, take
K s(75)
= 0.159 and K s(77) = 0.841. To determine a test with such a power
function, let us reject Ho: 8 ::; 75 if and only if the experimental value x
of the mean of a random sample of size n is greater than some constant c.
Thus the critical region is C = {(Xl' X 2, •• " X n) ; Xl + X2 + ... + Xn > nc}.
It should be noted that the sample size n and the constant c have not been
determined as yet. However, since X is n(8, 1OO/n), the power function is
We have now illustrated the following concepts:
(a) A statistical hypothesis.
(b) A test of a hypothesis against an alternative hypothesis and
the associated concept of the critical region of the test.
(c) The power of a test.
These concepts will now be formally defined.
Definition 1. A statistical hypothesis is an assertion about the dis-
0< x < 00,
1
f(x; 8) = 7J e-X
/
8
,
= 0 elsewhere.
If we refer again to Example 1, we see that the significance levels
of Tests 1, 2, and 3 of that example are 0.500,0.067, and 0.159, respec-
tively. An additional example may help clarify these definitions.
Example 2. It is known that the random variable X has a p.d.f, of the
form
tribution of one or more random variables. If the statistical hypothesis
completely specifies the distribution, it is called a simple statistical
hypothesis; if it does not, it is called a composite statistical hypothesis.
If we refer to Example 1, we see that both H o: e::; 75 andHI : e> 75
are composite statistical hypotheses, since neither of them completely
specifies the distribution. If there, instead of H o: e ::; 75, we had
Ho: e= 75, then Ho would have been a simple statistical hypothesis.
Definition 2. A test of a statistical hypothesis is a rule which, when
the experimental sample values have been obtained, leads to a decision
to accept or to reject the hypothesis under consideration.
Definition 3. Let C be that subset of the sample space which, in
accordance with a prescribed test, leads to the rejection of the hypoth-
esis under consideration. Then C is called the critical region of the test.
Definition 4. The power function of a test of a statistical hypothesis
H o against an alternative hypothesis HI is that function, defined for
all distributions under consideration, which yields the probability that
the sample point falls in the critical region C of the test, that is, a
function that yields the probability of rejecting the hypothesis under
consideration. The value of the power function at a parameter point is
called the power of the test at that point.
Definition 5. Let H0 denote a hypothesis that is to be tested
against an alternative hypothesis HI in accordance with a prescribed
test. The significance level of the test (or the size of the critical region C)
is the maximum value (actually supremum) of the power function of
the test when H 0 is true.
It is desired to test the simple hypothesis H o: 8 = 2 against the alternative
simple hypothesis HI: 8 = 4. Thus Q = {8; 8 = 2,4}. A random sample
Xl, X2 of size n = 2 will be used. The test to be used is defined by taking
1 - N(C - 72) = 0.841.
lO/vn
c - 77
--_ =-1.
lO/vn
c - 75
--=1
lO/vn '
(
c - 75)
1 - N . r = 0.159,
10/v n
The solution to these two equations in nand c is n = 100, c = 76. With
these values of nand c, other powers of Test 3 are K s(73) = 0.001 and
K s(79)
= 0.999. It is important to observe that although Test 3 has a more
desirable power function than those of Tests 1 and 2, a certain "price"
has been paid-a sample size of n = 100 is required in Test 3, whereas we
had n = 25 in the earlier tests.
Remark. Throughout the text we frequently say that we accept the
hypothesis H o if we do not reject H o in favor of HI' If this decision is made,
it certainly does not mean that H°is true or that we even believe that it is
true. All it means is, based upon the data at hand, that we are not convinced
that the hypothesis H o is wrong. Accordingly, the statement "We accept
H
o
" would possibly be better read as "We do not reject Hi;" However,
because it is in fairly common use, we use the statement "We accept Ho,"
but read it with this remark in mind.
240 Statistical Hypotheses [Ch, 7 Sec. 7.1] Some Examples and Definitions 241
the critical region to be C = {(xv x2) ; 9.5 :$ Xl + X 2 < co}. The power
function of the test and the significance level of the test will be determined.
There are but two probability density functions under consideration,
namely,J(x; 2) specified by Ho andf(x; 4) specified by Hl . Thus the power
function is defined at but two points () = 2 and () = 4. The power function
of the test is given by Pr[(XVX 2 ) EC]. If Ho is true, that is, () = 2, the
joint p.d.f. of Xl and X2 is
f(xl
; 2)f(X2; 2) = !r(Xl +x2)/2,
= 0 elsewhere,
and
= 0.05, approximately.
If H l is true, that is, () = 4, the joint p.d.f. of Xl and X2 is
f(xl ; 4)f(x2 ; 4) = l6e-(X1 +x2)/4, 0 < Xl < 00, 0 < X2 < 00,
= 0 elsewhere,
and
(9.5 (9.5 -x~ 1 ( )/4 d d
Pr[(XVX2) E C] = 1- Jo Jo 16e-Xl+x2 Xl X2
= 0.31, approximately.
Thus the power of the test is given by 0.05 for () = 2 and by 0.31 for () = 4.
That is, the probability of rejecting H o when H o is true is 0.05, and the
probability of rejecting Ho when Ho is false is 0.31. Since the significance
level of this test (or the size of the critical region) is the power of the test
when H o is true, the significance level of this test is 0.05.
The fact that the power of this test, when 8 = 4, is only 0.31 immediately
suggests that a search be made for another test which, with the same power
when () = 2, would have a power greater than 0.31 when () = 4. However,
Section 7.2 will make clear that such a search would be fruitless. That is,
there is no test with a significance level of 0.05 and based on a random
sample of size n = 2 that has a greater power at () = 4. The only manner in
which the situation may be improved is to have recourse to a random sample
of size n greater than 2.
Our computations of the powers of this test at the two points () = 2
and (J = 4 were purposely done the hard way to focus attention on funda-
mental concepts. A procedure that is computationally simpler is the follow-
ing. When the hypothesis H o is true, the random variable X is X2(2).
Thus
the random variable Xl + X2 = Y, say, is X2
(4). Accordingly, the power
of the test when H 0 is true is given by
Pr (Y ~ 9.5) = 1 - Pr (Y < 9.5) = 1 - 0.95 = 0.05,
from Table II of Appendix B. When the hypothesis Hl is true, the random
variable X/2 is X2(2); so the random variable (Xl + X 2)/2 = Z, say, is
X2(4).
Accordingly, the power of the test when H l is true is given by
Pr (Xl + X2 ~ 9.5) = Pr (Z ~ 4.75)
= (<Xl !ze-2 / 2 dz,
)4.75
which is equal to 0.31, approximately.
Remark. The rejection of the hypothesis Ho when that hypothesis is
true is, of course, an incorrect decision or an error. This incorrect decision
is often called a type I error; accordingly, the significance level of the test
is the probability of committing an error of type 1. The acceptance of H o
when Ho is false (Hl is true) is called an error of type II. Thus the probability
of a type II error is 1minus the power of the test when Hl is true. Frequently,
it is disconcerting to the student to discover that there are so many names for
the same thing. However, since all of them are used in the statistical litera-
ture, we feel obligated to point out that" significance level," "size of the
critical region," "power of the test when H0 is true," and" the probability of
committing an error of type I" are all equivalent.
EXERCISES
7.1. Let X have a p.d.f. of the form f(x; 8) = (J:r;8-1, 0 < X < 1, zero
elsewhere, where 8E{8; 8 = 1,2}. To test the simple hypothesis H o: 8 = 1
against the alternative simple hypothesis Hl : (J = 2, use a random sample
X v X2 of size n = 2 and define the critical region to be C = {(xv x2 ) ;
3/4 :$ Xl X2}' Find the power function of the test.
7.2. Let X have a binomial distribution with parameters n = 10 and
p E {P; P = -!-, -!-}. The simple hypothesis H o:P = -!- is rejected, and the
alternative simple hypothesis Hl : P = -!- is accepted, if the observed value
of Xl' a random sample of size 1, is less than or equal to 3. Find the power
function of the test.
7.3. Let Xv X2 be a random sample of size n = 2 from the distribution
having p.d.f. f(x; (J) = (1/8)e- X
/
8
, 0 < X < 00, zero elsewhere. We reject
Ho: (J = 2 and accept Hl : (J = 1 if the observed values of Xv X 2, say Xl' X2'
are such that
242 Statistical Hypotheses [Oh, 7 Sec. 7.2] Certain Best Tests 243
Here n = {B; B = 1, 2}. Find the significance level of the test and the power
of the test when H0 is false.
7.4. Sketch, as in Figure 9.1, the graphs of the power functions of Tests
1, 2, and 3 of Example 1 of this section.
7.5. Let us assume that the life of a tire in miles, say X, is normally
distributed with mean B and standard deviation 5000. Past experience
indicates that 0 = 30,000. The manufacturer claims that the tires made by a
new process have mean B > 30,000, and it is very possible that B = 35,000.
Let us check his claim by testing Ho: 0 :0; 30,000 against Hl : B > 30,000.
We shall observe n independent values of X, say Xl' •.• , X n , and we shall
reject H o (thus accept H l ) if and only if x ;:::: c. Determine nand c so that the
power function K(O) of the test has the values K(30,000) = 0.01 and
K(35,000) = 0.98.
7.6. Let X have a Poisson distribution with mean B. Consider the
simple hypothesis Ho: B = t and the alternative composite hypothesis
H l : B < 1- Thus n = {B;O < 0:0; t}. Let Xl"",X12 denote a random
sample of size 12 from this distribution. We reject H o if and only if the
observed value of Y = Xl + ... + X12 :0; 2. If K(B) is the power function
of the test, find the powers K(t), K(t), K(t), K(i), and K(fz). Sketch the
graph of K(O). What is the significance level of the test?
7.2 Certain Best Tests
In this section we require that both the hypothesis H o' which is to
be tested, and the alternative hypothesis HI be simple hypotheses.
Thus, in all instances, the parameter space is a set that consists of
exactly two points. Under this restriction, we shall do three things:
(a) Define a best test for testing H 0 against HI'
(b) Prove a theorem that provides a method of determining a best
test.
(c) Give two examples.
Before we define a best test, one important observation should be
made. Certainly, a test specifies a critical region; but it can also be
said that a choice of a critical region defines a test. For instance, if one
is given the critical region C = {(Xl' X 2, X s); X~ + X~ + X~ ;:::: I}, the
test is determined: Three random variables Xl' X 2 , X s are to be con-
sidered; if the observed values are Xl> X 2, X s, accept H 0 if x~ + x~ + x~
< 1; otherwise, reject H o. That is, the terms "test" and "critical
region" can, in this sense, be used interchangeably. Thus, if we define
a best critical region, we have defined a best test.
Let f(x; 0) denote the p.d.f. of a random variable X. Let Xl> X 2
,
... , X; denote a random sample from this distribution, and consider
the two simple hypotheses H o: 0 = 0' and HI: 0 = Olf. Thus n =
{O; 0 = 0', Olf}. We now define a best critical region (and hence a best
test) for testing the simple hypothesis H o against the alternative simple
hypothesis HI' In this definition the symbols Pr [(Xl> X 2 , ••• , X n) E
C; HoJ and Pr [(Xl' X 2 , ••• , X n) E C; HlJ mean Pr [(Xl' X 2 , ••• , X n) E
CJ when, respectively, H o and HI are true.
Definition 6. Let C denote a subset of the sample space. Then C is
called a best critical region of size a for testing the simple hypothesis
H o: 0 = 0' against the alternative simple hypothesis HI: 0 = Olf if, for
every subset A of the sample space for which Pr [(Xl> ... , X n) E
A; HoJ = a:
(a) Pr [(Xl> X 2 , • • • , X n) E C; HoJ = a.
(b) Pr[(Xl,X2,···,Xn ) EC;HlJ ;:::: Pr[(Xl>X2 , ... ,Xn)EA;Hl J.
This definition states, in effect, the following: First assume H o
to be true. In general, there will be a multiplicity of subsets A of the
sample space such that Pr [(Xl' X 2 , ••• , X n) E AJ = a. Suppose that
there is one of these subsets, say C, such that when HI is true, the power
of the test associated with C is at least as great as the power of the test
associated with each other A. Then C is defined as a best critical region
of size a for testing H 0 against HI'
. In the following example we shall examine this definition in some
detail and in a very simple case.
Example 1. Consider the one random variable X that has a binomial
distribution with n = 5 and p = B. Letf(x; 0) denote the p.d.f. of X and let
Ho: B =! and Hl : B = l The following tabulation gives, at points of
positive probability density, the values of f(x; !), f(x; t), and the ratio
f(x; t)/f(x; t)·
X 0 1 2 3 4 5
f(x; !) 1 ..L 10 10 .s. _1._
TI 32 32 TI 32 32
f(x; t) _ 1_ _1_5_
Tg~4
_1..1Q_ _.4..U. _1.43 _
1024 1024 1024 1024 1024
f(x; !)
.ll .ll .ll _1.L
- - 32 .ll
f(x; t) 3 9 27 81 243
We shall use one random value of X to test the simple hypothesis Ho: {I = !
against the alternative simple hypothesis H l : {I = t, and we shall first assign
the significance level of the test to be a = ii-. We seek a best critical region
244 Statistical Hypotheses [Ch. 7 Sec. 7.2] Certain Best Tests 245
for each point (Xl> X 2, ••. , xn) E C.
for each point (Xl> x2 , ••• , xn) E C*.
let k be a positive number. Let C be a subset ofthe sample space such that:
(a) L{0'; Xl> X 2, •.• , xn)
L(()";Xl> X
2,
••• , xn
) ~ k,
(c) a = Pr[(Xl>X2"",Xn) EC;Ho].
Then C is a best critical region of size a for testing the simple hypothesis
H 0: 0 = 0' against the alternative simple hypothesis HI: 0 = 0".
Proof. We shall give the proof when the random variables are of
the continuous type. If C is the only critical region of size a, the
theorem is proved. If there is another critical region of size a, denote it
by A . For convenience, we shall let rit.f L (0; Xl> ••• , xn) dX1
••. dXn
be
denoted by fR L(O). In this notation we wish to show that
r L(O") ~ ! r L(O').
JAnc* k JAnc*
i L(()") ~ ! r L(()').
cnA* k JcnA*
J L(()") - r L(O") z ! i L(()') - ! i L(()');
C~ 1~ k~# k~C*
(1) fc L(()") - t L(()")
= I. L(()") + I. L(()") - f L(()") - f L(()")
cnA CnA. AnC AnC.
= I. L(()") - f L(()II).
cnA· Anco
Since C is the union of the disjoint sets C n A and C n A * and A is
the union of the disjoint sets A n C and A n C*, we have
However, by the hypothesis of the theorem, L(()") ~ (ljk)L(()') at
each point of C, and hence at each point of C n A*; thus
But L(O") ~ (ljk)L(O') at each point of C*, and hence at each point of
A n C*; accordingly,
"These inequalities imply that
l04l4 = Pr(XEA2;H1 ) > Pr(XEA1;H1) = 10
124'
It should be noted, in this problem, that the best critical region C = A 2 of
size a = -f2 is found by including in C the point (or points) at which f(x; -t)
is small in comparison withf(x; ,t). This is seen to be true once it is observed
that the ratio f(x; -t)/f(x; ,t) is a minimum at x = 5. Accordingly, the ratio
f(x; -t)/f(x; i), which is given in the last line of the above tabulation, provides
us with a precise tool by which to find a best critical region C for certain given
values of a. To illustrate this, take a = 3
62'
When H o is true, each of the
subsets {x; x = 0, I}, {x; x = 0, 4},{x; x = 1, 5}, {x; x = 4, 5}has probability
measure 3~' By direct computation it is found that the best critical region of
this size is {x; x = 4, 5}.This reflects the fact that the ratio f{x; -t)/f(x; i) has
its two smallest values for x = 4 and x = 5. The power of this test, which
has a = f2' is
The preceding example should make the following theorem, due to
Neyman and Pearson, easier to understand. It is an important theorem
because it provides a systematic method of determining a best critical
region.
Neyman-Pearson Theorem. Let Xl' X 2 , • • • , X n , where n is a
fixed positive integer, denote a random sample from a distribution that has
p.d.]. f(x; ()). Then the joint p.d.f. of Xl> X 2 , ••• , X n is
L((); Xl> X 2, ••• , xn) = f(x1 ; ())f(x2 ; ()) ••• f(xn ; ()).
Let ()' and ()" be distinct fixed values of () so that n = {O; () = 0', Oil}, and
of size a = l2' If Al = {x; X = O} and A 2 = {x; X = 5},then Pr (X E AI; H o)
= Pr (X E A2 ; H o) = -i2- and there is no other subset A 3 of the space
{x; x = 0, 1,2,3,4, 5}such that Pr (X E A3 ; Ho) = h. Then either Al or A 2
is the best critical region C of size a = :h: for testing Ho against HI' We
note that Pr (X E AI; H o) = 3~ and that Pr (X E AI; HI) = ll24' Thus, if
the set Al is used as a critical region of size a = -h, we have the intolerable
situation that the probability of rejecting H o when HI is true (Ho is false)
is much less than the probability of rejecting Ho when Ho is true.
On the other hand, if the set A 2 is used as a critical region, then
Pr (X E A 2 ; H o) = 3
12
and Pr (X E A 2 ; HI) = -/0
4234'
That is, the probability
of rejecting H 0 when HI is true is much greater than the probability of reject-
ing H o when H o is true. Certainly, this is a more desirable state of affairs, and
actually A 2 is the best critical region of size a = 31-
i . The latter statement
follows from the fact that, when H o is true, there are but two subsets, Al and
A2 , of the sample space, each of whose probability measure is -f2 and the fact
that
246
and, from Equation (1), we obtain
Statistical Hypotheses [Ch. 7 Sec. 7.2] Certain Best Tests
or
247
If this result is substituted in inequality (2), we obtain the desired
result,
-00 < x < 00.
(l/v27T)nexp [-(~ (X, - 1)2)/2]
= exp ( - ~ x, + ~).
L(8'; Xl"", X n )
L(8"; Xl>"" X n)
. _ 1 ((x - 8)2)
f(x, 8) - V27T exp - 2 '
It is desired to test the simple hypothesis H 0: 8 = 8' = 0 against the alter-
native simple hypothesis HI: 8 = 8" = 1. Now
IF k > 0, the set of all points (xv X 2, •• " Xn ) such that
exp ( - ~ X, + ~) :$ k
Suppose that it is the first form, U l :$ Cl. Since 0' and 0" are given
constants, Ul(Xl, X 2 , ••• , X n ; 0', 0") is a statistic; and if the p.d.f. of
this statistic can be found when H o is true, then the significance level of
the test of H 0 against HI can be determined from this distribution.
That is,
Moreover, the test may be based on this statistic; for, if the observed
values of Xl' X 2 , · . · , X n are Xl> x2 , •• ·, X n, we reject H o (accept HI) if
Ul(Xl, X 2,···, xn ) :$ Cl·
A positive number k determines a best critical region C whose size
is ex = Pr [(Xl> X 2 , ••• , X n) E C; HoJ for that particular k. It may be
that this value of ex is unsuitable for the purpose at hand; that is, it is
too large or too small. However, if there is a statistic ul (Xl' X 2 , ••. , X n),
as in the preceding paragraph, whose p.d.I. can be determined when
H o is true, we need not experiment with various values of k to obtain
a desirable significance level. For if the distribution of the statistic is
known, or can be found, we may determine [1 such that Pr [Ul(Xl, X 2 ,
... , X n) :$ Cl; H oJ is a desirable significance level.
An illustrative example follows.
Example 2. Let Xl' X 2 • • • " X n denote a random sample from the
distribution that has the p.d.f.
= ex - ex = O.
I L(O')
AnC·
= f L(O') + J L((J') - I L(O') - I L(O')
cnA. CnA AnC AnC*
= fc L(O') - LL(O')
r L(O") - f L(O") ~ ~ [ r L(O') - i L(O')]'
Jc A k JcnA' Anc'
Remark. As stated in the theorem, conditions (a), (b), and (c) are suffi-
cient ones for region C to be a best critical region of size a. However, they are
also necessary. We discuss this briefly. Suppose there is a region A of size ex
that does not satisfy (a) and (b) and that is as powerful at 8 = 8" as C, which
satisfies (a), (b), and (c). Then expression (1) would be zero, since the power
at 8"using A is equal to that using C. It can be proved that to have expression
(1) equal zero A must be of the same form as C. As a matter of fact, in the
continuous case, A and C would essentially be the same region; that is, they
could differ only by a set having probability zero. However, in the discrete
case, if Pr [L(8') = kL(8"); HoJ is positive, A and C could be different sets,
but each would necessarily enjoy conditions (a), (b), and (c) to be a best
critical region of size a.
One aspect of the theorem to be emphasized is that if we take C to
be the set of all points (Xl' X 2, ••• , xn) which satisfy
L(O';Xl,X2 " " , xn) < k k 0
(0
" ) -, >,
L ; Xl> X 2, ••• , X n
then, in accordance with the theorem, C will be a best critical region.
This inequality can frequently be expressed in one of the forms (where
Cl and C2 are constants)
IcL(O") - LL(O") ~ O.
If the random variables are of the discrete type, the proof is the same,
with integration replaced by summation.
However,
r L( 0')
JcnA>
(2)
or, equivalently,
is a best critical region. This inequality holds if and only if
n n
LXl ~ 2 - in k = C.
I
n n
-2>1+2~lnk
I
Sec. 7.2] Certain Best Tests
249
distribution, nor, as a matter of fact, do the random variables X X
X d l> 2,
: .. , ': nee to be mu:ually stochastically independent. That is, if H
o
IS t~e simple hyp~thes.Is that the joint p.d.f, is g(xl
, X
2,
•• " x
n
), and if
HI IS the alternative simple hypothesis that the joint p.d.f, is h(x
l
, X
2,
••• : X n), then C is a best critical region of size IX for testing H 0 against
HI If, for k > 0:
Statistical Hypotheses [Ch, 7
248
n
In this case, a best critical region is the set C = {(Xl> X2, ..• , xn) ; L: Xl ~ c},
I
where C is a constant that can be determined so that the size of the critical
fco 1 ( (x - 1)2) -
Pr (X ~ cI ; HI) = Cl v27Tv1/n exp - 2(1/n) dx.
For example, if n = 25 and if a is selected to be 0.05, then from Table III
we find that Cl = 1.645/V25 = 0.329. Thus the power of this best test of Ho
against HI is 0.05, when H o is true, and is
There is another aspect of this theorem that warrants special
mention. It has to do with the number of parameters that appear in the
p.d.f. Our notation suggests that there is but one parameter. However,
a careful review of the proof will reveal that nowhere was this needed
or assumed. The p.d.f. may depend upon any finite number of param-
eters. What is essential is that the hypothesis H o and the alternative
hypothesis HI be simple, namely that they completely specify the
distributions. With this in mind, we see that the simple hypotheses
H o and HI do not need to be hypotheses about the parameters of a
X = 0, 1,2, ... ,
X = 0, 1, 2, ... ,
e- l
Ho:f(x) =-,
xl
Here
= 0 elsewhere.
= 0 elsewhere,
against the alternative simple hypothesis
Pr (Xl EC; H o) = 1 - Pr (Xl = 1,2; H o) = 0.448,
(b)' g(xl> X 2, ••• , xn)
h(x ) ~ k for (Xl> X 2, ••• , xn) E C*.
l> X 2, ••• , Xn
(c)' IX = Pr [(Xl> X 2 , ... , X n) E C; HoJ.
An illustrative example follows.
.Example 3. Let Xv . . " X n denote a random sample from a distribution
~hIch has ~ p.d..f. f(x) that is positive on and only on the nonnegative
integers. It IS desired to test the simple hypothesis
g(Xv··., Xn) _ e-n/(Xl! x2 !· · ·xnJ)
h(Xl> ... , Xn) - (t)n(1_)x1 +x2+"'+xn
= (2e-l)n2~x,
n
Fl (Xl!)
I
If k > 0, the set of points (xv X 2, ••• , X
n
) such that
(~Xl) In 2 - In [Q (x,!)] :s; In k - n In (2e- l) = C
~s a .best cr~tical region C. Consider the case of k = 1 and n = 1. The preced-
1ll~ me~uahty may be written 2x
l/XI! s e/2. This inequality is satisfied by all
points m the set C = {Xl; Xl = 0,3,4,5, ...}. Thus the power of the test
when H 0 is true is
(x -11)2] dx = f'" 1 e- w 2/2 dw = 0.999+,
2(g) -3.355 V27T
f'" 1 [
-==--= exp
0.329 vz:;,V-is
when HI is true.
n
region is a desired number a. The event L: X, ~ C is equivalent to the event
I
X ~ c/n = cl , say, so the test may be based upon the statistic X. If Ho is
true, that is, 8 = 8' = 0, then X has a distribution that is n(O, l/n). For a
given positive integer n, the size of the sample, and a given significance level
a, the number Cl can be found from Table III in Appendix B, so that
Pr (X ~ Cl; H o) = a. Hence, if the experimental values of Xl' X 2 , ••• , X;
n
were, respectively, Xv X2 , ••• , Xn, we would compute x = L: «[n. If x ~ CI,
I
the simple hypothesis H o: 8 = 8' = 0 would be rejected at the significance
level a; if X < Cv the hypothesis Ho would be accepted. The probability of
rejecting Ho, when Ho is true, is a; the probability of rejecting Ho, when Ho
is false, is the value of the power of the test at 8 = 8" = 1. That is,
approximately, in accordance with Table I of Appendix B. The power of
the test when H1 is true is given by
Pr (Xl EC; H1) = 1 - Pr (Xl = 1,2; H1)
= 1 - (! + t) = 0.625.
EXERCISES
7.7. In Example 2 of this section, let the simple hypotheses read
H
o
: (J = (J' = a and H1: (J = (J" = -1. Show that the best test of Ho
against H1
may be carried out by use of the statistic X, and that if n = 25
and a = 0.05, the power of the test is 0.999+ when H1 is true.
7.8. Let the random variable X have the p.d.f. j(x; (J) = (1/(J)e-
X'8
,
a < x < 00, zero elsewhere. Consider the simple hypothesis H o: (J = (J' = 2
and the alternative hypothesis H1: (J = (J" = 4. Let Xl' X2 denote a
random sample of size 2 from this distribution. Show that the best test of
H
o
against H1
may be carried out by use of the statistic Xl + X2 and that
the assertion in Example 2 of Section 7.1 is correct.
7.9. Repeat Exercise 7.8 when H1: (J = (J" = 6. Generalize this for every
(J" > 2.
7.10. Let Xv X 2, ... , X10 be a random sample of size 10 from a normal
distribution n(O, a2) . Find a best critical region of size a = 0.05 for testing
H
o
: a2 = 1 against H1: a2 = 2. Is this a best critical region of size 0.05 for
testing Ho: a2 = 1 against H1: a2 = 4? Against H1: a
2
= at > I?
7.11. If Xl' X 2, ... , X n is a random sample from a distribution having
p.d.f. of the form j(x; (J) = (Jx8 - 1, a < x < 1, zero elsewhere, show that a
best critical region for testing H o: (J = 1 against H 1: (J = 2 is C =
{ (Xl ' X2' ... , Xn); C :::; fr Xi}'
.=1
7.12. Let Xl' X 2, ... , X1 0 be a random sample from a distribution
that is n((Jl' (J2)' Find a best test of the simple hypothesis Ho: (J1 = (J~ = 0,
(J2 = (J; = 1 against the alternative simple hypothesis H1: (J1 = (J~ = 1,
(J2 = (J~ = 4.
7.13. Let Xl' X 2,.. ., X n denote a random sample from a normal distri-
n
bution n((J,100). Show that C = {(XVX2""'Xn);c:::; X = i:Xdn} is a
best critical region for testing Ho: (J = 75 against H1: (J = 78. Find nand c
so that
Pr [(Xl' X 2, ... , X n) E C; HoJ = Pr (X ~ c; Ho) = 0.05
For example, K(2) = 0.05, K(4) = 0.31, and K(9.5) = 21e. It is known
(Exercise 7.9) that C = {(Xl' x2) ; 9.5 :::; Xl + x2 < co} is a best critical
region of size 0.05 for testing the simple hypothesis H o: (J = 2 against
each simple hypothesis in the composite hypothesis H 1 : (J > 2.
251
a < x < 00,
1
f(x; (J) = '8 e>",
Sec. 7.3] Uniformly Most Powerful Tests
7.14. Let Xl' X 2, . . ., X n denote a random sample from a distribution
having the p.d.f.j(x; p) = pX(l - P)l-X, X = 0,1, zero elsewhere. Show that
n
C = {(xv' .. , xn); LXi:::; c} is a best critical region for testing H o: P = t
1
against H 1 : p = t. Use the central limit theorem to find nand c so that
approximately Pr (~Xi s c; Ho) = 0.10 and Pr (~Xi :::; c; H1) = 0.80.
7.15. Let Xl' X 2, ... , X1 0 denote a random sample of size 10 from a
Poisson distribution with mean (J. Show that the critical region C defined by
10
f Xi ;::: 3 is a best critical region for testing H o: (J = 0.1 against H 1: (J = 0.5.
Determine, for this test, the significance level a and the power at (J = 0.5.
7.3 Uniformly Most Powerful Tests
This section will take up the problem of a test of a simple hypothesis
Ho against an alternative composite hypothesis H1 . We begin with an
example.
Example 1. Consider the p.d.f.
= °elsewhere,
of Example 2, Section 7.1. It is desired to test the simple hypothesis H o: (J = 2
against the alternative composite hypothesis H 1: (J > 2. Thus Q = {(J; (J ;::: 2}.
A random sample, Xl, X 2, of size n = 2 will be used, and the critical region
is C = {(xv x2); 9.5 :::; Xl + x2 < co}. It was shown in the example cited
that the significance level of the test is approximately 0.05 and that the
power of the test when (J = 4 is approximately 0.31. The power function
K((J) of the test for all (J ;::: 2 will now be obtained. We have
Statistical Hypotheses [Ch. 7
250
and
Pr [(Xv X 2, ... , X n) E C; H1J = Pr (X ;::: c; H1) = 0.90, approximately.
The preceding example affords an illustration of a test of a simple
hypothesis H o that is a best test of H o against every simple hypothesis
252
Statistical Hypotheses [eh.7 Sec. 7.3] Uniformly Most Powerful Tests 253
in the alternative composite hypothesis HI· We now define a critical
region, when it exists, which is a best critical region for testing a simple
hypothesis H 0 against an alternative composite hypothesis HI- It seems
desirable that this critical region should be a best critical region for
testing H o against each simple hypothesis in HI- That is, the power
function of the test that corresponds to this critical region should be
at least as great as the power function of any other test with the same
significance level for every simple hypothesis in HI-
Definition 7. The critical region C is a uniformly most powerful
critical region of size ex for testing the simple hypothesis H 0 against an
alternative composite hypothesis HI if the set C is a best critical region
of size ex for testing H 0 against each simple hypothesis in HI· A test
defined by this critical region C is called a uniformly most powerful test,
with significance level ex, for testing the simple hypothesis H 0 against
the alternative composite hypothesis HI-
As will be seen presently, uniformly most powerful tests do not
always exist. However, when they do exist, the Neyman-Pearson
theorem provides a technique for finding them. Some illustrative
examples are given here.
Example 2. Let Xv X 2
, • _ ., X; denote a random sample from a distri-
bution that is n(O, 8), where the variance 8 is an unknown positive number.
It will be shown that there exists a uniformly most powerful test with
significance level ex for testing the simple hypothesis Ho: 8 = 8', where 8' is a
fixed positive number, against the alternative composite hypothesis
HI: 8> 8'. Thus n = {8; 8 C: 8'}. The joint p.d.f. of Xv X 2 , · _., X; is
The set C = {(xv X2, , xn); ~ xf C: c} is then a best critical region for
1
testing the simple hypothesis H o: 8 = 8' against the simple hypothesis
8 = 8". It remains to determine c so that this critical region has the desired
size ex. If H o is true, the random variable i. Xfl8' has a chi-square distribu-
1
tion with n degrees of freedom. Since ex = Pr (~ Xfl8' C: c]8'; H 0)' c]8' may
be read from Table II in Appendix Band c determined. Then C =
n
{(xv X 2, ••• , xn) ; t xf C: c} is a best critical region of size ex for testing
H o: 8 = 8' against the hypothesis 8 = 8". Moreover, for each number 8"
greater than 8', the foregoing argument holds. That is, if 8'" is another
number greater than 8', then C = {(Xl' ... , xn) ; ~ xf C: c} is a best critical
1
region of size ex for testing H o: 8 = 8' against the hypothesis 8 = 8"'. Accord-
n
ingly, C = {(Xl' ... , xn ); L xf C: c}is a uniformly most powerful critical region
1
of size ex for testing H 0: 8 = 8' against HI: 8 > 8'. If Xv x 2, .. " X n denote
the experimental values of Xl' X 2 , • . • , X n , then Ho: 8 = 8' is rejected at the
significance level ex, and HI: 8 > 8' is accepted, if ~ xf C: c; otherwise,
1
Ho: 8 = 8' is accepted.
If in the preceding discussion we take n = 15, ex = 0.05, and 8' = 3, then
here the two hypotheses will be H o: 8 = 3 and HI: 8 > 3. From Table II,
c/3 = 25 and hence c = 75.
. Example 3. Let Xv X 2 , ••• , X n denote a random sample from a
distribution that is n(8, 1), where the mean 8 is unknown. It will be shown
that there is no uniformly most powerful test of the simple hypothesis
H o: 8 = 8', where 8' is a fixed number, against the alternative composite
hypothesis HI: 8 =F 8'. Thus n = {8; -00 < 8 < co}. Let 8" be a number
not equal to 8'. Let k be a positive number and consider
Let 8" represent a number greater than 8', and let k denote a positive
number. Let C be the set of points where
that is, the set of points where
(8
")n/2 [(8" - 8') n 2]
8' exp - 28'8" fXj
or, equivalently,
~k
(1/27T)n/2 exp [ - ~ (Xl - 8')2/2]
(1/27T)n/2 exp [ - ~ (Xj - 8")2/2]
The preceding inequality may be written as
or
~ k.
n 28'8" [n (8") ]
f xf C: 8" _ 8' 2In 8' - In k = c.
n
(8" - 8') :L>l C: ~ [(8")2 - (8')2) - In k.
1
254 Statistical Hypotheses [Ch. 7 Sec. 7.3] Uniformly Most Powerful Tests 255
This last inequality is equivalent to
~ n (()" ()') In k
L.. x, :2: "2 + - ()" _ ()"
1
distribution with mean 10(). Thus, with () = 0.1 so that the mean of Y is 1,
the significance level of the test is
Pr (Y :2: 3) = 1 - Pr (Y s 2) = 1 - 0.920 = 0.080.
10
If the uniformly most powerful critical region defined by LX, ?:: 4 is used,
1
the significance level is
provided ()" > ()', and it is equivalent to
~ n (()" ()') In k
L..Xl ~ "2 + - ()" _ ()'
1 a = Pr (Y :2: 4) 1 - Pr (Y ~ 3) = 1 - 0.981 = 0.019.
if ()" < ()'. The first of these two expressions defines a best critical region for
testing H 0: () = ()' against the hypothesis () = ()" provided that ()" > ()',
while the second expression defines a best critical region for testing H 0: 8 = 8'
against the hypothesis 8 = 8" provided that 8" < 8'. That is, a best critical
region for testing the simple hypothesis against an alternative simple
hypothesis, say 8 = 8' + 1, will not serve as a best critical region for testing
H o: () = ()' against the alternative simple hypothesis () = ()' - 1, say. By
definition, then, there is no uniformly most powerful test in the case under
consideration.
It should be noted that had the alternative composite hypothesis been
either H 1: () > ()' or H 1: () < ()', a uniformly most powerful test would exist
in each instance.
Example 4. In Exercise 7.15, the reader is asked to show that if a random
sample of size n = 10 is taken from a Poisson distribution with mean (), the
10
critical region defined by L X, :2: 3 is a best critical region for testing H 0: () =
1
0.1 against H 1 : 8 = 0.5. This critical region is also a uniformly most powerful
one for testing H a: () = 0.1 against H 1 : () > 0.1 because, with ()" > 0.1,
(0.1)LAe- 10(0.l)/(x
1! x2 !· . ·xn!) < k
(()")L:Xle-10(O")/(X1! x2!· . ·xn!) -
is equivalent to
(
0.1) L:x, -10(0.1- 8") k
()" e ~ .
The preceding inequality may be written as
(~x,)(ln 0.1 - In ()") s In k + 10(0.1 - ()")
or, since 8" > 0.1, equivalently as
n In k + 1 - 108"
2:Xl?:: In 0 1 - In ()" .
1 •
10 . . Y ~X h P' I
Of course, Lx, :2: 3 is of the latter form. The statistic = L., I as a Olsson
1 1
If a significance level of about a = 0.05, say, is desired, most statisticians
would use one of these tests; that is, they would adjust the significance level
to that of one of these convenient tests. However, a significance level of
10 10
C( = 0.05 can be achieved exactly by rejecting Ho if LX, :2: 4 or if LX, = 3
1 1
and if an auxiliary independent random experiment resulted in "success,"
where the probability of success is selected to be equal to
0.050 - 0.019 31
0.080 - 0.019 = 61·
This is due to the fact that, when () = 0.1 so that the mean of Y is 1,
Pr (Y :2: 4) + Pr (Y = 3 and success) = 0.019 + Pr (Y = 3) Pr (success)
= 0.019 + (0.061)-H = 0.05.
The process of performing the auxiliary experiment to decide whether to
reject or not when Y = 3 is sometimes referred to as a randomized test.
Remarks. Not many statisticians like randomized tests in practice,
because the use of them means that two statisticians could make the same
assumptions, observe the same data, apply the same test, and yet make
different decisions. Hence they usually adjust their significance level so as
not to randomize. As a matter of fact, many statisticians report what are
commonly called p-values. For illustrations, if in Example 4 the observed
Y is y = 4, the p-value is 0.019; and if it is y = 3, the p-value is 0.080. That
is, the p-value is the observed" tail" probability of a statistic being at least
as extreme as the particular observed value when H o is true. Hence, more
generally, if Y = u(Xl> X 2 , ••• , X n) is the statistic to be used in a test of
H o and if a uniformly most powerful critical region is of the form
an observed value u(xl> X 2, ••• , xn) = d would mean that the
p-value = Pr (Y ~ d; Ha).
That is, if Cry) is the distribution function of Y = u(Xl> X 2 , ••• , X n) pro-
vided that H o is true, the p-value is equal to C(d) in this case. However,
C(Y), in the continuous case, is uniformly distributed on the unit interval,
256 Statistical Hypotheses [Ch. 7 Sec. 7.4] Likelihood Ratio Tests 257
so an observed value G(d) ::; 0.05 would be equivalent to selecting c so that
and observing that d ::; c.
There is a final remark that should be made about uniformly most
powerful tests. Of course, in Definition 7, the word uniformly is associated
with e, that is, C is a best critical region of size a for testing H 0: () = ()o
against all evalues given by the composite alternative HI' However, suppose
that the form of such a region is
Then this form provides uniformly most powerful critical regions for all
attainable a values by, of course, appropriately changing the value of c.
That is, there is a certain uniformity property, also associated with a, that
is not always noted in statistics texts.
EXERCISES
7.16. Let X have the p.d.f. f(x; e) = eX(1 - ell-X, X = 0, 1, zero else-
where. We test the simple hypothesis H o: e= t against the alternative
composite hypothesis HI: e < t by taking a random sample of size 10 and
rejecting H 0: e = t if and only if the observed values Xl> Xz, ... , Xl O of the
10
sample items are such that .L x, ::; 1. Find the power function K(e), 0 <
1
e::; t, of this test.
7.17. Let X have a p.d.f. of the formf(x; e) = l/e,O < X < e, zero else-
where. Let Y1 < Yz < Y3 < Y4 denote the order statistics ot a random
sample of size 4 from this distribution. Let the observed value of Y4 be Y4'
We reject H o: () = 1 and accept HI: e i= 1 if either Y4 ::; ! or Y4 2 1. Find
the power function K(()), 0 < (), of the test.
7.18. Consider a normal distribution of the form n(e, 4). The simple
hypothesis H o: e= 0 is rejected, and the alternative composite hypothesis
HI: e > 0 is accepted if and only if the observed mean xof a random sample
of size 25 is greater than or equal to t. Find the power function K(e), 0 ::; e,
of this test.
7.19. Consider the two independent normal distributions n(fLl' 400) and
n(fLz, 225). Let e= fLl - fLz. Let x and fj denote the observed means of two
independent random samples, each of size n, from these t'NO distributions.
We reject H o: e= 0 and accept HI: () > 0 if and only if x - fj ;:::: c. If K(e)
is the power function of this test, find nand c so that K(O) = 0.05 and
K(lO) = 0.90, approximately.
7.20. If, in Example 2 of this section, H o: e = e', where e' is a fixed posi-
n
tive number, and HI: e < e', show that the set {(Xl, X2, ••• , X n); L: xF ::; c}is
1
a uniformly most powerful critical region for testing H o against HI'
7.21. If, in Example 2 of this section, H o: e= e', where ()' is a fixed
positive number, and HI: () i= e', show that there is no uniformly most
powerful test for testing H 0 against HI'
7.22. Let Xl' Xz, " " XZ5 denote a random sample of size 25 from a
normal distribution n((), 100). Find a uniformly most powerful critical region
of size a = 0.10 for testing H o: () = 75 against HI: e > 75.
7.23. Let Xl' X 2 , ••• , X; denote a random sample from a normal
distribution n(e, 16). Find the sample size n and a uniformly most powerful
test of H o: () = 25 against HI: e < 25 with power function K(e) so that
approximately K(25) = 0.10 and K(23) = 0.90.
7.24. Consider a distribution having a p.d.f. of the form f(x; e) =
eX(1 - ell-X, X = 0, 1, zero elsewhere. Let H o: e = -fo and HI: e > z~. Use
the central limit theorem to determine the sample size n of a random sample
so that a uniformly most powerful test of H o against HI has a power function
K(e), with approximately K(zlo) = 0.05 and K(fo) = 0.90.
7.25. Illustrative Example 1 of this section dealt with a random sample
of size n = 2 from a gamma distribution with a = 1, f3 = e. Thus the
moment-generating function of the distribution is (1 - ()t)-1 t < l/e,
e ;:::: 2. Let Z = Xl + X z. Show that Z has a gamma distribution with
ex = 2, f3 = (). Express the power function K(e) of Example 1 in terms of a
smgle integral. Generalize this for a random sample of size n.
7.26. Let X have the p.d f. f(x; e) = eX
(1 - W- x , X = 0, 1, zero else-
where. We test H o: () = ! against HI: e < ! by taking a random sample
5
Xl> X z, ... , X 5 of size n = 5 and rejecting Hoif Y = .L X, is observed to be
1
less than or equal to a constant c.
(a) Show that this is a uniformly most powerful test.
(b) Find the significance level when c = 1.
(c) Find the significance level when c = O.
(d) By using a randomized test, modify the tests given in part (b) and
part (c) to find a test with significance level a = -l'2.
7.4 Likelihood Ratio Tests
The notion of using the magnitude of the ratio of two probability
density functions as the basis of a best test or of a uniformly most
powerful test can be modified, and made intuitively appealing, to
provide a method of constructing a test of a composite hypothesis
258 Statistical Hypotheses [Ch, 7 Sec. 7.4] Likelihood Ratio Tests 259
against an alternative composite hypothesis or of constructing a test of a
simple hypothesis against an alternative composite hypothesis when a
uniformly most powerful test does not exist. This method leads to tests
called likelihood ratio tests. A likelihood ratio test, as just remarked, is not
necessarily a uniformly most powerful test, but it has been proved in
the literature that such a test often has desirable properties.
A certain terminology and notation will be introduced by means of
an example.
Example 1. Let the random variable X be n(8v 82) and let the param-
eter space be 0 = {(8v 82 ) ; -00 < 81 < 00,0 < 82 < oo}, Let the composite
hypothesis be H o: 81 = 0, 82 > 0, and let the alternative composite hypoth-
esis be H1 : 81 =1= 0, 82 > O. The set w = {(81, 82); 81 = 0,0 < 82 < co} is
a subset of 0 and will be called the subspace specified by the hypothesis Ho-
Then, for instance, the hypothesis H o may be described as H o: (81, 82) E w.
It is proposed that we test H o against all alternatives in H 1 .
Let Xv X 2 , ••• , X n denote a random sample of size n > 1 from the
distribution of this example. The joint p.d.f. of Xv X 2 , ••• , X; is, at each
point in 0,
At each point (8v 82 ) E co, the joint p.d.f. of Xv X 2 , ••• , X n is
[
n ]
1 n/2 L: x~
L(O, 82; Xv ... , x n) = (27T8J exp - ~82 = L(w).
The joint p.d.f., now denoted by L(w), is not completely specified, since 82
may be any positive number; nor is the joint p.d.f., now denoted by L(O),
completely specified, since 81 may be any real number and 82 any positive
number. Thus the ratio of L(w) to L(O) could not provide a basis for a test
of H o against H 1 . Suppose, however, that we modify this ratio in the follow-
ing manner. We shall find the maximum of L(w) in w, that is, the maximum
of L(w) with respect to 82 , And we shall find the maximum of L(O) in 0;
that is, the maximum of L(O) with respect to 81 and 82 , The ratio of these
maxima will be taken as the criterion for a test of H o against H 1 . Let the
maximum of L(w) in w be denoted by L(w) and let the maximum of L(O)
in 0 be denoted by L(Q). Then the criterion for the test of Ho against H1 is
the likelihood ratio
Since L(w) and L(O) are probability density functions, , ~ 0; and since w
is a subset of 0, , ::; 1.
In our example the maximum, L(w), of L(w) is obtained by first setting
n
dIn L(w) n f x~
=--+-
d82 282 28~
n
equal to zero and solving for 82 , The solution for 82 is L: xfln, and this
1
number maximizes L(w). Thus the maximum is
(
n )
1 n/2 L: x~
L(w) = exp _ 1
( 2~~x1Jn ) 2 ~x~/n
(
_1)n/2
= 2:~X~
On the other hand, by using Example 4, Section 6.1, the maximum, L(O),
of L(O) is obtained by replacing 81 and 82 by i «[n = x and i (Xl - x)2jn,
1 1
respectively. That is,
[
n ]
• 1 n/2 L: (Xl - x)2
L(O) = n exp - ~
[2~,f(x, - - . 2fix, - ')'/n
[
-1 ] n/2
27T ~~:l -X)2 •
Thus here
[
n ] n/2
_ f (Xl - X)2
A - •
n
L: X~
1
n n
Because L: x~ = L: (Xl - x)2 + nx2, Amay be written
1 1
1
A = .
{I + [nx2j~ (x, - x)2]f/2
Now the hypothesis H o is 81 = 0, 82 > O. If the observed number x were
zero, the experiment tends to confirm H o. But if x = 0 and i x~ > 0, then
1
260
Statistical Hypotheses [Ch. 7 Sec. 7.4] Likelihood Ratio Tests 261
n
A = 1. On the other hand, if x and nx2/'L (Xl - X)2 deviate considerably
1
from zero, the experiment tends to negate Ho. Now the greater the deviation
n
of nx2/'L (x, - X)2 from zero, the smaller A becomes. That is, if A is used
1
as a test criterion, then an intuitively appealing critical region for testing
H0 is a set defined by 0 ::; A ::; Ao, where Ao is a positive proper fraction.
Thus we reject Ho if A ::; Ao. A test that has the critical region A ::; Ao is a
likelihood ratio test. In this example A ::; Ao when and only when
J" -y'n xl '" -y'(0 - 1)(A,"" - 1) ~ c.
'L (Xl - x)2/(n - 1)
1
If Ho
: (}1 = 0 is true, the results in Section 6.3 show that the statistic
vn (X - 0)
has a t distribution with n - 1 degrees of freedom. Accordingly, in this
example the likelihood ratio test of H 0 against HI may be based on a T
statistic. For a given positive integer n, Table IV in Appendix B may be
used (with n - 1 degrees of freedom) to determine the number c such that
IX = Pr [Jt(X v X 2
, ••• , X n
)! ~ c; HoJ is the desired significance level of the
test. If the experimental values of Xl' X 2 , · · · , X n are, respectively,
Xl' X
2,
••• , X
n, then we reject Ho if and only if It(xv X 2, ••• , xn) I ~ c. If, for
instance, n = 6 and ex = 0.05, then from Table IV, c = 2.571.
The preceding example should make the following generalization
easier to read: Let Xl> X 2 , • • • , X; denote n mutually stochastically
independent random variables having, respectively, the probability
density functions f,(x,; 8l> 82 , ••• , 8m), i = 1, 2, ... , n. The set that
consists of all parameter points (8l> 82 , ••• , 8m) is denoted by 0, which
we have called the parameter space. Let w be a subset of the parameter
space O. We wish to test the (simple or composite) hypothesis
H0: (8l> 82
, •.• , 8m
) E w against all alternative hypotheses. Define the
likelihood functions
n
L(w) = flfi(X,; 8l> 82 " " , 8m),
,=1
and
Let L(w) and L(Q) be the maxima, which we assume to exist, of these
two likelihood functions. The ratio of L(w) to L(Q) is called the likeli-
hood ratio and is denoted by
L(w)
A(Xl> X2 , ••• , xn) = A = --.
L(Q)
Let Ao be a positive proper function. The likelihood ratio test principle
states that the hypothesis Ho: (81, 82" " , 8m) Ew is rejected if and
only if
A(Xl> X2 , •.• , xn) = A ::; Ao.
~he. function A defines a random variable A(Xl> X 2, ..• , X n), and the
significance level of the test is given by
IX = Pr [A(Xl> X 2 , ••• , X n) ::; Ao; Hol
The likelihood ratio test principle is an intuitive one. However the
princip~e does lead to the same test, when testing a simple hypothesis
H o agamst an alternative simple hypothesis HI, as that given by the
Neyman-Pearson theorem (Exercise 7.29). Thus it might be expected
that a test based on this principle has some desirable properties.
An example of the preceding generalization will be given.
Example 2. Let the stochastically independent random variables X and
Y have distributions that are n((}l> ()3) and n(()2' ()3), where the means (}1
and ()2 and common variance ()3 are unknown. Then Q = {(() () ()).
11 2, 3'
-00 < (}1 < 00, -00 < ()2 < 00, 0 < ()3 < oo]. Let Xl> X 2 , ••• , X; and
Yl> Y 2 , ••• , Y m denote independent random samples from these distributions.
The. hypothesis Hi: (}l = ()2' unspecified, and ()3 unspecified, is to be tested
agamst all alternatives. Then w = {((}l> ()2' ()3); -00 < (}l = ()2 < 00,
o < ()3 < co}. Here Xl' X 2 , · · · , X n , Yl> Y 2 , ••• , Y m are n + m > 2
mutually stochastically independent random variables having the likelihood
functions
and
If
n
L(Q) = flf.(xj ; 81, 82 , ••• , 8m),
.=1
8ln L(w)
ee, and
8ln L(w)
8()3
262
are equated to zero, then (Exercise 7.30)
Statistical Hypotheses [Ch. 7 Sec. 7.4] Likelihood Ratio Tests
and U v U 2, and w' maximize L(0.). The maximum is
263
(1)
A _ (e- 1)(n+mJ/2
L(~l) - 27TW' '
so that
The solutions for 81 and 83 are, respectively,
n m
2: Xi + 2: s,
U = 1 1
n+m
and
n m
2: (Xi - U)2 + 2: (Yi - U)2
W = 1 1
n+m
and U and w maximize L(w). The maximum is
(
e- 1 ) (n+mJ/2
L(w) = -
27TW
L(w) (W')(n+mJ/2
'(x1, · · · , Xn, Y1'···' Ym) = , = L(Q) = W .
The random variable defined by ,2/Cn+mJ is
n m
2: {Xi - [(nX + mY)/(n + m)]}2 + 2: {Y, - [(nX + mY)/(n + m)]}2
1 1
Now
i (Xi - nX + mY)2 = i [(Xi _X) + (X _ nX + mY)]2
1 n+m 1 n+m
= i (Xi - X)2 + n(X _ nX + mY)2
1 n + m
and
are equated to zero, then (Exercise 7.31)
In like manner, if
oIn L(0.)
081
'
oIn L(0.)
082
'
oIn L(0.)
083
I (Yl
- nX + mY)2 = i [(Yi _ Y) + (Y _ nX + mY)]2
1 n+m 1 n+m
= i (Yl
- Y)2 + m(Y _ nX + mY)2.
1 n + m
(2)
The solutions for 8v 82 , and 83 are, respectively,
n
2: Xi
U
1
= _1_,
n
m
2: v.
U
2
= _1_,
m
But
n(X _ nX + mY)2 = m
2n
(X _ Y)2
n + m (n + m)2
and
m(Y _ nX + mY)2 = n
2m
(X _ Y)2.
n + m (n + m)2
Hence the random variable defined by ,2/(n+mJ may be written
n m
2: (Xi - X)2 + 2: (Yi - Y)2
1 1
n m
2: (Xi - X)2 + 2: (Yi - Y)2 + [nm/(n + m)J(X _ Y)2
1 1
1
1 [nm/(n + m)](X - Y)2
+ n m
2: (Xl - X)2 + 2: (Yi - Y)2
1 1
264 Statistical Hypotheses [Ch, 7 Sec. 7.4] Likelihood Ratio Tests 265
vnX
J~ (Xi - X)2/(n - 1)
Vnx/a
J~ (Xi - X)2/[a2(n - 1)]
Here WI = VnX/a is n(vn8l/a, 1), VI = ~ (Xi - X)2/a2 is x2(n - 1),
!
and WI and VI are stochastically independent. Thus, if 81 of. 0, we see,
in accordance with the definition, that t(X!> ... , X n) has a noncentral
t distribution with n - 1 degrees of freedom and noncentrality param-
eter 81 = vn8l /a. In Example 2 we had
In the light of this definition, let us reexamine the statistics of the
examples of this section. In Example 1 we had
n+m-2
J nm (X - V)
n+m
T = ---;==~=====
n+m-2
(n + m - 2) + T2
has, in accordance with Section 6.4, a t distribution with n + m - 2 degrees
of freedom. Thus the random variable defined by A2/(n + m) is
The test of H0 against all alternatives may then be based on a t distribution
with n + m - 2 degrees of freedom.
The likelihood ratio principle calls for the rejection of H 0 if and only if
A :::; Ao < 1. Thus the significance level of the test is
If the hypothesis H o: (}l = (}2 is true, the random variable
ex = Pr [(A(Xl , . · · , X n , Yl , · •• , Y m) s Ao;Hol
However, A(Xl , ... , X n, Y v .. " Y m) :::; Ao is equivalent to ITI ~ c, and so
T = W 2 ,
vV2/(n + m - 2)
ex = Pr(ITI ~ c;Ho)· where
For given values of nand m, the number c is determined from Table IV in
the Appendix (with n + m - 2 degrees of freedom) in such a manner as to
yield a desired ex. Then H 0 is rejected at a significance level ex if and only if
ItI ~ c, where t is the experimental value of T. If, for instance, n = 10,
m = 6, and ex = 0.05, then c = 2.145.
In each of the two examples of this section it was found that
the likelihood ratio test could be based on a statistic which, when the
hypothesis H Q is true, has a t distribution. To help us compute the
powers of these tests at parameter points other than those described by
the hypothesis H Q, we turn to the following definition.
Definition 8. Let the random variable W be n(8, 1); let the random
variable V be X2
(r), and Wand V be stochastically independent. The
quotient
W
T=--=
VV/r
is said to have a noncentral t distribution with r degrees of freedom and
noncentrality parameter 8. If I) = 0, we say that T has a central t
distribution.
/1f;
m - /
W2 = - - (X - Y) a
n+m
and
V2 = [~(Xi - X)2 + ~ (Yi - Y)2]/a2.
Here W2 is n[Vnm/(n + m)(Ol - (2)/a, 1J, V2 is X2(n + m - 2), and
W2 and V2 are stochastically independent. Accordingly, if 01 of. °2, T
has a noncentral t distribution with n + m - 2 degrees of freedom and
noncentrality parameter 82 = vnm/(n + m)(Ol - (2)/a. It is interest-
ing to note that 81 = vn8l /a measures the deviation of °1 from
01 = 0 in units of the standard deviation a/vn of X. The noncentrality
parameter 82 = vnm/(n + m)(O! - (2)/a is equal to the deviation of
°1 - O
2 from 81 - O
2 = 0 in units of the standard deviation
aV(n + m)/nm of X - Y.
There are various tables of the noncentral t distribution, but they are
much too cumbersome to be included in this book. However, with the
aid of such tables, we can determine the power functions of these tests
as functions of the noncentrality parameters.
266 Statistical Hypotheses [Ch. 7 Sec. 7.4] Likelihood Ratio Tests 267
In Example 2, in testing the equality of the means of two inde-
pendent normal distributions, it was assumed that the unknown
variances of the distributions were equal. Let us now consider the
problem of testing the equality of these two unknown variances.
Example 3. We are given the stochastically independent random sam-
ples Xl>"" X n
and Y 1
, •• " Ym from the independent distributions, which
are n(8l> 83
) and n(82, 84) , respectively. We have
Q = {(8l> 82, 83, 84) ; -00 < 8l> 82 < 00,0 < 83,84 < co].
The hypothesis H o: 83 = 84 , unspecified, with 81 and 82 also unspecified, is
to be tested against all alternatives. Then
w = {(81 , 82 , 83 , 84 ) ; -00 < 8l> 82 < 00,0 < 83 = 84 < co}.
It is easy to show (see Exercise 7.34) that the statistic defined by A =
L(w)/L(Q) is a function of the statistic
n
L: (Xt - X)2/(n - 1)
F = -'~~-------
L: (Y, - Y)2/(m - 1)
1
If 83
= 84
, this statistic F has an F distribution with n - 1 and m - 1
degrees of freedom. The hypothesis that (8l , 82 , 83 , 84) E w is rejected if the
computed F ~ Cl or if the computed F ~ C2• The constants Cl and C2 are
usually selected so that, if 83 = 84 ,
where (;(1 is the desired significance level of this test.
EXERCISES
7.27. In Example 1 let n = 10, and let the experimental values of the
10 d . d
random variables yield x = 0.6 and L: (xt - X)2 = 3.6. If the test enve
1
in that example is used, do we accept or reject Ho: 81 = 0 at the 5 per cent
significance level?
B
7.28. In Example 2 let n = m = 8, x = 75.2, Y = 78.6, L: (Xt - X)2 =
1
71.2, I (Yt - y)2 = 54.8. If we use the test derived in that example, do we
1
accept or reject H o: 81
= 82 at the 5 per cent significance level?
7.29. Show that the likelihood ratio principle leads to the same test, when
testing a simple hypothesis H o against an alternative simple hypothesis H 1 •
as that given by the Neyman-Pearson theorem. Note that there are only
two points in Q.
7.30. Verify Equations (1) of Example 2 of this section.
7.31. Verify Equations (2) of Example 2 of this section.
7.32. Let Xl' X 2 , ., ., X; be a random sample from the normal distribu-
tion n(8, 1). Show that the likelihood ratio principle for testing H o: 8 = 8',
where 8' is specified, against H 1 : 8 f= 8' leads to the inequality Ix - 8'1 ~ c.
Is this a uniformly most powerful test of Ho against H1?
7.33. Let Xl' X 2 , ••• , X; be a random sample from the normal distribu-
tion n(8l , 82 ) , Show that the likelihood ratio principle for testing H o: 82 = 8;
specified, and 81 unspecified, against HI: 82 f= 8;, 81 unspecified, leads to a
n n
test that rejects when L: (xt - X)2 ~ Cl or L: (Xt - X)2 ~ C2, where Cl < C2
1 1
are selected appropriately.
7.34. Let Xl"'" X n and Yl , ... , Ym be random samples from the
independent distributions n(8l , 83) and n(82, 84) , respectively.
(a) Show that the likelihood ratio for testing H o: 81 = 82,83 = 84
against all alternatives is given by
{[~ (Xt - U)2 + ~ (Yt _ U)2]/(m + n)yn+m)/2'
where u = (nx + my)/(n + m).
(b) Show that the likelihood ratio test for testing Ho: 83 = 84 , 81 and 82
unspecified, against HI: 83 f= 84 , 81 and 82 unspecified, can be based on the
random variable
n
L: (Xt - X)2/(n - 1)
F = "':;~'---------
L: (Yt - Y)2/(m - 1)
1
7.35. Let n independent trials of an experiment be such that Xl' X 2, ... ,
X k are the respective numbers of times that the experiment ends in the
mutually exclusive and exhaustive events Al> A 2 , · · · , A k • If Pi = P(A i) is
constant throughout the n trials, then the probability of that particular
sequence of trials is L = Pf'P~2 Pfk.
(a) Recalling that PI + P2 + + Pk = 1, show that the likelihood
ratio for testing H o: P, = Pto > 0, i = 1,2, ... , k, against all alternatives
is given by
268
(b) Show that
Statistical Hypotheses [eh. 7
-co < Xi < co.
_ 2In >. = i XI(XI - ,nfOI)2
/=1 (nPI)
where P; is between POI and xtln. Hint. Expand InPIO in a Taylor's serieswith
the remainder in the term involving (PIO - xtln)2.
(c) For large n, argue that xtl(np;)2 is approximated by l/(nPlo) and hence
_ 2In >. ~ i (x/ - npOI)
2
, when H is true.
1=1 npOI
Chapter 8
Other Statistical Tests
8.1 Chi-Square Tests
In this section we introduce tests of statistical hypotheses called
chi-square tests. A test of this sort was originally proposed by Karl
Pearson in 1900, and it provided one of the earlier methods of statistical
inference.
Let the random variable Xi be n(fLi' aT), i = 1, 2, ... ,n, and let
X:1> X 2 , ••• , X n be mutually stochastically independent. Thus the joint
p.d.f. of these variables is
1 [ 1 ~ (Xi - fLi)2]
exp - - L., - - -
a1a2 ... an(21T)n/2 2 1 ai '
The random variable that is defined by the exponent (apart from the
n
coefficient -!) is .L (Xi - fLi)2/a
t,and this random variable is X2(n).
1
In Chapter 12 we shall generalize this joint normal distribution of
probability to n random variables that are stochastically dependent
and we shall call the distribution a multivariate normal distribution.
It will then be shown that a certain exponent in the joint p.d.f. (apart
from a coefficient of -!) defines a random variable that is X2
(n). This
fact is the mathematical basis of the chi-square tests.
Let us now discuss some random variables that have approximate
chi-square distributions. Let Xl be b(n, PI)' Since the random variable
Y = (Xl - npI)/VnPI(l - PI) has, as n -+ co, a limiting distribution
that is n(O, 1), we would strongly suspect that the limiting distribution
269
270 Other Statistical Tests [eh. 8 Sec. 8.1] Chi-Square Tests 271
of Z = y2 is X2
(1). This is, in fact, the case, as will now be shown. If
Gn(y) represents the distribution function of Y, we know that
n - (Xl + ... + X k- l) and let Pk = 1 - (PI + ... + Pk-l)' Define
Qk-l by
Accordingly, since N(y) is everywhere continuous,
Qk-l = i [(Xi - nPto)2]
1 npto
has an approximate chi-square distribution with k - 1 degrees of free-
dom. Since, when H 0 is true, nptois the expected value of Xi' one would
feel intuitively that experimental values of Qk-1 should not be too large
if H o is true. With this in mind, we may use Table II of the Appendix,
with k - 1 degrees of freedom, and find c so that Pr (Qk-l ~ c) = IX,
where IX is the desired significance level of the test. If, then, the hy-
pothesis H o is rejected when the observed value of Qk-l is at least as
It is proved in a more advanced course that, as n ~ 00, Qk-l has a
limiting distribution that is X2(k
- 1). If we accept this fact, we can
say that Qk-l has an approximate chi-square distribution with k - 1
degrees of freedom when n is a positive integer. Some writers caution
the user of this approximation to be certain that n is large enough
that each npi' i = 1, 2, ... , k, is at least equal to 5. In any case it is
important to realize that Qk-1 does not have a chi-square distribution,
only an approximate chi-square distribution.
The random variable Qk-l may serve as the basis of the tests of
certain statistical hypotheses which we now discuss. Let the sample
space d of a random experiment be the union of a finite number k of
mutually disjoint sets Al> A 2,... , A k. Furthermore, let P(Ai) = Pi,
i = 1,2, ... , k, where Pk = 1 - PI -'" - Pk-l> so that Pi is the
probability that the outcome of the random experiment is an element
of the set Ai' The random experiment is to be repeated n independent
times and Xi will represent the number of times the outcome is an
element of the set Ai' That is, Xl> X 2, ... , X k = n - Xl - ... -
X k - l are the frequencies with which the outcome is, respectively, an
element of Al> A 2, ... , A k. Then the joint p.d.f. of Xl> X 2,... , X k- l is
the multinomial p.d.f. with the parameters n, PI' ... , Pk-l' Consider
the simple hypothesis (concerning this multinomial p.d.f.) H o:PI = PlO'
P2 = P20"'" Pk-l = Pk-l.O (Pk = PkO = 1 - PlO - ... - Pk-l,O),
where PlO' ... , Pk-l,O are specified numbers. It is desired to test H o
against all alternatives.
If the hypothesis H 0 is true, the random variable
-00 < y < 00,
lim Gn(y) = N(y),
n-+ 00
If we change the variable of integration in this last integral by writing
w2
= v, then
lim Hn(z) = N(VZ) - N( - VZ)
n-+oo
Hn(z) = Pr (Z :::; z) = Pr (- VZ :::; Y :::; Vz)
= Gn(vz) - Gn[( - VZ) - ].
lim H (z) - (Z 1 1/2-1 -v/2 d
n-+oo n - Jo r(t)2l/2 v e v,
provided that z ~ 0. If z < 0, then lim Hn(z) = 0. Thus lim Hn(z) is
n-+ co n....00
equal to the distribution function of a random variable that is X2(1).
This is the desired result.
Let us now return to the random variable Xl which is b(n, PI)' Let
X 2 = n - Xl and let P2 = 1 - h. If we denote y2 by Ql instead of
Z, we see that Ql may be written as
where N(y) is the distribution function of a distribution that is n(O, 1).
Let Hn(z) represent, for each positive integer n, the distribution
function of Z = y2. Thus, if z ~ 0,
(Xl - nPl)2 (Xl - nPl)2
-'--~-~'- + -'---=-:--~-
nPl n(1 - h)
(X 1 - nPl)2 (X2 - np2)2
= + -'----"'--'-
nPl np2
because (Xl - nPl)2 = (n - X 2 - n + np2)2 = (X2 - np2)2, Since Ql
has a limiting chi-square distribution with 1 degree of freedom, we say,
when n is a positive integer, that Ql has an approximate chi-square
distribution with 1 degree of freedom. This result can be generalized as
follows.
Let Xl> X 2,... , X k- l have a multinomial distribution with param-
eters n,pl>" .,Pk-l> as in Section 3.1. As a convenience, let X k =
272 Other Statistical Tests [Ch, 8 Sec. 8.1] Chi-Square Tests 273
(6 - 5)2 (18 - 15)2 (20 - 25)2 (36 - 35)2 = 64 = 1 83
5 + 15 + 25 + 35 35"
z = 1,2, ... , k,
is a function of the unknown parameters ft and a2
. Suppose that we take
a random sample Yl> .• " Yn of size n from this distribution. If we let
Xi denote the frequency of Ai' i = 1,2, ... , k, so that Xl + ... + X k
n, the random variable
approximately. From Table II, with 4 - 1 = 3 degrees of freedom, the value
corresponding to a 0.025 significance level is c = 9.35. Since the observed
value of Q3 is less than 9.35, the hypothesis is accepted at the (approximate)
0.025 level of significance.
Thus far we have used the chi-square test when the hypothesis H o
is a simple hypothesis. More often we encounter hypotheses H oin which
the multinomial probabilities PI> P2' .. " Pk are not completely specified
by the hypothesis H o. That is, under H o' these probabilities are
functions of unknown parameters. For illustration, suppose that a
certain random variable Y can take on any real value. Let us partition
the space {y; -00 < Y < co] into k mutually disjoint sets AI> A 2 , ••• , A k
so that the events AI' A2 , "', Ak are mutually exclusive and exhaustive.
Let H °be the hypothesis that Y is n(ft, a2
) with ft and a2 unspecified.
Then each
cannot be computed once Xl> ... , X k have been observed, since each
Pi' and hence Qk-I' is a function of the unknown parameters ft and a2.
There is a way out of our trouble, however. We have noted that
Qk-I is a function of ft and a2
• Accordingly, choose the values of ft and
a2
that minimize Qk-I' Obviously, these values depend upon the ob-
served Xl = Xl' .. " X k = X k and are called minimum chi-square
estimates of ft and a2 . These point estimates of ft and a2 enable us to
compute numerically the estimates of each Pi' Accordingly, if these
values are used, Qk-1 can be computed once Yl' Y2' ... , Yn' and hence
Xl' X 2 , ••• , X k , are observed. However, a very important aspect of
the fact, which we accept without proof, is that now Qk-I is approxi-
mately X2
(k - 3). That is, the number of degrees of freedom of the
limiting chi-square distribution of Qk-1 is reduced by one for each
parameter estimated by the experimental data. This statement applies
not only to the problem at hand but also to more general situations.
Two examples will now be given. The first of these examples will deal
Since 15.6 > 11.1, the hypothesis P(A i) = i. i = 1, 2, ... ,6, is rejected at
the (approximate) 5 per cent significance level.
Example 2. A point is to be selected from the unit interval {x; 0 < x < I}
by a random process. Let Al = {x; 0 < x ~ t}, A 2 = {x; t < x ~ t},
A 3 = {x; t < x ~ i}, and A 4 = {x; i < x < I}. Let the probabilities Pi'
i = 1, 2, 3, 4, assigned to these sets under the hypothesis be determined by
the p.d.f. 2x, 0 < x < 1, zero elsewhere. Then these probabilities are,
respectively,
f
I /4 1
PIo = 0 2x dx = T6'
(13 - 10)2 (19 - 10)2 (11 - 10)2
10 + 10 + 10
(8 - 10)2 (5 - 10)2 (4 - 10)2 _ 15 6
+ 10 + 10 + 10 - .,
Thus the hypothesis to be tested is that Pi'P2'P3' and P4 = 1 - Pi - P2 - P3
have the preceding values in a multinomial distribution with k = 4. This
hypothesis is to be tested at an approximate 0.025 significance level by
repeating the random experiment n = 80 independent times under the same
conditions. Here the npiO' i = 1, 2, 3, 4, are, respectively,S, 15, 25, and 35.
Suppose the observed frequencies of AI> A 2 , A 3 , and A 4 to be 6, 18, 20, and
4
36, respectively. Then the observed value of Q3 = L (Xi - npiO)2/(nPiO) is
1
great as c, the test of H o will have a significance level that is approxi-
mately equal to a.
Some illustrative examples follow.
Example 1. One of the first six positive integers is to be chosen by a
random experiment (perhaps by the cast of a die). Let Ai = {x; X = i},
i = 1,2, ... , 6. The hypothesis Ho: P(A i ) = Pio = i, i = 1,2, ... , 6, will
be tested, at the approximate 5 per cent significance level, against all
alternatives. To make the test, the random experiment will be repeated, under
the same conditions, 60 independent times. In this example k = 6 and
npiO = 60(~) = 10, i = 1, 2, ... , 6. Let Xi denote the frequency with which
the random experiment terminates with the outcome in Ai' i = 1, 2, ... , 6,
a
and let Q5 = L (Xi - 10)2/10. If Ho is true, Table II, with k - 1 = 6 - 1 =
I
5 degrees of freedom, shows that we have Pr (Q5 ~ 11.1) = 0.05. Now
suppose that the experimental frequencies of AI' A 2 , ••• , A aare, respectively,
13, 19, 11, 8, 5, and 4. The observed value of Q5 is
274 Other Statistical Tests [Ch.8 Sec. 8.1] Chi-Square Tests 275
with the test of the hypothesis that two multinomial distributions are
the same.
Remark. In many instances, such as that involving the mean fL and the
variance a2 of a normal distribution, minimum chi-square estimates are
difficult to compute. Hence other estimates, such as the maximum likelihood
»<:
estimates p- = Y and a2 = 52, are used to evaluate Pi and Qk-1' In general,
Qk-1 is not minimized by maximum likelihood estimates, and thus its
computed value is somewhat greater than it would be if minimum chi-square
estimates were used. Hence, when comparing it to a critical value listed in
the chi-square table with k - 3 degrees of freedom, there is a greater chance
of rejecting than there would be if the actual minimum of Qk-1 is used.
Accordingly, the approximate significance level of such a test will be some-
what higher than that value found in the table. This modification should be
kept in mind and, if at all possible, each h should be estimated using the
frequencies Xl"'" X k rather than using directly the items Y v Y 2, · · ·, Yn
of the random sample.
Example 3. Let us consider two independent multinomial distributions
with parameters nj, P1j, P2j, .. " Pkj, j = 1, 2, respectively. Let Xij' i =
1, 2, .. " k, j = 1, 2, represent the corresponding frequencies. If n1 and n2
are large, the random variable
value of this random variable is at least as great as an appropriate number
from Table II, with k - 1 degrees of freedom.
The second example deals with the subject of contingency tables.
Example 4. Let the result of a random experiment be classified by two
attributes (such as the color of the hair and the color of the eyes). That is,
one attribute of the outcome is one and only one of certain mutually exclusive
and exhaustive events, say A v A 2,.. " A a; and the other attribute of the
outcome is also one and only one of certain mutually exclusive and exhaustive
events, say B1, B 2, ... , B b. Let Pij = P(A i n B j), i = 1,2, ... , a; j =
1,2, ... , b. The random experiment is to be repeated n independent times
and Xii will denote the frequency of the event Ai n B j. Since there are
k = ab such events as Ai n B j , the random variable
has an approximate chi-square distribution with ab - 1 degrees of freedom,
provided that n is large. Suppose that we wish to test the independence of the
A attribute and the B attribute; that is, we wish to test the hypothesis
Ho: P(Ai n B j) = P(Ai)P(Bj), i = 1,2, ... , a; j = 1,2, ... , b. Let us
denote P(Ai) by A and P(Bj) by P'j; thus
b
A = L: Pif'
f=l
a
P·f = L: Pii'
i=l
is the sum of two stochastically independent random variables, each of
which we treat as though it were X2
(k - 1); that is, the random variable is
approximately X2(2k
- 2). Consider the hypothesis
and
b a b a
1 = L: L: Plf = L: P·f = L: A·
f=l i=l f=l i=l
Then the hypothesis can be formulated as H o:hj = Pi,P'j, i = 1,2, ... , a;
j = 1,2, ... , b. To test H o, we can use Qab-1 with Pii replaced by AP·f·
But if A, i = 1, 2, ... , a, and P'j, j = 1, 2, ... , b, are unknown, as they
frequently are in the applications, we cannot compute Qab-1 once the fre-
quencies are observed. In such a case we estimate these unknown parameters
by
where each Pi! = Pi2' i = 1, 2, .. " k, is unspecified. Thus we need point
estimates of these parameters. The maximum likelihood estimator of Pl1 =
P'2' based upon the frequencies Xlj' is (Xi! + X i2)/(n1 + n2), i = 1,2, ... , k.
Note that we need only k - 1 point estimates, because we have a point
estimate of Pk1 = Pk2 once we have point estimates of the first k - 1 prob-
abilities. In accordance with the fact that has been stated, the random
variable
A Xi'
Pl' =-,
n
b
where Xi' = 2: x.;
j=l
2 = 1,2, ... , a,
Since L:p,. = L: P.f = 1, we have estimated only a - 1 + b - 1 = a +
j
b - 2 parameters. So if these estimates are used in Qab-1, with Pii = Pi·P·j,
has an approximate X
2 distribution with 2k - 2 - (k - 1) = k - 1 degrees
of freedom. Thus we are able to test the hypothesis that two multinomial
distributions are the same; this hypothesis is rejected when the computed
and
A x.f h
P'j=-' were
n
j = 1,2, ... , b.
Other Statistical Tests [eb. 8
8.3. A die was cast n = 120 independent times and the following data
resulted:
276
then, according to the rule that has been stated in this section, the random
variable
Sec. 8.1] Chi-Square Tests
277
~ = 1,2, ... , 7, 8.
If we consider these data to be observations from two independent multi-
nomial distributions with k = 5, test, at the 5 per cent significance level, the
hypothesis that the two distributions are the same (and hence the two
teaching procedures are equally effective).
8.6. Let the result of a random experiment be classified as one of the
mutually exclusive and exhaustive ways At> A 2
, A 3
and also as one of the
mutually exclusive and exhaustive ways Bl
, B2
, B3
, B4
• Two hundred
independent trials of the experiment result in the following data:
If we use a chi-square test, for what values of b would the hypothesis that the
die is unbiased be rejected at the 0.025 significance level?
8.4. Consider the problem from genetics of crossing two types of peas.
The Mendelian theory states that the probabilities of the classifications
(a) ro~nd and yellow, (b) wrinkled and yellow, (c) round and green, and
~d) wnnkled and green are -l6' -{6, 1~' and -l6, respectively. If, from 160
mde~end~nt observations, the observed frequencies of these respective
classlficatlOns are 86, 35, 26, and 13, are these data consistent with the
Mendelian theory? That is, test, with IX = 0.01, the hypothesis that the
respective probabilities are n.-, 1~6, 1~, and h.
8.5. Two different teaching procedures were used on two different
groups of students. Each group contained 100 students of about the same
ability. At the end of the term, an evaluating team assigned a letter grade to
each student. The results were tabulated as follows.
6
4O-b
100
100
Total
5
20
F
"
16
4
20
6
13
24
17
28
D
15
21
27
3
20
21
27
19
32
29
c
Grade
2
20
10
11
6
B
25
18
b
I 15
" 9
Group A
Frequency
Spots up
PIO = II2~27T exp [ (x2(4~)2] dx,
This hypothesis (concerning the multinomial p.d.f. with k = 8)is to be tested
at the 5 per cent level of significance, by a chi-squar~ test. If the observec,
frequencies of the sets Ai, i = 1,2, ... , 8, are, respectively, 60, 96, 140, 210,
172, 160, 88, and 74, would Ho be accepted at the (approximate) 5 per cent
level of significance?
EXERCISES
8.1. A number is to be selected from the interval {x; °< x < 2} by a
random process. Let Ai = {x; (i - 1)/2 < x.-:::; i12}, i =: .1,.2, 3, and let
A 4 = {x; t < x < 2}. A certain hypothesis assigns probabilities Pia to these
sets in accordance with Pia = tl (1)(2 - x) dx, i = 1, 2, 3, 4. This hypothesis
(concerning the multinomial p.d.f. with k = 4) is to be tested, at the 5 ~er'
cent level of significance, by a chi-square test. If the observed frequencies
of the sets Ai> i = 1, 2, 3, 4, are, respectively, 30, 30, 10, 10, would H o be
accepted at the (approximate) 5 per cent level of significance?
8.2. Let the following sets be defined. Al = {x; -OCJ < x :::; O}, Ai =
{x; i - 2 < x :::; i-I}, i = 2, ... , 7, and As = {a:; 6 < x < co}. ~ certain
hypothesis assigns probabilities Pia to these sets Ai m accordance with
~ ~ [Xlj - n(Xi.ln)(X)n)J2
i~ 6-1 n(Xdn)(X.iln)
In each of the four examples of this section we have indicated that
the statistic used to test the hypothesis H o has an approximate chi-
square distribution, provided that n is sufficiently large and H 0 is true.
To compute the power of any of these tests for values of t~e ~arameters
not described by Ho' we need the distribution of the statistic wh~n Ho
is not true. In each of these cases, the statistic has an approximate
distribution called a noncentral chi-square distribution. The noncentral
chi-square distribution will be discussed in Section 8.4.
has an approximate chi-square distribution with ab - 1 - (a + b - 2) =:=
(a - l)(b - 1) degrees of freedom provided t~at H.o i~ true. The hypothesis
H is then rejected if the computed value of this statistic exceeds the constant
c,where cis selected from Table II so that the test has the desired significance
level IX.
278 Other Statistical Tests [Ch, 8 Sec. 8.2] The Distributions of Certain Quadratic Forms 279
Test, at the 0.05 significancelevel, the hypothesis of independence of the A
attribute and the B attribute, namely H o: P(A, n Bj ) = P(A,)P(Bj ) ,
i = 1, 2, 3 and j = 1, 2, 3, 4, against the alternative of dependence.
8.7. A certain genetic model suggests that the probabilities of a particular
trinomial distribution are, respectively, Pl = p, P2 = 2P(1 - P), and Pa =
(1 - P)2, where 0 < P < 1. If Xl> X 2, Xa represent the respective fre-
quencies in n independent trials, explain how we could check on the adequacy
of the genetic model.
8.2 The Distributions of Certain Quadratic Forms
A homogeneous polynomial of degree 2 in n variables is called a
quadratic form in those variables. If both the variables and the co-
efficients are real, the form is called a real quadratic form. Only real
quadratic forms will be considered in this book. To illustrate, the form
XI + X lX2 + X~ is a quadratic form in the two variables Xl and X 2 ;
the form XI + X~ + X~ - 2XlX2 is a quadratic form in the three
variables Xl' X 2, and X 3 ; but the form (Xl - 1)2 + (X2 - 2)2 =
XI + X~ - 2Xl - 4X2 + 5 is not a quadratic form in Xl and X 2 ,
although it is a quadratic form in the variables Xl - 1 and X 2 - 2.
Let X and S2 denote, respectively, the mean and the variance of a
random sample Xl> X 2 , ••• , X; from an arbitrary distribution. Thus
~ ~l(Xl - x, + X
2 n+'" + X n )2
nS2 = L. (Xl - X)2 = L.
1
n - 1 (X2 X2 X2)
=-n- 1+ 2+"'+ n
is a quadratic form in the n variables Xl> X 2 , ••• , X n . If the sample
arises from a distribution that is n(/-L' ( 2
) , we know that the random
variable nS2/a2
is X2(n
- 1) regardless of the value of /-L. This fact proved
useful in our search for a confidence interval for a2
when /-L is unknown.
It has been seen that tests of certain statistical hypotheses require a
statistic that is a quadratic form. For instance, Example 2, Section 7.3,
n .
made use of the statistic .L X~, which is a quadratic form in the van-
1
abIes Xl> X 2 , ••. , X n• Later in this chapter, tests of other statistical
hypotheses will be investigated, and it will be seen that functions of
statistics that are quadratic forms will be needed to carry out the tests
in an expeditious manner. But first we shall make a study of the
distribution of certain quadratic forms in normal and stochastically
independent random variables.
The following theorem will be proved in Chapter 12.
Theorem 1. Let Q = Ql + Q2 + ... + Qk-l + Qk' where Q, Ql> .. "
Qk are k + 1 random variables that are real quadratic forms in n mutually
stochastically independent random variables which are normally distrib-
uted with the means /-Ll' /-L2' ••• , /-Ln and the same variance a2. Let Q/a2,
QIfa2, , Qk_l/a2 have chi-square distributions with degrees of freedom
r, rl> , rk-l> respectively. Let Qk be nonnegative. Then:
(a) Ql"'" Qk are mutually stochastically independent, and hence
(b) Qk/a2 has a chi-square distribution with r - (rl + ... + rk- l)
rk degrees offreedom.
Three examples illustrative of the theorem will follow. Each of these
examples will deal with a distribution problem that is based on the
remarks made in the subsequent paragraph.
Let the random variable X have a distribution that is n(/-L' ( 2
) . Let
a and b denote positive integers greater than 1 and let n = abo Con-
sider a random sample of size n = ab from this normal distribution.
The items of the random sample will be denoted by the symbols
Xu, X12, ... , x.; X lb
X 2l, X 22, ... , X 21, ... , X 2b
XIl> X t2, ... , x.; ... , x;
X al, X a2, .. 0, X aj, ... , x.;
In this notation the first subscript indicates the row, and the second
subscript indicates the column in which the item appears. Thus Xl1 is
in row i and column j, i = 1,2, ... , a and j = 1,2, ... , b. By assump-
tion these n = ab random variables are mutually stochastically inde-
pendent, and each has the same normal distribution with mean /-L and
variance a2 • Thus, if we wish, we may consider each row as being a
random sample of size b from the given distribution; and we may con-
Other Statistical Tests [eh. 8 Sec. 8.2] The Distributions of Certain Quadratic Forms 281
280
sider each column as being a random sample .of size a from the given
distribution. We now define a + b + 1 statIstIcs. They are
a b
'2 '2 Xu
X + ... + X + ... + X ab i=l j=l
X
_ X 11 + ... + Ib al = ab
- ab
Thus
or, for brevity,
a b X)2
abS2 = L: L: (Xii -
1=1 i=1
or
Clearly, Q, Q1> and Q2 are quadratic forms in the n = ab variables Xii. We
shall use the theorem with k = 2 to show that Ql and Q2 are stochastically
independent. Since S2 is the variance of a random sample of size n = ab
from the given normal distribution, then abS2/a2
has a chi-square distribution
with ab - 1 degrees of freedom. Now
b
For each fixed value of i, L: (Xli - Xd 2/b is the variance of a random
i=1
sample of size b from the given normal distribution, and, accordingly,
b
L: (Xli - Xd 2/a2 has a chi-square distribution with b - 1 degrees of
i=1
freedom. Because the Xii are mutually stochastically independent, Q1/a2 is
the sum of a mutually stochastically independent random variables, each
having a chi-square distribution with b - 1 degrees of freedom. Hence
Q1/a2 has a chi-square distribution with a(b - 1) degrees of freedom. Now
a
Q2 = b L: (Xl' - X)2 ~ 0. In accordance with the theorem, Q1 and Q2 are
i=1
stochastically independent, and Q2/a2 has a chi-square distribution with
ab - 1 - a(b - 1) = a - 1 degrees of freedom.
Example 2. In abS2 replace Xii - X by (Xu - Xi) + (Xi - X) to
obtain
j = 1,2, ... , b.
i = 1,2, ... , a,
b
'2 Xu
X + X + ... + X ib i=l
X
il i2 =-,
I· = b b
+ 2 ~ ±(Xii - Xd(Xi. - X).
i= 1 i=1
. .d tit ay be written
The last term of the right-hand member of this I en I y m
a (X _ X) ~ (X _ Xt , ) ] = 2 i [(Xi' - XHbXi. - bXdJ = 0,
2 L: ( i· .L.. ii 1 i=1
i=1 1=1
a
'2 Xu
X . + X . + ... + Xaj i=l
X
_ 11 21 = - ,
.j - a a
. d b X but we use X for
I~ so~~ texts the statis..!.ic=~ISisd;:~~eanYofthe random sample of
sImplIcIty. In any c~se: X .. X are respectively, the means
size n = abo the statIstIcs Xl.' X 2 ., ••• , a·' ti 1 the
, . . X- X X are respec ive y,
f the rows: and the statIstIcs ·1' ·2' ... , ' b ' f 11
:eans of th~ columns. The examples illustrative of the theorem 0 o~.
. f h d m sample of SIze
Example 1. Consider the vanance S2 0 t e ran 0
n = abo We have the algebraic identity
and
and the term
a b X)2
L: L: (Xi· -
1=1 i=1
may be written
or, for brevity,
Q = Q3 + Q4'
It is easy to show (Exercise 8.8) that Q3/a2 has a chi-square distribution
• • b
WIth b(a - 1) degrees of freedom. Since Q4 = a L: (X.j - X)2 ~ 0, the
j=1
theorem enables us to assert that Q3 and Q4 are stochastically independent
282 Other Statistical Tests [eh. 8 Sec. 8.3] A Test of the Equality of Several Means 283
and that Q4/u2 has a chi-square distribution with ab - 1 - b(a - 1) =
b - 1 degrees of freedom.
Example 3. In abS2 replace Xii - X by (Xi' - X) + (X'i - X) +
(Xii - Xi' - X' i + X) to obtain (Exercise 8.9)
or, for brevity,
Q = Q2 + Q4 + Q5'
where Q2 and Q4 are as defined in Examples 1 and 2. From Examples 1 and 2,
Q/u
2,
Q2/u2, and Q4/u2 have chi-square distributions with ab - 1, a-I,
and b - 1 degrees of freedom, respectively. Since Q5 ~ 0, the theorem
asserts that Q2' Q4' and Q5 are mutually stochastically independent and that
Q5/u2 has a chi-square distribution with ab - 1 - (a - 1) - (b - 1) =
(a - l)(b - 1) degrees of freedom.
Once these quadratic form statistics have been shown to be stochastically
independent, a multiplicity of F statistics can be defined. For instance,
Q4/[u2(b - 1)] _ Q4/(b - 1)
Qs/[u2b(a - 1)] - Qs/[b(a - 1)]
has an F distribution with b - 1 and b(a - 1) degrees of freedom; and
Q4/[u2(b - 1)] _ Q4
Q5/[u2(a - l)(b - 1)] - Q5/(a - 1)
has an F distribution with b - 1 and (a - l)(b - 1) degrees of freedom. In
the subsequent sections it will be seen that some likelihood ratio tests of
certain statistical hypotheses can be based on these F statistics.
EXERCISES
8.8. In Example 2 verify that Q = Qs + Q4 and that Qs/u2 has a chi-
square distribution with b(a - 1) degrees of freedom.
8.9. In Example 3 verify that Q = Q2 + Q4 + Q5'
8.10. Let Xl, X 2 , •• " Xn be a random sample from a normal distribution
neiL, ( 2
) . Show that
n n 1
L (XI - X)2 = L (XI - X')2 + ~ (Xl - X')2,
1=1 1=2 n
n n
where X = L Xdn and X' = L Xd(n - 1). Hint. Replace Xi - X by
1=1 1=2
n
(XI - X') - (Xl - X')/n. Show that 1~2 (XI - X')2/u2 has a chi-square
distribution with n - 2 degrees of freedom. Prove that the two terms in the
right-hand member are stochastically independent. What then is the distri-
bution of
[en - 1)/n](Xl - X')2 f
2 •
o
8.11. Let X lik, i = 1, ... , a; j = 1, ... , b; k = 1, ... , c, be a random
sample of size n = abc from a normal distribution n(p, ( 2
) . Let X =
c b a c b
L L L Xlik/n and XI" = L L Xlik/be. Show that
k=l i=l 1=1 k=1 i=1
a b c a b c 2 a X X)2
L L L (Xlik - X)2 = L L L (Xlik - XI") + be ~ ( I" - •
1=1 i=1 k=l 1=1 i=1 k=1 1-1
Show that ~ ±i (X'ik - X,..)2/U2 has a chi-square distribution with
1=1 i=1 k=1
a(be - 1) degrees of freedom. Prove that the two te~s in t~e ~ght:hand
member are stochastically independent. What, then, IS the distribution of
be ~ (XI" - X)2/U2? Furthermore, let X.i· = i ~ Xlik/ae and Xli' =
1=1 k=ll=l
c
L Xlik/e. Show that
k=l
a b c
L L L ix.; - X)2
1=1 i=l k=l
= i ±i tx.; - XiJ·)2 + be ~ (XI" - X)2 + ae ~ (X.i. - X)2
1=1 i=l k=l 1=1 i-I
+ e ~ ±(XiJ. - XI" - X.i. + X)2.
1=1 i=l
Show that the four terms in the right-hand member, when divided by u2
,
are mutually stochastically independent chi-square variables with ab(e - 1),
a-I, b - 1, and (a - l)(b - 1) degrees of freedom, respectively.
8.12. Let Xl, X 2, Xs, X4 be a random sample of size n = 4 from the
4
normal distribution nCO, 1). Show that L (XI - X)2 equals
1=1
(Xl - X 2)2 [Xs - (Xl + X 2)/2]2 [X4 - (Xl + X 2 + XS)/3]2
2 + ~ + ~
and argue that these three terms are mutually stochastically independent,
each with a chi-square distribution with 1 degree of freedom.
8.3 A Test of the Equality of Several Means
Consider b mutually stochastically independent random variables
that have normal distributions with unknown means ftv ft2' ... , ftb'
respectively, and unknown but common variance a2. Let Xli' X 21, ••• ,
X a; represent a random sample of size a from the normal distribution
with mean fJ-; and variance a2
, j = 1,2, , b. It is desired to test the
composite hypothesis H o: fJ-1 = fJ-2 = = fJ-b = fJ-, fJ- unspecified,
against all possible alternative hypotheses H 1. A likelihood ratio test
will be used. Here the total parameter space is
Sec. 8.3] A Test of the Equality of Several Means
and these numbers maximize L(w). Furthermore,
285
j = 1,2, ... , b,
aIn L(Q) ab 1 b a 2
o( 2) = - -Z
2 + -Z
4 L: L: (xU - fJ-;) •
a a a 1=1 i=1
aIn L(Q)
ofJ-;
and
Other Statistical Tests [Ch. 8
284
If we equate these partial derivatives to zero, the solutions for fJ-1'
fJ-2' ••• , fJ-b' and a2 are, respectively, in Q,
and
w = {(fJ-1' fJ-2' ••• , fJ-b' a2
) ; -00 < fJ-1 = fJ-2 = ...
= fJ-b = fJ- < 00, 0 < a2 < co].
The likelihood functions, denoted by L(w) and L(Q) are, respectively, (Z)
a
2: Xi;
i=1 -
--- = X.;,
a
j = 1, Z, ... , b,
(
1 ) ab/2 [ 1 b a ]
L(w) = 22 exp --2
2 L L (Xii - fJ-)2
~a a 1=1 i=1
and
Now
oIn L(w)
ofJ-
b a
2: 2: (Xii - fJ-)
;=11=1
and these numbers maximize L(Q). These maxima are, respectively,
[
ab ]ab/2 f ab ± i (Xii - X)21
L(w) = b a exp _ _ .:....I=_1::..-:..i_=...;.1 _
Z~ 1~1 i~1 (Xii - X)2 ZJ1 it (Xii - X)2
and
and
oIn L(w)
o(a2 )
ab 1 b a
--Z
2 + -4 L L (XU - fJ-)2.
a 2a 1=1 1=1
If we equate these partial derivatives to zero, the solutions for fJ- and a2
are, respectively, in w,
b a
2: 2: Xu
1=1 1=1 = X,
ab
Finally,
(1)
b a
2: 2: (XU - X)2
;=1 1=1
ab
= V,
A = L(w) = fJ1 it (Xij - X.I)21
ab/2
L(D.) ± i (Xij - X)2
1=1 i=1
In the notation of Section 8.Z, the statistics defined by the functions
x and v given by Equations (1) of this section are
X = ±i Xii and S2 = i i (Xij - X)2 = Q;
;=1 i=1 ab ;=1 i=1 ab ab
286 Other Statistical Tests [Ch, 8
Sec. 8.3] A Test oj the Equality oj Several Means 287
while the statistics defined by the functions x1, x2' ••. , X.b and w
a
given by Equations (2) in this section are, respectively, X J = L X'J/a,
,=1
b a
J = 1,2, ... , b, and Q3/ab = L L (X'J - X J)2/ab. Thus, in the
J=l,=l
notation of Section 8.2, >..2/ab defines the statistic Q3/Q.
We reject the hypothesis H o if >.. s >"0' To find >"0 so that we have
a desired significance level a, we must assume that the hypothesis H o
is true. If the hypothesis H o is true, the random variables X'J con-
stitute a random sample of size n = ab from a distribution that is
normal with mean fL and variance a2
. This being the case, it was shown
b
in Example 2, Section 8.2, that Q=Q3+Q4' where Q4=a L (X.J- X)2;
J=l
that Q3 and Q4 are stochastically independent, and that Q3/a2 and
Q4/a2 have chi-square distributions with b(a - 1) and b - 1 degrees
of freedom, respectively. Thus the statistic defined by >..2/ab may be
written
1
The significance level of the test of H 0 is
a = Pr [1 + ~4/Q3 ~ >..~/ab; H o]
_ [Q4/(b - 1) >. ]
- Pr Q3/[b(a _ 1)J - c, H o '
where
= b(a - 1) ( -2/ab _ 1)
C b _ 1 "0 .
But
F = Q4/[a
2(b
- 1)J = Q4/(b - 1)
Q3/[a2b(a - 1)J Q3/[b(a - 1)J
has an F distribution with b - 1 and b(a - 1) degrees of freedom.
Hence the test of the composite hypothesis H 0: fL1 = fL2 = ... = fLb = fL,
fL unspecified, against all possible alternatives may be based on an F
statistic. The constant c is so selected as to yield the desired value of a.
Remark. It should be pointed out that a test of the equality of the b
means fLJ' j = 1,2, ... , b, does not require that we take a random sample of
size a from each of the b normal distributions. That is, the samples may be
of different sizes, say aI, a2, ... , abo A consideration of this procedure is left
to Exercise 8.13.
Suppose now that we wish to compute the power of the test of H o
against HI when H o is false, that is, when we do not have fL1 = fL2 =
... = fLb = fL· It will be seen in Section 8.4- that, when HI is true, no
longer is Q4/a2 a random variable that is X2(b - 1). Thus we cannot use
an F statistic to compute the power of the test when HI is true. This
problem is discussed in Section 8.4-.
An observation should be made in connection with maximizing a
likelihood function with respect to certain parameters. Sometimes it is
easier to avoid the use of the calculus. For example, L(Q.) of this section
can be maximized with respect to fLJ' for every fixed positive a2, by
minimizing
b a
Z = L L (X'J - fLJ)2
J=l,=l
with respect to fLJ' j = 1,2, ... , b. Now z can be written as
b a
Z = L L [(x'J - X J) + (x J - fLJ)J2
j=l 1=1
b a b
= L L (X'J - X J)2 + a L (x J - fLJ)2.
J=l i=l J=l
Since each term in the right-hand member of the preceding equation
is nonnegative, clearly z is a minimum, with respect to fLJ' if we take
fLJ = x.J, j = 1, 2, ... , b.
EXERCISES
8.13. Let X Ij, X 2J' ... , X aJj represent independent random samples of
sizes aJ from normal distributions with means fLJ and variances a2, j =
1,2, .. " b. Show that
b aJ b aJ
or Q' = Q; + Q~. Here X = L L X I1/ L «, and X'J = L XjJ/aj • If
j=1 j=l j=1 i=l
fL1 = fL2 = ... = fLb' show that Q'/a2and Q;/a2have chi-square distributions.
Prove that Q; and Q~ are stochastically independent, and hence Q~/a2 also
has a chi-square distribution. If the likelihood ratio , is used to test
H 0: fLl = fL2 = ... = fLb = fL, fL unspecified and a2 unknown, against all
288 Other Statistical Tests [Ch. 8 Sec. 8.4] Noncentral X2 and Noncentral F 289
possible alternatives, show that A ~ Ao is equivalent to the computed
F ~ c, where
What is the distribution of F when H 0 is true?
8.14. Using the notation of this section, assume that the means satisfy
the condition that iL = iLl + (b - l)d = iL2 - d = iLa - d = ... = iLb - d.
That is, the last b - 1 means are equal but differ from the first mean iLl>
provided that d =I- O. Let a random sample of size a be taken from each of the
b independent normal distributions with common unknown variance a2•
(a) Show that the maximum likelihood estimators of iL and dare (L = X
and
J = [J2X)(b - 1) - s.,]Ib.
(b) Find Qa and Q7 = ca2so that when d = 0, Q71 a2 is X2(1)and
(c) Argue that the three terms in the right-hand member of part (b),
once divided by a2
, are stochastically independent random variables with
chi-square distributions, provided that d = O.
(d) The ratio Q7/(Qa + Qa) times what constant has an F distribution,
provided that d = O?
The integral exists if t < t. To evaluate the integral, note that
Accordingly, with t < -1, we have
E[exp e~~)]
= exp [a2(:fL~ 2t)] 5:00 a~21T exp [ - 1 ~22t (Xj-1 :j2tY] dx;
If we multiply the integrand by V1 - 2t, t < -1, we have the integral
of a normal p.d.f. with mean fLtI(1 - 2t) and variance a2J(1 - 2t). Thus
[ (
tX 2) ] 1 [tfL'?- ]
E exp a2' = VI _ 2t exp a2(1 ~ 2t) ,
n
and the moment-generating function of Y = 2: X~Ja2 is given by
1
A random variable that has a moment-generating function of the
functional form
M(t) = 1 et81<1-2t>
(1 - 2W/2
'
8.4 Noncentral X2
and Noncentral F
Let Xl' X 2 , ••• , X n denote mutually stochastically independent
n
random variables that are n(fLj, a2), i = 1,2, ... , n, and let Y = 2: XUa2.
1
If each fLj is zero, we know that Y is X2
(n). We shall now investigate the
distribution of Y when each fLj is not zero. The moment-generating
function of Y is given by
[
n ]
1 t2:fL~
M(t) = (1 _ 2t)n/2 exp a2(l
l
_ 2t) ,
1
t < 2'
M(t) = E[exp (t j~ ~!)]
= D
E[exp (t ~!)l
Consider
E[exp (tX
a2~) ] _ foo 1 [tx~ (xj - fLj)2] d
- . /- exp 2 - 2 2 Xt·
-00 av 21T a a
where t < t, 0 < 0, and r is a positive integer, is said to have a non-
centralchi-square distribution with r degrees of freedom and noncentrality
parameter O. If one sets the noncentrality parameter 0 = 0, one has
M(t) = (1 - 2t)-r/2, which is the moment-generating function of a
random variable that is X2 (r). Such a random variable can appro-
priately be called a centralchi-square variable. We shall use the symbol
x2
(r, 0) to denote a noncentral chi-square distribution that has the
parameters rand 0; and we shall say that a random variable is X2
(r, 0)
290 Other Statistical Tests [Ch, 8 Sec. 8.5] The Analysis of Variance 291
8.5 The Analysis of Variance
Had we taken f31 = f32 = f33 = 0, the six random variables would
have had means
8.19. Let Xl and X2 be two stochastically independent random variables.
Let Xl and Y = Xl + X2 be X2
(rv ( 1) and X2
(r , 0), respectively. Here
rl < rand fh ~ O. Show that X2 is X2
(r - r1• 0 - ( 1 ) ,
distribution with degrees of freedom r l and r 2 > 2 and noncentrality
parameter O.
8.18. Show that the square of a noncentral T random variable is a
noncentral F random variable.
f'23 = 3.
f'13 = 5,
f'13 = 6,
f'23 = 4.
f'12 = 6,
f'22 = 4,
f'12 = 6,
f'22 = 4,
f'll = 6,
f'21 = 4,
f'll = 7,
f'21 = 5,
a
normal distributions are f'ij = f' + (Xi + f3j, where L (Xi = 0 and
I
b
L f3j = O. For example, take a = 2, b = 3, f' = 5, (Xl = 1, (X2 = -1,
1
f3l = 1, f32 = 0, and f33 = -1. Then the ab = six random variables
have means
The problem considered in Section 8.3 is an example of a method
of statistical inference called the analysis of variance. This method
derives its name from the fact that the quadratic form abS 2
, which is
a total sum of squares, is resolved into several component parts. In this
section other problems in the analysis of variance will be investigated.
Let X ij, i = 1,2, ... , a and j = 1,2, ... , b, denote n = ab random
variables which are mutually stochastically independent and have
normal distributions with common variance a2
• The means of these
EXERCISES
to mean that the random variable has this kind of distribution. The
symbol X2 (r, 0) is equivalent to X2
(r). Thus our random variable
y = ~ Xf/a
2
of this section is X2
(n, ~ f'f/a2). If each f'i is equal to
zero, then Y is X2
(n, 0) or, more simply, Y is X2
(n).
The noncentral chi-square variables in which we have interest are
certain quadratic forms, in normally distributed variables, divided by
a variance a 2
• In our example it is worth noting that the noncentrality
n n
parameter of L Xf/a2, which is L f'f/a
2,
may be computed by replacing
1 1
each Xi in the quadratic form by its mean f'i' i = 1, 2, ... , n. This is
no fortuitous circumstance; any quadratic form Q = Q(X1, .•• , X n) in
normally distributed variables, which is such that Q/a
2
is X2
(r, 0), has
o= Q(f'l' f'2' ••• , f'n)/a2; and if Q/a
2
is a chi-square variable (central
or noncentral) for certain real values of f'l' f'2' ... , f'n. it is chi-square
(central or noncentral) for all real values of these means.
It should be pointed out that Theorem 1, Section 8.2, is valid
whether the random variables are central or noncentral chi-square
variables.
We next discuss a noncentral F variable. If U and V are stochasti-
cally independent and are, respectively, X2(r1) and X2
(r2) , the random
variable F has been defined by F = r 2 U/rl V. Now suppose, in particu-
lar, that U is x2
(rl , 0), V is x2
h ), and that U and V are stochastically
independent. The random variable r2 U/Yl V is called a noncentral F
variable with Y l and r2 degrees of freedom and with noncentrality
parameter 0. Note that the noncentrality parameter of F is precisely
the noncentrality parameter of the random variable U, which is
X2
(rl> 0).
Tables of noncentral chi-square and noncentral F are available in
the literature. However, like those of noncentral t, they are too bulky
to be put in this book.
8.15. Let Yj , i = 1, 2, .. " n, denote mutually stochastically inde-
pendent random variables that are, respectively, x2 (rj , OJ), i = 1,2, ... , n.
n (n n )
Prove that Z = L Yj is X2
L ri, L OJ .
1 1 1
Thus, if we wish to test the composite hypothesis that
f'll = f'12 = ... = f'lb,
8.16. Compute the mean and the variance of a random variable that is
X2
(r , 0).
8.17. Compute the mean of a random variable that has a noncentral F
f'21 = f'22 = ... = f'2b,
f'al = f'a2 = ... = f'ab,
292 Other Statistical Tests [Ch. 8 Sec. 8.5] The Analysis of Variance 293
we could say that we are testing the composite hypothesis that f31 =
f32 = ... = f3b (and hence each f3j = 0, since their sum is zero). On the
other hand, the composite hypothesis
r-;
of a2
under w, here denoted by a~. So the likelihood ratio , =
r-. r-;
(a~/a~)ab/2 is a monotone function of the statistic
a b (X i j - X
i
.)2
L 2:
i=l j=l ab
F = Q4/(b - 1)
Q3/[b(a - 1)]
a b a b
abS2 = :L :L (Xi' - X)2 + :L :L (X.j - X)2
i=l j=l i=l j=l
a b
+ :L :L (Xi j - Xi. - x'j + X)2;
i=l j=l
F = Q4/(b - 1) ,
Q5/[(a - 1)(b - 1)J
which has, under H o, an F distribution with b - 1 and (a - 1)(b - 1)
degrees of freedom. The hypothesis H 0 is rejected if F ;::: c, where a =
Pr (F ;::: c; H o).
If we are to compute the power function of the test, we need the
distribution of F when H 0 is not true. From Section 8.4 we know,
when HI is true, that Q4/a2 and Q5/a2 are stochastically independent
(central or noncentral) chi-square variables. We shall compute the non-
centrality parameters of Q4/a2 and Q5/a2 when HI is true. We have
E(Xij) = I-' + ai + f3j, E(Xi .) = I-' + ai' E(X.j) = I-' + f3j and E(X)
1-'. Accordingly, the noncentrality parameter of Q4/a2 is
is that estimator under w. A useful monotone function of the likelihood
thus the total sum of squares, abS2, is decomposed into that among
rows (Q2), that among columns (Q4) , and that remaining (Q5)' It is
r-;
interesting to observe that a~ = Q5/ab is the maximum likelihood
estimator of a2
under Q and
upon which the test of the equality of means is based.
To help find a test for H o: f31 = f32 = ... = f3b = 0, where I-'ij =
I-' + a i + f3j, return to the decomposition of Example 3, Section 8.2,
namely Q = Q2 + Q4 + Q5' That is,
so we see that the total sum of squares, abS2
, is decomposed into a sum
of squares, Q4, among column means and a sum of squares, Q3, within
columns. The latter sum of squares, divided by n = ab, is the maximum
likelihood estimator of a2 , provided that the parameters are in Q; and
-<:
we denote it by a~. Of course, S2 is the maximum likelihood estimator
is the same as the composite hypothesis that a1 = a2 = ... = aa = 0.
Remarks. The model just described, and others similar to it, are widely
used in statistical applications. Consider a situation in which it is desirable
to investigate the effects of two factors that influence an outcome. Thus the
variety of a grain and the type of fertilizer used influence the yield; or the
teacher and the size of a class may influence the score on a standard test.
Let Xii denote the yield from the use of variety i of a grain and type j of
fertilizer. A test of the hypothesis that fl1 = fl2 = ... = flb = 0 would then
be a test of the hypothesis that the mean yield of each variety of grain is
the same regardless of the type of fertilizer used.
a b
There is no loss of generality in assuming that 2: ai = 2: flj = O. To see
1 1
this, let floij = flo' + a; + fl;· Write a' = 2: a;;a and p' = 2: fl;;b. We have
floij = (flo' + fi' + P') + (a; - fi') + (fl; - P') = flo + ai + flj, where 2: ai =
2: flj = O.
1-'1b = 1-'2b = ... = I-'ab'
To construct a test of the composite hypothesis H o: f31 = f32 = ...
= f3b = 0 against all alternative hypotheses, we could obtain the corre-
sponding likelihood ratio. However, to gain more insight into such a test,
let us reconsider the likelihood ratio test of Section 8.3, namely that of
the equality of the means of b mutually independent distributions.
There the important quadratic forms are Q, Q3' and Q4' which are
related through the equation Q = Q4 + Q3' That is,
1-'12 = 1-'22 = ... = l-'a2,
1-'11 = 1-'21 = ... = I-'al>
294 Other Statistical Tests [eh.8 Sec. 8.5] The Analysis of Variance 295
b a
2 .2 (I-' + al + flj - I-' - al - I-' - flj + 1-')2
::...j=....;l:......:...'=....;1=-- ---,;0-- = 0.
~ = 1,2, ... , a, j = 1,2, ... , b,
H o: Ylj = 0,
8.21. If at least one Ylf "# 0, show that the F, which is used to test that
F = [c,tJ1(Xtj. - XI .. - X.j. + X)2]/[(a - l)(b - l)J.
[22 L (Xtjk - Xtj.)2J/[ab(c - l)J
The reader should verify that the noncentrality parameter of this F
b a
distribution is equal to CL L Y~j/a2. Thus F is central when H o:Ytj =
j=ll=l
0, i = 1, 2, ... , a, j = 1, 2, ... , b, is true.
EXERCISES
8.20. Show that
that is, the total sum of squares is decomposed into that due to row
differences, that due to column differences, that due to interaction, and
that within cells. The test of
against all possible alternatives is based upon an F with (a - l)(b - 1)
and ab(c - 1) degrees of freedom,
a b c a b
2 2 2 (Xtjk - X)2 = be 2 (XI" - X)2 + ac 2 (X. j. - X)2
1=1 j=l k=l 1=1 j=l
a b
+ C2 2 (Xtj. - XI" - X. j. + X)2
1=1 j=l
a b c
+ 2 2 2 (Xtjk - Xtj.)2;
1=1 j=l k=l
That is, if Ylj = 0, each of the means in the first row is 2 greater than
the corresponding mean in the second row. In general, if each Ytj = 0,
the means of row i1 differ from the corresponding means of row i2 by a
constant. This constant may be different for different choices of i1 and
i2
. A similar statement can be made about the means of columns i1
and j2' The parameter Ytj is called the interaction associated with cell
(i, j). That is, the interaction between the ith level of one classification
and the jth level of the other classification is Ytj. One interesting
hypothesis to test is that each interaction is equal to zero. This will
now be investigated.
From Exercise 8.11 of Section 8.2 we have that
1-'13 = 3,
1-'23 = 5.
1-'13 = 5,
1-'23 = 3.
1-'12 = 7,
1-'22 = 3,
1-'12 = 6,
1-'22 = 4,
1-'11 = 8,
1-'21 = 4,
1-'11 = 7,
1-'21 = 5,
Note that, if each Ylj = 0, then
and, under H o: a 1 = a2 = ... = aa = 0, has an F distribution with
a-I and (a - l)(b - 1) degrees of freedom.
The analysis-of-variance problem that has just been discussed is
usually referred to as a two-way classification with one observation per
cell. Each combination of i and j determines a cell; thus there is a total
of ab cells in this model. Let us now investigate another two-way
~lassification problem, but in this case we take c > 1 stochastically
mdependent observations per cell.
Let Xtjk' i = 1, 2, .. " a, j = 1,2, ... , b, and k = 1,2, ... , c, denote
n = abc random variables which are mutually stochastically indepen-
dent and which have normal distributions with common, but unknown,
variance a2. The mean of each Xtjk' k = 1, 2, ... , c, is I-'Ij = I-' + al +
a b a b
flj + ytj, where .2 al = 0, 2 flj = 0, 2 Ytj = 0, and 2 Ytj = 0. For
'=1 j=l 1=1 j=l
example, take a = 2, b = 3, I-' = 5, a1 = 1, a2 = - 1, fl1 = 1, fl2 = 0,
fl3 = -1, Y11 = 1, Y12 = 1, Y13 = - 2, Y21 = -1, Y22 = -1, and Y23 = 2.
Then the means are
Thus, if the hypothesis H ois not true, F has a noncentral F distribution
with b - 1 and (a - l)(b - 1) degrees of freedom and noncentrality
b
parameter a 2 flJla2. The desired probabilities can then be found in
j=l
tables of the noncentral F distribution.
A similar argument can be used to construct the F needed to test
the equality of row means; that is, this F is essentially the ratio of the
sum of squares among rows and Q5' In particular, this F is defined by
296 Other Statistical Tests [Ch, 8 Sec. 8.6] A Regression Problem 297
each interaction is equal to zero, has noncentrality parameter equal to
b a
C 2: 2: YU(12.
i=l 1=1
8.6 A Regression Problem
It is easy to show (see Exercise 8.22) that the maximum likelihood
estimators of ex, {:3, and (12 are
n
2: Xi
& = _1_ = X
n '
or
and
~=-----
i [Xi - ex - fJ(Ci - C)]2 = i {(a - ex) + (~ - fJ)(Ci - c)
1 1
+ [Xi - & - ~(Ci - c)W
n
= n(& - ex)2 + (~ - fJ)2 L (CI - 13)2
1
+ ieXi - & - ~(Ci - 13)]2,
1
Consider next the algebraic identity (Exercise 8.24)
Since & and ~ are linearfunctions of X v X 2, ••• , X n' each is normally
distributed (Theorem 1, Section 4.7). It is easy to show (Exercise 8.23)
that their respective means are ex and fJ and their respective variances
n
are u2/n and u2/2: (ci - C)2.
1
Consider a laboratory experiment the outcome of which depends
upon the temperature; that is, the technician first sets a temperature
dial at a fixed point C and subsequently observes the outcome of the
experiment for that dial setting. From past experience, the technician
knows that if he repeats the experiment with the temperature dial set
at the same point c, he is not likely to observe precisely the same out-
come. He then assumes that the outcome of his experiment is a random
variable X whose distribution depends not only upon certain unknown
parameters but also upon a nonrandom variable C which he can choose
more or less at pleasure. Let C
v C
2,... , C
n denote n arbitrarily selected
values of C (but not all equal) and let Xi denote the outcome of the
experiment when C = c., i = 1,2, ... , n. We then have the n pairs
(Xv C1), ••• , (Xn, en) in which the Xi are random variables but the Ci
are known numbers and i = 1,2, ... , n. Once the n experiments have
been performed (the first with C = C
v the second with C = C
2,and so on)
and the outcome of each recorded, we have the n pairs of known
numbers (Xl' C1), ••• , (Xn, cn)' These numbers are to be used to make
statistical inferences about the unknown parameters in the distribution
of the random variable X. Certain problems of this sort are called
regression problems and we shall study a particular one in some detail.
Let C
1 , C
2,... , C
n be n given numbers, not all equal, and let c =
n
2: c.jn, Let Xv X 2 , ••• , X; be n mutually stochastically independent
1
random variables with joint p.d.f.
Li«, fJ, (12; Xv X 2, ••• , X n)
or, for brevity,
(
1 )n/2 { 1 n }
= - exp - - '" [XI - ex - fJ(ci - C)J2 .
27TU2
2u2
f
Here Q, Qv Q2' and Q3 are real quadratic forms in the variables
Thus each Xi has a normal distribution with the same variance u2
, but
the means of these distributions are ex + fJ(cl - c). Since the ci are not
all equal, in this regression problem the means of the normal distribu-
tions depend upon the choice of C
v C
2, ... , Cn' We shall investigate ways
of making statistical inferences about the parameters ex, fJ, and u2.
i = 1,2, ... , n.
In this equation, Q represents the sum of the squares of n mutually
stochastically independent random variables that have normal dis-
tributions with means zero and variances u2 • Thus Q/u2
has a chi-square
298 Other Statistical Tests [Ch, 8 Sec. 8.6] A Regression Problem 299
distribution with n degrees of freedom. Each of the random variables
V11,(a - alIa and J~ (ci - C)2(~ - (3)/a has a normal distribution with
1
zero mean and unit variance; thus each of Ql/a2 and Q2/a2 has a chi-
square distribution with 1 degree of freedom. Since Q3 is nonnegative,
we have, in accordance with the theorem of Section 8.2, that Qv Q2'
and Q3 are mutually stochastically independent, so that Q3/a2 has a
chi-square distribution with n - 1 - 1 = n - 2 degrees of freedom.
Then each of the random variables
[Vn(a - a)J/a
T1
= --=V~Q;=3/~[a::::;::2=:=(
n=_~277:)]
a-a
n
8.24. Verify that 2: [XI - a - f3(cl - C)]2 = Ql + Q2 + Q3' as stated in
1
the text.
8.25. Let the mutually stochastically independent random variables
Xl' X 2 , ••• , X; have, respectively, the probability density functions
n(f3cl , y2C~), i = 1, 2, ... , n, where the given numbers C1, c2, ••• , Cn are not
all equal and no one is zero. Find the maximum likelihood estimators of f3
and y2.
8.26. Let the mutually stochastically independent random variables
Xl> "', X; have the joint p.d.f.
and
[J~ (ci - C)2(~ - (3)]/a
VQ3/[a2(n - 2)]
where the given numbers c1 , c2 , ••• , Cn are not all equal. Let Ho: f3 = 0
(a and a2 unspecified). It is desired to use a likelihood ratio test to test H o
against all possible alternatives. Find , and see whether the test can be
based on a familiar statistic. Hint. In the notation of this section show that
has a t distribution with n - 2 degrees of freedom. These facts enable
us to obtain confidence intervals for a and f3. The fact that nf)2/a2 has a
chi-square distribution with n - 2 degrees of freedom provides a means
of determining a confidence interval for a 2• These are some of the
statistical inferences about the parameters to which reference was made
in the introductory remarks of this section.
Remark. The more discerning reader should quite properly question
our constructions of T1 and T2 immediately above. We know that the
squares of the linear forms are stochastically independent of Q3 = nfP, but we
do not know, at this time, that the linear forms themselves enjoy this
independence. This problem arises again in Section 8.7. In Exercise 12.15,
a more general problem is proposed, of which the present case is a special
instance.
EXERCISES
8.22. Verify that the maximum likelihood estimators of a, {J, and (12 are
the EX, p, and iP given in this section.
8.23. Show that EX and phave the respective means a and {J and the
n
respective variances a2/n and (12/2: (c, - C)2.
1
8.27. Using the notation of Section 8.3, assume that the means fLj satisfy
a linear function of j, namely fLj = C + d[j - (b + 1)/2]. Let a random sample
of size a be taken from each of the b independent normal distributions with
common unknown variance a2
•
(a) Show that the maximum likelihood estimators of c and dare,
respectively, c= X and
, b b
d = 2: [j - (b + 1)/2J(x'j - X)I 2: [j - (b + 1)/2J2.
j=l j=l
(b) Show that
a b
L L (XtJ - X)2
1=1 j=l
a b [ A( b + 1)]2 b (. b + 1)2
I~ j~ s; - X - d j - -2- + J2 ~1 a J - -2- .
(c) Argue that the two terms in the right-hand member of part (b), once
divided by a2
, are stochastically independent random variables with chi-
square distributions provided that d = O.
(d) What F statistic would be used to test the equality of the means.
that is, H o: d = O?
301
Wv(n - 2)
vU
n
2: [(Xj - x)(Yj - V)]
W = ..::.l_-;:::====-_
J~ (Xj - X)2
VUJ[u~(n - 2)]
(3)
Then we have (Exercise 8.32)
wVn=2
VU
has a conditional t distribution with n - 2 degrees of freedom. Let
The left-hand member of this equation and the first term of the right-
hand member are, when divided by u~, respectively, conditionally
X2(n - 1) and conditionally X2(1). In accordance with Theorem 1,
the nonnegative quadratic form, say U, which is the second term of the
right-hand member of Equation (2), is conditionally stochastically
independent of W 2 , and, when divided by u~, is conditionally X
2
(n - 2).
Now WJU2 is n(O, 1). Then (Remark, Section 8.6)
is n(O, u~) (see Exercise 8.30). Thus the conditional distribution of
W2Ju~, given Xl = Xl' ... , X n = Xn, is X2(1). We have the algebraic
iden tity (see Exercise 8.31)
(1)
X; = X
n
, is X2 (n - 1). Moreover, the conditional distribution of the
linear function W of Y 1> Y 2, ••• , Y n'
Sec. 8.7] A Test of Stochastic Independence
Other Statistical Tests [Ch. 8
n '
The conditional distribution of " (Y y-)2J 2 ' X
L., j - U2' given 1 = Xl' ••• ,1
1
This s.tat~stic R is called the correlationcoefficient of the random sample.
The likelihood ratio principle, which calls for the rejection of H if
A ::; Ao, is equivalent to the computed value of IRI ;::: c. That is, if ~he
absol~te value of the correlation coefficient of the sample is too large,
we reject the hypothesis that the correlation coefficient of the distri-
b.uti?n is equal to zero. To determine a value of c for a satisfactory
significance level, it will be necessary to obtain the distribution of R
or a function of R, when H o is true. This will now be done. '
Let Xl = Xl' X 2 = X2, " ., X; = Xn,n > 2, where Xl' X2,... , Xnand
_ n n
X = f xdn are fixed numbers such that 2: (xj - X)2 > 0. Consider the
1
conditional p.d.f. of Y 1> Y 2 ... Y given that X = X X = X
, , n' 1 1, 2 2, .•• ,
X n = Xn· Because Y 1> Y 2, ••• , Yn are mutually stochastically inde-
pendent and, with p = 0, are also mutually stochastically independent
of Xl' X 2, ... , X n, this conditional p.d.f. is given by
n
2: (Xj - X)(Yj - Y)
R = j=l
Ji (Xj - X)2 i (Yj - Y)2'
j=l j=l
8.7 A Test of Stochastic Independence
Let X and Y have a bivariate normal distribution with means ILl
a~d IL2' positive variances u~ and u~, and correlation coefficient p. We
wish to test the hypothesis that X and Yare stochastically independent.
Because two jointly normally distributed random variables are sto-
chastically independent if and only if p = 0, we test the hypothesis
H o: p = °against the hypothesis HI: p #- 0. A likelihood ratio test
will be used. Let (Xl' YI ), (X2, Y 2), ... , (Xn, Y n) denote a random
sample of size n > 2 from the bivariate normal distribution; that is
the joint p.d.f. of these 2n random variables is given by ,
f(X1> YI)f(X2,Y2) ... f(xn,Yn)'
~lth.ough it is.fair~y difficult to show, the statistic that is defined by the
likelihood ratio AIS a function of the statistic
300
302 Other Statistical Tests [eh. 8 Sec. 8.7] A Test of Stochastic Independence 303
this ratio has, given Xl = Xl> ••• , X; = Xn, a conditional t distribution
with n - 2 degrees of freedom. Note that the p.d.f., say g(t), of this
t distribution does not depend upon Xl> X 2, ••• , X n. Now the joint p.d.f.
of Xv X 2, ... , X; and Rvn - 2/vl - R2, where
the statistic Rvn - 2/v1 - R2 = T. In either case the significance
level of the test is
a = Pr (IR! ~ Cl; H o) = Pr (J TI ~ c2 ; H o),
where the constants Cl and C2 are chosen so as to give the desired
value of a.
Remark. It is also possible to obtain an approximate test of size a by
using the fact that
= 0 elsewhere.
We have now solved the problem of the distribution of R, when p = 0
and n > 2, or, perhaps more conveniently, that of RVn - 2/Vl - R2.
The likelihood ratio test of the hypothesis H o: p = 0 against all
alternatives H l: p =1= 0 may be based either on the statistic R or on
n
2:X,Yj - nXY
1
J(~ X~ - nX2) (~ Y~ - nY2)
8.28. Show that
n
2: (Xj - X)(Yj - Y)
R = ----;:::::::l~===:=====
Ji (X, - X)2 i (Y, - Y)2
1 1
EXERCISES
8.29. A random sample of size n = 6 from a bivariate normal distribu-
tion yields the value of the correlation coefficient to be 0.89. Would we
accept or reject, at the 5 per cent significance level, the hypothesis that
p = O?
8.30. Verify that W of Equation (1) of this section is n(O, a~).
8.31. Verify the algebraic identity (2) of this section.
8.32. Verify Equation (3) of this section.
8.33. Verify the p.d.f. (4) of this section.
W= !In (1 + R)
2 1 - R
!In(1 + po).
2 1 - Po
has an approximate normal distribution with mean 1- In [(1 + p)j(1 - p)]
and variance 1j(n - 3). We accept this statement without proof. Thus a
test of H o: p = 0 can be based on the statistic
Z = 1- In [(1 + R)j(1 - R)] - 1- In [(1 + p)j(1 - p)],
V1j(n - 3)
with p = 0 so that 1- In [(1 + p)j(1 - p)] = O. However, using W, we can
also test hypotheses like Ho: p = Po against Hl : p =1= Po, where Po is not
necessarily zero. In that case the hypothesized mean of W is
-1 < r < 1,
g(r)
(4)
If we write T = Rvn - 2/vl - R2, where T has a t distribution
with n - 2 > 0 degrees of freedom, it is easy to show, by the change-
of-variable technique (Exercise 8.33), that the p.d.f. of R is given by
r[(n - 1)/2J (1 _ r2) (n - 4)/2 ,
r(t)r[(n - 2)/2J
is the product of g(t) and the joint p.d.f. of Xl' X 2 , ••• , X n. Integration
on Xl' X 2, ... , X n yields the marginal p.d.f. of RVn - 2/V1 - R2;
because g(t) does not depend upon Xv X2' ••• , X n it is obvious that this
marginal p.d.f. is g(t), the conditional p.d.f. of Rcvn - 2/Vl - R~.
The change-of-variable technique can now be used to find the p.d.f.
of R.
Remarks. Since R has, when P = 0, a conditional distribution that
does not depend upon Xl' X 2, •• " X n (and hence that conditional distribution
is, in fact, the marginal distribution of R), we have the remarkable fact that R
is stochastically independent of Xl' X 2 , ••• , X n . It follows that R is
stochastically independent of every function of Xl, X 2 , ••• , X; alone, that is,
a function that does not depend upon any Yj • In like manner, R is stochastic-
ally independent of every function of Y1> Y2, ••. , Yn alone. Moreover, a
careful review of the argument reveals that nowhere did we use the fact that
X has a normal marginal distribution. Thus, if X and Yare stochastically
independent, and if Y has a normal distribution, then R has the same
conditional distribution whatever be the distribution of X, subject to the
condition ~ (xj - X)2 > O. Moreover, if Pr [~ (X, - X')2 > 0] = 1, then R
has the same marginal distribution whatever be the distribution of X.
Sec. 9.1] Confidence Intervals for Distribution Quantiles 305
Chapter 9
Nonparametric Methods
9.1 Confidence Intervals for Distribution Quantiles
We shall first define the concept of a quantile of a distribution of a
random variable of the continuous type. Let X be a random variable
of the continuous type with p.d.f. f(x) and distribution function F(x).
Let p denote a positive proper fraction and assume that the equation
F(x) = p has a unique solution for X. This unique root is denoted by
the symbol gp and is called the quantile (of the distribution) of order p.
Thus Pr (X ~ gp) = F(gp) = p. For example, the quantile of order t
is the median of the distribution and Pr (X ~ gO.5) = F(gO.5) = t-
In Chapter 6 we computed the probability that a certain random
interval includes a special point. Frequently, this special point was a
parameter of the distribution of probability under consideration. Thus
we were led to the notion of an interval estimate of a parameter. If the
parameter happens to be a quantile of the distribution, and if we work
with certain functions of the order statistics, it will be seen that this
method of statistical inference is applicable to all distributions of the
continuous type. We call these methods distribution-free or nonpara-
metric methods of inference.
To obtain a distribution-free confidence interval for gp, the quantile
of order p, of a distribution of the continuous type with distribution
function F(x), take a random sample Xl' X 2 , ••• , X n of size n from that
distribution. Let Y1 < Y2 < ... < Yn be the order statistics of the
sample. Take Y, < Y, and consider the event Y, < gp < Y j • For the
ith order statistic Y, to be less than gp it must be true that at least i
304
of the X values are less than gpo Moreover, for the jth order statistic
to be greater than gp, fewer than j of the X values are less than gpo
That is, if we say that we have a "success" when an individual X
value is less than gp, then, in the n independent trials, there must be
at least i successes but fewer than j successes for the event Y, < gp < Y j
to occur. But since the probability of success on each trial is Pr (X < gp)
F(gp) = p, the probability of this event is
j-l n!
Pr (Y, < gp < Y j ) = 2. '( _ )' pW(1 - p)n-w,
w=, W. n W.
the probability of having at least i, but less than j, successes. When
particular values of n, i, and j are specified, this probability can be
computed. By this procedure, suppose it has been found that y =
Pr (Y, < e. < Y j ) . Then the probability is y that the random interval
(Yt, Y j ) includes the quantile of order p. If the experimental values of
Yt and Y, are, respectively, Yt and Yj, the interval (Yt, Yj) serves as a
100y per cent confidence interval for gp, the quantile of order p.
An illustrative example follows.
Example 1. Let Y1 < Y2 < Ya < Y4 be the order statistics of a random
sample of size 4 from a distribution of the continuous type. The probability
that the random interval (Y1 , Y 4 ) includes the median gO.5 of the distribution
will be computed. We have
Pr (Y1 < gO.5 < Y 4 ) = W~l wI (/~ w)! Gr = 0.875.
If Y1 and Y4 are observed to be Yl = 2.8 and Y4 = 4.2, respectively, the
interval (2.8, 4.2) is an 87.5 per cent confidence interval for the median gO.5
of the distribution.
For samples of fairly large size, we can approximate the binomial
probabilities with those associated with normal distributions, as
illustrated in the next example.
Example 2. Let the following numbers represent the order statistics of
n = 27 observations obtained in a random sample from a certain distribution
of the continuous type.
61, 69, 71, 74, 79, 80, 83, 84, 86, 87, 92, 93, 96, 100,
104,105,113,121,122,129,141,143,156,164,191,217,276.
Say that we are interested in estimating the 25th percentile gO.25 (that is,
the quantile of order 0.25) of the distribution. Since (n + l)P = 28(1-) = 7,
the seventh order statistic, Y7 = 83, could serve as a point estimate of gO.25'
To get a confidence interval for gO.25, consider two order statistics, one less
306 Nonparametric Methods [Ch, 9 Sec. 9.2] Tolerance Limits for Distributions 307
than Y7 and the other greater, for illustration, Y4 and YIO' What is the con-
fidence coefficient associated with the interval (Y4' YIO)? Of course, before
the sample is drawn, we know that
'Y = Pr (Y4 < gO.25 < YI O) = ~4 (~)(0.25)W(0.75)27-w.
(a) Show that Pr (YI < P. < Y 2 ) = t and compute the expected value
of the random length Y2 - Yr-
(b) If X is the mean of this sample, find the constant c such that
Pr (X - co < p. < X + cal = 1-, and compare the length of this random
interval with the expected value of that of part (a). Hint. See Exercise 4.60,
Section 4.6.
9.2 Tolerance Limits for Distributions
9.6. Let YI < Y2 < ... < Y 25be the order statistics of a random sample
of size n = 25 from a distribution of the continuous type. Compute approxi-
mately:
(a) Pr (Ys < go.s < YI S) '
(b) Pr (Y2 < gO.2 < Yg) .
(c) Pr (YI S < go.s < Y 23) ·
9.7. Let YI < Y2 < ... < YlO O be the order statistics of a random
sample of size n = 100 from a distribution of the continuous type. Find
i < j so that Pr (Y, < gO.2 < YJ
) is about equal to 0.95.
= 0 elsewhere,
then, if 0 < P < 1, we have
Pr [F(X) s PJ = f:dz = p.
Now F(x) = Pr (X ~ x). Since Pr (X = x) = 0, then F(x) is the
fractional part of the probability for the distribution of X that is
between -00 and x. If F(x) ~ p, then no more than lOOp per cent of
the probability for the distribution of X is between -00 and x. But
recall Pr [F(X) ~ PJ = p. That is, the probability that the random
o < z < 1,
h(z) = 1,
We propose now to investigate a problem that has something of the
same flavor as that treated in Section 9.1. Specifically, can we compute
the probability that a certain random interval includes (or covers) a
preassigned percentage of the probability for the distribution under
consideration? And, by appropriate selection of the random interval,
can we be led to an additional distribution-free method of statistical
inference?
Let X be a random variable with distribution function F(x) of the
continuous type. The random variable Z = F(X) is an important
random variable, and its distribution is given in Example 1, Section
4.1. It is our purpose now to make an interpretation. Since Z = F(X)
has the p.d.f,
9.1. Let Yn denote the nth order statistic of a random sample of size n
from a distribution of the continuous type. Find the smallest value of n for
which Pr (go.g < Y n) ~ 0.75.
9.2. Let Y1 < Y2 < Y3 < Y4 < Ys denote the order statistics of a
random sample of size 5 from a distribution of the continuous type. Com-
pute:
(a) Pr (YI < gO.5 < Ys)'
(b) Pr (YI < gO.2S < Y 3) ·
(c) Pr (Y4 < go.so < Ys)'
9.3. Compute Pr (Y3 < go.s < Y7 ) if YI < ... < Yg are the order
statistics of a random sample of size 9 from a distribution of the continuous
type.
9.4. Find the smallest value of n for which Pr (Y1 < go.s < Yn) ~ 0.99,
where YI
< ... < Y n
are the order statistics of a random sample of size n
from a distribution of the continuous type.
9.5. Let YI
< Y2
denote the order statistics of a random sample of size
2 from a distribution which is n(p., a2
) , where a2
is known.
EXERCISES
'Y = Pr (3.5 < W < 9.5),
Thus (Y4 = 74, YlO = 87) serves as an 81.4 per cent confidence interval for
gO.2S' It should be noted that we could choose other intervals also, for
illustration, (Y3 = 71, Yu = 92), and these would have different confidence
coefficients. The persons involved in the study must select the desired
confidence coefficient, and then the appropriate order statistics, Y, and Y J ,
are taken in such a way that i and j are fairly symmetrically located about
(n + l)p.
That is,
where W is b(27,!) with mean 2; = 6.75 and variance ~ ~. Hence 'Y is
approximately equal to
308 Nonparametric Methods [Ch. 9 Sec. 9.2] Tolerance Limits for Distributions 309
variable Z = F(X) is less than or equal to p is precisely the probability
that the random interval (-00, X) contains no more than lOOp per cent
of the probability for the distribution. For example, the probability
that the random interval (-00, X) contains no more than 70 per cent
of the probability for the distribution is 0.70; and the probability that
the random interval (-00, X) contains more than 70 per cent of the
probability for the distribution is 1 - 0.70 = 0.30.
We now consider certain functions of the order statistics. Let
X X X denote a random sample of size n from a distribution
1, 2,···, n
that has a positive and continuous p.d.f. f(x) if and only if a < x < b;
and let F(x) denote the associated distribution function. Consider the
random variables F(Xl
), F(X2),... , F(Xn). These random variables
are mutually stochastically independent and each, in accordance with
Example 1, Section 4.1, has a uniform distribution on the interval (OJ 1).
Thus F(Xl), F(X2),... , F(Xn) is a random sample of size n from a
uniform distribution on the interval (0, 1). Consider the order statistics
of this random sample F(Xl ) , F(X2),... , F(Xn). Let Zl be the smallest
of these F(X;) , Z2 the next F(X;) in order of magnitude, .. " and Zn
the largest F(Xi). If Y 1> Y 2, .•. , Y nare the order statistics of the initial
random sample Xl' X 2
, ••• , X n, the fact that F(x) is a nondecreasing
(here, strictly increasing) function of x implies that Zl = F(Yl ),
Z2 = F(Y2),... , Zn = F(Yn). Thus the joint p.d.f. of ZlJ Z2' ... , Zn
is given by
h(z1> Z2' ... , zn) = nt, 0 < Zl < Z2 < ... < Zn < 1,
= 0 elsewhere.
This proves a special case of the following theorem.
Theorem 1. Let Y 1> Y 2, •• " Y n denote the order statistics of a
random sample of size n from a distribution of the continuous type that
has p.d.j. f(x) and distribution function F(x). The joint p.d.f. of the
random variables Z, = F(Yi), i = 1, 2, ... , n, is
h(z1> z2"",zn) = nt, 0 < Zl < Z2 < ... < zn < 1,
= 0 elsewhere.
Because the distribution function of Z = F(X) is given by z,
o < Z < 1, the marginal p.d.f. of Zk = F(Yk) is the following beta
p.d.f.:
n!
(i - 1)! (j - i - 1)! (n - j)!
X Z~-l(Z. - z.)j-i-l(l - z.)n-j
I ) & J J
= 0 elsewhere.
Moreover, the joint p.d.f. of Z; = F(Yi) and Z, = F(YJ
) is, with i < j,
given by
Sometimes this is a rather tedious computation. For this reason and
for the reason that coverages are important in distribution-free statistical
inference, we choose to introduce at this time the concept of a coverage.
Consider the random variables Wl = F(Yl) = Zl' W2 = F(Y2) -
F(Yl) = Z2 - Z1> W3 = F(Ys) - F(Y2) = Zs - Z2"'" Wn =
F(Yn) - F(Yn- l) = Zn - Zn-l' The random variable Wl is called a
coverage of the random interval {x; -00 < x < Y l} and the random
variable Wi, i = 2, 3, ... , n, is called a coverage of the random interval
{x; Yi- l < X < Vi}' We shall find the joint p.d.f. of the n coverages
Consider the difference Z, - Z, = F(Yj) - F(Y;), i < i Now
F(Yj) = Pr (X :5: Yj) and F(Yi) = Pr (X :5: Yi)' Since Pr (X = Y;) =
Pr (X = Yj) = 0, then the difference F(Yj) - F(Yi) is that fractional
part of the probability for the distribution of X that is between Y; and
Yj' Let p denote a positive proper fraction. If F(Yj) - F(Yi) ~ p, then
at least lOOp per cent of the probability for the distribution of X is
between Yi and Yj' Let it be given that y = Pr [F(Yj) - F(Yi) ~ Pl
Then the random interval (Yi , Y j ) has probability y of containing at
least lOOp per cent of the probability for the distribution of X. If now
Yi and Yj denote, respectively, experimental values of Y i and Y j, the
interval (Yi' Yj) either does or does not contain at least lOOp per cent
of the probability for the distribution of X. However, we refer to the
interval (Yi' Yj) as a lOGy per cent tolerance interval for lOOp per cent
of the probability for the distribution of X. In like vein, Yi and Yj are
called 100y per cent tolerance limits for lOOP per cent of the probability
for the distribution of X.
One way to compute the probability y = Pr [F(Yj) - F(Yi) ~ PJ
is to use Equation (2), which gives the joint p.d.f. of Z, = F(Y;) and
Zj = F(Yj ) . The required probability is then given by
o < Zk < 1,
n! k-l(l Z )n-k
(1) hk(zk) (k _ 1)! (n _ k)! Zk - k '
= 0 elsewhere.
310 Nonparametric Methods [Ch. 9 Sec. 9.2] Tolerance Limits for Distributions 311
WI> W 2' •.. , Wno First we note that the inverse functions of the
associated transformation are given by
because the integrand is the p.d.f. of F(Y6 ) - F(Yl ) . Accordingly,
y = 1 - 6(0.8)5 + 5(0.8)6 = 0.34,
Z2 = WI + W2'
za = WI + w 2 + wa,
Zn = WI + W2 + Wa + ... + wn•
We also note that the Jacobian is equal to 1 and that the space of
positive probability density is
{(WI> W2"'" wn) ; 0 < Wt' i = 1,2, ... , n, WI + ... + wn < I}.
Sincethejointp.d.f. of Zl' Z2"'" Znisn!, 0 < Zl < Z2 < ... <: Zn < 1,
zero elsewhere, the joint p.d.f. of the n coverages is
= 0 elsewhere.
A reexamination of Example 1 of Section 4.5 reveals that this is a
Dirichlet p.d.f. with k = n and (Xl = (X2 = ... = (Xn +1 = 1.
Because the p.d.f. k(wl> ... , wn) is symmetric in WI> W 2, ••• , W n, it is
evident that the distribution of every sum of r, r < n, of these coverages
WI>' .. , W n is exactly the same for each fixed value of r. For instance,
if i < Jand r = J - i, the distribution of Z, - Z, = F(Yj ) - F(Yt) =
W + W· 2 + ... + W· is exactly the same as that of Zj-t =
i+ I 1+ J
F(Yj- t) = WI + W2 + ... + Wj-i' But we know that the p.d.f. of
Zj-i is the beta p.d.f. of the form
0< w < 1,
kl (w) = n(1 - w)n-l,
= 0 elsewhere,
because WI = ZI = F(Yl ) has this p.d.f. Accordingly, the mathematical
expectation of each Wi is
fl nw(1 _ w)n-l dw = _1_.
o n + 1
Now the coverage Wi can be thought of as the area under the graph of the
p.d.f. j(x), above the z-axis, and between the lines x = Yj - l and x = Yi .
(We take Yo = -co.) Thus the expected value of each of these random
areas Wi' i = 1,2, ... , n, is 1/(n + 1). That is, the order statistics partition
the probability for the distribution into n + 1 parts, and the expected value
of each of these parts is 1/(n + 1). More generally, the expected value of
F(Yj ) - F(Yj ) , i < j, is (j - i)/(n + 1), since F(Yj ) - F(Yj ) is the sum of
j - i of these coverages. This result provides a reason for calling Ylcs where
(n + I)P = k, the (100P)th percentile oj the sample, since
E[F(Y )J = _k_ = (n + I)P = p.
k n+l n+l
Example 2. Each of the coverages Wi' i = 1, 2, ... , n, has the beta
p.d.f.
approximately. That is, the observed values of Yl and Y6 will define a
34 per cent tolerance interval for 80 per cent of the probability for the
distribution.
o < Wt, i = 1, ... , n, WI + ... + W n < 1,
k(w W ) = n!,
1, ••. , n
o < v < 1,
r(n + 1) Vj-i-I(1 _ v)n-Hi,
hj-t(v) = ru - i)r(n - J + i + 1)
= 0 elsewhere.
Consequently, F(Yj ) - F(Yi ) has this p.d.f. and
Pr [F(Yj) - F(Yt) ;? PJ = f:hj_t(v) dv.
Example 1. Let Y l
< Y2 < ... < Y6 be the order statistics of a
random sample of size 6 from a distribution of the continuous type. We
want to use the observed interval (Yl' Y6) as a tolerance interval for 80 per
cent of the distribution. Then
y = Pr [F(Y6) - F(Y1) ;? 0.8]
= 1 - f:·830v4(1 - v) dv,
EXERCISES
9.8. Let Yl and Y n be, respectively, the first and nth order statistics of a
random sample of size n from a distribution of the continuous type having
distribution function F(x). Find the smallest value of n such that
Pr [F(Yn) - F(Yl ) ;? 0.5] is at least 0.95.
9.9. Let Y2 and Y n - l denote the second and the (n - l)st order statistics
of a random sample of size n from a distribution of the continuous type
having distribution function F(x). Compute Pr [F(Yn - 1) - F(Y2) ;? PJ,
where 0 < p < 1.
9.10. Let Yl < Y2 < ... < Y 48be the order statistics of a random sample
of size 48 from a distribution of the continuous type. We want to use the
observed interval (Y4' Y45) as a 100y per cent tolerance interval for 75 per cent
of the distribution.
312
Nonparametric Methods [Ch, 9 Sec. 9.3] The Sign Test 313
(a) To what is yequal? . . '
(b) Approximate the integral in part (a) by notmg that It can ~e wntten
as a partial sum of a binomial p.d.f., which in turn can be approxImated by
probabilities associated with a normal distribution.
9.11. Let YI
< Y2
< ... < Y n be the order statistics of a random sample
of size n from a distribution of the continuous type having distribution func-
tion F(x).
(a) What is the distribution of U = 1 - F(Yj )?
(b) Determine the distribution of V = F(Yn) - F(YJ) + F(Yt) -
F(YI ) , where i < j.
Suppose, however, that we are interested only in the alternative hy-
pothesis, which is HI: FW > Po.One procedure is to base the test of H 0
against HI upon the random variable Y, which is the number of items
less than or equal to gin a random sample of size n from the distribu-
tion. The statistic Y can be thought of as the number of "successes"
throughout nindependent trials. Then,ifHoistrue, Yisb[n,po = FW];
whereas if Ho is false, Y is ben, P = F(g)] whatever be the distribution
function F(x). We reject H o and accept HI if and only if the observed
value y ;?: c, where c is an integer selected such that Pr (Y ;?: c; Ho) is
some reasonable significance level 0:. The power function of the test is
given by
where P = F(72). In particular, the significance level is
1
2:=:; P < 1,
Po s P < 1,
K(P)
In many places in the literature the test that we have just described
is called the sign test. The reason for this terminology is that the test
is based upon a statistic Y that is equal to the number of nonpositive
signs in the sequence Xl - g, X2 - g, ..., X n- g. In the next section
a distribution-free test, which considers both the sign and the magni-
tude of each deviation Xt - g, is studied.
where P = FW· In certain instances we may wish to approximate
K(P) by using an approximation to the binomial distribution.
Suppose that the alternative hypothesis to H o: FW = Po is
HI: FW < Po. Then the critical region is a set {y; y :=:; c1} . Finally, if
the alternative hypothesis is HI: FW "# Po, the critical region is a set
{y; y s c2 or ca s y}.
Frequently, Po = t and, in that case, the hypothesis is that the
given number gis a median of the distribution. In the following example,
this value of Po is used.
Example 1. Let Xl, X 2 , ••• , X I O be a random sample of size 10 from a
distribution with distribution function F(x). We wish to test the hypothesis
H o: F(72) = 1- against the alternative hypothesis HI: F(72) > -t. Let Y be
the number of sample items that are less than or equal to 72. Let the observed
value of Y be y, and let the test be defined by the critical region {y; y ;?: 8}.
The power function of the test is given by
~ = 1,2, ... , k;
and a chi-square test, based upon a statistic that was denoted by
Q was used to test the hypothesis H 0 against all alternative
k-I>
hypotheses.
There is a certain subjective element in the use of this test, namely
the choice of k and of A I> A 2, ••• , Ak' But it is important to note that
the limiting distribution of Qk-1, under Ho, is X2(k - 1); that is, the
distribution of Qk-1 is free of PlO' P20, ... , PkO and, accordingly, of the
specified distribution of X. Here, and elsewhere, "under r:0'.' means
when H is true. A test of a hypothesis Ho based upon a statistic whose
distribution, under H 0' does not depend upon the specified distribution
or any parameters of that distribution is called a distribution-free or a
nonparametric test.
Next, let F(x) be the unknown distribution function of the random
variable X. Let there be given two numbers gand Po,where 0 < Po < ~.
We wish to test the hypothesis H o: FW = Po, that is, the hypothesIs
that g = gpo' the quantile of order Poof the distrib~tionof X. We c?uld
use the statistic Qk-1, with k = 2, to test Ho agamst all alternatives.
9.3 The Sign Test
Some of the chi-square tests of Section 8.1 are illustrative of the
type of tests that we investigate in the remainder of thi~ c?ap:er.
Recall in that section, we tested the hypothesis that the distribution
of a certain random variable X is a specified distribution. We did this
in the following manner. The space of X was partitioned into k mutually
disjoint sets A l' A 2
, ••• , A k • The probability PIO that X E At was
computed under the assumption that the specified distrib~tion is the
correct distribution, i = 1,2, ... , k. The original hypothesIs was then
replaced by the hypothesis
314 Nonparametric Methods [Ch. 9 Sec. 9.4] A Test of Wilcoxon 315
for all x. Moreover, the probability that any two items of a random
sample are equal is zero, and in our discussion we shall assume that no
two are equal.
The problem is to test the hypothesis that the median gO.5 of the
distribution is equal to a fixed number, say g. Thus we may, in all
cases and without loss of generality, take g = 0. The reason for this is
that if g =1= 0, then the fixed gcan be subtracted from each sample item
and the resulting variables can be used to test the hypothesis that their
underlying distribution is symmetric about zero. Hence our conditions
on F(x) and f(x) become F( - x) = 1 - F(x) and f( - x) = f(x),
respectively.
To test the hypothesis Ho: F(O) = t, we proceed by first ranking
Xl' X 2 , ••• , X; according to magnitude, disregarding their algebraic
signs. Let R, be the rank of IXil among lXII, IX2 1
, ... , IXnl, i = 1,2,
... , n. For example, if n = 3 and if we have IX2 1 < IXsl < lXII, then
R l = 3, R2 = 1, and Rs = 2. Thus Rl , R2 , ••• , R; is an arrangement
of the first n positive integers 1,2, ... , n. Further, let Zi' i = 1,2, ... , n,
be defined by
EXERCISES
9.12. Suggest a chi-square test of the hypothesis which states that a
distribution is one of the beta type, with parameters a = 2 and f:3 = 2.
Further, suppose that the test is to be based upon a random sample of size
100. In the solution, give k, define AI> A 2 , ••• , A k , and compute each Pto. If
possible, compare your proposal with those of other students. Are any of
them the same?
9.13. Let Xl' X 2 , ••• , X 46 be a random sample of size 48 from a distri-
bution that has the distribution function F(x). To test H o: F(41) = t against
HI: F(41) < t, use the statistic Y, which is the number of sample items less
than or equal to 41. If the observed value of Y is y :s; 7, reject Ho and accept
HI' If P = F(41), find the power function K(P), 0 < P :s; t, of the test.
Approximate a = K(t).
9.14. Let Xl' X 2 , ••• , XI OO be a random sample of size 100 from a
distribution that has distribution function F(x). To test H o: F(90) - F(60)
= 1- against HI: F(90) - F(60) > 1-, use the statistic Y, which is the number
of sample items less than or equal to 90 but greater than 60. If the observed
value of Y, say y, is such that y ~ c, reject H o. Find c so that a = 0.05,
approximately.
9.4 A Test of Wilcoxon
Z, = -1,
= 1,
if Xi < 0,
if Xi > 0.
Suppose Xl' X 2 , • • • , X n is a random sample from a distribution
with distribution function F(x). We have considered a test of the
hypothesis F(g) = t, g given, which is based upon the signs of the
deviations Xl - g, X 2 - g, ... , X n - g. In this section a statistic is
studied that takes into account not only these signs, but also the
magnitudes of the deviations.
To find such a statistic that is distribution-free, we must make two
additional assumptions:
(a) F(x) is the distribution function of a continuous type of random
variable X.
(b) The p.d.f. f(x) of X has a graph that is symmetric about the
vertical axis through gO.5' the median (which we assume to be unique)
of the distribution.
Thus
f(go.5 - x) = f(go.5 + x),
and
F(gO.5 - x) 1 - F(gO.5 + x)
If we recall that Pr (Xi = 0) = 0, we see that it does not change the
probabilities whether we associate Z, = 1 or Z, = - 1 with the out-
come Xi = 0.
n
The statistic W = L ZiRi is the Wilcoxon statistic. Note that in
i=l
computing this statistic we simply associate the sign of each Xi with
the rank of its absolute value and sum the resulting n products.
If the alternative to the hypothesis Hs: gO.5 = °is HI: gO.5 > 0,
we reject H 0 if the observed value of W is an element of the set
{w; w ~ c}. This is due to the fact that large positive values of W
indicate that most of the large deviations from zero are positive. For
alternatives gO.5 < °and gO.5 =1= °the critical regions are, respectively,
the sets {w; w :s; cl}and{w; w :s; c20rw ~ cs}. To compute probabilities
like Pr (W ~ c; H o), we need to determine the distribution of W,
under tt;
To help us find the distribution of W, when Ho: F(O) = t is true,
we note the following facts:
(a) The assumption that f(x) = f( - x) ensures that Pr (Xi < 0)
Pr (Xi> 0) = t, i = 1, 2, .. " n.
n (e-tt + eit)
=I1 .
t=l 2
317
w = -6, -4, -2,2,4,6,
w = 0,
-~
- 8,
g(w) = t,
n
L E(I tj, - I-'t1 3
)
lim t=l - 0
n
....co (n )3/2 -,
L o~
t=1
then
n n
L Ut - L I-'t
,=1 t=1
JJ10
f
has a limiting distribution that is n(O, 1). For our variables Vv V2 ,
"', Vn we have
The variance of V, is (- i)2
(!-) + (i)2(!-) = i 2. Thus the variance of W is
o~ = ~>2 = n(n + 1)(2n + 1).
1 6
For large values of n, the determination of the exact distribution of
W becomes tedious. Accordingly, one looks for an approximating
distribution. Although W is distributed as is the sum of n random
variables that are mutually stochastically independent, our form of the
central limit theorem cannot be applied because the n random variables
do not have identical distributions. However, a more general theorem,
due to Liapounov, states that if U, has mean I-'t and variance or,
i = 1,2, ... , n, if Uv U2 , ••• , U'; are mutually stochastically inde-
pendent, if E(I U, - I-'t1 3
) is finite for every i, and if
n
I-'w = E(W) = L E(Vt) = O.
1
Sec. 9.4] A Test oj Wilcoxon
= 0 elsewhere.
The mean and the variance of Ware more easily computed directly
than by working with the moment-generating function M(t). Because
n n
V = L Vt and W = L ZtRt have the same distribution they have the
1 1 '
same mean and the same variance. When the hypothesis H 0: F(O) = !-
is true, it is easy to determine the values of these two characteristics of
the distribution of W. Since E(Vt) = 0, i = 1, 2, ... , n, we have
Thus the p.d.f. of W, for n = 3, is given by
Nonparametric Methods [Ch. 9
We can express M(t) as the sum of terms of the form (aj/2n)eblt. When
M(t) is written in this manner, we can determine by inspection the
p.d.f. of the discrete-type random variable W. For example, the smallest
value of W is found from the term (1/2n)e-te- 2t... e- nt = (1/2n)e- n<n+1)t/2
and it is -n(n + 1)/2. The probability of this value of W is the
coefficient 1/2n . To make these statements more concrete, take n = 3.
Then
n
The preceding observations enable us to say that W = L ZiRt
1
n
has the same distribution as the random variable V = L Vt, where
1
V v V2, •.. , Vn are mutually stochastically independent and
Pr (Vt = i) = Pr (Vt = - i) = !-,
i = 1, 2, ... , n. That V v V 2' ••. , Vn are mutually stochastically
independent follows from the fact that Z1' Z2' ... , Zn have that
property; that is, the numbers 1,2, ... , n always appear in a sum W
and those numbers receive their algebraic signs by independent
assignment. Thus each of V v V 2, ... , Vn is like one and only one of
Z1RV Z2R2, ... , ZnRn·
Since Wand V have the same distribution, the moment-generating
function of W is that of V,
M(t) (e-
t
t e
t)
r 2+ e
2t)
r:te
3t)
(t)(e-6t + e-4t + e-2t + 2 + e2t + e4t + e6t).
(b) Now Z, = - 1 if X, < 0 and Z, = 1 if X, > 0, i = 1, 2, , n.
Hence we have Pr (Zt = -1) = Pr (Zt = 1) =!-, i = 1,2, , n.
Moreover, Z1' Z2' ... , Z; are mutually stochastically independent
because Xv X 2
, ••• , X n are mutually stochastically independent.
(c) The assumption thatf(x) = f( -x) also assures that the rank R,
of IX,I does not depend upon the sign Z, of Xt. More generally, R1 , R2 ,
... , Rn are stochastically independent of Z1' Z2' , Zn·
(d) A sum W is made up of the numbers 1,2, , n, each number
with either a positive or a negative sign.
316
318 Nonparametric Methods [Ch. 9
Sec. 9.4] A Test of Wilcoxon 319
and it is known that
n
L
i=1
Now
lim n
2(n
+ 1)2/4 = °
n-+oo [n(n + 1)(2n + 1)/6]3/2
because the numerator is of order n4
and the denominator is of order
n 9 /2 • Thus
W
Vn(n + 1)(2n + 1)/6
is approximately n(O, 1) when Ho is true. This allows us to approximate
probabilities like Pr (W :2: c; H o) when the sample size n is large.
Example 1. Let gO.5 be the median of a symmetric distribution that is
of the continuous type. To test, with ex = 0.01, the hypothesis H o: gO.5 = 75
against HI: gO.5 > 75, we observed a random sample of sizen = 18. Let it be
given that the deviations of these 18 values from 75 are the following
numbers:
1.5, -0.5, 1.6,0.4,2.3, -0.8,3.2,0.9,2.9,
0.3,1.8, -0.1, 1.2,2.5,0.6, -0.7, 1.9, 1.3.
The experimental value of the Wilcoxon statistic is equal to
w = 11 - 4 + 12 + 3 + 15 - 7 + 18 + 8 + 17 + 2 + 13 - 1
+ 9 + 16 + 5 - 6 + 14 + 10 = 135.
Since, with n = 18 so that vn(n + 1)(2n + 1)/6 = 45.92, we have that
0.01 = Pr (4~2 :2: 2.326) = Pr (W :2: 106.8).
Because w = 135 > 106.8, we reject Ho at the approximate 0.01 significance
level.
There are many modifications and generalizations of the Wilcoxon
statistic. One generalization is the following: Let C1 :s; C2 :s; ... :s; Cn be
nonnegative numbers. Then, in the Wilcoxon statistic, replace the
ranks 1, 2, ... , n by c1> C2, ••• , Cn' respectively. For example, if n = 3
and if we have IX2 ! < IX3 ! < lXII, then R1 = 3 is replaced by C3,
R2 = 1 by C1, and R3 = 2 by c2 • In this example the generalized
statistic is given by Z1C3 + Z2Cl + Z3C2' Similar to the Wilcoxon
statistic, this generalized statistic is distributed under Ho, as is the sum
of n stochastically independent random variables, the ith of which
takes each of the values c, -=f °and - c, with probability -t; if Ci = 0,
that variable takes the value Ci = °with probability 1. Some special
cases of this statistic are proposed in the Exercises.
EXERCISES
9.15. The observed values of a random sample of size 10 from a distri-
bution that is symmetric about gO.5 are 10.2, 14.1, 9.2, 11.3, 7.2, 9.8, 6.5,
11.8,8.7, 10.8. Use Wilcoxon's statistic to test the hypothesis Ho: gO.5 = 8
against HI: gO.5 > 8 if ex = 0.05. Even though n is small, use the normal
approximation.
9.16. Find the distribution of W for n = 4 and n = 5. Hint. Multiply
the moment-generating function of W, with n = 3, by (e-4t
+ e4t
)/2 to get
that of W, with n = 4.
9.17. Let Xl, X 2 , ••• , X; be mutually stochastically independent. If the
p.d.f. of X, is uniform over the interval (_21 - ' , 21 - i ) , i = 1,2,3, ... , show
n
that Liapounov's condition is not satisfied. The sum L: Xi does not have an
'=1
approximate normal distribution because the first random variables in the
sum tend to dominate it.
9.18. If n = 4 and, in the notation of the text, C1 = 1, C2 = 2, C3 =
C4 = 3, find the distribution of the generalization of the Wilcoxon statistic,
say Wg • For a general n, find the mean and the variance of Wg if c, = i,
i:s; nI2,andc, = [nI2] + l,i > nI2,where[z]isthegreatestintegerfunction.
Does Liapounov's condition hold here?
9.19. A modification of Wilcoxon's statistic that is frequently used is
achieved by replacing R, by R, - 1; that is, use the modification Wm =
i Zi(R, - 1). Show that Wm/v(n - l)n(2n - 1)/6 has a limiting distri-
1
bution that is n(O, 1).
9.20. If, in the discussion of the generalization of the Wilcoxon statistic,
we let C
1
= C
2
= ... = C
n
= 1, show that we obtain a statistic equivalent to
that used in the sign test.
je! • /21 -x 2/2 d
9.21. If C1> c2 , ••• , Cn are selected so that il(n + 1) = 0 V 7T e x,
i = 1, 2, ... , n, the generalized Wilcoxon Wg is an example of a normal
scores statistic. If n = 9, compute the mean and the variance of this Wg •
9.22. If c, = 2i , i = 1, 2, ... , n, the corresponding Wg is called the binary
statistic. Find the mean and the variance of this Wg • Is Liapounov's con-
dition satisfied?
320 Nonparametric Methods [Ch. 9
Sec. 9.5] The Equality oj Two Distributions 321
9.5 The Equality of Two Distributions
If F(z) = G(z), for all z, then »« = PiZ, i = 1,2, ... , k. Accordingly,
the hypothesis that F(z) = G(z), for all z, is replaced by the less
restrictive hypothesis
number of items. (If the sample sizes are such that this is impossible, a
partition with approximately the same number of items in each group
suffices.) In effect, then, the partition Al> A z, ... , A k is determined by
the experimental values themselves. This does not alter the fact that the
statistic, discussed in Example 3, Section 8.1, has a limiting distribution
that is XZ(k - 1). Accordingly, the procedures used in that example
may be used here.
Among the tests of this type there is one that is frequently used.
It is essentially a test of the equality of the medians of two independent
distributions. To simplify the discussion, we assume that m + n, the
size of the combined sample, is an even number, say m + n = 2h,
where h is a positive integer. We take k = 2 and the combined sample
of size m + n = 2h, which has been ordered, is separated into two parts,
a "lower half" and an "upper half," each containing h = (m + n)/2 of
the experimental values of X and Y. The statistic, suggested by Example
3, Section 8.1, could be used because it has, when H o is true, a limiting
distribution that is xZ(l). However, it is more interesting to find
the exact distribution of another statistic which enables us to test the
hypothesis H a against the alternative HI: F(z) 2:: G(z) or against the
alternative HI: F(z) ~ G(z) as opposed to merely F(z) -=f. G(z). [Here,
and in the sequel, alternatives F(z) 2:: G(z) and F(z) s G(z) and
F(z) -=f. G(z) mean that strict inequality holds on some set of positive
probability measure.] This other statistic is V, which is the number of
observed values of X that are in the lower half of the combined sample.
If the observed value of V is quite large, one might suspect that the
median of the distribution of X is smaller than that of the distribution
of Y. Thus the critical region of this test of the hypothesis H o: F(z) =
G(z), for all z, against HI: F(z) 2:: G(z) is of the form V 2:: c. Because
our combined sample is of even size, there is no unique median of the
sample. However, one can arbitrarily insert a number between the hth
and (h + l)st ordered items and call it the median of the sample. On
this account, a test of the sort just described is called a median test.
Incidentally, if the alternative hypothesis is HI: F(z) ~ G(z), the
critical region is of the form V ~ c.
The distribution of V is quite easy to find if the distribution fun-
tions F(x) and G(y) are of the continuous type and if F(z) = G(z), for
all z. We shall now show that V has a hypergeometric p.d.f. Let
m + n = 2h, h a positive integer. To compute Pr (V = v), we need the
probability that exactly v of Xl> X z, " " X m are in the lower half of
the ordered combined sample. Under our assumptions, the probability
is zero that any two of the 2h random variables are equal. The smallest
t = 1,2, 0 0 0 , k.
i = 1,2, ... , k,
t = 1,2, ... , k.
PiZ = Pr (Y EAt),
9.23. In the definition of Wilcoxon's statistic, let WI be the sum of the
ranks of those items of the sample that are positive and let WZ be the sum
of the ranks of those items that are negative. Then W = WI - W2 •
(a) Show that W = 2WI - n(n + 1)/2 and W = n(n + 1)/2 - 2Wz.
(b) Compute the mean and the variance of each of WI and W z.
In Sections 9.3 and 9.4 some tests of hypotheses about one distri-
bution were investigated. In this section, as in the next section, various
tests of the equality of two independent distributions are studied. By
the equality of two distributions, we mean that the two distribution
functions, say F and G, have F(z) = G(z) for all values of z.
The first test that we discuss is a natural extension of the chi-square
test. Let X and Y be stochastically independent variables with distri-
bution functions F(x) and G(y), respectively. We wish to test the
hypothesis that F(z) = G(z), for all z. Let us partition the real line into
k mutually disjoint sets Al> A z, . 0 0 , A k • Define
and
But this is exactly the problem of testing the equality of two inde-
pendent multinomial distributions that was considered in Example 3,
Section 8.1, and the reader is referred to that example for the details.
Some statisticians prefer a procedure which eliminates some of the
subjectivity of selecting the partitions. For a fixed positive integer k,
proceed as follows. Consider a random sample of size m from the
distribution of X and a random sample of size n from the independent
distribution of Y. Let the experimental values be denoted by Xl, X z,
o • • , X m and Yl' Yz, ... , Yn- Then combine the two samples into one
sample of size m + n and order the m + n values (not their absolute
values) in ascending order of magnitude. These ordered items are then
partitioned into k parts in such a way that each part has the same
322 Nonparametric Methods [Ch. 9 Sec. 9.5] The Equality of Two Distributions 323
h of the m + n = 2h items can be selected in anyone of (~) ways.
Each of these ways has the same probability. Of these (~) ways, we
need to count the number of those in which exactly v of the m values of
X (and hence h - v of the n values of Y) appear in the lower h items.
But this is (:)(h : v), Thus the p.d.f. of Vis the hypergeometric p.d.f.
k(v) = Pr (V = v)
= °elsewhere,
v = 0, 1,2, .. " m,
two values of X, and so on. In our example, there is a total of eight
runs. Three are runs of length 1; three are runs of length 2; and two
are runs of length 3. Note that the total number of runs is always
one more than the number of unlike adjacent symbols.
Of what can runs be suggestive? Suppose that with m = 7 and
n = 8 we have the following ordering:
xxxxx ~ xx yyyyyyy.
To us, this strongly suggests that F(z) > G(z). For if, in fact, F(z)
G(z) for all z, we would anticipate a greater number of runs. And if the
first run of five values of X were interchanged with the last run of seven
values of Y, this would suggest that F(z) < G(z). But runs can be
suggestive of other things. For example, with m = 7 and n = 8,
consider the runs.
where m + n = 2h.
The reader may be momentarily puzzled by the meaning of (h : v)
for v = 0, 1,2, ... ,m.Forexample,letm = 17,n = 3,sothath = 10.
Then we have Co~ v), v= 0,1, ... ,17. However, we take (h : v)
to be zero if h - v is negative or if h - v > n,
If m + n is an odd number, say m + n = 2h + 1, it is left to the
reader to show that the p.d.f. k(v) gives the probability that exactly v
of the m values of X are among the lower h of the combined 2h + 1
values; that is, exactly v of the m values of X are less than the median
of the combined sample.
If the distribution functions F(x) and G(y) are of the continuous
type, there is another rather simple test of the hypothesis that F(z) =
G(z), for all z. This test is based upon the notion of runs of values of X
and of values of Y. We shall now explain what we mean by runs. Let
us again combine the sample of m values of X and the sample of n values
of Y into one collection of m + n ordered items arranged in ascending
order of magnitude. With m = 7 and n = 8 we might find that the 15
ordered items were in the arrangement
Note that in this ordering we have underscored the groups of successive
values of the random variable X and those of the random variable Y.
If we read from left to right, we would say that we have a run of one
value of X, followed by a run of three values of Y, followed by a run of
yyyy xxxxxxx yyyy.
This suggests to us that the medians of the distributions of X and Y
may very well be about the same, but that the "spread" (measured
possibly by the standard deviation) of the distribution of X is con-
siderably less than that of the distribution of Y.
Let the random variable R equal the number of runs in the com-
bined sample, once the combined sample has been ordered. Because our
random variables X and Yare of the continuous type, we may assume
that no two of these sample items are equal. We wish to find the p.d.f.
of R. To find this distribution, when F(z) = G(z), we shall suppose that
all arrangements of the m values of X and the n values of Y have equal
probabilities. We shall show that
Pr (R = 2k + 1) = { (m ~ 1) (~ =~) + (; ~ n(n ~ 1) }/ (m; n)
(1)
when 2k and 2k + 1 are elements of the space of R.
To prove formulas (1), note that we can select the m positions for
the mvalues of X from the m+ npositions in anyone of (m ~ n)
ways. Since each of these choices yields one arrangement, the probability
of each arrangement is equal to 1/(m ~ n). The problem is now to
324 Nonparametric Methods [Ch, 9
Sec. 9.5] The Equality oj Two Distributions 325
determine how many of these arrangements yield R = r, where r is an
integer in the space of R. First, let r = 2k + 1, where k is a positive
integer. This means that there must be k + 1 runs of the ordered values
of X and k runs of the ordered values of Y or vice versa. Consider first
the number of ways of obtaining k + 1 runs of the m values of X.
We can form k + 1 of these runs by inserting k "dividers" into the
m - 1 spaces between the values of X, with no more than one divider
per space. This can be done in anyone of (m ~ 1) ways. Similarly, we
can construct k runs of the n values of Y by inserting k - 1 dividers
into the n - 1 spaces between the values of Y, with no more than one
divider per space. This can be done in anyone of (~ =: Dways. The
.. . b f d' f (m - 1)(n - 1)
joint operation can e per orme in anyone a k k _ 1
ways. These two sets of runs can be placed together to form r = 2k + 1
runs. But we could also have k runs of the values of X and k + 1 rum,
of the values of Y. An argument similar to the preceding shows that this
(m- l)(n - 1)
can be effected in anyone of k _ 1 k ways. Thus
Pr (R = 2k + 1)
If the critical region of this run test of the hypothesis Ho: F(z) =
G(z) for allzis of the form R ~ c,it is easy to compute IX = Pr(R ~ c;Ho),
provided that m and n are small. Although it is not easy to show, the
distribution of R can be approximated, with large sample sizes m and n,
by a normal distribution with mean
f-L = E(R) = 2 m
m;n+ 1
and variance
2 (f-L - l)(f-L - 2)
a = .
m+n-1
The run test may also be used to test for randomness. That is, it can
be used as a check to see if it is reasonable to treat Xl' X 2 , ••• , X, as a
random sample of size s from some continuous distribution. To facil-
tate the discussion, take s to be even. We are given the s values of X to
be Xl> X 2, •.. , xs, which are not ordered by magnitude but by the order
in which they were observed. However, there are s/2 of these values,
each of which is smaller than the remaining s/2 values. Thus we have a
"lower half" and an "upper half" of these values. In the sequence
Xl> X 2, ••• ,xs' replace each value X that is in the lower half by the
letter L and each value in the upper half by the letter U. Then, for
example, with s = 10, a sequence such as
LLLLULUUUU
which is the second of formulas (1).
which is the first of formulas (1).
If r = 2k, where k is a positive integer, we see that the ordered
values of X and the ordered values of Y must each be separated into
k runs. These operations can be performed in anyone of (: =: D
(
n - 1)
and k _ 1 ways, respectively. These two sets of runs can be placed
together to form r = 2k runs. But we may begin with either a run of
values of X or a run of values of Y. Accordingly, the probability of 2k
runs is
Pr (R = 2k)
2(m - l)(n - 1)
k-1 k-1
(m ~ n)
may suggest a trend toward increasing values of X; that is, these
values of X may not reasonably be looked upon as being the items of a
random sample. If trend is the only alternative to randomness, we can
make a test based upon R and reject the hypothesis of randomness if
R ~ c.To make this test, we would use the p.d.f. of R with m = n = s/2.
On the other hand if, with s = 10, we find a sequence such as
L H L H L H L H L H,
our suspicions are aroused that there may be a nonrandom effect which
is cyclic even though R = 10. Accordingly, to test for a trend or a
cyclic effect, we could use a critical region of the form R ~ CI or
R :2: C2•
If the sample size s is odd, the number of sample items in the
"upper half" and the number in the" lower half" will differ by one.
Then, for example, we could use the p.d.f. of R with m = (s - 1)/2
and n = (s + 1)/2, or vice versa.
326 Nonparametric Methods [Ch. 9 Sec. 9.6] The Mann-Whitney-Wilcoxon Test 327
We note that
and consider the statistic
m
2: z.,
i=l
Zij = 1,
= 0,
n m
U = 2: 2: z.;
j=l i=l
counts the number of values of X that are less than YJ
, j = 1, 2, ... , n.
Thus U is the sum of these n counts. For example, with m = 4 and
n = 3, consider the observations
G(y) denote, respectively, the distribution functions of X and Y and let
Xl, X 2 , • • • , Xm and Y v Y 2 , •• " Y, denote independent samples from
these distributions. We shall discuss the Mann-Whitney-Wilcoxon test
of the hypothesis Ho: F(z) = G(z) for all values of z.
Let us define
9.25. In the median test, with m = 9 and n = 7, find the p.d.f. of the
random variable V, the number of values of X in the lower half of the
combined sample. In particular, what are the values of the probabilities
Pr (V = 0) and Pr (V = 9)?
EXERCISES
9.24. Let 3.1, 5.6, 4.7, 3.8, 4.2, 3.0, 5.1, 3.9, 4.8 and 5.3, 4.0, 4.9, 6.2,
3.7, 5.0, 6.5, 4.5, 5.5, 5.9, 4.4, 5.8 be observed samples of sizes m = 9 and
n = 12 from two independent distributions. With k = 3, use a chi-square
test to test, with a = 0.05 approximately, the equality of the two distri-
butions.
9.26. In the notation of the text, use the median test and the data given
in Exercise 9.24 to test, with a = 0.05, approximately, the hypothesis of the
equality of the two independent distributions against the alternative
hypothesis that F(z) ~ G(z). If the exact probabilities are too difficult to
determine for m = 9 and n = 12, approximate these probabilities.
9.27. Using the notation of this section, let U be the number of observed
values of X in the smallest d items of the combined sample of m + n items.
Argue that
u = 0,1, ... , m.
The statistic U could be used to test the equality of the (100P)th percentiles,
where (m + n)p = d, of the distributions of X and Y.
9.28. In the discussion of the run test, let the random variables R1 and
R2 be, respectively, the number of runs of the values of X and the number of
runs of the values of Y. Then R = R1 + R2. Let the pair (Yv Y2) of integers
be in the space of (R1 , R2) ; then Ir1 - r21 :::; 1. Show that the joint p.d.f. of
R1 and R2 is 2(~ =D
c=D
/(m ~ n) if Y 1 = Y 2; that this joint p.d.f. is
c..D
c=~)/(m ~ n) if Ir1 - Y 2 ! = 1; and is zero elsewhere. Show
that the marginal p.d.f. of R1 is (~ =D
(n ~ 1)/(m ~ n) Y 1 = 1, ... , m,
and is zero elsewhere. Find E(R1) . In a similar manner, find E(R2) . Compute
E(R) = E(R1 ) + E(R2) .
9.6 The Mann-Whitney-Wilcoxon Test
We return to the problem of testing the equality of two independent
distributions of the continuous type. Let X and Y be stochastically
independent random variables of the continuous type. Let F(x) and
There are three values of x that are less than Y1; there are four values
of x that are less than Y2; and there is one value of x that is less than Ya.
Thus the experimental value of U is u = 3 + 4 + 1 = 8.
Clearly, the smallest value which U can take is zero, and the largest
value is mn. Thus the space of U is {u; u = 0, 1,2, ... , mn}. If U is
large, the values of Y tend to be larger than the values of X, and this
suggests that F(z) ~ G(z) for all z. On the other hand, a small value of
U suggests that F(z) :::; G(z) for all z, Thus, if we test the hypothesis
H o: F(z) = G(z) for all z against the alternative hypothesis HI: F(z) ~
G(z) for all z, the critical region is of the form U ~ c1 . If the alternative
hypothesis is HI: F(z) :::; G(z) for all z, the critical region is of the form
U :::; c2 • To determine the size of a critical region, we need the distri-
bution of U when H o is true.
If u belongs to the space of U, let us denote Pr (U = u) by the
symbol h(u; m, n). This notation focuses attention on the sample sizes
m and n. To determine the probability h(u; m, n), we first note that we
have m + n positions to be filled by m values of X and n values of Y.
We can fill mpositions with the values of X in anyone of (m; n)
ways. Once this has been done, the remaining n positions can be filled
328 Nonparametric Methods [Ch. 9
Sec. 9.6] The Mann-Whitney-Wilcoxon Test 329
with the values of Y. When H o is true, each of these arrangements has
the same probability, 1/(m; n). The final right-hand position of an
arrangement may be either a value of X or a value of Y. This position
can be filled in anyone of m + n ways, m of which are favorable to X
and n of which are favorable to Y. Accordingly, the probability that
an arrangement ends with a value of X is m/(m + n) and the prob-
ability that an arrangement terminates with a value of Y is n/(m + n).
Now U can equal u in two mutually exclusive and exhaustive ways:
(1) The final right-hand position (the largest of the m + n values) in the
arrangement may be a value of X and the remaining (m - 1) values of
X and the n values of Y can be arranged so as to have U = u. The
probability that U = u, given an arrangement that terminates with a
value of X, is given by h(u; m - 1, n). Or (2) the largest value in the
arrangement can be a value of Y. This value of Y is greater than m
values of X. If we are to have U = u, the sum of n - 1 counts of the m
values of X with respect to the remaining n - 1 values of Y must be
u - m. Thus the probability that U = u, given an arrangement that
terminates in a value of Y, is given by h(u - m; m, n - 1). Accordingly,
the probability that U = u is
h(u, m, n) = (~)h(U; m - 1, n) + (_n_)h(U - m; m, n - 1).
m+n m+n
We impose the following reasonable restrictions upon the function
h(u; m, n):
and if m = 1, n = 2, we have
h(O; 1,2) = th(O; 0, 2) + th( -1; 1, 1) = t· 1 + t· °= t,
h(1; 1,2) = th(1; 0, 2) + th(O; 1, 1) = t· °+ t· t = t,
h(2; 1, 2) = th(2; 0, 2) + th(1; 1, 1) = t· °+ t· t = 1--
In Exercise 9.29 the reader is to determine the distribution of U when
m = 2, n = 1; m = 2, n = 2; m = 1, n = 3; and m = 3, n = 1.
For large values of m and n, it is desirable to use an approximate
distribution of U. Consider the mean and the variance of U when the
hypothesis Ho: F(z) = G(z), for all values of z, is true. Since U =
n m
.L .L Z'J' then
J=l ,=1
m n
E(U) .L.L E(Z'J)'
,=1 J=l
But
E(Z'1) = (1) Pr (Xj < Y J) + (0) Pr (X, > YJ) = t
because, when H 0 is true, Pr (X, < YJ) = Pr (X, > YJ) = l Thus
m n (1) mn
E(U) = 2: 2: - =-.
,=1 J=l 2 2
To compute the variance of U, we first find
Then it is easy, for small values m and n, to compute these probabilities.
For example, if m = n = 1, we have
h(O; 1, 1) = th(O; 0, 1) + th( -1; 1,0) = t· 1 + t· °= t"
h(1; 1, 1) = th(1; 0,1) + th(O; 1,0) = t· °+ t· 1 = t;
and
and
h(u; 0, n) = 1,
= 0,
h(U, m, 0) = 1,
= 0,
h(u; m, n) = 0,
u = 0,
u > 0, n ~ 1,
u = 0,
U > 0, m ~ 1,
u < 0, m ~ 0, n ~ 0.
n m n n m
= .L .L E(Z;,) + .L .L .L E(Z'JZjk)
J=l ,=1 k=l J=l ,=1
k"'J
nmm nnmm
+ .L .L .L E(Z'JZhJ) + .L .L .L .L E(Z'1Zhk)'
J=l h=l ,=1 k=l 1=1 h=l ,=1
h"" k"'J h""
Note that there are mn terms in the first of these sums, mn(n - 1)
in the second, mn(m - 1) in the third, and mn(m - 1)(n - 1) in the
fourth. When H o is true, we know that X" X h , Y J , and Y k , i #- h,
j #- k, are mutually stochastically independent and have the same
distribution of the continuous type. Thus Pr (X, < Y J) = t. Moreover,
Pr (X, < YJ , X, < Yk) = t because this is the probability that a
designated one of three items is less than each of the other two.
330 Nonparametric Methods [Ch, 9 Sec. 9.7] Distributions Under Alternative Hypotheses 331
i =1= h, j =1= k.
j =1= k,
i =1= h,
Similarly, Pr (X, < YJ, X h < YJ) = t. Finally, Pr (Xl < YJ
, X; < Y k )
= Pr (X, < YJ) Pr (Xh < Y k ) = l Hence we have
E(ZrJ) = (1)2 Pr (X, < Y J
) = t,
E(ZtJZ'k) = (1)(1) Pr (X, < YJ, X, < Yk) = j-,
E(Z'JZhJ) = (1)(1) Pr (X, < YJ, X; < YJ) = j-,
and
E(ZtJZhk)
Thus
E(U2 ) = mn mn(n - 1) mn(m - 1) mn(m - 1)(n - 1)
2+ 3 + 3 + 4
and
2 _ [1 n - 1 m - 1 (m - 1)(n - 1) _ m
4n]
au - mn 2 + -3- + -3- + 4
mn(m + n + 1)
12
Although it is fairly difficult to prove, it is true, when F(z) = G(z) for
all z, that
U _ mn
2
Jmn(m + n + 1)
12
has, if each of m and n is large, an approximate distribution that
is n(O, 1). This fact enables us to compute, approximately, various
significance levels.
Prior to the introduction of the statistic U in the statistical litera-
ture, it had been suggested that a test of H o: F(z) = G(z), for all z, be
based upon the following statistic, say T (not Student's t). Let T be the
sum of the ranks of Y 1 , Y 2 , .•. , Yn among the m + n items Xl>' 00'
X m' Y 1> 0 • • , Y n' once this combined sample has been ordered. In
Exercise 9.31 the reader is asked to show that
U = T _ n(n + 1)0
2
This formula provides another method of computing U and it shows that
a test of H 0 based on U is equivalent to a test based on T. A generaliza-
tion of T is considered in Section 9.8.
Example 1. With the assumptions and the notation of this section, let
m = 10 and n = 9. Let the observed values of X be as given in the first row
and the observed values of Y as in the second row of the following display:
4.3, 5.9, 4.9, 3.1, 5.3, 6.4, 6.2, 3.8, 7.5, 5.8,
5.5,7.9,6.8,9.0,5.6,6.3,8.5,4.6,7.1.
Since, in the combined sample, the ranks of the values of yare 4, 7, 8, 12, 14,
15, 17, 18, 19, we have the experimental value of T to be equal to t = 114.
Thus u = 114 - 45 = 69. If F(z) = G(z) for all z, then, approximately,
(u - 45 )
0.05 = Pr 12.247 ~ 1.645 = Pr (U ~ 65.146).
Accordingly, at the 0.05 significance level, we reject the hypothesis
H o: F(z) = G(z), for all z, and accept the alternative hypothesis H 1 : F(z) ~
G(z), for all z.
EXERCISES
9.29. Compute the distribution of U in each of the following cases:
(a) m = 2, n = 1; (b) m = 2, n = 2; (c) m = 1, n = 3; (d) m = 3, n = 1.
9.30. Suppose the hypothesis H o: F(z) = G(z), for all z, is not true. Let
P = Pr (X, < YJ
) . Show that U/mn is an unbiased estimator of p and that it
converges stochastically to p as m ~ 00 and n ~ 00.
9.31. Show that U = T - [n(n + I)J/2. Hint. Let Y(l) < Y(2) < ... <
Y(n) be the order statistics of the random sample Yl> Y2, ... , Ytr- If R, is the
rank of Y(,) in the combined ordered sample, note that Y(,) is greater than
R, - i values of X.
9.32. In Example 1 of this section assume that the values came from
two independent normal distributions with means JL1 and JL2' respectively,
and with common variance u2 . Calculate the Student's t which is used to test
the hypothesis H 0: JL1 = JL2' If the alternative hypothesis is H 1: JL1 < JL2' do
we accept or reject H o at the 0.05 significance level?
9.7 Distributions Under Alternative Hypotheses
In this section we shall discuss certain problems that are related
to a nonparametric test when the hypothesis H 0 is not true. Let X and
Y be stochastically independent random variables of the continuous
type with distribution functions F(x) and G(y), respectively, and
probability density functions f(x) and g(y). Let Xl' X 2 , ••• , X m
and Y1, Y 2, . 0 0, Yn denote independent random samples from these
distributions. Consider the hypothesis H 0: F(z) = G(z) for all values of
332 Nonparametric Methods [Ch, 9
Sec. 9.7] Distributions Under Alternative Hypotheses 333
z. It has been seen that the test of this hypothesis may be based upon
the statistic U, which, when the hypothesis H o is true, has a distribution
that does not depend upon F(z) = G(z). Or this test can be based upon
the statistic T = U + n(n + 1)/2, where T is the sum of the ranks of
Y v Y 2,.. " Y n in the combined sample. To elicit some information
about the distribution of T when the alternative hypothesis is true, let
us consider the joint distribution of the ranks of these values of Y.
Let Y(I) < Y(2) < ... < Y(n) be the order statistics of the sample
Y v Y 2,... , Yn- Order the combined sample, and let R, be the rank of
Y(i)' i = 1,2, ... , n. Thus there are i-I values of Y and R, - i
values of X that are less than Y(i)' Moreover, there are R, - Ri- 1 - 1
values of X between Y(i -1) and Y(i). If it is given that Y(l) = Yl <
Y(2) = Y2 < ... < Y(n) = Yn' then the conditional probability
(1) Pr (R1 = r v R2 = r 2, · .. , R; = r nlYl < Y2 < ... < Yn),
where ri < r 2 < ... < r n .:::; m + n are positive integers, can be com-
puted by using the multinomial p.d.f. in the following manner. Define
the following sets: Al = {x; -00 < x < Yl}, Ai = {x; Yi-l < X < Yi}'
i = 2, ... , n, A n+ l = {x; Yn < X < co}. The conditional probabilities
of these sets are, respectively, PI = F(YI)' P2 = F(Y2) - F(YI)' ... ,
P« = F(Yn) - F(Yn-I), Pn+l = 1 - F(Yn)' Then the conditional prob-
ability of display (1) is given by
(r1 - 1)1 (r2 - r1 - I)!··· (rn - rn- 1 - I)! (m + n - rn)!
To find the unconditional probability Pr (RI = r l , R2 = r 2 , ••• ,
Rn = rn), which we denote simply by Pr (rv ... , rn), we multiply the
conditional probability by the joint p.d.f. of Y(l) < Y(2) < ... < Y(n),
namely n! g(Yl)g(Y2) ... g(Yn) , and then integrate on Yv Y2' ... , Yn-
That is,
Pr (rl , r2,... , rn) = I~00 .. -I~300 J:200 Pr (rv ... , rnlYl < ... < Yn)nl
x g(Yl) ... g(Yn) dYl ... dYn'
where Pr (rl , ... , rnlYl < ... < Yn) denotes the conditional probability
in display (1).
Now that we have the joint distribution of R1 , R2 , ••• , Rn, we can
find, theoretically, the distributions of functions of R v R 2 , ••• , R; and,
n
in particular, the distribution of T = 2: Ri . From the latter we can find
1
that of U = T - n(n + 1)/2. To point out the extremely tedious
computational problems of distribution theory that we encounter,
we give an example. In this example we use the assumptions of this
section.
Example 1. Suppose that an hypothesis H o is not true but that in fact
f(x) = 1, 0 < x < 1, zero elsewhere, and g(y) = 2y, 0 < Y < 1, zero else-
where. Let m = 3 and n = 2. Note that the space of U is the set {u; u = 0,
1, ... , 6}.Consider Pr (U = 5). This event U = 5 occurs when and only when
RI
= 3, R2 = 5, since in this section RI < R2 are the ranks of Y(l) < Y(2)
in the combined sample and U = RI + R2 - 3. Because F(x) = x, 0 < x .:::; 1,
we have
II(y6 y6)
= 24 0 4
2
- 52 dY2 = /5'
Consider next Pr (U = +). The event U = 4 occurs if RI = 2, R2 = 5 or if
RI = 3, R2 = 4. Thus
Pr (U = 4) = Pr (RI = 2, R2 = 5) + Pr (RI = 3, R2 = 4);
the computation of each of these probabilities is similar to that of
Pr (RI = 3, R2 = 5). This procedure may be continued until we have
computed Pr (U = u) for each u E {u; U = 0, 1, ... , 6}.
In the preceding example the probability density functions and the
sample sizes m and n were selected so as to provide relatively simple
integrations. The reader can discover for himself how tedious, and even
difficult, the computations become if the sample sizes are large or if the
probability density functions are not of a simple functional form.
EXERCISES
9.33. Let the probability density functions of X and Y be those given in
Example 1 of this section. Further let the sample sizes be m = 5 and n = 3.
If RI < R2 < R3 are the ranks of Y(1) < Y(2) < Y(31 in the combined
sample, compute Pr (RI = 2, R2 = 6, R3 = 8).
9.34. Let Xl, X 2 , ••• , Xm be a random sample of size m from a distri-
bution of the continuous type with distribution function F(x) and p.d.f.
F'(x) = f(x). Let Y v Y 2, ••• , Y n be a random sample from a distribution
where VI < V2 < ... < Vm+n are the order statistics of a random sample of
size m + n from the uniform distribution over the interval (0, 1).
with distribution function G(y) = [F(y)J9, 0 < e. If e =1= 1, this distribution
is called a Lehmann alternative. With e= 2, show that
335
which is equal to the number of the m values of X that are in the lower
half of the combined sample of m + n items (a statistic used in the
median test of Section 9.5).
To determine the mean and the variance of L, we make some
observations about the joint and marginal distributions of the ranks
Rv R2
, ••• , RN
• Clearly, from the results of Section 4.6 on the distri-
bution of order statistics of a random sample, we observe that each
permutation of the ranks has the same probability,
which is the sum of the ranks of Y1> Y2' ••• , Yn among the m + n items
(a statistic denoted by T in Section 9.6).
(b) Take c(i) = 1, provided that i $ (m + n)/2, zero otherwise. If
al = ... = am = 1 and am + 1 = ... = aN = 0, then
Sec. 9.8] Linear Rank Statistics
Nonparametric Methods [Ch, 9
9.35. To generalize the results of Exercise 9.34, let G(y) = h[F(y)],
where h(z) is a differentiable function such that h(O) = 0, h(l) = 1, and
h'(z) > 0,0 < z < 1. Show that
Pr tr.. r
2
, .• " r
n
) = 2
nr
l(r2 + l)(rs+ 2) ... (rn + n - 1) .
(m+ n)
m (m + n + l)(m + n + 2)... (m + 2n)
334
1'; = I, 2, ... , N,
In a similar manner, the joint marginal p.d.f. of R, and Rj , i =1= j, is
zero elsewhere. That is, the (n - 2)-fold summation
where rl, r2 , ••• , rNis any permutation of the first N positive integers.
This implies that the marginal p.d.f. of R; is
1 (N - 2)1 1
2: ...2: N! = Nt = N(N - 1)'
zero elsewhere, because the number of permutations in which R, = Tt
is (N - I)! so that
VI = Xl"'" Vm = X m, Vm+ 1 = Y l, · · · , VN = v;
These two special statistics result from the following respective assign-
ments for c(i) and av a2 , ••• , aN:
(a) Take c(i) = i, al = ... = am = °and am+ 1 = ... = aN = I, so
that
is called a linear rank statistic.
To see that this type of statistic is actually a generalization of both
the Mann-Whitney-Wilcoxon statistic and also that statistic associ-
ated with the median test, let N = m + nand
In this section we consider a type of distribution-free statistic
that is, among other things, a generalization of the Mann-Whitney-
Wilcoxon statistic. Let Vv V2, ••• , VN be a random sample of size N
from a distribution of the continuous type. Let R, be the rank of Vi
among Vv V2 , •. • , VN' i = 1,2, ... , N; and let c(i) be a scoring
function defined on the first N positive integers-that is, let c(I), c(2),
.. " c(N) be some appropriately selected constants. If av a2 , ••. , aN
are constants, then a statistic of the form
9.8 Linear Rank Statistics
N m+n
L = .L a;c(R;) .L s;
;=1 ;=m+1
where the summation is over all permutations in which R; = rt and
R, = rj •
However, since
say, for all i = 1, 2, ... , N. In addition, we have that
337
Sec. 9.8] Linear Rank Statistics
However we can determine a substitute for the second factor by observ-
,
ing that
N N
N 2 (al - Zi)2 = N 2 at - N2Zi2
1=1 1=1
= N J1 af - (~1 air
= N ~ af - [.~ af + 2 2 ala;]
1=1 t=1 1*,
N
= (N - 1) 2 af - 22 ala;.
1=1 1*;
So, making this substitution in az, we finally have that
az = [i: (ck
- C)2 ][N i (at - zW]
k=1 N(N - 1) 1=1
1 N N
= - - L (at - Zi)2 L (ck - C)2.
N - 1 1= 1 k=1
In the special case in which N = m + nand
N
L = 2 c(RI) .
l=m+1
Nonparametric Methods [Ch, 9
N ( 1) c(l) + ... + c(N)
E[c(RI)J = r;1 c(rl)N = N .
Among other things these properties of the distribution of Rv
R2
,
... , RN imply that
336
If, for convenience, we let c(k) = ck , then
for all i = 1,2, ... , N.
A simple expression for the covariance of c(RI) and c(R;), i i= j, is a
little more difficult to determine. That covariance is
[
N ]2 N
°= 2 (ck - c) = 2 (ck - C)2 + 2 2 (ck - c)(ch - c),
k=1 k=1 k*h
the covariance can be written simply as
With these results, we first observe that the mean of L is
where a = (2 at)/N. Second, note that the variance of L is
the reader is asked to show that (Exercise 9.36)
2 mn ~ ( -)2
iLL = nc, aL = N(N _ 1) k~1 c k - C •
A further simplification when Ck = c(k) = k yields
n(m + n + 1) mn(m + n + 1),
az = --'------:-::------'
iLL = 2 ' 12
these latter are, respectively, the mean and the variance of the statistic
T as defined in Section 9.6.
As in the case of the Mann-Whitney-Wilcoxon statistic, the deter-
mination of the exact distribution of a linear rank statistic L can be
very difficult. However, for many selections of the constants a1 , a2' ... ,
aN and the scores c(l), c(2), ... , c(N), the ratio (L - iLL)/aL has, for
large N, an approximate distribution that is n(O, 1). This approxima-
tion is better if the scores c(k) = Ck are like an ideal sample from a
normal distribution, in particular, symmetric and without extreme
values. For example, use of normal scores defined by
N : 1 = fkoo
V~7T exp ( - ~2) dw
338 Nonparametric Methods [Ch. 9 Sec. 9.8] Linear Rank Statistics 339
makes the approximation better. However, even with the use of ranks,
c(k) = k, the approximation is reasonably good, provided that N is
large enough, say around 30 or greater.
In addition to being a generalization of statistics such as those of
Mann, Whitney, and Wilcoxon, we give two additional applications of
linear rank statistics in the following illustrations.
Example 1. Let Xl' X 2, ••. , X; denote n random variables. However,
s~ppose that .we question whether they are items of a random sample due
either to possible lack of mutual stochastic independence or to the fact that
Xl' X 2 , ••• , X n might not have the same distributions. In particular, say
we suspect a trend toward larger and larger values in the sequence X X
b 2,
... , X n • If R, = rank (XI), a statistic that could be used to test the alterna-
tive (trend) hypothesis is L = I iRI • Under the assumption (H ) that the
1=1 a
n random variables are actually items of a random sample from a distri-
bution of the continuous type, the reader is asked to show that (Exercise
9.37)
which in turn equals
From the first of these two additional expressions for Spearman's statistic,
it is clear that I R,QI is an equivalent statistic for the purpose of testing
1=1
the stochastic independence of X and Y, say Ha. However, note that if Ha
n
is true, then the distribution of 2: Q,R" which is not a linear rank statistic,
1= 1
and L = i iR, are the same. The reason for this is that the ranks Rl , R2,
1=1
... , R; and the ranks QI, Q2. . . ., Qn are stochastically independent because
of the stochastic independence of X and Y. Hence, under Ha, pairing RI ,
R
2
, ••• , R; at random with 1, 2, , n is distributionally equivalent to
pairing those ranks with QI' Q2' , Qn' which is simply a permutation of
1, 2, ... , n. The mean and the variance of L is given in Example 1.
!
n(n + 1)2
floL = 4 ' EXERCISES
The critical region of the test is of the form L :2:: d, and the constant d can be
determined either by using the normal approximation or referring to a tabu-
lated distribution of L so that Pr (L :2:: d; Ha) is approximately equal to a
desired significance level Ct.
Example 2. Let (Xv Yl ) , (X2 , Y 2) , •.. , (X n, Yn) be a random sample
from a bivariate distribution of the continuous type. Let R, be the rank of
Xi among Xv X 2 , ••• , X; and Q, be the rank of Y among Y Y Y
t 1J 2,···, n-
If X and Y have a large positive correlation coefficient, we would anticipate
that R, and Q, would tend to be large or small together. In particular the
correlation coefficient of (Rv QI), (R2, Q2), ... , (Rn,Qn), namely the S~ear­
man rank correlation coefficient,
n _ _
2: (RI - R)(Q, - Q)
'=1
would tend to be large. Since Rl , R2, ... , R; and Q Q Q
v 2, ••• , n are
permutations of 1,2, ... , n, this correlation coefficient can be shown
(Exercise 9.38) to equal
n
2: RIQI - n(n + 1)2/4
1=1
n(n2
- 1)/12
9.36. Use the notation of this section.
N
(a) Show that the mean and the variance of L = 2: c(RI) are equal to
I=m+l
the expressions in the text.
N
(b) In the special case in which L = 2: R" show that floL and a2 are
I=m+l
those of T considered in Section 9.6. Hint. Recall that
i k2 = N(N + 1~(2N + 1).
k=l
9.37. If Xl' X 2
• . .. , X; is a random sample from a distribution of the
continuous type and if R, = rank (Xi), show that the mean and the variance
of L = 2: iR, are n(n + 1)2/4 and n2(n + 1)2(n - 1)/144, respectively.
9.38. Verify that the two additional expressions, given in Example 2,
for the Spearman rank correlation coefficient are equivalent to the first one.
Hint. 2: R~ = n(n + 1)(2n + 1)/6 and 2: (R, - Q.)2/2 = 2: (R~ + Qn/2 -
2: R,Q,.
9.39. Let Xl, X 2
, ••• , Xs be a random sample of size n = 6 from a
distribution of the continuous type. Let R, = rank (Xl) and take al = as = 9,
e
a2
= a5
= 4, a3
= a4
= 1. Find the mean and the variance of L = 2: a,R"
i= 1
a statistic that could be used to detect a parabolic trend in X v X 2, ... , X e-
340
Nonparametric Methods [Ch. 9
9.40. In the notation of this section show that the covariance of the two
linear rank statistics, L 1 = l ajc(R.) and L = !!: b.d(R.) . 1
j = 1 1 2 j ;;;''1 1 1 , IS equa to
N _ N
j~l (aj
- a)(bj - b)k~l (ck - c)(dk - il)j(N - 1),
where, for convenience, dk
= d(k). Chapter IO
Sufficient Statistics
10.1 A Sufficient Statistic for a Parameter
In Section 6.2 we let Xl, X 2 , ••• , X; denote a random sample of
size n from a distribution that has p.d.f.j(x; B), BE Q. In each of several
examples and exercises there, we tried to determine a decision function
w of a statistic Y = u(Xv X 2 , ••• , X n) or, for simplicity, a function w
of Xl, X 2 , ••• , X n such that the expected value of a loss function
2(B, w) is a minimum. That is, we said that the "best" decision
function w(Xv X 2 , ••• , X n) for a given loss function 2(B, w) is one that
minimizes the risk R(B, w), which for a distribution of the continuous
type is given by
R(B, w) = I~a>" -I~a> 2[B, w(xv"" xn)Jj(XI; B)·· -f(xn; B) dxl · · ·dxn-
In particular, if E[w(XI, . . .. Xn)J = B and if 2(B, w) = (B - W)2, the
best decision function (statistic) is an unbiased minimum variance
estimator of B. For convenience of exposition in this chapter, we con-
tinue to call each unbiased minimum variance estimator of B a best
estimator of that parameter. However, the reader must recognize that
"best" defined in this way is wholly arbitrary and could be changed by
modifying the loss function or relaxing the unbiased assumption.
The purpose of establishing this definition of a best estimator of Bis
to help us motivate, in a somewhat natural way, the study of an
important class of statistics called sufficient statistics. For illustration,
note that in Section 6.2 the mean X of a random sample of Xl' X2 , • • • ,
341
342 Sufficient Statistics [Ch. 10 Sec. 10.1] A Sufficient Statistic for a Parameter 343
= 0 elsewhere.
What is the conditional probability
Pr (Xl = XV X2 = X2 , · · . , X; = Xn/Yl = YI) = P(AIB),
Example 2. Let Y I < Y 2 < ... < Y n denote the order statistics of a
random sample Xl' X 2 , ••• , X; from the distribution that has p.d.f.
f(x; 8) = e-(X-O" 8 < X < OCJ, -OCJ < 8 < 00,
= 0 elsewhere.
We now give an example that is illustrative of the definition.
f,(',-:XI:-..=...;--;8)",--f~(
X=2;_8.:-)_.'---,'f,-;,(,-,xn~;-:..8) )
- = H(Xl> X2,···, Xn,
gl[U l (Xl> X2, ... , Xn); 8J
where H(xl
, X2
, ••• , xn) does not depend upon 8 En for every fixed
value of Yl = ul(Xl> X 2, ••• , xn)·
Remark. Why we use the terminology "sufficient statistic" can be
explained as follows: If a statistic Y I satisfies the preceding definition, then
the conditional joint p.d.f. of Xv X 2 , ••• , X n, given Y I = YI' and hence of
each other statistic, say Y 2 = U 2 (X I, X 2 , ••• , X n ), does not depend upon the
parameter 8. As a consequence, once given Y I = Yl> it is impossible to use
Y 2
to make a statistical inference about 8; for example, we could not find a
confidence interval for 8 based on Y 2' In a sense, Y I exhausts all the informa-
tion about 8 that is contained in the sample. It is in this sense that we
call Y I a sufficient statistic for 8. In some instances it is preferable to call Y I
a sufficient statistic for the family {f(x; 8); 8 E Q} of probability density
functions.
f(X l; 8)f(x2; 8)· . -f(xn; 8)
gl[Ul(Xl> X2,... , xn); 8J '
provided that Xl> x2,. . ., Xnare such that the fixed Yl = Ul(Xl>X2, ... , xn),
and equals zero otherwise. We say that Yl = U l (X l> X 2" " , X n)
is a sufficient statistic for 8 if and only if this ratio does not depend
upon 8. While, with distributions of the continuous type, we cannot
use the same argument, we do, in this case, accept the fact that if
this ratio does not depend upon 8, then the conditional distribution of
Xl' X 2
, •.. , X n
, given Y1 = Yl> does not depend upon 8. Thus, in both
cases, we use the same definition of a sufficient statistic for 8.
Definition 1. Let Xl> X 2' ..• , X n denote a random sample of size
n from a distribution that has p.d.f. f(x; 8), 8 E n. Let Yl =
ul(Xl> X 2
, ••• , X n) be a statistic whose p.d.f. is gl(Yl; 8). Then Y1 is a
sufficient statistic for 8 if and only if
conditional probability of Xl = Xl> X 2 = X 2, •• ·, X n = X n, given
Y1 = Yl' equals
(
n )8L:x.(1 _ 8)n-1,xl
LX,
1
YI=O,l, ... ,n,
x = 0, 1; 0 < 8 < 1;
f(X; 8) = 8X(1
- 8)l-x,
= 0 elsewhere.
say, where YI = 0, 1, 2, .. " n? Unless the sum of the integers Xl' X 2, ••• , Xn
(each of which equals zero or 1) is equal to Yl> this conditional probability
obviously equals zero because A () B = 0. But in the case YI = 2: X" we
have that A C B so that A () B = A and P(AIB) = P(A)/P(B); thus the
conditional probability equals
8XI(1 - 8)l-x18X2(1 - 8)I-X2... 8Xn(1 _ 8)I-xn
(~)8Yl(1 - 8)n-Y1
The statistic Y I = Xl + X 2 + ... + X n has the p.d.f.
gl(YI; 8) = (~)8YI(1 - 8)n-YI,
Since YI = Xl + X2 + ... + Xn equals the number of l's in the n independent
trials, this conditional probability is the probability of selecting a particular
arrangement of YI l's and (n - YI) zeros. Note that this conditional prob-
ability does not depend upon the value of the parameter 8.
In general, let gl(Yl; 8) be the p.d.I. of the statistic Yl =
Ul(Xl, X 2 , ••• , X n), where Xl' X 2 , ••• , X; is a random sample arising
from a distribution of the discrete type having p.d.f. f(x; 8), 8 En. The
X; of size n = 9 from a distribution that is n(8, 1) is unbiased and has
variance less than that of the unbiased estimator Xl' However, to
claim that it is a best estimator requires that a comparison be made
with the variance of each other unbiased estimator of 8. Certainly, it is
impossible to do this by tabulation, and hence we must have some
mathematical means that essentially does this. Sufficient statistics
provide a beginning to the solution of this problem.
To understand clearly the definition of a sufficient statistic for a
parameter 8, we start with an illustration.
Example 1. Let Xv X 2 , ••• , X n denote a random sample from the
distribution that has p.d.f.
344 Sufficient Statistics [Ch, 10 Sec. 10.1] A SUfficient Statistic for a Parameter 345
The p.d.f, of the statistic Y1 is
gl(Y1; 0) = ne-n(Yl -8), 0 < Y1 < 00,
= 0 elsewhere.
Thus we have that
e-(X1 -8)e-(x2-8) . . .e-(Xn -8) e-xl -x2 - ... -xn
gl(min Xi; 0) ne n(mlnxj ) ,
which is free of 0 for each fixed Y1 = min (Xi), since Y1 ~ Xi' i = 1, 2, ... , n.
That is, neither the formula nor the domain of the resulting ratio depends
upon 0, and hence the first order statistic Y1 is a sufficient statistic for O.
If we are to show, by means of the definition, that a certain statistic
Y1 is or is not a sufficient statistic for a parameter 8, we must first of
all know the p.d.f. of Yl> say gl(Yl; 8). In some instances it may be
quite tedious to find this p.d.f. Fortunately, this problem can be avoided
if we will but prove the following factorization theorem of Neyman.
Theorem 1. Let Xl> X 2, 0 • • , X; denote a random sample
from a distribution that has p.d.f. f(x; 8), 8 E Q. The statistic Yl =
ul(Xl> X 2, . '0' X n) is a sufficient statistic for 8 if and only if we can find
two nonnegative functions, kl and k2, such that
f(xl; 0)f(x2; 8) ... f(xn; 8) = kl[Ul(Xl> X2, 0 0 . , xn); 8]k2(Xl, X2, 0 • 0 ' Xn),
where, for every fixed value of Yl = Ul(Xl, x2, 0 0 . , Xn), k2(xl> x2, 0 • 0 ' xn)
does not depend upon 8.
Proof. We shall prove the theorem when the random variables are
of the continuous type. Assume the factorization as stated in the
theorem. In our proof we shall make the one-to-one transformation Yl =
ul (Xl' . 0 . , Xn), Y2 = U2(Xl> 0 • 0 ' Xn), 0 • • , Yn = Un(Xl> 0 • 0 ' xn) having the
inverse functions Xl = wl(Yl>"" Yn), X2 = w2(Yl> 0 • • , Yn), . 0 0 ' Xn =
Wn(Yl' ... , Yn) and Jacobian]. The joint p.d.f, of the statistics Yl> Y2'
o • " Yn is then given by
where Wi = wi(Yl> Y2' 0 0 . , Yn), i = 1, 2, 0 • • , n. The p.d.f. of Yl> say
gl (Yl; 8), is given by
gl(Yl; 0) = I~<Xl" -f~<Xlg(Yl' Y2' 0 ' 0 , Yn; 8) dY2' 0 ·dYn
= kl(Yl; 8) I~<Xl' .. I~<Xl IJlk2(W1,W2, 0 • • , wn) dY2°' ·dYno
Now the function k2, for every fixed value of Y1 = Ul(Xl, . '0' Xn),
does not depend upon 8. Nor is 8 involved in either the Jacobian J or
the limits of integration. Hence the (n - Ij-fold integral in the right-
hand member of the preceding equation is a function of Yl alone, say
m(Yl)' Thus
gl(Yl; 8) = kl(Yl; 8)m(Yl)'
If m(Yl) = 0, then gl(Yl; 8) = 0. If m(Yl) > 0, we can write
k[ ( )
' 8] _ gl[Ul (Xl> ooo, xn); 8]
1 Ul Xl> .. 0' Xn, - [( )] ,
m Ul Xl>' 0', Xn
and the assumed factorization becomes
f( . 8) f( . 8) ( ). 8 k2(xl> 0 0 0 , Xn)
Xl' '" Xn, = gl[UlXl>,..,Xn , ] ( )
m[Ul Xl> 0 0 0 , Xn ]
Since neither the function k2 nor the function m depends upon 8, then
in accordance with the definition, Y1 is a sufficient statistic for the
parameter 8.
Conversely, if Y1 is a sufficient statistic for 8, the factorization can
be realized by taking the function «. to be the p.d.f. of Yl , namely the
function gl' This completes the proof of the theorem.
Example 3. Let Xl> X 2 , ••• , X; denote a random sample from a distri-
bution that is n(O, a2
) , -00 < 0 < 00, where the variance a2 is known. If
n
x = L «[«, then
1
n n n
L (Xi - 8)2 = L [(Xi - x) + (x - 8)]2 = L (Xi - x)2 + n(x - 0)2
i=l i=1 i=1
because
n n
2 L (Xi - X)(X - 0) = 2(x - 0) L (Xi - x) = O.
i=1 i=l
Thus the joint p.d.f. of Xl> X 2 , ••• , X n may be written
Since the first factor of the right-hand member of this equation depends upon
Xl> X 2, ••• , X n only through x, and since the second factor does not depend
upon 0, the factorization theorem implies that the mean X of the sample is,
346 Sufficient Statistics [Ch, 10 Sec. 10.1] A Sufficient Statistic for a Parameter 347
for any particular value of a2
, a sufficient statistic for 8, the mean of the
normal distribution.
We could have used the definition in the preceding example because
we know that X is n(B, (J2jn). Let us now consider an example in which
the use of the definition is inappropriate.
Example 4. Let Xv X 2 , ••• , X; denote a random sample from a distri-
bution with p.d.f.
Certainly, there is no 8 in the formula of the second factor, and it might be
assumed that Y3 = max Xi is itself a sufficient statistic for 8. But what is the
domain of the second factor for every fixed value of Y3 = max Xi? If
max Xi = Xl, the domain is 8 < X2 < Xv 8 < X3 < Xl; if max Xi = X2 , the
domain is 8 < Xl < X2 , 8 < X3 < X2 ; and if max Xi = X3, the domain is
8 < Xl < X3' 8 < X2 < X3. That is, for each fixed Y3 = max Xi' the domain
of the second factor depends upon 8. Thus the factorization theorem is not
satisfied.
and
Since k2 (xV X 2, ••• , xn) does not depend upon 8, the product XlX2 • • .Xn
is a sufficient statistic for 8,
= 0 elsewhere,
where 0 < 8. We shall use the factorization theorem to prove that the
product Ul(Xl, X 2 , ••• , X n) = XlX2 • • ,Xn is a sufficient statistic for 8.
The joint p.d.f. of Xv X 2 , ••• , X; is
8n(XlX2" 'Xn)B-l = [8n(XlX2" 'Xn)BJ( 1 ),
XlX2 • • 'Xn
where 0 < Xi < 1, i = 1,2, .. " n. In the factorization theorem let
kl[Ul(Xl, X2,···, xn); 8J = 8n(XlX2" 'Xn)B
If the reader has some difficulty using the factorization theorem
when the domain of positive probability density depends upon B, we
recommend use of the definition even though it may be somewhat
longer.
Before taking the next step in our search for a best statistic for a
parameter B, let us consider an important property possessed by a
sufficient statistic Yl = ul(XV X 2 , ••• , X n) for B. The conditional
p.d.f. of another statistic, say Y2 = u2(XvX2,···,Xn ), given Yl = Yv
does not depend upon B. On intuitive grounds, we might surmise that
the conditional p.d.£. of Y2' given some linear function a Y1 + b,
a 1= 0, of Yv does not depend upon B. That is, it seems as though the
random variable a Y1 + b is also a sufficient statistic for B. This con-
jecture is correct. In fact, every function Z = u(Yl ) , or Z =
u[ul(XV X 2 , .•• , X n)] = v(Xl, X 2 , .•• , X n), not involving B, with a
single-valued inverse Yl = w(Z), is also a sufficient statistic for B. To
prove this, we write, in accordance with the factorization theorem,
o< X < 1,
[i»; 8) = 8xB- 1,
There is a tendency for some readers to apply incorrectly the
factorization theorem in those instances in which the domain of
positive probability density depends upon the parameter B. This is
due to the fact that they do not give proper consideration to the domain
of the function k2 (xv X 2, ••• ,xn) for every fixed value of Yl =
u1(XV X 2, •.• , xn)· This will be illustrated in the next example.
Example 5. In Example 2, with f(x; 8) = e-(X-B), 8 < X < 00, -00 <
8 < 00, it was found that the first order statistic Yl is a sufficient statistic
for 8. To illustrate our point, take n = 3 so that the joint p.d.f. of Xl> X 2
, X 3
is given by
8 < Xi < 00,
i = 1, 2, 3, We can factor this in several ways. One way, with n = 3, is
given in Example 2. Another way would be to write the joint p.d.f. as the
product
However, Yl = w(z) or, equivalently, Ul(Xl, X2 , ••• , xn) = w[v(xv X2,
... , xn)] , which is not a function of B. Hence
Since the first factor of the right-hand member of this equation is a
function of z = v(xl, ... , xn) and B, while the second factor does not
depend upon B, the factorization theorem implies that Z = u(Yl ) is
also a sufficient statistic for B.
The relationship of a sufficient statistic for B to the maximum
likelihood estimator of Bis contained in the following theorem.
Theorem 2. Let X v X 2, .•• , X n denote a random sample from a
distribution that has p.d.j. f(x; B), BEO' .0. If a sufficient statistic Yl =
Ul(Xl, X 2, ... , X n) for B exists and if a maximum likelihood estimator 0
of Balso exists uniquely, then 0is a function of Yl = ul(XV X 2,... , X n) .
348 Sufficient Statistics [Ch, 10 Sec. 10.2] The Rae-Blackwell Theorem 349
Proof. Let gl(Yl; e) be the p.d.f. of Y1. Then by the definition of
sufficiency, the likelihood function
L(e; Xv x2, ... , Xn) = f(x1; e)f(X2; e)· . -j(xn ; e)
= gl[Ul(X1"", Xn) ; e]H(xv···, Xn),
where H(xv ... , xn) does not depend upon e. Thus L and g., as functions
of e, are maximized simultaneously. Since there is one and only one
value of e that maximizes L and hence gl[U1(X1, ... , xn); 0], that value
of e must be a function of u1(Xv X2, ... , xn) . Thus the maximum
likelihood estimator 0 is a function of the sufficient statistic Y1 =
U1(X1, X 2 , ••• , X n)·
EXERCISES
10.1. Let Xl, X 2 , ••• , X; be a random sample from the normal distribu-
• n
tion n(O, 8), 0 < 8 < 00. Show that L X: is a sufficient statistic for 8.
1
10.2. Prove that the sum of the items of a random sample of size n from
a Poisson distribution having parameter 8, 0 < 8 < 00, is a sufficient
statistic for 8.
10.3. Show that the nth order statistic of a random sample of size n from
the uniform distribution having p.d.f. f(x; 8) = l/e, 0 < x < e,o < e < 00,
zero elsewhere, is a sufficient statistic for 8. Generalize this result by consider-
ing the p.d.f. f(x; e) = Q(8)M(x), 0 < x < 0, 0 < 0 < 00, zero elsewhere.
Here, of course,
f
o 1
o M(x) dx = Q(ef
10.4. Let Xl, X 2 , ••• , X; be a random sample of size n from a geometric
distribution that has p.d.f. f(x; e) = (1 - e)XO, x = 0, 1, 2, ... , 0 < 0 < 1,
n
zero elsewhere. Show that L X, is a sufficient statistic for O.
1
10.5. Show that the sum of the items of a random sample of size n from
a gamma distribution that has p.d.f. f(x; 0) = (1/e)e- X10
, 0 < x < 00,
o < 0 < 00, zero elsewhere, is a sufficient statistic for e.
10.6. In each of the Exercises 10.1, 10.2, 10.4, and 10.5, show that the
maximum likelihood estimator of eis a function of the sufficient statistic
for O.
10.7. Let Xl' X 2 , •• " X n be a random sample of size n from a beta
distribution with parameters ex = 0 > 0 and f3 = 2. Show that the product
X lX2 • • ,Xn is a sufficient statistic for O.
10.8. Let Xl> X 2 , ••• , X n be a random sample of size n from a distribution
with p.d.f. f(x; 8) = 1/7T[1 + (x - 8)2J, -00 < x < 00, -00 < 8 < 00. Can
the joint p.d.f. of Xl> X 2 , ••• , X n be written in the form given in Theorem 1?
Does the parameter 8 have a sufficient statistic?
10.2 The Rao-Blackwell Theorem
We shall prove the Rao-Blackwell theorem.
Theorem 3. Let X and Y denote random variables such that Y has
mean fL and positive variance uf. Let E(Ylx) = <p(x). Then E[<p(X)] = fL
and u~(X) :::; uf·
Proof. We shall give the proof when the random variables are of
the continuous type. Let f(x, y), fl(X) , f2(Y), and h(ylx) denote, respec-
tively, the joint p.d.f. of X and Y, the two marginal probability density
functions, and the conditional p.d.f. of Y, given X = x. Then
f
oo
00 yf(x,y)dy
E(Ylx) = roo yh(ylx) dy = - 00 fl(X) = <p(x),
so that
f~00 yf(x, y) dy = <P(X)fl(X).
We have
E[<p(X)] = f~oo<p(x)fl(X)dx = f~oo[f~ooYf(x,y)dy]dx
= f~00 y[f~oof(x, y) dX] dy
= f:00 Yf2(Y) dy = fl-,
and the first part of the theorem is established. Consider next
uf = E[(Y - fl-)2J = E{[(Y - <p(X)) + (<p(X) - fl-)]2}
= E{[Y - <p(X)P} + E{[<p(X) - fLP} + 2E{[Y - <p(X)][<p(X) - fl-]}.
We shall show that the last term of the right-hand member of the
immediately preceding equation is zero. We have
E{[Y - <p(X)][<p(X) - fL]} = f~00 f~00 [y - <p(x)][<p(x) - fl-]f(x, y) dy dx.
In this integral we shall write f(x, y) in the form h(Ylx)fl(X), and we
shall integrate first on y to obtain
350 Sufficient Statistics [eb. 10 Sec. 10.2] The Rao-Blackwell Theorem 351
But cp(x) is the mean of the conditional p.d.f. h(ylx). Hence
f~oo [y - cp(x)]h(ylx) dy = 0,
and, accordingly,
E{[Y - cp(X)][cp(X) - ft]} = 0.
Moreover,
and
Accordingly,
and the theorem is proved when X and Yare random variables of the
continuous type. The proof in the discrete case is identical to the proof
given here with the exception that summation replaces integration.
It is interesting to note, in connection with the proof of the theorem,
that unless the probability measure of the set {(x, y); Y - cp(x) = O} is
equal to 1 then E{[Y - cp(X)J2} > 0, and we have the strict inequality
a~ > a~(x).
We shall give an illustrative example.
Example 1. Let X and Y have a bivariate normal distribution with
means JL1 and JL2' with positive variances ut and u~, and with correlation
coefficient p. Here E(Y) = JL = JL2 and u~ = u~. Now E(Ylx) is linear in x
and it is given by
E(Ylx) = cp(x) = JL2 + PU2 (x - JL1).
U1
Thus cp(X) = JL2 + P(U2/U1)(X - JL1) and E[cp(X)] = JL2' as stated in the
theorem. Moreover,
With -1 < p < 1, we have the strict inequality u~ > p2u~. It should be
observed that cp(X) is not a statistic if at least one of the five parameters is
unknown.
We shall use the Rao-Blackwell theorem to help us in our search
for a best estimator of a parameter. Let Xl> X 2 , ••• , X; denote a
random sample from a distribution that has p.d.f. f(x; 0), 0 E Q, where
it is known that Y1 = U 1(X 1, X 2, , X n) is a sufficient statistic for the
parameter O. Let Y2 = U 2(X 1, X 2, , X n) be another statistic (but
not a function of Y1 alone), which is an unbiased estimator of 0; that is,
E(Y2) = O. Consider E(Y2IY1). This expectation is a function of Yl>
say CP(Y1). Since Y1 is a sufficient statistic for 0, the conditional p.d.f.
of Y 2, given Y1 = Yl> does not depend upon 0, so E(Y2IY1) = CP(Y1)
is a function of Y1 alone. That is, here cp(Y1) is a statistic. In accordance
with the Rao-Blackwell theorem, cp(Y1) is an unbiased estimator of 0;
and because Y2 is not a function of Y1 alone, the variance of cp(Y1)
is strictly less than the variance of Y2. We shall summarize this
discussion in the following theorem.
Theorem 4. Let Xl' X 2, ... , X n, n a fixed positive integer, denote a
random sample from a distribution (continuous or discrete) that has
P.d.f.f(x; 0), 0 E Q. Let Y1 = U 1(X 1, X 2, ... , X n) be a sufficient statistic
for 0, and let Y2 = U 2(X 1, X 2, ... , X n) , not a function of Y1 alone,
be an unbiased estimator of o. Then E(Y2!Y1) = CP(Y1) defines a statistic
cp( Y1). This statistic cp( Y1) is a function of the sufficient statistic for 0,.
it is an unbiased estimator of 0,. and its variance is less than that of Y2.
This theorem tells us that in our search for a best estimator of a
parameter, we may, if a sufficient statistic for the parameter exists,
restrict that search to functions of the sufficient statistic. For if we
begin with an unbiased estimator Y2 that is not a function of the
sufficient statistic Y1 alone, then we can always improve on this by
computing E(Y2!Y1) = CP(Y1) so that cp(Y1) is an unbiased estimator
with smaller variance than that of Y2.
After Theorem 4 many students believe that it is necessary to find
first some unbiased estimator Y2 in their search for cp(Y1), an unbiased
estimator of 0 based upon the sufficient statistic Y1. This is not the case
at all, and Theorem 4 simply convinces us that we can restrict our
search for a best estimator to functions of Y1. It frequently happens
that E(Y1 ) = aO + b, where a f= °and b are constants, and thus
(Y1 - b)ja is a function of Y1 that is an unbiased estimator of O.
352 Sufficient Statistics [Ch, 10 Sec. 10.3] Completeness and Uniqueness 353
10.3 Completeness and Uniqueness
Let X 1> X 2, .•• , X n be a random sample from the distribution that
has p.d.f.
n
L XI is a
1=1
0<8
Y1 = 0, 1, Z, ...
x = 0,1,2, ... ;
= 0 elsewhere.
f(x; 8)
From Exercise 10.Z of Section 10.1 we know that Y1
sufficient statistic for 8 and its p.d.f. is
(n8)Y 1e- ne
Y1!
= 0 elsewhere.
That is, we can usually find an unbiased estimator based on Y1 without
first finding an estimator Y2' In the next two sections we discover
that, in most instances, if there is one function <plY1) that is unbiased,
<p(Y1) is the only unbiased estimator based on the sufficient statistic Y1 •
Remark. Since the unbiased estimator <p(Y1) , where <P(Yl) = E(YzIY1),
has variance smaller than that of the unbiased estimator Y z of 8, students
sometimes reason as follows. Let the function Y(Ya) = E[<p(Y1)Ya = Ya],
where Ya is another statistic, which is not sufficient for 8. By the Rao-
Blackwell theorem, we have that E[Y(Ya)] = 8 and Y(Ya) has a smaller
variance than does <p(Y1) . Accordingly, Y(Ya) must be better than <p(Y1)
as an unbiased estimator of 8. But this is not true because Ya is not sufficient;
thus 8 is present in the conditional distribution of Y1 , given Ya = Ya, and
the conditional mean Y(Ya)' So although indeed E[Y(Ya)] = 8, Y(Ya) is not
even a statistic because it involves the unknown parameter 8 and hence
cannot be used as an estimator.
implies that
Since e-ne does not equal zero, we have that
o< 8
u(O) = 0,
_ne[ n8 (n8)2 ]
= e u(O) + u(l) - + u(2) - + ... .
I! Z!
o = u(O) + [nu(1)J8 + [n2~(Z)]82 + .. '.
Let us consider the family {gl(Y1; 8); 0 < 8} of probability density
functions. Suppose that the function u(Y1 ) of Y1
is such that
E[u(Y1)] = 0 for every 8 > O. We shall show that this requires U(Y1)
to be zero at every point Y1 = 0, 1, 2, .... That is,
o= u(O) = u(l) = u(Z) = u(3)
We have for all 8 > 0 that
However, if such an infinite series converges to zero for all 8 > 0, then
each of the coefficients must equal zero. That is,
n2
u(2)
nu(l) = 0, -Z- = 0, ...
and thus 0 = u(O) = u(l) = u(Z) ="', as we wanted to show. Of
EXERCISES
10.9. Let Y1 < Y z < Ya < Y4 < Y5 be the order statistics of a random
sample of size 5 from the uniform distribution having p.d.f. f(x; 8) = 1/8,
o< x < 8, 0 < 8 < 00, zero elsewhere. Show that 2Ya is an unbiased
estimator of 8. Determine the joint p.d.f. of Ya and the sufficient statistic Y5
for 8. Find the conditional expectation E(2YaIY5) = <P(Y5)' Compare the
variances of 2Ya and <p(Y5). Hint. All of the integrals needed in this exercise
can be evaluated by making a change of variable such as z = yI8 and
using the results associated with the beta p.d.f.; see Example 5, Section 4.3.
10.10. If Xl> X z is a random sample of size 2 from a distribution having
p.d.f. f(x; 8) = (1/8)e- X10
, 0 < x < 00, 0 < 8 < 00, zero elsewhere, find
the joint p.d.f. of the sufficient statistic Yl = Xl + X z for 8 and Y z = X z.
Show that Yzis an unbiased estimator of 8 with variance 8z. Find E(YzIY1) =
<P(Yl) and the variance of <p(Y1 ) .
10.11. Let the random variables X and Y have the joint p.d.f. f(x, y) =
(2/8Z
)e- (x +YJ/o, 0 < x < Y < 00, zero elsewhere.
(a) Show that the mean and the variance of Yare, respectively, 38/2
and 58z/4.
(b) Show that E(Ylx) = x + 8. In accordance with the Rao-Blackwell
theorem, the expected value of X + 8 is that of Y, namely, 38/2, and the
variance of X + 8 is less than that of Y. Show that the variance of X + 8
is in fact 8z/4.
10.12. In each of Exercises 10.1, 10.2, and 10.5, compute the expected
value of the given sufficient statistic and, in each case, determine an un-
biased estimator of 8 that is a function of that sufficient statistic alone.
354 Sufficient Statistics [Ch. 10 Sec. 10.3] Completeness and Uniqueness 355
for 8 > O.
course, the condition E[u(YI)] = 0 for all 8 > 0 does not place any
restriction on U(YI) when YI is not a nonnegative integer. So we see that,
in this illustration, E[u(YI)] = 0 for all 8 > 0 requires that U(YI)
equals zero except on a set of points that has probability zero for each
p.d.f. gl(YI; 8), 0 < 8. From the following definition we observe that
the family {gl(YI; 8); 0 < 8} is complete.
Definition 2. Let the random variable Z of either the continuous
type or the discrete type have a p.d.f. that is one member of the family
{h(z; 8); 8 E Q}. If the condition E[u(Z)] = 0, for every 8 E Q, requires
that u(z) be zero except on a set of points that has probability zero for
each p.d.f. h(z; 8), 8 E Q, then the family {h(z; 8); 8 E Q} is called a
complete family of probability density functions.
Remark. In Section 1.9 it was noted that the existence of E[u(X)J
implies that the integral (or sum) converge absolutely. This absolute con-
vergence was tacitly assumed in our definition of completeness and it is
needed to prove that certain families of probability density functions are
complete.
In order to show that certain families of probability density functions
of the continuous type are complete, we must appeal to the same type
of theorem in analysis that we used when we claimed that the moment-
generating function uniquely determines a distribution. This is illus-
trated in the next example.
Example 1. Let Z have a p.d.f. that is a member of the family
{h(z; 8), 0 < 8 < co}, where
h(z; 8) = ~ e- z/
9
, 0 < z < co,
= 0 elsewhere.
Let us say that E[u(Z)J = 0 for every 8 > O. That is,
1 foo
8 0 u(z)e-z/odz = 0,
Readers acquainted with the theory of transforms will recognize the integral
in the left-hand member as being essentially the Laplace transform of u(z).
In that theory we learn that the only function u(z) transforming to a function
of 8which is identically equal to zero is u(z) = 0, except (in our terminology)
on a set of points that has probability zero for each h(z; 8), 0 < 8. That is,
the family {h(z; 8); 0 < 8 < co}is complete.
Let the parameter 8 in the p.d.f. f(x; 8), 8 E Q, have a sufficient
statistic Y I = til(XI, X 2, ••• , X n), where Xv X 2, ••• , X; is a random
sample from this distribution. Let the p.d.f. of Y I be gl(YI; 8), 8 E Q.
It has been seen that, if there is any unbiased estimator Y2 (not a func-
tion of Y I alone) of 8, then there is at least one function of Y I that is
an unbiased estimator of 8, and our search for a best estimator of 8 may
be restricted to functions of Y I . Suppose it has been verified that a
certain function <p(YI), not a function of 8, is such that E[<p(YI)J = 8
for all values of 8,8 E Q. Let !fJ(YI) be another function of the sufficient
statistic YI alone so that we have also E[!fJ(YI)] = 8 for all values of 8,
8 E Q. Hence
oE Q.
If the family {gl(YI; 0); 0 E Q} is complete, the function <P(YI)
!fJ(YI) = 0, except on a set of points that has probability zero. That
is, for every other unbiased estimator !fJ(YI) of 0, we have
except possibly at certain special points. Thus, in this sense [namely
<P(YI) = !fJ(YI), except on a set of points with probability zero], <p(YI) is
the unique function of Yv which is an unbiased estimator of O. In
accordance with the Rao-Blackwell theorem, <p(YI) has a smaller
variance than every other unbiased estimator of O. That is, the statistic
<p(YI) is the best estimator of B. This fact is stated in the following
theorem of Lehmann and Scheffe,
Theorem 5. Let Xv X 2, ••• , X n, n a fized positive integer, denote a
random sample from a distribution that has p.d.f. f(x; 0), 8 E Q, let
Y I = UI(XI, X 2 , ••• , X n) be a sufficient statisticfor 0, and let thefamily
{gl(YI; 8); 8 E Q} of probability density functions be complete. If there is
afunction of Y I that is an unbiased estimator of 0, then this function of Y I
is the unique best estimator of 8. Here "unique" is used in the sense
described in the precedingparagraph.
The statement that Y1 is a sufficient statistic for a parameter 8,
8 E Q, and that the family {gl(YI; e); e E Q} of probability density
functions is complete is lengthy and somewhat awkward. We shall
adopt the less descriptive, but more convenient, terminology that YI
is a complete sufficientstatistic for e. In the next section we shall study a
fairly large class of probability density functions for which a complete
sufficient statistic YI for ecan be determined by inspection.
356 Sufficient Statistics [Ch. 10 Sec. 10.4] The Exponential Class of Probability Density Functions 357
10.13. If azZ
+ bz + c = 0 for more than two values of z, then a = b =
c = O. Use this result to show that the family {b(2, e); 0 < B < I}is complete.
10.14. Show that each of the following families {f(x; B); 0 < (} < co} is not
complete by finding at least one nonzero function u(x) such that E[u(X)] = O.
for all B > O.
10.4 The Exponential Class of Probability Density Functions
Consider a family {j(x; e); e E Q} of probability density functions,
where Q is the interval set Q = {e; y < e < o}, where y and 0 are
known constants, and where
EXERCISES
1
(a) f(x; B) = 2B' - B < x < B,
(1) f(x; e) = exp [p(e)K(x) + S(x) + q(e)],
= 0 elsewhere.
a < x < b,
= 0 elsewhere.
(b) n(O, B).
10.15. Let Xl> Xz, .. " X; represent a random sample from the discrete
distribution having the probability density function
f(x; e) = BX(l - (j)l-x,
= 0 elsewhere.
x = 0, 1,0 < B < 1,
A p.d.f. of the form (1) is said to be a member of the exponential
class of probability density functions of the continuous type. If, in
addition,
(a) neither a nor b depends upon e, y < e < 0,
(b) p(e) is a nontrivial continuous function of e, y < e < o.
(c) each of K'(x) =f:. 0 and S(x) is a continuous function of x,
a < x < b,
we say that we have a regular case of the exponential class. A p.d.f.
exp [P(e) ~ K(x,) + ~ S(x,) + nq(e)]
Let Xl' Xz, 0 0 . , X; denote a random sample from a distribution
that has a p.d.f. which represents a regular case of the exponential
class of the continuous type. The joint p.d.I. of Xl> X z, .. 0, X; is
-00 < x < 00.
1 2
= __ e- x /ZfJ
V27Te
= exp ( - ;eX
Z
- In V27T8) ,
f(x; e)
For example, each member of the family {j(x; e); °< e < co],
where f(x; e) is n(O, e), represents a regular case of the exponential
class of the continuous type because
f(x; e) = exp [p(e)K(x) + S(x) + q(e)],
= 0 elsewhere
is said to represent a regular case of the exponential class of probability
density functions of the discrete type if
(a) The set {x; x = aI, az, ... } does not depend upon e.
(b) p(e) is a nontrivial continuous function of e, y < e < o.
(c) K(x) is a nontrivial function of x on the set {x; x = al> az, ••. }.
is the unique best estimator of B.
n
Show that YI = L: X, is a complete sufficient statistic for B. Find the unique
I
function of YI that is the best estimator of B. Hint. Display E[u(YI )] = 0,
show that the constant term u(O) is equal to zero, divide both members of
the equation by B i= 0, and repeat the argument.
10.16. Consider the family of probability density functions {h(z; B); BE Q},
where h(z; B) = liB, 0 < z < B, zero elsewhere.
(a) Show that the family is complete provided that Q = {B; 0 < B < co},
Hint. For convenience, assume that u(z) is continuous and note that the
derivative of E[u(Z)J with respect to Bis equal to zero also.
(b) Show that this family is not complete if Q = {B; 1 < B < co}.
Hint. Concentrate on the interval 0 < z < 1 and find a nonzero function
u(z) on that interval such that E[u(Z)J = °for all B > 1.
10.17. Show that the first order statistic Y I of a random sample of size n
from the distribution having p.d.f. f(x; B) = e-(X-B), B < x < 00, -00 <
B < 00, zero elsewhere, is a complete sufficient statistic for B. Find the
unique function of this statistic which is the best estimator of B.
10.18. Let a random sample of size n be taken from a distribution of the
discrete type with p.d.f. f(x; B) = liB, x = 1, 2, ... , B, zero elsewhere, where
Bis an unknown positive integer.
(a) Show that the largest item, say Y, of the sample is a complete sufficient
statistic for B.
(b) Prove that
[yn+1 _ (Y _ l)n+I]/[yn _ (Y _ l)n]
exp [P((J) ~K(XI) + nq((J)] exp [~S(XI)l
for a < XI < b, i = 1, 2, ... , n, 'Y < (J < 8, and is zero elsewhere. At
points of positive probability density, this joint p.d.f. may be written
as the product of the two nonnegative functions
Sec. 10.4] The ExpOnential Class of Probability Density Functions 359
Example I. Let Xv X 2 , •• " X n
denote a random sample from a normal
distribution that has p.d.f.
358 Sufficient Statistics [Ch, 10
or
. 1 [(X - 8)2]
f(x,8) = aV2rr exp - 2a2 ' -00 < x < 00, -00 < 8 < 00,
Here a2 is any fixed positive number. This is a regular case of the exponential
class with
Accordingly, YI = Xl + X2 + ... + X; = nX is a complete sufficient
statistic for the mean 0 of a normal distribution for every fixed value of
the variance a2
. Since E(YI) = nO, then <p(YI) = YI/n = X is the only
function of YI
that is an unbiased estimator of 8; and being a function of
the sufficient statistic YI , it has a minimum variance. That is, X is the
unique best estimator of 8. Incidentally, since YI is a single-valued function
of X, X itself is also a complete sufficient statistic for 8.
Example 2. Consider a Poisson distribution with parameter 8,0 < 8 < 00.
The p.d.f. of this distribution is
x=O,1,2, ... ,
02
q(O) = - - .
2a2
K(x) = x,
8
P(8) = -,
a2
8xe- 8
f(x; 8) = -,- = exp [(In 8)x - In (xl) - 8],
x.
(
8 X2 _ ( 2 )
f(x; 8) = exp - x - - - In v2rra2 - - .
a2 2a2 2a2
In accordance with the factorization theorem (Theorem 1, Section 10.1)
n
y 1 = L K(XI ) is a sufficient statistic for the parameter (J. To prove
1
n
that y 1 = L K(Xi ) is a sufficient statistic for (J in the discrete case,
1
we take the joint p.d.f. of Xv X 2 , • • • , X; to be positive on a discrete
set of points, say, when XI E {x; X = al> a2, ••. }, i = 1,2, ... , n. We
then use the factorization theorem. It is left as an exercise to show that
in either the continuous or the discrete case the p.d.f. of Y1 is of the
form
at points of positive probability density. The points of positive prob-
ability density and the function R(Yl) do not depend upon 8.
At this time we use a theorem in analysis to assert that the family
{gl(Yl; 8); 'Y < 8 < 8} of probability density functions is complete.
This is the theorem we used when we asserted that a moment-generating
function (when it exists) uniquely determines a distribution. In the
present context it can be stated as follows.
Theorem 6. Let f(x; 8), 'Y < 8 < 8, be a p.d.J. which represents a
regular case of the exponential class. Then if Xl> X 2 , ••• , X; (where n is
a fixed positive integer) is a random sample from a distribution with p.d.J.
n
f(x; 8), the statistic Y1 = L K(X.) is a sufficient statistic for 8 and the
1
family {gl (Yl; (J); 'Y < B < 8} of probability density functions of Y1 is
complete. That is, Y1 is a complete sufficient statistic for B.
= 0 elsewhere.
In accordance with Theorem 6, YI = ~ Xl is a complete sufficient statistic
I
for 8. Since E(YI) = n8, the statistic <p(YI) = YI/n = X, which is also a
complete sufficient statistic for 8, is the unique best estimator of 8.
EXERCISES
10.19. Write the p.d.f.
This theorem has useful implications. In a regular case of form (1),
n
we can see by inspection that the sufficient statistic is Y1 = .L K(Xi ) .
1
If we can see how to form a function of Y 1 , say <p(Y1) , so that E[<p(Y1) ]
= 8, then the statistic <p(Y1) is unique and is the best estimator of B.
o< x < 00, 0 < 8 < 00,
zero elsewhere, in the exponential form. If X1> X2' •.. , Xn is a random
sample from this distribution, find a complete sufficient statistic Y1 for 0
360 Sufficient Statistics [Ch, 10 Sec. 10.5] Functions of a Parameter 361
and the unique function cp(YI ) of this statistic that is the best estimator of B.
Is cp(YI ) itself a complete sufficient statistic?
10.20. Let Xl> X 2 , ••• , X; denote a random sample of size n > 2 from a
distribution with p.d.f. j(x; B) = Be-8X, 0 < x < 00, zero elsewhere, and
n
B > O. Then Y = L X, is a sufficient statistic for B. Prove that (n - l)jY
I
is the best estimator of B.
10.21. Let Xl' X 2 , ••• , X n denote a random sample of size n from a
distribution with p.d.f.j(x; B) = Bx8- 0 < x < 1, zero elsewhere, and B > O.
(a) Show that the geometric mean (XIX2 • · .Xn)l/n of the sample is a
complete sufficient statistic for B.
(b) Find the maximum likelihood estimator of B, and observe that it is a
function of this geometric mean.
10.22. Let X denote the mean of the random sample Xl> X 2 , ••• , X n from
a gamma-type distribution with parameters a > 0 and f3 = B > O. Compute
E[Xllx]. Hint. Can you find directly a function ifi(X) of X such that E[ifi(X)]
= B? Is E(Xllx) = ifi(x)? Why?
10.23. Let X be a random variable with a p.d.f. of a regular case of the
exponential class. Show that E[K(X)] = -q'(B)jp'(B), provided these
derivatives exist, by differentiating both members of the equality
f:exp [P(B)K(x) + Sex) + q(B)] dx = 1
with respect to B. By a second differentiation, find the variance of K(X).
10.24. Given that j(x; B) = exp [BK(x) + Sex) + q(B)], a < x < b,
y < B < 8, represents a regular case of the exponential class. Show that the
moment-generating function M(t) of Y = K(X) isM(t) = exp[q(B) - q(B + t)],
y < B + t < 8.
10.25. Given, in the preceding exercise, that E(Y) = E[K(X)] = B.
Prove that Y is n(B, 1). Hint. Consider M'(O) = B and solve the resulting
differential equation.
10.26. If Xl' X 2 , ••• , X n is a random sample from a distribution that has
a p.d.f. which is a regular case of the exponential class, show that the p.d.f. of
n
YI = L K(X,) is of the form gl(YI; B) = R(YI) exp [P(B)YI + nq(B)]. Hint.
I
Let Y2 = X 2 , ••• , Yn = X n be n - 1 auxiliary random variables. Find the
joint p.d.f. of YI , Y 2 , ••• , Y, and then the marginal p.d.f. of YI .
10.27. Let Y denote the median and let X denote the mean of a random
sample of size n = 2k + 1 from a distribution that is n(p., a2
) . Compute
E(YIX = x). Hint. See Exercise 10.22.
10.5 Functions of a Parameter
Up to this point we have sought an unbiased and minimum variance
estimator of a parameter O. Not always, however, are we interested in 0
but rather in a function of O. This will be illustrated in the following
examples.
Example 1. Let Xl> X 2 , ••• , X; denote the items of a random sample of
size n > 1 from a distribution that is b(l, B), 0 < B < 1. We know that if
n
Y = L X" then Yjn is the unique best estimator of B. Now the variance of
I
Yjn is B(l - B)jn. Suppose that an unbiased and minimum variance esti-
mator of this variance is sought. Because Y is a sufficient statistic for B, it is
known that we can restrict our search to functions of Y. Consider the
statistic (Yjn)(l - Yjn)/n. This statistic is suggested by the fact that Yjn
is the best estimator of B. The expectation of this statistic is given by
1 [Y ( V)] 1 1 2
nE n 1 - n = n2E(Y) - n3 E(Y ).
Now E(Y) = nB and E(Y2) = nB(l - B) + n2B2. Hence
~ E[Y (1 _V)] = n - 1B(l - B).
n n n n n
If we multiply both members of this equation by nj(n - 1), we find that the
statistic (Yjn)(l - Yjn)/(n - 1) is the unique best estimator of the variance
of Yjn.
A somewhat different, but very important problem in point estima-
tion is considered in the next example. In the example the distribution
of a random variable X is described by a p.d.f. j(x; 0) that depends
upon 0 E Q. The problem is to estimate the fractional part of the
probability for this distribution which is at or to the left of a fixed
point c. Thus we seek an unbiased, minimum variance estimator of
F(c; 0), where F(x; 0) is the distribution function of X.
Example 2. Let Xl' X 2 , ••• , X; be a random sample of size n > 1 from
a distribution that is n(B, 1). Suppose that we wish to find a best estimator
of the function of Bdefined by
J
e 1
Pr (X :0; c) = --= e-(X-8)2/2 dx = N(c - B),
-00 Y27T
where c is a fixed constant. There are many unbiased estimators of N(c - B).
We first exhibit one of these, say u(XI ) , a function of Xl alone. We shall then
362 Sufficient Statistics [Ch. 10 Sec. 10.5] Functions of a Parameter 363
compute the conditional expectation, E[u(X]) IX = x] = ep(x), of this un-
biased statistic, given the sufficient statistic X, the mean of the sample. In
accordance with the theorems of Rao-Blackwell and Lehmann-Scheffe, ep(X)
is the unique best estimator of N(c - 0).
Consider the function u(x]), where
The expected value of the random variable u(X]) is given by
E[u(X])] = f.,u(x]) v'~7T exp [ - (x] ~ 0)2] dx]
= fC (1) 1 exp [_ (x] - 0)2] dx],
- co v'27T 2
because u(x]) = 0, x] > c. But the latter integral has the value N(c - 0).
That is, u(X]) is an unbiased estimator of N(c - 0).
We shall next discuss the joint distribution of X] and X and the condi-
tional distribution of Xl> given X = x. This conditional distribution will
enable us to compute E[u(X]) IX = x] = ep(x). In accordance with Exercise
4.81, Section 4.7, the joint distribution of X] and X is bivariate normal with
means 0 ~nd 0, variances at = 1 and a~ = l/n, and correlation coefficient
p = 1/v'n. Thus the conditional p.d.f. of Xl> given X = x, is normal with
linear conditional mean
o+ pal (x - 0) = x
a2
and with variance
2(1 2) n - 1
a] - p = - - .
n
The conditional expectation of u(X]), given X = x, is then
(-)- f'" ()J-n 1 [n(x] - X)2]
ep X - -00 u x] n _ 1 v'27Texp - 2(n _ 1) dx]
_ fC J-n 1 [n(x] - X)2]
- -00 n - 1 v'27Texp - 2(n _ 1) dx].
The change of variable z = v'n(x] - x)/v'n - 1 enables us to write, with
c' = v'n(c - x)/v'n - 1, this conditional expectation as
ep(x) = fC' 1 e-z2/2dz = N(c') = N[v'n(c - X)].
- co v'27T v'n - 1
Thus the unique, unbiased, and minimum variance estimator of N(c - 0) is,
for every fixed constant c, given by ep(X) = N[v'n(c - X)/v'n _ 1].
Remark. We should like to draw the attention of the reader to a
rather important fact. This has to do with the adoption of a principle, such
as the principle of unbiasedness and minimum variance. A principle is not a
theorem; and seldom does a principle yield satisfactory results in all cases. So
far, this principle has provided quite satisfactory results. To see that this is
not always the case, let X have a Poisson distribution with parameter 0,
°< 0 < 00. We may look upon X as a random sample of size 1 from this
distribution. Thus X is a complete sufficient statistic for O. We seek the best
estimator of e- 29 , best in the sense of being unbiased and having minimum
variance. Consider Y = (_l)x. We have
'" (_ O)Xe- 9
E(Y) = E[( _l)X] = L , = e- 29•
x=Q x.
Accordingly, (_l)X is the (unique) best estimator of e- 29, in the sense
described. Here our principle leaves much to be desired. We are endeavoring
to elicit some information about the number e-2 9
, where °< e- 29
< 1.
Yet our point estimate is either -lor +1, each of which is a very poor
estimate of a number between zero and 1. We do not wish to leave the
reader with the impression that an unbiased, minimum variance estimator
is bad.That is not the case at all. We merely wish to point out that if one tries
hard enough, he can find instances where such a statistic is not good.
Incidentally, the maximum likelihood estimator of e- 29 is, in the case where
the sample size equals 1, e- 2 X
, which is probably a much better estimator in
practice than is the "best" estimator (_l)x.
EXERCISES
10.28. Let Xl> X 2,... , X n denote a random sample from a distribution
that is n(O, 1), -00 < 0 < 00. Find the best estimator of 02. Hint. First
determine E(X2).
10.29. Let X], X 2, ... , X n denote a random sample from a distribution
that is n(O, 0). Then Y = 2: X? is a sufficient statistic for O. Find the best
estimator of 02 •
10.30. In the notation of Example 2 of this section, is there a best esti-
mator of Pr (-c s X s c)? Here c > 0.
10.31. Let Xl> X 2 , •. " X; be a random sample from a Poisson distri-
bution with parameter 0 > 0. Find the best estimator of Pr (X ~ 1) =
(1 + O)e-9
• Hint. Let u(x]) = 1, x] ~ 1, zero elsewhere, and find
n
E[u(X])IY = y], where Y = 2: X,. Make use of Example 2, Section 4.2.
1
10.32. Let X], X 2 , •• " X n denote a random sample from a Poisson
364 Sufficient Statistics [Ch. 10 Sec. 10.6] The Case of Several Parameters 365
distribution with parameter () > O. From the Remark of this section, we
know that E[( -1)X1J = e- 29•
(a) Show that E[(-l)X l lY1 = Y1J = (1 - 2jn)Yl , where Yl = Xl + X 2
+ ... + X n • Hint. First show that the conditional p.d.f. of Xl, X 2 , •• " X n - v
given Y1 = Yv is multinomial, and hence that of Xl given Y1 = Y1 is
b(Y1' 1jn).
(b) Show that the maximum likelihood estimator of e- 29 is e- 2X •
(c) Since Y1 = nx, show that (1 - 2jn)Yl is approximately equal to e- 2X
when n is large.
Example 1. Let Xv X 2 , ••• , X n be a random sample from a distribution
having p.d.f.
= 0 elsewhere,
where -00 < ()1 < 00,0 < ()2 < 00. Let Y1 < Y2 < ... < Y, be the order
statistics. The joint p.d.f. of Y1 and Y n is given by
10.6 The Case of Several Parameters
In many of the interesting problems we encounter, the p.d.f. may
not depend upon a single parameter B, but perhaps upon two (or more)
parameters, say B1 and B2, where (Bv (2) E.o, a two-dimensional
parameter space. We now define joint sufficient statistics for the
parameters. For the moment we shall restrict ourselves to the case of
two parameters.
Definition 3. Let Xl' X 2 , ••• , X; denote a random sample from a
distribution that has p.d.f. f(x; B1 , ( 2), where (Bv (2 ) E.o. Let Y1 =
u1(XV X 2, ... , X n) and Y2 = U2(X1, X 2, ... , X n) be two statistics
whose joint p.d.f. is gdY1> Y2; B1> ( 2)· The statistics Y1 and Y2 are
called faint sufficient statistics for B1 and B2 if and only if
where, for every fixed Y1 = u1(X1> ... , xn) and Y2 = U2(X1, ... , xn) ,
H(x1> X2,.. " xn) does not depend upon B1 or B2.
As may be anticipated, the factorization theorem can be extended.
In our notation it can be stated in the following manner. The statistics
Y1 = U1(X1, X 2,· .. , X n) and Y2 = U2(X1, X 2,. . . , X n) are joint
sufficient statistics for the parameters B1 and B
2 if and only if we can
find two nonnegative functions k1 and k2 such that
f(x1; B1> ( 2)f(x2; B1> (2) ... f(xn ; B1> ( 2)
= k1[u1(X1> X2, .. " xn) , u2(U1> X2,. . ., xn) ; B1> B2Jk2(X1, X2 , .•• , xn) ,
where, for all fixed values of the functions Y1 = U1(X1, X2, ... , xn) and
Y2 = U2(X1, X2, ... , Xn), the function k 2(x1> X2, ... , xn) does not depend
upon both or either of B
1 and B
2 .
and equals zero elsewhere. Accordingly, the joint p.d.f. of Xv X 2 , ••• , X; can
be written, for points of positive probability density,
(
~ ) n = n(n - l)[max(x,) - min (x,)]n-2
2()2 (2()2)n
Since the last factor does not depend upon the parameters, either the def-
inition or the factorization theorem assures us that Y1 and Yn are joint
sufficient statistics for ()1 and ()2'
The extension of the notion of joint sufficient statistics for more
than two parameters is a natural one. Suppose that a certain p.d.f.
depends upon m parameters. Let a random sample of size n be taken
from the distribution that has this p.d.f. and define m statistics. These
m statistics are called joint sufficient statistics for the m parameters if
and only if the ratio of the joint p.d.f. of the items of the random
sample and the joint p.d.f. of these m statistics does not depend upon
the m parameters, whatever the fixed values of the m statistics. Again
the factorization theorem is readily extended.
There is an extension of the Rao-Blackwell theorem that can be
adapted to joint sufficient statistics for several parameters, but that
extension will not be included in this book. However, the concept of a
complete family of probability density functions is generalized as
follows: Let
denote a family of probability density functions of k random variables
V v V2, ••• , Vk that depends upon m parameters (B1> B2, ... , Bm) E .0.
366 Sufficient Statistics [Ch, 10 Sec. 10.6] The Case oj Several Parameters 367
and
and
Z2 = Y1 - Y~/n = L (Xi - X)2
n-l n-l
(
- 1 81 8~ .1-)
f(x; 8v 82 ) = exp 28
2
x2
+ 8
2
X - 28
2
- In v 27T82 •
Therefore, we can take K 1 (x) = x2
and K 2 (x) = x. Consequently, the
statistics
are joint sufficient statistics for the m parameters 81 , 82 , ••• r 8m• It is
left as an exercise to prove that the joint p.d.f. of Y1> ... , Ym is of the
form
From completeness, we have that Z1 and Z2 are the only functions of Y1
and Y2 that are unbiased estimators of 81 and (J2' respectively.
are joint complete sufficient statistics for (J1 and 82 , Since the relations
(2) R(Y1' ... , Ym) exp [J1PA81,· .. s 8m)Yj + nq(81, . 00' 8m)]
at points of positive probability density. These points of positive
probability density and the function R(Y1' 0 • • , Ym) do not depend upon
any or all of the parameters 81 , 82 , •. " 8m. Moreover, in accordance
with a theorem in analysis, it can be asserted that, in a regular case
of the exponential class, the family of probability density functions of
these joint sufficient statistics Y1 , Y2 , ••• , Ym is complete when n > m,
In accordance with a convention previously adopted, we shall refer to
Yl' Y2' .. " Ym as joint complete sufficient statistics for the parameters
81> 82 , •• 0' 8m•
Example 2. Let Xv X 2 , ••• , Xn denote a random sample from a
distribution that is n(81 , 82) , -00 < 81 < 00, 0 < 82 < 00. Thus the p.d.f.
f(x; 81 , 82 ) of the distribution may be written as
define a one-to-one transformation, Z1 and Z2 are also joint complete sufficient
statistics for 81 and 82 , Moreover,
A p.d.f.
f i»; 81 , 82 " " , 8m)
= exp [JlA81, 82 , 0", 8m)KAx) + S(x) + q(81, 82 " 0 0, 8m)}
Let U(V1' V2' • 0 0' v k ) be a function of VI' V2, 0 0 0, Vk (but not a function
of any or all of the parameters). If
E[U(V1' V2 , ... , Vk ) ] = 0
for all (81) 82, .•• , 8m) E Q implies that u(v1> V2, • 0 . , v k) = 0 at all
points (V1> V2, 0 • • , Vk), except on a set of points that has probability
zero for all members of the family of probability density functions,
we shall say that the family of probability density functions is a
complete family.
The remainder of our treatment of the case of several parameters
will be restricted to probability density functions that represent what
we shall call regular cases of the exponential class. Let Xl' X 2 , ••• , X n,
n > m, denote a random sample from a distribution that depends on
m parameters and has a p.d.f. of the form
(1) f(x; 81> 82 , 0 • " 8m)
= exp [Jlj(81 , 82 , •.• , 8m)Kj(x) + S(X) + q(81, 82 , • 0 . , 8m)]
for a < x < b, and equals zero elsewhere.
A p.d.f. of the form (1) is said to be a member of the exponential
class of probability density functions of the continuous type. If, in
addition,
(a) neither a nor b depends upon any or all of the parameters
81> 82 , •• 0' 8m,
(b) the Pj(81) 82, 0 • • , 8m), j = 1,2, 0 • • , m, are nontrivial, func-
tionally independent, continuous functions of 8j , Yj < 8j < OJ' j =
1,2, .. 0, rn,
(c) the Kj(x), j = 1,2,. 0 . , m, are continuous for a < x < band
no one is a linear homogeneous function of the others,
(d) S(x) is a continuous function of x, a < x < b,
we say that we have a regular case of the exponential class.
The joint p.d.f. of Xl' X 2 , ••• , X n is given, at points of positive
probability density, by
exp [J1pj(81).. " 8m) itKj(xi) + itS(Xi) + nq(81) ... , 8m)]
= exp [J1P;(8v ... , 8m) i~l Kj(Xi) + nq(81) 0'" 8m)] exp LtS(Xi)j-
In accordance with the factorization theorem, the statistics
368 Sufficient Statistics [Ch, 10 Sec. 10.6] The Case of Several Parameters 369
a < x < b,
zero elsewhere. Let K~(x) = cK;(x). Show thatf(x; 81> ( 2) can be written in
the form
a < x < b,
10.36. Let the p.d.f. f(x; 81> ( 2) be of the form
exp [P1(81, (2)K 1(x) + P2(81, ( 2)K 2(x) + S(x) + q(81) (2)J,
zero elsewhere. This is the reason why it is required that no one K~(x) be a
linear homogeneous function of the others, that is, so that the number of
sufficient statistics equals the number of parameters.
10.37. Let Y1 < Y2 < ... < Yn be the order statistics of a random
sample Xv X 2 , .•. , X; of size n from a distribution of the continuous type
with p.d.f. f(x). Show that the ratio of the joint p.d.f. of X1> X 2 , ••• , X n and
that of Y1 < Y2 < ... < Y, is equal to lin!, which does not depend upon
the underlying p.d.f. This suggests that Y1 < Y2 < ... < Y, are joint
sufficient statistics for the unknown" parameter" j.
10.35. Let (Xl' Y1 ) , (X2, Y2), ... , (Xn , Yn) denote a random sample of
size n from a bivariate normal distribution with means flo1 and flo2' positive
n n
variances ar and a~, and correlation coefficient p. Show that Ls; L v,
1 1
n n n
LXt, L Yr, and LXjYj are joint sufficient statistics for the five parameters.
1 1 1
_ n 11 n n
Are X = LXdn, Y = L Ydn, Sr = L (Xj - X)2In, S~ = L (Yj - y)2In,
1 1 1 1
n
and L (Xj - X)(Yi - Y)lnS1S2 also joint sufficient statistics for these
1
parameters?
zero elsewhere. Find the joint p.d.f. of Zl = Y 1, Z2 = Y 2, and Zs =
Y1 + Y2 + Ys' The corresponding transformation maps the space
{(Y1> Y2' Ys); e, < Y1 < Y2 < Ys < co] onto the space {(Z1> Z2' zs); 81 < Zl <
Z2 < (zs - zl)/2 < co]. Show that Zl and Zs are joint sufficient statistics
for 81 and 82 ,
10.34. Let Xl' X 2, ... , X; be a random sample from a distribution that
n
has a p.d.f. of form (1) of this section. Show that Y 1 = L K 1(Xj ) , ••• ,
j=l
n
Ym = L Km(Xj ) have a joint p.d.f. of form (2) of this section.
j=l
are joint complete sufficient statistics for the m parameters 8v 82 , 0 • " 8m•
zero elsewhere, is said to represent a regular case of the exponential
class of probability density functions of the discrete type if
(a) the set {x; x = av a2 , ••• } does not depend upon any or all of
the parameters 81, 82 , •.• , 8m,
(b) the Pj(81 , 82 , ••• , 8m), j = 1,2, ... , m, are nontrivial, func-
tionally independent, continuous functions of 8j, Yj < 8j < OJ' j =
1,2, .. 0, rn,
(c) the Kj(x), j = 1, 2, 0 0 . , m, are nontrivial functions of x on the
set {x; x = a1, a2 , • • . } and no one is a linear function of the others.
Let Xl' X 2 , •.• , X n denote a random sample from a discrete-type
distribution that represents a regular case of the exponential class.
Then the statements made above in connection with the random
variable of the continuous type are also valid here.
Not always do we sample from a distribution of one random variable
X. We could, for instance, sample from a distribution of two random
variables V and W with joint p.d.f.f(v, w; 8v 82, 0 0 . , 8m) , Recall that by
a random sample (Vv W1), (V2 , W2 ) , . '0, (Vn, Wn) from a distribution
of this sort, we mean that the joint p.d.f. of these 2n random variables
is given by
f(v1 , w1 ; 8v ... , 8m)j(v2 , w2 ; 8v 0 0 0 , 8m) ... f(vn, wn ; 81 , ..• , 8m) ,
In particular, suppose that the random sample is taken from a distribu-
tion that has the p.d.f. of V and W of the exponential class
(3) f(v, w; 8v . . . , 8m)
= exp [J1pj(8v ... , 8m)Kj(v, w) + S(v, w) + q(8v · .. , em)1
for a < v < b, C < w < d, and equals zero elsewhere, where a, b, C, d
do not depend on the parameters and conditions similar to (a), (b),
(c), and (d), p. 366, are imposed. Then the m statistics
EXERCISES
10.33. Let Y1 < Y2 < Ys be the order statistics of a random sample of
size 3 from the distribution with p.d.f.
Sec. 11.1] The Rao-Crarner Inequality 371
The final form of the right-hand member of the second equation is
justified by the discussion in Section 4.7. If we differentiate both mem-
bers of each of these equations with respect to 8, we have
Chapter II
Further Topics in
Statistical Inference
o = f00 8f(x;; 8) dx, = f00 8ln f(x;; B) f( . 8) d
(2) - 00 88 ' _00 80 X;, x"
_foo foo [n 1 8f(x;; 8)]
1 - -00'" -00 u(xl , x2, 0 0 0 ' xn) f f(x;; B) 8B
x f(xl ; B) 0 • • f(xn; 8) dX1
0 0 0 dXn
= foo ..·foo u(xv x2, 0 0 0 ' xn)[i _81.-:;nf:.....:..(x....:.:.;;-!..B)]
-00 -00 I 8B
11.1 The Rae-Cramer Inequality
In this section we establish a lower bound for the variance of an
unbiased estimator of a parameter.
Let Xl, X 2 , ••• , X; denote a random sample from a distribution
with p.d.f. f(x; B), BE Q = {B; y < B < o}, where y and 0 are known.
Let Y = u(XI, X 2, ... , X n) be an unbiased estimator of B. We shall
show that the variance of Y, say O"~, satisfies the inequality
Band
1
p=_o
O"yO"z
or
1 = B' 0 + pO"yO"z
n
Define the random variable Z by Z = L[8Inf(X,; B)/8B]. In accordance
I
with the first of Equations (2)we have E(Z) = i E[8Inf(X;; B)/8BJ = o.
I
Moreover, Z is the sum of n mutually stochastically independent random
variables each with mean zero and consequently with variance
E{[8lnf(X; B)/80P}. Hence the variance of Z is the sum of the n
variances,
n
Because Y = u(XI, 0 0 . , X n) and Z = L [8lnf(X;; 8)/88J, the second
I
of Equations (2) shows that E( YZ) = 1. Recall (Section 2.3) that
E(YZ) = E(Y)E(Z) + PO"yO"z,
where p is the correlation coefficient of Y and Z. Since E(Y)
E(Z) = 0, we have
1
nE{[8lnf(X; B)/8BJ2}
(1)
Throughout this section, unless otherwise specified, it will be assumed
that we may differentiate, with respect to a parameter, under an
integral or a summation symbol. This means, among other things, that
the domain of positive probability density does not depend upon B.
We shall now give a proof of inequality (1) when X is a random
variable of the continuous type. The reader can easily handle the
discrete case by changing integrals to sums. Let g(y; 8) denote the
p.d.f. of the unbiased statistic Y. We are given that
1 = f~oo f(x;; B) dx" i = 1,2, 0'" n,
and
B = f~00 yg(y; B) dy
= f~00 ••• f:00 u(xv X2, 0 0 . , Xn)f(XI; 8) 0 • 0 f(xn; 8)dX1 0 0 0 dxno
370
Now p2 s 1. Hence
or 1 2
"2:::; O"y.
o"z
372 Further Topics in Statistical Inference [Oh, 11 Sec. 11.1] The Roo-Cramer Inequality 373
o< x < 1,
If we replace u~ by its value, we have inequality (1),
1
u¥ ~ --------
nE[(0 ln~~; 8)fJ
Inequality (1) is known as the Rae-Cramer inequality. It provides, in
cases in which we can differentiate with respect to a parameter under
an integral or summation symbol, a lower bound on the variance of an
unbiased estimator of a parameter, usually called the Rae-Cramer
lower bound.
We now make the following definitions.
Definition 1. Let Y be an unbiased estimator of a parameter 8 in
such a case of point estimation. The statistic Y is called an efficient
estimator of eif and only if the variance of Y attains the Rae-Cramer
lower bound.
It is left as an exercise to show, in these cases of point estimation,
that E{[o Inf(X; 8)/oeJ2} = -E[02Inf(X; 8)/082]. In some instances
the latter is much easier to compute.
Definition 2. In cases in which we can differentiate with respect
to a parameter under an integral or summation symbol, the ratio of the
Rae-Cramer lower bound to the actual variance of any unbiased
estimator of a parameter is called the efficiency of that statistic.
Example 1. Let Xl' X 2 , ••• , X; denote a random sample from a Poisson
distribution that has the mean B > O. It is known that X is a maximum
likelihood estimator of B; we shall show that it is also an efficient estimator
of B. We have
() lnf(x; B) = ~ ( I B _ B-1 ')
8B (}B x n n x.
x x-8
=--1 =--.
B B
Accordingly,
E[(8In f(X; B))2] = E(X - B)2 = 0'2 = i = !.
8B B2 B2 B2 B
The Rae-Cramer lower bound in this case is l/[n(l/B)] = B/n. But B/nis the
variance O'~ of X. Hence X is an efficient estimator of B.
Example 2. Let 52 denote the variance of a random sample of size
n > 1 from a distribution that is n(p., B), 0 < B < 00. We know that
E[n52/(n - l)J = B. What is the efficiency of the estimator n52/(n - I)? We
have
I f(
. B) = _ (x - p.)2 _ In (Z7TB)
n x, ZB Z'
olnf(x; B) (x - p.)2 1
8B = ZB2 - ZB'
and
82Inf(x; B) (x - p.)2 1
8B2 = - BS + ZB2'
Accordingly,
_E[821n f(X; B)] = i __
1 = _1.
8B2 BS ZB2 ZB2
Thus the Rae-Cramer lower bound is ZB2/n. Now n52/B is X2(n - 1), so
the variance of n5 2/B is Z(n - 1). Accordingly, the variance of n52/(n - 1) is
Z(n - 1)[B2/(n - 1)2J = ZB2/(n - 1). Thus the efficiency of the estimator
n52/(n - 1) is (n - l)/n.
Example 3. Let Xv X 2 , ••• , X; denote a random sample of size n > Z
from a distribution with p.d.f.
f(x; B) = BXB- I = exp (B In x - In x + In B),
= 0 elsewhere.
It is easy to verify that the Rao-Crarner lower bound is 82
/n.Let Y, = -In Xl'
We shall indicate that each Yl has a gamma distribution. The associated
transformation y, = -In x" with inverse Xl = «:v«, is one-to-one and the
transformation maps the space {Xl; 0 < x, < I} onto the space {y,; 0 < Yl
< co}. We have III = e-Y ' . Thus Y, has a gamma distribution with a = 1
7t
and /3 = l/B. Let Z = - L In Xl' Then Z has a gamma distribution with
I
a = nand /3 = l/B. Accordingly, we have E(Z) = a/3 = niB. This suggests
that we compute the expectation of l/Z to see if we can find an unbiased
estimator of B. A simple integration shows that E(l/Z) = B/(n - 1).
Hence (n - l)/Z is an unbiased estimator of B. With n > Z, the variance
of (n - l)/Z exists and is found to be B2/(n - Z), so that the efficiency of
(n - l)/Z is (n - Z)/n. This efficiency tends to 1 as n increases. In such an
instance, the estimator is said to be asymptotically efficient.
The concept of joint efficient estimators of several parameters has
been developed along with the associated concept of joint efficiency of
several estimators. But limitations of space prevent their inclusion in
this book.
374 Further Topics in Statistical Inference [Ch, 11 Sec. 11.2] The Sequential Probability Ratio Test 375
EXERCISES
11.1. Prove that X, the mean of a random sample of size n from a distri-
bution that is n(B, a2) , -00 < B < 00, is, for every known a2
> 0, an
efficient estimator of 8.
11.2. Show that the mean X of a random sample of size n from a distribu-
tion which is b(l, B), 0 < 8 < 1, is an efficient estimator of 8.
11.3. Given f(x; 8) = 1/8, 0 < x < 8, zero elsewhere, with 8 > 0, form-
ally compute the reciprocal of
Compare this with the variance of (n + 1)Yn/n, where Yn is the largest
item of a random sample of size n from this distribution. Comment.
11.4. Given the p.d.f.
a notation that reveals both the parameter Band the sample size n. If
we reject H o: B = B' and accept HI: B = B" when and only when
L(B', n)
L(B", n) ~ k,
where k > 0, then this is a best test of H o against HI'
Let us now suppose that the sample size n is not fixed in advance.
In fact, let the sample size be a random variable N with sample space
{n; n = 1,2,3, ... }. An interesting procedure for testing the simple
hypothesis H o: B = B' against the simple hypothesis HI: B = B" is the
following. Let ko and kl be two positive constants with ko < kl.
Observe the mutually stochastically independent outcomes Xv X 2 ,
X 3 , ••• in sequence, say Xv X2 , X3 , ••• , and compute
1
f(x; 8) = '/Tel + (x - 8)2]' -00 < x < 00, -00 < 8 < 00.
L(B',I) L(B',2) L(B',3)
L(B", 1)' L(B", 2)' L(B", 3)' .. ··
Show that the Rae-Cramer lower bound is 2/n, where n is the size of a
random sample from this Cauchy distribution.
11.5. Show, with appropriate assumptions, that
The hypothesis H o: 8 = B' is rejected (and HI: B = B" is accepted) if
and only if there exists a positive integer n so that (xv X 2, ••• , xn)
belongs to the set
and
Hint. Differentiate with respect to 8 the first equation in display (2) of this
section,
{
L(B', j)
en = (Xl"'" Xn) ; ko < L(B",j) < kvj = 1, ... , n - 1,
L(B', n)
L(B", n)
o= f:oo aln~~x; 8)f(x; 8) dx, On the other hand, the hypothesis Ho: B = 8' is accepted (and
HI: B = B" is rejected) if and only if there exists a positive integer n
so that (Xl' X2 , ••• , xn) belongs to the set
That is, we continue to observe sample items as long as
B = {(X ).k L (B',j) . _ _
n 1""'Xn , 0< L(B",j)<kl ,J - l , ... ,n 1,
L(B', n)
L(B", n)
and
L(B', n)
ko < L(B" < kl ·
, n)
(1)
11.2 The Sequential Probability Ratio Test
In Section 7.2 we proved a theorem that provided us with a method
for determining a best critical region for testing a simple hypothesis
against an alternative simple hypothesis. The theorem was as follows.
Let Xv X 2 , •• 0, X; be a random sample with fixed sample size n from
a distribution that has p.d.f. f(x; B), where BE {B; 8 = B', B"} and 8'
and 8" are known numbers. Let the joint p.d.f. of Xv X 2 , 0 0 0 , X; be
denoted by
376 Further Topics in Statistical Inference [Ch. 11 Sec. 11.2] The Sequential Probability Ratio Test 377
or
co(n) < u(xv x2, ••• , xn) < c1(n),
where u(XI, X 2 , ••• , X n) is a statistic and co(n) and c1(n) depend on
the constants ko, kI , 0', 0", and on n. Then the observations are stopped
and a decision is reached as soon as
Remarks. At this point, the reader undoubtedly sees that there are
many questions which should be raised in connection with the sequential
probability ratio test. Some of these questions are possibly among the
following:
(a) What is the probability of the procedure continuing indefinitely?
(b) What is the value of the power function of this test at each of the
points 8 = 8' and 8 = 8"?
(c) If 8"is one of several values of 8 specified by an alternative composite
hypothesis, say HI: 8 > 8', what is the power function at each point 8 ~ 8'?
(d) Since the sample size N is a random variable, what are some of the
properties of the distribution of N? In particular, what is the expected value
E(N) of N?
(e) How does this test compare with tests that have a fixed sample size n?
"
Note that L(t, n)fL(j-, n) ::; koif and only if cl(n) ::; L Xj; and L(t, n)fL(j-, n)
1
"
~ k1 if and only if co(n) ~ L Xj' Thus we continue to observe outcomes as
1
"
long as co(n) < L X j < c1(n). The observation of outcomes is discontinued
I
" "
with the first value n of N for which either cI(n) ::; LX j or co(n) ~ LXj' The
I I
"
inequality cI(n) ::; L Xj leads to the rejection of Ho: 0 = t (the acceptance of
I
"
HI)' and the inequality co(n) ~ L Xj leads to the acceptance of Ho: 8 = t
I
(the rejection of HI)'
x = 0,1,
or
We now give an illustrative example.
Example 1. Let X have a p.d.f.
f(x; 8) = 8X(1
- W-x ,
(b) with acceptance of H o: 0 = 0' as soon as
L(O', n) > k
L(O", n) - l'
A test of this kind is called a sequential probability ratio test. Now,
frequently inequality (1) can be conveniently expressed in an equivalent
form
We stop these observations in one of two ways:
(a) With rejection of Ho: 0 = 0' as soon as
L(O', n) < k
L(O", n) - 0,
= 0 elsewhere.
In the preceding discussion of a sequential probability ratio test, let Ho: 8 = t
n
and HI: 8 = j-; then, with LX j = LXj,
I
L(l n) (l)~XI(2.)n-~XI
_3_'_ = 3 3 = 2n-2~xl.
L(j-, n) m~XI(t)n ~Xl
If we take logarithms to the base 2, the inequality
L(t, n)
ko < L(j-, n) < kI>
with 0 < ko < kI , becomes
"
10g2 ko < n - 2 L X
j < 10g2 kI ,
I
or, equivalently,
A course in sequential analysis would investigate these and many other
problems. However, in this book our objective is largely that of acquainting
the reader with this kind of test procedure. Accordingly, we assert that the
answer to question (a) is zero. Moreover, it can be proved that if 8 = 8' or
if 8 = 8", E(N) is smaller, for this sequential procedure, than the sample size
of a fixed-sample-size test which has the same values of the power function
at those points. We now consider question (b) in some detail.
In this section we shall denote the power of the test when H 0 is true
by the symbol a and the power of the test when HI is true by the
symbol 1 - f3. Thus a is the probability of committing a type I error
(the rejection of H o when H o is true), and f3 is the probability of com-
mitting a type II error (the acceptance of Ho when Ho is false). With
the sets en and En as previously defined, and with random variables of
the continuous type, we then have
a = ~ r L(O', n),
n=1 Jell
1 - f3 = ~ r L(O", n).
,,=1 Jell
378 Further Topics in Statistical Inference [Ch, 11 Sec. 11.2] The Sequential Probability Ratio Test 379
Since the probability is 1 that the procedure will terminate, we also
have
Moreover, since a and f3 are positive proper fractions, inequalities (3)
imply that
1 - a = ~ f L (8', n),
n=1 B n
f3 = ~ f L(8", n).
n=1 Bn
aa
a < ---,
- 1 - f3a
If (Xl> X 2, ••• , xn) ECn , we have L(8', n) :::; koL(8", n); hence it is clear
that
Because L(8', n) :2: kIL(8", n) at each point of the set Bn, we have
consequently, we have an upper bound on each of a and f3. Various
investigations of the sequential probability ratio test seem to indicate
that in most practical cases, the values of a and f3 are quite close to aa
and f3a. This prompts us to approximate the power function at the
points 8 = 8' and 8 = 8" by aa and 1 - f3a' respectively.
Example 2. Let X be n((J, 100). To find the sequential probability ratio
test for testing H o: (J = 75 against HI: (J = 78 such that each of a and f3
is approximately equal to 0.10, take
provided that f3 is not equal to zero or 1.
Now let aa and f3a be preassigned proper fractions; some typical
values in the applications are 0.01, 0.05, and 0.10. If we take the inequality
L(75, n) = exp [- I (Xi - 75)2/2(100)) _ (_ 6 I Xi - 459n)
L(78, n) exp [- I (X. - 78)2/2(100)] - exp 200'
k = 1 - 0.10 = 9
I 0.10 .
0.10 1
ko = 1 - 0.10 = 9'
Since
1 - a
kl < --,
- f3
(2)
Accordingly, it follows that
aa
ko = ---,
1 - f3a
k - ~ L(75, n) 9 k
o - 9 < L(78, n) < = I
or, equivalently,
then inequalities (2) become
-In 9 _ 6 I Xi - 459n I 9
< 200 <no
n
co(n) = 1.~3n - 1~oln9 < IXI < 1~3n + 1~oln9 = cl(n).
I
can be rewritten, by taking logarithms, as
This inequality is equivalent to the inequality
1 - a 1 - a
___
a < __ .
f3a - f3 '
a aa
--<---,
1 - f3 - 1 - f3a
(3)
If we add corresponding members of the immediately preceding
inequalities, we find that
a + fJ - af3a - f3aa :::; aa + f3a - f3aa - af3a
and hence
a + f3 :::; aa + f3a·
That is, the sum a + f3 of the probabilities of the two kinds of errors is
bounded above by the sum aa + f3a of the preassigned numbers.
Moreover, L(75, n)/L(78, n) :::; ko and L(75, n)/L(78, n) :2: kl are equivalent
n n
to the inequalities I x, :2: C1 (n) and LX, :::; co(n), respectively. Thus the
I 1
observation of outcomes is discontinued with the first value n of N for which
n n n
either LX. :2: c1(n) or LX, :::; co(n). The inequality I XI :2: c1(n) leads to the
1 1 1
n
rejection of Ho: (J = 75, and the inequality I z, :::; co(n) leads to the accept-
1
ance of H o: (J = 75. The power of the test is approximately 0.10 when H o is
true, and approximately 0.90 when HI is true.
Remark. It is interesting to note that a sequential probability ratio test
380 Further Topics in Statistical Inference [Ch, 11 Sec. 11.3] Multiple Comparisons 381
can be thought of as a random-walk procedure. For illustrations, the final
inequalities of Examples 1 and 2 can be rewritten as
n
-log2 k1 < L: 2(xi - 0.5) < -log2 ko
1
and
100 n 100
-TIn 9 < L(Xi - 76.5) < TIn 9,
1
respectively. In each instance we can think of starting at the point zero and
taking random steps until one of the boundaries is reached. In the first
situation the random steps are 2(X1 - 0.5), 2(X2 - 0.5), 2(Xa - 0.5), ...
and hence are of the same length, 1, but with random directions. In the second
instance, both the length and the direction of the steps are random variables,
Xl - 76.5, X 2 - 76.5, Xa - 76.5, ....
EXERCISES
11.6. Let X be n(O, 8) and, in the notation of this section, let 8' = 4,
8" = 9, CXa = 0.05, and fJa = 0.10. Show that the sequential probability
n
ratio test can be based upon the statistic L: Xf Determine co(n) and c1(n).
1
... , X aj of size a from the distribution n(ftj, a2
), j = 1,2, ... , b. If we
a
denote L Xij/a by X.j , then we know that X. j is n(ftj, a2/a), that
i=l
a
L (Xij - X. j)2/a
2 is X2(a - 1), and that the two random variables are
i=l
stochastically independent. Since the random samples are taken from
mutually independent distributions, the 2b random variables x;
a
L (Xij - X. j)2/a
2, j = 1,2, ... , b, are mutually stochastically inde-
i=l
pendent. Moreover, X. v X.2 , ••• , X'b and
±i (Xij -2 X. j )2
j=l i=l a
are mutually stochastically independent and the latter is X2[b(a
- 1)].
b b
Let Z = L kjX.j. Then Z is normal with mean L kjftj and variance
1 1
(~ ky)a2
/a, and Z is stochastically independent of
1 b a
V = b( _ 1) .L .L (Xij - X.j )2.
a J=l ,=1
Hence the random variable
has a t distribution with b(a - 1) degrees of freedom. A positive number
c can be found in Table IV in Appendix B, for certain values of cx,
o < ex < 1, such that Pr (- c :0; T :0; c) = 1 - ex. It follows that the
probability is 1 - ex that
J(~kr)a2/a
VV/a2
T=
11.7. Let X have a Poisson distribution with mean 8. Find the sequential
probability ratio test for testing H o: 8 = 0.02 against HI: 8 = 0.07. Show
n
that this test can be based upon the statistic L: Xi' If cxa = 0.20 and fJa =
1
0.10, find co(n) and C1(n).
11.8. Let the stochastically independent random variables Y and Z be
n(iLv 1) and n(iL2' 1), respectively. Let 8 = iLl - iL2' Let us observe mutually
stochastically independent items from each distribution, say Y 1 , Y 2 , .••
and Zv Z2' .... To test sequentially the hypothesis H o: 8 = 0 against
HI: 8 = -t, use the sequence Xi = Yi - Zi' i = 1,2, .... If CXa = fJa = 0.05,
show that the test can be based upon X = 17 - i. Find co(n) and c1(n).
11.3 Multiple Comparisons
Consider b mutually stochastically independent random variables
that have normal distributions with unknown means ftv ft2' ... , ftb'
respectively, and with unknown but common variance a2
• Let kl , k2 ,
... ,kb represent b known real constants that are not all zero. We
b
want to find a confidence interval for L kjftj' a linear function of the
I
means ftv ft2' ... , ftb' To do this, we take a random sample Xlj' X 2j,
The experimental values of X. j , j = 1, 2, ... , b, and V will provide a
b
100(1 - cx) per cent confidence interval for L kjftj'
I
b
It should be observed that the confidence interval for L kjftj
I
depends upon the particular choice of kv k2 , ••• , kb. It is conceivable
382 Further Topics in Statistical Inference [Ch, 11 Sec. 11.3] Multiple Comparisons 383
that we may be interested in more than one linear function of iLl, iL2'
... , iLb' such as iL2 - iLl> iLa - (iLl + iL2)/2, or iLl + ... + iLb' We can,
b
of course, find for each L kjiLj a random interval that has a preassigned
1
b
probability of including that particular L kjiLj' But how can we compute
1
the probability that simultaneously these random intervals include their
respective linear functions of iLl> iL2' ... , iLb? The following procedure
of multiple comparisons, due to Scheffe, is one solution to this problem.
The random variable
a2/a
is x2
(b) and, because it is a function of Xl> ... , x, alone, it is stochasti-
cally independent of the random variable
1 b a
V = b( _ 1) ~ .2 (Xij - X.j )2.
a 1=1 '=1
b
From the geometry of the situation it follows that L (X. j - iLj)2 is
1
equal to the maximum of expression (2) with respect to kl> k2, ... , kb.
b
Thus the inequality L (X. j - iLj)2 ::::; (bd)(V/a) holds if and only if
1
(3)
for every real kl> k2, ... ,kb, not all zero. Accordingly, these two
equivalent events have the same probability, 1 - a. However, in-
equality (3) may be written in the form
l~kjX,j - ~kjiLjl::::; Jbd(~kJ)~'
Thus the probability is 1 - a that simultaneously, for all real kl> k2 ,
. .. , kb, not all zero,
(2)
where
b :J
L 2.. ix; - xy
V=j=li=l ,
b
L (aj - 1)
1
(4')
Denote by A the event where inequality (4-) is true for all real
k l , •• 0' kb , and denote by B the event where that inequality is true
for a finite number of b-tuples (kl , . 0 0 ' kb ) . If the event A occurs,
certainly the event B occurs. Hence P(A) ::::; P(B). In the applications,
one is often interested only in a finite number of linear functions
b
L kjiLj' Once the experimental values are available, we obtain from (4-)
1
a confidence interval for each of these linear functions. Since P(B) 2
P(A) = 1 - a, we have a confidence coefficient of at least 100(1 - a)
per cent that the linear functions are in these respective confidence
intervals.
Remarks. If the sample sizes, say al> a2 , ••• , ab, are unequal, In-
equality (4) becomes
Hence the random variable
has an F distribution with band b(a - 1) degrees of freedom. From
Table V in Appendix B, for certain values of a, we can find a constant
d such that Pr (F ::::; d) = 1 - a or
Pr [i (X. j - iLj)2 s bd~] = 1 - a.
j=l a
b
Note that L (X. j - iLj)2 is the square of the distance, in b-dimensional
j= 1
space, from the point (iLl> iL2' .. 0, iLb) to the random point (X. l, X. 2,
. . " X. b)· Consider a space of dimension b and let (tl, t2, 0 • 0 ' tb) denote
the coordinates of a point in that space. An equation of a hyperplane
that passes through the point (iLl' iL2' , iLb) is given by
(1) kl(tl - iLl) + k2(t2 - iL2) + + kb(tb - iLb) = 0,
where not all the real numbers kj , j = 1,2, .. " b, are equal to zero.
The square of the distance from this hyperplane to the point (tl = Xl>
t2 = X.2 , •• 0' tb = X.b) is
[kl(X' l - iLl) + k2(X '2 - iL2) + .. 0 + kb(X'b - iLb)]2
k~ + k~ + 0 " + k~
384 Further Topics in Statistical Inference [Ch, 11 Sec. 11.4] Classification 385
b
and d is selected from Table V with band 2 (aj - 1) degrees of freedom.
1
Inequality (4') reduces to inequality (4) when a1 = a2 = ... = abo
Moreover, if we restrict our attention to linear functions of the form
b b
2 kjfLj with 2 kj = 0 (such linear functions are called contrasts), the radical
1 1
in inequality (4') is replaced by
b
where d is now found in Table V with b - 1 and 2 (aj - 1) degrees of
1
freedom.
In these multiple comparisons, one often finds that the length of a
confidence interval is much greater than the length of a 100(1 - a) per cent
b
confidence interval for a particular linear function 2 k)fLj' But this is to be
1
expected because in one case the probability 1 - a applies to just one event,
and in the other it applies to the simultaneous occurrence of many events.
One reasonable way to reduce the length of these intervals is to take a larger
value of a, say 0.25, instead of 0.05. After all, it is still a very strong statement
to say that the probability is 0.75 that all these events occur.
EXERCISES
11.9. If AI' A2 , ••• , AI< are events, prove, by induction, Boole's lil-
I<
equality P(A 1 U A 2 u· .. u AI<) s 2 P(Ai ) . Then show that
1
11.10. In the notation of this section, let (kil , ki2, ... , kj b ) , i = 1, 2, ... , m,
represent a finite number of b-tuples. The problem is to find simultaneous
b
confidence intervals for 2 kjjfLj, i = 1,2, ... , m, by a method different from
j=l
that of Scheffe, Define the random variable T, by
i = 1,2, ... , m.
(a) Let the event A1 be given by -Ci :::; Ti :::; c" i = 1,2, ... , m. Find
b
the random variables o, and Wi such that ti;« 2 kijfLj s Wi is equivalent
j= 1
to Ar
(b) Select Ci
such that P(Af) = 1 - aim; that is, P(Ai ) = aim. Use the
results of Exercise 11.9 to determine a lower bound on the probability
that simultaneously the random intervals (U1 , WI), .. " (Um' Wm) include
b b
.2 k1jfLj"'" 2 kmjfLj, respectively.
)=1 )=1
(c) Let a = 3, b = 6, and a = 0.05. Consider the linear functions
fL1 - fL2' fL2 - fL3' fL3 - fL4' fL4 - (fLs + fLs)/2, and (fL1 + fL2 + ... + fLs)/6.
Here m = 5. Show that the lengths of the confidence intervals given by the
results of part (b) are shorter than the corresponding ones given by the
method of Scheffe, as described in the text. If m becomes sufficiently large,
however, this is not the case.
11.4 Classification
The problem of classification can be described as follows. An investi-
gator makes a number of measurements on an item and wants to place
it into one of several categories (or classify it). For convenience in our
discussion, we assume that only two measurements, say X and Y,
are made on the item to be classified. Moreover, let X and Y have a
joint p.d.f. f(x, y; 0), where the parameter 0 represents one or more
parameters. In our simplification, suppose that there are only two
possible joint distributions (categories) for X and Y, which are indexed
by the parameter values 0' and 0", respectively. In this case, the problem
then reduces to one of observing X = x and Y = y and then testing the
hypothesis 0 = 0' against the hypothesis 0 = 0", with the classifi-
cation of X and Y being in accord with which hypothesis is accepted.
From the Neyman-Pearson theorem, we know that a best decision of
this sort is of the form: If
f(x, y; 0') k
.-:...--=---'- < ,
f(x, y; 0") -
choose the distribution indexed by 0"; that is, we classify (x, y) as
coming from the distribution indexed by 0". Otherwise, choose the
distribution indexed by 0'; that is, we classify (x, y) as coming from the
distribution indexed by 0'.
In order to investigate an appropriate value of k, let us consider a
Bayesian approach to the problem (see Section 6.6). We need a p.d.f.
h(0) of the parameter, which here is of the discrete type since the
parameter space Q consists of but two points 0' and 0". So we have that
h(0') + h(0") = 1. Of course, the conditional p.d.f. g(elx, y) of the
parameter, given X = x, Y = y, is proportional to the product of
h(O) andf(x, y; 0),
g(Olx, y) oc h(O)f(x, y; 0).
In particular, in this case,
386 Further Topics in Statisticallnjerence [eb.ll Sec. 11.4] Classification
where a1 > 0, a2 > 0, -1 < P < 1, and
387
h(B)f(x, y; B)
g(Blx, y) = h(B')f(x, y; B') + h(B")f(x, y; B")
Let us introduce a loss function 2[B, w(x, y)], where the decision
function w(x, y) selects decision w = B' or decision w = B". Because
the pairs (B = B', w = B') and (B = B", w = B") represent correct
decisions, we always take2(B', 8') = 2(B", B") = O. On the other hand,
positive values of the loss function should be assigned for incorrect
decisions; that is, 2(B', B") > 0 and 2(B", B') > O.
A Bayes' solution to the problem is defined to be such that the
conditional expected value of the loss 2[8, w(x, y)], given X = x,
Y = y, is a minimum. If w = B', this conditional expectation is
~ 2(B B' ( I _ 2(B", B')h(B")f(x, y; B")
* ,)g Bx, y) - h(B')f(x, y; B') + h(B")f(x, y; B")
because 2(B', B') = 0; and if w = B", it is
~ 2(B " ( I 2(B', B")h(B')f(x, y; B')
~ , B )g Bx, y) = h(B')f(x, y; B') + h(B")f(x, y; B")
because 2(B", B") = O. Accordingly, a Bayes' solution is one that
decides w = B" if the latter ratio is less than or equal to the former;
or, equivalently, if
2(B', B")h(B')f(x, y; B') s 2(B", B')h(B")f(x, y; B").
That is, the decision w = B" is made if
q(x, Y; fLv fL2) = 1 ~ p2 [(x ~lfL1r - 2p(x ~lfL1)(Y ~2fL2) + (Y ~2fL2rl
Assume that at. a~, and p are known but that we do not know whether the
respective means of (X, Y) are (fL~, fL;) or (fL;, fL;). The inequality
f( Y
· , , 2 2 )
x, ,fLv fL2' aI, a2' P < k
f(x, Y; fL~, fL;, at a~, p) -
is equivalent to
-t[q(x, Y; fL~, fL;) - q(x, Y; fL~, fL;)] ~ In k.
Moreover, it is clear that the difference in the left-hand member of this
inequality does not contain terms involving x2, xy, and y2. In particular, this
inequality is the same as
~ In k + -t[q(O, 0; fL~, fL;) - q(O, 0; fL~, fL;)],
or, for brevity,
ax + by ~ c.
That is, if this linear function of x and y in the left-hand member of inequality
(1) is less than or equal to a certain constant, we would classify that (x, y) as
coming from the bivariate normal distribution with means fL~ and fL;' Other-
wise, we would classify (x, y) as arising from the bivariate normal distri-
bution with means fL~ and fL;' Of course, if the prior probabilities and losses
are given, k and thus c can be found easily; this will be illustrated in Exercise
11.11.
Fortunately, the distribution of Z = aX + bY is easy to determine,
so each of these probabilities is easy to calculate. The moment-generating
function of Z is
Once the rule for classification is established, the statistician
might be interested in the two probabilities of misclassifications using
that rule. The first of these two is associated with the classification of
(x, y) as arising from the distribution indexed by B" if, in fact, it comes
from that index by B'. The second misclassification is similar, but with
the interchange of B' and B". In the previous example, the probabilities
of these respective misclassifications are
f(x, y; B') < 2(B", B')h(B") _ .
f(x, y; B") - 2(B', B")h(B') - k,
otherwise, the decision w = B' is made. Hence, if prior probabilities
h(B') and h(B") and losses 2(B = B', w = B") and 2(B = B", w = B')
can be assigned, the constant k of the Neyman-Pearson theorem can be
found easily from this formula.
Example 1. Let (x, y) be an observation of the random pair (X, Y),
which has a bivariate normal distribution with parameters fL1' fL2' ar, a~,
and p. In Section 3.5 that joint p.d.f. is given by
1 .
f(x Y·II. II. a2 a2 p) = e- Q(X,Y,1J1'#2)/2
, ,.1, .2' 1, 2, 2 . /1 2 '
1Ta1a2V - p
Pr (aX + bY ~ c; fL~, fL;) and Pr (aX + bY > c; fL~, fL;).
-00 < x < 00, -00 < Y < 00, E(etZ) = E[et<ax+bYl] = E(eatx+btY).
388 Further Topics in Statistical Inference [Ch, 11 Sec. 11.5] Sufficiency, Completeness, and Stochastic Independence 389
EXERCISES
/L~ = /L; = 0, (a~)' = (a~)' = 1, p' = -r
o< x < 00, 0 < Y < 00,
1 (x Y)
f(x, Y; 01> ( 2 ) = IT exp -0 - -0 '
1 2 1 2
11.12. Let X and Y have the joint p.d.f.
11.11. In Example 1 let /L~ = /L; = 0, /L~ = /L; = 1, a~ = 1, a~ = 1, and
p=l
(a) Evaluate inequality (1) when the prior probabilities are h(/L~, /L;) = t
and h(/L~, /L;) = j- and the losses are 2[0 = (/L~, /L;), W = ~~, /L;)J = 4 and
2[0 = (/L~, /L;), W = (/L~, /L;)J = 1.
(b) Find the distribution of the linear function aX + bY that results
from part (a).
(c) Compute Pr (aX + bY ~ c; /L~ = /L; = 0) and Pr (aX + bY > c;
/L~ = /L; = 1).
zero elsewhere, where 0 < 01 , 0 < O
2 , An observation (x, y) arises from the
joint distribution with parameters equal to either O~ = 1, 0; = 5 or O~ = 3,
0; = 2. Determine the form of the classification rule.
11.13. Let X and Y have a joint bivariate normal distribution. An
observation (x, y) arises from the joint distribution with parameters equal to
either
Hence in the joint moment-generating function of X and Y found in
Section 3.5, simply replace t1 by at and t2 by bt, to obtain
With this information, it is easy to compute the probabilities of mis-
classifications, and this will also be demonstrated in Exercise 11.11.
One final remark must be made with respect to the use of the
important classification rule established in Example 1. In most in-
stances the parameter values /Li, /L2 and /L~, /L; as well as at a§, and p
are unknown. In such cases the statistician has usually observed a
random sample (frequently called a training sample) from each of the
two distributions. Let us say the samples have sizes n' and n", respec-
tively, with sample characteristics
However, this is the moment-generating function of the normal
distribution
Accordingly, if in inequality (1) the parameters /Li, /L2, /L~, /L~, at a§,
and PUIU2 are replaced by the unbiased estimates
-I -I ( ') 2 ( ') 2 '
X , y, Sx , Sy ,r and -If -If (S")2 (S")2 r"
x, Y, x , u , .
or
/L~ = /L; = 1, (aV" = 4, (a~)" = 9, p" = -r.
Show that the classification rule involves a second-degree polynomial in x
and y.
n'(s')2 + n"(s")2 n'(s')2 + n"(s")2
-I -! -/I -II X X Y Y
«,«,«,«, '+" 2' '+" 2'
n n - n n- 11.5 Sufficiency, Completeness, and Stochastic Independence
n'r's~s~ + n"r"s~s~
n' + n" - 2 '
the resulting expression in the left-hand member is frequently called
Fisher's linear discriminant function. Since those parameters have been
estimated, the distribution theory associated with aX + bY is not
appropriate for Fisher's function. However, if n' and n" are large, the
distribution of aX + bY does provide an approximation.
Although we have considered only bivariate distributions in this
section, the results can easily be extended to multivariate normal
distributions after a study of Chapter 12.
In Chapter 10 we noted that if we have a sufficient statistic Y1
for a parameter 0, 0 En, then h(zIYl)' the conditional p.d.f. of another
statistic Z, given Y1 = Y1> does not depend upon B. If, moreover, Y1
and Z are stochastically independent, the p.d.f. g2(Z) of Z is such that
g2(Z) = h(Z!Yl), and hence g2(Z) must not depend upon Beither. So the
stochastic independence of a statistic Z and the sufficient statistic Y1
for a parameter B means that the distribution of Z does not depend
upon BEn.
It is interesting to investigate a converse of that property. Suppose
that the distribution of a statistic Z does not depend upon B; then, are
Z and the sufficient statistic Y1 for B stochastically independent? To
390 Further Topics in Statistical Inference [Ch. 11 Sec. 11.5] Sufficiency, Completeness, and Stochastic Independence 391
and the one-to-one transformation defined by WI = Xj - fL, i = 1,2, ... , n.
Since TV = X - fL, we have that
moreover, each WI is nCO, a 2), 1 = 1, 2, .. " n. That is, S2can be written as a
function of the random variables W 1, W 2, •• " Wn. which have distributions
that do not depend upon fL. Thus S2 must have a distribution that does not
depend upon fL; and hence, by the theorem, S2 and X, the complete sufficient
statistic for fL, are stochastically independent.
The technique that is used in Example 1 can be generalized to situations
in which there is a complete sufficient statistic for a location parameter 8. Let
Xl' X 2 , ••• , X; be a random sample from a distribution that has a p d.f.
of the form I(x - 8), for every real 8; that is, 8 is a location parameter. Let
{g(Y1; 8), 8 E Q} is complete (such as a regular case of the exponential
class), we can say that the statistic Z is stochastically independent of
the sufficient statistic Y1 if, and only if, the distribution of Z does not
depend upon 8.
It should be remarked that the theorem (including the special
formulation of it for regular cases of the exponential class) extends
immediately to probability density functions that involve m param-
eters for which there exist m joint sufficient statistics. For example,
let Xl' X 2, ... , X; be a random sample from a distribution having the
p.d.f. f(x; 81 , 82) that represents a regular case of the exponential class
such that there are two joint complete sufficient statistics for 81 and 82 ,
Then any other statistic Z = u(X1, X 2,.. " X n) is stochastically
independent of the joint complete sufficient statistics if and only if
the distribution of Z does not depend upon 81 or 82 ,
We give an example of the theorem that provides an alternative
proof of the stochastic independence of X and 52, the mean and the
variance of a random sample of size n from a distribution that is
n(fL, u2). This proof is presented as if we did not know that n52/u2 is
x2
(n - 1) because that fact and the stochastic independence were
established in the same argument (see Section 4.8).
Example 1. Let Xv X 2 , ••. , X; denote a random sample of size n
from a distribution that is n(fL, a2
) . We know that the mean X of the sample
is, for every known a2
, a complete sufficient statistic for the parameter fL,
-00 < fL < 00. Consider the statistic
or
begin our search for the answer, we know that the joint p.d.f. of Y1
and Z is gl (Y1; 8)h(zjY1)' where gl (Y1; 8) and h(zIY1) represent the
marginal p.d.f. of Y1 and the conditional p.d.f. of Z given Y1 = Y1'
respectively. Thus the marginal p.d.f. of Z is
for all 8 E Q. Since Y1 is a sufficient statistic for 8, h(zIY1) does not
depend upon 8. By assumption, g2(Z) and hence g2(Z) - h(zIY1) do
not depend upon 8. Now if the family {gl(Y1; 8); 8EQ} is complete,
Equation (1) would require that
(1)
That is, the joint p.d.f. of Y1 and Z must be equal to
f~00 g2(Z)gl(Y1; 8) dY1 = g2(z),
it follows, by taking the difference of the last two integrals, that
which, by hypothesis, does not depend upon 8. Because
Accordingly, Y1 and Z are stochastically independent, and we have
proved the following theorem.
Theorem 1. Let Xl' X 2,... , X; denote a random sample from a
distribution having a p.d.j. f(x; e), e E Q, where Q is an interval set.
Let Y1 = u1(X1> X2, .•• , Xn) be a sufficient statistic for e, and let the
family {gl(Y1; e); eE Q} of probability density functions of Y1
be com-
plete. Let Z = u(X1, X 2, ... , X n) be any other statistic (not a function of
Y1 alone). If the distribution of Z does not depend upon e, then Z is
stochastically independent of the sufficient statistic Yr-
In the discussion above, it is interesting to observe that if Y1 is a
sufficient statistic for 8, then the stochastic independence of Y1 and Z
implies that the distribution of Z does not depend upon 8 whether
{gl(Y1; 8); 8 E Q} is or is not complete. However, in the converse, to
prove the stochastic independence from the fact that g2(Z) does not
depend upon 8, we definitely need the completeness. Accordingly, if
we are dealing with situations in which we know that the family
Yl = Ul(Xl, X 2 , ••• , X n) be a complete sufficient statistic for 8. Moreover,
let Z = u(X1o X 2 , ••• , X n) be another statistic such that
392 Further Topics in Statistical Inference [Ch, 11 Sec. 11.5] Sufficiency, Completeness, and Stochastic Independence 393
1,2, ... , n, requires the following: (a) that the joint p.d.f. of W10 W2 , •• " Wn
be equal to
for all real d. The one-to-one transformation defined by W, = X, - 8,
i = 1,2, ... , n, requires that the joint p.d.f. of W10 W2 , ••• , Wn be
which does not depend upon 8. In addition, we have, because of the special
functional nature of u(x1o X2 , ••• , xn) , that
f(Wl)f(W2 ) ••• f(wn) ,
and (b) that the statistic Z be equal to
Z = u(8W1o 8W2 , ••• , 8Wn) = u(Wl , W2 , · · · , Wn)·
Since neither the joint p.d.f. of WI' W2, •• " Wn nor Z contain 8, the distri-
bution of Z must not depend upon 8 and thus, by the theorem, Z is stochastic-
ally independent of the complete sufficient statistic Y1 for the parameter 8.
Example 3. Let Xl and X 2 denote a random sample of size n = 2 from
a distribution with p.d.f.
= 0 elsewhere.
'<,..
for all c > O. The one-to-one transformation defined by WI = Xi/B, i =
o< x < 00, 0 < 8 < 00,
1
f(x; 8) = 7J e- x /o,
which satisfies the property that u(exl + d, ... , eXn + d) = u(x1o ... , xn)·
That is, Z is stochastically independent of both X and 52.
The p.d.f. is of the form (lJ8)f(xJ8) , where f(x) = e- x
, 0 < x < 00, zero
elsewhere. We know (Section 10.4) that Yl = Xl + X 2 is a complete suffi-
cient statistic for 8. Hence Yl is stochastically independent of every statistic
u(Xv X 2) with the property u(ex1o ex2) = u(x1o x2) . Illustrations of these are
X lJX2
and XlJ(Xl + X 2 ), statistics that have F and beta distributions,
respectively.
Finally, the location and the scale parameters can be combined in a
p.d.f. of the form (lJ82)f[(x - 8l)J82] , -00 < 81 < 00,0 < 82 < 00. Through
a one-to-one transformation defined by W, = (X, - 8l)J82 , i = 1,2, ... , n,
it is easy to show that a statistic Z = u(X1o X 2 , . · · , X n) such that
u(exl + d, .. " eXn + d) = u(x1o ... , xn)
for -00 < d < 00, 0 < e < 00, has a distribution that does not depend
upon 81
and 82
, Thus, by the extension of the theorem, the joint cOI~plete
sufficient statistics Y1 and Y2 for the parameters 81 and 82 are stochastically
independent of Z.
Example 4. Let Xl, X 2 , •• " X; denote a random sample from a distri-
bution that is n(81o (}2), -00 < 81 < 00,0 < 82 < 00. In Example 2, Section
10.6, it was proved that the mean X and the variance 52 of the sample are
joint complete sufficient statistics for ()l and 82 , Consider the statistic
= 0 elsewhere.
8 < x < 00, -00 < 8 < 00.
f(x; 8) = e-(X-O),
1 n
- 2: [X, - min (XI)].
n 1=1
Here the p.d.£. is of the form f(x - 8), where f(x) = e- x , 0 < x < 00, zero
elsewhere. Moreover, we know (Exercise 10.17, Section 10.3) that the first
order statistic Yl = min (Xl) is a complete sufficient statistic for 8. Hence
Yl must be stochastically independent of each statistic u(X1o X 2 , .•• , X n) ,
enjoying the property that
u(xl + d, X2 + d, ... , Xn + d) = u(x1o x2 , ••• , xn)
for all real d. Illustrations of such statistics are 52, the sample range, and
There is a result on stochastic independence of a complete sufficient
statistic for a scale parameter and another statistic that corresponds to that
associated with a location parameter. Let X v X 2, .•• , X; be a random sample
from a distribution that has a p.d.f. of the form (lJ8)f(xJ8), for all 8 > 0;
that is, 8 is a scale parameter. Let Yl = Ul(X1o X 2 , ••• , X n) be a complete
sufficient statistic for 8. Say Z = u(Xl, X 2 , ••• , X n) is another statistic
such that
is a function of W10 W2 , ••• , Wn alone (not of 8). Hence Z must have a
distribution that does not depend upon 8 and thus, by the theorem, is
stochastically independent of Yl .
Example 2. Let Xl' X 2 , ••• , X; be a random sample of size n from the
distribution having p.d.f.
The second factor in the right-hand member is evaluated by using the
probabilities for a noncentral t distribution. Of course, if 83 = 84 and
the difference 81 - 82 is large, we would want the preceding probability
to be close to 1 because the event {c1 < F < c2 , ITI ~ c} leads to a
correct decision, namely accept 83 = 84 and reject 81 = 82 ,
Let n(81 , 83) and n(82, 84) denote two independent normal distri-
butions. Recall that in Example 2, Section 7.4, a statistic, which was
denoted by T, was used to test the hypothesis that 81 = 82 , provided
the unknown variances 83 and 84 were equal. The hypothesis that
81 = 82 is rejected if the computed ITj ~ c, where the constant c is
selected so that (X2 = Pr (I TI ~ c; 81 = 82 , 83 = 84) is the assigned
significance level of the test. We shall show that, if 83 = 84 , F of
Example 3, Section 7.4, and T are stochastically independent. Among
other things, this means that if these two tests are performed sequenti-
ally, with respective significance levels (Xl and (X2' the probability of
accepting both these hypotheses, when they are true, is (1 - (X1)(1 - (X2)'
Thus the significance level ofthis joint test is (X = 1 - (1 - (Xl)(1 - (X2)'
The stochastic independence of F and T, when 83 = 84 , can be
established by an appeal to sufficiency and completeness. The three
n m
statistics X, Y, and 2: (Xt - X)2 + 2: (Yj - Y)2 are joint complete
1 1
sufficient statistics for the three parameters 81> 82, and 83 = 84 ,
Obviously, the distribution of F does not depend upon 81 , 82 , and
83 = 84 , and hence F is stochastically independent of the three joint
complete sufficient statistics. However, T is a function of these three
joint complete sufficient statistics alone, and, accordingly, T is
stochastically independent of F. It is important to note that these two
statistics are stochastically independent whether 81 = 82 or 81 f= 82 ,
that is, whether T is or is not central. This permits us to calculate
probabilities other than the significance level of the test. For example,
if 83 = 84 and 81 f= 82 , then
394 Further Topics in Statistical Inference [Ch. 11 Sec. 11.5] SUfficiency, Completeness, and Stochastic Independence 395
and (Y1 + Y2)/(Y3 + Y4)' Hint.Show thatthe p.d.f. is ofthe form (1/8)f(x/8) ,
where f(x) = 1, 0 < x < 1, zero elsewhere.
11.15. Let Y1 < Y2 < ... < Yn be the order statistics of a random
sample from the normal distribution n(8, a2
) , -00 < 8 < 00. Show that the
n
distribution of Z = y - Y does not depend upon 8. Thus Y = L Y'[n, a
n 1
complete sufficient statistic for (J, is stochastically independent of Z.
11.16. Let Xl' X2
, •• " X; be a random sample from the normal distribu-
tion n(8, a2) , -00 < (J < 00. Prove that a necessary and sufficient condition
n n . . .
that the statistics Z = L aiXi and Y = L Xi' a complete sufficient statistic
1 1
n
for 8, be stochastically independent is that f at = O.
11.17. Let X and Y be random variables such that E(Xk) and E(Yk) f= 0
exist for k = 1,2, 3, .... If the ratio X/Y and its denominator Yare
stochastically independent, prove that E[(X/Y)k] = E(Xk)/E(Yk), k =
1, 2, 3, .... Hint. Write E(Xk) = E[yk(X/Y)kJ.
11.18. Let Y1 < Y 2
< ... < Yn be the order statistics of a random
sample of size n from a distribution that has p.d.I. f(x; 8) = (1/8)e-
X10
,
n
o < X < 00,0 < (J < 00, zero elsewhere. Show that the ratio R = nY1/~ Y,
and its denominator (a complete sufficient statistic for 8) are stochastically
independent. Use the result of the preceding exercise to determine E(Rk),
k = 1,2,3, ....
11.19. Let Xl' X 2, ••• , X 5 be a random sample of size 5 from the distri-
bution that has p.d.f. f(x) = e- x , 0 < x < 00, zero elsewhere. Show. that
(Xl + X 2)/(X1
+ X 2
+ ... + X 5 ) and its denominator are stochastically
independent. Hint. The p.d.f.J(x) is a member of {f(x; 8); 0 < 8 < co}, where
f(x; 8) = (1/8)e- X10, 0 < x < 00, zero elsewhere.
11.20. Let Y 1 < Y2 < ... < Yn be the order statistics of a random
sample from the normal distributio~ n(8v 82!,~oo_< 81_< CXJ, O
2
< 82 < 00.
Show that the joint complete sufficient statistics X .: Y and 5 for 81 and
82
are stochastically independent of each of (Yn - Y)/S and (Yn - Y 1)/S.
11.21. Let Y1 < Y2 < ... < Yn be the order statistics of a random
sample from a distribution with the p.d.f.
EXERCISES
11.14. Let Y1 < Y2 < Y3 < Y4 denote the order statistics of a random
sample of size n = 4 from a distribution having p.d.f. f(x; 8) = 1/8, 0 <
x < 8, zero elsewhere, where 0 < 8 < 00. Argue that the complete sufficient
statistic Y4 for 8 is stochastically independent of each of the statistics Y 1/Y4
81
< X < 00, zero elsewhere, where -00 < 81 < 00, 0 < 82 < 00. Show that
the joint complete sufficient statistics Y1 and X = Y for 81 and 82 are
n
stochastically independent of (Y 2 - Y1)/J.. (Y, - Y1) ·
1
396 Further Topics in Statistical Inference [Ch, 11 Sec. 11.6] Robust Nonparametric Methods 397
11.6 Robust Nonparametric Methods
Frequently, an investigator is tempted to evaluate several test
statistics associated with a single hypothesis and then use the one
statistic that best supports his or her position, usually rejection.
Obviously, this type of procedure changes the actual significance level
of the test from the nominal a that is used. However, there is a way in
which the investigator can first look at the data and then select a test
statistic without changing this significance level. For illustration,
suppose there are three possible test statistics W v W2, Wa of the
hypothesis H; with respective critical regions C1, C2, Ca such that
Pr (Wi E Cj ; H o) = a, i = 1,2,3. Moreover, suppose that a statistic Q,
based upon the same data, selects one and only one of the statistics
W v W 2 , W a, and that W is then used to test Hi; For example, we
choose to use the test statistic Wi if Q E Di, i = 1,2,3, where the events
defined by D1 , D2 , and Da are mutually exclusive and exhaustive.
Now if Q and each Wj are stochastically independent when H o is true,
then the probability of rejection, using the entire procedure (selecting
and testing), is, under H 0'
Pr (Q E D1 , W 1 E C1) + Pr (Q E D2 , W 2 E C2) + Pr (Q E o; Wa E Ca)
= Pr (Q E D 1) Pr (W 1 E C1) + Pr (Q E D 2 ) Pr (W2
E C2)
+ Pr (Q E Da) Pr (Wa E Ca)
= a[Pr (Q E D1 ) + Pr (Q E D2 ) + Pr (Q E Da)] = a.
That is, the procedure of selecting Wi using a stochastically indepen-
dent statistic Q and then constructing a test of significance level a with
the statistic Wi has overall significance level a.
Of course, the important element in this procedure is the ability
to be able to find a selector Q that is independent of each test statistic
W. This can frequently be done by using the fact that the complete
sufficient statistics for the parameters, given by H o, are stochastically
independent of every statistic whose distribution is free of those
parameters (see Section 11.5). For illustration, if random samples of
sizes m and n arise from two independent normal distributions with
respective means!-t1 and!-t2 and common variance a2 , then the complete
sufficient statistics X, Y, and
m n
V = L(Xi - X)2 + L (Y, - Y)2
1 1
for !-t1' !-t2' and a2
are stochastically independent of every statistic
whose distribution is free of !-tv !-t2, and a2
such as
Thus, in general, we would hope to be able to find a selector Q that
is a function of the complete sufficient statistics for the parameters,
under H 0' so that it is independent of the test statistics.
It is particularly interesting to note that it is relatively easy to use
this technique in nonparametric methods by using the independence
result based upon complete sufficient statistics for parameters. How
can we use an argument depending on parameters in non parametric
methods? Although this does sound strange, it is due to the unfortunate
choice of a name in describing this broad area of nonparametric methods.
Most statisticians would prefer to describe the subject as being distri-
bution-free, since the test statistics have distributions that do not
depend on the underlying distribution of the continuous type, de-
scribed by either the distribution function F or the p.d.f. f. In addition,
the latter name provides the clue for our application here because we
have many test statistics whose distributions are free of the unknown
(infinite vector) "parameter" F (or f) . We now must find complete
sufficient statistics for the distribution function F of the continuous
type. In many instances, this is easy to do.
In Exercise 10.37, Section 10.6, it is shown that the order statistics
Y1 < Y2 < ... < Yn of a random sample of size n from a distribution
of the continuous type with p.d.f. F'(x) = f(x) are sufficient statistics
for the "parameter" f (or F). Moreover, if the family of distributions
contains all probability density functions of the continuous type, the
family of joint probability density functions of Yl' Y 2' ... , Yn is also
complete. We accept this latter fact without proof, as it is beyond the
level of this text; but doing so, we can now say that the order statistics
Y1, Y2, ... , Yn are complete sufficient statistics for the parameter f
(or F).
Accordingly, our selector Q will be based upon those complete
sufficient statistics, the order statistics under H o. This allows us to
independently choose a distribution-free test appropriate for this type
of underlying distribution, and thus increase our power. Although it is
well known that distribution-free tests hold the significance level a for
all underlying distributions of the continuous type, they have often
398 Further Topics in Statistical Inference [Ch. 11 Sec. 11.6] Robust Nonparametric Methods 399
EXERCISES
is stochastically independent of L 1 , L2 , and L3 . From Exercise 3.56,
Section 3.4, we know that the kurtosis of the normal distribution is 3;
hence if the two distributions were equal and normal, we would expect
K to be about 3. Of course, a longer-tailed distribution has a bigger
kurtosis. Thus one simple way of defining the independent selection
procedure would be by letting
11.22. Let F(x) be a distribution function of a distribution of the con-
tinuous type which is symmetric about its median g. We wish to test H a: g= 0
against HI: g > O. Use the fact that the 2n values, XI and -XI' i = 1,2,
. .. , n, after ordering, are complete sufficient statistics for F, provided that
H a is true. Then construct an adaptive distribution-free test based upon
Wilcoxon's statistic and two of its modifications given in Exercises 9.20 and
9.21.
11.23. Suppose that the hypothesis H a concerns the stochastic inde-
pendence of two random variables X and Y. That is, we wish to test
Ha: F(x, y) = F1(x)F2 (y), where F, FI , and F2 are the respective joint and
marginal distribution functions of the continuous type, against all alternatives.
Let (Xl' YI ) , (X2 , Y 2 ) , .•• , (Xn , Y n ) be a random sample from the joint
distribution. Under H a, the order statistics of Xl> X 2 , ••• , X; and the order
statistics of YI , Y2, •.• , Yn are, respectively, complete sufficient statistics for
D3 = {k; 8 < k}.
D2 = {k; 3 < k ~ 8},
D1 = {k; k s 3},
These choices are not necessarily the best way of selecting the appro-
priate test, but they are reasonable and illustrative of the adaptive
procedure. From the stochastic independence of K and (L 1 , L2 , L3 ) , we
know that the overall test has significance level cx. Since a more
appropriate test has been selected, the power will be relatively good
throughout a wide range of distributions. Accordingly, this distri-
bution-free adaptive test is robust.
F, these order statistics are the complete sufficient statistics for the
parameter F. Hence every statistic based on V1> V2, ••• , VN is sto-
chastically independent of L1> L2 , and L3 , since the latter statistics
have distributions that do not depend upon F. In particular, the
kurtosis (Exercise 1.98, Section 1.10) of the combined sample,
1 N -
N.L (VI - V)4
K 1=1
= [~ Jl(VI -V)2r'
cx = Pr (L1 EC1) = Pr(L2 EC2 ) = Pr(L3 EC3 ) .
Of course, we would like to use the test given by L1 E C1 if the tails
of the distributions are like or shorter than those of the normal distri-
butions. With distributions having somewhat longer tails, L2 E C2
provides an excellent test. And with distributions having very long
tails, the test based on L3 E C3 is quite satisfactory.
In order to select the appropriate test in an independent manner
we let VI < V2 < ... < VN' where N = m + n, be the order statistics
of the combined sample, which is of size N. Recall that if the two inde-
pendent distributions are equal and have the same distribution function
been criticized because their powers are sometimes low. The inde-
pendent selection of the distribution-free test to be used can help
correct this. So selecting-or adapting the test to the data-provides
a new dimension to nonparametric tests, which usually improves the
power of the overall test.
A statistical test that maintains the significance level close to a
desired significance level cx for a wide variety of underlying distri-
butions with good (not necessarily the best for anyone type of distri-
bution) power for all these distributions is described as being robust.
As an illustration, the T used to test the equality of the means of two
independent normal distributions (see Section 7.4) is quite robust
provided that the underlying distributions are rather close to normal
ones with common variance. However, if the class of distributions
includes those that are not too close to normal ones, such as the Cauchy
distribution, the test based upon T is not robust; the significance level
is not maintained and the power of the T test is low with Cauchy
distributions. As a matter of fact, the test based on the Mann-Whitney-
Wilcoxon statistic (Section 9.6) is a much more robust test than that
based upon T if the class of distributions is fairly wide (in particular,
if long-tailed distributions such as the Cauchy are included).
An illustration of the adaptive distribution-free procedure that is
robust is provided by considering a test of the equality of two inde-
pendent distributions of the continuous type. From the discussion in
Section 9.8, we know that we could construct many linear rank statis-
tics by changing the scoring function. However, we concentrate on
three such statistics mentioned explicitly in that section: that based on
normal scores, say L 1 ; that of Mann-Whitney-Wilcoxon, say L 2 ; and
that of the median test, say L3 . Moreover, respective critical regions
C1> C2 , and C3 are selected so that, under the equality of the two
distributions, we have
400 Further Topics in Statistical Inference [Ch, 11 Sec. 11.7] Robust Estimation 401
likelihood estimator of the center B of the Cauchy distribution with
p.d.f.
F 1 and F 2 . Use Spearman's statistic (Example 2, Section 9.8) and at least
two modifications ofit to create an adaptive distribution-free test of H o- Hint.
Instead of ranks, use normal and median scores (Section 9.8) to obtain two
additional correlation coefficients. The one associated with the median
scores is frequently called the quadrant test.
1
f(x; B) = 1T[1 + (x _ 8)2]' -00 < x < 00,
11.7 Robust Estimation
where -00 < 8 < 00. The logarithm of the likelihood function of a
random sample X 1> X 2' ••• , X n from this distribution is
In Examples 2 and 4, Section 6.1, the maximum likelihood estimator
p.. = X of the mean ji, of the normal distribution n(ji" a2
) was found by
minimizing a certain sum of squares,
n
In L(8) = -n In 1T - 2: In [l + (xj - 8)2].
j=l
To maximize, we differentiate In L(8) to obtain
-00 < x < 00,
p(x) = In 1T + In (1 + x2
) ,
2x
1 + x2
'¥(x)
n
2: p(xi - 8),
1=1
p(x) = In 2 + Ixl;
'¥(x) = -1, x < 0;
n
In L(8) = 2: lnf(xj - 8)
1=1
'¥(x) = x;
and
where p(x) = -lnf(x), and
dIn L(8) = _ i j'(xj - 8) = i '¥(xj
_ 8),
dB 1=1 f(XI - B) ,=1
where p'(x) = '¥(x). For the normal, double exponential, and Cauchy
distributions, we have that these respective functions are
1 x2
p(x) = 2In 21T + 2";
dlnL(8) = i 2(xl - 8) = 0
d8 1=1 1 + (Xj - 8)2 .
The solution of this equation cannot be found in closed form, but the
equation can be solved by some iterative process (for example, Newton's
method), of course checking that the approximate solution actually
provides the maximum of L(B), approximately.
The generalization of these three special cases is described as
follows. Let Xl' X 2 , ••• , X; be a random sample from a distribution
with a p.d.f. of the form f(x - B), where 8 is a location parameter
such that -00 < 8 < 00. Thus
[t»; 8) = te-Ix-el,
= 0 elsewhere,
where -00 < 8 < 00. The maximum likelihood estimator, 0 =
median (Xj ) , is found by minimizing the sum of absolute values,
Both of these procedures come under the general heading of the
method of least squares, because in each case a sum of squares is mini-
mized. More generally, in the estimation of means of normal distri-
butions, the method of least squares or some generalization of it is
always used. The problems in the analyses of variance found in Chapter
8 are good illustrations of this fact. Hence, in this sense, normal
assumptions and the method of least squares are mathematical
companions.
It is interesting to note what procedures are obtained if we consider
distributions that have longer tails than those of a normal distribution.
For illustration, in Exercise 6.1(d), Section 6.1, the sample arises from
a double exponential distribution with p.d.f.
Also, in the regression problem of Section 8.6, the maximum likelihood
estimators <2 and S of the ex and f3 in the mean ex + f3(cj - c) were
determined by minimizing the sum of squares
n
2: IXj - 81,
j=l
and hence this is illustrative of the method of least absolute values.
Possibly a more extreme case is the determination of the maximum
= 1, 0 < x.
Clearly, these functions are very different from one distribution to
another; and hence the respective maximum likelihood estimators may
differ greatly. Thus we would suspect that the maximum likelihood
402 Further Topics in Statistical Inference [Ch, 11 Sec. 11.7] Robust Estimation 403
(1)
estimator associated with one distribution would not necessarily be a
good estimator in another situation. This is true; for example, X is a
very poor estimator of the median of a Cauchy distribution, as the
variance of X does not even exist if the sample arises from a Cauchy
distribution. Intuitively, X is not a good estimator with the Cauchy
distribution, because the very small or very large values (outliers) that
can arise from that distribution influence the mean X of the sample too
much.
An estimator that is fairly good (small variance, say) for a wide
variety of distributions (not necessarily the best for anyone of them)
is called a robust estimator. Also estimators associated with the solution
of the equation
n
L '¥(Xi - 0) = °
i=l
are frequently called M-estimators (denoted by 0) because they can be
thought of as maximum likelihood estimators. So in finding a robust
M-estimator we must select a '¥ function which will provide an esti-
mator that is good for each distribution in the collection under con-
sideration. For certain theoretical reasons that we cannot explain at
this level, Huber suggested a '¥ function that is a combination of
those associated with the normal and double exponential distributions,
'¥(x) = -k, X < -k
= x, -k ~ X ~ k,
= k, k < x.
In Exercise 11.25 the reader is asked to find the p.d.f. f(x) so that the
M-estimator associated with this '¥ function is the maximum likelihood
estimator of the location parameter 0 in the p.d.f. f(x - 0).
With Huber's '¥ function, another problem arises. Note that if we
double (for illustration) each Xl' X 2 , •• " X n, estimators such as X and
median (Xi) also double. This is not at all true with the solution of the
equation
n
L '¥(Xi - 0) = 0,
i=l
where the '¥ function is that of Huber. One way to avoid this difficulty
is to solve another, but similar, equation instead,
i ,¥(X
i
- 0) = °
i=l d '
where d is a robust estimate of the scale. A popular d to use is
d = median IXi - median (Xi)1/0.6745.
The divisor 0.6745 is inserted in the definition of d because then the
expected value of the corresponding statistic D is about equal to a, if
the sample arises from a normal distribution. That is, a can be approxi-
mated by d under normal assumptions.
That scheme of selecting d also provides us with a clue for selecting
k. For if the sample actually arises from a normal distribution, we would
want most of the items Xl' X2, .. " Xn to satisfy the inequality
IXi
~ 01 s k
because then
,¥(Xi -d 0) Xi - 0
<:«:
That is, for illustration, if all the items satisfy this inequality, then
Equation (1) becomes
i '¥ (Xi - 0) = i Xi - 0= 0.
i=l d i=l d
This has the solution X, which of course is most desirable with normal
distributions. Since d approximates a, popular values of k to use are
1.5 and 2.0, because with those selections most normal variables would
satisfy the desired inequality.
Again an iterative process must usually be used to solve Equation
(1). One such scheme, Newton's method, is described. Let 81 be a first
estimate of 0, such as 81 = median (Xi)' Approximate the left-hand
member of Equation (1) by the first two terms of Taylor's expansion
about 81 to obtain
i ,¥(Xi - 81
) + (0 - 81
) .i ,¥,(Xi - 81
) (_!) = 0,
i=l d 1=1 d d
approximately. The solution of this provides a second estimate of 0,
d i ,¥(Xi - 81
)
Ii, ~ Ii, + i''Y'(X':Ii,)
i=l d
which is called the one-step M -estimate of O. If we use 82 in place of 81,
we obtain 83
, the two-step M-estimate of O. This process can continue
to obtain any desired degree of accuracy. With Huber's '¥ function,
the denominator of the second term,
404 Further Topics in Statistical Inference [Ch, 11
is particularly easy to compute because '¥'(x) = 1, -k ::::; x ::::; k, and
zero elsewhere. Thus that denominator simply counts the number of
Xv x2 , • , ., Xn such that IXi - 811/d ::::; k.
Although beyond the scope of this text, it can be shown, under very
general conditions with known a= 1, that the limiting distribution of
VE[,¥2(X - 8)]/{E[,¥'(X - 8)]}2'
where 0is the M -estimator associated with '¥, is n(O, 1). In applications,
the denominator of this ratio can be approximated by the square root
of
Moreover, after this substitution has been made, it has been discovered
empirically that certain t-distributions approximate the distribution
of the ratio better than does n(O, 1).
These M-estimators can be extended to regression situations. In
general, they give excellent protection against outliers and bad data
points; yet these M-estimators perform almost as well as least-squares
estimators if the underlying distributions are actually normal.
EXERCISES
11.24. Compute the one-step M-estimate 82 using Huber's if1 with k = 1.5
if n = 7 and the seven observations are 2.1, 5.2, 2.3, 1.4, 2.2, 2.3, and 1.6.
Here take 81 = 2.2, the median of the sample. Compare 82 with x.
11.25. Let the p.d.f. f(x) be such that the M-estimator associated with
Huber's if1 function is a maximum likelihood estimator of the location
parameter inf(x - 8). Show thatf(x) is of the form ce- P 1( X ) , where Pl(X) =
x2/2, Ixl ::::; k and Pl(X) = klxl - k2/2, k < Ixl.
11.26. Plot the if1 functions associated with the normal, double ex-
ponential, and Cauchy distributions in addition to that of Huber. Why is the
M-estimator associated with the if1 function of the Cauchy distribution
called a descending M-estimator?
11.27. Use the data in Exercise 11.24 to find the one-step descending
M-estimator 82 associated with if1(x) = sin (x/1.5), Ixl ::::; 1.51T, zero elsewhere.
This was first proposed by D. F. Andrews. Compare this to xand the one-step
M-estimator of Exercise 11.24.
Chapter I2
Further Normal
Distribution Theory
12.1 The Multivariate Normal Distribution
We have studied in some detail normal distributions of one and of
two random variables. In this section, we shall investigate a joint
distribution of n random variables that will be called a multivariate
normal distribution. This investigation assumes that the student is
familiar with elementary matrix algebra, with real symmetric quadratic
forms, and with orthogonal transformations. Henceforth the expression
quadratic form means a quadratic form in a prescribed number of
variables whose matrix is real and symmetric. All symbols which
represent matrices will be set in boldface type,
Let A denote an n x n real symmetric matrix which is positive
definite. Let EL denote the n x 1 matrix such that EL', the transpose of
EL, is EL' = [fLv fL2' ... , fLn]' where each fLi is a real constant. Finally, let
x denote the n x 1 matrix such that x' = [xv X2 , ••• , xnJ. We shall
show that if C is an appropriately chosen positive constant, the non-
negative function
C [
(x - EL)'A(x - EL)]
f(x1 , X2 , · · · , xn) = exp - 2 '
-00 < xj < 00, i = 1, 2, ... , n,
is a joint p.d.f. of n random variables Xv X 2 , ••• , X; that are of the
continuous type. Thus we need to show that
(1) I~oo" -I~oo f(xv X2,·· " xn
)dX1 dx2·· -dx; = 1.
405
Let t denote the n x 1 matrix such that t' = [t1 , t2 , •• " tn] , where
t1 , t2 , ••• , tn are arbitrary real numbers. We shall evaluate the integral
Because the real symmetric matrix A is positive definite, the n
characteristic numbers (proper values, latent roots, or eigenvalues)
aI' a2, ... , an of A are positive. There exists an appropriately chosen
n x n real orthogonal matrix L(L' = L -1, where L -1 is the inverse of
L) such that
and then we shall subsequently set t1 = t2 = ... = tn = 0, and thus
establish Equation (1). First we change the variables of integration in
integral (2) from Xl> x2, ... , X n to Y1' Y2' ... , Yn by writing x - fL = Y,
where y' = [Yl> Y2' .. " YnJ. The Jacobian of the transformation is one
and the n-dimensional x-space is mapped onto an n-dimensional
y-space, so that integral (2) may be written as
407
C exp (w'L'u) Ii[}¥exp (Wf)]
1=1 ai 2ai
Sec. 12.1] The Multivariate Normal Distribution
Moreover,
(6)
[
Z'(L'AL)Z] [ i aiZf]
exp - 2 = exp - T .
Then integral (4) may be written as the product of n integrals in the
following manner:
(5) C exp (w'L'fL) Ii[fOO exp (WiZi - aiZf) dZi
]
i=l -00 2
= C exp (w'L'u) Ii[ {2;foo exp (WiZi - ¥) dZ]'
1=1 -i z; -00 V27Tjai
1
The i?tegra.l that involves Zi can be treated as the moment-generating
f~ncho~, w~th the more familiar symbol t replaced by Wi' of a distribu-
tion which IS n(O, Ija;). Thus the right-hand member of Equation (5) is
equal to
Further Normal Distribution Theory [Ch, 12
f
OO foo ( Y'AY)
C exp (t'fL) . . . exp t'y - - - dY1" ·dYn·
-00 -00 2
(3)
406
for a suitable ordering of all a2, ... , an' We shall sometimes write
L'AL = diag [al> a2, ... , anJ. In integral (3), we shall change the vari-
ables of integration from Yl> Y2' ... , Yn to Zl> Z2' ... , Zn by writing
y = Lz, where z' = [Zl' Z2' ... , znJ. The Jacobian of the transformation
is the determinant of the orthogonal matrix L. Since L'L = In' where
In is the unit matrix of order n, we have the determinant L'L] = 1
and ILI2 = 1. Thus the absolute value of the Jacobian is one. Moreover,
the n-dimensional y-space is mapped onto an n-dimensional z-space.
The integral (3) becomes
f
OO foo [ Z'(L'AL)Z]
(4) C exp (t'fL) . . . exp t'Lz - dz1· . ·dzn.
-00 -00 2
It is computationally convenient to write, momentarily, t'L = w',
where w' = [Wl> W2, . . . , wn]. Then
exp [t'Lz] = exp [w'z] = exp (~ WiZ}
= C exp (W'L'fL)j (27T)n exp (~Wf).
a1a2· .. an 7 Za,
Now, because L-1 = L', we have
Thus,
n w2
L--.!.. = w'(L'A-lL)w = (Lw) 'A -l(Lw) = t'A -It.
1 ai
Moreover, the determinant IA-11 of A-I is
IA-1/ = IL'A -ILl = 1
a1a2· .. an
~ccordingly, the right-hand member of Equation (6), which is equal to
integral (2), may be written as
(7) Cet
'ILV (27T)nIA 1/ exp (t'~-It).
If, in this function, we set t1 = t2 = ... = tn = 0, we have the value
of the left-hand member of Equation (1). Thus, we have
Cv'(27T)nIA-11 = 1.
408 Further Normal Distribution Theory [Oh, 12 Sec. 12.1] The Multivariate Normal Distribution 409
EXERCISES
for all real values of t.
(
t/Yt)
exp t/!J. + 2
-00 < Xl < 00,
(8)
12.1. Let Xl' X 2 , ••• , X; have a multivariate normal distribution with
positive definite covariance matrix V. Prove that these random variables are
mutually stochastically independent if and only if V is a diagonal matrix.
Consider a linear function Y of Xl' X 2 , ••• , X; which is defined by Y =
n
c'X = L ciXj , where c' = [Cl> c2 , •• " cnJ and the several Cj are real and not
1
all zero. We wish to find the p.d.f. of Y. The moment-generating function
M(t) of the distribution of Y is given by
M(t) = E(et y
) = E(et c
•
X
) .
Now the expectation (8) exists for all real values of t. Thus we can replace t'
in expectation (8) by tc' and obtain
(
C'VCt2)
M(t) = exp tc'!J. + -2- .
Thus the random variable Y is n(c'!J., c'Vc).
Example 1. Let Xl> X 2 , ••• , X; have a multivariate normal distribution
with matrix IJ. of means and positive definite covariance matrix V. If we
let X' = [Xl' X 2 , ••• , XnJ, then the moment-generating function M(tl> t2 ,
... , tn) of this joint distribution of probability is
(
t'Vt)
E(et
•
X
) = exp t'!J. + 2 .
the covariance matrix of the multivariate normal distribution and
henceforth we shall denote this matrix by the symbol Y. In terms of
the positive definite covariance matrix Y, the multivariate normal
p.d.f. is written
1 ex [_(X - !J.)/Y-l(X - !J.)],
(2rr)nI2v'IVI p 2
i = 1, 2, ... , n, and the moment-generating function of this distribution
is given by
Let the elements of the real, symmetric, and positive definite
matrix A-I be denoted by (Jli' i, j = 1,2, ... , n. Then
M(O, ... , 0, tl, 0, ... , 0) = exp (tl""l + (J~f)
is the moment-generating function of Xl' i = 1, 2, ... , n. Thus, X, is
n(""l' (Jll), i = 1, 2, , n. Moreover, with i # j, we see that M(O, ... , 0,
tl , 0, ... ,0, tj , 0, ,0), the moment-generating function of X, and
Xj, is equal to
But this is the moment-generating function of a bivariate normal distri-
bution, so that (Jlj is the covariance of the random variables X, and Xj'
Thus the matrix !J., where EL' = [""1, ""2' ... , ""nJ, is the matrix of the
means of the random variables Xl> .. " X n . Moreover, the elements on
the principal diagonal of A -1 are, respectively, the variances (Jll = (Jf,
i = 1, 2, ... , n, and the elements not on the principal diagonal of A -1
are, respectively, the covariances (Jlj = PIPI(Jj, i # i, of the random
variables Xl' X 2 , ••• , X n. We call the matrix A-I, which is given by
1 [ (x - !J.)/A(x - !J.)]
f(Xl> x2 , •• " Xn) = (2rr)n'2v'IA -11 exp - 2 '
-00 < Xl < 00, i = 1,2, ... , n, is a joint p.d.f. of n random variables
Xl> X 2 , ••• , X; that are of the continuous type. Such a p.d.f. is called
a nonsingular multivariate normal p.d.f.
We have now proved that f(xl , X 2, •.. , xn) is a p.d.f. However, we
have proved more than that. Because f(Xl> X2 , ••• , xn) is a p.d.f.,
integral (2) is the moment-generating function M(tl , t2 , •. " tn) of this
joint distribution of probability. Since integral (2) is equal to function
(7), the moment-generating function of the multivariate normal
distribution is given by
Accordingly, the function
lOU'
(J12'
"'"]
12.2. Let n = 2 and take
.. 0,
[ u
2
PU 1U2]-
(J12, (J22' •. 0, (J2n V=
PUl
1U2
. , u~
Determine lVI, V-l, and (x - !J.)'V-l(X - !J.). Compare the bivariate normal
(Jln, (J2n, .. 0, (Jnn p.d.f. with the multivariate normal p.d.f. when n = 2.
410 Further Normal Distribution Theory [Ch, 12 Sec. 12.2] The Distributions of Certain Quadratic Forms 411
12.3. Let X v X 2' ••. , X n have a multivariate normal distribution, where
fJ. is the matrix of the means and Y is the positive definite covariance matrix.
Let Y = c'X and Z = d'X, where X' = [Xl"'" XnJ, C' = [Cv ... , cnJ,
and d' = [dv . . ., dn] are real matrices. (a) Find M(t v t2) = E(et, Y+t2 Z) to
see that Y and Z have a bivariate normal distribution. (b) Prove that Y
and Z are stochastically independent if and only if c'Vd = O. (c) If Xv
X 2 , ••• , X; are mutually stochastically independent random variables which
have the same variance a2, show that the necessary and sufficient condition
of part (b) becomes c'd = O.
12.4. Let X' = [Xv X 2 , •• " XnJ have the multivariate normal distribu-
tion of Exercise 12.3. Consider the p linear functions of Xl"'" X; defined
by W = BX, where W' = [Wv " " WpJ, P :s; n, and B is a p x n real
matrix of rank p. Find M(vv ... , vp ) = E(eV'w), where v' is the real matrix
[vv"" vp ] , to see that Wv " " Wp have a p-variate normal distribution
which has BfJ. for the matrix of the means and BYB' for the covariance
matrix.
12.5. Let X' = [Xv X 2 , ••• , XnJ have the n-variate normal distribution
of Exercise 12.3. Show that Xl' X z, ... , X p , P < n, have a p-variate normal
distribution. What submatrix of Y is the covariance matrix of Xl' X 2 , ••• ,
X p ? Hint. In the moment-generating function M(tv t2 , • • . , tn) of Xv X 2 , • • • ,
X n, let tp + I = ... = tn = O.
which is defined by (x - fJ.),y-l(x - fJ.), is X2
(n). We have for the
moment-generating function M(t) of Q the integral
f
OO foo 1
- 00 - 00 (27T)n/2VfVI
X exp [t(X - fJ.)'y-l(x - fJ.) - (x - fJ.)'Y
2-I(X
- fJ.)] dxl · . ·dxn
With y-l positive definite, the integral is seen to exist for all real
values of t < t. Moreover, (1 - 2t)y-l, t < t, is a positive definite
matrix and, since 1(1 - 2t)y- 1
1 = (1 - 2t)nly- ll,
it follows that
can be treated as a multivariate normal p.d.f. If we multiply our
integrand by (1 - 2t)n/2, we have this multivariate p.d.f. Thus the
moment-generating function of Q is given by
12.2 The Distributions of Certain Quadratic Forms 1
M(t) = (1 _ uv»: t < t,
Let X" i = 1, 2, ... , n, denote mutually stochastically independent
random variables which are n(iLj, an, i = 1,2, ... , n, respectively.
n
Then Q = L (X, - iLt)2jaf is X2(n). Now Q is a quadratic form in the
I
X, - iL, and Qis seen to be, apart from the coefficient -t, the random
variable which is defined by the exponent on the number e in the joint
p.d.f. of Xl' X 2 , ••• , X n. We shall now show that this result can be
generalized.
Let Xv X 2 , ••• , X; have a multivariate normal distribution with
p.d.f.
where, as usual, the covariance matrix Y is positive definite. We shall
show that the random variable Q (a quadratic form in the Xi - iLi),
and Q is (X2n), as we wished to show. This fact is the basis of the
chi-square tests that were discussed in Chapter 8.
The remarkable fact that the random variable which is defined by
(x - fJ.),y-I(X - fJ.) is X2 (n) stimulates a number of questions about
quadratic forms in normally distributed variables. We would like to
treat this problem in complete generality, but limitations of space
forbid this, and we find it necessary to restrict ourselves to some special
cases.
Let Xl' X 2 , ••• , X n denote a random sample of size n from a distri-
bution which is n(O, a2), a2 > 0. Let X' = [Xv X 2, ... , XnJ and let A
denote an arbitrary n x n real symmetric matrix. We shall investigate
the distribution of the quadratic form X'AX. For instance, we know
n
that X'InXja
2 = X'Xja2 = L Xfja2 is X2(n ). First we shall find the
I
moment-generating function of X'AXja2
• Then we shall investigate
412 Further Normal Distribution Theory [Ch, 12 Sec. 12.2] The Distributions of Certain Quadratic Forms 413
Accordingly we can write M(t), as given in Equation (1), in the form
2) M(t) = LDI (1 - 2tal )
r1/2,.
ItI < h.
Let r, 0 < r ::;; n, denote the rank of the real symmetric matrix A.
Then exactly r of the real numbers aI' a2, ... , an, say aI' .. " a., are
not zero and exactly n - r of these numbers, say ar+l> .. " an, are
zero. Thus we can write the moment-generating function of X'AX/02 as
M(t) = [(1 - 2ta1)(1 - 2ta2)· · ·(1 - 2tar)J - 1/2.
Now that we have found, in suitable form, the moment-generating
function of our random variable, let us turn to the question of the con-
the conditions which must be imposed upon the real symmetric matrix
A if X'AX/02 is to have a chi-square distribution. This moment-
generating function is given by
M(t) = foo .. ·foo ( 1_)n exp (tx'~x _ x'~) dx1. .. dx
n
-00 -00 OV27T 0 20
f oo foo ( 1)n [X'(I - 2tA)X]
= . . . -----= exp - 2 dx1· .. dxn,
- 00 - 00 OV27T 20
where I = In. The matrix I - 2tA is positive definite if we take ItI
sufficiently small, say ItI < h, h > O. Moreover, we can treat
1 [ x'(I - 2tA)X]
(27T)n/2vl (I _ 2tA) -102/ exp - 202
as a multivariate normal p.d.f. Now 1(1 - 2tA)-10211/2
on/II - 2tAll/2. If we multiply our integrand by II - 2tAI1/2, "We have
this multivariate p.d.f. Hence the moment-generating function of
X'AX/02 is given by
(1) M(t) = II - 2tAI-l/2, ItI < h.
It proves useful to express this moment-generating function in a
different form. To do this, let aI' a2, .. " an denote the characteristic
numbers of A and let L denote an n x n orthogonal matrix such that
L'AL = diag [al> a2, ... , an]. Thus,
ditions that must be imposed if X' AX/02 is to have a chi-square
distribution. Assume that X'AX/02 is X2(k). Then
M(t) = [(1 - 2ta1)(1 - 2ta2)···(1 - 2tar)]-1/2 = (1 - 2t)-k/2,
or, equivalently,
(1 - 2ta1)(1 - 2ta2)' .. (1 - 2tar) = (1 - 2t)k, ItI < h.
Because the positive integers rand k are the degrees of these poly-
nomials, and because these polynomials are equal for infinitely many
values of i, we have k = r, the rank of A. Moreover, the uniqueness of
the factorization of a polynomial implies that a1 = a2 = ... = a; = 1.
If each of the nonzero characteristic numbers of a real symmetric
matrix is one, the matrix is idempotent, that is, A2 = A, and con-
versely (see Exercise 12.7). Accordingly, if X'AX/02 has a chi-square
distribution, then A2 = A and the random variable is X2 (r), where r is
the rank of A. Conversely, if A is of rank r, 0 < r ::;; n, and if A2 = A,
then A has exactly r characteristic numbers that are equal to one, and
the remaining n - r characteristic numbers are equal to zero. Thus the
moment-generating function of X'AX/02 is given by (1 - 2t)-r/2,
t < -i, and X /AX/02 is X2(r). This establishes the following theorem.
Theorem 1. Let Q denote a random variable which is a quadratic
jorm ~n the items oj a random sample oj size n jrom a distribution which is
n(O, 02). Let A denote the symmetric matrix oj Q and let r, 0 < r ::;; n,
denote the rank oj A. Then Q/02 is X2(r) ij and only ij A2 = A.
Remark. If the normal distribution in Theorem 1 is n(fIo, a2), the
condition A2 = A remains a necessary and sufficient condition that Qla2
have a chi-square distribution. In general, however, Qla2 is not X2(r) but,
instead, Qla2 has a noncentral chi-square distribution if A2 = A. ~he
number of degrees of freedom is r, the rank of A, and the noncentrality
parameter is IL'AILla2, where IL' = [fL, fL, ... , flo]. Since IL'AIL = fIo2 L ali'
I,!
where A = [a ] then if 11. =1= 0 the conditions A2
= A and Lal! = 0 are
'J.) , ' r - , j.t
necessary and sufficient conditions that Qla2
be central X2
(r). Moreover, the
theorem may be extended to a quadratic form in random variables which
have a multivariate normal distribution with positive definite covariance
matrix V; here the necessary and sufficient condition that Qhave a chi-square
distribution is AVA = A.
EXERCISES
12.6. Let Q = X 1X2 - X SX4 , where Xl> X 2 , X s, X 4 is a random sample
of size 4 from a distribution which is n(O, a2) . Show that Q/a2 does not have
a chi-square distribution. Find the moment-generating function of Qla2•
II - 2tAI·
o
IL'(I - 2tA)L1
n
I1 (1 - 2ta.)
1=1
[
(1 - 2ta1)
L'(I - 2tA)L ~ r
Then
414 Further Normal Distribution Theory [Ch. 12 Sec. 12.3] The Independence of Certain Quadratic Forms 415
12.7. Let A be a real symmetric matrix. Prove that each of the nonzero
characteristic numbers of A is equal to one if and only if AZ = A. Hint.
Let L be an orthogonal matrix such that L'AL = diag Cal. az,.'" an] and
note that A is idempotent if and only if L'AL is idempotent.
12.8. The sum of the elements on the principal diagonal of a square
matrix A is called the trace of A and is denoted by tr A. (a) If B is n x m
and C is m x n, prove that tr (BC) = tr (CB). (b) If A is a square matrix
and if L is an orthogonal matrix, use the result of part (a) to show that
tr (L'AL) = tr A. (c) If A is a real symmetric idempotent matrix, use the
result of part (b) to prove that the rank of A is equal to tr A.
12.9. Let A = [aij] be a real symmetric matrix. Prove that L L a~j is
j t
equal to the sum of the squares of the characteristic numbers of A. Hint.
If L is an orthogonal matrix, show that L L a~j = tr (AZ) = tr (L'AZL) =
j t
tr [(L'AL)(L'AL)].
12.10. Let X and S2 denote, respectively, the mean and the variance of a
random sample of size n from a distribution which is n(O, a2
) . (a) If A denotes
the symmetric matrix of nX2, show that A = (l/n)P, where P is the n x n
matrix, each of whose elements is equal to one. (b) Demonstrate that A is
idempotent and that the tr A = 1. Thus nJ{2/a2 is x2(l ). (c) Show that the
symmetric matrix B of «S? is I - (l/n)P. (d) Demonstrate that B is
idempotent and that tr B = n - 1. Thus nS2
/aZ
is X2
(n - 1), as previously
proved otherwise. (e) Show that the product matrix AB is the zero matrix.
12.3 The Independence of Certain Quadratic Forms
We have previously investigated the stochastic independence of
linear functions of normally distributed variables (see Exercise 12.3).
In this section we shall prove some theorems about the stochastic
independence of quadratic forms. As we remarked on p. 411, we shall
confine our attention to normally distributed variables that constitute
a random sample of size n from a distribution that is n(O, aZ) .
Let Xl' X z, ... , X; denote a random sample of size n from a
distribution which is n(O, aZ
) . Let A and B denote two real symmetric
matrices, each of order n. Let X' = [Xv X z, ... , XnJ and consider the
two quadratic forms X'AX and X'BX. We wish to show that these
quadratic forms are stochastically independent if and only if AB = 0,
the zero matrix. We shall first compute the moment-generating
function M(t v tz) of the joint distribution of X'AX/az and X'BX/az.
We have
(
x'(I - 2tlA - 2tzB)X) d d
exp - 2az Xl' .• Xn•
The matrix I - 2tlA - 2tzB is positive definite if we take Itll and
Itzi sufficiently small, say Itll < hl, Itzl < hz, where hl' hz > O. Then,
as on p. 412, we have
M(t l, tz) = II - 2tlA - 2tzBI-l/Z, Itll < hl, Itzl < hz·
Let us assume that X'AX/az and X'BX/az are stochastically indepen-
dent (so that likewise are X'AX and X'BX) and prove that AB = O.
Thus we assume that
(1)
for all tl and tz for which Itil < hi, i = 1,2. Identity (1) is equivalent
to the identity
Let r > 0 denote the rank of A and let al' az, ... , a, denote the r
nonzero characteristic numbers of A. There exists an orthogonal
matrix L such that
al 0 0
0 az 0 0
L'AL = [?i~+~] =C
0 0 aT
----- - - - - ----------1---
0 I
0
:
for a suitable ordering of av az, ... , a.. Then L'BL may be written in
the identically partitioned form
The identity (2) may be written as
(2') lL'llI - 2tl A - 2tzB l iLi = IL'III - 2t1AIILl IL'III - 2t2BIILI,
416 Further Normal Distribution Theory [Oh, 12 Sec. 12.3] The Independence of Certain Quadratic Forms 417
The coefficient of (-2tl Y in the right-hand member of Equation (3)
is seen by inspection to be al a2 • .. a,1
I - 2t2D I. It is not so easy to
find the coefficient of (-2tl Yin the left-hand member of Equation (3).
Conceive of expanding this determinant in terms of minors of order r
formed from the first r columns. One term in this expansion is the
product of the minor of order r in the upper left-hand corner, namely,
IIr - 2tl Cll - 2t2D ll L and the minor of order n - r in the lower
right-hand corner, namely, IIn - r - 2t2D221
. Moreover, this product is
the only term in the expansion of the determinant that involves
(-2tlY. Thus the coefficient of (-2tl)' in the left-hand member of
Equation (3) is ala2·· .arIIn - r - 2t2Dd. If we equate these coefficients
of (-2tlY, we have, for all t2, It2 1 < h2,
or as
(3)
(4)
we have
M(tv t2) = M(tl, O)M(O, t2),
and the proof of the following theorem is complete.
Theorem 2. Let Ql and Q2 denote random variables which are
quadratic forms in the items of a random sample of size n from a distribu-
tion which is n(O, a2). Let A and B denote respectively the real symmetric
matrices of Ql and Q2' The random variables Ql and Q2 are stochastically
independent if and only if AB = O.
Remark. Theorem 2 remains valid if the random sample is from a
distribution which is n(p" a2
) , whatever be the real value of p,. Moreover,
Theorem 2 may be extended to quadratic forms in random variables that
have a joint multivariate normal distribution with a positive definite
covariance matrix V. The necessary and sufficient condition for the stochastic
independence of two such quadratic forms with symmetric matrices A and
B then becomes AVB = O. In our Theorem 2, we have V = a2I, so that
AVB = Aa2
IB = a2
A B = O.
Equation (4) implies that the nonzero characteristic numbers of the
matrices D and D22 are the same (see Exercise 12.17). Recall that the
sum of the squares of the characteristic numbers of a symmetric matrix
is equal to the sum of the squares of the elements of that matrix (see
Exercise 12.9). Thus the sum of the squares of the elements of matrix D
is equal to the sum of the squares of the elements of D2 2 . Since the
elements of the matrix D are real, it follows that each of the elements of
D ll , D12, and D2l is zero. Accordingly, we can write D in the form
Thus CD = L'ALL'BL = 0 and L'ABL = 0 and AB = 0, as we
wished to prove. To complete the proof of the theorem, we assume that
AB = O. We are to show that X'AX/a2and X'BX/a2
are stochastically
independent. We have, for all real values of tl and t2 ,
since AB = O. Thus,
II - 2tlA - 2t2BI = II - 2tlAIII - 2t2BI·
Since the moment-generating function of the joint distribution of
X'AX/a2 and X'BX/a2
is given by
M(tv t2) = II - 2tlA - 2t2BI-1/2, Itll < hi' Z = 1,2,
We shall next prove the theorem that was used in Chapter 8
(p. 279).
Theorem 3. Let Q = Ql + ... + Qk-l + Qk' where Q, Ql"'"
Qk-1' Qk are k + 1 random variables that are quadratic forms in the items
of a random sample of size n from a distribution which is n(O, a2). Let
Q/a2 be X2(r), let Q,/a2 be X2(r,), i = 1,2, , k - 1, and let Qk be non-
negative. Then the random variables Ql' Q2' , Qk are mutually stochasti-
cally independent and, hence, Qk/a2 is X2(rk = r - rl - ... - rk- l).
Proof. Take first the case of k = 2 and let the real symmetric
matrices of Q, Ql' and Q2 be denoted, respectively, by A, AI' A2.We are
given that Q = Ql + Q2 or, equivalently, that A = Al + A2. We
are also given that Q/a2 is X2(r) and that Ql/a2 is x2
h ). In accordance
with Theorem 1, p. 413, we have A2 = A and Ai = AI' Since Q2 :2: 0,
each of the matrices A, AI' and A2 is positive semidefinite. Because
A2 = A, we can find an orthogonal matrix L such that
[Ir : 0]
L'AL = o--i-o'
If then we multiply both members of A = Al + A2 on the left by L'
and on the right by L, we have
418 Further Normal Distribution Theory [Ch, 12 Sec. 12.3] The Independence of Certain Quadratic Forms 419
Now each of Al and A2, and hence each of L' AIL and L'A2L is positive
semidefinite. Recall that, if a real symmetric matrix is positive semi-
definite, each element on the principal diagonal is positive or zero.
Moreover, if an element on the principal diagonal is zero, then all
elements in that row and all elements in that column are zero. Thus
L'AL = L'AIL + L'A2L can be written as
(5)
Since A~ = AI> we have
If we multiply both members of Equation (5) on the left by the matrix
L'AIL, we see that
[~T___~] = [~__ _~] + [?~~~__~],
or, equivalently, L'AIL = L'AIL + (L'AlL)(L'A2L). Thus, (L'AIL) x
(L'A2L) = 0 and AIA2 = O. In accordance with Theorem 2, QI and Q2
are stochastically independent. This stochastic independence immedi-
ately implies that Qz/uz is X2 (r2 = r - rl ) . This completes the proof
when k = 2. For k > 2, the proof may be made by induction. We shall
merely indicate how this can be done by using k = 3. Take A =
Al + A2 + As, where A2 = A, A~ = AI, A~ = A2 and As is positive
semidefinite. Write A = Al + (A2 + As) = Al + BI, say. Now
A2 = A, AI = A1> and BI is positive semidefinite. In accordance with
the case of k = 2, we have AIBI = 0, so that BI = BI· With BI =
A2 + As, where BI = BI, A~ = A2, it follows from the case of k = 2
that A2As = 0 and A~ = As. If we regroup by writing A = Az +
(AI + As), we obtain AlAs = 0, and so on.
Remark. In our statement of Theorem 3 we took Xl, X 2 , •.• , Xn to
be items of a random sample from a distribution which is n(O, a
2
) . We did
this because our proof of Theorem 2 was restricted to that case. In fact, if
Q', Q~, ... ,Q~ are quadratic forms in any normal variables (including
multivariate normal variables), if Q' = Q~ + ... + Q~, if Q', Q~, , Q~-l
are central or noncentral chi-square, and if Q~ is nonnegative, then Q~, , Q~
are mutually stochastically independent and Q~ is either central or noncentral
chi-square.
This section will conclude with a proof of a frequently quoted
theorem due to Cochran.
Theorem 4. Let X1> X2, ... , Xn denote a random sample from a
distribution which is n(O, ( 2 ) . Let the sum of the squares of these items be
written in the form
where QJ is a quadratic form in Xl' X 2, ... , X n, with matrix AJ which
has rank rJ, j = 1,2, ... , k. The random variables Q1> Q2'... , QIe are
mutually stochastically independent and QJ/a2 is XZ(rJ), j = 1,2, ... , k,
Ie
if and only 1j 2: r. = n.
I
Ie n Ie
Proof. First assume the two conditions 2: r. = nand 2: X~ = 2: QJ
I I I
to be satisfied. The latter equation implies that I = A, + Az + ...
+ Ale' Let B, = I - A,. That is, B, is the sum of the matrices AI' .. "
Ale exclusive of A,. Let R, denote the rank of B,. Since the rank of the
sum of several matrices is less than or equal to the sum of the ranks,
Ie
we have R,:::; 2: rJ - r, = n - r,. However, I = A, + B" so that
I
n :::; r, + R, and n - r, :::; R,. Hence R, = n - r,. The characteristic
numbers of B, are the roots of the equation IB, - ,11 = O. Since
B, = I - A" this equation can be written as II - Al - ,11 = O.
Thus, we have IA, - (1 - ')11 = O. But each root of the last equation
is one minus a characteristic number of AI' Since B, has exactly
n - R, = rl characteristic numbers that are zero, then Al has exactly
r, characteristic numbers that are equal to one. However, rl is the rank
of A,. Thus, each of the r, nonzero characteristic numbers of A, is one.
That is, A~ = Al and thus QI/a2 is x2(r.), t = 1, 2, , k. In accordance
with Theorem 3, the random variables QI' Q2' , QIe are mutually
stochastically independent.
n
To complete the proof of Theorem 4, take 2: X~ = QI + Q2 + ...
I
+ QIe, let Q1> Q2' ... , QIe be mutually stochastically independent, and
let QJ/a2 be x2(rJ), j = 1,2, ... , k. Then ~ QJ/u2 is x2(~ rJ). But
Ie n Ie
2: QJ/a2 = 2: X~/u2 is X2(n). Thus, 2: rJ = n and the proof is complete.
I I I
420
EXERCISES
Further Normal Distribution Theory [Ch. 12
12.11. Let Xv X 2 , ••• , X; denote a random sample of size n from a
n
distribution which is n(O, a2) . Prove that LXr and every quadratic form,
I
which is nonidentically zero in Xl' X 2 , ••• , X n , are stochastically dependent.
12.12 Let Xv X 2 , Xa, X 4 denote a random sample of size 4 from a distri-
4
bution which is n(O,a2
) . Let Y = LaiXi, where av a2 , aa, and a4 are real
I
constants. If y2 and Q = XIX2 - X aX4 are stochastically independent,
determine av a2 , aa, and a4'
12.13. Let A be the real symmetric matrix of a quadratic form Qin the
items of a random sample of size n from a distribution which is n(O, a2
) .
Given that Qand the mean ]( of the sample are stochastically independent.
What can be said of the elements of each row (column) of A? Hint. Are Q
and ](2 stochastically independent?
12.14. Let A v A2, ... , Ak be the matrices of k > 2 quadratic forms
Qv Q2' ... , Qk in the items of a random sample of size n from a distribution
which is n(O, a2
) . Prove that the pairwise stochastic independence of these
forms implies that they are mutually stochastically independent. Hint. Show
that AiAj = 0, i ¥= j, permits E[exp (tIQI + t2Q2 + ... + tkQk)J to be
written as a product of the moment-generating functions of QI' Q2' ... , Qk'
12.15. Let X' = [Xl' X 2 , ••. , XnJ, where Xl, x2 , ..• , X; are items of a
random sample from a distribution which is n(O, a2
) . Let b' = [bl"b2 , ••• , bnJ
be a real nonzero matrix, and let A be a real symmetric matrix of order n,
Prove that the linear form b'X and the quadratic form X' AX are stochastically
independent if and only if b'A = 0. Use this fact to prove that b'X and
X' AX are stochastically independent if and only if the two quadratic forms,
(b'X)2 = X'bb'X and X' AX, are stochastically independent.
12.16. Let QI and Q2 be two nonnegative quadratic forms in the items of
a random sample from a distribution which is n(O, a2
) . Show that another
quadratic form Q is stochastically independent of QI + Q2 if and only if Q is
stochastically independent of each of QI and Q2' Hint. Consider the orthogonal
transformation that diagonalizes the matrix of QI + Q2' After this trans-
formation, what are the forms of the matrices of Q, QI, and Q2 if Q and
QI + Q2 are stochastically independent?
12.17. Prove that Equation (4) of this section implies that the nonzero
characteristic numbers of the matrices D and D 22 are the same. Hint. Let
A = 1/(2t2 ) , t2 ¥= 0, and show that Equation (4) is equivalent to ID - ,11
(-A)rJD22 - AIn_rl·
endix A
rences
Ierson, T. W., An Introduction to Multivariate Statistical Analysis,
m Wiley & Sons, Inc., New York, 1958.
;0, D., "On Statistics Independent of a Complete Sufficient Statistic,"
~khyii, 15, 377 (1955).
c, G. E. P., and Muller, M. A., "A Note on the Generation of Random
:mal Deviates," Ann. Math. Stat., 29,610 (1958).
penter, 0., "Note on the Extension of Craig's Theorem to Non-
tral Variates," Ann. Math. Stat., 21, 455 (1950).
:hran, W. G., "The Distribution of Quadratic Forms in a Normal
.tern, with Applications to the Analysis of Covariance," Proc.
nbridge Phil. Soc., 30, 178 (1934).
ig, A. T., "Bilinear Forms in Normally Correlated Variables," Ann.
tho Stat., 18, 565 (1947).
ig, A. T., "Note on the Independence of Certain Quadratic Forms,"
n. Math. Stat., 14, 195 (1943).
mer, H., Mathematical Methods of Statistics, Princeton University
ss, Princeton, N.J., 1946.
tiss, J. H. "A Note on the Theory of Moment Generating Functions,"
n. Math. Stat., 13, 430 (1942).
her, R. A., "On the Mathematical Foundation of Theoretical
tistics," Phil. Trans. Royal Soc. London, Series A, 222, 309 (1921).
.ser, D. A. S., Nonparametric Methods in Statistics, John Wiley &
is, Inc., New York, 1957.
iybill, F. A., An Introduction to Linear Statistical Models, Vol. 1,
Graw-Hill Book Company, New York, 1961.
gg, R. V., "Adaptive Robust Procedures: A Partial Review and Some
421
422 Appendix A
Suggestions for Future Applications and Theory," ]. Amer. Stat. Assoc.,
69, 909 (1974).
[14J Hogg, R V., and Craig, A. T., "On the Decomposition of Certain Chi-
Square Variables," Ann. Math. Stai., 29,608 (1958).
[15J Hogg, R V., and Craig, A. T., "Sufficient Statistics in Elementary
Distribution Theory," Sankhyii, 17, 209 (1956).
[16J Huber, P., "Robust Statistics: A Review," Ann. Math. Stat., 43, 1041
(1972).
[l7J Johnson, N. L., and Kotz, S., Continuous Univariate Distributions,
Vols. 1 and 2, Houghton Mifflin Company, Boston, 1970.
[18J Koopman, B. 0., "On Distributions Admitting a Sufficient Statistic,"
Trans. Amer. Math. Soc., 39, 399 (1936).
[19J Lancaster, H. 0., "Traces and Cumulants of Quadratic Forms in Normal
Variables," ]. Royal Stat. Soc., Series B, 16, 247 (1954).
[20J Lehmann, E. L., Testing Statistical Hypotheses, John Wiley & Sons,
Inc., New York, 1959.
[21J Lehmann, E. L., and Scheffe, H., "Completeness, Similar Regions, and
Unbiased Estimation," Sankhyii, 10, 305 (1950).
[22J Levy, P., Theorie de l'addition des variables aleatoires, Gauthier-Villars,
Paris, 1937.
[23J Mann, H. B., and Whitney, D. R, "On a Test of Whether One of Two
Random Variables Is Stochastically Larger Than the Other," Ann. Math.
Stat., 18, 50 (1947).
[24J Neymann, j., "Su un teorema concernente Ie cosiddette statistiche
sufficienti," Giornale dell' I stituto degli Attuari, 6, 320 (1935).
[25J Neyman,]., and Pearson, E. S., "On the Problem of the Most Efficient
Tests of Statistical Hypotheses," Phil. Trans. Royal Soc. London,
Series A, 231,289 (1933).
[26J Pearson, K., "On the Criterion That a Given System of Deviations from
the Probable in the Case of a Correlated System of Variables Is Such
That It Can Be Reasonably Supposed to Have Arisen from Random
Sampling," Phil. Mag., Series 5, 50, 157 (1900).
[27J Pitman, E. J. G., "Sufficient Statistics and Intrinsic Accuracy," Proc.
Cambridge Phil. Soc., 32,567 (1936).
[28J Rao, C. R, Linear Statistical Inference and Its Applications, John Wiley
& Sons, Inc., New York, 1965.
[29J Scheffe, H., The Analysis of Variance, John Wiley & Sons, Inc., New
York, 1959.
[30J Wald, A., Sequential Analysis, John Wiley & Sons, Inc., New York, 1947.
[31J Wilcoxon, F., "Individual Comparisons by Ranking Methods," Bio-
metrics Bull., 1, 80 (1945).
[32J Wilks, S. S., Mathematical Statistics, John Wiley & Sons., Inc., New
York, 1962.
Appendix B
Tables
TABLE I
The Poisson Distribution
x /LWe-1l
Pr (X s x) = 2: -,-
111=0 W.
fL = E(X)
x 0.5 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
0 0.607 0.368 0.223 0.135 0.050 0.018 0.007 0.002 0.001 0.000 0.000 0.000
I 0.910 0.736 0.558 0.406 0.199 0.092 0.040 0.017 0.007 0.003 0.001 0.000
2 0.986 0.920 0.809 0.677 0.423 0.238 0.125 0.062 0.030 0.0/4 0.006 0.003
3 0.998 0.981 0.934 0.857 0.647 0.433 0.265 0.151 0.082 0.042 0.021 0.010
4 1.000 0.996 0.98 0.947 0.815 0.629 0.440 0.285 0.173 0.100 0.055 0.029
5 0.999 0.996 0.983 0.9/6 0.785 0.6/6 0.446 0.30/ 0./9/ 0.//6 0.067
6 .000 0.999 0.995 0.966 0.889 0.762 0.606 0.450 0.313 0.207 0.130
7 1.000 0.999 0.988 0.949 0.867 0.744 0.599 0.453 0.324 0.220
8 1.000 0.996 0.979 0.932 0.847 0.729 0.593 0.456 0.333
9 0.999 0.992 0.968 0.916 0.830 0.717 0.587 0.458
10 1.000 0.997 0.986 0.957 0.901 0.816 0.706 0.583
II 0.999 0.995 0.980 0.947 0.888 0.803 0.697
12 1.000 0.998 0.991 0.973 0.936 0.876 0.792
13 0.999 0.996 0.987 0.966 0.926 0.864
14 1.000 0.999 0.994 0.983 0.959 0.917
15 0.999 0.998 0.992 0.978 0.951
16 /.000 0.999 0.996 0.989 0.973
17 1.000 0.998 0.995 0.986
18 0.999 0.998 0.993
19 1.000 0.999 0.997
20 1.000 0.998
21 0.999
22 1.000
423
x N(x) x N(x) x N(x)
0.00 0.500 1.10 0.864 2.05 0.980
0.05 0.520 /.15 0.875 2.10 0.982
0.10 0.540 1.20 0.885 2.15 0.984
0./5 0.560 /.25 0.894 2.20 0.986
0.20 0.579 1.282 0.900 2.25 0.988
0.25 0.599 /.30 0.903 2.30 0.989
0.30 0.618 1.35 0.911 2.326 0.990
0.35 0.637 1.40 0.9/9 2.35 0.991
0.40 0.655 1.45 0.926 2.40 0.992
0.45 0.674 1.50 0.933 2.45 0.993
0.50 0.691 1.55 0.939 2.50 0.994
0.55 0.709 1.60 0.945 2.55 0.995
0.60 0.726 1.645 0.950 2.576 0.995
0.65 0.742 1.65 0.951 2.60 0.995
0.70 0.758 1.70 0.955 2.65 0.996
0.75 0.773 1.75 0.960 2.70 0.997
0.80 0.788 1.80 0.964 2.75 0.997
0.85 0.802 1.85 0.968 2.80 0.997
0.90 0.816 1.90 0.971 2.85 0.998
0.95 0.829 1.95 0.974 2.90 0.998
1.00 0.841 1.960 0.975 2.95 0.998
1.05 0.853 2.00 0.977 3.00 0.999
424 Appendix B
TABLE II
The Chi-Square Distribution *
IX 1
Pr(X < x) = wr/2-1e-w/2dw
- 0 qr/2)2r /2 '
Pr (X s x)
r 0.01 0.025 0.050 0.95 0.975 0.99
I 0.000 0.00 0.004 3.84 5.02 6.63
2 0.020 0.051 0.103 5.99 7.38 9.21
3 0.115 0.216 0.352 7.81 9.35 11.3
4 0.297 0.484 0.711 9.49 I I. I 13.3
5 0.554 0.831 1.15 11.1 12.8 15.1
6 0.872 1.24 1.64 12.6 14.4 16.8
7 1.24 1.69 2.17 14./ 16.0 18.5
8 1.65 2.18 2.73 15.5 17.5 20.1
9 2.09 2.70 3.33 /6.9 19.0 21.7
10 2.56 3.25 3.94 18.3 20.5 23.2
II 3.05 3.82 4.57 19.7 21.9 24.7
12 3.57 4.40 5.23 21.0 23.3 26.2
13 4.11 5.01 5.89 22.4 24.7 27.7
14 4.66 5.63 6.57 23.7 26.1 29.1
15 5.23 6.26 7.26 25.0 27.5 30.6
16 5.81 6.91 7.96 26.3 28.8 32.0
17 6.41 7.56 8.67 27.6 30.2 33.4
18 7.01 8.23 9.39 28.9 31.5 34.8
19 7.63 8.91 10.1 30.1 32.9 36.2
20 8.26 9.59 10.9 31.4 34.2 37.6
21 8.90 10.3 11.6 32.7 35.5 38.9
22 9.54 11.0 12.3 33.9 36.8 40.3
23 10.2 11.7 13.1 35.2 38.1 41.6
24 10.9 12.4 13.8 36.4 39.4 43.0
25 11.5 13.1 14.6 37.7 40.6 44.3
26 12.2 13.8 15.4 38.9 41.9 45.6
27 12.9 14.6 /6.2 40./ 43.2 47.0
28 13.6 15.3 16.9 41.3 44.5 48.3
29 14.3 16.0 17.7 42.6 45.7 49.6
30 15.0 16.8 18.5 43.8 47.0 50.9
* This table is abridged and adapted from "Tables of Percentage Points of the
Incomplete Beta Function and of the Chi-Square Distribution," Biometrika, 32 (1941).
It is published here with the kind permission of Professor E. S. Pearson on behalf of
the author, Catherine M. Thompson, and of the Biometrika Trustees.
Appendix B
TABLE III
The Normal Distribution
I
x 1
Pr (X ~ x) = N(x) = . /_ e-
w2
/
2
dw
-00 "V 27T
[N( -x) = 1 - N(x)]
425
of>.
~
a-
..,
"'d
'"1
~
...,
0
I~
1/
~ .,...
0
'"1
1
...,
"-..
1/ I ~
;!
0 8
~ I
-. lD
~
"
.,... ... ~
...
~I 0 »
--..
" iii' lliI
-t ""j
... r-
1/ ...... ~ ., m
----
""j 0:
<
0 -c, I~
c---,
e
'0 ~ ...
~ "'d--;:::::;
+ 0'
~ '"1 ::l
~+
*
..., :::; ......
~
<,
0 1
1
/ ~ ~
<o ~~
<o '--' ~
+
t:
i3
0
I
~
~
'0
<o :g
~
(l)
::s
l:l.
it'
b:l
~
NNNNNNNNNNNNNNNNNWWWWWWWW~~~~W
~~~~~~~OOOOOOOOOOOOOO~~~OO~~NW~~O~OO~~
~~~~~OO~O-W~~~~N~~-~O~~~~OWO~N~
O~W-~~~~~-~-OOOO-~~N~~~O~~~N~-~~
W
NNNNNNNNNNNNNNNNNNNNNNNNWWW~~-
~~~~~~~~~~~~~~~~~~~~~OOoo~-w~~~OO
~~~~~OO~OO-Nw~~OOON~OO-~N~~~~~~~N
~N~w~~NOOOOOOO~N~WN~O-OO~-~oow~~-~-
NNNNNNNNNNNNNNNNNNNNNNNNNNNW~N
OOOOOOOOOOOO~~~~~~~NNNWW~~~~w~
~~~~~~~~~OOOO~O-NW~~~ON~O~~~~OOOO
N~OON~O~~~O~w-OO-~O~-OON~~~-~NW~
- - ----------------------NNNN~
~~~~~~~~~~~~~~~~~~~~OOOOOOOO~O~w~w
~~OOOO---NNNW~~~~~OO~-w~~~-W~N­
~~-w~OO-~~-~~~O~W--N~NWO~w~NwO~
wNNNNNNNNNN----------
O~OO~~~~wN-O~OO~~~~wN-O~OO~~~~wN-
-----------------------------W
wwwwwwwwwwwwwwwwwwwwwww~~~~~OOO
--------NNNNWWW~~~~~~OO~-~~wwOO~
O-w~~~OO~-w~OOOw~-~O~WNW~~O~wOO~OO
0'><
'< ?>
ort
~_u:
<1> C/)
'1 ~
?> -
=~.
0.. :::-.
Oi~
~ .....,
0.."
• <or
,,"
..... '"
::'-'0'
. .,
~~ , i
~ ~ *
g.a'; ,.,
:= ;:.::r'
eTa ~oo·
::r'~ M'-
~::"&I
'< ~ " ----_.
~ ~ 00'
slf&
~. ~ ~
o· '" aq
=' § 8.-
a.~~
s: ~ g
<1> " ;>
?>:;.,.,
C " ?>
S:~~
;; ::<l <1>
<J; " ......
~ ~ :::
::l .,
p..~s..
] -~ ITJ
g-g ;
~ cr (D
~ :=: I-f
~ g~
P..c.
TABLE V
The FOistribution*
Pr (F ~ f) = [I r[(r1 + r2)/2]h/r2)'1/2wr
,/2-1
Jo r(r1/2)rh/2)(1 + r1w/r2J<r, +r2 ) /2 dw
~
:g
(l)
::s
l:l.
l;'
b:l
(1
Pr (F ~ f) (2 I 2 3 4 5 6 7 8 9 10 12 15
0.95 I 161 200 216 225 230 234 237 239 241 242 244 246
0.975 648 800 864 900 922 937 948 957 963 969 977 985
0.99 4052 4999 5403 5625 5764 5859 5928 5982 6023 6056 6106 6157
0.95 2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 /9.4 19.4 19.4
0.975 38.5 39.0 39.2 39.2 39.3 39.3 39.4 39.4 39.4 39.4 39.4 39.4
0.99 98.5 99.0 99.2 99.2 99.3 99.3 99.4 99.4 99.4 99.4 99.4 99.4
0.95 3 10.1 9.55 9.28 9./2 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70
0.975 17.4 16.0 15.4 15.1 14.9 14.7 14.6 14.5 14.5 14.4 14.3 14.3
0.99 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 26.9
0.95 4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86
0.975 12.2 10.6 9.98 9.60 9.36 9.20 9.07 8.98 8.90 8.84 8.75 8.66
0.99 2/.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.5 14.4 14.2
0.95 5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62
0.975 10.0 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68 6.62 6.52 6.43
0.99 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1 9.89 9.72
e
(1
Pr (Fs f) (2 I 2 3 4 5 6 7 8 9 10 12 15
0.95 6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94
0.975 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.52 5.46 5.37 5.27
0.99 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56
0.95 7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51
0.975 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82 4.76 4.67 4.57
0.99 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31
0.95 8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22
0.975 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36 4.30 4.20 4.10
0.99 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52
0.95 9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01
0.975 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03 3.96 3.87 3.77
0.99 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96
0.95 10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85
0.975 6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78 3.72 3.62 3.52
0.99 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56
0.95 12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62
0.975 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44 3.37 3.28 3./8
0.99 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01
0.95 15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40
0.975 6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 3.12 3.06 2.96 2.86
0.99 8.68 6.36 5.42 4.89 4.56 4.32 4./4 4.00 3.89 3.80 3.67 3.52
* This table is abridged and adapted from" Tables of Percentage Points of the Inverted Beta Distribution," Biometrika, 33 (1943). It is
published here with the kmd permission of Professor E. S. Pearson on behalf of the authors, Maxine Merrington and Catherine M. Thompson,
and of the Biome!nka Trustees.
~
~
:g
~
::l
I:l.
1;'
b::l
...... o ..... o_~ ~_~_"'I ... NI...~~*_~~ ...I"' ...... ~O~
_ • _ P' v _ P' .- P' -1>0 V> -> r- I ' _. . r- " , ' I ' 10>1...
~~IA~-- - _ .... - ....... - tv 001-.1 _ 1 ... 'f'"~""
• so "*"OOOO-_O-wf. . . . . . "'I ' "'I
(1) "-'::: /  . . _ _ .. • I "*" N'" "'.... - ~'"
::;: - _ - 1/ H H - - "'IN - ~ . . - I •.
::r N ° 0" (l> -:;;- 0" _. - 0" I - ""1 "'I
lJl '::e ~ ~ ~ ~ / / :=- ~ N>-' "'IN ~ Co) ~ ~... ~ N,'"
!i> ° _'" ~ ~ / ° ° ":*". ~[v> - "'1 "'1 I - •
",.-._ ........ t#.)l-"w"""' ~ 0101
/ ° () ~ - - - () . - .. ~I'"
"-'::: I/:;~-":' I - ~ ;;:... -
"-'::: • tit- ° N .
/ 0:: -~I/ .
_/~ 1/ I ~
;:"~ ~~.
- Co)
""""" """" """"" """"" .....
~~~~~~~~~~~~t.Jt.Jt.Jt.J
~~'It1ICN_~OO'lt1lCN-'IO't1I.;:.
'*" _'*" ...... "'100 _ ....1 00[00 - - -
_. ° _. _. _. 0' "' 10-S ~ ~
...1' 0 0 0 •·
"'I'"' . _. _. --J ° ......---. ....,..---.
_ . "'iN :::j _. _. • - L-< H
uoL . - N ..... -~ ~ •.
10>1 ~ ..... "-'::: ~ H
.... ...._CJ)
... N _. (l> II
":'~ . 't ;+- N
'-v-'
1-01-01-01-01-01-01-01-0
t.Jt.J;";";";";";"
CN~oo'lO'';:'CN~
(j
~
~
~
~
s-
~
~>
~ =
to r.Il
~ ~
...... to
r.Il ~
to r.Il
r.Il fIIItl-
e
00
~'
~
=-
1-0 1-0 1-0 1-0
~~t.J;"
1-0 1-0
;..~
~
~
(J
::I:
>-
'i:I
_ _ _ - --1>0[.... -1'01.... .-;;::- :j
0" P' P' 0" P' _.' .W '4)
- - - - -~I-o" - ~
~ ~ ,..-.., --. ,..-.., r-"-. --. • WiN
R.~ _~ _~ .~ .~ _~ ':'1"' .....
~ OO ..... OH H
'::-:' / / 1/ / II II
°HHHHNO
/ / / / / y,"-
't,Woojv>NW ~
'--' '--' '-v-' '-.-' '"
. . ' " W
1-0
;..
1-0
+
+
"1.,
"1.,
/
° ~
'-v-'
Y
1-0 _ _
c,.~~
O'';:'CN
1-0
~
00
~
430 Appendix C Appendix C 431
1.63 1/3v'y,0 < y < 1; 1/6vy, 2.8 9 . t 3.28 0.05. 4.17 gl (Y1)
20' . Y1
1 < Y < 4; 0 elsewhere. 2.9 ..L 3.29 0.831, 12.8.
14' 1 ...L
2.10 (3x1 + 2)/(6x1 + 3); 3.30 0.90. 36
(:)/(~6).
2 ...±-
1.64 (a) (6x~ + 6x1 + 1)/ 3.31 X2(4).
36
3 -.J!._
(2)(6x1 + 3)2. 3.33 3e- 3y,0 < Y < 00.
36
4 .s,
(b) (~O)/ (~6). 2.12 3X2/4; 3x~/80. 3.34 2,0.95. 36
6 12
3.39 11 36
2.17 (b) l/e. 16' 9 .JL
2.18 (a) 1. (b) - 1. (c) O. 3.40 X2(2).
36
1 - C~O)/ (10
500).
4.20 -ff·, 0 < Y < 27.
1.65 2.19 7/v'804. 3.42 0.067,0.685. 4.23 a/(a + (3);
2.26 b2 = Ul(P12 - P13P23)/
3.44 71.3, 189.7.
a(3/[(a + (3 + l)(a + (3)2].
1.67 (b) 1 _ (~O)/ (~O). [u2(1 - p~3)J; 3.45 v'In 2/TT. 4.24 (a) 20. (b) 1260. (c) 495.
b3 = Ul(PI3 - PI2Pd/ 3.49 0.774. 4.25 10
243'
1.70 -In Z, 0 < Z < 1; [u3(1 - P~3)]. 3.50 v'2/TT; (TT - 2)/TT. 4.34 0.05.
oelsewhere. 2.31 7
3.51 0.90. 4.37 1/4.74, 3.33.
s·
1.72 (x - 1)(10 - x)(9 - x)/ 2.33 1 - (1 - y)12, 0 s Y < 1; 3.52 0.477. 4.42 (l/v'2TT)3yre- y:/2 sin Y3'
420, x = 2,3, ... ,8. 12(1 - y)l1, 0 < Y < 1. 3.53 0.461. o::; Y1 < 00,0 s Y2 < 2TT,
1.74 2; 86.4; -160.8. 2.34 g(y) = [y3 _ (y _ 1)3J/63, 3.54 n(O, 1). o s Y3 ::; TT.
1.75 3; 11; 27. Y = 1, 2, 3, 4, 5, 6. 3.55 0.433. 4.43 Y2Y5e-Ya,0 < v. < 1,
1.76 1. 2.35 1 3.56 0, 3. o < Y2 < 1, 0 < Y3 < 00.
9' 2'
1.78 (t)(t) ¥= i- 3.61 n(O, 2). 4.47 1/(2v'y),0 < Y < 1.
1.79 (a) t· (b) /(1) = l 3.62 (a) 0.264. (b) 0.440.
4.48 e- Yl/2/(2TTv'Y1 - y~),
1.80 $7.80. (c) 0.433. (d) 0.642.
1.83 7 CHAPTER 3 3.64 p=l - VYI < Y2 < v'Yl>
3'
1.84 (a) 1.5, 0.75. (b) 0.5,0.05. 3.65 (38.2,43.4). o < Y1 < 00.
3.1 ±Q
(c) 2; does not exist.
81' 4.50 1 - (1 - e- 3)4.
3.4 1.il
4.51 1.
1.85 et/(2 - et), t < In 2; 2; 2.
512'
3.6 5.
8'
1.92 1·..L 4.56 5
6' 36 3.9 65 16'
1.95 10; 0; 2; - 30. SI'
CHAPTER 4 4.57 48z1Z~Z~, 0 < ZI < 1,
3.11 ..L
1.97 2-(2 2q 72' o < Z2 < 1,0 < Z3 < 1.
--5-' -5-
3.14 1 4.3 0.405.
1.99 1/2P; i; t; 5; 50.
6' 4.58 .i:
3.15 _l..4:.. 4.4 1 12 .
1.101 g, ~H.
625' s· 4.63 6uv(u + v),
3.17 11. x /2' .ll 4.6 (n + 1)/2; (n2 - 1)/12.
1.106 0.84.
6' 1 , 6'
O<u<v<1.
3.18 l..2. 4.7 a + bi; b2s~.
4 .
3.19 0.09. 4.8 X2(2).
4.68 Y g(y)
3.22 4Xe- 4/x!, x = 0, 1,2, .... 4.11 t,O < Y < 1; 2 ...L
36
CHAPTER 2 3.23 0.84. 1/2y2, 1 < Y < 00. 3 2
36
3.27 (a) exp [- 2 + et2(1 + et1) ]. 4.12 y15, 0 ::; Y < 1; 15y14
, 4 -.L
2.3 (a) -t. (b) 55
6,
36
(b) fLl = 1, fL2 = 2, O<y<1. 5 .s,
36
[ (~)/ (~)][5/(8 - x)].
ai = 1, u~ = 2, 4.13 ± 6 ..L
(c) 7' 36
P = 0/2. 4.14 t, Y = 3,5,7. 7 -.J!._
36
2.6 l (c) y/2. 4.16 (t)~ii, Y = 1,8,27, .... 8 ..L
36
432 Appendix C Appendix C 433
CHAPTER 7
5.23 0.840.
5.26 0.08.
5.28 0.267.
5.29 0.682.
5.34 n(O, 1).
CHAPTER 6
6.1 (a) X.
(b) -n/ln (X1X2· .. X n).
(c) X. (d) The median.
(e) The first order statistic.
6.2 The first order statistic Yl'
(
n - l)Y( Y)
-- 1+--·
n n - 1
0.067.
Reject H o.
0; 4(4n
- 1)/3; no.
2
99'
98; Q~6.
9.13
9.15
9.22
9.33
9.39
CHAPTER 9
9.2 (a) ~ ~. (b) 67-5/1024;
(c) (0.8)4.
9.4 8.
9.6 0.954; 0.92; 0.788.
9.8 8.
9.11 (a) Beta (n - j + 1, j).
(b) Beta (n - j + i-I,
j - i + 2).
10.27 i.
10.28 X2 - lIn.
CHAPTER 10
10.9 60y~(ys - Y3)/8s; 6Ys/5;
02/7;
02/35.
10.10 (1/02)e-Yt/9,
o < Y2 < Y1 < 00;
Yl/2; 02/2.
10.12 L Xf[n; L Xdn; L Xiln.
10.14 X; X.
10.15 Y11n.
10.17 Y1 - lIn.
n
10.19 Y1 = LXi; Yl/4n; yes.
1
10.31
CHAPTER 11
11.3 82/n; 82/n(n
+ 2).
11.6 co(n) = (14.4)
x (n In 1.5 - In 9.5);
c1(n) = (14.4)
x (n In 1.5 + In 18).
8.1 q3 = ¥[i > 7.81,
reject H o.
8.3 b s 8 or 32 s b.
8.4 q3 = 1-l < 11.3,
accept tt;
8.5 6.4 < 9.49, accept H o.
8.7 P= (Xl + X 2/2)/
(Xl + X 2 + X 3) ·
8.16 r + 0, 2r + 40.
8.17 r2 (8 + r1)/[r1(r2 - 2)J,
r2 > 2.
8.25 S= L (Xtlncj ) ,
L [(Xi - ~ci)2/ncn
8.29 Reject n;
7.5 n=190r20.
7.6 K(t) = 0.062;
K(/2) = 0.920.
10
7.10 L xf ~ 18.3; yes; yes.
1
10 10
7.12 3 L xf + 2 L X j ~ C.
1 1
7.13 95 or 96; 76.7.
7.14 38 or 39; 15.
7.15 0.08; 0.875.
7.16 (1 - 0)9(1 + 90).
7.17 1,0 < 0 s t; 1/(1604
) ,
t < 0 < 1; 1 - 15/(1604
) ,
1 s O.
7.19 53 or 54, 5.6.
7.22 Reject n, if i ~ 77.564.
7.23 26 or 27;
reject n; if i s 24.
7.24 220 or 221;
reject tt; if y ~ 17.
7.27 t = 3 > 2.262, reject H o.
7.28 ItI = 2.27 > 2.145,
reject a;
CHAPTER 8
n
L (Xj - Y1 )/n.
1
4 11 7
25-' 25' 25'
Y1 = min (Xj );
n/ln [(X1X2· .. Xn)/yn
(b) X/(l - X). (d) X.
(e) X - 1.
l 1-
3' 3'
w1 (y)·
b = 0; does not exist.
Does not exist.
(77.28, 85.12).
24 or 25.
(3.7, 5.7).
160.
(5i/6, 5i/4).
( - 3.6, 2.0).
135 or 136.
(0.43,2.21).
(2.68,9.68).
(0.71,5.50).
[YT2 + }w
2/nJ/(T2 + u2
/n).
(3(y + a)/(n(3 + 1).
6.11
6.12
6.13
6.14
6.16
6.17
6.18
6.19
6.25
6.26
6.30
6.32
6.34
6.36
6.41
6.42
6.7
6.4
6.5
5.1 Degenerate at fL.
5.2 Gamma (a = 1, (3 = 1).
5.3 Gamma (a = 1, (3 = 1).
5.4 Gamma (a = 2; (3 = 1).
5.13 0.682.
5.14 (b) 0.815.
5.17 Degenerate at fL2
+ (U2/U1)(X - fL1)'
5.18 (b) n(O, 1).
5.19 (b) n(O, 1).
5.21 0.954.
y g(y)
9 3
4
6
10 3
3
6
11 }6
12 3~
4.69 0.24.
4.77 0.818.
4.80 (b) - 1 or 1.
(c) Z, = ujYj + fLj.
n
4.81 L ajbj = O.
1
4.83 6.41.
4.84 n = 16.
4.86 (n - 1)u2/n;
2(n - 1)u4/n2
•
4.87 0.90.
4.89 0.78.
4.90 !; t.
4.91 7.
4.93 2.5; 0.25.
4.95 -5; 60 - 12V6.
4.96 Ul/VU~ + u~.
4.99 22.5, 1-*l.
4.100 r2 > 4.
-=--=--~c----=--~
4.102 fL2Ul/Vu~u~ + fL~U~ + fL~u~,
4.105 5/V39.
4.109 e/l+<J
2
/
2 ; e2/l+<J2(eI12
- 1).
CHAPTER 5
434
11.7
11.12
11.24
11.27
co(n) (0.05n - In 8)/
In 3.5;
cl(n) (0.05n + In 4.5)/
In 3.5.
(9y - 20x)/30 ~ c.
2.17; 2.44.
2.20.
Appendix C
CHAPTER 12
12.3 (a) exp {(tlC + t2d)'fL +
[(tlC + t2d)'V
x (tlc + t2d)J/2}.
12.12 a, = 0, i = 1, 2, 3, 4.
"
12.13 2: ali = 0, i = 1,2, ... , n.
i=l
Index
Analysis of variance, 291
Andrews, D. P., 404
Approximate distribution (s) , chi-square,
269
normal for binomial, 195, 198
normal for chi-square, 190
normal for Poisson, 191
Poisson for binomial, 190
of X, 194
Arc sine transformation, 217
Bayes' formula, 65,228
Bayesian methods, 227, 229, 385
Bernstein, 87
Beta distribution, 139, 149,310
Binary statistic, 319
Binomial distribution, 90, 132, 190, 195,
198, 305
Bivariate normal distribution, 117, 170
Boole's inequality, 384
Box-Muller transformation, 141
Cauchy distribution, 142
Central limit theorem, 192
Change of variable, 128, 132, 147
Characteristic function, 54
Characterization, 163, 172
Chebyshev's inequality, 58, 93, 188
Chi-square distribution, 107, 114, 169,
191,271,279,413
Chi-square test, 269, 312, 320
Classification, 385
Cochran's theorem, 419
Complete sufficient statistics, 355
Completeness, 353, 358, 367, 390
Compounding, 234
Conditional expectation, 69, 349
Conditional probability, 61, 68, 343
Conditional p.d.f., 67, 71, 118
Confidence coefficient, 213
Confidence interval, 212, 219, 222
for difference of means, 219
for means, 212, 214
for p, 215, 221
for quantiles, 304
435
for ratio of variances, 225
for regression parameters, 298
for variances, 222
Contingency tables, 275
Contrasts, 384
Convergence, 186, 196,204
in probability, 188
with probability one, 188
stochastic, 186, 196, 204
Convolution formula, 143
Correlation coefficient, 73, 300
Covariance, 73, 179, 408
Covariance matrix, 409
Coverage, 309
Cramer, 189
Critical region, 236, 239, 242
best, 243, 245
size, 239, 241
uniformly most powerful, 252
Curtiss, J. H., 189
Decision function, 208, 228, 341, 386
Degenerate distribution, 56, 183
Degrees of freedom, 107, 264, 273, 279,
289
Descending M-estimators, 404
Distribution, beta, 139, 149, 310
binomial, 90, 132, 190, 195, 198, 305
bivariate normal, 117, 170,386
Cauchy, 142
chi-square, 107, 114, 169, 191, 271,
279,413
conditional, 65, 71
continuous type, 24, 26
of coverages, 310
degenerate, 56, 183
Dirichlet, 149,310
discrete type, 23, 26
double exponential, 140
exponential, 105, 163
exponential class, 357, 366
of F, 146, 282
function, 31, 36, 125
of functions of random variables, 122
of F(X), 126
436
Distribution, beta (cant.)
gamma, 104
geometric, 94
hypergeometric, 42
limiting, 181, 193, 197, 270, 317
of linear functions, 168, 171, 176, 409
logistic, 142
lognormal, 180
marginal, 66
multinomial, 96, 270, 332
multivariate normal, 269, 405
negative binomial, 94
of noncentral chi-square, 289, 413
of noncentral F, 290, 295
of noncentral T, 264
normal, 109, 168, 193
of nS2/a2, 175
of order statistics, 154
Pareto, 207
Poisson, 99, 131, 190
posterior, 228
prior, 228
of quadratic forms, 278, 410
of R, 302
of runs, 323
of sample, 125
of T, 144, 176, 260, 298, 302
trinomial, 95
truncated, 116
uniform, 39, 126
Weibull,109
of :Y, 173, 194
Distribution-free methods, 304, 312, 397
Double exponential distribution, 140
Efficiency, 372
Estimation, 200, 208,227,341
Bayesian, 227
interval, see also Confidence intervals,
212
maximum likelihood, 202, 347, 401
minimax, 210
point, 200, 208, 370
robust, 400
Estimator, 202
best, 341, 355, 363
consistent, 204
efficient, 372
maximum likelihood, 202, 347, 401
minimum chi-square, 273
minimum mean-square-error, 210
unbiased, 204, 341
unbiased minimum variance, 208, 341,
355, 361
Events, 2, 16
exhaustive, 15
mutually exclusive, 14, 17
Expectation (expected value), 44, 83,
176
Index
of a product, 47, 83
Exponential class, 357, 366
Exponential distribution, 105, 163
F distribution, 146, 282
Factorization theorem, 344, 358, 364
Family of distributions, 201, 354
Fisher, R. A., 388
Frequency, 2, 271
relative, 2, 12,93
Function, characteristic, 54, 192
decision, 208, 228, 341, 386
distribution, 31, 36, 125
exponential probability density, 105,
163
gamma, 104
likelihood, 202, 260
loss, 209, 341
moment-generating, 50,77, 84, 164
of parameter, 361
point, 8
power, 236, 239, 252
probability density, 25, 26, 31
probability distribution, 31, 34, 36
probability set, 12, 17, 34
of random variables, 35, 44, 122
risk, 209, 229, 341
set, 8
Geometric mean, 360
Gini's mean difference, 163
Huber, P., 402
Hypothesis, see Statistical hypothesis,
Test of a statistical hypothesis
Independence, see Stochastic
independence
Inequality, Boole, 384
Chebyshev, 58, 93, 188
Rao-Blackwell, 349
Rao-Cramerv J'Zz
Interaction, 295
Interval
confidence, 212
prediction, 218
random, 212
tolerance, 309
Jacobian, 134, 135, 147, 151
Joint conditional distribution, 71
Joint distribution function, 65
Joint probability density function, 65
Kurtosis, 57, 98,103,109,116,399
Law of large numbers, 93, 179, 188
Least squares, 400
Index
Lehmann alternative, 334
Lehmann-8cheffe, 355
Levy, P., 189
Liapounov,317
Likelihood function, 202, 205, 260
Likelihood ratio tests, 257, 284
Limiting distribution, 181, 193, 197, 317
Limiting moment-generating function,
188
Linear discriminant function, 388
Linear functions, covariance, 179
mean, 176, 409
moment-generating function, 171, 409
variance, 177, 409
Linear rank statistic, 334
Logistic distribution, 142
Lognormal distribution, 180
Loss function, 309, 341
Mann-Whitney-Wilcoxon, 326, 334
Marginal p.d.f., 66
Maximum likelihood, 202, 347
estimator, 202, 205, 347, 401
method of, 202
Mean, 49, 124
conditional, 69,75, 118, 349
of linear function, 176
of a sample, 124
of X, 49
ofK,178
Median, 30, 38, 161
M-estimators,402
Method
of least absolute values, 400
of least squares, 400
of maximum likelihood, 202
of moments, 206
Midrange, 161
Minimax, criterion, 210
decision function, 210
Minimum chi-square estimates, 273
Mode, 30, 98
Moment-generating function, 50, 77, 84,
189
of binomial distribution, 91
of bivariate normal distribution, 119
of chi-square distribution, 107
of gamma distribution, 105
of multinomial distribution, 97
of multivariate normal distribution,
408
of noncentral chi-square distribution,
289
of normal distribution, 111
of Poisson distribution, 101
of trinomial distribution, 95
of:Y,171
Moments, 52, 206
factorial, 56
437
method of, 206
Multinomial distribution, 96, 270, 332
Multiple comparisons, 380
Multiplication rule, 63, 64
Multivariate normal distribution, 269,
405
Neyman factorization theorem, 344, 358,
364
Neyman-Pearson theorem, 244, 267, 385
Noncentral chi-square, 289, 413
Noncentral F, 290, 295
Noncentral parameter, 264, 289,413
Noncentral T, 264
Nonparametric, 304, 312, 397
Normal distribution, 109, 168, 193
Normal scores, 319, 337, 398
Order statistics, 154, 304, 308, 369
distribution, 155
functions of, 161, 308
Parameter, 91, 201
function of, 361
Parameter space, 201, 260
Pareto distribution, 207
Percentile, 30, 311
PERT, 163, 171
Personal probability, 3, 228
Poisson distribution, 99, 131, 190
Poisson process, 99, 104
Power, see also Function, Test of a statis-
tical hypothesis, 236, 239
Prediction interval, 218
Probability, 2, 12, 34, 40
conditional, 61, 68, 343
induced, 17
measure, 2, 12
models, 38
posterior, 228, 233
subjective, 3, 228
Probability density function, 25,26,31
conditional, 67
exponential class, 357, 366
posterior, 228
prior, 228
Probability set function, 12, 17, 34
p-values,255
Quadrant test, 400
Quadratic forms, 278
distribution, 278, 410, 414
independence, 279, 414
Quantiles, 30, 304
confidence intervals for, 305
Random experiment, 1, 12, 38
Random interval, 212
Random sample, 124, 170,368
438
Random sampling distribution theory,
125
Random variable, 16,23, 35
continuous-type, 24, 26
discrete-type, 23, 26
mixture of types, 35
space of, 16, 19,20,27
Random walk, 380
Randomized test, 255
Range, 161
Rao-Blackwell theorem, 349
Rae-Cramer inequality, 372
Regression, 296
Relative frequency, 2, 12,93
Risk function, 209, 229, 341
Robust methods, 398, 402
Sample, correlation coefficient, 300
mean of, 124
median of, 161
random, 124, 170,368
space, 1, 12, 61, 200
variance, 124
Scheffe, H., 382
Sequential probability ratio test, 374
Set, 4
complement of, 7
of discrete points, 23
element of, 4
function, 8, 12
null, 5, 13
probability measure, 12
subset, 5, 13
Sets, algebra, 4
intersection, 5
union, 5
Significance level of test, 239,241
Simulation, 127, 140
Skewness, 56, 98, 103, 109, 116
Space, 6, 16,23,24
parameter, 201, 260
product, 80
of random variables, 16, 19,20,23,24
sample, 1, 12, 61, 200
Spearman rank correlation, 338, 400
Standard deviation, 49, 124
Statistic, see also Sufficient statistic(s),
122
Statistical hypothesis, 235, 238
alternative, 235
composite, 239, 252, 257
simple, 239, 245, 252
test of, 236, 239
Statistical inference, 201,235
Stochastic convergence, 186, 196,204
Stochastic dependence, 80
Stochastic independence, 80, 120, 132,
140,275,300,390,414
Index
mutual, 85
of linear forms, 172
pairwise, 87, 121
of quadratic forms, 279,414
test of, 275, 300
of X and S2, 175, 391
Sufficient statistic(s), 343, 364, 390
joint, 364
T distribution, 144, 176, 260, 264, 298,
302
Technique, change of variable, 128, 132,
147
distribution function, 125
moment-generating function, 164
Test of a statistical hypothesis, 236, 239
best, 243,252
chi-square, 269, 312, 320
critical region of, 239
of equality of distributions, 320
of equality of means, 261, 283, 291
of equality of variances, 266
likelihood ratio, 257, 260, 284
median, 321
nonparametric, 304
power of, 236, 239, 252
of randomness, 325
run, 322
sequential probability ratio, 374
sign, 312
significance level, 239, 241
of stochastic independence, 275, 300
uniformly most powerful, 251
Tolerance interval, 307
Training sample, 388
Transformation, 129, 147
of continuous-type variables, 132, 147
of discrete-type variables, 12F
not one-to-one, 149
one-to-one, 129, 132
Truncated distribution, 116
Types I and II errors, 241
Uniqueness, of best estimator, 355
of characteristic function, 55
of moment-generating function, 50
Variance, analysis of, 291
conditional, 69, 76, 118,349
of a distribution, 49
of a linear function, 177
of a sample, 124
of X, 49
of Y, 178
Venn diagram, 6
Weibull distribution, 109
Wilcoxon, 314, 326, 334

More Related Content

PPTX
Presentation1.pptx... Learn and understand everything
PDF
Grimmett&Stirzaker--Probability and Random Processes Third Ed(2001).pdf
PDF
schaum_probability.pdf
PPTX
Random Variables and Probability Distribution
PDF
An Introduction to Multivariate Statistical Analysis Third Edition.pdf
PDF
Where and why are the lucky primes positioned in the spectrum of the Polignac...
PDF
tjosullivanThesis
PDF
Power Full Exposition
Presentation1.pptx... Learn and understand everything
Grimmett&Stirzaker--Probability and Random Processes Third Ed(2001).pdf
schaum_probability.pdf
Random Variables and Probability Distribution
An Introduction to Multivariate Statistical Analysis Third Edition.pdf
Where and why are the lucky primes positioned in the spectrum of the Polignac...
tjosullivanThesis
Power Full Exposition

Similar to hogg_craig_-_introduction_to_mathematical_statistics_4th_edition1.pdf (20)

PDF
Solutions to Statistical infeence by George Casella
PDF
ELEMENTARY STATISCS ANSWER & QUETIONS.pdf
PDF
Syllabus 4-year-bs-math
PDF
Syllabus 4-year-bs-math
PPTX
PDF
A Primer Of Ecological Statistics 2nd Edition Ellison Aaron Mgotelli
PDF
A Primer Of Ecological Statistics 2nd Edition Ellison Aaron Mgotelli
PPTX
Theory of Probability and markov chain.pptx
PDF
U uni 6 ssb
PDF
Mathematical Geophysics Oup Jeanyves Chemin Benoit Desjardins
PDF
Topology As Fluid Geometry Twodimensional Spaces Volume 2 James W Cannon
PDF
Calculus 3rd Edition Jon Rogawski Colin Adams
PDF
M.C.A. (Sem - II) Probability and Statistics.pdf
PDF
2nd unit mathes probabality mca syllabus for probability and stats
DOCX
MidtermReview.pdfStatistics 411511Important Concepts an.docx
PDF
Physics lab manual
PPTX
math 10 2023.pptx in statistics and probability information
PPTX
Copy of 1 Statistics Module 1.general math
PDF
Random Fourier Series With Applications To Harmonic Analysis Am101 Volume 101...
Solutions to Statistical infeence by George Casella
ELEMENTARY STATISCS ANSWER & QUETIONS.pdf
Syllabus 4-year-bs-math
Syllabus 4-year-bs-math
A Primer Of Ecological Statistics 2nd Edition Ellison Aaron Mgotelli
A Primer Of Ecological Statistics 2nd Edition Ellison Aaron Mgotelli
Theory of Probability and markov chain.pptx
U uni 6 ssb
Mathematical Geophysics Oup Jeanyves Chemin Benoit Desjardins
Topology As Fluid Geometry Twodimensional Spaces Volume 2 James W Cannon
Calculus 3rd Edition Jon Rogawski Colin Adams
M.C.A. (Sem - II) Probability and Statistics.pdf
2nd unit mathes probabality mca syllabus for probability and stats
MidtermReview.pdfStatistics 411511Important Concepts an.docx
Physics lab manual
math 10 2023.pptx in statistics and probability information
Copy of 1 Statistics Module 1.general math
Random Fourier Series With Applications To Harmonic Analysis Am101 Volume 101...
Ad

Recently uploaded (20)

PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
Empowerment Technology for Senior High School Guide
PPTX
Introduction to Building Materials
PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
Cell Types and Its function , kingdom of life
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
Complications of Minimal Access Surgery at WLH
PDF
IGGE1 Understanding the Self1234567891011
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PPTX
Lesson notes of climatology university.
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Empowerment Technology for Senior High School Guide
Introduction to Building Materials
Indian roads congress 037 - 2012 Flexible pavement
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Cell Types and Its function , kingdom of life
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Complications of Minimal Access Surgery at WLH
IGGE1 Understanding the Self1234567891011
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
Lesson notes of climatology university.
Final Presentation General Medicine 03-08-2024.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Ad

hogg_craig_-_introduction_to_mathematical_statistics_4th_edition1.pdf

  • 1. Robert V. Hogg Allen T. Craig THE UNIVERSITY OF IOWA Introduction to Mathematical Statistics Fourth Edition Macmillan Publishing Co., Inc. NEW YORK Collier Macmillan Publishers LONDON
  • 2. Preface Copyright © 1978, Macmillan Publishing Co., Inc. Printed in the United States of America Earlier editions © 1958 and 1959 and copyright © 1965 and 1970 by Macmillan Publishing Co., Inc. Macmillan Publishing Co., Inc. 866 Third Avenue, New York, New York 10022 Collier Macmillan Canada, Ltd. Library of Congress Cataloging in Publication Data Hogg, Robert V Introduction to mathematical statistics. We are much indebted to our colleagues throughout the country who have so generously provided us with suggestions on both the order of presentation and the kind of material to be included in this edition of Introduction to Mathematical Statistics. We believe that you will find the book much more adaptable for classroom use than the previous edition. Again, essentially all the distribution theory that is needed is found in the first five chapters. Estimation and tests of statistical hypotheses, including nonparameteric methods, follow in Chapters 6, 7, 8, and 9, respectively. However, sufficient statistics can be introduced earlier by considering Chapter 10 immediately after Chapter 6 on estimation. Many of the topics of Chapter 11 are such that they may also be introduced sooner: the Rae-Cramer inequality (11.1) and robust estimation (11.7) after measures of the quality of estimators (6.2), sequential analysis (11.2) after best tests (7.2), multiple com- parisons (11.3) after the analysis of variance (8.5), and classification (11.4) after material on the sample correlation coefficient (8.7). With this flexibility the first eight chapters can easily be covered in courses of either six semester hours or eight quarter hours, supplementing with the various topics from Chapters 9 through 11 as the teacher chooses and as the time permits. In a longer course, we hope many teachers and students will be interested in the topics of stochastic independence (11.5), robustness (11.6 and 11.7), multivariate normal distributions (12.1), and quadratic forms (12.2 and 12.3). We are obligated to Catherine M. Thompson and Maxine Merrington and to Professor E. S. Pearson for permission to include Tables II and V, which are abridgments and adaptations of tables published in Biometrika. We wish to thank Oliver & Boyd Ltd., Edinburgh, for permission to include Table IV, which is an abridgment and adaptation v 56789 131415 YEAR PRINTING Bibliography: p. Includes index. 1. Mathematical statistics. I. Craig, Allen Thornton, (date) joint author. II. Title. QA276.H59 1978 519 77-2884 ISBN 0-02-355710-9 (Hardbound) ISBN 0-02-978990-7 (International Edition) All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the Publisher. T~RN
  • 3. vi Preface of Table III from the book Statistical Tables for Biological, Agricultural, and Medical Research by the late Professor Sir Ronald A. Fisher, Cambridge, and Dr. Frank Yates, Rothamsted. Finally, we wish to thank Mrs. Karen Horner for her first-class help in the preparation of the manuscript. R. V. H. A. T. C. Contents Chapter 1 Distributions of Random Variables 1 1.1 Introduction 1 1.2 Algebra of Sets 4 1.3 Set Functions 8 1.4 The Probability Set Function 12 1.5 Random Variables 16 1.6 The Probability Density Function 23 1.7 The Distribution Function 31 1.8 Certain Probability Models 38 1.9 Mathematical Expectation 44 1.10 Some Special Mathematical Expectations 48 1.11 Chebyshev's Inequality 58 Chapter 2 Conditional Probability and Stochastic Independence 61 2.1 Conditional Probability 61 2.2 Marginal and Conditional Distributions 65 2.3 The Correlation Coefficient 73 2.4 Stochastic Independence 80 Chapter 3 Some Special Distributions 3.1 The Binomial, Trinomial, and Multinomial Distributions 90 3.2 The Poisson Distribution 99 3.3 The Gamma and Chi-Square Distributions 103 3.4 The Normal Distribution 109 3.5 The Bivariate Normal Distribution 117 vii 90
  • 4. viii Contents Contents ix Chapter 4 Distributions of Functions of Random 'Variables 4.1 Sampling Theory 122 4.2 Transformations of Variables of the Discrete Type 128 4.3 Transformations of Variables of the Continuous Type 132 4.4 The t and F Distributions 143 4.5 Extensions of the Change-of-Variable Technique 147 4.6 Distributions of Order Statistics 154 4.7 The Moment-Generating-Function Technique 164 4.8 The Distributions of X and nS2 ja2 172 4.9 Expectations of Functions of Random Variables 176 Chapter 5 Limiting Distributions 5.1 Limiting Distributions 181 5.2 Stochastic Convergence 186 5.3 Limiting Moment-Generating Functions 188 5.4 The Central Limit Theorem 192 5.5 Some Theorems on Limiting Distributions 196 122 181 Chapter 8 Other Statistical Tests 8.1 Chi-Square Tests 269 8.2 The Distributions of Certain Quadratic Forms 278 8.3 A Test of the Equality of Several Means 283 8.4 Noncentral X2 and Noncentral F 288 8.5 The Analysis of Variance 291 8.6 A Regression Problem 296 8.7 A Test of Stochastic Independence 300 Chapter 9 Nonparametric Methods 9.1 Confidence Intervals for Distribution Quantiles 304 9.2 Tolerance Limits for Distributions 307 9.3 The Sign Test 312 9.4 A Test of Wilcoxon 314 9.5 The Equality of Two Distributions 320 9.6 The Mann-Whitney-Wilcoxon Test 326 9.7 Distributions Under Alternative Hypotheses 331 9.8 Linear Rank Statistics 334 269 304 Chapter 6 Estimation 6.1 Point Estimation 200 6.2 Measures of Quality of Estimators 207 6.3 Confidence Intervals for Means 212 6.4 Confidence Intervals for Differences of Means 219 6.5 Confidence Intervals for Variances 222 6.6 Bayesian Estimates 227 200 Chapter 10 Sufficient Statistics 1v.1 A Sufficient Statistic for a Parameter 341 10.2 The Rao-Blackwell Theorem 349 10.3 Completeness and Uniqueness 353 10.4 The Exponential Class of Probability Density Functions 357 10.5 Functions of a Parameter 361 10.6 The Case of Several Parameters 364 341 Chapter 7 Statistical Hypotheses 7.1 Some Examples and Definitions 235 7.2 Certain Best Tests 242 7.3 Uniformly Most Powerful Tests 251 7.4 Likelihood Ratio Tests 257 235 Chapter 11 Further Topics in Statistical Inference 11.1 The Rae-Cramer Inequality 370 11.2 The Sequential Probability Ratio Test 374 11.3 Multiple Comparisons 380 11.4 Classification 385 370
  • 5. x Contents 11.5 Sufficiency, Completeness, and Stochastic Independence 389 11.6 Robust Nonparametric Methods 396 11.7 Robust Estimation 400 Chapter 12 Further Normal Distribution Theory 12.1 The Multivariate Normal Distribution 405 12.2 The Distributions of Certain Quadratic Forms 12.3 The Independence of Certain Quadratic Forms Appendix A References Appendix B Tables Appendix C Answers to Selected Exercises Index 410 414 405 421 423 429 435 Chapter I Distributions of Random Variables 1.1 Introduction Many kinds of investigations may be characterized in part by the fact that repeated experimentation, under essentially the same con- ditions, is more or less standard procedure. For instance, in medical research, interest may center on the effect of a drug that is to be administered; or an economist may be concerned with the prices of three specified commodities at various time intervals; or the agronomist may wish to study the effect that a chemical fertilizer has on the yield of a cereal grain. The only way in which an investigator can elicit information about any such phenomenon is to perform his experiment. Each experiment terminates with an outcome. But it is characteristic of these experiments that the outcome cannot be predicted with certainty prior to the performance of the experiment. Suppose that we have such an experiment, the outcome of which cannot be predicted with certainty, but the experiment is of such a nature that the collection of every possible outcome can be described prior to its performance. If this kind of experiment can be repeated under the same conditions, it is called a random experiment, and the collection of every possible outcome is called the experimental space or the sample space. Example 1. In the toss of a coin, let the outcome tails be denoted by T and let the outcome heads be denoted by H. If we assume that the coin may be repeatedly tossed under the same conditions, then the toss of this coin is an example of a random experiment in which the outcome is one of 1
  • 6. 2 Distributions oj Random Variables [eh.l Sec. 1.1] Introduction 3 the two symbols T and H; that is, the sample space is the collection of these two symbols. Example 2. In the cast of one red die and one white die, let the outcome be the ordered pair (number of spots up on the red die, number of spots up on the white die). If we assume that these two dice may be repeatedly cast under the same conditions, then the cast of this pair of dice is a random experiment and the sample space consists of the 36 order pairs (1, 1), .. " (1, 6), (2, 1), ... , (2, 6), ... , (6,6). Let ce denote a sample space, and let C represent a part of ce. If, upon the performance of the experiment, the outcome is in C, we shall say that the event C has occurred. Now conceive of our having made N repeated performances of the random experiment. Then we can count the numberf of times (the frequency) that the event C actually occurred throughout the N performances. The ratio fiN is called the relative frequency of the event C in these N experiments. A relative frequency is usually quite erratic for small values of N, as you can discover by tossing a coin. But as N increases, experience indicates that relative frequencies tend to stabilize. This suggests that we associate with the event C a number, say p, that is equal or approximately equal to that number about which the relative frequency seems to stabilize. If we do this, then the number p can be interpreted as that number which, in future performances of the experiment, the relative frequency of the event C will either equal or approximate. Thus, although we cannot predict the outcome of a random experiment, we can, for a large value of N, predict approximately the relative frequency with which the outcome will be in C. The number p associated with the event C is given various names. Sometimes it is called the probability that the outcome of the random experiment is in C; sometimes it is called the probability of the event C; and sometimes it is called the probability measure of C. The context usually suggests an appropriate choice of terminology. Example 3. Let ~ denote the sample space of Example 2 and let C be the collection of every ordered pair of ~ for which the sum of the pair is equal to seven. Thus C is the collection (1, 6), (2,5), (3,4), (4,3), (5,2), and (6,1). Suppose that the dice are cast N = 400 times and let f, the frequency of a sum of seven, bef = 60. Then the relative frequency with which the outcome was in C is [lN = lrP-o = 0.15. Thus we might associate with C a number p that is close to 0.15, and p would be called the probability of the event C. Remark. The preceding interpretation of probability is sometimes re- ferred to as the relative frequency approach, and it obviously depends upon the fact that an experiment can be repeated under essentially identical con- ditions. However, many persons extend probability to other situations by treating it as rational measure of belief. For example, the statement p = i would mean to them that their personal or subjective probability of the event C is equal to i. Hence, if they are not opposed to gambling, this could be interpreted as a willingness on their part to bet on the outcome of C so that the two possible payoffs are in the ratio PI(1 - P) = Ht = t. Moreover, if they truly believe that p = i is correct, they would be willing to accept either side of the bet: (a) win 3 units if C occurs and lose 2 if it does not occur, or (b) win 2 units if C does not occur and lose 3 if it does. However, since the mathematical properties of probability given in Section 1.4 are consistent with either of these interpretations, the subsequent mathematical develop- ment does not depend upon which approach is used. The primary purpose of having a mathematical theory of statistics is to provide mathematical models for random experiments. Once a model for such an experiment has been provided and the theory worked out in detail, the statistician may, within this framework, make inferences (that is, draw conclusions) about the random experiment. The construction of such a model requires a theory of probability. One of the more logically satisfying theories of probability is that based on the concepts of sets and functions of sets. These concepts are introduced in Sections 1.2 and 1.3. EXERCISES 1.1. In each of the following random experiments, describe the sample space s. Use any experience that you may have had (or use your intuition) to assign a value to the probability p of the event C in each of the following instances: (a) The toss of an unbiased coin where the event C is tails. (b) The cast of an honest die where the event C is a five or a six. (c) The draw of a card from an ordinary deck of playing cards where the event C occurs if the card is a spade. (d) The choice of a number on the interval zero to 1 where the event C occurs if the number is less than t. (e) The choice of a point from the interior of a square with opposite vertices (-1, -1) and (1, 1) where the event C occurs if the sum of the coordinates of the point is less than 1-. 1.2. A point is to be chosen in a haphazard fashion from the interior of a fixed circle. Assign a probability p that the point will be inside another circle, which has a radius of one-half the first circle and which lies entirely within the first circle. 1.3. An unbiased coin is to be tossed twice. Assign a probability h to the event that the first toss will be a head and that the second toss will be a
  • 7. 4 Distributions of Random Variables [Ch.l Sec. 1.2] Algebra of Sets 5 tail. Assign a probability P2 to the event that there will be one head and one tail in the two tosses. 1.2 Algebra of Sets The concept of a set or a collection of objects is usually left undefined. However, a particular set can be described so that there is no misunder- standing as to what collection of objects is under consideration. For example, the set of the first 10 positive integers is sufficiently well described to make clear that the numbers -i and 14 are not in the set, while the number 3 is in the set. If an object belongs to a set, it is said to be an element of the set. For example, if A denotes the set of real numbers x for which 0 ;::; x ;::; 1, then i is an element of the set A. The fact that i is an element of the set A is indicated by writing i EA. More generally, a E A means that a is an element of the set A. The sets that concern us will frequently be sets ofnumbers. However, the language of sets of points proves somewhat more convenient than that of sets of numbers. Accordingly, we briefly indicate how we use this terminology. In analytic geometry considerable emphasis is placed on the fact that to each point on a line (on which an origin and a unit point have been selected) there corresponds one and only one number, say x; and that to each number x there corresponds one and only one point on the line. This one-to-one correspondence between the numbers and points on a line enables us to speak, without misunderstanding, of the" point x" instead of the" number z." Furthermore, with a plane rectangular coordinate system and with x and y numbers, to each symbol (x, y) there corresponds one and only one point in the plane; and to each point in the plane there corresponds but one such symbol. Here again, we may speak of the" point (x, y)," meaning the" ordered number pair x and y." This convenient language can be used when we have a rectangular coordinate system in a space of three or more dimensions. Thus the" point (Xl' x2 , •• " xn) " means the numbers Xl' X 2, ••• , X n in the order stated. Accordingly, in describing our sets, we frequently speak of a set of points (a set whose elements are points), being careful, of course, to describe the set so as to avoid any ambiguity. The nota- tion A = {x; 0 ;::; x ;::; I} is read "A is the one-dimensional set of points x for which 0 ;::; x s 1." Similarly, A = {(x, y); 0 s x s 1, o s y s I} can be read" A is the two-dimensional set of points (x, y) that are interior to, or on the boundary of, a square with opposite vertices at (0,0) and (1, 1)." We now give some definitions (together with illustrative examples) that lead to an elementary algebra of sets adequate for our purposes. Definition 1. If each element of a set A I is also an element of set A 2, the set A I is called a subset of the set A 2 . This is indicated by writing Al c A 2 · If Al C A 2 and also A 2 c A v the two sets have the same elements, and this is indicated by writing Al = A 2 • Example 1. Let Al = {x; 0 ;::; x ;::; I} and A2 = {x; -1 ;::; x ;::; 2}. Here the one-dimensional set Al is seen to be a subset of the one-dimensional set A2 ; that is, Al C A 2 . Subsequently, when the dimensionality of the set is clear, we shall not make specific reference to it. Example 2. LetAI = {(x,y);O;::; x = y;::; Ij and zl, = {(x,y);O;::; x;::; 1, o;::; y ;::; I}. Since the elements of Al are the points on one diagonal of the square, then Al C A 2 • Definition 2. If a set A has no elements, A is called the null set. This is indicated by writing A = 0. Definition 3. The set of all elements that belong to at least one of the sets A I and A 2 is called the union of A I and A 2' The union of Al and A 2 is indicated by writing Al U A 2 . The union of several sets A v A 2 , As, ... is the set of all elements that belong to at least one of the several sets. This union is denoted by A I U A 2 U As u . .. or by Al U A 2 U ... U A k if a finite number k of sets is involved. Example 3. Let Al = {x; X = 0, 1, ... , lO}and A 2 = {x, X = 8,9, lO, 11, or 11 < x ;::; 12}. Then Al U A 2 = {x; X = 0, 1, ... , 8, 9, lO, 11, or 11 < x;::; 12} = {x; x = 0, 1, ... ,8,9, lO, or 11 ;::; x ;::; 12}. Example 4. Let Al and A 2 be defined as in Example 1. Then Al U A 2 = A 2 • Example 5. Let A 2 = 0. Then Al U A 2 = Al for every set AI' Example 6. For every set A, A u A = A. Example 7. Let A k = {x; 1/(k + 1) ;::; x ;::; I}, k = 1, 2, 3, . . .. Then Al U A 2 U As u ... = {x, 0 < x ;::; I}. Note that the number zero is not in this set, since it is not in one of the sets AI> A 2 , As,. '" Definition 4. The set of all elements that belong to each of the sets A I and A 2 is called the intersection of A I and A 2' The intersection of A I and A 2 is indicated by writing Al n A 2 . The intersection of several sets AI' A 2 , As, ... is the set of all elements that belong to each of the sets A v A 2 , As, .. " This intersection is denoted by Al n A 2 n As n ... or by Al n A 2 n ... n A k if a finite number k of sets is involved. Example 8. Let Al = {(x, y); (x, y) = (0, 0), (0, 1), (1, I)} and A 2 = {(x, y); (x, y) = (1,1), (1,2), (2, I)}. Then Al n A 2 = {(x, y); (x, y) = (1, I)}.
  • 8. 6 Distributions oj Random Variables [Clr. l Sec. 1.2] Algebra oj Sets 7 FIGURE U Example 14. Let the number of heads, in tossing a coin four times, be EXERCISES denoted by x. Of necessity, the number of heads will be one of the numbers 0, 1, 2, 3, 4. Here, then, the space is the set d = {x; x = 0, 1,2, 3, 4}. Example 15. Consider all nondegenerate rectangles of base x and height y. To be meaningful, both x and y must be positive. Thus the space is the set .xl = {(x, y); x > 0, Y > O}. Definition 6. Let .91 denote a space and let A be a subset of the set d. The set that consists of all elements of .91 that are not elements of A is called the complement of A (actually, with respect to d). The complement of A is denoted by A *. In particular, .91* = 0. Example 16. Let d be defined as in Example 14, and let the set A = {x; x = 0, I}.The complement of A (with respect to d) is A * = {x; x = 2,3, 4}. Esample l'[, Given A c dThenA u A* = d,A nA* = 0 ,A ud = d, And = A, and (A*)* = A. 1.4. Find the union A l U A 2 and the intersection A l n A 2 of the two sets A l and A 2 , where: (a) A l = {x; X = 0, 1, 2}, A 2 = {x; X = 2, 3, 4}. (b) A l = {x; 0 < x < 2}, A 2 = {x; 1 :::::; x < 3}. (c) A l = {(x,y);O < x < 2,0 < y < 2},A2 = {(x,y); 1 < x < 3,1 < y < 3}. 1.5. Find the complement A *of the set A with respect to the space d if: (a) d = {x; 0 < x < I}, A = {x; i :::::; x < I}. (b) d = {(x, y, z); x2 + y2 + Z2 :::::; I}, A = {(x, y, z); x2 + y2 + Z2 = I}. (c) d = {(x, y); Ixl + Iyl :::::; 2}, A = {(x, y); x2 + y2 < 2}. 1.6. List all possible arrangements of the four letters m, a, r, and y. Let A l be the collection of the arrangements in which y is in the last position. Let A 2 be the collection of the arrangements in which m is in the first position. Find the union and the intersection of A l and A 2 • 1.7. By use of Venn diagrams, in which the space d is the set of points enclosed by a rectangle containing the circles, compare the following sets: (a) A l n (A2 u As) and (Al n A 2 ) u (A l n As). (b) A l u (A2 n As) and (A l U A 2 ) n (A l U As). (c) (Al U A 2)* and AT n A~. (d) (Al n A 2)* and AT U A~. 1.8. If a sequence of sets A v A 2 , As, ... is such that A k c A k +V k = 1,2, 3, ... , the sequence is said to be a nondecreasing sequence. Give an example of this kind of sequence of sets. 1.9. If a sequence of sets A l , A 2 , As, ... is such that A k ::::> A k +V k = 1, 2, 3, ... , the sequence is said to be a nonincreasing sequence. Give an example of this kind of sequence of sets. FIGURE 1.1 A, uAz Example9. Let zl, = {(x,y); 0:::::; x + y:::::; lj and zl, = {(x,y); 1 < x + y}. Then A l and A 2 have no points in common and A l n A 2 = 0. Example 10. For every set A, A n A = A and A n 0 = 0. Example 11. Let A k = {x; 0 < x < 11k}, k = 1,2, 3, .... Then A l n A2 n As ... is the null set, since there is no point that belongs to each of the sets A v A 2 , As, .... Example 12. Let A l and A 2 represent the sets of points enclosed, respectively, by two intersecting circles. Then the sets A l U A 2 and A l n A 2 are represented, respectively, by the shaded regions in the Venn diagrams in Figure 1.1. Example 13. Let A v A 2 , and As represent the sets of points enclosed, respectively, by three intersecting circles. Then the sets (A l U A 2 ) n As and (Al n A 2 ) u As are depicted in Figure 1.2. Definition 5. In certain discussions or considerations the totality of all elements that pertain to the discussion can be described. This set of all elements under consideration is given a special name. It is called the space. We shall often denote spaces by capital script letters such as d, flB, and C(j'.
  • 9. 8 Distributions oj Random Variables [Ch.l Sec. 1.3] Set Functions 9 1.10. If AI> A 2 , A 3 , ••• are sets such that Ale c AIe+I> k = 1,2,3, ... , lim Ale is defined as the union A l U A2 U A 3 U .. '. Find lim Ale if: k-+co k-+oo (a) Ale = {x; 11k s x s 3 - 11k}, k = 1, 2, 3, ... ; (b) Ale = {(x, y); 11k S x2 + y2 S 4 - 11k}, k = 1,2,3, .... 1.11. If AI> A 2 , A 3 , ••• are sets such that Ale::> AIe+ l , k = 1,2,3, ... , lim Ale is defined as the intersection A l n A 2 n A3 n· . '. Find lim Ale if: k-+oo k-+oo (a) Ale = {x; 2 - 11k < x s 2}, k = 1, 2, 3, . (b) Ale = {x; 2 < x s 2 + 11k}, k = 1, 2, 3, . (c) Ale = {(x, y); 0 S x2 + y2 sIlk}, k = 1,2,3, .... Example 2. Let A be a set in two-dimensional space and let Q(A) be the area of A, if A has a finite area; otherwise, let Q(A) be undefined. Thus, if A = {(x, y); x2 + y2 S I}, then Q(A) = 7T; if A = {(x, y); (x, y) = (0,0), (1,1), (0, I)}, then Q(A) = 0; if A = {(x, y); 0 s x,O S y, x + Y s I}, then Q(A) = t. Example 3. Let A be a set in three-dimensional space and let Q(A) be the volume of A, if A has a finite volume; otherwise, let Q(A) be undefined. Thus, if A = {(x, y, z); 0 S x s 2,0 s y s 1,0 s z s 3}, then Q(A) = 6; if A = {(x, y, z); x2 + y2 + Z2 ~ I}, then Q(A) is undefined. At this point we introduce the following notations. The symbol 1.3 Set Functions In the calculus, functions such as f(x) = 2x, -00 < x < 00, Lf(x) dx will mean the ordinary (Riemann) integral of f(x) over a prescribed one-dimensional set A; the symbol = 0 elsewhere, = 0 elsewhere, or or possibly g(x, y) = e-X - Y , o< x < 00, 0 < y < 00, o S Xi S 1, i = 1,2, ... , n, LIg(x, y) dx dy will mean the Riemann integral of g(x, y) over a prescribed two- dimensional set A; and so on. To be sure, unless these sets A and these functions f(x) and g(x, y) are chosen with care, the integrals will frequently fail to exist. Similarly, the symbol 2:f(x) A = 0 elsewhere. will mean the sum extended over all x E A; the symbol 2: 2:g(x, y) A will mean the sum extended over all (x, y) E A; and so on. Example 4. Let A be a set in one-dimensional space and let Q(A) = "Lf(x), where A x = 0,1, x = 1,2,3, ... , = 0 elsewhere. f(x) = (1Y, f(x) = px{l - P)l-X, If A = {x; 0 S x s 3}, then Q(A) = 1- + m 2 + m 3 = l Example 5. Let Q(A) = "Lf(x), where A were of common occurrence. The value of f(x) at the" point x = 1" is f(l) = 2; the value of g(x, y) at the" point (-1, 3)" is g(- 1, 3) = 0; the value of h(x1 , x2 , ••• , xn) at the" point (1, 1, ... , 1)" is 3. Functions such as these are called functions of a point or, more simply, point functions because they are evaluated (if they have a value) at a point in a space of indicated dimension. There is no reason why, if they prove useful, we should not have functions that can be evaluated, not necessarily at a point, but for an entire set of points. Such functions are naturally called functions of a set or, more simply, set functions. We shall give some examples of set functions and evaluate them for certain simple sets. Example 1. Let A be a set in one-dimensional space and let Q(A) be equal to the number of points in A which correspond to positive integers. Then Q(A) is a function of the set A. Thus, if A = {x; 0 < x < 5}, then Q(A) = 4; if A = {x; x = -2, -I}, thenQ(A) = O;ifA = {x; -00 < x < 6}, then Q(A) = 5.
  • 10. = Q(AI ) + Q(A2); if A = Al U A 2,where Al = {x; 0 :$ x :S 2} and A 2 = {x; 1 :S x :S 3},then Q(A) = Q(AI U A 2) = f:e- Xdx = f:e- Xdx + f:e- Xdx - f:e- Xdx = Q(AI ) + Q(A2) - Q(AI n A 2). Example 7. Let A be a set in n-dimensional space and let Q(A) = r~. fdXI dX2 dxn• If A = {(Xl' x2,···, xn) ; 0 s Xl s X2 s :$ Xn :$ I}, then Q(A) = f: f:n . . . f:3 f:2 dXI dx2· .. dXn - 1 dx; 1 - -, where n! = n(n - 1) .. ·3' 2' 1. - n! X=o Q(A) = L PX(1 - P)l-X = 1 - P; x=o if A = {x; 1 :$ x :$ 2}, then Q(A) = f(l) = p. Example 6. Let A be a one-dimensional set and let Q(A) = Le-X dx. Thus, if A = {x; 0 :$ x < co}, then Q(A) = f: e- X dx = 1; if A = {x; 1 :$ x :$ 2}, then Q(A) = 1:e- X dx = e- l - e-2 ; if Al = {x; 0 :$ x :$ I} and A2 = {x; 1 < x :$ 3}, then Q(AI U A 2) = f:e- Xdx = f~ e- Xdx + 1:e- Xdx 1.14. For everyone-dimensional set A, let Q(A) be equal to the number of points in A that correspond to positive integers. If Al = {x; x a multiple of 3, less than or equal to 50} and A 2 = {x; x a multiple of 7, less than or equal to 50}, findQ(A I ) , Q(A2), Q(AI U A 2), andQ(A I n A 2). ShowthatQ(AI U A 2) = Q(AI ) + Q(A2) - Q(AI n A2). 11 Sec. 1.3] Set Functions a + ar + .. , + arn- l = a(1 - rn)j(1 - r) and lim Sn = aj(1 - r) provided n-e cc that Irl < 1. 1.13. For everyone-dimensional set A for which the integral exists, let Q(A) = fAf(x) da; where f(x) = 6x(1 - x), 0 < x < 1, zero elsewhere; otherwise, let Q(A) be undefined. If Al = {x; i- < x < i}, A 2 = {x; X = !}, and Aa = {x; 0 < x < 10}, find Q(AI ) , Q(A2), and Q(Aa). 1.18. Let d be the set of points interior to or on the boundary of a cube with edge 1. Moreover, say that the cube is in the first octant with one vertex at the point (0, 0, 0) and an opposite vertex is at the point (1, 1, 1). Let Q(A) = Iff dxdydz. (a) If A c dis the set{(x, y, z); 0 < x < y < z < I}, A compute Q(A). (b) If A is the subset {(x, y, z); 0 < x = Y = z < I}, compute Q(A). 1.15. For every two-dimensional set A, let Q(A) be equal to the number of points (x, y) in A for which both x and yare positive integers. Find Q(AI ) andQ(A2),where Al = {(x,y);x2 + y2:$ 4}andA2 = {(x,y);x2 + y2:s 9}. Note that Al C A 2 and that Q(AI ) :S Q(A2). 1.16. Let Q(A) = fA f (x2 + y2) dx dy for every two-dimensional set A for which the integral exists; otherwise, let Q(A) be undefined. If Al = {(x, y); -1 :S x :S 1, -1 :S Y :S I}, A 2 = {(x, y); -1 :S x = y:s I}, and Aa = {(x, y); x2 + y2 :S I}, findQ(A I ) , Q(A2), andQ(Aa). Hint. In evaluating Q(A2 ) , recall the definition of the double integral (or consider the volume under the surface z = x2 + y2 above the line segment - 1 :$ x = Y :S 1 in the xy-plane). Use polar coordinates in the calculation of Q(Aa). 1.17. Let d denote the set of points that are interior to or on the boundary of a square with opposite vertices at the point (0,0) and at the point (1, 1). LetQ(A) = Lfdydx. (a) If A c dis the set {(x, y); 0 < x < y < I}, compute Q(A). (b) If A c d is the set {(x, y); 0 < x = Y < I}, compute Q(A). (c) If A c d is the set {(x, y); 0 < xj2 :S y s 3xj2 < I}, compute Q(A). Distributions of Random Variables [Ch.l 10 If A = {x; x = O}, then EXERCISES 1.12. For everyone-dimensional set A, let Q(A) = Lf(x), wheref(x) = A (!)mX , x = 0, 1, 2, .. " zero elsewhere. If Al = {x; X = 0, 1,2, 3} and A 2 = {x; X = 0, 1,2, ...}, find Q(AI ) and Q(A2). Hint. Recall that Sn = 1.19. Let A denote the set {(x, y, z); x2 + y2 + Z2 :S I}. Evaluate Q(A) = JJf v'x2 + y2 + Z2 dx dy dz. Hint. Change variables to spherical coordinates. A 1.20. To join a certain club, a person must be either a statistician or a
  • 11. 12 Distributions of Random Variables [eh.l Sec. 1.4] The Probability Set Function 13 mathematician or both. Of the 2S members in this club, 19 are statisticians and 16are mathematicians. How many persons in the club are both a statisti- cian and a mathematician? 1.21. After a hard-fought football game, it was reported that, of the 11 starting players, 8 hurt a hip, 6 hurt an arm, S hurt a knee, 3 hurt both a hip and an arm, 2 hurt both a hip and a knee, 1 hurt both an arm and a knee, and no one hurt all three. Comment on the accuracy of the report. 1.4 The Probability Set Function Let <fl denote the set of every possible outcome of a random experi- ment; that is, <fl is the sample space. It is our purpose to define a set function P(C) such that if C is a subset of <fl, then P(C) is the probability that the outcome of the random experiment is an element of C. Hence- forth it will be tacitly assumed that the structure of each set C is sufficiently simple to allow the computation. We have already seen that advantages accrue if we take P(C) to be that number about which the relative frequencyfiN of the event C tends to stabilize after a long series of experiments. This important fact suggests some of the properties that we would surely want the set function P(C) to possess. For example, no relative frequency is ever negative; accordingly, we would want P(C) to be a nonnegative set function. Again, the relative frequency of the whole sample space <fl is always 1. Thus we would want P(<fl) = 1. Finally, if C1> C2 , Cs, ... are subsets of '?J' such that no two of these subsets have a point in common, the relative frequency of the union of these sets is the sum of the relative frequencies of the sets, and we would want the set function P(C) to reflect this additive property. We now formally define a probability set function. Definition 7. If P(C) is defined for a type of subset of the space <fl, and if (a) P(C) ~ 0, (b) P(C1 u C2 U Cs U ) = P(C1 ) + P(C2 ) + P(Cs) + .. " where the sets Ci , i = 1, 2, 3, , are such that no two have a point in common, (that is, where C, (' C, = 0, i # j), (c) P(<fl) = 1, then P(C) is called the probability set function of the outcome of the random experiment. For each subset C of '?J', the number P(C) is called the probability that the outcome of the random experiment is an element of the set C, or the probability of the event C, or the probability measure of the set C. A probability set function tells us how the probability is distributed over various subsets C of a sample space <fl. In this sense we speak of a distribution of probability. Remark. In the definition, the phrase" a type of subset of the space 'C" would be explained more fully in a more advanced course. Nevertheless, a few observations can be made about the collection of subsets that are of the type. From condition (c) of the definition, we see that the space 'C must be in the collection. Condition (b) implies that if the sets C1> C2 , C3 , •.• are in the collection, their union is also one of that type. Finally, we observe from the following theorems and their proofs that if the set C is in the collection, its complement must be one of those subsets. In particular, the null set, which is the complement of 'C, must be in the collection. The following theorems give us some other properties of a probability set function. In the statement of each of these theorems, P(C) is taken, tacitly, to be a probability set function defined for a certain type of subset of the sample space '?J'. Theorem 1. For each C c '?J', P(C) = 1 - P(C*). Proof. We have '?J' = C u C* and C (' C* = 0. Thus, from (c) and (b) of Definition 7, it follows that 1 = P(C) + P(C*), which is the desired result. Theorem 2. The probability of the null setis zero;that is, P(0) = O. Proof. In Theorem 1, take C = 0 so that C* = '?J'. Accordingly, we have P(0) = 1 - P('?J') = 1 - 1 = 0, and the theorem is proved. Theorem 3. If C1 and C2 are subsets of'?J' such that C1 c C2, then P(C1 ) s P(C2) · Proof. NowC2 = C1 u (ct (' C2) and C, {' (ct (' C2) = 0. Hence, from (b) of Definition 7, P(C2 ) = P(C1 ) + P(ct (' C2 ) . However, from (a) of Definition 7, P(Ct (' C2 ) ~ 0; accordingly, P(C2 ) ~ P(C1 ) · Theorem 4. For each C c '?J', 0 s P(C) ~ 1.
  • 12. Distributions of Random Variables (eh.l Proof. Since 0 c C c C{?, we have by Theorem 3 that with (b) of Definition 7. Moreover, if C{? = c1 U c2 U c3 U .. " the mutually exclusive events are further characterized as being exhaustive and the probability of their union is obviously equal to 1. 14 P(0) s P(C) s P(6') or os P(C) ~ 1, Sec. 1.4] The Probability Set Function 15 the desired result. Theorem 5. IfC1 and C2 are subsets ofC{?, then P(C 1 U C2 ) = P(C 1) + P(C 2) - P(C1 11 C2) . Proof. Each of the sets C1 U C2 and C2 can be represented, respec- tively, as a union of nonintersecting sets as follows: C1 U C2 = Cl U (ct 11 C2) and Thus, from (b) of Definition 7, P(C 1 U C2) = P(C 1) + P(ct ( C2 ) and P(C2 ) = P(C 1 11 C2 ) + P(ct 11 C2) . If the second of these equations is solved for P(ct 11 C2) and this result substituted in the first equation, we obtain P(C 1 U C2) = P(C 1) + P(C 2) - P(C 1 ( C2) . This completes the proof. Example 1. Let ~ denote the sample space of Example 2 of Section 1.1. Let the probability set function assign a probability of :1-6 to each of the 36 points in ~. If Cl = {c; C = (1, 1), (2, 1), (3, 1), (4, 1), (5, In and C2 = {c; C = (1,2), (2, 2), (3, 2n, then P(CI ) = -l(i, P(C2 ) = -l6' P(CI U C2) = 3' and P(CI n C2) = O. Example 2. Two coins are to be tossed and the outcome is the ordered pair (face on the first coin, face on the second coin). Thus the sample space may be represented as ~ = {c; c = (H, H), (H, T), (T, H), (T, Tn. Let the probability set function assign a probability of ! to each element of ~. Let Cl = {c; C = (H, H), (H, Tn and C2 = {c; C = (H, H), (T, Hn. Then P(Cl) = P(C2) = -t, P(CI n C2 ) = -.t, and, in accordance with Theorem 5, P(CI U C2 ) = 1- + 1- - ! = i· Let C{? denote a sample space and let Cl , C2 , C3 , ••• denote subsets of C{? If these subsets are such that no two have an element in common, they are called mutually disjoint sets and the corresponding events C1> C2 , C3 , ••• are said to be mutually exclusive events. Then, for example, P(C1 U C2 U C3 U, .. ) = P(C1) + P(C 2) + P(C3 ) + .. " in accordance EXERCISES 1.22. A positive integer from one to six is to be chosen by casting a die. Thus the elements c of the sample space ~ are 1, 2, 3, 4, 5, 6. Let Cl = {c; C = 1, 2, 3, 4}, C2 = {c; C = 3, 4, 5, 6}. If the probability set function P assigns a probability of i to each of the elements of~, compute P(Cl), P(C2 ) , P(CI n C2 ) , and P(CI U C2 ) . 1.23. A random experiment consists in drawing a card from an ordinary deck of 52 playing cards. Let the probability set function P assign a prob- ability of -h- to each of the 52 possible outcomes. Let Cl denote the collection of the 13 hearts and let C2 denote the collection of the 4 kings. Compute P(Cl), P(C2 ), P(CI n C2 ) , and P(CI U C2 ) . 1.24. A coin is to be tossed as many times as is necessary to turn up one head. Thus the elements c of the sample space ~ are H, TH, TTH, TTTH, and so forth. Let the probability set function P assign to these elements the respective probabilities -t, -.t, t, -1-6, and so forth. Show that P(~) = 1. Let Cl = {c; cis H, TH, TTH, TTTH, or TTTTH}. Compute P(CI)' Let C2 = {c; cis TTTTH or TTTTTH}. Compute P(C2) , P(CI n C2) , and P(CI U C2) . 1.25. If the sample space is ~ = CI U C2 and if P(Cl) = 0.8 and P(C2) = 0.5, find P(CI n C2 ) . 1.26. Let the sample space be ~ = {c; 0 < c < co}, Let C c ~ be defined by C = {c; 4 < c < oo} and take P(C) = Ie e- X dx. Evaluate P(C), P(C*), and P(C U C*). 1.27. If the sample space is ~ = {c; -00 < c < co}and if C c ~is a set for which the integral Ie e- ix i dx exists, show that this set function is not a probability set function. What constant could we multiply the integral by to make it a probability set function? 1.28. If Cl and C2 are subsets of the sample space jf show that P(CI n C2) ~ P(CI ) s P(CI U C2) s P(CI ) + P(C2) . 1.29. Let Cl , C2 , and Cs be three mutually disjoint subsets of the sample space~. Find P[(CI U C2) n Cs] and P(ct U q). 1.30. If Cl , C2 , and Cs are subsets of ~, show that P(CI U C2 U Cs) = P(CI ) + P(C2 ) + P(Cs) - P(CI n C2 ) - P(CI n Cs) - P(C2 n Cs) + P(CI n C2 n Cs).
  • 13. 16 Distributions of Random Variables [eh.l Sec. 1.5] Random Variables 17 What is the generalization of this result to four or more subsets of '?5? Hint. Write P(CI U C2 U C3) = P[CI U (C2 U C3 )] and use Theorem 5. 1.5 Random Variables The reader will perceive that a sample space '?5 may be tedious to describe if the elements of '?5 are not numbers. We shall now discuss how we may formulate a rule, or a set of rules, by which the elements C of '?5 may be represented by numbers x or ordered pairs of numbers (Xl' X 2) or, more generally, ordered n-tuplets of numbers (Xl' ... , xn) . We begin the discussion with a very simple example. Let the random experiment be the toss of a coin and let the sample space associated with the experiment be '?5 = {c; where c is Tor c is H} and T and H repre- sent, respectively, tails and heads. Let X be a function such that X(c) = 0 if c is T and let X(c) = 1 if c is H. Thus X is a real-valued , function defined on the sample space '?5 which takes us from the sample space '?5 to a space of real numbers d = {x; X = 0, 1}. We call X a random variable and, in this example, the space associated with X is d = {x; X = 0, 1}. We now formulate the definition of a random variable and its space. Definition 8. Given a random experiment with a sample space '?5. A function X, which assigns to each element c E'?5 one and only one real number X(c) = x, is called a random variable. The space of X is the set of real numbers d = {x; X = X(c), CE'?5}. It may be that the set '?5 has elements which are themselves real numbers. In such an instance we could write X(c) = c so that d = '?5. Let X be a random variable that is defined on a sample space '?5, and let d be the space of X. Further, let A be a subset of d. Just as we used the terminology" the event G," with G c '?5, we shall now speak of" the event A." The probability P(G) of the event G has been defined. We wish now to define the probability of the event A. This probability will be denoted by Pr (X E A), where Pr is an abbreviation of "the probability that." With A a subset of d, let G be that subset of '?5 such that G = {c; CE'?5 and X(c) E A}. Thus G has as its elements all out- comes in '?5 for which the random variable X has a value that is in A. This prompts us to define, as we now do, Pr (X E A) to be equal to P(G), where G = {c; CE '?5 and X(c) E A}. Thus Pr (X E A) is an assign- ment of probability to a set A, which is a subset of the space d associated with the random variable X. This assignment is determined by the probability set function P and the random variable X and is sometimes denoted by Px(A). That is, Pr (X E A) = Px(A) = P(G), where G = {c; CE'?5 and X(c) E A}. Thus a random variable X is a function that carries the probability from a sample space '?5 to a space d of real numbers. In this sense, with A c d, the probability Px(A) is often called an induced probability. The function Px(A) satisfies the conditions (a), (b), and (c) of the definition of a probability set function (Section 1.4). That is, Px(A) is also a probability set function. Conditions (a) and (c) are easily verified by observing, for an appropriate G, that Px(A) = P(G) ~ 0, and that '?5 = {c; CE'?5 and X(c) Ed} requires Px(d) = P('?5) = 1. In discussing condition (b), let us restrict our attention to two mutually exclusive events Al and A 2 • Here PX(A I U A 2 ) = P(G), where G = {c; CE'?5 and X(c) EAl U A 2} . However, G = {c; CE'?5 and X(c) E AI} U {c; CE'?5 and X(c) E A 2} , or, for brevity, G = GI U G2 • But GI and G2 are disjoint sets. This must be so, for if some c were common, say ct, then X(ct) E A I and X(ct) E A 2 • That is, the same number X(ct) belongs to both Al and A 2 . This is a contradiction because Al and A 2 are disjoint sets. Accordingly, P(G) = P(GI ) + P(G2) . However, by definition, P(GI ) is PX(A I ) and P(G2 ) is PX(A2 ) and thus PX(A I U A 2 ) = PX(A I ) + PX(A2 ) . This is condition (b) for two disjoint sets. Thus each of Px(A) and P(G) is a probability set function. But the reader should fully recognize that the probability set function P is defined for subsets G of '?5, whereas P x is defined for subsets A of d, and, in general, they are not the same set function. Nevertheless, they are closely related and some authors even drop the index X and write P(A) for Px(A). They think it is quite clear that P(A) means the probability of A, a subset of d, and P(G) means the probability of G, a subset of '?5. From this point on, we shall adopt this convention and simply write P(A).
  • 14. 18 Distributions of Random Variables [eh.l Sec. 1.5] Random Variables 19 valued functions defined on the sample space f'(?, which takes us from that sample space to the space of ordered number pairs d = {(xv x2 ) ; (Xl, x2 ) = (0,0), (0, 1), (1, 1), (1,2), (2,2), (2, 3)}. Thus Xl and X 2 are two random variables defined on the space f'(?, and, in this example, the space of these random variables is the two- dimensional set d given immediately above. We now formulate the definition of the space of two random variables. Definition 9. Given a random experiment with a sample space f'(? Consider two random variables Xl and X 2, which assign to each element c of f'(? one and only one ordered pair of numbers Xl (c) = Xv X 2(C) = x2. The space of Xl and X 2 is the set of ordered pairs d = {(xv x2 ) ; Xl = X 1(c), X2 = X 2(c), c E f'(?}. Let d be the space associated with the two random variables Xl and X 2 and let A be a subset of d. As in the case of one random variable we shall speak of the event A. We wish to define the probability of the event A, which we denote by Pr [(Xl' X 2) E A]. Take C = {c; C E f'(? and [X1 (c), X 2 (c)] E A}, where f'(? is the sample space. We then define Pr[(X1 , X 2 ) EA] = P(C), where P is the probability set function defined for subsets C of f'(? Here again we could denote Pr [(Xl, X 2 ) E A] by the probability set function P x t,x2(A); but, with our previous convention, we simply write P(A) = Pr [(Xl' X 2) E A]. Again it is important to observe that this function is a probability set function defined for subsets A of the space d. Let us return to the example in our discussion of two random vari- ables. Consider the subset A of d, where A = {(xv x2 ) ; (xv x2 ) = (1, 1), (1,2)}. To compute Pr [(Xv X 2 ) E A] = P(A), we must include as elements of C all outcomes in f'(? for which the random variables Xl and X 2 take values (xv x2 ) which are elements of A. Now X 1 (C S ) = 1, X 2 (CS ) = 1, X 1 (C4 ) = 1, and X 2 (C4) = 1. Also, X 1 (C5 ) = 1, X 2 (C 5 ) = 2, X 1(C6) = 1, and X 2 (C6 ) = 2. Thus P(A) = Pr [(Xv X 2 ) E A] = P(C), where C = {c; c = Cs, c4 , c5 , or c6}. Suppose that our probability set function P(C) assigns a probability of t to each of the eight elements of f'(? Then P(A), which can be written as Pr (Xl = 1, X 2 = 1 or 2), is equal to t = l It is left for the reader to show that we can tabulate the probability, which is then assigned to each of the elements of d. with the following result: ' (xv x2 ) (0,0) (0, 1) (1, 1) (1,2) (2, 2) (2,3) Perhaps an additional example will be helpful. Let a coin be tossed twice and let our interest be in the number of heads to be observed. Thus the sample space is f'(? = {c; where c is TT or TH or HT or HH}. Let X(c) = °if c is TT; let X(c) = 1 if c is either TH or HT; and let X(c) = 2 if c is HH. Thus the space of the random variable X is d = {x; x = 0, 1, 2}. Consider the subset A of the space d, where A = {x; x = I}. How is the probability of the event A defined? We take the subset C of f'(? to have as its elements all outcomes in f'(? for which the random variable X has a value that is an element of A. Because X(c) = 1 if c is either TH or HT, then C = {c; where c is TH or HT}. Thus P(A) = Pr (X E A) = P(C). Since A = {x; x = I}, then P(A) = Pr (X E A) can be written more simply as Pr (X = 1). Let C1 = {c; c is TT}, C2 = {c; c is TH}, Cs = {c; c is HT}, and C4 = {c; c is HH} denote subsets of f'(? Suppose that our probability set function P(C) assigns a probability of t to each of the sets Ct , i = 1,2,3,4. Then P(C1 ) = t, P(C2 U Cs) = t + t = t, and P(C4 ) = l Let us now point out how much simpler it is to couch these statements in a language that involves the random variable X. Because X is the number of heads to be observed in tossing a coin two times, we have Pr (X = 0) = t, since P(C1 ) = t; Pr (X = 1) = t, since P(C2 u Cs) = t; and Pr (X = 2) = t, since P(C4 ) = l This may be further condensed in the following table: x 012 Pr (X = x) t t t This table depicts the distribution of probability over the elements of d, the space of the random variable X. We shall now discuss two random variables. Again, we start with an example. A coin is to be tossed three times and our interest is in the ordered number pair (number of H's on first two tosses, number H's on all three tosses). Thus the sample space is f'(? = {c; c = ct , i = 1,2, ... , 8}, where C1 is TTT, C2 is TTH, Cs is THT, C4 is HTT, C5 is THH, C6 is HTH, C7 is HHT, and Cs is HHH. Let Xl and X 2 be two functions such that X 1 (C1 ) = X 1 (C2 ) = 0, X1 (CS) = X 1 (C4) = X 1 (C5 ) = X 1 (C6) = 1, X 1 (C7 ) = X 1 (CS ) = 2; and X 2 (C1) = 0, X 2 (C2 ) = X 2 (CS ) = X 2 (C4 ) = 1, X 2(C5) = X 2 (C6 ) = X 2 (C7 ) = 2, X 2 (CS ) = 3. Thus Xl and X 2 are real- 2 If 2 If
  • 15. 20 Distributions of Random Variables [eh.l Sec. 1.5] Random Variables 21 and exists. Thus the probability set function of X is this integral. P(A) = L tdx J 1 11 2 3x2 1 P(A l ) = Pr (X E A l ) = f(x) dx = 8 dx = 64 A , 0 3x2 where f(x) = 8' XEd = {x; 0 < x < 2}. P(A) = Lf(x) dx, In statistics we are usually more interested in the probability set function of the random variable X than we are in the sample space C(f and the probability set function P(C). Therefore, in most instances, we begin with an assumed distribution of probability for the random variable X. Moreover, we do this same kind of thing with two or more random variables. Two illustrative examples follow. Example 2. Let the probability set function P(A) of a random variable Xbe P(A 2) = Pr (X E A 2) = J f(x) dx = r 2 3t dx = ~. A2 Jl To compute P(A l U A 2 ) , we note that A l n A 2 = 0; then we have P(A l U A 2 ) = P(A l ) + P(A 2 ) = U. Example 3. Let d = {(x, y); 0 < x < y < I} be the space of two random variables X and Y. Let the probability set function be Let A l = {x; 0 < x < ·n and A 2 = {x; 1 < x < 2} be two subsets of d. Then since A = {x; 2 < x < b}. This kind of argument holds for every set A c d for which the integral P(A) = Lf2dxdy. If A is taken to be Al = {(x, y); t < x < y < I}, then P(A 1 ) = Pr [(X, Y) E A 1J = II IY 2 dx dy = t. 1/2 1/2 If A is taken to be A 2 = {(x, y); x < y < 1,0 < x :0; t}, then A 2 = At, and P(A 2 ) = Pr [(X, Y) E A 2J = P(A!) = 1 - P(A 1 ) = i- EXERCISES I 11 2 P(C) = dz = t. 114 «b-2l/3 Px(A) = P(A) = P(C) = Jo dz. P(C) = fa dz. For instance, if C = {c; t < c < !}, then In the integral, make the change of variable x = 3z + 2 and obtain Again we should make the comment that Pr [(Xl' ... , X n) E A] could be denoted by the probability set function P X 1 •.••• xn(A). But, if there is no chance of misunderstanding, it will be written simply as P(A). Up to this point, our illustrative examples have dealt with a sample space C(f that contains a finite number of elements. We now give an example of a sample space C(f that is an interval. Example 1. Let the outcome of a random experiment be a point on the interval (0, 1). Thus, ~ = {c; 0 < c < I}. Let the probability set function be given by This table depicts the distribution of probability over the elements of d, the space of the random variables Xl and X 2 • The preceding notions about one and two random variables can be immediately extended to n random variables. We make the following definition of the space of n random variables. Definition 10. Given a random experiment with the sample space C(f. Let the random variable X, assign to each element C E C(f one and only one real number X,(c) = z , i = 1,2, ... , n. The space of these random variables is the set of ordered n-tuplets .xl = {(Xl' X2 , ••• , Xn); Xl = Xl(C), ... , Xn = Xn(C), CE C(f}. Further, let A be a subset of .xl. Then Pr [(Xl>' .. , X n) E A] = P(C), where C = {c; CE C(f and [Xl(c), X 2(c), .• " Xn(c)] E A}. Define the random variable X to be X = X(c) = 3c + 2. Accordingly, the space of X is d = {x; 2 < x < 5}. We wish to determine the probability set function of X, namely P(A), A c d. At this time, let A be the set {x; 2 < x < b}, where 2 < b < 5. Now X(c) is between 2 and b when and only when C E C = {c; 0 < c < (b - 2)j3}. Hence Px(A) = P(A) = f: tdx = Ltdx, 1.31. Let a card be selected from an ordinary deck of playing cards. The outcome C is one of these 52 cards. Let X(c) = 4 if c is an ace, let X(c) = 3 if
  • 16. 22 Distributions oj Random Variables [Ch.l Sec. 1.6] The Probability Density Function 23 c is a king, let X(c) = 2 if c is a queen, let X(c) = 1 if c is a jack, and let X(c) = 0 otherwise. Suppose that P(C) assigns a probability of -l-i to each outcome c. Describe the induced probability Px(A) on the space d = {x; x = 0, 1, 2, 3, 4} of the random variable X. 1.32. Let a point be selected from the sample space rc = {c; 0 < c < lO}. Let Cere and let the probability set function be P(C) = fe /0 dz. Define the random variable X to be X = X(c) = 2c - 10. Find the probability set function of X. Hint. If -10 < a < b < 10, note that a < X(c) < b when and only when (a + lO)/2 < c < (b + 10)/2. 1.33. Let the probability set function peA) of two random variables X and Y be peA) = L L f(x, y), where f(x, y) = -h, (x, y) Ed = {(x, y); A (x, y) = (0, 1), (0, 2), ... , (0, 13), (1, 1), ... , (1, 13), ... , (3, 13)}. Compute peA) = Pr [(X, Y) E A): (a) when A = {(x, y); (x, y) = (0,4), (1, 3), (2, 2)}; (b) when A = {(x, y); x + y = 4, (x, y) Ed}. 1.34. Let the probability set function peA) of the random variable X be peA) = L f(x) dx, where f(x) = 2x/9, xEd = {x; 0 < x < 3}. Let Al = {x; 0 < x < I}, A 2 = {x; 2 < x < 3}. Compute peAl) = Pr [X E AI), P(A 2 ) = Pr (X E A 2 ) , and peAl U A 2 ) = Pr (X E Al U A 2 ) · 1.35. Let the space of the random variable X be d = {x; 0 < x < I}. HAl = {x; 0 < x < -H and A 2 = {x; 1: ~ x < I}, find P(A 2) if peAl) =-!:- 1.36. Let the space of the random variable X be d = {x; 0 < x < lO} and let peAl) = -i, where Al = {x; 1 < x < 5}. Show that P(A 2) ~ i, where A 2 = {x; 5 ~ x < lO}. 1.37. Let the subsets Al = {x; t < x < !} and A 2 = {x; 1- ~ x < I} of the space d = {x; 0 < x < I} of the random variable X be such that peAl) = t and P(A 2 ) = l Find peAl U A 2 ) , peA!), and peA! n A;). 1.38. Let Al = {(x, y); x ~ 2, y ~ 4}, A 2 = {(x, y); x ~ 2, y ~ I}, Aa = {(x, y); x ~ 0, y ~ 4}, and A 4 = {(x, y); x ~ 0, y ~ I} be subsets of the space d of two random variables X and Y, which is the entire two-dimen- sional plane. If peAl) = i, P(A 2 ) = t P(Aa) = i, and P(A 4 ) = j, find peAs), where As = {(x, y); 0 < x ~ 2, 1 < y ~ 4}. 1.39. Given fA [l/1T(1 + x2))dx, where A c d = {x; -00 < x < oo}. Show that the integral could serve as a probability set function of a random variable X whose space is d. 1.40. Let the probability set function of the random variable X be Let Ale = {x; 2 - l/k < x ~ 3}, k = 1,2,3,.... Find lim Ale and k -« co P( lim A k). Find peAk) and lim peAk)' Note that lim peAk) = P( lim A k). k-+OO k-+oo k-+oo k-+co 1.6 The Probability Density Function Let X denote a random variable with space sf and let A be a subset of d. If we know how to compute P(C), C c '6', then for each A under consideration we can compute peA) = Pr (X EA); that is, we know how the probability is distributed over the various subsets of sf. In this sense, we speak of the distribution of the random variable X, meaning, of course, the distribution of probability. Moreover, we can use this convenient terminology when more than one random variable is involved and, in the sequel, we shall do this. In this section, we shall investigate some random variables whose distributions can be described very simply by what will be called the probability density function. The two types of distributions that we shall consider are called, respectively, the discrete type and the continuous type. For simplicity of presentation, we first consider a distribution of one random variable. (a) The discrete type of random variable. Let X denote a random variable with one-dimensional space sf. Suppose that the space sf is a set of points such that there is at most a finite number of points of sf in every finite interval. Such a set sf will be called a set of discrete points. Let a function f(x) be such that f(x) > 0, X E .91, and that 2:f(x) = 1. @ Whenever a probability set function P(A), A c .91, can be expressed in terms of such an f(x) by peA) = Pr (X E A) = 2: f(x) , A then X is called a random variable of the discrete type, and X is said to have a distribution of the discrete type. Example 1. Let X be a random variable of the discrete type with space d = {x; x = 0, 1,2, 3, 4}. Let peA) = L f(x), A where peA) = Le- X dx, where d = {x; 0 < x < co}. 4! (1)4 f(x) = xl (4 - x)! 2 ' XEd,
  • 17. 24 Distributions of Random Variables [eh.l Sec. 1.6] The Probability Density Function 25 and, as usual, Ot = 1. Then if A = {x; x = 0, I}, we have 41 (1)4 4! (1)4 5 Pr (X E A) = O! 4! 2 + I! 3! 2 = 16' Example 2. Let X be a random variable of the discrete type with space d = {x; x = 1, 2, 3, ... }, and let f(x) = (t)X, xEd. Then Pr (X E A) = L f(x). A If A = {x; x = 1, 3, 5, 7, ...}, we have Pr (X EA) = (1-) + (1-)3 + (!)5 + ... = t· (b) The continuous type of random variable. Let the one-dimensional set d be such that the Riemann integral fd f(x) dx = 1, where (1) f(x) > 0, XEd, and (2) f(x) has at most a finite num~er of discontinuities in every finite interval that is a subset of d. If d IS the space of the random variable X and if the probability set function P(A), A c d, can be expressed in terms of such anf(x) by P(A) = Pr (X E A) = t f(x) dx, then X is said to be a random variable of the continuous type and to have a distribution of that type. Example 3. Let the space d = {x; 0 < x < co}, and let f(x) = e-x , xEd. If X is a random variable of the continuous type so that Pr (X E A) = t e- X dx, we have, with A = {x; 0 < x < I}, Pr (X E A) = f:e- X de = 1 - e- 1 • Note that Pr (X E A) is the area under the graph of f(x) = e- x , which lies above the z-axis and between the vertical lines x = 0 and x = 1. Example 4. Let X be a random variable of the continuous type with space d = {x; 0 < x < I}. Let the probability set function be P(A) = Lf(x) dx, where Since P(A) is a probability set function, P(d) = 1. Hence the constant c is determined by (l cx2 dx = 1 Jo ' or c = 3. It is seen that whether the random variable X is of the discrete type or of the continuous type, the probability Pr (X E A) is completely determined by a functionf(x). In either casef(x) is called the probability density function (hereafter abbreviated p.d.f.) of the random variable X. If we restrict ourselves to random variables of either the discrete type or the continuous type, we may work exclusively with the p.d.f. f(x). This affords an enormous simplification; but it should be recognized that this simplification is obtained at considerable cost from a mathe- matical point of view. Not only shall we exclude from consideration many random variables that do not have these types of distributions, but we shall also exclude many interesting subsets of the space. In this book, however, we shall in general restrict ourselves to these simple types of random variables. Remarks. Let X denote the number of spots that show when a die is cast. We can assume that X is a random variable with d = {x; x = 1,2, .. " 6} and with a p.d.f. f(x) = i, XEd. Other assumptions can be made to provide different mathematical models for this experiment. Experimental evidence can be used to help one decide which model is the more realistic. Next, let X denote the point at which a balanced pointer comes to rest. If the circum- ference is graduated 0 ~ x < 1, a reasonable mathematical model for this experiment is tc take X to be a random variable with d = {x; 0 ~ x < I} and with a p.d.f. f(x) = 1, XEd. Both types of probability density functions can be used as distributional models for many random variables found in real situations. For illustrations consider the following. If X is the number of automobile accidents during a given day, thenf(0),j(I),j(2), ... represent the probabilities of 0,1,2, ... accidents. On the other hand, if X is length of life of a female born in a certain community, the integral [area under the graph of f(x) that lies above the z-axis and between the vertical lines x = 40 and x = 50J I 50 f(x) dx 40 represents the probability that she dies between 40 and 50 (or the percentage
  • 18. or as or as = 0 elsewhere, P(A) = Pr [(X, Y) E A] = 'L 'L f(x, y), A 27 f~oof(x) dx. 'Lf(x), x 'L 'Lf(x, y), y x by by by L.J(x) dx Sec. 1.6] The Probability Density Function and then refer to f(x) as the p.d.f. of X. We have f~oof(x) dx = foo 0 dx + foOO e- X dx = 1. Thus we rr:ay treat the entire axis of reals as though it were the space of X. Accordingly, we now replace 'L 'Lf(x, y) oUl 'Lf(x) oUl and, for two random variables, and so on. In ~c~ordancewith this convention (of extending the definition of a p.d.f.), It IS seen that a point function f, whether in one or more variables esse~tially satis~es the conditions of being a p.d.f. if (a) f is defined ~nd IS not negatIve for all real values of its argument(s) and if (b) its mtegral [~or the continuous type of random variable(s)], or its sum [for the discrete type of random variable(s)] over all real values of its argument(s) is 1. .Hf(x) is the p.d.f. of a continuous type of random variable X and if A IS the set {x; a < x < b}, then P(A) = Pr (X E A) can be written as Pr (a < X < b) = f:f(x) dx. Moreover, if A = {x; x = a}, then Similarly, we may extend the definition of a p.d.f. f(x, y) over the entire xy-plane, or a p.d.f. /(x, Y'.z) throughout three-dimensional space, and so on. We shall do this consistently so that tedious, repetitious references to the spaced can be avoided. Once this is done, we replace foUl f f(x, y) dx dy by f~00 f~00 f(x, y) dx dy, a?d so on. Similarly, after extending the definition of a p.d.f. of the discrete type, we replace, for one random variable, P(A) = Pr (X EA) = Pr (X = a) = f:f(x) dx = 0, ~n.ce the integral [: f(x) dx is defined in calculus to be zero. That is, if IS a random vanable of the continuous type, the probability of every o < x < co, Distributions of Random Variables [Ch.l f(x) = e- x , 26 The notion of the p.d.f. of one random variable X can be extended to the notion of the p.d.f. of two or more random variables. Under certain restrictions on the space .91 and the function f > 0 on .91 (restrictions that will not be enumerated here), we say that the two random variables X and Yare of the discrete type or of the continuous type, and have a distribution of that type, according as the probability set function P(A), A c .91, can be expressed as of these females dying between 40 and 50). A particular f(x) willbe suggested later for each of these situations, but again experimental evidence must be used to decide whether we have realistic models. P(A) = Pr [(X, Y) E A] = Lff(x, y) dx dy. In either case f is called the p.d.f. of the two random variables X and Y. Of necessity, P(d) = 1 in each case. More generally, we say that the n random variables Xl' X 2 , ••• , Xn are of the discrete type or of the con- tinuous type, and have a distribution of that type, according as the probability set function P(A), A c d, can be expressed as P(A) = Pr [(Xl> ... , X n) E A] = 'L" .'L f(xl> ... , xn), A P(A) = Pr [(Xl> ... , X n) E A] = rA-ff(Xl> ... , xn) dX1 ... dxn• The idea to be emphasized is that a function f, whether in one or more variables, essentially satisfies the conditions of being a p.d.f. if f > 0 on a spaced and if its integral [for the continuous type of random variable(s)] or its sum [for the discrete type of random variable(s)] over .91 is one. Our notation can be considerably simplified when we restrict our- selves to random variables of the continuous or discrete types. Suppose that the space of a continuous type of random variable X is .91 = {x; 0 < x < co} and that the p.d.f. of X is e- x , xEd. We shall in no manner alter the distribution of X [that is, alter any P(A), A c .91] if we extend the definition of the p.d.f. of X by writing
  • 19. 28 Distributions of Random Variables [Ch.l Sec. 1.6] The Probability Density Function 29 = 0 elsewhere, set consisting of a single point is zero. This fact enables us to write, say, Pr (a < X < b) = Pr (a s X s b). More important, this fact allows us to change the value of the p.d.f. of a continuous type of random variable X at a single point without altering the distribution of X. For instance, the p.d.f. f(x) = e-x , o< x < 00, Find Pr (-!- < X < i) and Pr (--!- < X < -!-). First, f 3/4 f3/4 Pr (-t < X < i) = I(x) dx = 'lx dx = l6' 1/2 1/2 Next, f l /2 Pr (--t < X < -!-) = f(x) dx -1/2 fo il/2 = Odx + Zx d» -1/2 0 = 0 + -! = -!. = 0 elsewhere, without changing any P(A). We observe that these two functions differ only at x = 0 and Pr (X = 0) = O. More generally, if two probability density functions of random variables of the continuous type differ only on a set having probability zero, the two corresponding probability set functions are exactly the same. Unlike the continuous type, the p.d.f. of a discrete type of random variable may not be changed at any point, since a change in such a p.d.f. alters the distribution of probability. Finally, if a p.d.f. in one or more variables is explicitly defined, we can see by inspection whether the random variables are of the con- tinuous or discrete type. For example, it seems obvious that the p.d.f. can be written as f(x) = e-X , o ~ x < 00, Example 6. Let f(x, y) = 6x2y, 0 < X < 1, 0 < y < 1, = 0 elsewhere, be the p.d.f. of two random variables X and Y. We have, for instance, I 2 [3/4 Pr (0 < X < i,1- < Y < 2) = 113 Jo f(x, y) dx dy II i3/4 f2 i3/4 = 6x2y dx dy + 0 dx dy 1/3 0 1 0 = i + 0 = i· Note that this probability is the volume under the surface f(x, y) = 6x2y and above the rectangular set {(x, y); 0 < x < i, -1 < y < I} in the xy-plane. EXERCISES = 0 elsewhere. Example 5. Let the random variable X have the p.d.I, = 0 elsewhere, is a p.d.f. of two discrete-type random variables X and Y, whereas the p.d.f. 1.41. For each of the following, find the constant c so that f(x) satisfies the conditions of being a p.d.f. of one random variable X. (a) f(x) = cWx, x = 1, 2, 3,... , zero elsewhere. (b) f(x) = cee:», 0 < x < (X), zero elsewhere. 1.42. Letf(x) = x/IS, x = 1,2,3,4,5, zero elsewhere, be the p.d.f. of X. Find Pr (X = 1 or 2), Pr (-!- < X < f), and Pr (1 s X s 2). 1.43. For each of the following probability density functions of X, compute Pr (IXI < 1) and Pr (X2 < 9). (a) f(x) = x2 /18, -3 < x < 3, zero elsewhere. (b) f(x) = (x + 2)/18, -2 < x < 4, zero elsewhere. 1.44. Let f(x) = l/x2 , 1 < x < (X) , zero elsewhere, be the p.d.f. of X. If Al = {x; 1 < x < 2} and A 2 = {x; 4 < x < 5}, find peAl U A 2 ) and peAl (1 A 2) . 1.45. Let f(x!> x2) = 4Xl X2, 0 < Xl < 1, 0 < X2 < 1, zero elsewhere, be the p.d.f. of Xl and X 2· Find Pr (0 < Xl < -!-, t < X 2 < 1), Pr (Xl = X 2), 0< x < 1, o< x < 00, 0 < y < 00, x = 1, 2, 3, ... , y = 1,2, 3, ... , I(x) = 'lx, 9 f(x, y) = 4x + y ' f(x, y) = 4xye-X 2 - y2 , = 0 elsewhere, is clearly a p.d.f. of two continuous-type random variables X and Y. In such cases it seems unnecessary to specify which of the two simpler types of random variables is under consideration.
  • 20. Distributions of Random Variables [eh.l and, for k ~ 1, that (by integrating by parts) f: xe"" dx = f: e- X dx = 1, 31 for the discrete type of random variable, and F(x) = f:a) f(w) dw, for the continuous type of random variable. We speak of a distribution function F(x) 3.S being of the continuous or discrete type, depending on whether the random variable is of the continuous or discrete type. Remark. If X is a random variable of the continuous type, the p.d.f. j(x) has at most a finite number of discontinuities in every finite interval. This means (1) that the distribution function F(x) is everywhere continuous and (2) that the derivative of F(x) with respect to x exists and is equal to j(x) at each point of continuity ofj(x). That is, F'(x) = j(x) at each point of continuity of j(x). If the random variable X is of the discrete type, most surely the p.d.f. j(x) is not the derivative of F(x) with respect to x (that is, with respect to Lebesgue measure); butj(x) is the (Radon-Nikodym) deriva- tive of F(x) with respect to a counting measure. A derivative is often called a density. Accordingly, we call these derivatives probability density junctions. F(x) = L f(w), W:S:x fo'" g(x) dx = 1. Show thatj(xl' x2) = [2g(v'x~ + X~)JI(7TVX~ + x~), 0 < Xl < 00,0 < X 2 < 00, zero elsewhere, satisfies the conditions of being a p.d.f. of two continuous- type random variables Xl and X 2 . Hint. Use polar coordinates. (c) For what value of the constant c does the function j(x) = cxne- x , o< X < 00, zero elsewhere, satisfy the properties of a p.d.f.? 1.51. Given that the nonnegative function g(x) has the property that 1.7 The Distribution Function Let the random variable X have the probability set function P(A), where A is a one-dimensional set. Take x to be a real number and con- sider the set A which is an unbounded set from -00 to x, including the point x itself. For all such sets A we have P(A) = Pr (X E A) = Pr (X ~ x). This probability depends on the point x; that is, this probability is a function of the point x. This point function is denoted by the symbol F(x) = Pr (X :::; x). The function F(x) is called the distribution junction (sometimes, cumulative distribution junction) of the random variable X. Since F(x) = Pr (X ~ x), then, with f(x) the p.d.f., we have Sec. 1.7] The Distribution Function (a) j(x) = 4! (!)x (~) 4-X, X = 0, 1, 2, 3, 4, zero elsewhere. x! (4 - x)! 4 4 (b) j(x) = 3x2, 0 < X < 1, zero elsewhere. 1 (c) j(x) = 7T(1 + x2) ' -00 < X < 00. Hint. In parts (b) and (c), Pr (X < x) = Pr (X ~ x) and thus that common value must equal 1- if x is to be the median of the distribution. 1.49. Let 0 < p < 1. A (100P)th percentile (quantile of order P) of the distribution of a random variable X is a value ~p such that Pr (X < ~p) s p and Pr (X ~ ~p) ~ p. Find the twentieth percentile of the distribution that has p.d.f. j(x) = 4x3 , 0 < X < 1, zero elsewhere. Hint. With a continuous- type random variable X, Pr (X < gp) = Pr (X ~ ~p) and hence that common value must equal p. 1.50. Show that (a) What is the value of f: xne- X d», where n is a nonnegative integer? (b) Formulate a reasonable definition of the now meaningless symbol 01. 30 Pr (Xl < X 2 ) , and Pr (Xl s X 2) . Hint. Recall that Pr (Xl = x, would be the volume under the surface j(Xl> x2) = 4X1X2 and above the line segment o< Xl = X2 < 1 in the xlx2-plane. 1.46. Let j(Xl' X2' x3) = exp [- (Xl + X 2 + x3)] , 0 < Xl < 00, 0 < X2 < 00,0 < X3 < 00, zero elsewhere, be the p.d.f. of Xl> X 2, X 3· Compute Pr (Xl < X2 < X 3 ) and Pr (Xl = X2 < X 3) . The symbol exp (w) means e W • 1.47. A mode of a distribution of one random variable X of the con- tinuous or discrete type is a value of X that maximizes the p.d.f.j(x). If there is only one such x, it is called the mode oj the distribution. Find the mode of each of the following distributions: (a) j(x) = (-!y, X = 1, 2, 3, ... , zero elsewhere. (b) j(x) = 12x2(1 - x), 0 < X < 1, zero elsewhere. (c) j(x) = (t)x2e- X , 0 < X < 00, zero elsewhere. 1.48. A median of a distribution of one random variable X of the discrete or continuous type is a value of X such that Pr (X < x) ~ t and ~r (~ ~ x) ~ !- If there is only one such x, it is called the median oj the distribution, Find the median of each of the following distributions:
  • 21. 32 Distributions oj Random Variables [Ch.l Sec. 1.7] The Distribution Function 33 F(x) F(x) --------- - _.~;;;;.- ...... =--............................._~ 2 3 x x FIGURE 1.3 FIGURE 1.4 Example 1. Let the random variable X of the discrete type have the p.d.f.j(x) = x/6, x = 1,2,3, zero elsewhere. The distribution function of X is Here, as depicted in Figure 1.3, F(x) is a step function that is constant in every interval not containing 1, 2, or 3, but has steps of heights i, i, and i at those respective points. It is also seen that F(x) is everywhere continuous to the right. Example 2. Let the random variable X of the continuous type have the p.d.f. j(x) = 2/x3 , 1 < x < CX), zero elsewhere. The distribution function of Xis = fX ~ dw = 1 - .!-, 1 w3 x2 F(x) = I:00 °dw = 0, and F(x") - F(x') = Pr (x' < X ~ x") ?: O. (c) F(oo) = 1 and F( -(0) = 0 because the set {x; x ~ co] is the entire one-dimensional space and the set {x; x :$ -oo} is the null set. From the proof of (b), it is observed that, if a < b, then Pr (X s x") = Pr (X s x') + Pr (x' < X :$ x"). That is, and lim F(x), respectively. In like manner, the symbols {x; x :$ co} x-+ - 00 and {x; x :$ -oo} represent, respectively, the limits of the sets {x; x ~ b} and {x; x ~ - b} as b -7 00. (a) 0 -s F(x) s 1 because 0 ~ Pr (X s x) s 1. (b) F(x) is a nondecreasing function of x. For, if x' < x", then {x; x ~ x"} = {x; x ~ x'} U {x; x' < x ~ x"} 1 s x. x < 1, x < 1, 1 :$ x < 2, 2 s x < 3, 3 :$ x. = 1, -1- - 6' _.J. - 6' F(x) = 0, The graph of this distribution function is depicted in Figure 1.4. Here F(x) is a continuous function for all real numbers x; in particular, F(x) is every- where continuous to the right. Moreover, the derivative of F(x) with respect to x exists at all points except at x = 1. Thus the p.d.f. of X is defined by this derivative except at x = 1. Since the set A = {x; x = I} is a set of probability measure zero [that is, P(A) = OJ, we are free to define the p.d.f. at x = 1 in any manner we please. One way to do this is to writej(x) = 2/x3 , 1 < x < 00, zero elsewhere. There are several properties of a distribution function F(x) that can be listed as a consequence of the properties of the probability set function. Some of these are the following. In listing these properties, we shall not restrict X to be a random variable of the discrete or continuous type. We shall use the symbols F(oo) and F( -(0) to mean lim F(x) x'" 00 Pr (a < X s b) = F(b) - F(a). Suppose that we want to use F(x) to compute the probability Pr (X = b). To do this, consider, with h > 0, lim Pr (b - h < X :$ b) = lim [F(b) - F(b - h)]. h ... O h"'O Intuitively, it seems that lim Pr (b - h < X ~ b) should exist and be h...O equal to Pr (X = b) because, as h tends to zero, the limit of the set {x; b - h < x ~ b}is the set that contains the single point x = b. The fact that this limit is Pr (X = b) is a theorem that we accept without proof. Accordingly, we have Pr (X = b) = F(b) - F(b-),
  • 22. 34 Distributions of Random Variables [Ch.l Sec. 1.7] The Distribution Function 35 where F(b - ) is the left-hand limit of F(x) at x = b. That is, the proba- bility that X = b is the height of the step that F(x) has at x = b. Hence, if the distribution function F(x) is continuous at x = b, then Pr (X = b) = O. There is a fourth property of F(x) that is now listed. (d) F(x) is continuous to the right at each point x. To prove this property, consider, with h > 0, F(x) FIGURE 1.5 x lim Pr (a < X ~ a + h) = lim [F(a + h) - F(a)]. h-->O h-->O We accept without proof a theorem which states, with h > 0, that lim Pr (a < X ~ a + h) = P(O) = O. h-->O Here also, the theorem is intuitively appealing because, as h tends to zero, the limit of the set {x; a < x ~ a + h}is the null set. Accordingly, we write o= F(a+) - F(a), where F(a +) is the right-hand limit of F(x) at x = a. Hence F(x) is continuous to the right at every point x = a. The preceding discussion may be summarized in the following manner: A distribution function F(x) is a nondecreasing function of x, which is everywhere continuous to the right and has F( -(0) = 0, F(oo) = 1. The probability Pr (a < X ~ b) is equal to the difference F(b) - F(a). If x is a discontinuity point of F(x), then the probability Pr (X = x) is equal to the jump which the distribution function has at the point x. If x is a continuity point of F(x), then Pr (X = x) = O. Let X be a random variable of the continuous type that has p.d.f. j(x) , and let A be a set of probability measure zero; that is, P(A) = Pr (X E A) = O. It has been observed that we may change the definition of j(x) at any point in A without in any way altering the distribution of probability. The freedom to do this with the p.d.f. j(x), of a con- tinuous type of random variable does not extend to the distribution function F(x); for, if F(x) is changed at so much as one point x, the probability Pr (X ~ x) = F(x) is changed, and we have a different distribution of probability. That is, the distribution function F(x), not the p.d.f. j(x), is really the fundamental concept. Remark. The definition of the distribution function makes it clear that the probability set function P determines the distribution function F. It is true, although not so obvious, that a probability set function P can be found from a distribution function F. That is, P and F give the same information about the.distribution of probability, and which function is used is a matter of convemence. We now give an illustrative example. Example 3. Let a distribution function be given by F(x) = 0, x < 0, x + 1 = -2-' 0 s x < 1, = 1, 1 ~ x. Then, for instance, Pr (-3 < X ~ -!-) = Fm - F( -3) = ! - 0 = ! and Pr (X = 0) = F(O) - F(O-) = -!- - 0 = l ~he graph o~ ~(x) is shown in Figure 1.5. We see that F(x) is not always c?ntI~uou.s, nor IS It a step ~unction. Accordingly, the corresponding distribu- tion ~s neither of the continuous type nor of the discrete type. It may be descnbed as a mixture of those types. We shall now point out an important fact about a function of a rand~m variable. Let X denote a random variable with space d. ConsIder the function Y = u(X) of the random variable X. Since X is a fun:tion defined on a sample space ee, then Y = u(X) is a composite fu~ctlOn defined on ee. That is, Y = u(X) is itself a random variable which .~as its own space fJ1J = {y; y = u(x), xEd} and its own probabl1Ity set function. If y E fJ1J, the event Y = u(X) s y occurs when, and o~ly whe~, t~e event X E A c d occurs, where A = {x; u(x) s y}. That IS, the distribution function of Y is G(y) = Pr (Y ~ y) = Pr [u(X) ~ y] = P(A).
  • 23. 36 Distributions oj Random Variables [eh.l Sec. 1.7] The Distribution Function 37 = 0 elsewhere. Accordingly, the distribution function of Y, G(y) = Pr (Y :$ y), is given by Since Y is a random variable of the continuous type, the p.d.f. of Y is g(y) = G'(y) at all points of continuity of g(y). Thus we may write o:$ x, y, z < 00, 0 3 F(x, y, z) - f( ) ox oy oz - x, y, z . = f:f:f: e- U - V - W du dv dw and is equal to zero elsewhere. Incidentally, except for a set of probability measure zero, we have F(x, y, z) = Pr (X :$ z, Y s y, Z s z) Example 5. Let f(x, y, z) = e-(x+y+Z), 0 < z, y, z < 00, zero elsewhere, be the p.d.f. of the random variables X, Y, and Z. Then the distribution function of X, Y, and Z is given by F(xv x2, ••• , xn) = Pr (Xl :$ Xv X 2 :$ x2, ••• , X; :$ xn ) . An illustrative example follows. The distribution function of the n random variables Xl' X 2 , ••• , X; is the point function o:$ Y < 1, o< Y < 1, 1 g(y) = . r' 2vy G(y) = 0, y < 0, = f./Y_ 1- dx = Vii, - .;y = 1, 1 :$ y. The following example illustrates a method of finding the distribution function and the p.d.f. of a function of a random variable. Example 4. Let f(x) = 1-, -1 < x < 1, zero elsewhere, be the p.d.f. of the random variable X. Define the random variable Y by Y = X2. We wish to find the p.d.f. of Y. If y ~ 0, the probability Pr (Y :$ y) is equivalent to Pr (X2 s y) = Pr (-Vii s X s Vii). EXERCISES 1.55. Let F(x, y) be the distribution function of X and Y. Show that x+2 = -4-' -1 :$ x < I, = 1, 1 :$ x. Sketch the graph of F(x) and then compute: (a) Pr (-t < X :$ t); (b) Pr (X = 0); (c) Pr (X = 1); (d) Pr (2 < X :$ 3). x < -1, F(x) = 0, 1.52. Letf(x) be the p.d.f. of a random variable X. Find the distribution function F(x) of X and sketch its graph if: (a) f(x) = 1, x = 0, zero elsewhere. (b) f(x) = t, x = -1, 0, 1, zero elsewhere. (c) f(x) = x/IS, x = 1, 2, 3, 4, 5, zero elsewhere. (d) f(x) = 3(1 - x)2, 0 < x < 1, zero elsewhere. (e) f(x) = 1/x2 , 1 < x < 00, zero elsewhere. (f) f(x) = t, 0 < x < 1 or 2 < x < 4, zero elsewhere. 1.53. Find the median of each of the distributions in Exercise 1.52. 1.54. Given the distribution function F(x, y) = I:ro s:eo j(u, v) du dv. Accordingly, at points of continuity ofj(x, y), we have fj2 F(x, y) = j( ) ox oy x, y . It is left as an exercise to show, in every case, that Pr (a < X :$ b, c < Y s d) = F(b, d) - F(b, c) - F(a, d) + F(a, c), for all real constants a < b, c < d. Let the random variables X and Y have the probability set function P(A), where A is a two-dimensional set. If A is the unbounded set {(u, v); u :$ X, v :$ y}, where X and yare real numbers, we have P(A) = Pr [(X, Y) E A] = Pr (X s x, Y :$ y). This function of the point (x, y) is called the distribution function of X and Y and is denoted by F(x, y) = Pr (X s x, Y s y). If X and Yare random variables of the continuous type that have p.d.f. j(x, y), then
  • 24. 38 Distributions oj Random Variables [eb.l Sec. 1.8] Certain Probability Models 39 Pr (a < X ~ b, e < Y ~ d) = F(b, d) - F(b, e) - F(a, d) + F(a, e), for all real constants a < b, e < d. interval A is a subset of d, the probability of the event A is proportional to the length of A. Hence, if A is the interval [a, x], x ~ b, then = 0 elsewhere. = 1, b < x. = 0 elsewhere, 1 = Pr (a ~ X ~ b) = c(b - a), a ~ x s b, o < x < 1, 0 < Y < 1, f(x) f(x, y) = 1, Accordingly, the p.d.f. of X, f(x) = F'(x), may be written 1 =--, b-a P(A) = Pr (X E A) = Pr (a ~ X ~ x) = c(x - a), The derivative of F(x) does not exist at x = a nor at x = b; but the set {x; x = a, b}is a set of probability measure zero, and we elect to define f(x) to be equal to l/(b - a) at those two points, just as a matter of convenience. We observe that this p.d.f. is a constant on d. If the p.d.f. of one or more variables of the continuous type or of the discrete type is a constant on the space d, we say that the probability is distributed uniformly over d. Thus, in the example above, we say that X has a uniform distribution over the interval [a, b]. Consider next an experiment in which one chooses at random a point (X, Y) from the unit square ~ = d = {(x, y); 0 < x < 1, o< y < I}. Suppose that our interest is not in X or in Y but in Z = X + Y. Once a suitable probability model has been adopted, we shall see how to find the p.d.f. of Z. To be specific, let the nature of the random experiment be such that it is reasonable to assume that the distribution of probability over the unit square is uniform. Then the p.d.f. of X and Y may be written where c is the constant of proportionality. In the expression above, if we take x = b, we have so c = l/(b - a). Thus we will have an appropriate probability model if we take the distribution function of X, F(x) = Pr (X ~ x), to be F(x) = 0, x < a, x-a - --, a ~ x ~ b, -b-a 1.56. Let f(x) = 1, 0 < x < 1, zero elsewhere, be the p.d.f. of X. Find the distribution function and the p.d.f. of Y = fl. Hint. Pr (Y ~ y) = Pr (fl ~ y) = Pr (X s y2), 0 < y < 1. 1.57. Letf(x) = xj6, x = 1,2,3, zero elsewhere, be the p.d.f. of X. Find the distribution function and the p.d.f. of Y = X2. Hint. Note that X is a random variable of the discrete type. 1.58. Letf(x) = (4 - x)j16, -2 < x < 2, zero elsewhere, be the p.d.f. of X. (a) Sketch the distribution function and the p.d.f. of X on the same set of axes. (b) If Y = lXI, compute Pr (Y s 1). (c) If Z = X2, compute Pr (Z ~ t). 1.59. Let f(x, y) = e- X - Y , 0 < x < 00, 0 < Y < 00, zero elsewhere, be the p.d.f. of X and Y. If Z = X + Y, compute Pr (Z s 0), Pr (Z s 6), and, more generally, Pr (Z ~ z), for 0 < z < 00. What is the p.d.f. of Z? 1.60. Explain why, with h > 0, the two limits lim Pr (b - h < X ~ b) h-+O and lim F(b - h) exist. Hint. Note that Pr (b - h < X ~ b) is bounded h-+O below by zero and F(b - h) is bounded above by both F(b) and 1. 1.61. Showthat the function F(x, y) that is equal to 1,providedx + 2y ;::: 1, and that is equal to zero provided x + 2y < 1, cannot be a distribution function of two random variables. Hint. Find four numbers a < b, e < d, so that F(b, d) - F(a, d) - F(b, e) + F(a, e) is less than zero. 1.62. Let F(x) be the distribution function of the random variable X. If m is a number such that F(m) = ,!-, show that m is a median of the distri- bution. 1.63. Let f(x) = j-, -1 < x < 2, zero elsewhere, be the p.d.f. of X. Find the distribution function and the p.d.f. of Y = X2. Hint. Consider Pr (X2 ~ y) for two cases: 0 s y < 1 and 1 ~ y < 4. Consider an experiment in which one chooses at random a point from the closed interval [a, b] that is on the real line. Thus the sample space ~ is [a, b]. Let the random variable X be the identity function defined on ~. Thus the space d of X is d = ~. Suppose that it is reasonable to assume, from the nature of the experiment, that if an 1.8 Certain Probability Models
  • 25. 40 Distributions of Random Variables [Chv I Sec. 1.8] Certain Probability Models 41 and this describes the probability modeL Now let the distribution function of Z be denoted by G(z) = Pr (X + Y s z). Then = 1, 2 :s; z, Since G'(z) exists for all values of z, the p.d.f. of Z may then be written = 0 elsewhere. It is clear that a different choice of the p.d.f. f(x, y) that describes the probability model will, in general, lead to a different p.d.f. of Z. We wish presently to extend and generalize some of the notions expressed in the next three sentences. Let the discrete type of random variable X have a uniform distribution of probability over the k points of the space d = {x; x = 1, 2, ... , k}. The p.d.f. of X is then f(x) = 1jk, x E.9I, zero elsewhere. This type of p.d.f. is used to describe the probability model when each of the k points has the same probability, namely, 1jk. The probability model described in the preceding paragraph will now be adapted to a more general situation. Let a probability set function P(C) be defined on a sample space C(? Here C(? may be a set in one, or two, or more dimensions. Let rc be partitioned into k mutually disjoint subsets Cl> C2 , ••• , C/c in such a way that the union of these k mutually disjoint subsets is the sample space C(? Thus the events Cl> C2 , ••• , C/c are mutually exclusive and exhaustive. Suppose that the random experiment is of such a character that it may be assumed that each of the mutually exclusive and exhaustive events Ci , i = 1,2, , k, has the same probability. Necessarily then, P(Ci ) = 1jk, i = 1,2, , k. Let the event E be the union of r of these mutually exclusive events, say r P(E) = P(C1) + P(C2 ) + ... + P(Cr) = k' ( 52) 52! k = 5 = 5! 47!· and ( 13) 13! r1 = 5 = 5! 8! In general, if n is a positive integer and if x is a nonnegative integer with x :s; n, then the binomial coefficient (n ) nl x = x! (n - x)! is equal to the number of combinations of n things taken x at a time. Thus, here, en (13){12)(ll)(10)(9) P(E1 } = C52) = (52)(51)(50)(49)(48) = 0.0005, Frequently, the integer k is called the total number of ways (for this particular partition of c(?) in which the random experiment can ter- minate and the integer r is called the number of ways that are favorable to the event E. So, in this terminology, P(E) is equal to the number of ways favorable to the event E divided by the total number of ways in which the experiment can terminate. It should be emphasized that in order to assign, in this manner, the probability rjk to the event E, we must assume that each of the mutually exclusive and exhaustive events Cl> C2 , ••• , C/c has the same probability 1jk. This assumption then becomes part of our probability modeL Obviously, if this assumption is not realistic in an application, the probability of the event E cannot be computed in this way. We next present two examples that are illustrative of this modeL Example 1. Let a card be drawn at random from an ordinary deck of 52 playing cards. The sample space Cfjis the union of k = 52 outcomes, and it is reasonable to assume that each of these outcomes has the same probability -l-i' Accordingly, if E1 is the set of outcomes that are spades, P(E1 } = g = t because there are r1 = 13 spades in the deck; that is, t is the probability of drawing a card that is a spade. If E 2 is the set of outcomes that are kings, P(E2} = n = -ls because there are r2 = 4 kings in the deck; that is, -ls is the probability of drawing a card that is a king. These computations are very easy because there are no difficultiesin the determination of the appropriate values of rand k. However, instead of drawing only one card, suppose that fivecards are taken, at random and without replacement, from this deck. We can think of each five-card hand as being an outcome in a sample space. It is reasonable to assume that each of these outcomes has the same probability. Now if E1 is the set of outcomes in which each card of the hand is a spade, P(E1 } is equal to the number r1 of all spade hands divided by the total number, say k, of five-card hands. It is shown in many books on algebra that 1 s z < 2, r s k. (2 - Z)2 2 ' 1 :s; z < 2, os z < 1, o < z < 1, = 2 - z, g(z) = z, z < 0, E = C1 U C2 U ... U C" rz r-x Z2 =JoJo dydx=Z' = 1 - i1 i1 dy dx = 1 z-1 z-x G(z) = 0, Then
  • 26. 42 Distributions of Random Variables [eb.l Sec. 1.8] Certain Probability Models 43 approximately. Next, let E2 be the set of outcomes in which at least one card is a spade. Then E~ is the set of outcomes in which no card is a spade. There are r~ = C:) such outcomes Hence because the numerator of this fraction is the number of outcomes in E 4 • Example 2. A lot, consisting of 100 fuses, is inspected by the following procedure. Five of these fuses are chosen at random and tested; if all 5 "blow" at the correct amperage, the lot is accepted. If, in fact, there are 20 defective fuses in the lot, the probability of accepting the lot is, under appropriate assumptions, EXERCISES 1.69. Let X have the uniform distribution given by the p.d.f. f(x) = t, x = -2, -1, 0, 1,2, zero elsewhere. Find the p.d.I, of Y = X2. Hint. Note that Y has a distribution of the discrete type. 1.70. Let X and Y have the p.d.f. f(x, y) = 1, 0 < x < 1, 0 < y < 1, zero elsewhere. Find the p.d.f. of the product Z = XY. 1.71. Let 13 cards be taken, at random and without replacement, from an ordinary deck of playing cards. If X is the number of spades in these 13 cards, find the p.d.f. of X. If, in addition, Y is the number of hearts in these 13 cards, find the probability Pr (X = 2, Y = 5). What is the p.d.f. of X and Y? 1.72. Four distinct integers are chosen at random and without replace- ment from the first 10 positive integers. Let the random variable X be the next to the smallest of these four numbers. Find the p.d.f. of X. 1.73. In a lot of 50 light bulbs, there are 2 bad bulbs. An inspector examines 5 bulbs, which are selected at random and without replacement. (a) Find the probability of at least 1 defective bulb among the 5. (In order to solve some of these exercises, the reader must make certain assumptions.) 1.64. A bowl contains 16 chips, of which 6 are red, 7 are white, and 3 are blue. If 4 chips are taken at random and without replacement, find the probability that: (a) each of the 4 chips is red; (b) none of the 4 chips is red; (c) there is at least 1 chip of each color. 1.65. A person has purchased 10 of 1000 tickets sold in a certain raffle. To determine the five prize winners,S tickets are to be drawn at random and without replacement. Compute the probability that this person will win at least one prize. Hint. First compute the probability that the person does not win a prize. 1.66. Compute the probability of being dealt at random and without replacement a 13-card bridge hand consisting of: (a) 6 spades, 4 hearts, 2 diamonds, and 1 club; (b) 13 cards of the same suit. 1.67. Three distinct integers are chosen at random from the first 20 positive integers. Compute the probability that: (a) their sum is even; (b) their product is even. 1.68. There are five red chips and three blue chips in a bowl. The red chips are numbered 1, 2, 3, 4, 5, respectively, and the blue chips are numbered 1, 2, 3, respectively. If two chips are to be drawn at random and without replacement, find the probability that these chips have either the same number or the same color. x = 0, 1,2,3,4,5, and (~)(5~ x) f(x) = Pr (X = x) = C~O) , = 0 elsewhere. This is an example of a discrete type of distribution called a hypergeometric distribution. (~O) C~O) = 0.32, approximately. More generally, let the random variable X be the number of defective fuses among the 5 that are inspected. The space of X is d = {x; x = 0, 1, 2, 3, 4, 5} and the p.d.f. of X is given by Now suppose that E 3 is the set of outcomes in which exactly three cards are kings and exactly two cards are queens. We can select the three kings in any one of G) ways and the two queens in anyone of G) ways By a well-known counting principle, the number of outcomes in E3 is r3 = G)G)· Thus P(E3 ) = G)G)/C~)· Finally, let E4 be the set of outcomes in which there are exactly two kings, two queens, and one jack. Then
  • 27. 44 Distributions oj Random Variables [Ch.l Sec. 1.9] Mathematical Expectation 45 (b) How many bulbs should he examine so that the probability of finding at least 1 bad bulb exceeds j ? distribution of probability. Suppose the p.d.f. of Y is g(y). Then E(Y) is given by 1.9 Mathematical Expectation J~<x>yg(y) dy or Lyg(y), 'II One of the more useful concepts in problems involving distributions of random variables is that of mathematical expectation. Let X be a random variable having a p.d.f. j(x) , and let u(X) be a function of X such that J:00 u(x)j(x) dx exists, if X is a continuous type of random variable, or such that 2: u(x)j(x) x according as Y is of the continuous type or of the discrete type. The question is: Does this have the same value as E[u(X)], which was defined above? The answer to this question is in the affirmative, as will be shown in Chapter 4. More generally, let Xl' X 2 , ••• , X; be random variables having p.d.f.j(xv x2,···, xn) and let u(XI, X 2, ... , X n) be a function of these variables such that the n-fold integral (1) I~00 ••• 1:00U(Xv x2, ... , xn)f(xl, x2, ... , xn) dXI dX2... dx; exists, if the random variables are of the continuous type, or such that the n-fold sum exists if the random variables are of the discrete type. The n-fold integral (or the n-fold sum, as the case may be) is called the mathe- matical expectation, denoted by E[u(Xv X 2 , ••• , X n)] , of the function u(Xv X 2, ... , X n). Next, we shall point out some fairly obvious but useful facts about mathematical expectations when they exist. (a) If k is a constant, then E(k) = k. This follows from expression (1) [or (2)] upon setting u = k and recalling that an integral (or sum) of a constant times a function is the constant times the integral (or sum) of the function. Of course, the integral (or sum) of the function j is 1. (b) If k is a constant and v is a function, then E(kv) = kE(v). This follows from expression (1) [or (2)] upon setting u = kv and rewriting expression (1) [or (2)] as k times the integral (or sum) of the product vf. (c) If kl and k2 are constants and VI and V2 are functions, then E(klvl + k2v2) = kIE(vl) + k2E(V2)' This, too, follows from expression (1) [or (2)] upon setting u = klvl + k2v2 because the integral (or sum) of (klv l + k2v2)j is equal to the integral (or sum) of klVd plus the integral (or sum) of k2v2 f. Repeated application of this property shows that if kv k2, ... , km are constants and Vv v2, ... , Vm are functions, then exists, if X is a discrete type of random variable. The integral, or the sum, as the case may be, is called the mathematical expectation (or expected value) of u(X) and is denoted by E[u(X)]. That is, E[u(X)] = I~00 u(x)j(x) dx, if X is a continuous type of random variable, or E[u(X)] = 2: u(x)j(x), x if X is a discrete type of random variable. Remarks. The usual definition of E[u(X)] requires that the integral (or sum) converge absolutely. However, in this book, each u(x) is of such a character that if the integral (or sum) exists, the convergence is absolute. Accordingly, wehave not burdened the student with this additional provision. The terminology "mathematical expectation" or "expected value" has its origin in games of chance. This can be illustrated as follows: Three small similar discs, numbered 1, 2, and 2, respectively, are placed in a bowl and are mixed. A player is to be blindfolded and is to draw a disc from the bowl. If he draws the disc numbered 1, he will receive $9; if he draws either disc numbered 2, he will receive $3. It seems reasonable to assume that the player has a "j- claim" on the $9 and a "t claim" on the $3. His" total claim" is 9(j-) + 3(t), or $5. If we take X to be a random variable having the p.d.f. f(x) = xJ3, x = 1, 2, zero elsewhere, and u(x) = 15 - 6x, then 2 E[u(X)] = L u(x)f(x) = L (15 - 6x)(xj3) = 5. That is, the mathematical x x=l expectation of u(X) is precisely the player's" claim" or expectation. The student may observe that u(X) is a random variable Y with its own (2)
  • 28. 46 Distributions of Random Variables [eh.l Sec. 1.9] Mathematical Expectation 47 and, of course, E(6X + 3X2) = 6(t) + 3(i-) = t. Example 2. Let X have the p.d.f. This property of mathematical expectation leads us to characterize the symbol E as a linear operator. Example 1. Let X have the p.d.f. E(X) = I~00 xJ(x) dx = I:(x)2(1 - x) d» = t, E(X2) = I~00 x2J(x) dx = I:(x2)2(1 - x) dx = !, The expected value of the length X is E(X) = t and the expected value of the length 5 - X is E(5 - X) = t. But the expected value of the product of the two lengths is equal to E[X(5 - X)J = I:x(5 - x)(t) dx = 265 =f (t)2. That is, in general, the expected value of a product is not equal to the product of the expected values. Example 5. A bowl contains five chips, which cannot be distinguished by a sense of touch alone. Three of the chips are marked $1 each and the re- maining two are marked $4 each. A player is blindfolded and draws, at random and without replacement, two chips from the bowl. The player is paid an amount equal to the sum of the values of the two chips that he draws and the game is over. If it costs $4.75 cents to play this game, would we care to participate for any protracted period of time? Because we are unable to distinguish the chips by sense of touch, we assume that each of the 10 pairs that can be drawn has the same probability of being drawn. Let the random variable X be the number of chips, of the two to be chosen, that are marked $1. Then, under our assumption, X has the hypergeometric p.d.f. o< x < 1, x = 1,2,3, x J(x) = "6' J(x) = 2(1 - x), = 0 elsewhere. Then = 0 elsewhere. Then = ! + 1. 6 Q + 8i = 'l.l. Example 3. Let X and Y have the p.d.f. J(x, y) = x + y, 0 < x < 1, 0 < Y < 1, = 0 elsewhere. x = 0, 1,2, = 0 elsewhere. If X = x, the player receives u(x) = x + 4(2 - x) = 8 - 3x dollars. Hence his mathematical expectation is equal to 2 E[8 - 3X] = L (8 - 3x)J(x) = t~, x=o Accordingly, or $4.40. = 0 elsewhere. = U· Example 4. Let us divide, at random, a horizontal line segment of length 5 into two parts. If X is the length of the left-hand part, it is reasonable to assume that X has the p.d.f. E(XY2) = I~00 I~00 xy2J(X, y) dx dy = I:f:xy2(X + y) d» dy I i ,II EXERCISES 1.74. Let X have the p.d.f. J(x) = (x + 2)/18, -2 < x < 4, zero else- where. Find E(X), E[(X + 2)3J, and E[6X - 2(X + 2)3). 1.75. Suppose thatJ(x) = t, x = 1,2,3,4, 5, zero elsewhere, is the p.d.f. of the discrete type of random variable X. Compute E(X) and E(X2). Use these two results to find E[(X + 2)2J by writing (X + 2)2 = X2 + 4X + 4. 1.76. If X and Y have the p.d.f.J(x, y) = t, (x, y) = (0,0), (0, 1), (1, 1), zero elsewhere, find E[(X - t)(Y - t)]. 1.77. Let the p.d.f. of X and Y beJ(x, y) = e-X - Y , 0 < x < 00,0 < Y < 00, 0< x < 5, J(x) = t,
  • 29. Distributions of Random Variables [eh.l and since E is a linear operator, 49 Sec. 1.10] Some Special Mathematical Expectations if at> a2 , • • • are the discrete points of the space of positive probability density. This sum of products may be interpreted as a "weighted average" of the squares of the deviations of the numbers at> a2' ... from the mean value fL of those numbers where the" weight" associated with each (aj - fL)2 is f(a j ) . This mean value of the square of the deviation of X from its mean value fL is called the variance of X (or the variance of the distribution). The variance of X will be denoted by a2 , and we define a2, if it exists, by a2 = E[(X - fL)2], whether X is a discrete or a continuous type of random variable. It is worthwhile to observe that a2 = E(X2) - 2fLE(X) + fL2 = E(X2) - 2fL2 + fL2 = E(X2) - fL2. This frequency affords an easier way of computing the variance of X. It is customary to call a (the positive square root of the variance) the standard deviation of X (or the standard deviation of the distribution). The number a is sometimes interpreted as a measure of the dispersion of the points of the space relative to the mean value fL. We note that if the space contains only one point x for which f(x) > 0, then a = O. Remark. Let the random variable X of the continuous type have the p.d.f. f(x) = 1/2a, -a < x < a, zero elsewhere, so that a = alV3 is the This sum of products is seen to be a "weighted average" of the values aI' a2 , a3 , ••• , the "weight" associated with each a, being f(aj ) . This suggests that we call E(X) the arithmetic mean of the values of X, or, more simply, the mean valueof X (or the mean value of the distribution). The mean value fL of a random variable X is defined, when it exists, to be fL = E(X), where X is a random variable of the discrete or of the continuous type. Another special mathematical expectation is obtained by taking u(X) = (X - fL)2. If, initially, X is a random variable of the discrete type having a p.d.f. f(x), then E[(X - fL)2] = L (x - fL)2f(x) x If the discrete points of the space of positive probability density are at> a2 , a3 , ••• , then E(IX - bl) = E(IX - ml) + 2f:(b - x)f(x) dx, provided that the expectations exist. For what value of b is E(IX - bl) a minimum? 1.82. Let f(x) = 2x, 0 < x < 1, zero elsewhere, be the p.d.f. of X. (a) Compute E(fl). (b) Find the distribution function and the p.d.f. of Y = fl. (c) Compute E(Y) and compare this result with the answer obtained in part (a). 1.83. Two distinct integers are chosen at random and without replace- ment from the first six positive integers. Compute the expected value of the absolute value of the difference of these two numbers. 1.10 Some Special Mathematical Expectations Certain mathematical expectations, if they exist, have special names and symbols to represent them. We shall mention now only those associated with one random variable. First, let u(X) = X, where X is a random variable of the discrete type having a p.d.f. f(x). Then E(X) = L xf(x). x 48 zero elsewhere. Let u(X, Y) = X, v(X, Y) = Y, and w(X, Y) = XV. Show that E[u(X, Y)J . E[v(X, Y)J = E[w(X, Y)J. 1.78. Let the p.d.f. of X and Y be f(x, y) = 2, 0 < x < y, 0 < y < 1, zero elsewhere. Let u(X, Y) = X, v(X, Y) = Y and w(X, Y) = XV. Show that E[u(X, Y)J . E[v(X, Y)J =I E[w(X, V)]. 1.79. Let X have a p.d.f.f(x) that is positive at x = -1,0,.1 and is zerlo elsewhere. (a) If f(O) = -1, find E(X2 ) . (b) If f(O) = -1 and If E(X) = 6' determine f( -1) andf(l). 1.80. A bowl contains 10 chips, of which 8 are marked $2 each and 2 are marked $5 each. Let a person choose, at random and without replaceme~t, 3 chips from this bowl. If the person is to receive the sum of the resultmg amounts, find his expectation. 1.81. Let X be a random variable of the continuous type that has p.d.f. f(x). If m is the unique median of the distribution ofX and bis a real constant, show that
  • 30. if X is a continuous type of random variable, or We next define a third special mathematical expectation, called the moment-generatingfunction of a random variable X. Suppose that there is a positive number h such that for -h < t < h the mathematical expectation E(etX) exists. Thus standard deviation of the distribution of X. Next, let the random variable Y of the continuous type have the p.d.f, g(y) = 1/4a, - 2a < y < 2a, zero elsewhere, so that a = 2a/V?' is the standard deviation of the distribution of Y. Here the standard deviation of Y is greater than that of X; this reflects the fact that the probability for Y is more widely distributed (relative to the mean zero) than is the probability for X. x = 1,2,3,4, x f(x) = -, 10 M(t) = Letxf(x), x or Sec. 1.10] Some Special Mathematical Expectations 51 variable X of the discrete type. If we let f(x) be the p.d.f. of X and let a, b, c, d, ... be the discrete points in the space of X at whichf(x) > 0, then loet + /oe2t + 130e3t + 140e4t = f(a)e at + f(b)ebt + .. '. Because this is an identity for all real values of t, it seems that the right- hand member should consist of but four terms and that each of the four should equal, respectively, one of those in the left-hand member; hence we may take a = 1,j(a) = lo; b = 2,j(b) = /0; C = 3,j(c) = 1 30; d = 4, f(d) = 1 40' Or, more simply, the p.d.f. of X is Distributions of Random Variables [Ch.l 50 is the moment-generating function of X. That is, we are given dM(t) Joo ~ = M'(t) = -00 xetxf(x) dx, = 0 elsewhere. On the other hand, let X be a random variable of the continuous type and let it be given that t < 1. t < 1, o < x < 00, M(t) _ 1 - (1 - t)2' f(x) = xe- x , (1 ~ t)2 = J:oo etxf(x) dx, It is not at all obvious how f(x) is found. However, it is easy to see that a distribution with p.d.f. = 0 elsewhere has the moment-generating function M(t) = (1 - t)-2, t < 1. Thus the random variable X has a distribution with this p.d.f. in accordance with the assertion of the uniqueness of the moment-generating function. Since a distribution that has a moment-generating function M(t) is com~letely determined by M(t), it would not be surprising if we could obtam some properties of the distribution directly from M(t). For example, the existence of M(t) for -h < t < h implies that derivatives of all order exist at t = O. Thus M(t) = E(etX). It is evident that if we set t = 0, we have M(O) = 1. As will be seen by example, not every distribution has a moment-generating function, but it is difficult to overemphasize the importance of a moment- generating function when it does exist. This importance stems from the fact that the moment-generating function is unique and completely determines the distribution of the random variable; thus, if two random variables have the same moment-generating function, they have the same distribution. This property of a moment-generating function will be very useful in subsequent chapters. Proof of the uniqueness of the moment-generating function is based on the theory of transforms in analysis, and therefore we merely assert this uniqueness. Although the fact that a moment-generating function (when it exists) completely determines a distribution of one random variable will not be proved, it does seem desirable to try to make the assertion plausible. This can be done if the random variable is of the discrete type. For example, let it be given that M(t) = loet + /oe2t + 130e3t + 140e4t is, for all real values of t, the moment-generating function of a random E(etX) = Letxf(x), x if X is a discrete type of random variable. This expectation is called the moment-generating function of X (or of the distribution) and is denoted by M(t). That is,
  • 31. 52 Distributions of Random Variables [Ch.l Sec. 1.10] Some Special Mathematical Expectations 53 JOO Jl X+ 1 1 JL = xf(x) dx = x--dx = - -00 -1 2 3 the moment-generating function. In fact, we shall sometimes call E(xm) the mth moment of the distribution, or the mth moment of X. Example 1. Let X have the p.d.f. -1 < x < 1, f(x) = !(x + 1), = 0 elsewhere. Then the mean value of X is or M'(O) = E(X) = 1-'-. The second derivative of M(t) is if X is of the continuous type, or dM d (t) = M'(t) = 2. xetxf(x) , t x if X is of the discrete type. Upon setting t = 0, we have in either case while the variance of X is fOO fl X+ 1 2 a2 = x2f(x) dx - JL2 = x2 - - dx - (t)2 = -. -00 -1 2 9 so that M"(O) = E(X2). Accordingly, a2 = E(X2) - 1-'-2 = M"(O) - [M'(0)J2. For example, if M(t) = (1 - t)-2, t < 1, as in the illustration above, then M'(t) = 2(1 - t)-3 and Example 2. If X has the p.d.f. 1 f(x) = 2' x 1 < x < 00, M"(t) = 6(1 - t)-4. Hence = 0 elsewhere, then the mean value of X does not exist, since I-'- = M'(O) = 2 and a2 = M"(O) - 1-'-2 = 6 - 4 = 2. Of course we could have computed I-'- and a 2 from the p.d.f. by I-'- = f~co xf(x) dx and = lim (In b - In 1) b--+00 does not exist. Example 3. Given that the series 1 1 1 }2 + 22 + 32 + ... respectively. Sometimes one way is easier than the other. In general, if m is a positive integer and if M<m)(t) means the mth derivative of M(t), we have, by repeated differentiation with respect to t, converges to 7T 2 /6.Then x = 1,2,3, ... , Now = 0 elsewhere, is the p.d.f. of a discrete type of random variable X. The moment-generating function of this distribution, if it exists, is given by or M(t) = E(etX) = 2: etxf(x) x and integrals (or sums) of this sort are, in mechanics, called moments. Since M(t) generates the values of E(xm), m = 1,2,3, ... , it is called
  • 32. Distributions of Random Variables [Chv I Accordingly, the integral for ep(t) exists for all real values of t. In the discrete case, a summation would replace the integral. Every distribution has a unique characteristic function; and to each characteristic function there corresponds a unique distribution of probability. If X has a distribution with characteristic function ep(t), then, for instance, if E(X) and E(X2) exist, they are given, respectively, by iE(X) = ep'(O) and i 2E(X2) = ep"(O). Readers who are familiar with complex-valued functions may write ep(t) = M(it) and, throughout this book, may prove certain theorems in complete generality. Those who have studied Laplace and Fourier transforms will note a similarity between these transforms and M (t) and ep(t); it is the uniqueness of these transforms that allows us to assert the uniqueness of each of the moment-generating and characteristic functions. 54 The ratio test may be used to show that this series div:rges if t > O. Thus there does not exist a positive number h such that M(t) ex~sts for -h < t < h. Accordingly, the distribution having the p.d.f. f(x) of this example does not have a moment-generating function. . f . M(t) t2 /2 Example 4. Let X have the moment-generatmg unct~on = e , -00 < t < 00. We can differentiate M(t) any number of times to find the t f X However it is instructive to consider this alternative method. momen s o . , . " . The function M(t) is represented by the following MacLaunn s senes. 1 (t 2) 1 (t 2) 2 1 (t 2) Ic et2/2 = 1 + TI "2 + 2i 2" + ... + ki"2 + ... 1 (3)(1) 4 (2k - 1)· .. (3)(1) t21c + ... - 1 + _t2 + --t + ... + --(2k)! . - 2! 4! . Sec. 1.10] Some Special Mathematical Expectations 55 In general, the MacLaurin's series for M(t) is '(0) M"(O) M<m)(0) M(t) = M(O) + MIl t + ---zt t2 + ... + ~ t m + ... _ 1 + E(X) t + E(X2) t2 + ... + E(xm) tm + .. '. - I! 2! m! Thus the coefficient of (tmjm!) in the MacLaurin's series representation of M(t) is E(xm). So, for our particular M(t), we have (2k)! E(X21c) = (2k - 1)(2k - 3)· .. (3)(1) = 2lck!' k = 1,2,3, ... , and E(X2 1c - l ) = 0, k = 1,2,3, .... Remarks. In a more advanced course, we would not work with the moment-generating function because so many distr.ibutions do .not .have moment-generating functions. Instead, we would let t d~note ~he Imagm~ry unit, t an arbitrary real, and we would define ep(t) = E(e' )- !hIS ex?ectahon exists for every distribution and it is called the charaetenst.tcfunctwn .of the distribution. To see why ep(t) exists for all real t, we note, m the continuous case, that its absolute value lep(t) I = Irooeitxf(x) - :::; f~oo Jeitxf(x)J dx. However, If(x) I = f(x) since f(x) is nonnegative and Jeitxl = [cos tx + i sin tx = vcos2 tx + sin 2 tx = 1. EXERCISES 1.84. Find the mean and variance, if they exist, of each of the following distributions. 3! (1)3 (a) f(x) = x! (3 _ x)! 2 ,x = 0, 1, 2, 3, zero elsewhere. (b) f(x) = 6x(1 - x), 0 < x < 1, zero elsewhere. (c) f(x) = 2jx3 , 1 < x < 00, zero elsewhere. 1.85. Letf(x) = (-!y, x = 1,2,3, ... , zero elsewhere, be the p.d.f. of the random variable X. Find the moment-generating function, the mean, and the variance of X. 1.86. For each of the following probability density functions, compute Pr (iL - 2a < X < iL + 2a). (a) f(x) = 6x(1 - x), 0 < x < 1, zero elsewhere. (b) f(x) = (-!y, x = 1, 2, 3, ... , zero elsewhere. 1.87. If the variance of the random variable X exists, show that E(X2) ;::: [E(X)]2. 1.88. Let a random variable X of the continuous type have a p.d.f. f(x) whose graph is symmetric with respect to x = c. If the mean value of X exists, show that E(X) = c. Hint. Show that E(X - c) equals zero by writing E(X - c) as the sum of two integrals: one from -00 to c and the other from c to 00. In the first, let y = c - x; and, in the second, Z = x-c. Finally, use the symmetry condition f(c - y) = f(c + y) in the first. 1.89. Let the random variable X have mean u; standard deviation a, and moment-generating function M(t), -h < t < h. Show that Thus lep(t)i :::; roof(x) dx = 1.
  • 33. = 1, t = O. 57 Sec. 1.10] Some Special Mathematical Expectations = 0 elsewhere, where 0 < p < -to Find the measure of kurtosis as a function of p. Determine its value when p = t, P = t, p = /0' and p = 1~o. Note that the kurtosis increases as p decreases. f(x) = p, x = -1, 1, = 1 - 2p, x = 0, 1.98. Let X be a random variable with mean I-' and variance a2 such that the fourth moment E[(X - 1-')4] about the vertical line through I-' exists. The value of the ratio E[(X - 1-')4]/a4is often used as a measure of kurtosis. Graph each of the following probability density functions and show that this measure is smaller for the first distribution. (a) f(x) = -t, -1 < x < 1, zero elsewhere. (b) f(x) = 3(1 - x2)j4, -1 < x < 1, zero elsewhere. 1.99. Let the random variable X have p.d.f. 1.100. Let o/(t) = In M(t), where M(t) is the moment-generating function of a distribution. Prove that 0/'(0) = I-' and 0/"(0) = a2• 1.101. Find the mean and the variance of the distribution that has the distribution function -ha < t < ho: t # 0, Distributions of Random Variables [eh.l 1.90. Show that the moment-generating function of the random variable X having the p.d.f. f(x) = t, -1 < x < 2, zero elsewhere, is M(t) = e 2t - e- t , 3t and 1.91. Let X be a random variable such that E[(X - b)2] exists for all real b. Show that E[(X - b)2] is a minimum when b = E(X). 1.92. Letf(xv x2) = 2XI, 0 < Xl < 1,0 < X 2 < 1, zero elsewhere, be the p.d.f. of Xl and X 2· Compute E(XI + X 2) andE{[XI + X2 - E(XI + X 2)]2}. 1.93. Let X denote a random variable for which E[(X - a)2] exists. Give an example of a distribution of a discrete type such that this expectation is zero. Such a distribution is called a degenerate distribution. 1.94. Let X be a random variable such that K(t) = EW) exists for all real values of t in a certain open interval that includes the point t = 1. Show that K<m)(l) is equal to the mth factorial moment E[X(X - 1)· .. (X - m + 1)]. 56 1.95. Let X be a random variable. If m is a positive integer, the expecta- tion E[(X - b)m], if it exists, is called the mth moment of the distribution about the point b. Let the first, second, and third moments of the distribution about the point 7 be 3, 11, and 15, respectively. Determine the mean I-' of X, and then find the first, second, and third moments of the distribution about the point 1-'. 1.96. Let X be a random variable such that R(t) = E(et<X-b») exists for - h < t < h. If m is a positive integer, show that R<m)(o) is equal to the mth moment of the distribution about the point b. 1.97. Let X be a random variable with mean I-' and variance a2 such that the third moment E[(X - 1-')3] about the vertical line through I-' exists. The value of the ratio E[(X - 1-')3]/a3 is often used as a measure of skewness. Graph each of the following probability density functions and show that this measure is negative, zero, and positive for these respective distributions (said to be skewed to the left, not skewed, and skewed to the right, re- spectively). (a) f(x) = (x + 1)/2, -1 < x < 1, zero elsewhere. (b) f(x) = -t, -1 < x < 1, zero elsewhere. (c) f(x) = (1 - x)/2, -1 < x < 1, zero elsewhere. F(x) = 0, x < 0, x = g' 0 s x < 2, x2 = 16' 2 ~ x < 4, = 1, 4 ~ x. f 1.~02. Find the moments of the distribution that has moment-generating unctIOn M(t) = (1 - t)-3, t < 1. Hint. Differentiate twice the series (1 - t) -1 = 1 + t + t2 + t3 + ... , -1 < t < 1. Whi~:~:' L~t.X be a ~andom variable of the continuous type with p.d.f.f(x), Sh h posItIve provided 0 < x < b < 00, and is equal to zero elsewhere ~t~ . E(X) = f:[1 - F(x)] d», Where F(x) is the distribution function of X.
  • 34. 58 Distributions of Random Variables [Ch.l Sec. 1.11] Chebyshev's Inequality 59 1.11 Chebyshev's Inequality In this section we shall prove a theorem that enables us to find upper (or lower) bounds for certain probabilities. These bounds, how- ever, are not necessarily close to the exact probabilities and, accordingly, we ordinarily do not use the theorem to approximate a probability. The principal uses of the theorem and a special case of it are in theoreti- cal discussions. Theorem 7. Chebyshev's Inequality. Let the random variable X have a distribution ofprobability about which we assume only that there is a finite variance a2 • This, of course, implies that there is a mean 11-. Then for every k > 0, 1 Pr (IX - 11-1 ~ ka) s k2 ' or, equivalently, Theorem 6. Let u(X) be a nonnegative function of the random variable X. If E[u(X)] exists, then, for every positive constant c, Pr [u(X) ~ c] s E[u(X)]. c 1 Pr (IX - ILl < ka) ~ 1 - k2 ' Proof. In Theorem 6 take u(X) = (X - 11-)2 and c = k2a2• Then we have 1 Pr (IX - 11-1 ~ ka) ~ k 2 ' Since the numerator of the right-hand member of the preceding inequality is u2 , the inequality may be written which is the desired result. Naturally, we would take the positive number k to be greater than 1 to have an inequality of interest. -V3 < x < V3, 1 f(x) =-, 2V3 It is seen that the number 1/k2 is an upper bound for the probability Pr (IX - 11-1 ~ ka). In the following example this upper bound and the exact value of the probability are compared in special instances. Example 1. Let X have the p.d.f. = 0 elsewhere. lIere IL = 0 and a2 = 1. If k = "i, we have the exact probability Pr (IX - ILl ::.::: ka) = Pr (IXI ~ ~) = 1 _ f3/2 _1_ dx = 1 _ V3. 2 -3/22V3 2 By Chebyshev's inequality, the preceding probability has the upper bound l/k2 - ± S' . j- tho - 9' mce 1 - v 3/2 = 0.134, approximately, the exact probability in IS C ' id h ase IS consi erably less than the upper bound l If we take k = 2, we ave the exact probability Pr (IX - ILl ~ 2a) = Pr (lXI ::.::: 2) = O. This tf(x) dx = Pr (X E A) = Pr [u(X) ~ t], E[u(X)] ~ cPr [u(X) ~ c], E[u(X)] ~ c tf(x) dx. Proof. The proof is given when the random variable X is of the continuous type; but the proof can be adapted to the discrete case if we replace integrals by sums. Let A = {x; u(x) ~ c} and let f(x) denote the p.d.f. of X. Then E[u(X)] = I~co u(x)f(x) dx = t u(x)f(x) dx + t*u(x)f(x) dx. Since each of the integrals in the extreme right-hand member of the preceding equation is nonnegative, the left-hand member is greater than or equal to either of them. In particular, E[u(X)] ~ t u(x)f(x) dx. However, if x E A, then u(x) ~ c; accordingly, the right-hand member of the preceding inequality is not increased if we replace u(x) by c. Thus Since which is the desired result. it follows that The preceding theorem is a generalization of an inequality which is often called Chebyshev's inequality. This inequality will now be established.
  • 35. 58 Distributions of Random Variables [eb.l Sec. 1.11] Chebyshev's Inequality 59 1.11 Chebyshev's Inequality In this section we shall prove a theorem that enables us to find upper (or lower) bounds for certain probabilities. These bounds, how- ever, are not necessarily close to the exact probabilities and, accordingly, we ordinarily do not use the theorem to approximate a probability. The principal uses of the theorem and a special case of it are in theoreti- cal discussions. Theorem 7. Chebyshev's Inequality. Let the random variable X have a distribution of probability about which we assume only that there is a finite variance a2 • This, of course, implies that there is a mean iL. Then for every k > 0, Pr (IX - iLl ~ ka) s 12' or, equivalently, Theorem 6. Let u(X) be a nonnegative function of the random variable X. If E[u(X)] exists, then, for every positive constant c, E[u(X)] Pr [u(X) ~ c] ~ . c 1 Pr (IX - iLl < ka) ~ 1 - k2 • Proof. In Theorem 6 take u(X) = (X - iL)2 and c = k2(J2. Then we have 1 Pr (IX - iLl ~ ka) ~ k2 ' which is the desired result. Naturally, we would take the positive number k to be greater than 1 to have an inequality of interest. Since the numerator of the right-hand member of the preceding inequality is a2 , the inequality may be written -V3 < x < V3, 1 f(x) =-, 2V3 It is seen that the number lIP is an upper bound for the probability Pr (IX - iLl ~ ka). In the following example this upper bound and the exact value of the probability are compared in special instances. Example 1. Let X have the p.d.f. Proof. The proof is given when the random variable X is of the continuous type; but the proof can be adapted to the discrete case if we replace integrals by sums. Let A = {x; u(x) ~ c} and let f(x) denote the p.d.f. of X. Then E[u(X)] = J~co u(x)f(x) dx = Lu(x)f(x) dx + L.u(x)f(x) dx. Since each of the integrals in the extreme right-hand member of the preceding equation is nonnegative, the left-hand member is greater than or equal to either of them. In particular, Since E[u(X)] ~ Lu(x)f(x) dx. However, if x E A, then u(x) ~ c; accordingly, the right-hand member of the preceding inequality is not increased if we replace u(x) by c. Thus E[u(X)] ~ c Lf(x) dx. Lf(x) dx = Pr (X E A) = Pr [u(X) ~ c], it follows that E[u(X)] ~ cPr [u(X) ~ c], which is the desired result. The preceding theorem is a generalization of an inequality which is often called Chebyshev's inequality. This inequality will now be established. = 0 elsewhere. Here iL = 0 and a2 = 1. If k = ·t we have the exact probability ( 3) f3/2 1 v'3 Pr (IX - iLl ~ ka) = Pr IXI ~ - = 1 - - dx = 1 - -. 2 - 3 /2 2V3 2 By Chebyshev's inequality, the preceding probability has the upper bound Ijk2 = t. Since 1 - V3j2 = 0.134, approximately, the exact probability in this case is considerably less than the upper bound l If we take k = 2, we have the exact probability Pr (IX - iLl ~ 2a) = Pr (IXI ~ 2) = O. This
  • 36. 60 Distributions of Random Variables [eb.l again is considerably less than the upper bound l/k 2 = t provided by Chebyshev's inequality. In each instance in the preceding example, the probability Pr (IX - iLl :2: ka) and its upper bound Ijk2 differ considerably..This suggests that this inequality might be made sharper. However, If we want an inequality that holds for every k > 0 and holds for all random variables having finite variance, such an improvement is impossible, as is shown by the following example. Example 2. Let the random variable X of the discrete type have probabilities i, ·t i at the points x = -1, 0, 1, respectively. Here I-'- = 0 and a2 = t. If k = 2, then l/k2 = t and Pr (IX - 1-'-1 :2: ka) = Pr (IXI :2: 1) = t· That is, the probability Pr (IX - 1-'-1 :2: ka) here attains the upper bound l/k2 = t. Hence the inequality cannot be improved without further assump- tions about the distribution of X. EXERCISES 1.104. Let X be a random variable with mean I-'- and let E[(X - 1-'-)2k] exist. Show, with d > 0, that Pr (IX - 1-'-1 :2: d) ~ E[(X - 1-'-)2k]/d 2k• 1.105. Let X be a random variable such that Pr (X ~ 0) = 0 and let I-'- = E(X) exist. Show that Pr (X :2: 21-'-) ~ 1'. 1.106. If X is a random variable such that E(X) = 3 and E(X2) = 13, use Chebyshev's inequality to determine a lower bound for the probability Pr(-2<X<8). 1.107. Let X be a random variable with moment-generating function M(t), -h < t < h. Prove that Pr (X :2: a) s e-atM(t), 0 < t < h, and that Pr (X s a) s e-atM(t), -h < t < O. Hint. Let u(x) = etx and c = eta in Theorem 6. Note. These results imply that Pr (X :2: a) and Pr (X ~ a) are less than the respective greatest lower bounds of e-atM(t) when 0 < t < h and when -h < t < O. 1.108. The moment-generating function of X exists for all real values of t and is given by et _ e:' M(t) = 2t ' t i= 0, M(O) = 1. Use the results of the preceding exercise to show that Pr (X :2: 1) = 0 and Pr (X ~ -1) = O. Note that here h is infinite. Chapter 2 Conditional Probability and Stochastic Independence 2.1 Conditional Probability In some random experiments, we are interested only in those out- comes that are elements of a subset C1 of the sample space Cf/. This means, for our purposes, that the sample space is effectively the subset C1. We are now confronted with the problem of defining a probability set function with C1 as the" new" sample space. Let the probability set function P(C) be defined on the sample space Cf/ and let CI be a subset of Cf/ such that P(CI ) > O. We agree to con- sider only those outcomes of the random experiment that are elements of C1;in essence, then, we take CI to be a sample space. Let C2be another subset of Cf/. How, relative to the new sample space CI , do we want to define the probability of the event C2? Once defined, this probability is called the conditional probability of the event C2 , relative to the hypothesis of the event CI ; or, more briefly, the conditional probability of C2 , given CI . Such a conditional probability is denoted by the symbol P(C2!CI ) . We now return to the question that was raised about the definition of this symbol. Since CI is now the sample space, the only elements of C2 that concern us are those, if any, that are also elements of CI , that is, the elements of CI r. C2 • It seems desirable, then, to define the symbol P(C2ICI) in such a way that P(C1ICI) = 1 and P(C2!C I ) = P(CI n C2!CI). Moreover, from a relative frequency point of view, it would seem logic- ally inconsistent if we did not require that the ratio of the probabilities of the events CI r, C2 and Cv relative to the space Cv be the same as the 61
  • 37. 63 x = 0,1,2,3,4,5, and Sec. 2.1] Conditional Probability From the definition of the conditional probability set function, we observe that The desired probability P(C1 n C2)is then the product of these two numbers. More generally, if X + 3 is the number of draws necessary to produce = 0 elsewhere. It is w~rth noting, if we let the random variable X equal the number of spades III a 5-card hand, that a reasonable probability model for X is given by the hypergeometric p.d.f. Accordingly, we can write P(C2IC1 ) = Pr (X = 5)f[Pr (X = 4,5)] = j(5)fU(4) + j(5)J. P(C1 n C2) = P(C1)P(C2IC1)' This relation is frequently called the multiplication rule for probabilities. Som~times, after considering the nature of the random experiment, it is possible to make reasonable assumptions so that both P(C1) and P(C2IC1) can be assigned. Then P(C1 n C2) can be computed under these assumptions. This will be illustrated in Examples 2 and 3. Example 2. A bowl contains eight chips. Three of the chips are red and the remaining five are blue. Two chips are to be drawn successively, at ran- dom and without replacement. We want to compute the probability that the first draw results in a red chip (C1) and that the second draw results in a blue chip (C2 ) . It is reasonable to assign the following probabilities: P(C1 ) = i and P(C2I C1 ) = t· Thus, under these assignments we have P(C n C ) - (.3.)(~) - l~ , 1 2 - 8 7 - 56' Example 3. From an ordinary deck of playing cards, cards are to be drawn succ.essively, at random and without replacement. The probability that the third spade appears on the sixth draw is computed as follows. Let C1 be the event of two spades in the first five draws and let C2 be the event of a spade on the sixth draw. Thus the probability that we wish to compute is P(C1 n C2) . It is reasonable to take Conditional Probability and Stochastic Independence [Ch. 2 P(C1 n C2C1) P(C1 n C2) P(C1IC1) = P(C1 ) • These three desirable conditions imply that the relation P(C C ) _ P(C1 n C2) 2 1 - P(C1 ) is a suitable definition of the conditional probability of the event C2, given the event C1 , provided P(C1 ) > O. Moreover, we have: (a) P(C2IC1) ;::: O. (b) P(C2 u C3 u· .. C1) = P(C2C1) + P(C3!C1) + .. " provided C2 , C3 , .•. are mutually disjoint sets. (c) P(C1C1) = 1. Properties (a) and (c) are evident; proof of property (b) is left as an exercise. But these are precisely the conditions that a probability set function must satisfy. Accordingly, P(C2C1) is a probability set function, defined for subsets of Ci- It may be called the conditional probability set function, relative to the hypothesis C1; or the condi- tional probability set function, given Ci- It should be noted that this conditional probability set function, given C1> is defined at this time only when P(C1) > O. We have now defined the concept of conditional probability for subsets C of a sample space s'. We wish to do the same kind of thing for subsets A of .91, where .91 is the space of one or more random variables defined on ~. Let P denote the probability set function of the induced probability on .91. If A 1 and A 2 are subsets of .91, the conditional probability of the event A 2 , given the event A 1 , is P(A A ) = P(A 1 n A 2 ) 2 1 P(A 1 ) provided P(A 1 ) > O. This definition will apply to any space which has a probability set function assigned to it. Example 1. A hand of 5 cards is to be dealt at random and without replacement from an ordinary deck of 52 playing cards. The conditional probability of an all-spade hand (C2) , relative to the hypothesis that there are at least 4 spades in the hand (C1) , is, since C1 n C2 = C2 , ratio of the probabilities of these events relative to the space ~; that is, we should have 62
  • 38. EXERCISES exactly three spades, a reasonable probability model for the random variable X is given by the p.d.f. = 0 elsewhere. Then the particular probability which we computed IS P(Cl () C2) Pr (X = 3) = f(3). 65 Sec.2.2J Marginal and Conditional Distributions x + 1 is the number of draws needed to produce the first blue chip, determine the p.d.f. of X. 2.4. A ha.nd of 13 cards is to be dealt at random and without replacement from an ordinary deck of pl~yin? cards. Find the conditional probability that there are ~t least three kings III the hand relative to the hypothesis that the hand contains at least two kings. 2.5. A drawer contains eight pairs of socks. If six socks are taken at random and without replacement, compute the probability that there is at least one m~tching pair among these six socks. Hint. Compute the probability that there IS not a matching pair. 2.6. A bowl contains 10 chips. Four of the chips are red 5 are hit d 1. . ' W 1 e, an IS blue. If ~ :hIPS are taken at random and without replacement, com~ute the conditional probability that there is 1 chip of each color relative to the hypothesis that there is exactly 1 red chip among the 3. 2.7.. ~et each of the mutually disjoint sets Cl , ... , Cm have nonzero probability. If the set C is a subset of the union of C1> ••• , Cm, show that P(C) = P(C1)P(C/Cl) + ... + P(Cm)P(ClCm). If P(C) > 0, prove Bayes' formula: P(C1C) - P(Cj)P(ClCj) , - P(Cl)P(ClCl) + ... + P(Cm)P(C/Cm) , 1 = 1, ... , m. Hint. P(C)P(CdC) = P(Cj)P(ClCJ 2.8. Bowl I contains 3 red chips and 7 blue chips. Bowl II contains 6 red chips a~d 4 blue chips. A bowl is selected at random and then 1 chip is drawn from this bowl. (a) Compute the probability that this chip is red. (b) Relative ~o the hypothesis that the chip is red, find the conditional probability that it IS drawn from bowl II. 2.9. Bowl I contains 6 red chips and 4 blue chips. Five of these 10 chips are sel~c:ed at random and without replacement and put in bowl II, which was originally empty. One chip is then drawn at random from bowl II. Relative to the hypothesis that this chip is blue, find the conditional probability that 2 red chips and 3 blue chips are transferred from bowl I to bowl II. 2.2 Marginal and Conditional Distributions Let f(x}> x2 ) be the p.d.f. of two random variables X and X . F.ro~ thi.s point on, for emphasis and clarity, we shall call a Ip.d.f. or 2 a distribution function a joint p.d.f. or a joint distribution function when more than one random variable is involved. Thus f(x1 , x2 ) is the joint x = 0, 1, 2, ... , 39, Conditional Probability and Stochastic Independence [Ch. 2 (In order to solve certain of these exercises, the student is required to make assumptions.) 2.1. If P(Cl ) > 0 and if C2 , Cs, C4 , •• , are mutually disjoint sets, show that P(C2u Cs u·· ·IC1) = P(C2Cl) + P(CSIC1) + .... 2.2. Prove that P(C l () C 2 () c, () C4 ) = P(Cl)P(C2Cl)P(CS[Cl () C2)P(C4ICl () C2 () Cs)· 2.3. A bowl contains eight chips. Three of the chips are red and five are blue. Four chips are to be drawn successively at random and without replace- ment. (a) Compute the probability that the colors alternate. (b) Compute the probability that the first blue chip appears on the third draw. (c) If The multiplication rule can be extended to three or more events. In the case of three events, we have, by using the multiplication rule for two events, P(C 1 n C2 n Cs) = P[(C1 n C2) n Cs] = P(C1 n C2)P(C SC 1 n C2) · But P(C1 n C2) = P(C1)P(C2C 1) · Hence P(C1 n C2 n Cs) = P(Cl)P(C2ICl)P(CSCl n C2)· This procedure can be used to extend the multiplication rule to four or more events. The general formula for k events can be proved by mathematical induction. Example 4. Four cards are to be dealt successively, at random and without replacement, from an ordinary deck of playing cards. The probability of receiving a spade, a heart, a diamond, and a club, in that order, is (H)(H)(U)(H). This follows from the extension of the multiplication rule. In this computation, the assumptions that are involved seem clear. 64
  • 39. 66 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.2] Marginal and Conditional Distributions 67 for the discrete case. Now each of be computed as X2 = 1,2, Xl = 1,2,3, On the other hand, the marginal p.d.£. of Xl is 1 (X ) = *Xl + X 2 _ 2xl + 3 IlL.. 21 - 21 ' X2=1 zero elsewhere, and the marginal p.d.f. of X 2 is 3 1: (X ) = '" Xl + X 2 = 6 + 3x2 2 2 L.. 21 21' Xl= 1 zero elsewhere. Thus the preceding probabilities may Pr (Xl = 3) = 11(3) = t and Pr (X2 = 2) = 12(2) = ~. We shall now discuss the notion of a conditional p.d.f. Let Xl and X 2 denote random variables of the discrete type which have the joint p.d.f. f(xv x2) which is positive on .xl and is zero elsewhere. Let fl(X 1) and f2(X2) denote, respectively, the marginal probability density func- tions of Xl and X 2. Take Al to be the set A = {(x x)· x - x' 1 l' 2' 1 - 1, -00 < X2 < oo], where x~ is such that P(A l) = Pr (Xl = X~) = fl(X~) > 0, and take A 2 to be the set A 2 = {(xv x2); -00 < Xl < 00, X 2 = x;}. Then, by definition, the conditional probability of the event A 2 , given the event A v is P(A IA ) = P(A I n A 2) _ Pr (Xl = X~, X 2 = x;) f(X~, X~) 2 1 P(A l ) - P ( r Xl = X~) fl(X~) That is, if (Xl' x2) is any point at which 1l(Xl) > 0, the conditional probability that X 2 = X2, given that Xl = Xv is f(xv x2)lfl(xl), With Xl held fast, and with fl(X l) > 0, this function of X2 satisfies the conditions of being a p.d.f. of a discrete type of random variable X 2 because f(xl, x2)Ifl(Xl) is not negative and L fj(' ~2) = f-(1 ) Lf(xl, x2) = fl(X l) = 1 X2 1 Xl 1 Xl X2 fl(X l ) ' We now define the symbolf(x2Ixl) by the relation f(x Ix ) = f(xl, X2) 2 1 fl(Xl) , (discrete case), (continuous case), (continuous case), (discrete case), Xl = 1,2, 3, X 2 = 1,2, and f2(X2) = f~co f(xv x2) dXl = 'L f(xv X 2) Xl Pr (a < Xl < b) = f:fl(X l) dX1 'L fl(X l) a<xl<b is called the marginal p.d.f. of X 2 • Example 1. Let the joint p.d.f. of Xl and X 2 be so that fl (Xl) is the p.d.f. of Xl alone. SinceI,(Xl) is found by summing (or integrating) the joint p.d.f. f(xl, x2) over all X2for a fixed Xv we can think of recording this sum in the" margin" of the x1x2-plane. Accord- ingly, fl(X l) is called the marginal p.d.f. of Xl' In like manner is a function of Xl alone, say fl(X l), Thus, for every a < b, we have for the continuous case, and by Pr (a < Xl < b, -00 < X 2 < 00) 'L 'Lf(xv x2) a<xl <b X2 p.d.f. of the random variables Xl and X 2' Consider the event a < Xl < b, a < b. This event can occur when and only when the event a < Xl < b, -00 < X 2 < 00 occurs; that is, the two events are equivalent, so that they have the same probability. But the probability of the latter event has been defined and is given by Pr (a < x, < b, -00 < X 2 < 00) = f: f~co f(xl, x2) dX2 dXl = 0 elsewhere. Then, for instance, Pr (Xl = 3) = 1(3, 1) +1(3,2) = t and we call f(x2Ixl) the conditional p.d.J. of the discrete type of random variable X 2 , given that the discrete type of random variable Xl ~ Xl' In a similar manner we define the symbol f(xllx2) by the relation and Pr (X2 = 2) = 1(1,2) +1(2,2) + 1(3,2) = t.
  • 40. 69 0< X2 < 1, Sec. 2.2] Marginal and Conditional Distributions Pr (c < Xl < d IX2 = x2) = r!(x1Ix2) dx1. If u(X2) is a function of X 2, the expectation E[u(X2)lxd = J:oo u(X2)!(x2Ix1) dX2 is called the conditional expectation of u(X)' X ti l ' f ' 2 ,gIven 1 = Xl' In par- .ICU ar, 1 ~hey exist, E(X2Ix1) is the mean and E{[X2 - E(X 2 Ix1)J2Ix } IS the vanance of the conditional distribution of X . X 1 It . . 2, given 1 = Xl " IS c~~vell1ent .to refer to these as the" conditional mean" and th~ conditional vanance" of X given X - Of 2' 1 - Xl' course we have E{[X2 - E(X2Ix1lJ2lx1} = E(X~lx1) - [E(X2Ix1)]2 from an. earlier result. In like manner, the conditional expectation of u(X1) , grven X 2 = X 2, is given by Pr (a < X 2 < blx1 ) . Similarly, the conditional X probability that c < 1 < d, given X 2 = X 2, is E[U(X1)IX2] = J:oo u(X1)!(x1Ix2) dx1· W~t? .random va:i~bles of the discrete type, these conditional prob- ~blhtIesan~ condlt:onal expectations are computed by using summation Instead of IntegratIOn. An illustrative example follows. Example 2. Let Xl and X 2 have the joint p.d.f. f(xl> x2) = 2, 0 < Xl < X2 < 1, = 0 elsewhere. Then the marginal probability density functions are, respectively, fl(X I ) = 1:,2 dX2 = 2(1 - Xl)' 0 < Xl < 1, = 0 elsewhere, and = 0 elsewhere. The conditional p.d.f. of Xl> given X 2 = X2' is In this relation, Xl is to be thought of as having a fixed (but any fixed) value for which !1(X1) > 0. It is evident that !(x2Ix1) is nonnegative and that !( I ) - !(Xl> X2) Xl X2 - !2(X2) , Since each of!(x2Ix1) and!(x1!x2) is a p.d.f. of one random variable (whether of the discrete or the continuous type), each has all the properties of such a p.d.f. Thus, we can compute probabilities and mathematical expectations. If the random variables are of the con- tinuous type, the probability 68 Conditional Probability and Stochastic Independence [Ch, 2 and we call !(X1!X2) the conditional p.d.f. of the discrete type of random variable Xl> given that the discrete type of random variable X 2 = X 2· Now let Xl and X 2 denote random variables of the continuous type that have the joint p.d.f. !(X1' x2) and the marginal probability density functions j',(Xl) and !2(X2), respectively. We shall use the results of the preceding paragraph to motivate a definition of a conditional p.d.f. of a continuous type of random variable. When!1(x1) > 0, we define the symbol!(x2Ix1) by the relation That is,j(x2Ix1) has the properties of a p.d.f. of one continuous type of random variable. It is called the conditional p.d.J. of the continuous type of random variable X 2 , given that the continuous type of random variable Xl has the value Xl' When !2(X2) > 0, the conditional p.d.f. of the continuous type of random variable Xl> given that the con- tinuous type of random variable X 2 has the value X2, is defined by Pr (a < X 2 < bX1 = Xl) = f:!(x2Ix1) dX2 is called "the conditional probability that a < X 2 < b, given that Xl = Xl." If there is no ambiguity, this may be written in the form
  • 41. 70 Conditional Probability and Stochastic Independence [Ch. 2 Sec. 2.2] Marginal and Conditional Distributions 71 but and Up to this point, each marginal p.d.f. has been a p.d.f. of one random variable. It is convenient to extend this terminology to joint probability density functions. We shall do this now. Let f(xv X2,... , xn) be the joint p.d.f. of the n random variables X l' X 2,... , X n, just as before. Now, however, let us take any group of k < n of these random variables and let us find the joint p.d.f. of them. This joint p.d.f. is called the marginal p.d.f. of this particular group of k variables. To fix the ideas, take n = 6, k = 3, and let us select the group X 2 , X 4 , Xs' Then the marginal p.d.f. of X 2 , X 4 , Xs is the joint p.d.f. of this particular group of three variables, namely, I~00 f~00 f~00 f(xl, X2, Xa, X4 , X s, x6 ) dXl dXa dx6 , if the random variables are of the continuous type. We shall next extend the definition of a conditional p.d.f. If fl(X l) > 0, the symbol f(x2, .. " xn!xl) is defined by the relation f( I ) - f(xl, X2,... , xn) X2,... , Xn Xl - fl(X l) , and f(x2, ... , xn!xl) is called the joint conditional p.d.j. of X 2,... , X n, given Xl = Xl' The joint conditional p.d.f. of any n - 1 random variables, say XV"" X'-I' X,+ l , ... , X n, given X, = X" is defined as the joint p.d.f. of Xl, X 2, ... , X n divided by marginal p.d.f. f,(x,) , provided f,(x,) > 0. More generally, the joint conditional p.d.f. of n - k of the random variables, for given values of the remaining k variables, is defined as the joint p.d.f. of the n variables divided by the marginal p.d.f. of the particular group of k variables, provided the latter p.d.f. is positive. We remark that there are many other con- ditional probability density functions; for instance, see Exercise 2.17. Because a conditional p.d.f. is a p.d.f. of a certain number of random variables, the mathematical expectation of a function of these random variables has been defined. To emphasize the fact that a conditional p.d.f. is under consideration, such expectations are called conditional expectations. For instance, the conditional expectation of u(X2, .. " X n) given Xl = Xv is, for random variables of the continuous type, given by E[u(X2,... , X n) Ixl ] = f~oo .. -I~oo u(x2,···, xn)f(x2,···, xnlxl) dx2·· ·dxn, provided fl(Xl) > °and the integral converges (absolutely). If the random variables are of the discrete type, conditional mathematical expectations are, of course, computed by using sums instead of integrals. Pr (a < x, < b) = f:fl(Xl) dxl, where fl(X l) is defined by the (n - l)-fold integral fl(X l) = f~oo .. -I~oof(xv X2,···, xn) dX2" .dxn· Accordingly, I, (Xl) is the p.d.f. of the one random variable Xl and fl(X l) is called the marginal p.d.f. of Xl' The marginal probability density functions f2(X2)'" .,fn(xn) of X 2, .. ·, X n, respectively, are similar (n - l)-fold integrals. We shall now discuss the notions of marginal and conditional probability density functions from the point of view of n random variables. All of the preceding definitions can be directly generalized to the case of n variables in the following manner. Let the random variables Xv X 2,... , X n have the joint p.d.f. f(xv X2, ... , xn)· If the random variables are of the continuous type, then by an argument similar to the two-variable case, we have for every a < b, E(Xl IX2 ) = rooxI!(xllx2 ) dXl = J:2xlCJ dXl X~ = 12' Finally, we shall compare the values of Pr (0 < Xl < -tIX2 = i) and Pr (0 < Xl < -t). We have [1/2 [1/2 4 Pr (0 < Xl < -tIX2 = -t) = Jo !(xlli) dXl = Jc (3) dXl = t. Here the conditional mean and conditional variance of Xl. given X2 = X 2• are, respectively,
  • 42. 72 Conditional Probability and Stochastic Independence [Ch. 2 Sec. 2.3] The Correlation Coefficient 73 EXERCISES 2.10. Let Xl and X2have the joint p.d.f.j(x1 , x2) = Xl + x2, 0 < Xl < 1, o< X 2 < 1, zero elsewhere. Find the conditional mean and variance of X 2 , given Xl = Xl' 0 < Xl < 1. 2.11. Let j(x1Ix2) = C1X1/X~, 0 < Xl < X2, 0 < X2 < 1, zero elsewhere, andj2(x2) = c2xt 0 < X2 < 1, zero elsewhere, denote, respectively, the con- ditional p.d.f. of Xl' given X 2 = X2, and the marginal p.d.f. of X 2. Deter- mine: (a) the constants C1 and c2 ; (b) the joint p.d.f. of Xl and X 2 ; (c) Pr (! < Xl < 1-IX2 = i); and (d) Pr (t < Xl < 1-). 2.12. Let j(x1,x2) = 21xixg, 0 < Xl < X2 < 1, zero elsewhere, be the joint p.d.f. of Xl and X 2 . Find the conditional mean and variance of Xl' given X2 = X2' 0 < X2 < 1. 2.13. If X 1 and X 2are random variables of the discrete type having p.d.f. j(X1' X2) = (Xl + 2x2)/18, (Xl> x2) = (1,1), (1,2), (2,1), (2, .2), zero else- where, determine the conditional mean and variance of X 2 , grven Xl = Xl> Xl = 1 or 2. 2.14. Five cards are drawn at random and without replacement from a bridge deck. Let the random variables Xl' X 2, and X 3 denote, respectively, the number of spades, the number of hearts, and the number of diamonds that appear among the five cards. (a) Determine the joint p.d.f. of Xl> X 2 , and X 3 • (b) Find the marginal probability density functions of Xl' X 2, and X 3 . (c) What is the joint conditional p.d.f. of X2and X 3 , given that Xl = 3? 2.15. Let Xl and X2 have the joint p.d.f. j(Xl> x2) described as follows: problem of time until death, given survival until time X o' (a) Show that f(xlX > xo) is a p.d.f. (b) Let j(x) = e- x , 0 < X < 00, zero elsewhere. Compute Pr (X > 21 X > 1). 2.3 The Correlation Coefficient Let X, Y, and Z denote random variables that have joint p.d.f. f(x, y, z). If u(x, y, z) is a function of x, y, and z, then E[u(X, Y, Z)] was defined, subject to its existence, on p. 45. The existence of all mathematical expectations will be assumed in this discussion. The means of X, Y, and Z, say fL1' fLz, and fLs, are obtained by taking u(x, y, z) to be x, y, and z, respectively; and the variances of X, Y, and Z, say a~, a~, and a~, are obtained by setting the function u(x, y, z) equal to (x - fL1)Z, (y - fLz)Z, and (z - fLs)Z, respectively. Consider the mathematical expectation E[(X - fL1)(Y - fLz)] = E(XY - fLzX - fL1 Y + fL1fLz) = E(XY) - fLzE(X) - fL1E(Y) + fLlfLz = E(XY) - fL1fLz. This number is called the covariance of X and Y. The covariance of X and Z is given by E[(X - fL1)(Z - fLs)], and the covariance of Y and Z is E[(Y - fLz)(Z - fLs)]. If each of a1 and az is positive, the number E[(X - fL1)(Y - fLz)] P1Z = U1UZ o < X < 1,0 < y < 1, j(x, y) = X + y, Example 1. Let the random variables X and Y have the joint p.d.f. !Jo1 = E(X) = f:f:x(x + y) dx dy = -/2 = 0 elsewhere. is called the correlation coefficient of X and Y. If the standard deviations are positive, the correlation coefficient of any two random variables is defined to be the covariance of the two random variables divided by the product of the standard deviations of the two random variables. It should be noted that the expected value of the product of two random variables is equal to the product of their expectations plus their covariance. We shall compute the correlation coefficient of X and Y. When only two variables are under consideration, we shall denote the correlation coefficient by p. Now (2,1) _L 18 (2,0) _Q.- 18 (1, 1) 3 Ts (1,0) _±- 18 (0, 1) -~­ 18 (X 1, X 2) 1-(~0:-'1O-'-)---'~-'---'---'::-'----'-'-::--"------'---'--;-'-----'-;---:' j(Xl> X2) Ts and j(Xl> x2) is equal to zero elsewhere. Find the two marginal probability density functions and the two conditional means. 2.16. Let us choose at random a point from the interval (0, 1) and let the random variable Xl be equal to the number which corresponds to that point. Then choose a point at random from the interval (0, Xl), where Xl is the ex- perimental value of Xl; and let the random variable X 2 be equal to the number which corresponds to this point. (a) Make assumptions about the marginal p.d.f. j1(X1) and the conditional p.d.f. j(X2!X1). (b) Compute Pr (Xl + X 2 :2: 1). (c) Find the conditional mean E(X1Ix2)' 2.17. Let j(x) and F(x) denote, respectively, the p.d.f. and the distribu- tion function of the random variable X. The conditional p.d.f. of X, given X > XO, Xo a fixed number, is defined by j(xlX > xo) = j(x)/[1 - F(xo)], Xo < x, zero elsewhere. This kind of conditional p.d.f. finds application in a
  • 43. 74 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.3] The Correlation Coefficient 75 The covariance of X and Y is -Th 1 p = V(_l..L) (-.l.l_) = - 11' 144 144 E(XY) - P,1P,2 = J:J:xy(x + y) dx dy - (fi)2 = Accordingly, the correlation coefficient of X and Y is f~00 yf(x, y) dy E(Ylx) = fl(X) = a + bx, when dealing with random variables of the continuous type. This con- ditional mean of Y, given X = x, is, of course, a function of x alone, say <p(x). In like vein, the conditional mean of X, given Y = y, is a function of y alone, say ¢;(y). In case <p(x) is a linear function of x, say <p(x) = a + bx, we say the conditional mean of Y is linear in x; or that Y has a linear conditional mean. When <p(x) = a + bx, the constants a and b have simple values which will now be determined. It will be assumed that neither uf nor u~, the variances of X and Y, is zero. From and and (le 7 11 af = E(X2) - p,f = Jo Jo x2(x + y) dx dy - (n)2 = T44' Similarly, P,2 = E(Y) = -?i or or J ~oo yf(x, y) dy = (a + bx)f1(x). (1) (3) (2) where iLl = E(X) and P,2 = E(Y). If both members of Equation (1) are first multiplied by x and then integrated on x, we have E(XY) = aE(X) + bE(X2), If both members of Equation (1) are integrated on x, it is seen that E(Y) = a + bE(X), we have where PU1U2 is the covariance of X and Y. The simultaneous solution of Equations (2) and (3) yields Remark. For certain kinds of distributions of two random variables, say X and Y, the correlation coefficient p proves to be a very useful charac- teristic of the distribution. Unfortunately, the formal definition of p does not reveal this fact. At this time we make some observations about p, some of which will be explored more fully at a later stage. It will soon be seen that if a joint distribution of two variables has a correlation coefficient (that is, if both of the variances are positive), then p satisfies -1 ~ p ~ 1. If p = 1, there is a line with equation y = a + bx, b > 0, the graph of which contains all of the probability for the distribution of X and Y. In this extreme case, we have Pr (Y = a + bX) = 1. If p = -1, we have the same state of affairs except that b < o. This suggests the following interest- ing question: When p does not have one of its extreme values, is there a line in the xy-plane such that the probability for X and Y tends to be con- centrated in a band about this line? Under certain restrictive conditions this is in fact the case, and under those conditions we can look upon p as a measure of the intensity of the concentration of the probability for X and Y about that line. Next, let f(x, y) denote the joint p.d.f. of two random variables X and Y and let fl(X) denote the marginal p.d.f. of X. The conditional p.d.f. of Y, given X = x, is and That is, f(ylx) f(x, y) fl(X) at points where f1(X) > O. Then the conditional mean of Y, given X = x, is given by f OO • 00 yf(x, y) dy E(Ylx) = J-00 yf(ylx) dy = - 00 fl(X) , <p(x) = E(Ylx) = iL2 + P U2 (x - P,l) Ul is the conditional mean of Y, given X = x, when the conditional mean of Y is linear in x. If the conditional mean of X, given Y = y, is linear in y, then that conditional mean is given by
  • 44. 76 Conditional Probability and Stochastic Independence [eb.2 Sec. 2.3] The Correlation Coefficient 77 and so that Example 2. Let the random variables X and Y have the linear con- ditional means E(Ylx) = 4x + 3 and E(XI y) = -l"6Y - 3. In accordance with the general formulas for the linear conditional means, weseethat E(Ylx) = IL2 if x = ILl and E(XI y) = ILl if Y = IL2' Accordingly, in this special case, we have IL2 = 4ILl + 3 and ILl = -l6-IL2 - 3 so that ILl = -~i and IL2 = -12. The general formulas for the linear conditional means also show that the product of the coefficients of x and y, respectively, is equal to p2 and that the quotient of these coefficients is equal to u~!u~. Here p2 = 4(l-6) = t with p = t (not -t), and u~!u~ = 64. Thus, from the two linear conditional means, we are able to find the values of ILl' IL2' p, and U2!UV but not the values of Ul and U2' This section will conclude with a definition and an illustrative example. Letf(x, y) denote the joint p.d.f. of the two random variables X and Y. If E(et1X +t2Y) exists for -hI < t1 < hv -h2 < t2 < h2 , where hI and h2 are positive, it is denoted by M(t1 , t2) and is called the moment-generating function of the joint distribution of X and Y. As in the case of one random variable, the moment-generating function M(tv t2 ) completely determines the joint distribution of X and Y, and hence the marginal distributions of X and Y. In fact, M(tv 0) = E(et1X ) = M(t1) In addition, in the case of random variables of the continuous type, (5) For instance, in a simplified notation which appears to be clear, We shall next investigate the variance of a conditional distribution under the assumption that the conditional mean is linear. The con- ditional variance of Y is given by (4) E{[Y - E{Y/x)J2Ix} = J~oo [y - 1-'-2 - P:: (x - 1-'-1)rf{ y 1x) dy J~00 [(y - 1-'-2) - P~ (x - 1-'-1)rf(X, y) dy f1(X) when the random variables are of the continuous type. This variance is nonnegative and is at most a function of x alone. If then, it is multiplied by f1(X) and integrated on x, the result obtained will be nonnegative. This result is That is, if the variance, Equation (4), is denoted by k(x), then E[k(X)J = a~(l - p2) ~ O. Accordingly, p2 :::;; 1, or -1 :::;; p :::;; 1. It is left as an exercise to prove that - 1 :::;; p :::;; 1 whether the conditional mean is or is not linear. Suppose that the variance, Equation (4), is positive but not a func- tion of x; that is, the variance is a constant k > O. Now if k is multi- plied byI,(x) and integrated on x, the result is k, so that k = al(l - p2). Thus, in this case, the variance of each conditional distribution of Y, given X = x, is a~(l - p2). If p = 0, the variance of each conditional distribution of Y, given X = x, is a~, the variance of the marginal distribution of Y. On the other hand, if p2 is near one, the variance of each conditional distribution of Y, given X = x, is relatively small, and there is a high concentration of the probability for this conditional distribution near the mean E(Ylx) = 1-'-2 + p{a2/al)(x - 1-'-1)' It should be pointed out that if the random variables X and Y in the preceding discussion are taken to be of the discrete type, the results just obtained are valid. J~00 J~oo [(y - 1-'-2) - P:: (x - 1-'-1)rf(X, y) dy dx = JOO JOO [(y - 1-'-2)2 - 2p U2 (y - 1-'-2)(X - 1-'-1) + p2 a~ (x - 1-'-1)2] - 00 - 00 Ul Ul X f(x, y) dy dx a2 U~ • = E[(Y - 1-'-2)2J - 2p - E[(X - 1-'-1)(Y - 1-'-2)J + p22. E[(X - 1-'-1)2J a1 al 2 a2 2al 2 = a2 - 2p - pala2 + p 2. al a1 a1 = a~ - 2p2a~ + p2al = a~(l - p2) ~ O.
  • 45. 78 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.3] The Correlation Coefficient 79 (2, 3) -..1.- 15 i = 1,2, (1, 2) (1, 3) (2, 1) (2,2) _.1._ _L _L _L 15 15 15 15 00/(0, 0) Otl 2.18. Let the random variables X and Y have the joint p.d.f. (a) f(x, y) = t, (x, y) = (0,0), (1, 1), (2,2), zero elsewhere. (b) f(x, y) = t, (x, y) = (0,2), (1,1), (2,0), zero elsewhere. (c) f(x, y) = j-, (x, y) = (0,0), (1, 1), (2,0), zero elsewhere. In each case compute the correlation coefficient of X and Y. 2.19. Let X and Y have the joint p.d.f. described as follows: EXERCISES 2.25. Let o/(tI , t2) = In M(tI , t2),where M(tl> t2) is the moment-generating function of X and Y. Show that and and f(x, y) is equal to zero elsewhere. 'Find the correlation coefficient p. 2.20. Letf(x, y) = 2,0 < x < y,O < y < 1, zero elsewhere, be the joint p.d.f. of X and Y. Show that the conditional means are, respectively, (1 + x)f2, 0 < x < 1, and yf2, 0 < y < 1. Show that the correlation coefficient of X and Y is p = l 2.21. Show that the variance of the conditional distribution of Y, given X = x, in Exercise 2.20, is (1 - x)2f12, 0 < x < 1, and that the variance of the conditional distribution of X, given Y = y, is y2f12, 0 < y < 1. 2.22. Verify the results of Equations (6) of this section. 2.23. Let X and Y have the joint p.d.f. f(x, y) = 1, -x < Y < x, o < x < 1, zero elsewhere. Show that, on the set of positive probability density, the graph of E(Ylx) is a straight line, whereas that of E(XI y) is not a straight line. 2.24. If the correlation coefficient p of X and Y exists, show that -1 :::; p :::; 1. Hint. Consider the discriminant of the nonnegative quadratic function h(v) = E{l(X - ItI) + v(Y - 1t2)J2}, where v is real and is not a function of X nor of Y. a~ = 2, 1t2 = 2, o < x < y < 00, ai = 1, Itl = 1, = 0 elsewhere. M(tI , t2) = fooo L oo exp (tlx + t2y - y) dy dx 1 1 M(t1 , 0) = -1--' tl < 1, - t1 These moment-generating functions are, of course, respectively, those of the marginal probability density functions, Verification of results of Equations (6) is left as an exercise. If, momentarilr, we accept these results, the correlation coefficient of X and Y is p = IfV2. Furthermore, the moment-generating functions of the marginal distributions of X and Yare, respectively, (6) provided t, + t2 < 1 and t2 < 1. For this distribution, Equations (5) become Example 3. Let the continuous-type random variables X and Y have the joint p.d.f. The moment-generating function of this joint distribution is It is fairly obvious that the results of Equations (5) hold if X and Yare random variables of the discrete type. Thus the correlation coefficients may be computed by using the moment-generating function of the joint distribution if that function is readily available. An illustrative .example follows. zero elsewhere. zero elsewhere, and 0< x < 00, o< y < 00, 020/(0,0) Otl ot2 yield the means, the variances, and the covariance of the two random variables. 2.26. Let Xl> X 2 , and X 3 be three random variables with means, variances, and correlation coefficients, denoted by Itl> 1t2' 1t3; ai, a~, a~; and
  • 46. 80 Conditional Probability and Stochastic Independence [Ch. 2 Sec. 2.4] Stochastic Independence 81 PIZ, P13' PZ3, respectively. If E(XI - fLllxz, x3) = bz(xz - fLz) + b3(X3 - fL3)' where bz and b3 are constants, determine bz and b3 in terms of the variances and the correlation coefficients. 2.4 Stochastic Independence Let Xl and X 2 denote random variables of either the continuous or the discrete type which have the joint p.d.f. f(xl> x2) and marginal probability density functions I,(Xl) and f2(X2), respectively. In accord- ance with the definition of the conditional p.d.f. f(x2lxIL we may write the joint p.d.f. f(xl, x2) as f(xl> X2) = f(x2I xI)fl(XI), Suppose we have an instance where f(X 2XI) does not depend upon Xl' Then the marginal p.d.f. of X 2 is, for random variables of the con- tinuous type, f2(X2) = J~00 f(X 2!xl)fl(Xl) dx; = f(X 2!Xl) J~oo fl(XI) dXI = f(x2IxI)' second remark pertains to the identity. The identity in Definition 1 should be interpreted as follows. There may be certain points (Xl' X z) E.5# at which f(xl, xz) of fl(xl)fz(xz)· However, if A is the set of points (Xl' xz)atwhich the equality does not hold, then PtA) = O. In the subsequent theorems and the subsequent generalizations, a product of nonnegative functions and an identity should be interpreted in an analogous manner. Example 1. Let the joint p.d.f. of Xl and X2 be f(xl, xz) = Xl + Xz, 0 < Xl < 1, 0 < Xz < 1, = 0 elsewhere. It will be shown that Xl and Xz are stochastically dependent. Here the marginal probability density functions are fl(Xl) = J'" f(xl, xz) dxz = [1 (Xl + xz) dxz = Xl + 1-, - co Jo = 0 elsewhere, and fz(xz) = Joo f(Xl> xz) dXl = fl (Xl + xz) dXl = t + X2' - 00 Jo = 0 elsewhere. Accordingly, f2(X2) = f(x2I xl) and Since f(xl, xz) '" fl(xl)fz(xz)' the random variables Xl and X 2 are stochastic- ally dependent. when f(x2Ixl) does not depend upon Xl' That is, if the conditional distribution of X 2 , given Xl = Xl' is independent of any assumption about Xl' then f(xl> x2) = fl(xl)f2(X2), These considerations motivate the following definition. Definition 1. Let the random variables Xl and X 2 have the joint p.d.f. f(xl, x2) and the marginal probability density functions fl(X l) andf2(x2), respectively. The random variables Xl and X 2 are said to be stochastically independent if, and only if, f(xl, x2) == fl(xl)f2(X2), Random variables that are not stochastically independent are said to be stochastically dependent. Remarks. Two comments should be made about the preceding definition. First the product of two nonnegative functions I,(xl)fz (xz) means a function that is positive on a product space. That is, if fl(Xl) and fz(xz) are positive on, and only on, the respective spaces Mi and ~, then the product of fl(Xl) and fz(xz) is positive on, and only on, the product space .5# = {(Xl> xz); Xl E .5#1' Xz E .5#z}. For instance, if .5#1 = {Xl; 0 < Xl < I} and ~ = {xz; 0 < Xz < 3}, then .5# = {(Xl> xz); 0 < Xl < 1,0 < Xz < 3}. The The following theorem makes it possible to assert, without comput- ing the marginal probability density functions, that the random variables Xl and X 2 of Example 1 are stochastically dependent. Theorem 1. Let the random variables Xl and X 2 have the joint p.d.f. f(xl, x2)· Then Xl and X 2 are stochastically independent if and only if f(xl , x2) can be written as a product of a nonnegative function of Xl alone and a nonnegative function of X2 alone. That is, where g(xl) > 0, Xl E .#1' zero elsewhere, and h(x2) > 0, X2 E .#2' zero elsewhere. Proof. If Xl and X 2 are stochastically independent, thenf(xl' x2) == I,(xl)f2(X2), where fl (Xl) and f2(X2) are the marginal probability density functions of Xl and X 2, respectively. Thus, the condition f(xl, x2) _ g(xl)h(x2 ) is fulfilled.
  • 47. 82 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.4] Stochastic Independence 83 Conversely, if f(xl, x2) == g(xl)h(X2)' then, for random variables of the continuous type, we have and where C l and C2 are constants, not functions of Xl or X 2• Moreover, Cl c2 = 1 because 1 = f~00 f~00 g(xl)h(x2) dXl dX2 = [f~00 g(xl) dxl] [f~00 h(x2) dx2] = C2Cl· These results imply that f(xv x2) == g(xl)h(x2) == Clg(Xl)c2h(x2) == fl(xl)f2(X2), Accordingly, Xl and X2 are stochastically independent. If we now refer to Example 1, we see that the joint p.d.f. f(xv x2) = Xl + X2, 0 < Xl < 1, 0 < X2 < 1, = 0 elsewhere, cannot be written as the product of a nonnegative function of Xl alone and a nonnegative function of X 2 alone. Accordingly, Xl and X 2 are stochastically dependent. Example 2. Let the p.d.f. of the random variables Xl and X2 be f(xl , x2 ) = 8XIX2 , 0 < Xl < X2 < 1, zero elsewhere. The formula 8XlX2 might suggest to some that Xl and X 2 are stochastically independent. However, if we consider the space d = {(Xl. X 2); 0 < Xl < X 2 < 1}, we see that it is not a product space. This should make it clear that, in general, Xl and X2 must be stochastically dependent if the space of positive probability density of XI and X2 is bounded by a curve that is neither a horizontal nor a vertical line. We now give a theorem that frequently simplifies the calculations of probabilities of events which involve stochastically independent variables. Thereom 2. If Xl and X 2 are stochastically independent random variables with marginal probability density functions fl(X l) and f2(X2), respectively, then Pr (a < Xl < b, c < X2 < d) = Pr (a < Xl < b) Pr (c < X 2 < d) for every a < band c < d, where a, b, c, and d are constants. Proof. From the stochastic independence of Xl and X 2, the joint p.d.f. of Xl and X 2 isfl(xl)f2(X2), Accordingly, in the continuous case, Pr (a < x, < b, c < X 2 < d) = f:f fl(xl)f2(X2) dX2dXl = [f:I, (Xl) dxl][ff2(X2) dx2] = Pr(a < Xl < b)Pr(c < X 2 < d); or, in the discrete case, Pr (a < Xl < b, c < X 2 < d) L L fl(xl)f2(X2) a<:xl <:b c<:x2<:d = L<~<b fl(Xl)][c<~<d f2(X2)] = Pr (a < Xl < b) Pr (c < X2 < d), as was to be shown. Example 3. In Example 1, Xl and X2 were found to be stochastically dependent. There, in general, Pr (a < X, < b, c < X 2 < d) #- Pr (a < Xl < b) Pr (c < X 2 < d). For instance, [112 [112 Pr (0 < Xl < t,o < X 2 < t) = Jo Jo (Xl + X2) dXI dX2 = t, whereas fl/2 Pr (0 < Xl < t) = Jo (Xl + t) dXI = t and [112 Pr (0 < X2 < t) = Jo (t + x2) dX2 = l Not merely are calculations of some probabilities usually simpler when we have stochastically independent random variables, but many mathematical expectations, including certain moment-generating functions, have comparably simpler computations. The following result will prove so useful that we state it in form of a theorem. Theorem 3. Let the stochastically independent random variables Xl and X2 have the marginal probability density functions fl(X l) and f2(X2), respectively. The expected value of the product of a function u(Xl) of Xl alone and a function v(X2) of X2 alone is, subject to their existence, equal to the product of 'the expected value of u(X1) and the expected value of v(X2); that is,
  • 48. 84 Conditional Probability and Stochastic Independence [Ch. 2 Sec. 2.4] Stochastic Independence 85 Proof. The stochastic independence of Xl and X 2 implies that the joint p.d.f. of Xl and X 2 is fl(X l)f2(X2), Thus, we have, by definition of mathematical expectation, in the continuous case, E[u(X1)V(X2)] = I~00 f~00 u(Xl)V(x2)fl(xl)f2(X2) dXl dX2 = [f~oo u(xl)fl(Xl) dXl][I~oo v(x2)f2(X2) dx2] = E[u(Xl)]E[v(X2)]; or, in the discrete case, E[U(Xl)V(X2)] = 2: 2: u(Xl)V(X2)fl(xl)f2(X2) X2 Xl = [tu(xl)fl(Xl)][~ v(x2)f2(X2)] = E[u(Xl)]E[v(X2)], as stated in the theorem. Example 4. Let X and Y be two stochastically independent random variables with means u, and!Jo2 and positive variances a~ and a~, respectively. We shall show that the stochastic independence of X and Y implies that the correlation coefficient of X and Y is zero. This is true because the covariance of X and Y is equal to We shall now prove a very useful theorem about stochastically independent random variables. The proof of the theorem relies heavily upon our assertion that a moment-generating function, when it exists, is unique and that it uniquely determines the distribution of probability. Thereom 4. Let Xl and X 2 denote random variables that have the joint p.d.j. f(xl, x2) and the marginal probability density functions fl (Xl) and f2(X2), respectively. Furthermore, let M (tv t2) denotethe moment- generating function of the distribution. Then Xl and X 2 are stochastically independent if and only if M(tv t2) = M(tv O)M(O, t2)· Proof. If Xl and X 2 are stochastically independent, then M(tv t2 ) = E(et,x, +t2X 2) = E(e!lX,et2X2) = E(e!lX,)E(e!2X2) = M(tv O)M(O, t2). Thus the stochastic independence of Xl and X 2 implies that the moment-generating function of the joint distribution factors into the product of the moment-generating functions of the two marginal distributions. Suppose next that the moment-generating function of the joint distribution of Xl and X 2 is given by M(t l, t2) = M(tv O)M(O, t2). Now Xl has the unique moment-generating function which, in the con- tinuous case, is given by M(tl,O) = I~00 e!lX'fl(Xl) dxl· Similarly, the unique moment-generating function of X 2 , in the con- tinuous case, is given by Thus we have M(tv O)M(O, t2) = [f~00 e!lX'fl(Xl) dXl][f~00 e!2X2f2(X 2) dx2] = f~oo f~oo e!,x, +!2X2fl(x l)f2(X2) dXl dX2' We are given that M(tl, t2) = M(tv O)M(O, t2); so M(tv t2) = I~oo I~00 e!,x, +!2X2fl(Xl)f2(X 2) dXl dx2· But M(tv t2) is the moment-generating function of Xl and X 2. Thus also M(tv t2) = f~00 I~00 e!,x, +!2X2f(x l, x2) dXl dx2· The uniqueness of the moment-generating function implies that the two distributions of probability that are described by I,(xl)f2(x2) and f(xv x2) are the same. Thus f(xv x2 ) =I,(xl)f2(X2), That is, if M(t l, t2) = M(tv O)M(O, t2),then Xl and X 2are stochastically independent. This completes the proof when the random variables are of the continuous type. With random variables of the discrete type, the proof is made by using summation instead of integration. Let the random variables Xv X 2,... , X; have the joint p.d.f. f(xl, X2,.. " xn) and the marginal probability density functions fl(Xl),f2(X2)," .,fn(xn), respectively. The definition of the stochastic independence of Xl and X 2 is generalized to the mutual stochastic independence of Xl' X 2 , ••• , X; as follows: The random variables
  • 49. 86 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.4] Stochastic Independence 87 Xl> X 2 , ••• , X n are said to be mutually stochastically independent if and only if f(xl, X2, ... , xn) == fl(xl)f2(X2)' . -fn(xn), It follows immedi- ately from this definition of the mutual stochastic independence of Xl' X 2 , ••• , X n that Pr (al < Xl < bl> a2 < X 2 < b2,·· ., an < X n < bn) = Pr (al < x, < bl) Pr (a2 < X 2 < b2)·· ·Pr (an < x; < bn) n = f1 Pr (a, < X, < b,), 1=1 n where the symbol f1 ep(i) is defined to be j=l n f1 ep(i) = ep(1)ep(2) . .. ep(n). f ee L The theorem that E[u(Xl)V(X2)J = E[u(Xl)JE[v(X2)J for stochastically independent random variables Xl and X 2 becomes, for mutually stochastically independent random variables Xl' X 2 , ••• , X n, Remark. If Xl> X 2 , and X s are mutually stochastically independent, they are pairwise stochastically independent (that is, X, and XJ' i -# j, where i, J = 1, 2, 3 are stochastically independent). However, the following example, due to S. Bernstein, shows that pairwise independence does not necessarily imply mutual independence. Let Xl> X 2 , and X s have the joint p.d.f. f(xl> X2,xs) = t, (Xl> X2,xs) E {(I, 0, 0), (0, 1,0), (0,0, 1), (1, 1, I)}, = °elsewhere. The joint p.d.f. of X, and XJ' i -# j, is };J(xj , XJ) = t, (x" XJ) E {(O, 0), (1,0), (0, 1), (1, I)}, = °elsewhere, whereas the marginal p.d.f. of X, is Xl = 0, 1, = °elsewhere. Obviously, if i -# j, we have Accordingly, the p.d.f. of Y is g(y) = 6y5 , °< y < 1, = °elsewhere. Pr (Y ~ t) = Pr (Xl ~ t, X 2 ~ -t, Xs ~ -!) fl/2 (1/2 (1/2 = Jo Jo Jo 8XlX2XS dXI dX2dxs = (1-)6 = 6~' In a similar manner, we find that the distribution function of Y is f.j(Xj , x;) == };(xtlfj(xj), and thus X, and X J are stochastically independent. However, f(xl, X2,xs) ot. fl(xl)f2(x2)fs(xs). Thus Xl' X 2 , and X s are not mutually stochastically independent. . Example 5. Let Xl> X 2 , and X s be three mutually stochastically mdependent random variables and let each have the p.d.f. f(x) = a, °< x < 1, zero elsewhere. The joint p.d.f. of Xl' X 2, Xs isf(xl)f(x2)f(xs) = 8XIX2XS , °< z, < 1, i = 1, 2, 3, zero elsewhere. Let Y be the maximum of Xl> X 2 , and X s. Then, for instance, we have y < 0, °s y < I, 1 s y. = y6, = 1, G(y) = Pr (Y ~ y) = 0, The moment-generating function of the joint distribution of n random variables Xl> X 2, ••. , X n is defined as follows. Let E[exp (tlXl + t2X2 + ... + tnXn)J n M(tl, t2, ... , tn) = f1 M(O, ... ,0, t., 0, ... ,0) j=l exist for -h, < i, < h" i = 1,2, ... , n, where each h, is positive. This expectation is denoted by M(tl, t2, .. " tn) and it is called the moment- generating function of the joint distribution of Xl' ... , X n (or simply the moment-generating function of Xl, ... , X n). As in the cases of one and two variables, this moment-generating function is unique and uniquely determines the joint distribution of the n variables (and hence all marginal distributions). For example, the moment-generating function of the marginal distribution of XI is M(O, ... , 0, t., 0, ... , 0), i = 1,2, ... , n; that of the marginal distribution of X, and X J is M (0, ... , 0, t., 0, ... , 0, tl , 0, ... , 0); and so on. Theorem 4 of this chapter can be generalized, and the factorization is a necessary and sufficient condition for the mutual stochastic independence of Xl> X 2 , ••• , X n . or
  • 50. 88 Conditional Probability and Stochastic Independence [Ch, 2 Sec. 2.4] Stochastic Independence 89 = 0 elsewhere. In particular, Pr (Y = 3) = g(3) = -§-. Pr (Xl = 0, X2 = 0, X3 = 1) = Pr (Xl = 0) Pr (X2 = 0) Pr (X3 = 1) = (-!-)3 = t- In general, if Y is the number of the trial on which the first head appears, then the p.d f. of Y is Example 6. Let a fair coin be tossed at random on successive independent trials. Let the random variable X, = 1 or X, = 0 according to whether the outcome on the ith toss is a head or a tail, i = 1,2,3, .... Let the p.d.f. of each X, bef(x) = -!-' x = 0, 1,zero elsewhere. Since the trials are independent, we say that the random variables Xl' X 2 , X 3 , ••• are mutually stochastically independent. Then, for example, the probability that the first head appears on the third trial is elsewhere. If Y is the minimum of these four variables, find the distribution function and the p.d.f. of Y. 2.34. A fair die is cast at random three independent times. Let the ~ando.m v.ariable X, be equal to the number of spots which appear on the tth t~lal,.t =.1,2,3. Let the random variable Y be equal to max (X,), Find ~he distribution function and the p.d.f. of Y. Hint. Pr (Y s y) = Pr (X, s y, t = 1,2, 3). 2.35. Suppose a man leaves for work between 8: 00 A.M. and 8: 30 A.M. and takes between 40 and 50 minutes to get to the office. Let X denote the time of departure and let Y denote the time of travel. If we assume that these random variables are stochastically independent and uniformly distributed, find the probability that he arrives at the office before 9: 00 A.M. 2.36. Let M(tl , t2 , t3 ) be the moment-generating function of the random variables Xl' X 2 , and X3 of Bernstein's example, described in the final remark of this section. Show that M(tl , t2 , 0) = M(t l , 0, O)M(O, t2 , 0), M(tl , 0, t3 ) = M(t l , 0, O)M(O, 0, t3 ) , M(O, t2 , t3 ) = M(O, t2 , O)M(O, 0, t3 ) , but M(tl , t2, t3) i= M(tl, 0, O)M(O, t2, O)M(O, 0, t3). Thus Xl' X 2, X3 are pairwise stochastically independent but not mutually stochastically independent. 2.37'. Gene~alize Theorem 1 of this chapter to the case of n mutually stochastically mdependent random variables. 2.38.. Gene~alize Theorem 4 of this chapter to the case of n mutually stochastically mdependent random variables. y = 1,2,3, ... , g(y) = my, EXERCISES 2.27. Show that the random variables Xl and X2 with joint p.d.f. f(XI' X2) = 12xlx2(1 - x2), 0 < Xl < 1, 0 < X2 < 1, zero elsewhere, are stochastically independent. 2.28. If the random variables XI and X2 have the joint p.d.f. f(xl, x2) = 2e-xl-x2, 0 < Xl < X2, 0 < X2 < 00, zero elsewhere, show that Xl and X2 are stochastically dependent. 2.29. Let f(xl, X2) = -il.6"' Xl = 1,2,3,4, and X2 = 1,2,3,4, zero else- where, be the joint p.d.f. of X I and X2' Show that X I and X2are stochastically independent. 2.30. Find Pr (0 < Xl < 1,0 < X 2 < 1) if the random variables Xl and X 2 have the joint p.d.f. f(xl, x2) = 4XI(1 - X2)' 0 < Xl < 1, 0 < X2 < 1, zero elsewhere. 2.31. Find the probability of the union of the events a < Xl < b, -00 < X 2 < 00 and -00 < Xl < 00, C < X 2 < d if Xl and X2 are two stochastically independent variables with Pr (a < Xl < b) = t and Pr (c < X 2 < d) = 1· 2.32. If f(xl , x2) = e- x l- x 2, 0 < Xl < 00, 0 < X2 < 00, zero elsewhere, is the joint p.d.f. of the random variables Xl and X 2 , show that Xl and X2 are stochastically independent and that E(el(X1 +X2)) = (1 - t)-2, t < 1. 2.33. Let Xl' X 2 , X 3 , and Xi be four mutually stochastically independent random variables, each with p.d.f. f(x) = 3(1 - X)2, 0 < X < 1, zero
  • 51. Chapter 3 Some Special Distributions 3.1 The Binomial, Trinomial, and Multinomial Distributions In Chapter 1 we introduced the uniform distribution and the hyper- geometric distribution. In this chapter we shall discuss some other important distributions of random variables frequently used III statistics. We begin with the binomial distribution. Recall, if n is a positive integer, that Consider the function defined by x = 0,1,2, ... , n, = 0 elsewhere, where n is a positive integer and 0 < p < 1. Under these conditions it is clear that f(x) :2: 0 and that ~ f(x) = ~o (:)px(l - p)n-x = [(1 _ P) + p]n = 1. That is, f(x) satisfies the conditions of being a p.d.f. of a random variable X of the discrete type. A random variable X that has a p.d.f. of the form of f(x) is said to have a binomial distribution, ~nd any such f(x) is called a binomial p.d.f. A binomial distribution will be denoted 90 Sec. 3.1] The Binomial, Trinomial, and Multinomial Distributions 91 by the symbol b(n, P). The constants nand p are called the parameters of the binomial distribution. Thus, if we say that X is b(5, !-), we mean that X has the binomial p.d.f. x = 0,1, ... ,5, = 0 elsewhere. Remark. The binomial distribution serves as an excellent mathematical model in a number of experimental situations. Consider a random experiment, the outcome of which can be classified in but one of two mutually exclusive and exhaustive ways, say, success or failure (for example, head or tail, life or death, effective or noneffective, etc.). Let the random experiment be repeated n independent times. Assume further that the probability of success, say p,is the same on each repetition; thus the probability of failure on each repetition is 1 - p. Define the random variable X" i = 1, 2, ... , n, to be zero, if the outcome of the ith performance is a failure, and to be 1 if that outcome is a success. We then have Pr (X, = 0) = 1 - P and Pr (Xl = 1) = p, i = 1, 2, ... , n. Since it has been assumed that the experiment is to be repeated n independent times, the random variables Xl> X 2 , •• " X n are mutually stochastically independent. According to the definition of X" the sum Y = Xl + X 2 + ... + X; is the number of successes throughout the n repetitions of the random experiment. The following argument shows that Y has a binomial distribution. Let y be an element of {y; y = 0, 1,2, ... , n}. Then Y = y if and only if exactly y of the variables Xl' X 2, ••. , X n have the value 1, and each of the remaining n - y variables is equal to zero. There are (;) ways in which exactly y ones can be assigned to y of the variables Xl> X 2 , •• " X n · Since Xl' X 2 , ••• , X; are mutually stochastically inde- pendent, the probability of each ofthese ways ispY(1 - p)n-y. Now Pr (Y = y) is the sum of the probabilities of these (;) mutually exclusive events; that is, Pr (Y = y) = (;)PY(1 - p)n-y, y = 0, 1,2, .. " n, zero elsewhere. This is the p.d.f. of a binomial distribution. The moment-generating function of a binomial distribution is easily found. It is M(t) = ~ etxf(x) = x~o etx(:)px(l - p)n-x x~o (:) (pet)x(l - p)n-x = [(1 - P) + pet]n
  • 52. Some Special Distributions [Ch. 3 92 for all real values of t. The mean fL and the variance a 2 of X may be computed from M(t). Since M'(t) = n[(1 - p) + petJn-l(pet ) and M"(t) = n[(1 - P) + petJn-l(pet) + n(n - 1)[(1 - P) + pe tJn-2(pet)2, it follows that fL = M'(O) = np Sec. 3.1] The Binomial, Trinomial, and Multinomial Distributions 93 Example 3. If Yis b(n, t), then Pr (Y ;::-: 1) = 1 - Pr (Y = 0) = 1 - (t)n. Suppose we wish to find the smallest value of n that yields Pr (Y ;::-: 1) > 0.80. We have 1 - (j-)n > 0.80 and 0.20 > (t)n. Either by inspection or by use of logarithms, we see that n = 4 is the solution. That is, the probability of at least one success throughout n = 4 independent repetitions of a random experiment with probability of success p = t is greater than 0.80. Example 4. Let the random variable Y be equal to the number of successes throughout n independent repetitions of a random experiment with probability p of success. That is, Y is b(n, P). The ratio Yin is called the relative frequency of success. For every e > 0, we have 1 1 7 8 Pr (0 s X s 1) = L f(x) = 128 + 128 = 128 x=o and a2 = M"(O) - fL2 = np + n(n - 1)p2 - (np)2 = np(1 - P)· Example 1. The binomial distribution with p.d.f. = °elsewhere, has the moment-generating function M(t) = (! + !et )7, has mean I-' = np = t, and has variance a2 = np(l - P) = l Furthermore, if X is the random variable with this distribution, we have Pr ( ~ - pi ;::-: e) = Pr (IY - nPI ;::-: en) = Pr (IY- 1-'1;::-: eJp(1 ~ P) a)' Now, for every fixed e > 0, the right-hand member of the preceding inequality is close to zero for sufficiently large n. That is, ( J n ) P(1 - P) Pr IY - 1-'1 ;::-: e P(1 _ P) a ~ ne2 where I-' = np and a2 = np(1 - P). In accordance with Chebyshev's in- equality with k = eVnlP(1 - P), we have and hence x = 0, 1, 2, ... , 7, (7 )(I)X( 1)7-X f(x) = x 2 1 - 2 ' and Pr (X = 5) = f(5) 7! (1)5 (1)2 21 = 51 21 2 2 = 128' and Example 2. If the moment-generating function of a random variable X is M(t) = (t + j-et)5, then X has a binomial distribution with n = 5 and p = j-; that is, the p.d.f. of X is (5 )(1)X(2)5-X f(x) = x "3 "3 ' = °elsewhere. x=0,1,2, ...,5, Since this is true for every fixed e > 0, we see, in a certain sense, that the relative frequency of success is for large values of n, close to the probability p of success. This result is one form of the law of large numbers. It was alluded to in the initial discussion of probability in Chapter 1 and will be considered again, along with related concepts, in Chapter 5. Example 5. Let the mutually stochastically independent random vari- ables Xl> X 2 , X 3 have the same distribution function F(x). Let Y be the middle value of Xl' X 2 , X 3 • To determine the distribution function of Y, say
  • 53. The binomial distribution can be generalized to the trinomial = °elsewhere. g(y) = G'(y) = 6[F(y)][1 - F(y)Jf(y)· Sec. 3.1] The Binomial, Trinomial, and Multinomial Distributions 95 distribution. If n is a positive integer and aI' a2 , aa are fixed constants, we have n! f(x, y) = xl y! (n _ x _ y)! PfP~P~-X-y, where x and yare nonnegative integers with x + y < nand P P - , 11 21 and Pa are positive proper fractions with PI + P2 + P» = 1; and let f(x, y) = 0 elsewhere. Accordingly, f(x, y) satisfies the conditions of being a joint p.d.f. of two random variables X and Y of the discrete' ty~e; that is,j(x, y) is nonnegative and its sum over all points (x, y) at which f(x, y) is positive is equal to (PI + P2 + Pa)n = 1. The random variables X and Y which have a joint p.d.f. of the formf(x, y) are said to have a trinomial distribution, and any such f(x, y) is called a tri- nomial p.d.j. The moment-generating function of a trinomial distri- bution, in accordance with Equation (1), is given by n n-x n! M(tl> t2) = L L (Pletl)X(het2)ypn-X-y x=o y=o xl y! (n - x _ y)! a = (Pletl + P2et2 + Pa)n for all real values of tl and t2 • The moment-generating functions of the marginal distributions of X and Yare, respectively, M(tl> 0) = (Pletl + h + Pa)n = [(1 - PI) + Pletl]n n n-x n' (1) L L . aXaYan-x-y zi e O y=o xl yl (n - x _ y)! 1 2 a _ ~ n! af n-x (n - x)! - L, L a~a~-X-Y zree O x! (n - x)! y=o y! (n - x - y)! n n! x~o xl (n _ x)! af(a2 + aa)n-x = (al + a2 + aa)n. Let the function f(x, y) be given by and M(O, t2) = (PI + het2 + Pa)n = [(1 - P2) + P2et2]n. We see immediately, from Theorem 4, Section 2.4, that X and Yare stochas.tically dependent. In addition, X is b(n, PI) and Y is b(n, P2)' Accordmgly, the means and the variances of X and Yare, respectively, 1-'-1 = npl' 1-'-2 = np2' ai = nPl(1 - PI), and a~ = np2(1 - P2)' y = 0, 1,2, ... , y = 0,1,2, ... , Some Special Distributions [eh.3 g(y) = P(l - P)Y, ( y + r - 1)pr-l(l _ P)Y r - 1 zero elsewhere, and the moment-generating function M(t) = P[l - (1 - p)etJ-l. In this special case, r = I, we say that Y has a geometric distribution. A distribution with a p.d.f. of the form g(y) is called a negative binomial distribution; and any such g(y) is called a negative binomial p.d.f. The distribution derives its name from the fact that g(y) is a general term in the expansion of pr[l - (1 - P)]-r. It is left as an exercise to show that the moment-generating function of this distribution is M(t) = pr[l - (1 - p)etJ -r, for t < -In (1 - P). If r = 1, then Y has the p.d.f. G(y) = G)[F(y)J2[1 - F(y)J + [F(y)J3. of obtaining exactly r - 1 successes in the first y + r - 1 trials and the probability p of a success on the (y + r)th trial. Thus the p.d.f. g(y) of Y is given by Example 6. Consider a sequence of independent repetitions of a random experiment with constant probability p of success. Let the random variable Y denote the total number of failures in this sequence before the rth success; that is, Y + r is equal to the number of trials necessary to produce exactly r successes. Here r is a fixed positive integer. To determine the p.d.f. of Y,let y be an element of {y; y = 0, 1, 2, ...}. Then, by the multiplication rule of probabilities, Pr (Y = y) = g(y) is equal to the product of the probability If F(x) is a continuous type of distribution function so that the p.d.f. of X is F'(x) = f(x), then the p.d.f. of Y is G(y) = Pr (Y ~ y), we note that Y ~ y if and only if at least two of the random variables Xv X 2 , Xa are less than or equal to y. Let us say that the ith "trial" is a success if X. ~ y, i = 1,2,3; here each "trial" has the probability of success F(y). In this terminology, G(y) = Pr (Y ~ y) is then the probability of at least two successes in three independent trials. Thus 94
  • 54. 96 Some Special Distributions [Ch, 3 Sec. 3.1] The Binomial, Trinomial, and Multinomial Distributions 97 y=O,I, ...,n-x, Consider next the conditional p.d.f. of Y, given X = x. We have f(yIX) = (n - x)! (~)Y(~)n-x-y y! (n - x - y)! 1 - Pl 1 - Pl ' = 0 elsewhere. Thus the conditional distribution of Y, given X = x, is ben - x, h/(1 - Pl)]. Hence the conditional mean of Y, given X = x, is the linear function the multinomial p.d.f. of k - 1 random variables Xv X 2 , ••• , X"_l of the discrete type. The moment-generating function of a multinomial distribution is given by M(tv ... , t"-l) = (Pletl + ... + p"_letk-l + p,,)n for all real values of tv t2 , ••• , t"-l' Thus each one-variable marginal p.d.f. is binomial, each two-variable marginal p.d.f, is trinomial, and so on. EXERCISES Pr (tt - 2a < X < tt + 2a) = X~l (:)Gr(jr-x • 3.3. If X is b(n,P), show that n P(1 - P) and 3.1. If the moment-generating function of a random variable X is (t + i et)5, find Pr (X = 2 or 3). 3.2. The moment-generating function of a random variable X is (i + tet)9. Show that Now recall (Example 2, Section 2.3) that the square of the correlation coefficient, say p2, is equal to the product of -P2/(1 - Pl) and -Pl/(l - P2), the coefficients of x and y in the respective conditional means. Since both of these coefficients are negative (and thus p is negative), we have E(YIX) = (n - x) (~). 1 - P1 Likewise, we find that the conditional distribution of X, given Y = y, is ben - y, Pl/(1 - P2)] and thus E(Xly) = (n - y) (1 ~lpJ' The trinomial distribution is generalized to the multinomial distri- bution as follows. Let a random experiment be repeated n independent times. On each repetition the experiment terminates in but one of k mutually exclusive and exhaustive ways, say C1, C2, ... , Ck- Let PI be the probability that the outcome is an element of CI and let PI remain constant throughout the n independent repetitions, i = 1,2, ... , k. Define the random variable XI to be equal to the number of outcomes which are elements of Ct, i = 1, 2, ... , k - 1. Furthermore, let Xl' X 2, •• • , X"-l be nonnegative integers so that Xl + X 2 + ... + X"_l ~ n. Then the probability that exactly Xl terminations of the experiment are in Cv , exactly X"-l terminations are in C"-I' and hence exactly n - (Xl + + X"_l) terminations are in C" is n! - - - - - - PIXl ... P"X ..!ylp%k, Xl! ... X"-l! X,,! where x" is merely an abbreviation for n - (Xl + ... + X"-l)' This is 3.4. Let the mutually stochastically independent random variables Xl' X 2 , X3 have the same p.d f. f(x) = 3x2 , 0 < X < 1, zero elsewhere. Find the probability that exactly two of these three variables exceed -t. 3.5. Let Y be the number of successes in n independent repetitions of a random experiment having the probability of success P = i. If n = 3, compute Pr (2 s Y); if n = 5, compute Pr (3 ~ Y). 3.6. Let Y be the number of successes throughout n independent repe- titions of a random experiment having probability of success P = t Deter- mine the smallest value of n so that Pr (1 ~ Y) ~ 0.70. 3.7. Let the stochastically independent random variables Xl and X2 have binomial distributions wrth parameters nl = 3, PI = i and n2= 4, P2 = -t, respectively. Compute Pr (Xl = X 2) . Hint. List the four mutually exclusive ways that Xl = X2 and compute the probability of each. 3.8. Let Xl' X 2 , ••• , X"_l have a multinomial distribution (a) Find the moment-generating function of X 2 , X 3 , ••. , X"-l' (b) What is the p.d.f. of X 2 , X 3 , ••• , X"_l? (c) Determine the conditional p.d f. of Xv given that X2 = x2 , ••. , X"_l = X"-l' (d) What is the conditional expectation E(XI Ix2 , · · · , X"_l)?
  • 55. 98 Some Special Distributions [Ch, 3 Sec. 3.2] The Poisson Distribution 99 where m > 0. Since m > 0, thenf(x) ~ °and x = 0,1,2, ... , mse:» =---, xl = °elsewhere, f(x) Recall that the series that is,f(x) satisfies the conditions of being a p.d.f. of a discrete type of random variable. A random variable that has a p.d.f. of the form f(x) is said to have a Poisson distribution, and any such f(x) is called a Poisson p.d.j. Remarks. Experience indicates that the Poisson p.d.f, may be used in a number of applications with quite satisfactory results. For example, let the random variable X denote the number of alpha particles emitted by a radioactive substance that enter a prescribed region during a prescribed interval of time. With a suitable value of m, it is found that X may be assumed to have a Poisson distribution. Again let the random variable X denote the number of defects on a manufactured article, such as a refrigerator door. Upon examining many of these doors, it is found, with an appropriate value of m, that X may be said to have a Poisson distribution. The number of automobile accidents in some unit of time (or the number of insurance claims in some unit of time) is often assumed to be a random variable which has a Poisson distribution. Each of these instances can be thought of as a process that generates a number of changes (accidents, claims, etc.) in a fixed interval (of time or space and so on). If a process leads to a Poisson distri- bution, that process is called a Poisson process. Some assumptions that ensure a Poisson process will now be enumerated. Let g(x, w) denote the probability of x changes in each interval of length w. Furthermore, let the symbol o(h) represent any function such that lim [o(h)fh] = 0; for example, h2 = o(h) and o(h) + o(h) = o(h). The Poisson h-+O postulates are the following: (a) g(l, h) = Ah + o(h), where Ais a positive constant and h > O. converge, for all values of m, to em. Consider the function f(x) defined by 3.2 The Poisson Distribution x2 = 0, 1, ... , Xl> Xl = 1,2, 3,4, 5, zero elsewhere, be the joint p.d.f. of Xl and X 2 . Determine: (a) E(X2 ) , (b) u(xl ) = E(X2 !X I ) , and (c) E[u(XI ) ]. Compare the answers to parts (a) 5 Xl and (c). Hint. Note that E(X2) = L L x2f(xl , x2) and use the fact that xl=l X2=O i y(n) (t)n = nf2. Why? y=O Y 3.15. Let an unbiased die be cast at random seven independent times. Compute the conditional probability that each side appears at least once relative to the hypothesis that side 1 appears exactly twice. 3.16. Compute the measures of skewness and kurtosis of the binomial distribution b(n, Pl. 3.17. Let 3.9. Let X be b(2,P) and let Y be b(4,Pl. If Pr (X ~ 1) = i, find Pr (Y ~ 1). 3.10. If x = r is the unique mode of a distribution that is b(n, P), show that (n + I)P - 1 < r < (n + l)p. Hint. Determine the values of x for which the ratio f(x + l)ff(x) > 1. 3.11. One of the numbers 1, 2, ... , 6 is to be chosen by casting an un- biased die. Let this random experiment be repeated five independent times. Let the random variable Xl be the number of terminations in the set {x; x = 1,2, 3} and let the random variable X 2 be the number of termina- tions in the set {x; x = 4, 5}. Compute Pr (Xl = 2, X 2 = 1). 3.12. Show that the moment-generating function of the negative binomial distribution is M(t) = PT [1 - (1 - p)et] -T. Find the mean and the variance of this distribution. Hint. In the summation representing M(t), make use of the MacLaurin's series for (1 - w) -T. 3.13. Let Xl and X 2 have a trinomial distribution. Differentiate the moment-generating function to show that their covariance is - nPIP2' 3.14. If a fair coin is tossed at random five independent times, find the conditional probability of five heads relative to the hypothesis that there are at least four heads. 3.18. Three fair dice are cast. In 10 independent casts, let X be the number of times all three faces are alike and let Y be the number of times only two faces are alike. Find the joint p.d.I. of X and Y and compute E(6XY).
  • 56. 100 Some Special Distributions [Ch. 3 Sec. 3.2] The Poisson Distribution 101 co (b) L g(x, h) = o(h). :1:=2 (c) The numbers of changes in nonoverlapping intervals are stochastically independent. Postulates (a) and (c) state, in effect, that the probability of one change in a short interval h is independent of changes in other nonoverlapping intervals and is approximately proportional to the length of the interval. The sub- stance of (b) is that the probability of two or more changes in the same short interval h is essentially equal to zero. If x = 0, we take g(O, 0) = 1. In accordance with postulates (a) and (b), the probability of at least one change in an interval of length h is >"h + o(h) + o(h) = M + o(h). Hence the probability of zero changes in this interval of length h is 1 - >"h - o(h). Thus the probability g(O, w + h) of zero changes in an interval of length w + his, in accordance with postulate (c), equal to the product of the probability g(O, w) of zero changes in an interval of length wand the probability [1 - >"h - o(h)] of zero changes in a nonoverlapping interval of length h. That is, g(O, w + h) = g(O, w)[1 - M - o(h)]. Then g(O, w + h) - g(O, w) = _ (0 ) _ o(h)g(O, w) h IIg ,w s": If we take the limit as h -+ 0, we have Dw[g(O, w)] = - ,g(0, w). The solution of this differential equation is g(O, w) = ce-~w. The condition g(O, 0) = 1 implies that c = 1; so g(O, w) = e-~w. for x = 1,2,3, .... It can be shown, by mathematical induction, that the solutions to these differential equations, with boundary conditions g(x, 0) = ° for x = 1, 2, 3, ... , are, respectively, x = 1,2,3, .... Hence the number of changes X in an interval of length w has a Poisson distribution with parameter m = >..w. The moment-generating function of a Poisson distribution is given by for all real values of t. Since and then fJ- = M'(O) = m and That is, a Poisson distribution has fJ- = (72 = m > O. On this account, a Poisson p.d.f. is frequently written If x is a positive integer, we take g(x, 0) = 0. The postulates imply that g(x, w + h) = [g(x, w)][1 - >"h - o(h)] + [g(x - 1, w)][M + o(h)] + o(h). Accordingly, we have J(x) x = 0,1,2, ... , g(x, w + h) - g(x, w) ( ) ( 1) o(h) h = - IIg z, W + IIg X - ,w + h and Dw[g(x, w)] = ->..g(x, w) + ,g(x - 1, w), = 0 elsewhere. Thus the parameter m in a Poisson p.d.f. is the mean fJ-. Table I in Appendix B gives approximately the distribution function of the Poisson distribution for various values of the parameter m = fJ-.
  • 57. 102 Some Special Distributions [eh.3 Sec. 3.3] The Gamma and Chi-Square Distributions 103 = 0 elsewhere. Example 1. Suppose that X has a Poisson distribution with fJ- = 2. Then the p.d.f. of X is The variance of this distribution is a2 = fJ- = 2. If we wish to compute Pr (1 :s: X), we have Pr (1 s X) = 1 - Pr (X = 0) = 1 - j(O) = 1 - e- 2 = 0.865, approximately, by Table I of Appendix B. Example 2. If the moment-generating function of a random variable X is M(t) = e4(e l - 1), then X has a Poisson distribution with fJ- = 4. Accordingly, by way of example, 2xe- 2 j(x) =-, xl x = 0, 1,2, ... , 3.20. The moment-generating function of a random variable X is e4 (e' - 1). Show that Pr (fJ- - 2a < X < fJ- + 2a) = 0.931. 3.21. In a lengthy manuscript, it is discovered that only 13.5 per cent of the pages contain no typing errors. If we assume that the number of errors per page is a random variable with a Poisson distribution, find the percentage of pages that have exactly one error. 3.22. Let the p.d.f. j(x) be positive on and only on the nonnegative integers. Given that j(x) = (4Jx)j(x - 1), x = 1,2,3, .... Find j(x). Hint. Note thatj(l) = 4j(0),f(2) = W/2!)j(0), and so on. That is, find eachj(x) in terms of j(O) and then determine j(O) from 1 = j(O) + j(l) + j(2) + .. ", 3.23. Let X have a Poisson distribution with fJ- = 100. Use Chebyshev's inequality to determine a lower bound for Pr (75 < X < 125). 3.24. Given that g(x, 0) = 0 and that Dw[g(x, w)] = - Ag(X, w) + Ag(X - 1, w) for x = 1,2, 3, .... If g(O, w) = e- AW , show, by mathematical induction, that or, by Table I, Pr (X = 3) = Pr (X s 3) - Pr (X :s: 2) = 0.433 - 0.238 = 0.195. Example 3. Let the probability of exactly one blemish in 1 foot of wire be about 10 100 and let the probability of two or more blemishes in that length be, for all practical purposes, zero. Let the random variable X be the number of blemishes in 3000 feet of wire. If we assume the stochastic independence of the numbers of blemishes in nonoverlapping intervals, then the postulates of the Poisson process are approximated, with A= 10 100 and w = 3000. Thus X has an approximate Poisson distribution with mean 3000(10 100) = 3. For example, the probability that there are exactly five blemishes in 3000 feet of wire is and, by Table I, Pr (X = 5) = Pr (X s 5) - Pr (X s 4) = 0.101, approximately. EXERCISES 3.19. If the random variable X has a Poisson distribution such that Pr (X = 1) = Pr (X = 2), find Pr (X = 4). (Aw)Xe- AW g(x, w) = , ' x = 1, 2, 3, .... x. 3.25. Let the number of chocolate drops in a certain type of cookie have a Poisson distribution. We want the probability that a cookie of this type contains at least two chocolate drops to be greater than 0.99. Find the smallest value that the mean of the distribution can take. 3.26. Compute the measures of skewness and kurtosis of the Poisson distribution with mean fJ-. 3.27. Let X and Y have the joint p.d.f. j(x, y) = e- 2/[x! (y - x)!], y = 0, 1, 2, ... ; x = 0, 1, .. " y, zero elsewhere. (a) Find the moment-generating function M(t1 , t2 ) of this joint distribu- tion. (b) Compute the means, the variances, and the correlation coefficient of X and Y. (c) Determine the conditional mean E(Xly). Hint. Note that y L: [exp (t1x)]y!J[x! (y - x)!] = [1 + exp (t1 }JY· x=o Why? 3.3 The Gamma and Chi-Square Distributions In this section we introduce the gamma and chi-square distributions. It is proved in books on advanced calculus that the integral fooo ya-1e- Y dy
  • 58. 104 Some Special Distributions [Ch. 3 Sec. 3.3] The Gamma and Chi-Square Distributions 105 exists for a > 0 and that the value of the integral is a positive number. The integral is called the gamma function of a, and we write f(a) = fo'Xl y«-le-Y dy. However, the event W > w, for w > 0, is equivalent to the event in which there are le~s than k.changes in a time interval of length w. That is, if the random vanable X IS the number of changes in an interval of le gth then n w, or, equivalently, If a = 1, clearly w > 0, 0< w < 00, 0< w < 00, g(w) = Ae-'w, It is left as an exercise to verify that k-l k-l (A) A Pr (W > w) = L Pr (X = x) = L w X e- w. x=o x=o xl M(t) = 0 elsewhere, We now find the moment-generating function of a gamma distri- bution. Since 1 00 1 = --x«-le-x(l-Ptl/P dx o f(a)f3« , = 0 elsewhere. Joo zk-le-2 _ k-l (AW)Xe-AW Aw (k - 1)!dz - x~o xl . If, momentarily, we accept this result, we have, for w > 0, ('Xl Zk-le-2 lAW Zk-le-2 G(w) = 1 - JAW r(k) dz = 0 r(k) dz, ~nd for w ~ 0, G(w) = O. If we change the variable of integration in the integral that defines G(w) by writing z = AY, then and G(w) = 0, w ~ O. Accordingly, the p.d.f. of W is Th~~ is, ~ has a gamma distribution with a = k and f3 = 1/A. If W is the waiting time until the first change, that is, if k = 1, the p.d.f. of W is and W is said to have an exponential distribution. o < x < 00, f(x) 1 = foo _1_ x«-le- X / Pdx. o f(a)f3« Since a > 0, f3 > 0, and I'(«) > 0, we see that 1 = -_ x«-le- x/P r(a)f3« ' f(a) = (a - 1)(a - 2) ... (3)(2)(1)r(1) = (a - 1)1. Since I'[I) = 1, this suggests that we take O! = 1, as we have done. In the integral that defines I'{«), let us introduce a new variable x by writing y = x/f3, where f3 > O. Then [00 (X)«-l (1) f(a) = Jo ~ e- x/P ~ dx, = 0 elsewhere, is a p.d.f. of a random variable of the continuous type. A random variable X that has a p.d.f. of this form is said to have a gamma dis- tribution with parameters a and f3; and any suchf(x) is called a gamma- type p.d.j. Remark. The gamma distribution is frequently the probability model for waiting times; for instances, in life testing, the waiting time until" death" is the random variable which frequently has a gamma distribution. To see this, let us assume the postulates of a Poisson process and let the interval of length w be a time interval. Specifically, let the random variable W be the time that is needed to obtain exactly k changes (possibly deaths), where k is a fixed positive integer. Then the distribution function of W is G(w) = Pr (W ~ w) = 1 - Pr (W > w). r(l) = fooo e:v dy = 1. If a > 1, an integration by parts shows that I'(«) = (a - 1) fo oo y«-2e- Y dy = (a - 1)r(a - 1). Accordingly, if a is a positive integer greater than 1,
  • 59. and M"(t) = (-a)(-a - 1)(1 - f3t)-a-2(_f3)2. Hence, for a gamma distribution, we have JL = M'(O) = af3 107 t < t, M(t) = (1 - 2t)-r/2, Sec. 3.3] The Gamma and Chi-Square Distributions Let us now consider the special case of the gamma distribution in which a = rj2, where r is a positive integer, and f3 = 2. A random variable X of the continuous type that has the p.d.f. f( ) 1 r/2-l -x/2 0 x r(rj2)2r/2x e , < x < 00, = 0 elsewhere, and the moment-generating function Pr (3.25 :::; X :::; 20.5) = Pr (X s 20.5) - Pr (X s 3.25) = 0.975 - 0.025 = 0.95. is said to have a chi-square distribution, and any f(x) of this form is called a chi-square p.d.j. The mean and the variance of a chi-square distribution are JL = af3 = (rj2)2 = rand a2 = af32 = (rj2)22 = 2r, respectively. For no obvious reason, we call the parameter r the number of degrees of freedom of the chi-square distribution (or of the chi- square p.d.f.). Because the chi-square distribution has an important role in statistics and occurs so frequently, we write, for brevity, that X is X2 (r) to mean that the random variable X has a chi-square distri- bution with r degrees of freedom. Example 3. If X has the p.d.f. f(x) = !xe- x /2 , 0 < x < 00, = 0 elsewhere, then X is X2 (4). Hence J10 = 4, a2 = 8. and M(t) = (1 - 2t) -2, t < l Example 4. If Xhas the moment-generating function M(t) = (1- 2t)-8, t < 1. then X is X2 (16). If the random variable X is X2 (r), then, with Cl < C2' we have Pr (cl s X :::; c2 ) = Pr (X s c2 ) - Pr (X s Cl), since Pr (X = cl ) = O. To compute such a probability, we need the value of an integral like Pr (X < x) = IX 1 wr/2-le-W/2 dw. - 0 r(rj2)2r/2 Tables of this integral for selected values of r and x have been prepared and are partially reproduced in Table II in Appendix B. Example 5. Let X be X2 (10). Then, by Table II of Appendix B, with r = 10, m=I,2,3..... Some Special Distributions [Ch. 3 ( _ I _ ) a roo _1 ya-le-Y dy 1 - f3t Jo r(a) 1 1 t < _. (1 - f3t)«' f3 M(t) M'(t) (-a)(1 - f3t)-a-l(_f3) Then the moment-generating function of X is given by the series 4! 3 5! 32 2 6! 3 3 t3 ••• M(t) = 1 + 3! I! t + 3! 2! t + 3! 3! + . This however. is the Maclaurin's series for (1 - 3t)-4, provided that _ 1 ~ 3t < 1. Accordingly. X has a gamma distribution with a = 4 and f3 = 3. Remark. The gamma distribution is not only a good model for ~aiting times, but one for many nonnegative random variables of the continuous type. For illustrations, the distribution of certain incomes could be modeled satisfactorily by the gamma distribution, since the two parameters a and f3 provide a great deal of flexibility. Example 1. Let the waiting time W have a gamma p.d.f. with a =.k and f3 = 1/t... Accordingly, E(W) = k/t... If k ~ 1, then E(W) = .1/>..; that IS, the expected waiting time for k = 1 changes IS equal to the reciprocal of >... Example 2. Let X be a random variable such that and Now That is, 106 we may set y = x(1 - f3t)jf3, t < Ijf3, or x = f3yj(1 - f3t), to obtain M(t) = roo f3j(1 - f3t) (2JL)a-le_y dy. Jo r(a)f3a 1 - f3t
  • 60. k = 1,2,3, .... Some Special Distributions [Ch, 3 and 109 o< x < 00, 1 j(x) = fJ2 xe-x /P, I~oo exp (-Iyl + 1) dy = 2e. the distribution of Y = minimum (Xl> X 2,Xs). Hint. Pr (Y :::; y) = 1 - Pr (Y > y) = 1 - Pr (Xl> y, i = 1,2, 3). 3.34. Let X have a gamma distribution with p.d.f. Sec. 3.4] The Normal Distribution zero elsewhere. If x = 2 is the unique mode of the distribution, find the parameter fJ and Pr (X < 9.49). 3.35. Compute the measures of skewness and kurtosis of a gamma distri- bution with parameters a and fJ. 3.36. Let X have a gamma distribution with parameters a and fJ. Show that Pr (X ~ 2afJ) s (2/e)a. Hint. Use the result of Exercise 1.107. 3.37. Give a reasonable definition of a chi-square distribution with zero degrees of freedom. Hint. Work with the moment-generating function of a distribution that is x2 (r) and let r = O. 3.38. In the Poisson postulates on page 99, let , be a nonnegative function of w, say '(w), such that Dw[g(O, w)J = - '(w)g(O, w). Suppose that ,(w) = krwT- l, r ~ 1. (a) Find g(O, w) noting that g(O, 0) = 1. (b) Let W be the time that is needed to obtain exactly one change. Find the distribution function of W, namely G(w) = Pr (W :::; w) = 1 - Pr (W > w) = 1 - g(O, w), o :::; w, and then find the p.d.f. of W. This p.d.f. is that of the Weibull distri- bution, which is used in the study of breaking strengths of materials. 3.39. Let X have a Poisson distribution with parameter m. If m is an experimental value of a random variable having a gamma distribution with a = 2 and fJ = 1, compute Pr (X = 0, 1, 2). 3.40. Let X have the uniform distribution with p.d.f.j(x) = 1,0 < x < 1, zero elsewhere. Find the distribution function of Y = -21n X. What is the p.d.f. of Y? Consider the integral I = I~oo exp (_y2/2) dy. This integral exists because the integrand is a positive continuous function which is bounded by an integrable function; that is, o< exp (_y2/2) < exp (-Iyl + 1), -00 < y < 00, 3.4 The Normal Distribution f "" 1 k-l p.xe-/L - - Zk-Ie-Z dz = L: --,-' /L r(k) x=o x. This demonstrates the relationship between the distribution functions of the gamma and Poisson distributions. Hint. Either integrate by parts k - 1 times or simply note that the "antiderivative" of Zk-Ie- Z is _zk-Ie-~ - (k _ l)zk-2e- z _ ... - (k - 1)1 e- Z by differentiating the latter expression. 3.33. Let x.. X 2 , and x, be mutually stochastically independent rand?m variables, each with p.d.f. j(x) = e- x , 0 < x < 00, zero elsewhere. Fmd Accordingly, the p.d.f. of Y is G'( ) fJ/2 (fJ /2)T/2 -Ie- y/2 g(y) = Y = r(r/2)W/2 y if y > O. That is, Y is i'{r). EXERCISES 3.28. If (1 - 2t)-6, t < -!-' is the moment-generating function of the random variable X, find Pr (X < 5.23). 3.29. If X is x2 (5), determine the constants cand d so that Pr (c < X < d) = 0.95 and Pr (X < c) = 0.025. 3.30. If X has a gamma distribution with a = 3 and fJ = 4, find Pr (3.28 < X < 25.2). Hint. Consider the probability of the equivalent event 1.64 < Y < 12.6, where Y = 2X/4 = X/2. 3.31. Let X be a random variable such that E(xm) = (m + 1)1 2 m, m = 1,2,3, .... Determine the distribution of X. 3.32. Show that If y s 0, then G(y) = 0; but if y > 0, then f PY/2 1 G( ) - xT/2-le-X/P dx. y - r(r/2)pI2 o 108 Again, by way of example, if Pr (~ < X) = 0.05, then Pr (X :::; a) = 0.95, and thus a = 18.3 from Table II WIth r = 10. Example 6. Let X have a gamma distribution wit~ a = r/2, where r is a positive integer, and fJ > O. Define the rand~m vana?le Y = 2X/fJ· We seek the p.d.f. of Y. Now the distribution function of Y IS G(y) = Pr (Y :::; y) = Pr (X s fJi)'
  • 61. 110 Some Special Distributions [eh.3 Sec. 3.4] The Normal Distribution 111 To evaluate the integral I, we note that I > 0 and that J2 may be written fOO foo (y2 + Z2) [2 = -00 -00 exp - 2 dy dz. This iterated integral can be evaluated by changing to polar coordinates. If we set y = r cos 0 and z = r sin 0, we have J2 = f:3t fooo e- r2/2r dr dO = f:3t dO = 27T. Accordingly, I = yZ; and we complete the square in the exponent. Thus M(t) becomes M(t) = exp [ a 2 - (a + b 2t)2] foo _1_ ex [_ (x - a - b 2t)2] 2b2 _00 bvz:;, P 2b2 dx = exp (ai + b~2) because the integrand of the last integral can be thought of as a normal p.d.f. with a replaced by a + b2t, and hence it is equal to 1. The mean I-t and variance 0- 2 of a normal distribution will be calcu- lated from M(t). Now M'(t) = M(t)(a + b2t) and f OO 1 -= e- y 2/2 dy = 1. -00 V27T Thus If we introduce a new variable of integration, say z, by writing I-t = M'(O) = a and a form that shows explicitly the values of I-t and 0- 2 . The moment- generating function M(t) can be written 0- 2 = M"(O) - 1-t2 = b2 + a2 - a2 = b2. This permits us to write a normal p.d.f. in the form of -00 < x < 00, f(x) = 1_ exp [ o-V27T b > 0, (x - a)2] 2b2 dx = 1. x-a y=--' b fOO 1 [ --exp -00 byZ; Since b > 0, this implies that the preceding integral becomes 1 [(X - a)2] f(x) = --exp - , byZ; 2b2 -00 < x < 00 -00 < x < 00. satisfies the conditions of being a p.d.f. of a continuous type of random variable. A random variable of the continuous type that has a p.d.f. of the form of f(x) is said to have a normal distribution, and any f(x) of this form is called a normal p.d.f. We can find the moment-generating function of a normal distribu- tion as follows. In f OO 1 [(X a)2] M (t) = etx • r;;- exp - 2b2 dx - 00 bv 27T fOO 1 ( = --exp -00 bv27T Example 1. If X has the moment-generating function M(t) = e2t+32t2, then X has a normal distribution with flo = 2, 0- 2 = 64. The normal p.d.f. occurs so frequently in certain parts of statistics that we denote it, for brevity, by n(l-t, 0- 2 ) . Thus, if we say that the random variable X is n(O, 1), we mean that X has a normal distribution with mean I-t = 0 and variance 0- 2 = 1, so that the p.d.f. of X is 1 f(x) = - - e- X 2 / 2 yZ; ,
  • 62. 112 Some Special Distributions [Ch.B Sec. 3.4] The Normal Distribution 113 If we say that X is n(5, 4), we mean that X has a normal distribution with mean p, = 5 and variance a2 = 4, so that the p.d.f. of X is 1 l (x - 5)21 f(x) = 2V2?T exp - 2(4) , Moreover, if -00 < x < 00. If we change the variable of integration by writing y = (x - p,)/a, then f 1 G(w) = -- e- y 2 /2 dy. -00 y'2; Accordingly, the p.d.f. g(w) = G'(w) of the continuous-type random variable W is then X is n(O, 1). The graph of g(w) -00 < w < 00. is seen (1) to be symmetric about a vertical axis through x = p" (2) to have its maximum of 1/avz;.at x = p" and (3) to have the x-axis as a horizontal asymptote. It should be verified that (4) there are points of inflection at x = p, ± a. Remark. Each of the special distributions considered thus far has been "justified" by some derivation that is based upon certain concepts found in elementary probability theory. Such a motivation for the normal distribu- tion is not given at this time; a motivation is presented in Chapter 5. How- ever, the normal distribution is one of the more widely used distributions in applications of statistical methods. Variables that are often assumed to be random variables having normal distributions (with appropriate values of flo and u) are the diameter of a hole made by a drill press, the score on a test, the yield of a grain on a plot of ground, and the length of a newborn child. 1 l (x - p,)21 f(x) = aVZ; exp - 2a2 ' -00 < x < 00, Thus W is n(O, 1), which is the desired result. This fact considerably simplifies calculations of probabilities con- cerning normally distributed variables, as will be seen presently. Sup- pose that X is n(p" a2 ) . Then, with C1 < C2 we have, since Pr (X = c1 ) = 0, = Pr (X ~ p, < C 2 ~ p,) _Pr (X ~ p. < c1 ~ p,) because W = (X - p,)/a is n(O, 1). That is, probabilities concerning X, which is n(p" a2 ) , can be expressed in terms of probabilities concerning W, which is n(O, 1). However, an integral such as We now prove a very useful theorem. Theorem 1. If the random variable X is n(p" a2 ) , a2 > 0, then the random variable W = (X - p,)/a is n(O, 1). Proof. The distribution function G(w) of W is, since a > 0, G(w) = Pr (X ~ P, s w) = Pr (X s tso + p,). That is, J W C1 tIL 1 l (x - p,)2J G(w) = . j - exp - 2 2 dx. -00 ov 2?T a cannot be evaluated by the fundamental theorem of calculus because an "antiderivative" of e-W 2 / 2 is not expressible as an elementary function. Instead, tables of the approximate value of this integral for various values of k have been prepared and are partially reproduced in Table III in Appendix B. We use the notation (for normal) J x 1 N(x) = -- e-W 2 / 2 dw; -00 VZ;
  • 63. 114 Some Special Distributions [Ch. 3 Sec. 3.4J The Normal Distribution 115 thus, if X is n(I-'-' a2 ) , then Pr (c1 < X< c2) = Pr (X ~ I-'- < C 2 ~ 1-'-) ( X - I-'- c1 - 1-'-) - Pr --- < --- a a That is, f -v'V 1 G(v) = 2 - - e-w 2 /2 dw, o VZ; os v, If we change the variable of integration by writing w = yY, then Hence the p.d.f. g(v) = G'(v) of the continuous-type random variable V is f v 1 G(v) = e- y / 2 dy, oVZ;vy o :s; v. v < O. G(v) = 0, and It is left as an exercise to show that N( -x) = 1 - N(x). Example 2. Let X be n(2, 25). Then, by Table III, ( 10 - 2) (0 - 2) Pr (0 < X < 10) = N -5- - N -5- = N(1.6) - N( -0.4) = 0.945 - (1 - 0.655) = 0.600 I x 1 N(x) = -- e-w2 /2 dw, -<Xl VZ; EXERCISES 3.41. If 3.46. If X is n(p., a2 ) , show that E([X - p.!) = av/2/TT. 3.47. Show that the graph of a p.d.I. n(p., a2 ) has points of inflection at x = p. - a and x = p. + a. Since g(v) is a p.d.f. and hence fooo g(v) dv = 1, it must be that r(t) = V:;;: and thus V is X2 (1). o < V < 00, 1 _=--= vl/2-1e-V/2 V7TV2 ' = 0 elsewhere. g(v) show that N( -x) = 1 - N(x). 3.42. If X is n(75, 100), find Pr (X < 60) and Pr (70 < X < 100). 3.43. If X is n(p., a2 ) , find b so that Pr [-b < (X - p.)/a < b] = 0.90. 3.44. Let X be n(p., a2 ) so that Pr (X < 89) = 0.90 and Pr (X < 94) = 0.95. Find p. and a2• 3.45. Show that the constant c can be selected so that f(x) = c2- x 2 , -00 < X < 00, satisfies the conditions of a normal p.d.f. Hint. Write 2 = e1n 2. These conditions require that p. = 73.1 and a = 10.2 approximately. We close this section with an important theorem. Theorem 2. If the random variable X is n(fL, a2), a2 > 0, then the random variable V = (X - p.)2ja2 is X2(1). Proof. Because V = W2, where W = (X - fL)!a is n(O, 1), the distribution function G(v) of V is, for v 2: 0, G(v) = Pr (W2 ::; v) = Pr (- vv s W s Vv). Pr (- 8 < X < 1) = NC ~ 2) - N(-8 5- 2) = N( -0.2) - N( -2) = (1 - 0.579) - (1 - 0.977) = 0.398. Example 3. Let X be n(p., a2 ) . Then, by Table III, ~0-~<X<p.+~=Nt+~-1-Nt-~-1 = N(2) - N(-2) = 0.977 - (1 - 0.977) = 0.954. Example 4. Suppose that 10 per cent of the probability for a certain distribution that is n(p., a2 ) is below 60 and that 5 per cent is above 90. What are the values of p. and a? We are given that the random variable X is n(p., a2 ) and that Pr (X ::; 60) = 0.10 and Pr (X ::; 90) = 0.95. Thus N[(60 - p.)/a] = 0.10 and N[(90 - p.)/a] = 0.95. From Table III we have 60 - p. = -1.282, 90 - P. = 1.645. a a and
  • 64. 116 Some Special Distributions [eh. 3 Sec. 3.5] The Bivariate Normal Distribution 117 where, with Ul > 0, U2 > 0, and - 1 < p < 1, -00 < x < 00, -00 < y < 00, j(x, y) At this point we do not know that the constants iLl' iL2' at a~, and p represent parameters of a distribution. As a matter of fact, we do not know thatj(x, y) has the properties of a joint p.d.f.Tt will now be shown that: (a) f(x, y) is a joint p.d.f. (b) X is n(iLl> ar) and Y is n(iL2' a~). (c) p is the correlation coefficient of X and Y. A joint p.d.f. of this form is called a bivariate normal p.df-, and the random variables X and Yare said to have a bivariate normal distribu- tion. That the nonnegative function f(x, y) is actually a joint p.d.f. can be seen as follows. Define fl(X) by mean of the truncated distribution that has p.d.f. g(y) = j(y)jF(b), -00 < y < b, zero elsewhere, be equal to - j(b)jF(b) for all real b. Prove that j(x) is n(O, 1). 3.61. Let X and Y be stochastically independent random variables, each with a distribution that is n(O, 1). Let Z = X + Y. Find the integral that represents the distribution function G(z) = Pr (X + Y ::;; z) of Z. Deter- mine the p.d.f. of Z. Hint. We have that G(z) = J~00 H(x, z) dx, where I Z - X 1 H(x, z) = _00 27T exp [- (x2 + y2)/2] dy. Find G'(z) by evaluating e00 [8H(x, z)/8z] dx. 3.5 The Bivariate Normal Distribution Let us investigate the function o< x < 00, zero elsewhere. 3.48. Determine the ninetieth percentile of the distribution, which is n(65,25). 3.49. If e3t+8t2 is the moment-generating function of the random variable X, find Pr (-1 < X < 9). 3.50. Let the random variable X have the p.d.f. Find the mean and variance of X. Hint. Compute E(X) directly and E(X2) by comparing that integral with the integral representing the variance of a variable that is n(O, 1). 3.51. Let X be n(5, 10). Find Pr [0.04 < (X - 5)2 < 38.4]. 3.52. If X is n(l, 4), compute the probability Pr (1 < X2 < 9). 3.53. If X is n(75, 25), find the conditional probability that X is greater than 80 relative to the hypothesis that X is greater than 77. See Exercise 2.17. 3.54. Let X be a random variable such that E(X2m) = (2m)!j(2mm!), m = 1,2,3, ... and E(X2m-l) = 0, m = 1,2,3, .... Find the moment- generating function and the p.d.f. of X. 3.55. Let the mutually stochastically independent random variables Xl' X 2 , and X a be n(O, 1), n(2, 4), and n( -1, 1), respectively. Compute the probability that exactly two of these three variables are less than zero. 3.56. Compute the measures of skewness and kurtosis of a distribution which is n(f.L, a2 ) . 3.57. Let the random variable X have a distribution that is n(f.L, a2 ) . (a) Does the random variable Y = X2 also have a normal distribution? (b) Would the random variable Y = aX + b, a and b nonzero constants, have a normal distribution? Hint. In each case, first determine Pr (Y ::;; y). 3.58. Let the random variable X be n(f.L, a2 ) . What would this distribution be if a2 = O? Hint. Look at the moment-generating function of X for a2 > 0 and investigate its limit as a2 --+ O. 3.59. Let n(x) and N(x) be the p.d.f. and distribution function of a distribution that is n(O, 1). Let Y have a truncated distribution with p.d.f. g(y) = n(y)j[N(b) - N(a)], a < y < b, zero elsewhere. Show that E(Y) is equal to [n(a) - n(b)]/[N(b) - N(a)]. 3.60. Let j(x) and F(x) be the p.d.f. and the distribution function of a distribution of the continuous type such that f'(x) exists for all x. Let the
  • 65. 118 Some Special Distributions [Ch, 3 Sec. 3.5] The Bivariate Normal Distribution 119 Now (1 _ p2)q = [(y ~2iL2) - p(X ~liLl)r+ (1 - p2)(X ~liLlr = (y ~ b) + (1 _ p2)(X ~liLlr, where b = iL2 + p(a2/al)(X - iLl)' Thus _ exp [- (x - iLl)2/2atJ foo exp {- (y - b)2/[2a~(1 - p2)J} dy. fl(x) - al-yl2; -00 a2V 1 - p2-y12; For the purpose of integration, the integrand of the ~ntegral in this expression for fl(x) may be considered a.normal p.d.f. with mean band variance a~(l - p2). Thus this integral IS equal to 1 and f ( ) 1 [(X - iLl)2J -00 < X < 00. 1 X = . rr> exp - 2a2 ' alV 27T 1 Since f~00 f:oof(x, y) dy dx = f:oofl(x) dx = 1, the nonnegative functionf(x, y) is a joint p.d.f. of two ~ontinuou:-type random variables X and Y. Accordingly, the function fl(x) IS the . 1 d f of X and X is seen to be n(iLl' at)· In like manner, we margma p. . . , see that Y is n(iL2' a~). Moreover, from the development above, we note that ( 1 [(y - b)2 J) f(x, y) = fl(x) a2 V 1 _ p2-y12; exp - 2a~(l _ p2) , h b + ( / )(X - II.) Accordingly the second factor in the were = iL2 P a2 al .1 . , .. right-hand member of the equation above ~s. the conditional p.d:f of Y, given that X = x. That is, the conditional p.d.f of Y, ~Iven X = x is itself normal with mean iL2 + p(a2/al)(X - iLl) and v~r~ance 2(1 _' 2) Thus with a bivariate normal distribution, the conditional a2 p. , . . mean of Y, given that X = x, is linear in x and IS gIVen by a2 ) E(Ylx) = iL2 + P - (x - iLl . al Since the coefficient of x in this linear conditional mean E(:VI~) is / d . and a represent the respective standard deVIatIOns, pa2 aI, an since al 2 Thi the number p is, in fact, the correlation coefficient of X and Y. IS follows from the result, established in Section 2.3, that the coefficient of x in a general linear conditional mean E(Ylx) is the product of the correlation coefficient and the ratio a2/al' Although the mean of the conditional distribution of Y, given X = x, depends upon x (unless p = 0), the variance a§(1 - p2) is the same for all real values of x. Thus, by way of example, given that X = x, the conditional probability that Y is within (2.576)a2V 1 _ p2 units of the conditional mean is 0.99, whatever the value of x. In this sense, most of the probability for the distribution of X and Y lies in the band about the graph of the linear conditional mean. For every fixed positive a2' the width of this band depends upon p. Because the band is narrow when p2 is nearly 1, we see that p does measure the intensity of the concentration of the probability for X and Y about the linear con- ditional mean. This is the fact to which we alluded in the remark of Section 2.3. In a similar manner we can show that the conditional distribution of X, given Y = y, is the normal distribution Example 1. Let us assume that in a certain population of married couples the height Xl of the husband and the height X2 of the wife have a bivariate normal distribution with parameters Il-l = 5.8 feet, 1l-2 = 5.3 feet, al = U2 = 0.2 foot, and p = 0.6. The conditional p.d.f. of X 2 , given Xl = 6.3, is normal with mean5.3 +(0.6)(6.3 -5.8) =5.6 and standarddeviation (0.2)V(1-0.36) = 0.16. Accordingly, given that the height of the husband is 6.3 feet, the probability that his wife has a height between 5.28 and 5.92 feet is Pr (5.28 < X2 < 5.921xl = 6.3) = N(2) - N( -2) = 0.954. The moment-generating function of a bivariate normal distribution can be determined as follows. We have M(tl, t2) = f~00 f~00 etlx+t2Yf(x, y) dx dy = f:oo etlxfl(x)[f~oo et2Yf(Ylx) dy] dx for all real values of tl and t2 . The integral within the brackets is the
  • 66. 3.63. If M(tv t2) is the moment-generating function of a bivariate normal distribution, compute the covariance by using the formula Now let if;(tv t2) = In M(t1, t2). Show that a2if;(0, 0)/8t1at2 gives this co- variance directly. 3.67. Let X, Y, and Z have the joint p.d.f. (1/217')3/2 exp [- (x2 + y2 + z2)/2J{1 + xyz exp [- (x2 + y2 + z2)/2J}, 121 aM(O, 0) 8M(0, 0) at1 at2 3.65. Let X and Y have a bivariate normal distribution with parameters iLl = 20, iL2 = 40, ai = 9, a~ = 4, and p = 0.6. Find the shortest interval for which 0.90 is the conditional probability that Y is in this interval, given that X = 22. 3.66. Letf(x,y) = (1/217') exp[ -t(x2+ y2)J{1 + xyexp[ --t(x2 + y2 - 2)]}, where -00 < x < 00, -00 < y < 00. If f(x, y) is a joint p.d.f., it is not a normal bivariate p.d.f. Show that f(x, y) actually is a joint p.d.f. and that each marginal p.d.f. is normal. Thus the fact that each marginal p.d.f. is normal does not imply that the joint p.d.f. is bivariate normal. where -00 < x < 00, -00 < Y < 00, and -00 < z < 00. While X, Y, and Z are obviously stochastically dependent, show that X, Y, and Z are pair- wise stochastically independent and that each pair has a bivariate normal distribution. EXERCISES 3.68. Let X and Y have a bivariate normal distribution with parameters iLl = iL2 = 0, ai = a~ = 1, and correlation coefficient p. Find the distri- bution of the random variable Z = aX + bY in which a and b are nonzero constants. Hint. Write G(z) = Pr (Z :::; z) as an iterated integral and com- pute G'(z) = g(z) by differentiating under the first integral sign and then evaluating the resulting integral by completing the square in the exponent. Sec. 3.5] The Bivariate Normal Distribution 3.64. Let X and Y have a bivariate normal distribution with parameters iLl = 5, iL2 = 10, ai = 1, a~ = 25, and p > 0. If Pr (4 < Y < 161x = 5) = 0.954, determine p. 3.62. Let X and Y have a bivariate normal distribution with parameters 1-'-1 = 3, iL2 = 1, ai = 16, ~ = 25, and p = -t. Determine the following probabilities: (a) Pr (3 < Y < 8). (b) Pr (3 < Y < 81x = 7). (c) Pr(-3<X<3). (d) Pr (-3 < X < 31y = -4). or equivalently, 22 , ( u~ti + 2pU1U2t1t2 + u2t2). M(t v t2) = exp I-'-lt1 + 1-'-2t2 + 2 It is interesting to note that if, in this moment-generating function . ffici t . t equal to zero then M(tv t2), the correlatIOn coe cien pIS se ' M(t 1,t2) = M(t 1,O)M(O, t2)· Thus X and Yare stochastically independent when p = 0,. If, con~ O)M(O t) we have eP<Jl<J2tlt2 = 1. Since eac versely, M(t:, t2) =:.M(tt 1 h ' ~ ; Accordingly, we have the following of U1 and u2ISpositive, en P - . theorem. 3 L t X and Y have a bivariate normal distribution with Theorem . e lati jfi . nt means 1-'-1 and 1-'-2' positive variances ut and u§,~nd corre at~~n c~ OCM p. Then X and Yare stochastically independent if and only if p - . A matter of fact, jf any two random variables are stochastical~y s a . . have noted III independent and have positive standard deVIatIOns, we . . 4 th t - 0 However p = 0 does not III Example 4 of Section 2. a p - . ' . d t: this I · I that two variables are stochastically llldepen en , genera Imp y . f Th rem 3 . E . 2 18(c) and 2.23. The Importance 0 eo can be seen III xercises . h two random . . h f ct that we now know when and only w en ~:~~~I:Seth:t have a bivariate normal distribution are stochastically independent. Some Special Distributions [eh. 3 120 t ting function of the conditional p.d.f. f(yx). ~ince momen -genera 1 ( I )( ) and vanance f(Ylx) is a normal p.d.f. with mean 1-'-2 + P U2 U1 X - 1-'-1 u~(l - p 2), then 1 t2u2(1 _ p2)} f~00 et2 Yf(yx) dy = exp {t2[1-'-2 + P:: (x - 1-'-1) J+ 2 2 2 . A di I M(t t) can be written in the form ccor mg y, l' 2 { U2 t~u~(l - p2)}fOO exp [(t1 + t2p u 2)x)f 1(x) dx. exp t21-'-2 - t2p u11-'-1 + 2 _00 u1 E( tX) [ t + (u2t2)J2] for all real values of t. Accordingly, But e = exp 1-'-1 1 . . b if we set t = t1 +- t2P(U2JU1)' we see that M(t v t2) IS glVen y { u2 t~u~(l - p2) exp t21-'-2 - t2pu11-'-1 + 2 ( U 2)2} t1 + t2p - + 1-'-1(t1 + t2p::) + a~ 2 U1
  • 67. Sec. 4.1] Sampling Theory 123 Chapter 4 Distributions of Functions of Random Variables 4.1 Sampling Theory Let Xl> X 2 , ••• , K; denote n random variables that have the)oint p.d.f.j(xl> X2' ••• , xn ) . These variables mayor may not be ~tochas:lcal~y independent. Problems such as the foll.owing ~re very mteres.tmg m themselves; but more importantly, their solutions often pr~vIde the basis for making statistical inferences. Let Y be a random vanable that is defined by a function of Xl' X 2 , • • • , X n, say Y = u(Xl> X 2 , ••• , X n) · O the P d f f(x X x ) is given can we find the p.d.f. of Y? nce . . . l> 2"'" n ' In some of the preceding chapters, we have solved a fe~ of t~ese pro~- lems. Among them are the following two. If n = 1 and If ~IIS n(p., a ), then Y = (Xl - p.)/a is n(O, 1). If n is a positive integer, 1£ the random . bl X . - 1 2 n are mutually stochastically independent, vana es i' ~ - , , ... , , and each Xi has the same p.d.f. f(x) = PX(1 - P)l-X, x = 0, 1, and zero elsewhere, and if Y = i Xi' then Yis b(n, P)· It should be observed I that Y = u(XI ) = (Xl - p.)/a is a function of Xl that depends upon the two parameters of the normal distribution; whereas Y = u(Xl> X 2' ... , X n ) = i Xi does not depend upon p, the parameter of the common d f f th IX . - 1 2 n The distinction that we make between p. ., 0 e ,,~- , , ... , . these functions is brought out in the following definition. Definition 1. A function of one or more random variab~e~ that does not depend upon any unknown parameter is called a statistic. 122 n In accordance with this definition, the random variable Y = .L: X, I discussed above is a statistic. But the random variable Y = (Xl - J-t)/a is not a statistic unless J-t and a are known numbers. It should be noted that, although a statistic does not depend upon any unknown param- eter, the distribution of that statistic may very well depend upon unknown parameters. Remark. We remark, for the benefit of the more advanced reader, that a statistic is usually defined to be a measurable function of the random variables. In this book, however, we wish to minimize the use of measure theoretic terminology so we have suppressed the modifier "measurable." It is quite clear that a statistic is a random variable. In fact, someprobabilists avoid the use of the word"statistic" altogether, and they refer to a measure- able function of random variables as a random variable. We decided to use the word"statistic" because the reader will encounter it so frequently in books and journals. We can motivate the study of the distribution of a statistic in the following way. Let a random variable X be defined on a sample space <'t' and let the space of X be denoted by d In many situations con- fronting us, the distribution of X is not completely known. For instance, we may know the distribution except for the value of an unknown parameter. To obtain more information about this distribution (or the unknown parameter), we shall repeat under identical conditions the random experiment n independent times. Let the random variable X, be a function of the ith outcome, i = 1,2, ... , n. Then we call Xl> X 2 , ••• , X n the items of a random sample from the distribution under consideration. Suppose that we can define a statistic Y = u(Xl> X 2 , ... , X n) whose p.d.f. is found to be g(y). Perhaps this p.d.f. shows that there is a great probability that Y has a value close to the unknown parameter. Once the experiment has been repeated in the manner indicated and we have Xl = xl> ... , X; = Xn> then y = u(xI , X2' ••• , xn) is a known number. It is to be hoped that this known number can in some manner be used to elicit information about the unknown param- eter. Thus a statistic may prove to be useful. Remarks. Let the random variable X be defined as the diameter of a hole to be drilled by a certain drill press and let it be assumed that X has a normal distribution. Past experience with many drill presses makes this assumption plausible; but the assumption does not specify the mean p. nor the variance a2 of this normal distribution. The only way to obtain informa- tion about p. and a2 is to have recourse to experimentation. Thus weshall drill
  • 68. 124 Distributions oj Functions oj Random Variables [eh.4 Sec. 4.1] Sampling Theory 125 a number, say n = 20, of these holes whose diameters will be Xv X 2 , ••• , X 2 0 • Then Xl' X 2 , ••• , X2 0 is a random sample from the normal distribution under consideration. Once the holes have been drilled and the diameters measured, the 20 numbers may be used, as will be seen later, to elicit information about fL and a2 • The term" random sample" is now defined in a more formal manner. Definition 2. Let Xl> X 2 , ••• , X n denote n mutually stochastically independent random variables, each of which has the same but possibly unknown p.d.f. f(x); that is, the probability density functions of Xl> X 2, ... , X; are, respectivelY,]I(xI) = f(X I),f2(X2) = f(X2), ... ,fn(xn) = f(xn), so that the joint p.d.f. is f(xl)f(x2)· . ·f(xn)· The random variables Xl> X 2 , ••• , X n are then said to constitute a random sample from a distribution that has p.d.f. f(x). Later we shall define what we mean by a random sample from a distribution of more than one random variable. Sometimes it is convenient to refer to a random sample of size n from a given distribution and, as has been remarked, to refer to Xl> X 2 , ••• , X n as the items of the random sample. A reexamination of Example 5 of Section 2.4 reveals that we found the p.d.f. of the statistic, which is the maximum of the items of a random sample of size n = 3, from a distribution with p.d.f. f(x) = 2x, °< x < 1, zero elsewhere. In the first Remark of Section 3.1 (and referred to in this section), we found the p.d.f. of the statistic, which is the sum of the items of a random sample of size n from a distribution that has p.d.f. f(x) = px(1 - n::: X = 0, 1, zero elsewhere. In this book, most of the statistics that we shall encounter will be functions of the items of a random sample from a given distribution. Next, we define two important statistics of this type. Definition 3. Let Xl> X 2' .•• , X n denote a random sample of size n from a given distribution. The statistic X = Xl + X 2 + ... + Xn = i Xl... n 1=1 n is called the mean of the random sample, and the statistic is called the variance of the random sample. Remark. Many writers do not define the variance of a random sample as we have done but, instead, they take 52 = ~ (X, - X)2/(n - 1). There I ~re .goodreasons for doing this. But a certain price has to be paid, as we shall indicate, Let Xl' X 2, ••• , X n denote experimental values of the random variable X that has the p.d.f. f(x) and the distribution function F(x). Thus we may l~ok upon Xv X 2, ••• , Xn as the experimental values of a .random sample of SIze n from the given distribution. The distribution of the sample is then defined to be t.he distribution obtained by assigning a probability of lin to each of the POl~ts x~, x:' .. :' X n• This is a distribution of the discrete type. The corres.pondmg distribution function will be denoted by Fn(x) and it is a step function. If we let fx denote the number of sample values that are less than or equal to x, then Fn(x) = fxln, so that Fn(x) gives the relative fre- quency of the event X ::; X in the set of n observations. The function Fn(x) is often called the" empirical distribution function" and it has a number of uses. Because the distribution of the sample is a discrete distribution, the mean and the variance have been defined and are, respectively, i:«[« = x n I and f (Xl - x)2ln = S2. Thus, if one finds the distribution of the sample and the associated empirical distribution function to be useful concepts, it :"ould seem logically inconsistent to define the variance of a random sample m any way other than we have. Random sampling distribution theory means the general problem of finding distributions of functions of the items of a random sample. Up to this point, the only method, other than direct probabilistic arguments, of finding the distribution of a function of one or more random variables is the distribution function technique. That is, if Xl> X 2 , · . · , X n are random variables, the distribution of Y = u(XI , X 2 , ••• , X n) is determined by computing the distribution function of Y, G(y) = Pr [u(Xl> X 2 , •• " X n) s y]. Even in what superficially appears to be a very simple problem, this can be quite tedious. This fact is illustrated in the next paragraph. Let Xl' X 2 , Xs denote a random sample of size 3 from a distribution that is n(O, 1). Let Y denote the statistic that is the sum of the squares of the sample items. The distribution function of Y is given by G(y) = Pr (Xi + X~ + X~ ::; y). If y < 0, then G(y) = 0. However, if y ~ 0, then G(y) = fif(2:)S/2 exp [ -~ (xi + x~ + X~)] dXI dX2 dxs,
  • 69. 126 Distributions of Functions of Random Variables [eh.4 Sec. 4.1] Sampling Theory 127 where p 2: 0, 0 ~ B < 27T, 0 ~ rp ~ 7T. Then, for y 2: 0, G(y) - S'/Y f2nfn _1_ e- p 2 /2p2 sin rp drp dB dp - (27T)3/2 o • 0 0 where A is the set of points (Xl> x2, x3) interior to, or on the surface of, a sphere with center at (0,0,0) and radius equal to yY. This is not a simple integral. We might hope to make progress by changing to spherical coordinates: = J~ L./Y p2e- p 2/2 dp. If we change the variable of integration by setting p = VW, we have G( ) - )2 iY vw -w/2 d y _ - -e w, 7T 0 2 for y 2: O. Since Y is a random variable of the continuous type, the p.d.f. of Y is g(y) = G'(y). Thus O<y<l. Pr (Y s y) = P[P-l(y)] = y, Pr (X s x) = Pr [F(X) ~ F(x)] = Pr [Y ~ F(x)] because Y = F(X). However, Pr (Y ~ y) = G(y), so we have " Pr (X ~ x) = G[F(x)] = F(x) , 0 < F(x) < 1. EXERCISES This is the distribution function of a random variable that is distri- buted uniformly on the interval (0, 1). 4.1. Show that That is, the distribution function of X is F(x). corresponds to F(x). If 0 < F(x) < 1, the inequalities X s x and F(X) s F(x) are equivalent. Thus, with 0 < F(x) < 1, the distribution function of Xis This result permits us to simulate random variables of different types. This is done by simply determining values of the uniform variable Y, usually with a computer. Then, after determining the observed value Y = y, solve the equation y = P(x), either explicitly or by numerical methods. This yields the inverse function X = P-l(y). By the preceding result, this number X will be an observed value of X that has distribution function P(x). It is also interesting to note that the converse of this result is true. If X has distribution function P(x) of the continuous type, then Y = P(X) is uniformly distributed over 0 < y < 1. The reason for this is, for 0 < y < 1, that Pr (Y s y) = Pr [P(X) ~ y] = Pr [X s P-I(y)]. However, it is given that Pr (X s x) = P(x), so X 3 = P cos rp, o < y < 00, X2 = p sin Bsin rp, Xl = Pcos Bsin rp, 1 g(y) = __ y3/2-le-Y/2, VZ; = 0 elsewhere. Because r(!) = (t)r(!-) = mV;, and thus V27T = r(!)23 / 2 , we see that Y is X2 (3). The problem that we have just solved points up the desirability of having, if possible, various methods of determining the distribution of a function of random variables. We shall find that other techniques are available and that often a particular technique is vastly superior to the others in a given situation. These techniques will be discussed in subsequent sections. Example 1. Let the random variable Y be distributed uniformly over the unit interval 0 < y < 1; that is, the distribution function of Y is G(y) = 0, y s 0 = y, 0 < y < 1, = 1, 1 s y. Suppose that F(x) is a distribution function of the continuous type which is strictly increasing when 0 < F(x) < 1. If we define the random variable X by the relationship Y = F(X), we now show that X has a distribution which n where X = LXt!n. I 4.2. Find the probability that exactly four items of a random sample of size 5 from the distribution having p.d.f. f(x) = (x + 1)/2, -1 < x < 1, zero elsewhere, exceed zero. 4.3. Let Xl> X 2 , X3 be a random sample of size 3 from a distribution that is n(6, 4). Determine the probability that the largest sample item exceeds 8.
  • 70. 4.2 Transformations of Variables of the Discrete Type 4.11. Let Xl and X2denote a random sample of size 2 from a distribution with p.d.f. f(x) = 1, 0 < x < 1, zero elsewhere. Find the distribution function and the p.d.f. of Y = X l/X2 · 4.8. Let Xl and X 2denote a random sample of size 2 from a distribution that is n(O, 1). Find the p.d.f. of Y = Xi + X~. Hint. In the double integral representing Pr (Y :-::; y), use polar coordinates. 129 y = 0,4,8, ... , x = 0,1,2, ... , f(x) Sec. 4.2] Transformations of Variables of the Discrete Type Let X have the Poisson p.d.f. = °elsewhere. As we have done before, let d denote the space d = {x; x = 0,1,2, ...}, so that d is the set where f(x) > 0. Define a new random variable Y by Y = 4X. We wish to find the p.d.f. of Y by the change-of-variable technique. Let y = 4x. We call y = 4x a transformation from x to y, and we say that the transformation maps the space d onto the space f!lj = {y; y = 0, 4, 8, 12, ... }. The space f!lj is obtained by transforming each point in d in accordance with y = 4x. We note two things about this transformation. It is such that to each point in d there corresponds one, and only one, point in f!lj; and conversely, to each point in f!lj there corresponds one, and only one, point in d. That is, the transformation y = 4x sets up a one-to-one correspondence between the points of d and those of f!lj. Any function y = u(x) (not merely y = 4x) that maps a space d (not merely our d) onto a space f!lj (not merely our f!lj) such that there is a one-to-one correspondence between the points of d and those of f!lj is called a one-to-one transformation. It is important to note that a one-to-one transformation, y = u(x), implies that y is a single-valued function of x, and that x is a single-valued function of y. In our case this is obviously true, since y = 4x and x = (±)y. Our problem is that of finding the p.d.f. g(y) of the discrete type of random variable Y = 4X. Now g(y) = Pr (Y = y). Because there is a one-to-one correspondence between the points of d and those of f!lj, the event Y = y or 4X = y can occur when, and only when, the event X = (i)y occurs. That is, the two events are equivalent and have the same probability. Hence ( Y) f-tY /4e - J.l g(y) = Pr (Y = y) = Pr X = 4 = (yJ4)!' 0= elsewhere. The foregoing detailed discussion should make the subsequent text easier to read. Let X be a random variable of the discrete type, having p.d.f. f(x). Let d denote the set of discrete points, at each of which f(x) > 0, and let y = u(x) define a one-to-one transformation that maps d onto f!lj. If we solve y = u(x) for x in terms of y, say, x = w(y), then for each y E f!lj, we have x = w(y) Ed. Consider the random variable Y = u(X). If Y E f!lj, then x = w(y) Ed, and the events Y = y [or Distributions of Functions of Random Variables [Ch.4 An alternative method of finding the distribution of a function of one or more random variables is called the change of variable technique. There are some delicate questions (with particular reference to random variables of the continuous type) involved in this technique, and these make it desirable for us first to consider special cases. 4.9. The four values Yl = 0.42, Y2 = 0.31, Y3 = 0.87, and Y4 = 0.65 represent the observed values of a random sample of size n = 4 from the uniform distribution over 0 < Y < 1. Using these four values, find a corre- sponding observed random sample from a distribution that has p.d.f. f(x) = e- x , 0 < X < 00, zero elsewhere. 4.10. Let Xv X2 denote a random sample of size 2 from a distribution with p.d.f. f(x) = 1-, 0 < x < 2, zero elsewhere. Find the joint p.d.f, of Xl and X 2 . Let Y = Xl + X 2 . Find the distribution function and the p.d.f. of Y. 4.12. Let Xl, X 2 , X 3 be a random sample of size 3 from a distribution having p.d.f. f(x) = 5x4 , 0 < X < 1, zero elsewhere. Let Y be the largest item in the sample. Find the distribution function and p.d.f. of Y. 4.13. Let Xl and X2 be items of a random sample from a distribution with p.d.f. f(x) = 2x, 0 < x < 1, zero elsewhere. Evaluate the conditional probability Pr (Xl < X21Xl < 2X2) · 4.7. Let Yi = a + bxi, i = 1,2, ... , n, where a and b are constants. Find fj = L ydn and s~ = L (Yi - fj)2/n in terms of a, b, x = L «[«, and s~ = L (Xi - X)2/n. 128 4.4. Let Xv X2 be a random sample from' the distribution having p.d.f. f(x) = 2x, 0 < x < 1, zero elsewhere. Find Pr (Xl/X2 :-::; 1-). 4.5. If the sample size IS n = 2, find the constant c so that S2 = c(Xl - X 2)2. 4.6. If Xi = i, i = 1,2, ... , n, compute the values of x = L «[n. and S2 = L (Xi - X)2/n.
  • 71. 130 Distributions of Functions of Random Variables [Ch.4 Sec. 4.2] Transformations of Variables of the Discrete Type 131 Y = 0,1,4,9, P4 = {(Yl> Y2); Y2 = 0, 1, ... , Yl and Yl = 0,1,2, ...}. Example 2. Let Xl and X2 be two stochastically independent random variables that have Poisson distributions with means fL1 and fL2' respectively. The joint p.d.f. of Xl and X 2 is one-to-one transformation of d onto@.This would enable us to find the joint p.d.f. of Yl' Y2, and Y3 from which we would get the marginal p.d.f. of Y1 by summing on Y2 and Ys. Xl = 0, 1,2, 3, .. " X2 = 0, 1, 2, 3, ... , fLf1fL~2e -U1-u2 Xl! x2! ' and is zero elsewhere. Thus the space d is the set of points (Xl> x2), where each of Xl and X 2 is a nonnegative integer. We wish to find the p.d.f. of Y 1 = Xl + X 2' If we use the change of variable technique, we need to define a second random variable Y 2' Because Y 2 is of no interest to us, let us choose it in such a way that we have a simple one-to-one transformation. For example, take Y 2 = X 2. Then Yl = Xl + X2 and Y2 = X2 represent a one-to-one transformation that maps d onto We seek the p.d.f. g(y) of the random variable Y = X2. The transformation Y = u(x) = x2 maps d = {x; x = 0, 1, 2, 3} onto P4 = {y; Y = 0, 1,4, 9}. In general, Y = x2 does not define a one-to-one transformation; here, how- ever, it does, for there are no negative values of x in d = {x; x = 0, 1, 2, 3}. That is, we have the single-valued inverse function x = w(y) = vY (not -vy), and so 3! (2)-.!Y(I)3--'!Y g(y) = f(vY) = (vY)! (3 - Vy)!"3 "3 ' = °elsewhere. u(X) = y] and X = w(y) are equivalent. Accordingly, the p.d.f. of Y is g(y) = Pr (Y = y) = Pr [X = w(y)] = f[w(y)], Y E @, = 0 elsewhere. Example 1. Let X have the binomial p.d.f. f(x) = x! (/~ x)! (~rG) 3-X, X = 0, 1,2,3, = °elsewhere. Yl = 0, 1,2, ... , Y1 gl(Yl) = L g(Yl> Y2) Y2=O e- U1 -U2 L Y1 Yl! - - - fLY 1 -Y2fLY2 Yl! Y2=O (Yl - Y2)! Y2! 1 2 (fLl + fL2)Y1e- U1-u2 Yl! ' Note that, if (Yl' Y2) E P4, then 0 ~ Y2 ~ Yl' The inverse functions are given by Xl = Yl - Y2 and X2 = Y2' Thus the joint p.d.f. of Y l and Y 2 is and is zero elsewhere. Consequently, the marginal p.d.f. of Y l is given by EXERCISES and is zero elsewhere. That is, Yl = Xl + X 2 has a Poisson distribution with parameter fLl + fL2' 4.14. Let X have a p.d.f, f(x) = !, x = 1, 2, 3, zero elsewhere. Find the p.d.f. of Y = 2X + 1. g(yv Y2) = f[W1(Y1' Y2), W2(Y1' Y2)]' = 0 elsewhere, where Xl = w1(Yv Y2) and X2 = w2(Yv Y2) are the single-valued inverses of Y1 = U 1(x~, x2) and Y2 = u2(XV x2)· From this joint p.d.f. g(Y1' Y2) we may obtain the marginal p.d.f. of Y1 by summing on Y2 or the marginal p.d.f. of Y2 by summing on Y1' Perhaps it should be emphasized that the technique of change of variables involves the introduction of as many "new" variables as there were" old" variables. That is, suppose that f(x1 , X2 , xs) is the joint p.d.f. of Xl' X 2, and X s, with d the set where f(xv x2, xs) > O. Let us say we seek the p.d.f. of Y1 = u1 (X V X 2 , X s)· We would then define (if possible) Y2 = u2(X V X 2, X s) and Ys = us(Xv X 2, X s), so that Yl = u1(XV X2' xs), Y2 = U2(X1, X2' xs), Ys = us(xv X2, xs) define a There are no essential difficulties involved in a problem like the following. Let f(xv x2 ) be the joint p.d.f. of two discrete-type random variables Xl and X 2 with d the (two-dimensional) set of points at which f(x1, x2) > O. Let Y1 = U 1(X1, x2) and Y2 = U 2(X1, x2) define a one-to-one transformation that maps d onto @. The joint p.d.f. of the two new random variables Y1 = u1(XV X 2) and Y2 = u2(XV X 2) is given by
  • 72. 132 Distributions of Functions of Random Variables [eb." Sec. 4.3] Transformations of Variables of the Continuous Type 133 = 0 elsewhere. and, accordingly, we have !{Ib occurs because there is a one-to-one correspondence between the points of .JJ! and £fl. Thus o< Y < 8, 1 g(y) = 6y 1/3 ' Let us rewrite this integral by changing the variable of integration from X to y by writing y = 8x3 or X = !tIY. Now dx 1 dy = 6y213 ' Pr (a < Y < b) = Pr (!iY'a < X < !{/b) = f{'bJ2 2x d -'/_ X. va/2 _Jb (tIY)( 1 ) Pr (a < Y < b) - a 2 2 6 y213 dy [ b 1 = 6 113 dy. • a Y Since this is true for every 0 < a < b < 8, the p.d.f. g(y) of Y is the inte- grand; that is, This can be proved by comparing the coefficients of xk in each member of the identity (1 + x)nl (1 + x)n2 == (1 + x)nl +n2 • 4.19. Let Xl and X2 be stochastically independent random variables of the discrete type with joint p.d.f. f1(X1)f2(X2), (Xl> x2) E.JJ!. Let Y1 = Ul(X1) and Y2 = U2(X2) denote a one-to-one transformation that maps d onto fJI. Show that Y1 = Ul(Xl) and Y2 = U2(X2) are stochastically independent. 4.15. uti»; x2) = (i) Xl +X2(t)2-Xl-X2, (Xl> X2) = (0,0), (0, 1), (I, 0), (I, 1). zero elsewhere, is the joint p.d.I. of Xl and X 2 , find the joint p.d.f. of Yl = Xl - X2 and Y2 = Xl + X 2• 4.16. Let X have the p.d.f. f(x) = a)x, X = 1, 2, 3, .. " zero elsewhere. Find the p.d.f. of Y = X3. 4.17. Let Xl and X2 have the joint p.dJ.f(xl, x 2) = Xlx2/36, Xl = 1,2,3 and X 2 = 1,2,3, zero elsewhere. Find first the joint p.d.f, of Yl = X1X2 and Y2 = X 2' and then find the marginal p.d.f. of Yr- 4.18. Let the stochastically independent random variables Xl and X2 be b(nl>P) and b{n2 , P), respectively. Find the joint p.d.f. of Yl = Xl + X2 and Y2 = X 2 , and then find the marginal p.d.f, of Yl . Hint. 'Use the fact that = 0 elsewhere. Here .JJ! is the space {x; 0 < x < I}, where f(x) > O. Define the random variable Y by Y = 8X3 and consider the transformation y = 8x2 • Under the transformation Y = 8x2 , the set.JJ!is mapped onto the set flJ ={y; 0 < Y < 8}, and, moreover, the transformation is one-to-one. For every 0 < a < b < 8, .3;- the event a < Y < b will occur when, and only when, the event lv a < X < 4.3 Transformations of Variables of the Continuous Type In the preceding section we introduced the notion of a one-to-one transformation and the mapping of a set d onto a set fJI under that transformation. Those ideas were sufficient to enable us to find the distribution of a function of several random variables of the discrete type. In this section we shall examine the same problem when the random variables are of the continuous type. It is again helpful to begin with a special problem. Example 1. Let X be a random variable of the continuous type, having p.d.f. o< y < 8, f(x) = 2x, 0< x < I, It is worth noting that we found the p.d.f. of the random variable Y = 8X3 by using a theorem on the change of variable in a definite integral. However, to obtain g(y) we actually need only two things: (1) the set fJI of points y where g(y) > 0 and (2) the integrand of the integral on y to which Pr (a < Y < b) is equal. These can be found by two simple rules: (a) Verify that the transformation y = 8x3 maps d = {x; 0 < x < 1} onto fJI = {y; 0 < y < 8} and that the transformation is one-to-one. (b) Determine g(y) on this set fJI by substituting t{ly for x in J(x) and then multiplying this result by the derivative of t{lY. That is, ( ) = J({Iy) d[(t){Iyj = _1_ g y 2 dy 6yl /3 ' = 0 elsewhere. We shall accept a theorem in analysis on the change of variable in a definite integral to enable us to state a more general result. Let X be a random variable of the continuous type having p.d.f.J(x). Let d be the
  • 73. 134 Distributions of Functions of Random Variables [Ch.4 Sec. 4.3] Transformations of Variables of the Continuous Type 135 one-dimensional space where j(x) > O. Consider the random variable Y = u(X), where Y = u(x) defines a one-to-one transformation that maps the set d onto the set !!lJ. Let the inverse of Y = u(x) be denoted by x = w(y), and let the derivative dx/dy = w'(y) be continuous and not vanish for all points y in !!lJ. Then the p.d.f. of the random variable Y = u(X) is given by g(y) = j[w(y)Jlw'(y) I, y E!!lJ, = 0 elsewhere, where Iw'(y) I represents the absolute value of w'(y). This is precisely what we did in Example 1 of this section, except there we deliberately chose y = 8x3 to be an increasing function so that (0,0) FIGURE 4.1 dx '( ) 1 0 8 dy = w Y = 6y2 /3 ' < Y < , is positive, and hence 16y~/31 = 6y~/3' 0 < Y < 8. Henceforth we shall refer to dx/dy = w'(y) as the Jacobian (denoted by ]) of the transformation. In most mathematical areas, ] = w'(y) is referred to as the Jacobian of the inverse transformation x = w(y), but in this book it will be called the Jacobian of the transformation, simply for convenience. = 0 elsewhere. We are to show that the random variable Y = -21n X has a chi-square distribution with 2 degrees of freedom. Here the transformation is Y = u(x) = -21n X, so that x = w(y) = e:v!", The space dis d = {x; 0 < X < I}, which the one-to-one transformation Y = -21n X maps onto!!lJ = {y; 0 < Y < oo]. The Jacobian of the transformation is J = dx = w'(y) = _!e-Y/2 • dy 2 Accordingly, the p.d.f. g(y) of Y = - 2 In X is g(y) = !(e-Y / 2 )IJI = !e-Y / 2 , 0 < Y < 00, = 0 elsewhere, a p.d.f. that is chi-square with 2 degrees of freedom. Note that this problem was first proposed in Exercise 3.40. Example 2. Let X have the p.d.f. j(x) ~ 1, o< x < 1, This method of finding the p.d.f. of a function of one random variable of the continuous type will now be extended to functions of two random variables of this type. Again, only functions that define a one-to-one transformation will be considered at this time. Let YI = uI(XV x2 ) and Y2 = u2(XV x2) define a one-to-one transformation that maps a (two- dimensional) set d in the xlx2-plane onto a (two-dimensional) set !!lJ in the YlY2-plane. If we express each of Xl and X2 in terms of YI and Y2' we can write Xl = wI(Yv Y2), X2 = w2(Yv Y2)' The determinant of order 2, oXI oX I 0YI 0Y2 oX 2 oX 2 0YI 0Y2 is called the]acobian of the transformation and will be denoted by the symbol]. It will be assumed that these first-order partial derivatives are continuous and that the Jacobian] is not identically equal to zero in!!lJ. An illustrative example may be desirable before we proceed with the extension of the change of variable technique to two random variables of the continuous type. Example 3. Let d be the set d = {(Xl' X 2); 0 < Xl < 1,0 < X2 < I}, depicted in Figure 4.1. We wish to determine the set!!lJ in the YIY2-plane that is the mapping of d under the one-to-one transformation YI = uI(XV X2) = Xl + X2, Y2 = u2(XV X2) = Xl - X2, and we wish to compute the Jacobian of the transformation. Now Xl = wI(Yv Y2) = ·HYI + Y2), X2 = w2(Yv Y2) = ·HYI - Y2)'
  • 74. 136 Distributions of Functions of Random Variables [Ch.4 Sec. 4.3] Transformations of Variables of the Continuous Type 137 Yz Yz (0,0) (0,0) FIGURE 4.3 Yl FIGURE 4.2 We wish now to change variables of integration by writing Yl = ul(Xl> x2), Y2 = u2(Xl> x2), or Xl = wl(Yl> Y2), X2 = w2(Yl> Y2). It has been proved in analysis that this change of variables requires into o< Xl < 1, 0 < X2 < 1, o< X < 1, !(X) = 1, cp(X1> x2) = !(XI)!(X2 ) = 1, = 0 elsewhere. = 0 elsewhere, Thus for every set B in flJ, g(Yl> Y2) = q:>[wl(Yl> Y2), w2(Yl> Y2)] Ill, = 0 elsewhere. Pr [(Yl> Y2) E B] = JB Jq:>[wl(Yl> Y2), W 2(Yl , Y2)]II! dYl dY2' which implies that the joint p.d.f. g(Yl> Y2) of Yl and Y2 is Accordingly, the marginal p.d.f. gl(Yl) of Y1 can be obtained from the joint p.d.f. g(YI, Y2) in the usual manner by integrating on Y2' Five examples of this result will be given. Example 4. Let the random variable X have the p.d.f. and let X1> X2 denote a random sample from this distribution. The joint p.d.f. of Xl and X2is then Consider the two random variables YI = XI + X2and Y2 = XI - X2' We ~ish to find the joint p.d.f of YI and Y 2 • Here the two-dimensional space d III the xlx2-planeis that of Example 3 of this section. The one-to-one trans- formation YI = Xl + X 2, Y2 = Xl - X 2 maps d onto the space fjJ of that into into into o= ·t(YI + Y2), 1 = ·t(YI + Y2), o= ·t(YI - Y2), 1 = !(YI - Y2)' . Accordingly, fjJ is as shown in Figure 4.2. Finally, Pr [(Yl, Y 2 ) E B] = Pr [(Xl> X 2 ) E A] = LJq:>(Xl> x2) dXl dx2. OXI oXI 1 1 0YI 0Y2 Z Z 1 J = oX2 oX2 1 1 = -Z· 0YI 0Y2 Z-Z We now proceed with the problem of finding the joint p.d.f. of two functions of two continuous-type random variables. Let Xl and X 2 be random variables of the continuous type, having joint p.d.f. q:>(xv x2 ) . Let d be the two-dimensional set in the x1x2-plane where q:>(xv x2) > O. Let Yl = u1(X v X 2 ) be a random variable whose p.d.f. is to be found. If YI = U l(xv x2) and Y2 = u2(Xl> x2) define a one-to-one transformation of d onto a set flJ in the Y1Y2-plane (with nonidentically vanishing Jacobian), we can find, by use of a theorem in analysis, the joint p.d.f. of Yl = u1(X v X 2) and Y2 = u2(Xv X 2). Let A be a subset of d, and let B denote the mapping of A under the one-to-one transformation (see Figure 4.3). The events (Xv X 2 ) EA and (Yl, Y 2 ) E B are equivalent. Hence To determine the set fjJ in the YIY2-plane onto which d is mapped under the transformation, note that the boundaries of d are transformed as follows into the boundaries of fjJ; Xl = 0
  • 75. example. Moreover, the Jacobian of that transformation has been shown to be I = -toThus g(Yl> Y2) = <pH(Yl + Y2)' t(Yl - Y2)]III = f[t(Yl + Y2)]f[t(Yl - Y2)] III = t, (Yl> Y2) E ffl, = 0 elsewhere. Because ffl is not a product space, the random variables Y1 and Y2 are stochastically dependent. The marginal p.d.f. of Y1 is given by gl(Yl) = f",g(Yl' Y2) dY2' o< Yl < 00,0 < Y2 < 1, In accordance with Theorem 1, Section 2.4, the random variables are sto- chastically independent. The marginal p.d.f. of Y2 is o < Yl < 00, 0 < Y2 < I} in the Y1Y2-plane. The joint p.d.f. of Yl and Y2 is then 139 1 g(Yl' Y2) = (Yl) r(0:)r(,8) (Y1Y2)a-l[Yl(1 - Y2)]B-le- Y1 _ y~-l(1 - Y2)B-l ya+B-le-Y - r(0:)r(,8) 1 1, = 0 elsewhere. Sec. 4.3] Transformations oj Variables oj the Continuous Type Distributions oj Functions oj Random Variables [Ch.4 138 o< u, ::; 1, If we refer to Figure 4.2, it is seen that J Y l gl(Yl) = 1- dY2 = Yl> -Yl f 2- Y1 = 1- dY2 = 2 - Yl' YI- 2 = 0 elsewhere. 1 < s, < 2, 0< Y2 < 1, = 0 elsewhere. o < Y2 < 1, -1 < Y2 s 0, o< Yl < 00 In a similar manner, the marginal p.d.f. g2(Y2) is given by J Y2+2 1 g2(Y2) = 1- dYl = Y2 + , -Y2 f 2 - Y2 = 1- dYl = 1 - Y2' Y2 = 0 elsewhere. Example 5. Let Xl and X 2 be two stochastically independent random variables that have gamma distributions and joint p.d.f. This p.d.f. is that of the beta distribution with parameters 0: and ,8. Since g(Yl> Y2) == gl(Yl)g2(Y2), it must be that the p.d.f. of Y l is 1 gl(Yl) = r(o: + ,8) y~+B-le-Yl, = 0 elsewhere, which is that of a gamma distribution with parameter values of 0: + ,8and 1. It is an easy exercise to show that the mean and the variance of Y 2 , which has a beta distribution with parameters 0: and ,8, are, respectively, I =I~ ~ 1= 2; Example 6. Let Yl = t(Xl - X 2 ) , where Xl and X2 are stochastically independent random variables, each being X2(2). The joint p.d.£. of Xl and X 2 is 2 0:,8 a = (0: + ,8 + 1)(0: + ,8)2 0: JL =--, 0:+,8 f(X l)f(X2) = lexp ( - Xl ; X2), = 0 elsewhere. Let Y 2 = X 2 so that Yl = t(xl - x2), Y2 = X2, or Xl = 2Yl + Y2' X2 = Y2 define a one-to-one transformation from d = {(Xl> X2); 0 < Xl < 00, o < X2 < oo} onto f11J = {(Yl> Y2); - 2Yl < Y2 and 0 < Y2' -00 < Yl < co}. The Jacobian of the transformation is ) 1 a-l B-le- x 1 - x 2 0 < Xl < 00,0 < X 2 < 00, f(Xl> X2 = r(o:)r(,8) Xl X2 ' zero elsewhere where 0: > 0, ,8 > O. Let Yl = Xl + X2 and Y2 = X l/(X1 + X 2)' We shall show that Y1and Y2are stochas~ically independent. The space d is, exclusive of the points on the coordinate axes, the first quadrant of the xlx2-plane. Now Yl = ul(Xl> X2) = Xl + X2, Xl Y2 = u2(Xl> X2) = + X Xl 2 may be written Xl = Y1Y2, X2 = Yl(1 - Y2)' so I = Y2 YlI= - u, '1= O. 1 - Y2 -Yl The transformation is one-to-one, and it maps d onto d1J = {(Yl'Y2);
  • 76. 140 Distributions of Functions of Random Variables [Ch. 4 Sec. 4.3] Transformations of Variables of the Continuous Type 141 hence the joint p.d.f. of Y1 and Y2 is g(y Y) - 8 e-Y 1 -Y2 1>2-4 ' are necessary. For instance, consider the important normal case in which we desire to determine X so that it is nCO, 1). Of course, once X is determined, other normal variables can then be obtained through X by the transformation Z = aX + fL. = 0 elsewhere. gl(YI) = te-1Y11, -00 < Yl < 00. This p.d.f. is now frequently called the double exponential p.d.f. Example 7. In this example a rather important result is established. Let Xl and X 2 be stochastically independent random variables of the continuous type with joint p.d.f. !1(Xl)!2(X2) that is positive on the two- dimensional space d. Let Yl = Ul(XI), a function of Xl alone, and Y 2 = U2(X2), a function of X 2 alone. We assume for the present that Yl = UI(XI), Y2 = U2(X2) define a one-to-one transformation from d onto a two-dimensional set f!B in the YIY2-plane. Solving for Xl and X 2 in terms of Yl and Y2' we have Xl = Wl(Yl) and X2 = W2(Y2), so Thus the p.d.f. of Y1 is given by gl(Yl) = f'" te-Y1 -Y2 dY2 = teY1 , -2Yl = fa'" -!e-Y1 -Y2 dY2 = te-Y1 , or -00 < Yl < 0, os YI < 00, To simulate normal variables, Box and Muller suggested the follow- ing scheme. Let Y1> YZ be a random sample from the uniform distri- bution over 0 < Y < 1. Define X I and X Z by Xl = (-2In YI)I/Z cos (27TYZ)' X z = (-2In yl)l/Z sin (27TYZ)' The corresponding transformation is one-to-one and maps {(Yl> yz); o< YI < 1,0 < Yz < I} onto {(Xl> xz); -00 < Xl < 00, -00 < Xz < co] except for sets involving Xl = 0 and X z = 0, which have probability zero. The inverse transformation is given by ( x~ + X~) YI = exp - 2 ' 1 Xz Yz = - arctan -. 27T Xl This has the Jacobian ( _ xz) exp ( _ x~ ~ X~) l/xI ]= ( xZ+XZ) exp _ I Z 2 Since the joint p.d.f. of YI and Y z is 1 on 0 < YI < 1,0 < Yz < 1, and zero elsewhere, the joint p.d.f. of Xl and X z is That is, Xl and X z are stochastically independent random variables each being nCO, 1). ' g(Y1> Y2) = !1[WI(YI)]!2[W2(Y2)]lw~(YI)W;(Y2)J, = 0 elsewhere. However, from the procedure for changing variables in the case of one random variable, we see that the marginal probability density functions of YI and Y2 are, respectively, gl(YI) = !1[WI(Yl)]lw~(YI)1 and g2(Y2) !2[W2(Y2)] !W;(Y2)! for u, and Y2 in some appropriate sets. Consequently, g(Y1> Y2) == gl(YI)g2(Y2)' Thus, summarizing, we note that, if Xl and X 2 are stochastically independent random variables, then YI = Ul(XI) and Y2 = U2(X2) are also stochastically independent random variables. It has been seen that the result holds if Xl and X 2 are of the discrete type; see Exercise 4.19. Example 8. In the simulation of random variables using uniform random variables, it is frequently difficult to solve Y = F(x) for x. Thus other methods IW~(Yl) 0 I '()'( );p 0 J = 0 W;(Y2) = WI YI W2 Y2 . Hence the joint p.d.f. of Yl and Y2 is
  • 77. EXERCISES 4.20. Let X have the p.d.f. f(x) = x2/9, 0 < x < 3, zero elsewhere. Find the p.d.f. of Y = X3. 4.21. If the p.d.f. of X is f(x) = 2xe- x2, 0 < x < 00, zero elsewhere, determine the p.d.f. of Y = X2. 4.22. Let Xl' X 2 be a random sample from the normal distribution nCO, 1). Show that the marginal p.d.f. of YI = XI/X2 is the Cauchy p.d.J. Sec. 4.4] The t and F Distributions 142 Distributions oj Functions oj Random Variables [eh.4 143 I'(l - t)f(1 + t), -1 < t < 1. Hint. In the integral representing M(t), let y = (1 + e-X)-l. 4.29. Let X have the uniform distribution over the interval (-7T/2, 7T/2). Show that Y = tan X has a Cauchy distribution. 4.30. Let Xl and X 2 be two stochastically independent random variables of the :ontinuous type with probability density functions f(xl ) and g(x2 ) , respectively. Show that the p.d.f. hey) of Y = Xl + X 2 can be found by the convolution formula, -00 < YI < 00. hey) = I~cof(Y - w)g(w) dw. -00 < W < 00, °< v < 00, Hint. Let Y2 = X 2 and take the p.d.f. of X2 to be equal to zero at X 2 = O. Then determine the joint p.d.f. of YI and Y2' Be sure to multiply by the absolute value of the Jacobian. 4.23. Find the mean and variance of the beta distribution considered in Example 5. Hint. From that example, we know that io l ya-I(1 - y)Ii-1 dy = r(a)r(,8) f(a + ,8) for all a > 0, ,8 > O. 4.24. Determine the constant c in each of the following so that eachf(x) is a beta p.d.f (a) f(x) = cx(1 - X)3, 0 < X < 1, zero elsewhere. (b) f(x) = cx4(1 - X)5, 0 < X < 1, zero elsewhere. (c) f(x) = cx2(1 - X)8, 0 < X < 1, zero elsewhere. 4.25. Determine the constant c so that f(x) = cx(3 - X)4, 0 < X < 3, zero elsewhere, is a p.d.f. 4.26. Show that the graph of the beta p.d.f. is symmetric about the vertical line through x = t if a = ,8. 4.27. Show, for k = 1, 2, ... , n, that f (k _ 1)~~n _ k)! zk-I(1 - z)n-k dz = :~: (:)PX(1 - p)n-x. This demonstrates the relationship between the distribution functions of the beta and binomial distributions. 4.28. Let X have the logisticP.d.j.f(x) = e- x/(1 + e- X )2, -00 < x < 00. (a) Show that the graph of f(x) is symmetric about the vertical axis through x = o. (b) Find the distribution function of X. (c) Show that the moment-generating function M(t) of X is 4.31. Let Xl and X 2 be two stochastically independent normal random variables, each with mean zero and variance one (possibly resulting from a Box-Muller transformation). Show that Zl = iLl + alXV Z2 = iL2 + pa2X I + a2Vf"=P2X2, where 0 < aI, 0 < a2, and 0 < P < 1, have a bivariate normal distribution with respective parameters t-«. iL2' at a~, and p. 4.32. Let Xl and X 2 denote a random sample of size 2 from a distribution that is nip: a 2 ) . Let YI = Xl + X2 and Y2 = Xl - X 2 • Find the joint p.d.f. of YI and Y2 and show that these random variables are stochastically independent. 4.33. Let Xl and X2 denote a random sample of size 2 from a distribution that is neiL, a 2 ) . Let YI = Xl + X 2 and Y2 = Xl + 2X . Show that the • • 2 joint p.d.f. of YI and Y2 is bivariate normal with correlation coefficient 3/VlO. 4.4 The t and F Distributions It is the purpose of this section to define two additional distributions quite useful in certain problems of statistical inference. These are called, respectively, the (Student's) t distribution and the F distribution. Let W denote a random variable that is nCO, 1); let V denote a ~andom variable that is x2 (r); and let Wand V be stochastically independent, Then the joint p.d.f. of Wand V, say cp(w, v), is the product of the p.d.f. of Wand that of V or 1 1 cp(w, v) = -- e- w2/2 vT/2-1e-v/2 VZ; f(r/2)2T / 2 ' = °elsewhere.
  • 78. 144 Distributions oj Functions oj Random Variables [eh.4 Sec. 4.4] The t and F Distributions 145 = 0 elsewhere. Define a new random variable T by writing The marginal p.d.f. of T is then F = V/r1 vt-, and we propose finding the p.d.f. gl(f) of F. The equations f -_ u/r1 , z = v, v/r2 define a one-to-one transformation that maps the set d' = {(u, v); o < u < 00,0 < v < oo] onto the set!!lJ = {(f, z); 0 < f < 00,0 < z < co], Since u = (rIfr2)zf, v = z, the absolute value of the Jacobian of the transformation is III = (r1/r2)z. The joint p.d.f. g(f, z) of the random variables F and Z = V is then 1 (r1Zf)rl/2-1 g(f, z) = r(r1/2)r(r2/2)2(r1 +r2)/2 ~ zT2/ 2 - 1 = 0 elsewhere. We define the new random variable [ z ("Ii )]"l Z x exp - - - + 1 -, 2 r2 "2 provided that (f, z) E!!lJ, and zero elsewhere. The marginal p.d.f. gl(f) of F is then has the immediately preceding p.d.f. gl(t). The distribution of the ran- dom variable T is usually called a t distribution. It should be observed that a t distribution is completely determined by the parameter r, the number of degrees of freedom of the random variable that has the chi-square distribution. Some approximate values of Pr (T s t) = roo gl(W) dw for selected values of rand t, can be found in Table IV in Appendix B. Next consider two stochastically independent chi-square random variables V and V having r1 and r2 degrees of freedom, respectively. The joint p.d.f. ep(u, v) of V and V is then 1 ep(u v) = url/2-1vr2/2-1e-(U+v)/2 , r (r1/2)r(r2/2)2(r1 +r 2)/2 ' o < u < 00, 0 < v < 00, -00 < t < 00. u=v and w t=-- VV[r s.(t) = f~00 g(t, u) du = foo 1 u(r+1)/2-1 exp [-~ (1 + ~)l duo o V271"Yr(r/2)2r/2 2 " In this integral let z = u[l + (t2 /r)J/2, and it is seen that foo 1 (2Z )(r+1)/2-1 ( 2 ) gl(t) = Jo V2Trrr(r/2)2r/2 1 + t2/r e- Z 1 + t2/r dz q(r + 1)/2J 1 = VTrrr(r/2) (1 + t2/r)(r+1)/2' Thus, if W is n(O, 1), if V is X2(r), and if Wand V are stochastically independent, then W T=-= VV/r g(t, u) = <pe~, u) III _ 1 Ur/2-1 exp [-~ (1 + ~)J vu, - y"i;r(r/2)2r/2 2 r vr -00 < t < 00, 0 < u < 00, define a one-to-one transformation that maps d' = {(w, v); -00 < w < 00,0 < v < co} onto!!lJ = {(t,u); -00 < t < 00,0 < u < co}, Since w = tVu/vr, v = u, the absolute value of the Jacobian of the trans- formation is III = VU/Vr. Accordingly, the joint p.d.f. of T and V = V is given by W T=--· VV/r The change-of-variable technique will be used to obtain the p.d.f. gl(t) of T. The equations
  • 79. 146 Distributions of Functions of Random Variables [eb.4 Sec. 4.5] Extensions of the Change-of-Variable Technique 147 If we change the variable of integration by writing z(rd ) Y=- -+1, 2 r2 it can be seen that _ (rJJ (r1/r2)Tl/ 2(f)Tl/2- 1 ( 2y )<Tl+T2)/2-1 e:» g1(f) - Jo r(r1/2)r(r2/2)2<T1 +T2)/2 r.tt:« + 1 4.38. Let T = WjvVj" where the stochastically independent variables W and V are, respectively, normal with mean zero and variance 1 and chi-square with, degrees of freedom. Show that P has an F distribution with parameters '1 = 1 and '2 = r. Hint. What is the distribution of the numerator of T2? where F has an F distribution with parameters '1 and '2. has a beta distri- bution. 4.39. Show that the t distribution with, = 1 degree of freedom and the Cauchy distribution are the same. 4.40. Show that o < f < 00, ( 2 ) x dy rd/r2 + 1 r[(r1 + r2)/2](r1/r2)Tl/2 (f)T1/2-1 = r(r1/2)r(r2/2) (1 + rd/r2)<T1 +T 2)/2, = 0 elsewhere. Accordingly, if U and V are stochastically independent chi-square variables with r 1 and r 2 degrees of freedom, respectively, then 4.41. Let Xl' X 2 be a random sample from a distribution having the p.d.f. f(x) = e-x , 0 < x < 00, zero elsewhere. Show that Z = X 1/X2 has an F distribution. F = U/r1 V/r2 has the immediately preceding p.d.f. g1(f). The distribution of this ran- dom variable is usually called an F distribution. It should be observed that an F distribution is completely determined by the two parameters r1 and r2' Table V in Appendix B gives some approximate values of Pr (F 5, f) = f:g1(W) dw 4.5 Extensions of the Change-of-Variable Technique In Section 4.3 it was seen that the determination of the joint p.d.f. of two functions of two random variables of the continuous type was essentially a corollary to a theorem in analysis having to do with the change of variables in a twofold integral. This theorem has a natural extension to n-fold integrals. This extension is as follows. Consider an integral of the form for selected values of r1> r2, and f. EXERCISES 4.34. Let T have a t distribution with 10 degrees of freedom. Find Pr (ITI > 2.228) from Table IV. 4.35. Let T have a t distribution with 14 degrees of freedom. Determine b so that Pr (- b < T < b) = 0.90. 4.36. Let F have an F distribution with parameters zj and '2' Prove that 1jF has an F distribution with parameters z, and z.. 4.37. If F has an F distribution with parameters r; = 5 and r, = 10, find a and b so that Pr (F 5, a) = 0.05 and Pr (F 5, b) = 0.95, and, accordingly, Pr (a < F < b) = 0.90. Hint. Write Pr (F 5, a) = Pr (ljF ~ 1ja) 1 - Pr (l/F 5, 1ja), and use the result of Exercise 4.36 and Table V. taken over a subset A of an n-dimensional space .xl. Let together with the inverse functions define a one-to-one transformation that maps .xl onto f!J in the Y1> Y2' ... , Ynspace (and hence maps the subset A of.xl onto a subset B
  • 80. = 0 elsewhere. Let of ~). Let the first partial derivatives of the inverse functions be con- tinuous and let the n by n determinant (called the Jacobian) 149 -00 < x < 00, Yk+1 0 0 Y1 0 Yk+1 0 Y2 J= = ?fi+1' 0 0 Yk+1 Yk -Yk+1 -Yk+1 -Yk+1 (1 - Y1 - ... - Yk) Hence the joint p.d.f. of YI> ... , Yk, Yk+1 is given by Sec. 4.5] Extensions oj the Change-of-Variable Technique We now consider some other problems that are encountered when transforming variables. Let X have the Cauchy p.d.f. 1 f(x) = 7T(1 + x2)' Y~;'+1·"+ak+l-1Y~1-1 .. 'Y~k-1(1 - Y1 - ... - Yk)ak+l- l e- lIk+l r(a1) ... r(ak)r(ak+1) provided that (Y1' .. " Yk' Yk+ 1) Ef1# and is equal to zero elsewhere. The joint p.d.f. of YI> .. " Y k is seen by inspection to be given by ( ) r(a1 + ... + ak+1) a -1 a -1(1 )a -1 gYI>···,Yk = Y11 ···Ykk - Y1 _ ... - Yk k+1 , r(a1) ... r(ak+ 1) when 0 < Yl' i = 1, .. " k, Y1 + ... + Yk < 1, while the function g is equal to zero elsewhere. Random variables Y I> .. " Yk that have a joint p.d.f. of this form are said to have a Dirichlet distribution with parameters al> ... , ak' ak+1> and any such g(Y1"'" Yk) is called a Dirichlet p.d.f. It is seen, in the special case of k = 1, that the Dirichlet p.d.f. becomes a beta p.d.f. Moreover, it is also clear from the joint p.d.f. of YI> , Y k» Y k+1 that Yk+1 has a gamma distribution with parameters a1 + + ak + ak+1 and fJ = 1 and that Y k+1 is stochastically independent of YI> Y 2, 0 • • , Yk' The associated transformation maps d = {(XI> ••• , Xk +1); 0 < Xl < 00, i = 1, ... , k + I} onto the space f1# = {(YI>"" Yk' Yk+1); 0 < Yl' i = 1, ... , k, Y1 + ... + Yk < 1,0 < Yk+1 < co]. The single-valued inverse functions are Xl = Y1Yk+1, ...• Xk = YkYk+1, X k+1 = Yk+1(1 - Y1 - ... - Yk), so that the Jacobian is and let Y = X2. We seek the p.d.f. g(y) of Y. Consider the trans- formation Y = x2 • This transformation maps the space of X, d = {x; -00 < x < co], onto ~ = {y; 0 :s Y < co}, However, the transfor- mationisnot one-to-one. To eachYE~, with the exceptionofy = 0, there correspond two points xEd. For example, if Y = 4, we may have either x = 2 or x = - 2. In such an instance, we represent d as the union of two disjoint sets A 1 and A 2 such that Y = x2 defines a one-to-one trans- 0< Xl < 00, i = 1,2, ... , k, Distributions oj Functions oj Random Variables [eh.4 s, Yl = X ' Xl + X2 + ... + k+1 and Y k +1 = Xl + X 2 + ... + X k +1 denote k + 1 new random variables. when (Y1' Y2' . 0 0 , Yn) E ~, and is zero elsewhere. Example 1. Let XI> X 2 , ••• , X k + 1 be mutually stochastically indepen- dent random variables, each having a gamma distribution with fJ = 1. The joint p.d.f. of these variables may be written as Whenever the conditions of this theorem are satisfied, we can determine the joint p.d.f. of n functions of n random variables. Appropriate changes of notation in Section 4.3 (to indicate n-space as opposed to 2-space) is all that is needed to show that the joint p.d.f. of the random variables YI = UI(Xl>X2 " , .,Xn), Y2 = U 2(X I, X 2, · · .,Xn),···, Y, = un(Xl> X 2, ... , Xn)-where the joint p.d.f. of Xl> X 2, .. 0' X; is cp(Xl> . 0 • , xn)-is given by f·A•fcp(Xl> X2' . 0 0 ' xn)dX1 dX20 • • dxn = rB' fcp[w1(Yl> .. 0' Yn),W2(Yl> .... Yn)• . . ., Wn(Y1' 0 •• , Yn)] x IJI dYI dY2' .. dYn' oX1 OX1 oX1 °Y1 Oy2 oYn oX2 OX2 OX2 J= °Y1 °Y2 oYn oXn OX~, oXn Oy1 °Y2 Oyn not vanish identically in ~. Then 148
  • 81. Distributions oj Functions oj Random Variables [Ch.4 151 t = 1,2, ... , k, Xl = wI,(Yv Y2' , Yn), X2 = w2,(Yv Y2, , Yn), OWli OWli oWli °YI °Y2 °Yn OW2i OW21 OW2i 11= °YI °Y2 °Yn t = 1,2, ... , k, OWni OWn' OWni 0YI 0Y2 0Yn be not identically equal to zero in flj. From a consideration of the prob- ability of the union of k mutually exclusive events and by applying the change of variable technique to the probability of each of these events, define a one-to-one transformation of each A, onto flj. Thus, to each point in flj there will correspond exactly one point in each of AI, A 2 , ... ,A". Let denote the k groups of n inverse functions, one group for each of these k transformations. Let the first partial derivatives be continuous and let each Sec. 4.5] Extensions oj the Change-of-Variable Technique why we sought to partition d (or a modification of d) into two disjoint subsets such that the transformation Y = x2 maps each onto the same flj. Had there been three inverse functions, we would have sought to partition d (or a modified form of d) into three disjoint subsets, and so on. It is hoped that this detailed discussion will make the follow- ing paragraph easier to read. Let <p(xv x2 , ••• , xn) be the joint p.d.f. of X v X 2' ... , X n' which are random variables of the continuous type. Let d be the n-dimensional space where <P(XI' x2, ... , xn) > 0, and consider the transformation YI = UI(XI,X2," .,Xn),Y2 = U2(XI,X2,·· .,xn),·· ·,Yn = Un(XI,X2,·· .,xn), which maps d onto flj in the YI,Y2, ... , Yn space. To each point of d there will correspond, of course, but one point in flj; but to a point in flj there may correspond more than one point in d. That is, the trans- formation may not be one-to-one. Suppose, however, that we can represent d as the union of a finite number, say k, of mutually disjoint sets A v A 2 , ••• , A" so that Y E B. o < Y < 00, g(y) 7T(1 + y)yY' g(y) = .1;- [f( - Vy) + f(yY)], 2vy With f(x) the Cauchy p.d.f. we have 1 In the first of these integrals, let x = - vy. Thus the Jacobian, say1 v is -lj2yY; moreover, the set As is mapped onto B. In the second integral let x = yY. Thus the Jacobian, say 12' is Ij2yY; moreover, the set A 4 is also mapped onto B. Finally, Pr (Y E B) = tf(-VY) - 2~Ydy + L f(yY) 2~dy = f[f( - vy) + f(Vy)] .1;- dy. B 2vy Hence the p.d.f. of Y is given by = 0 elsewhere. In the preceding discussion of a random variable of the c~ntinuous type, we had two inverse functions, x = - yY and x = Vy. That is 150 formation that maps each of Al and A 2 onto flj. If we take Al to be {x; -00 < x < O} and A 2 to be {x; 0 ::; x < co], we see that Al is mapped onto {y; 0 < Y < oo], whereas A 2 is mapped onto {y; 0 ::; Y < co], and these sets are not the same. Our difficulty is caused by the fact that x = 0 is an element of d. Why, then, do we not return to the Cauchy p.d.f. and take f(O) = O? Then our new dis d = {-oo < x < 00 but x -# O}. We then take zl, = {x; -00 < x < 0}~ndA2 = {x; 0 < x < co], Thus Y = x2 , with the inverse x = - vY, maps Al onto flj = {y; 0 < Y < oo] and the transformation is one-to-one. Moreover, the transformation Y = x2 , with inverse x = yY, maps A 2 on~o flj = {y; 0 < Y < co] and the transformation is one-to-one. Consider the probability Pr (Y E B), where B c flj. Let As = {x; x = - yY, Y E B} C Al and let A 4 = {x; X = yY, Y E B} C A 2 • Then Y E B when and only when X E As or X E A 4 • Thus we have Pr (Y E B) = Pr (X E As) + Pr (X E A 4) = f f(x) dx + f f(x) dx. Aa A4
  • 82. -00 < Yl < 00, 0 < Y2 < 00. EXERCISES We can make three interesting observations. The mean Yl of our random sample is n(O, -!-); Y 2 , which is twice the variance of our sample, is i"(I); and the two are stochastically independent. Thus the mean and the variance of our sample are stochastically independent. 153 Sec. 4.5] Extensions oj the Change-of-Variabte Technique Xl =1= x2}. This space is the union of the two disjoint sets A l = {(Xl, X2); X2 > Xl} and A2 = {(Xl' X2); X2 < Xl}' Moreover, our transformation now defines a one-to-one transformation of each A" i = 1,2, onto the new fll = {(Yv Y2); -00 < Yl < 00, 0 < Y2 < co}. We can now find the joint p.d.f., say g(yv Y2), of the mean Y1 and twice the variance Y2 of our random sample. An easy computation shows that 1111 = 1121 = I/V2Y2' Thus 152 Distributions oj Functions oj Random Variables [eb.4 it can be seen that the joint p.d.f. of Y1 = u1(X 1> X 2,···, X n), Y 2 = U2(X 1 , X 2, ... , X n),... , Yn = un(X 1> X 2, ... , X n), is given by k g(Y1> Y2' ... , Yn) = .L J;<P[Wll(Y1>"" Yn), ... , Wni(Y1> ... , Yn)], t=l provided that (Y1> Y2' ... , Yn) E fll, and equals zero elsewhere. The p.d.f. of any Yi' say Y1> is then gl(Yl) = j:00 ••• j:00 g(Yl' Y2' .•. , Yn) dY2' .. dYn' An illustrative example follows. Example 2. To illustrate the result just obta~ne~, t~ke n = ~ and let Xv X 2 denote a random sample of size 2 from a dIstrIbutIOn that ISn(O, 1). The joint p.d.f. of x, and X 2 is 1 (Xi + x~) f(xv x2) = 2n exp ---2-' -00 < Xl < 00, -00 < X2 < 00. Let Y1 denote the mean and let Y2 denote twice the variance of the random sample. The associated transformation is Xl + X2 Yl =--z-' (Xl - X2)2 Y2 = 2 . This transformation maps d = {(xv X2); -00 < Xl < 00, -00 < X2 < co} onto fll = {(Yv Y2); -00 < Yl < 00, 0 s Y2 < co}. But. the tran~formation is not one-to-one because, to each point in fll, exclusive of points where Y2 = 0, there correspond two points in d. In fact, the two groups of inverse functions are and 4.42. Let Xv X 2 , X s denote a random sample from a normal distribution n(O, 1). Let the random variables Yv Y 2 , Ys be defined by Xl = Yl cos Y2sin v; X 2= Yl sin Y2sin Ys, X s = Yl cos Ys, where 0 s v, < 00,0 s Y2 < 2n, 0 s v, s tr. Show that Yv Y 2 , v, are mutually stochastically independent. 4.43. Let Xv X 2 , X s denote a random sample from the distribution having p.d.f. f(x) = e- x , 0 < X < 00, zero elsewhere. Show that Moreover the set d cannot be represented as the union of two disjoint sets, each of which under our transformation maps onto fll. Our difficulty is caused by those points of d that lie on the line whose equation is X2 = Xl' At each of these points we have Y2 = O. However, v:e c~n define f~Xl' X2) to be zero at each point where Xl = X2' We can do this without .alten.ng the distribution of probability, because the probability measure of this set ISzero. Thus we have a new d = {(xv X2); -00 < Xl < 00, -00 < X2 < 00, but are mutually stochastically independent. 4.44. Let Xv X 2 , ••• , X, be r mutually stochastically independent gamma variables with parameters a = at and f3 = 1, i = 1, 2, .. " r, r~spec~ively. Show that Yl = Xl + X 2 + ... + X, has a gamma distribu- tion with parameters a = al + ... + a r and f3 = 1.Hint. Let Y2 = X 2 + ... + X" Ys = X s + ... + X Y = X r, ... , T ,. 4.45. Let Y v ... , Yk have a Dirichlet distribution with parameters av ••• , ak' ak+ r-
  • 83. (a) Show that Yl has a beta distribution with parameters a = al and f3 = a2 + ... + ak+l' (b) Show that Yl + ... + Y" r :$ k, has a beta distribution with parameters a = al + ... + aT and f3 = aT+ l + ... + ak+l' (c) Show that Y l + Y2 , Ys + Y4 , Y5 , ••• , Y k , k :2: 5, have a Dirichlet distribution with parameters al + a2' as + a4, a5, ... , ak' ak+ r- Hint. Recall the definition of Y, in Example 1 and use the fact that the sum of several stochastically independent gamma variables with f3 = 1 is a gamma variable (Exercise 4.44). 4.46. Let Xl' X 2 , and Xs be three mutually stochastically independent chi-square variables with rv r2 , and rs degrees of freedom, respectively. (a) Show that Y l = X l/X2 and Y2 = Xl + X 2 are stochastically independent and that Y2 is X2 (r l + r 2) . (b) Deduce that (1) Sec. 4.6] Distributions of Order Statistics 154 Distributions of Functions of Random Variables [Ch.4 155 statistic of the random sample X X X It ill b h . . V 2,··· , n' W e s own that the joint p.d.f. of Y v Y2, .•• , Yn is given by g(yv Y2' ... , Yn) = (nJ)!(Yl)!(Y2) ...!(Yn), a < Yl < Y2 < ... < Yn < b, = °elsewhere. We sh~ll prove this only for the case n = 3, but the argument is seen to be entirely general. With n = 3 the J' oint pdf f X X X I'S . ' . . . 0 V 2, s !(Xl)!x(X2)!(b x)s)T ' hC.onslder a probability such as Pr (a < x, = X 2 < b, a < a < . IS probability is given by J:J:J::!(Xl)!(X2)!(xs) dXl dX2 dxs = 0, and are stochastically independent F variables. 4.47. Iff(x) = 1-, -1 < x < 1,zero elsewhere, is the p.d.f. of the random variable X, find the p.d.f. of Y = X2. 4.48. If Xl' X2 is a random sample from a distribution that is n(O, 1), find the joint p.d.f. of Yl = Xi + X~ and Y2 = X2 and the marginal p.d.f. of Yl. Hint. Note thatthe space of Yl and Y2 is given by -vYl < Y2 < VYv o< Yl < 00. 4.49. If X has the p.d.f. f(x) = t, -1 < x < 3, zero elsewhere, find the p.d.f. of Y = X2. Hint. Here f!J = {y; 0 :$ Y < 9} and the event Y E B is the union of two mutually exclusive events if B = {y; 0 < Y < 1}. 4.6 Distributions of Order Statistics In this section the notion of an order statistic will be defined and we shall investigate some of the simpler properties of such a statistic. These statistics have in recent times come to play an important role in statistical inference partly because some of their properties do not depend upon the distribution from which the random sample is obtained. Let Xv X 2 , •.• , X; denote a random sample from a distribution of the continuous type having a p.d.f. !(x) that is positive, provided that a < X < b. Let Y 1 be the smallest of these X" Y 2 the next X, in order of magnitude, ... , and Yn the largest X,. That is, Y 1 < Y 2 < ... < Yn represent Xl> X 2 , ••• , X n when the latter are arranged in ascending order of magnitude. Then Y" i = 1,2, ... , n, is called the ith order since is defined in calculus to be zero. As has been pointed out we may without altering the distribution of X X X define the J.' . t d f' V 2, s, om p. . . !(Xl)!(X2)!(xs) to be zero at all points (Xl' X2, xs) that have at least two o.f their c~ordinates ~qual. Then the set .91, where !(Xl)!(X2)!(xs) > 0, IS the umon of the SIX mutually disjoint sets: Al = {(Xv X2, xs); a < Xl < X2 < Xs < b}, A 2 = {(Xv X2, xs); a < X2 < Xl < Xs < b}, As = {(Xl, X2, Xs); a < Xl < Xs < X2 < b}, A 4 = {(Xl, X2, Xs); a < X2 < Xs < Xl < b}, A 5 = {(Xv X2, Xs); a < Xs < Xl < X2 < b}, Ae = {(Xv X2, Xs); a < Xs < X2 < Xl < b}. There are six of these sets because we can arrange X X • • I I . . V 2, X s in precise y 3. = 6 ways. Consider the functions Yl = minimum of X X x· Y = 'ddl . . 1, 2, S' 2 rm e m magmtude of Xl, X2, Xs ; and Ys = maximum of Xl X X Th f . , 2, a- ese unctIons define one-to-one transformations that map each of A v A 2 , ••• , ~e onto the same set f!4 = {(Yl' Y2' Ys); a < Yl < Y2 < Ys < b}. The mverse functions are for points in A X = Y X _ Y ' . ' V I I , 2 - 2, Xs = Ys; for pomts in A 2, they are Xl = Y2' X2 = Yv X s = Ys; and so on, for each of the remaining four sets. Then we have that 1 ° 0/ I, = ° 1 ° = 1 ° 0 1
  • 84. 156 and Distributions oj Functions oj Random Variables [Ch. 4 Sec. 4.6] Distributions oj Order Statistics Accordingly, 157 010 12 = 1 0 0 =-1. o 0 1 It is easily verified that the absolute value of each of the 3! = 6 Jacobians is +1. Thus the joint p.d.f. of the three order statistics Yl = minimumofXl,X2,Xs; Y2 = middleinmagnitudeofX1>X2,XS; Ys = maximum of Xl' X 2 , X s is If x :::; a, F(x) = 0; and if b :::; z, F(x) = 1. Thus there is a unique median m of the distribution with F(m) = t. Let Xl> X 2 , Xa denote a random sample from this distribution and let Y1 < Y2 < Ya denote the order statistics of the sample. We shall compute the probability that Y2 :::; m. The joint p.d.f. of the three order statistics is g(Yl' Y2, Ys) = Jll!(Yl)!(Y2)!(YS) + 1121!(Y2)!(Yl)!(YS) + ... + 1161!(Ys)!(Y2)!(Yl), a < Y1 < Y2 < Y3 < b, = (31)!(Y1)!(Y2)!(Ys), a < Y1 < Y2 < Ys < b, = 0 elsewhere. a < Y1 < Y2 < ... < Yn < b, Pr (Y2 s m) = 6 r{F(Y2)j(Y2) - [F(Y2)]2j(Y2}} dY2 = 6{[F(Y2)]2 _ [F(Y2)]a}m = ~. 2 3 a 2 1 - F(x) = F(b) - F(x) = I:!(w) dw - I: !(w) dw = I:f(w) dw. F(x) = 0, x :::; a, = I: f(w) dw, a < x < b, = 1, b :::; x. The procedure used in Example 1 can be used to obtain general formulas for the marginal probability density functions of the order statistics. We shall do this now. Let X denote a random variable of the cont~nuous type having a p.d.f. j(x) that is positive and continuous, provided that a < x < b, and is zero elsewhere. Then the distribution function F(x) may be written Accordingly, F'(x) = !(x), a < x < b. Moreover, if a < x < b, Let X1> X 2 , ••• , X; denote a random sample of size n from this distribution, and let Yl' Y2' ... , Yn denote the order statistics of this random sample. Then the joint p.d.f. of Y1> Y2, .. " Yn is g(Y1> Y2, .. " Yn) = nl!(Y1)!(Y2) .. -f(Yn), = 0 elsewhere. a < x < b. F(x) = f:j(w) dw, This is Equation (1) with n = 3. In accordance with the natural extension of Theorem 1, Section 2.4, to distributions of more than two random variables, it is seen that the order statistics, unlike the items of the random sample, are sto- chastically dependent. Example 1. Let X denote a random variable of the continuous type with a p.d.f. j(x) that is positive and continuous provided that a < x < b, and is zero elsewhere. The distribution function F(x) of X may be written h(Y2) = 6j(Y2) J:2 I:2j(Yl)j(Ya) dYl dYa, = 6j(Y2)F(Y2)[1 - F(Y2)], a < Y2 < b, = 0 elsewhere. g(Y1' Y2' Ya) = 6j(Yl)j(Y2)j(Ya), = 0 elsewhere. The p.d.f. of Y2 is then a < Yl < Y2 < Ya < b, It will first be shown how the marginal p.d.f. of Yn may be expressed in terms of the distribution function F(x) and the p.d.f. !(x) of the random variable X. If a < Yn < b, the marginal p.d.f. of Yn is given by
  • 85. 158 Distributions of Functions of Random Variables [Ch.4 Sec. 4.6] Distributions of Order Statistics 159 Upon completing the integrations, it is found that since F(x) = S:f(w) dw. Now J Y 3 [F(Y2)]2IY 3 a F(Y2)f(Y2) dY2 = 2 a [F(YS)]2 = 2 But so that gl(Yl) = n[1 - F(Yl)]n-1f(Yl)' = 0 elsewhere. Once it is observed that Ix [F(w)]a-lf(w) dw = [F(x)]a, a a a < Yl < b, a> 0 so and that fb [1 _ F(W)]8-1f(w) dw = [1 - F(y)]8, Y ~ ~ > 0, If the successive integrations on Y4' ... , Yn-l are carried out, it is seen that _ 1 [F(YnW- 1 gn(Yn) - n. (n _ 1)1 j(Yn) = n[F(Yn)]n-lj(Yn)' a < Yn < b, = 0 elsewhere. It will next be shown how to express the marginal p.d.f. of Y 1 in terms of F(x) andj(x). We have, for a < Yl < b, gl(Yl) = 5:1' ..J:n-3 J:n-2 J:n-1 n! f(Yl)f(Y2)' . -J(Yn) dYn dYn-l' . ·dY2 =fb···fb fb nlj(Yl)f(Y2)'" Yl Yn-3 Yn-2 j(Yn-l)[l - F(Yn-l)] dYn-l .. , dY2' But f b [1 - F(Yn_l)]2 b Yn-2 [1 - F(Yn-l)]f(Yn-l) dYn-l = 2 Yn-2 [1 - F(Yn_2)]2 2 ' it is easy to express the marginal p.d.f. of any order statistic, say Y k' in terms of F(x) andj(x). This is done by evaluating the integral The result is a < Yk < b, = 0 elsewhere. Example 2. Let Y1 < Y2 < Ys < Y4 denote the order statistics of a random sample of size 4 from a distribution having p.d.f. j(x) = 2x, 0 < x < 1, = 0 elsewhere. We shall express the p.d.f. of Y3 in terms ofj(x) and F(x) and then compute Pr (t < Y 3 ) . Here F(x) = x2 , provided that 0 < x < 1, so that o< Y3 < 1, = 0 elsewhere.
  • 86. 160 Distributions of Functions of Random Variables [Ch.4 Sec. 4.6] Distributions of Order Statistics 161 Thus Pr (-!- < Y3) = fal g3(Y3) dY3 1/2 = fl 24(yg - y~) dY3 = ~t~· 1/2 Finally, the joint p.d.f. of any two order statistics, say Y, < Y;, is as easily expressed in terms of F(x) andf(x). We have g (y Y) = JYI. .. JY2fYi . . . SYi fb ... (b n!f(Yl)' .. ij i'; a a Yl Yj-2 u, JYn-l Certain functions of the order statistics Y l' Y 2, ••• , Y n are important statistics themselves. A few of these are: (a) Yn - Yl' which is called the range of the random sample; (b) (Y1 + Yn)/2, which is called the midrange of the random sample; and (c) if n is odd, Y(n+l)/2' which is called the median of the random sample. Example 3. Let Yl> Y 2 , Y 3 be the order statistics of a random sample of size 3 from a distribution having p.d.f. f(x) = 1, 0 < X < 1, = 0 elsewhere. Since, for y > 0, ry [F(y) _ F(W)y-lf(w) dw = _[F(y) - F(w)]Y IY Jx y x [F(y) - F(x)]Y, y We seek the p.d.f. of the sample range ZI = Y3 - Y1. Since F(x) = X, o< X < 1, the joint p.d.f. of Y1 and Y3 is gI3(Yl> Y3) = 6(Y3 - Yl), 0 < Yl < Y3 < 1, = 0 elsewhere. In addition to ZI = Y3 - Y 1, let Z2 = Y 3. Consider the functions ZI = Y3 - Yl> Z2 = Y3' and their inverses Yl = Z2 - Zl> Y3 = Z2' so that the corresponding Jacobian of the one-to-one-transformation is EXERCISES = 0 elsewhere. o< ZI < 1, 1 -111 o 1 =-1. °Yl °Yl ]= OZI OZ2 °Y3 °Y3 OZI OZ2 Accordingly, the p.d.f. of the range ZI = Y3 - Y1 of the random sample of size 3 is Thus the joint p.d.f. of ZI and Z2 is h(zl> Z2) = 1-116z1 = 6z1 , = 0 elsewhere. 4.50. Let Y 1 < Y 2 < Y 3 < Y 4 be the order statistics of a random sample of size 4 from the distribution having p.d.f. f(x) = e- x , 0 < x < 00, zero elsewhere. Find Pr (3 s Y4)' 4.51. Let Xl> X 2 , X; be a random sample from a distribution of the continuous type having p.d.f. f(x) = 2x,O < x < 1, zero elsewhere. Compute n! n! pi-lpj-i-lpn-jp P (i _ 1)1 (j _ i-I)! (n _ j)! 11 I! 1 2 3 4 5' which is g(Yj, YJ}L1jL1j • for a < Yi < Y; < b, and zero elsewhere. Remark. There is an easy method of remembering a p.d.f. like that given in Formula (3). The probability Pr (Yi < Y, < Yi + L1f , Yj < Y, < Yj + L1j), where L1t and L1j are small, can be approximated by the following multinomial probability. In n independent trials, i-I outcomes must be less than Yt (an event that has probability PI = F(Yi) on each trial); j - i-I outcomes must be between Yi + L1i and Yj [an event with approximate probability P2 = F(YJ) - F(Yf) on each trial]; n - j outcomes must be greater than y. + L1. (an event with approximate probability P3 = 1 - F(Yj) on each J J • trial); one outcome must be between Yi and Yf + L1i (an event with approxi- mate probability P4 = f(Yi)L1i on each trial); and finally one outcome must be between Yj and Yj + L1j [an event with approximate probability P5 = f(Yj)L1j on each trial]. This multinomial probability is (3) gij(Yi' Y;) = (i _ I)! (j - i-I)! (n - j)! x [F(Yi)]i-l[F(y;) - F(Yi)r- i- 1 [1 - F(y;)]n-1(Yt)f(y;) it is found that
  • 87. 162 Distributions of Functions of Random Variables [eh.4 Sec. 4.6] Distributions of Order Statistics 163 the probability that the smallest of these X, exceeds the median of the distribution. 4.52. Let f(x) = i, x = 1, 2, 3, 4, 5, 6, zero elsewhere, be the p.d.f. of a distribution of the discrete type. Show that the p.d.f. of the smallest item of a random sample of size 5 from this distribution is zero elsewhere. Note that in this exercise the random sample is from a distribution of the discrete type. All formulas in the text were derived under the assumption that the random sample is from a distribution of the continuous type and are not applicable. Why? 4.53. Let Y1 < Y2 < Ya < Y4 < Y5 denote the order statistics of a random sample of size 5 from a distribution having p.d.f. f(x) = e- x , o < x < 00, zero elsewhere. Show that Zl = Y2 and Z2 = Y4 - Y2 are stochastically independent. Hint. First find the joint p.d.f. of Y2 and Y 4 • 4.54. Let Y1 < Y2 < ... < Yn be the order statistics of a random sample of size n from a distribution with p.d.f. f(x) = 1, 0 < x < 1, zero elsewhere. Show that the kth order statistic Y" has a beta p.d.f. with param- eters a = k and f3 = n - k + 1. 4.55. Let Y1 < Y2 < ... < Yn be the order statistics from a Weibull distribution, Exercise 3.38, Section 3.3. Find the distribution function and p.d.f. of Y 1 • 4.56. Find the probability that the range of a random sample of size 4 from the uniform distribution having the p.d.f. f(x) = 1, 0 < x < 1, zero elsewhere, is less than -to 4.57. Let Y1 < Y2 < Ya be the order statistics of a random sample of size 3 from a distribution having the p.d.f. f(x) = 2x, 0 < x < 1, zero elsewhere. Show that Zl = Y 1/Y2, Z2 = Y2/YS, and Zs = Ys are mutually stochastically independent. 4.58. If a random sample of size 2 is taken from a distribution having p.d.f. f(x) = 2(1 - x), 0 < x < 1, zero elsewhere, compute the probability that one sample item is at least twice as large as the other. 4.59. Let Y1 < Y2 < Ya denote the order statistics of a random sample of size 3 from a distribution with p.d.f. f(x) = 1, 0 < x < 1, zero elsewhere. Let Z = (Y1 + Ys)/2 be the midrange of the sample. Find the p.d.f. of Z. 4.60. Let Y1 < Y2 denote the order statistics of a random sample of size 2 from n(O, a2 ) . Show that E(Y1 ) = -a/V;. Hint. Evaluate E(Y1 ) by using the joint p.d.f. of Y1 and Y 2 , and first integrating on Yl' 4.61. Let Y1 < Y2 be the order statistics of a random sample of size 2 (7- Yl) 5 (6 - Yl) 5 gl(Yl) = -6- - -6- , Yl = 1, 2, ... , 6, from a distribution of the continuous type which has p.d.f, f(x) such that f(x) > 0, provided x ~ 0, and f(x) = 0 elsewhere. Show that the stochastic independence of Zl = Y1 and Z2 = Y2 - Y1 characterizes the gamma p.d.f. f(x), which has parameters a = 1 and f3 > O. Hint. Use the change-of- variable technique to find the joint p.d.f. of Zl and Z2 from that of Y1 and Y2 • Accept the fact that the functional equation h(O)h(x + y) == h(x)h(y) has the solution h(x) = c1e c2 x , where C1 and C2 are constants. 4.62. Let Y denote the median of a random sample of size n = 2k + 1, k a positive integer, from a distribution which is n(p., a2 ) . Prove that the graph of the p.d.f. of Y is symmetric with respect to the vertical axis through Y = p. and deduce that E(Y) = p.. 4.63. Let X and Y denote stochastically independent random variables with respective probability density functions f(x) = 2x, 0 < x < 1, zero elsewhere, and g(y) = 3y2 , 0 < Y < 1, zero elsewhere. Let U = min (X, Y) and V = max (X, Y). Find the joint p.d.f. of U and V. Hint. Here the two inverse transformations are given by x = u, Y = v and x = v, Y = U. 4.64. Let the joint p.d.f. of X and Y bef(x, y) = l./'-x(x + y), 0 < x < 1, o < Y < 1, zero elsewhere. Let U = min (X, Y) and V = max (X, Y). Find the joint p.d.f. of U and V. 4.65. Let Xv X 2 , ••. , X; be a random sample from a distribution of either type. A measure of spread is Gini's mean difference 10 (a) If n = 10, find av a2 , ••. , a10 so that G = 2: a,Yi , where Y v Y 2 , ••• , '=1 Y1 0 are the order statistics of the sample. (b) Show that E(G) = 2a/V; if the sample arises from the normal distribution n(p., a2 ) . 4.66. Let Y1 < Y2 < ... < Yn be the order statistics of a random sample of size n from the exponential distribution with p.d.f. f(x) = e- x , o< x < 00, zero elsewhere. (a) Show that Zl = nYv Z2 = (n - 1)(Y2 - Y1 ), Zs = (n - 2) (Ys - Y 2 ) , •.. , Z; = Yn - Yn - 1 are stochastically independent and that each Z, has the exponential distribution. (b) Demonstrate that all linear functions of Y v Y 2 , ••• , Y n , such as n 2: a,Y" can be expressed as linear functions of stochastically independent 1 random variables. 4.67. In the Program Evaluation and Review Technique (PERT), we are interested in the total time to complete a project that is comprised of a large
  • 88. 4.7 The Moment-Generating-Function Technique M(t) = E(etYl) = J~oo etYlg(Y1) dY1 in the continuous case. It would seem that we need to know g(Y1) before we can compute M(t). That this is not the case is a fundamental fact. To see this consider 165 j = 1, 2, ... , n, ~ = 1, 2, .. " k, k L IIilep[W1j(Y1"'" Yn), ... , wni(Yv .. " Yn)] i=l Sec. 4.7] The Moment-Generating-Function Technique In accordance with Section 4.5, IIlep[WI(Yv Y2' ... , Yn), ... , Wn(Yv Y2' .. " Yn)] is the joint p.d.f. of Y v Y2" •• , Yn' The marginal p.d.f. g(Y1) of Y1 is obtained by integrating this joint p.d.f. on Y2" .. , Yn' Since the factor etYl does not involve the variables Y2"'" Yn' display (2) may be written as (3) J~<x> etYlg(Y1) dY1' But this is by definition the moment-generating function M(t) of the distribution of Yl' That is, we can compute E[exp (tu1(Xv . . ., X n)] and have the value of E(etY1 ) , where Y1 = u1(X V ••• , X n) . This fact provides another technique to help us find the p.d.f. of a function of several random variables. For if the moment-generating function of Y1 is seen to be that of a certain kind of distribution, the uniqueness property makes it certain that Y1 has that kind of distribution. When the p.d.f. of Y1 is obtained in this manner, we say that we use the moment-generating-function technique. The reader will observe that we have assumed the transformation to be one-to-one. We did this for simplicity of presentation. If the transformation is not one-to-one, let E[w(Y1)] = J~00 W(Y1)g(Y1) dY1 = J~00 ••• J:00 w[u1(xv . • ., xn)]ep(Xl' ... , xn) dXl' .. dxn· denote the k groups of n inverse functions each. Let Ii' i = 1, 2, ... , k, denote the k J acobians. Then (4) is the joint p.d.f. of Yv ... , Yn' Then display (1) becomes display (2) with IJlep(w1,... , wn) replaced by display (4). Hence our result is valid if the transformation is not one-to-one. It seems evident that we can treat the discrete case in an analogous manner with the same result. It should be noted that the expectation, subject to its existence, of any function of Y1 can be computed in like manner. That is, if W(Y1) is a function of Yv then Distributions oj Functions oj Random Variables [eb.4o 1M The change-of-variable procedure has been seen, in certain cases, to be an effective method of finding the distribution of a function of several random variables. An alternative procedure, built around the concept of the moment-generating function of a distribution, will be presented in this section. This procedure is particularly effective in certain instances. We should recall that a moment-generating function, when it exists, is unique and that it uniquely determines the distribution of a probability. Let ep(xv x2,. . ., xn) denote the joint p.d.f. of the n random vari- ables Xv X 2 , · •• , X n . These random variables mayor may not be the items of a random sample from some distribution that has a given p.d.f. f(x). Let Y1 = U1(X1, X 2,... , X n). We seek g(Y1)' the p.d.f. of the random variable Y 1 . Consider the moment-generating function of Y 1 • If it exists, it is given by (1) J ~00 ••• J:00 exp [tU1(xv· .. , xn)]ep(xv ... , xn) dX1... dxn, which we assume to exist for - h < t < h. We shall introduce n new variables of integration. They are Y1 = u1(XV X2' .. " xn),... , Yn = un(xv X2,. . ., xn). Momentarily, we assume that these functions define a one-to-one transformation. Let Xi = Wt(Y1' Y2' .. " Yn), i = 1,2, ... , n, denote the inverse functions and let I denote the Jacobian. Under this transformation, display (1) becomes number of subprojects. For illustration, let Xl> X 2 , X a be three stochastically independent random times for three subprojects. If these subprojects are in series (the first one must be completed before the second starts, etc.), then we are interested in the sum Y = Xl + X 2 + X a. If these are in parallel (can be worked on simultaneously), then we are interested in Z = max (Xl> X 2 , X a). In the case each of these random variables has the uniform distribution with p.d.f. f(x) = 1, 0 < x < 1, zero elsewhere, find (a) the p.d.f. of Yand (b) the p.d.f. of Z.
  • 89. 166 Distributions of Functions of Random Variables [eh.4 Sec. 4.7] The Moment-Generating-Function Technique 167 = 0 elsewhere; that is, the p.d.f. of Xl isf(xl) and that of X 2 isf(x2) ; and so the joint p.d.f. of Xl and X 2 is We shall now give some examples and prove some theorems where we use the moment-generating-function technique. In the first example, to emphasize the nature of the problem, we find the distribution of a rather simple statistic both by a direct probabilistic argument and by the moment-generating-function technique. Example 1. Let the stochastically independent random variables Xl and X 2 have the same p.d.f. Thus This form of M(t) tells us immediately that the p.d.f. g(y) of Y is zero except at y = 2, 3, 4, 5, 6, and that g(y) assumes the values :1-6, 3 46' H, ~~, l6' respectively, at these points where g(y) > O. This is, of course, the same result that was obtained in the first solution. There appears here to be little, if any, preference for one solution over the other. But in more complicated situations, and particularly with random variables of the continuous type, the moment-generating-function technique can prove very powerful. Example 2. Let Xl and X 2 be stochastically independent with normal distributions n(flol> an and n(flo2' a~), respectively. Define the random variable Y by Y = Xl - X 2 • The problem is to find g(y), the p.d.f. of Y. This will be done by first finding the moment-generating function of Y. It is M(t) = E(et(X1 -X2» ) = E(etXle-tx2) = E(etXl)E(e-tx2), since Xl and X 2 are stochastically independent. It is known that since Xl and X 2 are stochastically independent. In this example Xl and X 2 have the same distribution, so they have the same moment-generating function; that is, x = 1,2,3, Xl = 1, 2, 3, X2 = 1,2,3, x f(x) = (;' = 0 elsewhere. A probability, such as Pr (Xl = 2, X 2 = 3), can be seen immediately to be (2)(3)/36 = l However, consider a probability such as Pr (Xl + X 2 = 3). The computation can be made by first observing that the event X I + X 2 = 3 is the union, exclusive of the events with probability zero, of the two mutually exclusive events (Xl = 1, X 2 = 2) and (Xl = 2, X 2 = 1). Thus Pr (Xl + X 2 = 3) = Pr (Xl = 1, X 2 = 2) + Pr (Xl = 2, X 2 = 1) (1)(2) (2)(1) 4 =~+~=36' More generally, let y represent any of the numbers 2,3,4,5,6. The probability of each of the events Xl + X 2 = y, Y = 2,3,4,5,6, can be computed as in the case y = 3. Let g(y) = Pr (Xl + X 2 = y). Then the table and that g(y) -16 3 46 ~~ ~~ l6 gives the values of g(y) for y = 2,3,4,5,6. For all other values of y, g(y) = O. What we have actually done is to define a new random variable Y by Y = Xl + X 2 , and we have found the p.d.f. g(y) of this random variable Y. We shall now solve the same problem, and by the moment-generating-function technique. Now the moment-generating function of Y is M(t) = E(et(x1 +X2» ) = E(etXletX2) = E(etxl)E(etX2), Finally, then, for all real t. Then E(e-tX2) can be obtained from E(etX2) by replacing t by -ct. That is, ( (a~ + a~jt2) = exp (flol - flo2)t + 2 . 2 3 456 y
  • 90. Now 169 t < t. t < t, i = 1,2, ... , n, Sec. 4.7] The Moment-Generating-Function Technique has a chi-square distribution with n degrees of freedom. Not always do we sample from a distribution of one random vari- able. Let the random variables X and Y have the joint p.d.f. f(x, y) and let the 2n random variables (Xl' Y 1) , (X 2 , Y 2), ••• , (Xn, Y n) have the joint p.d.f. But this is the moment-generating function of a distribution that is x2(rl + r2 + ... + rn). Accordingly, Y has this chi-square distribution. Next, let Xl> X 2 , • • • , X; be a random sample of size n from a distribution that is n(fL, a2). In accordance with Theorem 2 of Section 3.4, each of the random variables (X, - fL)2/a2, i = 1, 2, ... , n, is X2(1). Moreover, these n random variables are mutually stochastically inde- pendent. Accordingly, by Theorem 2, the random variable Y = n L [(X, - fL)/aJ2 is X2(n). This proves the following theorem. 1 M(t) = E{exp [t(XI + X 2 + ... + X n)]} = E(etXl)E(etx2) ... E(etxn) because Xl> X 2, ••• , X n are mutually stochastically independent. Since If, in Theorem 1, we set each kt = 1, we see that the sum of n mutually stochastically independent normally distributed variables has a normal distribution. The next theorem proves a similar result for chi-square variables. Theorem 2. Let Xl> X 2' .•• , X n bemutually stochasticallyindepen- dent variables that have, respectively, the chi-square distributions X2(r1) , X2(r2), .. " and X2(rn)' Then the random variable Y = XI + X2 + ... + X n has a chi-square distribution with r1 + ... + rn degrees of freedom; that is, Y is X2(r1 + ... + rn). Proof. The moment-generating function of Y is we have Theorem 3. Let Xl> X 2,... , X; denote a random sample of size n from a distribution that is n(fL, a2). The random variable Distributions oj Functions oj Random Variables [eb. 4 That is, the moment-generating function of Y is n [ (k 2a2)t2] M(t) = [lexp (k,fLI)t + 1 2 1 [ n (~k~a~)t2] = exp (fk1fL1)t + 2 . But this is the moment-generating function of a distribution that is n(~ k1fLp ~ k~a~). This is the desired result. The following theorem, which is a generalization of Example 2, is very important in distribution theory. Theorem 1. Let Xl> X 2,. . ., K; be mutually stochastically indepen- dent random variables having, respectively, the normal distributions nI H a2) n(ll. a2) and n(ll. a2). The random variable Y = klXl + JA'1' l' r2' 2' ••• , rn' n . k 2X2 + ... + knXn, where kl , k2,. . ., kn are real constants, ss normally distributed with mean klfLl + ... + knfLn and variance krar + ... + k~a~. That is, Y is n(~ ktfL" ~ k~a~). Proof. Because x; X 2,. . ., x, an: mutual~y ~tochastically inde- pendent, the moment-generating function of Y IS glVen by M(t) = E{exp [t(k1X1 + k2X2 + ... + knXn)]} = E(etklXl)E(etk2X2) ... E(etknXn). E(etXt) = exp (fL,t + a~2), for all real t, i = 1,2, ... , n. Hence we have 168 The distribution of Y is completely determined by its moment-gene:ati~g function M(t), and it is seen that Y has the p.d.f. g(y), whl~h IS ( 2 + 2) That is the difference between two stochastIcally n fLl - fL2' Ul U2 . , . ' . independent, normally distributed, random vanables IS Itself. a random variable which is normally distributed with mean equal to the difference of the means (in the order indicated) and the variance equal to the sum of the variances.
  • 91. Distributions of Functions of Random Variables [eh.4 n real constants, has the moment-generating function M(t) = n M,(k,t). 1 (b) If each k; = 1 and if Xi is Poisson with mean /Li' i = 1, 2, ... , n, prove that Y is Poisson with mean /Ll + ... + /Ln' 4.71. Let the stochastically independent random variables Xl and X2 have binomial distributions with parameters nl , PI = t and n2' P2 = t, respectively. Show that Y = Xl - X 2 + n2has a binomial distribution with parameters n = nl + n2, P = 1-- 171 EXERCISES Sec. 4.7] The Moment-Generating-Function Technique 4.70. Let Xl and X 2 be stochastically independent random variables. Let Xl and Y = Xl + X 2have chi-square distributions with rl and r degrees of freedom, respectively. Here rl < r. Show that X2 has a chi-square distribu- tion with r - rl degrees of freedom. Hint. Write M(t) = E(e!(X1 +X2» ) and make use of the stochastic independence of Xl and X 2. 4.68. Let the stochastically independent random variables Xl and X 2 have the same p.d.f. f(x) = 1" x = 1, 2, 3, 4, 5, 6, zero elsewhere. Find the p.d.I. of Y = Xl + X 2 • Note, under appropriate assumptions, that Y may be interpreted as the sum of the spots that appear when two dice are cast. 4.69. Let Xl and X 2be stochastically independent with normal distribu- tions n(6, 1) and n(7, 1), respectively. Find Pr (Xl> X 2). Hint. Write Pr (Xl> X 2) = Pr (Xl - X 2 > 0) and determine the distribution of Xl - X 2· 4.72. Let X be n(O, 1). Use the moment-generating-function technique to show that Y = X2 is x2 (l ). Hint. Evaluate the integral that represents E(e!X2) by writing w = xVI - 2t, t < t. 4.73. Let Xl> X 2 , ••. , X; denote n mutually stochastically independent: random variables with the moment-generating functions MI(t), M 2(t), ... , M n (t), respectively. (a) Show that Y = klXI + k2X2 + ... + knXn, where kl> k2 , ••• , kn are ( tl i Xi t2 i Yi) = f:oo" -f~oo exp -;-- + -;-- cp dx l· .. dYn = 0If_0000 f_0000 exp r~i + t~i)f(Xi' Yi) dx, dYi]· The justification of the form of the right-hand member of the second equality is that each pair (Xi' Y i ) has the same p.d.f., and th~t these n pairs are mutually stochastically independent. The two~old integral in the brackets in the last equality is the moment-generatmg function of Xi and Y, (see Section 3.5) with tl replaced by tl/n and t2 replaced by t2 /n. Accordingly, cp = f(x v Yl)f(x2, Y2) ... f(xn, Yn), the moment-generating function of the two means X and Y is given by 170 The n random pairs (Xv Y l ) , (X2, Y 2), · · · , (Xn, Y n) are then mutually stochastically independent and are said to constitute a random sample of size n from the distribution of X and Y. In the next paragraph we shall take f(x, y) to be the normal bivariate p.d.f., .and we sha~l solve a problem in sampling theory when we are samplmg from this two- variable distribution. Let (Xl' Yl), (X 2 , Y 2) , •• ·, (Xn,. Yn! d~note .a random sample of size n from a bivariate normal distribution with p.d.f. f(x, y) and t rs I/. u2 u2 and p We wish to find the joint p.d.f. of the parame e ttl' ,...2' l' 2' . two statistics X = i Xt!n and Y = i Ydn . We call X the mean of 1 1 Xl' ... , x, an~ Y the mean of. Yv ... , Yn' Sin~e t.he jo:t p.d.f. of the 2n random vanables (Xi' YJ, ~ = 1,2, ... , n, IS glVen y TI n ltlttl tzttz M(tl , t2) = i=l exp -;; + -;; ui(tl/n)Z + 2PUlu2(tl/n){tz/n) + u~(tz/n)Z] + 2 l (ui/n)ti + 2p(uluZ/n)tltz + (u~/n)t~]. = exp tlttl + tzttz + 2 But this is the moment-generating function of a bivariate normal distribution with means ttl and tt2' variances ui/n and u~/n, and corre- lation coefficient p; therefore, X and Y have this joint distribution. 4.74. If Xl' X 2 , ••• , X; is a random sample from a distribution with moment-generating function M(t), show that the moment-generating func- n n tions of L: X, and L: X;/n are, respectively, [M(t)Jn and [M(tjnW. I 1 4.75. In Exercise 4.67 concerning PERT, find: (a) the p.d.f. of Y; (b) the p.d.f. of Z in case each of the three stochastically independent variables has the p.d.f. f(x) = e- x , 0 < x < 00, zero elsewhere. 4.76. If X and Y have a bivariate normal distribution with parameters t-«. /Lz, ar, a~, and p, show that Z aX + bY + c is n(a/LI + b/L2 + c, a2ar + 2abpalaZ + b2a~), where a, b, and c are constants. Hint. Use the
  • 92. 172 Distributions of Functions of Random Variables [Ch.4 Sec. 4.8] The Distributions of X and nS2Ja2 173 can be written has Jacobian n. Since distributions of the mean and the variance of this random sample, that is, the distributions of the two statistics X = ±X,jn and 52 = n 1 L (X, - X)2jn. 1 The problem of the distribution of X, the mean of the sample, is solved by the use of Theorem 1 of Section 4.7. We have here, in the notation of the statement of that theorem, 1-'1 = 1-'2 = ... = I-'n = 1-', a~ = ~~ = ... = a; = a2 , and k1 = k2 = ... = kn = ljn. Accordingly, Y = X has a normal distribution with mean and variance given by n = L(XI - X)2 + n(x - 1-')2 1 Xl = nYl - Y2 - .•. - Yn X 2 = Y2 ( l)n [L (x, - X)2 n(x - 1-')2] • /2rra exp - , ·v 2a2 2a2 respectively. That is, X is n(I-" a2jn). Example 1. Let X be the mean of a random sample o~ size 25 from a distribution that is n(75, 100). Thus X IS n(75, 4). Then, for instance, Pr (71 < X < 79) = NC9 ; 75) _ NC1 ; 75) = N(2) - N(-2) = 0.954. We now take up the problem of the distribution of 52, the variance of a random sample X1> ... , Xn from a distribution that is n(l-', a2). To do this, let us first consider the joint distribution of Y1 = X, Y2 = X 2 , · · . , Y n = Xn- The corresponding transformation n because 2(x - II.) "'" (Xi - x) = 0, th " t d f f X X X r: L.., e JOIn p... 0 1> 2,"" n 1 moment-generating function M(tl> t2) of X and Y to find the moment- generating function of Z. 4.77. Let X and Y have a bivariate normal distribution with parameters 1-'1 = 25, 1-'2 = 35, a~ = 4, a~ = 16, and p = H. 1£ Z = 3X - 2Y, find Pr (- 2 < Z < 19). 4.78. Let U and V be stochastically independent random variables, each having a normal distribution with mean zero and variance 1. Show that the moment-generating function E(et(uV») of the product UV is (1 - t2 )- l/2, -1 < t < 1. Hint. Compare E(etuV ) with the integral of a bivariate normal p.d.f. that has means equal to zero. 4.79. Let X and Y have a bivariate normal distribution with the param- eters I-'l> 1-'2' at a~, and p. Show that W = X - 1-'1 and Z = (Y - 1-'2) - p(a2/a1)(X - 1-'1) are stochastically independent normal variables. 4.80. Let Xl> X 2 , X3 be a random sample of size n = 3 from the normal distribution n(O, 1). (a) Show that Y1 = Xl + SX3 , Y2 = X2 + SX3 has a bivariate normal distribution. (b) Find the value of S so that the correlation coefficient p = 1-. (c) What additional transformation involving Y1 and Y 2 would produce a bivariate normal distribution with means 1-'1 and 1-'2' variances a~ and a~, and the same correlation coefficient p? 4.81. Let Xl> X 2 , ••• , X n be a random sample of size n from the normal n distribution n(I-" a2 ) . Find the joint distribution of Y = L ajXj and n 1 Z = L bjXj, where the a, and b, are real constants. When, and only when, 1 are Y and Z stochastically independent? Hint. Note that the joint moment- generating function E [exp (t1 ~ «x, + t2 ~ b,Xj ) ] is that of a bivariate normal distribution. 4.82. Let Xl> X 2 be a random sample of size 2 from a distribution with positive variance and moment-generating function M(t). 1£ Y = Xl + X2 and Z = Xl - X 2 are stochastically independent, prove that the distribution from which the sample is taken is a normal distribution. Hint. Show that m(t1, t2) = E{exp [t1(X 1 + X 2) + t2(X 1 - X 2)]} = M(t1 + t2)M(t1 - t2). Express each member of m(tl>t2) = m(tl>O)m(O, t2 ) in terms of M; differentiate twice with respect to t2 ; set t2 = 0; and solve the resulting differential equation in M. 4.8 The Distributions of Xand nS2 jo2 Let X1> X 2 , • • • , X; denote a random sample of size n ~ 2 from a distribution that is n(p., a2 ) . In this section we shall investigate the
  • 93. Distributions of Functions of Random Variables [eh.4 EXERCISES 4.83. Let X be the mean of a random sample of size 5 from a normal distribution with fL = 0 and a 2 = 125. Determme c so that Pr (X < c) = 0.90. 175 t < t. That is, the conditional distribution of n52ja2, given X = x, is X2(n - 1). Moreover, since it is clear that this conditional distribution does not depend upon x, X and n52ja2 must be stochastically independent or, equivalently, X and 52 are stochastically independent. To summarize, we have established, in this section, three important properties of X and 52 when the sample arises from a distribution which is n(fJ-, a2): (a) X is n(fJ-, a2jn). (b) n52ja2 is X2(n - 1). (c) X and 52 are stochastically independent. Determination of the p.d.f. of 52 is left as an exercise. where °< 1 - 2t, or t < l However, this latter integral is exactly the same as that of the conditional p.d.f. of Y2, Y3, ••. , Yn' given Y1 = Y1' with a 2 replaced by a 2 j(1 - 2t) > 0, and thus must equal 1. Hence the conditional moment-generating function of n52ja2 , given Y1 = Y1 or equivalently X = x, is Sec. 4.8] The Distributions of X and nS2ja2 ( l)n (nYl - Y2 - ... - Yn - Y1)2 n --=- exp - 2a2 v27Ta n )2 ~ (y, - Y1 _ n(Yl - fJ-)2), 2a2 2a2 . 1,2, ... , n. The quotient of this joint p.d.f. and the -00 < Yl < 00,1, = p.d.f. vn [n(Yl - fJ-)2} -==- exp - 2 2 V27Ta a of Y1 = X is the conditional p.d.f. of Y2' Y3' ... , Yn' given Y1 = Yl> vn(V~7Tar-1 exp ( - 2~2)' ) 2 + ~ (y _ y)2 Since this is a where q = (nYl - Y2 - ... - Yn - Yl '§" l ' joint conditional p.d.f., it must be, for all a > 0, that foo .. .foo vn(~ r:exp ( - 2~2) dY2' .. dYn = 1. _ 00 - 00 V27Ta 174 ( + + x )jn and -00 < x, < 00, 1, = where x represents Xl + X2 . . . n .. 1,2, ... , n. Accordingly, with Y1 = x, we find that the [oint p.d.f. of Yl' Y2' ..• , Yn is Now consider n52 = ~ (X, - X)2 1 n )2 Q =(nYI-Y2-"'-Yn-Y1)2+~(Y'-Yl =. . f ti f 52j 2 - Qja2 given The conditional moment-generatmg unc IOn 0 n a - , Y1=Yl>is E(etQ /<r2 IY1) = Joo .. .foo vn(_1-r-1 exp [- (1 ~a;t)q) dY2" .dYn _ 00 - 00 VZ:;a 4.84. If X is the mean of a random sample of size n from a normal distribution with mean fL and variance 100, find n so that Pr (fL - 5 < X < fL + 5) = 0.954. 4.85. Let Xl' X 2 , .•• , X 2 5 and Yl> Y 2 , ... , Y 2 5 be two random samples from two independent normal distributions n(O, 16) and n(1, 9), respectively. Let X and Y denote the corresponding sample means. Compute Pr (X > Y). 4.86. Find the mean and variance of 52 = i (Xl - X)2/n, where 1 Xl' X 2 , ••• , X n is a random sample from n(fL, a2 ) . Hint. Find the mean and variance of n5 2/a2 • ( _ 1_ ) < n- 1)/2f OO •• • foo vn[1 2:a;tX n- 1)/2 1-2t - 0 0 - 0 0 [ (1 - 2t)q ) x exp - 2a2 dY2' .. dYn' 4.87. Let 52 be the variance of a random sample of size 6 from the normal distribution n(fL, 12). Find Pr (2.30 < 52 < 22.2). 4.88. Find the p.d.f. of the sample variance 52, provided that the distri- bution from which the sample arises is n(fL, a2 ) .
  • 94. 4.89. Let X and S2 be the mean and the variance of a random sample of size 25 from a distribution which is n(3, 100). Evaluate Pr (0 < X < 6, 55.2 < S2 < 145.6). 177 i < j. Sec. 4.9] Expectations of Functions of Random Variables The variance of Y is given by a~ = E{[(k1X1 + ... + knXn) - (k11-'1 + ... + knl-'n)J2} = E{[k1(X1 - 1-'1) + ... + kn(Xn - I-'n)J2} = E{t1k~(Xi - l-'i)2 + 2 .. t t kjkj(Xi - l-'i)(Xj - I-'j)} n = i~1 ktE[(Xi - l-'i)2J + 2 f<t kikjE[(Xi - l-'i)(Xj - I-'j)J. ~onsider E[(Xi - l-'i)(Xj - I-'j)J, i < j. Because Xi and X, are stochastically independent, we have E[(Xi - l-'i)(Xj - I-'j)J = E(Xi - l-'i)E(Xj - I-'j) = O. Finally, then, We can o~tain a more general result if, in Example 2, we remove the hypothesIs. of mutual stochastic independence of X 1, X 2 , ••• , X n . We shall do this and we shall let Pij denote the correlation coefficient of Xi and Xj. Thus for easy reference to Example 2, we write If we refer to Example 2, we see that again I-'y = i kifJ-i. But now 1 Thus we have the following theorem. Theorem 4. Let Xl>.'" X; denote random variables that have means fJ-1' ... , t-« and variances a~, ... , a~. Let Pij, i # j, denote the COr- relation coefficient of Xi and X, and let kl> .. " kn denote real constants. The mean and the variance of the linear function rr[(r - 2)/2J r 2[(r - 2)/2Jr[(r - 2)/2J = r - 2' Distributions of Functions of Random Variables [Ch.4 Example 1. Given that W is n(O, 1), that V is X2(r) with r ;::,.: 2, and let Wand V be stochastically independent. The mean of the random variable T = Wvr/V exists and is zero because the graph of the p.d.f. of T (see Section 4.4) is symmetric about the vertical axis through t = O. The variance of T, when it exists, could be computed by integrating the product of t2 and the p.d.f. of T. But it seems much simpler to compute Now W2 is X2(1), so E(W2) = 1. Furthermore, E(r)_fro r 1 r/2-1 -v/2 d V - 0 v2r/2f(r/2) v e v exists if r > 2 and is given by rr[(r - 2)/2J 2f(r/2) Thus a~ = r/(r - 2), r > 2. 4.9 Expectations of Functions of Random Variables Let Xl> X 2 , ••• , Xn denote random variables that have the joint p.d.f. f(xl> X 2, • • • , xn). Let the random variable Y be defined by Y = u(X1, X 2 , ••• , X n).We found in Section 4.7 that we could compute expectations of functions of Y without first finding the p.d.f. of Y. Indeed, this fact was the basis of the moment-generating-function procedure for finding the p.d.f. of Y. We can take advantage of this fact in a number of other instances. Some illustrative examples will be given. 176 Example 2. Let Xi denote a random variable with mean ILi and variance at, i = 1, 2, ... , n. Let X1> X2, ••• , Xn be mutually stochastically inde- pendent and let kv k2, ... , kn denote real constants. We shall compute the mean and variance of the linear function Y = k1X1 + k2X2 + ... + knXn. Because E is a linear operator, the mean of Y is given by I-'y = E(k1X1 + k2X2 + ... + knXn) = k1E(X1) + k2E(X2) + ... + knE(Xn) n = k1IL1 + k21-'2 + ... + knl-'n = 2: kil-'i· 1 are, respectively, n fJ-y = 2: kifJ-i 1 and
  • 95. 178 Distributions of Functions of Random Variables [Ch.4 Sec. 4.9] Expectations of Functions of Random Variables 179 the variance (~k~)0'2. The following corollary of this theorem is quite useful. Corollary. Let Xl"'" X; denote the items of a random sample of size n from a distribution that has mean fL and variance 0'2. The mean and n . (n) d 2 of Y = f«x, are, respectwely, fLy = fk, flo an o' y = Example 3. Let X = i X,ln denote the mean of a random sample of 1 size n from a distribution that has mean flo and variance aZ • In accordance with the Corollary, we have flox = flo ~ (lIn) = flo and a~ = aZ~ (lln)Z = a2/n. We have seen, in Section 4.8, that if our sample is from a distribution that is n(flo, 0'2), then X is n(flo, aZln). It is interesting that flox = flo and a~ = aZIn whether the sample is or is not from a normal distribution. EXERCISES 4 90 Let X X X X be four mutually stochastically independent • • 1, 2' 3, 4 random variables having the same p.d.f. f(x) = 2x, 0 < x < 1, zero else- where. Find the mean and variance of the sum Y of these four random variables. 4.91. Let Xl and Xz be two stochastically independent random variables so that the variances of Xl and X 2are ar = k and a~ = 2, respectively. Given that the variance of Y = 3X2 - Xl is 25, find k. 4.92. If the stochastically independent variables Xl and Xz have means flov floz and variances ar, a~, respectively, show that the mean and variance of the product Y = XlXZ are flolflo2 and ara~ + flora~ + flo~ar, respectively. 4.93. Find the mean and variance of the sum Y of the items of a random sample of size 5 from the distribution having p.d.f. f(x) = 6x(1 - x), o< x < 1, zero elsewhere. 4.94. Determine the mean and variance of the mean X of a random sample of size 9 from a distribution having p.d f. f(x) = 4xs, 0 < x < 1, zero elsewhere. 4.95. Let X and Y be random variables with flol = 1, floz = 4, ar = 4, a~ = 6, p = t. Find the mean and variance of Z = 3X - 2Y. 4.96. Let X and Y be stochastically independent random variables with means flol' fJ-z and variances ar, a~. Determine the correlation coefficient of X and Z = X - Y in terms of fJ-v floz, ar, a~. 4.97. Let flo and aZ denote the mean and variance of the random variable X. Let Y = c + bX, where band c are real constants. Show that the mean and the variance of Yare, respectively, C + bflo and b2a2 • 4.98. Let X and Y be random variables with means flov floz; variances ar, a~; and correlation coefficient p. Show that the correlation coefficient of W = aX + b, a > 0, and Z = cY + d, c > 0, is p. 4.99. A person rolls a die, tosses a coin, and draws a card from an ordinary deck. He receives $3 for each point up on the die, $10 for a head, $0 for a tail, and $1 for each spot on the card (jack = 11, queen = 12, king = 13). If we assume that the three random variables involved are mutually stochastically independent and uniformly distributed, compute the mean and variance of the amount to be received. 4.100. Let U and V be two stochastically independent chi-square variables with rl and rz degrees of freedom, respectively. Find the mean and variance of F = (rzU)/(rlV). What restriction is needed on the parameters rl and rz in order to ensure the existence of both the mean and the variance of F? 4.101. Let Xv X z, .. " X; be a random sample of size n from a distribu- tion with mean flo and variance aZ• Show that £(5Z) = (n - l)aZ ln, where 52 n is the variance of the random sample. Hint. Write 5z = (lin) L: (X, - flo)Z - 1 (X - flo)z. 4.102. Let Xl and Xz be stochastically independent random variables with nonzero variances. Find the correlation coefficient of Y = XlXZ and Xl in terms of the means and variances of Xl and X z. 4.103. Let Xl and Xz have a joint distribution with parameters flol' floz, at, a~, and p. Find the correlation coefficient of the linear functions Y = alXl + azXz and Z = blXl + bzXz in terms of the real constants aI, az, bv bz, and the parameters of the distribution. 4.104. Let Xv X 2 , •• " Xn be a random sample of size n from a distri- bution which has mean flo and variance 0'2. Use Chebyshev's inequality to show, for every E > 0, that lim Pr(iX - floI < E) = 1; this is another form n-oo of the law of large numbers. 4.105. Let Xl' X z, and Xs be random variables with equal variances but with correlation coefficients P12 = 0.3, PIS = 0.5, and pzs = 0.2. Find the correlation coefficient of the linear functions Y = Xl + Xz and Z = X 2 + X s· 4.106. Find the variance of the sum of 10 random variables if each has variance 5 and if each pair has correlation coefficient 0.5. 4.107. Let Xl"'" Xn be random variables that have means flov"" flon and variances ar, , a~. Let P'J' ~ =1= j, denote the correlation coefficient of X, and XJ' Let av , an and bI> .. " bn be real constants. Show that the n n n n covariance of Y = L: a,X, and Z = L: s.x, is L: L: a,bp,aJP'J' where '=1 r e t 1=1 1=1 P« = 1, i = 1, 2, ... , n.
  • 96. 180 Distributions of Functions of Random Variables [eb.4 4.108. Let Xl and X 2 have a bivariate normal distribution with param- eters jLl> jL2' ut u~, and p. Compute the means, the variances, and the cor- relation coefficient of Yl = exp (Xl) and Y2 = exp (X2). Hint. Various moments of Yl and Y2 can be found by assigning appropriate values to tl and t2 in E[exp (tlXl + t2X 2)]. 4.109. Let X be n(jL, u2) and consider the transformation X = In Y or, equivalently, Y = eX. (a) Find the mean and the variance of Y by first determining E(eX ) and E[(eX)2]. (b) Find the p.d.f. of Y. This is called the lognormal distribution. 4.110. Let Xl and X 2 have a trinomial distribution with parameters n, Pl> P2' (a) What is the distribution of Y = Xl + X2? (b) From the equality u~ = u~ + u~ + 2PUlU2' once again determine the correlation coefficient p of Xl and X 2• 4.111. Let Yl = Xl + X 2 and Y2 = X 2 + X a, where Xl> X 2, and Xa are three stochastically independent random variables. Find the joint moment-generating function and the correlation coefficient of Y1 and Y2 provided that: (a) Xi has a Poisson distribution with mean jLi' i = 1, 2, 3. (b) Xi is n(jLi' u~), i = 1, 2, 3. Chapter 5 Limiting Distributions 5.1 Limiting Distributions In some of the preceding chapters it has been demonstrated by example that the distribution of a random variable (perhaps a statistic) often depends upon a positive integer n. For example, if the random variable X is b(n, P), the distribution of X depends upon n. If X is the mean of a random sample of size n from a distribution that is n(/L' a2 ) , then X is itself n(/L, a2/n) and the distribution of X depends upon n. If S2is the variance of this random sample from the normal distribution to which we have just referred, the random variable nS2/a2 is x2 (n - 1), and so the distribution of this random variable depends upon n. We know from experience that the determination of the p.d.f. of a random variable can, upon occasion, present rather formidable com- putational difficulties. For example, if X is the mean of a random sample Xl> X 2 , ••• , X n from a distribution that has the p.d.f. f(x) = 1, o < x < 1, = 0 elsewhere, then (Exercise 4.74) the moment-generating function of X is given by [M(t/n)]n, where here I I et - 1 M(t) = etxdx = --, o t = 1, t = O. 181 t =F 0,
  • 97. 182 Limiting Distributions [eh.5 Sec. 5.1] Limiting Distributions 183 os Y < 0, 0< y < 0, -00 < Y < 0, 8 s y < 00. -00 < Y < 8, e ::; y < 00, = 1, = 1, F(y) = 0, The p.d.f. of Y n is = 0 elsewhere, and the distribution function of Yn is Fn(Y) = 0, Y < 0, = JlI nzn-l = ('¥.)n on dz 8' o = 1, 8 ::; Y < 00. Then Now t =I- 0, Hence ( et/n_ 1)n tin ' 1, t = O. Since the moment-generating function of X depends upon n, the distribution of X depends upon n. It is true that various mathematical techniques can be used to determine the p.d.f. of X for a fixed, but arbitrarily fixed, positive integer n. But the p.d.f. is so complicated that few, if any, of us would be interested in using it to compute probabilities about X. One of the purposes of this chapter is to provide ways of approximating, for large values of n, some of these complicated probability density functions. Consider a distribution that depends upon the positive integer n. Clearly, the distribution function F of that distribution will also depend upon n. Throughout this chapter, we denote this fact by writing the distribution function as Fn and the corresponding p.d.f. as in. Moreover, to emphasize the fact that we are working with sequences of distribution functions, we place a subscript n on the ran- dom variables. For example, we shall write x 1 Fn(x) = J e- nw2/2 dw -00 vrrnVZ; for the distribution function of the mean Xn of a random sample of size n from a normal distribution with mean zero and variance 1. We now define a limiting distribution of a random variable whose distribution depends upon n. Definition 1. Let the distribution function F n(Y) of the random variable Yn depend upon n, a positive integer. If F(y) is a distribution function and if lim Fn(Y) = F(y) for every point Y at which F(y) is is a distribution function. Moreover, lim Fn(y) = F(y) at each point of n-co continuity of F(y). In accordance with the definition of a limiting distribu- tion, the random variable Yn has a limiting distribution with distribution function F(y). Recall that a distribution of the discrete type which has a probability of 1 at a single point has been called a degenerate distribution. Thus in this example the limiting distribution of Yn is degenerate. Some- times this is the case, sometimes a limiting distribution is not degenerate, and sometimes there is no limiting distribution at all. Example 2. Let X n have the distribution function n-> 00 = 0 elsewhere. continuous, then the random variable Y n is said to have a limiting distribution with distribution function F(y). x = 0, x> O. x < 0, = -t, = 1, lim F n(x) = 0, n_co It is clear that If the change of variable v = vnw is made, we have o < x < 0, 0 < 8 < 00, 1 f(x) = 0' The following examples are illustrative of random variables that have limiting distributions. Example 1. Let Yn denote the nth order statistic of a random sample Xl> X z, ... , X; from a distribution having p.d.f.
  • 98. 184 Now the function F(x) = 0, Limiting Distributions [eh. 5 x < 0, Sec. 5.1] Limiting Distributions and the distribution function of Zn is z < 0, 185 Hence 0:::;; z < nO, 0< z < co. 0:::;; z, z < 0, G(z) = 0, 5 z (0 - w/n)n-l (Z)n = dw = 1 - 1 - - o r nO ' = 1, nO:::;; z. Now 1 x = 2 + -. n = 1, x;::: 0, is a distribution function and lim Fn(x) = F(x) at every point of continuity of F(x). To be sure, lim Fn(O)# F(O). but F(x) is not continuous at x = O. Accordingly, the rand~;variable Xn has a limiti~g d~stribution with distribu- tion function F(x). Again, this limiting distributlOn IS degenerate and has all the probability at the one point x = O. Example 3. The fact that limiting distributions, if ~hey exist: cannot in general be determined by taking the limit of the p.d.f. WIll now be Illustrated. Let X; have the p.d.f. EXERCISES 5.1. Let Xn denote the mean of a random sample of size n from a distribu- tion that is n(JL, 0'2). Find the limiting distribution of Xn. 5.2. Let YI denote the first order statistic of a random sample of size n from a distribution that has the p.d.f. f(x) = e-(X-O), 0 < x < co, zero elsewhere. Let Zn = n(YI - 0). Investigate the limiting distribution of Zn. 5.3. Let Yn denote the nth order statistic of a random sample from a distribution of the continuous type that has distribution function F(x) and p.d.f. f(x) = F'(x). Find the limiting distribution of Zn = n[1 - F(Yn)J. 5.4. Let Y2 denote the second order statistic of a random sample of size n from a distribution of the continuous type that has distribution function F(x) and p.d.f.f(x) = F'(x). Find the limiting distribution of W n = nF(Y2 ) . 5.5. Let the p.d.f. of Yn be fn(Y) = 1, Y = n, zero elsewhere. Show that Yn does not have a limiting distribution. (In this case, the probability has "escaped" to infinity.) 5.6. Let Xl, X 2 , •• • , X; be a random sample of size n from a distribution n which is n(JL, 0'2), where JL > O. Show that the sum Z; = L Xi does not have I a limiting distribution. is a distribution function that is everywhere continuous and lim Gn(z) = n ....00 G(z) at all points. Thus Zn has a limiting distribution with distribution function G(z). This affords us an example of a limiting distribution that is not degenerate. 0< z < nO, x:::;; 2, x> 2. x < 2, = 1, = 1, F(x) = 0, n ....00 (0 - z/n)n-l hn(z) = on ' = 0 elsewhere, Since and = 0 elsewhere. Clearly, lim fn(x) = 0 for all values of x. This may suggest that X n has no limiting dis~ribution. However, the distribution function of X; is 1 F () o x < 2 + -, n X = , n 1 x> 2 +-, - n = 1, x;::: 2, is a distribution function, and since lim Fn(x) = F(x) at all points of con- n ....00 tinuity of F(x), there is a limiting distribution of X n with distribution function F(x). Example 4. Let Yn denote the nth order statistic of a random sample from the uniform distribution of Example 1. Let Zn = n(O - Y n)·The p.d.f. of Zn is
  • 99. 186 Limiting Distributions [eh.5 Sec. 5.2] Stochastic Convergence 187 Because °s Fn(Y) s 1 for all values of Y and for every positive integer n, it must be that Since this is true for every E > 0, we have as we were required to show. To complete the proof of Theorem 1, we assume that lim Fn[(c + E) - J n-+ 00 1. Y < C, Y > C, Y < C, = 1, lim Fn(Y) = 0, n-+ 00 lim F n(Y) = 0, n-+ 00 lim Fn(c - E) = 0, n-+ 00 5.2 Stochastic Convergence When the limiting distribution of a random variable is degenerate, the random variable is said to converge stochastically to the constant that has a probability of 1. Thus Examples 1 to 3 of Section 5.1 illustrate not only the notion of a limiting distribution but also the concept of stochastic convergence. In Example 1, the nth order statistic Yn converges stochastically to 8; in Example 2, the statistic Xn converges stochastically to zero, the mean of the normal distribution from which the sample was taken; and in Example 3, the random variable X n converges stochastically to 2. We shall show that in some instances the inequality of Chebyshev can be used to advantage in proving stochastic convergence. But first we shall prove the following theorem. lim Fn(c - E) = 0, n-+ 00 lim Fn[(c + E) - J = 1, n-+ 00 Y > c. = 1, We are to prove that lim Pr (/Y n - c] < ~) _- 1 f ~ or every E > O. n-+ 00 Because Pr (IYn - c] < E) = Fn[(c + E)-J - Fn(c - E), and because it is given that Theorem 1. Let Fn(Y) denote the distribution function of a random variable Yn whose distribution depends upon the positive integer n. Let c denote a constant which does not depend upon n. The random variable Yn converges stochastically to the constant c if and only if, for every E > 0, the lim Pr(Yn - c] < E) = 1. n-+ 00 Proof. First, assume that the lim Pr (I Yn - cl < E) = 1 for every n-+ 00 E > 0. We are to prove that the random variable Yn converges stochasti- cally to the constant c. This means we must prove that n-+ 00 = 1, Y > c. Note that we do not need to know anything about the lim Fn(c). For n-+ 00 if the limit of Fn(Y) is as indicated, then Yn has a limiting distribution with distribution function F(y) = 0, Y < C, Y < C, for every E > 0, we have the desired result. This completes the proof of the theorem. We should like to point out a simple but useful fact. Clearly, Pr (iYn - c] < E) + Pr (I Yn - c] ~ E) = 1. Thus the limit of Pr (I Yn - cl < E) is equal to 1 when and only when lim Pr(Yn - cl ~ E) = 0. n-+ 00 1 = lim Pr(IYn - cl < E) = lim Fn[(c + E)-J - lim Fn(c - E). n-+oo n-+oo n-+oo Pr (!Yn - cl < E) = Fn[(c + E)- J - Fn(c - E), where Fn[(c + E) - Jis the left-hand limit of Fn(y) at Y = c + E. Thus we have Now = 1, Y ~ c. That is, :his last limit is also a necessary and sufficient condition for the stochastic convergence of the random variable Yn to the constant c. .Ex~mp'le 1. Let X n denote the mean of a random sample of size n from a d~stnbutlO~ that has mean fL and positive variance a2. Then the mean and variance of X; are fL and a2jn. Consider, for every fixed E > 0, the probability Pr (/Xn - fLl ~ E) = Pr (/Xn - fL/ ~ ~~),
  • 100. EXERCISES 2 lim Pr (Xn - fLl ~ €) ::; lim ~ = 0. n-+oo n_ CO n€ where k = €vnla. In accordance with the inequality of Chebyshev, this probability is less than or equal to 1/k2 = a2In€2. So, for every fixed € > 0, we have Hence x, converges stochastically to fL if a2 is finite. In a more advanced course, the student will learn that fL finite is sufficient to ensure this stochastic convergence. Remark. The condition lim Pr (IYn - c] < €) = 1 is often used as n--> 00 lim [1 + ~ + lj;(n)]cn, n-+co n n lim (1 t 2 t3 ) -n/2 . ( t 2 t3jvn)-n/2 --+372 =hm 1--+-- . n-+ co n n n-+ co n n For example, where band c do not depend upon n and where lim lj;(n) = O. Then n-+ co Here b = _t2, C = -t, and lj;(n) = t3 jvn. Accordingly, for every fixed value of t, the limit is et 2 /2 • Example~. Let Y, have a distribution that is b(n,Pl. Suppose that the mean fL = np IS the same for every n; that is, p = fLln, where fL is a constant. . [ b lj;(n)]cn (b)cn lim 1 + - + -- = lim 1 + - = ebc • n-+co n n n-+co n Sec. 5.3] Limiting Moment-Generating Functions 189 the p~oblem we should like to avoid. If it exists, the moment-generating func:lOn that corresponds to the distribution function Fn(Y) often provides a convenient method of determining the limiting distribution function. To emphasize that the distribution of a random variable Y depends upon the positive integer n, in this chapter we shall write the moment-generating function of Y, in the form M(t; n). The following theorem, which is essentially Curtiss' modification of a the~rem of Levy an~ Cramer, explains how the moment-generating function may be used In problems of limiting distributions. A proof of the t?eorem requires a knowledge of that same facet of analysis that pe~mItte~ us to assert that a moment-generating function, when it exists, umquely determines a distribution. Accordingly, no proof of the theorem will be given. Theorem 2. Let the random variable Ynhave the distribution function F n(y) and the moment-generating function M (t; n) that exists for -h < t <.h for all n. If there exists a distribution function F(y), with correspond~.ng moment-generating function M(t), defined for ItI s hI < h, such that lim M(t; n) = M(t), then Y n has a limiting distribution with n-->00 distribution function F (y). In this and the subsequent section are several illustration of the use of Theorem 2. In some of these examples it is convenient to use a certain limit that is established in some courses in advanced calculus. We refer to a limit of the form Limiting Distributions [Ch, 5 the definition of convergence in probability and one says that Y n converges to c in probability. Thus stochastic convergence and convergence in prob- ability are equivalent. A stronger type of convergence is given by Pr (lim Yn = c) = 1; in this case we say that Yn converges to c with n--> 00 probability 1. Although we do not consider this type of convergence, it is known that the mean Xn of a random sample converges with probability 1 to the mean fL of the distribution, provided that the latter exists. This is one form of the strong law of large numbers. 5.9. Let Wn denote a random variable with mean fL and variance bjn", where p > 0, fL, and b are constants (not functions of n). Prove that Wn converges stochastically to fL. Hint. Use Chebyshev's inequality. 5.10. Let Yn denote the nth order statistic of a random sample of size n from a uniform distribution on the interval (0, 0), as in Example 1 of Section 5.1. Prove that Zn = YY;; converges stochastically to YO. 5.3 Limiting Moment-Generating Functions To find the limiting distribution function of a random variable Yn by use of the definition of limiting distribution function obviously requires that we know Fn(Y) for each positive integer n. But, as indicated in the introductory remarks of Section 5.1, this is precisely 5.7. Let the random variable Y, have a distribution that is b(n, Pl· (a) Prove that Ynln converges stochastically to p. This result is one form of the weak law of large numbers. (b) Prove that 1 - Ynln converges stochastic- ally to 1 - p. 5.8. Let S~ denote the variance of a random sample of size n from a distribution that is n(fL, a2). Prove that nS~/(n - 1) converges stochastically to a2 • 188
  • 101. n-+ 00 M(t; n) = E{exp [t(Z~~nn)]} 191 Sec. 5.3] Limiting Moment-Generating Functions Since g(n) -»- 0 as n -»- 00, then lim ~(n) = 0 for every fixed value of t. In accordance with the limit proposition cited earlier in this section, we have lim M(t; n) = et2/2 n-+ 00 for.al~ ~eal values of t. That is, the random variable v, = (Zn - n)/VZn has a limiting normal distribution with mean zero and variance 1. M(t; n) = (1 _~ + ~~)) -n/2, where ~(n) If this sum is substituted for et./2 /n in the last expression for M(t; n), it is seen that 5.11.. Let X n hav~ a gamma distribution with parameter a = nand (3, where filS not a function of n. Let Y n = Xnln. Find the limiting distribution of v; 5.12. Let Z; be x2(n) and let Wn = Znln2. Find the limiting distribution of Wn • 5.13. Let X be X2 (50). Approximate Pr (40 < X < 60). . 5.14. Let p = 0.95 be the probability that a man, in a certain age group, lives at least 5 years. (a) If w.e. are to observe 60 such men and if we assume independence, find the probability that at least 56 of them live 5 or more years. (b) Find an approximation to the result of part (a) by using the Poisson distribution. Hint. Redefine p to be 0.05 and 1 - P = 0.95. 5.15. Let the random variable Zn have a Poisson distribution with parameter fL = n. Show that the limiting distribution of the random variable Y n = (Zn - n)/V1i is normal with mean zero and variance 1. 5.16. Let S~ denote the variance of a random sample of size n from a distribu~ion that is n(fL, a2). It has been proved that nS~/(n - 1) converges stochastically to a2 . Prove that S~ converges stochastically to a2 • 5.17. Let X n and Y n have a bivariate normal distribution with parameters fLv fL2' ar, a~ (free of n) but p = 1 - lin. Consider the conditional distribu- tion of Yn, given X; = x. Investigate the limit of this conditional distribution as n -»- 00. What is the limiting distribution if p = - 1 + lin? Reference to these facts was made in the Remark, Section Z.3. EXERCISES t < VZn. Z -: Limiting Distributions [eh.5 e- 2 + Ze-2 = 0.406. This may be written in the form ( _ !2 -) -n/2 M(t; n) = et./2/n - t,J 1i et./2/n , In accordance with Taylor's formula, there exists a number g(n), between 0 and tVZln, such that (2 1 ( (2) 2 e~(n) ( (2)3 et./2/n = 1 + t,J n+:2 t,J n + 6 t,J 1i . approximately. Since fL = np = Z, the Poisson approximation to this prob- ability is Example 2. Let z; be x2 (n). Then the moment-generating function of z; is (1 _ Zt) -n/2, t < -t. The mean and the variance of Zn are, respectively, n and Zn.The limiting distribution of the random variable Y, = (Zn - n)/VZn will be investigated. Now the moment-generating function of Yn is for all real values of t. Since there exists a distribution, namely the Poisson distribution with mean fL, that has this moment-generating function e~(et-l), then in accordance with the theorem and under the conditions stated, it is seen that Yn has a limiting Poisson distribution with mean fL· Whenever a random variable has a limiting distribution, we may, if we wish, use the limiting distribution as an approximation to the exact distri- bution function. The result of this example enables us to use the Poisson distribution as an approximation to the binomial distribution when n is large and p is small. This is clearly an advantage, for it is easy to provide tables for the one-parameter Poisson distribution. On the other hand, the binomial distribution has two parameters, and tables for this distribution are very ungainly. To illustrate the use of the approximation, let Y have a binomial distribution with n = 50 and p = 2' Then Pr (Y ~ 1) = (H)50 + 50(h)(H)49 = 0.400, We shall find the limiting distribution of the binomial distribution, when p = fLln, by finding the limit of M(t; n). Now M(t; n) = E(etYn) = [(1 - P) + pet]n = [1 + fL(e t n- l)r for all real values of t. Hence we have lim M(t; n) = e~(eLl) 190
  • 102. 5.4 The Central Limit Theorem It was seen (Section 4.8) that, if Xl> X 2 , • • • , X n is a random sample from a normal distribution with mean iL and variance a2 , the random variable 5.18. Let s;denote the mean of a random sample of size n from a Poisson distribution with parameter iL = l. (a) Show that the moment-generating function of Yn = Vn(Xn - p.)/a = Vn(Xn - 1) is given by exp [-tvn + n(et/·./n- 1)]. (b) Investigate the limiting distribution of Y n as n -+ 00. Hint. Replace, by its Macl.aurin's series, the expression etl ";" , which is in the exponent of the moment-generating function of Yn' 5.19. Let Xn denote the mean of a random sample of size n from a distribution that has p.d.f. f(x) = e- x , 0 < x < 00, zero elsewhere. (a) Show that the moment-generating function M(t; n) of Y, = vn(Xn - 1) is equal to [et/";" - (t/vn)et/.J1iJ -n, t < vn. (b) Find the limiting distribution of Yn as n -+ 00. This exercise and the immediately preceding one are special instances of an important theorem that will be proved in the next section. t -h < -- < h. aYri = E[exp (t X;;n iL) J...E[exp (t X;;n iL)J = {E[exp (tSf)Jf = [mC~dr, Sec. 5.4] The Central Limit Theorem 193 Theorem 3. LetX X X d t th . , ' . 1, 2,"" n eno e e items of a random sample from a d~stnbutlOn that has mean iL and positive variance a2• Then the random variable Yn = (iXi - niL)'/vna = vn(X - )/ h 1" 1 n iL a as a imit- ing distribution that is normal with mean zero and variance 1. Proof In the modification of the proof, we assume the existence of t~e ~on:ent-generating function M(t) = E(etX ) , -h < t < h, of the dIst~Ibuhon.H~wever, this proof is essentially the same one that would be given for thIS. theorem in a more advanced course by replacing the mo~ent-generatmg function by the characteristic function (t) E~~. r The function m(t) = E[et(x-/l)J = e-/ltM(t) also exists for - h < t < h Since m(t) is the t . f . . momen -generatmg ~nctlOn fo~, X - iL, it must follow that m(O) = 1, m'(O) = E(X - iL) - 0, and m (0) = E[(X - 11.)2J - 2 By T I 'f I h r: - a . ay or s ormu a t ere exists a number t between 0 and t such that m(t) = m(O) + m'(O)t + m"(t)t 2 2 m"(t)t2 = 1 + . 2 If a 2 t2 /2is added and subtracted, then m(t) = 1 + a 2 t 2 + [m"(t) - a2Jt2 2 2 Next consider M(t; n), where Limiting Distributions [eb.5 n f Xi - niL vn(Xn - iL) avn a is, for every positive integer n, normally distributed with zero mean and unit variance. In probability theory there is a very elegant theorem called the central limit theorem. A special case of this theorem asserts the remarkable and important fact that if Xl' X 2 , •• " X n denote the items of a random sample of size n from any distribution having positive variance a2 (and hence finite mean iL), then the random variable vn(Xn - iLl/a has a limiting normal distribution with zero mean and unit variance. If this fact can be established, it will imply, whenever the conditions of the theorem are satisfied, that (for fixed n) the random variable vn(X - iLl/a has an approximate normal distribution with mean zero and variance 1. It will then be possible to use this approxi- mate normal distribution to compute approximate probabilities con- cering X. The more general form of the theorem is stated, but it is proved only in the modified case. However, this is exactly the proof of the theorem that would be given if we could use the characteristic function in place of the moment-generating function. 192
  • 103. = 0 elsewhere. lim [m"(g) - a2 ] = O. n--+ OO Since m"(t) is continuous at t = 0 and since g-+ 0 as n -+ 00, we have 195 Sec. 5.4] The Central Limit Theorem since M(t) exists for all real values of t. Moreover, JL = ! and a2 = -h, so we have approximately Pr (0.45 < X < 0.55) = Pr [Vn(0.45 - JL) < vn(X - JL) < vn(0.55 - JL)] a a a = Pr [- 1.5 < 30(X - 0.5) < 1.5J = 0.866, from Table III in Appendix B. Example 2. Let Xl' X 2 , ••• , X; denote a random sample from a distribution that is b(l, P). Here JL = p, a2 = P(l - P), and M(t) exists for all real values of t. If Yn = Xl + ... + X n , it is known that Y, is b(n,p). Calculation of probabilities concerning Yn, when we do not use the Poisson approximation, can be greatly simplified by making use of the fact that (Yn - np)/vnp(l - P) = vn(Xn - P)/vP(l - P) = vn(Xn - JL)/a has a limiting distribution that is normal with mean zero and variance 1. Let n = 100 and p = -t, and suppose that we wish to compute Pr (Y = 48,49, 50,51,52). Since Y is a random variable of the discrete type, the events Y = 48, 49, 50, 51, 52 and 47.5 < Y < 52.5 are equivalent. That is, Pr (Y = 48, 49,50,51,52) = Pr (47.5 < Y < 52.5). Since np = 50 and np(l - P) = 25, the latter probability may be written Pr (47.5 < Y < 52.5) = Pr (47.5 5- 50 < Y ~ 50 < 52.55- 50) = Pr (-0.5 < Y ~ 50 < 0.5). Since (Y - 50)/5 has an approximate normal distribution with mean zero and variance 1, Table III shows this probability to be approximately 0.382. The convention of selecting the event 47.5 < Y < 52.5, instead of, say, 47.8 < Y < 52.3, as the event equivalent to the event Y = 48,49,50,51, 52 seems to have originated in the following manner: The probability, Pr (Y = 48,49,50,51,52), can be interpreted as the sum of five rectangular areas where the rectangles have bases 1 but the heights are, respectively, Pr (Y = 48), .. " Pr (Y = 52). If these rectangles are so located that the midpoints of their bases are, respectively, at the points 48, 49, ... , 52 on a horizontal axis, then in approximating the sum of these areas by an area bounded by the horizontal axis, the graph of a normal p.d.f., and two ordinates, it seems reasonable to take the two ordinates at the points 47.5 and 52.5. Limiting Distributions [eh.5 o < x < 1, f(x) = 1, We interpret this theorem as saying, with n a fixed positive integer, that the random variable vn(X - fL)ja has an approximate normal distribution with mean zero and variance 1; and in applications we use the approximate normal p.d.f. as though it were the exact p.d.f. of v'n(X - JL)ja. Some illustrative examples, here and later, will help show the importance of this version of the central limit theorem. Example 1. Let X denote the mean of a random sample of size 75 from the distribution that has the p.d.f. { t2 [m"(g) - a2 ]t 2 }n . M(t; n) = 1 + -2 + 2 2 n no for all real values of t. This proves that the random variable Yn = vn(X n - JL)ja has a limiting normal distribution with mean zero and variance 1. The limit proposition cited in Section 5.3 shows that lim M(t; n) = et 2 / 2 n--+ 00 where now g is between 0 and tjavn with - havn < t < havn. Accordingly, mC~) = 194 In m(t), replace t by tjav'n to obtain It was stated in Section 5.1 that the exact p.d.f. of X, say g(x), is rather complicated. It can be shown that g(x) has a grap~ at points of p~sitive probability density that is composed of arcs of 75 different polynomIals of degree 74. The computation of such a probability as Pr (0.45 < X <.0.55) would be extremely laborious. The conditions of the theorem are satisfied, EXERCISES 5.20. Let X denote the mean of a random sample of size 100 from a distribution that is X2(50). Compute an approximate value of Pr (49 < X < 51).
  • 104. 5.23. Compute an approximate probability that the mean of a random sample of size 15 from a distribution having p.d.f. J(x) = 3x2, 0 < X < 1, zero elsewhere, is between t and t. 5.21. Let X denote the mean of a random sample of size 128 from a gamma distribution with a = 2 and f3 = 4. Approximate Pr (7 < X < 9). 5.22. Let Y be b(72, j-). Approximate Pr (22 s Y :s; 28). 5.24. Let Y denote the sum of the items of a random sample of size 12 from a distribution having p.d.f.J(x) = i, x = 1,2,3,4,5,6, zero elsewhere. Compute an approximate value of Pr (36 :s; Y :s; 48). Hint. Since the event of interest is Y = 36, 37, ... , 48, rewrite the probability as Pr (35.5 < Y < 48.5). 5.25. Let Y be b(400, !). Compute an approximate value of Pr (0.25 < Yin). 5.26. If Y is b(100, f), approximate the value of Pr (Y = 50). 5.27. Let Y be b(n, 0.55). Find the smallest value of n so that (approxi- mately) Pr (Yin> f) ~ 0.95. 5.28. Let J(x) = l/x2, 1 < x < 00, zero elsewhere, be the p.d.f. of a random variable X. Consider a random sample of size 72 from the distri- bution having this p.d.f. Compute approximately the probability that more than 50 of the items of the random sample are less than 3. Pr (lUn - cl ~ E) = Pr [1(vUn - VC)(v'V: + VC)I ~ E] = Pr (Is/U, - vcl ~ E ) VtJ;. + VC ~ Pr (IVUn - vcl ~ :e) ~ o. If we let' Ivc d if h E = E c, an 1 we take the limit, as n becomes infinite we ave ' Sec. 5.5] Some Theorems on Limiting Distributions 197 !heorem 5. Let Fn(u) denote the distribution function of a random ~arzable ti, whose distri~ution depends upon the positive integer n. Further ~ o, converge stochasltcally to the positive constant c and let Pr (Un < 0) - 0for every n. The random variable VtJ;. convergesstochastically to VC. Proof. We are given that the lim Pr (/U; - cl ~ E) = 0 for every n-+<XJ E > O. We are to prove that the lim Pr (/VtJ;. - vcl ~ E') = 0 for I n-+ 00 every E > O. Now the probability o= !~~ Pr (jUn - cl ~ E) ~ lim Pr (jVtJ;. - Vel ~ E') = 0 n-+<XJ Limiting Distributions [Ch, 5 196 5.29. Forty-eight measurements are recorded to several decimal places. Each of these 48 numbers is rounded off to the nearest integer. The sum of the original 48 numbers is approximated by the sum of these integers. If we assume that the errors made by rounding off are stochastically independent and have uniform distributions over the interval (-f, f), compute approxi- mately the probability that the sum of the integers is within 2 units of the true sum. 5.5 Some Theorems on Limiting Distributions In this section we shall present some theorems that can often be used to simplify the study of certain limiting distributions. Theorem 4. Let Fn(u) denote the distribution function of a random variable U'; whose distribution depends upon the positive integer n. Let U'; converge stochastically to the constant c #- O. The random variable Un/c converges stochastically to 1. The proof of this theorem is very easy and is left as an exercise. for every E' > O. This completes the proof. T?e conclusions of Theorems 4 and 5 are very natural ones and they certainly appeal to 0 . t iti Th . . ur III U1 IOn. ere are many other theor f ~~lS flavor in pr.obability theory. As exercises, it is to be shown ~~:t~f t . e random vanables U'; and Vn converge stochastically to the respec- rve constants c and d the U V , n n n converges stochastically to the constant cd and U IV c t h . . ' n n onverges s oc ashcally to the constant cid fPrlo lvl~ed that d #- O. However, we shall accept, without proof th~ o owmg theorem. ' !heorem 6. Let Fn(u) denote the distribution function of a random uariable "'!n.w.hose ~ist:ib~tion depends upon the positive integer n. Let U have a hm~.ltn~ d~~trzbutwn with distribution function F(u). Let H (v) z:the d~strzbutwnfunction ofa random variable V whose distribut~'on epends upon th p 't . . n " Th li " . e osi we ~nteger n. Let Vn converge stochastically to 1 e ~m~t2ng distribution. of the random variable W = U IV . th . as that 01' U . th . . " n n n ~s e same j ti 'J F ( n), at is, W n has a hmtt~ng distribution with distribution unc ton. w.
  • 105. 198 Limiting Distributions [Ch.5 Sec. 5.5] Some Theorems on Limiting Distributions 199 Example 1. Let Yn denote a random variable that is b(n, P), °< P < 1. We know that u _ Yn - np n - Vnp(l-p) has a limiting distribution that is n(O, 1). Moreover, it has been proved that Yn/n and 1 - Yn/n converge stochastically to p and 1 - p, respectively; thus (Yn/n)(1 - Yn/n) converges stochastically to P(1 - Pl. Then, by Theorem 4, (Yn/n)(1 - Y n/n)/[p(1 - P)] converges stochastically to 1, and Theorem 5 asserts that the following does also: V = [(Yn/n)(1 - Yn/n)] 1/2. n P(1 _ P) Thus, in accordance with Theorem 6, the ratio Wn = Un/Vn, namely Y, - np Vn(Yn/n)(1 - Yn/n) , has a limiting distribution that is n(O, 1). This fact enables us to write (with n a fixed positive integer) [ Y - np ] Pr -2 < < 2 = 0.954, Vn(Y/n)(1 - Yin) approximately. Example 2. Let X n and S~ denote, respectively, the mean and the variance of a random sample of size n from a distribution that is n(JL, u2), u2 > 0. It has been proved that s; converges stochastically to JL and that S~ converges stochastically to u2 • Theorem 5 asserts that Sn converges stochastic- ally to a and Theorem 4 tells us that Si]« converges stochastically to 1. In accordance with Theorem 6, the random variable Wn = uXn/Snhas the same limiting distribution as does x; That is, uXn/Snconverges stochastically to JL. EXERCISES 5.30. Prove Theorem 4. Hint. Note that Pr (!Un/e - 11 < E) = Pr (!Un - c] < Elel), for every E > 0. Then take E' = €Iel. 5.31. Let X; denote the mean of a random sample of size n from a gamma distribution with parameters ex = JL > °and f:3 = 1. Show that the limiting distribution of vn(Xn - JL)/VX n is n(O, 1). 5.32. Let T'; = (Xn - JL)/VS~/(n - 1), where X n and S~ represent, respectively, the mean and the variance of a random sample of size n from a distribution that is n(JL' u2). Prove that the limiting distribution of Tn is n(O, 1). 5.33. Let Xl' ... ' X n and YI , ... , Yn be the items of two independent random samples, each of size n, from the distributions that have the respective means JLI and JL2 and the common variance u2. Find the limiting distribution of (Xn - Yn) - (JLI - JL2) uV2/n ' where x, and Yn are the respective means of the samples. Hint. Let Zn = n L Zt/n, where Z, = X, - Y,. I 5.34. Let U'; and Vn converge stochastically to e and d, respectively. Prove the following. (a) The sum U'; + Vnconverges stochastically to e + d. Hint. Show that Pr (!Un + Vn - e - dl ~ E) :$ Pr (!Un - c] + IVn - dl ~ E) :$ Pr (IU'; - c] ~ E/2or IVn - dl ~ E/2) s Pr (IUn - cl ~ E/2) + Pr (IVn - dl ~ E/2). (b) The product U';Vn converges stochastically to cd. (c) If d i= 0, the ratio Un/Vn converges stochastically to cjd. 5.35. Let U'; converge stochastically to c. If h(u) is a continuous function at u = c, prove that h(Un) converges stochastically to h(c). Hint. For each E > 0, there exists a 0 > °such that Pr [h(Un) - h(c)l < E] ~ Pr [!Un - c] < 0]. Why?
  • 106. Example 1. Let Xl> X 2 , ••• , X; denote a random sample from the distribution with p.d.f. of times and it is found that Xl = xb X 2 = X2"'" X; = X n, we shall refer to xb x2 , ••• , xn as the experimental values of X b X 2 , ••• , X n or as the sample data. We shall use the terminology of the two preceding paragraphs, and in this section we shall give some examples of statistical inference. These examples will be built around the notion of a point estimate of an unknown parameter in a p.d.f. Let a random variable X have a p.d.f. that is of known functional form but in which the p.d.f. depends upon an unknown parameter () that may have any value in a set Q. This will be denoted by writing the p.d.f. in the formj(x; ()), () E Q. The set Q will be called the parameter space. Thus we are confronted, not with one distribution of prob- ability, but with a family of distributions. To each value of (), () E .0, there corresponds one member of the family. A family of probability density functions will be denoted by the symbol {j(x; ()); () ED}. Any member of this family of probability density functions will be denoted by the symbol j(x; ()), () E D. We shall continue to use the special symbols that have been adopted for the normal, the chi-square, and the binomial distributions. We may, for instance, have the family {n((), 1); () ED}, where Q is the set -00 < () < 00. One member of this family of distributions is the distribution that is n(O, 1). Any arbitrary member is n((), 1), -00 < () < 00. Consider a family of probability density functions {j(x; ()); () ED}. It may be that the experimenter needs to select precisely one member of the family as being the p.d.f. of his random variable. That is, he needs a point estimate of (). Let Xl> X 2 , ••• , X n denote a random sample from a distribution that has a p.d.f. which is one member (but which member we do not know) of the family {j(x; ()); () E Q} of prob- ability density functions. That is, our sample arises from a distribution that has the p.d.f. j(x; ()); () E Q. Our problem is that of defining a statistic YI = ul (Xl> X 2' •. " X n), so that if Xl> x2 , ••• , xn are the observed experimental values of Xl> X 2 , ••• , X n, then the number YI = uI(Xl> x2 , ••• , xn) will be a good point estimate of (). The following illustration should help motivate one principle that is often used in finding point estimates. Chapter 6 Estimation 6.1 Point Estimation The first five chapters of this book deal with certain concepts and problems of probability theory. Throughout we have carefully dis- tinguished between a sample space '?l of outcomes and the space d of one or more random variables defined on '?l. With this chapter we begin a study of some problems in statistics and here we are more interested in the number (or numbers) by which an outcome is repre- sented than we are in the outcome itself. Accordingly, we shall adopt a frequently used convention. We shall refer to a random variable X as the outcome of a random experiment and we shall refer to the space of X as the sample space. Were it not so awkward, we would call X the numerical outcome. Once the experiment has been performed and it is found that X = X, we shall call X the experimental value of X for that performance of the experiment. This convenient terminology can be used to advantage in more general situations. To illustrate this, let a random experiment be repeated n independent times and under identical conditions. Then the random variables X l>X 2' ... , X n (each of which assigns a numerical value to an outcome) constitute (Section 4.1) the items of a random sample. If we are more concerned with the numerical representations of the outcomes than with the outcomes themselves, it seems natural to refer to Xl> X 2, .•. , X n as the outcomes. And what more appropriate name can we give to the space of a random sample than the sample space? Once the experiment has been performed the indicated number 200 Sec. 6.1] Point Estimation f(x) = OX(1 - OF-X, = 0 elsewhere, x = 0, 1, 201
  • 107. 202 Estimation [eh. 6 Sec. 6.1] Point Estimation 203 where 0 ::; e::; 1. The probability that Xl = XV X2 = x2 , ••• , X; = xn is the joint p.d.f. eX1(1 _ e)l-X1eX2(1 _ e)l-x2... eXn(1 _ e)l-Xn = e2;X1(1 _ e)n-2;xl , where Xl equals zero or 1, i = 1,2, ... , n. This probability, which is the joint p.d.f. of Xl' X 2 , ••• , X n, may be regarded as a function of 8 and, when so regarded, is denoted by L(e) and called the likelihood function.That is, L(e) = e2;x.(1 - e)n-2;x 1, 0 s e ::; 1. We might ask what value of 8 would maximize the probability L(8) of obtaining this particular observed sample Xv X 2, .. " X n . Certainly, this maximizing value of 8 would seemingly be a good estimate of 8 because it would provide the largest probability of this particular sample. However, since the likelihood function L(e) and its logarithm, In L(e), are maximized for the same value e, either L(e) or In L(e) can be used. Here In L(8) = (~Xl) In e+ (n - ~Xl) In (1 - 8); so we have din L(8) = 2: Xl _ n - 2: Xl = 0 d8 8 1 - 8 ' provided that 8 is not equal to zero or 1. This is equivalent to the equation (1 - e) ~ Xl = e(n - ~ X} n n whose solution for 8 is 2: xt/n. That 2: xt/n actually maximizes L(8) and In L( 8) 1 1 can be easily checked, even in the cases in which all of Xv X2' .•. , Xn equal " zero together or 1 together. That is, 2: x,ln is the value of ethat maximizes 1 L( e). The corresponding statistic, h=~~x. n L t, 1=1 is called the maximum likelihood estimator of e. The observed value of h, " namely 2: xJn, is called the maximum likelihood estimate of 8. For a simple 1 example, suppose that n = 3, and Xl = 1, X 2 = 0, X3 = 1, then L(e) = e2 (1 - 8) and the observed h= t is the maximum likelihood estimate of e. The principle of the method of maximum likelihood can now be formulated easily. Consider a random sample Xl' X 2, •.• , X n from a distribution having p.d.f. f(x; 8), 8 E Q. The joint p.d.f. of Xl' X 2 , •.• , X n iSf(x1 ; 8)f(x2 ; e) .. ·f(xn ; 8). This joint p.d.f. may be regarded as a function of e. When so regarded, it is called the likelihood function L of the random sample, and we write eE .0. Suppose that we can find a nontrivial function of Xl> x 2 , ••• , X", say u(xl> x2 , ••• , x,,), such that, when eis replaced by u(x1 , x2 , ••• , X n ), the likelihood function L is a maximum. That is, L[u(xl> x2 , ••• , x,,); Xl> x2 , · •• , x,,] is at least as great as L( 8; Xl' x2 , • • • , xn) for every 8 E .0. Then the statistic u(X1, X 2 , ••• , X,,) will be called a maximum likeli- hood estimator of eand will be denoted by the symbol B= u(Xl> X 2 , .. " X,,). We remark that in many instances there will be a unique maximum likelihood estimator Bof a parameter e, and often it may be obtained by the process of differentiation. Example 2. Let Xv X 2 , ••• , X; be a random sample from the normal distribution n(8, 1), -OCJ < e< OCJ. Here This function L can be maximized by setting the first derivative of L, with respect to e, equal to zero and solving the resulting equation for 8. We note, however, that each of the functions L and In L is a maximum for the same value 8. So it may be easier to solve dlnL(e;x1'X2" " , Xn) = ° de . For this example, dIn L(e; Xl, X2' ... , X,,) = ~ ( . _ 8) de L... x, . 1 If this derivative is equated to zero, the solution for 8 is u(x1 , X2' ... , X n ) = n n 2: xt/n. That 2: xJn actually maximizes L is easily shown. Thus the statistic 1 1 is the unique maximum likelihood estimator of the mean e. It is interesting to note that in both Examples 1 and 2, it is true that E(IJ) = e. That is, in each of these cases, the expected value of the estimator is equal to the corresponding parameter, which leads to the following definition.
  • 108. 204 Estimation [Ch, 6 Sec. 6.1] Point Estimation 205 = 0 elsewhere, Definition 1. Any statistic whose mathematical expectation is equal to a parameter °is called an unbiased estimator of the parameter O. Otherwise, the statistic is said to be biased. Example 3. Let 1 f(x; 0) = (j' o < x s 8, 0 < 8 < 00, Om) EO n, depend on m parameters. This joint p.d.f., when regarded as a function of (01) 0z, ... , Om) EO n, is called the likelihood function of the random variables. Those functions u1(x, y, ... , z), uz(x, y, ... , z), ... , um(x, y, . . . , z) that maximize this likelihood function with respect to 01> 0z,· .. , Om' respectively, define the maximum likelihood estimators &1 = u1(X, Y, ... , Z), &z = uz(X, Y, ... , Z), ... , &m = um(X, Y, ... , Z) and let Xl> X 2 , ••• , X; denote a random sample from this distribution. Note that we have taken 0 < X ::; 8 instead of 0 < x < 8 so as to avoid a discussion of supremum versus maximum. Here 0< X, s 8, of the m parameters. Example 4. Let Xl> X2, ..• , Xn denote a random sample from a distri- bution that is n(81 , ( 2), -00 < 81 < 00, 0 < 82 < 00. We shall find ~l and ~2' the maximum likelihood estimators of 81 and 82 , The logarithm of the likelihood function may be written in the form We observe that we may maximize by differentiation. We have (n - 1)82 n (n - 1)a2 n n aIn L f (x, - (1)2 n ae;: = 28~ - 282 ' aln L ae;- = Sometimes it is impossible to find maximum likelihood estimators in a convenient closed form and numerical methods must be used to maximize the likelihood function. For illustration, suppose that X1> X 2 , ••• , X n is a random sample from a gamma distribution with However, in Chapter 5 it has been shown that ~1 = X and ~2 = 52 converge stochastically to 81 and 82, respectively, and thus they are consistent esti- mators of 81 and 82 , If we equate these partial derivatives to zero and solve simultaneously the two equations thus obtained, the solutions for 81 and 82 are found to be n n 2: x,jn = x and 2: (x, - X)2/n = S2, respectively. It can be verified that these 1 1 solutions maximize L. Thus the maximum likelihood estimators of 81 = f1. and 82 = a2 are, respectively, the mean and the variance of the sample, namely ~1 = X and ~2 = 52. Whereas ~1 is an unbiased estimator of 81, the estimator ~2 = 52 is biased because Consistency is a desirable property of an estimator; and, in all cases of practical interest, maximum likelihood estimators are consistent. The preceding definitions and properties are easily generalized. Let X, Y, . . . , Z denote random variables that mayor may not be stochastically independent and that mayor may not be identically distributed. Let the joint p.d.f. g(x, y, ... , z; 01> 02' ... , Om), (01) 02' ... , While the maximum likelihood estimator &of °in Example 3 is a biased estimator, results in Chapter 5 show that the nth order statistic &= max (X,) = Yn converges stochastically to 0. Thus, in accordance with the following definition, we say that IJ = Yn is a consistent estimator of 0. Definition 2. Any statistic that converges stochastically to a parameter °is called a consistent estimator of that parameter 0. 1 [max (XtW and the unique maximum likelihood estimator ~ of 8 in this example is the nth order statistic max (X,), It can be shown that E[max (Xt) ] = n8/(n + 1). Thus, in this instance, the maximum likelihood estimator of the parameter 8 is biased. That is, the property of unbiasedness is not in general a property of a maximum likelihood estimator. which is an ever-decreasing function of 8. The maximum of such functions cannot be found by differentiation but by selecting 8 as small as possible. Now 8 ~ each Xt; in particular, then, 8 ~ max (x.). Thus L can be made no larger than
  • 109. 206 Estimation [Ch.6 Sec. 6.2] Measures of Quality of Estimators 207 parameters a = ()1 and f3 = ()2' where ()1 > 0, ()2 > O. It is difficult to maximize We say that these latter two statistics, 81 and 82, are respective esti- mators of (J1 and (J2 found by the method of moments. To generalize the discussion of the preceding paragraph, let Xl' X 2 , . . ., X n be a random sample of size n from a distribution with p.d.f. f(x; (J1' (J2' •.. , (Jr), ((J1" •• , (Jr) E n. The expectation E(Xk) is frequently called the kth moment of the distribution, k = 1,2, 3, .. " The sum L(8v (J2; Xv ••. , x n) = [r((J~)(Jglr(X1X2 ••. Xn)81 -1 exp ( - ~ xii(J2 ) with respect to (J1 and (J2' owing to the presence of the gamma function r((J1)' However, to obtain easily point estimates of (J1 and (J2' let us simply equate the first two moments of the distribution to the corre- sponding moments of the sample. This seems like a reasonable way in which to find estimators, since the empirical distribution Fn(x)converges stochastically to F(x), and hence corresponding moments should be about equal. Here in this illustration we have (a) j(x; 0) = OXe- 9/x!, X = 0, 1,2, ... , 0 ~ 0 < 00, zero elsewhere, where j(O; 0) = 1. (b) j(x; 0) = Ox9 - 1, 0 < x < 1,0 < 6 < 00, zero elsewhere. (c) j(x; 6) = (1/6)e- x / 9 , 0 < x < 00, 0 < 6 < 00, zero elsewhere. (d) j(x; 6) = !e- 1x - B1, -00 < x < 00, -00 < 6 < 00. (e) j(x; 6) = e>:», 6 ~ x < 00, -00 < 6 < 00, zero elsewhere. In each case find the maximum likelihood estimator ~ of 6. 6.2. Let Xl' X 2 , · · . , X; be a random sample from the distribution having p.d.f. j(x; 61, ( 2) = (1/62) r (X - 91l/B2, 61 ~ x < 00, -00 < 61 < 00, o < O 2 < 00, zero elsewhere. Find the maximum likelihood estimators of 61 and °2 , 6.3. Let Y1 < Y2 < ... < Yn be the order statistics of a random sample from a distribution with p.d.f.j(x; 6) = 1,6- ! ~ x ~ 6 + !, -00 < 6 < 00, zero elsewhere. Show that every statistic u(Xl> X 2 , ••• , X n ) such that Y n - 1- ~ u(Xl , X 2 , ••• , X n ) ~ Yl + 1- is a maximum likelihood estimator of O. In particular, (4Yl + 2Yn + 1)/6, (Y, + Y n)/2, and (2Yl + 4Yn - 1)/6 are three such statistics. Thus unique- ness is not in general a property of a maximum likelihood estimator. 6.4. Let Xl, X 2 , and X a have the multinomial distribution in which n = 25, k = 4, and the unknown probabilities are 61 , 62 , and 0a,respectively. Here we can,for convenience, let X, = 25 - Xl - X2 - Xa and 64 = 1 - 61 - 62 - 6a· If the observed values of the random variables are Xl = 4, X 2 = 11, and X a = 7, find the maximum likelihood estimates of °1, 62 , and ea' 6.5. The Pareto distribution is frequently used as a model in study of incomes and has the distribution function F(x; °1, ( 2) = 1 - (6dx)B2 , 01 ~ X, zero elsewhere, where 01 > 0 and 62 > O. If Xl' X 2 , ••. , X; is a random sample from this distribution, find the maxi- mum likelihood estimators of 61 and °2 , 6.6. Let Y n be a statistic such that lim E(Yn) = 0 and lim af n = O. n- 00 n-+ co Prove that Y, is a consistent estimator of 0. Hint. Pr (IYn - 6! ~ €) ~ E[(Yn - O)2J/€2 and E[(Yn - 8)2J = [E(Yn - OW + at. Why? 6.7. For each of the distributions in Exercise 6.1, find an estimator of 0 by the method of moments and show that it is consistent. and ()182 = X, the solutions of which are _ X2 81 = S2 n M k = 2: X~jn is the kth moment of the sample, k = 1,2,3, .. " The 1 method of moments can be described as follows. Equate E(Xk) to M k» beginning with k = 1 and continuing until there are enough equations to provide unique solutions for (J1' (J2' ••• , (Jr, say hi(Mv M 2, ... ), i = 1,2, ... , r, respectively. It should be noted that this could be done in an equivalent manner by equating p. = E(X) to X and n E[(X - p.)kJ to 2: (Xi - X)kjn, k = 2, 3, and so on until unique solu- 1 tions for (Jv (J2"'" (Jr are obtained. This alternative procedure was used in the preceding illustration. In most practical cases, the esti- mator 8i = hi(M1, M 2, •.. ) of (J;, found by the method of moments, is a consistent estimator of (Ji' i = 1, 2, ... , r, EXERCISES 6.1. Let Xl> X 2 , •• " X; represent a random sample from each of the distributions having the following probability density functions: 6.2 Measures of Quality of Estimators Now it would seem that if y = u(xv x2 , ••• , xn) is to qualify as a good point estimate of (J, there should be a great probability that the
  • 110. 208 Estimation [eh.6 Sec. 6.2] Measures of Quality of Estimators 209 statistic Y = u(Xl> X 2 , • • • , X n) will be close to 8; that is, 8 should be a sort of rallying point for the numbers y = U(Xl> X 2, ••• , xn) . This can be achieved in one way by selecting Y = u(XI , X 2 , " ., X n) in such a way that not only is Y an unbiased esimator of 8 but also the variance of Y is as small as it can be made. We do this because the variance of Y is a measure of the intensity of the concentration of the probability for Y in the neighborhood of the point 8 = E(Y). Accord- ingly, we define an unbiased minimum variance estimator of the param- eter 8 in the following manner. Definition 3. For a given positive integer n, Y = u(XI , X 2 , ••• , X n) will be called an unbiased minimum variance estimator of the parameter fJ if Y is unbiased, that is E( Y) = fJ, and if the variance of Y is less than or equal to the variance of every other unbiased estimator of fJ. For illustration, let Xl' X 2 , ••. , X g denote a random sample from a distribution that is n(fJ, 1), -00 < 8 < 00. Since the statistic X = (Xl + X 2 + ... + X g)/9 is n(fJ, t), X is an unbiased estimator of fJ. The statistic Xl is n(fJ, 1), so Xl is also an unbiased estimator of fJ. Although the variance t of X is less than the variance 1 of Xl, we cannot say, with n = 9, that X is the unbiased minimum variance estimator of fJ; that definition requires that the comparison be made with every unbiased estimator of 8. To be sure, it is quite impossible to tabulate all other unbiased estimators of this parameter fJ, so other methods must be developed for making the comparisons of the variances. A beginning on this problem will be made in Chapter 10. Let us now discuss the problem of point estimation of a parameter from a slightly different standpoint. Let Xl' X 2 , ••• , X n denote a random sample of size n from a distribution that has the p.d.f. f(x; fJ), fJ E O. The distribution may be either of the continuous or the discrete type. Let Y = u(Xv X 2 , • • • , X n) be a statistic on which we wish to base a point estimate of the parameter fJ. Let w(y) be that function of the observed value of the statistic Y which is the point estimate of fJ. Thus the function w decides the value of our point estimate of fJ and w is called a decision function or a decision rule. One value of the decision function, say w(y), is called a decision. Thus a numerically determined point estimate of a parameter fJ is a decision. Now a decision may be correct or it may be wrong. It would be useful to have a measure of the seriousness of the difference, if any, between the true value of fJ and the point estimate w(y). Accordingly, with each pair, [fJ, w(y)], fJ EO, we associate a nonnegative number 2[8, w(y)] that reflects this seriousness. We call the function 2 the loss junction. The expected (mean). value of the loss function is called the risk junction. If g(y; 8), 8 EO, IS the p.d.f. of Y, the risk function R(8, w) is given by R(8, w) = E{2[8, w(Y)]} = J~G02[8, w(y)]g(y; 8) dy if Y is a random variable of the continuous type. It would be desirable to select a decision function that minimizes the risk R(fJ, w) for all values of 8, fJ E O. But this is usually impossible because the decision function w that minimizes R(fJ, w) for one value of fJ may not minimize R(fJ, w) for another value of fJ. Accordingly, we need either to restrict our decision function to a certain class or to consider methods of order- ing th~ risk func~ions. The following example, while very simple, dramatizes these difficulties, E~ample 1. Let Xl' X 2 , •• " X 25 be a random sample from a distribution that IS n(e, 1), -00 < e < 00. Let Y = X, the mean of the random sample, a~d let .p[e, w(y)] = [e - w(y)J2. We shall compare the two decision functions g.lven by.wl(y) = Y and w2 (y) = 0 for -OCJ < Y < 00. The corresponding nsk functions are and Obviously, if, in fact, e = 0, then w2 (y) = 0 is an excellent decision and we have R(O, w2 ) = O. However, if e differs from zero by very much, it is equally clear that W2(Y) = 0 is a poor decision. For example, if, in fact, e = 2, R(2, w2) = 4 > R(2, WI) = .,}-s. In general, we see that R(e w ) < R(e, ~l)' provided that -t < e < t and that otherwise R(e, w 2) :::': R(e,2 w1). That IS, one of these decision functions is better than the other for some values of eand the other decision function is better for other values of e. If, however, we had restricted our consideration to decision functions w :uch that E[w(Y)] = e for all values of e, e E Q, then the decision w 2(y) = 0 IS no~ all~wed. Und~r this restriction and with the given .p[e, w(y)], the risk function 1: the vanance of the unbiased estimator w(Y), and we are con- fronted with the problem of finding the unbiased minimum variance esti- mator. In Chapter 10 we show that the solution is w(y) = Y = X. S~ppose, however, that we do not want to restrict ourselves to decision functions w such that E[w(Y)] = e for all values of e, e E Q. Instead, let us say t?at. the decision function that minimizes the maximum of the risk function IS the best decision function. Because, in this example, R(e, w 2 ) = e2
  • 111. 210 Estimation [Ch, 6 Sec. 6.2] Measures of Quality of Estimators 211 is unbounded, w2 (y) = 0 is not, in accordance, with this criterion, a good decision function. On the other hand, with -00 < 0 < 00, we have max R(O, WI) = max Us) = -i-so 8 8 Accordingly, wI(y) = Y = x seems to be a very good decision in accordance with this criterion because 2~- is small. As a matter of fact, it can be proved that WI is the best decision function, as measured by this minimax criterion, when the loss function is .P[O,w(y)J = [0 - w(y)]2. In this example we illustrated the following: (a) Without some restriction on the decision function, it is difficult to find a decision function that has a risk function which is uniformly less than the risk function of another decision function. (b) A principle of selecting a best decision function, called the minimax principle. This principle may be stated as follows: If the decision function given by wo(y) is such that, for all 0 E Q, max R[O, wo(y)J ::; max R[O, w(y)J 8 8 for every other decision function w(y), then wo(y) is called a minimax decision function. With the restriction E[w(Y)J = 0 and the loss function.P[B, w(y)J = [B - w(y)P, the decision function that minimizes the risk function "yields an unbiased estimator with minimum variance. If, however, the restriction E[w(Y)J = B is replaced by some other condition, the decision function w(Y), if it exists, which minimizes E{[B - W(Y)J2} uniformly in B is sometimes called the minimum mean-square-error estimator. Exercises 6.13, 6.14, and 6.15 provide examples of this type of estimator. Another principle for selecting the decision function, which may be called a best decision function, will be stated in Section 6.6. EXERCISES 6.8. Show that the mean X of a random sample of size n from a distri- bution having p.d.f.j(x; 0) = (ljO)e-<x/8), 0 < x < 00,0 < 0 < 00, zero else- where, is an unbiased estimator of 0 and has variance 02jn. 6.9. Let Xl> X 2 , ••• , X n denote a random sample from a normal distribu- n tion with mean zero and variance 0, 0 < 0 < 00. Show that L XNn is an 1 unbiased estimator of 0 and has variance 202jn. 6.10. Let YI < Y2 < Y3 be the order statistics of a random sample of size 3 from the uniform distribution having p.d.f. j(x; 0) = 1jO, 0 < x < 0, o < 0 < 00, zero elsewhere. Show that 4Yl> 2Y2, and 1Y3 are all unbiased estimators of O. Find the variance of each of these unbiased estimators. 6.11. Let YI and Y2 be two stochastically independent unbiased esti- mators of O. Say the variance of YI is twice the variance of Y 2 . Find the constants kI and k2 so that kI Y1 + k2 Y2 is an unbiased estimator with smallest possible variance for such a linear combination. 6.12. In Example 1 of this section, take .P[0, w(y)] = 10 - w(y)I. Show that R(O, WI) = tV2j7T and R(O, w2) = IO!. Of these two decision functions WI and W 2, which yields the smaller maximum risk? 6.13. Let Xl> X 2 , ••• , X; denote a random sample from a Poisson distri- n bution with parameter 0, 0 < 0 < 00. Let Y = L Xl and let .P[O,w(y)] = 1 [0 - w(y)]2. If we restrict our considerations to decision functions of the form w(y) = b + yjn, where b does not depend upon y, show that R(O, w) = b2 + Ojn. What decision function of this form yields a uniformly smaller risk than every other decision function of this form? With this solution, say w and 0 < 0 < 00, determine max R(O, w) if it exists. 8 6.14. Let Xl> X 2 , ••• , X n denote a random sample from a distribution n that is n(fL, 0), 0 < (} < 00, where fL is unknown. Let Y = L (Xl - X)2jn = 1 S2 and let .P[O,w(y)J = [0 - w(y)]2. If we consider decision functions of the form w(y) = by, where b does not depend upon y, show that R(O, w) = (02jn2)[(n2 - 1)b2 - 2n(n - l)b + n2]. Show that b = nj(n + 1) yields a minimum risk for decision functions of this form. Note that nYj(n + 1) is not an unbiased estimator of O. With w(y) = nyj(n + 1) and 0 < 0 < 00, determine max R(O, w) if it exists. 8 6.15. Let Xl> X 2 , ••. , X; denote a random sample from a distribution n that is b(l, 0),0 ::; 0 ::; 1. Let Y = L X, and let .P[O, w(y)] = [0 - w(y)]2. 1 Consider decision functions of the form w(y) = by, where b does not depend upon y. Prove that R(O, w) = b2nO(1 - 0) + (bn - 1)202. Show that provided the value b is such that b2n ~ 2(bn - 1)2. Prove that b = 1jn does not minimize max R(O, w). 8
  • 112. 212 Estimation [Ch.6 Sec. 6.3] Confidence Intervals for Means 213 6.3 Confidence Intervals for Means Suppose we are willing to accept as a fact that the (numerical) out- come X of a random experiment is a random variable that has a normal distribution with known variance a2 but unknown mean p... That is, p.. is some constant, but its value is unknown. To elicit some information about p.., we decide to repeat the random experiment n independent times, n being a fixed positive integer, and under identical conditions. Let the random variables Xl> X 2 , • • • , X n denote, respectively, the outcomes to be obtained on these n repetitions of the experiment. If our assumptions are fulfilled, we then have under consideration a random sample Xl> X 2 , ••• , X n from a distribution that is n(p.., a2 ), a2 known. Consider the maximum likelihood estimator of p.., namely p.. = X. Of course, X is n(p.., a2/n) and (X - p..)/(a/vn) is n(O, 1). Thus Pr (-Z < ~/~: < z) = 0.954. However, the events X - p.. -Z < - - < Z a/v;:" , -Za Za --= < X - p.. < - , vn vn and Za Za X--<p.<X+- v'n vn are equivalent. Thus these events have the same probability. That is, ( Za Za) Pr X - vn < p.. < X + vn = 0.954. Since a is a known number, each of the random variables X - Za/vn and X + Za/v'n is a statistic. The interval (X - Za/vn, X + Za/v'n) is a random interval. In this case, both end points of the interval are statistics. The immediately preceding probability statement can be read. Prior to the repeated independent performances of the random experiment, the probability is 0.954 that the random interval (X - Za/,Vn, X + Za/v'n) includes the unknown fixed point (parameter) p... Up to this point, only probability has been involved; the determina- tion of the p.d.f. of X and the determination of the random interval were problems of probability. Now the problem becomes statistical. Suppose the experiment yields Xl = Xl> X2 = X 2, ••• , X n = X n . Then the sample value of X is x = (Xl + X 2 + ... + xn)/n, a known number. Moreover, since a is known, the interval (x - Za/v'n, x + Za/v'n) has known end points. Obviously, we cannot say that 0.954 is the prob- ability that the particular interval (x - Za/v'n,x + Za/v'n) includes the parameter p.., for p.., although unknown, is some constant, and this particular interval either does or does not include p... However, the fact that we had such a high degree of probability, prior to the performance of the experiment, that the random interval (X - Za/vn, X + Za/v'n) includes the fixed point (parameter) p..leads us to have some reliance on the particular interval (x - Za/v'n,x + Za/vn). This reliance is re- flected by calling the known interval (x - Za/vn, x + Za/v'n) a 95.4 per cent confidence interval for p... The number 0.954 is called the confidence coefficient. The confidence coefficient is equal to the prob- ability that the random interval includes the parameter. One may, of course, obtain an 80 per cent, a 90 per cent, or a 99 per cent confidence interval for p.. by using 1.Z8Z, 1.645, or Z.576, respectively, instead of the constant Z. A statistical inference of this sort is an example of interval estimation of a parameter. Note that the interval estimate of p.. is found by taking a good (here maximum likelihood) estimate x of p.. and adding and subtracting twice the standard deviation of X, namely Za/v'n, which is small if n is large. If a were not known, the end points of the random interval would not be statistics. Although the probability statement about the random interval remains valid, the sample data would not yield an interval with known end points. Example 1. If in the preceding discussionn = 40, a2 = 10, and x = 7.164, then (7.164 - 1.282V!%, 7.164 + 1.282V!%), or (6.523, 7.805), is an 80 per cent confidence interval for /lo. Thus we have an interval estimate of /lo. In the next example we shall show how the central limit theorem may be used to help us find an approximate confidence interval for p.. when our sample arises from a distribution that is not normal.
  • 113. 214 Estimation [Ch, 6 Sec. 6.3] Confidence Intervals for Means 215 Example 2. Let X denote the mean of a random sample of size 25 from a distribution that has a moment-generating function, variance a 2 = 100, and mean fL. Since a/Vn = 2, then approximately Pr ( -1.96 < X ~ fL < 1.96) = 0.95, or Pr (X - 3.92 < fL < X + 3.92) = 0.95. Let the observed mean of the sample be x = 67.53. Accordingly, the interval from x - 3.92 = 63.61 to x + 3.92 = 71.45 is an approximate 95 per cent confidence interval for the mean fL· Let us now turn to the problem of finding a confidence interval for the mean fL of a normal distribution when we are not so fortunate as to know the variance a2. In Section 4.8 we found that n5 2ja2, where 52 is the variance of a random sample of size n from a distribution that is n(fL' a2), is X2(n - 1). Thus we have vn(X - fL)ja to be n(O, 1), n52ja2 to be l(n - 1), and the two to be stochastically independent. In Section 4.4 the random variable T was defined in terms of two such random variables as these. In accordance with that section and the foregoing results, we know that T = [vn(X - fL)Jja vn52j[a2(n - I)J has a t distribution with n - 1 degrees of freedom, whatever the value of a2 > O. For a given positive integer n and a probability of 0.95, say, we can find numbers a < b from Table IV in Appendix B, such that Pr (a < ~ < b) = 0.95. S] n - 1 Since the graph of the p.d.f. of the random variable T is symmetric about the vertical axis through the origin, we would doubtless take a = - b,b > O. If the probability of this event is written (with a = - b) in the form ( b5 b5 ) Pr X - vn=l' < fL < X + V = 0.95, n-l n-l then the interval [X - (b5jvn=l') , X + (b5jvn=l')J is a random interval having probability 0.95 of including the unknown fixed point (parameter) fL. If the experimental values of Xl' X 2 , · · · , X; are Xl> X 2, ••• , X n with S2 = ~ (xj - x)2jn, where x = ~ xdn, then the interval [x - (bsjVn - 1), x + (bsjvn=1)J is a 95 per cent confidence interval for fL for every a2 > O. Again this interval estimate of fL is fo~nd b~ adding and subtracting a quantity, here bsjvn - 1, to the pomt estimate X. Example 3. If in the preceding discussion n = 10, x = 3.22, and s = 1.17, then the i~terval [3.22 - (2.262)(1.17)/V9, 3.22 + (2.262)(1.17)jV9]or (2.34,4.10) IS a 95 per cent confidence interval for fL. Remark. If one wishes to find a confidence interval for fL and if the va.riance.a2 of the nonnormal distribution is unknown (unlike Example 2 of this section), he may with large samples proceed as follows. If certain weak conditions are satisfied, then 52, the variance of a random sample of size n ;::: 2, converges stochastically to a2 • Then in Vn(X - fL)/a _ vn=1(X - fL) Vn52/(n - 1)a2 - 5 the numerator of the left-hand member has a limiting distribution that is n(O, 1) and the denominator of that member converges stochastically to 1. Thus Vn - l(X - fL)/5 has a limiting distribution that is n(O, 1). This fact enables us to find approximate confidence intervals for fL when our con- ditions are satisfied. A similar procedure can be followed in the next section when seeking confidence intervals for the difference of the means of two independent nonnormal distributions. . We shall now consider the problem of determining a confidence mterval for the unknown parameter p of a binomial distribution when the parameter n is known. Let Y be b(n, p), where 0 < p < 1 and n is kno,:n. Then p is the mean of Yjn. We shall use a result of Example 1, Section 5.5, to find an approximate 95.4 per cent confidence interval for the mean p. There we found that p[ Y-np ] r -2 < < 2 = 0.954, vn(Yjn)(1 - Yjn) approximately. Since Y - np (Yjn) - p Vn(Yjn)(l - Yjn) = V(Yjn)(1 - Yjn)jn' the probability statement above can easily be written in the form Pr [Y _ 2j(Yjn)(1 - Yjn) < p < Y + 2j(Yjn)(1 - Yjn)] = n n n n 0.954,
  • 114. 216 Estimation [eb.6 Sec. 6.3] Confidence Intervals jor Means 217 approximately. Thus, for large n, if the experimental value of Y is y, the interval provides an approximate 95.4 per cent confidence interval for p. A more complicated approximate 95.4 per cent confidence interval can be obtained from the fact that Z = (Y - np)/v'np(1 - P) has a limiting distribution that is n(O, 1), and the fact that the event -2 < Z < 2 is equivalent to the event The first of these facts was established in Example 2, Section 5.4; the proof of inequalities (1) is left as an exercise. Thus an experimental value Y of Y may be used in inequalities (1) to determine an approxi- mate 95.4 per cent confidence interval for p. If one wishes a 95 per cent confidence interval for p that does not depend upon limiting distribution theory, he may use the following approach. (This approach is quite general and can be used in other instances.) Determine two increasingfunctions of p, say c1(P) and c2(P), such that for each value of p we have, at least approximately, But it is the latter that we want to be essentially free of p; thus we set it equal to a constant, obtaining the differential equation v(~) = u(P) + (~ - p)u'(P). u'(P) = VP(I C _ P) Of course, v(Y/n) is a linear function of Yin and thus also has an approximate normal distribution; clearly, it has mean u(P) and variance [U'(P)]2P(1 - P). n (0.2 - zv'(0.2)(0.8)/100,0.2 + 2v (0.2)(0.8)/100) or (0.12, 0.28). The approxi- mate 95.4 per cent confidence interval provided by inequalities (1) is ( 22 - 2v (1600/100) + 1, 22 + 2v(1600/100) + 1) 104 104 or (0.13,0.29). By referring to the appropriate tables found elsewhere, we find that an approximate 95 per cent confidence interval has the limits d2(20) = 0.13 and d1(20) = 0.29. Thus in this example we see that all three methods yield results that are in substantial agreement. Remark. The fact that the variance of Yin is a function of p caused us some difficulty in finding a confidence interval for p.Another way of handling the problem is to try to find a function u(Y/n) of Yin, whose variance is essentially free of p. Since Yin converges stochastically to p,we can approxi- mate u(Y/n) by the first two terms of its Taylor's expansion about p, namely by A solution of this is u(P) = (2c) arc sin Vp. If we take c = t, we have, since u(Y/n) is approximately equal to v(Y/n), that u(~) = arc sin J~. Thi~ has an approximate normal distribution with mean arc sin Vpand variance 1/4n. Hence we could find an approximate 95.4 per cent confidence interval by using P ( 2 arc sin VY/n - arc sin Vp ) r - < < 2 = 0954 v'1/4n . and solving the inequalities for p. Y + 2 + 2v'[Y(n - Y)/n] + 1 < P < . n+4 Y + 2 - 2v'[Y(n - Y)/n] + 1 n+4 The reason that this may be approximate is due to the fact that Y has a distribution of the discrete type and thus it is, in general, impossible to achieve the probability 0.95 exactly. With c1(P) and c2(P) increasing functions, they have single-valued inverses, say d1 (y) and d2 (y), respectively. Thus the events c1(P) < Y < c2(P) and d2(Y) < P < d1(Y) are equivalent and we have, at least approximately, Pr [d2(Y) < P < d1(Y)] = 0.95. In the case of the binomial distribution, the functions c1(P), c2(P), d2(y), and d1(y) cannot be found explicitly, but a number of books provide tables of d2 (y) and d1 (y) for various values of n. Example 4. If, in the preceding discussion, we take n = 100 and y = 20, the first approximate 95.4 per cent confidence interval is given by (1)
  • 115. 6.22. Let Y be b(300, Pl. If the observed value of Y is y = 75, find an approximate 90 per cent confidence interval for p. 6.23. Let X be the mean of a random sample of size n from a distribution that is n(fL, (7 2), where the positive variance 172 is known. Use the fact that N(2) - N( - 2) = 0.954 to find, for each fL, Cl (fL) and C 2(fL) ~uch t~at Pr [Cl(fL) < X < C 2(fL)] = 0.954. Note that Cl(fL) and C2(fL) are mcreasmg functions of fL. Solve for the respective functions dl (x) and d2 (x); thus we also have that Pr [d2(X) < fL < dl(X)] = 0.954. Compare this with the answer obtained previously in the text. 6.16. Let the observed value of the mean X of a random sample of size 20 from a distribution that is n(fL, 80) be 81.2. Find a 95 per cent confidence interval for fL. 6.17. Let X be the mean of a random sample of size n from a distribution that is n(fL, 9). Find n such that Pr (X - 1 < fL < X + 1) = 0.90, approxi- mately. 6.18. Let a random sample of size 17 from the normal distribution n(fL, (72) yield x = 4.7 and S2 = 5.76. Determine a 90 per cent confidence interval for fL· 6.19. Let X denote the mean of a random sample of size n from a distri- bution that has mean fL, variance 172 = 10, and a moment-generating function. Find n so that the probability is approximately 0.954 that the random interval (X - t, X + t) includes fL· 6.20. Let Xl> X 2 , ••• , X g be a random sample of size 9 from a distribu- tion that is n(fL' (7 2). (a) If a is known, find the length of a 95 per cent confidence interval for fL if this interval is based on the random variable V9(X - fL)/a. (b) If a is unknown, find the expected value of the length of a 95 per.cent confidence interval for fL if this interval is based on the random vanable V8(X - fL)/5. (c) Compare these two answers. Hint. Write E(5) = (a/Vn)E[(n5 2/a2 )11 2J. 6.21. Let Xl> X 2 , ••• , X n , X n +l be a random sample of size n + 1, n > 1, n n _ from a distribution that is n(fL, (7 2). Let X = LXtin and 52 = L (X, - X)2/n. 1 1 Find the constant c so that the statistic c(X - Xn + l )/5 has a t distribution. If n = 8, determine k such that Pr (X - k5 < Xg < X + k5) = 0.80. The observed interval (x - ks, x + ks) is often called an 80 per cent prediction interval for Xg • (X - Y) - (fLl - f-t2) Va2jn + a2jm 6.24. In the notation of the discussion of the confidence interval for p, show that the event - 2 < Z < 2 is equivalent to inequalities (1). Hint. First observe that - 2 < Z < 2 is equivalent to Z2 < 4, which can be written as an inequality involving a quadratic expression in p. 6.25. Let X denote the mean of a random sample of size 25 from a gamma-type distribution with a = 4 and f3 > O. Use the central limit theorem to find an approximate 0.954 confidence interval for fL, the mean of the gamma distribution. Hint. Base the confidence interval on the random variable (X - 4(3)f(4f32f25)112 = 5Xf2f3 - 10. 219 A confidence interval for fLl - f-t2 may be obtained as follows: Let Xl' X 2 , · · · , X n and Yl , Y 2 , ••• , Ym denote, respectively, independent random samples from the two independent distributions having, respectively, the probability density functions n(fLl' (7 2) and n(fL2' (7 2). Denote the means of the samples by X and Y and the variances of the samples by Si and S~, respectively. It should be noted that these four statistics are mutually stochastically independent. The stochastic independence of X and Si (and, inferentially that of Y and S~) was established in Section 4.8; the assumption that X and Y have indepen- dent distributions accounts for the stochastic independence of the others. Thus X and Yare normally and stochastically independently distributed with means f-tl and f-t2 and variances a2jn and a2jm, respec- tively. In accordance with Section 4.7, their difference X - Y is norm- ally distributed with mean fLl - f-t2 and variance a2jn + a2jm . Then the random variable 6.4 Confidence Intervals for Differences of Means The random variable T may also be used to obtain a confidence interval for the difference f-tl - f-t2 between the means of two inde- pendent normal distributions, say n(fLl' (7 2) and n(f-t2' (7 2), when the distributions have the same, but unknown, variance 17 2• Remark. Let X have a normal distribution with unknown parameters fLl and 172 • A modification can be made in conducting the experiment so that the variance of the distribution will remain the same but the mean of the distribution will be changed; say, increased. After the modification has been effected, let the random variable be denoted by Y, and let Y have a normal distribution with unknown parameters fL2 and 172. Naturally, it is hoped that fL2 is greater than fLl> that is, that fLl - fL2 < O. Accordingly, one seeks a confidence interval for fLl - fL2 in order to make a statistical inference. Sec. 6.4] Confidence Intervals for Differences of Means Estimation [Oh.6 EXERCISES 218
  • 116. 220 Estimation [eh.6 Sec. 6.4] Confidence Intervals for Differences of Means 221 is normally distributed with zero mean and unit variance. This random variable may serve as the numerator of a T random variable. Further, nSVa2 and mS~/a2 have stochastically independent chi-sq~are distribu- tions with n - 1 and m - 1 degrees of freedom, respectively, so that their sum (nSi + mS~)/a2 has a chi-square distribution with n + m - 2 degrees of freedom, provided that m + n - 2 > O. Because of the mutual stochastic independence of X, Y, Si, and S~, it is seen that J nSi + mS~ a2(n + m - 2) may serve as the denominator of a T random variable. That is, the random variable T = (X - Y) - (1-'1 - 1-'2) Jns i + mS~ (!.- + !) n+m-2 n m has a t distribution with n + m - 2 degrees of freedom. As in the previous section, we can (once nand m are specified positive integers with n + m - 2 > 0) find a positive number b from Table IV of Appendix B such that Pr(-b < T < b) = 0.95. unknown variances of the two independent normal distributions are not equal is assigned to one of the exercises. Example 1. It may be verifiedthat if in the preceding discussion n = 10, m = 7, x = 4.2, ji = 3.4, s~ = 49, s~ = 32, then the interval (-5.16, 6.76) is a 90 per cent confidenceinterval for fLl - fL2' Let Y1 and Y2 be two stochastically independent random variables with binomial distributions b(nv PI) and b(n2,P2)' respectively. Let us now turn to the problem of finding a confidence interval for the difference PI - P2 of the means of Yl/nl and Y2/n2 when nl and n2 are known. Since the mean and the variance of Yd», - Y2/n2 are, respectively, PI - hand Pl(l - Pl)/nl + P2(1 - h)/n2, then the random variable given by the ratio (Yl/nl - Y 2/n2) - (PI - P2) VPl(l - Pl)/nl + P2(1 - P2)/n2 has mean zero and variance 1 for all positive integers nl and n2 • Moreover, since both Y1 and Y2 have approximate normal distributions for large nl and n2 , one suspects that the ratio has an approximate normal distribution. This is actually the case, but it will not be proved here. Moreover, if ndn2 = c, where C is a fixed positive constant, the result of Exercise 6.31 shows that the random variable If we set R = Jnsi + mS~ (.!. + ].), n+m-2 n m (1) (Ydnl)(1 - Yl/nl)/nl + (Y2/n2)(1 - Y 2/n2)/n2 Pl(1 - Pl)/nl + P2(1 - P2)/n2 this probability may be written in the form Pr [(X - Y) - bR < 1-'1 - 1-'2 < (X - Y) + bR] = 0.95. It follows that the random interval _ bJnsi + mS~ (~ + 2-), n+m-2 n m (X _ Y) + bJnSi + mS~ (!. + 2-)] n+m-2 n m has probability 0.95 of including the unknown fixed point (1-'1 - 1-'2)' As usual, the experimental values of X, Y, Si, and S~, namely X, fj, si, and s~, will provide a 95 per cent confidence interval for 1-'1 - 1-'2 when the variances of the two independent normal distributions are unknown but equal. A consideration of the difficulty encountered when the converges stochastically to 1 as n2 --+ 00 (and thus nl --+ 00, since nl/n2 = c, C > 0). In accordance with Theorem 6, Section 5.5, the random variable w = (Ydnl - Y 2/n2) - (PI - P2), u where has a limiting distribution that is n(O, 1). The event - 2 < W < 2, the probability of which is approximately equal to 0.954, is equivalent to the event
  • 117. 222 Estimation [Ch, 6 Sec. 6.5] Confidence Intervals for Variances 223 Accordingly, the experimental values Yl and Y2 of Y l and Y 2 , respec- tively, will provide an approximate 95.4 per cent confidence interval for Pl - P2' Example 2. If, in the preceding discussion, we take n l = 100, n2 = 400, Yl = 30, Y2 = 80, then the experimental values of Y l jn 1 - Y2jn2 and U are 0.1 and V(0.3)(0.7)jlOO + (0.2)(0.8)/400 = 0.05, respectively. Thus the interval (0, 0.2) is an approximate 95.4 per cent confidence interval for Pi - P2' distribution that is n(JL, 0'2), where JL is a known number. The maximum n likelihood estimator of 0'2 is L (Xi - JL)2/n, and the variable Y = n 1 L(Xi - JL)2/O'2 is X2(n). Let us select a probability, say 0.95, and for 1 the fixed positive integer n determine values of a and b, a < b, from Table II, such that Pr (a < Y < b) = 0.95. Thus or l 1 n Since JL, a, and b are known constants, each of L(Xi - JL)2/b and n 1 L (Xi - JL)2/a is a statistic. Moreover, the interval 1 = 0.95, is a random interval having probability of 0.95 that it includes the unknown fixed point (parameter) 0'2. Once the random experiment has been performed, and it is found that Xl = Xl> X 2 = X 2, ••• , X; = X n, then the particular interval EXERCISES 6.26. Let two independent random samples, each of size 10, from two independent normal distributions n(ILv 0'2) and n(IL2,O'2) yield x = 4.8, s~ = 8.64, fj = 5.6, s~ = 7.88. Find a 95 per cent confidence interval for ILl - IL2' 6.27. Let two stochastically independent random variables Y1 and Y 2 , with binomial distributions that have parameters n1 = n2 = 100,Pv and P2' respectively, be observed to be equal to Y1 = 50 and Y2 = 40. Determine an approximate 90 per cent confidence interval for P1 - P2' 6.28. Discuss the problem of finding a confidence interval for the difference ILl - IL2 between the two means of two independent normal distributions if the variances O'~ and O'~ are known but not necessarily equal. 6.29. Discuss Exercise 6.28 when it is assumed that the variances are unknown and unequal. This is a very difficult problem, and the discussion should point out exactly where the difficulty lies. If, however, the variances are unknown but their ratio O'Vo'~ is a known constant k, then a statistic that is a T random variable can again be used. Why? 6.30. Let X and Y be the means of two independent random samples, each of size n, from the respective distributions n(flo1' 0'2) and n(flo2,O'2), where the common variance is known. Find n such that Pr (X - Y - 0'/5 < flo1 - IL2 < X - Y + 0'/5) = 0.90. 6.31. Under the conditions given, show that the random variable defined by ratio (1) of the text converges stochastically to 1. 6.5 Confidence Intervals for Variances Let the random variable X be n(JL, 0'2). We shall discuss the problem of finding a confidence interval for 0'2. Our discussion will consist of two parts: first, when JL is a known number, and second, when JL is unknown. Let Xl> X 2 , • • " X; denote a random sample of size n from a is a 95 per cent confidence interval for 0'2. The reader will immediately observe that there are no unique numbers a < b such that Pr (a < Y < b) = 0.95. A common method of procedure is to find a and b such that Pr (Y < a) = 0.025 and Pr (b < Y) = 0.025. That procedure will be followed in this book. 10 Example 1. If in the preceding discussion flo = 0, n = 10, and L: xt = 1 106.6, then the interval (106.6/20.5, 106.6/3.25), or (5.2, 32.8), is a 95 per cent confidence interval for the variance 0'2, since Pr (Y < 3.25) = 0.025 and
  • 118. Estimation [Ch.6 that is, that a~/a~ < 1. In order to make a statistical inference, we find a confidence interval for the ratio a~/a~. 224 Pr (20.5 < Y) = 0.025, provided that Y has a chi-square distribution with 10 degrees of freedom. Sec. 6.5] Confidence Intervals for Variances 225 We now turn to the case in which fL is not known. This case can be handled by making use of the facts that S2 is the maximu~.lik:lihood estimator of a2 and nS2/a2 is X2(n - 1). For a fixed positive integer n 2': 2, we can find, from Table II, values of a and b, a < b, such that Pr (a < n~2 < b) = 0.95. Here, of course, we would find a and b by using a chi-square distribution with n - 1 degrees of freedom. In accordance with the convention previously adopted, we would select a and b so that Pr (n~2 < a) = 0.025 and Pr (n~2 > b) = 0.025. We then have ( nS2 nS2) Pr b < a 2 < a = 0.95 so that (nS2/b,nS2/a) is a random interval having probability 0.95 of including the fixed but unknown point (parameter) a2 . After the random experiment has been performed and we find, say, Xl = X1> X 2 = X2,... , X n = X n , with S2 = ~ (Xi - X)2/n, we have, as a 95 per cent confidence 1 interval for a2, the interval (ns2/b, ns2/a). Example 2. If, in the preceding discussion, we have n = 9, S2 = 7.63, then the interval [9(7.63)/17.5, 9(7.63)/2.18J or (3.92,31.50) is a 95 per cent confidence interval for the variance a 2 • Next let X and Y denote stochastically independent random variables that are n(fL1' ar) and n(fL2' a~), respectively. We shall deter- mine a confidence interval for the ratio a~/ar when fLl and fL2 are unknown. Remark. Consider a situation in which a random variable X has a normal distribution with variance a~. Although at is not known, it is found that the experimental values of X are quite widely dispersed, so that. at must be fairly large. It is believed that a certain modifica:ion i~ conducting the experiment may reduce the variance. After the modification has been effected let the random variable be denoted by Y, and let Y have a normal distribution with variance a~. Naturally, it is hoped that a~ is less than at Consider a random sample Xl' X 2, ... , X; of size n 2': 2 from the distribution of X and a random sample Y1> Y2, ••• , Ym of size m 2': 2 from the independent distribution of Y. Here nand m mayor may not be equaL Let the means of the two samples be denoted by X and Y, n and the variances of the two samples by Sr = L (Xi - X)2/n and m 1 S~ = L (Yi - y)2/m, respectively. The stochastically independent 1 random variables nSVar and mS~/a~ have chi-square distributions with n - 1 and m - 1 degrees of freedom, respectively. In Section 4.4 a random variable called F was defined, and through the change-of- variable technique the p.d.f. of F was obtained. If nSVar is divided by n - 1, the number of degrees of freedom, and if mS~/a~ is divided by m - 1, then, by definition of an F random variable, we have that F = nSr/[ar(n - 1)] mS~/[a~(m - 1)] has an F distribution with parameters n - 1 and m - 1. For numeri- cally given values of nand m and with a preassigned probability, say 0.95, we can determine from Table V of Appendix B, in accordance with our convention, numbers 0 < a < b such that [ nSr/[ar(n - 1)] ] Pr a < mS~/[a~(m _ 1)] < b = 0.95. If the probability of this event is written in the form P [ mS~/(m - 1) a~ mS~/(m - 1)] r a nSV(n _ 1) < ar < b nSV(n _ 1) = 0.95, it is seen that the interval [ a mS~/(m - 1), b mS~/(m - 1)] nSV(n - 1) nSV(n - 1) is a random interval having probability 0.95 of including the fixed but unknown point a~/ar. If the experimental values of Xv X 2, ... , X n and of Y1, Y2, •.• , Ym are denoted by Xv X2,... , Xn and Y1' Y2' ... , Ym'
  • 119. 226 Estimation [Ch.6 Sec. 6.6] Bayesian Estimates 227 n m . d df 2 " ( -)2 ms2 - "(y - y-)2 then the respectively. an I nS1 = f x, - X, 2 - f' ' interval with known end points, namely [ a ms~/(m - 1), b ms~/(m - 1)1' nsV(n - 1) nsV(n - 1) is a 95 per cent confidence interval for the ratio a~/a~ of the two un- known variances. Example 3. If in the preceding discussion n = 10, m = 5, sf = 20.0, s~ = 35.6, then the interval [( 1 ) 5(35.6)/4 5(35.6)/4] 4.72 10(20.0)/9' (8.90) 10(20.0)/9 or (0.4, 17.8) is a 95 per cent confidence interval for a~/af· EXERCISES 6.32. If 8.6, 7.9, 8.3, 6.4, 8.4,9.8,7.2,7.8,7.5 are the observed values of a random sample of size 9 from a distribution that is n(8, a2 ) , construct a 90 per cent confidence interval for a 2 . 6.33. Let Xl' X 2 , ••• , X; be a random sample from the distribution n(/L, a2). Let 0 < a < b. Show that the mathematical expectation of the length of the random interval [~(X, - /L)2/b, ~ (Xl - /L)2/a] is (b - a) x (na2/ab). 6.34. A random sample of size 15 from the normal distribution n(/L'a 2) yields x = 3.2 and S2 = 4.24. Determine a 90 per cent confidence interval for a2 • 6.35. Let 52 be the variance of a random sample of size n taken from a distribution that is n(/L, a2) where /L and a2are unknown. Let g(z)be the p.d.f. of Z = nS2/a2, which is X2(n - 1). Let a and b.be such that 2 the .observed interval (ns2 /b,ns2/a) is a 95 percent confidence mterval for a . y Its length ns2(b - a)/ab is to be a minimum, show that a a~d b. mu.st satIsfy. the con- dition that a2g(a) = b2g(b). Hmi. If G(z) is the distribution function of Z, then differentiate both G(b) - G(a) = 0.95 and (b - a)/ab with respect to b, recalling that, from the first equation, a must be a function of b. Then equate the latter derivative to zero. 6.36. Let two independent random samples of sizes n = 16 and m = 10, taken from two independent normal distributions n(/Ll' ai) a.nd n(/L2' a~), respectively, yield x = 3.6, sf = 4.14, fj = 13.6, s~ = 7.26. Fmd a 90 per cent confidence interval for aVai when /Ll and /L2 are unknown. 6.37. Discuss the problem of finding a confidence interval for the ratio a~/af of the two unknown variances of two independent normal distributions if the means /Ll and /L2 are known. 6.38. Let Xl> X 2 , ••• , X 6 be a random sample of size 6 from a gamma distribution with parameters a = 1 and unknown f3 > O. Discuss the construction of a 98 per cent confidence interval for f3. Hint. What is the 6 distribution of 2 L Xtlf3? 1 6.39. Let Sfand S~ denote, respectively, the variances of random samples, of sizes nand m, from two independent distributions that are n(/Ll> a2) and n(/L2' a2). Use the fact that (nSf + mS~)/a2 is X2(n + m - 2) to find a confidence interval for the common unknown variance a2 • 6.40. Let Y4 be the nth order statistic of a random sample, n = 4, from a continuous-type uniform distribution on the interval (0, e). Let 0 < Cl < C2 :s; 1 be selected so that Pr (cle < Y4 < c2e) = 0.95. Verify that Cl = V"0.05 and C2 = 1 satisfy these conditions. What, then, is a 95 per cent confidence interval for e? 6.6 Bayesian Estimates In Sections 6.3, 6.4, and 6.5 we constructed two statistics, say U and V,. U < V, such that we have a preassigned probability p that the random interval (U, V) contains a fixed but unknown point (parameter). We then adopted this principle: Use the experimental results to com- pute the values of U and V, say u and v; then call the interval (u, v) a lOOp per cent confidence interval for the parameter. Adoption of this principle provided us with one method of interval estimation. This method of interval estimation is widely used in statistical literature and in the applications. But it is important for us to understand that other principles can be adopted. The student should constantly keep in mind that as long as he is working with probability, he is in the realm of mathematics; but once he begins to make inferences or to draw con- clusions about a random experiment, which inferences are based upon experimental data, he is in the field of statistics. We shall now describe another approach to the problem of interval estimation. This approach takes into account any prior knowledge of the experiment that the statistician has and it is one application of a principle of statistical inference that may be called Bayesian statistics. Consider a random variable X that has a distribution of probability that depends upon the symbol B, where eis an element of a well- defined set Q. For example, if the symbol eis the mean of a normal
  • 120. distribution, Q may be the real line. We have previously looked upon Bas being some constant, although an unknown constant. Let us now introduce a random variable 0 that has a distribution of probability over the set Q; and, just as we look upon x as a possible value of the random variable X, we now look upon B as a possible value of the random variable 0. Thus the distribution of X depends upon B, a random determination of the random variable 0. We shall denote the p.d.f. of 0 by h(B) and we take h(B) = 0 when Bis not an element of Q. Let Xl> X 2 , ••• , Xn denote a random sample from this distribution of X and let Y denote a statistic that is a function of Xl> X 2 , ••• , X n• We can find the p.d.f. of Y for every given B; that is, we can find the conditional p.d.f. of Y, given 0 = B, which we denote by g(yIB). Thus the joint p.d.f. of Y and 0 is given by k(y, 8) = h(8)g(ylB). 228 Estimation [Ch, 6 Sec. 6.6] Bayesian Estimates 229 p.d.f. k(Bly) are known. Now, in general, how would we predict an experimental value of any random variable, say W, if we want our prediction to be "reasonably close" to the value to be observed? Many statisticians would predict the mean, E(W), of the distribution of W; others would predict a median (perhaps unique) of the distribution of W; some would predict a mode (perhaps unique) of the distribution of W; and some would have other predictions. However, it seems desirable that the choice of the decision function should depend upon the loss function 2[B, w(y)]. One way in which this dependence upon the loss function can be reflected is to select the decision function w in such a way that the conditional expectation of the loss is a minimum. A Bayes' solution is a decision function w that minimizes E{2[0, w(y)]IY = y} = f~oo 2[B, w(y)]k(Bly) dB, If 0 is a random variable of the continuous type, the marginal p.d.f. of Y is given by If 0 is a random variable of the discrete type, integration would be replaced by summation. In either case the conditional p.d.f. of 0, given Y = y, is This relationship is one form of Bayes' formula (see Exercise 2.7, Section 2.1). In Bayesian statistics, the p.d.f. h(B) is called the prior P.d.f. of 0, and the conditional p.d.f. k(8Iy) is called the posterior p.d.f. of 0. This is because h(B) is the p.d.f. of 0 prior to the observation of Y, whereas k(Bly) is the p.d.f. of 0 after the observation of Y has been made. In many instances, h(B) is not known; yet the choice of h(B) affects the p.d.f. k(8Iy). In these instances the statistician takes into account all prior knowledge of the experiment and assigns the prior p.d.f. h(B). This, of course, injects the problem of personal or subjective probability (see the Remark, Section 1.1). Suppose that we want a point estimate of B. From the Bayesian viewpoint, this really amounts to selecting a decision function w so that w(y) is a predicted value of B (an experimental value of the random variable 0) when both the computed value y and the conditional if 0 is a random variable of the continuous type. The usual modification of the right-hand member of this equation is made for random variables of the discrete type. If, for example, the loss function is given by 2[B, w(y)] = [B - W(y)]2, the Bayes' solution is given by w(y) = E(0IY), the mean of the conditional distribution of 0, given Y = y. This follows from the fact (Exercise 1.91) that E[(W - b)2], if it exists, is .a minimum when b = E(W). If the loss function is given by 2[B, w(y)] = IB - w(y)J, then a median of the conditional distribution of 0, given Y = y, is the Bayes' solution. This follows from the fact (Exercise 1.81) that E(IW - bi), if it exists, is a minimum when b is equal to any median of the distribution of W. The conditional expectation of the loss, given Y = y, defines a random variable that is a function of the statistic Y. The expected value of that function of Y, in the notation of this section, is given by f~oo {f~oo 2[B, w(y)]k(Bly) dB}k1(y) dy = f~oo {f~oo 2[B, w(y)]g(yIB) dY}h(B) so, in the continuous case. The integral within the braces in the latter expression is, for every given BE Q, the risk function R(B, w); accord- ingly, the latter expression is the mean value of the risk, or the expected risk. Because a Bayes' solution minimizes f~oo 2[B, w(y)]k(Bly) dB for every y for which k1 (y) > 0, it is evident that a Bayes' solution
  • 121. 230 Estimation [eb.6 Sec. 6.6] Bayesian Estimates 231 °< 8 < 1, y = 0,1, ... , n, D [8 - w(y)]2k(8Iy) d8 for y = 0, 1, ... , n and, accordingly, it minimizes the expected risk. It is very instructive to note that this Bayes' solution can be written as °< 8 < 1, k(8Iy) = c(y)8y+ a- I(1 - 8)n-u+Ii-1, That is, w(y) = C + ; + n) ~ + (a:;~ n) a:f3 which is a weighted average of the maximum likelihood estimate yin of 8 and the mean al(a + f3) of the prior p.d.f. of the parameter. Moreover, the respective weights are nj(a + f3 + n) and (a + f3)/(a + f3 + n). Thus we see that a and f3 should be selected so that not only is al(a + f3) the desired prior mean, but the sum a + f3 indicates the worth of the prior opinion, relative to a sample of size n. That is, if we want our prior opinion to have as much weight as a sample size of 20, we would take a + f3 = 20. So if our prior mean is i; we have that a and f3 are selected so that a = 15 and f3 = 5. In Example 1 it is extremely convenient to notice that it is not really necessary to determine kl (y) to find k(8Iy). If we divide g(yl 8)h(8) by kl (y) we must get the product of a factor, which depends upon y but does not depend upon 8, say c(y), and 8u+a-I(1 _ 8)n-u+li- 1. a+y a+f3+n This decision function w(y) minimizes The Bayes' solution w(y) is the mean of the conditional distribution of e, given Y = y. Thus w(y) = J:8k(8jy) d8 = r(n + a + f3) (I 8a+Y(1 _ 8)Ii+n-Y-1 d8 r(a + y)r(n + f3 - y) Jo and y = 0, 1, ... , n. However, c(y) must be that" constant" needed to make k(8Iy) a p.d.f., namely r(n + ex + (3) c(y) = r(y + a)r(n - y + (3) Accordingly, Bayesian statisticians frequently write that k(8ly) is pro- portional to g(yI8)h(8); that is, k(8Iy) o: g(yI8)h(8). h(8) = r(a + f3) 8a-1(1 _ 8)1i-1 r(a)r(f3) , = °elsewhere. where a and f3 are assigned positive constants. Thus the joint p.d.f. of Yand e is given by g(yl8)h(8) and the marginal p.d.f. of Y is I:{~o [8 - W(y)J2(;)8Y(1 - 8)n-Y}h(8) d8 Y~{f[8 - w(y)J2k(8Iy) d8}k1(y). k(81 ) = g(yl8)h(8) y k1(y) _ r(n + a + f3) 8a+y-1(1 _ 8)Ii+n-Y-1 °< 8 < 1, - r(a + y)r(n + f3 - y) , and y = 0, 1, ... , n. We take the loss function to be 2[8, w(y)J = [8 - W(y)J2. Because Y is a random variable of the discrete type, whereas e is of the continuous type, we have for the expected risk, k1(y) = J:h(8)g(yj8) d8 = (n) r(a + f3) II8y+a-1(1 _ 8)n-Y+Ii-1 se y r(a)r(f3) 0 _ cr(a + f3)r(a + y)r(n + f3 - y), y = 0,1,2, ... , n, - y r(a)r(f3)r(n + a + f3) = °elsewhere. Finally, the conditional p.d.f. of e, given Y = y, is, at points of positive probability density, g(yj8) = (;)8Y(1 - W-Y, = °elsewhere. We take the prior p.d.f. of the random variable e to be w(y) minimizes this mean value of the risk. We now give an illustrative example. Example 1. Let Xl' X 2 , ••• , X n denote a random sample from a distribution that is b(l, 8), °< 8 < 1. We seek a decision function w that is a Bayes' solution. If Y = i Xi, then Y is b(n, 0). That is, the conditional 1 p.d.f. of Y, given e = 8, is
  • 122. 232 Estimation [eh.6 Sec. 6.6] Bayesian Estimates 233 Then to actually form the p.d.f. k(0ly), they simply find a "constant," which is some function of y, so that the expression integrates to 1. This is now illustrated. Example 2. Suppose that Y = X is the mean of a random sample of size n that arises from the normal distribution n(O, ( 2 ) , where a2 is known. Then g(yI8) is n(8, a2/n). Further suppose that we are able to assign prior knowledge to 8 through a prior p.d.I. h(O) that is n(Oo, a~). Then we have that 1 1 [(y - 8)2 (0 - (0 )2 1 k(Oly) oc . / . / . /- exp - 2( 2/ ) - 2 2 . V 27TU/V n v 27TUo U n Uo If we eliminate all constant factors (including factors involving y only), we have k(OI ) [ _ (u~ + u2/n)02 - 2(ya~ + 00a2/n)0]. y oc exp 2(u2/n)u~ This can be simplified, by completing the square, to read (after eliminating factors not involving 0) [ ( 0 _ ya~ + 00a 2/n)2l a~ + a2/n k(0ly) oc exp - 2(a2/n)a~ . (a~ + a2/n) That is, the posterior p.d.f. of the parameter is obviously normal with mean and variance (a2/n)a~/(a~ + a2/n). If the square-error loss function is used, this posterior mean is the Bayes' solution. Again, note that it is a weighted average of the maximum likelihood estimate y = x and the prior mean 00 , Observe here and in Example 1 that the Bayes' solution gets closer to the maximum likelihood estimate as n increases. Thus the Bayesian procedures permit the decision maker to enter his or her prior opinions into the solution in a very formal way such that the influences of those prior notions will be less and less as n increases. In Bayesian statistics all the information is contained in the posterior p.d.f. k(Oly). In Examples 1 and 2 we found Bayesian point estimates using the square-error loss function. It should be noted that if 2[w(y), 0] = Iw(y) - 01. the absolute value of the error, then the Bayes' solution would be the median of the posterior distribution of the parameter, which is given by k(Oly). Hence the Bayes' solution changes, as it should, with different loss functions. If an interval estimate of °is desired, we can now find two functions u(y) and v(y) so that the conditional probability i V(Y ) Pr[u(y) < 0 < v(y)IY = y] = k(Oly) dO, u(y) is large, say 0.95. The experimental values of Xv X 2 , ••• , X n, say Xl' X2, ... , Xn, provide us with an experimental value of Y, say y. Then the interval u(y) to v(y) is an interval estimate of °in the sense that the conditional probability of 0 belonging to that interval is equal to 0.95. For illustration, in Example 2 where the posterior p.d.f. of the parameter was normal, the interval, whose end points are found by taking the mean of that distribution and adding and subtracting 1.96 of its standard deviation, yu~ + 0ou 2/n + 1.96 (u2/n)u~ u~ + u2/n - u~ + u2/n serves as an interval estimate for °with posterior probability of 0.95. Finally. it should be noted that in Bayesian statistics it is really better to begin with the sample items Xv X 2, ••• , X n rather than some statistic Y. We used the latter approach for convenience of notation. If Xv X 2 , ••• , X n are used, then in our discussion, replace g(yIO) by f(X11 0)f(x210) .. ·f(xnIO) and k(Oly) by k(Olxv X2, ... , xn). Thus we find that k(0IX1, x2, ... , xn) OC h(0)j(xII0)j(x210)·· .j(xnIO). II the statistic Y is chosen correctly (namely, as a sufficient statistic, as explained in Chapter 10), we find that k(Oxv X 2, ••• , xn) = k(0ly). This is illustrated by Exercise 6.44. Of course, these Bayesian pro- cedures can easily be extended to the case of several parameters, as demonstrated by Exercise 6.45. EXERCISES 6.41. Let Xv X 2 , ••• , X; denote a random sample from a distribution that is n(e, ( 2 ) , -00 < 0 < 00, where a2 is a given positive number. Let Y = X, the mean of the random sample. Take the loss function to be 2[0, w(y)J = 10 - w(y)j. If 0 is an observed value of the random variable 0 that is n(p" r 2 ) , where r 2 > 0 and p, are known numbers, find the Bayes' solution w(y) for a point estimate of e. 6.42. Let Xv X 2, •• . , X; denote a random sample from a Poisson n distribution with mean e, 0 < e < 00. Let Y = L X, and take the loss 1
  • 123. 234 Estimation [eh.6 function to 2[B, w(y)] = [B - w(y)]2. Let B be an observed value of the random variable 0. If 0 has the p.d.f. h(B) = Ba-Ie-8IP/r(a)f3a, 0 < B < co, zero elsewhere, where a > 0, 13 > 0 are known numbers, find the Bayes' solution w(y) for a point estimate of B. 6.43. Let Yn be the nth order statistic of a random sample of size n from a distribution with p.d.f. f(xIB) = l/B, 0 < x < B, zero elsewhere. Take the loss function to be 2[B, w(Yn)] = [B - w(Yn)]2. Let Bbe an observed value of the random variable 0, which has p.d.f. h(B) = f3aP/Bl3+l, a < B < CfJ, zero elsewhere, with a > 0, 13 > O. Find the Bayes' solution w(Yn) for a point estimate of B. 6.44. Let Xl, X 2 , ••• , X; be a random sample from a distribution that is b(l, B). Let the prior p.d.f. of 0 be a beta one with parameters a and 13. Show that the posterior p.d.f. k(BIXl> x2, ... , xn) is exactly the same as k(B[y) given in Example 1. This demonstrates that we get exactly the same result whether we begin with the statistic Y or with the sample items. Hint. Note that k(Blxl , x2, ... , xn) is proportional to the product of the joint p.d.f. of Xl> X 2 , ••• , X n and the prior p.d.f. of B. 6.45. Let YI and Y2 be statistics that have a trinomial distribution with parameters n, BI , and B2 . Here BI and B2 are observed values of the random variables 01 and O2 , which have a Dirichlet distribution with known param- eters aI' (X2' and (X3 (see Example 1, Section 4.5). Show that the conditional distribution of 01 and O2 is Dirichlet and determine the conditional means E(01 IYI' Y2) and E(02jYI' Y2)· 6.46. Let X be n(O, l/B). Assume that the unknown B is a value of a random variable 0 which has a gamma distribution with parameters (X = r/2 and f3 = 2/r, where r is a positive integer. Show that X has a marginal t distribution with r degrees of freedom. This procedure is called one of compounding, and it may be used by a Bayesian statistician as a way of first presenting the t distribution, as well as other distributions. 6.47. Let X have a Poisson distribution with parameter 8. Assume that the unknown Bis a value of a random variable 0 that has a gamma distri- bution with parameters (X = rand f3 = (1 - P)/P, where r is a positive integer and 0 < p < 1. Show, by the procedure of compounding, that X has a marginal distribution which is negative binomial, a distribution that was introduced earlier (Section 3.1) under very different assumptions. 6.48. In Example llet n = 30, (X = 10, and f3 = 5so that w(y) = (10 + y)j45 is the Bayes' estimate of 8. (a) If Y has the binomial distribution b(30, B), compute the risk E{[B - W(Y)]2}. (b) Determine those values of Bfor which the risk of part (a) is less than B(l - B)/30, the risk associated with the maximum likelihood estimator Yin of B. Chapter 7 Statistical Hypotheses 7.1 Some Examples and Definitions The two principal areas of statistical inference are the areas of estimation of parameters and of tests of statistical hypotheses. The problem of estimation of parameters, both point and interval estima- tion, has been treated. In this chapter some aspects of statistical hypotheses and tests of statistical hypotheses will be considered. The subject will be introduced by way of example. Example 1. Let it be known that the outcome X of a random experiment is n(B, 100). For instance, X may denote a score on a test, which score we assume to be normally distributed with mean Band variance 100. Let us say that past experience with this random experiment indicates that B = 75. Suppose, owing possibly to some research in the area pertaining to this experiment, some changes are made in the method of performing this random experiment. It is then suspected that no longer does B = 75 but that now B > 75. There is as yet no formal experimental evidence that B > 75; hence the statement B > 75 is a conjecture or a statistical hypothesis. In admitting that the statistical hypothesis B > 75 may be false, we allow, in effect, the possibility that B :s; 75. Thus there are actually two statistical hypotheses. First, that the unknown parameter B :s; 75; that is, there has been no increase in B. Second, that the unknown parameter () > 75. Accordingly, the param- eter space is Q = {(); -co < () < co}. We denote the first of these hypotheses by the symbols Ho: B :s; 75 and the second by the symbols HI: B > 75. Since the values B > 75 are alternatives to those where () :s; 75, the hypothesis HI: B > 75 is called the alternative hypothesis. Needless to say, Ho could be called the alternative HI; however, the conjecture, here () > 75, that is 235
  • 124. 236 Statistical Hypotheses [Ch.7 Sec. 7.1] Some Examples and Definitions 237 function of Test 1, and the value of the power function at a parameter point is called the power of Test 1 at that point. Because X is n(8, 4), we have - (78 - 8) K 2(8) = Pr (X > 78) = 1 - N -2- . ( X - 8 75 - 8) (75 - 8) K I(8) = Pr -2- > -2- = 1 - N -2- . I 79 1 -l FIGURE 7.1 o ~(B) Some values of the power function of Test 2 are K 2(73) = 0.006, K 2(75) = 0.067, K 2(77) = 0.309, and K 2(79) = 0.691. That is, if 8 = 75, the proba- bility of rejecting H o: 8 ~ 75 is 0.067; this is much more desirable than the corresponding probability 1- that resulted from Test 1. However, if H0 is false and, in fact, 8 = 77, the probability of rejecting H o: 8 ~ 75 (and hence of accepting HI: 8 > 75) is only 0.309. In certain instances, this low prob- ability 0.309 of a correct decision (the acceptance of HI when HI is true) is objectionable. That is, Test 2 is not wholly satisfactory. Perhaps we can overcome the undesirable features of Tests 1 and 2 if we proceed as in Test 3. So, for illustration, we have, by Table III of Appendix B, the power at 8 = 75 to be K I(75) = 0.500. Other powers are K I(73) = 0.159, K I(77) = 0.841, and K I(79) = 0.977. The graph of K1(8) of Test 1 is depicted in Figure 7.1. Among other things, this means that, if 8 = 75, the probability of rejecting the hypothesis H o: 8 ~ 75 is 1-- That is, if 8 = 75 so that H o is true, the probability of rejecting this true hypothesis H0 is 1-- Many statisti- cians and research workers find it very undesirable to have such a high probability as -t assigned to this kind of mistake: namely the rejection of H o when Ho is a true hypothesis. Thus Test 1 does not appear to be a very satisfactory test. Let us try to devise another test that does not have this objectionable feature. We shall do this by making it more difficult to reject the hypothesis H0' with the hope that this will give a smaller probability of rejecting H o when that hypothesis is true. Test 2. Let n = 25. We shall reject the hypothesis Ho: 8 ~ 75 and accept the hypothesis HI: 8 > 75 if and only if x > 78. Here the critical region is e = {(Xl' ... , X 25); Xl + ... + X 25 > (25)(78)}. The power function of Test 2 is, because X is n(8, 4), Pr [(Xl' ... , X 25 ) E C] = Pr (X > 75). Obviously, this probability is a function of the parameter 8 and we shall denote it by K I(8). The function K 1(8) = Pr (X > 75) is called the power made by the research worker is usually taken to be the alternative hypothesis. In any case the problem is to decide which of these hypotheses is to be accepted. To reach a decision, the random experiment is to be repe~ted a number of independent times, say n, and the results observed. That IS, we consider a random sample Xl, X 2 , ••• , X; from a distribution that is n(8, 100), and we devise a rule that will tell us what decision t? make once the experimental values, say Xl' X 2, •• " X n, have been determmed. Such a rule is called a test of the hypothesis Ho: 8 ~ 75 against the alternative hypothesis HI: 8 > 75. There is no bound on the number of rules or tes~s that can be constructed. We shall consider three such tests. Our tests will be constructed around the following notion. We shall partition the sample space d into a subset e and its complement e*. If the experimental values of Xl' X 2, ... , X n, say Xv X2' ... , Xn, are such that the point (Xl' X2' •.. , X n) E C, we shall reject the hypothesis H o (accept the hypothesis HI)' If we have (Xl' X 2, •.• , X n) E e*, we shall accept the hypothesis H0 (reject the hypothesis HI)' Test 1. Let n = 25. The sample space d is the set {(xv X 2, ... , X 25); -00 < Xi < 00, i = 1,2, ... , 25}. Let the subset e of the sample space be e = {(xv X2' ... , X 25); Xl + X2 + ... + X 25 > (25)(75)}. We shall reject the hypothesis Ho if and only if our 25 experimental values are such that (Xl, X 2, ... , X 25) E c. If (xv X 2, .. " X25) is not an element of C, we shall accept the hypothesis H o. This subset e of the sample space that leads to the rejection of the hypothesis H 0: 8 ~ 75 is called the critical region 25 25 of Test 1. Now LXi> (25)(75) if and only if x > 75, where x = L xi/25. I I Thus we can much more conveniently say that we shall reject the hypothesis H o: 8 ~ 75 and accept the hypothesis Hs: 8 > 75 if and only if the experi- mentally determined value of the sample mean x is greater than 75. If x ~ 75, we accept the hypothesis H o: 8 ~ 75. Our test then amounts to this: We shall reject the hypothesis H o: 8 ~ 75 if the mean of the sample exceeds the maximum value of the mean of the distribution when the hypothesis H o is true. It would help us to evaluate a test of a statistical hypothesis if we knew the probability of rejecting that hypothesis (and hence of accepting the alternative hypothesis). In our Test 1, this means that we want to compute the probability
  • 125. 238 Statistical Hypotheses [Ch, 7 Sec. 7.1] Some Examples and Definitions 239 Equivalently, from Table III of Appendix B, we have K s(8) = Pr (X > c) = 1 - N(c ~ /~). 10/v n The conditions K s(75) = 0.159 and K s(77) = 0.841 require that Test 3. Let us first select a power function KsUJ) that has the features of a small value at 8 = 75 and a large value at 8 = 77. For instance, take K s(75) = 0.159 and K s(77) = 0.841. To determine a test with such a power function, let us reject Ho: 8 ::; 75 if and only if the experimental value x of the mean of a random sample of size n is greater than some constant c. Thus the critical region is C = {(Xl' X 2, •• " X n) ; Xl + X2 + ... + Xn > nc}. It should be noted that the sample size n and the constant c have not been determined as yet. However, since X is n(8, 1OO/n), the power function is We have now illustrated the following concepts: (a) A statistical hypothesis. (b) A test of a hypothesis against an alternative hypothesis and the associated concept of the critical region of the test. (c) The power of a test. These concepts will now be formally defined. Definition 1. A statistical hypothesis is an assertion about the dis- 0< x < 00, 1 f(x; 8) = 7J e-X / 8 , = 0 elsewhere. If we refer again to Example 1, we see that the significance levels of Tests 1, 2, and 3 of that example are 0.500,0.067, and 0.159, respec- tively. An additional example may help clarify these definitions. Example 2. It is known that the random variable X has a p.d.f, of the form tribution of one or more random variables. If the statistical hypothesis completely specifies the distribution, it is called a simple statistical hypothesis; if it does not, it is called a composite statistical hypothesis. If we refer to Example 1, we see that both H o: e::; 75 andHI : e> 75 are composite statistical hypotheses, since neither of them completely specifies the distribution. If there, instead of H o: e ::; 75, we had Ho: e= 75, then Ho would have been a simple statistical hypothesis. Definition 2. A test of a statistical hypothesis is a rule which, when the experimental sample values have been obtained, leads to a decision to accept or to reject the hypothesis under consideration. Definition 3. Let C be that subset of the sample space which, in accordance with a prescribed test, leads to the rejection of the hypoth- esis under consideration. Then C is called the critical region of the test. Definition 4. The power function of a test of a statistical hypothesis H o against an alternative hypothesis HI is that function, defined for all distributions under consideration, which yields the probability that the sample point falls in the critical region C of the test, that is, a function that yields the probability of rejecting the hypothesis under consideration. The value of the power function at a parameter point is called the power of the test at that point. Definition 5. Let H0 denote a hypothesis that is to be tested against an alternative hypothesis HI in accordance with a prescribed test. The significance level of the test (or the size of the critical region C) is the maximum value (actually supremum) of the power function of the test when H 0 is true. It is desired to test the simple hypothesis H o: 8 = 2 against the alternative simple hypothesis HI: 8 = 4. Thus Q = {8; 8 = 2,4}. A random sample Xl, X2 of size n = 2 will be used. The test to be used is defined by taking 1 - N(C - 72) = 0.841. lO/vn c - 77 --_ =-1. lO/vn c - 75 --=1 lO/vn ' ( c - 75) 1 - N . r = 0.159, 10/v n The solution to these two equations in nand c is n = 100, c = 76. With these values of nand c, other powers of Test 3 are K s(73) = 0.001 and K s(79) = 0.999. It is important to observe that although Test 3 has a more desirable power function than those of Tests 1 and 2, a certain "price" has been paid-a sample size of n = 100 is required in Test 3, whereas we had n = 25 in the earlier tests. Remark. Throughout the text we frequently say that we accept the hypothesis H o if we do not reject H o in favor of HI' If this decision is made, it certainly does not mean that H°is true or that we even believe that it is true. All it means is, based upon the data at hand, that we are not convinced that the hypothesis H o is wrong. Accordingly, the statement "We accept H o " would possibly be better read as "We do not reject Hi;" However, because it is in fairly common use, we use the statement "We accept Ho," but read it with this remark in mind.
  • 126. 240 Statistical Hypotheses [Ch, 7 Sec. 7.1] Some Examples and Definitions 241 the critical region to be C = {(xv x2) ; 9.5 :$ Xl + X 2 < co}. The power function of the test and the significance level of the test will be determined. There are but two probability density functions under consideration, namely,J(x; 2) specified by Ho andf(x; 4) specified by Hl . Thus the power function is defined at but two points () = 2 and () = 4. The power function of the test is given by Pr[(XVX 2 ) EC]. If Ho is true, that is, () = 2, the joint p.d.f. of Xl and X2 is f(xl ; 2)f(X2; 2) = !r(Xl +x2)/2, = 0 elsewhere, and = 0.05, approximately. If H l is true, that is, () = 4, the joint p.d.f. of Xl and X2 is f(xl ; 4)f(x2 ; 4) = l6e-(X1 +x2)/4, 0 < Xl < 00, 0 < X2 < 00, = 0 elsewhere, and (9.5 (9.5 -x~ 1 ( )/4 d d Pr[(XVX2) E C] = 1- Jo Jo 16e-Xl+x2 Xl X2 = 0.31, approximately. Thus the power of the test is given by 0.05 for () = 2 and by 0.31 for () = 4. That is, the probability of rejecting H o when H o is true is 0.05, and the probability of rejecting Ho when Ho is false is 0.31. Since the significance level of this test (or the size of the critical region) is the power of the test when H o is true, the significance level of this test is 0.05. The fact that the power of this test, when 8 = 4, is only 0.31 immediately suggests that a search be made for another test which, with the same power when () = 2, would have a power greater than 0.31 when () = 4. However, Section 7.2 will make clear that such a search would be fruitless. That is, there is no test with a significance level of 0.05 and based on a random sample of size n = 2 that has a greater power at () = 4. The only manner in which the situation may be improved is to have recourse to a random sample of size n greater than 2. Our computations of the powers of this test at the two points () = 2 and (J = 4 were purposely done the hard way to focus attention on funda- mental concepts. A procedure that is computationally simpler is the follow- ing. When the hypothesis H o is true, the random variable X is X2(2). Thus the random variable Xl + X2 = Y, say, is X2 (4). Accordingly, the power of the test when H 0 is true is given by Pr (Y ~ 9.5) = 1 - Pr (Y < 9.5) = 1 - 0.95 = 0.05, from Table II of Appendix B. When the hypothesis Hl is true, the random variable X/2 is X2(2); so the random variable (Xl + X 2)/2 = Z, say, is X2(4). Accordingly, the power of the test when H l is true is given by Pr (Xl + X2 ~ 9.5) = Pr (Z ~ 4.75) = (<Xl !ze-2 / 2 dz, )4.75 which is equal to 0.31, approximately. Remark. The rejection of the hypothesis Ho when that hypothesis is true is, of course, an incorrect decision or an error. This incorrect decision is often called a type I error; accordingly, the significance level of the test is the probability of committing an error of type 1. The acceptance of H o when Ho is false (Hl is true) is called an error of type II. Thus the probability of a type II error is 1minus the power of the test when Hl is true. Frequently, it is disconcerting to the student to discover that there are so many names for the same thing. However, since all of them are used in the statistical litera- ture, we feel obligated to point out that" significance level," "size of the critical region," "power of the test when H0 is true," and" the probability of committing an error of type I" are all equivalent. EXERCISES 7.1. Let X have a p.d.f. of the form f(x; 8) = (J:r;8-1, 0 < X < 1, zero elsewhere, where 8E{8; 8 = 1,2}. To test the simple hypothesis H o: 8 = 1 against the alternative simple hypothesis Hl : (J = 2, use a random sample X v X2 of size n = 2 and define the critical region to be C = {(xv x2 ) ; 3/4 :$ Xl X2}' Find the power function of the test. 7.2. Let X have a binomial distribution with parameters n = 10 and p E {P; P = -!-, -!-}. The simple hypothesis H o:P = -!- is rejected, and the alternative simple hypothesis Hl : P = -!- is accepted, if the observed value of Xl' a random sample of size 1, is less than or equal to 3. Find the power function of the test. 7.3. Let Xv X2 be a random sample of size n = 2 from the distribution having p.d.f. f(x; (J) = (1/8)e- X / 8 , 0 < X < 00, zero elsewhere. We reject Ho: (J = 2 and accept Hl : (J = 1 if the observed values of Xv X 2, say Xl' X2' are such that
  • 127. 242 Statistical Hypotheses [Oh, 7 Sec. 7.2] Certain Best Tests 243 Here n = {B; B = 1, 2}. Find the significance level of the test and the power of the test when H0 is false. 7.4. Sketch, as in Figure 9.1, the graphs of the power functions of Tests 1, 2, and 3 of Example 1 of this section. 7.5. Let us assume that the life of a tire in miles, say X, is normally distributed with mean B and standard deviation 5000. Past experience indicates that 0 = 30,000. The manufacturer claims that the tires made by a new process have mean B > 30,000, and it is very possible that B = 35,000. Let us check his claim by testing Ho: 0 :0; 30,000 against Hl : B > 30,000. We shall observe n independent values of X, say Xl' •.• , X n , and we shall reject H o (thus accept H l ) if and only if x ;:::: c. Determine nand c so that the power function K(O) of the test has the values K(30,000) = 0.01 and K(35,000) = 0.98. 7.6. Let X have a Poisson distribution with mean B. Consider the simple hypothesis Ho: B = t and the alternative composite hypothesis H l : B < 1- Thus n = {B;O < 0:0; t}. Let Xl"",X12 denote a random sample of size 12 from this distribution. We reject H o if and only if the observed value of Y = Xl + ... + X12 :0; 2. If K(B) is the power function of the test, find the powers K(t), K(t), K(t), K(i), and K(fz). Sketch the graph of K(O). What is the significance level of the test? 7.2 Certain Best Tests In this section we require that both the hypothesis H o' which is to be tested, and the alternative hypothesis HI be simple hypotheses. Thus, in all instances, the parameter space is a set that consists of exactly two points. Under this restriction, we shall do three things: (a) Define a best test for testing H 0 against HI' (b) Prove a theorem that provides a method of determining a best test. (c) Give two examples. Before we define a best test, one important observation should be made. Certainly, a test specifies a critical region; but it can also be said that a choice of a critical region defines a test. For instance, if one is given the critical region C = {(Xl' X 2, X s); X~ + X~ + X~ ;:::: I}, the test is determined: Three random variables Xl' X 2 , X s are to be con- sidered; if the observed values are Xl> X 2, X s, accept H 0 if x~ + x~ + x~ < 1; otherwise, reject H o. That is, the terms "test" and "critical region" can, in this sense, be used interchangeably. Thus, if we define a best critical region, we have defined a best test. Let f(x; 0) denote the p.d.f. of a random variable X. Let Xl> X 2 , ... , X; denote a random sample from this distribution, and consider the two simple hypotheses H o: 0 = 0' and HI: 0 = Olf. Thus n = {O; 0 = 0', Olf}. We now define a best critical region (and hence a best test) for testing the simple hypothesis H o against the alternative simple hypothesis HI' In this definition the symbols Pr [(Xl> X 2 , ••• , X n) E C; HoJ and Pr [(Xl' X 2 , ••• , X n) E C; HlJ mean Pr [(Xl' X 2 , ••• , X n) E CJ when, respectively, H o and HI are true. Definition 6. Let C denote a subset of the sample space. Then C is called a best critical region of size a for testing the simple hypothesis H o: 0 = 0' against the alternative simple hypothesis HI: 0 = Olf if, for every subset A of the sample space for which Pr [(Xl> ... , X n) E A; HoJ = a: (a) Pr [(Xl> X 2 , • • • , X n) E C; HoJ = a. (b) Pr[(Xl,X2,···,Xn ) EC;HlJ ;:::: Pr[(Xl>X2 , ... ,Xn)EA;Hl J. This definition states, in effect, the following: First assume H o to be true. In general, there will be a multiplicity of subsets A of the sample space such that Pr [(Xl' X 2 , ••• , X n) E AJ = a. Suppose that there is one of these subsets, say C, such that when HI is true, the power of the test associated with C is at least as great as the power of the test associated with each other A. Then C is defined as a best critical region of size a for testing H 0 against HI' . In the following example we shall examine this definition in some detail and in a very simple case. Example 1. Consider the one random variable X that has a binomial distribution with n = 5 and p = B. Letf(x; 0) denote the p.d.f. of X and let Ho: B =! and Hl : B = l The following tabulation gives, at points of positive probability density, the values of f(x; !), f(x; t), and the ratio f(x; t)/f(x; t)· X 0 1 2 3 4 5 f(x; !) 1 ..L 10 10 .s. _1._ TI 32 32 TI 32 32 f(x; t) _ 1_ _1_5_ Tg~4 _1..1Q_ _.4..U. _1.43 _ 1024 1024 1024 1024 1024 f(x; !) .ll .ll .ll _1.L - - 32 .ll f(x; t) 3 9 27 81 243 We shall use one random value of X to test the simple hypothesis Ho: {I = ! against the alternative simple hypothesis H l : {I = t, and we shall first assign the significance level of the test to be a = ii-. We seek a best critical region
  • 128. 244 Statistical Hypotheses [Ch. 7 Sec. 7.2] Certain Best Tests 245 for each point (Xl> X 2, ••. , xn) E C. for each point (Xl> x2 , ••• , xn) E C*. let k be a positive number. Let C be a subset ofthe sample space such that: (a) L{0'; Xl> X 2, •.• , xn) L(()";Xl> X 2, ••• , xn ) ~ k, (c) a = Pr[(Xl>X2"",Xn) EC;Ho]. Then C is a best critical region of size a for testing the simple hypothesis H 0: 0 = 0' against the alternative simple hypothesis HI: 0 = 0". Proof. We shall give the proof when the random variables are of the continuous type. If C is the only critical region of size a, the theorem is proved. If there is another critical region of size a, denote it by A . For convenience, we shall let rit.f L (0; Xl> ••• , xn) dX1 ••. dXn be denoted by fR L(O). In this notation we wish to show that r L(O") ~ ! r L(O'). JAnc* k JAnc* i L(()") ~ ! r L(()'). cnA* k JcnA* J L(()") - r L(O") z ! i L(()') - ! i L(()'); C~ 1~ k~# k~C* (1) fc L(()") - t L(()") = I. L(()") + I. L(()") - f L(()") - f L(()") cnA CnA. AnC AnC. = I. L(()") - f L(()II). cnA· Anco Since C is the union of the disjoint sets C n A and C n A * and A is the union of the disjoint sets A n C and A n C*, we have However, by the hypothesis of the theorem, L(()") ~ (ljk)L(()') at each point of C, and hence at each point of C n A*; thus But L(O") ~ (ljk)L(O') at each point of C*, and hence at each point of A n C*; accordingly, "These inequalities imply that l04l4 = Pr(XEA2;H1 ) > Pr(XEA1;H1) = 10 124' It should be noted, in this problem, that the best critical region C = A 2 of size a = -f2 is found by including in C the point (or points) at which f(x; -t) is small in comparison withf(x; ,t). This is seen to be true once it is observed that the ratio f(x; -t)/f(x; ,t) is a minimum at x = 5. Accordingly, the ratio f(x; -t)/f(x; i), which is given in the last line of the above tabulation, provides us with a precise tool by which to find a best critical region C for certain given values of a. To illustrate this, take a = 3 62' When H o is true, each of the subsets {x; x = 0, I}, {x; x = 0, 4},{x; x = 1, 5}, {x; x = 4, 5}has probability measure 3~' By direct computation it is found that the best critical region of this size is {x; x = 4, 5}.This reflects the fact that the ratio f{x; -t)/f(x; i) has its two smallest values for x = 4 and x = 5. The power of this test, which has a = f2' is The preceding example should make the following theorem, due to Neyman and Pearson, easier to understand. It is an important theorem because it provides a systematic method of determining a best critical region. Neyman-Pearson Theorem. Let Xl' X 2 , • • • , X n , where n is a fixed positive integer, denote a random sample from a distribution that has p.d.]. f(x; ()). Then the joint p.d.f. of Xl> X 2 , ••• , X n is L((); Xl> X 2, ••• , xn) = f(x1 ; ())f(x2 ; ()) ••• f(xn ; ()). Let ()' and ()" be distinct fixed values of () so that n = {O; () = 0', Oil}, and of size a = l2' If Al = {x; X = O} and A 2 = {x; X = 5},then Pr (X E AI; H o) = Pr (X E A2 ; H o) = -i2- and there is no other subset A 3 of the space {x; x = 0, 1,2,3,4, 5}such that Pr (X E A3 ; Ho) = h. Then either Al or A 2 is the best critical region C of size a = :h: for testing Ho against HI' We note that Pr (X E AI; H o) = 3~ and that Pr (X E AI; HI) = ll24' Thus, if the set Al is used as a critical region of size a = -h, we have the intolerable situation that the probability of rejecting H o when HI is true (Ho is false) is much less than the probability of rejecting Ho when Ho is true. On the other hand, if the set A 2 is used as a critical region, then Pr (X E A 2 ; H o) = 3 12 and Pr (X E A 2 ; HI) = -/0 4234' That is, the probability of rejecting H 0 when HI is true is much greater than the probability of reject- ing H o when H o is true. Certainly, this is a more desirable state of affairs, and actually A 2 is the best critical region of size a = 31- i . The latter statement follows from the fact that, when H o is true, there are but two subsets, Al and A2 , of the sample space, each of whose probability measure is -f2 and the fact that
  • 129. 246 and, from Equation (1), we obtain Statistical Hypotheses [Ch. 7 Sec. 7.2] Certain Best Tests or 247 If this result is substituted in inequality (2), we obtain the desired result, -00 < x < 00. (l/v27T)nexp [-(~ (X, - 1)2)/2] = exp ( - ~ x, + ~). L(8'; Xl"", X n ) L(8"; Xl>"" X n) . _ 1 ((x - 8)2) f(x, 8) - V27T exp - 2 ' It is desired to test the simple hypothesis H 0: 8 = 8' = 0 against the alter- native simple hypothesis HI: 8 = 8" = 1. Now IF k > 0, the set of all points (xv X 2, •• " Xn ) such that exp ( - ~ X, + ~) :$ k Suppose that it is the first form, U l :$ Cl. Since 0' and 0" are given constants, Ul(Xl, X 2 , ••• , X n ; 0', 0") is a statistic; and if the p.d.f. of this statistic can be found when H o is true, then the significance level of the test of H 0 against HI can be determined from this distribution. That is, Moreover, the test may be based on this statistic; for, if the observed values of Xl' X 2 , · . · , X n are Xl> x2 , •• ·, X n, we reject H o (accept HI) if Ul(Xl, X 2,···, xn ) :$ Cl· A positive number k determines a best critical region C whose size is ex = Pr [(Xl> X 2 , ••• , X n) E C; HoJ for that particular k. It may be that this value of ex is unsuitable for the purpose at hand; that is, it is too large or too small. However, if there is a statistic ul (Xl' X 2 , ••. , X n), as in the preceding paragraph, whose p.d.I. can be determined when H o is true, we need not experiment with various values of k to obtain a desirable significance level. For if the distribution of the statistic is known, or can be found, we may determine [1 such that Pr [Ul(Xl, X 2 , ... , X n) :$ Cl; H oJ is a desirable significance level. An illustrative example follows. Example 2. Let Xl' X 2 • • • " X n denote a random sample from the distribution that has the p.d.f. = ex - ex = O. I L(O') AnC· = f L(O') + J L((J') - I L(O') - I L(O') cnA. CnA AnC AnC* = fc L(O') - LL(O') r L(O") - f L(O") ~ ~ [ r L(O') - i L(O')]' Jc A k JcnA' Anc' Remark. As stated in the theorem, conditions (a), (b), and (c) are suffi- cient ones for region C to be a best critical region of size a. However, they are also necessary. We discuss this briefly. Suppose there is a region A of size ex that does not satisfy (a) and (b) and that is as powerful at 8 = 8" as C, which satisfies (a), (b), and (c). Then expression (1) would be zero, since the power at 8"using A is equal to that using C. It can be proved that to have expression (1) equal zero A must be of the same form as C. As a matter of fact, in the continuous case, A and C would essentially be the same region; that is, they could differ only by a set having probability zero. However, in the discrete case, if Pr [L(8') = kL(8"); HoJ is positive, A and C could be different sets, but each would necessarily enjoy conditions (a), (b), and (c) to be a best critical region of size a. One aspect of the theorem to be emphasized is that if we take C to be the set of all points (Xl' X 2, ••• , xn) which satisfy L(O';Xl,X2 " " , xn) < k k 0 (0 " ) -, >, L ; Xl> X 2, ••• , X n then, in accordance with the theorem, C will be a best critical region. This inequality can frequently be expressed in one of the forms (where Cl and C2 are constants) IcL(O") - LL(O") ~ O. If the random variables are of the discrete type, the proof is the same, with integration replaced by summation. However, r L( 0') JcnA> (2)
  • 130. or, equivalently, is a best critical region. This inequality holds if and only if n n LXl ~ 2 - in k = C. I n n -2>1+2~lnk I Sec. 7.2] Certain Best Tests 249 distribution, nor, as a matter of fact, do the random variables X X X d l> 2, : .. , ': nee to be mu:ually stochastically independent. That is, if H o IS t~e simple hyp~thes.Is that the joint p.d.f, is g(xl , X 2, •• " x n ), and if HI IS the alternative simple hypothesis that the joint p.d.f, is h(x l , X 2, ••• : X n), then C is a best critical region of size IX for testing H 0 against HI If, for k > 0: Statistical Hypotheses [Ch, 7 248 n In this case, a best critical region is the set C = {(Xl> X2, ..• , xn) ; L: Xl ~ c}, I where C is a constant that can be determined so that the size of the critical fco 1 ( (x - 1)2) - Pr (X ~ cI ; HI) = Cl v27Tv1/n exp - 2(1/n) dx. For example, if n = 25 and if a is selected to be 0.05, then from Table III we find that Cl = 1.645/V25 = 0.329. Thus the power of this best test of Ho against HI is 0.05, when H o is true, and is There is another aspect of this theorem that warrants special mention. It has to do with the number of parameters that appear in the p.d.f. Our notation suggests that there is but one parameter. However, a careful review of the proof will reveal that nowhere was this needed or assumed. The p.d.f. may depend upon any finite number of param- eters. What is essential is that the hypothesis H o and the alternative hypothesis HI be simple, namely that they completely specify the distributions. With this in mind, we see that the simple hypotheses H o and HI do not need to be hypotheses about the parameters of a X = 0, 1,2, ... , X = 0, 1, 2, ... , e- l Ho:f(x) =-, xl Here = 0 elsewhere. = 0 elsewhere, against the alternative simple hypothesis Pr (Xl EC; H o) = 1 - Pr (Xl = 1,2; H o) = 0.448, (b)' g(xl> X 2, ••• , xn) h(x ) ~ k for (Xl> X 2, ••• , xn) E C*. l> X 2, ••• , Xn (c)' IX = Pr [(Xl> X 2 , ... , X n) E C; HoJ. An illustrative example follows. .Example 3. Let Xv . . " X n denote a random sample from a distribution ~hIch has ~ p.d..f. f(x) that is positive on and only on the nonnegative integers. It IS desired to test the simple hypothesis g(Xv··., Xn) _ e-n/(Xl! x2 !· · ·xnJ) h(Xl> ... , Xn) - (t)n(1_)x1 +x2+"'+xn = (2e-l)n2~x, n Fl (Xl!) I If k > 0, the set of points (xv X 2, ••• , X n ) such that (~Xl) In 2 - In [Q (x,!)] :s; In k - n In (2e- l) = C ~s a .best cr~tical region C. Consider the case of k = 1 and n = 1. The preced- 1ll~ me~uahty may be written 2x l/XI! s e/2. This inequality is satisfied by all points m the set C = {Xl; Xl = 0,3,4,5, ...}. Thus the power of the test when H 0 is true is (x -11)2] dx = f'" 1 e- w 2/2 dw = 0.999+, 2(g) -3.355 V27T f'" 1 [ -==--= exp 0.329 vz:;,V-is when HI is true. n region is a desired number a. The event L: X, ~ C is equivalent to the event I X ~ c/n = cl , say, so the test may be based upon the statistic X. If Ho is true, that is, 8 = 8' = 0, then X has a distribution that is n(O, l/n). For a given positive integer n, the size of the sample, and a given significance level a, the number Cl can be found from Table III in Appendix B, so that Pr (X ~ Cl; H o) = a. Hence, if the experimental values of Xl' X 2 , ••• , X; n were, respectively, Xv X2 , ••• , Xn, we would compute x = L: «[n. If x ~ CI, I the simple hypothesis H o: 8 = 8' = 0 would be rejected at the significance level a; if X < Cv the hypothesis Ho would be accepted. The probability of rejecting Ho, when Ho is true, is a; the probability of rejecting Ho, when Ho is false, is the value of the power of the test at 8 = 8" = 1. That is,
  • 131. approximately, in accordance with Table I of Appendix B. The power of the test when H1 is true is given by Pr (Xl EC; H1) = 1 - Pr (Xl = 1,2; H1) = 1 - (! + t) = 0.625. EXERCISES 7.7. In Example 2 of this section, let the simple hypotheses read H o : (J = (J' = a and H1: (J = (J" = -1. Show that the best test of Ho against H1 may be carried out by use of the statistic X, and that if n = 25 and a = 0.05, the power of the test is 0.999+ when H1 is true. 7.8. Let the random variable X have the p.d.f. j(x; (J) = (1/(J)e- X'8 , a < x < 00, zero elsewhere. Consider the simple hypothesis H o: (J = (J' = 2 and the alternative hypothesis H1: (J = (J" = 4. Let Xl' X2 denote a random sample of size 2 from this distribution. Show that the best test of H o against H1 may be carried out by use of the statistic Xl + X2 and that the assertion in Example 2 of Section 7.1 is correct. 7.9. Repeat Exercise 7.8 when H1: (J = (J" = 6. Generalize this for every (J" > 2. 7.10. Let Xv X 2, ... , X10 be a random sample of size 10 from a normal distribution n(O, a2) . Find a best critical region of size a = 0.05 for testing H o : a2 = 1 against H1: a2 = 2. Is this a best critical region of size 0.05 for testing Ho: a2 = 1 against H1: a2 = 4? Against H1: a 2 = at > I? 7.11. If Xl' X 2, ... , X n is a random sample from a distribution having p.d.f. of the form j(x; (J) = (Jx8 - 1, a < x < 1, zero elsewhere, show that a best critical region for testing H o: (J = 1 against H 1: (J = 2 is C = { (Xl ' X2' ... , Xn); C :::; fr Xi}' .=1 7.12. Let Xl' X 2, ... , X1 0 be a random sample from a distribution that is n((Jl' (J2)' Find a best test of the simple hypothesis Ho: (J1 = (J~ = 0, (J2 = (J; = 1 against the alternative simple hypothesis H1: (J1 = (J~ = 1, (J2 = (J~ = 4. 7.13. Let Xl' X 2,.. ., X n denote a random sample from a normal distri- n bution n((J,100). Show that C = {(XVX2""'Xn);c:::; X = i:Xdn} is a best critical region for testing Ho: (J = 75 against H1: (J = 78. Find nand c so that Pr [(Xl' X 2, ... , X n) E C; HoJ = Pr (X ~ c; Ho) = 0.05 For example, K(2) = 0.05, K(4) = 0.31, and K(9.5) = 21e. It is known (Exercise 7.9) that C = {(Xl' x2) ; 9.5 :::; Xl + x2 < co} is a best critical region of size 0.05 for testing the simple hypothesis H o: (J = 2 against each simple hypothesis in the composite hypothesis H 1 : (J > 2. 251 a < x < 00, 1 f(x; (J) = '8 e>", Sec. 7.3] Uniformly Most Powerful Tests 7.14. Let Xl' X 2, . . ., X n denote a random sample from a distribution having the p.d.f.j(x; p) = pX(l - P)l-X, X = 0,1, zero elsewhere. Show that n C = {(xv' .. , xn); LXi:::; c} is a best critical region for testing H o: P = t 1 against H 1 : p = t. Use the central limit theorem to find nand c so that approximately Pr (~Xi s c; Ho) = 0.10 and Pr (~Xi :::; c; H1) = 0.80. 7.15. Let Xl' X 2, ... , X1 0 denote a random sample of size 10 from a Poisson distribution with mean (J. Show that the critical region C defined by 10 f Xi ;::: 3 is a best critical region for testing H o: (J = 0.1 against H 1: (J = 0.5. Determine, for this test, the significance level a and the power at (J = 0.5. 7.3 Uniformly Most Powerful Tests This section will take up the problem of a test of a simple hypothesis Ho against an alternative composite hypothesis H1 . We begin with an example. Example 1. Consider the p.d.f. = °elsewhere, of Example 2, Section 7.1. It is desired to test the simple hypothesis H o: (J = 2 against the alternative composite hypothesis H 1: (J > 2. Thus Q = {(J; (J ;::: 2}. A random sample, Xl, X 2, of size n = 2 will be used, and the critical region is C = {(xv x2); 9.5 :::; Xl + x2 < co}. It was shown in the example cited that the significance level of the test is approximately 0.05 and that the power of the test when (J = 4 is approximately 0.31. The power function K((J) of the test for all (J ;::: 2 will now be obtained. We have Statistical Hypotheses [Ch. 7 250 and Pr [(Xv X 2, ... , X n) E C; H1J = Pr (X ;::: c; H1) = 0.90, approximately. The preceding example affords an illustration of a test of a simple hypothesis H o that is a best test of H o against every simple hypothesis
  • 132. 252 Statistical Hypotheses [eh.7 Sec. 7.3] Uniformly Most Powerful Tests 253 in the alternative composite hypothesis HI· We now define a critical region, when it exists, which is a best critical region for testing a simple hypothesis H 0 against an alternative composite hypothesis HI- It seems desirable that this critical region should be a best critical region for testing H o against each simple hypothesis in HI- That is, the power function of the test that corresponds to this critical region should be at least as great as the power function of any other test with the same significance level for every simple hypothesis in HI- Definition 7. The critical region C is a uniformly most powerful critical region of size ex for testing the simple hypothesis H 0 against an alternative composite hypothesis HI if the set C is a best critical region of size ex for testing H 0 against each simple hypothesis in HI· A test defined by this critical region C is called a uniformly most powerful test, with significance level ex, for testing the simple hypothesis H 0 against the alternative composite hypothesis HI- As will be seen presently, uniformly most powerful tests do not always exist. However, when they do exist, the Neyman-Pearson theorem provides a technique for finding them. Some illustrative examples are given here. Example 2. Let Xv X 2 , • _ ., X; denote a random sample from a distri- bution that is n(O, 8), where the variance 8 is an unknown positive number. It will be shown that there exists a uniformly most powerful test with significance level ex for testing the simple hypothesis Ho: 8 = 8', where 8' is a fixed positive number, against the alternative composite hypothesis HI: 8> 8'. Thus n = {8; 8 C: 8'}. The joint p.d.f. of Xv X 2 , · _., X; is The set C = {(xv X2, , xn); ~ xf C: c} is then a best critical region for 1 testing the simple hypothesis H o: 8 = 8' against the simple hypothesis 8 = 8". It remains to determine c so that this critical region has the desired size ex. If H o is true, the random variable i. Xfl8' has a chi-square distribu- 1 tion with n degrees of freedom. Since ex = Pr (~ Xfl8' C: c]8'; H 0)' c]8' may be read from Table II in Appendix Band c determined. Then C = n {(xv X 2, ••• , xn) ; t xf C: c} is a best critical region of size ex for testing H o: 8 = 8' against the hypothesis 8 = 8". Moreover, for each number 8" greater than 8', the foregoing argument holds. That is, if 8'" is another number greater than 8', then C = {(Xl' ... , xn) ; ~ xf C: c} is a best critical 1 region of size ex for testing H o: 8 = 8' against the hypothesis 8 = 8"'. Accord- n ingly, C = {(Xl' ... , xn ); L xf C: c}is a uniformly most powerful critical region 1 of size ex for testing H 0: 8 = 8' against HI: 8 > 8'. If Xv x 2, .. " X n denote the experimental values of Xl' X 2 , • . • , X n , then Ho: 8 = 8' is rejected at the significance level ex, and HI: 8 > 8' is accepted, if ~ xf C: c; otherwise, 1 Ho: 8 = 8' is accepted. If in the preceding discussion we take n = 15, ex = 0.05, and 8' = 3, then here the two hypotheses will be H o: 8 = 3 and HI: 8 > 3. From Table II, c/3 = 25 and hence c = 75. . Example 3. Let Xv X 2 , ••• , X n denote a random sample from a distribution that is n(8, 1), where the mean 8 is unknown. It will be shown that there is no uniformly most powerful test of the simple hypothesis H o: 8 = 8', where 8' is a fixed number, against the alternative composite hypothesis HI: 8 =F 8'. Thus n = {8; -00 < 8 < co}. Let 8" be a number not equal to 8'. Let k be a positive number and consider Let 8" represent a number greater than 8', and let k denote a positive number. Let C be the set of points where that is, the set of points where (8 ")n/2 [(8" - 8') n 2] 8' exp - 28'8" fXj or, equivalently, ~k (1/27T)n/2 exp [ - ~ (Xl - 8')2/2] (1/27T)n/2 exp [ - ~ (Xj - 8")2/2] The preceding inequality may be written as or ~ k. n 28'8" [n (8") ] f xf C: 8" _ 8' 2In 8' - In k = c. n (8" - 8') :L>l C: ~ [(8")2 - (8')2) - In k. 1
  • 133. 254 Statistical Hypotheses [Ch. 7 Sec. 7.3] Uniformly Most Powerful Tests 255 This last inequality is equivalent to ~ n (()" ()') In k L.. x, :2: "2 + - ()" _ ()" 1 distribution with mean 10(). Thus, with () = 0.1 so that the mean of Y is 1, the significance level of the test is Pr (Y :2: 3) = 1 - Pr (Y s 2) = 1 - 0.920 = 0.080. 10 If the uniformly most powerful critical region defined by LX, ?:: 4 is used, 1 the significance level is provided ()" > ()', and it is equivalent to ~ n (()" ()') In k L..Xl ~ "2 + - ()" _ ()' 1 a = Pr (Y :2: 4) 1 - Pr (Y ~ 3) = 1 - 0.981 = 0.019. if ()" < ()'. The first of these two expressions defines a best critical region for testing H 0: () = ()' against the hypothesis () = ()" provided that ()" > ()', while the second expression defines a best critical region for testing H 0: 8 = 8' against the hypothesis 8 = 8" provided that 8" < 8'. That is, a best critical region for testing the simple hypothesis against an alternative simple hypothesis, say 8 = 8' + 1, will not serve as a best critical region for testing H o: () = ()' against the alternative simple hypothesis () = ()' - 1, say. By definition, then, there is no uniformly most powerful test in the case under consideration. It should be noted that had the alternative composite hypothesis been either H 1: () > ()' or H 1: () < ()', a uniformly most powerful test would exist in each instance. Example 4. In Exercise 7.15, the reader is asked to show that if a random sample of size n = 10 is taken from a Poisson distribution with mean (), the 10 critical region defined by L X, :2: 3 is a best critical region for testing H 0: () = 1 0.1 against H 1 : 8 = 0.5. This critical region is also a uniformly most powerful one for testing H a: () = 0.1 against H 1 : () > 0.1 because, with ()" > 0.1, (0.1)LAe- 10(0.l)/(x 1! x2 !· . ·xn!) < k (()")L:Xle-10(O")/(X1! x2!· . ·xn!) - is equivalent to ( 0.1) L:x, -10(0.1- 8") k ()" e ~ . The preceding inequality may be written as (~x,)(ln 0.1 - In ()") s In k + 10(0.1 - ()") or, since 8" > 0.1, equivalently as n In k + 1 - 108" 2:Xl?:: In 0 1 - In ()" . 1 • 10 . . Y ~X h P' I Of course, Lx, :2: 3 is of the latter form. The statistic = L., I as a Olsson 1 1 If a significance level of about a = 0.05, say, is desired, most statisticians would use one of these tests; that is, they would adjust the significance level to that of one of these convenient tests. However, a significance level of 10 10 C( = 0.05 can be achieved exactly by rejecting Ho if LX, :2: 4 or if LX, = 3 1 1 and if an auxiliary independent random experiment resulted in "success," where the probability of success is selected to be equal to 0.050 - 0.019 31 0.080 - 0.019 = 61· This is due to the fact that, when () = 0.1 so that the mean of Y is 1, Pr (Y :2: 4) + Pr (Y = 3 and success) = 0.019 + Pr (Y = 3) Pr (success) = 0.019 + (0.061)-H = 0.05. The process of performing the auxiliary experiment to decide whether to reject or not when Y = 3 is sometimes referred to as a randomized test. Remarks. Not many statisticians like randomized tests in practice, because the use of them means that two statisticians could make the same assumptions, observe the same data, apply the same test, and yet make different decisions. Hence they usually adjust their significance level so as not to randomize. As a matter of fact, many statisticians report what are commonly called p-values. For illustrations, if in Example 4 the observed Y is y = 4, the p-value is 0.019; and if it is y = 3, the p-value is 0.080. That is, the p-value is the observed" tail" probability of a statistic being at least as extreme as the particular observed value when H o is true. Hence, more generally, if Y = u(Xl> X 2 , ••• , X n) is the statistic to be used in a test of H o and if a uniformly most powerful critical region is of the form an observed value u(xl> X 2, ••• , xn) = d would mean that the p-value = Pr (Y ~ d; Ha). That is, if Cry) is the distribution function of Y = u(Xl> X 2 , ••• , X n) pro- vided that H o is true, the p-value is equal to C(d) in this case. However, C(Y), in the continuous case, is uniformly distributed on the unit interval,
  • 134. 256 Statistical Hypotheses [Ch. 7 Sec. 7.4] Likelihood Ratio Tests 257 so an observed value G(d) ::; 0.05 would be equivalent to selecting c so that and observing that d ::; c. There is a final remark that should be made about uniformly most powerful tests. Of course, in Definition 7, the word uniformly is associated with e, that is, C is a best critical region of size a for testing H 0: () = ()o against all evalues given by the composite alternative HI' However, suppose that the form of such a region is Then this form provides uniformly most powerful critical regions for all attainable a values by, of course, appropriately changing the value of c. That is, there is a certain uniformity property, also associated with a, that is not always noted in statistics texts. EXERCISES 7.16. Let X have the p.d.f. f(x; e) = eX(1 - ell-X, X = 0, 1, zero else- where. We test the simple hypothesis H o: e= t against the alternative composite hypothesis HI: e < t by taking a random sample of size 10 and rejecting H 0: e = t if and only if the observed values Xl> Xz, ... , Xl O of the 10 sample items are such that .L x, ::; 1. Find the power function K(e), 0 < 1 e::; t, of this test. 7.17. Let X have a p.d.f. of the formf(x; e) = l/e,O < X < e, zero else- where. Let Y1 < Yz < Y3 < Y4 denote the order statistics ot a random sample of size 4 from this distribution. Let the observed value of Y4 be Y4' We reject H o: () = 1 and accept HI: e i= 1 if either Y4 ::; ! or Y4 2 1. Find the power function K(()), 0 < (), of the test. 7.18. Consider a normal distribution of the form n(e, 4). The simple hypothesis H o: e= 0 is rejected, and the alternative composite hypothesis HI: e > 0 is accepted if and only if the observed mean xof a random sample of size 25 is greater than or equal to t. Find the power function K(e), 0 ::; e, of this test. 7.19. Consider the two independent normal distributions n(fLl' 400) and n(fLz, 225). Let e= fLl - fLz. Let x and fj denote the observed means of two independent random samples, each of size n, from these t'NO distributions. We reject H o: e= 0 and accept HI: () > 0 if and only if x - fj ;:::: c. If K(e) is the power function of this test, find nand c so that K(O) = 0.05 and K(lO) = 0.90, approximately. 7.20. If, in Example 2 of this section, H o: e = e', where e' is a fixed posi- n tive number, and HI: e < e', show that the set {(Xl, X2, ••• , X n); L: xF ::; c}is 1 a uniformly most powerful critical region for testing H o against HI' 7.21. If, in Example 2 of this section, H o: e= e', where ()' is a fixed positive number, and HI: () i= e', show that there is no uniformly most powerful test for testing H 0 against HI' 7.22. Let Xl' Xz, " " XZ5 denote a random sample of size 25 from a normal distribution n((), 100). Find a uniformly most powerful critical region of size a = 0.10 for testing H o: () = 75 against HI: e > 75. 7.23. Let Xl' X 2 , ••• , X; denote a random sample from a normal distribution n(e, 16). Find the sample size n and a uniformly most powerful test of H o: () = 25 against HI: e < 25 with power function K(e) so that approximately K(25) = 0.10 and K(23) = 0.90. 7.24. Consider a distribution having a p.d.f. of the form f(x; e) = eX(1 - ell-X, X = 0, 1, zero elsewhere. Let H o: e = -fo and HI: e > z~. Use the central limit theorem to determine the sample size n of a random sample so that a uniformly most powerful test of H o against HI has a power function K(e), with approximately K(zlo) = 0.05 and K(fo) = 0.90. 7.25. Illustrative Example 1 of this section dealt with a random sample of size n = 2 from a gamma distribution with a = 1, f3 = e. Thus the moment-generating function of the distribution is (1 - ()t)-1 t < l/e, e ;:::: 2. Let Z = Xl + X z. Show that Z has a gamma distribution with ex = 2, f3 = (). Express the power function K(e) of Example 1 in terms of a smgle integral. Generalize this for a random sample of size n. 7.26. Let X have the p.d f. f(x; e) = eX (1 - W- x , X = 0, 1, zero else- where. We test H o: () = ! against HI: e < ! by taking a random sample 5 Xl> X z, ... , X 5 of size n = 5 and rejecting Hoif Y = .L X, is observed to be 1 less than or equal to a constant c. (a) Show that this is a uniformly most powerful test. (b) Find the significance level when c = 1. (c) Find the significance level when c = O. (d) By using a randomized test, modify the tests given in part (b) and part (c) to find a test with significance level a = -l'2. 7.4 Likelihood Ratio Tests The notion of using the magnitude of the ratio of two probability density functions as the basis of a best test or of a uniformly most powerful test can be modified, and made intuitively appealing, to provide a method of constructing a test of a composite hypothesis
  • 135. 258 Statistical Hypotheses [Ch, 7 Sec. 7.4] Likelihood Ratio Tests 259 against an alternative composite hypothesis or of constructing a test of a simple hypothesis against an alternative composite hypothesis when a uniformly most powerful test does not exist. This method leads to tests called likelihood ratio tests. A likelihood ratio test, as just remarked, is not necessarily a uniformly most powerful test, but it has been proved in the literature that such a test often has desirable properties. A certain terminology and notation will be introduced by means of an example. Example 1. Let the random variable X be n(8v 82) and let the param- eter space be 0 = {(8v 82 ) ; -00 < 81 < 00,0 < 82 < oo}, Let the composite hypothesis be H o: 81 = 0, 82 > 0, and let the alternative composite hypoth- esis be H1 : 81 =1= 0, 82 > O. The set w = {(81, 82); 81 = 0,0 < 82 < co} is a subset of 0 and will be called the subspace specified by the hypothesis Ho- Then, for instance, the hypothesis H o may be described as H o: (81, 82) E w. It is proposed that we test H o against all alternatives in H 1 . Let Xv X 2 , ••• , X n denote a random sample of size n > 1 from the distribution of this example. The joint p.d.f. of Xv X 2 , ••• , X; is, at each point in 0, At each point (8v 82 ) E co, the joint p.d.f. of Xv X 2 , ••• , X n is [ n ] 1 n/2 L: x~ L(O, 82; Xv ... , x n) = (27T8J exp - ~82 = L(w). The joint p.d.f., now denoted by L(w), is not completely specified, since 82 may be any positive number; nor is the joint p.d.f., now denoted by L(O), completely specified, since 81 may be any real number and 82 any positive number. Thus the ratio of L(w) to L(O) could not provide a basis for a test of H o against H 1 . Suppose, however, that we modify this ratio in the follow- ing manner. We shall find the maximum of L(w) in w, that is, the maximum of L(w) with respect to 82 , And we shall find the maximum of L(O) in 0; that is, the maximum of L(O) with respect to 81 and 82 , The ratio of these maxima will be taken as the criterion for a test of H o against H 1 . Let the maximum of L(w) in w be denoted by L(w) and let the maximum of L(O) in 0 be denoted by L(Q). Then the criterion for the test of Ho against H1 is the likelihood ratio Since L(w) and L(O) are probability density functions, , ~ 0; and since w is a subset of 0, , ::; 1. In our example the maximum, L(w), of L(w) is obtained by first setting n dIn L(w) n f x~ =--+- d82 282 28~ n equal to zero and solving for 82 , The solution for 82 is L: xfln, and this 1 number maximizes L(w). Thus the maximum is ( n ) 1 n/2 L: x~ L(w) = exp _ 1 ( 2~~x1Jn ) 2 ~x~/n ( _1)n/2 = 2:~X~ On the other hand, by using Example 4, Section 6.1, the maximum, L(O), of L(O) is obtained by replacing 81 and 82 by i «[n = x and i (Xl - x)2jn, 1 1 respectively. That is, [ n ] • 1 n/2 L: (Xl - x)2 L(O) = n exp - ~ [2~,f(x, - - . 2fix, - ')'/n [ -1 ] n/2 27T ~~:l -X)2 • Thus here [ n ] n/2 _ f (Xl - X)2 A - • n L: X~ 1 n n Because L: x~ = L: (Xl - x)2 + nx2, Amay be written 1 1 1 A = . {I + [nx2j~ (x, - x)2]f/2 Now the hypothesis H o is 81 = 0, 82 > O. If the observed number x were zero, the experiment tends to confirm H o. But if x = 0 and i x~ > 0, then 1
  • 136. 260 Statistical Hypotheses [Ch. 7 Sec. 7.4] Likelihood Ratio Tests 261 n A = 1. On the other hand, if x and nx2/'L (Xl - X)2 deviate considerably 1 from zero, the experiment tends to negate Ho. Now the greater the deviation n of nx2/'L (x, - X)2 from zero, the smaller A becomes. That is, if A is used 1 as a test criterion, then an intuitively appealing critical region for testing H0 is a set defined by 0 ::; A ::; Ao, where Ao is a positive proper fraction. Thus we reject Ho if A ::; Ao. A test that has the critical region A ::; Ao is a likelihood ratio test. In this example A ::; Ao when and only when J" -y'n xl '" -y'(0 - 1)(A,"" - 1) ~ c. 'L (Xl - x)2/(n - 1) 1 If Ho : (}1 = 0 is true, the results in Section 6.3 show that the statistic vn (X - 0) has a t distribution with n - 1 degrees of freedom. Accordingly, in this example the likelihood ratio test of H 0 against HI may be based on a T statistic. For a given positive integer n, Table IV in Appendix B may be used (with n - 1 degrees of freedom) to determine the number c such that IX = Pr [Jt(X v X 2 , ••• , X n )! ~ c; HoJ is the desired significance level of the test. If the experimental values of Xl' X 2 , · · · , X n are, respectively, Xl' X 2, ••• , X n, then we reject Ho if and only if It(xv X 2, ••• , xn) I ~ c. If, for instance, n = 6 and ex = 0.05, then from Table IV, c = 2.571. The preceding example should make the following generalization easier to read: Let Xl> X 2 , • • • , X; denote n mutually stochastically independent random variables having, respectively, the probability density functions f,(x,; 8l> 82 , ••• , 8m), i = 1, 2, ... , n. The set that consists of all parameter points (8l> 82 , ••• , 8m) is denoted by 0, which we have called the parameter space. Let w be a subset of the parameter space O. We wish to test the (simple or composite) hypothesis H0: (8l> 82 , •.• , 8m ) E w against all alternative hypotheses. Define the likelihood functions n L(w) = flfi(X,; 8l> 82 " " , 8m), ,=1 and Let L(w) and L(Q) be the maxima, which we assume to exist, of these two likelihood functions. The ratio of L(w) to L(Q) is called the likeli- hood ratio and is denoted by L(w) A(Xl> X2 , ••• , xn) = A = --. L(Q) Let Ao be a positive proper function. The likelihood ratio test principle states that the hypothesis Ho: (81, 82" " , 8m) Ew is rejected if and only if A(Xl> X2 , •.• , xn) = A ::; Ao. ~he. function A defines a random variable A(Xl> X 2, ..• , X n), and the significance level of the test is given by IX = Pr [A(Xl> X 2 , ••• , X n) ::; Ao; Hol The likelihood ratio test principle is an intuitive one. However the princip~e does lead to the same test, when testing a simple hypothesis H o agamst an alternative simple hypothesis HI, as that given by the Neyman-Pearson theorem (Exercise 7.29). Thus it might be expected that a test based on this principle has some desirable properties. An example of the preceding generalization will be given. Example 2. Let the stochastically independent random variables X and Y have distributions that are n((}l> ()3) and n(()2' ()3), where the means (}1 and ()2 and common variance ()3 are unknown. Then Q = {(() () ()). 11 2, 3' -00 < (}1 < 00, -00 < ()2 < 00, 0 < ()3 < oo]. Let Xl> X 2 , ••• , X; and Yl> Y 2 , ••• , Y m denote independent random samples from these distributions. The. hypothesis Hi: (}l = ()2' unspecified, and ()3 unspecified, is to be tested agamst all alternatives. Then w = {((}l> ()2' ()3); -00 < (}l = ()2 < 00, o < ()3 < co}. Here Xl' X 2 , · · · , X n , Yl> Y 2 , ••• , Y m are n + m > 2 mutually stochastically independent random variables having the likelihood functions and If n L(Q) = flf.(xj ; 81, 82 , ••• , 8m), .=1 8ln L(w) ee, and 8ln L(w) 8()3
  • 137. 262 are equated to zero, then (Exercise 7.30) Statistical Hypotheses [Ch. 7 Sec. 7.4] Likelihood Ratio Tests and U v U 2, and w' maximize L(0.). The maximum is 263 (1) A _ (e- 1)(n+mJ/2 L(~l) - 27TW' ' so that The solutions for 81 and 83 are, respectively, n m 2: Xi + 2: s, U = 1 1 n+m and n m 2: (Xi - U)2 + 2: (Yi - U)2 W = 1 1 n+m and U and w maximize L(w). The maximum is ( e- 1 ) (n+mJ/2 L(w) = - 27TW L(w) (W')(n+mJ/2 '(x1, · · · , Xn, Y1'···' Ym) = , = L(Q) = W . The random variable defined by ,2/Cn+mJ is n m 2: {Xi - [(nX + mY)/(n + m)]}2 + 2: {Y, - [(nX + mY)/(n + m)]}2 1 1 Now i (Xi - nX + mY)2 = i [(Xi _X) + (X _ nX + mY)]2 1 n+m 1 n+m = i (Xi - X)2 + n(X _ nX + mY)2 1 n + m and are equated to zero, then (Exercise 7.31) In like manner, if oIn L(0.) 081 ' oIn L(0.) 082 ' oIn L(0.) 083 I (Yl - nX + mY)2 = i [(Yi _ Y) + (Y _ nX + mY)]2 1 n+m 1 n+m = i (Yl - Y)2 + m(Y _ nX + mY)2. 1 n + m (2) The solutions for 8v 82 , and 83 are, respectively, n 2: Xi U 1 = _1_, n m 2: v. U 2 = _1_, m But n(X _ nX + mY)2 = m 2n (X _ Y)2 n + m (n + m)2 and m(Y _ nX + mY)2 = n 2m (X _ Y)2. n + m (n + m)2 Hence the random variable defined by ,2/(n+mJ may be written n m 2: (Xi - X)2 + 2: (Yi - Y)2 1 1 n m 2: (Xi - X)2 + 2: (Yi - Y)2 + [nm/(n + m)J(X _ Y)2 1 1 1 1 [nm/(n + m)](X - Y)2 + n m 2: (Xl - X)2 + 2: (Yi - Y)2 1 1
  • 138. 264 Statistical Hypotheses [Ch, 7 Sec. 7.4] Likelihood Ratio Tests 265 vnX J~ (Xi - X)2/(n - 1) Vnx/a J~ (Xi - X)2/[a2(n - 1)] Here WI = VnX/a is n(vn8l/a, 1), VI = ~ (Xi - X)2/a2 is x2(n - 1), ! and WI and VI are stochastically independent. Thus, if 81 of. 0, we see, in accordance with the definition, that t(X!> ... , X n) has a noncentral t distribution with n - 1 degrees of freedom and noncentrality param- eter 81 = vn8l /a. In Example 2 we had In the light of this definition, let us reexamine the statistics of the examples of this section. In Example 1 we had n+m-2 J nm (X - V) n+m T = ---;==~===== n+m-2 (n + m - 2) + T2 has, in accordance with Section 6.4, a t distribution with n + m - 2 degrees of freedom. Thus the random variable defined by A2/(n + m) is The test of H0 against all alternatives may then be based on a t distribution with n + m - 2 degrees of freedom. The likelihood ratio principle calls for the rejection of H 0 if and only if A :::; Ao < 1. Thus the significance level of the test is If the hypothesis H o: (}l = (}2 is true, the random variable ex = Pr [(A(Xl , . · · , X n , Yl , · •• , Y m) s Ao;Hol However, A(Xl , ... , X n, Y v .. " Y m) :::; Ao is equivalent to ITI ~ c, and so T = W 2 , vV2/(n + m - 2) ex = Pr(ITI ~ c;Ho)· where For given values of nand m, the number c is determined from Table IV in the Appendix (with n + m - 2 degrees of freedom) in such a manner as to yield a desired ex. Then H 0 is rejected at a significance level ex if and only if ItI ~ c, where t is the experimental value of T. If, for instance, n = 10, m = 6, and ex = 0.05, then c = 2.145. In each of the two examples of this section it was found that the likelihood ratio test could be based on a statistic which, when the hypothesis H Q is true, has a t distribution. To help us compute the powers of these tests at parameter points other than those described by the hypothesis H Q, we turn to the following definition. Definition 8. Let the random variable W be n(8, 1); let the random variable V be X2 (r), and Wand V be stochastically independent. The quotient W T=--= VV/r is said to have a noncentral t distribution with r degrees of freedom and noncentrality parameter 8. If I) = 0, we say that T has a central t distribution. /1f; m - / W2 = - - (X - Y) a n+m and V2 = [~(Xi - X)2 + ~ (Yi - Y)2]/a2. Here W2 is n[Vnm/(n + m)(Ol - (2)/a, 1J, V2 is X2(n + m - 2), and W2 and V2 are stochastically independent. Accordingly, if 01 of. °2, T has a noncentral t distribution with n + m - 2 degrees of freedom and noncentrality parameter 82 = vnm/(n + m)(Ol - (2)/a. It is interest- ing to note that 81 = vn8l /a measures the deviation of °1 from 01 = 0 in units of the standard deviation a/vn of X. The noncentrality parameter 82 = vnm/(n + m)(O! - (2)/a is equal to the deviation of °1 - O 2 from 81 - O 2 = 0 in units of the standard deviation aV(n + m)/nm of X - Y. There are various tables of the noncentral t distribution, but they are much too cumbersome to be included in this book. However, with the aid of such tables, we can determine the power functions of these tests as functions of the noncentrality parameters.
  • 139. 266 Statistical Hypotheses [Ch. 7 Sec. 7.4] Likelihood Ratio Tests 267 In Example 2, in testing the equality of the means of two inde- pendent normal distributions, it was assumed that the unknown variances of the distributions were equal. Let us now consider the problem of testing the equality of these two unknown variances. Example 3. We are given the stochastically independent random sam- ples Xl>"" X n and Y 1 , •• " Ym from the independent distributions, which are n(8l> 83 ) and n(82, 84) , respectively. We have Q = {(8l> 82, 83, 84) ; -00 < 8l> 82 < 00,0 < 83,84 < co]. The hypothesis H o: 83 = 84 , unspecified, with 81 and 82 also unspecified, is to be tested against all alternatives. Then w = {(81 , 82 , 83 , 84 ) ; -00 < 8l> 82 < 00,0 < 83 = 84 < co}. It is easy to show (see Exercise 7.34) that the statistic defined by A = L(w)/L(Q) is a function of the statistic n L: (Xt - X)2/(n - 1) F = -'~~------- L: (Y, - Y)2/(m - 1) 1 If 83 = 84 , this statistic F has an F distribution with n - 1 and m - 1 degrees of freedom. The hypothesis that (8l , 82 , 83 , 84) E w is rejected if the computed F ~ Cl or if the computed F ~ C2• The constants Cl and C2 are usually selected so that, if 83 = 84 , where (;(1 is the desired significance level of this test. EXERCISES 7.27. In Example 1 let n = 10, and let the experimental values of the 10 d . d random variables yield x = 0.6 and L: (xt - X)2 = 3.6. If the test enve 1 in that example is used, do we accept or reject Ho: 81 = 0 at the 5 per cent significance level? B 7.28. In Example 2 let n = m = 8, x = 75.2, Y = 78.6, L: (Xt - X)2 = 1 71.2, I (Yt - y)2 = 54.8. If we use the test derived in that example, do we 1 accept or reject H o: 81 = 82 at the 5 per cent significance level? 7.29. Show that the likelihood ratio principle leads to the same test, when testing a simple hypothesis H o against an alternative simple hypothesis H 1 • as that given by the Neyman-Pearson theorem. Note that there are only two points in Q. 7.30. Verify Equations (1) of Example 2 of this section. 7.31. Verify Equations (2) of Example 2 of this section. 7.32. Let Xl' X 2 , ., ., X; be a random sample from the normal distribu- tion n(8, 1). Show that the likelihood ratio principle for testing H o: 8 = 8', where 8' is specified, against H 1 : 8 f= 8' leads to the inequality Ix - 8'1 ~ c. Is this a uniformly most powerful test of Ho against H1? 7.33. Let Xl' X 2 , ••• , X; be a random sample from the normal distribu- tion n(8l , 82 ) , Show that the likelihood ratio principle for testing H o: 82 = 8; specified, and 81 unspecified, against HI: 82 f= 8;, 81 unspecified, leads to a n n test that rejects when L: (xt - X)2 ~ Cl or L: (Xt - X)2 ~ C2, where Cl < C2 1 1 are selected appropriately. 7.34. Let Xl"'" X n and Yl , ... , Ym be random samples from the independent distributions n(8l , 83) and n(82, 84) , respectively. (a) Show that the likelihood ratio for testing H o: 81 = 82,83 = 84 against all alternatives is given by {[~ (Xt - U)2 + ~ (Yt _ U)2]/(m + n)yn+m)/2' where u = (nx + my)/(n + m). (b) Show that the likelihood ratio test for testing Ho: 83 = 84 , 81 and 82 unspecified, against HI: 83 f= 84 , 81 and 82 unspecified, can be based on the random variable n L: (Xt - X)2/(n - 1) F = "':;~'--------- L: (Yt - Y)2/(m - 1) 1 7.35. Let n independent trials of an experiment be such that Xl' X 2, ... , X k are the respective numbers of times that the experiment ends in the mutually exclusive and exhaustive events Al> A 2 , · · · , A k • If Pi = P(A i) is constant throughout the n trials, then the probability of that particular sequence of trials is L = Pf'P~2 Pfk. (a) Recalling that PI + P2 + + Pk = 1, show that the likelihood ratio for testing H o: P, = Pto > 0, i = 1,2, ... , k, against all alternatives is given by
  • 140. 268 (b) Show that Statistical Hypotheses [eh. 7 -co < Xi < co. _ 2In >. = i XI(XI - ,nfOI)2 /=1 (nPI) where P; is between POI and xtln. Hint. Expand InPIO in a Taylor's serieswith the remainder in the term involving (PIO - xtln)2. (c) For large n, argue that xtl(np;)2 is approximated by l/(nPlo) and hence _ 2In >. ~ i (x/ - npOI) 2 , when H is true. 1=1 npOI Chapter 8 Other Statistical Tests 8.1 Chi-Square Tests In this section we introduce tests of statistical hypotheses called chi-square tests. A test of this sort was originally proposed by Karl Pearson in 1900, and it provided one of the earlier methods of statistical inference. Let the random variable Xi be n(fLi' aT), i = 1, 2, ... ,n, and let X:1> X 2 , ••• , X n be mutually stochastically independent. Thus the joint p.d.f. of these variables is 1 [ 1 ~ (Xi - fLi)2] exp - - L., - - - a1a2 ... an(21T)n/2 2 1 ai ' The random variable that is defined by the exponent (apart from the n coefficient -!) is .L (Xi - fLi)2/a t,and this random variable is X2(n). 1 In Chapter 12 we shall generalize this joint normal distribution of probability to n random variables that are stochastically dependent and we shall call the distribution a multivariate normal distribution. It will then be shown that a certain exponent in the joint p.d.f. (apart from a coefficient of -!) defines a random variable that is X2 (n). This fact is the mathematical basis of the chi-square tests. Let us now discuss some random variables that have approximate chi-square distributions. Let Xl be b(n, PI)' Since the random variable Y = (Xl - npI)/VnPI(l - PI) has, as n -+ co, a limiting distribution that is n(O, 1), we would strongly suspect that the limiting distribution 269
  • 141. 270 Other Statistical Tests [eh. 8 Sec. 8.1] Chi-Square Tests 271 of Z = y2 is X2 (1). This is, in fact, the case, as will now be shown. If Gn(y) represents the distribution function of Y, we know that n - (Xl + ... + X k- l) and let Pk = 1 - (PI + ... + Pk-l)' Define Qk-l by Accordingly, since N(y) is everywhere continuous, Qk-l = i [(Xi - nPto)2] 1 npto has an approximate chi-square distribution with k - 1 degrees of free- dom. Since, when H 0 is true, nptois the expected value of Xi' one would feel intuitively that experimental values of Qk-1 should not be too large if H o is true. With this in mind, we may use Table II of the Appendix, with k - 1 degrees of freedom, and find c so that Pr (Qk-l ~ c) = IX, where IX is the desired significance level of the test. If, then, the hy- pothesis H o is rejected when the observed value of Qk-l is at least as It is proved in a more advanced course that, as n ~ 00, Qk-l has a limiting distribution that is X2(k - 1). If we accept this fact, we can say that Qk-l has an approximate chi-square distribution with k - 1 degrees of freedom when n is a positive integer. Some writers caution the user of this approximation to be certain that n is large enough that each npi' i = 1, 2, ... , k, is at least equal to 5. In any case it is important to realize that Qk-1 does not have a chi-square distribution, only an approximate chi-square distribution. The random variable Qk-l may serve as the basis of the tests of certain statistical hypotheses which we now discuss. Let the sample space d of a random experiment be the union of a finite number k of mutually disjoint sets Al> A 2,... , A k. Furthermore, let P(Ai) = Pi, i = 1,2, ... , k, where Pk = 1 - PI -'" - Pk-l> so that Pi is the probability that the outcome of the random experiment is an element of the set Ai' The random experiment is to be repeated n independent times and Xi will represent the number of times the outcome is an element of the set Ai' That is, Xl> X 2, ... , X k = n - Xl - ... - X k - l are the frequencies with which the outcome is, respectively, an element of Al> A 2, ... , A k. Then the joint p.d.f. of Xl> X 2,... , X k- l is the multinomial p.d.f. with the parameters n, PI' ... , Pk-l' Consider the simple hypothesis (concerning this multinomial p.d.f.) H o:PI = PlO' P2 = P20"'" Pk-l = Pk-l.O (Pk = PkO = 1 - PlO - ... - Pk-l,O), where PlO' ... , Pk-l,O are specified numbers. It is desired to test H o against all alternatives. If the hypothesis H 0 is true, the random variable -00 < y < 00, lim Gn(y) = N(y), n-+ 00 If we change the variable of integration in this last integral by writing w2 = v, then lim Hn(z) = N(VZ) - N( - VZ) n-+oo Hn(z) = Pr (Z :::; z) = Pr (- VZ :::; Y :::; Vz) = Gn(vz) - Gn[( - VZ) - ]. lim H (z) - (Z 1 1/2-1 -v/2 d n-+oo n - Jo r(t)2l/2 v e v, provided that z ~ 0. If z < 0, then lim Hn(z) = 0. Thus lim Hn(z) is n-+ co n....00 equal to the distribution function of a random variable that is X2(1). This is the desired result. Let us now return to the random variable Xl which is b(n, PI)' Let X 2 = n - Xl and let P2 = 1 - h. If we denote y2 by Ql instead of Z, we see that Ql may be written as where N(y) is the distribution function of a distribution that is n(O, 1). Let Hn(z) represent, for each positive integer n, the distribution function of Z = y2. Thus, if z ~ 0, (Xl - nPl)2 (Xl - nPl)2 -'--~-~'- + -'---=-:--~- nPl n(1 - h) (X 1 - nPl)2 (X2 - np2)2 = + -'----"'--'- nPl np2 because (Xl - nPl)2 = (n - X 2 - n + np2)2 = (X2 - np2)2, Since Ql has a limiting chi-square distribution with 1 degree of freedom, we say, when n is a positive integer, that Ql has an approximate chi-square distribution with 1 degree of freedom. This result can be generalized as follows. Let Xl> X 2,... , X k- l have a multinomial distribution with param- eters n,pl>" .,Pk-l> as in Section 3.1. As a convenience, let X k =
  • 142. 272 Other Statistical Tests [Ch, 8 Sec. 8.1] Chi-Square Tests 273 (6 - 5)2 (18 - 15)2 (20 - 25)2 (36 - 35)2 = 64 = 1 83 5 + 15 + 25 + 35 35" z = 1,2, ... , k, is a function of the unknown parameters ft and a2 . Suppose that we take a random sample Yl> .• " Yn of size n from this distribution. If we let Xi denote the frequency of Ai' i = 1,2, ... , k, so that Xl + ... + X k n, the random variable approximately. From Table II, with 4 - 1 = 3 degrees of freedom, the value corresponding to a 0.025 significance level is c = 9.35. Since the observed value of Q3 is less than 9.35, the hypothesis is accepted at the (approximate) 0.025 level of significance. Thus far we have used the chi-square test when the hypothesis H o is a simple hypothesis. More often we encounter hypotheses H oin which the multinomial probabilities PI> P2' .. " Pk are not completely specified by the hypothesis H o. That is, under H o' these probabilities are functions of unknown parameters. For illustration, suppose that a certain random variable Y can take on any real value. Let us partition the space {y; -00 < Y < co] into k mutually disjoint sets AI> A 2 , ••• , A k so that the events AI' A2 , "', Ak are mutually exclusive and exhaustive. Let H °be the hypothesis that Y is n(ft, a2 ) with ft and a2 unspecified. Then each cannot be computed once Xl> ... , X k have been observed, since each Pi' and hence Qk-I' is a function of the unknown parameters ft and a2. There is a way out of our trouble, however. We have noted that Qk-I is a function of ft and a2 • Accordingly, choose the values of ft and a2 that minimize Qk-I' Obviously, these values depend upon the ob- served Xl = Xl' .. " X k = X k and are called minimum chi-square estimates of ft and a2 . These point estimates of ft and a2 enable us to compute numerically the estimates of each Pi' Accordingly, if these values are used, Qk-1 can be computed once Yl' Y2' ... , Yn' and hence Xl' X 2 , ••• , X k , are observed. However, a very important aspect of the fact, which we accept without proof, is that now Qk-I is approxi- mately X2 (k - 3). That is, the number of degrees of freedom of the limiting chi-square distribution of Qk-1 is reduced by one for each parameter estimated by the experimental data. This statement applies not only to the problem at hand but also to more general situations. Two examples will now be given. The first of these examples will deal Since 15.6 > 11.1, the hypothesis P(A i) = i. i = 1, 2, ... ,6, is rejected at the (approximate) 5 per cent significance level. Example 2. A point is to be selected from the unit interval {x; 0 < x < I} by a random process. Let Al = {x; 0 < x ~ t}, A 2 = {x; t < x ~ t}, A 3 = {x; t < x ~ i}, and A 4 = {x; i < x < I}. Let the probabilities Pi' i = 1, 2, 3, 4, assigned to these sets under the hypothesis be determined by the p.d.f. 2x, 0 < x < 1, zero elsewhere. Then these probabilities are, respectively, f I /4 1 PIo = 0 2x dx = T6' (13 - 10)2 (19 - 10)2 (11 - 10)2 10 + 10 + 10 (8 - 10)2 (5 - 10)2 (4 - 10)2 _ 15 6 + 10 + 10 + 10 - ., Thus the hypothesis to be tested is that Pi'P2'P3' and P4 = 1 - Pi - P2 - P3 have the preceding values in a multinomial distribution with k = 4. This hypothesis is to be tested at an approximate 0.025 significance level by repeating the random experiment n = 80 independent times under the same conditions. Here the npiO' i = 1, 2, 3, 4, are, respectively,S, 15, 25, and 35. Suppose the observed frequencies of AI> A 2 , A 3 , and A 4 to be 6, 18, 20, and 4 36, respectively. Then the observed value of Q3 = L (Xi - npiO)2/(nPiO) is 1 great as c, the test of H o will have a significance level that is approxi- mately equal to a. Some illustrative examples follow. Example 1. One of the first six positive integers is to be chosen by a random experiment (perhaps by the cast of a die). Let Ai = {x; X = i}, i = 1,2, ... , 6. The hypothesis Ho: P(A i ) = Pio = i, i = 1,2, ... , 6, will be tested, at the approximate 5 per cent significance level, against all alternatives. To make the test, the random experiment will be repeated, under the same conditions, 60 independent times. In this example k = 6 and npiO = 60(~) = 10, i = 1, 2, ... , 6. Let Xi denote the frequency with which the random experiment terminates with the outcome in Ai' i = 1, 2, ... , 6, a and let Q5 = L (Xi - 10)2/10. If Ho is true, Table II, with k - 1 = 6 - 1 = I 5 degrees of freedom, shows that we have Pr (Q5 ~ 11.1) = 0.05. Now suppose that the experimental frequencies of AI' A 2 , ••• , A aare, respectively, 13, 19, 11, 8, 5, and 4. The observed value of Q5 is
  • 143. 274 Other Statistical Tests [Ch.8 Sec. 8.1] Chi-Square Tests 275 with the test of the hypothesis that two multinomial distributions are the same. Remark. In many instances, such as that involving the mean fL and the variance a2 of a normal distribution, minimum chi-square estimates are difficult to compute. Hence other estimates, such as the maximum likelihood »<: estimates p- = Y and a2 = 52, are used to evaluate Pi and Qk-1' In general, Qk-1 is not minimized by maximum likelihood estimates, and thus its computed value is somewhat greater than it would be if minimum chi-square estimates were used. Hence, when comparing it to a critical value listed in the chi-square table with k - 3 degrees of freedom, there is a greater chance of rejecting than there would be if the actual minimum of Qk-1 is used. Accordingly, the approximate significance level of such a test will be some- what higher than that value found in the table. This modification should be kept in mind and, if at all possible, each h should be estimated using the frequencies Xl"'" X k rather than using directly the items Y v Y 2, · · ·, Yn of the random sample. Example 3. Let us consider two independent multinomial distributions with parameters nj, P1j, P2j, .. " Pkj, j = 1, 2, respectively. Let Xij' i = 1, 2, .. " k, j = 1, 2, represent the corresponding frequencies. If n1 and n2 are large, the random variable value of this random variable is at least as great as an appropriate number from Table II, with k - 1 degrees of freedom. The second example deals with the subject of contingency tables. Example 4. Let the result of a random experiment be classified by two attributes (such as the color of the hair and the color of the eyes). That is, one attribute of the outcome is one and only one of certain mutually exclusive and exhaustive events, say A v A 2,.. " A a; and the other attribute of the outcome is also one and only one of certain mutually exclusive and exhaustive events, say B1, B 2, ... , B b. Let Pij = P(A i n B j), i = 1,2, ... , a; j = 1,2, ... , b. The random experiment is to be repeated n independent times and Xii will denote the frequency of the event Ai n B j. Since there are k = ab such events as Ai n B j , the random variable has an approximate chi-square distribution with ab - 1 degrees of freedom, provided that n is large. Suppose that we wish to test the independence of the A attribute and the B attribute; that is, we wish to test the hypothesis Ho: P(Ai n B j) = P(Ai)P(Bj), i = 1,2, ... , a; j = 1,2, ... , b. Let us denote P(Ai) by A and P(Bj) by P'j; thus b A = L: Pif' f=l a P·f = L: Pii' i=l is the sum of two stochastically independent random variables, each of which we treat as though it were X2 (k - 1); that is, the random variable is approximately X2(2k - 2). Consider the hypothesis and b a b a 1 = L: L: Plf = L: P·f = L: A· f=l i=l f=l i=l Then the hypothesis can be formulated as H o:hj = Pi,P'j, i = 1,2, ... , a; j = 1,2, ... , b. To test H o, we can use Qab-1 with Pii replaced by AP·f· But if A, i = 1, 2, ... , a, and P'j, j = 1, 2, ... , b, are unknown, as they frequently are in the applications, we cannot compute Qab-1 once the fre- quencies are observed. In such a case we estimate these unknown parameters by where each Pi! = Pi2' i = 1, 2, .. " k, is unspecified. Thus we need point estimates of these parameters. The maximum likelihood estimator of Pl1 = P'2' based upon the frequencies Xlj' is (Xi! + X i2)/(n1 + n2), i = 1,2, ... , k. Note that we need only k - 1 point estimates, because we have a point estimate of Pk1 = Pk2 once we have point estimates of the first k - 1 prob- abilities. In accordance with the fact that has been stated, the random variable A Xi' Pl' =-, n b where Xi' = 2: x.; j=l 2 = 1,2, ... , a, Since L:p,. = L: P.f = 1, we have estimated only a - 1 + b - 1 = a + j b - 2 parameters. So if these estimates are used in Qab-1, with Pii = Pi·P·j, has an approximate X 2 distribution with 2k - 2 - (k - 1) = k - 1 degrees of freedom. Thus we are able to test the hypothesis that two multinomial distributions are the same; this hypothesis is rejected when the computed and A x.f h P'j=-' were n j = 1,2, ... , b.
  • 144. Other Statistical Tests [eb. 8 8.3. A die was cast n = 120 independent times and the following data resulted: 276 then, according to the rule that has been stated in this section, the random variable Sec. 8.1] Chi-Square Tests 277 ~ = 1,2, ... , 7, 8. If we consider these data to be observations from two independent multi- nomial distributions with k = 5, test, at the 5 per cent significance level, the hypothesis that the two distributions are the same (and hence the two teaching procedures are equally effective). 8.6. Let the result of a random experiment be classified as one of the mutually exclusive and exhaustive ways At> A 2 , A 3 and also as one of the mutually exclusive and exhaustive ways Bl , B2 , B3 , B4 • Two hundred independent trials of the experiment result in the following data: If we use a chi-square test, for what values of b would the hypothesis that the die is unbiased be rejected at the 0.025 significance level? 8.4. Consider the problem from genetics of crossing two types of peas. The Mendelian theory states that the probabilities of the classifications (a) ro~nd and yellow, (b) wrinkled and yellow, (c) round and green, and ~d) wnnkled and green are -l6' -{6, 1~' and -l6, respectively. If, from 160 mde~end~nt observations, the observed frequencies of these respective classlficatlOns are 86, 35, 26, and 13, are these data consistent with the Mendelian theory? That is, test, with IX = 0.01, the hypothesis that the respective probabilities are n.-, 1~6, 1~, and h. 8.5. Two different teaching procedures were used on two different groups of students. Each group contained 100 students of about the same ability. At the end of the term, an evaluating team assigned a letter grade to each student. The results were tabulated as follows. 6 4O-b 100 100 Total 5 20 F " 16 4 20 6 13 24 17 28 D 15 21 27 3 20 21 27 19 32 29 c Grade 2 20 10 11 6 B 25 18 b I 15 " 9 Group A Frequency Spots up PIO = II2~27T exp [ (x2(4~)2] dx, This hypothesis (concerning the multinomial p.d.f. with k = 8)is to be tested at the 5 per cent level of significance, by a chi-squar~ test. If the observec, frequencies of the sets Ai, i = 1,2, ... , 8, are, respectively, 60, 96, 140, 210, 172, 160, 88, and 74, would Ho be accepted at the (approximate) 5 per cent level of significance? EXERCISES 8.1. A number is to be selected from the interval {x; °< x < 2} by a random process. Let Ai = {x; (i - 1)/2 < x.-:::; i12}, i =: .1,.2, 3, and let A 4 = {x; t < x < 2}. A certain hypothesis assigns probabilities Pia to these sets in accordance with Pia = tl (1)(2 - x) dx, i = 1, 2, 3, 4. This hypothesis (concerning the multinomial p.d.f. with k = 4) is to be tested, at the 5 ~er' cent level of significance, by a chi-square test. If the observed frequencies of the sets Ai> i = 1, 2, 3, 4, are, respectively, 30, 30, 10, 10, would H o be accepted at the (approximate) 5 per cent level of significance? 8.2. Let the following sets be defined. Al = {x; -OCJ < x :::; O}, Ai = {x; i - 2 < x :::; i-I}, i = 2, ... , 7, and As = {a:; 6 < x < co}. ~ certain hypothesis assigns probabilities Pia to these sets Ai m accordance with ~ ~ [Xlj - n(Xi.ln)(X)n)J2 i~ 6-1 n(Xdn)(X.iln) In each of the four examples of this section we have indicated that the statistic used to test the hypothesis H o has an approximate chi- square distribution, provided that n is sufficiently large and H 0 is true. To compute the power of any of these tests for values of t~e ~arameters not described by Ho' we need the distribution of the statistic wh~n Ho is not true. In each of these cases, the statistic has an approximate distribution called a noncentral chi-square distribution. The noncentral chi-square distribution will be discussed in Section 8.4. has an approximate chi-square distribution with ab - 1 - (a + b - 2) =:= (a - l)(b - 1) degrees of freedom provided t~at H.o i~ true. The hypothesis H is then rejected if the computed value of this statistic exceeds the constant c,where cis selected from Table II so that the test has the desired significance level IX.
  • 145. 278 Other Statistical Tests [Ch, 8 Sec. 8.2] The Distributions of Certain Quadratic Forms 279 Test, at the 0.05 significancelevel, the hypothesis of independence of the A attribute and the B attribute, namely H o: P(A, n Bj ) = P(A,)P(Bj ) , i = 1, 2, 3 and j = 1, 2, 3, 4, against the alternative of dependence. 8.7. A certain genetic model suggests that the probabilities of a particular trinomial distribution are, respectively, Pl = p, P2 = 2P(1 - P), and Pa = (1 - P)2, where 0 < P < 1. If Xl> X 2, Xa represent the respective fre- quencies in n independent trials, explain how we could check on the adequacy of the genetic model. 8.2 The Distributions of Certain Quadratic Forms A homogeneous polynomial of degree 2 in n variables is called a quadratic form in those variables. If both the variables and the co- efficients are real, the form is called a real quadratic form. Only real quadratic forms will be considered in this book. To illustrate, the form XI + X lX2 + X~ is a quadratic form in the two variables Xl and X 2 ; the form XI + X~ + X~ - 2XlX2 is a quadratic form in the three variables Xl' X 2, and X 3 ; but the form (Xl - 1)2 + (X2 - 2)2 = XI + X~ - 2Xl - 4X2 + 5 is not a quadratic form in Xl and X 2 , although it is a quadratic form in the variables Xl - 1 and X 2 - 2. Let X and S2 denote, respectively, the mean and the variance of a random sample Xl> X 2 , ••• , X; from an arbitrary distribution. Thus ~ ~l(Xl - x, + X 2 n+'" + X n )2 nS2 = L. (Xl - X)2 = L. 1 n - 1 (X2 X2 X2) =-n- 1+ 2+"'+ n is a quadratic form in the n variables Xl> X 2 , ••• , X n . If the sample arises from a distribution that is n(/-L' ( 2 ) , we know that the random variable nS2/a2 is X2(n - 1) regardless of the value of /-L. This fact proved useful in our search for a confidence interval for a2 when /-L is unknown. It has been seen that tests of certain statistical hypotheses require a statistic that is a quadratic form. For instance, Example 2, Section 7.3, n . made use of the statistic .L X~, which is a quadratic form in the van- 1 abIes Xl> X 2 , ••. , X n• Later in this chapter, tests of other statistical hypotheses will be investigated, and it will be seen that functions of statistics that are quadratic forms will be needed to carry out the tests in an expeditious manner. But first we shall make a study of the distribution of certain quadratic forms in normal and stochastically independent random variables. The following theorem will be proved in Chapter 12. Theorem 1. Let Q = Ql + Q2 + ... + Qk-l + Qk' where Q, Ql> .. " Qk are k + 1 random variables that are real quadratic forms in n mutually stochastically independent random variables which are normally distrib- uted with the means /-Ll' /-L2' ••• , /-Ln and the same variance a2. Let Q/a2, QIfa2, , Qk_l/a2 have chi-square distributions with degrees of freedom r, rl> , rk-l> respectively. Let Qk be nonnegative. Then: (a) Ql"'" Qk are mutually stochastically independent, and hence (b) Qk/a2 has a chi-square distribution with r - (rl + ... + rk- l) rk degrees offreedom. Three examples illustrative of the theorem will follow. Each of these examples will deal with a distribution problem that is based on the remarks made in the subsequent paragraph. Let the random variable X have a distribution that is n(/-L' ( 2 ) . Let a and b denote positive integers greater than 1 and let n = abo Con- sider a random sample of size n = ab from this normal distribution. The items of the random sample will be denoted by the symbols Xu, X12, ... , x.; X lb X 2l, X 22, ... , X 21, ... , X 2b XIl> X t2, ... , x.; ... , x; X al, X a2, .. 0, X aj, ... , x.; In this notation the first subscript indicates the row, and the second subscript indicates the column in which the item appears. Thus Xl1 is in row i and column j, i = 1,2, ... , a and j = 1,2, ... , b. By assump- tion these n = ab random variables are mutually stochastically inde- pendent, and each has the same normal distribution with mean /-L and variance a2 • Thus, if we wish, we may consider each row as being a random sample of size b from the given distribution; and we may con-
  • 146. Other Statistical Tests [eh. 8 Sec. 8.2] The Distributions of Certain Quadratic Forms 281 280 sider each column as being a random sample .of size a from the given distribution. We now define a + b + 1 statIstIcs. They are a b '2 '2 Xu X + ... + X + ... + X ab i=l j=l X _ X 11 + ... + Ib al = ab - ab Thus or, for brevity, a b X)2 abS2 = L: L: (Xii - 1=1 i=1 or Clearly, Q, Q1> and Q2 are quadratic forms in the n = ab variables Xii. We shall use the theorem with k = 2 to show that Ql and Q2 are stochastically independent. Since S2 is the variance of a random sample of size n = ab from the given normal distribution, then abS2/a2 has a chi-square distribution with ab - 1 degrees of freedom. Now b For each fixed value of i, L: (Xli - Xd 2/b is the variance of a random i=1 sample of size b from the given normal distribution, and, accordingly, b L: (Xli - Xd 2/a2 has a chi-square distribution with b - 1 degrees of i=1 freedom. Because the Xii are mutually stochastically independent, Q1/a2 is the sum of a mutually stochastically independent random variables, each having a chi-square distribution with b - 1 degrees of freedom. Hence Q1/a2 has a chi-square distribution with a(b - 1) degrees of freedom. Now a Q2 = b L: (Xl' - X)2 ~ 0. In accordance with the theorem, Q1 and Q2 are i=1 stochastically independent, and Q2/a2 has a chi-square distribution with ab - 1 - a(b - 1) = a - 1 degrees of freedom. Example 2. In abS2 replace Xii - X by (Xu - Xi) + (Xi - X) to obtain j = 1,2, ... , b. i = 1,2, ... , a, b '2 Xu X + X + ... + X ib i=l X il i2 =-, I· = b b + 2 ~ ±(Xii - Xd(Xi. - X). i= 1 i=1 . .d tit ay be written The last term of the right-hand member of this I en I y m a (X _ X) ~ (X _ Xt , ) ] = 2 i [(Xi' - XHbXi. - bXdJ = 0, 2 L: ( i· .L.. ii 1 i=1 i=1 1=1 a '2 Xu X . + X . + ... + Xaj i=l X _ 11 21 = - , .j - a a . d b X but we use X for I~ so~~ texts the statis..!.ic=~ISisd;:~~eanYofthe random sample of sImplIcIty. In any c~se: X .. X are respectively, the means size n = abo the statIstIcs Xl.' X 2 ., ••• , a·' ti 1 the , . . X- X X are respec ive y, f the rows: and the statIstIcs ·1' ·2' ... , ' b ' f 11 :eans of th~ columns. The examples illustrative of the theorem 0 o~. . f h d m sample of SIze Example 1. Consider the vanance S2 0 t e ran 0 n = abo We have the algebraic identity and and the term a b X)2 L: L: (Xi· - 1=1 i=1 may be written or, for brevity, Q = Q3 + Q4' It is easy to show (Exercise 8.8) that Q3/a2 has a chi-square distribution • • b WIth b(a - 1) degrees of freedom. Since Q4 = a L: (X.j - X)2 ~ 0, the j=1 theorem enables us to assert that Q3 and Q4 are stochastically independent
  • 147. 282 Other Statistical Tests [eh. 8 Sec. 8.3] A Test of the Equality of Several Means 283 and that Q4/u2 has a chi-square distribution with ab - 1 - b(a - 1) = b - 1 degrees of freedom. Example 3. In abS2 replace Xii - X by (Xi' - X) + (X'i - X) + (Xii - Xi' - X' i + X) to obtain (Exercise 8.9) or, for brevity, Q = Q2 + Q4 + Q5' where Q2 and Q4 are as defined in Examples 1 and 2. From Examples 1 and 2, Q/u 2, Q2/u2, and Q4/u2 have chi-square distributions with ab - 1, a-I, and b - 1 degrees of freedom, respectively. Since Q5 ~ 0, the theorem asserts that Q2' Q4' and Q5 are mutually stochastically independent and that Q5/u2 has a chi-square distribution with ab - 1 - (a - 1) - (b - 1) = (a - l)(b - 1) degrees of freedom. Once these quadratic form statistics have been shown to be stochastically independent, a multiplicity of F statistics can be defined. For instance, Q4/[u2(b - 1)] _ Q4/(b - 1) Qs/[u2b(a - 1)] - Qs/[b(a - 1)] has an F distribution with b - 1 and b(a - 1) degrees of freedom; and Q4/[u2(b - 1)] _ Q4 Q5/[u2(a - l)(b - 1)] - Q5/(a - 1) has an F distribution with b - 1 and (a - l)(b - 1) degrees of freedom. In the subsequent sections it will be seen that some likelihood ratio tests of certain statistical hypotheses can be based on these F statistics. EXERCISES 8.8. In Example 2 verify that Q = Qs + Q4 and that Qs/u2 has a chi- square distribution with b(a - 1) degrees of freedom. 8.9. In Example 3 verify that Q = Q2 + Q4 + Q5' 8.10. Let Xl, X 2 , •• " Xn be a random sample from a normal distribution neiL, ( 2 ) . Show that n n 1 L (XI - X)2 = L (XI - X')2 + ~ (Xl - X')2, 1=1 1=2 n n n where X = L Xdn and X' = L Xd(n - 1). Hint. Replace Xi - X by 1=1 1=2 n (XI - X') - (Xl - X')/n. Show that 1~2 (XI - X')2/u2 has a chi-square distribution with n - 2 degrees of freedom. Prove that the two terms in the right-hand member are stochastically independent. What then is the distri- bution of [en - 1)/n](Xl - X')2 f 2 • o 8.11. Let X lik, i = 1, ... , a; j = 1, ... , b; k = 1, ... , c, be a random sample of size n = abc from a normal distribution n(p, ( 2 ) . Let X = c b a c b L L L Xlik/n and XI" = L L Xlik/be. Show that k=l i=l 1=1 k=1 i=1 a b c a b c 2 a X X)2 L L L (Xlik - X)2 = L L L (Xlik - XI") + be ~ ( I" - • 1=1 i=1 k=l 1=1 i=1 k=1 1-1 Show that ~ ±i (X'ik - X,..)2/U2 has a chi-square distribution with 1=1 i=1 k=1 a(be - 1) degrees of freedom. Prove that the two te~s in t~e ~ght:hand member are stochastically independent. What, then, IS the distribution of be ~ (XI" - X)2/U2? Furthermore, let X.i· = i ~ Xlik/ae and Xli' = 1=1 k=ll=l c L Xlik/e. Show that k=l a b c L L L ix.; - X)2 1=1 i=l k=l = i ±i tx.; - XiJ·)2 + be ~ (XI" - X)2 + ae ~ (X.i. - X)2 1=1 i=l k=l 1=1 i-I + e ~ ±(XiJ. - XI" - X.i. + X)2. 1=1 i=l Show that the four terms in the right-hand member, when divided by u2 , are mutually stochastically independent chi-square variables with ab(e - 1), a-I, b - 1, and (a - l)(b - 1) degrees of freedom, respectively. 8.12. Let Xl, X 2, Xs, X4 be a random sample of size n = 4 from the 4 normal distribution nCO, 1). Show that L (XI - X)2 equals 1=1 (Xl - X 2)2 [Xs - (Xl + X 2)/2]2 [X4 - (Xl + X 2 + XS)/3]2 2 + ~ + ~ and argue that these three terms are mutually stochastically independent, each with a chi-square distribution with 1 degree of freedom. 8.3 A Test of the Equality of Several Means Consider b mutually stochastically independent random variables that have normal distributions with unknown means ftv ft2' ... , ftb'
  • 148. respectively, and unknown but common variance a2. Let Xli' X 21, ••• , X a; represent a random sample of size a from the normal distribution with mean fJ-; and variance a2 , j = 1,2, , b. It is desired to test the composite hypothesis H o: fJ-1 = fJ-2 = = fJ-b = fJ-, fJ- unspecified, against all possible alternative hypotheses H 1. A likelihood ratio test will be used. Here the total parameter space is Sec. 8.3] A Test of the Equality of Several Means and these numbers maximize L(w). Furthermore, 285 j = 1,2, ... , b, aIn L(Q) ab 1 b a 2 o( 2) = - -Z 2 + -Z 4 L: L: (xU - fJ-;) • a a a 1=1 i=1 aIn L(Q) ofJ-; and Other Statistical Tests [Ch. 8 284 If we equate these partial derivatives to zero, the solutions for fJ-1' fJ-2' ••• , fJ-b' and a2 are, respectively, in Q, and w = {(fJ-1' fJ-2' ••• , fJ-b' a2 ) ; -00 < fJ-1 = fJ-2 = ... = fJ-b = fJ- < 00, 0 < a2 < co]. The likelihood functions, denoted by L(w) and L(Q) are, respectively, (Z) a 2: Xi; i=1 - --- = X.;, a j = 1, Z, ... , b, ( 1 ) ab/2 [ 1 b a ] L(w) = 22 exp --2 2 L L (Xii - fJ-)2 ~a a 1=1 i=1 and Now oIn L(w) ofJ- b a 2: 2: (Xii - fJ-) ;=11=1 and these numbers maximize L(Q). These maxima are, respectively, [ ab ]ab/2 f ab ± i (Xii - X)21 L(w) = b a exp _ _ .:....I=_1::..-:..i_=...;.1 _ Z~ 1~1 i~1 (Xii - X)2 ZJ1 it (Xii - X)2 and and oIn L(w) o(a2 ) ab 1 b a --Z 2 + -4 L L (XU - fJ-)2. a 2a 1=1 1=1 If we equate these partial derivatives to zero, the solutions for fJ- and a2 are, respectively, in w, b a 2: 2: Xu 1=1 1=1 = X, ab Finally, (1) b a 2: 2: (XU - X)2 ;=1 1=1 ab = V, A = L(w) = fJ1 it (Xij - X.I)21 ab/2 L(D.) ± i (Xij - X)2 1=1 i=1 In the notation of Section 8.Z, the statistics defined by the functions x and v given by Equations (1) of this section are X = ±i Xii and S2 = i i (Xij - X)2 = Q; ;=1 i=1 ab ;=1 i=1 ab ab
  • 149. 286 Other Statistical Tests [Ch, 8 Sec. 8.3] A Test oj the Equality oj Several Means 287 while the statistics defined by the functions x1, x2' ••. , X.b and w a given by Equations (2) in this section are, respectively, X J = L X'J/a, ,=1 b a J = 1,2, ... , b, and Q3/ab = L L (X'J - X J)2/ab. Thus, in the J=l,=l notation of Section 8.2, >..2/ab defines the statistic Q3/Q. We reject the hypothesis H o if >.. s >"0' To find >"0 so that we have a desired significance level a, we must assume that the hypothesis H o is true. If the hypothesis H o is true, the random variables X'J con- stitute a random sample of size n = ab from a distribution that is normal with mean fL and variance a2 . This being the case, it was shown b in Example 2, Section 8.2, that Q=Q3+Q4' where Q4=a L (X.J- X)2; J=l that Q3 and Q4 are stochastically independent, and that Q3/a2 and Q4/a2 have chi-square distributions with b(a - 1) and b - 1 degrees of freedom, respectively. Thus the statistic defined by >..2/ab may be written 1 The significance level of the test of H 0 is a = Pr [1 + ~4/Q3 ~ >..~/ab; H o] _ [Q4/(b - 1) >. ] - Pr Q3/[b(a _ 1)J - c, H o ' where = b(a - 1) ( -2/ab _ 1) C b _ 1 "0 . But F = Q4/[a 2(b - 1)J = Q4/(b - 1) Q3/[a2b(a - 1)J Q3/[b(a - 1)J has an F distribution with b - 1 and b(a - 1) degrees of freedom. Hence the test of the composite hypothesis H 0: fL1 = fL2 = ... = fLb = fL, fL unspecified, against all possible alternatives may be based on an F statistic. The constant c is so selected as to yield the desired value of a. Remark. It should be pointed out that a test of the equality of the b means fLJ' j = 1,2, ... , b, does not require that we take a random sample of size a from each of the b normal distributions. That is, the samples may be of different sizes, say aI, a2, ... , abo A consideration of this procedure is left to Exercise 8.13. Suppose now that we wish to compute the power of the test of H o against HI when H o is false, that is, when we do not have fL1 = fL2 = ... = fLb = fL· It will be seen in Section 8.4- that, when HI is true, no longer is Q4/a2 a random variable that is X2(b - 1). Thus we cannot use an F statistic to compute the power of the test when HI is true. This problem is discussed in Section 8.4-. An observation should be made in connection with maximizing a likelihood function with respect to certain parameters. Sometimes it is easier to avoid the use of the calculus. For example, L(Q.) of this section can be maximized with respect to fLJ' for every fixed positive a2, by minimizing b a Z = L L (X'J - fLJ)2 J=l,=l with respect to fLJ' j = 1,2, ... , b. Now z can be written as b a Z = L L [(x'J - X J) + (x J - fLJ)J2 j=l 1=1 b a b = L L (X'J - X J)2 + a L (x J - fLJ)2. J=l i=l J=l Since each term in the right-hand member of the preceding equation is nonnegative, clearly z is a minimum, with respect to fLJ' if we take fLJ = x.J, j = 1, 2, ... , b. EXERCISES 8.13. Let X Ij, X 2J' ... , X aJj represent independent random samples of sizes aJ from normal distributions with means fLJ and variances a2, j = 1,2, .. " b. Show that b aJ b aJ or Q' = Q; + Q~. Here X = L L X I1/ L «, and X'J = L XjJ/aj • If j=1 j=l j=1 i=l fL1 = fL2 = ... = fLb' show that Q'/a2and Q;/a2have chi-square distributions. Prove that Q; and Q~ are stochastically independent, and hence Q~/a2 also has a chi-square distribution. If the likelihood ratio , is used to test H 0: fLl = fL2 = ... = fLb = fL, fL unspecified and a2 unknown, against all
  • 150. 288 Other Statistical Tests [Ch. 8 Sec. 8.4] Noncentral X2 and Noncentral F 289 possible alternatives, show that A ~ Ao is equivalent to the computed F ~ c, where What is the distribution of F when H 0 is true? 8.14. Using the notation of this section, assume that the means satisfy the condition that iL = iLl + (b - l)d = iL2 - d = iLa - d = ... = iLb - d. That is, the last b - 1 means are equal but differ from the first mean iLl> provided that d =I- O. Let a random sample of size a be taken from each of the b independent normal distributions with common unknown variance a2• (a) Show that the maximum likelihood estimators of iL and dare (L = X and J = [J2X)(b - 1) - s.,]Ib. (b) Find Qa and Q7 = ca2so that when d = 0, Q71 a2 is X2(1)and (c) Argue that the three terms in the right-hand member of part (b), once divided by a2 , are stochastically independent random variables with chi-square distributions, provided that d = O. (d) The ratio Q7/(Qa + Qa) times what constant has an F distribution, provided that d = O? The integral exists if t < t. To evaluate the integral, note that Accordingly, with t < -1, we have E[exp e~~)] = exp [a2(:fL~ 2t)] 5:00 a~21T exp [ - 1 ~22t (Xj-1 :j2tY] dx; If we multiply the integrand by V1 - 2t, t < -1, we have the integral of a normal p.d.f. with mean fLtI(1 - 2t) and variance a2J(1 - 2t). Thus [ ( tX 2) ] 1 [tfL'?- ] E exp a2' = VI _ 2t exp a2(1 ~ 2t) , n and the moment-generating function of Y = 2: X~Ja2 is given by 1 A random variable that has a moment-generating function of the functional form M(t) = 1 et81<1-2t> (1 - 2W/2 ' 8.4 Noncentral X2 and Noncentral F Let Xl' X 2 , ••• , X n denote mutually stochastically independent n random variables that are n(fLj, a2), i = 1,2, ... , n, and let Y = 2: XUa2. 1 If each fLj is zero, we know that Y is X2 (n). We shall now investigate the distribution of Y when each fLj is not zero. The moment-generating function of Y is given by [ n ] 1 t2:fL~ M(t) = (1 _ 2t)n/2 exp a2(l l _ 2t) , 1 t < 2' M(t) = E[exp (t j~ ~!)] = D E[exp (t ~!)l Consider E[exp (tX a2~) ] _ foo 1 [tx~ (xj - fLj)2] d - . /- exp 2 - 2 2 Xt· -00 av 21T a a where t < t, 0 < 0, and r is a positive integer, is said to have a non- centralchi-square distribution with r degrees of freedom and noncentrality parameter O. If one sets the noncentrality parameter 0 = 0, one has M(t) = (1 - 2t)-r/2, which is the moment-generating function of a random variable that is X2 (r). Such a random variable can appro- priately be called a centralchi-square variable. We shall use the symbol x2 (r, 0) to denote a noncentral chi-square distribution that has the parameters rand 0; and we shall say that a random variable is X2 (r, 0)
  • 151. 290 Other Statistical Tests [Ch, 8 Sec. 8.5] The Analysis of Variance 291 8.5 The Analysis of Variance Had we taken f31 = f32 = f33 = 0, the six random variables would have had means 8.19. Let Xl and X2 be two stochastically independent random variables. Let Xl and Y = Xl + X2 be X2 (rv ( 1) and X2 (r , 0), respectively. Here rl < rand fh ~ O. Show that X2 is X2 (r - r1• 0 - ( 1 ) , distribution with degrees of freedom r l and r 2 > 2 and noncentrality parameter O. 8.18. Show that the square of a noncentral T random variable is a noncentral F random variable. f'23 = 3. f'13 = 5, f'13 = 6, f'23 = 4. f'12 = 6, f'22 = 4, f'12 = 6, f'22 = 4, f'll = 6, f'21 = 4, f'll = 7, f'21 = 5, a normal distributions are f'ij = f' + (Xi + f3j, where L (Xi = 0 and I b L f3j = O. For example, take a = 2, b = 3, f' = 5, (Xl = 1, (X2 = -1, 1 f3l = 1, f32 = 0, and f33 = -1. Then the ab = six random variables have means The problem considered in Section 8.3 is an example of a method of statistical inference called the analysis of variance. This method derives its name from the fact that the quadratic form abS 2 , which is a total sum of squares, is resolved into several component parts. In this section other problems in the analysis of variance will be investigated. Let X ij, i = 1,2, ... , a and j = 1,2, ... , b, denote n = ab random variables which are mutually stochastically independent and have normal distributions with common variance a2 • The means of these EXERCISES to mean that the random variable has this kind of distribution. The symbol X2 (r, 0) is equivalent to X2 (r). Thus our random variable y = ~ Xf/a 2 of this section is X2 (n, ~ f'f/a2). If each f'i is equal to zero, then Y is X2 (n, 0) or, more simply, Y is X2 (n). The noncentral chi-square variables in which we have interest are certain quadratic forms, in normally distributed variables, divided by a variance a 2 • In our example it is worth noting that the noncentrality n n parameter of L Xf/a2, which is L f'f/a 2, may be computed by replacing 1 1 each Xi in the quadratic form by its mean f'i' i = 1, 2, ... , n. This is no fortuitous circumstance; any quadratic form Q = Q(X1, .•• , X n) in normally distributed variables, which is such that Q/a 2 is X2 (r, 0), has o= Q(f'l' f'2' ••• , f'n)/a2; and if Q/a 2 is a chi-square variable (central or noncentral) for certain real values of f'l' f'2' ... , f'n. it is chi-square (central or noncentral) for all real values of these means. It should be pointed out that Theorem 1, Section 8.2, is valid whether the random variables are central or noncentral chi-square variables. We next discuss a noncentral F variable. If U and V are stochasti- cally independent and are, respectively, X2(r1) and X2 (r2) , the random variable F has been defined by F = r 2 U/rl V. Now suppose, in particu- lar, that U is x2 (rl , 0), V is x2 h ), and that U and V are stochastically independent. The random variable r2 U/Yl V is called a noncentral F variable with Y l and r2 degrees of freedom and with noncentrality parameter 0. Note that the noncentrality parameter of F is precisely the noncentrality parameter of the random variable U, which is X2 (rl> 0). Tables of noncentral chi-square and noncentral F are available in the literature. However, like those of noncentral t, they are too bulky to be put in this book. 8.15. Let Yj , i = 1, 2, .. " n, denote mutually stochastically inde- pendent random variables that are, respectively, x2 (rj , OJ), i = 1,2, ... , n. n (n n ) Prove that Z = L Yj is X2 L ri, L OJ . 1 1 1 Thus, if we wish to test the composite hypothesis that f'll = f'12 = ... = f'lb, 8.16. Compute the mean and the variance of a random variable that is X2 (r , 0). 8.17. Compute the mean of a random variable that has a noncentral F f'21 = f'22 = ... = f'2b, f'al = f'a2 = ... = f'ab,
  • 152. 292 Other Statistical Tests [Ch. 8 Sec. 8.5] The Analysis of Variance 293 we could say that we are testing the composite hypothesis that f31 = f32 = ... = f3b (and hence each f3j = 0, since their sum is zero). On the other hand, the composite hypothesis r-; of a2 under w, here denoted by a~. So the likelihood ratio , = r-. r-; (a~/a~)ab/2 is a monotone function of the statistic a b (X i j - X i .)2 L 2: i=l j=l ab F = Q4/(b - 1) Q3/[b(a - 1)] a b a b abS2 = :L :L (Xi' - X)2 + :L :L (X.j - X)2 i=l j=l i=l j=l a b + :L :L (Xi j - Xi. - x'j + X)2; i=l j=l F = Q4/(b - 1) , Q5/[(a - 1)(b - 1)J which has, under H o, an F distribution with b - 1 and (a - 1)(b - 1) degrees of freedom. The hypothesis H 0 is rejected if F ;::: c, where a = Pr (F ;::: c; H o). If we are to compute the power function of the test, we need the distribution of F when H 0 is not true. From Section 8.4 we know, when HI is true, that Q4/a2 and Q5/a2 are stochastically independent (central or noncentral) chi-square variables. We shall compute the non- centrality parameters of Q4/a2 and Q5/a2 when HI is true. We have E(Xij) = I-' + ai + f3j, E(Xi .) = I-' + ai' E(X.j) = I-' + f3j and E(X) 1-'. Accordingly, the noncentrality parameter of Q4/a2 is is that estimator under w. A useful monotone function of the likelihood thus the total sum of squares, abS2, is decomposed into that among rows (Q2), that among columns (Q4) , and that remaining (Q5)' It is r-; interesting to observe that a~ = Q5/ab is the maximum likelihood estimator of a2 under Q and upon which the test of the equality of means is based. To help find a test for H o: f31 = f32 = ... = f3b = 0, where I-'ij = I-' + a i + f3j, return to the decomposition of Example 3, Section 8.2, namely Q = Q2 + Q4 + Q5' That is, so we see that the total sum of squares, abS2 , is decomposed into a sum of squares, Q4, among column means and a sum of squares, Q3, within columns. The latter sum of squares, divided by n = ab, is the maximum likelihood estimator of a2 , provided that the parameters are in Q; and -<: we denote it by a~. Of course, S2 is the maximum likelihood estimator is the same as the composite hypothesis that a1 = a2 = ... = aa = 0. Remarks. The model just described, and others similar to it, are widely used in statistical applications. Consider a situation in which it is desirable to investigate the effects of two factors that influence an outcome. Thus the variety of a grain and the type of fertilizer used influence the yield; or the teacher and the size of a class may influence the score on a standard test. Let Xii denote the yield from the use of variety i of a grain and type j of fertilizer. A test of the hypothesis that fl1 = fl2 = ... = flb = 0 would then be a test of the hypothesis that the mean yield of each variety of grain is the same regardless of the type of fertilizer used. a b There is no loss of generality in assuming that 2: ai = 2: flj = O. To see 1 1 this, let floij = flo' + a; + fl;· Write a' = 2: a;;a and p' = 2: fl;;b. We have floij = (flo' + fi' + P') + (a; - fi') + (fl; - P') = flo + ai + flj, where 2: ai = 2: flj = O. 1-'1b = 1-'2b = ... = I-'ab' To construct a test of the composite hypothesis H o: f31 = f32 = ... = f3b = 0 against all alternative hypotheses, we could obtain the corre- sponding likelihood ratio. However, to gain more insight into such a test, let us reconsider the likelihood ratio test of Section 8.3, namely that of the equality of the means of b mutually independent distributions. There the important quadratic forms are Q, Q3' and Q4' which are related through the equation Q = Q4 + Q3' That is, 1-'12 = 1-'22 = ... = l-'a2, 1-'11 = 1-'21 = ... = I-'al>
  • 153. 294 Other Statistical Tests [eh.8 Sec. 8.5] The Analysis of Variance 295 b a 2 .2 (I-' + al + flj - I-' - al - I-' - flj + 1-')2 ::...j=....;l:......:...'=....;1=-- ---,;0-- = 0. ~ = 1,2, ... , a, j = 1,2, ... , b, H o: Ylj = 0, 8.21. If at least one Ylf "# 0, show that the F, which is used to test that F = [c,tJ1(Xtj. - XI .. - X.j. + X)2]/[(a - l)(b - l)J. [22 L (Xtjk - Xtj.)2J/[ab(c - l)J The reader should verify that the noncentrality parameter of this F b a distribution is equal to CL L Y~j/a2. Thus F is central when H o:Ytj = j=ll=l 0, i = 1, 2, ... , a, j = 1, 2, ... , b, is true. EXERCISES 8.20. Show that that is, the total sum of squares is decomposed into that due to row differences, that due to column differences, that due to interaction, and that within cells. The test of against all possible alternatives is based upon an F with (a - l)(b - 1) and ab(c - 1) degrees of freedom, a b c a b 2 2 2 (Xtjk - X)2 = be 2 (XI" - X)2 + ac 2 (X. j. - X)2 1=1 j=l k=l 1=1 j=l a b + C2 2 (Xtj. - XI" - X. j. + X)2 1=1 j=l a b c + 2 2 2 (Xtjk - Xtj.)2; 1=1 j=l k=l That is, if Ylj = 0, each of the means in the first row is 2 greater than the corresponding mean in the second row. In general, if each Ytj = 0, the means of row i1 differ from the corresponding means of row i2 by a constant. This constant may be different for different choices of i1 and i2 . A similar statement can be made about the means of columns i1 and j2' The parameter Ytj is called the interaction associated with cell (i, j). That is, the interaction between the ith level of one classification and the jth level of the other classification is Ytj. One interesting hypothesis to test is that each interaction is equal to zero. This will now be investigated. From Exercise 8.11 of Section 8.2 we have that 1-'13 = 3, 1-'23 = 5. 1-'13 = 5, 1-'23 = 3. 1-'12 = 7, 1-'22 = 3, 1-'12 = 6, 1-'22 = 4, 1-'11 = 8, 1-'21 = 4, 1-'11 = 7, 1-'21 = 5, Note that, if each Ylj = 0, then and, under H o: a 1 = a2 = ... = aa = 0, has an F distribution with a-I and (a - l)(b - 1) degrees of freedom. The analysis-of-variance problem that has just been discussed is usually referred to as a two-way classification with one observation per cell. Each combination of i and j determines a cell; thus there is a total of ab cells in this model. Let us now investigate another two-way ~lassification problem, but in this case we take c > 1 stochastically mdependent observations per cell. Let Xtjk' i = 1, 2, .. " a, j = 1,2, ... , b, and k = 1,2, ... , c, denote n = abc random variables which are mutually stochastically indepen- dent and which have normal distributions with common, but unknown, variance a2. The mean of each Xtjk' k = 1, 2, ... , c, is I-'Ij = I-' + al + a b a b flj + ytj, where .2 al = 0, 2 flj = 0, 2 Ytj = 0, and 2 Ytj = 0. For '=1 j=l 1=1 j=l example, take a = 2, b = 3, I-' = 5, a1 = 1, a2 = - 1, fl1 = 1, fl2 = 0, fl3 = -1, Y11 = 1, Y12 = 1, Y13 = - 2, Y21 = -1, Y22 = -1, and Y23 = 2. Then the means are Thus, if the hypothesis H ois not true, F has a noncentral F distribution with b - 1 and (a - l)(b - 1) degrees of freedom and noncentrality b parameter a 2 flJla2. The desired probabilities can then be found in j=l tables of the noncentral F distribution. A similar argument can be used to construct the F needed to test the equality of row means; that is, this F is essentially the ratio of the sum of squares among rows and Q5' In particular, this F is defined by
  • 154. 296 Other Statistical Tests [Ch, 8 Sec. 8.6] A Regression Problem 297 each interaction is equal to zero, has noncentrality parameter equal to b a C 2: 2: YU(12. i=l 1=1 8.6 A Regression Problem It is easy to show (see Exercise 8.22) that the maximum likelihood estimators of ex, {:3, and (12 are n 2: Xi & = _1_ = X n ' or and ~=----- i [Xi - ex - fJ(Ci - C)]2 = i {(a - ex) + (~ - fJ)(Ci - c) 1 1 + [Xi - & - ~(Ci - c)W n = n(& - ex)2 + (~ - fJ)2 L (CI - 13)2 1 + ieXi - & - ~(Ci - 13)]2, 1 Consider next the algebraic identity (Exercise 8.24) Since & and ~ are linearfunctions of X v X 2, ••• , X n' each is normally distributed (Theorem 1, Section 4.7). It is easy to show (Exercise 8.23) that their respective means are ex and fJ and their respective variances n are u2/n and u2/2: (ci - C)2. 1 Consider a laboratory experiment the outcome of which depends upon the temperature; that is, the technician first sets a temperature dial at a fixed point C and subsequently observes the outcome of the experiment for that dial setting. From past experience, the technician knows that if he repeats the experiment with the temperature dial set at the same point c, he is not likely to observe precisely the same out- come. He then assumes that the outcome of his experiment is a random variable X whose distribution depends not only upon certain unknown parameters but also upon a nonrandom variable C which he can choose more or less at pleasure. Let C v C 2,... , C n denote n arbitrarily selected values of C (but not all equal) and let Xi denote the outcome of the experiment when C = c., i = 1,2, ... , n. We then have the n pairs (Xv C1), ••• , (Xn, en) in which the Xi are random variables but the Ci are known numbers and i = 1,2, ... , n. Once the n experiments have been performed (the first with C = C v the second with C = C 2,and so on) and the outcome of each recorded, we have the n pairs of known numbers (Xl' C1), ••• , (Xn, cn)' These numbers are to be used to make statistical inferences about the unknown parameters in the distribution of the random variable X. Certain problems of this sort are called regression problems and we shall study a particular one in some detail. Let C 1 , C 2,... , C n be n given numbers, not all equal, and let c = n 2: c.jn, Let Xv X 2 , ••• , X; be n mutually stochastically independent 1 random variables with joint p.d.f. Li«, fJ, (12; Xv X 2, ••• , X n) or, for brevity, ( 1 )n/2 { 1 n } = - exp - - '" [XI - ex - fJ(ci - C)J2 . 27TU2 2u2 f Here Q, Qv Q2' and Q3 are real quadratic forms in the variables Thus each Xi has a normal distribution with the same variance u2 , but the means of these distributions are ex + fJ(cl - c). Since the ci are not all equal, in this regression problem the means of the normal distribu- tions depend upon the choice of C v C 2, ... , Cn' We shall investigate ways of making statistical inferences about the parameters ex, fJ, and u2. i = 1,2, ... , n. In this equation, Q represents the sum of the squares of n mutually stochastically independent random variables that have normal dis- tributions with means zero and variances u2 • Thus Q/u2 has a chi-square
  • 155. 298 Other Statistical Tests [Ch, 8 Sec. 8.6] A Regression Problem 299 distribution with n degrees of freedom. Each of the random variables V11,(a - alIa and J~ (ci - C)2(~ - (3)/a has a normal distribution with 1 zero mean and unit variance; thus each of Ql/a2 and Q2/a2 has a chi- square distribution with 1 degree of freedom. Since Q3 is nonnegative, we have, in accordance with the theorem of Section 8.2, that Qv Q2' and Q3 are mutually stochastically independent, so that Q3/a2 has a chi-square distribution with n - 1 - 1 = n - 2 degrees of freedom. Then each of the random variables [Vn(a - a)J/a T1 = --=V~Q;=3/~[a::::;::2=:=( n=_~277:)] a-a n 8.24. Verify that 2: [XI - a - f3(cl - C)]2 = Ql + Q2 + Q3' as stated in 1 the text. 8.25. Let the mutually stochastically independent random variables Xl' X 2 , ••• , X; have, respectively, the probability density functions n(f3cl , y2C~), i = 1, 2, ... , n, where the given numbers C1, c2, ••• , Cn are not all equal and no one is zero. Find the maximum likelihood estimators of f3 and y2. 8.26. Let the mutually stochastically independent random variables Xl> "', X; have the joint p.d.f. and [J~ (ci - C)2(~ - (3)]/a VQ3/[a2(n - 2)] where the given numbers c1 , c2 , ••• , Cn are not all equal. Let Ho: f3 = 0 (a and a2 unspecified). It is desired to use a likelihood ratio test to test H o against all possible alternatives. Find , and see whether the test can be based on a familiar statistic. Hint. In the notation of this section show that has a t distribution with n - 2 degrees of freedom. These facts enable us to obtain confidence intervals for a and f3. The fact that nf)2/a2 has a chi-square distribution with n - 2 degrees of freedom provides a means of determining a confidence interval for a 2• These are some of the statistical inferences about the parameters to which reference was made in the introductory remarks of this section. Remark. The more discerning reader should quite properly question our constructions of T1 and T2 immediately above. We know that the squares of the linear forms are stochastically independent of Q3 = nfP, but we do not know, at this time, that the linear forms themselves enjoy this independence. This problem arises again in Section 8.7. In Exercise 12.15, a more general problem is proposed, of which the present case is a special instance. EXERCISES 8.22. Verify that the maximum likelihood estimators of a, {J, and (12 are the EX, p, and iP given in this section. 8.23. Show that EX and phave the respective means a and {J and the n respective variances a2/n and (12/2: (c, - C)2. 1 8.27. Using the notation of Section 8.3, assume that the means fLj satisfy a linear function of j, namely fLj = C + d[j - (b + 1)/2]. Let a random sample of size a be taken from each of the b independent normal distributions with common unknown variance a2 • (a) Show that the maximum likelihood estimators of c and dare, respectively, c= X and , b b d = 2: [j - (b + 1)/2J(x'j - X)I 2: [j - (b + 1)/2J2. j=l j=l (b) Show that a b L L (XtJ - X)2 1=1 j=l a b [ A( b + 1)]2 b (. b + 1)2 I~ j~ s; - X - d j - -2- + J2 ~1 a J - -2- . (c) Argue that the two terms in the right-hand member of part (b), once divided by a2 , are stochastically independent random variables with chi- square distributions provided that d = O. (d) What F statistic would be used to test the equality of the means. that is, H o: d = O?
  • 156. 301 Wv(n - 2) vU n 2: [(Xj - x)(Yj - V)] W = ..::.l_-;:::====-_ J~ (Xj - X)2 VUJ[u~(n - 2)] (3) Then we have (Exercise 8.32) wVn=2 VU has a conditional t distribution with n - 2 degrees of freedom. Let The left-hand member of this equation and the first term of the right- hand member are, when divided by u~, respectively, conditionally X2(n - 1) and conditionally X2(1). In accordance with Theorem 1, the nonnegative quadratic form, say U, which is the second term of the right-hand member of Equation (2), is conditionally stochastically independent of W 2 , and, when divided by u~, is conditionally X 2 (n - 2). Now WJU2 is n(O, 1). Then (Remark, Section 8.6) is n(O, u~) (see Exercise 8.30). Thus the conditional distribution of W2Ju~, given Xl = Xl' ... , X n = Xn, is X2(1). We have the algebraic iden tity (see Exercise 8.31) (1) X; = X n , is X2 (n - 1). Moreover, the conditional distribution of the linear function W of Y 1> Y 2, ••• , Y n' Sec. 8.7] A Test of Stochastic Independence Other Statistical Tests [Ch. 8 n ' The conditional distribution of " (Y y-)2J 2 ' X L., j - U2' given 1 = Xl' ••• ,1 1 This s.tat~stic R is called the correlationcoefficient of the random sample. The likelihood ratio principle, which calls for the rejection of H if A ::; Ao, is equivalent to the computed value of IRI ;::: c. That is, if ~he absol~te value of the correlation coefficient of the sample is too large, we reject the hypothesis that the correlation coefficient of the distri- b.uti?n is equal to zero. To determine a value of c for a satisfactory significance level, it will be necessary to obtain the distribution of R or a function of R, when H o is true. This will now be done. ' Let Xl = Xl' X 2 = X2, " ., X; = Xn,n > 2, where Xl' X2,... , Xnand _ n n X = f xdn are fixed numbers such that 2: (xj - X)2 > 0. Consider the 1 conditional p.d.f. of Y 1> Y 2 ... Y given that X = X X = X , , n' 1 1, 2 2, .•• , X n = Xn· Because Y 1> Y 2, ••• , Yn are mutually stochastically inde- pendent and, with p = 0, are also mutually stochastically independent of Xl' X 2, ... , X n, this conditional p.d.f. is given by n 2: (Xj - X)(Yj - Y) R = j=l Ji (Xj - X)2 i (Yj - Y)2' j=l j=l 8.7 A Test of Stochastic Independence Let X and Y have a bivariate normal distribution with means ILl a~d IL2' positive variances u~ and u~, and correlation coefficient p. We wish to test the hypothesis that X and Yare stochastically independent. Because two jointly normally distributed random variables are sto- chastically independent if and only if p = 0, we test the hypothesis H o: p = °against the hypothesis HI: p #- 0. A likelihood ratio test will be used. Let (Xl' YI ), (X2, Y 2), ... , (Xn, Y n) denote a random sample of size n > 2 from the bivariate normal distribution; that is the joint p.d.f. of these 2n random variables is given by , f(X1> YI)f(X2,Y2) ... f(xn,Yn)' ~lth.ough it is.fair~y difficult to show, the statistic that is defined by the likelihood ratio AIS a function of the statistic 300
  • 157. 302 Other Statistical Tests [eh. 8 Sec. 8.7] A Test of Stochastic Independence 303 this ratio has, given Xl = Xl> ••• , X; = Xn, a conditional t distribution with n - 2 degrees of freedom. Note that the p.d.f., say g(t), of this t distribution does not depend upon Xl> X 2, ••• , X n. Now the joint p.d.f. of Xv X 2, ... , X; and Rvn - 2/vl - R2, where the statistic Rvn - 2/v1 - R2 = T. In either case the significance level of the test is a = Pr (IR! ~ Cl; H o) = Pr (J TI ~ c2 ; H o), where the constants Cl and C2 are chosen so as to give the desired value of a. Remark. It is also possible to obtain an approximate test of size a by using the fact that = 0 elsewhere. We have now solved the problem of the distribution of R, when p = 0 and n > 2, or, perhaps more conveniently, that of RVn - 2/Vl - R2. The likelihood ratio test of the hypothesis H o: p = 0 against all alternatives H l: p =1= 0 may be based either on the statistic R or on n 2:X,Yj - nXY 1 J(~ X~ - nX2) (~ Y~ - nY2) 8.28. Show that n 2: (Xj - X)(Yj - Y) R = ----;:::::::l~===:===== Ji (X, - X)2 i (Y, - Y)2 1 1 EXERCISES 8.29. A random sample of size n = 6 from a bivariate normal distribu- tion yields the value of the correlation coefficient to be 0.89. Would we accept or reject, at the 5 per cent significance level, the hypothesis that p = O? 8.30. Verify that W of Equation (1) of this section is n(O, a~). 8.31. Verify the algebraic identity (2) of this section. 8.32. Verify Equation (3) of this section. 8.33. Verify the p.d.f. (4) of this section. W= !In (1 + R) 2 1 - R !In(1 + po). 2 1 - Po has an approximate normal distribution with mean 1- In [(1 + p)j(1 - p)] and variance 1j(n - 3). We accept this statement without proof. Thus a test of H o: p = 0 can be based on the statistic Z = 1- In [(1 + R)j(1 - R)] - 1- In [(1 + p)j(1 - p)], V1j(n - 3) with p = 0 so that 1- In [(1 + p)j(1 - p)] = O. However, using W, we can also test hypotheses like Ho: p = Po against Hl : p =1= Po, where Po is not necessarily zero. In that case the hypothesized mean of W is -1 < r < 1, g(r) (4) If we write T = Rvn - 2/vl - R2, where T has a t distribution with n - 2 > 0 degrees of freedom, it is easy to show, by the change- of-variable technique (Exercise 8.33), that the p.d.f. of R is given by r[(n - 1)/2J (1 _ r2) (n - 4)/2 , r(t)r[(n - 2)/2J is the product of g(t) and the joint p.d.f. of Xl' X 2 , ••• , X n. Integration on Xl' X 2, ... , X n yields the marginal p.d.f. of RVn - 2/V1 - R2; because g(t) does not depend upon Xv X2' ••• , X n it is obvious that this marginal p.d.f. is g(t), the conditional p.d.f. of Rcvn - 2/Vl - R~. The change-of-variable technique can now be used to find the p.d.f. of R. Remarks. Since R has, when P = 0, a conditional distribution that does not depend upon Xl' X 2, •• " X n (and hence that conditional distribution is, in fact, the marginal distribution of R), we have the remarkable fact that R is stochastically independent of Xl' X 2 , ••• , X n . It follows that R is stochastically independent of every function of Xl, X 2 , ••• , X; alone, that is, a function that does not depend upon any Yj • In like manner, R is stochastic- ally independent of every function of Y1> Y2, ••. , Yn alone. Moreover, a careful review of the argument reveals that nowhere did we use the fact that X has a normal marginal distribution. Thus, if X and Yare stochastically independent, and if Y has a normal distribution, then R has the same conditional distribution whatever be the distribution of X, subject to the condition ~ (xj - X)2 > O. Moreover, if Pr [~ (X, - X')2 > 0] = 1, then R has the same marginal distribution whatever be the distribution of X.
  • 158. Sec. 9.1] Confidence Intervals for Distribution Quantiles 305 Chapter 9 Nonparametric Methods 9.1 Confidence Intervals for Distribution Quantiles We shall first define the concept of a quantile of a distribution of a random variable of the continuous type. Let X be a random variable of the continuous type with p.d.f. f(x) and distribution function F(x). Let p denote a positive proper fraction and assume that the equation F(x) = p has a unique solution for X. This unique root is denoted by the symbol gp and is called the quantile (of the distribution) of order p. Thus Pr (X ~ gp) = F(gp) = p. For example, the quantile of order t is the median of the distribution and Pr (X ~ gO.5) = F(gO.5) = t- In Chapter 6 we computed the probability that a certain random interval includes a special point. Frequently, this special point was a parameter of the distribution of probability under consideration. Thus we were led to the notion of an interval estimate of a parameter. If the parameter happens to be a quantile of the distribution, and if we work with certain functions of the order statistics, it will be seen that this method of statistical inference is applicable to all distributions of the continuous type. We call these methods distribution-free or nonpara- metric methods of inference. To obtain a distribution-free confidence interval for gp, the quantile of order p, of a distribution of the continuous type with distribution function F(x), take a random sample Xl' X 2 , ••• , X n of size n from that distribution. Let Y1 < Y2 < ... < Yn be the order statistics of the sample. Take Y, < Y, and consider the event Y, < gp < Y j • For the ith order statistic Y, to be less than gp it must be true that at least i 304 of the X values are less than gpo Moreover, for the jth order statistic to be greater than gp, fewer than j of the X values are less than gpo That is, if we say that we have a "success" when an individual X value is less than gp, then, in the n independent trials, there must be at least i successes but fewer than j successes for the event Y, < gp < Y j to occur. But since the probability of success on each trial is Pr (X < gp) F(gp) = p, the probability of this event is j-l n! Pr (Y, < gp < Y j ) = 2. '( _ )' pW(1 - p)n-w, w=, W. n W. the probability of having at least i, but less than j, successes. When particular values of n, i, and j are specified, this probability can be computed. By this procedure, suppose it has been found that y = Pr (Y, < e. < Y j ) . Then the probability is y that the random interval (Yt, Y j ) includes the quantile of order p. If the experimental values of Yt and Y, are, respectively, Yt and Yj, the interval (Yt, Yj) serves as a 100y per cent confidence interval for gp, the quantile of order p. An illustrative example follows. Example 1. Let Y1 < Y2 < Ya < Y4 be the order statistics of a random sample of size 4 from a distribution of the continuous type. The probability that the random interval (Y1 , Y 4 ) includes the median gO.5 of the distribution will be computed. We have Pr (Y1 < gO.5 < Y 4 ) = W~l wI (/~ w)! Gr = 0.875. If Y1 and Y4 are observed to be Yl = 2.8 and Y4 = 4.2, respectively, the interval (2.8, 4.2) is an 87.5 per cent confidence interval for the median gO.5 of the distribution. For samples of fairly large size, we can approximate the binomial probabilities with those associated with normal distributions, as illustrated in the next example. Example 2. Let the following numbers represent the order statistics of n = 27 observations obtained in a random sample from a certain distribution of the continuous type. 61, 69, 71, 74, 79, 80, 83, 84, 86, 87, 92, 93, 96, 100, 104,105,113,121,122,129,141,143,156,164,191,217,276. Say that we are interested in estimating the 25th percentile gO.25 (that is, the quantile of order 0.25) of the distribution. Since (n + l)P = 28(1-) = 7, the seventh order statistic, Y7 = 83, could serve as a point estimate of gO.25' To get a confidence interval for gO.25, consider two order statistics, one less
  • 159. 306 Nonparametric Methods [Ch, 9 Sec. 9.2] Tolerance Limits for Distributions 307 than Y7 and the other greater, for illustration, Y4 and YIO' What is the con- fidence coefficient associated with the interval (Y4' YIO)? Of course, before the sample is drawn, we know that 'Y = Pr (Y4 < gO.25 < YI O) = ~4 (~)(0.25)W(0.75)27-w. (a) Show that Pr (YI < P. < Y 2 ) = t and compute the expected value of the random length Y2 - Yr- (b) If X is the mean of this sample, find the constant c such that Pr (X - co < p. < X + cal = 1-, and compare the length of this random interval with the expected value of that of part (a). Hint. See Exercise 4.60, Section 4.6. 9.2 Tolerance Limits for Distributions 9.6. Let YI < Y2 < ... < Y 25be the order statistics of a random sample of size n = 25 from a distribution of the continuous type. Compute approxi- mately: (a) Pr (Ys < go.s < YI S) ' (b) Pr (Y2 < gO.2 < Yg) . (c) Pr (YI S < go.s < Y 23) · 9.7. Let YI < Y2 < ... < YlO O be the order statistics of a random sample of size n = 100 from a distribution of the continuous type. Find i < j so that Pr (Y, < gO.2 < YJ ) is about equal to 0.95. = 0 elsewhere, then, if 0 < P < 1, we have Pr [F(X) s PJ = f:dz = p. Now F(x) = Pr (X ~ x). Since Pr (X = x) = 0, then F(x) is the fractional part of the probability for the distribution of X that is between -00 and x. If F(x) ~ p, then no more than lOOp per cent of the probability for the distribution of X is between -00 and x. But recall Pr [F(X) ~ PJ = p. That is, the probability that the random o < z < 1, h(z) = 1, We propose now to investigate a problem that has something of the same flavor as that treated in Section 9.1. Specifically, can we compute the probability that a certain random interval includes (or covers) a preassigned percentage of the probability for the distribution under consideration? And, by appropriate selection of the random interval, can we be led to an additional distribution-free method of statistical inference? Let X be a random variable with distribution function F(x) of the continuous type. The random variable Z = F(X) is an important random variable, and its distribution is given in Example 1, Section 4.1. It is our purpose now to make an interpretation. Since Z = F(X) has the p.d.f, 9.1. Let Yn denote the nth order statistic of a random sample of size n from a distribution of the continuous type. Find the smallest value of n for which Pr (go.g < Y n) ~ 0.75. 9.2. Let Y1 < Y2 < Y3 < Y4 < Ys denote the order statistics of a random sample of size 5 from a distribution of the continuous type. Com- pute: (a) Pr (YI < gO.5 < Ys)' (b) Pr (YI < gO.2S < Y 3) · (c) Pr (Y4 < go.so < Ys)' 9.3. Compute Pr (Y3 < go.s < Y7 ) if YI < ... < Yg are the order statistics of a random sample of size 9 from a distribution of the continuous type. 9.4. Find the smallest value of n for which Pr (Y1 < go.s < Yn) ~ 0.99, where YI < ... < Y n are the order statistics of a random sample of size n from a distribution of the continuous type. 9.5. Let YI < Y2 denote the order statistics of a random sample of size 2 from a distribution which is n(p., a2 ) , where a2 is known. EXERCISES 'Y = Pr (3.5 < W < 9.5), Thus (Y4 = 74, YlO = 87) serves as an 81.4 per cent confidence interval for gO.2S' It should be noted that we could choose other intervals also, for illustration, (Y3 = 71, Yu = 92), and these would have different confidence coefficients. The persons involved in the study must select the desired confidence coefficient, and then the appropriate order statistics, Y, and Y J , are taken in such a way that i and j are fairly symmetrically located about (n + l)p. That is, where W is b(27,!) with mean 2; = 6.75 and variance ~ ~. Hence 'Y is approximately equal to
  • 160. 308 Nonparametric Methods [Ch. 9 Sec. 9.2] Tolerance Limits for Distributions 309 variable Z = F(X) is less than or equal to p is precisely the probability that the random interval (-00, X) contains no more than lOOp per cent of the probability for the distribution. For example, the probability that the random interval (-00, X) contains no more than 70 per cent of the probability for the distribution is 0.70; and the probability that the random interval (-00, X) contains more than 70 per cent of the probability for the distribution is 1 - 0.70 = 0.30. We now consider certain functions of the order statistics. Let X X X denote a random sample of size n from a distribution 1, 2,···, n that has a positive and continuous p.d.f. f(x) if and only if a < x < b; and let F(x) denote the associated distribution function. Consider the random variables F(Xl ), F(X2),... , F(Xn). These random variables are mutually stochastically independent and each, in accordance with Example 1, Section 4.1, has a uniform distribution on the interval (OJ 1). Thus F(Xl), F(X2),... , F(Xn) is a random sample of size n from a uniform distribution on the interval (0, 1). Consider the order statistics of this random sample F(Xl ) , F(X2),... , F(Xn). Let Zl be the smallest of these F(X;) , Z2 the next F(X;) in order of magnitude, .. " and Zn the largest F(Xi). If Y 1> Y 2, .•. , Y nare the order statistics of the initial random sample Xl' X 2 , ••• , X n, the fact that F(x) is a nondecreasing (here, strictly increasing) function of x implies that Zl = F(Yl ), Z2 = F(Y2),... , Zn = F(Yn). Thus the joint p.d.f. of ZlJ Z2' ... , Zn is given by h(z1> Z2' ... , zn) = nt, 0 < Zl < Z2 < ... < Zn < 1, = 0 elsewhere. This proves a special case of the following theorem. Theorem 1. Let Y 1> Y 2, •• " Y n denote the order statistics of a random sample of size n from a distribution of the continuous type that has p.d.j. f(x) and distribution function F(x). The joint p.d.f. of the random variables Z, = F(Yi), i = 1, 2, ... , n, is h(z1> z2"",zn) = nt, 0 < Zl < Z2 < ... < zn < 1, = 0 elsewhere. Because the distribution function of Z = F(X) is given by z, o < Z < 1, the marginal p.d.f. of Zk = F(Yk) is the following beta p.d.f.: n! (i - 1)! (j - i - 1)! (n - j)! X Z~-l(Z. - z.)j-i-l(l - z.)n-j I ) & J J = 0 elsewhere. Moreover, the joint p.d.f. of Z; = F(Yi) and Z, = F(YJ ) is, with i < j, given by Sometimes this is a rather tedious computation. For this reason and for the reason that coverages are important in distribution-free statistical inference, we choose to introduce at this time the concept of a coverage. Consider the random variables Wl = F(Yl) = Zl' W2 = F(Y2) - F(Yl) = Z2 - Z1> W3 = F(Ys) - F(Y2) = Zs - Z2"'" Wn = F(Yn) - F(Yn- l) = Zn - Zn-l' The random variable Wl is called a coverage of the random interval {x; -00 < x < Y l} and the random variable Wi, i = 2, 3, ... , n, is called a coverage of the random interval {x; Yi- l < X < Vi}' We shall find the joint p.d.f. of the n coverages Consider the difference Z, - Z, = F(Yj) - F(Y;), i < i Now F(Yj) = Pr (X :5: Yj) and F(Yi) = Pr (X :5: Yi)' Since Pr (X = Y;) = Pr (X = Yj) = 0, then the difference F(Yj) - F(Yi) is that fractional part of the probability for the distribution of X that is between Y; and Yj' Let p denote a positive proper fraction. If F(Yj) - F(Yi) ~ p, then at least lOOp per cent of the probability for the distribution of X is between Yi and Yj' Let it be given that y = Pr [F(Yj) - F(Yi) ~ Pl Then the random interval (Yi , Y j ) has probability y of containing at least lOOp per cent of the probability for the distribution of X. If now Yi and Yj denote, respectively, experimental values of Y i and Y j, the interval (Yi' Yj) either does or does not contain at least lOOp per cent of the probability for the distribution of X. However, we refer to the interval (Yi' Yj) as a lOGy per cent tolerance interval for lOOp per cent of the probability for the distribution of X. In like vein, Yi and Yj are called 100y per cent tolerance limits for lOOP per cent of the probability for the distribution of X. One way to compute the probability y = Pr [F(Yj) - F(Yi) ~ PJ is to use Equation (2), which gives the joint p.d.f. of Z, = F(Y;) and Zj = F(Yj ) . The required probability is then given by o < Zk < 1, n! k-l(l Z )n-k (1) hk(zk) (k _ 1)! (n _ k)! Zk - k ' = 0 elsewhere.
  • 161. 310 Nonparametric Methods [Ch. 9 Sec. 9.2] Tolerance Limits for Distributions 311 WI> W 2' •.. , Wno First we note that the inverse functions of the associated transformation are given by because the integrand is the p.d.f. of F(Y6 ) - F(Yl ) . Accordingly, y = 1 - 6(0.8)5 + 5(0.8)6 = 0.34, Z2 = WI + W2' za = WI + w 2 + wa, Zn = WI + W2 + Wa + ... + wn• We also note that the Jacobian is equal to 1 and that the space of positive probability density is {(WI> W2"'" wn) ; 0 < Wt' i = 1,2, ... , n, WI + ... + wn < I}. Sincethejointp.d.f. of Zl' Z2"'" Znisn!, 0 < Zl < Z2 < ... <: Zn < 1, zero elsewhere, the joint p.d.f. of the n coverages is = 0 elsewhere. A reexamination of Example 1 of Section 4.5 reveals that this is a Dirichlet p.d.f. with k = n and (Xl = (X2 = ... = (Xn +1 = 1. Because the p.d.f. k(wl> ... , wn) is symmetric in WI> W 2, ••• , W n, it is evident that the distribution of every sum of r, r < n, of these coverages WI>' .. , W n is exactly the same for each fixed value of r. For instance, if i < Jand r = J - i, the distribution of Z, - Z, = F(Yj ) - F(Yt) = W + W· 2 + ... + W· is exactly the same as that of Zj-t = i+ I 1+ J F(Yj- t) = WI + W2 + ... + Wj-i' But we know that the p.d.f. of Zj-i is the beta p.d.f. of the form 0< w < 1, kl (w) = n(1 - w)n-l, = 0 elsewhere, because WI = ZI = F(Yl ) has this p.d.f. Accordingly, the mathematical expectation of each Wi is fl nw(1 _ w)n-l dw = _1_. o n + 1 Now the coverage Wi can be thought of as the area under the graph of the p.d.f. j(x), above the z-axis, and between the lines x = Yj - l and x = Yi . (We take Yo = -co.) Thus the expected value of each of these random areas Wi' i = 1,2, ... , n, is 1/(n + 1). That is, the order statistics partition the probability for the distribution into n + 1 parts, and the expected value of each of these parts is 1/(n + 1). More generally, the expected value of F(Yj ) - F(Yj ) , i < j, is (j - i)/(n + 1), since F(Yj ) - F(Yj ) is the sum of j - i of these coverages. This result provides a reason for calling Ylcs where (n + I)P = k, the (100P)th percentile oj the sample, since E[F(Y )J = _k_ = (n + I)P = p. k n+l n+l Example 2. Each of the coverages Wi' i = 1, 2, ... , n, has the beta p.d.f. approximately. That is, the observed values of Yl and Y6 will define a 34 per cent tolerance interval for 80 per cent of the probability for the distribution. o < Wt, i = 1, ... , n, WI + ... + W n < 1, k(w W ) = n!, 1, ••. , n o < v < 1, r(n + 1) Vj-i-I(1 _ v)n-Hi, hj-t(v) = ru - i)r(n - J + i + 1) = 0 elsewhere. Consequently, F(Yj ) - F(Yi ) has this p.d.f. and Pr [F(Yj) - F(Yt) ;? PJ = f:hj_t(v) dv. Example 1. Let Y l < Y2 < ... < Y6 be the order statistics of a random sample of size 6 from a distribution of the continuous type. We want to use the observed interval (Yl' Y6) as a tolerance interval for 80 per cent of the distribution. Then y = Pr [F(Y6) - F(Y1) ;? 0.8] = 1 - f:·830v4(1 - v) dv, EXERCISES 9.8. Let Yl and Y n be, respectively, the first and nth order statistics of a random sample of size n from a distribution of the continuous type having distribution function F(x). Find the smallest value of n such that Pr [F(Yn) - F(Yl ) ;? 0.5] is at least 0.95. 9.9. Let Y2 and Y n - l denote the second and the (n - l)st order statistics of a random sample of size n from a distribution of the continuous type having distribution function F(x). Compute Pr [F(Yn - 1) - F(Y2) ;? PJ, where 0 < p < 1. 9.10. Let Yl < Y2 < ... < Y 48be the order statistics of a random sample of size 48 from a distribution of the continuous type. We want to use the observed interval (Y4' Y45) as a 100y per cent tolerance interval for 75 per cent of the distribution.
  • 162. 312 Nonparametric Methods [Ch, 9 Sec. 9.3] The Sign Test 313 (a) To what is yequal? . . ' (b) Approximate the integral in part (a) by notmg that It can ~e wntten as a partial sum of a binomial p.d.f., which in turn can be approxImated by probabilities associated with a normal distribution. 9.11. Let YI < Y2 < ... < Y n be the order statistics of a random sample of size n from a distribution of the continuous type having distribution func- tion F(x). (a) What is the distribution of U = 1 - F(Yj )? (b) Determine the distribution of V = F(Yn) - F(YJ) + F(Yt) - F(YI ) , where i < j. Suppose, however, that we are interested only in the alternative hy- pothesis, which is HI: FW > Po.One procedure is to base the test of H 0 against HI upon the random variable Y, which is the number of items less than or equal to gin a random sample of size n from the distribu- tion. The statistic Y can be thought of as the number of "successes" throughout nindependent trials. Then,ifHoistrue, Yisb[n,po = FW]; whereas if Ho is false, Y is ben, P = F(g)] whatever be the distribution function F(x). We reject H o and accept HI if and only if the observed value y ;?: c, where c is an integer selected such that Pr (Y ;?: c; Ho) is some reasonable significance level 0:. The power function of the test is given by where P = F(72). In particular, the significance level is 1 2:=:; P < 1, Po s P < 1, K(P) In many places in the literature the test that we have just described is called the sign test. The reason for this terminology is that the test is based upon a statistic Y that is equal to the number of nonpositive signs in the sequence Xl - g, X2 - g, ..., X n- g. In the next section a distribution-free test, which considers both the sign and the magni- tude of each deviation Xt - g, is studied. where P = FW· In certain instances we may wish to approximate K(P) by using an approximation to the binomial distribution. Suppose that the alternative hypothesis to H o: FW = Po is HI: FW < Po. Then the critical region is a set {y; y :=:; c1} . Finally, if the alternative hypothesis is HI: FW "# Po, the critical region is a set {y; y s c2 or ca s y}. Frequently, Po = t and, in that case, the hypothesis is that the given number gis a median of the distribution. In the following example, this value of Po is used. Example 1. Let Xl, X 2 , ••• , X I O be a random sample of size 10 from a distribution with distribution function F(x). We wish to test the hypothesis H o: F(72) = 1- against the alternative hypothesis HI: F(72) > -t. Let Y be the number of sample items that are less than or equal to 72. Let the observed value of Y be y, and let the test be defined by the critical region {y; y ;?: 8}. The power function of the test is given by ~ = 1,2, ... , k; and a chi-square test, based upon a statistic that was denoted by Q was used to test the hypothesis H 0 against all alternative k-I> hypotheses. There is a certain subjective element in the use of this test, namely the choice of k and of A I> A 2, ••• , Ak' But it is important to note that the limiting distribution of Qk-1, under Ho, is X2(k - 1); that is, the distribution of Qk-1 is free of PlO' P20, ... , PkO and, accordingly, of the specified distribution of X. Here, and elsewhere, "under r:0'.' means when H is true. A test of a hypothesis Ho based upon a statistic whose distribution, under H 0' does not depend upon the specified distribution or any parameters of that distribution is called a distribution-free or a nonparametric test. Next, let F(x) be the unknown distribution function of the random variable X. Let there be given two numbers gand Po,where 0 < Po < ~. We wish to test the hypothesis H o: FW = Po, that is, the hypothesIs that g = gpo' the quantile of order Poof the distrib~tionof X. We c?uld use the statistic Qk-1, with k = 2, to test Ho agamst all alternatives. 9.3 The Sign Test Some of the chi-square tests of Section 8.1 are illustrative of the type of tests that we investigate in the remainder of thi~ c?ap:er. Recall in that section, we tested the hypothesis that the distribution of a certain random variable X is a specified distribution. We did this in the following manner. The space of X was partitioned into k mutually disjoint sets A l' A 2 , ••• , A k • The probability PIO that X E At was computed under the assumption that the specified distrib~tion is the correct distribution, i = 1,2, ... , k. The original hypothesIs was then replaced by the hypothesis
  • 163. 314 Nonparametric Methods [Ch. 9 Sec. 9.4] A Test of Wilcoxon 315 for all x. Moreover, the probability that any two items of a random sample are equal is zero, and in our discussion we shall assume that no two are equal. The problem is to test the hypothesis that the median gO.5 of the distribution is equal to a fixed number, say g. Thus we may, in all cases and without loss of generality, take g = 0. The reason for this is that if g =1= 0, then the fixed gcan be subtracted from each sample item and the resulting variables can be used to test the hypothesis that their underlying distribution is symmetric about zero. Hence our conditions on F(x) and f(x) become F( - x) = 1 - F(x) and f( - x) = f(x), respectively. To test the hypothesis Ho: F(O) = t, we proceed by first ranking Xl' X 2 , ••• , X; according to magnitude, disregarding their algebraic signs. Let R, be the rank of IXil among lXII, IX2 1 , ... , IXnl, i = 1,2, ... , n. For example, if n = 3 and if we have IX2 1 < IXsl < lXII, then R l = 3, R2 = 1, and Rs = 2. Thus Rl , R2 , ••• , R; is an arrangement of the first n positive integers 1,2, ... , n. Further, let Zi' i = 1,2, ... , n, be defined by EXERCISES 9.12. Suggest a chi-square test of the hypothesis which states that a distribution is one of the beta type, with parameters a = 2 and f:3 = 2. Further, suppose that the test is to be based upon a random sample of size 100. In the solution, give k, define AI> A 2 , ••• , A k , and compute each Pto. If possible, compare your proposal with those of other students. Are any of them the same? 9.13. Let Xl' X 2 , ••• , X 46 be a random sample of size 48 from a distri- bution that has the distribution function F(x). To test H o: F(41) = t against HI: F(41) < t, use the statistic Y, which is the number of sample items less than or equal to 41. If the observed value of Y is y :s; 7, reject Ho and accept HI' If P = F(41), find the power function K(P), 0 < P :s; t, of the test. Approximate a = K(t). 9.14. Let Xl' X 2 , ••• , XI OO be a random sample of size 100 from a distribution that has distribution function F(x). To test H o: F(90) - F(60) = 1- against HI: F(90) - F(60) > 1-, use the statistic Y, which is the number of sample items less than or equal to 90 but greater than 60. If the observed value of Y, say y, is such that y ~ c, reject H o. Find c so that a = 0.05, approximately. 9.4 A Test of Wilcoxon Z, = -1, = 1, if Xi < 0, if Xi > 0. Suppose Xl' X 2 , • • • , X n is a random sample from a distribution with distribution function F(x). We have considered a test of the hypothesis F(g) = t, g given, which is based upon the signs of the deviations Xl - g, X 2 - g, ... , X n - g. In this section a statistic is studied that takes into account not only these signs, but also the magnitudes of the deviations. To find such a statistic that is distribution-free, we must make two additional assumptions: (a) F(x) is the distribution function of a continuous type of random variable X. (b) The p.d.f. f(x) of X has a graph that is symmetric about the vertical axis through gO.5' the median (which we assume to be unique) of the distribution. Thus f(go.5 - x) = f(go.5 + x), and F(gO.5 - x) 1 - F(gO.5 + x) If we recall that Pr (Xi = 0) = 0, we see that it does not change the probabilities whether we associate Z, = 1 or Z, = - 1 with the out- come Xi = 0. n The statistic W = L ZiRi is the Wilcoxon statistic. Note that in i=l computing this statistic we simply associate the sign of each Xi with the rank of its absolute value and sum the resulting n products. If the alternative to the hypothesis Hs: gO.5 = °is HI: gO.5 > 0, we reject H 0 if the observed value of W is an element of the set {w; w ~ c}. This is due to the fact that large positive values of W indicate that most of the large deviations from zero are positive. For alternatives gO.5 < °and gO.5 =1= °the critical regions are, respectively, the sets {w; w :s; cl}and{w; w :s; c20rw ~ cs}. To compute probabilities like Pr (W ~ c; H o), we need to determine the distribution of W, under tt; To help us find the distribution of W, when Ho: F(O) = t is true, we note the following facts: (a) The assumption that f(x) = f( - x) ensures that Pr (Xi < 0) Pr (Xi> 0) = t, i = 1, 2, .. " n.
  • 164. n (e-tt + eit) =I1 . t=l 2 317 w = -6, -4, -2,2,4,6, w = 0, -~ - 8, g(w) = t, n L E(I tj, - I-'t1 3 ) lim t=l - 0 n ....co (n )3/2 -, L o~ t=1 then n n L Ut - L I-'t ,=1 t=1 JJ10 f has a limiting distribution that is n(O, 1). For our variables Vv V2 , "', Vn we have The variance of V, is (- i)2 (!-) + (i)2(!-) = i 2. Thus the variance of W is o~ = ~>2 = n(n + 1)(2n + 1). 1 6 For large values of n, the determination of the exact distribution of W becomes tedious. Accordingly, one looks for an approximating distribution. Although W is distributed as is the sum of n random variables that are mutually stochastically independent, our form of the central limit theorem cannot be applied because the n random variables do not have identical distributions. However, a more general theorem, due to Liapounov, states that if U, has mean I-'t and variance or, i = 1,2, ... , n, if Uv U2 , ••• , U'; are mutually stochastically inde- pendent, if E(I U, - I-'t1 3 ) is finite for every i, and if n I-'w = E(W) = L E(Vt) = O. 1 Sec. 9.4] A Test oj Wilcoxon = 0 elsewhere. The mean and the variance of Ware more easily computed directly than by working with the moment-generating function M(t). Because n n V = L Vt and W = L ZtRt have the same distribution they have the 1 1 ' same mean and the same variance. When the hypothesis H 0: F(O) = !- is true, it is easy to determine the values of these two characteristics of the distribution of W. Since E(Vt) = 0, i = 1, 2, ... , n, we have Thus the p.d.f. of W, for n = 3, is given by Nonparametric Methods [Ch. 9 We can express M(t) as the sum of terms of the form (aj/2n)eblt. When M(t) is written in this manner, we can determine by inspection the p.d.f. of the discrete-type random variable W. For example, the smallest value of W is found from the term (1/2n)e-te- 2t... e- nt = (1/2n)e- n<n+1)t/2 and it is -n(n + 1)/2. The probability of this value of W is the coefficient 1/2n . To make these statements more concrete, take n = 3. Then n The preceding observations enable us to say that W = L ZiRt 1 n has the same distribution as the random variable V = L Vt, where 1 V v V2, •.. , Vn are mutually stochastically independent and Pr (Vt = i) = Pr (Vt = - i) = !-, i = 1, 2, ... , n. That V v V 2' ••. , Vn are mutually stochastically independent follows from the fact that Z1' Z2' ... , Zn have that property; that is, the numbers 1,2, ... , n always appear in a sum W and those numbers receive their algebraic signs by independent assignment. Thus each of V v V 2, ... , Vn is like one and only one of Z1RV Z2R2, ... , ZnRn· Since Wand V have the same distribution, the moment-generating function of W is that of V, M(t) (e- t t e t) r 2+ e 2t) r:te 3t) (t)(e-6t + e-4t + e-2t + 2 + e2t + e4t + e6t). (b) Now Z, = - 1 if X, < 0 and Z, = 1 if X, > 0, i = 1, 2, , n. Hence we have Pr (Zt = -1) = Pr (Zt = 1) =!-, i = 1,2, , n. Moreover, Z1' Z2' ... , Z; are mutually stochastically independent because Xv X 2 , ••• , X n are mutually stochastically independent. (c) The assumption thatf(x) = f( -x) also assures that the rank R, of IX,I does not depend upon the sign Z, of Xt. More generally, R1 , R2 , ... , Rn are stochastically independent of Z1' Z2' , Zn· (d) A sum W is made up of the numbers 1,2, , n, each number with either a positive or a negative sign. 316
  • 165. 318 Nonparametric Methods [Ch. 9 Sec. 9.4] A Test of Wilcoxon 319 and it is known that n L i=1 Now lim n 2(n + 1)2/4 = ° n-+oo [n(n + 1)(2n + 1)/6]3/2 because the numerator is of order n4 and the denominator is of order n 9 /2 • Thus W Vn(n + 1)(2n + 1)/6 is approximately n(O, 1) when Ho is true. This allows us to approximate probabilities like Pr (W :2: c; H o) when the sample size n is large. Example 1. Let gO.5 be the median of a symmetric distribution that is of the continuous type. To test, with ex = 0.01, the hypothesis H o: gO.5 = 75 against HI: gO.5 > 75, we observed a random sample of sizen = 18. Let it be given that the deviations of these 18 values from 75 are the following numbers: 1.5, -0.5, 1.6,0.4,2.3, -0.8,3.2,0.9,2.9, 0.3,1.8, -0.1, 1.2,2.5,0.6, -0.7, 1.9, 1.3. The experimental value of the Wilcoxon statistic is equal to w = 11 - 4 + 12 + 3 + 15 - 7 + 18 + 8 + 17 + 2 + 13 - 1 + 9 + 16 + 5 - 6 + 14 + 10 = 135. Since, with n = 18 so that vn(n + 1)(2n + 1)/6 = 45.92, we have that 0.01 = Pr (4~2 :2: 2.326) = Pr (W :2: 106.8). Because w = 135 > 106.8, we reject Ho at the approximate 0.01 significance level. There are many modifications and generalizations of the Wilcoxon statistic. One generalization is the following: Let C1 :s; C2 :s; ... :s; Cn be nonnegative numbers. Then, in the Wilcoxon statistic, replace the ranks 1, 2, ... , n by c1> C2, ••• , Cn' respectively. For example, if n = 3 and if we have IX2 ! < IX3 ! < lXII, then R1 = 3 is replaced by C3, R2 = 1 by C1, and R3 = 2 by c2 • In this example the generalized statistic is given by Z1C3 + Z2Cl + Z3C2' Similar to the Wilcoxon statistic, this generalized statistic is distributed under Ho, as is the sum of n stochastically independent random variables, the ith of which takes each of the values c, -=f °and - c, with probability -t; if Ci = 0, that variable takes the value Ci = °with probability 1. Some special cases of this statistic are proposed in the Exercises. EXERCISES 9.15. The observed values of a random sample of size 10 from a distri- bution that is symmetric about gO.5 are 10.2, 14.1, 9.2, 11.3, 7.2, 9.8, 6.5, 11.8,8.7, 10.8. Use Wilcoxon's statistic to test the hypothesis Ho: gO.5 = 8 against HI: gO.5 > 8 if ex = 0.05. Even though n is small, use the normal approximation. 9.16. Find the distribution of W for n = 4 and n = 5. Hint. Multiply the moment-generating function of W, with n = 3, by (e-4t + e4t )/2 to get that of W, with n = 4. 9.17. Let Xl, X 2 , ••• , X; be mutually stochastically independent. If the p.d.f. of X, is uniform over the interval (_21 - ' , 21 - i ) , i = 1,2,3, ... , show n that Liapounov's condition is not satisfied. The sum L: Xi does not have an '=1 approximate normal distribution because the first random variables in the sum tend to dominate it. 9.18. If n = 4 and, in the notation of the text, C1 = 1, C2 = 2, C3 = C4 = 3, find the distribution of the generalization of the Wilcoxon statistic, say Wg • For a general n, find the mean and the variance of Wg if c, = i, i:s; nI2,andc, = [nI2] + l,i > nI2,where[z]isthegreatestintegerfunction. Does Liapounov's condition hold here? 9.19. A modification of Wilcoxon's statistic that is frequently used is achieved by replacing R, by R, - 1; that is, use the modification Wm = i Zi(R, - 1). Show that Wm/v(n - l)n(2n - 1)/6 has a limiting distri- 1 bution that is n(O, 1). 9.20. If, in the discussion of the generalization of the Wilcoxon statistic, we let C 1 = C 2 = ... = C n = 1, show that we obtain a statistic equivalent to that used in the sign test. je! • /21 -x 2/2 d 9.21. If C1> c2 , ••• , Cn are selected so that il(n + 1) = 0 V 7T e x, i = 1, 2, ... , n, the generalized Wilcoxon Wg is an example of a normal scores statistic. If n = 9, compute the mean and the variance of this Wg • 9.22. If c, = 2i , i = 1, 2, ... , n, the corresponding Wg is called the binary statistic. Find the mean and the variance of this Wg • Is Liapounov's con- dition satisfied?
  • 166. 320 Nonparametric Methods [Ch. 9 Sec. 9.5] The Equality oj Two Distributions 321 9.5 The Equality of Two Distributions If F(z) = G(z), for all z, then »« = PiZ, i = 1,2, ... , k. Accordingly, the hypothesis that F(z) = G(z), for all z, is replaced by the less restrictive hypothesis number of items. (If the sample sizes are such that this is impossible, a partition with approximately the same number of items in each group suffices.) In effect, then, the partition Al> A z, ... , A k is determined by the experimental values themselves. This does not alter the fact that the statistic, discussed in Example 3, Section 8.1, has a limiting distribution that is XZ(k - 1). Accordingly, the procedures used in that example may be used here. Among the tests of this type there is one that is frequently used. It is essentially a test of the equality of the medians of two independent distributions. To simplify the discussion, we assume that m + n, the size of the combined sample, is an even number, say m + n = 2h, where h is a positive integer. We take k = 2 and the combined sample of size m + n = 2h, which has been ordered, is separated into two parts, a "lower half" and an "upper half," each containing h = (m + n)/2 of the experimental values of X and Y. The statistic, suggested by Example 3, Section 8.1, could be used because it has, when H o is true, a limiting distribution that is xZ(l). However, it is more interesting to find the exact distribution of another statistic which enables us to test the hypothesis H a against the alternative HI: F(z) 2:: G(z) or against the alternative HI: F(z) ~ G(z) as opposed to merely F(z) -=f. G(z). [Here, and in the sequel, alternatives F(z) 2:: G(z) and F(z) s G(z) and F(z) -=f. G(z) mean that strict inequality holds on some set of positive probability measure.] This other statistic is V, which is the number of observed values of X that are in the lower half of the combined sample. If the observed value of V is quite large, one might suspect that the median of the distribution of X is smaller than that of the distribution of Y. Thus the critical region of this test of the hypothesis H o: F(z) = G(z), for all z, against HI: F(z) 2:: G(z) is of the form V 2:: c. Because our combined sample is of even size, there is no unique median of the sample. However, one can arbitrarily insert a number between the hth and (h + l)st ordered items and call it the median of the sample. On this account, a test of the sort just described is called a median test. Incidentally, if the alternative hypothesis is HI: F(z) ~ G(z), the critical region is of the form V ~ c. The distribution of V is quite easy to find if the distribution fun- tions F(x) and G(y) are of the continuous type and if F(z) = G(z), for all z. We shall now show that V has a hypergeometric p.d.f. Let m + n = 2h, h a positive integer. To compute Pr (V = v), we need the probability that exactly v of Xl> X z, " " X m are in the lower half of the ordered combined sample. Under our assumptions, the probability is zero that any two of the 2h random variables are equal. The smallest t = 1,2, 0 0 0 , k. i = 1,2, ... , k, t = 1,2, ... , k. PiZ = Pr (Y EAt), 9.23. In the definition of Wilcoxon's statistic, let WI be the sum of the ranks of those items of the sample that are positive and let WZ be the sum of the ranks of those items that are negative. Then W = WI - W2 • (a) Show that W = 2WI - n(n + 1)/2 and W = n(n + 1)/2 - 2Wz. (b) Compute the mean and the variance of each of WI and W z. In Sections 9.3 and 9.4 some tests of hypotheses about one distri- bution were investigated. In this section, as in the next section, various tests of the equality of two independent distributions are studied. By the equality of two distributions, we mean that the two distribution functions, say F and G, have F(z) = G(z) for all values of z. The first test that we discuss is a natural extension of the chi-square test. Let X and Y be stochastically independent variables with distri- bution functions F(x) and G(y), respectively. We wish to test the hypothesis that F(z) = G(z), for all z. Let us partition the real line into k mutually disjoint sets Al> A z, . 0 0 , A k • Define and But this is exactly the problem of testing the equality of two inde- pendent multinomial distributions that was considered in Example 3, Section 8.1, and the reader is referred to that example for the details. Some statisticians prefer a procedure which eliminates some of the subjectivity of selecting the partitions. For a fixed positive integer k, proceed as follows. Consider a random sample of size m from the distribution of X and a random sample of size n from the independent distribution of Y. Let the experimental values be denoted by Xl, X z, o • • , X m and Yl' Yz, ... , Yn- Then combine the two samples into one sample of size m + n and order the m + n values (not their absolute values) in ascending order of magnitude. These ordered items are then partitioned into k parts in such a way that each part has the same
  • 167. 322 Nonparametric Methods [Ch. 9 Sec. 9.5] The Equality of Two Distributions 323 h of the m + n = 2h items can be selected in anyone of (~) ways. Each of these ways has the same probability. Of these (~) ways, we need to count the number of those in which exactly v of the m values of X (and hence h - v of the n values of Y) appear in the lower h items. But this is (:)(h : v), Thus the p.d.f. of Vis the hypergeometric p.d.f. k(v) = Pr (V = v) = °elsewhere, v = 0, 1,2, .. " m, two values of X, and so on. In our example, there is a total of eight runs. Three are runs of length 1; three are runs of length 2; and two are runs of length 3. Note that the total number of runs is always one more than the number of unlike adjacent symbols. Of what can runs be suggestive? Suppose that with m = 7 and n = 8 we have the following ordering: xxxxx ~ xx yyyyyyy. To us, this strongly suggests that F(z) > G(z). For if, in fact, F(z) G(z) for all z, we would anticipate a greater number of runs. And if the first run of five values of X were interchanged with the last run of seven values of Y, this would suggest that F(z) < G(z). But runs can be suggestive of other things. For example, with m = 7 and n = 8, consider the runs. where m + n = 2h. The reader may be momentarily puzzled by the meaning of (h : v) for v = 0, 1,2, ... ,m.Forexample,letm = 17,n = 3,sothath = 10. Then we have Co~ v), v= 0,1, ... ,17. However, we take (h : v) to be zero if h - v is negative or if h - v > n, If m + n is an odd number, say m + n = 2h + 1, it is left to the reader to show that the p.d.f. k(v) gives the probability that exactly v of the m values of X are among the lower h of the combined 2h + 1 values; that is, exactly v of the m values of X are less than the median of the combined sample. If the distribution functions F(x) and G(y) are of the continuous type, there is another rather simple test of the hypothesis that F(z) = G(z), for all z. This test is based upon the notion of runs of values of X and of values of Y. We shall now explain what we mean by runs. Let us again combine the sample of m values of X and the sample of n values of Y into one collection of m + n ordered items arranged in ascending order of magnitude. With m = 7 and n = 8 we might find that the 15 ordered items were in the arrangement Note that in this ordering we have underscored the groups of successive values of the random variable X and those of the random variable Y. If we read from left to right, we would say that we have a run of one value of X, followed by a run of three values of Y, followed by a run of yyyy xxxxxxx yyyy. This suggests to us that the medians of the distributions of X and Y may very well be about the same, but that the "spread" (measured possibly by the standard deviation) of the distribution of X is con- siderably less than that of the distribution of Y. Let the random variable R equal the number of runs in the com- bined sample, once the combined sample has been ordered. Because our random variables X and Yare of the continuous type, we may assume that no two of these sample items are equal. We wish to find the p.d.f. of R. To find this distribution, when F(z) = G(z), we shall suppose that all arrangements of the m values of X and the n values of Y have equal probabilities. We shall show that Pr (R = 2k + 1) = { (m ~ 1) (~ =~) + (; ~ n(n ~ 1) }/ (m; n) (1) when 2k and 2k + 1 are elements of the space of R. To prove formulas (1), note that we can select the m positions for the mvalues of X from the m+ npositions in anyone of (m ~ n) ways. Since each of these choices yields one arrangement, the probability of each arrangement is equal to 1/(m ~ n). The problem is now to
  • 168. 324 Nonparametric Methods [Ch, 9 Sec. 9.5] The Equality oj Two Distributions 325 determine how many of these arrangements yield R = r, where r is an integer in the space of R. First, let r = 2k + 1, where k is a positive integer. This means that there must be k + 1 runs of the ordered values of X and k runs of the ordered values of Y or vice versa. Consider first the number of ways of obtaining k + 1 runs of the m values of X. We can form k + 1 of these runs by inserting k "dividers" into the m - 1 spaces between the values of X, with no more than one divider per space. This can be done in anyone of (m ~ 1) ways. Similarly, we can construct k runs of the n values of Y by inserting k - 1 dividers into the n - 1 spaces between the values of Y, with no more than one divider per space. This can be done in anyone of (~ =: Dways. The .. . b f d' f (m - 1)(n - 1) joint operation can e per orme in anyone a k k _ 1 ways. These two sets of runs can be placed together to form r = 2k + 1 runs. But we could also have k runs of the values of X and k + 1 rum, of the values of Y. An argument similar to the preceding shows that this (m- l)(n - 1) can be effected in anyone of k _ 1 k ways. Thus Pr (R = 2k + 1) If the critical region of this run test of the hypothesis Ho: F(z) = G(z) for allzis of the form R ~ c,it is easy to compute IX = Pr(R ~ c;Ho), provided that m and n are small. Although it is not easy to show, the distribution of R can be approximated, with large sample sizes m and n, by a normal distribution with mean f-L = E(R) = 2 m m;n+ 1 and variance 2 (f-L - l)(f-L - 2) a = . m+n-1 The run test may also be used to test for randomness. That is, it can be used as a check to see if it is reasonable to treat Xl' X 2 , ••• , X, as a random sample of size s from some continuous distribution. To facil- tate the discussion, take s to be even. We are given the s values of X to be Xl> X 2, •.. , xs, which are not ordered by magnitude but by the order in which they were observed. However, there are s/2 of these values, each of which is smaller than the remaining s/2 values. Thus we have a "lower half" and an "upper half" of these values. In the sequence Xl> X 2, ••• ,xs' replace each value X that is in the lower half by the letter L and each value in the upper half by the letter U. Then, for example, with s = 10, a sequence such as LLLLULUUUU which is the second of formulas (1). which is the first of formulas (1). If r = 2k, where k is a positive integer, we see that the ordered values of X and the ordered values of Y must each be separated into k runs. These operations can be performed in anyone of (: =: D ( n - 1) and k _ 1 ways, respectively. These two sets of runs can be placed together to form r = 2k runs. But we may begin with either a run of values of X or a run of values of Y. Accordingly, the probability of 2k runs is Pr (R = 2k) 2(m - l)(n - 1) k-1 k-1 (m ~ n) may suggest a trend toward increasing values of X; that is, these values of X may not reasonably be looked upon as being the items of a random sample. If trend is the only alternative to randomness, we can make a test based upon R and reject the hypothesis of randomness if R ~ c.To make this test, we would use the p.d.f. of R with m = n = s/2. On the other hand if, with s = 10, we find a sequence such as L H L H L H L H L H, our suspicions are aroused that there may be a nonrandom effect which is cyclic even though R = 10. Accordingly, to test for a trend or a cyclic effect, we could use a critical region of the form R ~ CI or R :2: C2• If the sample size s is odd, the number of sample items in the "upper half" and the number in the" lower half" will differ by one. Then, for example, we could use the p.d.f. of R with m = (s - 1)/2 and n = (s + 1)/2, or vice versa.
  • 169. 326 Nonparametric Methods [Ch. 9 Sec. 9.6] The Mann-Whitney-Wilcoxon Test 327 We note that and consider the statistic m 2: z., i=l Zij = 1, = 0, n m U = 2: 2: z.; j=l i=l counts the number of values of X that are less than YJ , j = 1, 2, ... , n. Thus U is the sum of these n counts. For example, with m = 4 and n = 3, consider the observations G(y) denote, respectively, the distribution functions of X and Y and let Xl, X 2 , • • • , Xm and Y v Y 2 , •• " Y, denote independent samples from these distributions. We shall discuss the Mann-Whitney-Wilcoxon test of the hypothesis Ho: F(z) = G(z) for all values of z. Let us define 9.25. In the median test, with m = 9 and n = 7, find the p.d.f. of the random variable V, the number of values of X in the lower half of the combined sample. In particular, what are the values of the probabilities Pr (V = 0) and Pr (V = 9)? EXERCISES 9.24. Let 3.1, 5.6, 4.7, 3.8, 4.2, 3.0, 5.1, 3.9, 4.8 and 5.3, 4.0, 4.9, 6.2, 3.7, 5.0, 6.5, 4.5, 5.5, 5.9, 4.4, 5.8 be observed samples of sizes m = 9 and n = 12 from two independent distributions. With k = 3, use a chi-square test to test, with a = 0.05 approximately, the equality of the two distri- butions. 9.26. In the notation of the text, use the median test and the data given in Exercise 9.24 to test, with a = 0.05, approximately, the hypothesis of the equality of the two independent distributions against the alternative hypothesis that F(z) ~ G(z). If the exact probabilities are too difficult to determine for m = 9 and n = 12, approximate these probabilities. 9.27. Using the notation of this section, let U be the number of observed values of X in the smallest d items of the combined sample of m + n items. Argue that u = 0,1, ... , m. The statistic U could be used to test the equality of the (100P)th percentiles, where (m + n)p = d, of the distributions of X and Y. 9.28. In the discussion of the run test, let the random variables R1 and R2 be, respectively, the number of runs of the values of X and the number of runs of the values of Y. Then R = R1 + R2. Let the pair (Yv Y2) of integers be in the space of (R1 , R2) ; then Ir1 - r21 :::; 1. Show that the joint p.d.f. of R1 and R2 is 2(~ =D c=D /(m ~ n) if Y 1 = Y 2; that this joint p.d.f. is c..D c=~)/(m ~ n) if Ir1 - Y 2 ! = 1; and is zero elsewhere. Show that the marginal p.d.f. of R1 is (~ =D (n ~ 1)/(m ~ n) Y 1 = 1, ... , m, and is zero elsewhere. Find E(R1) . In a similar manner, find E(R2) . Compute E(R) = E(R1 ) + E(R2) . 9.6 The Mann-Whitney-Wilcoxon Test We return to the problem of testing the equality of two independent distributions of the continuous type. Let X and Y be stochastically independent random variables of the continuous type. Let F(x) and There are three values of x that are less than Y1; there are four values of x that are less than Y2; and there is one value of x that is less than Ya. Thus the experimental value of U is u = 3 + 4 + 1 = 8. Clearly, the smallest value which U can take is zero, and the largest value is mn. Thus the space of U is {u; u = 0, 1,2, ... , mn}. If U is large, the values of Y tend to be larger than the values of X, and this suggests that F(z) ~ G(z) for all z. On the other hand, a small value of U suggests that F(z) :::; G(z) for all z, Thus, if we test the hypothesis H o: F(z) = G(z) for all z against the alternative hypothesis HI: F(z) ~ G(z) for all z, the critical region is of the form U ~ c1 . If the alternative hypothesis is HI: F(z) :::; G(z) for all z, the critical region is of the form U :::; c2 • To determine the size of a critical region, we need the distri- bution of U when H o is true. If u belongs to the space of U, let us denote Pr (U = u) by the symbol h(u; m, n). This notation focuses attention on the sample sizes m and n. To determine the probability h(u; m, n), we first note that we have m + n positions to be filled by m values of X and n values of Y. We can fill mpositions with the values of X in anyone of (m; n) ways. Once this has been done, the remaining n positions can be filled
  • 170. 328 Nonparametric Methods [Ch. 9 Sec. 9.6] The Mann-Whitney-Wilcoxon Test 329 with the values of Y. When H o is true, each of these arrangements has the same probability, 1/(m; n). The final right-hand position of an arrangement may be either a value of X or a value of Y. This position can be filled in anyone of m + n ways, m of which are favorable to X and n of which are favorable to Y. Accordingly, the probability that an arrangement ends with a value of X is m/(m + n) and the prob- ability that an arrangement terminates with a value of Y is n/(m + n). Now U can equal u in two mutually exclusive and exhaustive ways: (1) The final right-hand position (the largest of the m + n values) in the arrangement may be a value of X and the remaining (m - 1) values of X and the n values of Y can be arranged so as to have U = u. The probability that U = u, given an arrangement that terminates with a value of X, is given by h(u; m - 1, n). Or (2) the largest value in the arrangement can be a value of Y. This value of Y is greater than m values of X. If we are to have U = u, the sum of n - 1 counts of the m values of X with respect to the remaining n - 1 values of Y must be u - m. Thus the probability that U = u, given an arrangement that terminates in a value of Y, is given by h(u - m; m, n - 1). Accordingly, the probability that U = u is h(u, m, n) = (~)h(U; m - 1, n) + (_n_)h(U - m; m, n - 1). m+n m+n We impose the following reasonable restrictions upon the function h(u; m, n): and if m = 1, n = 2, we have h(O; 1,2) = th(O; 0, 2) + th( -1; 1, 1) = t· 1 + t· °= t, h(1; 1,2) = th(1; 0, 2) + th(O; 1, 1) = t· °+ t· t = t, h(2; 1, 2) = th(2; 0, 2) + th(1; 1, 1) = t· °+ t· t = 1-- In Exercise 9.29 the reader is to determine the distribution of U when m = 2, n = 1; m = 2, n = 2; m = 1, n = 3; and m = 3, n = 1. For large values of m and n, it is desirable to use an approximate distribution of U. Consider the mean and the variance of U when the hypothesis Ho: F(z) = G(z), for all values of z, is true. Since U = n m .L .L Z'J' then J=l ,=1 m n E(U) .L.L E(Z'J)' ,=1 J=l But E(Z'1) = (1) Pr (Xj < Y J) + (0) Pr (X, > YJ) = t because, when H 0 is true, Pr (X, < YJ) = Pr (X, > YJ) = l Thus m n (1) mn E(U) = 2: 2: - =-. ,=1 J=l 2 2 To compute the variance of U, we first find Then it is easy, for small values m and n, to compute these probabilities. For example, if m = n = 1, we have h(O; 1, 1) = th(O; 0, 1) + th( -1; 1,0) = t· 1 + t· °= t" h(1; 1, 1) = th(1; 0,1) + th(O; 1,0) = t· °+ t· 1 = t; and and h(u; 0, n) = 1, = 0, h(U, m, 0) = 1, = 0, h(u; m, n) = 0, u = 0, u > 0, n ~ 1, u = 0, U > 0, m ~ 1, u < 0, m ~ 0, n ~ 0. n m n n m = .L .L E(Z;,) + .L .L .L E(Z'JZjk) J=l ,=1 k=l J=l ,=1 k"'J nmm nnmm + .L .L .L E(Z'JZhJ) + .L .L .L .L E(Z'1Zhk)' J=l h=l ,=1 k=l 1=1 h=l ,=1 h"" k"'J h"" Note that there are mn terms in the first of these sums, mn(n - 1) in the second, mn(m - 1) in the third, and mn(m - 1)(n - 1) in the fourth. When H o is true, we know that X" X h , Y J , and Y k , i #- h, j #- k, are mutually stochastically independent and have the same distribution of the continuous type. Thus Pr (X, < Y J) = t. Moreover, Pr (X, < YJ , X, < Yk) = t because this is the probability that a designated one of three items is less than each of the other two.
  • 171. 330 Nonparametric Methods [Ch, 9 Sec. 9.7] Distributions Under Alternative Hypotheses 331 i =1= h, j =1= k. j =1= k, i =1= h, Similarly, Pr (X, < YJ, X h < YJ) = t. Finally, Pr (Xl < YJ , X; < Y k ) = Pr (X, < YJ) Pr (Xh < Y k ) = l Hence we have E(ZrJ) = (1)2 Pr (X, < Y J ) = t, E(ZtJZ'k) = (1)(1) Pr (X, < YJ, X, < Yk) = j-, E(Z'JZhJ) = (1)(1) Pr (X, < YJ, X; < YJ) = j-, and E(ZtJZhk) Thus E(U2 ) = mn mn(n - 1) mn(m - 1) mn(m - 1)(n - 1) 2+ 3 + 3 + 4 and 2 _ [1 n - 1 m - 1 (m - 1)(n - 1) _ m 4n] au - mn 2 + -3- + -3- + 4 mn(m + n + 1) 12 Although it is fairly difficult to prove, it is true, when F(z) = G(z) for all z, that U _ mn 2 Jmn(m + n + 1) 12 has, if each of m and n is large, an approximate distribution that is n(O, 1). This fact enables us to compute, approximately, various significance levels. Prior to the introduction of the statistic U in the statistical litera- ture, it had been suggested that a test of H o: F(z) = G(z), for all z, be based upon the following statistic, say T (not Student's t). Let T be the sum of the ranks of Y 1 , Y 2 , .•. , Yn among the m + n items Xl>' 00' X m' Y 1> 0 • • , Y n' once this combined sample has been ordered. In Exercise 9.31 the reader is asked to show that U = T _ n(n + 1)0 2 This formula provides another method of computing U and it shows that a test of H 0 based on U is equivalent to a test based on T. A generaliza- tion of T is considered in Section 9.8. Example 1. With the assumptions and the notation of this section, let m = 10 and n = 9. Let the observed values of X be as given in the first row and the observed values of Y as in the second row of the following display: 4.3, 5.9, 4.9, 3.1, 5.3, 6.4, 6.2, 3.8, 7.5, 5.8, 5.5,7.9,6.8,9.0,5.6,6.3,8.5,4.6,7.1. Since, in the combined sample, the ranks of the values of yare 4, 7, 8, 12, 14, 15, 17, 18, 19, we have the experimental value of T to be equal to t = 114. Thus u = 114 - 45 = 69. If F(z) = G(z) for all z, then, approximately, (u - 45 ) 0.05 = Pr 12.247 ~ 1.645 = Pr (U ~ 65.146). Accordingly, at the 0.05 significance level, we reject the hypothesis H o: F(z) = G(z), for all z, and accept the alternative hypothesis H 1 : F(z) ~ G(z), for all z. EXERCISES 9.29. Compute the distribution of U in each of the following cases: (a) m = 2, n = 1; (b) m = 2, n = 2; (c) m = 1, n = 3; (d) m = 3, n = 1. 9.30. Suppose the hypothesis H o: F(z) = G(z), for all z, is not true. Let P = Pr (X, < YJ ) . Show that U/mn is an unbiased estimator of p and that it converges stochastically to p as m ~ 00 and n ~ 00. 9.31. Show that U = T - [n(n + I)J/2. Hint. Let Y(l) < Y(2) < ... < Y(n) be the order statistics of the random sample Yl> Y2, ... , Ytr- If R, is the rank of Y(,) in the combined ordered sample, note that Y(,) is greater than R, - i values of X. 9.32. In Example 1 of this section assume that the values came from two independent normal distributions with means JL1 and JL2' respectively, and with common variance u2 . Calculate the Student's t which is used to test the hypothesis H 0: JL1 = JL2' If the alternative hypothesis is H 1: JL1 < JL2' do we accept or reject H o at the 0.05 significance level? 9.7 Distributions Under Alternative Hypotheses In this section we shall discuss certain problems that are related to a nonparametric test when the hypothesis H 0 is not true. Let X and Y be stochastically independent random variables of the continuous type with distribution functions F(x) and G(y), respectively, and probability density functions f(x) and g(y). Let Xl' X 2 , ••• , X m and Y1, Y 2, . 0 0, Yn denote independent random samples from these distributions. Consider the hypothesis H 0: F(z) = G(z) for all values of
  • 172. 332 Nonparametric Methods [Ch, 9 Sec. 9.7] Distributions Under Alternative Hypotheses 333 z. It has been seen that the test of this hypothesis may be based upon the statistic U, which, when the hypothesis H o is true, has a distribution that does not depend upon F(z) = G(z). Or this test can be based upon the statistic T = U + n(n + 1)/2, where T is the sum of the ranks of Y v Y 2,.. " Y n in the combined sample. To elicit some information about the distribution of T when the alternative hypothesis is true, let us consider the joint distribution of the ranks of these values of Y. Let Y(I) < Y(2) < ... < Y(n) be the order statistics of the sample Y v Y 2,... , Yn- Order the combined sample, and let R, be the rank of Y(i)' i = 1,2, ... , n. Thus there are i-I values of Y and R, - i values of X that are less than Y(i)' Moreover, there are R, - Ri- 1 - 1 values of X between Y(i -1) and Y(i). If it is given that Y(l) = Yl < Y(2) = Y2 < ... < Y(n) = Yn' then the conditional probability (1) Pr (R1 = r v R2 = r 2, · .. , R; = r nlYl < Y2 < ... < Yn), where ri < r 2 < ... < r n .:::; m + n are positive integers, can be com- puted by using the multinomial p.d.f. in the following manner. Define the following sets: Al = {x; -00 < x < Yl}, Ai = {x; Yi-l < X < Yi}' i = 2, ... , n, A n+ l = {x; Yn < X < co}. The conditional probabilities of these sets are, respectively, PI = F(YI)' P2 = F(Y2) - F(YI)' ... , P« = F(Yn) - F(Yn-I), Pn+l = 1 - F(Yn)' Then the conditional prob- ability of display (1) is given by (r1 - 1)1 (r2 - r1 - I)!··· (rn - rn- 1 - I)! (m + n - rn)! To find the unconditional probability Pr (RI = r l , R2 = r 2 , ••• , Rn = rn), which we denote simply by Pr (rv ... , rn), we multiply the conditional probability by the joint p.d.f. of Y(l) < Y(2) < ... < Y(n), namely n! g(Yl)g(Y2) ... g(Yn) , and then integrate on Yv Y2' ... , Yn- That is, Pr (rl , r2,... , rn) = I~00 .. -I~300 J:200 Pr (rv ... , rnlYl < ... < Yn)nl x g(Yl) ... g(Yn) dYl ... dYn' where Pr (rl , ... , rnlYl < ... < Yn) denotes the conditional probability in display (1). Now that we have the joint distribution of R1 , R2 , ••• , Rn, we can find, theoretically, the distributions of functions of R v R 2 , ••• , R; and, n in particular, the distribution of T = 2: Ri . From the latter we can find 1 that of U = T - n(n + 1)/2. To point out the extremely tedious computational problems of distribution theory that we encounter, we give an example. In this example we use the assumptions of this section. Example 1. Suppose that an hypothesis H o is not true but that in fact f(x) = 1, 0 < x < 1, zero elsewhere, and g(y) = 2y, 0 < Y < 1, zero else- where. Let m = 3 and n = 2. Note that the space of U is the set {u; u = 0, 1, ... , 6}.Consider Pr (U = 5). This event U = 5 occurs when and only when RI = 3, R2 = 5, since in this section RI < R2 are the ranks of Y(l) < Y(2) in the combined sample and U = RI + R2 - 3. Because F(x) = x, 0 < x .:::; 1, we have II(y6 y6) = 24 0 4 2 - 52 dY2 = /5' Consider next Pr (U = +). The event U = 4 occurs if RI = 2, R2 = 5 or if RI = 3, R2 = 4. Thus Pr (U = 4) = Pr (RI = 2, R2 = 5) + Pr (RI = 3, R2 = 4); the computation of each of these probabilities is similar to that of Pr (RI = 3, R2 = 5). This procedure may be continued until we have computed Pr (U = u) for each u E {u; U = 0, 1, ... , 6}. In the preceding example the probability density functions and the sample sizes m and n were selected so as to provide relatively simple integrations. The reader can discover for himself how tedious, and even difficult, the computations become if the sample sizes are large or if the probability density functions are not of a simple functional form. EXERCISES 9.33. Let the probability density functions of X and Y be those given in Example 1 of this section. Further let the sample sizes be m = 5 and n = 3. If RI < R2 < R3 are the ranks of Y(1) < Y(2) < Y(31 in the combined sample, compute Pr (RI = 2, R2 = 6, R3 = 8). 9.34. Let Xl, X 2 , ••• , Xm be a random sample of size m from a distri- bution of the continuous type with distribution function F(x) and p.d.f. F'(x) = f(x). Let Y v Y 2, ••• , Y n be a random sample from a distribution
  • 173. where VI < V2 < ... < Vm+n are the order statistics of a random sample of size m + n from the uniform distribution over the interval (0, 1). with distribution function G(y) = [F(y)J9, 0 < e. If e =1= 1, this distribution is called a Lehmann alternative. With e= 2, show that 335 which is equal to the number of the m values of X that are in the lower half of the combined sample of m + n items (a statistic used in the median test of Section 9.5). To determine the mean and the variance of L, we make some observations about the joint and marginal distributions of the ranks Rv R2 , ••• , RN • Clearly, from the results of Section 4.6 on the distri- bution of order statistics of a random sample, we observe that each permutation of the ranks has the same probability, which is the sum of the ranks of Y1> Y2' ••• , Yn among the m + n items (a statistic denoted by T in Section 9.6). (b) Take c(i) = 1, provided that i $ (m + n)/2, zero otherwise. If al = ... = am = 1 and am + 1 = ... = aN = 0, then Sec. 9.8] Linear Rank Statistics Nonparametric Methods [Ch, 9 9.35. To generalize the results of Exercise 9.34, let G(y) = h[F(y)], where h(z) is a differentiable function such that h(O) = 0, h(l) = 1, and h'(z) > 0,0 < z < 1. Show that Pr tr.. r 2 , .• " r n ) = 2 nr l(r2 + l)(rs+ 2) ... (rn + n - 1) . (m+ n) m (m + n + l)(m + n + 2)... (m + 2n) 334 1'; = I, 2, ... , N, In a similar manner, the joint marginal p.d.f. of R, and Rj , i =1= j, is zero elsewhere. That is, the (n - 2)-fold summation where rl, r2 , ••• , rNis any permutation of the first N positive integers. This implies that the marginal p.d.f. of R; is 1 (N - 2)1 1 2: ...2: N! = Nt = N(N - 1)' zero elsewhere, because the number of permutations in which R, = Tt is (N - I)! so that VI = Xl"'" Vm = X m, Vm+ 1 = Y l, · · · , VN = v; These two special statistics result from the following respective assign- ments for c(i) and av a2 , ••• , aN: (a) Take c(i) = i, al = ... = am = °and am+ 1 = ... = aN = I, so that is called a linear rank statistic. To see that this type of statistic is actually a generalization of both the Mann-Whitney-Wilcoxon statistic and also that statistic associ- ated with the median test, let N = m + nand In this section we consider a type of distribution-free statistic that is, among other things, a generalization of the Mann-Whitney- Wilcoxon statistic. Let Vv V2, ••• , VN be a random sample of size N from a distribution of the continuous type. Let R, be the rank of Vi among Vv V2 , •. • , VN' i = 1,2, ... , N; and let c(i) be a scoring function defined on the first N positive integers-that is, let c(I), c(2), .. " c(N) be some appropriately selected constants. If av a2 , ••. , aN are constants, then a statistic of the form 9.8 Linear Rank Statistics N m+n L = .L a;c(R;) .L s; ;=1 ;=m+1 where the summation is over all permutations in which R; = rt and R, = rj •
  • 174. However, since say, for all i = 1, 2, ... , N. In addition, we have that 337 Sec. 9.8] Linear Rank Statistics However we can determine a substitute for the second factor by observ- , ing that N N N 2 (al - Zi)2 = N 2 at - N2Zi2 1=1 1=1 = N J1 af - (~1 air = N ~ af - [.~ af + 2 2 ala;] 1=1 t=1 1*, N = (N - 1) 2 af - 22 ala;. 1=1 1*; So, making this substitution in az, we finally have that az = [i: (ck - C)2 ][N i (at - zW] k=1 N(N - 1) 1=1 1 N N = - - L (at - Zi)2 L (ck - C)2. N - 1 1= 1 k=1 In the special case in which N = m + nand N L = 2 c(RI) . l=m+1 Nonparametric Methods [Ch, 9 N ( 1) c(l) + ... + c(N) E[c(RI)J = r;1 c(rl)N = N . Among other things these properties of the distribution of Rv R2 , ... , RN imply that 336 If, for convenience, we let c(k) = ck , then for all i = 1,2, ... , N. A simple expression for the covariance of c(RI) and c(R;), i i= j, is a little more difficult to determine. That covariance is [ N ]2 N °= 2 (ck - c) = 2 (ck - C)2 + 2 2 (ck - c)(ch - c), k=1 k=1 k*h the covariance can be written simply as With these results, we first observe that the mean of L is where a = (2 at)/N. Second, note that the variance of L is the reader is asked to show that (Exercise 9.36) 2 mn ~ ( -)2 iLL = nc, aL = N(N _ 1) k~1 c k - C • A further simplification when Ck = c(k) = k yields n(m + n + 1) mn(m + n + 1), az = --'------:-::------' iLL = 2 ' 12 these latter are, respectively, the mean and the variance of the statistic T as defined in Section 9.6. As in the case of the Mann-Whitney-Wilcoxon statistic, the deter- mination of the exact distribution of a linear rank statistic L can be very difficult. However, for many selections of the constants a1 , a2' ... , aN and the scores c(l), c(2), ... , c(N), the ratio (L - iLL)/aL has, for large N, an approximate distribution that is n(O, 1). This approxima- tion is better if the scores c(k) = Ck are like an ideal sample from a normal distribution, in particular, symmetric and without extreme values. For example, use of normal scores defined by N : 1 = fkoo V~7T exp ( - ~2) dw
  • 175. 338 Nonparametric Methods [Ch. 9 Sec. 9.8] Linear Rank Statistics 339 makes the approximation better. However, even with the use of ranks, c(k) = k, the approximation is reasonably good, provided that N is large enough, say around 30 or greater. In addition to being a generalization of statistics such as those of Mann, Whitney, and Wilcoxon, we give two additional applications of linear rank statistics in the following illustrations. Example 1. Let Xl' X 2, ••. , X; denote n random variables. However, s~ppose that .we question whether they are items of a random sample due either to possible lack of mutual stochastic independence or to the fact that Xl' X 2 , ••• , X n might not have the same distributions. In particular, say we suspect a trend toward larger and larger values in the sequence X X b 2, ... , X n • If R, = rank (XI), a statistic that could be used to test the alterna- tive (trend) hypothesis is L = I iRI • Under the assumption (H ) that the 1=1 a n random variables are actually items of a random sample from a distri- bution of the continuous type, the reader is asked to show that (Exercise 9.37) which in turn equals From the first of these two additional expressions for Spearman's statistic, it is clear that I R,QI is an equivalent statistic for the purpose of testing 1=1 the stochastic independence of X and Y, say Ha. However, note that if Ha n is true, then the distribution of 2: Q,R" which is not a linear rank statistic, 1= 1 and L = i iR, are the same. The reason for this is that the ranks Rl , R2, 1=1 ... , R; and the ranks QI, Q2. . . ., Qn are stochastically independent because of the stochastic independence of X and Y. Hence, under Ha, pairing RI , R 2 , ••• , R; at random with 1, 2, , n is distributionally equivalent to pairing those ranks with QI' Q2' , Qn' which is simply a permutation of 1, 2, ... , n. The mean and the variance of L is given in Example 1. ! n(n + 1)2 floL = 4 ' EXERCISES The critical region of the test is of the form L :2:: d, and the constant d can be determined either by using the normal approximation or referring to a tabu- lated distribution of L so that Pr (L :2:: d; Ha) is approximately equal to a desired significance level Ct. Example 2. Let (Xv Yl ) , (X2 , Y 2) , •.. , (X n, Yn) be a random sample from a bivariate distribution of the continuous type. Let R, be the rank of Xi among Xv X 2 , ••• , X; and Q, be the rank of Y among Y Y Y t 1J 2,···, n- If X and Y have a large positive correlation coefficient, we would anticipate that R, and Q, would tend to be large or small together. In particular the correlation coefficient of (Rv QI), (R2, Q2), ... , (Rn,Qn), namely the S~ear­ man rank correlation coefficient, n _ _ 2: (RI - R)(Q, - Q) '=1 would tend to be large. Since Rl , R2, ... , R; and Q Q Q v 2, ••• , n are permutations of 1,2, ... , n, this correlation coefficient can be shown (Exercise 9.38) to equal n 2: RIQI - n(n + 1)2/4 1=1 n(n2 - 1)/12 9.36. Use the notation of this section. N (a) Show that the mean and the variance of L = 2: c(RI) are equal to I=m+l the expressions in the text. N (b) In the special case in which L = 2: R" show that floL and a2 are I=m+l those of T considered in Section 9.6. Hint. Recall that i k2 = N(N + 1~(2N + 1). k=l 9.37. If Xl' X 2 • . .. , X; is a random sample from a distribution of the continuous type and if R, = rank (Xi), show that the mean and the variance of L = 2: iR, are n(n + 1)2/4 and n2(n + 1)2(n - 1)/144, respectively. 9.38. Verify that the two additional expressions, given in Example 2, for the Spearman rank correlation coefficient are equivalent to the first one. Hint. 2: R~ = n(n + 1)(2n + 1)/6 and 2: (R, - Q.)2/2 = 2: (R~ + Qn/2 - 2: R,Q,. 9.39. Let Xl, X 2 , ••• , Xs be a random sample of size n = 6 from a distribution of the continuous type. Let R, = rank (Xl) and take al = as = 9, e a2 = a5 = 4, a3 = a4 = 1. Find the mean and the variance of L = 2: a,R" i= 1 a statistic that could be used to detect a parabolic trend in X v X 2, ... , X e-
  • 176. 340 Nonparametric Methods [Ch. 9 9.40. In the notation of this section show that the covariance of the two linear rank statistics, L 1 = l ajc(R.) and L = !!: b.d(R.) . 1 j = 1 1 2 j ;;;''1 1 1 , IS equa to N _ N j~l (aj - a)(bj - b)k~l (ck - c)(dk - il)j(N - 1), where, for convenience, dk = d(k). Chapter IO Sufficient Statistics 10.1 A Sufficient Statistic for a Parameter In Section 6.2 we let Xl, X 2 , ••• , X; denote a random sample of size n from a distribution that has p.d.f.j(x; B), BE Q. In each of several examples and exercises there, we tried to determine a decision function w of a statistic Y = u(Xv X 2 , ••• , X n) or, for simplicity, a function w of Xl, X 2 , ••• , X n such that the expected value of a loss function 2(B, w) is a minimum. That is, we said that the "best" decision function w(Xv X 2 , ••• , X n) for a given loss function 2(B, w) is one that minimizes the risk R(B, w), which for a distribution of the continuous type is given by R(B, w) = I~a>" -I~a> 2[B, w(xv"" xn)Jj(XI; B)·· -f(xn; B) dxl · · ·dxn- In particular, if E[w(XI, . . .. Xn)J = B and if 2(B, w) = (B - W)2, the best decision function (statistic) is an unbiased minimum variance estimator of B. For convenience of exposition in this chapter, we con- tinue to call each unbiased minimum variance estimator of B a best estimator of that parameter. However, the reader must recognize that "best" defined in this way is wholly arbitrary and could be changed by modifying the loss function or relaxing the unbiased assumption. The purpose of establishing this definition of a best estimator of Bis to help us motivate, in a somewhat natural way, the study of an important class of statistics called sufficient statistics. For illustration, note that in Section 6.2 the mean X of a random sample of Xl' X2 , • • • , 341
  • 177. 342 Sufficient Statistics [Ch. 10 Sec. 10.1] A Sufficient Statistic for a Parameter 343 = 0 elsewhere. What is the conditional probability Pr (Xl = XV X2 = X2 , · · . , X; = Xn/Yl = YI) = P(AIB), Example 2. Let Y I < Y 2 < ... < Y n denote the order statistics of a random sample Xl' X 2 , ••• , X; from the distribution that has p.d.f. f(x; 8) = e-(X-O" 8 < X < OCJ, -OCJ < 8 < 00, = 0 elsewhere. We now give an example that is illustrative of the definition. f,(',-:XI:-..=...;--;8)",--f~( X=2;_8.:-)_.'---,'f,-;,(,-,xn~;-:..8) ) - = H(Xl> X2,···, Xn, gl[U l (Xl> X2, ... , Xn); 8J where H(xl , X2 , ••• , xn) does not depend upon 8 En for every fixed value of Yl = ul(Xl> X 2, ••• , xn)· Remark. Why we use the terminology "sufficient statistic" can be explained as follows: If a statistic Y I satisfies the preceding definition, then the conditional joint p.d.f. of Xv X 2 , ••• , X n, given Y I = YI' and hence of each other statistic, say Y 2 = U 2 (X I, X 2 , ••• , X n ), does not depend upon the parameter 8. As a consequence, once given Y I = Yl> it is impossible to use Y 2 to make a statistical inference about 8; for example, we could not find a confidence interval for 8 based on Y 2' In a sense, Y I exhausts all the informa- tion about 8 that is contained in the sample. It is in this sense that we call Y I a sufficient statistic for 8. In some instances it is preferable to call Y I a sufficient statistic for the family {f(x; 8); 8 E Q} of probability density functions. f(X l; 8)f(x2; 8)· . -f(xn; 8) gl[Ul(Xl> X2,... , xn); 8J ' provided that Xl> x2,. . ., Xnare such that the fixed Yl = Ul(Xl>X2, ... , xn), and equals zero otherwise. We say that Yl = U l (X l> X 2" " , X n) is a sufficient statistic for 8 if and only if this ratio does not depend upon 8. While, with distributions of the continuous type, we cannot use the same argument, we do, in this case, accept the fact that if this ratio does not depend upon 8, then the conditional distribution of Xl' X 2 , •.. , X n , given Y1 = Yl> does not depend upon 8. Thus, in both cases, we use the same definition of a sufficient statistic for 8. Definition 1. Let Xl> X 2' ..• , X n denote a random sample of size n from a distribution that has p.d.f. f(x; 8), 8 E n. Let Yl = ul(Xl> X 2 , ••• , X n) be a statistic whose p.d.f. is gl(Yl; 8). Then Y1 is a sufficient statistic for 8 if and only if conditional probability of Xl = Xl> X 2 = X 2, •• ·, X n = X n, given Y1 = Yl' equals ( n )8L:x.(1 _ 8)n-1,xl LX, 1 YI=O,l, ... ,n, x = 0, 1; 0 < 8 < 1; f(X; 8) = 8X(1 - 8)l-x, = 0 elsewhere. say, where YI = 0, 1, 2, .. " n? Unless the sum of the integers Xl' X 2, ••• , Xn (each of which equals zero or 1) is equal to Yl> this conditional probability obviously equals zero because A () B = 0. But in the case YI = 2: X" we have that A C B so that A () B = A and P(AIB) = P(A)/P(B); thus the conditional probability equals 8XI(1 - 8)l-x18X2(1 - 8)I-X2... 8Xn(1 _ 8)I-xn (~)8Yl(1 - 8)n-Y1 The statistic Y I = Xl + X 2 + ... + X n has the p.d.f. gl(YI; 8) = (~)8YI(1 - 8)n-YI, Since YI = Xl + X2 + ... + Xn equals the number of l's in the n independent trials, this conditional probability is the probability of selecting a particular arrangement of YI l's and (n - YI) zeros. Note that this conditional prob- ability does not depend upon the value of the parameter 8. In general, let gl(Yl; 8) be the p.d.I. of the statistic Yl = Ul(Xl, X 2 , ••• , X n), where Xl' X 2 , ••• , X; is a random sample arising from a distribution of the discrete type having p.d.f. f(x; 8), 8 En. The X; of size n = 9 from a distribution that is n(8, 1) is unbiased and has variance less than that of the unbiased estimator Xl' However, to claim that it is a best estimator requires that a comparison be made with the variance of each other unbiased estimator of 8. Certainly, it is impossible to do this by tabulation, and hence we must have some mathematical means that essentially does this. Sufficient statistics provide a beginning to the solution of this problem. To understand clearly the definition of a sufficient statistic for a parameter 8, we start with an illustration. Example 1. Let Xv X 2 , ••• , X n denote a random sample from the distribution that has p.d.f.
  • 178. 344 Sufficient Statistics [Ch, 10 Sec. 10.1] A SUfficient Statistic for a Parameter 345 The p.d.f, of the statistic Y1 is gl(Y1; 0) = ne-n(Yl -8), 0 < Y1 < 00, = 0 elsewhere. Thus we have that e-(X1 -8)e-(x2-8) . . .e-(Xn -8) e-xl -x2 - ... -xn gl(min Xi; 0) ne n(mlnxj ) , which is free of 0 for each fixed Y1 = min (Xi), since Y1 ~ Xi' i = 1, 2, ... , n. That is, neither the formula nor the domain of the resulting ratio depends upon 0, and hence the first order statistic Y1 is a sufficient statistic for O. If we are to show, by means of the definition, that a certain statistic Y1 is or is not a sufficient statistic for a parameter 8, we must first of all know the p.d.f. of Yl> say gl(Yl; 8). In some instances it may be quite tedious to find this p.d.f. Fortunately, this problem can be avoided if we will but prove the following factorization theorem of Neyman. Theorem 1. Let Xl> X 2, 0 • • , X; denote a random sample from a distribution that has p.d.f. f(x; 8), 8 E Q. The statistic Yl = ul(Xl> X 2, . '0' X n) is a sufficient statistic for 8 if and only if we can find two nonnegative functions, kl and k2, such that f(xl; 0)f(x2; 8) ... f(xn; 8) = kl[Ul(Xl> X2, 0 0 . , xn); 8]k2(Xl, X2, 0 • 0 ' Xn), where, for every fixed value of Yl = Ul(Xl, x2, 0 0 . , Xn), k2(xl> x2, 0 • 0 ' xn) does not depend upon 8. Proof. We shall prove the theorem when the random variables are of the continuous type. Assume the factorization as stated in the theorem. In our proof we shall make the one-to-one transformation Yl = ul (Xl' . 0 . , Xn), Y2 = U2(Xl> 0 • 0 ' Xn), 0 • • , Yn = Un(Xl> 0 • 0 ' xn) having the inverse functions Xl = wl(Yl>"" Yn), X2 = w2(Yl> 0 • • , Yn), . 0 0 ' Xn = Wn(Yl' ... , Yn) and Jacobian]. The joint p.d.f, of the statistics Yl> Y2' o • " Yn is then given by where Wi = wi(Yl> Y2' 0 0 . , Yn), i = 1, 2, 0 • • , n. The p.d.f. of Yl> say gl (Yl; 8), is given by gl(Yl; 0) = I~<Xl" -f~<Xlg(Yl' Y2' 0 ' 0 , Yn; 8) dY2' 0 ·dYn = kl(Yl; 8) I~<Xl' .. I~<Xl IJlk2(W1,W2, 0 • • , wn) dY2°' ·dYno Now the function k2, for every fixed value of Y1 = Ul(Xl, . '0' Xn), does not depend upon 8. Nor is 8 involved in either the Jacobian J or the limits of integration. Hence the (n - Ij-fold integral in the right- hand member of the preceding equation is a function of Yl alone, say m(Yl)' Thus gl(Yl; 8) = kl(Yl; 8)m(Yl)' If m(Yl) = 0, then gl(Yl; 8) = 0. If m(Yl) > 0, we can write k[ ( ) ' 8] _ gl[Ul (Xl> ooo, xn); 8] 1 Ul Xl> .. 0' Xn, - [( )] , m Ul Xl>' 0', Xn and the assumed factorization becomes f( . 8) f( . 8) ( ). 8 k2(xl> 0 0 0 , Xn) Xl' '" Xn, = gl[UlXl>,..,Xn , ] ( ) m[Ul Xl> 0 0 0 , Xn ] Since neither the function k2 nor the function m depends upon 8, then in accordance with the definition, Y1 is a sufficient statistic for the parameter 8. Conversely, if Y1 is a sufficient statistic for 8, the factorization can be realized by taking the function «. to be the p.d.f. of Yl , namely the function gl' This completes the proof of the theorem. Example 3. Let Xl> X 2 , ••• , X; denote a random sample from a distri- bution that is n(O, a2 ) , -00 < 0 < 00, where the variance a2 is known. If n x = L «[«, then 1 n n n L (Xi - 8)2 = L [(Xi - x) + (x - 8)]2 = L (Xi - x)2 + n(x - 0)2 i=l i=1 i=1 because n n 2 L (Xi - X)(X - 0) = 2(x - 0) L (Xi - x) = O. i=1 i=l Thus the joint p.d.f. of Xl> X 2 , ••• , X n may be written Since the first factor of the right-hand member of this equation depends upon Xl> X 2, ••• , X n only through x, and since the second factor does not depend upon 0, the factorization theorem implies that the mean X of the sample is,
  • 179. 346 Sufficient Statistics [Ch, 10 Sec. 10.1] A Sufficient Statistic for a Parameter 347 for any particular value of a2 , a sufficient statistic for 8, the mean of the normal distribution. We could have used the definition in the preceding example because we know that X is n(B, (J2jn). Let us now consider an example in which the use of the definition is inappropriate. Example 4. Let Xv X 2 , ••• , X; denote a random sample from a distri- bution with p.d.f. Certainly, there is no 8 in the formula of the second factor, and it might be assumed that Y3 = max Xi is itself a sufficient statistic for 8. But what is the domain of the second factor for every fixed value of Y3 = max Xi? If max Xi = Xl, the domain is 8 < X2 < Xv 8 < X3 < Xl; if max Xi = X2 , the domain is 8 < Xl < X2 , 8 < X3 < X2 ; and if max Xi = X3, the domain is 8 < Xl < X3' 8 < X2 < X3. That is, for each fixed Y3 = max Xi' the domain of the second factor depends upon 8. Thus the factorization theorem is not satisfied. and Since k2 (xV X 2, ••• , xn) does not depend upon 8, the product XlX2 • • .Xn is a sufficient statistic for 8, = 0 elsewhere, where 0 < 8. We shall use the factorization theorem to prove that the product Ul(Xl, X 2 , ••• , X n) = XlX2 • • ,Xn is a sufficient statistic for 8. The joint p.d.f. of Xv X 2 , ••• , X; is 8n(XlX2" 'Xn)B-l = [8n(XlX2" 'Xn)BJ( 1 ), XlX2 • • 'Xn where 0 < Xi < 1, i = 1,2, .. " n. In the factorization theorem let kl[Ul(Xl, X2,···, xn); 8J = 8n(XlX2" 'Xn)B If the reader has some difficulty using the factorization theorem when the domain of positive probability density depends upon B, we recommend use of the definition even though it may be somewhat longer. Before taking the next step in our search for a best statistic for a parameter B, let us consider an important property possessed by a sufficient statistic Yl = ul(XV X 2 , ••• , X n) for B. The conditional p.d.f. of another statistic, say Y2 = u2(XvX2,···,Xn ), given Yl = Yv does not depend upon B. On intuitive grounds, we might surmise that the conditional p.d.£. of Y2' given some linear function a Y1 + b, a 1= 0, of Yv does not depend upon B. That is, it seems as though the random variable a Y1 + b is also a sufficient statistic for B. This con- jecture is correct. In fact, every function Z = u(Yl ) , or Z = u[ul(XV X 2 , .•• , X n)] = v(Xl, X 2 , .•• , X n), not involving B, with a single-valued inverse Yl = w(Z), is also a sufficient statistic for B. To prove this, we write, in accordance with the factorization theorem, o< X < 1, [i»; 8) = 8xB- 1, There is a tendency for some readers to apply incorrectly the factorization theorem in those instances in which the domain of positive probability density depends upon the parameter B. This is due to the fact that they do not give proper consideration to the domain of the function k2 (xv X 2, ••• ,xn) for every fixed value of Yl = u1(XV X 2, •.• , xn)· This will be illustrated in the next example. Example 5. In Example 2, with f(x; 8) = e-(X-B), 8 < X < 00, -00 < 8 < 00, it was found that the first order statistic Yl is a sufficient statistic for 8. To illustrate our point, take n = 3 so that the joint p.d.f. of Xl> X 2 , X 3 is given by 8 < Xi < 00, i = 1, 2, 3, We can factor this in several ways. One way, with n = 3, is given in Example 2. Another way would be to write the joint p.d.f. as the product However, Yl = w(z) or, equivalently, Ul(Xl, X2 , ••• , xn) = w[v(xv X2, ... , xn)] , which is not a function of B. Hence Since the first factor of the right-hand member of this equation is a function of z = v(xl, ... , xn) and B, while the second factor does not depend upon B, the factorization theorem implies that Z = u(Yl ) is also a sufficient statistic for B. The relationship of a sufficient statistic for B to the maximum likelihood estimator of Bis contained in the following theorem. Theorem 2. Let X v X 2, .•• , X n denote a random sample from a distribution that has p.d.j. f(x; B), BEO' .0. If a sufficient statistic Yl = Ul(Xl, X 2, ... , X n) for B exists and if a maximum likelihood estimator 0 of Balso exists uniquely, then 0is a function of Yl = ul(XV X 2,... , X n) .
  • 180. 348 Sufficient Statistics [Ch, 10 Sec. 10.2] The Rae-Blackwell Theorem 349 Proof. Let gl(Yl; e) be the p.d.f. of Y1. Then by the definition of sufficiency, the likelihood function L(e; Xv x2, ... , Xn) = f(x1; e)f(X2; e)· . -j(xn ; e) = gl[Ul(X1"", Xn) ; e]H(xv···, Xn), where H(xv ... , xn) does not depend upon e. Thus L and g., as functions of e, are maximized simultaneously. Since there is one and only one value of e that maximizes L and hence gl[U1(X1, ... , xn); 0], that value of e must be a function of u1(Xv X2, ... , xn) . Thus the maximum likelihood estimator 0 is a function of the sufficient statistic Y1 = U1(X1, X 2 , ••• , X n)· EXERCISES 10.1. Let Xl, X 2 , ••• , X; be a random sample from the normal distribu- • n tion n(O, 8), 0 < 8 < 00. Show that L X: is a sufficient statistic for 8. 1 10.2. Prove that the sum of the items of a random sample of size n from a Poisson distribution having parameter 8, 0 < 8 < 00, is a sufficient statistic for 8. 10.3. Show that the nth order statistic of a random sample of size n from the uniform distribution having p.d.f. f(x; 8) = l/e, 0 < x < e,o < e < 00, zero elsewhere, is a sufficient statistic for 8. Generalize this result by consider- ing the p.d.f. f(x; e) = Q(8)M(x), 0 < x < 0, 0 < 0 < 00, zero elsewhere. Here, of course, f o 1 o M(x) dx = Q(ef 10.4. Let Xl, X 2 , ••• , X; be a random sample of size n from a geometric distribution that has p.d.f. f(x; e) = (1 - e)XO, x = 0, 1, 2, ... , 0 < 0 < 1, n zero elsewhere. Show that L X, is a sufficient statistic for O. 1 10.5. Show that the sum of the items of a random sample of size n from a gamma distribution that has p.d.f. f(x; 0) = (1/e)e- X10 , 0 < x < 00, o < 0 < 00, zero elsewhere, is a sufficient statistic for e. 10.6. In each of the Exercises 10.1, 10.2, 10.4, and 10.5, show that the maximum likelihood estimator of eis a function of the sufficient statistic for O. 10.7. Let Xl' X 2 , •• " X n be a random sample of size n from a beta distribution with parameters ex = 0 > 0 and f3 = 2. Show that the product X lX2 • • ,Xn is a sufficient statistic for O. 10.8. Let Xl> X 2 , ••• , X n be a random sample of size n from a distribution with p.d.f. f(x; 8) = 1/7T[1 + (x - 8)2J, -00 < x < 00, -00 < 8 < 00. Can the joint p.d.f. of Xl> X 2 , ••• , X n be written in the form given in Theorem 1? Does the parameter 8 have a sufficient statistic? 10.2 The Rao-Blackwell Theorem We shall prove the Rao-Blackwell theorem. Theorem 3. Let X and Y denote random variables such that Y has mean fL and positive variance uf. Let E(Ylx) = <p(x). Then E[<p(X)] = fL and u~(X) :::; uf· Proof. We shall give the proof when the random variables are of the continuous type. Let f(x, y), fl(X) , f2(Y), and h(ylx) denote, respec- tively, the joint p.d.f. of X and Y, the two marginal probability density functions, and the conditional p.d.f. of Y, given X = x. Then f oo 00 yf(x,y)dy E(Ylx) = roo yh(ylx) dy = - 00 fl(X) = <p(x), so that f~00 yf(x, y) dy = <P(X)fl(X). We have E[<p(X)] = f~oo<p(x)fl(X)dx = f~oo[f~ooYf(x,y)dy]dx = f~00 y[f~oof(x, y) dX] dy = f:00 Yf2(Y) dy = fl-, and the first part of the theorem is established. Consider next uf = E[(Y - fl-)2J = E{[(Y - <p(X)) + (<p(X) - fl-)]2} = E{[Y - <p(X)P} + E{[<p(X) - fLP} + 2E{[Y - <p(X)][<p(X) - fl-]}. We shall show that the last term of the right-hand member of the immediately preceding equation is zero. We have E{[Y - <p(X)][<p(X) - fL]} = f~00 f~00 [y - <p(x)][<p(x) - fl-]f(x, y) dy dx. In this integral we shall write f(x, y) in the form h(Ylx)fl(X), and we shall integrate first on y to obtain
  • 181. 350 Sufficient Statistics [eb. 10 Sec. 10.2] The Rao-Blackwell Theorem 351 But cp(x) is the mean of the conditional p.d.f. h(ylx). Hence f~oo [y - cp(x)]h(ylx) dy = 0, and, accordingly, E{[Y - cp(X)][cp(X) - ft]} = 0. Moreover, and Accordingly, and the theorem is proved when X and Yare random variables of the continuous type. The proof in the discrete case is identical to the proof given here with the exception that summation replaces integration. It is interesting to note, in connection with the proof of the theorem, that unless the probability measure of the set {(x, y); Y - cp(x) = O} is equal to 1 then E{[Y - cp(X)J2} > 0, and we have the strict inequality a~ > a~(x). We shall give an illustrative example. Example 1. Let X and Y have a bivariate normal distribution with means JL1 and JL2' with positive variances ut and u~, and with correlation coefficient p. Here E(Y) = JL = JL2 and u~ = u~. Now E(Ylx) is linear in x and it is given by E(Ylx) = cp(x) = JL2 + PU2 (x - JL1). U1 Thus cp(X) = JL2 + P(U2/U1)(X - JL1) and E[cp(X)] = JL2' as stated in the theorem. Moreover, With -1 < p < 1, we have the strict inequality u~ > p2u~. It should be observed that cp(X) is not a statistic if at least one of the five parameters is unknown. We shall use the Rao-Blackwell theorem to help us in our search for a best estimator of a parameter. Let Xl> X 2 , ••• , X; denote a random sample from a distribution that has p.d.f. f(x; 0), 0 E Q, where it is known that Y1 = U 1(X 1, X 2, , X n) is a sufficient statistic for the parameter O. Let Y2 = U 2(X 1, X 2, , X n) be another statistic (but not a function of Y1 alone), which is an unbiased estimator of 0; that is, E(Y2) = O. Consider E(Y2IY1). This expectation is a function of Yl> say CP(Y1). Since Y1 is a sufficient statistic for 0, the conditional p.d.f. of Y 2, given Y1 = Yl> does not depend upon 0, so E(Y2IY1) = CP(Y1) is a function of Y1 alone. That is, here cp(Y1) is a statistic. In accordance with the Rao-Blackwell theorem, cp(Y1) is an unbiased estimator of 0; and because Y2 is not a function of Y1 alone, the variance of cp(Y1) is strictly less than the variance of Y2. We shall summarize this discussion in the following theorem. Theorem 4. Let Xl' X 2, ... , X n, n a fixed positive integer, denote a random sample from a distribution (continuous or discrete) that has P.d.f.f(x; 0), 0 E Q. Let Y1 = U 1(X 1, X 2, ... , X n) be a sufficient statistic for 0, and let Y2 = U 2(X 1, X 2, ... , X n) , not a function of Y1 alone, be an unbiased estimator of o. Then E(Y2!Y1) = CP(Y1) defines a statistic cp( Y1). This statistic cp( Y1) is a function of the sufficient statistic for 0,. it is an unbiased estimator of 0,. and its variance is less than that of Y2. This theorem tells us that in our search for a best estimator of a parameter, we may, if a sufficient statistic for the parameter exists, restrict that search to functions of the sufficient statistic. For if we begin with an unbiased estimator Y2 that is not a function of the sufficient statistic Y1 alone, then we can always improve on this by computing E(Y2!Y1) = CP(Y1) so that cp(Y1) is an unbiased estimator with smaller variance than that of Y2. After Theorem 4 many students believe that it is necessary to find first some unbiased estimator Y2 in their search for cp(Y1), an unbiased estimator of 0 based upon the sufficient statistic Y1. This is not the case at all, and Theorem 4 simply convinces us that we can restrict our search for a best estimator to functions of Y1. It frequently happens that E(Y1 ) = aO + b, where a f= °and b are constants, and thus (Y1 - b)ja is a function of Y1 that is an unbiased estimator of O.
  • 182. 352 Sufficient Statistics [Ch, 10 Sec. 10.3] Completeness and Uniqueness 353 10.3 Completeness and Uniqueness Let X 1> X 2, .•• , X n be a random sample from the distribution that has p.d.f. n L XI is a 1=1 0<8 Y1 = 0, 1, Z, ... x = 0,1,2, ... ; = 0 elsewhere. f(x; 8) From Exercise 10.Z of Section 10.1 we know that Y1 sufficient statistic for 8 and its p.d.f. is (n8)Y 1e- ne Y1! = 0 elsewhere. That is, we can usually find an unbiased estimator based on Y1 without first finding an estimator Y2' In the next two sections we discover that, in most instances, if there is one function <plY1) that is unbiased, <p(Y1) is the only unbiased estimator based on the sufficient statistic Y1 • Remark. Since the unbiased estimator <p(Y1) , where <P(Yl) = E(YzIY1), has variance smaller than that of the unbiased estimator Y z of 8, students sometimes reason as follows. Let the function Y(Ya) = E[<p(Y1)Ya = Ya], where Ya is another statistic, which is not sufficient for 8. By the Rao- Blackwell theorem, we have that E[Y(Ya)] = 8 and Y(Ya) has a smaller variance than does <p(Y1) . Accordingly, Y(Ya) must be better than <p(Y1) as an unbiased estimator of 8. But this is not true because Ya is not sufficient; thus 8 is present in the conditional distribution of Y1 , given Ya = Ya, and the conditional mean Y(Ya)' So although indeed E[Y(Ya)] = 8, Y(Ya) is not even a statistic because it involves the unknown parameter 8 and hence cannot be used as an estimator. implies that Since e-ne does not equal zero, we have that o< 8 u(O) = 0, _ne[ n8 (n8)2 ] = e u(O) + u(l) - + u(2) - + ... . I! Z! o = u(O) + [nu(1)J8 + [n2~(Z)]82 + .. '. Let us consider the family {gl(Y1; 8); 0 < 8} of probability density functions. Suppose that the function u(Y1 ) of Y1 is such that E[u(Y1)] = 0 for every 8 > O. We shall show that this requires U(Y1) to be zero at every point Y1 = 0, 1, 2, .... That is, o= u(O) = u(l) = u(Z) = u(3) We have for all 8 > 0 that However, if such an infinite series converges to zero for all 8 > 0, then each of the coefficients must equal zero. That is, n2 u(2) nu(l) = 0, -Z- = 0, ... and thus 0 = u(O) = u(l) = u(Z) ="', as we wanted to show. Of EXERCISES 10.9. Let Y1 < Y z < Ya < Y4 < Y5 be the order statistics of a random sample of size 5 from the uniform distribution having p.d.f. f(x; 8) = 1/8, o< x < 8, 0 < 8 < 00, zero elsewhere. Show that 2Ya is an unbiased estimator of 8. Determine the joint p.d.f. of Ya and the sufficient statistic Y5 for 8. Find the conditional expectation E(2YaIY5) = <P(Y5)' Compare the variances of 2Ya and <p(Y5). Hint. All of the integrals needed in this exercise can be evaluated by making a change of variable such as z = yI8 and using the results associated with the beta p.d.f.; see Example 5, Section 4.3. 10.10. If Xl> X z is a random sample of size 2 from a distribution having p.d.f. f(x; 8) = (1/8)e- X10 , 0 < x < 00, 0 < 8 < 00, zero elsewhere, find the joint p.d.f. of the sufficient statistic Yl = Xl + X z for 8 and Y z = X z. Show that Yzis an unbiased estimator of 8 with variance 8z. Find E(YzIY1) = <P(Yl) and the variance of <p(Y1 ) . 10.11. Let the random variables X and Y have the joint p.d.f. f(x, y) = (2/8Z )e- (x +YJ/o, 0 < x < Y < 00, zero elsewhere. (a) Show that the mean and the variance of Yare, respectively, 38/2 and 58z/4. (b) Show that E(Ylx) = x + 8. In accordance with the Rao-Blackwell theorem, the expected value of X + 8 is that of Y, namely, 38/2, and the variance of X + 8 is less than that of Y. Show that the variance of X + 8 is in fact 8z/4. 10.12. In each of Exercises 10.1, 10.2, and 10.5, compute the expected value of the given sufficient statistic and, in each case, determine an un- biased estimator of 8 that is a function of that sufficient statistic alone.
  • 183. 354 Sufficient Statistics [Ch. 10 Sec. 10.3] Completeness and Uniqueness 355 for 8 > O. course, the condition E[u(YI)] = 0 for all 8 > 0 does not place any restriction on U(YI) when YI is not a nonnegative integer. So we see that, in this illustration, E[u(YI)] = 0 for all 8 > 0 requires that U(YI) equals zero except on a set of points that has probability zero for each p.d.f. gl(YI; 8), 0 < 8. From the following definition we observe that the family {gl(YI; 8); 0 < 8} is complete. Definition 2. Let the random variable Z of either the continuous type or the discrete type have a p.d.f. that is one member of the family {h(z; 8); 8 E Q}. If the condition E[u(Z)] = 0, for every 8 E Q, requires that u(z) be zero except on a set of points that has probability zero for each p.d.f. h(z; 8), 8 E Q, then the family {h(z; 8); 8 E Q} is called a complete family of probability density functions. Remark. In Section 1.9 it was noted that the existence of E[u(X)J implies that the integral (or sum) converge absolutely. This absolute con- vergence was tacitly assumed in our definition of completeness and it is needed to prove that certain families of probability density functions are complete. In order to show that certain families of probability density functions of the continuous type are complete, we must appeal to the same type of theorem in analysis that we used when we claimed that the moment- generating function uniquely determines a distribution. This is illus- trated in the next example. Example 1. Let Z have a p.d.f. that is a member of the family {h(z; 8), 0 < 8 < co}, where h(z; 8) = ~ e- z/ 9 , 0 < z < co, = 0 elsewhere. Let us say that E[u(Z)J = 0 for every 8 > O. That is, 1 foo 8 0 u(z)e-z/odz = 0, Readers acquainted with the theory of transforms will recognize the integral in the left-hand member as being essentially the Laplace transform of u(z). In that theory we learn that the only function u(z) transforming to a function of 8which is identically equal to zero is u(z) = 0, except (in our terminology) on a set of points that has probability zero for each h(z; 8), 0 < 8. That is, the family {h(z; 8); 0 < 8 < co}is complete. Let the parameter 8 in the p.d.f. f(x; 8), 8 E Q, have a sufficient statistic Y I = til(XI, X 2, ••• , X n), where Xv X 2, ••• , X; is a random sample from this distribution. Let the p.d.f. of Y I be gl(YI; 8), 8 E Q. It has been seen that, if there is any unbiased estimator Y2 (not a func- tion of Y I alone) of 8, then there is at least one function of Y I that is an unbiased estimator of 8, and our search for a best estimator of 8 may be restricted to functions of Y I . Suppose it has been verified that a certain function <p(YI), not a function of 8, is such that E[<p(YI)J = 8 for all values of 8,8 E Q. Let !fJ(YI) be another function of the sufficient statistic YI alone so that we have also E[!fJ(YI)] = 8 for all values of 8, 8 E Q. Hence oE Q. If the family {gl(YI; 0); 0 E Q} is complete, the function <P(YI) !fJ(YI) = 0, except on a set of points that has probability zero. That is, for every other unbiased estimator !fJ(YI) of 0, we have except possibly at certain special points. Thus, in this sense [namely <P(YI) = !fJ(YI), except on a set of points with probability zero], <p(YI) is the unique function of Yv which is an unbiased estimator of O. In accordance with the Rao-Blackwell theorem, <p(YI) has a smaller variance than every other unbiased estimator of O. That is, the statistic <p(YI) is the best estimator of B. This fact is stated in the following theorem of Lehmann and Scheffe, Theorem 5. Let Xv X 2, ••• , X n, n a fized positive integer, denote a random sample from a distribution that has p.d.f. f(x; 0), 8 E Q, let Y I = UI(XI, X 2 , ••• , X n) be a sufficient statisticfor 0, and let thefamily {gl(YI; 8); 8 E Q} of probability density functions be complete. If there is afunction of Y I that is an unbiased estimator of 0, then this function of Y I is the unique best estimator of 8. Here "unique" is used in the sense described in the precedingparagraph. The statement that Y1 is a sufficient statistic for a parameter 8, 8 E Q, and that the family {gl(YI; e); e E Q} of probability density functions is complete is lengthy and somewhat awkward. We shall adopt the less descriptive, but more convenient, terminology that YI is a complete sufficientstatistic for e. In the next section we shall study a fairly large class of probability density functions for which a complete sufficient statistic YI for ecan be determined by inspection.
  • 184. 356 Sufficient Statistics [Ch. 10 Sec. 10.4] The Exponential Class of Probability Density Functions 357 10.13. If azZ + bz + c = 0 for more than two values of z, then a = b = c = O. Use this result to show that the family {b(2, e); 0 < B < I}is complete. 10.14. Show that each of the following families {f(x; B); 0 < (} < co} is not complete by finding at least one nonzero function u(x) such that E[u(X)] = O. for all B > O. 10.4 The Exponential Class of Probability Density Functions Consider a family {j(x; e); e E Q} of probability density functions, where Q is the interval set Q = {e; y < e < o}, where y and 0 are known constants, and where EXERCISES 1 (a) f(x; B) = 2B' - B < x < B, (1) f(x; e) = exp [p(e)K(x) + S(x) + q(e)], = 0 elsewhere. a < x < b, = 0 elsewhere. (b) n(O, B). 10.15. Let Xl> Xz, .. " X; represent a random sample from the discrete distribution having the probability density function f(x; e) = BX(l - (j)l-x, = 0 elsewhere. x = 0, 1,0 < B < 1, A p.d.f. of the form (1) is said to be a member of the exponential class of probability density functions of the continuous type. If, in addition, (a) neither a nor b depends upon e, y < e < 0, (b) p(e) is a nontrivial continuous function of e, y < e < o. (c) each of K'(x) =f:. 0 and S(x) is a continuous function of x, a < x < b, we say that we have a regular case of the exponential class. A p.d.f. exp [P(e) ~ K(x,) + ~ S(x,) + nq(e)] Let Xl' Xz, 0 0 . , X; denote a random sample from a distribution that has a p.d.f. which represents a regular case of the exponential class of the continuous type. The joint p.d.I. of Xl> X z, .. 0, X; is -00 < x < 00. 1 2 = __ e- x /ZfJ V27Te = exp ( - ;eX Z - In V27T8) , f(x; e) For example, each member of the family {j(x; e); °< e < co], where f(x; e) is n(O, e), represents a regular case of the exponential class of the continuous type because f(x; e) = exp [p(e)K(x) + S(x) + q(e)], = 0 elsewhere is said to represent a regular case of the exponential class of probability density functions of the discrete type if (a) The set {x; x = aI, az, ... } does not depend upon e. (b) p(e) is a nontrivial continuous function of e, y < e < o. (c) K(x) is a nontrivial function of x on the set {x; x = al> az, ••. }. is the unique best estimator of B. n Show that YI = L: X, is a complete sufficient statistic for B. Find the unique I function of YI that is the best estimator of B. Hint. Display E[u(YI )] = 0, show that the constant term u(O) is equal to zero, divide both members of the equation by B i= 0, and repeat the argument. 10.16. Consider the family of probability density functions {h(z; B); BE Q}, where h(z; B) = liB, 0 < z < B, zero elsewhere. (a) Show that the family is complete provided that Q = {B; 0 < B < co}, Hint. For convenience, assume that u(z) is continuous and note that the derivative of E[u(Z)J with respect to Bis equal to zero also. (b) Show that this family is not complete if Q = {B; 1 < B < co}. Hint. Concentrate on the interval 0 < z < 1 and find a nonzero function u(z) on that interval such that E[u(Z)J = °for all B > 1. 10.17. Show that the first order statistic Y I of a random sample of size n from the distribution having p.d.f. f(x; B) = e-(X-B), B < x < 00, -00 < B < 00, zero elsewhere, is a complete sufficient statistic for B. Find the unique function of this statistic which is the best estimator of B. 10.18. Let a random sample of size n be taken from a distribution of the discrete type with p.d.f. f(x; B) = liB, x = 1, 2, ... , B, zero elsewhere, where Bis an unknown positive integer. (a) Show that the largest item, say Y, of the sample is a complete sufficient statistic for B. (b) Prove that [yn+1 _ (Y _ l)n+I]/[yn _ (Y _ l)n]
  • 185. exp [P((J) ~K(XI) + nq((J)] exp [~S(XI)l for a < XI < b, i = 1, 2, ... , n, 'Y < (J < 8, and is zero elsewhere. At points of positive probability density, this joint p.d.f. may be written as the product of the two nonnegative functions Sec. 10.4] The ExpOnential Class of Probability Density Functions 359 Example I. Let Xv X 2 , •• " X n denote a random sample from a normal distribution that has p.d.f. 358 Sufficient Statistics [Ch, 10 or . 1 [(X - 8)2] f(x,8) = aV2rr exp - 2a2 ' -00 < x < 00, -00 < 8 < 00, Here a2 is any fixed positive number. This is a regular case of the exponential class with Accordingly, YI = Xl + X2 + ... + X; = nX is a complete sufficient statistic for the mean 0 of a normal distribution for every fixed value of the variance a2 . Since E(YI) = nO, then <p(YI) = YI/n = X is the only function of YI that is an unbiased estimator of 8; and being a function of the sufficient statistic YI , it has a minimum variance. That is, X is the unique best estimator of 8. Incidentally, since YI is a single-valued function of X, X itself is also a complete sufficient statistic for 8. Example 2. Consider a Poisson distribution with parameter 8,0 < 8 < 00. The p.d.f. of this distribution is x=O,1,2, ... , 02 q(O) = - - . 2a2 K(x) = x, 8 P(8) = -, a2 8xe- 8 f(x; 8) = -,- = exp [(In 8)x - In (xl) - 8], x. ( 8 X2 _ ( 2 ) f(x; 8) = exp - x - - - In v2rra2 - - . a2 2a2 2a2 In accordance with the factorization theorem (Theorem 1, Section 10.1) n y 1 = L K(XI ) is a sufficient statistic for the parameter (J. To prove 1 n that y 1 = L K(Xi ) is a sufficient statistic for (J in the discrete case, 1 we take the joint p.d.f. of Xv X 2 , • • • , X; to be positive on a discrete set of points, say, when XI E {x; X = al> a2, ••. }, i = 1,2, ... , n. We then use the factorization theorem. It is left as an exercise to show that in either the continuous or the discrete case the p.d.f. of Y1 is of the form at points of positive probability density. The points of positive prob- ability density and the function R(Yl) do not depend upon 8. At this time we use a theorem in analysis to assert that the family {gl(Yl; 8); 'Y < 8 < 8} of probability density functions is complete. This is the theorem we used when we asserted that a moment-generating function (when it exists) uniquely determines a distribution. In the present context it can be stated as follows. Theorem 6. Let f(x; 8), 'Y < 8 < 8, be a p.d.J. which represents a regular case of the exponential class. Then if Xl> X 2 , ••• , X; (where n is a fixed positive integer) is a random sample from a distribution with p.d.J. n f(x; 8), the statistic Y1 = L K(X.) is a sufficient statistic for 8 and the 1 family {gl (Yl; (J); 'Y < B < 8} of probability density functions of Y1 is complete. That is, Y1 is a complete sufficient statistic for B. = 0 elsewhere. In accordance with Theorem 6, YI = ~ Xl is a complete sufficient statistic I for 8. Since E(YI) = n8, the statistic <p(YI) = YI/n = X, which is also a complete sufficient statistic for 8, is the unique best estimator of 8. EXERCISES 10.19. Write the p.d.f. This theorem has useful implications. In a regular case of form (1), n we can see by inspection that the sufficient statistic is Y1 = .L K(Xi ) . 1 If we can see how to form a function of Y 1 , say <p(Y1) , so that E[<p(Y1) ] = 8, then the statistic <p(Y1) is unique and is the best estimator of B. o< x < 00, 0 < 8 < 00, zero elsewhere, in the exponential form. If X1> X2' •.. , Xn is a random sample from this distribution, find a complete sufficient statistic Y1 for 0
  • 186. 360 Sufficient Statistics [Ch, 10 Sec. 10.5] Functions of a Parameter 361 and the unique function cp(YI ) of this statistic that is the best estimator of B. Is cp(YI ) itself a complete sufficient statistic? 10.20. Let Xl> X 2 , ••• , X; denote a random sample of size n > 2 from a distribution with p.d.f. j(x; B) = Be-8X, 0 < x < 00, zero elsewhere, and n B > O. Then Y = L X, is a sufficient statistic for B. Prove that (n - l)jY I is the best estimator of B. 10.21. Let Xl' X 2 , ••• , X n denote a random sample of size n from a distribution with p.d.f.j(x; B) = Bx8- 0 < x < 1, zero elsewhere, and B > O. (a) Show that the geometric mean (XIX2 • · .Xn)l/n of the sample is a complete sufficient statistic for B. (b) Find the maximum likelihood estimator of B, and observe that it is a function of this geometric mean. 10.22. Let X denote the mean of the random sample Xl> X 2 , ••• , X n from a gamma-type distribution with parameters a > 0 and f3 = B > O. Compute E[Xllx]. Hint. Can you find directly a function ifi(X) of X such that E[ifi(X)] = B? Is E(Xllx) = ifi(x)? Why? 10.23. Let X be a random variable with a p.d.f. of a regular case of the exponential class. Show that E[K(X)] = -q'(B)jp'(B), provided these derivatives exist, by differentiating both members of the equality f:exp [P(B)K(x) + Sex) + q(B)] dx = 1 with respect to B. By a second differentiation, find the variance of K(X). 10.24. Given that j(x; B) = exp [BK(x) + Sex) + q(B)], a < x < b, y < B < 8, represents a regular case of the exponential class. Show that the moment-generating function M(t) of Y = K(X) isM(t) = exp[q(B) - q(B + t)], y < B + t < 8. 10.25. Given, in the preceding exercise, that E(Y) = E[K(X)] = B. Prove that Y is n(B, 1). Hint. Consider M'(O) = B and solve the resulting differential equation. 10.26. If Xl' X 2 , ••• , X n is a random sample from a distribution that has a p.d.f. which is a regular case of the exponential class, show that the p.d.f. of n YI = L K(X,) is of the form gl(YI; B) = R(YI) exp [P(B)YI + nq(B)]. Hint. I Let Y2 = X 2 , ••• , Yn = X n be n - 1 auxiliary random variables. Find the joint p.d.f. of YI , Y 2 , ••• , Y, and then the marginal p.d.f. of YI . 10.27. Let Y denote the median and let X denote the mean of a random sample of size n = 2k + 1 from a distribution that is n(p., a2 ) . Compute E(YIX = x). Hint. See Exercise 10.22. 10.5 Functions of a Parameter Up to this point we have sought an unbiased and minimum variance estimator of a parameter O. Not always, however, are we interested in 0 but rather in a function of O. This will be illustrated in the following examples. Example 1. Let Xl> X 2 , ••• , X; denote the items of a random sample of size n > 1 from a distribution that is b(l, B), 0 < B < 1. We know that if n Y = L X" then Yjn is the unique best estimator of B. Now the variance of I Yjn is B(l - B)jn. Suppose that an unbiased and minimum variance esti- mator of this variance is sought. Because Y is a sufficient statistic for B, it is known that we can restrict our search to functions of Y. Consider the statistic (Yjn)(l - Yjn)/n. This statistic is suggested by the fact that Yjn is the best estimator of B. The expectation of this statistic is given by 1 [Y ( V)] 1 1 2 nE n 1 - n = n2E(Y) - n3 E(Y ). Now E(Y) = nB and E(Y2) = nB(l - B) + n2B2. Hence ~ E[Y (1 _V)] = n - 1B(l - B). n n n n n If we multiply both members of this equation by nj(n - 1), we find that the statistic (Yjn)(l - Yjn)/(n - 1) is the unique best estimator of the variance of Yjn. A somewhat different, but very important problem in point estima- tion is considered in the next example. In the example the distribution of a random variable X is described by a p.d.f. j(x; 0) that depends upon 0 E Q. The problem is to estimate the fractional part of the probability for this distribution which is at or to the left of a fixed point c. Thus we seek an unbiased, minimum variance estimator of F(c; 0), where F(x; 0) is the distribution function of X. Example 2. Let Xl' X 2 , ••• , X; be a random sample of size n > 1 from a distribution that is n(B, 1). Suppose that we wish to find a best estimator of the function of Bdefined by J e 1 Pr (X :0; c) = --= e-(X-8)2/2 dx = N(c - B), -00 Y27T where c is a fixed constant. There are many unbiased estimators of N(c - B). We first exhibit one of these, say u(XI ) , a function of Xl alone. We shall then
  • 187. 362 Sufficient Statistics [Ch. 10 Sec. 10.5] Functions of a Parameter 363 compute the conditional expectation, E[u(X]) IX = x] = ep(x), of this un- biased statistic, given the sufficient statistic X, the mean of the sample. In accordance with the theorems of Rao-Blackwell and Lehmann-Scheffe, ep(X) is the unique best estimator of N(c - 0). Consider the function u(x]), where The expected value of the random variable u(X]) is given by E[u(X])] = f.,u(x]) v'~7T exp [ - (x] ~ 0)2] dx] = fC (1) 1 exp [_ (x] - 0)2] dx], - co v'27T 2 because u(x]) = 0, x] > c. But the latter integral has the value N(c - 0). That is, u(X]) is an unbiased estimator of N(c - 0). We shall next discuss the joint distribution of X] and X and the condi- tional distribution of Xl> given X = x. This conditional distribution will enable us to compute E[u(X]) IX = x] = ep(x). In accordance with Exercise 4.81, Section 4.7, the joint distribution of X] and X is bivariate normal with means 0 ~nd 0, variances at = 1 and a~ = l/n, and correlation coefficient p = 1/v'n. Thus the conditional p.d.f. of Xl> given X = x, is normal with linear conditional mean o+ pal (x - 0) = x a2 and with variance 2(1 2) n - 1 a] - p = - - . n The conditional expectation of u(X]), given X = x, is then (-)- f'" ()J-n 1 [n(x] - X)2] ep X - -00 u x] n _ 1 v'27Texp - 2(n _ 1) dx] _ fC J-n 1 [n(x] - X)2] - -00 n - 1 v'27Texp - 2(n _ 1) dx]. The change of variable z = v'n(x] - x)/v'n - 1 enables us to write, with c' = v'n(c - x)/v'n - 1, this conditional expectation as ep(x) = fC' 1 e-z2/2dz = N(c') = N[v'n(c - X)]. - co v'27T v'n - 1 Thus the unique, unbiased, and minimum variance estimator of N(c - 0) is, for every fixed constant c, given by ep(X) = N[v'n(c - X)/v'n _ 1]. Remark. We should like to draw the attention of the reader to a rather important fact. This has to do with the adoption of a principle, such as the principle of unbiasedness and minimum variance. A principle is not a theorem; and seldom does a principle yield satisfactory results in all cases. So far, this principle has provided quite satisfactory results. To see that this is not always the case, let X have a Poisson distribution with parameter 0, °< 0 < 00. We may look upon X as a random sample of size 1 from this distribution. Thus X is a complete sufficient statistic for O. We seek the best estimator of e- 29 , best in the sense of being unbiased and having minimum variance. Consider Y = (_l)x. We have '" (_ O)Xe- 9 E(Y) = E[( _l)X] = L , = e- 29• x=Q x. Accordingly, (_l)X is the (unique) best estimator of e- 29, in the sense described. Here our principle leaves much to be desired. We are endeavoring to elicit some information about the number e-2 9 , where °< e- 29 < 1. Yet our point estimate is either -lor +1, each of which is a very poor estimate of a number between zero and 1. We do not wish to leave the reader with the impression that an unbiased, minimum variance estimator is bad.That is not the case at all. We merely wish to point out that if one tries hard enough, he can find instances where such a statistic is not good. Incidentally, the maximum likelihood estimator of e- 29 is, in the case where the sample size equals 1, e- 2 X , which is probably a much better estimator in practice than is the "best" estimator (_l)x. EXERCISES 10.28. Let Xl> X 2,... , X n denote a random sample from a distribution that is n(O, 1), -00 < 0 < 00. Find the best estimator of 02. Hint. First determine E(X2). 10.29. Let X], X 2, ... , X n denote a random sample from a distribution that is n(O, 0). Then Y = 2: X? is a sufficient statistic for O. Find the best estimator of 02 • 10.30. In the notation of Example 2 of this section, is there a best esti- mator of Pr (-c s X s c)? Here c > 0. 10.31. Let Xl> X 2 , •. " X; be a random sample from a Poisson distri- bution with parameter 0 > 0. Find the best estimator of Pr (X ~ 1) = (1 + O)e-9 • Hint. Let u(x]) = 1, x] ~ 1, zero elsewhere, and find n E[u(X])IY = y], where Y = 2: X,. Make use of Example 2, Section 4.2. 1 10.32. Let X], X 2 , •• " X n denote a random sample from a Poisson
  • 188. 364 Sufficient Statistics [Ch. 10 Sec. 10.6] The Case of Several Parameters 365 distribution with parameter () > O. From the Remark of this section, we know that E[( -1)X1J = e- 29• (a) Show that E[(-l)X l lY1 = Y1J = (1 - 2jn)Yl , where Yl = Xl + X 2 + ... + X n • Hint. First show that the conditional p.d.f. of Xl, X 2 , •• " X n - v given Y1 = Yv is multinomial, and hence that of Xl given Y1 = Y1 is b(Y1' 1jn). (b) Show that the maximum likelihood estimator of e- 29 is e- 2X • (c) Since Y1 = nx, show that (1 - 2jn)Yl is approximately equal to e- 2X when n is large. Example 1. Let Xv X 2 , ••• , X n be a random sample from a distribution having p.d.f. = 0 elsewhere, where -00 < ()1 < 00,0 < ()2 < 00. Let Y1 < Y2 < ... < Y, be the order statistics. The joint p.d.f. of Y1 and Y n is given by 10.6 The Case of Several Parameters In many of the interesting problems we encounter, the p.d.f. may not depend upon a single parameter B, but perhaps upon two (or more) parameters, say B1 and B2, where (Bv (2) E.o, a two-dimensional parameter space. We now define joint sufficient statistics for the parameters. For the moment we shall restrict ourselves to the case of two parameters. Definition 3. Let Xl' X 2 , ••• , X; denote a random sample from a distribution that has p.d.f. f(x; B1 , ( 2), where (Bv (2 ) E.o. Let Y1 = u1(XV X 2, ... , X n) and Y2 = U2(X1, X 2, ... , X n) be two statistics whose joint p.d.f. is gdY1> Y2; B1> ( 2)· The statistics Y1 and Y2 are called faint sufficient statistics for B1 and B2 if and only if where, for every fixed Y1 = u1(X1> ... , xn) and Y2 = U2(X1, ... , xn) , H(x1> X2,.. " xn) does not depend upon B1 or B2. As may be anticipated, the factorization theorem can be extended. In our notation it can be stated in the following manner. The statistics Y1 = U1(X1, X 2,· .. , X n) and Y2 = U2(X1, X 2,. . . , X n) are joint sufficient statistics for the parameters B1 and B 2 if and only if we can find two nonnegative functions k1 and k2 such that f(x1; B1> ( 2)f(x2; B1> (2) ... f(xn ; B1> ( 2) = k1[u1(X1> X2, .. " xn) , u2(U1> X2,. . ., xn) ; B1> B2Jk2(X1, X2 , .•• , xn) , where, for all fixed values of the functions Y1 = U1(X1, X2, ... , xn) and Y2 = U2(X1, X2, ... , Xn), the function k 2(x1> X2, ... , xn) does not depend upon both or either of B 1 and B 2 . and equals zero elsewhere. Accordingly, the joint p.d.f. of Xv X 2 , ••• , X; can be written, for points of positive probability density, ( ~ ) n = n(n - l)[max(x,) - min (x,)]n-2 2()2 (2()2)n Since the last factor does not depend upon the parameters, either the def- inition or the factorization theorem assures us that Y1 and Yn are joint sufficient statistics for ()1 and ()2' The extension of the notion of joint sufficient statistics for more than two parameters is a natural one. Suppose that a certain p.d.f. depends upon m parameters. Let a random sample of size n be taken from the distribution that has this p.d.f. and define m statistics. These m statistics are called joint sufficient statistics for the m parameters if and only if the ratio of the joint p.d.f. of the items of the random sample and the joint p.d.f. of these m statistics does not depend upon the m parameters, whatever the fixed values of the m statistics. Again the factorization theorem is readily extended. There is an extension of the Rao-Blackwell theorem that can be adapted to joint sufficient statistics for several parameters, but that extension will not be included in this book. However, the concept of a complete family of probability density functions is generalized as follows: Let denote a family of probability density functions of k random variables V v V2, ••• , Vk that depends upon m parameters (B1> B2, ... , Bm) E .0.
  • 189. 366 Sufficient Statistics [Ch, 10 Sec. 10.6] The Case oj Several Parameters 367 and and Z2 = Y1 - Y~/n = L (Xi - X)2 n-l n-l ( - 1 81 8~ .1-) f(x; 8v 82 ) = exp 28 2 x2 + 8 2 X - 28 2 - In v 27T82 • Therefore, we can take K 1 (x) = x2 and K 2 (x) = x. Consequently, the statistics are joint sufficient statistics for the m parameters 81 , 82 , ••• r 8m• It is left as an exercise to prove that the joint p.d.f. of Y1> ... , Ym is of the form From completeness, we have that Z1 and Z2 are the only functions of Y1 and Y2 that are unbiased estimators of 81 and (J2' respectively. are joint complete sufficient statistics for (J1 and 82 , Since the relations (2) R(Y1' ... , Ym) exp [J1PA81,· .. s 8m)Yj + nq(81, . 00' 8m)] at points of positive probability density. These points of positive probability density and the function R(Y1' 0 • • , Ym) do not depend upon any or all of the parameters 81 , 82 , •. " 8m. Moreover, in accordance with a theorem in analysis, it can be asserted that, in a regular case of the exponential class, the family of probability density functions of these joint sufficient statistics Y1 , Y2 , ••• , Ym is complete when n > m, In accordance with a convention previously adopted, we shall refer to Yl' Y2' .. " Ym as joint complete sufficient statistics for the parameters 81> 82 , •• 0' 8m• Example 2. Let Xv X 2 , ••• , Xn denote a random sample from a distribution that is n(81 , 82) , -00 < 81 < 00, 0 < 82 < 00. Thus the p.d.f. f(x; 81 , 82 ) of the distribution may be written as define a one-to-one transformation, Z1 and Z2 are also joint complete sufficient statistics for 81 and 82 , Moreover, A p.d.f. f i»; 81 , 82 " " , 8m) = exp [JlA81, 82 , 0", 8m)KAx) + S(x) + q(81, 82 " 0 0, 8m)} Let U(V1' V2' • 0 0' v k ) be a function of VI' V2, 0 0 0, Vk (but not a function of any or all of the parameters). If E[U(V1' V2 , ... , Vk ) ] = 0 for all (81) 82, .•• , 8m) E Q implies that u(v1> V2, • 0 . , v k) = 0 at all points (V1> V2, 0 • • , Vk), except on a set of points that has probability zero for all members of the family of probability density functions, we shall say that the family of probability density functions is a complete family. The remainder of our treatment of the case of several parameters will be restricted to probability density functions that represent what we shall call regular cases of the exponential class. Let Xl' X 2 , ••• , X n, n > m, denote a random sample from a distribution that depends on m parameters and has a p.d.f. of the form (1) f(x; 81> 82 , 0 • " 8m) = exp [Jlj(81 , 82 , •.• , 8m)Kj(x) + S(X) + q(81, 82 , • 0 . , 8m)] for a < x < b, and equals zero elsewhere. A p.d.f. of the form (1) is said to be a member of the exponential class of probability density functions of the continuous type. If, in addition, (a) neither a nor b depends upon any or all of the parameters 81> 82 , •• 0' 8m, (b) the Pj(81) 82, 0 • • , 8m), j = 1,2, 0 • • , m, are nontrivial, func- tionally independent, continuous functions of 8j , Yj < 8j < OJ' j = 1,2, .. 0, rn, (c) the Kj(x), j = 1,2,. 0 . , m, are continuous for a < x < band no one is a linear homogeneous function of the others, (d) S(x) is a continuous function of x, a < x < b, we say that we have a regular case of the exponential class. The joint p.d.f. of Xl' X 2 , ••• , X n is given, at points of positive probability density, by exp [J1pj(81).. " 8m) itKj(xi) + itS(Xi) + nq(81) ... , 8m)] = exp [J1P;(8v ... , 8m) i~l Kj(Xi) + nq(81) 0'" 8m)] exp LtS(Xi)j- In accordance with the factorization theorem, the statistics
  • 190. 368 Sufficient Statistics [Ch, 10 Sec. 10.6] The Case of Several Parameters 369 a < x < b, zero elsewhere. Let K~(x) = cK;(x). Show thatf(x; 81> ( 2) can be written in the form a < x < b, 10.36. Let the p.d.f. f(x; 81> ( 2) be of the form exp [P1(81, (2)K 1(x) + P2(81, ( 2)K 2(x) + S(x) + q(81) (2)J, zero elsewhere. This is the reason why it is required that no one K~(x) be a linear homogeneous function of the others, that is, so that the number of sufficient statistics equals the number of parameters. 10.37. Let Y1 < Y2 < ... < Yn be the order statistics of a random sample Xv X 2 , .•. , X; of size n from a distribution of the continuous type with p.d.f. f(x). Show that the ratio of the joint p.d.f. of X1> X 2 , ••• , X n and that of Y1 < Y2 < ... < Y, is equal to lin!, which does not depend upon the underlying p.d.f. This suggests that Y1 < Y2 < ... < Y, are joint sufficient statistics for the unknown" parameter" j. 10.35. Let (Xl' Y1 ) , (X2, Y2), ... , (Xn , Yn) denote a random sample of size n from a bivariate normal distribution with means flo1 and flo2' positive n n variances ar and a~, and correlation coefficient p. Show that Ls; L v, 1 1 n n n LXt, L Yr, and LXjYj are joint sufficient statistics for the five parameters. 1 1 1 _ n 11 n n Are X = LXdn, Y = L Ydn, Sr = L (Xj - X)2In, S~ = L (Yj - y)2In, 1 1 1 1 n and L (Xj - X)(Yi - Y)lnS1S2 also joint sufficient statistics for these 1 parameters? zero elsewhere. Find the joint p.d.f. of Zl = Y 1, Z2 = Y 2, and Zs = Y1 + Y2 + Ys' The corresponding transformation maps the space {(Y1> Y2' Ys); e, < Y1 < Y2 < Ys < co] onto the space {(Z1> Z2' zs); 81 < Zl < Z2 < (zs - zl)/2 < co]. Show that Zl and Zs are joint sufficient statistics for 81 and 82 , 10.34. Let Xl' X 2, ... , X; be a random sample from a distribution that n has a p.d.f. of form (1) of this section. Show that Y 1 = L K 1(Xj ) , ••• , j=l n Ym = L Km(Xj ) have a joint p.d.f. of form (2) of this section. j=l are joint complete sufficient statistics for the m parameters 8v 82 , 0 • " 8m• zero elsewhere, is said to represent a regular case of the exponential class of probability density functions of the discrete type if (a) the set {x; x = av a2 , ••• } does not depend upon any or all of the parameters 81, 82 , •.• , 8m, (b) the Pj(81 , 82 , ••• , 8m), j = 1,2, ... , m, are nontrivial, func- tionally independent, continuous functions of 8j, Yj < 8j < OJ' j = 1,2, .. 0, rn, (c) the Kj(x), j = 1, 2, 0 0 . , m, are nontrivial functions of x on the set {x; x = a1, a2 , • • . } and no one is a linear function of the others. Let Xl' X 2 , •.• , X n denote a random sample from a discrete-type distribution that represents a regular case of the exponential class. Then the statements made above in connection with the random variable of the continuous type are also valid here. Not always do we sample from a distribution of one random variable X. We could, for instance, sample from a distribution of two random variables V and W with joint p.d.f.f(v, w; 8v 82, 0 0 . , 8m) , Recall that by a random sample (Vv W1), (V2 , W2 ) , . '0, (Vn, Wn) from a distribution of this sort, we mean that the joint p.d.f. of these 2n random variables is given by f(v1 , w1 ; 8v ... , 8m)j(v2 , w2 ; 8v 0 0 0 , 8m) ... f(vn, wn ; 81 , ..• , 8m) , In particular, suppose that the random sample is taken from a distribu- tion that has the p.d.f. of V and W of the exponential class (3) f(v, w; 8v . . . , 8m) = exp [J1pj(8v ... , 8m)Kj(v, w) + S(v, w) + q(8v · .. , em)1 for a < v < b, C < w < d, and equals zero elsewhere, where a, b, C, d do not depend on the parameters and conditions similar to (a), (b), (c), and (d), p. 366, are imposed. Then the m statistics EXERCISES 10.33. Let Y1 < Y2 < Ys be the order statistics of a random sample of size 3 from the distribution with p.d.f.
  • 191. Sec. 11.1] The Rao-Crarner Inequality 371 The final form of the right-hand member of the second equation is justified by the discussion in Section 4.7. If we differentiate both mem- bers of each of these equations with respect to 8, we have Chapter II Further Topics in Statistical Inference o = f00 8f(x;; 8) dx, = f00 8ln f(x;; B) f( . 8) d (2) - 00 88 ' _00 80 X;, x" _foo foo [n 1 8f(x;; 8)] 1 - -00'" -00 u(xl , x2, 0 0 0 ' xn) f f(x;; B) 8B x f(xl ; B) 0 • • f(xn; 8) dX1 0 0 0 dXn = foo ..·foo u(xv x2, 0 0 0 ' xn)[i _81.-:;nf:.....:..(x....:.:.;;-!..B)] -00 -00 I 8B 11.1 The Rae-Cramer Inequality In this section we establish a lower bound for the variance of an unbiased estimator of a parameter. Let Xl, X 2 , ••• , X; denote a random sample from a distribution with p.d.f. f(x; B), BE Q = {B; y < B < o}, where y and 0 are known. Let Y = u(XI, X 2, ... , X n) be an unbiased estimator of B. We shall show that the variance of Y, say O"~, satisfies the inequality Band 1 p=_o O"yO"z or 1 = B' 0 + pO"yO"z n Define the random variable Z by Z = L[8Inf(X,; B)/8B]. In accordance I with the first of Equations (2)we have E(Z) = i E[8Inf(X;; B)/8BJ = o. I Moreover, Z is the sum of n mutually stochastically independent random variables each with mean zero and consequently with variance E{[8lnf(X; B)/80P}. Hence the variance of Z is the sum of the n variances, n Because Y = u(XI, 0 0 . , X n) and Z = L [8lnf(X;; 8)/88J, the second I of Equations (2) shows that E( YZ) = 1. Recall (Section 2.3) that E(YZ) = E(Y)E(Z) + PO"yO"z, where p is the correlation coefficient of Y and Z. Since E(Y) E(Z) = 0, we have 1 nE{[8lnf(X; B)/8BJ2} (1) Throughout this section, unless otherwise specified, it will be assumed that we may differentiate, with respect to a parameter, under an integral or a summation symbol. This means, among other things, that the domain of positive probability density does not depend upon B. We shall now give a proof of inequality (1) when X is a random variable of the continuous type. The reader can easily handle the discrete case by changing integrals to sums. Let g(y; 8) denote the p.d.f. of the unbiased statistic Y. We are given that 1 = f~oo f(x;; B) dx" i = 1,2, 0'" n, and B = f~00 yg(y; B) dy = f~00 ••• f:00 u(xv X2, 0 0 . , Xn)f(XI; 8) 0 • 0 f(xn; 8)dX1 0 0 0 dxno 370 Now p2 s 1. Hence or 1 2 "2:::; O"y. o"z
  • 192. 372 Further Topics in Statistical Inference [Oh, 11 Sec. 11.1] The Roo-Cramer Inequality 373 o< x < 1, If we replace u~ by its value, we have inequality (1), 1 u¥ ~ -------- nE[(0 ln~~; 8)fJ Inequality (1) is known as the Rae-Cramer inequality. It provides, in cases in which we can differentiate with respect to a parameter under an integral or summation symbol, a lower bound on the variance of an unbiased estimator of a parameter, usually called the Rae-Cramer lower bound. We now make the following definitions. Definition 1. Let Y be an unbiased estimator of a parameter 8 in such a case of point estimation. The statistic Y is called an efficient estimator of eif and only if the variance of Y attains the Rae-Cramer lower bound. It is left as an exercise to show, in these cases of point estimation, that E{[o Inf(X; 8)/oeJ2} = -E[02Inf(X; 8)/082]. In some instances the latter is much easier to compute. Definition 2. In cases in which we can differentiate with respect to a parameter under an integral or summation symbol, the ratio of the Rae-Cramer lower bound to the actual variance of any unbiased estimator of a parameter is called the efficiency of that statistic. Example 1. Let Xl' X 2 , ••• , X; denote a random sample from a Poisson distribution that has the mean B > O. It is known that X is a maximum likelihood estimator of B; we shall show that it is also an efficient estimator of B. We have () lnf(x; B) = ~ ( I B _ B-1 ') 8B (}B x n n x. x x-8 =--1 =--. B B Accordingly, E[(8In f(X; B))2] = E(X - B)2 = 0'2 = i = !. 8B B2 B2 B2 B The Rae-Cramer lower bound in this case is l/[n(l/B)] = B/n. But B/nis the variance O'~ of X. Hence X is an efficient estimator of B. Example 2. Let 52 denote the variance of a random sample of size n > 1 from a distribution that is n(p., B), 0 < B < 00. We know that E[n52/(n - l)J = B. What is the efficiency of the estimator n52/(n - I)? We have I f( . B) = _ (x - p.)2 _ In (Z7TB) n x, ZB Z' olnf(x; B) (x - p.)2 1 8B = ZB2 - ZB' and 82Inf(x; B) (x - p.)2 1 8B2 = - BS + ZB2' Accordingly, _E[821n f(X; B)] = i __ 1 = _1. 8B2 BS ZB2 ZB2 Thus the Rae-Cramer lower bound is ZB2/n. Now n52/B is X2(n - 1), so the variance of n5 2/B is Z(n - 1). Accordingly, the variance of n52/(n - 1) is Z(n - 1)[B2/(n - 1)2J = ZB2/(n - 1). Thus the efficiency of the estimator n52/(n - 1) is (n - l)/n. Example 3. Let Xv X 2 , ••• , X; denote a random sample of size n > Z from a distribution with p.d.f. f(x; B) = BXB- I = exp (B In x - In x + In B), = 0 elsewhere. It is easy to verify that the Rao-Crarner lower bound is 82 /n.Let Y, = -In Xl' We shall indicate that each Yl has a gamma distribution. The associated transformation y, = -In x" with inverse Xl = «:v«, is one-to-one and the transformation maps the space {Xl; 0 < x, < I} onto the space {y,; 0 < Yl < co}. We have III = e-Y ' . Thus Y, has a gamma distribution with a = 1 7t and /3 = l/B. Let Z = - L In Xl' Then Z has a gamma distribution with I a = nand /3 = l/B. Accordingly, we have E(Z) = a/3 = niB. This suggests that we compute the expectation of l/Z to see if we can find an unbiased estimator of B. A simple integration shows that E(l/Z) = B/(n - 1). Hence (n - l)/Z is an unbiased estimator of B. With n > Z, the variance of (n - l)/Z exists and is found to be B2/(n - Z), so that the efficiency of (n - l)/Z is (n - Z)/n. This efficiency tends to 1 as n increases. In such an instance, the estimator is said to be asymptotically efficient. The concept of joint efficient estimators of several parameters has been developed along with the associated concept of joint efficiency of several estimators. But limitations of space prevent their inclusion in this book.
  • 193. 374 Further Topics in Statistical Inference [Ch, 11 Sec. 11.2] The Sequential Probability Ratio Test 375 EXERCISES 11.1. Prove that X, the mean of a random sample of size n from a distri- bution that is n(B, a2) , -00 < B < 00, is, for every known a2 > 0, an efficient estimator of 8. 11.2. Show that the mean X of a random sample of size n from a distribu- tion which is b(l, B), 0 < 8 < 1, is an efficient estimator of 8. 11.3. Given f(x; 8) = 1/8, 0 < x < 8, zero elsewhere, with 8 > 0, form- ally compute the reciprocal of Compare this with the variance of (n + 1)Yn/n, where Yn is the largest item of a random sample of size n from this distribution. Comment. 11.4. Given the p.d.f. a notation that reveals both the parameter Band the sample size n. If we reject H o: B = B' and accept HI: B = B" when and only when L(B', n) L(B", n) ~ k, where k > 0, then this is a best test of H o against HI' Let us now suppose that the sample size n is not fixed in advance. In fact, let the sample size be a random variable N with sample space {n; n = 1,2,3, ... }. An interesting procedure for testing the simple hypothesis H o: B = B' against the simple hypothesis HI: B = B" is the following. Let ko and kl be two positive constants with ko < kl. Observe the mutually stochastically independent outcomes Xv X 2 , X 3 , ••• in sequence, say Xv X2 , X3 , ••• , and compute 1 f(x; 8) = '/Tel + (x - 8)2]' -00 < x < 00, -00 < 8 < 00. L(B',I) L(B',2) L(B',3) L(B", 1)' L(B", 2)' L(B", 3)' .. ·· Show that the Rae-Cramer lower bound is 2/n, where n is the size of a random sample from this Cauchy distribution. 11.5. Show, with appropriate assumptions, that The hypothesis H o: 8 = B' is rejected (and HI: B = B" is accepted) if and only if there exists a positive integer n so that (xv X 2, ••• , xn) belongs to the set and Hint. Differentiate with respect to 8 the first equation in display (2) of this section, { L(B', j) en = (Xl"'" Xn) ; ko < L(B",j) < kvj = 1, ... , n - 1, L(B', n) L(B", n) o= f:oo aln~~x; 8)f(x; 8) dx, On the other hand, the hypothesis Ho: B = 8' is accepted (and HI: B = B" is rejected) if and only if there exists a positive integer n so that (Xl' X2 , ••• , xn) belongs to the set That is, we continue to observe sample items as long as B = {(X ).k L (B',j) . _ _ n 1""'Xn , 0< L(B",j)<kl ,J - l , ... ,n 1, L(B', n) L(B", n) and L(B', n) ko < L(B" < kl · , n) (1) 11.2 The Sequential Probability Ratio Test In Section 7.2 we proved a theorem that provided us with a method for determining a best critical region for testing a simple hypothesis against an alternative simple hypothesis. The theorem was as follows. Let Xv X 2 , •• 0, X; be a random sample with fixed sample size n from a distribution that has p.d.f. f(x; B), where BE {B; 8 = B', B"} and 8' and 8" are known numbers. Let the joint p.d.f. of Xv X 2 , 0 0 0 , X; be denoted by
  • 194. 376 Further Topics in Statistical Inference [Ch. 11 Sec. 11.2] The Sequential Probability Ratio Test 377 or co(n) < u(xv x2, ••• , xn) < c1(n), where u(XI, X 2 , ••• , X n) is a statistic and co(n) and c1(n) depend on the constants ko, kI , 0', 0", and on n. Then the observations are stopped and a decision is reached as soon as Remarks. At this point, the reader undoubtedly sees that there are many questions which should be raised in connection with the sequential probability ratio test. Some of these questions are possibly among the following: (a) What is the probability of the procedure continuing indefinitely? (b) What is the value of the power function of this test at each of the points 8 = 8' and 8 = 8"? (c) If 8"is one of several values of 8 specified by an alternative composite hypothesis, say HI: 8 > 8', what is the power function at each point 8 ~ 8'? (d) Since the sample size N is a random variable, what are some of the properties of the distribution of N? In particular, what is the expected value E(N) of N? (e) How does this test compare with tests that have a fixed sample size n? " Note that L(t, n)fL(j-, n) ::; koif and only if cl(n) ::; L Xj; and L(t, n)fL(j-, n) 1 " ~ k1 if and only if co(n) ~ L Xj' Thus we continue to observe outcomes as 1 " long as co(n) < L X j < c1(n). The observation of outcomes is discontinued I " " with the first value n of N for which either cI(n) ::; LX j or co(n) ~ LXj' The I I " inequality cI(n) ::; L Xj leads to the rejection of Ho: 0 = t (the acceptance of I " HI)' and the inequality co(n) ~ L Xj leads to the acceptance of Ho: 8 = t I (the rejection of HI)' x = 0,1, or We now give an illustrative example. Example 1. Let X have a p.d.f. f(x; 8) = 8X(1 - W-x , (b) with acceptance of H o: 0 = 0' as soon as L(O', n) > k L(O", n) - l' A test of this kind is called a sequential probability ratio test. Now, frequently inequality (1) can be conveniently expressed in an equivalent form We stop these observations in one of two ways: (a) With rejection of Ho: 0 = 0' as soon as L(O', n) < k L(O", n) - 0, = 0 elsewhere. In the preceding discussion of a sequential probability ratio test, let Ho: 8 = t n and HI: 8 = j-; then, with LX j = LXj, I L(l n) (l)~XI(2.)n-~XI _3_'_ = 3 3 = 2n-2~xl. L(j-, n) m~XI(t)n ~Xl If we take logarithms to the base 2, the inequality L(t, n) ko < L(j-, n) < kI> with 0 < ko < kI , becomes " 10g2 ko < n - 2 L X j < 10g2 kI , I or, equivalently, A course in sequential analysis would investigate these and many other problems. However, in this book our objective is largely that of acquainting the reader with this kind of test procedure. Accordingly, we assert that the answer to question (a) is zero. Moreover, it can be proved that if 8 = 8' or if 8 = 8", E(N) is smaller, for this sequential procedure, than the sample size of a fixed-sample-size test which has the same values of the power function at those points. We now consider question (b) in some detail. In this section we shall denote the power of the test when H 0 is true by the symbol a and the power of the test when HI is true by the symbol 1 - f3. Thus a is the probability of committing a type I error (the rejection of H o when H o is true), and f3 is the probability of com- mitting a type II error (the acceptance of Ho when Ho is false). With the sets en and En as previously defined, and with random variables of the continuous type, we then have a = ~ r L(O', n), n=1 Jell 1 - f3 = ~ r L(O", n). ,,=1 Jell
  • 195. 378 Further Topics in Statistical Inference [Ch, 11 Sec. 11.2] The Sequential Probability Ratio Test 379 Since the probability is 1 that the procedure will terminate, we also have Moreover, since a and f3 are positive proper fractions, inequalities (3) imply that 1 - a = ~ f L (8', n), n=1 B n f3 = ~ f L(8", n). n=1 Bn aa a < ---, - 1 - f3a If (Xl> X 2, ••• , xn) ECn , we have L(8', n) :::; koL(8", n); hence it is clear that Because L(8', n) :2: kIL(8", n) at each point of the set Bn, we have consequently, we have an upper bound on each of a and f3. Various investigations of the sequential probability ratio test seem to indicate that in most practical cases, the values of a and f3 are quite close to aa and f3a. This prompts us to approximate the power function at the points 8 = 8' and 8 = 8" by aa and 1 - f3a' respectively. Example 2. Let X be n((J, 100). To find the sequential probability ratio test for testing H o: (J = 75 against HI: (J = 78 such that each of a and f3 is approximately equal to 0.10, take provided that f3 is not equal to zero or 1. Now let aa and f3a be preassigned proper fractions; some typical values in the applications are 0.01, 0.05, and 0.10. If we take the inequality L(75, n) = exp [- I (Xi - 75)2/2(100)) _ (_ 6 I Xi - 459n) L(78, n) exp [- I (X. - 78)2/2(100)] - exp 200' k = 1 - 0.10 = 9 I 0.10 . 0.10 1 ko = 1 - 0.10 = 9' Since 1 - a kl < --, - f3 (2) Accordingly, it follows that aa ko = ---, 1 - f3a k - ~ L(75, n) 9 k o - 9 < L(78, n) < = I or, equivalently, then inequalities (2) become -In 9 _ 6 I Xi - 459n I 9 < 200 <no n co(n) = 1.~3n - 1~oln9 < IXI < 1~3n + 1~oln9 = cl(n). I can be rewritten, by taking logarithms, as This inequality is equivalent to the inequality 1 - a 1 - a ___ a < __ . f3a - f3 ' a aa --<---, 1 - f3 - 1 - f3a (3) If we add corresponding members of the immediately preceding inequalities, we find that a + fJ - af3a - f3aa :::; aa + f3a - f3aa - af3a and hence a + f3 :::; aa + f3a· That is, the sum a + f3 of the probabilities of the two kinds of errors is bounded above by the sum aa + f3a of the preassigned numbers. Moreover, L(75, n)/L(78, n) :::; ko and L(75, n)/L(78, n) :2: kl are equivalent n n to the inequalities I x, :2: C1 (n) and LX, :::; co(n), respectively. Thus the I 1 observation of outcomes is discontinued with the first value n of N for which n n n either LX. :2: c1(n) or LX, :::; co(n). The inequality I XI :2: c1(n) leads to the 1 1 1 n rejection of Ho: (J = 75, and the inequality I z, :::; co(n) leads to the accept- 1 ance of H o: (J = 75. The power of the test is approximately 0.10 when H o is true, and approximately 0.90 when HI is true. Remark. It is interesting to note that a sequential probability ratio test
  • 196. 380 Further Topics in Statistical Inference [Ch, 11 Sec. 11.3] Multiple Comparisons 381 can be thought of as a random-walk procedure. For illustrations, the final inequalities of Examples 1 and 2 can be rewritten as n -log2 k1 < L: 2(xi - 0.5) < -log2 ko 1 and 100 n 100 -TIn 9 < L(Xi - 76.5) < TIn 9, 1 respectively. In each instance we can think of starting at the point zero and taking random steps until one of the boundaries is reached. In the first situation the random steps are 2(X1 - 0.5), 2(X2 - 0.5), 2(Xa - 0.5), ... and hence are of the same length, 1, but with random directions. In the second instance, both the length and the direction of the steps are random variables, Xl - 76.5, X 2 - 76.5, Xa - 76.5, .... EXERCISES 11.6. Let X be n(O, 8) and, in the notation of this section, let 8' = 4, 8" = 9, CXa = 0.05, and fJa = 0.10. Show that the sequential probability n ratio test can be based upon the statistic L: Xf Determine co(n) and c1(n). 1 ... , X aj of size a from the distribution n(ftj, a2 ), j = 1,2, ... , b. If we a denote L Xij/a by X.j , then we know that X. j is n(ftj, a2/a), that i=l a L (Xij - X. j)2/a 2 is X2(a - 1), and that the two random variables are i=l stochastically independent. Since the random samples are taken from mutually independent distributions, the 2b random variables x; a L (Xij - X. j)2/a 2, j = 1,2, ... , b, are mutually stochastically inde- i=l pendent. Moreover, X. v X.2 , ••• , X'b and ±i (Xij -2 X. j )2 j=l i=l a are mutually stochastically independent and the latter is X2[b(a - 1)]. b b Let Z = L kjX.j. Then Z is normal with mean L kjftj and variance 1 1 (~ ky)a2 /a, and Z is stochastically independent of 1 b a V = b( _ 1) .L .L (Xij - X.j )2. a J=l ,=1 Hence the random variable has a t distribution with b(a - 1) degrees of freedom. A positive number c can be found in Table IV in Appendix B, for certain values of cx, o < ex < 1, such that Pr (- c :0; T :0; c) = 1 - ex. It follows that the probability is 1 - ex that J(~kr)a2/a VV/a2 T= 11.7. Let X have a Poisson distribution with mean 8. Find the sequential probability ratio test for testing H o: 8 = 0.02 against HI: 8 = 0.07. Show n that this test can be based upon the statistic L: Xi' If cxa = 0.20 and fJa = 1 0.10, find co(n) and C1(n). 11.8. Let the stochastically independent random variables Y and Z be n(iLv 1) and n(iL2' 1), respectively. Let 8 = iLl - iL2' Let us observe mutually stochastically independent items from each distribution, say Y 1 , Y 2 , .•• and Zv Z2' .... To test sequentially the hypothesis H o: 8 = 0 against HI: 8 = -t, use the sequence Xi = Yi - Zi' i = 1,2, .... If CXa = fJa = 0.05, show that the test can be based upon X = 17 - i. Find co(n) and c1(n). 11.3 Multiple Comparisons Consider b mutually stochastically independent random variables that have normal distributions with unknown means ftv ft2' ... , ftb' respectively, and with unknown but common variance a2 • Let kl , k2 , ... ,kb represent b known real constants that are not all zero. We b want to find a confidence interval for L kjftj' a linear function of the I means ftv ft2' ... , ftb' To do this, we take a random sample Xlj' X 2j, The experimental values of X. j , j = 1, 2, ... , b, and V will provide a b 100(1 - cx) per cent confidence interval for L kjftj' I b It should be observed that the confidence interval for L kjftj I depends upon the particular choice of kv k2 , ••• , kb. It is conceivable
  • 197. 382 Further Topics in Statistical Inference [Ch, 11 Sec. 11.3] Multiple Comparisons 383 that we may be interested in more than one linear function of iLl, iL2' ... , iLb' such as iL2 - iLl> iLa - (iLl + iL2)/2, or iLl + ... + iLb' We can, b of course, find for each L kjiLj a random interval that has a preassigned 1 b probability of including that particular L kjiLj' But how can we compute 1 the probability that simultaneously these random intervals include their respective linear functions of iLl> iL2' ... , iLb? The following procedure of multiple comparisons, due to Scheffe, is one solution to this problem. The random variable a2/a is x2 (b) and, because it is a function of Xl> ... , x, alone, it is stochasti- cally independent of the random variable 1 b a V = b( _ 1) ~ .2 (Xij - X.j )2. a 1=1 '=1 b From the geometry of the situation it follows that L (X. j - iLj)2 is 1 equal to the maximum of expression (2) with respect to kl> k2, ... , kb. b Thus the inequality L (X. j - iLj)2 ::::; (bd)(V/a) holds if and only if 1 (3) for every real kl> k2, ... ,kb, not all zero. Accordingly, these two equivalent events have the same probability, 1 - a. However, in- equality (3) may be written in the form l~kjX,j - ~kjiLjl::::; Jbd(~kJ)~' Thus the probability is 1 - a that simultaneously, for all real kl> k2 , . .. , kb, not all zero, (2) where b :J L 2.. ix; - xy V=j=li=l , b L (aj - 1) 1 (4') Denote by A the event where inequality (4-) is true for all real k l , •• 0' kb , and denote by B the event where that inequality is true for a finite number of b-tuples (kl , . 0 0 ' kb ) . If the event A occurs, certainly the event B occurs. Hence P(A) ::::; P(B). In the applications, one is often interested only in a finite number of linear functions b L kjiLj' Once the experimental values are available, we obtain from (4-) 1 a confidence interval for each of these linear functions. Since P(B) 2 P(A) = 1 - a, we have a confidence coefficient of at least 100(1 - a) per cent that the linear functions are in these respective confidence intervals. Remarks. If the sample sizes, say al> a2 , ••• , ab, are unequal, In- equality (4) becomes Hence the random variable has an F distribution with band b(a - 1) degrees of freedom. From Table V in Appendix B, for certain values of a, we can find a constant d such that Pr (F ::::; d) = 1 - a or Pr [i (X. j - iLj)2 s bd~] = 1 - a. j=l a b Note that L (X. j - iLj)2 is the square of the distance, in b-dimensional j= 1 space, from the point (iLl> iL2' .. 0, iLb) to the random point (X. l, X. 2, . . " X. b)· Consider a space of dimension b and let (tl, t2, 0 • 0 ' tb) denote the coordinates of a point in that space. An equation of a hyperplane that passes through the point (iLl' iL2' , iLb) is given by (1) kl(tl - iLl) + k2(t2 - iL2) + + kb(tb - iLb) = 0, where not all the real numbers kj , j = 1,2, .. " b, are equal to zero. The square of the distance from this hyperplane to the point (tl = Xl> t2 = X.2 , •• 0' tb = X.b) is [kl(X' l - iLl) + k2(X '2 - iL2) + .. 0 + kb(X'b - iLb)]2 k~ + k~ + 0 " + k~
  • 198. 384 Further Topics in Statistical Inference [Ch, 11 Sec. 11.4] Classification 385 b and d is selected from Table V with band 2 (aj - 1) degrees of freedom. 1 Inequality (4') reduces to inequality (4) when a1 = a2 = ... = abo Moreover, if we restrict our attention to linear functions of the form b b 2 kjfLj with 2 kj = 0 (such linear functions are called contrasts), the radical 1 1 in inequality (4') is replaced by b where d is now found in Table V with b - 1 and 2 (aj - 1) degrees of 1 freedom. In these multiple comparisons, one often finds that the length of a confidence interval is much greater than the length of a 100(1 - a) per cent b confidence interval for a particular linear function 2 k)fLj' But this is to be 1 expected because in one case the probability 1 - a applies to just one event, and in the other it applies to the simultaneous occurrence of many events. One reasonable way to reduce the length of these intervals is to take a larger value of a, say 0.25, instead of 0.05. After all, it is still a very strong statement to say that the probability is 0.75 that all these events occur. EXERCISES 11.9. If AI' A2 , ••• , AI< are events, prove, by induction, Boole's lil- I< equality P(A 1 U A 2 u· .. u AI<) s 2 P(Ai ) . Then show that 1 11.10. In the notation of this section, let (kil , ki2, ... , kj b ) , i = 1, 2, ... , m, represent a finite number of b-tuples. The problem is to find simultaneous b confidence intervals for 2 kjjfLj, i = 1,2, ... , m, by a method different from j=l that of Scheffe, Define the random variable T, by i = 1,2, ... , m. (a) Let the event A1 be given by -Ci :::; Ti :::; c" i = 1,2, ... , m. Find b the random variables o, and Wi such that ti;« 2 kijfLj s Wi is equivalent j= 1 to Ar (b) Select Ci such that P(Af) = 1 - aim; that is, P(Ai ) = aim. Use the results of Exercise 11.9 to determine a lower bound on the probability that simultaneously the random intervals (U1 , WI), .. " (Um' Wm) include b b .2 k1jfLj"'" 2 kmjfLj, respectively. )=1 )=1 (c) Let a = 3, b = 6, and a = 0.05. Consider the linear functions fL1 - fL2' fL2 - fL3' fL3 - fL4' fL4 - (fLs + fLs)/2, and (fL1 + fL2 + ... + fLs)/6. Here m = 5. Show that the lengths of the confidence intervals given by the results of part (b) are shorter than the corresponding ones given by the method of Scheffe, as described in the text. If m becomes sufficiently large, however, this is not the case. 11.4 Classification The problem of classification can be described as follows. An investi- gator makes a number of measurements on an item and wants to place it into one of several categories (or classify it). For convenience in our discussion, we assume that only two measurements, say X and Y, are made on the item to be classified. Moreover, let X and Y have a joint p.d.f. f(x, y; 0), where the parameter 0 represents one or more parameters. In our simplification, suppose that there are only two possible joint distributions (categories) for X and Y, which are indexed by the parameter values 0' and 0", respectively. In this case, the problem then reduces to one of observing X = x and Y = y and then testing the hypothesis 0 = 0' against the hypothesis 0 = 0", with the classifi- cation of X and Y being in accord with which hypothesis is accepted. From the Neyman-Pearson theorem, we know that a best decision of this sort is of the form: If f(x, y; 0') k .-:...--=---'- < , f(x, y; 0") - choose the distribution indexed by 0"; that is, we classify (x, y) as coming from the distribution indexed by 0". Otherwise, choose the distribution indexed by 0'; that is, we classify (x, y) as coming from the distribution indexed by 0'. In order to investigate an appropriate value of k, let us consider a Bayesian approach to the problem (see Section 6.6). We need a p.d.f. h(0) of the parameter, which here is of the discrete type since the parameter space Q consists of but two points 0' and 0". So we have that h(0') + h(0") = 1. Of course, the conditional p.d.f. g(elx, y) of the parameter, given X = x, Y = y, is proportional to the product of h(O) andf(x, y; 0), g(Olx, y) oc h(O)f(x, y; 0).
  • 199. In particular, in this case, 386 Further Topics in Statisticallnjerence [eb.ll Sec. 11.4] Classification where a1 > 0, a2 > 0, -1 < P < 1, and 387 h(B)f(x, y; B) g(Blx, y) = h(B')f(x, y; B') + h(B")f(x, y; B") Let us introduce a loss function 2[B, w(x, y)], where the decision function w(x, y) selects decision w = B' or decision w = B". Because the pairs (B = B', w = B') and (B = B", w = B") represent correct decisions, we always take2(B', 8') = 2(B", B") = O. On the other hand, positive values of the loss function should be assigned for incorrect decisions; that is, 2(B', B") > 0 and 2(B", B') > O. A Bayes' solution to the problem is defined to be such that the conditional expected value of the loss 2[8, w(x, y)], given X = x, Y = y, is a minimum. If w = B', this conditional expectation is ~ 2(B B' ( I _ 2(B", B')h(B")f(x, y; B") * ,)g Bx, y) - h(B')f(x, y; B') + h(B")f(x, y; B") because 2(B', B') = 0; and if w = B", it is ~ 2(B " ( I 2(B', B")h(B')f(x, y; B') ~ , B )g Bx, y) = h(B')f(x, y; B') + h(B")f(x, y; B") because 2(B", B") = O. Accordingly, a Bayes' solution is one that decides w = B" if the latter ratio is less than or equal to the former; or, equivalently, if 2(B', B")h(B')f(x, y; B') s 2(B", B')h(B")f(x, y; B"). That is, the decision w = B" is made if q(x, Y; fLv fL2) = 1 ~ p2 [(x ~lfL1r - 2p(x ~lfL1)(Y ~2fL2) + (Y ~2fL2rl Assume that at. a~, and p are known but that we do not know whether the respective means of (X, Y) are (fL~, fL;) or (fL;, fL;). The inequality f( Y · , , 2 2 ) x, ,fLv fL2' aI, a2' P < k f(x, Y; fL~, fL;, at a~, p) - is equivalent to -t[q(x, Y; fL~, fL;) - q(x, Y; fL~, fL;)] ~ In k. Moreover, it is clear that the difference in the left-hand member of this inequality does not contain terms involving x2, xy, and y2. In particular, this inequality is the same as ~ In k + -t[q(O, 0; fL~, fL;) - q(O, 0; fL~, fL;)], or, for brevity, ax + by ~ c. That is, if this linear function of x and y in the left-hand member of inequality (1) is less than or equal to a certain constant, we would classify that (x, y) as coming from the bivariate normal distribution with means fL~ and fL;' Other- wise, we would classify (x, y) as arising from the bivariate normal distri- bution with means fL~ and fL;' Of course, if the prior probabilities and losses are given, k and thus c can be found easily; this will be illustrated in Exercise 11.11. Fortunately, the distribution of Z = aX + bY is easy to determine, so each of these probabilities is easy to calculate. The moment-generating function of Z is Once the rule for classification is established, the statistician might be interested in the two probabilities of misclassifications using that rule. The first of these two is associated with the classification of (x, y) as arising from the distribution indexed by B" if, in fact, it comes from that index by B'. The second misclassification is similar, but with the interchange of B' and B". In the previous example, the probabilities of these respective misclassifications are f(x, y; B') < 2(B", B')h(B") _ . f(x, y; B") - 2(B', B")h(B') - k, otherwise, the decision w = B' is made. Hence, if prior probabilities h(B') and h(B") and losses 2(B = B', w = B") and 2(B = B", w = B') can be assigned, the constant k of the Neyman-Pearson theorem can be found easily from this formula. Example 1. Let (x, y) be an observation of the random pair (X, Y), which has a bivariate normal distribution with parameters fL1' fL2' ar, a~, and p. In Section 3.5 that joint p.d.f. is given by 1 . f(x Y·II. II. a2 a2 p) = e- Q(X,Y,1J1'#2)/2 , ,.1, .2' 1, 2, 2 . /1 2 ' 1Ta1a2V - p Pr (aX + bY ~ c; fL~, fL;) and Pr (aX + bY > c; fL~, fL;). -00 < x < 00, -00 < Y < 00, E(etZ) = E[et<ax+bYl] = E(eatx+btY).
  • 200. 388 Further Topics in Statistical Inference [Ch, 11 Sec. 11.5] Sufficiency, Completeness, and Stochastic Independence 389 EXERCISES /L~ = /L; = 0, (a~)' = (a~)' = 1, p' = -r o< x < 00, 0 < Y < 00, 1 (x Y) f(x, Y; 01> ( 2 ) = IT exp -0 - -0 ' 1 2 1 2 11.12. Let X and Y have the joint p.d.f. 11.11. In Example 1 let /L~ = /L; = 0, /L~ = /L; = 1, a~ = 1, a~ = 1, and p=l (a) Evaluate inequality (1) when the prior probabilities are h(/L~, /L;) = t and h(/L~, /L;) = j- and the losses are 2[0 = (/L~, /L;), W = ~~, /L;)J = 4 and 2[0 = (/L~, /L;), W = (/L~, /L;)J = 1. (b) Find the distribution of the linear function aX + bY that results from part (a). (c) Compute Pr (aX + bY ~ c; /L~ = /L; = 0) and Pr (aX + bY > c; /L~ = /L; = 1). zero elsewhere, where 0 < 01 , 0 < O 2 , An observation (x, y) arises from the joint distribution with parameters equal to either O~ = 1, 0; = 5 or O~ = 3, 0; = 2. Determine the form of the classification rule. 11.13. Let X and Y have a joint bivariate normal distribution. An observation (x, y) arises from the joint distribution with parameters equal to either Hence in the joint moment-generating function of X and Y found in Section 3.5, simply replace t1 by at and t2 by bt, to obtain With this information, it is easy to compute the probabilities of mis- classifications, and this will also be demonstrated in Exercise 11.11. One final remark must be made with respect to the use of the important classification rule established in Example 1. In most in- stances the parameter values /Li, /L2 and /L~, /L; as well as at a§, and p are unknown. In such cases the statistician has usually observed a random sample (frequently called a training sample) from each of the two distributions. Let us say the samples have sizes n' and n", respec- tively, with sample characteristics However, this is the moment-generating function of the normal distribution Accordingly, if in inequality (1) the parameters /Li, /L2, /L~, /L~, at a§, and PUIU2 are replaced by the unbiased estimates -I -I ( ') 2 ( ') 2 ' X , y, Sx , Sy ,r and -If -If (S")2 (S")2 r" x, Y, x , u , . or /L~ = /L; = 1, (aV" = 4, (a~)" = 9, p" = -r. Show that the classification rule involves a second-degree polynomial in x and y. n'(s')2 + n"(s")2 n'(s')2 + n"(s")2 -I -! -/I -II X X Y Y «,«,«,«, '+" 2' '+" 2' n n - n n- 11.5 Sufficiency, Completeness, and Stochastic Independence n'r's~s~ + n"r"s~s~ n' + n" - 2 ' the resulting expression in the left-hand member is frequently called Fisher's linear discriminant function. Since those parameters have been estimated, the distribution theory associated with aX + bY is not appropriate for Fisher's function. However, if n' and n" are large, the distribution of aX + bY does provide an approximation. Although we have considered only bivariate distributions in this section, the results can easily be extended to multivariate normal distributions after a study of Chapter 12. In Chapter 10 we noted that if we have a sufficient statistic Y1 for a parameter 0, 0 En, then h(zIYl)' the conditional p.d.f. of another statistic Z, given Y1 = Y1> does not depend upon B. If, moreover, Y1 and Z are stochastically independent, the p.d.f. g2(Z) of Z is such that g2(Z) = h(Z!Yl), and hence g2(Z) must not depend upon Beither. So the stochastic independence of a statistic Z and the sufficient statistic Y1 for a parameter B means that the distribution of Z does not depend upon BEn. It is interesting to investigate a converse of that property. Suppose that the distribution of a statistic Z does not depend upon B; then, are Z and the sufficient statistic Y1 for B stochastically independent? To
  • 201. 390 Further Topics in Statistical Inference [Ch. 11 Sec. 11.5] Sufficiency, Completeness, and Stochastic Independence 391 and the one-to-one transformation defined by WI = Xj - fL, i = 1,2, ... , n. Since TV = X - fL, we have that moreover, each WI is nCO, a 2), 1 = 1, 2, .. " n. That is, S2can be written as a function of the random variables W 1, W 2, •• " Wn. which have distributions that do not depend upon fL. Thus S2 must have a distribution that does not depend upon fL; and hence, by the theorem, S2 and X, the complete sufficient statistic for fL, are stochastically independent. The technique that is used in Example 1 can be generalized to situations in which there is a complete sufficient statistic for a location parameter 8. Let Xl' X 2 , ••• , X; be a random sample from a distribution that has a p d.f. of the form I(x - 8), for every real 8; that is, 8 is a location parameter. Let {g(Y1; 8), 8 E Q} is complete (such as a regular case of the exponential class), we can say that the statistic Z is stochastically independent of the sufficient statistic Y1 if, and only if, the distribution of Z does not depend upon 8. It should be remarked that the theorem (including the special formulation of it for regular cases of the exponential class) extends immediately to probability density functions that involve m param- eters for which there exist m joint sufficient statistics. For example, let Xl' X 2, ... , X; be a random sample from a distribution having the p.d.f. f(x; 81 , 82) that represents a regular case of the exponential class such that there are two joint complete sufficient statistics for 81 and 82 , Then any other statistic Z = u(X1, X 2,.. " X n) is stochastically independent of the joint complete sufficient statistics if and only if the distribution of Z does not depend upon 81 or 82 , We give an example of the theorem that provides an alternative proof of the stochastic independence of X and 52, the mean and the variance of a random sample of size n from a distribution that is n(fL, u2). This proof is presented as if we did not know that n52/u2 is x2 (n - 1) because that fact and the stochastic independence were established in the same argument (see Section 4.8). Example 1. Let Xv X 2 , ••. , X; denote a random sample of size n from a distribution that is n(fL, a2 ) . We know that the mean X of the sample is, for every known a2 , a complete sufficient statistic for the parameter fL, -00 < fL < 00. Consider the statistic or begin our search for the answer, we know that the joint p.d.f. of Y1 and Z is gl (Y1; 8)h(zjY1)' where gl (Y1; 8) and h(zIY1) represent the marginal p.d.f. of Y1 and the conditional p.d.f. of Z given Y1 = Y1' respectively. Thus the marginal p.d.f. of Z is for all 8 E Q. Since Y1 is a sufficient statistic for 8, h(zIY1) does not depend upon 8. By assumption, g2(Z) and hence g2(Z) - h(zIY1) do not depend upon 8. Now if the family {gl(Y1; 8); 8EQ} is complete, Equation (1) would require that (1) That is, the joint p.d.f. of Y1 and Z must be equal to f~00 g2(Z)gl(Y1; 8) dY1 = g2(z), it follows, by taking the difference of the last two integrals, that which, by hypothesis, does not depend upon 8. Because Accordingly, Y1 and Z are stochastically independent, and we have proved the following theorem. Theorem 1. Let Xl' X 2,... , X; denote a random sample from a distribution having a p.d.j. f(x; e), e E Q, where Q is an interval set. Let Y1 = u1(X1> X2, .•• , Xn) be a sufficient statistic for e, and let the family {gl(Y1; e); eE Q} of probability density functions of Y1 be com- plete. Let Z = u(X1, X 2, ... , X n) be any other statistic (not a function of Y1 alone). If the distribution of Z does not depend upon e, then Z is stochastically independent of the sufficient statistic Yr- In the discussion above, it is interesting to observe that if Y1 is a sufficient statistic for 8, then the stochastic independence of Y1 and Z implies that the distribution of Z does not depend upon 8 whether {gl(Y1; 8); 8 E Q} is or is not complete. However, in the converse, to prove the stochastic independence from the fact that g2(Z) does not depend upon 8, we definitely need the completeness. Accordingly, if we are dealing with situations in which we know that the family
  • 202. Yl = Ul(Xl, X 2 , ••• , X n) be a complete sufficient statistic for 8. Moreover, let Z = u(X1o X 2 , ••• , X n) be another statistic such that 392 Further Topics in Statistical Inference [Ch, 11 Sec. 11.5] Sufficiency, Completeness, and Stochastic Independence 393 1,2, ... , n, requires the following: (a) that the joint p.d.f. of W10 W2 , •• " Wn be equal to for all real d. The one-to-one transformation defined by W, = X, - 8, i = 1,2, ... , n, requires that the joint p.d.f. of W10 W2 , ••• , Wn be which does not depend upon 8. In addition, we have, because of the special functional nature of u(x1o X2 , ••• , xn) , that f(Wl)f(W2 ) ••• f(wn) , and (b) that the statistic Z be equal to Z = u(8W1o 8W2 , ••• , 8Wn) = u(Wl , W2 , · · · , Wn)· Since neither the joint p.d.f. of WI' W2, •• " Wn nor Z contain 8, the distri- bution of Z must not depend upon 8 and thus, by the theorem, Z is stochastic- ally independent of the complete sufficient statistic Y1 for the parameter 8. Example 3. Let Xl and X 2 denote a random sample of size n = 2 from a distribution with p.d.f. = 0 elsewhere. '<,.. for all c > O. The one-to-one transformation defined by WI = Xi/B, i = o< x < 00, 0 < 8 < 00, 1 f(x; 8) = 7J e- x /o, which satisfies the property that u(exl + d, ... , eXn + d) = u(x1o ... , xn)· That is, Z is stochastically independent of both X and 52. The p.d.f. is of the form (lJ8)f(xJ8) , where f(x) = e- x , 0 < x < 00, zero elsewhere. We know (Section 10.4) that Yl = Xl + X 2 is a complete suffi- cient statistic for 8. Hence Yl is stochastically independent of every statistic u(Xv X 2) with the property u(ex1o ex2) = u(x1o x2) . Illustrations of these are X lJX2 and XlJ(Xl + X 2 ), statistics that have F and beta distributions, respectively. Finally, the location and the scale parameters can be combined in a p.d.f. of the form (lJ82)f[(x - 8l)J82] , -00 < 81 < 00,0 < 82 < 00. Through a one-to-one transformation defined by W, = (X, - 8l)J82 , i = 1,2, ... , n, it is easy to show that a statistic Z = u(X1o X 2 , . · · , X n) such that u(exl + d, .. " eXn + d) = u(x1o ... , xn) for -00 < d < 00, 0 < e < 00, has a distribution that does not depend upon 81 and 82 , Thus, by the extension of the theorem, the joint cOI~plete sufficient statistics Y1 and Y2 for the parameters 81 and 82 are stochastically independent of Z. Example 4. Let Xl, X 2 , •• " X; denote a random sample from a distri- bution that is n(81o (}2), -00 < 81 < 00,0 < 82 < 00. In Example 2, Section 10.6, it was proved that the mean X and the variance 52 of the sample are joint complete sufficient statistics for ()l and 82 , Consider the statistic = 0 elsewhere. 8 < x < 00, -00 < 8 < 00. f(x; 8) = e-(X-O), 1 n - 2: [X, - min (XI)]. n 1=1 Here the p.d.£. is of the form f(x - 8), where f(x) = e- x , 0 < x < 00, zero elsewhere. Moreover, we know (Exercise 10.17, Section 10.3) that the first order statistic Yl = min (Xl) is a complete sufficient statistic for 8. Hence Yl must be stochastically independent of each statistic u(X1o X 2 , .•• , X n) , enjoying the property that u(xl + d, X2 + d, ... , Xn + d) = u(x1o x2 , ••• , xn) for all real d. Illustrations of such statistics are 52, the sample range, and There is a result on stochastic independence of a complete sufficient statistic for a scale parameter and another statistic that corresponds to that associated with a location parameter. Let X v X 2, .•• , X; be a random sample from a distribution that has a p.d.f. of the form (lJ8)f(xJ8), for all 8 > 0; that is, 8 is a scale parameter. Let Yl = Ul(X1o X 2 , ••• , X n) be a complete sufficient statistic for 8. Say Z = u(Xl, X 2 , ••• , X n) is another statistic such that is a function of W10 W2 , ••• , Wn alone (not of 8). Hence Z must have a distribution that does not depend upon 8 and thus, by the theorem, is stochastically independent of Yl . Example 2. Let Xl' X 2 , ••• , X; be a random sample of size n from the distribution having p.d.f.
  • 203. The second factor in the right-hand member is evaluated by using the probabilities for a noncentral t distribution. Of course, if 83 = 84 and the difference 81 - 82 is large, we would want the preceding probability to be close to 1 because the event {c1 < F < c2 , ITI ~ c} leads to a correct decision, namely accept 83 = 84 and reject 81 = 82 , Let n(81 , 83) and n(82, 84) denote two independent normal distri- butions. Recall that in Example 2, Section 7.4, a statistic, which was denoted by T, was used to test the hypothesis that 81 = 82 , provided the unknown variances 83 and 84 were equal. The hypothesis that 81 = 82 is rejected if the computed ITj ~ c, where the constant c is selected so that (X2 = Pr (I TI ~ c; 81 = 82 , 83 = 84) is the assigned significance level of the test. We shall show that, if 83 = 84 , F of Example 3, Section 7.4, and T are stochastically independent. Among other things, this means that if these two tests are performed sequenti- ally, with respective significance levels (Xl and (X2' the probability of accepting both these hypotheses, when they are true, is (1 - (X1)(1 - (X2)' Thus the significance level ofthis joint test is (X = 1 - (1 - (Xl)(1 - (X2)' The stochastic independence of F and T, when 83 = 84 , can be established by an appeal to sufficiency and completeness. The three n m statistics X, Y, and 2: (Xt - X)2 + 2: (Yj - Y)2 are joint complete 1 1 sufficient statistics for the three parameters 81> 82, and 83 = 84 , Obviously, the distribution of F does not depend upon 81 , 82 , and 83 = 84 , and hence F is stochastically independent of the three joint complete sufficient statistics. However, T is a function of these three joint complete sufficient statistics alone, and, accordingly, T is stochastically independent of F. It is important to note that these two statistics are stochastically independent whether 81 = 82 or 81 f= 82 , that is, whether T is or is not central. This permits us to calculate probabilities other than the significance level of the test. For example, if 83 = 84 and 81 f= 82 , then 394 Further Topics in Statistical Inference [Ch. 11 Sec. 11.5] SUfficiency, Completeness, and Stochastic Independence 395 and (Y1 + Y2)/(Y3 + Y4)' Hint.Show thatthe p.d.f. is ofthe form (1/8)f(x/8) , where f(x) = 1, 0 < x < 1, zero elsewhere. 11.15. Let Y1 < Y2 < ... < Yn be the order statistics of a random sample from the normal distribution n(8, a2 ) , -00 < 8 < 00. Show that the n distribution of Z = y - Y does not depend upon 8. Thus Y = L Y'[n, a n 1 complete sufficient statistic for (J, is stochastically independent of Z. 11.16. Let Xl' X2 , •• " X; be a random sample from the normal distribu- tion n(8, a2) , -00 < (J < 00. Prove that a necessary and sufficient condition n n . . . that the statistics Z = L aiXi and Y = L Xi' a complete sufficient statistic 1 1 n for 8, be stochastically independent is that f at = O. 11.17. Let X and Y be random variables such that E(Xk) and E(Yk) f= 0 exist for k = 1,2, 3, .... If the ratio X/Y and its denominator Yare stochastically independent, prove that E[(X/Y)k] = E(Xk)/E(Yk), k = 1, 2, 3, .... Hint. Write E(Xk) = E[yk(X/Y)kJ. 11.18. Let Y1 < Y 2 < ... < Yn be the order statistics of a random sample of size n from a distribution that has p.d.I. f(x; 8) = (1/8)e- X10 , n o < X < 00,0 < (J < 00, zero elsewhere. Show that the ratio R = nY1/~ Y, and its denominator (a complete sufficient statistic for 8) are stochastically independent. Use the result of the preceding exercise to determine E(Rk), k = 1,2,3, .... 11.19. Let Xl' X 2, ••• , X 5 be a random sample of size 5 from the distri- bution that has p.d.f. f(x) = e- x , 0 < x < 00, zero elsewhere. Show. that (Xl + X 2)/(X1 + X 2 + ... + X 5 ) and its denominator are stochastically independent. Hint. The p.d.f.J(x) is a member of {f(x; 8); 0 < 8 < co}, where f(x; 8) = (1/8)e- X10, 0 < x < 00, zero elsewhere. 11.20. Let Y 1 < Y2 < ... < Yn be the order statistics of a random sample from the normal distributio~ n(8v 82!,~oo_< 81_< CXJ, O 2 < 82 < 00. Show that the joint complete sufficient statistics X .: Y and 5 for 81 and 82 are stochastically independent of each of (Yn - Y)/S and (Yn - Y 1)/S. 11.21. Let Y1 < Y2 < ... < Yn be the order statistics of a random sample from a distribution with the p.d.f. EXERCISES 11.14. Let Y1 < Y2 < Y3 < Y4 denote the order statistics of a random sample of size n = 4 from a distribution having p.d.f. f(x; 8) = 1/8, 0 < x < 8, zero elsewhere, where 0 < 8 < 00. Argue that the complete sufficient statistic Y4 for 8 is stochastically independent of each of the statistics Y 1/Y4 81 < X < 00, zero elsewhere, where -00 < 81 < 00, 0 < 82 < 00. Show that the joint complete sufficient statistics Y1 and X = Y for 81 and 82 are n stochastically independent of (Y 2 - Y1)/J.. (Y, - Y1) · 1
  • 204. 396 Further Topics in Statistical Inference [Ch, 11 Sec. 11.6] Robust Nonparametric Methods 397 11.6 Robust Nonparametric Methods Frequently, an investigator is tempted to evaluate several test statistics associated with a single hypothesis and then use the one statistic that best supports his or her position, usually rejection. Obviously, this type of procedure changes the actual significance level of the test from the nominal a that is used. However, there is a way in which the investigator can first look at the data and then select a test statistic without changing this significance level. For illustration, suppose there are three possible test statistics W v W2, Wa of the hypothesis H; with respective critical regions C1, C2, Ca such that Pr (Wi E Cj ; H o) = a, i = 1,2,3. Moreover, suppose that a statistic Q, based upon the same data, selects one and only one of the statistics W v W 2 , W a, and that W is then used to test Hi; For example, we choose to use the test statistic Wi if Q E Di, i = 1,2,3, where the events defined by D1 , D2 , and Da are mutually exclusive and exhaustive. Now if Q and each Wj are stochastically independent when H o is true, then the probability of rejection, using the entire procedure (selecting and testing), is, under H 0' Pr (Q E D1 , W 1 E C1) + Pr (Q E D2 , W 2 E C2) + Pr (Q E o; Wa E Ca) = Pr (Q E D 1) Pr (W 1 E C1) + Pr (Q E D 2 ) Pr (W2 E C2) + Pr (Q E Da) Pr (Wa E Ca) = a[Pr (Q E D1 ) + Pr (Q E D2 ) + Pr (Q E Da)] = a. That is, the procedure of selecting Wi using a stochastically indepen- dent statistic Q and then constructing a test of significance level a with the statistic Wi has overall significance level a. Of course, the important element in this procedure is the ability to be able to find a selector Q that is independent of each test statistic W. This can frequently be done by using the fact that the complete sufficient statistics for the parameters, given by H o, are stochastically independent of every statistic whose distribution is free of those parameters (see Section 11.5). For illustration, if random samples of sizes m and n arise from two independent normal distributions with respective means!-t1 and!-t2 and common variance a2 , then the complete sufficient statistics X, Y, and m n V = L(Xi - X)2 + L (Y, - Y)2 1 1 for !-t1' !-t2' and a2 are stochastically independent of every statistic whose distribution is free of !-tv !-t2, and a2 such as Thus, in general, we would hope to be able to find a selector Q that is a function of the complete sufficient statistics for the parameters, under H 0' so that it is independent of the test statistics. It is particularly interesting to note that it is relatively easy to use this technique in nonparametric methods by using the independence result based upon complete sufficient statistics for parameters. How can we use an argument depending on parameters in non parametric methods? Although this does sound strange, it is due to the unfortunate choice of a name in describing this broad area of nonparametric methods. Most statisticians would prefer to describe the subject as being distri- bution-free, since the test statistics have distributions that do not depend on the underlying distribution of the continuous type, de- scribed by either the distribution function F or the p.d.f. f. In addition, the latter name provides the clue for our application here because we have many test statistics whose distributions are free of the unknown (infinite vector) "parameter" F (or f) . We now must find complete sufficient statistics for the distribution function F of the continuous type. In many instances, this is easy to do. In Exercise 10.37, Section 10.6, it is shown that the order statistics Y1 < Y2 < ... < Yn of a random sample of size n from a distribution of the continuous type with p.d.f. F'(x) = f(x) are sufficient statistics for the "parameter" f (or F). Moreover, if the family of distributions contains all probability density functions of the continuous type, the family of joint probability density functions of Yl' Y 2' ... , Yn is also complete. We accept this latter fact without proof, as it is beyond the level of this text; but doing so, we can now say that the order statistics Y1, Y2, ... , Yn are complete sufficient statistics for the parameter f (or F). Accordingly, our selector Q will be based upon those complete sufficient statistics, the order statistics under H o. This allows us to independently choose a distribution-free test appropriate for this type of underlying distribution, and thus increase our power. Although it is well known that distribution-free tests hold the significance level a for all underlying distributions of the continuous type, they have often
  • 205. 398 Further Topics in Statistical Inference [Ch. 11 Sec. 11.6] Robust Nonparametric Methods 399 EXERCISES is stochastically independent of L 1 , L2 , and L3 . From Exercise 3.56, Section 3.4, we know that the kurtosis of the normal distribution is 3; hence if the two distributions were equal and normal, we would expect K to be about 3. Of course, a longer-tailed distribution has a bigger kurtosis. Thus one simple way of defining the independent selection procedure would be by letting 11.22. Let F(x) be a distribution function of a distribution of the con- tinuous type which is symmetric about its median g. We wish to test H a: g= 0 against HI: g > O. Use the fact that the 2n values, XI and -XI' i = 1,2, . .. , n, after ordering, are complete sufficient statistics for F, provided that H a is true. Then construct an adaptive distribution-free test based upon Wilcoxon's statistic and two of its modifications given in Exercises 9.20 and 9.21. 11.23. Suppose that the hypothesis H a concerns the stochastic inde- pendence of two random variables X and Y. That is, we wish to test Ha: F(x, y) = F1(x)F2 (y), where F, FI , and F2 are the respective joint and marginal distribution functions of the continuous type, against all alternatives. Let (Xl' YI ) , (X2 , Y 2 ) , .•• , (Xn , Y n ) be a random sample from the joint distribution. Under H a, the order statistics of Xl> X 2 , ••• , X; and the order statistics of YI , Y2, •.• , Yn are, respectively, complete sufficient statistics for D3 = {k; 8 < k}. D2 = {k; 3 < k ~ 8}, D1 = {k; k s 3}, These choices are not necessarily the best way of selecting the appro- priate test, but they are reasonable and illustrative of the adaptive procedure. From the stochastic independence of K and (L 1 , L2 , L3 ) , we know that the overall test has significance level cx. Since a more appropriate test has been selected, the power will be relatively good throughout a wide range of distributions. Accordingly, this distri- bution-free adaptive test is robust. F, these order statistics are the complete sufficient statistics for the parameter F. Hence every statistic based on V1> V2, ••• , VN is sto- chastically independent of L1> L2 , and L3 , since the latter statistics have distributions that do not depend upon F. In particular, the kurtosis (Exercise 1.98, Section 1.10) of the combined sample, 1 N - N.L (VI - V)4 K 1=1 = [~ Jl(VI -V)2r' cx = Pr (L1 EC1) = Pr(L2 EC2 ) = Pr(L3 EC3 ) . Of course, we would like to use the test given by L1 E C1 if the tails of the distributions are like or shorter than those of the normal distri- butions. With distributions having somewhat longer tails, L2 E C2 provides an excellent test. And with distributions having very long tails, the test based on L3 E C3 is quite satisfactory. In order to select the appropriate test in an independent manner we let VI < V2 < ... < VN' where N = m + n, be the order statistics of the combined sample, which is of size N. Recall that if the two inde- pendent distributions are equal and have the same distribution function been criticized because their powers are sometimes low. The inde- pendent selection of the distribution-free test to be used can help correct this. So selecting-or adapting the test to the data-provides a new dimension to nonparametric tests, which usually improves the power of the overall test. A statistical test that maintains the significance level close to a desired significance level cx for a wide variety of underlying distri- butions with good (not necessarily the best for anyone type of distri- bution) power for all these distributions is described as being robust. As an illustration, the T used to test the equality of the means of two independent normal distributions (see Section 7.4) is quite robust provided that the underlying distributions are rather close to normal ones with common variance. However, if the class of distributions includes those that are not too close to normal ones, such as the Cauchy distribution, the test based upon T is not robust; the significance level is not maintained and the power of the T test is low with Cauchy distributions. As a matter of fact, the test based on the Mann-Whitney- Wilcoxon statistic (Section 9.6) is a much more robust test than that based upon T if the class of distributions is fairly wide (in particular, if long-tailed distributions such as the Cauchy are included). An illustration of the adaptive distribution-free procedure that is robust is provided by considering a test of the equality of two inde- pendent distributions of the continuous type. From the discussion in Section 9.8, we know that we could construct many linear rank statis- tics by changing the scoring function. However, we concentrate on three such statistics mentioned explicitly in that section: that based on normal scores, say L 1 ; that of Mann-Whitney-Wilcoxon, say L 2 ; and that of the median test, say L3 . Moreover, respective critical regions C1> C2 , and C3 are selected so that, under the equality of the two distributions, we have
  • 206. 400 Further Topics in Statistical Inference [Ch, 11 Sec. 11.7] Robust Estimation 401 likelihood estimator of the center B of the Cauchy distribution with p.d.f. F 1 and F 2 . Use Spearman's statistic (Example 2, Section 9.8) and at least two modifications ofit to create an adaptive distribution-free test of H o- Hint. Instead of ranks, use normal and median scores (Section 9.8) to obtain two additional correlation coefficients. The one associated with the median scores is frequently called the quadrant test. 1 f(x; B) = 1T[1 + (x _ 8)2]' -00 < x < 00, 11.7 Robust Estimation where -00 < 8 < 00. The logarithm of the likelihood function of a random sample X 1> X 2' ••• , X n from this distribution is In Examples 2 and 4, Section 6.1, the maximum likelihood estimator p.. = X of the mean ji, of the normal distribution n(ji" a2 ) was found by minimizing a certain sum of squares, n In L(8) = -n In 1T - 2: In [l + (xj - 8)2]. j=l To maximize, we differentiate In L(8) to obtain -00 < x < 00, p(x) = In 1T + In (1 + x2 ) , 2x 1 + x2 '¥(x) n 2: p(xi - 8), 1=1 p(x) = In 2 + Ixl; '¥(x) = -1, x < 0; n In L(8) = 2: lnf(xj - 8) 1=1 '¥(x) = x; and where p(x) = -lnf(x), and dIn L(8) = _ i j'(xj - 8) = i '¥(xj _ 8), dB 1=1 f(XI - B) ,=1 where p'(x) = '¥(x). For the normal, double exponential, and Cauchy distributions, we have that these respective functions are 1 x2 p(x) = 2In 21T + 2"; dlnL(8) = i 2(xl - 8) = 0 d8 1=1 1 + (Xj - 8)2 . The solution of this equation cannot be found in closed form, but the equation can be solved by some iterative process (for example, Newton's method), of course checking that the approximate solution actually provides the maximum of L(B), approximately. The generalization of these three special cases is described as follows. Let Xl' X 2 , ••• , X; be a random sample from a distribution with a p.d.f. of the form f(x - B), where 8 is a location parameter such that -00 < 8 < 00. Thus [t»; 8) = te-Ix-el, = 0 elsewhere, where -00 < 8 < 00. The maximum likelihood estimator, 0 = median (Xj ) , is found by minimizing the sum of absolute values, Both of these procedures come under the general heading of the method of least squares, because in each case a sum of squares is mini- mized. More generally, in the estimation of means of normal distri- butions, the method of least squares or some generalization of it is always used. The problems in the analyses of variance found in Chapter 8 are good illustrations of this fact. Hence, in this sense, normal assumptions and the method of least squares are mathematical companions. It is interesting to note what procedures are obtained if we consider distributions that have longer tails than those of a normal distribution. For illustration, in Exercise 6.1(d), Section 6.1, the sample arises from a double exponential distribution with p.d.f. Also, in the regression problem of Section 8.6, the maximum likelihood estimators <2 and S of the ex and f3 in the mean ex + f3(cj - c) were determined by minimizing the sum of squares n 2: IXj - 81, j=l and hence this is illustrative of the method of least absolute values. Possibly a more extreme case is the determination of the maximum = 1, 0 < x. Clearly, these functions are very different from one distribution to another; and hence the respective maximum likelihood estimators may differ greatly. Thus we would suspect that the maximum likelihood
  • 207. 402 Further Topics in Statistical Inference [Ch, 11 Sec. 11.7] Robust Estimation 403 (1) estimator associated with one distribution would not necessarily be a good estimator in another situation. This is true; for example, X is a very poor estimator of the median of a Cauchy distribution, as the variance of X does not even exist if the sample arises from a Cauchy distribution. Intuitively, X is not a good estimator with the Cauchy distribution, because the very small or very large values (outliers) that can arise from that distribution influence the mean X of the sample too much. An estimator that is fairly good (small variance, say) for a wide variety of distributions (not necessarily the best for anyone of them) is called a robust estimator. Also estimators associated with the solution of the equation n L '¥(Xi - 0) = ° i=l are frequently called M-estimators (denoted by 0) because they can be thought of as maximum likelihood estimators. So in finding a robust M-estimator we must select a '¥ function which will provide an esti- mator that is good for each distribution in the collection under con- sideration. For certain theoretical reasons that we cannot explain at this level, Huber suggested a '¥ function that is a combination of those associated with the normal and double exponential distributions, '¥(x) = -k, X < -k = x, -k ~ X ~ k, = k, k < x. In Exercise 11.25 the reader is asked to find the p.d.f. f(x) so that the M-estimator associated with this '¥ function is the maximum likelihood estimator of the location parameter 0 in the p.d.f. f(x - 0). With Huber's '¥ function, another problem arises. Note that if we double (for illustration) each Xl' X 2 , •• " X n, estimators such as X and median (Xi) also double. This is not at all true with the solution of the equation n L '¥(Xi - 0) = 0, i=l where the '¥ function is that of Huber. One way to avoid this difficulty is to solve another, but similar, equation instead, i ,¥(X i - 0) = ° i=l d ' where d is a robust estimate of the scale. A popular d to use is d = median IXi - median (Xi)1/0.6745. The divisor 0.6745 is inserted in the definition of d because then the expected value of the corresponding statistic D is about equal to a, if the sample arises from a normal distribution. That is, a can be approxi- mated by d under normal assumptions. That scheme of selecting d also provides us with a clue for selecting k. For if the sample actually arises from a normal distribution, we would want most of the items Xl' X2, .. " Xn to satisfy the inequality IXi ~ 01 s k because then ,¥(Xi -d 0) Xi - 0 <:«: That is, for illustration, if all the items satisfy this inequality, then Equation (1) becomes i '¥ (Xi - 0) = i Xi - 0= 0. i=l d i=l d This has the solution X, which of course is most desirable with normal distributions. Since d approximates a, popular values of k to use are 1.5 and 2.0, because with those selections most normal variables would satisfy the desired inequality. Again an iterative process must usually be used to solve Equation (1). One such scheme, Newton's method, is described. Let 81 be a first estimate of 0, such as 81 = median (Xi)' Approximate the left-hand member of Equation (1) by the first two terms of Taylor's expansion about 81 to obtain i ,¥(Xi - 81 ) + (0 - 81 ) .i ,¥,(Xi - 81 ) (_!) = 0, i=l d 1=1 d d approximately. The solution of this provides a second estimate of 0, d i ,¥(Xi - 81 ) Ii, ~ Ii, + i''Y'(X':Ii,) i=l d which is called the one-step M -estimate of O. If we use 82 in place of 81, we obtain 83 , the two-step M-estimate of O. This process can continue to obtain any desired degree of accuracy. With Huber's '¥ function, the denominator of the second term,
  • 208. 404 Further Topics in Statistical Inference [Ch, 11 is particularly easy to compute because '¥'(x) = 1, -k ::::; x ::::; k, and zero elsewhere. Thus that denominator simply counts the number of Xv x2 , • , ., Xn such that IXi - 811/d ::::; k. Although beyond the scope of this text, it can be shown, under very general conditions with known a= 1, that the limiting distribution of VE[,¥2(X - 8)]/{E[,¥'(X - 8)]}2' where 0is the M -estimator associated with '¥, is n(O, 1). In applications, the denominator of this ratio can be approximated by the square root of Moreover, after this substitution has been made, it has been discovered empirically that certain t-distributions approximate the distribution of the ratio better than does n(O, 1). These M-estimators can be extended to regression situations. In general, they give excellent protection against outliers and bad data points; yet these M-estimators perform almost as well as least-squares estimators if the underlying distributions are actually normal. EXERCISES 11.24. Compute the one-step M-estimate 82 using Huber's if1 with k = 1.5 if n = 7 and the seven observations are 2.1, 5.2, 2.3, 1.4, 2.2, 2.3, and 1.6. Here take 81 = 2.2, the median of the sample. Compare 82 with x. 11.25. Let the p.d.f. f(x) be such that the M-estimator associated with Huber's if1 function is a maximum likelihood estimator of the location parameter inf(x - 8). Show thatf(x) is of the form ce- P 1( X ) , where Pl(X) = x2/2, Ixl ::::; k and Pl(X) = klxl - k2/2, k < Ixl. 11.26. Plot the if1 functions associated with the normal, double ex- ponential, and Cauchy distributions in addition to that of Huber. Why is the M-estimator associated with the if1 function of the Cauchy distribution called a descending M-estimator? 11.27. Use the data in Exercise 11.24 to find the one-step descending M-estimator 82 associated with if1(x) = sin (x/1.5), Ixl ::::; 1.51T, zero elsewhere. This was first proposed by D. F. Andrews. Compare this to xand the one-step M-estimator of Exercise 11.24. Chapter I2 Further Normal Distribution Theory 12.1 The Multivariate Normal Distribution We have studied in some detail normal distributions of one and of two random variables. In this section, we shall investigate a joint distribution of n random variables that will be called a multivariate normal distribution. This investigation assumes that the student is familiar with elementary matrix algebra, with real symmetric quadratic forms, and with orthogonal transformations. Henceforth the expression quadratic form means a quadratic form in a prescribed number of variables whose matrix is real and symmetric. All symbols which represent matrices will be set in boldface type, Let A denote an n x n real symmetric matrix which is positive definite. Let EL denote the n x 1 matrix such that EL', the transpose of EL, is EL' = [fLv fL2' ... , fLn]' where each fLi is a real constant. Finally, let x denote the n x 1 matrix such that x' = [xv X2 , ••• , xnJ. We shall show that if C is an appropriately chosen positive constant, the non- negative function C [ (x - EL)'A(x - EL)] f(x1 , X2 , · · · , xn) = exp - 2 ' -00 < xj < 00, i = 1, 2, ... , n, is a joint p.d.f. of n random variables Xv X 2 , ••• , X; that are of the continuous type. Thus we need to show that (1) I~oo" -I~oo f(xv X2,·· " xn )dX1 dx2·· -dx; = 1. 405
  • 209. Let t denote the n x 1 matrix such that t' = [t1 , t2 , •• " tn] , where t1 , t2 , ••• , tn are arbitrary real numbers. We shall evaluate the integral Because the real symmetric matrix A is positive definite, the n characteristic numbers (proper values, latent roots, or eigenvalues) aI' a2, ... , an of A are positive. There exists an appropriately chosen n x n real orthogonal matrix L(L' = L -1, where L -1 is the inverse of L) such that and then we shall subsequently set t1 = t2 = ... = tn = 0, and thus establish Equation (1). First we change the variables of integration in integral (2) from Xl> x2, ... , X n to Y1' Y2' ... , Yn by writing x - fL = Y, where y' = [Yl> Y2' .. " YnJ. The Jacobian of the transformation is one and the n-dimensional x-space is mapped onto an n-dimensional y-space, so that integral (2) may be written as 407 C exp (w'L'u) Ii[}¥exp (Wf)] 1=1 ai 2ai Sec. 12.1] The Multivariate Normal Distribution Moreover, (6) [ Z'(L'AL)Z] [ i aiZf] exp - 2 = exp - T . Then integral (4) may be written as the product of n integrals in the following manner: (5) C exp (w'L'fL) Ii[fOO exp (WiZi - aiZf) dZi ] i=l -00 2 = C exp (w'L'u) Ii[ {2;foo exp (WiZi - ¥) dZ]' 1=1 -i z; -00 V27Tjai 1 The i?tegra.l that involves Zi can be treated as the moment-generating f~ncho~, w~th the more familiar symbol t replaced by Wi' of a distribu- tion which IS n(O, Ija;). Thus the right-hand member of Equation (5) is equal to Further Normal Distribution Theory [Ch, 12 f OO foo ( Y'AY) C exp (t'fL) . . . exp t'y - - - dY1" ·dYn· -00 -00 2 (3) 406 for a suitable ordering of all a2, ... , an' We shall sometimes write L'AL = diag [al> a2, ... , anJ. In integral (3), we shall change the vari- ables of integration from Yl> Y2' ... , Yn to Zl> Z2' ... , Zn by writing y = Lz, where z' = [Zl' Z2' ... , znJ. The Jacobian of the transformation is the determinant of the orthogonal matrix L. Since L'L = In' where In is the unit matrix of order n, we have the determinant L'L] = 1 and ILI2 = 1. Thus the absolute value of the Jacobian is one. Moreover, the n-dimensional y-space is mapped onto an n-dimensional z-space. The integral (3) becomes f OO foo [ Z'(L'AL)Z] (4) C exp (t'fL) . . . exp t'Lz - dz1· . ·dzn. -00 -00 2 It is computationally convenient to write, momentarily, t'L = w', where w' = [Wl> W2, . . . , wn]. Then exp [t'Lz] = exp [w'z] = exp (~ WiZ} = C exp (W'L'fL)j (27T)n exp (~Wf). a1a2· .. an 7 Za, Now, because L-1 = L', we have Thus, n w2 L--.!.. = w'(L'A-lL)w = (Lw) 'A -l(Lw) = t'A -It. 1 ai Moreover, the determinant IA-11 of A-I is IA-1/ = IL'A -ILl = 1 a1a2· .. an ~ccordingly, the right-hand member of Equation (6), which is equal to integral (2), may be written as (7) Cet 'ILV (27T)nIA 1/ exp (t'~-It). If, in this function, we set t1 = t2 = ... = tn = 0, we have the value of the left-hand member of Equation (1). Thus, we have Cv'(27T)nIA-11 = 1.
  • 210. 408 Further Normal Distribution Theory [Oh, 12 Sec. 12.1] The Multivariate Normal Distribution 409 EXERCISES for all real values of t. ( t/Yt) exp t/!J. + 2 -00 < Xl < 00, (8) 12.1. Let Xl' X 2 , ••• , X; have a multivariate normal distribution with positive definite covariance matrix V. Prove that these random variables are mutually stochastically independent if and only if V is a diagonal matrix. Consider a linear function Y of Xl' X 2 , ••• , X; which is defined by Y = n c'X = L ciXj , where c' = [Cl> c2 , •• " cnJ and the several Cj are real and not 1 all zero. We wish to find the p.d.f. of Y. The moment-generating function M(t) of the distribution of Y is given by M(t) = E(et y ) = E(et c • X ) . Now the expectation (8) exists for all real values of t. Thus we can replace t' in expectation (8) by tc' and obtain ( C'VCt2) M(t) = exp tc'!J. + -2- . Thus the random variable Y is n(c'!J., c'Vc). Example 1. Let Xl> X 2 , ••• , X; have a multivariate normal distribution with matrix IJ. of means and positive definite covariance matrix V. If we let X' = [Xl' X 2 , ••• , XnJ, then the moment-generating function M(tl> t2 , ... , tn) of this joint distribution of probability is ( t'Vt) E(et • X ) = exp t'!J. + 2 . the covariance matrix of the multivariate normal distribution and henceforth we shall denote this matrix by the symbol Y. In terms of the positive definite covariance matrix Y, the multivariate normal p.d.f. is written 1 ex [_(X - !J.)/Y-l(X - !J.)], (2rr)nI2v'IVI p 2 i = 1, 2, ... , n, and the moment-generating function of this distribution is given by Let the elements of the real, symmetric, and positive definite matrix A-I be denoted by (Jli' i, j = 1,2, ... , n. Then M(O, ... , 0, tl, 0, ... , 0) = exp (tl""l + (J~f) is the moment-generating function of Xl' i = 1, 2, ... , n. Thus, X, is n(""l' (Jll), i = 1, 2, , n. Moreover, with i # j, we see that M(O, ... , 0, tl , 0, ... ,0, tj , 0, ,0), the moment-generating function of X, and Xj, is equal to But this is the moment-generating function of a bivariate normal distri- bution, so that (Jlj is the covariance of the random variables X, and Xj' Thus the matrix !J., where EL' = [""1, ""2' ... , ""nJ, is the matrix of the means of the random variables Xl> .. " X n . Moreover, the elements on the principal diagonal of A -1 are, respectively, the variances (Jll = (Jf, i = 1, 2, ... , n, and the elements not on the principal diagonal of A -1 are, respectively, the covariances (Jlj = PIPI(Jj, i # i, of the random variables Xl' X 2 , ••• , X n. We call the matrix A-I, which is given by 1 [ (x - !J.)/A(x - !J.)] f(Xl> x2 , •• " Xn) = (2rr)n'2v'IA -11 exp - 2 ' -00 < Xl < 00, i = 1,2, ... , n, is a joint p.d.f. of n random variables Xl> X 2 , ••• , X; that are of the continuous type. Such a p.d.f. is called a nonsingular multivariate normal p.d.f. We have now proved that f(xl , X 2, •.. , xn) is a p.d.f. However, we have proved more than that. Because f(Xl> X2 , ••• , xn) is a p.d.f., integral (2) is the moment-generating function M(tl , t2 , •. " tn) of this joint distribution of probability. Since integral (2) is equal to function (7), the moment-generating function of the multivariate normal distribution is given by Accordingly, the function lOU' (J12' "'"] 12.2. Let n = 2 and take .. 0, [ u 2 PU 1U2]- (J12, (J22' •. 0, (J2n V= PUl 1U2 . , u~ Determine lVI, V-l, and (x - !J.)'V-l(X - !J.). Compare the bivariate normal (Jln, (J2n, .. 0, (Jnn p.d.f. with the multivariate normal p.d.f. when n = 2.
  • 211. 410 Further Normal Distribution Theory [Ch, 12 Sec. 12.2] The Distributions of Certain Quadratic Forms 411 12.3. Let X v X 2' ••. , X n have a multivariate normal distribution, where fJ. is the matrix of the means and Y is the positive definite covariance matrix. Let Y = c'X and Z = d'X, where X' = [Xl"'" XnJ, C' = [Cv ... , cnJ, and d' = [dv . . ., dn] are real matrices. (a) Find M(t v t2) = E(et, Y+t2 Z) to see that Y and Z have a bivariate normal distribution. (b) Prove that Y and Z are stochastically independent if and only if c'Vd = O. (c) If Xv X 2 , ••• , X; are mutually stochastically independent random variables which have the same variance a2, show that the necessary and sufficient condition of part (b) becomes c'd = O. 12.4. Let X' = [Xv X 2 , •• " XnJ have the multivariate normal distribu- tion of Exercise 12.3. Consider the p linear functions of Xl"'" X; defined by W = BX, where W' = [Wv " " WpJ, P :s; n, and B is a p x n real matrix of rank p. Find M(vv ... , vp ) = E(eV'w), where v' is the real matrix [vv"" vp ] , to see that Wv " " Wp have a p-variate normal distribution which has BfJ. for the matrix of the means and BYB' for the covariance matrix. 12.5. Let X' = [Xv X 2 , ••• , XnJ have the n-variate normal distribution of Exercise 12.3. Show that Xl' X z, ... , X p , P < n, have a p-variate normal distribution. What submatrix of Y is the covariance matrix of Xl' X 2 , ••• , X p ? Hint. In the moment-generating function M(tv t2 , • • . , tn) of Xv X 2 , • • • , X n, let tp + I = ... = tn = O. which is defined by (x - fJ.),y-l(x - fJ.), is X2 (n). We have for the moment-generating function M(t) of Q the integral f OO foo 1 - 00 - 00 (27T)n/2VfVI X exp [t(X - fJ.)'y-l(x - fJ.) - (x - fJ.)'Y 2-I(X - fJ.)] dxl · . ·dxn With y-l positive definite, the integral is seen to exist for all real values of t < t. Moreover, (1 - 2t)y-l, t < t, is a positive definite matrix and, since 1(1 - 2t)y- 1 1 = (1 - 2t)nly- ll, it follows that can be treated as a multivariate normal p.d.f. If we multiply our integrand by (1 - 2t)n/2, we have this multivariate p.d.f. Thus the moment-generating function of Q is given by 12.2 The Distributions of Certain Quadratic Forms 1 M(t) = (1 _ uv»: t < t, Let X" i = 1, 2, ... , n, denote mutually stochastically independent random variables which are n(iLj, an, i = 1,2, ... , n, respectively. n Then Q = L (X, - iLt)2jaf is X2(n). Now Q is a quadratic form in the I X, - iL, and Qis seen to be, apart from the coefficient -t, the random variable which is defined by the exponent on the number e in the joint p.d.f. of Xl' X 2 , ••• , X n. We shall now show that this result can be generalized. Let Xv X 2 , ••• , X; have a multivariate normal distribution with p.d.f. where, as usual, the covariance matrix Y is positive definite. We shall show that the random variable Q (a quadratic form in the Xi - iLi), and Q is (X2n), as we wished to show. This fact is the basis of the chi-square tests that were discussed in Chapter 8. The remarkable fact that the random variable which is defined by (x - fJ.),y-I(X - fJ.) is X2 (n) stimulates a number of questions about quadratic forms in normally distributed variables. We would like to treat this problem in complete generality, but limitations of space forbid this, and we find it necessary to restrict ourselves to some special cases. Let Xl' X 2 , ••• , X n denote a random sample of size n from a distri- bution which is n(O, a2), a2 > 0. Let X' = [Xv X 2, ... , XnJ and let A denote an arbitrary n x n real symmetric matrix. We shall investigate the distribution of the quadratic form X'AX. For instance, we know n that X'InXja 2 = X'Xja2 = L Xfja2 is X2(n ). First we shall find the I moment-generating function of X'AXja2 • Then we shall investigate
  • 212. 412 Further Normal Distribution Theory [Ch, 12 Sec. 12.2] The Distributions of Certain Quadratic Forms 413 Accordingly we can write M(t), as given in Equation (1), in the form 2) M(t) = LDI (1 - 2tal ) r1/2,. ItI < h. Let r, 0 < r ::;; n, denote the rank of the real symmetric matrix A. Then exactly r of the real numbers aI' a2, ... , an, say aI' .. " a., are not zero and exactly n - r of these numbers, say ar+l> .. " an, are zero. Thus we can write the moment-generating function of X'AX/02 as M(t) = [(1 - 2ta1)(1 - 2ta2)· · ·(1 - 2tar)J - 1/2. Now that we have found, in suitable form, the moment-generating function of our random variable, let us turn to the question of the con- the conditions which must be imposed upon the real symmetric matrix A if X'AX/02 is to have a chi-square distribution. This moment- generating function is given by M(t) = foo .. ·foo ( 1_)n exp (tx'~x _ x'~) dx1. .. dx n -00 -00 OV27T 0 20 f oo foo ( 1)n [X'(I - 2tA)X] = . . . -----= exp - 2 dx1· .. dxn, - 00 - 00 OV27T 20 where I = In. The matrix I - 2tA is positive definite if we take ItI sufficiently small, say ItI < h, h > O. Moreover, we can treat 1 [ x'(I - 2tA)X] (27T)n/2vl (I _ 2tA) -102/ exp - 202 as a multivariate normal p.d.f. Now 1(1 - 2tA)-10211/2 on/II - 2tAll/2. If we multiply our integrand by II - 2tAI1/2, "We have this multivariate p.d.f. Hence the moment-generating function of X'AX/02 is given by (1) M(t) = II - 2tAI-l/2, ItI < h. It proves useful to express this moment-generating function in a different form. To do this, let aI' a2, .. " an denote the characteristic numbers of A and let L denote an n x n orthogonal matrix such that L'AL = diag [al> a2, ... , an]. Thus, ditions that must be imposed if X' AX/02 is to have a chi-square distribution. Assume that X'AX/02 is X2(k). Then M(t) = [(1 - 2ta1)(1 - 2ta2)···(1 - 2tar)]-1/2 = (1 - 2t)-k/2, or, equivalently, (1 - 2ta1)(1 - 2ta2)' .. (1 - 2tar) = (1 - 2t)k, ItI < h. Because the positive integers rand k are the degrees of these poly- nomials, and because these polynomials are equal for infinitely many values of i, we have k = r, the rank of A. Moreover, the uniqueness of the factorization of a polynomial implies that a1 = a2 = ... = a; = 1. If each of the nonzero characteristic numbers of a real symmetric matrix is one, the matrix is idempotent, that is, A2 = A, and con- versely (see Exercise 12.7). Accordingly, if X'AX/02 has a chi-square distribution, then A2 = A and the random variable is X2 (r), where r is the rank of A. Conversely, if A is of rank r, 0 < r ::;; n, and if A2 = A, then A has exactly r characteristic numbers that are equal to one, and the remaining n - r characteristic numbers are equal to zero. Thus the moment-generating function of X'AX/02 is given by (1 - 2t)-r/2, t < -i, and X /AX/02 is X2(r). This establishes the following theorem. Theorem 1. Let Q denote a random variable which is a quadratic jorm ~n the items oj a random sample oj size n jrom a distribution which is n(O, 02). Let A denote the symmetric matrix oj Q and let r, 0 < r ::;; n, denote the rank oj A. Then Q/02 is X2(r) ij and only ij A2 = A. Remark. If the normal distribution in Theorem 1 is n(fIo, a2), the condition A2 = A remains a necessary and sufficient condition that Qla2 have a chi-square distribution. In general, however, Qla2 is not X2(r) but, instead, Qla2 has a noncentral chi-square distribution if A2 = A. ~he number of degrees of freedom is r, the rank of A, and the noncentrality parameter is IL'AILla2, where IL' = [fL, fL, ... , flo]. Since IL'AIL = fIo2 L ali' I,! where A = [a ] then if 11. =1= 0 the conditions A2 = A and Lal! = 0 are 'J.) , ' r - , j.t necessary and sufficient conditions that Qla2 be central X2 (r). Moreover, the theorem may be extended to a quadratic form in random variables which have a multivariate normal distribution with positive definite covariance matrix V; here the necessary and sufficient condition that Qhave a chi-square distribution is AVA = A. EXERCISES 12.6. Let Q = X 1X2 - X SX4 , where Xl> X 2 , X s, X 4 is a random sample of size 4 from a distribution which is n(O, a2) . Show that Q/a2 does not have a chi-square distribution. Find the moment-generating function of Qla2• II - 2tAI· o IL'(I - 2tA)L1 n I1 (1 - 2ta.) 1=1 [ (1 - 2ta1) L'(I - 2tA)L ~ r Then
  • 213. 414 Further Normal Distribution Theory [Ch. 12 Sec. 12.3] The Independence of Certain Quadratic Forms 415 12.7. Let A be a real symmetric matrix. Prove that each of the nonzero characteristic numbers of A is equal to one if and only if AZ = A. Hint. Let L be an orthogonal matrix such that L'AL = diag Cal. az,.'" an] and note that A is idempotent if and only if L'AL is idempotent. 12.8. The sum of the elements on the principal diagonal of a square matrix A is called the trace of A and is denoted by tr A. (a) If B is n x m and C is m x n, prove that tr (BC) = tr (CB). (b) If A is a square matrix and if L is an orthogonal matrix, use the result of part (a) to show that tr (L'AL) = tr A. (c) If A is a real symmetric idempotent matrix, use the result of part (b) to prove that the rank of A is equal to tr A. 12.9. Let A = [aij] be a real symmetric matrix. Prove that L L a~j is j t equal to the sum of the squares of the characteristic numbers of A. Hint. If L is an orthogonal matrix, show that L L a~j = tr (AZ) = tr (L'AZL) = j t tr [(L'AL)(L'AL)]. 12.10. Let X and S2 denote, respectively, the mean and the variance of a random sample of size n from a distribution which is n(O, a2 ) . (a) If A denotes the symmetric matrix of nX2, show that A = (l/n)P, where P is the n x n matrix, each of whose elements is equal to one. (b) Demonstrate that A is idempotent and that the tr A = 1. Thus nJ{2/a2 is x2(l ). (c) Show that the symmetric matrix B of «S? is I - (l/n)P. (d) Demonstrate that B is idempotent and that tr B = n - 1. Thus nS2 /aZ is X2 (n - 1), as previously proved otherwise. (e) Show that the product matrix AB is the zero matrix. 12.3 The Independence of Certain Quadratic Forms We have previously investigated the stochastic independence of linear functions of normally distributed variables (see Exercise 12.3). In this section we shall prove some theorems about the stochastic independence of quadratic forms. As we remarked on p. 411, we shall confine our attention to normally distributed variables that constitute a random sample of size n from a distribution that is n(O, aZ) . Let Xl' X z, ... , X; denote a random sample of size n from a distribution which is n(O, aZ ) . Let A and B denote two real symmetric matrices, each of order n. Let X' = [Xv X z, ... , XnJ and consider the two quadratic forms X'AX and X'BX. We wish to show that these quadratic forms are stochastically independent if and only if AB = 0, the zero matrix. We shall first compute the moment-generating function M(t v tz) of the joint distribution of X'AX/az and X'BX/az. We have ( x'(I - 2tlA - 2tzB)X) d d exp - 2az Xl' .• Xn• The matrix I - 2tlA - 2tzB is positive definite if we take Itll and Itzi sufficiently small, say Itll < hl, Itzl < hz, where hl' hz > O. Then, as on p. 412, we have M(t l, tz) = II - 2tlA - 2tzBI-l/Z, Itll < hl, Itzl < hz· Let us assume that X'AX/az and X'BX/az are stochastically indepen- dent (so that likewise are X'AX and X'BX) and prove that AB = O. Thus we assume that (1) for all tl and tz for which Itil < hi, i = 1,2. Identity (1) is equivalent to the identity Let r > 0 denote the rank of A and let al' az, ... , a, denote the r nonzero characteristic numbers of A. There exists an orthogonal matrix L such that al 0 0 0 az 0 0 L'AL = [?i~+~] =C 0 0 aT ----- - - - - ----------1--- 0 I 0 : for a suitable ordering of av az, ... , a.. Then L'BL may be written in the identically partitioned form The identity (2) may be written as (2') lL'llI - 2tl A - 2tzB l iLi = IL'III - 2t1AIILl IL'III - 2t2BIILI,
  • 214. 416 Further Normal Distribution Theory [Oh, 12 Sec. 12.3] The Independence of Certain Quadratic Forms 417 The coefficient of (-2tl Y in the right-hand member of Equation (3) is seen by inspection to be al a2 • .. a,1 I - 2t2D I. It is not so easy to find the coefficient of (-2tl Yin the left-hand member of Equation (3). Conceive of expanding this determinant in terms of minors of order r formed from the first r columns. One term in this expansion is the product of the minor of order r in the upper left-hand corner, namely, IIr - 2tl Cll - 2t2D ll L and the minor of order n - r in the lower right-hand corner, namely, IIn - r - 2t2D221 . Moreover, this product is the only term in the expansion of the determinant that involves (-2tlY. Thus the coefficient of (-2tl)' in the left-hand member of Equation (3) is ala2·· .arIIn - r - 2t2Dd. If we equate these coefficients of (-2tlY, we have, for all t2, It2 1 < h2, or as (3) (4) we have M(tv t2) = M(tl, O)M(O, t2), and the proof of the following theorem is complete. Theorem 2. Let Ql and Q2 denote random variables which are quadratic forms in the items of a random sample of size n from a distribu- tion which is n(O, a2). Let A and B denote respectively the real symmetric matrices of Ql and Q2' The random variables Ql and Q2 are stochastically independent if and only if AB = O. Remark. Theorem 2 remains valid if the random sample is from a distribution which is n(p" a2 ) , whatever be the real value of p,. Moreover, Theorem 2 may be extended to quadratic forms in random variables that have a joint multivariate normal distribution with a positive definite covariance matrix V. The necessary and sufficient condition for the stochastic independence of two such quadratic forms with symmetric matrices A and B then becomes AVB = O. In our Theorem 2, we have V = a2I, so that AVB = Aa2 IB = a2 A B = O. Equation (4) implies that the nonzero characteristic numbers of the matrices D and D22 are the same (see Exercise 12.17). Recall that the sum of the squares of the characteristic numbers of a symmetric matrix is equal to the sum of the squares of the elements of that matrix (see Exercise 12.9). Thus the sum of the squares of the elements of matrix D is equal to the sum of the squares of the elements of D2 2 . Since the elements of the matrix D are real, it follows that each of the elements of D ll , D12, and D2l is zero. Accordingly, we can write D in the form Thus CD = L'ALL'BL = 0 and L'ABL = 0 and AB = 0, as we wished to prove. To complete the proof of the theorem, we assume that AB = O. We are to show that X'AX/a2and X'BX/a2 are stochastically independent. We have, for all real values of tl and t2 , since AB = O. Thus, II - 2tlA - 2t2BI = II - 2tlAIII - 2t2BI· Since the moment-generating function of the joint distribution of X'AX/a2 and X'BX/a2 is given by M(tv t2) = II - 2tlA - 2t2BI-1/2, Itll < hi' Z = 1,2, We shall next prove the theorem that was used in Chapter 8 (p. 279). Theorem 3. Let Q = Ql + ... + Qk-l + Qk' where Q, Ql"'" Qk-1' Qk are k + 1 random variables that are quadratic forms in the items of a random sample of size n from a distribution which is n(O, a2). Let Q/a2 be X2(r), let Q,/a2 be X2(r,), i = 1,2, , k - 1, and let Qk be non- negative. Then the random variables Ql' Q2' , Qk are mutually stochasti- cally independent and, hence, Qk/a2 is X2(rk = r - rl - ... - rk- l). Proof. Take first the case of k = 2 and let the real symmetric matrices of Q, Ql' and Q2 be denoted, respectively, by A, AI' A2.We are given that Q = Ql + Q2 or, equivalently, that A = Al + A2. We are also given that Q/a2 is X2(r) and that Ql/a2 is x2 h ). In accordance with Theorem 1, p. 413, we have A2 = A and Ai = AI' Since Q2 :2: 0, each of the matrices A, AI' and A2 is positive semidefinite. Because A2 = A, we can find an orthogonal matrix L such that [Ir : 0] L'AL = o--i-o' If then we multiply both members of A = Al + A2 on the left by L' and on the right by L, we have
  • 215. 418 Further Normal Distribution Theory [Ch, 12 Sec. 12.3] The Independence of Certain Quadratic Forms 419 Now each of Al and A2, and hence each of L' AIL and L'A2L is positive semidefinite. Recall that, if a real symmetric matrix is positive semi- definite, each element on the principal diagonal is positive or zero. Moreover, if an element on the principal diagonal is zero, then all elements in that row and all elements in that column are zero. Thus L'AL = L'AIL + L'A2L can be written as (5) Since A~ = AI> we have If we multiply both members of Equation (5) on the left by the matrix L'AIL, we see that [~T___~] = [~__ _~] + [?~~~__~], or, equivalently, L'AIL = L'AIL + (L'AlL)(L'A2L). Thus, (L'AIL) x (L'A2L) = 0 and AIA2 = O. In accordance with Theorem 2, QI and Q2 are stochastically independent. This stochastic independence immedi- ately implies that Qz/uz is X2 (r2 = r - rl ) . This completes the proof when k = 2. For k > 2, the proof may be made by induction. We shall merely indicate how this can be done by using k = 3. Take A = Al + A2 + As, where A2 = A, A~ = AI, A~ = A2 and As is positive semidefinite. Write A = Al + (A2 + As) = Al + BI, say. Now A2 = A, AI = A1> and BI is positive semidefinite. In accordance with the case of k = 2, we have AIBI = 0, so that BI = BI· With BI = A2 + As, where BI = BI, A~ = A2, it follows from the case of k = 2 that A2As = 0 and A~ = As. If we regroup by writing A = Az + (AI + As), we obtain AlAs = 0, and so on. Remark. In our statement of Theorem 3 we took Xl, X 2 , •.• , Xn to be items of a random sample from a distribution which is n(O, a 2 ) . We did this because our proof of Theorem 2 was restricted to that case. In fact, if Q', Q~, ... ,Q~ are quadratic forms in any normal variables (including multivariate normal variables), if Q' = Q~ + ... + Q~, if Q', Q~, , Q~-l are central or noncentral chi-square, and if Q~ is nonnegative, then Q~, , Q~ are mutually stochastically independent and Q~ is either central or noncentral chi-square. This section will conclude with a proof of a frequently quoted theorem due to Cochran. Theorem 4. Let X1> X2, ... , Xn denote a random sample from a distribution which is n(O, ( 2 ) . Let the sum of the squares of these items be written in the form where QJ is a quadratic form in Xl' X 2, ... , X n, with matrix AJ which has rank rJ, j = 1,2, ... , k. The random variables Q1> Q2'... , QIe are mutually stochastically independent and QJ/a2 is XZ(rJ), j = 1,2, ... , k, Ie if and only 1j 2: r. = n. I Ie n Ie Proof. First assume the two conditions 2: r. = nand 2: X~ = 2: QJ I I I to be satisfied. The latter equation implies that I = A, + Az + ... + Ale' Let B, = I - A,. That is, B, is the sum of the matrices AI' .. " Ale exclusive of A,. Let R, denote the rank of B,. Since the rank of the sum of several matrices is less than or equal to the sum of the ranks, Ie we have R,:::; 2: rJ - r, = n - r,. However, I = A, + B" so that I n :::; r, + R, and n - r, :::; R,. Hence R, = n - r,. The characteristic numbers of B, are the roots of the equation IB, - ,11 = O. Since B, = I - A" this equation can be written as II - Al - ,11 = O. Thus, we have IA, - (1 - ')11 = O. But each root of the last equation is one minus a characteristic number of AI' Since B, has exactly n - R, = rl characteristic numbers that are zero, then Al has exactly r, characteristic numbers that are equal to one. However, rl is the rank of A,. Thus, each of the r, nonzero characteristic numbers of A, is one. That is, A~ = Al and thus QI/a2 is x2(r.), t = 1, 2, , k. In accordance with Theorem 3, the random variables QI' Q2' , QIe are mutually stochastically independent. n To complete the proof of Theorem 4, take 2: X~ = QI + Q2 + ... I + QIe, let Q1> Q2' ... , QIe be mutually stochastically independent, and let QJ/a2 be x2(rJ), j = 1,2, ... , k. Then ~ QJ/u2 is x2(~ rJ). But Ie n Ie 2: QJ/a2 = 2: X~/u2 is X2(n). Thus, 2: rJ = n and the proof is complete. I I I
  • 216. 420 EXERCISES Further Normal Distribution Theory [Ch. 12 12.11. Let Xv X 2 , ••• , X; denote a random sample of size n from a n distribution which is n(O, a2) . Prove that LXr and every quadratic form, I which is nonidentically zero in Xl' X 2 , ••• , X n , are stochastically dependent. 12.12 Let Xv X 2 , Xa, X 4 denote a random sample of size 4 from a distri- 4 bution which is n(O,a2 ) . Let Y = LaiXi, where av a2 , aa, and a4 are real I constants. If y2 and Q = XIX2 - X aX4 are stochastically independent, determine av a2 , aa, and a4' 12.13. Let A be the real symmetric matrix of a quadratic form Qin the items of a random sample of size n from a distribution which is n(O, a2 ) . Given that Qand the mean ]( of the sample are stochastically independent. What can be said of the elements of each row (column) of A? Hint. Are Q and ](2 stochastically independent? 12.14. Let A v A2, ... , Ak be the matrices of k > 2 quadratic forms Qv Q2' ... , Qk in the items of a random sample of size n from a distribution which is n(O, a2 ) . Prove that the pairwise stochastic independence of these forms implies that they are mutually stochastically independent. Hint. Show that AiAj = 0, i ¥= j, permits E[exp (tIQI + t2Q2 + ... + tkQk)J to be written as a product of the moment-generating functions of QI' Q2' ... , Qk' 12.15. Let X' = [Xl' X 2 , ••. , XnJ, where Xl, x2 , ..• , X; are items of a random sample from a distribution which is n(O, a2 ) . Let b' = [bl"b2 , ••• , bnJ be a real nonzero matrix, and let A be a real symmetric matrix of order n, Prove that the linear form b'X and the quadratic form X' AX are stochastically independent if and only if b'A = 0. Use this fact to prove that b'X and X' AX are stochastically independent if and only if the two quadratic forms, (b'X)2 = X'bb'X and X' AX, are stochastically independent. 12.16. Let QI and Q2 be two nonnegative quadratic forms in the items of a random sample from a distribution which is n(O, a2 ) . Show that another quadratic form Q is stochastically independent of QI + Q2 if and only if Q is stochastically independent of each of QI and Q2' Hint. Consider the orthogonal transformation that diagonalizes the matrix of QI + Q2' After this trans- formation, what are the forms of the matrices of Q, QI, and Q2 if Q and QI + Q2 are stochastically independent? 12.17. Prove that Equation (4) of this section implies that the nonzero characteristic numbers of the matrices D and D 22 are the same. Hint. Let A = 1/(2t2 ) , t2 ¥= 0, and show that Equation (4) is equivalent to ID - ,11 (-A)rJD22 - AIn_rl· endix A rences Ierson, T. W., An Introduction to Multivariate Statistical Analysis, m Wiley & Sons, Inc., New York, 1958. ;0, D., "On Statistics Independent of a Complete Sufficient Statistic," ~khyii, 15, 377 (1955). c, G. E. P., and Muller, M. A., "A Note on the Generation of Random :mal Deviates," Ann. Math. Stat., 29,610 (1958). penter, 0., "Note on the Extension of Craig's Theorem to Non- tral Variates," Ann. Math. Stat., 21, 455 (1950). :hran, W. G., "The Distribution of Quadratic Forms in a Normal .tern, with Applications to the Analysis of Covariance," Proc. nbridge Phil. Soc., 30, 178 (1934). ig, A. T., "Bilinear Forms in Normally Correlated Variables," Ann. tho Stat., 18, 565 (1947). ig, A. T., "Note on the Independence of Certain Quadratic Forms," n. Math. Stat., 14, 195 (1943). mer, H., Mathematical Methods of Statistics, Princeton University ss, Princeton, N.J., 1946. tiss, J. H. "A Note on the Theory of Moment Generating Functions," n. Math. Stat., 13, 430 (1942). her, R. A., "On the Mathematical Foundation of Theoretical tistics," Phil. Trans. Royal Soc. London, Series A, 222, 309 (1921). .ser, D. A. S., Nonparametric Methods in Statistics, John Wiley & is, Inc., New York, 1957. iybill, F. A., An Introduction to Linear Statistical Models, Vol. 1, Graw-Hill Book Company, New York, 1961. gg, R. V., "Adaptive Robust Procedures: A Partial Review and Some 421
  • 217. 422 Appendix A Suggestions for Future Applications and Theory," ]. Amer. Stat. Assoc., 69, 909 (1974). [14J Hogg, R V., and Craig, A. T., "On the Decomposition of Certain Chi- Square Variables," Ann. Math. Stai., 29,608 (1958). [15J Hogg, R V., and Craig, A. T., "Sufficient Statistics in Elementary Distribution Theory," Sankhyii, 17, 209 (1956). [16J Huber, P., "Robust Statistics: A Review," Ann. Math. Stat., 43, 1041 (1972). [l7J Johnson, N. L., and Kotz, S., Continuous Univariate Distributions, Vols. 1 and 2, Houghton Mifflin Company, Boston, 1970. [18J Koopman, B. 0., "On Distributions Admitting a Sufficient Statistic," Trans. Amer. Math. Soc., 39, 399 (1936). [19J Lancaster, H. 0., "Traces and Cumulants of Quadratic Forms in Normal Variables," ]. Royal Stat. Soc., Series B, 16, 247 (1954). [20J Lehmann, E. L., Testing Statistical Hypotheses, John Wiley & Sons, Inc., New York, 1959. [21J Lehmann, E. L., and Scheffe, H., "Completeness, Similar Regions, and Unbiased Estimation," Sankhyii, 10, 305 (1950). [22J Levy, P., Theorie de l'addition des variables aleatoires, Gauthier-Villars, Paris, 1937. [23J Mann, H. B., and Whitney, D. R, "On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other," Ann. Math. Stat., 18, 50 (1947). [24J Neymann, j., "Su un teorema concernente Ie cosiddette statistiche sufficienti," Giornale dell' I stituto degli Attuari, 6, 320 (1935). [25J Neyman,]., and Pearson, E. S., "On the Problem of the Most Efficient Tests of Statistical Hypotheses," Phil. Trans. Royal Soc. London, Series A, 231,289 (1933). [26J Pearson, K., "On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling," Phil. Mag., Series 5, 50, 157 (1900). [27J Pitman, E. J. G., "Sufficient Statistics and Intrinsic Accuracy," Proc. Cambridge Phil. Soc., 32,567 (1936). [28J Rao, C. R, Linear Statistical Inference and Its Applications, John Wiley & Sons, Inc., New York, 1965. [29J Scheffe, H., The Analysis of Variance, John Wiley & Sons, Inc., New York, 1959. [30J Wald, A., Sequential Analysis, John Wiley & Sons, Inc., New York, 1947. [31J Wilcoxon, F., "Individual Comparisons by Ranking Methods," Bio- metrics Bull., 1, 80 (1945). [32J Wilks, S. S., Mathematical Statistics, John Wiley & Sons., Inc., New York, 1962. Appendix B Tables TABLE I The Poisson Distribution x /LWe-1l Pr (X s x) = 2: -,- 111=0 W. fL = E(X) x 0.5 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 0 0.607 0.368 0.223 0.135 0.050 0.018 0.007 0.002 0.001 0.000 0.000 0.000 I 0.910 0.736 0.558 0.406 0.199 0.092 0.040 0.017 0.007 0.003 0.001 0.000 2 0.986 0.920 0.809 0.677 0.423 0.238 0.125 0.062 0.030 0.0/4 0.006 0.003 3 0.998 0.981 0.934 0.857 0.647 0.433 0.265 0.151 0.082 0.042 0.021 0.010 4 1.000 0.996 0.98 0.947 0.815 0.629 0.440 0.285 0.173 0.100 0.055 0.029 5 0.999 0.996 0.983 0.9/6 0.785 0.6/6 0.446 0.30/ 0./9/ 0.//6 0.067 6 .000 0.999 0.995 0.966 0.889 0.762 0.606 0.450 0.313 0.207 0.130 7 1.000 0.999 0.988 0.949 0.867 0.744 0.599 0.453 0.324 0.220 8 1.000 0.996 0.979 0.932 0.847 0.729 0.593 0.456 0.333 9 0.999 0.992 0.968 0.916 0.830 0.717 0.587 0.458 10 1.000 0.997 0.986 0.957 0.901 0.816 0.706 0.583 II 0.999 0.995 0.980 0.947 0.888 0.803 0.697 12 1.000 0.998 0.991 0.973 0.936 0.876 0.792 13 0.999 0.996 0.987 0.966 0.926 0.864 14 1.000 0.999 0.994 0.983 0.959 0.917 15 0.999 0.998 0.992 0.978 0.951 16 /.000 0.999 0.996 0.989 0.973 17 1.000 0.998 0.995 0.986 18 0.999 0.998 0.993 19 1.000 0.999 0.997 20 1.000 0.998 21 0.999 22 1.000 423
  • 218. x N(x) x N(x) x N(x) 0.00 0.500 1.10 0.864 2.05 0.980 0.05 0.520 /.15 0.875 2.10 0.982 0.10 0.540 1.20 0.885 2.15 0.984 0./5 0.560 /.25 0.894 2.20 0.986 0.20 0.579 1.282 0.900 2.25 0.988 0.25 0.599 /.30 0.903 2.30 0.989 0.30 0.618 1.35 0.911 2.326 0.990 0.35 0.637 1.40 0.9/9 2.35 0.991 0.40 0.655 1.45 0.926 2.40 0.992 0.45 0.674 1.50 0.933 2.45 0.993 0.50 0.691 1.55 0.939 2.50 0.994 0.55 0.709 1.60 0.945 2.55 0.995 0.60 0.726 1.645 0.950 2.576 0.995 0.65 0.742 1.65 0.951 2.60 0.995 0.70 0.758 1.70 0.955 2.65 0.996 0.75 0.773 1.75 0.960 2.70 0.997 0.80 0.788 1.80 0.964 2.75 0.997 0.85 0.802 1.85 0.968 2.80 0.997 0.90 0.816 1.90 0.971 2.85 0.998 0.95 0.829 1.95 0.974 2.90 0.998 1.00 0.841 1.960 0.975 2.95 0.998 1.05 0.853 2.00 0.977 3.00 0.999 424 Appendix B TABLE II The Chi-Square Distribution * IX 1 Pr(X < x) = wr/2-1e-w/2dw - 0 qr/2)2r /2 ' Pr (X s x) r 0.01 0.025 0.050 0.95 0.975 0.99 I 0.000 0.00 0.004 3.84 5.02 6.63 2 0.020 0.051 0.103 5.99 7.38 9.21 3 0.115 0.216 0.352 7.81 9.35 11.3 4 0.297 0.484 0.711 9.49 I I. I 13.3 5 0.554 0.831 1.15 11.1 12.8 15.1 6 0.872 1.24 1.64 12.6 14.4 16.8 7 1.24 1.69 2.17 14./ 16.0 18.5 8 1.65 2.18 2.73 15.5 17.5 20.1 9 2.09 2.70 3.33 /6.9 19.0 21.7 10 2.56 3.25 3.94 18.3 20.5 23.2 II 3.05 3.82 4.57 19.7 21.9 24.7 12 3.57 4.40 5.23 21.0 23.3 26.2 13 4.11 5.01 5.89 22.4 24.7 27.7 14 4.66 5.63 6.57 23.7 26.1 29.1 15 5.23 6.26 7.26 25.0 27.5 30.6 16 5.81 6.91 7.96 26.3 28.8 32.0 17 6.41 7.56 8.67 27.6 30.2 33.4 18 7.01 8.23 9.39 28.9 31.5 34.8 19 7.63 8.91 10.1 30.1 32.9 36.2 20 8.26 9.59 10.9 31.4 34.2 37.6 21 8.90 10.3 11.6 32.7 35.5 38.9 22 9.54 11.0 12.3 33.9 36.8 40.3 23 10.2 11.7 13.1 35.2 38.1 41.6 24 10.9 12.4 13.8 36.4 39.4 43.0 25 11.5 13.1 14.6 37.7 40.6 44.3 26 12.2 13.8 15.4 38.9 41.9 45.6 27 12.9 14.6 /6.2 40./ 43.2 47.0 28 13.6 15.3 16.9 41.3 44.5 48.3 29 14.3 16.0 17.7 42.6 45.7 49.6 30 15.0 16.8 18.5 43.8 47.0 50.9 * This table is abridged and adapted from "Tables of Percentage Points of the Incomplete Beta Function and of the Chi-Square Distribution," Biometrika, 32 (1941). It is published here with the kind permission of Professor E. S. Pearson on behalf of the author, Catherine M. Thompson, and of the Biometrika Trustees. Appendix B TABLE III The Normal Distribution I x 1 Pr (X ~ x) = N(x) = . /_ e- w2 / 2 dw -00 "V 27T [N( -x) = 1 - N(x)] 425
  • 219. of>. ~ a- .., "'d '"1 ~ ..., 0 I~ 1/ ~ .,... 0 '"1 1 ..., "-.. 1/ I ~ ;! 0 8 ~ I -. lD ~ " .,... ... ~ ... ~I 0 » --.. " iii' lliI -t ""j ... r- 1/ ...... ~ ., m ---- ""j 0: < 0 -c, I~ c---, e '0 ~ ... ~ "'d--;:::::; + 0' ~ '"1 ::l ~+ * ..., :::; ...... ~ <, 0 1 1 / ~ ~ <o ~~ <o '--' ~ + t: i3 0 I ~ ~ '0 <o :g ~ (l) ::s l:l. it' b:l ~ NNNNNNNNNNNNNNNNNWWWWWWWW~~~~W ~~~~~~~OOOOOOOOOOOOOO~~~OO~~NW~~O~OO~~ ~~~~~OO~O-W~~~~N~~-~O~~~~OWO~N~ O~W-~~~~~-~-OOOO-~~N~~~O~~~N~-~~ W NNNNNNNNNNNNNNNNNNNNNNNNWWW~~- ~~~~~~~~~~~~~~~~~~~~~OOoo~-w~~~OO ~~~~~OO~OO-Nw~~OOON~OO-~N~~~~~~~N ~N~w~~NOOOOOOO~N~WN~O-OO~-~oow~~-~- NNNNNNNNNNNNNNNNNNNNNNNNNNNW~N OOOOOOOOOOOO~~~~~~~NNNWW~~~~w~ ~~~~~~~~~OOOO~O-NW~~~ON~O~~~~OOOO N~OON~O~~~O~w-OO-~O~-OON~~~-~NW~ - - ----------------------NNNN~ ~~~~~~~~~~~~~~~~~~~~OOOOOOOO~O~w~w ~~OOOO---NNNW~~~~~OO~-w~~~-W~N­ ~~-w~OO-~~-~~~O~W--N~NWO~w~NwO~ wNNNNNNNNNN---------- O~OO~~~~wN-O~OO~~~~wN-O~OO~~~~wN- -----------------------------W wwwwwwwwwwwwwwwwwwwwwww~~~~~OOO --------NNNNWWW~~~~~~OO~-~~wwOO~ O-w~~~OO~-w~OOOw~-~O~WNW~~O~wOO~OO 0'>< '< ?> ort ~_u: <1> C/) '1 ~ ?> - =~. 0.. :::-. Oi~ ~ ....., 0.." • <or ,," ..... '" ::'-'0' . ., ~~ , i ~ ~ * g.a'; ,., := ;:.::r' eTa ~oo· ::r'~ M'- ~::"&I '< ~ " ----_. ~ ~ 00' slf& ~. ~ ~ o· '" aq =' § 8.- a.~~ s: ~ g <1> " ;> ?>:;.,., C " ?> S:~~ ;; ::<l <1> <J; " ...... ~ ~ ::: ::l ., p..~s.. ] -~ ITJ g-g ; ~ cr (D ~ :=: I-f ~ g~ P..c. TABLE V The FOistribution* Pr (F ~ f) = [I r[(r1 + r2)/2]h/r2)'1/2wr ,/2-1 Jo r(r1/2)rh/2)(1 + r1w/r2J<r, +r2 ) /2 dw ~ :g (l) ::s l:l. l;' b:l (1 Pr (F ~ f) (2 I 2 3 4 5 6 7 8 9 10 12 15 0.95 I 161 200 216 225 230 234 237 239 241 242 244 246 0.975 648 800 864 900 922 937 948 957 963 969 977 985 0.99 4052 4999 5403 5625 5764 5859 5928 5982 6023 6056 6106 6157 0.95 2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 /9.4 19.4 19.4 0.975 38.5 39.0 39.2 39.2 39.3 39.3 39.4 39.4 39.4 39.4 39.4 39.4 0.99 98.5 99.0 99.2 99.2 99.3 99.3 99.4 99.4 99.4 99.4 99.4 99.4 0.95 3 10.1 9.55 9.28 9./2 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.70 0.975 17.4 16.0 15.4 15.1 14.9 14.7 14.6 14.5 14.5 14.4 14.3 14.3 0.99 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.3 27.2 27.1 26.9 0.95 4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.86 0.975 12.2 10.6 9.98 9.60 9.36 9.20 9.07 8.98 8.90 8.84 8.75 8.66 0.99 2/.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.5 14.4 14.2 0.95 5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.62 0.975 10.0 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68 6.62 6.52 6.43 0.99 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1 9.89 9.72 e
  • 220. (1 Pr (Fs f) (2 I 2 3 4 5 6 7 8 9 10 12 15 0.95 6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.94 0.975 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.52 5.46 5.37 5.27 0.99 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.56 0.95 7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.51 0.975 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82 4.76 4.67 4.57 0.99 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.31 0.95 8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.22 0.975 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36 4.30 4.20 4.10 0.99 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.52 0.95 9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.01 0.975 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03 3.96 3.87 3.77 0.99 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 4.96 0.95 10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.85 0.975 6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78 3.72 3.62 3.52 0.99 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.56 0.95 12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.62 0.975 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44 3.37 3.28 3./8 0.99 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.01 0.95 15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.40 0.975 6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 3.12 3.06 2.96 2.86 0.99 8.68 6.36 5.42 4.89 4.56 4.32 4./4 4.00 3.89 3.80 3.67 3.52 * This table is abridged and adapted from" Tables of Percentage Points of the Inverted Beta Distribution," Biometrika, 33 (1943). It is published here with the kmd permission of Professor E. S. Pearson on behalf of the authors, Maxine Merrington and Catherine M. Thompson, and of the Biome!nka Trustees. ~ ~ :g ~ ::l I:l. 1;' b::l ...... o ..... o_~ ~_~_"'I ... NI...~~*_~~ ...I"' ...... ~O~ _ • _ P' v _ P' .- P' -1>0 V> -> r- I ' _. . r- " , ' I ' 10>1... ~~IA~-- - _ .... - ....... - tv 001-.1 _ 1 ... 'f'"~"" • so "*"OOOO-_O-wf. . . . . . "'I ' "'I (1) "-'::: / . . _ _ .. • I "*" N'" "'.... - ~'" ::;: - _ - 1/ H H - - "'IN - ~ . . - I •. ::r N ° 0" (l> -:;;- 0" _. - 0" I - ""1 "'I lJl '::e ~ ~ ~ ~ / / :=- ~ N>-' "'IN ~ Co) ~ ~... ~ N,'" !i> ° _'" ~ ~ / ° ° ":*". ~[v> - "'1 "'1 I - • ",.-._ ........ t#.)l-"w"""' ~ 0101 / ° () ~ - - - () . - .. ~I'" "-'::: I/:;~-":' I - ~ ;;:... - "-'::: • tit- ° N . / 0:: -~I/ . _/~ 1/ I ~ ;:"~ ~~. - Co) """"" """" """"" """"" ..... ~~~~~~~~~~~~t.Jt.Jt.Jt.J ~~'It1ICN_~OO'lt1lCN-'IO't1I.;:. '*" _'*" ...... "'100 _ ....1 00[00 - - - _. ° _. _. _. 0' "' 10-S ~ ~ ...1' 0 0 0 •· "'I'"' . _. _. --J ° ......---. ....,..---. _ . "'iN :::j _. _. • - L-< H uoL . - N ..... -~ ~ •. 10>1 ~ ..... "-'::: ~ H .... ...._CJ) ... N _. (l> II ":'~ . 't ;+- N '-v-' 1-01-01-01-01-01-01-01-0 t.Jt.J;";";";";";" CN~oo'lO'';:'CN~ (j ~ ~ ~ ~ s- ~ ~> ~ = to r.Il ~ ~ ...... to r.Il ~ to r.Il r.Il fIIItl- e 00 ~' ~ =- 1-0 1-0 1-0 1-0 ~~t.J;" 1-0 1-0 ;..~ ~ ~ (J ::I: >- 'i:I _ _ _ - --1>0[.... -1'01.... .-;;::- :j 0" P' P' 0" P' _.' .W '4) - - - - -~I-o" - ~ ~ ~ ,..-.., --. ,..-.., r-"-. --. • WiN R.~ _~ _~ .~ .~ _~ ':'1"' ..... ~ OO ..... OH H '::-:' / / 1/ / II II °HHHHNO / / / / / y,"- 't,Woojv>NW ~ '--' '--' '-v-' '-.-' '" . . ' " W 1-0 ;.. 1-0 + + "1., "1., / ° ~ '-v-' Y 1-0 _ _ c,.~~ O'';:'CN 1-0 ~ 00 ~
  • 221. 430 Appendix C Appendix C 431 1.63 1/3v'y,0 < y < 1; 1/6vy, 2.8 9 . t 3.28 0.05. 4.17 gl (Y1) 20' . Y1 1 < Y < 4; 0 elsewhere. 2.9 ..L 3.29 0.831, 12.8. 14' 1 ...L 2.10 (3x1 + 2)/(6x1 + 3); 3.30 0.90. 36 (:)/(~6). 2 ...±- 1.64 (a) (6x~ + 6x1 + 1)/ 3.31 X2(4). 36 3 -.J!._ (2)(6x1 + 3)2. 3.33 3e- 3y,0 < Y < 00. 36 4 .s, (b) (~O)/ (~6). 2.12 3X2/4; 3x~/80. 3.34 2,0.95. 36 6 12 3.39 11 36 2.17 (b) l/e. 16' 9 .JL 2.18 (a) 1. (b) - 1. (c) O. 3.40 X2(2). 36 1 - C~O)/ (10 500). 4.20 -ff·, 0 < Y < 27. 1.65 2.19 7/v'804. 3.42 0.067,0.685. 4.23 a/(a + (3); 2.26 b2 = Ul(P12 - P13P23)/ 3.44 71.3, 189.7. a(3/[(a + (3 + l)(a + (3)2]. 1.67 (b) 1 _ (~O)/ (~O). [u2(1 - p~3)J; 3.45 v'In 2/TT. 4.24 (a) 20. (b) 1260. (c) 495. b3 = Ul(PI3 - PI2Pd/ 3.49 0.774. 4.25 10 243' 1.70 -In Z, 0 < Z < 1; [u3(1 - P~3)]. 3.50 v'2/TT; (TT - 2)/TT. 4.34 0.05. oelsewhere. 2.31 7 3.51 0.90. 4.37 1/4.74, 3.33. s· 1.72 (x - 1)(10 - x)(9 - x)/ 2.33 1 - (1 - y)12, 0 s Y < 1; 3.52 0.477. 4.42 (l/v'2TT)3yre- y:/2 sin Y3' 420, x = 2,3, ... ,8. 12(1 - y)l1, 0 < Y < 1. 3.53 0.461. o::; Y1 < 00,0 s Y2 < 2TT, 1.74 2; 86.4; -160.8. 2.34 g(y) = [y3 _ (y _ 1)3J/63, 3.54 n(O, 1). o s Y3 ::; TT. 1.75 3; 11; 27. Y = 1, 2, 3, 4, 5, 6. 3.55 0.433. 4.43 Y2Y5e-Ya,0 < v. < 1, 1.76 1. 2.35 1 3.56 0, 3. o < Y2 < 1, 0 < Y3 < 00. 9' 2' 1.78 (t)(t) ¥= i- 3.61 n(O, 2). 4.47 1/(2v'y),0 < Y < 1. 1.79 (a) t· (b) /(1) = l 3.62 (a) 0.264. (b) 0.440. 4.48 e- Yl/2/(2TTv'Y1 - y~), 1.80 $7.80. (c) 0.433. (d) 0.642. 1.83 7 CHAPTER 3 3.64 p=l - VYI < Y2 < v'Yl> 3' 1.84 (a) 1.5, 0.75. (b) 0.5,0.05. 3.65 (38.2,43.4). o < Y1 < 00. 3.1 ±Q (c) 2; does not exist. 81' 4.50 1 - (1 - e- 3)4. 3.4 1.il 4.51 1. 1.85 et/(2 - et), t < In 2; 2; 2. 512' 3.6 5. 8' 1.92 1·..L 4.56 5 6' 36 3.9 65 16' 1.95 10; 0; 2; - 30. SI' CHAPTER 4 4.57 48z1Z~Z~, 0 < ZI < 1, 3.11 ..L 1.97 2-(2 2q 72' o < Z2 < 1,0 < Z3 < 1. --5-' -5- 3.14 1 4.3 0.405. 1.99 1/2P; i; t; 5; 50. 6' 4.58 .i: 3.15 _l..4:.. 4.4 1 12 . 1.101 g, ~H. 625' s· 4.63 6uv(u + v), 3.17 11. x /2' .ll 4.6 (n + 1)/2; (n2 - 1)/12. 1.106 0.84. 6' 1 , 6' O<u<v<1. 3.18 l..2. 4.7 a + bi; b2s~. 4 . 3.19 0.09. 4.8 X2(2). 4.68 Y g(y) 3.22 4Xe- 4/x!, x = 0, 1,2, .... 4.11 t,O < Y < 1; 2 ...L 36 CHAPTER 2 3.23 0.84. 1/2y2, 1 < Y < 00. 3 2 36 3.27 (a) exp [- 2 + et2(1 + et1) ]. 4.12 y15, 0 ::; Y < 1; 15y14 , 4 -.L 2.3 (a) -t. (b) 55 6, 36 (b) fLl = 1, fL2 = 2, O<y<1. 5 .s, 36 [ (~)/ (~)][5/(8 - x)]. ai = 1, u~ = 2, 4.13 ± 6 ..L (c) 7' 36 P = 0/2. 4.14 t, Y = 3,5,7. 7 -.J!._ 36 2.6 l (c) y/2. 4.16 (t)~ii, Y = 1,8,27, .... 8 ..L 36
  • 222. 432 Appendix C Appendix C 433 CHAPTER 7 5.23 0.840. 5.26 0.08. 5.28 0.267. 5.29 0.682. 5.34 n(O, 1). CHAPTER 6 6.1 (a) X. (b) -n/ln (X1X2· .. X n). (c) X. (d) The median. (e) The first order statistic. 6.2 The first order statistic Yl' ( n - l)Y( Y) -- 1+--· n n - 1 0.067. Reject H o. 0; 4(4n - 1)/3; no. 2 99' 98; Q~6. 9.13 9.15 9.22 9.33 9.39 CHAPTER 9 9.2 (a) ~ ~. (b) 67-5/1024; (c) (0.8)4. 9.4 8. 9.6 0.954; 0.92; 0.788. 9.8 8. 9.11 (a) Beta (n - j + 1, j). (b) Beta (n - j + i-I, j - i + 2). 10.27 i. 10.28 X2 - lIn. CHAPTER 10 10.9 60y~(ys - Y3)/8s; 6Ys/5; 02/7; 02/35. 10.10 (1/02)e-Yt/9, o < Y2 < Y1 < 00; Yl/2; 02/2. 10.12 L Xf[n; L Xdn; L Xiln. 10.14 X; X. 10.15 Y11n. 10.17 Y1 - lIn. n 10.19 Y1 = LXi; Yl/4n; yes. 1 10.31 CHAPTER 11 11.3 82/n; 82/n(n + 2). 11.6 co(n) = (14.4) x (n In 1.5 - In 9.5); c1(n) = (14.4) x (n In 1.5 + In 18). 8.1 q3 = ¥[i > 7.81, reject H o. 8.3 b s 8 or 32 s b. 8.4 q3 = 1-l < 11.3, accept tt; 8.5 6.4 < 9.49, accept H o. 8.7 P= (Xl + X 2/2)/ (Xl + X 2 + X 3) · 8.16 r + 0, 2r + 40. 8.17 r2 (8 + r1)/[r1(r2 - 2)J, r2 > 2. 8.25 S= L (Xtlncj ) , L [(Xi - ~ci)2/ncn 8.29 Reject n; 7.5 n=190r20. 7.6 K(t) = 0.062; K(/2) = 0.920. 10 7.10 L xf ~ 18.3; yes; yes. 1 10 10 7.12 3 L xf + 2 L X j ~ C. 1 1 7.13 95 or 96; 76.7. 7.14 38 or 39; 15. 7.15 0.08; 0.875. 7.16 (1 - 0)9(1 + 90). 7.17 1,0 < 0 s t; 1/(1604 ) , t < 0 < 1; 1 - 15/(1604 ) , 1 s O. 7.19 53 or 54, 5.6. 7.22 Reject n, if i ~ 77.564. 7.23 26 or 27; reject n; if i s 24. 7.24 220 or 221; reject tt; if y ~ 17. 7.27 t = 3 > 2.262, reject H o. 7.28 ItI = 2.27 > 2.145, reject a; CHAPTER 8 n L (Xj - Y1 )/n. 1 4 11 7 25-' 25' 25' Y1 = min (Xj ); n/ln [(X1X2· .. Xn)/yn (b) X/(l - X). (d) X. (e) X - 1. l 1- 3' 3' w1 (y)· b = 0; does not exist. Does not exist. (77.28, 85.12). 24 or 25. (3.7, 5.7). 160. (5i/6, 5i/4). ( - 3.6, 2.0). 135 or 136. (0.43,2.21). (2.68,9.68). (0.71,5.50). [YT2 + }w 2/nJ/(T2 + u2 /n). (3(y + a)/(n(3 + 1). 6.11 6.12 6.13 6.14 6.16 6.17 6.18 6.19 6.25 6.26 6.30 6.32 6.34 6.36 6.41 6.42 6.7 6.4 6.5 5.1 Degenerate at fL. 5.2 Gamma (a = 1, (3 = 1). 5.3 Gamma (a = 1, (3 = 1). 5.4 Gamma (a = 2; (3 = 1). 5.13 0.682. 5.14 (b) 0.815. 5.17 Degenerate at fL2 + (U2/U1)(X - fL1)' 5.18 (b) n(O, 1). 5.19 (b) n(O, 1). 5.21 0.954. y g(y) 9 3 4 6 10 3 3 6 11 }6 12 3~ 4.69 0.24. 4.77 0.818. 4.80 (b) - 1 or 1. (c) Z, = ujYj + fLj. n 4.81 L ajbj = O. 1 4.83 6.41. 4.84 n = 16. 4.86 (n - 1)u2/n; 2(n - 1)u4/n2 • 4.87 0.90. 4.89 0.78. 4.90 !; t. 4.91 7. 4.93 2.5; 0.25. 4.95 -5; 60 - 12V6. 4.96 Ul/VU~ + u~. 4.99 22.5, 1-*l. 4.100 r2 > 4. -=--=--~c----=--~ 4.102 fL2Ul/Vu~u~ + fL~U~ + fL~u~, 4.105 5/V39. 4.109 e/l+<J 2 / 2 ; e2/l+<J2(eI12 - 1). CHAPTER 5
  • 223. 434 11.7 11.12 11.24 11.27 co(n) (0.05n - In 8)/ In 3.5; cl(n) (0.05n + In 4.5)/ In 3.5. (9y - 20x)/30 ~ c. 2.17; 2.44. 2.20. Appendix C CHAPTER 12 12.3 (a) exp {(tlC + t2d)'fL + [(tlC + t2d)'V x (tlc + t2d)J/2}. 12.12 a, = 0, i = 1, 2, 3, 4. " 12.13 2: ali = 0, i = 1,2, ... , n. i=l Index Analysis of variance, 291 Andrews, D. P., 404 Approximate distribution (s) , chi-square, 269 normal for binomial, 195, 198 normal for chi-square, 190 normal for Poisson, 191 Poisson for binomial, 190 of X, 194 Arc sine transformation, 217 Bayes' formula, 65,228 Bayesian methods, 227, 229, 385 Bernstein, 87 Beta distribution, 139, 149,310 Binary statistic, 319 Binomial distribution, 90, 132, 190, 195, 198, 305 Bivariate normal distribution, 117, 170 Boole's inequality, 384 Box-Muller transformation, 141 Cauchy distribution, 142 Central limit theorem, 192 Change of variable, 128, 132, 147 Characteristic function, 54 Characterization, 163, 172 Chebyshev's inequality, 58, 93, 188 Chi-square distribution, 107, 114, 169, 191,271,279,413 Chi-square test, 269, 312, 320 Classification, 385 Cochran's theorem, 419 Complete sufficient statistics, 355 Completeness, 353, 358, 367, 390 Compounding, 234 Conditional expectation, 69, 349 Conditional probability, 61, 68, 343 Conditional p.d.f., 67, 71, 118 Confidence coefficient, 213 Confidence interval, 212, 219, 222 for difference of means, 219 for means, 212, 214 for p, 215, 221 for quantiles, 304 435 for ratio of variances, 225 for regression parameters, 298 for variances, 222 Contingency tables, 275 Contrasts, 384 Convergence, 186, 196,204 in probability, 188 with probability one, 188 stochastic, 186, 196, 204 Convolution formula, 143 Correlation coefficient, 73, 300 Covariance, 73, 179, 408 Covariance matrix, 409 Coverage, 309 Cramer, 189 Critical region, 236, 239, 242 best, 243, 245 size, 239, 241 uniformly most powerful, 252 Curtiss, J. H., 189 Decision function, 208, 228, 341, 386 Degenerate distribution, 56, 183 Degrees of freedom, 107, 264, 273, 279, 289 Descending M-estimators, 404 Distribution, beta, 139, 149, 310 binomial, 90, 132, 190, 195, 198, 305 bivariate normal, 117, 170,386 Cauchy, 142 chi-square, 107, 114, 169, 191, 271, 279,413 conditional, 65, 71 continuous type, 24, 26 of coverages, 310 degenerate, 56, 183 Dirichlet, 149,310 discrete type, 23, 26 double exponential, 140 exponential, 105, 163 exponential class, 357, 366 of F, 146, 282 function, 31, 36, 125 of functions of random variables, 122 of F(X), 126
  • 224. 436 Distribution, beta (cant.) gamma, 104 geometric, 94 hypergeometric, 42 limiting, 181, 193, 197, 270, 317 of linear functions, 168, 171, 176, 409 logistic, 142 lognormal, 180 marginal, 66 multinomial, 96, 270, 332 multivariate normal, 269, 405 negative binomial, 94 of noncentral chi-square, 289, 413 of noncentral F, 290, 295 of noncentral T, 264 normal, 109, 168, 193 of nS2/a2, 175 of order statistics, 154 Pareto, 207 Poisson, 99, 131, 190 posterior, 228 prior, 228 of quadratic forms, 278, 410 of R, 302 of runs, 323 of sample, 125 of T, 144, 176, 260, 298, 302 trinomial, 95 truncated, 116 uniform, 39, 126 Weibull,109 of :Y, 173, 194 Distribution-free methods, 304, 312, 397 Double exponential distribution, 140 Efficiency, 372 Estimation, 200, 208,227,341 Bayesian, 227 interval, see also Confidence intervals, 212 maximum likelihood, 202, 347, 401 minimax, 210 point, 200, 208, 370 robust, 400 Estimator, 202 best, 341, 355, 363 consistent, 204 efficient, 372 maximum likelihood, 202, 347, 401 minimum chi-square, 273 minimum mean-square-error, 210 unbiased, 204, 341 unbiased minimum variance, 208, 341, 355, 361 Events, 2, 16 exhaustive, 15 mutually exclusive, 14, 17 Expectation (expected value), 44, 83, 176 Index of a product, 47, 83 Exponential class, 357, 366 Exponential distribution, 105, 163 F distribution, 146, 282 Factorization theorem, 344, 358, 364 Family of distributions, 201, 354 Fisher, R. A., 388 Frequency, 2, 271 relative, 2, 12,93 Function, characteristic, 54, 192 decision, 208, 228, 341, 386 distribution, 31, 36, 125 exponential probability density, 105, 163 gamma, 104 likelihood, 202, 260 loss, 209, 341 moment-generating, 50,77, 84, 164 of parameter, 361 point, 8 power, 236, 239, 252 probability density, 25, 26, 31 probability distribution, 31, 34, 36 probability set, 12, 17, 34 of random variables, 35, 44, 122 risk, 209, 229, 341 set, 8 Geometric mean, 360 Gini's mean difference, 163 Huber, P., 402 Hypothesis, see Statistical hypothesis, Test of a statistical hypothesis Independence, see Stochastic independence Inequality, Boole, 384 Chebyshev, 58, 93, 188 Rao-Blackwell, 349 Rao-Cramerv J'Zz Interaction, 295 Interval confidence, 212 prediction, 218 random, 212 tolerance, 309 Jacobian, 134, 135, 147, 151 Joint conditional distribution, 71 Joint distribution function, 65 Joint probability density function, 65 Kurtosis, 57, 98,103,109,116,399 Law of large numbers, 93, 179, 188 Least squares, 400 Index Lehmann alternative, 334 Lehmann-8cheffe, 355 Levy, P., 189 Liapounov,317 Likelihood function, 202, 205, 260 Likelihood ratio tests, 257, 284 Limiting distribution, 181, 193, 197, 317 Limiting moment-generating function, 188 Linear discriminant function, 388 Linear functions, covariance, 179 mean, 176, 409 moment-generating function, 171, 409 variance, 177, 409 Linear rank statistic, 334 Logistic distribution, 142 Lognormal distribution, 180 Loss function, 309, 341 Mann-Whitney-Wilcoxon, 326, 334 Marginal p.d.f., 66 Maximum likelihood, 202, 347 estimator, 202, 205, 347, 401 method of, 202 Mean, 49, 124 conditional, 69,75, 118, 349 of linear function, 176 of a sample, 124 of X, 49 ofK,178 Median, 30, 38, 161 M-estimators,402 Method of least absolute values, 400 of least squares, 400 of maximum likelihood, 202 of moments, 206 Midrange, 161 Minimax, criterion, 210 decision function, 210 Minimum chi-square estimates, 273 Mode, 30, 98 Moment-generating function, 50, 77, 84, 189 of binomial distribution, 91 of bivariate normal distribution, 119 of chi-square distribution, 107 of gamma distribution, 105 of multinomial distribution, 97 of multivariate normal distribution, 408 of noncentral chi-square distribution, 289 of normal distribution, 111 of Poisson distribution, 101 of trinomial distribution, 95 of:Y,171 Moments, 52, 206 factorial, 56 437 method of, 206 Multinomial distribution, 96, 270, 332 Multiple comparisons, 380 Multiplication rule, 63, 64 Multivariate normal distribution, 269, 405 Neyman factorization theorem, 344, 358, 364 Neyman-Pearson theorem, 244, 267, 385 Noncentral chi-square, 289, 413 Noncentral F, 290, 295 Noncentral parameter, 264, 289,413 Noncentral T, 264 Nonparametric, 304, 312, 397 Normal distribution, 109, 168, 193 Normal scores, 319, 337, 398 Order statistics, 154, 304, 308, 369 distribution, 155 functions of, 161, 308 Parameter, 91, 201 function of, 361 Parameter space, 201, 260 Pareto distribution, 207 Percentile, 30, 311 PERT, 163, 171 Personal probability, 3, 228 Poisson distribution, 99, 131, 190 Poisson process, 99, 104 Power, see also Function, Test of a statis- tical hypothesis, 236, 239 Prediction interval, 218 Probability, 2, 12, 34, 40 conditional, 61, 68, 343 induced, 17 measure, 2, 12 models, 38 posterior, 228, 233 subjective, 3, 228 Probability density function, 25,26,31 conditional, 67 exponential class, 357, 366 posterior, 228 prior, 228 Probability set function, 12, 17, 34 p-values,255 Quadrant test, 400 Quadratic forms, 278 distribution, 278, 410, 414 independence, 279, 414 Quantiles, 30, 304 confidence intervals for, 305 Random experiment, 1, 12, 38 Random interval, 212 Random sample, 124, 170,368
  • 225. 438 Random sampling distribution theory, 125 Random variable, 16,23, 35 continuous-type, 24, 26 discrete-type, 23, 26 mixture of types, 35 space of, 16, 19,20,27 Random walk, 380 Randomized test, 255 Range, 161 Rao-Blackwell theorem, 349 Rae-Cramer inequality, 372 Regression, 296 Relative frequency, 2, 12,93 Risk function, 209, 229, 341 Robust methods, 398, 402 Sample, correlation coefficient, 300 mean of, 124 median of, 161 random, 124, 170,368 space, 1, 12, 61, 200 variance, 124 Scheffe, H., 382 Sequential probability ratio test, 374 Set, 4 complement of, 7 of discrete points, 23 element of, 4 function, 8, 12 null, 5, 13 probability measure, 12 subset, 5, 13 Sets, algebra, 4 intersection, 5 union, 5 Significance level of test, 239,241 Simulation, 127, 140 Skewness, 56, 98, 103, 109, 116 Space, 6, 16,23,24 parameter, 201, 260 product, 80 of random variables, 16, 19,20,23,24 sample, 1, 12, 61, 200 Spearman rank correlation, 338, 400 Standard deviation, 49, 124 Statistic, see also Sufficient statistic(s), 122 Statistical hypothesis, 235, 238 alternative, 235 composite, 239, 252, 257 simple, 239, 245, 252 test of, 236, 239 Statistical inference, 201,235 Stochastic convergence, 186, 196,204 Stochastic dependence, 80 Stochastic independence, 80, 120, 132, 140,275,300,390,414 Index mutual, 85 of linear forms, 172 pairwise, 87, 121 of quadratic forms, 279,414 test of, 275, 300 of X and S2, 175, 391 Sufficient statistic(s), 343, 364, 390 joint, 364 T distribution, 144, 176, 260, 264, 298, 302 Technique, change of variable, 128, 132, 147 distribution function, 125 moment-generating function, 164 Test of a statistical hypothesis, 236, 239 best, 243,252 chi-square, 269, 312, 320 critical region of, 239 of equality of distributions, 320 of equality of means, 261, 283, 291 of equality of variances, 266 likelihood ratio, 257, 260, 284 median, 321 nonparametric, 304 power of, 236, 239, 252 of randomness, 325 run, 322 sequential probability ratio, 374 sign, 312 significance level, 239, 241 of stochastic independence, 275, 300 uniformly most powerful, 251 Tolerance interval, 307 Training sample, 388 Transformation, 129, 147 of continuous-type variables, 132, 147 of discrete-type variables, 12F not one-to-one, 149 one-to-one, 129, 132 Truncated distribution, 116 Types I and II errors, 241 Uniqueness, of best estimator, 355 of characteristic function, 55 of moment-generating function, 50 Variance, analysis of, 291 conditional, 69, 76, 118,349 of a distribution, 49 of a linear function, 177 of a sample, 124 of X, 49 of Y, 178 Venn diagram, 6 Weibull distribution, 109 Wilcoxon, 314, 326, 334