SlideShare a Scribd company logo
F.M. Dekking C. Kraaikamp
H.P. Lopuhaä L.E. Meester
A Modern Introduction to
Probability and Statistics
Understanding Why and How
With 120 Figures
Frederik Michel Dekking
Cornelis Kraaikamp
Hendrik Paul Lopuhaä
Ludolf Erwin Meester
Delft Institute of Applied Mathematics
Delft University of Technology
Mekelweg 4
2628 CD Delft
The Netherlands
Whilst we have made considerable efforts to contact all holders of copyright material contained in this
book, we may have failed to locate some of them. Should holders wish to contact the Publisher, we
will be happy to come to some arrangement with them.
British Library Cataloguing in Publication Data
A modern introduction to probability and statistics. —
(Springer texts in statistics)
1. Probabilities 2. Mathematical statistics
I. Dekking, F. M.
519.2
ISBN 1852338962
Library of Congress Cataloging-in-Publication Data
A modern introduction to probability and statistics : understanding why and how / F.M. Dekking ... [et
al.].
p. cm. — (Springer texts in statistics)
Includes bibliographical references and index.
ISBN 1-85233-896-2
1. Probabilities—Textbooks. 2. Mathematical statistics—Textbooks. I. Dekking, F.M. II.
Series.
QA273.M645 2005
519.2—dc22 2004057700
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the publish-
ers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the
Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to
the publishers.
ISBN-10: 1-85233-896-2
ISBN-13: 978-1-85233-896-1
Springer Science+Business Media
springeronline.com
© Springer-Verlag London Limited 2005
The use of registered names, trademarks, etc. in this publication does not imply, even in the absence
of a specific statement, that such names are exempt from the relevant laws and regulations and therefore
free for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the informa-
tion contained in this book and cannot accept any legal responsibility or liability for any errors or
omissions that may be made.
Printed in the United States of America
12/3830/543210 Printed on acid-free paper SPIN 10943403
Preface
Probability and statistics are fascinating subjects on the interface between
mathematics and applied sciences that help us understand and solve practical
problems. We believe that you, by learning how stochastic methods come
about and why they work, will be able to understand the meaning of statistical
statements as well as judge the quality of their content, when facing such
problems on your own. Our philosophy is one of how and why: instead of just
presenting stochastic methods as cookbook recipes, we prefer to explain the
principles behind them.
In this book you will find the basics of probability theory and statistics. In
addition, there are several topics that go somewhat beyond the basics but
that ought to be present in an introductory course: simulation, the Poisson
process, the law of large numbers, and the central limit theorem. Computers
have brought many changes in statistics. In particular, the bootstrap has
earned its place. It provides the possibility to derive confidence intervals and
perform tests of hypotheses where traditional (normal approximation or large
sample) methods are inappropriate. It is a modern useful tool one should learn
about, we believe.
Examples and datasets in this book are mostly from real-life situations, at
least that is what we looked for in illustrations of the material. Anybody who
has inspected datasets with the purpose of using them as elementary examples
knows that this is hard: on the one hand, you do not want to boldly state
assumptions that are clearly not satisfied; on the other hand, long explanations
concerning side issues distract from the main points. We hope that we found
a good middle way.
A first course in calculus is needed as a prerequisite for this book. In addition
to high-school algebra, some infinite series are used (exponential, geometric).
Integration and differentiation are the most important skills, mainly concern-
ing one variable (the exceptions, two dimensional integrals, are encountered in
Chapters 9–11). Although the mathematics is kept to a minimum, we strived
VI Preface
to be mathematically correct throughout the book. With respect to probabil-
ity and statistics the book is self-contained.
The book is aimed at undergraduate engineering students, and students from
more business-oriented studies (who may gloss over some of the more mathe-
matically oriented parts). At our own university we also use it for students in
applied mathematics (where we put a little more emphasis on the math and
add topics like combinatorics, conditional expectations, and generating func-
tions). It is designed for a one-semester course: on average two hours in class
per chapter, the first for a lecture, the second doing exercises. The material
is also well-suited for self-study, as we know from experience.
We have divided attention about evenly between probability and statistics.
The very first chapter is a sampler with differently flavored introductory ex-
amples, ranging from scientific success stories to a controversial puzzle. Topics
that follow are elementary probability theory, simulation, joint distributions,
the law of large numbers, the central limit theorem, statistical modeling (in-
formal: why and how we can draw inference from data), data analysis, the
bootstrap, estimation, simple linear regression, confidence intervals, and hy-
pothesis testing. Instead of a few chapters with a long list of discrete and
continuous distributions, with an enumeration of the important attributes of
each, we introduce a few distributions when presenting the concepts and the
others where they arise (more) naturally. A list of distributions and their
characteristics is found in Appendix A.
With the exception of the first one, chapters in this book consist of three main
parts. First, about four sections discussing new material, interspersed with a
handful of so-called Quick exercises. Working these—two-or-three-minute—
exercises should help to master the material and provide a break from reading
to do something more active. On about two dozen occasions you will find
indented paragraphs labeled Remark, where we felt the need to discuss more
mathematical details or background material. These remarks can be skipped
without loss of continuity; in most cases they require a bit more mathematical
maturity. Whenever persons are introduced in examples we have determined
their sex by looking at the chapter number and applying the rule “He is odd,
she is even.” Solutions to the quick exercises are found in the second to last
section of each chapter.
The last section of each chapter is devoted to exercises, on average thirteen
per chapter. For about half of the exercises, answers are given in Appendix C,
and for half of these, full solutions in Appendix D. Exercises with both a
short answer and a full solution are marked with  and those with only a
short answer are marked with  (when more appropriate, for example, in
“Show that . . . ” exercises, the short answer provides a hint to the key step).
Typically, the section starts with some easy exercises and the order of the
material in the chapter is more or less respected. More challenging exercises
are found at the end.
Preface VII
Much of the material in this book would benefit from illustration with a
computer using statistical software. A complete course should also involve
computer exercises. Topics like simulation, the law of large numbers, the
central limit theorem, and the bootstrap loudly call for this kind of experi-
ence. For this purpose, all the datasets discussed in the book are available at
http://www.springeronline.com/1-85233-896-2. The same Web site also pro-
vides access, for instructors, to a complete set of solutions to the exercises;
go to the Springer online catalog or contact textbooks@springer-sbm.com to
apply for your password.
Delft, The Netherlands F. M. Dekking
January 2005 C. Kraaikamp
H. P. Lopuhaä
L. E. Meester
Contents
1 Why probability and statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Biometry: iris recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Killer football . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Cars and goats: the Monty Hall dilemma . . . . . . . . . . . . . . . . . . . 4
1.4 The space shuttle Challenger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Statistics versus intelligence agencies . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 The speed of light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Outcomes, events, and probability . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Sample spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Products of sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 An infinite sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Conditional probability and independence . . . . . . . . . . . . . . . . . 25
3.1 Conditional probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 The multiplication rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 The law of total probability and Bayes’ rule. . . . . . . . . . . . . . . . . 30
3.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
X Contents
4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 The probability distribution of a discrete random variable . . . . 43
4.3 The Bernoulli and binomial distributions . . . . . . . . . . . . . . . . . . . 45
4.4 The geometric distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 The exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 The Pareto distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.6 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.7 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 What is simulation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Generating realizations of random variables . . . . . . . . . . . . . . . . . 72
6.3 Comparing two jury rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 The single-server queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.1 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Three examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3 The change-of-variable formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8 Computations with random variables . . . . . . . . . . . . . . . . . . . . . . 103
8.1 Transforming discrete random variables . . . . . . . . . . . . . . . . . . . . 103
8.2 Transforming continuous random variables . . . . . . . . . . . . . . . . . . 104
8.3 Jensen’s inequality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Contents XI
8.4 Extremes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9 Joint distributions and independence . . . . . . . . . . . . . . . . . . . . . . 115
9.1 Joint distributions of discrete random variables . . . . . . . . . . . . . . 115
9.2 Joint distributions of continuous random variables . . . . . . . . . . . 118
9.3 More than two random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.4 Independent random variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.5 Propagation of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.6 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
10 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.1 Expectation and joint distributions . . . . . . . . . . . . . . . . . . . . . . . . 135
10.2 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
10.3 The correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
10.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11 More computations with more random variables . . . . . . . . . . . 151
11.1 Sums of discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.2 Sums of continuous random variables . . . . . . . . . . . . . . . . . . . . . . 154
11.3 Product and quotient of two random variables . . . . . . . . . . . . . . 159
11.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12 The Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.1 Random points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.2 Taking a closer look at random arrivals. . . . . . . . . . . . . . . . . . . . . 168
12.3 The one-dimensional Poisson process . . . . . . . . . . . . . . . . . . . . . . . 171
12.4 Higher-dimensional Poisson processes . . . . . . . . . . . . . . . . . . . . . . 173
12.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
12.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
13 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
13.1 Averages vary less . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
13.2 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
XII Contents
13.3 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
13.4 Consequences of the law of large numbers . . . . . . . . . . . . . . . . . . 188
13.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
13.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
14 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
14.1 Standardizing averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
14.2 Applications of the central limit theorem . . . . . . . . . . . . . . . . . . . 199
14.3 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15 Exploratory data analysis: graphical summaries . . . . . . . . . . . . 207
15.1 Example: the Old Faithful data . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
15.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
15.3 Kernel density estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
15.4 The empirical distribution function . . . . . . . . . . . . . . . . . . . . . . . . 219
15.5 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
15.6 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
15.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
16 Exploratory data analysis: numerical summaries . . . . . . . . . . . 231
16.1 The center of a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
16.2 The amount of variability of a dataset. . . . . . . . . . . . . . . . . . . . . . 233
16.3 Empirical quantiles, quartiles, and the IQR . . . . . . . . . . . . . . . . . 234
16.4 The box-and-whisker plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
16.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
16.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
17 Basic statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
17.1 Random samples and statistical models . . . . . . . . . . . . . . . . . . . . 245
17.2 Distribution features and sample statistics . . . . . . . . . . . . . . . . . . 248
17.3 Estimating features of the “true” distribution . . . . . . . . . . . . . . . 253
17.4 The linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
17.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Contents XIII
18 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
18.1 The bootstrap principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
18.2 The empirical bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
18.3 The parametric bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
18.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
18.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
19 Unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
19.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
19.2 Investigating the behavior of an estimator . . . . . . . . . . . . . . . . . . 287
19.3 The sampling distribution and unbiasedness . . . . . . . . . . . . . . . . 288
19.4 Unbiased estimators for expectation and variance . . . . . . . . . . . . 292
19.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
20 Efficiency and mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . 299
20.1 Estimating the number of German tanks . . . . . . . . . . . . . . . . . . . 299
20.2 Variance of an estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
20.3 Mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
20.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
20.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
21 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
21.1 Why a general principle? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
21.2 The maximum likelihood principle . . . . . . . . . . . . . . . . . . . . . . . . . 314
21.3 Likelihood and loglikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
21.4 Properties of maximum likelihood estimators . . . . . . . . . . . . . . . . 321
21.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
21.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
22 The method of least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
22.1 Least squares estimation and regression . . . . . . . . . . . . . . . . . . . . 329
22.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
22.3 Relation with maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 335
22.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
22.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
XIV Contents
23 Confidence intervals for the mean . . . . . . . . . . . . . . . . . . . . . . . . . 341
23.1 General principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
23.2 Normal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
23.3 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
23.4 Large samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
23.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
23.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
24 More on confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
24.1 The probability of success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
24.2 Is there a general method? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
24.3 One-sided confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
24.4 Determining the sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
24.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
24.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
25 Testing hypotheses: essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
25.1 Null hypothesis and test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 373
25.2 Tail probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
25.3 Type I and type II errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
25.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
25.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
26 Testing hypotheses: elaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
26.1 Significance level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
26.2 Critical region and critical values . . . . . . . . . . . . . . . . . . . . . . . . . . 386
26.3 Type II error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
26.4 Relation with confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 392
26.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
26.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
27 The t-test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
27.1 Monitoring the production of ball bearings. . . . . . . . . . . . . . . . . . 399
27.2 The one-sample t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
27.3 The t-test in a regression setting. . . . . . . . . . . . . . . . . . . . . . . . . . . 405
27.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
27.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
Contents XV
28 Comparing two samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
28.1 Is dry drilling faster than wet drilling? . . . . . . . . . . . . . . . . . . . . . 415
28.2 Two samples with equal variances . . . . . . . . . . . . . . . . . . . . . . . . . 416
28.3 Two samples with unequal variances . . . . . . . . . . . . . . . . . . . . . . . 419
28.4 Large samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
28.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
28.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
A Summary of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
B Tables of the normal and t-distributions . . . . . . . . . . . . . . . . . . . 431
C Answers to selected exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
D Full solutions to selected exercises . . . . . . . . . . . . . . . . . . . . . . . . . 445
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
List of symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
1
Why probability and statistics?
Is everything on this planet determined by randomness? This question is open
to philosophical debate. What is certain is that every day thousands and
thousands of engineers, scientists, business persons, manufacturers, and others
are using tools from probability and statistics.
The theory and practice of probability and statistics were developed during
the last century and are still actively being refined and extended. In this book
we will introduce the basic notions and ideas, and in this first chapter we
present a diverse collection of examples where randomness plays a role.
1.1 Biometry: iris recognition
Biometry is the art of identifying a person on the basis of his or her personal
biological characteristics, such as fingerprints or voice. From recent research
it appears that with the human iris one can beat all existing automatic hu-
man identification systems. Iris recognition technology is based on the visible
qualities of the iris. It converts these—via a video camera—into an “iris code”
consisting of just 2048 bits. This is done in such a way that the code is hardly
sensitive to the size of the iris or the size of the pupil. However, at different
times and different places the iris code of the same person will not be exactly
the same. Thus one has to allow for a certain percentage of mismatching bits
when identifying a person. In fact, the system allows about 34% mismatches!
How can this lead to a reliable identification system? The miracle is that dif-
ferent persons have very different irides. In particular, over a large collection
of different irides the code bits take the values 0 and 1 about half of the time.
But that is certainly not sufficient: if one bit would determine the other 2047,
then we could only distinguish two persons. In other words, single bits may
be random, but the correlation between bits is also crucial (we will discuss
correlation at length in Chapter 10). John Daugman who has developed the
iris recognition technology made comparisons between 222 743 pairs of iris
2 1 Why probability and statistics?
codes and concluded that of the 2048 bits 266 may be considered as uncor-
related ([6]). He then argues that we may consider an iris code as the result
of 266 coin tosses with a fair coin. This implies that if we compare two such
codes from different persons, then there is an astronomically small probability
that these two differ in less than 34% of the bits—almost all pairs will differ
in about 50% of the bits. This is illustrated in Figure 1.1, which originates
from [6], and was kindly provided by John Daugman. The iris code data con-
sist of numbers between 0 and 1, each a Hamming distance (the fraction of
mismatches) between two iris codes. The data have been summarized in two
histograms, that is, two graphs that show the number of counts of Hamming
distances falling in a certain interval. We will encounter histograms and other
summaries of data in Chapter 15. One sees from the figure that for codes from
the same iris (left side) the mismatch fraction is only about 0.09, while for
different irides (right side) it is about 0.46.
0
2000
6000
10000
14000
18000
22000
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0
10
20
30
40
50
60
70
80
90
100
120
Hamming Distance
Count
d’ = 11.36
mean = 0.089
stnd dev = 0.042
mean = 0.456
stnd dev = 0.018
222,743 comparisons of different iris pairs
546 comparisons of same iris pairs
DECISION ENVIRONMENT
FOR IRIS RECOGNITION
Theoretical curves: binomial family
Theoretical cross-over point: HD = 0.342
Theoretical cross-over rate: 1 in 1.2 million
C
Fig. 1.1. Comparison of same and different iris pairs.
Source: J.Daugman. Second IMA Conference on Image Processing: Mathe-
matical Methods, Algorithms and Applications, 2000. Ellis Horwood Pub-
lishing Limited.
You may still wonder how it is possible that irides distinguish people so well.
What about twins, for instance? The surprising thing is that although the
color of eyes is hereditary, many features of iris patterns seem to be pro-
duced by so-called epigenetic events. This means that during embryo develop-
ment the iris structure develops randomly. In particular, the iris patterns of
(monozygotic) twins are as discrepant as those of two arbitrary individuals.
1.2 Killer football 3
For this reason, as early as in the 1930s, eye specialists proposed that iris
patterns might be used for identification purposes.
1.2 Killer football
A couple of years ago the prestigious British Medical Journal published a
paper with the title “Cardiovascular mortality in Dutch men during 1996
European football championship: longitudinal population study” ([41]). The
authors claim to have shown that the effect of a single football match is
detectable in national mortality data. They consider the mortality from in-
farctions (heart attacks) and strokes, and the “explanation” of the increase is
a combination of heavy alcohol consumption and stress caused by watching
the football match on June 22 between the Netherlands and France (lost by
the Dutch team!). The authors mainly support their claim with a figure like
Figure 1.2, which shows the number of deaths from the causes mentioned (for
men over 45), during the period June 17 to June 27, 1996. The middle horizon-
tal line marks the average number of deaths on these days, and the upper and
lower horizontal lines mark what the authors call the 95% confidence inter-
val. The construction of such an interval is usually performed with standard
statistical techniques, which you will learn in Chapter 23. The interpretation
of such an interval is rather tricky. That the bar on June 22 sticks out off the
confidence interval should support the “killer claim.”
June 18 June 22 June 26
0
10
20
30
40
Deaths
Fig. 1.2. Number of deaths from infarction or stroke in (part of) June 1996.
It is rather surprising that such a conclusion is based on a single football
match, and one could wonder why no probability model is proposed in the
paper. In fact, as we shall see in Chapter 12, it would not be a bad idea to
model the time points at which deaths occur as a so-called Poisson process.
4 1 Why probability and statistics?
Once we have done this, we can compute how often a pattern like the one in the
figure might occur—without paying attention to football matches and other
high-risk national events. To do this we need the mean number of deaths per
day. This number can be obtained from the data by an estimation procedure
(the subject of Chapters 19 to 23). We use the sample mean, which is equal to
(10 · 27.2 + 41)/11 = 313/11 = 28.45. (Here we have to make a computation
like this because we only use the data in the paper: 27.2 is the average over
the 5 days preceding and following the match, and 41 is the number of deaths
on the day of the match.) Now let phigh be the probability that there are
41 or more deaths on a day, and let pusual be the probability that there are
between 21 and 34 deaths on a day—here 21 and 34 are the lowest and the
highest number that fall in the interval in Figure 1.2. From the formula of the
Poisson distribution given in Chapter 12 one can compute that phigh = 0.008
and pusual = 0.820. Since events on different days are independent according
to the Poisson process model, the probability p of a pattern as in the figure is
p = p5
usual · phigh · p5
usual = 0.0011.
From this it can be shown by (a generalization of) the law of large numbers
(which we will study in Chapter 13) that such a pattern would appear about
once every 1/0.0011 = 899 days. So it is not overwhelmingly exceptional to
find such a pattern, and the fact that there was an important football match
on the day in the middle of the pattern might just have been a coincidence.
1.3 Cars and goats: the Monty Hall dilemma
On Sunday September 9, 1990, the following question appeared in the “Ask
Marilyn” column in Parade, a Sunday supplement to many newspapers across
the United States:
Suppose you’re on a game show, and you’re given the choice of three
doors; behind one door is a car; behind the others, goats. You pick a
door, say No. 1, and the host, who knows what’s behind the doors,
opens another door, say No. 3, which has a goat. He then says to you,
“Do you want to pick door No. 2?” Is it to your advantage to switch
your choice?—Craig F. Whitaker, Columbia, Md.
Marilyn’s answer—one should switch—caused an avalanche of reactions, in to-
tal an estimated 10 000. Some of these reactions were not so flattering (“You
are the goat”), quite a lot were by professional mathematicians (“You blew
it, and blew it big,” “You are utterly incorrect . . . . How many irate mathe-
maticians are needed to change your mind?”). Perhaps some of the reactions
were so strong, because Marilyn vos Savant, the author of the column, is in
the Guinness Book of Records for having one of the highest IQs in the world.
1.4 The space shuttle Challenger 5
The switching question was inspired by Monty Hall’s “Let’s Make a Deal”
game show, which ran with small interruptions for 23 years on various U.S.
television networks.
Although it is not explicitly stated in the question, the game show host will
always open a door with a goat after you make your initial choice. Many
people would argue that in this situation it does not matter whether one
would change or not: one door has a car behind it, the other a goat, so the
odds to get the car are fifty-fifty. To see why they are wrong, consider the
following argument. In the original situation two of the three doors have a
goat behind them, so with probability 2/3 your initial choice was wrong, and
with probability 1/3 it was right. Now the host opens a door with a goat (note
that he can always do this). In case your initial choice was wrong the host has
only one option to show a door with a goat, and switching leads you to the
door with the car. In case your initial choice was right the host has two goats
to choose from, so switching will lead you to a goat. We see that switching
is the best strategy, doubling our chances to win. To stress this argument,
consider the following generalization of the problem: suppose there are 10 000
doors, behind one is a car and behind the rest, goats. After you make your
choice, the host will open 9998 doors with goats, and offers you the option to
switch. To change or not to change, that’s the question! Still not convinced?
Use your Internet browser to find one of the zillion sites where one can run a
simulation of the Monty Hall problem (more about simulation in Chapter 6).
In fact, there are quite a lot of variations on the problem. For example, the
situation that there are four doors: you select a door, the host always opens a
door with a goat, and offers you to select another door. After you have made
up your mind he opens a door with a goat, and again offers you to switch.
After you have decided, he opens the door you selected. What is now the best
strategy? In this situation switching only at the last possible moment yields
a probability of 3/4 to bring the car home. Using the law of total probability
from Section 3.3 you will find that this is indeed the best possible strategy.
1.4 The space shuttle Challenger
On January 28, 1986, the space shuttle Challenger exploded about one minute
after it had taken off from the launch pad at Kennedy Space Center in Florida.
The seven astronauts on board were killed and the spacecraft was destroyed.
The cause of the disaster was explosion of the main fuel tank, caused by flames
of hot gas erupting from one of the so-called solid rocket boosters.
These solid rocket boosters had been cause for concern since the early years
of the shuttle. They are manufactured in segments, which are joined at a later
stage, resulting in a number of joints that are sealed to protect against leakage.
This is done with so-called O-rings, which in turn are protected by a layer
of putty. When the rocket motor ignites, high pressure and high temperature
6 1 Why probability and statistics?
build up within. In time these may burn away the putty and subsequently
erode the O-rings, eventually causing hot flames to erupt on the outside. In a
nutshell, this is what actually happened to the Challenger.
After the explosion, an investigative commission determined the causes of the
disaster, and a report was issued with many findings and recommendations
([24]). On the evening of January 27, a decision to launch the next day had
been made, notwithstanding the fact that an extremely low temperature of
31◦
F had been predicted, well below the operating limit of 40◦
F set by Morton
Thiokol, the manufacturer of the solid rocket boosters. Apparently, a “man-
agement decision” was made to overrule the engineers’ recommendation not
to launch. The inquiry faulted both NASA and Morton Thiokol management
for giving in to the pressure to launch, ignoring warnings about problems with
the seals.
The Challenger launch was the 24th of the space shuttle program, and we
shall look at the data on the number of failed O-rings, available from previous
launches (see [5] for more details). Each rocket has three O-rings, and two
rocket boosters are used per launch, so in total six O-rings are used each
time. Because low temperatures are known to adversely affect the O-rings,
we also look at the corresponding launch temperature. In Figure 1.3 the dots
show the number of failed O-rings per mission (there are 23 dots—one time the
boosters could not be recovered from the ocean; temperatures are rounded to
the nearest degree Fahrenheit; in case of two or more equal data points these
are shifted slightly.). If you ignore the dots representing zero failures, which
all occurred at high temperatures, a temperature effect is not apparent.
30 40 50 60 70 80 90
Launch temperature in ◦
F
0
1
2
3
4
5
6
Failures
·
·· ·
··
·
····
·
·
·
···
·
·
····
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 · p(t)

Source: based on data from Volume VI of the Report of the Presidential
Commission on the space shuttle Challenger accident, Washington, DC, 1986.
Fig. 1.3. Space shuttle failure data of pre-Challenger missions and fitted model of
expected number of failures per mission function.
1.5 Statistics versus intelligence agencies 7
In a model to describe these data, the probability p(t) that an individual
O-ring fails should depend on the launch temperature t. Per mission, the
number of failed O-rings follows a so-called binomial distribution: six O-rings,
and each may fail with probability p(t); more about this distribution and the
circumstances under which it arises can be found in Chapter 4. A logistic
model was used in [5] to describe the dependence on t:
p(t) =
ea+b·t
1 + ea+b·t
.
A high value of a + b · t corresponds to a high value of p(t), a low value to
low p(t). Values of a and b were determined from the data, according to the
following principle: choose a and b so that the probability that we get data as
in Figure 1.3 is as high as possible. This is an example of the use of the method
of maximum likelihood, which we shall discuss in Chapter 21. This results in
a = 5.085 and b = −0.1156, which indeed leads to lower probabilities at higher
temperatures, and to p(31) = 0.8178. We can also compute the (estimated)
expected number of failures, 6·p(t), as a function of the launch temperature t;
this is the plotted line in the figure.
Combining the estimates with estimated probabilities of other events that
should happen for a complete failure of the field-joint, the estimated proba-
bility of such a failure is 0.023. With six field-joints, the probability of at least
one complete failure is then 1 − (1 − 0.023)6
= 0.13!
1.5 Statistics versus intelligence agencies
During World War II, information about Germany’s war potential was essen-
tial to the Allied forces in order to schedule the time of invasions and to carry
out the allied strategic bombing program. Methods for estimating German
production used during the early phases of the war proved to be inadequate.
In order to obtain more reliable estimates of German war production, ex-
perts from the Economic Warfare Division of the American Embassy and the
British Ministry of Economic Warfare started to analyze markings and serial
numbers obtained from captured German equipment.
Each piece of enemy equipment was labeled with markings, which included
all or some portion of the following information: (a) the name and location
of the marker; (b) the date of manufacture; (c) a serial number; and (d)
miscellaneous markings such as trademarks, mold numbers, casting numbers,
etc. The purpose of these markings was to maintain an effective check on
production standards and to perform spare parts control. However, these same
markings offered Allied intelligence a wealth of information about German
industry.
The first products to be analyzed were tires taken from German aircraft shot
over Britain and from supply dumps of aircraft and motor vehicle tires cap-
tured in North Africa. The marking on each tire contained the maker’s name,
8 1 Why probability and statistics?
a serial number, and a two-letter code for the date of manufacture. The first
step in analyzing the tire markings involved breaking the two-letter date code.
It was conjectured that one letter represented the month and the other the
year of manufacture, and that there should be 12 letter variations for the
month code and 3 to 6 for the year code. This, indeed, turned out to be true.
The following table presents examples of the 12 letter variations used by four
different manufacturers.
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dunlop T I E B R A P O L N U D
Fulda F U L D A M U N S T E R
Phoenix F O N I X H A M B U R G
Sempirit A B C D E F G H I J K L
Reprinted with permission from “An empirical approach to economic intelli-
gence” by R.Ruggles and H.Brodie, pp.72-91, Vol. 42, No. 237. 1947 by
the American Statistical Association. All rights reserved.
For instance, the Dunlop code was Dunlop Arbeit spelled backwards. Next,
the year code was broken and the numbering system was solved so that for
each manufacturer individually the serial numbers could be dated. Moreover,
for each month, the serial numbers could be recoded to numbers running
from 1 to some unknown largest number N, and the observed (recoded) serial
numbers could be seen as a subset of this. The objective was to estimate N
for each month and each manufacturer separately by means of the observed
(recoded) serial numbers. In Chapter 20 we discuss two different methods
of estimation, and we show that the method based on only the maximum
observed (recoded) serial number is much better than the method based on
the average observed (recoded) serial numbers.
With a sample of about 1400 tires from five producers, individual monthly
output figures were obtained for almost all months over a period from 1939
to mid-1943. The following table compares the accuracy of estimates of the
average monthly production of all manufacturers of the first quarter of 1943
with the statistics of the Speer Ministry that became available after the war.
The accuracy of the estimates can be appreciated even more if we compare
them with the figures obtained by Allied intelligence agencies. They estimated,
using other methods, the production between 900 000 and 1 200 000 per month!
Type of tire Estimated production Actual production
Truck and passenger car 147 000 159 000
Aircraft 28 500 26 400
——— ———
Total 175500 186100
Reprinted with permission from “An empirical approach to economic intelli-
gence” by R.Ruggles and H.Brodie, pp.72-91, Vol. 42, No. 237. 1947 by
the American Statistical Association. All rights reserved.
1.6 The speed of light 9
1.6 The speed of light
In 1983 the definition of the meter (the SI unit of one meter) was changed to:
The meter is the length of the path traveled by light in vacuum during a time
interval of 1/299 792 458 of a second. This implicitly defines the speed of light
as 299 792 458 meters per second. It was done because one thought that the
speed of light was so accurately known that it made more sense to define the
meter in terms of the speed of light rather than vice versa, a remarkable end
to a long story of scientific discovery. For a long time most scientists believed
that the speed of light was infinite. Early experiments devised to demonstrate
the finiteness of the speed of light failed because the speed is so extraordi-
narily high. In the 18th century this debate was settled, and work started on
determination of the speed, using astronomical observations, but a century
later scientists turned to earth-based experiments. Albert Michelson refined
experimental arrangements from two previous experiments and conducted a
series of measurements in June and early July of 1879, at the U.S. Naval
Academy in Annapolis. In this section we give a very short summary of his
work. It is extracted from an article in Statistical Science ([18]).
The principle of speed measurement is easy, of course: measure a distance and
the time it takes to travel that distance, the speed equals distance divided by
time. For an accurate determination, both the distance and the time need
to be measured accurately, and with the speed of light this is a problem:
either we should use a very large distance and the accuracy of the distance
measurement is a problem, or we have a very short time interval, which is also
very difficult to measure accurately.
In Michelson’s time it was known that the speed of light was about 300 000
km/s, and he embarked on his study with the goal of an improved value of the
speed of light. His experimental setup is depicted schematically in Figure 1.4.
Light emitted from a light source is aimed, through a slit in a fixed plate,
at a rotating mirror; we call its distance from the plate the radius. At one
particular angle, this rotating mirror reflects the beam in the direction of a
distant (fixed) flat mirror. On its way the light first passes through a focusing
lens. This second mirror is positioned in such a way that it reflects the beam
back in the direction of the rotating mirror. In the time it takes the light to
travel back and forth between the two mirrors, the rotating mirror has moved
by an angle α, resulting in a reflection on the plate that is displaced with
respect to the source beam that passed through the slit. The radius and the
displacement determine the angle α because
tan 2α =
displacement
radius
and combined with the number of revolutions per seconds (rps) of the mirror,
this determines the elapsed time:
time =
α/2π
rps
.
10 1 Why probability and statistics?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fixed
mirror
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
Distance
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Light source
•
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Displacement
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Plate
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Radius
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
α
Rotating
mirror
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Focusing
lens
Fig. 1.4. Michelson’s experiment.
During this time the light traveled twice the distance between the mirrors, so
the speed of light in air now follows:
cair =
2 · distance
time
.
All in all, it looks simple: just measure the four quantities—distance, radius,
displacement and the revolutions per second—and do the calculations. This
is much harder than it looks, and problems in the form of inaccuracies are
lurking everywhere. An error in any of these quantities translates directly into
some error in the final result.
Michelson did the utmost to reduce errors. For example, the distance between
the mirrors was about 2000 feet, and to measure it he used a steel measuring
tape. Its nominal length was 100 feet, but he carefully checked this using a
copy of the official “standard yard.” He found that the tape was in fact 100.006
feet. This way he eliminated a (small) systematic error.
Now imagine using the tape to measure a distance of 2000 feet: you have to use
the tape 20 times, each time marking the next 100 feet. Do it again, and you
probably find a slightly different answer, no matter how hard you try to be
very precise in every step of the measuring procedure. This kind of variation
is inevitable: sometimes we end up with a value that is a bit too high, other
times it is too low, but on average we’re doing okay—assuming that we have
eliminated sources of systematic error, as in the measuring tape. Michelson
measured the distance five times, which resulted in values between 1984.93
and 1985.17 feet (after correcting for the temperature-dependent stretch), and
he used the average as the “true distance.”
In many phases of the measuring process Michelson attempted to identify
and determine systematic errors and subsequently applied corrections. He
1.6 The speed of light 11
also systematically repeated measuring steps and averaged the results to re-
duce variability. His final dataset consists of 100 separate measurements (see
Table 17.1), but each is in fact summarized and averaged from repeated mea-
surements on several variables. The final result he reported was that the speed
of light in vacuum (this involved a conversion) was 299 944 ± 51 km/s, where
the 51 is an indication of the uncertainty in the answer. In retrospect, we must
conclude that, in spite of Michelson’s admirable meticulousness, some source
of error must have slipped his attention, as his result is off by about 150 km/s.
With current methods we would derive from his data a so-called 95% confi-
dence interval: 299 944 ± 15.5 km/s, suggesting that Michelson’s uncertainty
analysis was a little conservative. The methods used to construct confidence
intervals are the topic of Chapters 23 and 24.
2
Outcomes, events, and probability
The world around us is full of phenomena we perceive as random or unpre-
dictable. We aim to model these phenomena as outcomes of some experiment,
where you should think of experiment in a very general sense. The outcomes
are elements of a sample space Ω, and subsets of Ω are called events.The events
will be assigned a probability, a number between 0 and 1 that expresses how
likely the event is to occur.
2.1 Sample spaces
Sample spaces are simply sets whose elements describe the outcomes of the
experiment in which we are interested.
We start with the most basic experiment: the tossing of a coin. Assuming that
we will never see the coin land on its rim, there are two possible outcomes:
heads and tails. We therefore take as the sample space associated with this
experiment the set Ω = {H, T }.
In another experiment we ask the next person we meet on the street in which
month her birthday falls. An obvious choice for the sample space is
Ω = {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec}.
In a third experiment we load a scale model for a bridge up to the point
where the structure collapses. The outcome is the load at which this occurs.
In reality, one can only measure with finite accuracy, e.g., to five decimals, and
a sample space with just those numbers would strictly be adequate. However,
in principle, the load itself could be any positive number and therefore Ω =
(0, ∞) is the right choice. Even though in reality there may also be an upper
limit to what loads are conceivable, it is not necessary or practical to try to
limit the outcomes correspondingly.
14 2 Outcomes, events, and probability
In a fourth experiment, we find on our doormat three envelopes, sent to us by
three different persons, and we look in which order the envelopes lie on top of
each other. Coding them 1, 2, and 3, the sample space would be
Ω = {123, 132, 213, 231, 312, 321}.
Quick exercise 2.1 If we received mail from four different persons, how
many elements would the corresponding sample space have?
In general one might consider the order in which n different objects can be
placed. This is called a permutation of the n objects. As we have seen, there
are 6 possible permutations of 3 objects, and 4 · 6 = 24 of 4 objects. What
happens is that if we add the nth object, then this can be placed in any of n
positions in any of the permutations of n − 1 objects. Therefore there are
n · (n − 1) · · · · 3 · 2 · 1 = n!
possible permutations of n objects. Here n! is the standard notation for this
product and is pronounced “n factorial.” It is convenient to define 0! = 1.
2.2 Events
Subsets of the sample space are called events. We say that an event A occurs
if the outcome of the experiment is an element of the set A. For example, in
the birthday experiment we can ask for the outcomes that correspond to a
long month, i.e., a month with 31 days. This is the event
L = {Jan, Mar, May, Jul, Aug, Oct, Dec}.
Events may be combined according to the usual set operations.
For example if R is the event that corresponds to the months that have the
letter r in their (full) name (so R = {Jan, Feb, Mar, Apr, Sep, Oct, Nov, Dec}),
then the long months that contain the letter r are
L ∩ R = {Jan, Mar, Oct, Dec}.
The set L∩R is called the intersection of L and R and occurs if both L and R
occur. Similarly, we have the union A∪B of two sets A and B, which occurs if
at least one of the events A and B occurs. Another common operation is taking
complements. The event Ac
= {ω ∈ Ω : ω /
∈ A} is called the complement of A;
it occurs if and only if A does not occur. The complement of Ω is denoted
∅, the empty set, which represents the impossible event. Figure 2.1 illustrates
these three set operations.
2.2 Events 15
Intersection A ∩ B
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A
B
Ω
A ∩ B
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Union A ∪ B
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A
B
A ∪ B
Ω
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Complement Ac
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A
Ac
Ω
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 2.1. Diagrams of intersection, union, and complement.
We call events A and B disjoint or mutually exclusive if A and B have no
outcomes in common; in set terminology: A∩B = ∅. For example, the event L
“the birthday falls in a long month” and the event {Feb} are disjoint.
Finally, we say that event A implies event B if the outcomes of A also lie
in B. In set notation: A ⊂ B; see Figure 2.2.
Some people like to use double negations:
“It is certainly not true that neither John nor Mary is to blame.”
This is equivalent to: “John or Mary is to blame, or both.” The following
useful rules formalize this mental operation to a manipulation with events.
DeMorgan’s laws. For any two events A and B we have
(A ∪ B)c
= Ac
∩ Bc
and (A ∩ B)c
= Ac
∪ Bc
.
Quick exercise 2.2 Let J be the event “John is to blame” and M the event
“Mary is to blame.” Express the two statements above in terms of the events
J, Jc
, M, and Mc
, and check the equivalence of the statements by means of
DeMorgan’s laws.
Disjoint sets A and B
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A
B
Ω
A subset of B
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A
B
Ω
Fig. 2.2. Minimal and maximal intersection of two sets.
16 2 Outcomes, events, and probability
2.3 Probability
We want to express how likely it is that an event occurs. To do this we will
assign a probability to each event. The assignment of probabilities to events is
in general not an easy task, and some of the coming chapters will be dedicated
directly or indirectly to this problem. Since each event has to be assigned a
probability, we speak of a probability function. It has to satisfy two basic
properties.
Definition. A probability function P on a finite sample space Ω
assigns to each event A in Ω a number P(A) in [0,1] such that
(i) P(Ω) = 1, and
(ii) P(A ∪ B) = P(A) + P(B) if A and B are disjoint.
The number P(A) is called the probability that A occurs.
Property (i) expresses that the outcome of the experiment is always an element
of the sample space, and property (ii) is the additivity property of a probability
function. It implies additivity of the probability function over more than two
sets; e.g., if A, B, and C are disjoint events, then the two events A ∪ B and
C are also disjoint, so
P(A ∪ B ∪ C) = P(A ∪ B) + P(C) = P(A) + P(B) + P(C) .
We will now look at some examples. When we want to decide whether Peter
or Paul has to wash the dishes, we might toss a coin. The fact that we consider
this a fair way to decide translates into the opinion that heads and tails are
equally likely to occur as the outcome of the coin-tossing experiment. So we
put
P({H}) = P({T }) =
1
2
.
Formally we have to write {H} for the set consisting of the single element H,
because a probability function is defined on events, not on outcomes. From
now on we shall drop these brackets.
Now it might happen, for example due to an asymmetric distribution of the
mass over the coin, that the coin is not completely fair. For example, it might
be the case that
P(H) = 0.4999 and P(T ) = 0.5001.
More generally we can consider experiments with two possible outcomes, say
“failure” and “success”, which have probabilities 1 − p and p to occur, where
p is a number between 0 and 1. For example, when our experiment consists
of buying a ticket in a lottery with 10 000 tickets and only one prize, where
“success” stands for winning the prize, then p = 10−4
.
How should we assign probabilities in the second experiment, where we ask
for the month in which the next person we meet has his or her birthday? In
analogy with what we have just done, we put
2.3 Probability 17
P(Jan) = P(Feb) = · · · = P(Dec) =
1
12
.
Some of you might object to this and propose that we put, for example,
P(Jan) =
31
365
and P(Apr) =
30
365
,
because we have long months and short months. But then the very precise
among us might remark that this does not yet take care of leap years.
Quick exercise 2.3 If you would take care of the leap years, assuming that
one in every four years is a leap year (which again is an approximation to
reality!), how would you assign a probability to each month?
In the third experiment (the buckling load of a bridge), where the outcomes are
real numbers, it is impossible to assign a positive probability to each outcome
(there are just too many outcomes!). We shall come back to this problem in
Chapter 5, restricting ourselves in this chapter to finite and countably infinite1
sample spaces.
In the fourth experiment it makes sense to assign equal probabilities to all six
outcomes:
P(123) = P(132) = P(213) = P(231) = P(312) = P(321) =
1
6
.
Until now we have only assigned probabilities to the individual outcomes of the
experiments. To assign probabilities to events we use the additivity property.
For instance, to find the probability P(T ) of the event T that in the three
envelopes experiment envelope 2 is on top we note that
P(T ) = P(213) + P(231) =
1
6
+
1
6
=
1
3
.
In general, additivity of P implies that the probability of an event is obtained
by summing the probabilities of the outcomes belonging to the event.
Quick exercise 2.4 Compute P(L) and P(R) in the birthday experiment.
Finally we mention a rule that permits us to compute probabilities of events
A and B that are not disjoint. Note that we can write A = (A∩B) ∪ (A∩Bc
),
which is a disjoint union; hence
P(A) = P(A ∩ B) + P(A ∩ Bc
) .
If we split A ∪ B in the same way with B and Bc
, we obtain the events
(A∪B)∩B, which is simply B and (A∪B)∩Bc
, which is nothing but A∩Bc
.
1
This means: although infinite, we can still count them one by one; Ω =
{ω1, ω2, . . . }. The interval [0,1] of real numbers is an example of an uncountable
sample space.
18 2 Outcomes, events, and probability
Thus
P(A ∪ B) = P(B) + P(A ∩ Bc
) .
Eliminating P(A ∩ Bc
) from these two equations we obtain the following rule.
The probability of a union. For any two events A and B we
have
P(A ∪ B) = P(A) + P(B) − P(A ∩ B) .
From the additivity property we can also find a way to compute probabilities
of complements of events: from A ∪ Ac
= Ω, we deduce that
P(Ac
) = 1 − P(A) .
2.4 Products of sample spaces
Basic to statistics is that one usually does not consider one experiment, but
that the same experiment is performed several times. For example, suppose
we throw a coin two times. What is the sample space associated with this new
experiment? It is clear that it should be the set
Ω = {H, T } × {H, T } = {(H, H), (H, T ), (T, H), (T, T )}.
If in the original experiment we had a fair coin, i.e., P(H) = P(T ), then in
this new experiment all 4 outcomes again have equal probabilities:
P((H, H)) = P((H, T )) = P((T, H)) = P((T, T )) =
1
4
.
Somewhat more generally, if we consider two experiments with sample spaces
Ω1 and Ω2 then the combined experiment has as its sample space the set
Ω = Ω1 × Ω2 = {(ω1, ω2) : ω1 ∈ Ω1, ω2 ∈ Ω2}.
If Ω1 has r elements and Ω2 has s elements, then Ω1 × Ω2 has rs elements.
Now suppose that in the first, the second, and the combined experiment all
outcomes are equally likely to occur. Then the outcomes in the first experi-
ment have probability 1/r to occur, those of the second experiment 1/s, and
those of the combined experiment probability 1/rs. Motivated by the fact that
1/rs = (1/r) × (1/s), we will assign probability pipj to the outcome (ωi, ωj)
in the combined experiment, in the case that ωi has probability pi and ωj has
probability pj to occur. One should realize that this is by no means the only
way to assign probabilities to the outcomes of a combined experiment. The
preceding choice corresponds to the situation where the two experiments do
not influence each other in any way. What we mean by this influence will be
explained in more detail in the next chapter.
2.5 An infinite sample space 19
Quick exercise 2.5 Consider the sample space {a1, a2, a3, a4, a5, a6} of some
experiment, where outcome ai has probability pi for i = 1, . . . , 6. We perform
this experiment twice in such a way that the associated probabilities are
P((ai, ai)) = pi, and P((ai, aj)) = 0 if i = j, for i, j = 1, . . . , 6.
Check that P is a probability function on the sample space Ω = {a1, . . . , a6}×
{a1, . . . , a6} of the combined experiment. What is the relationship between
the first experiment and the second experiment that is determined by this
probability function?
We started this section with the experiment of throwing a coin twice. If we
want to learn more about the randomness associated with a particular exper-
iment, then we should repeat it more often, say n times. For example, if we
perform an experiment with outcomes 1 (success) and 0 (failure) five times,
and we consider the event A “exactly one experiment was a success,” then
this event is given by the set
A = {(0, 0, 0, 0, 1), (0, 0, 0, 1, 0), (0, 0, 1, 0, 0), (0, 1, 0, 0, 0), (1, 0, 0, 0, 0)}
in Ω = {0, 1} × {0, 1} × {0, 1} × {0, 1} × {0, 1}. Moreover, if success has
probability p and failure probability 1 − p, then
P(A) = 5 · (1 − p)4
· p,
since there are five outcomes in the event A, each having probability (1−p)4
·p.
Quick exercise 2.6 What is the probability of the event B “exactly two
experiments were successful”?
In general, when we perform an experiment n times, then the corresponding
sample space is
Ω = Ω1 × Ω2 × · · · × Ωn,
where Ωi for i = 1, . . . , n is a copy of the sample space of the original exper-
iment. Moreover, we assign probabilities to the outcomes (ω1, . . . , ωn) in the
standard way described earlier, i.e.,
P((ω1, ω2, . . . , ωn)) = p1 · p2 · · · · pn,
if each ωi has probability pi.
2.5 An infinite sample space
We end this chapter with an example of an experiment with infinitely many
outcomes. We toss a coin repeatedly until the first head turns up. The outcome
20 2 Outcomes, events, and probability
of the experiment is the number of tosses it takes to have this first occurrence
of a head. Our sample space is the space of all positive natural numbers
Ω = {1, 2, 3, . . .}.
What is the probability function P for this experiment?
Suppose the coin has probability p of falling on heads and probability 1−p to
fall on tails, where 0  p  1. We determine the probability P(n) for each n.
Clearly P(1) = p, the probability that we have a head right away. The event
{2} corresponds to the outcome (T, H) in {H, T }×{H, T }, so we should have
P(2) = (1 − p)p.
Similarly, the event {n} corresponds to the outcome (T, T, . . ., T, T, H) in the
space {H, T } × · · · × {H, T }. Hence we should have, in general,
P(n) = (1 − p)n−1
p, n = 1, 2, 3, . . ..
Does this define a probability function on Ω = {1, 2, 3, . . .}? Then we should
at least have P(Ω) = 1. It is not directly clear how to calculate P(Ω): since
the sample space is no longer finite we have to amend the definition of a
probability function.
Definition. A probability function on an infinite (or finite) sample
space Ω assigns to each event A in Ω a number P(A) in [0, 1] such
that
(i) P(Ω) = 1, and
(ii) P(A1 ∪ A2 ∪ A3 ∪ · · ·) = P(A1) + P(A2) + P(A3) + · · ·
if A1, A2, A3, . . . are disjoint events.
Note that this new additivity property is an extension of the previous one
because if we choose A3 = A4 = · · · = ∅, then
P(A1 ∪ A2) = P(A1 ∪ A2 ∪ ∅ ∪ ∅ ∪ · · ·)
= P(A1) + P(A2) + 0 + 0 + · · · = P(A1) + P(A2) .
Now we can compute the probability of Ω:
P(Ω) = P(1) + P(2) + · · · + P(n) + · · ·
= p + (1 − p)p + · · · (1 − p)n−1
p + · · ·
= p[1 + (1 − p) + · · · (1 − p)n−1
+ · · · ].
The sum 1 + (1 − p) + · · · + (1 − p)n−1
+ · · · is an example of a geometric
series. It is well known that when |1 − p|  1,
1 + (1 − p) + · · · + (1 − p)n−1
+ · · · =
1
1 − (1 − p)
=
1
p
.
Therefore we do indeed have P(Ω) = p ·
1
p
= 1.
2.7 Exercises 21
Quick exercise 2.7 Suppose an experiment in a laboratory is repeated every
day of the week until it is successful, the probability of success being p. The
first experiment is started on a Monday. What is the probability that the
series ends on the next Sunday?
2.6 Solutions to the quick exercises
2.1 The sample space is Ω = {1234, 1243, 1324, 1342, . . ., 4321}. The best way
to count its elements is by noting that for each of the 6 outcomes of the three-
envelope experiment we can put a fourth envelope in any of 4 positions. Hence
Ω has 4 · 6 = 24 elements.
2.2 The statement “It is certainly not true that neither John nor Mary is to
blame” corresponds to the event (Jc
∩Mc
)c
. The statement “John or Mary is
to blame, or both” corresponds to the event J ∪ M. Equivalence now follows
from DeMorgan’s laws.
2.3 In four years we have 365×3+366 = 1461 days. Hence long months each
have a probability 4 × 31/1461 = 124/1461, and short months a probability
120/1461 to occur. Moreover, {Feb} has probability 113/1461.
2.4 Since there are 7 long months and 8 months with an “r” in their name,
we have P(L) = 7/12 and P(R) = 8/12.
2.5 Checking that P is a probability function Ω amounts to verifying that
0 ≤ P((ai, aj)) ≤ 1 for all i and j and noting that
P(Ω) =
6

i,j=1
P((ai, aj)) =
6

i=1
P((ai, ai)) =
6

i=1
pi = 1.
The two experiments are totally coupled: one has outcome ai if and only if
the other has outcome ai.
2.6 Now there are 10 outcomes in B (for example (0,1,0,1,0)), each having
probability (1 − p)3
p2
. Hence P(B) = 10(1 − p)3
p2
.
2.7 This happens if and only if the experiment fails on Monday,. . . , Saturday,
and is a success on Sunday. This has probability p(1 − p)6
to happen.
2.7 Exercises
2.1  Let A and B be two events in a sample space for which P(A) = 2/3,
P(B) = 1/6, and P(A ∩ B) = 1/9. What is P(A ∪ B)?
22 2 Outcomes, events, and probability
2.2 Let E and F be two events for which one knows that the probability that
at least one of them occurs is 3/4. What is the probability that neither E nor
F occurs? Hint: use one of DeMorgan’s laws: Ec
∩ Fc
= (E ∪ F)c
.
2.3 Let C and D be two events for which one knows that P(C) = 0.3, P(D) =
0.4, and P(C ∩ D) = 0.2. What is P(Cc
∩ D)?
2.4  We consider events A, B, and C, which can occur in some experiment.
Is it true that the probability that only A occurs (and not B or C) is equal
to P(A ∪ B ∪ C) − P(B) − P(C) + P(B ∩ C)?
2.5 The event A∩Bc
that A occurs but not B is sometimes denoted as AB.
Here  is the set-theoretic minus sign. Show that P(A  B) = P(A) − P(B) if
B implies A, i.e., if B ⊂ A.
2.6 When P(A) = 1/3, P(B) = 1/2, and P(A ∪ B) = 3/4, what is
a. P(A ∩ B)?
b. P(Ac
∪ Bc
)?
2.7  Let A and B be two events. Suppose that P(A) = 0.4, P(B) = 0.5, and
P(A ∩ B) = 0.1. Find the probability that A or B occurs, but not both.
2.8  Suppose the events D1 and D2 represent disasters, which are rare:
P(D1) ≤ 10−6
and P(D2) ≤ 10−6
. What can you say about the probability
that at least one of the disasters occurs? What about the probability that
they both occur?
2.9 We toss a coin three times. For this experiment we choose the sample
space
Ω = {HHH, T HH, HT H, HHT, T T H, THT, HTT, TTT}
where T stands for tails and H for heads.
a. Write down the set of outcomes corresponding to each of the following
events:
A : “we throw tails exactly two times.”
B : “we throw tails at least two times.”
C : “tails did not appear before a head appeared.”
D : “the first throw results in tails.”
b. Write down the set of outcomes corresponding to each of the following
events: Ac
, A ∪ (C ∩ D), and A ∩ Dc
.
2.10 In some sample space we consider two events A and B. Let C be the
event that A or B occurs, but not both. Express C in terms of A and B, using
only the basic operations “union,” “intersection,” and “complement.”
2.7 Exercises 23
2.11  An experiment has only two outcomes. The first has probability p to
occur, the second probability p2
. What is p?
2.12  In the UEFA Euro 2004 playoffs draw 10 national football teams
were matched in pairs. A lot of people complained that “the draw was not
fair,” because each strong team had been matched with a weak team (this
is commercially the most interesting). It was claimed that such a matching
is extremely unlikely. We will compute the probability of this “dream draw”
in this exercise. In the spirit of the three-envelope example of Section 2.1
we put the names of the 5 strong teams in envelopes labeled 1, 2, 3, 4, and
5 and of the 5 weak teams in envelopes labeled 6, 7, 8, 9, and 10. We shuffle
the 10 envelopes and then match the envelope on top with the next envelope,
the third envelope with the fourth envelope, and so on. One particular way
a “dream draw” occurs is when the five envelopes labeled 1, 2, 3, 4, 5 are in
the odd numbered positions (in any order!) and the others are in the even
numbered positions. This way corresponds to the situation where the first
match of each strong team is a home match. Since for each pair there are
two possibilities for the home match, the total number of possibilities for the
“dream draw” is 25
= 32 times as large.
a. An outcome of this experiment is a sequence like 4, 9, 3, 7, 5, 10, 1, 8, 2, 6 of
labels of envelopes. What is the probability of an outcome?
b. How many outcomes are there in the event “the five envelopes labeled
1, 2, 3, 4, 5 are in the odd positions—in any order, and the envelopes la-
beled 6, 7, 8, 9, 10 are in the even positions—in any order”?
c. What is the probability of a “dream draw”?
2.13 In some experiment first an arbitrary choice is made out of four pos-
sibilities, and then an arbitrary choice is made out of the remaining three
possibilities. One way to describe this is with a product of two sample spaces
{a, b, c, d}:
Ω = {a, b, c, d} × {a, b, c, d}.
a. Make a 4×4 table in which you write the probabilities of the outcomes.
b. Describe the event “c is one of the chosen possibilities” and determine its
probability.
2.14  Consider the Monty Hall “experiment” described in Section 1.3. The
door behind which the car is parked we label a, the other two b and c. As the
sample space we choose a product space
Ω = {a, b, c} × {a, b, c}.
Here the first entry gives the choice of the candidate, and the second entry
the choice of the quizmaster.
24 2 Outcomes, events, and probability
a. Make a 3×3 table in which you write the probabilities of the outcomes.
N.B. You should realize that the candidate does not know that the car
is in a, but the quizmaster will never open the door labeled a because he
knows that the car is there. You may assume that the quizmaster makes
an arbitrary choice between the doors labeled b and c, when the candidate
chooses door a.
b. Consider the situation of a “no switching” candidate who will stick to his
or her choice. What is the event “the candidate wins the car,” and what
is its probability?
c. Consider the situation of a “switching” candidate who will not stick to
her choice. What is now the event “the candidate wins the car,” and what
is its probability?
2.15 The rule P(A ∪ B) = P(A) + P(B) − P(A ∩ B) from Section 2.3 is often
useful to compute the probability of the union of two events. What would be
the corresponding rule for three events A, B, and C? It should start with
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − · · · .
Hint: you could use the sum rule suitably, or you could make a diagram as in
Figure 2.1.
2.16  Three events E, F, and G cannot occur simultaneously. Further it
is known that P(E ∩ F) = P(F ∩ G) = P(E ∩ G) = 1/3. Can you deter-
mine P(E)?
Hint: if you try to use the formula of Exercise 2.15 then it seems that you do
not have enough information; make a diagram instead.
2.17 A post office has two counters where customers can buy stamps, etc.
If you are interested in the number of customers in the two queues that will
form for the counters, what would you take as sample space?
2.18 In a laboratory, two experiments are repeated every day of the week in
different rooms until at least one is successful, the probability of success be-
ing p for each experiment. Supposing that the experiments in different rooms
and on different days are performed independently of each other, what is the
probability that the laboratory scores its first successful experiment on day n?
2.19  We repeatedly toss a coin. A head has probability p, and a tail prob-
ability 1 − p to occur, where 0  p  1. The outcome of the experiment we
are interested in is the number of tosses it takes until a head occurs for the
second time.
a. What would you choose as the sample space?
b. What is the probability that it takes 5 tosses?
3
Conditional probability and independence
Knowing that an event has occurred sometimes forces us to reassess the prob-
ability of another event; the new probability is the conditional probability. If
the conditional probability equals what the probability was before, the events
involved are called independent. Often, conditional probabilities and indepen-
dence are needed if we want to compute probabilities, and in many other
situations they simplify the work.
3.1 Conditional probability
In the previous chapter we encountered the events L, “born in a long month,”
and R, “born in a month with the letter r.” Their probabilities are easy to
compute: since L = {Jan, Mar, May, Jul, Aug, Oct, Dec} and R = {Jan, Feb,
Mar, Apr, Sep, Oct, Nov, Dec}, one finds
P(L) =
7
12
and P(R) =
8
12
.
Now suppose that it is known about the person we meet in the street that
he was born in a “long month,” and we wonder whether he was born in
a “month with the letter r.” The information given excludes five outcomes
of our sample space: it cannot be February, April, June, September, or
November. Seven possible outcomes are left, of which only four—those in
R ∩ L = {Jan, Mar, Oct, Dec}—are favorable, so we reassess the probability
as 4/7. We call this the conditional probability of R given L, and we write:
P(R | L) =
4
7
.
This is not the same as P(R ∩ L), which is 1/3. Also note that P(R | L) is the
proportion that P(R ∩ L) is of P(L).
26 3 Conditional probability and independence
Quick exercise 3.1 Let N = Rc
be the event “born in a month without r.”
What is the conditional probability P(N | L)?
Recalling the three envelopes on our doormat, consider the events “envelope 1
is the middle one” (call this event A) and “envelope 2 is the middle one” (B).
Then P(A) = P(213 or 312) = 1/3; by symmetry, the same is found for P(B).
We say that the envelopes are in order if their order is either 123 or 321.
Suppose we know that they are not in order, but otherwise we do not know
anything; what are the probabilities of A and B, given this information?
Let C be the event that the envelopes are not in order, so: C = {123, 321}c
=
{132, 213, 231, 312}. We ask for the probabilities of A and B, given that C
occurs. Event C consists of four elements, two of which also belong to A:
A ∩ C = {213, 312}, so P(A | C) = 1/2. The probability of A ∩ C is half of
P(C). No element of C also belongs to B, so P(B | C) = 0.
Quick exercise 3.2 Calculate P(C | A) and P(Cc
| A ∪ B).
In general, computing the probability of an event A, given that an event C
occurs, means finding which fraction of the probability of C is also in the
event A.
Definition. The conditional probability of A given C is given by:
P(A | C) =
P(A ∩ C)
P(C)
,
provided P(C)  0.
Quick exercise 3.3 Show that P(A | C) + P(Ac
| C) = 1.
This exercise shows that the rule P(Ac
) = 1 − P(A) also holds for conditional
probabilities. In fact, even more is true: if we have a fixed conditioning event C
and define Q(A) = P(A | C) for events A ⊂ Ω, then Q is a probability function
and hence satisfies all the rules as described in Chapter 2. The definition of
conditional probability agrees with our intuition and it also works in situations
where computing probabilities by counting outcomes does not.
A chemical reactor: residence times
Consider a continuously stirred reactor vessel where a chemical reaction takes
place. On one side fluid or gas flows in, mixes with whatever is already present
in the vessel, and eventually flows out on the other side. One characteristic
of each particular reaction setup is the so-called residence time distribution,
which tells us how long particles stay inside the vessel before moving on. We
consider a continuously stirred tank: the contents of the vessel are perfectly
mixed at all times.
3.2 The multiplication rule 27
Let Rt denote the event “the particle has a residence time longer than t
seconds.” In Section 5.3 we will see how continuous stirring determines the
probabilities; here we just use that in a particular continuously stirred tank,
Rt has probability e−t
. So:
P(R3) = e−3
= 0.04978 . . .
P(R4) = e−4
= 0.01831 . . . .
We can use the definition of conditional probability to find the probability
that a particle that has stayed more than 3 seconds will stay more than 4:
P(R4 | R3) =
P(R4 ∩ R3)
P(R3)
=
P(R4)
P(R3)
=
e−4
e−3
= e−1
= 0.36787 . . . .
Quick exercise 3.4 Calculate P(R3 | Rc
4).
For more details on the subject of residence time distributions see, for example,
the book on reaction engineering by Fogler ([11]).
3.2 The multiplication rule
From the definition of conditional probability we derive a useful rule by mul-
tiplying left and right by P(C).
The multiplication rule. For any events A and C:
P(A ∩ C) = P(A | C) · P(C) .
Computing the probability of A∩C can hence be decomposed into two parts,
computing P(C) and P(A | C) separately, which is often easier than computing
P(A ∩ C) directly.
The probability of no coincident birthdays
Suppose you meet two arbitrarily chosen people. What is the probability their
birthdays are different? Let B2 denote the event that this happens. Whatever
the birthday of the first person is, there is only one day the second person
cannot “pick” as birthday, so:
P(B2) = 1 −
1
365
.
When the same question is asked with three people, conditional probabilities
become helpful. The event B3 can be seen as the intersection of the event B2,
28 3 Conditional probability and independence
“the first two have different birthdays,” with event A3 “the third person has
a birthday that does not coincide with that of one of the first two persons.”
Using the multiplication rule:
P(B3) = P(A3 ∩ B2) = P(A3 | B2)P(B2) .
The conditional probability P(A3 | B2) is the probability that, when two days
are already marked on the calendar, a day picked at random is not marked,
or
P(A3 | B2) = 1 −
2
365
,
and so
P(B3) = P(A3 | B2)P(B2) =

1 −
2
365

·

1 −
1
365

= 0.9918.
We are already halfway to solving the general question: in a group of n arbi-
trarily chosen people, what is the probability there are no coincident birth-
days? The event Bn of no coincident birthdays among the n persons is the
same as: “the birthdays of the first n − 1 persons are different” (the event
Bn−1) and “the birthday of the nth person does not coincide with a birthday
of any of the first n − 1 persons” (the event An), that is,
Bn = An ∩ Bn−1.
Applying the multiplication rule yields:
P(Bn) = P(An | Bn−1) · P(Bn−1) =

1 −
n − 1
365

· P(Bn−1)
as person n should avoid n − 1 days. Applying the same step to P(Bn−1),
P(Bn−2), etc., we find:
P(Bn) =

1 −
n − 1
365

· P(An−1 | Bn−2) · P(Bn−2)
=

1 −
n − 1
365

·

1 −
n − 2
365

· P(Bn−2)
.
.
.
=

1 −
n − 1
365

· · ·

1 −
2
365

· P(B2)
=

1 −
n − 1
365

· · ·

1 −
2
365

·

1 −
1
365

.
This can be used to compute the probability for arbitrary n. For example,
we find: P(B22) = 0.5243 and P(B23) = 0.4927. In Figure 3.1 the probability
3.2 The multiplication rule 29
0 10 20 30 40 50 60 70 80 90 100
n
0.0
0.2
0.4
0.6
0.8
1.0
P(Bn)
····································································································
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
........................
Fig. 3.1. The probability P(Bn) of no coincident birthdays for n = 1, . . . , 100.
P(Bn) is plotted for n = 1, . . . , 100, with dotted lines drawn at n = 23 and
at probability 0.5. It may be hard to believe, but with just 23 people the
probability of all birthdays being different is less than 50%!
Quick exercise 3.5 Compute the probability that three arbitrary people are
born in different months. Can you give the formula for n people?
It matters how one conditions
Conditioning can help to make computations easier, but it matters how it is
applied. To compute P(A ∩ C) we may condition on C to get
P(A ∩ C) = P(A | C) · P(C) ;
or we may condition on A and get
P(A ∩ C) = P(C | A) · P(A) .
Both ways are valid, but often one of P(A | C) and P(C | A) is easy and the
other is not. For example, in the birthday example one could have tried:
P(B3) = P(A3 ∩ B2) = P(B2 | A3)P(A3) ,
but just trying to understand the conditional probability P(B2 | A3) already
is confusing:
The probability that the first two persons’ birthdays differ given that
the third person’s birthday does not coincide with the birthday of one
of the first two . . . ?
Conditioning should lead to easier probabilities; if not, it is probably the
wrong approach.
30 3 Conditional probability and independence
3.3 The law of total probability and Bayes’ rule
We will now discuss two important rules that help probability computations
by means of conditional probabilities. We introduce both of them in the next
example.
Testing for mad cow disease
In early 2001 the European Commission introduced massive testing of cattle
to determine infection with the transmissible form of Bovine Spongiform En-
cephalopathy (BSE) or “mad cow disease.” As no test is 100% accurate, most
tests have the problem of false positives and false negatives. A false positive
means that according to the test the cow is infected, but in actuality it is not.
A false negative means an infected cow is not detected by the test.
Imagine we test a cow. Let B denote the event “the cow has BSE” and T
the event “the test comes up positive” (this is test jargon for: according to
the test we should believe the cow is infected with BSE). One can “test the
test” by analyzing samples from cows that are known to be infected or known
to be healthy and so determine the effectiveness of the test. The European
Commission had this done for four tests in 1999 (see [19]) and for several
more later. The results for what the report calls Test A may be summarized
as follows: an infected cow has a 70% chance of testing positive, and a healthy
cow just 10%; in formulas:
P(T | B) = 0.70,
P(T | Bc
) = 0.10.
Suppose we want to determine the probability P(T ) that an arbitrary cow
tests positive. The tested cow is either infected or it is not: event T occurs in
combination with B or with Bc
(there are no other possibilities). In terms of
events
T = (T ∩ B) ∪ (T ∩ Bc
),
so that
P(T ) = P(T ∩ B) + P(T ∩ Bc
) ,
because T ∩B and T ∩Bc
are disjoint. Next, apply the multiplication rule (in
such a way that the known conditional probabilities appear!):
P(T ∩ B) = P(T | B) · P(B)
P(T ∩ Bc
) = P(T | Bc
) · P(Bc
)
(3.1)
so that
P(T ) = P(T | B) · P(B) + P(T | Bc
) · P(Bc
) . (3.2)
This is an application of the law of total probability: computing a probability
through conditioning on several disjoint events that make up the whole sample
3.3 The law of total probability and Bayes’ rule 31
space (in this case two). Suppose1
P(B) = 0.02; then from the last equation
we conclude: P(T ) = 0.02 · 0.70 + (1 − 0.02) · 0.10 = 0.112.
Quick exercise 3.6 Calculate P(T ) when P(T | B) = 0.99 and P(T | Bc
) =
0.05.
Following is a general statement of the law.
The law of total probability. Suppose C1, C2, . . . , Cm are
disjoint events such that C1 ∪ C2 ∪ · · · ∪ Cm = Ω. The probability of
an arbitrary event A can be expressed as:
P(A) = P(A | C1)P(C1) + P(A | C2)P(C2) + · · · + P(A | Cm)P(Cm) .
Figure 3.2 illustrates the law for m = 5. The event A is the disjoint union of
A∩Ci, for i = 1, . . . , 5, so P(A) = P(A ∩ C1)+· · ·+P(A ∩ C5), and for each i
the multiplication rule states P(A ∩ Ci) = P(A | Ci) · P(Ci).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A
C1
C2
C3
C4
C5
A ∩ C1
A ∩ C2
A ∩ C3
A ∩ C4
A ∩ C5
Ω
Fig. 3.2. The law of total probability (illustration for m = 5).
In the BSE example, we have just two mutually exclusive events: substitute
m = 2, C1 = B, C2 = Bc
, and A = T to obtain (3.2).
Another, perhaps more pertinent, question about the BSE test is the following:
suppose my cow tests positive; what is the probability it really has BSE?
Translated, this asks for the value of P(B | T ). The information we were given
is P(T | B), a conditional probability, but the wrong one. We would like to
switch T and B.
Start with the definition of conditional probability and then use equations
(3.1) and (3.2):
1
We choose this probability for the sake of the calculations that follow. The true
value is unknown and varies from country to country. The BSE risk for the Nether-
lands for 2003 was estimated to be P(B) ≈ 0.000013.
32 3 Conditional probability and independence
P(B | T ) =
P(T ∩ B)
P(T )
=
P(T | B) · P(B)
P(T | B) · P(B) + P(T | Bc) · P(Bc)
.
So with P(B) = 0.02 we find
P(B | T ) =
0.70 · 0.02
0.70 · 0.02 + 0.10 · (1 − 0.02)
= 0.125,
and by a similar calculation: P(B | T c
) = 0.0068. These probabilities reflect
that this Test A is not a very good test; a perfect test would result in
P(B | T ) = 1 and P(B | T c
) = 0. In Exercise 3.4 we redo this calculation,
replacing P(B) = 0.02 with a more realistic number.
What we have just seen is known as Bayes’ rule, after the English clergyman
Thomas Bayes who derived this in the 18th century. The general statement
follows.
Bayes’ rule. Suppose the events C1, C2, . . . , Cm are disjoint and
C1 ∪ C2 ∪ · · · ∪ Cm = Ω. The conditional probability of Ci, given an
arbitrary event A, can be expressed as:
P(Ci | A) =
P(A | Ci) · P(Ci)
P(A | C1)P(C1) + P(A | C2)P(C2) + · · · + P(A | Cm)P(Cm)
.
This is the traditional form of Bayes’ formula. It follows from
P(Ci | A) =
P(A | Ci) · P(Ci)
P(A)
(3.3)
in combination with the law of total probability applied to P(A) in the de-
nominator. Purists would refer to (3.3) as Bayes’ rule, and perhaps they are
right.
Quick exercise 3.7 Calculate P(B | T ) and P(B | T c
) if P(T | B) = 0.99 and
P(T | Bc
) = 0.05.
3.4 Independence
Consider three probabilities from the previous section:
P(B) = 0.02,
P(B | T ) = 0.125,
P(B | T c
) = 0.0068.
If we know nothing about a cow, we would say that there is a 2% chance it is
infected. However, if we know it tested positive, we can say there is a 12.5%
3.4 Independence 33
chance the cow is infected. On the other hand, if it tested negative, there is
only a 0.68% chance. We see that the two events are related in some way: the
probability of B depends on whether T occurs.
Imagine the opposite: the test is useless. Whether the cow is infected is unre-
lated to the outcome of the test, and knowing the outcome of the test does not
change our probability of B: P(B | T ) = P(B). In this case we would call B
independent of T .
Definition. An event A is called independent of B if
P(A | B) = P(A) .
From this simple definition many statements can be derived. For example,
because P(Ac
| B) = 1 − P(A | B) and 1 − P(A) = P(Ac
), we conclude:
A independent of B ⇔ Ac
independent of B. (3.4)
By application of the multiplication rule, if A is independent of B, then
P(A ∩ B) = P(A | B)P(B) = P(A) P(B). On the other hand, if P(A ∩ B) =
P(A) P(B), then P(A | B) = P(A) follows from the definition of independence.
This shows:
A independent of B ⇔ P(A ∩ B) = P(A) P(B) .
Finally, by definition of conditional probability, if A is independent of B, then
P(B | A) =
P(A ∩ B)
P(A)
=
P(A) · P(B)
P(A)
= P(B) ,
that is, B is independent of A. This works in reverse, too, so we have:
A independent of B ⇔ B independent of A. (3.5)
This statement says that in fact, independence is a mutual property. Therefore,
the expressions “A is independent of B” and “A and B are independent” are
used interchangeably. From the three ⇔-statements it follows that there are
in fact 12 ways to show that A and B are independent; and if they are, there
are 12 ways to use that.
Independence. To show that A and B are independent it suffices
to prove just one of the following:
P(A | B) = P(A) ,
P(B | A) = P(B) ,
P(A ∩ B) = P(A) P(B) ,
where A may be replaced by Ac
and B replaced by Bc
, or both. If
one of these statements holds, all of them are true. If two events are
not independent, they are called dependent.
34 3 Conditional probability and independence
Recall the birthday events L “born in a long month” and R “born in a month
with the letter r.” Let H be the event “born in the first half of the year,”
so P(H) = 1/2. Also, P(H | R) = 1/2. So H and R are independent, and we
conclude, for example, P(Rc
| Hc
) = P(Rc
) = 1 − 8/12 = 1/3.
We know that P(L ∩ H) = 1/4 and P(L) = 7/12. Checking 1/2 ×7/12 = 1/4,
you conclude that L and H are dependent.
Quick exercise 3.8 Derive the statement “Rc
is independent of Hc
” from
“H is independent of R” using rules (3.4) and (3.5).
Since the words dependence and independence have several meanings, one
sometimes uses the terms stochastic or statistical dependence and indepen-
dence to avoid ambiguity.
Remark 3.1 (Physical and stochastic independence). Stochastic
dependence or independence can sometimes be established by inspecting
whether there is any physical dependence present. The following statements
may be made.
If events have to do with processes or experiments that have no physical con-
nection, they are always stochastically independent. If they are connected
to the same physical process, then, as a rule, they are stochastically de-
pendent, but stochastic independence is possible in exceptional cases. The
events H and R are an example.
Independence of two or more events
When more than two events are involved we need a more elaborate definition
of independence. The reason behind this is explained by an example following
the definition.
Independence of two or more events. Events A1, A2, . . . ,
Am are called independent if
P(A1 ∩ A2 ∩ · · · ∩ Am) = P(A1) P(A2) · · · P(Am)
and this statement also holds when any number of the events A1,
. . . , Am are replaced by their complements throughout the formula.
You see that we need to check 2m
equations to establish the independence of
m events. In fact, m + 1 of those equations are redundant, but we chose this
version of the definition because it is easier.
The reason we need to do so much more checking to establish independence
for multiple events is that there are subtle ways in which events may depend
on each other. Consider the question:
Is independence for three events A, B, and C the same as: A and B are
independent; B and C are independent; and A and C are independent?
3.5 Solutions to the quick exercises 35
The answer is “No,” as the following example shows. Perform two independent
tosses of a coin. Let A be the event “heads on toss 1,” B the event “heads on
toss 2,” and C “the two tosses are equal.”
First, get the probabilities. Of course, P(A) = P(B) = 1/2, but also
P(C) = P(A ∩ B) + P(Ac
∩ Bc
) =
1
4
+
1
4
=
1
2
.
What about independence? Events A and B are independent by assumption,
so check the independence of A and C. Given that the first toss is heads (A
occurs), C occurs if and only if the second toss is heads as well (B occurs), so
P(C | A) = P(B | A) = P(B) =
1
2
= P(C) .
By symmetry, also P(C | B) = P(C), so all pairs taken from A, B, C are
independent: the three are called pairwise independent. Checking the full con-
ditions for independence, we find, for example:
P(A ∩ B ∩ C) = P(A ∩ B) =
1
4
, whereas P(A) P(B) P(C) =
1
8
,
and
P(A ∩ B ∩ Cc
) = P(∅) = 0, whereas P(A) P(B) P(Cc
) =
1
8
.
The reason for this is clear: whether C occurs follows deterministically from
the outcomes of tosses 1 and 2.
3.5 Solutions to the quick exercises
3.1 N = {May, Jun, Jul, Aug}, L = {Jan, Mar, May, Jul, Aug, Oct, Dec},
and N ∩ L = {May, Jul, Aug}. Three out of seven outcomes of L belong to
N as well, so P(N | L) = 3/7.
3.2 The event A is contained in C. So when A occurs, C also occurs; therefore
P(C | A) = 1.
Since Cc
= {123, 321} and A ∪ B = {123, 321, 312, 213}, one can see that two
of the four outcomes of A ∪ B belong to Cc
as well, so P(Cc
| A ∪ B) = 1/2.
3.3 Using the definition we find:
P(A | C) + P(Ac
| C) =
P(A ∩ C)
P(C)
+
P(Ac
∩ C)
P(C)
= 1,
because C can be split into disjoint parts A ∩ C and Ac
∩ C and therefore
P(A ∩ C) + P(Ac
∩ C) = P(C) .
36 3 Conditional probability and independence
3.4 This asks for the probability that the particle stays more than 3 seconds,
given that it does not stay longer than 4 seconds, so 4 or less. From the
definition:
P(R3 | Rc
4) =
P(R3 ∩ Rc
4)
P(Rc
4)
.
The event R3 ∩ Rc
4 describes: longer than 3 but not longer than 4 seconds.
Furthermore, R3 is the disjoint union of the events R3 ∩Rc
4 and R3 ∩R4 = R4,
so P(R3 ∩ Rc
4) = P(R3) − P(R4) = e−3
− e−4
. Using the complement rule:
P(Rc
4) = 1 − P(R4) = 1 − e−4
. Together:
P(R3 | Rc
4) =
e−3
− e−4
1 − e−4
=
0.0315
0.9817
= 0.0321.
3.5 Instead of a calendar of 365 days, we have one with just 12 months. Let
Cn be the event n arbitrary persons have different months of birth. Then
P(C3) =

1 −
2
12

·

1 −
1
12

=
55
72
= 0.7639
and it is no surprise that this is much smaller than P(B3). The general formula
is
P(Cn) =

1 −
n − 1
12

· · ·

1 −
2
12

·

1 −
1
12

.
Note that it is correct even if n is 13 or more, in which case P(Cn) = 0.
3.6 Repeating the calculation we find:
P(T ∩ B) = 0.99 · 0.02 = 0.0198
P(T ∩ Bc
) = 0.05 · 0.98 = 0.0490
so P(T ) = P(T ∩ B) + P(T ∩ Bc
) = 0.0198 + 0.0490 = 0.0688.
3.7 In the solution to Quick exercise 3.5 we already found P(T ∩ B) = 0.0198
and P(T ) = 0.0688, so
P(B | T ) =
P(T ∩ B)
P(T )
=
0.0198
0.0688
= 0.2878.
Further, P(T c
) = 1 − 0.0688 = 0.9312 and P(T c
| B) = 1 − P(T | B) = 0.01.
So, P(B ∩ T c
) = 0.01 · 0.02 = 0.0002 and
P(B | T c
) =
0.0002
0.9312
= 0.00021.
3.8 It takes three steps of applying (3.4) and (3.5):
H independent of R ⇔ Hc
independent of R by (3.4)
Hc
independent of R ⇔ R independent of Hc
by (3.5)
R independent of Hc
⇔ Rc
independent of Hc
by (3.4).
3.6 Exercises 37
3.6 Exercises
3.1  Your lecturer wants to walk from A to B (see the map). To do so, he
first randomly selects one of the paths to C, D, or E. Next he selects randomly
one of the possible paths at that moment (so if he first selected the path to
E, he can either select the path to A or the path to F), etc. What is the
probability that he will reach B after two selections?
A
B
C D E
F
•
•
• •
•
•
•
•
•
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.2  A fair die is thrown twice. A is the event “sum of the throws equals 4,”
B is “at least one of the throws is a 3.”
a. Calculate P(A | B).
b. Are A and B independent events?
3.3  We draw two cards from a regular deck of 52. Let S1 be the event “the
first one is a spade,” and S2 “the second one is a spade.”
a. Compute P(S1), P(S2 | S1), and P(S2 | Sc
1).
b. Compute P(S2) by conditioning on whether the first card is a spade.
3.4  A Dutch cow is tested for BSE, using Test A as described in Section 3.3,
with P(T | B) = 0.70 and P(T | Bc
) = 0.10. Assume that the BSE risk for the
Netherlands is the same as in 2003, when it was estimated to be P(B) =
1.3 · 10−5
. Compute P(B | T ) and P(B | T c
).
3.5 A ball is drawn at random from an urn containing one red and one white
ball. If the white ball is drawn, it is put back into the urn. If the red ball
is drawn, it is returned to the urn together with two more red balls. Then a
second draw is made. What is the probability a red ball was drawn on both
the first and the second draws?
3.6 We choose a month of the year, in such a manner that each month has
the same probability. Find out whether the following events are independent:
a. the events “outcome is an even numbered month” (i.e., February, April,
June, etc.) and “outcome is in the first half of the year.”
b. the events “outcome is an even numbered month” (i.e., February, April,
June, etc.) and “outcome is a summer month” (i.e., June, July, August).
38 3 Conditional probability and independence
3.7  Calculate
a. P(A ∪ B) if it is given that P(A) = 1/3 and P(B | Ac
) = 1/4.
b. P(B) if it is given that P(A ∪ B) = 2/3 and P(Ac
| Bc
) = 1/2.
3.8  Spaceman Spiff’s spacecraft has a warning light that is supposed to
switch on when the freem blasters are overheated. Let W be the event “the
warning light is switched on” and F “the freem blasters are overheated.”
Suppose the probability of freem blaster overheating P(F) is 0.1, that the
light is switched on when they actually are overheated is 0.99, and that there
is a 2% chance that it comes on when nothing is wrong: P(W | Fc
) = 0.02.
a. Determine the probability that the warning light is switched on.
b. Determine the conditional probability that the freem blasters are over-
heated, given that the warning light is on.
3.9  A certain grapefruit variety is grown in two regions in southern Spain.
Both areas get infested from time to time with parasites that damage the
crop. Let A be the event that region R1 is infested with parasites and B that
region R2 is infested. Suppose P(A) = 3/4, P(B) = 2/5 and P(A ∪ B) = 4/5.
If the food inspection detects the parasite in a ship carrying grapefruits from
R1, what is the probability region R2 is infested as well?
3.10 A student takes a multiple-choice exam. Suppose for each question he
either knows the answer or gambles and chooses an option at random. Further
suppose that if he knows the answer, the probability of a correct answer is 1,
and if he gambles this probability is 1/4. To pass, students need to answer at
least 60% of the questions correctly. The student has “studied for a minimal
pass,” i.e., with probability 0.6 he knows the answer to a question. Given that
he answers a question correctly, what is the probability that he actually knows
the answer?
3.11 A breath analyzer, used by the police to test whether drivers exceed
the legal limit set for the blood alcohol percentage while driving, is known to
satisfy
P(A | B) = P(Ac
| Bc
) = p,
where A is the event “breath analyzer indicates that legal limit is exceeded”
and B “driver’s blood alcohol percentage exceeds legal limit.” On Saturday
night about 5% of the drivers are known to exceed the limit.
a. Describe in words the meaning of P(Bc
| A).
b. Determine P(Bc
| A) if p = 0.95.
c. How big should p be so that P(B | A) = 0.9?
3.12 The events A, B, and C satisfy: P(A | B ∩ C) = 1/4, P(B | C) = 1/3,
and P(C) = 1/2. Calculate P(Ac
∩ B ∩ C).
3.6 Exercises 39
3.13 In Exercise 2.12 we computed the probability of a “dream draw” in the
UEFA playoffs lottery by counting outcomes. Recall that there were ten teams
in the lottery, five considered “strong” and five considered “weak.” Introduce
events Di, “the ith pair drawn is a dream combination,” where a “dream
combination” is a pair of a strong team with a weak team, and i = 1, . . . , 5.
a. Compute P(D1).
b. Compute P(D2 | D1) and P(D1 ∩ D2).
c. Compute P(D3 | D1 ∩ D2) and P(D1 ∩ D2 ∩ D3).
d. Continue the procedure to obtain the probability of a “dream draw”:
P(D1 ∩ · · · ∩ D5).
3.14 Recall the Monty Hall problem from Section 1.3. Let R be the event
“the prize is behind the door you chose initially,” and W the event “you win
the prize by switching doors.”
a. Compute P(W | R) and P(W | Rc
).
b. Compute P(W) using the law of total probability.
3.15 Two independent events A and B are given, and P(B | A ∪ B) = 2/3,
P(A | B) = 1/2. What is P(B)?
3.16 You are diagnosed with an uncommon disease. You know that there
only is a 1% chance of getting it. Use the letter D for the event “you have the
disease” and T for “the test says so.” It is known that the test is imperfect:
P(T | D) = 0.98 and P(T c
| Dc
) = 0.95.
a. Given that you test positive, what is the probability that you really have
the disease?
b. You obtain a second opinion: an independent repetition of the test. You
test positive again. Given this, what is the probability that you really have
the disease?
3.17 You and I play a tennis match. It is deuce, which means if you win the
next two rallies, you win the game; if I win both rallies, I win the game; if
we each win one rally, it is deuce again. Suppose the outcome of a rally is
independent of other rallies, and you win a rally with probability p. Let W be
the event “you win the game,” G “the game ends after the next two rallies,”
and D “it becomes deuce again.”
a. Determine P(W | G).
b. Show that P(W) = p2
+ 2p(1 − p)P(W | D) and use P(W) = P(W | D)
(why is this so?) to determine P(W).
c. Explain why the answers are the same.
40 3 Conditional probability and independence
3.18 Suppose A and B are events with 0  P(A)  1 and 0  P(B)  1.
a. If A and B are disjoint, can they be independent?
b. If A and B are independent, can they be disjoint?
c. If A ⊂ B, can A and B be independent?
d. If A and B are independent, can A and A ∪ B be independent?
4
Discrete random variables
The sample space associated with an experiment, together with a probability
function defined on all its events, is a complete probabilistic description of
that experiment. Often we are interested only in certain features of this de-
scription. We focus on these features using random variables. In this chapter
we discuss discrete random variables, and in the next we will consider contin-
uous random variables. We introduce the Bernoulli, binomial, and geometric
random variables.
4.1 Random variables
Suppose we are playing the board game “Snakes and Ladders,” where the
moves are determined by the sum of two independent throws with a die. An
obvious choice of the sample space is
Ω = {(ω1, ω2) : ω1, ω2 ∈ {1, 2, . . ., 6} }
= {(1, 1), (1, 2), . . ., (1, 6), (2, 1), . . ., (6, 5), (6, 6)}.
However, as players of the game, we are only interested in the sum of the
outcomes of the two throws, i.e., in the value of the function S : Ω → R, given
by
S( ω1, ω2 ) = ω1 + ω2 for (ω1, ω2) ∈ Ω.
In Table 4.1 the possible results of the first throw (top margin), those of the
second throw (left margin), and the corresponding values of S (body) are
given. Note that the values of S are constant on lines perpendicular to the
diagonal. We denote the event that the function S attains the value k by
{S = k}, which is an abbreviation of “the subset of those ω = (ω1, ω2) ∈ Ω
for which S( ω1, ω2 ) = ω1 + ω2 = k,” i.e.,
{S = k} = {(ω1, ω2) ∈ Ω : S( ω1, ω2) = k }.
42 4 Discrete random variables
Table 4.1. Two throws with a die and the corresponding sum.
ω1
ω2 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
Quick exercise 4.1 List the outcomes in the event {S = 8}.
We denote the probability of the event {S = k} by
P(S = k) ,
although formally we should write P({S = k}) instead of P(S = k). In our
example, S attains only the values k = 2, 3, . . . , 12 with positive probability.
For example,
P(S = 2) = P( (1, 1) ) =
1
36
,
P(S = 3) = P( {(1, 2), (2, 1)} ) =
2
36
,
while
P(S = 13) = P( ∅ ) = 0,
because 13 is an “impossible outcome.”
Quick exercise 4.2 Use Table 4.1 to determine P(S = k) for k = 4, 5, . . . , 12.
Now suppose that for some other game the moves are given by the maximum
of two independent throws. In this case we are interested in the value of the
function M : Ω → R, given by
M( ω1, ω2 ) = max{ω1, ω2} for (ω1, ω2) ∈ Ω.
In Table 4.2 the possible results of the first throw (top margin), those of the
second throw (left margin), and the corresponding values of M (body) are
given. The functions S and M are examples of what we call discrete random
variables.
Definition. Let Ω be a sample space. A discrete random variable
is a function X : Ω → R that takes on a finite number of values
a1, a2, . . . , an or an infinite number of values a1, a2, . . . .
4.2 The probability distribution of a discrete random variable 43
Table 4.2. Two throws with a die and the corresponding maximum.
ω1
ω2 1 2 3 4 5 6
1 1 2 3 4 5 6
2 2 2 3 4 5 6
3 3 3 3 4 5 6
4 4 4 4 4 5 6
5 5 5 5 5 5 6
6 6 6 6 6 6 6
In a way, a discrete random variable X “transforms” a sample space Ω to a
more “tangible” sample space Ω̃, whose events are more directly related to
what you are interested in. For instance, S transforms Ω = {(1, 1), (1, 2), . . .,
(1, 6), (2, 1), . . . , (6, 5), (6, 6)} to Ω̃ = {2, . . ., 12}, and M transforms Ω to
Ω̃ = {1, . . . , 6}. Of course, there is a price to pay: one has to calculate the
probabilities of X. Or, to say things more formally, one has to determine
the probability distribution of X, i.e., to describe how the probability mass is
distributed over possible values of X.
4.2 The probability distribution of a discrete random
variable
Once a discrete random variable X is introduced, the sample space Ω is no
longer important. It suffices to list the possible values of X and their corre-
sponding probabilities. This information is contained in the probability mass
function of X.
Definition. The probability mass function p of a discrete random
variable X is the function p : R → [0, 1], defined by
p(a) = P(X = a) for − ∞  a  ∞.
If X is a discrete random variable that takes on the values a1, a2, . . ., then
p(ai)  0, p(a1) + p(a2) + · · · = 1, and p(a) = 0 for all other a.
As an example we give the probability mass function p of M.
a 1 2 3 4 5 6
p(a) 1/36 3/36 5/36 7/36 9/36 11/36
Of course, p(a) = 0 for all other a.
44 4 Discrete random variables
The distribution function of a random variable
As we will see, so-called continuous random variables cannot be specified
by giving a probability mass function. However, the distribution function of
a random variable X (also known as the cumulative distribution function)
allows us to treat discrete and continuous random variables in the same way.
Definition. The distribution function F of a random variable X
is the function F : R → [0, 1], defined by
F(a) = P(X ≤ a) for −∞  a  ∞.
Both the probability mass function and the distribution function of a discrete
random variable X contain all the probabilistic information of X; the probabil-
ity distribution of X is determined by either of them. In fact, the distribution
function F of a discrete random variable X can be expressed in terms of the
probability mass function p of X and vice versa. If X attains values a1, a2, . . .,
such that
p(ai)  0, p(a1) + p(a2) + · · · = 1,
then
F(a) =

ai≤a
p(ai).
We see that, for a discrete random variable X, the distribution function F
jumps in each of the ai, and is constant between successive ai. The height of
the jump at ai is p(ai); in this way p can be retrieved from F. For example,
see Figure 4.1, where p and F are displayed for the random variable M.
1 2 3 4 5 6
a
1/36
3/36
5/36
7/36
9/36
11/36
1
· · · · · ·
p(a)
.
.
.
.
.
.
.
.
.
.
.
.
.
1 2 3 4 5 6
a
1/36
4/36
9/36
16/36
25/36
1
F(a)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
· ·
·
·
·
·
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 4.1. Probability mass function and distribution function of M.
4.3 The Bernoulli and binomial distributions 45
We end this section with three properties of the distribution function F of a
random variable X:
1. For a ≤ b one has that F(a) ≤ F(b). This property is an immediate
consequence of the fact that a ≤ b implies that the event {X ≤ a} is
contained in the event {X ≤ b}.
2. Since F(a) is a probability, the value of the distribution function is always
between 0 and 1. Moreover,
lim
a→+∞
F(a) = lim
a→+∞
P(X ≤ a) = 1
lim
a→−∞
F(a) = lim
a→−∞
P(X ≤ a) = 0.
3. F is right-continuous, i.e., one has
lim
ε↓0
F(a + ε) = F(a).
This is indicated in Figure 4.1 by bullets. Henceforth we will omit these
bullets.
Conversely, any function F satisfying 1, 2, and 3 is the distribution function
of some random variable (see Remarks 6.1 and 6.2).
Quick exercise 4.3 Let X be a discrete random variable, and let a be such
that p(a)  0. Show that F(a) = P(X  a) + p(a).
There are many discrete random variables that arise in a natural way. We
introduce three of them in the next two sections.
4.3 The Bernoulli and binomial distributions
The Bernoulli distribution is used to model an experiment with only two pos-
sible outcomes, often referred to as “success” and “failure”, usually encoded
as 1 and 0.
Definition. A discrete random variable X has a Bernoulli distri-
bution with parameter p, where 0 ≤ p ≤ 1, if its probability mass
function is given by
pX(1) = P(X = 1) = p and pX(0) = P(X = 0) = 1 − p.
We denote this distribution by Ber(p).
Note that we wrote pX instead of p for the probability mass function of X. This
was done to emphasize its dependence on X and to avoid possible confusion
with the parameter p of the Bernoulli distribution.
46 4 Discrete random variables
Consider the (fictitious) situation that you attend, completely unprepared,
a multiple-choice exam. It consists of 10 questions, and each question has
four alternatives (of which only one is correct). You will pass the exam if
you answer six or more questions correctly. You decide to answer each of the
questions in a random way, in such a way that the answer of one question is
not affected by the answers of the others. What is the probability that you
will pass?
Setting for i = 1, 2, . . . , 10
Ri =

1 if the ith answer is correct
0 if the ith answer is incorrect,
the number of correct answers X is given by
X = R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10.
Quick exercise 4.4 Calculate the probability that you answered the first
question correctly and the second one incorrectly.
Clearly, X attains only the values 0, 1, . . ., 10. Let us first consider the case
X = 0. Since the answers to the different questions do not influence each other,
we conclude that the events {R1 = a1}, . . . , {R10 = a10} are independent for
every choice of the ai, where each ai is 0 or 1. We find
P(X = 0) = P(not a single Ri equals 1)
= P(R1 = 0, R2 = 0, . . . , R10 = 0)
= P(R1 = 0) P(R2 = 0) · · · P(R10 = 0)
=

3
4
10
.
The probability that we have answered exactly one question correctly equals
P(X = 1) =
1
4
·

3
4
9
· 10,
which is the probability that the answer is correct times the probability that
the other nine answers are wrong, times the number of ways in which this can
occur:
P(X = 1) = P(R1 = 1) P(R2 = 0) P(R3 = 0) · · · P(R10 = 0)
+ P(R1 = 0) P(R2 = 1) P(R3 = 0) · · · P(R10 = 0)
.
.
.
+ P(R1 = 0) P(R2 = 0) P(R3 = 0) · · · P(R10 = 1) .
In general we find for k = 0, 1, . . ., 10, again using independence, that
4.3 The Bernoulli and binomial distributions 47
P(X = k) =

1
4
k
·

3
4
10−k
· C10,k,
which is the probability that k questions were answered correctly times the
probability that the other 10−k answers are wrong, times the number of ways
C10,k this can occur.
So C10,k is the number of different ways in which one can choose k correct
answers from the list of 10. We already have seen that C10,0 = 1, because
there is only one way to do everything wrong; and that C10,1 = 10, because
each of the 10 questions may have been answered correctly.
More generally, if we have to choose k different objects out of an ordered list
of n objects, and the order in which we pick the objects matters, then for
the first object you have n possibilities, and no matter which object you pick,
for the second one there are n − 1 possibilities. For the third there are n − 2
possibilities, and so on, with n − (k − 1) possibilities for the kth. So there are
n(n − 1) · · · (n − (k − 1))
ways to choose the k objects.
In how many ways can we choose three questions? When the order matters,
there are 10 · 9 · 8 ways. However, the order in which these three questions are
selected does not matter: to answer questions 2, 5, and 8 correctly is the same
as answering questions 8, 2, and 5 correctly, and so on. The triplet {2, 5, 8}
can be chosen in 3 · 2 · 1 different orders, all with the same result. There are
six permutations of the numbers 2, 5, and 8 (see page 14).
Thus, compensating for this six-fold overcount, the number C10,3 of ways to
correctly answer 3 questions out of 10 becomes
C10,3 =
10 · 9 · 8
3 · 2 · 1
.
More generally, for n ≥ 1 and 1 ≤ k ≤ n,
Cn,k =
n(n − 1) · · · (n − (k − 1))
k(k − 1) · · · 2 · 1
.
Note that this is equal to
n!
k! (n − k)!
,
which is usually denoted by
n
k

, so Cn,k =
n
k

. Moreover, in accordance with
0! = 1 (as defined in Chapter 2), we put Cn,0 =
n
0

= 1.
Quick exercise 4.5 Show that
 n
n−k

=
n
k

.
Substituting
10
k

for C10,k we obtain
48 4 Discrete random variables
P(X = k) =

10
k
 
1
4
k 
3
4
10−k
.
Since P(X ≥ 6) = P(X = 6) + · · · + P(X = 10), it is now an easy (but te-
dious) exercise to determine the probability that you will pass. One finds that
P(X ≥ 6) = 0.0197. It pays to study, doesn’t it?!
The preceding random variable X is an example of a random variable with a
binomial distribution with parameters n = 10 and p = 1/4.
Definition. A discrete random variable X has a binomial distri-
bution with parameters n and p, where n = 1, 2, . . . and 0 ≤ p ≤ 1,
if its probability mass function is given by
pX(k) = P(X = k) =

n
k

pk
(1 − p)
n−k
for k = 0, 1, . . ., n.
We denote this distribution by Bin(n, p).
Figure 4.2 shows the probability mass function pX and distribution function
FX of a Bin(10, 1
4 ) distributed random variable.
0 1 2 3 4 5 6 7 8 9 10
k
0.0
0.1
0.2
0.3
pX(k)
·
·
·
·
·
·
· · · · ·
0 1 2 3 4 5 6 7 8 9 10
a
0.0
0.2
0.4
0.6
0.8
1.0
FX (a)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......
.
.
.
......
.
.
.
.
.
.
.
.
.
.
.
......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......
.
.
.
.
.
.
.
.
......
.
.
.
......
...........................
Fig. 4.2. Probability mass function and distribution function of the Bin(10, 1
4
)
distribution.
4.4 The geometric distribution
In 1986, Weinberg and Gladen [38] investigated the number of menstrual cy-
cles it took women to become pregnant, measured from the moment they had
4.4 The geometric distribution 49
decided to become pregnant. We model the number of cycles up to pregnancy
by a random variable X.
Assume that the probability that a woman becomes pregnant during a partic-
ular cycle is equal to p, for some p with 0  p ≤ 1, independent of the previous
cycles. Then clearly P(X = 1) = p. Due to the independence of consecutive
cycles, one finds for k = 1, 2, . . . that
P(X = k) = P(no pregnancy in the first k − 1 cycles, pregnancy in the kth)
= (1 − p)k−1
p.
This random variable X is an example of a random variable with a geometric
distribution with parameter p.
Definition. A discrete random variable X has a geometric distri-
bution with parameter p, where 0  p ≤ 1, if its probability mass
function is given by
pX(k) = P(X = k) = (1 − p)k−1
p for k = 1, 2, . . . .
We denote this distribution by Geo(p).
Figure 4.3 shows the probability mass function pX and distribution function
FX of a Geo(1
4 ) distributed random variable.
1 5 10 15 20
k
0.0
0.1
0.2
0.3
pX (k)
·
·
·
·
·················
1 5 10 15 20
a
0.0
0.2
0.4
0.6
0.8
1.0
FX (a)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....
.
.
.
.
.
.
.
.
.
.
.
....
.
.
.
.
.
.
.
....
.
.
.
.
.
.
...
.
.
.
.
....
.
.
.
....
.
....
.
....
.
....
.......
............................
Fig. 4.3. Probability mass function and distribution function of the Geo(1
4
) distri-
bution.
Quick exercise 4.6 Let X have a Geo(p) distribution. For n ≥ 0, show that
P(X  n) = (1 − p)
n
.
50 4 Discrete random variables
The geometric distribution has a remarkable property, which is known as the
memoryless property.1
For n, k = 0, 1, 2, . . . one has
P(X  n + k | X  k) = P(X  n) .
We can derive this equality using the result from Quick exercise 4.6:
P(X  n + k | X  k) =
P({X  k + n} ∩ {X  k})
P(X  k)
=
P(X  k + n)
P(X  k)
=
(1 − p)
n+k
(1 − p)
k
= (1 − p)
n
= P(X  n) .
4.5 Solutions to the quick exercises
4.1 From Table 4.1, one finds that
{S = 8} = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)}.
4.2 From Table 4.1, one determines the following table.
k 4 5 6 7 8 9 10 11 12
P(S = k) 3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
4.3 Since {X ≤ a} = {X  a} ∪ {X = a}, it follows that
F(a) = P(X ≤ a) = P(X  a) + P(X = a) = P(X  a) + p(a).
Not very interestingly: this also holds if p(a) = 0.
4.4 The probability that you answered the first question correctly and the
second one incorrectly is given by P(R1 = 1, R2 = 0). Due to independence,
this is equal to P(R1 = 1) P(R2 = 0) = 1
4 · 3
4 = 3
16 .
4.5 Rewriting yields

n
n − k

=
n!
(n − k)! (n − (n − k))!
=
n!
k!(n − k)!
=

n
k

.
1
In fact, the geometric distribution is the only discrete random variable with this
property.
4.6 Exercises 51
4.6 There are two ways to show that P(X  n) = (1 − p)n
. The easiest way is
to realize that P(X  n) is the probability that we had “no success in the first
n trials,” which clearly equals (1 − p)
n
. A more involved way is by calculation:
P(X  n) = P(X = n + 1) + P(X = n + 2) + · · ·
= (1 − p)n
p + (1 − p)n+1
p + · · ·
= (1 − p)n
p

1 + (1 − p) + (1 − p)2
+ · · ·

.
If we recall from calculus that
∞

k=0
(1 − p)k
=
1
1 − (1 − p)
=
1
p
,
the answer follows immediately.
4.6 Exercises
4.1  Let Z represent the number of times a 6 appeared in two independent
throws of a die, and let S and M be as in Section 4.1.
a. Describe the probability distribution of Z, by giving either the probability
mass function pZ of Z or the distribution function FZ of Z. What type of
distribution does Z have, and what are the values of its parameters?
b. List the outcomes in the events {M = 2, Z = 0}, {S = 5, Z = 1}, and
{S = 8, Z = 1}. What are their probabilities?
c. Determine whether the events {M = 2} and {Z = 0} are independent.
4.2 Let X be a discrete random variable with probability mass function p
given by:
a −1 0 1 2
p(a) 1
4
1
8
1
8
1
2
and p(a) = 0 for all other a.
a. Let the random variable Y be defined by Y = X2
, i.e., if X = 2, then
Y = 4. Calculate the probability mass function of Y .
b. Calculate the value of the distribution functions of X and Y in a = 1,
a = 3/4, and a = π − 3.
4.3  Suppose that the distribution function of a discrete random variable X
is given by
52 4 Discrete random variables
F(a) =
⎧
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎩
0 for a  0
1
3 for 0 ≤ a  1
2
1
2 for 1
2 ≤ a  3
4
1 for a ≥ 3
4 .
Determine the probability mass function of X.
4.4 You toss n coins, each showing heads with probability p, independently
of the other tosses. Each coin that shows tails is tossed again. Let X be the
total number of heads.
a. What type of distribution does X have? Specify its parameter(s).
b. What is the probability mass function of the total number of heads X?
4.5 A fair die is thrown until the sum of the results of the throws exceeds 6.
The random variable X is the number of throws needed for this. Let F be the
distribution function of X. Determine F(1), F(2), and F(7).
4.6  Three times we randomly draw a number from the following numbers:
1 2 3.
If Xi represents the ith draw, i = 1, 2, 3, then the probability mass function
of Xi is given by
a 1 2 3
P(Xi = a) 1
3
1
3
1
3
and P(Xi = a) = 0 for all other a. We assume that each draw is independent
of the previous draws. Let X̄ be the average of X1, X2, and X3, i.e.,
X̄ =
X1 + X2 + X3
3
.
a. Determine the probability mass function pX̄ of X̄.
b. Compute the probability that exactly two draws are equal to 1.
4.7  A shop receives a batch of 1000 cheap lamps. The odds that a lamp is
defective are 0.1%. Let X be the number of defective lamps in the batch.
a. What kind of distribution does X have? What is/are the value(s) of pa-
rameter(s) of this distribution?
b. What is the probability that the batch contains no defective lamps? One
defective lamp? More than two defective ones?
4.8  In Section 1.4 we saw that each space shuttle has six O-rings and that
each O-ring fails with probability
4.6 Exercises 53
p(t) =
ea+b·t
1 + ea+b·t
,
where a = 5.085, b = −0.1156, and t is the temperature (in degrees Fahren-
heit) at the time of the launch of the space shuttle. At the time of the fatal
launch of the Challenger, t = 31, yielding p(31) = 0.8178.
a. Let X be the number of failing O-rings at launch temperature 31◦
F. What
type of probability distribution does X have, and what are the values of
its parameters?
b. What is the probability P(X ≥ 1) that at least one O-ring fails?
4.9 For simplicity’s sake, let us assume that all space shuttles will be launched
at 81◦
F (which is the highest recorded launch temperature in Figure 1.3). With
this temperature, the probability of an O-ring failure is equal to p(81) = 0.0137
(see Section 1.4 or Exercise 4.8).
a. What is the probability that during 23 launches no O-ring will fail, but
that at least one O-ring will fail during the 24th launch of a space shuttle?
b. What is the probability that no O-ring fails during 24 launches?
4.10  Early in the morning, a group of m people decides to use the elevator
in an otherwise deserted building of 21 floors. Each of these persons chooses
his or her floor independently of the others, and—from our point of view—
completely at random, so that each person selects a floor with probability
1/21. Let Sm be the number of times the elevator stops. In order to study
Sm, we introduce for i = 1, 2, . . ., 21 random variables Ri, given by
Ri =

1 if the elevator stops at the ith floor
0 if the elevator does not stop at the ith floor.
a. Each Ri has a Ber(p) distribution. Show that p = 1 −
20
21
m
.
b. From the way we defined Sm, it follows that
Sm = R1 + R2 + · · · + R21.
Can we conclude that Sm has a Bin(21, p) distribution, with p as in part a?
Why or why not?
c. Clearly, if m = 1, one has that P(S1 = 1) = 1. Show that for m = 2
P(S2 = 1) =
1
21
= 1 − P(S2 = 2) ,
and that S3 has the following distribution.
a 1 2 3
P(S3 = a) 1/441 60/441 380/441
54 4 Discrete random variables
4.11 You decide to play monthly in two different lotteries, and you stop play-
ing as soon as you win a prize in one (or both) lotteries of at least one million
euros. Suppose that every time you participate in these lotteries, the proba-
bility to win one million (or more) euros is p1 for one of the lotteries and p2
for the other. Let M be the number of times you participate in these lotteries
until winning at least one prize. What kind of distribution does M have, and
what is its parameter?
4.12  You and a friend want to go to a concert, but unfortunately only one
ticket is still available. The man who sells the tickets decides to toss a coin
until heads appears. In each toss heads appears with probability p, where
0  p  1, independent of each of the previous tosses. If the number of tosses
needed is odd, your friend is allowed to buy the ticket; otherwise you can buy
it. Would you agree to this arrangement?
4.13  A box contains an unknown number N of identical bolts. In order
to get an idea of the size N, we randomly mark one of the bolts from the
box. Next we select at random a bolt from the box. If this is the marked bolt
we stop, otherwise we return the bolt to the box, and we randomly select a
second one, etc. We stop when the selected bolt is the marked one. Let X be
the number of times a bolt was selected. Later (in Exercise 21.11) we will try
to find an estimate of N. Here we look at the probability distribution of X.
a. What is the probability distribution of X? Specify its parameter(s)!
b. The drawback of this approach is that X can attain any of the values
1, 2, 3, . . ., so that if N is large we might be sampling from the box for
quite a long time. We decide to sample from the box in a slightly different
way: after we have randomly marked one of the bolts in the box, we
select at random a bolt from the box. If this is the marked one, we stop,
otherwise we randomly select a second bolt (we do not return the selected
bolt). We stop when we select the marked bolt. Let Y be the number of
times a bolt was selected.
Show that P(Y = k) = 1/N for k = 1, 2, . . . , N (Y has a so-called discrete
uniform distribution).
c. Instead of randomly marking one bolt in the box, we mark m bolts, with
m smaller than N. Next, we randomly select r bolts; Z is the number of
marked bolts in the sample.
Show that
P(Z = k) =
m
k
N−m
r−k

N
r
 , for k = 0, 1, 2, . . ., r.
(Z has a so-called hypergeometric distribution, with parameters m, N,
and r.)
4.14 We throw a coin until a head turns up for the second time, where p is the
probability that a throw results in a head and we assume that the outcome
4.6 Exercises 55
of each throw is independent of the previous outcomes. Let X be the number
of times we have thrown the coin.
a. Determine P(X = 2), P(X = 3), and P(X = 4).
b. Show that P(X = n) = (n − 1)p2
(1 − p)n−2
for n ≥ 2.
5
Continuous random variables
Many experiments have outcomes that take values on a continuous scale. For
example, in Chapter 2 we encountered the load at which a model of a bridge
collapses. These experiments have continuous random variables naturally as-
sociated with them.
5.1 Probability density functions
One way to look at continuous random variables is that they arise by a (never-
ending) process of refinement from discrete random variables. Suppose, for
example, that a discrete random variable associated with some experiment
takes on the value 6.283 with probability p. If we refine, in the sense that we
also get to know the fourth decimal, then the probability p is spread over the
outcomes 6.2830, 6.2831, . . ., 6.2839. Usually this will mean that each of these
new values is taken on with a probability that is much smaller than p—the
sum of the ten probabilities is p. Continuing the refinement process to more
and more decimals, the probabilities of the possible values of the outcomes
become smaller and smaller, approaching zero. However, the probability that
the possible values lie in some fixed interval [a, b] will settle down. This is
closely related to the way sums converge to an integral in the definition of the
integral and motivates the following definition.
Definition. A random variable X is continuous if for some function
f : R → R and for any numbers a and b with a ≤ b,
P(a ≤ X ≤ b) =
b
a
f(x) dx.
The function f has to satisfy f(x) ≥ 0 for all x and
∞
−∞ f(x) dx = 1.
We call f the probability density function (or probability density)
of X.
58 5 Continuous random variables
a b
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
f →
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
P(a ≤ X ≤ b)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 5.1. Area under a probability density function f on the interval [a, b].
Note that the probability that X lies in an interval [a, b] is equal to the area
under the probability density function f of X over the interval [a, b]; this
is illustrated in Figure 5.1. So if the interval gets smaller and smaller, the
probability will go to zero: for any positive ε
P(a − ε ≤ X ≤ a + ε) =
a+ε
a−ε
f(x) dx,
and sending ε to 0, it follows that for any a
P(X = a) = 0.
This implies that for continuous random variables you may be careless about
the precise form of the intervals:
P(a ≤ X ≤ b) = P(a  X ≤ b) = P(a  X  b) = P(a ≤ X  b) .
What does f(a) represent? Note (see also Figure 5.2) that
P(a − ε ≤ X ≤ a + ε) =
a+ε
a−ε
f(x) dx ≈ 2εf(a) (5.1)
for small positive ε. Hence f(a) can be interpreted as a (relative) measure of
how likely it is that X will be near a. However, do not think of f(a) as a
probability: f(a) can be arbitrarily large. An example of such an f is given in
the following exercise.
Quick exercise 5.1 Let the function f be defined by f(x) = 0 if x ≤ 0
or x ≥ 1, and f(x) = 1/(2
√
x) for 0  x  1. You can check quickly that
f satisfies the two properties of a probability density function. Let X be
a random variable with f as its probability density function. Compute the
probability that X lies between 10−4
and 10−2
.
5.1 Probability density functions 59
a − ε a + ε
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
←
− 2ε −
→
↑
|
|
|
f(a)
|
|
|
↓
f
Fig. 5.2. Approximating the probability that X lies ε-close to a.
You should realize that discrete random variables do not have a probability
density function f and continuous random variables do not have a probability
mass function p, but that both have a distribution function F(a) = P(X ≤ a).
Using the fact that for a  b the event {X ≤ b} is a disjoint union of the
events {X ≤ a} and {a  X ≤ b}, we can express the probability that X lies
in an interval (a, b] directly in terms of F for both cases:
P(a  X ≤ b) = P(X ≤ b) − P(X ≤ a) = F(b) − F(a).
There is a simple relation between the distribution function F and the prob-
ability density function f of a continuous random variable. It follows from
integral calculus that
F(b) =
b
−∞
f(x) dx and1
f(x) =
d
dx
F(x).
Both the probability density function and the distribution function of a con-
tinuous random variable X contain all the probabilistic information about X;
the probability distribution of X is described by either of them.
We illustrate all this with an example. Suppose we want to make a probability
model for an experiment that can be described as “an object hits a disc of
radius r in a completely arbitrary way” (of course, this is not you playing
darts—nevertheless we will refer to this example as the darts example). We
are interested in the distance X between the hitting point and the center of
the disc. Since distances cannot be negative, we have F(b) = P(X ≤ b) = 0
when b  0. Since the object hits the disc, we have F(b) = 1 when b  r. That
the dart hits the disk in a completely arbitrary way we interpret as that the
probability of hitting any region is proportional to the area of that region. In
particular, because the disc has area πr2
and the disc with radius b has area
πb2
, we should put
1
This holds for all x where f is continuous.
60 5 Continuous random variables
F(b) = P(X ≤ b) =
πb2
πr2
=
b2
r2
for 0 ≤ b ≤ r.
Then the probability density function f of X is equal to 0 outside the interval
[0, r] and
f(x) =
d
dx
F(x) =
1
r2
d
dx
x2
=
2x
r2
for 0 ≤ x ≤ r.
Quick exercise 5.2 Compute for the darts example the probability that
0  X ≤ r/2, and the probability that r/2  X ≤ r.
5.2 The uniform distribution
In this section we encounter a continuous random variable that describes an
experiment where the outcome is completely arbitrary, except that we know
that it lies between certain bounds. Many experiments of physical origin have
this kind of behavior. For instance, suppose we measure for a long time the
emission of radioactive particles of some material. Suppose that the experi-
ment consists of recording in each hour at what times the particles are emitted.
Then the outcomes will lie in the interval [0,60] minutes. If the measurements
would concentrate in any way, there is either something wrong with your
Geiger counter or you are about to discover some new physical law. Not con-
centrating in any way means that subintervals of the same length should have
the same probability. It is then clear (cf. equation (5.1)) that the probability
density function associated with this experiment should be constant on [0, 60].
This motivates the following definition.
Definition. A continuous random variable has a uniform distribu-
tion on the interval [α, β] if its probability density function f is given
by f(x) = 0 if x is not in [α, β] and
f(x) =
1
β − α
for α ≤ x ≤ β.
We denote this distribution by U(α, β).
Quick exercise 5.3 Argue that the distribution function F of a random
variable that has a U(α, β) distribution is given by F(x) = 0 if x  α,
F(x) = 1 if x  β, and F(x) = (x − α)/(β − α) for α ≤ x ≤ β.
In Figure 5.3 the probability density function and the distribution function of
a U(0, 1
3 ) distribution are depicted.
F(x)=P(Ux)=x for U(0,1)
5.3 The exponential distribution 61
0 1/3
0
1
2
3
f
0 1/3
0
1
F
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 5.3. The probability density function and the distribution function of the
U(0, 1
3
) distribution.
5.3 The exponential distribution
We already encountered the exponential distribution in the chemical reactor
example of Chapter 3. We will give an argument why it appears in that ex-
ample. Let v be the effluent volumetric flow rate, i.e., the volume that leaves
the reactor over a time interval [0, t] is vt (and an equal volume enters the
vessel at the other end). Let V be the volume of the reactor vessel. Then in
total a fraction (v/V ) · t will have left the vessel during [0, t], when t is not
too large. Let the random variable T be the residence time of a particle in
the vessel. To compute the distribution of T , we divide the interval [0, t] in
n small intervals of equal length t/n. Assuming perfect mixing, so that the
particle’s position is uniformly distributed over the volume, the particle has
probability p = (v/V )·t/n to have left the vessel during any of the n intervals
of length t/n. If we assume that the behavior of the particle in different time
intervals of length t/n is independent, we have, if we call “leaving the vessel”
a success, that T has a geometric distribution with success probability p. It
follows (see also Quick exercise 4.6) that the probability P(T  t) that the
particle is still in the vessel at time t is, for large n, well approximated by
(1 − p)n
=

1 −
vt
V n
n
.
But then, letting n → ∞, we obtain (recall a well-known limit from your
calculus course)
P(T  t) = lim
n→∞

1 −
vt
V
·
1
n
n
= e− v
V t
.
It follows that the distribution function of T equals 1 − e− v
V t
, and differenti-
ating we obtain that the probability density function fT of T is equal to
62 5 Continuous random variables
fT (t) =
d
dt
(1 − e− v
V t
) =
v
V
e− v
V t
for t ≥ 0.
This is an example of an exponential distribution, with parameter v/V .
Definition. A continuous random variable has an exponential dis-
tribution with parameter λ if its probability density function f is
given by f(x) = 0 if x  0 and
f(x) = λe−λx
for x ≥ 0.
We denote this distribution by Exp(λ).
The distribution function F of an Exp(λ) distribution is given by
F(a) = 1 − e−λa
for a ≥ 0.
In Figure 5.4 we show the probability density function and the distribution
function of the Exp(0.25) distribution.
−5 0 5 10 15 20
0.0
0.1
0.2
f
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−5 0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
F
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 5.4. The probability density and the distribution function of the Exp(0.25)
distribution.
Since we obtained the exponential distribution directly from the geometric
distribution it should not come as a surprise that the exponential distribution
also satisfies the memoryless property, i.e., if X has an exponential distribu-
tion, then for all s, t  0,
P(X  s + t | X  s) = P(X  t) .
Actually, this follows directly from
P(X  s + t | X  s) =
P(X  s + t)
P(X  s)
=
e−λ(s+t)
e−λs
= e−λt
= P(X  t) .
5.4 The Pareto distribution 63
Quick exercise 5.4 A study of the response time of a certain computer sys-
tem yields that the response time in seconds has an exponentially distributed
time with parameter 0.25. What is the probability that the response time
exceeds 5 seconds?
5.4 The Pareto distribution
More than a century ago the economist Vilfredo Pareto ([20]) noticed that
the number of people whose income exceeded level x was well approximated
by C/xα
, for some constants C and α  0 (it appears that for all countries
α is around 1.5). A similar phenomenon occurs with city sizes, earthquake
rupture areas, insurance claims, and sizes of commercial companies. When
these quantities are modeled as realizations of random variables X, then their
distribution functions are of the type F(x) = 1 − 1/xα
for x ≥ 1. (Here
1 is a more or less arbitrarily chosen starting point—what matters is the
behavior for large x.) Differentiating, we obtain probability densities of the
form f(x) = α/xα+1
. This motivates the following definition.
Definition. A continuous random variable has a Pareto distribution
with parameter α  0 if its probability density function f is given
by f(x) = 0 if x  1 and
f(x) =
α
xα+1
for x ≥ 1.
We denote this distribution by Par(α).
0 2 4 6 8 10 12
0.0
0.1
0.2
0.3
0.4
0.5
f
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
F
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 5.5. The probability density and the distribution function of the Par(0.5)
distribution.
64 5 Continuous random variables
In Figure 5.5 we depicted the probability density f and the distribution func-
tion F of the Par(0.5) distribution.
5.5 The normal distribution
The normal distribution plays a central role in probability theory and statis-
tics. One of its first applications was due to C.F. Gauss, who used it in 1809
to model observational errors in astronomy; see [13]. We will see in Chap-
ter 14 that the normal distribution is an important tool to approximate the
probability distribution of the average of independent random variables.
Definition. A continuous random variable has a normal distribu-
tion with parameters µ and σ2
 0 if its probability density function
f is given by
f(x) =
1
σ
√
2π
e
− 1
2

x−µ
σ
2
for − ∞  x  ∞.
We denote this distribution by N(µ, σ2
).
In Figure 5.6 the graphs of the probability density function f and distribution
function F of the normal distribution with µ = 3 and σ2
= 6.25 are displayed.
−3 0 3 6 9
0.00
0.05
0.10
0.15
0.20
f
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−3 0 3 6 9
0.0
0.2
0.4
0.6
0.8
1.0
F
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 5.6. The probability density and the distribution function of the N(3, 6.25)
distribution.
If X has an N(µ, σ2
) distribution, then its distribution function is given by
F(a) =
a
−∞
1
σ
√
2π
e
− 1
2

x−µ
σ
2
dx for −∞  a  ∞.
5.6 Quantiles 65
Unfortunately there is no explicit expression for F; f has no antiderivative.
However, as we shall see in Chapter 8, any N(µ, σ2
) distributed random vari-
able can be turned into an N(0, 1) distributed random variable by a simple
transformation. As a consequence, a table of the N(0, 1) distribution suffices.
The latter is called the standard normal distribution, and because of its special
role the letter φ has been reserved for its probability density function:
φ(x) =
1
√
2π
e− 1
2 x2
for − ∞  x  ∞.
Note that φ is symmetric around zero: φ(−x) = φ(x) for each x. The corre-
sponding distribution function is denoted by Φ. The table for the standard nor-
mal distribution (see Table B.1) does not contain the values of Φ(a), but rather
the so-called right tail probabilities 1 −Φ(a). If, for instance, we want to know
the probability that a standard normal random variable Z is smaller than or
equal to 1, we use that P(Z ≤ 1) = 1 − P(Z ≥ 1). In the table we find that
P(Z ≥ 1) = 1−Φ(1) is equal to 0.1587. Hence P(Z ≤ 1) = 1−0.1587 = 0.8413.
With the table you can handle tail probabilities with numbers a given to two
decimals. To find, for instance, P(Z  1.07), we stay in the same row in the
table but move to the seventh column to find that P(Z  1.07) = 0.1423.
Quick exercise 5.5 Let the random variable Z have a standard normal
distribution. Use Table B.1 to find P(Z ≤ 0.75). How do you know—without
doing any calculations—that the answer should be larger than 0.5?
5.6 Quantiles
Recall the chemical reactor example, where the residence time T , measured
in minutes, has an exponential distribution with parameter λ = v/V = 0.25.
As we shall see in the next chapters, a consequence of this choice of λ is that
the mean time the particle stays in the vessel is 4 minutes. However, from the
viewpoint of process control this is not the quantity of interest. Often, there
will be some minimal amount of time the particle has to stay in the vessel to
participate in the chemical reaction, and we would want that at least 90% of
the particles stay in the vessel this minimal amount of time. In other words,
we are interested in the number q with the property that P(T  q) = 0.9, or
equivalently,
P(T ≤ q) = 0.1.
The number q is called the 0.1th quantile or 10th percentile of the distribution.
In the case at hand it is easy to determine. We should have
P(T ≤ q) = 1 − e−0.25q
= 0.1.
This holds exactly when e−0.25q
= 0.9 or when −0.25q = ln(0.9) = −0.105.
So q = 0.42. Hence, although the mean residence time is 4 minutes, 10% of
66 5 Continuous random variables
the particles stays less than 0.42 minute in the vessel, which is just slightly
more than 25 seconds! We use the following general definition.
Definition. Let X be a continuous random variable and let p be a
number between 0 and 1. The pth quantile or 100pth percentile of
the distribution of X is the smallest number qp such that
F(qp) = P(X ≤ qp) = p.
The median of a distribution is its 50th percentile.
Quick exercise 5.6 What is the median of the U(2, 7) distribution?
For continuous random variables qp is often easy to determine. Indeed, if F is
strictly increasing from 0 to 1 on some interval (which may be infinite to one
or both sides), then
qp = Finv
(p),
where Finv
is the inverse of F. This is illustrated in Figure 5.7 for the
Exp(0.25) distribution.
0 qp 20
0
p
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 5.7. The pth quantile qp of the Exp(0.25) distribution.
For an exponential distribution it is easy to compute quantiles. This is dif-
ferent for the standard normal distribution, where we have to use a table
(like Table B.1). For example, the 90th percentile of a standard normal is the
number q0.9 such that Φ(q0.9) = 0.9, which is the same as 1 − Φ(q0.9) = 0.1,
and the table gives us q0.9 = 1.28. This is illustrated in Figure 5.8, with both
the probability density function and the distribution function of the standard
normal distribution.
Quick exercise 5.7 Find the 0.95th quantile q0.95 of a standard normal
distribution, accurate to two decimals.
5.7 Solutions to the quick exercises 67
−3 0 3
q0.9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
..
..
area 0.1
φ
−3 0 3
q0.9
0
1
0.9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Φ
Fig. 5.8. The 90th percentile of the N(0, 1) distribution.
5.7 Solutions to the quick exercises
5.1 We know from integral calculus that for 0 ≤ a ≤ b ≤ 1
b
a
f(x) dx =
b
a
1
2
√
x
dx =
√
b −
√
a.
Hence
∞
−∞
f(x) dx =
1
0
1/(2
√
x) dx = 1 (so f is a probability density
function—nonnegativity being obvious), and
P

10−4
≤ X ≤ 10−2

=
10−2
10−4
1
2
√
x
dx
=
√
10−2 −
√
10−4 = 10−1
− 10−2
= 0.09.
Actually, the random variable X arises in a natural way; see equation (7.1).
5.2 We have P(0  X ≤ r/2) = F(r/2) − F(0) = (1/2)2
− 02
= 1/4, and
P(r/2  X ≤ r) = F(r)−F(r/2) = 1−1/4 = 3/4, no matter what the radius
of the disc is!
5.3 Since f(x) = 0 for x  α, we have F(x) = 0 if x  α. Also, since f(x) = 0
for all x  β, F(x) = 1 if x  β. In between
F(x) =
x
−∞
f(y) dy =
x
α
1
β − α
dy =
y
β − α
x
α
=
x − α
β − α
.
In other words; the distribution function increases linearly from the value 0
in α to the value 1 in β.
5.4 If X is the response time, we ask for P(X  5). This equals
P(X  5) = e−0.25·5
= e−1.25
= 0.2865 . . . .
68 5 Continuous random variables
5.5 In the eighth row and sixth column of the table, we find that 1−Φ(0.75) =
0.2266. Hence the answer is 1 − 0.2266 = 0.7734. Because of the symmetry of
the probability density φ, half of the mass of a standard normal distribution
lies on the negative axis. Hence for any number a  0, it should be true that
P(Z ≤ a)  P(Z ≤ 0) = 0.5.
5.6 The median is the number q0.5 = Finv
(0.5). You either see directly that
you have got half of the mass to both sides of the middle of the interval, hence
q0.5 = (2 + 7)/2 = 4.5, or you solve with the distribution function:
1
2
= F(q) =
q − 2
7 − 2
, and so q = 4.5.
5.7 Since Φ(q0.95) = 0.95 is the same as 1 − Φ(q0.95) = 0.05, the table gives
us q0.95 = 1.64, or more precisely, if we interpolate between the fourth and
the fifth column; 1.645.
5.8 Exercises
5.1 Let X be a continuous random variable with probability density function
f(x) =
⎧
⎪
⎨
⎪
⎩
3
4 for 0 ≤ x ≤ 1
1
4 for 2 ≤ x ≤ 3
0 elsewhere.
a. Draw the graph of f.
b. Determine the distribution function F of X, and draw its graph.
5.2  Let X be a random variable that takes values in [0, 1], and is further
given by
F(x) = x2
for 0 ≤ x ≤ 1.
Compute P
1
2  X ≤ 3
4

.
5.3 Let a continuous random variable X be given that takes values in [0, 1],
and whose distribution function F satisfies
F(x) = 2x2
− x4
for 0 ≤ x ≤ 1.
a. Compute P
1
4 ≤ X ≤ 3
4

.
b. What is the probability density function of X?
5.4  Jensen, arriving at a bus stop, just misses the bus. Suppose that he
decides to walk if the (next) bus takes longer than 5 minutes to arrive. Suppose
also that the time in minutes between the arrivals of buses at the bus stop is
a continuous random variable with a U(4, 6) distribution. Let X be the time
that Jensen will wait.
5.8 Exercises 69
a. What is the probability that X is less than 41
2 (minutes)?
b. What is the probability that X equals 5 (minutes)?
c. Is X a discrete random variable or a continuous random variable?
5.5  The probability density function f of a continuous random variable X
is given by:
f(x) =
⎧
⎪
⎨
⎪
⎩
cx + 3 for − 3 ≤ x ≤ −2
3 − cx for 2 ≤ x ≤ 3
0 elsewhere.
a. Compute c.
b. Compute the distribution function of X.
5.6 Let X have an Exp(0.2) distribution. Compute P(X  5).
5.7 The score of a student on a certain exam is represented by a number
between 0 and 1. Suppose that the student passes the exam if this number
is at least 0.55. Suppose we model this experiment by a continuous random
variable S, the score, whose probability density function is given by
f(x) =
⎧
⎪
⎨
⎪
⎩
4x for 0 ≤ x ≤ 1
2
4 − 4x for 1
2 ≤ x ≤ 1
0 elsewhere.
a. What is the probability that the student fails the exam?
b. What is the score that he will obtain with a 50% chance, in other words,
what is the 50th percentile of the score distribution?
5.8  Consider Quick exercise 5.2. For another dart thrower it is given that his
distance to the center of the disc Y is described by the following distribution
function:
G(b) =

b
r
for 0  b  r
and G(b) = 0 for b ≤ 0, G(b) = 1 for b ≥ r.
a. Sketch the probability density function g(y) = d
dy G(y).
b. Is this person “better” than the person in Quick exercise 5.2?
c. Sketch a distribution function associated to a person who in 90% of his
throws hits the disc no further than 0.1 · r of the center.
5.9  Suppose we choose arbitrarily a point from the square with corners at
(2,1), (3,1), (2,2), and (3,2). The random variable A is the area of the triangle
with its corners at (2,1), (3,1) and the chosen point (see Figure 5.9).
a. What is the largest area A that can occur, and what is the set of points
for which A ≤ 1/4?
70 5 Continuous random variables
A
(2, 1) (3, 1)
(2, 2) (3, 2)
•
randomly chosen
point
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
Fig. 5.9. A triangle in a square.
b. Determine the distribution function F of A.
c. Determine the probability density function f of A.
5.10 Consider again the chemical reactor example with parameter λ = 0.5.
We saw in Section 5.6 that 10% of the particles stay in the vessel no longer
than about 12 seconds—while the mean residence time is 2 minutes. Which
percentage of the particles stay no longer than 2 minutes in the vessel?
5.11 Compute the median of an Exp(λ) distribution.
5.12  Compute the median of a Par(1) distribution.
5.13  We consider a random variable Z with a standard normal distribution.
a. Show why the symmetry of the probability density function φ of Z implies
that for any a one has Φ(−a) = 1 − Φ(a).
b. Use this to compute P(Z ≤ −2).
5.14 Determine the 10th percentile of a standard normal distribution.
6
Simulation
Sometimes probabilistic models are so complex that the tools of mathemat-
ical analysis are not sufficient to answer all relevant questions about them.
Stochastic simulation is an alternative approach: values are generated for the
random variables and inserted into the model, thus mimicking outcomes for
the whole system. It is shown in this chapter how one can use uniform ran-
dom number generators to mimic random variables. Also two larger simulation
examples are presented.
6.1 What is simulation?
In many areas of science, technology, government, and business, models are
used to gain understanding of some part of reality (the portion of interest is
often referred to as “the system”). Sometimes these are physical models, such
as a scale model of an airplane in a wind tunnel or a scale model of a chemical
plant. Other models are abstract, such as macroeconomic models consisting
of equations relating things like interest rates, unemployment, and inflation
or partial differential equations describing global weather patterns.
In simulation, one uses a model to create specific situations in order to study
the response of the model to them and then interprets this in terms of what
would happen to the system “in the real world.” In this way, one can carry
out experiments that are impossible, too dangerous, or too expensive to do
in the real world—addressing questions like: What happens to the average
temperature if we reduce the greenhouse gas emissions globally by 50%? Can
the plane still fly if engines 3 and 4 stop in midair? What happens to the
distribution of wealth if we halve the tax rate?
More specifically, we focus on situations and problems where randomness or
uncertainty or both play a significant or dominant role and should be modeled
explicitly. Models for such systems involve random variables, and we speak of
probabilistic or stochastic models. Simulating them is stochastic simulation. In
72 6 Simulation
the preceding chapters we have encountered some of the tools of probability
theory, and we will encounter others in the chapters to come. With these tools
we can compute quantities of interest explicitly for many models. Stochastic
simulation of a system means generating values for all the random variables
in the model, according to their specified distributions, and recording and
analyzing what happens. We refer to the generated values as realizations of
the random variables.
For us, there are two reasons to learn about stochastic simulation. The first is
that for complex systems, simulation can be an alternative to mathematical
analysis, sometimes the only one. The second reason is that through simula-
tion, we can get more feeling for random variables, and this is why we study
stochastic simulation at this point in the book. We start by asking how we
can generate a realization of a random variable.
6.2 Generating realizations of random variables
Simulations are almost always done using computers, which usually have one
or more so-called (pseudo) random number generators. A call to the random
number generator returns a random number between 0 and 1, which mimics
a realization of a U(0, 1) variable. With this source of uniform (pseudo) ran-
domness we can construct any random variable we want by transforming the
outcome, as we shall see.
Quick exercise 6.1 Describe how you can simulate a coin toss when instead
of a coin you have a die. Any ideas on how to simulate a roll of a die if you
only have a coin?
Bernoulli random variables
Suppose U has a U(0, 1) distribution. To construct a Ber(p) random variable
for some 0  p  1, we define
X =

1 if U  p,
0 if U ≥ p
so that
P(X = 1) = P(U  p) = p,
P(X = 0) = P(U ≥ p) = 1 − p.
This random variable X has a Bernoulli distribution with parameter p.
Quick exercise 6.2 A random variable Y has outcomes 1, 3, and 4 with the
following probabilities: P(Y = 1) = 3/5, P(Y = 3) = 1/5, and P(Y = 4) =
1/5. Describe how to construct Y from a U(0, 1) random variable.
6.2 Generating realizations of random variables 73
Continuous random variables
Suppose we have the distribution function F of a continuous random variable
and we wish to construct a random variable with this distribution. We show
how to do this if F is strictly increasing from 0 to 1 on an interval. In that
case F has an inverse function Finv
. Figure 6.1 shows an example: F is strictly
increasing on the interval [2, 10]; the inverse Finv
is a function from the interval
[0, 1] to the interval [2, 10].
2 Finv(u) x 10
0
u
F(x)
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 6.1. Simulating a continuous random variable using the distribution function.
Note how u relates to Finv
(u) as F(x) relates to x. We see that u ≤ F(x)
is equivalent with Finv
(u) ≤ x. If instead of a real number u we consider a
U(0, 1) random variable U, we obtain that the corresponding events are the
same:
{U ≤ F(x)} = {Finv
(U) ≤ x}. (6.1)
We know about the U(0, 1) random variable U that P(U ≤ b) = b for any
number 0 ≤ b ≤ 1. Substituting b = F(x) we see
P(U ≤ F(x)) = F(x).
From equality (6.1), therefore,
P

Finv
(U) ≤ x

= F(x);
in other words, the random variable Finv
(U) has distribution function F.
What remains is to find the function Finv
. From Figure 6.1 we see
F(x) = u ⇔ x = Finv
(u),
so if we solve the equation F(x) = u for x, we obtain the expression for
Finv
(u).
74 6 Simulation
Exponential random variables
We apply this method to the exponential distribution. On the interval [0, ∞),
the Exp(λ) distribution function is strictly increasing and given by
F(x) = 1 − e−λx
.
To find Finv
, we solve the equation F(x) = u:
F(x) = u ⇔ 1 − e−λx
= u
⇔ e−λx
= 1 − u
⇔ −λx = ln(1 − u)
⇔ x = −
1
λ
ln(1 − u),
so Finv
(u) = − 1
λ ln(1−u) and if U has a U(0, 1) distribution, then the random
variable X defined by
X = Finv
(U) = −
1
λ
ln(1 − U)
has an Exp(λ) distribution.
In practice, one replaces 1−U with U, because both have a U(0, 1) distribution
(see Exercise 6.3). Leaving out the subtraction leads to more efficient computer
code. So instead of X we may use
Y = −
1
λ
ln(U),
which also has an Exp(λ) distribution.
Quick exercise 6.3 A distribution function F is 0 for x  1 and 1 for x  3,
and F(x) = 1
4 (x − 1)2
if 1 ≤ x ≤ 3. Let U be a U(0, 1) random variable.
Construct a random variable with distribution F from U.
Remark 6.1 (The general case). The restriction we imposed earlier,
that the distribution function should be strictly increasing, is not really
necessary. Furthermore, a distribution function with jumps or a flat section
somewhere in the middle is not a problem either. We illustrate this with an
example in Figure 6.2.
This F has a jump at 4 and so for a corresponding X we should have
P(X = 4) = 0.2, the size of the jump. We see that whenever U is in the
interval [0.3, 0.5], it is mapped to 4 by our method, and that this happens
with exactly the right probability!
The flat section of F between 7 and 8 seems to pose a problem: the equa-
tion F(a) = 0.85 has as its solution any a between 7 and 8, and we can-
not define a unique inverse. This, however, does not really matter, because
P(U = 0.85) = 0, and we can define the inverse Finv
(0.85) in any way we
want. Taking the left endpoint, here the number 7, agrees best with the
definition of quantiles (see page 66).
P(F^inv(U) ≤ x)= F(x)
6.3 Comparing two jury rules 75
2 4 7 8 10
0
0.3
0.5
0.85
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..................................................................................
.
.
.
.
. .
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.................................................
.
.
.
.
. .
.
.
.
.
.................................................
.
.
.
.
. .
.
.
.
.
.................................................
.
.
.
.
. .
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 6.2. A distribution function with a jump and a flat section.
Remark 6.2 (Existence of random variables). The previous remark
supplies a sketchy argument for the fact that any nondecreasing, rightcon-
tinuous function F, with limx→−∞ F(x) = 0 and limx→∞ F(x) = 1, is the
distribution of some random variable.
Generating sequences
For simulations we often want to generate realizations for a large number of
random variables. Random number generators have been designed with this
purpose in mind: each new call mimics a new U(0, 1) random variable. The
sequence of numbers thus generated is considered as a realization of a sequence
of U(0, 1) random variables U1, U2, U3,. . . with the special property that the
events {Ui ≤ ai} are independent1
for every choice of the ai.
6.3 Comparing two jury rules
At the Olympic Games there are several sports events that are judged by a
jury, including gymnastics, figure skating, and ice dancing. During the 2002
winter games a dispute arose concerning the gold medal in ice dancing: there
were allegations that the Russian team had bribed a French jury member,
thereby causing the Russian pair to win just ahead of the Canadians. We look
into operating rules for juries, although we leave the effects of bribery to the
exercises (Exercise 6.11).
Suppose we have a jury of seven members, and for each performance each
juror assigns a grade. The seven grades are to be transformed into a final
score. Two rules to do this are under consideration, and we want to choose
1
In Chapter 9 we return to the question of independence between random variables.
76 6 Simulation
the better one. For the first one, the highest and lowest scores are removed
and the final score is the average of the remaining five. For the second rule,
the scores are put in ascending order and the middle one is assigned as final
score. Before you continue reading, consider which rule is better and how you
can verify this.
A probabilistic model
For our investigation we assume that the scores the jurors assign deviate by
some random amount from the true or deserved score. We model the score
that juror i assigns when the performance deserves a score g by
Yi = g + Zi for i = 1, . . . , 7, (6.2)
where Z1, . . . , Z7 are random variables with values around zero. Let h1 and
h2 be functions implementing the two rules:
h1(y1, . . . , y7) = average of the middle five of y1, . . . , y7,
h2(y1, . . . , y7) = middle value of y1, . . . , y7.
We are interested in deviations from the deserved score g:
T = h1(Y1, . . . , Y7) − g,
M = h2(Y1, . . . , Y7) − g.
(6.3)
The distributions of T and M depend on the individual jury grades, and
through those, on the juror-deviations Z1, Z2, . . . , Z7, which we model as
U(−0.5, 0.5) variables. This more or less finishes the modeling phase: we have
given a stochastic model that mimics the workings of a jury and have defined,
in terms of the variables in the model, the random variables T and M that
represent the errors that result after application of the jury rules.
In any serious application, the model should be validated. This means that
one tries to gather evidence to convince oneself and others that the model
adequately reflects the workings of the real system. In this chapter we are
more interested in showing what you can do with simulation once you have a
model, so we skip the validation.
The next phase is analysis: which of the deviations is closer to zero? Because
T and M are random variables, we would have to clarify what we mean by
that, and answering the question certainly involves computing probabilities
about T and M. We cannot do this with what we have learned so far, but we
know how to simulate, so this is what we do.
Simulation
To generate a realization of a U(−0.5, 0.5) random variable, we only need to
subtract 0.5 from the result we obtain from a call to the random generator.
6.3 Comparing two jury rules 77
We do this 7 times and insert the resulting values in (6.2) as jury deviations
Z1, . . . , Z7, and substitute them in equations (6.3) to obtain T and M (the
value of g is irrelevant: it drops out of the calculation):
T = average of the middle five of Z1, . . . , Z7,
M = middle value of Z1, . . . , Z7.
(6.4)
In simulation terminology, this is called a run: we have gone through the whole
procedure once, inserting realizations for the random variables. If we repeat
the whole procedure, we have a second run; see Table 6.1 for the results of
five runs.
Table 6.1. Simulation results for the two jury rules.
Run Z1 Z2 Z3 Z4 Z5 Z6 Z7 T M
1 −0.45 −0.08 −0.38 0.11 −0.42 0.48 0.02 −0.15 −0.08
2 −0.37 −0.18 0.05 −0.10 0.01 0.28 0.31 0.01 0.01
3 0.08 0.07 0.47 −0.21 −0.33 −0.22 −0.48 −0.12 −0.21
4 0.24 0.08 −0.11 0.19 −0.03 0.02 0.44 0.10 0.08
5 0.10 0.18 −0.39 −0.24 −0.36 −0.25 0.20 −0.11 −0.24
Quick exercise 6.4 The next realizations for Z1,. . . , Z7 are: −0.05, 0.26,
0.25, 0.39, 0.22, 0.23, 0.13. Determine the corresponding realizations of T
and M.
Table 6.1 can be used to check some computations. We also see that the real-
ization of T was closest to zero in runs 3 and 5, the realization of M was closest
to zero in runs 1 and 4, and they were (about) the same in run 2. There is no
clear conclusion from this, and even if there was, one could wonder whether
the next five runs would yield the same picture. Because the whole process
mimics randomness, one has to expect some variation—or perhaps a lot. In
later chapters we will get a better understanding of this variation; for the
moment we just say that judgment based on a large number of runs is better.
We do one thousand runs and exchange the table for pictures. Figure 6.3 de-
picts, for juror 1, a histogram of all the deviations from the true score g. For
each interval of length 0.05 we have counted the number of runs for which the
deviation of juror 1 fell in that interval. These numbers vary from about 40
to about 60.
This is just to get an idea about the results for an individual juror. In Fig-
ure 6.4 we see histograms for the final scores. Comparing the histograms, it
seems that the realizations of T are more concentrated near zero than those
of M.
78 6 Simulation
−0.4 −0.2 0.0 0.2 0.4
0
20
40
60
Fig. 6.3. Deviations of juror 1 from the deserved score, one thousand runs.
−0.4 −0.2 0.0 0.2 0.4
T
0
50
100
150
−0.4 −0.2 0.0 0.2 0.4
M
0
50
100
150
Fig. 6.4. One thousand realizations of T and M.
However, the two histograms do not tell us anything about the relation be-
tween T and M, so we plot the realizations of pairs (T, M) for all one thousand
runs (Figure 6.5). From this plot we see that in most cases M and T go in
the same direction: if T is positive, then usually M is also positive, and the
same goes for negative values. In terms of the final scores, both rules generally
overvalue and undervalue the performance simultaneously. On closer exami-
nation, with help of the line drawn from (−0.5, −0.5) to (0.5, 0.5), we see that
the T values tend to be a little closer to zero than the M values.
This suggests that we make a histogram that shows the difference of the
absolute deviations from true score. For rule 1 this absolute deviation is |T |,
for rule 2 it is |M|. If the difference |M| − |T | is positive, then T is closer to
zero than M, and the difference tells us by how much. A negative difference
6.3 Comparing two jury rules 79
−0.4 −0.2 0.0 0.2 0.4
T
−0.4
−0.2
0.0
0.2
0.4
M
...............................................................................................................................................
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
··
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
· ·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
· ·
·
·
·
·
·
·
··
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
· ·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Fig. 6.5. Plot of the points (T, M), one thousand runs.
means that M was closer. In Figure 6.6 all the differences are shown in a
histogram. The bars to the right of zero represent 696 runs. So, in about 70%
of the runs, rule 1 resulted in a final score that is closer to the true score than
rule 2. In about 30% of the cases, rule 2 was better, but generally by a smaller
amount, as we see from the histogram.
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3
0
50
100
150
200
Fig. 6.6. Differences |M| − |T| for one thousand runs.
80 6 Simulation
6.4 The single-server queue
There are many situations in life where you stand in a line waiting for some
service: when you want to withdraw money from a cash dispenser, borrow
books at the library, be admitted to the emergency room at the hospital, or
pump gas at the gas station. Many other queueing situations are hidden: an
email message you send might be queued at the local server until it has sent
all messages that were submitted ahead of yours; searching the Internet, your
browser sends and receives packets of information that are queued at various
stages and locations; in assembly lines, partly finished products move from
station to station, each time waiting for the next component to be added.
We are going to study one simple queueing model, the so-called single-server
queue: it has one server or service mechanism, and the arriving customers
await their turn in order of their arrival. For definiteness, think of an oasis
with one big water well. People arrive at the well with bottles, jerry cans, and
other types of containers, to pump water. The supply of water is large, but
the pump capacity is limited. The pump is about to be replaced, and while it
is clear that a larger pump capacity will result in shorter waiting times, more
powerful pumps are also more expensive. Therefore, to prepare a decision that
balances costs and benefits, we wish to investigate the relationship between
pump capacity and system performance.
Modeling the system
A stochastic model is in order: some general characteristics are known, such
as how many people arrive per day and how much water they take on average,
but the individual arrival times and amounts are unpredictable. We introduce
random variables to describe them: let T1 be the time between the start at
time zero and the arrival of the first customer, T2 the time between the arrivals
of the first and the second customer, T3 the time between the second and the
third, etc.; these are called the interarrival times. Let Si be the length of time
that customer i needs to use the pump; in standard terminology this is called
the service time. This is our description so far:
Arrivals at: T1 T1 + T2 T1 + T2 + T3 etc.
Service times: S1 S2 S3 etc.
The pump capacity v (liters per minute) is not a random variable but a model
parameter or decision variable, whose “best” value we wish to determine. So
if customer i requires Ri liters of water, then her service time is
Si =
Ri
v
.
To complete the model description, we need to specify the distribution of the
random variables Ti and Ri:
6.4 The single-server queue 81
Interarrival times: every Ti has an Exp(0.5) distribution (minutes);
Service requirement: every Ri has a U(2, 5) distribution (liters).
This particular choice of distributions would have to be supported by evidence
that they are suited for the system at hand: a validation step as suggested for
the jury model is appropriate here as well. For many arrival type processes,
however, the exponential distribution is reasonable as a model for the inter-
arrival times (see Chapter 12). The particular uniform distribution chosen for
the required amount of water says that all amounts between 2 and 5 liters are
equally likely. So there is no sheik who owns a 5000-liter water truck in “our”
oasis.
To evaluate system performance, we want to extract from the model the wait-
ing times of the customers and how busy it is at the pump.
Waiting times
Let Wi denote the waiting time of customer i. The first customer is lucky;
the system starts empty, and so W1 = 0. For customer i the waiting time
depends on how long customer i−1 spent in the system compared to the time
between their respective arrivals. We see that if the interarrival time Ti is long,
relatively speaking, then customer i arrives after the departure of customer
i − 1, and so Wi = 0:
Arrival of
customer i − 1
Departure of
customer i − 1
Arrival of
customer i
Wi = 0
←
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
− Ti −
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
→
←
−
− Wi−1 −
−
→←
−
−
− Si−1 −
−
−
→
On the other hand, if customer i arrives before the departure, the waiting
time Wi equals whatever remains of Wi−1 + Si−1:
Arrival of
customer i − 1
Departure of
customer i − 1
Arrival of
customer i
Wi = Wi−1 + Si−1 − Ti
←
−
−
−
−
− Ti −
−
−
−
−
→←
−
− Wi −
−
→
←
− Wi−1 −
→←
−
−
−
− Si−1 −
−
−
−
→
Summarizing the two cases, we see obtain:
Wi = max{Wi−1 + Si−1 − Ti, 0}. (6.5)
To carry out a simulation, we start at time zero and generate realizations of
the interarrival times (the Ti) and service requirements (the Ri) for as long
as we want, computing the other quantities that follow from the model on the
way. Table 6.2 shows the values generated this way, for two pump capacities
(v = 2 and 3) for the first six customers. Note that in both cases we use the
same realizations of Ti and Ri.
82 6 Simulation
Table 6.2. Results of a short simulation.
Input realizations v = 2 v = 3
i Ti Arr.time Ri Si Wi Si Wi
1 0.24 0.24 4.39 2.20 0 1.46 0
2 1.97 2.21 4.00 2.00 0.23 1.33 0
3 1.73 3.94 2.33 1.17 0.50 0.78 0
4 2.82 6.76 4.03 2.01 0 1.34 0
5 1.01 7.77 4.17 2.09 1.00 1.39 0.33
6 1.09 8.86 4.24 2.12 1.99 1.41 0.63
Quick exercise 6.5 The next four realizations are T7: 1.86; R7: 4.79; T8:
1.08; and R8: 2.33. Complete the corresponding rows of the table.
Longer simulations produce so many numbers that we will drown in them
unless we think of something. First, we summarize the waiting times of the
first n customers with their average:
W̄n =
W1 + W2 + · · · + Wn
n
. (6.6)
Then, instead of giving a table, we plot the pairs (n, W̄n), for n = 1, 2, . . . until
the end of the simulation. In Figure 6.7 we see that both lines bounce up and
down a bit. Toward the end, the average waiting time for pump capacity 3 is
about 0.5 and for v = 2 about 2. In a longer simulation we would see each of
the averages converge to a limiting value (a consequence of the so-called law
of large numbers, the topic of Chapter 13).
0 10 20 30 40 50
n
0.0
0.5
1.0
1.5
2.0
2.5
Average
of
first
n
waiting
times
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..........................................................................................................
Fig. 6.7. Averaged waiting times at the well, for pump capacity 2 and 3.
6.4 The single-server queue 83
Work-in-system
To show how busy it is at the pump one could record how many customers are
waiting in the queue and plot this quantity against time. A slightly different
approach is to record at every moment how much work there is in the system,
that is, how much time it would take to serve everyone present at that moment.
For example, if I am halfway through filling my 4-liter jerry can and three
persons are waiting who require 2, 3, and 5 liters, respectively, then there are
12 liters to go; at v = 2, there is 6 minutes of work in the system, and at
v = 3 just 4.
The amount of work in the system just before a customer arrives equals the
waiting time of that customer, because it is exactly the time it takes to finish
the work for everybody ahead of her. The work-in-system at time t tells us
how long the wait would be if somebody were to arrive at t. For this reason,
this quantity is also called the virtual waiting time.
Figure 6.8 shows the work-in-system as a function of time for the first 15
minutes, using the same realizations that were the basis for Table 6.2. In the
top graph, corresponding to v = 2, the work in the system jumps to 2.20
(which is the realization of R1/2) at t = 0.24, when the first customer arrives.
So at t = 2.21, which is 1.97 later, there is 2.20 − 1.97 = 0.23 minute of work
left; this is the waiting time for customer 2, who brings an amount of work
of 2.00 minutes, so the peak at 1.97 is 0.23 + 2.00 = 2.23, etc. In the bottom
graph we see the work-in-system reach zero more often, because the individual
(work) amounts are 2/3 of what they are when v = 2. More often, arriving
0 5 10 15
0
1
2
3
4
5
Work
in
system
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 5 10 15
t
0
1
2
3
4
5
Work
in
system
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 6.8. Work in system: top, v = 2; bottom, v = 3.
84 6 Simulation
0 20 40 60 80 100
0
2
4
6
8
10
Work
in
system .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
.
.
.
.
. .
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
. .
.
. .
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
. .
. .
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 20 40 60 80 100
t
0
2
4
6
8
10
Work
in
system
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
..
..
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
..
.
.
. .
..
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
..
.
.
.
.
. .
.
. .
.
.
.
.
.
.
. .
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 6.9. Work in system: top, v = 2; bottom, v = 3.
customers find the queue empty and the pump not in use; they do not have
to wait.
In Figure 6.9 the work-in-system is depicted as a function of time for the
first 100 minutes of our run. At pump capacity 2 the virtual waiting time
peaks at close to 11 minutes after about 55 minutes, whereas with v = 3 the
corresponding peak is only about 4 minutes. There also is a marked difference
in the proportion of time the system is empty.
6.5 Solutions to the quick exercises
6.1 To simulate the coin, choose any three of the six possible outcomes of
the die, report heads if one of these three outcomes turns up, and report tails
otherwise. For example, heads if the outcome is odd, tails if it is even.
To simulate the die using a coin is more difficult; one solution is as follows.
Toss the coin three times and use the following conversion table to map the
result:
Coins HHH HHT HTH HTT THH THT
Die 1 2 3 4 5 6
Repeat the coin tosses if you get TTH or TTT.
6.6 Exercises 85
6.2 Let the U(0, 1) variable be U and set:
Y =
⎧
⎪
⎨
⎪
⎩
1 if U  3
5 ,
3 if 3
5 ≤ U  4
5 ,
4 if U ≥ 4
5 .
So, for example, P(Y = 3) = P
3
5 ≤ U  4
5

= 1
5 .
6.3 The given distribution function F is strictly increasing between 1 and 3,
so we use the method with Finv
. Solve the equation F(x) = 1
4 (x − 1)2
= u
for x. This yields x = 1 + 2
√
u, so we can set X = 1 + 2
√
U. If you need to
be convinced, determine FX.
6.4 In ascending order the values are −0.05, 0.13, 0.22, 0.23, 0.25, 0.26, 0.39,
so for M we find 0.23, and for T (0.13 + 0.22 + 0.23 + 0.25 + 0.26)/5 = 0.22.
6.5 We find:
Input realizations v = 2 v = 3
i Ti Arr.time Ri Si Wi Si Wi
7 1.86 10.72 4.79 2.39 2.25 1.60 0.18
8 1.08 11.80 2.33 1.16 3.57 0.78 0.70
6.6 Exercises
6.1 Let U have a U(0, 1) distribution.
a. Describe how to simulate the outcome of a roll with a die using U.
b. Define Y as follows: round 6U + 1 down to the nearest integer. What are
the possible outcomes of Y and their probabilities?
6.2  We simulate the random variable X = 1 + 2
√
U constructed in Quick
exercise 6.3. As realization for U we obtain from the pseudo random generator
the number u = 0.3782739.
a. What is the corresponding realization x of the random variable X?
b. If the next call to the random generator yields u = 0.3, will the corre-
sponding realization for X be larger or smaller than the value you found
in a?
c. What is the probability the next draw will be smaller than the value you
found in a?
86 6 Simulation
6.3 Let U have a U(0, 1) distribution. Show that Z = 1 − U has a U(0, 1)
distribution by deriving the probability density function or the distribution
function.
6.4 Let F be the distribution function as given in Quick exercise 6.3: F(x)
is 0 for x  1 and 1 for x  3, and F(x) = 1
4 (x − 1)2
if 1 ≤ x ≤ 3. In the
answer it is claimed that X = 1 + 2
√
U has distribution function F, where U
is a U(0, 1) random variable. Verify this by computing P(X ≤ a) and checking
that this equals F(a), for any a.
6.5  We have seen that if U has a U(0, 1) distribution, then X = − ln U has
an Exp(1) distribution. Check this by verifying that P(X ≤ a) = 1 − e−a
for
a ≥ 0.
6.6  Somebody messed up the random number generator in your computer:
instead of uniform random numbers it generates numbers with an Exp(2) dis-
tribution. Describe how to construct a U(0, 1) random variable U from an
Exp(2) distributed X.
Hint: look at how you obtain an Exp(2) random variable from a U(0, 1) ran-
dom variable.
6.7  In models for the lifetimes of mechanical components one sometimes
uses random variables with distribution functions from the so-called Weibull
family. Here is an example: F(x) = 0 for x  0, and
F(x) = 1 − e−5x2
for x ≥ 0.
Construct a random variable Z with this distribution from a U(0, 1) variable.
6.8 A random variable X has a Par(3) distribution, so with distribution func-
tion F with F(x) = 0 for x  1, and F(x) = 1 − x−3
for x ≥ 1. For details on
the Pareto distribution see Section 5.4. Describe how to construct X from a
U(0, 1) random variable.
6.9  In Quick exercise 6.1 we simulated a die by tossing three coins. Recall
that we might need several attempts before succeeding.
a. What is the probability that we succeed on the first try?
b. Let N be the number of tries that we need. Determine the distribution
of N.
6.10  There is usually more than one way to simulate a particular random
variable. In this exercise we consider two ways to generate geometric random
variables.
a. We give you a sequence of independent U(0, 1) random variables U1, U2,
. . . . From this sequence, construct a sequence of Bernoulli random vari-
6.6 Exercises 87
ables. From the sequence of Bernoulli random variables, construct a (sin-
gle) Geo(p) random variable.
b. It is possible to generate a Geo(p) random variable using just one U(0, 1)
random variable. If calls to the random number generator take a lot of
CPU time, this would lead to faster simulation programs. Set λ = − ln(1−
p) and let Y have a Exp(λ) distribution. We obtain Z from Y by rounding
to the nearest integer greater than Y . Note that Z is a discrete random
variable, whereas Y is a continuous one. Show that, nevertheless, the event
{Z  n} is the same as {Y  n}. Use this to compute P(Z  n) from the
distribution of Y . What is the distribution of Z? (See Quick exercise 4.6.)
6.11 Reconsider the jury example (see Section 6.3). Suppose the first jury
member is bribed to vote in favor of the present candidate.
a. How should you now model Y1? Describe how you can investigate which
of the two rules is less sensitive to the effect of the bribery.
b. The International Skating Union decided to adopt a rule similar to the
following: randomly discard two of the jury scores, then average the re-
maining scores. Describe how to investigate this rule. Do you expect this
rule to be more sensitive to the bribery than the two rules already dis-
cussed, or less sensitive?
6.12  A tiny financial model. To investigate investment strategies, con-
sider the following:
You can choose to invest your money in one particular stock or put it in a
savings account. Your initial capital is û1000. The interest rate r is 0.5% per
month and does not change. The initial stock price is û100. Your stochastic
model for the stock price is as follows: next month the price is the same as
this month with probability 1/2, with probability 1/4 it is 5% lower, and with
probability 1/4 it is 5% higher. This principle applies for every new month.
There are no transaction costs when you buy or sell stock.
Your investment strategy for the next 5 years is: convert all your money to
stock when the price drops below û95, and sell all stock and put the money
in the bank when the stock price exceeds û110.
Describe how to simulate the results of this strategy for the model given.
6.13 We give you an unfair coin and you do not know P(H) for this coin. Can
you simulate a fair coin, and how many tosses do you need for each fair coin
toss?
7
Expectation and variance
Random variables are complicated objects, containing a lot of information
on the experiments that are modeled by them. If we want to summarize a
random variable by a single number, then this number should undoubtedly
be its expected value. The expected value, also called the expectation or mean,
gives the center—in the sense of average value—of the distribution of the
random variable. If we allow a second number to describe the random variable,
then we look at its variance, which is a measure of spread of the distribution
of the random variable.
7.1 Expected values
An oil company needs drill bits in an exploration project. Suppose that it is
known that (after rounding to the nearest hour) drill bits of the type used
in this particular project will last 2, 3, or 4 hours with probabilities 0.1, 0.7,
and 0.2. If a drill bit is replaced by one of the same type each time it has worn
out, how long could exploration be continued if in total the company would
reserve 10 drill bits for the exploration job? What most people would do to
answer this question is to take the weighted average
0.1 · 2 + 0.7 · 3 + 0.2 · 4 = 3.1,
and conclude that the exploration could continue for 10 × 3.1, or 31 hours.
This weighted average is what we call the expected value or expectation of the
random variable X whose distribution is given by
P(X = 2) = 0.1, P(X = 3) = 0.7, P(X = 4) = 0.2.
It might happen that the company is unlucky and that each of the 10 drill bits
has worn out after two hours, in which case exploration ends after 20 hours.
At the other extreme, they may be lucky and drill for 40 hours on these 10
90 7 Expectation and variance
bits. However, it is a mathematical fact that the conclusion about a 31-hour
total drilling time is correct in the following sense: for a large number n of
drill bits the total running time will be around n times 3.1 hours with high
probability. In the example, where n = 10, we have, for instance, that drilling
will continue for 29, 30, 31, 32, or 33 hours with probability more than 0.86,
while the probability that it will last only for 20, 21, 22, 23, or 24 hours is less
than 0.00006. We will come back to this in Chapters 13 and 14. This example
illustrates the following definition.
Definition. The expectation of a discrete random variable X taking
the values a1, a2, . . . and with probability mass function p is the
number
E[X] =

i
aiP(X = ai) =

i
aip(ai).
We also call E[X] the expected value or mean of X. Since the expectation is
determined by the probability distribution of X only, we also speak of the
expectation or mean of the distribution.
Quick exercise 7.1 Let X be the discrete random variable that takes the
values 1, 2, 4, 8, and 16, each with probability 1/5. Compute the expectation
of X.
Looking at an expectation as a weighted average gives a more physical in-
terpretation of this notion, namely as the center of gravity of weights p(ai)
placed at the points ai. For the random variable associated with the drill bit,
this is illustrated in Figure 7.1.
2 3 4

Fig. 7.1. Expected value as center of gravity.
7.1 Expected values 91
This point of view also leads the way to how one should define the expected
value of a continuous random variable. Let, for example, X be a continuous
random variable whose probability density function f is zero outside the in-
terval [0, 1]. It seems reasonable to approximate X by the discrete random
variable Y , taking the values
1
n
,
2
n
, . . . ,
n − 1
n
, 1
with as probabilities the masses that X assigns to the intervals [k−1
n , k
n ]:
P

Y =
k
n

= P

k − 1
n
≤ X ≤
k
n

=
k/n
(k−1)/n
f(x) dx.
We have a good idea of the size of this probability. For large n, it can be
approximated well in terms of f:
P

Y =
k
n

=
k/n
k/n−1/n
f(x) dx ≈
1
n
f
k
n

.
The “center-of-gravity” interpretation suggests that the expectation E[Y ] of
Y should approximate the expectation E[X] of X. We have
E[Y ] =
n

k=1
k
n
P

Y =
k
n

≈
n

k=1
k
n
f
k
n
 1
n
.
By the definition of a definite integral, for large n the right-hand side is close
to
1
0
xf(x) dx.
This motivates the following definition.
Definition. The expectation of a continuous random variable X
with probability density function f is the number
E[X] =
∞
−∞
xf(x) dx.
We also call E[X] the expected value or mean of X. Note that E[X] is indeed
the center of gravity of the mass distribution described by the function f:
E[X] =
∞
−∞
xf(x) dx =
∞
−∞ xf(x) dx
∞
−∞ f(x) dx
.
This is illustrated in Figure 7.2.
92 7 Expectation and variance
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
f

..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
Fig. 7.2. Expected value as center of gravity, continuous case.
Quick exercise 7.2 Compute the expectation of a random variable U that
is uniformly distributed over [2, 5].
Remark 7.1 (The expected value may not exist!). In the definitions
in this section we have been rather careless about the convergence of sums
and integrals. Let us take a closer look at the integral I =
 ∞
−∞
xf(x) dx.
Since a probability density function cannot take negative values, we have
I = I−
+ I+
with I−
=
 0
−∞
xf(x) dx a negative and I+
=
 ∞
0
xf(x) dx a
positive number. However, it may happen that I−
equals −∞ or I+
equals
+∞. If both I−
= −∞ and I+
= +∞, then we say that the expected
value does not exist. An example of a continuous random variable for which
the expected value does not exist is the random variable with the Cauchy
distribution (see also page 161), having probability density function
f(x) =
1
π(1 + x2)
for − ∞  x  ∞.
For this random variable
I+
=
 ∞
0
x ·
1
π(1 + x2)
dx =

1
2π
ln(1 + x2
)
∞
0
= +∞,
I−
=
 0
−∞
x ·
1
π(1 + x2)
dx =

1
2π
ln(1 + x2
)
0
−∞
= −∞.
If I−
is finite but I+
= +∞, then we say that the expected value is infinite.
A distribution that has an infinite expectation is the Pareto distribution
with parameter α = 1 (see Exercise 7.11). The remarks we made on the
integral in the definition of E[X] for continuous X apply similarly to the
sum in the definition of E[X] for discrete random variables X.
7.2 Three examples 93
7.2 Three examples
The geometric distribution
If you buy a lottery ticket every week and you have a chance of 1 in 10 000
of winning the jackpot, what is the expected number of weeks you have to
buy tickets before you get the jackpot? The answer is: 10 000 weeks (almost
two centuries!). The number of weeks is modeled by a random variable with
a geometric distribution with parameter p = 10−4
.
The expectation of a geometric distribution. Let X have
a geometric distribution with parameter p; then
E[X] =
∞

k=1
kp(1 − p)k−1
=
1
p
.
Here
∞
k=1 kp(1 − p)k−1
= 1/p follows from the formula
∞
k=1 kxk−1
=
1/(1 − x)2
that has been derived in your calculus course. We will see a simple
(probabilistic) way to obtain the value of this sum in Chapter 11.
The exponential distribution
In Section 5.6 we considered the chemical reactor example, where the residence
time T , measured in minutes, has an Exp(0.5) distribution. We claimed that
this implies that the mean time a particle stays in the vessel is 2 minutes.
More generally, we have the following.
The expectation of an exponential distribution. Let X
have an exponential distribution with parameter λ; then
E[X] =
∞
0
xλe−λx
dx =
1
λ
.
The integral has been determined in your calculus course (with the technique
of integration by parts).
The normal distribution
Here, using that the normal density integrates to 1 and applying the substi-
tution z = (x − µ)/σ,
E[X] =
∞
−∞
x
1
σ
√
2π
e
− 1
2

x−µ
σ
2
dx = µ +
∞
−∞
(x − µ)
1
σ
√
2π
e
− 1
2

x−µ
σ
2
dx
= µ + σ
∞
−∞
z
1
√
2π
e− 1
2 z2
dz = µ,
94 7 Expectation and variance
where the integral is 0, because the integrand is an odd function. We obtained
the following rule.
The expectation of a normal distribution. Let X be an
N(µ, σ2
) distributed random variable. Then
E[X] =
∞
−∞
x
1
σ
√
2π
e
− 1
2

x−µ
σ
2
dx = µ.
7.3 The change-of-variable formula
Often one does not want to compute the expected value of a random variable
X but rather of a function of X, as, for example, X2
. We then need to deter-
mine the distribution of Y = X2
, for example by computing the distribution
function FY of Y (this is an example of the general problem of how distribu-
tions change under transformations—this topic is the subject of Chapter 8).
For a concrete example, suppose an architect wants maximal variety in the
sizes of buildings: these should be of the same width and depth X, but X is
uniformly distributed between 0 and 10 meters. What is the distribution of
the area X2
of a building; in particular, will this distribution be (anything
near to) uniform? Let us compute FY ; for 0 ≤ a ≤ 100:
FY (a) = P

X2
≤ a

= P

X ≤
√
a

=
√
a
10
.
Hence the probability density function fY of the area is, for 0  y  100
meters squared, given by
fY (y) =
d
dy
FY (y) =
d
dy
√
y
10
=
1
20
√
y
. (7.1)
This means that the buildings with small areas are heavily overrepresented,
because fY explodes near 0—see also Figure 7.3, in which we plotted fY .
Surprisingly, this is not very visible in Figure 7.4, an example where we should
believe our calculations more than our eyes. In the figure the locations of
the buildings are generated by a Poisson process, the subject of Chapter 12.
Suppose that a contractor has to make an offer on the price of the foundations
of the buildings. The amount of concrete he needs will be proportional to the
area X2
of a building. So his problem is: what is the expected area of a
building? With fY from (7.1) he finds
E

X2

= E[Y ] =
100
0
y ·
1
20
√
y
dy =
100
0
√
y
20
dy =
1
20
2
3
y
3
2
100
0
= 331
3 m2
.
7.3 The change-of-variable formula 95
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4 fY
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 7.3. The probability density of the square of a U(0, 10) random variable.
It is interesting to note that we really need to do this calculation, because
the expected area is not simply the product of the expected width and the
expected depth, which is 25 m2
. However, there is a much easier way in which
the contractor could have obtained this result. He could have argued that
the value of the area is x2
when x is the width, and that he should take the
weighted average of those values, where the weight at width x is given by the
value fX(x) of the probability density of X. Then he would have computed
E

X2

=
∞
−∞
x2
fX(x) dx =
10
0
x2
·
1
10
dx =
1
30
x3
10
0
= 331
3 m2
.
It is indeed a mathematical theorem that this is always a correct way to
compute expected values of functions of random variables.
0 10
∗
∗ ∗
∗ ∗ ∗
∗ ∗
∗ ∗
∗ ∗ ∗
∗
∗ ∗
∗ ∗ ∗
∗ ∗ ∗
∗
∗ ∗
∗
∗ ∗
Fig. 7.4. Top: widths of the buildings between 0 and 10 meters. Bottom: corre-
sponding buildings in a 100×300 m area.
96 7 Expectation and variance
The change-of-variable formula. Let X be a random variable,
and let g : R → R be a function.
If X is discrete, taking the values a1, a2, . . . , then
E[g(X)] =

i
g(ai)P(X = ai) .
If X is continuous, with probability density function f, then
E[g(X)] =
∞
−∞
g(x)f(x) dx.
Quick exercise 7.3 Let X have a Ber(p) distribution. Compute E

2X

.
An operation that occurs very often in practice is a change of units, e.g., from
Fahrenheit to Celsius. What happens then to the expectation? Here we have
to apply the formula with the function g(x) = rx + s, where r and s are
real numbers. When X has a continuous distribution, the change-of-variable
formula yields:
E[rX + s] =
∞
−∞
(rx + s)f(x) dx
= r
∞
−∞
xf(x) dx + s
∞
−∞
f(x) dx
= rE[X] + s.
A similar computation with integrals replaced by sums gives the same result
for discrete random variables.
7.4 Variance
Suppose you are offered an opportunity for an investment whose expected
return is û500. If you are given the extra information that this expected
value is the result of a 50% chance of a û450 return and a 50% chance of a
û550 return, then you would not hesitate to spend û450 on this investment.
However, if the expected return were the result of a 50% chance of a û0 return
and a 50% chance of a û1000 return, then most people would be reluctant to
spend such an amount. This demonstrates that the spread (around the mean)
of a random variable is of great importance. Usually this is measured by the
expected squared deviation from the mean.
Definition. The variance Var(X) of a random variable X is the
number
Var(X) = E

(X − E[X])2

.
7.4 Variance 97
Note that the variance of a random variable is always positive (or 0). Fur-
thermore, there is the question of existence and finiteness (cf. Remark 7.1).
In practical situations one often considers the standard deviation defined by

Var(X), because it has the same dimension as E[X].
As an example, let us compute the variance of a normal distribution. If X has
an N(µ, σ2
) distribution, then:
Var(X) = E

(X − E[X])2

=
∞
−∞
(x − µ)2 1
σ
√
2π
e
− 1
2

x−µ
σ
2
dx
= σ2
∞
−∞
z2 1
√
2π
e− 1
2 z2
dz.
Here we substituted z = (x − µ)/σ. Using integration by parts one finds that
∞
−∞
z2 1
√
2π
e− 1
2 z2
dz = 1.
We have found the following property.
Variance of a normal distribution. Let X be an N(µ, σ2
)
distributed random variable. Then
Var(X) =
∞
−∞
(x − µ)2 1
σ
√
2π
e
− 1
2

x−µ
σ
2
dx = σ2
.
Quick exercise 7.4 Let us call the two returns discussed above Y1 and Y2,
respectively. Compute the variance and standard deviation of Y1 and Y2.
It is often not practical to compute Var(X) directly from the definition, but
one uses the following rule.
An alternative expression for the variance. For any ran-
dom variable X,
Var(X) = E

X2

− (E[X])
2
.
To see that this rule holds, we apply the change-of-variable formula. Sup-
pose X is a continuous random variable with probability density function f
(the discrete case runs completely analogously). Using the change-of-variable
formula, well-known properties of the integral, and
∞
−∞
f(x) dx = 1, we find
98 7 Expectation and variance
Var(X) = E

(X − E[X])2

=
∞
−∞
(x − E[X])2
f(x) dx
=
∞
−∞

x2
− 2xE[X] + (E[X])2

f(x) dx
=
∞
−∞
x2
f(x) dx − 2E[X]
∞
−∞
xf(x) dx + (E[X])2
∞
−∞
f(x) dx
= E

X2

− 2(E[X])2
+ (E[X])2
= E

X2

− (E[X])2
.
With this rule we make two steps: first we compute E[X], then we compute
E

X2

. The latter is called the second moment of X. Let us compare the
computations, using the definition and this rule for the drill bit example.
Recall that for this example X takes the values 2, 3, and 4 with probabilities
0.1, 0.7, and 0.2. We found that E[X]= 3.1. According to the definition
Var(X) = E

(X − 3.1)2

= 0.1 · (2 − 3.1)2
+ 0.7 · (3 − 3.1)2
+ 0.2 · (4 − 3.1)2
= 0.1 · (−1.1)2
+ 0.7 · (−0.1)2
+ 0.2 · (0.9)2
= 0.1 · 1.21 + 0.7 · 0.01 + 0.2 · 0.81
= 0.121 + 0.007 + 0.162
= 0.29.
Using the rule is neater and somewhat faster:
Var(X) = E

X2

− (3.1)2
= 0.1 · 22
+ 0.7 · 32
+ 0.2 · 42
− 9.61
= 0.1 · 4 + 0.7 · 9 + 0.2 · 16 − 9.61
= 0.4 + 6.3 + 3.2 − 9.61
= 0.29.
What happens to the variance if we change units? At the end of the pre-
vious section we showed that E[rX + s] = rE[X] + s. This can be used to
obtain the corresponding rule for the variance under change of units (see also
Exercise 7.15).
Expectation and variance under change of units. For any
random variable X and any real numbers r and s,
E[rX + s] = rE[X] + s, and Var(rX + s) = r2
Var(X) .
Note that the variance is insensitive to the shift over s. Can you understand
why this must be true without doing any computations?
7.6 Exercises 99
7.5 Solutions to the quick exercises
7.1 We have
E[X] =

i
aiP(X = ai) = 1 ·
1
5
+ 2 ·
1
5
+ 4 ·
1
5
+ 8 ·
1
5
+ 16 ·
1
5
=
31
5
= 6.2.
7.2 The probability density function f of U is given by f(x) = 0 outside [2, 5]
and f(x) = 1/3 for 2 ≤ x ≤ 5; hence
E[U] =
∞
−∞
xf(x) dx =
5
2
1
3
x dx =
1
6
x2
5
2
= 31
2 .
7.3 Using the change-of-variable formula we obtain
E

2X

=

i
2ai
P(X = ai)
= 20
· P(X = 0) + 21
· P(X = 1)
= 1 · (1 − p) + 2 · p = 1 − p + 2p = 1 + p.
You could also have noted that Y = 2X
has a distribution given by P(Y = 1) =
1 − p, P(Y = 2) = p; hence
E

2X

= E[Y ] = 1 · P(Y = 1) + 2 · P(Y = 2) = 1 · (1 − p) + 2 · p = 1 + p.
7.4 We have
Var(Y1) = 1
2 (450 − 500)2
+ 1
2 (550 − 500)2
= 502
= 2500,
so Y1 has standard deviation û50 and
Var(Y2) = 1
2 (0 − 500)2
+ 1
2 (1000 − 500)2
= 5002
= 250 000,
so Y2 has standard deviation û500.
7.6 Exercises
7.1  Let T be the outcome of a roll with a fair die.
a. Describe the probability distribution of T , that is, list the outcomes and
the corresponding probabilities.
b. Determine E[T ] and Var(T ).
7.2  The probability distribution of a discrete random variable X is given
by
P(X = −1) = 1
5 , P(X = 0) = 2
5 , P(X = 1) = 2
5 .
100 7 Expectation and variance
a. Compute E[X].
b. Give the probability distribution of Y = X2
and compute E[Y ] using the
distribution of Y .
c. Determine E

X2

using the change-of-variable formula. Check your an-
swer against the answer in b.
d. Determine Var(X).
7.3 For a certain random variable X it is known that E[X] = 2, Var(X) = 3.
What is E

X2

?
7.4 Let X be a random variable with E[X] = 2, Var(X) = 4. Compute the
expectation and variance of 3 − 2X.
7.5  Determine the expectation and variance of the Ber(p) distribution.
7.6  The random variable Z has probability density function f(z) = 3z2
/19
for 2 ≤ z ≤ 3 and f(z) = 0 elsewhere. Determine E[Z]. Before you do the
calculation: will the answer lie closer to 2 than to 3 or the other way around?
7.7 Given is a random variable X with probability density function f given
by f(x) = 0 for x  0, and for x  1, and f(x) = 4x − 4x3
for 0 ≤ x ≤ 1.
Determine the expectation and variance of the random variable 2X + 3.
7.8  Given is a continuous random variable X whose distribution function
F satisfies F(x) = 0 for x  0, F(x) = 1 for x  1, and F(x) = x(2 − x) for
0 ≤ x ≤ 1. Determine E[X].
7.9 Let U be a random variable with a U(α, β) distribution.
a. Determine the expectation of U.
b. Determine the variance of U.
7.10  Let X have an exponential distribution with parameter λ.
a. Determine E[X] and E

X2

using partial integration.
b. Determine Var(X).
7.11  In this exercise we take a look at the mean of a Pareto distribution.
a. Determine the expectation of a Par(2) distribution.
b. Determine the expectation of a Par(1
2 ) distribution.
c. Let X have a Par(α) distribution. Show that E[X] = α/(α − 1) if α  1.
7.12 For which α is the variance of a Par(α) distribution finite? Compute the
variance for these α.
7.6 Exercises 101
7.13 Remember that we found on page 95 that the expected area of a building
was 331
3 m2
, whereas the square of the expected width was only 25 m2
. This
phenomenon is more general: show that for any random variable X one has
E

X2

≥

E[X]
2
.
Hint: you might use that Var(X) ≥ 0.
7.14 Suppose we choose arbitrarily a point from the square with corners at
(2,1), (3,1), (2,2), and (3,2). The random variable A is the area of the triangle
with its corners at (2,1), (3,1), and the chosen point. (See also Exercise 5.9
and Figure 7.5.) Compute E[A].
A
(2, 1) (3, 1)
(2, 2) (3, 2)
•
randomly chosen
point
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
Fig. 7.5. A triangle in a 1×1 square.
7.15  Let X be a random variable and r and s any real numbers. Use the
change-of-units rule E[rX + s] = rE[X] + s for the expectation to obtain a
and b.
a. Show that Var(rX) = r2
Var(X).
b. Show that Var(X + s) = Var(X).
c. Combine parts a and b to show that
Var(rX + s) = r2
Var(X) .
7.16  The probability density function f of the random variable X used
in Figure 7.2 is given by f(x) = 0 outside (0, 1) and f(x) = −4x ln(x) for
0  x  1. Compute the position of the balancing point in the figure, that is,
compute the expectation of X.
7.17  Let U be a discrete random variable taking the values a1, . . . , ar with
probabilities p1, . . . , pr.
a. Suppose all ai ≥ 0, but that E[U]=0. Show then
102 7 Expectation and variance
a1 = a2 = · · · = ar = 0.
In other words; P(U = 0) = 1.
b. Suppose that V is a random variable taking the values b1, . . . , br with
probabilities p1, . . . , pr. Show that Var(V ) = 0 implies
P(V = E[V ]) = 1.
Hint: apply a with U = (V − E[V ])2
.
8
Computations with random variables
There are many ways to make new random variables from old ones. Of course
this is not a goal in itself; usually new variables are created naturally in
the process of solving a practical problem. The expectations and variances
of such new random variables can be calculated with the change-of-variable
formula. However, often one would like to know the distributions of the new
random variables. We shall show how to determine these distributions, how
to compare expectations of random variables and their transformed versions
(Jensen’s inequality), and how to determine the distributions of maxima and
minima of several random variables.
8.1 Transforming discrete random variables
The problem we consider in this section and the next is how the distribution
of a random variable X changes if we apply a function g to it, thus obtaining
a new random variable Y :
Y = g(X).
When X is a discrete random variable this is usually not too hard to do: it
is just a matter of bookkeeping. We illustrate this with an example. Imagine
an airline company that sells tickets for a flight with 150 available seats. It
has no idea about how many tickets it will sell. Suppose, to keep the example
simple, that the number X of tickets that will be sold can be anything from 1
to 200. Moreover, suppose that each possibility has equal probability to occur,
i.e., P(X = j) = 1/200 for j = 1, 2, . . . , 200. The real interest of the airline
company is in the random variable Y, which is the number of passengers that
have to be refused. What is the distribution of Y ? To answer this, note that
nobody will be refused when the passengers fit in the plane, hence
P(Y = 0) = P(X ≤ 150) =
150
200
=
3
4
.
skip 8
104 8 Computations with random variables
For the other values, k = 1, 2 . . ., 50
P(Y = k) = P(X = 150 + k) =
1
200
.
Note that in this example the function g is given by g(x) = max{x − 150, 0}.
Quick exercise 8.1 Let Z be the number of passengers that will be in the
plane. Determine the probability distribution of Z. What is the function g in
this case?
8.2 Transforming continuous random variables
We now turn to continuous random variables. Since single values occur with
probability zero for a continuous random variable, the approach above does
not work. The strategy now is to first determine the distribution function of
the transformed random variable Y = g(X) and then the probability density
by differentiating. We shall illustrate this with the following example (actually
we saw an example of such a computation in Section 7.3 with the function
g(x) = x2
).
We consider two methods that traffic police employ to determine whether
you deserve a fine for speeding. From experience, the traffic police think that
vehicles are driving at speeds ranging from 60 to 90 km/hour at a certain
road section where the speed limit is 80 km/hour. They assume that the
speed of the cars is uniformly distributed over this interval. The first method
is measuring the speed at a fixed spot in the road section. With this method
the police will find that about (90 − 80)/(90 − 60) = 1/3 of the cars will be
fined.
For the second method, cameras are put at the beginning and end of a 1-km
road section, and a driver is fined if he spends less than a certain amount of
time in the road section. Cars driving at 60 km/hour need one minute, those
driving at 90 km/hour only 40 seconds. Let us therefore model the time T
an arbitrary car spends in the section by a uniform distribution over (40, 60)
seconds. What is the speed V we deduce from this travelling time? Note that
for 40 ≤ t ≤ 60,
P(T ≤ t) =
t − 40
20
.
Since there are 3600 seconds in an hour we have that
V = g(T ) =
3600
T
.
We therefore find for the distribution function FV (v) = P(V ≤ v) of the
speed V that
8.2 Transforming continuous random variables 105
FV (v) = P

3600
T
≤ v

= P

T ≥
3600
v

= 1 −
(3600/v) − 40
20
= 3 −
180
v
for all speeds v between 60 and 90. We can now obtain the probability density
fV of V by differentiating:
fV (v) =
d
dv
FV (v) =
d
dv

3 −
180
v

=
180
v2
for 60 ≤ v ≤ 90.
It is amusing to note that with the second model the traffic police write fewer
speeding tickets because
P(V  80) = 1 − P(V ≤ 80) = 1 −

3 −
180
80

=
1
4
.
(With the first model we found probability 1/3 that a car drove faster than
80 km/hour.) This is related to a famous result in road traffic research, which
is succinctly phrased as: “space mean speed  time mean speed” (see [37]).
It is also related to Jensen’s inequality, which we introduce in Section 8.3.
Similar to the way this is done in the traffic example, one can determine
the distribution of Y = 1/X for any X with a continuous distribution. The
outcome will be that if X has density fX, then the density fY of Y is given
by
fY (y) =
d
dy
FY (y) =
1
y2
fX
1
y

for y  0 and y  0.
One can give fY (0) any value; often one puts fY (0) = 0.
Quick exercise 8.2 Let X have a continuous distribution with probability
density fX(x) = 1/[π(1 + x2
)]. What is the distribution of Y = 1/X?
We turn to a second example. A very common transformation is a change of
units, for instance, from Celsius to Fahrenheit. If X is temperature expressed
in degrees Celsius, then Y = 9
5 X+32 is the temperature in degrees Fahrenheit.
Let FX and FY be the distribution functions of X and Y . Then we have for
any a
FY (a) = P(Y ≤ a) = P

9
5
X + 32 ≤ a

= P

X ≤
5
9

a − 32

= FX
5
9

a − 32

.
By differentiating FY (using the chain rule), we obtain the probability density
fY (y) = 5
9 fX
5
9 (y − 32)

. We can do this for more general changes of units,
and we obtain the following useful rule.
106 8 Computations with random variables
Change-of-units transformation. Let X be a continuous ran-
dom variable with distribution function FX and probability density
function fX. If we change units to Y = rX +s for real numbers r  0
and s, then
FY (y) = FX

y − s
r

and fY (y) =
1
r
fX

y − s
r

.
As an example, let X be a random variable with an N(µ, σ2
) distribution,
and let Y = rX + s. Then this rule gives us
fY (y) =
1
r
fX

y − s
r

=
1
rσ
√
2π
e− 1
2 ((y−rµ−s)/rσ)2
for −∞  y  ∞. On the right-hand side we recognize the probability density
of a normal distribution with parameters rµ+s and r2
σ2
. This illustrates the
following rule.
Normal random variables under change of units. Let X
be a random variable with an N(µ, σ2
) distribution. For any r =
0 and any s, the random variable rX + s has an N(rµ + s, r2
σ2
)
distribution.
Note that if X has an N(µ, σ2
) distribution, then with r = 1/σ and s = −µ/σ
we conclude that
Z =
1
σ
X +

−
µ
σ

=
X − µ
σ
has an N(0, 1) distribution. As a consequence
FX(a) = P(X ≤ a) = P(σZ + µ ≤ a) = P

Z ≤
a − µ
σ

= Φ

a − µ
σ

.
So any probability for an N(µ, σ2
) distributed random variable X can be
expressed in terms of an N(0, 1) distributed random variable Z.
Quick exercise 8.3 Compute the probabilities P(X ≤ 5) and P(X ≥ 2) for
X with an N(4, 25) distribution.
8.3 Jensen’s inequality
Without actually computing the distribution of g(X) we can often tell how
E[g(X)] relates to g(E[X]). For the change-of-units transformation g(x) =
rx + s we know that E[g(X)] = g(E[X]) (see Section 7.3). It is a common
8.3 Jensen’s inequality 107
error to equate these two sides for other functions g. In fact, equality will very
rarely occur for nonlinear g.
For example, suppose that a company that produces microelectronic parts
has a target production of 240 chips per day, but the yield has only been 40,
60, and 80 chips on three consecutive days. The average production over the
three days then is 60 chips, so on average the production should have been
4 times higher to reach the target. However, one can also look at this in the
following way: on the three days the production should have been 240/40 = 6,
240/60 = 4, and 240/80 = 3 times higher. On average that is
1
3 (6 + 4 + 3) = 13
3 = 4.3333
times higher! What happens here can be explained (take for X the part of the
target production that is realized, where you give equal probabilities to the
three outcomes 1/6, 1/4, and 1/3) by the fact that if X is a random variable
taking positive values, then always
1
E[X]
 E
1
X

,
unless Var(X) = 0, which only happens if X is not random at all (cf. Exer-
cise 7.17). This inequality is the case g(x) = 1/x on (0, ∞) of the following
result that holds for general convex functions g.
Jensen’s inequality. Let g be a convex function, and let X be
a random variable. Then
g(E[X]) ≤ E[g(X)] .
Recall from calculus that a twice differentiable function g is convex on an
interval I if g
(x) ≥ 0 for all x in I, and strictly convex if g
(x)  0 for
all x in I. When X takes its values in an interval I (this can, for instance,
be I = (−∞, ∞)), and g is strictly convex on I, then strict inequality holds:
g(E[X])  E[g(X)], unless X is not random.
In Figure 8.1 we illustrate the way in which this result can be obtained for
the special case of a random variable X that takes two values, a and b. In the
figure, X takes these two values with probability 3/4 and 1/4 respectively.
Convexity of g forces any line segment connecting two points on the graph of
g to lie above the part of the graph between these two points. So if we choose
the line segment from (a, g(a)) to (b, g(b)), then it follows that the point
(E[X] , E[g(X)]) =
3
4 a + 1
4 b, 3
4 g(a) + 1
4 g(b)

= 3
4 (a, g(a)) + 1
4 (b, g(b))
on this line lies “above” the point (E[X] , g(E[X]) on the graph of g. Hence
E[g(X)] ≥ g(E[X]).
108 8 Computations with random variables
a E[X] b
g
E[g(X)]
•
• g(E[X])
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 8.1. Jensen’s inequality.
A simple example is given by g(x) = x2
. This function is convex (g
(x) = 2
for all x), and hence
(E[X])2
≤ E

X2

.
Note that this is exactly the same as saying that Var(X) ≥ 0, which we have
already seen in Section 7.4.
Quick exercise 8.4 Let X be a random variable with Var(X)  0. Which
is true: E

e−X

 e−E[X]
or E

e−X

 e−E[X]
?
8.4 Extremes
In many situations the maximum (or minimum) of a sequence X1, X2, . . . , Xn
of random variables is the variable of interest. For instance, let X1, X2,
. . . , X365 be the water level of a river during the days of a particular year
for a particular location. Suppose there will be flooding if the level exceeds a
certain height—usually the height of the dykes. The question whether flood-
ing occurs during a year is completely answered by looking at the maximum
of X1, X2, . . . , X365. If one wants to predict occurrence of flooding in the fu-
ture, the probability distribution of this maximum is of great interest. Similar
models arise, for instance, when one is interested in possible damage from a
series of shocks or in the extent of a contamination plume in the subsurface.
We want to find the distribution of the random variable
Z = max{X1, X2, . . . , Xn}.
We can determine the distribution function of Z by realizing that the maxi-
mum of the Xi is smaller than a number a if and only if all Xi are smaller
than a:
8.4 Extremes 109
FZ (a) = P(Z ≤ a) = P(max{X1, . . . , Xn} ≤ a) = P(X1 ≤ a, . . . , Xn ≤ a) .
Now suppose that the events {Xi ≤ ai} are independent for every choice
of the ai. In this case we call the random variables independent (see also
Chapter 9, where we study independence of random variables). In particular,
the events {Xi ≤ a} are independent for all a. It then follows that
FZ (a) = P(X1 ≤ a, . . . , Xn ≤ a) = P(X1 ≤ a) · · · P(Xn ≤ a) .
Hence, if all random variables have the same distribution function F, then
the following result holds.
The distribution of the maximum. Let X1, X2, . . . , Xn be n
independent random variables with the same distribution function
F, and let Z = max{X1, X2, . . . , Xn}. Then
FZ(a) = (F(a))n
.
Quick exercise 8.5 Let X1, X2, . . . , Xn be independent random variables,
all with a U(0, 1) distribution. Let Z = max{X1, . . . , Xn}. Compute the dis-
tribution function and the probability density function of Z.
What can we say about the distribution of the minimum? Let
V = min{X1, X2, . . . , Xn}.
We can now find the distribution function FV of V by observing that the
minimum of the Xi is larger than a number a if and only if all Xi are larger
than a. The trick is to switch to the complement of the event {V ≤ a}:
FV (a) = P(V ≤ a) = 1 − P(V  a) = 1 − P(min{X1, . . . , Xn}  a)
= 1 − P(X1  a, . . . , Xn  a) .
So using independence and switching back again, we obtain
FV (a) = 1 − P(X1  a, . . . , Xn  a) = 1 − P(X1  a) · · · P(Xn  a)
= 1 − (1 − P(X1 ≤ a)) · · · (1 − P(Xn ≤ a)).
We have found the following result for the minimum.
The distribution of the minimum. Let X1, X2, . . . , Xn be n
independent random variables with the same distribution function
F, and let V = min{X1, X2, . . . , Xn}. Then
FV (a) = 1 − (1 − F(a))n
.
Quick exercise 8.6 Let X1, X2, . . . , Xn be independent random variables,
all with a U(0, 1) distribution. Let V = min{X1, . . . , Xn}. Compute the dis-
tribution function and the probability density function of V .
110 8 Computations with random variables
8.5 Solutions to the quick exercises
8.1 Clearly Z can take the values 1, . . . , 150. The value 150 is special:
the plane is full if 150 or more people buy a ticket. Hence P(Z = 150) =
P(X ≥ 150) = 51/200. For the other values we have P(Z = i) = P(X = i) =
1/200, for i = 1, . . . , 149. Clearly, here g(x) = min{150, x}.
8.2 The probability density of Y = 1/X is
fY (y) =
1
y2
1
π(1 + (1
y )2)
=
1
π(1 + y2)
.
We see that 1/X has the same distribution as X! (This distribution is called
the standard Cauchy distribution, it will be introduced in Chapter 11.)
8.3 First define Z = (X −4)/5, which has an N(0, 1) distribution. Then from
Table B.1
P(X ≤ 5) = P

Z ≤
5 − 4
5

= P(Z ≤ 0.20) = 1 − 0.4207 = 0.5793.
Similarly, using the symmetry of the normal distribution,
P(X ≥ 2) = P

Z ≥
2 − 4
5

= P(Z ≥ −0.40) = P(Z ≤ 0.40) = 0.6554.
8.4 If g(x) = e−x
, then g
(x) = e−x
 0; hence g is strictly convex. It follows
from Jensen’s inequality that
e−E[X]
≤ E

e−X

.
Moreover, if Var(X)  0, then the inequality is strict.
8.5 The distribution function of the Xi is given by F(x) = x on [0, 1]. There-
fore the distribution function FZ of the maximum Z is equal to FZ(a) =
(F(a))n
= an
. Its probability density function is
fZ(z) =
d
dz
FZ (z) = nzn−1
for 0 ≤ z ≤ 1.
8.6 The distribution function of the Xi is given by F(x) = x on [0, 1]. There-
fore the distribution function FV of the minimum V is equal to FV (a) =
1 − (1 − a)n
. Its probability density function is
fV (v) =
d
dv
FV (v) = n(1 − v)n−1
for 0 ≤ v ≤ 1.
8.6 Exercises 111
8.6 Exercises
8.1  Often one is interested in the distribution of the deviation of a random
variable X from its mean µ = E[X]. Let X take the values 80, 90, 100, 110,
and 120, all with probability 0.2; then E[X] = µ = 100. Determine the dis-
tribution of Y = |X − µ|. That is, specify the values Y can take and give the
corresponding probabilities.
8.2  Suppose X has a uniform distribution over the points {1, 2, 3, 4, 5, 6}
and that g(x) = sin(π
2 x).
a. Determine the distribution of Y = g(X) = sin(π
2 X), that is, specify the
values Y can take and give the corresponding probabilities.
b. Let Z = cos(π
2 X). Determine the distribution of Z.
c. Determine the distribution of W = Y 2
+ Z2
. Warning: in this example
there is a very special dependency between Y and Z, and in general it is
much harder to determine the distribution of a random variable that is a
function of two other random variables. This is the subject of Chapter 11.
8.3  The continuous random variable U is uniformly distributed over [0, 1].
a. Determine the distribution function of V = 2U + 7. What kind of distri-
bution does V have?
b. Determine the distribution function of V = rU + s for all real numbers
r  0 and s. See Exercise 8.9 for what happens for negative r.
8.4 Transforming exponential distributions.
a. Let X have an Exp(1
2 ) distribution. Determine the distribution function
of 1
2 X. What kind of distribution does 1
2 X have?
b. Let X have an Exp(λ) distribution. Determine the distribution function
of λX. What kind of distribution does λX have?
8.5  Let X be a continuous random variable with probability density func-
tion
fX(x) =

3
4 x(2 − x) for 0 ≤ x ≤ 2
0 elsewhere.
a. Determine the distribution function FX.
b. Let Y =
√
X. Determine the distribution function FY .
c. Determine the probability density of Y .
8.6 Let X be a continuous random variable with probability density fX that
takes only positive values and let Y = 1/X.
112 8 Computations with random variables
a. Determine FY (y) and show that
fY (y) =
1
y2
fX
1
y

for y  0.
b. Let Z = 1/Y . Using a, determine the probability density fZ of Z, in terms
of fX.
8.7 Let X have a Par(α) distribution. Determine the distribution function of
ln X. What kind of a distribution does ln X have?
8.8  Let X have an Exp(1) distribution, and let α and λ be positive numbers.
Determine the distribution function of the random variable
W =
X1/α
λ
.
The distribution of the random variable W is called the Weibull distribution
with parameters α and λ.
8.9 Let X be a continuous random variable. Express the distribution function
and probability density of the random variable Y = −X in terms of those of X.
8.10  Let X be an N(3, 4) distributed random variable. Use the rule for
normal random variables under change of units and Table B.1 to determine
the probabilities P(X ≥ 3) and P(X ≤ 1).
8.11  Let X be a random variable, and let g be a twice differentiable function
with g
(x) ≤ 0 for all x. Such a function is called a concave function. Show
that for concave functions always
g(E[X]) ≥ E[g(X)] .
8.12  Let X be a random variable with the following probability mass func-
tion:
x 0 1 100 10 000
P(X = x) 1
4
1
4
1
4
1
4
a. Determine the distribution of Y =
√
X.
b. Which is larger E
√
X

or

E[X]?
Hint: use Exercise 8.11, or start by showing that the function g(x) = −
√
x
is convex.
c. Compute

E[X] and E
√
X

to check your answer (and to see that it
makes a big difference!).
8.13 Let W have a U(π, 2π) distribution. What is larger: E[sin(W)] or
sin(E[W])? Check your answer by computing these two numbers.
8.6 Exercises 113
8.14 In this exercise we take a look at Jensen’s inequality for the function
g(x) = x3
(which is neither convex nor concave on (−∞, ∞)).
a. Can you find a (discrete) random variable X with Var(X)  0 such that
E

X3

= (E[X])3
?
b. Under what kind of conditions on a random variable X will the inequality
E

X3

 (E[X])3
certainly hold?
8.15 Let X1, X2, . . . , Xn be independent random variables, all with a U(0, 1)
distribution. Let Z = max{X1, . . . , Xn} and V = min{X1, . . . , Xn}.
a. Compute E[max{X1, X2}] and E[min{X1, X2}].
b. Compute E[Z] and E[V ] for general n.
c. Can you argue directly (using the symmetry of the uniform distribu-
tion (see Exercise 6.3) and not the result of the computation in b) that
1 − E[max{X1, . . . , Xn}] = E[min{X1, . . . , Xn}]?
8.16 In this exercise we derive a kind of Jensen inequality for the minimum.
a. Let a and b be real numbers. Show that
min{a, b} =
1
2
(a + b − |a − b|).
b. Let X and Y be independent random variables with the same distribution
and finite expectation. Deduce from a that
E[min{X, Y }] = E[X] −
1
2
E[|X − Y |] .
c. Show that
E[min{X, Y }] ≤ min{E[X] , E[Y ]}.
Remark: this is not so interesting, since min{E[X] , E[Y ]} = E[X] = E[Y ],
but we will see in the exercises of Chapter 11 that this inequality is also true
for X and Y, which do not have the same distribution.
8.17 Let X1, . . . , Xn be n independent random variables with the same dis-
tribution function F.
a. Convince yourself that for any numbers x1, . . . , xn it is true that
min{x1, . . . , xn} = − max{−x1, . . . , −xn}.
b. Let Z = max{X1, X2, . . . , Xn} and V = min{X1, X2, . . . , Xn}. Use Exer-
cise 8.9 and the observation in a to deduce the formula
114 8 Computations with random variables
FV (a) = 1 − (1 − F(a))n
directly from the formula
FZ (a) = (F(a))n
.
8.18  Let X1, X2, . . . , Xn be independent random variables, all with an
Exp(λ) distribution. Let V = min{X1, . . . , Xn}. Determine the distribution
function of V . What kind of distribution is this?
8.19  From the “north pole” N of a circle with diameter 1, a point Q on
the circle is mapped to a point t on the line by its projection from N, as
illustrated in Figure 8.2.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
t
Q
N
ϕ
•
•
•
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 8.2. Mapping the circle to the line.
Suppose that the point Q is uniformly chosen on the circle. This is the same
as saying that the angle ϕ is uniformly chosen from the interval [−π
2 , π
2 ] (can
you see this?). Let X be this angle, so that X is uniformly distributed over
the interval [−π
2 , π
2 ]. This means that P(X ≤ ϕ) = 1/2 + ϕ/π (cf. Quick
exercise 5.3). What will be the distribution of the projection of Q on the line?
Let us call this random variable Z. Then it is clear that the event {Z ≤ t} is
equal to the event {X ≤ ϕ}, where t and ϕ correspond to each other under
the projection. This means that tan(ϕ) = t, which is the same as saying that
arctan(t) = ϕ.
a. What part of the circle is mapped to the interval [1, ∞)?
b. Compute the distribution function of Z using the correspondence between
t and ϕ.
c. Compute the probability density function of Z.
The distribution of Z is called the Cauchy distribution (which will be discussed
in Chapter 11).
9
Joint distributions and independence
Random variables related to the same experiment often influence one another.
In order to capture this, we introduce the joint distribution of two or more
random variables. We also discuss the notion of independence for random
variables, which models the situation where random variables do not influence
each other. As with single random variables we treat these topics for discrete
and continuous random variables separately.
9.1 Joint distributions of discrete random variables
In a census one is usually interested in several variables, such as income, age,
and gender. In itself these variables are interesting, but when two (or more) are
studied simultaneously, detailed information is obtained on the society where
the census is performed. For instance, studying income, age, and gender jointly
might give insight to the emancipation of women.
Without mentioning it explicitly, we already encountered several examples of
joint distributions of discrete random variables. For example, in Chapter 4 we
defined two random variables S and M, the sum and the maximum of two
independent throws of a die.
Quick exercise 9.1 List the elements of the event {S = 7, M = 4} and
compute its probability.
In general, the joint distribution of two discrete random variables X and Y ,
defined on the same sample space Ω, is given by prescribing the probabilities
of all possible values of the pair (X, Y ).
116 9 Joint distributions and independence
Definition. The joint probability mass function p of two discrete
random variables X and Y is the function p : R2
→ [0, 1], defined by
p(a, b) = P(X = a, Y = b) for − ∞  a, b  ∞.
To stress the dependence on (X, Y ), we sometimes write pX,Y instead of p.
If X and Y take on the values a1, a2, . . . , ak and b1, b2, . . . , b, respectively,
the joint distribution of X and Y can simply be described by listing all the
possible values of p(ai, bj). For example, for the random variables S and M
from Chapter 4 we obtain Table 9.1.
Table 9.1. Joint probability mass function p(a, b) = P(S = a, M = b).
b
a 1 2 3 4 5 6
2 1/36 0 0 0 0 0
3 0 2/36 0 0 0 0
4 0 1/36 2/36 0 0 0
5 0 0 2/36 2/36 0 0
6 0 0 1/36 2/36 2/36 0
7 0 0 0 2/36 2/36 2/36
8 0 0 0 1/36 2/36 2/36
9 0 0 0 0 2/36 2/36
10 0 0 0 0 1/36 2/36
11 0 0 0 0 0 2/36
12 0 0 0 0 0 1/36
From this table we can retrieve the distribution of S and of M. For example,
because
{S = 6} = {S = 6, M = 1} ∪ {S = 6, M = 2} ∪ · · · ∪ {S = 6, M = 6},
and because the six events
{S = 6, M = 1}, {S = 6, M = 2}, . . ., {S = 6, M = 6}
are mutually exclusive, we find that
pS(6) = P(S = 6) = P(S = 6, M = 1) + · · · + P(S = 6, M = 6)
= p(6, 1) + p(6, 2) + · · · + p(6, 6)
= 0 + 0 +
1
36
+
2
36
+
2
36
+ 0
=
5
36
.
9.1 Joint distributions of discrete random variables 117
Table 9.2. Joint distribution and marginal distributions of S and M.
b
a 1 2 3 4 5 6 pS(a)
2 1/36 0 0 0 0 0 1/36
3 0 2/36 0 0 0 0 2/36
4 0 1/36 2/36 0 0 0 3/36
5 0 0 2/36 2/36 0 0 4/36
6 0 0 1/36 2/36 2/36 0 5/36
7 0 0 0 2/36 2/36 2/36 6/36
8 0 0 0 1/36 2/36 2/36 5/36
9 0 0 0 0 2/36 2/36 4/36
10 0 0 0 0 1/36 2/36 3/36
11 0 0 0 0 0 2/36 2/36
12 0 0 0 0 0 1/36 1/36
pM (b) 1/36 3/36 5/36 7/36 9/36 11/36 1
Thus we see that the probabilities of S can be obtained by taking the sum
of the joint probabilities in the rows of Table 9.1. This yields the probability
distribution of S, i.e., all values of pS(a) for a = 2, . . . , 12. We speak of the
marginal distribution of S. In Table 9.2 we have added this distribution in the
right “margin” of the table. Similarly, summing over the columns of Table 9.1
yields the marginal distribution of M, in the bottom margin of Table 9.2.
The joint distribution of two random variables contains a lot more information
than the two marginal distributions. This can be illustrated by the fact that in
many cases the joint probability mass function of X and Y cannot be retrieved
from the marginal probability mass functions pX and pY . A simple example
is given in the following quick exercise.
Quick exercise 9.2 Let X and Y be two discrete random variables, with
joint probability mass function p, given by the following table, where ε is an
arbitrary number between −1/4 and 1/4.
b
a 0 1 pX(a)
0 1/4 − ε 1/4 + ε . . .
1 1/4 + ε 1/4 − ε . . .
pY (b) . . . . . . . . .
Complete the table, and conclude that we cannot retrieve p from pX and pY .
118 9 Joint distributions and independence
The joint distribution function
As in the case of a single random variable, the distribution function enables
us to treat pairs of discrete and pairs of continuous random variables in the
same way.
Definition. The joint distribution function F of two random vari-
ables X and Y is the function F : R2
→ [0, 1] defined by
F(a, b) = P(X ≤ a, Y ≤ b) for − ∞  a, b  ∞.
Quick exercise 9.3 Compute F(5, 3) for the joint distribution function F
of the pair (S, M).
The distribution functions FX and FY can be obtained from the joint distri-
bution function of X and Y . As before, we speak of the marginal distribution
functions. The following rule holds.
From joint to marginal distribution function. Let F be
the joint distribution function of random variables X and Y . Then
the marginal distribution function of X is given for each a by
FX(a) = P(X ≤ a) = F(a, +∞) = lim
b→∞
F(a, b), (9.1)
and the marginal distribution function of Y is given for each b by
FY (b) = P(Y ≤ b) = F(+∞, b) = lim
a→∞
F(a, b). (9.2)
9.2 Joint distributions of continuous random variables
We saw in Chapter 5 that the probability that a single continuous random
variable X lies in an interval [a, b], is equal to the area under the probability
density function f of X over the interval (see also Figure 5.1). For the joint
distribution of continuous random variables X and Y the situation is analo-
gous: the probability that the pair (X, Y ) falls in the rectangle [a1, b1]×[a2, b2]
is equal to the volume under the joint probability density function f(x, y) of
(X, Y ) over the rectangle. This is illustrated in Figure 9.1, where a chunk of
a joint probability density function f(x, y) is displayed for x between −0.5
and 1 and for y between −1.5 and 1. Its volume represents the probability
P(−0.5 ≤ X ≤ 1, −1.5 ≤ Y ≤ 1). As the volume under f on [−0.5, 1]×[−1.5, 1]
is equal to the integral of f over this rectangle, this motivates the following
definition.
9.2 Joint distributions of continuous random variables 119
-3
-2
-1
0
1
2
3
x
-3
-2
-1
0
1
2
3
y
0
0
.
0
5
0
.
1
0
.
1
5
f(x,y)
Fig. 9.1. Volume under a joint probability density function f on the rectangle
[−0.5, 1] × [−1.5, 1].
Definition. Random variables X and Y have a joint continuous
distribution if for some function f : R2
→ R and for all numbers
a1, a2 and b1, b2 with a1 ≤ b1 and a2 ≤ b2,
P(a1 ≤ X ≤ b1, a2 ≤ Y ≤ b2) =
b1
a1
b2
a2
f(x, y) dx dy.
The function f has to satisfy f(x, y) ≥ 0 for all x and y, and
∞
−∞
∞
−∞ f(x, y) dx dy = 1. We call f the joint probability density
function of X and Y .
As in the one-dimensional case there is a simple relation between the joint
distribution function F and the joint probability density function f:
F(a, b) =
a
−∞
b
−∞
f(x, y) dx dy and f(x, y) =
∂2
∂x∂y
F(x, y).
A joint probability density function of two random variables is also called
a bivariate probability density. An explicit example of such a density is the
function
f(x, y) =
30
π
e−50x2
−50y2
+80xy
for −∞  x  ∞ and −∞  y  ∞; see Figure 9.2. This is an example of
a bivariate normal density (see Remark 11.2 for a full description of bivariate
normal distributions).
We illustrate a number of properties of joint continuous distributions by means
of the following simple example. Suppose that X and Y have joint probability
120 9 Joint distributions and independence
-0.4
-0.2
0
0.2
0.4
X
-0.4
-0.2
0
0.2
0.4
Y
0
2
4
6
8
1
0
f
(
x
,
y
)
Fig. 9.2. A bivariate normal probability density function.
density function
f(x, y) =
2
75

2x2
y + xy2

for 0 ≤ x ≤ 3 and 1 ≤ y ≤ 2,
and f(x, y) = 0 otherwise; see Figure 9.3.
4
3
2
0
3
0,2
2,5 x
1
0,4
2
0,6
1,5
y
0
0,8
1
1
0,5 -1
1,2
0
Fig. 9.3. The probability density function f(x, y) = 2
75

2x2
y + xy2

.
9.2 Joint distributions of continuous random variables 121
As an illustration of how to compute joint probabilities:
P

1 ≤ X ≤ 2,
4
3
≤ Y ≤
5
3

=
2
1
5
3
4
3
f(x, y) dx dy
=
2
75
2
1
 5
3
4
3
(2x2
y + xy2
) dy

dx
=
2
75
2
1

x2
+
61
81
x

dx =
187
2025
.
Next, for a between 0 and 3 and b between 1 and 2, we determine the ex-
pression of the joint distribution function. Since f(x, y) = 0 for x  0 or
y  1,
F(a, b) = P(X ≤ a, Y ≤ b) =
a
−∞

b
−∞
f(x, y) dy

dx
=
2
75
a
0
 b
1
(2x2
y + xy2
) dy

dx
=
1
225

2a3
b2
− 2a3
+ a2
b3
− a2

.
Note that for either a outside [0, 3] or b outside [1, 2], the expression for F(a, b)
is different. For example, suppose that a is between 0 and 3 and b is larger
than 2. Since f(x, y) = 0 for y  2, we find for any b ≥ 2:
F(a, b) = P(X ≤ a, Y ≤ b) = P(X ≤ a, Y ≤ 2) = F(a, 2) =
1
225

6a3
+ 7a2

.
Hence, applying (9.1) one finds the marginal distribution function of X:
FX (a) = lim
b→∞
F(a, b) =
1
225

6a3
+ 7a2

for a between 0 and 3.
Quick exercise 9.4 Show that FY (b) = 1
75

3b3
+ 18b2
− 21

for b between 1
and 2.
The probability density of X can be found by differentiating FX:
fX(x) =
d
dx
FX (x) =
d
dx

1
225

6x3
+ 7x2


=
2
225

9x2
+ 7x

for x between 0 and 3. It is also possible to obtain the probability density
function of X directly from f(x, y). Recall that we determined marginal prob-
abilities of discrete random variables by summing over the joint probabilities
(see Table 9.2). In a similar way we can find fX. For x between 0 and 3,
122 9 Joint distributions and independence
fX(x) =
∞
−∞
f(x, y) dy =
2
75
2
1

2x2
y + xy2

dy =
2
225

9x2
+ 7x

.
This illustrates the following rule.
From joint to marginal probability density function. Let
f be the joint probability density function of random variables X
and Y . Then the marginal probability densities of X and Y can be
found as follows:
fX(x) =
∞
−∞
f(x, y) dy and fY (y) =
∞
−∞
f(x, y) dx.
Hence the probability density function of each of the random variables X and
Y can easily be obtained by “integrating out” the other variable.
Quick exercise 9.5 Determine fY (y).
9.3 More than two random variables
To determine the joint distribution of n random variables X1, X2, . . . , Xn, all
defined on the same sample space Ω, we have to describe how the probability
mass is distributed over all possible values of (X1, X2, . . . , Xn). In fact, it
suffices to specify the joint distribution function F of X1, X2, . . . , Xn, which
is defined by
F(a1, a2, . . . , an) = P(X1 ≤ a1, X2 ≤ a2, . . . , Xn ≤ an)
for −∞  a1, a2, . . . , an  ∞.
In case the random variables X1, X2, . . . , Xn are discrete, the joint distribution
can also be characterized by specifying the joint probability mass function p
of X1, X2, . . . , Xn, defined by
p(a1, a2, . . . , an) = P(X1 = a1, X2 = a2, . . . , Xn = an)
for −∞  a1, a2, . . . , an  ∞.
Drawing without replacement
Let us illustrate the use of the joint probability mass function with an example.
In the weekly Dutch National Lottery Show, 6 balls are drawn from a vase
that contains balls numbered from 1 to 41. Clearly, the first number takes
values 1, 2, . . ., 41 with equal probabilities. Is this also the case for—say—the
third ball?
9.3 More than two random variables 123
Let us consider a more general situation. Suppose a vase contains balls num-
bered 1, 2, . . . , N. We draw n balls without replacement from the vase. Note
that n cannot be larger than N. Each ball is selected with equal probability,
i.e., in the first draw each ball has probability 1/N, in the second draw each of
the N −1 remaining balls has probability 1/(N −1), and so on. Let Xi denote
the number on the ball in the i-th draw, for i = 1, 2, . . . , n. In order to obtain
the marginal probability mass function of Xi, we first compute the joint proba-
bility mass function of X1, X2, . . . , Xn. Since there are N(N −1) · · · (N −n+1)
possible combinations for the values of X1, X2, . . . , Xn, each having the same
probability, the joint probability mass function is given by
p(a1, a2, . . . , an) = P(X1 = a1, X2 = a2, . . . , Xn = an)
=
1
N(N − 1) · · · (N − n + 1)
,
for all distinct values a1, a2, . . . , an with 1 ≤ aj ≤ N. Clearly X1, X2, . . . , Xn
influence each other. Nevertheless, the marginal distribution of each Xi is
the same. This can be seen as follows. Similar to obtaining the marginal
probability mass functions in Table 9.2, we can find the marginal probability
mass function of Xi by summing the joint probability mass function over all
possible values of X1, . . . , Xi−1, Xi+1, . . . , Xn:
pXi (k) =

p(a1, . . . , ai−1, k, ai+1, . . . , an)
=
 1
N(N − 1) · · · (N − n + 1)
,
where the sum runs over all distinct values a1, a2, . . . , an with 1 ≤ aj ≤ N
and ai = k. Since there are (N − 1)(N − 2) · · · (N − n + 1) such combinations,
we conclude that the marginal probability mass function of Xi is given by
pXi (k) = (N − 1)(N − 2) · · · (N − n + 1) ·
1
N(N − 1) · · · (N − n + 1)
=
1
N
,
for k = 1, 2, . . ., N. We see that the marginal probability mass function of
each Xi is the same, assigning equal probability 1/N to each possible value.
In case the random variables X1, X2, . . . , Xn are continuous, the joint dis-
tribution is defined in a similar way as in the case of two variables. We say
that the random variables X1, X2, . . . , Xn have a joint continuous distribu-
tion if for some function f : Rn
→ R and for all numbers a1, a2, . . . , an and
b1, b2, . . . , bn with ai ≤ bi,
P(a1 ≤ X1 ≤ b1, a2 ≤ X2 ≤ b2, . . . , an ≤ Xn ≤ bn)
=
b1
a1
b2
a2
· · ·
bn
an
f(x1, x2, . . . , xn) dx1 dx2 · · · dxn.
Again f has to satisfy f(x1, x2, . . . , xn) ≥ 0 and f has to integrate to 1. We
call f the joint probability density of X1, X2, . . . , Xn.
124 9 Joint distributions and independence
9.4 Independent random variables
In earlier chapters we have spoken of independence of random variables, an-
ticipating a formal definition. On page 46 we postulated that the events
{R1 = a1}, {R2 = a2}, . . . , {R10 = a10}
related to the Bernoulli random variables R1, . . . , R10 are independent. How
should one define independence of random variables? Intuitively, random vari-
ables X and Y are independent if every event involving only X is indepen-
dent of every event involving only Y . Since for two discrete random variables
X and Y , any event involving X and Y is the union of events of the type
{X = a, Y = b}, an adequate definition for independence would be
P(X = a, Y = b) = P(X = a) P(Y = b) , (9.3)
for all possible values a and b. However, this definition is useless for continuous
random variables. Both the discrete and the continuous case are covered by
the following definition.
Definition. The random variables X and Y , with joint distribution
function F, are independent if
P(X ≤ a, Y ≤ b) = P(X ≤ a) P(Y ≤ b) ,
that is,
F(a, b) = FX(a)FY (b) (9.4)
for all possible values a and b. Random variables that are not inde-
pendent are called dependent.
Note that independence of X and Y guarantees that the joint probability of
{X ≤ a, Y ≤ b} factorizes. More generally, the following is true: if X and Y
are independent, then
P(X ∈ A, Y ∈ B) = P(X ∈ A) P(Y ∈ B) , (9.5)
for all suitable A and B, such as intervals and points. As a special case we
can take A = {a}, B = {b}, which yields that for independent X and Y the
probability of {X = a, Y = b} equals the product of the marginal probabilities.
In fact, for discrete random variables the definition of independence can be
reduced—after cumbersome computations—to equality (9.3). For continuous
random variables X and Y we find, differentiating both sides of (9.4) with
respect to x and y, that
f(x, y) = fX(x)fY (y).
9.5 Propagation of independence 125
Quick exercise 9.6 Determine for which value of ε the discrete random
variables X and Y from Quick exercise 9.2 are independent.
More generally, random variables X1, X2, . . . , Xn, with joint distribution func-
tion F, are independent if for all values a1, . . . , an,
F(a1, a2, . . . , an) = FX1 (a1)FX2 (a2) · · · FXn (an).
As in the case of two discrete random variables, the discrete random variables
X1, X2, . . . , Xn are independent if
P(X1 = a1, . . . , Xn = an) = P(X1 = a1) · · · P(Xn = an) ,
for all possible values a1, . . . , an. Thus we see that the definition of inde-
pendence for discrete random variables is in agreement with our intuitive
interpretation given earlier in (9.3).
In case of independent continuous random variables X1, X2, . . . , Xn with joint
probability density function f, differentiating the joint distribution function
with respect to all the variables gives that
f(x1, x2, . . . , xn) = fX1 (x1)fX2 (x2) · · · fXn (xn) (9.6)
for all values x1, . . . , xn. By integrating both sides over (−∞, a1]×(−∞, a2]×
· · ·×(−∞, an], we find the definition of independence. Hence in the continuous
case, (9.6) is equivalent to the definition of independence.
9.5 Propagation of independence
A natural question is whether transformed independent random variables are
again independent. We start with a simple example. Let X and Y be two
independent random variables with joint distribution function F. Take an
interval I = (a, b] and define random variables U and V as follows:
U =

1 if X ∈ I
0 if X /
∈ I,
and V =

1 if Y ∈ I
0 if Y /
∈ I.
Are U and V independent? Yes, they are! By using (9.5) and the independence
of X and Y , we can write
P(U = 0, V = 1) = P(X ∈ Ic
, Y ∈ I)
= P(X ∈ Ic
) P(Y ∈ I)
= P(U = 0) P(V = 1) .
By a similar reasoning one finds that for all values a and b,
126 9 Joint distributions and independence
P(U = a, V = b) = P(U = a) P(V = b) .
This illustrates the fact that for independent random variables X1, X2, . . . , Xn,
the random variables Y1, Y2, . . . , Yn, where each Yi is determined by Xi only,
inherit the independence from the Xi. The general rule is given here.
Propagation of independence. Let X1, X2, . . . , Xn be indepen-
dent random variables. For each i, let hi : R → R be a function and
define the random variable
Yi = hi(Xi).
Then Y1, Y2, . . . , Yn are also independent.
Often one uses this rule with all functions the same: hi = h. For instance, in
the preceding example,
h(x) =

1 if x ∈ I
0 if x /
∈ I.
The rule is also useful when we need different transformations for different
Xi. We already saw an example of this in Chapter 6. In the single-server
queue example in Section 6.4, the Exp(0.5) random variables T1, T2, . . . and
U(2, 5) random variables S1, S2, . . . are required to be independent. They are
generated according to the technique described in Section 6.2. With a se-
quence U1, U2, . . . of independent U(0, 1) random variables we can accomplish
independence of the Ti and Si as follows:
Ti = Finv
(U2i−1) and Si = Ginv
(U2i),
where F and G are the distribution functions of the Exp(0.5) distribution and
the U(2, 5) distribution. The propagation-of-independence rule now guaran-
tees that all random variables T1, S1, T2, S2, . . . are independent.
9.6 Solutions to the quick exercises
9.1 The only possibilities with the sum equal to 7 and the maximum equal
to 4 are the combinations (3, 4) and (4, 3). They both have probability 1/36,
so that P(S = 7, M = 4) = 2/36.
9.2 Since pX(0), pX(1), pY (0), and pY (1) are all equal to 1/2, knowing only
pX and pY yields no information on ε whatsoever. You have to be a student
at Hogwarts to be able to get the values of p right!
9.3 Since S and M are discrete random variables, F(5, 3) is the sum of the
probabilities P(S = a, M = b) of all combinations (a, b) with a ≤ 5 and b ≤ 3.
From Table 9.2 we see that this sum is 8/36.
9.7 Exercises 127
9.4 For a between 0 and 3 and for b between 1 and 2, we have seen that
F(a, b) =
1
225

2a3
b2
− 2a3
+ a2
b3
− a2

.
Since f(x, y) = 0 for x  3, we find for any a ≥ 3 and b between 1 and 2:
F(a, b) = P(X ≤ a, Y ≤ b) = P(X ≤ 3, Y ≤ b)
= F(3, b) =
1
75

3b3
+ 18b2
− 21

.
As a result, applying (9.2) yields that FY (b) = lima→∞ F(a, b) = F(3, b) =
1
75

3b3
+ 18b2
− 21

, for b between 1 and 2.
9.5 For y between 1 and 2, we have seen that FY (y) = 1
75

3y3
+ 18y2
− 21

.
Differentiating with respect to y yields that
fY (y) =
d
dy
FY (y) =
1
25
(3y2
+ 12y),
for y between 1 and 2 (and fY (y) = 0 otherwise). The probability density
function of Y can also be obtained directly from f(x, y). For y between 1
and 2:
fY (y) =
∞
−∞
f(x, y) dx =
2
75
3
0
(2x2
y + xy2
) dx
=
2
75
2
3
x3
y +
1
2
x2
y2
x=3
x=0
=
1
25
(3y2
+ 12y).
Since f(x, y) = 0 for values of y not between 1 and 2, we have that fY (y) =
∞
−∞
f(x, y) dx = 0 for these y’s.
9.6 The number ε is between −1/4 and 1/4. Now X and Y are independent
in case p(i, j) = P(X = i, Y = j) = P(X = i) P(Y = j) = pX(i)pY (j), for all
i, j = 0, 1. If i = j = 0, we should have
1
4
− ε = p(0, 0) = pX(0) pY (0) =
1
4
.
This implies that ε = 0. Furthermore, for all other combinations (i, j) one
can check that for ε = 0 also p(i, j) = pX(i) pY (j), so that X and Y are
independent. If ε = 0, we have p(0, 0) = pX(0) pY (0), so that X and Y are
dependent.
9.7 Exercises
9.1 The joint probabilities P(X = a, Y = b) of discrete random variables X
and Y are given in the following table (which is based on the magical square
in Albrecht Dürer’s engraving Melencolia I in Figure 9.4). Determine the
marginal probability distributions of X and Y , i.e., determine the probabilities
P(X = a) and P(Y = b) for a, b = 1, 2, 3, 4.
128 9 Joint distributions and independence
Fig. 9.4. Albrecht Dürer’s Melencolia I.
Albrecht Dürer (German, 1471-1528) Melencolia I, 1514. Engraving. Bequest
of William P. Chapman, Jr., Class of 1895. Courtesy of the Herbert F. Johnson
Museum of Art, Cornell University.
a
b 1 2 3 4
1 16/136 3/136 2/136 13/136
2 5/136 10/136 11/136 8/136
3 9/136 6/136 7/136 12/136
4 4/136 15/136 14/136 1/136
9.7 Exercises 129
9.2  The joint probability distribution of two discrete random variables X
and Y is partly given in the following table.
a
b 0 1 2 P(Y = b)
−1 . . . . . . . . . 1/2
1 . . . 1/2 . . . 1/2
P(X = a) 1/6 2/3 1/6 1
a. Complete the table.
b. Are X and Y dependent or independent?
9.3 Let X and Y be two random variables, with joint distribution the Melen-
colia distribution, given by the table in Exercise 9.1. What is
a. P(X = Y )?
b. P(X + Y = 5)?
c. P(1  X ≤ 3, 1  Y ≤ 3)?
d. P((X, Y ) ∈ {1, 4} × {1, 4})?
9.4 This exercise will be easy for those familiar with Japanese puzzles called
nonograms. The marginal probability distributions of the discrete random
variables X and Y are given in the following table:
a
b 1 2 3 4 5 P(Y = b)
1 5/14
2 4/14
3 2/14
4 2/14
5 1/14
P(X = a) 1/14 5/14 4/14 2/14 2/14 1
Moreover, for a and b from 1 to 5 the joint probability P(X = a, Y = b) is
either 0 or 1/14. Determine the joint probability distribution of X and Y .
9.5  Let η be an unknown real number, and let the joint probabilities
P(X = a, Y = b) of the discrete random variables X and Y be given by the
following table:
130 9 Joint distributions and independence
a
b −1 0 1
4 η − 1
16
1
4 − η 0
5 1
8
3
16
1
8
6 η + 1
16
1
16
1
4 − η
a. Which are the values η can attain?
b. Is there a value of η for which X and Y are independent?
9.6  Let X and Y be two independent Ber(1
2 ) random variables. Define
random variables U and V by:
U = X + Y and V = |X − Y |.
a. Determine the joint and marginal probability distributions of U and V .
b. Find out whether U and V are dependent or independent.
9.7 To investigate the relation between hair color and eye color, the hair color
and eye color of 5383 persons was recorded. The data are given in the following
table:
Hair color
Eye color Fair/red Medium Dark/black
Light 1168 825 305
Dark 573 1312 1200
Source: B. Everitt and G. Dunn. Applied multivariate data analysis. Second
edition Hodder Arnold, 2001; Table 4.12. Reproduced by permission of Hodder
 Stoughton.
Eye color is encoded by the values 1 (Light) and 2 (Dark), and hair color by
1 (Fair/red), 2 (Medium), and 3 (Dark/black). By dividing the numbers in
the table by 5383, the table is turned into a joint probability distribution for
random variables X (hair color) taking values 1 to 3 and Y (eye color) taking
values 1 and 2.
a. Determine the joint and marginal probability distributions of X and Y .
b. Find out whether X and Y are dependent or independent.
9.8  Let X and Y be independent random variables with probability distri-
butions given by
P(X = 0) = P(X = 1) = 1
2 and P(Y = 0) = P(Y = 2) = 1
2 .
9.7 Exercises 131
a. Compute the distribution of Z = X + Y .
b. Let Ỹ and Z̃ be independent random variables, where Ỹ has the same
distribution as Y , and Z̃ the same distribution as Z. Compute the distri-
bution of X̃ = Z̃ − Ỹ .
9.9  Suppose that the joint distribution function of X and Y is given by
F(x, y) = 1 − e−2x
− e−y
+ e−(2x+y)
if x  0, y  0,
and F(x, y) = 0 otherwise.
a. Determine the marginal distribution functions of X and Y .
b. Determine the joint probability density function of X and Y .
c. Determine the marginal probability density functions of X and Y .
d. Find out whether X and Y are independent.
9.10  Let X and Y be two continuous random variables with joint proba-
bility density function
f(x, y) =
12
5
xy(1 + y) for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1,
and f(x, y) = 0 otherwise.
a. Find the probability P
1
4 ≤ X ≤ 1
2 , 1
3 ≤ Y ≤ 2
3

.
b. Determine the joint distribution function of X and Y for a and b between
0 and 1.
c. Use your answer from b to find FX(a) for a between 0 and 1.
d. Apply the rule on page 122 to find the probability density function of X
from the joint probability density function f(x, y). Use the result to verify
your answer from c.
e. Find out whether X and Y are independent.
9.11  Let X and Y be two continuous random variables, with the same
joint probability density function as in Exercise 9.10. Find the probability
P(X  Y ) that X is smaller than Y .
9.12 The joint probability density function f of the pair (X, Y ) is given by
f(x, y) = K(3x2
+ 8xy) for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 2,
and f(x, y) = 0 for all other values of x and y. Here K is some positive
constant.
a. Find K.
b. Determine the probability P(2X ≤ Y ).
132 9 Joint distributions and independence
9.13  On a disc with origin (0, 0) and radius 1, a point (X, Y ) is selected by
throwing a dart that hits the disc in an arbitrary place. This is best described
by the joint probability density function f of X and Y , given by
f(x, y) =

c if x2
+ y2
≤ 1
0 otherwise,
where c is some positive constant.
a. Determine c.
b. Let R =
√
X2 + Y 2 be the distance from (X, Y ) to the origin. Determine
the distribution function FR.
c. Determine the marginal density function fX. Without doing any calcula-
tions, what can you say about fY ?
9.14 An arbitrary point (X, Y ) is drawn from the square [−1, 1] × [−1, 1].
This means that for any region G in the plane, the probability that (X, Y ) is
in G, is given by the area of G ∩  divided by the area of , where  denotes
the square [−1, 1] × [−1, 1]:
P((X, Y ) ∈ G) =
area of G ∩ 
area of 
.
a. Determine the joint probability density function of the pair (X, Y ).
b. Check that X and Y are two independent, U(−1, 1) distributed random
variables.
9.15  Let the pair (X, Y ) be drawn arbitrarily from the triangle ∆ with
vertices (0, 0), (0, 1), and (1, 1).
a. Use Figure 9.5 to show that the joint distribution function F of the pair
(X, Y ) satisfies
F(a, b) =
⎧
⎪
⎪
⎪
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎪
⎪
⎪
⎩
0 for a or b less than 0
a(2b − a) for (a, b) in the triangle ∆
b2
for b between 0 and 1 and a larger than b
2a − a2
for a between 0 and 1 and b larger than 1
1 for a and b larger than 1.
b. Determine the joint probability density function f of the pair (X, Y ).
c. Show that fX(x) = 2 − 2x for x between 0 and 1 and that fY (y) = 2y for
y between 0 and 1.
9.16 (Continuation of Exercise 9.15) An arbitrary point (U, V ) is drawn from
the unit square [0, 1]×[0, 1]. Let X and Y be defined as in Exercise 9.15. Show
that min{U, V } has the same distribution as X and that max{U, V } has the
same distribution as Y .
9.7 Exercises 133
(0, 0)
(0, 1) (1, 1)
(a, b)
•
∆
←− Rectangle (−∞, a] × (−∞, b]
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 9.5. Drawing (X, Y ) from (−∞, a] × (−∞, b] ∩ ∆.
9.17 Let U1 and U2 be two independent random variables, both uniformly
distributed over [0, a]. Let V = min{U1, U2} and Z = max{U1, U2}. Show
that the joint distribution function of V and Z is given by
F(s, t) = P(V ≤ s, Z ≤ t) =
t2
− (t − s)2
a2
for 0 ≤ s ≤ t ≤ a.
Hint: note that V ≤ s and Z ≤ t happens exactly when both U1 ≤ t and
U2 ≤ t, but not both s  U1 ≤ t and s  U2 ≤ t.
9.18 Suppose a vase contains balls numbered 1, 2, . . ., N. We draw n balls
without replacement from the vase. Each ball is selected with equal probability,
i.e., in the first draw each ball has probability 1/N, in the second draw each
of the N − 1 remaining balls has probability 1/(N − 1), and so on. For i =
1, 2, . . . , n, let Xi denote the number on the ball in the ith draw. We have
shown that the marginal probability mass function of Xi is given by
pXi (k) =
1
N
, for k = 1, 2, . . ., N.
a. Show that
E[Xi] =
N + 1
2
.
b. Compute the variance of Xi. You may use the identity
1 + 4 + 9 + · · · + N2
=
1
6
N(N + 1)(2N + 1).
9.19  Let X and Y be two continuous random variables, with joint proba-
bility density function
f(x, y) =
30
π
e−50x2
−50y2
+80xy
for −∞  x  ∞ and −∞  y  ∞; see also Figure 9.2.
134 9 Joint distributions and independence
a. Determine positive numbers a, b, and c such that
50x2
− 80xy + 50y2
= (ay − bx)2
+ cx2
.
b. Setting µ = 4
5 x, and σ = 1
10 , show that
(
√
50y −
√
32x)2
=
1
2

y − µ
σ
2
and use this to show that
∞
−∞
e−(
√
50y−
√
32x)2
dy =
√
2π
10
.
c. Use the results from b to determine the probability density function fX
of X. What kind of distribution does X have?
9.20 Suppose we throw a needle on a large sheet of paper, on which horizontal
lines are drawn, which are at needle-length apart (see also Exercise 21.16).
Choose one of the horizontal lines as x-axis, and let (X, Y ) be the center of the
needle. Furthermore, let Z be the distance of this center (X, Y ) to the nearest
horizontal line under (X, Y ), and let H be the angle between the needle and
the positive x-axis.
a. Assuming that the length of the needle is equal to 1, argue that Z has
a U(0, 1) distribution. Also argue that H has a U(0, π) distribution and
that Z and H are independent.
b. Show that the needle hits a horizontal line when
Z ≤
1
2
sin H or 1 − Z ≤
1
2
sin H.
c. Show that the probability that the needle will hit one of the horizontal
lines equals 2/π.
10
Covariance and correlation
In this chapter we see how the joint distribution of two or more random vari-
ables is used to compute the expectation of a combination of these random
variables. We discuss the expectation and variance of a sum of random vari-
ables and introduce the notions of covariance and correlation, which express
to some extent the way two random variables influence each other.
10.1 Expectation and joint distributions
China vases of various shapes are produced in the Delftware factories in the
old city of Delft. One particular simple cylindrical model has height H and
radius R centimeters. Due to all kinds of circumstances—the place of the vase
in the oven, the fact that the vases are handmade, etc.—H and R are not
constants but are random variables. The volume of a vase is equal to the
random variable V = πHR2
, and one is interested in its expected value E[V ].
When fV denotes the probability density of V , then by definition
E[V ] =
∞
−∞
vfV (v) dv.
However, to obtain E[V ], we do not necessarily need to determine fV from
the joint probability density f of H and R! Since V is a function of H and R,
we can use a rule similar to the change-of-variable formula from Chapter 7:
E[V ] = E

πHR2

=
∞
−∞
∞
−∞
πhr2
f(h, r) dh dr.
Suppose that H has a U(25, 35) distribution and that R has a U(7.5, 12.5)
distribution. In the case that H and R are also independent, we have
136 10 Covariance and correlation
E[V ] =
∞
−∞
∞
−∞
πhr2
fH(h)fR(r) dh dr =
35
25
12.5
7.5
πhr2
·
1
10
·
1
5
dh dr
=
π
50
35
25
h dh
12.5
7.5
r2
dr = 9621.127 cm3
.
This illustrates the following general rule.
Two-dimensional change-of-variable formula. Let X and
Y be random variables, and let g : R2
→ R be a function.
If X and Y are discrete random variables with values a1, a2, . . . and
b1, b2, . . . , respectively, then
E[g(X, Y )] =

i

j
g(ai, bj)P(X = ai, Y = bj) .
If X and Y are continuous random variables with joint probability
density function f, then
E[g(X, Y )] =
∞
−∞
∞
−∞
g(x, y)f(x, y) dx dy.
As an example, take g(x, y) = xy for discrete random variables X and Y with
the joint probability distribution given in Table 10.1. The expectation of XY
is computed as follows:
E[XY ] = (0 · 0) · 0 + (1 · 0) ·
1
4
+ (2 · 0) · 0
+ (0 · 1) ·
1
4
+ (1 · 1) · 0 + (2 · 1) ·
1
4
+ (0 · 2) · 0 + (1 · 2) ·
1
4
+ (2 · 2) · 0 = 1.
A natural question is whether this value can also be obtained from E[X] E[Y ].
We return to this question later in this chapter. First we address the expec-
tation of the sum of two random variables.
Table 10.1. Joint probabilities P(X = a, Y = b).
a
b 0 1 2
0 0 1/4 0
1 1/4 0 1/4
2 0 1/4 0
10.1 Expectation and joint distributions 137
Quick exercise 10.1 Compute E[X + Y ] for the random variables with the
joint distribution given in Table 10.1.
For discrete X and Y with values a1, a2, . . . and b1, b2, . . . , respectively, we
see that
E[X + Y ] =

i

j
(ai + bj)P(X = ai, Y = bj)
=

i

j
aiP(X = ai, Y = bj) +

i

j
bjP(X = ai, Y = bj)
=

i
ai
 
j
P(X = ai, Y = bj)

+

j
bj
 
i
P(X = ai, Y = bj)

=

i
aiP(X = ai) +

j
bjP(Y = bj)
= E[X] + E[Y ] .
A similar line of reasoning applies in case X and Y are continuous random
variables. The following general rule holds.
Linearity of expectations. For all numbers r, s, and t and
random variables X and Y , one has
E[rX + sY + t] = rE[X] + sE[Y ] + t.
Quick exercise 10.2 Determine the marginal distributions for the random
variables X and Y with the joint distribution given in Table 10.1, and use
them to compute E[X] en E[Y ]. Check that E[X]+E[Y ] is equal to E[X + Y ],
which was computed in Quick exercise 10.1.
More generally, for random variables X1, . . . , Xn and numbers s1, . . . , sn and t,
E[s1X1 + · · · + snXn + t] = s1E[X1] + · · · + snE[Xn] + t.
This rule is a powerful instrument. For example, it provides an easy way to
compute the expectation of a random variable X with a Bin(n, p) distribution.
If we would use the definition of expectation, we have to compute
E[X] =
n

k=0
kP(X = k) =
n

k=0
k

n
k

pk
(1 − p)n−k
.
To determine this sum is not straightforward. However, there is a simple alter-
native. Recall the multiple-choice example from Section 4.3. We represented
138 10 Covariance and correlation
the number of correct answers out of 10 multiple-choice questions as a sum of
10 Bernoulli random variables. More generally, any random variable X with
a Bin(n, p) distribution can be represented as
X = R1 + R2 + · · · + Rn,
where R1, R2, . . . , Rn are independent Ber(p) random variables, i.e.,
Ri =

1 with probability p
0 with probability 1 − p.
Since E[Ri] = 0 · (1 − p) + 1 · p = p, for every i = 1, 2, . . ., n, the linearity-of-
expectations rule yields
E[X] = E[R1] + E[R2] + · · · + E[Rn] = np.
Hence we conclude that the expectation of a Bin(n, p) distribution equals np.
Remark 10.1 (More than two random variables). In both the discrete
and continuous cases, the change-of-variable formula for n random variables
is a straightforward generalization of the change-of-variable formula for two
random variables. For instance, if X1, X2, . . . , Xn are continuous random
variables, with joint probability density function f, and g is a function from
Rn
to R, then
E[g(X1, . . . , Xn)] =
 ∞
−∞
· · ·
 ∞
−∞
g(x1, . . . , xn)f(x1, . . . , xn) dx1 · · · dxn.
10.2 Covariance
In the previous section we have seen that for two random variables X and Y
always
E[X + Y ] = E[X] + E[Y ] .
Does such a simple relation also hold for the variance of the sum Var(X + Y )
or for expectation of the product E[XY ]? We will investigate this in the
current section.
For the variables X and Y from the example in Section 9.2 with joint proba-
bility density
f(x, y) =
2
75

2x2
y + xy2

for 0 ≤ x ≤ 3 and 1 ≤ y ≤ 2,
one can show that
Var(X + Y ) =
939
2000
and Var(X) + Var(Y ) =
989
2500
+
791
10 000
=
4747
10 000
10.2 Covariance 139
(see Exercise 10.10). This shows, in contrast to the linearity-of-expectations
rule, that Var(X + Y ) is generally not equal to Var(X)+ Var(Y ). To deter-
mine Var(X + Y ), we exploit its definition:
Var(X + Y ) = E

(X + Y − E[X + Y ])2

.
Now X + Y − E[X + Y ] = (X − E[X]) + (Y − E[Y ]), so that
(X + Y − E[X + Y ])
2
= (X − E[X])
2
+ (Y − E[Y ])
2
+ 2 (X − E[X]) (Y − E[Y ]) .
Taking expectations on both sides, another application of the linearity-of-
expectations rule gives
Var(X + Y ) = Var(X) + Var(Y ) + 2E[(X − E[X])(Y − E[Y ])] .
That is, the variance of the sum X + Y equals the sum of the variances of X
and Y , plus an extra term 2E[(X − E[X])(Y − E[Y ])]. To some extent this
term expresses the way X and Y influence each other.
Definition. Let X and Y be two random variables. The covariance
between X and Y is defined by
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] .
Loosely speaking, if the covariance of X and Y is positive, then if X has a
realization larger than E[X], it is likely that Y will have a realization larger
than E[Y ], and the other way around. In this case we say that X and Y are
positively correlated. In case the covariance is negative, the opposite effect oc-
curs; X and Y are negatively correlated. In case Cov(X, Y ) = 0 we say that X
and Y are uncorrelated. An easy consequence of the linearity-of-expectations
property (see Exercise 10.19) is the following rule.
An alternative expression for the covariance. Let X and
Y be two random variables, then
Cov(X, Y ) = E[XY ] − E[X] E[Y ] .
For X and Y from the example in Section 9.2, we have E[X] = 109/50,
E[Y ] = 157/100, and E[XY ] = 171/50 (see Exercise 10.10). Thus we see that
X and Y are negatively correlated:
Cov(X, Y ) =
171
50
−
109
50
·
157
100
= −
13
5000
 0.
Moreover, this also illustrates that, in contrast to the expectation of the sum,
for the expectation of the product, in general E[XY ] is not equal to E[X] E[Y ].
140 10 Covariance and correlation
Independent versus uncorrelated
Now let X and Y be two independent random variables. One expects that X
and Y are uncorrelated: they have nothing to do with one another! This is
indeed the case, for instance, if X and Y are discrete; one finds that
E[XY ] =

i

j
aibjP(X = ai, Y = bj)
=

i

j
aibjP(X = ai) P(Y = bj)
=
 
i
aiP(X = ai)
 
j
bjP(Y = bj)

= E[X] E[Y ] .
A similar reasoning holds in case X and Y are continuous random variables.
The alternative expression for the covariance leads to the following important
observation.
Independent versus uncorrelated. If two random variables
X and Y are independent, then X and Y are uncorrelated.
Note that the reverse is not necessarily true. If X and Y are uncorrelated,
they need not be independent. This is illustrated in the next quick exercise.
Quick exercise 10.3 Consider the random variables X and Y with the joint
distribution given in Table 10.1. Check that X and Y are dependent, but that
also E[XY ] = E[X] E[Y ].
From the preceding we also deduce the following rule on the variance of the
sum of two random variables.
Variance of the sum. Let X and Y be two random variables.
Then always
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ) .
If X and Y are uncorrelated,
Var(X + Y ) = Var(X) + Var(Y ) .
Hence, we always have that E[X + Y ] = E[X]+E[Y ], whereas Var(X + Y ) =
Var(X)+Var(Y ) only holds for uncorrelated random variables (and hence for
independent random variables!).
As with the linearity-of-expectations rule, the rule for the variance of the
sum of uncorrelated random variables holds more generally. For uncorrelated
random variables X1, X2, . . . , Xn, we have
10.3 The correlation coefficient 141
Var(X1 + X2 + · · · + Xn) = Var(X1) + Var(X2) + · · · + Var(Xn) .
This rule provides an easy way to compute the variance of a random variable
with a Bin(n, p) distribution. Recall the representation for a Bin(n, p) random
variable X:
X = R1 + R2 + · · · + Rn.
Each Ri has variance
Var(Ri) = E

R2
i

− (E[Ri])
2
= 02
· (1 − p) + 12
· p − (E[Ri])
2
= p − p2
= p(1 − p).
Using the independence of the Ri, the rule for the variance of the sum yields
Var(X) = Var(R1) + Var(R2) + · · · + Var(Rn) = np(1 − p).
10.3 The correlation coefficient
In the previous section we saw that the covariance between random vari-
ables gives an indication of how they influence one another. A disadvan-
tage of the covariance is the fact that it depends on the units in which the
random variables are represented. For instance, suppose that the length in
inches and weight in kilograms of Dutch citizens are modeled by random vari-
ables L and W. Someone prefers to represent the length in centimeters. Since
1 inch ≡ 2.53 cm, one is dealing with a transformed random variable 2.53L.
The covariance between 2.53L and W is
Cov(2.53L, W) = E[(2.53L)W] − E[2.53L]E[W]
= 2.53

E[LW] − E[L] E[W]

= 2.53 Cov(L, W) .
That is, the covariance increases with a factor 2.53, which is somewhat dis-
turbing since changing from inches to centimeters does not essentially alter
the dependence between length and weight. This illustrates that the covari-
ance changes under a change of units. The following rule provides the exact
relationship.
Covariance under change of units. Let X and Y be two
random variables. Then
Cov(rX + s, tY + u) = rt Cov(X, Y )
for all numbers r, s, t, and u.
See Exercise 10.14 for a derivation of this rule.
142 10 Covariance and correlation
Quick exercise 10.4 For X and Y in the example in Section 9.2 (see also
Section 10.2), show that Cov(−2X + 7, 5Y − 3) = 13/500.
The preceding discussion indicates that the covariance Cov(X, Y ) may not
always be suitable to express the dependence between X and Y . For this
reason there is a standardized version of the covariance called the correlation
coefficient of X and Y .
Definition. Let X and Y be two random variables. The correlation
coefficient ρ(X, Y ) is defined to be 0 if Var(X) = 0 or Var(Y ) = 0,
and otherwise
ρ(X, Y ) =
Cov(X, Y )

Var(X) Var(Y )
.
Note that ρ(X, Y ) remains unaffected by a change of units, and therefore it
is dimensionless. For instance, if X and Y are measured in kilometers, then
Cov(X, Y ), Var(X) and Var(Y ) are in km2
, so that the dimension of ρ(X, Y )
is in km2
/(
√
km2
·
√
km2
).
For X and Y in the example in Section 9.2, recall that Cov(X, Y ) = −13/5000.
We also have Var(X) = 989/2500 and Var(Y ) = 791/10 000 (see Exer-
cise 10.10), so that
ρ(X, Y ) =
− 13
5000

989
2500 · 791
10 000
= −0.0147.
Quick exercise 10.5 For X and Y in the example in Section 9.2, show that
ρ(−2X + 7, 5Y − 3) = 0.0147.
The previous quick exercise illustrates the following linearity property for the
correlation coefficient. For numbers r, s, t, and u fixed, r, t = 0, and random
variables X and Y :
ρ(rX + s, tY + u) =

−ρ(X, Y ) if rt  0,
ρ(X, Y ) if rt  0.
Thus we see that the size of the correlation coefficient is unaffected by a change
of units, but note the possibility of a change of sign.
Two random variables X and Y are “most correlated” if X = Y or if X = −Y .
As a matter of fact, in the former case ρ(X, Y ) = 1, while in the latter case
ρ(X, Y ) = −1. In general—for nonconstant random variables X and Y —the
following property holds:
−1 ≤ ρ(X, Y ) ≤ 1.
For a formal derivation of this property, see the next remark.
10.4 Solutions to the quick exercises 143
Remark 10.2 (Correlations are between −1 and 1). Here we give a
proof of the preceding formula. Since the variance of any random variable
is nonnegative, we have that
0 ≤ Var

X

Var(X)
+
Y

Var(Y )
= Var

X

Var(X)
+ Var

Y

Var(Y )
+ 2Cov

X

Var(X)
,
Y

Var(Y )
=
Var(X)
Var(X)
+
Var(Y )
Var(Y )
+
2Cov(X, Y )

Var(X) Var(Y )
= 2 (1 + ρ(X, Y )) .
This implies ρ(X, Y ) ≥ −1. Using the same argument but replacing X by
−X shows that ρ(X, Y ) ≤ 1.
10.4 Solutions to the quick exercises
10.1 The expectation of X + Y is computed as follows:
E[X + Y ] = (0 + 0) · 0 + (1 + 0) ·
1
4
+ (2 + 0) · 0
+ (0 + 1) ·
1
4
+ (1 + 1) · 0 + (2 + 1) ·
1
4
+ (0 + 2) · 0 + (1 + 2) ·
1
4
+ (2 + 2) · 0 = 2.
10.2 First complete Table 10.1 with the marginal distributions:
a
b 0 1 2 P(Y = b)
0 0 1/4 0 1/4
1 1/4 0 1/4 1/2
2 0 1/4 0 1/4
P(X = a) 1/4 1/2 1/4 1
It follows that E[X] = 0 · 1
4 + 1 · 1
2 + 2 · 1
4 = 1, and similarly E[Y ] = 1.
Therefore E[X] + E[Y ] = 2, which is equal to E[X + Y ] as computed in
Quick exercise 10.1.
144 10 Covariance and correlation
10.3 From Table 10.1, as completed in Quick exercise 10.2, we see that X
and Y are dependent. For instance, P(X = 0, Y = 0) = P(X = 0) P(Y = 0).
From Quick exercise 10.2 we know that E[X] = E[Y ] = 1. Because we already
computed E[XY ] = 1, it follows that E[XY ] = E[X] E[Y ]. According to the
alternative expression for the covariance this means that Cov(X, Y ) = 0, i.e.,
X and Y are uncorrelated.
10.4 We already computed Cov(X, Y ) = −13/5000 in Section 10.2. Hence, by
the linearity-of-covariance rule Cov(−2X + 7, 5Y − 3) = (−2)·5·(−13/5000) =
13/500.
10.5 From Quick exercise 10.4 we have Cov(−2X + 7, 5Y − 3) = 13/500.
Since Var(X) = 989/2500 and Var(Y ) = 791/10 000, by definition of the
correlation coefficient and the rule for variances,
ρ(−2X + 7, 5Y − 3) =
Cov(−2X + 7, 5Y − 3)

Var(−2X + 7) · Var(5Y − 3)
=
13
500

4Var(X) · 25Var(Y )
=
13
500

3956
2500 · 19775
10 000
= 0.0147.
10.5 Exercises
10.1  Consider the joint probability distribution of X and Y from Exer-
cise 9.7, obtained from data on hair color and eye color, for which we already
computed the expectations and variances of X and Y , as well as E[XY ].
a. Compute Cov(X, Y ). Are X and Y positively correlated, negative corre-
lated, or uncorrelated?
b. Compute the correlation coefficient between X and Y .
10.2  Consider the two discrete random variables X and Y with joint dis-
tribution derived in Exercise 9.2:
a
b 0 1 2 P(Y = b)
−1 1/6 1/6 1/6 1/2
1 0 1/2 0 1/2
P(X = a) 1/6 2/3 1/6 1
a. Determine E[XY ].
b. Note that X and Y are dependent. Show that X and Y are uncorrelated.
10.5 Exercises 145
c. Determine Var(X + Y ).
d. Determine Var(X − Y ).
10.3 Let U and V be the two random variables from Exercise 9.6. We have
seen that U and V are dependent with joint probability distribution
a
b 0 1 2 P(V = b)
0 1/4 0 1/4 1/2
1 0 1/2 0 1/2
P(U = a) 1/4 1/2 1/4 1
Determine the covariance Cov(U, V ) and the correlation coefficient ρ(U, V ).
10.4 Consider the joint probability distribution of the discrete random vari-
ables X and Y from the Melencolia Exercise 9.1. Compute Cov(X, Y ).
a
b 1 2 3 4
1 16/136 3/136 2/136 13/136
2 5/136 10/136 11/136 8/136
3 9/136 6/136 7/136 12/136
4 4/136 15/136 14/136 1/136
10.5  Suppose X and Y are discrete random variables taking values 0,1,
and 2. The following is given about the joint and marginal distributions:
a
b 0 1 2 P(Y = b)
0 8/72 . . . 10/72 1/3
1 12/72 9/72 . . . 1/2
2 . . . 3/72 . . . . . .
P(X = a) 1/3 . . . . . . 1
a. Complete the table.
b. Compute the expectation of X and of Y and the covariance between X
and Y .
c. Are X and Y independent?
146 10 Covariance and correlation
10.6  Suppose X and Y are discrete random variables taking values c−1, c,
and c + 1. The following is given about the joint and marginal distributions:
a
b c − 1 c c + 1 P(Y = b)
c − 1 2/45 9/45 4/45 1/3
c 7/45 5/45 3/45 1/3
c + 1 6/45 1/45 8/45 1/3
P(X = a) 1/3 1/3 1/3 1
a. Take c = 0 and compute the expectation of X and of Y and the covariance
between X and Y .
b. Show that X and Y are uncorrelated, no matter what the value of c is.
Hint: one could compute Cov(X, Y ), but there is a short solution using
the rule on the covariance under change of units (see page 141) together
with part a.
c. Are X and Y independent?
10.7  Consider the joint distribution of Quick exercise 9.2 and take ε fixed
between −1/4 and 1/4:
b
a 0 1 pX(a)
0 1/4 − ε 1/4 + ε 1/2
1 1/4 + ε 1/4 − ε 1/2
pY (b) 1/2 1/2 1
a. Take ε = 1/8 and compute Cov(X, Y ).
b. Take ε = 1/8 and compute ρ(X, Y ).
c. For which values of ε is ρ(X, Y ) equal to −1, 0, or 1?
10.8 Let X and Y be random variables such that
E[X] = 2, E[Y ] = 3, and Var(X) = 4.
a. Show that E

X2

= 8.
b. Determine the expectation of −2X2
+ Y .
10.9  Suppose the blood of 1000 persons has to be tested to see which ones
are infected by a (rare) disease. Suppose that the probability that the test
10.5 Exercises 147
is positive is p = 0.001. The obvious way to proceed is to test each person,
which results in a total of 1000 tests. An alternative procedure is the following.
Distribute the blood of the 1000 persons over 25 groups of size 40, and mix
half of the blood of each of the 40 persons with that of the others in each
group. Now test the aggregated blood sample of each group: when the test is
negative no one in that group has the disease; when the test is positive, at
least one person in the group has the disease, and one will test the other half
of the blood of all 40 persons of that group separately. In total, that gives 41
tests for that group. Let Xi be the total number of tests one has to perform
for the ith group using this alternative procedure.
a. Describe the probability distribution of Xi, i.e., list the possible values it
takes on and the corresponding probabilities.
b. What is the expected number of tests for the ith group? What is the
expected total number of tests? What do you think of this alternative
procedure for blood testing?
10.10  Consider the variables X and Y from the example in Section 9.2
with joint probability density
f(x, y) =
2
75

2x2
y + xy2

for 0 ≤ x ≤ 3 and 1 ≤ y ≤ 2
and marginal probability densities
fX(x) =
2
225

9x2
+ 7x

for 0 ≤ x ≤ 3
fY (y) =
1
25
(3y2
+ 12y) for 1 ≤ y ≤ 2.
a. Compute E[X], E[Y ], and E[X + Y ].
b. Compute E

X2

, E

Y 2

, E[XY ], and E

(X + Y )2

,
c. Compute Var(X + Y ), Var(X), and Var(Y ) and check that Var(X + Y ) =
Var(X) + Var(Y ).
10.11 Recall the relation between degrees Celsius and degrees Fahrenheit
degrees Fahrenheit =
9
5
· degrees Celsius + 32.
Let X and Y be the average daily temperatures in degrees Celsius in Ams-
terdam and Antwerp. Suppose that Cov(X, Y ) = 3 and ρ(X, Y ) = 0.8. Let T
and S be the same temperatures in degrees Fahrenheit. Compute Cov(T, S)
and ρ(T, S).
10.12 Consider the independent random variables H and R from the vase
example, with a U(25, 35) and a U(7.5, 12.5) distribution. Compute E[H]
and E

R2

and check that E[V ] = πE[H] E

R2

.
148 10 Covariance and correlation
10.13 Let X and Y be as in the triangle example in Exercise 9.15. Recall from
Exercise 9.16 that X and Y represent the minimum and maximum coordinate
of a point that is drawn from the unit square: X = min{U, V } and Y =
max{U, V }.
a. Show that E[X] = 1/3, Var(X) = 1/18, E[Y ] = 2/3, and Var(Y ) = 1/18.
Hint: you might consult Exercise 8.15.
b. Check that Var(X + Y ) = 1/6, by using that U and V are independent
and that X + Y = U + V .
c. Determine the covariance Cov(X, Y ) using the results from a and b.
10.14  Let X and Y be two random variables and let r, s, t, and u be
arbitrary real numbers.
a. Derive from the definition that Cov(X + s, Y + u) = Cov(X, Y ).
b. Derive from the definition that Cov(rX, tY ) = rtCov(X, Y ).
c. Combine parts a and b to show Cov(rX + s, tY + u) = rtCov(X, Y ).
10.15 In Figure 10.1 three plots are displayed. For each plot we carried out a
simulation in which we generated 500 realizations of a pair of random variables
(X, Y ). We have chosen three different joint distributions of X and Y .
−2 0 2
−2
0
2
·
· · ·
· · ·
· ·
· ·
·
· · ·
· ·
·
· ·
·
· ·
·
· · ·
· ·
· ·
· ·
· · ·
· ·
· ·
·
· · · · ·
·
· ·
·
· · ··
·
·
···
· ·
·
· ·
·
· · ·
·· ·
· ·
·· · ·
·
· ·
·
·
· ·
· ·
··
· ·
·
··
·
· ·
· ·
·
· ·
· ·
· ·
···· · ·
·
··
·
·
· ·
· · ·
·
· · ·
·
·
· · ·
·
· ·
·
· ··
· ·
··
· · ·
·
·· ·
·
·
· · ·
·
·
··
·
·
· ·
· ·
·
· ·
··
· ·
·
·
· ·
· ·
·
· ·
· · ·
·
··
· ·
· · ·
· · ·
·
·
···
· · ·
· ·
· ·
·
··· ··
· ·
··
··
·
·· ·
·
· · ·
·
· · ·· ·
·
· ·
· ·
· ·
··
· · ·
·
··
· ·
· ·
· · ·
·
· ·
·
·
·
· ·
· ·
· · ·
· ·
·
·
· · ·
· ·
·
· ·
· · ··
· ·
· ·
· ··
·
· ·
·
··
·
· · ·
· ·
·
·
· ·
· ·
·
· ·
· ··
··
· · ·
· · ·
·
· ·
· ·
·
·
·
·· · ·
· ··
·
· ··
· ·
· ·
·
· ·
·
· ·
·
· ·
·
· · ·
·
· · ·
· ·
· ·
· ·
· ·
· ·
· ·
· ·
· · ·
· ·
· · ·
·
· ·
·
· ·
· ·
· · ·
·
· ·
· ·
· · ·
·
·
·
·
· ·
·· ·
·
· ·
·
·
· ·
· ·· ·
· ·
· ·
· ·
·
·
·· ·
·
···· ·
· ·
· ·
·
·
·
·
· ··
· ·
··
· ·
·
· ·
· ·
·
·
· ·
· ·
·
·
· ·
· ·
· ·
· ·
·
· ··
·
· ·
· ·
·
· ·
·
· ·
· · ·
· ·
−2 0 2
−2
0
2
·
· ·
· ·
· ·
· ·
·
· ··
·
· · ·
·
· · ·
·
··
·
· ·
·
· ·
·
· ·
·
· ·
··
·
· · ·
·
· ···
·
· ·
·
· ··
· ·
· ·
···
· · ·
·
·· ·
·
· ·
·
· ·
· ·
· ·
··
· ·
·· ·
· ·
· ·
· ·· ·
·
·· · ·
· ··
··
· ·
·
·
·
· ·
· ·
·
·
· ·
· ··
··
· · ·
·
·
·
·
· ·
·
·· · ·
· ·
·
· ·
·
·
· ·
· · ·
·
·· ·
·
·
· ·
··
· ·
· ·
·
· ·
· · ·
·
·· ·
·
· ·
· ··
·
·· ·
· ·
·
·
· ·
· · ··
·
· · ·
·
·
· ·· ·
· ·
·· ·
· ·
·
· · ·
·
· ·
·· ·
· ·
· · ·
·
··
·
· ·
·
· · ·
· ·
·
·
· ·
· ·
··
·· · ·
· · ··
·
·
· · · ·
· · ·
· ·
·
· ·
·· ·· ·
·
·
· ·
· ··
·· ·
· ·
· · ·
· ·
·
·
···
· ·
· ··
· ·
· ·
· ·
·
·
·
· ··
· ·
·
· ·
···
·
··
· · ·
·
· ·
·
· ·
· ·
· · ·
··
· ·
· ·
· ·
· · ·
·
· · ·· ·
· · ·
· ·
·
· · ·
· ·· ·
· ·
·
· ·
· ·
··
· ·
·· ·
·· ·
·
· ·
· · ·
· ··
·
· ·
···
·
· ·· ·
·
· · ·
·· ·
· ·
·
· · ·
· ·
· ··
· ·
· ·
· ·
· · ·
·
· ·
·· ··
· · ·
·
·
· ·
··
·
· ··
· ·
· · ·
·
·
·· ·
·
· ·
·
· ·
· · · ·
··
· ·
· ·
· ·
·
· ·
·
· · · ·
·
· ·
·
· · ····
··
·
−2 0 2
−2
0
2
·
··
· ·
· ··
·
·· ·
· ·
··
· ·
· · ·
· ·
··
· ·
· ··
···
·
· ·
·
· ·
·
· ·
·
·
· ·
··
·
··
·
· ··
· ·
·
· ·
·
···
·
·
· ·
·
· ··
·
· ·
· ·
·· ·
· ·
·
··· ·
·
· ·
· ·
· ·
· · · ·
·
··
··
·· ·
· ·
· ·
· ·
·
··
·
· ·
·
·
·· ·
··
· ·
··
· ·
·
· ·
· ·
·
·
· ··
· ·
·
· ·
· ·
· ·
·· ·
·
··
· ·
··· ·
··
· ·
· ·
·····
··
··
·
· ·
·
·
·
· ·
··
· ·
·
· ·
· ·
·
· ·
·
· ·
· ·
··
·
·
··· ·
·· ·
·
···
· ·
·
· · ·
·
· ·
·
··
· ·
··
·
·
··
·
·
·
· ··
· ·
·
· ·
·
·
· ·
·
· ·
··
·
·
··
···
···
· ·
··
· ·
·
··
··
· ·
··
·
··
··
· ·
· ·
·
·
·
· ·
· ·
·
· ·
·
·
· ··
·
·
·
···
· ·
· ·
· ·
·
·· ·
· ·
· ·
· ··
··
·
· ··
·
·
· ·
·
· ·
·
···
··
· · ·
· ·· ·
· ·
·
· · ·
·
·
···
·
· ·
··
· ·
· ·
· · ·
·
·
· ·
· ·
· ·
·
· ·
·
·
· · ·
·
· ·
··
· · ·
·
·
· ·
·
· ·
·
··
·
· ·
··
· ·
·
·· ·
·
· ·
·
· ·
· ·
··
·
··
· ·
·
· ·
··
·
· ·
· ·
· ··
· · ·
·
·
··
·
·
· ·
· ·
· ·
·
· ·
·
· ·
··
·
·
·
· ·
· ·
·
·
·
· ·· ·
·
· ·
··
··
· ·
·
·
Fig. 10.1. Some scatterplots.
a. Indicate for each plot whether it corresponds to random variables X and
Y that are positively correlated, negatively correlated, or uncorrelated.
b. Which plot corresponds to random variables X and Y for which |ρ(X, Y )|
is maximal?
10.16  Let X and Y be random variables.
a. Express Cov(X, X + Y ) in terms of Var(X) and Cov(X, Y ).
b. Are X and X + Y positively correlated, uncorrelated, or negatively cor-
related, or can anything happen?
10.5 Exercises 149
c. Same question as in part b, but now assume that X and Y are uncorre-
lated.
10.17 Extending the variance of the sum rule. For mathematical con-
venience we first extend the sum rule to three random variables with zero
expectation. Next we further extend the rule to three random variables with
nonzero expectation. By the same line of reasoning we extend the rule to n
random variables.
a. Let X, Y and Z be random variables with expectation 0. Show that
Var(X + Y + Z) = Var(X) + Var(Y ) + Var(Z)
+ 2Cov(X, Y ) + 2Cov(X, Z) + 2Cov(Y, Z) .
Hint: directly apply that for real numbers y1, . . . , yn
(y1 + · · · + yn)2
= y2
1 + · · · + y2
n + 2y1y2 + 2y1y3 + · · · + 2yn−1yn.
b. Now show a for X, Y , and Z with nonzero expectation.
Hint: you might use the rules on pages 98 and 141 about variance and
covariance under a change of units.
c. Derive a general variance of the sum rule, i.e., show that if X1, X2, . . . , Xn
are random variables, then
Var(X1 + X2 + · · · + Xn)
= Var(X1) + · · · +Var(Xn)
+2Cov(X1, X2) + 2Cov(X1, X3) + · · · + 2Cov(X1, Xn)
+ 2Cov(X2, X3) + · · · + 2Cov(X2, Xn)
...
+ 2Cov(Xn−1, Xn) .
d. Show that if the variances are all equal to σ2
and the covariances are all
equal to some constant γ, then
Var(X1 + X2 + · · · + Xn) = nσ2
+ n(n − 1)γ.
10.18  Consider a vase containing balls numbered 1, 2, . . . , N. We draw
n balls without replacement from the vase. Each ball is selected with equal
probability, i.e., in the first draw each ball has probability 1/N, in the second
draw each of the N − 1 remaining balls has probability 1/(N − 1), and so
on. For i = 1, 2, . . ., n, let Xi denote the number on the ball in the ith draw.
From Exercise 9.18 we know that the variance of Xi equals
Var(Xi) =
1
12
(N − 1)(N + 1).
150 10 Covariance and correlation
Show that
Cov(X1, X2) = −
1
12
(N + 1).
Before you do the exercise: why do you think the covariance is negative?
Hint: use Var(X1 + X2 + · · · + XN ) = 0 (why?), and apply Exercise 10.17.
10.19 Derive the alternative expression for the covariance: Cov(X, Y ) =
E[XY ] − E[X] E[Y ].
Hint: work out (X − E[X])(Y − E[Y ]) and use linearity of expectations.
10.20 Determine ρ

U, U2

when U has a U(0, a) distribution. Here a is a
positive number.
11
More computations with more random
variables
Often one is interested in combining random variables, for instance, in taking
the sum. In previous chapters, we have seen that it is fairly easy to describe
the expected value and the variance of this new random variable. Often more
details are needed, and one also would like to have its probability distribu-
tion. In this chapter we consider the probability distributions of the sum, the
product, and the quotient of two random variables.
11.1 Sums of discrete random variables
In a solo race across the Pacific Ocean, a ship has one spare radio set for
communications. Each of the two radios has probability p of failing each time
it is switched on. The skipper uses the radio once every day. Let X be the
number of days the radio is switched on until it fails (so if the radio can be
used for two days and fails on the third day, X attains the value 3). Similarly,
let Y be the number of days the spare radio is switched on until it fails. Note
that these random variables are similar to the one discussed in Section 4.4,
which modeled the number of cycles until pregnancy. Hence, X and Y are
Geo(p) distributed random variables. Suppose that p = 1/75 and that the
trip will last 100 days. Then at first sight the skipper does not need to worry
about radio contact: the number of days the first radio lasts is X − 1 days,
and similarly the spare radio lasts Y −1 days. Therefore the expected number
of days he is able to have radio contact is
E[X − 1 + Y − 1] = E[X] + E[Y ] − 2 =
1
p
+
1
p
− 2 = 148 days!
The skipper—who has some training in probability theory—still has some
concerns about the risk he runs with these two radios. What if the probability
P(X + Y − 2 ≤ 99) that his two radios break down before the end of the trip
is large?
skip 11
152 11 More computations with more random variables
This example illustrates that it is important to study the probability distri-
bution of the sum Z = X + Y of two discrete random variables. The random
variable Z takes on values ai + bj, where ai is a possible value of X and bj
of Y . Hence, the probability mass function of Z is given by
pZ(c) =

(i,j):ai+bj =c
P(X = ai, Y = bj) ,
where the sum runs over all possible values ai of X and bj of Y such that
ai + bj = c. Because the sum only runs over values ai that are equal to c − bj,
we simplify the summation and write
pZ (c) =

j
P(X = c − bj, Y = bj) ,
where the sum runs over all possible values bj of Y . When X and Y are
independent, then P(X = c − bj, Y = bj) = P(X = c − bj) P(Y = bj). This
leads to the following rule.
Adding two independent discrete random variables. Let X
and Y be two independent discrete random variables, with probabil-
ity mass functions pX and pY . Then the probability mass function
pZ of Z = X + Y satisfies
pZ(c) =

j
pX(c − bj)pY (bj),
where the sum runs over all possible values bj of Y .
Quick exercise 11.1 Let S be the sum of two independent throws with
a die, so S = X + Y , where X and Y are independent, and P(X = k) =
P(Y = k) = 1/6, for k = 1, . . . , 6. Use the addition rule to compute P(S = 3)
and P(S = 8), and compare your answers with Table 9.2.
In the solo race example, X and Y are independent Geo(p) distributed random
variables. Let Z = X + Y ; then by the above rule for k ≥ 2
P(X + Y = k) = pZ(k) =
∞

=1
pX(k − )pY ().
Because pX(a) = 0 for a ≤ 0, all terms in this sum with  ≥ k vanish, hence
P(X + Y = k) =
k−1

=1
pX(k − ) · pY () =
k−1

=1
(1 − p)k−−1
p · (1 − p)−1
p
=
k−1

=1
p2
(1 − p)k−2
= (k − 1)p2
(1 − p)k−2
.
Note that X + Y does not have a geometric distribution.
11.1 Sums of discrete random variables 153
Remark 11.1 (The expected value of a geometric distribution).
The preceding gives us the opportunity to calculate the expected value of
the geometric distribution in an easy way. Since the probabilities of Z add
up to one:
1 =
∞
k=2
pZ(k) =
∞
k=2
(k − 1)p2
(1 − p)k−2
= p
∞
=1
p(1 − p)−1
;
it follows that
E[X] =
∞
=1
p(1 − p)−1
=
1
p
.
Returning to the solo race example, it is clear that the skipper does have
grounds to worry:
P(X + Y − 2 ≤ 99) = P(X + Y ≤ 101) =
101

k=2
P(X + Y = k)
=
101

k=2
(k − 1)( 1
75 )2
(1 − 1
75 )k−2
= 0.3904.
The sum of two binomial random variables
It is not always necessary to use the addition rule for two independent discrete
random variables to find the distribution of their sum. For example, let X and
Y be two independent random variables, where X has a Bin(n, p) distribution
and Y has a Bin(m, p) distribution. Since a Bin(n, p) distribution models
the number of successes in n independent trials with success probability p,
heuristically, X + Y represents the number of successes in n + m trials with
success probability p and should therefore have a Bin(n + m, p) distribution.
A more formal reasoning is the following. Let
R1, R2, . . . , Rn, S1, S2, . . . , Sm
be independent Ber(p) distributed random variables. Recall that a Bin(n, p)
distributed random variable has the same distribution as the sum of n inde-
pendent Ber(p) distributed random variables (see Section 4.3 or 10.2). Hence
X has the same distribution as R1 + R2 + · · · + Rn and Y has the same
distribution as S1 + S2 + · · · + Sm. This means that X + Y has the same dis-
tribution as the sum of n+m independent Ber(p) variables and therefore has
a Bin(n + m, p) distribution. This can also be verified analytically by means
of the addition rule, using that X and Y are also independent.
Quick exercise 11.2 For i = 1, 2, 3, let Xi be a Bin(ni, p) distributed ran-
dom variable, and suppose that X1, X2, and X3 are independent. Argue that
Z = X1 + X2 + X3 is a Bin(n1 + n2 + n3, p) distributed random variable.
154 11 More computations with more random variables
11.2 Sums of continuous random variables
Let X and Y be two continuous random variables. What can we say about the
probability density function of Z = X+Y ? We start with an example. Suppose
that X and Y are two independent, U(0, 1) distributed random variables. One
might be tempted to think that Z is also uniformly distributed.
Note that the joint probability density function f of X and Y is equal to the
product of the marginal probability functions fX and fY :
f(x, y) = fX(x)fY (y) = 1 for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1,
and f(x, y) = 0 otherwise. Let us compute the distribution function FZ of Z.
It is easy to see that FZ (a) = 0 for a ≤ 0 and FZ (a) = 1 for a ≥ 2. For a
between 0 and 1, let G be that part of the plane below the line x+y = a, and
let ∆ be the triangle with vertices (0, 0), (a, 0), and (0, a); see Figure 11.1.
a 1
a
1
x + y = a
∆
G
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 11.1. The region G in the plane where x + y ≤ a (with 0  a  1) intersected
with ∆.
Since f(x, y) = 0 outside [0, 1] × [0, 1], the distribution function of Z is given
by
FZ (a) = P(Z ≤ a) = P(X + Y ≤ a)
=
G
f(x, y) dx dy =
∆
1 dx dy = area of ∆ =
1
2
a2
for 0  a  1. For the case where 1 ≤ a  2 one can draw a similar figure (see
Figure 11.2), from which one can find that
FZ(a) = 1 −
1
2
(2 − a)2
for 1 ≤ a  2.
11.2 Sums of continuous random variables 155
a
1
a
1
x + y = a
∆
G
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 11.2. The region G in the plane where x + y ≤ a (with 1 ≤ a  2) intersected
with ∆.
We see that Z is not uniformly distributed.
In general, the distribution function FZ of the sum Z of two continuous ran-
dom variables X and Y is given by
FZ (a) = P(Z ≤ a) = P(X + Y ≤ a) =
(x,y):x+y≤a
f(x, y) dx dy.
The double integral on the right-hand side can be written as a repeated in-
tegral, first over x and then over y. Note that x and y are between minus
and plus infinity and that they also have to satisfy x + y ≤ a or, equivalently,
x ≤ a − y. This means that the integral over x runs from minus infinity to
y − a, and the integral over y runs from minus infinity to plus infinity. Hence
FZ (a) =
∞
−∞
 a−y
−∞
f(x, y) dx

dy.
In case X and Y are independent, the last double integral can be written as
∞
−∞
 a−y
−∞
fX(x) dx

fY (y) dy,
and we find that
FZ(a) =
∞
−∞
FX(a − y)fY (y) dy
for −∞  a  ∞. Differentiating FZ we find the following rule.
156 11 More computations with more random variables
Adding two independent continuous random variables.
Let X and Y be two independent continuous random variables, with
probability density functions fX and fY . Then the probability den-
sity function fZ of Z = X + Y is given by
fZ(z) =
∞
−∞
fX(z − y)fY (y) dy
for −∞  z  ∞.
The single-server queue revisited
In the single-server queue model from Section 6.4, T1 is the time between
the start at time zero and the arrival of the first customer and Ti is the
time between the arrival of the (i − 1)th and ith customer at a well. We are
interested in the arrival time of the nth customer at the well. For n ≥ 1, let
Zn be the arrival time of the nth customer at the well: Zn = T1 + · · · + Tn.
Since each Ti has an Exp(0.5) distribution, it follows from the linearity-of-
expectations rule in Section 10.1 that the expected arrival time of the nth
customer is
E[Zn] = E[T1 + · · · + Tn] = E[T1] + · · · + E[Tn] = 2n minutes.
We would like to know whether the pump capacity is sufficient; for instance,
when the service times Si are independent U(2, 5) distributed random vari-
ables (this is the case when the pump capacity v = 1). In that case, at most
30 customers can pump water at the well in the first hour. If P(Z30 ≤ 60) is
large, one might be tempted to increase the capacity of the well.
Recalling that the Ti are independent Exp(λ) random variables, it follows
from the addition rule that fT1+T2 (z) = 0 if z  0, and for z ≥ 0 that
fZ2 (z) = fT1+T2 (z) =
∞
−∞
fT1 (z − y)fT2 (y) dy
=
z
0
λe−λ(z−y)
· λe−λy
dy
= λ2
e−λz
z
0
dy = λ2
ze−λz
.
Viewing T1 + T2 + T3 as the sum of T1 and T2 + T3, we find, by applying the
addition rule again, that fZ3 (z) = 0 if z  0, and for z ≥ 0 that
fZ3 (z) = fT1+T2+T3 (z) =
∞
−∞
fT1 (z − y)fT2+T3 (y) dy
=
z
0
λe−λ(z−y)
· λ2
ye−λy
dy
= λ3
e−λz
z
0
y dy =
1
2
λ3
z2
e−λz
.
11.2 Sums of continuous random variables 157
Repeating this procedure, we find that fZn (z) = 0 if z  0, and
fZn (z) =
λ (λz)
n−1
e−λz
(n − 1)!
for z ≥ 0. Using integration by parts we find (see Exercise 11.13) that for
n ≥ 1 and a ≥ 0:
P(Zn ≤ a) = 1 − e−λa
n−1

i=0
(λa)i
i!
.
Since λ = 1/2, it follows that
P(Z30 ≤ 60) = 0.524.
Even if each customer fills his jerrican in the minimum time of 2 minutes, we
see that after an hour with probability 0.524, people will be waiting at the
pump!
The random variable Zn is an example of a gamma random variable, defined
as follows.
Definition. A continuous random variable X has a gamma dis-
tribution with parameters α  0 and λ  0 if its probability density
function f is given by f(x) = 0 for x  0 and
f(x) =
λ (λx)α−1
e−λx
Γ(α)
for x ≥ 0,
where the quantity Γ(α) is a normalizing constant such that f inte-
grates to 1. We denote this distribution by Gam(α, λ).
The quantity Γ(α) is for α  0 defined by
Γ(α) =
∞
0
tα−1
e−t
dt.
It satisfies for α  0 and n = 1, 2, . . .
Γ(α + 1) = αΓ(α) and Γ(n) = (n − 1)!
(see also Exercise 11.12). It follows from our example that the sum of n inde-
pendent Exp(λ) distributed random variables has a Gam(n, λ) distribution,
also known as the Erlang-n distribution with parameter λ.
The sum of independent normal random variables
Using the addition rule you can show that the sum of two independent nor-
mally distributed random variables is again a normally distributed random
158 11 More computations with more random variables
variable. For instance, if X and Y are independent N(0, 1) distributed random
variables, one has
fX+Y (z) =
∞
−∞
fX(z − y)fY (y) dy
=
∞
−∞

1
√
2π
e− 1
2 (z−y)2
 
1
√
2π
e− 1
2 y2

dy
=
∞
−∞

1
√
2π
2
e− 1
2 (2y2
−2yz+z2
)
dy.
To prepare a change of variables, we subtract the term 1
2 z2
from 2y2
−2yz+z2
to complete the square in the exponent:
2y2
− 2yz +
1
2
z2
=
√
2

y −
z
2
2
.
In this way we find with changing integration variables t =
√
2(y − z/2):
fX+Y (z) =
1
√
2π
e− 1
4 z2
∞
−∞
1
√
2π
e− 1
2 (2y2
−2yz+ 1
2 z2
)
dy
=
1
√
2π
e− 1
4 z2
∞
−∞
1
√
2π
e− 1
2 [
√
2(y−z/2)]
2
dy
=
1
√
2π
e− 1
4 z2 1
√
2
∞
−∞
1
√
2π
e− 1
2 t2
dt
=
1
√
4π
e− 1
4 z2
∞
−∞
φ(t) dt.
Since φ is the probability density of the standard normal distribution, it in-
tegrates to 1, so that
fX+Y (z) =
1
√
4π
e− 1
4 z2
,
which is the probability density of the N(0, 2) distribution. Thus, X + Y also
has a normal distribution. This is more generally true.
The sum of independent normal random variables. If X and
Y are independent random variables with a normal distribution, then
X + Y also has a normal distribution.
Quick exercise 11.3 Let X and Y be independent random variables, where
X has an N(3, 16) distribution, and Y an N(5, 9) distribution. Then X + Y
is a normally distributed random variable. What are its parameters?
Rather surprisingly, independence of X and Y is not a prerequisite, as can be
seen in the following remark.
11.3 Product and quotient of two random variables 159
Remark 11.2 (Sums of dependent normal random variables). We
say the pair X, Y is has a bivariate normal distribution if their joint prob-
ability density equals
1
2πσX σY

1 − ρ2
exp −
1
2
1
(1 − ρ2)
Q(x, y) ,
where
Q(x, y) =
x − µX
σX
2
− 2ρ
x − µX
σX
y − µY
σY
+
y − µY
σY
2
.
Here µX and µY are the expectations of X and Y , σ2
X and σ2
Y are their
variances, and ρ is the correlation coefficient of X and Y . If X and Y have
such a bivariate normal distribution, then X has an N(µX, σ2
X ) and Y has
an N(µY , σ2
Y ) distribution. Moreover, one can show that X + Y has an
N(µX + µY , σ2
X + σ2
Y + 2ρσX σY ) distribution. An example of a bivariate
normal probability density is displayed in Figure 9.2. This probability den-
sity corresponds to parameters µX = µY = 0, σX = σY = 1/6, and ρ = 0.8.
11.3 Product and quotient of two random variables
Recall from Chapter 7 the example of the architect who wants maximal vari-
ety in the sizes of buildings. The architect wants more variety and therefore
replaces the square buildings by rectangular buildings: the buildings should
be of width X and depth Y , where X and Y are independent and uniformly
distributed between 0 and 10 meters. Since X and Y are independent, the
expected area of a building equals E[XY ] = E[X] E[Y ] = 5 · 5 = 25 m2
. But
what can one say about the distribution of the area Z = XY of an arbitrary
building?
Let us calculate the distribution function of Z. Clearly FZ (a) = 0 if a  0
and FZ (a) = 1 if a  100. For a between 0 and 100 we can compute FZ (a)
with the help of Figure 11.3.
We find
FZ (a) = P(Z ≤ a) = P(XY ≤ a)
=
area of the shaded region in Figure 11.3
area of [0, 10] × [0, 10]
=
1
100

a
10
· 10 +
10
a/10
a
x
dx

=
1
100

a +

a ln x
10
a/10

=
a(1 + 2 ln 10 − ln a)
100
.
Hence the probability density function fZ of Z is given by
160 11 More computations with more random variables
a/10 x 10
a/x
10
xy = a
G
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 11.3. The region G in the plane where xy ≤ a intersected with [0, 10]×[0, 10].
fZ(z) =
d
dz
FZ (z) =
d
dz
z(1 + 2 ln 10 − ln z)
100
=
ln 100 − ln z
100
for 0  z  100 m2
.
This computation can be generalized to arbitrary independent continuous
random variables, and we obtain the following formula for the probability
density function of the product of two random variables.
Product of independent continuous random variables. Let
X and Y be two independent continuous random variables with prob-
ability densities fX and fY . Then the probability density function
fZ of Z = XY is given by
fZ(z) =
∞
−∞
fY
z
x

fX(x)
1
|x|
dx
for −∞  z  ∞.
For the quotient Z = X/Y of two independent random variables X and
Y it is now fairly easy to derive the probability density function. Since the
independence of X and Y implies that X and 1/Y are independent, the
preceding rule yields
fZ(z) =
∞
−∞
f1/Y
z
x

fX(x)
1
|x|
dx.
Recall from Section 8.2 that the probability density function of 1/Y is given
by
f1/Y (y) =
1
y2
fY
1
y

.
11.3 Product and quotient of two random variables 161
Substituting this in the integral, after changing the variable of integration, we
find the following rule.
Quotient of independent continuous random variables.
Let X and Y be two independent continuous random variables with
probability densities fX and fY . Then the probability density func-
tion fZ of Z = X/Y is given by
fZ(z) =
∞
−∞
fX(zx)fY (x)|x| dx
for −∞  z  ∞.
The quotient of two independent normal random variables
Let X and Y be independent random variables, both having a standard normal
distribution. When we compute the quotient Z of X and Y , we find a so-called
standard Cauchy distribution:
fZ(z) =
∞
−∞
|x|

1
√
2π
e− 1
2 z2
x2
 
1
√
2π
e− 1
2 x2

dx
=
1
2π
∞
−∞
|x|e− 1
2 (z2
+1)x2
dx = 2 ·
1
2π
∞
0
xe− 1
2 (z2
+1)x2
dx
=
1
π
−1
z2 + 1
e− 1
2 (z2
+1)x2
∞
0
=
1
π(z2 + 1)
.
This is the special case α = 0, β = 1 of the following family of distributions.
Definition. A continuous random variable has a Cauchy distribu-
tion with parameters α and β  0 if its probability density function f
is given by
f(x) =
β
π (β2 + (x − α)2)
for − ∞  x  ∞.
We denote this distribution by Cau(α, β).
By integrating, we find that the distribution function F of a Cauchy distri-
bution is given by
F(x) =
1
2
+
1
π
arctan

x − α
β

.
The parameter α is the point of symmetry of the probability density func-
tion f. Note that α is not the expected value of Z. As a matter of fact, it was
shown in Remark 7.1 that the expected value does not exist! The probabil-
ity density f is shown together with the distribution function F for the case
α = 2, β = 5 in Figure 11.4.
162 11 More computations with more random variables
−12 −8 −4 0 4 8 12 16
0.00
0.02
0.04
0.06
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
f
−12 −8 −4 0 4 8 12 16
0
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
F
Fig. 11.4. The graphs of f and F of the Cau(2, 5) distribution.
Quick exercise 11.4 Argue—without doing any calculations—that if Z has
a standard Cauchy distribution, 1/Z also has a standard Cauchy distribution.
11.4 Solutions to the quick exercises
11.1 Using the addition rule we find
P(S = 3) =
6

j=1
pX(3 − j)pY (j)
= pX(2)pY (1) + pX(1)pY (2) + pX(0)pY (3)
+pX(−1)pY (4) + pX(−2)pY (5) + pX(−3)pY (6)
=
1
36
+
1
36
+ 0 + 0 + 0 + 0 =
1
18
and
P(S = 8) =
6

j=1
pX(8 − j)pY (j)
= pX(7)pY (1) + pX(6)pY (2) + pX(5)pY (3)
+pX(4)pY (4) + pX(3)pY (5) + pX(2)pY (6)
= 0 +
1
36
+
1
36
+
1
36
+
1
36
+
1
36
=
5
36
.
11.2 We have seen that X1 + X2 is a Bin(n1 + n2, p) distributed random
variable. Viewing X1 + X2 + X3 as the sum of X1 + X2 and X3, it follows
that X1 + X2 + X3 is a Bin(n1 + n2 + n3, p) distributed random variable.
11.5 Exercises 163
11.3 The sum rule for two normal random variables tells us that X + Y is
a normally distributed random variable. Its parameters are expectation and
variance of X + Y . Hence by linearity of expectations
µX+Y = E[X + Y ] = E[X] + E[Y ] = µX + µY = 3 + 5 = 8,
and by the rule for the variance of the sum
σ2
X+Y = Var(X) + Var(Y ) + 2Cov(X, Y ) = σ2
X + σ2
Y = 16 + 9 = 25,
using that Cov(X, Y ) = 0 due to independence of X and Y .
11.4 In the examples we have seen that the quotient X/Y of two independent
standard normal random variables has a standard Cauchy distribution. Since
Z = X/Y , the random variable 1/Z = Y/X. This is also the quotient of two
independent standard normal random variables, and it has a standard Cauchy
distribution.
11.5 Exercises
11.1  Let X and Y be independent random variables with a discrete uniform
distribution, i.e., with probability mass functions
pX(k) = pY (k) =
1
N
, for k = 1, . . . , N.
Use the addition rule for discrete random variables on page 152 to determine
the probability mass function of Z = X + Y for the following two cases.
a. Suppose N = 6, so that X and Y represent two throws with a die. Show
that
pZ(k) = P(X + Y = k) =
⎧
⎪
⎪
⎨
⎪
⎪
⎩
k − 1
36
for k = 2, . . . , 6,
13 − k
36
for k = 7, . . . , 12.
You may check this with Quick exercise 11.1.
b. Determine the expression for pZ (k) for general N.
11.2  Consider a discrete random variable X taking values k = 0, 1, 2, . . .
with probabilities
P(X = k) =
µk
k!
e−µ
,
where µ  0. This is the Poisson distribution with parameter µ. We will learn
more about this distribution in Chapter 12. This exercise illustrates that the
sum of independent Poisson variables again has a Poisson distribution.
164 11 More computations with more random variables
a. Let X and Y be independent random variables, each having a Poisson
distribution with µ = 1. Show that for k = 0, 1, 2, . . .
P(X + Y = k) =
2k
k!
e−2
,
by using
k
=0
k


= 2k
.
b. Let X and Y be independent random variables, each having a Poisson
distribution with parameters λ and µ. Show that for k = 0, 1, 2, . . .
P(X + Y = k) =
(λ + µ)k
k!
e−(λ+µ)
,
by using
k
=0
k


p
(1 − p)k−
= 1 for p = µ/(λ + µ).
We conclude that X +Y has a Poisson distribution with parameter λ+µ.
11.3 Let X and Y be two independent random variables, where X has a
Ber(p) distribution, and Y has a Ber(q) distribution. When p = q = r, we
know that X + Y has a Bin(2, r) distribution. Suppose that p = 1/2 and
q = 1/4. Determine P(X + Y = k), for k = 0, 1, 2, and conclude that X + Y
does not have a binomial distribution.
11.4  Let X and Y be two independent random variables, where X has an
N(2, 5) distribution and Y has an N(5, 9) distribution. Define Z = 3X−2Y +1.
a. Compute E[Z] and Var(Z).
b. What is the distribution of Z?
c. Compute P(Z ≤ 6).
11.5  Let X and Y be two independent, U(0, 1) distributed random vari-
ables. Use the rule on addition of independent continuous random variables
on page 156 to show that the probability density function of X + Y is given
by
fZ(z) =
⎧
⎪
⎨
⎪
⎩
z for 0 ≤ z  1,
2 − z for 1 ≤ z ≤ 2,
0 otherwise.
11.6  Let X and Y be independent random variables with probability den-
sities
fX(x) =
1
4
xe−x/2
and fY (y) =
1
4
ye−y/2
.
Use the rule on addition of independent continuous random variables to de-
termine the probability density of Z = X + Y .
11.7  The two random variables in Exercise 11.6 are special cases of
Gam(α, λ) variables, namely with α = 2 and λ = 1/2. More generally, let
11.5 Exercises 165
X1, . . . , Xn be independent Gam(k, λ) distributed random variables, where
λ  0 and k is a positive integer. Argue—without doing any calculations—
that X1 + · · · + Xn has a Gam(nk, λ) distribution.
11.8 We investigate the effect on the Cauchy distribution under a change of
units.
a. Let X have a standard Cauchy distribution. What is the distribution of
Y = rX + s?
b. Let X have a Cau(α, β) distribution. What is the distribution of the
random variable (X − α)/β?
11.9  Let X and Y be independent random variables with a Par(α) and
Par(β) distribution.
a. Take α = 3 and β = 1 and determine the probability density of Z = XY .
b. Determine the probability density of Z = XY for general α and β.
11.10 Let X and Y be independent random variables with a Par(α) and
Par(β) distribution.
a. Take α = β = 2. Show that Z = X/Y has probability density
fZ(z) =

z for 0  z  1,
1/z3
for 1 ≤ z  ∞.
b. For general α, β  0, show that Z = X/Y has probability density
fZ(z) =
⎧
⎪
⎪
⎨
⎪
⎪
⎩
αβ
α + β
zβ−1
for 0  z  1,
αβ
α + β
1
zα+1
for 1 ≤ z  ∞.
11.11 Let X1, X2, and X3 be three independent Geo(p) distributed random
variables, and let Z = X1 + X2 + X3.
a. Show for k ≥ 3 that the probability mass function pZ of Z is given by
pZ(k) = P(X1 + X2 + X3 = k) =
1
2
(k − 2)(k − 1)p3
(1 − p)k−3
.
b. Use the fact that
∞
k=3 pZ(k) = 1 to show that
p2

E

X2
1

+ E[X1]

= 2.
c. Use E[X1] = 1/p and part b to conclude that
E

X2
1

=
2 − p
p2
and Var(X1) =
1 − p
p2
.
166 11 More computations with more random variables
11.12 Show that Γ(1) = 1, and use integration by parts to show that
Γ(x + 1) = xΓ(x) for x  0.
Use this last expression to show for n = 1, 2, . . . that
Γ(n) = (n − 1)!
11.13 Let Zn have an Erlang-n distribution with parameter λ.
a. Use integration by parts to show that for a ≥ 0 and n ≥ 2:
P(Zn ≤ a) =
a
0
λn
zn−1
e−λz
(n − 1)!
dz = −
(λa)n−1
(n − 1)!
e−λa
+ P(Zn−1 ≤ a) .
b. Use a to show that for a ≥ 0:
P(Zn ≤ a) = −
n−1

i=1
(λa)i
i!
e−λa
+ P(Z1 ≤ a) .
c. Conclude that for a ≥ 0:
P(Zn ≤ a) = 1 − e−λa
n−1

i=0
(λa)i
i!
.
12
The Poisson process
In many random phenomena we encounter, it is not just one or two random
variables that play a role but a whole collection. In that case one often speaks
of a random process. The Poisson process is a simple kind of random process,
which models the occurrence of random points in time or space. There are
numerous ways in which processes of random points arise: some examples are
presented in the first section. The Poisson process describes in a certain sense
the most random way to distribute points in time or space. This is made more
precise with the notions of homogeneity and independence.
12.1 Random points
Typical examples of the occurrence of random time points are: arrival times
of email messages at a server, the times at which asteroids hit the earth,
arrival times of radioactive particles at a Geiger counter, times at which your
computer crashes, the times at which electronic components fail, and arrival
times of people at a pump in an oasis.
Examples of the occurrence of random points in space are: the locations of
asteroid impacts with earth (2-dimensional), the locations of imperfections in a
material (3-dimensional), and the locations of trees in a forest (2-dimensional).
Some of these phenomena are better modeled by the Poisson process than
others. Loosely speaking, one might say that the Poisson process model often
applies in situations where there is a very large population, and each member
of the population has a very small probability to produce a point of the
process. This is, for instance, well fulfilled in the Geiger counter example
where, in a huge collection of atoms, just a few will emit a radioactive particle
(see [28]). A property of the Poisson process—as we will see shortly—is that
points may lie arbitrarily close together. Therefore the tree locations are not
so well modeled by the Poisson process.
168 12 The Poisson process
12.2 Taking a closer look at random arrivals
A well-known example that is usually modeled by the Poisson process is that
of calls arriving at a telephone exchange—the exchange is connected to a large
number of people who make phone calls now and then. This will be our leading
example in this section.
Telephone calls arrive at random times X1, X2, . . . at the telephone exchange
during a time interval [0, t].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
|
0
Time
X1 X2 X3 X4 X5
× × × × ×
+ + + + + |
t
The two basic assumptions we make on these random arrivals are
1. (Homogeneity) The rate λ at which arrivals occur is constant over time:
in a subinterval of length u the expectation of the number of telephone
calls is λu.
2. (Independence) The numbers of arrivals in disjoint time intervals are in-
dependent random variables.
Homogeneity is also called weak stationarity. We denote the total number of
calls in an interval I by N(I), abbreviating N([0, t]) to Nt. Homogeneity then
implies that we require
E[Nt] = λt.
To get hold of the distribution of Nt we divide the interval [0, t] into n intervals
of length t/n. When n is large enough, every interval Ij,n = ((j − 1) t/n, j t/n]
will contain either 0 or 1 arrival: For such a large n (which also satisfies
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
|
0
Time
X1 X2 X3 X4 X5
× × × × ×
+ + + + +
| | | | |
t
|
t
n
| | | |
(n − 1)
t
n
|
t
n  λt), let Rj be the number of arrivals in the time interval Ij,n. Since Rj is
0 or 1, Rj has a Ber(pj) distribution for some pj. Recall that for a Bernoulli
random variable E[Rj] = 0 · (1 − pj) + 1 · pj = pj. By the homogeneity
assumption, for each j
pj = λ · length of Ij,n =
λt
n
.
Summing the number of calls in the intervals gives the total number of calls,
hence
Nt = R1 + R2 + · · · + Rn.
12.2 Taking a closer look at random arrivals 169
By the independence assumption, the Rj are independent random variables,
therefore Nt has a Bin(n, p) distribution, with p = λt/n.
Remark 12.1 (About this approximation). The argument just given
seems pretty convincing, but actually Rj does not have a Bernoulli distri-
bution, whatever the value of n. A way to see this is the following. Every
interval Ij,n is a union of the two intervals I2j−1,2n and I2j,2n. Hence the
probability that Ij,n contains two calls is at least (λt/2n)2
= λ2
t2
/4n2
,
which is larger than zero.
Note however, that the probability of having two arrivals is of smaller order
than the probability that Rj takes the value 1. If we add a third assumption,
namely that the probability of two or more calls arriving in an interval Ij,n
tends to zero faster than 1/n, then the conclusion below on the distribution
of Nt is valid.
We have found that (at least in first approximation)
P(Nt = k) =

n
k
 
λt
n
k 
1 −
λt
n
n−k
for k = 0, . . . , n.
In this analysis n is a rather artificial parameter, of which we only know that
it should not be “too small.” It therefore seems a good idea to get rid of n
by letting n go to infinity, hoping that the probability distribution of Nt will
settle down. Note that
lim
n→∞

n
k

1
nk
= lim
n→∞
n
n
·
n − 1
n
· · ·
(n − k + 1)
n
·
1
k!
=
1
k!
,
and from calculus we know that
lim
n→∞

1 −
λt
n
n
= e−λt
.
Since certainly
lim
n→∞

1 −
λt
n
−k
= 1,
we obtain, combining these three limits, that
lim
n→∞
P(Nt = k) = lim
n→∞

n
k

1
nk
· (λt)k
·

1 −
λt
n
n
·

1 −
λt
n
−k
=
(λt)k
k!
e−λt
.
Since
e−λt
∞

k=0
(λt)k
k!
= e−λt
eλt
= 1,
we have indeed run into a probability distribution on the numbers 0, 1, 2, . . . .
Note that all these probabilities are determined by the single value λt. This
motivates the following definition.
170 12 The Poisson process
Definition. A discrete random variable X has a Poisson distribu-
tion with parameter µ, where µ  0 if its probability mass function p
is given by
p(k) = P(X = k) =
µk
k!
e−µ
for k = 0, 1, 2, . . ..
We denote this distribution by Pois(µ).
Figure 12.1 displays the graphs of the probability mass functions of the Poisson
distribution with µ = 0.9 (left) and the Poisson distribution with µ = 5
(right).
0 2 4 6 8 10
k
0.0
0.1
0.2
0.3
0.4
0.5
p(k)
· ·
·
· · · · · · · ·
0 2 4 6 8 10
k
0.0
0.1
0.2
0.3
0.4
0.5
p(k)
· ·
·
· · · · · · · ·
Fig. 12.1. The probability mass functions of the Pois(0.9) and the Pois(5) distri-
butions.
Quick exercise 12.1 Consider the event “exactly one call arrives in the
interval [0, 2s].” The probability of this event is P(N2s = 1) = λ · 2s · e−λ·2s
.
But note that this event is the same as “there is exactly one call in the interval
[0, s) and no calls in the interval [s, 2s], or no calls in [0, s) and exactly one call
in [s, 2s].” Verify (using assumptions 1 and 2) that you get the same answer
if you compute the probability of the event in this way.
We do have a hint1
about what the expectation and variance of a Poisson
random variable might be: since E[Nt] = λt for all n, we anticipate that the
limiting Poisson distribution will have expectation λt. Similarly, since Nt has
a Bin(n, λt
n ) distribution, we anticipate that the variance will be
1
This is really not more than a hint: there are simple examples where the distribu-
tions of random variables converge to a distribution whose expectation is different
from the limit of the expectations of the distributions! (cf. Exercise 12.14).
12.3 The one-dimensional Poisson process 171
lim
n→∞
Var(Nt) = lim
n→∞
n ·
λt
n
·

1 −
λt
n

= λt.
Actually, the expectation of a Poisson random variable X with parameter µ
is easy to compute:
E[X] =
∞

k=0
k
µk
k!
e−µ
= e−µ
∞

k=1
µk
(k − 1)!
= µe−µ
∞

k=1
µk−1
(k − 1)!
= µe−µ
∞

j=0
µj
j!
= µ.
In a similar way the variance can be determined (see Exercise 12.8), and we
arrive at the following rule.
The expectation and variance of a Poisson distribution.
Let X have a Poisson distribution with parameter µ; then
E[X] = µ and Var(X) = µ.
12.3 The one-dimensional Poisson process
We will derive some properties of the sequence of random points X1, X2, . . .
that we considered in the previous section. What we derived so far is that for
any interval (s, s + t] the number N((s, s + t]) of points Xi in that interval is
a random variable with a Pois(λt) distribution.
Interarrival times
The differences
Ti = Xi − Xi−1
are called interarrival times. Here we define T1 = X1, the time of the first
arrival. To determine the probability distribution of T1, we observe that the
event {T1  t} that the first call arrives after time t is the same as the event
{Nt = 0} that no calls have been made in [0, t]. But this implies that
P(T1 ≤ t) = 1 − P(T1  t) = 1 − P(Nt = 0) = 1 − e−λt
.
Therefore T1 has an exponential distribution with parameter λ.
To compute the joint distribution of T1 and T2, we consider the conditional
probability that T2  t, given that T1 = s, and use the property that arrivals
in different intervals are independent:
172 12 The Poisson process
P(T2  t | T1 = s) = P(no arrivals in (s, s + t] | T1 = s)
= P(no arrivals in (s, s + t])
= P(N((s, s + t]) = 0) = e−λt
.
Since this answer does not depend on s, we conclude that T1 and T2 are
independent, and
P(T2  t) = e−λt
,
i.e., T2 also has an exponential distribution with parameter λ. Actually, al-
though the conclusion is correct, the method to derive it is not, because we
conditioned on the event {T1 = s}, which has zero probability. This problem
could be circumvented by conditioning on the event that T1 lies in some small
interval, but that will not be done here. Analogously, one can show that the Ti
are independent and have an Exp(λ) distribution. This nice property allows
us to give a simple definition of the one-dimensional Poisson process.
Definition. The one-dimensional Poisson process with intensity λ
is a sequence X1, X2, X3, . . . of random variables having the property
that the interarrival times X1, X2 −X1, X3 −X2, . . . are independent
random variables, each with an Exp(λ) distribution.
Note that the connection with Nt is as follows: Nt is equal to the number of
Xi that are smaller than (or equal to) t.
Quick exercise 12.2 We model the arrivals of email messages at a server as
a Poisson process. Suppose that on average 330 messages arrive per minute.
What would you choose for the intensity λ in messages per second? What is
the expectation of the interarrival time?
An obvious question is: what is the distribution of Xi? This has already been
answered in Chapter 11: since Xi is a sum of i independent exponentially
distributed random variables, we have the following.
The points of the Poisson process. For i = 1, 2, . . . the random
variable Xi has a Gam(i, λ) distribution.
The distribution of points
Another interesting question is: if we know that n points are generated in an
interval, where do these points lie? Since the distribution of the number of
points only depends on the length of the interval, and not on its location, it
suffices to determine this for an interval starting at 0. Let this interval be [0, a].
We start with the simplest case, where there is one point in [0, a]: suppose
that N([0, a]) = 1. Then, for 0  s  a:
12.4 Higher-dimensional Poisson processes 173
P(X1 ≤ s | N([0, a]) = 1) =
P(X1 ≤ s, N([0, a]) = 1)
P(N([0, a]) = 1)
=
P(N([0, s]) = 1, N((s, a]) = 0)
P(N([0, a]) = 1)
=
λse−λs
e−λ(a−s)
λae−λa
=
s
a
.
We find that conditional on the event {N([0, a]) = 1}, the random variable
X1 is uniformly distributed over the interval [0, a].
Now suppose that it is given that there are two points in [0, a]: N([0, a]) =
2. In a way similar to what we did for one point, we can show that (see
Exercise 12.12)
P(X1 ≤ s, X2 ≤ t | N([0, a]) = 2) =
t2
− (t − s)2
a2
.
Now recall the result of Exercise 9.17: if U1 and U2 are two independent
random variables, both uniformly distributed over [0, a], then the joint distri-
bution function of V = min(U1, U2) and Z = max(U1, U2) is given by
P(V ≤ s, Z ≤ t) =
t2
− (t − s)2
a2
for 0 ≤ s ≤ t ≤ a.
Thus we have found that, if we forget about their order, the two points in
[0, a] are independent and uniformly distributed over [0, a]. With somewhat
more work, this generalizes to an arbitrary number of points, and we arrive
at the following property.
Location of the points, given their number. Given that
the Poisson process has n points in the interval [a, b], the locations
of these points are independently distributed, each with a uniform
distribution on [a, b].
12.4 Higher-dimensional Poisson processes
Our definition of the one-dimensional Poisson process, starting with the in-
terarrival times, does not generalize easily, because it is based on the ordering
of the real numbers. However, we can easily extend the assumptions of inde-
pendence, homogeneity, and the Poisson distribution property. To do this we
need a higher-dimensional version of the concept of length. We denote the k-
dimensional volume of a set A in k-dimensional space by m(A). For instance,
in the plane m(A) is the area of A, and in space m(A) is the volume of A.
skip 12.4
174 12 The Poisson process
Definition. The k-dimensional Poisson process with intensity λ
is a collection X1, X2, X3, . . . of random points having the property
that if N(A) denotes the number of points in the set A, then
1. (Homogeneity) The random variable N(A) has a Poisson distri-
bution with parameter λm(A).
2. (Independence) For disjoint sets A1, A2, . . . , An the random vari-
ables N(A1), N(A2), . . . , N(An) are independent.
Quick exercise 12.3 Suppose that the locations of defects in a certain type of
material follow the two-dimensional Poisson process model. For this material
it is known that it contains on average five defects per square meter. What is
the probability that a strip of length 2 meters and width 5 cm will be without
defects?
In Figure 7.4 the locations of the buildings the architect wanted to distribute
over a 100-by-300-m terrain have been generated by a two-dimensional Poisson
process. This has been done in the following way. One can again show that
given the total number of points in a set, these points are uniformly distributed
over the set. This leads to the following procedure: first one generates a value
n from a Poisson distribution with the appropriate parameter (λ times the
area), then one generates n times a point uniformly distributed over the 100-
by-300 rectangle.
Actually one can generate a higher-dimensional Poisson process in a way that
is very similar to the natural way this can be done for the one-dimensional
process. Directly from the definition of the one-dimensional process we see
that it can be obtained by consecutively generating points with exponentially
distributed gaps. We will explain a similar procedure for dimension two. For
s  0, let
Ms = N(Cs),
where Cs is the circular region of radius s, centered at the origin. Since Cs
has area πs2
, Ms has a Poisson distribution with parameter λπs2
. Let Ri
denote the distance of the ith closest point to the origin. This is illustrated
in Figure 12.2.
Note that Ri is the analogue of the ith arrival time for the one-dimensional
Poisson process: we have in fact that
Ri ≤ s if and only if Ms ≥ i.
In particular, with i = 1 and s =
√
t,
P

R2
1 ≤ t

= P

R1 ≤
√
t

= P

M√
t  0

= 1 − e−λπt
.
In other words: R2
1 is Exp(λπ) distributed. For general i, we can similarly
write
P

R2
i ≤ t

= P

Ri ≤
√
t

= P

M√
t ≥ i

.
12.4 Higher-dimensional Poisson processes 175
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. R1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
R2
×
×
×
×
×
×
×
×
×
×
×
×
×
×
×
× ×
×
×
×
×
×
×
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+
+
+
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..............................
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
................................................
.
.
.
.
.
.
.
.
.
.
.
Fig. 12.2. The Poisson process in the plane, with the two circles of the two points
closest to the origin.
So
P

R2
i ≤ t

= 1 − e−λπt
i−1

j=0
(λπt)j
j!
,
which means that R2
i has a Gam(i, λπ) distribution—as we saw on page 157.
Since gamma distributions arise as sums of independent exponential distribu-
tions, we can also write
R2
i = R2
i−1 + Ti,
where the Ti are independent Exp(λπ) random variables (and where R0 = 0).
Note that this is quite similar to the one-dimensional case. To simulate the
two-dimensional Poisson process from a sequence U1, U2, . . . of independent
U(0, 1) random variables, one can therefore proceed as follows (recall from
Section 6.2 that −(1/λ) ln(Ui) has an Exp(λ) distribution): for i = 1, 2, . . .
put
Ri =

R2
i−1 −
1
λπ
ln(U2i);
this gives the distance of the ith point to the origin, and then put the point
on this circle according to an angle value generated by 2πU2i−1. This is the
correct way to do it, because one can show that in polar coordinates the radius
and the angle of a Poisson process point are independent of each other, and
the angle is uniformly distributed over [0, 2π]. The latter is called the isotropy
property of the Poisson process.
176 12 The Poisson process
12.5 Solutions to the quick exercises
12.1 The probability of exactly one call in [0, s) and no calls in [s, 2s] equals
P(N([0, s)) = 1, N([s, 2s]) = 0) = P(N([0, s)) = 1) P(N([s, 2s]) = 0)
= P(N([0, s)) = 1) P(N([0, s]) = 0)
= λse−λs
· e−λs
,
because of independence and homogeneity. In the same way, the probability
of exactly one call in [s, 2s] and no calls in [0, s) is equal to e−λs
·λse−λs
. And
indeed: λse−λs
· e−λs
+ e−λs
· λse−λs
= 2λse−λ·2s
.
12.2 Because there are 60 seconds in a minute, we have 60λ = 330. It follows
that λ = 51
2 . Since the interarrival times have an Exp(λ) distribution, the
expected time between messages is 1/λ = 0.18 second.
12.3 The intensity of this process is λ = 5 per m2
. The area of the strip is
2 · (1/20) = 1/10 m2
. Hence the probability that no defects occur in the strip
is e−λ·(area of strip)
= e−5·(1/10)
= e−1/2
= 0.60.
12.6 Exercises
12.1  In each of the following examples, try to indicate whether the Poisson
process would be a good model.
a. The times of bankruptcy of enterprises in the United States.
b. The times a chicken lays its eggs.
c. The times of airplane crashes in a worldwide registration.
d. The locations of worngly spelled words in a book.
e. The times of traffic accidents at a crossroad.
12.2 The number of customers that visit a bank on a day is modeled by a
Poisson distribution. It is known that the probability of no customers at all
is 0.00001. What is the expected number of customers?
12.3 Let N have a Pois(4) distribution. What is P(N = 4)?
12.4 Let X have a Pois(2) distribution. What is P(X ≤ 1)?
12.5  The number of errors on a hard disk is modeled as a Poisson random
variable with expectation one error in every Mb, that is, in every 220
bytes.
a. What is the probability of at least one error in a sector of 512 bytes?
b. The hard disk is an 18.62-Gb disk drive with 39 054 015 sectors. What is
the probability of at least one error on the hard disk?
12.6 Exercises 177
12.6  A certain brand of copper wire has flaws about every 40 centimeters.
Model the locations of the flaws as a Poisson process. What is the probability
of two flaws in 1 meter of wire?
12.7  The Poisson model is sometimes used to study the flow of traffic ([15]).
If the traffic can flow freely, it behaves like a Poisson process. A 20-minute
time interval is divided into 10-second time slots. At a certain point along the
highway the number of passing cars is registered for each 10-second time slot.
Let nj be the number of slots in which j cars have passed for j = 0, . . . , 9.
Suppose that one finds
j 0 1 2 3 4 5 6 7 8 9
nj 19 38 28 20 7 3 4 0 0 1
Note that the total number of cars passing in these 20 minutes is 230.
a. What would you choose for the intensity parameter λ?
b. Suppose one estimates the probability of 0 cars passing in a 10-second
time slot by n0 divided by the total number of time slots. Does that
(reasonably) agree with the value that follows from your answer in a?
c. What would you take for the probability that 10 cars pass in a 10-second
time slot?
12.8  Let X be a Poisson random variable with parameter µ.
a. Compute E[X(X − 1)].
b. Compute Var(X), using that Var(X) = E[X(X − 1)] + E[X] − (E[X])2
.
12.9 Let Y1 and Y2 be independent Poisson random variables with parameter
µ1, respectively µ2. Show that Y = Y1 + Y2 also has a Poisson distribution.
Instead of using the addition rule in Section 11.1 as in Exercise 11.2, you
can prove this without doing any computations by considering the number
of points of a Poisson process (with intensity 1) in two disjoint intervals of
length µ1 and µ2.
12.10 Let X be a random variable with a Pois(µ) distribution. Show the
following. If µ  1, then the probabilities P(X = k) are strictly decreasing
in k. If µ  1, then the probabilities P(X = k) are first increasing, then
decreasing (cf. Figure 12.1). What happens if µ = 1?
12.11  Consider the one-dimensional Poisson process with intensity λ. Show
that the number of points in [0, t], given that the number of points in [0, 2t]
is equal to n, has a Bin(n, 1
2 ) distribution.
Hint: write the event {N([0, s]) = k, N([0, 2s]) = n} as the intersection of the
(independent!) events {N([0, s]) = k} and {N((s, 2s]) = n − k}.
178 12 The Poisson process
12.12 We consider the one-dimensional Poisson process. Suppose for some
a  0 it is given that there are exactly two points in [0, a], or in other words:
Na = 2. The goal of this exercise is to determine the joint distribution of X1
and X2, the locations of the two points, conditional on Na = 2.
a. Prove that for 0  s  t  a
P(X1 ≤ s, X2 ≤ t, Na = 2)
= P(X2 ≤ t, Na = 2) − P(X1  s, X2 ≤ t, Na = 2) .
b. Deduce from a that
P(X1 ≤ s, X2 ≤ t, Na = 2) = e−λa

λ2
t2
2!
−
λ2
(t − s)2
2!

.
c. Deduce from b that for 0  s  t  a
P(X1 ≤ s, X2 ≤ t | Na = 2) =
t2
− (t − s)2
a2
.
12.13 Walking through a meadow we encounter two kinds of flowers, daisies
and dandelions. As we walk in a straight line, we model the positions of the
flowers we encounter with a one-dimensional Poisson process with intensity λ.
It appears that about one in every four flowers is a daisy. Forgetting about
the dandelions, what does the process of the daisies look like? This question
will be answered with the following steps.
a. Let Nt be the total number of flowers, Xt the number of daisies, and Yt
be the number of dandelions we encounter during the first t minutes of
our walk. Note that Xt + Yt = Nt. Suppose that each flower is a daisy
with probability 1/4, independent of the other flowers. Argue that
P(Xt = n, Yt = m | Nt = n + m) =

n + m
n
1
4
n3
4
m
.
b. Show that
P(Xt = n, Yt = m) =
1
n!
1
m!
1
4
n3
4
m
e−λt
(λt)n+m
,
by conditioning on Nt and using a.
c. By writing e−λt
= e−(λ/4)t
e−(3λ/4)t
and summing over m, show that
P(Xt = n) =
1
n!
e−(λ/4)t
λt
4
n
.
Since it is clear that the numbers of daisies that we encounter in disjoint time
intervals are independent, we may conclude from c that the process (Xt) is
again a Poisson process, with intensity λ/4. One often says that the process
(Xt) is obtained by thinning the process (Nt). In our example this corresponds
to picking all the dandelions.
12.6 Exercises 179
12.14  In this exercise we look at a simple example of random variables Xn
that have the property that their distributions converge to the distribution of
a random variable X as n → ∞, while it is not true that their expectations
converge to the expectation of X. Let for n = 1, 2, . . . the random variables
Xn be defined by
P(Xn = 0) = 1 −
1
n
and P(Xn = 7n) =
1
n
.
a. Let X be the random variable that is equal to 0 with probability 1. Show
that for all a the probability mass functions pXn (a) of the Xn converge to
the probability mass function pX(a) of X as n → ∞. Note that E[X]=0.
b. Show that nonetheless E[Xn] = 7 for all n.
13
The law of large numbers
For many experiments and observations concerning natural phenomena—such
as measuring the speed of light—one finds that performing the procedure twice
under (what seem) identical conditions results in two different outcomes. Un-
controllable factors cause “random” variation. In practice one tries to over-
come this as follows: the experiment is repeated a number of times and the
results are averaged in some way. In this chapter we will see why this works so
well, using a model for repeated measurements. We view them as a sequence
of independent random variables, each with the same unknown distribution.
It is a probabilistic fact that from such a sequence—in principle—any feature
of the distribution can be recovered. This is a consequence of the law of large
numbers.
13.1 Averages vary less
Scientists and engineers involved in experimental work have known for cen-
turies that more accurate answers are obtained when measurements or ex-
periments are repeated a number of times and one averages the individual
outcomes.1
For example, if you read a description of A.A. Michelson’s work
done in 1879 to determine the speed of light, you would find that for each
value he collected, repeated measurements at several levels were performed.
In an article in Statistical Science describing his work ([18]), R.J. MacKay
and R.W. Oldford state: “It is clear that Michelson appreciated the power
of averaging to reduce variability in measurement.” We shall see that we can
understand this reduction using only what we have learned so far about prob-
ability in combination with a simple inequality called Chebyshev’s inequality.
Throughout this chapter we consider a sequence of random variables X1, X2,
X3, . . . . You should think of Xi as the result of the ith repetition of a partic-
ular measurement or experiment. We confine ourselves to the situation where
1
We leave the problem of systematic errors aside but will return to it in Chapter 19.
182 13 The law of large numbers
experimental conditions of subsequent experiments are identical, and the out-
come of any one experiment does not influence the outcomes of others. Under
those circumstances, the random variables of the sequence are independent,
and all have the same distribution, and we therefore call X1, X2, X3, . . . an
independent and identically distributed sequence. We shall denote the distri-
bution function of each random variable Xi by F, its expectation by µ, and
the standard deviation by σ.
The average of the first n random variables in the sequence is
X̄n =
X1 + X2 + · · · + Xn
n
,
and using linearity of expectations we find:
E

X̄n

=
1
n
E[X1 + X2 + · · · + Xn] =
1
n
(µ + µ + · · · + µ) = µ.
By the variance-of-the-sum rule, using the independence of X1, . . . , Xn,
Var

X̄n

=
1
n2
Var(X1 + X2 + · · · + Xn) =
1
n2
(σ2
+ σ2
+ · · · + σ2
) =
σ2
n
.
This establishes the following rule.
Expectation and variance of an average. If X̄n is the average
of n independent random variables with the same expectation µ and
variance σ2
, then
E

X̄n

= µ and Var

X̄n

=
σ2
n
.
The expectation of X̄n is again µ, and its standard deviation is less than that
of a single Xi by a factor
√
n; the “typical distance” from µ is
√
n smaller.
The latter property is what Michelson used to gain accuracy. To illustrate
this, we analyze an example.
Suppose the random variables X1, X2, . . . are continuous with a Gam(2, 1)
distribution, so with probability density:
f(x) = xe−x
for x ≥ 0.
Recall from Section 11.2 that this means that each Xi is distributed as the
sum of two independent Exp(1) random variables. Hence, Sn = X1 +· · ·+Xn
is distributed as the sum of 2n independent Exp(1) random variables, which
has a Gam(2n, 1) distribution, with probability density
fSn (x) =
x2n−1
e−x
(2n − 1)!
for x ≥ 0.
13.2 Chebyshev’s inequality 183
Because X̄n = Sn/n, we find by applying the change-of-units rule (page 106):
fX̄n
(x) = nfSn (nx) =
n (nx)
2n−1
e−nx
(2n − 1)!
for x ≥ 0.
This is the probability density of the Gam(2n, n) distribution.
So we have determined the distribution of X̄n explicitly and we can investigate
what happens as n increases, for example, by plotting probability densities.
In the left-hand column of Figure 13.1 you see plots of fX̄n
for n = 1, 2, 4, 9,
16, and 400 (note that for n = 1 this is just f itself). For comparison, we take
as a second example a so-called bimodal density function: a density with two
bumps, formally called modes. For the same values of n we determined the
probability density function of X̄n (unlike the previous example, we are not
concerned with the computations, just with the results). The graphs of these
densities are given side by side with the gamma densities in Figure 13.1.
The graphs clearly show that, as n increases, there is “contraction” of the
probability mass near the expected value µ (for the gamma densities this is 2,
for the bimodal densities 2.625).
Quick exercise 13.1 Compare the probabilities that X̄n is within 0.5 of its
expected value for n = 1, 4, 16, and 400. Do this for the gamma case only
by estimating the probabilities from the graphs in the left-hand column of
Figure 13.1.
13.2 Chebyshev’s inequality
The contraction of probability mass near the expectation is a consequence of
the fact that, for any probability distribution, most probability mass is within
a few standard deviations from the expectation. To show this we will employ
the following tool, which provides a bound for the probability that the random
variable Y is outside the interval (E[Y ] − a, E[Y ] + a).
Chebyshev’s inequality. For an arbitrary random variable Y
and any a  0:
P(|Y − E[Y ] | ≥ a) ≤
1
a2
Var(Y ) .
We shall derive this inequality for continuous Y (the discrete case is similar).
Let fY be the probability density function of Y . Let µ denote E[Y ]. Then:
Var(Y ) =
∞
−∞
(y − µ)2
fY (y) dy ≥
|y−µ|≥a
(y − µ)2
fY (y) dy
≥
|y−µ|≥a
a2
fY (y) dy = a2
P(|Y − µ| ≥ a) .
184 13 The law of large numbers
0 1 2 3 4
0.0
0.5
1.0
1.5 n = 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 1 2 3 4
0.0
0.5
1.0
1.5
n = 2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 1 2 3 4
0.0
0.5
1.0
1.5 n = 4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 1 2 3 4
0.0
0.5
1.0
1.5
n = 9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 1 2 3 4
0.0
0.5
1.0
1.5 n = 16
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 1 2 3 4
0.0
0.5
1.0
1.5 n = 400
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2 4 6 8
0.0
0.4
0.8 n = 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2 4 6 8
0.0
0.4
0.8
n = 2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2 4 6 8
0.0
0.4
0.8 n = 4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2 4 6 8
0.0
0.4
0.8
n = 9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2 4 6 8
0.0
0.4
0.8 n = 16
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2 4 6 8
0.0
0.4
0.8 n = 400
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 13.1. Densities of averages. Left column: from a gamma density; right column:
from a bimodal density.
13.3 The law of large numbers 185
Dividing both sides of the resulting inequality by a2
, we obtain Chebyshev’s
inequality.
Denote Var(Y ) by σ2
and consider the probability that Y is within a few
standard deviations from its expectation µ:
P(|Y − µ|  kσ) = 1 − P(|Y − µ| ≥ kσ) ,
where k is a small integer. Setting a = kσ in Chebyshev’s inequality, we find
P(|Y − µ|  kσ) ≥ 1 −
Var(Y )
k2σ2
= 1 −
1
k2
. (13.1)
For k = 2, 3, 4 the right-hand side is 3/4, 8/9, and 15/16, respectively. This
suggests that with Chebyshev’s inequality we can make very strong state-
ments. For most distributions, however, the actual value of P(|Y − µ|  kσ)
is even higher than the lower bound (13.1). We summarize this as a somewhat
loose rule.
The “µ ± a few σ” rule. Most of the probability mass of a
random variable is within a few standard deviations from its expec-
tation.
Quick exercise 13.2 Calculate P(|Y − µ|  kσ) exactly for k = 1, 2, 3, 4
when Y has an Exp(1) distribution and compare this with the bounds from
Chebyshev’s inequality.
13.3 The law of large numbers
We return to the independent and identically distributed sequence of ran-
dom variables X1, X2, . . . with expectation µ and variance σ2
. We apply
Chebyshev’s inequality to the average X̄n, where we use E

X̄n

= µ and
Var

X̄n

= σ2
/n, and where ε  0:
P

X̄n − µ

  ε

= P

X̄n − E

X̄n

  ε

≤
1
ε2
Var

X̄n

=
σ2
nε2
.
The right-hand side vanishes as n goes to infinity, no matter how small ε is.
This proves the following law.
The law of large numbers. If X̄n is the average of n independent
random variables with expectation µ and variance σ2
, then for any
ε  0:
lim
n→∞
P

|X̄n − µ|  ε

= 0.
186 13 The law of large numbers
A connection with experimental work
Let us try to interpret the law of large numbers from an experimenter’s per-
spective. Imagine you conduct a series of experiments. The experimental setup
is complicated and your measurements vary quite a bit around the “true” value
you are after. Suppose (unknown to you) your measurements have a gamma
distribution, and its expectation is what you want to determine. You decide
to do a certain number of measurements, say n, and to use their average as
your estimate of the expectation.
We can simulate all this, and Figure 13.2 shows the results of a simulation,
where we chose the same Gam(2, 1) distribution, i.e., with expectation µ = 2.
We anticipated that you might want to do as many as 500 measurements, so
we generated realizations for X1, X2, . . . , X500. For each n we computed the
average of the first n values and plotted these averages against n in Figure 13.2.
0 100 200 300 400 500
1
2
3
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Fig. 13.2. Averages of realizations of a sequence of gamma distributed random
variables.
If your decision is to do 200 repetitions, you would find (in this simulation) a
value of about 2.09 (slightly too high, but you wouldn’t know!), whereas with
n = 400 you would be almost exactly correct with 1.99, and with n = 500
again a little farther away with 2.06. For another sequence of realizations, the
details in the pattern that you see in Figure 13.2 would be different, but the
general dampening of the oscillations would still be present. This follows from
what we saw earlier, that as n is larger, the probability for the average to be
within a certain distance of the expectation increases, in the limit even to 1.
In practice it may happen that with a large number of repetitions your average
is farther from the “true” value than with a smaller number of repetitions—if
it is, then you had bad luck, because the odds are in your favor.
13.3 The law of large numbers 187
The averages may fail to converge
The law of large numbers is valid if the expectation of the distribution F is
finite. This is not always the case. For example, the Cauchy and some Pareto
distributions have heavy tails: their probability densities do go to 0 as x
becomes large, but (too) slowly.2
On the left in Figure 13.3 you see the result
of a simulation with Cau(2, 1) random variables. As in the gamma case, the
averages tend to go toward 2 (which is the point of symmetry of the Cau(2, 1)
density), but once in a while a very large (positive or negative) realization of
an Xi throws off the average.
0 100 200 300 400 500
0
1
2
3
4
5
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
0 100 200 300 400 500
2
4
6
8
10
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Fig. 13.3. Averages of realizations of a sequence of Cauchy (at left) and Pareto (at
right) distributed random variables.
On the right in Figure 13.3 the result of a simulation with a Par(0.99) distri-
bution is shown. Its expectation is infinite. In the plot we see segments where
the average “drifts downward,” separated by upward jumps, which correspond
to Xi with extremely large values. The effect of the jumps dominates: it can
be shown that X̄n grows beyond any level.
You might think that these patterns are phenomena that occur because of
the short length of the simulation and that in longer simulations they would
disappear after some value of n. However, the patterns as described will con-
tinue to occur and the results of a longer simulation, say to n = 5000, would
not look any “better.”
Remark 13.1 (There is a stronger law of large numbers). Even
though it is a strong statement, the law of large numbers in this paragraph
is more accurately known as the weak law of large numbers. A stronger
result holds, the strong law of large numbers, which says that:
2
They represent two separate cases: the Cauchy expectation does not exist (see
Remark 7.1) and the Par(α)’s expectation is +∞ if α ≤ 1 (see Section 7.2).
188 13 The law of large numbers
P

lim
n→∞
X̄n = µ

= 1.
This is also expressed as “as n goes to infinity, X̄n converges to µ with
probability 1.” It is not easy to see, but it is true that the strong law is
actually stronger. The conditions for the law of large numbers, as stated
in this section, could be relaxed. They suffice for both versions of the law.
The conditions can be weakened to a point where the weak law still follows
from them, but the strong law does not anymore; the strong law requires
the stronger conditions.
13.4 Consequences of the law of large numbers
We continue with the sequence X1, X2, . . . of independent random variables
with distribution function F. In the previous section we saw how we could
recover the (unknown) expectation µ from a realization of the sequence. We
shall see that in fact we can recover any feature of the probability distribu-
tion. In order to avoid unnecessary indices, as in E[X1] and P(X1 ∈ C), we
introduce an additional random variable X that also has F as its distribution
function.
Recovering the probability of an event
Suppose that, rather than being interested in µ = E[X], we want to know the
probability of an event, for example,
p = P(X ∈ C) , where C = (a, b] for some a  b.
If you do not know this probability p, you would probably estimate it from
how often the event {Xi ∈ C} occurs in the sequence. You would use the
relative frequency of Xi ∈ C among X1, . . . , Xn: the number of times the
set C was hit divided by n. Define for each i:
Yi =

1 if Xi ∈ C,
0 if Xi ∈ C.
The random variable Yi indicates whether the corresponding Xi hits the set C;
it is called an indicator random variable. In general, an indicator random
variable for an event A is a random variable that is 1 when A occurs and 0
when Ac
occurs. Using this terminology, Yi is the indicator random variable
of the event Xi ∈ C. Its expectation is given by
E[Yi] = 1 · P(Xi ∈ C) + 0 · P(Xi ∈ C) = P(Xi ∈ C) = P(X ∈ C) = p.
Using the Yi, the relative frequency is expressed as (Y1 +Y2 +· · ·+Yn)/n = Ȳn.
Note that the random variables Y1, Y2, . . . are independent; the Xi form an in-
dependent sequence, and Yi is determined from Xi only (this is an application
of the rule about propagation of independence; see page 126).
13.4 Consequences of the law of large numbers 189
The law of large numbers, with p in the role of µ, can now be applied to Ȳn;
it is the average of n independent random variables with expectation p and
variance p(1 − p), so
lim
n→∞
P

|Ȳn − p|  ε

= 0 (13.2)
for any ε  0. By reasoning along the same lines as in the previous section, we
see that from a long sequence of realizations we can get an accurate estimate
of the probability p.
Recovering the probability density function
Consider the continuous case, where f is the probability density function
corresponding with F, and now choose C = (a − h, a + h], for some (small)
positive h. By equation (13.2), for large n:
Ȳn ≈ p = P(X ∈ C) =
a+h
a−h
f(x) dx ≈ 2hf(a). (13.3)
This relationship suggests to estimate the probability density in a as follows:
f(a) ≈
Ȳn
2h
=
the number of times Xi ∈ C for i ≤ n
n · the length of C
.
In Figure 13.4 we have done so for h = 0.25 and two values of a: 2 and 4.
Rather than plotting the estimate in just one point, we use the same value
for the whole interval (a − h, a + h]. This results in a vertical bar, whose area
corresponds to Ȳn:
height · width =
Ȳn
2h
· 2h = Ȳn.
These estimates are based on the realizations of 500 independent Gam(2, 1)
distributed random variables. In order to be able to see how well things came
0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..............
..
.
.
.
.
.
.
.
.
.
.
.
.
..
.
..
.....................................................................................
Fig. 13.4. Estimating the density at two points.
190 13 The law of large numbers
out, the Gam(2, 1) density function is shown as well; near a = 2 the estimate
is very accurate, but around a = 4 it is a little too low.
There really is no reason to derive estimated values around just a few points,
as is done in Figure 13.4. We might as well cover the whole x-axis with a grid
(with grid size 2h) and do the computation for each point in the grid, thus
covering the axis with a series of bars. The resulting bar graph is called a
histogram. Figure 13.5 shows the result for two sets of realizations.
0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..........
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
...........................................................................
0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..........
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
...........................................................................
Fig. 13.5. Recovering the density function by way of histograms.
The top graph is constructed from the same realizations as Figure 13.4 and
the bottom graph is constructed from a new set of realizations. Both graphs
match the general shape of the density, with some bumps and valleys that are
particular for the corresponding set of realizations. In Chapters 15 and 17 we
shall return to histograms and treat them more elaborately.
Quick exercise 13.3 The height of the bar at x = 2 in the first histogram
is 0.26. How many of the 500 realizations were between 1.75 and 2.25?
13.6 Exercises 191
13.5 Solutions to the quick exercises
13.1 The answers you have found should be in the neighborhood of the fol-
lowing exact values:
n 1 4 16 400
P

|X̄n − µ|  0.5

0.27 0.52 0.85 1.00
13.2 Because Y has an Exp(1) distribution µ = 1 and Var(Y ) = σ2
= 1; we
find for k ≥ 1:
P(|Y − µ|  kσ) = P(|Y − 1|  k)
= P(1 − k  Y  k + 1) = P(Y  k + 1) = 1 − e−k−1
.
Using this formula and (13.1) we obtain the following numbers:
k 1 2 3 4
Lower bound from Chebyshev 0 0.750 0.889 0.938
P(|Y − 1|  k) 0.865 0.950 0.982 0.993
13.3 The value of Ȳn for this bar equals its area 0.26 · 0.5 = 0.13. The bar
represents 13% of the values, or 0.13 · 500 = 65 realizations.
13.6 Exercises
13.1 Verify the “µ±a few σ” rule as you did in Quick exercise 13.2 for the fol-
lowing distributions: U(−1, 1), U(−a, a), N(0, 1), N(µ, σ2
), Par(3), Geo(1/2).
Construct a table as in the answer to the quick exercise and enter a line for
each distribution.
13.2  An accountant wants to simplify his bookkeeping by rounding amounts
to the nearest integer, for example, rounding û99.53 and û100.46 both to
û100. What is the cumulative effect of this if there are, say, 100 amounts? To
study this we model the rounding errors by 100 independent U(−0.5, 0.5) ran-
dom variables X1, X2, . . . , X100.
a. Compute the expectation and the variance of the Xi.
b. Use Chebyshev’s inequality to compute an upper bound for the probability
P(|X1 + X2 + · · · + X100|  10) that the cumulative rounding error X1 +
X2 + · · · + X100 exceeds û10.
192 13 The law of large numbers
13.3 Consider the situation of the previous exercise. A manager wants to
know what happens to the mean absolute error 1
n
n
i=1 |Xi| as n becomes
large. What can you say about this, applying the law of large numbers?
13.4  Of the voters in Florida, a proportion p will vote for candidate G,
and a proportion 1 − p will vote for candidate B. In an election poll a number
of voters are asked for whom they will vote. Let Xi be the indicator random
variable for the event “the ith person interviewed will vote for G.” A model
for the election poll is that the people to be interviewed are selected in such
a way that the indicator random variables X1, X2,. . . are independent and
have a Ber(p) distribution.
a. Suppose we use X̄n to predict p. According to Chebyshev’s inequality, how
large should n be (how many people should be interviewed) such that the
probability that X̄n is within 0.2 of the “true” p is at least 0.9?
Hint: solve this first for p = 1/2, and use that p(1 − p) ≤ 1/4 for all
0 ≤ p ≤ 1.
b. Answer the same question, but now X̄n should be within 0.1 of p.
c. Answer the question from part a, but now the probability should be at
least 0.95.
d. If p  1/2 candidate G wins; if X̄n  1/2 you predict that G will win.
Find an n (as small as you can) such that the probability that you predict
correctly is at least 0.9, if in fact p = 0.6.
13.5 You are trying to determine the melting point of a new material, of
which you have a large number of samples. For each sample that you measure
you find a value close to the actual melting point c but corrupted with a
measurement error. We model this with random variables:
Mi = c + Ui
where Mi is the measured value in degree Kelvin, and Ui is the occurring
random error. It is known that E[Ui] = 0 and Var(Ui) = 3, for each i, and that
we may consider the random variables M1, M2, . . . independent. According
to Chebyshev’s inequality, how many samples do you need to measure to be
90% sure that the average of the measurements is within half a degree of c?
13.6  The casino La bella Fortuna is for sale and you think you might want
to buy it, but you want to know how much money you are going to make. All
the present owner can tell you is that the roulette game Red or Black is played
about 1000 times a night, 365 days a year. Each time it is played you have
probability 19/37 of winning the player’s bet of û1 and probability 18/37 of
having to pay the player û1.
Explain in detail why the law of large numbers can be used to determine the
income of the casino, and determine how much it is.
13.6 Exercises 193
13.7 Let X1, X2, . . . be a sequence of independent and identically distributed
random variables with distributions function F. Define Fn as follows: for any a
Fn(a) =
number of Xi in (−∞, a]
n
.
Consider a fixed and introduce the appropriate indicator random variables (as
in Section 13.4). Compute their expectation and variance and show that the
law of large numbers tells us that
lim
n→∞
P(|Fn(a) − F(a)|  ε) = 0.
13.8  In Section 13.4 we described how the probability density function
could be recovered from a sequence X1, X2, X3, . . . . We consider the
Gam(2, 1) probability density discussed in the main text and a histogram bar
around the point a = 2. Then f(a) = f(2) = 2e−2
= 0.27 and the estimate
for f(2) is Ȳn/2h, where Ȳn as in (13.3).
a. Express the standard deviation of Ȳn/2h in terms of n and h.
b. Choose h = 0.25. How large should n be (according to Chebyshev’s in-
equality) so that the estimate is within 20% of the “true value”, with
probability 80%?
13.9  Let X1, X2, . . . be an independent sequence of U(−1, 1) random
variables and let Tn = 1
n
n
i=1 X2
i . It is claimed that for some a and any
ε  0
lim
n→∞
P(|Tn − a|  ε) = 0.
a. Explain how this could be true.
b. Determine a.
13.10  Let Mn be the maximum of n independent U(0, 1) random variables.
a. Derive the exact expression for P(|Mn − 1|  ε).
Hint: see Section 8.4.
b. Show that limn→∞ P(|Mn − 1|  ε) = 0. Can this be derived from Cheby-
shev’s inequality or the law of large numbers?
13.11 For some t  1, let X be a random variable taking the values 0 and t,
with probabilities
P(X = 0) = 1 −
1
t
and P(X = t) =
1
t
.
Then E[X] = 1 and Var(X) = t−1. Consider the probability P(|X − 1|  a).
a. Verify the following: if t = 10 and a = 8 then P(|X − 1|  a) = 1/10 and
Chebyshev’s inequality gives an upper bound for this probability of 9/64.
The difference is 9/64 − 1/10 ≈ 0.04. We will say that for t = 10 the
Chebyshev gap for X at a = 8 is 0.04.
194 13 The law of large numbers
b. Compute the Chebyshev gap for t = 10 at a = 5 and at a = 10.
c. Can you find a gap smaller than 0.01, smaller than 0.001, smaller than
0.0001?
d. Do you think one could improve Chebyshev’s inequality, i.e., find an upper
bound closer to the true probabilities?
13.12 (A more general law of large numbers). Let X1, X2, . . . be a
sequence of independent random variables, with E[Xi] = µi and Var(Xi) =
σ2
i , for i = 1, 2, . . .. Suppose that 0  σ2
i ≤ M, for all i. Let a be an arbitrary
positive number.
a. Apply Chebyshev’s inequality to show that
P





X̄n −
1
n
n

i=1
µi





 a

≤
Var(X1) + · · · + Var(Xn)
n2a2
.
b. Conclude from a that
lim
n→∞
P





X̄n −
1
n
n

i=1
µi





 a

= 0.
Check that the law of large numbers is a special case of this result.
14
The central limit theorem
The central limit theorem is a refinement of the law of large numbers.
For a large number of independent identically distributed random variables
X1, . . . , Xn, with finite variance, the average X̄n approximately has a normal
distribution, no matter what the distribution of the Xi is. In the first section
we discuss the proper normalization of X̄n to obtain a normal distribution
in the limit. In the second section we will use the central limit theorem to
approximate probabilities of averages and sums of random variables.
14.1 Standardizing averages
In the previous chapter we saw that the law of large numbers guarantees
the convergence to µ of the average X̄n of n independent random variables
X1, . . . , Xn, all having the same expectation µ and variance σ2
. This conver-
gence was illustrated by Figure 13.1. Closer examination of this figure suggests
another phenomenon: for the two distributions considered (i.e., the Gam(2, 1)
distribution and a bimodal distribution), the probability density function of
X̄n seems to become symmetrical and bell shaped around the expected value µ
as n becomes larger and larger. However, the bell collapses into a single spike
at µ. Nevertheless, by a proper normalization it is possible to stabilize the
bell shape, as we will see.
In order to let the distribution of X̄n settle down it seems to be a good idea
to stabilize the expectation and variance. Since E

X̄n

= µ for all n, only the
variance needs some special attention. In Figure 14.1 we depict the probability
density function of the centered average X̄n−µ of Gam(2, 1) random variables,
multiplied by three different powers of n. In the left column we display the
density of n
1
4 (X̄n − µ), in the middle column the density of n
1
2 (X̄n − µ), and
in the right column the density of n(X̄n − µ). These figures suggest that
√
n
is the right factor to stabilize the bell shape.
skip
196 14 The central limit theorem
0.0
0.2
0.4
n = 1
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n = 1
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n = 1
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.0
0.2
0.4
n = 2
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n = 2
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n = 2
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.0
0.2
0.4
n = 4
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n = 4
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n = 4
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.0
0.2
0.4
n = 16
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n = 16
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
n = 16
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−3 −2 −1 0 1 2 3
0.0
0.2
0.4 n = 100
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−3 −2 −1 0 1 2 3
n = 100
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−3 −2 −1 0 1 2 3
n = 100
..............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 14.1. Multiplying the difference X̄n − µ of n Gam(2, 1) random variables. Left
column: n
1
4 (X̄n − µ); middle column:
√
n(X̄n − µ); right column: n(X̄n − µ).
14.1 Standardizing averages 197
Indeed, according to the rule for the variance of an average (see page 182),
we have Var

X̄n

= σ2
/n, and therefore for any number C:
Var

C(X̄n − µ)

= Var

CX̄n

= C2
Var

X̄n

= C2 σ2
n
.
To stabilize the variance we therefore must choose C =
√
n. In fact, by choos-
ing C =
√
n/σ, one standardizes the averages, i.e., the resulting random vari-
able Zn, defined by
Zn =
√
n
X̄n − µ
σ
, n = 1, 2, . . . ,
has expected value 0 and variance 1. What more can we say about the distri-
bution of the random variables Zn?
In case X1, X2, . . . are independent N(µ, σ2
) distributed random variables,
we know from Section 11.2 and the rule on expectation and variance under
change of units (see page 98), that Zn has an N(0, 1) distribution for all n. For
the gamma and bimodal random variables from Section 13.1 we depicted the
probability density function of Zn in Figure 14.2. For both examples we see
that the probability density functions of the Zn seem to converge to the prob-
ability density function of the N(0, 1) distribution, indicated by the dotted
line. The following amazing result states that this behavior generally occurs
no matter what distribution we start with.
The central limit theorem. Let X1, X2, . . . be any sequence
of independent identically distributed random variables with finite
positive variance. Let µ be the expected value and σ2
the variance
of each of the Xi. For n ≥ 1, let Zn be defined by
Zn =
√
n
X̄n − µ
σ
;
then for any number a
lim
n→∞
FZn (a) = Φ(a),
where Φ is the distribution function of the N(0, 1) distribution. In
words: the distribution function of Zn converges to the distribution
function Φ of the standard normal distribution.
Note that
Zn =
X̄n − E

X̄n


Var

X̄n
 ,
which is a more direct way to see that Zn is the average X̄n standardized.
198 14 The central limit theorem
0.0
0.2
0.4
0.6
0.8
1.0 n = 1
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
n = 1
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
0.0
0.2
0.4
0.6
0.8
1.0
n = 2
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
n = 2
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
0.0
0.2
0.4
0.6
0.8
1.0
n = 4
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
n = 4
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
0.0
0.2
0.4
0.6
0.8
1.0
n = 16
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
n = 16
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
n = 100
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
−3 −2 −1 0 1 2 3
n = 100
.....................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................................................................
Fig. 14.2. Densities of standardized averages Zn. Left column: from a gamma den-
sity; right column: from a bimodal density. Dotted line: N(0, 1) probability density.
14.2 Applications of the central limit theorem 199
One can also write Zn as a standardized sum
Zn =
X1 + · · · + Xn − nµ
σ
√
n
. (14.1)
In the next section we will see that this last representation of Zn is very
helpful when one wants to approximate probabilities of sums of independent
identically distributed random variables.
Since
X̄n =
σ
√
n
Zn + µ,
it follows that X̄n approximately has an N(µ, σ2
/n) distribution; see the
change-of-units rule for normal random variables on page 106. This explains
the symmetrical bell shape of the probability densities in Figure 13.1.
Remark 14.1 (Some history). Originally, the central limit theorem was
proved in 1733 by De Moivre for independent Ber(1
2
) distributed random
variables. Lagrange extended De Moivre’s result to Ber(p) random variables
and later formulated the central limit theorem as stated above. Around
1901 a first rigorous proof of this result was given by Lyapunov. Several
versions of the central limit theorem exist with weaker conditions than those
presented here. For example, for applications it is interesting that it is not
necessary that all Xi have the same distribution; see Ross [26], Section 8.3,
or Feller [8], Section 8.4, and Billingsley [3], Section 27.
14.2 Applications of the central limit theorem
The central limit theorem provides a tool to approximate the probability
distribution of the average or the sum of independent identically distributed
random variables. This plays an important role in applications, for instance,
see Sections 23.4, 24.1, 26.2, and 27.2. Here we will illustrate the use of the
central limit theorem to approximate probabilities of averages and sums of
random variables in three examples. The first example deals with an average;
the other two concern sums of random variables.
Did we have bad luck?
In the example in Section 13.3 averages of independent Gam(2, 1) distributed
random variables were simulated for n = 1, . . . , 500. In Figure 13.2 the realiza-
tion of X̄n for n = 400 is 1.99, which is almost exactly equal to the expected
value 2. For n = 500 the simulation was 2.06, a little bit farther away. Did
we have bad luck, or is a value 2.06 or higher not unusual? To answer this
question we want to compute P

X̄n ≥ 2.06

. We will find an approximation
of this probability using the central limit theorem.
200 14 The central limit theorem
Note that
P

X̄n ≥ 2.06

= P

X̄n − µ ≥ 2.06 − µ

= P

√
n
X̄n − µ
σ
≥
√
n
2.06 − µ
σ

= P

Zn ≥
√
n
2.06 − µ
σ

.
Since the Xi are Gam(2, 1) random variables, µ = E[Xi] = 2 and σ2
=
Var(Xi) = 2. We find for n = 500 that
P

X̄500 ≥ 2.06

= P

Z500 ≥
√
500
2.06 − 2
√
2

= P(Z500 ≥ 0.95)
= 1 − P(Z500  0.95) .
It now follows from the central limit theorem that
P

X̄500 ≥ 2.06

≈ 1 − Φ(0.95) = 0.1711.
This is close to the exact answer 0.1710881, which was obtained using the
probability density of X̄n as given in Section 13.1.
Thus we see that there is about a 17% probability that the average X̄500 is at
least 0.06 above 2. Since 17% is quite large, we conclude that the value 2.06
is not unusual. In other words, we did not have bad luck; n = 500 is simply
not large enough to be that close. Would 2.06 be unusual if n = 5000?
Quick exercise 14.1 Show that P

X̄5000 ≥ 2.06

≈ 0.0013, using the central
limit theorem.
Rounding amounts to the nearest integer
In Exercise 13.2 an accountant wanted to simplify his bookkeeping by round-
ing amounts to the nearest integer, and you were asked to use Chebyshev’s
inequality to compute an upper bound for the probability
p = P(|X1 + X2 + · · · + X100|  10)
that the cumulative rounding error X1 + X2 + · · · + X100 exceeds û10. This
upper bound equals 1/12. In order to know the exact value of p one has to
determine the distribution of the sum X1 +· · ·+X100. This is difficult, but the
central limit theorem is a handy tool to get an approximation of p. Clearly,
p = P(X1 + · · · + X100  −10) + P(X1 + · · · + X100  10) .
Standardizing as in (14.1), for the second probability we write, with n = 100
14.2 Applications of the central limit theorem 201
P(X1 + · · · + Xn  10) = P(X1 + · · · + Xn − nµ  10 − nµ)
= P

X1 + · · · + Xn − nµ
σ
√
n

10 − nµ
σ
√
n

= P

Zn 
10 − nµ
σ
√
n

.
The Xi are U(−0.5, 0.5) random variables, µ = E[Xi] = 0, and σ2
=
Var(Xi) = 1/12, so that
P(X1 + · · · + X100  10) = P

Z100 
10 − 100 · 0

1/12
√
100

= P(Z100  3.46) .
It follows from the central limit theorem that
P(Z100  3.46) ≈ 1 − Φ(3.46) = 0.0003.
Similarly,
P(X1 + · · · + X100  −10) ≈ Φ(−3.46) = 0.0003.
Thus we find that p = 0.0006.
Normal approximation of the binomial distribution
In Section 4.3 we considered the (fictitious) situation that you attend, com-
pletely unprepared, a multiple-choice exam consisting of 10 questions. We saw
that the probability you will pass equals
P(X ≥ 6) = 0.0197,
where X—being the sum of 10 independent Ber(1
4 ) random variables—has
a Bin(10, 1
4 ) distribution. As we saw in Chapter 4 it is rather easy, but te-
dious, to calculate P(X ≥ 6). Although n is small, we investigate what the
central limit theorem will yield as an approximation of P(X ≥ 6). Recall that
a random variable with a Bin(n, p) distribution can be written as the sum of
n independent Ber(p) distributed random variables R1, . . . , Rn. Substituting
n = 10, µ = p = 1/4, and σ2
= p(1 − p) = 3/16, it follows from the central
limit theorem that
P(X ≥ 6) = P(R1 + · · · + Rn ≥ 6)
= P

R1 + · · · + Rn − nµ
σ
√
n
≥
6 − nµ
σ
√
n

= P
⎛
⎝Z10 ≥
6 − 21
2

3
16
√
10
⎞
⎠
≈ 1 − Φ(2.56) = 0.0052.
202 14 The central limit theorem
The number 0.0052 is quite a poor approximation for the true value 0.0197.
Note however, that we could also argue that
P(X ≥ 6) = P(X  5)
= P(R1 + · · · + Rn  5)
= P
⎛
⎝Z10 ≥
5 − 21
2

3
16
√
10
⎞
⎠
≈ 1 − Φ(1.83) = 0.0336,
which gives an approximation that is too large! A better approach lies some-
where in the middle, as the following quick exercise illustrates.
Quick exercise 14.2 Apply the central limit theorem to find 0.0143 as an ap-
proximation to P

X ≥ 51
2

. Since P(X ≥ 6) = P

X ≥ 51
2

, this also provides
an approximation of P(X ≥ 6).
How large should n be?
In view of the previous examples one might raise the question of how large n
should be to have a good approximation when using the central limit theorem.
In other words, how fast is the convergence to the normal distribution? This
is a difficult question to answer in general. For instance, in the third example
one might initially be tempted to think that the approximation was quite
poor, but after taking the fact into account that we approximate a discrete
distribution by a continuous one we obtain a considerable improvement of the
approximation, as was illustrated in Quick exercise 14.2. For another example,
see Figure 14.2. Here we see that the convergence is slightly faster for the
bimodal distribution than for the Gam(2, 1) distribution, which is due to the
fact that the Gam(2, 1) is rather asymmetric.
In general the approximation might be poor when n is small, when the dis-
tribution of the Xi is asymmetric, bimodal, or discrete, or when the value a
in
P

X̄n  a

is far from the center of the distribution of the Xi.
14.3 Solutions to the quick exercises
14.1 In the same way we approximated P

X̄n ≥ 2.06

using the central limit
theorem, we have that
P

X̄n ≥ 2.06

= P

Zn ≥
√
n
2.06 − µ
σ

.
14.4 Exercises 203
With µ = 2 and σ =
√
2, we find for n = 5000 that
P

X̄5000 ≥ 2.06

= P(Z5000 ≥ 3) ,
which is approximately equal to 1−Φ(3) = 0.0013, thanks to the central limit
theorem. Because we think that 0.13% is a small probability, to find 2.06 as
a value for X̄5000 would mean that you really had bad luck!
14.2 Similar to the computation P(X ≥ 6), we have
P

X ≥ 5
1
2

= P

R1 + · · · + R10 ≥ 5
1
2

= P
⎛
⎝Z10 ≥
51
2 − 21
2

3
16
√
10
⎞
⎠
≈ 1 − Φ(2.19) = 0.0143.
We have seen that using the central limit theorem to approximate P(X ≥ 6)
gives an underestimate of this probability, while using the central limit the-
orem to P(X  5) gives an overestimation. Since 51
2 is “in the middle,” the
approximation will be better.
14.4 Exercises
14.1 Let X1, X2, . . . , X144 be independent identically distributed random
variables, each with expected value µ = E[Xi] = 2, and variance σ2
=
Var(Xi) = 4. Approximate P(X1 + X2 + · · · + X144  144), using the central
limit theorem.
14.2  Let X1, X2, . . . , X625 be independent identically distributed random
variables, with probability density function f given by
f(x) =

3(1 − x)2
for 0 ≤ x ≤ 1,
0 otherwise.
Use the central limit theorem to approximate P(X1 + X2 + · · · + X625  170).
14.3  In Exercise 13.4 a you were asked to use Chebyshev’s inequality to
determine how large n should be (how many people should be interviewed) so
that the probability that X̄n is within 0.2 of the “true” p is at least 0.9. Here
p is the proportion of the voters in Florida who will vote for G (and 1 − p is
the proportion of the voters who will vote for B). How large should n at least
be according to the central limit theorem?
204 14 The central limit theorem
14.4  In the single-server queue model from Section 6.4, Ti is the time
between the arrival of the (i − 1)th and ith customers. Furthermore, one
of the model assumptions is that the Ti are independent, Exp(0.5) dis-
tributed random variables. In Section 11.2 we saw that the probability
P(T1 + · · · + T30 ≤ 60) of the 30th customer arriving within an hour at the
well is equal to 0.542. Find the normal approximation of this probability.
14.5  Let X be a Bin(n, p) distributed random variable. Show that the
random variable
X − np

np(1 − p)
has a distribution that is approximately standard normal.
14.6  Again, as in the previous exercise, let X be a Bin(n, p) distributed
random variable.
a. An exact computation yields that P(X ≤ 25) = 0.55347, when n = 100
and p = 1/4. Use the central limit theorem to give an approximation of
P(X ≤ 25) and P(X  26).
b. When n = 100 and p = 1/4, then P(X ≤ 2) = 1.87 ·10−10
. Use the central
limit theorem to give an approximation of this probability.
14.7 Let X1, X2, . . . , Xn be n independent random variables, each with ex-
pected value µ and finite positive variance σ2
. Use Chebyshev’s inequality to
show that for any a  0 one has
P



n
1
4
X̄n − µ
σ



 ≥ a

≤
1
a2
√
n
.
Use this fact to explain the occurrence of a single spike in the left column of
Figure 14.1.
14.8 Let X1, X2, . . . be a sequence of independent N(0, 1) distributed random
variables. For n = 1, 2, . . ., let Yn be the random variable, defined by
Yn = X2
1 + · · · + X2
n.
a. Show that E

X2
i

= 1.
b. One can show—using integration by parts—that E

X4
i

= 3. Deduce from
this that Var

X2
i

= 2.
c. Use the central limit theorem to approximate P(Y100  110).
14.9  A factory produces links for heavy metal chains. The research lab of
the factory models the length (in cm) of a link by the random variable X,
with expected value E[X] = 5 and variance Var(X) = 0.04. The length of a
link is defined in such a way that the length of a chain is equal to the sum of
14.4 Exercises 205
the lengths of its links. The factory sells chains of 50 meters; to be on the safe
side 1002 links are used for such chains. The factory guarantees that the chain
is not shorter than 50 meters. If by chance a chain is too short, the customer
is reimbursed, and a new chain is given for free.
a. Give an estimate of the probability that for a chain of at least 50 meters
more than 1002 links are needed. For what percentage of the chains does
the factory have to reimburse clients and provide free chains?
b. The sales department of the factory notices that it has to hand out a
lot of free chains and asks the research lab what is wrong. After further
investigations the research lab reports to the sales department that the
expectation value 5 is incorrect, and that the correct value is 4.99 (cm).
Do you think that it was necessary to report such a minor change of this
value?
14.10 Chebyshev’s inequality was used in Exercise 13.5 to determine how
many times n one needs to measure a sample to be 90% sure that the average
of the measurements is within half a degree of the actual melting point c of a
new material.
a. Use the normal approximation to find a less conservative value for n.
b. Only in case the random errors Ui in the measurements have a normal
distribution the value of n from a is “exact,” in all other cases an approx-
imation. Explain this.
15
Exploratory data analysis: graphical summaries
In the previous chapters we focused on probability models to describe random
phenomena. Confronted with a new phenomenon, we want to learn about the
randomness that is associated with it. It is common to conduct an experiment
for this purpose and record observations concerning the phenomenon. The set
of observations is called a dataset. By exploring the dataset we can gain insight
into what probability model suits the phenomenon.
Frequently you will have to deal with a dataset that contains so many ele-
ments that it is necessary to condense the data for easy visual comprehension
of general characteristics. In this chapter we present several graphical methods
to do so. To graphically represent univariate datasets, consisting of repeated
measurements of one particular quantity, we discuss the classical histogram,
the more recently introduced kernel density estimates and the empirical dis-
tribution function. To represent a bivariate dataset, which consists of repeated
measurements of two quantities, we use the scatterplot.
15.1 Example: the Old Faithful data
The Old Faithful geyser at Yellowstone National Park, Wyoming, USA, was
observed from August 1st to August 15th, 1985. During that time, data were
collected on the duration of eruptions. There were 272 eruptions observed, of
which the recorded durations are listed in Table 15.1. The data are given in
seconds.
The variety in the lengths of the eruptions indicates that randomness is in-
volved. By exploring the dataset we might learn about this randomness. For
instance: we like to know which durations are more likely to occur than others;
is there something like “the typical duration of an eruption”; do the durations
vary symmetrically around the center of the dataset; and so on. In order to
retrieve this type of information, just listing the observed durations does not
help us very much. Somehow we must summarize the observed data. We could
208 15 Exploratory data analysis: graphical summaries
Table 15.1. Duration in seconds of 272 eruptions of the Old Faithful geyser.
216 108 200 137 272 173 282 216 117 261
110 235 252 105 282 130 105 288 96 255
108 105 207 184 272 216 118 245 231 266
258 268 202 242 230 121 112 290 110 287
261 113 274 105 272 199 230 126 278 120
288 283 110 290 104 293 223 100 274 259
134 270 105 288 109 264 250 282 124 282
242 118 270 240 119 304 121 274 233 216
248 260 246 158 244 296 237 271 130 240
132 260 112 289 110 258 280 225 112 294
149 262 126 270 243 112 282 107 291 221
284 138 294 265 102 278 139 276 109 265
157 244 255 118 276 226 115 270 136 279
112 250 168 260 110 263 113 296 122 224
254 134 272 289 260 119 278 121 306 108
302 240 144 276 214 240 270 245 108 238
132 249 120 230 210 275 142 300 116 277
115 125 275 200 250 260 270 145 240 250
113 275 255 226 122 266 245 110 265 131
288 110 288 246 238 254 210 262 135 280
126 261 248 112 276 107 262 231 116 270
143 282 112 230 205 254 144 288 120 249
112 256 105 269 240 247 245 256 235 273
245 145 251 133 267 113 111 257 237 140
249 141 296 174 275 230 125 262 128 261
132 267 214 270 249 229 235 267 120 257
286 272 111 255 119 135 285 247 129 265
109 268
Source: W. Härdle. Smoothing techniques with implementation in S. 1991;
Table 3, page 201. Springer New York.
start by computing the mean of the data, which is 209.3 for the Old Faithful
data. However, this is a poor summary of the dataset, because there is a lot
more information in the observed durations. How do we get hold of this?
Just staring at the dataset for a while tells us very little. To see something,
we have to rearrange the data somehow. The first thing we could do is order
the data. The result is shown in Table 15.2. Putting the elements in order
already provides more information. For instance, it is now immediately clear
that all elements lie between 96 and 306.
Quick exercise 15.1 Which two elements of the Old Faithful dataset split
the dataset in three groups of equal size?
A closer look at the ordered data shows that the two middle elements (the
136th and 137th elements in ascending order) are equal to 240, which is much
closer to the maximum value 306 than to the minimum value 96. This seems to
15.2 Histograms 209
Table 15.2. Ordered durations of eruptions of the Old Faithful geyser.
96 100 102 104 105 105 105 105 105 105
107 107 108 108 108 108 109 109 109 110
110 110 110 110 110 110 111 111 112 112
112 112 112 112 112 112 113 113 113 113
115 115 116 116 117 118 118 118 119 119
119 120 120 120 120 121 121 121 122 122
124 125 125 126 126 126 128 129 130 130
131 132 132 132 133 134 134 135 135 136
137 138 139 140 141 142 143 144 144 145
145 149 157 158 168 173 174 184 199 200
200 202 205 207 210 210 214 214 216 216
216 216 221 223 224 225 226 226 229 230
230 230 230 230 231 231 233 235 235 235
237 237 238 238 240 240 240 240 240 240
242 242 243 244 244 245 245 245 245 245
246 246 247 247 248 248 249 249 249 249
250 250 250 250 251 252 254 254 254 255
255 255 255 256 256 257 257 258 258 259
260 260 260 260 260 261 261 261 261 262
262 262 262 263 264 265 265 265 265 266
266 267 267 267 268 268 269 270 270 270
270 270 270 270 270 271 272 272 272 272
272 273 274 274 274 275 275 275 275 276
276 276 276 277 278 278 278 279 280 280
282 282 282 282 282 282 283 284 285 286
287 288 288 288 288 288 288 289 289 290
290 291 293 294 294 296 296 296 300 302
304 306
indicate that the dataset is somewhat asymmetric, but even from the ordered
dataset we cannot get a clear picture of this asymmetry. Also, geologists be-
lieve that there are two different kinds of eruptions that play a role. Hence one
would expect two separate values around which the elements of the dataset
would accumulate, corresponding to the typical durations of the two types of
eruptions. Again it is not clear, not even from the ordered dataset, what these
two typical values are. It would be better to have a plot of the dataset that
reflects symmetry or asymmetry of the data and from which we can easily see
where the elements accumulate. In the following sections we will discuss two
such methods.
15.2 Histograms
The classical method to graphically represent data is the histogram, which
probably dates from the mortality studies of John Graunt in 1662 (see West-
210 15 Exploratory data analysis: graphical summaries
ergaard [39], p.22). The term histogram appears to have been used first by
Karl Pearson ([22]). Figure 15.1 displays a histogram of the Old Faithful data.
The picture immediately reveals the asymmetry of the dataset and the fact
that the elements accumulate somewhere near 120 and 270, which was not
clear from Tables 15.1 and 15.2.
60 120 180 240 300 360
0
0.002
0.004
0.006
0.008
0.010
Fig. 15.1. Histogram of the Old Faithful data.
The construction of the histogram is as follows. Let us denote a generic (uni-
variate) dataset of size n by
x1, x2, . . . , xn
and suppose we want to construct a histogram. We use the version of the
histogram that is scaled in such a way that the total area under the curve is
equal to one.1
First we divide the range of the data into intervals. These intervals are called
bins and are denoted by
B1, B2, . . . , Bm.
The length of an interval Bi is denoted by |Bi| and is called the bin width.
The bins do not necessarily have the same width. In Figure 15.1 we have eight
bins of equal bin width. We want the area under the histogram on each bin
Bi to reflect the number of elements in Bi. Since the total area 1 under the
histogram then corresponds to the total number of elements n in the dataset,
the area under the histogram on a bin Bi is equal to the proportion of elements
in Bi:
the number of xj in Bi
n
.
1
The reason to scale the histogram so that the total area under the curve is equal to
one is that if we view the data as being generated from some unknown probability
density f (see Chapter 17), such a histogram can be used as a crude estimate of f.
15.2 Histograms 211
The height of the histogram on bin Bi must then be equal to
the number of xj in Bi
n|Bi|
.
Quick exercise 15.2 Use Table 15.2 to count how many elements fall into
each of the bins (90, 120], (120, 150], . . . , (300, 330] in Figure 15.1 and com-
pute the height on each bin.
Choice of the bin width
Consider a histogram with bins of equal width. In that case the bins are of
the form
Bi = (r + (i − 1)b, r + ib] for i = 1, 2, . . . , m,
where r is some reference point smaller than the minimum of the dataset,
and b denotes the bin width. In Figure 15.2, three histograms of the Old
Faithful data of Table 15.2 are displayed with bin widths equal to 2, 30, and
90, respectively. Clearly, the choice of the bin width b, or the corresponding
choice of the number of bins m, will determine what the resulting histogram
will look like. Choosing the bin width too small will result in a chaotic figure
with many isolated peaks. Choosing the bin width too large will result in a
figure without much detail, at the risk of losing information about general
characteristics. In Figure 15.2, bin width b = 2 is somewhat too small. Bin
width b = 90 is clearly too large and produces a histogram that no longer
captures the fact that the data show two separate modes near 120 and 270.
How does one go about choosing the bin width? In practice, this might boil
down to picking the bin width by trial and error, continuing until the figure
looks reasonable. Mathematical research, however, has provided some guide-
lines for a data-based choice for b or m. Formulas that may effectively be
used are m = 1 + 3.3 log10(n) (see [34]) or b = 3.49 sn−1/3
(see [29]; see also
Remark 15.1), where s is the sample standard deviation (see Section 16.2 for
the definition of the sample standard deviation).
60 180 300
Bin width 2
0
0.01
60 180 300
Bin width 30
0
0.01
60 180 300
Bin width 90
0
0.01
Fig. 15.2. Histograms of the Old Faithful data with different bin widths.
212 15 Exploratory data analysis: graphical summaries
Remark 15.1 (Normal reference method for histograms). Let
Hn(x) denote the height of the histogram at x and suppose that we view our
dataset as being generated from a probability distribution with density f.
We would like to find the bin width that minimizes the difference between
Hn and f, measured by the so-called mean integrated squared error (MISE)
E
 ∞
−∞
(Hn(x) − f(x))2
dx

.
Under suitable smoothness conditions on f, the value of b that minimizes
the MISE as n goes to infinity is given by
b = C(f)n−1/3
where C(f) = 61/3
 ∞
−∞
f
(x)2
dx
−1/3
(see for instance [29] or [12]). A simple data-based choice for b is obtained by
estimating the constant C(f). The normal reference method takes f to be
the density of an N(µ, σ2
) distribution, in which case C(f) = (24
√
π)1/3
σ.
Estimating σ by the sample standard deviation s (see Chapter 16 for a
definition of s) would result in bin width
b = (24
√
π)1/3
sn−1/3
.
For the Old Faithful data this would give b = 36.89.
Quick exercise 15.3 If we construct a histogram for the Old Faithful data
with equal bin width b = 3.49 sn−1/3
, how may bins will we need to cover the
data if s = 68.48?
The main advantage of the histogram is that it is simple. Its disadvantage is
the discrete character of the plot. In Figure 15.1 it is still somewhat unclear
which two values correspond to the typical durations of the two types of
eruptions. Another well-known artifact is that changing the bin width slightly
or keeping the bin width fixed and shifting the bins slightly may result in a
figure of a different nature. A method that produces a smoother figure and is
less sensitive to these kinds of changes will be discussed in the next section.
15.3 Kernel density estimates
We can graphically represent data in a more variegated plot by a so-called
kernel density estimate. The basic ideas of kernel density estimation first ap-
peared in the early 1950s. Rosenblatt [25] and Parzen [21] provided the stim-
ulus for further research on this topic. Although the method was introduced
in the middle of the last century, until recently it remained unpopular as a
tool for practitioners because of its computationally intensive nature.
Figure 15.3 displays a kernel density estimate of the Old Faithful data. Again
the picture immediately reveals the asymmetry of the dataset, but it is much
15.3 Kernel density estimates 213
60 120 180 240 300 360
0
0.002
0.004
0.006
0.008
0.010
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 15.3. Kernel density estimate of the Old Faithful data.
smoother than the histogram in Figure 15.1. Note that it is now easier to
detect the two typical values around which the elements accumulate.
The idea behind the construction of the plot is to “put a pile of sand” around
each element of the dataset. At places where the elements accumulate, the
sand will pile up. The actual plot is constructed by choosing a kernel K and
a bandwidth h. The kernel K reflects the shape of the piles of sand, whereas
the bandwidth is a tuning parameter that determines how wide the piles
of sand will be. Formally, a kernel K is a function K : R → R. Figure 15.4
displays several well-known kernels. A kernel K typically satisfies the following
conditions:
(K1) K is a probability density, i.e., K(u) ≥ 0 and
∞
−∞ K(u) du = 1;
(K2) K is symmetric around zero, i.e., K(u) = K(−u);
(K3) K(u) = 0 for |u|  1.
Examples are the Epanechnikov kernel:
K(u) =
3
4

1 − u2

for − 1 ≤ u ≤ 1
and K(u) = 0 elsewhere, and the triweight kernel
K(u) =
35
32

1 − x2
3
for − 1 ≤ u ≤ 1
and K(u) = 0 elsewhere. Sometimes one uses kernels that do not satisfy
condition (K3), for example, the normal kernel
K(u) =
1
√
2π
e− 1
2 u2
for − ∞  u  ∞.
Let us denote a kernel density estimate by fn,h, and suppose that we want to
construct fn,h for a dataset x1, x2, . . . , xn. In Figure 15.5 the construction is
214 15 Exploratory data analysis: graphical summaries
−2 −1 0 1 2
Triangular kernel
0.0
0.4
0.8
1.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−2 −1 0 1 2
Cosine kernel
0.0
0.4
0.8
1.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−2 −1 0 1 2
Epanechnikov kernel
0.0
0.4
0.8
1.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−2 −1 0 1 2
Biweight kernel
0.0
0.4
0.8
1.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−2 −1 0 1 2
Triweight kernel
0.0
0.4
0.8
1.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−2 −1 0 1 2
Normal kernel
0.0
0.4
0.8
1.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 15.4. Examples of well-known kernels K.
illustrated for a dataset containing five elements, where we use the Epanech-
nikov kernel and bandwidth h = 0.5. First we scale the kernel K (solid line)
into the function
t →
1
h
K

t
h

.
The scaled kernel (dotted line) is of the same type as the original kernel, with
area 1 under the curve but is positive on the interval [−h, h] instead of [−1, 1]
and higher (lower) when h is smaller (larger) than 1. Next, we put a scaled
kernel around each element xi in the dataset. This results in functions of the
type
t →
1
h
K

t − xi
h

.
These shifted kernels (dotted lines) have the same shape as the transformed
kernel, all with area 1 under the curve, but they are now symmetric around
xi and positive on the interval [xi − h, xi + h]. We see that the graphs of the
shifted kernels will overlap whenever xi and xj are close to each other, so
that things will pile up more at places where more elements accumulate. The
kernel density estimate fn,h is constructed by summing the scaled kernels and
dividing them by n, in order to obtain area 1 under the curve:
15.3 Kernel density estimates 215
−2 −1 0 1 2
Kernel and scaled kernel
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..............

1
h
K
t
h

K
−2 −1 0 1 2
Shifted kernel
....................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..............
.................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............................
..............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
................
........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......................
−2 −1 0 1 2
Kernel density estimate
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 15.5. Construction of a kernel density estimate fn,h.
fn,h(t) =
1
n
1
h
K

t − x1
h

+
1
h
K

t − x2
h

+ · · · +
1
h
K

t − xn
h
!
or briefly,
fn,h(t) =
1
nh
n

i=1
K

t − xi
h

. (15.1)
When computing fn,h(t), we assign higher weights to observations xi closer to
t, in contrast to the histogram where we simply count the number of observa-
tions in the bin that contains t. Note that as a consequence of condition (K1),
fn,h itself is a probability density:
fn,h(t) ≥ 0 and
∞
−∞
fn,h(t) dt = 1.
Quick exercise 15.4 Check that the total area under the kernel density
estimate is equal to one, i.e., show that
∞
−∞
fn,h(t) dt = 1.
Note that computing fn,h is very computationally intensive. Its common use
nowadays is therefore a typical product of the recent developments in com-
puter hardware, despite the fact that the method was introduced much earlier.
Choice of the bandwidth
The bandwidth h plays the same role for kernel density estimates as the bin
width b does for histograms. In Figure 15.6 three kernel density estimates of
the Old Faithful data are plotted with the triweight kernel and bandwidths
1.8, 18, and 180. It is clear that the choice of the bandwidth h determines
largely what the resulting kernel density estimate will look like. Choosing the
bandwidth too small will produce a curve with many isolated peaks. Choosing
the bandwidth too large will produce a very smooth curve, at the risk of
smoothing away important features of the data. In Figure 15.6 bandwidth
216 15 Exploratory data analysis: graphical summaries
h = 1.8 is somewhat too small. Bandwidth h = 180 is clearly too large and
produces an oversmoothed kernel density estimate that no longer captures the
fact that the data show two separate modes.
60 180 300
Bandwidth 1.8
0
0.01
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60 180 300
Bandwidth 18
0
0.01
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60 180 300
Bandwidth 180
0
0.01
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 15.6. Kernel estimates of the Old Faithful data.
How does one go about choosing the bandwidth? Similar to histograms, in
practice one could do this by trial and error and continue until one obtains
a reasonable picture. Recent research, however, has provided some guidelines
for a data-based choice of h. A formula that may effectively be used is h =
1.06 sn−1/5
, where s denotes the sample standard deviation (see, for instance,
[31]; see also Remark 15.2).
Remark 15.2 (Normal reference method for kernel estimates).
Suppose we view our dataset as being generated from a probability dis-
tribution with density f. Let K be a fixed chosen kernel and let fn,h be
the kernel density estimate. We would like to take the bandwidth that min-
imizes the difference between fn,h and f, measured by the so-called mean
integrated squared error (MISE)
E
 ∞
−∞
(fn,h(x) − f(x))2
dx

.
Under suitable smoothness conditions on f, the value of h that minimizes
the MISE, as n goes to infinity, is given by
h = C1(f)C2(K)n−1/5
,
where the constants C1(f) and C2(K) are given by
C1(f) =

1
 ∞
−∞
f(x)2 dx
1/5
and C2(K) =
 ∞
−∞
K(u)2
du
1/5
 ∞
−∞
u2K(u) du
2/5
.
After choosing the kernel K, one can compute the constant C2(K) to obtain
a simple data-based choice for h by estimating the constant C1(f). For
instance, for the normal kernel one finds C2(K) = (2
√
π)−1/5
. As with
15.3 Kernel density estimates 217
histograms (see Remark 15.1), the normal reference method takes f to be
the density of an N(µ, σ2
) distribution, in which case C1(f) = (8
√
π/3)1/5
σ.
Estimating σ by the sample standard deviation s (see Chapter 16 for a
definition of s) would result in bandwidth
h =
4
3
1/5
sn−1/5
.
For the Old Faithful data, this would give h = 23.64.
Quick exercise 15.5 If we construct a kernel density estimate for the Old
Faithful data with bandwidth h = 1.06sn−1/5
, then on what interval is fn,h
strictly positive if s = 68.48?
Choice of the kernel
To construct a kernel density estimate, one has to choose a kernel K and a
bandwidth h. The choice of kernel is less important. In Figure 15.7 we have
plotted two kernel density estimates for the Old Faithful data of Table 15.1:
one is constructed with the triweight kernel (solid line), and one with the
Epanechnikov kernel (dotted line), both with the same bandwidth h = 24. As
one can see, the graphs are very similar. If one wants to compare with the
normal kernel, one should set the bandwidth of the normal kernel at about
h/4. This has to do with the fact that the normal kernel is much more spread
out than the two kernels mentioned here, which are zero outside [−1, 1].
60 120 180 240 300 360
0
0.002
0.004
0.006
0.008
0.010
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.................
Fig. 15.7. Kernel estimates of the Old Faithful data with different kernels: triweight
(solid line) and Epanechnikov kernel (dotted), both with bandwidth h = 24.
Boundary kernels
In order to estimate the parameters of a software reliability model, failure data
are collected. Usually the most desirable type of failure data results when the
218 15 Exploratory data analysis: graphical summaries
Table 15.3. Interfailure times between successive failures.
30 113 81 115 9 2 91 112 15 138
50 77 24 108 88 670 120 26 114 325
55 242 68 422 180 10 1146 600 15 36
4 0 8 227 65 176 58 457 300 97
263 452 255 197 193 6 79 816 1351 148
21 233 134 357 193 236 31 369 748 0
232 330 365 1222 543 10 16 529 379 44
129 810 290 300 529 281 160 828 1011 445
296 1755 1064 1783 860 983 707 33 868 724
2323 2930 1461 843 12 261 1800 865 1435 30
143 108 0 3110 1247 943 700 875 245 729
1897 447 386 446 122 990 948 1082 22 75
482 5509 100 10 1071 371 790 6150 3321 1045
648 5485 1160 1864 4116
Source: J.D. Musa, A. Iannino, and K. Okumoto. Software reliability: mea-
surement, prediction, application. McGraw-Hill, New York, 1987; Table on
page 305.
failure times are recorded, or equivalently, the length of an interval between
successive failures. The data in Table 15.3 are observed interfailure times in
CPU seconds for a certain control software system. On the left in Figure 15.8
a kernel density estimate of the observed interfailure times is plotted. Note
that to the left of the origin, fn,h is positive. This is absurd, since it suggests
that there are negative interfailure times.
This phenomenon is a consequence of the fact that one uses a symmetric ker-
nel. In that case, the resulting kernel density estimate will always be positive
on the interval [xi −h, xi +h] for every element xi in the dataset. Hence, obser-
0 2000 4000 6000 8000
0
0.0005
0.0010
0.0015

fn,h with
symmetric kernel
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2000 4000 6000 8000
0
0.0005
0.0010
0.0015
with boundary
kernel
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
........................................
with symmetric
kernel
Fig. 15.8. Kernel density estimate of the software reliability data with symmetric
and boundary kernel.
15.4 The empirical distribution function 219
vations close to zero will cause the kernel density estimate fn,h to be positive
to the left of zero. It is possible to improve the kernel density estimate in a
neighborhood of zero by means of a so-called boundary kernel. Without going
into detail about the construction of such an improvement, we will only show
the result of this. On the right in Figure 15.8 the histogram of the interfailure
times is plotted together with the kernel density estimate constructed with a
symmetric kernel (dotted line) and with the boundary kernel density estimate
(solid line). The boundary kernel density estimate is 0 to the left of the ori-
gin and is adjusted on the interval [0, h). On the interval [h, ∞) both kernel
density estimates are the same.
15.4 The empirical distribution function
Another way to graphically represent a dataset is to plot the data in a cumu-
lative manner. This can be done using the empirical cumulative distribution
function of the data. It is denoted by Fn and is defined at a point x as the
proportion of elements in the dataset that are less than or equal to x:
Fn(x) =
number of elements in the dataset ≤ x
n
.
To illustrate the construction of Fn, consider the dataset consisting of the
elements
4 3 9 1 7.
The corresponding empirical distribution function is displayed in Figure 15.9.
For x  1, there are no elements less than or equal to x, so that Fn(x) = 0. For
1 ≤ x  3, only the element 1 is less than or equal to x, so that Fn(x) = 1/5.
For 3 ≤ x  4, the elements 1 and 3 are less than or equal to x, so that
Fn(x) = 2/5, and so on.
In general, the graph of Fn has the form of a staircase, with Fn(x) = 0 for all
x smaller than the minimum of the dataset and Fn(x) = 1 for all x greater
than the maximum of the dataset. Between the minimum and maximum, Fn
has a jump of size 1/n at each element of the dataset and is constant between
successive elements. In Figure 15.9, the marks • and ◦ are added to the graph
to emphasize the fact that, for instance, the value of Fn(x) at x = 3 is 0.4, not
0.2. Usually, we leave these out, and one might also connect the horizontal
segments by vertical lines.
In Figure 15.10 the empirical distribution functions are plotted for the Old
Faithful data and the software reliability data. The fact that the Old Faithful
data accumulate in the neighborhood of 120 and 270 is reflected in the graph
of Fn by the fact that it is steeper at these places: the jumps of Fn succeed each
other faster. In regions where the elements of the dataset are more stretched
220 15 Exploratory data analysis: graphical summaries
1 3 4 7 9
0.0
0.2
0.4
0.6
0.8
1.0
•
•
•
•
•
◦
◦
◦
◦
◦
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 15.9. Empirical distribution function.
out, the graph of Fn is flatter. Similar behavior can be seen for the software
reliability data in the neighborhood of zero. The elements accumulate more
close to zero, less as we move to the right. This is reflected by the empirical
distribution function, which is very steep near zero and flattens out if we move
to the right.
The graph of the empirical distribution function for the Old Faithful data
agrees with the histogram in Figure 15.1 whose height is the largest on the
bins (90, 120] and (240, 270]. In fact, there is a one-to-one relation between the
two graphical summaries of the data: the area under the histogram on a single
bin is equal to the relative frequency of elements that lie in that bin, which is
also equal to the increase of Fn on that bin. For instance, the area under the
histogram on bin (240, 270] for the Old Faithful data is equal to 30 · 0.0092 =
60 120 180 240 300 360
Old Faithful data
0.0
0.2
0.4
0.6
0.8
1.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2000 4000 6000 8000
Software data
0.0
0.2
0.4
0.6
0.8
1.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 15.10. Empirical distribution function of the Old Faithful data and the soft-
ware reliability data.
15.5 Scatterplot 221
0.276 (see Quick exercise 15.2). On the other hand, Fn(270) = 215/272 =
0.7904 and Fn(240) = 140/272 = 0.5147, whose difference Fn(270) − Fn(240)
is also equal to 0.276.
Quick exercise 15.6 Suppose that for a dataset consisting of 300 elements,
the value of the empirical distribution function in the point 1.5 is equal to
0.7. How many elements in the dataset are strictly greater than 1.5?
Remark 15.3 (Fn as a discrete distribution function). Note that
Fn satisfies the four properties of a distribution function: it is continuous
from the right, Fn(x) → 0 as x → −∞, Fn(x) → 1 as x → ∞ and Fn is
nondecreasing. This means that Fn itself is a distribution function of some
random variable. Indeed, Fn is the distribution function of the discrete ran-
dom variable that attains values x1, x2, . . . , xn with equal probability 1/n.
15.5 Scatterplot
In some situations one wants to investigate the relationship between two or
more variables. In the case of two variables x and y, the dataset consists of
pairs of observations:
(x1, y1), (x2, y2), . . . , (xn, yn).
We call such a dataset a bivariate dataset in contrast to the univariate dataset,
which consists of observations of one particular quantity. We often like to in-
vestigate whether the value of variable y depends on the value of the variable x,
and if so, whether we can describe the relation between the two variables. A
first step is to take a look at the data, i.e., to plot the points (xi, yi) for
i = 1, 2 . . . , n. Such a plot is called a scatterplot.
Drilling in rock
During a study about “dry” and “wet” drilling in rock, six holes were drilled,
three corresponding to each process. In a dry hole one forces compressed air
down the drill rods to flush the cutting and the drive hammer, whereas in a
wet hole one forces water. As the hole gets deeper, one has to add a rod of
5 feet length to the drill. In each hole the time was recorded to advance 5
feet to a total depth of 400 feet. The data in Table 15.4 are in 1/100 minute
and are derived from the original data in [23]. The original data consisted of
drill times for each of the six holes and contained missing observations and
observations that were known to be too large. The data in Table 15.4 are the
mean drill times of the bona fide observations at each depth for dry and wet
drilling.
One of the questions of interest is whether drill time depends on depth. To in-
vestigate this, we plot the mean drill time against depth. Figure 15.11 displays
222 15 Exploratory data analysis: graphical summaries
Table 15.4. Mean drill time.
Depth Dry Wet Depth Dry Wet
5 640.67 830.00 205 803.33 962.33
10 674.67 800.00 210 794.33 864.67
15 708.00 711.33 215 760.67 805.67
20 735.67 867.67 220 789.50 966.00
25 754.33 940.67 225 904.50 1010.33
30 723.33 941.33 230 940.50 936.33
35 664.33 924.33 235 882.00 915.67
40 727.67 873.00 240 783.50 956.33
45 658.67 874.67 245 843.50 936.00
50 658.00 843.33 250 813.50 803.67
55 705.67 885.67 255 658.00 697.33
60 700.00 881.67 260 702.50 795.67
65 720.67 822.00 265 623.50 1045.33
70 701.33 886.33 270 739.00 1029.67
75 716.67 842.50 275 907.50 977.00
80 649.67 874.67 280 846.00 1054.33
85 667.33 889.33 285 829.00 1001.33
90 612.67 870.67 290 975.50 1042.00
95 656.67 916.00 295 998.00 1200.67
100 614.00 888.33 300 1037.50 1172.67
105 584.00 835.33 305 984.00 1019.67
110 619.67 776.33 310 972.50 990.33
115 666.00 811.67 315 834.00 1173.33
120 695.00 874.67 320 675.00 1165.67
125 702.00 846.00 325 686.00 1142.00
130 739.67 920.67 330 963.00 1030.67
135 790.67 896.33 335 961.50 1089.67
140 730.33 810.33 340 932.00 1154.33
145 674.00 912.33 345 1054.00 1238.50
150 749.00 862.33 350 1038.00 1208.67
155 709.67 828.00 355 1238.00 1134.67
160 769.00 812.67 360 927.00 1088.00
165 663.00 795.67 365 850.00 1004.00
170 679.33 897.67 370 1066.00 1104.00
175 740.67 881.00 375 962.50 970.33
180 776.50 819.67 380 1025.50 1054.50
185 688.00 853.33 385 1205.50 1143.50
190 761.67 844.33 390 1168.00 1044.00
195 800.00 919.00 395 1032.50 978.33
200 845.50 933.33 400 1162.00 1104.00
Source: R. Penner and D.G. Watts. Mining information. The American
Statistician, 45:4–9, 1991; Table 1 on page 6.
15.5 Scatterplot 223
Dry holes
0 100 200 300 400
Depth
500
700
900
1100
1300
Mean
drill
time
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Wet holes
0 100 200 300 400
Depth
500
700
900
1100
1300
Mean
drill
time
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Fig. 15.11. Scatterplots of mean drill time versus depth.
the resulting scatterplots for the dry and wet holes. The scatterplots seem to
indicate that in the beginning the drill time hardly depends on depth, at least
up to, let’s say, 250 feet. At greater depth, the drill time seems to vary over a
larger range and increases somewhat with depth. A possible explanation for
this is that the drill moved from softer to harder material. This was suggested
by the fact that the drill hit an ore lens at about 250 feet and that the natural
place such ore lenses occur is between two different materials (see [23] for
details).
A more important question is whether one can drill holes faster using dry
drilling or wet drilling. The scatterplots seem to suggest that dry drilling
might be faster. We will come back to this later.
Predicting Janka hardness of Australian timber
The Janka hardness test is a standard test to measure the hardness of wood.
It measures the force required to push a steel ball with a diameter of 11.28
millimeters (0.444 inch) into the wood to a depth of half the ball’s diameter.
To measure Janka hardness directly is difficult. However, it is related to the
density of the wood, which is comparatively easy to measure. In Table 15.5
a bivariate dataset is given of density (x) and Janka hardness (y) of 36 Aus-
tralian eucalypt hardwoods.
In order to get an impression of the relationship between hardness and den-
sity, we made a scatterplot of the bivariate dataset, which is displayed in
Figure 15.12. It consists of all points (xi, yi) for i = 1, 2, . . . , 36. The scatter-
plot might provide suggestions for the formula that describes the relationship
between the variables x and y. In this case, a linear relationship between the
two variables does not seem unreasonable. Later (Chapter 22) we will discuss
224 15 Exploratory data analysis: graphical summaries
Table 15.5. Density and hardness of Australian timber.
Density Hardness Density Hardness Density Hardness
24.7 484 39.4 1210 53.4 1880
24.8 427 39.9 989 56.0 1980
27.3 413 40.3 1160 56.5 1820
28.4 517 40.6 1010 57.3 2020
28.4 549 40.7 1100 57.6 1980
29.0 648 40.7 1130 59.2 2310
30.3 587 42.9 1270 59.8 1940
32.7 704 45.8 1180 66.0 3260
35.6 979 46.9 1400 67.4 2700
38.5 914 48.2 1760 68.8 2890
38.8 1070 51.5 1710 69.1 2740
39.3 1020 51.5 2010 69.1 3140
Source: E.J. Williams. Regression analysis. John Wiley  Sons Inc., New
York, 1959; Table 3.1 on page 43.
how one can establish such a linear relationship by means of the observed
pairs.
Quick exercise 15.7 Suppose we have a eucalypt hardwood tree with den-
sity 65. What would your prediction be for the corresponding Janka hardness?
20 30 40 50 60 70 80
Wood density
0
500
1000
1500
2000
2500
3000
3500
Hardness
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Fig. 15.12. Scatterplot of Janka hardness versus density of wood.
15.6 Solutions to the quick exercises 225
15.6 Solutions to the quick exercises
15.1 There are 272 elements in the dataset. The 91st and 182nd elements
of the ordered data divide the dataset in three groups, each consisting of 90
elements. From a closer look at Table 15.2 we find that these two elements
are 145 and 260.
15.2 In Table 15.2 one can easily count the number of observations in each
of the bins (90, 120], . . ., (300, 330]. The heights on each bin can be computed
by dividing the number of observations in each bin by 272·30 = 8160. We get
the following:
Bin Count Height Bin Count Height
(90, 120] 55 0.0067 (210, 240] 34 0.0042
(120,150] 37 0.0045 (240, 270] 75 0.0092
(150,180] 5 0.0006 (270, 300] 54 0.0066
(180,210] 9 0.0011 (300, 330] 3 0.0004
15.3 From Table 15.2 we see that we must cover an interval of length of at
least 306 − 96 = 210 with bins of width b = 3.49 · 68.48 · 272−1/3
= 36.89.
Since 210/36.89 = 5.69, we need at least six bins to cover the whole dataset.
15.4 By means of formula (15.1), we can write
∞
−∞
fn,h(t) dt =
1
nh
n

i=1
∞
−∞
K

t − xi
h

dt.
For any i = 1, . . . , n, we find by change of integration variables t = hu + xi
that ∞
−∞
K

t − xi
h

dt = h
∞
−∞
K (u) du = h,
where we also use condition (K1). This directly yields
∞
−∞
fn,h(t) dt =
1
nh
· n · h = 1.
15.5 The kernel density estimate will be strictly positive between the min-
imum minus h and the maximum plus h. The bandwidth equals h = 1.06 ·
68.48 · 272−1/5
= 23.66. From Table 15.2, we see that this will be between
96 − 23.66 = 72.34 and 306 + 23.66 = 329.66.
15.6 By definition the number of elements less than or equal to 1.5 is
F300(1.5) · 300 = 210. Hence 90 elements are strictly greater than 1.5.
15.7 Just by drawing a straight line that seems to fit the datapoints well, the
authors predicted a Janka hardness of about 2700.
226 15 Exploratory data analysis: graphical summaries
15.7 Exercises
15.1 In [33] Stephen Stigler discusses data from the Edinburgh Medical and
Surgical Journal (1817). These concern the chest circumference of 5732 Scot-
tish soldiers, measured in inches. The following information is given about the
histogram with bin width 1, the first bin starting at 32.5.
Bin Count Bin Count
(32.5, 33.5] 3 (40.5, 41.5] 935
(33.5, 34.5] 19 (41.5, 42.5] 646
(34.5, 35.5] 81 (42.5, 43.5] 313
(35.5, 36.5] 189 (43.5, 44.5] 168
(36.5, 37.5] 409 (44.5, 45.5] 50
(37.5, 38.5] 753 (45.5, 46.5] 18
(38.5, 39.5] 1062 (46.5, 47.5] 3
(39.5, 40.5] 1082 (47.5, 48.5] 1
Source: S.M. Stigler. The history of statistics – The measurement of uncer-
tainty before 1900. Cambridge, Massachusetts, 1986.
a. Compute the height of the histogram on each bin.
b. Make a sketch of the histogram. Would you view the dataset as being
symmetric or skewed?
15.2 Recall the example of the space shuttle Challenger in Section 1.4. The
following list contains the launch temperatures in degrees Fahrenheit during
previous takeoffs.
66 70 69 68 67 72 73 70 57 63 70 78
67 53 67 75 70 81 76 79 75 76 58
Source: Presidential commission on the space shuttle Challenger accident.
Report on the space shuttle Challenger accident. Washington, DC, 1986; table
on pages 129–131.
a. Compute the heights of a histogram with bin width 5, the first bin starting
at 50.
b. On January 28, 1986, during the launch of the space shuttle Challenger,
the temperature was 31 degrees Fahrenheit. Given the dataset of launch
temperatures of previous takeoffs, would you consider 31 as a representa-
tive launch temperature?
15.3  In an article in Biometrika, an example is discussed about mine dis-
asters during the period from March 15, 1851, to March, 22, 1962. A dataset
has been obtained of 190 recorded time intervals (in days) between successive
coal mine disasters involving ten or more men killed. The ordered data are
listed in Table 15.6.
15.7 Exercises 227
Table 15.6. Number of days between successive coal mine disasters.
0 1 1 2 2 3 4 4 4 6
7 10 11 12 12 12 13 15 15 16
16 16 17 17 18 19 19 19 20 20
22 23 24 25 27 28 29 29 29 31
31 32 33 34 34 36 36 37 40 41
41 42 43 45 47 48 49 50 53 54
54 55 56 59 59 61 61 65 66 66
70 72 75 78 78 78 80 80 81 88
91 92 93 93 95 95 96 96 97 99
101 108 110 112 113 114 120 120 123 123
124 124 125 127 129 131 134 137 139 143
144 145 151 154 156 157 176 182 186 187
188 189 190 193 194 197 202 203 208 215
216 217 217 217 218 224 225 228 232 233
250 255 275 275 275 276 286 292 307 307
312 312 315 324 326 326 329 330 336 345
348 354 361 364 368 378 388 420 431 456
462 467 498 517 536 538 566 632 644 745
806 826 871 952 1205 1312 1358 1630 1643 2366
Source: R.G. Jarrett. A note on the intervals between coal mining disasters.
Biometrika, 66:191-193, 1979; by permission of the Biometrika Trustees.
a. Compute the height on each bin of the histogram with bins [0, 250],
(250, 500], . . ., (2250, 2500].
b. Make a sketch of the histogram. Would you view the dataset as being
symmetric or skewed?
15.4  The ordered software data (see also Table 15.3) are given in the fol-
lowing list.
0 0 0 2 4 6 8 9 10 10
10 12 15 15 16 21 22 24 26 30
30 31 33 36 44 50 55 58 65 68
75 77 79 81 88 91 97 100 108 108
112 113 114 115 120 122 129 134 138 143
148 160 176 180 193 193 197 227 232 233
236 242 245 255 261 263 281 290 296 300
300 325 330 357 365 369 371 379 386 422
445 446 447 452 457 482 529 529 543 600
648 670 700 707 724 729 748 790 810 816
828 843 860 865 868 875 943 948 983 990
1011 1045 1064 1071 1082 1146 1160 1222 1247 1351
1435 1461 1755 1783 1800 1864 1897 2323 2930 3110
3321 4116 5485 5509 6150
228 15 Exploratory data analysis: graphical summaries
a. Compute the heights on each bin of the histogram with bins [0, 500],
(500, 1000], and so on.
b. Compute the value of the empirical distribution function in the endpoints
of the bins.
c. Check that the area under the histogram on bin (1000, 1500] is equal to
the increase Fn(1500) − Fn(1000) of the empirical distribution function
on this bin. Actually, this is true for each single bin (see Exercise 15.11).
15.5  Suppose we construct a histogram with bins [0,1], (1,3], (3,5], (5,8],
(8,11], (11,14], and (14,18]. Given are the values of the empirical distribution
function at the boundaries of the bins:
t 0 1 3 5 8 11 14 18
Fn(t) 0 0.225 0.445 0.615 0.735 0.805 0.910 1.000
Compute the height of the histogram on each bin.
15.6  Given is the following information about a histogram:
Bin Height
(0,2] 0.245
(2,4] 0.130
(4,7] 0.050
(7,11] 0.020
(11,15] 0.005
Compute the value of the empirical distribution function in the point t = 7.
15.7 In Exercise 15.2 a histogram was constructed for the Challenger data. On
which bin does the empirical distribution function have the largest increase?
15.8 Define a function K by
K(u) = cos(πu) for − 1 ≤ u ≤ 1
and K(u) = 0 elsewhere. Check whether K satisfies the conditions (K1)–(K3)
for a kernel function.
15.9 On the basis of the duration of an eruption of the Old Faithful geyser,
park rangers try to predict the waiting time to the next eruption. In Fig-
ure 15.13 a scatterplot is displayed of the duration and the time to the next
eruption in seconds.
a. Does the scatterplot give reason to believe that the duration of an eruption
influences the time to the next eruption?
15.7 Exercises 229
100 150 200 250 300
Duration
40
60
80
100
Waiting
time
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
· ·
··
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
· ·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
··
· ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
· ·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
·
·
·
Fig. 15.13. Scatterplot of the Old Faithful data.
b. Suppose you have just observed an eruption that lasted 250 seconds. What
would you predict for the time to the next eruption?
c. The dataset of durations shows two modes, i.e., there are two places where
the data accumulate (see, for instance, the histogram in Figure 15.1). How
many modes does the dataset of waiting times show?
15.10 Figure 15.14 displays the graph of an empirical distribution function
of a dataset consisting of 200 elements. How many modes does the dataset
show?
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 15.14. Empirical distribution function.
15.11  Given is a histogram and the empirical distribution function Fn of
the same dataset. Show that the height of the histogram on a bin (a, b] is
230 15 Exploratory data analysis: graphical summaries
equal to
Fn(b) − Fn(a)
b − a
.
15.12  Let fn,h be a kernel estimate. As mentioned in Section 15.3, fn,h
itself is a probability density.
a. Show that the corresponding expectation is equal to
∞
−∞
tfn,h(t) dt = x̄n.
Hint: you might consult the solution to Quick exercise 15.4.
b. Show that the second moment corresponding to fn,h satisfies
∞
−∞
t2
fn,h(t) dt =
1
n
n

i=1
x2
i + h2
∞
−∞
u2
K(u) du.
16
Exploratory data analysis: numerical
summaries
The classical way to describe important features of a dataset is to give several
numerical summaries. We discuss numerical summaries for the center of a
dataset and for the amount of variability among the elements of a dataset, and
then we introduce the notion of quantiles for a dataset. To distinguish these
quantities from corresponding notions for probability distributions of random
variables, we will often add the word sample or empirical; for instance, we will
speak of the sample mean and empirical quantiles. We end this chapter with
the boxplot, which combines some of the numerical summaries in a graphical
display.
16.1 The center of a dataset
The best-known method to identify the center of a dataset is to compute the
sample mean
x̄n =
x1 + x2 + · · · + xn
n
. (16.1)
For the sake of notational convenience we will sometimes drop the subscript n
and write x̄ instead of x̄n. The following dataset consists of hourly tempera-
tures in degrees Fahrenheit (rounded to the nearest integer), recorded at Wick
in northern Scotland from 5 p.m. December 31, 1960, to 3 a.m. January 1,
1961. The sample mean of the 11 measurements is equal to 44.7.
43 43 41 41 41 42 43 58 58 41 41
Source: V. Barnett and T. Lewis. Outliers in statistical data. Third edition,
1994. John Wiley  Sons Limited. Reproduced with permission.
Another way to identify the center of a dataset is by means of the sample
median, which we will denote by Med(x1, x2, . . . , xn) or briefly Medn. The
sample median is defined as the middle element of the dataset when it is put
in ascending order. When n is odd, it is clear what this means. When n is even,
232 16 Exploratory data analysis: numerical summaries
we take the average of the two middle elements. For the Wick temperature
data the sample median is equal to 42.
Quick exercise 16.1 Compute the sample mean and sample median of the
dataset
4.6 3.0 3.2 4.2 5.0.
Both methods have pros and cons. The sample mean is the natural analogue
for a dataset of what the expectation is for a probability distribution. However,
it is very sensitive to outliers, by which we mean observations in the dataset
that deviate a lot from the bulk of the data.
To illustrate the sensitivity of the sample mean, consider the Wick tempera-
ture data displayed in Figure 16.1. The values 58 and 58 recorded at midnight
and 1 a.m. are clearly far from the bulk of the data and give grounds for
concern whether they are genuine (58 degrees Fahrenheit seems very warm
at midnight for New Year’s in northern Scotland). To investigate their effect
on the sample mean we compute the average of the data, leaving out these
measurements, which gives 41.8 (instead of 44.7). The sample median of the
data is equal to 41 (instead of 42) when leaving out the measurements with
value 58. The median is more robust in the sense that it is hardly affected by
a few outliers.
17 p.m. 19 p.m. 21 p.m. 23 p.m. 1am 3am
Time of day
40
45
50
55
60
Temperature
· · · · · · ·
· ·
· ·
Fig. 16.1. The Wick temperature data.
It should be emphasized that this discussion is only meant to illustrate the
sensitivity of the sample mean and by no means is intended to suggest we leave
out measurements that deviate a lot from the bulk of the data! It is important
to be aware of the presence of an outlier. In that case, one could try to find out
whether there is perhaps something suspicious about this measurement. This
might lead to assigning a smaller weight to such a measurement or even to
16.2 The amount of variability of a dataset 233
removing it from the dataset. However, sometimes it is possible to reconstruct
the exact circumstances and correct the measurement. For instance, after
further inquiry in the temperature example it turned out that at midnight
the meteorological office changed its recording unit from degrees Fahrenheit
to 1/10th degree Celsius (so 58 and 41 should read 5.8◦
C and 4.1◦
C). The
corrected values in degrees Fahrenheit (to the nearest integer) are
43 43 41 41 41 42 43 42 42 39 39.
For the corrected data the sample mean is 41.5 and the sample median is 42.
Quick exercise 16.2 Consider the same dataset as in Quick exercise 16.1.
Suppose that someone misreads the dataset as
4.6 30 3.2 4.2 50.
Compute the sample mean and sample median and compare these values with
the ones you found in Quick exercise 16.1.
16.2 The amount of variability of a dataset
To quantify the amount of variability among the elements of a dataset, one
often uses the sample variance defined by
s2
n =
1
n − 1
n

i=1
(xi − x̄n)2
.
Up to a scaling factor this is equal to the average squared deviation from x̄n.
At first sight, it seems more natural to define the sample variance by
s̃2
n =
1
n
n

i=1
(xi − x̄n)2
.
Why we choose the factor 1/(n−1) instead of 1/n will be explained later (see
Chapter 19). Because s2
n is in different units from the elements of the dataset,
one often prefers the sample standard deviation
sn =

#
#
$ 1
n − 1
n

i=1
(xi − x̄n)2,
which is measured in the same units as the elements of the dataset itself.
Just as the sample mean, the sample standard deviation is very sensitive to
outliers. For the (uncorrected) Wick temperature data the sample standard
deviation is 6.62, or 0.97 if we leave out the two measurements with value 58.
234 16 Exploratory data analysis: numerical summaries
For the corrected data the standard deviation is 1.44. A more robust measure
of variability is the median of absolute deviations or MAD, which is defined
as follows. Consider the absolute deviation of every element xi with respect
to the sample median:
|xi − Med(x1, x2, . . . , xn)|
or briefly
|xi − Medn|.
The MAD is obtained by taking the median of all these absolute deviations
MAD(x1, x2, . . . , xn) = Med(|x1 − Medn|, . . . , |xn − Medn|). (16.2)
Quick exercise 16.3 Compute the sample standard deviation for the dataset
of Quick exercise 16.1 for which it is given that the values of xi − x̄n are:
−1.0, 0.6, −0.8, 0.2, 1.0.
Also compute the MAD for this dataset.
Just as the sample median, the MAD is hardly affected by outliers. For the
(uncorrected) Wick temperature data the MAD is 1 and equal to 0 if we leave
out the two measurements with value 58 (the value 0 seems a bit strange,
but is a consequence of the fact that the observations are given in degrees
Fahrenheit rounded to the nearest integer). For the corrected data the MAD
is 1.
Quick exercise 16.4 Compute the sample standard deviation for the mis-
read dataset of Quick exercise 16.2 for which it is given that the values of
xi − x̄n are:
11.6, −13.8, −15.2, −14.2, 31.6.
Also compute the MAD for this dataset and compare both values with the
ones you found in Quick exercise 16.3.
16.3 Empirical quantiles, quartiles, and the IQR
The sample median divides the dataset in two more or less equal parts: about
half of the elements are less than the median and about half of the elements
are greater than the median. More generally, we can divide the dataset in
two parts in such a way that a proportion p is less than a certain number
and a proportion 1 − p is greater than this number. Such a number is called
the 100p empirical percentile or the pth empirical quantile and is denoted by
qn(p). For a suitable introduction of empirical quantiles we need the notion
of order statistics.
16.3 Empirical quantiles, quartiles, and the IQR 235
The order statistics consist of the same elements as in the original dataset
x1, x2, . . . , xn, but in ascending order. Denote by x(k) the kth element in the
ordered list. Then
x(1) ≤ x(2) ≤ · · · ≤ x(n)
are called the order statistics of x1, x2, . . . , xn. The order statistics of the Wick
temperature data are
41 41 41 41 41 42 43 43 43 58 58.
Note that by putting the elements in order, it is possible that successive order
statistics are the same, for instance, x(1) = · · · = x(5) = 41. Another example
is Table 15.2, which lists the order statistics of the Old Faithful dataset.
To compute empirical quantiles one linearly interpolates between order statis-
tics of the dataset. Let 0  p  1, and suppose we want to compute the pth
empirical quantile for a dataset x1, x2, . . . , xn. The following computation is
based on requiring that the ith order statistic is the i/(n + 1) quantile. If we
denote the integer part of a by a, then the computation of qn(p) runs as
follows:
qn(p) = x(k) + α(x(k+1) − x(k))
with k = p(n + 1) and α = p(n + 1) − k. On the left in Figure 16.2 the
relation between the pth quantile and the empirical distribution function is
illustrated for the Old Faithful data.
pth empirical quantile
0
p
1
→
↓
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...........................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.00
0.25
0.50
0.75
1.00
Lower
quartile
Median Upper
quartile
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
................
.........................................
...............................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 16.2. Empirical quantile and quartiles for the Old Faithful data.
Quick exercise 16.5 Compute the 55th empirical percentile for the Wick
temperature data.
236 16 Exploratory data analysis: numerical summaries
Lower and upper quartiles
Instead of identifying only the center of the dataset, Tukey [35] suggested
to give a five-number summary of the dataset: the minimum, the maximum,
the sample median, and the 25th and 75th empirical percentiles. The 25th
empirical percentile qn(0.25) is called the lower quartile and the 75th empirical
percentile qn(0.75) is called the upper quartile. Together with the median, the
lower and upper quartiles divide the dataset in four more or less equal parts
consisting of about one quarter of the number of elements. The relation of
the two quartiles and the median with the empirical distribution function is
illustrated for the Old Faithful data on the right of Figure 16.2. The distance
between the lower quartile and the median, relative to the distance between
the upper quartile and the median, gives some indication on the skewness of
the dataset. The distance between the upper and lower quartiles is called the
interquartile range, or IQR:
IQR = qn(0.75) − qn(0.25).
The IQR specifies the range of the middle half of the dataset. It could also
serve as a robust measure of the amount of variability among the elements of
the dataset. For the Old Faithful data the five-number summary is
Minimum Lower quartile Median Upper quartile Maximum
96 129.25 240 267.75 306
and the IQR is 138.5.
Quick exercise 16.6 Compute the five-number summary for the (uncor-
rected) Wick temperature data.
16.4 The box-and-whisker plot
Tukey [35] also proposed visualizing the five-number summary discussed in
the previous section by a so-called box-and-whisker plot, briefly boxplot. Fig-
ure 16.3 displays a boxplot. The data are now on the vertical axis, where we
left out the numbers on the axis in order to explain the construction of the
figure. The horizontal width of the box is irrelevant. In the vertical direction
the box extends from the lower to the upper quartile, so that the height of the
box is precisely the IQR. The horizontal line inside the box corresponds to the
sample median. Up from the upper quartile we measure out a distance of 1.5
times the IQR and draw a so-called whisker up to the largest observation that
lies within this distance, where we put a horizontal line. Similarly, down from
the lower quartile we measure out a distance of 1.5 times the IQR and draw
a whisker to the smallest observation that lies within this distance, where
we also put a horizontal line. All other observations beyond the whiskers are
marked by ◦. Such an observation is called an outlier.
16.4 The box-and-whisker plot 237
Minimum
Lower quartile−1.5·IQR
Lower quartile
Median
Upper quartile
Maximum
Upper quartile+1.5·IQR
◦
◦
◦
↑
↓
1.5·IQR
↑
↓
1.5·IQR
↑
↓
IQR
Fig. 16.3. A boxplot.
In Figure 16.4 the boxplots of the Old Faithful data and of the software relia-
bility data (see also Chapter 15) are displayed. The skewness of the software
reliability data produces a boxplot with whiskers of very different length and
with several observations beyond the upper quartile plus 1.5 times the IQR.
The boxplot of the Old Faithful data illustrates one of the shortcomings of the
boxplot; it does not capture the fact that the data show two separate peaks.
However, the position of the sample median inside the box does suggest that
the dataset is skewed.
Quick exercise 16.7 Suppose we want to construct a boxplot of the (uncor-
rected) Wick temperature data. What is the height of the box, the length of
both whiskers, and which measurements fall outside the box and whiskers?
Would you consider the two values 58 extreme outliers?
1
2
3
4
5
6
Old Faithful data
0
2000
4000
6000
Software data
◦
◦
◦
◦
◦
◦
◦
◦
Fig. 16.4. Boxplot of the Old Faithful data and the software data.
238 16 Exploratory data analysis: numerical summaries
Using boxplots to compare several datasets
Although the boxplot provides some information about the structure of the
data, such as center, range, skewness or symmetry, it is a poor graphical
display of the dataset. Graphical summaries such as the histogram and kernel
density estimate are more informative displays of a single dataset. Boxplots
become useful if we want to compare several sets of data in a simple graphical
display. In Figure 16.5 boxplots are displayed of the average drill time for
dry and wet drilling up to a depth of 250 feet for the drill data discussed in
Section 15.5 (see also Table 15.4). It is clear that the boxplot corresponding
to dry drilling differs from that corresponding to wet drilling. However, the
question is whether this difference can still be attributed to chance or is caused
by the drilling technique used. We will return to this type of question in
Chapter 25.
600
700
800
900
1000
Dry
◦
Wet
Fig. 16.5. Boxplot of average drill times.
16.5 Solutions to the quick exercises
16.1 The average is
x̄n =
4.6 + 3.0 + 3.2 + 4.2 + 5.0
5
=
20
5
= 4.
The median is the middle element of 3.0, 3.2, 4.2, 4.6, and 5.0, which gives
Medn = 4.2.
16.2 The average is
x̄n =
4.6 + 30 + 3.2 + 4.2 + 50
5
=
90
5
= 18,
16.5 Solutions to the quick exercises 239
which differs 14.4 from the average we found in Quick exercise 16.1. The
median is the middle element of 3.2, 4.2, 4.6, 30, and 50. This gives Medn =
4.6, which only differs 0.4 from the median we found in Quick exercise 16.1.
As one can see, the median is hardly affected by the two outliers.
16.3 The sample variance is
s2
n =
(−1)2
+ (0.6)2
+ (−0.8)2
+ (0.2)2
+ (1.0)2
5 − 1
=
3.04
4
= 0.76
so that the sample standard deviation is sn =
√
0.76 = 0.872. The median is
4.2, so that the absolute deviations from the median are given by
0.4 1.2 1.0 0.0 0.8.
The MAD is the median of these numbers, which is 0.8.
16.4 The sample variance is
s2
n =
(11.6)2
+ (−13.8)2
+ (−15.2)2
+ (−14.2)2
+ (31.6)2
5 − 1
=
1756.24
4
= 439.06
so that the sample standard deviation is sn =
√
439.06 = 20.95, which is a
difference of 20.19 from the value we found in Quick exercise 16.3. The median
is 4.6, so that the absolute deviations from the median are given by
0.0 25.4 1.4 0.4 45.4.
The MAD is the median of these numbers, which is 1.4. Just as the median,
the MAD is hardly affected by the two outliers.
16.5 We have k = 0.55 · 12 = 6.6 = 6, so that α = 0.6. This gives
qn(0.55) = x(6) + 0.6 · (x(7) − x(6)) = 42 + 0.6 · (43 − 42) = 42.6.
16.6 From the order statistics of the Wick temperature data
41 41 41 41 41 42 43 43 43 58 58
it can be seen immediately that minimum, maximum, and median are given by
41, 58, and 42. For the lower quartile we have k = 0.25·12 = 3, so that α = 0
and qn(0.25) = x(3) = 41. For the upper quartile we have k = 0.75 · 12 = 9,
so that again α = 0 and qn(0.75) = x(9) = 43. Hence for the Wick temperature
data the five-number summary is
Minimum Lower quartile Median Upper quartile Maximum
41 41 42 43 58
240 16 Exploratory data analysis: numerical summaries
16.7 From the five-number summary for the Wick temperature data (see
Quick exercise 16.6), it follows immediately that the height of the box is the
IQR: 43 − 41 = 2. If we measure out a distance of 1.5 times 2 down from the
lower quartile 41, we see that the smallest observation within this range is
41, which means that the lower whisker has length zero. Similarly, the upper
whisker has length zero. The two measurements with value 58 are outside the
box and whiskers. The two values 58 are clearly far away from the bulk of the
data and should be considered extreme outliers.
41
42
43
58 ◦
◦
16.6 Exercises
16.1  Use the order statistics of the software data as given in Exercise 15.4
to answer the following questions.
a. Compute the sample median.
b. Compute the lower and upper quartiles and the IQR.
c. Compute the 37th empirical percentile.
16.2 Compute for the Old Faithful data the distance of the lower and upper
quartiles to the median and explain the difference.
16.3  Recall the example about the space shuttle Challenger in Section 1.4.
The following table lists the order statistics of launch temperatures during
take-offs in degrees Fahrenheit, including the launch temperature on Jan-
uary 28, 1986.
31 53 57 58 63 66 67 67 67 68 69 70
70 70 70 72 73 75 75 76 76 78 79 81
a. Find the sample median and the lower and upper quartiles.
b. Sketch the boxplot of this dataset.
16.6 Exercises 241
c. On January 28, 1986, the launch temperature was 31 degrees Fahrenheit.
Comment on the value 31 with respect to the other data points.
16.4  The sample mean and sample median of the uncorrected Wick tem-
perature data (in degrees Fahrenheit) are 44.7 and 42. We transform the data
from degrees Fahrenheit (xi) to degrees Celsius (yi) by means of the formula
yi =
5
9
(xi − 32),
which gives the following dataset
55
9
55
9 5 5 5 50
9
55
9
130
9
130
9 5 5.
a. Check that ȳn = 5
9 (x̄n − 32).
b. Is it also true that Med(y1, . . . , yn) = 5
9 (Med(x1, . . . , xn) − 32)?
c. Suppose we have a dataset x1, x2, . . . , xn and construct y1, y2, . . . , yn
where yi = axi + b with a and b being real numbers. Do similar rela-
tions hold for the sample mean and sample median? If so, state them.
16.5 Consider the uncorrected Wick temperature data in degrees Fahrenheit
(xi) and the corresponding temperatures in degrees Celsius (yi) as given in
Exercise 16.4. The sample standard deviation and the MAD for the Wick data
are 6.62 and 1.
a. Let sF and sC denote the sample standard deviations of x1, x2, . . . , xn
and y1, y2, . . . , yn respectively. Check that sC = 5
9 sF .
b. Let MADF and MADC denote the MAD of x1, x2, . . . , xn and y1, y2, . . . , yn
respectively. Is it also true that MADC = 5
9 MADF ?
c. Suppose we have a dataset x1, x2, . . . , xn and construct y1, y2, . . . , yn
where yi = axi + b with a and b being real numbers. Do similar rela-
tions hold for the sample standard deviation and the MAD? If so, state
them.
16.6  Consider two datasets: 1, 5, 9 and 2, 4, 6, 8.
a. Denote the sample means of the two datasets by x̄ and ȳ. Is it true that the
average (x̄ + ȳ)/2 of x̄ and ȳ is equal to the sample mean of the combined
dataset with 7 elements?
b. Suppose we have two other datasets: one of size n with sample mean
x̄n and another dataset of size m with sample mean ȳm. Is it always
true that the average (x̄n + ȳm)/2 of x̄n and ȳm is equal to the sample
mean of the combined dataset with n + m elements? If no, then provide
a counterexample. If yes, then explain this.
c. If m = n, is (x̄n +ȳm)/2 equal to the sample mean of the combined dataset
with n + m elements?
242 16 Exploratory data analysis: numerical summaries
16.7 Consider the two datasets from Exercise 16.6.
a. Denote the sample medians of the two datasets by Medx and Medy. Is it
true that the sample median (Medx +Medy)/2 of the two sample medians
is equal to the sample median of the combined dataset with 7 elements?
b. Suppose we have two other datasets: one of size n with sample median
Medx and another dataset of size m with sample median Medy. Is it
always true that the sample median (Medx + Medy)/2 of the two sample
medians is equal to the sample median of the combined dataset with n+m
elements? If no, then provide a counterexample. If yes, then explain this.
c. What if m = n?
16.8  Compute the MAD for the combined dataset of 7 elements from Ex-
ercise 16.6.
16.9 Consider a dataset x1, x2, . . . , xn with xi = 0. We construct a second
dataset y1, y2, . . . , yn, where
yi =
1
xi
.
a. Suppose dataset x1, x2, . . . , xn consists of −6, 1, 15. Is it true that ȳ3 =
1/x̄3?
b. Suppose that n is odd. Is it true that ȳn = 1/x̄n?
c. Suppose that n is odd and each xi  0. Is it true that Med(y1, . . . , yn) =
1/Med(x1, . . . , xn)? What about when n is even?
16.10  A method to investigate the sensitivity of the sample mean and the
sample median to extreme outliers is to replace one or more elements in a
given dataset by a number y and investigate the effect when y goes to infinity.
To illustrate this, consider the dataset from Quick Exercise 16.1:
4.6 3.0 3.2 4.2 5.0
with sample mean 4 and sample median 4.2.
a. We replace the element 3.2 by some real number y. What happens with
the sample mean and the sample median of this new dataset as y → ∞?
b. We replace a number of elements by some real number y. How many
elements do we need to replace so that the sample median of the new
dataset goes to infinity as y → ∞?
c. Suppose we have another dataset of size n. How many elements do we
need to replace by some real number y, so that the sample mean of the
new dataset goes to infinity as y → ∞? And how many elements do we
need to replace, so that the sample median of the new dataset goes to
infinity?
16.6 Exercises 243
16.11 Just as in Exercise 16.10 we investigate the sensitivity of the sample
standard deviation and the MAD to extreme outliers, by considering the same
dataset with sample standard deviation 0.872 and MAD equal to 0.8. Answer
the same three questions for the sample standard deviation and the MAD
instead of the sample mean and sample median.
16.12  Compute the sample mean and sample median for the dataset
1, 2, . . ., N
in case N is odd and in case N is even. You may use the fact that
1 + 2 + · · · + N =
N(N + 1)
2
.
16.13 Compute the sample standard deviation and MAD for the dataset
−N, . . . , −1, 0, 1, . . ., N.
You may use the fact that
12
+ 22
+ · · · + N2
=
N(N + 1)(2N + 1)
6
.
16.14 Check that the 50th empirical percentile is the sample median.
16.15  The following rule is useful for the computation of the sample vari-
ance (and standard deviation). Show that
1
n
n

i=1
(xi − x̄n)2
=

1
n
n

i=1
x2
i

− (x̄n)
2
where x̄n = (
n
i=1 xi)/n.
16.16 Recall Exercise 15.12, where we computed the mean and second mo-
ment corresponding to a density estimate fn,h. Show that the variance corre-
sponding to fn,h satisfies:
∞
−∞
t2
fn,h(t) dt−
 ∞
−∞
tfn,h(t) dt
2
=
1
n
n

i=1
(xi −x̄n)2
+h2
∞
−∞
u2
K(u) du.
16.17 Suppose we have a dataset x1, x2, . . . , xn. Check that if p = i/(n + 1)
the pth empirical quantile is the ith order statistic.
17
Basic statistical models
In this chapter we introduce a common statistical model. It corresponds to
the situation where the elements of the dataset are repeated measurements
of the same quantity and where different measurements do not influence each
other. Next, we discuss the probability distribution of the random variables
that model the measurements and illustrate how sample statistics can help
to select a suitable statistical model. Finally, we discuss the simple linear
regression model that corresponds to the situation where the elements of the
dataset are paired measurements.
17.1 Random samples and statistical models
In Chapter 1 we briefly discussed Michelson’s experiment conducted between
June 5 and July 2 in 1879, in which 100 measurements were obtained on the
speed of light. The values are given in Table 17.1 and represent the speed
of light in air in km/sec minus 299 000. The variation among the 100 values
suggests that measuring the speed of light is subject to random influences. As
we have seen before, we describe random phenomena by means of a probability
model, i.e., we interpret the outcome of an experiment as a realization of
some random variable. Hence the first measurement is modeled by a random
variable X1 and the value 850 is interpreted as the realization of X1. Similarly,
the second measurement is modeled by a random variable X2 and the value 740
is interpreted as the realization of X2. Since both measurements are obtained
under the same experimental conditions, it is justified to assume that the
probability distributions of X1 and X2 are the same. More generally, the 100
measurements are modeled by random variables
X1, X2, . . . , X100
with the same probability distribution, and the values in Table 17.1 are inter-
preted as realizations of X1, X2, . . . , X100. Moreover, because we believe that
246 17 Basic statistical models
Table 17.1. Michelson data on the speed of light.
850 740 900 1070 930 850 950 980 980 880
1000 980 930 650 760 810 1000 1000 960 960
960 940 960 940 880 800 850 880 900 840
830 790 810 880 880 830 800 790 760 800
880 880 880 860 720 720 620 860 970 950
880 910 850 870 840 840 850 840 840 840
890 810 810 820 800 770 760 740 750 760
910 920 890 860 880 720 840 850 850 780
890 840 780 810 760 810 790 810 820 850
870 870 810 740 810 940 950 800 810 870
Source: E.N. Dorsey. The velocity of light. Transactions of the American
Philosophical Society. 34(1):1-110, 1944; Table 22 on pages 60-61.
Michelson took great care not to have the measurements influence each other,
the random variables X1, X2, . . . , X100 are assumed to be mutually indepen-
dent (see also Remark 3.1 about physical and stochastic independence). Such
a collection of random variables is called a random sample or briefly, sample.
Random sample. A random sample is a collection of random vari-
ables X1, X2, . . . , Xn, that have the same probability distribution
and are mutually independent.
If F is the distribution function of each random variable Xi in a random
sample, we speak of a random sample from F. Similarly, we speak of a random
sample from a density f, a random sample from an N(µ, σ2
) distribution, etc.
Quick exercise 17.1 Suppose we have a random sample X1, X2 from a dis-
tribution with variance 1. Compute the variance of X1 + X2.
Properties that are inherent to the random phenomenon under study may
provide additional knowledge about the distribution of the sample. Recall
the software data discussed in Chapter 15. The data are observed lengths in
CPU seconds between successive failures that occur during the execution of
a certain real-time command. Typically, in a situation like this, in a small
time interval, either 0 or 1 failure occurs. Moreover, failures occur with small
probability and in disjoint time intervals failures occur independent of each
other. In addition, let us assume that the rate at which the failures occur
is constant over time. According to Chapter 12, this justifies the choice of
a Poisson process to model the series of failures. From the properties of the
Poisson process we know that the interfailure times are independent and have
the same exponential distribution. Hence we model the software data as the
realization of a random sample from an exponential distribution.
17.1 Random samples and statistical models 247
In some cases we may not be able to specify the type of distribution. Take, for
instance, the Old Faithful data consisting of observed durations of eruptions
of the Old Faithful geyser. Due to lack of specific geological knowledge about
the subsurface and the mechanism that governs the eruptions, we prefer not to
assume a particular type of distribution. However, we do model the durations
as the realization of a random sample from a continuous distribution on (0, ∞).
In each of the three examples the dataset was obtained from repeated mea-
surements performed under the same experimental conditions. The basic sta-
tistical model for such a dataset is to consider the measurements as a random
sample and to interpret the dataset as the realization of the random sample.
Knowledge about the phenomenon under study and the nature of the experi-
ment may lead to partial specification of the probability distribution of each
Xi in the sample. This should be included in the model.
Statistical model for repeated measurements. A dataset
consisting of values x1, x2, . . . , xn of repeated measurements of the
same quantity is modeled as the realization of a random sample
X1, X2, . . . , Xn. The model may include a partial specification of
the probability distribution of each Xi.
The probability distribution of each Xi is called the model distribution. Usu-
ally it refers to a collection of distributions: in the Old Faithful example to
the collection of all continuous distributions on (0, ∞), in the software ex-
ample to the collection of all exponential distributions. In the latter case the
parameter of the exponential distribution is called the model parameter. The
unique distribution from which the sample actually originates is assumed to
be one particular member of this collection and is called the “true” distribu-
tion. Similarly, in the software example, the parameter corresponding to the
“true” exponential distribution is called the “true” parameter. The word true
is put between quotation marks because it does not refer to something in the
real world, but only to a distribution (or parameter) in the statistical model,
which is merely an approximation of the real situation.
Quick exercise 17.2 We obtain a dataset of ten elements by tossing a coin
ten times and recording the result of each toss. What is an appropriate sta-
tistical model and corresponding model distribution for this dataset?
Of course there are situations where the assumption of independence or identi-
cal distributions is unrealistic. In that case a different statistical model would
be more appropriate. However, we will restrict ourselves mainly to the case
where the dataset can be modeled as the realization of a random sample.
Once we have formulated a statistical model for our dataset, we can use the
dataset to infer knowledge about the model distribution. Important questions
about the corresponding model distribution are
248 17 Basic statistical models
Ĺ which feature of the model distribution represents the quantity of interest
and how do we use our dataset to determine a value for this?
Ĺ which model distribution fits a particular dataset best?
These questions can be diverse, and answering them may be difficult. For
instance, the Old Faithful data are modeled as a realization of a random
sample from a continuous distribution. Suppose we are interested in a complete
characterization of the “true” distribution, such as the distribution function
F or the probability density f. Since there are no further specifications about
the type of distribution, our problem would be to estimate the complete curve
of F or f on the basis of our dataset.
On the other hand, the software data are modeled as the realization of a
random sample from an exponential distribution. In that case F and f are
completely characterized by a single parameter λ:
F(x) = 1 − e−λx
and f(x) = λe−λx
for x ≥ 0.
Even if we are interested in the curves of F and f, our problem would reduce
to estimating a single parameter on the basis of our dataset.
In other cases we may not be interested in the distribution as a whole, but
only in a specific feature of the model distribution that represents the quantity
of interest. For instance, in a physical experiment, such as the one performed
by Michelson, one usually thinks of each measurement as
measurement = quantity of interest + measurement error.
The quantity of interest, in this case the speed of light, is thought of as being
some (unknown) constant and the measurement error is some random fluc-
tuation. In the absence of systematic error, the measurement error can be
modeled by a random variable with zero expectation and finite variance. In
that case the measurements are modeled by a random sample from a distribu-
tion with some unknown expectation and finite variance. The speed of light is
represented by the expectation of the model distribution. Our problem would
be to estimate the expectation of the model distribution on the basis of our
dataset.
In the remaining chapters, we will develop several statistical methods to infer
knowledge about the “true” distribution or about a specific feature of it, by
means of a dataset. In the remainder of this chapter we will investigate how
the graphical and numerical summaries of our dataset can serve as a first
indication of what an appropriate choice would be for this distribution or for
a specific feature, such as its expectation.
17.2 Distribution features and sample statistics
In Chapters 15 and 16 we have discussed several empirical summaries of
datasets. They are examples of numbers, curves, and other objects that are a
17.2 Distribution features and sample statistics 249
function
h(x1, x2, . . . , xn)
of the dataset x1, x2, . . . , xn only. Since datasets are modeled as realizations
of random samples X1, X2, . . . , Xn, an object h(x1, x2, . . . , xn) is a realization
of the corresponding random object
h(X1, X2, . . . , Xn).
Such an object, which depends on the random sample X1, X2, . . . , Xn only, is
called a sample statistic.
If a statistical model adequately describes the dataset at hand, then the sample
statistics corresponding to the empirical summaries should somehow reflect
corresponding features of the model distribution. We have already seen a
mathematical justification for this in Chapter 13 for the sample statistic
X̄n =
X1 + X2 + · · · + Xn
n
,
based on a sample X1, X2, . . . , Xn from a probability distribution with expec-
tation µ. According to the law of large numbers,
lim
n→∞
P

|X̄n − µ|  ε

= 0
for every ε  0. This means that for large sample size n, the sample mean
of most realizations of the random sample is close to the expectation of the
corresponding distribution. In fact, all sample statistics discussed in Chap-
ters 15 and 16 are close to corresponding distribution features. To illustrate
this we generate an artificial dataset from a normal distribution with pa-
rameters µ = 5 and σ = 2, using a technique similar to the one described
in Section 6.2. Next, we compare the sample statistics with corresponding
features of this distribution.
The empirical distribution function
Let X1, X2, . . . , Xn be a random sample from distribution function F, and let
Fn(a) =
number of Xi in (−∞, a]
n
be the empirical distribution function of the sample. Another application of
the law of large numbers (see Exercise 13.7) yields that for every ε  0,
lim
n→∞
P(|Fn(a) − F(a)|  ε) = 0.
This means that for most realizations of the random sample the empirical
distribution function Fn is close to F:
Fn(a) ≈ F(a).
250 17 Basic statistical models
−2 0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...........................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...........................
−2 0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...........................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...........................
Fig. 17.1. Empirical distribution functions of normal samples.
Hence the empirical distribution function of the normal dataset should resem-
ble the distribution function
F(a) =
a
−∞
1
2
√
2π
e− 1
2 (x−5
2 )2
dx
of the N(5, 4) distribution, and the fit should become better as the sample size
n increases. An illustration of this can be found in Figure 17.1. We displayed
the empirical distribution functions of datasets generated from an N(5, 4)
distribution together with the “true” distribution function F (dotted lines),
for sample sizes n = 20 (left) and n = 200 (right).
The histogram and the kernel density estimate
Suppose the random sample X1, X2, . . . , Xn is generated from a continuous
distribution with probability density f. In Section 13.4 we have seen yet an-
other consequence of the law of large numbers:
number of Xi in (x − h, x + h]
2hn
≈ f(x).
When (x − h, x + h] is a bin of a histogram of the random sample, this means
that the height of the histogram approximates the value of f at the midpoint
of the bin:
height of the histogram on (x − h, x + h] ≈ f(x).
Similarly, the kernel density estimate of a random sample approximates the
corresponding probability density f:
fn,h(x) ≈ f(x).
17.2 Distribution features and sample statistics 251
−2 0 2 4 6 8 10 12
0.0
0.1
0.2
0.3
.....................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....................
−2 0 2 4 6 8 10 12
0.0
0.1
0.2
0.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....................
Fig. 17.2. Histogram and kernel density estimate of a sample of size 200.
So the histogram and kernel density estimate of the normal dataset should
resemble the graph of the probability density
f(x) =
1
2
√
2π
e− 1
2 (x−5
2 )2
of the N(5, 4) distribution. This is illustrated in Figure 17.2, where we dis-
played a histogram and a kernel density estimate of our dataset consisting of
200 values generated from the N(5, 4) distribution. It should be noted that
with a smaller dataset the similarity can be much worse. This is demonstrated
in Figure 17.3, which is based on the dataset consisting of 20 values generated
from the same distribution.
−2 0 2 4 6 8 10 12
0.0
0.1
0.2
0.3
.....................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....................
−2 0 2 4 6 8 10 12
0.0
0.1
0.2
0.3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....................
Fig. 17.3. Histogram and kernel density estimate of a sample of size 20.
252 17 Basic statistical models
Remark 17.1 (About the approximations). Let Hn be the height of
the histogram on the interval (x − h, x + h], which is assumed to be a bin of
the histogram. Direct application of the law of large numbers merely yields
that Hn converges to
1
2h
 x+h
x−h
f(u) du.
Only for small h this is close to f(x). However, if we let h tend to 0 as n
increases, a variation on the law of large numbers will guarantee that Hn
converges to f(x): for every ε  0,
lim
n→∞
P(|Hn − f(x)|  ε) = 0.
A possible choice is the optimal bin width mentioned in Remark 15.1. Sim-
ilarly, direct application of the law of large numbers yields that a kernel
density estimator with fixed bandwidth h converges to
 ∞
−∞
f(x + hu)K(u) du.
Once more, only for small h this is close to f(x), provided that K is sym-
metric and integrates to one. However, by letting the bandwidth h tend
to 0 as n increases, yet another variation on the law of large numbers will
guarantee that fn,h(x) converges to f(x): for every ε  0,
lim
n→∞
P(|fn,h(x) − f(x)|  ε) = 0.
A possible choice is the optimal bandwidth mentioned in Remark 15.2.
The sample mean, the sample median, and empirical quantiles
As we saw in Section 5.5, the expectation of an N(µ, σ2
) distribution is µ;
so the N(5, 4) distribution has expectation 5. According to the law of large
numbers: X̄n ≈ µ. This is illustrated by our dataset of 200 values generated
from the N(5, 4) distribution for which we find
x̄200 = 5.012.
For the sample median we find
Med(x1, . . . , x200) = 5.018.
This illustrates the fact that the sample median of a random sample from
F approximates the median q0.5 = Finv
(0.5). In fact, we have the following
general property for the pth empirical quantile:
qn(p) ≈ Finv
(p) = qp.
In the special case of the N(µ, σ2
) distribution, the expectation and the me-
dian coincide, which explains why the sample mean and sample median of the
normal dataset are so close to each other.
17.3 Estimating features of the “true” distribution 253
The sample variance and standard deviation, and the MAD
As we saw in Section 5.5, the standard deviation and variance of an N(µ, σ2
)
distribution are σ and σ2
; so for the N(5, 4) distribution these are 2 and 4.
Another consequence of the law of large numbers is that
S2
n ≈ σ2
and Sn ≈ σ.
This is illustrated by our normal dataset of size 200, for which we find
s2
200 = 4.761 and s200 = 2.182
for the sample variance and sample standard deviation.
For the MAD of the dataset we find 1.334, which clearly differs from the
standard deviation 2 of the N(5, 4) distribution. The reason is that
MAD(X1, X2, . . . , Xn) ≈ Finv
(0.75) − Finv
(0.5),
for any distribution that is symmetric around its median Finv
(0.5). For the
N(5, 4) distribution Finv
(0.75) − Finv
(0.5) = 2Φinv
(0.75) = 1.3490, where
Φ denotes the distribution function of the standard normal distribution (see
Exercise 17.10).
Relative frequencies
For continuous distributions the histogram and kernel density estimates of a
random sample approximate the corresponding probability density f. For dis-
crete distributions we would like to have a sample statistic that approximates
the probability mass function. In Section 13.4 we saw that, as a consequence
of the law of large numbers, relative frequencies based on a random sample ap-
proximate corresponding probabilities. As a special case, for a random sample
X1, X2, . . . , Xn from a discrete distribution with probability mass function p,
one has that
number of Xi equal to a
n
≈ p(a).
This means that the relative frequency of a’s in the sample approximates
the value of the probability mass function at a. Table 17.2 lists the sample
statistics and the corresponding distribution features they approximate.
17.3 Estimating features of the “true” distribution
In the previous section we generated a dataset of 200 elements from a proba-
bility distribution, and we have seen that certain features of this distribution
are approximated by corresponding sample statistics. In practice, the situa-
tion is reversed. In that case we have a dataset of n elements that is modeled
as the realization of a random sample with a probability distribution that is
unknown to us. Our goal is to use our dataset to estimate a certain feature
of this distribution that represents the quantity of interest. In this section we
will discuss a few examples.
254 17 Basic statistical models
Table 17.2. Some sample statistics and corresponding distribution features.
Sample statistic Distribution feature
Graphical
Empirical distribution function Fn Distribution function F
Kernel density estimate fn,h and histogram Probability density f
(Number of Xi equal to a)/n Probability mass function p(a)
Numerical
Sample mean X̄n Expectation µ
Sample median Med(X1, X2, . . . , Xn) Median q0.5 = Finv
(0.5)
pth empirical quantile qn(p) 100pth percentile qp = Finv
(p)
Sample variance S2
n Variance σ2
Sample standard deviation Sn Standard deviation σ
MAD(X1, X2, . . . , Xn) Finv
(0.75) − Finv
(0.5), for
symmetric F
The Old Faithful data
We stick to the assumptions of Section 17.1: by lack of knowledge on this phe-
nomenon we prefer not to specify a particular parametric type of distribution,
and we model the Old Faithful data as the realization of a random sample of
size 272 from a continuous probability distribution. From the previous section
we know that the kernel density estimate and the empirical distribution func-
tion of the dataset approximate the probability density f and the distribution
function F of this distribution. In Figure 17.4 a kernel density estimate (left)
and the empirical distribution function (right) are displayed. Indeed, neither
graph resembles the probability density function or distribution function of
any of the familiar parametric distributions. Instead of viewing both graphs
60 120 180 240 300 360
0
0.002
0.004
0.006
0.008
0.010
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60 120 180 240 300 360
0.0
0.2
0.4
0.6
0.8
1.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 17.4. Nonparametric estimates for f and F based on the Old Faithful data.
17.3 Estimating features of the “true” distribution 255
only as graphical summaries of the data, we can also use both curves as esti-
mates for f and F. We estimate the model probability density f by means of
the kernel density estimate and the model distribution function F by means
of the empirical distribution function. Since neither estimate assumes a par-
ticular parametric model, they are called nonparametric estimates.
The software data
Next consider the software reliability data. As motivated in Section 17.1,
we model interfailure times as the realization of a random sample from an
exponential distribution. To see whether an exponential distribution is indeed
a reasonable model, we plot a histogram and a kernel density estimate using
a boundary kernel in Figure 17.5.
0 2000 4000 6000 8000
0
0.0005
0.0010
0.0015
0 2000 4000 6000 8000
0
0.0005
0.0010
0.0015 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 17.5. Histogram and kernel density estimate for the software data.
Both seem to corroborate the assumption of an exponential distribution. Ac-
cepting this, we are left with estimating the parameter λ. Because for the
exponential distribution E[X] = 1/λ, the law of large numbers suggests 1/x̄
as an estimate for λ. For our dataset x̄ = 656.88, which yields 1/x̄ = 0.0015.
In Figure 17.6 we compare the estimated exponential density (left) and dis-
tribution function (right) with the corresponding nonparametric estimates.
Note that the nonparametric estimates do not assume an exponential model
for the data. But, if an exponential distribution were the right model, the
kernel density estimate and empirical distribution function should resemble
the estimated exponential density and distribution function. At first sight the
fit seems reasonable, although near zero the data accumulate more than one
might perhaps expect for a sample of size 135 from an exponential distri-
bution, and the other way around at the other end of the data range. The
question is whether this phenomenon can be attributed to chance or is caused
by the fact that the exponential model is the wrong model. We will return to
this type of question in Chapter 25 (see also Chapter 18).
256 17 Basic statistical models
0 2000 4000 6000 8000
0
0.0005
0.0010
0.0015
0.0020
0.0025 .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..........................................................
0 2000 4000 6000 8000
0.0
0.2
0.4
0.6
0.8
1.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.......................................................
Fig. 17.6. Kernel density estimate and empirical cdf for software data (solid) com-
pared to f and F of the estimated exponential distribution.
Michelson data
Consider the Michelson data on the speed of light. In this case we are not
particularly interested in estimation of the “true” distribution, but solely in
the expectation of this distribution, which represents the speed of light. The
law of large numbers suggests to estimate the expectation by the sample
mean x̄, which equals 852.4.
17.4 The linear regression model
Recall the example about predicting Janka hardness of wood from the density
of the wood in Section 15.5. The idea is, of course, that Janka hardness is
related to the density: the higher the density of the wood, the higher the
value of Janka hardness. This suggests a relationship of the type
hardness = g(density of timber)
for some increasing function g. This is supported by the scatterplot of the data
in Figure 17.7. A closer look at the bivariate dataset in Table 15.5 suggests
that randomness is also involved. For instance, for the value 51.5 of the density,
different corresponding values of Janka hardness were observed. One way to
model such a situation is by means of a regression model:
hardness = g(density of timber) + random fluctuation.
The important question now is what sort of function g fits well to the points
in the scatterplot?
In general, this may be a difficult question to answer. We may have so little
knowledge about the phenomenon under study, and the data points may be
17.4 The linear regression model 257
20 30 40 50 60 70 80
Wood density
0
500
1000
1500
2000
2500
3000
3500
Hardness
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Fig. 17.7. Scatterplot of Janka hardness versus wood density.
scattered in such a way, that there is no reason to assume a specific type of
function for g. However, for the Janka hardness data it makes sense to assume
that g is increasing, but this still leaves us with many possibilities. Looking at
the scatterplot, at first sight it does not seem unreasonable to assume that g is
a straight line, i.e., Janka hardness depends linearly on the density of timber.
The fact that the points are not exactly on a straight line is then modeled by
a random fluctuation with respect to the straight line:
hardness = α + β · (density of timber) + random fluctuation.
This is a loose description of a simple linear regression model. A more complete
description is given below.
Simple linear regression model. In a simple linear regression
model for a bivariate dataset (x1, y1), (x2, y2), . . . , (xn, yn), we as-
sume that x1, x2, . . . , xn are nonrandom and that y1, y2, . . . , yn are
realizations of random variables Y1, Y2, . . . , Yn satisfying
Yi = α + βxi + Ui for i = 1, 2, . . ., n,
where U1, . . . , Un are independent random variables with E[Ui] = 0
and Var(Ui) = σ2
.
The line y = α + βx is called the regression line. The parameters α and β
represent the intercept and slope of the regression line. Usually, the x-variable
is called the explanatory variable and the y-variable is called the response
variable. One also refers to x and y as independent and dependent variables.
The random variables U1, U2, . . . , Un are assumed to be independent when the
different measurements do not influence each other. They are assumed to have
258 17 Basic statistical models
expectation zero, because the random fluctuation is considered to be around
the regression line y = α + βx. Finally, because each random fluctuation
is supposed to have the same amount of variability, we assume that all Ui
have the same variance. Note that by the propagation of independence rule
in Section 9.4, independence of the Ui implies independence of Yi. However,
Y1, Y2, . . . , Yn do not form a random sample. Indeed, the Yi have different
distributions because every Yi has a different expectation
E[Yi] = E[α + βxi + Ui] = α + βxi + E[Ui] = α + βxi.
Quick exercise 17.3 Consider the simple linear regression model as defined
earlier. Compute the variance of Yi.
The parameters α and β are unknown and our task will be to estimate them on
the basis of the data. We will come back to this in Chapter 22. In Figure 17.8
the scatterplot for the Janka hardness data is displayed with the estimated
20 30 40 50 60 70 80
Wood density
0
500
1000
1500
2000
2500
3000
3500
Hardness
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Fig. 17.8. Estimated regression line for the Janka hardness data.
regression line
y = −1160.5 + 57.51x.
Taking a closer look at Figure 17.8, you might wonder whether
y = α + βx + γx2
would be a more appropriate model. By trying to answer this question we
enter the area of multiple linear regression. We will not pursue this topic; we
restrict ourselves to simple linear regression.
17.6 Exercises 259
17.5 Solutions to the quick exercises
17.1 Because X1, X2 form a random sample, they are independent. Using
the rule about the variance of the sum of independent random variables, this
means that Var(X1 + X2) = Var(X1) + Var(X2) = 1 + 1 = 2.
17.2 The result of each toss of a coin can be modeled by a Bernoulli random
variable taking values 1 (heads) and 0 (tails). In the case when it is known
that we are tossing a fair coin, heads and tails occur with equal probability.
Since it is reasonable to assume that the tosses do not influence each other,
the outcomes of the ten tosses are modeled as the realization of a random
sample X1, . . . , X10 from a Bernoulli distribution with parameter p = 1/2. In
this case the model distribution is completely specified and coincides with the
“true” distribution: a Ber(1
2 ) distribution.
In the case when we are dealing with a possibly unfair coin, the outcomes
of the ten tosses are still modeled as the realization of a random sample
X1, . . . , X10 from a Bernoulli distribution, but we cannot specify the value
of the parameter p. The model distribution is a Bernoulli distribution. The
“true” distribution is a Bernoulli distribution with one particular value for p,
unknown to us.
17.3 Note that the xi are considered nonrandom. By the rules for the vari-
ance, we find Var(Yi) = Var(α + βxi + Ui) = Var(Ui) = σ2
.
17.6 Exercises
17.1  Figure 17.9 displays several histograms, kernel density estimates, and
empirical distribution functions. It is known that all figures correspond to
datasets of size 200 that are generated from normal distributions N(0, 1),
N(0, 9), and N(3, 1), and from exponential distributions Exp(1) and Exp(1/3).
Report for each figure from which distribution the dataset has been generated.
17.2  Figure 17.10 displays several boxplots. It is known that all figures
correspond to datasets of size 200 that are generated from the same five dis-
tributions as in Exercise 17.1. Report for each boxplot from which distribution
the dataset has been generated.
17.3  At a London underground station, the number of women was counted
in each of 100 queues of length 10. In this way a dataset x1, x2, . . . , x100 was
obtained, where xi denotes the observed number of women in the ith queue.
The dataset is summarized in the following table and lists the number of
queues with 0 women, 1 woman, 2 women, etc.
260 17 Basic statistical models
0 2 4 6
0.0
0.1
0.2
0.3
0.4
Dataset 1
−2 0 2
0.0
0.2
0.4
0.6
0.8
1.0
Dataset 2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−4 −2 0 2 4
0.0
0.1
0.2
0.3
Dataset 3
−2 0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
0.5
Dataset 4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
Dataset 5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2 4 6 8
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Dataset 6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−4 −2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
Dataset 7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−6 −3 0 3 6 9
0.00
0.05
0.10
0.15
Dataset 8
0 2 4 6 8
0.0
0.2
0.4
0.6
Dataset 9
0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
Dataset 10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−12 −6 0 6 12
0.00
0.03
0.06
0.09
0.12
Dataset 11
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 5 10 15 20
0.00
0.06
0.12
0.18
0.24
Dataset 12
−5 0 5 10
0.0
0.2
0.4
0.6
0.8
1.0
Dataset 13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 5 10 15 20 25
0.0
0.1
0.2
0.3
Dataset 14
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
Dataset 15
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 17.9. Graphical representations of different datasets from Exercise 17.1.
17.6 Exercises 261
0
5
10
15
Boxplot 1
◦
◦
◦
◦
−6
−3
0
3
6
Boxplot 2
0
5
10
15
20
Boxplot 3
◦
◦
◦
◦
◦
◦
◦
◦
◦
−3
−2
−1
0
1
2
3
Boxplot 4
0
2
4
6
Boxplot 5
◦
◦
0
2
4
6
8
Boxplot 6
◦
◦
◦
◦
◦
−9
−6
−3
0
3
6
Boxplot 7
◦
−6
−3
0
3
6
9
Boxplot 8
◦
0
2
4
6
Boxplot 9
◦
◦
◦
◦
◦
◦
◦
◦
0
2
4
6
Boxplot 10
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
0
2
4
6
Boxplot 11
0
2
4
6
Boxplot 12
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
−4
−3
−2
−1
0
1
2
3
Boxplot 13
◦
−3
−2
−1
0
1
2
3
4
Boxplot 14
◦
◦
0
5
10
15
20
Boxplot 15
◦
◦
◦
◦
◦
◦
◦
◦
◦
Fig. 17.10. Boxplot of different datasets from Exercise 17.2.
262 17 Basic statistical models
Count 0 1 2 3 4 5 6 7 8 9 10
Frequency 1 3 4 23 25 19 18 5 1 1 0
Source: R.A. Jinkinson and M. Slater. Critical discussion of a graphical
method for identifying discrete distributions. The Statistician, 30:239–248,
1981; Table 1 on page 240.
In the statistical model for this dataset, we assume that the observed counts
are a realization of a random sample X1, X2, . . . , X100.
a. Assume that people line up in such a way that a man or woman in a
certain position is independent of the other positions, and that in each
position one has a woman with equal probability. What is an appropriate
choice for the model distribution?
b. Use the table to find an estimate for the parameter(s) of the model dis-
tribution chosen in part a.
17.4 During the Second World War, London was hit by numerous flying
bombs. The following data are from an area in South London of 36 square
kilometers. The area was divided into 576 squares with sides of length 1/4
kilometer. For each of the 576 squares the number of hits was recorded. In
this way we obtain a dataset x1, x2, . . . , x576, where xi denotes the number of
hits in the ith square. The data are summarized in the following table which
lists the number of squares with no hits, 1 hit, 2 hits, etc.
Number of hits 0 1 2 3 4 5 6 7
Number of squares 229 211 93 35 7 0 0 1
Source: R.D. Clarke. An application of the Poisson distribution. Journal of
the Institute of Actuaries, 72:48, 1946; Table 1 on page 481. Faculty and
Institute of Actuaries.
An interesting question is whether London was hit in a completely random
manner. In that case a Poisson distribution should fit the data.
a. If we model the dataset as the realization of a random sample from a
Poisson distribution with parameter µ, then what would you choose as an
estimate for µ?
b. Check the fit with a Poisson distribution by comparing some of the ob-
served relative frequencies of 0’s, 1’s, 2’s, etc., with the corresponding
probabilities for the Poisson distribution with µ estimated as in part a.
17.5  We return to the example concerning the number of menstrual cycles
up to pregnancy, where the number of cycles was modeled by a geometric
random variable (see Section 4.4). The original data concerned 100 smoking
and 486 nonsmoking women. For 7 smokers and 12 nonsmokers, the exact
number of cycles up to pregnancy was unknown. In the following tables we only
17.6 Exercises 263
incorporated the 93 smokers and 474 nonsmokers, for which the exact number
of cycles was observed. Another analysis, based on the complete dataset, is
done in Section 21.1.
a. Consider the dataset x1, x2, . . . , x93 corresponding to the smoking women,
where xi denotes the number of cycles for the ith smoking woman. The
data are summarized in the following table.
Cycles 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 29 16 17 4 3 9 4 5 1 1 1 3
Source: C.R. Weinberg and B.C. Gladen. The beta-geometric distribution ap-
plied to comparative fecundability studies. Biometrics, 42(3):547–560, 1986.
The table lists the number of women that had to wait 1 cycle, 2 cycles,
etc. If we model the dataset as the realization of a random sample from a
geometric distribution with parameter p, then what would you choose as
an estimate for p?
b. Also estimate the parameter p for the 474 nonsmoking women, which
is also modeled as the realization of a random sample from a geometric
distribution. The dataset y1, y2, . . . , y474, where yj denotes the number of
cycles for the jth nonsmoking woman, is summarized here:
Cycles 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 198 107 55 38 18 22 7 9 5 3 6 6
Source: C.R. Weinberg and B.C. Gladen. The beta-geometric distribution ap-
plied to comparative fecundability studies. Biometrics, 42(3):547–560, 1986.
You may use that y1 + y2 + · · · + y474 = 1285.
c. Compare the estimates of the probability of becoming pregnant in three
or fewer cycles for smoking and nonsmoking women.
17.6 Recall Exercise 15.1 about the chest circumference of 5732 Scottish sol-
diers, where we constructed the histogram displayed in Figure 17.11. The
histogram suggests modeling the data as the realization of a random sample
from a normal distribution.
a. Suppose that for the dataset

xi = 228377.2 and

x2
i = 9124064. What
would you choose as estimates for the parameters µ and σ of the N(µ, σ2
)
distribution?
Hint: you may want to use the relation from Exercise 16.15.
b. Give an estimate for the probability that a Scottish soldier has a chest
circumference between 38.5 and 42.5 inches.
264 17 Basic statistical models
32 34 36 38 40 42 44 46 48 50
0
0.05
0.10
0.15
0.20
Fig. 17.11. Histogram of chest circumferences.
17.7  Recall Exercise 15.3 about time intervals between successive coal mine
disasters. Let us assume that the rate at which the disasters occur is constant
over time and that on a single day a disaster takes place with small probability
independently of what happens on other days. According to Chapter 12 this
suggests modeling the series of disasters with a Poisson process. Figure 17.12
displays a histogram and empirical distribution function of the observed time
intervals.
a. In the statistical model for this dataset we model the 190 time intervals
as the realization of a random sample. What would you choose for the
model distribution?
b. The sum of the observed time intervals is 40 549 days. Give an estimate
for the parameter(s) of the distribution chosen in part a.
0 500 1000 1500 2000 2500
0
0.001
0.002
0.003
0 500 1000 1500 2000 2500
0.0
0.2
0.4
0.6
0.8
1.0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 17.12. Histogram of time intervals between successive disasters.
17.6 Exercises 265
17.8 The following data represent the number of revolutions to failure (in
millions) of 22 deep-groove ball-bearings.
17.88 28.92 33.00 41.52 42.12
45.60 48.48 51.84 51.96 54.12
55.56 67.80 68.64 68.88 84.12
93.12 98.64 105.12 105.84 127.92
128.04 173.40
Source: J. Lieblein and M. Zelen. Statistical investigation of the fatigue-life
of deep-groove ball-bearings. Journal of Research, National Bureau of Stan-
dards, 57:273–316, 1956; specimen worksheet on page 286.
Lieblein and Zelen propose modeling the dataset as a realization of a random
sample from a Weibull distribution, which has distribution function
F(x) = 1 − e−(λx)α
for x ≥ 0,
and F(x) = 0, for x  0, where α, λ  0.
a. Suppose that X is a random variable with a Weibull distribution. Check
that the random variable Y = Xα
has an exponential distribution with
parameter λα
and conclude that E[Xα
] = 1/λα
.
b. Use part a to explain how one can use the data in the table to find
an estimate for the parameter λ, if it is given that the parameter α is
estimated by 2.102.
17.9  The volume (i.e., the effective wood production in cubic meters),
height (in meters), and diameter (in meters) (measured at 1.37 meter above
the ground) are recorded for 31 black cherry trees in the Allegheny National
Forest in Pennsylvania. The data are listed in Table 17.3. They were collected
to find an estimate for the volume of a tree (and therefore for the timber
yield), given its height and diameter. For each tree the volume y and the
value of x = d2
h are recorded, where d and h are the diameter and height
of the tree. The resulting points (x1, y1), . . . , (x31, y31) are displayed in the
scatterplot in Figure 17.13.
We model the data by the following linear regression model (without intercept)
Yi = βxi + Ui
for i = 1, 2, . . . , 31.
a. What physical reasons justify the linear relationship between y and d2
h?
Hint: how does the volume of a cylinder relate to its diameter and height?
b. We want to find an estimate for the slope β of the line y = βx. Two
natural candidates are the average slope z̄n, where zi = yi/xi, and the
266 17 Basic statistical models
Table 17.3. Measurements on black cherry trees.
Diameter Height Volume
0.21 21.3 0.29
0.22 19.8 0.29
0.22 19.2 0.29
0.27 21.9 0.46
0.27 24.7 0.53
0.27 25.3 0.56
0.28 20.1 0.44
0.28 22.9 0.52
0.28 24.4 0.64
0.28 22.9 0.56
0.29 24.1 0.69
0.29 23.2 0.59
0.29 23.2 0.61
0.30 21.0 0.60
0.30 22.9 0.54
0.33 22.6 0.63
0.33 25.9 0.96
0.34 26.2 0.78
0.35 21.6 0.73
0.35 19.5 0.71
0.36 23.8 0.98
0.36 24.4 0.90
0.37 22.6 1.03
0.41 21.9 1.08
0.41 23.5 1.21
0.44 24.7 1.57
0.44 25.0 1.58
0.45 24.4 1.65
0.46 24.4 1.46
0.46 24.4 1.44
0.52 26.5 2.18
Source: A.C. Atkinson. Regression diagnostics, trend formations and con-
structed variables (with discussion). Journal of the Royal Statistical Society,
Series B, 44:1–36, 1982.
slope of the averages ȳ/x̄. In Chapter 22 we will encounter the so-called
least squares estimate:
n

i=1
xiyi
n

i=1
x2
i
.
17.6 Exercises 267
0 2 4 6 8
0.0
0.5
1.0
1.5
2.0
2.5
·
·
· ···
···
··
·
·
···
··
·
·
··
· · ·
···
·
·
·
Fig. 17.13. Scatterplot of the black cherry tree data.
Compute all three estimates for the data in Table 17.3. You need at least
5 digits accuracy, and you may use that

xi = 87.456,

yi = 26.486,

yi/xi = 9.369,

xiyi = 95.498, and

x2
i = 314.644.
17.10 Let X be a random variable with (continuous) distribution function F.
Let m = q0.5 = Finv
(0.5) be the median of F and define the random variable
Y = |X − m|.
a. Show that Y has distribution function G, defined by
G(y) = F(m + y) − F(m − y).
b. The MAD of F is the median of G. Show that if the density f correspond-
ing to F is symmetric around its median m, then
G(y) = 2F(m + y) − 1
and derive that
Ginv
(1
2 ) = Finv
(3
4 ) − Finv
(1
2 ).
c. Use b to conclude that the MAD of an N(µ, σ2
) distribution is equal to
σΦinv
(3/4), where Φ is the distribution function of a standard normal
distribution. Recall that the distribution function F of an N(µ, σ2
) can
be written as
F(x) = Φ

x − µ
σ

.
You might check that, as stated in Section 17.2, the MAD of the N(5, 4)
distribution is equal to 2Φinv
(3/4) = 1.3490.
268 17 Basic statistical models
17.11 In this exercise we compute the MAD of the Exp(λ) distribution.
a. Let X have an Exp(λ) distribution, with median m = (ln 2)/λ. Show that
Y = |X − m| has distribution function
G(y) =
1
2

eλy
− e−λy

.
b. Argue that the MAD of the Exp(λ) distribution is a solution of the equa-
tion e2λy
− eλy
− 1 = 0.
c. Compute the MAD of the Exp(λ) distribution.
Hint: put x = eλy
and first solve for x.
18
The bootstrap
In the forthcoming chapters we will develop statistical methods to infer knowl-
edge about the model distribution and encounter several sample statistics to
do this. In the previous chapter we have seen examples of sample statistics
that can be used to estimate different model features, for instance, the em-
pirical distribution function to estimate the model distribution function F,
and the sample mean to estimate the expectation µ corresponding to F. One
of the things we would like to know is how close a sample statistic is to the
model feature it is supposed to estimate. For instance, what is the probability
that the sample mean and µ differ more than a given tolerance ε? For this
we need to know the distribution of X̄n − µ. More generally, it is important
to know how a sample statistic is distributed in relation to the corresponding
model feature. For the distribution of the sample mean we saw a normal limit
approximation in Chapter 14. In this chapter we discuss a simulation proce-
dure that approximates the distribution of the sample mean for finite sample
size. Moreover, the method is more generally applicable to sample statistics
other than the sample mean.
18.1 The bootstrap principle
Consider the Old Faithful data introduced in Chapter 15, which we modeled
as the realization of a random sample of size n = 272 from some distribution
function F. The sample mean x̄n of the observed durations equals 209.3. What
does this say about the expectation µ of F? As we saw in Chapter 17, the value
209.3 is a natural estimate for µ, but to conclude that µ is equal to 209.3 is
unwise. The reason is that, if we would observe a new dataset of durations, we
will obtain a different sample mean as an estimate for µ. This should not come
as a surprise. Since the dataset x1, x2, . . . , xn is just one possible realization
of the random sample X1, X2, . . . , Xn, the observed sample mean is just one
possible realization of the random variable
Skip
270 18 The bootstrap
X̄n =
X1 + X2 + · · · + Xn
n
.
A new dataset is another realization of the random sample, and the cor-
responding sample mean is another realization of the random variable X̄n.
Hence, to infer something about µ, one should take into account how realiza-
tions of X̄n vary. This variation is described by the probability distribution
of X̄n.
In principle1
it is possible to determine the distribution function of X̄n from
the distribution function F of the random sample X1, X2, . . . , Xn. However,
F is unknown. Nevertheless, in Chapter 17 we saw that the observed dataset
reflects most features of the “true” probability distribution. Hence the natural
thing to do is to compute an estimate F̂ for the distribution function F and
then to consider a random sample from F̂ and the corresponding sample mean
as substitutes for the random sample X1, X2, . . . , Xn from F and the random
variable X̄n. A random sample from F̂ is called a bootstrap random sample,
or briefly bootstrap sample, and is denoted by
X∗
1 , X∗
2 , . . . , X∗
n
to distinguish it from the random sample X1, X2, . . . , Xn from the “true” F.
The corresponding average is called the bootstrapped sample mean, and this
random variable is denoted by
X̄∗
n =
X∗
1 + X∗
2 + · · · + X∗
n
n
to distinguish it from the random variable X̄n. The idea is now to use the
distribution of X̄∗
n to approximate the distribution of X̄n.
The preceding procedure is called the bootstrap principle for the sample mean.
Clearly, it can be applied to any sample statistic h(X1, X2, . . . , Xn) by approx-
imating its probability distribution by that of the corresponding bootstrapped
sample statistic h(X∗
1 , X∗
2 , . . . , X∗
n).
Bootstrap principle. Use the dataset x1, x2, . . . , xn to com-
pute an estimate F̂ for the “true” distribution function F. Replace
the random sample X1, X2, . . . , Xn from F by a random sample
X∗
1 , X∗
2 , . . . , X∗
n from F̂, and approximate the probability distribu-
tion of h(X1, X2, . . . , Xn) by that of h(X∗
1 , X∗
2 , . . . , X∗
n).
Returning to the sample mean, the first question that comes to mind is, of
course, how well does the distribution of X̄∗
n approximate the distribution
1
In Section 11.1 we saw how the distribution of the sum of independent random
variables can be computed. Together with the change-of-units rule (see page 106),
the distribution of X̄n can be determined. See also Section 13.1, where this is done
for independent Gam(2, 1) variables.
18.1 The bootstrap principle 271
of X̄n? Or more generally, how well does the distribution of a bootstrapped
sample statistic h(X∗
1 , X∗
2 , . . . , X∗
n) approximate the distribution of the sam-
ple statistic of interest h(X1, X2, . . . , Xn)? Applied in such a straightforward
manner, the bootstrap approximation for the distribution of X̄n by that of
X̄∗
n may not be so good (see Remark 18.1). The bootstrap approximation will
improve if we approximate the distribution of the centered sample mean:
X̄n − µ,
where µ is the expectation corresponding to F. The bootstrapped version
would be the random variable
X̄∗
n − µ∗
,
where µ∗
is the expectation corresponding to F̂. Often the bootstrap approx-
imation of the distribution of a sample statistic will improve if we somehow
normalize the sample statistic by relating it to a corresponding feature of the
“true” distribution. An example is the centered sample median
Med(X1, X2, . . . , Xn) − Finv
(0.5),
where we subtract the median Finv
(0.5) of F. Another example is the nor-
malized sample variance
S2
n
σ2
,
where we divide by the variance σ2
of F.
Quick exercise 18.1 Describe how the bootstrap principle should be applied
to approximate the distribution of Med(X1, X2, . . . , Xn) − Finv
(0.5).
Remark 18.1 (The bootstrap for the sample mean). To see why
the bootstrap approximation for X̄n may be bad, consider a dataset
x1, x2, . . . , xn that is a realization of a random sample X1, X2, . . . , Xn from
an N(µ, 1) distribution. In that case the corresponding sample mean X̄n
has an N(µ, 1/n) distribution. We estimate µ by x̄n and replace the ran-
dom sample from an N(µ, 1) distribution by a bootstrap random sample
X∗
1 , X∗
2 , . . . , X∗
n from an N(x̄n, 1) distribution. The corresponding boot-
strapped sample mean X̄∗
n has an N(x̄n, 1/n) distribution. Therefore the
distribution functions Gn and G∗
n of the random variables X̄n and X̄∗
n can
be determined:
Gn(a) = Φ(
√
n(a − µ)) and G∗
n(a) = Φ(
√
n(a − x̄n)).
In this case it turns out that the maximum distance between the two dis-
tribution functions is equal to
2Φ
1
2
√
n|x̄n − µ|

− 1.
272 18 The bootstrap
Since X̄n has an N(µ, 1/n) distribution, this value is approximately equal to
2Φ (|z|/2)−1, where z is a realization of an N(0, 1) random variable Z. This
only equals zero for z = 0, so that the distance between the distribution
functions of X̄n and X̄∗
n will almost always be strictly positive, even for
large n.
The question that remains is what to take as an estimate F̂ for F. This
will depend on how well F can be specified. For the Old Faithful data we
cannot say anything about the type of distribution. However, for the software
data it seems reasonable to model the dataset as a realization of a random
sample from an Exp(λ) distribution and then we only have to estimate the
parameter λ. Different assumptions about F give rise to different bootstrap
procedures. We will discuss two of them in the next sections.
18.2 The empirical bootstrap
Suppose we consider our dataset x1, x2, . . . , xn as a realization of a random
sample from a distribution function F. When we cannot make any assumptions
about the type of F, we can always estimate F by the empirical distribution
function of the dataset:
F̂(a) = Fn(a) =
number of xi less than or equal to a
n
.
Since we estimate F by the empirical distribution function, the corresponding
bootstrap principle is called the empirical bootstrap. Applying this principle
to the centered sample mean, the random sample X1, X2, . . . , Xn from F is
replaced by a bootstrap random sample X∗
1 , X∗
2 , . . . , X∗
n from Fn, and the
distribution of X̄n − µ is approximated by that of X̄∗
n − µ∗
, where µ∗
denotes
the expectation corresponding to Fn. The question is, of course, how good
this approximation is. A mathematical theorem tells us that the empirical
bootstrap works for the centered sample mean, i.e., the distribution of X̄n −µ
is well approximated by that of X̄∗
n−µ∗
(see Remark 18.2). On the other hand,
there are (normalized) sample statistics for which the empirical bootstrap fails,
such as
1 −
maximum of X1, X2, . . . , Xn
θ
,
based on a random sample X1, X2, . . . , Xn from a U(0, θ) distribution (see
Exercise 18.12).
Remark 18.2 (The empirical bootstrap for X̄n −µ). For the centered
sample mean the bootstrap approximation works, even if we estimate F
by the empirical distribution function Fn. If Gn denotes the distribution
function of X̄n − µ and G∗
n the distribution function of its bootstrapped
version X̄∗
n − µ∗
, then the maximum distance between G∗
n and Gn goes to
zero with probability one:
18.2 The empirical bootstrap 273
P lim
n→∞
sup
t∈R
|G∗
n(t) − Gn(t)| = 0 = 1
(see, for instance, Singh [32]). In fact, the empirical bootstrap approxima-
tion can be improved by approximating the distribution of the standardized
average
√
n(X̄n −µ)/σ by its bootstrapped version
√
n(X̄∗
n −µ∗
)/σ∗
, where
σ and σ∗
denote the standard deviations of F and Fn. This approximation
is even better than the normal approximation by the central limit theorem!
See, for instance, Hall [14].
Let us continue with approximating the distribution of X̄n − µ by that of
X̄∗
n −µ∗
. First note that the empirical distribution function Fn of the original
dataset is the distribution function of a discrete random variable that attains
the values x1, x2, . . . , xn, each with probability 1/n. This means that each of
the bootstrap random variables X∗
i has expectation
µ∗
= E[X∗
i ] = x1 ·
1
n
+ x2 ·
1
n
+ · · · + xn ·
1
n
= x̄n.
Therefore, applying the empirical bootstrap to X̄n − µ means approximating
its distribution by that of X̄∗
n − x̄n. In principle it would be possible to deter-
mine the probability distribution of X̄∗
n −x̄n. Indeed, the random variable X̄∗
n
is based on the random variables X∗
i , whose distribution we know precisely:
it takes values x1, x2, . . . , xn with equal probability 1/n. Hence we could de-
termine the possible values of X̄∗
n − x̄n and the corresponding probabilities.
For small n this can be done (see Exercise 18.5), but for large n this becomes
cumbersome. Therefore we invoke a second approximation.
Recall the jury example in Section 6.3, where we investigated the variation
of two different rules that a jury might use to assign grades. In terms of
the present chapter, the jury example deals with a random sample from a
U(−0.5, 0.5) distribution and two different sample statistics T and M, cor-
responding to the two rules. To investigate the distribution of T and M,
a simulation was carried out with one thousand runs, where in every run we
generated a realization of a random sample from the U(−0.5, 0.5) distribution
and computed the corresponding realization of T and M. The one thousand
realizations give a good impression of how T and M vary around the deserved
score (see Figure 6.4).
Returning to the distribution of X̄∗
n −x̄n, the analogue would be to repeatedly
generate a realization of the bootstrap random sample from Fn and every time
compute the corresponding realization of X̄∗
n − x̄n. The resulting realizations
would give a good impression about the distribution of X̄∗
n −x̄n. A realization
of the bootstrap random sample is called a bootstrap dataset and is denoted
by
x∗
1, x∗
2, . . . , x∗
n
to distinguish it from the original dataset x1, x2, . . . , xn. For the centered
sample mean the simulation procedure is as follows.
274 18 The bootstrap
Empirical bootstrap simulation (for X̄n−µ). Given a dataset
x1, x2, . . . , xn, determine its empirical distribution function Fn as an
estimate of F, and compute the expectation
µ∗
= x̄n =
x1 + x2 + · · · + xn
n
corresponding to Fn.
1. Generate a bootstrap dataset x∗
1, x∗
2, . . . , x∗
n from Fn.
2. Compute the centered sample mean for the bootstrap dataset:
x̄∗
n − x̄n,
where
x̄∗
n =
x∗
1 + x∗
2 + · · · + x∗
n
n
.
Repeat steps 1 and 2 many times.
Note that generating a value x∗
i from Fn is equivalent to choosing one of the
elements x1, x2, . . . , xn of the original dataset with equal probability 1/n.
The empirical bootstrap simulation is described for the centered sample mean,
but clearly a similar simulation procedure can be formulated for any (normal-
ized) sample statistic.
Remark 18.3 (Some history). Although Efron [7] in 1979 drew attention
to diverse applications of the empirical bootstrap simulation, it already
existed before that time, but not as a unified widely applicable technique.
See Hall [14] for references to earlier ideas along similar lines and to further
development of the bootstrap. One of Efron’s contributions was to point out
how to combine the bootstrap with modern computational power. In this
way, the interest in this procedure is a typical consequence of the influence of
computers on the development of statistics in the past decades. Efron also
coined the term “bootstrap,” which is inspired by the American version
of one of the tall stories of the Baron von Münchhausen, who claimed to
have lifted himself out of a swamp by pulling the strap on his boot (in the
European version he lifted himself by pulling his hair).
Quick exercise 18.2 Describe the empirical bootstrap simulation for the
centered sample median Med(X1, X2, . . . , Xn) − Finv
(0.5).
For the Old Faithful data we carried out the empirical bootstrap simulation
for the centered sample mean with one thousand repetitions. In Figure 18.1
a histogram (left) and kernel density estimate (right) are displayed of one
thousand centered bootstrap sample means
x̄∗
n,1 − x̄n x̄∗
n,2 − x̄n · · · x̄∗
n,1000 − x̄n.
18.2 The empirical bootstrap 275
-18 -12 -6 0 6 12 18
0
0.02
0.04
0.06
-18 -12 -6 0 6 12 18
0
0.02
0.04
0.06
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 18.1. Histogram and kernel density estimate of centered bootstrap sample
means.
Since these are realizations of the random variable X̄∗
n − x̄n, we know from
Section 17.2 that they reflect the distribution of X̄∗
n − x̄n. Hence, as the dis-
tribution of X̄∗
n − x̄n approximates that of X̄n − µ, the centered bootstrap
sample means also reflect the distribution of X̄n−µ. This leads to the following
application.
An application of the empirical bootstrap
Let us return to our example about the Old Faithful data, which are mod-
eled as a realization of a random sample from some F. Suppose we estimate
the expectation µ corresponding to F by x̄n = 209.3. Can we say how far
away 209.3 is from the “true” expectation µ? To be honest, the answer is
no. . . (oops). In a situation like this, the measurements and their correspond-
ing average are subject to randomness, so that we cannot say anything with
absolute certainty about how far away the average will be from µ. One of the
things we can say is how likely it is that the average is within a given distance
from µ.
To get an impression of how close the average of a dataset of n = 272 ob-
served durations of the Old Faithful geyser is to µ, we want to compute the
probability that the sample mean deviates more than 5 from µ:
P

|X̄n − µ|  5

.
Direct computation of this probability is impossible, because we do not know
the distribution of the random variable X̄n−µ. However, since the distribution
of X̄∗
n − x̄n approximates the distribution of X̄n − µ, we can approximate the
probability as follows
P

|X̄n − µ|  5

≈ P

|X̄∗
n − x̄n|  5

= P

|X̄∗
n − 209.3|  5

,
276 18 The bootstrap
where we have also used that for the Old Faithful data, x̄n = 209.3. As we
mentioned before, in principle it is possible to compute the last probability
exactly. Since this is too cumbersome, we approximate P

|X̄∗
n − 209.3|  5

by means of the one thousand centered bootstrap sample means obtained from
the empirical bootstrap simulation:
x̄∗
n,1 − 209.3 x̄∗
n,2 − 209.3 · · · x̄∗
n,1000 − 209.3.
In view of Table 17.2, a natural estimate for P

|X̄∗
n − 209.3|  5

is the relative
frequency of centered bootstrap sample means that are greater than 5 in
absolute value:
number of i with |x̄∗
n,i − 209.3| greater than 5
1000
.
For the centered bootstrap sample means of Figure 18.1, this relative fre-
quency is 0.227. Hence, we obtain the following bootstrap approximation
P

|X̄n − µ|  5

≈ P

|X̄∗
n − 209.3|  5

≈ 0.227.
It should be emphasized that the second approximation can be made ar-
bitrarily accurate by increasing the number of repetitions in the bootstrap
procedure.
18.3 The parametric bootstrap
Suppose we consider our dataset as a realization of a random sample from a
distribution of a specific parametric type. In that case the distribution function
is completely determined by a parameter or vector of parameters θ: F = Fθ.
Then we do not have to estimate the whole distribution function F, but it
suffices to estimate the parameter(vector) θ by θ̂ and estimate F by
F̂ = Fθ̂.
The corresponding bootstrap principle is called the parametric bootstrap.
Let us investigate what this would mean for the centered sample mean. First
we should realize that the expectation of Fθ is also determined by θ: µ = µθ.
The parametric bootstrap for the centered sample mean now amounts to the
following. The random sample X1, X2, . . . , Xn from the “true” distribution
function Fθ is replaced by a bootstrap random sample X∗
1 , X∗
2 , . . . , X∗
n from
Fθ̂, and the probability distribution of X̄n − µθ is approximated by that of
X̄∗
n − µ∗
, where
µ∗
= µθ̂
denotes the expectation corresponding to Fθ̂.
Often the parametric bootstrap approximation is better than the empirical
bootstrap approximation, as illustrated in the next quick exercise.
18.3 The parametric bootstrap 277
Quick exercise 18.3 Suppose the dataset x1, x2, . . . , xn is a realization of a
random sample X1, X2, . . . , Xn from an N(µ, 1) distribution. Estimate µ by
x̄n and consider a bootstrap random sample X∗
1 , X∗
2 , . . . , X∗
n from an N(x̄n, 1)
distribution. Check that the probability distributions of X̄n − µ and X̄∗
n − x̄n
are the same: an N(0, 1/n) distribution.
Once more, in principle it is possible to determine the distribution of X̄∗
n −µθ̂
exactly. However, in contrast with the situation considered in the previous
quick exercise, in some cases this is still cumbersome. Again a simulation
procedure may help us out. For the centered sample mean the procedure is as
follows.
Parametric bootstrap simulation (for X̄n − µ). Given a
dataset x1, x2, . . . , xn, compute an estimate θ̂ for θ. Determine Fθ̂
as an estimate for Fθ, and compute the expectation µ∗
= µθ̂ corre-
sponding to Fθ̂.
1. Generate a bootstrap dataset x∗
1, x∗
2, . . . , x∗
n from Fθ̂.
2. Compute the centered sample mean for the bootstrap dataset:
x̄∗
n − µθ̂,
where
x̄∗
n =
x∗
1 + x∗
2 + · · · + x∗
n
n
.
Repeat steps 1 and 2 many times.
As an application we will use the parametric bootstrap simulation to investi-
gate whether the exponential distribution is a reasonable model for the soft-
ware data.
Are the software data exponential?
Consider fitting an exponential distribution to the software data, as discussed
in Section 17.3. At first sight, Figure 17.6 shows a reasonable fit with the ex-
ponential distribution. One way to quantify the difference between the dataset
and the exponential model is to compute the maximum distance between the
empirical distribution function Fn of the dataset and the exponential distri-
bution function Fλ̂ estimated from the dataset:
tks = sup
a∈R
|Fn(a) − Fλ̂(a)|.
Here Fλ̂(a) = 0 for a  0 and
Fλ̂(a) = 1 − e−λ̂a
for a ≥ 0,
where λ̂ = 1/x̄n is estimated from the dataset. The quantity tks is called the
Kolmogorov-Smirnov distance between Fn and Fλ̂.
278 18 The bootstrap
The idea behind the use of this distance is the following. If F denotes the
“true” distribution function, then according to Section 17.2 the empirical
distribution function Fn will resemble F whether F equals the distribution
function Fλ of some Exp(λ) distribution or not. On the other hand, if the
“true” distribution function is Fλ, then the estimated exponential distribu-
tion function Fλ̂ will resemble Fλ, because λ̂ = 1/x̄n is close to the “true” λ.
Therefore, if F = Fλ, then both Fn and Fλ̂ will be close to the same distribu-
tion function, so that tks is small; if F is different from Fλ, then Fn and Fλ̂
are close to two different distribution functions, so that tks is large. The value
tks is always between 0 and 1, and the further away this value is from 0, the
more it is an indication that the exponential model is inappropriate. For the
software dataset we find λ̂ = 1/x̄n = 0.0015 and tks = 0.176. Does this speak
against the believed exponential model?
One way to investigate this is to find out whether, in the case when the data are
truly a realization of an exponential random sample from Fλ, the value 0.176 is
unusually large. To answer this question we consider the sample statistic that
corresponds to tks. The estimate λ̂ = 1/x̄n is replaced by the random variable
Λ̂ = 1/X̄n, and the empirical distribution function of the dataset is replaced
by the empirical distribution function of the random sample X1, X2, . . . , Xn
(again denoted by Fn):
Fn(a) =
number of Xi less than or equal to a
n
.
In this way, tks is a realization of the sample statistic
Tks = sup
a∈R
|Fn(a) − FΛ̂(a)|.
To find out whether 0.176 is an exceptionally large value for the random vari-
able Tks, we must determine the probability distribution of Tks. However, this
is impossible because the parameter λ of the Exp(λ) distribution is unknown.
We will approximate the distribution of Tks by a parametric bootstrap. We use
the dataset to estimate λ by λ̂ = 1/x̄n = 0.0015 and replace the random sam-
ple X1, X2, . . . , Xn from Fλ by a bootstrap random sample X∗
1 , X∗
2 , . . . , X∗
n
from Fλ̂. Next we approximate the distribution of Tks by that of its boot-
strapped version
T ∗
ks = sup
a∈R
|F∗
n (a) − FΛ̂∗ (a)|,
where F∗
n is the empirical distribution function of the bootstrap random sam-
ple:
F∗
n (a) =
number of X∗
i less than or equal to a
n
,
and Λ̂∗
= 1/X̄∗
n, with X̄∗
n being the average of the bootstrap random sample.
The bootstrapped sample statistic T ∗
ks is too complicated to determine its
probability distribution, and hence we perform a parametric bootstrap simu-
lation:
18.4 Solutions to the quick exercises 279
1. We generate a bootstrap dataset x∗
1, x∗
2, . . . , x∗
135 from an exponential dis-
tribution with parameter λ̂ = 0.0015.
2. We compute the bootstrapped KS distance
t∗
ks = sup
a∈R
|F∗
n (a) − Fλ̂∗ (a)|,
where F∗
n denotes the empirical distribution function of the bootstrap
dataset and Fλ̂∗ denotes the estimated exponential distribution function,
where λ̂∗
= 1/x̄∗
n is computed from the bootstrap dataset.
We repeat steps 1 and 2 one thousand times, which results in one thousand
values of the bootstrapped KS distance. In Figure 18.2 we have displayed a
histogram and kernel density estimate of the one thousand bootstrapped KS
distances. It is clear that if the software data would come from an exponential
distribution, the value 0.176 of the KS distance would be very unlikely! This
strongly suggests that the exponential distribution is not the right model for
the software data. The reason for this is that the Poisson process is the wrong
model for the series of failures. A closer inspection shows that the rate at
which failures occur over time is not constant, as was assumed in Chapter 17,
but decreases.
0 0.176
0
5
10
15
20
25
0 0.176
0
5
10
15
20
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 18.2. One thousand bootstrapped KS distances.
18.4 Solutions to the quick exercises
18.1 You could have written something like the following: “Use the dataset
x1, x2, . . . , xn to compute an estimate F̂ for F. Replace the random sample
X1, X2, . . . , Xn from F by a random sample X∗
1 , X∗
2 , . . . , X∗
n from F̂, and
approximate the probability distribution of
280 18 The bootstrap
Med(X1, X2, . . . , Xn) − Finv
(0.5)
by that of Med(X∗
1 , X∗
2 , . . . , X∗
n) − F̂inv
(0.5), where F̂inv
(0.5) is the median
of F̂.”
18.2 You could have written something like the following: “Given a dataset
x1, x2, . . . , xn, determine its empirical distribution function Fn as an estimate
of F, and the median Finv
(0.5) of Fn.
1. Generate a bootstrap dataset x∗
1, x∗
2, . . . , x∗
n from Fn.
2. Compute the sample median for the bootstrap dataset:
Med∗
n − Finv
(0.5),
where Med∗
n = sample median of x∗
1, x∗
2, . . . , x∗
n.
Repeat steps 1 and 2 many times.”
Note that if n is odd, then Finv
(0.5) equals the sample median of the original
dataset, but this is not necessarily so for n even.
18.3 According to Remark 11.2 about the sum of independent normal ran-
dom variables, the sum of n independent N(µ, 1) distributed random variables
has an N(nµ, n) distribution. Hence by the change-of-units rule for the normal
distribution (see page 106), it follows that X̄n has an N(µ, 1/n) distribution,
and that X̄n − µ has an N(0, 1/n) distribution. Similarly, the average X̄∗
n of
n independent N(x̄n, 1) distributed bootstrap random variables has a nor-
mal distribution N(x̄n, 1/n) distribution, and therefore X̄∗
n − x̄n again has an
N(0, 1/n) distribution.
18.5 Exercises
18.1  We generate a bootstrap dataset x∗
1, x∗
2, . . . , x∗
6 from the empirical
distribution function of the dataset
2 1 1 4 6 3,
i.e., we draw (with replacement) six values from these numbers with equal
probability 1/6. How many different bootstrap datasets are possible? Are
they all equally likely to occur?
18.2 We generate a bootstrap dataset x∗
1, x∗
2, x∗
3, x∗
4 from the empirical distri-
bution function of the dataset
1 3 4 6.
a. Compute the probability that the bootstrap sample mean is equal to 1.
18.5 Exercises 281
b. Compute the probability that the maximum of the bootstrap dataset is
equal to 6.
c. Compute the probability that exactly two elements in the bootstrap sam-
ple are less than 2.
18.3  We generate a bootstrap dataset x∗
1, x∗
2, . . . , x∗
10 from the empirical
distribution function of the dataset
0.39 0.41 0.38 0.44 0.40
0.36 0.34 0.46 0.35 0.37.
a. Compute the probability that the bootstrap dataset has exactly three
elements equal to 0.35.
b. Compute the probability that the bootstrap dataset has at most two ele-
ments less than or equal to 0.38.
c. Compute the probability that the bootstrap dataset has exactly two ele-
ments less than or equal to 0.38 and all other elements greater than 0.42.
18.4  Consider the dataset from Exercise 18.3, with maximum 0.46.
a. We generate a bootstrap random sample X∗
1 , X∗
2 , . . . , X∗
10 from the empir-
ical distribution function of the dataset. Compute P(M∗
10  0.46), where
M∗
10 = max{X∗
1 , X∗
2 , . . . , X∗
10}.
b. The same question as in a, but now for a dataset with distinct elements
x1, x2, . . . , xn and maximum mn. Compute P(M∗
n  mn), where M∗
n is
the maximum of a bootstrap random sample X∗
1 , X∗
2 , . . . , X∗
n generated
from the empirical distribution function of the dataset.
18.5  Suppose we have a dataset
0 3 6,
which is the realization of a random sample from a distribution function F. If
we estimate F by the empirical distribution function, then according to the
bootstrap principle applied to the centered sample mean X̄3 − µ, we must
replace this random variable by its bootstrapped version X̄∗
3 − x̄3. Determine
the possible values for the bootstrap random variable X̄∗
3 − x̄3 and the corre-
sponding probabilities.
18.6 Suppose that the dataset x1, x2, . . . , xn is a realization of a random
sample from an Exp(λ) distribution with distribution function Fλ, and that
x̄n = 5.
a. Check that the median of the Exp(λ) distribution is mλ = (ln 2)/λ (see
also Exercise 5.11).
b. Suppose we estimate λ by 1/x̄n. Describe the parametric bootstrap sim-
ulation for Med(X1, X2, . . . , Xn) − mλ.
282 18 The bootstrap
18.7  To give an example in which the bootstrapped centered sample mean
in the parametric and empirical bootstrap simulations may be different, con-
sider the following situation. Suppose that the dataset x1, x2, . . . , xn is a re-
alization of a random sample from a U(0, θ) distribution with expectation
µ = θ/2. We estimate θ by
θ̂ =
n + 1
n
mn,
where mn = max{x1, x2, . . . , xn}. Describe the parametric bootstrap simula-
tion for the centered sample mean X̄n − µ.
18.8  Here is an example in which the bootstrapped centered sample mean in
the parametric and empirical bootstrap simulations are the same. Consider the
software data with average x̄n = 656.8815 and median mn = 290, modeled as
a realization of a random sample X1, X2, . . . , Xn from a distribution function
F with expectation µ. By means of bootstrap simulation we like to get an
impression of the distribution of X̄n − µ.
a. Suppose that we assume nothing about the distribution of the interfailure
times. Describe the appropriate bootstrap simulation procedure with one
thousand repetitions.
b. Suppose we assume that F is the distribution function of an Exp(λ) distri-
bution, where λ is estimated by 1/x̄n = 0.0015. Describe the appropriate
bootstrap simulation procedure with one thousand repetitions.
c. Suppose we assume that F is the distribution function of an Exp(λ) dis-
tribution, and that (as suggested by Exercise 18.6 a) the parameter λ
is estimated by (ln 2)/mn = 0.0024. Describe the appropriate bootstrap
simulation procedure with one thousand repetitions.
18.9  Consider the dataset from Exercises 15.1 and 17.6 consisting of mea-
sured chest circumferences of Scottish soldiers with average x̄n = 39.85 and
sample standard deviation sn = 2.09. The histogram in Figure 17.11 suggests
modeling the data as the realization of a random sample X1, X2, . . . , Xn from
an N(µ, σ2
) distribution. We estimate µ by the sample mean and we are inter-
ested in the probability that the sample mean deviates more than 1 from µ:
P

|X̄n − µ|  1

. Describe how one can use the bootstrap principle to approx-
imate this probability, i.e., describe the distribution of the bootstrap random
sample X∗
1 , X∗
2 , . . . , X∗
n and compute P

|X̄∗
n − µ∗
|  1

. Note that one does
not need a simulation to approximate this latter probability.
18.10 Consider the software data, with average x̄n = 656.8815, modeled as
a realization of a random sample X1, X2, . . . , Xn from a distribution func-
tion F. We estimate the expectation µ of F by the sample mean and we are
interested in the probability that the sample mean deviates more than ten
from µ: P

|X̄n − µ|  10

.
18.5 Exercises 283
a. Suppose we assume nothing about the distribution of the interfailure
times. Describe how one can obtain a bootstrap approximation for the
probability, i.e., describe the appropriate bootstrap simulation procedure
with one thousand repetitions and how the results of this simulation can
be used to approximate the probability.
b. Suppose we assume that F is the distribution function of an Exp(λ) dis-
tribution. Describe how one can obtain a bootstrap approximation for the
probability.
18.11 Consider the dataset of measured chest circumferences of 5732 Scottish
soldiers (see Exercises 15.1, 17.6, and 18.9). The Kolmogorov-Smirnov distance
between the empirical distribution function and the distribution function
Fx̄n,sn of the normal distribution with estimated parameters µ̂ = x̄n = 39.85
and σ̂ = sn = 2.09 is equal to
tks = sup
a∈R
|Fn(a) − Fx̄n,sn (a)| = 0.0987,
where x̄n and sn denote sample mean and sample standard deviation of the
dataset. Suppose we want to perform a bootstrap simulation with one thou-
sand repetitions for the KS distance to investigate to which degree the value
0.0987 agrees with the assumed normality of the dataset. Describe the appro-
priate bootstrap simulation that must be carried out.
18.12 To give an example where the empirical bootstrap fails, consider the
following situation. Suppose our dataset x1, x2, . . . , xn is a realization of a
random sample X1, X2, . . . , Xn from a U(0, θ) distribution. Consider the nor-
malized sample statistic
Tn = 1 −
Mn
θ
,
where Mn is the maximum of X1, X2, . . . , Xn. Let X∗
1 , X∗
2 , . . . , X∗
n be a boot-
strap random sample from the empirical distribution function of our dataset,
and let M∗
n be the corresponding bootstrap maximum. We are going to com-
pare the distribution functions of Tn and its bootstrap counterpart
T ∗
n = 1 −
M∗
n
mn
,
where mn is the maximum of x1, x2, . . . , xn.
a. Check that P(Tn ≤ 0) = 0 and show that
P(T ∗
n ≤ 0) = 1 −

1 −
1
n
n
.
Hint: first argue that P(T ∗
n ≤ 0) = P(M∗
n = mn), and then use the result
of Exercise 18.4.
284 18 The bootstrap
b. Let Gn(t) = P(Tn ≤ t) be the distribution function of Tn, and similarly let
G∗
n(t) = P(T ∗
n ≤ t) be the distribution function of the bootstrap statistic
T ∗
n. Conclude from part a that the maximum distance between G∗
n and
Gn can be bounded from below as follows:
sup
t∈R
|G∗
n(t) − Gn(t)| ≥ 1 −

1 −
1
n
n
.
c. Use part b to argue that for all n, the maximum distance between G∗
n
and Gn is greater than 0.632:
sup
t∈R
|G∗
n(t) − Gn(t)| ≥ 1 − e−1
= 0.632.
Hint: you may use that e−x
≥ 1 − x for all x.
We conclude that even for very large sample sizes the maximum distance
between the distribution functions of Tn and its bootstrap counterpart T ∗
n
is at least 0.632.
18.13 (Exercise 18.12 continued). In contrast to the empirical bootstrap, the
parametric bootstrap for Tn does work. Suppose we estimate the parameter θ
of the U(0, θ) distribution by
θ̂ =
n + 1
n
mn, where mn = maximum of x1, x2, . . . , xn.
Let now X∗
1 , X∗
2 , . . . , X∗
n be a bootstrap random sample from a U(0, θ̂) dis-
tribution, and let M∗
n be the corresponding bootstrap maximum. Again, we
are going to compare the distribution function Gn of Tn = 1 −Mn/θ with the
distribution function G∗
n of its bootstrap counterpart T ∗
n = 1 − M∗
n/θ̂.
a. Check that the distribution function Fθ of a U(0, θ) distribution is given
by
Fθ(a) =
a
θ
for 0 ≤ a ≤ θ.
b. Check that the distribution function of Tn is
Gn(t) = P(Tn ≤ t) = 1 − (1 − t)n
for 0 ≤ t ≤ 1.
Hint: rewrite P(Tn ≤ t) as 1 − P(Mn ≤ θ(1 − t)) and use the rule on
page 109 about the distribution function of the maximum.
c. Show that T ∗
n has the same distribution function:
G∗
n(t) = P(T ∗
n ≤ t) = 1 − (1 − t)n
for 0 ≤ t ≤ 1.
This means that, in contrast to the empirical bootstrap (see Exer-
cise 18.12), the parametric bootstrap works perfectly in this situation.
19
Unbiased estimators
In Chapter 17 we saw that a dataset can be modeled as a realization of a
random sample from a probability distribution and that quantities of interest
correspond to features of the model distribution. One of our tasks is to use the
dataset to estimate a quantity of interest. We shall mainly deal with the situ-
ation where it is modeled as one of the parameters of the model distribution
or as a certain function of the parameters. We will first discuss what we mean
exactly by an estimator and then introduce the notion of unbiasedness as a
desirable property for estimators. We end the chapter by providing unbiased
estimators for the expectation and variance of a model distribution.
19.1 Estimators
Consider the arrivals of packages at a network server. One is interested in the
intensity at which packages arrive on a generic day and in the percentage of
minutes during which no packages arrive. If the arrivals occur completely at
random in time, the arrival process can be modeled by a Poisson process. This
would mean that the number of arrivals during one minute is modeled by a
random variable having a Poisson distribution with (unknown) parameter µ.
The intensity of the arrivals is then modeled by the parameter µ itself, and
the percentage of minutes during which no packages arrive is modeled by the
probability of zero arrivals: e−µ
. Suppose one observes the arrival process for a
while and gathers a dataset x1, x2, . . . , xn, where xi represents the number of
arrivals in the ith minute. Our task will be to estimate, based on the dataset,
the parameter µ and a function of the parameter: e−µ
.
This example is typical for the general situation in which our dataset is mod-
eled as a realization of a random sample X1, X2, . . . , Xn from a probability
distribution that is completely determined by one or more parameters. The
parameters that determine the model distribution are called the model param-
eters. We focus on the situation where the quantity of interest corresponds
286 19 Unbiased estimators
to a feature of the model distribution that can be described by the model
parameters themselves or by some function of the model parameters. This
distribution feature is referred to as the parameter of interest. In discussing
this general setup we shall denote the parameter of interest by the Greek
letter θ. So, for instance, in our network server example, µ is the model pa-
rameter. When we are interested in the arrival intensity, the role of θ is played
by the parameter µ itself, and when we are interested in the percentage of
idle minutes the role of θ is played by e−µ
.
Whatever method we use to estimate the parameter of interest θ, the result
depends only on our dataset.
Estimate. An estimate is a value t that only depends on the dataset
x1, x2, . . . , xn, i.e., t is some function of the dataset only:
t = h(x1, x2, . . . , xn).
This description of estimate is a bit formal. The idea is, of course, that the
value t, computed from our dataset x1, x2, . . . , xn, gives some indication of
the “true” value of the parameter θ. We have already met several estimates in
Chapter 17; see, for instance, Table 17.2. This table illustrates that the value
of an estimate can be anything: a single number, a vector of numbers, even a
complete curve.
Let us return to our network server example in which our dataset x1, x2, . . . , xn
is modeled as a realization of a random sample from a Pois(µ) distribution.
The intensity at which packages arrive is then represented by the parameter µ.
Since the parameter µ is the expectation of the model distribution, the law
of large numbers suggests the sample mean x̄n as a natural estimate for µ.
On the other hand, the parameter µ also represents the variance of the model
distribution, so that by a similar reasoning another natural estimate is the
sample variance s2
n.
The percentage of idle minutes is modeled by the probability of zero arrivals.
Similar to the reasoning in Section 13.4, a natural estimate is the relative
frequency of zeros in the dataset:
number of xi equal to zero
n
.
On the other hand, the probability of zero arrivals can be expressed as a
function of the model parameter: e−µ
. Hence, if we estimate µ by x̄n, we
could also estimate e−µ
by e−x̄n
.
Quick exercise 19.1 Suppose we estimate the probability of zero arrivals
e−µ
by the relative frequency of xi equal to zero. Deduce an estimate for µ
from this.
19.2 Investigating the behavior of an estimator 287
The preceding examples illustrate that one can often think of several estimates
for the parameter of interest. This raises questions like
Ĺ When is one estimate better than another?
Ĺ Does there exist a best possible estimate?
For instance, can we say which of the values x̄n or s2
n computed from the
dataset is closer to the “true” parameter µ? The answer is no. The measure-
ments and the corresponding estimates are subject to randomness, so that
we cannot say anything with certainty about which of the two is closer to µ.
One of the things we can say for each of them is how likely it is that they are
within a given distance from µ. To this end, we consider the random variables
that correspond to the estimates. Because our dataset x1, x2, . . . , xn is mod-
eled as a realization of a random sample X1, X2, . . . , Xn, the estimate t is a
realization of a random variable T .
Estimator. Let t = h(x1, x2, . . . , xn) be an estimate based on the
dataset x1, x2, . . . , xn. Then t is a realization of the random variable
T = h(X1, X2, . . . , Xn).
The random variable T is called an estimator.
The word estimator refers to the method or device for estimation. This is
distinguished from estimate, which refers to the actual value computed from
a dataset. Note that estimators are special cases of sample statistics. In the
remainder of this chapter we will discuss the notion of unbiasedness that
describes to some extent the behavior of estimators.
19.2 Investigating the behavior of an estimator
Let us continue with our network server example. Suppose we have observed
the network for 30 minutes and we have recorded the number of arrivals in
each minute. The dataset is modeled as a realization of a random sample
X1, X2, . . . , Xn of size n = 30 from a Pois(µ) distribution. Let us concentrate
on estimating the probability p0 of zero arrivals, which is an unknown number
between 0 and 1. As motivated in the previous section, we have the following
possible estimators:
S =
number of Xi equal to zero
n
and T = e−X̄n
.
Our first estimator S can only attain the values 0, 1
30 , 2
30 , . . . , 1, so that in
general it cannot give the exact value of p0. Similarly for our second estima-
tor T , which can only attain the values 1, e−1/30
, e−2/30
, . . . . So clearly, we
288 19 Unbiased estimators
cannot expect our estimators always to give the exact value of p0 on basis of
30 observations. Well, then what can we expect from a reasonable estimator?
To get an idea of the behavior of both estimators, we pretend we know µ
and we simulate the estimation process in the case of n = 30 observations.
Let us choose µ = ln 10, so that p0 = e−µ
= 0.1. We draw 30 values from
a Poisson distribution with parameter µ = ln 10 and compute the value of
estimators S and T . We repeat this 500 times, so that we have 500 values
for each estimator. In Figure 19.1 a frequency histogram1
of these values
for estimator S is displayed on the left and for estimator T on the right.
Clearly, the values of both estimators vary around the value 0.1, which they
are supposed to estimate.
0.0 0.1 0.2 0.3
0
50
100
150
200
250
0.0 0.1 0.2 0.3
0
50
100
150
200
250
Fig. 19.1. Frequency histograms of 500 values for estimators S (left) and T (right)
of p0 = 0.1.
19.3 The sampling distribution and unbiasedness
We have just seen that the values generated for estimator S fluctuate around
p0 = 0.1. Although the value of this estimator is not always equal to 0.1, it
is desirable that on average, S is on target, i.e., E[S] = 0.1. Moreover, it is
desirable that this property holds no matter what the actual value of p0 is,
i.e.,
E[S] = p0
irrespective of the value 0  p0  1. In order to find out whether this is
true, we need the probability distribution of the estimator S. Of course this
1
In a frequency histogram the height of each vertical bar equals the frequency of
values in the corresponding bin.
19.3 The sampling distribution and unbiasedness 289
is simply the distribution of a random variable, but because estimators are
constructed from a random sample X1, X2, . . . , Xn, we speak of the sampling
distribution.
The sampling distribution. Let T = h(X1, X2, . . . , Xn) be an
estimator based on a random sample X1, X2, . . . , Xn. The probabil-
ity distribution of T is called the sampling distribution of T .
The sampling distribution of S can be found as follows. Write
S =
Y
n
,
where Y is the number of Xi equal to zero. If for each i we label Xi = 0 as
a success, then Y is equal to the number of successes in n independent trials
with p0 as the probability of success. Similar to Section 4.3, it follows that Y
has a Bin(n, p0) distribution. Hence the sampling distribution of S is that of
a Bin(n, p0) distributed random variable divided by n. This means that S is
a discrete random variable that attains the values k/n, where k = 0, 1, . . . , n,
with probabilities given by
pS

k
n

= P

S =
k
n

= P(Y = k) =

n
k

pk
0(1 − p0)n−k
.
The probability mass function of S for the case n = 30 and p0 = 0.1 is
displayed in Figure 19.2. Since S = Y/n and Y has a Bin(n, p0) distribution,
it follows that
E[S] =
E[Y ]
n
=
np0
n
= p0.
So, indeed, the estimator S for p0 has the property E[S] = p0. This property
reflects the fact that estimator S has no systematic tendency to produce
0.0 0.2 0.4 0.6 0.8 1.0
a
0.00
0.05
0.10
0.15
0.20
0.25
pS(a)
·
·
··
·
·
·
························
Fig. 19.2. Probability mass function of S.
Hence, S is an unbiased estimator.
290 19 Unbiased estimators
estimates that are larger than p0, and no systematic tendency to produce
estimates that are smaller than p0. This is a desirable property for estimators,
and estimators that have this property are called unbiased.
Definition. An estimator T is called an unbiased estimator for the
parameter θ, if
E[T ] = θ
irrespective of the value of θ. The difference E[T ] − θ is called the
bias of T ; if this difference is nonzero, then T is called biased.
Let us return to our second estimator for the probability of zero arrivals in
the network server example: T = e−X̄n
. The sampling distribution can be
obtained as follows. Write
T = e−Z/n
,
where Z = X1 + X2 + · · ·+ Xn. From Exercise 12.9 we know that the random
variable Z, being the sum of n independent Pois(µ) random variables, has
a Pois(nµ) distribution. This means that T is a discrete random variable
attaining values e−k/n
, where k = 0, 1, . . . and the probability mass function
of T is given by
pT

e−k/n

= P

T = e−k/n

= P(Z = k) =
e−nµ
(nµ)k
k!
.
The probability mass function of T for the case n = 30 and p0 = 0.1 is
displayed in Figure 19.3. From the histogram in Figure 19.1 as well as from
the probability mass function in Figure 19.3, you may get the impression
that T is also an unbiased estimator. However, this not the case, which follows
immediately from an application of Jensen’s inequality:
0.0 0.2 0.4 0.6 0.8 1.0
a
0.00
0.01
0.02
0.03
0.04
0.05
pT (a)
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Fig. 19.3. Probability mass function of T.
19.3 The sampling distribution and unbiasedness 291
E[T ] = E

e−X̄n

 e−E[X̄n],
where we have a strict inequality because the function g(x) = e−x
is strictly
convex (g
(x) = e−x
 0). Recall that the parameter µ equals the expectation
of the Pois(µ) model distribution, so that according to Section 13.1 we have
E

X̄n

= µ. We find that
E[T ]  e−µ
= p0,
which means that the estimator T for p0 has positive bias. In fact we can
compute E[T ] exactly (see Exercise 19.9):
E[T ] = E

e−X̄n

= e−nµ(1−e−1/n
)
.
Note that n(1 − e−1/n
) → 1, so that
E[T ] = e−nµ(1−e−1/n
)
→ e−µ
= p0
as n goes to infinity. Hence, although T has positive bias, the bias decreases
to zero as the sample size becomes larger. In Figure 19.4 the expectation of
T is displayed as a function of the sample size n for the case µ = ln(10). For
n = 30 the difference between E[T ] and p0 = 0.1 equals 0.0038.
0 5 10 15 20 25 30
n
0.00
0.05
0.10
0.15
0.20
0.25
E[T]
·
·
····························
.....................................................................................................
Fig. 19.4. E[T] as a function of n.
Quick exercise 19.2 If we estimate p0 = e−µ
by the relative frequency of
zeros S = Y/n, then we could estimate µ by U = − ln(S). Argue that U is a
biased estimator for µ. Is the bias positive or negative?
We conclude this section by returning to the estimation of the parameter µ.
Apart from the (biased) estimator in Quick exercise 19.2 we also considered
292 19 Unbiased estimators
the sample mean X̄n and sample variance S2
n as possible estimators for µ.
These are both unbiased estimators for the parameter µ. This is a direct
consequence of a more general property of X̄n and S2
n, which is discussed in
the next section.
19.4 Unbiased estimators for expectation and variance
Sometimes the quantity of interest can be described by the expectation or
variance of the model distribution, and is it irrelevant whether this distribution
is of a parametric type. In this section we propose unbiased estimators for
these distribution features.
Unbiased estimators for expectation and variance. Sup-
pose X1, X2, . . . , Xn is a random sample from a distribution with
finite expectation µ and finite variance σ2
. Then
X̄n =
X1 + X2 + · · · + Xn
n
is an unbiased estimator for µ and
S2
n =
1
n − 1
n

i=1
(Xi − X̄n)2
is an unbiased estimator for σ2
.
The first statement says that E

X̄n

= µ, which was shown in Section 13.1.
The second statement says E

S2
n

= σ2
. To see this, use linearity of expecta-
tions to write
E

S2
n

=
1
n − 1
n

i=1
E

(Xi − X̄n)2

.
Since E

X̄n

= µ, we have E

Xi − X̄n

= E[Xi]−E

X̄n

= 0. Now note that
for any random variable Y with E[Y ] = 0, we have
Var(Y ) = E

Y 2

− (E[Y ])2
= E

Y 2

.
Applying this to Y = Xi − X̄n, it follows that
E

(Xi − X̄n)2

= Var

Xi − X̄n

.
Note that we can write
Xi − X̄n =
n − 1
n
Xi −
1
n

j=i
Xj.
19.4 Unbiased estimators for expectation and variance 293
Then from the rules concerning variances of sums of independent random
variables we find that
Var

Xi − X̄n

= Var
⎛
⎝n − 1
n
Xi −
1
n

j=i
Xj
⎞
⎠
=
(n − 1)2
n2
Var(Xi) +
1
n2

j=i
Var(Xj)
=
(n − 1)2
n2
+
n − 1
n2

σ2
=
n − 1
n
σ2
.
We conclude that
E

S2
n

=
1
n − 1
n

i=1
E

(Xi − X̄n)2

=
1
n − 1
n

i=1
Var

Xi − X̄n

=
1
n − 1
· n ·
n − 1
n
σ2
= σ2
.
This explains why we divide by n − 1 in the formula for S2
n; only in this case
S2
n is an unbiased estimator for the “true” variance σ2
. If we would divide by
n instead of n − 1, we would obtain an estimator with negative bias; it would
systematically produce too-small estimates for σ2
.
Quick exercise 19.3 Consider the following estimator for σ2
:
V 2
n =
1
n
n

i=1
(Xi − X̄n)2
.
Compute the bias E

V 2
n

− σ2
for this estimator, where you can keep compu-
tations simple by realizing that V 2
n = (n − 1)S2
n/n.
Unbiasedness does not always carry over
We have seen that S2
n is an unbiased estimator for the “true” variance σ2
. A
natural question is whether Sn is again an unbiased estimator for σ. This is not
the case. Since the function g(x) = x2
is strictly convex, Jensen’s inequality
yields that
σ2
= E

S2
n

 (E[Sn])2
,
which implies that E[Sn]  σ. Another example is the network arrivals, in
which X̄n is an unbiased estimator for µ, whereas e−X̄n
is positively biased
with respect to e−µ
. These examples illustrate a general fact: unbiasedness
does not always carry over, i.e., if T is an unbiased estimator for a parameter θ,
then g(T ) does not have to be an unbiased estimator for g(θ).
294 19 Unbiased estimators
However, there is one special case in which unbiasedness does carry over,
namely if g(T ) = aT + b. Indeed, if T is unbiased for θ: E[T ] = θ, then by the
change-of-units rule for expectations,
E[aT + b] = aE[T ] + b = aθ + b,
which means that aT + b is unbiased for aθ + b.
19.5 Solutions to the quick exercises
19.1 Write y for the number of xi equal to zero. Denote the probability of
zero by p0, so that p0 = e−µ
. This means that µ = − ln(p0). Hence if we
estimate p0 by the relative frequency y/n, we can estimate µ by − ln(y/n).
19.2 The function g(x) = − ln(x) is strictly convex, since g
(x) = 1/x2
 0.
Hence by Jensen’s inequality
E[U] = E[− ln(S)]  − ln(E[S]).
Since we have seen that E[S] = p0 = e−µ
, it follows that E[U]  − ln(E[S]) =
− ln(e−µ
) = µ. This means that U has positive bias.
19.3 Using that E

S2
n

= σ2
, we find that
E

V 2
n

= E
n − 1
n
S2
n

=
n − 1
n
E

S2
n

=
n − 1
n
σ2
.
We conclude that the bias of V 2
n equals E

V 2
n

− σ2
= −σ2
/n  0.
19.6 Exercises
19.1  Suppose our dataset is a realization of a random sample X1, X2, . . . , Xn
from a uniform distribution on the interval [−θ, θ], where θ is unknown.
a. Show that
T =
3
n
(X2
1 + X2
2 + · · · + X2
n)
is an unbiased estimator for θ2
.
b. Is
√
T also an unbiased estimator for θ? If not, argue whether it has
positive or negative bias.
19.2 Suppose the random variables X1, X2, . . . , Xn have the same expecta-
tion µ.
19.6 Exercises 295
a. Is S = 1
2 X1 + 1
3 X2 + 1
6 X3 an unbiased estimator for µ?
b. Under what conditions on constants a1, a2, . . . , an is
T = a1X1 + a2X2 + · · · + anXn
an unbiased estimator for µ?
19.3  Suppose the random variables X1, X2, . . . , Xn have the same expec-
tation µ. For which constants a and b is
T = a(X1 + X2 + · · · + Xn) + b
an unbiased estimator for µ?
19.4 Recall Exercise 17.5 about the number of cycles to pregnancy. Suppose
the dataset corresponding to the table in Exercise 17.5 a is modeled as a
realization of a random sample X1, X2, . . . , Xn from a Geo(p) distribution,
where 0  p  1 is unknown. Motivated by the law of large numbers, a
natural estimator for p is
T = 1/X̄n.
a. Check that T is a biased estimator for p and find out whether it has
positive or negative bias.
b. In Exercise 17.5 we discussed the estimation of the probability that a
woman becomes pregnant within three or fewer cycles. One possible esti-
mator for this probability is the relative frequency of women that became
pregnant within three cycles
S =
number of Xi ≤ 3
n
.
Show that S is an unbiased estimator for this probability.
19.5  Suppose a dataset is modeled as a realization of a random sample
X1, X2, . . . , Xn from an Exp(λ) distribution, where λ  0 is unknown. Let
µ denote the corresponding expectation and let Mn denote the minimum of
X1, X2, . . . , Xn. Recall from Exercise 8.18 that Mn has an Exp(nλ) distribu-
tion. Find out for which constant c the estimator
T = cMn
is an unbiased estimator for µ.
19.6  Consider the following dataset of lifetimes of ball bearings in hours.
6278 3113 5236 11584 12628 7725 8604 14266 6125 9350
3212 9003 3523 12888 9460 13431 17809 2812 11825 2398
Source: J.E. Angus. Goodness-of-fit tests for exponentiality based on a loss-
of-memory type functional equation. Journal of Statistical Planning and In-
ference, 6:241-251, 1982; example 5 on page 249.
296 19 Unbiased estimators
One is interested in estimating the minimum lifetime of this type of ball bear-
ing. The dataset is modeled as a realization of a random sample X1, . . . , Xn.
Each random variable Xi is represented as
Xi = δ + Yi,
where Yi has an Exp(λ) distribution and δ  0 is an unknown parameter that
is supposed to model the minimum lifetime. The objective is to construct an
unbiased estimator for δ. It is known that
E[Mn] = δ +
1
nλ
and E

X̄n

= δ +
1
λ
,
where Mn = minimum of X1, X2, . . . , Xn and X̄n = (X1 + X2 + · · · + Xn)/n.
a. Check that
T =
n
n − 1

X̄n − Mn

is an unbiased estimator for 1/λ.
b. Construct an unbiased estimator for δ.
c. Use the dataset to compute an estimate for the minimum lifetime δ. You
may use that the average lifetime of the data is 8563.5.
19.7 Leaves are divided into four different types: starchy-green, sugary-white,
starchy-white, and sugary-green. According to genetic theory, the types occur
with probabilities 1
4 (θ + 2), 1
4 θ, 1
4 (1 − θ), and 1
4 (1 − θ), respectively, where
0  θ  1. Suppose one has n leaves. Then the number of starchy-green
leaves is modeled by a random variable N1 with a Bin(n, p1) distribution,
where p1 = 1
4 (θ + 2), and the number of sugary-white leaves is modeled by
a random variable N2 with a Bin(n, p2) distribution, where p2 = 1
4 θ. The
following table lists the counts for the progeny of self-fertilized heterozygotes
among 3839 leaves.
Type Count
Starchy-green 1997
Sugary-white 32
Starchy-white 906
Sugary-green 904
Source: R.A. Fisher. Statistical methods for research workers. Hafner, New
York, 1958; Table 62 on page 299.
Consider the following two estimators for θ:
T1 =
4
n
N1 − 2 and T2 =
4
n
N2.
19.6 Exercises 297
a. Check that both T1 and T2 are unbiased estimators for θ.
b. Compute the value of both estimators for θ.
19.8  Recall the black cherry trees example from Exercise 17.9, modeled by
a linear regression model without intercept
Yi = βxi + Ui for i = 1, 2, . . . , n,
where U1, U2, . . . , Un are independent random variables with E[Ui] = 0 and
Var(Ui) = σ2
. We discussed three estimators for the parameter β:
B1 =
1
n

Y1
x1
+ · · · +
Yn
xn

,
B2 =
Y1 + · · · + Yn
x1 + · · · + xn
,
B3 =
x1Y1 + · · · + xnYn
x2
1 + · · · + x2
n
.
Show that all three estimators are unbiased for β.
19.9 Consider the network example where the dataset is modeled as a real-
ization of a random sample X1, X2, . . . , Xn from a Pois(µ) distribution. We
estimate the probability of zero arrivals e−µ
by means of T = e−X̄n
. Check
that
E[T ] = e−nµ(1−e−1/n
)
.
Hint: write T = e−Z/n
, where Z = X1 + X2 + · · · + Xn has a Pois(nµ)
distribution.
20
Efficiency and mean squared error
In the previous chapter we introduced the notion of unbiasedness as a de-
sirable property of an estimator. If several unbiased estimators for the same
parameter of interest exist, we need a criterion for comparison of these estima-
tors. A natural criterion is some measure of spread of the estimators around
the parameter of interest. For unbiased estimators we will use variance. For
arbitrary estimators we introduce the notion of mean squared error (MSE),
which combines variance and bias.
20.1 Estimating the number of German tanks
In this section we come back to the problem of estimating German war produc-
tion as discussed in Section 1.5. We consider serial numbers on tanks, recoded
to numbers running from 1 to some unknown largest number N. Given is a
subset of n numbers of this set. The objective is to estimate the total number
of tanks N on the basis of the observed serial numbers.
Denote the observed distinct serial numbers by x1, x2, . . . , xn. This dataset
can be modeled as a realization of random variables X1, X2, . . . , Xn repre-
senting n draws without replacement from the numbers 1, 2, . . . , N with equal
probability. Note that in this example our dataset is not a realization of a
random sample, because the random variables X1, X2, . . . , Xn are dependent.
We propose two unbiased estimators. The first one is based on the sample
mean
X̄n =
X1 + X2 + · · · + Xn
n
,
and the second one is based on the sample maximum
Mn = max{X1, X2, . . . , Xn}.
skip
300 20 Efficiency and mean squared error
An estimator based on the sample mean
To construct an unbiased estimator for N based on the sample mean, we start
by computing the expectation of X̄n. The linearity-of-expectations rule also
applies to dependent random variables, so that
E

X̄n

=
E[X1] + E[X2] + · · · + E[Xn]
n
.
In Section 9.3 we saw that the marginal distribution of each Xi is the same:
P(Xi = k) =
1
N
for k = 1, 2, . . . , N.
Therefore the expectation of each Xi is given by
E[Xi] = 1 ·
1
N
+ 2 ·
1
N
+ · · · + N ·
1
N
=
1 + 2 + · · · + N
N
=
1
2 N(N + 1)
N
=
N + 1
2
.
It follows that
E

X̄n

=
E[X1] + E[X2] + · · · + E[Xn]
n
=
N + 1
2
.
This directly implies that
T1 = 2X̄n − 1
is an unbiased estimator for N, since the change-of-units rule yields that
E[T1] = E

2X̄n − 1

= 2E

X̄n

− 1 = 2 ·
N + 1
2
− 1 = N.
Quick exercise 20.1 Suppose we have observed tanks with (recoded) serial
numbers
61 19 56 24 16.
Compute the value of the estimator T1 for the total number of tanks.
An estimator based on the sample maximum
To construct an unbiased estimator for N based on the maximum, we first
compute the expectation of Mn. We start by computing the probability that
Mn = k, where k takes the values n, . . . , N. Similar to the combinatorics
used in Section 4.3 to derive the binomial distribution, the number of ways
to draw n numbers without replacement from 1, 2, . . . , N is
N
n

. Hence each
combination has probability 1/
N
n

. In order to have Mn = k, we must have
one number equal to k and choose the other n−1 numbers out of the numbers
1, 2, . . . , k − 1. There are
k−1
n−1

ways to do this. Hence for the possible values
k = n, n + 1, . . . , N,
20.1 Estimating the number of German tanks 301
P(Mn = k) =
k−1
n−1

N
n
 =
(k − 1)!
(k − n)!(n − 1)!
·
(N − n)! n!
N!
= n ·
(k − 1)!
(k − n)!
(N − n)!
N!
.
Thus the expectation of Mn is given by
E[Mn] =
N

k=n
kP(Mn = k) =
N

k=n
k · n ·
(k − 1)!
(k − n)!
(N − n)!
N!
=
N

k=n
n ·
k!
(k − n)!
(N − n)!
N!
= n ·
(N − n)!
N!
N

k=n
k!
(k − n)!
.
How to continue the computation of E[Mn]? We use a trick: we start by
rearranging
1 =
N

j=n
P(Mn = j) =
N

j=n
n ·
(j − 1)!
(j − n)!
(N − n)!
N!
,
finding that
N

j=n
(j − 1)!
(j − n)!
=
N!
n (N − n)!
. (20.1)
This holds for any N and any n ≤ N. In particular we could replace N by
N + 1 and n by n + 1:
N+1

j=n+1
(j − 1)!
(j − n − 1)!
=
(N + 1)!
(n + 1)(N − n)!
.
Changing the summation variable to k = j − 1, we obtain
N

k=n
k!
(k − n)!
=
(N + 1)!
(n + 1)(N − n)!
. (20.2)
This is exactly what we need to finish the computation of E[Mn]. Substituting
(20.2) in what we obtained earlier, we find
E[Mn] = n ·
(N − n)!
N!
N

k=n
k!
(k − n)!
= n ·
(N − n)!
N!
·
(N + 1)!
(n + 1)(N − n)!
= n ·
N + 1
n + 1
.
302 20 Efficiency and mean squared error
Quick exercise 20.2 Choosing n = N in this formula yields E[MN ] = N.
Can you argue that this is the right answer without doing any computations?
With the formula for E[Mn] we can derive immediately that
T2 =
n + 1
n
Mn − 1
is an unbiased estimator for N, since by the change-of-units rule,
E[T2] = E
n + 1
n
Mn − 1

=
n + 1
n
E[Mn] − 1 =
n + 1
n
·
n(N + 1)
n + 1
− 1 = N.
Quick exercise 20.3 Compute the value of estimator T2 for the total number
of tanks on basis of the observed numbers from Quick exercise 20.1.
20.2 Variance of an estimator
In the previous section we saw that we can construct two completely different
estimators for the total number of tanks N that are both unbiased. The obvious
question is: which of the two is better? To answer this question, we investigate
how both estimators vary around the parameter of interest N. Although we
could in principle compute the distributions of T1 and T2, we carry out a
small simulation study instead. Take N = 1000 and n = 10 fixed. We draw
10 numbers, without replacement, from 1, 2, . . . , 1000 and compute the value
of the estimators T1 and T2. We repeat this two thousand times, so that we
have 2000 values for both estimators. In Figure 20.1 we have displayed the
histogram of the 2000 values for T1 on the left and the histogram of the 2000
values for T2 on the right. From the histograms, which reflect the probability
300 700 N = 1000 1300 1600
0
0.002
0.004
0.006
0.008
300 700 N = 1000 1300 1600
0
0.002
0.004
0.006
0.008
Fig. 20.1. Histograms of two thousand values for T1 (left) and T2 (right).
20.2 Variance of an estimator 303
mass functions of both estimators, we see that the distributions of T1 and
T2 are of completely different types. As can be expected from the fact that
both estimators are unbiased, the values vary around the parameter of interest
N = 1000. The most important difference between the histograms is that the
variation in the values of T2 is less than the variation in the values of T1.
This suggests that estimator T2 estimates the total number of tanks more
efficiently than estimator T1, in the sense that it produces estimates that
are more concentrated around the parameter of interest N than estimates
produced by T1. Recall that the variance measures the spread of a random
variable. Hence the previous discussion motivates the use of the variance of
an estimator to evaluate its performance.
Efficiency. Let T1 and T2 be two unbiased estimators for the same
parameter θ. Then estimator T2 is called more efficient than estima-
tor T1 if Var(T2)  Var(T1), irrespective of the value of θ.
Let us compare T1 and T2 using this criterion. For T1 we have
Var(T1) = Var

2X̄n − 1

= 4Var

X̄n

.
Although the Xi are not independent, it is true that all pairs (Xi, Xj) with
i = j have the same distribution (this follows in the same way in which
we showed on page 122 that all Xi have the same distribution). With the
variance-of-the-sum rule for n random variables (see Exercise 10.17), we find
that
Var(X1 + · · · + Xn) = nVar(X1) + n(n − 1)Cov(X1, X2) .
In Exercises 9.18 and 10.18, we computed that
Var(X1) =
1
12
(N − 1)(N + 1), Cov(X1, X2) = −
1
12
(N + 1).
We find therefore that
Var(T1) = 4Var

X̄n

=
4
n2
Var(X1 + · · · + Xn)
=
4
n2
n ·
1
12
(N − 1)(N + 1) − n(n − 1) ·
1
12
(N + 1)

=
1
3n
(N + 1)[N − 1 − (n − 1)]
=
(N + 1)(N − n)
3n
.
Obtaining the variance of T2 is a little more work. One can compute the
variance of Mn in a way that is very similar to the way we obtained E[Mn].
The result is (see Remark 20.1 for details)
Var(Mn) =
n(N + 1)(N − n)
(n + 2)(n + 1)2
.
304 20 Efficiency and mean squared error
Remark 20.1 (How to compute this variance). The trick is to com-
pute not E

M2
n

but E[Mn(Mn + 1)]. First we derive an identity from Equa-
tion (20.1) as before, this time replacing N by N + 2 and n by n + 2:
N+2
j=n+2
(j − 1)!
(j − n − 2)!
=
(N + 2)!
(n + 2)(N − n)!
.
Changing the summation variable to k = j − 2 yields
N
k=n
(k + 1)!
(k − n)!
=
(N + 2)!
(n + 2)(N − n)!
.
With this formula one can obtain:
E[Mn(Mn + 1)] =
N
k=n
k(k + 1) · n
(k − 1)!
(k − n)!
(N − n)!
N!
=
n(N + 1)(N + 2)
n + 2
.
Since we know E[Mn], we can determine E

M2
n

from this, and subsequently
the variance of Mn.
With the expression for the variance of Mn, we derive
Var(T2) = Var

n + 1
n
Mn − 1

=
(n + 1)2
n2
Var(Mn) =
(N + 1)(N − n)
n(n + 2)
.
We see that Var(T2)  Var(T1) for all N and n ≥ 2. Hence T2 is always more
efficient than T1, except when n = 1. In this case the variances are equal,
simply because the estimators are the same—they both equal X1.
The quotient Var(T1) /Var(T2), is called the relative efficiency of T2 with
respect to T1. In our case the relative efficiency of T2 with respect to T1
equals
Var(T1)
Var(T2)
=
(N + 1)(N − n)
3n
·
n(n + 2)
(N + 1)(N − n)
=
n + 2
3
.
Surprisingly, this quotient does not depend on N, and we see clearly the
advantage of T2 over T1 as the sample size n gets larger.
Quick exercise 20.4 Let n = 5, and let the sample be
7 3 10 45 15.
Compute the value of the estimator T1 for N. Do you notice anything strange?
The self-contradictory behavior of T1 in Quick exercise 20.4 is not rare: this
phenomenon will occur for up to 50% of the samples if n and N are large.
This gives another reason to prefer T2 over T1.
20.3 Mean squared error 305
Remark 20.2 (The Cramér-Rao inequality). Suppose we have a ran-
dom sample from a continuous distribution with probability density function
fθ, where θ is the parameter of interest. Under certain smoothness condi-
tions on the density fθ, the variance of an unbiased estimator T for θ always
has to be larger than or equal to a certain positive number, the so-called
Cramér-Rao lower bound:
Var(T) ≥
1
nE
 ∂
∂θ
ln fθ(X)
2
 for all θ.
Here n is the size of the sample and X a random variable whose density
function is fθ. In some cases we can find unbiased estimators attaining this
bound. These are called minimum variance unbiased estimators. An exam-
ple is the sample mean for the expectation of an exponential distribution.
(We will consider this case in Exercise 20.3.)
20.3 Mean squared error
In the last section we compared two unbiased estimators by considering their
spread around the value to be estimated, where the spread was measured by
the variance. Although unbiasedness is a desirable property, the performance
of an estimator should mainly be judged by the way it spreads around the
parameter θ to be estimated. This leads to the following definition.
Definition. Let T be an estimator for a parameter θ. The mean
squared error of T is the number MSE(T ) = E

(T − θ)2

.
According to this criterion, an estimator T1 performs better than an estima-
tor T2 if MSE(T1)  MSE(T2). Note that
MSE(T ) = E

(T − θ)2

= E

(T − E[T ] + E[T ] − θ)2

= E

(T − E[T ])2

+ 2E[T − E[T ]] (E[T ] − θ) + (E[T ] − θ)2
= Var(T ) + (E[T ] − θ)2
.
So the MSE of T turns out to be the variance of T plus the square of the bias
of T . In particular, when T is unbiased, the MSE of T is just the variance
of T . This means that we already used mean squared errors to compare the
estimators T1 and T2 in the previous section. We extend the notion of efficiency
by saying that estimator T2 is more efficient than estimator T1 (for the same
parameter of interest), if the MSE of T2 is smaller than the MSE of T1.
Unbiasedness and efficiency
A biased estimator with a small variance may be more useful than an unbiased
estimator with a large variance. We illustrate this with the network server
306 20 Efficiency and mean squared error
0 e−µ
0.2 0.3 0.4
0
2
4
6
8
10
0 e−µ
0.2 0.3 0.4
0
2
4
6
8
10
Fig. 20.2. Histograms of a thousand values for S (left) and T (right).
example from Section 19.2. Recall that our goal was to estimate the probability
p0 = e−µ
of zero arrivals (of packages) in a minute. We did have two promising
candidates as estimators:
S =
number of Xi equal to zero
n
and T = e−X̄n
.
In Figure 20.2 we depict histograms of one thousand simulations of the values
of S and T computed for random samples of size n = 25 from a Pois(µ)
distribution, where µ = 2. Considering the way the values of the (biased!)
estimator T are more concentrated around the true value e−µ
= e−2
= 0.1353,
we would be inclined to prefer T over S. This choice is strongly supported
by the fact that T is more efficient than S: MSE(T ) is always smaller than
MSE(S), as illustrated in Figure 20.3.
0 1 2 3 4 5
0.000
0.002
0.004
0.006
0.008
0.010
MSE(S)

MSE(T)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.......................
...............................................................................................
Fig. 20.3. MSEs of S and T as a function of µ.
20.5 Exercises 307
20.4 Solutions to the quick exercises
20.1 We have x̄5 = (61 + 19 + 56 + 24 + 16)/5 = 176/5 = 35.2. Therefore
t1 = 2 · 35.2 − 1 = 69.4.
20.2 When n = N, we have drawn all the numbers. But then the largest
number MN is N, and so E[MN ] = N.
20.3 We have t2 = (6/5) · 61 − 1 = 72.2.
20.4 Since 45 is in the sample, N has to be at least 45. Adding the numbers
yields 7 + 3 + 10 + 15 + 45 = 80. So t1 = 2x̄n − 1 = 2 · 16 − 1 = 31. What is
strange about this is that the estimate for N is far smaller than the number
45 in the sample!
20.5 Exercises
20.1 Given is a random sample X1, X2, . . . , Xn from a distribution with finite
variance σ2
. We estimate the expectation of the distribution with the sample
mean X̄n. Argue that the larger our sample, the more efficient our estimator.
What is the relative efficiency Var

X̄n

/Var

X̄2n

of X̄2n with respect to X̄n?
20.2  Given are two estimators S and T for a parameter θ. Furthermore it
is known that Var(S) = 40 and Var(T ) = 4.
a. Suppose that we know that E[S] = θ and E[T ] = θ + 3. Which estimator
would you prefer, and why?
b. Suppose that we know that E[S] = θ and E[T ] = θ + a for some positive
number a. For each a, which estimator would you prefer, and why?
20.3  Suppose we have a random sample X1, . . . , Xn from an Exp(λ) distri-
bution. Suppose we want to estimate the mean 1/λ. According to Section 19.4
the estimator
T1 = X̄n =
1
n
(X1 + X2 + · · · + Xn)
is an unbiased estimator of 1/λ. Let Mn be the minimum of X1, X2, . . . , Xn.
Recall from Exercise 8.18 that Mn has an Exp(nλ) distribution. In Exer-
cise 19.5 you have determined that
T2 = nMn
is another unbiased estimator for 1/λ. Which of the estimators T1 and T2
would you choose for estimating the mean 1/λ? Substantiate your answer.
308 20 Efficiency and mean squared error
20.4  Consider the situation of this chapter, where we have to estimate the
parameter N from a sample x1, . . . , xn drawn without replacement from the
numbers {1, . . ., N}. To keep it simple, we consider n = 2. Let M = M2 be
the maximum of X1 and X2. We have found that T2 = 3M/2 − 1 is a good
unbiased estimator for N. We want to construct a new unbiased estimator
T3 based on the minimum L of X1 and X2. In the following you may use
that the random variable L has the same distribution as the random variable
N + 1 − M (this follows from symmetry considerations).
a. Show that T3 = 3L − 1 is an unbiased estimator for N.
b. Compute Var(T3) using that Var(M) = (N + 1)(N − 2)/18. (The latter
has been computed in Remark 20.1.)
c. What is the relative efficiency of T2 with respect to T3?
20.5 Someone is proposing two unbiased estimators U and V , with the same
variance Var(U) = Var(V ). It therefore appears that we would not prefer one
estimator over the other. However, we could go for a third estimator, namely
W = (U + V )/2. Note that W is unbiased. To judge the quality of W we
want to compute its variance. Lacking information on the joint probability
distribution of U and V , this is impossible. However, we should prefer W in
any case! To see this, show by means of the variance-of-the-sum rule that the
relative efficiency of U with respect to W is equal to
Var((U + V )/2)
Var(U)
=
1
2
+
1
2
ρ(U, V ) .
Here ρ(U, V ) is the correlation coefficient. Why does this result imply that we
should use W instead of U (or V )?
20.6 A geodesic engineer measures the three unknown angles α1, α2, and α3
of a triangle. He models the uncertainty in the measurements by considering
them as realizations of three independent random variables T1, T2, and T3
with expectations
E[T1] = α1, E[T2] = α2, E[T3] = α3,
and all three with the same variance σ2
. In order to make use of the fact that
the three angles must add to π, he also considers new estimators U1, U2, and
U3 defined by
U1 =T1 + 1
3 (π − T1 − T2 − T3),
U2 =T2 + 1
3 (π − T1 − T2 − T3),
U3 =T3 + 1
3 (π − T1 − T2 − T3).
(Note that the “deviation” π − T1 − T2 − T3 is evenly divided over the three
measurements and that U1 + U2 + U3 = π.)
20.5 Exercises 309
a. Compute E[U1] and Var(U1) .
b. What does he gain in efficiency when he uses U1 instead of T1 to estimate
the angle α1?
c. What kind of estimator would you choose for α1 if it is known that the
triangle is isosceles (i.e., α1 = α2)?
20.7  (Exercise 19.7 continued.) Leaves are divided into four different types:
starchy-green, sugary-white, starchy-white, and sugary-green. According to
genetic theory, the types occur with probabilities 1
4 (θ + 2), 1
4 θ, 1
4 (1 − θ), and
1
4 (1 − θ), respectively, where 0  θ  1. Suppose one has n leaves. Then the
number of starchy-green leaves is modeled by a random variable N1 with a
Bin(n, p1) distribution, where p1 = 1
4 (θ + 2), and the number of sugary-white
leaves is modeled by a random variable N2 with a Bin(n, p2) distribution,
where p2 = 1
4 θ. Consider the following two estimators for θ:
T1 =
4
n
N1 − 2 and T2 =
4
n
N2.
In Exercise 19.7 you showed that both T1 and T2 are unbiased estimators
for θ. Which estimator would you prefer? Motivate your answer.
20.8  Let X̄n and Ȳm be the sample means of two independent random
samples of size n (resp. m) from the same distribution with mean µ. We
combine these two estimators to a new estimator T by putting
T = rX̄n + (1 − r)Ȳm,
where r is some number between 0 and 1.
a. Show that T is an unbiased estimator for the mean µ.
b. Show that T is most efficient when r = n/(n + m).
20.9 Given is a random sample X1, X2, . . . , Xn from a Ber(p) distribution.
One considers the estimators
T1 =
1
n
(X1 + · · · + Xn) and T2 = min{X1, . . . , Xn}.
a. Are T1 and T2 unbiased estimators for p?
b. Show that
MSE(T1) =
1
n
p(1 − p), MSE(T2) = pn
− 2pn+1
+ p2
.
c. Which estimator is more efficient when n = 2?
20.10 Suppose we have a random sample X1, . . . , Xn from an Exp(λ) distri-
bution. We want to estimate the expectation 1/λ. According to Section 19.4,
310 20 Efficiency and mean squared error
X̄n =
1
n
(X1 + X2 + · · · + Xn)
is an unbiased estimator of 1/λ. Let us consider more generally estimators T
of the form
T = c · (X1 + X2 + · · · + Xn) ,
where c is a real number. We are interested in the MSE of these estimators
and would like to know whether there are choices for c that yield a smaller
MSE than the choice c = 1/n.
a. Compute MSE(T ) for each c.
b. For which c does the estimator perform best in the MSE sense? Compare
this to the unbiased estimator X̄n that one obtains for c = 1/n.
20.11  In Exercise 17.9 we modeled diameters of black cherry trees with the
linear regression model (without intercept)
Yi = βxi + Ui
for i = 1, 2, . . ., n. As usual, the Ui here are independent random variables
with E[Ui]=0, and Var(Ui) = σ2
.
We considered three estimators for the slope β of the line y = βx: the so-
called least squares estimator T1 (which will be considered in Chapter 22),
the average slope estimator T2, and the slope of the averages estimator T3.
These estimators are defined by:
T1 =
n

i=1
xiYi
n

i=1
x2
i
, T2 =
1
n
n

i=1
Yi
xi
, T3 =
n

i=1
Yi
n

i=1
xi
.
In Exercise 19.8 it was shown that all three estimators are unbiased. Compute
the MSE of all three estimators.
Remark: it can be shown that T1 is always more efficient than T3, which in
turn is more efficient than T2. To prove the first inequality one uses a famous
inequality called the Cauchy Schwartz inequality; for the second inequality
one uses Jensen’s inequality (can you see how?).
20.12 Let X1, X2, . . . , Xn represent n draws without replacement from the
numbers 1, 2, . . . , N with equal probability. The goal of this exercise is to
compute the distribution of Mn in a way other than by the combinatorial
analysis we did in this chapter.
a. Compute P(Mn ≤ k), by using, as in Section 8.4, that:
P(Mn ≤ k) = P(X1 ≤ k, X2 ≤ k, . . . , Xn ≤ k) .
20.5 Exercises 311
b. Derive that
P(Mn = n) =
n!(N − n)!
N!
.
c. Show that for k = n + 1, . . . , N
P(Mn = k) = n ·
(k − 1)!
(k − n)!
(N − n)!
N!
.
21
Maximum likelihood
In previous chapters we could easily construct estimators for various param-
eters of interest because these parameters had a natural sample analogue:
expectation versus sample mean, probabilities versus relative frequencies, etc.
However, in some situations such an analogue does not exist. In this chap-
ter, a general principle to construct estimators is introduced, the so-called
maximum likelihood principle. Maximum likelihood estimators have certain
attractive properties that are discussed in the last section.
21.1 Why a general principle?
In Section 4.4 we modeled the number of cycles up to pregnancy by a ran-
dom variable X with a geometric distribution with (unknown) parameter p.
Weinberg and Gladen studied the effect of smoking on the number of cycles
and obtained the data in Table 21.1 for 100 smokers and 486 nonsmokers.
Table 21.1. Observed numbers of cycles up to pregnancy.
Number of cycles 1 2 3 4 5 6 7 8 9 10 11 12 12
Smokers 29 16 17 4 3 9 4 5 1 1 1 3 7
Nonsmokers 198 107 55 38 18 22 7 9 5 3 6 6 12
Source: C.R. Weinberg and B.C. Gladen. The beta-geometric distribution ap-
plied to comparative fecundability studies. Biometrics, 42(3):547–560, 1986.
Is the parameter p, which equals the probability of becoming pregnant after
one cycle, different for smokers and nonsmokers? Let us try to find out by
estimating p in the two cases.
314 21 Maximum likelihood
What would be reasonable ways to estimate p? Since p = P(X = 1), the law
of large numbers (see Section 13.3) motivates use of
S =
number of Xi equal to 1
n
as an estimator for p. This yields estimates p = 29/100 = 0.29 for smokers and
p = 198/486 = 0.41 for nonsmokers. We know from Section 19.4 that S is an
unbiased estimator for p. However, one cannot escape the feeling that S is a
“bad” estimator: S does not use all the information in the table, i.e., the way
the women are distributed over the numbers 2, 3, . . . of observed numbers of
cycles is not used. One would like to have an estimator that incorporates all
the available information. Due to the way the data are given, this seems to be
difficult. For instance, estimators based on the average cannot be evaluated,
because 7 smokers and 12 nonsmokers had an unknown number of cycles
up to pregnancy (larger than 12). If one simply ignores the last column in
Table 21.1 as we did in Exercise 17.5, the average can be computed and yields
1/x̄93 = 0.2809 as an estimate of p for smokers and 1/x̄474 = 0.3688 for
nonsmokers. However, because we discard seven values larger than 12 in case
of the smokers and twelve values larger than 12 in case of the nonsmokers, we
overestimate p in both cases.
In the next section we introduce a general principle to find an estimate for a
parameter of interest, the maximum likelihood principle. This principle yields
good estimators and will solve problems such as those stated earlier.
21.2 The maximum likelihood principle
Suppose a dealer of computer chips is offered on the black market two batches
of 10 000 chips each. According to the seller, in one batch about 50% of the
chips are defective, while this percentage is about 10% in the other batch. Our
dealer is only interested in this last batch. Unfortunately the seller cannot tell
the two batches apart. To help him to make up his mind, the seller offers our
dealer one batch, from which he is allowed to select and test 10 chips. After
selecting 10 chips arbitrarily, it turns out that only the second one is defective.
Our dealer at once decides to buy this batch. Is this a wise decision?
With the batch where 50% of the chips are defective it is more likely that
defective chips will appear, whereas with the other batch one would expect
hardly any defective chip. Clearly, our dealer chooses the batch for which it is
most likely that only one chip is defective. This is also the guiding idea behind
the maximum likelihood principle.
The maximum likelihood principle. Given a dataset, choose
the parameter(s) of interest in such a way that the data are most
likely.
21.2 The maximum likelihood principle 315
Set Ri = 1 in case the ith tested chip was defective and Ri = 0 in case it
was operational, where i = 1, . . . , 10. Then R1, . . . , R10 are ten independent
Ber(p) distributed random variables, where p is the probability that a ran-
domly selected chip is defective. The probability that the observed data occur
is equal to
P(R1 = 0, R2 = 1, R3 = 0, . . . , R10 = 0) = p(1 − p)9
.
For the batch where about 10% of the chips are defective we find that
P(R1 = 0, R2 = 1, R3 = 0, . . . , R10 = 0) =
1
10

9
10
9
= 0.039,
whereas for the other batch
P(R1 = 0, R2 = 1, R3 = 0, . . . , R10 = 0) =
1
2

1
2
9
= 0.00098.
So the probability for the batch with only 10% defective chips is about 40
times larger than the probability for the other batch. Given the data, our
dealer made a sound decision.
Quick exercise 21.1 Which batch should the dealer choose if only the first
three chips are defective?
Returning to the example of the number of cycles up to pregnancy, denoting
Xi as the number of cycles up to pregnancy of the ith smoker, recall that
P(Xi = k) = (1 − p)k−1
p
and
P(Xi  12) = P(no success in cycle 1 to 12) = (1 − p)12
;
cf. Quick exercise 4.6. From Table 21.1 we see that there are 29 smokers for
which Xi = 1, that there are 16 for which Xi = 2, etc. Since we model the
data as a random sample from a geometric distribution, the probability of the
data—as a function of p—is given by
L(p) = C · P(Xi = 1)
29
· P(Xi = 2)
16
· · · P(Xi = 12)
3
· P(Xi  12)
7
= C · p29
· ((1 − p)p)
16
· · ·

(1 − p)11
p
3
·

(1 − p)12
7
= C · p93
· (1 − p)322
.
Here C is the number of ways we can assign 29 ones, 16 twos, . . . , 3 twelves,
and 7 numbers larger than 12 to 100 smokers.1
According to the maximum
likelihood principle we now choose p, with 0 ≤ p ≤ 1, in such a way, that L(p)
1
C = 311657028822819441451842682167854800096263625208359116504431153487280760832000000000.
316 21 Maximum likelihood
is maximal. Since C does not depend on p, we do not need to know the value
of C explicitly to find for which p the function L(p) is maximal.
Differentiating L(p) with respect to p yields that
L
(p) = C

93p92
(1 − p)322
− 322p93
(1 − p)321

= Cp92
(1 − p)321
[93(1 − p) − 322p]
= Cp92
(1 − p)321
(93 − 415p).
Now L
(p) = 0 if p = 0, p = 1, or p = 93/415 = 0.224, and L(p) attains its
unique maximum in this last point (check this!). We say that 93/415 = 0.224 is
the maximum likelihood estimate of p for the smokers. Note that this estimate
is quite a lot smaller than the estimate 0.29 for the smokers we found in the
previous section, and the estimate 0.2809 you obtained in Exercise 17.5.
Quick exercise 21.2 Check that for the nonsmokers the probability of the
data is given by
L(p) = constant · p474
(1 − p)955
.
Compute the maximum likelihood estimate for p.
Remark 21.1 (Some history). The method of maximum likelihood es-
timation was propounded by Ronald Aylmer Fisher in a highly influential
paper. In fact, this paper does not contain the original statement of the
method, which was published by Fisher in 1912 [9], nor does it contain
the original definition of likelihood, which appeared in 1921 (see [10]). The
roots of the maximum likelihood method date back as far as 1713, when
Jacob Bernoulli’s Ars Conjectandi ([1]) was posthumously published. In the
eighteenth century other important contributions were by Daniel Bernoulli,
Lambert, and Lagrange (see also [2], [16], and [17]). It is interesting to re-
mark that another giant of statistics, Karl Pearson, had not understood
Fisher’s method. Fisher was hurt by Pearson’s lack of understanding, which
eventually led to a violent confrontation.
21.3 Likelihood and loglikelihood
Suppose we have a dataset x1, x2, . . . , xn, modeled as a realization of a random
sample from a distribution characterized by a parameter θ. To stress the
dependence of the distribution on θ, we write
pθ(x)
for the probability mass function in case we have a sample from a discrete
distribution and
fθ(x)
21.3 Likelihood and loglikelihood 317
for the probability density function when we have a sample from a continuous
distribution.
For a dataset x1, x2, . . . , xn modeled as the realization of a random sample
X1, . . . , Xn from a discrete distribution, the maximum likelihood principle
now tells us to estimate θ by that value, for which the function L(θ), given by
L(θ) = P(X1 = x1, . . . , Xn = xn) = pθ(x1) · · · pθ(xn)
is maximal. This value is called the maximum likelihood estimate of θ. The
function L(θ) is called the likelihood function. This is a function of θ, deter-
mined by the numbers x1, x2, . . . , xn.
In case the sample is from a continuous distribution we clearly need to de-
fine the likelihood function L(θ) in a way different from the discrete case (if
we would define L(θ) as in the discrete case, one always would have that
L(θ) = 0). For a reasonable definition of the likelihood function we have the
following motivation. Let fθ be the probability density function of X, and
let ε  0 be some fixed, small number. It is sensible to choose θ in such a
way, that the probability P(x1 − ε ≤ X1 ≤ x1 + ε, . . . , xn − ε ≤ Xn ≤ xn + ε)
is maximal. Since the Xi are independent, we find that
P(x1 − ε ≤ X1 ≤ x1 + ε, . . . , xn − ε ≤ Xn ≤ xn + ε)
= P(x1 − ε ≤ X1 ≤ x1 + ε) · · · P(xn − ε ≤ Xn ≤ xn + ε) (21.1)
≈ fθ(x1)fθ(x2) · · · fθ(xn)(2ε)n
,
where in the last step we used that (see also Equation (5.1))
P(xi − ε ≤ Xi ≤ xi + ε) =
xi+ε
xi−ε
fθ(x) dx ≈ 2εfθ(xi).
Note that the right-hand side of (21.1) is maximal whenever the function
fθ(x1)fθ(x2) · · · fθ(xn) is maximal, irrespective of the value of ε. In view of
this, given a dataset x1, x2, . . . , xn, the likelihood function L(θ) is defined by
L(θ) = fθ(x1)fθ(x2) · · · fθ(xn)
in the continuous case.
Maximum likelihood estimates. The maximum likelihood es-
timate of θ is the value t = h(x1, x2, . . . , xn) that maximizes the
likelihood function L(θ). The corresponding random variable
T = h(X1, X2, . . . , Xn)
is called the maximum likelihood estimator for θ.
318 21 Maximum likelihood
As an example, suppose we have a dataset x1, x2, . . . , xn modeled as a re-
alization of a random sample from an Exp(λ) distribution, with probability
density function given by fλ(x) = 0 if x  0 and
fλ(x) = λe−λx
for x ≥ 0.
Then the likelihood is given by
L(λ) = fλ(x1)fλ(x2) · · · fλ(xn)
= λe−λx1
· λe−λx2
· · · λe−λxn
= λn
· e−λ(x1+x2+···+xn)
.
To obtain the maximum likelihood estimate of λ it is enough to find the
maximum of L(λ). To do so, we determine the derivative of L(λ):
d
dλ
L(λ) = nλn−1
e−λ
n
i=1 xi
− λn
 n

i=1
xi

e−λ
n
i=1 xi
= n

λn−1
e−λ
n
i=1 xi

1 −
λ
n
n

i=1
xi

.
We see that d (L(λ)) /dλ = 0 if and only if
1 − λx̄n = 0,
i.e., if λ = 1/x̄n. Check that for this value of λ the likelihood function L(λ)
attains a maximum! So the maximum likelihood estimator for λ is 1/X̄n.
In the example of the number of cycles up to pregnancy of smoking women,
we have seen that L(p) = C ·p93
·(1−p)322
. The maximum likelihood estimate
of p was found by differentiating L(p). Differentiating is not always possible,
as the following example shows.
Estimating the upper endpoint of a uniform distribution
Suppose the dataset x1 = 0.98, x2 = 1.57, and x3 = 0.31 is the realization
of a random sample from a U(0, θ) distribution with θ  0 unknown. The
probability density function of each Xi is now given by fθ(x) = 0 if x is not
in [0, θ] and
fθ(x) =
1
θ
for 0 ≤ x ≤ θ.
The likelihood L(θ) is zero if θ is smaller than at least one of the xi, and
equals 1/θ3
if θ is greater than or equal to each of the three xi, i.e.,
L(θ) = fθ(x1)fθ(x2)fθ(x3) =

1
θ3 if θ ≥ max (x1, x2, x3) = 1.57
0 if θ  max (x1, x2, x3) = 1.57.
21.3 Likelihood and loglikelihood 319
0 0.98 1.57
0.31
0
0.1
0.2
L(θ) =
1
θ3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 21.1. Likelihood function L(θ) of a sample from a U(0, θ) distribution.
Figure 21.1 depicts this likelihood function. One glance at this figure is enough
to realize that L(θ) attains its maximum at max (x1, x2, x3) = 1.57.
In general, given a dataset x1, x2, . . . , xn originating from a U(0, θ) distribu-
tion, we see that L(θ) = 0 if θ is smaller than at least one of the xi and that
L(θ) = 1/θn
if θ is greater than or equal to the largest of the xi. We conclude
that the maximum likelihood estimator of θ is given by max {X1, X2, . . . , Xn}.
Loglikelihood
In the preceding example it was easy to find the value of the parameter for
which the likelihood is maximal. Usually one can find the maximum by dif-
ferentiating the likelihood function L(θ). The calculation of the derivative of
L(θ) may be tedious, because L(θ) is a product of terms, all involving θ (see
also Quick exercise 21.3). To differentiate L(θ) we have to apply the product
rule from calculus. Considering the logarithm of L(θ) changes the product of
the terms involving θ into a sum of logarithms of these terms, which makes
the process of differentiating easier. Moreover, because the logarithm is an in-
creasing function, the likelihood function L(θ) and the loglikelihood function
(θ), defined by
(θ) = ln(L(θ)),
attain their extreme values for the same values of θ. In particular, L(θ) is
maximal if and only if (θ) is maximal. This is illustrated in Figure 21.2 by
the likelihood function L(p) = Cp93
(1 − p)322
and the loglikelihood function
(p) = ln(C) + 93 ln(p) + 322 ln(1 − p) for the smokers.
In the situation that we have a dataset x1, x2, . . . , xn modeled as a realiza-
tion of a random sample from an Exp(λ) distribution, we found as likelihood
function L(λ) = λn
· e−λ(x1+x2+···+xn)
. Therefore, the loglikelihood function
is given by
(λ) = n ln(λ) − λ (x1 + x2 + · · · + xn) .
320 21 Maximum likelihood
0 93/415 0.5
0
4 · 10−13
5 · 10−13
L(p)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 93/415 0.5
−300
0
−28.5
(p)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 21.2. The graphs of the likelihood function L(p) and the loglikelihood function
(p) for the smokers.
Quick exercise 21.3 In this example, use the loglikelihood function (λ) to
show that the maximum likelihood estimate of λ equals 1/x̄n.
Estimating the parameters of the normal distribution
Suppose that the dataset x1, x2, . . . , xn is a realization of a random sample
from an N(µ, σ2
) distribution, with µ and σ unknown. What are the maximum
likelihood estimates for µ and σ?
In this case θ is the vector (µ, σ), and therefore the likelihood function is a
function of two variables:
L(µ, σ) = fµ,σ(x1)fµ,σ(x2) · · · fµ,σ(xn),
where each fµ,σ(x) is the N(µ, σ2
) probability density function:
fµ,σ(x) =
1
σ
√
2π
e− 1
2 (x−µ
σ )
2
, −∞  x  ∞.
Since
ln (fµ,σ(x)) = − ln(σ) − ln(
√
2π) −
1
2

x − µ
σ
2
,
one finds that
(µ, σ) = ln (fµ,σ(x1)) + · · · + ln (fµ,σ(xn))
= −n ln(σ) − n ln(
√
2π) −
1
2σ2

(x1 − µ)2
+ · · · + (xn − µ)2

.
The partial derivatives of  are
21.4 Properties of maximum likelihood estimators 321
∂
∂µ
=
1
σ2

(x1 − µ) + (x2 − µ) + · · · + (xn − µ)

=
n
σ2
(x̄n − µ)
∂
∂σ
= −
n
σ
+
1
σ3

(x1 − µ)2
+ (x2 − µ)2
+ · · · + (xn − µ)2

= −
n
σ3

σ2
−
1
n
n

i=1
(xi − µ)2

.
Solving
∂
∂µ
= 0 and
∂
∂σ
= 0 yields
µ = x̄n and σ =

#
#
$ 1
n
n

i=1
(xi − x̄n)2.
It is not hard to show that for these values of µ and σ the likelihood func-
tion L(µ, σ) attains a maximum. We find that x̄n is the maximum likelihood
estimate for µ and that 
#
#
$ 1
n
n

i=1
(xi − x̄n)2
is the maximum likelihood estimate for σ.
21.4 Properties of maximum likelihood estimators
Apart from the fact that the maximum likelihood principle provides a general
principle to construct estimators, one can also show that maximum likelihood
estimators have several desirable properties.
Invariance principle
In the previous example, we saw that
Dn =

#
#
$ 1
n
n

i=1
(Xi − X̄n)2
is the maximum likelihood estimator for the parameter σ of an N(µ, σ2
) distri-
bution. Does this imply that D2
n is the maximum likelihood estimator for σ2
?
This is indeed the case! In general one can show that if T is the maximum
likelihood estimator of a parameter θ and g(θ) is an invertible function of θ,
then g(T ) is the maximum likelihood estimator for g(θ).
322 21 Maximum likelihood
Asymptotic unbiasedness
The maximum likelihood estimator T may be biased. For example, because
D2
n = n−1
n S2
n, for the previously mentioned maximum likelihood estimator D2
n
of the parameter σ2
of an N(µ, σ2
) distribution, it follows from Section 19.4
that
E

D2
n

= E
n − 1
n
S2
n

=
n − 1
n
E

S2
n

=
n − 1
n
σ2
.
We see that D2
n is a biased estimator for σ2
, but also that as n goes to
infinity, the expected value of D2
n converges to σ2
. This holds more generally.
Under mild conditions on the distribution of the random variables Xi under
consideration (see, e.g., [36]), one can show that asymptotically (that is, as
the size n of the dataset goes to infinity) maximum likelihood estimators are
unbiased. By this we mean that if Tn = h(X1, X2, . . . , Xn) is the maximum
likelihood estimator for a parameter θ, then
lim
n→∞
E[Tn] = θ.
Asymptotic minimum variance
The variance of an unbiased estimator for a parameter θ is always larger than
or equal to a certain positive number, known as the Cramér-Rao lower bound
(see Remark 20.2). Again under mild conditions one can show that maxi-
mum likelihood estimators have asymptotically the smallest variance among
unbiased estimators. That is, asymptotically the variance of the maximum
likelihood estimator for a parameter θ attains the Cramér-Rao lower bound.
21.5 Solutions to the quick exercises
21.1 In the case that only the first three chips are defective, the probability
that the observed data occur is equal to
P(R1 = 1, R2 = 1, R3 = 1, R4 = 0, . . . , R10 = 0) = p3
(1 − p)7
.
For the batch where about 10% of the chips are defective we find that
P(R1 = 1, R2 = 1, R3 = 1, R4 = 0, . . . , R10 = 0) =

1
10
3
9
10
7
= 0.00048,
whereas for the other batch this probability is equal to
1
2
31
2
7
= 0.00098.
So the probability for the batch with about 50% defective chips is about 2
times larger than the probability for the other batch. In view of this, it would
be reasonable to choose the other batch, not the tested one.
21.6 Exercises 323
21.2 From Table 21.1 we derive
L(p) = constant · P(Xi = 1)198
P(Xi = 2)107
· · · P(Xi = 12)6
P(Xi  12)12
= constant · p198
· [(1 − p)p]
107
· · ·

(1 − p)11
p
6
·

(1 − p)12
12
= constant · p474
· (1 − p)955
.
Here the constant is the number of ways we can assign 198 ones, 107 twos, . . . ,
6 twelves, and 12 numbers larger than 12 to 486 nonsmokers. Differentiating
L(p) with respect to p yields that
L
(p) = constant ·

474p473
(1 − p)955
− 955p474
(1 − p)954

= constant · p473
(1 − p)954
[474(1 − p) − 955p]
= constant · p473
(1 − p)954
(474 − 1429p).
Now L
(p) = 0 if p = 0, p = 1, or p = 474/1429 = 0.33, and L(p) attains its
unique maximum in this last point.
21.3 The loglikelihood function L(λ) has derivative

(λ) =
n
λ
− (x1 + x2 + · · · + xn) = n

1
λ
− x̄n

.
One finds that 
(λ) = 0 if and only if λ = 1/x̄n and that this is a maximum.
The maximum likelihood estimate for λ is therefore 1/x̄n.
21.6 Exercises
21.1  Consider the following situation. Suppose we have two fair dice, D1
with 5 red sides and 1 white side and D2 with 1 red side and 5 white sides.
We pick one of the dice randomly, and throw it repeatedly until red comes
up for the first time. With the same die this experiment is repeated two more
times. Suppose the following happens:
First experiment: first red appears in 3rd throw
Second experiment: first red appears in 5th throw
Third experiment: first red appears in 4th throw.
Show that for die D1 this happens with probability 5.7424 · 10−8
, and for
die D2 the probability with which this happens is 8.9725 · 10−4
. Given these
probabilities, which die do you think we picked?
21.2  We throw an unfair coin repeatedly until heads comes up for the first
time. We repeat this experiment three times (with the same coin) and obtain
the following data:
324 21 Maximum likelihood
First experiment: heads first comes up in 3rd throw
Second experiment: heads first comes up in 5th throw
Third experiment: heads first comes up in 4th throw.
Let p be the probability that heads comes up in a throw with this coin.
Determine the maximum likelihood estimate p̂ of p.
21.3 In Exercise 17.4 we modeled the hits of London by flying bombs by a
Poisson distribution with parameter µ.
a. Use the data from Exercise 17.4 to find the maximum likelihood estimate
of µ.
b. Suppose the summarized data from Exercise 17.4 got corrupted in the
following way:
Number of hits 0 or 1 2 3 4 5 6 7
Number of squares 440 93 35 7 0 0 1
Using this new data, what is the maximum likelihood estimate of µ?
21.4  In Section 19.1, we considered the arrivals of packages at a network
server, where we modeled the number of arrivals per minute by a Pois(µ)
distribution. Let x1, x2, . . . , xn be a realization of a random sample from a
Pois(µ) distribution. We saw on page 286 that a natural estimate of the
probability of zeros in the dataset is given by
number of xi equal to zero
n
.
a. Show that the likelihood L(µ) is given by
L(µ) =
e−nµ
x1! · · · xn!
µx1+x2+···+xn
.
b. Determine the loglikelihood (µ) and the formula of the maximum likeli-
hood estimate for µ.
c. What is the maximum likelihood estimate for the probability e−µ
of zero
arrivals?
21.5  Suppose that x1, x2, . . . , xn is a dataset, which is a realization of a
random sample from a normal distribution.
a. Let the probability density of this normal distribution be given by
fµ(x) =
1
√
2π
e− 1
2 (x−µ)2
for −∞  x  ∞.
Determine the maximum likelihood estimate for µ.
21.6 Exercises 325
b. Now suppose that the density of this normal distribution is given by
fσ(x) =
1
σ
√
2π
e− 1
2 x2
/σ2
for −∞  x  ∞.
Determine the maximum likelihood estimate for σ.
21.6 Let x1, x2, . . . , xn be a dataset that is a realization of a random sample
from a distribution with probability density fδ(x) given by
fδ(x) =

e−(x−δ)
for x ≥ δ
0 for x  δ.
a. Draw the likelihood L(δ).
b. Determine the maximum likelihood estimate for δ.
21.7  Suppose that x1, x2, . . . , xn is a dataset, which is a realization of a ran-
dom sample from a Rayleigh distribution, which is a continuous distribution
with probability density function given by
fθ(x) =
x
θ2
e− 1
2 x2
/θ2
for x ≥ 0.
In this case what is the maximum likelihood estimate for θ?
21.8  (Exercises 19.7 and 20.7 continued) A certain type of plant can be di-
vided into four types: starchy-green, starchy-white, sugary-green, and sugary-
white. The following table lists the counts of the various types among 3839
leaves.
Type Count
Starchy-green 1997
Sugary-white 32
Starchy-white 906
Sugary-green 904
Setting
X =
⎧
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎩
1 if the observed leave is of type starchy-green
2 if the observed leave is of type sugary-white
3 if the observed leave is of type starchy-white
4 if the observed leave is of type sugary-green,
the probability mass function p of X is given by
a 1 2 3 4
p(a) 1
4 (2 + θ) 1
4 θ 1
4 (1 − θ) 1
4 (1 − θ)
326 21 Maximum likelihood
and p(a) = 0 for all other a. Here 0  θ  1 is an unknown parameter,
which was estimated in Exercise 19.7. We want to find a maximum likelihood
estimate of θ.
a. Use the data to find the likelihood L(θ) and the loglikelihood (θ).
b. What is the maximum likelihood estimate of θ using the data from the
preceding table?
c. Suppose that we have the counts of n different leaves: n1 of type starchy-
green, n2 of type sugary-white, n3 of type starchy-white, and n4 of type
sugary-green (so n = n1 + n2 + n3 + n4). Determine the general formula
for the maximum likelihood estimate of θ.
21.9  Let x1, x2, . . . , xn be a dataset that is a realization of a random sample
from a U(α, β) distribution (with α and β unknown, α  β). Determine the
maximum likelihood estimates for α and β.
21.10 Let x1, x2, . . . , xn be a dataset, which is a realization of a random
sample from a Par(α) distribution. What is the maximum likelihood estimate
for α?
21.11  In Exercise 4.13 we considered the situation where we have a box
containing an unknown number—say N—of identical bolts. In order to get an
idea of the size of N we introduced three random variables X, Y , and Z. Here
we will use X and Y , and in the next exercise Z, to find maximum likelihood
estimates of N.
a. Suppose that x1, x2, . . . , xn is a dataset, which is a realization of a random
sample from a Geo(1/N) distribution. Determine the maximum likelihood
estimate for N.
b. Suppose that y1, y2, . . . , yn is a dataset, which is a realization of a random
sample from a discrete uniform distribution on 1, 2, . . . , N. Determine the
maximum likelihood estimate for N.
21.12 (Exercise 21.11 continued.) Suppose that m bolts in the box were
marked and then r bolts were selected from the box; Z is the number of
marked bolts in the sample. (Recall that it was shown in Exercise 4.13 c that
Z has a hypergeometric distribution, with parameters m, N, and r.) Suppose
that k bolts in the sample were marked. Show that the likelihood L(N) is
given by
L(N) =
m
k
N−m
r−k

N
r
 .
Next show that L(N) increases for N  mr/k and decreases for N  mr/k,
and conclude that mr/k is the maximum likelihood estimate for N.
21.13 Often one can model the times that customers arrive at a shop rather
well by a Poisson process with (unknown) rate λ (customers/hour). On a
certain day, one of the attendants noticed that between noon and 12.45 p.m.
21.6 Exercises 327
two customers arrived, and another attendant noticed that on the same day
one customer arrived between 12.15 and 1 p.m. Use the observations of the
attendants to determine the maximum likelihood estimate of λ.
21.14 A very inexperienced archer shoots n times an arrow at a disc of (un-
known) radius θ. The disc is hit every time, but at completely random places.
Let r1, r2, . . . , rn be the distances of the various hits to the center of the disc.
Determine the maximum likelihood estimate for θ.
21.15 On January 28, 1986, the main fuel tank of the space shuttle Challenger
exploded shortly after takeoff. Essential in this accident was the leakage of
some of the six O-rings of the Challenger. In Section 1.4 the probability of
failure of an O-ring is given by
p(t) =
ea+b·t
1 + ea+b·t
,
where t is the temperature at launch in degrees Fahrenheit. In Table 21.2
the temperature t (in ◦
F, rounded to the nearest integer) and the number of
failures N for 23 missions are given, ordered according to increasing temper-
atures. (See also Figure 1.3, where these data are graphically depicted.) Give
the likelihood L(a, b) and the loglikelihood (a, b).
Table 21.2. Space shuttle failure data of pre-Challenger missions.
t 53 57 58 63 66 67 67 67
N 2 1 1 1 0 0 0 0
t 68 69 70 70 70 70 72 73
N 0 0 0 0 1 1 0 0
t 75 75 76 76 78 79 81
N 0 2 0 0 0 0 0
21.16 In the 18th century Georges-Louis Leclerc, Comte de Buffon (1707–
1788) found an amusing way to approximate the number π using probability
theory and statistics. Buffon had the following idea: take a needle and a large
sheet of paper, and draw horizontal lines that are a needle-length apart. Throw
the needle a number of times (say n times) on the sheet, and count how often it
hits one of the horizontal lines. Say this number is sn, then sn is the realization
of a Bin(n, p) distributed random variable Sn. Here p is the probability that
the needle hits one of the horizontal lines. In Exercise 9.20 you found that
p = 2/π. Show that
T =
2n
Sn
is the maximum likelihood estimator for π.
22
The method of least squares
The maximum likelihood principle provides a way to estimate parameters. The
applicability of the method is quite general but not universal. For example,
in the simple linear regression model, introduced in Section 17.4, we need to
know the distribution of the response variable in order to find the maximum
likelihood estimates for the parameters involved. In this chapter we will see
how these parameters can be estimated using the method of least squares.
Furthermore, the relation between least squares and maximum likelihood will
be investigated in the case of normally distributed errors.
22.1 Least squares estimation and regression
Recall from Section 17.4 the simple linear regression model for a bivariate
dataset (x1, y1), (x2, y2), . . . , (xn, yn). In this model x1, x2, . . . , xn are non-
random and y1, y2, . . . , yn are realizations of random variables Y1, Y2, . . . , Yn
satisfying
Yi = α + βxi + Ui for i = 1, 2, . . ., n,
where U1, U2, . . . , Un are independent random variables with zero expectation
and variance σ2
. How can one obtain estimates for the parameters α, β, and σ2
in this model?
Note that we cannot find maximum likelihood estimates for these parameters,
simply because we have no further knowledge about the distribution of the Ui
(and consequently of the Yi). We want to choose α and β in such a way that
we obtain a line that fits the data best. A classical approach to do this is to
consider the sum of squared distances between the observed values yi and the
values α +βxi on the regression line y = α +βx. See Figure 22.1, where these
distances are indicated. The method of least squares prescribes to choose α
and β such that the sum of squares
S(α, β) =
n

i=1
(yi − α − βxi)
2
330 22 The method of least squares
xi
yi
α + βxi

The regression
line y = αx = β
The point (xi, yi) 
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
·
·
·
·
·
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 22.1. The observed value yi corresponding to xi and the value α + βxi on the
regression line y = α + βx.
is minimal. The ith term in the sum is the squared distance in the vertical
direction from (xi, yi) to the line y = α + βx. To find these so-called least
squares estimates, we differentiate S(α, β) with respect to α and β, and we
set the derivatives equal to 0:
∂
∂α
S(α, β) = 0 ⇔
n

i=1
(yi − α − βxi) = 0
∂
∂β
S(α, β) = 0 ⇔
n

i=1
(yi − α − βxi) xi = 0.
This is equivalent to
nα + β
n

i=1
xi =
n

i=1
yi
α
n

i=1
xi + β
n

i=1
x2
i =
n

i=1
xiyi.
For example, for the timber data from Table 15.5 we would obtain
36 α + 1646.4 β = 52 901
1646.4 α + 81750.02 β = 2 790 525.
These are two equations with two unknowns α and β. Solving for α and β
yields the solutions α̂ = −1160.5 and β̂ = 57.51. In Figure 22.2 a scatterplot of
the timber dataset, together with the estimated regression line y = −1160.5+
57.51x, is depicted.
Quick exercise 22.1 Suppose you are given a piece of Australian timber with
density 65. What would you choose as an estimate for the Janka hardness?
22.1 Least squares estimation and regression 331
20 30 40 50 60 70 80
Wood density
0
500
1000
1500
2000
2500
3000
3500
Hardness
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
Fig. 22.2. Scatterplot and estimated regression line for the timber data.
In general, writing

instead of
n
i=1, we find the following formulas for the
estimates α̂ (the intercept) and β̂ (the slope):
β̂ =
n

xiyi − (

xi)(

yi)
n

x2
i − (

xi)2
(22.1)
α̂ = ȳn − β̂x̄n. (22.2)
Since S(α, β) is an elliptic paraboloid (a “vase”), it follows that (α̂, β̂) is the
unique minimum of S(α, β) (except when all xi are equal).
Quick exercise 22.2 Check that the line y = α̂ + β̂x always passes through
the “center of gravity” (x̄n, ȳn).
Least squares estimators are unbiased
We denote the least squares estimates by α̂ and β̂. It is quite common to also
denote the least squares estimators by α̂ and β̂:
α̂ = Ȳn − β̂x̄n, β̂ =
n

xiYi − (

xi)(

Yi)
n

x2
i − (

xi)
2 .
In Exercise 22.12 it is shown that β̂ is an unbiased estimator for β. Using this
and the fact that E[Yi] = α + βxi (see page 258), we find for α̂:
E[α̂] = E

Ȳn

− x̄nE

β̂

=
1
n
n

i=1
E[Yi] − x̄nβ
=
1
n
n

i=1
(α + βxi) − x̄nβ = α + βx̄n − x̄nβ
= α.
We see that α̂ is an unbiased estimator for α.
332 22 The method of least squares
An unbiased estimator for σ2
In the simple linear regression model the assumptions imply that the random
variables Yi are independent with variance σ2
. Unfortunately, one cannot ap-
ply the usual estimator (1/(n − 1))
n
i=1

Yi − Ȳi
2
for the variance of the
Yi (see Section 19.4), because different Yi have different expectations. What
would be a reasonable estimator for σ2
? The following quick exercise suggests
a candidate.
Quick exercise 22.3 Let U1, U2, . . . , Un be independent random variables,
each with expected value zero and variance σ2
. Show that
T =
1
n
n

i=1
U2
i
is an unbiased estimator for σ2
.
At first sight one might be tempted to think that the unbiased estimator T
from this quick exercise is a useful tool to estimate σ2
. Unfortunately, we only
observe the xi and Yi, not the Ui. However, from the fact that Ui = Yi−α−βxi,
it seems reasonable to try
1
n
n

i=1
(Yi − α̂ − β̂xi)2
(22.3)
as an estimator for σ2
. Tedious calculations show that the expected value of
this random variable equals n−2
n σ2
. But then we can easily turn it into an
unbiased estimator for σ2
.
An unbiased estimator for σ2
. In the simple linear regression
model the random variable
σ̂2
=
1
n − 2
n

i=1
(Yi − α̂ − β̂xi)2
is an unbiased estimator for σ2
.
22.2 Residuals
A way to explore whether the simple linear regression model is appropriate
to model a given bivariate dataset is to inspect a scatterplot of the so-called
residuals ri against the xi. The ith residual ri is defined as the vertical distance
between the ith point and the estimated regression line:
ri = yi − α̂ − β̂xi, i = 1, 2, . . ., n.
22.2 Residuals 333
When a linear model is appropriate, the scatterplot of the residuals ri against
the xi should show truly random fluctuations around zero, in the sense that
it should not exhibit any trend or pattern. This seems to be the case in
Figure 22.3, which shows the residuals for the black cherry tree data from
Exercise 17.9.
0 2 4 6 8
−0.15
−0.10
−0.05
0.00
0.05
0.10
0.15
Residual
·
·
· ···
··
·
·
·
·
·
·
··
·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
Fig. 22.3. Scatterplot of ri versus xi for the black cherry tree data.
Quick exercise 22.4 Recall from Quick exercise 22.2 that (x̄n, ȳn) is on the
regression line y = α̂ + β̂x, i.e., that ȳn = α̂ + β̂x̄n. Use this to show that
n
i=1 ri = 0, i.e., that the sum of the residuals is zero.
In Figure 22.4 we depicted ri versus xi for the timber dataset. In this case a
slight parabolic pattern can be observed. Figures 22.2 and 22.4 suggest that
20 30 40 50 60 70
−400
−200
0
200
400
600
800
Residual
·
·
·
·
· ·
· ·
·
·
· ·
·
· ·
·
·
·
· ·
·
·
·
·
·
·
·
·
· ·
· · ·
·
·
·
Fig. 22.4. Scatterplot of ri versus xi for the timber data with the simple linear
regression model Yi = α + βxi + Ui.
334 22 The method of least squares
for the timber dataset a better model might be
Yi = α + βxi + γx2
i + Ui for i = 1, 2, . . ., n.
In this new model the residuals are
ri = yi − α̂ − β̂xi − γ̂x2
i ,
where α̂, β̂, and γ̂ are the least squares estimates obtained by minimizing
n

i=1

yi − α − βxi − γx2
i
2
.
In Figure 22.5 we depicted ri versus xi. The residuals display no trend or
pattern, except that they “fan out”—an example of a phenomenon called
heteroscedasticity.
20 30 40 50 60 70
−400
−200
0
200
400
600
800
Residual
·
·
·
· · ·
·
·
·
· · ·
· ·
·
· · ·
· ·
·
·
·
·
·
· ·
·
·
·
· ·
·
·
·
·
Fig. 22.5. Scatterplot of ri versus xi for the timber data with the model Yi =
α + βxi + γx2
i + Ui.
Heteroscedasticity
The assumption of equal variance of the Ui (and therefore of the Yi) is called
homoscedasticity. In case the variance of Yi depends on the value of xi, we
speak of heteroscedasticity. For instance, heteroscedasticity occurs when Yi
with a large expected value have a larger variance than those with small ex-
pected values. This produces a “fanning out” effect, which can be observed
in Figure 22.5. This figure strongly suggests that the timber data are het-
eroscedastic. Possible ways out of this problem are a technique called weighted
least squares or the use of variance-stabilizing transformations.
22.3 Relation with maximum likelihood 335
22.3 Relation with maximum likelihood
To apply the method of least squares no assumption is needed about the type
of distribution of the Ui. In case the type of distribution of the Ui is known,
the maximum likelihood principle can be applied. Consider, for instance, the
classical situation where the Ui are independent with an N(0, σ2
) distribution.
What are the maximum likelihood estimates for α and β?
In this case the Yi are independent, and Yi has an N(α+βxi, σ2
) distribution.
Under these assumptions and assuming that the linear model is appropriate
to model a given bivariate dataset, the ri should look like the realization of a
random sample from a normal distribution. As an example a histogram of the
residuals ri of the cherry tree data of Exercise 17.9 is depicted in Figure 22.6.
−0.2 −0.1 0.0 0.1 0.2
0
2
4
6
Fig. 22.6. Histogram of the residuals ri for the black cherry tree data.
The data do not exhibit strong evidence against the assumption of normality.
When Yi has an N(α + βxi, σ2
) distribution, the probability density of Yi is
given by
fi(y) =
1
σ
√
2π
e−(y−α−βxi)2
/(2σ2
)
for − ∞  y  ∞.
Since
ln (fi(yi)) = − ln(σ) − ln(
√
2π) −
1
2

yi − α − βxi
σ
2
,
the loglikelihood is:
(α, β, σ) = ln (f1(y1)) + · · · + ln (fn(yn))
= −n ln(σ) − n ln(
√
2π) −
1
2σ2
n

i=1
(yi − α − βxi)2
.
336 22 The method of least squares
Note that for any fixed σ  0, the loglikelihood (α, β, σ) attains its maximum
precisely when
n
i=1(yi − α − βxi)2
is minimal. Hence, in case the Ui are
independent with an N(0, σ2
) distribution, the maximum likelihood principle
and the least squares method yield the same estimators.
To find the maximum likelihood estimate of σ we differentiate (α, β, σ) with
respect to σ:
∂
∂σ
(α, β, σ) = −
n
σ
+
1
σ3
n

i=1
(yi − α − βxi)2
.
It follows (from the invariance principle on page 321) that the maximum
likelihood estimator of σ2
is given by
1
n
n

i=1
(Yi − α̂ − β̂xi)2
,
which is the estimator from (22.3).
22.4 Solutions to the quick exercises
22.1 We can use the estimated regression line y = −1160.5+57.51x to predict
the Janka hardness. For density x = 65 we find as a prediction for the Janka
hardness y = 2577.65.
22.2 Rewriting α̂ = ȳn − β̂, it follows that ȳn = α̂ + β̂x̄n, which means that
(x̄n, ȳn) is a point on the estimated regression line y = α̂ + β̂x.
22.3 We need to show that E[T ] = σ2
. Since E[Ui] = 0, Var(Ui) = E

U2
i

,
so that:
E[T ] = E
%
1
n
n

i=1
U2
i

=
1
n
n

i=1
E

U2
i

=
1
n
n

i=1
Var(Ui) = σ2
.
22.4 Since ri = yi − (α̂ + β̂xi) for i = 1, 2, . . . , n, it follows that the sum of
the residuals equals

ri =

yi −

nα̂ + β̂

xi

= nȳn −

nα̂ + nβ̂x̄n

= n

ȳn − (α̂ + β̂x̄n)

= 0,
because ȳn = α̂ + β̂x̄n, according to Quick exercise 22.2.
22.5 Exercises 337
22.5 Exercises
22.1  Consider the following bivariate dataset:
(1, 2) (3, 1.8) (5, 1).
a. Determine the least squares estimates α̂ and β̂ of the parameters of the
regression line y = α + βx.
b. Determine the residuals r1, r2, and r3 and check that they add up to 0.
c. Draw in one figure the scatterplot of the data and the estimated regression
line y = α̂ + β̂x.
22.2 Adding one point may dramatically change the estimates of α and β.
Suppose one extra datapoint is added to the dataset of the previous exercise
and that we have as dataset:
(0, 0) (1, 2) (3, 1.8) (5, 1).
Determine the least squares estimate of β̂. A point such as (0, 0), which dra-
matically changes the estimates for α and β, is called a leverage point.
22.3 Suppose we have the following bivariate dataset:
(1, 3.1) (1.7, 3.9) (2.1, 3.8) (2.5, 4.7) (2.7, 4.5).
a. Determine the least squares estimates α̂ and β̂ of the parameters of the
regression line y = α + βx. You may use that

xi = 10,

yi = 20,

x2
i = 21.84, and

xiyi = 41.61.
b. Draw in one figure the scatterplot of the data and the estimated regression
line y = α̂ + β̂x.
22.4 We are given a bivariate dataset (x1, y1), (x2, y2), . . . , (x100, y100). For
this bivariate dataset it is known that

xi = 231.7,

x2
i = 2400.8,

yi =
321, and

xiyi = 5189. What are the least squares estimates α̂ and β̂ of the
parameters of the regression line y = α + βx?
22.5  For the timber dataset it seems reasonable to leave out the intercept α
(“no hardness without density”). The model then becomes
Yi = βxi + Ui for i = 1, 2, . . . , n.
Show that the least squares estimator β̂ of β is now given by
β̂ =
n

i=1
xiYi
n

i=1
x2
i
by minimizing the appropriate sum of squares.
338 22 The method of least squares
22.6  (Quick exercise 22.1 and Exercise 22.5 continued). Suppose we are
given a piece of Australian timber with density 65. What would you choose
as an estimate for the Janka hardness, based on the regression model with
no intercept? Recall that

xiyi = 2790525 and

x2
i = 81750.02 (see also
Section 22.1).
22.7 Consider the dataset
(x1, y1), (x2, y2), . . . , (xn, yn),
where x1, x2, . . . , xn are nonrandom and y1, y2, . . . , yn are realizations of ran-
dom variables Y1, Y2, . . . , Yn, satisfying
Yi = eα+βxi
+ Ui for i = 1, 2, . . . , n.
Here U1, U2, . . . , Un are independent random variables with zero expectation
and variance σ2
. What are the least squares estimates for the parameters α
and β in this model?
22.8  Which simple regression model has the larger residual sum of squares
n
i=1 r2
i , the model with intercept or the one without?
22.9 For some datasets it seems reasonable to leave out the slope β. For
example, in the jury example from Section 6.3 it was assumed that the score
that juror i assigns when the performance deserves a score g is Yi = g + Zi,
where Zi is a random variable with values around zero. In general, when the
slope β is left out, the model becomes
Yi = α + Ui for i = 1, 2, . . ., n.
Show that Ȳn is the least squares estimator α̂ of α.
22.10  In the method of least squares we choose α and β in such a way
that the sum of squared residuals S(α, β) is minimal. Since the ith term in
this sum is the squared vertical distance from (xi, yi) to the regression line
y = α + βx, one might also wonder whether it is a good idea to replace this
squared distance simply by the distance. So, given a bivariate dataset
(x1, y1), (x2, y2), . . . , (xn, yn),
choose α and β in such a way that the sum
A(α, β) =
n

i=1
|yi − α − βxi|
is minimal. We will investigate this by a simple example. Consider the follow-
ing bivariate dataset:
(0, 2), (1, 2), (2, 0).
22.5 Exercises 339
a. Determine the least squares estimates α̂ and β̂, and draw in one figure
the scatterplot of the data and the estimated regression line y = α̂ + β̂x.
Finally, determine A(α̂, β̂).
b. One might wonder whether α̂ and β̂ also minimize A(α, β). To investigate
this, choose β = −1 and find α’s for which A(α, −1)  A(α̂, β̂). For which
α is A(α, −1) minimal?
c. Find α and β for which A(α, β) is minimal.
22.11 Consider the dataset (x1, y1), (x2, y2), . . . , (xn, yn), where the xi are
nonrandom and the yi are realizations of random variables Y1, Y2, . . . , Yn sat-
isfying
Yi = g(xi) + Ui for i = 1, 2, . . . , n,
where U1, U2, . . . , Un are independent random variables with zero expecta-
tion and variance σ2
. Visual inspection of the scatterplot of our dataset in
20 30 40 50 60 70 80
0
500
1000
1500
2000
2500
· ·
·
·
·
·
·· ·
·
·
·
·
·
·
· ·
·
·
·
·
·
·
·
·
· ·
·
Fig. 22.7. Scatterplot of yi versus xi.
Figure 22.7 suggests that we should model the Yi by
Yi = βxi + γx2
i + Ui for i = 1, 2, . . . , n.
a. Show that the least squares estimators β̂ and γ̂ satisfy
β

x2
i + γ

x3
i =

xiyi,
β

x3
i + γ

x4
i =

x2
i yi.
b. Infer from a—for instance, by using linear algebra—that the estimators
β̂ and γ̂ are given by
β̂ =
(

xiYi)(

x4
i ) − (

x3
i )(

x2
i Yi)
(

x2
i )(

x4
i ) − (

x3
i )2
340 22 The method of least squares
and
γ̂ =
(

x2
i )(

x2
i Yi) − (

x3
i )(

xiYi)
(

x2
i )(

x4
i ) − (

x3
i )2
.
22.12  The least square estimator β̂ from (22.1) is an unbiased estimator
for β. You can show this in four steps.
a. First show that
E

β̂

=
n

xiE[Yi] − (

xi)(

E[Yi])
n

x2
i − (

xi)
2 .
b. Next use that E[Yi] = α + βxi, to obtain that
E

β̂

=
n

xi(α + βxi) − (

xi) [nα + β

xi]
n

x2
i − (

xi)
2 .
c. Simplify this last expression to find
E

β̂

=
nα

xi + nβ

x2
i − nα

xi − β(

xi)2
n

x2
i − (

xi)
2 .
d. Finally, conclude that β̂ is an unbiased estimator for β.
23
Confidence intervals for the mean
Sometimes, a range of plausible values for an unknown parameter is preferred
to a single estimate. We shall discuss how to turn data into what are called
confidence intervals and show that this can be done in such a manner that
definite statements can be made about how confident we are that the true pa-
rameter value is in the reported interval. This level of confidence is something
you can choose. We start this chapter with the general principle of confidence
intervals. We continue with confidence intervals for the mean, the common
way to refer to confidence intervals made for the expected value of the model
distribution. Depending on the situation, one of the four methods presented
will apply.
23.1 General principle
In previous chapters we have encountered sample statistics as estimators for
distribution features. This started somewhat informally in Chapter 17, where
it was claimed, for example, that the sample mean and the sample variance
are usually close to µ and σ2
of the underlying distribution. Bias and MSE
of estimators, discussed in Chapters 19 and 20, are used to judge the quality
of estimators. If we have at our disposal an estimator T for an unknown
parameter θ, we use its realization t as our estimate for θ. For example, when
collecting data on the speed of light, as Michelson did (see Section 13.1), the
unknown speed of light would be the parameter θ, our estimator T could
be the sample mean, and Michelson’s data then yield an estimate t for θ of
299 852.4 km/sec. We call this number a point estimate: if we are required
to select one number, this is it. Had the measurements started a day earlier,
however, the whole experiment would in essence be the same, but the results
might have been different. Hence, we cannot say that the estimate equals the
speed of light but rather that it is close to the true speed of light. For example,
we could say something like: “we have great confidence that the true speed of
342 23 Confidence intervals for the mean
light is somewhere between . . . and . . . .” In addition to providing an interval
of plausible values for θ we would want to add a specific statement about how
confident we are that the true θ is among them.
In this chapter we shall present methods to make confidence statements about
unknown parameters, based on knowledge of the sampling distributions of cor-
responding estimators. To illustrate the main idea, suppose the estimator T
is unbiased for the speed of light θ. For the moment, also suppose that T
has standard deviation σT = 100 km/sec (we shall drop this unrealistic as-
sumption shortly). Then, applying formula (13.1), which was derived from
Chebyshev’s inequality (see Section 13.2), we find
P(|T − θ|  2σT ) ≥ 3
4 . (23.1)
In words this reads: with probability at least 75%, the estimator T is within
2σT = 200 of the true speed of light θ. We could rephrase this as
T ∈ (θ − 200, θ + 200) with probability at least 75%.
However, if I am near the city of Paris, then the city of Paris is near me: the
statement “T is within 200 of θ” is the same as “θ is within 200 of T ,” and
we could equally well rephrase (23.1) as
θ ∈ (T − 200, T + 200) with probability at least 75%.
Note that of the last two equations the first is a statement about a random
variable T being in a fixed interval, whereas in the second equation the interval
is random and the statement is about the probability that the random interval
covers the fixed but unknown θ. The interval (T − 200, T + 200) is sometimes
called an interval estimator, and its realization is an interval estimate.
Evaluating T for the Michelson data we find as its realization t = 299 852.4,
and this yields the statement
θ ∈ (299 652.4, 300 052.4). (23.2)
Because we substituted the realization for the random variable, we cannot
claim that (23.2) holds with probability at least 75%: either the true speed of
light θ belongs to the interval or it does not; the statement we make is either
true or false, we just do not know which. However, because the procedure
guarantees a probability of at least 75% of getting a “right” statement, we
say:
θ ∈ (299 652.4, 300 052.4) with confidence at least 75%. (23.3)
The construction of this confidence interval only involved an unbiased estima-
tor and knowledge of its standard deviation. When more information on the
sampling distribution of the estimator is available, more refined statements
can be made, as we shall see shortly.
23.1 General principle 343
Quick exercise 23.1 Repeat the preceding derivation, starting from the
statement P(|T − θ|  3σT ) ≥ 8/9 (check that this follows from Chebyshev’s
inequality). What is the resulting confidence interval for the speed of light,
and what is the corresponding confidence?
A general definition
Many confidence intervals are of the form1
(t − c · σT , t + c · σT )
we just encountered, where c is a number near 2 or 3. The corresponding
confidence is often much higher than in the preceding example. Because there
are many other ways confidence intervals can (or have to) be constructed, the
general definition looks a bit different.
Confidence intervals. Suppose a dataset x1, . . . , xn is given,
modeled as realization of random variables X1, . . . , Xn. Let θ be the
parameter of interest, and γ a number between 0 and 1. If there exist
sample statistics Ln = g(X1, . . . , Xn) and Un = h(X1, . . . , Xn) such
that
P(Ln  θ  Un) = γ
for every value of θ, then
(ln, un),
where ln = g(x1, . . . , xn) and un = h(x1, . . . , xn), is called a 100γ%
confidence interval for θ. The number γ is called the confidence level.
Sometimes sample statistics Ln and Un as required in the definition do not
exist, but one can find Ln and Un that satisfy
P(Ln  θ  Un) ≥ γ.
The resulting confidence interval (ln, un) is called a conservative 100γ% confi-
dence interval for θ: the actual confidence level might be higher. For example,
the interval in (23.2) is a conservative 75% confidence interval.
Quick exercise 23.2 Why is the interval in (23.2) a conservative 75% con-
fidence interval?
There is no way of knowing whether an individual confidence interval is cor-
rect, in the sense that it indeed does cover θ. The procedure guarantees that
each time we make a confidence interval we have probability γ of covering θ.
What this means in practice can easily be illustrated with an example, using
simulation:
1
Another form is, for example, (c1t, c2t).
344 23 Confidence intervals for the mean
Generate x1, . . . , x20 from an N(0, 1) distribution. Next, pretend that
it is known that the data are from a normal distribution but that both
µ and σ are unknown. Construct the 90% confidence interval for the
expectation µ using the method described in the next section, which
says to use (ln, un) with
ln = x̄20 − 1.729
s20
√
20
un = x̄20 + 1.729
s20
√
20
,
where x̄20 and s20 are the sample mean and standard deviation. Fi-
nally, check whether the “true µ,” in this case 0, is in the confidence
interval.
We repeated the whole procedure 50 times, making 50 confidence intervals
for µ. Each confidence interval is based on a fresh independently generated
set of data. The 50 intervals are plotted in Figure 23.1 as horizontal line
−1 1
µ
Fig. 23.1. Fifty 90% confidence intervals for µ = 0.
23.2 Normal data 345
segments, and at µ (0!) a vertical line is drawn. We count 46 “hits”: only four
intervals do not contain the true µ.
Quick exercise 23.3 Suppose you were to make 40 confidence intervals with
confidence level 95%. About how many of them should you expect to be
“wrong”? Should you be surprised if 10 of them are wrong?
In the remainder of this chapter we consider confidence intervals for the mean:
confidence intervals for the unknown expectation µ of the distribution from
which the sample originates. We start with the situation where it is known that
the data originate from a normal distribution, first with known variance, then
with unknown variance. Then we drop the normal assumption, first use the
bootstrap, and finally show how, for very large samples, confidence intervals
based on the central limit theorem are made.
23.2 Normal data
Suppose the data can be seen as the realization of a sample X1, . . . , Xn from
an N(µ, σ2
) distribution and µ is the (unknown) parameter of interest. If the
variance σ2
is known, confidence intervals are easily derived. Before we do
this, some preparation has to be done.
Critical values
We shall need so-called critical values for the standard normal distribution.
The critical value zp of an N(0, 1) distribution is the number that has right
tail probability p. It is defined by
P(Z ≥ zp) = p,
where Z is an N(0, 1) random variable. For example, from Table B.1 we read
P(Z ≥ 1.96) = 0.025, so z0.025 = 1.96. In fact, zp is the (1 − p)th quantile of
the standard normal distribution:
Φ(zp) = P(Z ≤ zp) = 1 − p.
By the symmetry of the standard normal density, P(Z ≤ −zp) = P(Z ≥ zp) =
p, so P(Z ≥ −zp) = 1 − p and therefore
z1−p = −zp.
For example, z0.975 = −z0.025 = −1.96. All this is illustrated in Figure 23.2.
Quick exercise 23.4 Determine z0.01 and z0.95 from Table B.1.
346 23 Confidence intervals for the mean
−3 0 3
z1−p zp
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
..
area p area p
−3 0 3
z1−p zp
0
1
p
1 − p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 23.2. Critical values of the standard normal distribution.
Variance known
If X1, . . . , Xn is a random sample from an N(µ, σ2
) distribution, then X̄n has
an N(µ, σ2
/n) distribution, and from the properties of the normal distribution
(see page 106), we know that
X̄n − µ
σ/
√
n
has an N(0, 1) distribution.
If cl and cu are chosen such that P(cl  Z  cu) = γ for an N(0, 1) distributed
random variable Z, then
γ = P

cl 
X̄n − µ
σ/
√
n
 cu

= P

cl
σ
√
n
 X̄n − µ  cu
σ
√
n

= P

X̄n − cu
σ
√
n
 µ  X̄n − cl
σ
√
n

.
We have found that
Ln = X̄n − cu
σ
√
n
and Un = X̄n − cl
σ
√
n
satisfy the confidence interval definition: the interval (Ln, Un) covers µ with
probability γ. Therefore

x̄n − cu
σ
√
n
, x̄n − cl
σ
√
n

is a 100γ% confidence interval for µ. A common choice is to divide α = 1 − γ
evenly between the tails,2
that is, solve cl and cu from
2
Here this choice could be motivated by the fact that it leads to the shortest
confidence interval; in other examples the shortest interval requires an asymmetric
23.2 Normal data 347
P(Z ≥ cu) = α/2 and P(Z ≤ cl) = α/2,
so that cu = zα/2 and cl = z1−α/2 = −zα/2. Summarizing, the 100(1 − α)%
confidence interval for µ is:

x̄n − zα/2
σ
√
n
, x̄n + zα/2
σ
√
n

.
For example, if α = 0.05, we use z0.025 = 1.96 and the 95% confidence interval
is 
x̄n − 1.96
σ
√
n
, x̄n + 1.96
σ
√
n

.
Example: gross calorific content of coal
When a shipment of coal is traded, a number of its properties should be known
accurately, because the value of the shipment is determined by them. An im-
portant example is the so-called gross calorific value, which characterizes the
heat content and is a numerical value in megajoules per kilogram (MJ/kg).
The International Organization of Standardization (ISO) issues standard pro-
cedures for the determination of these properties. For the gross calorific value,
there is a method known as ISO 1928. When the procedure is carried out prop-
erly, resulting measurement errors are known to be approximately normal,
with a standard deviation of about 0.1 MJ/kg. Laboratories that operate
according to standard procedures receive ISO certificates. In Table 23.1, a
number of such ISO 1928 measurements is given for a shipment of Osterfeld
coal coded 262DE27.
Table 23.1. Gross calorific value measurements for Osterfeld 262DE27.
23.870 23.730 23.712 23.760 23.640 23.850 23.840 23.860
23.940 23.830 23.877 23.700 23.796 23.727 23.778 23.740
23.890 23.780 23.678 23.771 23.860 23.690 23.800
Source: A.M.H. van der Veen and A.J.M. Broos. Interlaboratory study pro-
gramme “ILS coal characterization”—reported data. Technical report, NMi
Van Swinden Laboratorium B.V., The Netherlands, 1996.
We want to combine these values into a confidence statement about the “true”
gross calorific content of Osterfeld 262DE27. From the data, we compute x̄n =
23.788. Using the given σ = 0.1 and α = 0.05, we find the 95% confidence
interval

23.788 − 1.96
0.1
√
23
, 23.788 + 1.96
0.1
√
23

= (23.747, 23.829) MJ/kg.
division of α. If you are only concerned with the left or right boundary of the
confidence interval, see the next chapter.
348 23 Confidence intervals for the mean
Variance unknown
When σ is unknown, the fact that
X̄n − µ
σ/
√
n
has a standard normal distribution has become useless, as it involves this un-
known σ, which would subsequently appear in the confidence interval. How-
ever, if we substitute the estimator Sn for σ, the resulting random variable
X̄n − µ
Sn/
√
n
has a distribution that only depends on n and not on µ or σ. Moreover, its
density can be given explicitly.
Definition. A continuous random variable has a t-distribution with
parameter m, where m ≥ 1 is an integer, if its probability density is
given by
f(x) = km

1 +
x2
m
− m+1
2
for −∞  x  ∞,
where km = Γ
m+1
2

/

Γ
m
2
 √
mπ

. This distribution is denoted
by t(m) and is referred to as the t-distribution with m degrees of
freedom.
The normalizing constant km is given in terms of the gamma function, which
was defined on page 157. For m = 1, it evaluates to k1 = 1/π, and the resulting
density is that of the standard Cauchy distribution (see page 161). If X has
a t(m) distribution, then E[X] = 0 for m ≥ 2 and Var(X) = m/(m − 2)
for m ≥ 3. Densities of t-distributions look like that of the standard normal
distribution: they are also symmetric around 0 and bell-shaped. As m goes
to infinity the limit of the t(m) density is the standard normal density. The
distinguishing feature is that densities of t-distributions have heavier tails:
f(x) goes to zero as x goes to +∞ or −∞, but more slowly than the density
φ(x) of the standard normal distribution. These properties are illustrated in
Figure 23.3, which shows the densities and distribution functions of the t(1),
t(2), and t(5) distribution as well as those of the standard normal.
We will also need critical values for the t(m) distribution: the critical value
tm,p is the number satisfying
P(T ≥ tm,p) = p,
where T is a t(m) distributed random variable. Because the t-distribution is
symmetric around zero, using the same reasoning as for the critical values of
the standard normal distribution, we find:
23.2 Normal data 349
−4 −2 0 2 4
0.0
0.2
0.4
.................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
−4 −2 0 2 4
0.0
0.5
1.0
..........................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.........................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 23.3. Three t-distributions and the standard normal distribution. The dotted
line corresponds to the standard normal. The other distributions depicted are the
t(1), t(2), and t(5), which in that order resemble the standard normal more and
more.
tm,1−p = −tm,p.
For example, in Table B.2 we read t10,0.01 = 2.764, and from this we deduce
that t10,0.99 = −2.764.
Quick exercise 23.5 Determine t3,0.01 and t35,0.9975 from Table B.2.
We now return to the distribution of
X̄n − µ
Sn/
√
n
and construct a confidence interval for µ.
The studentized mean of a normal random sample. For a
random sample X1, . . . , Xn from an N(µ, σ2
) distribution, the stu-
dentized mean
X̄n − µ
Sn/
√
n
has a t(n − 1) distribution, regardless of the values of µ and σ.
From this fact and using critical values of the t-distribution, we derive that
P

−tn−1,α/2 
X̄n − µ
Sn/
√
n
 tn−1,α/2

= 1 − α, (23.4)
and in the same way as when σ is known it now follows that a 100(1 − α)%
confidence interval for µ is given by:
350 23 Confidence intervals for the mean

x̄n − tn−1,α/2
sn
√
n
, x̄n + tn−1,α/2
sn
√
n

.
Returning to the coal example, there was another shipment, of Daw Mill
258GB41 coal, where there were actually some doubts whether the stated
accuracy of the ISO 1928 method was attained. We therefore prefer to consider
σ unknown and estimate it from the data, which are given in Table 23.2.
Table 23.2. Gross calorific value measurements for Daw Mill 258GB41.
30.990 31.030 31.060 30.921 30.920 30.990 31.024 30.929
31.050 30.991 31.208 30.830 31.330 30.810 31.060 30.800
31.091 31.170 31.026 31.020 30.880 31.125
Source: A.M.H. van der Veen and A.J.M. Broos. Interlaboratory study pro-
gramme “ILS coal characterization”—reported data. Technical report, NMi
Van Swinden Laboratorium B.V., The Netherlands, 1996.
Doing this, we find x̄n = 31.012 and sn = 0.1294. Because n = 22, for a 95%
confidence interval we use t21,0.025 = 2.080 and obtain

31.012 − 2.080
0.1294
√
22
, 31.012 + 2.080
0.1294
√
22

= (30.954, 31.069).
Note that this confidence interval is (50%!) wider than the one we made for
the Osterfeld coal, with almost the same sample size. There are two reasons
for this; one is that σ = 0.1 is replaced by the (larger) estimate sn = 0.1294,
and the second is that the critical value z0.025 = 1.96 is replaced by the larger
t21,0.025 = 2.080. The differences in the method and the ingredients seem
minor, but they matter, especially for small samples.
23.3 Bootstrap confidence intervals
It is not uncommon that the methods of the previous section are used even
when the normal distribution is not a good model for the data. In some cases
this is not a big problem: with small deviations from normality the actual
confidence level of a constructed confidence interval may deviate only a few
percent from the intended confidence level. For large datasets the central limit
theorem in fact ensures that this method provides confidence intervals with
approximately correct confidence levels, as we shall see in the next section.
If we doubt the normality of the data and we do not have a large sample, usu-
ally the best thing to do is to bootstrap. Suppose we have a dataset x1, . . . , xn,
modeled as a realization of a random sample from some distribution F, and
we want to construct a confidence interval for its (unknown) expectation µ.
23.3 Bootstrap confidence intervals 351
In the previous section we saw that it suffices to find numbers cl and cu such
that
P

cl 
X̄n − µ
Sn/
√
n
 cu

= 1 − α.
The 100(1 − α)% confidence interval would then be

x̄n − cu
sn
√
n
, x̄n − cl
sn
√
n

,
where, of course, x̄n and sn are the sample mean and the sample standard
deviation. To find cl and cu we need to know the distribution of the studentized
mean
T =
X̄n − µ
Sn/
√
n
.
We apply the bootstrap principle. From the data x1, . . . , xn we determine an
estimate F̂ of F. Let X∗
1 , . . . , X∗
n be a random sample from F̂, with µ∗
=
E[X∗
i ], and consider
T ∗
=
X̄∗
n − µ∗
S∗
n/
√
n
.
The distribution of T ∗
is now used as an approximation to the distribution
of T . If we use F̂ = Fn, we get the following.
Empirical bootstrap simulation for the studentized mean.
Given a dataset x1, x2, . . . , xn, determine its empirical distribution
function Fn as an estimate of F. The expectation corresponding
to Fn is µ∗
= x̄n.
1. Generate a bootstrap dataset x∗
1, x∗
2, . . . , x∗
n from Fn.
2. Compute the studentized mean for the bootstrap dataset:
t∗
=
x̄∗
n − x̄n
s∗
n/
√
n
,
where x̄∗
n and s∗
n are the sample mean and sample standard de-
viation of x∗
1, x∗
2, . . . , x∗
n.
Repeat steps 1 and 2 many times.
From the bootstrap experiment we can determine c∗
l and c∗
u such that
P

c∗
l 
X̄∗
n − µ∗
S∗
n/
√
n
 c∗
u

≈ 1 − α.
By the bootstrap principle we may transfer this statement about the distri-
bution of T ∗
to the distribution of T . That is, we may use these estimated
critical values as bootstrap approximations to cl and cu:
cl ≈ c∗
l and cu ≈ c∗
u,
352 23 Confidence intervals for the mean
Therefore, we call 
x̄n − c∗
u
sn
√
n
, x̄n − c∗
l
sn
√
n

a 100(1 − α)% bootstrap confidence interval for µ.
Example: the software data
Recall the software data, a dataset of interfailure times (see Section 17.3).
From the nature of the data—failure times are positive numbers—and the
histogram (Figure 17.5), we know that they should not be modeled as a real-
ization of a random sample from a normal distribution. From the data we know
x̄n = 656.88, sn = 1037.3, and n = 135. We generate one thousand bootstrap
datasets, and for each dataset we compute t∗
as in step 2 of the procedure. The
histogram and empirical distribution function made from these one thousand
values are estimates of the density and the distribution function, respectively,
of the bootstrap sample statistic T ∗
; see Figure 23.4.
−6 −4 −2 0 2
0
0.1
0.2
0.3
0.4
0.5
−6 −4 −2.11 0 1.39
0.05
0.95
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
................................
Fig. 23.4. Histogram and empirical distribution function of the studentized boot-
strap simulation results for the software data.
We want to make a 90% bootstrap confidence interval, so we need c∗
l and c∗
u,
or the 0.05th and 0.95th quantile from the empirical distribution function in
Figure 23.4. The 50th order statistic of the one thousand t∗
values is −2.107.
This means that 50 out of the one thousand values, or 5%, are smaller than
or equal to this value, and so c∗
l = −2.107. Similarly, from the 951st order
statistic, 1.389, we obtain3
c∗
u = 1.389. Inserting these values, we find the
following 90% bootstrap confidence interval for µ:
3
These results deviate slightly from the definition of empirical quantiles as given
in Section 16.3. That method is a little more accurate.
23.4 Large samples 353

656.88 − 1.389
1037.3
√
135
, 656.88 − (−2.107)
1037.3
√
135

= (532.9, 845.0).
Quick exercise 23.6 The 25th and 976th order statistic from the preceding
bootstrap results are −2.443 and 1.713, respectively. Use these numbers to
construct a confidence interval for µ. What is the corresponding confidence
level?
Why the bootstrap may be better
The reason to use the bootstrap is that it should lead to a more accurate
approximation of the distribution of the studentized mean than the t(n − 1)
distribution that follows from assuming normality. If, in the previous example,
we would think we had normal data, we would use critical values from the
t(134) distribution: t134,0.05 = 1.656. The result would be

656.88 − 1.656
1037.3
√
135
, 656.88 + 1.656
1037.3
√
135

= (509.0, 804.7).
Comparing the intervals, we see that here the bootstrap interval is a little
larger and, as opposed to the t-interval, not centered around the sample mean
but skewed to the right side. This is one of the features of the bootstrap:
if the distribution from which the data originate is skewed, this is reflected
in the confidence interval. Looking at the histogram of the software data
(Figure 17.5), we see that is it skewed to the right: it has a long tail on the
right, but not on the left, so the same most likely holds for the distribution
from which these data originate. The skewness is reflected in the confidence
interval, which extends more to the right of x̄n than to the left. In some sense,
the bootstrap adapts to the shape of the distribution, and in this way it leads
to more accurate confidence statements than using the method for normal
data. What we mean by this is that, for example, with the normal method
only 90% of the 95% confidence statements would actually cover the true
value, whereas for the bootstrap intervals this percentage would be close(r)
to 95%.
23.4 Large samples
A variant of the central limit theorem states that as n goes to infinity, the
distribution of the studentized mean
X̄n − µ
Sn/
√
n
approaches the standard normal distribution. This fact is the basis for so-
called large sample confidence intervals. Suppose X1, . . . , Xn is a random
354 23 Confidence intervals for the mean
sample from some distribution F with expectation µ. If n is large enough,
we may use
P

−zα/2 
X̄n − µ
Sn/
√
n
 zα/2

≈ 1 − α. (23.5)
This implies that if x1, . . . , xn can be seen as a realization of a random sample
from some unknown distribution with expectation µ and if n is large enough,
then 
x̄n − zα/2
sn
√
n
, x̄n + zα/2
sn
√
n

is an approximate 100(1 − α)% confidence interval for µ.
Just as earlier with the central limit theorem, a key question is “how big
should n be?” Again, there is no easy answer. To give you some idea, we have
listed in Table 23.3 the results of a small simulation experiment. For each of
the distributions, sample sizes, and confidence levels listed, we constructed
10 000 confidence intervals with the large sample method; the numbers listed
in the table are the confidence levels as estimated from the simulation, the
coverage probabilities. The chosen Pareto distribution is very skewed, and this
shows; the coverage probabilities for the exponential are just a few percent
off.
Table 23.3. Estimated coverage probabilities for large sample confidence intervals
for non-normal data.
γ
Distribution n 0.900 0.950
Exp(1) 20 0.851 0.899
Exp(1) 100 0.890 0.938
Par(2.1) 20 0.727 0.774
Par(2.1) 100 0.798 0.849
In the case of simulation one can often quite easily generate a very large
number of independent repetitions, and then this question poses no problem.
In other cases there may be nothing better to do than hope that the dataset
is large enough. We give an example where (we believe!) this is definitely the
case.
In an article published in 1910 ([28]), Rutherford and Geiger reported their
observations on the radioactive decay of the element polonium. Using a small
disk coated with polonium they counted the number of emitted alpha-particles
during 2608 intervals of 7.5 seconds each. The dataset consists of the counted
number of alpha-particles for each of the 2608 intervals and can be summarized
as in Table 23.4.
23.5 Solutions to the quick exercises 355
Table 23.4. Alpha-particle counts for 2608 intervals of 7.5 seconds.
Count 0 1 2 3 4
Frequency 57 203 383 525 532
Count 5 6 7 8 9
Frequency 408 273 139 45 27
Count 10 11 12 13 14
Frequency 10 4 0 1 1
Source: E. Rutherford and H. Geiger (with a note by H. Bateman), The proba-
bility variations in the distribution of α particles, Phil.Mag., 6: 698–704, 1910;
the table on page 701.
The total number of counted alpha-particles is 10 097, the average number
per interval is therefore 3.8715. The sample standard deviation can also
be computed from the table; it is 1.9225. So we know of the actual data
x1, x2, . . . , x2608 (where the counts xi are between 0 and 14) that x̄n = 3.8715
and sn = 1.9225. We construct a 98% confidence interval for the expected
number of particles per interval. As z0.01 = 2.33 this results in

3.8715 − 2.33
1.9225
√
2608
, 3.8715 + 2.33
1.9225
√
2608

= (3.784, 3.959).
23.5 Solutions to the quick exercises
23.1 From the probability statement, we derive, using σT = 100 and 8/9 =
0.889:
θ ∈ (T − 300, T + 300) with probability at least 88%.
With t = 299 852.4, this becomes
θ ∈ (299 552.4, 300 152.4) with confidence at least 88%.
23.2 Chebyshev’s inequality only gives an upper bound. The actual value
of P(|T − θ|  2σT ) could be higher than 3/4, depending on the distribution
of T . For example, in Quick exercise 13.2 we saw that in case of an exponen-
tial distribution this probability is 0.865. For other distributions, even higher
values are attained; see Exercise 13.1.
23.3 For each of the confidence intervals we have a 5% probability that
it is wrong. Therefore, the number of wrong confidence intervals has a
Bin(40, 0.05) distribution, and we would expect about 40 · 0.05 = 2 to be
wrong. The standard deviation of this distribution is
√
40 · 0.05 · 0.95 = 1.38.
The outcome “10 confidence intervals wrong” is (10 − 2)/1.38 = 5.8 standard
deviations from the expectation and would be a surprising outcome indeed.
(The probability of 10 or more wrong is 0.00002.)
356 23 Confidence intervals for the mean
23.4 We need to solve P(Z ≥ a) = 0.01. In Table B.1 we find P(Z ≥ 2.33) =
0.0099 ≈ 0.01, so z0.01 ≈ 2.33. For z0.95 we need to solve P(Z ≥ a) = 0.95,
and because this is in the left tail of the distribution, we use z0.95 = −z0.05.
In the table we read P(Z ≥ 1.64) = 0.0505 and P(Z ≥ 1.65) = 0.0495, from
which we conclude z0.05 ≈ (1.64 + 1.65)/2 = 1.645 and z0.95 ≈ −1.645.
23.5 In Table B.1 we find P(T3 ≥ 4.541) = 0.01, so t3,0.01 = 4.541. For
t35,0.9975, we need to use t35,0.9975 = −t35,0.0025. In the table we find t30,0.0025 =
3.030 and t40,0.0025 = 2.971, and by interpolation t35,0.0025 ≈ (3.030 +
2.971)/2 = 3.0005. Hence, t35,0.9975 ≈ −3.000.
23.6 The order statistics are estimates for c∗
0.025 and c∗
0.975, respectively. So
the corresponding α is 0.05, and the 95% bootstrap confidence interval for µ
is:

656.88 − 1.713
1037.3
√
135
, 656.88 − (−2.443)
1037.3
√
135

= (504.0, 875.0).
23.6 Exercises
23.1  A bottling machine is known to fill wine bottles with amounts that
follow an N(µ, σ2
) distribution, with σ = 5 (ml). In a sample of 16 bottles,
x̄ = 743 (ml) was found. Construct a 95% confidence interval for µ.
23.2  You are given a dataset that may be considered a realization of a
normal random sample. The size of the dataset is 34, the average is 3.54, and
the sample standard deviation is 0.13. Construct a 98% confidence interval
for the unknown expectation µ.
23.3 You have ordered 10 bags of cement, which are supposed to weigh 94 kg
each. The average weight of the 10 bags is 93.5 kg. Assuming that the 10
weights can be viewed as a realization of a random sample from a normal
distribution with unknown parameters, construct a 95% confidence interval
for the expected weight of a bag. The sample standard deviation of the 10
weights is 0.75.
23.4 A new type of car tire is launched by a tire manufacturer. The auto-
mobile association performs a durability test on a random sample of 18 of
these tires. For each tire the durability is expressed as a percentage: a score
of 100 (%) means that the tire lasted exactly as long as the average standard
tire, an accepted comparison standard. From the multitude of factors that in-
fluence the durability of individual tires the assumption is warranted that the
durability of an arbitrary tire follows an N(µ, σ2
) distribution. The parame-
ters µ and σ2
characterize the tire type, and µ could be called the durability
index for this type of tire. The automobile association found for the tested
tires: x̄18 = 195.3 and s18 = 16.7. Construct a 95% confidence interval for µ.
23.6 Exercises 357
23.5  During the 2002 Winter Olympic Games in Salt Lake City a newspaper
article mentioned the alleged advantage speed-skaters have in the 1500 m race
if they start in the outer lane. In the men’s 1500m, there were 24 races, but
in race 13 (really!) someone fell and did not finish. The results in seconds of
the remaining 23 races are listed in Table 23.5. You should know that who
races against whom, in which race, and who starts in the outer lane are all
determined by a fair lottery.
Table 23.5. Speed-skating results in seconds, men’s 1500 m (except race 13), 2002
Winter Olympic Games.
Race Inner Outer Difference
number lane lane
1 107.04 105.98 1.06
2 109.24 108.20 1.04
3 111.02 108.40 2.62
4 108.02 108.58 −0.56
5 107.83 105.51 2.32
6 109.50 112.01 −2.51
7 111.81 112.87 −1.06
8 111.02 106.40 4.62
9 106.04 104.57 1.47
10 110.15 110.70 −0.55
11 109.42 109.45 −0.03
12 108.13 109.57 −1.44
14 105.86 105.97 −0.11
15 108.27 105.63 2.64
16 107.63 105.41 2.22
17 107.72 110.26 −2.54
18 106.38 105.82 0.56
19 107.78 106.29 1.49
20 108.57 107.26 1.31
21 106.99 103.95 3.04
22 107.21 106.00 1.21
23 105.34 105.26 0.08
24 108.76 106.75 2.01
Mean 108.25 107.43 0.82
St.dev. 1.70 2.42 1.78
a. As a consequence of the lottery and the fact that many different factors
contribute to the actual time difference “inner lane minus outer lane” the
assumption of a normal distribution for the difference is warranted. The
numbers in the last column can be seen as realizations from an N(δ, σ2
)
358 23 Confidence intervals for the mean
distribution, where δ is the expected outer lane advantage. Construct a
95% confidence interval for δ. N.B. n = 23, not 24!
b. You decide to make a bootstrap confidence interval instead. Describe the
appropriate bootstrap experiment.
c. The bootstrap experiment was performed with one thousand repetitions.
Part of the bootstrap outcomes are listed in the following table. From the
ordered list of results, numbers 21 to 60 and 941 to 980 are given. Use
these to construct a 95% bootstrap confidence interval for δ.
21–25 −2.202 −2.164 −2.111 −2.109 −2.101
26–30 −2.099 −2.006 −1.985 −1.967 −1.929
31–35 −1.917 −1.898 −1.864 −1.830 −1.808
36–40 −1.800 −1.799 −1.774 −1.773 −1.756
41–45 −1.736 −1.732 −1.731 −1.717 −1.716
46–50 −1.699 −1.692 −1.691 −1.683 −1.666
51–55 −1.661 −1.644 −1.638 −1.637 −1.620
56–60 −1.611 −1.611 −1.601 −1.600 −1.593
941–945 1.648 1.667 1.669 1.689 1.696
946–950 1.708 1.722 1.726 1.735 1.814
951–955 1.816 1.825 1.856 1.862 1.864
956–960 1.875 1.877 1.897 1.905 1.917
961–965 1.923 1.948 1.961 1.987 2.001
966–970 2.015 2.015 2.017 2.018 2.034
971–975 2.035 2.037 2.039 2.053 2.060
976–980 2.088 2.092 2.101 2.129 2.143
23.6  A dataset x1, x2, . . . , xn is given, modeled as realization of a sam-
ple X1, X2, . . . , Xn from an N(µ, 1) distribution. Suppose there are sample
statistics Ln = g(X1, . . . , Xn) and Un = h(X1, . . . , Xn) such that
P(Ln  µ  Un) = 0.95
for every value of µ. Suppose that the corresponding 95% confidence interval
derived from the data is (ln, un) = (−2, 5).
a. Suppose θ = 3µ + 7. Let L̃n = 3Ln + 7 and Ũn = 3Un + 7. Show that
P

L̃n  θ  Ũn

= 0.95.
b. Write the 95% confidence interval for θ in terms of ln and un.
c. Suppose θ = 1 − µ. Again, find L̃n and Ũn, as well as the confidence
interval for θ.
d. Suppose θ = µ2
. Can you construct a confidence interval for θ?
23.6 Exercises 359
23.7  A 95% confidence interval for the parameter µ of a Pois(µ) distri-
bution is given: (2, 3). Let X be a random variable with this distribution.
Construct a 95% confidence interval for P(X = 0) = e−µ
.
23.8 Suppose that in Exercise 23.1 the content of the bottles has to be de-
termined by weighing. It is known that the wine bottles involved weigh on
average 250 grams, with a standard deviation of 15 grams, and the weights
follow a normal distribution. For a sample of 16 bottles, an average weight of
998 grams was found. You may assume that 1 ml of wine weighs 1 gram, and
that the filling amount is independent of the bottle weight. Construct a 95%
confidence interval for the expected amount of wine per bottle, µ.
23.9 Consider the alpha-particle counts discussed in Section 23.4; the data
are given in Table 23.4. We want to bootstrap in order to make a bootstrap
confidence interval for the expected number of particles in a 7.5-second inter-
val.
a. Describe in detail how you would perform the bootstrap simulation.
b. The bootstrap experiment was performed with one thousand repetitions.
Part of the (ordered) bootstrap t∗
’s are given in the following table. Con-
struct the 95% bootstrap confidence interval for the expected number of
particles in a 7.5-second interval.
1–5 −2.996 −2.942 −2.831 −2.663 −2.570
6–10 −2.537 −2.505 −2.290 −2.273 −2.228
11–15 −2.193 −2.112 −2.092 −2.086 −2.045
16–20 −1.983 −1.980 −1.978 −1.950 −1.931
21–25 −1.920 −1.910 −1.893 −1.889 −1.888
26–30 −1.865 −1.864 −1.832 −1.817 −1.815
31–35 −1.755 −1.751 −1.749 −1.746 −1.744
36–40 −1.734 −1.723 −1.710 −1.708 −1.705
41–45 −1.703 −1.700 −1.696 −1.692 −1.691
46–50 −1.691 −1.675 −1.660 −1.656 −1.650
951–955 1.635 1.638 1.643 1.648 1.661
956–960 1.666 1.668 1.678 1.681 1.686
961–965 1.692 1.719 1.721 1.753 1.772
966–970 1.773 1.777 1.806 1.814 1.821
971–975 1.824 1.826 1.837 1.838 1.845
976–980 1.862 1.877 1.881 1.883 1.956
981–985 1.971 1.992 2.060 2.063 2.083
986–990 2.089 2.177 2.181 2.186 2.224
991–995 2.234 2.264 2.273 2.310 2.348
996–1000 2.483 2.556 2.870 2.890 3.546
360 23 Confidence intervals for the mean
c. Answer this without doing any calculations: if we made the 98% boot-
strap confidence interval, would it be smaller or larger than the interval
constructed in Section 23.4?
23.10 In a report you encounter a 95% confidence interval (1.6, 7.8) for the
parameter µ of an N(µ, σ2
) distribution. The interval is based on 16 observa-
tions, constructed according to the studentized mean procedure.
a. What is the mean of the (unknown) dataset?
b. You prefer to have a 99% confidence interval for µ. Construct it.
23.11  A 95% confidence interval for the unknown expectation of some
distribution contains the number 0.
a. We construct the corresponding 98% confidence interval, using the same
data. Will it contain the number 0?
b. The confidence interval in fact is a bootstrap confidence interval. We re-
peat the bootstrap experiment (using the same data) and construct a new
95% confidence interval based on the results. Will it contain the number 0?
c. We collect new data, resulting in a dataset of the same size. With this data,
we construct a 95% confidence interval for the unknown expectation. Will
the interval contain 0?
23.12 Let Z1, . . . , Zn be a random sample from an N(0, 1) distribution. Define
Xi = µ+σZi for i = 1, . . . , n and σ  0. Let Z̄, X̄ denote the sample averages
and SZ and SX the sample standard deviations, of the Zi and Xi, respectively.
a. Show that X1, . . . , Xn is a random sample from an N(µ, σ2
) distribution.
b. Express X̄ and SX in terms of Z̄, SZ , µ, and σ.
c. Verify that
X̄ − µ
SX/
√
n
=
Z̄
SZ/
√
n
,
and explain why this shows that the distribution of the studentized mean
does not depend on µ and σ.
24
More on confidence intervals
While in Chapter 23 we were solely concerned with confidence intervals for
expectations, in this chapter we treat a variety of topics. First, we focus on
confidence intervals for the parameter p of the binomial distribution. Then,
based on an example, we briefly discuss a general method to construct confi-
dence intervals. One-sided confidence intervals, or upper and lower confidence
bounds, are discussed next. At the end of the chapter we investigate the ques-
tion of how to determine the sample size when a confidence interval of a certain
width is desired.
24.1 The probability of success
A common situation is that we observe a random variable X with a Bin(n, p)
distribution and use X to estimate p. For example, if we want to estimate
the proportion of voters that support candidate G in an election, we take a
sample from the voter population and determine the proportion in the sample
that supports G. If n individuals are selected at random from the population,
where a proportion p supports candidate G, the number of supporters X in
the sample is modeled by a Bin(n, p) distribution; we count the supporters of
candidate G as “successes.” Usually, the sample proportion X/n is taken as
an estimator for p.
If we want to make a confidence interval for p, based on the number of suc-
cesses X in the sample, we need to find statistics L and U (see the definition
of confidence intervals on page 343) such that
P(L  p  U) = 1 − α,
where L and U are to be based on X only. In general, this problem does
not have a solution. However, the method for large n described next, some-
times called “the Wilson method” (see [40]), yields confidence intervals with
362 24 More on confidence intervals
confidence level approximately 100(1 − α)%. (How close the true confidence
level is to 100(1 − α)% depends on the (unknown) p, though it is known that
for p near 0 and 1 it is too low. For some details and an alternative for this
situation, see Remark 24.1.)
Recall the normal approximation to the binomial distribution, a consequence
of the central limit theorem (see page 201 and Exercise 14.5): for large n, the
distribution of X is approximately normal and
X − np

np(1 − p)
is approximately standard normal. By dividing by n in both the numerator
and the denominator, we see that this equals:
X
n − p

p(1−p)
n
.
Therefore, for large n
P
⎛
⎝−zα/2 
X
n − p

p(1−p)
n
 zα/2
⎞
⎠ ≈ 1 − α.
Note that the event
−zα/2 
X
n − p

p(1−p)
n
 zα/2
is the same as ⎛
⎝
X
n − p

p(1−p)
n
⎞
⎠
2


zα/2
2
or 
X
n
− p
2
−

zα/2
2 p(1 − p)
n
 0.
To derive expressions for L and U we can rewrite the inequality in this state-
ment to obtain the form L  p  U, but the resulting formulas are rather
awkward. To obtain the confidence interval, we instead substitute the data
values directly and then solve for p, which yields the desired result.
Suppose, in a sample of 125 voters, 78 support one candidate. What is the 95%
confidence interval for the population proportion p supporting that candidate?
The realization of X is x = 78 and n = 125. We substitute this, together with
zα/2 = z0.025 = 1.96, in the last inequality:

78
125
− p
2
−
(1.96)2
125
p(1 − p)  0,
24.1 The probability of success 363
0.4 0.8
0.54 0.70
−0.01
0.00
0.01
0.02
0.03
0.04
0.05
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.....................................................................................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 24.1. The parabola 1.0307 p2
− 1.2787 p + 0.3894 and the resulting confidence
interval.
or, working out squares and products and grouping terms:
1.0307 p2
− 1.2787 p + 0.3894  0.
This quadratic form describes a parabola, which is depicted in Figure 24.1.
Also, for other values of n and x there always results a quadratic inequality like
this, with a positive coefficient for p2
and a similar picture. For the confidence
interval we need to find the values where the parabola intersects the horizontal
axis. The solutions we find are:
p1,2 =
−(−1.2787) ±

(−1.2787)2 − 4 · 1.0307 · 0.3894
2 · 1.0307
= 0.6203 ± 0.0835;
hence, l = 0.54 and u = 0.70, so the resulting confidence interval is (0.54, 0.70).
Quick exercise 24.1 Suppose in another election we find 80 supporters in a
sample of 200. Suppose we use α = 0.0456 for which zα/2 = 2. Construct the
corresponding confidence interval for p.
Remark 24.1 (Coverage probabilities and an alternative method).
Because of the discrete nature of the binomial distribution, the probabil-
ity that the confidence interval covers the true parameter value depends
on p. As a function of p it typically oscillates in a sawtooth-like manner
around 1 − α, being too high for some values and too low for others. This
is something that cannot be escaped from; the phenomenon is present in
every method. In an average sense, the method treated in the text yields
coverage probabilities close to 1 − α, though for arbitrarily high values of n
it is possible to find p’s for which the actual coverage is several percentage
points too low. The low coverage occurs for p’s near 0 and 1.
364 24 More on confidence intervals
An alternative is the method proposed by Agresti and Coull, which overall
is more conservative than the Wilson method (in fact, the Agresti-Coull
interval contains the Wilson interval as a proper subset). Especially for p
near 0 or 1 this method yields conservative confidence intervals. Define
X̃ = X +
(zα/2)2
2
and ñ = n + (zα/2)2
,
and p̃ = X̃/ñ. The approximate 100(1 − α)% confidence interval is then
given by 
p̃ − zα/2

p̃(1 − p̃)
ñ
, p̃ + zα/2

p̃(1 − p̃)
ñ
.
For a clear survey paper on confidence intervals for p we recommend Brown
et al. [4].
24.2 Is there a general method?
We have now seen a number of examples of confidence intervals, and while it
should be clear to you that in each of these cases the resulting intervals are
valid confidence intervals, you may wonder how we go about finding confidence
intervals in new situations. One could ask: is there a general method? We first
consider an example.
A confidence interval for the minimum lifetime
Suppose we have a random sample X1, . . . , Xn from a shifted exponential
distribution, that is, Xi = δ + Yi, where Y1, . . . , Yn are a random sample from
an Exp(1) distribution. This type of random variable is sometimes used to
model lifetimes; a minimum lifetime is guaranteed, but otherwise the lifetime
has an exponential distribution. The unknown parameter δ represents the
minimum lifetime, and the probability density of the Xi is positive only for
values greater than δ.
To derive information about δ it is natural to use the smallest observed value
T = min{X1, . . . , Xn}. This is also the maximum likelihood estimator for δ;
see Exercise 21.6. Writing
T = min{δ + Y1, . . . , δ + Yn} = δ + min{Y1, . . . , Yn}
and observing that M = min{Y1, . . . , Yn} has an Exp(n) distribution (see
Exercise 8.18), we find for the distribution function of T : FT (a) = 0 for a  δ
and
FT (a) = P(T ≤ a) = P(δ + M ≤ a) = P(M ≤ a − δ)
= 1 − e−n(a−δ)
for a ≥ δ.
(24.1)
Next, we solve
24.2 Is there a general method? 365
P(cl  T  cu) = 1 − α
by requiring
P(T ≤ cl) = P(T ≥ cu) = 1
2 α.
Using (24.1) we find the following equations:
1 − e−n(cl−δ)
= 1
2 α and e−n(cu−δ)
= 1
2 α
whose solutions are
cl = δ −
1
n
ln

1 − 1
2 α

and cu = δ −
1
n
ln
1
2 α

.
Both cl and cu are values larger than δ, because the logarithms are negative.
We have found that, whatever the value of δ:
P

δ −
1
n
ln

1 − 1
2 α

 T  δ −
1
n
ln
1
2 α


= 1 − α.
By rearranging the inequalities, we see this is equivalent to
P

T +
1
n
ln
1
2 α

 δ  T +
1
n
ln

1 − 1
2 α


= 1 − α,
and therefore a 100(1 − α)% confidence interval for δ is given by

t +
1
n
ln
1
2 α

, t +
1
n
ln

1 − 1
2 α


. (24.2)
For α = 0.05 this becomes:

t −
3.69
n
, t −
0.0253
n

.
Quick exercise 24.2 Suppose you have a dataset of size 15 from a shifted
Exp(1) distribution, whose minimum value is 23.5. What is the 99% confidence
interval for δ?
Looking back at the example, we see that the confidence interval could be
constructed because we know that T −δ = M has an exponential distribution.
There are many more examples of this type: some function g(T, θ) of a sample
statistic T and the unknown parameter θ has a known distribution. However,
this still does not cover all the ways to construct confidence intervals (see also
the following remark).
Remark 24.2 (About a general method). Suppose X1, . . . , Xn is a
random sample from some distribution depending on some unknown pa-
rameter θ and let T be a sample statistic. One possible choice is to select
a T that is an estimator for θ, but this is not necessary. In each case, the
366 24 More on confidence intervals
distribution of T depends on θ, just as that of X1, . . . , Xn does. In some
cases it might be possible to find functions g(θ) and h(θ) such that
P(g(θ)  T  h(θ)) = 1 − α for every value of θ. (24.3)
If this is so, then confidence statements about θ can be made. In more
special cases, for example if g and h are strictly increasing, the inequalities
g(θ)  T  h(θ) can be rewritten as
h−1
(T)  θ  g−1
(T),
and then (24.3) is equivalent to
P

h−1
(T)  θ  g−1
(T)

= 1 − α for every value of θ.
Checking with the confidence interval definition, we see that the last state-
ment implies that (h−1
(t), g−1
(t)) is a 100(1−α)% confidence interval for θ.
24.3 One-sided confidence intervals
Suppose you are in charge of a power plant that generates and sells electricity,
and you are about to buy a shipment of coal, say a shipment of the Daw Mill
coal identified as 258GB41 earlier. You plan to buy the shipment if you are
confident that the gross calorific content exceeds 31.00 MJ/kg. At the end of
Section 23.2 we obtained for the gross calorific content the 95% confidence
interval (30.946, 31.067): based on the data we are 95% confident that the
gross calorific content is higher than 30.946 and lower than 31.067.
In the present situation, however, we are only interested in the lower bound:
we would prefer a confidence statement of the type “we are 95% confident
that the gross calorific content exceeds 31.00.” Modifying equation (23.4) we
find
P

X̄n − µ
Sn/
√
n
 tn−1,α

= 1 − α,
which is equivalent to
P

X̄n − tn−1,α
Sn
√
n
 µ

= 1 − α.
We conclude that 
x̄n − tn−1,α
sn
√
n
, ∞

is a 100(1 − α)% one-sided confidence interval for µ. For the Daw Mill coal,
using α = 0.05, with t21,0.05 = 1.721 this results in:

31.012 − 1.721
0.1294
√
22
, ∞

= (30.964, ∞).
24.4 Determining the sample size 367
We see that because “all uncertainty may be put on one side,” the lower
bound in the one-sided interval is higher than that in the two-sided one,
though still below 31.00. Other situations may require a confidence upper
bound. For example, if the calorific value is below a certain number you can
try to negotiate a lower the price.
The definition of confidence intervals (page 343) can be extended to include
one-sided confidence intervals as well. If we have a sample statistic Ln such
that
P(Ln  θ) = γ
for every value of the parameter of interest θ, then
(ln, ∞)
is called a 100γ% one-sided confidence interval for θ. The number ln is
sometimes called a 100γ% lower confidence bound for θ. Similary, Un with
P(θ  Un) = γ for every value of θ, yields the one-sided confidence interval
(−∞, un), and un is called a 100γ% upper confidence bound.
Quick exercise 24.3 Determine the 99% upper confidence bound for the
gross calorific value of the Daw Mill coal.
24.4 Determining the sample size
The narrower the confidence interval the better (why?). As a general prin-
ciple, we know that more accurate statements can be made if we have more
measurements. Sometimes, an accuracy requirement is set, even before data
are collected, and the corresponding sample size is to be determined. We pro-
vide an example of how to do this and note that this generally can be done,
but the actual computation varies with the type of confidence interval.
Consider the question of the calorific content of coal once more. We have a
shipment of coal to test and we want to obtain a 95% confidence interval,
but it should not be wider than 0.05 MJ/kg, i.e., the lower and upper bound
should not differ more than 0.05. How many measurements do we need?
We answer this question for the case when ISO method 1928 is used, whence
we may assume that measurements are normally distributed with standard
deviation σ = 0.1. When the desired confidence level is 1 − α, the width of
the confidence interval will be
2 · zα/2
σ
√
n
.
Requiring that this is at most w means finding the smallest n that satisfies
2zα/2
σ
√
n
≤ w
368 24 More on confidence intervals
or
n ≥

2zα/2σ
w
2
.
For the example: w = 0.05, σ = 0.1, and z0.025 = 1.96; so
n ≥

2 · 1.96 · 0.1
0.05
2
= 61.4,
that is, we should perform at least 62 measurements.
In case σ is unknown, we somehow have to estimate it, and then the method
can only give an indication of the required sample size. The standard deviation
as we (afterwards) estimate it from the data may turn out to be quite different,
and the obtained confidence interval may be smaller or larger than intended.
Quick exercise 24.4 What is the required sample size if we want the 99%
confidence interval to be 0.05 MJ/kg wide?
24.5 Solutions to the quick exercises
24.1 We need to solve

80
200
− p
2
−
(2)2
200
p(1 − p)  0, or 1.02 p2
− 0.82p + 0.16  0.
The solutions are:
p1,2 =
−(−0.82) ±

(−0.82)2 − 4 · 1.02 · 0.16
2 · 1.02
= 0.4020 ± 0.0686,
so the confidence interval is (0.33, 0.47).
24.2 We should substitute n = 15, t = 23.5, and α = 0.01 into:

t +
1
n
ln
1
2 α

, t +
1
n
ln

1 − 1
2 α


,
which yields

23.5 −
5.30
15
, 23.5 −
0.0050
15

= (23.1467, 23.4997).
24.3 The upper confidence bound is given by
un = x̄n + t21,0.01
sn
√
22
,
where x̄n = 31.012, t21,0.01 = 2.518, and sn = 0.1294. Substitution yields
un = 31.081.
24.6 Exercises 369
24.4 The confidence level changes to 99%, so we use z0.005 = 2.576 instead
of 1.96 in the computation:
n ≥

2 · 2.576 · 0.1
0.05
2
= 106.2,
so we need at least 107 measurements.
24.6 Exercises
24.1  Of a series of 100 (independent and identical) chemical experiments,
70 were concluded succesfully. Construct a 90% confidence interval for the
success probability of this type of experiment.
24.2 In January 2002 the Euro was introduced and soon after stories started
to circulate that some of the Euro coins would not be fair coins, because the
“national side” of some coins would be too heavy or too light (see, for example,
the New Scientist of January 4, 2002, but also national newspapers of that
date).
a. A French 1 Euro coin was tossed six times, resulting in 1 heads and 5 tails.
Is it reasonable to use the Wilson method, introduced in Section 24.1, to
construct a confidence interval for p?
b. A Belgian 1 Euro coin was tossed 250 times: 140 heads and 110 tails.
Construct a 95% confidence interval for the probability of getting heads
with this coin.
24.3 In Exercise 23.1, what sample size is needed if we want a 99% confidence
interval for µ at most 1 ml wide?
24.4  Recall Exercise 23.3 and the 10 bags of cement that should each weigh
94 kg. The average weight was 93.5 kg, with sample standard deviation 0.75.
a. Based on these data, how many bags would you need to sample to make
a 90% confidence interval that is 0.1 kg wide?
b. Suppose you actually do measure the required number of bags and con-
struct a new confidence interval. Is it guaranteed to be at most 0.1 kg
wide?
24.5 Suppose we want to make a 95% confidence interval for the probability
of getting heads with a Dutch 1 Euro coin, and it should be at most 0.01
wide. To determine the required sample size, we note that the probability of
getting heads is about 0.5. Furthermore, if X has a Bin(n, p) distribution,
with n large and p ≈ 0.5, then
370 24 More on confidence intervals
X − np

n/4
is approximately standard normal.
a. Use this statement to derive that the width of the 95% confidence interval
for p is approximately
z0.025
√
n
.
Use this width to determine how large n should be.
b. The coin is thrown the number of times just computed, resulting in 19 477
times heads. Construct the 95% confidence interval and check whether the
required accuracy is attained.
24.6  Environmentalists have taken 16 samples from the wastewater of a
chemical plant and measured the concentration of a certain carcinogenic sub-
stance. They found x̄16 = 2.24 (ppm) and s2
16 = 1.12, and want to use these
data in a lawsuit against the plant. It may be assumed that the data are a
realization of a normal random sample.
a. Construct the 97.5% one-sided confidence interval that the environmen-
talists made to convince the judge that the concentration exceeds legal
limits.
b. The plant management uses the same data to construct a 97.5% one-
sided confidence interval to show that concentrations are not too high.
Construct this interval as well.
24.7 Consider once more the Rutherford-Geiger data as given in Section 23.4.
Knowing that the number of α-particle emissions during an interval has a
Poisson distribution, we may see the data as observations from a Pois(µ)
distribution. The central limit theorem tells us that the average X̄n of a large
number of independent Pois(µ) approximately has a normal distribution and
hence that
X̄n − µ
√
µ/
√
n
has a distribution that is approximately N(0, 1).
a. Show that the large sample 95% confidence interval contains those values
of µ for which
(x̄n − µ)
2
≤ (1.96)2 µ
n
.
b. Use the result from a to construct the large sample 95% confidence interval
based on the Rutherford-Geiger data.
c. Compare the result with that of Exercise 23.9 b. Is this surprising?
24.8  Recall Exercise 23.5 about the 1500 m speed-skating results in the 2002
Winter Olympic Games. If there were no outer lane advantage, the number
24.6 Exercises 371
out of the 23 completed races won by skaters starting in the outer lane would
have a Bin(23, p) distribution with p = 1/2, because of the lane assignment
by lottery.
a. Of the 23 races, 15 were won by the skater starting in the outer lane. Use
this information to construct a 95% confidence interval for p by means
of the Wilson method. If you think that n = 23 is probably too small to
use a method based on the central limit theorem, we agree. We should be
careful with conclusions we draw from this confidence interval.
b. The question posed earlier “Is there an outer lane advantage?” implies that
a one-sided confidence interval is more suitable. Construct the appropriate
95% one-sided confidence interval for p by first constructing a 90% two-
sided confidence interval.
24.9  Suppose we have a dataset x1, . . . , x12 that may be modeled as the
realization of a random sample X1, . . . , X12 from a U(0, θ) distribution, with
θ unknown. Let M = max{X1, . . . , X12}.
a. Show that for 0 ≤ t ≤ 1
P

M
θ
≤ t

= t12
.
b. Use α = 0.1 and solve
P

M
θ
≤ cl

= P

M
θ
≤ cu

= 1
2 α.
c. Suppose the realization of M is m = 3. Construct the 90% confidence
interval for θ.
d. Derive the general expression for a confidence interval of level 1 −α based
on a sample of size n.
24.10 Suppose we have a dataset x1, . . . , xn that may be modeled as the
realization of a random sample X1, . . . , Xn from an Exp(λ) distribution, where
λ is unknown. Let Sn = X1 + · · · + Xn.
a. Check that λSn has a Gam(n, 1) distribution.
b. The following quantiles of the Gam(20, 1) distribution are given: q0.05 =
13.25 and q0.95 = 27.88. Use these to construct a 90% confidence interval
for λ when n = 20.
25
Testing hypotheses: essentials
The statistical methods that we have discussed until now have been devel-
oped to infer knowledge about certain features of the model distribution that
represent our quantities of interest. These inferences often take the form of
numerical estimates, as either single numbers or confidence intervals. How-
ever, sometimes the conclusion to be drawn is not expressed numerically, but
is concerned with choosing between two conflicting theories, or hypotheses.
For instance, one has to assess whether the lifetime of a certain type of ball
bearing deviates or does not deviate from the lifetime guaranteed by the man-
ufacturer of the bearings; an engineer wants to know whether dry drilling is
faster or the same as wet drilling; a gynecologist wants to find out whether
smoking affects or does not affect the probability of getting pregnant; the Al-
lied Forces want to know whether the German war production is equal to or
smaller than what Allied intelligence agencies reported. The process of formu-
lating the possible conclusions one can draw from an experiment and choosing
between two alternatives is known as hypothesis testing. In this chapter we
start to explore this statistical methodology.
25.1 Null hypothesis and test statistic
We will introduce the basic concepts of hypothesis testing with an exam-
ple. Let us return to the analysis of German war equipment. During World
War II the Allied Forces received reports by the Allied intelligence agencies
on German war production. The numbers of produced tires, tanks, and other
equipment, as claimed in these reports, were a lot higher than indicated by
the observed serial numbers. The objective was to decide whether the actual
produced quantities were smaller than the ones reported.
For simplicity suppose that we have observed tanks with (recoded) serial num-
bers
61 19 56 24 16.
374 25 Testing hypotheses: essentials
Furthermore, suppose that the Allied intelligence agencies report a production
of 350 tanks.1
This is a lot more than we would surmise from the observed
data. We want to choose between the proposition that the total number of
tanks is 350 and the proposition that the total number is smaller than 350.
The two competing propositions are called null hypothesis, denoted by H0, and
alternative hypothesis, denoted by H1. The way we go about choosing between
H0 and H1 is conceptually similar to the way a jury deliberates in a court
trial. The null hypothesis corresponds to the position of the defendant: just
as he is presumed to be innocent until proven guilty, so is the null hypothesis
presumed to be true until the data provide convincing evidence against it.
The alternative hypothesis corresponds to the charges brought against the
defendant.
To decide whether H0 is false we use a statistical model. As argued in Chap-
ter 20 the (recoded) serial numbers are modeled as a realization of random
variables X1, X2, . . . , X5 representing five draws without replacement from the
numbers 1, 2, . . . , N. The parameter N represents the total number of tanks.
The two hypotheses in question are
H0 : N = 350
H1 : N  350.
If we reject the null hypothesis we will accept H1; we speak of rejecting H0
in favor of H1. Usually, the alternative hypothesis represents the theory or
belief that we would like to accept if we do reject H0. This means that we
must carefully choose H1 in relation with our interests in the problem at hand.
In our example we are particularly interested in whether the number of tanks
is less than 350; so we test the null hypothesis against H1 : N  350. If we
would be interested in whether the number of tanks differs from 350, or is
greater than 350, we would test against H1 : N = 350 or H1 : N  350.
Quick exercise 25.1 In the drilling example from Sections 15.5 and 16.4 the
data on drill times for dry drilling are modeled as a realization of a random
sample from a distribution with expectation µ1, and similarly the data for wet
drilling correspond to a distribution with expectation µ2. We want to know
whether dry drilling is faster than wet drilling. To this end we test the null
hypothesis H0 : µ1 = µ2 (the drill time is the same for both methods). What
would you choose for H1?
The next step is to select a criterion based on X1, X2, . . . , X5 that provides an
indication about whether H0 is false. Such a criterion involves a test statistic.
1
This may seem ridiculous. However, when after the war official German produc-
tion statistics became available, the average monthly production of tanks during
the period 1940–1943 was 342. During the war this number was estimated at 327,
whereas Allied intelligence reported 1550! (see [27]).
25.1 Null hypothesis and test statistic 375
Test Statistic. Suppose the dataset is modeled as the realization
of random variables X1, X2, . . . , Xn. A test statistic is any sample
statistic T = h(X1, X2, . . . , Xn), whose numerical value is used to
decide whether we reject H0.
In the tank example we use the test statistic
T = max{X1, X2, . . . , X5}.
Having chosen a test statistic T , we investigate what sort of values T can
attain. These values can be viewed on a credibility scale for H0, and we must
determine which of these values provide evidence in favor of H0, and which
provide evidence in favor of H1. First of all note that if we find a value of
T larger than 350, we immediately know that H0 as well as H1 is false. If
this happens, we actually should be considering another testing problem, but
for the current problem of testing H0 : N = 350 against H1 : N  350 such
values are irrelevant. Hence the possible values of T that are of interest to us
are the integers from 5 to 350.
If H0 is true, then what is a typical value for T and what is not? Remember
from Section 20.1 that, because n = 5, the expectation of T is E[T ] = 5
6 (N+1).
This means that the distribution of T is centered around 5
6 (N + 1). Hence, if
H0 is true, then typical values of T are in the neighborhood of 5
6 ·351 = 292.5.
Values of T that deviate a lot from 292.5 are evidence against H0. Values that
are much greater than 292.5 are evidence against H0 but provide even stronger
evidence against H1. For such values we will not reject H0 in favor of H1. Also
values a little smaller than 292.5 are grounds not to reject H0, because we are
committed to giving H0 the benefit of the doubt. On the other hand, values
of T very close to 5 should be considered as strong evidence against the null
hypothesis and are in favor of H1, hence they lead to a decision to reject H0.
This is summarized in Figure 25.1.
5 292.5 350
Values in
favor of H1
Values in
favor of H0
Values against
both H0 and H1
Fig. 25.1. Values of the test statistic T.
Quick exercise 25.2 Another possible test statistic would be X̄5. If we use
its values as a credibility scale for H0, then what are the possible values of
X̄5, which values of X̄5 are in favor of H1 : N  350, and which values are in
favor of H0 : N = 350?
376 25 Testing hypotheses: essentials
For the data we find
t = max{61, 19, 56, 24, 16} = 61
as the realization of the test statistic. How do we use this to decide on H0?
25.2 Tail probabilities
As we have just seen, if H0 is true, then typical values of T are in the neighbor-
hood of 5
6 ·351 = 292.5. In view of Figure 25.1, the more a value of T is to the
left, the stronger evidence it provides in favor of H1. The value 61 is in the left
region of Figure 25.1. Can we now reject H0 and conclude that N is smaller
than 350, or can the fact that we observe 61 as maximum be attributed to
chance? In courtroom terminology: can we reach the conclusion that the null
hypothesis is false beyond reasonable doubt? One way to investigate this is to
examine how likely it is that one would observe a value of T that provides
even stronger evidence against H0 than 61, in the situation that N = 350. If
this is very unlikely, then 61 already bears strong evidence against H0.
Values of T that provide stronger evidence against H0 than 61 are to the
left of 61. Therefore we compute P(T ≤ 61). In the situation that N = 350,
the test statistic T is the maximum of 5 numbers drawn without replacement
from 1, 2, . . . , 350. We find that
P(T ≤ 61) = P(max{X1, X2, . . . , X5} ≤ 61)
=
61
350
·
60
349
· · ·
57
346
= 0.00014.
This probability is so small that we view the value 61 as strong evidence
against the null hypothesis. Indeed, if the null hypothesis would be true, then
values of T that would provide the same or even stronger evidence against H0
than 61 are very unlikely to occur, i.e., they occur with probability 0.00014!
In other words, the observed value 61 is exceptionally small in case H0 is true.
At this point we can do two things: either we believe that H0 is true and
that something very unlikely has happened, or we believe that events with
such a small probability do not happen in practice, so that T ≤ 61 could
only have occurred because H0 is false. We choose to believe that things
happening with probability 0.00014 are so exceptional that we reject the null
hypothesis H0 : N = 350 in favor of the alternative hypothesis H1 : N  350.
In courtroom terminology: we say that a value of T smaller than or equal to
61 implies that the null hypothesis is false beyond reasonable doubt.
P-values
In our example, the more a value of T is to the left, the stronger evidence
it provides against H0. For this reason we computed the left tail probability
25.3 Type I and type II errors 377
P(T ≤ 61). In other situations, the direction in which values of T provide
stronger evidence against H0 may be to the right of the observed value t,
in which case one would compute a right tail probability P(T ≥ t). In both
cases the tail probability expresses how likely it is to obtain a value of the
test statistic T at least as extreme as the value t observed for the data. Such
a probability is called a p-value. In a way, the size of the p-value reflects how
much evidence the observed value t provides against H0. The smaller the
p-value, the stronger evidence the observed value t bears against H0.
The phrase “at least as extreme as the observed value t” refers to a particular
direction, namely the direction in which values of T provide stronger evidence
against H0 and in favor of H1. In our example, this was to the left of 61, and
the p-value corresponding to 61 was P(T ≤ 61) = 0.00014. In this case it is
clear what is meant by “at least as extreme as t” and which tail probability
corresponds to the p-value. However, in some testing problems one can deviate
from H0 in both directions. In such cases it may not be clear what values of
T are at least as extreme as the observed value, and it may be unclear how
the p-value should be computed. One approach to a solution in this case is
to simply compute the one-tailed p-value that corresponds to the direction in
which t deviates from H0.
Quick exercise 25.3 Suppose that the Allied intelligence agencies had re-
ported a production of 80 tanks, so that we would test H0 : N = 80 against
H1 : N  80. Compute the p-value corresponding to 61. Would you conclude
H0 is false beyond reasonable doubt?
25.3 Type I and type II errors
Suppose that the maximum is 200 instead of 61. This is also to the left of
the expected value 292.5 of T . Is it far enough to the left to reject the null
hypothesis? In this case the p-value is equal to
P(T ≤ 200) = P(max{X1, X2, . . . , X5} ≤ 200)
=
200
350
·
199
349
· · ·
196
346
= 0.0596.
This means that if the total number of produced tanks is 350, then in 5.96%
of all cases we would observe a value of T that is at least as extreme as the
value 200. Before we decide whether 0.0596 is small enough to reject the null
hypothesis let us explore in more detail what the preceding probability stands
for.
It is important to distinguish between (1) the true state of nature: H0 is true
or H1 is true and (2) our decision: we reject or do not reject H0 on the basis
of the data. In our example the possibilities for the true state of nature are:
Ĺ H0 is true, i.e., there are 350 tanks produced.
Ĺ H1 is true, i.e., the number of tanks produced is less than 350.
378 25 Testing hypotheses: essentials
We do not know in which situation we are. There are two possible decisions:
Ĺ We reject H0 in favor of H1.
Ĺ We do not reject H0.
This leads to four possible situations, which are summarized in Figure 25.2.
True state of nature
H0 is true H1 is true
Reject H0 Type I error Correct decision
Our decision on the
basis of the data
Not reject H0 Correct decision Type II error
Fig. 25.2. Four situations when deciding about H0.
There are two situations in which the decision made on the basis of the data is
wrong. The null hypothesis H0 may be true, whereas the data lead to rejection
of H0. On the other hand, the alternative hypothesis H1 may be true, whereas
we do not reject H0 on the basis of the data. These wrong decisions are called
type I and type II errors.
Type I and II errors. A type I error occurs if we falsely reject
H0. A type II error occurs if we falsely do not reject H0.
In courtroom terminology, a type I error corresponds to convicting an innocent
defendant, whereas a type II error corresponds to acquitting a criminal.
If H0 : N = 350 is true, then the decision to reject H0 is a type I error. We
will never know whether we make a type I error. However, given a particular
decision rule, we can say something about the probability of committing a
type I error. Suppose the decision rule would be “reject H0 : N = 350 when-
ever T ≤ 200.” With this decision rule the probability of committing a type I
error is P(T ≤ 200) = 0.0596. If we are willing to run the risk of committing
a type I error with probability 0.0596, we could adopt this decision rule. This
would also mean that on the basis of an observed maximum of 200 we would
reject H0 in favor of H1 : N  350.
Quick exercise 25.4 Suppose we adopt the following decision rule about the
null hypothesis: “reject H0 : N = 350 whenever T ≤ 250.” Using this decision
rule, what is the probability of committing a type I error?
25.4 Solutions to the quick exercises 379
The question remains what amount of risk one is willing to take to falsely
reject H0, or in courtroom terminology: how small should the p-value be to
reach a conclusion that is “beyond reasonable doubt”? In many situations,
as a rule of thumb 0.05 is used as the level where reasonable doubt begins.
Something happening with probability less than or equal to 0.05 is then viewed
as being too exceptional. However, there is no general rule that specifies how
small the p-value must be to reject H0. There is no way to argue that this
probability should be below 0.10 or 0.18 or 0.009—or anything else.
A possible solution is to solely report the p-value corresponding to the ob-
served value of the test statistic. This is objective and does not have the
arbitrariness of a preselected level such as 0.05. An investigator who reports
the p-value conveys the maximum amount of information contained in the
dataset and permits all decision makers to choose their own level and make
their own decision about the null hypothesis. This is especially important
when there is no justifiable reason for preselecting a particular value for such
a level.
25.4 Solutions to the quick exercises
25.1 One is interested in whether dry drilling is faster than wet drilling.
Hence if we reject H0 : µ1 = µ2, we would like to conclude that the drill time
is smaller for dry drilling than for wet drilling. Since µ1 and µ2 represent the
drill time for dry and wet drilling, we should choose H1 : µ1  µ2.
25.2 The value of X̄5 is at least 3 and if we find a value of X̄5 that is larger
than 348, then at least one of the five numbers must be greater than 350, so
that we immediately know that H0 as well as H1 is false. Hence the possible
values of X̄5 that are relevant for our testing problem are between 3 and 348.
We know from Section 20.1 that 2X̄5 − 1 is an unbiased estimator for N,
no matter what the value of N is. This implies that values of X̄5 itself are
centered around (N + 1)/2. Hence values close to 351/2=175.5 are in favor
of H0, whereas values close to 3 are in favor of H1. Values close to 348 are
against H0, but also against H1. See Figure 25.3.
3 175.5 348
Values in
favor of H1
Values in
favor of H0
Values against
both H0 and H1
Fig. 25.3. Values of the test statistic X̄5.
25.3 The p-value corresponding to 61 is now equal to
P(T ≤ 61) = 61
80 · 60
79 · · · 57
76 = 0.2475.
380 25 Testing hypotheses: essentials
If H0 is true, then in 24.75% of the time one will observe a value T less than
or equal to 61. Such values are not exceptionally small for T under H0, and
therefore the evidence that the value 61 bears against H0 is pretty weak. We
cannot reject H0 beyond reasonable doubt.
25.4 The type I error associated with the decision rule occurs if N = 350
(H0 is true) and t ≤ 250 (reject H0). The probability that this happens is
P(T ≤ 250) = 250
350 · 249
349 · · · 246
346 = 0.1838.
25.5 Exercises
25.1 In a study about train delays in The Netherlands one was interested in
whether arrival delays of trains exhibit more variation during rush hours than
during quiet hours. The observed arrival delays during rush hours are mod-
eled as realizations of a random sample from a distribution with variance σ2
1,
and similarly the observed arrival delays during quiet hours correspond to a
distribution with variance σ2
2. One tests the null hypothesis H0 : σ1 = σ2.
What do you choose as the alternative hypothesis?
25.2  On average, the number of babies born in Cleveland, Ohio, in the
month of September is 1472. On January 26, 1977, the city was immobilized
by a blizzard. Nine months later, in September 1977, the recorded number of
births was 1718. Can the increase of 246 be attributed to chance? To inves-
tigate this, the number of births in the month of September is modeled by a
Poisson random variable with parameter µ, and we test H0 : µ = 1472. What
would you choose as the alternative hypothesis?
25.3 Recall Exercise 17.9 about black cherry trees. The scatterplot of y (vol-
ume) versus x = d2
h (squared diameter times height) seems to indicate that
the regression line y = α + βx runs through the origin. One wants to inves-
tigate whether this is true by means of a testing problem. Formulate a null
hypothesis and alternative hypothesis in terms of (one of) the parameters α
and β.
25.4  Consider the example from Section 4.4 about the number of cycles
up to pregnancy of smoking and nonsmoking women. Suppose the observed
number of cycles are modeled as realizations of random samples from geo-
metric distributions. Let p1 be the parameter of the geometric distribution
corresponding to smoking women and p2 be the parameter for the nonsmok-
ing women. We are interested in whether p1 is different from p2, and we
investigate this by testing H0 : p1 = p2 against H1 : p1 = p2.
a. If the data are as given in Exercise 17.5, what would you choose as a test
statistic?
25.5 Exercises 381
b. What would you choose as a test statistic, if you were given the extra
knowledge as in Table 21.1?
c. Suppose we are interested in whether smoking women are less likely to get
pregnant than nonsmoking women. What is the appropriate alternative
hypothesis in this case?
25.5  Suppose a dataset is a realization of a random sample X1, X2, . . . , Xn
from a uniform distribution on [0, θ], for some (unknown) θ  0. We test
H0 : θ = 5 versus H1 : θ = 5.
a. We take T1 = max{X1, X2, . . . , Xn} as our test statistic. Specify what
the (relevant) possible values are for T and which are in favor of H0 and
which are in favor of H1. For instance, make a picture like Figure 25.1.
b. Same as a, but now for test statistic T2 = |2X̄n − 5|.
25.6  To test a certain null hypothesis H0 one uses a test statistic T with
a continuous sampling distribution. One agrees that H0 is rejected if one
observes a value t of the test statistic for which (under H0) the right tail
probability P(T ≥ t) is smaller than or equal to 0.05. Given below are different
values t and a corresponding left or right tail probability (under H0). Specify
for each case what the p-value is, if possible, and whether we should reject H0.
a. t = 2.34 and P(T ≥ 2.34) = 0.23.
b. t = 2.34 and P(T ≤ 2.34) = 0.23.
c. t = 0.03 and P(T ≥ 0.03) = 0.968.
d. t = 1.07 and P(T ≤ 1.07) = 0.981.
e. t = 1.07 and P(T ≤ 2.34) = 0.01.
f. t = 2.34 and P(T ≤ 1.07) = 0.981.
g. t = 2.34 and P(T ≤ 1.07) = 0.800.
25.7 (Exercise 25.2 continued). The number of births in September is mod-
eled by a Poisson random variable T with parameter µ, which represents the
expected number of births. Suppose that one uses T to test the null hypothe-
sis H0 : µ = 1472 and that one decides to reject H0 on the basis of observing
the value t = 1718.
a. In which direction do values of T provide evidence against H0 (and in
favor of H1)?
b. Compute the p-value corresponding to t = 1718, where you may use the
fact that the distribution of T can be approximated by an N(µ, µ) distri-
bution.
25.8 Suppose we want to test the null hypothesis that our dataset is a realiza-
tion of a random sample from a standard normal distribution. As test statistic
we use the Kolmogorov-Smirnov distance between the empirical distribution
382 25 Testing hypotheses: essentials
function Fn of the data and the distribution function Φ of the standard nor-
mal:
T = sup
a∈R
|Fn(a) − Φ(a)|.
What are the possible values of T and in which direction do values of T deviate
from the null hypothesis?
25.9 Recall the example from Section 18.3, where we investigated whether the
software data are exponential by means of the Kolmogorov-Smirnov distance
between the empirical distribution function Fn of the data and the estimated
exponential distribution function:
Tks = sup
a∈R
|Fn(a) − (1 − e−Λ̂a
)|.
For the data we found tks = 0.176. By means of a new parametric bootstrap
we simulated 100 000 realizations of Tks and found that all of them are smaller
than 0.176. What can you say about the p-value corresponding to 0.176?
25.10  Consider the coal data from Table 23.1, where 23 gross calorific value
measurements are listed for Osterfeld coal coded 262DE27. We modeled this
dataset as a realization of a random sample from a normal distribution with
expectation µ unknown and standard deviation 0.1 MJ/kg. We are planning
to buy a shipment if the gross calorific value exceeds 23.75 MJ/kg. In order
to decide whether this is sensible, we test the null hypothesis H0 : µ = 23.75
with test statistic X̄n.
a. What would you choose as the alternative hypothesis?
b. For the dataset x̄n is 23.788. Compute the corresponding p-value, using
that X̄n has an N(23.75, (0.1)2
/23) distribution under the null hypothesis.
25.11  One is given a number t, which is the realization of a random vari-
able T with an N(µ, 1) distribution. To test H0 : µ = 0 against H1 : µ = 0,
one uses T as the test statistic. One decides to reject H0 in favor of H1 if
|t| ≥ 2. Compute the probability of committing a type I error.
26
Testing hypotheses: elaboration
In the previous chapter we introduced the setup for testing a null hypothesis
against an alternative hypothesis using a test statistic T . The notions of type I
error and type II error were introduced. A type I error occurs when we falsely
reject H0 on the basis of the observed value of T , whereas a type II error
occurs when we falsely do not reject H0. The decision to reject H0 or not was
based on the size of the p-value. In this chapter we continue the introduction
of basic concepts of testing hypotheses, such as significance level and critical
region, and investigate the probability of committing a type II error.
26.1 Significance level
As mentioned in the previous chapter, there is no general rule that specifies a
level below which the p-value is considered exceptionally small. However, there
are situations where this level is set a priori, and the question is: which values
of the test statistic should then lead to rejection of H0? To illustrate this, con-
sider the following example. The speed limit on freeways in The Netherlands
is 120 kilometers per hour. A device next to freeway A2 between Amsterdam
and Utrecht measures the speed of passing vehicles. Suppose that the device
is designed in such a way that it conducts three measurements of the speed
of a passing vehicle, modeled by a random sample X1, X2, X3. On the basis
of the value of the average X̄3, the driver is either fined for speeding or not.
For what values of X̄3 should we fine the driver, if we allow that 5% of the
drivers are fined unjustly?
Let us rephrase things in terms of a testing problem. Each measurement can
be thought of as
measurement = true speed + measurement error.
Suppose for the moment that the measuring device is carefully calibrated, so
that the measurement error is modeled by a random variable with mean zero
384 26 Testing hypotheses: elaboration
and known variance σ2
, say σ2
= 4. Moreover, in physical experiments such as
this one, the measurement error is often modeled by a random variable with a
normal distribution. In that case, the measurements X1, X2, X3 are modeled
by a random sample from an N(µ, 4) distribution, where the parameter µ
represents the true speed of the passing vehicle. Our testing problem can now
be formulated as testing
H0 : µ = 120 against H1 : µ  120,
with test statistic
T =
X1 + X2 + X3
3
= X̄3.
Since sums of independent normal random variables again have a normal dis-
tribution (see Remark 11.2), it follows that X̄3 has an N(µ, 4/3) distribution.
In particular, the distribution of T = X̄3 is centered around µ no matter what
the value of µ is. Values of T close to 120 are therefore in favor of H0. Values of
T that are far from 120 are considered as strong evidence against H0. Values
much larger than 120 suggest that µ  120 and are therefore in favor of H1.
Values much smaller than 120 suggest that µ  120. They also constitute
evidence against H0, but even stronger evidence against H1. Thus we reject
H0 in favor of H1 only for values of T larger than 120. See also Figure 26.1.
120
Values in
favor of H1
Fig. 26.1. Possible values of T = X̄3.
Rejection of H0 in favor of H1 corresponds to fining the driver for speeding.
Unjustly fining a driver corresponds to falsely rejecting H0, i.e., committing
a type I error. Since we allow 5% of the drivers to be fined unjustly, we are
dealing with a testing problem where the probability of committing a type I
error is set a priori at 0.05. The question is: for which values of T should
we reject H0? The decision rule for rejecting H0 should be such that the
corresponding probability of committing a type I error is 0.05. The value 0.05
is called the significance level.
Significance level. The significance level is the largest accept-
able probability of committing a type I error and is denoted by α,
where 0  α  1.
We speak of “performing the test at level α,” as well as “rejecting H0 in
favor of H1 at level α.” In our example we are testing H0 : µ = 120 against
H1 : µ  120 at level 0.05.
26.1 Significance level 385
Quick exercise 26.1 Suppose that in the freeway example H0 : µ = 120 is
rejected in favor of H1 : µ  120 at level α = 0.05. Will it necessarily be
rejected at level α = 0.01? On the other hand, suppose that H0 : µ = 120
is rejected in favor of H1 : µ  120 at level α = 0.01. Will it necessarily be
rejected at level α = 0.05?
Let us continue with our example and determine for which values of T = X̄3
we should reject H0 at level α = 0.05 in favor of H1 : µ  120. Suppose
we decide to fine each driver whose recorded average speed is 121 or more,
i.e., we reject H0 whenever T ≥ 121. Then how large is the probability of a
type I error P(T ≥ 121)? When H0 : µ = 120 is true, then T = X̄3 has an
N(120, 4/3) distribution, so that by the change-of-units rule for the normal
distribution (see page 106), the random variable
Z =
T − 120
2/
√
3
has an N(0, 1) distribution. This implies that
P(T ≥ 121) = P

T − 120
2/
√
3
≥
121 − 120
2/
√
3

= P(Z ≥ 0.87) .
From Table B.1, we find P(Z ≥ 0.87) = 0.1922, which means that the prob-
ability of a type I error is greater than the significance level α = 0.05. Since
this level was defined as the largest acceptable probability of a type I error,
we do not reject H0. Similarly, if we decide to reject H0 whenever we record
an average of 122 or more, the probability of a type I error equals 0.0416
(check this). This is smaller than α = 0.05, so in that case we reject H0. The
boundary case is the value c that satisfies P(T ≥ c) = 0.05. To find c, we must
solve
P

Z ≥
c − 120
2/
√
3

= 0.05.
From Table B.2 we have that z0.05 = t∞,0.05 = 1.645, so that we find
c − 120
2/
√
3
= 1.645,
which leads to
c = 120 + 1.645 ·
2
√
3
= 121.9.
Hence, if we set the significance level α at 0.05, we should reject H0 : µ = 120
in favor of H1 : µ  120 whenever T ≥ 121.9. For our freeway example this
means that if the average recorded speed of a passing vehicle is greater than
or equal to 121.9, then the driver is fined for speeding. With this decision rule,
at most 5% of the drivers get fined unjustly.
386 26 Testing hypotheses: elaboration
In connection with p-values: the significance level is the level below which
the p-value is sufficiently small to reject H0. Indeed, for any observed value
t ≥ 121.9 we reject H0, and the p-value for such a t is at most 0.05:
P(T ≥ t) ≤ P(T ≥ 121.9) = 0.05.
We will see more about this relation in the next section.
26.2 Critical region and critical values
In the freeway example the significance level 0.05 corresponds to the decision
rule “reject H0 : µ = 120 in favor H1 : µ  120 whenever T ≥ 121.9.” The
set K = [121.9, ∞) consisting of values of the test statistic T for which we
reject H0 is called critical region. The value 121.9, which is the boundary case
between rejecting and not rejecting H0, is called the critical value.
Critical region and critical values. Suppose we test H0
against H1 at significance level α by means of a test statistic T .
The set K ⊂ R that corresponds to all values of T for which we
reject H0 in favor of H1 is called the critical region. Values on the
boundary of the critical region are called critical values.
The precise shape of the critical region depends on both the chosen significance
level α and the test statistic T that is used. But it will always be such that
the probability that T ∈ K satisfies
P(T ∈ K) ≤ α in the case that H0 is true.
At this point it becomes important to emphasize whether probabilities are
computed under the assumption that H0 is true. With a slight abuse of nota-
tion, we briefly write P(T ∈ K | H0) for the probability.
Relation with p-values
If we record average speed t = 124, then this value falls in the critical region
K = [121.9, ∞), so that H0 : µ = 120 is rejected in favor H1 : µ  120. On
the other hand we can also compute the p-value corresponding to the observed
value 124. Since values of T to the right provide stronger evidence against H0,
the p-value is the following right tail probability
P(T ≥ 124 | H0) = P

T − 120
2/
√
3
≥
124 − 120
2/
√
3

= P(Z ≥ 3.46) = 0.0003,
which is smaller than the significance level 0.05. This is no coincidence.
26.2 Critical region and critical values 387
In general, suppose that we perform a test at level α using test statistic T
and that we have observed t as the value of our test statistic. Then
t ∈ K ⇔ the p-value corresponding to t is less than or equal to α.
Figure 26.2 illustrates this for a testing problem where values of T to the
right provide evidence against H0 and in favor of H1. In that case, the p-value
corresponds to the right tail probability P(T ≥ t | H0). The shaded area to the
right of cα corresponds to α = P(T ≥ cα | H0), whereas the more intensely
shaded area to the right of t represents the p-value. We see that deciding
whether to reject H0 at a given significance level α can be done by comparing
either t with cα or the p-value with α. For this reason the p-value is sometimes
called the observed significance level.
cα t
Sampling distribution
of T under H0
−→ Critical region K = [cα, ∞)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
....... . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
.
..
..........................
Fig. 26.2. P-value and critical value.
The concepts of critical value and p-value have their own merit. The critical
region and the corresponding critical values specify exactly what values of T
lead to rejection of H0 at a given level α. This can be done even without
obtaining a dataset and computing the value t of the test statistic. The p-
value, on the other hand, represents the strength of the evidence the observed
value t bears against H0. But it does not specify all values of T that lead to
rejection of H0 at a given level α.
Quick exercise 26.2 In our freeway example, we have already computed
the relevant tail probability to decide whether a person with recorded average
speed t = 124 gets fined if we set the significance level at 0.05. Suppose the
significance level is set at α = 0.01 (we allow 1% of the drivers to get fined
unjustly). Determine whether a person with recorded average speed t = 124
gets fined (H0 : µ = 120 is rejected). Furthermore, determine the critical
region in this case.
388 26 Testing hypotheses: elaboration
Sometimes the critical region K can be constructed such that P(T ∈ K | H0) is
exactly equal to α, as in the freeway example. However, when the distribution
of T is discrete, this is not always possible. This is illustrated by the next
example.
After the introduction of the Euro, Polish mathematicians claimed that the
Belgian 1 Euro coin is not a fair coin (see, for instance, the New Scientist,
January 4, 2002). Suppose we put a 1 Euro coin to the test. We will throw
it ten times and record X, the number of heads. Then X has a Bin(10, p)
distribution, where p denotes the probability of heads. We like to find out
whether p differs from 1/2. Therefore we test
H0 : p =
1
2
(the coin is fair) against H1 : p =
1
2
(the coin is not fair).
We use X as the test statistic. When we set the significance level α at 0.05,
for what values of X will we reject H0 and conclude that the coin is not fair?
Let us first find out what values of X are in favor of H1. If H0 : p = 1/2 is
true, then E[X] = 10 · 1
2 = 5, so that values of X close to 5 are in favor H0.
Values close to 10 suggest that p  1/2 and values close to 0 suggest that
p  1/2. Hence, both values close to 0 and values close to 10 are in favor of
H1 : p = 1/2.
0 5 10
Values of X
Values in
favor of H1
Values in
favor of H1
This means that we will reject H0 in favor of H1 whenever X ≤ cl or X ≥ cu.
Therefore, the critical region is the set
K = {0, 1, . . ., cl} ∪ {cu, . . . , 9, 10}.
The boundary values cl and cu are called left and right critical values. They
must be chosen such that the critical region K is as large as possible and still
satisfies
P(X ∈ K | H0) = P

X ≤ cl | p = 1
2

+ P

X ≥ cu | p = 1
2

≤ 0.05.
Here P

X ≥ cu | p = 1
2

denotes the probability P(X ≥ cu) computed with X
having a Bin(10, 1
2 ) distribution. Since we have no preference for rejecting H0
for values close to 0 or close to 10, we divide 0.05 over the two sides, and we
choose cl as large as possible and cu as small as possible such that
P

X ≤ cl | p = 1
2

≤ 0.025 and P

X ≥ cu | p = 1
2

≤ 0.025.
26.2 Critical region and critical values 389
Table 26.1. Left tail probabilities of the Bin(10, 1
2
) distribution.
k P(X ≤ k) k P(X ≤ k)
0 0.00098 6 0.82813
1 0.01074 7 0.94531
2 0.05469 8 0.98926
3 0.17188 9 0.99902
4 0.37696 10 1.00000
5 0.62305
The left tail probabilities of the Bin(10, 1
2 ) distribution are listed in Ta-
ble 26.1. We immediately see that cl = 1 is the largest value such that
P(X ≤ cl | p = 1/2) ≤ 0.025. Similarly, cu = 9 is the smallest value such that
P(X ≥ cu | p = 1/2) ≤ 0.025. Indeed, when X has a Bin(10, 1
2 ) distribution,
P(X ≥ 9) = 1 − P(X ≤ 8) = 1 − 0.98926 = 0.01074,
P(X ≥ 8) = 1 − P(X ≤ 7) = 1 − 0.94531 = 0.05469.
Hence, if we test H0 : p = 1/2 against H1 : p = 1/2 at level α = 0.05, the
critical region is the set K = {0, 1, 9, 10}. The corresponding type I error is
P(X ∈ K) = P(X ≤ 1) + P(X ≥ 9) = 0.01074 + 0.01074 = 0.02148,
which is smaller than the significance level. You may perform ten throws with
your favorite coin and see whether the number of heads falls in the critical
region.
Quick exercise 26.3 Recall the tank example where we tested H0 : N = 350
against H1 : N  350 by means of the test statistic T = max Xi. Suppose that
we perform the test at level 0.05. Deduce the critical region K corresponding
to level 0.05 from the left tail probabilities given here:
k 195 194 193 192 191
P(T ≤ k | H0) 0.0525 0.0511 0.0498 0.0485 0.0472
Is P(T ∈ K | H0) = 0.05?
One- and two-tailed p-values
In the Euro coin example, we deviate from H0 : p = 1/2 in two directions:
values of X both far to the right and far to the left of 5 are evidence against H0.
Suppose that in ten throws with the 1 Euro coin we recorded x heads. What
would the p-value be corresponding to x? The problem is that the direction
in which values of X are at least as extreme as the observed value x depends
on whether x lies to the right or to the left of 5.
390 26 Testing hypotheses: elaboration
At this point there are two natural solutions. One may report the appropri-
ate left or right tail probability, which corresponds to the direction in which
x deviates from H0. For instance, if x lies to the right of 5, we compute
P(X ≥ x | H0). This is called a one-tailed p-value. The disadvantage of one-
tailed p-values is that they are somewhat misleading about how strong the
evidence of the observed value x bears against H0. In view of the relation
between rejection on the basis of critical values or on the basis of a p-value,
the one-tailed p-value should be compared to α/2. On the other hand, since
people are inclined to compare p-values with the significance level α itself,
one could also double the one-tailed p-value and compare this with α. This
double-tail probability is called a two-tailed p-value. It doesn’t make much
of a difference, as long as one also reports whether the reported p-value is
one-tailed or two-tailed.
Let us illustrate things by means of the findings by the Polish mathematicians.
They performed 250 throws with a Belgian 1 Euro coin and recorded heads
140 times (see also Exercise 24.2). The question is whether this provides strong
enough evidence against H0 : p = 1/2. The observed value 140 is to the right
of 125, the value we would expect if H0 is true. Hence the one-tailed p-value
is P(X ≥ 140), where now X has a Bin(250, 1
2 ) distribution. By means of the
normal approximation (see page 201), we find
P(X ≥ 140) = P
⎛
⎝ X − 125

1
4
√
250
≥
140 − 125

1
4
√
250
⎞
⎠
≈ P(Z ≥ 1.90) = 1 − Φ(1.90) = 0.0287.
Therefore the two-tailed p-value is approximately 0.0574, which does not pro-
vide very strong evidence against H0. In fact, the exact two-tailed p-value,
computed by means of statistical software, is 0.066, which is even larger.
Quick exercise 26.4 In a Dutch newspaper (De Telegraaf, January 3, 2002)
it was reported that the Polish mathematicians recorded heads 150 times.
What are the one- and two-tailed probabilities is this case? Do they now have
a case?
26.3 Type II error
As we have just seen, by setting a significance level α, we are able to control
the probability of committing a type I error; it will at most be α. For instance,
let us return to the freeway example and suppose that we adopt the decision
rule to fine the driver for speeding if her average observed speed is at least
121.9, i.e.,
reject H0 : µ = 120 in favor of H1 : µ  120 whenever T = X̄3 ≥ 121.9.
26.3 Type II error 391
From Section 26.1 we know that with this decision rule, the probability of a
type I error is 0.05. What is the probability of committing a type II error?
This corresponds to the percentage of drivers whose true speed is above 120
but who do not get fined because their recorded average speed is below 121.9.
For instance, suppose that a car passes at true speed µ = 125. A type II error
occurs when T  121.9, and since T = X̄3 has an N(125, 4/3) distribution,
the probability that this happens is
P(T  121.9 | µ = 125) = P

T − 125
2/
√
3

121.9 − 125
2/
√
3

= Φ(−2.68) = 0.0036.
This looks promising, but now consider a vehicle passing at true speed µ =
123. The probability of committing a type II error in this case is
P(T  121.9 | µ = 123) = P

T − 123
2/
√
3

121.9 − 123
2/
√
3

= Φ(−0.95) = 0.1711.
Hence 17.11% of all drivers that pass at speed µ = 123 will not get fined. In
Figure 26.3 the last situation is illustrated. The curve on the left represents the
probability density of the N(120, 4/3) distribution, which is the distribution
of T under the null hypothesis. The shaded area on the right of 121.9 represents
the probability of committing a type I error
P(T ≥ 121.9 | µ = 120) = 0.05.
The curve on the right is the probability density of the N(123, 4/3) distribu-
tion, which is the distribution of T under the alternative µ = 123. The shaded
area on the left of 121.9 represents the probability of a type II error
120 121.9
0.0
0.1
0.2
0.3
0.4
0.5
Sampling
distribution
of T when
µ = 123
Sampling
distribution
of T when
H0 is true
−→ Reject H0
Do not reject H0 ←−
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
.
.
....... . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 26.3. Type I and type II errors in the freeway example.
392 26 Testing hypotheses: elaboration
P(T  121.9 | µ = 123) = 0.1711.
Shifting µ further to the right will result in a smaller probability of a type II
error. However, shifting µ toward the value 120 leads to a larger probability
of a type II error. In fact it can be arbitrarily close to 0.95.
The previous example illustrates that the probability of committing a type II
error depends on the actual value of µ in the alternative hypothesis H1 : µ 
120. The closer µ is to 120, the higher the probability of a type II error will
be. In contrast with the probability of a type I error, which is always at most
α, the probability of a type II error may be arbitrarily close to 1 − α. This is
illustrated in the next quick exercise.
Quick exercise 26.5 What is the probability of a type II error in the freeway
example if µ = 120.1?
26.4 Relation with confidence intervals
When testing H0 : µ = 120 against H1 : µ  120 at level 0.05 in the freeway
example, the critical value was obtained by the formula
c0.05 = 120 + 1.645 ·
2
√
3
.
On the other hand, using that X̄3 has an N(µ, 4/3) distribution, a 95% lower
confidence bound for µ in this case can be derived from
ln = x̄3 − 1.645 ·
2
√
3
.
Although, at first sight, testing hypotheses and constructing confidence inter-
vals seem to be two separate statistical procedures, they are in fact intimately
related. In the freeway example, observe that for a given dataset x1, x2, x3,
we reject H0 : µ = 120 in favor of H1 : µ  120 at level 0.05
⇔ x̄3 ≥ 120 + 1.645 ·
2
√
3
⇔ x̄3 − 1.645 ·
2
√
3
≥ 120
⇔ 120 is not in the 95% one-sided confidence interval for µ.
This is not a coincidence. In general, the following applies. Suppose that for
some parameter θ we test H0 : θ = θ0. Then
we reject H0 : θ = θ0 in favor of H1 : θ  θ0 at level α
if and only if
θ0 is not in the 100(1 − α)% one-sided confidence interval for θ.
26.5 Solutions to the quick exercises 393
The same relation holds for testing against H1 : θ  θ0, and a similar relation
holds between testing against H1 : θ = θ0 and two-sided confidence intervals:
we reject H0 : θ = θ0 in favor of H1 : θ0 = θ0 at level α
if and only if
θ0 is not in the 100(1 − α)% two-sided confidence region for θ.
In fact, one could use these facts to define the 100(1−α)% confidence region for
a parameter θ as the set of values θ0 for which the null hypothesis H0 : θ = θ0
is not rejected at level α.
It should be emphasized that these relations only hold if the random variable
that is used to construct the confidence interval relates appropriately to the
test statistic. For instance, the preceding relations do not hold if on the one
hand, we construct a confidence interval for the parameter µ of an N(µ, σ2
)
distribution by means of the studentized mean (X̄n −µ)/(Sn/
√
n), and on the
other hand, use the sample median Medn to test a null hypothesis for µ.
26.5 Solutions to the quick exercises
26.1 In the first situation, we reject at significance level α = 0.05, which
means that the probability of committing a type I error is at most 0.05. This
does not necessarily mean that this probability will also be less than or equal to
0.01. Therefore with this information we cannot know whether we also reject
at level α = 0.01. In the reversed situation, if we reject at level α = 0.01, then
the probability of committing a type I error is at most 0.01, and is therefore
also smaller than 0.05. This means that we also reject at level α = 0.05.
26.2 To decide whether we should reject H0 : µ = 120 at level 0.01, we could
compute P(T ≥ 124 | H0) and compare this with 0.01. We have already seen
that P(T ≥ 124 | H0) = 0.0003. This is (much) smaller than the significance
level α = 0.01, so we should reject.
The critical region is K = [c, ∞), where we must solve c from
P

Z ≥
c − 120
2/
√
3

= 0.01.
Since z0.01 = 2.326, this means that c = 120 + 2.326 · (2/
√
3) = 122.7.
26.3 The critical region is of the form K = {5, 6, . . ., c}, where the criti-
cal value c is the largest value, for which P(T ≤ c | H0) is still less than or
equal to 0.05. From the table we immediately see that c = 193 and that
P(T ∈ K | H0) = P(T ≤ 193 | H0) = 0.0498, which is not equal to 0.05.
394 26 Testing hypotheses: elaboration
26.4 By means of the normal approximation, for the one-tailed p-value we
find
P(X ≥ 150) = P
⎛
⎝ X − 125

1
4
√
250
≥
150 − 125

1
4
√
250
⎞
⎠
= P(Zn ≥ 3.16) ≈ 1 − Φ(3.16) = 0.0008.
The two-tailed p-value is 0.0016. This is a lot smaller than the two-tailed p-
value 0.0574, corresponding to 140 heads. It seems that with 150 heads the
mathematicians would have a case; the Belgian Euro coin would then appear
not to be fair.
26.5 The probability of a type II error is
P(T  121.9 | µ = 120.1) = P

T − 120.1
2/
√
3

121.9 − 120.1
2/
√
3

= Φ(1.56) = 0.9406.
26.6 Exercises
26.1 Polygraphs that are used in criminal investigations are supposed to in-
dicate whether a person is lying or telling the truth. However the procedure
is not infallible, as is illustrated by the following example. An experienced
polygraph examiner was asked to make an overall judgment for each of a
total 280 records, of which 140 were from guilty suspects and 140 from inno-
cent suspects. The results are listed in Table 26.2. We view each judgment
as a problem of hypothesis testing, with the null hypothesis corresponding to
“suspect is innocent” and the alternative hypothesis to “suspect is guilty.”
Estimate the probabilities of a type I error and a type II error that apply to
this polygraph method on the basis of Table 26.2.
26.2 Consider the testing problem in Exercise 25.11. Compute the probability
of committing a type II error if the true value of µ is 1.
26.3  One generates a number x from a uniform distribution on the interval
[0, θ]. One decides to test H0 : θ = 2 against H1 : θ = 2 by rejecting H0 if
x ≤ 0.1 or x ≥ 1.9.
a. Compute the probability of committing a type I error.
b. Compute the probability of committing a type II error if the true value
of θ is 2.5.
26.4 To investigate the hypothesis that a horse’s chances of winning an eight-
horse race on a circular track are affected by its position in the starting lineup,
26.6 Exercises 395
Table 26.2. Examiners and suspects.
Suspect’s true status
Innocent Guilty
Acquitted 131 15
Examiner’s
assesment
Convicted 9 125
Source: F.S. Horvath and J.E. Reid. The reliability of polygraph examiner
diagnosis of truth and deception. Journal of Criminal Law, Criminology,
and Police Science, 62(2):276–281, 1971.
the starting position of each of 144 winners was recorded ([30]). It turned out
that 29 of these winners had starting position one (closest to the rail on the
inside track). We model the number of winners with starting position one by
a random variable T with a Bin(144, p) distribution. We test the hypothesis
H0 : p = 1/8 against H1 : p  1/8 at level α = 0.01 with T as test statistic.
a. Argue whether the test procedure involves a right critical value, a left
critical value, or both.
b. Use the normal approximation to compute the critical value(s) correspond-
ing to α = 0.01, determine the critical region, and report your conclusion
about the null hypothesis.
26.5  Recall Exercises 23.5 and 24.8 about the 1500 m speed-skating results
in the 2002 Winter Olympic Games. The number of races won by skaters
starting in the outer lane is modeled by a random variable X with a Bin(23, p)
distribution. The question of whether there is an outer lane advantage was
investigated in Exercise 24.8 by means of constructing confidence intervals
using the normal approximation. In this exercise we examine this question by
testing the null hypothesis H0 : p = 1/2 against H1 : p  1/2 using X as the
test statistic. The distribution of X under H0 is given in Table 26.3. Out of
23 completed races, 15 were won by skaters starting in the outer lane.
a. Compute the p-value corresponding to x = 15 and report your conclusion
if we perform the test at level 0.05. Does your conclusion agree with the
confidence interval you found for p in Exercise 24.8 b?
b. Determine the critical region corresponding to significance level α = 0.05.
c. Compute the probability of committing a type I error if we base our
decision rule on the critical region determined in b.
396 26 Testing hypotheses: elaboration
Table 26.3. Left tail probabilities for the Bin(23, 1
2
) distribution.
k P(X ≤ k) k P(X ≤ k) k P(X ≤ k)
0 0.0000 8 0.1050 16 0.9827
1 0.0000 9 0.2024 17 0.9947
2 0.0000 10 0.3388 18 0.9987
3 0.0002 11 0.5000 19 0.9998
4 0.0013 12 0.6612 20 1.0000
5 0.0053 13 0.7976 21 1.0000
6 0.0173 14 0.8950 22 1.0000
7 0.0466 15 0.9534 23 1.0000
d. Use the normal approximation to determine the probability of committing
a type II error for the case p = 0.6, if we base our decision rule on the
critical region determined in b.
26.6  Consider Exercises 25.2 and 25.7. One decides to test H0 : µ = 1472
against H1 : µ  1472 at level α = 0.05 on the basis of the recorded value
1718 of the test statistic T .
a. Argue whether the test procedure involves a right critical value, a left
critical value, or both.
b. Use the fact that the distribution of T can be approximated by an N(µ, µ)
distribution to determine the critical value(s) and the critical region, and
report your conclusion about the null hypothesis.
26.7 A random sample X1, X2 is drawn from a uniform distribution on the
interval [0, θ]. We wish to test H0 : θ = 1 against H1 : θ  1 by rejecting if
X1 + X2 ≤ c. Find the value of c and the critical region that correspond to a
level of significance 0.05.
Hint: use Exercise 11.5.
26.8  This exercise is meant to illustrate that the shape of the critical region
is not necessarily similar to the type of alternative hypothesis. The type of
alternative hypothesis and the test statistic used determine the shape of the
critical region.
Suppose that X1, X2, . . . , Xn form a random sample from an Exp(λ) distri-
bution, and we test H0 : λ = 1 with test statistics T = X̄n and T 
= e−X̄n
.
a. Suppose we test the null hypothesis against H1 : λ  1. Determine for
both test procedures whether they involve a right critical value, a left
critical value, or both.
b. Same question as in part a, but now test against H1 : λ = 1.
26.6 Exercises 397
26.9  Similar to Exercise 26.8, but with a random sample X1, X2, . . . , Xn
from an N(µ, 1) distribution. We test H0 : µ = 0 with test statistics T = (X̄n)2
and T 
= 1/X̄n.
a. Suppose that we test the null hypothesis against H1 : µ = 0. Determine
the shape of the critical region for both test procedures.
b. Same question as in part a, but now test against H1 : µ  0.
27
The t-test
In many applications the quantity of interest can be represented by the ex-
pectation of the model distribution. In some of these applications one wants
to know whether this expectation deviates from some a priori specified value.
This can be investigated by means of a statistical test, known as the t-test.
We consider this test both under the assumption that the model distribution
is normal and without the assumption of normality. Furthermore, we discuss a
similar test for the slope and the intercept in a simple linear regression model.
27.1 Monitoring the production of ball bearings
A production line in a large industrial corporation are set to produce a spe-
cific type of steel ball bearing with a diameter of 1 millimeter. In order to
check the performance of the production lines, a number of ball bearings are
picked at the end of the day and their diameters are measured. Suppose we ob-
serve 20 diameters of ball bearings from the production lines, which are listed
in Table 27.1. The average diameter is x̄20 = 1.03 millimeter. This clearly
deviates from the target value 1, but the question is whether the difference
can be attributed to chance or whether it is large enough to conclude that
the production line is producing ball bearings with a wrong diameter. To an-
swer this question, we model the dataset as a realization of a random sample
X1, X2, . . . , X20 from a probability distribution with expected value µ. The
parameter µ represents the diameter of ball bearings produced by the produc-
Table 27.1. Diameters of ball bearings.
1.018 1.009 1.042 1.053 0.969 1.002 0.988 1.019 1.062 1.032
1.072 0.977 1.062 1.044 1.069 1.029 0.979 1.096 1.079 0.999
400 27 The t-test
tion lines. In order to investigate whether this diameter deviates from 1, we
test the null hypothesis H0 : µ = 1 against H1 : µ = 1.
This example illustrates a situation that often occurs: the data x1, x2, . . . , xn
are a realization of a random sample X1, X2, . . . , Xn from a distribution with
expectation µ, and we want to test whether µ equals an a priori specified value,
say µ0. According to the law of large numbers, X̄n is close to µ for large n.
This suggests a test statistic based on X̄n − µ0; realizations of X̄n − µ0 close
to zero are in favor of the null hypothesis. Does X̄n − µ0 suffice as a test
statistic?
In our example, x̄n − µ0 = 1.03 − 1 = 0.03. Should we interpret this as small?
First, note that under the null hypothesis E

X̄n − µ0

= µ − µ0 = 0. Now, if
X̄n − µ0 would have standard deviation 1, then the value 0.03 is within one
standard deviation of E

X̄n − µ0

. The “µ ± a few σ” rule on page 185 then
suggests that the value 0.03 is not exceptional; it must be seen as a small
deviation. On the other hand, if X̄n − µ0 has standard deviation 0.001, then
the value 0.03 is 30 standard deviations away from E

X̄n − µ0

. According to
the “µ ± a few σ” rule this is very exceptional; the value 0.03 must be seen
as a large deviation. The next quick exercise provides a concrete example.
Quick exercise 27.1 Suppose that X̄n is a normal random variable with
expectation 1 and variance 1. Determine P

X̄n − 1 ≥ 0.03

. Find the same
probability, but for the case where the variance is (0.01)2
.
This discussion illustrates that we must standardize X̄n − µ0 to incorporate
its variation. Recall that
Var

X̄n − µ0

= Var

X̄n

=
σ2
n
,
where σ2
is the variance of each Xi. Hence, standardizing X̄n − µ0 means
that we should divide by σ/
√
n. Since σ is unknown, we substitute the sample
standard deviation Sn for σ. This leads to the following test statistic for the
null hypothesis H0 : µ = µ0:
T =
X̄n − µ0
Sn/
√
n
.
Values of T close to zero are in favor of H0 : µ = µ0. Large positive values of
T suggest that µ  µ0 and large negative values suggest that µ  µ0; both
are evidence against H0.
For the ball bearing data one finds that sn = 0.0372, so that
t =
x̄n − µ0
sn/
√
n
=
1.03 − 1
0.0372/
√
20
= 3.607.
This is clearly different from zero, but the question is whether this difference
is large enough to reject H0 : µ = 1. To answer this question, we need to know
27.2 The one-sample t-test 401
the probability distribution of T under the null hypothesis. Note that under
the null hypothesis H0 : µ = µ0, the test statistic
T =
X̄n − µ0
Sn/
√
n
is the studentized mean (see also Chapter 23)
X̄n − µ
Sn/
√
n
.
Hence, under the null hypothesis, the probability distribution of T is the same
as that of the studentized mean.
27.2 The one-sample t-test
The classical assumption is that the dataset is a realization of a random sample
from an N(µ, σ2
) distribution. In that case our test statistic T turns out to
have a t-distribution under the null hypothesis, as we will see later. For this
reason, the test for the null hypothesis H0 : µ = µ0 is called the (one-sample)
t-test. Without the assumption of normality, we will use the bootstrap to
approximate the distribution of T . For large sample sizes, this distribution
can be approximated by means of the central limit theorem. We start with
the first case.
Normal data
Suppose that the dataset x1, x2, . . . , xn is a realization of a random sample
X1, X2, . . . , Xn from an N(µ, σ2
) distribution. Then, according to the rule on
page 349, the studentized mean has a t(n − 1) distribution. An immediate
consequence is that, under the null hypothesis H0 : µ = µ0, also our test
statistic T has a t(n − 1) distribution. Therefore, if we test H0 : µ = µ0
against H1 : µ = µ0 at level α, then we must reject the null hypothesis in
favor of H1 : µ = µ0, if
T ≤ −tn−1,α/2 or T ≥ tn−1,α/2.
Similar decision rules apply to alternatives H1 : µ  µ0 and H1 : µ  µ0.
Suppose that in the ball bearing example we test H0 : µ = 1 against H1 :
µ = 1 at level α = 0.05. From Table B.2 we find t19,0.025 = 2.093. Hence, we
must reject if T ≤ −2.093 or T ≥ 2.093. For the ball bearing data we found
t = 3.607, which means we reject the null hypothesis at level α = 0.05.
Alternatively, one might report the one-tailed p-value corresponding to the
observed value t and compare this with α/2. The one-tailed p-value is ei-
ther a right or a left tail probability, which must be computed by means
402 27 The t-test
of the t(n − 1) distribution. In our ball bearing example the one-tailed p-
value is the right tail probability P(T ≥ 3.607). From Table B.2 we see
that this probability is between 0.0005 and 0.0010, which is smaller than
α/2 = 0.025 (to be precise, by means of a statistical software package we
found P(T ≥ 3.607) = 0.00094). The data provide strong enough evidence
against the null hypothesis, so that it seems sensible to adjust the settings of
the production line.
Quick exercise 27.2 Suppose that the data in Table 27.1 are from two
separate production lines. The first ten measurements have average 1.0194 and
standard deviation 0.0290, whereas the last ten measurements have average
1.0406 and standard deviation 0.0428. Perform the t-test H0 : µ = 1 against
H1 : µ = 1 at level α = 0.01 for both datasets separately, assuming normality.
Nonnormal data
Draw a rectangle with height h and width w (let us agree that w  h), and
within this rectangle draw a square with sides of length h (see Figure 27.1).
This creates another (smaller) rectangle with horizontal and vertical sides of
↑
|
|
|
|
|
|
h
|
|
|
|
|
|
↓
←
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
− w −
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
−
→
←
−
−
− w − h −
−
−
→
↑
|
|
|
|
|
|
h
|
|
|
|
|
|
↓
Fig. 27.1. Rectangle with square within.
lengths w − h and h. A large rectangle with a vertical-to-horizontal ratio that
is equal to the horizontal-to-vertical ratio for the small rectangle, i.e.,
h
w
=
w − h
h
,
was called a “golden rectangle” by the ancient Greeks, who often used these in
their architecture. After solving for h/w, we obtain that the height-to-width
27.2 The one-sample t-test 403
Table 27.2. Ratios for Shoshoni rectangles.
0.693 0.749 0.654 0.670 0.662 0.672 0.615 0.606 0.690 0.628
0.668 0.611 0.606 0.609 0.601 0.553 0.570 0.844 0.576 0.933
Source: C. Dubois (ed.). Lowie’s selected papers in anthropology, 1960.
The Regents of the University of California.
ratio h/w is equal to the “golden number” (
√
5 − 1)/2 ≈ 0.618. The data in
Table 27.2 represent corresponding h/w ratios for rectangles used by Shoshoni
Indians to decorate their leather goods. Is it reasonable to assume that they
were also using golden rectangles? We examine this by means of a t-test.
The observed ratios are modeled as a realization of a random sample from a
distribution with expectation µ, where the parameter µ represents the true
esthetic preference for height-to-width ratios of the Shoshoni Indians. We want
to test
H0 : µ = 0.618 against H1 : µ = 0.618.
For the Shoshoni ratios, x̄n = 0.6605 and sn = 0.0925, so that the value of
the test statistic is
t =
x̄n − 0.618
sn/
√
n
=
0.6605 − 0.618
0.0925/
√
20
= 2.055.
Closer examination of the data indicates that the normal distribution is not
the right model. For instance, by definition the height-to-width ratios h/w
are always between 0 and 1. Because some of the data points are also close
to right boundary 1, the normal distribution is inappropriate. If we cannot
assume a normal model distribution, we can no longer conclude that our test
statistic has a t(n − 1) distribution under the null hypothesis.
Since there is no reason to assume any other particular type of distribution
to model the data, we approximate the distribution of T under the null hy-
pothesis. Recall that this distribution is the same as that of the studentized
mean (see the end of Section 27.1). To approximate its distribution, we use
the empirical bootstrap simulation for the studentized mean, as described
on page 351. We generate 10 000 bootstrap datasets and for each bootstrap
dataset x∗
1, x∗
2, . . . , x∗
n, we compute
t∗
=
x̄∗
n − 0.6605
s∗
n/
√
n
.
In Figure 27.2 the kernel density estimate and empirical distribution function
are displayed for 10 000 bootstrap values t∗
. Suppose we test H0 : µ = 0.618
against H1 : µ = 0.618 at level α = 0.05. In the same way as in Section 23.3,
we find the following bootstrap approximations for the critical values:
c∗
l = −3.334 and c∗
u = 1.644.
404 27 The t-test
−6 −4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
-3.334 0 1.644
0.025
0.975
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 27.2. Kernel density estimate and empirical distribution function of 10 000
bootstrap values t∗
.
Since for the Shoshoni data the value 2.055 of the test statistic is greater
than 1.644, we reject the null hypothesis at level 0.05. Alternatively, we can
also compute a bootstrap approximation of the one-tailed p-value correspond-
ing to 2.055, which is the right tail probability P(T ≥ 2.055). The bootstrap
approximation for this probability is:
number of t∗
values greater than or equal to 2.055
10 000
= 0.0067.
Hence P(T ≥ 2.055) ≈ 0.0067, which is smaller than α/2 = 0.025. The value
2.055 should be considered as exceptionally large, and we reject the null hy-
pothesis. The esthetic preference for height-to-width ratios of the Shoshoni
Indians differs from that of the ancient Greeks.
Large samples
For large sample sizes the distribution of the studentized mean can be ap-
proximated by a standard normal distribution (see Section 23.4). This means
that for large sample sizes the distribution of the t-test statistic under the
null hypothesis can also be approximated by a standard normal distribution.
To illustrate this, recall the Old Faithful data. Park rangers in Yellowstone
National Park inform the public about the behavior of the geyser, such as the
expected time between successive eruptions and the length of the duration of
an eruption. Suppose they claim that the expected length of an eruption is
4 minutes (240 seconds). Does this seem likely on the basis of the data from
Section 15.1? We investigate this by testing H0 : µ = 240 against H1 : µ = 240
at level α = 0.001, where µ is the expectation of the model distribution. The
value of the test statistic is
t =
x̄n − 240
sn/
√
n
=
209.3 − 240
68.48/
√
272
= −7.39.
27.3 The t-test in a regression setting 405
The one-tailed p-value P(T ≤ −7.39) can be approximated by P(Z ≤ −7.39),
where Z has an N(0, 1) distribution. From Table B.1 we see that this probabil-
ity is smaller than P(Z ≤ −3.49) = 0.0002. This is smaller than α/2 = 0.0005,
so we reject the null hypothesis at level 0.001. In fact the p-value is much
smaller: a statistical software package gives P(Z ≤ −7.39) = 7.5 · 10−14
. The
data provide overwhelming evidence against H0 : µ = 240, so that we conclude
that the expected length of an eruption is different from 4 minutes.
Quick exercise 27.3 Compute the critical region K for the test, using the
normal approximation, and check that t = −7.39 falls in K.
In fact, if we would test H0 : µ = 240 against H1 : µ  240, the p-value
corresponding to t = −7.39 is the left tail probability P(T ≤ −7.39). This
probability is very small, so that we also reject the null hypothesis in favor
of this alternative and conclude that the expected length of an eruption is
smaller than 4 minutes.
27.3 The t-test in a regression setting
Is calcium in your drinking water good for your health? In England and Wales,
an investigation of environmental causes of disease was conducted. The annual
mortality rate (percentage of deaths) and the calcium concentration in the
drinking water supply were recorded for 61 large towns. The data in Table 27.3
represent the annual mortality rate averaged over the years 1958–1964, and
the calcium concentration in parts per million. In Figure 27.3 the 61 paired
measurements are displayed in a scatterplot. The scatterplot shows a slight
downward trend, which suggests that higher concentrations of calcium lead
to lower mortality rates. The question is whether this is really the case or if
the slight downward trend should be attributed to chance.
To investigate this question we model the mortality data by means of a simple
linear regression model with normally distributed errors, with the mortality
rate as the dependent variable y and the calcium concentration as the inde-
pendent variable x:
Yi = α + βxi + Ui for i = 1, 2, . . . , 61,
where U1, U2, . . . , U61 is a random sample from an N(0, σ2
) distribution. The
parameter β represents the change of the mortality rate if we increase the
calcium concentration by one unit. We test the null hypothesis H0 : β = 0
(calcium has no effect on the mortality rate) against H1 : β  0 (higher
concentration of calcium reduces the mortality rate).
This example illustrates the general situation, where the dataset
(x1, y1), (x2, y2), . . . , (xn, yn)
406 27 The t-test
Table 27.3. Mortality data.
Rate Calcium Rate Calcium Rate Calcium Rate Calcium
1247 105 1466 5 1299 78 1359 84
1392 73 1307 78 1254 96 1318 122
1260 21 1096 138 1402 37 1309 59
1259 133 1175 107 1486 5 1456 90
1236 101 1369 68 1257 50 1527 60
1627 53 1486 122 1485 81 1519 21
1581 14 1625 13 1668 17 1800 14
1609 18 1558 10 1807 15 1637 10
1755 12 1491 20 1555 39 1428 39
1723 44 1379 94 1742 8 1574 9
1569 91 1591 16 1772 15 1828 8
1704 26 1702 44 1427 27 1724 6
1696 6 1711 13 1444 14 1591 49
1987 8 1495 14 1587 75 1713 71
1557 13 1640 57 1709 71 1625 20
1378 71
Source: M. Hills and the M345 Course Team. M345 Statistical Methods,
Units 3: Examining Straight-line Data, 1986, Milton Keynes: Open Uni-
versity, 28. Data provided by Professor M.J.Gardner, Medical Research Coun-
cil Environmental Epidemiology Research Unit, Southampton.
0 20 40 60 80 100 120 140
Calcium concentration (ppm)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Mortality
rate
(%)
·
· ··
·· · ·
· ·
· · ·
·
· · ·
·
·
·
· ·
·
·
·
··
··
··
··
· ·
·
·
·
·
· ·
·
·
· · ·
·
·
· ·
· ·
·
· ·
·
· · ·
· ·
Fig. 27.3. Scatterplot mortality data.
is modeled by a simple linear regression model, and one wants to test a null
hypothesis of the form H0 : α = α0 or H0 : β = β0. Similar to the one-sample
t-test we will construct a test statistic for each of these null hypotheses. With
normally distributed errors, these test statistics have a t-distribution under
the null hypothesis. For this reason, for both null hypotheses the test is called
a t-test.
27.3 The t-test in a regression setting 407
The t-test for the slope
For the null hypothesis H0 : β = β0, we use as test statistic
Tb =
β̂ − β0
Sb
,
where β̂ is the least squares estimator for β (see Chapter 22) and
S2
b =
n
n

x2
i − (

xi)2
σ̂2
.
In this expression,
σ̂2
=
1
n − 2
n

i=1
(Yi − α̂ − β̂xi)2
is the estimator for σ2
as introduced on page 332. It can be shown that
Var

β̂ − β0

=
n
n

x2
i − (

xi)2
σ2
,
so that the random variable S2
b is an estimator for the variance of β̂ − β0.
Hence, similar to the test statistic for the one-sample t-test, the test statistic Tb
compares the estimator β̂ with the value β0 and standardizes by dividing by
an estimator for the standard deviation of β̂ − β0. Values of Tb close to zero
are in favor of the null hypothesis H0 : β = β0. Large positive values of Tb
suggest that β  β0, whereas large negative values of Tb suggest that β  β0.
Recall that in the case of normal random samples the one-sample t-test statis-
tic has a t(n − 1) distribution under the null hypothesis. For the same reason,
it is also a fact that in the case of normally distributed errors the test statis-
tic Tb has a t(n − 2) distribution under the null hypothesis H0 : β = β0.
In our mortality example we want to test H0 : β = 0 against H0 : β  0. For
the data we find β̂ = −3.2261 and sb = 0.4847, so that the value of Tb is
tb =
−3.2261
0.4847
= −6.656.
If we test at level α = 0.05, then we must compare this value with the left
critical value −t59,0.05. This value is not in Table B.2, but we have that
−1.676 = −t50,0.05  −t59,0.05.
This means that tb is much smaller than −t59,0.05, so that we reject the null hy-
pothesis at level 0.05. How much evidence the value tb = −6.656 bears against
the null hypothesis is expressed by the one-tailed p-value P(Tb ≤ −6.656).
From Table B.2 we can only see that this probability is smaller than 0.0005.
By means of a statistical package we find P(Tb ≤ −6.656) = 5.2 · 10−9
. The
data provide overwhelming evidence against the null hypothesis. We conclude
that higher concentrations of calcium correspond to lower mortality rates.
408 27 The t-test
Quick exercise 27.4 The data in Table 27.3 can be separated into measure-
ments for towns at least as far north as Derby and towns south of Derby. For
the data corresponding to 35 towns at least as far north as Derby, one finds
β̂ = −1.9313 and sb = 0.8479. Test H0 : β = 0 against H0 : β  0 at level
0.01, i.e., compute the value of the test statistic and report your conclusion
about the null hypothesis.
The t-test for the intercept
We test the null hypothesis H0 : α = α0 with test statistic
Ta =
α̂ − α0
Sa
, (27.1)
where α̂ is the least squares estimator for α and
S2
a =

x2
i
n

x2
i − (

xi)2
σ̂2
,
with σ̂2
defined as before. The random variable S2
a is an estimator for the
variance
Var(α̂ − α0) =

x2
i
n

x2
i − (

xi)2
σ2
.
Again, we compare the estimator α̂ with the value α0 and standardize by
dividing by an estimator for the standard deviation of α̂ − α0. Values of Ta
close to zero are in favor of the null hypothesis H0 : α = α0. Large positive
values of Ta suggest that α  α0, whereas large negative values of Ta suggest
that α  α0. Like Tb, in the case of normal errors, the test statistic Ta has a
t(n − 2) distribution under the null hypothesis H0 : α = α0.
As an illustration, recall Exercise 17.9 where we modeled the volume y of
black cherry trees by means of a linear model without intercept, with inde-
pendent variable x = d2
h, where d and h are the diameter and height of the
trees. The scatterplot of the pairs (x1, y1), (x2, y2), . . . , (x31, y31) is displayed
in Figure 27.4. As mentioned in Exercise 17.9, there are physical reasons to
leave out the intercept. We want to investigate whether this is confirmed by
the data. To this end, we model the data by a simple linear regression model
with intercept
Yi = α + βxi + Ui for i = 1, 2, . . . , 31,
where U1, U2, . . . , U31 are a random sample from an N(0, σ2
) distribution, and
we test H0 : α = 0 against H1 : α = 0 at level 0.10. The value of the test
statistic is
ta =
−0.2977
0.9636
= −0.3089,
and the left critical value is −t29,0.05 = −1.699. This means we cannot reject
the null hypothesis. The data do not provide sufficient evidence against H0 :
α = 0, which is confirmed by the one-tailed p-value P(Ta ≤ −0.3089) = 0.3798
(computed by means of a statistical package). We conclude that the intercept
does not contribute significantly to the model.
27.4 Solutions to the quick exercises 409
0 2 4 6 8
0.0
0.5
1.0
1.5
2.0
2.5
·
·
· ···
···
··
·
·
···
··
·
·
··
· · ·
···
·
·
·
Fig. 27.4. Scatterplot of the black cherry tree data.
27.4 Solutions to the quick exercises
27.1 If Y has an N(1, 1) distribution, then Y − 1 has an N(0, 1) distri-
bution. Therefore, from Table B.1: P(Y − 1 ≥ 0.03) = 0.4880. If Y has an
N(1, (0.01)2
) distribution, then (Y − 1)/0.01 has an N(0, 1) distribution. In
that case,
P(Y − 1 ≥ 0.03) = P

Y − 1
0.01
≥ 3

= 0.0013.
27.2 For the first and last ten measurements the values of the test statistic
are
t =
1.0194 − 1
0.0290/
√
10
= 2.115 and t =
1.0406 − 1
0.0428/
√
10
= 3.000.
The critical value t9,0.025 = 2.262, which means we reject the null hypothesis
for the second production line, but not for the first production line.
27.3 The critical region is of the form K = (−∞, cl] ∪ [cu, ∞). The right
critical value cu is approximated by z0.0005 = t∞,0.0005 = 3.291, which can be
found in Table B.2. By symmetry of the normal distribution, the left critical
value cl is approximated by −z0.0005 = −3.291. Clearly, t = −7.39  −3.291,
so that it falls in K.
27.4 The value of the test statistic is
tb =
−1.9313
0.8479
= −2.2778.
The left critical value is equal to −t33,0.01, which is not in Table B.2, but we
see that −t33,0.01  −t40,0.01 = −2.423. This means that −t33,0.01  tb, so
that we cannot reject H0 : β = 0 against H0 : β  0 at level 0.01.
410 27 The t-test
27.5 Exercises
27.1 We perform a t-test for the null hypothesis H0 : µ = 10 by means of
a dataset consisting of n = 16 elements with sample mean 11 and sample
variance 4. We use significance level 0.05.
a. Should we reject the null hypothesis in favor of H1 : µ = 10?
b. What if we test against H1 : µ  10?
27.2  The Cleveland Casting Plant is a large highly automated producer of
gray and nodular iron automotive castings for Ford Motor Company. One
process variable of interest to Cleveland Casting is the pouring tempera-
ture of molten iron. The pouring temperatures (in degrees Fahrenheit) of ten
crankshafts are given in Table 27.4. The target setting for the pouring tem-
perature is set at 2550 degrees. One wants to conduct a test at level α = 0.01
to determine whether the pouring temperature differs from the target setting.
Table 27.4. Pouring temperatures of ten crankshafts.
2543 2541 2544 2620 2560
2559 2562 2553 2552 2553
1995 From A structural model relating process inputs and final prod-
uct characteristics, Quality Engineering, , Vol 7, No. 4, pp. 693-704, by
Price, B. and Barth, B. Reproduced by permission of Taylor  Francis, Inc.,
http//www.taylorandfrancis.com
a. Formulate the appropriate null hypothesis and alternative hypothesis.
b. Compute the value of the test statistic and report your conclusion. You
may assume a normal model distribution and use that the sample variance
is 517.34.
27.3 Table 27.5 lists the results of tensile adhesion tests on 22 U-700 alloy
specimens. The data are loads at failure in MPa. The sample mean is 13.71
and the sample standard deviation is 3.55. You may assume that the data
originated from a normal distribution with expectation µ. One is interested
in whether the load at failure exceeds 10 MPa. We investigate this by means
of a t-test for the null hypothesis H0 : µ = 10.
a. What do you choose as the alternative hypothesis?
b. Compute the value of the test statistic and report your conclusion, when
performing the test at level 0.05.
27.5 Exercises 411
Table 27.5. Loads at failure of U-700 specimens.
19.8 18.5 17.6 16.7 15.8
15.4 14.1 13.6 11.9 11.4
11.4 8.8 7.5 15.4 15.4
19.5 14.9 12.7 11.9 11.4
10.1 7.9
Source: C.C. Berndt. Instrumented Tensile adhesion tests on plasma sprayed
thermal barrier coatings. Journal of Materials Engineering II(4): 275-282,
Dec 1989. Springer-Verlag New York Inc.
27.4 Consider the coal data from Table 23.2, where 22 gross calorific value
measurements are listed for Daw Mill coal coded 258GB41. We modeled this
dataset as a realization of a random sample from an N(µ, σ2
) distribution
with µ and σ unknown. We are planning to buy a shipment if the gross
calorific value exceeds 31.00 MJ/kg. The sample mean and sample variance
of the data are x̄n = 31.012 and sn = 0.1294. Perform a t-test for the null
hypothesis H0 : µ = 31.00 against H1 : µ  31.00 using significance level 0.01,
i.e., compute the value of the test statistic, the critical value of the test, and
report your conclusion.
27.5  In the November 1988 issue of Science a study was reported on the
inbreeding of tropical swarm-founding wasps. Each member of a sample of
197 wasps was captured, frozen, and subjected to a series of genetic tests,
from which an inbreeding coefficient was determined. The sample mean and
the sample standard deviation of the coefficients are x̄197 = 0.044 and s197 =
0.884. If a species does not have the tendency to inbreed, their true inbreeding
coefficient is 0. Determine by means of a test whether the inbreeding coefficient
for this species of wasp exceeds 0.
a. Formulate the appropriate null hypothesis and alternative hypothesis and
compute the value of the test statistic.
b. Compute the p-value corresponding to the value of the test statistic and
report your conclusion about the null hypothesis.
27.6 The stopping distance of an automobile is related to its speed. The data
in Table 27.6 give the stopping distance in feet and speed in miles per hour
of an automobile. The data are modeled by means of simple linear regression
model with normally distributed errors, with the square root of the stopping
distance as dependent variable y and the speed as independent variable x:
Yi = α + βxi + Ui, for i = 1, . . . , 7.
For the dataset we find
α̂ = 5.388, β̂ = 4.252, sa = 1.874, sb = 0.242.
412 27 The t-test
Table 27.6. Speed and stopping distance of automobiles.
Speed 20.5 20.5 30.5 30.5 40.5 48.8 57.8
Distance 15.4 13.3 33.9 27.0 73.1 113.0 142.6
Source: K.A. Brownlee. Statistical theory and methodology in science and
engineering. Wiley, New York, 1960; Table II.9 on page 372.
One would expect that the intercept can be taken equal to 0, since zero speed
would yield zero stopping distance. Investigate whether this is confirmed by
the data by performing the appropriate test at level 0.10. Formulate the proper
null and alternative hypothesis, compute the value of the test statistic, and
report your conclusion.
27.7  In a study about the effect of wall insulation, the weekly gas con-
sumption (in 1000 cubic feet) and the average outside temperature (in de-
grees Celsius) was measured of a certain house in southeast England, for 26
weeks before and 30 weeks after cavity-wall insulation had been installed.
The house thermostat was set at 20 degrees throughout. The data are listed
in Table 27.7. We model the data before insulation by means of a simple lin-
ear regression model with normally distributed errors and gas consumption
as response variable. A similar model was used for the data after insulation.
Given are
Before insulation: α̂ = 6.8538, β̂ = −0.3932 and sa = 0.1184, sb = 0.0196
After insulation: α̂ = 4.7238, β̂ = −0.2779 and sa = 0.1297, sb = 0.0252.
a. Use the data before insulation to investigate whether smaller outside tem-
peratures lead to higher gas consumption. Formulate the proper null and
alternative hypothesis, compute the value of the test statistic, and report
your conclusion, using significance level 0.05.
b. Do the same for the data after insulation.
27.5 Exercises 413
Table 27.7. Temperature and gas consumption.
Before insulation After insulation
Temperature Gas consumption Temperature Gas consumption
−0.8 7.2 −0.7 4.8
−0.7 6.9 0.8 4.6
0.4 6.4 1.0 4.7
2.5 6.0 1.4 4.0
2.9 5.8 1.5 4.2
3.2 5.8 1.6 4.2
3.6 5.6 2.3 4.1
3.9 4.7 2.5 4.0
4.2 5.8 2.5 3.5
4.3 5.2 3.1 3.2
5.4 4.9 3.9 3.9
6.0 4.9 4.0 3.5
6.0 4.3 4.0 3.7
6.0 4.4 4.2 3.5
6.2 4.5 4.3 3.5
6.3 4.6 4.6 3.7
6.9 3.7 4.7 3.5
7.0 3.9 4.9 3.4
7.4 4.2 4.9 3.7
7.5 4.0 4.9 4.0
7.5 3.9 5.0 3.6
7.6 3.5 5.3 3.7
8.0 4.0 6.2 2.8
8.5 3.6 7.1 3.0
9.1 3.1 7.2 2.8
10.2 2.6 7.5 2.6
8.0 2.7
8.7 2.8
8.8 1.3
9.7 1.5
Source: MDST242 Statistics in Society, Unit 45: Review, 2nd edition, 1984,
Milton Keynes: The Open University, Figures 2.5 and 2.6.
28
Comparing two samples
Many applications are concerned with two groups of observations of the same
kind that originate from two possibly different model distributions, and the
question is whether these distributions have different expectations. We de-
scribe a test for equality of expectations, where we consider normal and non-
normal model distributions and equal and unequal variances of the model
distributions.
28.1 Is dry drilling faster than wet drilling?
Recall the drilling example from Sections 15.5 and 16.4. The question was
whether dry drilling is faster than wet drilling. The scatterplots in Figure 15.11
seem to suggest that up to a depth of 250 feet the drill time does not depend
on depth. Therefore, for a first investigation of a possible difference between
dry and wet drilling we only consider the (mean) drill times up to this depth.
A more thorough study can be found in [23].
The boxplots of the drill times for both types of drilling are displayed in
Figure 28.1. Clearly, the boxplot for dry drilling is positioned lower than the
600
700
800
900
1000
Dry
◦
Wet
Fig. 28.1. Boxplot of drill times.
416 28 Comparing two samples
one for wet drilling. However, the question is whether this difference can be
attributed to chance or if it is large enough to conclude that the dry drill
time is shorter than the wet drill time. To answer this question, we model the
datasets of dry and wet drill times as realizations of random samples from
two distribution functions F and G, one with expected value µ1 and the other
with expected value µ2. The parameters µ1 and µ2 represent the drill times
of dry drilling and wet drilling, respectively. We test H0 : µ1 = µ2 against
H1 : µ1  µ2.
This example illustrates a general situation where we compare two datasets
x1, x2, . . . , xn and y1, y2, . . . , ym,
which are the realization of independent random samples
X1, X2, . . . , Xn and Y1, Y2, . . . , Ym
from two distributions, and we want to test whether the expectations of both
distributions are the same. Both the variance σ2
X of the Xi and the variance
σ2
Y of the Yj are unknown.
Note that the null hypothesis is equivalent to the statement µ1 − µ2 = 0. For
this reason, similar to Chapter 27, the test statistic for the null hypothesis
H0 : µ1 = µ2 is based on an estimator X̄n − Ȳm for the difference µ1 − µ2. As
before, we standardize X̄n − Ȳm by an estimator for its variance
Var

X̄n − Ȳm

=
σ2
X
n
+
σ2
Y
m
.
Recall that the sample variances S2
X and S2
Y of the Xi and Yj, are unbiased
estimators for σ2
X and σ2
Y . We will use a combination of S2
X and S2
Y to con-
struct an estimator for Var

X̄n − Ȳm

. The actual standardization of X̄n −Ȳm
depends on whether the variances of the Xi and Yj are the same. We distin-
guish between the two cases σ2
X = σ2
Y and σ2
X = σ2
Y . In the next section we
consider the case of equal variances.
Quick exercise 28.1 Looking at the boxplots in Figure 28.1, does the as-
sumption σ2
X = σ2
Y seem reasonable to you? Can you think of a way to
quantify your belief?
28.2 Two samples with equal variances
Suppose that the samples originate from distributions with the same (but
unknown) variance:
σ2
X = σ2
Y = σ2
.
In this case we can pool the sample variances S2
X and S2
Y by constructing
a linear combination aS2
X + bS2
Y that is an unbiased estimator for σ2
. One
particular choice is the weighted average
28.2 Two samples with equal variances 417
(n − 1)S2
X + (m − 1)S2
Y
n + m − 2
.
It has the property that for normally distributed samples it has the smallest
variance among all unbiased linear combinations of S2
X and S2
Y (see Exer-
cise 28.5). Moreover, the weights depend on the sample sizes. This is appro-
priate, since if one sample is much larger than the other, the estimate of σ2
from that sample is more reliable and should receive greater weight.
We find that the pooled-variance:
S2
p =
(n − 1)S2
X + (m − 1)S2
Y
n + m − 2

1
n
+
1
m

is an unbiased estimator for
Var

X̄n − Ȳm

= σ2

1
n
+
1
m

.
This leads to the following test statistic for the null hypothesis H0 : µ1 = µ2:
Tp =
X̄n − Ȳm
Sp
.
As before, we compare the estimator X̄n − Ȳm with 0 (the value of µ1 − µ2
under the null hypothesis), and we standardize by dividing by the estimator Sp
for the standard deviation of X̄n − Ȳm. Values of Tp close to zero are in favor
of the null hypothesis H0 : µ1 = µ2. Large positive values of Tp suggest that
µ1  µ2, whereas large negative values suggest that µ1  µ2.
The next step is to determine the distribution of Tp. Note that under the null
hypothesis H0 : µ1 = µ2, the test statistic Tp is the pooled studentized mean
difference
(X̄n − Ȳm) − (µ1 − µ2)
Sp
.
Hence, under the null hypothesis, the probability distribution of Tp is the
same as that of the pooled studentized mean difference. To determine its
distribution, we distinguish between normal and nonnormal data.
Normal samples
In the same way as the studentized mean of a single normal sample has a
t(n − 1) distribution (see page 349), it is also a fact that if two independent
samples originate from normal distributions, i.e.,
X1, X2, . . . , Xn random sample from N(µ1, σ2
)
Y1, Y2, . . . , Ym random sample from N(µ2, σ2
),
then the pooled studentized mean difference has a t(n + m − 2) distribution.
Hence, under the null hypothesis, the test statistic Tp has a t(n + m − 2)
418 28 Comparing two samples
distribution. For this reason, a test for the null hypothesis H0 : µ1 = µ2 is
called a two-sample t-test.
Suppose that in our drilling example we model our datasets as realizations
of random samples of sizes n = m = 50 from two normal distributions with
equal variances, and we test H0 : µ1 = µ2 against H1 : µ1  µ2 at level 0.05.
For the data we find x̄50 = 727.78, ȳ50 = 873.02, and sp = 13.62, so that
tp =
727.78 − 873.02
13.62
= −10.66.
We compare this with the left critical value −t98,0.05. This value is not in
Table B.2, but −1.676 = −t50,0.05  −t98,0.05. This means that tp  −t98,0.05,
so that we reject H0 : µ1 = µ2 in favor of H1 : µ1  µ2 at level 0.05. The p-
value corresponding to tp = −10.66 is the left tail probability P(T ≤ −10.66).
From Table B.2 we can only see that this is smaller than 0.0005 (a statistical
software package gives P(T ≤ −10.66) = 2.25 ·10−18
). The data provide over-
whelming evidence against the null hypothesis, so that we conclude that dry
drilling is faster than wet drilling.
Quick exercise 28.2 Suppose that in the ball bearing example of Quick
exercise 27.2, we test H0 : µ1 = µ2 against H1 : µ1 = µ2, where µ1 and µ2
represent the diameters of a ball bearing from the first and second production
line. What are the critical values corresponding to level α = 0.01?
Nonnormal samples
Similar to the one-sample t-test, if we cannot assume normal model distribu-
tions, then we can no longer conclude that our test statistic has a t(n + m − 2)
distribution under the null hypothesis. Recall that under the null hypothesis,
the distribution of our test statistic is the same as that of the pooled studen-
tized mean difference (see page 417).
To approximate its distribution, we use the empirical bootstrap simulation
for the pooled studentized mean difference
(X̄n − Ȳm) − (µ1 − µ2)
Sp
.
Given datasets x1, x2, . . . , xn and y1, y2, . . . , ym, determine their empirical dis-
tribution functions Fn and Gm as estimates for F and G. The expectations
corresponding to Fn and Gm are µ∗
1 = x̄n and µ∗
2 = ȳm. Then repeat the
following two steps many times:
1. Generate a bootstrap dataset x∗
1, x∗
2, . . . , x∗
n from Fn and a bootstrap
dataset y∗
1, y∗
2, . . . , y∗
m from Gm.
2. Compute the pooled studentized mean difference for the bootstrap data:
t∗
p =
(x̄∗
n − ȳ∗
m) − (x̄n − ȳm)
s∗
p
,
28.3 Two samples with unequal variances 419
where x̄∗
n and ȳ∗
m are the sample means of the bootstrap datasets, and
(s∗
p)2
=
(n − 1)(s∗
X)2
+ (m − 1)(s∗
Y )2
n + m − 2

1
n
+
1
m

with (s∗
X)2
and (s∗
Y )2
the sample variances of the bootstrap datasets.
The reason that in each iteration we subtract x̄n − ȳm is that µ1 − µ2 is
the difference of the expectations of the two model distributions. Therefore,
according to the bootstrap principle we should replace this by the difference
x̄n − ȳm of the expectations corresponding to the two empirical distribution
functions.
We carried out this bootstrap simulation for the drill times. The result of this
simulation can be seen in Figure 28.2, where a histogram and the empirical
distribution function are displayed for one thousand bootstrap values of t∗
p.
Suppose that we test H0 : µ1 = µ2 against H1 : µ1  µ2 at level 0.05. The
bootstrap approximation for the left critical value is c∗
l = −1.659. The value
of tp = −10.66, computed from the data, is much smaller. Hence, also on the
basis of the bootstrap simulation we reject the null hypothesis and conclude
that the dry drill time is shorter than the wet drill time.
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
-1.659 0
0.05
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 28.2. Histogram and empirical distribution function of 1000 bootstrap values
for T∗
p .
28.3 Two samples with unequal variances
During an investigation about weather modification, a series of experiments
was conducted in southern Florida from 1968 to 1972. These experiments
were designed to investigate the use of massive silver-iodide seeding. It was
420 28 Comparing two samples
Table 28.1. Rainfall data.
Unseeded
1202.6 830.1 372.4 345.5 321.2 244.3
163.0 147.8 95.0 87.0 81.2 68.5
47.3 41.1 36.6 29.0 28.6 26.3
26.1 24.4 21.7 17.3 11.5 4.9
4.9 1.0
Seeded
2745.6 1697.8 1656.0 978.0 703.4 489.1
430.0 334.1 302.8 274.7 274.7 255.0
242.5 200.7 198.6 129.6 119.0 118.3
115.3 92.4 40.6 32.7 31.4 17.5
7.7 4.1
Source: J. Simpson, A. Olsen, and J.C. Eden. A Bayesian analysis of a mul-
tiplicative treatment effect in weather modification. Technometrics, 17:161–
166, 1975; Table 1 on page 162.
hypothesized that under specified conditions, this leads to invigorated cumulus
growth and prolonged lifetimes, thereby causing increased precipitation. In
these experiments, 52 isolated cumulus clouds were observed, of which 26 were
selected at random and injected with silver-iodide smoke. Rainfall amounts
(in acre-feet) were recorded for all clouds. They are listed in Table 28.1. To
investigate whether seeding leads to increased rainfall, we test H0 : µ1 = µ2
against H1 : µ1  µ2, where µ1 and µ2 represent the rainfall for unseeded and
seeded clouds.
In Figure 28.3 the boxplots of both datasets are displayed. From this we
see that the assumption of equal variances may not be realistic. Indeed, this
is confirmed by the values s2
X = 77 521 and s2
Y = 423 524 of the sample
variances of the datasets. This means that we need to test H0 : µ1 = µ2
without the assumption of equal variances. As before, the test statistic will be
a standardized version of X̄n − Ȳm, but S2
p is no longer an unbiased estimator
for
Var

X̄n − Ȳm

=
σ2
X
n
+
σ2
Y
m
.
However, if we estimate σ2
X and σ2
Y by S2
X and S2
Y , then the nonpooled variance
S2
d =
S2
X
n
+
S2
Y
m
is an unbiased estimator for Var

X̄n − Ȳm

. This leads to test statistic
Td =
X̄n − Ȳm
Sd
.
28.3 Two samples with unequal variances 421
0
500
1000
1500
2000
2500
Unseeded
◦
◦
◦
Seeded
◦
◦
◦
◦
Fig. 28.3. Boxplots of rainfall.
Again, we compare the estimator X̄n − Ȳm with zero and standardize by
dividing by an estimator for the standard deviation of X̄n − Ȳm. Values of Td
close to zero are in favor of the null hypothesis H0 : µ1 = µ2.
Quick exercise 28.3 Consider the ball bearing example from Quick exer-
cise 27.2. Compute the value of Td for this example.
Under the null hypothesis H0 : µ1 = µ2, the test statistic
Td =
X̄n − Ȳm
Sd
is equal to the nonpooled studentized mean difference
(X̄n − Ȳm) − (µ1 − µ2)
Sd
.
Therefore, the distribution of Td under the null hypothesis is the same as that
of the nonpooled studentized mean difference. Unfortunately, its distribution
is not a t-distribution, not even in the case of normal samples. This means
that we have to approximate this distribution.
Similar to the previous section, we use the empirical bootstrap simulation for
the nonpooled studentized mean difference. The only difference with the proce-
dure outlined in the previous section is that now in each iteration we compute
the nonpooled studentized mean difference for the bootstrap datasets:
t∗
d =
(x̄∗
n − ȳ∗
m) − (x̄n − ȳm)
s∗
d
,
where x̄∗
n and ȳ∗
m are the sample means of the bootstrap datasets, and
(s∗
d)2
=
(s∗
X)2
n
+
(s∗
Y )2
m
422 28 Comparing two samples
−4 −2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
-1.405 0
0.05
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Fig. 28.4. Histogram and empirical distribution function of 1000 bootstrap values
of T∗
d .
with (s∗
X)2
and (s∗
Y )2
the sample variances of the bootstrap datasets.
We carried out this bootstrap simulation for the cloud seeding data. The
result of this simulation can be seen in Figure 28.4, where a histogram and
the empirical distribution function are displayed for one thousand values t∗
d.
The bootstrap approximation for the left critical value corresponding to level
0.05 is c∗
l = −1.405. For the data we find the value
td =
164.59 − 441.98
138.92
= −1.998.
This is smaller than c∗
l , so we reject the null hypothesis. Although the evidence
against the null hypothesis is not overwhelming, there is some indication that
seeding clouds leads to more rainfall.
28.4 Large samples
Variants of the central limit theorem state that as n and m both tend to
infinity, the distributions of the pooled studentized mean difference
(X̄n − Ȳm) − (µ1 − µ2)
Sp
and the nonpooled studentized mean difference
(X̄n − Ȳm) − (µ1 − µ2)
Sd
both approach the standard normal distribution. This fact can be used to
approximate the distribution of the test statistics Tp and Td under the null
hypothesis by a standard normal distribution.
28.4 Large samples 423
We illustrate this by means of the following example. To investigate whether a
restricted diet promotes longevity, two groups of randomly selected rats were
put on the different diets. One group of n = 106 rats was put on a restricted
diet, the other group of m = 89 rats on an ad libitum diet (i.e., unrestricted
eating). The data in Table 28.2 represent the remaining lifetime in days of two
groups of rats after they were put on the different diets. The average lifetimes
are x̄n = 968.75 and ȳm = 684.01 days. To investigate whether a restricted
diet promotes longevity, we test H0 : µ1 = µ2 against H1 : µ1  µ2, where
µ1 and µ2 represent the lifetime of a rat on a restricted diet and on an ad
libitum diet, respectively.
If we may assume equal variances, we compute
tp =
968.75 − 684.01
32.88
= 8.66.
This value is larger than the right critical value z0.0005 = 3.291, which means
that we would reject H0 : µ1 = µ2 in favor of H1 : µ1  µ2 at level α = 0.0005.
Table 28.2. Rat data.
Restricted
105 193 211 236 302 363 389 390 391 403
530 604 605 630 716 718 727 731 749 769
770 789 804 810 811 833 868 871 875 893
897 901 906 907 919 923 931 940 957 958
961 962 974 979 982 1001 1008 1010 1011 1012
1014 1017 1032 1039 1045 1046 1047 1057 1063 1070
1073 1076 1085 1090 1094 1099 1107 1119 1120 1128
1129 1131 1133 1136 1138 1144 1149 1160 1166 1170
1173 1181 1183 1188 1190 1203 1206 1209 1218 1220
1221 1228 1230 1231 1233 1239 1244 1258 1268 1294
1316 1327 1328 1369 1393 1435
Ad libitum
89 104 387 465 479 494 496 514 532 536
545 547 548 582 606 609 619 620 621 630
635 639 648 652 653 654 660 665 667 668
670 675 677 678 678 681 684 688 694 695
697 698 702 704 710 711 712 715 716 717
720 721 730 731 732 733 735 736 738 739
741 743 746 749 751 753 764 765 768 770
773 777 779 780 788 791 794 796 799 801
806 807 815 836 838 850 859 894 963
Source: B.L. Berger, D.D. Boos, and F.M. Guess. Tests and confidence sets
for comparing two mean residual life functions. Biometrics, 44:103–115, 1988.
424 28 Comparing two samples
The p-value is the right tail probability P(Tp ≥ 8.66), which we approximate
by P(Z ≥ 8.66), where Z has an N(0, 1) distribution. From Table B.1 we see
that this probability is smaller than P(Z ≥ 3.49) = 0.0002. By means of a
statistical package we find P(Z ≥ 8.66) = 2.4 · 10−16
.
If we repeat the test without the assumption of equal variances, we compute
td =
968.75 − 684.01
31.08
= 9.16,
which also leads to rejection of the null hypothesis. In this case, the p-value
P(Td ≥ 9.16) ≈ P(Z ≥ 9.16) is even smaller since 9.16  8.66 (a statistical
package gives P(Z ≥ 9.16) = 2.6 · 10−18
). The data provide overwhelming
evidence against the null hypothesis, and we conclude that a restricted diet
promotes longevity.
28.5 Solutions to the quick exercises
28.1 Just by looking at the boxplots, the authors believe that the assumption
σ2
X = σ2
Y is reasonable. The lengths of the boxplots and their IQRs are almost
the same. However, the boxplots do not reveal how the elements of the dataset
vary around the center. One way of quantifying our belief would be to compare
the sample variances of the datasets. One possibility is to compare the ratio of
both sample variances; a ratio close to one would support our belief of equal
variances (in case of normal samples, this is a standard test called the F-test).
28.2 In this case we have a right and left critical value. From Quick ex-
ercise 27.2 we know that n = m = 10, so that the right critical value is
t18,0.005 = 2.878 and the left critical value is −t18,0.005 = −2.878.
28.3 We first compute s2
d = (0.0290)2
/10+(0.0428)2
/10 = 0.000267 and then
td = (1.0194 − 1.0406)/
√
0.000267 = −1.297.
28.6 Exercises
28.1  The data in Table 28.3 represent salaries (in pounds Sterling) in 72
randomly selected advertisements in the The Guardian (April 6, 1992). When
a range was given in the advertisement, the midpoint of the range is repro-
duced in the table. The data are salaries corresponding to two kinds of occu-
pations (n = m = 72): (1) creative, media, and marketing and (2) education.
The sample mean and sample variance of the two datasets are, respectively:
(1) x̄72 = 17 410 and s2
x = 41 258 741,
(2) ȳ72 = 19 818 and s2
y = 50 744 521.
28.6 Exercises 425
Table 28.3. Salaries in two kinds of occupations.
Occupation (1) Occupation (2)
17703 13796 12000 25899 17378 19236
42000 22958 22900 21676 15594 18780
18780 10750 13440 15053 17375 12459
15723 13552 17574 19461 20111 22700
13179 21000 22149 22485 16799 35750
37500 18245 17547 17378 12587 20539
22955 19358 9500 15053 24102 13115
13000 22000 25000 10998 12755 13605
13500 12000 15723 18360 35000 20539
13000 16820 12300 22533 20500 16629
11000 17709 10750 23008 13000 27500
12500 23065 11000 24260 18066 17378
13000 18693 19000 25899 35403 15053
10500 14472 13500 18021 17378 20594
12285 12000 32000 17970 14855 9866
13000 20000 17783 21074 21074 21074
16000 18900 16600 15053 19401 25598
15000 14481 18000 20739 15053 15053
13944 35000 11406 15053 15083 31530
23960 18000 23000 30800 10294 16799
11389 30000 15379 37000 11389 15053
12587 12548 21458 48000 11389 14359
17000 17048 21262 16000 26544 15344
9000 13349 20000 20147 14274 31000
Source: D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski.
Small data sets. Chapman and Hall, London, 1994; dataset 385. Data col-
lected by D.J. Hand.
Suppose that the datasets are modeled as realizations of normal distributions
with expectations µ1 and µ2, which represent the salaries for occupations (1)
and (2).
a. Test the null hypothesis that the salary for both occupations is the same
at level α = 0.05 under the assumption of equal variances. Formulate
the proper null and alternative hypotheses, compute the value of the test
statistic, and report your conclusion.
b. Do the same without the assumption of equal variances.
c. As a comparison, one carries out an empirical bootstrap simulation for the
nonpooled studentized mean difference. The bootstrap approximations for
the critical values are c∗
l = −2.004 and c∗
u = 2.133. Report your conclusion
about the salaries on the basis of the bootstrap results.
426 28 Comparing two samples
28.2 The data in Table 28.4 represent the duration of pregnancy for 1669
women who gave birth in a maternity hospital in Newcastle-upon-Tyne, Eng-
land, in 1954.
Table 28.4. Durations of pregnancy.
Duration Medical Emergency Social
11 1
15 1
17 1
20 1
22 1 2
24 1 3
25 2 1
26 1
27 2 2 1
28 1 2 1
29 3 1
30 3 5 1
31 4 5 2
32 10 9 2
33 6 6 2
34 12 7 10
35 23 11 4
36 26 13 19
37 54 16 30
38 68 35 72
39 159 38 115
40 197 32 155
41 111 27 128
42 55 25 64
43 29 8 16
44 4 5 3
45 3 1 6
46 1 1 1
47 1
56 1
Source: D.J. Newell. Statistical aspects of the demand for maternity beds.
Journal of the Royal Statistical Society, Series A, 127:1–33, 1964.
The durati
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures
A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures

More Related Content

PDF
M4D-v0.4.pdf
PDF
Book
PDF
Cmu experimental design
PDF
basic statistics
PDF
Think_Stats.pdf
PDF
Mathematical Induction A Powerful And Elegant Method Of Proof Titu Andreescu
PDF
Discrete Stochastic Processes Lecture Notes Mit 6262 Itebooks
M4D-v0.4.pdf
Book
Cmu experimental design
basic statistics
Think_Stats.pdf
Mathematical Induction A Powerful And Elegant Method Of Proof Titu Andreescu
Discrete Stochastic Processes Lecture Notes Mit 6262 Itebooks

Similar to A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures (20)

PDF
Essential_20Statistics_20for_20Data_20Science.pdf
PDF
Teach yourself logic 2017
PDF
Teach yourself logic 2017
PDF
Steen-Pedersen-From-Calculus-to-Analysis-Springer-2015.pdf
PDF
An introductiontoappliedmultivariateanalysiswithr everit
PDF
Basic calculus
PDF
Fast Fourier Transform Algorithms With Applications Phd Thesis Todd Mateer
PDF
Data stuctures
PDF
Data structures and algorisms
PDF
Senior_Thesis_Evan_Oman
PDF
Non omniscience
PDF
Nguyễn Nho Vĩnh
PDF
An intro to applied multi stat with r by everitt et al
PDF
Power Full Exposition
PDF
Mathematics For Physics An Illustrated Handbook 1st Edition Adam Marsh
PDF
1993_Book_Probability.pdf
PDF
An Introduction to Statistical Learning R Fourth Printing.pdf
PDF
[Sundstrom_Ted.]_Mathematical_Reasoning_Writing - Copy.pdf
Essential_20Statistics_20for_20Data_20Science.pdf
Teach yourself logic 2017
Teach yourself logic 2017
Steen-Pedersen-From-Calculus-to-Analysis-Springer-2015.pdf
An introductiontoappliedmultivariateanalysiswithr everit
Basic calculus
Fast Fourier Transform Algorithms With Applications Phd Thesis Todd Mateer
Data stuctures
Data structures and algorisms
Senior_Thesis_Evan_Oman
Non omniscience
Nguyễn Nho Vĩnh
An intro to applied multi stat with r by everitt et al
Power Full Exposition
Mathematics For Physics An Illustrated Handbook 1st Edition Adam Marsh
1993_Book_Probability.pdf
An Introduction to Statistical Learning R Fourth Printing.pdf
[Sundstrom_Ted.]_Mathematical_Reasoning_Writing - Copy.pdf
Ad

More from Todd Turner (20)

PDF
Policy Issue Paper Example What Are Policy Briefs
PDF
Write Esse Argumentative Essay First Paragraph
PDF
Cover Page For Essay The Best Place To Buy Same D
PDF
10 Printable Lined Paper Templates - Realia Project
PDF
How To Pay Someone To Write My Essay. Online assignment writing service.
PDF
(PDF) The Issue Of Poverty In The Provision Of Quality Edu
PDF
Essay Writing Tricks How To Write Essay
PDF
Interpretive Essay Example. Interpretive Essays Exam
PDF
Pin On Preschool Resources. Online assignment writing service.
PDF
About Us - Mandarin Class Hong Kong E-Learning Site
PDF
WRA150 - Advertisement Analysis Essay - Kaitlyn Ri
PDF
Marco Materazzi Archive Outlining A Hunt Paper Ide
PDF
Custom WritingThesis I. Online assignment writing service.
PDF
Comparing Two Poems Essay Example - PHDess
PDF
Printable Thanksgiving Writing Paper (Pack 1) - N
PDF
Discover How To Write A Term Paper And Find New Examples - PaperWritingPro
PDF
How To Format Essays - Ocean County College
PDF
Format For A Research Paper A Research Guide For Stu
PDF
Analytical Essay Analytical Paragraph Examples Clas
PDF
Introduction To A Persuasive Essay. Writing An Introducti
Policy Issue Paper Example What Are Policy Briefs
Write Esse Argumentative Essay First Paragraph
Cover Page For Essay The Best Place To Buy Same D
10 Printable Lined Paper Templates - Realia Project
How To Pay Someone To Write My Essay. Online assignment writing service.
(PDF) The Issue Of Poverty In The Provision Of Quality Edu
Essay Writing Tricks How To Write Essay
Interpretive Essay Example. Interpretive Essays Exam
Pin On Preschool Resources. Online assignment writing service.
About Us - Mandarin Class Hong Kong E-Learning Site
WRA150 - Advertisement Analysis Essay - Kaitlyn Ri
Marco Materazzi Archive Outlining A Hunt Paper Ide
Custom WritingThesis I. Online assignment writing service.
Comparing Two Poems Essay Example - PHDess
Printable Thanksgiving Writing Paper (Pack 1) - N
Discover How To Write A Term Paper And Find New Examples - PaperWritingPro
How To Format Essays - Ocean County College
Format For A Research Paper A Research Guide For Stu
Analytical Essay Analytical Paragraph Examples Clas
Introduction To A Persuasive Essay. Writing An Introducti
Ad

Recently uploaded (20)

PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Basic Mud Logging Guide for educational purpose
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Complications of Minimal Access Surgery at WLH
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Pharma ospi slides which help in ospi learning
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
01-Introduction-to-Information-Management.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
RMMM.pdf make it easy to upload and study
Basic Mud Logging Guide for educational purpose
Microbial disease of the cardiovascular and lymphatic systems
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Insiders guide to clinical Medicine.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Institutional Correction lecture only . . .
Complications of Minimal Access Surgery at WLH
GDM (1) (1).pptx small presentation for students
Pharma ospi slides which help in ospi learning
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Renaissance Architecture: A Journey from Faith to Humanism
01-Introduction-to-Information-Management.pdf

A Modern Introduction To Probability And Statistics Understanding Why And How With 120 Figures

  • 1. F.M. Dekking C. Kraaikamp H.P. Lopuhaä L.E. Meester A Modern Introduction to Probability and Statistics Understanding Why and How With 120 Figures
  • 2. Frederik Michel Dekking Cornelis Kraaikamp Hendrik Paul Lopuhaä Ludolf Erwin Meester Delft Institute of Applied Mathematics Delft University of Technology Mekelweg 4 2628 CD Delft The Netherlands Whilst we have made considerable efforts to contact all holders of copyright material contained in this book, we may have failed to locate some of them. Should holders wish to contact the Publisher, we will be happy to come to some arrangement with them. British Library Cataloguing in Publication Data A modern introduction to probability and statistics. — (Springer texts in statistics) 1. Probabilities 2. Mathematical statistics I. Dekking, F. M. 519.2 ISBN 1852338962 Library of Congress Cataloging-in-Publication Data A modern introduction to probability and statistics : understanding why and how / F.M. Dekking ... [et al.]. p. cm. — (Springer texts in statistics) Includes bibliographical references and index. ISBN 1-85233-896-2 1. Probabilities—Textbooks. 2. Mathematical statistics—Textbooks. I. Dekking, F.M. II. Series. QA273.M645 2005 519.2—dc22 2004057700 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publish- ers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. ISBN-10: 1-85233-896-2 ISBN-13: 978-1-85233-896-1 Springer Science+Business Media springeronline.com © Springer-Verlag London Limited 2005 The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the informa- tion contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Printed in the United States of America 12/3830/543210 Printed on acid-free paper SPIN 10943403
  • 3. Preface Probability and statistics are fascinating subjects on the interface between mathematics and applied sciences that help us understand and solve practical problems. We believe that you, by learning how stochastic methods come about and why they work, will be able to understand the meaning of statistical statements as well as judge the quality of their content, when facing such problems on your own. Our philosophy is one of how and why: instead of just presenting stochastic methods as cookbook recipes, we prefer to explain the principles behind them. In this book you will find the basics of probability theory and statistics. In addition, there are several topics that go somewhat beyond the basics but that ought to be present in an introductory course: simulation, the Poisson process, the law of large numbers, and the central limit theorem. Computers have brought many changes in statistics. In particular, the bootstrap has earned its place. It provides the possibility to derive confidence intervals and perform tests of hypotheses where traditional (normal approximation or large sample) methods are inappropriate. It is a modern useful tool one should learn about, we believe. Examples and datasets in this book are mostly from real-life situations, at least that is what we looked for in illustrations of the material. Anybody who has inspected datasets with the purpose of using them as elementary examples knows that this is hard: on the one hand, you do not want to boldly state assumptions that are clearly not satisfied; on the other hand, long explanations concerning side issues distract from the main points. We hope that we found a good middle way. A first course in calculus is needed as a prerequisite for this book. In addition to high-school algebra, some infinite series are used (exponential, geometric). Integration and differentiation are the most important skills, mainly concern- ing one variable (the exceptions, two dimensional integrals, are encountered in Chapters 9–11). Although the mathematics is kept to a minimum, we strived
  • 4. VI Preface to be mathematically correct throughout the book. With respect to probabil- ity and statistics the book is self-contained. The book is aimed at undergraduate engineering students, and students from more business-oriented studies (who may gloss over some of the more mathe- matically oriented parts). At our own university we also use it for students in applied mathematics (where we put a little more emphasis on the math and add topics like combinatorics, conditional expectations, and generating func- tions). It is designed for a one-semester course: on average two hours in class per chapter, the first for a lecture, the second doing exercises. The material is also well-suited for self-study, as we know from experience. We have divided attention about evenly between probability and statistics. The very first chapter is a sampler with differently flavored introductory ex- amples, ranging from scientific success stories to a controversial puzzle. Topics that follow are elementary probability theory, simulation, joint distributions, the law of large numbers, the central limit theorem, statistical modeling (in- formal: why and how we can draw inference from data), data analysis, the bootstrap, estimation, simple linear regression, confidence intervals, and hy- pothesis testing. Instead of a few chapters with a long list of discrete and continuous distributions, with an enumeration of the important attributes of each, we introduce a few distributions when presenting the concepts and the others where they arise (more) naturally. A list of distributions and their characteristics is found in Appendix A. With the exception of the first one, chapters in this book consist of three main parts. First, about four sections discussing new material, interspersed with a handful of so-called Quick exercises. Working these—two-or-three-minute— exercises should help to master the material and provide a break from reading to do something more active. On about two dozen occasions you will find indented paragraphs labeled Remark, where we felt the need to discuss more mathematical details or background material. These remarks can be skipped without loss of continuity; in most cases they require a bit more mathematical maturity. Whenever persons are introduced in examples we have determined their sex by looking at the chapter number and applying the rule “He is odd, she is even.” Solutions to the quick exercises are found in the second to last section of each chapter. The last section of each chapter is devoted to exercises, on average thirteen per chapter. For about half of the exercises, answers are given in Appendix C, and for half of these, full solutions in Appendix D. Exercises with both a short answer and a full solution are marked with and those with only a short answer are marked with (when more appropriate, for example, in “Show that . . . ” exercises, the short answer provides a hint to the key step). Typically, the section starts with some easy exercises and the order of the material in the chapter is more or less respected. More challenging exercises are found at the end.
  • 5. Preface VII Much of the material in this book would benefit from illustration with a computer using statistical software. A complete course should also involve computer exercises. Topics like simulation, the law of large numbers, the central limit theorem, and the bootstrap loudly call for this kind of experi- ence. For this purpose, all the datasets discussed in the book are available at http://www.springeronline.com/1-85233-896-2. The same Web site also pro- vides access, for instructors, to a complete set of solutions to the exercises; go to the Springer online catalog or contact textbooks@springer-sbm.com to apply for your password. Delft, The Netherlands F. M. Dekking January 2005 C. Kraaikamp H. P. Lopuhaä L. E. Meester
  • 6. Contents 1 Why probability and statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Biometry: iris recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Killer football . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Cars and goats: the Monty Hall dilemma . . . . . . . . . . . . . . . . . . . 4 1.4 The space shuttle Challenger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Statistics versus intelligence agencies . . . . . . . . . . . . . . . . . . . . . . . 7 1.6 The speed of light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Outcomes, events, and probability . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1 Sample spaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Products of sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 An infinite sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Conditional probability and independence . . . . . . . . . . . . . . . . . 25 3.1 Conditional probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 The multiplication rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 The law of total probability and Bayes’ rule. . . . . . . . . . . . . . . . . 30 3.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
  • 7. X Contents 4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 The probability distribution of a discrete random variable . . . . 43 4.3 The Bernoulli and binomial distributions . . . . . . . . . . . . . . . . . . . 45 4.4 The geometric distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 The exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 The Pareto distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.5 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.7 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1 What is simulation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 Generating realizations of random variables . . . . . . . . . . . . . . . . . 72 6.3 Comparing two jury rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.4 The single-server queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.1 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.2 Three examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.3 The change-of-variable formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8 Computations with random variables . . . . . . . . . . . . . . . . . . . . . . 103 8.1 Transforming discrete random variables . . . . . . . . . . . . . . . . . . . . 103 8.2 Transforming continuous random variables . . . . . . . . . . . . . . . . . . 104 8.3 Jensen’s inequality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
  • 8. Contents XI 8.4 Extremes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 8.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 9 Joint distributions and independence . . . . . . . . . . . . . . . . . . . . . . 115 9.1 Joint distributions of discrete random variables . . . . . . . . . . . . . . 115 9.2 Joint distributions of continuous random variables . . . . . . . . . . . 118 9.3 More than two random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 122 9.4 Independent random variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 9.5 Propagation of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 9.6 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 10 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 10.1 Expectation and joint distributions . . . . . . . . . . . . . . . . . . . . . . . . 135 10.2 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 10.3 The correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 10.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 11 More computations with more random variables . . . . . . . . . . . 151 11.1 Sums of discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . 151 11.2 Sums of continuous random variables . . . . . . . . . . . . . . . . . . . . . . 154 11.3 Product and quotient of two random variables . . . . . . . . . . . . . . 159 11.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 12 The Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 12.1 Random points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 12.2 Taking a closer look at random arrivals. . . . . . . . . . . . . . . . . . . . . 168 12.3 The one-dimensional Poisson process . . . . . . . . . . . . . . . . . . . . . . . 171 12.4 Higher-dimensional Poisson processes . . . . . . . . . . . . . . . . . . . . . . 173 12.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 12.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 13 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 13.1 Averages vary less . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 13.2 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
  • 9. XII Contents 13.3 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 13.4 Consequences of the law of large numbers . . . . . . . . . . . . . . . . . . 188 13.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 13.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 14 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 14.1 Standardizing averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 14.2 Applications of the central limit theorem . . . . . . . . . . . . . . . . . . . 199 14.3 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 15 Exploratory data analysis: graphical summaries . . . . . . . . . . . . 207 15.1 Example: the Old Faithful data . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 15.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 15.3 Kernel density estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 15.4 The empirical distribution function . . . . . . . . . . . . . . . . . . . . . . . . 219 15.5 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 15.6 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 15.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 16 Exploratory data analysis: numerical summaries . . . . . . . . . . . 231 16.1 The center of a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 16.2 The amount of variability of a dataset. . . . . . . . . . . . . . . . . . . . . . 233 16.3 Empirical quantiles, quartiles, and the IQR . . . . . . . . . . . . . . . . . 234 16.4 The box-and-whisker plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 16.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 16.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 17 Basic statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 17.1 Random samples and statistical models . . . . . . . . . . . . . . . . . . . . 245 17.2 Distribution features and sample statistics . . . . . . . . . . . . . . . . . . 248 17.3 Estimating features of the “true” distribution . . . . . . . . . . . . . . . 253 17.4 The linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 17.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
  • 10. Contents XIII 18 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 18.1 The bootstrap principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 18.2 The empirical bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 18.3 The parametric bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 18.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 18.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 19 Unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 19.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 19.2 Investigating the behavior of an estimator . . . . . . . . . . . . . . . . . . 287 19.3 The sampling distribution and unbiasedness . . . . . . . . . . . . . . . . 288 19.4 Unbiased estimators for expectation and variance . . . . . . . . . . . . 292 19.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 19.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 20 Efficiency and mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . 299 20.1 Estimating the number of German tanks . . . . . . . . . . . . . . . . . . . 299 20.2 Variance of an estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 20.3 Mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 20.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 20.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 21 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 21.1 Why a general principle? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 21.2 The maximum likelihood principle . . . . . . . . . . . . . . . . . . . . . . . . . 314 21.3 Likelihood and loglikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 21.4 Properties of maximum likelihood estimators . . . . . . . . . . . . . . . . 321 21.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 21.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 22 The method of least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 22.1 Least squares estimation and regression . . . . . . . . . . . . . . . . . . . . 329 22.2 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 22.3 Relation with maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 335 22.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 22.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
  • 11. XIV Contents 23 Confidence intervals for the mean . . . . . . . . . . . . . . . . . . . . . . . . . 341 23.1 General principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 23.2 Normal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 23.3 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 23.4 Large samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 23.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 23.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 24 More on confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 24.1 The probability of success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 24.2 Is there a general method? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 24.3 One-sided confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 24.4 Determining the sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 24.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 24.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 25 Testing hypotheses: essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 25.1 Null hypothesis and test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 373 25.2 Tail probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 25.3 Type I and type II errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 25.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 25.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 26 Testing hypotheses: elaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 26.1 Significance level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 26.2 Critical region and critical values . . . . . . . . . . . . . . . . . . . . . . . . . . 386 26.3 Type II error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 26.4 Relation with confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 392 26.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 26.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 27 The t-test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 27.1 Monitoring the production of ball bearings. . . . . . . . . . . . . . . . . . 399 27.2 The one-sample t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 27.3 The t-test in a regression setting. . . . . . . . . . . . . . . . . . . . . . . . . . . 405 27.4 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 27.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
  • 12. Contents XV 28 Comparing two samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 28.1 Is dry drilling faster than wet drilling? . . . . . . . . . . . . . . . . . . . . . 415 28.2 Two samples with equal variances . . . . . . . . . . . . . . . . . . . . . . . . . 416 28.3 Two samples with unequal variances . . . . . . . . . . . . . . . . . . . . . . . 419 28.4 Large samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 28.5 Solutions to the quick exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 28.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 A Summary of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 B Tables of the normal and t-distributions . . . . . . . . . . . . . . . . . . . 431 C Answers to selected exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 D Full solutions to selected exercises . . . . . . . . . . . . . . . . . . . . . . . . . 445 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 List of symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
  • 13. 1 Why probability and statistics? Is everything on this planet determined by randomness? This question is open to philosophical debate. What is certain is that every day thousands and thousands of engineers, scientists, business persons, manufacturers, and others are using tools from probability and statistics. The theory and practice of probability and statistics were developed during the last century and are still actively being refined and extended. In this book we will introduce the basic notions and ideas, and in this first chapter we present a diverse collection of examples where randomness plays a role. 1.1 Biometry: iris recognition Biometry is the art of identifying a person on the basis of his or her personal biological characteristics, such as fingerprints or voice. From recent research it appears that with the human iris one can beat all existing automatic hu- man identification systems. Iris recognition technology is based on the visible qualities of the iris. It converts these—via a video camera—into an “iris code” consisting of just 2048 bits. This is done in such a way that the code is hardly sensitive to the size of the iris or the size of the pupil. However, at different times and different places the iris code of the same person will not be exactly the same. Thus one has to allow for a certain percentage of mismatching bits when identifying a person. In fact, the system allows about 34% mismatches! How can this lead to a reliable identification system? The miracle is that dif- ferent persons have very different irides. In particular, over a large collection of different irides the code bits take the values 0 and 1 about half of the time. But that is certainly not sufficient: if one bit would determine the other 2047, then we could only distinguish two persons. In other words, single bits may be random, but the correlation between bits is also crucial (we will discuss correlation at length in Chapter 10). John Daugman who has developed the iris recognition technology made comparisons between 222 743 pairs of iris
  • 14. 2 1 Why probability and statistics? codes and concluded that of the 2048 bits 266 may be considered as uncor- related ([6]). He then argues that we may consider an iris code as the result of 266 coin tosses with a fair coin. This implies that if we compare two such codes from different persons, then there is an astronomically small probability that these two differ in less than 34% of the bits—almost all pairs will differ in about 50% of the bits. This is illustrated in Figure 1.1, which originates from [6], and was kindly provided by John Daugman. The iris code data con- sist of numbers between 0 and 1, each a Hamming distance (the fraction of mismatches) between two iris codes. The data have been summarized in two histograms, that is, two graphs that show the number of counts of Hamming distances falling in a certain interval. We will encounter histograms and other summaries of data in Chapter 15. One sees from the figure that for codes from the same iris (left side) the mismatch fraction is only about 0.09, while for different irides (right side) it is about 0.46. 0 2000 6000 10000 14000 18000 22000 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0 10 20 30 40 50 60 70 80 90 100 120 Hamming Distance Count d’ = 11.36 mean = 0.089 stnd dev = 0.042 mean = 0.456 stnd dev = 0.018 222,743 comparisons of different iris pairs 546 comparisons of same iris pairs DECISION ENVIRONMENT FOR IRIS RECOGNITION Theoretical curves: binomial family Theoretical cross-over point: HD = 0.342 Theoretical cross-over rate: 1 in 1.2 million C Fig. 1.1. Comparison of same and different iris pairs. Source: J.Daugman. Second IMA Conference on Image Processing: Mathe- matical Methods, Algorithms and Applications, 2000. Ellis Horwood Pub- lishing Limited. You may still wonder how it is possible that irides distinguish people so well. What about twins, for instance? The surprising thing is that although the color of eyes is hereditary, many features of iris patterns seem to be pro- duced by so-called epigenetic events. This means that during embryo develop- ment the iris structure develops randomly. In particular, the iris patterns of (monozygotic) twins are as discrepant as those of two arbitrary individuals.
  • 15. 1.2 Killer football 3 For this reason, as early as in the 1930s, eye specialists proposed that iris patterns might be used for identification purposes. 1.2 Killer football A couple of years ago the prestigious British Medical Journal published a paper with the title “Cardiovascular mortality in Dutch men during 1996 European football championship: longitudinal population study” ([41]). The authors claim to have shown that the effect of a single football match is detectable in national mortality data. They consider the mortality from in- farctions (heart attacks) and strokes, and the “explanation” of the increase is a combination of heavy alcohol consumption and stress caused by watching the football match on June 22 between the Netherlands and France (lost by the Dutch team!). The authors mainly support their claim with a figure like Figure 1.2, which shows the number of deaths from the causes mentioned (for men over 45), during the period June 17 to June 27, 1996. The middle horizon- tal line marks the average number of deaths on these days, and the upper and lower horizontal lines mark what the authors call the 95% confidence inter- val. The construction of such an interval is usually performed with standard statistical techniques, which you will learn in Chapter 23. The interpretation of such an interval is rather tricky. That the bar on June 22 sticks out off the confidence interval should support the “killer claim.” June 18 June 22 June 26 0 10 20 30 40 Deaths Fig. 1.2. Number of deaths from infarction or stroke in (part of) June 1996. It is rather surprising that such a conclusion is based on a single football match, and one could wonder why no probability model is proposed in the paper. In fact, as we shall see in Chapter 12, it would not be a bad idea to model the time points at which deaths occur as a so-called Poisson process.
  • 16. 4 1 Why probability and statistics? Once we have done this, we can compute how often a pattern like the one in the figure might occur—without paying attention to football matches and other high-risk national events. To do this we need the mean number of deaths per day. This number can be obtained from the data by an estimation procedure (the subject of Chapters 19 to 23). We use the sample mean, which is equal to (10 · 27.2 + 41)/11 = 313/11 = 28.45. (Here we have to make a computation like this because we only use the data in the paper: 27.2 is the average over the 5 days preceding and following the match, and 41 is the number of deaths on the day of the match.) Now let phigh be the probability that there are 41 or more deaths on a day, and let pusual be the probability that there are between 21 and 34 deaths on a day—here 21 and 34 are the lowest and the highest number that fall in the interval in Figure 1.2. From the formula of the Poisson distribution given in Chapter 12 one can compute that phigh = 0.008 and pusual = 0.820. Since events on different days are independent according to the Poisson process model, the probability p of a pattern as in the figure is p = p5 usual · phigh · p5 usual = 0.0011. From this it can be shown by (a generalization of) the law of large numbers (which we will study in Chapter 13) that such a pattern would appear about once every 1/0.0011 = 899 days. So it is not overwhelmingly exceptional to find such a pattern, and the fact that there was an important football match on the day in the middle of the pattern might just have been a coincidence. 1.3 Cars and goats: the Monty Hall dilemma On Sunday September 9, 1990, the following question appeared in the “Ask Marilyn” column in Parade, a Sunday supplement to many newspapers across the United States: Suppose you’re on a game show, and you’re given the choice of three doors; behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?—Craig F. Whitaker, Columbia, Md. Marilyn’s answer—one should switch—caused an avalanche of reactions, in to- tal an estimated 10 000. Some of these reactions were not so flattering (“You are the goat”), quite a lot were by professional mathematicians (“You blew it, and blew it big,” “You are utterly incorrect . . . . How many irate mathe- maticians are needed to change your mind?”). Perhaps some of the reactions were so strong, because Marilyn vos Savant, the author of the column, is in the Guinness Book of Records for having one of the highest IQs in the world.
  • 17. 1.4 The space shuttle Challenger 5 The switching question was inspired by Monty Hall’s “Let’s Make a Deal” game show, which ran with small interruptions for 23 years on various U.S. television networks. Although it is not explicitly stated in the question, the game show host will always open a door with a goat after you make your initial choice. Many people would argue that in this situation it does not matter whether one would change or not: one door has a car behind it, the other a goat, so the odds to get the car are fifty-fifty. To see why they are wrong, consider the following argument. In the original situation two of the three doors have a goat behind them, so with probability 2/3 your initial choice was wrong, and with probability 1/3 it was right. Now the host opens a door with a goat (note that he can always do this). In case your initial choice was wrong the host has only one option to show a door with a goat, and switching leads you to the door with the car. In case your initial choice was right the host has two goats to choose from, so switching will lead you to a goat. We see that switching is the best strategy, doubling our chances to win. To stress this argument, consider the following generalization of the problem: suppose there are 10 000 doors, behind one is a car and behind the rest, goats. After you make your choice, the host will open 9998 doors with goats, and offers you the option to switch. To change or not to change, that’s the question! Still not convinced? Use your Internet browser to find one of the zillion sites where one can run a simulation of the Monty Hall problem (more about simulation in Chapter 6). In fact, there are quite a lot of variations on the problem. For example, the situation that there are four doors: you select a door, the host always opens a door with a goat, and offers you to select another door. After you have made up your mind he opens a door with a goat, and again offers you to switch. After you have decided, he opens the door you selected. What is now the best strategy? In this situation switching only at the last possible moment yields a probability of 3/4 to bring the car home. Using the law of total probability from Section 3.3 you will find that this is indeed the best possible strategy. 1.4 The space shuttle Challenger On January 28, 1986, the space shuttle Challenger exploded about one minute after it had taken off from the launch pad at Kennedy Space Center in Florida. The seven astronauts on board were killed and the spacecraft was destroyed. The cause of the disaster was explosion of the main fuel tank, caused by flames of hot gas erupting from one of the so-called solid rocket boosters. These solid rocket boosters had been cause for concern since the early years of the shuttle. They are manufactured in segments, which are joined at a later stage, resulting in a number of joints that are sealed to protect against leakage. This is done with so-called O-rings, which in turn are protected by a layer of putty. When the rocket motor ignites, high pressure and high temperature
  • 18. 6 1 Why probability and statistics? build up within. In time these may burn away the putty and subsequently erode the O-rings, eventually causing hot flames to erupt on the outside. In a nutshell, this is what actually happened to the Challenger. After the explosion, an investigative commission determined the causes of the disaster, and a report was issued with many findings and recommendations ([24]). On the evening of January 27, a decision to launch the next day had been made, notwithstanding the fact that an extremely low temperature of 31◦ F had been predicted, well below the operating limit of 40◦ F set by Morton Thiokol, the manufacturer of the solid rocket boosters. Apparently, a “man- agement decision” was made to overrule the engineers’ recommendation not to launch. The inquiry faulted both NASA and Morton Thiokol management for giving in to the pressure to launch, ignoring warnings about problems with the seals. The Challenger launch was the 24th of the space shuttle program, and we shall look at the data on the number of failed O-rings, available from previous launches (see [5] for more details). Each rocket has three O-rings, and two rocket boosters are used per launch, so in total six O-rings are used each time. Because low temperatures are known to adversely affect the O-rings, we also look at the corresponding launch temperature. In Figure 1.3 the dots show the number of failed O-rings per mission (there are 23 dots—one time the boosters could not be recovered from the ocean; temperatures are rounded to the nearest degree Fahrenheit; in case of two or more equal data points these are shifted slightly.). If you ignore the dots representing zero failures, which all occurred at high temperatures, a temperature effect is not apparent. 30 40 50 60 70 80 90 Launch temperature in ◦ F 0 1 2 3 4 5 6 Failures · ·· · ·· · ···· · · · ··· · · ···· . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 · p(t) Source: based on data from Volume VI of the Report of the Presidential Commission on the space shuttle Challenger accident, Washington, DC, 1986. Fig. 1.3. Space shuttle failure data of pre-Challenger missions and fitted model of expected number of failures per mission function.
  • 19. 1.5 Statistics versus intelligence agencies 7 In a model to describe these data, the probability p(t) that an individual O-ring fails should depend on the launch temperature t. Per mission, the number of failed O-rings follows a so-called binomial distribution: six O-rings, and each may fail with probability p(t); more about this distribution and the circumstances under which it arises can be found in Chapter 4. A logistic model was used in [5] to describe the dependence on t: p(t) = ea+b·t 1 + ea+b·t . A high value of a + b · t corresponds to a high value of p(t), a low value to low p(t). Values of a and b were determined from the data, according to the following principle: choose a and b so that the probability that we get data as in Figure 1.3 is as high as possible. This is an example of the use of the method of maximum likelihood, which we shall discuss in Chapter 21. This results in a = 5.085 and b = −0.1156, which indeed leads to lower probabilities at higher temperatures, and to p(31) = 0.8178. We can also compute the (estimated) expected number of failures, 6·p(t), as a function of the launch temperature t; this is the plotted line in the figure. Combining the estimates with estimated probabilities of other events that should happen for a complete failure of the field-joint, the estimated proba- bility of such a failure is 0.023. With six field-joints, the probability of at least one complete failure is then 1 − (1 − 0.023)6 = 0.13! 1.5 Statistics versus intelligence agencies During World War II, information about Germany’s war potential was essen- tial to the Allied forces in order to schedule the time of invasions and to carry out the allied strategic bombing program. Methods for estimating German production used during the early phases of the war proved to be inadequate. In order to obtain more reliable estimates of German war production, ex- perts from the Economic Warfare Division of the American Embassy and the British Ministry of Economic Warfare started to analyze markings and serial numbers obtained from captured German equipment. Each piece of enemy equipment was labeled with markings, which included all or some portion of the following information: (a) the name and location of the marker; (b) the date of manufacture; (c) a serial number; and (d) miscellaneous markings such as trademarks, mold numbers, casting numbers, etc. The purpose of these markings was to maintain an effective check on production standards and to perform spare parts control. However, these same markings offered Allied intelligence a wealth of information about German industry. The first products to be analyzed were tires taken from German aircraft shot over Britain and from supply dumps of aircraft and motor vehicle tires cap- tured in North Africa. The marking on each tire contained the maker’s name,
  • 20. 8 1 Why probability and statistics? a serial number, and a two-letter code for the date of manufacture. The first step in analyzing the tire markings involved breaking the two-letter date code. It was conjectured that one letter represented the month and the other the year of manufacture, and that there should be 12 letter variations for the month code and 3 to 6 for the year code. This, indeed, turned out to be true. The following table presents examples of the 12 letter variations used by four different manufacturers. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Dunlop T I E B R A P O L N U D Fulda F U L D A M U N S T E R Phoenix F O N I X H A M B U R G Sempirit A B C D E F G H I J K L Reprinted with permission from “An empirical approach to economic intelli- gence” by R.Ruggles and H.Brodie, pp.72-91, Vol. 42, No. 237. 1947 by the American Statistical Association. All rights reserved. For instance, the Dunlop code was Dunlop Arbeit spelled backwards. Next, the year code was broken and the numbering system was solved so that for each manufacturer individually the serial numbers could be dated. Moreover, for each month, the serial numbers could be recoded to numbers running from 1 to some unknown largest number N, and the observed (recoded) serial numbers could be seen as a subset of this. The objective was to estimate N for each month and each manufacturer separately by means of the observed (recoded) serial numbers. In Chapter 20 we discuss two different methods of estimation, and we show that the method based on only the maximum observed (recoded) serial number is much better than the method based on the average observed (recoded) serial numbers. With a sample of about 1400 tires from five producers, individual monthly output figures were obtained for almost all months over a period from 1939 to mid-1943. The following table compares the accuracy of estimates of the average monthly production of all manufacturers of the first quarter of 1943 with the statistics of the Speer Ministry that became available after the war. The accuracy of the estimates can be appreciated even more if we compare them with the figures obtained by Allied intelligence agencies. They estimated, using other methods, the production between 900 000 and 1 200 000 per month! Type of tire Estimated production Actual production Truck and passenger car 147 000 159 000 Aircraft 28 500 26 400 ——— ——— Total 175500 186100 Reprinted with permission from “An empirical approach to economic intelli- gence” by R.Ruggles and H.Brodie, pp.72-91, Vol. 42, No. 237. 1947 by the American Statistical Association. All rights reserved.
  • 21. 1.6 The speed of light 9 1.6 The speed of light In 1983 the definition of the meter (the SI unit of one meter) was changed to: The meter is the length of the path traveled by light in vacuum during a time interval of 1/299 792 458 of a second. This implicitly defines the speed of light as 299 792 458 meters per second. It was done because one thought that the speed of light was so accurately known that it made more sense to define the meter in terms of the speed of light rather than vice versa, a remarkable end to a long story of scientific discovery. For a long time most scientists believed that the speed of light was infinite. Early experiments devised to demonstrate the finiteness of the speed of light failed because the speed is so extraordi- narily high. In the 18th century this debate was settled, and work started on determination of the speed, using astronomical observations, but a century later scientists turned to earth-based experiments. Albert Michelson refined experimental arrangements from two previous experiments and conducted a series of measurements in June and early July of 1879, at the U.S. Naval Academy in Annapolis. In this section we give a very short summary of his work. It is extracted from an article in Statistical Science ([18]). The principle of speed measurement is easy, of course: measure a distance and the time it takes to travel that distance, the speed equals distance divided by time. For an accurate determination, both the distance and the time need to be measured accurately, and with the speed of light this is a problem: either we should use a very large distance and the accuracy of the distance measurement is a problem, or we have a very short time interval, which is also very difficult to measure accurately. In Michelson’s time it was known that the speed of light was about 300 000 km/s, and he embarked on his study with the goal of an improved value of the speed of light. His experimental setup is depicted schematically in Figure 1.4. Light emitted from a light source is aimed, through a slit in a fixed plate, at a rotating mirror; we call its distance from the plate the radius. At one particular angle, this rotating mirror reflects the beam in the direction of a distant (fixed) flat mirror. On its way the light first passes through a focusing lens. This second mirror is positioned in such a way that it reflects the beam back in the direction of the rotating mirror. In the time it takes the light to travel back and forth between the two mirrors, the rotating mirror has moved by an angle α, resulting in a reflection on the plate that is displaced with respect to the source beam that passed through the slit. The radius and the displacement determine the angle α because tan 2α = displacement radius and combined with the number of revolutions per seconds (rps) of the mirror, this determines the elapsed time: time = α/2π rps .
  • 22. 10 1 Why probability and statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fixed mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Light source • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Displacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Plate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Radius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . α Rotating mirror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Focusing lens Fig. 1.4. Michelson’s experiment. During this time the light traveled twice the distance between the mirrors, so the speed of light in air now follows: cair = 2 · distance time . All in all, it looks simple: just measure the four quantities—distance, radius, displacement and the revolutions per second—and do the calculations. This is much harder than it looks, and problems in the form of inaccuracies are lurking everywhere. An error in any of these quantities translates directly into some error in the final result. Michelson did the utmost to reduce errors. For example, the distance between the mirrors was about 2000 feet, and to measure it he used a steel measuring tape. Its nominal length was 100 feet, but he carefully checked this using a copy of the official “standard yard.” He found that the tape was in fact 100.006 feet. This way he eliminated a (small) systematic error. Now imagine using the tape to measure a distance of 2000 feet: you have to use the tape 20 times, each time marking the next 100 feet. Do it again, and you probably find a slightly different answer, no matter how hard you try to be very precise in every step of the measuring procedure. This kind of variation is inevitable: sometimes we end up with a value that is a bit too high, other times it is too low, but on average we’re doing okay—assuming that we have eliminated sources of systematic error, as in the measuring tape. Michelson measured the distance five times, which resulted in values between 1984.93 and 1985.17 feet (after correcting for the temperature-dependent stretch), and he used the average as the “true distance.” In many phases of the measuring process Michelson attempted to identify and determine systematic errors and subsequently applied corrections. He
  • 23. 1.6 The speed of light 11 also systematically repeated measuring steps and averaged the results to re- duce variability. His final dataset consists of 100 separate measurements (see Table 17.1), but each is in fact summarized and averaged from repeated mea- surements on several variables. The final result he reported was that the speed of light in vacuum (this involved a conversion) was 299 944 ± 51 km/s, where the 51 is an indication of the uncertainty in the answer. In retrospect, we must conclude that, in spite of Michelson’s admirable meticulousness, some source of error must have slipped his attention, as his result is off by about 150 km/s. With current methods we would derive from his data a so-called 95% confi- dence interval: 299 944 ± 15.5 km/s, suggesting that Michelson’s uncertainty analysis was a little conservative. The methods used to construct confidence intervals are the topic of Chapters 23 and 24.
  • 24. 2 Outcomes, events, and probability The world around us is full of phenomena we perceive as random or unpre- dictable. We aim to model these phenomena as outcomes of some experiment, where you should think of experiment in a very general sense. The outcomes are elements of a sample space Ω, and subsets of Ω are called events.The events will be assigned a probability, a number between 0 and 1 that expresses how likely the event is to occur. 2.1 Sample spaces Sample spaces are simply sets whose elements describe the outcomes of the experiment in which we are interested. We start with the most basic experiment: the tossing of a coin. Assuming that we will never see the coin land on its rim, there are two possible outcomes: heads and tails. We therefore take as the sample space associated with this experiment the set Ω = {H, T }. In another experiment we ask the next person we meet on the street in which month her birthday falls. An obvious choice for the sample space is Ω = {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec}. In a third experiment we load a scale model for a bridge up to the point where the structure collapses. The outcome is the load at which this occurs. In reality, one can only measure with finite accuracy, e.g., to five decimals, and a sample space with just those numbers would strictly be adequate. However, in principle, the load itself could be any positive number and therefore Ω = (0, ∞) is the right choice. Even though in reality there may also be an upper limit to what loads are conceivable, it is not necessary or practical to try to limit the outcomes correspondingly.
  • 25. 14 2 Outcomes, events, and probability In a fourth experiment, we find on our doormat three envelopes, sent to us by three different persons, and we look in which order the envelopes lie on top of each other. Coding them 1, 2, and 3, the sample space would be Ω = {123, 132, 213, 231, 312, 321}. Quick exercise 2.1 If we received mail from four different persons, how many elements would the corresponding sample space have? In general one might consider the order in which n different objects can be placed. This is called a permutation of the n objects. As we have seen, there are 6 possible permutations of 3 objects, and 4 · 6 = 24 of 4 objects. What happens is that if we add the nth object, then this can be placed in any of n positions in any of the permutations of n − 1 objects. Therefore there are n · (n − 1) · · · · 3 · 2 · 1 = n! possible permutations of n objects. Here n! is the standard notation for this product and is pronounced “n factorial.” It is convenient to define 0! = 1. 2.2 Events Subsets of the sample space are called events. We say that an event A occurs if the outcome of the experiment is an element of the set A. For example, in the birthday experiment we can ask for the outcomes that correspond to a long month, i.e., a month with 31 days. This is the event L = {Jan, Mar, May, Jul, Aug, Oct, Dec}. Events may be combined according to the usual set operations. For example if R is the event that corresponds to the months that have the letter r in their (full) name (so R = {Jan, Feb, Mar, Apr, Sep, Oct, Nov, Dec}), then the long months that contain the letter r are L ∩ R = {Jan, Mar, Oct, Dec}. The set L∩R is called the intersection of L and R and occurs if both L and R occur. Similarly, we have the union A∪B of two sets A and B, which occurs if at least one of the events A and B occurs. Another common operation is taking complements. The event Ac = {ω ∈ Ω : ω / ∈ A} is called the complement of A; it occurs if and only if A does not occur. The complement of Ω is denoted ∅, the empty set, which represents the impossible event. Figure 2.1 illustrates these three set operations.
  • 26. 2.2 Events 15 Intersection A ∩ B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A B Ω A ∩ B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Union A ∪ B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A B A ∪ B Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complement Ac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Ac Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 2.1. Diagrams of intersection, union, and complement. We call events A and B disjoint or mutually exclusive if A and B have no outcomes in common; in set terminology: A∩B = ∅. For example, the event L “the birthday falls in a long month” and the event {Feb} are disjoint. Finally, we say that event A implies event B if the outcomes of A also lie in B. In set notation: A ⊂ B; see Figure 2.2. Some people like to use double negations: “It is certainly not true that neither John nor Mary is to blame.” This is equivalent to: “John or Mary is to blame, or both.” The following useful rules formalize this mental operation to a manipulation with events. DeMorgan’s laws. For any two events A and B we have (A ∪ B)c = Ac ∩ Bc and (A ∩ B)c = Ac ∪ Bc . Quick exercise 2.2 Let J be the event “John is to blame” and M the event “Mary is to blame.” Express the two statements above in terms of the events J, Jc , M, and Mc , and check the equivalence of the statements by means of DeMorgan’s laws. Disjoint sets A and B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A B Ω A subset of B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A B Ω Fig. 2.2. Minimal and maximal intersection of two sets.
  • 27. 16 2 Outcomes, events, and probability 2.3 Probability We want to express how likely it is that an event occurs. To do this we will assign a probability to each event. The assignment of probabilities to events is in general not an easy task, and some of the coming chapters will be dedicated directly or indirectly to this problem. Since each event has to be assigned a probability, we speak of a probability function. It has to satisfy two basic properties. Definition. A probability function P on a finite sample space Ω assigns to each event A in Ω a number P(A) in [0,1] such that (i) P(Ω) = 1, and (ii) P(A ∪ B) = P(A) + P(B) if A and B are disjoint. The number P(A) is called the probability that A occurs. Property (i) expresses that the outcome of the experiment is always an element of the sample space, and property (ii) is the additivity property of a probability function. It implies additivity of the probability function over more than two sets; e.g., if A, B, and C are disjoint events, then the two events A ∪ B and C are also disjoint, so P(A ∪ B ∪ C) = P(A ∪ B) + P(C) = P(A) + P(B) + P(C) . We will now look at some examples. When we want to decide whether Peter or Paul has to wash the dishes, we might toss a coin. The fact that we consider this a fair way to decide translates into the opinion that heads and tails are equally likely to occur as the outcome of the coin-tossing experiment. So we put P({H}) = P({T }) = 1 2 . Formally we have to write {H} for the set consisting of the single element H, because a probability function is defined on events, not on outcomes. From now on we shall drop these brackets. Now it might happen, for example due to an asymmetric distribution of the mass over the coin, that the coin is not completely fair. For example, it might be the case that P(H) = 0.4999 and P(T ) = 0.5001. More generally we can consider experiments with two possible outcomes, say “failure” and “success”, which have probabilities 1 − p and p to occur, where p is a number between 0 and 1. For example, when our experiment consists of buying a ticket in a lottery with 10 000 tickets and only one prize, where “success” stands for winning the prize, then p = 10−4 . How should we assign probabilities in the second experiment, where we ask for the month in which the next person we meet has his or her birthday? In analogy with what we have just done, we put
  • 28. 2.3 Probability 17 P(Jan) = P(Feb) = · · · = P(Dec) = 1 12 . Some of you might object to this and propose that we put, for example, P(Jan) = 31 365 and P(Apr) = 30 365 , because we have long months and short months. But then the very precise among us might remark that this does not yet take care of leap years. Quick exercise 2.3 If you would take care of the leap years, assuming that one in every four years is a leap year (which again is an approximation to reality!), how would you assign a probability to each month? In the third experiment (the buckling load of a bridge), where the outcomes are real numbers, it is impossible to assign a positive probability to each outcome (there are just too many outcomes!). We shall come back to this problem in Chapter 5, restricting ourselves in this chapter to finite and countably infinite1 sample spaces. In the fourth experiment it makes sense to assign equal probabilities to all six outcomes: P(123) = P(132) = P(213) = P(231) = P(312) = P(321) = 1 6 . Until now we have only assigned probabilities to the individual outcomes of the experiments. To assign probabilities to events we use the additivity property. For instance, to find the probability P(T ) of the event T that in the three envelopes experiment envelope 2 is on top we note that P(T ) = P(213) + P(231) = 1 6 + 1 6 = 1 3 . In general, additivity of P implies that the probability of an event is obtained by summing the probabilities of the outcomes belonging to the event. Quick exercise 2.4 Compute P(L) and P(R) in the birthday experiment. Finally we mention a rule that permits us to compute probabilities of events A and B that are not disjoint. Note that we can write A = (A∩B) ∪ (A∩Bc ), which is a disjoint union; hence P(A) = P(A ∩ B) + P(A ∩ Bc ) . If we split A ∪ B in the same way with B and Bc , we obtain the events (A∪B)∩B, which is simply B and (A∪B)∩Bc , which is nothing but A∩Bc . 1 This means: although infinite, we can still count them one by one; Ω = {ω1, ω2, . . . }. The interval [0,1] of real numbers is an example of an uncountable sample space.
  • 29. 18 2 Outcomes, events, and probability Thus P(A ∪ B) = P(B) + P(A ∩ Bc ) . Eliminating P(A ∩ Bc ) from these two equations we obtain the following rule. The probability of a union. For any two events A and B we have P(A ∪ B) = P(A) + P(B) − P(A ∩ B) . From the additivity property we can also find a way to compute probabilities of complements of events: from A ∪ Ac = Ω, we deduce that P(Ac ) = 1 − P(A) . 2.4 Products of sample spaces Basic to statistics is that one usually does not consider one experiment, but that the same experiment is performed several times. For example, suppose we throw a coin two times. What is the sample space associated with this new experiment? It is clear that it should be the set Ω = {H, T } × {H, T } = {(H, H), (H, T ), (T, H), (T, T )}. If in the original experiment we had a fair coin, i.e., P(H) = P(T ), then in this new experiment all 4 outcomes again have equal probabilities: P((H, H)) = P((H, T )) = P((T, H)) = P((T, T )) = 1 4 . Somewhat more generally, if we consider two experiments with sample spaces Ω1 and Ω2 then the combined experiment has as its sample space the set Ω = Ω1 × Ω2 = {(ω1, ω2) : ω1 ∈ Ω1, ω2 ∈ Ω2}. If Ω1 has r elements and Ω2 has s elements, then Ω1 × Ω2 has rs elements. Now suppose that in the first, the second, and the combined experiment all outcomes are equally likely to occur. Then the outcomes in the first experi- ment have probability 1/r to occur, those of the second experiment 1/s, and those of the combined experiment probability 1/rs. Motivated by the fact that 1/rs = (1/r) × (1/s), we will assign probability pipj to the outcome (ωi, ωj) in the combined experiment, in the case that ωi has probability pi and ωj has probability pj to occur. One should realize that this is by no means the only way to assign probabilities to the outcomes of a combined experiment. The preceding choice corresponds to the situation where the two experiments do not influence each other in any way. What we mean by this influence will be explained in more detail in the next chapter.
  • 30. 2.5 An infinite sample space 19 Quick exercise 2.5 Consider the sample space {a1, a2, a3, a4, a5, a6} of some experiment, where outcome ai has probability pi for i = 1, . . . , 6. We perform this experiment twice in such a way that the associated probabilities are P((ai, ai)) = pi, and P((ai, aj)) = 0 if i = j, for i, j = 1, . . . , 6. Check that P is a probability function on the sample space Ω = {a1, . . . , a6}× {a1, . . . , a6} of the combined experiment. What is the relationship between the first experiment and the second experiment that is determined by this probability function? We started this section with the experiment of throwing a coin twice. If we want to learn more about the randomness associated with a particular exper- iment, then we should repeat it more often, say n times. For example, if we perform an experiment with outcomes 1 (success) and 0 (failure) five times, and we consider the event A “exactly one experiment was a success,” then this event is given by the set A = {(0, 0, 0, 0, 1), (0, 0, 0, 1, 0), (0, 0, 1, 0, 0), (0, 1, 0, 0, 0), (1, 0, 0, 0, 0)} in Ω = {0, 1} × {0, 1} × {0, 1} × {0, 1} × {0, 1}. Moreover, if success has probability p and failure probability 1 − p, then P(A) = 5 · (1 − p)4 · p, since there are five outcomes in the event A, each having probability (1−p)4 ·p. Quick exercise 2.6 What is the probability of the event B “exactly two experiments were successful”? In general, when we perform an experiment n times, then the corresponding sample space is Ω = Ω1 × Ω2 × · · · × Ωn, where Ωi for i = 1, . . . , n is a copy of the sample space of the original exper- iment. Moreover, we assign probabilities to the outcomes (ω1, . . . , ωn) in the standard way described earlier, i.e., P((ω1, ω2, . . . , ωn)) = p1 · p2 · · · · pn, if each ωi has probability pi. 2.5 An infinite sample space We end this chapter with an example of an experiment with infinitely many outcomes. We toss a coin repeatedly until the first head turns up. The outcome
  • 31. 20 2 Outcomes, events, and probability of the experiment is the number of tosses it takes to have this first occurrence of a head. Our sample space is the space of all positive natural numbers Ω = {1, 2, 3, . . .}. What is the probability function P for this experiment? Suppose the coin has probability p of falling on heads and probability 1−p to fall on tails, where 0 p 1. We determine the probability P(n) for each n. Clearly P(1) = p, the probability that we have a head right away. The event {2} corresponds to the outcome (T, H) in {H, T }×{H, T }, so we should have P(2) = (1 − p)p. Similarly, the event {n} corresponds to the outcome (T, T, . . ., T, T, H) in the space {H, T } × · · · × {H, T }. Hence we should have, in general, P(n) = (1 − p)n−1 p, n = 1, 2, 3, . . .. Does this define a probability function on Ω = {1, 2, 3, . . .}? Then we should at least have P(Ω) = 1. It is not directly clear how to calculate P(Ω): since the sample space is no longer finite we have to amend the definition of a probability function. Definition. A probability function on an infinite (or finite) sample space Ω assigns to each event A in Ω a number P(A) in [0, 1] such that (i) P(Ω) = 1, and (ii) P(A1 ∪ A2 ∪ A3 ∪ · · ·) = P(A1) + P(A2) + P(A3) + · · · if A1, A2, A3, . . . are disjoint events. Note that this new additivity property is an extension of the previous one because if we choose A3 = A4 = · · · = ∅, then P(A1 ∪ A2) = P(A1 ∪ A2 ∪ ∅ ∪ ∅ ∪ · · ·) = P(A1) + P(A2) + 0 + 0 + · · · = P(A1) + P(A2) . Now we can compute the probability of Ω: P(Ω) = P(1) + P(2) + · · · + P(n) + · · · = p + (1 − p)p + · · · (1 − p)n−1 p + · · · = p[1 + (1 − p) + · · · (1 − p)n−1 + · · · ]. The sum 1 + (1 − p) + · · · + (1 − p)n−1 + · · · is an example of a geometric series. It is well known that when |1 − p| 1, 1 + (1 − p) + · · · + (1 − p)n−1 + · · · = 1 1 − (1 − p) = 1 p . Therefore we do indeed have P(Ω) = p · 1 p = 1.
  • 32. 2.7 Exercises 21 Quick exercise 2.7 Suppose an experiment in a laboratory is repeated every day of the week until it is successful, the probability of success being p. The first experiment is started on a Monday. What is the probability that the series ends on the next Sunday? 2.6 Solutions to the quick exercises 2.1 The sample space is Ω = {1234, 1243, 1324, 1342, . . ., 4321}. The best way to count its elements is by noting that for each of the 6 outcomes of the three- envelope experiment we can put a fourth envelope in any of 4 positions. Hence Ω has 4 · 6 = 24 elements. 2.2 The statement “It is certainly not true that neither John nor Mary is to blame” corresponds to the event (Jc ∩Mc )c . The statement “John or Mary is to blame, or both” corresponds to the event J ∪ M. Equivalence now follows from DeMorgan’s laws. 2.3 In four years we have 365×3+366 = 1461 days. Hence long months each have a probability 4 × 31/1461 = 124/1461, and short months a probability 120/1461 to occur. Moreover, {Feb} has probability 113/1461. 2.4 Since there are 7 long months and 8 months with an “r” in their name, we have P(L) = 7/12 and P(R) = 8/12. 2.5 Checking that P is a probability function Ω amounts to verifying that 0 ≤ P((ai, aj)) ≤ 1 for all i and j and noting that P(Ω) = 6 i,j=1 P((ai, aj)) = 6 i=1 P((ai, ai)) = 6 i=1 pi = 1. The two experiments are totally coupled: one has outcome ai if and only if the other has outcome ai. 2.6 Now there are 10 outcomes in B (for example (0,1,0,1,0)), each having probability (1 − p)3 p2 . Hence P(B) = 10(1 − p)3 p2 . 2.7 This happens if and only if the experiment fails on Monday,. . . , Saturday, and is a success on Sunday. This has probability p(1 − p)6 to happen. 2.7 Exercises 2.1 Let A and B be two events in a sample space for which P(A) = 2/3, P(B) = 1/6, and P(A ∩ B) = 1/9. What is P(A ∪ B)?
  • 33. 22 2 Outcomes, events, and probability 2.2 Let E and F be two events for which one knows that the probability that at least one of them occurs is 3/4. What is the probability that neither E nor F occurs? Hint: use one of DeMorgan’s laws: Ec ∩ Fc = (E ∪ F)c . 2.3 Let C and D be two events for which one knows that P(C) = 0.3, P(D) = 0.4, and P(C ∩ D) = 0.2. What is P(Cc ∩ D)? 2.4 We consider events A, B, and C, which can occur in some experiment. Is it true that the probability that only A occurs (and not B or C) is equal to P(A ∪ B ∪ C) − P(B) − P(C) + P(B ∩ C)? 2.5 The event A∩Bc that A occurs but not B is sometimes denoted as AB. Here is the set-theoretic minus sign. Show that P(A B) = P(A) − P(B) if B implies A, i.e., if B ⊂ A. 2.6 When P(A) = 1/3, P(B) = 1/2, and P(A ∪ B) = 3/4, what is a. P(A ∩ B)? b. P(Ac ∪ Bc )? 2.7 Let A and B be two events. Suppose that P(A) = 0.4, P(B) = 0.5, and P(A ∩ B) = 0.1. Find the probability that A or B occurs, but not both. 2.8 Suppose the events D1 and D2 represent disasters, which are rare: P(D1) ≤ 10−6 and P(D2) ≤ 10−6 . What can you say about the probability that at least one of the disasters occurs? What about the probability that they both occur? 2.9 We toss a coin three times. For this experiment we choose the sample space Ω = {HHH, T HH, HT H, HHT, T T H, THT, HTT, TTT} where T stands for tails and H for heads. a. Write down the set of outcomes corresponding to each of the following events: A : “we throw tails exactly two times.” B : “we throw tails at least two times.” C : “tails did not appear before a head appeared.” D : “the first throw results in tails.” b. Write down the set of outcomes corresponding to each of the following events: Ac , A ∪ (C ∩ D), and A ∩ Dc . 2.10 In some sample space we consider two events A and B. Let C be the event that A or B occurs, but not both. Express C in terms of A and B, using only the basic operations “union,” “intersection,” and “complement.”
  • 34. 2.7 Exercises 23 2.11 An experiment has only two outcomes. The first has probability p to occur, the second probability p2 . What is p? 2.12 In the UEFA Euro 2004 playoffs draw 10 national football teams were matched in pairs. A lot of people complained that “the draw was not fair,” because each strong team had been matched with a weak team (this is commercially the most interesting). It was claimed that such a matching is extremely unlikely. We will compute the probability of this “dream draw” in this exercise. In the spirit of the three-envelope example of Section 2.1 we put the names of the 5 strong teams in envelopes labeled 1, 2, 3, 4, and 5 and of the 5 weak teams in envelopes labeled 6, 7, 8, 9, and 10. We shuffle the 10 envelopes and then match the envelope on top with the next envelope, the third envelope with the fourth envelope, and so on. One particular way a “dream draw” occurs is when the five envelopes labeled 1, 2, 3, 4, 5 are in the odd numbered positions (in any order!) and the others are in the even numbered positions. This way corresponds to the situation where the first match of each strong team is a home match. Since for each pair there are two possibilities for the home match, the total number of possibilities for the “dream draw” is 25 = 32 times as large. a. An outcome of this experiment is a sequence like 4, 9, 3, 7, 5, 10, 1, 8, 2, 6 of labels of envelopes. What is the probability of an outcome? b. How many outcomes are there in the event “the five envelopes labeled 1, 2, 3, 4, 5 are in the odd positions—in any order, and the envelopes la- beled 6, 7, 8, 9, 10 are in the even positions—in any order”? c. What is the probability of a “dream draw”? 2.13 In some experiment first an arbitrary choice is made out of four pos- sibilities, and then an arbitrary choice is made out of the remaining three possibilities. One way to describe this is with a product of two sample spaces {a, b, c, d}: Ω = {a, b, c, d} × {a, b, c, d}. a. Make a 4×4 table in which you write the probabilities of the outcomes. b. Describe the event “c is one of the chosen possibilities” and determine its probability. 2.14 Consider the Monty Hall “experiment” described in Section 1.3. The door behind which the car is parked we label a, the other two b and c. As the sample space we choose a product space Ω = {a, b, c} × {a, b, c}. Here the first entry gives the choice of the candidate, and the second entry the choice of the quizmaster.
  • 35. 24 2 Outcomes, events, and probability a. Make a 3×3 table in which you write the probabilities of the outcomes. N.B. You should realize that the candidate does not know that the car is in a, but the quizmaster will never open the door labeled a because he knows that the car is there. You may assume that the quizmaster makes an arbitrary choice between the doors labeled b and c, when the candidate chooses door a. b. Consider the situation of a “no switching” candidate who will stick to his or her choice. What is the event “the candidate wins the car,” and what is its probability? c. Consider the situation of a “switching” candidate who will not stick to her choice. What is now the event “the candidate wins the car,” and what is its probability? 2.15 The rule P(A ∪ B) = P(A) + P(B) − P(A ∩ B) from Section 2.3 is often useful to compute the probability of the union of two events. What would be the corresponding rule for three events A, B, and C? It should start with P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − · · · . Hint: you could use the sum rule suitably, or you could make a diagram as in Figure 2.1. 2.16 Three events E, F, and G cannot occur simultaneously. Further it is known that P(E ∩ F) = P(F ∩ G) = P(E ∩ G) = 1/3. Can you deter- mine P(E)? Hint: if you try to use the formula of Exercise 2.15 then it seems that you do not have enough information; make a diagram instead. 2.17 A post office has two counters where customers can buy stamps, etc. If you are interested in the number of customers in the two queues that will form for the counters, what would you take as sample space? 2.18 In a laboratory, two experiments are repeated every day of the week in different rooms until at least one is successful, the probability of success be- ing p for each experiment. Supposing that the experiments in different rooms and on different days are performed independently of each other, what is the probability that the laboratory scores its first successful experiment on day n? 2.19 We repeatedly toss a coin. A head has probability p, and a tail prob- ability 1 − p to occur, where 0 p 1. The outcome of the experiment we are interested in is the number of tosses it takes until a head occurs for the second time. a. What would you choose as the sample space? b. What is the probability that it takes 5 tosses?
  • 36. 3 Conditional probability and independence Knowing that an event has occurred sometimes forces us to reassess the prob- ability of another event; the new probability is the conditional probability. If the conditional probability equals what the probability was before, the events involved are called independent. Often, conditional probabilities and indepen- dence are needed if we want to compute probabilities, and in many other situations they simplify the work. 3.1 Conditional probability In the previous chapter we encountered the events L, “born in a long month,” and R, “born in a month with the letter r.” Their probabilities are easy to compute: since L = {Jan, Mar, May, Jul, Aug, Oct, Dec} and R = {Jan, Feb, Mar, Apr, Sep, Oct, Nov, Dec}, one finds P(L) = 7 12 and P(R) = 8 12 . Now suppose that it is known about the person we meet in the street that he was born in a “long month,” and we wonder whether he was born in a “month with the letter r.” The information given excludes five outcomes of our sample space: it cannot be February, April, June, September, or November. Seven possible outcomes are left, of which only four—those in R ∩ L = {Jan, Mar, Oct, Dec}—are favorable, so we reassess the probability as 4/7. We call this the conditional probability of R given L, and we write: P(R | L) = 4 7 . This is not the same as P(R ∩ L), which is 1/3. Also note that P(R | L) is the proportion that P(R ∩ L) is of P(L).
  • 37. 26 3 Conditional probability and independence Quick exercise 3.1 Let N = Rc be the event “born in a month without r.” What is the conditional probability P(N | L)? Recalling the three envelopes on our doormat, consider the events “envelope 1 is the middle one” (call this event A) and “envelope 2 is the middle one” (B). Then P(A) = P(213 or 312) = 1/3; by symmetry, the same is found for P(B). We say that the envelopes are in order if their order is either 123 or 321. Suppose we know that they are not in order, but otherwise we do not know anything; what are the probabilities of A and B, given this information? Let C be the event that the envelopes are not in order, so: C = {123, 321}c = {132, 213, 231, 312}. We ask for the probabilities of A and B, given that C occurs. Event C consists of four elements, two of which also belong to A: A ∩ C = {213, 312}, so P(A | C) = 1/2. The probability of A ∩ C is half of P(C). No element of C also belongs to B, so P(B | C) = 0. Quick exercise 3.2 Calculate P(C | A) and P(Cc | A ∪ B). In general, computing the probability of an event A, given that an event C occurs, means finding which fraction of the probability of C is also in the event A. Definition. The conditional probability of A given C is given by: P(A | C) = P(A ∩ C) P(C) , provided P(C) 0. Quick exercise 3.3 Show that P(A | C) + P(Ac | C) = 1. This exercise shows that the rule P(Ac ) = 1 − P(A) also holds for conditional probabilities. In fact, even more is true: if we have a fixed conditioning event C and define Q(A) = P(A | C) for events A ⊂ Ω, then Q is a probability function and hence satisfies all the rules as described in Chapter 2. The definition of conditional probability agrees with our intuition and it also works in situations where computing probabilities by counting outcomes does not. A chemical reactor: residence times Consider a continuously stirred reactor vessel where a chemical reaction takes place. On one side fluid or gas flows in, mixes with whatever is already present in the vessel, and eventually flows out on the other side. One characteristic of each particular reaction setup is the so-called residence time distribution, which tells us how long particles stay inside the vessel before moving on. We consider a continuously stirred tank: the contents of the vessel are perfectly mixed at all times.
  • 38. 3.2 The multiplication rule 27 Let Rt denote the event “the particle has a residence time longer than t seconds.” In Section 5.3 we will see how continuous stirring determines the probabilities; here we just use that in a particular continuously stirred tank, Rt has probability e−t . So: P(R3) = e−3 = 0.04978 . . . P(R4) = e−4 = 0.01831 . . . . We can use the definition of conditional probability to find the probability that a particle that has stayed more than 3 seconds will stay more than 4: P(R4 | R3) = P(R4 ∩ R3) P(R3) = P(R4) P(R3) = e−4 e−3 = e−1 = 0.36787 . . . . Quick exercise 3.4 Calculate P(R3 | Rc 4). For more details on the subject of residence time distributions see, for example, the book on reaction engineering by Fogler ([11]). 3.2 The multiplication rule From the definition of conditional probability we derive a useful rule by mul- tiplying left and right by P(C). The multiplication rule. For any events A and C: P(A ∩ C) = P(A | C) · P(C) . Computing the probability of A∩C can hence be decomposed into two parts, computing P(C) and P(A | C) separately, which is often easier than computing P(A ∩ C) directly. The probability of no coincident birthdays Suppose you meet two arbitrarily chosen people. What is the probability their birthdays are different? Let B2 denote the event that this happens. Whatever the birthday of the first person is, there is only one day the second person cannot “pick” as birthday, so: P(B2) = 1 − 1 365 . When the same question is asked with three people, conditional probabilities become helpful. The event B3 can be seen as the intersection of the event B2,
  • 39. 28 3 Conditional probability and independence “the first two have different birthdays,” with event A3 “the third person has a birthday that does not coincide with that of one of the first two persons.” Using the multiplication rule: P(B3) = P(A3 ∩ B2) = P(A3 | B2)P(B2) . The conditional probability P(A3 | B2) is the probability that, when two days are already marked on the calendar, a day picked at random is not marked, or P(A3 | B2) = 1 − 2 365 , and so P(B3) = P(A3 | B2)P(B2) = 1 − 2 365 · 1 − 1 365 = 0.9918. We are already halfway to solving the general question: in a group of n arbi- trarily chosen people, what is the probability there are no coincident birth- days? The event Bn of no coincident birthdays among the n persons is the same as: “the birthdays of the first n − 1 persons are different” (the event Bn−1) and “the birthday of the nth person does not coincide with a birthday of any of the first n − 1 persons” (the event An), that is, Bn = An ∩ Bn−1. Applying the multiplication rule yields: P(Bn) = P(An | Bn−1) · P(Bn−1) = 1 − n − 1 365 · P(Bn−1) as person n should avoid n − 1 days. Applying the same step to P(Bn−1), P(Bn−2), etc., we find: P(Bn) = 1 − n − 1 365 · P(An−1 | Bn−2) · P(Bn−2) = 1 − n − 1 365 · 1 − n − 2 365 · P(Bn−2) . . . = 1 − n − 1 365 · · · 1 − 2 365 · P(B2) = 1 − n − 1 365 · · · 1 − 2 365 · 1 − 1 365 . This can be used to compute the probability for arbitrary n. For example, we find: P(B22) = 0.5243 and P(B23) = 0.4927. In Figure 3.1 the probability
  • 40. 3.2 The multiplication rule 29 0 10 20 30 40 50 60 70 80 90 100 n 0.0 0.2 0.4 0.6 0.8 1.0 P(Bn) ···································································································· . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........................ Fig. 3.1. The probability P(Bn) of no coincident birthdays for n = 1, . . . , 100. P(Bn) is plotted for n = 1, . . . , 100, with dotted lines drawn at n = 23 and at probability 0.5. It may be hard to believe, but with just 23 people the probability of all birthdays being different is less than 50%! Quick exercise 3.5 Compute the probability that three arbitrary people are born in different months. Can you give the formula for n people? It matters how one conditions Conditioning can help to make computations easier, but it matters how it is applied. To compute P(A ∩ C) we may condition on C to get P(A ∩ C) = P(A | C) · P(C) ; or we may condition on A and get P(A ∩ C) = P(C | A) · P(A) . Both ways are valid, but often one of P(A | C) and P(C | A) is easy and the other is not. For example, in the birthday example one could have tried: P(B3) = P(A3 ∩ B2) = P(B2 | A3)P(A3) , but just trying to understand the conditional probability P(B2 | A3) already is confusing: The probability that the first two persons’ birthdays differ given that the third person’s birthday does not coincide with the birthday of one of the first two . . . ? Conditioning should lead to easier probabilities; if not, it is probably the wrong approach.
  • 41. 30 3 Conditional probability and independence 3.3 The law of total probability and Bayes’ rule We will now discuss two important rules that help probability computations by means of conditional probabilities. We introduce both of them in the next example. Testing for mad cow disease In early 2001 the European Commission introduced massive testing of cattle to determine infection with the transmissible form of Bovine Spongiform En- cephalopathy (BSE) or “mad cow disease.” As no test is 100% accurate, most tests have the problem of false positives and false negatives. A false positive means that according to the test the cow is infected, but in actuality it is not. A false negative means an infected cow is not detected by the test. Imagine we test a cow. Let B denote the event “the cow has BSE” and T the event “the test comes up positive” (this is test jargon for: according to the test we should believe the cow is infected with BSE). One can “test the test” by analyzing samples from cows that are known to be infected or known to be healthy and so determine the effectiveness of the test. The European Commission had this done for four tests in 1999 (see [19]) and for several more later. The results for what the report calls Test A may be summarized as follows: an infected cow has a 70% chance of testing positive, and a healthy cow just 10%; in formulas: P(T | B) = 0.70, P(T | Bc ) = 0.10. Suppose we want to determine the probability P(T ) that an arbitrary cow tests positive. The tested cow is either infected or it is not: event T occurs in combination with B or with Bc (there are no other possibilities). In terms of events T = (T ∩ B) ∪ (T ∩ Bc ), so that P(T ) = P(T ∩ B) + P(T ∩ Bc ) , because T ∩B and T ∩Bc are disjoint. Next, apply the multiplication rule (in such a way that the known conditional probabilities appear!): P(T ∩ B) = P(T | B) · P(B) P(T ∩ Bc ) = P(T | Bc ) · P(Bc ) (3.1) so that P(T ) = P(T | B) · P(B) + P(T | Bc ) · P(Bc ) . (3.2) This is an application of the law of total probability: computing a probability through conditioning on several disjoint events that make up the whole sample
  • 42. 3.3 The law of total probability and Bayes’ rule 31 space (in this case two). Suppose1 P(B) = 0.02; then from the last equation we conclude: P(T ) = 0.02 · 0.70 + (1 − 0.02) · 0.10 = 0.112. Quick exercise 3.6 Calculate P(T ) when P(T | B) = 0.99 and P(T | Bc ) = 0.05. Following is a general statement of the law. The law of total probability. Suppose C1, C2, . . . , Cm are disjoint events such that C1 ∪ C2 ∪ · · · ∪ Cm = Ω. The probability of an arbitrary event A can be expressed as: P(A) = P(A | C1)P(C1) + P(A | C2)P(C2) + · · · + P(A | Cm)P(Cm) . Figure 3.2 illustrates the law for m = 5. The event A is the disjoint union of A∩Ci, for i = 1, . . . , 5, so P(A) = P(A ∩ C1)+· · ·+P(A ∩ C5), and for each i the multiplication rule states P(A ∩ Ci) = P(A | Ci) · P(Ci). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A C1 C2 C3 C4 C5 A ∩ C1 A ∩ C2 A ∩ C3 A ∩ C4 A ∩ C5 Ω Fig. 3.2. The law of total probability (illustration for m = 5). In the BSE example, we have just two mutually exclusive events: substitute m = 2, C1 = B, C2 = Bc , and A = T to obtain (3.2). Another, perhaps more pertinent, question about the BSE test is the following: suppose my cow tests positive; what is the probability it really has BSE? Translated, this asks for the value of P(B | T ). The information we were given is P(T | B), a conditional probability, but the wrong one. We would like to switch T and B. Start with the definition of conditional probability and then use equations (3.1) and (3.2): 1 We choose this probability for the sake of the calculations that follow. The true value is unknown and varies from country to country. The BSE risk for the Nether- lands for 2003 was estimated to be P(B) ≈ 0.000013.
  • 43. 32 3 Conditional probability and independence P(B | T ) = P(T ∩ B) P(T ) = P(T | B) · P(B) P(T | B) · P(B) + P(T | Bc) · P(Bc) . So with P(B) = 0.02 we find P(B | T ) = 0.70 · 0.02 0.70 · 0.02 + 0.10 · (1 − 0.02) = 0.125, and by a similar calculation: P(B | T c ) = 0.0068. These probabilities reflect that this Test A is not a very good test; a perfect test would result in P(B | T ) = 1 and P(B | T c ) = 0. In Exercise 3.4 we redo this calculation, replacing P(B) = 0.02 with a more realistic number. What we have just seen is known as Bayes’ rule, after the English clergyman Thomas Bayes who derived this in the 18th century. The general statement follows. Bayes’ rule. Suppose the events C1, C2, . . . , Cm are disjoint and C1 ∪ C2 ∪ · · · ∪ Cm = Ω. The conditional probability of Ci, given an arbitrary event A, can be expressed as: P(Ci | A) = P(A | Ci) · P(Ci) P(A | C1)P(C1) + P(A | C2)P(C2) + · · · + P(A | Cm)P(Cm) . This is the traditional form of Bayes’ formula. It follows from P(Ci | A) = P(A | Ci) · P(Ci) P(A) (3.3) in combination with the law of total probability applied to P(A) in the de- nominator. Purists would refer to (3.3) as Bayes’ rule, and perhaps they are right. Quick exercise 3.7 Calculate P(B | T ) and P(B | T c ) if P(T | B) = 0.99 and P(T | Bc ) = 0.05. 3.4 Independence Consider three probabilities from the previous section: P(B) = 0.02, P(B | T ) = 0.125, P(B | T c ) = 0.0068. If we know nothing about a cow, we would say that there is a 2% chance it is infected. However, if we know it tested positive, we can say there is a 12.5%
  • 44. 3.4 Independence 33 chance the cow is infected. On the other hand, if it tested negative, there is only a 0.68% chance. We see that the two events are related in some way: the probability of B depends on whether T occurs. Imagine the opposite: the test is useless. Whether the cow is infected is unre- lated to the outcome of the test, and knowing the outcome of the test does not change our probability of B: P(B | T ) = P(B). In this case we would call B independent of T . Definition. An event A is called independent of B if P(A | B) = P(A) . From this simple definition many statements can be derived. For example, because P(Ac | B) = 1 − P(A | B) and 1 − P(A) = P(Ac ), we conclude: A independent of B ⇔ Ac independent of B. (3.4) By application of the multiplication rule, if A is independent of B, then P(A ∩ B) = P(A | B)P(B) = P(A) P(B). On the other hand, if P(A ∩ B) = P(A) P(B), then P(A | B) = P(A) follows from the definition of independence. This shows: A independent of B ⇔ P(A ∩ B) = P(A) P(B) . Finally, by definition of conditional probability, if A is independent of B, then P(B | A) = P(A ∩ B) P(A) = P(A) · P(B) P(A) = P(B) , that is, B is independent of A. This works in reverse, too, so we have: A independent of B ⇔ B independent of A. (3.5) This statement says that in fact, independence is a mutual property. Therefore, the expressions “A is independent of B” and “A and B are independent” are used interchangeably. From the three ⇔-statements it follows that there are in fact 12 ways to show that A and B are independent; and if they are, there are 12 ways to use that. Independence. To show that A and B are independent it suffices to prove just one of the following: P(A | B) = P(A) , P(B | A) = P(B) , P(A ∩ B) = P(A) P(B) , where A may be replaced by Ac and B replaced by Bc , or both. If one of these statements holds, all of them are true. If two events are not independent, they are called dependent.
  • 45. 34 3 Conditional probability and independence Recall the birthday events L “born in a long month” and R “born in a month with the letter r.” Let H be the event “born in the first half of the year,” so P(H) = 1/2. Also, P(H | R) = 1/2. So H and R are independent, and we conclude, for example, P(Rc | Hc ) = P(Rc ) = 1 − 8/12 = 1/3. We know that P(L ∩ H) = 1/4 and P(L) = 7/12. Checking 1/2 ×7/12 = 1/4, you conclude that L and H are dependent. Quick exercise 3.8 Derive the statement “Rc is independent of Hc ” from “H is independent of R” using rules (3.4) and (3.5). Since the words dependence and independence have several meanings, one sometimes uses the terms stochastic or statistical dependence and indepen- dence to avoid ambiguity. Remark 3.1 (Physical and stochastic independence). Stochastic dependence or independence can sometimes be established by inspecting whether there is any physical dependence present. The following statements may be made. If events have to do with processes or experiments that have no physical con- nection, they are always stochastically independent. If they are connected to the same physical process, then, as a rule, they are stochastically de- pendent, but stochastic independence is possible in exceptional cases. The events H and R are an example. Independence of two or more events When more than two events are involved we need a more elaborate definition of independence. The reason behind this is explained by an example following the definition. Independence of two or more events. Events A1, A2, . . . , Am are called independent if P(A1 ∩ A2 ∩ · · · ∩ Am) = P(A1) P(A2) · · · P(Am) and this statement also holds when any number of the events A1, . . . , Am are replaced by their complements throughout the formula. You see that we need to check 2m equations to establish the independence of m events. In fact, m + 1 of those equations are redundant, but we chose this version of the definition because it is easier. The reason we need to do so much more checking to establish independence for multiple events is that there are subtle ways in which events may depend on each other. Consider the question: Is independence for three events A, B, and C the same as: A and B are independent; B and C are independent; and A and C are independent?
  • 46. 3.5 Solutions to the quick exercises 35 The answer is “No,” as the following example shows. Perform two independent tosses of a coin. Let A be the event “heads on toss 1,” B the event “heads on toss 2,” and C “the two tosses are equal.” First, get the probabilities. Of course, P(A) = P(B) = 1/2, but also P(C) = P(A ∩ B) + P(Ac ∩ Bc ) = 1 4 + 1 4 = 1 2 . What about independence? Events A and B are independent by assumption, so check the independence of A and C. Given that the first toss is heads (A occurs), C occurs if and only if the second toss is heads as well (B occurs), so P(C | A) = P(B | A) = P(B) = 1 2 = P(C) . By symmetry, also P(C | B) = P(C), so all pairs taken from A, B, C are independent: the three are called pairwise independent. Checking the full con- ditions for independence, we find, for example: P(A ∩ B ∩ C) = P(A ∩ B) = 1 4 , whereas P(A) P(B) P(C) = 1 8 , and P(A ∩ B ∩ Cc ) = P(∅) = 0, whereas P(A) P(B) P(Cc ) = 1 8 . The reason for this is clear: whether C occurs follows deterministically from the outcomes of tosses 1 and 2. 3.5 Solutions to the quick exercises 3.1 N = {May, Jun, Jul, Aug}, L = {Jan, Mar, May, Jul, Aug, Oct, Dec}, and N ∩ L = {May, Jul, Aug}. Three out of seven outcomes of L belong to N as well, so P(N | L) = 3/7. 3.2 The event A is contained in C. So when A occurs, C also occurs; therefore P(C | A) = 1. Since Cc = {123, 321} and A ∪ B = {123, 321, 312, 213}, one can see that two of the four outcomes of A ∪ B belong to Cc as well, so P(Cc | A ∪ B) = 1/2. 3.3 Using the definition we find: P(A | C) + P(Ac | C) = P(A ∩ C) P(C) + P(Ac ∩ C) P(C) = 1, because C can be split into disjoint parts A ∩ C and Ac ∩ C and therefore P(A ∩ C) + P(Ac ∩ C) = P(C) .
  • 47. 36 3 Conditional probability and independence 3.4 This asks for the probability that the particle stays more than 3 seconds, given that it does not stay longer than 4 seconds, so 4 or less. From the definition: P(R3 | Rc 4) = P(R3 ∩ Rc 4) P(Rc 4) . The event R3 ∩ Rc 4 describes: longer than 3 but not longer than 4 seconds. Furthermore, R3 is the disjoint union of the events R3 ∩Rc 4 and R3 ∩R4 = R4, so P(R3 ∩ Rc 4) = P(R3) − P(R4) = e−3 − e−4 . Using the complement rule: P(Rc 4) = 1 − P(R4) = 1 − e−4 . Together: P(R3 | Rc 4) = e−3 − e−4 1 − e−4 = 0.0315 0.9817 = 0.0321. 3.5 Instead of a calendar of 365 days, we have one with just 12 months. Let Cn be the event n arbitrary persons have different months of birth. Then P(C3) = 1 − 2 12 · 1 − 1 12 = 55 72 = 0.7639 and it is no surprise that this is much smaller than P(B3). The general formula is P(Cn) = 1 − n − 1 12 · · · 1 − 2 12 · 1 − 1 12 . Note that it is correct even if n is 13 or more, in which case P(Cn) = 0. 3.6 Repeating the calculation we find: P(T ∩ B) = 0.99 · 0.02 = 0.0198 P(T ∩ Bc ) = 0.05 · 0.98 = 0.0490 so P(T ) = P(T ∩ B) + P(T ∩ Bc ) = 0.0198 + 0.0490 = 0.0688. 3.7 In the solution to Quick exercise 3.5 we already found P(T ∩ B) = 0.0198 and P(T ) = 0.0688, so P(B | T ) = P(T ∩ B) P(T ) = 0.0198 0.0688 = 0.2878. Further, P(T c ) = 1 − 0.0688 = 0.9312 and P(T c | B) = 1 − P(T | B) = 0.01. So, P(B ∩ T c ) = 0.01 · 0.02 = 0.0002 and P(B | T c ) = 0.0002 0.9312 = 0.00021. 3.8 It takes three steps of applying (3.4) and (3.5): H independent of R ⇔ Hc independent of R by (3.4) Hc independent of R ⇔ R independent of Hc by (3.5) R independent of Hc ⇔ Rc independent of Hc by (3.4).
  • 48. 3.6 Exercises 37 3.6 Exercises 3.1 Your lecturer wants to walk from A to B (see the map). To do so, he first randomly selects one of the paths to C, D, or E. Next he selects randomly one of the possible paths at that moment (so if he first selected the path to E, he can either select the path to A or the path to F), etc. What is the probability that he will reach B after two selections? A B C D E F • • • • • • • • • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 A fair die is thrown twice. A is the event “sum of the throws equals 4,” B is “at least one of the throws is a 3.” a. Calculate P(A | B). b. Are A and B independent events? 3.3 We draw two cards from a regular deck of 52. Let S1 be the event “the first one is a spade,” and S2 “the second one is a spade.” a. Compute P(S1), P(S2 | S1), and P(S2 | Sc 1). b. Compute P(S2) by conditioning on whether the first card is a spade. 3.4 A Dutch cow is tested for BSE, using Test A as described in Section 3.3, with P(T | B) = 0.70 and P(T | Bc ) = 0.10. Assume that the BSE risk for the Netherlands is the same as in 2003, when it was estimated to be P(B) = 1.3 · 10−5 . Compute P(B | T ) and P(B | T c ). 3.5 A ball is drawn at random from an urn containing one red and one white ball. If the white ball is drawn, it is put back into the urn. If the red ball is drawn, it is returned to the urn together with two more red balls. Then a second draw is made. What is the probability a red ball was drawn on both the first and the second draws? 3.6 We choose a month of the year, in such a manner that each month has the same probability. Find out whether the following events are independent: a. the events “outcome is an even numbered month” (i.e., February, April, June, etc.) and “outcome is in the first half of the year.” b. the events “outcome is an even numbered month” (i.e., February, April, June, etc.) and “outcome is a summer month” (i.e., June, July, August).
  • 49. 38 3 Conditional probability and independence 3.7 Calculate a. P(A ∪ B) if it is given that P(A) = 1/3 and P(B | Ac ) = 1/4. b. P(B) if it is given that P(A ∪ B) = 2/3 and P(Ac | Bc ) = 1/2. 3.8 Spaceman Spiff’s spacecraft has a warning light that is supposed to switch on when the freem blasters are overheated. Let W be the event “the warning light is switched on” and F “the freem blasters are overheated.” Suppose the probability of freem blaster overheating P(F) is 0.1, that the light is switched on when they actually are overheated is 0.99, and that there is a 2% chance that it comes on when nothing is wrong: P(W | Fc ) = 0.02. a. Determine the probability that the warning light is switched on. b. Determine the conditional probability that the freem blasters are over- heated, given that the warning light is on. 3.9 A certain grapefruit variety is grown in two regions in southern Spain. Both areas get infested from time to time with parasites that damage the crop. Let A be the event that region R1 is infested with parasites and B that region R2 is infested. Suppose P(A) = 3/4, P(B) = 2/5 and P(A ∪ B) = 4/5. If the food inspection detects the parasite in a ship carrying grapefruits from R1, what is the probability region R2 is infested as well? 3.10 A student takes a multiple-choice exam. Suppose for each question he either knows the answer or gambles and chooses an option at random. Further suppose that if he knows the answer, the probability of a correct answer is 1, and if he gambles this probability is 1/4. To pass, students need to answer at least 60% of the questions correctly. The student has “studied for a minimal pass,” i.e., with probability 0.6 he knows the answer to a question. Given that he answers a question correctly, what is the probability that he actually knows the answer? 3.11 A breath analyzer, used by the police to test whether drivers exceed the legal limit set for the blood alcohol percentage while driving, is known to satisfy P(A | B) = P(Ac | Bc ) = p, where A is the event “breath analyzer indicates that legal limit is exceeded” and B “driver’s blood alcohol percentage exceeds legal limit.” On Saturday night about 5% of the drivers are known to exceed the limit. a. Describe in words the meaning of P(Bc | A). b. Determine P(Bc | A) if p = 0.95. c. How big should p be so that P(B | A) = 0.9? 3.12 The events A, B, and C satisfy: P(A | B ∩ C) = 1/4, P(B | C) = 1/3, and P(C) = 1/2. Calculate P(Ac ∩ B ∩ C).
  • 50. 3.6 Exercises 39 3.13 In Exercise 2.12 we computed the probability of a “dream draw” in the UEFA playoffs lottery by counting outcomes. Recall that there were ten teams in the lottery, five considered “strong” and five considered “weak.” Introduce events Di, “the ith pair drawn is a dream combination,” where a “dream combination” is a pair of a strong team with a weak team, and i = 1, . . . , 5. a. Compute P(D1). b. Compute P(D2 | D1) and P(D1 ∩ D2). c. Compute P(D3 | D1 ∩ D2) and P(D1 ∩ D2 ∩ D3). d. Continue the procedure to obtain the probability of a “dream draw”: P(D1 ∩ · · · ∩ D5). 3.14 Recall the Monty Hall problem from Section 1.3. Let R be the event “the prize is behind the door you chose initially,” and W the event “you win the prize by switching doors.” a. Compute P(W | R) and P(W | Rc ). b. Compute P(W) using the law of total probability. 3.15 Two independent events A and B are given, and P(B | A ∪ B) = 2/3, P(A | B) = 1/2. What is P(B)? 3.16 You are diagnosed with an uncommon disease. You know that there only is a 1% chance of getting it. Use the letter D for the event “you have the disease” and T for “the test says so.” It is known that the test is imperfect: P(T | D) = 0.98 and P(T c | Dc ) = 0.95. a. Given that you test positive, what is the probability that you really have the disease? b. You obtain a second opinion: an independent repetition of the test. You test positive again. Given this, what is the probability that you really have the disease? 3.17 You and I play a tennis match. It is deuce, which means if you win the next two rallies, you win the game; if I win both rallies, I win the game; if we each win one rally, it is deuce again. Suppose the outcome of a rally is independent of other rallies, and you win a rally with probability p. Let W be the event “you win the game,” G “the game ends after the next two rallies,” and D “it becomes deuce again.” a. Determine P(W | G). b. Show that P(W) = p2 + 2p(1 − p)P(W | D) and use P(W) = P(W | D) (why is this so?) to determine P(W). c. Explain why the answers are the same.
  • 51. 40 3 Conditional probability and independence 3.18 Suppose A and B are events with 0 P(A) 1 and 0 P(B) 1. a. If A and B are disjoint, can they be independent? b. If A and B are independent, can they be disjoint? c. If A ⊂ B, can A and B be independent? d. If A and B are independent, can A and A ∪ B be independent?
  • 52. 4 Discrete random variables The sample space associated with an experiment, together with a probability function defined on all its events, is a complete probabilistic description of that experiment. Often we are interested only in certain features of this de- scription. We focus on these features using random variables. In this chapter we discuss discrete random variables, and in the next we will consider contin- uous random variables. We introduce the Bernoulli, binomial, and geometric random variables. 4.1 Random variables Suppose we are playing the board game “Snakes and Ladders,” where the moves are determined by the sum of two independent throws with a die. An obvious choice of the sample space is Ω = {(ω1, ω2) : ω1, ω2 ∈ {1, 2, . . ., 6} } = {(1, 1), (1, 2), . . ., (1, 6), (2, 1), . . ., (6, 5), (6, 6)}. However, as players of the game, we are only interested in the sum of the outcomes of the two throws, i.e., in the value of the function S : Ω → R, given by S( ω1, ω2 ) = ω1 + ω2 for (ω1, ω2) ∈ Ω. In Table 4.1 the possible results of the first throw (top margin), those of the second throw (left margin), and the corresponding values of S (body) are given. Note that the values of S are constant on lines perpendicular to the diagonal. We denote the event that the function S attains the value k by {S = k}, which is an abbreviation of “the subset of those ω = (ω1, ω2) ∈ Ω for which S( ω1, ω2 ) = ω1 + ω2 = k,” i.e., {S = k} = {(ω1, ω2) ∈ Ω : S( ω1, ω2) = k }.
  • 53. 42 4 Discrete random variables Table 4.1. Two throws with a die and the corresponding sum. ω1 ω2 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12 Quick exercise 4.1 List the outcomes in the event {S = 8}. We denote the probability of the event {S = k} by P(S = k) , although formally we should write P({S = k}) instead of P(S = k). In our example, S attains only the values k = 2, 3, . . . , 12 with positive probability. For example, P(S = 2) = P( (1, 1) ) = 1 36 , P(S = 3) = P( {(1, 2), (2, 1)} ) = 2 36 , while P(S = 13) = P( ∅ ) = 0, because 13 is an “impossible outcome.” Quick exercise 4.2 Use Table 4.1 to determine P(S = k) for k = 4, 5, . . . , 12. Now suppose that for some other game the moves are given by the maximum of two independent throws. In this case we are interested in the value of the function M : Ω → R, given by M( ω1, ω2 ) = max{ω1, ω2} for (ω1, ω2) ∈ Ω. In Table 4.2 the possible results of the first throw (top margin), those of the second throw (left margin), and the corresponding values of M (body) are given. The functions S and M are examples of what we call discrete random variables. Definition. Let Ω be a sample space. A discrete random variable is a function X : Ω → R that takes on a finite number of values a1, a2, . . . , an or an infinite number of values a1, a2, . . . .
  • 54. 4.2 The probability distribution of a discrete random variable 43 Table 4.2. Two throws with a die and the corresponding maximum. ω1 ω2 1 2 3 4 5 6 1 1 2 3 4 5 6 2 2 2 3 4 5 6 3 3 3 3 4 5 6 4 4 4 4 4 5 6 5 5 5 5 5 5 6 6 6 6 6 6 6 6 In a way, a discrete random variable X “transforms” a sample space Ω to a more “tangible” sample space Ω̃, whose events are more directly related to what you are interested in. For instance, S transforms Ω = {(1, 1), (1, 2), . . ., (1, 6), (2, 1), . . . , (6, 5), (6, 6)} to Ω̃ = {2, . . ., 12}, and M transforms Ω to Ω̃ = {1, . . . , 6}. Of course, there is a price to pay: one has to calculate the probabilities of X. Or, to say things more formally, one has to determine the probability distribution of X, i.e., to describe how the probability mass is distributed over possible values of X. 4.2 The probability distribution of a discrete random variable Once a discrete random variable X is introduced, the sample space Ω is no longer important. It suffices to list the possible values of X and their corre- sponding probabilities. This information is contained in the probability mass function of X. Definition. The probability mass function p of a discrete random variable X is the function p : R → [0, 1], defined by p(a) = P(X = a) for − ∞ a ∞. If X is a discrete random variable that takes on the values a1, a2, . . ., then p(ai) 0, p(a1) + p(a2) + · · · = 1, and p(a) = 0 for all other a. As an example we give the probability mass function p of M. a 1 2 3 4 5 6 p(a) 1/36 3/36 5/36 7/36 9/36 11/36 Of course, p(a) = 0 for all other a.
  • 55. 44 4 Discrete random variables The distribution function of a random variable As we will see, so-called continuous random variables cannot be specified by giving a probability mass function. However, the distribution function of a random variable X (also known as the cumulative distribution function) allows us to treat discrete and continuous random variables in the same way. Definition. The distribution function F of a random variable X is the function F : R → [0, 1], defined by F(a) = P(X ≤ a) for −∞ a ∞. Both the probability mass function and the distribution function of a discrete random variable X contain all the probabilistic information of X; the probabil- ity distribution of X is determined by either of them. In fact, the distribution function F of a discrete random variable X can be expressed in terms of the probability mass function p of X and vice versa. If X attains values a1, a2, . . ., such that p(ai) 0, p(a1) + p(a2) + · · · = 1, then F(a) = ai≤a p(ai). We see that, for a discrete random variable X, the distribution function F jumps in each of the ai, and is constant between successive ai. The height of the jump at ai is p(ai); in this way p can be retrieved from F. For example, see Figure 4.1, where p and F are displayed for the random variable M. 1 2 3 4 5 6 a 1/36 3/36 5/36 7/36 9/36 11/36 1 · · · · · · p(a) . . . . . . . . . . . . . 1 2 3 4 5 6 a 1/36 4/36 9/36 16/36 25/36 1 F(a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . · · · · · · . . . . . . . . . . . . . Fig. 4.1. Probability mass function and distribution function of M.
  • 56. 4.3 The Bernoulli and binomial distributions 45 We end this section with three properties of the distribution function F of a random variable X: 1. For a ≤ b one has that F(a) ≤ F(b). This property is an immediate consequence of the fact that a ≤ b implies that the event {X ≤ a} is contained in the event {X ≤ b}. 2. Since F(a) is a probability, the value of the distribution function is always between 0 and 1. Moreover, lim a→+∞ F(a) = lim a→+∞ P(X ≤ a) = 1 lim a→−∞ F(a) = lim a→−∞ P(X ≤ a) = 0. 3. F is right-continuous, i.e., one has lim ε↓0 F(a + ε) = F(a). This is indicated in Figure 4.1 by bullets. Henceforth we will omit these bullets. Conversely, any function F satisfying 1, 2, and 3 is the distribution function of some random variable (see Remarks 6.1 and 6.2). Quick exercise 4.3 Let X be a discrete random variable, and let a be such that p(a) 0. Show that F(a) = P(X a) + p(a). There are many discrete random variables that arise in a natural way. We introduce three of them in the next two sections. 4.3 The Bernoulli and binomial distributions The Bernoulli distribution is used to model an experiment with only two pos- sible outcomes, often referred to as “success” and “failure”, usually encoded as 1 and 0. Definition. A discrete random variable X has a Bernoulli distri- bution with parameter p, where 0 ≤ p ≤ 1, if its probability mass function is given by pX(1) = P(X = 1) = p and pX(0) = P(X = 0) = 1 − p. We denote this distribution by Ber(p). Note that we wrote pX instead of p for the probability mass function of X. This was done to emphasize its dependence on X and to avoid possible confusion with the parameter p of the Bernoulli distribution.
  • 57. 46 4 Discrete random variables Consider the (fictitious) situation that you attend, completely unprepared, a multiple-choice exam. It consists of 10 questions, and each question has four alternatives (of which only one is correct). You will pass the exam if you answer six or more questions correctly. You decide to answer each of the questions in a random way, in such a way that the answer of one question is not affected by the answers of the others. What is the probability that you will pass? Setting for i = 1, 2, . . . , 10 Ri = 1 if the ith answer is correct 0 if the ith answer is incorrect, the number of correct answers X is given by X = R1 + R2 + R3 + R4 + R5 + R6 + R7 + R8 + R9 + R10. Quick exercise 4.4 Calculate the probability that you answered the first question correctly and the second one incorrectly. Clearly, X attains only the values 0, 1, . . ., 10. Let us first consider the case X = 0. Since the answers to the different questions do not influence each other, we conclude that the events {R1 = a1}, . . . , {R10 = a10} are independent for every choice of the ai, where each ai is 0 or 1. We find P(X = 0) = P(not a single Ri equals 1) = P(R1 = 0, R2 = 0, . . . , R10 = 0) = P(R1 = 0) P(R2 = 0) · · · P(R10 = 0) = 3 4 10 . The probability that we have answered exactly one question correctly equals P(X = 1) = 1 4 · 3 4 9 · 10, which is the probability that the answer is correct times the probability that the other nine answers are wrong, times the number of ways in which this can occur: P(X = 1) = P(R1 = 1) P(R2 = 0) P(R3 = 0) · · · P(R10 = 0) + P(R1 = 0) P(R2 = 1) P(R3 = 0) · · · P(R10 = 0) . . . + P(R1 = 0) P(R2 = 0) P(R3 = 0) · · · P(R10 = 1) . In general we find for k = 0, 1, . . ., 10, again using independence, that
  • 58. 4.3 The Bernoulli and binomial distributions 47 P(X = k) = 1 4 k · 3 4 10−k · C10,k, which is the probability that k questions were answered correctly times the probability that the other 10−k answers are wrong, times the number of ways C10,k this can occur. So C10,k is the number of different ways in which one can choose k correct answers from the list of 10. We already have seen that C10,0 = 1, because there is only one way to do everything wrong; and that C10,1 = 10, because each of the 10 questions may have been answered correctly. More generally, if we have to choose k different objects out of an ordered list of n objects, and the order in which we pick the objects matters, then for the first object you have n possibilities, and no matter which object you pick, for the second one there are n − 1 possibilities. For the third there are n − 2 possibilities, and so on, with n − (k − 1) possibilities for the kth. So there are n(n − 1) · · · (n − (k − 1)) ways to choose the k objects. In how many ways can we choose three questions? When the order matters, there are 10 · 9 · 8 ways. However, the order in which these three questions are selected does not matter: to answer questions 2, 5, and 8 correctly is the same as answering questions 8, 2, and 5 correctly, and so on. The triplet {2, 5, 8} can be chosen in 3 · 2 · 1 different orders, all with the same result. There are six permutations of the numbers 2, 5, and 8 (see page 14). Thus, compensating for this six-fold overcount, the number C10,3 of ways to correctly answer 3 questions out of 10 becomes C10,3 = 10 · 9 · 8 3 · 2 · 1 . More generally, for n ≥ 1 and 1 ≤ k ≤ n, Cn,k = n(n − 1) · · · (n − (k − 1)) k(k − 1) · · · 2 · 1 . Note that this is equal to n! k! (n − k)! , which is usually denoted by n k , so Cn,k = n k . Moreover, in accordance with 0! = 1 (as defined in Chapter 2), we put Cn,0 = n 0 = 1. Quick exercise 4.5 Show that n n−k = n k . Substituting 10 k for C10,k we obtain
  • 59. 48 4 Discrete random variables P(X = k) = 10 k 1 4 k 3 4 10−k . Since P(X ≥ 6) = P(X = 6) + · · · + P(X = 10), it is now an easy (but te- dious) exercise to determine the probability that you will pass. One finds that P(X ≥ 6) = 0.0197. It pays to study, doesn’t it?! The preceding random variable X is an example of a random variable with a binomial distribution with parameters n = 10 and p = 1/4. Definition. A discrete random variable X has a binomial distri- bution with parameters n and p, where n = 1, 2, . . . and 0 ≤ p ≤ 1, if its probability mass function is given by pX(k) = P(X = k) = n k pk (1 − p) n−k for k = 0, 1, . . ., n. We denote this distribution by Bin(n, p). Figure 4.2 shows the probability mass function pX and distribution function FX of a Bin(10, 1 4 ) distributed random variable. 0 1 2 3 4 5 6 7 8 9 10 k 0.0 0.1 0.2 0.3 pX(k) · · · · · · · · · · · 0 1 2 3 4 5 6 7 8 9 10 a 0.0 0.2 0.4 0.6 0.8 1.0 FX (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... . . . ...... . . . . . . . . . . . ...... . . . . . . . . . . . . . . . . ...... . . . . . . . . . . . . . . . ...... . . . . . . . . ...... . . . ...... ........................... Fig. 4.2. Probability mass function and distribution function of the Bin(10, 1 4 ) distribution. 4.4 The geometric distribution In 1986, Weinberg and Gladen [38] investigated the number of menstrual cy- cles it took women to become pregnant, measured from the moment they had
  • 60. 4.4 The geometric distribution 49 decided to become pregnant. We model the number of cycles up to pregnancy by a random variable X. Assume that the probability that a woman becomes pregnant during a partic- ular cycle is equal to p, for some p with 0 p ≤ 1, independent of the previous cycles. Then clearly P(X = 1) = p. Due to the independence of consecutive cycles, one finds for k = 1, 2, . . . that P(X = k) = P(no pregnancy in the first k − 1 cycles, pregnancy in the kth) = (1 − p)k−1 p. This random variable X is an example of a random variable with a geometric distribution with parameter p. Definition. A discrete random variable X has a geometric distri- bution with parameter p, where 0 p ≤ 1, if its probability mass function is given by pX(k) = P(X = k) = (1 − p)k−1 p for k = 1, 2, . . . . We denote this distribution by Geo(p). Figure 4.3 shows the probability mass function pX and distribution function FX of a Geo(1 4 ) distributed random variable. 1 5 10 15 20 k 0.0 0.1 0.2 0.3 pX (k) · · · · ················· 1 5 10 15 20 a 0.0 0.2 0.4 0.6 0.8 1.0 FX (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....... . . . . . . . . . . . . . . .... . . . . . . . . . . . .... . . . . . . . .... . . . . . . ... . . . . .... . . . .... . .... . .... . .... ....... ............................ Fig. 4.3. Probability mass function and distribution function of the Geo(1 4 ) distri- bution. Quick exercise 4.6 Let X have a Geo(p) distribution. For n ≥ 0, show that P(X n) = (1 − p) n .
  • 61. 50 4 Discrete random variables The geometric distribution has a remarkable property, which is known as the memoryless property.1 For n, k = 0, 1, 2, . . . one has P(X n + k | X k) = P(X n) . We can derive this equality using the result from Quick exercise 4.6: P(X n + k | X k) = P({X k + n} ∩ {X k}) P(X k) = P(X k + n) P(X k) = (1 − p) n+k (1 − p) k = (1 − p) n = P(X n) . 4.5 Solutions to the quick exercises 4.1 From Table 4.1, one finds that {S = 8} = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)}. 4.2 From Table 4.1, one determines the following table. k 4 5 6 7 8 9 10 11 12 P(S = k) 3 36 4 36 5 36 6 36 5 36 4 36 3 36 2 36 1 36 4.3 Since {X ≤ a} = {X a} ∪ {X = a}, it follows that F(a) = P(X ≤ a) = P(X a) + P(X = a) = P(X a) + p(a). Not very interestingly: this also holds if p(a) = 0. 4.4 The probability that you answered the first question correctly and the second one incorrectly is given by P(R1 = 1, R2 = 0). Due to independence, this is equal to P(R1 = 1) P(R2 = 0) = 1 4 · 3 4 = 3 16 . 4.5 Rewriting yields n n − k = n! (n − k)! (n − (n − k))! = n! k!(n − k)! = n k . 1 In fact, the geometric distribution is the only discrete random variable with this property.
  • 62. 4.6 Exercises 51 4.6 There are two ways to show that P(X n) = (1 − p)n . The easiest way is to realize that P(X n) is the probability that we had “no success in the first n trials,” which clearly equals (1 − p) n . A more involved way is by calculation: P(X n) = P(X = n + 1) + P(X = n + 2) + · · · = (1 − p)n p + (1 − p)n+1 p + · · · = (1 − p)n p 1 + (1 − p) + (1 − p)2 + · · · . If we recall from calculus that ∞ k=0 (1 − p)k = 1 1 − (1 − p) = 1 p , the answer follows immediately. 4.6 Exercises 4.1 Let Z represent the number of times a 6 appeared in two independent throws of a die, and let S and M be as in Section 4.1. a. Describe the probability distribution of Z, by giving either the probability mass function pZ of Z or the distribution function FZ of Z. What type of distribution does Z have, and what are the values of its parameters? b. List the outcomes in the events {M = 2, Z = 0}, {S = 5, Z = 1}, and {S = 8, Z = 1}. What are their probabilities? c. Determine whether the events {M = 2} and {Z = 0} are independent. 4.2 Let X be a discrete random variable with probability mass function p given by: a −1 0 1 2 p(a) 1 4 1 8 1 8 1 2 and p(a) = 0 for all other a. a. Let the random variable Y be defined by Y = X2 , i.e., if X = 2, then Y = 4. Calculate the probability mass function of Y . b. Calculate the value of the distribution functions of X and Y in a = 1, a = 3/4, and a = π − 3. 4.3 Suppose that the distribution function of a discrete random variable X is given by
  • 63. 52 4 Discrete random variables F(a) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ 0 for a 0 1 3 for 0 ≤ a 1 2 1 2 for 1 2 ≤ a 3 4 1 for a ≥ 3 4 . Determine the probability mass function of X. 4.4 You toss n coins, each showing heads with probability p, independently of the other tosses. Each coin that shows tails is tossed again. Let X be the total number of heads. a. What type of distribution does X have? Specify its parameter(s). b. What is the probability mass function of the total number of heads X? 4.5 A fair die is thrown until the sum of the results of the throws exceeds 6. The random variable X is the number of throws needed for this. Let F be the distribution function of X. Determine F(1), F(2), and F(7). 4.6 Three times we randomly draw a number from the following numbers: 1 2 3. If Xi represents the ith draw, i = 1, 2, 3, then the probability mass function of Xi is given by a 1 2 3 P(Xi = a) 1 3 1 3 1 3 and P(Xi = a) = 0 for all other a. We assume that each draw is independent of the previous draws. Let X̄ be the average of X1, X2, and X3, i.e., X̄ = X1 + X2 + X3 3 . a. Determine the probability mass function pX̄ of X̄. b. Compute the probability that exactly two draws are equal to 1. 4.7 A shop receives a batch of 1000 cheap lamps. The odds that a lamp is defective are 0.1%. Let X be the number of defective lamps in the batch. a. What kind of distribution does X have? What is/are the value(s) of pa- rameter(s) of this distribution? b. What is the probability that the batch contains no defective lamps? One defective lamp? More than two defective ones? 4.8 In Section 1.4 we saw that each space shuttle has six O-rings and that each O-ring fails with probability
  • 64. 4.6 Exercises 53 p(t) = ea+b·t 1 + ea+b·t , where a = 5.085, b = −0.1156, and t is the temperature (in degrees Fahren- heit) at the time of the launch of the space shuttle. At the time of the fatal launch of the Challenger, t = 31, yielding p(31) = 0.8178. a. Let X be the number of failing O-rings at launch temperature 31◦ F. What type of probability distribution does X have, and what are the values of its parameters? b. What is the probability P(X ≥ 1) that at least one O-ring fails? 4.9 For simplicity’s sake, let us assume that all space shuttles will be launched at 81◦ F (which is the highest recorded launch temperature in Figure 1.3). With this temperature, the probability of an O-ring failure is equal to p(81) = 0.0137 (see Section 1.4 or Exercise 4.8). a. What is the probability that during 23 launches no O-ring will fail, but that at least one O-ring will fail during the 24th launch of a space shuttle? b. What is the probability that no O-ring fails during 24 launches? 4.10 Early in the morning, a group of m people decides to use the elevator in an otherwise deserted building of 21 floors. Each of these persons chooses his or her floor independently of the others, and—from our point of view— completely at random, so that each person selects a floor with probability 1/21. Let Sm be the number of times the elevator stops. In order to study Sm, we introduce for i = 1, 2, . . ., 21 random variables Ri, given by Ri = 1 if the elevator stops at the ith floor 0 if the elevator does not stop at the ith floor. a. Each Ri has a Ber(p) distribution. Show that p = 1 − 20 21 m . b. From the way we defined Sm, it follows that Sm = R1 + R2 + · · · + R21. Can we conclude that Sm has a Bin(21, p) distribution, with p as in part a? Why or why not? c. Clearly, if m = 1, one has that P(S1 = 1) = 1. Show that for m = 2 P(S2 = 1) = 1 21 = 1 − P(S2 = 2) , and that S3 has the following distribution. a 1 2 3 P(S3 = a) 1/441 60/441 380/441
  • 65. 54 4 Discrete random variables 4.11 You decide to play monthly in two different lotteries, and you stop play- ing as soon as you win a prize in one (or both) lotteries of at least one million euros. Suppose that every time you participate in these lotteries, the proba- bility to win one million (or more) euros is p1 for one of the lotteries and p2 for the other. Let M be the number of times you participate in these lotteries until winning at least one prize. What kind of distribution does M have, and what is its parameter? 4.12 You and a friend want to go to a concert, but unfortunately only one ticket is still available. The man who sells the tickets decides to toss a coin until heads appears. In each toss heads appears with probability p, where 0 p 1, independent of each of the previous tosses. If the number of tosses needed is odd, your friend is allowed to buy the ticket; otherwise you can buy it. Would you agree to this arrangement? 4.13 A box contains an unknown number N of identical bolts. In order to get an idea of the size N, we randomly mark one of the bolts from the box. Next we select at random a bolt from the box. If this is the marked bolt we stop, otherwise we return the bolt to the box, and we randomly select a second one, etc. We stop when the selected bolt is the marked one. Let X be the number of times a bolt was selected. Later (in Exercise 21.11) we will try to find an estimate of N. Here we look at the probability distribution of X. a. What is the probability distribution of X? Specify its parameter(s)! b. The drawback of this approach is that X can attain any of the values 1, 2, 3, . . ., so that if N is large we might be sampling from the box for quite a long time. We decide to sample from the box in a slightly different way: after we have randomly marked one of the bolts in the box, we select at random a bolt from the box. If this is the marked one, we stop, otherwise we randomly select a second bolt (we do not return the selected bolt). We stop when we select the marked bolt. Let Y be the number of times a bolt was selected. Show that P(Y = k) = 1/N for k = 1, 2, . . . , N (Y has a so-called discrete uniform distribution). c. Instead of randomly marking one bolt in the box, we mark m bolts, with m smaller than N. Next, we randomly select r bolts; Z is the number of marked bolts in the sample. Show that P(Z = k) = m k N−m r−k N r , for k = 0, 1, 2, . . ., r. (Z has a so-called hypergeometric distribution, with parameters m, N, and r.) 4.14 We throw a coin until a head turns up for the second time, where p is the probability that a throw results in a head and we assume that the outcome
  • 66. 4.6 Exercises 55 of each throw is independent of the previous outcomes. Let X be the number of times we have thrown the coin. a. Determine P(X = 2), P(X = 3), and P(X = 4). b. Show that P(X = n) = (n − 1)p2 (1 − p)n−2 for n ≥ 2.
  • 67. 5 Continuous random variables Many experiments have outcomes that take values on a continuous scale. For example, in Chapter 2 we encountered the load at which a model of a bridge collapses. These experiments have continuous random variables naturally as- sociated with them. 5.1 Probability density functions One way to look at continuous random variables is that they arise by a (never- ending) process of refinement from discrete random variables. Suppose, for example, that a discrete random variable associated with some experiment takes on the value 6.283 with probability p. If we refine, in the sense that we also get to know the fourth decimal, then the probability p is spread over the outcomes 6.2830, 6.2831, . . ., 6.2839. Usually this will mean that each of these new values is taken on with a probability that is much smaller than p—the sum of the ten probabilities is p. Continuing the refinement process to more and more decimals, the probabilities of the possible values of the outcomes become smaller and smaller, approaching zero. However, the probability that the possible values lie in some fixed interval [a, b] will settle down. This is closely related to the way sums converge to an integral in the definition of the integral and motivates the following definition. Definition. A random variable X is continuous if for some function f : R → R and for any numbers a and b with a ≤ b, P(a ≤ X ≤ b) = b a f(x) dx. The function f has to satisfy f(x) ≥ 0 for all x and ∞ −∞ f(x) dx = 1. We call f the probability density function (or probability density) of X.
  • 68. 58 5 Continuous random variables a b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . f → . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P(a ≤ X ≤ b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.1. Area under a probability density function f on the interval [a, b]. Note that the probability that X lies in an interval [a, b] is equal to the area under the probability density function f of X over the interval [a, b]; this is illustrated in Figure 5.1. So if the interval gets smaller and smaller, the probability will go to zero: for any positive ε P(a − ε ≤ X ≤ a + ε) = a+ε a−ε f(x) dx, and sending ε to 0, it follows that for any a P(X = a) = 0. This implies that for continuous random variables you may be careless about the precise form of the intervals: P(a ≤ X ≤ b) = P(a X ≤ b) = P(a X b) = P(a ≤ X b) . What does f(a) represent? Note (see also Figure 5.2) that P(a − ε ≤ X ≤ a + ε) = a+ε a−ε f(x) dx ≈ 2εf(a) (5.1) for small positive ε. Hence f(a) can be interpreted as a (relative) measure of how likely it is that X will be near a. However, do not think of f(a) as a probability: f(a) can be arbitrarily large. An example of such an f is given in the following exercise. Quick exercise 5.1 Let the function f be defined by f(x) = 0 if x ≤ 0 or x ≥ 1, and f(x) = 1/(2 √ x) for 0 x 1. You can check quickly that f satisfies the two properties of a probability density function. Let X be a random variable with f as its probability density function. Compute the probability that X lies between 10−4 and 10−2 .
  • 69. 5.1 Probability density functions 59 a − ε a + ε . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ← − 2ε − → ↑ | | | f(a) | | | ↓ f Fig. 5.2. Approximating the probability that X lies ε-close to a. You should realize that discrete random variables do not have a probability density function f and continuous random variables do not have a probability mass function p, but that both have a distribution function F(a) = P(X ≤ a). Using the fact that for a b the event {X ≤ b} is a disjoint union of the events {X ≤ a} and {a X ≤ b}, we can express the probability that X lies in an interval (a, b] directly in terms of F for both cases: P(a X ≤ b) = P(X ≤ b) − P(X ≤ a) = F(b) − F(a). There is a simple relation between the distribution function F and the prob- ability density function f of a continuous random variable. It follows from integral calculus that F(b) = b −∞ f(x) dx and1 f(x) = d dx F(x). Both the probability density function and the distribution function of a con- tinuous random variable X contain all the probabilistic information about X; the probability distribution of X is described by either of them. We illustrate all this with an example. Suppose we want to make a probability model for an experiment that can be described as “an object hits a disc of radius r in a completely arbitrary way” (of course, this is not you playing darts—nevertheless we will refer to this example as the darts example). We are interested in the distance X between the hitting point and the center of the disc. Since distances cannot be negative, we have F(b) = P(X ≤ b) = 0 when b 0. Since the object hits the disc, we have F(b) = 1 when b r. That the dart hits the disk in a completely arbitrary way we interpret as that the probability of hitting any region is proportional to the area of that region. In particular, because the disc has area πr2 and the disc with radius b has area πb2 , we should put 1 This holds for all x where f is continuous.
  • 70. 60 5 Continuous random variables F(b) = P(X ≤ b) = πb2 πr2 = b2 r2 for 0 ≤ b ≤ r. Then the probability density function f of X is equal to 0 outside the interval [0, r] and f(x) = d dx F(x) = 1 r2 d dx x2 = 2x r2 for 0 ≤ x ≤ r. Quick exercise 5.2 Compute for the darts example the probability that 0 X ≤ r/2, and the probability that r/2 X ≤ r. 5.2 The uniform distribution In this section we encounter a continuous random variable that describes an experiment where the outcome is completely arbitrary, except that we know that it lies between certain bounds. Many experiments of physical origin have this kind of behavior. For instance, suppose we measure for a long time the emission of radioactive particles of some material. Suppose that the experi- ment consists of recording in each hour at what times the particles are emitted. Then the outcomes will lie in the interval [0,60] minutes. If the measurements would concentrate in any way, there is either something wrong with your Geiger counter or you are about to discover some new physical law. Not con- centrating in any way means that subintervals of the same length should have the same probability. It is then clear (cf. equation (5.1)) that the probability density function associated with this experiment should be constant on [0, 60]. This motivates the following definition. Definition. A continuous random variable has a uniform distribu- tion on the interval [α, β] if its probability density function f is given by f(x) = 0 if x is not in [α, β] and f(x) = 1 β − α for α ≤ x ≤ β. We denote this distribution by U(α, β). Quick exercise 5.3 Argue that the distribution function F of a random variable that has a U(α, β) distribution is given by F(x) = 0 if x α, F(x) = 1 if x β, and F(x) = (x − α)/(β − α) for α ≤ x ≤ β. In Figure 5.3 the probability density function and the distribution function of a U(0, 1 3 ) distribution are depicted. F(x)=P(Ux)=x for U(0,1)
  • 71. 5.3 The exponential distribution 61 0 1/3 0 1 2 3 f 0 1/3 0 1 F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.3. The probability density function and the distribution function of the U(0, 1 3 ) distribution. 5.3 The exponential distribution We already encountered the exponential distribution in the chemical reactor example of Chapter 3. We will give an argument why it appears in that ex- ample. Let v be the effluent volumetric flow rate, i.e., the volume that leaves the reactor over a time interval [0, t] is vt (and an equal volume enters the vessel at the other end). Let V be the volume of the reactor vessel. Then in total a fraction (v/V ) · t will have left the vessel during [0, t], when t is not too large. Let the random variable T be the residence time of a particle in the vessel. To compute the distribution of T , we divide the interval [0, t] in n small intervals of equal length t/n. Assuming perfect mixing, so that the particle’s position is uniformly distributed over the volume, the particle has probability p = (v/V )·t/n to have left the vessel during any of the n intervals of length t/n. If we assume that the behavior of the particle in different time intervals of length t/n is independent, we have, if we call “leaving the vessel” a success, that T has a geometric distribution with success probability p. It follows (see also Quick exercise 4.6) that the probability P(T t) that the particle is still in the vessel at time t is, for large n, well approximated by (1 − p)n = 1 − vt V n n . But then, letting n → ∞, we obtain (recall a well-known limit from your calculus course) P(T t) = lim n→∞ 1 − vt V · 1 n n = e− v V t . It follows that the distribution function of T equals 1 − e− v V t , and differenti- ating we obtain that the probability density function fT of T is equal to
  • 72. 62 5 Continuous random variables fT (t) = d dt (1 − e− v V t ) = v V e− v V t for t ≥ 0. This is an example of an exponential distribution, with parameter v/V . Definition. A continuous random variable has an exponential dis- tribution with parameter λ if its probability density function f is given by f(x) = 0 if x 0 and f(x) = λe−λx for x ≥ 0. We denote this distribution by Exp(λ). The distribution function F of an Exp(λ) distribution is given by F(a) = 1 − e−λa for a ≥ 0. In Figure 5.4 we show the probability density function and the distribution function of the Exp(0.25) distribution. −5 0 5 10 15 20 0.0 0.1 0.2 f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −5 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.4. The probability density and the distribution function of the Exp(0.25) distribution. Since we obtained the exponential distribution directly from the geometric distribution it should not come as a surprise that the exponential distribution also satisfies the memoryless property, i.e., if X has an exponential distribu- tion, then for all s, t 0, P(X s + t | X s) = P(X t) . Actually, this follows directly from P(X s + t | X s) = P(X s + t) P(X s) = e−λ(s+t) e−λs = e−λt = P(X t) .
  • 73. 5.4 The Pareto distribution 63 Quick exercise 5.4 A study of the response time of a certain computer sys- tem yields that the response time in seconds has an exponentially distributed time with parameter 0.25. What is the probability that the response time exceeds 5 seconds? 5.4 The Pareto distribution More than a century ago the economist Vilfredo Pareto ([20]) noticed that the number of people whose income exceeded level x was well approximated by C/xα , for some constants C and α 0 (it appears that for all countries α is around 1.5). A similar phenomenon occurs with city sizes, earthquake rupture areas, insurance claims, and sizes of commercial companies. When these quantities are modeled as realizations of random variables X, then their distribution functions are of the type F(x) = 1 − 1/xα for x ≥ 1. (Here 1 is a more or less arbitrarily chosen starting point—what matters is the behavior for large x.) Differentiating, we obtain probability densities of the form f(x) = α/xα+1 . This motivates the following definition. Definition. A continuous random variable has a Pareto distribution with parameter α 0 if its probability density function f is given by f(x) = 0 if x 1 and f(x) = α xα+1 for x ≥ 1. We denote this distribution by Par(α). 0 2 4 6 8 10 12 0.0 0.1 0.2 0.3 0.4 0.5 f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.5. The probability density and the distribution function of the Par(0.5) distribution.
  • 74. 64 5 Continuous random variables In Figure 5.5 we depicted the probability density f and the distribution func- tion F of the Par(0.5) distribution. 5.5 The normal distribution The normal distribution plays a central role in probability theory and statis- tics. One of its first applications was due to C.F. Gauss, who used it in 1809 to model observational errors in astronomy; see [13]. We will see in Chap- ter 14 that the normal distribution is an important tool to approximate the probability distribution of the average of independent random variables. Definition. A continuous random variable has a normal distribu- tion with parameters µ and σ2 0 if its probability density function f is given by f(x) = 1 σ √ 2π e − 1 2 x−µ σ 2 for − ∞ x ∞. We denote this distribution by N(µ, σ2 ). In Figure 5.6 the graphs of the probability density function f and distribution function F of the normal distribution with µ = 3 and σ2 = 6.25 are displayed. −3 0 3 6 9 0.00 0.05 0.10 0.15 0.20 f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −3 0 3 6 9 0.0 0.2 0.4 0.6 0.8 1.0 F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.6. The probability density and the distribution function of the N(3, 6.25) distribution. If X has an N(µ, σ2 ) distribution, then its distribution function is given by F(a) = a −∞ 1 σ √ 2π e − 1 2 x−µ σ 2 dx for −∞ a ∞.
  • 75. 5.6 Quantiles 65 Unfortunately there is no explicit expression for F; f has no antiderivative. However, as we shall see in Chapter 8, any N(µ, σ2 ) distributed random vari- able can be turned into an N(0, 1) distributed random variable by a simple transformation. As a consequence, a table of the N(0, 1) distribution suffices. The latter is called the standard normal distribution, and because of its special role the letter φ has been reserved for its probability density function: φ(x) = 1 √ 2π e− 1 2 x2 for − ∞ x ∞. Note that φ is symmetric around zero: φ(−x) = φ(x) for each x. The corre- sponding distribution function is denoted by Φ. The table for the standard nor- mal distribution (see Table B.1) does not contain the values of Φ(a), but rather the so-called right tail probabilities 1 −Φ(a). If, for instance, we want to know the probability that a standard normal random variable Z is smaller than or equal to 1, we use that P(Z ≤ 1) = 1 − P(Z ≥ 1). In the table we find that P(Z ≥ 1) = 1−Φ(1) is equal to 0.1587. Hence P(Z ≤ 1) = 1−0.1587 = 0.8413. With the table you can handle tail probabilities with numbers a given to two decimals. To find, for instance, P(Z 1.07), we stay in the same row in the table but move to the seventh column to find that P(Z 1.07) = 0.1423. Quick exercise 5.5 Let the random variable Z have a standard normal distribution. Use Table B.1 to find P(Z ≤ 0.75). How do you know—without doing any calculations—that the answer should be larger than 0.5? 5.6 Quantiles Recall the chemical reactor example, where the residence time T , measured in minutes, has an exponential distribution with parameter λ = v/V = 0.25. As we shall see in the next chapters, a consequence of this choice of λ is that the mean time the particle stays in the vessel is 4 minutes. However, from the viewpoint of process control this is not the quantity of interest. Often, there will be some minimal amount of time the particle has to stay in the vessel to participate in the chemical reaction, and we would want that at least 90% of the particles stay in the vessel this minimal amount of time. In other words, we are interested in the number q with the property that P(T q) = 0.9, or equivalently, P(T ≤ q) = 0.1. The number q is called the 0.1th quantile or 10th percentile of the distribution. In the case at hand it is easy to determine. We should have P(T ≤ q) = 1 − e−0.25q = 0.1. This holds exactly when e−0.25q = 0.9 or when −0.25q = ln(0.9) = −0.105. So q = 0.42. Hence, although the mean residence time is 4 minutes, 10% of
  • 76. 66 5 Continuous random variables the particles stays less than 0.42 minute in the vessel, which is just slightly more than 25 seconds! We use the following general definition. Definition. Let X be a continuous random variable and let p be a number between 0 and 1. The pth quantile or 100pth percentile of the distribution of X is the smallest number qp such that F(qp) = P(X ≤ qp) = p. The median of a distribution is its 50th percentile. Quick exercise 5.6 What is the median of the U(2, 7) distribution? For continuous random variables qp is often easy to determine. Indeed, if F is strictly increasing from 0 to 1 on some interval (which may be infinite to one or both sides), then qp = Finv (p), where Finv is the inverse of F. This is illustrated in Figure 5.7 for the Exp(0.25) distribution. 0 qp 20 0 p 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 5.7. The pth quantile qp of the Exp(0.25) distribution. For an exponential distribution it is easy to compute quantiles. This is dif- ferent for the standard normal distribution, where we have to use a table (like Table B.1). For example, the 90th percentile of a standard normal is the number q0.9 such that Φ(q0.9) = 0.9, which is the same as 1 − Φ(q0.9) = 0.1, and the table gives us q0.9 = 1.28. This is illustrated in Figure 5.8, with both the probability density function and the distribution function of the standard normal distribution. Quick exercise 5.7 Find the 0.95th quantile q0.95 of a standard normal distribution, accurate to two decimals.
  • 77. 5.7 Solutions to the quick exercises 67 −3 0 3 q0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .. .. area 0.1 φ −3 0 3 q0.9 0 1 0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Φ Fig. 5.8. The 90th percentile of the N(0, 1) distribution. 5.7 Solutions to the quick exercises 5.1 We know from integral calculus that for 0 ≤ a ≤ b ≤ 1 b a f(x) dx = b a 1 2 √ x dx = √ b − √ a. Hence ∞ −∞ f(x) dx = 1 0 1/(2 √ x) dx = 1 (so f is a probability density function—nonnegativity being obvious), and P 10−4 ≤ X ≤ 10−2 = 10−2 10−4 1 2 √ x dx = √ 10−2 − √ 10−4 = 10−1 − 10−2 = 0.09. Actually, the random variable X arises in a natural way; see equation (7.1). 5.2 We have P(0 X ≤ r/2) = F(r/2) − F(0) = (1/2)2 − 02 = 1/4, and P(r/2 X ≤ r) = F(r)−F(r/2) = 1−1/4 = 3/4, no matter what the radius of the disc is! 5.3 Since f(x) = 0 for x α, we have F(x) = 0 if x α. Also, since f(x) = 0 for all x β, F(x) = 1 if x β. In between F(x) = x −∞ f(y) dy = x α 1 β − α dy = y β − α x α = x − α β − α . In other words; the distribution function increases linearly from the value 0 in α to the value 1 in β. 5.4 If X is the response time, we ask for P(X 5). This equals P(X 5) = e−0.25·5 = e−1.25 = 0.2865 . . . .
  • 78. 68 5 Continuous random variables 5.5 In the eighth row and sixth column of the table, we find that 1−Φ(0.75) = 0.2266. Hence the answer is 1 − 0.2266 = 0.7734. Because of the symmetry of the probability density φ, half of the mass of a standard normal distribution lies on the negative axis. Hence for any number a 0, it should be true that P(Z ≤ a) P(Z ≤ 0) = 0.5. 5.6 The median is the number q0.5 = Finv (0.5). You either see directly that you have got half of the mass to both sides of the middle of the interval, hence q0.5 = (2 + 7)/2 = 4.5, or you solve with the distribution function: 1 2 = F(q) = q − 2 7 − 2 , and so q = 4.5. 5.7 Since Φ(q0.95) = 0.95 is the same as 1 − Φ(q0.95) = 0.05, the table gives us q0.95 = 1.64, or more precisely, if we interpolate between the fourth and the fifth column; 1.645. 5.8 Exercises 5.1 Let X be a continuous random variable with probability density function f(x) = ⎧ ⎪ ⎨ ⎪ ⎩ 3 4 for 0 ≤ x ≤ 1 1 4 for 2 ≤ x ≤ 3 0 elsewhere. a. Draw the graph of f. b. Determine the distribution function F of X, and draw its graph. 5.2 Let X be a random variable that takes values in [0, 1], and is further given by F(x) = x2 for 0 ≤ x ≤ 1. Compute P 1 2 X ≤ 3 4 . 5.3 Let a continuous random variable X be given that takes values in [0, 1], and whose distribution function F satisfies F(x) = 2x2 − x4 for 0 ≤ x ≤ 1. a. Compute P 1 4 ≤ X ≤ 3 4 . b. What is the probability density function of X? 5.4 Jensen, arriving at a bus stop, just misses the bus. Suppose that he decides to walk if the (next) bus takes longer than 5 minutes to arrive. Suppose also that the time in minutes between the arrivals of buses at the bus stop is a continuous random variable with a U(4, 6) distribution. Let X be the time that Jensen will wait.
  • 79. 5.8 Exercises 69 a. What is the probability that X is less than 41 2 (minutes)? b. What is the probability that X equals 5 (minutes)? c. Is X a discrete random variable or a continuous random variable? 5.5 The probability density function f of a continuous random variable X is given by: f(x) = ⎧ ⎪ ⎨ ⎪ ⎩ cx + 3 for − 3 ≤ x ≤ −2 3 − cx for 2 ≤ x ≤ 3 0 elsewhere. a. Compute c. b. Compute the distribution function of X. 5.6 Let X have an Exp(0.2) distribution. Compute P(X 5). 5.7 The score of a student on a certain exam is represented by a number between 0 and 1. Suppose that the student passes the exam if this number is at least 0.55. Suppose we model this experiment by a continuous random variable S, the score, whose probability density function is given by f(x) = ⎧ ⎪ ⎨ ⎪ ⎩ 4x for 0 ≤ x ≤ 1 2 4 − 4x for 1 2 ≤ x ≤ 1 0 elsewhere. a. What is the probability that the student fails the exam? b. What is the score that he will obtain with a 50% chance, in other words, what is the 50th percentile of the score distribution? 5.8 Consider Quick exercise 5.2. For another dart thrower it is given that his distance to the center of the disc Y is described by the following distribution function: G(b) = b r for 0 b r and G(b) = 0 for b ≤ 0, G(b) = 1 for b ≥ r. a. Sketch the probability density function g(y) = d dy G(y). b. Is this person “better” than the person in Quick exercise 5.2? c. Sketch a distribution function associated to a person who in 90% of his throws hits the disc no further than 0.1 · r of the center. 5.9 Suppose we choose arbitrarily a point from the square with corners at (2,1), (3,1), (2,2), and (3,2). The random variable A is the area of the triangle with its corners at (2,1), (3,1) and the chosen point (see Figure 5.9). a. What is the largest area A that can occur, and what is the set of points for which A ≤ 1/4?
  • 80. 70 5 Continuous random variables A (2, 1) (3, 1) (2, 2) (3, 2) • randomly chosen point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... Fig. 5.9. A triangle in a square. b. Determine the distribution function F of A. c. Determine the probability density function f of A. 5.10 Consider again the chemical reactor example with parameter λ = 0.5. We saw in Section 5.6 that 10% of the particles stay in the vessel no longer than about 12 seconds—while the mean residence time is 2 minutes. Which percentage of the particles stay no longer than 2 minutes in the vessel? 5.11 Compute the median of an Exp(λ) distribution. 5.12 Compute the median of a Par(1) distribution. 5.13 We consider a random variable Z with a standard normal distribution. a. Show why the symmetry of the probability density function φ of Z implies that for any a one has Φ(−a) = 1 − Φ(a). b. Use this to compute P(Z ≤ −2). 5.14 Determine the 10th percentile of a standard normal distribution.
  • 81. 6 Simulation Sometimes probabilistic models are so complex that the tools of mathemat- ical analysis are not sufficient to answer all relevant questions about them. Stochastic simulation is an alternative approach: values are generated for the random variables and inserted into the model, thus mimicking outcomes for the whole system. It is shown in this chapter how one can use uniform ran- dom number generators to mimic random variables. Also two larger simulation examples are presented. 6.1 What is simulation? In many areas of science, technology, government, and business, models are used to gain understanding of some part of reality (the portion of interest is often referred to as “the system”). Sometimes these are physical models, such as a scale model of an airplane in a wind tunnel or a scale model of a chemical plant. Other models are abstract, such as macroeconomic models consisting of equations relating things like interest rates, unemployment, and inflation or partial differential equations describing global weather patterns. In simulation, one uses a model to create specific situations in order to study the response of the model to them and then interprets this in terms of what would happen to the system “in the real world.” In this way, one can carry out experiments that are impossible, too dangerous, or too expensive to do in the real world—addressing questions like: What happens to the average temperature if we reduce the greenhouse gas emissions globally by 50%? Can the plane still fly if engines 3 and 4 stop in midair? What happens to the distribution of wealth if we halve the tax rate? More specifically, we focus on situations and problems where randomness or uncertainty or both play a significant or dominant role and should be modeled explicitly. Models for such systems involve random variables, and we speak of probabilistic or stochastic models. Simulating them is stochastic simulation. In
  • 82. 72 6 Simulation the preceding chapters we have encountered some of the tools of probability theory, and we will encounter others in the chapters to come. With these tools we can compute quantities of interest explicitly for many models. Stochastic simulation of a system means generating values for all the random variables in the model, according to their specified distributions, and recording and analyzing what happens. We refer to the generated values as realizations of the random variables. For us, there are two reasons to learn about stochastic simulation. The first is that for complex systems, simulation can be an alternative to mathematical analysis, sometimes the only one. The second reason is that through simula- tion, we can get more feeling for random variables, and this is why we study stochastic simulation at this point in the book. We start by asking how we can generate a realization of a random variable. 6.2 Generating realizations of random variables Simulations are almost always done using computers, which usually have one or more so-called (pseudo) random number generators. A call to the random number generator returns a random number between 0 and 1, which mimics a realization of a U(0, 1) variable. With this source of uniform (pseudo) ran- domness we can construct any random variable we want by transforming the outcome, as we shall see. Quick exercise 6.1 Describe how you can simulate a coin toss when instead of a coin you have a die. Any ideas on how to simulate a roll of a die if you only have a coin? Bernoulli random variables Suppose U has a U(0, 1) distribution. To construct a Ber(p) random variable for some 0 p 1, we define X = 1 if U p, 0 if U ≥ p so that P(X = 1) = P(U p) = p, P(X = 0) = P(U ≥ p) = 1 − p. This random variable X has a Bernoulli distribution with parameter p. Quick exercise 6.2 A random variable Y has outcomes 1, 3, and 4 with the following probabilities: P(Y = 1) = 3/5, P(Y = 3) = 1/5, and P(Y = 4) = 1/5. Describe how to construct Y from a U(0, 1) random variable.
  • 83. 6.2 Generating realizations of random variables 73 Continuous random variables Suppose we have the distribution function F of a continuous random variable and we wish to construct a random variable with this distribution. We show how to do this if F is strictly increasing from 0 to 1 on an interval. In that case F has an inverse function Finv . Figure 6.1 shows an example: F is strictly increasing on the interval [2, 10]; the inverse Finv is a function from the interval [0, 1] to the interval [2, 10]. 2 Finv(u) x 10 0 u F(x) 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 6.1. Simulating a continuous random variable using the distribution function. Note how u relates to Finv (u) as F(x) relates to x. We see that u ≤ F(x) is equivalent with Finv (u) ≤ x. If instead of a real number u we consider a U(0, 1) random variable U, we obtain that the corresponding events are the same: {U ≤ F(x)} = {Finv (U) ≤ x}. (6.1) We know about the U(0, 1) random variable U that P(U ≤ b) = b for any number 0 ≤ b ≤ 1. Substituting b = F(x) we see P(U ≤ F(x)) = F(x). From equality (6.1), therefore, P Finv (U) ≤ x = F(x); in other words, the random variable Finv (U) has distribution function F. What remains is to find the function Finv . From Figure 6.1 we see F(x) = u ⇔ x = Finv (u), so if we solve the equation F(x) = u for x, we obtain the expression for Finv (u).
  • 84. 74 6 Simulation Exponential random variables We apply this method to the exponential distribution. On the interval [0, ∞), the Exp(λ) distribution function is strictly increasing and given by F(x) = 1 − e−λx . To find Finv , we solve the equation F(x) = u: F(x) = u ⇔ 1 − e−λx = u ⇔ e−λx = 1 − u ⇔ −λx = ln(1 − u) ⇔ x = − 1 λ ln(1 − u), so Finv (u) = − 1 λ ln(1−u) and if U has a U(0, 1) distribution, then the random variable X defined by X = Finv (U) = − 1 λ ln(1 − U) has an Exp(λ) distribution. In practice, one replaces 1−U with U, because both have a U(0, 1) distribution (see Exercise 6.3). Leaving out the subtraction leads to more efficient computer code. So instead of X we may use Y = − 1 λ ln(U), which also has an Exp(λ) distribution. Quick exercise 6.3 A distribution function F is 0 for x 1 and 1 for x 3, and F(x) = 1 4 (x − 1)2 if 1 ≤ x ≤ 3. Let U be a U(0, 1) random variable. Construct a random variable with distribution F from U. Remark 6.1 (The general case). The restriction we imposed earlier, that the distribution function should be strictly increasing, is not really necessary. Furthermore, a distribution function with jumps or a flat section somewhere in the middle is not a problem either. We illustrate this with an example in Figure 6.2. This F has a jump at 4 and so for a corresponding X we should have P(X = 4) = 0.2, the size of the jump. We see that whenever U is in the interval [0.3, 0.5], it is mapped to 4 by our method, and that this happens with exactly the right probability! The flat section of F between 7 and 8 seems to pose a problem: the equa- tion F(a) = 0.85 has as its solution any a between 7 and 8, and we can- not define a unique inverse. This, however, does not really matter, because P(U = 0.85) = 0, and we can define the inverse Finv (0.85) in any way we want. Taking the left endpoint, here the number 7, agrees best with the definition of quantiles (see page 66). P(F^inv(U) ≤ x)= F(x)
  • 85. 6.3 Comparing two jury rules 75 2 4 7 8 10 0 0.3 0.5 0.85 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .................................................................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ................................................. . . . . . . . . . . ................................................. . . . . . . . . . . ................................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 6.2. A distribution function with a jump and a flat section. Remark 6.2 (Existence of random variables). The previous remark supplies a sketchy argument for the fact that any nondecreasing, rightcon- tinuous function F, with limx→−∞ F(x) = 0 and limx→∞ F(x) = 1, is the distribution of some random variable. Generating sequences For simulations we often want to generate realizations for a large number of random variables. Random number generators have been designed with this purpose in mind: each new call mimics a new U(0, 1) random variable. The sequence of numbers thus generated is considered as a realization of a sequence of U(0, 1) random variables U1, U2, U3,. . . with the special property that the events {Ui ≤ ai} are independent1 for every choice of the ai. 6.3 Comparing two jury rules At the Olympic Games there are several sports events that are judged by a jury, including gymnastics, figure skating, and ice dancing. During the 2002 winter games a dispute arose concerning the gold medal in ice dancing: there were allegations that the Russian team had bribed a French jury member, thereby causing the Russian pair to win just ahead of the Canadians. We look into operating rules for juries, although we leave the effects of bribery to the exercises (Exercise 6.11). Suppose we have a jury of seven members, and for each performance each juror assigns a grade. The seven grades are to be transformed into a final score. Two rules to do this are under consideration, and we want to choose 1 In Chapter 9 we return to the question of independence between random variables.
  • 86. 76 6 Simulation the better one. For the first one, the highest and lowest scores are removed and the final score is the average of the remaining five. For the second rule, the scores are put in ascending order and the middle one is assigned as final score. Before you continue reading, consider which rule is better and how you can verify this. A probabilistic model For our investigation we assume that the scores the jurors assign deviate by some random amount from the true or deserved score. We model the score that juror i assigns when the performance deserves a score g by Yi = g + Zi for i = 1, . . . , 7, (6.2) where Z1, . . . , Z7 are random variables with values around zero. Let h1 and h2 be functions implementing the two rules: h1(y1, . . . , y7) = average of the middle five of y1, . . . , y7, h2(y1, . . . , y7) = middle value of y1, . . . , y7. We are interested in deviations from the deserved score g: T = h1(Y1, . . . , Y7) − g, M = h2(Y1, . . . , Y7) − g. (6.3) The distributions of T and M depend on the individual jury grades, and through those, on the juror-deviations Z1, Z2, . . . , Z7, which we model as U(−0.5, 0.5) variables. This more or less finishes the modeling phase: we have given a stochastic model that mimics the workings of a jury and have defined, in terms of the variables in the model, the random variables T and M that represent the errors that result after application of the jury rules. In any serious application, the model should be validated. This means that one tries to gather evidence to convince oneself and others that the model adequately reflects the workings of the real system. In this chapter we are more interested in showing what you can do with simulation once you have a model, so we skip the validation. The next phase is analysis: which of the deviations is closer to zero? Because T and M are random variables, we would have to clarify what we mean by that, and answering the question certainly involves computing probabilities about T and M. We cannot do this with what we have learned so far, but we know how to simulate, so this is what we do. Simulation To generate a realization of a U(−0.5, 0.5) random variable, we only need to subtract 0.5 from the result we obtain from a call to the random generator.
  • 87. 6.3 Comparing two jury rules 77 We do this 7 times and insert the resulting values in (6.2) as jury deviations Z1, . . . , Z7, and substitute them in equations (6.3) to obtain T and M (the value of g is irrelevant: it drops out of the calculation): T = average of the middle five of Z1, . . . , Z7, M = middle value of Z1, . . . , Z7. (6.4) In simulation terminology, this is called a run: we have gone through the whole procedure once, inserting realizations for the random variables. If we repeat the whole procedure, we have a second run; see Table 6.1 for the results of five runs. Table 6.1. Simulation results for the two jury rules. Run Z1 Z2 Z3 Z4 Z5 Z6 Z7 T M 1 −0.45 −0.08 −0.38 0.11 −0.42 0.48 0.02 −0.15 −0.08 2 −0.37 −0.18 0.05 −0.10 0.01 0.28 0.31 0.01 0.01 3 0.08 0.07 0.47 −0.21 −0.33 −0.22 −0.48 −0.12 −0.21 4 0.24 0.08 −0.11 0.19 −0.03 0.02 0.44 0.10 0.08 5 0.10 0.18 −0.39 −0.24 −0.36 −0.25 0.20 −0.11 −0.24 Quick exercise 6.4 The next realizations for Z1,. . . , Z7 are: −0.05, 0.26, 0.25, 0.39, 0.22, 0.23, 0.13. Determine the corresponding realizations of T and M. Table 6.1 can be used to check some computations. We also see that the real- ization of T was closest to zero in runs 3 and 5, the realization of M was closest to zero in runs 1 and 4, and they were (about) the same in run 2. There is no clear conclusion from this, and even if there was, one could wonder whether the next five runs would yield the same picture. Because the whole process mimics randomness, one has to expect some variation—or perhaps a lot. In later chapters we will get a better understanding of this variation; for the moment we just say that judgment based on a large number of runs is better. We do one thousand runs and exchange the table for pictures. Figure 6.3 de- picts, for juror 1, a histogram of all the deviations from the true score g. For each interval of length 0.05 we have counted the number of runs for which the deviation of juror 1 fell in that interval. These numbers vary from about 40 to about 60. This is just to get an idea about the results for an individual juror. In Fig- ure 6.4 we see histograms for the final scores. Comparing the histograms, it seems that the realizations of T are more concentrated near zero than those of M.
  • 88. 78 6 Simulation −0.4 −0.2 0.0 0.2 0.4 0 20 40 60 Fig. 6.3. Deviations of juror 1 from the deserved score, one thousand runs. −0.4 −0.2 0.0 0.2 0.4 T 0 50 100 150 −0.4 −0.2 0.0 0.2 0.4 M 0 50 100 150 Fig. 6.4. One thousand realizations of T and M. However, the two histograms do not tell us anything about the relation be- tween T and M, so we plot the realizations of pairs (T, M) for all one thousand runs (Figure 6.5). From this plot we see that in most cases M and T go in the same direction: if T is positive, then usually M is also positive, and the same goes for negative values. In terms of the final scores, both rules generally overvalue and undervalue the performance simultaneously. On closer exami- nation, with help of the line drawn from (−0.5, −0.5) to (0.5, 0.5), we see that the T values tend to be a little closer to zero than the M values. This suggests that we make a histogram that shows the difference of the absolute deviations from true score. For rule 1 this absolute deviation is |T |, for rule 2 it is |M|. If the difference |M| − |T | is positive, then T is closer to zero than M, and the difference tells us by how much. A negative difference
  • 89. 6.3 Comparing two jury rules 79 −0.4 −0.2 0.0 0.2 0.4 T −0.4 −0.2 0.0 0.2 0.4 M ............................................................................................................................................... · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 6.5. Plot of the points (T, M), one thousand runs. means that M was closer. In Figure 6.6 all the differences are shown in a histogram. The bars to the right of zero represent 696 runs. So, in about 70% of the runs, rule 1 resulted in a final score that is closer to the true score than rule 2. In about 30% of the cases, rule 2 was better, but generally by a smaller amount, as we see from the histogram. −0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 0 50 100 150 200 Fig. 6.6. Differences |M| − |T| for one thousand runs.
  • 90. 80 6 Simulation 6.4 The single-server queue There are many situations in life where you stand in a line waiting for some service: when you want to withdraw money from a cash dispenser, borrow books at the library, be admitted to the emergency room at the hospital, or pump gas at the gas station. Many other queueing situations are hidden: an email message you send might be queued at the local server until it has sent all messages that were submitted ahead of yours; searching the Internet, your browser sends and receives packets of information that are queued at various stages and locations; in assembly lines, partly finished products move from station to station, each time waiting for the next component to be added. We are going to study one simple queueing model, the so-called single-server queue: it has one server or service mechanism, and the arriving customers await their turn in order of their arrival. For definiteness, think of an oasis with one big water well. People arrive at the well with bottles, jerry cans, and other types of containers, to pump water. The supply of water is large, but the pump capacity is limited. The pump is about to be replaced, and while it is clear that a larger pump capacity will result in shorter waiting times, more powerful pumps are also more expensive. Therefore, to prepare a decision that balances costs and benefits, we wish to investigate the relationship between pump capacity and system performance. Modeling the system A stochastic model is in order: some general characteristics are known, such as how many people arrive per day and how much water they take on average, but the individual arrival times and amounts are unpredictable. We introduce random variables to describe them: let T1 be the time between the start at time zero and the arrival of the first customer, T2 the time between the arrivals of the first and the second customer, T3 the time between the second and the third, etc.; these are called the interarrival times. Let Si be the length of time that customer i needs to use the pump; in standard terminology this is called the service time. This is our description so far: Arrivals at: T1 T1 + T2 T1 + T2 + T3 etc. Service times: S1 S2 S3 etc. The pump capacity v (liters per minute) is not a random variable but a model parameter or decision variable, whose “best” value we wish to determine. So if customer i requires Ri liters of water, then her service time is Si = Ri v . To complete the model description, we need to specify the distribution of the random variables Ti and Ri:
  • 91. 6.4 The single-server queue 81 Interarrival times: every Ti has an Exp(0.5) distribution (minutes); Service requirement: every Ri has a U(2, 5) distribution (liters). This particular choice of distributions would have to be supported by evidence that they are suited for the system at hand: a validation step as suggested for the jury model is appropriate here as well. For many arrival type processes, however, the exponential distribution is reasonable as a model for the inter- arrival times (see Chapter 12). The particular uniform distribution chosen for the required amount of water says that all amounts between 2 and 5 liters are equally likely. So there is no sheik who owns a 5000-liter water truck in “our” oasis. To evaluate system performance, we want to extract from the model the wait- ing times of the customers and how busy it is at the pump. Waiting times Let Wi denote the waiting time of customer i. The first customer is lucky; the system starts empty, and so W1 = 0. For customer i the waiting time depends on how long customer i−1 spent in the system compared to the time between their respective arrivals. We see that if the interarrival time Ti is long, relatively speaking, then customer i arrives after the departure of customer i − 1, and so Wi = 0: Arrival of customer i − 1 Departure of customer i − 1 Arrival of customer i Wi = 0 ← − − − − − − − − − − − − − − − − − Ti − − − − − − − − − − − − − − − − − → ← − − Wi−1 − − →← − − − Si−1 − − − → On the other hand, if customer i arrives before the departure, the waiting time Wi equals whatever remains of Wi−1 + Si−1: Arrival of customer i − 1 Departure of customer i − 1 Arrival of customer i Wi = Wi−1 + Si−1 − Ti ← − − − − − Ti − − − − − →← − − Wi − − → ← − Wi−1 − →← − − − − Si−1 − − − − → Summarizing the two cases, we see obtain: Wi = max{Wi−1 + Si−1 − Ti, 0}. (6.5) To carry out a simulation, we start at time zero and generate realizations of the interarrival times (the Ti) and service requirements (the Ri) for as long as we want, computing the other quantities that follow from the model on the way. Table 6.2 shows the values generated this way, for two pump capacities (v = 2 and 3) for the first six customers. Note that in both cases we use the same realizations of Ti and Ri.
  • 92. 82 6 Simulation Table 6.2. Results of a short simulation. Input realizations v = 2 v = 3 i Ti Arr.time Ri Si Wi Si Wi 1 0.24 0.24 4.39 2.20 0 1.46 0 2 1.97 2.21 4.00 2.00 0.23 1.33 0 3 1.73 3.94 2.33 1.17 0.50 0.78 0 4 2.82 6.76 4.03 2.01 0 1.34 0 5 1.01 7.77 4.17 2.09 1.00 1.39 0.33 6 1.09 8.86 4.24 2.12 1.99 1.41 0.63 Quick exercise 6.5 The next four realizations are T7: 1.86; R7: 4.79; T8: 1.08; and R8: 2.33. Complete the corresponding rows of the table. Longer simulations produce so many numbers that we will drown in them unless we think of something. First, we summarize the waiting times of the first n customers with their average: W̄n = W1 + W2 + · · · + Wn n . (6.6) Then, instead of giving a table, we plot the pairs (n, W̄n), for n = 1, 2, . . . until the end of the simulation. In Figure 6.7 we see that both lines bounce up and down a bit. Toward the end, the average waiting time for pump capacity 3 is about 0.5 and for v = 2 about 2. In a longer simulation we would see each of the averages converge to a limiting value (a consequence of the so-called law of large numbers, the topic of Chapter 13). 0 10 20 30 40 50 n 0.0 0.5 1.0 1.5 2.0 2.5 Average of first n waiting times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......................................................................................................... Fig. 6.7. Averaged waiting times at the well, for pump capacity 2 and 3.
  • 93. 6.4 The single-server queue 83 Work-in-system To show how busy it is at the pump one could record how many customers are waiting in the queue and plot this quantity against time. A slightly different approach is to record at every moment how much work there is in the system, that is, how much time it would take to serve everyone present at that moment. For example, if I am halfway through filling my 4-liter jerry can and three persons are waiting who require 2, 3, and 5 liters, respectively, then there are 12 liters to go; at v = 2, there is 6 minutes of work in the system, and at v = 3 just 4. The amount of work in the system just before a customer arrives equals the waiting time of that customer, because it is exactly the time it takes to finish the work for everybody ahead of her. The work-in-system at time t tells us how long the wait would be if somebody were to arrive at t. For this reason, this quantity is also called the virtual waiting time. Figure 6.8 shows the work-in-system as a function of time for the first 15 minutes, using the same realizations that were the basis for Table 6.2. In the top graph, corresponding to v = 2, the work in the system jumps to 2.20 (which is the realization of R1/2) at t = 0.24, when the first customer arrives. So at t = 2.21, which is 1.97 later, there is 2.20 − 1.97 = 0.23 minute of work left; this is the waiting time for customer 2, who brings an amount of work of 2.00 minutes, so the peak at 1.97 is 0.23 + 2.00 = 2.23, etc. In the bottom graph we see the work-in-system reach zero more often, because the individual (work) amounts are 2/3 of what they are when v = 2. More often, arriving 0 5 10 15 0 1 2 3 4 5 Work in system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 5 10 15 t 0 1 2 3 4 5 Work in system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 6.8. Work in system: top, v = 2; bottom, v = 3.
  • 94. 84 6 Simulation 0 20 40 60 80 100 0 2 4 6 8 10 Work in system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 20 40 60 80 100 t 0 2 4 6 8 10 Work in system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. . . . . . . . . . . . . . . . . . . . . . .. . . . . .. . .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 6.9. Work in system: top, v = 2; bottom, v = 3. customers find the queue empty and the pump not in use; they do not have to wait. In Figure 6.9 the work-in-system is depicted as a function of time for the first 100 minutes of our run. At pump capacity 2 the virtual waiting time peaks at close to 11 minutes after about 55 minutes, whereas with v = 3 the corresponding peak is only about 4 minutes. There also is a marked difference in the proportion of time the system is empty. 6.5 Solutions to the quick exercises 6.1 To simulate the coin, choose any three of the six possible outcomes of the die, report heads if one of these three outcomes turns up, and report tails otherwise. For example, heads if the outcome is odd, tails if it is even. To simulate the die using a coin is more difficult; one solution is as follows. Toss the coin three times and use the following conversion table to map the result: Coins HHH HHT HTH HTT THH THT Die 1 2 3 4 5 6 Repeat the coin tosses if you get TTH or TTT.
  • 95. 6.6 Exercises 85 6.2 Let the U(0, 1) variable be U and set: Y = ⎧ ⎪ ⎨ ⎪ ⎩ 1 if U 3 5 , 3 if 3 5 ≤ U 4 5 , 4 if U ≥ 4 5 . So, for example, P(Y = 3) = P 3 5 ≤ U 4 5 = 1 5 . 6.3 The given distribution function F is strictly increasing between 1 and 3, so we use the method with Finv . Solve the equation F(x) = 1 4 (x − 1)2 = u for x. This yields x = 1 + 2 √ u, so we can set X = 1 + 2 √ U. If you need to be convinced, determine FX. 6.4 In ascending order the values are −0.05, 0.13, 0.22, 0.23, 0.25, 0.26, 0.39, so for M we find 0.23, and for T (0.13 + 0.22 + 0.23 + 0.25 + 0.26)/5 = 0.22. 6.5 We find: Input realizations v = 2 v = 3 i Ti Arr.time Ri Si Wi Si Wi 7 1.86 10.72 4.79 2.39 2.25 1.60 0.18 8 1.08 11.80 2.33 1.16 3.57 0.78 0.70 6.6 Exercises 6.1 Let U have a U(0, 1) distribution. a. Describe how to simulate the outcome of a roll with a die using U. b. Define Y as follows: round 6U + 1 down to the nearest integer. What are the possible outcomes of Y and their probabilities? 6.2 We simulate the random variable X = 1 + 2 √ U constructed in Quick exercise 6.3. As realization for U we obtain from the pseudo random generator the number u = 0.3782739. a. What is the corresponding realization x of the random variable X? b. If the next call to the random generator yields u = 0.3, will the corre- sponding realization for X be larger or smaller than the value you found in a? c. What is the probability the next draw will be smaller than the value you found in a?
  • 96. 86 6 Simulation 6.3 Let U have a U(0, 1) distribution. Show that Z = 1 − U has a U(0, 1) distribution by deriving the probability density function or the distribution function. 6.4 Let F be the distribution function as given in Quick exercise 6.3: F(x) is 0 for x 1 and 1 for x 3, and F(x) = 1 4 (x − 1)2 if 1 ≤ x ≤ 3. In the answer it is claimed that X = 1 + 2 √ U has distribution function F, where U is a U(0, 1) random variable. Verify this by computing P(X ≤ a) and checking that this equals F(a), for any a. 6.5 We have seen that if U has a U(0, 1) distribution, then X = − ln U has an Exp(1) distribution. Check this by verifying that P(X ≤ a) = 1 − e−a for a ≥ 0. 6.6 Somebody messed up the random number generator in your computer: instead of uniform random numbers it generates numbers with an Exp(2) dis- tribution. Describe how to construct a U(0, 1) random variable U from an Exp(2) distributed X. Hint: look at how you obtain an Exp(2) random variable from a U(0, 1) ran- dom variable. 6.7 In models for the lifetimes of mechanical components one sometimes uses random variables with distribution functions from the so-called Weibull family. Here is an example: F(x) = 0 for x 0, and F(x) = 1 − e−5x2 for x ≥ 0. Construct a random variable Z with this distribution from a U(0, 1) variable. 6.8 A random variable X has a Par(3) distribution, so with distribution func- tion F with F(x) = 0 for x 1, and F(x) = 1 − x−3 for x ≥ 1. For details on the Pareto distribution see Section 5.4. Describe how to construct X from a U(0, 1) random variable. 6.9 In Quick exercise 6.1 we simulated a die by tossing three coins. Recall that we might need several attempts before succeeding. a. What is the probability that we succeed on the first try? b. Let N be the number of tries that we need. Determine the distribution of N. 6.10 There is usually more than one way to simulate a particular random variable. In this exercise we consider two ways to generate geometric random variables. a. We give you a sequence of independent U(0, 1) random variables U1, U2, . . . . From this sequence, construct a sequence of Bernoulli random vari-
  • 97. 6.6 Exercises 87 ables. From the sequence of Bernoulli random variables, construct a (sin- gle) Geo(p) random variable. b. It is possible to generate a Geo(p) random variable using just one U(0, 1) random variable. If calls to the random number generator take a lot of CPU time, this would lead to faster simulation programs. Set λ = − ln(1− p) and let Y have a Exp(λ) distribution. We obtain Z from Y by rounding to the nearest integer greater than Y . Note that Z is a discrete random variable, whereas Y is a continuous one. Show that, nevertheless, the event {Z n} is the same as {Y n}. Use this to compute P(Z n) from the distribution of Y . What is the distribution of Z? (See Quick exercise 4.6.) 6.11 Reconsider the jury example (see Section 6.3). Suppose the first jury member is bribed to vote in favor of the present candidate. a. How should you now model Y1? Describe how you can investigate which of the two rules is less sensitive to the effect of the bribery. b. The International Skating Union decided to adopt a rule similar to the following: randomly discard two of the jury scores, then average the re- maining scores. Describe how to investigate this rule. Do you expect this rule to be more sensitive to the bribery than the two rules already dis- cussed, or less sensitive? 6.12 A tiny financial model. To investigate investment strategies, con- sider the following: You can choose to invest your money in one particular stock or put it in a savings account. Your initial capital is û1000. The interest rate r is 0.5% per month and does not change. The initial stock price is û100. Your stochastic model for the stock price is as follows: next month the price is the same as this month with probability 1/2, with probability 1/4 it is 5% lower, and with probability 1/4 it is 5% higher. This principle applies for every new month. There are no transaction costs when you buy or sell stock. Your investment strategy for the next 5 years is: convert all your money to stock when the price drops below û95, and sell all stock and put the money in the bank when the stock price exceeds û110. Describe how to simulate the results of this strategy for the model given. 6.13 We give you an unfair coin and you do not know P(H) for this coin. Can you simulate a fair coin, and how many tosses do you need for each fair coin toss?
  • 98. 7 Expectation and variance Random variables are complicated objects, containing a lot of information on the experiments that are modeled by them. If we want to summarize a random variable by a single number, then this number should undoubtedly be its expected value. The expected value, also called the expectation or mean, gives the center—in the sense of average value—of the distribution of the random variable. If we allow a second number to describe the random variable, then we look at its variance, which is a measure of spread of the distribution of the random variable. 7.1 Expected values An oil company needs drill bits in an exploration project. Suppose that it is known that (after rounding to the nearest hour) drill bits of the type used in this particular project will last 2, 3, or 4 hours with probabilities 0.1, 0.7, and 0.2. If a drill bit is replaced by one of the same type each time it has worn out, how long could exploration be continued if in total the company would reserve 10 drill bits for the exploration job? What most people would do to answer this question is to take the weighted average 0.1 · 2 + 0.7 · 3 + 0.2 · 4 = 3.1, and conclude that the exploration could continue for 10 × 3.1, or 31 hours. This weighted average is what we call the expected value or expectation of the random variable X whose distribution is given by P(X = 2) = 0.1, P(X = 3) = 0.7, P(X = 4) = 0.2. It might happen that the company is unlucky and that each of the 10 drill bits has worn out after two hours, in which case exploration ends after 20 hours. At the other extreme, they may be lucky and drill for 40 hours on these 10
  • 99. 90 7 Expectation and variance bits. However, it is a mathematical fact that the conclusion about a 31-hour total drilling time is correct in the following sense: for a large number n of drill bits the total running time will be around n times 3.1 hours with high probability. In the example, where n = 10, we have, for instance, that drilling will continue for 29, 30, 31, 32, or 33 hours with probability more than 0.86, while the probability that it will last only for 20, 21, 22, 23, or 24 hours is less than 0.00006. We will come back to this in Chapters 13 and 14. This example illustrates the following definition. Definition. The expectation of a discrete random variable X taking the values a1, a2, . . . and with probability mass function p is the number E[X] = i aiP(X = ai) = i aip(ai). We also call E[X] the expected value or mean of X. Since the expectation is determined by the probability distribution of X only, we also speak of the expectation or mean of the distribution. Quick exercise 7.1 Let X be the discrete random variable that takes the values 1, 2, 4, 8, and 16, each with probability 1/5. Compute the expectation of X. Looking at an expectation as a weighted average gives a more physical in- terpretation of this notion, namely as the center of gravity of weights p(ai) placed at the points ai. For the random variable associated with the drill bit, this is illustrated in Figure 7.1. 2 3 4 Fig. 7.1. Expected value as center of gravity.
  • 100. 7.1 Expected values 91 This point of view also leads the way to how one should define the expected value of a continuous random variable. Let, for example, X be a continuous random variable whose probability density function f is zero outside the in- terval [0, 1]. It seems reasonable to approximate X by the discrete random variable Y , taking the values 1 n , 2 n , . . . , n − 1 n , 1 with as probabilities the masses that X assigns to the intervals [k−1 n , k n ]: P Y = k n = P k − 1 n ≤ X ≤ k n = k/n (k−1)/n f(x) dx. We have a good idea of the size of this probability. For large n, it can be approximated well in terms of f: P Y = k n = k/n k/n−1/n f(x) dx ≈ 1 n f k n . The “center-of-gravity” interpretation suggests that the expectation E[Y ] of Y should approximate the expectation E[X] of X. We have E[Y ] = n k=1 k n P Y = k n ≈ n k=1 k n f k n 1 n . By the definition of a definite integral, for large n the right-hand side is close to 1 0 xf(x) dx. This motivates the following definition. Definition. The expectation of a continuous random variable X with probability density function f is the number E[X] = ∞ −∞ xf(x) dx. We also call E[X] the expected value or mean of X. Note that E[X] is indeed the center of gravity of the mass distribution described by the function f: E[X] = ∞ −∞ xf(x) dx = ∞ −∞ xf(x) dx ∞ −∞ f(x) dx . This is illustrated in Figure 7.2.
  • 101. 92 7 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . f .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Fig. 7.2. Expected value as center of gravity, continuous case. Quick exercise 7.2 Compute the expectation of a random variable U that is uniformly distributed over [2, 5]. Remark 7.1 (The expected value may not exist!). In the definitions in this section we have been rather careless about the convergence of sums and integrals. Let us take a closer look at the integral I = ∞ −∞ xf(x) dx. Since a probability density function cannot take negative values, we have I = I− + I+ with I− = 0 −∞ xf(x) dx a negative and I+ = ∞ 0 xf(x) dx a positive number. However, it may happen that I− equals −∞ or I+ equals +∞. If both I− = −∞ and I+ = +∞, then we say that the expected value does not exist. An example of a continuous random variable for which the expected value does not exist is the random variable with the Cauchy distribution (see also page 161), having probability density function f(x) = 1 π(1 + x2) for − ∞ x ∞. For this random variable I+ = ∞ 0 x · 1 π(1 + x2) dx = 1 2π ln(1 + x2 ) ∞ 0 = +∞, I− = 0 −∞ x · 1 π(1 + x2) dx = 1 2π ln(1 + x2 ) 0 −∞ = −∞. If I− is finite but I+ = +∞, then we say that the expected value is infinite. A distribution that has an infinite expectation is the Pareto distribution with parameter α = 1 (see Exercise 7.11). The remarks we made on the integral in the definition of E[X] for continuous X apply similarly to the sum in the definition of E[X] for discrete random variables X.
  • 102. 7.2 Three examples 93 7.2 Three examples The geometric distribution If you buy a lottery ticket every week and you have a chance of 1 in 10 000 of winning the jackpot, what is the expected number of weeks you have to buy tickets before you get the jackpot? The answer is: 10 000 weeks (almost two centuries!). The number of weeks is modeled by a random variable with a geometric distribution with parameter p = 10−4 . The expectation of a geometric distribution. Let X have a geometric distribution with parameter p; then E[X] = ∞ k=1 kp(1 − p)k−1 = 1 p . Here ∞ k=1 kp(1 − p)k−1 = 1/p follows from the formula ∞ k=1 kxk−1 = 1/(1 − x)2 that has been derived in your calculus course. We will see a simple (probabilistic) way to obtain the value of this sum in Chapter 11. The exponential distribution In Section 5.6 we considered the chemical reactor example, where the residence time T , measured in minutes, has an Exp(0.5) distribution. We claimed that this implies that the mean time a particle stays in the vessel is 2 minutes. More generally, we have the following. The expectation of an exponential distribution. Let X have an exponential distribution with parameter λ; then E[X] = ∞ 0 xλe−λx dx = 1 λ . The integral has been determined in your calculus course (with the technique of integration by parts). The normal distribution Here, using that the normal density integrates to 1 and applying the substi- tution z = (x − µ)/σ, E[X] = ∞ −∞ x 1 σ √ 2π e − 1 2 x−µ σ 2 dx = µ + ∞ −∞ (x − µ) 1 σ √ 2π e − 1 2 x−µ σ 2 dx = µ + σ ∞ −∞ z 1 √ 2π e− 1 2 z2 dz = µ,
  • 103. 94 7 Expectation and variance where the integral is 0, because the integrand is an odd function. We obtained the following rule. The expectation of a normal distribution. Let X be an N(µ, σ2 ) distributed random variable. Then E[X] = ∞ −∞ x 1 σ √ 2π e − 1 2 x−µ σ 2 dx = µ. 7.3 The change-of-variable formula Often one does not want to compute the expected value of a random variable X but rather of a function of X, as, for example, X2 . We then need to deter- mine the distribution of Y = X2 , for example by computing the distribution function FY of Y (this is an example of the general problem of how distribu- tions change under transformations—this topic is the subject of Chapter 8). For a concrete example, suppose an architect wants maximal variety in the sizes of buildings: these should be of the same width and depth X, but X is uniformly distributed between 0 and 10 meters. What is the distribution of the area X2 of a building; in particular, will this distribution be (anything near to) uniform? Let us compute FY ; for 0 ≤ a ≤ 100: FY (a) = P X2 ≤ a = P X ≤ √ a = √ a 10 . Hence the probability density function fY of the area is, for 0 y 100 meters squared, given by fY (y) = d dy FY (y) = d dy √ y 10 = 1 20 √ y . (7.1) This means that the buildings with small areas are heavily overrepresented, because fY explodes near 0—see also Figure 7.3, in which we plotted fY . Surprisingly, this is not very visible in Figure 7.4, an example where we should believe our calculations more than our eyes. In the figure the locations of the buildings are generated by a Poisson process, the subject of Chapter 12. Suppose that a contractor has to make an offer on the price of the foundations of the buildings. The amount of concrete he needs will be proportional to the area X2 of a building. So his problem is: what is the expected area of a building? With fY from (7.1) he finds E X2 = E[Y ] = 100 0 y · 1 20 √ y dy = 100 0 √ y 20 dy = 1 20 2 3 y 3 2 100 0 = 331 3 m2 .
  • 104. 7.3 The change-of-variable formula 95 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 fY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 7.3. The probability density of the square of a U(0, 10) random variable. It is interesting to note that we really need to do this calculation, because the expected area is not simply the product of the expected width and the expected depth, which is 25 m2 . However, there is a much easier way in which the contractor could have obtained this result. He could have argued that the value of the area is x2 when x is the width, and that he should take the weighted average of those values, where the weight at width x is given by the value fX(x) of the probability density of X. Then he would have computed E X2 = ∞ −∞ x2 fX(x) dx = 10 0 x2 · 1 10 dx = 1 30 x3 10 0 = 331 3 m2 . It is indeed a mathematical theorem that this is always a correct way to compute expected values of functions of random variables. 0 10 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ Fig. 7.4. Top: widths of the buildings between 0 and 10 meters. Bottom: corre- sponding buildings in a 100×300 m area.
  • 105. 96 7 Expectation and variance The change-of-variable formula. Let X be a random variable, and let g : R → R be a function. If X is discrete, taking the values a1, a2, . . . , then E[g(X)] = i g(ai)P(X = ai) . If X is continuous, with probability density function f, then E[g(X)] = ∞ −∞ g(x)f(x) dx. Quick exercise 7.3 Let X have a Ber(p) distribution. Compute E 2X . An operation that occurs very often in practice is a change of units, e.g., from Fahrenheit to Celsius. What happens then to the expectation? Here we have to apply the formula with the function g(x) = rx + s, where r and s are real numbers. When X has a continuous distribution, the change-of-variable formula yields: E[rX + s] = ∞ −∞ (rx + s)f(x) dx = r ∞ −∞ xf(x) dx + s ∞ −∞ f(x) dx = rE[X] + s. A similar computation with integrals replaced by sums gives the same result for discrete random variables. 7.4 Variance Suppose you are offered an opportunity for an investment whose expected return is û500. If you are given the extra information that this expected value is the result of a 50% chance of a û450 return and a 50% chance of a û550 return, then you would not hesitate to spend û450 on this investment. However, if the expected return were the result of a 50% chance of a û0 return and a 50% chance of a û1000 return, then most people would be reluctant to spend such an amount. This demonstrates that the spread (around the mean) of a random variable is of great importance. Usually this is measured by the expected squared deviation from the mean. Definition. The variance Var(X) of a random variable X is the number Var(X) = E (X − E[X])2 .
  • 106. 7.4 Variance 97 Note that the variance of a random variable is always positive (or 0). Fur- thermore, there is the question of existence and finiteness (cf. Remark 7.1). In practical situations one often considers the standard deviation defined by Var(X), because it has the same dimension as E[X]. As an example, let us compute the variance of a normal distribution. If X has an N(µ, σ2 ) distribution, then: Var(X) = E (X − E[X])2 = ∞ −∞ (x − µ)2 1 σ √ 2π e − 1 2 x−µ σ 2 dx = σ2 ∞ −∞ z2 1 √ 2π e− 1 2 z2 dz. Here we substituted z = (x − µ)/σ. Using integration by parts one finds that ∞ −∞ z2 1 √ 2π e− 1 2 z2 dz = 1. We have found the following property. Variance of a normal distribution. Let X be an N(µ, σ2 ) distributed random variable. Then Var(X) = ∞ −∞ (x − µ)2 1 σ √ 2π e − 1 2 x−µ σ 2 dx = σ2 . Quick exercise 7.4 Let us call the two returns discussed above Y1 and Y2, respectively. Compute the variance and standard deviation of Y1 and Y2. It is often not practical to compute Var(X) directly from the definition, but one uses the following rule. An alternative expression for the variance. For any ran- dom variable X, Var(X) = E X2 − (E[X]) 2 . To see that this rule holds, we apply the change-of-variable formula. Sup- pose X is a continuous random variable with probability density function f (the discrete case runs completely analogously). Using the change-of-variable formula, well-known properties of the integral, and ∞ −∞ f(x) dx = 1, we find
  • 107. 98 7 Expectation and variance Var(X) = E (X − E[X])2 = ∞ −∞ (x − E[X])2 f(x) dx = ∞ −∞ x2 − 2xE[X] + (E[X])2 f(x) dx = ∞ −∞ x2 f(x) dx − 2E[X] ∞ −∞ xf(x) dx + (E[X])2 ∞ −∞ f(x) dx = E X2 − 2(E[X])2 + (E[X])2 = E X2 − (E[X])2 . With this rule we make two steps: first we compute E[X], then we compute E X2 . The latter is called the second moment of X. Let us compare the computations, using the definition and this rule for the drill bit example. Recall that for this example X takes the values 2, 3, and 4 with probabilities 0.1, 0.7, and 0.2. We found that E[X]= 3.1. According to the definition Var(X) = E (X − 3.1)2 = 0.1 · (2 − 3.1)2 + 0.7 · (3 − 3.1)2 + 0.2 · (4 − 3.1)2 = 0.1 · (−1.1)2 + 0.7 · (−0.1)2 + 0.2 · (0.9)2 = 0.1 · 1.21 + 0.7 · 0.01 + 0.2 · 0.81 = 0.121 + 0.007 + 0.162 = 0.29. Using the rule is neater and somewhat faster: Var(X) = E X2 − (3.1)2 = 0.1 · 22 + 0.7 · 32 + 0.2 · 42 − 9.61 = 0.1 · 4 + 0.7 · 9 + 0.2 · 16 − 9.61 = 0.4 + 6.3 + 3.2 − 9.61 = 0.29. What happens to the variance if we change units? At the end of the pre- vious section we showed that E[rX + s] = rE[X] + s. This can be used to obtain the corresponding rule for the variance under change of units (see also Exercise 7.15). Expectation and variance under change of units. For any random variable X and any real numbers r and s, E[rX + s] = rE[X] + s, and Var(rX + s) = r2 Var(X) . Note that the variance is insensitive to the shift over s. Can you understand why this must be true without doing any computations?
  • 108. 7.6 Exercises 99 7.5 Solutions to the quick exercises 7.1 We have E[X] = i aiP(X = ai) = 1 · 1 5 + 2 · 1 5 + 4 · 1 5 + 8 · 1 5 + 16 · 1 5 = 31 5 = 6.2. 7.2 The probability density function f of U is given by f(x) = 0 outside [2, 5] and f(x) = 1/3 for 2 ≤ x ≤ 5; hence E[U] = ∞ −∞ xf(x) dx = 5 2 1 3 x dx = 1 6 x2 5 2 = 31 2 . 7.3 Using the change-of-variable formula we obtain E 2X = i 2ai P(X = ai) = 20 · P(X = 0) + 21 · P(X = 1) = 1 · (1 − p) + 2 · p = 1 − p + 2p = 1 + p. You could also have noted that Y = 2X has a distribution given by P(Y = 1) = 1 − p, P(Y = 2) = p; hence E 2X = E[Y ] = 1 · P(Y = 1) + 2 · P(Y = 2) = 1 · (1 − p) + 2 · p = 1 + p. 7.4 We have Var(Y1) = 1 2 (450 − 500)2 + 1 2 (550 − 500)2 = 502 = 2500, so Y1 has standard deviation û50 and Var(Y2) = 1 2 (0 − 500)2 + 1 2 (1000 − 500)2 = 5002 = 250 000, so Y2 has standard deviation û500. 7.6 Exercises 7.1 Let T be the outcome of a roll with a fair die. a. Describe the probability distribution of T , that is, list the outcomes and the corresponding probabilities. b. Determine E[T ] and Var(T ). 7.2 The probability distribution of a discrete random variable X is given by P(X = −1) = 1 5 , P(X = 0) = 2 5 , P(X = 1) = 2 5 .
  • 109. 100 7 Expectation and variance a. Compute E[X]. b. Give the probability distribution of Y = X2 and compute E[Y ] using the distribution of Y . c. Determine E X2 using the change-of-variable formula. Check your an- swer against the answer in b. d. Determine Var(X). 7.3 For a certain random variable X it is known that E[X] = 2, Var(X) = 3. What is E X2 ? 7.4 Let X be a random variable with E[X] = 2, Var(X) = 4. Compute the expectation and variance of 3 − 2X. 7.5 Determine the expectation and variance of the Ber(p) distribution. 7.6 The random variable Z has probability density function f(z) = 3z2 /19 for 2 ≤ z ≤ 3 and f(z) = 0 elsewhere. Determine E[Z]. Before you do the calculation: will the answer lie closer to 2 than to 3 or the other way around? 7.7 Given is a random variable X with probability density function f given by f(x) = 0 for x 0, and for x 1, and f(x) = 4x − 4x3 for 0 ≤ x ≤ 1. Determine the expectation and variance of the random variable 2X + 3. 7.8 Given is a continuous random variable X whose distribution function F satisfies F(x) = 0 for x 0, F(x) = 1 for x 1, and F(x) = x(2 − x) for 0 ≤ x ≤ 1. Determine E[X]. 7.9 Let U be a random variable with a U(α, β) distribution. a. Determine the expectation of U. b. Determine the variance of U. 7.10 Let X have an exponential distribution with parameter λ. a. Determine E[X] and E X2 using partial integration. b. Determine Var(X). 7.11 In this exercise we take a look at the mean of a Pareto distribution. a. Determine the expectation of a Par(2) distribution. b. Determine the expectation of a Par(1 2 ) distribution. c. Let X have a Par(α) distribution. Show that E[X] = α/(α − 1) if α 1. 7.12 For which α is the variance of a Par(α) distribution finite? Compute the variance for these α.
  • 110. 7.6 Exercises 101 7.13 Remember that we found on page 95 that the expected area of a building was 331 3 m2 , whereas the square of the expected width was only 25 m2 . This phenomenon is more general: show that for any random variable X one has E X2 ≥ E[X] 2 . Hint: you might use that Var(X) ≥ 0. 7.14 Suppose we choose arbitrarily a point from the square with corners at (2,1), (3,1), (2,2), and (3,2). The random variable A is the area of the triangle with its corners at (2,1), (3,1), and the chosen point. (See also Exercise 5.9 and Figure 7.5.) Compute E[A]. A (2, 1) (3, 1) (2, 2) (3, 2) • randomly chosen point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... Fig. 7.5. A triangle in a 1×1 square. 7.15 Let X be a random variable and r and s any real numbers. Use the change-of-units rule E[rX + s] = rE[X] + s for the expectation to obtain a and b. a. Show that Var(rX) = r2 Var(X). b. Show that Var(X + s) = Var(X). c. Combine parts a and b to show that Var(rX + s) = r2 Var(X) . 7.16 The probability density function f of the random variable X used in Figure 7.2 is given by f(x) = 0 outside (0, 1) and f(x) = −4x ln(x) for 0 x 1. Compute the position of the balancing point in the figure, that is, compute the expectation of X. 7.17 Let U be a discrete random variable taking the values a1, . . . , ar with probabilities p1, . . . , pr. a. Suppose all ai ≥ 0, but that E[U]=0. Show then
  • 111. 102 7 Expectation and variance a1 = a2 = · · · = ar = 0. In other words; P(U = 0) = 1. b. Suppose that V is a random variable taking the values b1, . . . , br with probabilities p1, . . . , pr. Show that Var(V ) = 0 implies P(V = E[V ]) = 1. Hint: apply a with U = (V − E[V ])2 .
  • 112. 8 Computations with random variables There are many ways to make new random variables from old ones. Of course this is not a goal in itself; usually new variables are created naturally in the process of solving a practical problem. The expectations and variances of such new random variables can be calculated with the change-of-variable formula. However, often one would like to know the distributions of the new random variables. We shall show how to determine these distributions, how to compare expectations of random variables and their transformed versions (Jensen’s inequality), and how to determine the distributions of maxima and minima of several random variables. 8.1 Transforming discrete random variables The problem we consider in this section and the next is how the distribution of a random variable X changes if we apply a function g to it, thus obtaining a new random variable Y : Y = g(X). When X is a discrete random variable this is usually not too hard to do: it is just a matter of bookkeeping. We illustrate this with an example. Imagine an airline company that sells tickets for a flight with 150 available seats. It has no idea about how many tickets it will sell. Suppose, to keep the example simple, that the number X of tickets that will be sold can be anything from 1 to 200. Moreover, suppose that each possibility has equal probability to occur, i.e., P(X = j) = 1/200 for j = 1, 2, . . . , 200. The real interest of the airline company is in the random variable Y, which is the number of passengers that have to be refused. What is the distribution of Y ? To answer this, note that nobody will be refused when the passengers fit in the plane, hence P(Y = 0) = P(X ≤ 150) = 150 200 = 3 4 . skip 8
  • 113. 104 8 Computations with random variables For the other values, k = 1, 2 . . ., 50 P(Y = k) = P(X = 150 + k) = 1 200 . Note that in this example the function g is given by g(x) = max{x − 150, 0}. Quick exercise 8.1 Let Z be the number of passengers that will be in the plane. Determine the probability distribution of Z. What is the function g in this case? 8.2 Transforming continuous random variables We now turn to continuous random variables. Since single values occur with probability zero for a continuous random variable, the approach above does not work. The strategy now is to first determine the distribution function of the transformed random variable Y = g(X) and then the probability density by differentiating. We shall illustrate this with the following example (actually we saw an example of such a computation in Section 7.3 with the function g(x) = x2 ). We consider two methods that traffic police employ to determine whether you deserve a fine for speeding. From experience, the traffic police think that vehicles are driving at speeds ranging from 60 to 90 km/hour at a certain road section where the speed limit is 80 km/hour. They assume that the speed of the cars is uniformly distributed over this interval. The first method is measuring the speed at a fixed spot in the road section. With this method the police will find that about (90 − 80)/(90 − 60) = 1/3 of the cars will be fined. For the second method, cameras are put at the beginning and end of a 1-km road section, and a driver is fined if he spends less than a certain amount of time in the road section. Cars driving at 60 km/hour need one minute, those driving at 90 km/hour only 40 seconds. Let us therefore model the time T an arbitrary car spends in the section by a uniform distribution over (40, 60) seconds. What is the speed V we deduce from this travelling time? Note that for 40 ≤ t ≤ 60, P(T ≤ t) = t − 40 20 . Since there are 3600 seconds in an hour we have that V = g(T ) = 3600 T . We therefore find for the distribution function FV (v) = P(V ≤ v) of the speed V that
  • 114. 8.2 Transforming continuous random variables 105 FV (v) = P 3600 T ≤ v = P T ≥ 3600 v = 1 − (3600/v) − 40 20 = 3 − 180 v for all speeds v between 60 and 90. We can now obtain the probability density fV of V by differentiating: fV (v) = d dv FV (v) = d dv 3 − 180 v = 180 v2 for 60 ≤ v ≤ 90. It is amusing to note that with the second model the traffic police write fewer speeding tickets because P(V 80) = 1 − P(V ≤ 80) = 1 − 3 − 180 80 = 1 4 . (With the first model we found probability 1/3 that a car drove faster than 80 km/hour.) This is related to a famous result in road traffic research, which is succinctly phrased as: “space mean speed time mean speed” (see [37]). It is also related to Jensen’s inequality, which we introduce in Section 8.3. Similar to the way this is done in the traffic example, one can determine the distribution of Y = 1/X for any X with a continuous distribution. The outcome will be that if X has density fX, then the density fY of Y is given by fY (y) = d dy FY (y) = 1 y2 fX 1 y for y 0 and y 0. One can give fY (0) any value; often one puts fY (0) = 0. Quick exercise 8.2 Let X have a continuous distribution with probability density fX(x) = 1/[π(1 + x2 )]. What is the distribution of Y = 1/X? We turn to a second example. A very common transformation is a change of units, for instance, from Celsius to Fahrenheit. If X is temperature expressed in degrees Celsius, then Y = 9 5 X+32 is the temperature in degrees Fahrenheit. Let FX and FY be the distribution functions of X and Y . Then we have for any a FY (a) = P(Y ≤ a) = P 9 5 X + 32 ≤ a = P X ≤ 5 9 a − 32 = FX 5 9 a − 32 . By differentiating FY (using the chain rule), we obtain the probability density fY (y) = 5 9 fX 5 9 (y − 32) . We can do this for more general changes of units, and we obtain the following useful rule.
  • 115. 106 8 Computations with random variables Change-of-units transformation. Let X be a continuous ran- dom variable with distribution function FX and probability density function fX. If we change units to Y = rX +s for real numbers r 0 and s, then FY (y) = FX y − s r and fY (y) = 1 r fX y − s r . As an example, let X be a random variable with an N(µ, σ2 ) distribution, and let Y = rX + s. Then this rule gives us fY (y) = 1 r fX y − s r = 1 rσ √ 2π e− 1 2 ((y−rµ−s)/rσ)2 for −∞ y ∞. On the right-hand side we recognize the probability density of a normal distribution with parameters rµ+s and r2 σ2 . This illustrates the following rule. Normal random variables under change of units. Let X be a random variable with an N(µ, σ2 ) distribution. For any r = 0 and any s, the random variable rX + s has an N(rµ + s, r2 σ2 ) distribution. Note that if X has an N(µ, σ2 ) distribution, then with r = 1/σ and s = −µ/σ we conclude that Z = 1 σ X + − µ σ = X − µ σ has an N(0, 1) distribution. As a consequence FX(a) = P(X ≤ a) = P(σZ + µ ≤ a) = P Z ≤ a − µ σ = Φ a − µ σ . So any probability for an N(µ, σ2 ) distributed random variable X can be expressed in terms of an N(0, 1) distributed random variable Z. Quick exercise 8.3 Compute the probabilities P(X ≤ 5) and P(X ≥ 2) for X with an N(4, 25) distribution. 8.3 Jensen’s inequality Without actually computing the distribution of g(X) we can often tell how E[g(X)] relates to g(E[X]). For the change-of-units transformation g(x) = rx + s we know that E[g(X)] = g(E[X]) (see Section 7.3). It is a common
  • 116. 8.3 Jensen’s inequality 107 error to equate these two sides for other functions g. In fact, equality will very rarely occur for nonlinear g. For example, suppose that a company that produces microelectronic parts has a target production of 240 chips per day, but the yield has only been 40, 60, and 80 chips on three consecutive days. The average production over the three days then is 60 chips, so on average the production should have been 4 times higher to reach the target. However, one can also look at this in the following way: on the three days the production should have been 240/40 = 6, 240/60 = 4, and 240/80 = 3 times higher. On average that is 1 3 (6 + 4 + 3) = 13 3 = 4.3333 times higher! What happens here can be explained (take for X the part of the target production that is realized, where you give equal probabilities to the three outcomes 1/6, 1/4, and 1/3) by the fact that if X is a random variable taking positive values, then always 1 E[X] E 1 X , unless Var(X) = 0, which only happens if X is not random at all (cf. Exer- cise 7.17). This inequality is the case g(x) = 1/x on (0, ∞) of the following result that holds for general convex functions g. Jensen’s inequality. Let g be a convex function, and let X be a random variable. Then g(E[X]) ≤ E[g(X)] . Recall from calculus that a twice differentiable function g is convex on an interval I if g (x) ≥ 0 for all x in I, and strictly convex if g (x) 0 for all x in I. When X takes its values in an interval I (this can, for instance, be I = (−∞, ∞)), and g is strictly convex on I, then strict inequality holds: g(E[X]) E[g(X)], unless X is not random. In Figure 8.1 we illustrate the way in which this result can be obtained for the special case of a random variable X that takes two values, a and b. In the figure, X takes these two values with probability 3/4 and 1/4 respectively. Convexity of g forces any line segment connecting two points on the graph of g to lie above the part of the graph between these two points. So if we choose the line segment from (a, g(a)) to (b, g(b)), then it follows that the point (E[X] , E[g(X)]) = 3 4 a + 1 4 b, 3 4 g(a) + 1 4 g(b) = 3 4 (a, g(a)) + 1 4 (b, g(b)) on this line lies “above” the point (E[X] , g(E[X]) on the graph of g. Hence E[g(X)] ≥ g(E[X]).
  • 117. 108 8 Computations with random variables a E[X] b g E[g(X)] • • g(E[X]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 8.1. Jensen’s inequality. A simple example is given by g(x) = x2 . This function is convex (g (x) = 2 for all x), and hence (E[X])2 ≤ E X2 . Note that this is exactly the same as saying that Var(X) ≥ 0, which we have already seen in Section 7.4. Quick exercise 8.4 Let X be a random variable with Var(X) 0. Which is true: E e−X e−E[X] or E e−X e−E[X] ? 8.4 Extremes In many situations the maximum (or minimum) of a sequence X1, X2, . . . , Xn of random variables is the variable of interest. For instance, let X1, X2, . . . , X365 be the water level of a river during the days of a particular year for a particular location. Suppose there will be flooding if the level exceeds a certain height—usually the height of the dykes. The question whether flood- ing occurs during a year is completely answered by looking at the maximum of X1, X2, . . . , X365. If one wants to predict occurrence of flooding in the fu- ture, the probability distribution of this maximum is of great interest. Similar models arise, for instance, when one is interested in possible damage from a series of shocks or in the extent of a contamination plume in the subsurface. We want to find the distribution of the random variable Z = max{X1, X2, . . . , Xn}. We can determine the distribution function of Z by realizing that the maxi- mum of the Xi is smaller than a number a if and only if all Xi are smaller than a:
  • 118. 8.4 Extremes 109 FZ (a) = P(Z ≤ a) = P(max{X1, . . . , Xn} ≤ a) = P(X1 ≤ a, . . . , Xn ≤ a) . Now suppose that the events {Xi ≤ ai} are independent for every choice of the ai. In this case we call the random variables independent (see also Chapter 9, where we study independence of random variables). In particular, the events {Xi ≤ a} are independent for all a. It then follows that FZ (a) = P(X1 ≤ a, . . . , Xn ≤ a) = P(X1 ≤ a) · · · P(Xn ≤ a) . Hence, if all random variables have the same distribution function F, then the following result holds. The distribution of the maximum. Let X1, X2, . . . , Xn be n independent random variables with the same distribution function F, and let Z = max{X1, X2, . . . , Xn}. Then FZ(a) = (F(a))n . Quick exercise 8.5 Let X1, X2, . . . , Xn be independent random variables, all with a U(0, 1) distribution. Let Z = max{X1, . . . , Xn}. Compute the dis- tribution function and the probability density function of Z. What can we say about the distribution of the minimum? Let V = min{X1, X2, . . . , Xn}. We can now find the distribution function FV of V by observing that the minimum of the Xi is larger than a number a if and only if all Xi are larger than a. The trick is to switch to the complement of the event {V ≤ a}: FV (a) = P(V ≤ a) = 1 − P(V a) = 1 − P(min{X1, . . . , Xn} a) = 1 − P(X1 a, . . . , Xn a) . So using independence and switching back again, we obtain FV (a) = 1 − P(X1 a, . . . , Xn a) = 1 − P(X1 a) · · · P(Xn a) = 1 − (1 − P(X1 ≤ a)) · · · (1 − P(Xn ≤ a)). We have found the following result for the minimum. The distribution of the minimum. Let X1, X2, . . . , Xn be n independent random variables with the same distribution function F, and let V = min{X1, X2, . . . , Xn}. Then FV (a) = 1 − (1 − F(a))n . Quick exercise 8.6 Let X1, X2, . . . , Xn be independent random variables, all with a U(0, 1) distribution. Let V = min{X1, . . . , Xn}. Compute the dis- tribution function and the probability density function of V .
  • 119. 110 8 Computations with random variables 8.5 Solutions to the quick exercises 8.1 Clearly Z can take the values 1, . . . , 150. The value 150 is special: the plane is full if 150 or more people buy a ticket. Hence P(Z = 150) = P(X ≥ 150) = 51/200. For the other values we have P(Z = i) = P(X = i) = 1/200, for i = 1, . . . , 149. Clearly, here g(x) = min{150, x}. 8.2 The probability density of Y = 1/X is fY (y) = 1 y2 1 π(1 + (1 y )2) = 1 π(1 + y2) . We see that 1/X has the same distribution as X! (This distribution is called the standard Cauchy distribution, it will be introduced in Chapter 11.) 8.3 First define Z = (X −4)/5, which has an N(0, 1) distribution. Then from Table B.1 P(X ≤ 5) = P Z ≤ 5 − 4 5 = P(Z ≤ 0.20) = 1 − 0.4207 = 0.5793. Similarly, using the symmetry of the normal distribution, P(X ≥ 2) = P Z ≥ 2 − 4 5 = P(Z ≥ −0.40) = P(Z ≤ 0.40) = 0.6554. 8.4 If g(x) = e−x , then g (x) = e−x 0; hence g is strictly convex. It follows from Jensen’s inequality that e−E[X] ≤ E e−X . Moreover, if Var(X) 0, then the inequality is strict. 8.5 The distribution function of the Xi is given by F(x) = x on [0, 1]. There- fore the distribution function FZ of the maximum Z is equal to FZ(a) = (F(a))n = an . Its probability density function is fZ(z) = d dz FZ (z) = nzn−1 for 0 ≤ z ≤ 1. 8.6 The distribution function of the Xi is given by F(x) = x on [0, 1]. There- fore the distribution function FV of the minimum V is equal to FV (a) = 1 − (1 − a)n . Its probability density function is fV (v) = d dv FV (v) = n(1 − v)n−1 for 0 ≤ v ≤ 1.
  • 120. 8.6 Exercises 111 8.6 Exercises 8.1 Often one is interested in the distribution of the deviation of a random variable X from its mean µ = E[X]. Let X take the values 80, 90, 100, 110, and 120, all with probability 0.2; then E[X] = µ = 100. Determine the dis- tribution of Y = |X − µ|. That is, specify the values Y can take and give the corresponding probabilities. 8.2 Suppose X has a uniform distribution over the points {1, 2, 3, 4, 5, 6} and that g(x) = sin(π 2 x). a. Determine the distribution of Y = g(X) = sin(π 2 X), that is, specify the values Y can take and give the corresponding probabilities. b. Let Z = cos(π 2 X). Determine the distribution of Z. c. Determine the distribution of W = Y 2 + Z2 . Warning: in this example there is a very special dependency between Y and Z, and in general it is much harder to determine the distribution of a random variable that is a function of two other random variables. This is the subject of Chapter 11. 8.3 The continuous random variable U is uniformly distributed over [0, 1]. a. Determine the distribution function of V = 2U + 7. What kind of distri- bution does V have? b. Determine the distribution function of V = rU + s for all real numbers r 0 and s. See Exercise 8.9 for what happens for negative r. 8.4 Transforming exponential distributions. a. Let X have an Exp(1 2 ) distribution. Determine the distribution function of 1 2 X. What kind of distribution does 1 2 X have? b. Let X have an Exp(λ) distribution. Determine the distribution function of λX. What kind of distribution does λX have? 8.5 Let X be a continuous random variable with probability density func- tion fX(x) = 3 4 x(2 − x) for 0 ≤ x ≤ 2 0 elsewhere. a. Determine the distribution function FX. b. Let Y = √ X. Determine the distribution function FY . c. Determine the probability density of Y . 8.6 Let X be a continuous random variable with probability density fX that takes only positive values and let Y = 1/X.
  • 121. 112 8 Computations with random variables a. Determine FY (y) and show that fY (y) = 1 y2 fX 1 y for y 0. b. Let Z = 1/Y . Using a, determine the probability density fZ of Z, in terms of fX. 8.7 Let X have a Par(α) distribution. Determine the distribution function of ln X. What kind of a distribution does ln X have? 8.8 Let X have an Exp(1) distribution, and let α and λ be positive numbers. Determine the distribution function of the random variable W = X1/α λ . The distribution of the random variable W is called the Weibull distribution with parameters α and λ. 8.9 Let X be a continuous random variable. Express the distribution function and probability density of the random variable Y = −X in terms of those of X. 8.10 Let X be an N(3, 4) distributed random variable. Use the rule for normal random variables under change of units and Table B.1 to determine the probabilities P(X ≥ 3) and P(X ≤ 1). 8.11 Let X be a random variable, and let g be a twice differentiable function with g (x) ≤ 0 for all x. Such a function is called a concave function. Show that for concave functions always g(E[X]) ≥ E[g(X)] . 8.12 Let X be a random variable with the following probability mass func- tion: x 0 1 100 10 000 P(X = x) 1 4 1 4 1 4 1 4 a. Determine the distribution of Y = √ X. b. Which is larger E √ X or E[X]? Hint: use Exercise 8.11, or start by showing that the function g(x) = − √ x is convex. c. Compute E[X] and E √ X to check your answer (and to see that it makes a big difference!). 8.13 Let W have a U(π, 2π) distribution. What is larger: E[sin(W)] or sin(E[W])? Check your answer by computing these two numbers.
  • 122. 8.6 Exercises 113 8.14 In this exercise we take a look at Jensen’s inequality for the function g(x) = x3 (which is neither convex nor concave on (−∞, ∞)). a. Can you find a (discrete) random variable X with Var(X) 0 such that E X3 = (E[X])3 ? b. Under what kind of conditions on a random variable X will the inequality E X3 (E[X])3 certainly hold? 8.15 Let X1, X2, . . . , Xn be independent random variables, all with a U(0, 1) distribution. Let Z = max{X1, . . . , Xn} and V = min{X1, . . . , Xn}. a. Compute E[max{X1, X2}] and E[min{X1, X2}]. b. Compute E[Z] and E[V ] for general n. c. Can you argue directly (using the symmetry of the uniform distribu- tion (see Exercise 6.3) and not the result of the computation in b) that 1 − E[max{X1, . . . , Xn}] = E[min{X1, . . . , Xn}]? 8.16 In this exercise we derive a kind of Jensen inequality for the minimum. a. Let a and b be real numbers. Show that min{a, b} = 1 2 (a + b − |a − b|). b. Let X and Y be independent random variables with the same distribution and finite expectation. Deduce from a that E[min{X, Y }] = E[X] − 1 2 E[|X − Y |] . c. Show that E[min{X, Y }] ≤ min{E[X] , E[Y ]}. Remark: this is not so interesting, since min{E[X] , E[Y ]} = E[X] = E[Y ], but we will see in the exercises of Chapter 11 that this inequality is also true for X and Y, which do not have the same distribution. 8.17 Let X1, . . . , Xn be n independent random variables with the same dis- tribution function F. a. Convince yourself that for any numbers x1, . . . , xn it is true that min{x1, . . . , xn} = − max{−x1, . . . , −xn}. b. Let Z = max{X1, X2, . . . , Xn} and V = min{X1, X2, . . . , Xn}. Use Exer- cise 8.9 and the observation in a to deduce the formula
  • 123. 114 8 Computations with random variables FV (a) = 1 − (1 − F(a))n directly from the formula FZ (a) = (F(a))n . 8.18 Let X1, X2, . . . , Xn be independent random variables, all with an Exp(λ) distribution. Let V = min{X1, . . . , Xn}. Determine the distribution function of V . What kind of distribution is this? 8.19 From the “north pole” N of a circle with diameter 1, a point Q on the circle is mapped to a point t on the line by its projection from N, as illustrated in Figure 8.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . t Q N ϕ • • • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 8.2. Mapping the circle to the line. Suppose that the point Q is uniformly chosen on the circle. This is the same as saying that the angle ϕ is uniformly chosen from the interval [−π 2 , π 2 ] (can you see this?). Let X be this angle, so that X is uniformly distributed over the interval [−π 2 , π 2 ]. This means that P(X ≤ ϕ) = 1/2 + ϕ/π (cf. Quick exercise 5.3). What will be the distribution of the projection of Q on the line? Let us call this random variable Z. Then it is clear that the event {Z ≤ t} is equal to the event {X ≤ ϕ}, where t and ϕ correspond to each other under the projection. This means that tan(ϕ) = t, which is the same as saying that arctan(t) = ϕ. a. What part of the circle is mapped to the interval [1, ∞)? b. Compute the distribution function of Z using the correspondence between t and ϕ. c. Compute the probability density function of Z. The distribution of Z is called the Cauchy distribution (which will be discussed in Chapter 11).
  • 124. 9 Joint distributions and independence Random variables related to the same experiment often influence one another. In order to capture this, we introduce the joint distribution of two or more random variables. We also discuss the notion of independence for random variables, which models the situation where random variables do not influence each other. As with single random variables we treat these topics for discrete and continuous random variables separately. 9.1 Joint distributions of discrete random variables In a census one is usually interested in several variables, such as income, age, and gender. In itself these variables are interesting, but when two (or more) are studied simultaneously, detailed information is obtained on the society where the census is performed. For instance, studying income, age, and gender jointly might give insight to the emancipation of women. Without mentioning it explicitly, we already encountered several examples of joint distributions of discrete random variables. For example, in Chapter 4 we defined two random variables S and M, the sum and the maximum of two independent throws of a die. Quick exercise 9.1 List the elements of the event {S = 7, M = 4} and compute its probability. In general, the joint distribution of two discrete random variables X and Y , defined on the same sample space Ω, is given by prescribing the probabilities of all possible values of the pair (X, Y ).
  • 125. 116 9 Joint distributions and independence Definition. The joint probability mass function p of two discrete random variables X and Y is the function p : R2 → [0, 1], defined by p(a, b) = P(X = a, Y = b) for − ∞ a, b ∞. To stress the dependence on (X, Y ), we sometimes write pX,Y instead of p. If X and Y take on the values a1, a2, . . . , ak and b1, b2, . . . , b, respectively, the joint distribution of X and Y can simply be described by listing all the possible values of p(ai, bj). For example, for the random variables S and M from Chapter 4 we obtain Table 9.1. Table 9.1. Joint probability mass function p(a, b) = P(S = a, M = b). b a 1 2 3 4 5 6 2 1/36 0 0 0 0 0 3 0 2/36 0 0 0 0 4 0 1/36 2/36 0 0 0 5 0 0 2/36 2/36 0 0 6 0 0 1/36 2/36 2/36 0 7 0 0 0 2/36 2/36 2/36 8 0 0 0 1/36 2/36 2/36 9 0 0 0 0 2/36 2/36 10 0 0 0 0 1/36 2/36 11 0 0 0 0 0 2/36 12 0 0 0 0 0 1/36 From this table we can retrieve the distribution of S and of M. For example, because {S = 6} = {S = 6, M = 1} ∪ {S = 6, M = 2} ∪ · · · ∪ {S = 6, M = 6}, and because the six events {S = 6, M = 1}, {S = 6, M = 2}, . . ., {S = 6, M = 6} are mutually exclusive, we find that pS(6) = P(S = 6) = P(S = 6, M = 1) + · · · + P(S = 6, M = 6) = p(6, 1) + p(6, 2) + · · · + p(6, 6) = 0 + 0 + 1 36 + 2 36 + 2 36 + 0 = 5 36 .
  • 126. 9.1 Joint distributions of discrete random variables 117 Table 9.2. Joint distribution and marginal distributions of S and M. b a 1 2 3 4 5 6 pS(a) 2 1/36 0 0 0 0 0 1/36 3 0 2/36 0 0 0 0 2/36 4 0 1/36 2/36 0 0 0 3/36 5 0 0 2/36 2/36 0 0 4/36 6 0 0 1/36 2/36 2/36 0 5/36 7 0 0 0 2/36 2/36 2/36 6/36 8 0 0 0 1/36 2/36 2/36 5/36 9 0 0 0 0 2/36 2/36 4/36 10 0 0 0 0 1/36 2/36 3/36 11 0 0 0 0 0 2/36 2/36 12 0 0 0 0 0 1/36 1/36 pM (b) 1/36 3/36 5/36 7/36 9/36 11/36 1 Thus we see that the probabilities of S can be obtained by taking the sum of the joint probabilities in the rows of Table 9.1. This yields the probability distribution of S, i.e., all values of pS(a) for a = 2, . . . , 12. We speak of the marginal distribution of S. In Table 9.2 we have added this distribution in the right “margin” of the table. Similarly, summing over the columns of Table 9.1 yields the marginal distribution of M, in the bottom margin of Table 9.2. The joint distribution of two random variables contains a lot more information than the two marginal distributions. This can be illustrated by the fact that in many cases the joint probability mass function of X and Y cannot be retrieved from the marginal probability mass functions pX and pY . A simple example is given in the following quick exercise. Quick exercise 9.2 Let X and Y be two discrete random variables, with joint probability mass function p, given by the following table, where ε is an arbitrary number between −1/4 and 1/4. b a 0 1 pX(a) 0 1/4 − ε 1/4 + ε . . . 1 1/4 + ε 1/4 − ε . . . pY (b) . . . . . . . . . Complete the table, and conclude that we cannot retrieve p from pX and pY .
  • 127. 118 9 Joint distributions and independence The joint distribution function As in the case of a single random variable, the distribution function enables us to treat pairs of discrete and pairs of continuous random variables in the same way. Definition. The joint distribution function F of two random vari- ables X and Y is the function F : R2 → [0, 1] defined by F(a, b) = P(X ≤ a, Y ≤ b) for − ∞ a, b ∞. Quick exercise 9.3 Compute F(5, 3) for the joint distribution function F of the pair (S, M). The distribution functions FX and FY can be obtained from the joint distri- bution function of X and Y . As before, we speak of the marginal distribution functions. The following rule holds. From joint to marginal distribution function. Let F be the joint distribution function of random variables X and Y . Then the marginal distribution function of X is given for each a by FX(a) = P(X ≤ a) = F(a, +∞) = lim b→∞ F(a, b), (9.1) and the marginal distribution function of Y is given for each b by FY (b) = P(Y ≤ b) = F(+∞, b) = lim a→∞ F(a, b). (9.2) 9.2 Joint distributions of continuous random variables We saw in Chapter 5 that the probability that a single continuous random variable X lies in an interval [a, b], is equal to the area under the probability density function f of X over the interval (see also Figure 5.1). For the joint distribution of continuous random variables X and Y the situation is analo- gous: the probability that the pair (X, Y ) falls in the rectangle [a1, b1]×[a2, b2] is equal to the volume under the joint probability density function f(x, y) of (X, Y ) over the rectangle. This is illustrated in Figure 9.1, where a chunk of a joint probability density function f(x, y) is displayed for x between −0.5 and 1 and for y between −1.5 and 1. Its volume represents the probability P(−0.5 ≤ X ≤ 1, −1.5 ≤ Y ≤ 1). As the volume under f on [−0.5, 1]×[−1.5, 1] is equal to the integral of f over this rectangle, this motivates the following definition.
  • 128. 9.2 Joint distributions of continuous random variables 119 -3 -2 -1 0 1 2 3 x -3 -2 -1 0 1 2 3 y 0 0 . 0 5 0 . 1 0 . 1 5 f(x,y) Fig. 9.1. Volume under a joint probability density function f on the rectangle [−0.5, 1] × [−1.5, 1]. Definition. Random variables X and Y have a joint continuous distribution if for some function f : R2 → R and for all numbers a1, a2 and b1, b2 with a1 ≤ b1 and a2 ≤ b2, P(a1 ≤ X ≤ b1, a2 ≤ Y ≤ b2) = b1 a1 b2 a2 f(x, y) dx dy. The function f has to satisfy f(x, y) ≥ 0 for all x and y, and ∞ −∞ ∞ −∞ f(x, y) dx dy = 1. We call f the joint probability density function of X and Y . As in the one-dimensional case there is a simple relation between the joint distribution function F and the joint probability density function f: F(a, b) = a −∞ b −∞ f(x, y) dx dy and f(x, y) = ∂2 ∂x∂y F(x, y). A joint probability density function of two random variables is also called a bivariate probability density. An explicit example of such a density is the function f(x, y) = 30 π e−50x2 −50y2 +80xy for −∞ x ∞ and −∞ y ∞; see Figure 9.2. This is an example of a bivariate normal density (see Remark 11.2 for a full description of bivariate normal distributions). We illustrate a number of properties of joint continuous distributions by means of the following simple example. Suppose that X and Y have joint probability
  • 129. 120 9 Joint distributions and independence -0.4 -0.2 0 0.2 0.4 X -0.4 -0.2 0 0.2 0.4 Y 0 2 4 6 8 1 0 f ( x , y ) Fig. 9.2. A bivariate normal probability density function. density function f(x, y) = 2 75 2x2 y + xy2 for 0 ≤ x ≤ 3 and 1 ≤ y ≤ 2, and f(x, y) = 0 otherwise; see Figure 9.3. 4 3 2 0 3 0,2 2,5 x 1 0,4 2 0,6 1,5 y 0 0,8 1 1 0,5 -1 1,2 0 Fig. 9.3. The probability density function f(x, y) = 2 75 2x2 y + xy2 .
  • 130. 9.2 Joint distributions of continuous random variables 121 As an illustration of how to compute joint probabilities: P 1 ≤ X ≤ 2, 4 3 ≤ Y ≤ 5 3 = 2 1 5 3 4 3 f(x, y) dx dy = 2 75 2 1 5 3 4 3 (2x2 y + xy2 ) dy dx = 2 75 2 1 x2 + 61 81 x dx = 187 2025 . Next, for a between 0 and 3 and b between 1 and 2, we determine the ex- pression of the joint distribution function. Since f(x, y) = 0 for x 0 or y 1, F(a, b) = P(X ≤ a, Y ≤ b) = a −∞ b −∞ f(x, y) dy dx = 2 75 a 0 b 1 (2x2 y + xy2 ) dy dx = 1 225 2a3 b2 − 2a3 + a2 b3 − a2 . Note that for either a outside [0, 3] or b outside [1, 2], the expression for F(a, b) is different. For example, suppose that a is between 0 and 3 and b is larger than 2. Since f(x, y) = 0 for y 2, we find for any b ≥ 2: F(a, b) = P(X ≤ a, Y ≤ b) = P(X ≤ a, Y ≤ 2) = F(a, 2) = 1 225 6a3 + 7a2 . Hence, applying (9.1) one finds the marginal distribution function of X: FX (a) = lim b→∞ F(a, b) = 1 225 6a3 + 7a2 for a between 0 and 3. Quick exercise 9.4 Show that FY (b) = 1 75 3b3 + 18b2 − 21 for b between 1 and 2. The probability density of X can be found by differentiating FX: fX(x) = d dx FX (x) = d dx 1 225 6x3 + 7x2 = 2 225 9x2 + 7x for x between 0 and 3. It is also possible to obtain the probability density function of X directly from f(x, y). Recall that we determined marginal prob- abilities of discrete random variables by summing over the joint probabilities (see Table 9.2). In a similar way we can find fX. For x between 0 and 3,
  • 131. 122 9 Joint distributions and independence fX(x) = ∞ −∞ f(x, y) dy = 2 75 2 1 2x2 y + xy2 dy = 2 225 9x2 + 7x . This illustrates the following rule. From joint to marginal probability density function. Let f be the joint probability density function of random variables X and Y . Then the marginal probability densities of X and Y can be found as follows: fX(x) = ∞ −∞ f(x, y) dy and fY (y) = ∞ −∞ f(x, y) dx. Hence the probability density function of each of the random variables X and Y can easily be obtained by “integrating out” the other variable. Quick exercise 9.5 Determine fY (y). 9.3 More than two random variables To determine the joint distribution of n random variables X1, X2, . . . , Xn, all defined on the same sample space Ω, we have to describe how the probability mass is distributed over all possible values of (X1, X2, . . . , Xn). In fact, it suffices to specify the joint distribution function F of X1, X2, . . . , Xn, which is defined by F(a1, a2, . . . , an) = P(X1 ≤ a1, X2 ≤ a2, . . . , Xn ≤ an) for −∞ a1, a2, . . . , an ∞. In case the random variables X1, X2, . . . , Xn are discrete, the joint distribution can also be characterized by specifying the joint probability mass function p of X1, X2, . . . , Xn, defined by p(a1, a2, . . . , an) = P(X1 = a1, X2 = a2, . . . , Xn = an) for −∞ a1, a2, . . . , an ∞. Drawing without replacement Let us illustrate the use of the joint probability mass function with an example. In the weekly Dutch National Lottery Show, 6 balls are drawn from a vase that contains balls numbered from 1 to 41. Clearly, the first number takes values 1, 2, . . ., 41 with equal probabilities. Is this also the case for—say—the third ball?
  • 132. 9.3 More than two random variables 123 Let us consider a more general situation. Suppose a vase contains balls num- bered 1, 2, . . . , N. We draw n balls without replacement from the vase. Note that n cannot be larger than N. Each ball is selected with equal probability, i.e., in the first draw each ball has probability 1/N, in the second draw each of the N −1 remaining balls has probability 1/(N −1), and so on. Let Xi denote the number on the ball in the i-th draw, for i = 1, 2, . . . , n. In order to obtain the marginal probability mass function of Xi, we first compute the joint proba- bility mass function of X1, X2, . . . , Xn. Since there are N(N −1) · · · (N −n+1) possible combinations for the values of X1, X2, . . . , Xn, each having the same probability, the joint probability mass function is given by p(a1, a2, . . . , an) = P(X1 = a1, X2 = a2, . . . , Xn = an) = 1 N(N − 1) · · · (N − n + 1) , for all distinct values a1, a2, . . . , an with 1 ≤ aj ≤ N. Clearly X1, X2, . . . , Xn influence each other. Nevertheless, the marginal distribution of each Xi is the same. This can be seen as follows. Similar to obtaining the marginal probability mass functions in Table 9.2, we can find the marginal probability mass function of Xi by summing the joint probability mass function over all possible values of X1, . . . , Xi−1, Xi+1, . . . , Xn: pXi (k) = p(a1, . . . , ai−1, k, ai+1, . . . , an) = 1 N(N − 1) · · · (N − n + 1) , where the sum runs over all distinct values a1, a2, . . . , an with 1 ≤ aj ≤ N and ai = k. Since there are (N − 1)(N − 2) · · · (N − n + 1) such combinations, we conclude that the marginal probability mass function of Xi is given by pXi (k) = (N − 1)(N − 2) · · · (N − n + 1) · 1 N(N − 1) · · · (N − n + 1) = 1 N , for k = 1, 2, . . ., N. We see that the marginal probability mass function of each Xi is the same, assigning equal probability 1/N to each possible value. In case the random variables X1, X2, . . . , Xn are continuous, the joint dis- tribution is defined in a similar way as in the case of two variables. We say that the random variables X1, X2, . . . , Xn have a joint continuous distribu- tion if for some function f : Rn → R and for all numbers a1, a2, . . . , an and b1, b2, . . . , bn with ai ≤ bi, P(a1 ≤ X1 ≤ b1, a2 ≤ X2 ≤ b2, . . . , an ≤ Xn ≤ bn) = b1 a1 b2 a2 · · · bn an f(x1, x2, . . . , xn) dx1 dx2 · · · dxn. Again f has to satisfy f(x1, x2, . . . , xn) ≥ 0 and f has to integrate to 1. We call f the joint probability density of X1, X2, . . . , Xn.
  • 133. 124 9 Joint distributions and independence 9.4 Independent random variables In earlier chapters we have spoken of independence of random variables, an- ticipating a formal definition. On page 46 we postulated that the events {R1 = a1}, {R2 = a2}, . . . , {R10 = a10} related to the Bernoulli random variables R1, . . . , R10 are independent. How should one define independence of random variables? Intuitively, random vari- ables X and Y are independent if every event involving only X is indepen- dent of every event involving only Y . Since for two discrete random variables X and Y , any event involving X and Y is the union of events of the type {X = a, Y = b}, an adequate definition for independence would be P(X = a, Y = b) = P(X = a) P(Y = b) , (9.3) for all possible values a and b. However, this definition is useless for continuous random variables. Both the discrete and the continuous case are covered by the following definition. Definition. The random variables X and Y , with joint distribution function F, are independent if P(X ≤ a, Y ≤ b) = P(X ≤ a) P(Y ≤ b) , that is, F(a, b) = FX(a)FY (b) (9.4) for all possible values a and b. Random variables that are not inde- pendent are called dependent. Note that independence of X and Y guarantees that the joint probability of {X ≤ a, Y ≤ b} factorizes. More generally, the following is true: if X and Y are independent, then P(X ∈ A, Y ∈ B) = P(X ∈ A) P(Y ∈ B) , (9.5) for all suitable A and B, such as intervals and points. As a special case we can take A = {a}, B = {b}, which yields that for independent X and Y the probability of {X = a, Y = b} equals the product of the marginal probabilities. In fact, for discrete random variables the definition of independence can be reduced—after cumbersome computations—to equality (9.3). For continuous random variables X and Y we find, differentiating both sides of (9.4) with respect to x and y, that f(x, y) = fX(x)fY (y).
  • 134. 9.5 Propagation of independence 125 Quick exercise 9.6 Determine for which value of ε the discrete random variables X and Y from Quick exercise 9.2 are independent. More generally, random variables X1, X2, . . . , Xn, with joint distribution func- tion F, are independent if for all values a1, . . . , an, F(a1, a2, . . . , an) = FX1 (a1)FX2 (a2) · · · FXn (an). As in the case of two discrete random variables, the discrete random variables X1, X2, . . . , Xn are independent if P(X1 = a1, . . . , Xn = an) = P(X1 = a1) · · · P(Xn = an) , for all possible values a1, . . . , an. Thus we see that the definition of inde- pendence for discrete random variables is in agreement with our intuitive interpretation given earlier in (9.3). In case of independent continuous random variables X1, X2, . . . , Xn with joint probability density function f, differentiating the joint distribution function with respect to all the variables gives that f(x1, x2, . . . , xn) = fX1 (x1)fX2 (x2) · · · fXn (xn) (9.6) for all values x1, . . . , xn. By integrating both sides over (−∞, a1]×(−∞, a2]× · · ·×(−∞, an], we find the definition of independence. Hence in the continuous case, (9.6) is equivalent to the definition of independence. 9.5 Propagation of independence A natural question is whether transformed independent random variables are again independent. We start with a simple example. Let X and Y be two independent random variables with joint distribution function F. Take an interval I = (a, b] and define random variables U and V as follows: U = 1 if X ∈ I 0 if X / ∈ I, and V = 1 if Y ∈ I 0 if Y / ∈ I. Are U and V independent? Yes, they are! By using (9.5) and the independence of X and Y , we can write P(U = 0, V = 1) = P(X ∈ Ic , Y ∈ I) = P(X ∈ Ic ) P(Y ∈ I) = P(U = 0) P(V = 1) . By a similar reasoning one finds that for all values a and b,
  • 135. 126 9 Joint distributions and independence P(U = a, V = b) = P(U = a) P(V = b) . This illustrates the fact that for independent random variables X1, X2, . . . , Xn, the random variables Y1, Y2, . . . , Yn, where each Yi is determined by Xi only, inherit the independence from the Xi. The general rule is given here. Propagation of independence. Let X1, X2, . . . , Xn be indepen- dent random variables. For each i, let hi : R → R be a function and define the random variable Yi = hi(Xi). Then Y1, Y2, . . . , Yn are also independent. Often one uses this rule with all functions the same: hi = h. For instance, in the preceding example, h(x) = 1 if x ∈ I 0 if x / ∈ I. The rule is also useful when we need different transformations for different Xi. We already saw an example of this in Chapter 6. In the single-server queue example in Section 6.4, the Exp(0.5) random variables T1, T2, . . . and U(2, 5) random variables S1, S2, . . . are required to be independent. They are generated according to the technique described in Section 6.2. With a se- quence U1, U2, . . . of independent U(0, 1) random variables we can accomplish independence of the Ti and Si as follows: Ti = Finv (U2i−1) and Si = Ginv (U2i), where F and G are the distribution functions of the Exp(0.5) distribution and the U(2, 5) distribution. The propagation-of-independence rule now guaran- tees that all random variables T1, S1, T2, S2, . . . are independent. 9.6 Solutions to the quick exercises 9.1 The only possibilities with the sum equal to 7 and the maximum equal to 4 are the combinations (3, 4) and (4, 3). They both have probability 1/36, so that P(S = 7, M = 4) = 2/36. 9.2 Since pX(0), pX(1), pY (0), and pY (1) are all equal to 1/2, knowing only pX and pY yields no information on ε whatsoever. You have to be a student at Hogwarts to be able to get the values of p right! 9.3 Since S and M are discrete random variables, F(5, 3) is the sum of the probabilities P(S = a, M = b) of all combinations (a, b) with a ≤ 5 and b ≤ 3. From Table 9.2 we see that this sum is 8/36.
  • 136. 9.7 Exercises 127 9.4 For a between 0 and 3 and for b between 1 and 2, we have seen that F(a, b) = 1 225 2a3 b2 − 2a3 + a2 b3 − a2 . Since f(x, y) = 0 for x 3, we find for any a ≥ 3 and b between 1 and 2: F(a, b) = P(X ≤ a, Y ≤ b) = P(X ≤ 3, Y ≤ b) = F(3, b) = 1 75 3b3 + 18b2 − 21 . As a result, applying (9.2) yields that FY (b) = lima→∞ F(a, b) = F(3, b) = 1 75 3b3 + 18b2 − 21 , for b between 1 and 2. 9.5 For y between 1 and 2, we have seen that FY (y) = 1 75 3y3 + 18y2 − 21 . Differentiating with respect to y yields that fY (y) = d dy FY (y) = 1 25 (3y2 + 12y), for y between 1 and 2 (and fY (y) = 0 otherwise). The probability density function of Y can also be obtained directly from f(x, y). For y between 1 and 2: fY (y) = ∞ −∞ f(x, y) dx = 2 75 3 0 (2x2 y + xy2 ) dx = 2 75 2 3 x3 y + 1 2 x2 y2 x=3 x=0 = 1 25 (3y2 + 12y). Since f(x, y) = 0 for values of y not between 1 and 2, we have that fY (y) = ∞ −∞ f(x, y) dx = 0 for these y’s. 9.6 The number ε is between −1/4 and 1/4. Now X and Y are independent in case p(i, j) = P(X = i, Y = j) = P(X = i) P(Y = j) = pX(i)pY (j), for all i, j = 0, 1. If i = j = 0, we should have 1 4 − ε = p(0, 0) = pX(0) pY (0) = 1 4 . This implies that ε = 0. Furthermore, for all other combinations (i, j) one can check that for ε = 0 also p(i, j) = pX(i) pY (j), so that X and Y are independent. If ε = 0, we have p(0, 0) = pX(0) pY (0), so that X and Y are dependent. 9.7 Exercises 9.1 The joint probabilities P(X = a, Y = b) of discrete random variables X and Y are given in the following table (which is based on the magical square in Albrecht Dürer’s engraving Melencolia I in Figure 9.4). Determine the marginal probability distributions of X and Y , i.e., determine the probabilities P(X = a) and P(Y = b) for a, b = 1, 2, 3, 4.
  • 137. 128 9 Joint distributions and independence Fig. 9.4. Albrecht Dürer’s Melencolia I. Albrecht Dürer (German, 1471-1528) Melencolia I, 1514. Engraving. Bequest of William P. Chapman, Jr., Class of 1895. Courtesy of the Herbert F. Johnson Museum of Art, Cornell University. a b 1 2 3 4 1 16/136 3/136 2/136 13/136 2 5/136 10/136 11/136 8/136 3 9/136 6/136 7/136 12/136 4 4/136 15/136 14/136 1/136
  • 138. 9.7 Exercises 129 9.2 The joint probability distribution of two discrete random variables X and Y is partly given in the following table. a b 0 1 2 P(Y = b) −1 . . . . . . . . . 1/2 1 . . . 1/2 . . . 1/2 P(X = a) 1/6 2/3 1/6 1 a. Complete the table. b. Are X and Y dependent or independent? 9.3 Let X and Y be two random variables, with joint distribution the Melen- colia distribution, given by the table in Exercise 9.1. What is a. P(X = Y )? b. P(X + Y = 5)? c. P(1 X ≤ 3, 1 Y ≤ 3)? d. P((X, Y ) ∈ {1, 4} × {1, 4})? 9.4 This exercise will be easy for those familiar with Japanese puzzles called nonograms. The marginal probability distributions of the discrete random variables X and Y are given in the following table: a b 1 2 3 4 5 P(Y = b) 1 5/14 2 4/14 3 2/14 4 2/14 5 1/14 P(X = a) 1/14 5/14 4/14 2/14 2/14 1 Moreover, for a and b from 1 to 5 the joint probability P(X = a, Y = b) is either 0 or 1/14. Determine the joint probability distribution of X and Y . 9.5 Let η be an unknown real number, and let the joint probabilities P(X = a, Y = b) of the discrete random variables X and Y be given by the following table:
  • 139. 130 9 Joint distributions and independence a b −1 0 1 4 η − 1 16 1 4 − η 0 5 1 8 3 16 1 8 6 η + 1 16 1 16 1 4 − η a. Which are the values η can attain? b. Is there a value of η for which X and Y are independent? 9.6 Let X and Y be two independent Ber(1 2 ) random variables. Define random variables U and V by: U = X + Y and V = |X − Y |. a. Determine the joint and marginal probability distributions of U and V . b. Find out whether U and V are dependent or independent. 9.7 To investigate the relation between hair color and eye color, the hair color and eye color of 5383 persons was recorded. The data are given in the following table: Hair color Eye color Fair/red Medium Dark/black Light 1168 825 305 Dark 573 1312 1200 Source: B. Everitt and G. Dunn. Applied multivariate data analysis. Second edition Hodder Arnold, 2001; Table 4.12. Reproduced by permission of Hodder Stoughton. Eye color is encoded by the values 1 (Light) and 2 (Dark), and hair color by 1 (Fair/red), 2 (Medium), and 3 (Dark/black). By dividing the numbers in the table by 5383, the table is turned into a joint probability distribution for random variables X (hair color) taking values 1 to 3 and Y (eye color) taking values 1 and 2. a. Determine the joint and marginal probability distributions of X and Y . b. Find out whether X and Y are dependent or independent. 9.8 Let X and Y be independent random variables with probability distri- butions given by P(X = 0) = P(X = 1) = 1 2 and P(Y = 0) = P(Y = 2) = 1 2 .
  • 140. 9.7 Exercises 131 a. Compute the distribution of Z = X + Y . b. Let Ỹ and Z̃ be independent random variables, where Ỹ has the same distribution as Y , and Z̃ the same distribution as Z. Compute the distri- bution of X̃ = Z̃ − Ỹ . 9.9 Suppose that the joint distribution function of X and Y is given by F(x, y) = 1 − e−2x − e−y + e−(2x+y) if x 0, y 0, and F(x, y) = 0 otherwise. a. Determine the marginal distribution functions of X and Y . b. Determine the joint probability density function of X and Y . c. Determine the marginal probability density functions of X and Y . d. Find out whether X and Y are independent. 9.10 Let X and Y be two continuous random variables with joint proba- bility density function f(x, y) = 12 5 xy(1 + y) for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, and f(x, y) = 0 otherwise. a. Find the probability P 1 4 ≤ X ≤ 1 2 , 1 3 ≤ Y ≤ 2 3 . b. Determine the joint distribution function of X and Y for a and b between 0 and 1. c. Use your answer from b to find FX(a) for a between 0 and 1. d. Apply the rule on page 122 to find the probability density function of X from the joint probability density function f(x, y). Use the result to verify your answer from c. e. Find out whether X and Y are independent. 9.11 Let X and Y be two continuous random variables, with the same joint probability density function as in Exercise 9.10. Find the probability P(X Y ) that X is smaller than Y . 9.12 The joint probability density function f of the pair (X, Y ) is given by f(x, y) = K(3x2 + 8xy) for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 2, and f(x, y) = 0 for all other values of x and y. Here K is some positive constant. a. Find K. b. Determine the probability P(2X ≤ Y ).
  • 141. 132 9 Joint distributions and independence 9.13 On a disc with origin (0, 0) and radius 1, a point (X, Y ) is selected by throwing a dart that hits the disc in an arbitrary place. This is best described by the joint probability density function f of X and Y , given by f(x, y) = c if x2 + y2 ≤ 1 0 otherwise, where c is some positive constant. a. Determine c. b. Let R = √ X2 + Y 2 be the distance from (X, Y ) to the origin. Determine the distribution function FR. c. Determine the marginal density function fX. Without doing any calcula- tions, what can you say about fY ? 9.14 An arbitrary point (X, Y ) is drawn from the square [−1, 1] × [−1, 1]. This means that for any region G in the plane, the probability that (X, Y ) is in G, is given by the area of G ∩ divided by the area of , where denotes the square [−1, 1] × [−1, 1]: P((X, Y ) ∈ G) = area of G ∩ area of . a. Determine the joint probability density function of the pair (X, Y ). b. Check that X and Y are two independent, U(−1, 1) distributed random variables. 9.15 Let the pair (X, Y ) be drawn arbitrarily from the triangle ∆ with vertices (0, 0), (0, 1), and (1, 1). a. Use Figure 9.5 to show that the joint distribution function F of the pair (X, Y ) satisfies F(a, b) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0 for a or b less than 0 a(2b − a) for (a, b) in the triangle ∆ b2 for b between 0 and 1 and a larger than b 2a − a2 for a between 0 and 1 and b larger than 1 1 for a and b larger than 1. b. Determine the joint probability density function f of the pair (X, Y ). c. Show that fX(x) = 2 − 2x for x between 0 and 1 and that fY (y) = 2y for y between 0 and 1. 9.16 (Continuation of Exercise 9.15) An arbitrary point (U, V ) is drawn from the unit square [0, 1]×[0, 1]. Let X and Y be defined as in Exercise 9.15. Show that min{U, V } has the same distribution as X and that max{U, V } has the same distribution as Y .
  • 142. 9.7 Exercises 133 (0, 0) (0, 1) (1, 1) (a, b) • ∆ ←− Rectangle (−∞, a] × (−∞, b] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 9.5. Drawing (X, Y ) from (−∞, a] × (−∞, b] ∩ ∆. 9.17 Let U1 and U2 be two independent random variables, both uniformly distributed over [0, a]. Let V = min{U1, U2} and Z = max{U1, U2}. Show that the joint distribution function of V and Z is given by F(s, t) = P(V ≤ s, Z ≤ t) = t2 − (t − s)2 a2 for 0 ≤ s ≤ t ≤ a. Hint: note that V ≤ s and Z ≤ t happens exactly when both U1 ≤ t and U2 ≤ t, but not both s U1 ≤ t and s U2 ≤ t. 9.18 Suppose a vase contains balls numbered 1, 2, . . ., N. We draw n balls without replacement from the vase. Each ball is selected with equal probability, i.e., in the first draw each ball has probability 1/N, in the second draw each of the N − 1 remaining balls has probability 1/(N − 1), and so on. For i = 1, 2, . . . , n, let Xi denote the number on the ball in the ith draw. We have shown that the marginal probability mass function of Xi is given by pXi (k) = 1 N , for k = 1, 2, . . ., N. a. Show that E[Xi] = N + 1 2 . b. Compute the variance of Xi. You may use the identity 1 + 4 + 9 + · · · + N2 = 1 6 N(N + 1)(2N + 1). 9.19 Let X and Y be two continuous random variables, with joint proba- bility density function f(x, y) = 30 π e−50x2 −50y2 +80xy for −∞ x ∞ and −∞ y ∞; see also Figure 9.2.
  • 143. 134 9 Joint distributions and independence a. Determine positive numbers a, b, and c such that 50x2 − 80xy + 50y2 = (ay − bx)2 + cx2 . b. Setting µ = 4 5 x, and σ = 1 10 , show that ( √ 50y − √ 32x)2 = 1 2 y − µ σ 2 and use this to show that ∞ −∞ e−( √ 50y− √ 32x)2 dy = √ 2π 10 . c. Use the results from b to determine the probability density function fX of X. What kind of distribution does X have? 9.20 Suppose we throw a needle on a large sheet of paper, on which horizontal lines are drawn, which are at needle-length apart (see also Exercise 21.16). Choose one of the horizontal lines as x-axis, and let (X, Y ) be the center of the needle. Furthermore, let Z be the distance of this center (X, Y ) to the nearest horizontal line under (X, Y ), and let H be the angle between the needle and the positive x-axis. a. Assuming that the length of the needle is equal to 1, argue that Z has a U(0, 1) distribution. Also argue that H has a U(0, π) distribution and that Z and H are independent. b. Show that the needle hits a horizontal line when Z ≤ 1 2 sin H or 1 − Z ≤ 1 2 sin H. c. Show that the probability that the needle will hit one of the horizontal lines equals 2/π.
  • 144. 10 Covariance and correlation In this chapter we see how the joint distribution of two or more random vari- ables is used to compute the expectation of a combination of these random variables. We discuss the expectation and variance of a sum of random vari- ables and introduce the notions of covariance and correlation, which express to some extent the way two random variables influence each other. 10.1 Expectation and joint distributions China vases of various shapes are produced in the Delftware factories in the old city of Delft. One particular simple cylindrical model has height H and radius R centimeters. Due to all kinds of circumstances—the place of the vase in the oven, the fact that the vases are handmade, etc.—H and R are not constants but are random variables. The volume of a vase is equal to the random variable V = πHR2 , and one is interested in its expected value E[V ]. When fV denotes the probability density of V , then by definition E[V ] = ∞ −∞ vfV (v) dv. However, to obtain E[V ], we do not necessarily need to determine fV from the joint probability density f of H and R! Since V is a function of H and R, we can use a rule similar to the change-of-variable formula from Chapter 7: E[V ] = E πHR2 = ∞ −∞ ∞ −∞ πhr2 f(h, r) dh dr. Suppose that H has a U(25, 35) distribution and that R has a U(7.5, 12.5) distribution. In the case that H and R are also independent, we have
  • 145. 136 10 Covariance and correlation E[V ] = ∞ −∞ ∞ −∞ πhr2 fH(h)fR(r) dh dr = 35 25 12.5 7.5 πhr2 · 1 10 · 1 5 dh dr = π 50 35 25 h dh 12.5 7.5 r2 dr = 9621.127 cm3 . This illustrates the following general rule. Two-dimensional change-of-variable formula. Let X and Y be random variables, and let g : R2 → R be a function. If X and Y are discrete random variables with values a1, a2, . . . and b1, b2, . . . , respectively, then E[g(X, Y )] = i j g(ai, bj)P(X = ai, Y = bj) . If X and Y are continuous random variables with joint probability density function f, then E[g(X, Y )] = ∞ −∞ ∞ −∞ g(x, y)f(x, y) dx dy. As an example, take g(x, y) = xy for discrete random variables X and Y with the joint probability distribution given in Table 10.1. The expectation of XY is computed as follows: E[XY ] = (0 · 0) · 0 + (1 · 0) · 1 4 + (2 · 0) · 0 + (0 · 1) · 1 4 + (1 · 1) · 0 + (2 · 1) · 1 4 + (0 · 2) · 0 + (1 · 2) · 1 4 + (2 · 2) · 0 = 1. A natural question is whether this value can also be obtained from E[X] E[Y ]. We return to this question later in this chapter. First we address the expec- tation of the sum of two random variables. Table 10.1. Joint probabilities P(X = a, Y = b). a b 0 1 2 0 0 1/4 0 1 1/4 0 1/4 2 0 1/4 0
  • 146. 10.1 Expectation and joint distributions 137 Quick exercise 10.1 Compute E[X + Y ] for the random variables with the joint distribution given in Table 10.1. For discrete X and Y with values a1, a2, . . . and b1, b2, . . . , respectively, we see that E[X + Y ] = i j (ai + bj)P(X = ai, Y = bj) = i j aiP(X = ai, Y = bj) + i j bjP(X = ai, Y = bj) = i ai j P(X = ai, Y = bj) + j bj i P(X = ai, Y = bj) = i aiP(X = ai) + j bjP(Y = bj) = E[X] + E[Y ] . A similar line of reasoning applies in case X and Y are continuous random variables. The following general rule holds. Linearity of expectations. For all numbers r, s, and t and random variables X and Y , one has E[rX + sY + t] = rE[X] + sE[Y ] + t. Quick exercise 10.2 Determine the marginal distributions for the random variables X and Y with the joint distribution given in Table 10.1, and use them to compute E[X] en E[Y ]. Check that E[X]+E[Y ] is equal to E[X + Y ], which was computed in Quick exercise 10.1. More generally, for random variables X1, . . . , Xn and numbers s1, . . . , sn and t, E[s1X1 + · · · + snXn + t] = s1E[X1] + · · · + snE[Xn] + t. This rule is a powerful instrument. For example, it provides an easy way to compute the expectation of a random variable X with a Bin(n, p) distribution. If we would use the definition of expectation, we have to compute E[X] = n k=0 kP(X = k) = n k=0 k n k pk (1 − p)n−k . To determine this sum is not straightforward. However, there is a simple alter- native. Recall the multiple-choice example from Section 4.3. We represented
  • 147. 138 10 Covariance and correlation the number of correct answers out of 10 multiple-choice questions as a sum of 10 Bernoulli random variables. More generally, any random variable X with a Bin(n, p) distribution can be represented as X = R1 + R2 + · · · + Rn, where R1, R2, . . . , Rn are independent Ber(p) random variables, i.e., Ri = 1 with probability p 0 with probability 1 − p. Since E[Ri] = 0 · (1 − p) + 1 · p = p, for every i = 1, 2, . . ., n, the linearity-of- expectations rule yields E[X] = E[R1] + E[R2] + · · · + E[Rn] = np. Hence we conclude that the expectation of a Bin(n, p) distribution equals np. Remark 10.1 (More than two random variables). In both the discrete and continuous cases, the change-of-variable formula for n random variables is a straightforward generalization of the change-of-variable formula for two random variables. For instance, if X1, X2, . . . , Xn are continuous random variables, with joint probability density function f, and g is a function from Rn to R, then E[g(X1, . . . , Xn)] = ∞ −∞ · · · ∞ −∞ g(x1, . . . , xn)f(x1, . . . , xn) dx1 · · · dxn. 10.2 Covariance In the previous section we have seen that for two random variables X and Y always E[X + Y ] = E[X] + E[Y ] . Does such a simple relation also hold for the variance of the sum Var(X + Y ) or for expectation of the product E[XY ]? We will investigate this in the current section. For the variables X and Y from the example in Section 9.2 with joint proba- bility density f(x, y) = 2 75 2x2 y + xy2 for 0 ≤ x ≤ 3 and 1 ≤ y ≤ 2, one can show that Var(X + Y ) = 939 2000 and Var(X) + Var(Y ) = 989 2500 + 791 10 000 = 4747 10 000
  • 148. 10.2 Covariance 139 (see Exercise 10.10). This shows, in contrast to the linearity-of-expectations rule, that Var(X + Y ) is generally not equal to Var(X)+ Var(Y ). To deter- mine Var(X + Y ), we exploit its definition: Var(X + Y ) = E (X + Y − E[X + Y ])2 . Now X + Y − E[X + Y ] = (X − E[X]) + (Y − E[Y ]), so that (X + Y − E[X + Y ]) 2 = (X − E[X]) 2 + (Y − E[Y ]) 2 + 2 (X − E[X]) (Y − E[Y ]) . Taking expectations on both sides, another application of the linearity-of- expectations rule gives Var(X + Y ) = Var(X) + Var(Y ) + 2E[(X − E[X])(Y − E[Y ])] . That is, the variance of the sum X + Y equals the sum of the variances of X and Y , plus an extra term 2E[(X − E[X])(Y − E[Y ])]. To some extent this term expresses the way X and Y influence each other. Definition. Let X and Y be two random variables. The covariance between X and Y is defined by Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] . Loosely speaking, if the covariance of X and Y is positive, then if X has a realization larger than E[X], it is likely that Y will have a realization larger than E[Y ], and the other way around. In this case we say that X and Y are positively correlated. In case the covariance is negative, the opposite effect oc- curs; X and Y are negatively correlated. In case Cov(X, Y ) = 0 we say that X and Y are uncorrelated. An easy consequence of the linearity-of-expectations property (see Exercise 10.19) is the following rule. An alternative expression for the covariance. Let X and Y be two random variables, then Cov(X, Y ) = E[XY ] − E[X] E[Y ] . For X and Y from the example in Section 9.2, we have E[X] = 109/50, E[Y ] = 157/100, and E[XY ] = 171/50 (see Exercise 10.10). Thus we see that X and Y are negatively correlated: Cov(X, Y ) = 171 50 − 109 50 · 157 100 = − 13 5000 0. Moreover, this also illustrates that, in contrast to the expectation of the sum, for the expectation of the product, in general E[XY ] is not equal to E[X] E[Y ].
  • 149. 140 10 Covariance and correlation Independent versus uncorrelated Now let X and Y be two independent random variables. One expects that X and Y are uncorrelated: they have nothing to do with one another! This is indeed the case, for instance, if X and Y are discrete; one finds that E[XY ] = i j aibjP(X = ai, Y = bj) = i j aibjP(X = ai) P(Y = bj) = i aiP(X = ai) j bjP(Y = bj) = E[X] E[Y ] . A similar reasoning holds in case X and Y are continuous random variables. The alternative expression for the covariance leads to the following important observation. Independent versus uncorrelated. If two random variables X and Y are independent, then X and Y are uncorrelated. Note that the reverse is not necessarily true. If X and Y are uncorrelated, they need not be independent. This is illustrated in the next quick exercise. Quick exercise 10.3 Consider the random variables X and Y with the joint distribution given in Table 10.1. Check that X and Y are dependent, but that also E[XY ] = E[X] E[Y ]. From the preceding we also deduce the following rule on the variance of the sum of two random variables. Variance of the sum. Let X and Y be two random variables. Then always Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ) . If X and Y are uncorrelated, Var(X + Y ) = Var(X) + Var(Y ) . Hence, we always have that E[X + Y ] = E[X]+E[Y ], whereas Var(X + Y ) = Var(X)+Var(Y ) only holds for uncorrelated random variables (and hence for independent random variables!). As with the linearity-of-expectations rule, the rule for the variance of the sum of uncorrelated random variables holds more generally. For uncorrelated random variables X1, X2, . . . , Xn, we have
  • 150. 10.3 The correlation coefficient 141 Var(X1 + X2 + · · · + Xn) = Var(X1) + Var(X2) + · · · + Var(Xn) . This rule provides an easy way to compute the variance of a random variable with a Bin(n, p) distribution. Recall the representation for a Bin(n, p) random variable X: X = R1 + R2 + · · · + Rn. Each Ri has variance Var(Ri) = E R2 i − (E[Ri]) 2 = 02 · (1 − p) + 12 · p − (E[Ri]) 2 = p − p2 = p(1 − p). Using the independence of the Ri, the rule for the variance of the sum yields Var(X) = Var(R1) + Var(R2) + · · · + Var(Rn) = np(1 − p). 10.3 The correlation coefficient In the previous section we saw that the covariance between random vari- ables gives an indication of how they influence one another. A disadvan- tage of the covariance is the fact that it depends on the units in which the random variables are represented. For instance, suppose that the length in inches and weight in kilograms of Dutch citizens are modeled by random vari- ables L and W. Someone prefers to represent the length in centimeters. Since 1 inch ≡ 2.53 cm, one is dealing with a transformed random variable 2.53L. The covariance between 2.53L and W is Cov(2.53L, W) = E[(2.53L)W] − E[2.53L]E[W] = 2.53 E[LW] − E[L] E[W] = 2.53 Cov(L, W) . That is, the covariance increases with a factor 2.53, which is somewhat dis- turbing since changing from inches to centimeters does not essentially alter the dependence between length and weight. This illustrates that the covari- ance changes under a change of units. The following rule provides the exact relationship. Covariance under change of units. Let X and Y be two random variables. Then Cov(rX + s, tY + u) = rt Cov(X, Y ) for all numbers r, s, t, and u. See Exercise 10.14 for a derivation of this rule.
  • 151. 142 10 Covariance and correlation Quick exercise 10.4 For X and Y in the example in Section 9.2 (see also Section 10.2), show that Cov(−2X + 7, 5Y − 3) = 13/500. The preceding discussion indicates that the covariance Cov(X, Y ) may not always be suitable to express the dependence between X and Y . For this reason there is a standardized version of the covariance called the correlation coefficient of X and Y . Definition. Let X and Y be two random variables. The correlation coefficient ρ(X, Y ) is defined to be 0 if Var(X) = 0 or Var(Y ) = 0, and otherwise ρ(X, Y ) = Cov(X, Y ) Var(X) Var(Y ) . Note that ρ(X, Y ) remains unaffected by a change of units, and therefore it is dimensionless. For instance, if X and Y are measured in kilometers, then Cov(X, Y ), Var(X) and Var(Y ) are in km2 , so that the dimension of ρ(X, Y ) is in km2 /( √ km2 · √ km2 ). For X and Y in the example in Section 9.2, recall that Cov(X, Y ) = −13/5000. We also have Var(X) = 989/2500 and Var(Y ) = 791/10 000 (see Exer- cise 10.10), so that ρ(X, Y ) = − 13 5000 989 2500 · 791 10 000 = −0.0147. Quick exercise 10.5 For X and Y in the example in Section 9.2, show that ρ(−2X + 7, 5Y − 3) = 0.0147. The previous quick exercise illustrates the following linearity property for the correlation coefficient. For numbers r, s, t, and u fixed, r, t = 0, and random variables X and Y : ρ(rX + s, tY + u) = −ρ(X, Y ) if rt 0, ρ(X, Y ) if rt 0. Thus we see that the size of the correlation coefficient is unaffected by a change of units, but note the possibility of a change of sign. Two random variables X and Y are “most correlated” if X = Y or if X = −Y . As a matter of fact, in the former case ρ(X, Y ) = 1, while in the latter case ρ(X, Y ) = −1. In general—for nonconstant random variables X and Y —the following property holds: −1 ≤ ρ(X, Y ) ≤ 1. For a formal derivation of this property, see the next remark.
  • 152. 10.4 Solutions to the quick exercises 143 Remark 10.2 (Correlations are between −1 and 1). Here we give a proof of the preceding formula. Since the variance of any random variable is nonnegative, we have that 0 ≤ Var X Var(X) + Y Var(Y ) = Var X Var(X) + Var Y Var(Y ) + 2Cov X Var(X) , Y Var(Y ) = Var(X) Var(X) + Var(Y ) Var(Y ) + 2Cov(X, Y ) Var(X) Var(Y ) = 2 (1 + ρ(X, Y )) . This implies ρ(X, Y ) ≥ −1. Using the same argument but replacing X by −X shows that ρ(X, Y ) ≤ 1. 10.4 Solutions to the quick exercises 10.1 The expectation of X + Y is computed as follows: E[X + Y ] = (0 + 0) · 0 + (1 + 0) · 1 4 + (2 + 0) · 0 + (0 + 1) · 1 4 + (1 + 1) · 0 + (2 + 1) · 1 4 + (0 + 2) · 0 + (1 + 2) · 1 4 + (2 + 2) · 0 = 2. 10.2 First complete Table 10.1 with the marginal distributions: a b 0 1 2 P(Y = b) 0 0 1/4 0 1/4 1 1/4 0 1/4 1/2 2 0 1/4 0 1/4 P(X = a) 1/4 1/2 1/4 1 It follows that E[X] = 0 · 1 4 + 1 · 1 2 + 2 · 1 4 = 1, and similarly E[Y ] = 1. Therefore E[X] + E[Y ] = 2, which is equal to E[X + Y ] as computed in Quick exercise 10.1.
  • 153. 144 10 Covariance and correlation 10.3 From Table 10.1, as completed in Quick exercise 10.2, we see that X and Y are dependent. For instance, P(X = 0, Y = 0) = P(X = 0) P(Y = 0). From Quick exercise 10.2 we know that E[X] = E[Y ] = 1. Because we already computed E[XY ] = 1, it follows that E[XY ] = E[X] E[Y ]. According to the alternative expression for the covariance this means that Cov(X, Y ) = 0, i.e., X and Y are uncorrelated. 10.4 We already computed Cov(X, Y ) = −13/5000 in Section 10.2. Hence, by the linearity-of-covariance rule Cov(−2X + 7, 5Y − 3) = (−2)·5·(−13/5000) = 13/500. 10.5 From Quick exercise 10.4 we have Cov(−2X + 7, 5Y − 3) = 13/500. Since Var(X) = 989/2500 and Var(Y ) = 791/10 000, by definition of the correlation coefficient and the rule for variances, ρ(−2X + 7, 5Y − 3) = Cov(−2X + 7, 5Y − 3) Var(−2X + 7) · Var(5Y − 3) = 13 500 4Var(X) · 25Var(Y ) = 13 500 3956 2500 · 19775 10 000 = 0.0147. 10.5 Exercises 10.1 Consider the joint probability distribution of X and Y from Exer- cise 9.7, obtained from data on hair color and eye color, for which we already computed the expectations and variances of X and Y , as well as E[XY ]. a. Compute Cov(X, Y ). Are X and Y positively correlated, negative corre- lated, or uncorrelated? b. Compute the correlation coefficient between X and Y . 10.2 Consider the two discrete random variables X and Y with joint dis- tribution derived in Exercise 9.2: a b 0 1 2 P(Y = b) −1 1/6 1/6 1/6 1/2 1 0 1/2 0 1/2 P(X = a) 1/6 2/3 1/6 1 a. Determine E[XY ]. b. Note that X and Y are dependent. Show that X and Y are uncorrelated.
  • 154. 10.5 Exercises 145 c. Determine Var(X + Y ). d. Determine Var(X − Y ). 10.3 Let U and V be the two random variables from Exercise 9.6. We have seen that U and V are dependent with joint probability distribution a b 0 1 2 P(V = b) 0 1/4 0 1/4 1/2 1 0 1/2 0 1/2 P(U = a) 1/4 1/2 1/4 1 Determine the covariance Cov(U, V ) and the correlation coefficient ρ(U, V ). 10.4 Consider the joint probability distribution of the discrete random vari- ables X and Y from the Melencolia Exercise 9.1. Compute Cov(X, Y ). a b 1 2 3 4 1 16/136 3/136 2/136 13/136 2 5/136 10/136 11/136 8/136 3 9/136 6/136 7/136 12/136 4 4/136 15/136 14/136 1/136 10.5 Suppose X and Y are discrete random variables taking values 0,1, and 2. The following is given about the joint and marginal distributions: a b 0 1 2 P(Y = b) 0 8/72 . . . 10/72 1/3 1 12/72 9/72 . . . 1/2 2 . . . 3/72 . . . . . . P(X = a) 1/3 . . . . . . 1 a. Complete the table. b. Compute the expectation of X and of Y and the covariance between X and Y . c. Are X and Y independent?
  • 155. 146 10 Covariance and correlation 10.6 Suppose X and Y are discrete random variables taking values c−1, c, and c + 1. The following is given about the joint and marginal distributions: a b c − 1 c c + 1 P(Y = b) c − 1 2/45 9/45 4/45 1/3 c 7/45 5/45 3/45 1/3 c + 1 6/45 1/45 8/45 1/3 P(X = a) 1/3 1/3 1/3 1 a. Take c = 0 and compute the expectation of X and of Y and the covariance between X and Y . b. Show that X and Y are uncorrelated, no matter what the value of c is. Hint: one could compute Cov(X, Y ), but there is a short solution using the rule on the covariance under change of units (see page 141) together with part a. c. Are X and Y independent? 10.7 Consider the joint distribution of Quick exercise 9.2 and take ε fixed between −1/4 and 1/4: b a 0 1 pX(a) 0 1/4 − ε 1/4 + ε 1/2 1 1/4 + ε 1/4 − ε 1/2 pY (b) 1/2 1/2 1 a. Take ε = 1/8 and compute Cov(X, Y ). b. Take ε = 1/8 and compute ρ(X, Y ). c. For which values of ε is ρ(X, Y ) equal to −1, 0, or 1? 10.8 Let X and Y be random variables such that E[X] = 2, E[Y ] = 3, and Var(X) = 4. a. Show that E X2 = 8. b. Determine the expectation of −2X2 + Y . 10.9 Suppose the blood of 1000 persons has to be tested to see which ones are infected by a (rare) disease. Suppose that the probability that the test
  • 156. 10.5 Exercises 147 is positive is p = 0.001. The obvious way to proceed is to test each person, which results in a total of 1000 tests. An alternative procedure is the following. Distribute the blood of the 1000 persons over 25 groups of size 40, and mix half of the blood of each of the 40 persons with that of the others in each group. Now test the aggregated blood sample of each group: when the test is negative no one in that group has the disease; when the test is positive, at least one person in the group has the disease, and one will test the other half of the blood of all 40 persons of that group separately. In total, that gives 41 tests for that group. Let Xi be the total number of tests one has to perform for the ith group using this alternative procedure. a. Describe the probability distribution of Xi, i.e., list the possible values it takes on and the corresponding probabilities. b. What is the expected number of tests for the ith group? What is the expected total number of tests? What do you think of this alternative procedure for blood testing? 10.10 Consider the variables X and Y from the example in Section 9.2 with joint probability density f(x, y) = 2 75 2x2 y + xy2 for 0 ≤ x ≤ 3 and 1 ≤ y ≤ 2 and marginal probability densities fX(x) = 2 225 9x2 + 7x for 0 ≤ x ≤ 3 fY (y) = 1 25 (3y2 + 12y) for 1 ≤ y ≤ 2. a. Compute E[X], E[Y ], and E[X + Y ]. b. Compute E X2 , E Y 2 , E[XY ], and E (X + Y )2 , c. Compute Var(X + Y ), Var(X), and Var(Y ) and check that Var(X + Y ) = Var(X) + Var(Y ). 10.11 Recall the relation between degrees Celsius and degrees Fahrenheit degrees Fahrenheit = 9 5 · degrees Celsius + 32. Let X and Y be the average daily temperatures in degrees Celsius in Ams- terdam and Antwerp. Suppose that Cov(X, Y ) = 3 and ρ(X, Y ) = 0.8. Let T and S be the same temperatures in degrees Fahrenheit. Compute Cov(T, S) and ρ(T, S). 10.12 Consider the independent random variables H and R from the vase example, with a U(25, 35) and a U(7.5, 12.5) distribution. Compute E[H] and E R2 and check that E[V ] = πE[H] E R2 .
  • 157. 148 10 Covariance and correlation 10.13 Let X and Y be as in the triangle example in Exercise 9.15. Recall from Exercise 9.16 that X and Y represent the minimum and maximum coordinate of a point that is drawn from the unit square: X = min{U, V } and Y = max{U, V }. a. Show that E[X] = 1/3, Var(X) = 1/18, E[Y ] = 2/3, and Var(Y ) = 1/18. Hint: you might consult Exercise 8.15. b. Check that Var(X + Y ) = 1/6, by using that U and V are independent and that X + Y = U + V . c. Determine the covariance Cov(X, Y ) using the results from a and b. 10.14 Let X and Y be two random variables and let r, s, t, and u be arbitrary real numbers. a. Derive from the definition that Cov(X + s, Y + u) = Cov(X, Y ). b. Derive from the definition that Cov(rX, tY ) = rtCov(X, Y ). c. Combine parts a and b to show Cov(rX + s, tY + u) = rtCov(X, Y ). 10.15 In Figure 10.1 three plots are displayed. For each plot we carried out a simulation in which we generated 500 realizations of a pair of random variables (X, Y ). We have chosen three different joint distributions of X and Y . −2 0 2 −2 0 2 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · ··· · · · · · · · · · ·· · · · ·· · · · · · · · · · · · ·· · · · ·· · · · · · · · · · · · · ···· · · · ·· · · · · · · · · · · · · · · · · · · · · · ·· · · ·· · · · · ·· · · · · · · · · ·· · · · · · · · · · ·· · · · · · · · · · · · · · · · ·· · · · · · · · · · · ··· · · · · · · · · ··· ·· · · ·· ·· · ·· · · · · · · · · ·· · · · · · · · · ·· · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · ·· · · · · ·· · · · · · · · · · · · · · · · · ·· ·· · · · · · · · · · · · · · · ·· · · · ·· · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · ·· · · · · · · · · · ·· · · ···· · · · · · · · · · · ·· · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · −2 0 2 −2 0 2 · · · · · · · · · · · ·· · · · · · · · · · ·· · · · · · · · · · · · · ·· · · · · · · ··· · · · · · ·· · · · · ··· · · · · ·· · · · · · · · · · · · ·· · · ·· · · · · · · ·· · · ·· · · · ·· ·· · · · · · · · · · · · · · · ·· ·· · · · · · · · · · · ·· · · · · · · · · · · · · · · · ·· · · · · · ·· · · · · · · · · · · · ·· · · · · · ·· · ·· · · · · · · · · · ·· · · · · · · · ·· · · · ·· · · · · · · · · · · ·· · · · · · · · ·· · · · · · · · · · · · · · · · ·· ·· · · · · ·· · · · · · · · · · · · · · · ·· ·· · · · · · · ·· ·· · · · · · · · · · · ··· · · · ·· · · · · · · · · · · ·· · · · · · ··· · ·· · · · · · · · · · · · · · · ·· · · · · · · · · · · · · ·· · · · · · · · · · · · ·· · · · · · · · · ·· · · ·· · ·· · · · · · · · · ·· · · · ··· · · ·· · · · · · ·· · · · · · · · · · · ·· · · · · · · · · · · · · ·· ·· · · · · · · · ·· · · ·· · · · · · · · ·· · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · ···· ·· · −2 0 2 −2 0 2 · ·· · · · ·· · ·· · · · ·· · · · · · · · ·· · · · ·· ··· · · · · · · · · · · · · · ·· · ·· · · ·· · · · · · · ··· · · · · · · ·· · · · · · ·· · · · · ··· · · · · · · · · · · · · · ·· ·· ·· · · · · · · · · ·· · · · · · ·· · ·· · · ·· · · · · · · · · · · ·· · · · · · · · · · ·· · · ·· · · ··· · ·· · · · · ····· ·· ·· · · · · · · · · ·· · · · · · · · · · · · · · · · ·· · · ··· · ·· · · ··· · · · · · · · · · · ·· · · ·· · · ·· · · · · ·· · · · · · · · · · · · · ·· · · ·· ··· ··· · · ·· · · · ·· ·· · · ·· · ·· ·· · · · · · · · · · · · · · · · · · ·· · · · ··· · · · · · · · ·· · · · · · · ·· ·· · · ·· · · · · · · · · ··· ·· · · · · ·· · · · · · · · · · ··· · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · ·· · · · ·· · · · ·· · · · · · · · · · ·· · ·· · · · · · ·· · · · · · · ·· · · · · · ·· · · · · · · · · · · · · · · ·· · · · · · · · · · · · ·· · · · · ·· ·· · · · · Fig. 10.1. Some scatterplots. a. Indicate for each plot whether it corresponds to random variables X and Y that are positively correlated, negatively correlated, or uncorrelated. b. Which plot corresponds to random variables X and Y for which |ρ(X, Y )| is maximal? 10.16 Let X and Y be random variables. a. Express Cov(X, X + Y ) in terms of Var(X) and Cov(X, Y ). b. Are X and X + Y positively correlated, uncorrelated, or negatively cor- related, or can anything happen?
  • 158. 10.5 Exercises 149 c. Same question as in part b, but now assume that X and Y are uncorre- lated. 10.17 Extending the variance of the sum rule. For mathematical con- venience we first extend the sum rule to three random variables with zero expectation. Next we further extend the rule to three random variables with nonzero expectation. By the same line of reasoning we extend the rule to n random variables. a. Let X, Y and Z be random variables with expectation 0. Show that Var(X + Y + Z) = Var(X) + Var(Y ) + Var(Z) + 2Cov(X, Y ) + 2Cov(X, Z) + 2Cov(Y, Z) . Hint: directly apply that for real numbers y1, . . . , yn (y1 + · · · + yn)2 = y2 1 + · · · + y2 n + 2y1y2 + 2y1y3 + · · · + 2yn−1yn. b. Now show a for X, Y , and Z with nonzero expectation. Hint: you might use the rules on pages 98 and 141 about variance and covariance under a change of units. c. Derive a general variance of the sum rule, i.e., show that if X1, X2, . . . , Xn are random variables, then Var(X1 + X2 + · · · + Xn) = Var(X1) + · · · +Var(Xn) +2Cov(X1, X2) + 2Cov(X1, X3) + · · · + 2Cov(X1, Xn) + 2Cov(X2, X3) + · · · + 2Cov(X2, Xn) ... + 2Cov(Xn−1, Xn) . d. Show that if the variances are all equal to σ2 and the covariances are all equal to some constant γ, then Var(X1 + X2 + · · · + Xn) = nσ2 + n(n − 1)γ. 10.18 Consider a vase containing balls numbered 1, 2, . . . , N. We draw n balls without replacement from the vase. Each ball is selected with equal probability, i.e., in the first draw each ball has probability 1/N, in the second draw each of the N − 1 remaining balls has probability 1/(N − 1), and so on. For i = 1, 2, . . ., n, let Xi denote the number on the ball in the ith draw. From Exercise 9.18 we know that the variance of Xi equals Var(Xi) = 1 12 (N − 1)(N + 1).
  • 159. 150 10 Covariance and correlation Show that Cov(X1, X2) = − 1 12 (N + 1). Before you do the exercise: why do you think the covariance is negative? Hint: use Var(X1 + X2 + · · · + XN ) = 0 (why?), and apply Exercise 10.17. 10.19 Derive the alternative expression for the covariance: Cov(X, Y ) = E[XY ] − E[X] E[Y ]. Hint: work out (X − E[X])(Y − E[Y ]) and use linearity of expectations. 10.20 Determine ρ U, U2 when U has a U(0, a) distribution. Here a is a positive number.
  • 160. 11 More computations with more random variables Often one is interested in combining random variables, for instance, in taking the sum. In previous chapters, we have seen that it is fairly easy to describe the expected value and the variance of this new random variable. Often more details are needed, and one also would like to have its probability distribu- tion. In this chapter we consider the probability distributions of the sum, the product, and the quotient of two random variables. 11.1 Sums of discrete random variables In a solo race across the Pacific Ocean, a ship has one spare radio set for communications. Each of the two radios has probability p of failing each time it is switched on. The skipper uses the radio once every day. Let X be the number of days the radio is switched on until it fails (so if the radio can be used for two days and fails on the third day, X attains the value 3). Similarly, let Y be the number of days the spare radio is switched on until it fails. Note that these random variables are similar to the one discussed in Section 4.4, which modeled the number of cycles until pregnancy. Hence, X and Y are Geo(p) distributed random variables. Suppose that p = 1/75 and that the trip will last 100 days. Then at first sight the skipper does not need to worry about radio contact: the number of days the first radio lasts is X − 1 days, and similarly the spare radio lasts Y −1 days. Therefore the expected number of days he is able to have radio contact is E[X − 1 + Y − 1] = E[X] + E[Y ] − 2 = 1 p + 1 p − 2 = 148 days! The skipper—who has some training in probability theory—still has some concerns about the risk he runs with these two radios. What if the probability P(X + Y − 2 ≤ 99) that his two radios break down before the end of the trip is large? skip 11
  • 161. 152 11 More computations with more random variables This example illustrates that it is important to study the probability distri- bution of the sum Z = X + Y of two discrete random variables. The random variable Z takes on values ai + bj, where ai is a possible value of X and bj of Y . Hence, the probability mass function of Z is given by pZ(c) = (i,j):ai+bj =c P(X = ai, Y = bj) , where the sum runs over all possible values ai of X and bj of Y such that ai + bj = c. Because the sum only runs over values ai that are equal to c − bj, we simplify the summation and write pZ (c) = j P(X = c − bj, Y = bj) , where the sum runs over all possible values bj of Y . When X and Y are independent, then P(X = c − bj, Y = bj) = P(X = c − bj) P(Y = bj). This leads to the following rule. Adding two independent discrete random variables. Let X and Y be two independent discrete random variables, with probabil- ity mass functions pX and pY . Then the probability mass function pZ of Z = X + Y satisfies pZ(c) = j pX(c − bj)pY (bj), where the sum runs over all possible values bj of Y . Quick exercise 11.1 Let S be the sum of two independent throws with a die, so S = X + Y , where X and Y are independent, and P(X = k) = P(Y = k) = 1/6, for k = 1, . . . , 6. Use the addition rule to compute P(S = 3) and P(S = 8), and compare your answers with Table 9.2. In the solo race example, X and Y are independent Geo(p) distributed random variables. Let Z = X + Y ; then by the above rule for k ≥ 2 P(X + Y = k) = pZ(k) = ∞ =1 pX(k − )pY (). Because pX(a) = 0 for a ≤ 0, all terms in this sum with ≥ k vanish, hence P(X + Y = k) = k−1 =1 pX(k − ) · pY () = k−1 =1 (1 − p)k−−1 p · (1 − p)−1 p = k−1 =1 p2 (1 − p)k−2 = (k − 1)p2 (1 − p)k−2 . Note that X + Y does not have a geometric distribution.
  • 162. 11.1 Sums of discrete random variables 153 Remark 11.1 (The expected value of a geometric distribution). The preceding gives us the opportunity to calculate the expected value of the geometric distribution in an easy way. Since the probabilities of Z add up to one: 1 = ∞ k=2 pZ(k) = ∞ k=2 (k − 1)p2 (1 − p)k−2 = p ∞ =1 p(1 − p)−1 ; it follows that E[X] = ∞ =1 p(1 − p)−1 = 1 p . Returning to the solo race example, it is clear that the skipper does have grounds to worry: P(X + Y − 2 ≤ 99) = P(X + Y ≤ 101) = 101 k=2 P(X + Y = k) = 101 k=2 (k − 1)( 1 75 )2 (1 − 1 75 )k−2 = 0.3904. The sum of two binomial random variables It is not always necessary to use the addition rule for two independent discrete random variables to find the distribution of their sum. For example, let X and Y be two independent random variables, where X has a Bin(n, p) distribution and Y has a Bin(m, p) distribution. Since a Bin(n, p) distribution models the number of successes in n independent trials with success probability p, heuristically, X + Y represents the number of successes in n + m trials with success probability p and should therefore have a Bin(n + m, p) distribution. A more formal reasoning is the following. Let R1, R2, . . . , Rn, S1, S2, . . . , Sm be independent Ber(p) distributed random variables. Recall that a Bin(n, p) distributed random variable has the same distribution as the sum of n inde- pendent Ber(p) distributed random variables (see Section 4.3 or 10.2). Hence X has the same distribution as R1 + R2 + · · · + Rn and Y has the same distribution as S1 + S2 + · · · + Sm. This means that X + Y has the same dis- tribution as the sum of n+m independent Ber(p) variables and therefore has a Bin(n + m, p) distribution. This can also be verified analytically by means of the addition rule, using that X and Y are also independent. Quick exercise 11.2 For i = 1, 2, 3, let Xi be a Bin(ni, p) distributed ran- dom variable, and suppose that X1, X2, and X3 are independent. Argue that Z = X1 + X2 + X3 is a Bin(n1 + n2 + n3, p) distributed random variable.
  • 163. 154 11 More computations with more random variables 11.2 Sums of continuous random variables Let X and Y be two continuous random variables. What can we say about the probability density function of Z = X+Y ? We start with an example. Suppose that X and Y are two independent, U(0, 1) distributed random variables. One might be tempted to think that Z is also uniformly distributed. Note that the joint probability density function f of X and Y is equal to the product of the marginal probability functions fX and fY : f(x, y) = fX(x)fY (y) = 1 for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1, and f(x, y) = 0 otherwise. Let us compute the distribution function FZ of Z. It is easy to see that FZ (a) = 0 for a ≤ 0 and FZ (a) = 1 for a ≥ 2. For a between 0 and 1, let G be that part of the plane below the line x+y = a, and let ∆ be the triangle with vertices (0, 0), (a, 0), and (0, a); see Figure 11.1. a 1 a 1 x + y = a ∆ G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 11.1. The region G in the plane where x + y ≤ a (with 0 a 1) intersected with ∆. Since f(x, y) = 0 outside [0, 1] × [0, 1], the distribution function of Z is given by FZ (a) = P(Z ≤ a) = P(X + Y ≤ a) = G f(x, y) dx dy = ∆ 1 dx dy = area of ∆ = 1 2 a2 for 0 a 1. For the case where 1 ≤ a 2 one can draw a similar figure (see Figure 11.2), from which one can find that FZ(a) = 1 − 1 2 (2 − a)2 for 1 ≤ a 2.
  • 164. 11.2 Sums of continuous random variables 155 a 1 a 1 x + y = a ∆ G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 11.2. The region G in the plane where x + y ≤ a (with 1 ≤ a 2) intersected with ∆. We see that Z is not uniformly distributed. In general, the distribution function FZ of the sum Z of two continuous ran- dom variables X and Y is given by FZ (a) = P(Z ≤ a) = P(X + Y ≤ a) = (x,y):x+y≤a f(x, y) dx dy. The double integral on the right-hand side can be written as a repeated in- tegral, first over x and then over y. Note that x and y are between minus and plus infinity and that they also have to satisfy x + y ≤ a or, equivalently, x ≤ a − y. This means that the integral over x runs from minus infinity to y − a, and the integral over y runs from minus infinity to plus infinity. Hence FZ (a) = ∞ −∞ a−y −∞ f(x, y) dx dy. In case X and Y are independent, the last double integral can be written as ∞ −∞ a−y −∞ fX(x) dx fY (y) dy, and we find that FZ(a) = ∞ −∞ FX(a − y)fY (y) dy for −∞ a ∞. Differentiating FZ we find the following rule.
  • 165. 156 11 More computations with more random variables Adding two independent continuous random variables. Let X and Y be two independent continuous random variables, with probability density functions fX and fY . Then the probability den- sity function fZ of Z = X + Y is given by fZ(z) = ∞ −∞ fX(z − y)fY (y) dy for −∞ z ∞. The single-server queue revisited In the single-server queue model from Section 6.4, T1 is the time between the start at time zero and the arrival of the first customer and Ti is the time between the arrival of the (i − 1)th and ith customer at a well. We are interested in the arrival time of the nth customer at the well. For n ≥ 1, let Zn be the arrival time of the nth customer at the well: Zn = T1 + · · · + Tn. Since each Ti has an Exp(0.5) distribution, it follows from the linearity-of- expectations rule in Section 10.1 that the expected arrival time of the nth customer is E[Zn] = E[T1 + · · · + Tn] = E[T1] + · · · + E[Tn] = 2n minutes. We would like to know whether the pump capacity is sufficient; for instance, when the service times Si are independent U(2, 5) distributed random vari- ables (this is the case when the pump capacity v = 1). In that case, at most 30 customers can pump water at the well in the first hour. If P(Z30 ≤ 60) is large, one might be tempted to increase the capacity of the well. Recalling that the Ti are independent Exp(λ) random variables, it follows from the addition rule that fT1+T2 (z) = 0 if z 0, and for z ≥ 0 that fZ2 (z) = fT1+T2 (z) = ∞ −∞ fT1 (z − y)fT2 (y) dy = z 0 λe−λ(z−y) · λe−λy dy = λ2 e−λz z 0 dy = λ2 ze−λz . Viewing T1 + T2 + T3 as the sum of T1 and T2 + T3, we find, by applying the addition rule again, that fZ3 (z) = 0 if z 0, and for z ≥ 0 that fZ3 (z) = fT1+T2+T3 (z) = ∞ −∞ fT1 (z − y)fT2+T3 (y) dy = z 0 λe−λ(z−y) · λ2 ye−λy dy = λ3 e−λz z 0 y dy = 1 2 λ3 z2 e−λz .
  • 166. 11.2 Sums of continuous random variables 157 Repeating this procedure, we find that fZn (z) = 0 if z 0, and fZn (z) = λ (λz) n−1 e−λz (n − 1)! for z ≥ 0. Using integration by parts we find (see Exercise 11.13) that for n ≥ 1 and a ≥ 0: P(Zn ≤ a) = 1 − e−λa n−1 i=0 (λa)i i! . Since λ = 1/2, it follows that P(Z30 ≤ 60) = 0.524. Even if each customer fills his jerrican in the minimum time of 2 minutes, we see that after an hour with probability 0.524, people will be waiting at the pump! The random variable Zn is an example of a gamma random variable, defined as follows. Definition. A continuous random variable X has a gamma dis- tribution with parameters α 0 and λ 0 if its probability density function f is given by f(x) = 0 for x 0 and f(x) = λ (λx)α−1 e−λx Γ(α) for x ≥ 0, where the quantity Γ(α) is a normalizing constant such that f inte- grates to 1. We denote this distribution by Gam(α, λ). The quantity Γ(α) is for α 0 defined by Γ(α) = ∞ 0 tα−1 e−t dt. It satisfies for α 0 and n = 1, 2, . . . Γ(α + 1) = αΓ(α) and Γ(n) = (n − 1)! (see also Exercise 11.12). It follows from our example that the sum of n inde- pendent Exp(λ) distributed random variables has a Gam(n, λ) distribution, also known as the Erlang-n distribution with parameter λ. The sum of independent normal random variables Using the addition rule you can show that the sum of two independent nor- mally distributed random variables is again a normally distributed random
  • 167. 158 11 More computations with more random variables variable. For instance, if X and Y are independent N(0, 1) distributed random variables, one has fX+Y (z) = ∞ −∞ fX(z − y)fY (y) dy = ∞ −∞ 1 √ 2π e− 1 2 (z−y)2 1 √ 2π e− 1 2 y2 dy = ∞ −∞ 1 √ 2π 2 e− 1 2 (2y2 −2yz+z2 ) dy. To prepare a change of variables, we subtract the term 1 2 z2 from 2y2 −2yz+z2 to complete the square in the exponent: 2y2 − 2yz + 1 2 z2 = √ 2 y − z 2 2 . In this way we find with changing integration variables t = √ 2(y − z/2): fX+Y (z) = 1 √ 2π e− 1 4 z2 ∞ −∞ 1 √ 2π e− 1 2 (2y2 −2yz+ 1 2 z2 ) dy = 1 √ 2π e− 1 4 z2 ∞ −∞ 1 √ 2π e− 1 2 [ √ 2(y−z/2)] 2 dy = 1 √ 2π e− 1 4 z2 1 √ 2 ∞ −∞ 1 √ 2π e− 1 2 t2 dt = 1 √ 4π e− 1 4 z2 ∞ −∞ φ(t) dt. Since φ is the probability density of the standard normal distribution, it in- tegrates to 1, so that fX+Y (z) = 1 √ 4π e− 1 4 z2 , which is the probability density of the N(0, 2) distribution. Thus, X + Y also has a normal distribution. This is more generally true. The sum of independent normal random variables. If X and Y are independent random variables with a normal distribution, then X + Y also has a normal distribution. Quick exercise 11.3 Let X and Y be independent random variables, where X has an N(3, 16) distribution, and Y an N(5, 9) distribution. Then X + Y is a normally distributed random variable. What are its parameters? Rather surprisingly, independence of X and Y is not a prerequisite, as can be seen in the following remark.
  • 168. 11.3 Product and quotient of two random variables 159 Remark 11.2 (Sums of dependent normal random variables). We say the pair X, Y is has a bivariate normal distribution if their joint prob- ability density equals 1 2πσX σY 1 − ρ2 exp − 1 2 1 (1 − ρ2) Q(x, y) , where Q(x, y) = x − µX σX 2 − 2ρ x − µX σX y − µY σY + y − µY σY 2 . Here µX and µY are the expectations of X and Y , σ2 X and σ2 Y are their variances, and ρ is the correlation coefficient of X and Y . If X and Y have such a bivariate normal distribution, then X has an N(µX, σ2 X ) and Y has an N(µY , σ2 Y ) distribution. Moreover, one can show that X + Y has an N(µX + µY , σ2 X + σ2 Y + 2ρσX σY ) distribution. An example of a bivariate normal probability density is displayed in Figure 9.2. This probability den- sity corresponds to parameters µX = µY = 0, σX = σY = 1/6, and ρ = 0.8. 11.3 Product and quotient of two random variables Recall from Chapter 7 the example of the architect who wants maximal vari- ety in the sizes of buildings. The architect wants more variety and therefore replaces the square buildings by rectangular buildings: the buildings should be of width X and depth Y , where X and Y are independent and uniformly distributed between 0 and 10 meters. Since X and Y are independent, the expected area of a building equals E[XY ] = E[X] E[Y ] = 5 · 5 = 25 m2 . But what can one say about the distribution of the area Z = XY of an arbitrary building? Let us calculate the distribution function of Z. Clearly FZ (a) = 0 if a 0 and FZ (a) = 1 if a 100. For a between 0 and 100 we can compute FZ (a) with the help of Figure 11.3. We find FZ (a) = P(Z ≤ a) = P(XY ≤ a) = area of the shaded region in Figure 11.3 area of [0, 10] × [0, 10] = 1 100 a 10 · 10 + 10 a/10 a x dx = 1 100 a + a ln x 10 a/10 = a(1 + 2 ln 10 − ln a) 100 . Hence the probability density function fZ of Z is given by
  • 169. 160 11 More computations with more random variables a/10 x 10 a/x 10 xy = a G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 11.3. The region G in the plane where xy ≤ a intersected with [0, 10]×[0, 10]. fZ(z) = d dz FZ (z) = d dz z(1 + 2 ln 10 − ln z) 100 = ln 100 − ln z 100 for 0 z 100 m2 . This computation can be generalized to arbitrary independent continuous random variables, and we obtain the following formula for the probability density function of the product of two random variables. Product of independent continuous random variables. Let X and Y be two independent continuous random variables with prob- ability densities fX and fY . Then the probability density function fZ of Z = XY is given by fZ(z) = ∞ −∞ fY z x fX(x) 1 |x| dx for −∞ z ∞. For the quotient Z = X/Y of two independent random variables X and Y it is now fairly easy to derive the probability density function. Since the independence of X and Y implies that X and 1/Y are independent, the preceding rule yields fZ(z) = ∞ −∞ f1/Y z x fX(x) 1 |x| dx. Recall from Section 8.2 that the probability density function of 1/Y is given by f1/Y (y) = 1 y2 fY 1 y .
  • 170. 11.3 Product and quotient of two random variables 161 Substituting this in the integral, after changing the variable of integration, we find the following rule. Quotient of independent continuous random variables. Let X and Y be two independent continuous random variables with probability densities fX and fY . Then the probability density func- tion fZ of Z = X/Y is given by fZ(z) = ∞ −∞ fX(zx)fY (x)|x| dx for −∞ z ∞. The quotient of two independent normal random variables Let X and Y be independent random variables, both having a standard normal distribution. When we compute the quotient Z of X and Y , we find a so-called standard Cauchy distribution: fZ(z) = ∞ −∞ |x| 1 √ 2π e− 1 2 z2 x2 1 √ 2π e− 1 2 x2 dx = 1 2π ∞ −∞ |x|e− 1 2 (z2 +1)x2 dx = 2 · 1 2π ∞ 0 xe− 1 2 (z2 +1)x2 dx = 1 π −1 z2 + 1 e− 1 2 (z2 +1)x2 ∞ 0 = 1 π(z2 + 1) . This is the special case α = 0, β = 1 of the following family of distributions. Definition. A continuous random variable has a Cauchy distribu- tion with parameters α and β 0 if its probability density function f is given by f(x) = β π (β2 + (x − α)2) for − ∞ x ∞. We denote this distribution by Cau(α, β). By integrating, we find that the distribution function F of a Cauchy distri- bution is given by F(x) = 1 2 + 1 π arctan x − α β . The parameter α is the point of symmetry of the probability density func- tion f. Note that α is not the expected value of Z. As a matter of fact, it was shown in Remark 7.1 that the expected value does not exist! The probabil- ity density f is shown together with the distribution function F for the case α = 2, β = 5 in Figure 11.4.
  • 171. 162 11 More computations with more random variables −12 −8 −4 0 4 8 12 16 0.00 0.02 0.04 0.06 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . f −12 −8 −4 0 4 8 12 16 0 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F Fig. 11.4. The graphs of f and F of the Cau(2, 5) distribution. Quick exercise 11.4 Argue—without doing any calculations—that if Z has a standard Cauchy distribution, 1/Z also has a standard Cauchy distribution. 11.4 Solutions to the quick exercises 11.1 Using the addition rule we find P(S = 3) = 6 j=1 pX(3 − j)pY (j) = pX(2)pY (1) + pX(1)pY (2) + pX(0)pY (3) +pX(−1)pY (4) + pX(−2)pY (5) + pX(−3)pY (6) = 1 36 + 1 36 + 0 + 0 + 0 + 0 = 1 18 and P(S = 8) = 6 j=1 pX(8 − j)pY (j) = pX(7)pY (1) + pX(6)pY (2) + pX(5)pY (3) +pX(4)pY (4) + pX(3)pY (5) + pX(2)pY (6) = 0 + 1 36 + 1 36 + 1 36 + 1 36 + 1 36 = 5 36 . 11.2 We have seen that X1 + X2 is a Bin(n1 + n2, p) distributed random variable. Viewing X1 + X2 + X3 as the sum of X1 + X2 and X3, it follows that X1 + X2 + X3 is a Bin(n1 + n2 + n3, p) distributed random variable.
  • 172. 11.5 Exercises 163 11.3 The sum rule for two normal random variables tells us that X + Y is a normally distributed random variable. Its parameters are expectation and variance of X + Y . Hence by linearity of expectations µX+Y = E[X + Y ] = E[X] + E[Y ] = µX + µY = 3 + 5 = 8, and by the rule for the variance of the sum σ2 X+Y = Var(X) + Var(Y ) + 2Cov(X, Y ) = σ2 X + σ2 Y = 16 + 9 = 25, using that Cov(X, Y ) = 0 due to independence of X and Y . 11.4 In the examples we have seen that the quotient X/Y of two independent standard normal random variables has a standard Cauchy distribution. Since Z = X/Y , the random variable 1/Z = Y/X. This is also the quotient of two independent standard normal random variables, and it has a standard Cauchy distribution. 11.5 Exercises 11.1 Let X and Y be independent random variables with a discrete uniform distribution, i.e., with probability mass functions pX(k) = pY (k) = 1 N , for k = 1, . . . , N. Use the addition rule for discrete random variables on page 152 to determine the probability mass function of Z = X + Y for the following two cases. a. Suppose N = 6, so that X and Y represent two throws with a die. Show that pZ(k) = P(X + Y = k) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ k − 1 36 for k = 2, . . . , 6, 13 − k 36 for k = 7, . . . , 12. You may check this with Quick exercise 11.1. b. Determine the expression for pZ (k) for general N. 11.2 Consider a discrete random variable X taking values k = 0, 1, 2, . . . with probabilities P(X = k) = µk k! e−µ , where µ 0. This is the Poisson distribution with parameter µ. We will learn more about this distribution in Chapter 12. This exercise illustrates that the sum of independent Poisson variables again has a Poisson distribution.
  • 173. 164 11 More computations with more random variables a. Let X and Y be independent random variables, each having a Poisson distribution with µ = 1. Show that for k = 0, 1, 2, . . . P(X + Y = k) = 2k k! e−2 , by using k =0 k = 2k . b. Let X and Y be independent random variables, each having a Poisson distribution with parameters λ and µ. Show that for k = 0, 1, 2, . . . P(X + Y = k) = (λ + µ)k k! e−(λ+µ) , by using k =0 k p (1 − p)k− = 1 for p = µ/(λ + µ). We conclude that X +Y has a Poisson distribution with parameter λ+µ. 11.3 Let X and Y be two independent random variables, where X has a Ber(p) distribution, and Y has a Ber(q) distribution. When p = q = r, we know that X + Y has a Bin(2, r) distribution. Suppose that p = 1/2 and q = 1/4. Determine P(X + Y = k), for k = 0, 1, 2, and conclude that X + Y does not have a binomial distribution. 11.4 Let X and Y be two independent random variables, where X has an N(2, 5) distribution and Y has an N(5, 9) distribution. Define Z = 3X−2Y +1. a. Compute E[Z] and Var(Z). b. What is the distribution of Z? c. Compute P(Z ≤ 6). 11.5 Let X and Y be two independent, U(0, 1) distributed random vari- ables. Use the rule on addition of independent continuous random variables on page 156 to show that the probability density function of X + Y is given by fZ(z) = ⎧ ⎪ ⎨ ⎪ ⎩ z for 0 ≤ z 1, 2 − z for 1 ≤ z ≤ 2, 0 otherwise. 11.6 Let X and Y be independent random variables with probability den- sities fX(x) = 1 4 xe−x/2 and fY (y) = 1 4 ye−y/2 . Use the rule on addition of independent continuous random variables to de- termine the probability density of Z = X + Y . 11.7 The two random variables in Exercise 11.6 are special cases of Gam(α, λ) variables, namely with α = 2 and λ = 1/2. More generally, let
  • 174. 11.5 Exercises 165 X1, . . . , Xn be independent Gam(k, λ) distributed random variables, where λ 0 and k is a positive integer. Argue—without doing any calculations— that X1 + · · · + Xn has a Gam(nk, λ) distribution. 11.8 We investigate the effect on the Cauchy distribution under a change of units. a. Let X have a standard Cauchy distribution. What is the distribution of Y = rX + s? b. Let X have a Cau(α, β) distribution. What is the distribution of the random variable (X − α)/β? 11.9 Let X and Y be independent random variables with a Par(α) and Par(β) distribution. a. Take α = 3 and β = 1 and determine the probability density of Z = XY . b. Determine the probability density of Z = XY for general α and β. 11.10 Let X and Y be independent random variables with a Par(α) and Par(β) distribution. a. Take α = β = 2. Show that Z = X/Y has probability density fZ(z) = z for 0 z 1, 1/z3 for 1 ≤ z ∞. b. For general α, β 0, show that Z = X/Y has probability density fZ(z) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ αβ α + β zβ−1 for 0 z 1, αβ α + β 1 zα+1 for 1 ≤ z ∞. 11.11 Let X1, X2, and X3 be three independent Geo(p) distributed random variables, and let Z = X1 + X2 + X3. a. Show for k ≥ 3 that the probability mass function pZ of Z is given by pZ(k) = P(X1 + X2 + X3 = k) = 1 2 (k − 2)(k − 1)p3 (1 − p)k−3 . b. Use the fact that ∞ k=3 pZ(k) = 1 to show that p2 E X2 1 + E[X1] = 2. c. Use E[X1] = 1/p and part b to conclude that E X2 1 = 2 − p p2 and Var(X1) = 1 − p p2 .
  • 175. 166 11 More computations with more random variables 11.12 Show that Γ(1) = 1, and use integration by parts to show that Γ(x + 1) = xΓ(x) for x 0. Use this last expression to show for n = 1, 2, . . . that Γ(n) = (n − 1)! 11.13 Let Zn have an Erlang-n distribution with parameter λ. a. Use integration by parts to show that for a ≥ 0 and n ≥ 2: P(Zn ≤ a) = a 0 λn zn−1 e−λz (n − 1)! dz = − (λa)n−1 (n − 1)! e−λa + P(Zn−1 ≤ a) . b. Use a to show that for a ≥ 0: P(Zn ≤ a) = − n−1 i=1 (λa)i i! e−λa + P(Z1 ≤ a) . c. Conclude that for a ≥ 0: P(Zn ≤ a) = 1 − e−λa n−1 i=0 (λa)i i! .
  • 176. 12 The Poisson process In many random phenomena we encounter, it is not just one or two random variables that play a role but a whole collection. In that case one often speaks of a random process. The Poisson process is a simple kind of random process, which models the occurrence of random points in time or space. There are numerous ways in which processes of random points arise: some examples are presented in the first section. The Poisson process describes in a certain sense the most random way to distribute points in time or space. This is made more precise with the notions of homogeneity and independence. 12.1 Random points Typical examples of the occurrence of random time points are: arrival times of email messages at a server, the times at which asteroids hit the earth, arrival times of radioactive particles at a Geiger counter, times at which your computer crashes, the times at which electronic components fail, and arrival times of people at a pump in an oasis. Examples of the occurrence of random points in space are: the locations of asteroid impacts with earth (2-dimensional), the locations of imperfections in a material (3-dimensional), and the locations of trees in a forest (2-dimensional). Some of these phenomena are better modeled by the Poisson process than others. Loosely speaking, one might say that the Poisson process model often applies in situations where there is a very large population, and each member of the population has a very small probability to produce a point of the process. This is, for instance, well fulfilled in the Geiger counter example where, in a huge collection of atoms, just a few will emit a radioactive particle (see [28]). A property of the Poisson process—as we will see shortly—is that points may lie arbitrarily close together. Therefore the tree locations are not so well modeled by the Poisson process.
  • 177. 168 12 The Poisson process 12.2 Taking a closer look at random arrivals A well-known example that is usually modeled by the Poisson process is that of calls arriving at a telephone exchange—the exchange is connected to a large number of people who make phone calls now and then. This will be our leading example in this section. Telephone calls arrive at random times X1, X2, . . . at the telephone exchange during a time interval [0, t]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 0 Time X1 X2 X3 X4 X5 × × × × × + + + + + | t The two basic assumptions we make on these random arrivals are 1. (Homogeneity) The rate λ at which arrivals occur is constant over time: in a subinterval of length u the expectation of the number of telephone calls is λu. 2. (Independence) The numbers of arrivals in disjoint time intervals are in- dependent random variables. Homogeneity is also called weak stationarity. We denote the total number of calls in an interval I by N(I), abbreviating N([0, t]) to Nt. Homogeneity then implies that we require E[Nt] = λt. To get hold of the distribution of Nt we divide the interval [0, t] into n intervals of length t/n. When n is large enough, every interval Ij,n = ((j − 1) t/n, j t/n] will contain either 0 or 1 arrival: For such a large n (which also satisfies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 0 Time X1 X2 X3 X4 X5 × × × × × + + + + + | | | | | t | t n | | | | (n − 1) t n | t n λt), let Rj be the number of arrivals in the time interval Ij,n. Since Rj is 0 or 1, Rj has a Ber(pj) distribution for some pj. Recall that for a Bernoulli random variable E[Rj] = 0 · (1 − pj) + 1 · pj = pj. By the homogeneity assumption, for each j pj = λ · length of Ij,n = λt n . Summing the number of calls in the intervals gives the total number of calls, hence Nt = R1 + R2 + · · · + Rn.
  • 178. 12.2 Taking a closer look at random arrivals 169 By the independence assumption, the Rj are independent random variables, therefore Nt has a Bin(n, p) distribution, with p = λt/n. Remark 12.1 (About this approximation). The argument just given seems pretty convincing, but actually Rj does not have a Bernoulli distri- bution, whatever the value of n. A way to see this is the following. Every interval Ij,n is a union of the two intervals I2j−1,2n and I2j,2n. Hence the probability that Ij,n contains two calls is at least (λt/2n)2 = λ2 t2 /4n2 , which is larger than zero. Note however, that the probability of having two arrivals is of smaller order than the probability that Rj takes the value 1. If we add a third assumption, namely that the probability of two or more calls arriving in an interval Ij,n tends to zero faster than 1/n, then the conclusion below on the distribution of Nt is valid. We have found that (at least in first approximation) P(Nt = k) = n k λt n k 1 − λt n n−k for k = 0, . . . , n. In this analysis n is a rather artificial parameter, of which we only know that it should not be “too small.” It therefore seems a good idea to get rid of n by letting n go to infinity, hoping that the probability distribution of Nt will settle down. Note that lim n→∞ n k 1 nk = lim n→∞ n n · n − 1 n · · · (n − k + 1) n · 1 k! = 1 k! , and from calculus we know that lim n→∞ 1 − λt n n = e−λt . Since certainly lim n→∞ 1 − λt n −k = 1, we obtain, combining these three limits, that lim n→∞ P(Nt = k) = lim n→∞ n k 1 nk · (λt)k · 1 − λt n n · 1 − λt n −k = (λt)k k! e−λt . Since e−λt ∞ k=0 (λt)k k! = e−λt eλt = 1, we have indeed run into a probability distribution on the numbers 0, 1, 2, . . . . Note that all these probabilities are determined by the single value λt. This motivates the following definition.
  • 179. 170 12 The Poisson process Definition. A discrete random variable X has a Poisson distribu- tion with parameter µ, where µ 0 if its probability mass function p is given by p(k) = P(X = k) = µk k! e−µ for k = 0, 1, 2, . . .. We denote this distribution by Pois(µ). Figure 12.1 displays the graphs of the probability mass functions of the Poisson distribution with µ = 0.9 (left) and the Poisson distribution with µ = 5 (right). 0 2 4 6 8 10 k 0.0 0.1 0.2 0.3 0.4 0.5 p(k) · · · · · · · · · · · 0 2 4 6 8 10 k 0.0 0.1 0.2 0.3 0.4 0.5 p(k) · · · · · · · · · · · Fig. 12.1. The probability mass functions of the Pois(0.9) and the Pois(5) distri- butions. Quick exercise 12.1 Consider the event “exactly one call arrives in the interval [0, 2s].” The probability of this event is P(N2s = 1) = λ · 2s · e−λ·2s . But note that this event is the same as “there is exactly one call in the interval [0, s) and no calls in the interval [s, 2s], or no calls in [0, s) and exactly one call in [s, 2s].” Verify (using assumptions 1 and 2) that you get the same answer if you compute the probability of the event in this way. We do have a hint1 about what the expectation and variance of a Poisson random variable might be: since E[Nt] = λt for all n, we anticipate that the limiting Poisson distribution will have expectation λt. Similarly, since Nt has a Bin(n, λt n ) distribution, we anticipate that the variance will be 1 This is really not more than a hint: there are simple examples where the distribu- tions of random variables converge to a distribution whose expectation is different from the limit of the expectations of the distributions! (cf. Exercise 12.14).
  • 180. 12.3 The one-dimensional Poisson process 171 lim n→∞ Var(Nt) = lim n→∞ n · λt n · 1 − λt n = λt. Actually, the expectation of a Poisson random variable X with parameter µ is easy to compute: E[X] = ∞ k=0 k µk k! e−µ = e−µ ∞ k=1 µk (k − 1)! = µe−µ ∞ k=1 µk−1 (k − 1)! = µe−µ ∞ j=0 µj j! = µ. In a similar way the variance can be determined (see Exercise 12.8), and we arrive at the following rule. The expectation and variance of a Poisson distribution. Let X have a Poisson distribution with parameter µ; then E[X] = µ and Var(X) = µ. 12.3 The one-dimensional Poisson process We will derive some properties of the sequence of random points X1, X2, . . . that we considered in the previous section. What we derived so far is that for any interval (s, s + t] the number N((s, s + t]) of points Xi in that interval is a random variable with a Pois(λt) distribution. Interarrival times The differences Ti = Xi − Xi−1 are called interarrival times. Here we define T1 = X1, the time of the first arrival. To determine the probability distribution of T1, we observe that the event {T1 t} that the first call arrives after time t is the same as the event {Nt = 0} that no calls have been made in [0, t]. But this implies that P(T1 ≤ t) = 1 − P(T1 t) = 1 − P(Nt = 0) = 1 − e−λt . Therefore T1 has an exponential distribution with parameter λ. To compute the joint distribution of T1 and T2, we consider the conditional probability that T2 t, given that T1 = s, and use the property that arrivals in different intervals are independent:
  • 181. 172 12 The Poisson process P(T2 t | T1 = s) = P(no arrivals in (s, s + t] | T1 = s) = P(no arrivals in (s, s + t]) = P(N((s, s + t]) = 0) = e−λt . Since this answer does not depend on s, we conclude that T1 and T2 are independent, and P(T2 t) = e−λt , i.e., T2 also has an exponential distribution with parameter λ. Actually, al- though the conclusion is correct, the method to derive it is not, because we conditioned on the event {T1 = s}, which has zero probability. This problem could be circumvented by conditioning on the event that T1 lies in some small interval, but that will not be done here. Analogously, one can show that the Ti are independent and have an Exp(λ) distribution. This nice property allows us to give a simple definition of the one-dimensional Poisson process. Definition. The one-dimensional Poisson process with intensity λ is a sequence X1, X2, X3, . . . of random variables having the property that the interarrival times X1, X2 −X1, X3 −X2, . . . are independent random variables, each with an Exp(λ) distribution. Note that the connection with Nt is as follows: Nt is equal to the number of Xi that are smaller than (or equal to) t. Quick exercise 12.2 We model the arrivals of email messages at a server as a Poisson process. Suppose that on average 330 messages arrive per minute. What would you choose for the intensity λ in messages per second? What is the expectation of the interarrival time? An obvious question is: what is the distribution of Xi? This has already been answered in Chapter 11: since Xi is a sum of i independent exponentially distributed random variables, we have the following. The points of the Poisson process. For i = 1, 2, . . . the random variable Xi has a Gam(i, λ) distribution. The distribution of points Another interesting question is: if we know that n points are generated in an interval, where do these points lie? Since the distribution of the number of points only depends on the length of the interval, and not on its location, it suffices to determine this for an interval starting at 0. Let this interval be [0, a]. We start with the simplest case, where there is one point in [0, a]: suppose that N([0, a]) = 1. Then, for 0 s a:
  • 182. 12.4 Higher-dimensional Poisson processes 173 P(X1 ≤ s | N([0, a]) = 1) = P(X1 ≤ s, N([0, a]) = 1) P(N([0, a]) = 1) = P(N([0, s]) = 1, N((s, a]) = 0) P(N([0, a]) = 1) = λse−λs e−λ(a−s) λae−λa = s a . We find that conditional on the event {N([0, a]) = 1}, the random variable X1 is uniformly distributed over the interval [0, a]. Now suppose that it is given that there are two points in [0, a]: N([0, a]) = 2. In a way similar to what we did for one point, we can show that (see Exercise 12.12) P(X1 ≤ s, X2 ≤ t | N([0, a]) = 2) = t2 − (t − s)2 a2 . Now recall the result of Exercise 9.17: if U1 and U2 are two independent random variables, both uniformly distributed over [0, a], then the joint distri- bution function of V = min(U1, U2) and Z = max(U1, U2) is given by P(V ≤ s, Z ≤ t) = t2 − (t − s)2 a2 for 0 ≤ s ≤ t ≤ a. Thus we have found that, if we forget about their order, the two points in [0, a] are independent and uniformly distributed over [0, a]. With somewhat more work, this generalizes to an arbitrary number of points, and we arrive at the following property. Location of the points, given their number. Given that the Poisson process has n points in the interval [a, b], the locations of these points are independently distributed, each with a uniform distribution on [a, b]. 12.4 Higher-dimensional Poisson processes Our definition of the one-dimensional Poisson process, starting with the in- terarrival times, does not generalize easily, because it is based on the ordering of the real numbers. However, we can easily extend the assumptions of inde- pendence, homogeneity, and the Poisson distribution property. To do this we need a higher-dimensional version of the concept of length. We denote the k- dimensional volume of a set A in k-dimensional space by m(A). For instance, in the plane m(A) is the area of A, and in space m(A) is the volume of A. skip 12.4
  • 183. 174 12 The Poisson process Definition. The k-dimensional Poisson process with intensity λ is a collection X1, X2, X3, . . . of random points having the property that if N(A) denotes the number of points in the set A, then 1. (Homogeneity) The random variable N(A) has a Poisson distri- bution with parameter λm(A). 2. (Independence) For disjoint sets A1, A2, . . . , An the random vari- ables N(A1), N(A2), . . . , N(An) are independent. Quick exercise 12.3 Suppose that the locations of defects in a certain type of material follow the two-dimensional Poisson process model. For this material it is known that it contains on average five defects per square meter. What is the probability that a strip of length 2 meters and width 5 cm will be without defects? In Figure 7.4 the locations of the buildings the architect wanted to distribute over a 100-by-300-m terrain have been generated by a two-dimensional Poisson process. This has been done in the following way. One can again show that given the total number of points in a set, these points are uniformly distributed over the set. This leads to the following procedure: first one generates a value n from a Poisson distribution with the appropriate parameter (λ times the area), then one generates n times a point uniformly distributed over the 100- by-300 rectangle. Actually one can generate a higher-dimensional Poisson process in a way that is very similar to the natural way this can be done for the one-dimensional process. Directly from the definition of the one-dimensional process we see that it can be obtained by consecutively generating points with exponentially distributed gaps. We will explain a similar procedure for dimension two. For s 0, let Ms = N(Cs), where Cs is the circular region of radius s, centered at the origin. Since Cs has area πs2 , Ms has a Poisson distribution with parameter λπs2 . Let Ri denote the distance of the ith closest point to the origin. This is illustrated in Figure 12.2. Note that Ri is the analogue of the ith arrival time for the one-dimensional Poisson process: we have in fact that Ri ≤ s if and only if Ms ≥ i. In particular, with i = 1 and s = √ t, P R2 1 ≤ t = P R1 ≤ √ t = P M√ t 0 = 1 − e−λπt . In other words: R2 1 is Exp(λπ) distributed. For general i, we can similarly write P R2 i ≤ t = P Ri ≤ √ t = P M√ t ≥ i .
  • 184. 12.4 Higher-dimensional Poisson processes 175 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R2 × × × × × × × × × × × × × × × × × × × × × × × + + + + + + + + + + + + + + + + + + + + + + + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .............................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ................................................ . . . . . . . . . . . Fig. 12.2. The Poisson process in the plane, with the two circles of the two points closest to the origin. So P R2 i ≤ t = 1 − e−λπt i−1 j=0 (λπt)j j! , which means that R2 i has a Gam(i, λπ) distribution—as we saw on page 157. Since gamma distributions arise as sums of independent exponential distribu- tions, we can also write R2 i = R2 i−1 + Ti, where the Ti are independent Exp(λπ) random variables (and where R0 = 0). Note that this is quite similar to the one-dimensional case. To simulate the two-dimensional Poisson process from a sequence U1, U2, . . . of independent U(0, 1) random variables, one can therefore proceed as follows (recall from Section 6.2 that −(1/λ) ln(Ui) has an Exp(λ) distribution): for i = 1, 2, . . . put Ri = R2 i−1 − 1 λπ ln(U2i); this gives the distance of the ith point to the origin, and then put the point on this circle according to an angle value generated by 2πU2i−1. This is the correct way to do it, because one can show that in polar coordinates the radius and the angle of a Poisson process point are independent of each other, and the angle is uniformly distributed over [0, 2π]. The latter is called the isotropy property of the Poisson process.
  • 185. 176 12 The Poisson process 12.5 Solutions to the quick exercises 12.1 The probability of exactly one call in [0, s) and no calls in [s, 2s] equals P(N([0, s)) = 1, N([s, 2s]) = 0) = P(N([0, s)) = 1) P(N([s, 2s]) = 0) = P(N([0, s)) = 1) P(N([0, s]) = 0) = λse−λs · e−λs , because of independence and homogeneity. In the same way, the probability of exactly one call in [s, 2s] and no calls in [0, s) is equal to e−λs ·λse−λs . And indeed: λse−λs · e−λs + e−λs · λse−λs = 2λse−λ·2s . 12.2 Because there are 60 seconds in a minute, we have 60λ = 330. It follows that λ = 51 2 . Since the interarrival times have an Exp(λ) distribution, the expected time between messages is 1/λ = 0.18 second. 12.3 The intensity of this process is λ = 5 per m2 . The area of the strip is 2 · (1/20) = 1/10 m2 . Hence the probability that no defects occur in the strip is e−λ·(area of strip) = e−5·(1/10) = e−1/2 = 0.60. 12.6 Exercises 12.1 In each of the following examples, try to indicate whether the Poisson process would be a good model. a. The times of bankruptcy of enterprises in the United States. b. The times a chicken lays its eggs. c. The times of airplane crashes in a worldwide registration. d. The locations of worngly spelled words in a book. e. The times of traffic accidents at a crossroad. 12.2 The number of customers that visit a bank on a day is modeled by a Poisson distribution. It is known that the probability of no customers at all is 0.00001. What is the expected number of customers? 12.3 Let N have a Pois(4) distribution. What is P(N = 4)? 12.4 Let X have a Pois(2) distribution. What is P(X ≤ 1)? 12.5 The number of errors on a hard disk is modeled as a Poisson random variable with expectation one error in every Mb, that is, in every 220 bytes. a. What is the probability of at least one error in a sector of 512 bytes? b. The hard disk is an 18.62-Gb disk drive with 39 054 015 sectors. What is the probability of at least one error on the hard disk?
  • 186. 12.6 Exercises 177 12.6 A certain brand of copper wire has flaws about every 40 centimeters. Model the locations of the flaws as a Poisson process. What is the probability of two flaws in 1 meter of wire? 12.7 The Poisson model is sometimes used to study the flow of traffic ([15]). If the traffic can flow freely, it behaves like a Poisson process. A 20-minute time interval is divided into 10-second time slots. At a certain point along the highway the number of passing cars is registered for each 10-second time slot. Let nj be the number of slots in which j cars have passed for j = 0, . . . , 9. Suppose that one finds j 0 1 2 3 4 5 6 7 8 9 nj 19 38 28 20 7 3 4 0 0 1 Note that the total number of cars passing in these 20 minutes is 230. a. What would you choose for the intensity parameter λ? b. Suppose one estimates the probability of 0 cars passing in a 10-second time slot by n0 divided by the total number of time slots. Does that (reasonably) agree with the value that follows from your answer in a? c. What would you take for the probability that 10 cars pass in a 10-second time slot? 12.8 Let X be a Poisson random variable with parameter µ. a. Compute E[X(X − 1)]. b. Compute Var(X), using that Var(X) = E[X(X − 1)] + E[X] − (E[X])2 . 12.9 Let Y1 and Y2 be independent Poisson random variables with parameter µ1, respectively µ2. Show that Y = Y1 + Y2 also has a Poisson distribution. Instead of using the addition rule in Section 11.1 as in Exercise 11.2, you can prove this without doing any computations by considering the number of points of a Poisson process (with intensity 1) in two disjoint intervals of length µ1 and µ2. 12.10 Let X be a random variable with a Pois(µ) distribution. Show the following. If µ 1, then the probabilities P(X = k) are strictly decreasing in k. If µ 1, then the probabilities P(X = k) are first increasing, then decreasing (cf. Figure 12.1). What happens if µ = 1? 12.11 Consider the one-dimensional Poisson process with intensity λ. Show that the number of points in [0, t], given that the number of points in [0, 2t] is equal to n, has a Bin(n, 1 2 ) distribution. Hint: write the event {N([0, s]) = k, N([0, 2s]) = n} as the intersection of the (independent!) events {N([0, s]) = k} and {N((s, 2s]) = n − k}.
  • 187. 178 12 The Poisson process 12.12 We consider the one-dimensional Poisson process. Suppose for some a 0 it is given that there are exactly two points in [0, a], or in other words: Na = 2. The goal of this exercise is to determine the joint distribution of X1 and X2, the locations of the two points, conditional on Na = 2. a. Prove that for 0 s t a P(X1 ≤ s, X2 ≤ t, Na = 2) = P(X2 ≤ t, Na = 2) − P(X1 s, X2 ≤ t, Na = 2) . b. Deduce from a that P(X1 ≤ s, X2 ≤ t, Na = 2) = e−λa λ2 t2 2! − λ2 (t − s)2 2! . c. Deduce from b that for 0 s t a P(X1 ≤ s, X2 ≤ t | Na = 2) = t2 − (t − s)2 a2 . 12.13 Walking through a meadow we encounter two kinds of flowers, daisies and dandelions. As we walk in a straight line, we model the positions of the flowers we encounter with a one-dimensional Poisson process with intensity λ. It appears that about one in every four flowers is a daisy. Forgetting about the dandelions, what does the process of the daisies look like? This question will be answered with the following steps. a. Let Nt be the total number of flowers, Xt the number of daisies, and Yt be the number of dandelions we encounter during the first t minutes of our walk. Note that Xt + Yt = Nt. Suppose that each flower is a daisy with probability 1/4, independent of the other flowers. Argue that P(Xt = n, Yt = m | Nt = n + m) = n + m n 1 4 n3 4 m . b. Show that P(Xt = n, Yt = m) = 1 n! 1 m! 1 4 n3 4 m e−λt (λt)n+m , by conditioning on Nt and using a. c. By writing e−λt = e−(λ/4)t e−(3λ/4)t and summing over m, show that P(Xt = n) = 1 n! e−(λ/4)t λt 4 n . Since it is clear that the numbers of daisies that we encounter in disjoint time intervals are independent, we may conclude from c that the process (Xt) is again a Poisson process, with intensity λ/4. One often says that the process (Xt) is obtained by thinning the process (Nt). In our example this corresponds to picking all the dandelions.
  • 188. 12.6 Exercises 179 12.14 In this exercise we look at a simple example of random variables Xn that have the property that their distributions converge to the distribution of a random variable X as n → ∞, while it is not true that their expectations converge to the expectation of X. Let for n = 1, 2, . . . the random variables Xn be defined by P(Xn = 0) = 1 − 1 n and P(Xn = 7n) = 1 n . a. Let X be the random variable that is equal to 0 with probability 1. Show that for all a the probability mass functions pXn (a) of the Xn converge to the probability mass function pX(a) of X as n → ∞. Note that E[X]=0. b. Show that nonetheless E[Xn] = 7 for all n.
  • 189. 13 The law of large numbers For many experiments and observations concerning natural phenomena—such as measuring the speed of light—one finds that performing the procedure twice under (what seem) identical conditions results in two different outcomes. Un- controllable factors cause “random” variation. In practice one tries to over- come this as follows: the experiment is repeated a number of times and the results are averaged in some way. In this chapter we will see why this works so well, using a model for repeated measurements. We view them as a sequence of independent random variables, each with the same unknown distribution. It is a probabilistic fact that from such a sequence—in principle—any feature of the distribution can be recovered. This is a consequence of the law of large numbers. 13.1 Averages vary less Scientists and engineers involved in experimental work have known for cen- turies that more accurate answers are obtained when measurements or ex- periments are repeated a number of times and one averages the individual outcomes.1 For example, if you read a description of A.A. Michelson’s work done in 1879 to determine the speed of light, you would find that for each value he collected, repeated measurements at several levels were performed. In an article in Statistical Science describing his work ([18]), R.J. MacKay and R.W. Oldford state: “It is clear that Michelson appreciated the power of averaging to reduce variability in measurement.” We shall see that we can understand this reduction using only what we have learned so far about prob- ability in combination with a simple inequality called Chebyshev’s inequality. Throughout this chapter we consider a sequence of random variables X1, X2, X3, . . . . You should think of Xi as the result of the ith repetition of a partic- ular measurement or experiment. We confine ourselves to the situation where 1 We leave the problem of systematic errors aside but will return to it in Chapter 19.
  • 190. 182 13 The law of large numbers experimental conditions of subsequent experiments are identical, and the out- come of any one experiment does not influence the outcomes of others. Under those circumstances, the random variables of the sequence are independent, and all have the same distribution, and we therefore call X1, X2, X3, . . . an independent and identically distributed sequence. We shall denote the distri- bution function of each random variable Xi by F, its expectation by µ, and the standard deviation by σ. The average of the first n random variables in the sequence is X̄n = X1 + X2 + · · · + Xn n , and using linearity of expectations we find: E X̄n = 1 n E[X1 + X2 + · · · + Xn] = 1 n (µ + µ + · · · + µ) = µ. By the variance-of-the-sum rule, using the independence of X1, . . . , Xn, Var X̄n = 1 n2 Var(X1 + X2 + · · · + Xn) = 1 n2 (σ2 + σ2 + · · · + σ2 ) = σ2 n . This establishes the following rule. Expectation and variance of an average. If X̄n is the average of n independent random variables with the same expectation µ and variance σ2 , then E X̄n = µ and Var X̄n = σ2 n . The expectation of X̄n is again µ, and its standard deviation is less than that of a single Xi by a factor √ n; the “typical distance” from µ is √ n smaller. The latter property is what Michelson used to gain accuracy. To illustrate this, we analyze an example. Suppose the random variables X1, X2, . . . are continuous with a Gam(2, 1) distribution, so with probability density: f(x) = xe−x for x ≥ 0. Recall from Section 11.2 that this means that each Xi is distributed as the sum of two independent Exp(1) random variables. Hence, Sn = X1 +· · ·+Xn is distributed as the sum of 2n independent Exp(1) random variables, which has a Gam(2n, 1) distribution, with probability density fSn (x) = x2n−1 e−x (2n − 1)! for x ≥ 0.
  • 191. 13.2 Chebyshev’s inequality 183 Because X̄n = Sn/n, we find by applying the change-of-units rule (page 106): fX̄n (x) = nfSn (nx) = n (nx) 2n−1 e−nx (2n − 1)! for x ≥ 0. This is the probability density of the Gam(2n, n) distribution. So we have determined the distribution of X̄n explicitly and we can investigate what happens as n increases, for example, by plotting probability densities. In the left-hand column of Figure 13.1 you see plots of fX̄n for n = 1, 2, 4, 9, 16, and 400 (note that for n = 1 this is just f itself). For comparison, we take as a second example a so-called bimodal density function: a density with two bumps, formally called modes. For the same values of n we determined the probability density function of X̄n (unlike the previous example, we are not concerned with the computations, just with the results). The graphs of these densities are given side by side with the gamma densities in Figure 13.1. The graphs clearly show that, as n increases, there is “contraction” of the probability mass near the expected value µ (for the gamma densities this is 2, for the bimodal densities 2.625). Quick exercise 13.1 Compare the probabilities that X̄n is within 0.5 of its expected value for n = 1, 4, 16, and 400. Do this for the gamma case only by estimating the probabilities from the graphs in the left-hand column of Figure 13.1. 13.2 Chebyshev’s inequality The contraction of probability mass near the expectation is a consequence of the fact that, for any probability distribution, most probability mass is within a few standard deviations from the expectation. To show this we will employ the following tool, which provides a bound for the probability that the random variable Y is outside the interval (E[Y ] − a, E[Y ] + a). Chebyshev’s inequality. For an arbitrary random variable Y and any a 0: P(|Y − E[Y ] | ≥ a) ≤ 1 a2 Var(Y ) . We shall derive this inequality for continuous Y (the discrete case is similar). Let fY be the probability density function of Y . Let µ denote E[Y ]. Then: Var(Y ) = ∞ −∞ (y − µ)2 fY (y) dy ≥ |y−µ|≥a (y − µ)2 fY (y) dy ≥ |y−µ|≥a a2 fY (y) dy = a2 P(|Y − µ| ≥ a) .
  • 192. 184 13 The law of large numbers 0 1 2 3 4 0.0 0.5 1.0 1.5 n = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 1 2 3 4 0.0 0.5 1.0 1.5 n = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 1 2 3 4 0.0 0.5 1.0 1.5 n = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 1 2 3 4 0.0 0.5 1.0 1.5 n = 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 1 2 3 4 0.0 0.5 1.0 1.5 n = 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 1 2 3 4 0.0 0.5 1.0 1.5 n = 400 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 0.0 0.4 0.8 n = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 0.0 0.4 0.8 n = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 0.0 0.4 0.8 n = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 0.0 0.4 0.8 n = 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 0.0 0.4 0.8 n = 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 0.0 0.4 0.8 n = 400 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 13.1. Densities of averages. Left column: from a gamma density; right column: from a bimodal density.
  • 193. 13.3 The law of large numbers 185 Dividing both sides of the resulting inequality by a2 , we obtain Chebyshev’s inequality. Denote Var(Y ) by σ2 and consider the probability that Y is within a few standard deviations from its expectation µ: P(|Y − µ| kσ) = 1 − P(|Y − µ| ≥ kσ) , where k is a small integer. Setting a = kσ in Chebyshev’s inequality, we find P(|Y − µ| kσ) ≥ 1 − Var(Y ) k2σ2 = 1 − 1 k2 . (13.1) For k = 2, 3, 4 the right-hand side is 3/4, 8/9, and 15/16, respectively. This suggests that with Chebyshev’s inequality we can make very strong state- ments. For most distributions, however, the actual value of P(|Y − µ| kσ) is even higher than the lower bound (13.1). We summarize this as a somewhat loose rule. The “µ ± a few σ” rule. Most of the probability mass of a random variable is within a few standard deviations from its expec- tation. Quick exercise 13.2 Calculate P(|Y − µ| kσ) exactly for k = 1, 2, 3, 4 when Y has an Exp(1) distribution and compare this with the bounds from Chebyshev’s inequality. 13.3 The law of large numbers We return to the independent and identically distributed sequence of ran- dom variables X1, X2, . . . with expectation µ and variance σ2 . We apply Chebyshev’s inequality to the average X̄n, where we use E X̄n = µ and Var X̄n = σ2 /n, and where ε 0: P X̄n − µ ε = P X̄n − E X̄n ε ≤ 1 ε2 Var X̄n = σ2 nε2 . The right-hand side vanishes as n goes to infinity, no matter how small ε is. This proves the following law. The law of large numbers. If X̄n is the average of n independent random variables with expectation µ and variance σ2 , then for any ε 0: lim n→∞ P |X̄n − µ| ε = 0.
  • 194. 186 13 The law of large numbers A connection with experimental work Let us try to interpret the law of large numbers from an experimenter’s per- spective. Imagine you conduct a series of experiments. The experimental setup is complicated and your measurements vary quite a bit around the “true” value you are after. Suppose (unknown to you) your measurements have a gamma distribution, and its expectation is what you want to determine. You decide to do a certain number of measurements, say n, and to use their average as your estimate of the expectation. We can simulate all this, and Figure 13.2 shows the results of a simulation, where we chose the same Gam(2, 1) distribution, i.e., with expectation µ = 2. We anticipated that you might want to do as many as 500 measurements, so we generated realizations for X1, X2, . . . , X500. For each n we computed the average of the first n values and plotted these averages against n in Figure 13.2. 0 100 200 300 400 500 1 2 3 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 13.2. Averages of realizations of a sequence of gamma distributed random variables. If your decision is to do 200 repetitions, you would find (in this simulation) a value of about 2.09 (slightly too high, but you wouldn’t know!), whereas with n = 400 you would be almost exactly correct with 1.99, and with n = 500 again a little farther away with 2.06. For another sequence of realizations, the details in the pattern that you see in Figure 13.2 would be different, but the general dampening of the oscillations would still be present. This follows from what we saw earlier, that as n is larger, the probability for the average to be within a certain distance of the expectation increases, in the limit even to 1. In practice it may happen that with a large number of repetitions your average is farther from the “true” value than with a smaller number of repetitions—if it is, then you had bad luck, because the odds are in your favor.
  • 195. 13.3 The law of large numbers 187 The averages may fail to converge The law of large numbers is valid if the expectation of the distribution F is finite. This is not always the case. For example, the Cauchy and some Pareto distributions have heavy tails: their probability densities do go to 0 as x becomes large, but (too) slowly.2 On the left in Figure 13.3 you see the result of a simulation with Cau(2, 1) random variables. As in the gamma case, the averages tend to go toward 2 (which is the point of symmetry of the Cau(2, 1) density), but once in a while a very large (positive or negative) realization of an Xi throws off the average. 0 100 200 300 400 500 0 1 2 3 4 5 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 0 100 200 300 400 500 2 4 6 8 10 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 13.3. Averages of realizations of a sequence of Cauchy (at left) and Pareto (at right) distributed random variables. On the right in Figure 13.3 the result of a simulation with a Par(0.99) distri- bution is shown. Its expectation is infinite. In the plot we see segments where the average “drifts downward,” separated by upward jumps, which correspond to Xi with extremely large values. The effect of the jumps dominates: it can be shown that X̄n grows beyond any level. You might think that these patterns are phenomena that occur because of the short length of the simulation and that in longer simulations they would disappear after some value of n. However, the patterns as described will con- tinue to occur and the results of a longer simulation, say to n = 5000, would not look any “better.” Remark 13.1 (There is a stronger law of large numbers). Even though it is a strong statement, the law of large numbers in this paragraph is more accurately known as the weak law of large numbers. A stronger result holds, the strong law of large numbers, which says that: 2 They represent two separate cases: the Cauchy expectation does not exist (see Remark 7.1) and the Par(α)’s expectation is +∞ if α ≤ 1 (see Section 7.2).
  • 196. 188 13 The law of large numbers P lim n→∞ X̄n = µ = 1. This is also expressed as “as n goes to infinity, X̄n converges to µ with probability 1.” It is not easy to see, but it is true that the strong law is actually stronger. The conditions for the law of large numbers, as stated in this section, could be relaxed. They suffice for both versions of the law. The conditions can be weakened to a point where the weak law still follows from them, but the strong law does not anymore; the strong law requires the stronger conditions. 13.4 Consequences of the law of large numbers We continue with the sequence X1, X2, . . . of independent random variables with distribution function F. In the previous section we saw how we could recover the (unknown) expectation µ from a realization of the sequence. We shall see that in fact we can recover any feature of the probability distribu- tion. In order to avoid unnecessary indices, as in E[X1] and P(X1 ∈ C), we introduce an additional random variable X that also has F as its distribution function. Recovering the probability of an event Suppose that, rather than being interested in µ = E[X], we want to know the probability of an event, for example, p = P(X ∈ C) , where C = (a, b] for some a b. If you do not know this probability p, you would probably estimate it from how often the event {Xi ∈ C} occurs in the sequence. You would use the relative frequency of Xi ∈ C among X1, . . . , Xn: the number of times the set C was hit divided by n. Define for each i: Yi = 1 if Xi ∈ C, 0 if Xi ∈ C. The random variable Yi indicates whether the corresponding Xi hits the set C; it is called an indicator random variable. In general, an indicator random variable for an event A is a random variable that is 1 when A occurs and 0 when Ac occurs. Using this terminology, Yi is the indicator random variable of the event Xi ∈ C. Its expectation is given by E[Yi] = 1 · P(Xi ∈ C) + 0 · P(Xi ∈ C) = P(Xi ∈ C) = P(X ∈ C) = p. Using the Yi, the relative frequency is expressed as (Y1 +Y2 +· · ·+Yn)/n = Ȳn. Note that the random variables Y1, Y2, . . . are independent; the Xi form an in- dependent sequence, and Yi is determined from Xi only (this is an application of the rule about propagation of independence; see page 126).
  • 197. 13.4 Consequences of the law of large numbers 189 The law of large numbers, with p in the role of µ, can now be applied to Ȳn; it is the average of n independent random variables with expectation p and variance p(1 − p), so lim n→∞ P |Ȳn − p| ε = 0 (13.2) for any ε 0. By reasoning along the same lines as in the previous section, we see that from a long sequence of realizations we can get an accurate estimate of the probability p. Recovering the probability density function Consider the continuous case, where f is the probability density function corresponding with F, and now choose C = (a − h, a + h], for some (small) positive h. By equation (13.2), for large n: Ȳn ≈ p = P(X ∈ C) = a+h a−h f(x) dx ≈ 2hf(a). (13.3) This relationship suggests to estimate the probability density in a as follows: f(a) ≈ Ȳn 2h = the number of times Xi ∈ C for i ≤ n n · the length of C . In Figure 13.4 we have done so for h = 0.25 and two values of a: 2 and 4. Rather than plotting the estimate in just one point, we use the same value for the whole interval (a − h, a + h]. This results in a vertical bar, whose area corresponds to Ȳn: height · width = Ȳn 2h · 2h = Ȳn. These estimates are based on the realizations of 500 independent Gam(2, 1) distributed random variables. In order to be able to see how well things came 0 2 4 6 8 10 0.0 0.1 0.2 0.3 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .............. .. . . . . . . . . . . . . .. . .. ..................................................................................... Fig. 13.4. Estimating the density at two points.
  • 198. 190 13 The law of large numbers out, the Gam(2, 1) density function is shown as well; near a = 2 the estimate is very accurate, but around a = 4 it is a little too low. There really is no reason to derive estimated values around just a few points, as is done in Figure 13.4. We might as well cover the whole x-axis with a grid (with grid size 2h) and do the computation for each point in the grid, thus covering the axis with a series of bars. The resulting bar graph is called a histogram. Figure 13.5 shows the result for two sets of realizations. 0 2 4 6 8 10 0.0 0.1 0.2 0.3 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......... .. . . . . . . . . . . . . . . . . . . . . . .. . ........................................................................... 0 2 4 6 8 10 0.0 0.1 0.2 0.3 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......... .. . . . . . . . . . . . . . . . . . . . . . .. . ........................................................................... Fig. 13.5. Recovering the density function by way of histograms. The top graph is constructed from the same realizations as Figure 13.4 and the bottom graph is constructed from a new set of realizations. Both graphs match the general shape of the density, with some bumps and valleys that are particular for the corresponding set of realizations. In Chapters 15 and 17 we shall return to histograms and treat them more elaborately. Quick exercise 13.3 The height of the bar at x = 2 in the first histogram is 0.26. How many of the 500 realizations were between 1.75 and 2.25?
  • 199. 13.6 Exercises 191 13.5 Solutions to the quick exercises 13.1 The answers you have found should be in the neighborhood of the fol- lowing exact values: n 1 4 16 400 P |X̄n − µ| 0.5 0.27 0.52 0.85 1.00 13.2 Because Y has an Exp(1) distribution µ = 1 and Var(Y ) = σ2 = 1; we find for k ≥ 1: P(|Y − µ| kσ) = P(|Y − 1| k) = P(1 − k Y k + 1) = P(Y k + 1) = 1 − e−k−1 . Using this formula and (13.1) we obtain the following numbers: k 1 2 3 4 Lower bound from Chebyshev 0 0.750 0.889 0.938 P(|Y − 1| k) 0.865 0.950 0.982 0.993 13.3 The value of Ȳn for this bar equals its area 0.26 · 0.5 = 0.13. The bar represents 13% of the values, or 0.13 · 500 = 65 realizations. 13.6 Exercises 13.1 Verify the “µ±a few σ” rule as you did in Quick exercise 13.2 for the fol- lowing distributions: U(−1, 1), U(−a, a), N(0, 1), N(µ, σ2 ), Par(3), Geo(1/2). Construct a table as in the answer to the quick exercise and enter a line for each distribution. 13.2 An accountant wants to simplify his bookkeeping by rounding amounts to the nearest integer, for example, rounding û99.53 and û100.46 both to û100. What is the cumulative effect of this if there are, say, 100 amounts? To study this we model the rounding errors by 100 independent U(−0.5, 0.5) ran- dom variables X1, X2, . . . , X100. a. Compute the expectation and the variance of the Xi. b. Use Chebyshev’s inequality to compute an upper bound for the probability P(|X1 + X2 + · · · + X100| 10) that the cumulative rounding error X1 + X2 + · · · + X100 exceeds û10.
  • 200. 192 13 The law of large numbers 13.3 Consider the situation of the previous exercise. A manager wants to know what happens to the mean absolute error 1 n n i=1 |Xi| as n becomes large. What can you say about this, applying the law of large numbers? 13.4 Of the voters in Florida, a proportion p will vote for candidate G, and a proportion 1 − p will vote for candidate B. In an election poll a number of voters are asked for whom they will vote. Let Xi be the indicator random variable for the event “the ith person interviewed will vote for G.” A model for the election poll is that the people to be interviewed are selected in such a way that the indicator random variables X1, X2,. . . are independent and have a Ber(p) distribution. a. Suppose we use X̄n to predict p. According to Chebyshev’s inequality, how large should n be (how many people should be interviewed) such that the probability that X̄n is within 0.2 of the “true” p is at least 0.9? Hint: solve this first for p = 1/2, and use that p(1 − p) ≤ 1/4 for all 0 ≤ p ≤ 1. b. Answer the same question, but now X̄n should be within 0.1 of p. c. Answer the question from part a, but now the probability should be at least 0.95. d. If p 1/2 candidate G wins; if X̄n 1/2 you predict that G will win. Find an n (as small as you can) such that the probability that you predict correctly is at least 0.9, if in fact p = 0.6. 13.5 You are trying to determine the melting point of a new material, of which you have a large number of samples. For each sample that you measure you find a value close to the actual melting point c but corrupted with a measurement error. We model this with random variables: Mi = c + Ui where Mi is the measured value in degree Kelvin, and Ui is the occurring random error. It is known that E[Ui] = 0 and Var(Ui) = 3, for each i, and that we may consider the random variables M1, M2, . . . independent. According to Chebyshev’s inequality, how many samples do you need to measure to be 90% sure that the average of the measurements is within half a degree of c? 13.6 The casino La bella Fortuna is for sale and you think you might want to buy it, but you want to know how much money you are going to make. All the present owner can tell you is that the roulette game Red or Black is played about 1000 times a night, 365 days a year. Each time it is played you have probability 19/37 of winning the player’s bet of û1 and probability 18/37 of having to pay the player û1. Explain in detail why the law of large numbers can be used to determine the income of the casino, and determine how much it is.
  • 201. 13.6 Exercises 193 13.7 Let X1, X2, . . . be a sequence of independent and identically distributed random variables with distributions function F. Define Fn as follows: for any a Fn(a) = number of Xi in (−∞, a] n . Consider a fixed and introduce the appropriate indicator random variables (as in Section 13.4). Compute their expectation and variance and show that the law of large numbers tells us that lim n→∞ P(|Fn(a) − F(a)| ε) = 0. 13.8 In Section 13.4 we described how the probability density function could be recovered from a sequence X1, X2, X3, . . . . We consider the Gam(2, 1) probability density discussed in the main text and a histogram bar around the point a = 2. Then f(a) = f(2) = 2e−2 = 0.27 and the estimate for f(2) is Ȳn/2h, where Ȳn as in (13.3). a. Express the standard deviation of Ȳn/2h in terms of n and h. b. Choose h = 0.25. How large should n be (according to Chebyshev’s in- equality) so that the estimate is within 20% of the “true value”, with probability 80%? 13.9 Let X1, X2, . . . be an independent sequence of U(−1, 1) random variables and let Tn = 1 n n i=1 X2 i . It is claimed that for some a and any ε 0 lim n→∞ P(|Tn − a| ε) = 0. a. Explain how this could be true. b. Determine a. 13.10 Let Mn be the maximum of n independent U(0, 1) random variables. a. Derive the exact expression for P(|Mn − 1| ε). Hint: see Section 8.4. b. Show that limn→∞ P(|Mn − 1| ε) = 0. Can this be derived from Cheby- shev’s inequality or the law of large numbers? 13.11 For some t 1, let X be a random variable taking the values 0 and t, with probabilities P(X = 0) = 1 − 1 t and P(X = t) = 1 t . Then E[X] = 1 and Var(X) = t−1. Consider the probability P(|X − 1| a). a. Verify the following: if t = 10 and a = 8 then P(|X − 1| a) = 1/10 and Chebyshev’s inequality gives an upper bound for this probability of 9/64. The difference is 9/64 − 1/10 ≈ 0.04. We will say that for t = 10 the Chebyshev gap for X at a = 8 is 0.04.
  • 202. 194 13 The law of large numbers b. Compute the Chebyshev gap for t = 10 at a = 5 and at a = 10. c. Can you find a gap smaller than 0.01, smaller than 0.001, smaller than 0.0001? d. Do you think one could improve Chebyshev’s inequality, i.e., find an upper bound closer to the true probabilities? 13.12 (A more general law of large numbers). Let X1, X2, . . . be a sequence of independent random variables, with E[Xi] = µi and Var(Xi) = σ2 i , for i = 1, 2, . . .. Suppose that 0 σ2 i ≤ M, for all i. Let a be an arbitrary positive number. a. Apply Chebyshev’s inequality to show that P X̄n − 1 n n i=1 µi a ≤ Var(X1) + · · · + Var(Xn) n2a2 . b. Conclude from a that lim n→∞ P X̄n − 1 n n i=1 µi a = 0. Check that the law of large numbers is a special case of this result.
  • 203. 14 The central limit theorem The central limit theorem is a refinement of the law of large numbers. For a large number of independent identically distributed random variables X1, . . . , Xn, with finite variance, the average X̄n approximately has a normal distribution, no matter what the distribution of the Xi is. In the first section we discuss the proper normalization of X̄n to obtain a normal distribution in the limit. In the second section we will use the central limit theorem to approximate probabilities of averages and sums of random variables. 14.1 Standardizing averages In the previous chapter we saw that the law of large numbers guarantees the convergence to µ of the average X̄n of n independent random variables X1, . . . , Xn, all having the same expectation µ and variance σ2 . This conver- gence was illustrated by Figure 13.1. Closer examination of this figure suggests another phenomenon: for the two distributions considered (i.e., the Gam(2, 1) distribution and a bimodal distribution), the probability density function of X̄n seems to become symmetrical and bell shaped around the expected value µ as n becomes larger and larger. However, the bell collapses into a single spike at µ. Nevertheless, by a proper normalization it is possible to stabilize the bell shape, as we will see. In order to let the distribution of X̄n settle down it seems to be a good idea to stabilize the expectation and variance. Since E X̄n = µ for all n, only the variance needs some special attention. In Figure 14.1 we depict the probability density function of the centered average X̄n−µ of Gam(2, 1) random variables, multiplied by three different powers of n. In the left column we display the density of n 1 4 (X̄n − µ), in the middle column the density of n 1 2 (X̄n − µ), and in the right column the density of n(X̄n − µ). These figures suggest that √ n is the right factor to stabilize the bell shape. skip
  • 204. 196 14 The central limit theorem 0.0 0.2 0.4 n = 1 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 1 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 1 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.0 0.2 0.4 n = 2 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 2 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 2 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.0 0.2 0.4 n = 4 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 4 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 4 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.0 0.2 0.4 n = 16 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 16 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 16 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −3 −2 −1 0 1 2 3 0.0 0.2 0.4 n = 100 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −3 −2 −1 0 1 2 3 n = 100 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −3 −2 −1 0 1 2 3 n = 100 .............................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 14.1. Multiplying the difference X̄n − µ of n Gam(2, 1) random variables. Left column: n 1 4 (X̄n − µ); middle column: √ n(X̄n − µ); right column: n(X̄n − µ).
  • 205. 14.1 Standardizing averages 197 Indeed, according to the rule for the variance of an average (see page 182), we have Var X̄n = σ2 /n, and therefore for any number C: Var C(X̄n − µ) = Var CX̄n = C2 Var X̄n = C2 σ2 n . To stabilize the variance we therefore must choose C = √ n. In fact, by choos- ing C = √ n/σ, one standardizes the averages, i.e., the resulting random vari- able Zn, defined by Zn = √ n X̄n − µ σ , n = 1, 2, . . . , has expected value 0 and variance 1. What more can we say about the distri- bution of the random variables Zn? In case X1, X2, . . . are independent N(µ, σ2 ) distributed random variables, we know from Section 11.2 and the rule on expectation and variance under change of units (see page 98), that Zn has an N(0, 1) distribution for all n. For the gamma and bimodal random variables from Section 13.1 we depicted the probability density function of Zn in Figure 14.2. For both examples we see that the probability density functions of the Zn seem to converge to the prob- ability density function of the N(0, 1) distribution, indicated by the dotted line. The following amazing result states that this behavior generally occurs no matter what distribution we start with. The central limit theorem. Let X1, X2, . . . be any sequence of independent identically distributed random variables with finite positive variance. Let µ be the expected value and σ2 the variance of each of the Xi. For n ≥ 1, let Zn be defined by Zn = √ n X̄n − µ σ ; then for any number a lim n→∞ FZn (a) = Φ(a), where Φ is the distribution function of the N(0, 1) distribution. In words: the distribution function of Zn converges to the distribution function Φ of the standard normal distribution. Note that Zn = X̄n − E X̄n Var X̄n , which is a more direct way to see that Zn is the average X̄n standardized.
  • 206. 198 14 The central limit theorem 0.0 0.2 0.4 0.6 0.8 1.0 n = 1 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. n = 1 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. 0.0 0.2 0.4 0.6 0.8 1.0 n = 2 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. n = 2 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. 0.0 0.2 0.4 0.6 0.8 1.0 n = 4 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. n = 4 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. 0.0 0.2 0.4 0.6 0.8 1.0 n = 16 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. n = 16 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. −3 −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 n = 100 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. −3 −2 −1 0 1 2 3 n = 100 ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................................................................. Fig. 14.2. Densities of standardized averages Zn. Left column: from a gamma den- sity; right column: from a bimodal density. Dotted line: N(0, 1) probability density.
  • 207. 14.2 Applications of the central limit theorem 199 One can also write Zn as a standardized sum Zn = X1 + · · · + Xn − nµ σ √ n . (14.1) In the next section we will see that this last representation of Zn is very helpful when one wants to approximate probabilities of sums of independent identically distributed random variables. Since X̄n = σ √ n Zn + µ, it follows that X̄n approximately has an N(µ, σ2 /n) distribution; see the change-of-units rule for normal random variables on page 106. This explains the symmetrical bell shape of the probability densities in Figure 13.1. Remark 14.1 (Some history). Originally, the central limit theorem was proved in 1733 by De Moivre for independent Ber(1 2 ) distributed random variables. Lagrange extended De Moivre’s result to Ber(p) random variables and later formulated the central limit theorem as stated above. Around 1901 a first rigorous proof of this result was given by Lyapunov. Several versions of the central limit theorem exist with weaker conditions than those presented here. For example, for applications it is interesting that it is not necessary that all Xi have the same distribution; see Ross [26], Section 8.3, or Feller [8], Section 8.4, and Billingsley [3], Section 27. 14.2 Applications of the central limit theorem The central limit theorem provides a tool to approximate the probability distribution of the average or the sum of independent identically distributed random variables. This plays an important role in applications, for instance, see Sections 23.4, 24.1, 26.2, and 27.2. Here we will illustrate the use of the central limit theorem to approximate probabilities of averages and sums of random variables in three examples. The first example deals with an average; the other two concern sums of random variables. Did we have bad luck? In the example in Section 13.3 averages of independent Gam(2, 1) distributed random variables were simulated for n = 1, . . . , 500. In Figure 13.2 the realiza- tion of X̄n for n = 400 is 1.99, which is almost exactly equal to the expected value 2. For n = 500 the simulation was 2.06, a little bit farther away. Did we have bad luck, or is a value 2.06 or higher not unusual? To answer this question we want to compute P X̄n ≥ 2.06 . We will find an approximation of this probability using the central limit theorem.
  • 208. 200 14 The central limit theorem Note that P X̄n ≥ 2.06 = P X̄n − µ ≥ 2.06 − µ = P √ n X̄n − µ σ ≥ √ n 2.06 − µ σ = P Zn ≥ √ n 2.06 − µ σ . Since the Xi are Gam(2, 1) random variables, µ = E[Xi] = 2 and σ2 = Var(Xi) = 2. We find for n = 500 that P X̄500 ≥ 2.06 = P Z500 ≥ √ 500 2.06 − 2 √ 2 = P(Z500 ≥ 0.95) = 1 − P(Z500 0.95) . It now follows from the central limit theorem that P X̄500 ≥ 2.06 ≈ 1 − Φ(0.95) = 0.1711. This is close to the exact answer 0.1710881, which was obtained using the probability density of X̄n as given in Section 13.1. Thus we see that there is about a 17% probability that the average X̄500 is at least 0.06 above 2. Since 17% is quite large, we conclude that the value 2.06 is not unusual. In other words, we did not have bad luck; n = 500 is simply not large enough to be that close. Would 2.06 be unusual if n = 5000? Quick exercise 14.1 Show that P X̄5000 ≥ 2.06 ≈ 0.0013, using the central limit theorem. Rounding amounts to the nearest integer In Exercise 13.2 an accountant wanted to simplify his bookkeeping by round- ing amounts to the nearest integer, and you were asked to use Chebyshev’s inequality to compute an upper bound for the probability p = P(|X1 + X2 + · · · + X100| 10) that the cumulative rounding error X1 + X2 + · · · + X100 exceeds û10. This upper bound equals 1/12. In order to know the exact value of p one has to determine the distribution of the sum X1 +· · ·+X100. This is difficult, but the central limit theorem is a handy tool to get an approximation of p. Clearly, p = P(X1 + · · · + X100 −10) + P(X1 + · · · + X100 10) . Standardizing as in (14.1), for the second probability we write, with n = 100
  • 209. 14.2 Applications of the central limit theorem 201 P(X1 + · · · + Xn 10) = P(X1 + · · · + Xn − nµ 10 − nµ) = P X1 + · · · + Xn − nµ σ √ n 10 − nµ σ √ n = P Zn 10 − nµ σ √ n . The Xi are U(−0.5, 0.5) random variables, µ = E[Xi] = 0, and σ2 = Var(Xi) = 1/12, so that P(X1 + · · · + X100 10) = P Z100 10 − 100 · 0 1/12 √ 100 = P(Z100 3.46) . It follows from the central limit theorem that P(Z100 3.46) ≈ 1 − Φ(3.46) = 0.0003. Similarly, P(X1 + · · · + X100 −10) ≈ Φ(−3.46) = 0.0003. Thus we find that p = 0.0006. Normal approximation of the binomial distribution In Section 4.3 we considered the (fictitious) situation that you attend, com- pletely unprepared, a multiple-choice exam consisting of 10 questions. We saw that the probability you will pass equals P(X ≥ 6) = 0.0197, where X—being the sum of 10 independent Ber(1 4 ) random variables—has a Bin(10, 1 4 ) distribution. As we saw in Chapter 4 it is rather easy, but te- dious, to calculate P(X ≥ 6). Although n is small, we investigate what the central limit theorem will yield as an approximation of P(X ≥ 6). Recall that a random variable with a Bin(n, p) distribution can be written as the sum of n independent Ber(p) distributed random variables R1, . . . , Rn. Substituting n = 10, µ = p = 1/4, and σ2 = p(1 − p) = 3/16, it follows from the central limit theorem that P(X ≥ 6) = P(R1 + · · · + Rn ≥ 6) = P R1 + · · · + Rn − nµ σ √ n ≥ 6 − nµ σ √ n = P ⎛ ⎝Z10 ≥ 6 − 21 2 3 16 √ 10 ⎞ ⎠ ≈ 1 − Φ(2.56) = 0.0052.
  • 210. 202 14 The central limit theorem The number 0.0052 is quite a poor approximation for the true value 0.0197. Note however, that we could also argue that P(X ≥ 6) = P(X 5) = P(R1 + · · · + Rn 5) = P ⎛ ⎝Z10 ≥ 5 − 21 2 3 16 √ 10 ⎞ ⎠ ≈ 1 − Φ(1.83) = 0.0336, which gives an approximation that is too large! A better approach lies some- where in the middle, as the following quick exercise illustrates. Quick exercise 14.2 Apply the central limit theorem to find 0.0143 as an ap- proximation to P X ≥ 51 2 . Since P(X ≥ 6) = P X ≥ 51 2 , this also provides an approximation of P(X ≥ 6). How large should n be? In view of the previous examples one might raise the question of how large n should be to have a good approximation when using the central limit theorem. In other words, how fast is the convergence to the normal distribution? This is a difficult question to answer in general. For instance, in the third example one might initially be tempted to think that the approximation was quite poor, but after taking the fact into account that we approximate a discrete distribution by a continuous one we obtain a considerable improvement of the approximation, as was illustrated in Quick exercise 14.2. For another example, see Figure 14.2. Here we see that the convergence is slightly faster for the bimodal distribution than for the Gam(2, 1) distribution, which is due to the fact that the Gam(2, 1) is rather asymmetric. In general the approximation might be poor when n is small, when the dis- tribution of the Xi is asymmetric, bimodal, or discrete, or when the value a in P X̄n a is far from the center of the distribution of the Xi. 14.3 Solutions to the quick exercises 14.1 In the same way we approximated P X̄n ≥ 2.06 using the central limit theorem, we have that P X̄n ≥ 2.06 = P Zn ≥ √ n 2.06 − µ σ .
  • 211. 14.4 Exercises 203 With µ = 2 and σ = √ 2, we find for n = 5000 that P X̄5000 ≥ 2.06 = P(Z5000 ≥ 3) , which is approximately equal to 1−Φ(3) = 0.0013, thanks to the central limit theorem. Because we think that 0.13% is a small probability, to find 2.06 as a value for X̄5000 would mean that you really had bad luck! 14.2 Similar to the computation P(X ≥ 6), we have P X ≥ 5 1 2 = P R1 + · · · + R10 ≥ 5 1 2 = P ⎛ ⎝Z10 ≥ 51 2 − 21 2 3 16 √ 10 ⎞ ⎠ ≈ 1 − Φ(2.19) = 0.0143. We have seen that using the central limit theorem to approximate P(X ≥ 6) gives an underestimate of this probability, while using the central limit the- orem to P(X 5) gives an overestimation. Since 51 2 is “in the middle,” the approximation will be better. 14.4 Exercises 14.1 Let X1, X2, . . . , X144 be independent identically distributed random variables, each with expected value µ = E[Xi] = 2, and variance σ2 = Var(Xi) = 4. Approximate P(X1 + X2 + · · · + X144 144), using the central limit theorem. 14.2 Let X1, X2, . . . , X625 be independent identically distributed random variables, with probability density function f given by f(x) = 3(1 − x)2 for 0 ≤ x ≤ 1, 0 otherwise. Use the central limit theorem to approximate P(X1 + X2 + · · · + X625 170). 14.3 In Exercise 13.4 a you were asked to use Chebyshev’s inequality to determine how large n should be (how many people should be interviewed) so that the probability that X̄n is within 0.2 of the “true” p is at least 0.9. Here p is the proportion of the voters in Florida who will vote for G (and 1 − p is the proportion of the voters who will vote for B). How large should n at least be according to the central limit theorem?
  • 212. 204 14 The central limit theorem 14.4 In the single-server queue model from Section 6.4, Ti is the time between the arrival of the (i − 1)th and ith customers. Furthermore, one of the model assumptions is that the Ti are independent, Exp(0.5) dis- tributed random variables. In Section 11.2 we saw that the probability P(T1 + · · · + T30 ≤ 60) of the 30th customer arriving within an hour at the well is equal to 0.542. Find the normal approximation of this probability. 14.5 Let X be a Bin(n, p) distributed random variable. Show that the random variable X − np np(1 − p) has a distribution that is approximately standard normal. 14.6 Again, as in the previous exercise, let X be a Bin(n, p) distributed random variable. a. An exact computation yields that P(X ≤ 25) = 0.55347, when n = 100 and p = 1/4. Use the central limit theorem to give an approximation of P(X ≤ 25) and P(X 26). b. When n = 100 and p = 1/4, then P(X ≤ 2) = 1.87 ·10−10 . Use the central limit theorem to give an approximation of this probability. 14.7 Let X1, X2, . . . , Xn be n independent random variables, each with ex- pected value µ and finite positive variance σ2 . Use Chebyshev’s inequality to show that for any a 0 one has P n 1 4 X̄n − µ σ ≥ a ≤ 1 a2 √ n . Use this fact to explain the occurrence of a single spike in the left column of Figure 14.1. 14.8 Let X1, X2, . . . be a sequence of independent N(0, 1) distributed random variables. For n = 1, 2, . . ., let Yn be the random variable, defined by Yn = X2 1 + · · · + X2 n. a. Show that E X2 i = 1. b. One can show—using integration by parts—that E X4 i = 3. Deduce from this that Var X2 i = 2. c. Use the central limit theorem to approximate P(Y100 110). 14.9 A factory produces links for heavy metal chains. The research lab of the factory models the length (in cm) of a link by the random variable X, with expected value E[X] = 5 and variance Var(X) = 0.04. The length of a link is defined in such a way that the length of a chain is equal to the sum of
  • 213. 14.4 Exercises 205 the lengths of its links. The factory sells chains of 50 meters; to be on the safe side 1002 links are used for such chains. The factory guarantees that the chain is not shorter than 50 meters. If by chance a chain is too short, the customer is reimbursed, and a new chain is given for free. a. Give an estimate of the probability that for a chain of at least 50 meters more than 1002 links are needed. For what percentage of the chains does the factory have to reimburse clients and provide free chains? b. The sales department of the factory notices that it has to hand out a lot of free chains and asks the research lab what is wrong. After further investigations the research lab reports to the sales department that the expectation value 5 is incorrect, and that the correct value is 4.99 (cm). Do you think that it was necessary to report such a minor change of this value? 14.10 Chebyshev’s inequality was used in Exercise 13.5 to determine how many times n one needs to measure a sample to be 90% sure that the average of the measurements is within half a degree of the actual melting point c of a new material. a. Use the normal approximation to find a less conservative value for n. b. Only in case the random errors Ui in the measurements have a normal distribution the value of n from a is “exact,” in all other cases an approx- imation. Explain this.
  • 214. 15 Exploratory data analysis: graphical summaries In the previous chapters we focused on probability models to describe random phenomena. Confronted with a new phenomenon, we want to learn about the randomness that is associated with it. It is common to conduct an experiment for this purpose and record observations concerning the phenomenon. The set of observations is called a dataset. By exploring the dataset we can gain insight into what probability model suits the phenomenon. Frequently you will have to deal with a dataset that contains so many ele- ments that it is necessary to condense the data for easy visual comprehension of general characteristics. In this chapter we present several graphical methods to do so. To graphically represent univariate datasets, consisting of repeated measurements of one particular quantity, we discuss the classical histogram, the more recently introduced kernel density estimates and the empirical dis- tribution function. To represent a bivariate dataset, which consists of repeated measurements of two quantities, we use the scatterplot. 15.1 Example: the Old Faithful data The Old Faithful geyser at Yellowstone National Park, Wyoming, USA, was observed from August 1st to August 15th, 1985. During that time, data were collected on the duration of eruptions. There were 272 eruptions observed, of which the recorded durations are listed in Table 15.1. The data are given in seconds. The variety in the lengths of the eruptions indicates that randomness is in- volved. By exploring the dataset we might learn about this randomness. For instance: we like to know which durations are more likely to occur than others; is there something like “the typical duration of an eruption”; do the durations vary symmetrically around the center of the dataset; and so on. In order to retrieve this type of information, just listing the observed durations does not help us very much. Somehow we must summarize the observed data. We could
  • 215. 208 15 Exploratory data analysis: graphical summaries Table 15.1. Duration in seconds of 272 eruptions of the Old Faithful geyser. 216 108 200 137 272 173 282 216 117 261 110 235 252 105 282 130 105 288 96 255 108 105 207 184 272 216 118 245 231 266 258 268 202 242 230 121 112 290 110 287 261 113 274 105 272 199 230 126 278 120 288 283 110 290 104 293 223 100 274 259 134 270 105 288 109 264 250 282 124 282 242 118 270 240 119 304 121 274 233 216 248 260 246 158 244 296 237 271 130 240 132 260 112 289 110 258 280 225 112 294 149 262 126 270 243 112 282 107 291 221 284 138 294 265 102 278 139 276 109 265 157 244 255 118 276 226 115 270 136 279 112 250 168 260 110 263 113 296 122 224 254 134 272 289 260 119 278 121 306 108 302 240 144 276 214 240 270 245 108 238 132 249 120 230 210 275 142 300 116 277 115 125 275 200 250 260 270 145 240 250 113 275 255 226 122 266 245 110 265 131 288 110 288 246 238 254 210 262 135 280 126 261 248 112 276 107 262 231 116 270 143 282 112 230 205 254 144 288 120 249 112 256 105 269 240 247 245 256 235 273 245 145 251 133 267 113 111 257 237 140 249 141 296 174 275 230 125 262 128 261 132 267 214 270 249 229 235 267 120 257 286 272 111 255 119 135 285 247 129 265 109 268 Source: W. Härdle. Smoothing techniques with implementation in S. 1991; Table 3, page 201. Springer New York. start by computing the mean of the data, which is 209.3 for the Old Faithful data. However, this is a poor summary of the dataset, because there is a lot more information in the observed durations. How do we get hold of this? Just staring at the dataset for a while tells us very little. To see something, we have to rearrange the data somehow. The first thing we could do is order the data. The result is shown in Table 15.2. Putting the elements in order already provides more information. For instance, it is now immediately clear that all elements lie between 96 and 306. Quick exercise 15.1 Which two elements of the Old Faithful dataset split the dataset in three groups of equal size? A closer look at the ordered data shows that the two middle elements (the 136th and 137th elements in ascending order) are equal to 240, which is much closer to the maximum value 306 than to the minimum value 96. This seems to
  • 216. 15.2 Histograms 209 Table 15.2. Ordered durations of eruptions of the Old Faithful geyser. 96 100 102 104 105 105 105 105 105 105 107 107 108 108 108 108 109 109 109 110 110 110 110 110 110 110 111 111 112 112 112 112 112 112 112 112 113 113 113 113 115 115 116 116 117 118 118 118 119 119 119 120 120 120 120 121 121 121 122 122 124 125 125 126 126 126 128 129 130 130 131 132 132 132 133 134 134 135 135 136 137 138 139 140 141 142 143 144 144 145 145 149 157 158 168 173 174 184 199 200 200 202 205 207 210 210 214 214 216 216 216 216 221 223 224 225 226 226 229 230 230 230 230 230 231 231 233 235 235 235 237 237 238 238 240 240 240 240 240 240 242 242 243 244 244 245 245 245 245 245 246 246 247 247 248 248 249 249 249 249 250 250 250 250 251 252 254 254 254 255 255 255 255 256 256 257 257 258 258 259 260 260 260 260 260 261 261 261 261 262 262 262 262 263 264 265 265 265 265 266 266 267 267 267 268 268 269 270 270 270 270 270 270 270 270 271 272 272 272 272 272 273 274 274 274 275 275 275 275 276 276 276 276 277 278 278 278 279 280 280 282 282 282 282 282 282 283 284 285 286 287 288 288 288 288 288 288 289 289 290 290 291 293 294 294 296 296 296 300 302 304 306 indicate that the dataset is somewhat asymmetric, but even from the ordered dataset we cannot get a clear picture of this asymmetry. Also, geologists be- lieve that there are two different kinds of eruptions that play a role. Hence one would expect two separate values around which the elements of the dataset would accumulate, corresponding to the typical durations of the two types of eruptions. Again it is not clear, not even from the ordered dataset, what these two typical values are. It would be better to have a plot of the dataset that reflects symmetry or asymmetry of the data and from which we can easily see where the elements accumulate. In the following sections we will discuss two such methods. 15.2 Histograms The classical method to graphically represent data is the histogram, which probably dates from the mortality studies of John Graunt in 1662 (see West-
  • 217. 210 15 Exploratory data analysis: graphical summaries ergaard [39], p.22). The term histogram appears to have been used first by Karl Pearson ([22]). Figure 15.1 displays a histogram of the Old Faithful data. The picture immediately reveals the asymmetry of the dataset and the fact that the elements accumulate somewhere near 120 and 270, which was not clear from Tables 15.1 and 15.2. 60 120 180 240 300 360 0 0.002 0.004 0.006 0.008 0.010 Fig. 15.1. Histogram of the Old Faithful data. The construction of the histogram is as follows. Let us denote a generic (uni- variate) dataset of size n by x1, x2, . . . , xn and suppose we want to construct a histogram. We use the version of the histogram that is scaled in such a way that the total area under the curve is equal to one.1 First we divide the range of the data into intervals. These intervals are called bins and are denoted by B1, B2, . . . , Bm. The length of an interval Bi is denoted by |Bi| and is called the bin width. The bins do not necessarily have the same width. In Figure 15.1 we have eight bins of equal bin width. We want the area under the histogram on each bin Bi to reflect the number of elements in Bi. Since the total area 1 under the histogram then corresponds to the total number of elements n in the dataset, the area under the histogram on a bin Bi is equal to the proportion of elements in Bi: the number of xj in Bi n . 1 The reason to scale the histogram so that the total area under the curve is equal to one is that if we view the data as being generated from some unknown probability density f (see Chapter 17), such a histogram can be used as a crude estimate of f.
  • 218. 15.2 Histograms 211 The height of the histogram on bin Bi must then be equal to the number of xj in Bi n|Bi| . Quick exercise 15.2 Use Table 15.2 to count how many elements fall into each of the bins (90, 120], (120, 150], . . . , (300, 330] in Figure 15.1 and com- pute the height on each bin. Choice of the bin width Consider a histogram with bins of equal width. In that case the bins are of the form Bi = (r + (i − 1)b, r + ib] for i = 1, 2, . . . , m, where r is some reference point smaller than the minimum of the dataset, and b denotes the bin width. In Figure 15.2, three histograms of the Old Faithful data of Table 15.2 are displayed with bin widths equal to 2, 30, and 90, respectively. Clearly, the choice of the bin width b, or the corresponding choice of the number of bins m, will determine what the resulting histogram will look like. Choosing the bin width too small will result in a chaotic figure with many isolated peaks. Choosing the bin width too large will result in a figure without much detail, at the risk of losing information about general characteristics. In Figure 15.2, bin width b = 2 is somewhat too small. Bin width b = 90 is clearly too large and produces a histogram that no longer captures the fact that the data show two separate modes near 120 and 270. How does one go about choosing the bin width? In practice, this might boil down to picking the bin width by trial and error, continuing until the figure looks reasonable. Mathematical research, however, has provided some guide- lines for a data-based choice for b or m. Formulas that may effectively be used are m = 1 + 3.3 log10(n) (see [34]) or b = 3.49 sn−1/3 (see [29]; see also Remark 15.1), where s is the sample standard deviation (see Section 16.2 for the definition of the sample standard deviation). 60 180 300 Bin width 2 0 0.01 60 180 300 Bin width 30 0 0.01 60 180 300 Bin width 90 0 0.01 Fig. 15.2. Histograms of the Old Faithful data with different bin widths.
  • 219. 212 15 Exploratory data analysis: graphical summaries Remark 15.1 (Normal reference method for histograms). Let Hn(x) denote the height of the histogram at x and suppose that we view our dataset as being generated from a probability distribution with density f. We would like to find the bin width that minimizes the difference between Hn and f, measured by the so-called mean integrated squared error (MISE) E ∞ −∞ (Hn(x) − f(x))2 dx . Under suitable smoothness conditions on f, the value of b that minimizes the MISE as n goes to infinity is given by b = C(f)n−1/3 where C(f) = 61/3 ∞ −∞ f (x)2 dx −1/3 (see for instance [29] or [12]). A simple data-based choice for b is obtained by estimating the constant C(f). The normal reference method takes f to be the density of an N(µ, σ2 ) distribution, in which case C(f) = (24 √ π)1/3 σ. Estimating σ by the sample standard deviation s (see Chapter 16 for a definition of s) would result in bin width b = (24 √ π)1/3 sn−1/3 . For the Old Faithful data this would give b = 36.89. Quick exercise 15.3 If we construct a histogram for the Old Faithful data with equal bin width b = 3.49 sn−1/3 , how may bins will we need to cover the data if s = 68.48? The main advantage of the histogram is that it is simple. Its disadvantage is the discrete character of the plot. In Figure 15.1 it is still somewhat unclear which two values correspond to the typical durations of the two types of eruptions. Another well-known artifact is that changing the bin width slightly or keeping the bin width fixed and shifting the bins slightly may result in a figure of a different nature. A method that produces a smoother figure and is less sensitive to these kinds of changes will be discussed in the next section. 15.3 Kernel density estimates We can graphically represent data in a more variegated plot by a so-called kernel density estimate. The basic ideas of kernel density estimation first ap- peared in the early 1950s. Rosenblatt [25] and Parzen [21] provided the stim- ulus for further research on this topic. Although the method was introduced in the middle of the last century, until recently it remained unpopular as a tool for practitioners because of its computationally intensive nature. Figure 15.3 displays a kernel density estimate of the Old Faithful data. Again the picture immediately reveals the asymmetry of the dataset, but it is much
  • 220. 15.3 Kernel density estimates 213 60 120 180 240 300 360 0 0.002 0.004 0.006 0.008 0.010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 15.3. Kernel density estimate of the Old Faithful data. smoother than the histogram in Figure 15.1. Note that it is now easier to detect the two typical values around which the elements accumulate. The idea behind the construction of the plot is to “put a pile of sand” around each element of the dataset. At places where the elements accumulate, the sand will pile up. The actual plot is constructed by choosing a kernel K and a bandwidth h. The kernel K reflects the shape of the piles of sand, whereas the bandwidth is a tuning parameter that determines how wide the piles of sand will be. Formally, a kernel K is a function K : R → R. Figure 15.4 displays several well-known kernels. A kernel K typically satisfies the following conditions: (K1) K is a probability density, i.e., K(u) ≥ 0 and ∞ −∞ K(u) du = 1; (K2) K is symmetric around zero, i.e., K(u) = K(−u); (K3) K(u) = 0 for |u| 1. Examples are the Epanechnikov kernel: K(u) = 3 4 1 − u2 for − 1 ≤ u ≤ 1 and K(u) = 0 elsewhere, and the triweight kernel K(u) = 35 32 1 − x2 3 for − 1 ≤ u ≤ 1 and K(u) = 0 elsewhere. Sometimes one uses kernels that do not satisfy condition (K3), for example, the normal kernel K(u) = 1 √ 2π e− 1 2 u2 for − ∞ u ∞. Let us denote a kernel density estimate by fn,h, and suppose that we want to construct fn,h for a dataset x1, x2, . . . , xn. In Figure 15.5 the construction is
  • 221. 214 15 Exploratory data analysis: graphical summaries −2 −1 0 1 2 Triangular kernel 0.0 0.4 0.8 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −2 −1 0 1 2 Cosine kernel 0.0 0.4 0.8 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −2 −1 0 1 2 Epanechnikov kernel 0.0 0.4 0.8 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −2 −1 0 1 2 Biweight kernel 0.0 0.4 0.8 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −2 −1 0 1 2 Triweight kernel 0.0 0.4 0.8 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −2 −1 0 1 2 Normal kernel 0.0 0.4 0.8 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 15.4. Examples of well-known kernels K. illustrated for a dataset containing five elements, where we use the Epanech- nikov kernel and bandwidth h = 0.5. First we scale the kernel K (solid line) into the function t → 1 h K t h . The scaled kernel (dotted line) is of the same type as the original kernel, with area 1 under the curve but is positive on the interval [−h, h] instead of [−1, 1] and higher (lower) when h is smaller (larger) than 1. Next, we put a scaled kernel around each element xi in the dataset. This results in functions of the type t → 1 h K t − xi h . These shifted kernels (dotted lines) have the same shape as the transformed kernel, all with area 1 under the curve, but they are now symmetric around xi and positive on the interval [xi − h, xi + h]. We see that the graphs of the shifted kernels will overlap whenever xi and xj are close to each other, so that things will pile up more at places where more elements accumulate. The kernel density estimate fn,h is constructed by summing the scaled kernels and dividing them by n, in order to obtain area 1 under the curve:
  • 222. 15.3 Kernel density estimates 215 −2 −1 0 1 2 Kernel and scaled kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .............. 1 h K t h K −2 −1 0 1 2 Shifted kernel .................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .............. ................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............................. .............. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ................ ........ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...................... −2 −1 0 1 2 Kernel density estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 15.5. Construction of a kernel density estimate fn,h. fn,h(t) = 1 n 1 h K t − x1 h + 1 h K t − x2 h + · · · + 1 h K t − xn h ! or briefly, fn,h(t) = 1 nh n i=1 K t − xi h . (15.1) When computing fn,h(t), we assign higher weights to observations xi closer to t, in contrast to the histogram where we simply count the number of observa- tions in the bin that contains t. Note that as a consequence of condition (K1), fn,h itself is a probability density: fn,h(t) ≥ 0 and ∞ −∞ fn,h(t) dt = 1. Quick exercise 15.4 Check that the total area under the kernel density estimate is equal to one, i.e., show that ∞ −∞ fn,h(t) dt = 1. Note that computing fn,h is very computationally intensive. Its common use nowadays is therefore a typical product of the recent developments in com- puter hardware, despite the fact that the method was introduced much earlier. Choice of the bandwidth The bandwidth h plays the same role for kernel density estimates as the bin width b does for histograms. In Figure 15.6 three kernel density estimates of the Old Faithful data are plotted with the triweight kernel and bandwidths 1.8, 18, and 180. It is clear that the choice of the bandwidth h determines largely what the resulting kernel density estimate will look like. Choosing the bandwidth too small will produce a curve with many isolated peaks. Choosing the bandwidth too large will produce a very smooth curve, at the risk of smoothing away important features of the data. In Figure 15.6 bandwidth
  • 223. 216 15 Exploratory data analysis: graphical summaries h = 1.8 is somewhat too small. Bandwidth h = 180 is clearly too large and produces an oversmoothed kernel density estimate that no longer captures the fact that the data show two separate modes. 60 180 300 Bandwidth 1.8 0 0.01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 180 300 Bandwidth 18 0 0.01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 180 300 Bandwidth 180 0 0.01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 15.6. Kernel estimates of the Old Faithful data. How does one go about choosing the bandwidth? Similar to histograms, in practice one could do this by trial and error and continue until one obtains a reasonable picture. Recent research, however, has provided some guidelines for a data-based choice of h. A formula that may effectively be used is h = 1.06 sn−1/5 , where s denotes the sample standard deviation (see, for instance, [31]; see also Remark 15.2). Remark 15.2 (Normal reference method for kernel estimates). Suppose we view our dataset as being generated from a probability dis- tribution with density f. Let K be a fixed chosen kernel and let fn,h be the kernel density estimate. We would like to take the bandwidth that min- imizes the difference between fn,h and f, measured by the so-called mean integrated squared error (MISE) E ∞ −∞ (fn,h(x) − f(x))2 dx . Under suitable smoothness conditions on f, the value of h that minimizes the MISE, as n goes to infinity, is given by h = C1(f)C2(K)n−1/5 , where the constants C1(f) and C2(K) are given by C1(f) = 1 ∞ −∞ f(x)2 dx 1/5 and C2(K) = ∞ −∞ K(u)2 du 1/5 ∞ −∞ u2K(u) du 2/5 . After choosing the kernel K, one can compute the constant C2(K) to obtain a simple data-based choice for h by estimating the constant C1(f). For instance, for the normal kernel one finds C2(K) = (2 √ π)−1/5 . As with
  • 224. 15.3 Kernel density estimates 217 histograms (see Remark 15.1), the normal reference method takes f to be the density of an N(µ, σ2 ) distribution, in which case C1(f) = (8 √ π/3)1/5 σ. Estimating σ by the sample standard deviation s (see Chapter 16 for a definition of s) would result in bandwidth h = 4 3 1/5 sn−1/5 . For the Old Faithful data, this would give h = 23.64. Quick exercise 15.5 If we construct a kernel density estimate for the Old Faithful data with bandwidth h = 1.06sn−1/5 , then on what interval is fn,h strictly positive if s = 68.48? Choice of the kernel To construct a kernel density estimate, one has to choose a kernel K and a bandwidth h. The choice of kernel is less important. In Figure 15.7 we have plotted two kernel density estimates for the Old Faithful data of Table 15.1: one is constructed with the triweight kernel (solid line), and one with the Epanechnikov kernel (dotted line), both with the same bandwidth h = 24. As one can see, the graphs are very similar. If one wants to compare with the normal kernel, one should set the bandwidth of the normal kernel at about h/4. This has to do with the fact that the normal kernel is much more spread out than the two kernels mentioned here, which are zero outside [−1, 1]. 60 120 180 240 300 360 0 0.002 0.004 0.006 0.008 0.010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ................. Fig. 15.7. Kernel estimates of the Old Faithful data with different kernels: triweight (solid line) and Epanechnikov kernel (dotted), both with bandwidth h = 24. Boundary kernels In order to estimate the parameters of a software reliability model, failure data are collected. Usually the most desirable type of failure data results when the
  • 225. 218 15 Exploratory data analysis: graphical summaries Table 15.3. Interfailure times between successive failures. 30 113 81 115 9 2 91 112 15 138 50 77 24 108 88 670 120 26 114 325 55 242 68 422 180 10 1146 600 15 36 4 0 8 227 65 176 58 457 300 97 263 452 255 197 193 6 79 816 1351 148 21 233 134 357 193 236 31 369 748 0 232 330 365 1222 543 10 16 529 379 44 129 810 290 300 529 281 160 828 1011 445 296 1755 1064 1783 860 983 707 33 868 724 2323 2930 1461 843 12 261 1800 865 1435 30 143 108 0 3110 1247 943 700 875 245 729 1897 447 386 446 122 990 948 1082 22 75 482 5509 100 10 1071 371 790 6150 3321 1045 648 5485 1160 1864 4116 Source: J.D. Musa, A. Iannino, and K. Okumoto. Software reliability: mea- surement, prediction, application. McGraw-Hill, New York, 1987; Table on page 305. failure times are recorded, or equivalently, the length of an interval between successive failures. The data in Table 15.3 are observed interfailure times in CPU seconds for a certain control software system. On the left in Figure 15.8 a kernel density estimate of the observed interfailure times is plotted. Note that to the left of the origin, fn,h is positive. This is absurd, since it suggests that there are negative interfailure times. This phenomenon is a consequence of the fact that one uses a symmetric ker- nel. In that case, the resulting kernel density estimate will always be positive on the interval [xi −h, xi +h] for every element xi in the dataset. Hence, obser- 0 2000 4000 6000 8000 0 0.0005 0.0010 0.0015 fn,h with symmetric kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2000 4000 6000 8000 0 0.0005 0.0010 0.0015 with boundary kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......... . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . ........................................ with symmetric kernel Fig. 15.8. Kernel density estimate of the software reliability data with symmetric and boundary kernel.
  • 226. 15.4 The empirical distribution function 219 vations close to zero will cause the kernel density estimate fn,h to be positive to the left of zero. It is possible to improve the kernel density estimate in a neighborhood of zero by means of a so-called boundary kernel. Without going into detail about the construction of such an improvement, we will only show the result of this. On the right in Figure 15.8 the histogram of the interfailure times is plotted together with the kernel density estimate constructed with a symmetric kernel (dotted line) and with the boundary kernel density estimate (solid line). The boundary kernel density estimate is 0 to the left of the ori- gin and is adjusted on the interval [0, h). On the interval [h, ∞) both kernel density estimates are the same. 15.4 The empirical distribution function Another way to graphically represent a dataset is to plot the data in a cumu- lative manner. This can be done using the empirical cumulative distribution function of the data. It is denoted by Fn and is defined at a point x as the proportion of elements in the dataset that are less than or equal to x: Fn(x) = number of elements in the dataset ≤ x n . To illustrate the construction of Fn, consider the dataset consisting of the elements 4 3 9 1 7. The corresponding empirical distribution function is displayed in Figure 15.9. For x 1, there are no elements less than or equal to x, so that Fn(x) = 0. For 1 ≤ x 3, only the element 1 is less than or equal to x, so that Fn(x) = 1/5. For 3 ≤ x 4, the elements 1 and 3 are less than or equal to x, so that Fn(x) = 2/5, and so on. In general, the graph of Fn has the form of a staircase, with Fn(x) = 0 for all x smaller than the minimum of the dataset and Fn(x) = 1 for all x greater than the maximum of the dataset. Between the minimum and maximum, Fn has a jump of size 1/n at each element of the dataset and is constant between successive elements. In Figure 15.9, the marks • and ◦ are added to the graph to emphasize the fact that, for instance, the value of Fn(x) at x = 3 is 0.4, not 0.2. Usually, we leave these out, and one might also connect the horizontal segments by vertical lines. In Figure 15.10 the empirical distribution functions are plotted for the Old Faithful data and the software reliability data. The fact that the Old Faithful data accumulate in the neighborhood of 120 and 270 is reflected in the graph of Fn by the fact that it is steeper at these places: the jumps of Fn succeed each other faster. In regions where the elements of the dataset are more stretched
  • 227. 220 15 Exploratory data analysis: graphical summaries 1 3 4 7 9 0.0 0.2 0.4 0.6 0.8 1.0 • • • • • ◦ ◦ ◦ ◦ ◦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 15.9. Empirical distribution function. out, the graph of Fn is flatter. Similar behavior can be seen for the software reliability data in the neighborhood of zero. The elements accumulate more close to zero, less as we move to the right. This is reflected by the empirical distribution function, which is very steep near zero and flattens out if we move to the right. The graph of the empirical distribution function for the Old Faithful data agrees with the histogram in Figure 15.1 whose height is the largest on the bins (90, 120] and (240, 270]. In fact, there is a one-to-one relation between the two graphical summaries of the data: the area under the histogram on a single bin is equal to the relative frequency of elements that lie in that bin, which is also equal to the increase of Fn on that bin. For instance, the area under the histogram on bin (240, 270] for the Old Faithful data is equal to 30 · 0.0092 = 60 120 180 240 300 360 Old Faithful data 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2000 4000 6000 8000 Software data 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 15.10. Empirical distribution function of the Old Faithful data and the soft- ware reliability data.
  • 228. 15.5 Scatterplot 221 0.276 (see Quick exercise 15.2). On the other hand, Fn(270) = 215/272 = 0.7904 and Fn(240) = 140/272 = 0.5147, whose difference Fn(270) − Fn(240) is also equal to 0.276. Quick exercise 15.6 Suppose that for a dataset consisting of 300 elements, the value of the empirical distribution function in the point 1.5 is equal to 0.7. How many elements in the dataset are strictly greater than 1.5? Remark 15.3 (Fn as a discrete distribution function). Note that Fn satisfies the four properties of a distribution function: it is continuous from the right, Fn(x) → 0 as x → −∞, Fn(x) → 1 as x → ∞ and Fn is nondecreasing. This means that Fn itself is a distribution function of some random variable. Indeed, Fn is the distribution function of the discrete ran- dom variable that attains values x1, x2, . . . , xn with equal probability 1/n. 15.5 Scatterplot In some situations one wants to investigate the relationship between two or more variables. In the case of two variables x and y, the dataset consists of pairs of observations: (x1, y1), (x2, y2), . . . , (xn, yn). We call such a dataset a bivariate dataset in contrast to the univariate dataset, which consists of observations of one particular quantity. We often like to in- vestigate whether the value of variable y depends on the value of the variable x, and if so, whether we can describe the relation between the two variables. A first step is to take a look at the data, i.e., to plot the points (xi, yi) for i = 1, 2 . . . , n. Such a plot is called a scatterplot. Drilling in rock During a study about “dry” and “wet” drilling in rock, six holes were drilled, three corresponding to each process. In a dry hole one forces compressed air down the drill rods to flush the cutting and the drive hammer, whereas in a wet hole one forces water. As the hole gets deeper, one has to add a rod of 5 feet length to the drill. In each hole the time was recorded to advance 5 feet to a total depth of 400 feet. The data in Table 15.4 are in 1/100 minute and are derived from the original data in [23]. The original data consisted of drill times for each of the six holes and contained missing observations and observations that were known to be too large. The data in Table 15.4 are the mean drill times of the bona fide observations at each depth for dry and wet drilling. One of the questions of interest is whether drill time depends on depth. To in- vestigate this, we plot the mean drill time against depth. Figure 15.11 displays
  • 229. 222 15 Exploratory data analysis: graphical summaries Table 15.4. Mean drill time. Depth Dry Wet Depth Dry Wet 5 640.67 830.00 205 803.33 962.33 10 674.67 800.00 210 794.33 864.67 15 708.00 711.33 215 760.67 805.67 20 735.67 867.67 220 789.50 966.00 25 754.33 940.67 225 904.50 1010.33 30 723.33 941.33 230 940.50 936.33 35 664.33 924.33 235 882.00 915.67 40 727.67 873.00 240 783.50 956.33 45 658.67 874.67 245 843.50 936.00 50 658.00 843.33 250 813.50 803.67 55 705.67 885.67 255 658.00 697.33 60 700.00 881.67 260 702.50 795.67 65 720.67 822.00 265 623.50 1045.33 70 701.33 886.33 270 739.00 1029.67 75 716.67 842.50 275 907.50 977.00 80 649.67 874.67 280 846.00 1054.33 85 667.33 889.33 285 829.00 1001.33 90 612.67 870.67 290 975.50 1042.00 95 656.67 916.00 295 998.00 1200.67 100 614.00 888.33 300 1037.50 1172.67 105 584.00 835.33 305 984.00 1019.67 110 619.67 776.33 310 972.50 990.33 115 666.00 811.67 315 834.00 1173.33 120 695.00 874.67 320 675.00 1165.67 125 702.00 846.00 325 686.00 1142.00 130 739.67 920.67 330 963.00 1030.67 135 790.67 896.33 335 961.50 1089.67 140 730.33 810.33 340 932.00 1154.33 145 674.00 912.33 345 1054.00 1238.50 150 749.00 862.33 350 1038.00 1208.67 155 709.67 828.00 355 1238.00 1134.67 160 769.00 812.67 360 927.00 1088.00 165 663.00 795.67 365 850.00 1004.00 170 679.33 897.67 370 1066.00 1104.00 175 740.67 881.00 375 962.50 970.33 180 776.50 819.67 380 1025.50 1054.50 185 688.00 853.33 385 1205.50 1143.50 190 761.67 844.33 390 1168.00 1044.00 195 800.00 919.00 395 1032.50 978.33 200 845.50 933.33 400 1162.00 1104.00 Source: R. Penner and D.G. Watts. Mining information. The American Statistician, 45:4–9, 1991; Table 1 on page 6.
  • 230. 15.5 Scatterplot 223 Dry holes 0 100 200 300 400 Depth 500 700 900 1100 1300 Mean drill time · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Wet holes 0 100 200 300 400 Depth 500 700 900 1100 1300 Mean drill time · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 15.11. Scatterplots of mean drill time versus depth. the resulting scatterplots for the dry and wet holes. The scatterplots seem to indicate that in the beginning the drill time hardly depends on depth, at least up to, let’s say, 250 feet. At greater depth, the drill time seems to vary over a larger range and increases somewhat with depth. A possible explanation for this is that the drill moved from softer to harder material. This was suggested by the fact that the drill hit an ore lens at about 250 feet and that the natural place such ore lenses occur is between two different materials (see [23] for details). A more important question is whether one can drill holes faster using dry drilling or wet drilling. The scatterplots seem to suggest that dry drilling might be faster. We will come back to this later. Predicting Janka hardness of Australian timber The Janka hardness test is a standard test to measure the hardness of wood. It measures the force required to push a steel ball with a diameter of 11.28 millimeters (0.444 inch) into the wood to a depth of half the ball’s diameter. To measure Janka hardness directly is difficult. However, it is related to the density of the wood, which is comparatively easy to measure. In Table 15.5 a bivariate dataset is given of density (x) and Janka hardness (y) of 36 Aus- tralian eucalypt hardwoods. In order to get an impression of the relationship between hardness and den- sity, we made a scatterplot of the bivariate dataset, which is displayed in Figure 15.12. It consists of all points (xi, yi) for i = 1, 2, . . . , 36. The scatter- plot might provide suggestions for the formula that describes the relationship between the variables x and y. In this case, a linear relationship between the two variables does not seem unreasonable. Later (Chapter 22) we will discuss
  • 231. 224 15 Exploratory data analysis: graphical summaries Table 15.5. Density and hardness of Australian timber. Density Hardness Density Hardness Density Hardness 24.7 484 39.4 1210 53.4 1880 24.8 427 39.9 989 56.0 1980 27.3 413 40.3 1160 56.5 1820 28.4 517 40.6 1010 57.3 2020 28.4 549 40.7 1100 57.6 1980 29.0 648 40.7 1130 59.2 2310 30.3 587 42.9 1270 59.8 1940 32.7 704 45.8 1180 66.0 3260 35.6 979 46.9 1400 67.4 2700 38.5 914 48.2 1760 68.8 2890 38.8 1070 51.5 1710 69.1 2740 39.3 1020 51.5 2010 69.1 3140 Source: E.J. Williams. Regression analysis. John Wiley Sons Inc., New York, 1959; Table 3.1 on page 43. how one can establish such a linear relationship by means of the observed pairs. Quick exercise 15.7 Suppose we have a eucalypt hardwood tree with den- sity 65. What would your prediction be for the corresponding Janka hardness? 20 30 40 50 60 70 80 Wood density 0 500 1000 1500 2000 2500 3000 3500 Hardness · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 15.12. Scatterplot of Janka hardness versus density of wood.
  • 232. 15.6 Solutions to the quick exercises 225 15.6 Solutions to the quick exercises 15.1 There are 272 elements in the dataset. The 91st and 182nd elements of the ordered data divide the dataset in three groups, each consisting of 90 elements. From a closer look at Table 15.2 we find that these two elements are 145 and 260. 15.2 In Table 15.2 one can easily count the number of observations in each of the bins (90, 120], . . ., (300, 330]. The heights on each bin can be computed by dividing the number of observations in each bin by 272·30 = 8160. We get the following: Bin Count Height Bin Count Height (90, 120] 55 0.0067 (210, 240] 34 0.0042 (120,150] 37 0.0045 (240, 270] 75 0.0092 (150,180] 5 0.0006 (270, 300] 54 0.0066 (180,210] 9 0.0011 (300, 330] 3 0.0004 15.3 From Table 15.2 we see that we must cover an interval of length of at least 306 − 96 = 210 with bins of width b = 3.49 · 68.48 · 272−1/3 = 36.89. Since 210/36.89 = 5.69, we need at least six bins to cover the whole dataset. 15.4 By means of formula (15.1), we can write ∞ −∞ fn,h(t) dt = 1 nh n i=1 ∞ −∞ K t − xi h dt. For any i = 1, . . . , n, we find by change of integration variables t = hu + xi that ∞ −∞ K t − xi h dt = h ∞ −∞ K (u) du = h, where we also use condition (K1). This directly yields ∞ −∞ fn,h(t) dt = 1 nh · n · h = 1. 15.5 The kernel density estimate will be strictly positive between the min- imum minus h and the maximum plus h. The bandwidth equals h = 1.06 · 68.48 · 272−1/5 = 23.66. From Table 15.2, we see that this will be between 96 − 23.66 = 72.34 and 306 + 23.66 = 329.66. 15.6 By definition the number of elements less than or equal to 1.5 is F300(1.5) · 300 = 210. Hence 90 elements are strictly greater than 1.5. 15.7 Just by drawing a straight line that seems to fit the datapoints well, the authors predicted a Janka hardness of about 2700.
  • 233. 226 15 Exploratory data analysis: graphical summaries 15.7 Exercises 15.1 In [33] Stephen Stigler discusses data from the Edinburgh Medical and Surgical Journal (1817). These concern the chest circumference of 5732 Scot- tish soldiers, measured in inches. The following information is given about the histogram with bin width 1, the first bin starting at 32.5. Bin Count Bin Count (32.5, 33.5] 3 (40.5, 41.5] 935 (33.5, 34.5] 19 (41.5, 42.5] 646 (34.5, 35.5] 81 (42.5, 43.5] 313 (35.5, 36.5] 189 (43.5, 44.5] 168 (36.5, 37.5] 409 (44.5, 45.5] 50 (37.5, 38.5] 753 (45.5, 46.5] 18 (38.5, 39.5] 1062 (46.5, 47.5] 3 (39.5, 40.5] 1082 (47.5, 48.5] 1 Source: S.M. Stigler. The history of statistics – The measurement of uncer- tainty before 1900. Cambridge, Massachusetts, 1986. a. Compute the height of the histogram on each bin. b. Make a sketch of the histogram. Would you view the dataset as being symmetric or skewed? 15.2 Recall the example of the space shuttle Challenger in Section 1.4. The following list contains the launch temperatures in degrees Fahrenheit during previous takeoffs. 66 70 69 68 67 72 73 70 57 63 70 78 67 53 67 75 70 81 76 79 75 76 58 Source: Presidential commission on the space shuttle Challenger accident. Report on the space shuttle Challenger accident. Washington, DC, 1986; table on pages 129–131. a. Compute the heights of a histogram with bin width 5, the first bin starting at 50. b. On January 28, 1986, during the launch of the space shuttle Challenger, the temperature was 31 degrees Fahrenheit. Given the dataset of launch temperatures of previous takeoffs, would you consider 31 as a representa- tive launch temperature? 15.3 In an article in Biometrika, an example is discussed about mine dis- asters during the period from March 15, 1851, to March, 22, 1962. A dataset has been obtained of 190 recorded time intervals (in days) between successive coal mine disasters involving ten or more men killed. The ordered data are listed in Table 15.6.
  • 234. 15.7 Exercises 227 Table 15.6. Number of days between successive coal mine disasters. 0 1 1 2 2 3 4 4 4 6 7 10 11 12 12 12 13 15 15 16 16 16 17 17 18 19 19 19 20 20 22 23 24 25 27 28 29 29 29 31 31 32 33 34 34 36 36 37 40 41 41 42 43 45 47 48 49 50 53 54 54 55 56 59 59 61 61 65 66 66 70 72 75 78 78 78 80 80 81 88 91 92 93 93 95 95 96 96 97 99 101 108 110 112 113 114 120 120 123 123 124 124 125 127 129 131 134 137 139 143 144 145 151 154 156 157 176 182 186 187 188 189 190 193 194 197 202 203 208 215 216 217 217 217 218 224 225 228 232 233 250 255 275 275 275 276 286 292 307 307 312 312 315 324 326 326 329 330 336 345 348 354 361 364 368 378 388 420 431 456 462 467 498 517 536 538 566 632 644 745 806 826 871 952 1205 1312 1358 1630 1643 2366 Source: R.G. Jarrett. A note on the intervals between coal mining disasters. Biometrika, 66:191-193, 1979; by permission of the Biometrika Trustees. a. Compute the height on each bin of the histogram with bins [0, 250], (250, 500], . . ., (2250, 2500]. b. Make a sketch of the histogram. Would you view the dataset as being symmetric or skewed? 15.4 The ordered software data (see also Table 15.3) are given in the fol- lowing list. 0 0 0 2 4 6 8 9 10 10 10 12 15 15 16 21 22 24 26 30 30 31 33 36 44 50 55 58 65 68 75 77 79 81 88 91 97 100 108 108 112 113 114 115 120 122 129 134 138 143 148 160 176 180 193 193 197 227 232 233 236 242 245 255 261 263 281 290 296 300 300 325 330 357 365 369 371 379 386 422 445 446 447 452 457 482 529 529 543 600 648 670 700 707 724 729 748 790 810 816 828 843 860 865 868 875 943 948 983 990 1011 1045 1064 1071 1082 1146 1160 1222 1247 1351 1435 1461 1755 1783 1800 1864 1897 2323 2930 3110 3321 4116 5485 5509 6150
  • 235. 228 15 Exploratory data analysis: graphical summaries a. Compute the heights on each bin of the histogram with bins [0, 500], (500, 1000], and so on. b. Compute the value of the empirical distribution function in the endpoints of the bins. c. Check that the area under the histogram on bin (1000, 1500] is equal to the increase Fn(1500) − Fn(1000) of the empirical distribution function on this bin. Actually, this is true for each single bin (see Exercise 15.11). 15.5 Suppose we construct a histogram with bins [0,1], (1,3], (3,5], (5,8], (8,11], (11,14], and (14,18]. Given are the values of the empirical distribution function at the boundaries of the bins: t 0 1 3 5 8 11 14 18 Fn(t) 0 0.225 0.445 0.615 0.735 0.805 0.910 1.000 Compute the height of the histogram on each bin. 15.6 Given is the following information about a histogram: Bin Height (0,2] 0.245 (2,4] 0.130 (4,7] 0.050 (7,11] 0.020 (11,15] 0.005 Compute the value of the empirical distribution function in the point t = 7. 15.7 In Exercise 15.2 a histogram was constructed for the Challenger data. On which bin does the empirical distribution function have the largest increase? 15.8 Define a function K by K(u) = cos(πu) for − 1 ≤ u ≤ 1 and K(u) = 0 elsewhere. Check whether K satisfies the conditions (K1)–(K3) for a kernel function. 15.9 On the basis of the duration of an eruption of the Old Faithful geyser, park rangers try to predict the waiting time to the next eruption. In Fig- ure 15.13 a scatterplot is displayed of the duration and the time to the next eruption in seconds. a. Does the scatterplot give reason to believe that the duration of an eruption influences the time to the next eruption?
  • 236. 15.7 Exercises 229 100 150 200 250 300 Duration 40 60 80 100 Waiting time · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 15.13. Scatterplot of the Old Faithful data. b. Suppose you have just observed an eruption that lasted 250 seconds. What would you predict for the time to the next eruption? c. The dataset of durations shows two modes, i.e., there are two places where the data accumulate (see, for instance, the histogram in Figure 15.1). How many modes does the dataset of waiting times show? 15.10 Figure 15.14 displays the graph of an empirical distribution function of a dataset consisting of 200 elements. How many modes does the dataset show? 0 5 10 15 20 25 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 15.14. Empirical distribution function. 15.11 Given is a histogram and the empirical distribution function Fn of the same dataset. Show that the height of the histogram on a bin (a, b] is
  • 237. 230 15 Exploratory data analysis: graphical summaries equal to Fn(b) − Fn(a) b − a . 15.12 Let fn,h be a kernel estimate. As mentioned in Section 15.3, fn,h itself is a probability density. a. Show that the corresponding expectation is equal to ∞ −∞ tfn,h(t) dt = x̄n. Hint: you might consult the solution to Quick exercise 15.4. b. Show that the second moment corresponding to fn,h satisfies ∞ −∞ t2 fn,h(t) dt = 1 n n i=1 x2 i + h2 ∞ −∞ u2 K(u) du.
  • 238. 16 Exploratory data analysis: numerical summaries The classical way to describe important features of a dataset is to give several numerical summaries. We discuss numerical summaries for the center of a dataset and for the amount of variability among the elements of a dataset, and then we introduce the notion of quantiles for a dataset. To distinguish these quantities from corresponding notions for probability distributions of random variables, we will often add the word sample or empirical; for instance, we will speak of the sample mean and empirical quantiles. We end this chapter with the boxplot, which combines some of the numerical summaries in a graphical display. 16.1 The center of a dataset The best-known method to identify the center of a dataset is to compute the sample mean x̄n = x1 + x2 + · · · + xn n . (16.1) For the sake of notational convenience we will sometimes drop the subscript n and write x̄ instead of x̄n. The following dataset consists of hourly tempera- tures in degrees Fahrenheit (rounded to the nearest integer), recorded at Wick in northern Scotland from 5 p.m. December 31, 1960, to 3 a.m. January 1, 1961. The sample mean of the 11 measurements is equal to 44.7. 43 43 41 41 41 42 43 58 58 41 41 Source: V. Barnett and T. Lewis. Outliers in statistical data. Third edition, 1994. John Wiley Sons Limited. Reproduced with permission. Another way to identify the center of a dataset is by means of the sample median, which we will denote by Med(x1, x2, . . . , xn) or briefly Medn. The sample median is defined as the middle element of the dataset when it is put in ascending order. When n is odd, it is clear what this means. When n is even,
  • 239. 232 16 Exploratory data analysis: numerical summaries we take the average of the two middle elements. For the Wick temperature data the sample median is equal to 42. Quick exercise 16.1 Compute the sample mean and sample median of the dataset 4.6 3.0 3.2 4.2 5.0. Both methods have pros and cons. The sample mean is the natural analogue for a dataset of what the expectation is for a probability distribution. However, it is very sensitive to outliers, by which we mean observations in the dataset that deviate a lot from the bulk of the data. To illustrate the sensitivity of the sample mean, consider the Wick tempera- ture data displayed in Figure 16.1. The values 58 and 58 recorded at midnight and 1 a.m. are clearly far from the bulk of the data and give grounds for concern whether they are genuine (58 degrees Fahrenheit seems very warm at midnight for New Year’s in northern Scotland). To investigate their effect on the sample mean we compute the average of the data, leaving out these measurements, which gives 41.8 (instead of 44.7). The sample median of the data is equal to 41 (instead of 42) when leaving out the measurements with value 58. The median is more robust in the sense that it is hardly affected by a few outliers. 17 p.m. 19 p.m. 21 p.m. 23 p.m. 1am 3am Time of day 40 45 50 55 60 Temperature · · · · · · · · · · · Fig. 16.1. The Wick temperature data. It should be emphasized that this discussion is only meant to illustrate the sensitivity of the sample mean and by no means is intended to suggest we leave out measurements that deviate a lot from the bulk of the data! It is important to be aware of the presence of an outlier. In that case, one could try to find out whether there is perhaps something suspicious about this measurement. This might lead to assigning a smaller weight to such a measurement or even to
  • 240. 16.2 The amount of variability of a dataset 233 removing it from the dataset. However, sometimes it is possible to reconstruct the exact circumstances and correct the measurement. For instance, after further inquiry in the temperature example it turned out that at midnight the meteorological office changed its recording unit from degrees Fahrenheit to 1/10th degree Celsius (so 58 and 41 should read 5.8◦ C and 4.1◦ C). The corrected values in degrees Fahrenheit (to the nearest integer) are 43 43 41 41 41 42 43 42 42 39 39. For the corrected data the sample mean is 41.5 and the sample median is 42. Quick exercise 16.2 Consider the same dataset as in Quick exercise 16.1. Suppose that someone misreads the dataset as 4.6 30 3.2 4.2 50. Compute the sample mean and sample median and compare these values with the ones you found in Quick exercise 16.1. 16.2 The amount of variability of a dataset To quantify the amount of variability among the elements of a dataset, one often uses the sample variance defined by s2 n = 1 n − 1 n i=1 (xi − x̄n)2 . Up to a scaling factor this is equal to the average squared deviation from x̄n. At first sight, it seems more natural to define the sample variance by s̃2 n = 1 n n i=1 (xi − x̄n)2 . Why we choose the factor 1/(n−1) instead of 1/n will be explained later (see Chapter 19). Because s2 n is in different units from the elements of the dataset, one often prefers the sample standard deviation sn = # # $ 1 n − 1 n i=1 (xi − x̄n)2, which is measured in the same units as the elements of the dataset itself. Just as the sample mean, the sample standard deviation is very sensitive to outliers. For the (uncorrected) Wick temperature data the sample standard deviation is 6.62, or 0.97 if we leave out the two measurements with value 58.
  • 241. 234 16 Exploratory data analysis: numerical summaries For the corrected data the standard deviation is 1.44. A more robust measure of variability is the median of absolute deviations or MAD, which is defined as follows. Consider the absolute deviation of every element xi with respect to the sample median: |xi − Med(x1, x2, . . . , xn)| or briefly |xi − Medn|. The MAD is obtained by taking the median of all these absolute deviations MAD(x1, x2, . . . , xn) = Med(|x1 − Medn|, . . . , |xn − Medn|). (16.2) Quick exercise 16.3 Compute the sample standard deviation for the dataset of Quick exercise 16.1 for which it is given that the values of xi − x̄n are: −1.0, 0.6, −0.8, 0.2, 1.0. Also compute the MAD for this dataset. Just as the sample median, the MAD is hardly affected by outliers. For the (uncorrected) Wick temperature data the MAD is 1 and equal to 0 if we leave out the two measurements with value 58 (the value 0 seems a bit strange, but is a consequence of the fact that the observations are given in degrees Fahrenheit rounded to the nearest integer). For the corrected data the MAD is 1. Quick exercise 16.4 Compute the sample standard deviation for the mis- read dataset of Quick exercise 16.2 for which it is given that the values of xi − x̄n are: 11.6, −13.8, −15.2, −14.2, 31.6. Also compute the MAD for this dataset and compare both values with the ones you found in Quick exercise 16.3. 16.3 Empirical quantiles, quartiles, and the IQR The sample median divides the dataset in two more or less equal parts: about half of the elements are less than the median and about half of the elements are greater than the median. More generally, we can divide the dataset in two parts in such a way that a proportion p is less than a certain number and a proportion 1 − p is greater than this number. Such a number is called the 100p empirical percentile or the pth empirical quantile and is denoted by qn(p). For a suitable introduction of empirical quantiles we need the notion of order statistics.
  • 242. 16.3 Empirical quantiles, quartiles, and the IQR 235 The order statistics consist of the same elements as in the original dataset x1, x2, . . . , xn, but in ascending order. Denote by x(k) the kth element in the ordered list. Then x(1) ≤ x(2) ≤ · · · ≤ x(n) are called the order statistics of x1, x2, . . . , xn. The order statistics of the Wick temperature data are 41 41 41 41 41 42 43 43 43 58 58. Note that by putting the elements in order, it is possible that successive order statistics are the same, for instance, x(1) = · · · = x(5) = 41. Another example is Table 15.2, which lists the order statistics of the Old Faithful dataset. To compute empirical quantiles one linearly interpolates between order statis- tics of the dataset. Let 0 p 1, and suppose we want to compute the pth empirical quantile for a dataset x1, x2, . . . , xn. The following computation is based on requiring that the ith order statistic is the i/(n + 1) quantile. If we denote the integer part of a by a, then the computation of qn(p) runs as follows: qn(p) = x(k) + α(x(k+1) − x(k)) with k = p(n + 1) and α = p(n + 1) − k. On the left in Figure 16.2 the relation between the pth quantile and the empirical distribution function is illustrated for the Old Faithful data. pth empirical quantile 0 p 1 → ↓ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0.00 0.25 0.50 0.75 1.00 Lower quartile Median Upper quartile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ................ ......................................... ............................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 16.2. Empirical quantile and quartiles for the Old Faithful data. Quick exercise 16.5 Compute the 55th empirical percentile for the Wick temperature data.
  • 243. 236 16 Exploratory data analysis: numerical summaries Lower and upper quartiles Instead of identifying only the center of the dataset, Tukey [35] suggested to give a five-number summary of the dataset: the minimum, the maximum, the sample median, and the 25th and 75th empirical percentiles. The 25th empirical percentile qn(0.25) is called the lower quartile and the 75th empirical percentile qn(0.75) is called the upper quartile. Together with the median, the lower and upper quartiles divide the dataset in four more or less equal parts consisting of about one quarter of the number of elements. The relation of the two quartiles and the median with the empirical distribution function is illustrated for the Old Faithful data on the right of Figure 16.2. The distance between the lower quartile and the median, relative to the distance between the upper quartile and the median, gives some indication on the skewness of the dataset. The distance between the upper and lower quartiles is called the interquartile range, or IQR: IQR = qn(0.75) − qn(0.25). The IQR specifies the range of the middle half of the dataset. It could also serve as a robust measure of the amount of variability among the elements of the dataset. For the Old Faithful data the five-number summary is Minimum Lower quartile Median Upper quartile Maximum 96 129.25 240 267.75 306 and the IQR is 138.5. Quick exercise 16.6 Compute the five-number summary for the (uncor- rected) Wick temperature data. 16.4 The box-and-whisker plot Tukey [35] also proposed visualizing the five-number summary discussed in the previous section by a so-called box-and-whisker plot, briefly boxplot. Fig- ure 16.3 displays a boxplot. The data are now on the vertical axis, where we left out the numbers on the axis in order to explain the construction of the figure. The horizontal width of the box is irrelevant. In the vertical direction the box extends from the lower to the upper quartile, so that the height of the box is precisely the IQR. The horizontal line inside the box corresponds to the sample median. Up from the upper quartile we measure out a distance of 1.5 times the IQR and draw a so-called whisker up to the largest observation that lies within this distance, where we put a horizontal line. Similarly, down from the lower quartile we measure out a distance of 1.5 times the IQR and draw a whisker to the smallest observation that lies within this distance, where we also put a horizontal line. All other observations beyond the whiskers are marked by ◦. Such an observation is called an outlier.
  • 244. 16.4 The box-and-whisker plot 237 Minimum Lower quartile−1.5·IQR Lower quartile Median Upper quartile Maximum Upper quartile+1.5·IQR ◦ ◦ ◦ ↑ ↓ 1.5·IQR ↑ ↓ 1.5·IQR ↑ ↓ IQR Fig. 16.3. A boxplot. In Figure 16.4 the boxplots of the Old Faithful data and of the software relia- bility data (see also Chapter 15) are displayed. The skewness of the software reliability data produces a boxplot with whiskers of very different length and with several observations beyond the upper quartile plus 1.5 times the IQR. The boxplot of the Old Faithful data illustrates one of the shortcomings of the boxplot; it does not capture the fact that the data show two separate peaks. However, the position of the sample median inside the box does suggest that the dataset is skewed. Quick exercise 16.7 Suppose we want to construct a boxplot of the (uncor- rected) Wick temperature data. What is the height of the box, the length of both whiskers, and which measurements fall outside the box and whiskers? Would you consider the two values 58 extreme outliers? 1 2 3 4 5 6 Old Faithful data 0 2000 4000 6000 Software data ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Fig. 16.4. Boxplot of the Old Faithful data and the software data.
  • 245. 238 16 Exploratory data analysis: numerical summaries Using boxplots to compare several datasets Although the boxplot provides some information about the structure of the data, such as center, range, skewness or symmetry, it is a poor graphical display of the dataset. Graphical summaries such as the histogram and kernel density estimate are more informative displays of a single dataset. Boxplots become useful if we want to compare several sets of data in a simple graphical display. In Figure 16.5 boxplots are displayed of the average drill time for dry and wet drilling up to a depth of 250 feet for the drill data discussed in Section 15.5 (see also Table 15.4). It is clear that the boxplot corresponding to dry drilling differs from that corresponding to wet drilling. However, the question is whether this difference can still be attributed to chance or is caused by the drilling technique used. We will return to this type of question in Chapter 25. 600 700 800 900 1000 Dry ◦ Wet Fig. 16.5. Boxplot of average drill times. 16.5 Solutions to the quick exercises 16.1 The average is x̄n = 4.6 + 3.0 + 3.2 + 4.2 + 5.0 5 = 20 5 = 4. The median is the middle element of 3.0, 3.2, 4.2, 4.6, and 5.0, which gives Medn = 4.2. 16.2 The average is x̄n = 4.6 + 30 + 3.2 + 4.2 + 50 5 = 90 5 = 18,
  • 246. 16.5 Solutions to the quick exercises 239 which differs 14.4 from the average we found in Quick exercise 16.1. The median is the middle element of 3.2, 4.2, 4.6, 30, and 50. This gives Medn = 4.6, which only differs 0.4 from the median we found in Quick exercise 16.1. As one can see, the median is hardly affected by the two outliers. 16.3 The sample variance is s2 n = (−1)2 + (0.6)2 + (−0.8)2 + (0.2)2 + (1.0)2 5 − 1 = 3.04 4 = 0.76 so that the sample standard deviation is sn = √ 0.76 = 0.872. The median is 4.2, so that the absolute deviations from the median are given by 0.4 1.2 1.0 0.0 0.8. The MAD is the median of these numbers, which is 0.8. 16.4 The sample variance is s2 n = (11.6)2 + (−13.8)2 + (−15.2)2 + (−14.2)2 + (31.6)2 5 − 1 = 1756.24 4 = 439.06 so that the sample standard deviation is sn = √ 439.06 = 20.95, which is a difference of 20.19 from the value we found in Quick exercise 16.3. The median is 4.6, so that the absolute deviations from the median are given by 0.0 25.4 1.4 0.4 45.4. The MAD is the median of these numbers, which is 1.4. Just as the median, the MAD is hardly affected by the two outliers. 16.5 We have k = 0.55 · 12 = 6.6 = 6, so that α = 0.6. This gives qn(0.55) = x(6) + 0.6 · (x(7) − x(6)) = 42 + 0.6 · (43 − 42) = 42.6. 16.6 From the order statistics of the Wick temperature data 41 41 41 41 41 42 43 43 43 58 58 it can be seen immediately that minimum, maximum, and median are given by 41, 58, and 42. For the lower quartile we have k = 0.25·12 = 3, so that α = 0 and qn(0.25) = x(3) = 41. For the upper quartile we have k = 0.75 · 12 = 9, so that again α = 0 and qn(0.75) = x(9) = 43. Hence for the Wick temperature data the five-number summary is Minimum Lower quartile Median Upper quartile Maximum 41 41 42 43 58
  • 247. 240 16 Exploratory data analysis: numerical summaries 16.7 From the five-number summary for the Wick temperature data (see Quick exercise 16.6), it follows immediately that the height of the box is the IQR: 43 − 41 = 2. If we measure out a distance of 1.5 times 2 down from the lower quartile 41, we see that the smallest observation within this range is 41, which means that the lower whisker has length zero. Similarly, the upper whisker has length zero. The two measurements with value 58 are outside the box and whiskers. The two values 58 are clearly far away from the bulk of the data and should be considered extreme outliers. 41 42 43 58 ◦ ◦ 16.6 Exercises 16.1 Use the order statistics of the software data as given in Exercise 15.4 to answer the following questions. a. Compute the sample median. b. Compute the lower and upper quartiles and the IQR. c. Compute the 37th empirical percentile. 16.2 Compute for the Old Faithful data the distance of the lower and upper quartiles to the median and explain the difference. 16.3 Recall the example about the space shuttle Challenger in Section 1.4. The following table lists the order statistics of launch temperatures during take-offs in degrees Fahrenheit, including the launch temperature on Jan- uary 28, 1986. 31 53 57 58 63 66 67 67 67 68 69 70 70 70 70 72 73 75 75 76 76 78 79 81 a. Find the sample median and the lower and upper quartiles. b. Sketch the boxplot of this dataset.
  • 248. 16.6 Exercises 241 c. On January 28, 1986, the launch temperature was 31 degrees Fahrenheit. Comment on the value 31 with respect to the other data points. 16.4 The sample mean and sample median of the uncorrected Wick tem- perature data (in degrees Fahrenheit) are 44.7 and 42. We transform the data from degrees Fahrenheit (xi) to degrees Celsius (yi) by means of the formula yi = 5 9 (xi − 32), which gives the following dataset 55 9 55 9 5 5 5 50 9 55 9 130 9 130 9 5 5. a. Check that ȳn = 5 9 (x̄n − 32). b. Is it also true that Med(y1, . . . , yn) = 5 9 (Med(x1, . . . , xn) − 32)? c. Suppose we have a dataset x1, x2, . . . , xn and construct y1, y2, . . . , yn where yi = axi + b with a and b being real numbers. Do similar rela- tions hold for the sample mean and sample median? If so, state them. 16.5 Consider the uncorrected Wick temperature data in degrees Fahrenheit (xi) and the corresponding temperatures in degrees Celsius (yi) as given in Exercise 16.4. The sample standard deviation and the MAD for the Wick data are 6.62 and 1. a. Let sF and sC denote the sample standard deviations of x1, x2, . . . , xn and y1, y2, . . . , yn respectively. Check that sC = 5 9 sF . b. Let MADF and MADC denote the MAD of x1, x2, . . . , xn and y1, y2, . . . , yn respectively. Is it also true that MADC = 5 9 MADF ? c. Suppose we have a dataset x1, x2, . . . , xn and construct y1, y2, . . . , yn where yi = axi + b with a and b being real numbers. Do similar rela- tions hold for the sample standard deviation and the MAD? If so, state them. 16.6 Consider two datasets: 1, 5, 9 and 2, 4, 6, 8. a. Denote the sample means of the two datasets by x̄ and ȳ. Is it true that the average (x̄ + ȳ)/2 of x̄ and ȳ is equal to the sample mean of the combined dataset with 7 elements? b. Suppose we have two other datasets: one of size n with sample mean x̄n and another dataset of size m with sample mean ȳm. Is it always true that the average (x̄n + ȳm)/2 of x̄n and ȳm is equal to the sample mean of the combined dataset with n + m elements? If no, then provide a counterexample. If yes, then explain this. c. If m = n, is (x̄n +ȳm)/2 equal to the sample mean of the combined dataset with n + m elements?
  • 249. 242 16 Exploratory data analysis: numerical summaries 16.7 Consider the two datasets from Exercise 16.6. a. Denote the sample medians of the two datasets by Medx and Medy. Is it true that the sample median (Medx +Medy)/2 of the two sample medians is equal to the sample median of the combined dataset with 7 elements? b. Suppose we have two other datasets: one of size n with sample median Medx and another dataset of size m with sample median Medy. Is it always true that the sample median (Medx + Medy)/2 of the two sample medians is equal to the sample median of the combined dataset with n+m elements? If no, then provide a counterexample. If yes, then explain this. c. What if m = n? 16.8 Compute the MAD for the combined dataset of 7 elements from Ex- ercise 16.6. 16.9 Consider a dataset x1, x2, . . . , xn with xi = 0. We construct a second dataset y1, y2, . . . , yn, where yi = 1 xi . a. Suppose dataset x1, x2, . . . , xn consists of −6, 1, 15. Is it true that ȳ3 = 1/x̄3? b. Suppose that n is odd. Is it true that ȳn = 1/x̄n? c. Suppose that n is odd and each xi 0. Is it true that Med(y1, . . . , yn) = 1/Med(x1, . . . , xn)? What about when n is even? 16.10 A method to investigate the sensitivity of the sample mean and the sample median to extreme outliers is to replace one or more elements in a given dataset by a number y and investigate the effect when y goes to infinity. To illustrate this, consider the dataset from Quick Exercise 16.1: 4.6 3.0 3.2 4.2 5.0 with sample mean 4 and sample median 4.2. a. We replace the element 3.2 by some real number y. What happens with the sample mean and the sample median of this new dataset as y → ∞? b. We replace a number of elements by some real number y. How many elements do we need to replace so that the sample median of the new dataset goes to infinity as y → ∞? c. Suppose we have another dataset of size n. How many elements do we need to replace by some real number y, so that the sample mean of the new dataset goes to infinity as y → ∞? And how many elements do we need to replace, so that the sample median of the new dataset goes to infinity?
  • 250. 16.6 Exercises 243 16.11 Just as in Exercise 16.10 we investigate the sensitivity of the sample standard deviation and the MAD to extreme outliers, by considering the same dataset with sample standard deviation 0.872 and MAD equal to 0.8. Answer the same three questions for the sample standard deviation and the MAD instead of the sample mean and sample median. 16.12 Compute the sample mean and sample median for the dataset 1, 2, . . ., N in case N is odd and in case N is even. You may use the fact that 1 + 2 + · · · + N = N(N + 1) 2 . 16.13 Compute the sample standard deviation and MAD for the dataset −N, . . . , −1, 0, 1, . . ., N. You may use the fact that 12 + 22 + · · · + N2 = N(N + 1)(2N + 1) 6 . 16.14 Check that the 50th empirical percentile is the sample median. 16.15 The following rule is useful for the computation of the sample vari- ance (and standard deviation). Show that 1 n n i=1 (xi − x̄n)2 = 1 n n i=1 x2 i − (x̄n) 2 where x̄n = ( n i=1 xi)/n. 16.16 Recall Exercise 15.12, where we computed the mean and second mo- ment corresponding to a density estimate fn,h. Show that the variance corre- sponding to fn,h satisfies: ∞ −∞ t2 fn,h(t) dt− ∞ −∞ tfn,h(t) dt 2 = 1 n n i=1 (xi −x̄n)2 +h2 ∞ −∞ u2 K(u) du. 16.17 Suppose we have a dataset x1, x2, . . . , xn. Check that if p = i/(n + 1) the pth empirical quantile is the ith order statistic.
  • 251. 17 Basic statistical models In this chapter we introduce a common statistical model. It corresponds to the situation where the elements of the dataset are repeated measurements of the same quantity and where different measurements do not influence each other. Next, we discuss the probability distribution of the random variables that model the measurements and illustrate how sample statistics can help to select a suitable statistical model. Finally, we discuss the simple linear regression model that corresponds to the situation where the elements of the dataset are paired measurements. 17.1 Random samples and statistical models In Chapter 1 we briefly discussed Michelson’s experiment conducted between June 5 and July 2 in 1879, in which 100 measurements were obtained on the speed of light. The values are given in Table 17.1 and represent the speed of light in air in km/sec minus 299 000. The variation among the 100 values suggests that measuring the speed of light is subject to random influences. As we have seen before, we describe random phenomena by means of a probability model, i.e., we interpret the outcome of an experiment as a realization of some random variable. Hence the first measurement is modeled by a random variable X1 and the value 850 is interpreted as the realization of X1. Similarly, the second measurement is modeled by a random variable X2 and the value 740 is interpreted as the realization of X2. Since both measurements are obtained under the same experimental conditions, it is justified to assume that the probability distributions of X1 and X2 are the same. More generally, the 100 measurements are modeled by random variables X1, X2, . . . , X100 with the same probability distribution, and the values in Table 17.1 are inter- preted as realizations of X1, X2, . . . , X100. Moreover, because we believe that
  • 252. 246 17 Basic statistical models Table 17.1. Michelson data on the speed of light. 850 740 900 1070 930 850 950 980 980 880 1000 980 930 650 760 810 1000 1000 960 960 960 940 960 940 880 800 850 880 900 840 830 790 810 880 880 830 800 790 760 800 880 880 880 860 720 720 620 860 970 950 880 910 850 870 840 840 850 840 840 840 890 810 810 820 800 770 760 740 750 760 910 920 890 860 880 720 840 850 850 780 890 840 780 810 760 810 790 810 820 850 870 870 810 740 810 940 950 800 810 870 Source: E.N. Dorsey. The velocity of light. Transactions of the American Philosophical Society. 34(1):1-110, 1944; Table 22 on pages 60-61. Michelson took great care not to have the measurements influence each other, the random variables X1, X2, . . . , X100 are assumed to be mutually indepen- dent (see also Remark 3.1 about physical and stochastic independence). Such a collection of random variables is called a random sample or briefly, sample. Random sample. A random sample is a collection of random vari- ables X1, X2, . . . , Xn, that have the same probability distribution and are mutually independent. If F is the distribution function of each random variable Xi in a random sample, we speak of a random sample from F. Similarly, we speak of a random sample from a density f, a random sample from an N(µ, σ2 ) distribution, etc. Quick exercise 17.1 Suppose we have a random sample X1, X2 from a dis- tribution with variance 1. Compute the variance of X1 + X2. Properties that are inherent to the random phenomenon under study may provide additional knowledge about the distribution of the sample. Recall the software data discussed in Chapter 15. The data are observed lengths in CPU seconds between successive failures that occur during the execution of a certain real-time command. Typically, in a situation like this, in a small time interval, either 0 or 1 failure occurs. Moreover, failures occur with small probability and in disjoint time intervals failures occur independent of each other. In addition, let us assume that the rate at which the failures occur is constant over time. According to Chapter 12, this justifies the choice of a Poisson process to model the series of failures. From the properties of the Poisson process we know that the interfailure times are independent and have the same exponential distribution. Hence we model the software data as the realization of a random sample from an exponential distribution.
  • 253. 17.1 Random samples and statistical models 247 In some cases we may not be able to specify the type of distribution. Take, for instance, the Old Faithful data consisting of observed durations of eruptions of the Old Faithful geyser. Due to lack of specific geological knowledge about the subsurface and the mechanism that governs the eruptions, we prefer not to assume a particular type of distribution. However, we do model the durations as the realization of a random sample from a continuous distribution on (0, ∞). In each of the three examples the dataset was obtained from repeated mea- surements performed under the same experimental conditions. The basic sta- tistical model for such a dataset is to consider the measurements as a random sample and to interpret the dataset as the realization of the random sample. Knowledge about the phenomenon under study and the nature of the experi- ment may lead to partial specification of the probability distribution of each Xi in the sample. This should be included in the model. Statistical model for repeated measurements. A dataset consisting of values x1, x2, . . . , xn of repeated measurements of the same quantity is modeled as the realization of a random sample X1, X2, . . . , Xn. The model may include a partial specification of the probability distribution of each Xi. The probability distribution of each Xi is called the model distribution. Usu- ally it refers to a collection of distributions: in the Old Faithful example to the collection of all continuous distributions on (0, ∞), in the software ex- ample to the collection of all exponential distributions. In the latter case the parameter of the exponential distribution is called the model parameter. The unique distribution from which the sample actually originates is assumed to be one particular member of this collection and is called the “true” distribu- tion. Similarly, in the software example, the parameter corresponding to the “true” exponential distribution is called the “true” parameter. The word true is put between quotation marks because it does not refer to something in the real world, but only to a distribution (or parameter) in the statistical model, which is merely an approximation of the real situation. Quick exercise 17.2 We obtain a dataset of ten elements by tossing a coin ten times and recording the result of each toss. What is an appropriate sta- tistical model and corresponding model distribution for this dataset? Of course there are situations where the assumption of independence or identi- cal distributions is unrealistic. In that case a different statistical model would be more appropriate. However, we will restrict ourselves mainly to the case where the dataset can be modeled as the realization of a random sample. Once we have formulated a statistical model for our dataset, we can use the dataset to infer knowledge about the model distribution. Important questions about the corresponding model distribution are
  • 254. 248 17 Basic statistical models Ĺ which feature of the model distribution represents the quantity of interest and how do we use our dataset to determine a value for this? Ĺ which model distribution fits a particular dataset best? These questions can be diverse, and answering them may be difficult. For instance, the Old Faithful data are modeled as a realization of a random sample from a continuous distribution. Suppose we are interested in a complete characterization of the “true” distribution, such as the distribution function F or the probability density f. Since there are no further specifications about the type of distribution, our problem would be to estimate the complete curve of F or f on the basis of our dataset. On the other hand, the software data are modeled as the realization of a random sample from an exponential distribution. In that case F and f are completely characterized by a single parameter λ: F(x) = 1 − e−λx and f(x) = λe−λx for x ≥ 0. Even if we are interested in the curves of F and f, our problem would reduce to estimating a single parameter on the basis of our dataset. In other cases we may not be interested in the distribution as a whole, but only in a specific feature of the model distribution that represents the quantity of interest. For instance, in a physical experiment, such as the one performed by Michelson, one usually thinks of each measurement as measurement = quantity of interest + measurement error. The quantity of interest, in this case the speed of light, is thought of as being some (unknown) constant and the measurement error is some random fluc- tuation. In the absence of systematic error, the measurement error can be modeled by a random variable with zero expectation and finite variance. In that case the measurements are modeled by a random sample from a distribu- tion with some unknown expectation and finite variance. The speed of light is represented by the expectation of the model distribution. Our problem would be to estimate the expectation of the model distribution on the basis of our dataset. In the remaining chapters, we will develop several statistical methods to infer knowledge about the “true” distribution or about a specific feature of it, by means of a dataset. In the remainder of this chapter we will investigate how the graphical and numerical summaries of our dataset can serve as a first indication of what an appropriate choice would be for this distribution or for a specific feature, such as its expectation. 17.2 Distribution features and sample statistics In Chapters 15 and 16 we have discussed several empirical summaries of datasets. They are examples of numbers, curves, and other objects that are a
  • 255. 17.2 Distribution features and sample statistics 249 function h(x1, x2, . . . , xn) of the dataset x1, x2, . . . , xn only. Since datasets are modeled as realizations of random samples X1, X2, . . . , Xn, an object h(x1, x2, . . . , xn) is a realization of the corresponding random object h(X1, X2, . . . , Xn). Such an object, which depends on the random sample X1, X2, . . . , Xn only, is called a sample statistic. If a statistical model adequately describes the dataset at hand, then the sample statistics corresponding to the empirical summaries should somehow reflect corresponding features of the model distribution. We have already seen a mathematical justification for this in Chapter 13 for the sample statistic X̄n = X1 + X2 + · · · + Xn n , based on a sample X1, X2, . . . , Xn from a probability distribution with expec- tation µ. According to the law of large numbers, lim n→∞ P |X̄n − µ| ε = 0 for every ε 0. This means that for large sample size n, the sample mean of most realizations of the random sample is close to the expectation of the corresponding distribution. In fact, all sample statistics discussed in Chap- ters 15 and 16 are close to corresponding distribution features. To illustrate this we generate an artificial dataset from a normal distribution with pa- rameters µ = 5 and σ = 2, using a technique similar to the one described in Section 6.2. Next, we compare the sample statistics with corresponding features of this distribution. The empirical distribution function Let X1, X2, . . . , Xn be a random sample from distribution function F, and let Fn(a) = number of Xi in (−∞, a] n be the empirical distribution function of the sample. Another application of the law of large numbers (see Exercise 13.7) yields that for every ε 0, lim n→∞ P(|Fn(a) − F(a)| ε) = 0. This means that for most realizations of the random sample the empirical distribution function Fn is close to F: Fn(a) ≈ F(a).
  • 256. 250 17 Basic statistical models −2 0 2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........................... −2 0 2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ........................... Fig. 17.1. Empirical distribution functions of normal samples. Hence the empirical distribution function of the normal dataset should resem- ble the distribution function F(a) = a −∞ 1 2 √ 2π e− 1 2 (x−5 2 )2 dx of the N(5, 4) distribution, and the fit should become better as the sample size n increases. An illustration of this can be found in Figure 17.1. We displayed the empirical distribution functions of datasets generated from an N(5, 4) distribution together with the “true” distribution function F (dotted lines), for sample sizes n = 20 (left) and n = 200 (right). The histogram and the kernel density estimate Suppose the random sample X1, X2, . . . , Xn is generated from a continuous distribution with probability density f. In Section 13.4 we have seen yet an- other consequence of the law of large numbers: number of Xi in (x − h, x + h] 2hn ≈ f(x). When (x − h, x + h] is a bin of a histogram of the random sample, this means that the height of the histogram approximates the value of f at the midpoint of the bin: height of the histogram on (x − h, x + h] ≈ f(x). Similarly, the kernel density estimate of a random sample approximates the corresponding probability density f: fn,h(x) ≈ f(x).
  • 257. 17.2 Distribution features and sample statistics 251 −2 0 2 4 6 8 10 12 0.0 0.1 0.2 0.3 ..................... . . . . . . . . . . . . . . . . . . . . . . . . ............. . . . . . . . . . . . . . . . . . . . . . . . ..................... −2 0 2 4 6 8 10 12 0.0 0.1 0.2 0.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................... . . . . . . . . . . . . . . . . . . . . . . . . ............. . . . . . . . . . . . . . . . . . . . . . . . ..................... Fig. 17.2. Histogram and kernel density estimate of a sample of size 200. So the histogram and kernel density estimate of the normal dataset should resemble the graph of the probability density f(x) = 1 2 √ 2π e− 1 2 (x−5 2 )2 of the N(5, 4) distribution. This is illustrated in Figure 17.2, where we dis- played a histogram and a kernel density estimate of our dataset consisting of 200 values generated from the N(5, 4) distribution. It should be noted that with a smaller dataset the similarity can be much worse. This is demonstrated in Figure 17.3, which is based on the dataset consisting of 20 values generated from the same distribution. −2 0 2 4 6 8 10 12 0.0 0.1 0.2 0.3 ..................... . . . . . . . . . . . . . . . . . . . . . . . . ............. . . . . . . . . . . . . . . . . . . . . . . . ..................... −2 0 2 4 6 8 10 12 0.0 0.1 0.2 0.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................... . . . . . . . . . . . . . . . . . . . . . . . . ............. . . . . . . . . . . . . . . . . . . . . . . . ..................... Fig. 17.3. Histogram and kernel density estimate of a sample of size 20.
  • 258. 252 17 Basic statistical models Remark 17.1 (About the approximations). Let Hn be the height of the histogram on the interval (x − h, x + h], which is assumed to be a bin of the histogram. Direct application of the law of large numbers merely yields that Hn converges to 1 2h x+h x−h f(u) du. Only for small h this is close to f(x). However, if we let h tend to 0 as n increases, a variation on the law of large numbers will guarantee that Hn converges to f(x): for every ε 0, lim n→∞ P(|Hn − f(x)| ε) = 0. A possible choice is the optimal bin width mentioned in Remark 15.1. Sim- ilarly, direct application of the law of large numbers yields that a kernel density estimator with fixed bandwidth h converges to ∞ −∞ f(x + hu)K(u) du. Once more, only for small h this is close to f(x), provided that K is sym- metric and integrates to one. However, by letting the bandwidth h tend to 0 as n increases, yet another variation on the law of large numbers will guarantee that fn,h(x) converges to f(x): for every ε 0, lim n→∞ P(|fn,h(x) − f(x)| ε) = 0. A possible choice is the optimal bandwidth mentioned in Remark 15.2. The sample mean, the sample median, and empirical quantiles As we saw in Section 5.5, the expectation of an N(µ, σ2 ) distribution is µ; so the N(5, 4) distribution has expectation 5. According to the law of large numbers: X̄n ≈ µ. This is illustrated by our dataset of 200 values generated from the N(5, 4) distribution for which we find x̄200 = 5.012. For the sample median we find Med(x1, . . . , x200) = 5.018. This illustrates the fact that the sample median of a random sample from F approximates the median q0.5 = Finv (0.5). In fact, we have the following general property for the pth empirical quantile: qn(p) ≈ Finv (p) = qp. In the special case of the N(µ, σ2 ) distribution, the expectation and the me- dian coincide, which explains why the sample mean and sample median of the normal dataset are so close to each other.
  • 259. 17.3 Estimating features of the “true” distribution 253 The sample variance and standard deviation, and the MAD As we saw in Section 5.5, the standard deviation and variance of an N(µ, σ2 ) distribution are σ and σ2 ; so for the N(5, 4) distribution these are 2 and 4. Another consequence of the law of large numbers is that S2 n ≈ σ2 and Sn ≈ σ. This is illustrated by our normal dataset of size 200, for which we find s2 200 = 4.761 and s200 = 2.182 for the sample variance and sample standard deviation. For the MAD of the dataset we find 1.334, which clearly differs from the standard deviation 2 of the N(5, 4) distribution. The reason is that MAD(X1, X2, . . . , Xn) ≈ Finv (0.75) − Finv (0.5), for any distribution that is symmetric around its median Finv (0.5). For the N(5, 4) distribution Finv (0.75) − Finv (0.5) = 2Φinv (0.75) = 1.3490, where Φ denotes the distribution function of the standard normal distribution (see Exercise 17.10). Relative frequencies For continuous distributions the histogram and kernel density estimates of a random sample approximate the corresponding probability density f. For dis- crete distributions we would like to have a sample statistic that approximates the probability mass function. In Section 13.4 we saw that, as a consequence of the law of large numbers, relative frequencies based on a random sample ap- proximate corresponding probabilities. As a special case, for a random sample X1, X2, . . . , Xn from a discrete distribution with probability mass function p, one has that number of Xi equal to a n ≈ p(a). This means that the relative frequency of a’s in the sample approximates the value of the probability mass function at a. Table 17.2 lists the sample statistics and the corresponding distribution features they approximate. 17.3 Estimating features of the “true” distribution In the previous section we generated a dataset of 200 elements from a proba- bility distribution, and we have seen that certain features of this distribution are approximated by corresponding sample statistics. In practice, the situa- tion is reversed. In that case we have a dataset of n elements that is modeled as the realization of a random sample with a probability distribution that is unknown to us. Our goal is to use our dataset to estimate a certain feature of this distribution that represents the quantity of interest. In this section we will discuss a few examples.
  • 260. 254 17 Basic statistical models Table 17.2. Some sample statistics and corresponding distribution features. Sample statistic Distribution feature Graphical Empirical distribution function Fn Distribution function F Kernel density estimate fn,h and histogram Probability density f (Number of Xi equal to a)/n Probability mass function p(a) Numerical Sample mean X̄n Expectation µ Sample median Med(X1, X2, . . . , Xn) Median q0.5 = Finv (0.5) pth empirical quantile qn(p) 100pth percentile qp = Finv (p) Sample variance S2 n Variance σ2 Sample standard deviation Sn Standard deviation σ MAD(X1, X2, . . . , Xn) Finv (0.75) − Finv (0.5), for symmetric F The Old Faithful data We stick to the assumptions of Section 17.1: by lack of knowledge on this phe- nomenon we prefer not to specify a particular parametric type of distribution, and we model the Old Faithful data as the realization of a random sample of size 272 from a continuous probability distribution. From the previous section we know that the kernel density estimate and the empirical distribution func- tion of the dataset approximate the probability density f and the distribution function F of this distribution. In Figure 17.4 a kernel density estimate (left) and the empirical distribution function (right) are displayed. Indeed, neither graph resembles the probability density function or distribution function of any of the familiar parametric distributions. Instead of viewing both graphs 60 120 180 240 300 360 0 0.002 0.004 0.006 0.008 0.010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 120 180 240 300 360 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 17.4. Nonparametric estimates for f and F based on the Old Faithful data.
  • 261. 17.3 Estimating features of the “true” distribution 255 only as graphical summaries of the data, we can also use both curves as esti- mates for f and F. We estimate the model probability density f by means of the kernel density estimate and the model distribution function F by means of the empirical distribution function. Since neither estimate assumes a par- ticular parametric model, they are called nonparametric estimates. The software data Next consider the software reliability data. As motivated in Section 17.1, we model interfailure times as the realization of a random sample from an exponential distribution. To see whether an exponential distribution is indeed a reasonable model, we plot a histogram and a kernel density estimate using a boundary kernel in Figure 17.5. 0 2000 4000 6000 8000 0 0.0005 0.0010 0.0015 0 2000 4000 6000 8000 0 0.0005 0.0010 0.0015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 17.5. Histogram and kernel density estimate for the software data. Both seem to corroborate the assumption of an exponential distribution. Ac- cepting this, we are left with estimating the parameter λ. Because for the exponential distribution E[X] = 1/λ, the law of large numbers suggests 1/x̄ as an estimate for λ. For our dataset x̄ = 656.88, which yields 1/x̄ = 0.0015. In Figure 17.6 we compare the estimated exponential density (left) and dis- tribution function (right) with the corresponding nonparametric estimates. Note that the nonparametric estimates do not assume an exponential model for the data. But, if an exponential distribution were the right model, the kernel density estimate and empirical distribution function should resemble the estimated exponential density and distribution function. At first sight the fit seems reasonable, although near zero the data accumulate more than one might perhaps expect for a sample of size 135 from an exponential distri- bution, and the other way around at the other end of the data range. The question is whether this phenomenon can be attributed to chance or is caused by the fact that the exponential model is the wrong model. We will return to this type of question in Chapter 25 (see also Chapter 18).
  • 262. 256 17 Basic statistical models 0 2000 4000 6000 8000 0 0.0005 0.0010 0.0015 0.0020 0.0025 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......................................................... 0 2000 4000 6000 8000 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....................................................... Fig. 17.6. Kernel density estimate and empirical cdf for software data (solid) com- pared to f and F of the estimated exponential distribution. Michelson data Consider the Michelson data on the speed of light. In this case we are not particularly interested in estimation of the “true” distribution, but solely in the expectation of this distribution, which represents the speed of light. The law of large numbers suggests to estimate the expectation by the sample mean x̄, which equals 852.4. 17.4 The linear regression model Recall the example about predicting Janka hardness of wood from the density of the wood in Section 15.5. The idea is, of course, that Janka hardness is related to the density: the higher the density of the wood, the higher the value of Janka hardness. This suggests a relationship of the type hardness = g(density of timber) for some increasing function g. This is supported by the scatterplot of the data in Figure 17.7. A closer look at the bivariate dataset in Table 15.5 suggests that randomness is also involved. For instance, for the value 51.5 of the density, different corresponding values of Janka hardness were observed. One way to model such a situation is by means of a regression model: hardness = g(density of timber) + random fluctuation. The important question now is what sort of function g fits well to the points in the scatterplot? In general, this may be a difficult question to answer. We may have so little knowledge about the phenomenon under study, and the data points may be
  • 263. 17.4 The linear regression model 257 20 30 40 50 60 70 80 Wood density 0 500 1000 1500 2000 2500 3000 3500 Hardness · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 17.7. Scatterplot of Janka hardness versus wood density. scattered in such a way, that there is no reason to assume a specific type of function for g. However, for the Janka hardness data it makes sense to assume that g is increasing, but this still leaves us with many possibilities. Looking at the scatterplot, at first sight it does not seem unreasonable to assume that g is a straight line, i.e., Janka hardness depends linearly on the density of timber. The fact that the points are not exactly on a straight line is then modeled by a random fluctuation with respect to the straight line: hardness = α + β · (density of timber) + random fluctuation. This is a loose description of a simple linear regression model. A more complete description is given below. Simple linear regression model. In a simple linear regression model for a bivariate dataset (x1, y1), (x2, y2), . . . , (xn, yn), we as- sume that x1, x2, . . . , xn are nonrandom and that y1, y2, . . . , yn are realizations of random variables Y1, Y2, . . . , Yn satisfying Yi = α + βxi + Ui for i = 1, 2, . . ., n, where U1, . . . , Un are independent random variables with E[Ui] = 0 and Var(Ui) = σ2 . The line y = α + βx is called the regression line. The parameters α and β represent the intercept and slope of the regression line. Usually, the x-variable is called the explanatory variable and the y-variable is called the response variable. One also refers to x and y as independent and dependent variables. The random variables U1, U2, . . . , Un are assumed to be independent when the different measurements do not influence each other. They are assumed to have
  • 264. 258 17 Basic statistical models expectation zero, because the random fluctuation is considered to be around the regression line y = α + βx. Finally, because each random fluctuation is supposed to have the same amount of variability, we assume that all Ui have the same variance. Note that by the propagation of independence rule in Section 9.4, independence of the Ui implies independence of Yi. However, Y1, Y2, . . . , Yn do not form a random sample. Indeed, the Yi have different distributions because every Yi has a different expectation E[Yi] = E[α + βxi + Ui] = α + βxi + E[Ui] = α + βxi. Quick exercise 17.3 Consider the simple linear regression model as defined earlier. Compute the variance of Yi. The parameters α and β are unknown and our task will be to estimate them on the basis of the data. We will come back to this in Chapter 22. In Figure 17.8 the scatterplot for the Janka hardness data is displayed with the estimated 20 30 40 50 60 70 80 Wood density 0 500 1000 1500 2000 2500 3000 3500 Hardness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 17.8. Estimated regression line for the Janka hardness data. regression line y = −1160.5 + 57.51x. Taking a closer look at Figure 17.8, you might wonder whether y = α + βx + γx2 would be a more appropriate model. By trying to answer this question we enter the area of multiple linear regression. We will not pursue this topic; we restrict ourselves to simple linear regression.
  • 265. 17.6 Exercises 259 17.5 Solutions to the quick exercises 17.1 Because X1, X2 form a random sample, they are independent. Using the rule about the variance of the sum of independent random variables, this means that Var(X1 + X2) = Var(X1) + Var(X2) = 1 + 1 = 2. 17.2 The result of each toss of a coin can be modeled by a Bernoulli random variable taking values 1 (heads) and 0 (tails). In the case when it is known that we are tossing a fair coin, heads and tails occur with equal probability. Since it is reasonable to assume that the tosses do not influence each other, the outcomes of the ten tosses are modeled as the realization of a random sample X1, . . . , X10 from a Bernoulli distribution with parameter p = 1/2. In this case the model distribution is completely specified and coincides with the “true” distribution: a Ber(1 2 ) distribution. In the case when we are dealing with a possibly unfair coin, the outcomes of the ten tosses are still modeled as the realization of a random sample X1, . . . , X10 from a Bernoulli distribution, but we cannot specify the value of the parameter p. The model distribution is a Bernoulli distribution. The “true” distribution is a Bernoulli distribution with one particular value for p, unknown to us. 17.3 Note that the xi are considered nonrandom. By the rules for the vari- ance, we find Var(Yi) = Var(α + βxi + Ui) = Var(Ui) = σ2 . 17.6 Exercises 17.1 Figure 17.9 displays several histograms, kernel density estimates, and empirical distribution functions. It is known that all figures correspond to datasets of size 200 that are generated from normal distributions N(0, 1), N(0, 9), and N(3, 1), and from exponential distributions Exp(1) and Exp(1/3). Report for each figure from which distribution the dataset has been generated. 17.2 Figure 17.10 displays several boxplots. It is known that all figures correspond to datasets of size 200 that are generated from the same five dis- tributions as in Exercise 17.1. Report for each boxplot from which distribution the dataset has been generated. 17.3 At a London underground station, the number of women was counted in each of 100 queues of length 10. In this way a dataset x1, x2, . . . , x100 was obtained, where xi denotes the observed number of women in the ith queue. The dataset is summarized in the following table and lists the number of queues with 0 women, 1 woman, 2 women, etc.
  • 266. 260 17 Basic statistical models 0 2 4 6 0.0 0.1 0.2 0.3 0.4 Dataset 1 −2 0 2 0.0 0.2 0.4 0.6 0.8 1.0 Dataset 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −4 −2 0 2 4 0.0 0.1 0.2 0.3 Dataset 3 −2 0 2 4 6 8 0.0 0.1 0.2 0.3 0.4 0.5 Dataset 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 Dataset 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Dataset 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −4 −2 0 2 4 6 0.0 0.1 0.2 0.3 0.4 Dataset 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −6 −3 0 3 6 9 0.00 0.05 0.10 0.15 Dataset 8 0 2 4 6 8 0.0 0.2 0.4 0.6 Dataset 9 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Dataset 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −12 −6 0 6 12 0.00 0.03 0.06 0.09 0.12 Dataset 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 5 10 15 20 0.00 0.06 0.12 0.18 0.24 Dataset 12 −5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 Dataset 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 5 10 15 20 25 0.0 0.1 0.2 0.3 Dataset 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 Dataset 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 17.9. Graphical representations of different datasets from Exercise 17.1.
  • 267. 17.6 Exercises 261 0 5 10 15 Boxplot 1 ◦ ◦ ◦ ◦ −6 −3 0 3 6 Boxplot 2 0 5 10 15 20 Boxplot 3 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −3 −2 −1 0 1 2 3 Boxplot 4 0 2 4 6 Boxplot 5 ◦ ◦ 0 2 4 6 8 Boxplot 6 ◦ ◦ ◦ ◦ ◦ −9 −6 −3 0 3 6 Boxplot 7 ◦ −6 −3 0 3 6 9 Boxplot 8 ◦ 0 2 4 6 Boxplot 9 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 0 2 4 6 Boxplot 10 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 0 2 4 6 Boxplot 11 0 2 4 6 Boxplot 12 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ −4 −3 −2 −1 0 1 2 3 Boxplot 13 ◦ −3 −2 −1 0 1 2 3 4 Boxplot 14 ◦ ◦ 0 5 10 15 20 Boxplot 15 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Fig. 17.10. Boxplot of different datasets from Exercise 17.2.
  • 268. 262 17 Basic statistical models Count 0 1 2 3 4 5 6 7 8 9 10 Frequency 1 3 4 23 25 19 18 5 1 1 0 Source: R.A. Jinkinson and M. Slater. Critical discussion of a graphical method for identifying discrete distributions. The Statistician, 30:239–248, 1981; Table 1 on page 240. In the statistical model for this dataset, we assume that the observed counts are a realization of a random sample X1, X2, . . . , X100. a. Assume that people line up in such a way that a man or woman in a certain position is independent of the other positions, and that in each position one has a woman with equal probability. What is an appropriate choice for the model distribution? b. Use the table to find an estimate for the parameter(s) of the model dis- tribution chosen in part a. 17.4 During the Second World War, London was hit by numerous flying bombs. The following data are from an area in South London of 36 square kilometers. The area was divided into 576 squares with sides of length 1/4 kilometer. For each of the 576 squares the number of hits was recorded. In this way we obtain a dataset x1, x2, . . . , x576, where xi denotes the number of hits in the ith square. The data are summarized in the following table which lists the number of squares with no hits, 1 hit, 2 hits, etc. Number of hits 0 1 2 3 4 5 6 7 Number of squares 229 211 93 35 7 0 0 1 Source: R.D. Clarke. An application of the Poisson distribution. Journal of the Institute of Actuaries, 72:48, 1946; Table 1 on page 481. Faculty and Institute of Actuaries. An interesting question is whether London was hit in a completely random manner. In that case a Poisson distribution should fit the data. a. If we model the dataset as the realization of a random sample from a Poisson distribution with parameter µ, then what would you choose as an estimate for µ? b. Check the fit with a Poisson distribution by comparing some of the ob- served relative frequencies of 0’s, 1’s, 2’s, etc., with the corresponding probabilities for the Poisson distribution with µ estimated as in part a. 17.5 We return to the example concerning the number of menstrual cycles up to pregnancy, where the number of cycles was modeled by a geometric random variable (see Section 4.4). The original data concerned 100 smoking and 486 nonsmoking women. For 7 smokers and 12 nonsmokers, the exact number of cycles up to pregnancy was unknown. In the following tables we only
  • 269. 17.6 Exercises 263 incorporated the 93 smokers and 474 nonsmokers, for which the exact number of cycles was observed. Another analysis, based on the complete dataset, is done in Section 21.1. a. Consider the dataset x1, x2, . . . , x93 corresponding to the smoking women, where xi denotes the number of cycles for the ith smoking woman. The data are summarized in the following table. Cycles 1 2 3 4 5 6 7 8 9 10 11 12 Frequency 29 16 17 4 3 9 4 5 1 1 1 3 Source: C.R. Weinberg and B.C. Gladen. The beta-geometric distribution ap- plied to comparative fecundability studies. Biometrics, 42(3):547–560, 1986. The table lists the number of women that had to wait 1 cycle, 2 cycles, etc. If we model the dataset as the realization of a random sample from a geometric distribution with parameter p, then what would you choose as an estimate for p? b. Also estimate the parameter p for the 474 nonsmoking women, which is also modeled as the realization of a random sample from a geometric distribution. The dataset y1, y2, . . . , y474, where yj denotes the number of cycles for the jth nonsmoking woman, is summarized here: Cycles 1 2 3 4 5 6 7 8 9 10 11 12 Frequency 198 107 55 38 18 22 7 9 5 3 6 6 Source: C.R. Weinberg and B.C. Gladen. The beta-geometric distribution ap- plied to comparative fecundability studies. Biometrics, 42(3):547–560, 1986. You may use that y1 + y2 + · · · + y474 = 1285. c. Compare the estimates of the probability of becoming pregnant in three or fewer cycles for smoking and nonsmoking women. 17.6 Recall Exercise 15.1 about the chest circumference of 5732 Scottish sol- diers, where we constructed the histogram displayed in Figure 17.11. The histogram suggests modeling the data as the realization of a random sample from a normal distribution. a. Suppose that for the dataset xi = 228377.2 and x2 i = 9124064. What would you choose as estimates for the parameters µ and σ of the N(µ, σ2 ) distribution? Hint: you may want to use the relation from Exercise 16.15. b. Give an estimate for the probability that a Scottish soldier has a chest circumference between 38.5 and 42.5 inches.
  • 270. 264 17 Basic statistical models 32 34 36 38 40 42 44 46 48 50 0 0.05 0.10 0.15 0.20 Fig. 17.11. Histogram of chest circumferences. 17.7 Recall Exercise 15.3 about time intervals between successive coal mine disasters. Let us assume that the rate at which the disasters occur is constant over time and that on a single day a disaster takes place with small probability independently of what happens on other days. According to Chapter 12 this suggests modeling the series of disasters with a Poisson process. Figure 17.12 displays a histogram and empirical distribution function of the observed time intervals. a. In the statistical model for this dataset we model the 190 time intervals as the realization of a random sample. What would you choose for the model distribution? b. The sum of the observed time intervals is 40 549 days. Give an estimate for the parameter(s) of the distribution chosen in part a. 0 500 1000 1500 2000 2500 0 0.001 0.002 0.003 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 17.12. Histogram of time intervals between successive disasters.
  • 271. 17.6 Exercises 265 17.8 The following data represent the number of revolutions to failure (in millions) of 22 deep-groove ball-bearings. 17.88 28.92 33.00 41.52 42.12 45.60 48.48 51.84 51.96 54.12 55.56 67.80 68.64 68.88 84.12 93.12 98.64 105.12 105.84 127.92 128.04 173.40 Source: J. Lieblein and M. Zelen. Statistical investigation of the fatigue-life of deep-groove ball-bearings. Journal of Research, National Bureau of Stan- dards, 57:273–316, 1956; specimen worksheet on page 286. Lieblein and Zelen propose modeling the dataset as a realization of a random sample from a Weibull distribution, which has distribution function F(x) = 1 − e−(λx)α for x ≥ 0, and F(x) = 0, for x 0, where α, λ 0. a. Suppose that X is a random variable with a Weibull distribution. Check that the random variable Y = Xα has an exponential distribution with parameter λα and conclude that E[Xα ] = 1/λα . b. Use part a to explain how one can use the data in the table to find an estimate for the parameter λ, if it is given that the parameter α is estimated by 2.102. 17.9 The volume (i.e., the effective wood production in cubic meters), height (in meters), and diameter (in meters) (measured at 1.37 meter above the ground) are recorded for 31 black cherry trees in the Allegheny National Forest in Pennsylvania. The data are listed in Table 17.3. They were collected to find an estimate for the volume of a tree (and therefore for the timber yield), given its height and diameter. For each tree the volume y and the value of x = d2 h are recorded, where d and h are the diameter and height of the tree. The resulting points (x1, y1), . . . , (x31, y31) are displayed in the scatterplot in Figure 17.13. We model the data by the following linear regression model (without intercept) Yi = βxi + Ui for i = 1, 2, . . . , 31. a. What physical reasons justify the linear relationship between y and d2 h? Hint: how does the volume of a cylinder relate to its diameter and height? b. We want to find an estimate for the slope β of the line y = βx. Two natural candidates are the average slope z̄n, where zi = yi/xi, and the
  • 272. 266 17 Basic statistical models Table 17.3. Measurements on black cherry trees. Diameter Height Volume 0.21 21.3 0.29 0.22 19.8 0.29 0.22 19.2 0.29 0.27 21.9 0.46 0.27 24.7 0.53 0.27 25.3 0.56 0.28 20.1 0.44 0.28 22.9 0.52 0.28 24.4 0.64 0.28 22.9 0.56 0.29 24.1 0.69 0.29 23.2 0.59 0.29 23.2 0.61 0.30 21.0 0.60 0.30 22.9 0.54 0.33 22.6 0.63 0.33 25.9 0.96 0.34 26.2 0.78 0.35 21.6 0.73 0.35 19.5 0.71 0.36 23.8 0.98 0.36 24.4 0.90 0.37 22.6 1.03 0.41 21.9 1.08 0.41 23.5 1.21 0.44 24.7 1.57 0.44 25.0 1.58 0.45 24.4 1.65 0.46 24.4 1.46 0.46 24.4 1.44 0.52 26.5 2.18 Source: A.C. Atkinson. Regression diagnostics, trend formations and con- structed variables (with discussion). Journal of the Royal Statistical Society, Series B, 44:1–36, 1982. slope of the averages ȳ/x̄. In Chapter 22 we will encounter the so-called least squares estimate: n i=1 xiyi n i=1 x2 i .
  • 273. 17.6 Exercises 267 0 2 4 6 8 0.0 0.5 1.0 1.5 2.0 2.5 · · · ··· ··· ·· · · ··· ·· · · ·· · · · ··· · · · Fig. 17.13. Scatterplot of the black cherry tree data. Compute all three estimates for the data in Table 17.3. You need at least 5 digits accuracy, and you may use that xi = 87.456, yi = 26.486, yi/xi = 9.369, xiyi = 95.498, and x2 i = 314.644. 17.10 Let X be a random variable with (continuous) distribution function F. Let m = q0.5 = Finv (0.5) be the median of F and define the random variable Y = |X − m|. a. Show that Y has distribution function G, defined by G(y) = F(m + y) − F(m − y). b. The MAD of F is the median of G. Show that if the density f correspond- ing to F is symmetric around its median m, then G(y) = 2F(m + y) − 1 and derive that Ginv (1 2 ) = Finv (3 4 ) − Finv (1 2 ). c. Use b to conclude that the MAD of an N(µ, σ2 ) distribution is equal to σΦinv (3/4), where Φ is the distribution function of a standard normal distribution. Recall that the distribution function F of an N(µ, σ2 ) can be written as F(x) = Φ x − µ σ . You might check that, as stated in Section 17.2, the MAD of the N(5, 4) distribution is equal to 2Φinv (3/4) = 1.3490.
  • 274. 268 17 Basic statistical models 17.11 In this exercise we compute the MAD of the Exp(λ) distribution. a. Let X have an Exp(λ) distribution, with median m = (ln 2)/λ. Show that Y = |X − m| has distribution function G(y) = 1 2 eλy − e−λy . b. Argue that the MAD of the Exp(λ) distribution is a solution of the equa- tion e2λy − eλy − 1 = 0. c. Compute the MAD of the Exp(λ) distribution. Hint: put x = eλy and first solve for x.
  • 275. 18 The bootstrap In the forthcoming chapters we will develop statistical methods to infer knowl- edge about the model distribution and encounter several sample statistics to do this. In the previous chapter we have seen examples of sample statistics that can be used to estimate different model features, for instance, the em- pirical distribution function to estimate the model distribution function F, and the sample mean to estimate the expectation µ corresponding to F. One of the things we would like to know is how close a sample statistic is to the model feature it is supposed to estimate. For instance, what is the probability that the sample mean and µ differ more than a given tolerance ε? For this we need to know the distribution of X̄n − µ. More generally, it is important to know how a sample statistic is distributed in relation to the corresponding model feature. For the distribution of the sample mean we saw a normal limit approximation in Chapter 14. In this chapter we discuss a simulation proce- dure that approximates the distribution of the sample mean for finite sample size. Moreover, the method is more generally applicable to sample statistics other than the sample mean. 18.1 The bootstrap principle Consider the Old Faithful data introduced in Chapter 15, which we modeled as the realization of a random sample of size n = 272 from some distribution function F. The sample mean x̄n of the observed durations equals 209.3. What does this say about the expectation µ of F? As we saw in Chapter 17, the value 209.3 is a natural estimate for µ, but to conclude that µ is equal to 209.3 is unwise. The reason is that, if we would observe a new dataset of durations, we will obtain a different sample mean as an estimate for µ. This should not come as a surprise. Since the dataset x1, x2, . . . , xn is just one possible realization of the random sample X1, X2, . . . , Xn, the observed sample mean is just one possible realization of the random variable Skip
  • 276. 270 18 The bootstrap X̄n = X1 + X2 + · · · + Xn n . A new dataset is another realization of the random sample, and the cor- responding sample mean is another realization of the random variable X̄n. Hence, to infer something about µ, one should take into account how realiza- tions of X̄n vary. This variation is described by the probability distribution of X̄n. In principle1 it is possible to determine the distribution function of X̄n from the distribution function F of the random sample X1, X2, . . . , Xn. However, F is unknown. Nevertheless, in Chapter 17 we saw that the observed dataset reflects most features of the “true” probability distribution. Hence the natural thing to do is to compute an estimate F̂ for the distribution function F and then to consider a random sample from F̂ and the corresponding sample mean as substitutes for the random sample X1, X2, . . . , Xn from F and the random variable X̄n. A random sample from F̂ is called a bootstrap random sample, or briefly bootstrap sample, and is denoted by X∗ 1 , X∗ 2 , . . . , X∗ n to distinguish it from the random sample X1, X2, . . . , Xn from the “true” F. The corresponding average is called the bootstrapped sample mean, and this random variable is denoted by X̄∗ n = X∗ 1 + X∗ 2 + · · · + X∗ n n to distinguish it from the random variable X̄n. The idea is now to use the distribution of X̄∗ n to approximate the distribution of X̄n. The preceding procedure is called the bootstrap principle for the sample mean. Clearly, it can be applied to any sample statistic h(X1, X2, . . . , Xn) by approx- imating its probability distribution by that of the corresponding bootstrapped sample statistic h(X∗ 1 , X∗ 2 , . . . , X∗ n). Bootstrap principle. Use the dataset x1, x2, . . . , xn to com- pute an estimate F̂ for the “true” distribution function F. Replace the random sample X1, X2, . . . , Xn from F by a random sample X∗ 1 , X∗ 2 , . . . , X∗ n from F̂, and approximate the probability distribu- tion of h(X1, X2, . . . , Xn) by that of h(X∗ 1 , X∗ 2 , . . . , X∗ n). Returning to the sample mean, the first question that comes to mind is, of course, how well does the distribution of X̄∗ n approximate the distribution 1 In Section 11.1 we saw how the distribution of the sum of independent random variables can be computed. Together with the change-of-units rule (see page 106), the distribution of X̄n can be determined. See also Section 13.1, where this is done for independent Gam(2, 1) variables.
  • 277. 18.1 The bootstrap principle 271 of X̄n? Or more generally, how well does the distribution of a bootstrapped sample statistic h(X∗ 1 , X∗ 2 , . . . , X∗ n) approximate the distribution of the sam- ple statistic of interest h(X1, X2, . . . , Xn)? Applied in such a straightforward manner, the bootstrap approximation for the distribution of X̄n by that of X̄∗ n may not be so good (see Remark 18.1). The bootstrap approximation will improve if we approximate the distribution of the centered sample mean: X̄n − µ, where µ is the expectation corresponding to F. The bootstrapped version would be the random variable X̄∗ n − µ∗ , where µ∗ is the expectation corresponding to F̂. Often the bootstrap approx- imation of the distribution of a sample statistic will improve if we somehow normalize the sample statistic by relating it to a corresponding feature of the “true” distribution. An example is the centered sample median Med(X1, X2, . . . , Xn) − Finv (0.5), where we subtract the median Finv (0.5) of F. Another example is the nor- malized sample variance S2 n σ2 , where we divide by the variance σ2 of F. Quick exercise 18.1 Describe how the bootstrap principle should be applied to approximate the distribution of Med(X1, X2, . . . , Xn) − Finv (0.5). Remark 18.1 (The bootstrap for the sample mean). To see why the bootstrap approximation for X̄n may be bad, consider a dataset x1, x2, . . . , xn that is a realization of a random sample X1, X2, . . . , Xn from an N(µ, 1) distribution. In that case the corresponding sample mean X̄n has an N(µ, 1/n) distribution. We estimate µ by x̄n and replace the ran- dom sample from an N(µ, 1) distribution by a bootstrap random sample X∗ 1 , X∗ 2 , . . . , X∗ n from an N(x̄n, 1) distribution. The corresponding boot- strapped sample mean X̄∗ n has an N(x̄n, 1/n) distribution. Therefore the distribution functions Gn and G∗ n of the random variables X̄n and X̄∗ n can be determined: Gn(a) = Φ( √ n(a − µ)) and G∗ n(a) = Φ( √ n(a − x̄n)). In this case it turns out that the maximum distance between the two dis- tribution functions is equal to 2Φ 1 2 √ n|x̄n − µ| − 1.
  • 278. 272 18 The bootstrap Since X̄n has an N(µ, 1/n) distribution, this value is approximately equal to 2Φ (|z|/2)−1, where z is a realization of an N(0, 1) random variable Z. This only equals zero for z = 0, so that the distance between the distribution functions of X̄n and X̄∗ n will almost always be strictly positive, even for large n. The question that remains is what to take as an estimate F̂ for F. This will depend on how well F can be specified. For the Old Faithful data we cannot say anything about the type of distribution. However, for the software data it seems reasonable to model the dataset as a realization of a random sample from an Exp(λ) distribution and then we only have to estimate the parameter λ. Different assumptions about F give rise to different bootstrap procedures. We will discuss two of them in the next sections. 18.2 The empirical bootstrap Suppose we consider our dataset x1, x2, . . . , xn as a realization of a random sample from a distribution function F. When we cannot make any assumptions about the type of F, we can always estimate F by the empirical distribution function of the dataset: F̂(a) = Fn(a) = number of xi less than or equal to a n . Since we estimate F by the empirical distribution function, the corresponding bootstrap principle is called the empirical bootstrap. Applying this principle to the centered sample mean, the random sample X1, X2, . . . , Xn from F is replaced by a bootstrap random sample X∗ 1 , X∗ 2 , . . . , X∗ n from Fn, and the distribution of X̄n − µ is approximated by that of X̄∗ n − µ∗ , where µ∗ denotes the expectation corresponding to Fn. The question is, of course, how good this approximation is. A mathematical theorem tells us that the empirical bootstrap works for the centered sample mean, i.e., the distribution of X̄n −µ is well approximated by that of X̄∗ n−µ∗ (see Remark 18.2). On the other hand, there are (normalized) sample statistics for which the empirical bootstrap fails, such as 1 − maximum of X1, X2, . . . , Xn θ , based on a random sample X1, X2, . . . , Xn from a U(0, θ) distribution (see Exercise 18.12). Remark 18.2 (The empirical bootstrap for X̄n −µ). For the centered sample mean the bootstrap approximation works, even if we estimate F by the empirical distribution function Fn. If Gn denotes the distribution function of X̄n − µ and G∗ n the distribution function of its bootstrapped version X̄∗ n − µ∗ , then the maximum distance between G∗ n and Gn goes to zero with probability one:
  • 279. 18.2 The empirical bootstrap 273 P lim n→∞ sup t∈R |G∗ n(t) − Gn(t)| = 0 = 1 (see, for instance, Singh [32]). In fact, the empirical bootstrap approxima- tion can be improved by approximating the distribution of the standardized average √ n(X̄n −µ)/σ by its bootstrapped version √ n(X̄∗ n −µ∗ )/σ∗ , where σ and σ∗ denote the standard deviations of F and Fn. This approximation is even better than the normal approximation by the central limit theorem! See, for instance, Hall [14]. Let us continue with approximating the distribution of X̄n − µ by that of X̄∗ n −µ∗ . First note that the empirical distribution function Fn of the original dataset is the distribution function of a discrete random variable that attains the values x1, x2, . . . , xn, each with probability 1/n. This means that each of the bootstrap random variables X∗ i has expectation µ∗ = E[X∗ i ] = x1 · 1 n + x2 · 1 n + · · · + xn · 1 n = x̄n. Therefore, applying the empirical bootstrap to X̄n − µ means approximating its distribution by that of X̄∗ n − x̄n. In principle it would be possible to deter- mine the probability distribution of X̄∗ n −x̄n. Indeed, the random variable X̄∗ n is based on the random variables X∗ i , whose distribution we know precisely: it takes values x1, x2, . . . , xn with equal probability 1/n. Hence we could de- termine the possible values of X̄∗ n − x̄n and the corresponding probabilities. For small n this can be done (see Exercise 18.5), but for large n this becomes cumbersome. Therefore we invoke a second approximation. Recall the jury example in Section 6.3, where we investigated the variation of two different rules that a jury might use to assign grades. In terms of the present chapter, the jury example deals with a random sample from a U(−0.5, 0.5) distribution and two different sample statistics T and M, cor- responding to the two rules. To investigate the distribution of T and M, a simulation was carried out with one thousand runs, where in every run we generated a realization of a random sample from the U(−0.5, 0.5) distribution and computed the corresponding realization of T and M. The one thousand realizations give a good impression of how T and M vary around the deserved score (see Figure 6.4). Returning to the distribution of X̄∗ n −x̄n, the analogue would be to repeatedly generate a realization of the bootstrap random sample from Fn and every time compute the corresponding realization of X̄∗ n − x̄n. The resulting realizations would give a good impression about the distribution of X̄∗ n −x̄n. A realization of the bootstrap random sample is called a bootstrap dataset and is denoted by x∗ 1, x∗ 2, . . . , x∗ n to distinguish it from the original dataset x1, x2, . . . , xn. For the centered sample mean the simulation procedure is as follows.
  • 280. 274 18 The bootstrap Empirical bootstrap simulation (for X̄n−µ). Given a dataset x1, x2, . . . , xn, determine its empirical distribution function Fn as an estimate of F, and compute the expectation µ∗ = x̄n = x1 + x2 + · · · + xn n corresponding to Fn. 1. Generate a bootstrap dataset x∗ 1, x∗ 2, . . . , x∗ n from Fn. 2. Compute the centered sample mean for the bootstrap dataset: x̄∗ n − x̄n, where x̄∗ n = x∗ 1 + x∗ 2 + · · · + x∗ n n . Repeat steps 1 and 2 many times. Note that generating a value x∗ i from Fn is equivalent to choosing one of the elements x1, x2, . . . , xn of the original dataset with equal probability 1/n. The empirical bootstrap simulation is described for the centered sample mean, but clearly a similar simulation procedure can be formulated for any (normal- ized) sample statistic. Remark 18.3 (Some history). Although Efron [7] in 1979 drew attention to diverse applications of the empirical bootstrap simulation, it already existed before that time, but not as a unified widely applicable technique. See Hall [14] for references to earlier ideas along similar lines and to further development of the bootstrap. One of Efron’s contributions was to point out how to combine the bootstrap with modern computational power. In this way, the interest in this procedure is a typical consequence of the influence of computers on the development of statistics in the past decades. Efron also coined the term “bootstrap,” which is inspired by the American version of one of the tall stories of the Baron von Münchhausen, who claimed to have lifted himself out of a swamp by pulling the strap on his boot (in the European version he lifted himself by pulling his hair). Quick exercise 18.2 Describe the empirical bootstrap simulation for the centered sample median Med(X1, X2, . . . , Xn) − Finv (0.5). For the Old Faithful data we carried out the empirical bootstrap simulation for the centered sample mean with one thousand repetitions. In Figure 18.1 a histogram (left) and kernel density estimate (right) are displayed of one thousand centered bootstrap sample means x̄∗ n,1 − x̄n x̄∗ n,2 − x̄n · · · x̄∗ n,1000 − x̄n.
  • 281. 18.2 The empirical bootstrap 275 -18 -12 -6 0 6 12 18 0 0.02 0.04 0.06 -18 -12 -6 0 6 12 18 0 0.02 0.04 0.06 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 18.1. Histogram and kernel density estimate of centered bootstrap sample means. Since these are realizations of the random variable X̄∗ n − x̄n, we know from Section 17.2 that they reflect the distribution of X̄∗ n − x̄n. Hence, as the dis- tribution of X̄∗ n − x̄n approximates that of X̄n − µ, the centered bootstrap sample means also reflect the distribution of X̄n−µ. This leads to the following application. An application of the empirical bootstrap Let us return to our example about the Old Faithful data, which are mod- eled as a realization of a random sample from some F. Suppose we estimate the expectation µ corresponding to F by x̄n = 209.3. Can we say how far away 209.3 is from the “true” expectation µ? To be honest, the answer is no. . . (oops). In a situation like this, the measurements and their correspond- ing average are subject to randomness, so that we cannot say anything with absolute certainty about how far away the average will be from µ. One of the things we can say is how likely it is that the average is within a given distance from µ. To get an impression of how close the average of a dataset of n = 272 ob- served durations of the Old Faithful geyser is to µ, we want to compute the probability that the sample mean deviates more than 5 from µ: P |X̄n − µ| 5 . Direct computation of this probability is impossible, because we do not know the distribution of the random variable X̄n−µ. However, since the distribution of X̄∗ n − x̄n approximates the distribution of X̄n − µ, we can approximate the probability as follows P |X̄n − µ| 5 ≈ P |X̄∗ n − x̄n| 5 = P |X̄∗ n − 209.3| 5 ,
  • 282. 276 18 The bootstrap where we have also used that for the Old Faithful data, x̄n = 209.3. As we mentioned before, in principle it is possible to compute the last probability exactly. Since this is too cumbersome, we approximate P |X̄∗ n − 209.3| 5 by means of the one thousand centered bootstrap sample means obtained from the empirical bootstrap simulation: x̄∗ n,1 − 209.3 x̄∗ n,2 − 209.3 · · · x̄∗ n,1000 − 209.3. In view of Table 17.2, a natural estimate for P |X̄∗ n − 209.3| 5 is the relative frequency of centered bootstrap sample means that are greater than 5 in absolute value: number of i with |x̄∗ n,i − 209.3| greater than 5 1000 . For the centered bootstrap sample means of Figure 18.1, this relative fre- quency is 0.227. Hence, we obtain the following bootstrap approximation P |X̄n − µ| 5 ≈ P |X̄∗ n − 209.3| 5 ≈ 0.227. It should be emphasized that the second approximation can be made ar- bitrarily accurate by increasing the number of repetitions in the bootstrap procedure. 18.3 The parametric bootstrap Suppose we consider our dataset as a realization of a random sample from a distribution of a specific parametric type. In that case the distribution function is completely determined by a parameter or vector of parameters θ: F = Fθ. Then we do not have to estimate the whole distribution function F, but it suffices to estimate the parameter(vector) θ by θ̂ and estimate F by F̂ = Fθ̂. The corresponding bootstrap principle is called the parametric bootstrap. Let us investigate what this would mean for the centered sample mean. First we should realize that the expectation of Fθ is also determined by θ: µ = µθ. The parametric bootstrap for the centered sample mean now amounts to the following. The random sample X1, X2, . . . , Xn from the “true” distribution function Fθ is replaced by a bootstrap random sample X∗ 1 , X∗ 2 , . . . , X∗ n from Fθ̂, and the probability distribution of X̄n − µθ is approximated by that of X̄∗ n − µ∗ , where µ∗ = µθ̂ denotes the expectation corresponding to Fθ̂. Often the parametric bootstrap approximation is better than the empirical bootstrap approximation, as illustrated in the next quick exercise.
  • 283. 18.3 The parametric bootstrap 277 Quick exercise 18.3 Suppose the dataset x1, x2, . . . , xn is a realization of a random sample X1, X2, . . . , Xn from an N(µ, 1) distribution. Estimate µ by x̄n and consider a bootstrap random sample X∗ 1 , X∗ 2 , . . . , X∗ n from an N(x̄n, 1) distribution. Check that the probability distributions of X̄n − µ and X̄∗ n − x̄n are the same: an N(0, 1/n) distribution. Once more, in principle it is possible to determine the distribution of X̄∗ n −µθ̂ exactly. However, in contrast with the situation considered in the previous quick exercise, in some cases this is still cumbersome. Again a simulation procedure may help us out. For the centered sample mean the procedure is as follows. Parametric bootstrap simulation (for X̄n − µ). Given a dataset x1, x2, . . . , xn, compute an estimate θ̂ for θ. Determine Fθ̂ as an estimate for Fθ, and compute the expectation µ∗ = µθ̂ corre- sponding to Fθ̂. 1. Generate a bootstrap dataset x∗ 1, x∗ 2, . . . , x∗ n from Fθ̂. 2. Compute the centered sample mean for the bootstrap dataset: x̄∗ n − µθ̂, where x̄∗ n = x∗ 1 + x∗ 2 + · · · + x∗ n n . Repeat steps 1 and 2 many times. As an application we will use the parametric bootstrap simulation to investi- gate whether the exponential distribution is a reasonable model for the soft- ware data. Are the software data exponential? Consider fitting an exponential distribution to the software data, as discussed in Section 17.3. At first sight, Figure 17.6 shows a reasonable fit with the ex- ponential distribution. One way to quantify the difference between the dataset and the exponential model is to compute the maximum distance between the empirical distribution function Fn of the dataset and the exponential distri- bution function Fλ̂ estimated from the dataset: tks = sup a∈R |Fn(a) − Fλ̂(a)|. Here Fλ̂(a) = 0 for a 0 and Fλ̂(a) = 1 − e−λ̂a for a ≥ 0, where λ̂ = 1/x̄n is estimated from the dataset. The quantity tks is called the Kolmogorov-Smirnov distance between Fn and Fλ̂.
  • 284. 278 18 The bootstrap The idea behind the use of this distance is the following. If F denotes the “true” distribution function, then according to Section 17.2 the empirical distribution function Fn will resemble F whether F equals the distribution function Fλ of some Exp(λ) distribution or not. On the other hand, if the “true” distribution function is Fλ, then the estimated exponential distribu- tion function Fλ̂ will resemble Fλ, because λ̂ = 1/x̄n is close to the “true” λ. Therefore, if F = Fλ, then both Fn and Fλ̂ will be close to the same distribu- tion function, so that tks is small; if F is different from Fλ, then Fn and Fλ̂ are close to two different distribution functions, so that tks is large. The value tks is always between 0 and 1, and the further away this value is from 0, the more it is an indication that the exponential model is inappropriate. For the software dataset we find λ̂ = 1/x̄n = 0.0015 and tks = 0.176. Does this speak against the believed exponential model? One way to investigate this is to find out whether, in the case when the data are truly a realization of an exponential random sample from Fλ, the value 0.176 is unusually large. To answer this question we consider the sample statistic that corresponds to tks. The estimate λ̂ = 1/x̄n is replaced by the random variable Λ̂ = 1/X̄n, and the empirical distribution function of the dataset is replaced by the empirical distribution function of the random sample X1, X2, . . . , Xn (again denoted by Fn): Fn(a) = number of Xi less than or equal to a n . In this way, tks is a realization of the sample statistic Tks = sup a∈R |Fn(a) − FΛ̂(a)|. To find out whether 0.176 is an exceptionally large value for the random vari- able Tks, we must determine the probability distribution of Tks. However, this is impossible because the parameter λ of the Exp(λ) distribution is unknown. We will approximate the distribution of Tks by a parametric bootstrap. We use the dataset to estimate λ by λ̂ = 1/x̄n = 0.0015 and replace the random sam- ple X1, X2, . . . , Xn from Fλ by a bootstrap random sample X∗ 1 , X∗ 2 , . . . , X∗ n from Fλ̂. Next we approximate the distribution of Tks by that of its boot- strapped version T ∗ ks = sup a∈R |F∗ n (a) − FΛ̂∗ (a)|, where F∗ n is the empirical distribution function of the bootstrap random sam- ple: F∗ n (a) = number of X∗ i less than or equal to a n , and Λ̂∗ = 1/X̄∗ n, with X̄∗ n being the average of the bootstrap random sample. The bootstrapped sample statistic T ∗ ks is too complicated to determine its probability distribution, and hence we perform a parametric bootstrap simu- lation:
  • 285. 18.4 Solutions to the quick exercises 279 1. We generate a bootstrap dataset x∗ 1, x∗ 2, . . . , x∗ 135 from an exponential dis- tribution with parameter λ̂ = 0.0015. 2. We compute the bootstrapped KS distance t∗ ks = sup a∈R |F∗ n (a) − Fλ̂∗ (a)|, where F∗ n denotes the empirical distribution function of the bootstrap dataset and Fλ̂∗ denotes the estimated exponential distribution function, where λ̂∗ = 1/x̄∗ n is computed from the bootstrap dataset. We repeat steps 1 and 2 one thousand times, which results in one thousand values of the bootstrapped KS distance. In Figure 18.2 we have displayed a histogram and kernel density estimate of the one thousand bootstrapped KS distances. It is clear that if the software data would come from an exponential distribution, the value 0.176 of the KS distance would be very unlikely! This strongly suggests that the exponential distribution is not the right model for the software data. The reason for this is that the Poisson process is the wrong model for the series of failures. A closer inspection shows that the rate at which failures occur over time is not constant, as was assumed in Chapter 17, but decreases. 0 0.176 0 5 10 15 20 25 0 0.176 0 5 10 15 20 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 18.2. One thousand bootstrapped KS distances. 18.4 Solutions to the quick exercises 18.1 You could have written something like the following: “Use the dataset x1, x2, . . . , xn to compute an estimate F̂ for F. Replace the random sample X1, X2, . . . , Xn from F by a random sample X∗ 1 , X∗ 2 , . . . , X∗ n from F̂, and approximate the probability distribution of
  • 286. 280 18 The bootstrap Med(X1, X2, . . . , Xn) − Finv (0.5) by that of Med(X∗ 1 , X∗ 2 , . . . , X∗ n) − F̂inv (0.5), where F̂inv (0.5) is the median of F̂.” 18.2 You could have written something like the following: “Given a dataset x1, x2, . . . , xn, determine its empirical distribution function Fn as an estimate of F, and the median Finv (0.5) of Fn. 1. Generate a bootstrap dataset x∗ 1, x∗ 2, . . . , x∗ n from Fn. 2. Compute the sample median for the bootstrap dataset: Med∗ n − Finv (0.5), where Med∗ n = sample median of x∗ 1, x∗ 2, . . . , x∗ n. Repeat steps 1 and 2 many times.” Note that if n is odd, then Finv (0.5) equals the sample median of the original dataset, but this is not necessarily so for n even. 18.3 According to Remark 11.2 about the sum of independent normal ran- dom variables, the sum of n independent N(µ, 1) distributed random variables has an N(nµ, n) distribution. Hence by the change-of-units rule for the normal distribution (see page 106), it follows that X̄n has an N(µ, 1/n) distribution, and that X̄n − µ has an N(0, 1/n) distribution. Similarly, the average X̄∗ n of n independent N(x̄n, 1) distributed bootstrap random variables has a nor- mal distribution N(x̄n, 1/n) distribution, and therefore X̄∗ n − x̄n again has an N(0, 1/n) distribution. 18.5 Exercises 18.1 We generate a bootstrap dataset x∗ 1, x∗ 2, . . . , x∗ 6 from the empirical distribution function of the dataset 2 1 1 4 6 3, i.e., we draw (with replacement) six values from these numbers with equal probability 1/6. How many different bootstrap datasets are possible? Are they all equally likely to occur? 18.2 We generate a bootstrap dataset x∗ 1, x∗ 2, x∗ 3, x∗ 4 from the empirical distri- bution function of the dataset 1 3 4 6. a. Compute the probability that the bootstrap sample mean is equal to 1.
  • 287. 18.5 Exercises 281 b. Compute the probability that the maximum of the bootstrap dataset is equal to 6. c. Compute the probability that exactly two elements in the bootstrap sam- ple are less than 2. 18.3 We generate a bootstrap dataset x∗ 1, x∗ 2, . . . , x∗ 10 from the empirical distribution function of the dataset 0.39 0.41 0.38 0.44 0.40 0.36 0.34 0.46 0.35 0.37. a. Compute the probability that the bootstrap dataset has exactly three elements equal to 0.35. b. Compute the probability that the bootstrap dataset has at most two ele- ments less than or equal to 0.38. c. Compute the probability that the bootstrap dataset has exactly two ele- ments less than or equal to 0.38 and all other elements greater than 0.42. 18.4 Consider the dataset from Exercise 18.3, with maximum 0.46. a. We generate a bootstrap random sample X∗ 1 , X∗ 2 , . . . , X∗ 10 from the empir- ical distribution function of the dataset. Compute P(M∗ 10 0.46), where M∗ 10 = max{X∗ 1 , X∗ 2 , . . . , X∗ 10}. b. The same question as in a, but now for a dataset with distinct elements x1, x2, . . . , xn and maximum mn. Compute P(M∗ n mn), where M∗ n is the maximum of a bootstrap random sample X∗ 1 , X∗ 2 , . . . , X∗ n generated from the empirical distribution function of the dataset. 18.5 Suppose we have a dataset 0 3 6, which is the realization of a random sample from a distribution function F. If we estimate F by the empirical distribution function, then according to the bootstrap principle applied to the centered sample mean X̄3 − µ, we must replace this random variable by its bootstrapped version X̄∗ 3 − x̄3. Determine the possible values for the bootstrap random variable X̄∗ 3 − x̄3 and the corre- sponding probabilities. 18.6 Suppose that the dataset x1, x2, . . . , xn is a realization of a random sample from an Exp(λ) distribution with distribution function Fλ, and that x̄n = 5. a. Check that the median of the Exp(λ) distribution is mλ = (ln 2)/λ (see also Exercise 5.11). b. Suppose we estimate λ by 1/x̄n. Describe the parametric bootstrap sim- ulation for Med(X1, X2, . . . , Xn) − mλ.
  • 288. 282 18 The bootstrap 18.7 To give an example in which the bootstrapped centered sample mean in the parametric and empirical bootstrap simulations may be different, con- sider the following situation. Suppose that the dataset x1, x2, . . . , xn is a re- alization of a random sample from a U(0, θ) distribution with expectation µ = θ/2. We estimate θ by θ̂ = n + 1 n mn, where mn = max{x1, x2, . . . , xn}. Describe the parametric bootstrap simula- tion for the centered sample mean X̄n − µ. 18.8 Here is an example in which the bootstrapped centered sample mean in the parametric and empirical bootstrap simulations are the same. Consider the software data with average x̄n = 656.8815 and median mn = 290, modeled as a realization of a random sample X1, X2, . . . , Xn from a distribution function F with expectation µ. By means of bootstrap simulation we like to get an impression of the distribution of X̄n − µ. a. Suppose that we assume nothing about the distribution of the interfailure times. Describe the appropriate bootstrap simulation procedure with one thousand repetitions. b. Suppose we assume that F is the distribution function of an Exp(λ) distri- bution, where λ is estimated by 1/x̄n = 0.0015. Describe the appropriate bootstrap simulation procedure with one thousand repetitions. c. Suppose we assume that F is the distribution function of an Exp(λ) dis- tribution, and that (as suggested by Exercise 18.6 a) the parameter λ is estimated by (ln 2)/mn = 0.0024. Describe the appropriate bootstrap simulation procedure with one thousand repetitions. 18.9 Consider the dataset from Exercises 15.1 and 17.6 consisting of mea- sured chest circumferences of Scottish soldiers with average x̄n = 39.85 and sample standard deviation sn = 2.09. The histogram in Figure 17.11 suggests modeling the data as the realization of a random sample X1, X2, . . . , Xn from an N(µ, σ2 ) distribution. We estimate µ by the sample mean and we are inter- ested in the probability that the sample mean deviates more than 1 from µ: P |X̄n − µ| 1 . Describe how one can use the bootstrap principle to approx- imate this probability, i.e., describe the distribution of the bootstrap random sample X∗ 1 , X∗ 2 , . . . , X∗ n and compute P |X̄∗ n − µ∗ | 1 . Note that one does not need a simulation to approximate this latter probability. 18.10 Consider the software data, with average x̄n = 656.8815, modeled as a realization of a random sample X1, X2, . . . , Xn from a distribution func- tion F. We estimate the expectation µ of F by the sample mean and we are interested in the probability that the sample mean deviates more than ten from µ: P |X̄n − µ| 10 .
  • 289. 18.5 Exercises 283 a. Suppose we assume nothing about the distribution of the interfailure times. Describe how one can obtain a bootstrap approximation for the probability, i.e., describe the appropriate bootstrap simulation procedure with one thousand repetitions and how the results of this simulation can be used to approximate the probability. b. Suppose we assume that F is the distribution function of an Exp(λ) dis- tribution. Describe how one can obtain a bootstrap approximation for the probability. 18.11 Consider the dataset of measured chest circumferences of 5732 Scottish soldiers (see Exercises 15.1, 17.6, and 18.9). The Kolmogorov-Smirnov distance between the empirical distribution function and the distribution function Fx̄n,sn of the normal distribution with estimated parameters µ̂ = x̄n = 39.85 and σ̂ = sn = 2.09 is equal to tks = sup a∈R |Fn(a) − Fx̄n,sn (a)| = 0.0987, where x̄n and sn denote sample mean and sample standard deviation of the dataset. Suppose we want to perform a bootstrap simulation with one thou- sand repetitions for the KS distance to investigate to which degree the value 0.0987 agrees with the assumed normality of the dataset. Describe the appro- priate bootstrap simulation that must be carried out. 18.12 To give an example where the empirical bootstrap fails, consider the following situation. Suppose our dataset x1, x2, . . . , xn is a realization of a random sample X1, X2, . . . , Xn from a U(0, θ) distribution. Consider the nor- malized sample statistic Tn = 1 − Mn θ , where Mn is the maximum of X1, X2, . . . , Xn. Let X∗ 1 , X∗ 2 , . . . , X∗ n be a boot- strap random sample from the empirical distribution function of our dataset, and let M∗ n be the corresponding bootstrap maximum. We are going to com- pare the distribution functions of Tn and its bootstrap counterpart T ∗ n = 1 − M∗ n mn , where mn is the maximum of x1, x2, . . . , xn. a. Check that P(Tn ≤ 0) = 0 and show that P(T ∗ n ≤ 0) = 1 − 1 − 1 n n . Hint: first argue that P(T ∗ n ≤ 0) = P(M∗ n = mn), and then use the result of Exercise 18.4.
  • 290. 284 18 The bootstrap b. Let Gn(t) = P(Tn ≤ t) be the distribution function of Tn, and similarly let G∗ n(t) = P(T ∗ n ≤ t) be the distribution function of the bootstrap statistic T ∗ n. Conclude from part a that the maximum distance between G∗ n and Gn can be bounded from below as follows: sup t∈R |G∗ n(t) − Gn(t)| ≥ 1 − 1 − 1 n n . c. Use part b to argue that for all n, the maximum distance between G∗ n and Gn is greater than 0.632: sup t∈R |G∗ n(t) − Gn(t)| ≥ 1 − e−1 = 0.632. Hint: you may use that e−x ≥ 1 − x for all x. We conclude that even for very large sample sizes the maximum distance between the distribution functions of Tn and its bootstrap counterpart T ∗ n is at least 0.632. 18.13 (Exercise 18.12 continued). In contrast to the empirical bootstrap, the parametric bootstrap for Tn does work. Suppose we estimate the parameter θ of the U(0, θ) distribution by θ̂ = n + 1 n mn, where mn = maximum of x1, x2, . . . , xn. Let now X∗ 1 , X∗ 2 , . . . , X∗ n be a bootstrap random sample from a U(0, θ̂) dis- tribution, and let M∗ n be the corresponding bootstrap maximum. Again, we are going to compare the distribution function Gn of Tn = 1 −Mn/θ with the distribution function G∗ n of its bootstrap counterpart T ∗ n = 1 − M∗ n/θ̂. a. Check that the distribution function Fθ of a U(0, θ) distribution is given by Fθ(a) = a θ for 0 ≤ a ≤ θ. b. Check that the distribution function of Tn is Gn(t) = P(Tn ≤ t) = 1 − (1 − t)n for 0 ≤ t ≤ 1. Hint: rewrite P(Tn ≤ t) as 1 − P(Mn ≤ θ(1 − t)) and use the rule on page 109 about the distribution function of the maximum. c. Show that T ∗ n has the same distribution function: G∗ n(t) = P(T ∗ n ≤ t) = 1 − (1 − t)n for 0 ≤ t ≤ 1. This means that, in contrast to the empirical bootstrap (see Exer- cise 18.12), the parametric bootstrap works perfectly in this situation.
  • 291. 19 Unbiased estimators In Chapter 17 we saw that a dataset can be modeled as a realization of a random sample from a probability distribution and that quantities of interest correspond to features of the model distribution. One of our tasks is to use the dataset to estimate a quantity of interest. We shall mainly deal with the situ- ation where it is modeled as one of the parameters of the model distribution or as a certain function of the parameters. We will first discuss what we mean exactly by an estimator and then introduce the notion of unbiasedness as a desirable property for estimators. We end the chapter by providing unbiased estimators for the expectation and variance of a model distribution. 19.1 Estimators Consider the arrivals of packages at a network server. One is interested in the intensity at which packages arrive on a generic day and in the percentage of minutes during which no packages arrive. If the arrivals occur completely at random in time, the arrival process can be modeled by a Poisson process. This would mean that the number of arrivals during one minute is modeled by a random variable having a Poisson distribution with (unknown) parameter µ. The intensity of the arrivals is then modeled by the parameter µ itself, and the percentage of minutes during which no packages arrive is modeled by the probability of zero arrivals: e−µ . Suppose one observes the arrival process for a while and gathers a dataset x1, x2, . . . , xn, where xi represents the number of arrivals in the ith minute. Our task will be to estimate, based on the dataset, the parameter µ and a function of the parameter: e−µ . This example is typical for the general situation in which our dataset is mod- eled as a realization of a random sample X1, X2, . . . , Xn from a probability distribution that is completely determined by one or more parameters. The parameters that determine the model distribution are called the model param- eters. We focus on the situation where the quantity of interest corresponds
  • 292. 286 19 Unbiased estimators to a feature of the model distribution that can be described by the model parameters themselves or by some function of the model parameters. This distribution feature is referred to as the parameter of interest. In discussing this general setup we shall denote the parameter of interest by the Greek letter θ. So, for instance, in our network server example, µ is the model pa- rameter. When we are interested in the arrival intensity, the role of θ is played by the parameter µ itself, and when we are interested in the percentage of idle minutes the role of θ is played by e−µ . Whatever method we use to estimate the parameter of interest θ, the result depends only on our dataset. Estimate. An estimate is a value t that only depends on the dataset x1, x2, . . . , xn, i.e., t is some function of the dataset only: t = h(x1, x2, . . . , xn). This description of estimate is a bit formal. The idea is, of course, that the value t, computed from our dataset x1, x2, . . . , xn, gives some indication of the “true” value of the parameter θ. We have already met several estimates in Chapter 17; see, for instance, Table 17.2. This table illustrates that the value of an estimate can be anything: a single number, a vector of numbers, even a complete curve. Let us return to our network server example in which our dataset x1, x2, . . . , xn is modeled as a realization of a random sample from a Pois(µ) distribution. The intensity at which packages arrive is then represented by the parameter µ. Since the parameter µ is the expectation of the model distribution, the law of large numbers suggests the sample mean x̄n as a natural estimate for µ. On the other hand, the parameter µ also represents the variance of the model distribution, so that by a similar reasoning another natural estimate is the sample variance s2 n. The percentage of idle minutes is modeled by the probability of zero arrivals. Similar to the reasoning in Section 13.4, a natural estimate is the relative frequency of zeros in the dataset: number of xi equal to zero n . On the other hand, the probability of zero arrivals can be expressed as a function of the model parameter: e−µ . Hence, if we estimate µ by x̄n, we could also estimate e−µ by e−x̄n . Quick exercise 19.1 Suppose we estimate the probability of zero arrivals e−µ by the relative frequency of xi equal to zero. Deduce an estimate for µ from this.
  • 293. 19.2 Investigating the behavior of an estimator 287 The preceding examples illustrate that one can often think of several estimates for the parameter of interest. This raises questions like Ĺ When is one estimate better than another? Ĺ Does there exist a best possible estimate? For instance, can we say which of the values x̄n or s2 n computed from the dataset is closer to the “true” parameter µ? The answer is no. The measure- ments and the corresponding estimates are subject to randomness, so that we cannot say anything with certainty about which of the two is closer to µ. One of the things we can say for each of them is how likely it is that they are within a given distance from µ. To this end, we consider the random variables that correspond to the estimates. Because our dataset x1, x2, . . . , xn is mod- eled as a realization of a random sample X1, X2, . . . , Xn, the estimate t is a realization of a random variable T . Estimator. Let t = h(x1, x2, . . . , xn) be an estimate based on the dataset x1, x2, . . . , xn. Then t is a realization of the random variable T = h(X1, X2, . . . , Xn). The random variable T is called an estimator. The word estimator refers to the method or device for estimation. This is distinguished from estimate, which refers to the actual value computed from a dataset. Note that estimators are special cases of sample statistics. In the remainder of this chapter we will discuss the notion of unbiasedness that describes to some extent the behavior of estimators. 19.2 Investigating the behavior of an estimator Let us continue with our network server example. Suppose we have observed the network for 30 minutes and we have recorded the number of arrivals in each minute. The dataset is modeled as a realization of a random sample X1, X2, . . . , Xn of size n = 30 from a Pois(µ) distribution. Let us concentrate on estimating the probability p0 of zero arrivals, which is an unknown number between 0 and 1. As motivated in the previous section, we have the following possible estimators: S = number of Xi equal to zero n and T = e−X̄n . Our first estimator S can only attain the values 0, 1 30 , 2 30 , . . . , 1, so that in general it cannot give the exact value of p0. Similarly for our second estima- tor T , which can only attain the values 1, e−1/30 , e−2/30 , . . . . So clearly, we
  • 294. 288 19 Unbiased estimators cannot expect our estimators always to give the exact value of p0 on basis of 30 observations. Well, then what can we expect from a reasonable estimator? To get an idea of the behavior of both estimators, we pretend we know µ and we simulate the estimation process in the case of n = 30 observations. Let us choose µ = ln 10, so that p0 = e−µ = 0.1. We draw 30 values from a Poisson distribution with parameter µ = ln 10 and compute the value of estimators S and T . We repeat this 500 times, so that we have 500 values for each estimator. In Figure 19.1 a frequency histogram1 of these values for estimator S is displayed on the left and for estimator T on the right. Clearly, the values of both estimators vary around the value 0.1, which they are supposed to estimate. 0.0 0.1 0.2 0.3 0 50 100 150 200 250 0.0 0.1 0.2 0.3 0 50 100 150 200 250 Fig. 19.1. Frequency histograms of 500 values for estimators S (left) and T (right) of p0 = 0.1. 19.3 The sampling distribution and unbiasedness We have just seen that the values generated for estimator S fluctuate around p0 = 0.1. Although the value of this estimator is not always equal to 0.1, it is desirable that on average, S is on target, i.e., E[S] = 0.1. Moreover, it is desirable that this property holds no matter what the actual value of p0 is, i.e., E[S] = p0 irrespective of the value 0 p0 1. In order to find out whether this is true, we need the probability distribution of the estimator S. Of course this 1 In a frequency histogram the height of each vertical bar equals the frequency of values in the corresponding bin.
  • 295. 19.3 The sampling distribution and unbiasedness 289 is simply the distribution of a random variable, but because estimators are constructed from a random sample X1, X2, . . . , Xn, we speak of the sampling distribution. The sampling distribution. Let T = h(X1, X2, . . . , Xn) be an estimator based on a random sample X1, X2, . . . , Xn. The probabil- ity distribution of T is called the sampling distribution of T . The sampling distribution of S can be found as follows. Write S = Y n , where Y is the number of Xi equal to zero. If for each i we label Xi = 0 as a success, then Y is equal to the number of successes in n independent trials with p0 as the probability of success. Similar to Section 4.3, it follows that Y has a Bin(n, p0) distribution. Hence the sampling distribution of S is that of a Bin(n, p0) distributed random variable divided by n. This means that S is a discrete random variable that attains the values k/n, where k = 0, 1, . . . , n, with probabilities given by pS k n = P S = k n = P(Y = k) = n k pk 0(1 − p0)n−k . The probability mass function of S for the case n = 30 and p0 = 0.1 is displayed in Figure 19.2. Since S = Y/n and Y has a Bin(n, p0) distribution, it follows that E[S] = E[Y ] n = np0 n = p0. So, indeed, the estimator S for p0 has the property E[S] = p0. This property reflects the fact that estimator S has no systematic tendency to produce 0.0 0.2 0.4 0.6 0.8 1.0 a 0.00 0.05 0.10 0.15 0.20 0.25 pS(a) · · ·· · · · ························ Fig. 19.2. Probability mass function of S. Hence, S is an unbiased estimator.
  • 296. 290 19 Unbiased estimators estimates that are larger than p0, and no systematic tendency to produce estimates that are smaller than p0. This is a desirable property for estimators, and estimators that have this property are called unbiased. Definition. An estimator T is called an unbiased estimator for the parameter θ, if E[T ] = θ irrespective of the value of θ. The difference E[T ] − θ is called the bias of T ; if this difference is nonzero, then T is called biased. Let us return to our second estimator for the probability of zero arrivals in the network server example: T = e−X̄n . The sampling distribution can be obtained as follows. Write T = e−Z/n , where Z = X1 + X2 + · · ·+ Xn. From Exercise 12.9 we know that the random variable Z, being the sum of n independent Pois(µ) random variables, has a Pois(nµ) distribution. This means that T is a discrete random variable attaining values e−k/n , where k = 0, 1, . . . and the probability mass function of T is given by pT e−k/n = P T = e−k/n = P(Z = k) = e−nµ (nµ)k k! . The probability mass function of T for the case n = 30 and p0 = 0.1 is displayed in Figure 19.3. From the histogram in Figure 19.1 as well as from the probability mass function in Figure 19.3, you may get the impression that T is also an unbiased estimator. However, this not the case, which follows immediately from an application of Jensen’s inequality: 0.0 0.2 0.4 0.6 0.8 1.0 a 0.00 0.01 0.02 0.03 0.04 0.05 pT (a) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 19.3. Probability mass function of T.
  • 297. 19.3 The sampling distribution and unbiasedness 291 E[T ] = E e−X̄n e−E[X̄n], where we have a strict inequality because the function g(x) = e−x is strictly convex (g (x) = e−x 0). Recall that the parameter µ equals the expectation of the Pois(µ) model distribution, so that according to Section 13.1 we have E X̄n = µ. We find that E[T ] e−µ = p0, which means that the estimator T for p0 has positive bias. In fact we can compute E[T ] exactly (see Exercise 19.9): E[T ] = E e−X̄n = e−nµ(1−e−1/n ) . Note that n(1 − e−1/n ) → 1, so that E[T ] = e−nµ(1−e−1/n ) → e−µ = p0 as n goes to infinity. Hence, although T has positive bias, the bias decreases to zero as the sample size becomes larger. In Figure 19.4 the expectation of T is displayed as a function of the sample size n for the case µ = ln(10). For n = 30 the difference between E[T ] and p0 = 0.1 equals 0.0038. 0 5 10 15 20 25 30 n 0.00 0.05 0.10 0.15 0.20 0.25 E[T] · · ···························· ..................................................................................................... Fig. 19.4. E[T] as a function of n. Quick exercise 19.2 If we estimate p0 = e−µ by the relative frequency of zeros S = Y/n, then we could estimate µ by U = − ln(S). Argue that U is a biased estimator for µ. Is the bias positive or negative? We conclude this section by returning to the estimation of the parameter µ. Apart from the (biased) estimator in Quick exercise 19.2 we also considered
  • 298. 292 19 Unbiased estimators the sample mean X̄n and sample variance S2 n as possible estimators for µ. These are both unbiased estimators for the parameter µ. This is a direct consequence of a more general property of X̄n and S2 n, which is discussed in the next section. 19.4 Unbiased estimators for expectation and variance Sometimes the quantity of interest can be described by the expectation or variance of the model distribution, and is it irrelevant whether this distribution is of a parametric type. In this section we propose unbiased estimators for these distribution features. Unbiased estimators for expectation and variance. Sup- pose X1, X2, . . . , Xn is a random sample from a distribution with finite expectation µ and finite variance σ2 . Then X̄n = X1 + X2 + · · · + Xn n is an unbiased estimator for µ and S2 n = 1 n − 1 n i=1 (Xi − X̄n)2 is an unbiased estimator for σ2 . The first statement says that E X̄n = µ, which was shown in Section 13.1. The second statement says E S2 n = σ2 . To see this, use linearity of expecta- tions to write E S2 n = 1 n − 1 n i=1 E (Xi − X̄n)2 . Since E X̄n = µ, we have E Xi − X̄n = E[Xi]−E X̄n = 0. Now note that for any random variable Y with E[Y ] = 0, we have Var(Y ) = E Y 2 − (E[Y ])2 = E Y 2 . Applying this to Y = Xi − X̄n, it follows that E (Xi − X̄n)2 = Var Xi − X̄n . Note that we can write Xi − X̄n = n − 1 n Xi − 1 n j=i Xj.
  • 299. 19.4 Unbiased estimators for expectation and variance 293 Then from the rules concerning variances of sums of independent random variables we find that Var Xi − X̄n = Var ⎛ ⎝n − 1 n Xi − 1 n j=i Xj ⎞ ⎠ = (n − 1)2 n2 Var(Xi) + 1 n2 j=i Var(Xj) = (n − 1)2 n2 + n − 1 n2 σ2 = n − 1 n σ2 . We conclude that E S2 n = 1 n − 1 n i=1 E (Xi − X̄n)2 = 1 n − 1 n i=1 Var Xi − X̄n = 1 n − 1 · n · n − 1 n σ2 = σ2 . This explains why we divide by n − 1 in the formula for S2 n; only in this case S2 n is an unbiased estimator for the “true” variance σ2 . If we would divide by n instead of n − 1, we would obtain an estimator with negative bias; it would systematically produce too-small estimates for σ2 . Quick exercise 19.3 Consider the following estimator for σ2 : V 2 n = 1 n n i=1 (Xi − X̄n)2 . Compute the bias E V 2 n − σ2 for this estimator, where you can keep compu- tations simple by realizing that V 2 n = (n − 1)S2 n/n. Unbiasedness does not always carry over We have seen that S2 n is an unbiased estimator for the “true” variance σ2 . A natural question is whether Sn is again an unbiased estimator for σ. This is not the case. Since the function g(x) = x2 is strictly convex, Jensen’s inequality yields that σ2 = E S2 n (E[Sn])2 , which implies that E[Sn] σ. Another example is the network arrivals, in which X̄n is an unbiased estimator for µ, whereas e−X̄n is positively biased with respect to e−µ . These examples illustrate a general fact: unbiasedness does not always carry over, i.e., if T is an unbiased estimator for a parameter θ, then g(T ) does not have to be an unbiased estimator for g(θ).
  • 300. 294 19 Unbiased estimators However, there is one special case in which unbiasedness does carry over, namely if g(T ) = aT + b. Indeed, if T is unbiased for θ: E[T ] = θ, then by the change-of-units rule for expectations, E[aT + b] = aE[T ] + b = aθ + b, which means that aT + b is unbiased for aθ + b. 19.5 Solutions to the quick exercises 19.1 Write y for the number of xi equal to zero. Denote the probability of zero by p0, so that p0 = e−µ . This means that µ = − ln(p0). Hence if we estimate p0 by the relative frequency y/n, we can estimate µ by − ln(y/n). 19.2 The function g(x) = − ln(x) is strictly convex, since g (x) = 1/x2 0. Hence by Jensen’s inequality E[U] = E[− ln(S)] − ln(E[S]). Since we have seen that E[S] = p0 = e−µ , it follows that E[U] − ln(E[S]) = − ln(e−µ ) = µ. This means that U has positive bias. 19.3 Using that E S2 n = σ2 , we find that E V 2 n = E n − 1 n S2 n = n − 1 n E S2 n = n − 1 n σ2 . We conclude that the bias of V 2 n equals E V 2 n − σ2 = −σ2 /n 0. 19.6 Exercises 19.1 Suppose our dataset is a realization of a random sample X1, X2, . . . , Xn from a uniform distribution on the interval [−θ, θ], where θ is unknown. a. Show that T = 3 n (X2 1 + X2 2 + · · · + X2 n) is an unbiased estimator for θ2 . b. Is √ T also an unbiased estimator for θ? If not, argue whether it has positive or negative bias. 19.2 Suppose the random variables X1, X2, . . . , Xn have the same expecta- tion µ.
  • 301. 19.6 Exercises 295 a. Is S = 1 2 X1 + 1 3 X2 + 1 6 X3 an unbiased estimator for µ? b. Under what conditions on constants a1, a2, . . . , an is T = a1X1 + a2X2 + · · · + anXn an unbiased estimator for µ? 19.3 Suppose the random variables X1, X2, . . . , Xn have the same expec- tation µ. For which constants a and b is T = a(X1 + X2 + · · · + Xn) + b an unbiased estimator for µ? 19.4 Recall Exercise 17.5 about the number of cycles to pregnancy. Suppose the dataset corresponding to the table in Exercise 17.5 a is modeled as a realization of a random sample X1, X2, . . . , Xn from a Geo(p) distribution, where 0 p 1 is unknown. Motivated by the law of large numbers, a natural estimator for p is T = 1/X̄n. a. Check that T is a biased estimator for p and find out whether it has positive or negative bias. b. In Exercise 17.5 we discussed the estimation of the probability that a woman becomes pregnant within three or fewer cycles. One possible esti- mator for this probability is the relative frequency of women that became pregnant within three cycles S = number of Xi ≤ 3 n . Show that S is an unbiased estimator for this probability. 19.5 Suppose a dataset is modeled as a realization of a random sample X1, X2, . . . , Xn from an Exp(λ) distribution, where λ 0 is unknown. Let µ denote the corresponding expectation and let Mn denote the minimum of X1, X2, . . . , Xn. Recall from Exercise 8.18 that Mn has an Exp(nλ) distribu- tion. Find out for which constant c the estimator T = cMn is an unbiased estimator for µ. 19.6 Consider the following dataset of lifetimes of ball bearings in hours. 6278 3113 5236 11584 12628 7725 8604 14266 6125 9350 3212 9003 3523 12888 9460 13431 17809 2812 11825 2398 Source: J.E. Angus. Goodness-of-fit tests for exponentiality based on a loss- of-memory type functional equation. Journal of Statistical Planning and In- ference, 6:241-251, 1982; example 5 on page 249.
  • 302. 296 19 Unbiased estimators One is interested in estimating the minimum lifetime of this type of ball bear- ing. The dataset is modeled as a realization of a random sample X1, . . . , Xn. Each random variable Xi is represented as Xi = δ + Yi, where Yi has an Exp(λ) distribution and δ 0 is an unknown parameter that is supposed to model the minimum lifetime. The objective is to construct an unbiased estimator for δ. It is known that E[Mn] = δ + 1 nλ and E X̄n = δ + 1 λ , where Mn = minimum of X1, X2, . . . , Xn and X̄n = (X1 + X2 + · · · + Xn)/n. a. Check that T = n n − 1 X̄n − Mn is an unbiased estimator for 1/λ. b. Construct an unbiased estimator for δ. c. Use the dataset to compute an estimate for the minimum lifetime δ. You may use that the average lifetime of the data is 8563.5. 19.7 Leaves are divided into four different types: starchy-green, sugary-white, starchy-white, and sugary-green. According to genetic theory, the types occur with probabilities 1 4 (θ + 2), 1 4 θ, 1 4 (1 − θ), and 1 4 (1 − θ), respectively, where 0 θ 1. Suppose one has n leaves. Then the number of starchy-green leaves is modeled by a random variable N1 with a Bin(n, p1) distribution, where p1 = 1 4 (θ + 2), and the number of sugary-white leaves is modeled by a random variable N2 with a Bin(n, p2) distribution, where p2 = 1 4 θ. The following table lists the counts for the progeny of self-fertilized heterozygotes among 3839 leaves. Type Count Starchy-green 1997 Sugary-white 32 Starchy-white 906 Sugary-green 904 Source: R.A. Fisher. Statistical methods for research workers. Hafner, New York, 1958; Table 62 on page 299. Consider the following two estimators for θ: T1 = 4 n N1 − 2 and T2 = 4 n N2.
  • 303. 19.6 Exercises 297 a. Check that both T1 and T2 are unbiased estimators for θ. b. Compute the value of both estimators for θ. 19.8 Recall the black cherry trees example from Exercise 17.9, modeled by a linear regression model without intercept Yi = βxi + Ui for i = 1, 2, . . . , n, where U1, U2, . . . , Un are independent random variables with E[Ui] = 0 and Var(Ui) = σ2 . We discussed three estimators for the parameter β: B1 = 1 n Y1 x1 + · · · + Yn xn , B2 = Y1 + · · · + Yn x1 + · · · + xn , B3 = x1Y1 + · · · + xnYn x2 1 + · · · + x2 n . Show that all three estimators are unbiased for β. 19.9 Consider the network example where the dataset is modeled as a real- ization of a random sample X1, X2, . . . , Xn from a Pois(µ) distribution. We estimate the probability of zero arrivals e−µ by means of T = e−X̄n . Check that E[T ] = e−nµ(1−e−1/n ) . Hint: write T = e−Z/n , where Z = X1 + X2 + · · · + Xn has a Pois(nµ) distribution.
  • 304. 20 Efficiency and mean squared error In the previous chapter we introduced the notion of unbiasedness as a de- sirable property of an estimator. If several unbiased estimators for the same parameter of interest exist, we need a criterion for comparison of these estima- tors. A natural criterion is some measure of spread of the estimators around the parameter of interest. For unbiased estimators we will use variance. For arbitrary estimators we introduce the notion of mean squared error (MSE), which combines variance and bias. 20.1 Estimating the number of German tanks In this section we come back to the problem of estimating German war produc- tion as discussed in Section 1.5. We consider serial numbers on tanks, recoded to numbers running from 1 to some unknown largest number N. Given is a subset of n numbers of this set. The objective is to estimate the total number of tanks N on the basis of the observed serial numbers. Denote the observed distinct serial numbers by x1, x2, . . . , xn. This dataset can be modeled as a realization of random variables X1, X2, . . . , Xn repre- senting n draws without replacement from the numbers 1, 2, . . . , N with equal probability. Note that in this example our dataset is not a realization of a random sample, because the random variables X1, X2, . . . , Xn are dependent. We propose two unbiased estimators. The first one is based on the sample mean X̄n = X1 + X2 + · · · + Xn n , and the second one is based on the sample maximum Mn = max{X1, X2, . . . , Xn}. skip
  • 305. 300 20 Efficiency and mean squared error An estimator based on the sample mean To construct an unbiased estimator for N based on the sample mean, we start by computing the expectation of X̄n. The linearity-of-expectations rule also applies to dependent random variables, so that E X̄n = E[X1] + E[X2] + · · · + E[Xn] n . In Section 9.3 we saw that the marginal distribution of each Xi is the same: P(Xi = k) = 1 N for k = 1, 2, . . . , N. Therefore the expectation of each Xi is given by E[Xi] = 1 · 1 N + 2 · 1 N + · · · + N · 1 N = 1 + 2 + · · · + N N = 1 2 N(N + 1) N = N + 1 2 . It follows that E X̄n = E[X1] + E[X2] + · · · + E[Xn] n = N + 1 2 . This directly implies that T1 = 2X̄n − 1 is an unbiased estimator for N, since the change-of-units rule yields that E[T1] = E 2X̄n − 1 = 2E X̄n − 1 = 2 · N + 1 2 − 1 = N. Quick exercise 20.1 Suppose we have observed tanks with (recoded) serial numbers 61 19 56 24 16. Compute the value of the estimator T1 for the total number of tanks. An estimator based on the sample maximum To construct an unbiased estimator for N based on the maximum, we first compute the expectation of Mn. We start by computing the probability that Mn = k, where k takes the values n, . . . , N. Similar to the combinatorics used in Section 4.3 to derive the binomial distribution, the number of ways to draw n numbers without replacement from 1, 2, . . . , N is N n . Hence each combination has probability 1/ N n . In order to have Mn = k, we must have one number equal to k and choose the other n−1 numbers out of the numbers 1, 2, . . . , k − 1. There are k−1 n−1 ways to do this. Hence for the possible values k = n, n + 1, . . . , N,
  • 306. 20.1 Estimating the number of German tanks 301 P(Mn = k) = k−1 n−1 N n = (k − 1)! (k − n)!(n − 1)! · (N − n)! n! N! = n · (k − 1)! (k − n)! (N − n)! N! . Thus the expectation of Mn is given by E[Mn] = N k=n kP(Mn = k) = N k=n k · n · (k − 1)! (k − n)! (N − n)! N! = N k=n n · k! (k − n)! (N − n)! N! = n · (N − n)! N! N k=n k! (k − n)! . How to continue the computation of E[Mn]? We use a trick: we start by rearranging 1 = N j=n P(Mn = j) = N j=n n · (j − 1)! (j − n)! (N − n)! N! , finding that N j=n (j − 1)! (j − n)! = N! n (N − n)! . (20.1) This holds for any N and any n ≤ N. In particular we could replace N by N + 1 and n by n + 1: N+1 j=n+1 (j − 1)! (j − n − 1)! = (N + 1)! (n + 1)(N − n)! . Changing the summation variable to k = j − 1, we obtain N k=n k! (k − n)! = (N + 1)! (n + 1)(N − n)! . (20.2) This is exactly what we need to finish the computation of E[Mn]. Substituting (20.2) in what we obtained earlier, we find E[Mn] = n · (N − n)! N! N k=n k! (k − n)! = n · (N − n)! N! · (N + 1)! (n + 1)(N − n)! = n · N + 1 n + 1 .
  • 307. 302 20 Efficiency and mean squared error Quick exercise 20.2 Choosing n = N in this formula yields E[MN ] = N. Can you argue that this is the right answer without doing any computations? With the formula for E[Mn] we can derive immediately that T2 = n + 1 n Mn − 1 is an unbiased estimator for N, since by the change-of-units rule, E[T2] = E n + 1 n Mn − 1 = n + 1 n E[Mn] − 1 = n + 1 n · n(N + 1) n + 1 − 1 = N. Quick exercise 20.3 Compute the value of estimator T2 for the total number of tanks on basis of the observed numbers from Quick exercise 20.1. 20.2 Variance of an estimator In the previous section we saw that we can construct two completely different estimators for the total number of tanks N that are both unbiased. The obvious question is: which of the two is better? To answer this question, we investigate how both estimators vary around the parameter of interest N. Although we could in principle compute the distributions of T1 and T2, we carry out a small simulation study instead. Take N = 1000 and n = 10 fixed. We draw 10 numbers, without replacement, from 1, 2, . . . , 1000 and compute the value of the estimators T1 and T2. We repeat this two thousand times, so that we have 2000 values for both estimators. In Figure 20.1 we have displayed the histogram of the 2000 values for T1 on the left and the histogram of the 2000 values for T2 on the right. From the histograms, which reflect the probability 300 700 N = 1000 1300 1600 0 0.002 0.004 0.006 0.008 300 700 N = 1000 1300 1600 0 0.002 0.004 0.006 0.008 Fig. 20.1. Histograms of two thousand values for T1 (left) and T2 (right).
  • 308. 20.2 Variance of an estimator 303 mass functions of both estimators, we see that the distributions of T1 and T2 are of completely different types. As can be expected from the fact that both estimators are unbiased, the values vary around the parameter of interest N = 1000. The most important difference between the histograms is that the variation in the values of T2 is less than the variation in the values of T1. This suggests that estimator T2 estimates the total number of tanks more efficiently than estimator T1, in the sense that it produces estimates that are more concentrated around the parameter of interest N than estimates produced by T1. Recall that the variance measures the spread of a random variable. Hence the previous discussion motivates the use of the variance of an estimator to evaluate its performance. Efficiency. Let T1 and T2 be two unbiased estimators for the same parameter θ. Then estimator T2 is called more efficient than estima- tor T1 if Var(T2) Var(T1), irrespective of the value of θ. Let us compare T1 and T2 using this criterion. For T1 we have Var(T1) = Var 2X̄n − 1 = 4Var X̄n . Although the Xi are not independent, it is true that all pairs (Xi, Xj) with i = j have the same distribution (this follows in the same way in which we showed on page 122 that all Xi have the same distribution). With the variance-of-the-sum rule for n random variables (see Exercise 10.17), we find that Var(X1 + · · · + Xn) = nVar(X1) + n(n − 1)Cov(X1, X2) . In Exercises 9.18 and 10.18, we computed that Var(X1) = 1 12 (N − 1)(N + 1), Cov(X1, X2) = − 1 12 (N + 1). We find therefore that Var(T1) = 4Var X̄n = 4 n2 Var(X1 + · · · + Xn) = 4 n2 n · 1 12 (N − 1)(N + 1) − n(n − 1) · 1 12 (N + 1) = 1 3n (N + 1)[N − 1 − (n − 1)] = (N + 1)(N − n) 3n . Obtaining the variance of T2 is a little more work. One can compute the variance of Mn in a way that is very similar to the way we obtained E[Mn]. The result is (see Remark 20.1 for details) Var(Mn) = n(N + 1)(N − n) (n + 2)(n + 1)2 .
  • 309. 304 20 Efficiency and mean squared error Remark 20.1 (How to compute this variance). The trick is to com- pute not E M2 n but E[Mn(Mn + 1)]. First we derive an identity from Equa- tion (20.1) as before, this time replacing N by N + 2 and n by n + 2: N+2 j=n+2 (j − 1)! (j − n − 2)! = (N + 2)! (n + 2)(N − n)! . Changing the summation variable to k = j − 2 yields N k=n (k + 1)! (k − n)! = (N + 2)! (n + 2)(N − n)! . With this formula one can obtain: E[Mn(Mn + 1)] = N k=n k(k + 1) · n (k − 1)! (k − n)! (N − n)! N! = n(N + 1)(N + 2) n + 2 . Since we know E[Mn], we can determine E M2 n from this, and subsequently the variance of Mn. With the expression for the variance of Mn, we derive Var(T2) = Var n + 1 n Mn − 1 = (n + 1)2 n2 Var(Mn) = (N + 1)(N − n) n(n + 2) . We see that Var(T2) Var(T1) for all N and n ≥ 2. Hence T2 is always more efficient than T1, except when n = 1. In this case the variances are equal, simply because the estimators are the same—they both equal X1. The quotient Var(T1) /Var(T2), is called the relative efficiency of T2 with respect to T1. In our case the relative efficiency of T2 with respect to T1 equals Var(T1) Var(T2) = (N + 1)(N − n) 3n · n(n + 2) (N + 1)(N − n) = n + 2 3 . Surprisingly, this quotient does not depend on N, and we see clearly the advantage of T2 over T1 as the sample size n gets larger. Quick exercise 20.4 Let n = 5, and let the sample be 7 3 10 45 15. Compute the value of the estimator T1 for N. Do you notice anything strange? The self-contradictory behavior of T1 in Quick exercise 20.4 is not rare: this phenomenon will occur for up to 50% of the samples if n and N are large. This gives another reason to prefer T2 over T1.
  • 310. 20.3 Mean squared error 305 Remark 20.2 (The Cramér-Rao inequality). Suppose we have a ran- dom sample from a continuous distribution with probability density function fθ, where θ is the parameter of interest. Under certain smoothness condi- tions on the density fθ, the variance of an unbiased estimator T for θ always has to be larger than or equal to a certain positive number, the so-called Cramér-Rao lower bound: Var(T) ≥ 1 nE ∂ ∂θ ln fθ(X) 2 for all θ. Here n is the size of the sample and X a random variable whose density function is fθ. In some cases we can find unbiased estimators attaining this bound. These are called minimum variance unbiased estimators. An exam- ple is the sample mean for the expectation of an exponential distribution. (We will consider this case in Exercise 20.3.) 20.3 Mean squared error In the last section we compared two unbiased estimators by considering their spread around the value to be estimated, where the spread was measured by the variance. Although unbiasedness is a desirable property, the performance of an estimator should mainly be judged by the way it spreads around the parameter θ to be estimated. This leads to the following definition. Definition. Let T be an estimator for a parameter θ. The mean squared error of T is the number MSE(T ) = E (T − θ)2 . According to this criterion, an estimator T1 performs better than an estima- tor T2 if MSE(T1) MSE(T2). Note that MSE(T ) = E (T − θ)2 = E (T − E[T ] + E[T ] − θ)2 = E (T − E[T ])2 + 2E[T − E[T ]] (E[T ] − θ) + (E[T ] − θ)2 = Var(T ) + (E[T ] − θ)2 . So the MSE of T turns out to be the variance of T plus the square of the bias of T . In particular, when T is unbiased, the MSE of T is just the variance of T . This means that we already used mean squared errors to compare the estimators T1 and T2 in the previous section. We extend the notion of efficiency by saying that estimator T2 is more efficient than estimator T1 (for the same parameter of interest), if the MSE of T2 is smaller than the MSE of T1. Unbiasedness and efficiency A biased estimator with a small variance may be more useful than an unbiased estimator with a large variance. We illustrate this with the network server
  • 311. 306 20 Efficiency and mean squared error 0 e−µ 0.2 0.3 0.4 0 2 4 6 8 10 0 e−µ 0.2 0.3 0.4 0 2 4 6 8 10 Fig. 20.2. Histograms of a thousand values for S (left) and T (right). example from Section 19.2. Recall that our goal was to estimate the probability p0 = e−µ of zero arrivals (of packages) in a minute. We did have two promising candidates as estimators: S = number of Xi equal to zero n and T = e−X̄n . In Figure 20.2 we depict histograms of one thousand simulations of the values of S and T computed for random samples of size n = 25 from a Pois(µ) distribution, where µ = 2. Considering the way the values of the (biased!) estimator T are more concentrated around the true value e−µ = e−2 = 0.1353, we would be inclined to prefer T over S. This choice is strongly supported by the fact that T is more efficient than S: MSE(T ) is always smaller than MSE(S), as illustrated in Figure 20.3. 0 1 2 3 4 5 0.000 0.002 0.004 0.006 0.008 0.010 MSE(S) MSE(T) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....................... ............................................................................................... Fig. 20.3. MSEs of S and T as a function of µ.
  • 312. 20.5 Exercises 307 20.4 Solutions to the quick exercises 20.1 We have x̄5 = (61 + 19 + 56 + 24 + 16)/5 = 176/5 = 35.2. Therefore t1 = 2 · 35.2 − 1 = 69.4. 20.2 When n = N, we have drawn all the numbers. But then the largest number MN is N, and so E[MN ] = N. 20.3 We have t2 = (6/5) · 61 − 1 = 72.2. 20.4 Since 45 is in the sample, N has to be at least 45. Adding the numbers yields 7 + 3 + 10 + 15 + 45 = 80. So t1 = 2x̄n − 1 = 2 · 16 − 1 = 31. What is strange about this is that the estimate for N is far smaller than the number 45 in the sample! 20.5 Exercises 20.1 Given is a random sample X1, X2, . . . , Xn from a distribution with finite variance σ2 . We estimate the expectation of the distribution with the sample mean X̄n. Argue that the larger our sample, the more efficient our estimator. What is the relative efficiency Var X̄n /Var X̄2n of X̄2n with respect to X̄n? 20.2 Given are two estimators S and T for a parameter θ. Furthermore it is known that Var(S) = 40 and Var(T ) = 4. a. Suppose that we know that E[S] = θ and E[T ] = θ + 3. Which estimator would you prefer, and why? b. Suppose that we know that E[S] = θ and E[T ] = θ + a for some positive number a. For each a, which estimator would you prefer, and why? 20.3 Suppose we have a random sample X1, . . . , Xn from an Exp(λ) distri- bution. Suppose we want to estimate the mean 1/λ. According to Section 19.4 the estimator T1 = X̄n = 1 n (X1 + X2 + · · · + Xn) is an unbiased estimator of 1/λ. Let Mn be the minimum of X1, X2, . . . , Xn. Recall from Exercise 8.18 that Mn has an Exp(nλ) distribution. In Exer- cise 19.5 you have determined that T2 = nMn is another unbiased estimator for 1/λ. Which of the estimators T1 and T2 would you choose for estimating the mean 1/λ? Substantiate your answer.
  • 313. 308 20 Efficiency and mean squared error 20.4 Consider the situation of this chapter, where we have to estimate the parameter N from a sample x1, . . . , xn drawn without replacement from the numbers {1, . . ., N}. To keep it simple, we consider n = 2. Let M = M2 be the maximum of X1 and X2. We have found that T2 = 3M/2 − 1 is a good unbiased estimator for N. We want to construct a new unbiased estimator T3 based on the minimum L of X1 and X2. In the following you may use that the random variable L has the same distribution as the random variable N + 1 − M (this follows from symmetry considerations). a. Show that T3 = 3L − 1 is an unbiased estimator for N. b. Compute Var(T3) using that Var(M) = (N + 1)(N − 2)/18. (The latter has been computed in Remark 20.1.) c. What is the relative efficiency of T2 with respect to T3? 20.5 Someone is proposing two unbiased estimators U and V , with the same variance Var(U) = Var(V ). It therefore appears that we would not prefer one estimator over the other. However, we could go for a third estimator, namely W = (U + V )/2. Note that W is unbiased. To judge the quality of W we want to compute its variance. Lacking information on the joint probability distribution of U and V , this is impossible. However, we should prefer W in any case! To see this, show by means of the variance-of-the-sum rule that the relative efficiency of U with respect to W is equal to Var((U + V )/2) Var(U) = 1 2 + 1 2 ρ(U, V ) . Here ρ(U, V ) is the correlation coefficient. Why does this result imply that we should use W instead of U (or V )? 20.6 A geodesic engineer measures the three unknown angles α1, α2, and α3 of a triangle. He models the uncertainty in the measurements by considering them as realizations of three independent random variables T1, T2, and T3 with expectations E[T1] = α1, E[T2] = α2, E[T3] = α3, and all three with the same variance σ2 . In order to make use of the fact that the three angles must add to π, he also considers new estimators U1, U2, and U3 defined by U1 =T1 + 1 3 (π − T1 − T2 − T3), U2 =T2 + 1 3 (π − T1 − T2 − T3), U3 =T3 + 1 3 (π − T1 − T2 − T3). (Note that the “deviation” π − T1 − T2 − T3 is evenly divided over the three measurements and that U1 + U2 + U3 = π.)
  • 314. 20.5 Exercises 309 a. Compute E[U1] and Var(U1) . b. What does he gain in efficiency when he uses U1 instead of T1 to estimate the angle α1? c. What kind of estimator would you choose for α1 if it is known that the triangle is isosceles (i.e., α1 = α2)? 20.7 (Exercise 19.7 continued.) Leaves are divided into four different types: starchy-green, sugary-white, starchy-white, and sugary-green. According to genetic theory, the types occur with probabilities 1 4 (θ + 2), 1 4 θ, 1 4 (1 − θ), and 1 4 (1 − θ), respectively, where 0 θ 1. Suppose one has n leaves. Then the number of starchy-green leaves is modeled by a random variable N1 with a Bin(n, p1) distribution, where p1 = 1 4 (θ + 2), and the number of sugary-white leaves is modeled by a random variable N2 with a Bin(n, p2) distribution, where p2 = 1 4 θ. Consider the following two estimators for θ: T1 = 4 n N1 − 2 and T2 = 4 n N2. In Exercise 19.7 you showed that both T1 and T2 are unbiased estimators for θ. Which estimator would you prefer? Motivate your answer. 20.8 Let X̄n and Ȳm be the sample means of two independent random samples of size n (resp. m) from the same distribution with mean µ. We combine these two estimators to a new estimator T by putting T = rX̄n + (1 − r)Ȳm, where r is some number between 0 and 1. a. Show that T is an unbiased estimator for the mean µ. b. Show that T is most efficient when r = n/(n + m). 20.9 Given is a random sample X1, X2, . . . , Xn from a Ber(p) distribution. One considers the estimators T1 = 1 n (X1 + · · · + Xn) and T2 = min{X1, . . . , Xn}. a. Are T1 and T2 unbiased estimators for p? b. Show that MSE(T1) = 1 n p(1 − p), MSE(T2) = pn − 2pn+1 + p2 . c. Which estimator is more efficient when n = 2? 20.10 Suppose we have a random sample X1, . . . , Xn from an Exp(λ) distri- bution. We want to estimate the expectation 1/λ. According to Section 19.4,
  • 315. 310 20 Efficiency and mean squared error X̄n = 1 n (X1 + X2 + · · · + Xn) is an unbiased estimator of 1/λ. Let us consider more generally estimators T of the form T = c · (X1 + X2 + · · · + Xn) , where c is a real number. We are interested in the MSE of these estimators and would like to know whether there are choices for c that yield a smaller MSE than the choice c = 1/n. a. Compute MSE(T ) for each c. b. For which c does the estimator perform best in the MSE sense? Compare this to the unbiased estimator X̄n that one obtains for c = 1/n. 20.11 In Exercise 17.9 we modeled diameters of black cherry trees with the linear regression model (without intercept) Yi = βxi + Ui for i = 1, 2, . . ., n. As usual, the Ui here are independent random variables with E[Ui]=0, and Var(Ui) = σ2 . We considered three estimators for the slope β of the line y = βx: the so- called least squares estimator T1 (which will be considered in Chapter 22), the average slope estimator T2, and the slope of the averages estimator T3. These estimators are defined by: T1 = n i=1 xiYi n i=1 x2 i , T2 = 1 n n i=1 Yi xi , T3 = n i=1 Yi n i=1 xi . In Exercise 19.8 it was shown that all three estimators are unbiased. Compute the MSE of all three estimators. Remark: it can be shown that T1 is always more efficient than T3, which in turn is more efficient than T2. To prove the first inequality one uses a famous inequality called the Cauchy Schwartz inequality; for the second inequality one uses Jensen’s inequality (can you see how?). 20.12 Let X1, X2, . . . , Xn represent n draws without replacement from the numbers 1, 2, . . . , N with equal probability. The goal of this exercise is to compute the distribution of Mn in a way other than by the combinatorial analysis we did in this chapter. a. Compute P(Mn ≤ k), by using, as in Section 8.4, that: P(Mn ≤ k) = P(X1 ≤ k, X2 ≤ k, . . . , Xn ≤ k) .
  • 316. 20.5 Exercises 311 b. Derive that P(Mn = n) = n!(N − n)! N! . c. Show that for k = n + 1, . . . , N P(Mn = k) = n · (k − 1)! (k − n)! (N − n)! N! .
  • 317. 21 Maximum likelihood In previous chapters we could easily construct estimators for various param- eters of interest because these parameters had a natural sample analogue: expectation versus sample mean, probabilities versus relative frequencies, etc. However, in some situations such an analogue does not exist. In this chap- ter, a general principle to construct estimators is introduced, the so-called maximum likelihood principle. Maximum likelihood estimators have certain attractive properties that are discussed in the last section. 21.1 Why a general principle? In Section 4.4 we modeled the number of cycles up to pregnancy by a ran- dom variable X with a geometric distribution with (unknown) parameter p. Weinberg and Gladen studied the effect of smoking on the number of cycles and obtained the data in Table 21.1 for 100 smokers and 486 nonsmokers. Table 21.1. Observed numbers of cycles up to pregnancy. Number of cycles 1 2 3 4 5 6 7 8 9 10 11 12 12 Smokers 29 16 17 4 3 9 4 5 1 1 1 3 7 Nonsmokers 198 107 55 38 18 22 7 9 5 3 6 6 12 Source: C.R. Weinberg and B.C. Gladen. The beta-geometric distribution ap- plied to comparative fecundability studies. Biometrics, 42(3):547–560, 1986. Is the parameter p, which equals the probability of becoming pregnant after one cycle, different for smokers and nonsmokers? Let us try to find out by estimating p in the two cases.
  • 318. 314 21 Maximum likelihood What would be reasonable ways to estimate p? Since p = P(X = 1), the law of large numbers (see Section 13.3) motivates use of S = number of Xi equal to 1 n as an estimator for p. This yields estimates p = 29/100 = 0.29 for smokers and p = 198/486 = 0.41 for nonsmokers. We know from Section 19.4 that S is an unbiased estimator for p. However, one cannot escape the feeling that S is a “bad” estimator: S does not use all the information in the table, i.e., the way the women are distributed over the numbers 2, 3, . . . of observed numbers of cycles is not used. One would like to have an estimator that incorporates all the available information. Due to the way the data are given, this seems to be difficult. For instance, estimators based on the average cannot be evaluated, because 7 smokers and 12 nonsmokers had an unknown number of cycles up to pregnancy (larger than 12). If one simply ignores the last column in Table 21.1 as we did in Exercise 17.5, the average can be computed and yields 1/x̄93 = 0.2809 as an estimate of p for smokers and 1/x̄474 = 0.3688 for nonsmokers. However, because we discard seven values larger than 12 in case of the smokers and twelve values larger than 12 in case of the nonsmokers, we overestimate p in both cases. In the next section we introduce a general principle to find an estimate for a parameter of interest, the maximum likelihood principle. This principle yields good estimators and will solve problems such as those stated earlier. 21.2 The maximum likelihood principle Suppose a dealer of computer chips is offered on the black market two batches of 10 000 chips each. According to the seller, in one batch about 50% of the chips are defective, while this percentage is about 10% in the other batch. Our dealer is only interested in this last batch. Unfortunately the seller cannot tell the two batches apart. To help him to make up his mind, the seller offers our dealer one batch, from which he is allowed to select and test 10 chips. After selecting 10 chips arbitrarily, it turns out that only the second one is defective. Our dealer at once decides to buy this batch. Is this a wise decision? With the batch where 50% of the chips are defective it is more likely that defective chips will appear, whereas with the other batch one would expect hardly any defective chip. Clearly, our dealer chooses the batch for which it is most likely that only one chip is defective. This is also the guiding idea behind the maximum likelihood principle. The maximum likelihood principle. Given a dataset, choose the parameter(s) of interest in such a way that the data are most likely.
  • 319. 21.2 The maximum likelihood principle 315 Set Ri = 1 in case the ith tested chip was defective and Ri = 0 in case it was operational, where i = 1, . . . , 10. Then R1, . . . , R10 are ten independent Ber(p) distributed random variables, where p is the probability that a ran- domly selected chip is defective. The probability that the observed data occur is equal to P(R1 = 0, R2 = 1, R3 = 0, . . . , R10 = 0) = p(1 − p)9 . For the batch where about 10% of the chips are defective we find that P(R1 = 0, R2 = 1, R3 = 0, . . . , R10 = 0) = 1 10 9 10 9 = 0.039, whereas for the other batch P(R1 = 0, R2 = 1, R3 = 0, . . . , R10 = 0) = 1 2 1 2 9 = 0.00098. So the probability for the batch with only 10% defective chips is about 40 times larger than the probability for the other batch. Given the data, our dealer made a sound decision. Quick exercise 21.1 Which batch should the dealer choose if only the first three chips are defective? Returning to the example of the number of cycles up to pregnancy, denoting Xi as the number of cycles up to pregnancy of the ith smoker, recall that P(Xi = k) = (1 − p)k−1 p and P(Xi 12) = P(no success in cycle 1 to 12) = (1 − p)12 ; cf. Quick exercise 4.6. From Table 21.1 we see that there are 29 smokers for which Xi = 1, that there are 16 for which Xi = 2, etc. Since we model the data as a random sample from a geometric distribution, the probability of the data—as a function of p—is given by L(p) = C · P(Xi = 1) 29 · P(Xi = 2) 16 · · · P(Xi = 12) 3 · P(Xi 12) 7 = C · p29 · ((1 − p)p) 16 · · · (1 − p)11 p 3 · (1 − p)12 7 = C · p93 · (1 − p)322 . Here C is the number of ways we can assign 29 ones, 16 twos, . . . , 3 twelves, and 7 numbers larger than 12 to 100 smokers.1 According to the maximum likelihood principle we now choose p, with 0 ≤ p ≤ 1, in such a way, that L(p) 1 C = 311657028822819441451842682167854800096263625208359116504431153487280760832000000000.
  • 320. 316 21 Maximum likelihood is maximal. Since C does not depend on p, we do not need to know the value of C explicitly to find for which p the function L(p) is maximal. Differentiating L(p) with respect to p yields that L (p) = C 93p92 (1 − p)322 − 322p93 (1 − p)321 = Cp92 (1 − p)321 [93(1 − p) − 322p] = Cp92 (1 − p)321 (93 − 415p). Now L (p) = 0 if p = 0, p = 1, or p = 93/415 = 0.224, and L(p) attains its unique maximum in this last point (check this!). We say that 93/415 = 0.224 is the maximum likelihood estimate of p for the smokers. Note that this estimate is quite a lot smaller than the estimate 0.29 for the smokers we found in the previous section, and the estimate 0.2809 you obtained in Exercise 17.5. Quick exercise 21.2 Check that for the nonsmokers the probability of the data is given by L(p) = constant · p474 (1 − p)955 . Compute the maximum likelihood estimate for p. Remark 21.1 (Some history). The method of maximum likelihood es- timation was propounded by Ronald Aylmer Fisher in a highly influential paper. In fact, this paper does not contain the original statement of the method, which was published by Fisher in 1912 [9], nor does it contain the original definition of likelihood, which appeared in 1921 (see [10]). The roots of the maximum likelihood method date back as far as 1713, when Jacob Bernoulli’s Ars Conjectandi ([1]) was posthumously published. In the eighteenth century other important contributions were by Daniel Bernoulli, Lambert, and Lagrange (see also [2], [16], and [17]). It is interesting to re- mark that another giant of statistics, Karl Pearson, had not understood Fisher’s method. Fisher was hurt by Pearson’s lack of understanding, which eventually led to a violent confrontation. 21.3 Likelihood and loglikelihood Suppose we have a dataset x1, x2, . . . , xn, modeled as a realization of a random sample from a distribution characterized by a parameter θ. To stress the dependence of the distribution on θ, we write pθ(x) for the probability mass function in case we have a sample from a discrete distribution and fθ(x)
  • 321. 21.3 Likelihood and loglikelihood 317 for the probability density function when we have a sample from a continuous distribution. For a dataset x1, x2, . . . , xn modeled as the realization of a random sample X1, . . . , Xn from a discrete distribution, the maximum likelihood principle now tells us to estimate θ by that value, for which the function L(θ), given by L(θ) = P(X1 = x1, . . . , Xn = xn) = pθ(x1) · · · pθ(xn) is maximal. This value is called the maximum likelihood estimate of θ. The function L(θ) is called the likelihood function. This is a function of θ, deter- mined by the numbers x1, x2, . . . , xn. In case the sample is from a continuous distribution we clearly need to de- fine the likelihood function L(θ) in a way different from the discrete case (if we would define L(θ) as in the discrete case, one always would have that L(θ) = 0). For a reasonable definition of the likelihood function we have the following motivation. Let fθ be the probability density function of X, and let ε 0 be some fixed, small number. It is sensible to choose θ in such a way, that the probability P(x1 − ε ≤ X1 ≤ x1 + ε, . . . , xn − ε ≤ Xn ≤ xn + ε) is maximal. Since the Xi are independent, we find that P(x1 − ε ≤ X1 ≤ x1 + ε, . . . , xn − ε ≤ Xn ≤ xn + ε) = P(x1 − ε ≤ X1 ≤ x1 + ε) · · · P(xn − ε ≤ Xn ≤ xn + ε) (21.1) ≈ fθ(x1)fθ(x2) · · · fθ(xn)(2ε)n , where in the last step we used that (see also Equation (5.1)) P(xi − ε ≤ Xi ≤ xi + ε) = xi+ε xi−ε fθ(x) dx ≈ 2εfθ(xi). Note that the right-hand side of (21.1) is maximal whenever the function fθ(x1)fθ(x2) · · · fθ(xn) is maximal, irrespective of the value of ε. In view of this, given a dataset x1, x2, . . . , xn, the likelihood function L(θ) is defined by L(θ) = fθ(x1)fθ(x2) · · · fθ(xn) in the continuous case. Maximum likelihood estimates. The maximum likelihood es- timate of θ is the value t = h(x1, x2, . . . , xn) that maximizes the likelihood function L(θ). The corresponding random variable T = h(X1, X2, . . . , Xn) is called the maximum likelihood estimator for θ.
  • 322. 318 21 Maximum likelihood As an example, suppose we have a dataset x1, x2, . . . , xn modeled as a re- alization of a random sample from an Exp(λ) distribution, with probability density function given by fλ(x) = 0 if x 0 and fλ(x) = λe−λx for x ≥ 0. Then the likelihood is given by L(λ) = fλ(x1)fλ(x2) · · · fλ(xn) = λe−λx1 · λe−λx2 · · · λe−λxn = λn · e−λ(x1+x2+···+xn) . To obtain the maximum likelihood estimate of λ it is enough to find the maximum of L(λ). To do so, we determine the derivative of L(λ): d dλ L(λ) = nλn−1 e−λ n i=1 xi − λn n i=1 xi e−λ n i=1 xi = n λn−1 e−λ n i=1 xi 1 − λ n n i=1 xi . We see that d (L(λ)) /dλ = 0 if and only if 1 − λx̄n = 0, i.e., if λ = 1/x̄n. Check that for this value of λ the likelihood function L(λ) attains a maximum! So the maximum likelihood estimator for λ is 1/X̄n. In the example of the number of cycles up to pregnancy of smoking women, we have seen that L(p) = C ·p93 ·(1−p)322 . The maximum likelihood estimate of p was found by differentiating L(p). Differentiating is not always possible, as the following example shows. Estimating the upper endpoint of a uniform distribution Suppose the dataset x1 = 0.98, x2 = 1.57, and x3 = 0.31 is the realization of a random sample from a U(0, θ) distribution with θ 0 unknown. The probability density function of each Xi is now given by fθ(x) = 0 if x is not in [0, θ] and fθ(x) = 1 θ for 0 ≤ x ≤ θ. The likelihood L(θ) is zero if θ is smaller than at least one of the xi, and equals 1/θ3 if θ is greater than or equal to each of the three xi, i.e., L(θ) = fθ(x1)fθ(x2)fθ(x3) = 1 θ3 if θ ≥ max (x1, x2, x3) = 1.57 0 if θ max (x1, x2, x3) = 1.57.
  • 323. 21.3 Likelihood and loglikelihood 319 0 0.98 1.57 0.31 0 0.1 0.2 L(θ) = 1 θ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 21.1. Likelihood function L(θ) of a sample from a U(0, θ) distribution. Figure 21.1 depicts this likelihood function. One glance at this figure is enough to realize that L(θ) attains its maximum at max (x1, x2, x3) = 1.57. In general, given a dataset x1, x2, . . . , xn originating from a U(0, θ) distribu- tion, we see that L(θ) = 0 if θ is smaller than at least one of the xi and that L(θ) = 1/θn if θ is greater than or equal to the largest of the xi. We conclude that the maximum likelihood estimator of θ is given by max {X1, X2, . . . , Xn}. Loglikelihood In the preceding example it was easy to find the value of the parameter for which the likelihood is maximal. Usually one can find the maximum by dif- ferentiating the likelihood function L(θ). The calculation of the derivative of L(θ) may be tedious, because L(θ) is a product of terms, all involving θ (see also Quick exercise 21.3). To differentiate L(θ) we have to apply the product rule from calculus. Considering the logarithm of L(θ) changes the product of the terms involving θ into a sum of logarithms of these terms, which makes the process of differentiating easier. Moreover, because the logarithm is an in- creasing function, the likelihood function L(θ) and the loglikelihood function (θ), defined by (θ) = ln(L(θ)), attain their extreme values for the same values of θ. In particular, L(θ) is maximal if and only if (θ) is maximal. This is illustrated in Figure 21.2 by the likelihood function L(p) = Cp93 (1 − p)322 and the loglikelihood function (p) = ln(C) + 93 ln(p) + 322 ln(1 − p) for the smokers. In the situation that we have a dataset x1, x2, . . . , xn modeled as a realiza- tion of a random sample from an Exp(λ) distribution, we found as likelihood function L(λ) = λn · e−λ(x1+x2+···+xn) . Therefore, the loglikelihood function is given by (λ) = n ln(λ) − λ (x1 + x2 + · · · + xn) .
  • 324. 320 21 Maximum likelihood 0 93/415 0.5 0 4 · 10−13 5 · 10−13 L(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 93/415 0.5 −300 0 −28.5 (p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 21.2. The graphs of the likelihood function L(p) and the loglikelihood function (p) for the smokers. Quick exercise 21.3 In this example, use the loglikelihood function (λ) to show that the maximum likelihood estimate of λ equals 1/x̄n. Estimating the parameters of the normal distribution Suppose that the dataset x1, x2, . . . , xn is a realization of a random sample from an N(µ, σ2 ) distribution, with µ and σ unknown. What are the maximum likelihood estimates for µ and σ? In this case θ is the vector (µ, σ), and therefore the likelihood function is a function of two variables: L(µ, σ) = fµ,σ(x1)fµ,σ(x2) · · · fµ,σ(xn), where each fµ,σ(x) is the N(µ, σ2 ) probability density function: fµ,σ(x) = 1 σ √ 2π e− 1 2 (x−µ σ ) 2 , −∞ x ∞. Since ln (fµ,σ(x)) = − ln(σ) − ln( √ 2π) − 1 2 x − µ σ 2 , one finds that (µ, σ) = ln (fµ,σ(x1)) + · · · + ln (fµ,σ(xn)) = −n ln(σ) − n ln( √ 2π) − 1 2σ2 (x1 − µ)2 + · · · + (xn − µ)2 . The partial derivatives of are
  • 325. 21.4 Properties of maximum likelihood estimators 321 ∂ ∂µ = 1 σ2 (x1 − µ) + (x2 − µ) + · · · + (xn − µ) = n σ2 (x̄n − µ) ∂ ∂σ = − n σ + 1 σ3 (x1 − µ)2 + (x2 − µ)2 + · · · + (xn − µ)2 = − n σ3 σ2 − 1 n n i=1 (xi − µ)2 . Solving ∂ ∂µ = 0 and ∂ ∂σ = 0 yields µ = x̄n and σ = # # $ 1 n n i=1 (xi − x̄n)2. It is not hard to show that for these values of µ and σ the likelihood func- tion L(µ, σ) attains a maximum. We find that x̄n is the maximum likelihood estimate for µ and that # # $ 1 n n i=1 (xi − x̄n)2 is the maximum likelihood estimate for σ. 21.4 Properties of maximum likelihood estimators Apart from the fact that the maximum likelihood principle provides a general principle to construct estimators, one can also show that maximum likelihood estimators have several desirable properties. Invariance principle In the previous example, we saw that Dn = # # $ 1 n n i=1 (Xi − X̄n)2 is the maximum likelihood estimator for the parameter σ of an N(µ, σ2 ) distri- bution. Does this imply that D2 n is the maximum likelihood estimator for σ2 ? This is indeed the case! In general one can show that if T is the maximum likelihood estimator of a parameter θ and g(θ) is an invertible function of θ, then g(T ) is the maximum likelihood estimator for g(θ).
  • 326. 322 21 Maximum likelihood Asymptotic unbiasedness The maximum likelihood estimator T may be biased. For example, because D2 n = n−1 n S2 n, for the previously mentioned maximum likelihood estimator D2 n of the parameter σ2 of an N(µ, σ2 ) distribution, it follows from Section 19.4 that E D2 n = E n − 1 n S2 n = n − 1 n E S2 n = n − 1 n σ2 . We see that D2 n is a biased estimator for σ2 , but also that as n goes to infinity, the expected value of D2 n converges to σ2 . This holds more generally. Under mild conditions on the distribution of the random variables Xi under consideration (see, e.g., [36]), one can show that asymptotically (that is, as the size n of the dataset goes to infinity) maximum likelihood estimators are unbiased. By this we mean that if Tn = h(X1, X2, . . . , Xn) is the maximum likelihood estimator for a parameter θ, then lim n→∞ E[Tn] = θ. Asymptotic minimum variance The variance of an unbiased estimator for a parameter θ is always larger than or equal to a certain positive number, known as the Cramér-Rao lower bound (see Remark 20.2). Again under mild conditions one can show that maxi- mum likelihood estimators have asymptotically the smallest variance among unbiased estimators. That is, asymptotically the variance of the maximum likelihood estimator for a parameter θ attains the Cramér-Rao lower bound. 21.5 Solutions to the quick exercises 21.1 In the case that only the first three chips are defective, the probability that the observed data occur is equal to P(R1 = 1, R2 = 1, R3 = 1, R4 = 0, . . . , R10 = 0) = p3 (1 − p)7 . For the batch where about 10% of the chips are defective we find that P(R1 = 1, R2 = 1, R3 = 1, R4 = 0, . . . , R10 = 0) = 1 10 3 9 10 7 = 0.00048, whereas for the other batch this probability is equal to 1 2 31 2 7 = 0.00098. So the probability for the batch with about 50% defective chips is about 2 times larger than the probability for the other batch. In view of this, it would be reasonable to choose the other batch, not the tested one.
  • 327. 21.6 Exercises 323 21.2 From Table 21.1 we derive L(p) = constant · P(Xi = 1)198 P(Xi = 2)107 · · · P(Xi = 12)6 P(Xi 12)12 = constant · p198 · [(1 − p)p] 107 · · · (1 − p)11 p 6 · (1 − p)12 12 = constant · p474 · (1 − p)955 . Here the constant is the number of ways we can assign 198 ones, 107 twos, . . . , 6 twelves, and 12 numbers larger than 12 to 486 nonsmokers. Differentiating L(p) with respect to p yields that L (p) = constant · 474p473 (1 − p)955 − 955p474 (1 − p)954 = constant · p473 (1 − p)954 [474(1 − p) − 955p] = constant · p473 (1 − p)954 (474 − 1429p). Now L (p) = 0 if p = 0, p = 1, or p = 474/1429 = 0.33, and L(p) attains its unique maximum in this last point. 21.3 The loglikelihood function L(λ) has derivative (λ) = n λ − (x1 + x2 + · · · + xn) = n 1 λ − x̄n . One finds that (λ) = 0 if and only if λ = 1/x̄n and that this is a maximum. The maximum likelihood estimate for λ is therefore 1/x̄n. 21.6 Exercises 21.1 Consider the following situation. Suppose we have two fair dice, D1 with 5 red sides and 1 white side and D2 with 1 red side and 5 white sides. We pick one of the dice randomly, and throw it repeatedly until red comes up for the first time. With the same die this experiment is repeated two more times. Suppose the following happens: First experiment: first red appears in 3rd throw Second experiment: first red appears in 5th throw Third experiment: first red appears in 4th throw. Show that for die D1 this happens with probability 5.7424 · 10−8 , and for die D2 the probability with which this happens is 8.9725 · 10−4 . Given these probabilities, which die do you think we picked? 21.2 We throw an unfair coin repeatedly until heads comes up for the first time. We repeat this experiment three times (with the same coin) and obtain the following data:
  • 328. 324 21 Maximum likelihood First experiment: heads first comes up in 3rd throw Second experiment: heads first comes up in 5th throw Third experiment: heads first comes up in 4th throw. Let p be the probability that heads comes up in a throw with this coin. Determine the maximum likelihood estimate p̂ of p. 21.3 In Exercise 17.4 we modeled the hits of London by flying bombs by a Poisson distribution with parameter µ. a. Use the data from Exercise 17.4 to find the maximum likelihood estimate of µ. b. Suppose the summarized data from Exercise 17.4 got corrupted in the following way: Number of hits 0 or 1 2 3 4 5 6 7 Number of squares 440 93 35 7 0 0 1 Using this new data, what is the maximum likelihood estimate of µ? 21.4 In Section 19.1, we considered the arrivals of packages at a network server, where we modeled the number of arrivals per minute by a Pois(µ) distribution. Let x1, x2, . . . , xn be a realization of a random sample from a Pois(µ) distribution. We saw on page 286 that a natural estimate of the probability of zeros in the dataset is given by number of xi equal to zero n . a. Show that the likelihood L(µ) is given by L(µ) = e−nµ x1! · · · xn! µx1+x2+···+xn . b. Determine the loglikelihood (µ) and the formula of the maximum likeli- hood estimate for µ. c. What is the maximum likelihood estimate for the probability e−µ of zero arrivals? 21.5 Suppose that x1, x2, . . . , xn is a dataset, which is a realization of a random sample from a normal distribution. a. Let the probability density of this normal distribution be given by fµ(x) = 1 √ 2π e− 1 2 (x−µ)2 for −∞ x ∞. Determine the maximum likelihood estimate for µ.
  • 329. 21.6 Exercises 325 b. Now suppose that the density of this normal distribution is given by fσ(x) = 1 σ √ 2π e− 1 2 x2 /σ2 for −∞ x ∞. Determine the maximum likelihood estimate for σ. 21.6 Let x1, x2, . . . , xn be a dataset that is a realization of a random sample from a distribution with probability density fδ(x) given by fδ(x) = e−(x−δ) for x ≥ δ 0 for x δ. a. Draw the likelihood L(δ). b. Determine the maximum likelihood estimate for δ. 21.7 Suppose that x1, x2, . . . , xn is a dataset, which is a realization of a ran- dom sample from a Rayleigh distribution, which is a continuous distribution with probability density function given by fθ(x) = x θ2 e− 1 2 x2 /θ2 for x ≥ 0. In this case what is the maximum likelihood estimate for θ? 21.8 (Exercises 19.7 and 20.7 continued) A certain type of plant can be di- vided into four types: starchy-green, starchy-white, sugary-green, and sugary- white. The following table lists the counts of the various types among 3839 leaves. Type Count Starchy-green 1997 Sugary-white 32 Starchy-white 906 Sugary-green 904 Setting X = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ 1 if the observed leave is of type starchy-green 2 if the observed leave is of type sugary-white 3 if the observed leave is of type starchy-white 4 if the observed leave is of type sugary-green, the probability mass function p of X is given by a 1 2 3 4 p(a) 1 4 (2 + θ) 1 4 θ 1 4 (1 − θ) 1 4 (1 − θ)
  • 330. 326 21 Maximum likelihood and p(a) = 0 for all other a. Here 0 θ 1 is an unknown parameter, which was estimated in Exercise 19.7. We want to find a maximum likelihood estimate of θ. a. Use the data to find the likelihood L(θ) and the loglikelihood (θ). b. What is the maximum likelihood estimate of θ using the data from the preceding table? c. Suppose that we have the counts of n different leaves: n1 of type starchy- green, n2 of type sugary-white, n3 of type starchy-white, and n4 of type sugary-green (so n = n1 + n2 + n3 + n4). Determine the general formula for the maximum likelihood estimate of θ. 21.9 Let x1, x2, . . . , xn be a dataset that is a realization of a random sample from a U(α, β) distribution (with α and β unknown, α β). Determine the maximum likelihood estimates for α and β. 21.10 Let x1, x2, . . . , xn be a dataset, which is a realization of a random sample from a Par(α) distribution. What is the maximum likelihood estimate for α? 21.11 In Exercise 4.13 we considered the situation where we have a box containing an unknown number—say N—of identical bolts. In order to get an idea of the size of N we introduced three random variables X, Y , and Z. Here we will use X and Y , and in the next exercise Z, to find maximum likelihood estimates of N. a. Suppose that x1, x2, . . . , xn is a dataset, which is a realization of a random sample from a Geo(1/N) distribution. Determine the maximum likelihood estimate for N. b. Suppose that y1, y2, . . . , yn is a dataset, which is a realization of a random sample from a discrete uniform distribution on 1, 2, . . . , N. Determine the maximum likelihood estimate for N. 21.12 (Exercise 21.11 continued.) Suppose that m bolts in the box were marked and then r bolts were selected from the box; Z is the number of marked bolts in the sample. (Recall that it was shown in Exercise 4.13 c that Z has a hypergeometric distribution, with parameters m, N, and r.) Suppose that k bolts in the sample were marked. Show that the likelihood L(N) is given by L(N) = m k N−m r−k N r . Next show that L(N) increases for N mr/k and decreases for N mr/k, and conclude that mr/k is the maximum likelihood estimate for N. 21.13 Often one can model the times that customers arrive at a shop rather well by a Poisson process with (unknown) rate λ (customers/hour). On a certain day, one of the attendants noticed that between noon and 12.45 p.m.
  • 331. 21.6 Exercises 327 two customers arrived, and another attendant noticed that on the same day one customer arrived between 12.15 and 1 p.m. Use the observations of the attendants to determine the maximum likelihood estimate of λ. 21.14 A very inexperienced archer shoots n times an arrow at a disc of (un- known) radius θ. The disc is hit every time, but at completely random places. Let r1, r2, . . . , rn be the distances of the various hits to the center of the disc. Determine the maximum likelihood estimate for θ. 21.15 On January 28, 1986, the main fuel tank of the space shuttle Challenger exploded shortly after takeoff. Essential in this accident was the leakage of some of the six O-rings of the Challenger. In Section 1.4 the probability of failure of an O-ring is given by p(t) = ea+b·t 1 + ea+b·t , where t is the temperature at launch in degrees Fahrenheit. In Table 21.2 the temperature t (in ◦ F, rounded to the nearest integer) and the number of failures N for 23 missions are given, ordered according to increasing temper- atures. (See also Figure 1.3, where these data are graphically depicted.) Give the likelihood L(a, b) and the loglikelihood (a, b). Table 21.2. Space shuttle failure data of pre-Challenger missions. t 53 57 58 63 66 67 67 67 N 2 1 1 1 0 0 0 0 t 68 69 70 70 70 70 72 73 N 0 0 0 0 1 1 0 0 t 75 75 76 76 78 79 81 N 0 2 0 0 0 0 0 21.16 In the 18th century Georges-Louis Leclerc, Comte de Buffon (1707– 1788) found an amusing way to approximate the number π using probability theory and statistics. Buffon had the following idea: take a needle and a large sheet of paper, and draw horizontal lines that are a needle-length apart. Throw the needle a number of times (say n times) on the sheet, and count how often it hits one of the horizontal lines. Say this number is sn, then sn is the realization of a Bin(n, p) distributed random variable Sn. Here p is the probability that the needle hits one of the horizontal lines. In Exercise 9.20 you found that p = 2/π. Show that T = 2n Sn is the maximum likelihood estimator for π.
  • 332. 22 The method of least squares The maximum likelihood principle provides a way to estimate parameters. The applicability of the method is quite general but not universal. For example, in the simple linear regression model, introduced in Section 17.4, we need to know the distribution of the response variable in order to find the maximum likelihood estimates for the parameters involved. In this chapter we will see how these parameters can be estimated using the method of least squares. Furthermore, the relation between least squares and maximum likelihood will be investigated in the case of normally distributed errors. 22.1 Least squares estimation and regression Recall from Section 17.4 the simple linear regression model for a bivariate dataset (x1, y1), (x2, y2), . . . , (xn, yn). In this model x1, x2, . . . , xn are non- random and y1, y2, . . . , yn are realizations of random variables Y1, Y2, . . . , Yn satisfying Yi = α + βxi + Ui for i = 1, 2, . . ., n, where U1, U2, . . . , Un are independent random variables with zero expectation and variance σ2 . How can one obtain estimates for the parameters α, β, and σ2 in this model? Note that we cannot find maximum likelihood estimates for these parameters, simply because we have no further knowledge about the distribution of the Ui (and consequently of the Yi). We want to choose α and β in such a way that we obtain a line that fits the data best. A classical approach to do this is to consider the sum of squared distances between the observed values yi and the values α +βxi on the regression line y = α +βx. See Figure 22.1, where these distances are indicated. The method of least squares prescribes to choose α and β such that the sum of squares S(α, β) = n i=1 (yi − α − βxi) 2
  • 333. 330 22 The method of least squares xi yi α + βxi The regression line y = αx = β The point (xi, yi) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . · · · · · . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . Fig. 22.1. The observed value yi corresponding to xi and the value α + βxi on the regression line y = α + βx. is minimal. The ith term in the sum is the squared distance in the vertical direction from (xi, yi) to the line y = α + βx. To find these so-called least squares estimates, we differentiate S(α, β) with respect to α and β, and we set the derivatives equal to 0: ∂ ∂α S(α, β) = 0 ⇔ n i=1 (yi − α − βxi) = 0 ∂ ∂β S(α, β) = 0 ⇔ n i=1 (yi − α − βxi) xi = 0. This is equivalent to nα + β n i=1 xi = n i=1 yi α n i=1 xi + β n i=1 x2 i = n i=1 xiyi. For example, for the timber data from Table 15.5 we would obtain 36 α + 1646.4 β = 52 901 1646.4 α + 81750.02 β = 2 790 525. These are two equations with two unknowns α and β. Solving for α and β yields the solutions α̂ = −1160.5 and β̂ = 57.51. In Figure 22.2 a scatterplot of the timber dataset, together with the estimated regression line y = −1160.5+ 57.51x, is depicted. Quick exercise 22.1 Suppose you are given a piece of Australian timber with density 65. What would you choose as an estimate for the Janka hardness?
  • 334. 22.1 Least squares estimation and regression 331 20 30 40 50 60 70 80 Wood density 0 500 1000 1500 2000 2500 3000 3500 Hardness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 22.2. Scatterplot and estimated regression line for the timber data. In general, writing instead of n i=1, we find the following formulas for the estimates α̂ (the intercept) and β̂ (the slope): β̂ = n xiyi − ( xi)( yi) n x2 i − ( xi)2 (22.1) α̂ = ȳn − β̂x̄n. (22.2) Since S(α, β) is an elliptic paraboloid (a “vase”), it follows that (α̂, β̂) is the unique minimum of S(α, β) (except when all xi are equal). Quick exercise 22.2 Check that the line y = α̂ + β̂x always passes through the “center of gravity” (x̄n, ȳn). Least squares estimators are unbiased We denote the least squares estimates by α̂ and β̂. It is quite common to also denote the least squares estimators by α̂ and β̂: α̂ = Ȳn − β̂x̄n, β̂ = n xiYi − ( xi)( Yi) n x2 i − ( xi) 2 . In Exercise 22.12 it is shown that β̂ is an unbiased estimator for β. Using this and the fact that E[Yi] = α + βxi (see page 258), we find for α̂: E[α̂] = E Ȳn − x̄nE β̂ = 1 n n i=1 E[Yi] − x̄nβ = 1 n n i=1 (α + βxi) − x̄nβ = α + βx̄n − x̄nβ = α. We see that α̂ is an unbiased estimator for α.
  • 335. 332 22 The method of least squares An unbiased estimator for σ2 In the simple linear regression model the assumptions imply that the random variables Yi are independent with variance σ2 . Unfortunately, one cannot ap- ply the usual estimator (1/(n − 1)) n i=1 Yi − Ȳi 2 for the variance of the Yi (see Section 19.4), because different Yi have different expectations. What would be a reasonable estimator for σ2 ? The following quick exercise suggests a candidate. Quick exercise 22.3 Let U1, U2, . . . , Un be independent random variables, each with expected value zero and variance σ2 . Show that T = 1 n n i=1 U2 i is an unbiased estimator for σ2 . At first sight one might be tempted to think that the unbiased estimator T from this quick exercise is a useful tool to estimate σ2 . Unfortunately, we only observe the xi and Yi, not the Ui. However, from the fact that Ui = Yi−α−βxi, it seems reasonable to try 1 n n i=1 (Yi − α̂ − β̂xi)2 (22.3) as an estimator for σ2 . Tedious calculations show that the expected value of this random variable equals n−2 n σ2 . But then we can easily turn it into an unbiased estimator for σ2 . An unbiased estimator for σ2 . In the simple linear regression model the random variable σ̂2 = 1 n − 2 n i=1 (Yi − α̂ − β̂xi)2 is an unbiased estimator for σ2 . 22.2 Residuals A way to explore whether the simple linear regression model is appropriate to model a given bivariate dataset is to inspect a scatterplot of the so-called residuals ri against the xi. The ith residual ri is defined as the vertical distance between the ith point and the estimated regression line: ri = yi − α̂ − β̂xi, i = 1, 2, . . ., n.
  • 336. 22.2 Residuals 333 When a linear model is appropriate, the scatterplot of the residuals ri against the xi should show truly random fluctuations around zero, in the sense that it should not exhibit any trend or pattern. This seems to be the case in Figure 22.3, which shows the residuals for the black cherry tree data from Exercise 17.9. 0 2 4 6 8 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 Residual · · · ··· ·· · · · · · · ·· · · · · · · · · · · · · · · · Fig. 22.3. Scatterplot of ri versus xi for the black cherry tree data. Quick exercise 22.4 Recall from Quick exercise 22.2 that (x̄n, ȳn) is on the regression line y = α̂ + β̂x, i.e., that ȳn = α̂ + β̂x̄n. Use this to show that n i=1 ri = 0, i.e., that the sum of the residuals is zero. In Figure 22.4 we depicted ri versus xi for the timber dataset. In this case a slight parabolic pattern can be observed. Figures 22.2 and 22.4 suggest that 20 30 40 50 60 70 −400 −200 0 200 400 600 800 Residual · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 22.4. Scatterplot of ri versus xi for the timber data with the simple linear regression model Yi = α + βxi + Ui.
  • 337. 334 22 The method of least squares for the timber dataset a better model might be Yi = α + βxi + γx2 i + Ui for i = 1, 2, . . ., n. In this new model the residuals are ri = yi − α̂ − β̂xi − γ̂x2 i , where α̂, β̂, and γ̂ are the least squares estimates obtained by minimizing n i=1 yi − α − βxi − γx2 i 2 . In Figure 22.5 we depicted ri versus xi. The residuals display no trend or pattern, except that they “fan out”—an example of a phenomenon called heteroscedasticity. 20 30 40 50 60 70 −400 −200 0 200 400 600 800 Residual · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 22.5. Scatterplot of ri versus xi for the timber data with the model Yi = α + βxi + γx2 i + Ui. Heteroscedasticity The assumption of equal variance of the Ui (and therefore of the Yi) is called homoscedasticity. In case the variance of Yi depends on the value of xi, we speak of heteroscedasticity. For instance, heteroscedasticity occurs when Yi with a large expected value have a larger variance than those with small ex- pected values. This produces a “fanning out” effect, which can be observed in Figure 22.5. This figure strongly suggests that the timber data are het- eroscedastic. Possible ways out of this problem are a technique called weighted least squares or the use of variance-stabilizing transformations.
  • 338. 22.3 Relation with maximum likelihood 335 22.3 Relation with maximum likelihood To apply the method of least squares no assumption is needed about the type of distribution of the Ui. In case the type of distribution of the Ui is known, the maximum likelihood principle can be applied. Consider, for instance, the classical situation where the Ui are independent with an N(0, σ2 ) distribution. What are the maximum likelihood estimates for α and β? In this case the Yi are independent, and Yi has an N(α+βxi, σ2 ) distribution. Under these assumptions and assuming that the linear model is appropriate to model a given bivariate dataset, the ri should look like the realization of a random sample from a normal distribution. As an example a histogram of the residuals ri of the cherry tree data of Exercise 17.9 is depicted in Figure 22.6. −0.2 −0.1 0.0 0.1 0.2 0 2 4 6 Fig. 22.6. Histogram of the residuals ri for the black cherry tree data. The data do not exhibit strong evidence against the assumption of normality. When Yi has an N(α + βxi, σ2 ) distribution, the probability density of Yi is given by fi(y) = 1 σ √ 2π e−(y−α−βxi)2 /(2σ2 ) for − ∞ y ∞. Since ln (fi(yi)) = − ln(σ) − ln( √ 2π) − 1 2 yi − α − βxi σ 2 , the loglikelihood is: (α, β, σ) = ln (f1(y1)) + · · · + ln (fn(yn)) = −n ln(σ) − n ln( √ 2π) − 1 2σ2 n i=1 (yi − α − βxi)2 .
  • 339. 336 22 The method of least squares Note that for any fixed σ 0, the loglikelihood (α, β, σ) attains its maximum precisely when n i=1(yi − α − βxi)2 is minimal. Hence, in case the Ui are independent with an N(0, σ2 ) distribution, the maximum likelihood principle and the least squares method yield the same estimators. To find the maximum likelihood estimate of σ we differentiate (α, β, σ) with respect to σ: ∂ ∂σ (α, β, σ) = − n σ + 1 σ3 n i=1 (yi − α − βxi)2 . It follows (from the invariance principle on page 321) that the maximum likelihood estimator of σ2 is given by 1 n n i=1 (Yi − α̂ − β̂xi)2 , which is the estimator from (22.3). 22.4 Solutions to the quick exercises 22.1 We can use the estimated regression line y = −1160.5+57.51x to predict the Janka hardness. For density x = 65 we find as a prediction for the Janka hardness y = 2577.65. 22.2 Rewriting α̂ = ȳn − β̂, it follows that ȳn = α̂ + β̂x̄n, which means that (x̄n, ȳn) is a point on the estimated regression line y = α̂ + β̂x. 22.3 We need to show that E[T ] = σ2 . Since E[Ui] = 0, Var(Ui) = E U2 i , so that: E[T ] = E % 1 n n i=1 U2 i = 1 n n i=1 E U2 i = 1 n n i=1 Var(Ui) = σ2 . 22.4 Since ri = yi − (α̂ + β̂xi) for i = 1, 2, . . . , n, it follows that the sum of the residuals equals ri = yi − nα̂ + β̂ xi = nȳn − nα̂ + nβ̂x̄n = n ȳn − (α̂ + β̂x̄n) = 0, because ȳn = α̂ + β̂x̄n, according to Quick exercise 22.2.
  • 340. 22.5 Exercises 337 22.5 Exercises 22.1 Consider the following bivariate dataset: (1, 2) (3, 1.8) (5, 1). a. Determine the least squares estimates α̂ and β̂ of the parameters of the regression line y = α + βx. b. Determine the residuals r1, r2, and r3 and check that they add up to 0. c. Draw in one figure the scatterplot of the data and the estimated regression line y = α̂ + β̂x. 22.2 Adding one point may dramatically change the estimates of α and β. Suppose one extra datapoint is added to the dataset of the previous exercise and that we have as dataset: (0, 0) (1, 2) (3, 1.8) (5, 1). Determine the least squares estimate of β̂. A point such as (0, 0), which dra- matically changes the estimates for α and β, is called a leverage point. 22.3 Suppose we have the following bivariate dataset: (1, 3.1) (1.7, 3.9) (2.1, 3.8) (2.5, 4.7) (2.7, 4.5). a. Determine the least squares estimates α̂ and β̂ of the parameters of the regression line y = α + βx. You may use that xi = 10, yi = 20, x2 i = 21.84, and xiyi = 41.61. b. Draw in one figure the scatterplot of the data and the estimated regression line y = α̂ + β̂x. 22.4 We are given a bivariate dataset (x1, y1), (x2, y2), . . . , (x100, y100). For this bivariate dataset it is known that xi = 231.7, x2 i = 2400.8, yi = 321, and xiyi = 5189. What are the least squares estimates α̂ and β̂ of the parameters of the regression line y = α + βx? 22.5 For the timber dataset it seems reasonable to leave out the intercept α (“no hardness without density”). The model then becomes Yi = βxi + Ui for i = 1, 2, . . . , n. Show that the least squares estimator β̂ of β is now given by β̂ = n i=1 xiYi n i=1 x2 i by minimizing the appropriate sum of squares.
  • 341. 338 22 The method of least squares 22.6 (Quick exercise 22.1 and Exercise 22.5 continued). Suppose we are given a piece of Australian timber with density 65. What would you choose as an estimate for the Janka hardness, based on the regression model with no intercept? Recall that xiyi = 2790525 and x2 i = 81750.02 (see also Section 22.1). 22.7 Consider the dataset (x1, y1), (x2, y2), . . . , (xn, yn), where x1, x2, . . . , xn are nonrandom and y1, y2, . . . , yn are realizations of ran- dom variables Y1, Y2, . . . , Yn, satisfying Yi = eα+βxi + Ui for i = 1, 2, . . . , n. Here U1, U2, . . . , Un are independent random variables with zero expectation and variance σ2 . What are the least squares estimates for the parameters α and β in this model? 22.8 Which simple regression model has the larger residual sum of squares n i=1 r2 i , the model with intercept or the one without? 22.9 For some datasets it seems reasonable to leave out the slope β. For example, in the jury example from Section 6.3 it was assumed that the score that juror i assigns when the performance deserves a score g is Yi = g + Zi, where Zi is a random variable with values around zero. In general, when the slope β is left out, the model becomes Yi = α + Ui for i = 1, 2, . . ., n. Show that Ȳn is the least squares estimator α̂ of α. 22.10 In the method of least squares we choose α and β in such a way that the sum of squared residuals S(α, β) is minimal. Since the ith term in this sum is the squared vertical distance from (xi, yi) to the regression line y = α + βx, one might also wonder whether it is a good idea to replace this squared distance simply by the distance. So, given a bivariate dataset (x1, y1), (x2, y2), . . . , (xn, yn), choose α and β in such a way that the sum A(α, β) = n i=1 |yi − α − βxi| is minimal. We will investigate this by a simple example. Consider the follow- ing bivariate dataset: (0, 2), (1, 2), (2, 0).
  • 342. 22.5 Exercises 339 a. Determine the least squares estimates α̂ and β̂, and draw in one figure the scatterplot of the data and the estimated regression line y = α̂ + β̂x. Finally, determine A(α̂, β̂). b. One might wonder whether α̂ and β̂ also minimize A(α, β). To investigate this, choose β = −1 and find α’s for which A(α, −1) A(α̂, β̂). For which α is A(α, −1) minimal? c. Find α and β for which A(α, β) is minimal. 22.11 Consider the dataset (x1, y1), (x2, y2), . . . , (xn, yn), where the xi are nonrandom and the yi are realizations of random variables Y1, Y2, . . . , Yn sat- isfying Yi = g(xi) + Ui for i = 1, 2, . . . , n, where U1, U2, . . . , Un are independent random variables with zero expecta- tion and variance σ2 . Visual inspection of the scatterplot of our dataset in 20 30 40 50 60 70 80 0 500 1000 1500 2000 2500 · · · · · · ·· · · · · · · · · · · · · · · · · · · · · Fig. 22.7. Scatterplot of yi versus xi. Figure 22.7 suggests that we should model the Yi by Yi = βxi + γx2 i + Ui for i = 1, 2, . . . , n. a. Show that the least squares estimators β̂ and γ̂ satisfy β x2 i + γ x3 i = xiyi, β x3 i + γ x4 i = x2 i yi. b. Infer from a—for instance, by using linear algebra—that the estimators β̂ and γ̂ are given by β̂ = ( xiYi)( x4 i ) − ( x3 i )( x2 i Yi) ( x2 i )( x4 i ) − ( x3 i )2
  • 343. 340 22 The method of least squares and γ̂ = ( x2 i )( x2 i Yi) − ( x3 i )( xiYi) ( x2 i )( x4 i ) − ( x3 i )2 . 22.12 The least square estimator β̂ from (22.1) is an unbiased estimator for β. You can show this in four steps. a. First show that E β̂ = n xiE[Yi] − ( xi)( E[Yi]) n x2 i − ( xi) 2 . b. Next use that E[Yi] = α + βxi, to obtain that E β̂ = n xi(α + βxi) − ( xi) [nα + β xi] n x2 i − ( xi) 2 . c. Simplify this last expression to find E β̂ = nα xi + nβ x2 i − nα xi − β( xi)2 n x2 i − ( xi) 2 . d. Finally, conclude that β̂ is an unbiased estimator for β.
  • 344. 23 Confidence intervals for the mean Sometimes, a range of plausible values for an unknown parameter is preferred to a single estimate. We shall discuss how to turn data into what are called confidence intervals and show that this can be done in such a manner that definite statements can be made about how confident we are that the true pa- rameter value is in the reported interval. This level of confidence is something you can choose. We start this chapter with the general principle of confidence intervals. We continue with confidence intervals for the mean, the common way to refer to confidence intervals made for the expected value of the model distribution. Depending on the situation, one of the four methods presented will apply. 23.1 General principle In previous chapters we have encountered sample statistics as estimators for distribution features. This started somewhat informally in Chapter 17, where it was claimed, for example, that the sample mean and the sample variance are usually close to µ and σ2 of the underlying distribution. Bias and MSE of estimators, discussed in Chapters 19 and 20, are used to judge the quality of estimators. If we have at our disposal an estimator T for an unknown parameter θ, we use its realization t as our estimate for θ. For example, when collecting data on the speed of light, as Michelson did (see Section 13.1), the unknown speed of light would be the parameter θ, our estimator T could be the sample mean, and Michelson’s data then yield an estimate t for θ of 299 852.4 km/sec. We call this number a point estimate: if we are required to select one number, this is it. Had the measurements started a day earlier, however, the whole experiment would in essence be the same, but the results might have been different. Hence, we cannot say that the estimate equals the speed of light but rather that it is close to the true speed of light. For example, we could say something like: “we have great confidence that the true speed of
  • 345. 342 23 Confidence intervals for the mean light is somewhere between . . . and . . . .” In addition to providing an interval of plausible values for θ we would want to add a specific statement about how confident we are that the true θ is among them. In this chapter we shall present methods to make confidence statements about unknown parameters, based on knowledge of the sampling distributions of cor- responding estimators. To illustrate the main idea, suppose the estimator T is unbiased for the speed of light θ. For the moment, also suppose that T has standard deviation σT = 100 km/sec (we shall drop this unrealistic as- sumption shortly). Then, applying formula (13.1), which was derived from Chebyshev’s inequality (see Section 13.2), we find P(|T − θ| 2σT ) ≥ 3 4 . (23.1) In words this reads: with probability at least 75%, the estimator T is within 2σT = 200 of the true speed of light θ. We could rephrase this as T ∈ (θ − 200, θ + 200) with probability at least 75%. However, if I am near the city of Paris, then the city of Paris is near me: the statement “T is within 200 of θ” is the same as “θ is within 200 of T ,” and we could equally well rephrase (23.1) as θ ∈ (T − 200, T + 200) with probability at least 75%. Note that of the last two equations the first is a statement about a random variable T being in a fixed interval, whereas in the second equation the interval is random and the statement is about the probability that the random interval covers the fixed but unknown θ. The interval (T − 200, T + 200) is sometimes called an interval estimator, and its realization is an interval estimate. Evaluating T for the Michelson data we find as its realization t = 299 852.4, and this yields the statement θ ∈ (299 652.4, 300 052.4). (23.2) Because we substituted the realization for the random variable, we cannot claim that (23.2) holds with probability at least 75%: either the true speed of light θ belongs to the interval or it does not; the statement we make is either true or false, we just do not know which. However, because the procedure guarantees a probability of at least 75% of getting a “right” statement, we say: θ ∈ (299 652.4, 300 052.4) with confidence at least 75%. (23.3) The construction of this confidence interval only involved an unbiased estima- tor and knowledge of its standard deviation. When more information on the sampling distribution of the estimator is available, more refined statements can be made, as we shall see shortly.
  • 346. 23.1 General principle 343 Quick exercise 23.1 Repeat the preceding derivation, starting from the statement P(|T − θ| 3σT ) ≥ 8/9 (check that this follows from Chebyshev’s inequality). What is the resulting confidence interval for the speed of light, and what is the corresponding confidence? A general definition Many confidence intervals are of the form1 (t − c · σT , t + c · σT ) we just encountered, where c is a number near 2 or 3. The corresponding confidence is often much higher than in the preceding example. Because there are many other ways confidence intervals can (or have to) be constructed, the general definition looks a bit different. Confidence intervals. Suppose a dataset x1, . . . , xn is given, modeled as realization of random variables X1, . . . , Xn. Let θ be the parameter of interest, and γ a number between 0 and 1. If there exist sample statistics Ln = g(X1, . . . , Xn) and Un = h(X1, . . . , Xn) such that P(Ln θ Un) = γ for every value of θ, then (ln, un), where ln = g(x1, . . . , xn) and un = h(x1, . . . , xn), is called a 100γ% confidence interval for θ. The number γ is called the confidence level. Sometimes sample statistics Ln and Un as required in the definition do not exist, but one can find Ln and Un that satisfy P(Ln θ Un) ≥ γ. The resulting confidence interval (ln, un) is called a conservative 100γ% confi- dence interval for θ: the actual confidence level might be higher. For example, the interval in (23.2) is a conservative 75% confidence interval. Quick exercise 23.2 Why is the interval in (23.2) a conservative 75% con- fidence interval? There is no way of knowing whether an individual confidence interval is cor- rect, in the sense that it indeed does cover θ. The procedure guarantees that each time we make a confidence interval we have probability γ of covering θ. What this means in practice can easily be illustrated with an example, using simulation: 1 Another form is, for example, (c1t, c2t).
  • 347. 344 23 Confidence intervals for the mean Generate x1, . . . , x20 from an N(0, 1) distribution. Next, pretend that it is known that the data are from a normal distribution but that both µ and σ are unknown. Construct the 90% confidence interval for the expectation µ using the method described in the next section, which says to use (ln, un) with ln = x̄20 − 1.729 s20 √ 20 un = x̄20 + 1.729 s20 √ 20 , where x̄20 and s20 are the sample mean and standard deviation. Fi- nally, check whether the “true µ,” in this case 0, is in the confidence interval. We repeated the whole procedure 50 times, making 50 confidence intervals for µ. Each confidence interval is based on a fresh independently generated set of data. The 50 intervals are plotted in Figure 23.1 as horizontal line −1 1 µ Fig. 23.1. Fifty 90% confidence intervals for µ = 0.
  • 348. 23.2 Normal data 345 segments, and at µ (0!) a vertical line is drawn. We count 46 “hits”: only four intervals do not contain the true µ. Quick exercise 23.3 Suppose you were to make 40 confidence intervals with confidence level 95%. About how many of them should you expect to be “wrong”? Should you be surprised if 10 of them are wrong? In the remainder of this chapter we consider confidence intervals for the mean: confidence intervals for the unknown expectation µ of the distribution from which the sample originates. We start with the situation where it is known that the data originate from a normal distribution, first with known variance, then with unknown variance. Then we drop the normal assumption, first use the bootstrap, and finally show how, for very large samples, confidence intervals based on the central limit theorem are made. 23.2 Normal data Suppose the data can be seen as the realization of a sample X1, . . . , Xn from an N(µ, σ2 ) distribution and µ is the (unknown) parameter of interest. If the variance σ2 is known, confidence intervals are easily derived. Before we do this, some preparation has to be done. Critical values We shall need so-called critical values for the standard normal distribution. The critical value zp of an N(0, 1) distribution is the number that has right tail probability p. It is defined by P(Z ≥ zp) = p, where Z is an N(0, 1) random variable. For example, from Table B.1 we read P(Z ≥ 1.96) = 0.025, so z0.025 = 1.96. In fact, zp is the (1 − p)th quantile of the standard normal distribution: Φ(zp) = P(Z ≤ zp) = 1 − p. By the symmetry of the standard normal density, P(Z ≤ −zp) = P(Z ≥ zp) = p, so P(Z ≥ −zp) = 1 − p and therefore z1−p = −zp. For example, z0.975 = −z0.025 = −1.96. All this is illustrated in Figure 23.2. Quick exercise 23.4 Determine z0.01 and z0.95 from Table B.1.
  • 349. 346 23 Confidence intervals for the mean −3 0 3 z1−p zp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. area p area p −3 0 3 z1−p zp 0 1 p 1 − p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 23.2. Critical values of the standard normal distribution. Variance known If X1, . . . , Xn is a random sample from an N(µ, σ2 ) distribution, then X̄n has an N(µ, σ2 /n) distribution, and from the properties of the normal distribution (see page 106), we know that X̄n − µ σ/ √ n has an N(0, 1) distribution. If cl and cu are chosen such that P(cl Z cu) = γ for an N(0, 1) distributed random variable Z, then γ = P cl X̄n − µ σ/ √ n cu = P cl σ √ n X̄n − µ cu σ √ n = P X̄n − cu σ √ n µ X̄n − cl σ √ n . We have found that Ln = X̄n − cu σ √ n and Un = X̄n − cl σ √ n satisfy the confidence interval definition: the interval (Ln, Un) covers µ with probability γ. Therefore x̄n − cu σ √ n , x̄n − cl σ √ n is a 100γ% confidence interval for µ. A common choice is to divide α = 1 − γ evenly between the tails,2 that is, solve cl and cu from 2 Here this choice could be motivated by the fact that it leads to the shortest confidence interval; in other examples the shortest interval requires an asymmetric
  • 350. 23.2 Normal data 347 P(Z ≥ cu) = α/2 and P(Z ≤ cl) = α/2, so that cu = zα/2 and cl = z1−α/2 = −zα/2. Summarizing, the 100(1 − α)% confidence interval for µ is: x̄n − zα/2 σ √ n , x̄n + zα/2 σ √ n . For example, if α = 0.05, we use z0.025 = 1.96 and the 95% confidence interval is x̄n − 1.96 σ √ n , x̄n + 1.96 σ √ n . Example: gross calorific content of coal When a shipment of coal is traded, a number of its properties should be known accurately, because the value of the shipment is determined by them. An im- portant example is the so-called gross calorific value, which characterizes the heat content and is a numerical value in megajoules per kilogram (MJ/kg). The International Organization of Standardization (ISO) issues standard pro- cedures for the determination of these properties. For the gross calorific value, there is a method known as ISO 1928. When the procedure is carried out prop- erly, resulting measurement errors are known to be approximately normal, with a standard deviation of about 0.1 MJ/kg. Laboratories that operate according to standard procedures receive ISO certificates. In Table 23.1, a number of such ISO 1928 measurements is given for a shipment of Osterfeld coal coded 262DE27. Table 23.1. Gross calorific value measurements for Osterfeld 262DE27. 23.870 23.730 23.712 23.760 23.640 23.850 23.840 23.860 23.940 23.830 23.877 23.700 23.796 23.727 23.778 23.740 23.890 23.780 23.678 23.771 23.860 23.690 23.800 Source: A.M.H. van der Veen and A.J.M. Broos. Interlaboratory study pro- gramme “ILS coal characterization”—reported data. Technical report, NMi Van Swinden Laboratorium B.V., The Netherlands, 1996. We want to combine these values into a confidence statement about the “true” gross calorific content of Osterfeld 262DE27. From the data, we compute x̄n = 23.788. Using the given σ = 0.1 and α = 0.05, we find the 95% confidence interval 23.788 − 1.96 0.1 √ 23 , 23.788 + 1.96 0.1 √ 23 = (23.747, 23.829) MJ/kg. division of α. If you are only concerned with the left or right boundary of the confidence interval, see the next chapter.
  • 351. 348 23 Confidence intervals for the mean Variance unknown When σ is unknown, the fact that X̄n − µ σ/ √ n has a standard normal distribution has become useless, as it involves this un- known σ, which would subsequently appear in the confidence interval. How- ever, if we substitute the estimator Sn for σ, the resulting random variable X̄n − µ Sn/ √ n has a distribution that only depends on n and not on µ or σ. Moreover, its density can be given explicitly. Definition. A continuous random variable has a t-distribution with parameter m, where m ≥ 1 is an integer, if its probability density is given by f(x) = km 1 + x2 m − m+1 2 for −∞ x ∞, where km = Γ m+1 2 / Γ m 2 √ mπ . This distribution is denoted by t(m) and is referred to as the t-distribution with m degrees of freedom. The normalizing constant km is given in terms of the gamma function, which was defined on page 157. For m = 1, it evaluates to k1 = 1/π, and the resulting density is that of the standard Cauchy distribution (see page 161). If X has a t(m) distribution, then E[X] = 0 for m ≥ 2 and Var(X) = m/(m − 2) for m ≥ 3. Densities of t-distributions look like that of the standard normal distribution: they are also symmetric around 0 and bell-shaped. As m goes to infinity the limit of the t(m) density is the standard normal density. The distinguishing feature is that densities of t-distributions have heavier tails: f(x) goes to zero as x goes to +∞ or −∞, but more slowly than the density φ(x) of the standard normal distribution. These properties are illustrated in Figure 23.3, which shows the densities and distribution functions of the t(1), t(2), and t(5) distribution as well as those of the standard normal. We will also need critical values for the t(m) distribution: the critical value tm,p is the number satisfying P(T ≥ tm,p) = p, where T is a t(m) distributed random variable. Because the t-distribution is symmetric around zero, using the same reasoning as for the critical values of the standard normal distribution, we find:
  • 352. 23.2 Normal data 349 −4 −2 0 2 4 0.0 0.2 0.4 ................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . −4 −2 0 2 4 0.0 0.5 1.0 .......................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ......................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 23.3. Three t-distributions and the standard normal distribution. The dotted line corresponds to the standard normal. The other distributions depicted are the t(1), t(2), and t(5), which in that order resemble the standard normal more and more. tm,1−p = −tm,p. For example, in Table B.2 we read t10,0.01 = 2.764, and from this we deduce that t10,0.99 = −2.764. Quick exercise 23.5 Determine t3,0.01 and t35,0.9975 from Table B.2. We now return to the distribution of X̄n − µ Sn/ √ n and construct a confidence interval for µ. The studentized mean of a normal random sample. For a random sample X1, . . . , Xn from an N(µ, σ2 ) distribution, the stu- dentized mean X̄n − µ Sn/ √ n has a t(n − 1) distribution, regardless of the values of µ and σ. From this fact and using critical values of the t-distribution, we derive that P −tn−1,α/2 X̄n − µ Sn/ √ n tn−1,α/2 = 1 − α, (23.4) and in the same way as when σ is known it now follows that a 100(1 − α)% confidence interval for µ is given by:
  • 353. 350 23 Confidence intervals for the mean x̄n − tn−1,α/2 sn √ n , x̄n + tn−1,α/2 sn √ n . Returning to the coal example, there was another shipment, of Daw Mill 258GB41 coal, where there were actually some doubts whether the stated accuracy of the ISO 1928 method was attained. We therefore prefer to consider σ unknown and estimate it from the data, which are given in Table 23.2. Table 23.2. Gross calorific value measurements for Daw Mill 258GB41. 30.990 31.030 31.060 30.921 30.920 30.990 31.024 30.929 31.050 30.991 31.208 30.830 31.330 30.810 31.060 30.800 31.091 31.170 31.026 31.020 30.880 31.125 Source: A.M.H. van der Veen and A.J.M. Broos. Interlaboratory study pro- gramme “ILS coal characterization”—reported data. Technical report, NMi Van Swinden Laboratorium B.V., The Netherlands, 1996. Doing this, we find x̄n = 31.012 and sn = 0.1294. Because n = 22, for a 95% confidence interval we use t21,0.025 = 2.080 and obtain 31.012 − 2.080 0.1294 √ 22 , 31.012 + 2.080 0.1294 √ 22 = (30.954, 31.069). Note that this confidence interval is (50%!) wider than the one we made for the Osterfeld coal, with almost the same sample size. There are two reasons for this; one is that σ = 0.1 is replaced by the (larger) estimate sn = 0.1294, and the second is that the critical value z0.025 = 1.96 is replaced by the larger t21,0.025 = 2.080. The differences in the method and the ingredients seem minor, but they matter, especially for small samples. 23.3 Bootstrap confidence intervals It is not uncommon that the methods of the previous section are used even when the normal distribution is not a good model for the data. In some cases this is not a big problem: with small deviations from normality the actual confidence level of a constructed confidence interval may deviate only a few percent from the intended confidence level. For large datasets the central limit theorem in fact ensures that this method provides confidence intervals with approximately correct confidence levels, as we shall see in the next section. If we doubt the normality of the data and we do not have a large sample, usu- ally the best thing to do is to bootstrap. Suppose we have a dataset x1, . . . , xn, modeled as a realization of a random sample from some distribution F, and we want to construct a confidence interval for its (unknown) expectation µ.
  • 354. 23.3 Bootstrap confidence intervals 351 In the previous section we saw that it suffices to find numbers cl and cu such that P cl X̄n − µ Sn/ √ n cu = 1 − α. The 100(1 − α)% confidence interval would then be x̄n − cu sn √ n , x̄n − cl sn √ n , where, of course, x̄n and sn are the sample mean and the sample standard deviation. To find cl and cu we need to know the distribution of the studentized mean T = X̄n − µ Sn/ √ n . We apply the bootstrap principle. From the data x1, . . . , xn we determine an estimate F̂ of F. Let X∗ 1 , . . . , X∗ n be a random sample from F̂, with µ∗ = E[X∗ i ], and consider T ∗ = X̄∗ n − µ∗ S∗ n/ √ n . The distribution of T ∗ is now used as an approximation to the distribution of T . If we use F̂ = Fn, we get the following. Empirical bootstrap simulation for the studentized mean. Given a dataset x1, x2, . . . , xn, determine its empirical distribution function Fn as an estimate of F. The expectation corresponding to Fn is µ∗ = x̄n. 1. Generate a bootstrap dataset x∗ 1, x∗ 2, . . . , x∗ n from Fn. 2. Compute the studentized mean for the bootstrap dataset: t∗ = x̄∗ n − x̄n s∗ n/ √ n , where x̄∗ n and s∗ n are the sample mean and sample standard de- viation of x∗ 1, x∗ 2, . . . , x∗ n. Repeat steps 1 and 2 many times. From the bootstrap experiment we can determine c∗ l and c∗ u such that P c∗ l X̄∗ n − µ∗ S∗ n/ √ n c∗ u ≈ 1 − α. By the bootstrap principle we may transfer this statement about the distri- bution of T ∗ to the distribution of T . That is, we may use these estimated critical values as bootstrap approximations to cl and cu: cl ≈ c∗ l and cu ≈ c∗ u,
  • 355. 352 23 Confidence intervals for the mean Therefore, we call x̄n − c∗ u sn √ n , x̄n − c∗ l sn √ n a 100(1 − α)% bootstrap confidence interval for µ. Example: the software data Recall the software data, a dataset of interfailure times (see Section 17.3). From the nature of the data—failure times are positive numbers—and the histogram (Figure 17.5), we know that they should not be modeled as a real- ization of a random sample from a normal distribution. From the data we know x̄n = 656.88, sn = 1037.3, and n = 135. We generate one thousand bootstrap datasets, and for each dataset we compute t∗ as in step 2 of the procedure. The histogram and empirical distribution function made from these one thousand values are estimates of the density and the distribution function, respectively, of the bootstrap sample statistic T ∗ ; see Figure 23.4. −6 −4 −2 0 2 0 0.1 0.2 0.3 0.4 0.5 −6 −4 −2.11 0 1.39 0.05 0.95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ................................ Fig. 23.4. Histogram and empirical distribution function of the studentized boot- strap simulation results for the software data. We want to make a 90% bootstrap confidence interval, so we need c∗ l and c∗ u, or the 0.05th and 0.95th quantile from the empirical distribution function in Figure 23.4. The 50th order statistic of the one thousand t∗ values is −2.107. This means that 50 out of the one thousand values, or 5%, are smaller than or equal to this value, and so c∗ l = −2.107. Similarly, from the 951st order statistic, 1.389, we obtain3 c∗ u = 1.389. Inserting these values, we find the following 90% bootstrap confidence interval for µ: 3 These results deviate slightly from the definition of empirical quantiles as given in Section 16.3. That method is a little more accurate.
  • 356. 23.4 Large samples 353 656.88 − 1.389 1037.3 √ 135 , 656.88 − (−2.107) 1037.3 √ 135 = (532.9, 845.0). Quick exercise 23.6 The 25th and 976th order statistic from the preceding bootstrap results are −2.443 and 1.713, respectively. Use these numbers to construct a confidence interval for µ. What is the corresponding confidence level? Why the bootstrap may be better The reason to use the bootstrap is that it should lead to a more accurate approximation of the distribution of the studentized mean than the t(n − 1) distribution that follows from assuming normality. If, in the previous example, we would think we had normal data, we would use critical values from the t(134) distribution: t134,0.05 = 1.656. The result would be 656.88 − 1.656 1037.3 √ 135 , 656.88 + 1.656 1037.3 √ 135 = (509.0, 804.7). Comparing the intervals, we see that here the bootstrap interval is a little larger and, as opposed to the t-interval, not centered around the sample mean but skewed to the right side. This is one of the features of the bootstrap: if the distribution from which the data originate is skewed, this is reflected in the confidence interval. Looking at the histogram of the software data (Figure 17.5), we see that is it skewed to the right: it has a long tail on the right, but not on the left, so the same most likely holds for the distribution from which these data originate. The skewness is reflected in the confidence interval, which extends more to the right of x̄n than to the left. In some sense, the bootstrap adapts to the shape of the distribution, and in this way it leads to more accurate confidence statements than using the method for normal data. What we mean by this is that, for example, with the normal method only 90% of the 95% confidence statements would actually cover the true value, whereas for the bootstrap intervals this percentage would be close(r) to 95%. 23.4 Large samples A variant of the central limit theorem states that as n goes to infinity, the distribution of the studentized mean X̄n − µ Sn/ √ n approaches the standard normal distribution. This fact is the basis for so- called large sample confidence intervals. Suppose X1, . . . , Xn is a random
  • 357. 354 23 Confidence intervals for the mean sample from some distribution F with expectation µ. If n is large enough, we may use P −zα/2 X̄n − µ Sn/ √ n zα/2 ≈ 1 − α. (23.5) This implies that if x1, . . . , xn can be seen as a realization of a random sample from some unknown distribution with expectation µ and if n is large enough, then x̄n − zα/2 sn √ n , x̄n + zα/2 sn √ n is an approximate 100(1 − α)% confidence interval for µ. Just as earlier with the central limit theorem, a key question is “how big should n be?” Again, there is no easy answer. To give you some idea, we have listed in Table 23.3 the results of a small simulation experiment. For each of the distributions, sample sizes, and confidence levels listed, we constructed 10 000 confidence intervals with the large sample method; the numbers listed in the table are the confidence levels as estimated from the simulation, the coverage probabilities. The chosen Pareto distribution is very skewed, and this shows; the coverage probabilities for the exponential are just a few percent off. Table 23.3. Estimated coverage probabilities for large sample confidence intervals for non-normal data. γ Distribution n 0.900 0.950 Exp(1) 20 0.851 0.899 Exp(1) 100 0.890 0.938 Par(2.1) 20 0.727 0.774 Par(2.1) 100 0.798 0.849 In the case of simulation one can often quite easily generate a very large number of independent repetitions, and then this question poses no problem. In other cases there may be nothing better to do than hope that the dataset is large enough. We give an example where (we believe!) this is definitely the case. In an article published in 1910 ([28]), Rutherford and Geiger reported their observations on the radioactive decay of the element polonium. Using a small disk coated with polonium they counted the number of emitted alpha-particles during 2608 intervals of 7.5 seconds each. The dataset consists of the counted number of alpha-particles for each of the 2608 intervals and can be summarized as in Table 23.4.
  • 358. 23.5 Solutions to the quick exercises 355 Table 23.4. Alpha-particle counts for 2608 intervals of 7.5 seconds. Count 0 1 2 3 4 Frequency 57 203 383 525 532 Count 5 6 7 8 9 Frequency 408 273 139 45 27 Count 10 11 12 13 14 Frequency 10 4 0 1 1 Source: E. Rutherford and H. Geiger (with a note by H. Bateman), The proba- bility variations in the distribution of α particles, Phil.Mag., 6: 698–704, 1910; the table on page 701. The total number of counted alpha-particles is 10 097, the average number per interval is therefore 3.8715. The sample standard deviation can also be computed from the table; it is 1.9225. So we know of the actual data x1, x2, . . . , x2608 (where the counts xi are between 0 and 14) that x̄n = 3.8715 and sn = 1.9225. We construct a 98% confidence interval for the expected number of particles per interval. As z0.01 = 2.33 this results in 3.8715 − 2.33 1.9225 √ 2608 , 3.8715 + 2.33 1.9225 √ 2608 = (3.784, 3.959). 23.5 Solutions to the quick exercises 23.1 From the probability statement, we derive, using σT = 100 and 8/9 = 0.889: θ ∈ (T − 300, T + 300) with probability at least 88%. With t = 299 852.4, this becomes θ ∈ (299 552.4, 300 152.4) with confidence at least 88%. 23.2 Chebyshev’s inequality only gives an upper bound. The actual value of P(|T − θ| 2σT ) could be higher than 3/4, depending on the distribution of T . For example, in Quick exercise 13.2 we saw that in case of an exponen- tial distribution this probability is 0.865. For other distributions, even higher values are attained; see Exercise 13.1. 23.3 For each of the confidence intervals we have a 5% probability that it is wrong. Therefore, the number of wrong confidence intervals has a Bin(40, 0.05) distribution, and we would expect about 40 · 0.05 = 2 to be wrong. The standard deviation of this distribution is √ 40 · 0.05 · 0.95 = 1.38. The outcome “10 confidence intervals wrong” is (10 − 2)/1.38 = 5.8 standard deviations from the expectation and would be a surprising outcome indeed. (The probability of 10 or more wrong is 0.00002.)
  • 359. 356 23 Confidence intervals for the mean 23.4 We need to solve P(Z ≥ a) = 0.01. In Table B.1 we find P(Z ≥ 2.33) = 0.0099 ≈ 0.01, so z0.01 ≈ 2.33. For z0.95 we need to solve P(Z ≥ a) = 0.95, and because this is in the left tail of the distribution, we use z0.95 = −z0.05. In the table we read P(Z ≥ 1.64) = 0.0505 and P(Z ≥ 1.65) = 0.0495, from which we conclude z0.05 ≈ (1.64 + 1.65)/2 = 1.645 and z0.95 ≈ −1.645. 23.5 In Table B.1 we find P(T3 ≥ 4.541) = 0.01, so t3,0.01 = 4.541. For t35,0.9975, we need to use t35,0.9975 = −t35,0.0025. In the table we find t30,0.0025 = 3.030 and t40,0.0025 = 2.971, and by interpolation t35,0.0025 ≈ (3.030 + 2.971)/2 = 3.0005. Hence, t35,0.9975 ≈ −3.000. 23.6 The order statistics are estimates for c∗ 0.025 and c∗ 0.975, respectively. So the corresponding α is 0.05, and the 95% bootstrap confidence interval for µ is: 656.88 − 1.713 1037.3 √ 135 , 656.88 − (−2.443) 1037.3 √ 135 = (504.0, 875.0). 23.6 Exercises 23.1 A bottling machine is known to fill wine bottles with amounts that follow an N(µ, σ2 ) distribution, with σ = 5 (ml). In a sample of 16 bottles, x̄ = 743 (ml) was found. Construct a 95% confidence interval for µ. 23.2 You are given a dataset that may be considered a realization of a normal random sample. The size of the dataset is 34, the average is 3.54, and the sample standard deviation is 0.13. Construct a 98% confidence interval for the unknown expectation µ. 23.3 You have ordered 10 bags of cement, which are supposed to weigh 94 kg each. The average weight of the 10 bags is 93.5 kg. Assuming that the 10 weights can be viewed as a realization of a random sample from a normal distribution with unknown parameters, construct a 95% confidence interval for the expected weight of a bag. The sample standard deviation of the 10 weights is 0.75. 23.4 A new type of car tire is launched by a tire manufacturer. The auto- mobile association performs a durability test on a random sample of 18 of these tires. For each tire the durability is expressed as a percentage: a score of 100 (%) means that the tire lasted exactly as long as the average standard tire, an accepted comparison standard. From the multitude of factors that in- fluence the durability of individual tires the assumption is warranted that the durability of an arbitrary tire follows an N(µ, σ2 ) distribution. The parame- ters µ and σ2 characterize the tire type, and µ could be called the durability index for this type of tire. The automobile association found for the tested tires: x̄18 = 195.3 and s18 = 16.7. Construct a 95% confidence interval for µ.
  • 360. 23.6 Exercises 357 23.5 During the 2002 Winter Olympic Games in Salt Lake City a newspaper article mentioned the alleged advantage speed-skaters have in the 1500 m race if they start in the outer lane. In the men’s 1500m, there were 24 races, but in race 13 (really!) someone fell and did not finish. The results in seconds of the remaining 23 races are listed in Table 23.5. You should know that who races against whom, in which race, and who starts in the outer lane are all determined by a fair lottery. Table 23.5. Speed-skating results in seconds, men’s 1500 m (except race 13), 2002 Winter Olympic Games. Race Inner Outer Difference number lane lane 1 107.04 105.98 1.06 2 109.24 108.20 1.04 3 111.02 108.40 2.62 4 108.02 108.58 −0.56 5 107.83 105.51 2.32 6 109.50 112.01 −2.51 7 111.81 112.87 −1.06 8 111.02 106.40 4.62 9 106.04 104.57 1.47 10 110.15 110.70 −0.55 11 109.42 109.45 −0.03 12 108.13 109.57 −1.44 14 105.86 105.97 −0.11 15 108.27 105.63 2.64 16 107.63 105.41 2.22 17 107.72 110.26 −2.54 18 106.38 105.82 0.56 19 107.78 106.29 1.49 20 108.57 107.26 1.31 21 106.99 103.95 3.04 22 107.21 106.00 1.21 23 105.34 105.26 0.08 24 108.76 106.75 2.01 Mean 108.25 107.43 0.82 St.dev. 1.70 2.42 1.78 a. As a consequence of the lottery and the fact that many different factors contribute to the actual time difference “inner lane minus outer lane” the assumption of a normal distribution for the difference is warranted. The numbers in the last column can be seen as realizations from an N(δ, σ2 )
  • 361. 358 23 Confidence intervals for the mean distribution, where δ is the expected outer lane advantage. Construct a 95% confidence interval for δ. N.B. n = 23, not 24! b. You decide to make a bootstrap confidence interval instead. Describe the appropriate bootstrap experiment. c. The bootstrap experiment was performed with one thousand repetitions. Part of the bootstrap outcomes are listed in the following table. From the ordered list of results, numbers 21 to 60 and 941 to 980 are given. Use these to construct a 95% bootstrap confidence interval for δ. 21–25 −2.202 −2.164 −2.111 −2.109 −2.101 26–30 −2.099 −2.006 −1.985 −1.967 −1.929 31–35 −1.917 −1.898 −1.864 −1.830 −1.808 36–40 −1.800 −1.799 −1.774 −1.773 −1.756 41–45 −1.736 −1.732 −1.731 −1.717 −1.716 46–50 −1.699 −1.692 −1.691 −1.683 −1.666 51–55 −1.661 −1.644 −1.638 −1.637 −1.620 56–60 −1.611 −1.611 −1.601 −1.600 −1.593 941–945 1.648 1.667 1.669 1.689 1.696 946–950 1.708 1.722 1.726 1.735 1.814 951–955 1.816 1.825 1.856 1.862 1.864 956–960 1.875 1.877 1.897 1.905 1.917 961–965 1.923 1.948 1.961 1.987 2.001 966–970 2.015 2.015 2.017 2.018 2.034 971–975 2.035 2.037 2.039 2.053 2.060 976–980 2.088 2.092 2.101 2.129 2.143 23.6 A dataset x1, x2, . . . , xn is given, modeled as realization of a sam- ple X1, X2, . . . , Xn from an N(µ, 1) distribution. Suppose there are sample statistics Ln = g(X1, . . . , Xn) and Un = h(X1, . . . , Xn) such that P(Ln µ Un) = 0.95 for every value of µ. Suppose that the corresponding 95% confidence interval derived from the data is (ln, un) = (−2, 5). a. Suppose θ = 3µ + 7. Let L̃n = 3Ln + 7 and Ũn = 3Un + 7. Show that P L̃n θ Ũn = 0.95. b. Write the 95% confidence interval for θ in terms of ln and un. c. Suppose θ = 1 − µ. Again, find L̃n and Ũn, as well as the confidence interval for θ. d. Suppose θ = µ2 . Can you construct a confidence interval for θ?
  • 362. 23.6 Exercises 359 23.7 A 95% confidence interval for the parameter µ of a Pois(µ) distri- bution is given: (2, 3). Let X be a random variable with this distribution. Construct a 95% confidence interval for P(X = 0) = e−µ . 23.8 Suppose that in Exercise 23.1 the content of the bottles has to be de- termined by weighing. It is known that the wine bottles involved weigh on average 250 grams, with a standard deviation of 15 grams, and the weights follow a normal distribution. For a sample of 16 bottles, an average weight of 998 grams was found. You may assume that 1 ml of wine weighs 1 gram, and that the filling amount is independent of the bottle weight. Construct a 95% confidence interval for the expected amount of wine per bottle, µ. 23.9 Consider the alpha-particle counts discussed in Section 23.4; the data are given in Table 23.4. We want to bootstrap in order to make a bootstrap confidence interval for the expected number of particles in a 7.5-second inter- val. a. Describe in detail how you would perform the bootstrap simulation. b. The bootstrap experiment was performed with one thousand repetitions. Part of the (ordered) bootstrap t∗ ’s are given in the following table. Con- struct the 95% bootstrap confidence interval for the expected number of particles in a 7.5-second interval. 1–5 −2.996 −2.942 −2.831 −2.663 −2.570 6–10 −2.537 −2.505 −2.290 −2.273 −2.228 11–15 −2.193 −2.112 −2.092 −2.086 −2.045 16–20 −1.983 −1.980 −1.978 −1.950 −1.931 21–25 −1.920 −1.910 −1.893 −1.889 −1.888 26–30 −1.865 −1.864 −1.832 −1.817 −1.815 31–35 −1.755 −1.751 −1.749 −1.746 −1.744 36–40 −1.734 −1.723 −1.710 −1.708 −1.705 41–45 −1.703 −1.700 −1.696 −1.692 −1.691 46–50 −1.691 −1.675 −1.660 −1.656 −1.650 951–955 1.635 1.638 1.643 1.648 1.661 956–960 1.666 1.668 1.678 1.681 1.686 961–965 1.692 1.719 1.721 1.753 1.772 966–970 1.773 1.777 1.806 1.814 1.821 971–975 1.824 1.826 1.837 1.838 1.845 976–980 1.862 1.877 1.881 1.883 1.956 981–985 1.971 1.992 2.060 2.063 2.083 986–990 2.089 2.177 2.181 2.186 2.224 991–995 2.234 2.264 2.273 2.310 2.348 996–1000 2.483 2.556 2.870 2.890 3.546
  • 363. 360 23 Confidence intervals for the mean c. Answer this without doing any calculations: if we made the 98% boot- strap confidence interval, would it be smaller or larger than the interval constructed in Section 23.4? 23.10 In a report you encounter a 95% confidence interval (1.6, 7.8) for the parameter µ of an N(µ, σ2 ) distribution. The interval is based on 16 observa- tions, constructed according to the studentized mean procedure. a. What is the mean of the (unknown) dataset? b. You prefer to have a 99% confidence interval for µ. Construct it. 23.11 A 95% confidence interval for the unknown expectation of some distribution contains the number 0. a. We construct the corresponding 98% confidence interval, using the same data. Will it contain the number 0? b. The confidence interval in fact is a bootstrap confidence interval. We re- peat the bootstrap experiment (using the same data) and construct a new 95% confidence interval based on the results. Will it contain the number 0? c. We collect new data, resulting in a dataset of the same size. With this data, we construct a 95% confidence interval for the unknown expectation. Will the interval contain 0? 23.12 Let Z1, . . . , Zn be a random sample from an N(0, 1) distribution. Define Xi = µ+σZi for i = 1, . . . , n and σ 0. Let Z̄, X̄ denote the sample averages and SZ and SX the sample standard deviations, of the Zi and Xi, respectively. a. Show that X1, . . . , Xn is a random sample from an N(µ, σ2 ) distribution. b. Express X̄ and SX in terms of Z̄, SZ , µ, and σ. c. Verify that X̄ − µ SX/ √ n = Z̄ SZ/ √ n , and explain why this shows that the distribution of the studentized mean does not depend on µ and σ.
  • 364. 24 More on confidence intervals While in Chapter 23 we were solely concerned with confidence intervals for expectations, in this chapter we treat a variety of topics. First, we focus on confidence intervals for the parameter p of the binomial distribution. Then, based on an example, we briefly discuss a general method to construct confi- dence intervals. One-sided confidence intervals, or upper and lower confidence bounds, are discussed next. At the end of the chapter we investigate the ques- tion of how to determine the sample size when a confidence interval of a certain width is desired. 24.1 The probability of success A common situation is that we observe a random variable X with a Bin(n, p) distribution and use X to estimate p. For example, if we want to estimate the proportion of voters that support candidate G in an election, we take a sample from the voter population and determine the proportion in the sample that supports G. If n individuals are selected at random from the population, where a proportion p supports candidate G, the number of supporters X in the sample is modeled by a Bin(n, p) distribution; we count the supporters of candidate G as “successes.” Usually, the sample proportion X/n is taken as an estimator for p. If we want to make a confidence interval for p, based on the number of suc- cesses X in the sample, we need to find statistics L and U (see the definition of confidence intervals on page 343) such that P(L p U) = 1 − α, where L and U are to be based on X only. In general, this problem does not have a solution. However, the method for large n described next, some- times called “the Wilson method” (see [40]), yields confidence intervals with
  • 365. 362 24 More on confidence intervals confidence level approximately 100(1 − α)%. (How close the true confidence level is to 100(1 − α)% depends on the (unknown) p, though it is known that for p near 0 and 1 it is too low. For some details and an alternative for this situation, see Remark 24.1.) Recall the normal approximation to the binomial distribution, a consequence of the central limit theorem (see page 201 and Exercise 14.5): for large n, the distribution of X is approximately normal and X − np np(1 − p) is approximately standard normal. By dividing by n in both the numerator and the denominator, we see that this equals: X n − p p(1−p) n . Therefore, for large n P ⎛ ⎝−zα/2 X n − p p(1−p) n zα/2 ⎞ ⎠ ≈ 1 − α. Note that the event −zα/2 X n − p p(1−p) n zα/2 is the same as ⎛ ⎝ X n − p p(1−p) n ⎞ ⎠ 2 zα/2 2 or X n − p 2 − zα/2 2 p(1 − p) n 0. To derive expressions for L and U we can rewrite the inequality in this state- ment to obtain the form L p U, but the resulting formulas are rather awkward. To obtain the confidence interval, we instead substitute the data values directly and then solve for p, which yields the desired result. Suppose, in a sample of 125 voters, 78 support one candidate. What is the 95% confidence interval for the population proportion p supporting that candidate? The realization of X is x = 78 and n = 125. We substitute this, together with zα/2 = z0.025 = 1.96, in the last inequality: 78 125 − p 2 − (1.96)2 125 p(1 − p) 0,
  • 366. 24.1 The probability of success 363 0.4 0.8 0.54 0.70 −0.01 0.00 0.01 0.02 0.03 0.04 0.05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..................................................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 24.1. The parabola 1.0307 p2 − 1.2787 p + 0.3894 and the resulting confidence interval. or, working out squares and products and grouping terms: 1.0307 p2 − 1.2787 p + 0.3894 0. This quadratic form describes a parabola, which is depicted in Figure 24.1. Also, for other values of n and x there always results a quadratic inequality like this, with a positive coefficient for p2 and a similar picture. For the confidence interval we need to find the values where the parabola intersects the horizontal axis. The solutions we find are: p1,2 = −(−1.2787) ± (−1.2787)2 − 4 · 1.0307 · 0.3894 2 · 1.0307 = 0.6203 ± 0.0835; hence, l = 0.54 and u = 0.70, so the resulting confidence interval is (0.54, 0.70). Quick exercise 24.1 Suppose in another election we find 80 supporters in a sample of 200. Suppose we use α = 0.0456 for which zα/2 = 2. Construct the corresponding confidence interval for p. Remark 24.1 (Coverage probabilities and an alternative method). Because of the discrete nature of the binomial distribution, the probabil- ity that the confidence interval covers the true parameter value depends on p. As a function of p it typically oscillates in a sawtooth-like manner around 1 − α, being too high for some values and too low for others. This is something that cannot be escaped from; the phenomenon is present in every method. In an average sense, the method treated in the text yields coverage probabilities close to 1 − α, though for arbitrarily high values of n it is possible to find p’s for which the actual coverage is several percentage points too low. The low coverage occurs for p’s near 0 and 1.
  • 367. 364 24 More on confidence intervals An alternative is the method proposed by Agresti and Coull, which overall is more conservative than the Wilson method (in fact, the Agresti-Coull interval contains the Wilson interval as a proper subset). Especially for p near 0 or 1 this method yields conservative confidence intervals. Define X̃ = X + (zα/2)2 2 and ñ = n + (zα/2)2 , and p̃ = X̃/ñ. The approximate 100(1 − α)% confidence interval is then given by p̃ − zα/2 p̃(1 − p̃) ñ , p̃ + zα/2 p̃(1 − p̃) ñ . For a clear survey paper on confidence intervals for p we recommend Brown et al. [4]. 24.2 Is there a general method? We have now seen a number of examples of confidence intervals, and while it should be clear to you that in each of these cases the resulting intervals are valid confidence intervals, you may wonder how we go about finding confidence intervals in new situations. One could ask: is there a general method? We first consider an example. A confidence interval for the minimum lifetime Suppose we have a random sample X1, . . . , Xn from a shifted exponential distribution, that is, Xi = δ + Yi, where Y1, . . . , Yn are a random sample from an Exp(1) distribution. This type of random variable is sometimes used to model lifetimes; a minimum lifetime is guaranteed, but otherwise the lifetime has an exponential distribution. The unknown parameter δ represents the minimum lifetime, and the probability density of the Xi is positive only for values greater than δ. To derive information about δ it is natural to use the smallest observed value T = min{X1, . . . , Xn}. This is also the maximum likelihood estimator for δ; see Exercise 21.6. Writing T = min{δ + Y1, . . . , δ + Yn} = δ + min{Y1, . . . , Yn} and observing that M = min{Y1, . . . , Yn} has an Exp(n) distribution (see Exercise 8.18), we find for the distribution function of T : FT (a) = 0 for a δ and FT (a) = P(T ≤ a) = P(δ + M ≤ a) = P(M ≤ a − δ) = 1 − e−n(a−δ) for a ≥ δ. (24.1) Next, we solve
  • 368. 24.2 Is there a general method? 365 P(cl T cu) = 1 − α by requiring P(T ≤ cl) = P(T ≥ cu) = 1 2 α. Using (24.1) we find the following equations: 1 − e−n(cl−δ) = 1 2 α and e−n(cu−δ) = 1 2 α whose solutions are cl = δ − 1 n ln 1 − 1 2 α and cu = δ − 1 n ln 1 2 α . Both cl and cu are values larger than δ, because the logarithms are negative. We have found that, whatever the value of δ: P δ − 1 n ln 1 − 1 2 α T δ − 1 n ln 1 2 α = 1 − α. By rearranging the inequalities, we see this is equivalent to P T + 1 n ln 1 2 α δ T + 1 n ln 1 − 1 2 α = 1 − α, and therefore a 100(1 − α)% confidence interval for δ is given by t + 1 n ln 1 2 α , t + 1 n ln 1 − 1 2 α . (24.2) For α = 0.05 this becomes: t − 3.69 n , t − 0.0253 n . Quick exercise 24.2 Suppose you have a dataset of size 15 from a shifted Exp(1) distribution, whose minimum value is 23.5. What is the 99% confidence interval for δ? Looking back at the example, we see that the confidence interval could be constructed because we know that T −δ = M has an exponential distribution. There are many more examples of this type: some function g(T, θ) of a sample statistic T and the unknown parameter θ has a known distribution. However, this still does not cover all the ways to construct confidence intervals (see also the following remark). Remark 24.2 (About a general method). Suppose X1, . . . , Xn is a random sample from some distribution depending on some unknown pa- rameter θ and let T be a sample statistic. One possible choice is to select a T that is an estimator for θ, but this is not necessary. In each case, the
  • 369. 366 24 More on confidence intervals distribution of T depends on θ, just as that of X1, . . . , Xn does. In some cases it might be possible to find functions g(θ) and h(θ) such that P(g(θ) T h(θ)) = 1 − α for every value of θ. (24.3) If this is so, then confidence statements about θ can be made. In more special cases, for example if g and h are strictly increasing, the inequalities g(θ) T h(θ) can be rewritten as h−1 (T) θ g−1 (T), and then (24.3) is equivalent to P h−1 (T) θ g−1 (T) = 1 − α for every value of θ. Checking with the confidence interval definition, we see that the last state- ment implies that (h−1 (t), g−1 (t)) is a 100(1−α)% confidence interval for θ. 24.3 One-sided confidence intervals Suppose you are in charge of a power plant that generates and sells electricity, and you are about to buy a shipment of coal, say a shipment of the Daw Mill coal identified as 258GB41 earlier. You plan to buy the shipment if you are confident that the gross calorific content exceeds 31.00 MJ/kg. At the end of Section 23.2 we obtained for the gross calorific content the 95% confidence interval (30.946, 31.067): based on the data we are 95% confident that the gross calorific content is higher than 30.946 and lower than 31.067. In the present situation, however, we are only interested in the lower bound: we would prefer a confidence statement of the type “we are 95% confident that the gross calorific content exceeds 31.00.” Modifying equation (23.4) we find P X̄n − µ Sn/ √ n tn−1,α = 1 − α, which is equivalent to P X̄n − tn−1,α Sn √ n µ = 1 − α. We conclude that x̄n − tn−1,α sn √ n , ∞ is a 100(1 − α)% one-sided confidence interval for µ. For the Daw Mill coal, using α = 0.05, with t21,0.05 = 1.721 this results in: 31.012 − 1.721 0.1294 √ 22 , ∞ = (30.964, ∞).
  • 370. 24.4 Determining the sample size 367 We see that because “all uncertainty may be put on one side,” the lower bound in the one-sided interval is higher than that in the two-sided one, though still below 31.00. Other situations may require a confidence upper bound. For example, if the calorific value is below a certain number you can try to negotiate a lower the price. The definition of confidence intervals (page 343) can be extended to include one-sided confidence intervals as well. If we have a sample statistic Ln such that P(Ln θ) = γ for every value of the parameter of interest θ, then (ln, ∞) is called a 100γ% one-sided confidence interval for θ. The number ln is sometimes called a 100γ% lower confidence bound for θ. Similary, Un with P(θ Un) = γ for every value of θ, yields the one-sided confidence interval (−∞, un), and un is called a 100γ% upper confidence bound. Quick exercise 24.3 Determine the 99% upper confidence bound for the gross calorific value of the Daw Mill coal. 24.4 Determining the sample size The narrower the confidence interval the better (why?). As a general prin- ciple, we know that more accurate statements can be made if we have more measurements. Sometimes, an accuracy requirement is set, even before data are collected, and the corresponding sample size is to be determined. We pro- vide an example of how to do this and note that this generally can be done, but the actual computation varies with the type of confidence interval. Consider the question of the calorific content of coal once more. We have a shipment of coal to test and we want to obtain a 95% confidence interval, but it should not be wider than 0.05 MJ/kg, i.e., the lower and upper bound should not differ more than 0.05. How many measurements do we need? We answer this question for the case when ISO method 1928 is used, whence we may assume that measurements are normally distributed with standard deviation σ = 0.1. When the desired confidence level is 1 − α, the width of the confidence interval will be 2 · zα/2 σ √ n . Requiring that this is at most w means finding the smallest n that satisfies 2zα/2 σ √ n ≤ w
  • 371. 368 24 More on confidence intervals or n ≥ 2zα/2σ w 2 . For the example: w = 0.05, σ = 0.1, and z0.025 = 1.96; so n ≥ 2 · 1.96 · 0.1 0.05 2 = 61.4, that is, we should perform at least 62 measurements. In case σ is unknown, we somehow have to estimate it, and then the method can only give an indication of the required sample size. The standard deviation as we (afterwards) estimate it from the data may turn out to be quite different, and the obtained confidence interval may be smaller or larger than intended. Quick exercise 24.4 What is the required sample size if we want the 99% confidence interval to be 0.05 MJ/kg wide? 24.5 Solutions to the quick exercises 24.1 We need to solve 80 200 − p 2 − (2)2 200 p(1 − p) 0, or 1.02 p2 − 0.82p + 0.16 0. The solutions are: p1,2 = −(−0.82) ± (−0.82)2 − 4 · 1.02 · 0.16 2 · 1.02 = 0.4020 ± 0.0686, so the confidence interval is (0.33, 0.47). 24.2 We should substitute n = 15, t = 23.5, and α = 0.01 into: t + 1 n ln 1 2 α , t + 1 n ln 1 − 1 2 α , which yields 23.5 − 5.30 15 , 23.5 − 0.0050 15 = (23.1467, 23.4997). 24.3 The upper confidence bound is given by un = x̄n + t21,0.01 sn √ 22 , where x̄n = 31.012, t21,0.01 = 2.518, and sn = 0.1294. Substitution yields un = 31.081.
  • 372. 24.6 Exercises 369 24.4 The confidence level changes to 99%, so we use z0.005 = 2.576 instead of 1.96 in the computation: n ≥ 2 · 2.576 · 0.1 0.05 2 = 106.2, so we need at least 107 measurements. 24.6 Exercises 24.1 Of a series of 100 (independent and identical) chemical experiments, 70 were concluded succesfully. Construct a 90% confidence interval for the success probability of this type of experiment. 24.2 In January 2002 the Euro was introduced and soon after stories started to circulate that some of the Euro coins would not be fair coins, because the “national side” of some coins would be too heavy or too light (see, for example, the New Scientist of January 4, 2002, but also national newspapers of that date). a. A French 1 Euro coin was tossed six times, resulting in 1 heads and 5 tails. Is it reasonable to use the Wilson method, introduced in Section 24.1, to construct a confidence interval for p? b. A Belgian 1 Euro coin was tossed 250 times: 140 heads and 110 tails. Construct a 95% confidence interval for the probability of getting heads with this coin. 24.3 In Exercise 23.1, what sample size is needed if we want a 99% confidence interval for µ at most 1 ml wide? 24.4 Recall Exercise 23.3 and the 10 bags of cement that should each weigh 94 kg. The average weight was 93.5 kg, with sample standard deviation 0.75. a. Based on these data, how many bags would you need to sample to make a 90% confidence interval that is 0.1 kg wide? b. Suppose you actually do measure the required number of bags and con- struct a new confidence interval. Is it guaranteed to be at most 0.1 kg wide? 24.5 Suppose we want to make a 95% confidence interval for the probability of getting heads with a Dutch 1 Euro coin, and it should be at most 0.01 wide. To determine the required sample size, we note that the probability of getting heads is about 0.5. Furthermore, if X has a Bin(n, p) distribution, with n large and p ≈ 0.5, then
  • 373. 370 24 More on confidence intervals X − np n/4 is approximately standard normal. a. Use this statement to derive that the width of the 95% confidence interval for p is approximately z0.025 √ n . Use this width to determine how large n should be. b. The coin is thrown the number of times just computed, resulting in 19 477 times heads. Construct the 95% confidence interval and check whether the required accuracy is attained. 24.6 Environmentalists have taken 16 samples from the wastewater of a chemical plant and measured the concentration of a certain carcinogenic sub- stance. They found x̄16 = 2.24 (ppm) and s2 16 = 1.12, and want to use these data in a lawsuit against the plant. It may be assumed that the data are a realization of a normal random sample. a. Construct the 97.5% one-sided confidence interval that the environmen- talists made to convince the judge that the concentration exceeds legal limits. b. The plant management uses the same data to construct a 97.5% one- sided confidence interval to show that concentrations are not too high. Construct this interval as well. 24.7 Consider once more the Rutherford-Geiger data as given in Section 23.4. Knowing that the number of α-particle emissions during an interval has a Poisson distribution, we may see the data as observations from a Pois(µ) distribution. The central limit theorem tells us that the average X̄n of a large number of independent Pois(µ) approximately has a normal distribution and hence that X̄n − µ √ µ/ √ n has a distribution that is approximately N(0, 1). a. Show that the large sample 95% confidence interval contains those values of µ for which (x̄n − µ) 2 ≤ (1.96)2 µ n . b. Use the result from a to construct the large sample 95% confidence interval based on the Rutherford-Geiger data. c. Compare the result with that of Exercise 23.9 b. Is this surprising? 24.8 Recall Exercise 23.5 about the 1500 m speed-skating results in the 2002 Winter Olympic Games. If there were no outer lane advantage, the number
  • 374. 24.6 Exercises 371 out of the 23 completed races won by skaters starting in the outer lane would have a Bin(23, p) distribution with p = 1/2, because of the lane assignment by lottery. a. Of the 23 races, 15 were won by the skater starting in the outer lane. Use this information to construct a 95% confidence interval for p by means of the Wilson method. If you think that n = 23 is probably too small to use a method based on the central limit theorem, we agree. We should be careful with conclusions we draw from this confidence interval. b. The question posed earlier “Is there an outer lane advantage?” implies that a one-sided confidence interval is more suitable. Construct the appropriate 95% one-sided confidence interval for p by first constructing a 90% two- sided confidence interval. 24.9 Suppose we have a dataset x1, . . . , x12 that may be modeled as the realization of a random sample X1, . . . , X12 from a U(0, θ) distribution, with θ unknown. Let M = max{X1, . . . , X12}. a. Show that for 0 ≤ t ≤ 1 P M θ ≤ t = t12 . b. Use α = 0.1 and solve P M θ ≤ cl = P M θ ≤ cu = 1 2 α. c. Suppose the realization of M is m = 3. Construct the 90% confidence interval for θ. d. Derive the general expression for a confidence interval of level 1 −α based on a sample of size n. 24.10 Suppose we have a dataset x1, . . . , xn that may be modeled as the realization of a random sample X1, . . . , Xn from an Exp(λ) distribution, where λ is unknown. Let Sn = X1 + · · · + Xn. a. Check that λSn has a Gam(n, 1) distribution. b. The following quantiles of the Gam(20, 1) distribution are given: q0.05 = 13.25 and q0.95 = 27.88. Use these to construct a 90% confidence interval for λ when n = 20.
  • 375. 25 Testing hypotheses: essentials The statistical methods that we have discussed until now have been devel- oped to infer knowledge about certain features of the model distribution that represent our quantities of interest. These inferences often take the form of numerical estimates, as either single numbers or confidence intervals. How- ever, sometimes the conclusion to be drawn is not expressed numerically, but is concerned with choosing between two conflicting theories, or hypotheses. For instance, one has to assess whether the lifetime of a certain type of ball bearing deviates or does not deviate from the lifetime guaranteed by the man- ufacturer of the bearings; an engineer wants to know whether dry drilling is faster or the same as wet drilling; a gynecologist wants to find out whether smoking affects or does not affect the probability of getting pregnant; the Al- lied Forces want to know whether the German war production is equal to or smaller than what Allied intelligence agencies reported. The process of formu- lating the possible conclusions one can draw from an experiment and choosing between two alternatives is known as hypothesis testing. In this chapter we start to explore this statistical methodology. 25.1 Null hypothesis and test statistic We will introduce the basic concepts of hypothesis testing with an exam- ple. Let us return to the analysis of German war equipment. During World War II the Allied Forces received reports by the Allied intelligence agencies on German war production. The numbers of produced tires, tanks, and other equipment, as claimed in these reports, were a lot higher than indicated by the observed serial numbers. The objective was to decide whether the actual produced quantities were smaller than the ones reported. For simplicity suppose that we have observed tanks with (recoded) serial num- bers 61 19 56 24 16.
  • 376. 374 25 Testing hypotheses: essentials Furthermore, suppose that the Allied intelligence agencies report a production of 350 tanks.1 This is a lot more than we would surmise from the observed data. We want to choose between the proposition that the total number of tanks is 350 and the proposition that the total number is smaller than 350. The two competing propositions are called null hypothesis, denoted by H0, and alternative hypothesis, denoted by H1. The way we go about choosing between H0 and H1 is conceptually similar to the way a jury deliberates in a court trial. The null hypothesis corresponds to the position of the defendant: just as he is presumed to be innocent until proven guilty, so is the null hypothesis presumed to be true until the data provide convincing evidence against it. The alternative hypothesis corresponds to the charges brought against the defendant. To decide whether H0 is false we use a statistical model. As argued in Chap- ter 20 the (recoded) serial numbers are modeled as a realization of random variables X1, X2, . . . , X5 representing five draws without replacement from the numbers 1, 2, . . . , N. The parameter N represents the total number of tanks. The two hypotheses in question are H0 : N = 350 H1 : N 350. If we reject the null hypothesis we will accept H1; we speak of rejecting H0 in favor of H1. Usually, the alternative hypothesis represents the theory or belief that we would like to accept if we do reject H0. This means that we must carefully choose H1 in relation with our interests in the problem at hand. In our example we are particularly interested in whether the number of tanks is less than 350; so we test the null hypothesis against H1 : N 350. If we would be interested in whether the number of tanks differs from 350, or is greater than 350, we would test against H1 : N = 350 or H1 : N 350. Quick exercise 25.1 In the drilling example from Sections 15.5 and 16.4 the data on drill times for dry drilling are modeled as a realization of a random sample from a distribution with expectation µ1, and similarly the data for wet drilling correspond to a distribution with expectation µ2. We want to know whether dry drilling is faster than wet drilling. To this end we test the null hypothesis H0 : µ1 = µ2 (the drill time is the same for both methods). What would you choose for H1? The next step is to select a criterion based on X1, X2, . . . , X5 that provides an indication about whether H0 is false. Such a criterion involves a test statistic. 1 This may seem ridiculous. However, when after the war official German produc- tion statistics became available, the average monthly production of tanks during the period 1940–1943 was 342. During the war this number was estimated at 327, whereas Allied intelligence reported 1550! (see [27]).
  • 377. 25.1 Null hypothesis and test statistic 375 Test Statistic. Suppose the dataset is modeled as the realization of random variables X1, X2, . . . , Xn. A test statistic is any sample statistic T = h(X1, X2, . . . , Xn), whose numerical value is used to decide whether we reject H0. In the tank example we use the test statistic T = max{X1, X2, . . . , X5}. Having chosen a test statistic T , we investigate what sort of values T can attain. These values can be viewed on a credibility scale for H0, and we must determine which of these values provide evidence in favor of H0, and which provide evidence in favor of H1. First of all note that if we find a value of T larger than 350, we immediately know that H0 as well as H1 is false. If this happens, we actually should be considering another testing problem, but for the current problem of testing H0 : N = 350 against H1 : N 350 such values are irrelevant. Hence the possible values of T that are of interest to us are the integers from 5 to 350. If H0 is true, then what is a typical value for T and what is not? Remember from Section 20.1 that, because n = 5, the expectation of T is E[T ] = 5 6 (N+1). This means that the distribution of T is centered around 5 6 (N + 1). Hence, if H0 is true, then typical values of T are in the neighborhood of 5 6 ·351 = 292.5. Values of T that deviate a lot from 292.5 are evidence against H0. Values that are much greater than 292.5 are evidence against H0 but provide even stronger evidence against H1. For such values we will not reject H0 in favor of H1. Also values a little smaller than 292.5 are grounds not to reject H0, because we are committed to giving H0 the benefit of the doubt. On the other hand, values of T very close to 5 should be considered as strong evidence against the null hypothesis and are in favor of H1, hence they lead to a decision to reject H0. This is summarized in Figure 25.1. 5 292.5 350 Values in favor of H1 Values in favor of H0 Values against both H0 and H1 Fig. 25.1. Values of the test statistic T. Quick exercise 25.2 Another possible test statistic would be X̄5. If we use its values as a credibility scale for H0, then what are the possible values of X̄5, which values of X̄5 are in favor of H1 : N 350, and which values are in favor of H0 : N = 350?
  • 378. 376 25 Testing hypotheses: essentials For the data we find t = max{61, 19, 56, 24, 16} = 61 as the realization of the test statistic. How do we use this to decide on H0? 25.2 Tail probabilities As we have just seen, if H0 is true, then typical values of T are in the neighbor- hood of 5 6 ·351 = 292.5. In view of Figure 25.1, the more a value of T is to the left, the stronger evidence it provides in favor of H1. The value 61 is in the left region of Figure 25.1. Can we now reject H0 and conclude that N is smaller than 350, or can the fact that we observe 61 as maximum be attributed to chance? In courtroom terminology: can we reach the conclusion that the null hypothesis is false beyond reasonable doubt? One way to investigate this is to examine how likely it is that one would observe a value of T that provides even stronger evidence against H0 than 61, in the situation that N = 350. If this is very unlikely, then 61 already bears strong evidence against H0. Values of T that provide stronger evidence against H0 than 61 are to the left of 61. Therefore we compute P(T ≤ 61). In the situation that N = 350, the test statistic T is the maximum of 5 numbers drawn without replacement from 1, 2, . . . , 350. We find that P(T ≤ 61) = P(max{X1, X2, . . . , X5} ≤ 61) = 61 350 · 60 349 · · · 57 346 = 0.00014. This probability is so small that we view the value 61 as strong evidence against the null hypothesis. Indeed, if the null hypothesis would be true, then values of T that would provide the same or even stronger evidence against H0 than 61 are very unlikely to occur, i.e., they occur with probability 0.00014! In other words, the observed value 61 is exceptionally small in case H0 is true. At this point we can do two things: either we believe that H0 is true and that something very unlikely has happened, or we believe that events with such a small probability do not happen in practice, so that T ≤ 61 could only have occurred because H0 is false. We choose to believe that things happening with probability 0.00014 are so exceptional that we reject the null hypothesis H0 : N = 350 in favor of the alternative hypothesis H1 : N 350. In courtroom terminology: we say that a value of T smaller than or equal to 61 implies that the null hypothesis is false beyond reasonable doubt. P-values In our example, the more a value of T is to the left, the stronger evidence it provides against H0. For this reason we computed the left tail probability
  • 379. 25.3 Type I and type II errors 377 P(T ≤ 61). In other situations, the direction in which values of T provide stronger evidence against H0 may be to the right of the observed value t, in which case one would compute a right tail probability P(T ≥ t). In both cases the tail probability expresses how likely it is to obtain a value of the test statistic T at least as extreme as the value t observed for the data. Such a probability is called a p-value. In a way, the size of the p-value reflects how much evidence the observed value t provides against H0. The smaller the p-value, the stronger evidence the observed value t bears against H0. The phrase “at least as extreme as the observed value t” refers to a particular direction, namely the direction in which values of T provide stronger evidence against H0 and in favor of H1. In our example, this was to the left of 61, and the p-value corresponding to 61 was P(T ≤ 61) = 0.00014. In this case it is clear what is meant by “at least as extreme as t” and which tail probability corresponds to the p-value. However, in some testing problems one can deviate from H0 in both directions. In such cases it may not be clear what values of T are at least as extreme as the observed value, and it may be unclear how the p-value should be computed. One approach to a solution in this case is to simply compute the one-tailed p-value that corresponds to the direction in which t deviates from H0. Quick exercise 25.3 Suppose that the Allied intelligence agencies had re- ported a production of 80 tanks, so that we would test H0 : N = 80 against H1 : N 80. Compute the p-value corresponding to 61. Would you conclude H0 is false beyond reasonable doubt? 25.3 Type I and type II errors Suppose that the maximum is 200 instead of 61. This is also to the left of the expected value 292.5 of T . Is it far enough to the left to reject the null hypothesis? In this case the p-value is equal to P(T ≤ 200) = P(max{X1, X2, . . . , X5} ≤ 200) = 200 350 · 199 349 · · · 196 346 = 0.0596. This means that if the total number of produced tanks is 350, then in 5.96% of all cases we would observe a value of T that is at least as extreme as the value 200. Before we decide whether 0.0596 is small enough to reject the null hypothesis let us explore in more detail what the preceding probability stands for. It is important to distinguish between (1) the true state of nature: H0 is true or H1 is true and (2) our decision: we reject or do not reject H0 on the basis of the data. In our example the possibilities for the true state of nature are: Ĺ H0 is true, i.e., there are 350 tanks produced. Ĺ H1 is true, i.e., the number of tanks produced is less than 350.
  • 380. 378 25 Testing hypotheses: essentials We do not know in which situation we are. There are two possible decisions: Ĺ We reject H0 in favor of H1. Ĺ We do not reject H0. This leads to four possible situations, which are summarized in Figure 25.2. True state of nature H0 is true H1 is true Reject H0 Type I error Correct decision Our decision on the basis of the data Not reject H0 Correct decision Type II error Fig. 25.2. Four situations when deciding about H0. There are two situations in which the decision made on the basis of the data is wrong. The null hypothesis H0 may be true, whereas the data lead to rejection of H0. On the other hand, the alternative hypothesis H1 may be true, whereas we do not reject H0 on the basis of the data. These wrong decisions are called type I and type II errors. Type I and II errors. A type I error occurs if we falsely reject H0. A type II error occurs if we falsely do not reject H0. In courtroom terminology, a type I error corresponds to convicting an innocent defendant, whereas a type II error corresponds to acquitting a criminal. If H0 : N = 350 is true, then the decision to reject H0 is a type I error. We will never know whether we make a type I error. However, given a particular decision rule, we can say something about the probability of committing a type I error. Suppose the decision rule would be “reject H0 : N = 350 when- ever T ≤ 200.” With this decision rule the probability of committing a type I error is P(T ≤ 200) = 0.0596. If we are willing to run the risk of committing a type I error with probability 0.0596, we could adopt this decision rule. This would also mean that on the basis of an observed maximum of 200 we would reject H0 in favor of H1 : N 350. Quick exercise 25.4 Suppose we adopt the following decision rule about the null hypothesis: “reject H0 : N = 350 whenever T ≤ 250.” Using this decision rule, what is the probability of committing a type I error?
  • 381. 25.4 Solutions to the quick exercises 379 The question remains what amount of risk one is willing to take to falsely reject H0, or in courtroom terminology: how small should the p-value be to reach a conclusion that is “beyond reasonable doubt”? In many situations, as a rule of thumb 0.05 is used as the level where reasonable doubt begins. Something happening with probability less than or equal to 0.05 is then viewed as being too exceptional. However, there is no general rule that specifies how small the p-value must be to reject H0. There is no way to argue that this probability should be below 0.10 or 0.18 or 0.009—or anything else. A possible solution is to solely report the p-value corresponding to the ob- served value of the test statistic. This is objective and does not have the arbitrariness of a preselected level such as 0.05. An investigator who reports the p-value conveys the maximum amount of information contained in the dataset and permits all decision makers to choose their own level and make their own decision about the null hypothesis. This is especially important when there is no justifiable reason for preselecting a particular value for such a level. 25.4 Solutions to the quick exercises 25.1 One is interested in whether dry drilling is faster than wet drilling. Hence if we reject H0 : µ1 = µ2, we would like to conclude that the drill time is smaller for dry drilling than for wet drilling. Since µ1 and µ2 represent the drill time for dry and wet drilling, we should choose H1 : µ1 µ2. 25.2 The value of X̄5 is at least 3 and if we find a value of X̄5 that is larger than 348, then at least one of the five numbers must be greater than 350, so that we immediately know that H0 as well as H1 is false. Hence the possible values of X̄5 that are relevant for our testing problem are between 3 and 348. We know from Section 20.1 that 2X̄5 − 1 is an unbiased estimator for N, no matter what the value of N is. This implies that values of X̄5 itself are centered around (N + 1)/2. Hence values close to 351/2=175.5 are in favor of H0, whereas values close to 3 are in favor of H1. Values close to 348 are against H0, but also against H1. See Figure 25.3. 3 175.5 348 Values in favor of H1 Values in favor of H0 Values against both H0 and H1 Fig. 25.3. Values of the test statistic X̄5. 25.3 The p-value corresponding to 61 is now equal to P(T ≤ 61) = 61 80 · 60 79 · · · 57 76 = 0.2475.
  • 382. 380 25 Testing hypotheses: essentials If H0 is true, then in 24.75% of the time one will observe a value T less than or equal to 61. Such values are not exceptionally small for T under H0, and therefore the evidence that the value 61 bears against H0 is pretty weak. We cannot reject H0 beyond reasonable doubt. 25.4 The type I error associated with the decision rule occurs if N = 350 (H0 is true) and t ≤ 250 (reject H0). The probability that this happens is P(T ≤ 250) = 250 350 · 249 349 · · · 246 346 = 0.1838. 25.5 Exercises 25.1 In a study about train delays in The Netherlands one was interested in whether arrival delays of trains exhibit more variation during rush hours than during quiet hours. The observed arrival delays during rush hours are mod- eled as realizations of a random sample from a distribution with variance σ2 1, and similarly the observed arrival delays during quiet hours correspond to a distribution with variance σ2 2. One tests the null hypothesis H0 : σ1 = σ2. What do you choose as the alternative hypothesis? 25.2 On average, the number of babies born in Cleveland, Ohio, in the month of September is 1472. On January 26, 1977, the city was immobilized by a blizzard. Nine months later, in September 1977, the recorded number of births was 1718. Can the increase of 246 be attributed to chance? To inves- tigate this, the number of births in the month of September is modeled by a Poisson random variable with parameter µ, and we test H0 : µ = 1472. What would you choose as the alternative hypothesis? 25.3 Recall Exercise 17.9 about black cherry trees. The scatterplot of y (vol- ume) versus x = d2 h (squared diameter times height) seems to indicate that the regression line y = α + βx runs through the origin. One wants to inves- tigate whether this is true by means of a testing problem. Formulate a null hypothesis and alternative hypothesis in terms of (one of) the parameters α and β. 25.4 Consider the example from Section 4.4 about the number of cycles up to pregnancy of smoking and nonsmoking women. Suppose the observed number of cycles are modeled as realizations of random samples from geo- metric distributions. Let p1 be the parameter of the geometric distribution corresponding to smoking women and p2 be the parameter for the nonsmok- ing women. We are interested in whether p1 is different from p2, and we investigate this by testing H0 : p1 = p2 against H1 : p1 = p2. a. If the data are as given in Exercise 17.5, what would you choose as a test statistic?
  • 383. 25.5 Exercises 381 b. What would you choose as a test statistic, if you were given the extra knowledge as in Table 21.1? c. Suppose we are interested in whether smoking women are less likely to get pregnant than nonsmoking women. What is the appropriate alternative hypothesis in this case? 25.5 Suppose a dataset is a realization of a random sample X1, X2, . . . , Xn from a uniform distribution on [0, θ], for some (unknown) θ 0. We test H0 : θ = 5 versus H1 : θ = 5. a. We take T1 = max{X1, X2, . . . , Xn} as our test statistic. Specify what the (relevant) possible values are for T and which are in favor of H0 and which are in favor of H1. For instance, make a picture like Figure 25.1. b. Same as a, but now for test statistic T2 = |2X̄n − 5|. 25.6 To test a certain null hypothesis H0 one uses a test statistic T with a continuous sampling distribution. One agrees that H0 is rejected if one observes a value t of the test statistic for which (under H0) the right tail probability P(T ≥ t) is smaller than or equal to 0.05. Given below are different values t and a corresponding left or right tail probability (under H0). Specify for each case what the p-value is, if possible, and whether we should reject H0. a. t = 2.34 and P(T ≥ 2.34) = 0.23. b. t = 2.34 and P(T ≤ 2.34) = 0.23. c. t = 0.03 and P(T ≥ 0.03) = 0.968. d. t = 1.07 and P(T ≤ 1.07) = 0.981. e. t = 1.07 and P(T ≤ 2.34) = 0.01. f. t = 2.34 and P(T ≤ 1.07) = 0.981. g. t = 2.34 and P(T ≤ 1.07) = 0.800. 25.7 (Exercise 25.2 continued). The number of births in September is mod- eled by a Poisson random variable T with parameter µ, which represents the expected number of births. Suppose that one uses T to test the null hypothe- sis H0 : µ = 1472 and that one decides to reject H0 on the basis of observing the value t = 1718. a. In which direction do values of T provide evidence against H0 (and in favor of H1)? b. Compute the p-value corresponding to t = 1718, where you may use the fact that the distribution of T can be approximated by an N(µ, µ) distri- bution. 25.8 Suppose we want to test the null hypothesis that our dataset is a realiza- tion of a random sample from a standard normal distribution. As test statistic we use the Kolmogorov-Smirnov distance between the empirical distribution
  • 384. 382 25 Testing hypotheses: essentials function Fn of the data and the distribution function Φ of the standard nor- mal: T = sup a∈R |Fn(a) − Φ(a)|. What are the possible values of T and in which direction do values of T deviate from the null hypothesis? 25.9 Recall the example from Section 18.3, where we investigated whether the software data are exponential by means of the Kolmogorov-Smirnov distance between the empirical distribution function Fn of the data and the estimated exponential distribution function: Tks = sup a∈R |Fn(a) − (1 − e−Λ̂a )|. For the data we found tks = 0.176. By means of a new parametric bootstrap we simulated 100 000 realizations of Tks and found that all of them are smaller than 0.176. What can you say about the p-value corresponding to 0.176? 25.10 Consider the coal data from Table 23.1, where 23 gross calorific value measurements are listed for Osterfeld coal coded 262DE27. We modeled this dataset as a realization of a random sample from a normal distribution with expectation µ unknown and standard deviation 0.1 MJ/kg. We are planning to buy a shipment if the gross calorific value exceeds 23.75 MJ/kg. In order to decide whether this is sensible, we test the null hypothesis H0 : µ = 23.75 with test statistic X̄n. a. What would you choose as the alternative hypothesis? b. For the dataset x̄n is 23.788. Compute the corresponding p-value, using that X̄n has an N(23.75, (0.1)2 /23) distribution under the null hypothesis. 25.11 One is given a number t, which is the realization of a random vari- able T with an N(µ, 1) distribution. To test H0 : µ = 0 against H1 : µ = 0, one uses T as the test statistic. One decides to reject H0 in favor of H1 if |t| ≥ 2. Compute the probability of committing a type I error.
  • 385. 26 Testing hypotheses: elaboration In the previous chapter we introduced the setup for testing a null hypothesis against an alternative hypothesis using a test statistic T . The notions of type I error and type II error were introduced. A type I error occurs when we falsely reject H0 on the basis of the observed value of T , whereas a type II error occurs when we falsely do not reject H0. The decision to reject H0 or not was based on the size of the p-value. In this chapter we continue the introduction of basic concepts of testing hypotheses, such as significance level and critical region, and investigate the probability of committing a type II error. 26.1 Significance level As mentioned in the previous chapter, there is no general rule that specifies a level below which the p-value is considered exceptionally small. However, there are situations where this level is set a priori, and the question is: which values of the test statistic should then lead to rejection of H0? To illustrate this, con- sider the following example. The speed limit on freeways in The Netherlands is 120 kilometers per hour. A device next to freeway A2 between Amsterdam and Utrecht measures the speed of passing vehicles. Suppose that the device is designed in such a way that it conducts three measurements of the speed of a passing vehicle, modeled by a random sample X1, X2, X3. On the basis of the value of the average X̄3, the driver is either fined for speeding or not. For what values of X̄3 should we fine the driver, if we allow that 5% of the drivers are fined unjustly? Let us rephrase things in terms of a testing problem. Each measurement can be thought of as measurement = true speed + measurement error. Suppose for the moment that the measuring device is carefully calibrated, so that the measurement error is modeled by a random variable with mean zero
  • 386. 384 26 Testing hypotheses: elaboration and known variance σ2 , say σ2 = 4. Moreover, in physical experiments such as this one, the measurement error is often modeled by a random variable with a normal distribution. In that case, the measurements X1, X2, X3 are modeled by a random sample from an N(µ, 4) distribution, where the parameter µ represents the true speed of the passing vehicle. Our testing problem can now be formulated as testing H0 : µ = 120 against H1 : µ 120, with test statistic T = X1 + X2 + X3 3 = X̄3. Since sums of independent normal random variables again have a normal dis- tribution (see Remark 11.2), it follows that X̄3 has an N(µ, 4/3) distribution. In particular, the distribution of T = X̄3 is centered around µ no matter what the value of µ is. Values of T close to 120 are therefore in favor of H0. Values of T that are far from 120 are considered as strong evidence against H0. Values much larger than 120 suggest that µ 120 and are therefore in favor of H1. Values much smaller than 120 suggest that µ 120. They also constitute evidence against H0, but even stronger evidence against H1. Thus we reject H0 in favor of H1 only for values of T larger than 120. See also Figure 26.1. 120 Values in favor of H1 Fig. 26.1. Possible values of T = X̄3. Rejection of H0 in favor of H1 corresponds to fining the driver for speeding. Unjustly fining a driver corresponds to falsely rejecting H0, i.e., committing a type I error. Since we allow 5% of the drivers to be fined unjustly, we are dealing with a testing problem where the probability of committing a type I error is set a priori at 0.05. The question is: for which values of T should we reject H0? The decision rule for rejecting H0 should be such that the corresponding probability of committing a type I error is 0.05. The value 0.05 is called the significance level. Significance level. The significance level is the largest accept- able probability of committing a type I error and is denoted by α, where 0 α 1. We speak of “performing the test at level α,” as well as “rejecting H0 in favor of H1 at level α.” In our example we are testing H0 : µ = 120 against H1 : µ 120 at level 0.05.
  • 387. 26.1 Significance level 385 Quick exercise 26.1 Suppose that in the freeway example H0 : µ = 120 is rejected in favor of H1 : µ 120 at level α = 0.05. Will it necessarily be rejected at level α = 0.01? On the other hand, suppose that H0 : µ = 120 is rejected in favor of H1 : µ 120 at level α = 0.01. Will it necessarily be rejected at level α = 0.05? Let us continue with our example and determine for which values of T = X̄3 we should reject H0 at level α = 0.05 in favor of H1 : µ 120. Suppose we decide to fine each driver whose recorded average speed is 121 or more, i.e., we reject H0 whenever T ≥ 121. Then how large is the probability of a type I error P(T ≥ 121)? When H0 : µ = 120 is true, then T = X̄3 has an N(120, 4/3) distribution, so that by the change-of-units rule for the normal distribution (see page 106), the random variable Z = T − 120 2/ √ 3 has an N(0, 1) distribution. This implies that P(T ≥ 121) = P T − 120 2/ √ 3 ≥ 121 − 120 2/ √ 3 = P(Z ≥ 0.87) . From Table B.1, we find P(Z ≥ 0.87) = 0.1922, which means that the prob- ability of a type I error is greater than the significance level α = 0.05. Since this level was defined as the largest acceptable probability of a type I error, we do not reject H0. Similarly, if we decide to reject H0 whenever we record an average of 122 or more, the probability of a type I error equals 0.0416 (check this). This is smaller than α = 0.05, so in that case we reject H0. The boundary case is the value c that satisfies P(T ≥ c) = 0.05. To find c, we must solve P Z ≥ c − 120 2/ √ 3 = 0.05. From Table B.2 we have that z0.05 = t∞,0.05 = 1.645, so that we find c − 120 2/ √ 3 = 1.645, which leads to c = 120 + 1.645 · 2 √ 3 = 121.9. Hence, if we set the significance level α at 0.05, we should reject H0 : µ = 120 in favor of H1 : µ 120 whenever T ≥ 121.9. For our freeway example this means that if the average recorded speed of a passing vehicle is greater than or equal to 121.9, then the driver is fined for speeding. With this decision rule, at most 5% of the drivers get fined unjustly.
  • 388. 386 26 Testing hypotheses: elaboration In connection with p-values: the significance level is the level below which the p-value is sufficiently small to reject H0. Indeed, for any observed value t ≥ 121.9 we reject H0, and the p-value for such a t is at most 0.05: P(T ≥ t) ≤ P(T ≥ 121.9) = 0.05. We will see more about this relation in the next section. 26.2 Critical region and critical values In the freeway example the significance level 0.05 corresponds to the decision rule “reject H0 : µ = 120 in favor H1 : µ 120 whenever T ≥ 121.9.” The set K = [121.9, ∞) consisting of values of the test statistic T for which we reject H0 is called critical region. The value 121.9, which is the boundary case between rejecting and not rejecting H0, is called the critical value. Critical region and critical values. Suppose we test H0 against H1 at significance level α by means of a test statistic T . The set K ⊂ R that corresponds to all values of T for which we reject H0 in favor of H1 is called the critical region. Values on the boundary of the critical region are called critical values. The precise shape of the critical region depends on both the chosen significance level α and the test statistic T that is used. But it will always be such that the probability that T ∈ K satisfies P(T ∈ K) ≤ α in the case that H0 is true. At this point it becomes important to emphasize whether probabilities are computed under the assumption that H0 is true. With a slight abuse of nota- tion, we briefly write P(T ∈ K | H0) for the probability. Relation with p-values If we record average speed t = 124, then this value falls in the critical region K = [121.9, ∞), so that H0 : µ = 120 is rejected in favor H1 : µ 120. On the other hand we can also compute the p-value corresponding to the observed value 124. Since values of T to the right provide stronger evidence against H0, the p-value is the following right tail probability P(T ≥ 124 | H0) = P T − 120 2/ √ 3 ≥ 124 − 120 2/ √ 3 = P(Z ≥ 3.46) = 0.0003, which is smaller than the significance level 0.05. This is no coincidence.
  • 389. 26.2 Critical region and critical values 387 In general, suppose that we perform a test at level α using test statistic T and that we have observed t as the value of our test statistic. Then t ∈ K ⇔ the p-value corresponding to t is less than or equal to α. Figure 26.2 illustrates this for a testing problem where values of T to the right provide evidence against H0 and in favor of H1. In that case, the p-value corresponds to the right tail probability P(T ≥ t | H0). The shaded area to the right of cα corresponds to α = P(T ≥ cα | H0), whereas the more intensely shaded area to the right of t represents the p-value. We see that deciding whether to reject H0 at a given significance level α can be done by comparing either t with cα or the p-value with α. For this reason the p-value is sometimes called the observed significance level. cα t Sampling distribution of T under H0 −→ Critical region K = [cα, ∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .. .......................... Fig. 26.2. P-value and critical value. The concepts of critical value and p-value have their own merit. The critical region and the corresponding critical values specify exactly what values of T lead to rejection of H0 at a given level α. This can be done even without obtaining a dataset and computing the value t of the test statistic. The p- value, on the other hand, represents the strength of the evidence the observed value t bears against H0. But it does not specify all values of T that lead to rejection of H0 at a given level α. Quick exercise 26.2 In our freeway example, we have already computed the relevant tail probability to decide whether a person with recorded average speed t = 124 gets fined if we set the significance level at 0.05. Suppose the significance level is set at α = 0.01 (we allow 1% of the drivers to get fined unjustly). Determine whether a person with recorded average speed t = 124 gets fined (H0 : µ = 120 is rejected). Furthermore, determine the critical region in this case.
  • 390. 388 26 Testing hypotheses: elaboration Sometimes the critical region K can be constructed such that P(T ∈ K | H0) is exactly equal to α, as in the freeway example. However, when the distribution of T is discrete, this is not always possible. This is illustrated by the next example. After the introduction of the Euro, Polish mathematicians claimed that the Belgian 1 Euro coin is not a fair coin (see, for instance, the New Scientist, January 4, 2002). Suppose we put a 1 Euro coin to the test. We will throw it ten times and record X, the number of heads. Then X has a Bin(10, p) distribution, where p denotes the probability of heads. We like to find out whether p differs from 1/2. Therefore we test H0 : p = 1 2 (the coin is fair) against H1 : p = 1 2 (the coin is not fair). We use X as the test statistic. When we set the significance level α at 0.05, for what values of X will we reject H0 and conclude that the coin is not fair? Let us first find out what values of X are in favor of H1. If H0 : p = 1/2 is true, then E[X] = 10 · 1 2 = 5, so that values of X close to 5 are in favor H0. Values close to 10 suggest that p 1/2 and values close to 0 suggest that p 1/2. Hence, both values close to 0 and values close to 10 are in favor of H1 : p = 1/2. 0 5 10 Values of X Values in favor of H1 Values in favor of H1 This means that we will reject H0 in favor of H1 whenever X ≤ cl or X ≥ cu. Therefore, the critical region is the set K = {0, 1, . . ., cl} ∪ {cu, . . . , 9, 10}. The boundary values cl and cu are called left and right critical values. They must be chosen such that the critical region K is as large as possible and still satisfies P(X ∈ K | H0) = P X ≤ cl | p = 1 2 + P X ≥ cu | p = 1 2 ≤ 0.05. Here P X ≥ cu | p = 1 2 denotes the probability P(X ≥ cu) computed with X having a Bin(10, 1 2 ) distribution. Since we have no preference for rejecting H0 for values close to 0 or close to 10, we divide 0.05 over the two sides, and we choose cl as large as possible and cu as small as possible such that P X ≤ cl | p = 1 2 ≤ 0.025 and P X ≥ cu | p = 1 2 ≤ 0.025.
  • 391. 26.2 Critical region and critical values 389 Table 26.1. Left tail probabilities of the Bin(10, 1 2 ) distribution. k P(X ≤ k) k P(X ≤ k) 0 0.00098 6 0.82813 1 0.01074 7 0.94531 2 0.05469 8 0.98926 3 0.17188 9 0.99902 4 0.37696 10 1.00000 5 0.62305 The left tail probabilities of the Bin(10, 1 2 ) distribution are listed in Ta- ble 26.1. We immediately see that cl = 1 is the largest value such that P(X ≤ cl | p = 1/2) ≤ 0.025. Similarly, cu = 9 is the smallest value such that P(X ≥ cu | p = 1/2) ≤ 0.025. Indeed, when X has a Bin(10, 1 2 ) distribution, P(X ≥ 9) = 1 − P(X ≤ 8) = 1 − 0.98926 = 0.01074, P(X ≥ 8) = 1 − P(X ≤ 7) = 1 − 0.94531 = 0.05469. Hence, if we test H0 : p = 1/2 against H1 : p = 1/2 at level α = 0.05, the critical region is the set K = {0, 1, 9, 10}. The corresponding type I error is P(X ∈ K) = P(X ≤ 1) + P(X ≥ 9) = 0.01074 + 0.01074 = 0.02148, which is smaller than the significance level. You may perform ten throws with your favorite coin and see whether the number of heads falls in the critical region. Quick exercise 26.3 Recall the tank example where we tested H0 : N = 350 against H1 : N 350 by means of the test statistic T = max Xi. Suppose that we perform the test at level 0.05. Deduce the critical region K corresponding to level 0.05 from the left tail probabilities given here: k 195 194 193 192 191 P(T ≤ k | H0) 0.0525 0.0511 0.0498 0.0485 0.0472 Is P(T ∈ K | H0) = 0.05? One- and two-tailed p-values In the Euro coin example, we deviate from H0 : p = 1/2 in two directions: values of X both far to the right and far to the left of 5 are evidence against H0. Suppose that in ten throws with the 1 Euro coin we recorded x heads. What would the p-value be corresponding to x? The problem is that the direction in which values of X are at least as extreme as the observed value x depends on whether x lies to the right or to the left of 5.
  • 392. 390 26 Testing hypotheses: elaboration At this point there are two natural solutions. One may report the appropri- ate left or right tail probability, which corresponds to the direction in which x deviates from H0. For instance, if x lies to the right of 5, we compute P(X ≥ x | H0). This is called a one-tailed p-value. The disadvantage of one- tailed p-values is that they are somewhat misleading about how strong the evidence of the observed value x bears against H0. In view of the relation between rejection on the basis of critical values or on the basis of a p-value, the one-tailed p-value should be compared to α/2. On the other hand, since people are inclined to compare p-values with the significance level α itself, one could also double the one-tailed p-value and compare this with α. This double-tail probability is called a two-tailed p-value. It doesn’t make much of a difference, as long as one also reports whether the reported p-value is one-tailed or two-tailed. Let us illustrate things by means of the findings by the Polish mathematicians. They performed 250 throws with a Belgian 1 Euro coin and recorded heads 140 times (see also Exercise 24.2). The question is whether this provides strong enough evidence against H0 : p = 1/2. The observed value 140 is to the right of 125, the value we would expect if H0 is true. Hence the one-tailed p-value is P(X ≥ 140), where now X has a Bin(250, 1 2 ) distribution. By means of the normal approximation (see page 201), we find P(X ≥ 140) = P ⎛ ⎝ X − 125 1 4 √ 250 ≥ 140 − 125 1 4 √ 250 ⎞ ⎠ ≈ P(Z ≥ 1.90) = 1 − Φ(1.90) = 0.0287. Therefore the two-tailed p-value is approximately 0.0574, which does not pro- vide very strong evidence against H0. In fact, the exact two-tailed p-value, computed by means of statistical software, is 0.066, which is even larger. Quick exercise 26.4 In a Dutch newspaper (De Telegraaf, January 3, 2002) it was reported that the Polish mathematicians recorded heads 150 times. What are the one- and two-tailed probabilities is this case? Do they now have a case? 26.3 Type II error As we have just seen, by setting a significance level α, we are able to control the probability of committing a type I error; it will at most be α. For instance, let us return to the freeway example and suppose that we adopt the decision rule to fine the driver for speeding if her average observed speed is at least 121.9, i.e., reject H0 : µ = 120 in favor of H1 : µ 120 whenever T = X̄3 ≥ 121.9.
  • 393. 26.3 Type II error 391 From Section 26.1 we know that with this decision rule, the probability of a type I error is 0.05. What is the probability of committing a type II error? This corresponds to the percentage of drivers whose true speed is above 120 but who do not get fined because their recorded average speed is below 121.9. For instance, suppose that a car passes at true speed µ = 125. A type II error occurs when T 121.9, and since T = X̄3 has an N(125, 4/3) distribution, the probability that this happens is P(T 121.9 | µ = 125) = P T − 125 2/ √ 3 121.9 − 125 2/ √ 3 = Φ(−2.68) = 0.0036. This looks promising, but now consider a vehicle passing at true speed µ = 123. The probability of committing a type II error in this case is P(T 121.9 | µ = 123) = P T − 123 2/ √ 3 121.9 − 123 2/ √ 3 = Φ(−0.95) = 0.1711. Hence 17.11% of all drivers that pass at speed µ = 123 will not get fined. In Figure 26.3 the last situation is illustrated. The curve on the left represents the probability density of the N(120, 4/3) distribution, which is the distribution of T under the null hypothesis. The shaded area on the right of 121.9 represents the probability of committing a type I error P(T ≥ 121.9 | µ = 120) = 0.05. The curve on the right is the probability density of the N(123, 4/3) distribu- tion, which is the distribution of T under the alternative µ = 123. The shaded area on the left of 121.9 represents the probability of a type II error 120 121.9 0.0 0.1 0.2 0.3 0.4 0.5 Sampling distribution of T when µ = 123 Sampling distribution of T when H0 is true −→ Reject H0 Do not reject H0 ←− . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . ....... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 26.3. Type I and type II errors in the freeway example.
  • 394. 392 26 Testing hypotheses: elaboration P(T 121.9 | µ = 123) = 0.1711. Shifting µ further to the right will result in a smaller probability of a type II error. However, shifting µ toward the value 120 leads to a larger probability of a type II error. In fact it can be arbitrarily close to 0.95. The previous example illustrates that the probability of committing a type II error depends on the actual value of µ in the alternative hypothesis H1 : µ 120. The closer µ is to 120, the higher the probability of a type II error will be. In contrast with the probability of a type I error, which is always at most α, the probability of a type II error may be arbitrarily close to 1 − α. This is illustrated in the next quick exercise. Quick exercise 26.5 What is the probability of a type II error in the freeway example if µ = 120.1? 26.4 Relation with confidence intervals When testing H0 : µ = 120 against H1 : µ 120 at level 0.05 in the freeway example, the critical value was obtained by the formula c0.05 = 120 + 1.645 · 2 √ 3 . On the other hand, using that X̄3 has an N(µ, 4/3) distribution, a 95% lower confidence bound for µ in this case can be derived from ln = x̄3 − 1.645 · 2 √ 3 . Although, at first sight, testing hypotheses and constructing confidence inter- vals seem to be two separate statistical procedures, they are in fact intimately related. In the freeway example, observe that for a given dataset x1, x2, x3, we reject H0 : µ = 120 in favor of H1 : µ 120 at level 0.05 ⇔ x̄3 ≥ 120 + 1.645 · 2 √ 3 ⇔ x̄3 − 1.645 · 2 √ 3 ≥ 120 ⇔ 120 is not in the 95% one-sided confidence interval for µ. This is not a coincidence. In general, the following applies. Suppose that for some parameter θ we test H0 : θ = θ0. Then we reject H0 : θ = θ0 in favor of H1 : θ θ0 at level α if and only if θ0 is not in the 100(1 − α)% one-sided confidence interval for θ.
  • 395. 26.5 Solutions to the quick exercises 393 The same relation holds for testing against H1 : θ θ0, and a similar relation holds between testing against H1 : θ = θ0 and two-sided confidence intervals: we reject H0 : θ = θ0 in favor of H1 : θ0 = θ0 at level α if and only if θ0 is not in the 100(1 − α)% two-sided confidence region for θ. In fact, one could use these facts to define the 100(1−α)% confidence region for a parameter θ as the set of values θ0 for which the null hypothesis H0 : θ = θ0 is not rejected at level α. It should be emphasized that these relations only hold if the random variable that is used to construct the confidence interval relates appropriately to the test statistic. For instance, the preceding relations do not hold if on the one hand, we construct a confidence interval for the parameter µ of an N(µ, σ2 ) distribution by means of the studentized mean (X̄n −µ)/(Sn/ √ n), and on the other hand, use the sample median Medn to test a null hypothesis for µ. 26.5 Solutions to the quick exercises 26.1 In the first situation, we reject at significance level α = 0.05, which means that the probability of committing a type I error is at most 0.05. This does not necessarily mean that this probability will also be less than or equal to 0.01. Therefore with this information we cannot know whether we also reject at level α = 0.01. In the reversed situation, if we reject at level α = 0.01, then the probability of committing a type I error is at most 0.01, and is therefore also smaller than 0.05. This means that we also reject at level α = 0.05. 26.2 To decide whether we should reject H0 : µ = 120 at level 0.01, we could compute P(T ≥ 124 | H0) and compare this with 0.01. We have already seen that P(T ≥ 124 | H0) = 0.0003. This is (much) smaller than the significance level α = 0.01, so we should reject. The critical region is K = [c, ∞), where we must solve c from P Z ≥ c − 120 2/ √ 3 = 0.01. Since z0.01 = 2.326, this means that c = 120 + 2.326 · (2/ √ 3) = 122.7. 26.3 The critical region is of the form K = {5, 6, . . ., c}, where the criti- cal value c is the largest value, for which P(T ≤ c | H0) is still less than or equal to 0.05. From the table we immediately see that c = 193 and that P(T ∈ K | H0) = P(T ≤ 193 | H0) = 0.0498, which is not equal to 0.05.
  • 396. 394 26 Testing hypotheses: elaboration 26.4 By means of the normal approximation, for the one-tailed p-value we find P(X ≥ 150) = P ⎛ ⎝ X − 125 1 4 √ 250 ≥ 150 − 125 1 4 √ 250 ⎞ ⎠ = P(Zn ≥ 3.16) ≈ 1 − Φ(3.16) = 0.0008. The two-tailed p-value is 0.0016. This is a lot smaller than the two-tailed p- value 0.0574, corresponding to 140 heads. It seems that with 150 heads the mathematicians would have a case; the Belgian Euro coin would then appear not to be fair. 26.5 The probability of a type II error is P(T 121.9 | µ = 120.1) = P T − 120.1 2/ √ 3 121.9 − 120.1 2/ √ 3 = Φ(1.56) = 0.9406. 26.6 Exercises 26.1 Polygraphs that are used in criminal investigations are supposed to in- dicate whether a person is lying or telling the truth. However the procedure is not infallible, as is illustrated by the following example. An experienced polygraph examiner was asked to make an overall judgment for each of a total 280 records, of which 140 were from guilty suspects and 140 from inno- cent suspects. The results are listed in Table 26.2. We view each judgment as a problem of hypothesis testing, with the null hypothesis corresponding to “suspect is innocent” and the alternative hypothesis to “suspect is guilty.” Estimate the probabilities of a type I error and a type II error that apply to this polygraph method on the basis of Table 26.2. 26.2 Consider the testing problem in Exercise 25.11. Compute the probability of committing a type II error if the true value of µ is 1. 26.3 One generates a number x from a uniform distribution on the interval [0, θ]. One decides to test H0 : θ = 2 against H1 : θ = 2 by rejecting H0 if x ≤ 0.1 or x ≥ 1.9. a. Compute the probability of committing a type I error. b. Compute the probability of committing a type II error if the true value of θ is 2.5. 26.4 To investigate the hypothesis that a horse’s chances of winning an eight- horse race on a circular track are affected by its position in the starting lineup,
  • 397. 26.6 Exercises 395 Table 26.2. Examiners and suspects. Suspect’s true status Innocent Guilty Acquitted 131 15 Examiner’s assesment Convicted 9 125 Source: F.S. Horvath and J.E. Reid. The reliability of polygraph examiner diagnosis of truth and deception. Journal of Criminal Law, Criminology, and Police Science, 62(2):276–281, 1971. the starting position of each of 144 winners was recorded ([30]). It turned out that 29 of these winners had starting position one (closest to the rail on the inside track). We model the number of winners with starting position one by a random variable T with a Bin(144, p) distribution. We test the hypothesis H0 : p = 1/8 against H1 : p 1/8 at level α = 0.01 with T as test statistic. a. Argue whether the test procedure involves a right critical value, a left critical value, or both. b. Use the normal approximation to compute the critical value(s) correspond- ing to α = 0.01, determine the critical region, and report your conclusion about the null hypothesis. 26.5 Recall Exercises 23.5 and 24.8 about the 1500 m speed-skating results in the 2002 Winter Olympic Games. The number of races won by skaters starting in the outer lane is modeled by a random variable X with a Bin(23, p) distribution. The question of whether there is an outer lane advantage was investigated in Exercise 24.8 by means of constructing confidence intervals using the normal approximation. In this exercise we examine this question by testing the null hypothesis H0 : p = 1/2 against H1 : p 1/2 using X as the test statistic. The distribution of X under H0 is given in Table 26.3. Out of 23 completed races, 15 were won by skaters starting in the outer lane. a. Compute the p-value corresponding to x = 15 and report your conclusion if we perform the test at level 0.05. Does your conclusion agree with the confidence interval you found for p in Exercise 24.8 b? b. Determine the critical region corresponding to significance level α = 0.05. c. Compute the probability of committing a type I error if we base our decision rule on the critical region determined in b.
  • 398. 396 26 Testing hypotheses: elaboration Table 26.3. Left tail probabilities for the Bin(23, 1 2 ) distribution. k P(X ≤ k) k P(X ≤ k) k P(X ≤ k) 0 0.0000 8 0.1050 16 0.9827 1 0.0000 9 0.2024 17 0.9947 2 0.0000 10 0.3388 18 0.9987 3 0.0002 11 0.5000 19 0.9998 4 0.0013 12 0.6612 20 1.0000 5 0.0053 13 0.7976 21 1.0000 6 0.0173 14 0.8950 22 1.0000 7 0.0466 15 0.9534 23 1.0000 d. Use the normal approximation to determine the probability of committing a type II error for the case p = 0.6, if we base our decision rule on the critical region determined in b. 26.6 Consider Exercises 25.2 and 25.7. One decides to test H0 : µ = 1472 against H1 : µ 1472 at level α = 0.05 on the basis of the recorded value 1718 of the test statistic T . a. Argue whether the test procedure involves a right critical value, a left critical value, or both. b. Use the fact that the distribution of T can be approximated by an N(µ, µ) distribution to determine the critical value(s) and the critical region, and report your conclusion about the null hypothesis. 26.7 A random sample X1, X2 is drawn from a uniform distribution on the interval [0, θ]. We wish to test H0 : θ = 1 against H1 : θ 1 by rejecting if X1 + X2 ≤ c. Find the value of c and the critical region that correspond to a level of significance 0.05. Hint: use Exercise 11.5. 26.8 This exercise is meant to illustrate that the shape of the critical region is not necessarily similar to the type of alternative hypothesis. The type of alternative hypothesis and the test statistic used determine the shape of the critical region. Suppose that X1, X2, . . . , Xn form a random sample from an Exp(λ) distri- bution, and we test H0 : λ = 1 with test statistics T = X̄n and T = e−X̄n . a. Suppose we test the null hypothesis against H1 : λ 1. Determine for both test procedures whether they involve a right critical value, a left critical value, or both. b. Same question as in part a, but now test against H1 : λ = 1.
  • 399. 26.6 Exercises 397 26.9 Similar to Exercise 26.8, but with a random sample X1, X2, . . . , Xn from an N(µ, 1) distribution. We test H0 : µ = 0 with test statistics T = (X̄n)2 and T = 1/X̄n. a. Suppose that we test the null hypothesis against H1 : µ = 0. Determine the shape of the critical region for both test procedures. b. Same question as in part a, but now test against H1 : µ 0.
  • 400. 27 The t-test In many applications the quantity of interest can be represented by the ex- pectation of the model distribution. In some of these applications one wants to know whether this expectation deviates from some a priori specified value. This can be investigated by means of a statistical test, known as the t-test. We consider this test both under the assumption that the model distribution is normal and without the assumption of normality. Furthermore, we discuss a similar test for the slope and the intercept in a simple linear regression model. 27.1 Monitoring the production of ball bearings A production line in a large industrial corporation are set to produce a spe- cific type of steel ball bearing with a diameter of 1 millimeter. In order to check the performance of the production lines, a number of ball bearings are picked at the end of the day and their diameters are measured. Suppose we ob- serve 20 diameters of ball bearings from the production lines, which are listed in Table 27.1. The average diameter is x̄20 = 1.03 millimeter. This clearly deviates from the target value 1, but the question is whether the difference can be attributed to chance or whether it is large enough to conclude that the production line is producing ball bearings with a wrong diameter. To an- swer this question, we model the dataset as a realization of a random sample X1, X2, . . . , X20 from a probability distribution with expected value µ. The parameter µ represents the diameter of ball bearings produced by the produc- Table 27.1. Diameters of ball bearings. 1.018 1.009 1.042 1.053 0.969 1.002 0.988 1.019 1.062 1.032 1.072 0.977 1.062 1.044 1.069 1.029 0.979 1.096 1.079 0.999
  • 401. 400 27 The t-test tion lines. In order to investigate whether this diameter deviates from 1, we test the null hypothesis H0 : µ = 1 against H1 : µ = 1. This example illustrates a situation that often occurs: the data x1, x2, . . . , xn are a realization of a random sample X1, X2, . . . , Xn from a distribution with expectation µ, and we want to test whether µ equals an a priori specified value, say µ0. According to the law of large numbers, X̄n is close to µ for large n. This suggests a test statistic based on X̄n − µ0; realizations of X̄n − µ0 close to zero are in favor of the null hypothesis. Does X̄n − µ0 suffice as a test statistic? In our example, x̄n − µ0 = 1.03 − 1 = 0.03. Should we interpret this as small? First, note that under the null hypothesis E X̄n − µ0 = µ − µ0 = 0. Now, if X̄n − µ0 would have standard deviation 1, then the value 0.03 is within one standard deviation of E X̄n − µ0 . The “µ ± a few σ” rule on page 185 then suggests that the value 0.03 is not exceptional; it must be seen as a small deviation. On the other hand, if X̄n − µ0 has standard deviation 0.001, then the value 0.03 is 30 standard deviations away from E X̄n − µ0 . According to the “µ ± a few σ” rule this is very exceptional; the value 0.03 must be seen as a large deviation. The next quick exercise provides a concrete example. Quick exercise 27.1 Suppose that X̄n is a normal random variable with expectation 1 and variance 1. Determine P X̄n − 1 ≥ 0.03 . Find the same probability, but for the case where the variance is (0.01)2 . This discussion illustrates that we must standardize X̄n − µ0 to incorporate its variation. Recall that Var X̄n − µ0 = Var X̄n = σ2 n , where σ2 is the variance of each Xi. Hence, standardizing X̄n − µ0 means that we should divide by σ/ √ n. Since σ is unknown, we substitute the sample standard deviation Sn for σ. This leads to the following test statistic for the null hypothesis H0 : µ = µ0: T = X̄n − µ0 Sn/ √ n . Values of T close to zero are in favor of H0 : µ = µ0. Large positive values of T suggest that µ µ0 and large negative values suggest that µ µ0; both are evidence against H0. For the ball bearing data one finds that sn = 0.0372, so that t = x̄n − µ0 sn/ √ n = 1.03 − 1 0.0372/ √ 20 = 3.607. This is clearly different from zero, but the question is whether this difference is large enough to reject H0 : µ = 1. To answer this question, we need to know
  • 402. 27.2 The one-sample t-test 401 the probability distribution of T under the null hypothesis. Note that under the null hypothesis H0 : µ = µ0, the test statistic T = X̄n − µ0 Sn/ √ n is the studentized mean (see also Chapter 23) X̄n − µ Sn/ √ n . Hence, under the null hypothesis, the probability distribution of T is the same as that of the studentized mean. 27.2 The one-sample t-test The classical assumption is that the dataset is a realization of a random sample from an N(µ, σ2 ) distribution. In that case our test statistic T turns out to have a t-distribution under the null hypothesis, as we will see later. For this reason, the test for the null hypothesis H0 : µ = µ0 is called the (one-sample) t-test. Without the assumption of normality, we will use the bootstrap to approximate the distribution of T . For large sample sizes, this distribution can be approximated by means of the central limit theorem. We start with the first case. Normal data Suppose that the dataset x1, x2, . . . , xn is a realization of a random sample X1, X2, . . . , Xn from an N(µ, σ2 ) distribution. Then, according to the rule on page 349, the studentized mean has a t(n − 1) distribution. An immediate consequence is that, under the null hypothesis H0 : µ = µ0, also our test statistic T has a t(n − 1) distribution. Therefore, if we test H0 : µ = µ0 against H1 : µ = µ0 at level α, then we must reject the null hypothesis in favor of H1 : µ = µ0, if T ≤ −tn−1,α/2 or T ≥ tn−1,α/2. Similar decision rules apply to alternatives H1 : µ µ0 and H1 : µ µ0. Suppose that in the ball bearing example we test H0 : µ = 1 against H1 : µ = 1 at level α = 0.05. From Table B.2 we find t19,0.025 = 2.093. Hence, we must reject if T ≤ −2.093 or T ≥ 2.093. For the ball bearing data we found t = 3.607, which means we reject the null hypothesis at level α = 0.05. Alternatively, one might report the one-tailed p-value corresponding to the observed value t and compare this with α/2. The one-tailed p-value is ei- ther a right or a left tail probability, which must be computed by means
  • 403. 402 27 The t-test of the t(n − 1) distribution. In our ball bearing example the one-tailed p- value is the right tail probability P(T ≥ 3.607). From Table B.2 we see that this probability is between 0.0005 and 0.0010, which is smaller than α/2 = 0.025 (to be precise, by means of a statistical software package we found P(T ≥ 3.607) = 0.00094). The data provide strong enough evidence against the null hypothesis, so that it seems sensible to adjust the settings of the production line. Quick exercise 27.2 Suppose that the data in Table 27.1 are from two separate production lines. The first ten measurements have average 1.0194 and standard deviation 0.0290, whereas the last ten measurements have average 1.0406 and standard deviation 0.0428. Perform the t-test H0 : µ = 1 against H1 : µ = 1 at level α = 0.01 for both datasets separately, assuming normality. Nonnormal data Draw a rectangle with height h and width w (let us agree that w h), and within this rectangle draw a square with sides of length h (see Figure 27.1). This creates another (smaller) rectangle with horizontal and vertical sides of ↑ | | | | | | h | | | | | | ↓ ← − − − − − − − − − − − − − − − − − w − − − − − − − − − − − − − − − − − → ← − − − w − h − − − → ↑ | | | | | | h | | | | | | ↓ Fig. 27.1. Rectangle with square within. lengths w − h and h. A large rectangle with a vertical-to-horizontal ratio that is equal to the horizontal-to-vertical ratio for the small rectangle, i.e., h w = w − h h , was called a “golden rectangle” by the ancient Greeks, who often used these in their architecture. After solving for h/w, we obtain that the height-to-width
  • 404. 27.2 The one-sample t-test 403 Table 27.2. Ratios for Shoshoni rectangles. 0.693 0.749 0.654 0.670 0.662 0.672 0.615 0.606 0.690 0.628 0.668 0.611 0.606 0.609 0.601 0.553 0.570 0.844 0.576 0.933 Source: C. Dubois (ed.). Lowie’s selected papers in anthropology, 1960. The Regents of the University of California. ratio h/w is equal to the “golden number” ( √ 5 − 1)/2 ≈ 0.618. The data in Table 27.2 represent corresponding h/w ratios for rectangles used by Shoshoni Indians to decorate their leather goods. Is it reasonable to assume that they were also using golden rectangles? We examine this by means of a t-test. The observed ratios are modeled as a realization of a random sample from a distribution with expectation µ, where the parameter µ represents the true esthetic preference for height-to-width ratios of the Shoshoni Indians. We want to test H0 : µ = 0.618 against H1 : µ = 0.618. For the Shoshoni ratios, x̄n = 0.6605 and sn = 0.0925, so that the value of the test statistic is t = x̄n − 0.618 sn/ √ n = 0.6605 − 0.618 0.0925/ √ 20 = 2.055. Closer examination of the data indicates that the normal distribution is not the right model. For instance, by definition the height-to-width ratios h/w are always between 0 and 1. Because some of the data points are also close to right boundary 1, the normal distribution is inappropriate. If we cannot assume a normal model distribution, we can no longer conclude that our test statistic has a t(n − 1) distribution under the null hypothesis. Since there is no reason to assume any other particular type of distribution to model the data, we approximate the distribution of T under the null hy- pothesis. Recall that this distribution is the same as that of the studentized mean (see the end of Section 27.1). To approximate its distribution, we use the empirical bootstrap simulation for the studentized mean, as described on page 351. We generate 10 000 bootstrap datasets and for each bootstrap dataset x∗ 1, x∗ 2, . . . , x∗ n, we compute t∗ = x̄∗ n − 0.6605 s∗ n/ √ n . In Figure 27.2 the kernel density estimate and empirical distribution function are displayed for 10 000 bootstrap values t∗ . Suppose we test H0 : µ = 0.618 against H1 : µ = 0.618 at level α = 0.05. In the same way as in Section 23.3, we find the following bootstrap approximations for the critical values: c∗ l = −3.334 and c∗ u = 1.644.
  • 405. 404 27 The t-test −6 −4 −2 0 2 4 0.0 0.1 0.2 0.3 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -3.334 0 1.644 0.025 0.975 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 27.2. Kernel density estimate and empirical distribution function of 10 000 bootstrap values t∗ . Since for the Shoshoni data the value 2.055 of the test statistic is greater than 1.644, we reject the null hypothesis at level 0.05. Alternatively, we can also compute a bootstrap approximation of the one-tailed p-value correspond- ing to 2.055, which is the right tail probability P(T ≥ 2.055). The bootstrap approximation for this probability is: number of t∗ values greater than or equal to 2.055 10 000 = 0.0067. Hence P(T ≥ 2.055) ≈ 0.0067, which is smaller than α/2 = 0.025. The value 2.055 should be considered as exceptionally large, and we reject the null hy- pothesis. The esthetic preference for height-to-width ratios of the Shoshoni Indians differs from that of the ancient Greeks. Large samples For large sample sizes the distribution of the studentized mean can be ap- proximated by a standard normal distribution (see Section 23.4). This means that for large sample sizes the distribution of the t-test statistic under the null hypothesis can also be approximated by a standard normal distribution. To illustrate this, recall the Old Faithful data. Park rangers in Yellowstone National Park inform the public about the behavior of the geyser, such as the expected time between successive eruptions and the length of the duration of an eruption. Suppose they claim that the expected length of an eruption is 4 minutes (240 seconds). Does this seem likely on the basis of the data from Section 15.1? We investigate this by testing H0 : µ = 240 against H1 : µ = 240 at level α = 0.001, where µ is the expectation of the model distribution. The value of the test statistic is t = x̄n − 240 sn/ √ n = 209.3 − 240 68.48/ √ 272 = −7.39.
  • 406. 27.3 The t-test in a regression setting 405 The one-tailed p-value P(T ≤ −7.39) can be approximated by P(Z ≤ −7.39), where Z has an N(0, 1) distribution. From Table B.1 we see that this probabil- ity is smaller than P(Z ≤ −3.49) = 0.0002. This is smaller than α/2 = 0.0005, so we reject the null hypothesis at level 0.001. In fact the p-value is much smaller: a statistical software package gives P(Z ≤ −7.39) = 7.5 · 10−14 . The data provide overwhelming evidence against H0 : µ = 240, so that we conclude that the expected length of an eruption is different from 4 minutes. Quick exercise 27.3 Compute the critical region K for the test, using the normal approximation, and check that t = −7.39 falls in K. In fact, if we would test H0 : µ = 240 against H1 : µ 240, the p-value corresponding to t = −7.39 is the left tail probability P(T ≤ −7.39). This probability is very small, so that we also reject the null hypothesis in favor of this alternative and conclude that the expected length of an eruption is smaller than 4 minutes. 27.3 The t-test in a regression setting Is calcium in your drinking water good for your health? In England and Wales, an investigation of environmental causes of disease was conducted. The annual mortality rate (percentage of deaths) and the calcium concentration in the drinking water supply were recorded for 61 large towns. The data in Table 27.3 represent the annual mortality rate averaged over the years 1958–1964, and the calcium concentration in parts per million. In Figure 27.3 the 61 paired measurements are displayed in a scatterplot. The scatterplot shows a slight downward trend, which suggests that higher concentrations of calcium lead to lower mortality rates. The question is whether this is really the case or if the slight downward trend should be attributed to chance. To investigate this question we model the mortality data by means of a simple linear regression model with normally distributed errors, with the mortality rate as the dependent variable y and the calcium concentration as the inde- pendent variable x: Yi = α + βxi + Ui for i = 1, 2, . . . , 61, where U1, U2, . . . , U61 is a random sample from an N(0, σ2 ) distribution. The parameter β represents the change of the mortality rate if we increase the calcium concentration by one unit. We test the null hypothesis H0 : β = 0 (calcium has no effect on the mortality rate) against H1 : β 0 (higher concentration of calcium reduces the mortality rate). This example illustrates the general situation, where the dataset (x1, y1), (x2, y2), . . . , (xn, yn)
  • 407. 406 27 The t-test Table 27.3. Mortality data. Rate Calcium Rate Calcium Rate Calcium Rate Calcium 1247 105 1466 5 1299 78 1359 84 1392 73 1307 78 1254 96 1318 122 1260 21 1096 138 1402 37 1309 59 1259 133 1175 107 1486 5 1456 90 1236 101 1369 68 1257 50 1527 60 1627 53 1486 122 1485 81 1519 21 1581 14 1625 13 1668 17 1800 14 1609 18 1558 10 1807 15 1637 10 1755 12 1491 20 1555 39 1428 39 1723 44 1379 94 1742 8 1574 9 1569 91 1591 16 1772 15 1828 8 1704 26 1702 44 1427 27 1724 6 1696 6 1711 13 1444 14 1591 49 1987 8 1495 14 1587 75 1713 71 1557 13 1640 57 1709 71 1625 20 1378 71 Source: M. Hills and the M345 Course Team. M345 Statistical Methods, Units 3: Examining Straight-line Data, 1986, Milton Keynes: Open Uni- versity, 28. Data provided by Professor M.J.Gardner, Medical Research Coun- cil Environmental Epidemiology Research Unit, Southampton. 0 20 40 60 80 100 120 140 Calcium concentration (ppm) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Mortality rate (%) · · ·· ·· · · · · · · · · · · · · · · · · · · · ·· ·· ·· ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · Fig. 27.3. Scatterplot mortality data. is modeled by a simple linear regression model, and one wants to test a null hypothesis of the form H0 : α = α0 or H0 : β = β0. Similar to the one-sample t-test we will construct a test statistic for each of these null hypotheses. With normally distributed errors, these test statistics have a t-distribution under the null hypothesis. For this reason, for both null hypotheses the test is called a t-test.
  • 408. 27.3 The t-test in a regression setting 407 The t-test for the slope For the null hypothesis H0 : β = β0, we use as test statistic Tb = β̂ − β0 Sb , where β̂ is the least squares estimator for β (see Chapter 22) and S2 b = n n x2 i − ( xi)2 σ̂2 . In this expression, σ̂2 = 1 n − 2 n i=1 (Yi − α̂ − β̂xi)2 is the estimator for σ2 as introduced on page 332. It can be shown that Var β̂ − β0 = n n x2 i − ( xi)2 σ2 , so that the random variable S2 b is an estimator for the variance of β̂ − β0. Hence, similar to the test statistic for the one-sample t-test, the test statistic Tb compares the estimator β̂ with the value β0 and standardizes by dividing by an estimator for the standard deviation of β̂ − β0. Values of Tb close to zero are in favor of the null hypothesis H0 : β = β0. Large positive values of Tb suggest that β β0, whereas large negative values of Tb suggest that β β0. Recall that in the case of normal random samples the one-sample t-test statis- tic has a t(n − 1) distribution under the null hypothesis. For the same reason, it is also a fact that in the case of normally distributed errors the test statis- tic Tb has a t(n − 2) distribution under the null hypothesis H0 : β = β0. In our mortality example we want to test H0 : β = 0 against H0 : β 0. For the data we find β̂ = −3.2261 and sb = 0.4847, so that the value of Tb is tb = −3.2261 0.4847 = −6.656. If we test at level α = 0.05, then we must compare this value with the left critical value −t59,0.05. This value is not in Table B.2, but we have that −1.676 = −t50,0.05 −t59,0.05. This means that tb is much smaller than −t59,0.05, so that we reject the null hy- pothesis at level 0.05. How much evidence the value tb = −6.656 bears against the null hypothesis is expressed by the one-tailed p-value P(Tb ≤ −6.656). From Table B.2 we can only see that this probability is smaller than 0.0005. By means of a statistical package we find P(Tb ≤ −6.656) = 5.2 · 10−9 . The data provide overwhelming evidence against the null hypothesis. We conclude that higher concentrations of calcium correspond to lower mortality rates.
  • 409. 408 27 The t-test Quick exercise 27.4 The data in Table 27.3 can be separated into measure- ments for towns at least as far north as Derby and towns south of Derby. For the data corresponding to 35 towns at least as far north as Derby, one finds β̂ = −1.9313 and sb = 0.8479. Test H0 : β = 0 against H0 : β 0 at level 0.01, i.e., compute the value of the test statistic and report your conclusion about the null hypothesis. The t-test for the intercept We test the null hypothesis H0 : α = α0 with test statistic Ta = α̂ − α0 Sa , (27.1) where α̂ is the least squares estimator for α and S2 a = x2 i n x2 i − ( xi)2 σ̂2 , with σ̂2 defined as before. The random variable S2 a is an estimator for the variance Var(α̂ − α0) = x2 i n x2 i − ( xi)2 σ2 . Again, we compare the estimator α̂ with the value α0 and standardize by dividing by an estimator for the standard deviation of α̂ − α0. Values of Ta close to zero are in favor of the null hypothesis H0 : α = α0. Large positive values of Ta suggest that α α0, whereas large negative values of Ta suggest that α α0. Like Tb, in the case of normal errors, the test statistic Ta has a t(n − 2) distribution under the null hypothesis H0 : α = α0. As an illustration, recall Exercise 17.9 where we modeled the volume y of black cherry trees by means of a linear model without intercept, with inde- pendent variable x = d2 h, where d and h are the diameter and height of the trees. The scatterplot of the pairs (x1, y1), (x2, y2), . . . , (x31, y31) is displayed in Figure 27.4. As mentioned in Exercise 17.9, there are physical reasons to leave out the intercept. We want to investigate whether this is confirmed by the data. To this end, we model the data by a simple linear regression model with intercept Yi = α + βxi + Ui for i = 1, 2, . . . , 31, where U1, U2, . . . , U31 are a random sample from an N(0, σ2 ) distribution, and we test H0 : α = 0 against H1 : α = 0 at level 0.10. The value of the test statistic is ta = −0.2977 0.9636 = −0.3089, and the left critical value is −t29,0.05 = −1.699. This means we cannot reject the null hypothesis. The data do not provide sufficient evidence against H0 : α = 0, which is confirmed by the one-tailed p-value P(Ta ≤ −0.3089) = 0.3798 (computed by means of a statistical package). We conclude that the intercept does not contribute significantly to the model.
  • 410. 27.4 Solutions to the quick exercises 409 0 2 4 6 8 0.0 0.5 1.0 1.5 2.0 2.5 · · · ··· ··· ·· · · ··· ·· · · ·· · · · ··· · · · Fig. 27.4. Scatterplot of the black cherry tree data. 27.4 Solutions to the quick exercises 27.1 If Y has an N(1, 1) distribution, then Y − 1 has an N(0, 1) distri- bution. Therefore, from Table B.1: P(Y − 1 ≥ 0.03) = 0.4880. If Y has an N(1, (0.01)2 ) distribution, then (Y − 1)/0.01 has an N(0, 1) distribution. In that case, P(Y − 1 ≥ 0.03) = P Y − 1 0.01 ≥ 3 = 0.0013. 27.2 For the first and last ten measurements the values of the test statistic are t = 1.0194 − 1 0.0290/ √ 10 = 2.115 and t = 1.0406 − 1 0.0428/ √ 10 = 3.000. The critical value t9,0.025 = 2.262, which means we reject the null hypothesis for the second production line, but not for the first production line. 27.3 The critical region is of the form K = (−∞, cl] ∪ [cu, ∞). The right critical value cu is approximated by z0.0005 = t∞,0.0005 = 3.291, which can be found in Table B.2. By symmetry of the normal distribution, the left critical value cl is approximated by −z0.0005 = −3.291. Clearly, t = −7.39 −3.291, so that it falls in K. 27.4 The value of the test statistic is tb = −1.9313 0.8479 = −2.2778. The left critical value is equal to −t33,0.01, which is not in Table B.2, but we see that −t33,0.01 −t40,0.01 = −2.423. This means that −t33,0.01 tb, so that we cannot reject H0 : β = 0 against H0 : β 0 at level 0.01.
  • 411. 410 27 The t-test 27.5 Exercises 27.1 We perform a t-test for the null hypothesis H0 : µ = 10 by means of a dataset consisting of n = 16 elements with sample mean 11 and sample variance 4. We use significance level 0.05. a. Should we reject the null hypothesis in favor of H1 : µ = 10? b. What if we test against H1 : µ 10? 27.2 The Cleveland Casting Plant is a large highly automated producer of gray and nodular iron automotive castings for Ford Motor Company. One process variable of interest to Cleveland Casting is the pouring tempera- ture of molten iron. The pouring temperatures (in degrees Fahrenheit) of ten crankshafts are given in Table 27.4. The target setting for the pouring tem- perature is set at 2550 degrees. One wants to conduct a test at level α = 0.01 to determine whether the pouring temperature differs from the target setting. Table 27.4. Pouring temperatures of ten crankshafts. 2543 2541 2544 2620 2560 2559 2562 2553 2552 2553 1995 From A structural model relating process inputs and final prod- uct characteristics, Quality Engineering, , Vol 7, No. 4, pp. 693-704, by Price, B. and Barth, B. Reproduced by permission of Taylor Francis, Inc., http//www.taylorandfrancis.com a. Formulate the appropriate null hypothesis and alternative hypothesis. b. Compute the value of the test statistic and report your conclusion. You may assume a normal model distribution and use that the sample variance is 517.34. 27.3 Table 27.5 lists the results of tensile adhesion tests on 22 U-700 alloy specimens. The data are loads at failure in MPa. The sample mean is 13.71 and the sample standard deviation is 3.55. You may assume that the data originated from a normal distribution with expectation µ. One is interested in whether the load at failure exceeds 10 MPa. We investigate this by means of a t-test for the null hypothesis H0 : µ = 10. a. What do you choose as the alternative hypothesis? b. Compute the value of the test statistic and report your conclusion, when performing the test at level 0.05.
  • 412. 27.5 Exercises 411 Table 27.5. Loads at failure of U-700 specimens. 19.8 18.5 17.6 16.7 15.8 15.4 14.1 13.6 11.9 11.4 11.4 8.8 7.5 15.4 15.4 19.5 14.9 12.7 11.9 11.4 10.1 7.9 Source: C.C. Berndt. Instrumented Tensile adhesion tests on plasma sprayed thermal barrier coatings. Journal of Materials Engineering II(4): 275-282, Dec 1989. Springer-Verlag New York Inc. 27.4 Consider the coal data from Table 23.2, where 22 gross calorific value measurements are listed for Daw Mill coal coded 258GB41. We modeled this dataset as a realization of a random sample from an N(µ, σ2 ) distribution with µ and σ unknown. We are planning to buy a shipment if the gross calorific value exceeds 31.00 MJ/kg. The sample mean and sample variance of the data are x̄n = 31.012 and sn = 0.1294. Perform a t-test for the null hypothesis H0 : µ = 31.00 against H1 : µ 31.00 using significance level 0.01, i.e., compute the value of the test statistic, the critical value of the test, and report your conclusion. 27.5 In the November 1988 issue of Science a study was reported on the inbreeding of tropical swarm-founding wasps. Each member of a sample of 197 wasps was captured, frozen, and subjected to a series of genetic tests, from which an inbreeding coefficient was determined. The sample mean and the sample standard deviation of the coefficients are x̄197 = 0.044 and s197 = 0.884. If a species does not have the tendency to inbreed, their true inbreeding coefficient is 0. Determine by means of a test whether the inbreeding coefficient for this species of wasp exceeds 0. a. Formulate the appropriate null hypothesis and alternative hypothesis and compute the value of the test statistic. b. Compute the p-value corresponding to the value of the test statistic and report your conclusion about the null hypothesis. 27.6 The stopping distance of an automobile is related to its speed. The data in Table 27.6 give the stopping distance in feet and speed in miles per hour of an automobile. The data are modeled by means of simple linear regression model with normally distributed errors, with the square root of the stopping distance as dependent variable y and the speed as independent variable x: Yi = α + βxi + Ui, for i = 1, . . . , 7. For the dataset we find α̂ = 5.388, β̂ = 4.252, sa = 1.874, sb = 0.242.
  • 413. 412 27 The t-test Table 27.6. Speed and stopping distance of automobiles. Speed 20.5 20.5 30.5 30.5 40.5 48.8 57.8 Distance 15.4 13.3 33.9 27.0 73.1 113.0 142.6 Source: K.A. Brownlee. Statistical theory and methodology in science and engineering. Wiley, New York, 1960; Table II.9 on page 372. One would expect that the intercept can be taken equal to 0, since zero speed would yield zero stopping distance. Investigate whether this is confirmed by the data by performing the appropriate test at level 0.10. Formulate the proper null and alternative hypothesis, compute the value of the test statistic, and report your conclusion. 27.7 In a study about the effect of wall insulation, the weekly gas con- sumption (in 1000 cubic feet) and the average outside temperature (in de- grees Celsius) was measured of a certain house in southeast England, for 26 weeks before and 30 weeks after cavity-wall insulation had been installed. The house thermostat was set at 20 degrees throughout. The data are listed in Table 27.7. We model the data before insulation by means of a simple lin- ear regression model with normally distributed errors and gas consumption as response variable. A similar model was used for the data after insulation. Given are Before insulation: α̂ = 6.8538, β̂ = −0.3932 and sa = 0.1184, sb = 0.0196 After insulation: α̂ = 4.7238, β̂ = −0.2779 and sa = 0.1297, sb = 0.0252. a. Use the data before insulation to investigate whether smaller outside tem- peratures lead to higher gas consumption. Formulate the proper null and alternative hypothesis, compute the value of the test statistic, and report your conclusion, using significance level 0.05. b. Do the same for the data after insulation.
  • 414. 27.5 Exercises 413 Table 27.7. Temperature and gas consumption. Before insulation After insulation Temperature Gas consumption Temperature Gas consumption −0.8 7.2 −0.7 4.8 −0.7 6.9 0.8 4.6 0.4 6.4 1.0 4.7 2.5 6.0 1.4 4.0 2.9 5.8 1.5 4.2 3.2 5.8 1.6 4.2 3.6 5.6 2.3 4.1 3.9 4.7 2.5 4.0 4.2 5.8 2.5 3.5 4.3 5.2 3.1 3.2 5.4 4.9 3.9 3.9 6.0 4.9 4.0 3.5 6.0 4.3 4.0 3.7 6.0 4.4 4.2 3.5 6.2 4.5 4.3 3.5 6.3 4.6 4.6 3.7 6.9 3.7 4.7 3.5 7.0 3.9 4.9 3.4 7.4 4.2 4.9 3.7 7.5 4.0 4.9 4.0 7.5 3.9 5.0 3.6 7.6 3.5 5.3 3.7 8.0 4.0 6.2 2.8 8.5 3.6 7.1 3.0 9.1 3.1 7.2 2.8 10.2 2.6 7.5 2.6 8.0 2.7 8.7 2.8 8.8 1.3 9.7 1.5 Source: MDST242 Statistics in Society, Unit 45: Review, 2nd edition, 1984, Milton Keynes: The Open University, Figures 2.5 and 2.6.
  • 415. 28 Comparing two samples Many applications are concerned with two groups of observations of the same kind that originate from two possibly different model distributions, and the question is whether these distributions have different expectations. We de- scribe a test for equality of expectations, where we consider normal and non- normal model distributions and equal and unequal variances of the model distributions. 28.1 Is dry drilling faster than wet drilling? Recall the drilling example from Sections 15.5 and 16.4. The question was whether dry drilling is faster than wet drilling. The scatterplots in Figure 15.11 seem to suggest that up to a depth of 250 feet the drill time does not depend on depth. Therefore, for a first investigation of a possible difference between dry and wet drilling we only consider the (mean) drill times up to this depth. A more thorough study can be found in [23]. The boxplots of the drill times for both types of drilling are displayed in Figure 28.1. Clearly, the boxplot for dry drilling is positioned lower than the 600 700 800 900 1000 Dry ◦ Wet Fig. 28.1. Boxplot of drill times.
  • 416. 416 28 Comparing two samples one for wet drilling. However, the question is whether this difference can be attributed to chance or if it is large enough to conclude that the dry drill time is shorter than the wet drill time. To answer this question, we model the datasets of dry and wet drill times as realizations of random samples from two distribution functions F and G, one with expected value µ1 and the other with expected value µ2. The parameters µ1 and µ2 represent the drill times of dry drilling and wet drilling, respectively. We test H0 : µ1 = µ2 against H1 : µ1 µ2. This example illustrates a general situation where we compare two datasets x1, x2, . . . , xn and y1, y2, . . . , ym, which are the realization of independent random samples X1, X2, . . . , Xn and Y1, Y2, . . . , Ym from two distributions, and we want to test whether the expectations of both distributions are the same. Both the variance σ2 X of the Xi and the variance σ2 Y of the Yj are unknown. Note that the null hypothesis is equivalent to the statement µ1 − µ2 = 0. For this reason, similar to Chapter 27, the test statistic for the null hypothesis H0 : µ1 = µ2 is based on an estimator X̄n − Ȳm for the difference µ1 − µ2. As before, we standardize X̄n − Ȳm by an estimator for its variance Var X̄n − Ȳm = σ2 X n + σ2 Y m . Recall that the sample variances S2 X and S2 Y of the Xi and Yj, are unbiased estimators for σ2 X and σ2 Y . We will use a combination of S2 X and S2 Y to con- struct an estimator for Var X̄n − Ȳm . The actual standardization of X̄n −Ȳm depends on whether the variances of the Xi and Yj are the same. We distin- guish between the two cases σ2 X = σ2 Y and σ2 X = σ2 Y . In the next section we consider the case of equal variances. Quick exercise 28.1 Looking at the boxplots in Figure 28.1, does the as- sumption σ2 X = σ2 Y seem reasonable to you? Can you think of a way to quantify your belief? 28.2 Two samples with equal variances Suppose that the samples originate from distributions with the same (but unknown) variance: σ2 X = σ2 Y = σ2 . In this case we can pool the sample variances S2 X and S2 Y by constructing a linear combination aS2 X + bS2 Y that is an unbiased estimator for σ2 . One particular choice is the weighted average
  • 417. 28.2 Two samples with equal variances 417 (n − 1)S2 X + (m − 1)S2 Y n + m − 2 . It has the property that for normally distributed samples it has the smallest variance among all unbiased linear combinations of S2 X and S2 Y (see Exer- cise 28.5). Moreover, the weights depend on the sample sizes. This is appro- priate, since if one sample is much larger than the other, the estimate of σ2 from that sample is more reliable and should receive greater weight. We find that the pooled-variance: S2 p = (n − 1)S2 X + (m − 1)S2 Y n + m − 2 1 n + 1 m is an unbiased estimator for Var X̄n − Ȳm = σ2 1 n + 1 m . This leads to the following test statistic for the null hypothesis H0 : µ1 = µ2: Tp = X̄n − Ȳm Sp . As before, we compare the estimator X̄n − Ȳm with 0 (the value of µ1 − µ2 under the null hypothesis), and we standardize by dividing by the estimator Sp for the standard deviation of X̄n − Ȳm. Values of Tp close to zero are in favor of the null hypothesis H0 : µ1 = µ2. Large positive values of Tp suggest that µ1 µ2, whereas large negative values suggest that µ1 µ2. The next step is to determine the distribution of Tp. Note that under the null hypothesis H0 : µ1 = µ2, the test statistic Tp is the pooled studentized mean difference (X̄n − Ȳm) − (µ1 − µ2) Sp . Hence, under the null hypothesis, the probability distribution of Tp is the same as that of the pooled studentized mean difference. To determine its distribution, we distinguish between normal and nonnormal data. Normal samples In the same way as the studentized mean of a single normal sample has a t(n − 1) distribution (see page 349), it is also a fact that if two independent samples originate from normal distributions, i.e., X1, X2, . . . , Xn random sample from N(µ1, σ2 ) Y1, Y2, . . . , Ym random sample from N(µ2, σ2 ), then the pooled studentized mean difference has a t(n + m − 2) distribution. Hence, under the null hypothesis, the test statistic Tp has a t(n + m − 2)
  • 418. 418 28 Comparing two samples distribution. For this reason, a test for the null hypothesis H0 : µ1 = µ2 is called a two-sample t-test. Suppose that in our drilling example we model our datasets as realizations of random samples of sizes n = m = 50 from two normal distributions with equal variances, and we test H0 : µ1 = µ2 against H1 : µ1 µ2 at level 0.05. For the data we find x̄50 = 727.78, ȳ50 = 873.02, and sp = 13.62, so that tp = 727.78 − 873.02 13.62 = −10.66. We compare this with the left critical value −t98,0.05. This value is not in Table B.2, but −1.676 = −t50,0.05 −t98,0.05. This means that tp −t98,0.05, so that we reject H0 : µ1 = µ2 in favor of H1 : µ1 µ2 at level 0.05. The p- value corresponding to tp = −10.66 is the left tail probability P(T ≤ −10.66). From Table B.2 we can only see that this is smaller than 0.0005 (a statistical software package gives P(T ≤ −10.66) = 2.25 ·10−18 ). The data provide over- whelming evidence against the null hypothesis, so that we conclude that dry drilling is faster than wet drilling. Quick exercise 28.2 Suppose that in the ball bearing example of Quick exercise 27.2, we test H0 : µ1 = µ2 against H1 : µ1 = µ2, where µ1 and µ2 represent the diameters of a ball bearing from the first and second production line. What are the critical values corresponding to level α = 0.01? Nonnormal samples Similar to the one-sample t-test, if we cannot assume normal model distribu- tions, then we can no longer conclude that our test statistic has a t(n + m − 2) distribution under the null hypothesis. Recall that under the null hypothesis, the distribution of our test statistic is the same as that of the pooled studen- tized mean difference (see page 417). To approximate its distribution, we use the empirical bootstrap simulation for the pooled studentized mean difference (X̄n − Ȳm) − (µ1 − µ2) Sp . Given datasets x1, x2, . . . , xn and y1, y2, . . . , ym, determine their empirical dis- tribution functions Fn and Gm as estimates for F and G. The expectations corresponding to Fn and Gm are µ∗ 1 = x̄n and µ∗ 2 = ȳm. Then repeat the following two steps many times: 1. Generate a bootstrap dataset x∗ 1, x∗ 2, . . . , x∗ n from Fn and a bootstrap dataset y∗ 1, y∗ 2, . . . , y∗ m from Gm. 2. Compute the pooled studentized mean difference for the bootstrap data: t∗ p = (x̄∗ n − ȳ∗ m) − (x̄n − ȳm) s∗ p ,
  • 419. 28.3 Two samples with unequal variances 419 where x̄∗ n and ȳ∗ m are the sample means of the bootstrap datasets, and (s∗ p)2 = (n − 1)(s∗ X)2 + (m − 1)(s∗ Y )2 n + m − 2 1 n + 1 m with (s∗ X)2 and (s∗ Y )2 the sample variances of the bootstrap datasets. The reason that in each iteration we subtract x̄n − ȳm is that µ1 − µ2 is the difference of the expectations of the two model distributions. Therefore, according to the bootstrap principle we should replace this by the difference x̄n − ȳm of the expectations corresponding to the two empirical distribution functions. We carried out this bootstrap simulation for the drill times. The result of this simulation can be seen in Figure 28.2, where a histogram and the empirical distribution function are displayed for one thousand bootstrap values of t∗ p. Suppose that we test H0 : µ1 = µ2 against H1 : µ1 µ2 at level 0.05. The bootstrap approximation for the left critical value is c∗ l = −1.659. The value of tp = −10.66, computed from the data, is much smaller. Hence, also on the basis of the bootstrap simulation we reject the null hypothesis and conclude that the dry drill time is shorter than the wet drill time. −4 −2 0 2 4 0.0 0.1 0.2 0.3 0.4 -1.659 0 0.05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 28.2. Histogram and empirical distribution function of 1000 bootstrap values for T∗ p . 28.3 Two samples with unequal variances During an investigation about weather modification, a series of experiments was conducted in southern Florida from 1968 to 1972. These experiments were designed to investigate the use of massive silver-iodide seeding. It was
  • 420. 420 28 Comparing two samples Table 28.1. Rainfall data. Unseeded 1202.6 830.1 372.4 345.5 321.2 244.3 163.0 147.8 95.0 87.0 81.2 68.5 47.3 41.1 36.6 29.0 28.6 26.3 26.1 24.4 21.7 17.3 11.5 4.9 4.9 1.0 Seeded 2745.6 1697.8 1656.0 978.0 703.4 489.1 430.0 334.1 302.8 274.7 274.7 255.0 242.5 200.7 198.6 129.6 119.0 118.3 115.3 92.4 40.6 32.7 31.4 17.5 7.7 4.1 Source: J. Simpson, A. Olsen, and J.C. Eden. A Bayesian analysis of a mul- tiplicative treatment effect in weather modification. Technometrics, 17:161– 166, 1975; Table 1 on page 162. hypothesized that under specified conditions, this leads to invigorated cumulus growth and prolonged lifetimes, thereby causing increased precipitation. In these experiments, 52 isolated cumulus clouds were observed, of which 26 were selected at random and injected with silver-iodide smoke. Rainfall amounts (in acre-feet) were recorded for all clouds. They are listed in Table 28.1. To investigate whether seeding leads to increased rainfall, we test H0 : µ1 = µ2 against H1 : µ1 µ2, where µ1 and µ2 represent the rainfall for unseeded and seeded clouds. In Figure 28.3 the boxplots of both datasets are displayed. From this we see that the assumption of equal variances may not be realistic. Indeed, this is confirmed by the values s2 X = 77 521 and s2 Y = 423 524 of the sample variances of the datasets. This means that we need to test H0 : µ1 = µ2 without the assumption of equal variances. As before, the test statistic will be a standardized version of X̄n − Ȳm, but S2 p is no longer an unbiased estimator for Var X̄n − Ȳm = σ2 X n + σ2 Y m . However, if we estimate σ2 X and σ2 Y by S2 X and S2 Y , then the nonpooled variance S2 d = S2 X n + S2 Y m is an unbiased estimator for Var X̄n − Ȳm . This leads to test statistic Td = X̄n − Ȳm Sd .
  • 421. 28.3 Two samples with unequal variances 421 0 500 1000 1500 2000 2500 Unseeded ◦ ◦ ◦ Seeded ◦ ◦ ◦ ◦ Fig. 28.3. Boxplots of rainfall. Again, we compare the estimator X̄n − Ȳm with zero and standardize by dividing by an estimator for the standard deviation of X̄n − Ȳm. Values of Td close to zero are in favor of the null hypothesis H0 : µ1 = µ2. Quick exercise 28.3 Consider the ball bearing example from Quick exer- cise 27.2. Compute the value of Td for this example. Under the null hypothesis H0 : µ1 = µ2, the test statistic Td = X̄n − Ȳm Sd is equal to the nonpooled studentized mean difference (X̄n − Ȳm) − (µ1 − µ2) Sd . Therefore, the distribution of Td under the null hypothesis is the same as that of the nonpooled studentized mean difference. Unfortunately, its distribution is not a t-distribution, not even in the case of normal samples. This means that we have to approximate this distribution. Similar to the previous section, we use the empirical bootstrap simulation for the nonpooled studentized mean difference. The only difference with the proce- dure outlined in the previous section is that now in each iteration we compute the nonpooled studentized mean difference for the bootstrap datasets: t∗ d = (x̄∗ n − ȳ∗ m) − (x̄n − ȳm) s∗ d , where x̄∗ n and ȳ∗ m are the sample means of the bootstrap datasets, and (s∗ d)2 = (s∗ X)2 n + (s∗ Y )2 m
  • 422. 422 28 Comparing two samples −4 −2 0 2 4 6 0.0 0.1 0.2 0.3 0.4 -1.405 0 0.05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 28.4. Histogram and empirical distribution function of 1000 bootstrap values of T∗ d . with (s∗ X)2 and (s∗ Y )2 the sample variances of the bootstrap datasets. We carried out this bootstrap simulation for the cloud seeding data. The result of this simulation can be seen in Figure 28.4, where a histogram and the empirical distribution function are displayed for one thousand values t∗ d. The bootstrap approximation for the left critical value corresponding to level 0.05 is c∗ l = −1.405. For the data we find the value td = 164.59 − 441.98 138.92 = −1.998. This is smaller than c∗ l , so we reject the null hypothesis. Although the evidence against the null hypothesis is not overwhelming, there is some indication that seeding clouds leads to more rainfall. 28.4 Large samples Variants of the central limit theorem state that as n and m both tend to infinity, the distributions of the pooled studentized mean difference (X̄n − Ȳm) − (µ1 − µ2) Sp and the nonpooled studentized mean difference (X̄n − Ȳm) − (µ1 − µ2) Sd both approach the standard normal distribution. This fact can be used to approximate the distribution of the test statistics Tp and Td under the null hypothesis by a standard normal distribution.
  • 423. 28.4 Large samples 423 We illustrate this by means of the following example. To investigate whether a restricted diet promotes longevity, two groups of randomly selected rats were put on the different diets. One group of n = 106 rats was put on a restricted diet, the other group of m = 89 rats on an ad libitum diet (i.e., unrestricted eating). The data in Table 28.2 represent the remaining lifetime in days of two groups of rats after they were put on the different diets. The average lifetimes are x̄n = 968.75 and ȳm = 684.01 days. To investigate whether a restricted diet promotes longevity, we test H0 : µ1 = µ2 against H1 : µ1 µ2, where µ1 and µ2 represent the lifetime of a rat on a restricted diet and on an ad libitum diet, respectively. If we may assume equal variances, we compute tp = 968.75 − 684.01 32.88 = 8.66. This value is larger than the right critical value z0.0005 = 3.291, which means that we would reject H0 : µ1 = µ2 in favor of H1 : µ1 µ2 at level α = 0.0005. Table 28.2. Rat data. Restricted 105 193 211 236 302 363 389 390 391 403 530 604 605 630 716 718 727 731 749 769 770 789 804 810 811 833 868 871 875 893 897 901 906 907 919 923 931 940 957 958 961 962 974 979 982 1001 1008 1010 1011 1012 1014 1017 1032 1039 1045 1046 1047 1057 1063 1070 1073 1076 1085 1090 1094 1099 1107 1119 1120 1128 1129 1131 1133 1136 1138 1144 1149 1160 1166 1170 1173 1181 1183 1188 1190 1203 1206 1209 1218 1220 1221 1228 1230 1231 1233 1239 1244 1258 1268 1294 1316 1327 1328 1369 1393 1435 Ad libitum 89 104 387 465 479 494 496 514 532 536 545 547 548 582 606 609 619 620 621 630 635 639 648 652 653 654 660 665 667 668 670 675 677 678 678 681 684 688 694 695 697 698 702 704 710 711 712 715 716 717 720 721 730 731 732 733 735 736 738 739 741 743 746 749 751 753 764 765 768 770 773 777 779 780 788 791 794 796 799 801 806 807 815 836 838 850 859 894 963 Source: B.L. Berger, D.D. Boos, and F.M. Guess. Tests and confidence sets for comparing two mean residual life functions. Biometrics, 44:103–115, 1988.
  • 424. 424 28 Comparing two samples The p-value is the right tail probability P(Tp ≥ 8.66), which we approximate by P(Z ≥ 8.66), where Z has an N(0, 1) distribution. From Table B.1 we see that this probability is smaller than P(Z ≥ 3.49) = 0.0002. By means of a statistical package we find P(Z ≥ 8.66) = 2.4 · 10−16 . If we repeat the test without the assumption of equal variances, we compute td = 968.75 − 684.01 31.08 = 9.16, which also leads to rejection of the null hypothesis. In this case, the p-value P(Td ≥ 9.16) ≈ P(Z ≥ 9.16) is even smaller since 9.16 8.66 (a statistical package gives P(Z ≥ 9.16) = 2.6 · 10−18 ). The data provide overwhelming evidence against the null hypothesis, and we conclude that a restricted diet promotes longevity. 28.5 Solutions to the quick exercises 28.1 Just by looking at the boxplots, the authors believe that the assumption σ2 X = σ2 Y is reasonable. The lengths of the boxplots and their IQRs are almost the same. However, the boxplots do not reveal how the elements of the dataset vary around the center. One way of quantifying our belief would be to compare the sample variances of the datasets. One possibility is to compare the ratio of both sample variances; a ratio close to one would support our belief of equal variances (in case of normal samples, this is a standard test called the F-test). 28.2 In this case we have a right and left critical value. From Quick ex- ercise 27.2 we know that n = m = 10, so that the right critical value is t18,0.005 = 2.878 and the left critical value is −t18,0.005 = −2.878. 28.3 We first compute s2 d = (0.0290)2 /10+(0.0428)2 /10 = 0.000267 and then td = (1.0194 − 1.0406)/ √ 0.000267 = −1.297. 28.6 Exercises 28.1 The data in Table 28.3 represent salaries (in pounds Sterling) in 72 randomly selected advertisements in the The Guardian (April 6, 1992). When a range was given in the advertisement, the midpoint of the range is repro- duced in the table. The data are salaries corresponding to two kinds of occu- pations (n = m = 72): (1) creative, media, and marketing and (2) education. The sample mean and sample variance of the two datasets are, respectively: (1) x̄72 = 17 410 and s2 x = 41 258 741, (2) ȳ72 = 19 818 and s2 y = 50 744 521.
  • 425. 28.6 Exercises 425 Table 28.3. Salaries in two kinds of occupations. Occupation (1) Occupation (2) 17703 13796 12000 25899 17378 19236 42000 22958 22900 21676 15594 18780 18780 10750 13440 15053 17375 12459 15723 13552 17574 19461 20111 22700 13179 21000 22149 22485 16799 35750 37500 18245 17547 17378 12587 20539 22955 19358 9500 15053 24102 13115 13000 22000 25000 10998 12755 13605 13500 12000 15723 18360 35000 20539 13000 16820 12300 22533 20500 16629 11000 17709 10750 23008 13000 27500 12500 23065 11000 24260 18066 17378 13000 18693 19000 25899 35403 15053 10500 14472 13500 18021 17378 20594 12285 12000 32000 17970 14855 9866 13000 20000 17783 21074 21074 21074 16000 18900 16600 15053 19401 25598 15000 14481 18000 20739 15053 15053 13944 35000 11406 15053 15083 31530 23960 18000 23000 30800 10294 16799 11389 30000 15379 37000 11389 15053 12587 12548 21458 48000 11389 14359 17000 17048 21262 16000 26544 15344 9000 13349 20000 20147 14274 31000 Source: D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski. Small data sets. Chapman and Hall, London, 1994; dataset 385. Data col- lected by D.J. Hand. Suppose that the datasets are modeled as realizations of normal distributions with expectations µ1 and µ2, which represent the salaries for occupations (1) and (2). a. Test the null hypothesis that the salary for both occupations is the same at level α = 0.05 under the assumption of equal variances. Formulate the proper null and alternative hypotheses, compute the value of the test statistic, and report your conclusion. b. Do the same without the assumption of equal variances. c. As a comparison, one carries out an empirical bootstrap simulation for the nonpooled studentized mean difference. The bootstrap approximations for the critical values are c∗ l = −2.004 and c∗ u = 2.133. Report your conclusion about the salaries on the basis of the bootstrap results.
  • 426. 426 28 Comparing two samples 28.2 The data in Table 28.4 represent the duration of pregnancy for 1669 women who gave birth in a maternity hospital in Newcastle-upon-Tyne, Eng- land, in 1954. Table 28.4. Durations of pregnancy. Duration Medical Emergency Social 11 1 15 1 17 1 20 1 22 1 2 24 1 3 25 2 1 26 1 27 2 2 1 28 1 2 1 29 3 1 30 3 5 1 31 4 5 2 32 10 9 2 33 6 6 2 34 12 7 10 35 23 11 4 36 26 13 19 37 54 16 30 38 68 35 72 39 159 38 115 40 197 32 155 41 111 27 128 42 55 25 64 43 29 8 16 44 4 5 3 45 3 1 6 46 1 1 1 47 1 56 1 Source: D.J. Newell. Statistical aspects of the demand for maternity beds. Journal of the Royal Statistical Society, Series A, 127:1–33, 1964. The durati