Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott

Multivariate Density Estimation Theory Practice
And Visualization 2nd Edition David W Scott
download
https://ebookbell.com/product/multivariate-density-estimation-
theory-practice-and-visualization-2nd-edition-david-w-
scott-5031034
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Smoothing Of Multivariate Data Density Estimation And Visualization
Wiley Series In Probability And Statistics 1st Edition Jussi Klemela
https://ebookbell.com/product/smoothing-of-multivariate-data-density-
estimation-and-visualization-wiley-series-in-probability-and-
statistics-1st-edition-jussi-klemela-1797940
Multivariate Statistical Modeling In Engineering And Management 1st
Edition Jhareswar Maiti
https://ebookbell.com/product/multivariate-statistical-modeling-in-
engineering-and-management-1st-edition-jhareswar-maiti-46083382
Multivariate Data Analysis Fionn Murtagh Andre Heck
https://ebookbell.com/product/multivariate-data-analysis-fionn-
murtagh-andre-heck-47912096
Multivariate Reducedrank Regression Theory Methods And Applications
2nd Edition Gregory C Reinsel
https://ebookbell.com/product/multivariate-reducedrank-regression-
theory-methods-and-applications-2nd-edition-gregory-c-reinsel-48696422

Multivariate Frequency Analysis Of Hydrometeorological Variables A
Copulabased Approach Fateh Chebana
https://ebookbell.com/product/multivariate-frequency-analysis-of-
hydrometeorological-variables-a-copulabased-approach-fateh-
chebana-48775100
Multivariate Calculus Samiran Karmakar Sibdas Karmakar
https://ebookbell.com/product/multivariate-calculus-samiran-karmakar-
sibdas-karmakar-49224188
Multivariate Calculus Samiran Karmakar Sibdas Karmakar
https://ebookbell.com/product/multivariate-calculus-samiran-karmakar-
sibdas-karmakar-49492868
Multivariate Characteristic And Correlation Functions Zoltn Sasvri
https://ebookbell.com/product/multivariate-characteristic-and-
correlation-functions-zoltn-sasvri-50378588
Multivariate Analysis An Applicationoriented Introduction 2nd Klaus
Backhaus
https://ebookbell.com/product/multivariate-analysis-an-
applicationoriented-introduction-2nd-klaus-backhaus-50637476

“9780471697558pre” — 2015/2/11 — 17:32 — page vi — #6

“9780471697558pre” — 2015/2/11 — 17:32 — page i — #1
MULTIVARIATE DENSITY
ESTIMATION

“9780471697558pre” — 2015/2/11 — 17:32 — page ii — #2
WILEY SERIES IN PROBABILITY AND STATISTICS
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice,
Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott,
Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg
Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane,
Jozef L. Teugels
A complete list of the titles in this series appears at the end of this volume.

“9780471697558pre” — 2015/2/11 — 17:32 — page iii — #3
MULTIVARIATE DENSITY
ESTIMATION
Theory, Practice, and Visualization
Second Edition
DAVID W. SCOTT
Rice University
Houston, Texas

“9780471697558pre” — 2015/2/11 — 17:32 — page iv — #4
Copyright © 2015 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax
(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be suitable
for your situation. You should consult with a professional where appropriate. Neither the publisher nor
author shall be liable for any loss of profit or any other commercial damages, including but not limited to
special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at
(317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic formats. For more information about Wiley products, visit our web site at
www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Scott, David W., 1950–
Multivariate density estimation : theory, practice, and visualization / David W. Scott. – Second edition.
pages cm
Includes bibliographical references and index.
ISBN 978-0-471-69755-8 (cloth)
1. Estimation theory. 2. Multivariate analysis. I. Title.
QA276.8.S28 2014
519.535–dc23
2014043897
Set in 10/12pts Times Lt Std by SPi Publisher Services, Pondicherry, India
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
1 2015

“9780471697558pre” — 2015/2/11 — 17:32 — page v — #5
To Jean, Hilary,
Elizabeth, Warren,
and my parents, John
and Nancy Scott

“9780471697558pre” — 2015/2/12 — 15:05 — page vii — #7
CONTENTS
PREFACE TO SECOND EDITION xv
PREFACE TO FIRST EDITION xvii
1 Representation and Geometry of Multivariate Data 1
1.1 Introduction, 1
1.2 Historical Perspective, 4
1.3 Graphical Display of Multivariate Data Points, 5
1.3.1 Multivariate Scatter Diagrams, 5
1.3.2 Chernoff Faces, 11
1.3.3 Andrews’ Curves and Parallel Coordinate Curves, 12
1.3.4 Limitations, 14
1.4 Graphical Display of Multivariate Functionals, 16
1.4.1 Scatterplot Smoothing by Density Function, 16
1.4.2 Scatterplot Smoothing by Regression Function, 18
1.4.3 Visualization of Multivariate Functions, 19
1.4.3.1 Visualizing Multivariate Regression Functions, 24
1.4.4 Overview of Contouring and Surface Display, 26
1.5 Geometry of Higher Dimensions, 28
1.5.1 Polar Coordinates in d Dimensions, 28
1.5.2 Content of Hypersphere, 29
1.5.3 Some Interesting Consequences, 30
1.5.3.1 Sphere Inscribed in Hypercube, 30
1.5.3.2 Hypervolume of a Thin Shell, 30
1.5.3.3 Tail Probabilities of Multivariate Normal, 31

“9780471697558pre” — 2015/2/12 — 15:05 — page viii — #8
viii CONTENTS
1.5.3.4 Diagonals in Hyperspace, 31
1.5.3.5 Data Aggregate Around Shell, 32
1.5.3.6 Nearest Neighbor Distances, 32
Problems, 33
2 Nonparametric Estimation Criteria 36
2.1 Estimation of the Cumulative Distribution Function, 37
2.2 Direct Nonparametric Estimation of the Density, 39
2.3 Error Criteria for Density Estimates, 40
2.3.1 MISE for Parametric Estimators, 42
2.3.1.1 Uniform Density Example, 42
2.3.1.2 General Parametric MISE Method with Gaussian
Application, 43
2.3.2 The L1 Criterion, 44
2.3.2.1 L1 versus L2, 44
2.3.2.2 Three Useful Properties of the L1 Criterion, 44
2.3.3 Data-Based Parametric Estimation Criteria, 46
2.4 Nonparametric Families of Distributions, 48
2.4.1 Pearson Family of Distributions, 48
2.4.2 When Is an Estimator Nonparametric?, 49
Problems, 50
3 Histograms: Theory and Practice 51
3.1 Sturges’ Rule for Histogram Bin-Width Selection, 51
3.2 The L2 Theory of Univariate Histograms, 53
3.2.1 Pointwise Mean Squared Error and Consistency, 53
3.2.2 Global L2 Histogram Error, 56
3.2.3 Normal Density Reference Rule, 59
3.2.3.1 Comparison of Bandwidth Rules, 59
3.2.3.2 Adjustments for Skewness and Kurtosis, 60
3.2.4 Equivalent Sample Sizes, 62
3.2.5 Sensitivity of MISE to Bin Width, 63
3.2.5.1 Asymptotic Case, 63
3.2.5.2 Large-Sample and Small-Sample Simulations, 64
3.2.6 Exact MISE versus Asymptotic MISE, 65
3.2.6.1 Normal Density, 66
3.2.6.2 Lognormal Density, 68
3.2.7 Influence of Bin Edge Location on MISE, 69
3.2.7.1 General Case, 69
3.2.7.2 Boundary Discontinuities in the Density, 69
3.2.8 Optimally Adaptive Histogram Meshes, 70
3.2.8.1 Bounds on MISE Improvement for Adaptive
Histograms, 71
3.2.8.2 Some Optimal Meshes, 72

“9780471697558pre” — 2015/2/12 — 15:05 — page ix — #9
CONTENTS ix
3.2.8.3 Null Space of Adaptive Densities, 72
3.2.8.4 Percentile Meshes or Adaptive Histograms with
Equal Bin Counts, 73
3.2.8.5 Using Adaptive Meshes versus Transformation, 74
3.2.8.6 Remarks, 75
3.3 Practical Data-Based Bin Width Rules, 76
3.3.1 Oversmoothed Bin Widths, 76
3.3.1.1 Lower Bounds on the Number of Bins, 76
3.3.1.2 Upper Bounds on Bin Widths, 78
3.3.2 Biased and Unbiased CV, 79
3.3.2.1 Biased CV, 79
3.3.2.2 Unbiased CV, 80
3.3.2.3 End Problems with BCV and UCV, 81
3.3.2.4 Applications, 81
3.4 L2 Theory for Multivariate Histograms, 83
3.4.1 Curse of Dimensionality, 85
3.4.2 A Special Case: d = 2 with Nonzero Correlation, 87
3.4.3 Optimal Regular Bivariate Meshes, 88
3.5 Modes and Bumps in a Histogram, 89
3.5.1 Properties of Histogram “Modes”, 91
3.5.2 Noise in Optimal Histograms, 92
3.5.3 Optimal Histogram Bandwidths for Modes, 93
3.5.4 A Useful Bimodal Mixture Density, 95
3.6 Other Error Criteria: L1,L4,L6,L8, and L∞, 96
3.6.1 Optimal L1 Histograms, 96
3.6.2 Other LP Criteria, 97
Problems, 97
4 Frequency Polygons 100
4.1 Univariate Frequency Polygons, 101
4.1.1 Mean Integrated Squared Error, 101
4.1.2 Practical FP Bin Width Rules, 104
4.1.3 Optimally Adaptive Meshes, 107
4.1.4 Modes and Bumps in a Frequency Polygon, 109
4.2 Multivariate Frequency Polygons, 110
4.3 Bin Edge Problems, 113
4.4 Other Modifications of Histograms, 114
4.4.1 Bin Count Adjustments, 114
4.4.1.1 Linear Binning, 114
4.4.1.2 Adjusting FP Bin Counts to Match Histogram Areas, 117
4.4.2 Polynomial Histograms, 117
4.4.3 How Much Information Is There in a Few Bins?, 120
Problems, 122

“9780471697558pre” — 2015/2/12 — 15:05 — page x — #10
x CONTENTS
5 Averaged Shifted Histograms 125
5.1 Construction, 126
5.2 Asymptotic Properties, 128
5.3 The Limiting ASH as a Kernel Estimator, 133
Problems, 135
6 Kernel Density Estimators 137
6.1 Motivation for Kernel Estimators, 138
6.1.1 Numerical Analysis and Finite Differences, 138
6.1.2 Smoothing by Convolution, 139
6.1.3 Orthogonal Series Approximations, 140
6.2 Theoretical Properties: Univariate Case, 142
6.2.1 MISE Analysis, 142
6.2.2 Estimation of Derivatives, 144
6.2.3 Choice of Kernel, 145
6.2.3.1 Higher Order Kernels, 145
6.2.3.2 Optimal Kernels, 151
6.2.3.3 Equivalent Kernels, 153
6.2.3.4 Higher Order Kernels and Kernel Design, 155
6.2.3.5 Boundary Kernels, 157
6.3 Theoretical Properties: Multivariate Case, 161
6.3.1 Product Kernels, 162
6.3.2 General Multivariate Kernel MISE, 164
6.3.3 Boundary Kernels for Irregular Regions, 167
6.4 Generality of the Kernel Method, 167
6.4.1 Delta Methods, 167
6.4.2 General Kernel Theorem, 168
6.4.2.1 Proof of General Kernel Result, 168
6.4.2.2 Characterization of a Nonparametric Estimator, 169
6.4.2.3 Equivalent Kernels of Parametric Estimators, 171
6.5 Cross-Validation, 172
6.5.1 Univariate Data, 172
6.5.1.1 Early Efforts in Bandwidth Selection, 173
6.5.1.2 Oversmoothing, 176
6.5.1.3 Unbiased and Biased Cross-Validation, 177
6.5.1.4 Bootstrapping Cross-Validation, 181
6.5.1.5 Faster Rates and PI Cross-Validation, 184
6.5.1.6 Constrained Oversmoothing, 187
6.5.2 Multivariate Data, 190
6.5.2.1 Multivariate Cross-Validation, 190
6.5.2.2 Multivariate Oversmoothing Bandwidths, 191
6.5.2.3 Asymptotics of Multivariate Cross-Validation, 192
6.6 Adaptive Smoothing, 193
6.6.1 Variable Kernel Introduction, 193

“9780471697558pre” — 2015/2/12 — 15:05 — page xi — #11
CONTENTS xi
6.6.2 Univariate Adaptive Smoothing, 195
6.6.2.1 Bounds on Improvement, 195
6.6.2.2 Nearest-Neighbor Estimators, 197
6.6.2.3 Sample-Point Adaptive Estimators, 198
6.6.2.4 Data Sharpening, 200
6.6.3 Multivariate Adaptive Procedures, 202
6.6.3.1 Pointwise Adapting, 202
6.6.3.2 Global Adapting, 203
6.6.4 Practical Adaptive Algorithms, 204
6.6.4.1 Zero-Bias Bandwidths for Tail Estimation, 204
6.6.4.2 UCV for Adaptive Estimators, 208
6.7 Aspects of Computation, 209
6.7.1 Finite Kernel Support and Rounding of Data, 210
6.7.2 Convolution and Fourier Transforms, 210
6.7.2.1 Application to Kernel Density Estimators, 211
6.7.2.2 FFTs, 212
6.7.2.3 Discussion, 212
6.8 Summary, 213
Problems, 213
7 The Curse of Dimensionality and Dimension Reduction 217
7.1 Introduction, 217
7.2 Curse of Dimensionality, 220
7.2.1 Equivalent Sample Sizes, 220
7.2.2 Multivariate L1 Kernel Error, 222
7.2.3 Examples and Discussion, 224
7.3 Dimension Reduction, 229
7.3.1 Principal Components, 229
7.3.2 Projection Pursuit, 231
7.3.3 Informative Components Analysis, 234
7.3.4 Model-Based Nonlinear Projection, 239
Problems, 240
8 Nonparametric Regression and Additive Models 241
8.1 Nonparametric Kernel Regression, 242
8.1.1 The Nadaraya–Watson Estimator, 242
8.1.2 Local Least-Squares Polynomial Estimators, 243
8.1.2.1 Local Constant Fitting, 243
8.1.2.2 Local Polynomial Fitting, 244
8.1.3 Pointwise Mean Squared Error, 244
8.1.4 Bandwidth Selection, 247
8.1.5 Adaptive Smoothing, 247
8.2 General Linear Nonparametric Estimation, 248
8.2.1 Local Polynomial Regression, 248

“9780471697558pre” — 2015/2/12 — 15:05 — page xii — #12
xii CONTENTS
8.2.2 Spline Smoothing, 250
8.2.3 Equivalent Kernels, 252
8.3 Robustness, 253
8.3.1 Resistant Estimators, 254
8.3.2 Modal Regression, 254
8.3.3 L1 Regression, 257
8.4 Regression in Several Dimensions, 259
8.4.1 Kernel Smoothing and WARPing, 259
8.4.2 Additive Modeling, 261
8.4.3 The Curse of Dimensionality, 262
8.5 Summary, 265
Problems, 266
9 Other Applications 267
9.1 Classification, Discrimination, and Likelihood Ratios, 267
9.2 Modes and Bump Hunting, 273
9.2.1 Confidence Intervals, 273
9.2.2 Oversmoothing for Derivatives, 275
9.2.3 Critical Bandwidth Testing, 275
9.2.4 Clustering via Mixture Models and Modes, 277
9.2.4.1 Gaussian Mixture Modeling, 277
9.2.4.2 Modes for Clustering, 280
9.3 Specialized Topics, 286
9.3.1 Bootstrapping, 286
9.3.2 Confidence Intervals, 287
9.3.3 Survival Analysis, 289
9.3.4 High-Dimensional Holes, 290
9.3.5 Image Enhancement, 292
9.3.6 Nonparametric Inference, 292
9.3.7 Final Vignettes, 293
9.3.7.1 Principal Curves and Density Ridges, 293
9.3.7.2 Time Series Data, 294
9.3.7.3 Inverse Problems and Deconvolution, 294
9.3.7.4 Densities on the Sphere, 294
Problems, 294
APPENDIX A Computer Graphics in 3
296
A.1 Bivariate and Trivariate Contouring Display, 296
A.1.1 Bivariate Contouring, 296
A.1.2 Trivariate Contouring, 299
A.2 Drawing 3-D Objects on the Computer, 300

“9780471697558pre” — 2015/2/12 — 15:05 — page xiii — #13
CONTENTS xiii
APPENDIX B DataSets 302
B.1 US Economic Variables Dataset, 302
B.2 University Dataset, 304
B.3 Blood Fat Concentration Dataset, 305
B.4 Penny Thickness Dataset, 306
B.5 Gas Meter Accuracy Dataset, 307
B.6 Old Faithful Dataset, 309
B.7 Silica Dataset, 309
B.8 LRL Dataset, 310
B.9 Buffalo Snowfall Dataset, 310
APPENDIX C Notation and Abbreviations 311
C.1 General Mathematical and Probability Notation, 311
C.2 Density Abbreviations, 312
C.3 Error Measure Abbreviations, 313
C.4 Smoothing Parameter Abbreviations, 313
REFERENCES 315
AUTHOR INDEX 334
SUBJECT INDEX 339

“9780471697558pre” — 2015/2/12 — 15:05 — page xiv — #14

“9780471697558pre” — 2015/2/11 — 17:32 — page xv — #15
PREFACE TO SECOND EDITION
The past 25 years have seen confirmation of the importance of density estimation
and nonparametric methods in modern data analysis, in this era of “big data.” This
updated version retains its focus on fostering an intuitive understanding of the under-
lying methodology and supporting theory. I have sought to retain as much of the
original material as possible and, in particular, the point of view of its development
from the histogram. In every chapter, new material has been added to highlight chal-
lenges presented by massive datasets, or to clarify theoretical opportunities and new
algorithms. However, no claim to comprehensive coverage is professed.
I have benefitted greatly from interactions with a number of gifted doctoral
students who worked in this field—Lynette Factor, Donna Nezames, Rod Jee,
Ferdie Wang, Michael Minnotte, Steve Sain, Keith Baggerly, John Salch, Will
Wojciechowski, H.-G. Sung, Alena Oetting, Galen Papkov, Eric Chi, Jonathan Lane,
Justin Silver, Jaime Ramos, and Yeshaya Adler—their work is represented here. In
addition, contributions were made by many students taking my courses. I would
also like to thank my colleagues and collaborators, especially my co-advisor Jim
Thompson and my frequent co-authors George Terrell (VPI), Bill Szewczyk (DoD)
and Masahiko Sagae (Kanazawa University). They have made the lifetime of learn-
ing, teaching, and discovery especially delightful and satisfying. I especially wish to
acknowledge the able help of Robert Kosar in assembling the final versions of the
color figures and reviewing new material.
Not a few mistakes have been corrected. For example, the constant in the expres-
sion for the asymptotic mean integrated squared error for the multivariate histogram
in Theorem 3.5 is now correct. The content of Tables 3.6 and 3.7 has been mod-
ified accordingly, and the effect of dimension on sample size is seen to be even
more dramatic in the corrected version. Any mistakes remain the responsibility of the

“9780471697558pre” — 2015/2/11 — 17:32 — page xvi — #16
xvi PREFACE TO SECOND EDITION
author, who would appreciate hearing of such. All will be recorded in an appropriate
repository.
Steve Quigley of John Wiley Sons was infinitely patient awaiting this second
edition until his retirement, and Kathryn Sharples completed the project. Steve made
a freshly minted LaTeX version available as a starting point. All figures in S-Plus have
been re-engineered into R. Figures in color or using color have been transformed to
gray scale for the printed version, but the original figures will also be available in the
same repository. In the original edition, I also neglected to properly acknowledge the
generous support of the ARO (DAAL-03-88-G-0074 through my colleague James
Thompson) and the ONR (N00014-90-J-1176).
As with the original edition, this revision would not have been possible with the
tireless and enthusiastic support of my wife, Jean, and family. Thanks for everything.
David W. Scott
Houston, Texas
August, 2014

“9780471697558pre” — 2015/2/11 — 17:32 — page xvii — #17
PREFACE TO FIRST EDITION
With the revolution in computing in recent years, access to data of unprecedented
complexity has become commonplace. More variables are being measured, and the
sheer volume of data is growing. At the same time, advancements in the perfor-
mance of graphical workstations have given new power to the data analyst. With
these changes has come an increasing demand for tools that can detect and summa-
rize the multivariate structure in difficult data. Density estimation is now recognized
as a tool useful with univariate and bivariate data; my purpose is to demonstrate that
it is also a powerful tool in higher dimensions, with particular emphasis on trivari-
ate and quadrivariate data. I have written this book for the reader interested in the
theoretical aspects of nonparametric estimation as well as for the reader interested in
the application of these methods to multivariate data. It is my hope that the book can
serve as an introductory textbook and also as a general reference.
I have chosen to introduce major ideas in the context of the classical histogram,
which remains the most widely applied and most intuitive nonparametric estimator.
I have found it instructive to develop the links between the histogram and more statis-
tically efficient methods. This approach greatly simplifies the treatment of advanced
estimators, as much of the novelty of the theoretical context has been moved to the
familiar histogram setting.
The nonparametric world is more complex than its parametric counterpart. I have
selected material that is representative of the broad spectrum of theoretical results
available, with an eye on the potential user, based on my assessments of usefulness,
prevalence, and tutorial value. Theory particularly relevant to application or under-
standing is covered, but a loose standard of rigor is adopted in order to emphasize the
methodological and application topics. Rather than present a cookbook of techniques,
I have adopted a hierarchical approach that emphasizes the similarities among the

“9780471697558pre” — 2015/2/11 — 17:32 — page xviii — #18
xviii PREFACE TO FIRST EDITION
different estimators. I have tried to present new ideas and practical advice, together
with numerous examples and problems, with a graphical emphasis.
Visualization is a key aspect of effective multivariate nonparametric analysis, and
I have attempted to provide a wide array of graphic illustrations. All of the figures
in this book were composed using S, S-PLUS, Exponent Graphics from IMSL, and
Mathematica. The color plates were derived from S-based software. The color graph-
ics with transparency were composed by displaying the S output using the MinneView
program developed at the Minnesota Geometry Project and printed on hardware under
development by the 3M Corporation. I have not included a great deal of computer
code. A collection of software, primarily Fortran-based with interfaces to the S lan-
guage, is available by electronic mail at scottdw@rice.edu. Comments and other
feedback are welcomed.
I would like to thank many colleagues for their generous support over the past
20 years, particularly Jim Thompson, Richard Tapia, and Tony Gorry. I have espe-
cially drawn on my collaboration with George Terrell, and I gratefully acknowledge
his major contributions and influence in this book. The initial support for the high-
dimensional graphics came from Richard Heydorn of NASA. This work has been
generously supported by the Office of Naval Research under grant N00014-90-J-
1176 as well as the Army Research Office. Allan Wilks collaborated on the creation
of many of the color figures while we were visiting the Geometry Project, directed by
Al Marden and assisted by Charlie Gunn, at the Minnesota Supercomputer Center.
I have taught much of this material in graduate courses not only at Rice but also
during a summer course in 1985 at Stanford and during an ASA short course in
1986 in Chicago with Bernard Silverman. Previous Rice students Lynette Factor,
Donna Nezames, Rod Jee, and Ferdie Wang all made contributions through their
theses. I am especially grateful for the able assistance given during the final phases
of preparation by Tim Dunne and Keith Baggerly, as well as Steve Sain, Monnie
McGee, and Michael Minnotte. Many colleagues have influenced this work, includ-
ing Edward Wegman, Dan Carr, Grace Wahba, Wolfgang Härdle, Matthew Wand,
Simon Sheather, Steve Marron, Peter Hall, Robert Launer, Yasuo Amemiya, Nils
Hjort, Linda Davis, Bernhard Flury, Will Gersch, Charles Taylor, Imke Janssen,
Steve Boswell, I.J. Good, Iain Johnstone, Ingram Olkin, Jerry Friedman, David
Donoho, Leo Breiman, Naomi Altman, Mark Matthews, Tim Hesterberg, Hal Stern,
Michael Trosset, Richard Byrd, John Bennett, Heinz-Peter Schmidt, Manny Parzen,
and Michael Tarter. Finally, this book could not have been written without the patience
and encouragement of my family.
David W. Scott
Houston, Texas
February, 1992

“9780471697558c01” — 2015/2/25 — 16:16 — page 1 — #1
1
REPRESENTATION AND GEOMETRY
OF MULTIVARIATE DATA
A complete analysis of multidimensional data requires the application of an array of
statistical tools—parametric, nonparametric, and graphical. Parametric analysis is the
most powerful. Nonparametric analysis is the most flexible. And graphical analysis
provides the vehicle for discovering the unexpected.
This chapter introduces some graphical tools for visualizing structure in multidi-
mensional data. One set of tools focuses on depicting the data points themselves,
while another set of tools relies on displaying of functions estimated from those
points. Visualization and contouring of functions in more than two dimensions is
introduced. Some mathematical aspects of the geometry of higher dimensions are
reviewed. These results have consequences for nonparametric data analysis.
1.1 INTRODUCTION
Classical linear multivariate statistical models rely primarily on analysis of the covari-
ance matrix. So powerful are these techniques that analysis is almost routine for
datasets with hundreds of variables. While the theoretical basis of parametric mod-
els lies with the multivariate normal density, these models are applied in practice
to many kinds of data. Parametric studies provide neat inferential summaries and
parsimonious representation of the data.
For many problems second-order information is inadequate. Advanced model-
ing or simple variable transformations may provide a solution. When no simple
Multivariate Density Estimation, First Edition. David W. Scott.
© 2015 John Wiley Sons, Inc. Published 2015 by John Wiley Sons, Inc.

“9780471697558c01” — 2015/2/25 — 16:16 — page 2 — #2
2 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
parametric model is forthcoming, many researchers have opted for fully “unpara-
metric” methods that may be loosely collected under the heading of exploratory data
analysis. Such analyses are highly graphical; but in a complex non-normal setting, a
graph may provide a more concise representation than a parametric model, because
a parametric model of adequate complexity may involve hundreds of parameters.
There are some significant differences between parametric and nonparametric
modeling. The focus on optimality in parametric modeling does not translate well
to the nonparametric world. For example, the histogram might be proved to be an
inadmissible estimator, but that theoretical fact should not be taken to suggest his-
tograms should not be used. Quite to the contrary, some methods that are theoretically
superior are almost never used in practice. The reason is that the ordering of algo-
rithms is not absolute, but is dependent not only on the unknown density but also on
the sample size. Thus the histogram is generally superior for small samples regard-
less of its asymptotic properties. The exploratory school is at the other extreme,
rejecting probabilistic models, whose existence provides the framework for defining
optimality.
In this book, an intermediate point of view is adopted regarding statistical effi-
cacy. No nonparametric estimate is considered wrong; only different components of
the solution are emphasized. Much effort will be devoted to the data-based calibra-
tion problem, but nonparametric estimates can be reasonably calibrated in practice
without too much difficulty. The “curse of optimality” might suggest that this is
an illogical point of view. However, if the notion that optimality is all important is
adopted, then the focus becomes matching the theoretical properties of an estimator
to the assumed properties of the density function. Is it a gross inefficiency to use a
procedure that requires only two continuous derivatives when the curve in fact has six
continuous derivatives? This attitude may have some formal basis but should be dis-
couraged as too heavy-handed for nonparametric thinking. A more relaxed attitude
is required. Furthermore, many “optimal” nonparametric procedures are unstable in
a manner that slightly inefficient procedures are not. In practice, when faced with the
application of a procedure that requires six derivatives, or some other assumption that
cannot be proved in practice, it is more important to be able to recognize the signs
of estimator failure than to worry too much about assumptions. Detecting failure at
the level of a discontinuous fourth derivative is a bit extreme, but certainly the effects
of simple discontinuities should be well understood. Thus only for the purposes of
illustration are the best assumptions given.
The notions of efficiency and admissibility are related to the choice of a criterion,
which can only imperfectly measure the quality of a nonparametric estimate. Unlike
optimal parametric estimates that are useful for many purposes, nonparametric esti-
mates must be optimized for each application. The extra work is justified by the extra
flexibility. As the choice of criterion is imperfect, so then is the notion of a single
optimal estimator. This attitude reflects not sloppy thinking, but rather the imperfect
relationship between the practical and theoretical aspects of our methods. Too rigid a
point of view leads one to a minimax view of the world where nonparametric methods
should be abandoned because there exist difficult problems.

“9780471697558c01” — 2015/2/25 — 16:16 — page 3 — #3
INTRODUCTION 3
Visualization is an important component of nonparametric data analysis. Data
visualization is the focus of exploratory methods, ranging from simple scatterplots
to sophisticated dynamic interactive displays. Function visualization is a significant
component of nonparametric function estimation, and can draw on the relevant lit-
erature in the fields of scientific visualization and computer graphics. The focus of
multivariate data analysis on points and scatterplots has meant that the full impact
of scientific visualization has not yet been realized. With the new emphasis on
smooth functions estimated nonparametrically, the fruits of visualization will be
attained. Banchoff (1986) has been a pioneer in the visualization of higher dimen-
sional mathematical surfaces. Curiously, the surfaces of interest to mathematicians
contain singularities and discontinuities, all producing striking pictures when pro-
jected to the plane. In statistics, visualization of the smooth density surface in four,
five, and six dimensions cannot rely on projection, as projections of smooth surfaces
to the plane show nothing. Instead, the emphasis is on contouring in three dimensions
and slicing of surfaces beyond. The focus on three and four dimensions is natural
because one and two are so well understood. Beyond four dimensions, the ability to
explore surfaces carefully decreases rapidly due to the curse of dimensionality. For-
tunately, statistical data seldom display structure in more than five dimensions, so
guided projection to those dimensions may be adequate. It is these threshold dimen-
sions from three to five that are and deserve to be the focus of our visualization
efforts.
There is a natural flow among the parametric, exploratory, and nonparametric pro-
cedures that represents a rational approach to statistical data analysis. Begin with a
fully exploratory point of view in order to obtain an overview of the data. If a prob-
abilistic structure is present, estimate that structure nonparametrically and explore
it visually. Finally, if a linear model appears adequate, adopt a fully parametric
approach. Each step conceptually represents a willingness to more strongly smooth
the raw data, finally reducing the dimension of the solution to a handful of interest-
ing parameters. With the assumption of normality, the mind’s eye can easily imagine
the d-dimensional egg-shaped elliptical data clusters. Some statisticians may prefer
to work in the reverse order, progressing to exploratory methodology as a diagnostic
tool for evaluating the adequacy of a parametric model fit.
There are many excellent references that complement and expand on this sub-
ject. In exploratory data analysis, references include Tukey (1977), Tukey and Tukey
(1981), Cleveland and McGill (1988), and Wang (1978).
In density estimation, the classic texts of Tapia and Thompson (1978), Wertz
(1978), and Thompson and Tapia (1990) first indicated the power of the nonpara-
metric approach for univariate and bivariate data. Silverman (1986) has provided a
further look at applications in this setting. Prakasa Rao (1983) has provided a the-
oretical survey with a lengthy bibliography. Other texts are more specialized, some
focusing on regression (Müller, 1988; Härdle, 1990), some on a specific error cri-
terion (Devroye and Györfi, 1985; Devroye, 1987), and some on particular solution
classes such as splines (Eubank, 1988; Wahba, 1990). A discussion of additive models
may be found in Hastie and Tibshirani (1990).

“9780471697558c01” — 2015/2/25 — 16:16 — page 4 — #4
1.2 HISTORICAL PERSPECTIVE
One of the roots of modern statistical thought can be traced to the empirical discov-
ery of correlation by Galton in 1886 (Stigler, 1986). Galton’s ideas quickly reached
Karl Pearson. Although best remembered for his methodological contributions such
as goodness-of-fit tests, frequency curves, and biometry, Pearson was a strong pro-
ponent of the geometrical representation of statistics. In a series of lectures a century
ago in November 1891 at Gresham College in London, Pearson spoke on a wide-
ranging set of topics (Pearson, 1938). He discussed the foundations of the science
of pure statistics and its many divisions. He discussed the collection of observations.
He described the classification and representation of data using both numerical and
geometrical descriptors. Finally, he emphasized statistical methodology and discov-
ery of statistical laws. The syllabus for his lecture of November 11, 1891, includes
this cryptic note:
Erroneous opinion that Geometry is only a means of popular representation: it is a
fundamental method of investigating and analysing statistical material. (his italics)
In that lecture Pearson described 10 methods of geometrical data representation.
The most familiar is a representation “by columns,” which he called the “his-
togram.” (Pearson is usually given credit for coining the word “histogram” later in
a 1894 paper.) Other familiar-sounding names include “diagrams,” “chartograms,”
“topograms,” and “stereograms.” Unfamiliar names include “stigmograms,” “euthy-
grams,” “epipedograms,” “radiograms,” and “hormograms.”
Beginning 21 years later, Fisher advanced the numerically descriptive portion of
statistics with the method of maximum likelihood, from which he progressed on to the
analysis of variance and other contributions that focused on the optimal use of data
in parametric modeling and inference. In Statistical Methods for Research Workers,
Fisher (1932) devotes a chapter titled “Diagrams” to graphical tools. He begins the
chapter with this statement:
The preliminary examination of most data is facilitated by the use of diagrams.
Diagrams prove nothing, but bring outstanding features readily to the eye; they are
therefore no substitute for such critical tests as may be applied to the data, but are
valuable in suggesting such tests, and in explaining the conclusions founded upon
them.
An emphasis on optimization and the efficiency of statistical procedures has been
a hallmark of mathematical statistics ever since. Ironically, Fisher was criticized
by mathematical statisticians for relying too heavily upon geometrical arguments in
proofs of his results.
Modern statistics has experienced a strong resurgence of geometrical and graphi-
cal statistics in the form of exploratory data analysis (Tukey, 1977). Given the para-
metric emphasis on optimization, the more relaxed philosophy of exploratory data
analysis has been refreshing. The revolution has been fueled by the low cost of graph-
ical workstations and microcomputers. These machines have enabled current work on
statistics in motion (Scott, 1990), that is, the use of animation and kinematic display

“9780471697558c01” — 2015/2/25 — 16:16 — page 5 — #5
GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 5
for visualization of data structure, statistical analysis, and algorithm performance. No
longer are static displays sufficient for comprehensive analysis.
All of these events were anticipated by Pearsonand his visionary statistical com-
puting laboratory. In his lecture of April 14, 1891, titled “The Geometry of Motion,”
he spoke of the “ultimate elements of sensations we represent as motions in space
and time.” In 1918, after his many efforts during World War I, he reminisced about
the excitement created by wartime work of his statistical laboratory:
The work has been so urgent and of such value that the Ministry of Munitions has
placed eight to ten computers and draughtsmen at my disposal ... (Pearson, 1938,
p. 165).
These workers produced hundreds of statistical graphs, ranging from detailed maps of
worker availability across England (chartograms) to figures for sighting antiaircraft
guns (diagrams). The use of stereograms allowed for representation of data with three
variables. His “computers,” of course, were not electronic but human. Later, Fisher
would be frustrated because Pearson would not agree to allocate his “computers” to
the task of tabulating percentiles of the t-distribution. But Pearson’s capabilities for
producing high-quality graphics were far superior to those of most modern statisti-
cians prior to 1980. Given Pearson’s joint interests in graphics and kinematics, it is
tantalizing to speculate on how he would have utilized modern computers.
1.3 GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS
The modern challenge in data analysis is to be able to cope with whatever complexi-
ties may be intrinsic to the data. The data may, for example, be strongly non-normal,
fall onto a nonlinear subspace, exhibit multiple modes, or be asymmetric. Dealing
with these features becomes exponentially more difficult as the dimensionality of the
data increases, a phenomenon known as the curse of dimensionality. In fact, datasets
with hundreds of variables and millions of observations are routinely compiled that
exhibit all of these features. Examples abound in such diverse fields as remote sens-
ing, the US Census, geological exploration, speech recognition, and medical research.
The expense of collecting and managing these large datasets is often so great that no
funds are left for serious data analysis. The role of statistics is clear, but too often
no statisticians are involved in large projects and no creative statistical thinking is
applied. The goal of statistical data analysis is to extract the maximum information
from the data, and to present a product that is as accurate and as useful as possible.
1.3.1 Multivariate Scatter Diagrams
The presentation of multivariate data is often accomplished in tabular form, par-
ticularly for small datasets with named or labeled objects. For example, Table B.1
contains economic data spanning the depression years of the 1930s, and Table B.2
contains information on a selected sample of American universities. It is easy enough
to scan an individual column in these tables, to make comparisons of library size,

“9780471697558c01” — 2015/2/25 — 16:16 — page 6 — #6
for example, and to draw conclusions one variable at a time (see Tufte (1983) and
Wang (1978)). However, variable-by-variable examination of multivariate data can
be overwhelming and tiring, and cannot reveal any relationships among the variables.
Looking at all pairwise scatterplots provides an improvement (Chambers et al., 1983).
Data on four variables of three species of Iris are displayed in Figure 1.1. (A listing
of the Fisher–Anderson Iris data, one of the few familiar four-dimensional datasets,
may be found in several references and is provided with the S package (Becker et al.,
1988)). What multivariate structure is apparent from this figure? The setosa variety
does not overlap the other two varieties. The versicolor and virginica varieties are not
as well separated, although a close examination reveals that they are almost nonover-
lapping. If the 150 observations were unlabeled and plotted with the same symbol,
it is likely that only two clusters would be observed. Even if it were known a priori
that there were three clusters, it would still be unlikely that all three clusters would be
properly identified. These alternative presentations reflect the two related problems
of discrimination and clustering, respectively.
If the observations from different categories overlap substantially or have differ-
ent sample sizes, scatter diagrams become much more difficult to interpret properly.
The data in Figure 1.2 come from a study of 371 males suffering from chest pain
(Scott et al., 1978): 320 had demonstrated coronary artery disease (occlusion or nar-
rowing of the heart’s own arteries) while 51 had none (see Table B.3). The blood fat
concentrations of plasma cholesterol and triglyceride are predictive of heart disease,
although the correlation is low. It is difficult to estimate the predictive power of these
variables in this setting solely from the scatter diagram. A nonparametric analysis
will reveal some interesting nonlinear interactions (see Chapters 5 and 9).
An easily overlooked practical aspect of scatter diagrams is illustrated by these
data, which are integer valued. To avoid problems of overplotting, the data have been
jittered or blurred (Chambers et al., 1983); that is, uniform U(−0.5,0.5) noise is
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
3
3
3
3 3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
Sepal
width
Petal
length
Petal
width
Sepal length Sepal width Petal length
1
1
1
1 1 1
1 1
1 1 1
1
1
1 1
1
1
1 1
1 1
1
1
1
1
1
1 1
1
11 1
1 1
11 1
1
1 1
1
1
1
11
1 1
1 1
1
2
2
2
2
2
2 2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2 2
2
2 2 2
2
2
2
2 2
2
2
22
2 2
2
2
3
3
3
3 3
3
3
3
3
3
3
3 3
3 3 33
33
3
3
3
3
3
3
3
3
3
3 3 3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
1
1 1
1 1 1
1
1
1 1 1
1
1
1 1 1
1
1 1
1
1 1
1
11
1 1 1
1
1
1 1 1 1
1 1 11
1 1 1
1 1
1 1
1 1
1 1
1
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2 2
2 2
2
2
2
2 2
2
2 2
2
2
2
2
2 2
2
2
2 2
2
2
2
2
3
3
3
3 3
3
3
3
3
3
3
3 3
3 3 3
3
3
3
3
3
3
3
3
3
3
3 3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
3
33
3
3 3 3
3
1
1
1
1 1
1
1 1
1 1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1 1
1
1
1
11
1
1 1
11 1
1
1 1
1
1
1
1
1
1
1
1 1
1
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2 2
2 2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
1
1 1
1 1
1
1
1
1 1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1 1 11
1 1
1
1 1
1
1
1
1
1 1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2 2 2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
11
1
1 1
1
1
1
1
1
1
1
1
1
1
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2 2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3 3
3
3
33
3
3
3
3
3
3
3
3
3
3 3
3
3
3
33
3
3
3 3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
FIGURE 1.1 Pairwise scatter diagrams of the Iris data with the three species labeled.
1, setosa; 2, versicolor; 3, virginica.

“9780471697558c01” — 2015/2/25 — 16:16 — page 7 — #7
No disease (n=51)
100 150 200 300 400
With disease (n=320)
100 150 200 300 400
Cholesterol (mg/dl)
Triglyceride
(mg/dl)
50
100
200
500
50
100
200
500
FIGURE 1.2 Scatter diagrams of blood lipid concentrations for 320 diseased and 51
nondiseased males.
added to each element of the original data. This trick should be regularly employed
for data recorded with three or fewer significant digits (with an appropriate range on
the added uniform noise). Jittering reduces visual miscues that result from the vertical
and horizontal synchronization of regularly spaced data.
The visual perception system can easily be overwhelmed if the number of points
is more than several thousand. Figure 1.3 displays three pairwise scatterplots derived
from measurements taken in 1977 by the Landsat remote sensing system over a 5 mile
by 6 mile agricultural region in North Dakota with n = 22,932 = 117 × 196 pixels
or picture elements, each corresponding to an area approximately 1.1 acres in size
(Scott and Thompson, 1983; Scott and Jee, 1984). The Landsat instrument mea-
sures the intensity of light in four spectral bands reflected from the surface of the
earth. A principal components transformation gives two variables that are commonly
referred to as the “brightness” and “greenness” of each pixel. Every pixel is mea-
sured at regular intervals of approximately 3 weeks. During the summer of 1977, six
useful replications were obtained, giving 24 measurements on each pixel. Using an
agronometric growth model for crops, Badhwar et al. (1982) nonlinearly transformed
this 24-dimensional data to three dimensions. Badhwar described these synthetic vari-
ables, (x1,x2,x3), as (1) the calendar time at which peak greenness is observed, (2) the
length of crop ripening, and (3) the peak greenness value, respectively. The scat-
ter diagrams in Figure 1.3 have also been enhanced by jittering, as the raw data are
integers between (0,255). The use of integers allows compression to eight bits of
computer memory. Only structure in the boundary and tails is readily seen. The over-
plotting problem is apparent and the blackened areas include over 95% of the data.
Other techniques to enhance scatter diagrams are needed to see structure in the bulk
of the data cloud, such as plotting random subsets (see Tukey and Tukey (1981)).
Pairwise scatter diagrams lack one important property necessary for identifying
more than two-dimensional features—strong interplot linkage among the plots. In

“9780471697558c01” — 2015/2/25 — 16:16 — page 8 — #8
Peak
width
Peak
value
Peak time Peak width
FIGURE 1.3 Pairwise scatter diagram of transformed Landsat data from 22,932 pixels over
a 5 by 6 nautical mile region. The range on all the axes is (0, 255).
principle, it should be possible to locate the same point in each figure, assuming
the data are free of ties. But it is not practical to do so for samples of any size. For
quadrivariate data, Diaconis and Friedman (1983) proposed drawing lines between
corresponding points in the scatterplots of (x1,x2) and (x3,x4) (see Problem 1.2). But a
more powerful dynamic technique that takes full advantage of computer graphics has
been developed by several research groups (McDonald, 1982; Becker and Cleveland,
1987; see the many references in Cleveland and McGill, 1988). The method is called
brushing or painting a scatterplot matrix. Using a pointing device such as a mouse,
a subset of the points in one scatter diagram is selected and the corresponding points
are simultaneously highlighted in the other scatter diagrams. Conceptually, a subset
of points in d
is tagged, for example, by painting the points red or making the points
blink synchronously, and that characteristic is inherited by the linked points in all the
“linked” graphs, including not only scatterplots but also histograms and regression
plots as well. The Iris example in Figure 1.1 illustrates the flavor of brushing with
three tags. Usually the color of points is changed rather than the symbol type. Brush-
ing is an excellent tool for identifying outliers and following well-defined clusters. It
is well-suited for conditioning on some variable, for example, 1 x3 3.
These ideas are illustrated in Figure 1.4 for the PRIM4 dataset (Friedman and
Tukey, 1974; the data summarize 500 high-energy particle physics scattering exper-
iments) provided in the S language. Using the brushing tool in S-PLUS (1990), the
left cluster in the 1–2 scatterplot was brushed, and then the left cluster in the 2–4
scatterplot was brushed with a different symbol. Try to imagine linking the clusters
throughout the scatterplot matrix without any highlighting.

“9780471697558c01” — 2015/2/25 — 16:16 — page 9 — #9
FIGURE 1.4 Pairwise scatterplots of the transformed PRIM4s data using the ggobi visual-
ization system. Two clumps of points are highlighted by brushing.
There are limitations to the brushing technique. The number of pairwise scat-
terplots is
d
2

, so viewing more than 5 or 10 variables at once is impractical.
Furthermore, the physical size of each scatter diagram is reduced as more variables
are added, so that fewer distinct data points can be plotted. If there are more than
a few variables, the eye cannot follow many of the dynamic changes in the pattern
of points during brushing, except with the simplest of structure. It is, however, an
open question as to the number of dimensions of structure that can be perceived by
this method of linkage. Brushing remains an important and well-used tool that has
proven successful in real data analysis.
If a 2-D array of bivariate scatter diagrams is useful, then why not construct a
3-D array of trivariate scatter diagrams? Navigating the collection of
d
3

trivariate
scatterplots is difficult even with modest values of d. But a single 3-D scatterplot
can easily be rotated in real time with significant perceptual gain compared to three
bivariate diagrams in the scatterplot matrix. Many statistical packages now provide
this capability. The program MacSpin (Donoho et al., 1988) was the first widely used
software of this type. The top middle panel in Figure 1.4 displays a particular ori-
entation of a rotating 3-D scatterplot. The kinds of structure available in 3-D data
are more complex (and hence more interesting) than in 2-D data. Furthermore, the
overplotting problem is reduced as more data points can be resolved in a rotating 3-D
scatterplot than in a static 2-D view (although this is resolution dependent—a 2-D
view printed by a laser device can display significantly more points than is possible
on a computer monitor). Density information is still relatively difficult to perceive,
however, and the sample size definitely influences perception.
Beyond three dimensions, many novel ideas are being pursued (see Tukey and
Tukey (1981)). Six-dimensional data could be viewed with two rotating 3-D scat-
ter diagrams linked by brushing. Carr and Nicholson (1988) have actively pursued
using stereography as an alternative and adjunct to rotation. Some workers report

“9780471697558c01” — 2015/2/25 — 16:16 — page 10 — #10
that stereo viewing of static data can be more precise than viewing dynamic rotation
alone. Unfortunately, many individuals suffer from color blindness and various depth
perception limitations, rendering some techniques useless. Nevertheless, it is clear
that there is no limit to the possible combinations of ideas one might consider imple-
menting. Such efforts can easily take many months to program without any fancy
interface. This state of affairs would be discouraging but for the fact that a LISP-
based system for easily prototyping such ideas is now available using object-oriented
concepts (see Tierney (1990)). RStudio has made the shiny app available for this pur-
pose as well: see http://shiny.rstudio.com. A collection of articles is devoted to the
general topic of animation (Cleveland and McGill, 1988).
The idea of displaying 2- or 3-D arrays of 2- or 3-D scatter diagrams is perhaps
too closely tied to the Euclidean coordinate system. It might be better to examine
many 2- or 3-D projections of the data. An orderly way to do approximately just
that is the “grand tour” discussed by Asimov (1985). Let P be a d × 2 projection
matrix, which takes the d-dimensional data down to a plane. The author proposed
examining a sequence of scatterplots obtained by a smoothly changing sequence of
projection matrices. The resulting kinematic display shows the n data points mov-
ing in a continuous (and sometimes seemingly random) fashion. It may be hoped
that most interesting projections will be displayed at some point during the first sev-
eral minutes of the grand tour, although for even 10 variables several hours may be
required (Huber, 1985).
Special attention should be drawn to representing multivariate data in the bivariate
scatter diagram with points replaced by glyphs, which are special symbols whose
shapes are determined by the remaining data variables (x3,...,xd). Figure 1.5 displays
the Iris data in such a form following Carr et al. (1986). The length and angle of the
glyph are determined by the sepal length and width, respectively. Careful examination
of the glyphs shows that there is no gap in 4-D between the versicolor and virginica
species, as the angles and lengths of the glyphs are similar near the boundary.
Setosa
Versicolor
Virginica
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
Petal length
Petal
width
Glyph (length, angle)=(Sepal length, sepal width)
FIGURE 1.5 Glyph scatter diagram of the Iris data.

“9780471697558c01” — 2015/2/25 — 16:16 — page 11 — #11
1 2 3 4 5 6 7
2.0
2.5
3.0
3.5
4.0
4.5
0.0
0.5
1.0
1.5
2.0
2.5
Petal length
Sepal
width
Petal
width
FIGURE 1.6 A three-dimensional scatter diagram of the Fisher–Anderson Iris data, omitting
the sepal length variable. From left to right, the 50 points for each of the three varieties of
setosa, versicolor, and virginica are distinguished by symbol type (square, diamond, triangle),
respectively. The symbol is required to indicate the presence of three clusters rather than only
two. The same basic picture results from any choice of three variables from the full set of four
variables.
A second glyph representation shown in Figure 1.6 is a 3-D scatterplot omitting
sepal length, one of the four variables. This figure clearly depicts the structure in
these data. Plotting glyphs in 3-D scatter diagrams with stereography is a more pow-
erful visual tool (Carr and Nicholson, 1988). The glyph technique does not treat
variables “symmetrically” and all variable–glyph combinations could be considered.
This complaint affects most multivariate procedures (with a few exceptions).
All of these techniques are an outgrowth of a powerful system devised to analyze
data in up to nine dimensions called PRIM-9 (Fisherkeller et al., 1974; reprinted in
Cleveland and McGill, 1988). The PRIM-9 system contained many of the capabilities
of current systems. The letters are an acronym for “Picturing, Rotation, Isolation, and
Masking.” The latter two serve to identify and select subsets of the multivariate data.
The “picturing” feature was implemented by pressing two buttons that cycled through
all of the
9
2

pairwise scatter diagrams in current coordinates. An IBM 360 mainframe
was specially modified to drive the custom display system.
1.3.2 Chernoff Faces
Chernoff (1973) proposed a special glyph that associates variables to facial features,
such as the size and shape of the eyes, nose, mouth, hair, ears, chin, and facial out-
line. Certainly, humans are able to discriminate among nearly identical faces very
well. Chernoff has suggested that most other multivariate point methods “seem to be

“9780471697558c01” — 2015/2/25 — 16:16 — page 12 — #12
1925 1926 1927 1928 1929
1930 1931 1932 1933 1934
1935 1936 1937 1938 1939
FIGURE 1.7 Chernoff faces of the economic dataset spanning 1925–1939.
less valuable in producing an emotional response” (Wang, 1978, p. 6).Whether an
emotional response is desired is debatable. Chernoff faces for the time series dataset
in Table B.1 are displayed in Figure 1.7. (The variable–feature associations are listed
in the table.) By carefully studying an individual facial feature such as the smile over
the sequence of all the faces, simple trends can be recognized. But it is the overall
multivariate impression that makes Chernoff faces so powerful. Variables should be
carefully assigned to features. For example, Chernoff faces of the colleges’ data in
Table B.2 might logically assign variables relating to the library to the eyes rather
than to the mouth (see Problem 1.3). Such subjective judgments should not prejudice
our use of this procedure.
One early application not in a statistics journal was constructed by Hiebert-Dodd
(1982), who had examined the performance of several optimization algorithms on a
suite of test problems. She reported that several referees felt this method of presenta-
tion was too frivolous. Comparing the endless tables in the paper as it appeared to the
Chernoff faces displayed in the original technical report, one might easily conclude
the referees were too cautious. On the other hand, when Rice University administra-
tors were shown Chernoff faces of the colleges’ dataset, they were quite open to its
suggestions and enjoyed the exercise. The practical fact is that repetitious viewing of
large tables of data is tedious and haphazard, and broad-brush displays such as faces
can significantly improve data digestion. Several researchers have noted that Chernoff
faces contain redundant information because of symmetry. Flury and Riedwyl (1981)
have proposed using asymmetrical faces, as did Turner and Tidmore (1980), although
Chernoff has stated he believes the additional gain does not justify such nonrealistic
figures.
1.3.3 Andrews’ Curves and Parallel Coordinate Curves
Three intriguing proposals display not the data points themselves but rather a unique
curve determined by the data vector x. Andrews (1972) proposed representing

“9780471697558c01” — 2015/2/25 — 16:16 — page 13 — #13
1929 1930 1931 1932
FIGURE 1.8 Star diagram for 4 years of the economic dataset shown in Figure 1.7.
high-dimensional data by replacing each point in d
with a curve s(t) for |t| π,
where
s(t | x1,...,xd) =
x1
√
2
+x2 sint +x3 cost +x4 sin2t +x5 cos2t +··· ,
the so-called Fourier series representation. This mapping provides the first “com-
plete” continuous view of high-dimensional points on the plane, because, in principle,
the original multivariate data point can be recovered from this curve. Clearly, an
Andrews’ curve is dominated by the variables placed on the low-frequency terms,
so care should be taken to put the most interesting variables early in the expansion
(see Problem 1.4).
A simple graphical device that treats the d variables symmetrically is the star dia-
gram, which is discussed by Fienberg (1979). The d axes are drawn as spokes on a
wheel. The coordinate data values are plotted on those axes and connected as shown
in Figure 1.8.
Another novel multivariate approach that treats variables in a symmetric fashion is
the parallel coordinates plot, introduced by Inselberg (1985) in a mathematical set-
ting and extended by Wegman (1990) to the analysis of stochastic data. Cartesian
coordinates are abandoned in favor of d axes drawn parallel and equally spaced.
Each multivariate point x ∈ d
is plotted as a piecewise linear curve connecting
the d points on the parallel axes. For reasons shown by Inselberg and Wegman,
there are advantages to simply drawing piecewise linear line segments, rather than
a smoother line such as a spline. The disadvantage of this choice is that points
that have identical values in any coordinate dimension cannot be distinguished in
parallel coordinates. However, with this choice a duality may be deduced between
points and lines in Euclidean and parallel coordinates. In the left frame of Figure 1.9,
six points that fall on a straight line with negative slope are plotted. The right frame
shows those same points in parallel coordinates. Thus a scatter diagram of highly
correlated normal points displays a nearly common point of intersection in parallel
coordinates. However, if the correlation is positive, that point is not “between” the
parallel axes (see Problem 1.6). The location of the point where the lines all intersect
can be used to recover the equation of the line back in Euclidean coordinates (see
Problem 1.8).
A variety of other properties with potential applications are explored by Inselberg
and Wegman. One result is a graphical means of deciding if a point x ∈ d
is on the

“9780471697558c01” — 2015/2/25 — 16:16 — page 14 — #14
x1
x
2
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
1
2
3
4
5
6
0.0
0.5
1.0
1.5
1
2
3
4
5
6
1
2
3
4
5
6
x1 x2
FIGURE 1.9 Example of duality of points and lines between Euclidean and parallel
coordinates. The points are labeled 1 to 6 in both coordinate systems.
inside or the outside of a convex closed hypersurface. If all the points on the hyper-
surface are plotted in parallel coordinates, then a well-defined geometrical outline
will appear on the plane. If a portion of the line segments defining the point x in par-
allel coordinates fall outside the outline, then x is not inside the hypersurface, and
vice versa. One of the more fascinating extensions developed by Wegman is a grand
tour of all variables displayed in parallel coordinates. The advantage of parallel coor-
dinates is that all d of the rotating variables are visible simultaneously, whereas in
the usual presentation, only two of the grand tour variables are visible in a bivariate
scatterplot.
Figure 1.10 displays parallel coordinate plots of the Iris and earthquake data. The
earthquake dataset represents the epicenters of 473 tremors beneath the Mount St.
Helens volcano in the several months preceding its March 1982 eruption (Weaver
et al., 1983). Clearly, the tremors are mostly small in magnitude, increasing in fre-
quency over time, and clustered near the surface, although depth is clearly a bimodal
variable. The longitude and latitude variables are least effective on this plot, because
their natural spatial structure is lost.
1.3.4 Limitations
Tools such as Chernoff faces and scatter diagram glyphs tend to be most valuable
with small datasets where individual points are “identifiable” or interesting. Such
individualistic exploratory tools can easily generate “too much ink” (Tufte, 1983)
and produce figures with black splotches, which convey little information. Parallel
coordinates and Andrews’ curves generate much ink. One obvious remedy is to plot

“9780471697558c01” — 2015/2/25 — 16:16 — page 15 — #15
Sepal.length Sepal.width Petal.length Petal.width
Longitude Latitude Depth Day Intensity
FIGURE 1.10 Parallel coordinate plot of the earthquake dataset.
only a subset of the data in a process known as “thinning.” However, plotting random
subsets no longer makes optimal use of all the data and does not result in precisely
reproducible interpretations. Point-oriented methods typically have a range of sample
sizes that is most appropriate: n 200 for faces; n 2000 for scatter diagrams.
Since none of these displays is truly d-dimensional, each has limitations. All pair-
wise scatterplots can detect distinct clusters and some two-dimensional structure (if
perhaps in a rotated coordinate system). In the latter case, an interactive supplement
such as brushing may be necessary to confirm the nature of the links among the scat-
terplots (not really providing any higher dimensional information). On the positive
side, variables are treated symmetrically in the scatterplot matrix. But many different
and highly dissimilar d-dimensional datasets can give rise to visually similar scatter-
plot matrix diagrams; hence the need for brushing. However, with increasing number
of variables, individual scatterplots physically decrease in size and fill up with ink
ever faster. Scatter diagrams provide a highly subjective view of data, with poor
density perception and greatest emphasis on the tails of the data.

“9780471697558c01” — 2015/2/25 — 16:16 — page 16 — #16
1.4 GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS
1.4.1 Scatterplot Smoothing by Density Function
As graphical exploratory tools, each of the point-based procedures has significant
value. However, each suffers from the problem of too much ink, as the number of
objects (and hence the amount of ink) is linear in the sample size n. To mix metaphors,
point-based graphs cannot provide a consistent picture of the data as n → ∞. As Scott
and Thompson (1983) wrote,
the scatter diagram points to the bivariate density function.
In other words, the raw data points need to be smoothed if a consistent view is to be
obtained.
A histogram is the simplest example of a scatterplot smoother. The amount of
smoothness is controlled by the bin width. For univariate data, the histogram with
bin width narrower than min |xi −xj| is precisely a univariate scatter diagram plotted
with glyphs that are tall, thin rectangles. For bivariate data, the glyph is a beam with a
square base. Increasing the bin width, the histogram represents a count per unit area,
which is precisely the unit of a probability density. In Chapter 3, the histogram will
be shown to provide a consistent estimate of the density function in any dimension.
Histograms can provide a wealth of information for large datasets, even well-
known ones. For example, consider the 1979–1981 decennial life table published
by the U.S. and Bureau of the Census (1987). Certain relevant summary statistics are
well-known: life expectancy, infant mortality, and certain conditional life expectan-
cies. But what additional information can be gleaned by examining the mortality
histogram itself? In Figure 1.11, the histogram of age of death for individuals is
depicted. Not surprisingly, the histogram is skewed with a short tail for older ages.
Not as well-known perhaps is the observation that the most common age of death is
85! The absolute and relative magnitude of mortality in the first year of life is made
strikingly clear.
Careful examination reveals two other general features of interest. The first feature
is the small but prominent bump in the curve between the ages of 13 and 27 years.
This “excess mortality” is due to an increase in a variety of risky activities, the most
notable being obtaining a driver’s license. In the right frame of Figure 1.11, compar-
ison of the 1959–1961 (Gross and Clark, 1975) and 1979–1981 histograms shows an
impressive reduction of death in all preadolescent years. Particularly striking is the
60% decline in mortality in the first year and the 3-year difference in the locations of
the modes.
These facts are remarkable when placed in the context of the mortality histogram
constructed by John Graunt from the Bills of Mortality during the plague years.
Graunt (1662) estimated that 36% of individuals died before attaining their sixth birth-
day! Graunt was a contemporary of the better-known William Petty, to whom some
credit for these ideas is variously ascribed, probably without cause. The circumstantial
evidence that Graunt actually invented the histogram while looking at these mortal-
ity data seems quite strong, although there is reason to infer that Galileo had used

“9780471697558c01” — 2015/2/25 — 16:16 — page 17 — #17
GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 17
Age of death
Number
per
100,000
1960
0 20 40 60 80 100
0
500
1000
1500
2000
2500
3000
Age of death
Sqrt
(number
per
100,000) 0 20 40 60 80 100
0
10
20
30
40
50
60
2009
1997
1980
1960
FIGURE 1.11 Histogram of the U.S. mortality data in 1960. Rootgrams (histograms plotted
on a square-root scale) of the mortality data for 1960, 1980, and 1997.
histogram-like diagrams earlier. Hald (1990) recounts a portion of Galileo’s Dialogo,
published in 1632, in which Galileo summarized his observations on the star that
appeared in 1572. According to Hald, Galileo noted the symmetry of the “observa-
tion errors” and the more frequent occurrence of small errors than large errors. Both
pointssuggestGalileohadconstructedafrequencydiagramtodrawthoseconclusions.
Many large datasets are in fact collected in binned or histogram form. For
example, elementary particles in high-energy physics scattering experiments are man-
ifested by small bumps in the frequency curve. Good and Gaskins (1980) considered
such a large dataset (n = 25,752) from the Lawrence Radiation Laboratory (LRL)
(see Figure 1.12). The authors devised an ingenious algorithm for estimating the
odds that a bump observed in the frequency curve was real. This topic is covered
in Chapter 9.
Multivariate scatterplot smoothing of time series data is also easily accomplished
with histograms. Consider a univariate time series and smooth both the raw data {xt}
as well as the lagged data {xt,xt+1}. Any strong elliptical structure present in the
smoothed lagged-data diagram provides a graphical version of the first-order auto-
correlation coefficient. Consider the Old Faithful geyser dataset listed in Table B.6.
These data are the durations in minutes of 107 eruptions of the Old Faithful geyser
(Weisberg, 1985). As there was a gap in the recording of data between midnight
and 6 a.m., there are only 99 pairs {xt,xt+1} available. The univariate histogram
in Figure 1.13 reveals a simple bimodal structure—short and long eruption dura-
tions. The most notable feature in the bivariate (smoothed) histogram is the missing
fourth bump corresponding to the short-short duration sequence. Clearly, graphs of
f̂(xt+1|xt) would be useful for improved prediction compared to a regression estimate.
For more than two dimensions, only slices are available for viewing with histogram
surfaces. Consider the Landsat data again. Divide the (jittered) data into four pieces
using quartiles of x1, which is the time of peak greenness. Examining a series of

“9780471697558c01” — 2015/2/25 — 16:16 — page 18 — #18
Mev
Bin
count
500 1000 1500 2000
0
200
400
600
FIGURE 1.12 Histogram of LRL dataset.
Eruption duration (min)
Bin
count
1 2 3 4 5
0
5
10
15
20
25
5.5
X(t+ 1)
1 1 X(t)
5.5
FIGURE 1.13 Histogram of {xt} for the Old Faithful geyser dataset, and a bivariate
histogram of the lagged data (xt,xt+1).
bivariate pictures of (x2,x3) for each quartile slice provides a crude approximation
of the four-dimensional surface f̂(x1,x2,x3) (see Figure 1.14). The histograms are
all constructed on the subinterval [−5,100]×[−5,100]. Compare this representation
of the Landsat data to that in Figure 1.3. From Figure 1.3, it is clear that most of
the outliers are in the last quartile of x1. How well can the relative density levels
be determined from the scatter diagrams? Visualization of a smoothed histogram of
these data will be considered in Section 1.4.3.
1.4.2 Scatterplot Smoothing by Regression Function
The term scatterplot smoother is most often applied to regression data. For bivariate
data, either a nonparametric regression line can be superimposed upon the data, or
the points themselves can be moved toward the regression line. Tukey (1977) presents

“9780471697558c01” — 2015/2/25 — 16:16 — page 19 — #19
5.2 x1 82.7
x2
x3
82.7 x1 85.2
60
x2
0
0 x3
115
85.2 x1 87.4
x2
x3
87.4 x1 93.8
x2
x3
93.8 x1 97.2
x2
x3
97.2 x1 249.5
x2
x3
FIGURE 1.14 Bivariate histogram slices of the trivariate Landsat data. Slicing was per-
formed at the quartiles of variable x1.
the “3R” smoother as an example of the latter. Suppose that the n data points, {xt}, are
measured on a fixed time scale. The 3R smoothing algorithm replaces each point {xt}
with the median of the three points {xt−1,xt,xt+1} recursively until no changes occur.
This algorithm is a powerful filter that removes isolated outliers effectively. The 3R
smoother may be applied to unequally spaced data or repeated data. Tukey also pro-
poses applying a Hanning filter, by which x̃t ← 0.25×(xt−1 +2xt +xt+1). This filter
may be applied several times as necessary. In Figure 1.15, the Tukey smoother (S
function smooth) is applied to the gas flow dataset given in the Table B.5. Observe
how the single potential outlier at x = 187 is totally ignored. The least-squares fit is
shown for reference.
The simplest nonparametric regression estimator is the regressogram. The x-axis
is binned and the sample averages of the responses are computed and plotted over the
intervals. The regressogram for the gas flow dataset is also shown in Figure 1.15. The
Hanning filter and regressogram are special cases of nonparametric kernel regression,
which is discussed in Chapter 8.
The gas flow dataset is part of a larger collection taken at seven different pressures.
A stick-pin plot of the complete dataset is shown in Figure 1.16 (the 74.6 psia data
are second from the right). Clearly, the accuracy is affected by the flow rate, while
the effect of psia seems small. These data will be revisited in Chapter 8.
1.4.3 Visualization of Multivariate Functions
Visualization of functions of more than two variables has not been common in statis-
tics. The Landsat example in Figure 1.14 hints at the potential that visualization of
4-D surfaces would bring to the data analyst. In this section, effective visualization
of surfaces in more than three dimensions is introduced.

“9780471697558c01” — 2015/2/25 — 16:16 — page 20 — #20
Flow rate
Percentage
of
actual
flow
50 100 500 1000 4000
97
98
99
100
101
74.6 psia
Least squares
3R
Regressogram
FIGURE 1.15 Accuracy of a natural gas meter as a function of the flow rate through the
valve at 74.6 psia. The raw data (n = 33) are shown by the filled points. The three smooths
(least squares, Tukey’s 3R, and Tukey’s regressogram) are superimposed.
1.30
3.60
log10 flow
1.60
2.80 log10 psia
96.00
100.00
Accuracy
FIGURE 1.16 Complete 3-D view of the gas flow dataset.
Displaying a three-dimensional perspective plot of the surface f(x, y) of a bivariate
function requires one more dimension than the corresponding bivariate contour rep-
resentation (see Figure 1.17). There are trade-offs. The contour representation lacks
the exact detail and visual impact available in a perspective plot; however, perspective
plots usually have portions obscured by peaks and present less precise height infor-
mation. One way of expressing the difference is to say that a contour plot displays,
loosely speaking, about 2.6–2.9 dimensions of the entire 3-D surface (more, as more
contour lines are drawn). Some authors claim that one or the other representation is
superior, but it seems clear that both can be useful for complicated surfaces.

“9780471697558c01” — 2015/2/25 — 16:16 — page 21 — #21
X
Y
Z
FIGURE 1.17 Perspective plot of bivariate normal density with a “floating” representation
of the corresponding contours.
The visualization advantage afforded by a contour representation is that it lives
in the same dimension as the data, whereas a perspective plot requires an additional
dimension. Hence with trivariate data, the third dimension can be used to present a
3-D contour. In the case of a density function, the corresponding 3-D contour plot
comprises one or more α-level contour surfaces, which are defined for x ∈ d
by
α-Contour : Sα = {x : f(x) = αfmax}, 0 ≤ α ≤ 1,
where fmax is the maximum or modal value of the density function.
For normal data, the general contour surfaces are hyper-ellipses defined by the
easily verified equation (see Problem 1.14):
(x−μ)T
Σ−1
(x−μ) = −2logα. (1.1)
A trivariate contour plot of f(x1,x2,x3) would generally contain several “nested”
surfaces, {S0.1,S0.3,S0.5,S0.7,S0.9}, for example. For the independent standard nor-
mal density, the contours would be nested hyperspheres centered on the mode. In
Figure 1.18, three contours of the trivariate standard normal density are shown in
stereo. Many if not most readers, will have difficulty crossing their eyes to obtain
the stereo effect. But even without the stereo effect, the three spherical contours are
well-represented.
How effective is this in practice? Consider a smoothed histogram f̂(x,y,z) of 1000
trivariate normal points with Σ = I3. Figure 1.19 shows surfaces of nine equally
spaced bivariate slices of the trivariate estimate. Each slice is approximately bivari-
ate normal but without rescaling. Of course, the surfaces are not precisely bivariate
normal, due to the finite size of the sample.
A natural question to pose is: Why not plot the corresponding sequence of con-
ditional densities, f̂(x,y|z = z0), rather than the slices, f̂(x,y,z0)? If this were done,
all the surfaces in Figure 1.19 would be nearly identical. (Theoretically, the condition

“9780471697558c01” — 2015/2/25 — 16:16 — page 22 — #22
X Y
Z
X Y
Z
FIGURE 1.18 Stereo representation of three α-contours of a trivariate normal density.
Gently crossing your eyes should allow the two frames to fuse in the middle.
z=–1.8 z=–1.2 z=–0.6
z=0 z=0.6 z=1.2
FIGURE 1.19 Sequence of bivariate slices of a trivariate smoothed histogram.
densities are all exactly N(02,I2).) If the goal is to understand the 4-D density surface,
then the sequence of conditional densities overemphasizes the (visual) importance
of the tails and obscures information about the location of the “center” of the data.
Furthermore, as nonparametric estimates in the tail will be relatively noisy, the esti-
mates will be especially rough upon normalization (see Figure 1.20). For these
reasons, it seems best to look at slices and to reserve normalization for looking at
conditional densities that are particularly interesting.
Several trivariate contour surfaces of the same estimated density are displayed
in Figure 1.21. Clearly, the trivariate contours give an improved “big picture”—just
as a rotating trivariate scatter diagram improves on three static bivariate scatter dia-
grams. The complete density estimate is a 4-D surface, and the trivariate contour view
in the final frame of Figure 1.21 may present only 3.5 dimensions, while the series
of bivariate slices may yield a bit more, perhaps 3.75 dimensions, but without the
visual impact. Examine the 3-D contour view for the Landsat data in the first frame
of Figure 7.8 in comparison to Figures 1.3 and 1.14. The structure is quite complex.

“9780471697558c01” — 2015/2/25 — 16:16 — page 23 — #23
z=–3 z =–2.6 z= –2.2
FIGURE 1.20 Normalized slices in the left tail of the smoothed histogram.
The presentation of clusters is stunning and shows multiple modes and multiple
clusters. This detailed structure is not apparent in the scatterplot in Figure 1.3.
Depending on the nature of the variables, slicing can be attempted with four-,
five-, or six-dimensional data. Of special importance is the 5-D surface generated by
4-D data, for example, space–time variables such as the Mount St. Helens data in
Figure 1.10. These higher dimensional estimates can be animated in a fashion similar
to Figure 1.19 (see Scott and Wilks (1990)).
In the 4-D case, the α-level contours of interest are based on the slices:
Sα,t = {(x,y,z) : f(x,y,z,t) = αfmax},
where fmax is the global maximum over the 5-D surface. For a fixed choice of α,
as the slice value t changes continuously, the contour shells will expand or contract
smoothly, finally vanishing for extreme values of t. For example, a single theoretical
contour of the N(0,I4) density would vanish outside a symmetric interval around the
origin, but within that interval, the contour shell would be a sphere centered on the
origin with greatest diameter when t = 0. With several α-shells displayed simultane-
ously, the contours would be nested spheres of different radii, appearing at different
values of t, but of greatest diameter when t = 0.
One particularly interesting slice of the smoothed 5-D histogram estimate of the
entire Iris dataset is shown in Figure 1.22. The α = 4% contour surface reveals two
well-separated clusters. However, the α = 10% contour surface is trimodal, revealing
the true structure in this dataset even with only 150 points. the virginica and versicolor
data may not be separated in the point cloud but apparently can be separated in the
density cloud.
The 3-D contour slices in Figure 1.22 were assembled from a 2-D contouring algo-
rithm, then projected into the plane. The sequence of 2-D contour slices is shown in
Figure 1.23. Study these two diagrams and think about the possibilities for exploring
the entire five-dimensional surface.
To emphasize the potential value of additional variables, we conclude this vignette,
we examine the Iris data excluding the sepal width variable. Figure 1.24 displays a
3-D scatterplot, as well as contours of the smoothed histogram at levels α = 0.17 and
α = 0.44. A litle study supports the speculation that the data might contain a hybrid

“9780471697558c01” — 2015/2/25 — 16:16 — page 24 — #24
FIGURE 1.21 Trivariate normal examples.
species of the versicolor and virginica species. With such a small sample, that may
be an embellishment.
With more than four variables, the most appropriate sequence of slicing is not
clear. With five variables, bivariate contours of (x4,x5) may be drawn; then a sequence
of trivariate slices may be examined tracing along one of these bivariate contours.
With more than five or six variables, deciding where to slice at all is a diffi-
cult problem because the number of possibilities grows exponentially. That is why
projection-based methods are so important (see Chapter 7).
1.4.3.1 Visualizing Multivariate Regression Functions The same graphical rep-
resentation can be applied to regression surfaces. However, the interpretation can
be more difficult. For example, if the regression surface is monotone, the α-level
contours of the surface will not be “closed” and will appear to “float” in space. If
the regression surface is a simple linear function such as ax + by + cz, then a set of
trivariate α-contours will simply be a set of parallel planes. Practical questions arise
that do not appear for density surfaces. In particular, what is the natural extent of the
regression surface; that is, for what region in the design space should the surface be

“9780471697558c01” — 2015/2/25 — 16:16 — page 25 — #25
Sepal length
Petal length
Petal width
setosa
versicolor
virginica
(Sliced at sepal width = 3.4 cm)
FIGURE 1.22 Two α-level contour surfaces from a slice of a five-dimensional averaged
shifted histogram estimate, based on all 150 Iris data points. The displayed variables x, y, and
z are sepal length, petal length and width, respectively, with the sepal width variable sliced at
t = 3.4 cm. The (outer) darker α = 4% contour reveals only two clusters, while the (inner)
lighter α = 10% contour reveals the three clusters.
x=4 x=4.15 x=4.3 x=4.45 x=4.6 x=4.75 x=4.9 x=5.05
x=5.2 x=5.35 x=5.5 x=5.65 x=5.8
x=5.95 x=6.1 x=6.25
x=6.4 x=6.55 x=6.7 x=6.85 x=7 x=7.15 x=7.3 x=7.45
FIGURE 1.23 A detailed breakdown of the 3-D contours shown in Figure 1.22 taken from
the ASH estimate f̂(x,y,z,t = 3.4) as the sepal length, x, ranges from 4.00 to 7.45 cm.

“9780471697558c01” — 2015/2/25 — 16:16 — page 26 — #26
4 5 6 7 8
0.0
0.5
1.0
1.5
2.0
2.5
1
2
3
4
5
6
7
Sepal.length Petal.length
Petal.width
x
y
z
FIGURE 1.24 Analysis of three of the four Iris variables, omitting sepal width entirely,
which should be compared to the slice shown in Figure 1.22. The middle contour (α = 0.17)
is superimposed upon the contour (α = 0.44) in the right frame to help locate the shells.
+ + + + +
− + + + −
− − − − −
FIGURE 1.25 A portion of a bivariate contour at the α = 0 level of a smooth function
measured on a regular grid and using linear interpolation (dotted lines).
plotted? Perhaps one answer is to limit the plot to regions where there is sufficient
data, that is, where the density of design points is above a certain threshold.
1.4.4 Overview of Contouring and Surface Display
Suppose that a general bivariate function f(x,y) (taking on positive and negative
values) is sampled on a regular grid, and the α = 0 contour S0 is desired; that is,
S0 = {(x,y) : f(x,y) = 0}. Label the values of the grid as +, 0, or − depending on
whether f 0, f = 0, or f 0, respectively. Then the desired contour is shown in
Figure 1.25. The piecewise linear approximation and the true contour do not match
along the bin boundaries since the interpolation is not exact.
However, bivariate contouring is not as simple a task as one might imagine. Usu-
ally, the function is sampled on a rectangular mesh, with no gradient information
or possibility for further refinement of the mesh. If too coarse a mesh is chosen,
then small local bumps or dips may be missed, or two distinct contours at the same
level may be inadvertently joined. For speed and simplicity, one wants to avoid hav-
ing to do any global analysis before drawing contours. A local contouring algorithm
avoids multiple passes over the data. In any case, global analysis is based on certain

“9780471697558c01” — 2015/2/25 — 16:16 — page 27 — #27
FIGURE 1.26 Simple stereo representation of four 3-D nested shells of the earthquake data.
smoothness assumptions and may fail. The difficulties and details of contouring are
described more fully in Section A.1.
There are several varieties of 3-D contouring algorithms. It is assumed that the
function has been sampled on a lattice, which can be taken to be cubical without loss
of generality. One simple trick is to display a set of 2-D contour slices that result
from intersecting the 3-D contour shell with a set of parallel planes along the lattice
of the data, as was done in Figures 1.18 and 1.22. In this representation, a single
spherical shell becomes a set of circular contours (Figure 1.26). This approach has
the advantage of providing a shell representation that is “transparent” so that multiple
α-level contour levels may be visualized. Different colors can be used for different
contour levels (see Scott (1983, 1984, 1991a), Scott and Thompson (1983), Härdle
and Scott (1988), and Scott and Hall (1989)).
More visually pleasing surfaces can be drawn using the marching cubes algorithm
(Lorensen and Cline, 1987). The overall contour surface is represented by a large
number of connected triangular planar sections, which are computed for each cubical
bin and then displayed. Depending on the pattern of signs on the eight vertices of each
cube in the data lattice, up to six triangular patches are drawn within each cube (see
Figure 1.27). In general, there are 28
cases (each corner of the cube being either above
or below the contour level). Taking into consideration certain symmetries reduces this
number. By scanning through all the cubes in the data lattice, a collection of triangles
is found that defines the contour shell. Each triangle has an inner and outer surface,
depending on the gradient of the density function. The inner and outer surfaces may
be distinguished by color shading. A convenient choice is various shades of red for
surfaces pointing toward regions of higher (hotter) density, and shades of blue toward
regions of lower (cooler) density; see the cover jacket of this book for an example.
Each contour is a patchwork of several thousand triangles. Smoother surfaces may be

“9780471697558c01” — 2015/2/25 — 16:16 — page 28 — #28
+ +
+
FIGURE 1.27 Examples of marching cube contouring algorithm. The corners with values
above the contour level are labeled with a+symbol.
obtained by using higher-order splines, but the underlying bin structure information
would be lost.
In summary, visualizing trivariate functions directly is a powerful adjunct to data
analysis. The gain of an additional dimension of visible structure without resort to
slices greatly improves the ability of a data analyst to perceive structure. The same
visualization applies to slices of density function with more than three variables.
A demonstration tape that displays 4-D animation of Sα,t contours as α and t vary
is available (Scott and Wilks, 1990).
1.5 GEOMETRY OF HIGHER DIMENSIONS
The geometry of higher dimensions provides a few surprises. In this section, a few
standard figures are considered. This material is available in scattered references (see
Kendall (1961), for example).
1.5.1 Polar Coordinates in d Dimensions
In d dimensions, a point x can be expressed in spherical polar coordinates by a
radius r, a base angle θd−1 ranging over (0,2π), and d − 2 angles θ1,...,θd−2 each
ranging over (−π/2,π/2) (see Figure 1.28). Let sk = sinθk and ck = cosθk. Then the
transformation back to Euclidean coordinates is given by
x1 = rc1 c2 ···cd−3 cd−2 cd−1
x2 = rc1 c2 ···cd−3 cd−2 sd−1
x3 = rc1 c2 ···cd−3 sd−2
.
.
.
xj = rc1 ···cd−jsd−j+1
.
.
.
xd = rs1 .

“9780471697558c01” — 2015/2/25 — 16:16 — page 29 — #29
GEOMETRY OF HIGHER DIMENSIONS 29
x1
x2
x3
P
r
θ1
θ2
FIGURE 1.28 Polar coordinates (r,θ1,θ2) of a point P in 3
.
After some work (see Problem 1.11), the Jacobian of this transformation may be
shown to be
J = rd−1
cd−2
1 cd−3
2 ···cd−2 . (1.2)
1.5.2 Content of Hypersphere
The volume of the d-dimensional hypersphere {x :
d
i=1 x2
i ≤ a2
} is given by
Vd(a) =
∫
d
i=1 x2
i ≤a2
1 dx
=
a
∫
0
dr
π/2
∫
−π/2
dθ1
π/2
∫
−π/2
dθ2 ···
2π
∫
0
dθd−1rd−1
cd−2
1 cd−3
2 ···cd−2 .
This can be simplified using the identity
π/2
∫
−π/2
cosk
θ dθ = 2
π/2
∫
0
cosk
θ dθ = 2
π/2
∫
0
cosk
θ
d(cos2
θ)
−2cosθsinθ
,
which, using the change of variables u = cos2
θ,
=
1
∫
0
uk/2 du
u1/2(1−u)1/2
= B
1
2
, k+1
2

=
Γ
1
2

Γ
k+1
2

Γ
k+2
2
.

“9780471697558c01” — 2015/2/25 — 16:16 — page 30 — #30
As Γ
1
2

=
√
π,
Vd(a) = 2π
ad
d
·
Γ
1
2

Γ
d−1
2

Γ
d
2
·
Γ
1
2

Γ
d−2
2

Γ
d−1
2
···
Γ
1
2

Γ(1)
Γ
3
2

=
ad
πd/2
d
2
Γ
d
2
=
ad
πd/2
Γ
d
2
+1
. (1.3)
1.5.3 Some Interesting Consequences
1.5.3.1 Sphere Inscribed in Hypercube Consider the hypercube [−a,a]d
and an
inscribed hypersphere with radius r = a. Then using (1.3), the fraction of the volume
of the cube contained in the hypersphere is given by
fd =
Volume sphere
Volume cube
=
ad
πd/2
/Γ
d
2
+1

(2a)d
=
πd/2
2d Γ
d
2 +1
.
For lower dimensions, the fraction fd is as shown in Table 1.1. It is clear that the center
of the cube becomes less important. As the dimension increases, the volume of the
hypercube concentrates in its corners. This distortion of space (at least to our three-
dimensional way of thinking) has many potential consequences for data analysis.
1.5.3.2 Hypervolume of a Thin Shell Wegman (1990) demonstrates the distortion
of space in another setting. Consider two spheres centered on the origin, one with
radius r and the other with slightly smaller radius r −. Consider the fraction of the
volume of the larger sphere in between the spheres. By Equation (1.3),
Vd(r)−Vd(r −)
Vd(r)
=
rd
−(r −)d
rd
= 1−

1−

r
d
−
−
−
→
d→∞
1.
Hence, virtually all of the content of a hypersphere is concentrated close to its surface,
which is only a (d − 1)-dimensional manifold. Thus for data distributed uniformly
over both the hypersphere and the hypercube, most of the data fall near the boundary
and edges of the volume. Most statistical techniques exhibit peculiar behavior if the
data fall in a lower dimensional subspace. This example illustrates one important
aspect of the curse of dimensionality, which is discussed in Chapter 7.
TABLE 1.1 Fraction of the Volume of a Hypercube Lying in the
Inscribed Hypersphere
Dimension (d) 1 2 3 4 5 6 7
Fraction volume (fd) 1 0.785 0.524 0.308 0.164 0.081 0.037

Random documents with unrelated
content Scribd suggests to you:

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott

More Related Content

Similar to Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott (20)

Recently uploaded (20)

Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott