SlideShare a Scribd company logo
Multivariate Density Estimation Theory Practice
And Visualization 2nd Edition David W Scott
download
https://ebookbell.com/product/multivariate-density-estimation-
theory-practice-and-visualization-2nd-edition-david-w-
scott-5031034
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Smoothing Of Multivariate Data Density Estimation And Visualization
Wiley Series In Probability And Statistics 1st Edition Jussi Klemela
https://ebookbell.com/product/smoothing-of-multivariate-data-density-
estimation-and-visualization-wiley-series-in-probability-and-
statistics-1st-edition-jussi-klemela-1797940
Multivariate Statistical Modeling In Engineering And Management 1st
Edition Jhareswar Maiti
https://ebookbell.com/product/multivariate-statistical-modeling-in-
engineering-and-management-1st-edition-jhareswar-maiti-46083382
Multivariate Data Analysis Fionn Murtagh Andre Heck
https://ebookbell.com/product/multivariate-data-analysis-fionn-
murtagh-andre-heck-47912096
Multivariate Reducedrank Regression Theory Methods And Applications
2nd Edition Gregory C Reinsel
https://ebookbell.com/product/multivariate-reducedrank-regression-
theory-methods-and-applications-2nd-edition-gregory-c-reinsel-48696422
Multivariate Frequency Analysis Of Hydrometeorological Variables A
Copulabased Approach Fateh Chebana
https://ebookbell.com/product/multivariate-frequency-analysis-of-
hydrometeorological-variables-a-copulabased-approach-fateh-
chebana-48775100
Multivariate Calculus Samiran Karmakar Sibdas Karmakar
https://ebookbell.com/product/multivariate-calculus-samiran-karmakar-
sibdas-karmakar-49224188
Multivariate Calculus Samiran Karmakar Sibdas Karmakar
https://ebookbell.com/product/multivariate-calculus-samiran-karmakar-
sibdas-karmakar-49492868
Multivariate Characteristic And Correlation Functions Zoltn Sasvri
https://ebookbell.com/product/multivariate-characteristic-and-
correlation-functions-zoltn-sasvri-50378588
Multivariate Analysis An Applicationoriented Introduction 2nd Klaus
Backhaus
https://ebookbell.com/product/multivariate-analysis-an-
applicationoriented-introduction-2nd-klaus-backhaus-50637476
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
“9780471697558pre” — 2015/2/11 — 17:32 — page vi — #6
“9780471697558pre” — 2015/2/11 — 17:32 — page i — #1
MULTIVARIATE DENSITY
ESTIMATION
“9780471697558pre” — 2015/2/11 — 17:32 — page ii — #2
WILEY SERIES IN PROBABILITY AND STATISTICS
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice,
Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott,
Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg
Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane,
Jozef L. Teugels
A complete list of the titles in this series appears at the end of this volume.
“9780471697558pre” — 2015/2/11 — 17:32 — page iii — #3
MULTIVARIATE DENSITY
ESTIMATION
Theory, Practice, and Visualization
Second Edition
DAVID W. SCOTT
Rice University
Houston, Texas
“9780471697558pre” — 2015/2/11 — 17:32 — page iv — #4
Copyright © 2015 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax
(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be suitable
for your situation. You should consult with a professional where appropriate. Neither the publisher nor
author shall be liable for any loss of profit or any other commercial damages, including but not limited to
special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at
(317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic formats. For more information about Wiley products, visit our web site at
www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Scott, David W., 1950–
Multivariate density estimation : theory, practice, and visualization / David W. Scott. – Second edition.
pages cm
Includes bibliographical references and index.
ISBN 978-0-471-69755-8 (cloth)
1. Estimation theory. 2. Multivariate analysis. I. Title.
QA276.8.S28 2014
519.535–dc23
2014043897
Set in 10/12pts Times Lt Std by SPi Publisher Services, Pondicherry, India
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
1 2015
“9780471697558pre” — 2015/2/11 — 17:32 — page v — #5
To Jean, Hilary,
Elizabeth, Warren,
and my parents, John
and Nancy Scott
“9780471697558pre” — 2015/2/11 — 17:32 — page vi — #6
“9780471697558pre” — 2015/2/12 — 15:05 — page vii — #7
CONTENTS
PREFACE TO SECOND EDITION xv
PREFACE TO FIRST EDITION xvii
1 Representation and Geometry of Multivariate Data 1
1.1 Introduction, 1
1.2 Historical Perspective, 4
1.3 Graphical Display of Multivariate Data Points, 5
1.3.1 Multivariate Scatter Diagrams, 5
1.3.2 Chernoff Faces, 11
1.3.3 Andrews’ Curves and Parallel Coordinate Curves, 12
1.3.4 Limitations, 14
1.4 Graphical Display of Multivariate Functionals, 16
1.4.1 Scatterplot Smoothing by Density Function, 16
1.4.2 Scatterplot Smoothing by Regression Function, 18
1.4.3 Visualization of Multivariate Functions, 19
1.4.3.1 Visualizing Multivariate Regression Functions, 24
1.4.4 Overview of Contouring and Surface Display, 26
1.5 Geometry of Higher Dimensions, 28
1.5.1 Polar Coordinates in d Dimensions, 28
1.5.2 Content of Hypersphere, 29
1.5.3 Some Interesting Consequences, 30
1.5.3.1 Sphere Inscribed in Hypercube, 30
1.5.3.2 Hypervolume of a Thin Shell, 30
1.5.3.3 Tail Probabilities of Multivariate Normal, 31
“9780471697558pre” — 2015/2/12 — 15:05 — page viii — #8
viii CONTENTS
1.5.3.4 Diagonals in Hyperspace, 31
1.5.3.5 Data Aggregate Around Shell, 32
1.5.3.6 Nearest Neighbor Distances, 32
Problems, 33
2 Nonparametric Estimation Criteria 36
2.1 Estimation of the Cumulative Distribution Function, 37
2.2 Direct Nonparametric Estimation of the Density, 39
2.3 Error Criteria for Density Estimates, 40
2.3.1 MISE for Parametric Estimators, 42
2.3.1.1 Uniform Density Example, 42
2.3.1.2 General Parametric MISE Method with Gaussian
Application, 43
2.3.2 The L1 Criterion, 44
2.3.2.1 L1 versus L2, 44
2.3.2.2 Three Useful Properties of the L1 Criterion, 44
2.3.3 Data-Based Parametric Estimation Criteria, 46
2.4 Nonparametric Families of Distributions, 48
2.4.1 Pearson Family of Distributions, 48
2.4.2 When Is an Estimator Nonparametric?, 49
Problems, 50
3 Histograms: Theory and Practice 51
3.1 Sturges’ Rule for Histogram Bin-Width Selection, 51
3.2 The L2 Theory of Univariate Histograms, 53
3.2.1 Pointwise Mean Squared Error and Consistency, 53
3.2.2 Global L2 Histogram Error, 56
3.2.3 Normal Density Reference Rule, 59
3.2.3.1 Comparison of Bandwidth Rules, 59
3.2.3.2 Adjustments for Skewness and Kurtosis, 60
3.2.4 Equivalent Sample Sizes, 62
3.2.5 Sensitivity of MISE to Bin Width, 63
3.2.5.1 Asymptotic Case, 63
3.2.5.2 Large-Sample and Small-Sample Simulations, 64
3.2.6 Exact MISE versus Asymptotic MISE, 65
3.2.6.1 Normal Density, 66
3.2.6.2 Lognormal Density, 68
3.2.7 Influence of Bin Edge Location on MISE, 69
3.2.7.1 General Case, 69
3.2.7.2 Boundary Discontinuities in the Density, 69
3.2.8 Optimally Adaptive Histogram Meshes, 70
3.2.8.1 Bounds on MISE Improvement for Adaptive
Histograms, 71
3.2.8.2 Some Optimal Meshes, 72
“9780471697558pre” — 2015/2/12 — 15:05 — page ix — #9
CONTENTS ix
3.2.8.3 Null Space of Adaptive Densities, 72
3.2.8.4 Percentile Meshes or Adaptive Histograms with
Equal Bin Counts, 73
3.2.8.5 Using Adaptive Meshes versus Transformation, 74
3.2.8.6 Remarks, 75
3.3 Practical Data-Based Bin Width Rules, 76
3.3.1 Oversmoothed Bin Widths, 76
3.3.1.1 Lower Bounds on the Number of Bins, 76
3.3.1.2 Upper Bounds on Bin Widths, 78
3.3.2 Biased and Unbiased CV, 79
3.3.2.1 Biased CV, 79
3.3.2.2 Unbiased CV, 80
3.3.2.3 End Problems with BCV and UCV, 81
3.3.2.4 Applications, 81
3.4 L2 Theory for Multivariate Histograms, 83
3.4.1 Curse of Dimensionality, 85
3.4.2 A Special Case: d = 2 with Nonzero Correlation, 87
3.4.3 Optimal Regular Bivariate Meshes, 88
3.5 Modes and Bumps in a Histogram, 89
3.5.1 Properties of Histogram “Modes”, 91
3.5.2 Noise in Optimal Histograms, 92
3.5.3 Optimal Histogram Bandwidths for Modes, 93
3.5.4 A Useful Bimodal Mixture Density, 95
3.6 Other Error Criteria: L1,L4,L6,L8, and L∞, 96
3.6.1 Optimal L1 Histograms, 96
3.6.2 Other LP Criteria, 97
Problems, 97
4 Frequency Polygons 100
4.1 Univariate Frequency Polygons, 101
4.1.1 Mean Integrated Squared Error, 101
4.1.2 Practical FP Bin Width Rules, 104
4.1.3 Optimally Adaptive Meshes, 107
4.1.4 Modes and Bumps in a Frequency Polygon, 109
4.2 Multivariate Frequency Polygons, 110
4.3 Bin Edge Problems, 113
4.4 Other Modifications of Histograms, 114
4.4.1 Bin Count Adjustments, 114
4.4.1.1 Linear Binning, 114
4.4.1.2 Adjusting FP Bin Counts to Match Histogram Areas, 117
4.4.2 Polynomial Histograms, 117
4.4.3 How Much Information Is There in a Few Bins?, 120
Problems, 122
“9780471697558pre” — 2015/2/12 — 15:05 — page x — #10
x CONTENTS
5 Averaged Shifted Histograms 125
5.1 Construction, 126
5.2 Asymptotic Properties, 128
5.3 The Limiting ASH as a Kernel Estimator, 133
Problems, 135
6 Kernel Density Estimators 137
6.1 Motivation for Kernel Estimators, 138
6.1.1 Numerical Analysis and Finite Differences, 138
6.1.2 Smoothing by Convolution, 139
6.1.3 Orthogonal Series Approximations, 140
6.2 Theoretical Properties: Univariate Case, 142
6.2.1 MISE Analysis, 142
6.2.2 Estimation of Derivatives, 144
6.2.3 Choice of Kernel, 145
6.2.3.1 Higher Order Kernels, 145
6.2.3.2 Optimal Kernels, 151
6.2.3.3 Equivalent Kernels, 153
6.2.3.4 Higher Order Kernels and Kernel Design, 155
6.2.3.5 Boundary Kernels, 157
6.3 Theoretical Properties: Multivariate Case, 161
6.3.1 Product Kernels, 162
6.3.2 General Multivariate Kernel MISE, 164
6.3.3 Boundary Kernels for Irregular Regions, 167
6.4 Generality of the Kernel Method, 167
6.4.1 Delta Methods, 167
6.4.2 General Kernel Theorem, 168
6.4.2.1 Proof of General Kernel Result, 168
6.4.2.2 Characterization of a Nonparametric Estimator, 169
6.4.2.3 Equivalent Kernels of Parametric Estimators, 171
6.5 Cross-Validation, 172
6.5.1 Univariate Data, 172
6.5.1.1 Early Efforts in Bandwidth Selection, 173
6.5.1.2 Oversmoothing, 176
6.5.1.3 Unbiased and Biased Cross-Validation, 177
6.5.1.4 Bootstrapping Cross-Validation, 181
6.5.1.5 Faster Rates and PI Cross-Validation, 184
6.5.1.6 Constrained Oversmoothing, 187
6.5.2 Multivariate Data, 190
6.5.2.1 Multivariate Cross-Validation, 190
6.5.2.2 Multivariate Oversmoothing Bandwidths, 191
6.5.2.3 Asymptotics of Multivariate Cross-Validation, 192
6.6 Adaptive Smoothing, 193
6.6.1 Variable Kernel Introduction, 193
“9780471697558pre” — 2015/2/12 — 15:05 — page xi — #11
CONTENTS xi
6.6.2 Univariate Adaptive Smoothing, 195
6.6.2.1 Bounds on Improvement, 195
6.6.2.2 Nearest-Neighbor Estimators, 197
6.6.2.3 Sample-Point Adaptive Estimators, 198
6.6.2.4 Data Sharpening, 200
6.6.3 Multivariate Adaptive Procedures, 202
6.6.3.1 Pointwise Adapting, 202
6.6.3.2 Global Adapting, 203
6.6.4 Practical Adaptive Algorithms, 204
6.6.4.1 Zero-Bias Bandwidths for Tail Estimation, 204
6.6.4.2 UCV for Adaptive Estimators, 208
6.7 Aspects of Computation, 209
6.7.1 Finite Kernel Support and Rounding of Data, 210
6.7.2 Convolution and Fourier Transforms, 210
6.7.2.1 Application to Kernel Density Estimators, 211
6.7.2.2 FFTs, 212
6.7.2.3 Discussion, 212
6.8 Summary, 213
Problems, 213
7 The Curse of Dimensionality and Dimension Reduction 217
7.1 Introduction, 217
7.2 Curse of Dimensionality, 220
7.2.1 Equivalent Sample Sizes, 220
7.2.2 Multivariate L1 Kernel Error, 222
7.2.3 Examples and Discussion, 224
7.3 Dimension Reduction, 229
7.3.1 Principal Components, 229
7.3.2 Projection Pursuit, 231
7.3.3 Informative Components Analysis, 234
7.3.4 Model-Based Nonlinear Projection, 239
Problems, 240
8 Nonparametric Regression and Additive Models 241
8.1 Nonparametric Kernel Regression, 242
8.1.1 The Nadaraya–Watson Estimator, 242
8.1.2 Local Least-Squares Polynomial Estimators, 243
8.1.2.1 Local Constant Fitting, 243
8.1.2.2 Local Polynomial Fitting, 244
8.1.3 Pointwise Mean Squared Error, 244
8.1.4 Bandwidth Selection, 247
8.1.5 Adaptive Smoothing, 247
8.2 General Linear Nonparametric Estimation, 248
8.2.1 Local Polynomial Regression, 248
“9780471697558pre” — 2015/2/12 — 15:05 — page xii — #12
xii CONTENTS
8.2.2 Spline Smoothing, 250
8.2.3 Equivalent Kernels, 252
8.3 Robustness, 253
8.3.1 Resistant Estimators, 254
8.3.2 Modal Regression, 254
8.3.3 L1 Regression, 257
8.4 Regression in Several Dimensions, 259
8.4.1 Kernel Smoothing and WARPing, 259
8.4.2 Additive Modeling, 261
8.4.3 The Curse of Dimensionality, 262
8.5 Summary, 265
Problems, 266
9 Other Applications 267
9.1 Classification, Discrimination, and Likelihood Ratios, 267
9.2 Modes and Bump Hunting, 273
9.2.1 Confidence Intervals, 273
9.2.2 Oversmoothing for Derivatives, 275
9.2.3 Critical Bandwidth Testing, 275
9.2.4 Clustering via Mixture Models and Modes, 277
9.2.4.1 Gaussian Mixture Modeling, 277
9.2.4.2 Modes for Clustering, 280
9.3 Specialized Topics, 286
9.3.1 Bootstrapping, 286
9.3.2 Confidence Intervals, 287
9.3.3 Survival Analysis, 289
9.3.4 High-Dimensional Holes, 290
9.3.5 Image Enhancement, 292
9.3.6 Nonparametric Inference, 292
9.3.7 Final Vignettes, 293
9.3.7.1 Principal Curves and Density Ridges, 293
9.3.7.2 Time Series Data, 294
9.3.7.3 Inverse Problems and Deconvolution, 294
9.3.7.4 Densities on the Sphere, 294
Problems, 294
APPENDIX A Computer Graphics in 3
296
A.1 Bivariate and Trivariate Contouring Display, 296
A.1.1 Bivariate Contouring, 296
A.1.2 Trivariate Contouring, 299
A.2 Drawing 3-D Objects on the Computer, 300
“9780471697558pre” — 2015/2/12 — 15:05 — page xiii — #13
CONTENTS xiii
APPENDIX B DataSets 302
B.1 US Economic Variables Dataset, 302
B.2 University Dataset, 304
B.3 Blood Fat Concentration Dataset, 305
B.4 Penny Thickness Dataset, 306
B.5 Gas Meter Accuracy Dataset, 307
B.6 Old Faithful Dataset, 309
B.7 Silica Dataset, 309
B.8 LRL Dataset, 310
B.9 Buffalo Snowfall Dataset, 310
APPENDIX C Notation and Abbreviations 311
C.1 General Mathematical and Probability Notation, 311
C.2 Density Abbreviations, 312
C.3 Error Measure Abbreviations, 313
C.4 Smoothing Parameter Abbreviations, 313
REFERENCES 315
AUTHOR INDEX 334
SUBJECT INDEX 339
“9780471697558pre” — 2015/2/12 — 15:05 — page xiv — #14
“9780471697558pre” — 2015/2/11 — 17:32 — page xv — #15
PREFACE TO SECOND EDITION
The past 25 years have seen confirmation of the importance of density estimation
and nonparametric methods in modern data analysis, in this era of “big data.” This
updated version retains its focus on fostering an intuitive understanding of the under-
lying methodology and supporting theory. I have sought to retain as much of the
original material as possible and, in particular, the point of view of its development
from the histogram. In every chapter, new material has been added to highlight chal-
lenges presented by massive datasets, or to clarify theoretical opportunities and new
algorithms. However, no claim to comprehensive coverage is professed.
I have benefitted greatly from interactions with a number of gifted doctoral
students who worked in this field—Lynette Factor, Donna Nezames, Rod Jee,
Ferdie Wang, Michael Minnotte, Steve Sain, Keith Baggerly, John Salch, Will
Wojciechowski, H.-G. Sung, Alena Oetting, Galen Papkov, Eric Chi, Jonathan Lane,
Justin Silver, Jaime Ramos, and Yeshaya Adler—their work is represented here. In
addition, contributions were made by many students taking my courses. I would
also like to thank my colleagues and collaborators, especially my co-advisor Jim
Thompson and my frequent co-authors George Terrell (VPI), Bill Szewczyk (DoD)
and Masahiko Sagae (Kanazawa University). They have made the lifetime of learn-
ing, teaching, and discovery especially delightful and satisfying. I especially wish to
acknowledge the able help of Robert Kosar in assembling the final versions of the
color figures and reviewing new material.
Not a few mistakes have been corrected. For example, the constant in the expres-
sion for the asymptotic mean integrated squared error for the multivariate histogram
in Theorem 3.5 is now correct. The content of Tables 3.6 and 3.7 has been mod-
ified accordingly, and the effect of dimension on sample size is seen to be even
more dramatic in the corrected version. Any mistakes remain the responsibility of the
“9780471697558pre” — 2015/2/11 — 17:32 — page xvi — #16
xvi PREFACE TO SECOND EDITION
author, who would appreciate hearing of such. All will be recorded in an appropriate
repository.
Steve Quigley of John Wiley  Sons was infinitely patient awaiting this second
edition until his retirement, and Kathryn Sharples completed the project. Steve made
a freshly minted LaTeX version available as a starting point. All figures in S-Plus have
been re-engineered into R. Figures in color or using color have been transformed to
gray scale for the printed version, but the original figures will also be available in the
same repository. In the original edition, I also neglected to properly acknowledge the
generous support of the ARO (DAAL-03-88-G-0074 through my colleague James
Thompson) and the ONR (N00014-90-J-1176).
As with the original edition, this revision would not have been possible with the
tireless and enthusiastic support of my wife, Jean, and family. Thanks for everything.
David W. Scott
Houston, Texas
August, 2014
“9780471697558pre” — 2015/2/11 — 17:32 — page xvii — #17
PREFACE TO FIRST EDITION
With the revolution in computing in recent years, access to data of unprecedented
complexity has become commonplace. More variables are being measured, and the
sheer volume of data is growing. At the same time, advancements in the perfor-
mance of graphical workstations have given new power to the data analyst. With
these changes has come an increasing demand for tools that can detect and summa-
rize the multivariate structure in difficult data. Density estimation is now recognized
as a tool useful with univariate and bivariate data; my purpose is to demonstrate that
it is also a powerful tool in higher dimensions, with particular emphasis on trivari-
ate and quadrivariate data. I have written this book for the reader interested in the
theoretical aspects of nonparametric estimation as well as for the reader interested in
the application of these methods to multivariate data. It is my hope that the book can
serve as an introductory textbook and also as a general reference.
I have chosen to introduce major ideas in the context of the classical histogram,
which remains the most widely applied and most intuitive nonparametric estimator.
I have found it instructive to develop the links between the histogram and more statis-
tically efficient methods. This approach greatly simplifies the treatment of advanced
estimators, as much of the novelty of the theoretical context has been moved to the
familiar histogram setting.
The nonparametric world is more complex than its parametric counterpart. I have
selected material that is representative of the broad spectrum of theoretical results
available, with an eye on the potential user, based on my assessments of usefulness,
prevalence, and tutorial value. Theory particularly relevant to application or under-
standing is covered, but a loose standard of rigor is adopted in order to emphasize the
methodological and application topics. Rather than present a cookbook of techniques,
I have adopted a hierarchical approach that emphasizes the similarities among the
“9780471697558pre” — 2015/2/11 — 17:32 — page xviii — #18
xviii PREFACE TO FIRST EDITION
different estimators. I have tried to present new ideas and practical advice, together
with numerous examples and problems, with a graphical emphasis.
Visualization is a key aspect of effective multivariate nonparametric analysis, and
I have attempted to provide a wide array of graphic illustrations. All of the figures
in this book were composed using S, S-PLUS, Exponent Graphics from IMSL, and
Mathematica. The color plates were derived from S-based software. The color graph-
ics with transparency were composed by displaying the S output using the MinneView
program developed at the Minnesota Geometry Project and printed on hardware under
development by the 3M Corporation. I have not included a great deal of computer
code. A collection of software, primarily Fortran-based with interfaces to the S lan-
guage, is available by electronic mail at scottdw@rice.edu. Comments and other
feedback are welcomed.
I would like to thank many colleagues for their generous support over the past
20 years, particularly Jim Thompson, Richard Tapia, and Tony Gorry. I have espe-
cially drawn on my collaboration with George Terrell, and I gratefully acknowledge
his major contributions and influence in this book. The initial support for the high-
dimensional graphics came from Richard Heydorn of NASA. This work has been
generously supported by the Office of Naval Research under grant N00014-90-J-
1176 as well as the Army Research Office. Allan Wilks collaborated on the creation
of many of the color figures while we were visiting the Geometry Project, directed by
Al Marden and assisted by Charlie Gunn, at the Minnesota Supercomputer Center.
I have taught much of this material in graduate courses not only at Rice but also
during a summer course in 1985 at Stanford and during an ASA short course in
1986 in Chicago with Bernard Silverman. Previous Rice students Lynette Factor,
Donna Nezames, Rod Jee, and Ferdie Wang all made contributions through their
theses. I am especially grateful for the able assistance given during the final phases
of preparation by Tim Dunne and Keith Baggerly, as well as Steve Sain, Monnie
McGee, and Michael Minnotte. Many colleagues have influenced this work, includ-
ing Edward Wegman, Dan Carr, Grace Wahba, Wolfgang Härdle, Matthew Wand,
Simon Sheather, Steve Marron, Peter Hall, Robert Launer, Yasuo Amemiya, Nils
Hjort, Linda Davis, Bernhard Flury, Will Gersch, Charles Taylor, Imke Janssen,
Steve Boswell, I.J. Good, Iain Johnstone, Ingram Olkin, Jerry Friedman, David
Donoho, Leo Breiman, Naomi Altman, Mark Matthews, Tim Hesterberg, Hal Stern,
Michael Trosset, Richard Byrd, John Bennett, Heinz-Peter Schmidt, Manny Parzen,
and Michael Tarter. Finally, this book could not have been written without the patience
and encouragement of my family.
David W. Scott
Houston, Texas
February, 1992
“9780471697558c01” — 2015/2/25 — 16:16 — page 1 — #1
1
REPRESENTATION AND GEOMETRY
OF MULTIVARIATE DATA
A complete analysis of multidimensional data requires the application of an array of
statistical tools—parametric, nonparametric, and graphical. Parametric analysis is the
most powerful. Nonparametric analysis is the most flexible. And graphical analysis
provides the vehicle for discovering the unexpected.
This chapter introduces some graphical tools for visualizing structure in multidi-
mensional data. One set of tools focuses on depicting the data points themselves,
while another set of tools relies on displaying of functions estimated from those
points. Visualization and contouring of functions in more than two dimensions is
introduced. Some mathematical aspects of the geometry of higher dimensions are
reviewed. These results have consequences for nonparametric data analysis.
1.1 INTRODUCTION
Classical linear multivariate statistical models rely primarily on analysis of the covari-
ance matrix. So powerful are these techniques that analysis is almost routine for
datasets with hundreds of variables. While the theoretical basis of parametric mod-
els lies with the multivariate normal density, these models are applied in practice
to many kinds of data. Parametric studies provide neat inferential summaries and
parsimonious representation of the data.
For many problems second-order information is inadequate. Advanced model-
ing or simple variable transformations may provide a solution. When no simple
Multivariate Density Estimation, First Edition. David W. Scott.
© 2015 John Wiley  Sons, Inc. Published 2015 by John Wiley  Sons, Inc.
“9780471697558c01” — 2015/2/25 — 16:16 — page 2 — #2
2 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
parametric model is forthcoming, many researchers have opted for fully “unpara-
metric” methods that may be loosely collected under the heading of exploratory data
analysis. Such analyses are highly graphical; but in a complex non-normal setting, a
graph may provide a more concise representation than a parametric model, because
a parametric model of adequate complexity may involve hundreds of parameters.
There are some significant differences between parametric and nonparametric
modeling. The focus on optimality in parametric modeling does not translate well
to the nonparametric world. For example, the histogram might be proved to be an
inadmissible estimator, but that theoretical fact should not be taken to suggest his-
tograms should not be used. Quite to the contrary, some methods that are theoretically
superior are almost never used in practice. The reason is that the ordering of algo-
rithms is not absolute, but is dependent not only on the unknown density but also on
the sample size. Thus the histogram is generally superior for small samples regard-
less of its asymptotic properties. The exploratory school is at the other extreme,
rejecting probabilistic models, whose existence provides the framework for defining
optimality.
In this book, an intermediate point of view is adopted regarding statistical effi-
cacy. No nonparametric estimate is considered wrong; only different components of
the solution are emphasized. Much effort will be devoted to the data-based calibra-
tion problem, but nonparametric estimates can be reasonably calibrated in practice
without too much difficulty. The “curse of optimality” might suggest that this is
an illogical point of view. However, if the notion that optimality is all important is
adopted, then the focus becomes matching the theoretical properties of an estimator
to the assumed properties of the density function. Is it a gross inefficiency to use a
procedure that requires only two continuous derivatives when the curve in fact has six
continuous derivatives? This attitude may have some formal basis but should be dis-
couraged as too heavy-handed for nonparametric thinking. A more relaxed attitude
is required. Furthermore, many “optimal” nonparametric procedures are unstable in
a manner that slightly inefficient procedures are not. In practice, when faced with the
application of a procedure that requires six derivatives, or some other assumption that
cannot be proved in practice, it is more important to be able to recognize the signs
of estimator failure than to worry too much about assumptions. Detecting failure at
the level of a discontinuous fourth derivative is a bit extreme, but certainly the effects
of simple discontinuities should be well understood. Thus only for the purposes of
illustration are the best assumptions given.
The notions of efficiency and admissibility are related to the choice of a criterion,
which can only imperfectly measure the quality of a nonparametric estimate. Unlike
optimal parametric estimates that are useful for many purposes, nonparametric esti-
mates must be optimized for each application. The extra work is justified by the extra
flexibility. As the choice of criterion is imperfect, so then is the notion of a single
optimal estimator. This attitude reflects not sloppy thinking, but rather the imperfect
relationship between the practical and theoretical aspects of our methods. Too rigid a
point of view leads one to a minimax view of the world where nonparametric methods
should be abandoned because there exist difficult problems.
“9780471697558c01” — 2015/2/25 — 16:16 — page 3 — #3
INTRODUCTION 3
Visualization is an important component of nonparametric data analysis. Data
visualization is the focus of exploratory methods, ranging from simple scatterplots
to sophisticated dynamic interactive displays. Function visualization is a significant
component of nonparametric function estimation, and can draw on the relevant lit-
erature in the fields of scientific visualization and computer graphics. The focus of
multivariate data analysis on points and scatterplots has meant that the full impact
of scientific visualization has not yet been realized. With the new emphasis on
smooth functions estimated nonparametrically, the fruits of visualization will be
attained. Banchoff (1986) has been a pioneer in the visualization of higher dimen-
sional mathematical surfaces. Curiously, the surfaces of interest to mathematicians
contain singularities and discontinuities, all producing striking pictures when pro-
jected to the plane. In statistics, visualization of the smooth density surface in four,
five, and six dimensions cannot rely on projection, as projections of smooth surfaces
to the plane show nothing. Instead, the emphasis is on contouring in three dimensions
and slicing of surfaces beyond. The focus on three and four dimensions is natural
because one and two are so well understood. Beyond four dimensions, the ability to
explore surfaces carefully decreases rapidly due to the curse of dimensionality. For-
tunately, statistical data seldom display structure in more than five dimensions, so
guided projection to those dimensions may be adequate. It is these threshold dimen-
sions from three to five that are and deserve to be the focus of our visualization
efforts.
There is a natural flow among the parametric, exploratory, and nonparametric pro-
cedures that represents a rational approach to statistical data analysis. Begin with a
fully exploratory point of view in order to obtain an overview of the data. If a prob-
abilistic structure is present, estimate that structure nonparametrically and explore
it visually. Finally, if a linear model appears adequate, adopt a fully parametric
approach. Each step conceptually represents a willingness to more strongly smooth
the raw data, finally reducing the dimension of the solution to a handful of interest-
ing parameters. With the assumption of normality, the mind’s eye can easily imagine
the d-dimensional egg-shaped elliptical data clusters. Some statisticians may prefer
to work in the reverse order, progressing to exploratory methodology as a diagnostic
tool for evaluating the adequacy of a parametric model fit.
There are many excellent references that complement and expand on this sub-
ject. In exploratory data analysis, references include Tukey (1977), Tukey and Tukey
(1981), Cleveland and McGill (1988), and Wang (1978).
In density estimation, the classic texts of Tapia and Thompson (1978), Wertz
(1978), and Thompson and Tapia (1990) first indicated the power of the nonpara-
metric approach for univariate and bivariate data. Silverman (1986) has provided a
further look at applications in this setting. Prakasa Rao (1983) has provided a the-
oretical survey with a lengthy bibliography. Other texts are more specialized, some
focusing on regression (Müller, 1988; Härdle, 1990), some on a specific error cri-
terion (Devroye and Györfi, 1985; Devroye, 1987), and some on particular solution
classes such as splines (Eubank, 1988; Wahba, 1990). A discussion of additive models
may be found in Hastie and Tibshirani (1990).
“9780471697558c01” — 2015/2/25 — 16:16 — page 4 — #4
4 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
1.2 HISTORICAL PERSPECTIVE
One of the roots of modern statistical thought can be traced to the empirical discov-
ery of correlation by Galton in 1886 (Stigler, 1986). Galton’s ideas quickly reached
Karl Pearson. Although best remembered for his methodological contributions such
as goodness-of-fit tests, frequency curves, and biometry, Pearson was a strong pro-
ponent of the geometrical representation of statistics. In a series of lectures a century
ago in November 1891 at Gresham College in London, Pearson spoke on a wide-
ranging set of topics (Pearson, 1938). He discussed the foundations of the science
of pure statistics and its many divisions. He discussed the collection of observations.
He described the classification and representation of data using both numerical and
geometrical descriptors. Finally, he emphasized statistical methodology and discov-
ery of statistical laws. The syllabus for his lecture of November 11, 1891, includes
this cryptic note:
Erroneous opinion that Geometry is only a means of popular representation: it is a
fundamental method of investigating and analysing statistical material. (his italics)
In that lecture Pearson described 10 methods of geometrical data representation.
The most familiar is a representation “by columns,” which he called the “his-
togram.” (Pearson is usually given credit for coining the word “histogram” later in
a 1894 paper.) Other familiar-sounding names include “diagrams,” “chartograms,”
“topograms,” and “stereograms.” Unfamiliar names include “stigmograms,” “euthy-
grams,” “epipedograms,” “radiograms,” and “hormograms.”
Beginning 21 years later, Fisher advanced the numerically descriptive portion of
statistics with the method of maximum likelihood, from which he progressed on to the
analysis of variance and other contributions that focused on the optimal use of data
in parametric modeling and inference. In Statistical Methods for Research Workers,
Fisher (1932) devotes a chapter titled “Diagrams” to graphical tools. He begins the
chapter with this statement:
The preliminary examination of most data is facilitated by the use of diagrams.
Diagrams prove nothing, but bring outstanding features readily to the eye; they are
therefore no substitute for such critical tests as may be applied to the data, but are
valuable in suggesting such tests, and in explaining the conclusions founded upon
them.
An emphasis on optimization and the efficiency of statistical procedures has been
a hallmark of mathematical statistics ever since. Ironically, Fisher was criticized
by mathematical statisticians for relying too heavily upon geometrical arguments in
proofs of his results.
Modern statistics has experienced a strong resurgence of geometrical and graphi-
cal statistics in the form of exploratory data analysis (Tukey, 1977). Given the para-
metric emphasis on optimization, the more relaxed philosophy of exploratory data
analysis has been refreshing. The revolution has been fueled by the low cost of graph-
ical workstations and microcomputers. These machines have enabled current work on
statistics in motion (Scott, 1990), that is, the use of animation and kinematic display
“9780471697558c01” — 2015/2/25 — 16:16 — page 5 — #5
GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 5
for visualization of data structure, statistical analysis, and algorithm performance. No
longer are static displays sufficient for comprehensive analysis.
All of these events were anticipated by Pearsonand his visionary statistical com-
puting laboratory. In his lecture of April 14, 1891, titled “The Geometry of Motion,”
he spoke of the “ultimate elements of sensations we represent as motions in space
and time.” In 1918, after his many efforts during World War I, he reminisced about
the excitement created by wartime work of his statistical laboratory:
The work has been so urgent and of such value that the Ministry of Munitions has
placed eight to ten computers and draughtsmen at my disposal ... (Pearson, 1938,
p. 165).
These workers produced hundreds of statistical graphs, ranging from detailed maps of
worker availability across England (chartograms) to figures for sighting antiaircraft
guns (diagrams). The use of stereograms allowed for representation of data with three
variables. His “computers,” of course, were not electronic but human. Later, Fisher
would be frustrated because Pearson would not agree to allocate his “computers” to
the task of tabulating percentiles of the t-distribution. But Pearson’s capabilities for
producing high-quality graphics were far superior to those of most modern statisti-
cians prior to 1980. Given Pearson’s joint interests in graphics and kinematics, it is
tantalizing to speculate on how he would have utilized modern computers.
1.3 GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS
The modern challenge in data analysis is to be able to cope with whatever complexi-
ties may be intrinsic to the data. The data may, for example, be strongly non-normal,
fall onto a nonlinear subspace, exhibit multiple modes, or be asymmetric. Dealing
with these features becomes exponentially more difficult as the dimensionality of the
data increases, a phenomenon known as the curse of dimensionality. In fact, datasets
with hundreds of variables and millions of observations are routinely compiled that
exhibit all of these features. Examples abound in such diverse fields as remote sens-
ing, the US Census, geological exploration, speech recognition, and medical research.
The expense of collecting and managing these large datasets is often so great that no
funds are left for serious data analysis. The role of statistics is clear, but too often
no statisticians are involved in large projects and no creative statistical thinking is
applied. The goal of statistical data analysis is to extract the maximum information
from the data, and to present a product that is as accurate and as useful as possible.
1.3.1 Multivariate Scatter Diagrams
The presentation of multivariate data is often accomplished in tabular form, par-
ticularly for small datasets with named or labeled objects. For example, Table B.1
contains economic data spanning the depression years of the 1930s, and Table B.2
contains information on a selected sample of American universities. It is easy enough
to scan an individual column in these tables, to make comparisons of library size,
“9780471697558c01” — 2015/2/25 — 16:16 — page 6 — #6
6 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
for example, and to draw conclusions one variable at a time (see Tufte (1983) and
Wang (1978)). However, variable-by-variable examination of multivariate data can
be overwhelming and tiring, and cannot reveal any relationships among the variables.
Looking at all pairwise scatterplots provides an improvement (Chambers et al., 1983).
Data on four variables of three species of Iris are displayed in Figure 1.1. (A listing
of the Fisher–Anderson Iris data, one of the few familiar four-dimensional datasets,
may be found in several references and is provided with the S package (Becker et al.,
1988)). What multivariate structure is apparent from this figure? The setosa variety
does not overlap the other two varieties. The versicolor and virginica varieties are not
as well separated, although a close examination reveals that they are almost nonover-
lapping. If the 150 observations were unlabeled and plotted with the same symbol,
it is likely that only two clusters would be observed. Even if it were known a priori
that there were three clusters, it would still be unlikely that all three clusters would be
properly identified. These alternative presentations reflect the two related problems
of discrimination and clustering, respectively.
If the observations from different categories overlap substantially or have differ-
ent sample sizes, scatter diagrams become much more difficult to interpret properly.
The data in Figure 1.2 come from a study of 371 males suffering from chest pain
(Scott et al., 1978): 320 had demonstrated coronary artery disease (occlusion or nar-
rowing of the heart’s own arteries) while 51 had none (see Table B.3). The blood fat
concentrations of plasma cholesterol and triglyceride are predictive of heart disease,
although the correlation is low. It is difficult to estimate the predictive power of these
variables in this setting solely from the scatter diagram. A nonparametric analysis
will reveal some interesting nonlinear interactions (see Chapters 5 and 9).
An easily overlooked practical aspect of scatter diagrams is illustrated by these
data, which are integer valued. To avoid problems of overplotting, the data have been
jittered or blurred (Chambers et al., 1983); that is, uniform U(−0.5,0.5) noise is
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
3
3
3
3 3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3 3
3
3
3
3
3
3
3
3
Sepal
width
Petal
length
Petal
width
Sepal length Sepal width Petal length
1
1
1
1 1 1
1 1
1 1 1
1
1
1 1
1
1
1 1
1 1
1
1
1
1
1
1 1
1
11 1
1 1
11 1
1
1 1
1
1
1
11
1 1
1 1
1
2
2
2
2
2
2 2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2 2
2
2 2 2
2
2
2
2 2
2
2
22
2 2
2
2
3
3
3
3 3
3
3
3
3
3
3
3 3
3 3 33
33
3
3
3
3
3
3
3
3
3
3 3 3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
1
1 1
1 1 1
1
1
1 1 1
1
1
1 1 1
1
1 1
1
1 1
1
11
1 1 1
1
1
1 1 1 1
1 1 11
1 1 1
1 1
1 1
1 1
1 1
1
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2 2
2 2
2
2
2
2 2
2
2 2
2
2
2
2
2 2
2
2
2 2
2
2
2
2
3
3
3
3 3
3
3
3
3
3
3
3 3
3 3 3
3
3
3
3
3
3
3
3
3
3
3 3
3 3
3 3
3
3
3
3
3
3
3
3
3
3
3
33
3
3 3 3
3
1
1
1
1 1
1
1 1
1 1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1 1
1
1
1
11
1
1 1
11 1
1
1 1
1
1
1
1
1
1
1
1 1
1
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2 2
2 2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
1
1 1
1 1
1
1
1
1 1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1 1 11
1 1
1
1 1
1
1
1
1
1 1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2 2 2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3 3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
11 1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
11
1
1 1
1
1
1
1
1
1
1
1
1
1
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2 2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3 3
3
3
33
3
3
3
3
3
3
3
3
3
3 3
3
3
3
33
3
3
3 3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
FIGURE 1.1 Pairwise scatter diagrams of the Iris data with the three species labeled.
1, setosa; 2, versicolor; 3, virginica.
“9780471697558c01” — 2015/2/25 — 16:16 — page 7 — #7
GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 7
No disease (n=51)
100 150 200 300 400
With disease (n=320)
100 150 200 300 400
Cholesterol (mg/dl)
Triglyceride
(mg/dl)
50
100
200
500
50
100
200
500
FIGURE 1.2 Scatter diagrams of blood lipid concentrations for 320 diseased and 51
nondiseased males.
added to each element of the original data. This trick should be regularly employed
for data recorded with three or fewer significant digits (with an appropriate range on
the added uniform noise). Jittering reduces visual miscues that result from the vertical
and horizontal synchronization of regularly spaced data.
The visual perception system can easily be overwhelmed if the number of points
is more than several thousand. Figure 1.3 displays three pairwise scatterplots derived
from measurements taken in 1977 by the Landsat remote sensing system over a 5 mile
by 6 mile agricultural region in North Dakota with n = 22,932 = 117 × 196 pixels
or picture elements, each corresponding to an area approximately 1.1 acres in size
(Scott and Thompson, 1983; Scott and Jee, 1984). The Landsat instrument mea-
sures the intensity of light in four spectral bands reflected from the surface of the
earth. A principal components transformation gives two variables that are commonly
referred to as the “brightness” and “greenness” of each pixel. Every pixel is mea-
sured at regular intervals of approximately 3 weeks. During the summer of 1977, six
useful replications were obtained, giving 24 measurements on each pixel. Using an
agronometric growth model for crops, Badhwar et al. (1982) nonlinearly transformed
this 24-dimensional data to three dimensions. Badhwar described these synthetic vari-
ables, (x1,x2,x3), as (1) the calendar time at which peak greenness is observed, (2) the
length of crop ripening, and (3) the peak greenness value, respectively. The scat-
ter diagrams in Figure 1.3 have also been enhanced by jittering, as the raw data are
integers between (0,255). The use of integers allows compression to eight bits of
computer memory. Only structure in the boundary and tails is readily seen. The over-
plotting problem is apparent and the blackened areas include over 95% of the data.
Other techniques to enhance scatter diagrams are needed to see structure in the bulk
of the data cloud, such as plotting random subsets (see Tukey and Tukey (1981)).
Pairwise scatter diagrams lack one important property necessary for identifying
more than two-dimensional features—strong interplot linkage among the plots. In
“9780471697558c01” — 2015/2/25 — 16:16 — page 8 — #8
8 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
Peak
width
Peak
value
Peak time Peak width
FIGURE 1.3 Pairwise scatter diagram of transformed Landsat data from 22,932 pixels over
a 5 by 6 nautical mile region. The range on all the axes is (0, 255).
principle, it should be possible to locate the same point in each figure, assuming
the data are free of ties. But it is not practical to do so for samples of any size. For
quadrivariate data, Diaconis and Friedman (1983) proposed drawing lines between
corresponding points in the scatterplots of (x1,x2) and (x3,x4) (see Problem 1.2). But a
more powerful dynamic technique that takes full advantage of computer graphics has
been developed by several research groups (McDonald, 1982; Becker and Cleveland,
1987; see the many references in Cleveland and McGill, 1988). The method is called
brushing or painting a scatterplot matrix. Using a pointing device such as a mouse,
a subset of the points in one scatter diagram is selected and the corresponding points
are simultaneously highlighted in the other scatter diagrams. Conceptually, a subset
of points in d
is tagged, for example, by painting the points red or making the points
blink synchronously, and that characteristic is inherited by the linked points in all the
“linked” graphs, including not only scatterplots but also histograms and regression
plots as well. The Iris example in Figure 1.1 illustrates the flavor of brushing with
three tags. Usually the color of points is changed rather than the symbol type. Brush-
ing is an excellent tool for identifying outliers and following well-defined clusters. It
is well-suited for conditioning on some variable, for example, 1  x3  3.
These ideas are illustrated in Figure 1.4 for the PRIM4 dataset (Friedman and
Tukey, 1974; the data summarize 500 high-energy particle physics scattering exper-
iments) provided in the S language. Using the brushing tool in S-PLUS (1990), the
left cluster in the 1–2 scatterplot was brushed, and then the left cluster in the 2–4
scatterplot was brushed with a different symbol. Try to imagine linking the clusters
throughout the scatterplot matrix without any highlighting.
“9780471697558c01” — 2015/2/25 — 16:16 — page 9 — #9
GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 9
FIGURE 1.4 Pairwise scatterplots of the transformed PRIM4s data using the ggobi visual-
ization system. Two clumps of points are highlighted by brushing.
There are limitations to the brushing technique. The number of pairwise scat-
terplots is
d
2

, so viewing more than 5 or 10 variables at once is impractical.
Furthermore, the physical size of each scatter diagram is reduced as more variables
are added, so that fewer distinct data points can be plotted. If there are more than
a few variables, the eye cannot follow many of the dynamic changes in the pattern
of points during brushing, except with the simplest of structure. It is, however, an
open question as to the number of dimensions of structure that can be perceived by
this method of linkage. Brushing remains an important and well-used tool that has
proven successful in real data analysis.
If a 2-D array of bivariate scatter diagrams is useful, then why not construct a
3-D array of trivariate scatter diagrams? Navigating the collection of
d
3

trivariate
scatterplots is difficult even with modest values of d. But a single 3-D scatterplot
can easily be rotated in real time with significant perceptual gain compared to three
bivariate diagrams in the scatterplot matrix. Many statistical packages now provide
this capability. The program MacSpin (Donoho et al., 1988) was the first widely used
software of this type. The top middle panel in Figure 1.4 displays a particular ori-
entation of a rotating 3-D scatterplot. The kinds of structure available in 3-D data
are more complex (and hence more interesting) than in 2-D data. Furthermore, the
overplotting problem is reduced as more data points can be resolved in a rotating 3-D
scatterplot than in a static 2-D view (although this is resolution dependent—a 2-D
view printed by a laser device can display significantly more points than is possible
on a computer monitor). Density information is still relatively difficult to perceive,
however, and the sample size definitely influences perception.
Beyond three dimensions, many novel ideas are being pursued (see Tukey and
Tukey (1981)). Six-dimensional data could be viewed with two rotating 3-D scat-
ter diagrams linked by brushing. Carr and Nicholson (1988) have actively pursued
using stereography as an alternative and adjunct to rotation. Some workers report
“9780471697558c01” — 2015/2/25 — 16:16 — page 10 — #10
10 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
that stereo viewing of static data can be more precise than viewing dynamic rotation
alone. Unfortunately, many individuals suffer from color blindness and various depth
perception limitations, rendering some techniques useless. Nevertheless, it is clear
that there is no limit to the possible combinations of ideas one might consider imple-
menting. Such efforts can easily take many months to program without any fancy
interface. This state of affairs would be discouraging but for the fact that a LISP-
based system for easily prototyping such ideas is now available using object-oriented
concepts (see Tierney (1990)). RStudio has made the shiny app available for this pur-
pose as well: see http://shiny.rstudio.com. A collection of articles is devoted to the
general topic of animation (Cleveland and McGill, 1988).
The idea of displaying 2- or 3-D arrays of 2- or 3-D scatter diagrams is perhaps
too closely tied to the Euclidean coordinate system. It might be better to examine
many 2- or 3-D projections of the data. An orderly way to do approximately just
that is the “grand tour” discussed by Asimov (1985). Let P be a d × 2 projection
matrix, which takes the d-dimensional data down to a plane. The author proposed
examining a sequence of scatterplots obtained by a smoothly changing sequence of
projection matrices. The resulting kinematic display shows the n data points mov-
ing in a continuous (and sometimes seemingly random) fashion. It may be hoped
that most interesting projections will be displayed at some point during the first sev-
eral minutes of the grand tour, although for even 10 variables several hours may be
required (Huber, 1985).
Special attention should be drawn to representing multivariate data in the bivariate
scatter diagram with points replaced by glyphs, which are special symbols whose
shapes are determined by the remaining data variables (x3,...,xd). Figure 1.5 displays
the Iris data in such a form following Carr et al. (1986). The length and angle of the
glyph are determined by the sepal length and width, respectively. Careful examination
of the glyphs shows that there is no gap in 4-D between the versicolor and virginica
species, as the angles and lengths of the glyphs are similar near the boundary.
Setosa
Versicolor
Virginica
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
Petal length
Petal
width
Glyph (length, angle)=(Sepal length, sepal width)
FIGURE 1.5 Glyph scatter diagram of the Iris data.
“9780471697558c01” — 2015/2/25 — 16:16 — page 11 — #11
GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 11
1 2 3 4 5 6 7
2.0
2.5
3.0
3.5
4.0
4.5
0.0
0.5
1.0
1.5
2.0
2.5
Petal length
Sepal
width
Petal
width
FIGURE 1.6 A three-dimensional scatter diagram of the Fisher–Anderson Iris data, omitting
the sepal length variable. From left to right, the 50 points for each of the three varieties of
setosa, versicolor, and virginica are distinguished by symbol type (square, diamond, triangle),
respectively. The symbol is required to indicate the presence of three clusters rather than only
two. The same basic picture results from any choice of three variables from the full set of four
variables.
A second glyph representation shown in Figure 1.6 is a 3-D scatterplot omitting
sepal length, one of the four variables. This figure clearly depicts the structure in
these data. Plotting glyphs in 3-D scatter diagrams with stereography is a more pow-
erful visual tool (Carr and Nicholson, 1988). The glyph technique does not treat
variables “symmetrically” and all variable–glyph combinations could be considered.
This complaint affects most multivariate procedures (with a few exceptions).
All of these techniques are an outgrowth of a powerful system devised to analyze
data in up to nine dimensions called PRIM-9 (Fisherkeller et al., 1974; reprinted in
Cleveland and McGill, 1988). The PRIM-9 system contained many of the capabilities
of current systems. The letters are an acronym for “Picturing, Rotation, Isolation, and
Masking.” The latter two serve to identify and select subsets of the multivariate data.
The “picturing” feature was implemented by pressing two buttons that cycled through
all of the
9
2

pairwise scatter diagrams in current coordinates. An IBM 360 mainframe
was specially modified to drive the custom display system.
1.3.2 Chernoff Faces
Chernoff (1973) proposed a special glyph that associates variables to facial features,
such as the size and shape of the eyes, nose, mouth, hair, ears, chin, and facial out-
line. Certainly, humans are able to discriminate among nearly identical faces very
well. Chernoff has suggested that most other multivariate point methods “seem to be
“9780471697558c01” — 2015/2/25 — 16:16 — page 12 — #12
12 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
1925 1926 1927 1928 1929
1930 1931 1932 1933 1934
1935 1936 1937 1938 1939
FIGURE 1.7 Chernoff faces of the economic dataset spanning 1925–1939.
less valuable in producing an emotional response” (Wang, 1978, p. 6).Whether an
emotional response is desired is debatable. Chernoff faces for the time series dataset
in Table B.1 are displayed in Figure 1.7. (The variable–feature associations are listed
in the table.) By carefully studying an individual facial feature such as the smile over
the sequence of all the faces, simple trends can be recognized. But it is the overall
multivariate impression that makes Chernoff faces so powerful. Variables should be
carefully assigned to features. For example, Chernoff faces of the colleges’ data in
Table B.2 might logically assign variables relating to the library to the eyes rather
than to the mouth (see Problem 1.3). Such subjective judgments should not prejudice
our use of this procedure.
One early application not in a statistics journal was constructed by Hiebert-Dodd
(1982), who had examined the performance of several optimization algorithms on a
suite of test problems. She reported that several referees felt this method of presenta-
tion was too frivolous. Comparing the endless tables in the paper as it appeared to the
Chernoff faces displayed in the original technical report, one might easily conclude
the referees were too cautious. On the other hand, when Rice University administra-
tors were shown Chernoff faces of the colleges’ dataset, they were quite open to its
suggestions and enjoyed the exercise. The practical fact is that repetitious viewing of
large tables of data is tedious and haphazard, and broad-brush displays such as faces
can significantly improve data digestion. Several researchers have noted that Chernoff
faces contain redundant information because of symmetry. Flury and Riedwyl (1981)
have proposed using asymmetrical faces, as did Turner and Tidmore (1980), although
Chernoff has stated he believes the additional gain does not justify such nonrealistic
figures.
1.3.3 Andrews’ Curves and Parallel Coordinate Curves
Three intriguing proposals display not the data points themselves but rather a unique
curve determined by the data vector x. Andrews (1972) proposed representing
“9780471697558c01” — 2015/2/25 — 16:16 — page 13 — #13
GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 13
1929 1930 1931 1932
FIGURE 1.8 Star diagram for 4 years of the economic dataset shown in Figure 1.7.
high-dimensional data by replacing each point in d
with a curve s(t) for |t|  π,
where
s(t | x1,...,xd) =
x1
√
2
+x2 sint +x3 cost +x4 sin2t +x5 cos2t +··· ,
the so-called Fourier series representation. This mapping provides the first “com-
plete” continuous view of high-dimensional points on the plane, because, in principle,
the original multivariate data point can be recovered from this curve. Clearly, an
Andrews’ curve is dominated by the variables placed on the low-frequency terms,
so care should be taken to put the most interesting variables early in the expansion
(see Problem 1.4).
A simple graphical device that treats the d variables symmetrically is the star dia-
gram, which is discussed by Fienberg (1979). The d axes are drawn as spokes on a
wheel. The coordinate data values are plotted on those axes and connected as shown
in Figure 1.8.
Another novel multivariate approach that treats variables in a symmetric fashion is
the parallel coordinates plot, introduced by Inselberg (1985) in a mathematical set-
ting and extended by Wegman (1990) to the analysis of stochastic data. Cartesian
coordinates are abandoned in favor of d axes drawn parallel and equally spaced.
Each multivariate point x ∈ d
is plotted as a piecewise linear curve connecting
the d points on the parallel axes. For reasons shown by Inselberg and Wegman,
there are advantages to simply drawing piecewise linear line segments, rather than
a smoother line such as a spline. The disadvantage of this choice is that points
that have identical values in any coordinate dimension cannot be distinguished in
parallel coordinates. However, with this choice a duality may be deduced between
points and lines in Euclidean and parallel coordinates. In the left frame of Figure 1.9,
six points that fall on a straight line with negative slope are plotted. The right frame
shows those same points in parallel coordinates. Thus a scatter diagram of highly
correlated normal points displays a nearly common point of intersection in parallel
coordinates. However, if the correlation is positive, that point is not “between” the
parallel axes (see Problem 1.6). The location of the point where the lines all intersect
can be used to recover the equation of the line back in Euclidean coordinates (see
Problem 1.8).
A variety of other properties with potential applications are explored by Inselberg
and Wegman. One result is a graphical means of deciding if a point x ∈ d
is on the
“9780471697558c01” — 2015/2/25 — 16:16 — page 14 — #14
14 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
x1
x
2
0.0 0.5 1.0 1.5
0.0
0.5
1.0
1.5
1
2
3
4
5
6
0.0
0.5
1.0
1.5
1
2
3
4
5
6
1
2
3
4
5
6
x1 x2
FIGURE 1.9 Example of duality of points and lines between Euclidean and parallel
coordinates. The points are labeled 1 to 6 in both coordinate systems.
inside or the outside of a convex closed hypersurface. If all the points on the hyper-
surface are plotted in parallel coordinates, then a well-defined geometrical outline
will appear on the plane. If a portion of the line segments defining the point x in par-
allel coordinates fall outside the outline, then x is not inside the hypersurface, and
vice versa. One of the more fascinating extensions developed by Wegman is a grand
tour of all variables displayed in parallel coordinates. The advantage of parallel coor-
dinates is that all d of the rotating variables are visible simultaneously, whereas in
the usual presentation, only two of the grand tour variables are visible in a bivariate
scatterplot.
Figure 1.10 displays parallel coordinate plots of the Iris and earthquake data. The
earthquake dataset represents the epicenters of 473 tremors beneath the Mount St.
Helens volcano in the several months preceding its March 1982 eruption (Weaver
et al., 1983). Clearly, the tremors are mostly small in magnitude, increasing in fre-
quency over time, and clustered near the surface, although depth is clearly a bimodal
variable. The longitude and latitude variables are least effective on this plot, because
their natural spatial structure is lost.
1.3.4 Limitations
Tools such as Chernoff faces and scatter diagram glyphs tend to be most valuable
with small datasets where individual points are “identifiable” or interesting. Such
individualistic exploratory tools can easily generate “too much ink” (Tufte, 1983)
and produce figures with black splotches, which convey little information. Parallel
coordinates and Andrews’ curves generate much ink. One obvious remedy is to plot
“9780471697558c01” — 2015/2/25 — 16:16 — page 15 — #15
GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 15
Sepal.length Sepal.width Petal.length Petal.width
Longitude Latitude Depth Day Intensity
FIGURE 1.10 Parallel coordinate plot of the earthquake dataset.
only a subset of the data in a process known as “thinning.” However, plotting random
subsets no longer makes optimal use of all the data and does not result in precisely
reproducible interpretations. Point-oriented methods typically have a range of sample
sizes that is most appropriate: n  200 for faces; n  2000 for scatter diagrams.
Since none of these displays is truly d-dimensional, each has limitations. All pair-
wise scatterplots can detect distinct clusters and some two-dimensional structure (if
perhaps in a rotated coordinate system). In the latter case, an interactive supplement
such as brushing may be necessary to confirm the nature of the links among the scat-
terplots (not really providing any higher dimensional information). On the positive
side, variables are treated symmetrically in the scatterplot matrix. But many different
and highly dissimilar d-dimensional datasets can give rise to visually similar scatter-
plot matrix diagrams; hence the need for brushing. However, with increasing number
of variables, individual scatterplots physically decrease in size and fill up with ink
ever faster. Scatter diagrams provide a highly subjective view of data, with poor
density perception and greatest emphasis on the tails of the data.
“9780471697558c01” — 2015/2/25 — 16:16 — page 16 — #16
16 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
1.4 GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS
1.4.1 Scatterplot Smoothing by Density Function
As graphical exploratory tools, each of the point-based procedures has significant
value. However, each suffers from the problem of too much ink, as the number of
objects (and hence the amount of ink) is linear in the sample size n. To mix metaphors,
point-based graphs cannot provide a consistent picture of the data as n → ∞. As Scott
and Thompson (1983) wrote,
the scatter diagram points to the bivariate density function.
In other words, the raw data points need to be smoothed if a consistent view is to be
obtained.
A histogram is the simplest example of a scatterplot smoother. The amount of
smoothness is controlled by the bin width. For univariate data, the histogram with
bin width narrower than min |xi −xj| is precisely a univariate scatter diagram plotted
with glyphs that are tall, thin rectangles. For bivariate data, the glyph is a beam with a
square base. Increasing the bin width, the histogram represents a count per unit area,
which is precisely the unit of a probability density. In Chapter 3, the histogram will
be shown to provide a consistent estimate of the density function in any dimension.
Histograms can provide a wealth of information for large datasets, even well-
known ones. For example, consider the 1979–1981 decennial life table published
by the U.S. and Bureau of the Census (1987). Certain relevant summary statistics are
well-known: life expectancy, infant mortality, and certain conditional life expectan-
cies. But what additional information can be gleaned by examining the mortality
histogram itself? In Figure 1.11, the histogram of age of death for individuals is
depicted. Not surprisingly, the histogram is skewed with a short tail for older ages.
Not as well-known perhaps is the observation that the most common age of death is
85! The absolute and relative magnitude of mortality in the first year of life is made
strikingly clear.
Careful examination reveals two other general features of interest. The first feature
is the small but prominent bump in the curve between the ages of 13 and 27 years.
This “excess mortality” is due to an increase in a variety of risky activities, the most
notable being obtaining a driver’s license. In the right frame of Figure 1.11, compar-
ison of the 1959–1961 (Gross and Clark, 1975) and 1979–1981 histograms shows an
impressive reduction of death in all preadolescent years. Particularly striking is the
60% decline in mortality in the first year and the 3-year difference in the locations of
the modes.
These facts are remarkable when placed in the context of the mortality histogram
constructed by John Graunt from the Bills of Mortality during the plague years.
Graunt (1662) estimated that 36% of individuals died before attaining their sixth birth-
day! Graunt was a contemporary of the better-known William Petty, to whom some
credit for these ideas is variously ascribed, probably without cause. The circumstantial
evidence that Graunt actually invented the histogram while looking at these mortal-
ity data seems quite strong, although there is reason to infer that Galileo had used
“9780471697558c01” — 2015/2/25 — 16:16 — page 17 — #17
GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 17
Age of death
Number
per
100,000
1960
0 20 40 60 80 100
0
500
1000
1500
2000
2500
3000
Age of death
Sqrt
(number
per
100,000) 0 20 40 60 80 100
0
10
20
30
40
50
60
2009
1997
1980
1960
FIGURE 1.11 Histogram of the U.S. mortality data in 1960. Rootgrams (histograms plotted
on a square-root scale) of the mortality data for 1960, 1980, and 1997.
histogram-like diagrams earlier. Hald (1990) recounts a portion of Galileo’s Dialogo,
published in 1632, in which Galileo summarized his observations on the star that
appeared in 1572. According to Hald, Galileo noted the symmetry of the “observa-
tion errors” and the more frequent occurrence of small errors than large errors. Both
pointssuggestGalileohadconstructedafrequencydiagramtodrawthoseconclusions.
Many large datasets are in fact collected in binned or histogram form. For
example, elementary particles in high-energy physics scattering experiments are man-
ifested by small bumps in the frequency curve. Good and Gaskins (1980) considered
such a large dataset (n = 25,752) from the Lawrence Radiation Laboratory (LRL)
(see Figure 1.12). The authors devised an ingenious algorithm for estimating the
odds that a bump observed in the frequency curve was real. This topic is covered
in Chapter 9.
Multivariate scatterplot smoothing of time series data is also easily accomplished
with histograms. Consider a univariate time series and smooth both the raw data {xt}
as well as the lagged data {xt,xt+1}. Any strong elliptical structure present in the
smoothed lagged-data diagram provides a graphical version of the first-order auto-
correlation coefficient. Consider the Old Faithful geyser dataset listed in Table B.6.
These data are the durations in minutes of 107 eruptions of the Old Faithful geyser
(Weisberg, 1985). As there was a gap in the recording of data between midnight
and 6 a.m., there are only 99 pairs {xt,xt+1} available. The univariate histogram
in Figure 1.13 reveals a simple bimodal structure—short and long eruption dura-
tions. The most notable feature in the bivariate (smoothed) histogram is the missing
fourth bump corresponding to the short-short duration sequence. Clearly, graphs of
f̂(xt+1|xt) would be useful for improved prediction compared to a regression estimate.
For more than two dimensions, only slices are available for viewing with histogram
surfaces. Consider the Landsat data again. Divide the (jittered) data into four pieces
using quartiles of x1, which is the time of peak greenness. Examining a series of
“9780471697558c01” — 2015/2/25 — 16:16 — page 18 — #18
18 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
Mev
Bin
count
500 1000 1500 2000
0
200
400
600
FIGURE 1.12 Histogram of LRL dataset.
Eruption duration (min)
Bin
count
1 2 3 4 5
0
5
10
15
20
25
5.5
X(t+ 1)
1 1 X(t)
5.5
FIGURE 1.13 Histogram of {xt} for the Old Faithful geyser dataset, and a bivariate
histogram of the lagged data (xt,xt+1).
bivariate pictures of (x2,x3) for each quartile slice provides a crude approximation
of the four-dimensional surface f̂(x1,x2,x3) (see Figure 1.14). The histograms are
all constructed on the subinterval [−5,100]×[−5,100]. Compare this representation
of the Landsat data to that in Figure 1.3. From Figure 1.3, it is clear that most of
the outliers are in the last quartile of x1. How well can the relative density levels
be determined from the scatter diagrams? Visualization of a smoothed histogram of
these data will be considered in Section 1.4.3.
1.4.2 Scatterplot Smoothing by Regression Function
The term scatterplot smoother is most often applied to regression data. For bivariate
data, either a nonparametric regression line can be superimposed upon the data, or
the points themselves can be moved toward the regression line. Tukey (1977) presents
“9780471697558c01” — 2015/2/25 — 16:16 — page 19 — #19
GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 19
5.2  x1  82.7
x2
x3
82.7  x1  85.2
60
x2
0
0 x3
115
85.2  x1  87.4
x2
x3
87.4  x1  93.8
x2
x3
93.8  x1  97.2
x2
x3
97.2  x1  249.5
x2
x3
FIGURE 1.14 Bivariate histogram slices of the trivariate Landsat data. Slicing was per-
formed at the quartiles of variable x1.
the “3R” smoother as an example of the latter. Suppose that the n data points, {xt}, are
measured on a fixed time scale. The 3R smoothing algorithm replaces each point {xt}
with the median of the three points {xt−1,xt,xt+1} recursively until no changes occur.
This algorithm is a powerful filter that removes isolated outliers effectively. The 3R
smoother may be applied to unequally spaced data or repeated data. Tukey also pro-
poses applying a Hanning filter, by which x̃t ← 0.25×(xt−1 +2xt +xt+1). This filter
may be applied several times as necessary. In Figure 1.15, the Tukey smoother (S
function smooth) is applied to the gas flow dataset given in the Table B.5. Observe
how the single potential outlier at x = 187 is totally ignored. The least-squares fit is
shown for reference.
The simplest nonparametric regression estimator is the regressogram. The x-axis
is binned and the sample averages of the responses are computed and plotted over the
intervals. The regressogram for the gas flow dataset is also shown in Figure 1.15. The
Hanning filter and regressogram are special cases of nonparametric kernel regression,
which is discussed in Chapter 8.
The gas flow dataset is part of a larger collection taken at seven different pressures.
A stick-pin plot of the complete dataset is shown in Figure 1.16 (the 74.6 psia data
are second from the right). Clearly, the accuracy is affected by the flow rate, while
the effect of psia seems small. These data will be revisited in Chapter 8.
1.4.3 Visualization of Multivariate Functions
Visualization of functions of more than two variables has not been common in statis-
tics. The Landsat example in Figure 1.14 hints at the potential that visualization of
4-D surfaces would bring to the data analyst. In this section, effective visualization
of surfaces in more than three dimensions is introduced.
“9780471697558c01” — 2015/2/25 — 16:16 — page 20 — #20
20 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
Flow rate
Percentage
of
actual
flow
50 100 500 1000 4000
97
98
99
100
101
74.6 psia
Least squares
3R
Regressogram
FIGURE 1.15 Accuracy of a natural gas meter as a function of the flow rate through the
valve at 74.6 psia. The raw data (n = 33) are shown by the filled points. The three smooths
(least squares, Tukey’s 3R, and Tukey’s regressogram) are superimposed.
1.30
3.60
log10 flow
1.60
2.80 log10 psia
96.00
100.00
Accuracy
FIGURE 1.16 Complete 3-D view of the gas flow dataset.
Displaying a three-dimensional perspective plot of the surface f(x, y) of a bivariate
function requires one more dimension than the corresponding bivariate contour rep-
resentation (see Figure 1.17). There are trade-offs. The contour representation lacks
the exact detail and visual impact available in a perspective plot; however, perspective
plots usually have portions obscured by peaks and present less precise height infor-
mation. One way of expressing the difference is to say that a contour plot displays,
loosely speaking, about 2.6–2.9 dimensions of the entire 3-D surface (more, as more
contour lines are drawn). Some authors claim that one or the other representation is
superior, but it seems clear that both can be useful for complicated surfaces.
“9780471697558c01” — 2015/2/25 — 16:16 — page 21 — #21
GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 21
X
Y
Z
FIGURE 1.17 Perspective plot of bivariate normal density with a “floating” representation
of the corresponding contours.
The visualization advantage afforded by a contour representation is that it lives
in the same dimension as the data, whereas a perspective plot requires an additional
dimension. Hence with trivariate data, the third dimension can be used to present a
3-D contour. In the case of a density function, the corresponding 3-D contour plot
comprises one or more α-level contour surfaces, which are defined for x ∈ d
by
α-Contour : Sα = {x : f(x) = αfmax}, 0 ≤ α ≤ 1,
where fmax is the maximum or modal value of the density function.
For normal data, the general contour surfaces are hyper-ellipses defined by the
easily verified equation (see Problem 1.14):
(x−μ)T
Σ−1
(x−μ) = −2logα. (1.1)
A trivariate contour plot of f(x1,x2,x3) would generally contain several “nested”
surfaces, {S0.1,S0.3,S0.5,S0.7,S0.9}, for example. For the independent standard nor-
mal density, the contours would be nested hyperspheres centered on the mode. In
Figure 1.18, three contours of the trivariate standard normal density are shown in
stereo. Many if not most readers, will have difficulty crossing their eyes to obtain
the stereo effect. But even without the stereo effect, the three spherical contours are
well-represented.
How effective is this in practice? Consider a smoothed histogram f̂(x,y,z) of 1000
trivariate normal points with Σ = I3. Figure 1.19 shows surfaces of nine equally
spaced bivariate slices of the trivariate estimate. Each slice is approximately bivari-
ate normal but without rescaling. Of course, the surfaces are not precisely bivariate
normal, due to the finite size of the sample.
A natural question to pose is: Why not plot the corresponding sequence of con-
ditional densities, f̂(x,y|z = z0), rather than the slices, f̂(x,y,z0)? If this were done,
all the surfaces in Figure 1.19 would be nearly identical. (Theoretically, the condition
“9780471697558c01” — 2015/2/25 — 16:16 — page 22 — #22
22 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
X Y
Z
X Y
Z
FIGURE 1.18 Stereo representation of three α-contours of a trivariate normal density.
Gently crossing your eyes should allow the two frames to fuse in the middle.
z=–1.8 z=–1.2 z=–0.6
z=0 z=0.6 z=1.2
FIGURE 1.19 Sequence of bivariate slices of a trivariate smoothed histogram.
densities are all exactly N(02,I2).) If the goal is to understand the 4-D density surface,
then the sequence of conditional densities overemphasizes the (visual) importance
of the tails and obscures information about the location of the “center” of the data.
Furthermore, as nonparametric estimates in the tail will be relatively noisy, the esti-
mates will be especially rough upon normalization (see Figure 1.20). For these
reasons, it seems best to look at slices and to reserve normalization for looking at
conditional densities that are particularly interesting.
Several trivariate contour surfaces of the same estimated density are displayed
in Figure 1.21. Clearly, the trivariate contours give an improved “big picture”—just
as a rotating trivariate scatter diagram improves on three static bivariate scatter dia-
grams. The complete density estimate is a 4-D surface, and the trivariate contour view
in the final frame of Figure 1.21 may present only 3.5 dimensions, while the series
of bivariate slices may yield a bit more, perhaps 3.75 dimensions, but without the
visual impact. Examine the 3-D contour view for the Landsat data in the first frame
of Figure 7.8 in comparison to Figures 1.3 and 1.14. The structure is quite complex.
“9780471697558c01” — 2015/2/25 — 16:16 — page 23 — #23
GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 23
z=–3 z =–2.6 z= –2.2
FIGURE 1.20 Normalized slices in the left tail of the smoothed histogram.
The presentation of clusters is stunning and shows multiple modes and multiple
clusters. This detailed structure is not apparent in the scatterplot in Figure 1.3.
Depending on the nature of the variables, slicing can be attempted with four-,
five-, or six-dimensional data. Of special importance is the 5-D surface generated by
4-D data, for example, space–time variables such as the Mount St. Helens data in
Figure 1.10. These higher dimensional estimates can be animated in a fashion similar
to Figure 1.19 (see Scott and Wilks (1990)).
In the 4-D case, the α-level contours of interest are based on the slices:
Sα,t = {(x,y,z) : f(x,y,z,t) = αfmax},
where fmax is the global maximum over the 5-D surface. For a fixed choice of α,
as the slice value t changes continuously, the contour shells will expand or contract
smoothly, finally vanishing for extreme values of t. For example, a single theoretical
contour of the N(0,I4) density would vanish outside a symmetric interval around the
origin, but within that interval, the contour shell would be a sphere centered on the
origin with greatest diameter when t = 0. With several α-shells displayed simultane-
ously, the contours would be nested spheres of different radii, appearing at different
values of t, but of greatest diameter when t = 0.
One particularly interesting slice of the smoothed 5-D histogram estimate of the
entire Iris dataset is shown in Figure 1.22. The α = 4% contour surface reveals two
well-separated clusters. However, the α = 10% contour surface is trimodal, revealing
the true structure in this dataset even with only 150 points. the virginica and versicolor
data may not be separated in the point cloud but apparently can be separated in the
density cloud.
The 3-D contour slices in Figure 1.22 were assembled from a 2-D contouring algo-
rithm, then projected into the plane. The sequence of 2-D contour slices is shown in
Figure 1.23. Study these two diagrams and think about the possibilities for exploring
the entire five-dimensional surface.
To emphasize the potential value of additional variables, we conclude this vignette,
we examine the Iris data excluding the sepal width variable. Figure 1.24 displays a
3-D scatterplot, as well as contours of the smoothed histogram at levels α = 0.17 and
α = 0.44. A litle study supports the speculation that the data might contain a hybrid
“9780471697558c01” — 2015/2/25 — 16:16 — page 24 — #24
24 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
FIGURE 1.21 Trivariate normal examples.
species of the versicolor and virginica species. With such a small sample, that may
be an embellishment.
With more than four variables, the most appropriate sequence of slicing is not
clear. With five variables, bivariate contours of (x4,x5) may be drawn; then a sequence
of trivariate slices may be examined tracing along one of these bivariate contours.
With more than five or six variables, deciding where to slice at all is a diffi-
cult problem because the number of possibilities grows exponentially. That is why
projection-based methods are so important (see Chapter 7).
1.4.3.1 Visualizing Multivariate Regression Functions The same graphical rep-
resentation can be applied to regression surfaces. However, the interpretation can
be more difficult. For example, if the regression surface is monotone, the α-level
contours of the surface will not be “closed” and will appear to “float” in space. If
the regression surface is a simple linear function such as ax + by + cz, then a set of
trivariate α-contours will simply be a set of parallel planes. Practical questions arise
that do not appear for density surfaces. In particular, what is the natural extent of the
regression surface; that is, for what region in the design space should the surface be
“9780471697558c01” — 2015/2/25 — 16:16 — page 25 — #25
GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 25
Sepal length
Petal length
Petal width
setosa
versicolor
virginica
(Sliced at sepal width = 3.4 cm)
FIGURE 1.22 Two α-level contour surfaces from a slice of a five-dimensional averaged
shifted histogram estimate, based on all 150 Iris data points. The displayed variables x, y, and
z are sepal length, petal length and width, respectively, with the sepal width variable sliced at
t = 3.4 cm. The (outer) darker α = 4% contour reveals only two clusters, while the (inner)
lighter α = 10% contour reveals the three clusters.
x=4 x=4.15 x=4.3 x=4.45 x=4.6 x=4.75 x=4.9 x=5.05
x=5.2 x=5.35 x=5.5 x=5.65 x=5.8
x=5.95 x=6.1 x=6.25
x=6.4 x=6.55 x=6.7 x=6.85 x=7 x=7.15 x=7.3 x=7.45
FIGURE 1.23 A detailed breakdown of the 3-D contours shown in Figure 1.22 taken from
the ASH estimate f̂(x,y,z,t = 3.4) as the sepal length, x, ranges from 4.00 to 7.45 cm.
“9780471697558c01” — 2015/2/25 — 16:16 — page 26 — #26
26 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
4 5 6 7 8
0.0
0.5
1.0
1.5
2.0
2.5
1
2
3
4
5
6
7
Sepal.length Petal.length
Petal.width
x
y
z
FIGURE 1.24 Analysis of three of the four Iris variables, omitting sepal width entirely,
which should be compared to the slice shown in Figure 1.22. The middle contour (α = 0.17)
is superimposed upon the contour (α = 0.44) in the right frame to help locate the shells.
+ + + + +
− + + + −
− − − − −
FIGURE 1.25 A portion of a bivariate contour at the α = 0 level of a smooth function
measured on a regular grid and using linear interpolation (dotted lines).
plotted? Perhaps one answer is to limit the plot to regions where there is sufficient
data, that is, where the density of design points is above a certain threshold.
1.4.4 Overview of Contouring and Surface Display
Suppose that a general bivariate function f(x,y) (taking on positive and negative
values) is sampled on a regular grid, and the α = 0 contour S0 is desired; that is,
S0 = {(x,y) : f(x,y) = 0}. Label the values of the grid as +, 0, or − depending on
whether f  0, f = 0, or f  0, respectively. Then the desired contour is shown in
Figure 1.25. The piecewise linear approximation and the true contour do not match
along the bin boundaries since the interpolation is not exact.
However, bivariate contouring is not as simple a task as one might imagine. Usu-
ally, the function is sampled on a rectangular mesh, with no gradient information
or possibility for further refinement of the mesh. If too coarse a mesh is chosen,
then small local bumps or dips may be missed, or two distinct contours at the same
level may be inadvertently joined. For speed and simplicity, one wants to avoid hav-
ing to do any global analysis before drawing contours. A local contouring algorithm
avoids multiple passes over the data. In any case, global analysis is based on certain
“9780471697558c01” — 2015/2/25 — 16:16 — page 27 — #27
GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 27
FIGURE 1.26 Simple stereo representation of four 3-D nested shells of the earthquake data.
smoothness assumptions and may fail. The difficulties and details of contouring are
described more fully in Section A.1.
There are several varieties of 3-D contouring algorithms. It is assumed that the
function has been sampled on a lattice, which can be taken to be cubical without loss
of generality. One simple trick is to display a set of 2-D contour slices that result
from intersecting the 3-D contour shell with a set of parallel planes along the lattice
of the data, as was done in Figures 1.18 and 1.22. In this representation, a single
spherical shell becomes a set of circular contours (Figure 1.26). This approach has
the advantage of providing a shell representation that is “transparent” so that multiple
α-level contour levels may be visualized. Different colors can be used for different
contour levels (see Scott (1983, 1984, 1991a), Scott and Thompson (1983), Härdle
and Scott (1988), and Scott and Hall (1989)).
More visually pleasing surfaces can be drawn using the marching cubes algorithm
(Lorensen and Cline, 1987). The overall contour surface is represented by a large
number of connected triangular planar sections, which are computed for each cubical
bin and then displayed. Depending on the pattern of signs on the eight vertices of each
cube in the data lattice, up to six triangular patches are drawn within each cube (see
Figure 1.27). In general, there are 28
cases (each corner of the cube being either above
or below the contour level). Taking into consideration certain symmetries reduces this
number. By scanning through all the cubes in the data lattice, a collection of triangles
is found that defines the contour shell. Each triangle has an inner and outer surface,
depending on the gradient of the density function. The inner and outer surfaces may
be distinguished by color shading. A convenient choice is various shades of red for
surfaces pointing toward regions of higher (hotter) density, and shades of blue toward
regions of lower (cooler) density; see the cover jacket of this book for an example.
Each contour is a patchwork of several thousand triangles. Smoother surfaces may be
“9780471697558c01” — 2015/2/25 — 16:16 — page 28 — #28
28 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
+ +
+
FIGURE 1.27 Examples of marching cube contouring algorithm. The corners with values
above the contour level are labeled with a+symbol.
obtained by using higher-order splines, but the underlying bin structure information
would be lost.
In summary, visualizing trivariate functions directly is a powerful adjunct to data
analysis. The gain of an additional dimension of visible structure without resort to
slices greatly improves the ability of a data analyst to perceive structure. The same
visualization applies to slices of density function with more than three variables.
A demonstration tape that displays 4-D animation of Sα,t contours as α and t vary
is available (Scott and Wilks, 1990).
1.5 GEOMETRY OF HIGHER DIMENSIONS
The geometry of higher dimensions provides a few surprises. In this section, a few
standard figures are considered. This material is available in scattered references (see
Kendall (1961), for example).
1.5.1 Polar Coordinates in d Dimensions
In d dimensions, a point x can be expressed in spherical polar coordinates by a
radius r, a base angle θd−1 ranging over (0,2π), and d − 2 angles θ1,...,θd−2 each
ranging over (−π/2,π/2) (see Figure 1.28). Let sk = sinθk and ck = cosθk. Then the
transformation back to Euclidean coordinates is given by
x1 = rc1 c2 ···cd−3 cd−2 cd−1
x2 = rc1 c2 ···cd−3 cd−2 sd−1
x3 = rc1 c2 ···cd−3 sd−2
.
.
.
xj = rc1 ···cd−jsd−j+1
.
.
.
xd = rs1 .
“9780471697558c01” — 2015/2/25 — 16:16 — page 29 — #29
GEOMETRY OF HIGHER DIMENSIONS 29
x1
x2
x3
P
r
θ1
θ2
FIGURE 1.28 Polar coordinates (r,θ1,θ2) of a point P in 3
.
After some work (see Problem 1.11), the Jacobian of this transformation may be
shown to be
J = rd−1
cd−2
1 cd−3
2 ···cd−2 . (1.2)
1.5.2 Content of Hypersphere
The volume of the d-dimensional hypersphere {x :
d
i=1 x2
i ≤ a2
} is given by
Vd(a) =
∫
d
i=1 x2
i ≤a2
1 dx
=
a
∫
0
dr
π/2
∫
−π/2
dθ1
π/2
∫
−π/2
dθ2 ···
2π
∫
0
dθd−1rd−1
cd−2
1 cd−3
2 ···cd−2 .
This can be simplified using the identity
π/2
∫
−π/2
cosk
θ dθ = 2
π/2
∫
0
cosk
θ dθ = 2
π/2
∫
0
cosk
θ
d(cos2
θ)
−2cosθsinθ
,
which, using the change of variables u = cos2
θ,
=
1
∫
0
uk/2 du
u1/2(1−u)1/2
= B
1
2
, k+1
2

=
Γ
1
2

Γ
k+1
2

Γ
k+2
2
 .
“9780471697558c01” — 2015/2/25 — 16:16 — page 30 — #30
30 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA
As Γ
1
2

=
√
π,
Vd(a) = 2π
ad
d
·
Γ
1
2

Γ
d−1
2

Γ
d
2
 ·
Γ
1
2

Γ
d−2
2

Γ
d−1
2
 ···
Γ
1
2

Γ(1)
Γ
3
2

=
ad
πd/2
d
2
Γ
d
2
 =
ad
πd/2
Γ
d
2
+1
 . (1.3)
1.5.3 Some Interesting Consequences
1.5.3.1 Sphere Inscribed in Hypercube Consider the hypercube [−a,a]d
and an
inscribed hypersphere with radius r = a. Then using (1.3), the fraction of the volume
of the cube contained in the hypersphere is given by
fd =
Volume sphere
Volume cube
=
ad
πd/2
/Γ
d
2
+1

(2a)d
=
πd/2
2d Γ
d
2 +1
 .
For lower dimensions, the fraction fd is as shown in Table 1.1. It is clear that the center
of the cube becomes less important. As the dimension increases, the volume of the
hypercube concentrates in its corners. This distortion of space (at least to our three-
dimensional way of thinking) has many potential consequences for data analysis.
1.5.3.2 Hypervolume of a Thin Shell Wegman (1990) demonstrates the distortion
of space in another setting. Consider two spheres centered on the origin, one with
radius r and the other with slightly smaller radius r −. Consider the fraction of the
volume of the larger sphere in between the spheres. By Equation (1.3),
Vd(r)−Vd(r −)
Vd(r)
=
rd
−(r −)d
rd
= 1−

1−

r
d
−
−
−
→
d→∞
1.
Hence, virtually all of the content of a hypersphere is concentrated close to its surface,
which is only a (d − 1)-dimensional manifold. Thus for data distributed uniformly
over both the hypersphere and the hypercube, most of the data fall near the boundary
and edges of the volume. Most statistical techniques exhibit peculiar behavior if the
data fall in a lower dimensional subspace. This example illustrates one important
aspect of the curse of dimensionality, which is discussed in Chapter 7.
TABLE 1.1 Fraction of the Volume of a Hypercube Lying in the
Inscribed Hypersphere
Dimension (d) 1 2 3 4 5 6 7
Fraction volume (fd) 1 0.785 0.524 0.308 0.164 0.081 0.037
Random documents with unrelated
content Scribd suggests to you:
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
back
Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Biomedical Image Understanding Methods And Applications 1st Edition Joohwee Lim
PDF
Risk Assessment In Geotechnical Engineering Gordon A Fenton And D V Griffiths
PDF
Techniques And Methods In Urban Remote Sensing Qihao Weng
PDF
Analytics And Dynamic Customer Strategy Big Profits From Big Data 1st Edition...
PDF
Largescale Distributed Systems And Energy Efficiency A Holistic View 1st Edit...
PDF
Critical Infrastructure Protection In Homeland Security Defending A Networked...
PDF
Time Series Analysis Fourth Edition George E P Box Gwilym M Jenkins
DOCX
QUANTITATIVEINVESTMENTANALYSISSecond Edition.docx
Biomedical Image Understanding Methods And Applications 1st Edition Joohwee Lim
Risk Assessment In Geotechnical Engineering Gordon A Fenton And D V Griffiths
Techniques And Methods In Urban Remote Sensing Qihao Weng
Analytics And Dynamic Customer Strategy Big Profits From Big Data 1st Edition...
Largescale Distributed Systems And Energy Efficiency A Holistic View 1st Edit...
Critical Infrastructure Protection In Homeland Security Defending A Networked...
Time Series Analysis Fourth Edition George E P Box Gwilym M Jenkins
QUANTITATIVEINVESTMENTANALYSISSecond Edition.docx

Similar to Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott (20)

PDF
Applied Logistic Regression 3rd Edition David Hosmer
PDF
Guide To Analysis Of Dna Microarray Data 2nd Edition Steen Knudsen
PDF
Multivariate Nonparametric Regression And Visualization With R And Applicatio...
PDF
Cloud Computing Principles And Paradigms Rajkumar Buyya James Broberg
PDF
Toward More Sustainable Infrastructure Project Evaluation For Planners And En...
PDF
Data Mining The Web Uncovering Patterns In Web Content Structure And Usage 1s...
PDF
Applied Logistic Regression 3rd David Hosmer Stanley Lemeshow
PDF
Multicriteria Decisionmaking Under Conditions Of Uncertainty A Fuzzy Set Pers...
PDF
Activity Learning Discovering Recognizing And Predicting Human Behavior From ...
PDF
Emerging Technologies For Healthcare Internet Of Things And Deep Learning Mod...
PDF
Executives Guide To Solvency Ii 1st Edition David Buckham Jason Wahl
PDF
Case Studies In Reliability And Maintenance Wiley Series In Probability And S...
PDF
From Traditional Fault Tolerance To Blockchain Wenbing Zhao
PDF
Signal Analysis Time Frequency Scale And Structure Ronald L Allen
PDF
Information Security Governance A Practical Development And Implementation Ap...
PDF
Vehicular Ad Hoc Network Security And Privacy 1st Edition Xiaodong Lin
PDF
Handbook On Intelligent Healthcare Analytics A Jaya K Kalaiselvi
PDF
Cash Investment Management For Nonprofit Organizations John T Zietlow Alan G ...
PDF
Assurance Technologies Principles And A Product Process And System Safety Per...
PDF
The Integrated Reporting Movement Meaning Momentum Motives and Materiality 1s...
Applied Logistic Regression 3rd Edition David Hosmer
Guide To Analysis Of Dna Microarray Data 2nd Edition Steen Knudsen
Multivariate Nonparametric Regression And Visualization With R And Applicatio...
Cloud Computing Principles And Paradigms Rajkumar Buyya James Broberg
Toward More Sustainable Infrastructure Project Evaluation For Planners And En...
Data Mining The Web Uncovering Patterns In Web Content Structure And Usage 1s...
Applied Logistic Regression 3rd David Hosmer Stanley Lemeshow
Multicriteria Decisionmaking Under Conditions Of Uncertainty A Fuzzy Set Pers...
Activity Learning Discovering Recognizing And Predicting Human Behavior From ...
Emerging Technologies For Healthcare Internet Of Things And Deep Learning Mod...
Executives Guide To Solvency Ii 1st Edition David Buckham Jason Wahl
Case Studies In Reliability And Maintenance Wiley Series In Probability And S...
From Traditional Fault Tolerance To Blockchain Wenbing Zhao
Signal Analysis Time Frequency Scale And Structure Ronald L Allen
Information Security Governance A Practical Development And Implementation Ap...
Vehicular Ad Hoc Network Security And Privacy 1st Edition Xiaodong Lin
Handbook On Intelligent Healthcare Analytics A Jaya K Kalaiselvi
Cash Investment Management For Nonprofit Organizations John T Zietlow Alan G ...
Assurance Technologies Principles And A Product Process And System Safety Per...
The Integrated Reporting Movement Meaning Momentum Motives and Materiality 1s...
Ad

Recently uploaded (20)

PDF
Pre independence Education in Inndia.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Lesson notes of climatology university.
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Complications of Minimal Access Surgery at WLH
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Cell Structure & Organelles in detailed.
Pre independence Education in Inndia.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Lesson notes of climatology university.
2.FourierTransform-ShortQuestionswithAnswers.pdf
Insiders guide to clinical Medicine.pdf
Supply Chain Operations Speaking Notes -ICLT Program
102 student loan defaulters named and shamed – Is someone you know on the list?
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Sports Quiz easy sports quiz sports quiz
GDM (1) (1).pptx small presentation for students
PPH.pptx obstetrics and gynecology in nursing
VCE English Exam - Section C Student Revision Booklet
O5-L3 Freight Transport Ops (International) V1.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Module 4: Burden of Disease Tutorial Slides S2 2025
Complications of Minimal Access Surgery at WLH
Abdominal Access Techniques with Prof. Dr. R K Mishra
Cell Structure & Organelles in detailed.
Ad

Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott

  • 1. Multivariate Density Estimation Theory Practice And Visualization 2nd Edition David W Scott download https://ebookbell.com/product/multivariate-density-estimation- theory-practice-and-visualization-2nd-edition-david-w- scott-5031034 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Smoothing Of Multivariate Data Density Estimation And Visualization Wiley Series In Probability And Statistics 1st Edition Jussi Klemela https://ebookbell.com/product/smoothing-of-multivariate-data-density- estimation-and-visualization-wiley-series-in-probability-and- statistics-1st-edition-jussi-klemela-1797940 Multivariate Statistical Modeling In Engineering And Management 1st Edition Jhareswar Maiti https://ebookbell.com/product/multivariate-statistical-modeling-in- engineering-and-management-1st-edition-jhareswar-maiti-46083382 Multivariate Data Analysis Fionn Murtagh Andre Heck https://ebookbell.com/product/multivariate-data-analysis-fionn- murtagh-andre-heck-47912096 Multivariate Reducedrank Regression Theory Methods And Applications 2nd Edition Gregory C Reinsel https://ebookbell.com/product/multivariate-reducedrank-regression- theory-methods-and-applications-2nd-edition-gregory-c-reinsel-48696422
  • 3. Multivariate Frequency Analysis Of Hydrometeorological Variables A Copulabased Approach Fateh Chebana https://ebookbell.com/product/multivariate-frequency-analysis-of- hydrometeorological-variables-a-copulabased-approach-fateh- chebana-48775100 Multivariate Calculus Samiran Karmakar Sibdas Karmakar https://ebookbell.com/product/multivariate-calculus-samiran-karmakar- sibdas-karmakar-49224188 Multivariate Calculus Samiran Karmakar Sibdas Karmakar https://ebookbell.com/product/multivariate-calculus-samiran-karmakar- sibdas-karmakar-49492868 Multivariate Characteristic And Correlation Functions Zoltn Sasvri https://ebookbell.com/product/multivariate-characteristic-and- correlation-functions-zoltn-sasvri-50378588 Multivariate Analysis An Applicationoriented Introduction 2nd Klaus Backhaus https://ebookbell.com/product/multivariate-analysis-an- applicationoriented-introduction-2nd-klaus-backhaus-50637476
  • 6. “9780471697558pre” — 2015/2/11 — 17:32 — page vi — #6
  • 7. “9780471697558pre” — 2015/2/11 — 17:32 — page i — #1 MULTIVARIATE DENSITY ESTIMATION
  • 8. “9780471697558pre” — 2015/2/11 — 17:32 — page ii — #2 WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels A complete list of the titles in this series appears at the end of this volume.
  • 9. “9780471697558pre” — 2015/2/11 — 17:32 — page iii — #3 MULTIVARIATE DENSITY ESTIMATION Theory, Practice, and Visualization Second Edition DAVID W. SCOTT Rice University Houston, Texas
  • 10. “9780471697558pre” — 2015/2/11 — 17:32 — page iv — #4 Copyright © 2015 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Scott, David W., 1950– Multivariate density estimation : theory, practice, and visualization / David W. Scott. – Second edition. pages cm Includes bibliographical references and index. ISBN 978-0-471-69755-8 (cloth) 1. Estimation theory. 2. Multivariate analysis. I. Title. QA276.8.S28 2014 519.535–dc23 2014043897 Set in 10/12pts Times Lt Std by SPi Publisher Services, Pondicherry, India Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 1 2015
  • 11. “9780471697558pre” — 2015/2/11 — 17:32 — page v — #5 To Jean, Hilary, Elizabeth, Warren, and my parents, John and Nancy Scott
  • 12. “9780471697558pre” — 2015/2/11 — 17:32 — page vi — #6
  • 13. “9780471697558pre” — 2015/2/12 — 15:05 — page vii — #7 CONTENTS PREFACE TO SECOND EDITION xv PREFACE TO FIRST EDITION xvii 1 Representation and Geometry of Multivariate Data 1 1.1 Introduction, 1 1.2 Historical Perspective, 4 1.3 Graphical Display of Multivariate Data Points, 5 1.3.1 Multivariate Scatter Diagrams, 5 1.3.2 Chernoff Faces, 11 1.3.3 Andrews’ Curves and Parallel Coordinate Curves, 12 1.3.4 Limitations, 14 1.4 Graphical Display of Multivariate Functionals, 16 1.4.1 Scatterplot Smoothing by Density Function, 16 1.4.2 Scatterplot Smoothing by Regression Function, 18 1.4.3 Visualization of Multivariate Functions, 19 1.4.3.1 Visualizing Multivariate Regression Functions, 24 1.4.4 Overview of Contouring and Surface Display, 26 1.5 Geometry of Higher Dimensions, 28 1.5.1 Polar Coordinates in d Dimensions, 28 1.5.2 Content of Hypersphere, 29 1.5.3 Some Interesting Consequences, 30 1.5.3.1 Sphere Inscribed in Hypercube, 30 1.5.3.2 Hypervolume of a Thin Shell, 30 1.5.3.3 Tail Probabilities of Multivariate Normal, 31
  • 14. “9780471697558pre” — 2015/2/12 — 15:05 — page viii — #8 viii CONTENTS 1.5.3.4 Diagonals in Hyperspace, 31 1.5.3.5 Data Aggregate Around Shell, 32 1.5.3.6 Nearest Neighbor Distances, 32 Problems, 33 2 Nonparametric Estimation Criteria 36 2.1 Estimation of the Cumulative Distribution Function, 37 2.2 Direct Nonparametric Estimation of the Density, 39 2.3 Error Criteria for Density Estimates, 40 2.3.1 MISE for Parametric Estimators, 42 2.3.1.1 Uniform Density Example, 42 2.3.1.2 General Parametric MISE Method with Gaussian Application, 43 2.3.2 The L1 Criterion, 44 2.3.2.1 L1 versus L2, 44 2.3.2.2 Three Useful Properties of the L1 Criterion, 44 2.3.3 Data-Based Parametric Estimation Criteria, 46 2.4 Nonparametric Families of Distributions, 48 2.4.1 Pearson Family of Distributions, 48 2.4.2 When Is an Estimator Nonparametric?, 49 Problems, 50 3 Histograms: Theory and Practice 51 3.1 Sturges’ Rule for Histogram Bin-Width Selection, 51 3.2 The L2 Theory of Univariate Histograms, 53 3.2.1 Pointwise Mean Squared Error and Consistency, 53 3.2.2 Global L2 Histogram Error, 56 3.2.3 Normal Density Reference Rule, 59 3.2.3.1 Comparison of Bandwidth Rules, 59 3.2.3.2 Adjustments for Skewness and Kurtosis, 60 3.2.4 Equivalent Sample Sizes, 62 3.2.5 Sensitivity of MISE to Bin Width, 63 3.2.5.1 Asymptotic Case, 63 3.2.5.2 Large-Sample and Small-Sample Simulations, 64 3.2.6 Exact MISE versus Asymptotic MISE, 65 3.2.6.1 Normal Density, 66 3.2.6.2 Lognormal Density, 68 3.2.7 Influence of Bin Edge Location on MISE, 69 3.2.7.1 General Case, 69 3.2.7.2 Boundary Discontinuities in the Density, 69 3.2.8 Optimally Adaptive Histogram Meshes, 70 3.2.8.1 Bounds on MISE Improvement for Adaptive Histograms, 71 3.2.8.2 Some Optimal Meshes, 72
  • 15. “9780471697558pre” — 2015/2/12 — 15:05 — page ix — #9 CONTENTS ix 3.2.8.3 Null Space of Adaptive Densities, 72 3.2.8.4 Percentile Meshes or Adaptive Histograms with Equal Bin Counts, 73 3.2.8.5 Using Adaptive Meshes versus Transformation, 74 3.2.8.6 Remarks, 75 3.3 Practical Data-Based Bin Width Rules, 76 3.3.1 Oversmoothed Bin Widths, 76 3.3.1.1 Lower Bounds on the Number of Bins, 76 3.3.1.2 Upper Bounds on Bin Widths, 78 3.3.2 Biased and Unbiased CV, 79 3.3.2.1 Biased CV, 79 3.3.2.2 Unbiased CV, 80 3.3.2.3 End Problems with BCV and UCV, 81 3.3.2.4 Applications, 81 3.4 L2 Theory for Multivariate Histograms, 83 3.4.1 Curse of Dimensionality, 85 3.4.2 A Special Case: d = 2 with Nonzero Correlation, 87 3.4.3 Optimal Regular Bivariate Meshes, 88 3.5 Modes and Bumps in a Histogram, 89 3.5.1 Properties of Histogram “Modes”, 91 3.5.2 Noise in Optimal Histograms, 92 3.5.3 Optimal Histogram Bandwidths for Modes, 93 3.5.4 A Useful Bimodal Mixture Density, 95 3.6 Other Error Criteria: L1,L4,L6,L8, and L∞, 96 3.6.1 Optimal L1 Histograms, 96 3.6.2 Other LP Criteria, 97 Problems, 97 4 Frequency Polygons 100 4.1 Univariate Frequency Polygons, 101 4.1.1 Mean Integrated Squared Error, 101 4.1.2 Practical FP Bin Width Rules, 104 4.1.3 Optimally Adaptive Meshes, 107 4.1.4 Modes and Bumps in a Frequency Polygon, 109 4.2 Multivariate Frequency Polygons, 110 4.3 Bin Edge Problems, 113 4.4 Other Modifications of Histograms, 114 4.4.1 Bin Count Adjustments, 114 4.4.1.1 Linear Binning, 114 4.4.1.2 Adjusting FP Bin Counts to Match Histogram Areas, 117 4.4.2 Polynomial Histograms, 117 4.4.3 How Much Information Is There in a Few Bins?, 120 Problems, 122
  • 16. “9780471697558pre” — 2015/2/12 — 15:05 — page x — #10 x CONTENTS 5 Averaged Shifted Histograms 125 5.1 Construction, 126 5.2 Asymptotic Properties, 128 5.3 The Limiting ASH as a Kernel Estimator, 133 Problems, 135 6 Kernel Density Estimators 137 6.1 Motivation for Kernel Estimators, 138 6.1.1 Numerical Analysis and Finite Differences, 138 6.1.2 Smoothing by Convolution, 139 6.1.3 Orthogonal Series Approximations, 140 6.2 Theoretical Properties: Univariate Case, 142 6.2.1 MISE Analysis, 142 6.2.2 Estimation of Derivatives, 144 6.2.3 Choice of Kernel, 145 6.2.3.1 Higher Order Kernels, 145 6.2.3.2 Optimal Kernels, 151 6.2.3.3 Equivalent Kernels, 153 6.2.3.4 Higher Order Kernels and Kernel Design, 155 6.2.3.5 Boundary Kernels, 157 6.3 Theoretical Properties: Multivariate Case, 161 6.3.1 Product Kernels, 162 6.3.2 General Multivariate Kernel MISE, 164 6.3.3 Boundary Kernels for Irregular Regions, 167 6.4 Generality of the Kernel Method, 167 6.4.1 Delta Methods, 167 6.4.2 General Kernel Theorem, 168 6.4.2.1 Proof of General Kernel Result, 168 6.4.2.2 Characterization of a Nonparametric Estimator, 169 6.4.2.3 Equivalent Kernels of Parametric Estimators, 171 6.5 Cross-Validation, 172 6.5.1 Univariate Data, 172 6.5.1.1 Early Efforts in Bandwidth Selection, 173 6.5.1.2 Oversmoothing, 176 6.5.1.3 Unbiased and Biased Cross-Validation, 177 6.5.1.4 Bootstrapping Cross-Validation, 181 6.5.1.5 Faster Rates and PI Cross-Validation, 184 6.5.1.6 Constrained Oversmoothing, 187 6.5.2 Multivariate Data, 190 6.5.2.1 Multivariate Cross-Validation, 190 6.5.2.2 Multivariate Oversmoothing Bandwidths, 191 6.5.2.3 Asymptotics of Multivariate Cross-Validation, 192 6.6 Adaptive Smoothing, 193 6.6.1 Variable Kernel Introduction, 193
  • 17. “9780471697558pre” — 2015/2/12 — 15:05 — page xi — #11 CONTENTS xi 6.6.2 Univariate Adaptive Smoothing, 195 6.6.2.1 Bounds on Improvement, 195 6.6.2.2 Nearest-Neighbor Estimators, 197 6.6.2.3 Sample-Point Adaptive Estimators, 198 6.6.2.4 Data Sharpening, 200 6.6.3 Multivariate Adaptive Procedures, 202 6.6.3.1 Pointwise Adapting, 202 6.6.3.2 Global Adapting, 203 6.6.4 Practical Adaptive Algorithms, 204 6.6.4.1 Zero-Bias Bandwidths for Tail Estimation, 204 6.6.4.2 UCV for Adaptive Estimators, 208 6.7 Aspects of Computation, 209 6.7.1 Finite Kernel Support and Rounding of Data, 210 6.7.2 Convolution and Fourier Transforms, 210 6.7.2.1 Application to Kernel Density Estimators, 211 6.7.2.2 FFTs, 212 6.7.2.3 Discussion, 212 6.8 Summary, 213 Problems, 213 7 The Curse of Dimensionality and Dimension Reduction 217 7.1 Introduction, 217 7.2 Curse of Dimensionality, 220 7.2.1 Equivalent Sample Sizes, 220 7.2.2 Multivariate L1 Kernel Error, 222 7.2.3 Examples and Discussion, 224 7.3 Dimension Reduction, 229 7.3.1 Principal Components, 229 7.3.2 Projection Pursuit, 231 7.3.3 Informative Components Analysis, 234 7.3.4 Model-Based Nonlinear Projection, 239 Problems, 240 8 Nonparametric Regression and Additive Models 241 8.1 Nonparametric Kernel Regression, 242 8.1.1 The Nadaraya–Watson Estimator, 242 8.1.2 Local Least-Squares Polynomial Estimators, 243 8.1.2.1 Local Constant Fitting, 243 8.1.2.2 Local Polynomial Fitting, 244 8.1.3 Pointwise Mean Squared Error, 244 8.1.4 Bandwidth Selection, 247 8.1.5 Adaptive Smoothing, 247 8.2 General Linear Nonparametric Estimation, 248 8.2.1 Local Polynomial Regression, 248
  • 18. “9780471697558pre” — 2015/2/12 — 15:05 — page xii — #12 xii CONTENTS 8.2.2 Spline Smoothing, 250 8.2.3 Equivalent Kernels, 252 8.3 Robustness, 253 8.3.1 Resistant Estimators, 254 8.3.2 Modal Regression, 254 8.3.3 L1 Regression, 257 8.4 Regression in Several Dimensions, 259 8.4.1 Kernel Smoothing and WARPing, 259 8.4.2 Additive Modeling, 261 8.4.3 The Curse of Dimensionality, 262 8.5 Summary, 265 Problems, 266 9 Other Applications 267 9.1 Classification, Discrimination, and Likelihood Ratios, 267 9.2 Modes and Bump Hunting, 273 9.2.1 Confidence Intervals, 273 9.2.2 Oversmoothing for Derivatives, 275 9.2.3 Critical Bandwidth Testing, 275 9.2.4 Clustering via Mixture Models and Modes, 277 9.2.4.1 Gaussian Mixture Modeling, 277 9.2.4.2 Modes for Clustering, 280 9.3 Specialized Topics, 286 9.3.1 Bootstrapping, 286 9.3.2 Confidence Intervals, 287 9.3.3 Survival Analysis, 289 9.3.4 High-Dimensional Holes, 290 9.3.5 Image Enhancement, 292 9.3.6 Nonparametric Inference, 292 9.3.7 Final Vignettes, 293 9.3.7.1 Principal Curves and Density Ridges, 293 9.3.7.2 Time Series Data, 294 9.3.7.3 Inverse Problems and Deconvolution, 294 9.3.7.4 Densities on the Sphere, 294 Problems, 294 APPENDIX A Computer Graphics in 3 296 A.1 Bivariate and Trivariate Contouring Display, 296 A.1.1 Bivariate Contouring, 296 A.1.2 Trivariate Contouring, 299 A.2 Drawing 3-D Objects on the Computer, 300
  • 19. “9780471697558pre” — 2015/2/12 — 15:05 — page xiii — #13 CONTENTS xiii APPENDIX B DataSets 302 B.1 US Economic Variables Dataset, 302 B.2 University Dataset, 304 B.3 Blood Fat Concentration Dataset, 305 B.4 Penny Thickness Dataset, 306 B.5 Gas Meter Accuracy Dataset, 307 B.6 Old Faithful Dataset, 309 B.7 Silica Dataset, 309 B.8 LRL Dataset, 310 B.9 Buffalo Snowfall Dataset, 310 APPENDIX C Notation and Abbreviations 311 C.1 General Mathematical and Probability Notation, 311 C.2 Density Abbreviations, 312 C.3 Error Measure Abbreviations, 313 C.4 Smoothing Parameter Abbreviations, 313 REFERENCES 315 AUTHOR INDEX 334 SUBJECT INDEX 339
  • 20. “9780471697558pre” — 2015/2/12 — 15:05 — page xiv — #14
  • 21. “9780471697558pre” — 2015/2/11 — 17:32 — page xv — #15 PREFACE TO SECOND EDITION The past 25 years have seen confirmation of the importance of density estimation and nonparametric methods in modern data analysis, in this era of “big data.” This updated version retains its focus on fostering an intuitive understanding of the under- lying methodology and supporting theory. I have sought to retain as much of the original material as possible and, in particular, the point of view of its development from the histogram. In every chapter, new material has been added to highlight chal- lenges presented by massive datasets, or to clarify theoretical opportunities and new algorithms. However, no claim to comprehensive coverage is professed. I have benefitted greatly from interactions with a number of gifted doctoral students who worked in this field—Lynette Factor, Donna Nezames, Rod Jee, Ferdie Wang, Michael Minnotte, Steve Sain, Keith Baggerly, John Salch, Will Wojciechowski, H.-G. Sung, Alena Oetting, Galen Papkov, Eric Chi, Jonathan Lane, Justin Silver, Jaime Ramos, and Yeshaya Adler—their work is represented here. In addition, contributions were made by many students taking my courses. I would also like to thank my colleagues and collaborators, especially my co-advisor Jim Thompson and my frequent co-authors George Terrell (VPI), Bill Szewczyk (DoD) and Masahiko Sagae (Kanazawa University). They have made the lifetime of learn- ing, teaching, and discovery especially delightful and satisfying. I especially wish to acknowledge the able help of Robert Kosar in assembling the final versions of the color figures and reviewing new material. Not a few mistakes have been corrected. For example, the constant in the expres- sion for the asymptotic mean integrated squared error for the multivariate histogram in Theorem 3.5 is now correct. The content of Tables 3.6 and 3.7 has been mod- ified accordingly, and the effect of dimension on sample size is seen to be even more dramatic in the corrected version. Any mistakes remain the responsibility of the
  • 22. “9780471697558pre” — 2015/2/11 — 17:32 — page xvi — #16 xvi PREFACE TO SECOND EDITION author, who would appreciate hearing of such. All will be recorded in an appropriate repository. Steve Quigley of John Wiley Sons was infinitely patient awaiting this second edition until his retirement, and Kathryn Sharples completed the project. Steve made a freshly minted LaTeX version available as a starting point. All figures in S-Plus have been re-engineered into R. Figures in color or using color have been transformed to gray scale for the printed version, but the original figures will also be available in the same repository. In the original edition, I also neglected to properly acknowledge the generous support of the ARO (DAAL-03-88-G-0074 through my colleague James Thompson) and the ONR (N00014-90-J-1176). As with the original edition, this revision would not have been possible with the tireless and enthusiastic support of my wife, Jean, and family. Thanks for everything. David W. Scott Houston, Texas August, 2014
  • 23. “9780471697558pre” — 2015/2/11 — 17:32 — page xvii — #17 PREFACE TO FIRST EDITION With the revolution in computing in recent years, access to data of unprecedented complexity has become commonplace. More variables are being measured, and the sheer volume of data is growing. At the same time, advancements in the perfor- mance of graphical workstations have given new power to the data analyst. With these changes has come an increasing demand for tools that can detect and summa- rize the multivariate structure in difficult data. Density estimation is now recognized as a tool useful with univariate and bivariate data; my purpose is to demonstrate that it is also a powerful tool in higher dimensions, with particular emphasis on trivari- ate and quadrivariate data. I have written this book for the reader interested in the theoretical aspects of nonparametric estimation as well as for the reader interested in the application of these methods to multivariate data. It is my hope that the book can serve as an introductory textbook and also as a general reference. I have chosen to introduce major ideas in the context of the classical histogram, which remains the most widely applied and most intuitive nonparametric estimator. I have found it instructive to develop the links between the histogram and more statis- tically efficient methods. This approach greatly simplifies the treatment of advanced estimators, as much of the novelty of the theoretical context has been moved to the familiar histogram setting. The nonparametric world is more complex than its parametric counterpart. I have selected material that is representative of the broad spectrum of theoretical results available, with an eye on the potential user, based on my assessments of usefulness, prevalence, and tutorial value. Theory particularly relevant to application or under- standing is covered, but a loose standard of rigor is adopted in order to emphasize the methodological and application topics. Rather than present a cookbook of techniques, I have adopted a hierarchical approach that emphasizes the similarities among the
  • 24. “9780471697558pre” — 2015/2/11 — 17:32 — page xviii — #18 xviii PREFACE TO FIRST EDITION different estimators. I have tried to present new ideas and practical advice, together with numerous examples and problems, with a graphical emphasis. Visualization is a key aspect of effective multivariate nonparametric analysis, and I have attempted to provide a wide array of graphic illustrations. All of the figures in this book were composed using S, S-PLUS, Exponent Graphics from IMSL, and Mathematica. The color plates were derived from S-based software. The color graph- ics with transparency were composed by displaying the S output using the MinneView program developed at the Minnesota Geometry Project and printed on hardware under development by the 3M Corporation. I have not included a great deal of computer code. A collection of software, primarily Fortran-based with interfaces to the S lan- guage, is available by electronic mail at scottdw@rice.edu. Comments and other feedback are welcomed. I would like to thank many colleagues for their generous support over the past 20 years, particularly Jim Thompson, Richard Tapia, and Tony Gorry. I have espe- cially drawn on my collaboration with George Terrell, and I gratefully acknowledge his major contributions and influence in this book. The initial support for the high- dimensional graphics came from Richard Heydorn of NASA. This work has been generously supported by the Office of Naval Research under grant N00014-90-J- 1176 as well as the Army Research Office. Allan Wilks collaborated on the creation of many of the color figures while we were visiting the Geometry Project, directed by Al Marden and assisted by Charlie Gunn, at the Minnesota Supercomputer Center. I have taught much of this material in graduate courses not only at Rice but also during a summer course in 1985 at Stanford and during an ASA short course in 1986 in Chicago with Bernard Silverman. Previous Rice students Lynette Factor, Donna Nezames, Rod Jee, and Ferdie Wang all made contributions through their theses. I am especially grateful for the able assistance given during the final phases of preparation by Tim Dunne and Keith Baggerly, as well as Steve Sain, Monnie McGee, and Michael Minnotte. Many colleagues have influenced this work, includ- ing Edward Wegman, Dan Carr, Grace Wahba, Wolfgang Härdle, Matthew Wand, Simon Sheather, Steve Marron, Peter Hall, Robert Launer, Yasuo Amemiya, Nils Hjort, Linda Davis, Bernhard Flury, Will Gersch, Charles Taylor, Imke Janssen, Steve Boswell, I.J. Good, Iain Johnstone, Ingram Olkin, Jerry Friedman, David Donoho, Leo Breiman, Naomi Altman, Mark Matthews, Tim Hesterberg, Hal Stern, Michael Trosset, Richard Byrd, John Bennett, Heinz-Peter Schmidt, Manny Parzen, and Michael Tarter. Finally, this book could not have been written without the patience and encouragement of my family. David W. Scott Houston, Texas February, 1992
  • 25. “9780471697558c01” — 2015/2/25 — 16:16 — page 1 — #1 1 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA A complete analysis of multidimensional data requires the application of an array of statistical tools—parametric, nonparametric, and graphical. Parametric analysis is the most powerful. Nonparametric analysis is the most flexible. And graphical analysis provides the vehicle for discovering the unexpected. This chapter introduces some graphical tools for visualizing structure in multidi- mensional data. One set of tools focuses on depicting the data points themselves, while another set of tools relies on displaying of functions estimated from those points. Visualization and contouring of functions in more than two dimensions is introduced. Some mathematical aspects of the geometry of higher dimensions are reviewed. These results have consequences for nonparametric data analysis. 1.1 INTRODUCTION Classical linear multivariate statistical models rely primarily on analysis of the covari- ance matrix. So powerful are these techniques that analysis is almost routine for datasets with hundreds of variables. While the theoretical basis of parametric mod- els lies with the multivariate normal density, these models are applied in practice to many kinds of data. Parametric studies provide neat inferential summaries and parsimonious representation of the data. For many problems second-order information is inadequate. Advanced model- ing or simple variable transformations may provide a solution. When no simple Multivariate Density Estimation, First Edition. David W. Scott. © 2015 John Wiley Sons, Inc. Published 2015 by John Wiley Sons, Inc.
  • 26. “9780471697558c01” — 2015/2/25 — 16:16 — page 2 — #2 2 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA parametric model is forthcoming, many researchers have opted for fully “unpara- metric” methods that may be loosely collected under the heading of exploratory data analysis. Such analyses are highly graphical; but in a complex non-normal setting, a graph may provide a more concise representation than a parametric model, because a parametric model of adequate complexity may involve hundreds of parameters. There are some significant differences between parametric and nonparametric modeling. The focus on optimality in parametric modeling does not translate well to the nonparametric world. For example, the histogram might be proved to be an inadmissible estimator, but that theoretical fact should not be taken to suggest his- tograms should not be used. Quite to the contrary, some methods that are theoretically superior are almost never used in practice. The reason is that the ordering of algo- rithms is not absolute, but is dependent not only on the unknown density but also on the sample size. Thus the histogram is generally superior for small samples regard- less of its asymptotic properties. The exploratory school is at the other extreme, rejecting probabilistic models, whose existence provides the framework for defining optimality. In this book, an intermediate point of view is adopted regarding statistical effi- cacy. No nonparametric estimate is considered wrong; only different components of the solution are emphasized. Much effort will be devoted to the data-based calibra- tion problem, but nonparametric estimates can be reasonably calibrated in practice without too much difficulty. The “curse of optimality” might suggest that this is an illogical point of view. However, if the notion that optimality is all important is adopted, then the focus becomes matching the theoretical properties of an estimator to the assumed properties of the density function. Is it a gross inefficiency to use a procedure that requires only two continuous derivatives when the curve in fact has six continuous derivatives? This attitude may have some formal basis but should be dis- couraged as too heavy-handed for nonparametric thinking. A more relaxed attitude is required. Furthermore, many “optimal” nonparametric procedures are unstable in a manner that slightly inefficient procedures are not. In practice, when faced with the application of a procedure that requires six derivatives, or some other assumption that cannot be proved in practice, it is more important to be able to recognize the signs of estimator failure than to worry too much about assumptions. Detecting failure at the level of a discontinuous fourth derivative is a bit extreme, but certainly the effects of simple discontinuities should be well understood. Thus only for the purposes of illustration are the best assumptions given. The notions of efficiency and admissibility are related to the choice of a criterion, which can only imperfectly measure the quality of a nonparametric estimate. Unlike optimal parametric estimates that are useful for many purposes, nonparametric esti- mates must be optimized for each application. The extra work is justified by the extra flexibility. As the choice of criterion is imperfect, so then is the notion of a single optimal estimator. This attitude reflects not sloppy thinking, but rather the imperfect relationship between the practical and theoretical aspects of our methods. Too rigid a point of view leads one to a minimax view of the world where nonparametric methods should be abandoned because there exist difficult problems.
  • 27. “9780471697558c01” — 2015/2/25 — 16:16 — page 3 — #3 INTRODUCTION 3 Visualization is an important component of nonparametric data analysis. Data visualization is the focus of exploratory methods, ranging from simple scatterplots to sophisticated dynamic interactive displays. Function visualization is a significant component of nonparametric function estimation, and can draw on the relevant lit- erature in the fields of scientific visualization and computer graphics. The focus of multivariate data analysis on points and scatterplots has meant that the full impact of scientific visualization has not yet been realized. With the new emphasis on smooth functions estimated nonparametrically, the fruits of visualization will be attained. Banchoff (1986) has been a pioneer in the visualization of higher dimen- sional mathematical surfaces. Curiously, the surfaces of interest to mathematicians contain singularities and discontinuities, all producing striking pictures when pro- jected to the plane. In statistics, visualization of the smooth density surface in four, five, and six dimensions cannot rely on projection, as projections of smooth surfaces to the plane show nothing. Instead, the emphasis is on contouring in three dimensions and slicing of surfaces beyond. The focus on three and four dimensions is natural because one and two are so well understood. Beyond four dimensions, the ability to explore surfaces carefully decreases rapidly due to the curse of dimensionality. For- tunately, statistical data seldom display structure in more than five dimensions, so guided projection to those dimensions may be adequate. It is these threshold dimen- sions from three to five that are and deserve to be the focus of our visualization efforts. There is a natural flow among the parametric, exploratory, and nonparametric pro- cedures that represents a rational approach to statistical data analysis. Begin with a fully exploratory point of view in order to obtain an overview of the data. If a prob- abilistic structure is present, estimate that structure nonparametrically and explore it visually. Finally, if a linear model appears adequate, adopt a fully parametric approach. Each step conceptually represents a willingness to more strongly smooth the raw data, finally reducing the dimension of the solution to a handful of interest- ing parameters. With the assumption of normality, the mind’s eye can easily imagine the d-dimensional egg-shaped elliptical data clusters. Some statisticians may prefer to work in the reverse order, progressing to exploratory methodology as a diagnostic tool for evaluating the adequacy of a parametric model fit. There are many excellent references that complement and expand on this sub- ject. In exploratory data analysis, references include Tukey (1977), Tukey and Tukey (1981), Cleveland and McGill (1988), and Wang (1978). In density estimation, the classic texts of Tapia and Thompson (1978), Wertz (1978), and Thompson and Tapia (1990) first indicated the power of the nonpara- metric approach for univariate and bivariate data. Silverman (1986) has provided a further look at applications in this setting. Prakasa Rao (1983) has provided a the- oretical survey with a lengthy bibliography. Other texts are more specialized, some focusing on regression (Müller, 1988; Härdle, 1990), some on a specific error cri- terion (Devroye and Györfi, 1985; Devroye, 1987), and some on particular solution classes such as splines (Eubank, 1988; Wahba, 1990). A discussion of additive models may be found in Hastie and Tibshirani (1990).
  • 28. “9780471697558c01” — 2015/2/25 — 16:16 — page 4 — #4 4 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA 1.2 HISTORICAL PERSPECTIVE One of the roots of modern statistical thought can be traced to the empirical discov- ery of correlation by Galton in 1886 (Stigler, 1986). Galton’s ideas quickly reached Karl Pearson. Although best remembered for his methodological contributions such as goodness-of-fit tests, frequency curves, and biometry, Pearson was a strong pro- ponent of the geometrical representation of statistics. In a series of lectures a century ago in November 1891 at Gresham College in London, Pearson spoke on a wide- ranging set of topics (Pearson, 1938). He discussed the foundations of the science of pure statistics and its many divisions. He discussed the collection of observations. He described the classification and representation of data using both numerical and geometrical descriptors. Finally, he emphasized statistical methodology and discov- ery of statistical laws. The syllabus for his lecture of November 11, 1891, includes this cryptic note: Erroneous opinion that Geometry is only a means of popular representation: it is a fundamental method of investigating and analysing statistical material. (his italics) In that lecture Pearson described 10 methods of geometrical data representation. The most familiar is a representation “by columns,” which he called the “his- togram.” (Pearson is usually given credit for coining the word “histogram” later in a 1894 paper.) Other familiar-sounding names include “diagrams,” “chartograms,” “topograms,” and “stereograms.” Unfamiliar names include “stigmograms,” “euthy- grams,” “epipedograms,” “radiograms,” and “hormograms.” Beginning 21 years later, Fisher advanced the numerically descriptive portion of statistics with the method of maximum likelihood, from which he progressed on to the analysis of variance and other contributions that focused on the optimal use of data in parametric modeling and inference. In Statistical Methods for Research Workers, Fisher (1932) devotes a chapter titled “Diagrams” to graphical tools. He begins the chapter with this statement: The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitute for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them. An emphasis on optimization and the efficiency of statistical procedures has been a hallmark of mathematical statistics ever since. Ironically, Fisher was criticized by mathematical statisticians for relying too heavily upon geometrical arguments in proofs of his results. Modern statistics has experienced a strong resurgence of geometrical and graphi- cal statistics in the form of exploratory data analysis (Tukey, 1977). Given the para- metric emphasis on optimization, the more relaxed philosophy of exploratory data analysis has been refreshing. The revolution has been fueled by the low cost of graph- ical workstations and microcomputers. These machines have enabled current work on statistics in motion (Scott, 1990), that is, the use of animation and kinematic display
  • 29. “9780471697558c01” — 2015/2/25 — 16:16 — page 5 — #5 GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 5 for visualization of data structure, statistical analysis, and algorithm performance. No longer are static displays sufficient for comprehensive analysis. All of these events were anticipated by Pearsonand his visionary statistical com- puting laboratory. In his lecture of April 14, 1891, titled “The Geometry of Motion,” he spoke of the “ultimate elements of sensations we represent as motions in space and time.” In 1918, after his many efforts during World War I, he reminisced about the excitement created by wartime work of his statistical laboratory: The work has been so urgent and of such value that the Ministry of Munitions has placed eight to ten computers and draughtsmen at my disposal ... (Pearson, 1938, p. 165). These workers produced hundreds of statistical graphs, ranging from detailed maps of worker availability across England (chartograms) to figures for sighting antiaircraft guns (diagrams). The use of stereograms allowed for representation of data with three variables. His “computers,” of course, were not electronic but human. Later, Fisher would be frustrated because Pearson would not agree to allocate his “computers” to the task of tabulating percentiles of the t-distribution. But Pearson’s capabilities for producing high-quality graphics were far superior to those of most modern statisti- cians prior to 1980. Given Pearson’s joint interests in graphics and kinematics, it is tantalizing to speculate on how he would have utilized modern computers. 1.3 GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS The modern challenge in data analysis is to be able to cope with whatever complexi- ties may be intrinsic to the data. The data may, for example, be strongly non-normal, fall onto a nonlinear subspace, exhibit multiple modes, or be asymmetric. Dealing with these features becomes exponentially more difficult as the dimensionality of the data increases, a phenomenon known as the curse of dimensionality. In fact, datasets with hundreds of variables and millions of observations are routinely compiled that exhibit all of these features. Examples abound in such diverse fields as remote sens- ing, the US Census, geological exploration, speech recognition, and medical research. The expense of collecting and managing these large datasets is often so great that no funds are left for serious data analysis. The role of statistics is clear, but too often no statisticians are involved in large projects and no creative statistical thinking is applied. The goal of statistical data analysis is to extract the maximum information from the data, and to present a product that is as accurate and as useful as possible. 1.3.1 Multivariate Scatter Diagrams The presentation of multivariate data is often accomplished in tabular form, par- ticularly for small datasets with named or labeled objects. For example, Table B.1 contains economic data spanning the depression years of the 1930s, and Table B.2 contains information on a selected sample of American universities. It is easy enough to scan an individual column in these tables, to make comparisons of library size,
  • 30. “9780471697558c01” — 2015/2/25 — 16:16 — page 6 — #6 6 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA for example, and to draw conclusions one variable at a time (see Tufte (1983) and Wang (1978)). However, variable-by-variable examination of multivariate data can be overwhelming and tiring, and cannot reveal any relationships among the variables. Looking at all pairwise scatterplots provides an improvement (Chambers et al., 1983). Data on four variables of three species of Iris are displayed in Figure 1.1. (A listing of the Fisher–Anderson Iris data, one of the few familiar four-dimensional datasets, may be found in several references and is provided with the S package (Becker et al., 1988)). What multivariate structure is apparent from this figure? The setosa variety does not overlap the other two varieties. The versicolor and virginica varieties are not as well separated, although a close examination reveals that they are almost nonover- lapping. If the 150 observations were unlabeled and plotted with the same symbol, it is likely that only two clusters would be observed. Even if it were known a priori that there were three clusters, it would still be unlikely that all three clusters would be properly identified. These alternative presentations reflect the two related problems of discrimination and clustering, respectively. If the observations from different categories overlap substantially or have differ- ent sample sizes, scatter diagrams become much more difficult to interpret properly. The data in Figure 1.2 come from a study of 371 males suffering from chest pain (Scott et al., 1978): 320 had demonstrated coronary artery disease (occlusion or nar- rowing of the heart’s own arteries) while 51 had none (see Table B.3). The blood fat concentrations of plasma cholesterol and triglyceride are predictive of heart disease, although the correlation is low. It is difficult to estimate the predictive power of these variables in this setting solely from the scatter diagram. A nonparametric analysis will reveal some interesting nonlinear interactions (see Chapters 5 and 9). An easily overlooked practical aspect of scatter diagrams is illustrated by these data, which are integer valued. To avoid problems of overplotting, the data have been jittered or blurred (Chambers et al., 1983); that is, uniform U(−0.5,0.5) noise is 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Sepal width Petal length Petal width Sepal length Sepal width Petal length 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 11 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 FIGURE 1.1 Pairwise scatter diagrams of the Iris data with the three species labeled. 1, setosa; 2, versicolor; 3, virginica.
  • 31. “9780471697558c01” — 2015/2/25 — 16:16 — page 7 — #7 GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 7 No disease (n=51) 100 150 200 300 400 With disease (n=320) 100 150 200 300 400 Cholesterol (mg/dl) Triglyceride (mg/dl) 50 100 200 500 50 100 200 500 FIGURE 1.2 Scatter diagrams of blood lipid concentrations for 320 diseased and 51 nondiseased males. added to each element of the original data. This trick should be regularly employed for data recorded with three or fewer significant digits (with an appropriate range on the added uniform noise). Jittering reduces visual miscues that result from the vertical and horizontal synchronization of regularly spaced data. The visual perception system can easily be overwhelmed if the number of points is more than several thousand. Figure 1.3 displays three pairwise scatterplots derived from measurements taken in 1977 by the Landsat remote sensing system over a 5 mile by 6 mile agricultural region in North Dakota with n = 22,932 = 117 × 196 pixels or picture elements, each corresponding to an area approximately 1.1 acres in size (Scott and Thompson, 1983; Scott and Jee, 1984). The Landsat instrument mea- sures the intensity of light in four spectral bands reflected from the surface of the earth. A principal components transformation gives two variables that are commonly referred to as the “brightness” and “greenness” of each pixel. Every pixel is mea- sured at regular intervals of approximately 3 weeks. During the summer of 1977, six useful replications were obtained, giving 24 measurements on each pixel. Using an agronometric growth model for crops, Badhwar et al. (1982) nonlinearly transformed this 24-dimensional data to three dimensions. Badhwar described these synthetic vari- ables, (x1,x2,x3), as (1) the calendar time at which peak greenness is observed, (2) the length of crop ripening, and (3) the peak greenness value, respectively. The scat- ter diagrams in Figure 1.3 have also been enhanced by jittering, as the raw data are integers between (0,255). The use of integers allows compression to eight bits of computer memory. Only structure in the boundary and tails is readily seen. The over- plotting problem is apparent and the blackened areas include over 95% of the data. Other techniques to enhance scatter diagrams are needed to see structure in the bulk of the data cloud, such as plotting random subsets (see Tukey and Tukey (1981)). Pairwise scatter diagrams lack one important property necessary for identifying more than two-dimensional features—strong interplot linkage among the plots. In
  • 32. “9780471697558c01” — 2015/2/25 — 16:16 — page 8 — #8 8 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA Peak width Peak value Peak time Peak width FIGURE 1.3 Pairwise scatter diagram of transformed Landsat data from 22,932 pixels over a 5 by 6 nautical mile region. The range on all the axes is (0, 255). principle, it should be possible to locate the same point in each figure, assuming the data are free of ties. But it is not practical to do so for samples of any size. For quadrivariate data, Diaconis and Friedman (1983) proposed drawing lines between corresponding points in the scatterplots of (x1,x2) and (x3,x4) (see Problem 1.2). But a more powerful dynamic technique that takes full advantage of computer graphics has been developed by several research groups (McDonald, 1982; Becker and Cleveland, 1987; see the many references in Cleveland and McGill, 1988). The method is called brushing or painting a scatterplot matrix. Using a pointing device such as a mouse, a subset of the points in one scatter diagram is selected and the corresponding points are simultaneously highlighted in the other scatter diagrams. Conceptually, a subset of points in d is tagged, for example, by painting the points red or making the points blink synchronously, and that characteristic is inherited by the linked points in all the “linked” graphs, including not only scatterplots but also histograms and regression plots as well. The Iris example in Figure 1.1 illustrates the flavor of brushing with three tags. Usually the color of points is changed rather than the symbol type. Brush- ing is an excellent tool for identifying outliers and following well-defined clusters. It is well-suited for conditioning on some variable, for example, 1 x3 3. These ideas are illustrated in Figure 1.4 for the PRIM4 dataset (Friedman and Tukey, 1974; the data summarize 500 high-energy particle physics scattering exper- iments) provided in the S language. Using the brushing tool in S-PLUS (1990), the left cluster in the 1–2 scatterplot was brushed, and then the left cluster in the 2–4 scatterplot was brushed with a different symbol. Try to imagine linking the clusters throughout the scatterplot matrix without any highlighting.
  • 33. “9780471697558c01” — 2015/2/25 — 16:16 — page 9 — #9 GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 9 FIGURE 1.4 Pairwise scatterplots of the transformed PRIM4s data using the ggobi visual- ization system. Two clumps of points are highlighted by brushing. There are limitations to the brushing technique. The number of pairwise scat- terplots is d 2 , so viewing more than 5 or 10 variables at once is impractical. Furthermore, the physical size of each scatter diagram is reduced as more variables are added, so that fewer distinct data points can be plotted. If there are more than a few variables, the eye cannot follow many of the dynamic changes in the pattern of points during brushing, except with the simplest of structure. It is, however, an open question as to the number of dimensions of structure that can be perceived by this method of linkage. Brushing remains an important and well-used tool that has proven successful in real data analysis. If a 2-D array of bivariate scatter diagrams is useful, then why not construct a 3-D array of trivariate scatter diagrams? Navigating the collection of d 3 trivariate scatterplots is difficult even with modest values of d. But a single 3-D scatterplot can easily be rotated in real time with significant perceptual gain compared to three bivariate diagrams in the scatterplot matrix. Many statistical packages now provide this capability. The program MacSpin (Donoho et al., 1988) was the first widely used software of this type. The top middle panel in Figure 1.4 displays a particular ori- entation of a rotating 3-D scatterplot. The kinds of structure available in 3-D data are more complex (and hence more interesting) than in 2-D data. Furthermore, the overplotting problem is reduced as more data points can be resolved in a rotating 3-D scatterplot than in a static 2-D view (although this is resolution dependent—a 2-D view printed by a laser device can display significantly more points than is possible on a computer monitor). Density information is still relatively difficult to perceive, however, and the sample size definitely influences perception. Beyond three dimensions, many novel ideas are being pursued (see Tukey and Tukey (1981)). Six-dimensional data could be viewed with two rotating 3-D scat- ter diagrams linked by brushing. Carr and Nicholson (1988) have actively pursued using stereography as an alternative and adjunct to rotation. Some workers report
  • 34. “9780471697558c01” — 2015/2/25 — 16:16 — page 10 — #10 10 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA that stereo viewing of static data can be more precise than viewing dynamic rotation alone. Unfortunately, many individuals suffer from color blindness and various depth perception limitations, rendering some techniques useless. Nevertheless, it is clear that there is no limit to the possible combinations of ideas one might consider imple- menting. Such efforts can easily take many months to program without any fancy interface. This state of affairs would be discouraging but for the fact that a LISP- based system for easily prototyping such ideas is now available using object-oriented concepts (see Tierney (1990)). RStudio has made the shiny app available for this pur- pose as well: see http://shiny.rstudio.com. A collection of articles is devoted to the general topic of animation (Cleveland and McGill, 1988). The idea of displaying 2- or 3-D arrays of 2- or 3-D scatter diagrams is perhaps too closely tied to the Euclidean coordinate system. It might be better to examine many 2- or 3-D projections of the data. An orderly way to do approximately just that is the “grand tour” discussed by Asimov (1985). Let P be a d × 2 projection matrix, which takes the d-dimensional data down to a plane. The author proposed examining a sequence of scatterplots obtained by a smoothly changing sequence of projection matrices. The resulting kinematic display shows the n data points mov- ing in a continuous (and sometimes seemingly random) fashion. It may be hoped that most interesting projections will be displayed at some point during the first sev- eral minutes of the grand tour, although for even 10 variables several hours may be required (Huber, 1985). Special attention should be drawn to representing multivariate data in the bivariate scatter diagram with points replaced by glyphs, which are special symbols whose shapes are determined by the remaining data variables (x3,...,xd). Figure 1.5 displays the Iris data in such a form following Carr et al. (1986). The length and angle of the glyph are determined by the sepal length and width, respectively. Careful examination of the glyphs shows that there is no gap in 4-D between the versicolor and virginica species, as the angles and lengths of the glyphs are similar near the boundary. Setosa Versicolor Virginica 1 2 3 4 5 6 7 0 0.5 1 1.5 2 2.5 Petal length Petal width Glyph (length, angle)=(Sepal length, sepal width) FIGURE 1.5 Glyph scatter diagram of the Iris data.
  • 35. “9780471697558c01” — 2015/2/25 — 16:16 — page 11 — #11 GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 11 1 2 3 4 5 6 7 2.0 2.5 3.0 3.5 4.0 4.5 0.0 0.5 1.0 1.5 2.0 2.5 Petal length Sepal width Petal width FIGURE 1.6 A three-dimensional scatter diagram of the Fisher–Anderson Iris data, omitting the sepal length variable. From left to right, the 50 points for each of the three varieties of setosa, versicolor, and virginica are distinguished by symbol type (square, diamond, triangle), respectively. The symbol is required to indicate the presence of three clusters rather than only two. The same basic picture results from any choice of three variables from the full set of four variables. A second glyph representation shown in Figure 1.6 is a 3-D scatterplot omitting sepal length, one of the four variables. This figure clearly depicts the structure in these data. Plotting glyphs in 3-D scatter diagrams with stereography is a more pow- erful visual tool (Carr and Nicholson, 1988). The glyph technique does not treat variables “symmetrically” and all variable–glyph combinations could be considered. This complaint affects most multivariate procedures (with a few exceptions). All of these techniques are an outgrowth of a powerful system devised to analyze data in up to nine dimensions called PRIM-9 (Fisherkeller et al., 1974; reprinted in Cleveland and McGill, 1988). The PRIM-9 system contained many of the capabilities of current systems. The letters are an acronym for “Picturing, Rotation, Isolation, and Masking.” The latter two serve to identify and select subsets of the multivariate data. The “picturing” feature was implemented by pressing two buttons that cycled through all of the 9 2 pairwise scatter diagrams in current coordinates. An IBM 360 mainframe was specially modified to drive the custom display system. 1.3.2 Chernoff Faces Chernoff (1973) proposed a special glyph that associates variables to facial features, such as the size and shape of the eyes, nose, mouth, hair, ears, chin, and facial out- line. Certainly, humans are able to discriminate among nearly identical faces very well. Chernoff has suggested that most other multivariate point methods “seem to be
  • 36. “9780471697558c01” — 2015/2/25 — 16:16 — page 12 — #12 12 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 FIGURE 1.7 Chernoff faces of the economic dataset spanning 1925–1939. less valuable in producing an emotional response” (Wang, 1978, p. 6).Whether an emotional response is desired is debatable. Chernoff faces for the time series dataset in Table B.1 are displayed in Figure 1.7. (The variable–feature associations are listed in the table.) By carefully studying an individual facial feature such as the smile over the sequence of all the faces, simple trends can be recognized. But it is the overall multivariate impression that makes Chernoff faces so powerful. Variables should be carefully assigned to features. For example, Chernoff faces of the colleges’ data in Table B.2 might logically assign variables relating to the library to the eyes rather than to the mouth (see Problem 1.3). Such subjective judgments should not prejudice our use of this procedure. One early application not in a statistics journal was constructed by Hiebert-Dodd (1982), who had examined the performance of several optimization algorithms on a suite of test problems. She reported that several referees felt this method of presenta- tion was too frivolous. Comparing the endless tables in the paper as it appeared to the Chernoff faces displayed in the original technical report, one might easily conclude the referees were too cautious. On the other hand, when Rice University administra- tors were shown Chernoff faces of the colleges’ dataset, they were quite open to its suggestions and enjoyed the exercise. The practical fact is that repetitious viewing of large tables of data is tedious and haphazard, and broad-brush displays such as faces can significantly improve data digestion. Several researchers have noted that Chernoff faces contain redundant information because of symmetry. Flury and Riedwyl (1981) have proposed using asymmetrical faces, as did Turner and Tidmore (1980), although Chernoff has stated he believes the additional gain does not justify such nonrealistic figures. 1.3.3 Andrews’ Curves and Parallel Coordinate Curves Three intriguing proposals display not the data points themselves but rather a unique curve determined by the data vector x. Andrews (1972) proposed representing
  • 37. “9780471697558c01” — 2015/2/25 — 16:16 — page 13 — #13 GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 13 1929 1930 1931 1932 FIGURE 1.8 Star diagram for 4 years of the economic dataset shown in Figure 1.7. high-dimensional data by replacing each point in d with a curve s(t) for |t| π, where s(t | x1,...,xd) = x1 √ 2 +x2 sint +x3 cost +x4 sin2t +x5 cos2t +··· , the so-called Fourier series representation. This mapping provides the first “com- plete” continuous view of high-dimensional points on the plane, because, in principle, the original multivariate data point can be recovered from this curve. Clearly, an Andrews’ curve is dominated by the variables placed on the low-frequency terms, so care should be taken to put the most interesting variables early in the expansion (see Problem 1.4). A simple graphical device that treats the d variables symmetrically is the star dia- gram, which is discussed by Fienberg (1979). The d axes are drawn as spokes on a wheel. The coordinate data values are plotted on those axes and connected as shown in Figure 1.8. Another novel multivariate approach that treats variables in a symmetric fashion is the parallel coordinates plot, introduced by Inselberg (1985) in a mathematical set- ting and extended by Wegman (1990) to the analysis of stochastic data. Cartesian coordinates are abandoned in favor of d axes drawn parallel and equally spaced. Each multivariate point x ∈ d is plotted as a piecewise linear curve connecting the d points on the parallel axes. For reasons shown by Inselberg and Wegman, there are advantages to simply drawing piecewise linear line segments, rather than a smoother line such as a spline. The disadvantage of this choice is that points that have identical values in any coordinate dimension cannot be distinguished in parallel coordinates. However, with this choice a duality may be deduced between points and lines in Euclidean and parallel coordinates. In the left frame of Figure 1.9, six points that fall on a straight line with negative slope are plotted. The right frame shows those same points in parallel coordinates. Thus a scatter diagram of highly correlated normal points displays a nearly common point of intersection in parallel coordinates. However, if the correlation is positive, that point is not “between” the parallel axes (see Problem 1.6). The location of the point where the lines all intersect can be used to recover the equation of the line back in Euclidean coordinates (see Problem 1.8). A variety of other properties with potential applications are explored by Inselberg and Wegman. One result is a graphical means of deciding if a point x ∈ d is on the
  • 38. “9780471697558c01” — 2015/2/25 — 16:16 — page 14 — #14 14 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA x1 x 2 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 1 2 3 4 5 6 0.0 0.5 1.0 1.5 1 2 3 4 5 6 1 2 3 4 5 6 x1 x2 FIGURE 1.9 Example of duality of points and lines between Euclidean and parallel coordinates. The points are labeled 1 to 6 in both coordinate systems. inside or the outside of a convex closed hypersurface. If all the points on the hyper- surface are plotted in parallel coordinates, then a well-defined geometrical outline will appear on the plane. If a portion of the line segments defining the point x in par- allel coordinates fall outside the outline, then x is not inside the hypersurface, and vice versa. One of the more fascinating extensions developed by Wegman is a grand tour of all variables displayed in parallel coordinates. The advantage of parallel coor- dinates is that all d of the rotating variables are visible simultaneously, whereas in the usual presentation, only two of the grand tour variables are visible in a bivariate scatterplot. Figure 1.10 displays parallel coordinate plots of the Iris and earthquake data. The earthquake dataset represents the epicenters of 473 tremors beneath the Mount St. Helens volcano in the several months preceding its March 1982 eruption (Weaver et al., 1983). Clearly, the tremors are mostly small in magnitude, increasing in fre- quency over time, and clustered near the surface, although depth is clearly a bimodal variable. The longitude and latitude variables are least effective on this plot, because their natural spatial structure is lost. 1.3.4 Limitations Tools such as Chernoff faces and scatter diagram glyphs tend to be most valuable with small datasets where individual points are “identifiable” or interesting. Such individualistic exploratory tools can easily generate “too much ink” (Tufte, 1983) and produce figures with black splotches, which convey little information. Parallel coordinates and Andrews’ curves generate much ink. One obvious remedy is to plot
  • 39. “9780471697558c01” — 2015/2/25 — 16:16 — page 15 — #15 GRAPHICAL DISPLAY OF MULTIVARIATE DATA POINTS 15 Sepal.length Sepal.width Petal.length Petal.width Longitude Latitude Depth Day Intensity FIGURE 1.10 Parallel coordinate plot of the earthquake dataset. only a subset of the data in a process known as “thinning.” However, plotting random subsets no longer makes optimal use of all the data and does not result in precisely reproducible interpretations. Point-oriented methods typically have a range of sample sizes that is most appropriate: n 200 for faces; n 2000 for scatter diagrams. Since none of these displays is truly d-dimensional, each has limitations. All pair- wise scatterplots can detect distinct clusters and some two-dimensional structure (if perhaps in a rotated coordinate system). In the latter case, an interactive supplement such as brushing may be necessary to confirm the nature of the links among the scat- terplots (not really providing any higher dimensional information). On the positive side, variables are treated symmetrically in the scatterplot matrix. But many different and highly dissimilar d-dimensional datasets can give rise to visually similar scatter- plot matrix diagrams; hence the need for brushing. However, with increasing number of variables, individual scatterplots physically decrease in size and fill up with ink ever faster. Scatter diagrams provide a highly subjective view of data, with poor density perception and greatest emphasis on the tails of the data.
  • 40. “9780471697558c01” — 2015/2/25 — 16:16 — page 16 — #16 16 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA 1.4 GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 1.4.1 Scatterplot Smoothing by Density Function As graphical exploratory tools, each of the point-based procedures has significant value. However, each suffers from the problem of too much ink, as the number of objects (and hence the amount of ink) is linear in the sample size n. To mix metaphors, point-based graphs cannot provide a consistent picture of the data as n → ∞. As Scott and Thompson (1983) wrote, the scatter diagram points to the bivariate density function. In other words, the raw data points need to be smoothed if a consistent view is to be obtained. A histogram is the simplest example of a scatterplot smoother. The amount of smoothness is controlled by the bin width. For univariate data, the histogram with bin width narrower than min |xi −xj| is precisely a univariate scatter diagram plotted with glyphs that are tall, thin rectangles. For bivariate data, the glyph is a beam with a square base. Increasing the bin width, the histogram represents a count per unit area, which is precisely the unit of a probability density. In Chapter 3, the histogram will be shown to provide a consistent estimate of the density function in any dimension. Histograms can provide a wealth of information for large datasets, even well- known ones. For example, consider the 1979–1981 decennial life table published by the U.S. and Bureau of the Census (1987). Certain relevant summary statistics are well-known: life expectancy, infant mortality, and certain conditional life expectan- cies. But what additional information can be gleaned by examining the mortality histogram itself? In Figure 1.11, the histogram of age of death for individuals is depicted. Not surprisingly, the histogram is skewed with a short tail for older ages. Not as well-known perhaps is the observation that the most common age of death is 85! The absolute and relative magnitude of mortality in the first year of life is made strikingly clear. Careful examination reveals two other general features of interest. The first feature is the small but prominent bump in the curve between the ages of 13 and 27 years. This “excess mortality” is due to an increase in a variety of risky activities, the most notable being obtaining a driver’s license. In the right frame of Figure 1.11, compar- ison of the 1959–1961 (Gross and Clark, 1975) and 1979–1981 histograms shows an impressive reduction of death in all preadolescent years. Particularly striking is the 60% decline in mortality in the first year and the 3-year difference in the locations of the modes. These facts are remarkable when placed in the context of the mortality histogram constructed by John Graunt from the Bills of Mortality during the plague years. Graunt (1662) estimated that 36% of individuals died before attaining their sixth birth- day! Graunt was a contemporary of the better-known William Petty, to whom some credit for these ideas is variously ascribed, probably without cause. The circumstantial evidence that Graunt actually invented the histogram while looking at these mortal- ity data seems quite strong, although there is reason to infer that Galileo had used
  • 41. “9780471697558c01” — 2015/2/25 — 16:16 — page 17 — #17 GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 17 Age of death Number per 100,000 1960 0 20 40 60 80 100 0 500 1000 1500 2000 2500 3000 Age of death Sqrt (number per 100,000) 0 20 40 60 80 100 0 10 20 30 40 50 60 2009 1997 1980 1960 FIGURE 1.11 Histogram of the U.S. mortality data in 1960. Rootgrams (histograms plotted on a square-root scale) of the mortality data for 1960, 1980, and 1997. histogram-like diagrams earlier. Hald (1990) recounts a portion of Galileo’s Dialogo, published in 1632, in which Galileo summarized his observations on the star that appeared in 1572. According to Hald, Galileo noted the symmetry of the “observa- tion errors” and the more frequent occurrence of small errors than large errors. Both pointssuggestGalileohadconstructedafrequencydiagramtodrawthoseconclusions. Many large datasets are in fact collected in binned or histogram form. For example, elementary particles in high-energy physics scattering experiments are man- ifested by small bumps in the frequency curve. Good and Gaskins (1980) considered such a large dataset (n = 25,752) from the Lawrence Radiation Laboratory (LRL) (see Figure 1.12). The authors devised an ingenious algorithm for estimating the odds that a bump observed in the frequency curve was real. This topic is covered in Chapter 9. Multivariate scatterplot smoothing of time series data is also easily accomplished with histograms. Consider a univariate time series and smooth both the raw data {xt} as well as the lagged data {xt,xt+1}. Any strong elliptical structure present in the smoothed lagged-data diagram provides a graphical version of the first-order auto- correlation coefficient. Consider the Old Faithful geyser dataset listed in Table B.6. These data are the durations in minutes of 107 eruptions of the Old Faithful geyser (Weisberg, 1985). As there was a gap in the recording of data between midnight and 6 a.m., there are only 99 pairs {xt,xt+1} available. The univariate histogram in Figure 1.13 reveals a simple bimodal structure—short and long eruption dura- tions. The most notable feature in the bivariate (smoothed) histogram is the missing fourth bump corresponding to the short-short duration sequence. Clearly, graphs of f̂(xt+1|xt) would be useful for improved prediction compared to a regression estimate. For more than two dimensions, only slices are available for viewing with histogram surfaces. Consider the Landsat data again. Divide the (jittered) data into four pieces using quartiles of x1, which is the time of peak greenness. Examining a series of
  • 42. “9780471697558c01” — 2015/2/25 — 16:16 — page 18 — #18 18 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA Mev Bin count 500 1000 1500 2000 0 200 400 600 FIGURE 1.12 Histogram of LRL dataset. Eruption duration (min) Bin count 1 2 3 4 5 0 5 10 15 20 25 5.5 X(t+ 1) 1 1 X(t) 5.5 FIGURE 1.13 Histogram of {xt} for the Old Faithful geyser dataset, and a bivariate histogram of the lagged data (xt,xt+1). bivariate pictures of (x2,x3) for each quartile slice provides a crude approximation of the four-dimensional surface f̂(x1,x2,x3) (see Figure 1.14). The histograms are all constructed on the subinterval [−5,100]×[−5,100]. Compare this representation of the Landsat data to that in Figure 1.3. From Figure 1.3, it is clear that most of the outliers are in the last quartile of x1. How well can the relative density levels be determined from the scatter diagrams? Visualization of a smoothed histogram of these data will be considered in Section 1.4.3. 1.4.2 Scatterplot Smoothing by Regression Function The term scatterplot smoother is most often applied to regression data. For bivariate data, either a nonparametric regression line can be superimposed upon the data, or the points themselves can be moved toward the regression line. Tukey (1977) presents
  • 43. “9780471697558c01” — 2015/2/25 — 16:16 — page 19 — #19 GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 19 5.2 x1 82.7 x2 x3 82.7 x1 85.2 60 x2 0 0 x3 115 85.2 x1 87.4 x2 x3 87.4 x1 93.8 x2 x3 93.8 x1 97.2 x2 x3 97.2 x1 249.5 x2 x3 FIGURE 1.14 Bivariate histogram slices of the trivariate Landsat data. Slicing was per- formed at the quartiles of variable x1. the “3R” smoother as an example of the latter. Suppose that the n data points, {xt}, are measured on a fixed time scale. The 3R smoothing algorithm replaces each point {xt} with the median of the three points {xt−1,xt,xt+1} recursively until no changes occur. This algorithm is a powerful filter that removes isolated outliers effectively. The 3R smoother may be applied to unequally spaced data or repeated data. Tukey also pro- poses applying a Hanning filter, by which x̃t ← 0.25×(xt−1 +2xt +xt+1). This filter may be applied several times as necessary. In Figure 1.15, the Tukey smoother (S function smooth) is applied to the gas flow dataset given in the Table B.5. Observe how the single potential outlier at x = 187 is totally ignored. The least-squares fit is shown for reference. The simplest nonparametric regression estimator is the regressogram. The x-axis is binned and the sample averages of the responses are computed and plotted over the intervals. The regressogram for the gas flow dataset is also shown in Figure 1.15. The Hanning filter and regressogram are special cases of nonparametric kernel regression, which is discussed in Chapter 8. The gas flow dataset is part of a larger collection taken at seven different pressures. A stick-pin plot of the complete dataset is shown in Figure 1.16 (the 74.6 psia data are second from the right). Clearly, the accuracy is affected by the flow rate, while the effect of psia seems small. These data will be revisited in Chapter 8. 1.4.3 Visualization of Multivariate Functions Visualization of functions of more than two variables has not been common in statis- tics. The Landsat example in Figure 1.14 hints at the potential that visualization of 4-D surfaces would bring to the data analyst. In this section, effective visualization of surfaces in more than three dimensions is introduced.
  • 44. “9780471697558c01” — 2015/2/25 — 16:16 — page 20 — #20 20 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA Flow rate Percentage of actual flow 50 100 500 1000 4000 97 98 99 100 101 74.6 psia Least squares 3R Regressogram FIGURE 1.15 Accuracy of a natural gas meter as a function of the flow rate through the valve at 74.6 psia. The raw data (n = 33) are shown by the filled points. The three smooths (least squares, Tukey’s 3R, and Tukey’s regressogram) are superimposed. 1.30 3.60 log10 flow 1.60 2.80 log10 psia 96.00 100.00 Accuracy FIGURE 1.16 Complete 3-D view of the gas flow dataset. Displaying a three-dimensional perspective plot of the surface f(x, y) of a bivariate function requires one more dimension than the corresponding bivariate contour rep- resentation (see Figure 1.17). There are trade-offs. The contour representation lacks the exact detail and visual impact available in a perspective plot; however, perspective plots usually have portions obscured by peaks and present less precise height infor- mation. One way of expressing the difference is to say that a contour plot displays, loosely speaking, about 2.6–2.9 dimensions of the entire 3-D surface (more, as more contour lines are drawn). Some authors claim that one or the other representation is superior, but it seems clear that both can be useful for complicated surfaces.
  • 45. “9780471697558c01” — 2015/2/25 — 16:16 — page 21 — #21 GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 21 X Y Z FIGURE 1.17 Perspective plot of bivariate normal density with a “floating” representation of the corresponding contours. The visualization advantage afforded by a contour representation is that it lives in the same dimension as the data, whereas a perspective plot requires an additional dimension. Hence with trivariate data, the third dimension can be used to present a 3-D contour. In the case of a density function, the corresponding 3-D contour plot comprises one or more α-level contour surfaces, which are defined for x ∈ d by α-Contour : Sα = {x : f(x) = αfmax}, 0 ≤ α ≤ 1, where fmax is the maximum or modal value of the density function. For normal data, the general contour surfaces are hyper-ellipses defined by the easily verified equation (see Problem 1.14): (x−μ)T Σ−1 (x−μ) = −2logα. (1.1) A trivariate contour plot of f(x1,x2,x3) would generally contain several “nested” surfaces, {S0.1,S0.3,S0.5,S0.7,S0.9}, for example. For the independent standard nor- mal density, the contours would be nested hyperspheres centered on the mode. In Figure 1.18, three contours of the trivariate standard normal density are shown in stereo. Many if not most readers, will have difficulty crossing their eyes to obtain the stereo effect. But even without the stereo effect, the three spherical contours are well-represented. How effective is this in practice? Consider a smoothed histogram f̂(x,y,z) of 1000 trivariate normal points with Σ = I3. Figure 1.19 shows surfaces of nine equally spaced bivariate slices of the trivariate estimate. Each slice is approximately bivari- ate normal but without rescaling. Of course, the surfaces are not precisely bivariate normal, due to the finite size of the sample. A natural question to pose is: Why not plot the corresponding sequence of con- ditional densities, f̂(x,y|z = z0), rather than the slices, f̂(x,y,z0)? If this were done, all the surfaces in Figure 1.19 would be nearly identical. (Theoretically, the condition
  • 46. “9780471697558c01” — 2015/2/25 — 16:16 — page 22 — #22 22 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA X Y Z X Y Z FIGURE 1.18 Stereo representation of three α-contours of a trivariate normal density. Gently crossing your eyes should allow the two frames to fuse in the middle. z=–1.8 z=–1.2 z=–0.6 z=0 z=0.6 z=1.2 FIGURE 1.19 Sequence of bivariate slices of a trivariate smoothed histogram. densities are all exactly N(02,I2).) If the goal is to understand the 4-D density surface, then the sequence of conditional densities overemphasizes the (visual) importance of the tails and obscures information about the location of the “center” of the data. Furthermore, as nonparametric estimates in the tail will be relatively noisy, the esti- mates will be especially rough upon normalization (see Figure 1.20). For these reasons, it seems best to look at slices and to reserve normalization for looking at conditional densities that are particularly interesting. Several trivariate contour surfaces of the same estimated density are displayed in Figure 1.21. Clearly, the trivariate contours give an improved “big picture”—just as a rotating trivariate scatter diagram improves on three static bivariate scatter dia- grams. The complete density estimate is a 4-D surface, and the trivariate contour view in the final frame of Figure 1.21 may present only 3.5 dimensions, while the series of bivariate slices may yield a bit more, perhaps 3.75 dimensions, but without the visual impact. Examine the 3-D contour view for the Landsat data in the first frame of Figure 7.8 in comparison to Figures 1.3 and 1.14. The structure is quite complex.
  • 47. “9780471697558c01” — 2015/2/25 — 16:16 — page 23 — #23 GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 23 z=–3 z =–2.6 z= –2.2 FIGURE 1.20 Normalized slices in the left tail of the smoothed histogram. The presentation of clusters is stunning and shows multiple modes and multiple clusters. This detailed structure is not apparent in the scatterplot in Figure 1.3. Depending on the nature of the variables, slicing can be attempted with four-, five-, or six-dimensional data. Of special importance is the 5-D surface generated by 4-D data, for example, space–time variables such as the Mount St. Helens data in Figure 1.10. These higher dimensional estimates can be animated in a fashion similar to Figure 1.19 (see Scott and Wilks (1990)). In the 4-D case, the α-level contours of interest are based on the slices: Sα,t = {(x,y,z) : f(x,y,z,t) = αfmax}, where fmax is the global maximum over the 5-D surface. For a fixed choice of α, as the slice value t changes continuously, the contour shells will expand or contract smoothly, finally vanishing for extreme values of t. For example, a single theoretical contour of the N(0,I4) density would vanish outside a symmetric interval around the origin, but within that interval, the contour shell would be a sphere centered on the origin with greatest diameter when t = 0. With several α-shells displayed simultane- ously, the contours would be nested spheres of different radii, appearing at different values of t, but of greatest diameter when t = 0. One particularly interesting slice of the smoothed 5-D histogram estimate of the entire Iris dataset is shown in Figure 1.22. The α = 4% contour surface reveals two well-separated clusters. However, the α = 10% contour surface is trimodal, revealing the true structure in this dataset even with only 150 points. the virginica and versicolor data may not be separated in the point cloud but apparently can be separated in the density cloud. The 3-D contour slices in Figure 1.22 were assembled from a 2-D contouring algo- rithm, then projected into the plane. The sequence of 2-D contour slices is shown in Figure 1.23. Study these two diagrams and think about the possibilities for exploring the entire five-dimensional surface. To emphasize the potential value of additional variables, we conclude this vignette, we examine the Iris data excluding the sepal width variable. Figure 1.24 displays a 3-D scatterplot, as well as contours of the smoothed histogram at levels α = 0.17 and α = 0.44. A litle study supports the speculation that the data might contain a hybrid
  • 48. “9780471697558c01” — 2015/2/25 — 16:16 — page 24 — #24 24 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA FIGURE 1.21 Trivariate normal examples. species of the versicolor and virginica species. With such a small sample, that may be an embellishment. With more than four variables, the most appropriate sequence of slicing is not clear. With five variables, bivariate contours of (x4,x5) may be drawn; then a sequence of trivariate slices may be examined tracing along one of these bivariate contours. With more than five or six variables, deciding where to slice at all is a diffi- cult problem because the number of possibilities grows exponentially. That is why projection-based methods are so important (see Chapter 7). 1.4.3.1 Visualizing Multivariate Regression Functions The same graphical rep- resentation can be applied to regression surfaces. However, the interpretation can be more difficult. For example, if the regression surface is monotone, the α-level contours of the surface will not be “closed” and will appear to “float” in space. If the regression surface is a simple linear function such as ax + by + cz, then a set of trivariate α-contours will simply be a set of parallel planes. Practical questions arise that do not appear for density surfaces. In particular, what is the natural extent of the regression surface; that is, for what region in the design space should the surface be
  • 49. “9780471697558c01” — 2015/2/25 — 16:16 — page 25 — #25 GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 25 Sepal length Petal length Petal width setosa versicolor virginica (Sliced at sepal width = 3.4 cm) FIGURE 1.22 Two α-level contour surfaces from a slice of a five-dimensional averaged shifted histogram estimate, based on all 150 Iris data points. The displayed variables x, y, and z are sepal length, petal length and width, respectively, with the sepal width variable sliced at t = 3.4 cm. The (outer) darker α = 4% contour reveals only two clusters, while the (inner) lighter α = 10% contour reveals the three clusters. x=4 x=4.15 x=4.3 x=4.45 x=4.6 x=4.75 x=4.9 x=5.05 x=5.2 x=5.35 x=5.5 x=5.65 x=5.8 x=5.95 x=6.1 x=6.25 x=6.4 x=6.55 x=6.7 x=6.85 x=7 x=7.15 x=7.3 x=7.45 FIGURE 1.23 A detailed breakdown of the 3-D contours shown in Figure 1.22 taken from the ASH estimate f̂(x,y,z,t = 3.4) as the sepal length, x, ranges from 4.00 to 7.45 cm.
  • 50. “9780471697558c01” — 2015/2/25 — 16:16 — page 26 — #26 26 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA 4 5 6 7 8 0.0 0.5 1.0 1.5 2.0 2.5 1 2 3 4 5 6 7 Sepal.length Petal.length Petal.width x y z FIGURE 1.24 Analysis of three of the four Iris variables, omitting sepal width entirely, which should be compared to the slice shown in Figure 1.22. The middle contour (α = 0.17) is superimposed upon the contour (α = 0.44) in the right frame to help locate the shells. + + + + + − + + + − − − − − − FIGURE 1.25 A portion of a bivariate contour at the α = 0 level of a smooth function measured on a regular grid and using linear interpolation (dotted lines). plotted? Perhaps one answer is to limit the plot to regions where there is sufficient data, that is, where the density of design points is above a certain threshold. 1.4.4 Overview of Contouring and Surface Display Suppose that a general bivariate function f(x,y) (taking on positive and negative values) is sampled on a regular grid, and the α = 0 contour S0 is desired; that is, S0 = {(x,y) : f(x,y) = 0}. Label the values of the grid as +, 0, or − depending on whether f 0, f = 0, or f 0, respectively. Then the desired contour is shown in Figure 1.25. The piecewise linear approximation and the true contour do not match along the bin boundaries since the interpolation is not exact. However, bivariate contouring is not as simple a task as one might imagine. Usu- ally, the function is sampled on a rectangular mesh, with no gradient information or possibility for further refinement of the mesh. If too coarse a mesh is chosen, then small local bumps or dips may be missed, or two distinct contours at the same level may be inadvertently joined. For speed and simplicity, one wants to avoid hav- ing to do any global analysis before drawing contours. A local contouring algorithm avoids multiple passes over the data. In any case, global analysis is based on certain
  • 51. “9780471697558c01” — 2015/2/25 — 16:16 — page 27 — #27 GRAPHICAL DISPLAY OF MULTIVARIATE FUNCTIONALS 27 FIGURE 1.26 Simple stereo representation of four 3-D nested shells of the earthquake data. smoothness assumptions and may fail. The difficulties and details of contouring are described more fully in Section A.1. There are several varieties of 3-D contouring algorithms. It is assumed that the function has been sampled on a lattice, which can be taken to be cubical without loss of generality. One simple trick is to display a set of 2-D contour slices that result from intersecting the 3-D contour shell with a set of parallel planes along the lattice of the data, as was done in Figures 1.18 and 1.22. In this representation, a single spherical shell becomes a set of circular contours (Figure 1.26). This approach has the advantage of providing a shell representation that is “transparent” so that multiple α-level contour levels may be visualized. Different colors can be used for different contour levels (see Scott (1983, 1984, 1991a), Scott and Thompson (1983), Härdle and Scott (1988), and Scott and Hall (1989)). More visually pleasing surfaces can be drawn using the marching cubes algorithm (Lorensen and Cline, 1987). The overall contour surface is represented by a large number of connected triangular planar sections, which are computed for each cubical bin and then displayed. Depending on the pattern of signs on the eight vertices of each cube in the data lattice, up to six triangular patches are drawn within each cube (see Figure 1.27). In general, there are 28 cases (each corner of the cube being either above or below the contour level). Taking into consideration certain symmetries reduces this number. By scanning through all the cubes in the data lattice, a collection of triangles is found that defines the contour shell. Each triangle has an inner and outer surface, depending on the gradient of the density function. The inner and outer surfaces may be distinguished by color shading. A convenient choice is various shades of red for surfaces pointing toward regions of higher (hotter) density, and shades of blue toward regions of lower (cooler) density; see the cover jacket of this book for an example. Each contour is a patchwork of several thousand triangles. Smoother surfaces may be
  • 52. “9780471697558c01” — 2015/2/25 — 16:16 — page 28 — #28 28 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA + + + FIGURE 1.27 Examples of marching cube contouring algorithm. The corners with values above the contour level are labeled with a+symbol. obtained by using higher-order splines, but the underlying bin structure information would be lost. In summary, visualizing trivariate functions directly is a powerful adjunct to data analysis. The gain of an additional dimension of visible structure without resort to slices greatly improves the ability of a data analyst to perceive structure. The same visualization applies to slices of density function with more than three variables. A demonstration tape that displays 4-D animation of Sα,t contours as α and t vary is available (Scott and Wilks, 1990). 1.5 GEOMETRY OF HIGHER DIMENSIONS The geometry of higher dimensions provides a few surprises. In this section, a few standard figures are considered. This material is available in scattered references (see Kendall (1961), for example). 1.5.1 Polar Coordinates in d Dimensions In d dimensions, a point x can be expressed in spherical polar coordinates by a radius r, a base angle θd−1 ranging over (0,2π), and d − 2 angles θ1,...,θd−2 each ranging over (−π/2,π/2) (see Figure 1.28). Let sk = sinθk and ck = cosθk. Then the transformation back to Euclidean coordinates is given by x1 = rc1 c2 ···cd−3 cd−2 cd−1 x2 = rc1 c2 ···cd−3 cd−2 sd−1 x3 = rc1 c2 ···cd−3 sd−2 . . . xj = rc1 ···cd−jsd−j+1 . . . xd = rs1 .
  • 53. “9780471697558c01” — 2015/2/25 — 16:16 — page 29 — #29 GEOMETRY OF HIGHER DIMENSIONS 29 x1 x2 x3 P r θ1 θ2 FIGURE 1.28 Polar coordinates (r,θ1,θ2) of a point P in 3 . After some work (see Problem 1.11), the Jacobian of this transformation may be shown to be J = rd−1 cd−2 1 cd−3 2 ···cd−2 . (1.2) 1.5.2 Content of Hypersphere The volume of the d-dimensional hypersphere {x : d i=1 x2 i ≤ a2 } is given by Vd(a) = ∫ d i=1 x2 i ≤a2 1 dx = a ∫ 0 dr π/2 ∫ −π/2 dθ1 π/2 ∫ −π/2 dθ2 ··· 2π ∫ 0 dθd−1rd−1 cd−2 1 cd−3 2 ···cd−2 . This can be simplified using the identity π/2 ∫ −π/2 cosk θ dθ = 2 π/2 ∫ 0 cosk θ dθ = 2 π/2 ∫ 0 cosk θ d(cos2 θ) −2cosθsinθ , which, using the change of variables u = cos2 θ, = 1 ∫ 0 uk/2 du u1/2(1−u)1/2 = B 1 2 , k+1 2 = Γ 1 2 Γ k+1 2 Γ k+2 2 .
  • 54. “9780471697558c01” — 2015/2/25 — 16:16 — page 30 — #30 30 REPRESENTATION AND GEOMETRY OF MULTIVARIATE DATA As Γ 1 2 = √ π, Vd(a) = 2π ad d · Γ 1 2 Γ d−1 2 Γ d 2 · Γ 1 2 Γ d−2 2 Γ d−1 2 ··· Γ 1 2 Γ(1) Γ 3 2 = ad πd/2 d 2 Γ d 2 = ad πd/2 Γ d 2 +1 . (1.3) 1.5.3 Some Interesting Consequences 1.5.3.1 Sphere Inscribed in Hypercube Consider the hypercube [−a,a]d and an inscribed hypersphere with radius r = a. Then using (1.3), the fraction of the volume of the cube contained in the hypersphere is given by fd = Volume sphere Volume cube = ad πd/2 /Γ d 2 +1 (2a)d = πd/2 2d Γ d 2 +1 . For lower dimensions, the fraction fd is as shown in Table 1.1. It is clear that the center of the cube becomes less important. As the dimension increases, the volume of the hypercube concentrates in its corners. This distortion of space (at least to our three- dimensional way of thinking) has many potential consequences for data analysis. 1.5.3.2 Hypervolume of a Thin Shell Wegman (1990) demonstrates the distortion of space in another setting. Consider two spheres centered on the origin, one with radius r and the other with slightly smaller radius r −. Consider the fraction of the volume of the larger sphere in between the spheres. By Equation (1.3), Vd(r)−Vd(r −) Vd(r) = rd −(r −)d rd = 1− 1− r d − − − → d→∞ 1. Hence, virtually all of the content of a hypersphere is concentrated close to its surface, which is only a (d − 1)-dimensional manifold. Thus for data distributed uniformly over both the hypersphere and the hypercube, most of the data fall near the boundary and edges of the volume. Most statistical techniques exhibit peculiar behavior if the data fall in a lower dimensional subspace. This example illustrates one important aspect of the curse of dimensionality, which is discussed in Chapter 7. TABLE 1.1 Fraction of the Volume of a Hypercube Lying in the Inscribed Hypersphere Dimension (d) 1 2 3 4 5 6 7 Fraction volume (fd) 1 0.785 0.524 0.308 0.164 0.081 0.037
  • 55. Random documents with unrelated content Scribd suggests to you:
  • 57. back
  • 59. back
  • 61. back
  • 63. back
  • 65. back
  • 67. back
  • 69. back
  • 71. back
  • 73. back
  • 75. back
  • 77. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com