SlideShare a Scribd company logo
Functional Data Analysis With R 1st Edition
Ciprian M Crainiceanu Jeff Goldsmith Andrew
Leroux And Erjia Cui download
https://ebookbell.com/product/functional-data-analysis-
with-r-1st-edition-ciprian-m-crainiceanu-jeff-goldsmith-andrew-
leroux-and-erjia-cui-55542968
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Functional Data Analysis With R And Matlab 1st Edition James Ramsay
https://ebookbell.com/product/functional-data-analysis-with-r-and-
matlab-1st-edition-james-ramsay-2529504
Theoretical Foundations Of Functional Data Analysis With An
Introduction To Linear Operators 1st Edition Tailen Hsing
https://ebookbell.com/product/theoretical-foundations-of-functional-
data-analysis-with-an-introduction-to-linear-operators-1st-edition-
tailen-hsing-5032246
Functional Data Analysis 2ed Ramsay J Silverman Bw
https://ebookbell.com/product/functional-data-analysis-2ed-ramsay-j-
silverman-bw-2046976
Functional Data Analysis In Biomechanics Edward Gunning John
Warmenhoven
https://ebookbell.com/product/functional-data-analysis-in-
biomechanics-edward-gunning-john-warmenhoven-60635520
Geostatistical Functional Data Analysis Wiley Series In Probability
And Statistics 1st Edition Jorge Mateu Editor
https://ebookbell.com/product/geostatistical-functional-data-analysis-
wiley-series-in-probability-and-statistics-1st-edition-jorge-mateu-
editor-51992568
Sfunctional Data Analysis Users Manual For Windows 1st Edition Douglas
B Clarkson
https://ebookbell.com/product/sfunctional-data-analysis-users-manual-
for-windows-1st-edition-douglas-b-clarkson-2003332
Nonparametric Functional Data Analysis Theory And Practice Springer
Series In Statistics 1st Edition Frdric Ferraty
https://ebookbell.com/product/nonparametric-functional-data-analysis-
theory-and-practice-springer-series-in-statistics-1st-edition-frdric-
ferraty-2007538
Applied Functional Data Analysis 1st Edition Jo Ramsay Bw Silverman
https://ebookbell.com/product/applied-functional-data-analysis-1st-
edition-jo-ramsay-bw-silverman-2618086
Nonparametric Functional Data Analysis 1st Edition Frdric Ferraty
https://ebookbell.com/product/nonparametric-functional-data-
analysis-1st-edition-frdric-ferraty-1289824
Functional Data Analysis With R 1st Edition Ciprian M Crainiceanu Jeff Goldsmith Andrew Leroux And Erjia Cui
Functional Data Analysis With R 1st Edition Ciprian M Crainiceanu Jeff Goldsmith Andrew Leroux And Erjia Cui
Functional Data Analysis
with R
Emerging technologies generate data sets of increased size and complexity that require new or updated statisti-
cal inferential methods and scalable, reproducible software. These data sets often involve measurements of a
continuous underlying process, and benefit from a functional data perspective. Functional Data Analysis with R
presents many ideas for handling functional data including dimension reduction techniques, smoothing, func-
tional regression, structured decompositions of curves, and clustering. The idea is for the reader to be able to
immediately reproduce the results in the book, implement these methods, and potentially design new methods
and software that may be inspired by these approaches.
Features:
• Functional regression models receive a modern treatment that allows extensions to many practical scenarios
and development of state-of-the-art software.
• The connection between functional regression, penalized smoothing, and mixed effects models is used as the
cornerstone for inference.
• Multilevel, longitudinal, and structured functional data are discussed with emphasis on emerging functional
data structures.
• Methods for clustering functional data before and after smoothing are discussed.
• Multiple new functional data sets with dense and sparse sampling designs from various application areas are
presented, including the NHANES linked accelerometry and mortality data, COVID-19 mortality data, CD4
counts data, and the CONTENT child growth study.
• Step-by-step software implementations are included, along with a supplementary website (www.Functional-
DataAnalysis.com) featuring software, data, and tutorials.
• More than 100 plots for visualization of functional data are presented.
Functional Data Analysis with R is primarily aimed at undergraduate, master’s, and PhD students, as well as data
scientists and researchers working on functional data analysis. The book can be read at different levels and com-
bines state-of-the-art software, methods, and inference. It can be used for self-learning, teaching, and research, and
will particularly appeal to anyone who is interested in practical methods for hands-on, problem-forward functional
data analysis. The reader should have some basic coding experience, but expertise in R is not required.
Ciprian M. Crainiceanu is Professor of Biostatistics at Johns Hopkins University working on wearable and im-
plantable technology (WIT), signal processing, and clinical neuroimaging. He has extensive experience in mixed
effects modeling, semiparametric regression, and functional data analysis with application to data generated by
emerging technologies.
Jeff Goldsmith is Associate Dean for Data Science and Associate Professor of Biostatistics at the Columbia Uni-
versity Mailman School of Public Health. His work in functional data analysis includes methodological and com-
putational advances with applications in reaching kinematics, wearable devices, and neuroimaging.
Andrew Leroux is an Assistant Professor of Biostatistics and Informatics at the University of Colorado. His
interests include the development of methodology in functional data analysis, particularly related to wearable
technologies and intensive longitudinal data.
Erjia Cui is an Assistant Professor of Biostatistics at the University of Minnesota. His research interests include
developing functional data analysis methods and semiparametric regression models with reproducible software,
with applications in wearable devices, mobile health, and imaging.
MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY
Editors: F. Bunea, R. Henderson, L. Levina, N. Meinshausen, R. Smith,
Recently Published Titles
Multistate Models for the Analysis of Life History Data
Richard J. Cook and Jerald F. Lawless 158
Nonparametric Models for Longitudinal Data
with Implementation in R
Colin O. Wu and Xin Tian 159
Multivariate Kernel Smoothing and Its Applications
José E. Chacón and Tarn Duong 160
Sufficient Dimension Reduction
Methods and Applications with R
Bing Li 161
Large Covariance and Autocovariance Matrices
Arup Bose and Monika Bhattacharjee 162
The Statistical Analysis of Multivariate Failure Time Data: A Marginal Modeling Approach
Ross L. Prentice and Shanshan Zhao 163
Dynamic Treatment Regimes
Statistical Methods for Precision Medicine
Anastasios A. Tsiatis, Marie Davidian, Shannon T. Holloway, and Eric B. Laber 164
Sequential Change Detection and Hypothesis Testing
General Non-i.i.d. Stochastic Models and Asymptotically Optimal Rules
Alexander Tartakovsky 165
Introduction to Time Series Modeling
Genshiro Kitigawa 166
Replication and Evidence Factors in Observational Studies
Paul R. Rosenbaum 167
Introduction to High-Dimensional Statistics, Second Edition
Christophe Giraud 168
Object Oriented Data Analysis
J.S. Marron and Ian L. Dryden 169
Martingale Methods in Statistics
Yoichi Nishiyama 170
The Energy of Data and Distance Correlation
Gabor J. Szekely and Maria L. Rizzo 171
Sparse Graphical Modeling for High Dimensional Data: Sparse Graphical Modeling for High
Dimensional Data
Faming Liang and Bochao Jia 172
Bayesian Nonparametric Methods for Missing Data and Causal Inference
Michael J. Daniels, Antonio Linero, and Jason Roy 173
Functional Data Analysis with R
Ciprian M. Crainiceanu, Jeff Goldsmith, Andrew Leroux, and Erjia Cui 174
Formoreinformationaboutthisseriespleasevisit:https://www.crcpress.com/Chapman--HallCRC-Monographs-
on-Statistics--Applied-Probability/book-series/CHMONSTAAPP
Functional Data Analysis
with R
Ciprian M. Crainiceanu, Jeff Goldsmith, Andrew Leroux,
and Erjia Cui
First edition published 2024
by CRC Press
2385 Executive Center Drive, Suite 320, Boca Raton, FL 33431, U.S.A.
and by CRC Press
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
CRC Press is an imprint of Taylor & Francis Group, LLC
© 2024 Ciprian M. Crainiceanu, Jeff Goldsmith, Andrew Leroux, and Erjia Cui
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.
ISBN: 978-1-032-24471-6 (hbk)
ISBN: 978-1-032-24472-3 (pbk)
ISBN: 978-1-003-27872-6 (ebk)
DOI: 10.1201/9781003278726
Typeset in CMR10
by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Library of Congress Cataloging-in-Publication Data
Names: Crainiceanu, Ciprian, author. | Goldsmith, Jeff, author. | Leroux, Andrew, author. | Cui, Erjia,
author.
Title: Functional data analysis with R / Ciprian Crainiceanu, Jeff Goldsmith, Andrew Leroux, and Erjia
Cui.
Description: First edition. | Boca Raton : CRC Press, 2024. |
Series: CRC monographs on statistics and applied probability | Includes bibliographical references and
index. | Summary: “Functional Data Analysis with R is primarily aimed at undergraduate, masters, and
PhD students, as well as data scientists and researchers working on functional data analysis. The book
can be read at different levels and combines state-of-the-art software, methods, and inference. It can be
used for self-learning, teaching, and research, and will particularly appeal to anyone who is interested in
practical methods for hands-on, problem-forward functional data analysis. The reader should have some
basic coding experience, but expertise in R is not required”-- Provided by publisher.
Identifiers: LCCN 2023041843 (print) | LCCN 2023041844 (ebook) | ISBN 9781032244716 (hbk) | ISBN
9781032244723 (pbk) | ISBN 9781003278726 (ebk)
Subjects: LCSH: Multivariate analysis. | Statistical functionals. | Functional analysis. | R (Computer
program language)
Classification: LCC QA278 .C73 2024 (print) | LCC QA278 (ebook) | DDC
519.5/35--dc23/eng/20231221
LC record available at https://lccn.loc.gov/2023041843
LC ebook record available at https://lccn.loc.gov/2023041844
To Bianca, Julia, and Adina,
may your life be as beautiful as you made mine.
Ciprian
To my family and friends, for your unfailing support and
encouragement.
Jeff
To Tushar, mom, and dad, thank you for all you do to keep me
centered and sane.
To Sarina and Nikhil, you’re all a parent could ever ask for.
Never stop shining your light on the world.
Andrew
To my family, especially my mom, for your unconditional love.
Erjia
Functional Data Analysis With R 1st Edition Ciprian M Crainiceanu Jeff Goldsmith Andrew Leroux And Erjia Cui
Contents
Preface xi
1 Basic Concepts 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 NHANES 2011–2014 Accelerometry Data . . . . . . . . . . . . . . . 3
1.2.2 COVID-19 US Mortality Data . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 CD4 Counts Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 The CONTENT Child Growth Study . . . . . . . . . . . . . . . . . 15
1.3 Notation and Methodological Challenges . . . . . . . . . . . . . . . . . . . 19
1.4 R Data Structures for Functional Observations . . . . . . . . . . . . . . . . 20
1.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Key Methodological Concepts 25
2.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 The Linear Algebra of SVD . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.2 The Link between SVD and PCA . . . . . . . . . . . . . . . . . . . . 27
2.1.3 SVD and PCA for High-Dimensional FDA . . . . . . . . . . . . . . . 28
2.1.4 SVD for US Excess Mortality . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Semiparametric Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Regression Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1.1 Univariate Regression Splines . . . . . . . . . . . . . . . . . 37
2.3.1.2 Regression Splines with Multiple Covariates . . . . . . . . 38
2.3.1.3 Multivariate Regression Splines . . . . . . . . . . . . . . . . 39
2.3.2 Penalized Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.3 Smoothing as Mixed Effects Modeling . . . . . . . . . . . . . . . . . 43
2.3.4 Penalized Spline Smoothing in NHANES . . . . . . . . . . . . . . . 44
2.3.4.1 Mean PA among Deceased and Alive Individuals . . . . . . 44
2.3.4.2 Regression of Mean PA . . . . . . . . . . . . . . . . . . . . 46
2.4 Correlation and Multiplicity Adjusted (CMA) Confidence Intervals and
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.1 CMA Confidence Intervals Based on Multivariate Normality . . . . . 48
2.4.2 CMA Confidence Intervals Based on Parameter Simulations . . . . . 50
2.4.3 CMA Confidence Intervals Based on the Nonparametric Bootstrap of
the Max Absolute Statistic . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.4 Pointwise and Global Correlation and Multiplicity Adjusted (CMA)
p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.5 The Origins of CMA Inference Ideas in this Book . . . . . . . . . . . 53
2.5 Covariance Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.5.1 Types of Covariance Smoothing . . . . . . . . . . . . . . . . . . . . . 55
2.5.1.1 Covariance Smoothing for Dense Functional Data . . . . . 56
vii
viii Contents
2.5.1.2 Covariance Smoothing for Sparse Functional Data . . . . . 57
2.5.2 Covariance Smoothing in NHANES . . . . . . . . . . . . . . . . . . . 58
2.5.3 Covariance Smoothing for CD4 Counts . . . . . . . . . . . . . . . . . 60
3 Functional Principal Components Analysis 65
3.1 Defining FPCA and Connections to PCA . . . . . . . . . . . . . . . . . . . 65
3.1.1 A Simulated Example . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.1.1.1 Code for Generating Data . . . . . . . . . . . . . . . . . . . 67
3.1.1.2 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.1.3 Raw PCA versus FPCA Results . . . . . . . . . . . . . . . 69
3.1.1.4 Functional PCA with Missing Data . . . . . . . . . . . . . 73
3.1.2 Application to NHANES . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.1.2.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . 74
3.1.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2 Generalized FPCA for Non-Gaussian Functional Data . . . . . . . . . . . . 76
3.2.1 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.2 Fast GFPCA Using Local Mixed Effects . . . . . . . . . . . . . . . . 80
3.2.3 Binary PCA Using Exact EM . . . . . . . . . . . . . . . . . . . . . . 83
3.2.4 Functional Additive Mixed Models . . . . . . . . . . . . . . . . . . . 84
3.2.5 Comparison of Approaches . . . . . . . . . . . . . . . . . . . . . . . 86
3.2.6 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3 Sparse/Irregular FPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.1 CONTENT Child Growth Data . . . . . . . . . . . . . . . . . . . . . 88
3.3.2 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.3.4 About the Methodology for Fast Sparse FPCA . . . . . . . . . . . . 96
3.4 When PCA Fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4 Scalar-on-Function Regression 101
4.1 Motivation and EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2 “Simple” Linear Scalar-on-Function Regression . . . . . . . . . . . . . . . . 106
4.2.1 Model Specification and Interpretation . . . . . . . . . . . . . . . . . 107
4.2.2 Parametric Estimation of the Coefficient Function . . . . . . . . . . 108
4.2.3 Penalized Spline Estimation . . . . . . . . . . . . . . . . . . . . . . . 113
4.2.4 Data-Driven Basis Expansion . . . . . . . . . . . . . . . . . . . . . . 118
4.3 Inference in “Simple” Linear Scalar-on-Function Regression . . . . . . . . . 123
4.3.1 Unadjusted Inference for Functional Predictors . . . . . . . . . . . . 123
4.4 Extensions of Scalar-on-Function Regression . . . . . . . . . . . . . . . . . 126
4.4.1 Adding Scalar Covariates . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4.2 Multiple Functional Coefficients . . . . . . . . . . . . . . . . . . . . . 127
4.4.3 Exponential Family Outcomes . . . . . . . . . . . . . . . . . . . . . 129
4.4.4 Other Scalar-on-Function Regression Models . . . . . . . . . . . . . 129
4.5 Estimation and Inference Using mgcv . . . . . . . . . . . . . . . . . . . . . 130
4.5.1 Unadjusted Pointwise Inference for SoFR Using mgcv . . . . . . . . . 132
4.5.2 Correlation and Multiplicity Adjusted (CMA) Inference for SoFR . . 134
5 Function-on-Scalar Regression 143
5.1 Motivation and Exploratory Analysis of MIMS Profiles . . . . . . . . . . . 144
5.1.1 Regressions Using Binned Data . . . . . . . . . . . . . . . . . . . . . 145
5.2 Linear Function-on-Scalar Regression . . . . . . . . . . . . . . . . . . . . . 151
5.2.1 Estimation of Fixed Effects . . . . . . . . . . . . . . . . . . . . . . . 153
Contents ix
5.2.1.1 Estimation Using Ordinary Least Squares . . . . . . . . . . 153
5.2.1.2 Estimation Using Smoothness Penalties . . . . . . . . . . . 155
5.2.2 Accounting for Error Correlation . . . . . . . . . . . . . . . . . . . . 159
5.2.2.1 Modeling Residuals Using FPCA . . . . . . . . . . . . . . . 161
5.2.2.2 Modeling Residuals Using Splines . . . . . . . . . . . . . . 166
5.2.2.3 A Bayesian Perspective on Model Fitting . . . . . . . . . . 168
5.3 A Scalable Approach Based on Epoch-Level Regressions . . . . . . . . . . . 170
6 Function-on-Function Regression 175
6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.1.1 Association between Patterns of Excess Mortality . . . . . . . . . . . 176
6.1.2 Predicting Future Growth of Children from Past Observations . . . 176
6.2 Linear Function-on-Function Regression . . . . . . . . . . . . . . . . . . . . 176
6.2.1 Penalized Spline Estimation of FoFR . . . . . . . . . . . . . . . . . . 178
6.2.2 Model Fit and Prediction Using FoFR . . . . . . . . . . . . . . . . . 180
6.2.3 Missing and Sparse Data . . . . . . . . . . . . . . . . . . . . . . . . 181
6.3 Fitting FoFR Using pffr in refund . . . . . . . . . . . . . . . . . . . . . . 181
6.3.1 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.3.2 Additional Features of pffr . . . . . . . . . . . . . . . . . . . . . . . 187
6.3.3 An Example of pffr in the CONTENT Study . . . . . . . . . . . . 190
6.4 Fitting FoFR Using mgcv . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.5 Inference for FoFR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.5.1 Unadjusted Pointwise Inference for FoFR . . . . . . . . . . . . . . . 199
6.5.2 Correlation and Multiplicity Adjusted Inference for FoFR . . . . . . 201
7 Survival Analysis with Functional Predictors 211
7.1 Introduction to Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . 212
7.2 Exploratory Data Analysis of the Survival Data in NHANES . . . . . . . . 213
7.2.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
7.2.1.1 Traditional Survival Analysis . . . . . . . . . . . . . . . . . 213
7.2.1.2 Survival Analysis with Functional Predictors . . . . . . . . 215
7.2.2 Kaplan-Meier Estimators . . . . . . . . . . . . . . . . . . . . . . . . 217
7.2.3 Results for the Standard Cox Models . . . . . . . . . . . . . . . . . . 218
7.3 Cox Regression with Baseline Functional Predictors . . . . . . . . . . . . . 220
7.3.1 Linear Functional Cox Model . . . . . . . . . . . . . . . . . . . . . . 220
7.3.1.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
7.3.1.2 Inference on the Functional Coefficient . . . . . . . . . . . 224
7.3.1.3 Survival Curve Prediction . . . . . . . . . . . . . . . . . . . 232
7.3.2 Smooth Effects of Traditional and Functional Predictors . . . . . . . 234
7.3.3 Additive Functional Cox Model . . . . . . . . . . . . . . . . . . . . . 236
7.4 Simulating Survival Data with Functional Predictors . . . . . . . . . . . . 239
8 Multilevel Functional Data Analysis 243
8.1 Data Structure in NHANES . . . . . . . . . . . . . . . . . . . . . . . . . . 244
8.2 Multilevel Functional Principal Component Analysis . . . . . . . . . . . . . 245
8.2.1 Two-Level Functional Principal Component Analysis . . . . . . . . . 245
8.2.1.1 Two-Level FPCA Model . . . . . . . . . . . . . . . . . . . 246
8.2.1.2 Estimation of the Two-Level FPCA Model . . . . . . . . . 246
8.2.1.3 Implementation in R . . . . . . . . . . . . . . . . . . . . . 248
8.2.1.4 NHANES Application Results . . . . . . . . . . . . . . . . 249
8.2.2 Structured Functional PCA . . . . . . . . . . . . . . . . . . . . . . . 252
x Contents
8.2.2.1 Two-Way Crossed Design . . . . . . . . . . . . . . . . . . . 253
8.2.2.2 Three-Way Nested Design . . . . . . . . . . . . . . . . . . . 254
8.3 Multilevel Functional Mixed Models . . . . . . . . . . . . . . . . . . . . . . 255
8.3.1 Functional Additive Mixed Models . . . . . . . . . . . . . . . . . . . 258
8.3.2 Fast Univariate Inference . . . . . . . . . . . . . . . . . . . . . . . . 259
8.3.3 NHANES Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 260
8.4 Multilevel Scalar-on-Function Regression . . . . . . . . . . . . . . . . . . . 262
8.4.1 Generalized Multilevel Functional Regression . . . . . . . . . . . . . 262
8.4.2 Longitudinal Penalized Functional Regression . . . . . . . . . . . . . 263
9 Clustering of Functional Data 265
9.1 Basic Concepts and Examples . . . . . . . . . . . . . . . . . . . . . . . . . 265
9.2 Some Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9.2.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9.2.1.1 Clustering States Using K-means . . . . . . . . . . . . . . . 268
9.2.1.2 Background on K-means . . . . . . . . . . . . . . . . . . . 271
9.2.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 271
9.2.2.1 Hierarchical Clustering of States . . . . . . . . . . . . . . . 271
9.2.2.2 Background on Hierarchical Clustering . . . . . . . . . . . 274
9.2.3 Distributional Clustering . . . . . . . . . . . . . . . . . . . . . . . . 276
9.2.3.1 Distributional Clustering of States . . . . . . . . . . . . . . 276
9.2.3.2 Background on Distributional Clustering . . . . . . . . . . 276
9.3 Smoothing and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
9.3.1 FPCA Smoothing and Clustering . . . . . . . . . . . . . . . . . . . . 280
9.3.2 FPCA Smoothing and Clustering with Noisy Data . . . . . . . . . . 285
9.3.3 FPCA Smoothing and Clustering with Sparse Data . . . . . . . . . 287
9.3.4 Clustering NHANES Data . . . . . . . . . . . . . . . . . . . . . . . . 289
Bibliography 291
Index 313
Preface
Around the year 2000, several major areas of statistics were witnessing rapid changes:
functional data analysis, semiparametric regression, mixed effects models, and software
development. While none of these areas was new, they were all becoming more mature, and
their complementary ideas were setting the stage for new and rapid advancements. These
developments were the result of the work of thousands of statisticians, whose collective
achievements cannot be fully recognized in one monograph. We will try to describe some
of the watershed moments that directly influenced our work and this book. We will also
identify and contextualize our contributions to functional data analysis.
The Functional Data Analysis (FDA) book of Ramsay and Silverman [244, 245] was first
published in 1997 and, without a doubt, defined the field. It considered functions as the basic
unit of observation, and introduced new data structures, new methods, and new definitions.
This amplified the interest in FDA, especially with the emergence of new, larger, and more
complex data sets in the early 2000s. Around the same time, and largely independent of
the FDA literature, nonparametric modeling was subject to massive structural changes.
Starting in the early 1970s, the seminal papers of Grace Wahba and collaborators [54,
150, 303] were setting the stage for smoothing spline regression. Likely influenced by these
ideas, in 1986, Finbarr O’Sullivan [221] published the first paper on penalized splines (B-
splines with a smaller number of knots and a penalty on the roughness of the regression
function). In 1996, Marx and Eilers [71] published a seminal paper on P-splines (similar to
O’Sullivan’s approach, but using a different penalty structure) and followed it up in 2002 by
showing that ideas can be extended to Generalized Additive Models (GAM) [72]. In 1999,
Brumback, Ruppert, and Wand [26] pointed out that regression models incorporating splines
with coefficient penalties can be viewed as particular cases of Generalized Linear Mixed
Models (GLMM). This idea was expanded upon in a series of papers that led to the highly
influential Semiparametric Regression book by Ruppert, Wand, and Carroll [258], which
was published in 2003. The book showed that semiparametric models could incorporate
additional covariates, random effects, and nonparametric smoothing components in a unified
mixed effects inferential framework. It also demonstrated how to implement these models
in existing mixed effects software. Simon Wood and his collaborators, in a series of papers
that culminated with the 2006 Generalized Additive Models book [315], set the current
standards for methods and software integration for GAM. The substantially updated 2017
second edition of this book [319] is now a classic reference for GAM.
In early 2000s, the connection between functional data analysis, semiparametric regres-
sion, and mixed effects models was not yet apparent, though some early cross-pollination
work was starting to appear. In 1999, Marx and Eilers [192] introduced the idea of P-splines
for signal regression, which is closely related to the Functional Linear Model with a scalar
outcome and functional predictors described by Ramsay and Silverman; see also extensions
in the early 2000s [72, 171, 193]. In 2007, Reiss and Ogden [252] introduced a version of the
method proposed by Marx and Eilers [192] using a different penalty structure, described
methods for principal component regression (FPCR) and functional partial least squares
(FPLS), and noted the connection with the mixed effects model representation of penalized
splines described in [258]. In spite of these crucial advancements, by 2008 there was still
xi
xii Preface
no reliable FDA software for implementing these methods. In 2008, Wood gave a Royal
Scientific Society (RSS) talk (https://rb.gy/o1zg5), where he showed how to use mgcv to
fit scalar-on-function regression (SoFR) models using “linear functional terms.” This talk
clarified the conceptual and practical connections between functional and semiparametric
regression; see pages 17–20 of his presentation. In a personal note, Wood mentioned that
his work was influenced by that of Eilers, Marx, Reiss, and Ogden, though he points to
Wahba’s 1990 book [304] and Tikhonov, 1963 [294] as his primary sources of inspiration. In
his words: “[Grace Wahba’s equation] (8.1.4), from Tikhonov, 1963, is essentially the signal
regression problem. It just took me a long time to think up the summation convention idea
that mgcv uses to implement this.” In 2011, Wood published the idea of penalized spline
estimation for the functional coefficient in the SoFR context; see Section 5.2 in his paper,
where methods are extended to incorporate non-Gaussian errors with multiple penalties.
Our methods and philosophy were also informed by many sources, including the now
classical references discussed above. However, we were heavily influenced by the mixed ef-
fects representation of semiparametric models introduced by Ruppert, Wand, and Carroll
[258]. Also, we were interested in the practical implementation and scalability of a variety of
FDA models beyond the SoFR model. The 2010 paper by Crainiceanu and Goldsmith [48]
and the 2011 paper led by Goldsmith and Bobb [102] outlined the philosophy and practice
underlying much of the functional regression chapters of this book: (1) where necessary,
project observed functions on a functional principal component basis to account for noise,
irregular observation grids, and/or missing data; (2) use rich-basis spline expansions for
functional coefficients and induce smoothing using penalties on the spline coefficients; (3)
identify the mixed effects models that correspond to the specific functional regression; and
(4) use existing mixed effects model software (in their case WinBUGS [187] and nlme [230],
respectively) to fit the model and conduct inference. Regardless of the underlying software
platform, one of our main contributions was to recognize the deep connections between
functional regression, penalized spline smoothing, and mixed effects inference. This allowed
extensions that incorporated multiple scalar covariates, random effects, and multiple func-
tional observations with or without noise, with dense or sparse sampling patterns, and
complete or missing data. Over time, the inferential approach was extended to scalar-on-
function regression (SoFR), function-on-scalar regression (FoSR), and function-on-function
regression (FoFR). We have also contributed to increasing awareness of new data structures
and the need for validated and supported inferential software.
Around 2010–2011, Philip Reiss and Crainiceanu initiated a project to assemble existing
R functions for FDA. It was started as the package refund [105] for “REgression with
FUNctional Data,” though it never provided any refund, it was not only about regression,
and was not particularly easy to find on Google. However, it did bring together a group of
statisticians who were passionate about developing FDA software for a wide audience. We
would like to thank all of these contributors for their dedication and vision. The refund
package is currently maintained by Julia Wrobel.
Fabian Scheipl, Sonja Greven, and collaborators have led a series of transformative pa-
pers [128, 262, 263] that started to appear in 2015 and expanded functional regression in
many new directions. The 2015 paper by Ivanescu, Staicu, Scheipl, and Greven [128] showed
how to conduct function-on-function regression (FoFR) using the philosophy outlined by
Goldsmith, Bobb, and Crainiceanu [48, 102]. The paper made the connection to the “lin-
ear functional terms” implementation in mgcv, which merged previously disparate lines of
work in FDA. This series of papers led to substantial revisions of the refund package and
the addition of the powerful function pffr(), which provides a functional interface based
on the mgcv package. The function pfr(), initially developed by Goldsmith, was updated to
the same standard. Scheipl’s contributions to refund were transformative and set a new bar
for FDA software. Finally, the ideas came together and showed how functional regression
Preface xiii
can be modeled semiparametrically using splines, smoothness can be induced via specific
penalties on parameters, and penalized models can be treated as mixed effects models, which
can be fit using modern software. This body of work provides much of the infrastructure of
Chapters 4, 5, and 6 of this book.
To address the larger and increasingly complex data applications, new methods were
required for Functional Principal Components Analysis (FPCA). To the best of our knowl-
edge, in 2010 there was no working software for smoothing covariance matrices for functional
data with more than 300 observations per function. Luo Xiao was one of the main contrib-
utors who introduced the FAst Covariance Estimation (FACE), a principled method for
nonparametric smoothing of covariance operators for high and ultra-high dimensional func-
tional data. Methods use “sandwich estimators” of covariance matrices that are guaranteed
to be symmetric and positive definite and were deployed in the refund::fpca.face()
function [331]. Xiao’s subsequent work on sparse and multivariate sparse FPCA
was deployed as the standalone functions face::face.sparse() [328, 329] and
mfaces::mface.sparse() [172, 173]. During the writing of this book, it became appar-
ent that methods were also needed for FPCA-like methods for non-Gaussian functional
data. Andrew Leroux and Wrobel led a paper on fast generalized FPCA (fastGFPCA) [167]
using local mixed effects models and deployed the accompanying fastGFPCA package [324].
These developments are highlighted in Chapters 2 and 3 of this book.
Much less work has been dedicated to survival analysis with functional predictors and, es-
pecially, to extending the semiparametric regression ideas to this context. In 2015, Jonathan
Gellar introduced the Penalized Functional Cox Regression [94], where the effect of the func-
tional predictor on the log-hazard was modeled using penalized splines. However, methods
were not immediately deployed in mgcv because this option only became available in 2016
[322]. In subsequent publications, Leroux [164, 166] and Erjia Cui [55, 56] made clear the
connection to the “linear functional terms” in mgcv and substantially enlarged the range of
applications of survival analysis with functional predictors. This work provides the infras-
tructure for Chapter 7 of this book.
In 2009, Chongzhi Di, Crainiceanu, and collaborators introduced the concept of Multi-
level Functional Principal Component (MFPCA) for functional data observed at multiple
visits (e.g., electroencephalograms at every 30 seconds during sleep at two visits several
years apart). They developed and deployed the refund::mfpca.sc() function. A much im-
proved version of the software was deployed recently in the refund::mfpca.face() function
based on a paper led by Cui and Ruonan Li [58]. Much work has been dedicated to ex-
tending ideas to structured functional data [272, 273], led by Haochang Shou, longitudinal
functional data [109], led by Greven, and ultra-high dimensional data [345, 346], led by
Vadim Zipunnikov. Many others have provided contributions, including Ana-Maria Staicu,
Goldsmith, and Lei Huang. Fast methods for fixed effects inference in this context were
developed, among others, by Staicu [223] and Cui [57]. These methods required specialized
software to deal with the size and complexity of new data sets. This work forms the basis
of Chapter 8 of this book.
As we were writing this book we realized just how many open problems still remain.
Some of these problems have been addressed along the way; some are still left open. In the
end, we have tried to provide a set of coherent analytic tools based on statistical principled
approaches. The core set of ideas is to model functional coefficients parametrically or non-
parametrically using splines, penalize the spline coefficients, and conduct inference in the
resulting mixed effects model. The book is accompanied by detailed software and a website
http://www.FunctionalDataAnalysis.com that will continue to be updated.
We hope that you enjoy reading this book as much as we enjoyed writing it.
Functional Data Analysis With R 1st Edition Ciprian M Crainiceanu Jeff Goldsmith Andrew Leroux And Erjia Cui
1
Basic Concepts
Our goal is to create the most useful book for the widest possible audience without theo-
retical, methodological, or computational compromise.
Our approach to statistics is to identify important scientific problems and meaning-
fully contribute to solving them through timely engagement with data. The development
of general-purpose methodology is motivated by this process, and must be accompanied
by computational tools that facilitate reproducibility and transparency. This “problem for-
ward” approach is critical as technological advances rapidly increase the precision and vol-
ume of traditional measurements, produce completely new types of measurements, and open
new areas of scientific research.
Our experience in public health and medical research provides numerous examples of new
technologies that reshape scientific questions. For example, heart rate and blood pressure
used to be measured once a year during an annual medical exam. Wearable devices can now
measure them continuously, including during the night, for weeks or months at the time.
The resulting data provide insights into blood pressure, hypertension, and health outcomes
and open completely new areas of research. New types of measurements are continuously
emerging, including physical activity measured by accelerometers, brain imaging, ecological
momentary assessments (EMA) via smart phone apps, daily infection and mortality during
the COVID-19 pandemic, or CD4 counts from the time of sero-conversion. These examples
and many others involve measurements of a continuous underlying process, and benefit from
a functional data perspective.
1.1 Introduction
Functional Data Analysis (FDA) provides a conceptual framework for analyzing functions
instead of or in addition to scalar measurements. For example, physical activity is a con-
tinuous process over the course of the day and can be observed for each individual; FDA
considers the complete physical activity trajectory in the analysis instead of reducing it to a
single scalar summary, such as the total daily activity. In this book we denote the observed
functions by Wi : S → R, where S is an interval (e.g., [0, 1] in R or [0, 1]M
in RM
), i is the
basic experimental unit (e.g., study participant), and Wi(s) is the functional observation
for unit i at s ∈ S. In general, the domain S does not need to be an interval, but for the
purposes of this book we will work under this assumption.
We often assume that Wi(s) = Xi(s) + i(s), where Xi : S → R is the true functional
process and i(s) are independent noise variables. We will see various generalizations of
this definition, but for illustration purposes we use this notation. We briefly summarize
the properties of functional data that can be used to better target the associated analytic
methods:
1
2 Functional Data Analysis with R
• Continuity is the property of the observed functions, Wi(s), and true functional pro-
cesses, Xi(s), which allows it to be sampled at a higher or lower resolution within S.
• Ordering is the property of the functional domain, S, which can be ordered and has a
distance.
• Self-consistency is the property of the observed functions, Wi(s), and true functional
processes, Xi(s), to be on the same scale and have the same interpretation for all ex-
perimental units, i, and functional arguments, s.
• Smoothness is the property of the true functional process, Xi(s), which is not expected
to change substantially for small changes in the functional argument, s.
• Colocalization is the property of the functional argument, s, which has the same inter-
pretation for all observed functions, Wi(s), and true functional processes, Xi(s).
These properties differentiate functional from multivariate data. As the functional ar-
gument, s ∈ S, is often time or space, FDA can be used for modeling temporal and/or
spatial processes. However, there is a fundamental difference between FDA and spatio-
temporal processes. Indeed, FDA assumes that the observed functions, Wi(s), and true
functional processes, Xi(s), depend on and are indexed by the experimental unit i. This
means that there are many repetitions of the time series or spatial processes, which is not
the case for time series or spatial analysis.
The FDA framework serves to guide methods development, interpretation, and ex-
ploratory analysis. We emphasize that the concept of continuously observed functions differs
from the practical reality that functions are observed over discrete grids that can be dense or
sparse, regularly spaced or irregular, and common or unique across functional observations.
Put differently, in practice, functional data are multivariate data with specific properties.
Tools for understanding functional data must bridge the conceptual and practical to produce
useful insights that reflect the data-generating and observation processes.
FDA has a long and rich tradition. Its beginnings can be traced at least to a paper
by C.R. Rao [247], who proposed to use Principal Component Analysis (PCA), a mul-
tivariate method, to analyze growth curves. Several monographs on FDA already exist,
including [86, 153, 242, 245]. In addition, several survey papers provide insights into current
developments [154, 205, 250, 299, 307]. This book is designed to complement the existing
literature by focusing on methods that (1) combine parametric, nonparametric, and mixed
effects components; (2) provide statistically principled approaches for estimation and in-
ference; (3) allow users to seamlessly add or remove model components; (4) are associated
with high-quality, fast, and easy-to-modify R software; and (5) are intuitive and friendly to
scientific applications.
This book provides an introduction to FDA with R [240]. Two packages will be used
throughout the book: (1) refund [105], which contains a large number of FDA models
and many of the data sets used for illustration in this book; and (2) mgcv [317, 319], a
powerful inferential software developed for semiparametric inference. We will show how
this software, originally developed for semiparametric regression, can be adapted to FDA.
This is a crucial contribution of the book, which is built around the idea of providing
tools that can be readily used in practice. The book is accompanied by the web page
http://www.FunctionalDataAnalysis.com, which contains vignettes and R software for each
chapter of this book. All vignettes use the refund and mgcv packages, which are available
from CRAN and can be loaded into R [240] as follows.
library(refund)
library(mgcv)
Basic Concepts 3
General-purpose, stable, and fast software is the key to increasing the popularity of FDA
methods. The book will present the current version of the software, while acknowledging
that software is changing much faster than methodology. Thus, the book will change slowly,
while the web page http://www.FunctionalDataAnalysis.com and accompanying vignettes
will be adapted to the latest developments.
1.2 Examples
We now introduce several examples that illustrate the ubiquity and complexity of functional
data in modern research, and that will be revisited throughout the book. These examples
highlight various types of functional data sampling, including dense, regularly-spaced grids
that are common across participants, and sparse, irregular observations for each participant.
1.2.1 NHANES 2011–2014 Accelerometry Data
The National Health and Nutrition Examination Survey (NHANES) is a large, ongoing,
cross-sectional study of the non-institutionalized US population conducted by the Centers
for Disease Control and Prevention (CDC) in two-year waves using a multi-stage stratified
sampling scheme. NHANES collects a vast array of demographic, socioeconomic, lifestyle
and medical data, though the exact data collected and population samples vary from year
to year. The wrist-worn accelerometry data collected in the NHANES 2011–2012 and 2013–
2014 waves were released in December 2020. This data set is of particular interest because
(1) it is publicly available and linked to the National Death Index (NDI) by the National
Center for Health Statistics (NCHS); (2) it was collected from the wrist and processed
into “monitor-independent movement summary” (MIMS) units [138] using an open source,
reproducible algorithm (https://bit.ly/3cDnRBF); and (3) the protocol required 24-hour
continuous wear of the wrist accelerometers, including during sleep, for multiple days for
each study participant.
In total there were 14,693 study participants who agreed to wear an accelerometer. To
ensure the quality of the accelerometry data for each subject, we excluded study participants
who had less than three good days (2,083 study participants excluded), where a good day is
defined as having at least 95% “good data.” “Good data” is defined as data that was flagged
as “wear” (PAXPREDM ∈ {1, 2, 4}) and did not have a quality problem flag (PAXFLGSM = )
in the NHANES data set. The final data set has 12,610 study participants with an average
age of 36.90 years and 51.23% females. Note that the variable name “gender” used in the
data set and elsewhere is taken directly from the framing of the questions in NHANES,
and is not intended to conflate sex and gender. The proportions of Non-Hispanic White,
Non-Hispanic Black, Non-Hispanic Asian, Mexican American, Other Hispanic and Other
Race were 35.17%, 24.81%, 11.01%, 15.16%, 9.87%, and 3.98%, respectively. The data set
is too large to be made available through refund, but it is available from the website
http://www.FunctionalDataAnalysis.com associated with this book.
Figure 1.1 displays objective physical activity data measured in MIMS for three study
participants in the NHANES 2011–2014. Each panel column corresponds to one study par-
ticipant and each panel row corresponds to a day of the week. The first study participant
(labeled SEQN 75111) had seven days of data labeled Sunday through Saturday. The sec-
ond study participant (labeled SEQN 77936) had five days of data labeled Tuesday through
Saturday. The third study participant (labeled SEQN 82410) had six days of data that in-
cluded all days of the week except Friday. This happened because the data recorded on
4 Functional Data Analysis with R
FIGURE 1.1: Physical activity data measured in MIMS for three study participants in the
NHANES 2011–2014 summarized at every minute of the day. Each study participant is
shown in one column and each row corresponds to a day of the week. The x-axis in each
panel is time in one-minute increments from midnight to midnight.
Basic Concepts 5
Friday had less than 95% of “good data” and were therefore excluded. The x-axis for each
panel is time in one-minute increments from midnight (beginning of the day) to midnight
(end of the day). The y-axis is MIMS, a measure of physical activity intensity.
Some features of the data become apparent during visual inspection of Figure 1.1. First,
activity during the night (0–6 AM) is reduced for the first two study participants, but
not for the third. Indeed, study participant SEQN 82410 has clearly more activity during
the night than during the day (note the consistent dip in activity between 12 PM and 6
PM). Second, there is substantial heterogeneity of the data from one minute to another
both within and between days. Third, data are positive and exhibit substantial skewness.
Fourth, the patterns of activity of study participant SEQN 75111 on Saturday and Sunday
are quite different from their pattern of activity on the other days of the week. Fifth, there
seems to be some day-to-day within-individual consistency of observations.
Having multiple days of minute-level physical activity for the same individual increases
the complexity and size of the data. A potential solution is to take averages at the same
time of the day within study participants. This is equivalent to averaging the curves in
Figure 1.1 by column at the same time of the day. This reduces the data to one function per
study participant, but ignores the visit-to-visit variability around the person-specific mean.
To illustrate the population-level data structure, Figure 1.2 displays the smooth means
of several groups within NHANES. Data were smoothed for visualization purposes; techni-
cal details on smoothing are discussed in Section 2.3. The left panel displays the average
physical activity data for individuals who died (blue line) and survived (red line). Mortal-
ity indicators were based on the NHANES mortality release file that included events up to
December 31, 2019. Mortality information was available for 8,713 of the 12,610 study partic-
ipants. There were 832 deceased individuals and 7,881 who were still alive on December 31,
2019. The plot indicates that individuals who did not die had, on average, higher physical
activity throughout the day, with larger differences between 8 AM and 11 PM. This result
is consistent with the published literature on the association between physical activity and
mortality; see, for example, [64, 65, 136, 170, 259, 275, 292].
The right panel in Figure 1.2 displays the smooth average curves for groups stratified by
age and gender. For illustration purposes, four age groups (in years) were used and identified
FIGURE 1.2: Average physical activity data (expressed in MIMS) in NHANES 2011–2014
as a function of the minute of the day in different groups. Left panel: deceased (blue
line) and alive individuals (red line) as of December 31, 2019. Right panel: females (dashed
lines) and males (solid lines) within age groups [18, 35] (red), (35, 50] (orange), (50, 65] (light
blue), and (65, 80] (dark blue).
6 Functional Data Analysis with R
by a different color: [18, 35] (red), (35, 50] (orange), (50, 65] (light blue), and (65, 80] (dark
blue). Within each age group, data for females is shown as dashed lines and for males as
solid lines. In all subgroups physical activity averages are lower at night, increase sharply in
the morning and remain high during the day. The average for the (50, 65] and (65, 80] age
groups exhibit a steady decrease during the day. This pattern is not apparent in the younger
age groups. These findings are consistent with the activity patterns described in [265, 327].
In addition, for every age group, the average activity during the day is higher for females
compared to males. During the night, females have the same or slightly lower activity than
males. These results contradict the widely cited literature [296] which indicated that “Males
are more physically active than females.” However, they are consistent with [327], which
found that women are more active than men, especially among older individuals.
Rich, complex data as displayed in Figures 1.1 and 1.2 suggest multiple scientific prob-
lems, including (1) quantifying the association between physical activity patterns and health
outcomes (e.g., prevalent diabetes or stroke) with or without adjustment for other covari-
ates (e.g., age, gender, body mass index); (2) identifying which specific components of
physical activity data are most predictive of future health outcomes (e.g., incident mor-
tality or cardiovascular events); (3) visualizing the directions of variation in the data; (4)
investigating whether clusters exist and if they are scientifically meaningful; (5) evaluating
transformations of the data that may provide complementary information; (6) developing
prediction methods for missing observations (e.g., one hour of missing data for a person);
(7) quantifying whether the timing or fragmentation of physical activity provides additional
information above and beyond summary statistics (e.g., mean, standard deviation over the
day); (8) studying how much data are needed to identify a particular study participant; (9)
predicting the activity for the rest of the day given data up to a particular time and day
(e.g., 12 PM on Sunday); (10) determining what levels of data aggregation (e.g., minute,
hour, day) may be most useful for specific scientific questions; and (11) proposing data
generating mechanisms that could produce data similar to the observed data.
The daily physical activity curves have all the properties that define functional data:
continuity, ordering, self-consistency, smoothness, and colocalization. The measured process
is continuous, as physical activity is continuous. While MIMS were summarized at the
minute level, data aggregation could have been done at a finer (e.g., ten-, or one-second
intervals) or coarser (e.g., one- or two-hour intervals) scale. The functional data have the
ordering property, because the functional argument is time during the day, which is both
ordered and has a well-defined distance. The data and the measured process have the self-
consistency property because all observations are expressed in MIMS at the minute level.
The true functional process can be assumed to have the smoothness property, as one does not
expect physical activity to change substantially over short periods of time (e.g., one second).
The functional argument has the colocalization property, as the time when physical activity
is measured (e.g., 12:00 PM) has the same interpretation for every study participant and
day of measurement.
The observed data can be denoted as a function Wim : S → R+, where Wim(s) is the
MIMS measurement at minute s ∈ S = {1, . . . , 1440} and day m = 1, . . . , Mi, where Mi is
the number of days with high-quality physical activity data for study participant i. Data
complexity could be reduced by taking the average Wi(s) = 1
Mi
Mi
m=1 Wim(s) at every
minute s or the average over days and minutes Wi = 1
Mi|S|
|S|
s=1
Mi
m=1 Wim(s), where |S|
denotes the number of elements in the domain S. Such reductions in complexity improve
interpretability and make analyses easier, though some information may be lost. Deciding
at what level to summarize the data without losing crucial information is an important goal
of FDA.
Basic Concepts 7
Here we have identified the domain of the functions Wim(·) as S = {1, . . . , 1440}, which
is a finite set in R and does not satisfy the basic requirement that S is an interval. This
could be a major limitation as basic concepts such as continuity or smoothness of the
functions cannot be defined on the sampling domain S = {1, . . . , 1440}. This is due to
practical limitations of sampling that can only be done at a finite number of points. Here the
theoretical domain is [0, 1440] minutes, or [0, 24] hours, or [0, 1] days, depending on how we
normalize the domain. Recall that the functions have the continuity property, which assumes
that the function could be measured anywhere within this theoretical domain. While not
formally correct, we will refer to both of these domains as S to simplify the exposition;
whenever necessary we will indicate more precisely when we refer to the theoretical (e.g.,
S = [0, 1440]) or sampling (e.g., S = {1, . . . , 1440}) domain. This slight abuse of notation
will be used throughout the book and clarifications will be added, as needed.
1.2.2 COVID-19 US Mortality Data
COVID-19 is an infectious disease caused by the SARS-Cov-2 virus that was first identi-
fied in Wuhan, China in 2019. The virus spreads primarily via airborne mechanisms. In
COVID-19, “CO” stands for corona, “VI” for virus, “D” for disease, and 19 for 2019, the
first year the virus was identified in humans. According to the World Health Organization,
COVID-19 has become a world pandemic with more than 767 million confirmed infections
and almost 7 million confirmed deaths in virtually every country of the world by June 6,
2023 (https://covid19.who.int/). Here we focus on mortality data collected in the US
before and during the pandemic. The COVID-19 data used in this book can be loaded using
the following lines of code.
#Load refund
library(refund)
#Load the COVID-19 data
data(COVID19)
Among other variables, this data set contains the US weekly number of all-cause deaths,
weekly number of deaths due to COVID-19 (as assessed on the death certificate), and
population size in the 50 US states plus Puerto Rico and District of Columbia as of July
1, 2020. Figure 1.3 displays the total weekly number of deaths in the US between the week
ending on January 14, 2017 and the week ending on December 12, 2020 for a total of
205 weeks. The original data source is the National Center for Health Statistics (NCHS)
and the data set link is called National and State Estimates of Excess Deaths. It can be
accessed from https://bit.ly/3wjMQBY. The file can be downloaded directly from https:
//bit.ly/3pMAAaA. The data stored in the COVID19 data set in the refund package contains
an analytic version of these data as the variable US weekly mort.
In Figure 1.3, each dot corresponds to one week and the number of deaths is expressed in
thousands. For example, there were 61,114 deaths in the US in the week ending on January
14, 2017. Here we are interested in excess mortality in the first 52 weeks of 2020 compared
to the first 52 weeks of 2019. The first week of 2020 is the one ending on January 4, 2020
and the 52nd week is the one ending on December 26, 2020. There were 3,348,951 total
deaths in the US in the first 52 weeks of 2020 (red shaded area in Figure 1.3) and 2,852,747
deaths in the first 52 weeks of 2019 (blue shaded area in Figure 1.3). Thus, there were
496,204 more deaths in the US in the first 52 weeks of 2020 than in the first 52 weeks
of 2019. This is called the (raw) excess mortality in the first 52 weeks of the year. Here
we use this intuitive definition (number of deaths in 2020 minus the number of deaths in
2019), though slightly different definitions can be used. Indeed, note that the population
size increases from 2019 to 2020 and some additional deaths can be due to the increase in
8 Functional Data Analysis with R
FIGURE 1.3: Total weekly number of deaths in the US between January 14, 2017 and
December 12, 2020. The COVID-19 epidemic is thought to have started in the US sometime
between January and March 2020.
population. For example, the US population was 330,024,493 on December 26, 2020 and
329,147,064 on December 26, 2019 for an increase of 877,429. Using the mortality rate in
2019 of 0.0087 (number of deaths divided by the total population), the expected increase
in number of deaths due to increase in the population would be 7,634. Thus, the number of
deaths associated with the natural increase in population is about 1.5% of the total excess
all-cause deaths in 2020 compared to 2019.
Figure 1.3 displays a higher mortality peak at the end of 2017 and beginning of 2018,
which is likely due to a severe flu season. The CDC estimates that in the 2017–2018 flu
season in the US there were “an estimated 35.5 million people getting sick with influenza,
16.5 million people going to a health care provider for their illness, 490,600 hospitalizations,
and 34,200 deaths from influenza” (https://bit.ly/3H8fa1b).
As indicated in Figure 1.3, the excess mortality can be calculated for every week from the
beginning of 2020. The blue dots in Figure 1.4 display this weekly excess all-cause mortality
as a function of time from January 2020. Excess mortality is positive in every week with an
average of 9,542 excess deaths per week for a total of 496,204 excess deaths in the first 52
weeks. Excess mortality is not a constant function over the year. For example, there were an
average of 1,066 all-cause excess deaths per week between January 1, 2020 and March 14,
2020. In contrast, there were an average of 14,948 all-cause excess deaths per week between
March 28, 2020 and June 23, 2020.
One of the most watched indicators of the severity of the pandemic in the US was
the number of deaths attributed to COVID-19. The data is made available by the US
Center for Disease Control and Prevention (CDC) and can be downloaded directly from
Basic Concepts 9
FIGURE 1.4: Total weekly number of deaths attributed to COVID-19 and excess mor-
tality in the US. The x-axis is time expressed in weeks from the first week in 2020. Red
dots correspond to weekly number of deaths attributed to COVID-19. Blue dots indicate
the difference in the total number of deaths between a particular week in 2020 and the
corresponding week in 2019.
https://bit.ly/3iE2xjo. The data stored in the COVID19 data set in the refund package
contains an analytic version of these data as the variable US weekly mort CV19. The red
dots in Figure 1.4 represent the weekly mortality attributed to COVID-19 according to the
death certificate. Visually, COVID-19 and all-cause excess mortality have a similar pattern
during the year with some important differences: (1) all-cause excess mortality is larger
than COVID-19 mortality every week; (2) the main association does not seem to be delayed
(lagged) in either direction; and (3) the difference between all-cause excess and COVID-19
mortality as a proportion of COVID-19 mortality is highest in the summer.
Figure 1.4 indicates that there were more excess deaths than COVID-19 attributed
deaths in each week of 2020. In fact, the total US all-cause excess deaths in the first 52 weeks
of 2020 was 496,204 compared to 365,122 deaths attributed to COVID-19. The difference is
131,082 deaths, or 35.9% more excess deaths than COVID-19 attributed deaths. So, what
are some potential sources for this discrepancy? In some cases, viral infection did occur
and caused death, though the primary cause of death was recorded as something else (e.g.,
cardiac or pulmonary failure). This could happen if death occurred after the infection had
already passed, infection was present and not detected, or infection was present but not
adjudicated as the primary cause of death. In other cases, viral infection did not occur, but
the person died due to mental or physical health stresses, isolation, or deferred health care.
There could also be other reasons that are not immediately apparent.
10 Functional Data Analysis with R
FIGURE 1.5: Each line represents the cumulative weekly all-cause excess mortality per
million for each US state plus Puerto Rico and District of Columbia. Five states are empha-
sized: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and California
(plum).
In addition to data aggregated at the US national level, the COVID19 data contains
similar data for each state and two territories, Puerto Rico and Washington DC. The
all-cause weekly excess mortality data for each state in the US is stored as the variable
States excess mortality in the COVID19 data set.
Figure 1.5 displays the total cumulative all-cause excess mortality per million in every
state in the US, Puerto Rico and District of Columbia. For each state, the weekly excess
mortality was obtained as described for the US in Figures 1.3 and 1.4. For every week, the
cumulative excess mortality was calculated by adding the excess mortality for every week
up to and including the current week. To make data comparable across states, cumulative
excess mortality was then divided by the estimated population of the state or territory on
July 1, 2020 and multiplied by 1,000,000. Every line represents a state or territory with the
trajectory for five states being emphasized: New Jersey (green), Louisiana (red), Maryland
(blue), Texas (salmon), and California (plum). For example, New Jersey had 1,916 excess
all-cause deaths per one million residents by April 30, 2020. This corresponds to a total of
17,019 excess all-cause deaths by April 30, 2020 because the population of New Jersey was
8,882,371 as of July 1, 2020 (the reference date for the population size).
The trajectories for individual states exhibit substantial heterogeneity. For example,
New Jersey had the largest number of excess deaths per million in the US. Most of these
excess deaths were accumulated in the April–June period, with fewer between June and
November, and another increase in December. In contrast, California had a much lower
cumulative excess number of deaths per million, with a roughly constant increase during
Basic Concepts 11
FIGURE 1.6: Each line represents the cumulative COVID-19 mortality for each US state
plus Puerto Rico and District of Columbia in 2020. Cumulative means that numbers are
added as weeks go by. Five states are emphasized: New Jersey (green), Louisiana (red),
Maryland (blue), Texas (salmon), and California (plum).
2020. Maryland had about a third the number of excess deaths per million of New Jersey
at the end of June and about half by the end of December.
We now investigate the number of weekly deaths attributed to COVID-19 for each state
in the US, which is stored as the variable States CV19 mortality in the COVID19 data set.
Figure 1.6 is similar to Figure 1.5, but it displays the cumulative number of deaths attributed
to COVID-19 for each state per million residents. Each line corresponds to a state and a
few states are emphasized using the same color scheme as in Figure 1.5. The y-axis was kept
the same as in Figure 1.5 to illustrate that, in general, the number of cumulative COVID-19
deaths tends to be lower than the excess all-cause mortality. However, the main patterns
exhibit substantial similarities.
There are many scientific and methodological problems that occur from such a data set.
Here are a few examples: (1) quantifying the all-cause and COVID-19 mortality at the state
level as a function of time; (2) identifying whether the observed trajectories are affected
by geography, population characteristics, weather, mitigating interventions, or intervention
compliance; (3) investigating whether the strength of the association between reported
COVID-19 and all-cause excess mortality varies with time; (4) identifying which states are
the largest contributors to the observed excess mortality in the January–March period; (5)
quantifying the main directions of variation and clusters of state-specific mortality patterns;
(6) evaluating the distribution of the difference between all-cause excess and COVID-19
deaths as a function of state and time; (7) predicting the number of COVID-19 deaths
and infections based on the excess number of deaths; (8) evaluating dynamic prediction
12 Functional Data Analysis with R
models for mortality trajectories; (9) comparing different data transformations for analysis,
visualization, and communication of results; and (10) using data from countries with good
health statistics systems to estimate the burden of COVID-19 in other countries using
all-cause excess mortality.
In the COVID-19 example it is not immediately clear that data could be viewed as
functional. However, the partitioning of the data by state suggests that such an approach
could be useful, at least for visualization purposes. Note that data in Figures 1.5 and 1.6
are curves evaluated at every week of 2020. Thus, the measured process is continuous, as
observations could have been taken at a much finer (e.g., days or hours) or coarser (e.g.,
every month) time scale. Data are ordered by calendar time and is self-consistent because
the number or proportion of deaths has the same interpretation for each state and every
week. Moreover, one can assume that the true number of deaths is a smooth process as the
number of deaths is not expected to change substantially for small changes in time (e.g.,
one hour). Data are also colocalized, as calendar time has the same interpretation for each
state and territory.
The observed data can be denoted as functions Wim : S → R+, where Wim(s) is the
number or cumulative number of deaths in state i per one million residents at time s ∈
S = {1, . . . , 52}. Here m ∈ 1, 2 denotes all-cause excess mortality (m = 1) and COVID-
19 attributed mortality (m = 2), respectively. Because each m refers to different types of
measurements on the same unit (in this case, US state), this type of data is referred to
as “multivariate” functional data. Observations can be modeled as scalars by focusing, for
example, on Wim(s) at one s at a time or on the average of Wim(s) over s for one m. FDA
focuses on analyzing the entire function or combination of functions, extracting information
using fewer assumptions, and suggesting functional summaries that may not be immediately
evident. Most importantly, FDA provides techniques for data visualization and exploratory
data analysis (EDA) in the original or a transformed data space.
Just as in the case of NHANES physical activity data, the domain of the functions Wim(·)
is S = {1, . . . , 52} expressed in weeks, which is a finite set that is not an interval. This is due
to practical limitations of sampling that can only be done at a finite number of points. Here
the theoretical domain is [0, 52] weeks, or [0, 12] months, or [0, 1] years, depending on how
we normalize the domain. Recall that the functions have the continuity property, which
assumes that the function could be measured anywhere within this theoretical domain.
While not formally correct, we will refer to both of these domains as S to simplify the
exposition.
1.2.3 CD4 Counts Data
Human immune deficiency virus (HIV) attacks CD4 cells, which are an essential part of the
human immune system. This reduces the concentration of CD4 cells in the blood, which
affects their ability to signal other types of immune cells. Ultimately, this compromises the
immune system and substantially reduces the human body’s ability to fight off infections.
Therefore, the CD4 cell count per milliliter of blood is a widely used measure of HIV pro-
gression. The CD4 counts data used in this book can be loaded as follows.
#Load refund
library(refund)
#Load the CD4 data
data(cd4)
This data contains the CD4 cell counts for 366 HIV infected individuals from the Multi-
center AIDS Cohort Study (MACS) [66, 144]. We would like to thank Professor Peter Diggle
for making this important de-identified data publicly available on his website and for giving
Basic Concepts 13
FIGURE 1.7: Each line represents the log CD4 count as a function of month, where month
zero corresponds to seroconversion. Five study participants are identified using colors: green,
red, blue, salmon, and plum.
us the permission to use it in this book. We would also like to thank the participants in
this MACS sub-study. Figure 1.7 displays the log CD4 count for up to 18 months before
and 42 months after sero-conversion. Each line represents the log CD4 count for one study
participant as a function of month, where month zero corresponds to sero-conversion.
There are a total of 1,888 data points, with between 1 and 11 (median 5) observations
per study participant. Five study participants are highlighted using colors: green, red, blue,
salmon, and plum. Some of the characteristics of these data include (1) there are few obser-
vations per curve; (2) the time of observations is not synchronized across individuals; and (3)
there is substantial visit-to-visit variation in log CD4 counts before and after seroconversion.
Figure 1.8 displays the same data as Figure 1.7 together with the raw (cyan dots)
and smooth (dark red line) estimator of the mean. The raw mean is the average of log CD4
counts of study participants who had a visit at that time. The raw mean exhibits substantial
variation and has a missing observation at time t = 0. The smooth mean estimator captures
the general shape of the raw estimator, but provides a more interpretable summary. For
example, the smooth estimator is relatively constant before seroconversion, declines rapidly
in the first 10–15 months after seroconversion, and continues to decline, but much slower
after month 15. These characteristics are not immediately apparent in the raw mean or in
the person-specific log CD4 trajectories displayed in Figure 1.6.
There are many scientific and methodological problems suggested by the CD4 data
set. Here we identify a few: (1) estimating the time-varying mean, standard deviation and
14 Functional Data Analysis with R
FIGURE 1.8: Each gray line represents the log CD4 count as a function of month, where
month zero corresponds to seroconversion. The point-wise raw mean is shown as cyan dots.
The smooth estimator of the mean is shown as a dark red line.
quantiles of the log CD4 counts as a function of time; (2) producing confidence intervals for
these time-varying population parameters; (3) identifying whether there are specific sub-
groups that have different patterns over time; (4) designing analytic methods that work
with sparse data (few observations per curve that are not synchronized across individuals);
(5) predicting log CD4 observations for each individual at months when measurements were
not taken; (6) predicting the future observations for one individual given observations up to
a certain point (e.g., 10 months after seroconversion); (7) constructing confidence intervals
for these predictions; (8) quantifying the month-to-month measurement error (fluctuations
along the long-term trend); (9) studying whether the month-to-month measurement error
depends on person-specific characteristics, including average log CD4 count; and (10) de-
signing realistic simulation studies that mimic the observed data structure to evaluate the
performance of analytic methods.
Data displayed in Figures 1.7 and 1.8 are observed at discrete time points and with
substantial visit-to-visit variability. We leave it as an exercise to argue that the CD4 data
has the characteristics of functional data: continuity, ordering, self-consistency, smoothness,
and colocalization.
The observed data has the structure {sij, Wi(sij)}, where Wi(sij) is the log CD4 count
at time sij ∈ S = {−18, −17, . . . , 42}. Here i = 1, . . . , n is study participant, j = 1, . . . , pi
is the observation number, and pi is the number of observations for study participant
i. In statistics, this data structure is often encountered in longitudinal studies and is
Basic Concepts 15
typically modeled using linear mixed effects (LME) models [66, 87, 161, 196]. LMEs use
a pre-specified, typically parsimonious, structure of random effects (e.g., random inter-
cepts and slopes) to capture the person-specific curves. Functional data analysis comple-
ments LMEs by considering more complex and/or data-dependent designs of random effects
[134, 254, 255, 283, 328, 334, 336]. It is worth noting that this data structure and problem
are equivalent to the matrix completion problem [29, 30, 214, 312]. Statistical approaches
can handle different levels of measurement error in the matrix entries, and produce both
point estimators and the associated uncertainty for each matrix entry.
In this example, one could think about the sampling domain as being S =
{−18, −17, . . . , 42} expressed in months. This is a finite set that is not an interval. The
theoretical domain is [−18, 42] in months from seroconversion, though the interval could
be normalized to [0, 1]. The difference from the NHANES and COVID-19 data sets is that
observations are not available at every point in S = {−18, −17, . . . , 42} for each individual.
Indeed, the minimum number of observations per individual is 1 and the maximum is 11,
with a median number of observations of 5, which is 100×5/(42+19) = 8.2% of the number
of possible time points between −18 and 42. This type of data is referred to in statistics as
“sparse functional data.” In strict mathematical terms this is a misnomer, as the sampling
domain S = {−18, −17, . . . , 42} is itself mathematically sparse in R. Here we will use the
definition that sparse functional data are observed functions Wi(sij) where j = 1, . . . , pi, pi
is small (at most 20) at sampling points sij that are not identical across study participants.
Note that this is a property of the observed data Wi(sij) and not of the true underlying
process, Xi(s), which could be observed/sampled at any point in [−18, 42]. While this defi-
nition is imprecise, it should be intuitive enough for the intents and purposes of this book.
We acknowledge that there may be other definitions and also that there is a continuum of
scientific examples between “dense, equally spaced functional data” and “sparse, unequally
spaced functional data.”
1.2.4 The CONTENT Child Growth Study
The CONTENT child growth study (referred to in this book as the CONTENT study) was
funded by the Sixth Framework Programme of the European Union, Project CONTENT
(INCO-DEV-3-032136) and was led by Dr. William Checkley. The study was conducted
between May 2007 and February 2011 in Las Pampas de San Juan Miraflores and Nuevo
Paraı́so, two peri-urban shanty towns with high population density located on the southern
edge of Lima city in Peru. The shanty towns had approximately 40,000 residents with 25%
of the population under the age of 5 [38, 39]. A simple census was conducted to identify
pregnant women and children less than 3 months of age. Eligible newborns and pregnant
women were randomly selected from the census and invited to participate in the study.
Only one newborn was recruited per household. Written informed consent was required
from parents or guardians before enrollment. The study design was that of a longitudinal
cohort study with the primary objective to assess if infection with Helicobacter pylori (H.
pylori) increases the risk of diarrhea, which, in turn, could adversely affect the growth in
children less than 2 years of age [131].
Anthropometric data were obtained longitudinally on 197 children weekly until the child
was 3 months of age, every two weeks between three and 11 months of age, and once monthly
thereafter for the remainder of follow-up up to age 2. Here we will focus on child length
and weight, both measured at the same visits. Even if visits were designed to be equally
spaced, they were obtained within different days of each sampling period. For example, the
observation on week four for a child could be on day 22 or 25, depending on the availability
of the contact person, day of the week, or on the researchers who conducted the visit.
16 Functional Data Analysis with R
FIGURE 1.9: Longitudinal observations of z-score for length (zlen, first column) and z-score
for weight (zwei, second column) shown for males (first row) and females (second row) as a
function of day from birth. Data for two boys (shown as light and dark shades of red) and
two girls (shown as light and dark shades of blue) are highlighted. The same shade of color
identifies the same individual.
Moreover, not all planned visits were completed, which provided the data a quasi-sparse
structure, as observations are not temporally synchronized across children.
We would like to thank Dr. William Checkley for making this important de-identified
data publicly available and to the members of the communities of Pampas de San Juan
de Miraflores and Nuevo Paraı́so who participated in this study. The data can be loaded
directly using the refund R package as follows.
#Load refund
library(refund)
#Load the CONTENT data
data(content)
Figure 1.9 provides an illustration of the z-score for length (zlen) and z-score for weight
(zwei) variables collected in the CONTENT study. Data are also available on the origi-
nal scale, though for illustration purposes here we display these normalized measures. For
example, the zlen measurement is obtained by subtracting the mean and dividing by the
standard deviation of height for a given age of children as provided by the World Health
Organization (WHO) growth charts.
Even though the study was designed to collect data up to age 2 (24 months), for visu-
alization purposes, observations are displayed only through day 600, as data become very
Basic Concepts 17
FIGURE 1.10: Histogram of the number of days from birth in the CONTENT study. There
are a total of 4,405 observations for 197 children.
sparse thereafter. Data for every individual is shown as a light gray line and four different
panels display the zlen (first column) and zwei (second column) variables as a function of
day from birth separately for males (first row) and females (second row). Data for two boys
is highlighted in the first row of panels in red. The lighter and darker shades of red are used
to identify the same individual in the two panels. A similar strategy is used to highlight
two girls using lighter and darker shades of blue. Note, for example, that both girls who
are highlighted start at about the same length and weight z-score, but their trajectories
are substantially different. The z-scores increase for both height and weight for the first girl
(data shown in darker blue) and decrease for the second girl (data shown in light blue).
Moreover, after day 250 the second girl seems to reverse the downward trend in the z-score
for weight, though that does not happen with her z-score for height, which continues to
decrease.
These data were analyzed in [127, 169] to dynamically predict the growth patterns of
children at any time point given the data up to that particular time. Figure 1.10 displays
the histogram of the number of days from birth in the CONTENT study. There are a total
of 4,405 observations for 197 children, out of which 2006 (45.5% of total) are in the first 100
days and 3,299 (74.9% of total) are in the first 200 days from birth. Observations become
sparser after that, which can also be observed in Figure 1.9.
There are several problems suggested by the CONTENT growth study including (1)
estimating the marginal mean, standard deviation and quantiles of anthropometric mea-
surements as a function of time; (2) producing pointwise and joint confidence intervals for
these time-varying parameters; (3) identifying whether there are particular subgroups or in-
dividuals that have distinct patterns or individual observations; (4) conducting estimation
and inference on the individual growth trajectories; (5) quantifying the contemporaneous
and lagged correlations between various anthropometric measures; (6) estimating anthropo-
18 Functional Data Analysis with R
metric measures when observations were missing; (7) predicting future observations for one
individual given observations up to a certain point (e.g., 6 months after birth); (8) quan-
tifying the month-to-month measurement error and study whether it is differential among
children; (9) developing methods that are designed for multivariate sparse data (few obser-
vations per curve) with the amount of sparsity varying along the observation domain; (10)
identifying outlying observations or patterns of growth that could be used as early warn-
ings of growth stunting; (11) developing methods for studying the longitudinal association
between multivariate growth outcomes and time-dependent exposures, such as infections;
and (12) designing realistic simulation scenarios that mimic the observed data structure to
evaluate the performance of analytic methods.
Data displayed in Figure 1.9 are observed at discrete time points and with substantial
visit-to-visit and participant-to-participant variability. These data have all the characteris-
tics of functional data: continuity, ordering, self-consistency, smoothness, and colocalization.
Indeed, data are continuous because growth curves could be sampled at any time point at
both higher and lower resolutions. The choice for the particular sampling resolution was a
balance between available resources and knowledge about the growth process of humans.
Data are also ordered as observations are sampled in time. That is, we know that a measure-
ment at week 3 was taken before a measurement at month 5 and we know exactly how far
apart the two measurements were taken. Also, the observed and true functional processes
have the self-consistency property as they are expressed in the same units of measurement.
For example, height is always measured in centimeters or is transformed into normalized
measures, such as zlen. Data are also smooth, as the growth process is expected to be grad-
ual and not have large minute-to-minute or even day-to-day fluctuations. Even potential
growth spurts are smooth processes characterized by faster growth but small day-to-day
variation. Observations are also colocalized, as the functional argument, time from birth,
has the same interpretation for all functions. For example, one month from birth means the
same thing for each baby.
The observed functional data in CONTENT has the structure {sij, Wim(sij)}, where
Wim : S → R is the mth anthropometric measurement at time s ∈ S ⊂ [0, 24] (expressed in
months from birth) for study participant i. Here the time of the observations, sij, depends
on the study participant, i, and visit number, j, but not the anthropometric measure, m.
The reason is that if a visit was completed, all anthropometric measures were collected.
However, this may not be the case for all studies and observations may depend on m in
other studies. Each such variation on sampling requires special attention and methods de-
velopment. In this example it is difficult to enumerate the entire sampling domain because
it is too large and observations are not equally spaced. One way to obtain this space in R is
using the function
#Find all unique observations
sampling S - sort(unique(content$agedays))
A similar notation, Wim(s), was used to describe the NHANES data structure in Sec-
tion 1.2.1. In NHANES m referred to the day number from initiating the accelerometry
study. However, in the CONTENT study, m refers to the type of anthropometric measure.
Thus, while in NHANES functions indexed by m measure the same thing every day (e.g.,
physical activity at 12 PM), in CONTENT each function measures something different (e.g.,
zlen and zwei at month 2). In FDA one typically refers to the NHANES structure as mul-
tilevel and to the CONTENT structure as multivariate functional data. Another difference
is that data are not equally spaced within individuals and are not synchronized across in-
dividuals. Thus, the CONTENT data has a multivariate (multiple types of measurement),
functional (has all characteristics of functional data), sparse (few observations per curve
Basic Concepts 19
that are not synchronized across individuals), and unequally spaced (observations were not
taken at equal intervals within study participants). The CONTENT data is highly com-
plex and contains additional time invariant (e.g., sex) and time-varying observations (e.g.,
bacterial infections).
As the CD4 counts data presented in Section 1.2.3, the CONTENT data is at the
interface between traditional linear mixed effects models (LME) and functional data. While
both approaches can be used, this is an example when FDA approaches are more reasonable,
at least as an exploratory tool to understand the potential hidden complexity of individual
trajectories. In these situations, one also starts to question or even test the standard residual
dependence assumptions in traditional LMEs. In the end, we will show that every FDA is a
form of LME, but this will require some finesse and substantial methodological development.
1.3 Notation and Methodological Challenges
In all examples in Section 1.2, the data are comprised of functions Wi : S → R, though in
the CONTENT example, one could argue that the vector Wi(·) = {Wi1(·), Wi2(·)}, where
Wi1(·) and Wi2(·)} are the z-scores for length and weight, respectively, takes values in R2
.
Here, the conceptual and practical framing of functional data should be noted: conceptually,
the theoretical domain S (where functional data could be observed) is an interval in R or
RM
; practically, the sampling domain S (where functional data is actually observed) is a
finite subset of points of the theoretical domain. We will, at times, be specific about our
use of a particular framing, but frequently the distinction can be elided (or at least inferred
from context) without detracting from the clarity of our discussion.
Continuity is an important property of functional data, indicating that measurements
could, in principle, have been taken at any point in the interval spanned by the sampling
domain S. For example, in the NHANES study, data are summarized at every minute of
the day, which results in 1,440 observations per day. However, data could be summarized
at a much finer or coarser resolution. Thus, the domain of the function is considered to be
an interval and, without loss of generality, the [0, 1] interval. In NHANES the start of the
day (midnight or 12:00 AM) would correspond to 0, the end of the day (11:59 PM) would
correspond to 1 and minute s of the day would correspond to (s − 1)/1439.
Most common functional data are of the type Wi : [0, 1] → R, though many variations
exist. An important assumption is that there exists an underlying, true process, Xi : [0, 1] →
R, and Wi(s) provides proxy measurements of Xi(s) at the points where Wi(·) is observed.
The observed function is Wi(s) = Xi(s)+i(s), where i(s) are independent noise variables,
which could be Gaussian, but could refer to binary, Poisson, or other types of errors.
Thus, FDA assumes that there exists an infinite-dimensional data generating process,
Xi(·), for every study participant, while information is accumulated at a finite number of
points via the measured process, Wi(s), where s ∈ S and S is the sampling domain. This
inferential problem is addressed by a combination of smoothing and simplifying (modeling)
assumptions. The sampling location (s points where Wi(·) are measured), measurement
type (exactly what is measured), and underlying signal structure (the distribution of Xi(·))
raise important methodological problems that need to be addressed to bridge the theoretical
assumption of continuity with the reality of sampling at a finite number of points.
First, connecting the continuity of Xi(·) to the discrete measurement Wi(·) needs to be
done through explicit modeling and assumptions.
20 Functional Data Analysis with R
Second, the density and number of observations at the study participant level could
vary substantially. Indeed, there could be as few as two or three to as many as hundreds
of millions of observations per study participant. Moreover, observations can be equally or
unequally spaced within and between study participants as well as when aggregated across
study participants. Each of these scenarios raises its own specific set of challenges.
Third, the complexity of individual and population trajectories is a priori unknown. Ex-
tracting information is thus a balancing act between model assumptions and signal structure
often in the presence of substantial noise. As shown in the examples in this chapter, func-
tional data are seldom linear and often non-stationary.
Fourth, the covariance structure within experimental units (e.g., study participants) re-
quires a new set of assumptions that cannot be directly extended from traditional statistical
models. For example, the independence and exchangeability assumptions from longitudinal
data analysis are, at best, suspect in many high-resolution FDA applications. The auto-
regressive assumption is probably way too restrictive, as well, because it implies stationarity
of residuals and an exponential decrease of correlation as a function of distance. Moreover,
as sampling points are getting closer together (higher resolution) the structure of correlation
may change substantially. The unstructured correlation assumption is more appropriate for
FDA, but it requires the estimation of a very large dimensional correlation matrix. This
can raise computational challenges for moderate to high-dimensional functions.
Fifth, observed data may be non-Gaussian with high skewness and thicker than normal
tails. While much is known about univariate modeling of such data, much more needs to
be done when the marginal distributions of functional data exhibit such behavior. Binary
or Poisson functional data raise their own specific sets of challenges.
To understand the richness of FDA, one could think of all problems in traditional data
analysis where some of the scalar observations are replaced with functional observations.
This requires new modeling and computational tools to accommodate the change of all
or some measurements from scalars to high-dimensional, highly structured multivariate
vectors, matrices or arrays. The goal of this book is to address these problems by providing
a class of self-contained, coherent analytic methods that are computationally friendly. To
achieve this goal, we need three important components: dimensionality reduction, penalized
smoothing, and unified regression modeling via mixed effects models inference. Chapter 2
will introduce these ideas and principles.
1.4 R Data Structures for Functional Observations
As the preceding text makes clear, there is a contrast between the conceptual and practical
formulations of functional data: conceptually, functional data are continuous and infinite
dimensional, but practically they are observed over discrete grids. This book relies on both
formulations to provide interpretable model structures with concrete software implemen-
tations. We will use a variety of data structures for the storage, manipulation, and use of
functional observations, and discuss these briefly now.
In perhaps the simplest case, functional data are observed over the same equally spaced
grid for each participant or unit of observation. Physical activity is measured at each minute
of the day for each participant in the NHANES data set, and deaths due to COVID-19
are recorded weekly in each state in the COVID-19 US mortality data. A natural way of
representing such data is a matrix in which rows correspond to participants and columns
to the grid over which data are observed.
Basic Concepts 21
For illustration purposes, we display below the “wide format” data structure of
the NHANES physical activity data. This is stored in the variable MIMS of the data
nhanes fda with r. This NHANES data consists of a 12,610 × 1,440 matrix, with columns
containing MIMS measurements from 12:00 AM to 11:59 PM. Here we approximated the
MIMS up to the second decimal for illustration purposes, so the actual data may vary
slightly upon closer inspection. This data structure is familiar to many statisticians, and
can be useful in the implementation of specific methods, such as Functional Principal Com-
ponent Analysis (FPCA).
#Storage format for the accelerometry data in NHANES data set
nhanes fda with r$MIMS
MIN0001 MIN0002 MIN0003 MIN0004 ... MIN1439 MIN1440
62161 1.11 3.12 1.47 0.94 ... 1.38 1.53
62163 25.15 19.16 17.84 20.33 ... 7.38 15.93
62164 1.92 1.67 2.38 0.93 ... 3.03 4.46
62165 3.98 3.00 1.91 0.89 ... 2.18 0.31
... ... ... ... ... ... ... ...
83730 1.50 2.11 1.34 0.16 ... 1.07 1.14
83731 0.09 0.01 0.49 0.10 ... 0.86 0.46
It is possible to use matrices for data that are somewhat less simple, although care is
required. When data can be observed over the same grid but are sparse for each subject,
a matrix with missing entries can be used. For the CD4 data, observations are recorded
at months before or after seroconversion. The observation grid is integers from −18 to 42,
but any specific participant is measured only at a subset of these values. Data like these
can be stored in a relatively sparse matrix, again with rows for study units and columns
for elements of the observation grid. Our data examples focus on equally spaced grids, but
this is not required for functional data in general or for the use of matrices to store these
observations.
For illustration purposes, we display the CD4 count data in the same “wide format” used
for NHANES. The structure is similar to that of NHANES data, where each row corresponds
to an individual and each column corresponds to a potential sampling point, in this case a
month from seroconversion. However, in the CD4 data example most observations are not
available, as indicated by the NA fields. Indeed, as we discussed, only 1,888 data points are
available out of the 366×61 = 22,326 entries of the matrix, or 8.5%. Having one look at the
data matrix and knowing that less than 10% of the entries are known, immediately creates
the idea that the matrix and the data are “sparse.” Note, however, that this concept refers
to the percent of non-missing entries into a matrix and not to the mathematical concept
of sparsity. In most of the book, “sparsity” will refer to matrix sparsity and not to the
mathematical concept of sparsity of a set.
#Storage format for CD4 data in refund
CD4
-18 -17 -16 -15 -14 -13 -12 -11 -10 -9 ... 41 42
[1,] NA NA NA NA NA NA NA NA NA 548 ... NA NA
[2,] NA NA NA NA NA NA NA NA NA NA ... NA NA
[3,] NA NA NA 846 NA NA NA NA NA 1102 ... NA NA
... ... ... ... ... ... ... ... ... ... ... ... ... ...
[363,] NA NA NA NA NA NA NA NA 1661 NA ... NA NA
[364,] NA NA NA 646 NA NA NA 882 NA NA ... NA NA
[365,] NA NA NA NA NA NA NA NA NA NA ... 294 NA
[366,] NA NA NA NA NA NA NA NA NA NA ... 462 NA
22 Functional Data Analysis with R
Storing the CD4 in wide format is not a problem because the matrix is relatively small
and does not take that much memory. However, this format is not efficient and could
be extremely cumbersome when data matrices increase both in terms of number of rows
or columns. The number of columns can increase very quickly when the observations are
irregular across subjects and the union of sampling point across study participants is very
large. In the extreme, but commonly encountered, case when no two observations are taken
at exactly the same time, the number of columns of the matrix would be equal to the total
number of observations for all individuals. Additionally, observation grid values are not
directly accessible, and must be stored as column names or in a separate vector.
Using the “long format” for sparse functional data can address some disadvantages that
are associated with the “wide format.” In particular, a data matrix or frame with columns
for study unit ID, observation grid point, and measurement value can be used for dense or
sparse data and for regular or irregular observation grids, and makes the observation grid
explicit. Below we show the CD4 counts data in “long format,” where all the missing data
are no longer included. The price to pay is that we add the column ID, which contains many
repetitions, while the column time also contains some repetitions to explicitly indicate the
month where the sample was taken.
The long format of the data is much more memory efficient when data are sparse,
though these advantages can disappear or become disadvantages when data become denser.
For example, when the observation grid is common across subjects and there are many
observations for each study participant, the ID and time column require substantial addi-
tional memory without providing additional information. Long format data may also repeat
subject-level covariates for each element of the observation grid, which further exacerbates
memory requirements. Moreover, complexity and memory allocation can increase substan-
tially when multiple functional variables are observed on different observation grids. From
a practical perspective, different software implementations require different data structures,
which can be a reason for frustration. In general refund tends to use the wide format of
the data, whereas our implementation of FDA in mgcv often uses the long format.
#CD4 count data in long format
CD4
CD4 count time ID
548 -9 1
... ... ...
846 -15 3
1102 -9 3
... ... ...
1661 -10 363
... ... ...
646 -15 364
882 -11 364
... ... ...
294 41 365
... ... ...
462 41 366
Given these considerations, we will use both the wide and long formats and we will
discuss when and how we make the transition between these formats. We recognize the
increased popularity of the tidyverse for visualization and exploratory data analysis, which
prefers the long format of the data. Over the last several years, many R users have gravitated
toward data frames for data storage. This shift has been facilitated by (and arguably is
Basic Concepts 23
attributable to) the development of the tidyverse collection of packages, which implement
general-purpose tools for data manipulation, visualization, and analysis.
The tidyfun [261] R package was developed to address issues that arise in the storage,
manipulation, and visualization of functional data. Beginning from the conceptual perspec-
tive that a complete curve is the basic unit of analysis, tidyfun introduces a data type
(tf) that represents and operates on functional data in a way that is analogous to nu-
meric data. This allows functional data to easily sit alongside other (scalar or functional)
observations in a data frame in a way that is integrated with a tidyverse-centric approach to
manipulation, exploratory analysis, and visualization. Where possible, tidyfun conserves
memory by avoiding data duplication.
We will use both the tidyverse and the usualverse (completely made up word) and
we will point out the various approaches to handling the data. In the end, it is a personal
choice of what tools to use, as long as the main inferential engine works.
One can reasonably ask why a book of methods places such an emphasis on data struc-
tures? The reason is that this is a book on “functional data analysis with R” and not a book
on “functional data analysis without R.” Thus, in addition to methods and inference we
emphasize the practical implementation of methods and the combination of data structures,
code, and methods that is amenable to software development.
1.5 Notation
Throughout the book we will attempt to use notation that is consistent across chapters.
This will not be easy or perfect, as functional data analysis can test the limits of reasonable
notation. Indeed, the Latin and Greek alphabet using lower- and uppercase, bold and regular
font were heavily tested by the data structures discussed in this book. To provide some order
ahead of starting the book in earnest we introduce the following notation.
• n: number of study participants
• i: the index for the study participant, i = 1, . . . , n
• S: the sampling or theoretical domain of the observed functions; this will depend on the
context
• Yi: scalar outcome for study participant i
• Wi(sj): observed functional measurement for study participant i and location sj ∈ S,
for j = 1, . . . , p when data are observed on the same grid (dense, equal grid)
• Wi(sij): observed functional measurement for study participant i and location sij ∈ S,
for j = 1, . . . , pi when data are observed on different grids across study participants
(sparse, different grid)
• Wim(·): observed functional measurement for multivariate or multilevel data. For mul-
tivariate data m = 1, . . . , M, whereas for multivariate data m = 1, . . . , Mi, though in
some instances Mi = M for all i
24 Functional Data Analysis with R
• Xi(sj), Xi(sij), Xim(·): same as Wi(sj), Wi(sij), Wim(·), but for the underlying, unob-
served, functional process
• Zi: column vector of additional scalar covariates
• vectors: defined as columns and referred to using bold, typically lower case, font
• matrices: referred to using bold, typically upper case, font
2
Key Methodological Concepts
In this chapter we introduce some of the key methodological concepts that will be used
extensively throughout the book. Each method is important in itself, but it is the specific
combination of these methods that provides a coherent infrastructure for FDA inference
and software development. Understanding the details of each approach is not essential for
the application of these methods. Readers who are less interested in a deep dive into these
methods and more interested in applying them can skip this chapter for now.
2.1 Dimension Reduction
Consider the case when functional data are of the form Wraw,i(s) for i = 1, . . . , n and
s ∈ S = {s1, . . . , sp}, where p = |S| is the number of observations in S. Assume that all
functions are measured at the same values, sj, j = 1, . . . , p, and that there are no missing
observations. The centered and normalized functional data is
Wi(sj) =
1
√
np
{Wraw,i(sj) − Wraw(sj)} ,
where Wraw(sj) = 1
n
n
i=1 Wraw,i(sj) is the average of functional observations over study
participants at sj. This transformation is not strictly necessary, but will simplify the con-
nection between the discrete observed measurement process and the theoretical underlying
continuous process. In particular, dividing by
√
np will keep measures of data variation
comparable when the number of rows (study participants) or columns (data sampling res-
olution) change.
The data can be organized in an n×p dimensional matrix, W, where the ith row contains
the observations {Wi(sj) : j = 1, . . . , p}. Each row in W corresponds to a study participant
and each column has mean zero. The dimension of the problem refers to p and dimension
reduction refers to finding a smaller set of functions, K0  p, that contains most of the
information in the functions {Wi(sj) : j = 1, . . . , p}.
There are many approaches to dimension reduction. Here we focus on two closely re-
lated techniques: Singular Value Decomposition (SVD) and Principal Component Analysis
(PCA). While the linear algebra will get slightly involved, SVD and PCA are essential ana-
lytic tools for high-dimensional FDA. Moreover, the SVD and PCA of any n×p dimensional
matrix can easily be computed in R [240] as described below.
#SVD of matrix W
SVD of W - svd(W)
#PCA of matrix W
PCA of W - princomp(W)
25
26 Functional Data Analysis with R
2.1.1 The Linear Algebra of SVD
The SVD of W is the decomposition W = UΣVt
, where U is an n × n dimensional matrix
with the property Ut
U = In, Σ is an n × p dimensional diagonal matrix, and V is a p × p
dimensional matrix with the property Vt
V = Ip. Here In and Ip are the identity matrices
of size n and p, respectively. The diagonal entries dk of Σ, k = 1, . . . , K = min(n, p), are
called the singular values of W. The columns of U, uk = {uik : i = 1, . . . , n}, and V,
vk = {vk(sj) : j = 1, . . . , p}, for k = 1, . . . , K are the left and right singular vectors of W,
respectively. The matrix form of the SVD decomposition can be written in entry-wise form
for every s ∈ S as
Wi(s) =
K

k=1
dkuikvk(s) . (2.1)
This provides an explicit linear decomposition of the data in terms of the functions, {vk(sj) :
j = 1, . . . , p}, which are the columns of V and form an orthonormal basis in Rp
. These right
singular vectors are often referred to as the main directions of variation in the functional
space. Because vk are orthonormal, the coefficients of this decomposition can be obtained
as
dkuik =
p

j=1
Wi(sj)vk(sj) .
Thus, dkuik is the inner product between the ith row of W (the data for study participant
i) and the kth column of V (the kth principal direction of variation in functional space).
We will show that {d2
k : k = 1, . . . , K} quantify the variability of the observed data
explained by the vectors {vk(sj) : j = 1, . . . , p} for k = 1, . . . , K. The total variance of the
original data is
1
np
n

i=1
p

j=1
{Wraw,i(sj) − Wraw(sj)}2
=
n

i=1
p

j=1
W2
i (sj) ,
which is equal to tr(Wt
W) = tr(VΣt
Ut
UΣVt
), where tr(A) denotes the trace of matrix
A. As Ut
U = In, tr(Wt
W) = tr(VΣt
ΣVt
) = tr(Σt
ΣVt
V), where we used the property
that tr(AB) = tr(BA) for A = V and B = Σt
ΣVt
. As Vt
V = Ip and Σt
Σ =
K
k=1 d2
k, it
follows that
n

i=1
p

j=1
W2
i (sj) =
K

k=1
d2
k , (2.2)
indicating that the total variance is equal to the sum of squares of the singular values.
In practice, for every s ∈ S, Wi(s) is often approximated by
K0
k=1 dkuikvk(s) that is,
by the first K0 right singular vectors, where 0 ≤ K0 ≤ K. We now quantify the variance
explained by these K0 right singular vectors. Denote by V = [VK0 |V−K0 ] the partition of
V in the p × K0 dimensional sub-matrix VK0 and p × (p − K0) dimensional sub-matrix
V−K0
containing the first K0 and the last (p − K0) columns of V, respectively. Similarly,
denote by ΣK0
and Σ−K0
the sub-matrices of Σ that correspond to the first K0 and last
(K−K0) singular values, respectively. With this notation, W = UΣK0
Vt
K0
+UΣ−K0
Vt
−K0
or, equivalently, W − UΣK0 Vt
K0
= UΣ−K0 Vt
−K0
. Using a similar argument to the one
for the decomposition of the total variation, we obtain tr(V−K0 Σt
−K0
Ut
UΣ−K0 Vt
−K0
) =
K
k=K0+1 d2
k. Therefore,
tr(W − UΣK0
Vt
K0
)t
(W − UΣK0
Vt
K0
) =
K

k=K0+1
d2
k .
Key Methodological Concepts 27
Changing from matrix to entry-wise notation this equality becomes
n

i=1
p

j=1
{Wi(sj) −
K0

k=1
dkuikvk(sj)}2
=
K

k=K0+1
d2
k . (2.3)
Equations (2.2) and (2.3) indicate that the first K0 right singular vectors of W explain
K0
k=1 d2
k of the total variance of the data, or a fraction equal to
K0
k=1 d2
k/
K
k=1 d2
k. In
many applications d2
k decrease quickly with k indicating that only a few vk(·) functions are
enough to capture the variability in the observed data.
It can also be shown that for every K0 = 1, . . . , K
n

i=1
p

j=1
{Wi(sj) −

k=K0
dkuikvk(sj)}2
= d2
K0
, (2.4)
where the sum over k = K0 is over all k = 1, . . . , K, except K0. Thus, the K0th right
singular vector explains d2
K0
of the total variance, or a fraction equal to d2
K0
/
K
k=1 d2
k. The
proof is similar to the one for equation (2.3), but partitions the matrix V into a sub-matrix
that contains its K0 column vector and a sub-matrix that contains all its other columns.
In summary, equation (2.1) can be rewritten for every s ∈ S as
Wi(s) =
K0

k=1
dkuikvk(s) +
K

k=K0+1
dkuikvk(s) , (2.5)
where
K0
k=1 dkuikvk(s) is the approximation of Wi(s) and
K
k=K0+1 dkuikvk(s) is the ap-
proximation error with variance equal to
K
k=K0+1 d2
k. The number K0 is typically chosen
to explain a given fraction of the total variance of the data, but other criteria could be used.
We now provide the matrix equivalent of the approximation in equation (2.5). Recall
that Wi(sj) is the (i, j)th entry of the matrix W. If uk and vk denote the left and right
singular vectors of W, the (i, j) entry of the matrix ukvt
k is equal to uikvk(sj). Therefore,
the matrix format of equation (2.5) is
W =
K0

k=1
dkukvt
k +
K

k=K0+1
dkukvt
k . (2.6)
The matrix
K0
k=1 dkukvt
k is called the rank K0 approximation of W.
2.1.2 The Link between SVD and PCA
The PCA [140, 229] of W is the decomposition Wt
W = VΛVt
, where V is the p × p
dimensional matrix with the property Vt
V = Ip and Λ is a p × p diagonal matrix with
positive elements on the diagonal λ1 ≥ . . . ≥ λp ≥ 0. PCA is also known as the discrete
Karhunen-Loéve transform [143, 184]. Denote by vk, k = 1, . . . , K = min(n, p), the p × 1
dimensional column vectors of the matrix V. The vector vk is the kth eigenvector of the
matrix V, corresponds to the eigenvalue λk, and has the property that Wt
Wvk = λkvk. In
FDA the vk vectors are referred to as eigenfunctions. In image analysis the term eigenimages
is used instead.
Just as with SVD, vk form a set of orthonormal vectors in Rp
. It can be shown that every
vk+1 explains the most residual variability in the data matrix, W, after accounting for the
eigenvectors v1, . . . , vk. We will show this for v1 first. Note that Wt
W =
K
k=1 λkvkvt
k. If
28 Functional Data Analysis with R
v is any p × 1 dimensional vector such that vt
v = 1, the variance of Wv is vt
Wt
Wv =
K
k=1 λkvt
vkvt
kv. Denote by v =
K
l=1 alvl the expansion of v in the basis {vl : l =
1, . . . , K}. Because vk are orthonormal vt
kv =
K
l=1 alvt
kvl = ak and vt
v =
K
l=1 a2
l = 1.
Therefore, vt
Wt
Wv =
K
k=1 λka2
k ≤ λ1
K
k=1 a2
k = λ1. Equality can be achieved only
when a1 = 1 and ak = 0 for k = 2, . . . , K, that is, when v = v1. Thus, v1 is the solution to
the problem
v1 = arg max
||v=1||
vt
Wt
Wv . (2.7)
Once v1 is known, the projection of the data matrix on v1 is A1v1 and the residual
variation in the data is W − A1vt
1, where A1 is an n × p dimensional matrix. Because
vk are orthonormal, it can be shown that A1 = Wv1 and the unexplained variation is
W − Wv1vt
1 =
K
k=2 λkvkvt
k. Iterating with W − Wv1vt
1 instead of W, we obtain that
the second eigenfunction, v2, maximizes the residual variance after accounting for v1. The
process is then iterated.
PCA and SVD are closely connected, as Wt
W = VΣt
ΣVt
. Thus, if d2
k are ordered such
that d2
1 ≥ . . . ≥ d2
K ≥ 0, the kth right singular vector of W is equal to the kth eigenvector
of Wt
W and corresponds to the kth eigenvalue λk = d2
k. Similarly, WWt
= UΣΣt
U,
indicating that the kth left singular vector of W is equal to the kth eigenvector of WWt
and corresponds to the kth eigenvalue λk = d2
k.
SVD and PCA have been developed for multivariate data and can be applied to func-
tional data. There are, however, some specific considerations that apply to FDA: (1)
the data Wi(s) are functions of s and are expressed in the same units for all s; (2)
the mean function, W(s), and the main directions of variation in the functional space,
vk = {vk(sj), j = 1, . . . , p}, are functions of s ∈ S; (3) these functions inherit and abide
by the rules induced by the organization of the space in S (e.g., they do not change too
much for small variations in s); (4) the correlation structure between Wi(s) and Wi(s
) may
depend on (s, s
); and (5) the data may be observed with noise, which may substantially
affect the calculation and interpretation of {vk(sj), j = 1, . . . , p}. For these reasons, FDA
often uses smoothing assumptions on Wi(·), W(·) and vk(·). These smoothing assumptions
provide a different flavor to PCA and SVD and give rise to functional PCA (FPCA) and
SVD (FSVD). While FPCA is better known in FDA, FSVD is a powerful technique that
is indispensable for higher dimensional (large p) applications. A more in-depth look at
smoothing in FDA is provided in Section 2.3.
2.1.3 SVD and PCA for High-Dimensional FDA
When data are high-dimensional (large p) the n×p dimensional matrix W cannot be loaded
into the memory and SVD cannot be performed. Things are more difficult for PCA, which
uses a p×p dimensional matrix Wt
W. In this situation, feasible computational alternatives
are needed.
Consider the case when p is very large but n is small to moderate. It can be shown
that WWt
=
p
j=1 wjwt
j, where wj is the jth column of matrix W. The advantage of
this formulation is that it can be computed sequentially. For example, if Ck =
k
j=1 wjwt
j,
C1 = w1wt
1, Ck+1 = Ck + wk+1wt
k+1, and Cp = WWt
. It takes O(n2
) operations to
calculate C1 because it requires the multiplication of an n × 1 by a 1 × n dimensional
matrix. At every step, k + 1, only the n × n dimensional matrix Ck and the n × 1 vector
wk+1 need to be loaded in the memory. This avoids loading the complete data matrix.
Thus, the matrix WWt
can be calculated in O(n2
p) operations without ever loading the
complete matrix, W. The PCA decomposition of WWt
= UΣΣt
U yields the matrices U
and Σ. The matrix V can then be obtained as V = Wt
UΣ−1
. Thus, each column of V is
Key Methodological Concepts 29
obtained by multiplying Wt
with the corresponding column of UΣ−1
. This requires O(n2
p)
operations. As, in general, we are only interested in the first K0 columns of V, the total
number of operations is of the order O(n2
pK0). Moreover, the operations do not require
loading the entire data set in the computer memory. Indeed, Wt
UΣ−1
can be done by
loading one 1 × n dimensional row of Wt
at a time.
The essential idea of this computational trick is to replace the diagonalization of the
large p × p dimensional matrix Wt
W with the diagonalization of the much smaller n ×
n dimensional matrix WWt
. When n is also large, this trick does not work. A simple
solution to address this problem is to sub-sample the rows of the matrix W to a tractable
sample size, say 2000. Sub-sampling can be repeated and right singular vectors can be
averaged across sub-samples. Other solutions include incremental, or streaming, approaches
[133, 203, 219, 285] and the power method [67, 158].
The incremental, or streaming, approaches start with a number of rows of W that
can be handled computationally. Then covariance operators, eigenvectors, and eigenvalues
are updated as new rows are added to the matrix W. The power method starts with the
n × n dimensional matrix A = WWt
and an n × 1 dimensional random normal vector u0,
which is normalized u0 ← u0/||u0||. Here ||a|| = (at
a)1/2
is the norm induced by the inner
product in Rn
. The power method consists of calculating the updates ur+1 ← Aur and
ur+1 ← ur+1/||ur+1||. Under mild conditions, this approach yields the first eigenfunction,
v1, which can be subtracted and the method can be iterated to obtain the subsequent
eigenfunctions. The computational trick here is that diagonalization of matrices is replaced
by matrix multiplications, which are much more computationally efficient.
We have found that sampling is a very powerful, easy-to-use method and we recommend
it as a first line approach in cases when both n and p are very large.
2.1.4 SVD for US Excess Mortality
We show how SVD can be used to visualize and analyze the cumulative all-cause excess mor-
tality data in 50 states and 2 territories (District of Columbia and Puerto Rico). Figure 2.1
displays these functions for each of the first 52 weeks of 2020. For each state or territory, i,
the data are Wi(sj), where sj = j ∈ {1, . . . , p = 52}. The mean W(sj) is obtained by aver-
aging observations across states (i) for every week of 2020 (sj = j). The R implementation is
#Calculate the mean of Wr, the un-centered data matrix
mW - colMeans(Wr)
#Construct a matrix with the mean repeated on each row
mW mat - matrix(rep(mW, each = nrow(Wr)), ncol = ncol(Wr))
#Center the data
W - Wr - mW mat
Here mW is the R notation for the mean vector that contains W(sj), j = 1, . . . , p. We have
not divided by
√
np, as results are identical and it is more intuitive to work on the original
scale of the data. Figure 2.1 displays the cumulative excess mortality per one million people
in each state of the US and two territories in 2020 (light gray lines). This is the same data
as in Figure 1.5 without emphasizing the mortality patterns for specific states. Instead, the
dark red line is the average of these curves and corresponds to the mW variable. Figure 2.2
displays the same data as Figure 2.1 after centering the data (removing the mean at every
time point). These data are stored as rows in the matrix W (in R notation) and have been
denoted as W (in statistical notation). Five states are emphasized to provide examples of
trajectories.
The centered data matrix W (W in R) is decomposed using the SVD. The left singular
vectors, U, are stored as columns in the matrix U, the singular values, d, are stored in the
30 Functional Data Analysis with R
FIGURE 2.1: Each line represents the cumulative excess mortality for each state and two
territories in the US. The mean cumulative excess mortality in the US per one million
residents is shown as a dark red line.
vector d, and the right singular vectors, V, are stored as columns in the matrix V.
#Calculate the SVD of W
SVD of W - svd(W)
#Left singular vectors stored by columns
U - SVD of W$u
#Singular values
d - SVD of W$d
#Right singular vectors stored by columns
V - SVD of W$v
The individual and cumulative variance explained can be calculated from the vector of
singular values, d. In R this is implemented as
#Calculate the eigenvalues
lambda - SVD of W$d^2
#Individual proportion of variation
propor var - round(100 * lambda / sum(lambda), digits = 1)
#Cumulative proportion of variation
cumsum var - cumsum(propor var)
Table 2.1 presents the individual and cumulative percent variance explained by the first
five right singular vectors. The first two right singular vectors explain 84% and 11.9% of
the variance, respectively, for a total of 95.9%. The first five right singular vectors explain
Key Methodological Concepts 31
FIGURE 2.2: Each line represents the centered cumulative excess mortality for each state
in the US. Centered means that the average at every time point is equal to zero. Five states
are emphasized: New Jersey (green), Louisiana (red), Maryland(blue), Texas (salmon), and
California (plum).
a cumulative 99.7%, indicating that dimension reduction is quite effective in this particular
example. Recall that the right singular vectors are the functional principal components.
The next step is to visualize the two right singular vectors, which together explain 95.9%
of the variability. These are the vectors V[,1] and V[,2] in R notation and v1 and v2 in
statistical notation. Figure 2.3 displays the first (light coral) and second (dark coral) right
singular vectors. The interpretation of the first right singular vector is that the mortality
data for a state that has a positive coefficient (score) tends to (1) be closer to the US mean
between January and April; (2) have a sharp increase above the US mean between April
and June; and (3) be larger with a constant difference from the US mean between July
and December. The mortality data for a state that has a positive coefficient on the second
right singular vector tends to (1) have an even sharper increase between April and June
TABLE 2.1
All-cause cumulative excess mortality in 50 US states plus
Puerto Rico and District of Columbia. Individual and
cumulative percent variance explained by the first five right
singular vectors (principal components).
Right singular vectors
Variance 1 2 3 4 5
Individual (%) 84.0% 11.9% 2.9% 0.6% 0.3%
Cumulative (%) 84.0% 95.9% 98.8% 99.4% 99.7%
32 Functional Data Analysis with R
FIGURE 2.3: First two right singular vectors (principal components) for all-cause weekly
excess US mortality data in 2020. First right singular vector: light coral. Second singular
vector: dark coral.
relative to the US average; and (2) exhibit a decreased difference from the US mean as time
progresses from July to December. Of course, things are more complex, as the mean and
right singular vectors can compensate for one another in specific times of the year.
Individual state mortality data can be reconstructed for all states simultaneously. A
K0 = 2 rank reconstruction of the data can be obtained as
#Set the reconstruction rank
K0 - 2
#Reconstruct the centered data using rank K0 approximation
rec - SVD of W$u[,1:K0] %*% diag(SVDofW$d[1:K0]) %*% t(V[,1:K0])
#Add the mean to the rank K0 approximation of W
WK0 - mW mat + rec
The matrices W and WK0 contain the original and reconstructed data, where each state
is recorded by rows. Figure 2.4 displays the original (solid lines) and reconstructed data
(dashed lines of matching color) for five states: New Jersey (green), Louisiana (red), Mary-
land (blue), Texas (salmon), and California (plum). Even though the reconstructions are
not perfect, they do capture the main features of the data for each of the five states. Better
approximations can be obtained by increasing K0, though at the expense of using additional
right singular vectors.
Consider, for example, the mortality data from New Jersey. The rank K0 = 2 recon-
struction of the data is
WNJ(s) = WUS(s) + 0.49v1(s) + 0.25v2(s) ,
Key Methodological Concepts 33
FIGURE 2.4: All-cause excess mortality (solid lines) and predictions based on rank 2 SVD
(dashed lines) for five states in the US: New Jersey (green), Louisiana (red), Maryland
(blue), Texas (salmon), and California (plum).
where the coefficients 0.49 and 0.25 correspond to ui1 and ui2, the (i, 1) and (i, 2) entries of
the matrix U (U in R), where i corresponds to New Jersey. These values can be calculated
in R as
U[states==New Jersey, 1:2]
where states is the vector containing the names of US states and territories. We have used
the notation WUS(s) instead of W(s) and WNJ(s) instead of Wi(s) to improve the precision
of notation. Both coefficients for v1(·) and v2(·) are positive, indicating that for New Jersey
there was a strong increase in mortality between April and June, a much slower increase
between June and November and a further larger increase in December. Even though neither
of the two components contained information about the increase in mortality in December,
the effect was accounted for by the mean; see, for the example the increase in the November
December period in the mean in Figure 2.1.
All the coefficients, also known as scores, are stored in the matrix U. It is customary to
display these scores using scatter plots. For example,
plot(U[,1], U[,2])
produces a plot similar to the one shown in Figure 2.5. Every point in this graph represents
a state and the same five states were emphasized: New Jersey (green), Louisiana (red),
Maryland (blue), Texas (salmon), and California (plum). Note that New Jersey is the point
34 Functional Data Analysis with R
FIGURE 2.5: Scores on the first versus second right singular vectors for all-cause weekly
excess mortality in the US. Each dot is a state, Puerto Rico, or Washington DC. Five states
are emphasized: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and
California (plum).
with the largest score on the first right singular vector and the third largest score on the
second right singular vector. Louisiana has the third largest score on the first right singular
vector, which is consistent with being among the states with highest all-cause mortality.
In contrast to New Jersey, the score for Louisiana on the second right singular vector is
negative indicating that its cumulative mortality data continues to increase away from the
US mean between May and November; see Figure 2.2.
2.2 Gaussian Processes
While all the data we observe will be sampled at discrete time points, observed functional
data is thought of realizations of an underlying continuous process. Here we provide some
theoretical concepts that will help with the interpretation of the analytic methods. A Gaus-
sian Process (GP) is a collection of random variables {W(s), s ∈ S} where every finite col-
lection of random variables {W(s1), . . . , W(sp)}, sj ∈ S for every j = 1, . . . , p and every p is
a multivariate Gaussian distribution. For convenience, we consider S = [0, 1] and interpret
it as time, but Gaussian Processes can be defined over space, as well. A Gaussian Process
Key Methodological Concepts 35
is completely characterized by its mean µ(s) and covariance operator KW : S × S → R,
where KW (s1, s2) = Cov{W(s1), W(s2)}.
Assume now that the mean of the process is 0. By Mercer’s theorem [199] there exists
a set of eigenvalues and eigenfunctions λk, φk(s), where λk ≥ 0, φk : S → R form an
orthonormal basis in L2
([0, 1]),

KW (s, t)φk(t)dt = λsφk(s) for every s ∈ S and k =
1, 2, . . ., and
KW (s1, s2) =
∞

k=1
λkφk(s1)φk(s2) .
The Kosambi-Karhunen-Loève (KKL) [143, 157, 184] theorem provides the explicit de-
composition of the process W(s). Because φk(t) form an orthonormal basis, the Gaussian
Process can be expanded as
W(s) =
∞

k=1
ξkφk(s) ,
where ξk =
 1
0
W(s)φk(s)dt, which does not depend on s. It is easy to show that the
E(ξk) = 0 as
E(ξk) = E{
 1
0
W(s)φk(s)ds} =
 1
0
E{W(s)}φk(s)ds = 0 .
We can also show that the Cov(ξk, ξl) = E(ξkξl) = 0 for k = l and Var(ξk) = λk. The proof
is shown below
E(ξkξl) = E
 1
0
 1
0
W(s)W(t)φk(s)φl(t)dtds

=
 1
0
 1
0
E{W(s)W(t)}φk(t)φl(s)dtds
=
 1
0
 1
0
KW (s, t)φk(t)dt

φl(s)ds
= λk
 1
0
φk(s)φl(s)ds
= λkδkl ,
(2.8)
where δkl = 0 if k = l and 1 otherwise. The second equality holds because of the change
of order of integrals (expectations), the third equality holds because of the definition of
KW (s, t), the fourth equality holds because φk(s) is the eigenfunction of KW (·, ·) corre-
sponding to the eigenvalue λk, and the fifth equality holds because of the orthonormality
of the φk(s) functions. These results hold for any L2
[0, 1] integrable process and does not
require Gaussianity of the scores.
However, if the process is Gaussian, it can be shown that any finite collection
{ξk1 , . . . , ξkl
} is jointly Gaussian. Because the individual entries are uncorrelated mean-
zero, the scores are independent Gaussian random variables. One could reasonably ask,
why should one care about all these properties and whether this theory has any practical
implications. Below we identify some of the practical implications.
The expression “Gaussian Process” is quite intimidating, the definition is relatively tech-
nical, and it is not clear from the definition that such objects even exist. However, these
results show how to generate Gaussian Processes relatively easily. Indeed, the only ingredi-
ents we need are a set of orthonormal functions φk(·) in L2
[0, 1] and a set of positive numbers
λ1 ≥ λ2 ≥ . . .. For example, if φ1(s) =
√
2 sin(2πs), φ2(s) =
√
2 cos(2πs), λ1 = 4, λ2 = 1,
Random documents with unrelated
content Scribd suggests to you:
could pick; then I required that of him every day, or I docked his
wages.”
As we were talking, the mate of the “Quitman” took up an oyster-
shell and threw it at the head of one of the deck-hands, who did not
handle the cotton to suit him. It did not hurt the negro’s head much,
but it hurt his feelings.
“Out on the plantations,” observed my friend the overseer, “it would
cost him fifty dollars to hit a nigger that way. It cost me a hundred
and fifty dollars just for knocking down three niggers lately,—fifty
dollars a piece, by ——!”
He thought the negroes were going to be crowded out by the
Germans; and went on to say, with true Southern consistency,—
“The Germans want twenty dollars a month, and we can hire the
niggers for ten and fifteen. The Germans will die in our swamps.
Then as soon as they get money enough to buy a cart and mule,
and an acre of land somewhar, whar they can plant a grape-vine,
they’ll go in for themselves.”
CHAPTER LV.
THE LOWER MISSISSIPPI.
We were nearly all night at Natchez loading cotton. The next day, I
noticed that the men worked languidly, and that the mate was plying
them with whiskey. I took an opportunity to talk with him about
them. He said,—
“We have a hundred and eighty hands aboard, all told. Thar’s sixty
deck-hands. That a’n’t enough. We ought to have reliefs, when we’re
shipping freight day and night as we are now.”
I remarked: “A gentleman who came up to Vicksburg in the
‘Fashion,’ stated, as an excuse for the long trip she made, that the
niggers wouldn’t work,—that the mates couldn’t make them work.”
He replied: “I reckon the hands on board the ‘Fashion’ are about in
the condition these are. These men are used up. They ha’n’t had no
sleep for four days and nights. I’ve seen a man go to sleep many a
time, standing up, with a box on his shoulder. We pay sixty dollars a
month,—more’n almost any other boat, the work is so hard. But we
get rid of paying a heap of ’em. When a man gets so used up he
can’t stand no more, he quits. He don’t dare to ask for wages, for he
knows he’ll get none, without he sticks by to the end of the trip.”
While we were talking, a young fellow, not more than twenty years
old, came up, looking very much exhausted, and told the mate he
was sick.
“Ye a’n’t sick neither!” roared the mate at him, fiercely. “You’re lazy!
If you won’t work, go ashore.”
The fellow limped away again, and went ashore at the next landing.
“Is he sick or lazy?” I asked.
“Neither. He’s used up. He was as smart a hand as I had when he
came aboard. But they can’t stand it.”
“Was it always so?”
“No; before the war we had men trained for this work. We had some
niggers, but more white men. We couldn’t git all the niggers we
wanted; a fifteen hundred dollar man wore out too quick.”
“The whites were the best, I suppose.”
“The niggers was the best. They was more active getting down
bales. They liked the fun. They stand it better than white men.
Business stopped, and that set of hands all dropped off,—went into
the war, the most of ’em. Now we have to take raw hands. These
are all plantation niggers. Not one of ’m’ll ship for another trip;
they’ve had enough of it. Thar’s no compellin’ ’em. You can’t hit a
nigger now, but these d——d Yankee sons of b——s have you up
and make you pay for it.”
I told him if that was the case, I didn’t think I should hit one.
“They’ve never had me up,” he resumed. “When I tackle a nigger,
it’ll be whar thar an’t no witnesses, and it’ll be the last of him. That’s
what ought to be done with ’em,—kill ’em all off. I like a nigger in his
place, and that’s a servant, if thar’s any truth in the Bible.”
This allusion to Scripture, from lips hot with words of wrath and
wrong, was especially edifying.
The “Quitman” was a fine boat, and passengers, if not deck-hands,
fared sumptuously on board of her. The table was equal to that of
the best hotels. An excellent quality of claret wine was furnished, as
a part of the regular dinner fare, after the French fashion, which
appears to have been introduced into this country by the Creoles,
and which is to be met with, I believe, only on the steamboats of the
Lower Mississippi.
On the “Quitman,” as on the boat from Memphis to Vicksburg, I
made the acquaintance of all sorts of Southern people. The
conversation of some of them is worth recording.
One, a Mississippi planter, learning that I was a Northern man, took
me aside, and with much emotion, asked if I thought there was “any
chance of the government paying us for our niggers.”
“What niggers?”
“The niggers you’ve set free by this abolition war.”
“This abolition war you brought upon yourselves; and paying you for
your slaves would be like paying a burglar for a pistol lost on your
premises. No, my friend, believe me, you will never get the first
cent, as long as this government lasts.”
He looked deeply anxious. But he still cherished a hope. “I’ve been
told by a heap of our people that we shall get our pay. Some are
talking about buying nigger claims. They expect, when our
representatives get into Congress, there’ll be an appropriation
made.”
He went on: “I did one mighty bad thing. To save my niggers, I run
’em off into Texas. It cost me a heap of money. I came back without
a dollar, and found the Yankees had taken all my stock, and
everything, and my niggers was free, after all.”
Jim B——, from Warren County, ten miles from Vicksburg, was a
Mississippi planter of a different type,—jovial, generous, extravagant
in his speech, and, in his habits of living, fast. “My niggers are all
with me yet, and you can’t get ’em to leave me. The other day my
boy Dan drove me into town; when we got thar, I says to him, ‘Dan,
ye want any money?’ ‘Yes, master, I’d like a little?’ I took out a ten-
dollar bill and give him. Another nigger says to him, ‘Dan, what did
that man give you money for?’ ‘That man?’ says Dan; ‘I belongs to
him.’ ‘No, you don’t belong to nobody now; you’re free.’ ‘Well,’ says
Dan, ‘he provides for me, and gives me money, and he’s my master,
any way.’ I give my boys a heap more money than I should if I just
hired ’em. We go right on like we always did, and I pole ’em if they
don’t do right. This year I says to ’em, ‘Boys, I’m going to make a
bargain with you. I’ll roll out the ploughs and the mules and the
feed, and you shall do the work; we’ll make a crop of cotton, and
you shall have half. I’ll provide for ye, give ye quarters, treat ye well,
and when ye won’t work, pole ye like I always have. They agreed to
it, and I put it into the contract that I was to whoop ’em when I
pleased.”
Jim was very enthusiastic about a girl that belonged to him. “She’s a
perfect mountain-spout of a woman!” (if anybody knows what that
is.) “When the Yankees took me prisoner, she froze to a trunk of
mine, and got it out of the way with fifty thousand dollars
Confederate money in it.”
He never wearied of praising her fine qualities. “She’s black outside,
but she’s white inside, shore!” And he spoke of a son of hers, then
twelve years old, with an interest and affection which led me to
inquire about the child’s father. “Well,” said Jim, with a smile, “he’s a
perfect little image of me, only a shade blacker.”
An Arkansas planter said: “I’ve a large plantation near Pine Bluff. I
furnish everything but clothes, and give my freedmen one third of
the crop they make. On twenty plantations around me, there are ten
different styles of contracts. Niggers are working well; but you can’t
get only about two thirds as much out of ’em now as you could
when they were slaves” (which I suppose is about all that ought to
be got out of them). “The nigger is fated: he can’t live with the
white race, now he’s free. I don’t know one I’d trust with fifty
dollars, or to manage a crop and control the proceeds. It will be
generations before we can feel friendly towards the Northern
people.”
I remarked: “I have travelled months in the South, and expressed
my sentiments freely, and met with better treatment than I could
have expected five years ago.”
“That’s true; if you had expressed abolition sentiments then, you’d
have woke up some morning and found yourself hanging from some
limb.”
Of the war he said: “Slavery was really what we were fighting for,
although the leaders didn’t talk that to the people. They saw the
slave interest was losing power in the Union, and trying to straighten
it up, they tipped it over.”
A Louisiana planter, from Lake Providence,—and a very intelligent,
well-bred gentleman,—said: “Negroes do best when they have a
share of the crop; the idea of working for themselves stimulates
them. Planters are afraid to trust them to manage; but it’s a great
mistake. I know an old negro who, with three children, made
twenty-five bales of cotton this year on abandoned land. Another,
with two women and a blind mule, made twenty-seven bales. A
gang of fifty made three hundred bales,—all without any advice or
assistance from white men. I was always in favor of educating and
elevating the black race. The laws were against it, but I taught all
my slaves to read the Bible. Each race has its peculiarities: the negro
has his, and it remains to be seen what can be done with him. Men
talk about his stealing: no doubt he’ll steal: but circumstances have
cultivated that habit. Some of my neighbors couldn’t have a pig, but
their niggers would steal it. But mine never stole from me, because
they had enough without stealing. Giving them the elective franchise
just now is absurd; but when they are prepared for it, and they will
be some day, I shall advocate it.”
Another Louisianian, agent of the Hope Estate, near Water-Proof, in
Tensas Parish, said: “I manage five thousand acres,—fourteen
hundred under cultivation. I always fed my niggers well, and rarely
found one that would steal. My neighbors’ niggers, half-fed, hard-
worked, they’d steal, and I never blamed ’em. Nearly all mine stay
with me. They’ve done about two thirds the work this year they used
to, for one seventh of the crops. Heap of niggers around me have
never received anything; they’re only just beginning to learn that
they’re free. Many planters keep stores for niggers, and sell ’em
flour, prints, jewelry and trinkets, and charge two or three prices for
everything. I think God intended the niggers to be slaves; we have
the Bible for that:” always the Bible. “Now since man has deranged
God’s plan, I think the best we can do is to keep ’em as near a state
of bondage as possible. I don’t believe in educating ’em.”
“Why not?”
“One reason, schooling would enable them to compete with white
mechanics.”
“And why not?”
“It would be a disadvantage to the whites,” he replied,—as if that
was the only thing to be considered by men with the Bible in their
mouths! “In Mississippi, opposite Water-Proof, there’s a minister
collecting money to buy plantations in a white man’s name, to be
divided in little farms of ten and fifteen acres for the niggers. He
couldn’t do that thing in my parish: he’d soon be dangling from
some tree. There isn’t a freedman taught in our parish; not a school;
it wouldn’t be allowed.”
He admitted that the war was brought on by the Southern leaders,
but thought the North “ought to be lenient and give them all their
rights.” Adding: “What we want chiefly is to legislate for the
freedmen. Another thing: the Confederate debt ought to be assumed
by the government. We shall try hard for that. If we can’t get it, if
the North continues to treat us as a subjugated people, the thing will
have to be tried over again,”—meaning the war. “We must be left to
manage the nigger. He can’t be made to work without force.” (He
had just said his niggers did two thirds as much work as formerly.)
“My theory is, feed ’em well, clothe ’em well, and then, if they won’t
work, d—n ’em, whip ’em well!”
I did not neglect the deck-passengers. These were all negroes,
except a family of white refugees from Arkansas, who had been
burnt out twice during the war, once near Little Rock, and again in
Tennessee, near Memphis. With the little remnant of their
possessions they were now going to seek their fortunes elsewhere,—
ill-clad, starved-looking, sleeping on deck in the rain, coiled around
the smoke-pipe, and covered with ragged bedclothes.
The talk of the negroes was always entertaining. Here is a sample,
from the lips of a stout old black woman:—
“De best ting de Yankees done was to break de slavery chain. I
shouldn’t be here to-day if dey hadn’t. I’m going to see my mother.”
“Your mother must be very old.”
“You may know she’s dat, for I’m one of her baby chil’n, and I’s got
’leven of my own. I’ve a heap better time now ’n I had when I was
in bondage. I had to nus’ my chil’n four times a day and pick two
hundred pounds cotton besides. My third husband went off to de
Yankees. My first was sold away from me. Now I have my second
husband again; I was sold away from him, but I found him again,
after I’d lived with my third husband thirteen years.”
I asked if he was willing to take her back.
“He was willing to have me again on any terms”—emphatically—“for
he knowed I was Number One!”
Several native French inhabitants took passage at various points
along the river, below the Mississippi line. All spoke very good
French, and a few conversed well in English. One, from Point Coupée
Parish, said: “Before the war, there were over seventeen thousand
inhabitants in our parish.” (In Louisiana a county is called a parish.)
“Nearly thirteen thousand were slaves. Many of the free inhabitants
were colored; so that there were about four colored persons to one
white. We made yearly between eight and nine thousand hogsheads
of sugar, and fifteen hundred bales of cotton. The war has left us
only three thousand inhabitants. We sent fifteen hundred men into
the Confederate army. All the French population were in favor of
secession. The white inhabitants of these parishes are mostly French
Creoles. We treated our slaves better than the Americans treated
theirs. We didn’t work them so hard; and there was more familiarity
and kindly feeling between us and our servants. The children were
raised together; and a white child learned the negroes’ patois before
he learned French. The patois is curious: a negro says ‘Moi pas
connais’ for ‘Je ne sais pas’ (I do not know); and they use a great
many African words which you would not understand. Our slaves
were never sold except to settle an estate. Besides these two classes
there was a third, quite separate, which did not associate with either
of the others. They were the free colored, of French-African descent,
some almost or quite white, with many large property holders and
slave-owners among them; a very respectable class, forming a
society of their own.”
The villages and plantation dwellings along here, with their low roofs
and sunny verandas, on the level river bank, had a peculiarly foreign
and tropical appearance.
The levees of Louisiana form a much more extensive and complete
system than those of Mississippi. In the latter State there is much
hilly land that does not need their protection, and much swamp land
not worth protecting; and there is, I believe, no law regarding them.
In the low and level State of Louisiana, however, a large and fertile
part of which lies considerably below the level of high water, there is
very strict legislation on the subject, compelling every land-owner on
the river to keep up his levees. This year the State itself had
undertaken to repair them, issuing eight per cent. bonds to the
amount of a million dollars for the purpose,—the expense of the
work to be defrayed eventually by the planters.
For a long distance the Lower Mississippi, at high water, appears to
be flowing upon a ridge. The river has built up its own banks higher
than the country which lies back of them; and the levees have raised
them still higher. Behind this fertile strip there are extensive swamps,
containing a soil of unsurpassed depth and richness, but unavailable
for want of drainage. Three methods are proposed for bringing them
under cultivation. First, to surround them by levees, ditch them, and
pump the water out by steam. Second, to cut a canal through them
to the Gulf. Third, to turn the Mississippi into them, and fill them
with its alluvial deposit. This last method is no doubt the one Nature
intended to employ; and it is the opinion of many that man,
confining the flow of the stream within artificial limits, attempted the
settlement of this country several centuries too soon.
A remarkable feature of Louisiana scenery is its forests of cypress-
trees growing out of the water, heavy, sombre, and shaggy with
moss.
The complexion of the river water is a light mud-color, which it
derives from the turbid Missouri,—the Upper Mississippi being a clear
stream. Pour off a glass of it after it has been standing a short time,
and a sediment of dark mud appears at the bottom. Notwithstanding
this unpleasant peculiarity, it is used altogether for cooking and
drinking purposes on board the steamboats, and I found New
Orleans supplied with it.
A curious fact has been suggested with regard to this wonderful
river,—that it runs up hill. Its mouth is said to be two and a half
miles higher—or farther from the earth’s centre—than its source.
When we consider that the earth is a spheroid, with an axis shorter
by twenty-six miles than its equatorial diameter; and that the same
centrifugal motion which has caused the equatorial protuberance
tends still to heap up the waters of the globe where that motion is
greatest; the seeming impossibility appears possible,—just as we see
a revolving grindstone send the water on its surface to the rim. Stop
the grindstone, and the water flows down its sides. Stop the earth’s
revolution, and immediately you will see the Mississippi River turn
and flow the other way.
Some years ago I made a voyage of several days on the Upper
Mississippi, to the head of navigation. It was difficult to realize that
this was the same stream on which I was now sailing day after day
in an opposite direction,—six days in all, from Memphis to New
Orleans. From St. Anthony’s Falls to the Gulf, the Mississippi is
navigable twenty-two hundred miles. Its entire length is three
thousand miles. Its great tributary, the Missouri, is alone three
thousand miles in length: measured from its head-waters to the Gulf,
it is four thousand five hundred miles. Consider also the Ohio, the
Arkansas, the Red River, and the hundred lesser streams that fall
into it, and well may we call it by its Indian name, Michi-Sepe, the
Father of Waters.
CHAPTER LVI.
THE CRESCENT CITY.
On the morning of January 1st, 1866, I arrived at New Orleans.
It was midwinter; but the mild sunny weather that followed the first
chill days of rain, made me fancy it May. The gardens of the city
were verdant with tropical plants. White roses in full bloom climbed
upon trellises or the verandas of houses. Oleander trees, bananas
with their broad drooping leaves six feet long, and Japan plums that
ripen in February, grew side by side in the open air. There were
orange-trees whose golden fruit could be picked from the balconies
which they half concealed. Magnolias, gray-oaks and live-oaks, some
heavily hung with moss that swung in the breeze like waving hair,
shaded the yards and streets. I found the roadsides of the suburbs
green with grass, and the vegetable gardens checkered and striped
with delicately contrasting rows of lettuce, cabbages, carrots, beets,
onions, and peas in blossom.
The French quarter of the city impresses you as a foreign town
transplanted to the banks of the Mississippi. Many of the houses are
very ancient, with low, moss-covered roofs projecting over the first
story, like slouched hat-brims over quaint old faces. The more
modern houses are often very elegant, and not less picturesque. The
names of the streets are Pagan, foreign, and strange. The gods and
muses of mythology, the saints of the Church, the Christian virtues,
and modern heroes, are all here. You have streets of “Good
Children,” of “Piety,” of “Apollo,” of “St. Paul,” of “Euterpe,” and all
their relations. The shop-signs are in French, or in French and
English. The people you meet have a foreign air and speak a foreign
tongue. Their complexions range through all hues, from the dark
Creole to the ebon African. The anomalous third class of Louisiana—
the respectable free colored people of French-African descent—are
largely represented. Dressed in silks, accompanied by their servants,
and speaking good French,—for many of them are well educated,—
the ladies and children of this class enter the street cars, which they
enliven with the Parisian vivacity of their conversation.
The mingling of foreign and American elements has given to New
Orleans a great variety of styles of architecture; and the whole city
has a light, picturesque, and agreeable appearance. It is built upon
an almost level strip of land bordering upon the left bank of the river,
and falling back from the levee with an imperceptible slope to the
cypress and alligator swamps in the rear. The houses have no
cellars. I noticed that the surface drainage of the city flowed back
from the river into the Bayou St. John, a navigable inlet of Lake
Ponchartrain. The old city front lay upon a curve of the Mississippi,
which gave it a crescent shape: hence its poetic soubriquet. The
modern city has a river front seven miles in extent, bent like the
letter S.
The broad levee, lined with wharves on one side and belted by busy
streets on the other, crowded with merchandise, and thronged with
merchants, boatmen, and laborers, presents always a lively and
entertaining spectacle. Steam and sailing crafts of every description,
arriving, departing, loading, unloading, and fringing the city with
their long array of smoke-pipes and masts, give you some idea of
the commerce of New Orleans.
Here is the great cotton market of the world. In looking over the
cotton statistics of the past thirty years, I found that nearly one half
the crop of the United States had passed through this port. In 1855–
1856 (the mercantile cotton year beginning September 1st and
ending August 31st) 1,795,023 bales were shipped from New
Orleans,—986,622 to Great Britain (chiefly to Liverpool); 214,814 to
France (chiefly to Havre); 162,657 to the North of Europe; 178,812
to the South of Europe, Mexico, c.; and 222,100 coastwise,—
151,469 going to Boston and 51,340 to New York. In 1859–1860,
2,214,296 bales were exported, 1,426,966 to Great Britain, 313,291
to France, and 208,634 coastwise,—131,648 going to Boston,
62,936 to New York, and 5,717 to Providence. This, it will be
remembered, was the great cotton year, the crop amounting to near
5,000,000 bales.
One is interested to learn how much cotton left this port during the
war. In 1860–1861, 1,915,852 bales were shipped, nearly all before
hostilities began; in 1861–1862, 27,627 bales; in 1862–1863,
23,750; in 1863–1864, 128,130; in 1864–1865, 192,351. The total
receipts during this last year were 271,015 bales. From September
1st, 1865, to January 1st, 1866, the receipts were 375,000 bales;
and cotton was still coming. The warehouses on the lower tributaries
of the Mississippi were said to be full of it, waiting for high water to
send it down. There had been far more concealed in the country
than was supposed: it made its appearance where least looked for;
and such was the supply that experienced traders believed that
prices would thenceforth be steadily on the decline.
A first-class Liverpool steamer is calculated to take out 3000 500-
pound bales, the freight on which is 7–8ths of a penny per pound,—
not quite two cents. The freight to New York and Boston is 1 1–4th
cents by steamers, and 7–8ths of a cent by sailing-vessels.
I put up at the St. Charles, famous before the war as a hotel, and
during the war as the head-quarters of General Butler. It is a
conspicuous edifice, with white-pillared porticos, and a spacious
Rotunda, thronged nightly with a crowd which strikes a stranger with
astonishment. It is a sort of social evening exchange, where
merchants, planters, travellers, river-men, army men, (principally
Rebels,) manufacturing and jobbing agents, showmen, overseers,
idlers, sharpers, gamblers, foreigners, Yankees, Southern men, the
well dressed and the prosperous, the rough and the seedy,
congregate together, some leaning against the pillars, and a few
sitting about the stoves, which are almost hidden from sight by the
concourse of people standing or moving about in the great central
space. Numbers of citizens regularly spend their evenings here, as at
a club-room. One, an old plantation overseer of the better class, told
me that for years he had not missed going to the Rotunda a single
night, except when absent from the city. The character he gave the
crowd was not complimentary.
“They are all trying to get money without earning it. Each is doing
his best to shave the rest. If they ever make anything, I don’t know
it. I’ve been here two thousand nights, and never made a cent yet.”
I inquired what brought him here.
“For company; to kill time. I never was married, and never had a
home. When I was young, the girls said I smelt like a wet dog; that’s
because I was poor. Since I’ve got rich, I’m too old to get married.”
What he was thinking of now was a fortune to be made out of labor-
saving machinery to be used on the plantations: “I wish I could get
hold of a half-crazy feller, to fix up a cotton planter, cotton-picker,
cane-cutter, and a thing to hill up some.”
He talked cynically of the planters. “They’re a helpless set. They’re
all confused. They don’t know what they’re going to do. They never
did know much else but to get drunk. If a man has a plantation to
rent or sell, he can’t tell anything about it; you can’t get any
proposition out of him.”
He complained that Northern capital lodged in the cotton belt; but
little of it getting through to the sugar country. He did not know any
lands let to Northern men. “They hav’n’t got sugar on the brain; it’s
cotton they’re all crazy after.”
He used to oversee for fifteen hundred dollars a year: he was now
offered five thousand. He was a well-dressed, rather intelligent,
capable man; and I noticed that the planters treated him with
respect. But his manner toward them was cool and independent: he
could not forget old times. “I never was thought anything of by
these men, till I got rich. Then they began to say ‘Dick P—— is a
mighty clever feller;’ and by-and-by it got to be ‘Mr. P——.’ Now they
all come to me, because I know about business, and they don’t
know a thing.”
Like everybody else, he had much to say of the niggers. “A heap of
the planters wants ’em all killed off. But I believe in the nigger. He’ll
work, if they’ll only let him alone. They fool him, and tell him such
lies, he’s no confidence. I’ve worked free niggers and white men,
and always found the niggers worked the best. But no nigger, nor
anybody else, will work like a slave works with the whip behind him.
You can’t make ’em. I was brought up to work alongside o’ niggers,
and soon as I got out of it, nothing, no money, could induce me to
work so again.”
Speaking of other overseers, he said: “I admit I was about as tight
on the nigger as a man ought to be. If I’d been a slave, I shouldn’t
have wanted to work under a master that was tighter than I was.
But I wa’n’t a priming to some. You see that red-faced feller with his
right hand behind him, talking with two men? He’s an overseer. I
know of his killing two niggers, and torturing another so that he died
in a few days.” (I omit the shocking details of the punishment said to
have been applied.) “The other night he came here to kill me
because I told about him. He pulled out his pistol, and says he, ‘Dick
P——, did you tell so-and-so I killed three niggers on Clark’s
plantation?’ ‘Yes,’ I says, ‘I said so, and can prove it; and if there’s
any shooting to be done, I can shoot as fast as you can.’ After that
he bullied around here some, then went off, and I hav’n’t heard
anything about shooting since.”
Among the earliest acquaintances I made at New Orleans was
General Phil. Sheridan, perhaps the most brilliant and popular
fighting man of the war. I found him in command of the Military
Division of the Gulf, comprising the States of Louisiana, Texas, and
Florida. In Florida he had at that time seven thousand troops; in
Louisiana, nine thousand; and in Texas, twenty thousand, embracing
ten thousand colored troops at Corpus Christi and on the Rio
Grande, watching the French movements.
It was Sheridan’s opinion that the Rebellion would never be ended
until Maximilian was driven from Mexico. Such a government on our
borders cherished the seeds of ambition and discontent in the minds
of the late Confederates. Many were emigrating to Mexico, and there
was danger of their uniting either with the Liberals or the
Imperialists, and forming a government inimical to the United States.
To prevent such a possibility, he had used military and diplomatic
strategy. Three thousand Rebels having collected in Monterey, he
induced the Liberals to arrest and disarm them. Then in order that
they should not be received by the Imperialists, he made hostile
demonstrations, sending a pontoon train to Brownsville, and six
thousand cavalry to San Antonio, establishing military posts, and
making extensive inquiries for forage. Under such circumstances,
Maximilian did not feel inclined to welcome the Rebel refugees. It is
even probable that, had our government at that time required the
withdrawal of the French from Mexico, the demand, emphasized by
these and similar demonstrations, would have been complied with.
Maximilian is very weak in his position. Nineteen twentieths of the
people are opposed to him. There is no regular, legitimate taxation
for the support of his government, but he levies contributions upon
merchants for a small part of the funds he requires, and draws upon
France for the rest. His “government” consists merely of an armed
occupation of the country; with long lines of communication
between military posts, which could be easily cut off and captured
one after another by a comparatively small force.
The Southern country, in the General’s opinion, was fast becoming
“Northernized.” It was very poor, and going to be poorer. The
planters had no enterprise, no recuperative energy: they were
entirely dependent upon Northern capital and Northern spirit. He
thought the freedmen’s affairs required no legislation, but that the
State should leave them to be regulated by the natural law of supply
and demand.
Phil. Sheridan is a man of small stature, compactly and somewhat
massively built, with great toughness of constitutional fibre, and an
alert countenance, expressive of remarkable energy and force. I
inquired if he experienced no reaction after the long strain upon his
mental and bodily powers occasioned by the war.
“Only a pleasant one,” he replied. “During my Western campaigns,
when I was continually in the saddle, I weighed but a hundred and
fifteen pounds. My flesh was hard as iron. Now my weight is a
hundred and forty-five.”
He went over with me to the City Hall, to which the Executive
department of the State had been removed, and introduced me to
Governor Wells, a plain, elderly man, affable, and loyal in his speech.
I remember his saying that the action of the President, in pardoning
Governor Humphreys, of Mississippi, after he had been elected by
the people on account of his services in the Confederate cause, was
doing great harm throughout the South, encouraging Rebels and
discouraging Union men. “Everything is being conceded to traitors,”
said he, “before they have been made to feel the Federal power.” He
spoke of the strong Rebel element in the Legislature which he was
combating; and gave me copies of two veto messages which he had
returned to it with bills that were passed for the especial benefit of
traitors. The new serf code, similar to that of Mississippi, engineered
through the Legislature by a member of the late Confederate
Congress, he had also disapproved. After this, I was surprised to
hear from other sources how faithfully he had been carrying out the
very policy which he professed to condemn,—even going beyond the
President, in removing from office Union men appointed by Governor
Hahn and appointing Secessionists and Rebels in their place; and
advocating the Southern doctrine that the Government must pay for
the slaves it had emancipated. Such discrepancies between deeds
and professions require no comment. Governor Wells is not the only
one, nor the highest, among public officers, who, wishing to
reconcile the irreconcilable, and to stand well before the country
whilst they were strengthening the hands and gaining the favor of its
enemies, have suffered their loyal protestations to be put to some
confusion by acts of doubtful patriotism.
At the Governor’s room I had the good fortune to meet the Mayor of
the city, Mr. Hugh Kennedy, whom I afterwards called upon by
appointment. By birth a Scotchman, he had been thirty years a
citizen of New Orleans, and, from the beginning of the Secession
troubles, had shown himself a stanch patriot. He was appointed to
the mayoralty by President Lincoln; General Banks removed him, but
he was afterwards reinstated.
I found him an almost enthusiastic believer in the future greatness
of New Orleans. “It is certain,” he said, “to double its population in
ten years. Its prosperity dates from the day of the abolition of
slavery. Men who formerly lived upon the proceeds of slave-labor are
now stimulated to enterprise. A dozen industrial occupations will
spring up where there was one before. Manufactures are already
taking a start. We have two new cotton-mills just gone into
operation. The effect upon the whole country will be similar.
Formerly planters went or sent to New York and Boston and laid in
their supplies; for this reason there were no villages in the South.
But now that men work for wages, which they will wish to spend
near home, villages will everywhere spring up.”
Living, in New Orleans, he said, was very cheap. The fertile soil
produces, with little labor, an abundance of vegetables the year
round. Cattle are brought from the extensive prairies of the State,
and from the vast pastures of Texas: and contractors had engaged
to supply the charitable institutions of the city with the rumps and
rounds of beef at six cents a pound.
The street railroads promised to yield a considerable revenue to the
city. The original company paid only $130,000 for the privilege of
laying down its rails, and an exclusive right to the track for twenty-
five years. But two new roads had been started, one of which had
stipulated to pay to the city government eleven and a half per cent.
of its gross proceeds, and the other twenty-two and a half per cent.
“In two or three years an annual income from that source will not be
less than $200,000.”
From Mr. Kennedy I learned that free people of color owned property
in New Orleans to the amount of $15,000,000.
He was delighted with the working of the free-labor system. “I
thought it an indication of progress when the white laborers and
negroes on the levees the other day made a strike for higher wages.
They were receiving two dollars and a half and three dollars a day,
and they struck for five and seven dollars. They marched up the
levee in a long procession, white and black together. I gave orders
that they should not be interfered with as long as they interfered
with nobody else; but when they undertook by force to prevent
other laborers from working, the police promptly put a stop to their
proceedings.”
CHAPTER LVII.
POLITICS, FREE LABOR, AND SUGAR.
Through the courtesy of the Mayor I became acquainted with some
of the radical Union men of New Orleans. Like the same class in
Richmond and elsewhere, I found them extremely dissatisfied with
the political situation and prospects. “Everything,” they said, “has
been given up to traitors. The President is trying to help the nation
out of its difficulty by restoring to power the very men who created
the difficulty. To have been a good Rebel is now in a man’s favor;
and to have stood by the government through all its trials is against
him. If an original secessionist, or a time-serving, half-and-half
Union man, ready to make any concession for the convenience of
the moment, goes to Washington, he gets the ear of the
administration, and comes away full of encouragement for the worst
enemies the government ever had. If a man of principle goes to
Washington, he gets nothing but plausible words which amount to
nothing, if he isn’t actually insulted for his trouble.”
I heard everywhere the same complaints from this class. And here I
may state that they were among the saddest things I had to endure
in the South. Whatever may be thought of the intrinsic merits of any
measures, we cannot but feel misgivings when we see our late
enemies made jubilant by them, and loyal men dismayed.
The Union men of New Orleans were severe in their strictures on
General Banks. “It was he,” they said, “who precipitated the
organization of the State government on a Rebel basis. Read his
General Orders No. 35, issued March 11th, 1864, concerning the
election of delegates to the Convention. Rebels who have taken the
amnesty oath are admitted to the polls, and loyal colored men are
excluded. Section 4th reads, ‘Every free white man,’ c. Since his
return to Massachusetts he has been making speeches in favor of
negro suffrage. He is in favor of it there, where it is popular as an
abstraction, and a man gets into Congress on the strength of it; but
he was not in favor of it here, where there was a chance of making
it practical. His excuse was, that if black men voted white men would
take offence, and keep away from the polls. Very likely some white
men would, but loyal white men wouldn’t. That he had the power to
extend the franchise to the blacks, or at least thought he had, may
be seen by his apology for not doing so, in which he says: ‘I did not
decide upon this subject without very long and serious
consideration,’ and so forth. So he let the great, the golden
opportunity slip, of organizing the State government on a loyal basis,
—of demonstrating the capacity of the colored man for self-
government, and, of setting an example to the other Rebel States.”
Being one day in the office of Mr. Durant, a prominent lawyer and
Union man, I was much struck by the language and bearing of a
gentleman who called upon him, and carried on a long conversation
in French. Having understood that the Creoles were nearly all
secessionists, I was surprised to hear this man give utterance to the
most enlightened Republican sentiments. After he had gone out, I
expressed my gratification at having met him.
“That,” said Mr. Durant, “is one of the ablest and wealthiest business
men in New Orleans. He was educated in Paris. But there is one
thing about him you do not seem to have suspected. He belongs to
that class of Union men the government has made up its mind to
leave politically bound in the hands of the Rebels. That man, whom
you thought refined and intelligent, has not the right which the most
ignorant, Yankee-hating, negro-hating Confederate soldier has. He is
a colored man, and has no vote.”
There were six daily newspapers published in New Orleans,—five in
English and one in English and French,—besides several weeklies.
There was but one loyal sheet among them, and that was a “nigger
paper,” the Tribune, not sold by any newsboy, and, I believe, by but
one news-dealer.
I called on General T. W. Sherman, in command of the Eastern
District of Louisiana, who told me that, in order to please the people,
our troops had been withdrawn from the interior, and that the
militia, consisting mostly of Rebel soldiers, many of whom still wore
the Rebel uniform, had been organized to fill their place. The
negroes, whom they treated tyrannically, had been made to believe
that it was the United States, and not the State government, that
had thus set their enemies to keep guard over them.
Both Governor Wells and General Sherman had received piles of
letters from “prominent parties” expressing fears of negro
insurrections. The most serious indications of bloody retribution
preparing for the white race had been reported in the Teche country,
where regiments of black cavalry were said to be organized and
drilled. The General, on visiting the spot, and investigating the truth
of the story, learned that it had its foundation in the fact that some
negro boys had been playing soldier with wooden swords. No
wonder the Rebel militia was thought necessary!
From General Baird, Assistant-Commissioner, and General Gregg,
Inspecting-Agent of the Freedmen’s Bureau, I obtained official
information regarding the condition of free labor in Louisiana. A
detailed account of it would be but a recapitulation, with slight
variations, of what I have said of free labor in other States. The
whites were as ignorant of the true nature of the system as the
blacks. Capitalists did not understand how they could secure labor
without owning it, or how men could be induced to work without the
whip. It was thought necessary to make a serf of him who was no
longer a slave. To this end the Legislature had passed a code of
black laws even more objectionable than that enacted by the
Legislature of Mississippi. By its provisions freedmen were to be
arrested as vagrants who had not, on the 10th of January, 1866,
entered into contracts for the year. They were thus left little choice
as to employers, and none as to terms. They were also subjected to
a harsh system of fines and punishments for loss of time and the
infraction of contracts; and made responsible for all losses of stock
on the plantation, until they should be able to prove that they had
not killed it. Although these laws had not been approved by the
Governor, there was no doubt but they would be approved and
enforced as soon as the national troops were removed.
A majority of the Southern planters clamored for the withdrawal of
the troops and the Freedmen’s Bureau. But Northern planters settled
in the State as earnestly opposed the measure. “If the government’s
protection goes, we must go too. It would be impossible for us to
live here without it. Planters would come to us and say, ‘Here, you’ve
got a nigger that belongs to us;’ they would claim him, under the
State laws, and compel him to go and work for them. Not a first-
class laborer could we be sure of.”
Here, as elsewhere, the fact that the freedmen had no independent
homes, but lived in negro-quarters at the will of the owner, placed
him under great disadvantages, which the presence of the Bureau
was necessary to counteract. The planters desired nothing so much
as to be left to manage the negroes with or without the help of State
laws. “With that privilege,” they said, “we can make more out of
them than ever. The government must take care of the old and
worthless niggers it has set free, and we will put through the able-
bodied ones.” The disposition to keep the freedmen in debt by
furnishing their supplies at dishonest prices, and to impose upon
their helplessness and ignorance in various other ways, was very
general.
Fortunately there was a great demand for labor, and the freedmen,
with the aid of the Bureau, were making favorable contracts with
their employers. When encouraged by just treatment and fair wages,
they were working well. But they were observed to be always
happier, thriftier, and more comfortable, living in little homes of their
own and working land on their own account, than in any other
condition. “I believe,” said General Gregg, “the best thing
philanthropic Northern capitalists can do both for the freedmen and
for themselves, is to buy up tracts of land, which can be had in some
of the most fertile sections of Louisiana at two, three, and five
dollars an acre, to be leased to the freedmen.”
The more enlightened planters were in favor of educating the blacks.
But the majority were opposed to it; so that in many parishes it was
impossible to establish schools, while in others it had been very
difficult. In January last there were 278 teachers in the State,
instructing 19,000 pupils in 143 schools. The expenses, $20,000 a
month, were defrayed by the Bureau from the proceeds of rents of
abandoned and confiscated estates. But this source of revenue had
nearly failed, in consequence of the indiscriminate pardoning of
Rebel owners and the restoration of their property. In New Orleans,
for example, the rents of Rebel estates had dwindled, in October,
1865, to $8,000; in December, to $1,500; and they were still rapidly
diminishing. The result was, it had been necessary to order the
discontinuance of all the schools in the State at the end of January,
the funds in the treasury of the Bureau being barely sufficient to
hold out until that time. It was hoped, however, that they would
soon be reëstablished on a permanent basis, by a tax upon the
freedmen themselves. For this purpose, the Assistant Commissioner
had ordered that five per cent. of their wages should be paid by
their employers to the agents of the Bureau. The freedmen’s schools
in New Orleans were not in session at the time I was there; but I
heard them highly praised by those who had visited them. Here is
Mr. Superintendent Warren’s account of them:—
“From the infant which must learn to count its fingers, to the scholar who can read
and understand blank-verse, we have grades and departments adapted and free
to all. Examinations, promotions, and gradations are had at stated seasons. The
city is divided into districts; each district has its school, and each school the
several departments of primary, intermediate, and grammar. A principal is
appointed to each school, with the requisite number of assistants. Our teachers
are mostly from the North, with a few Southerners, who have heroically dared the
storm of prejudice to do good and right. The normal method of teaching is
adopted, and object teaching is a specialty.
“There are eight schools in the city, with from two to eight hundred pupils each,
which, with those in the suburbs, amount to sixteen schools with nearly six
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Confidence Intervals In Generalized Regression Models 1st Edition Esa Uusipaikka
PDF
Analyzing Sensory Data with R Chapman Hall CRC The R Series 1st Edition Le
PDF
Healthcare Data Analytics 1st Edition Chandan K. Reddy
PDF
Bayesian Ideas And Data Analysis An Introduction For Scientists And Statistic...
PDF
Data Analytics For Protein Crystallization 1st Edition Marc L Pusey
PDF
Methods And Applications Of Autonomous Experimentation Marcus M Noack Daniela...
PDF
Confidence Intervals in Generalized Regression Models 1st Edition Esa Uusipaikka
PDF
A First Course In Linear Model Theory 1st Edition Nalini Ravishanker
Confidence Intervals In Generalized Regression Models 1st Edition Esa Uusipaikka
Analyzing Sensory Data with R Chapman Hall CRC The R Series 1st Edition Le
Healthcare Data Analytics 1st Edition Chandan K. Reddy
Bayesian Ideas And Data Analysis An Introduction For Scientists And Statistic...
Data Analytics For Protein Crystallization 1st Edition Marc L Pusey
Methods And Applications Of Autonomous Experimentation Marcus M Noack Daniela...
Confidence Intervals in Generalized Regression Models 1st Edition Esa Uusipaikka
A First Course In Linear Model Theory 1st Edition Nalini Ravishanker

Similar to Functional Data Analysis With R 1st Edition Ciprian M Crainiceanu Jeff Goldsmith Andrew Leroux And Erjia Cui (20)

PDF
Modelling Binary Data Second Edition Collett
PDF
Healthcare Data Analytics 1st Edition Chandan K. Reddy
PDF
Design and Analysis of Bridging Studies 1st Edition Jen-Pei Liu (Editor)
PDF
Computational Intelligent Data Analysis For Sustainable Development Ting Yu
PDF
The Statistical Analysis Of Recurrent Events Statistics For Biology And Healt...
PDF
Computational Methods Of Feature Selection Huan Liu Hiroshi Motoda
PDF
Deep Learning Through Sparse and Low Rank Modeling Zhangyang Wang
PDF
Design and Analysis of Bridging Studies 1st Edition Jen-Pei Liu (Editor)
PDF
The Statistical Analysis Of Recurrent Events Springer Cook Rj
PDF
Mechatronic Systems Design And Solid Materials Methods And Practices Prabhat ...
PDF
Time Series Modeling Computation And Inference West Mike Prado
PDF
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
PDF
Adaptive survey design 1st Edition Peytchev
PDF
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
PDF
Adaptive Design Theory And Implementation Using Sas And R Second Edition 2nd ...
PDF
Quantitative Medical Data Analysis Using Mathematical Tools and Statistical T...
PDF
Big Data Analytics Volume 33 1st Edition Venu Govindaraju 2024 Scribd Download
PDF
Adaptive survey design 1st Edition Peytchev
PDF
Applied Statistical Inference with MINITAB Sally Lesik
PDF
Data Analysis For Neurodegenerative Disorders Deepika Koundal
Modelling Binary Data Second Edition Collett
Healthcare Data Analytics 1st Edition Chandan K. Reddy
Design and Analysis of Bridging Studies 1st Edition Jen-Pei Liu (Editor)
Computational Intelligent Data Analysis For Sustainable Development Ting Yu
The Statistical Analysis Of Recurrent Events Statistics For Biology And Healt...
Computational Methods Of Feature Selection Huan Liu Hiroshi Motoda
Deep Learning Through Sparse and Low Rank Modeling Zhangyang Wang
Design and Analysis of Bridging Studies 1st Edition Jen-Pei Liu (Editor)
The Statistical Analysis Of Recurrent Events Springer Cook Rj
Mechatronic Systems Design And Solid Materials Methods And Practices Prabhat ...
Time Series Modeling Computation And Inference West Mike Prado
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Adaptive survey design 1st Edition Peytchev
Computational Methods of Feature Selection 1st Edition Huan Liu (Editor)
Adaptive Design Theory And Implementation Using Sas And R Second Edition 2nd ...
Quantitative Medical Data Analysis Using Mathematical Tools and Statistical T...
Big Data Analytics Volume 33 1st Edition Venu Govindaraju 2024 Scribd Download
Adaptive survey design 1st Edition Peytchev
Applied Statistical Inference with MINITAB Sally Lesik
Data Analysis For Neurodegenerative Disorders Deepika Koundal
Ad

Recently uploaded (20)

PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Lesson notes of climatology university.
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
History, Philosophy and sociology of education (1).pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Updated Idioms and Phrasal Verbs in English subject
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Module 4: Burden of Disease Tutorial Slides S2 2025
Lesson notes of climatology university.
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
History, Philosophy and sociology of education (1).pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Computing-Curriculum for Schools in Ghana
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Final Presentation General Medicine 03-08-2024.pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Anesthesia in Laparoscopic Surgery in India
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Updated Idioms and Phrasal Verbs in English subject
Supply Chain Operations Speaking Notes -ICLT Program
Orientation - ARALprogram of Deped to the Parents.pptx
Ad

Functional Data Analysis With R 1st Edition Ciprian M Crainiceanu Jeff Goldsmith Andrew Leroux And Erjia Cui

  • 1. Functional Data Analysis With R 1st Edition Ciprian M Crainiceanu Jeff Goldsmith Andrew Leroux And Erjia Cui download https://ebookbell.com/product/functional-data-analysis- with-r-1st-edition-ciprian-m-crainiceanu-jeff-goldsmith-andrew- leroux-and-erjia-cui-55542968 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Functional Data Analysis With R And Matlab 1st Edition James Ramsay https://ebookbell.com/product/functional-data-analysis-with-r-and- matlab-1st-edition-james-ramsay-2529504 Theoretical Foundations Of Functional Data Analysis With An Introduction To Linear Operators 1st Edition Tailen Hsing https://ebookbell.com/product/theoretical-foundations-of-functional- data-analysis-with-an-introduction-to-linear-operators-1st-edition- tailen-hsing-5032246 Functional Data Analysis 2ed Ramsay J Silverman Bw https://ebookbell.com/product/functional-data-analysis-2ed-ramsay-j- silverman-bw-2046976 Functional Data Analysis In Biomechanics Edward Gunning John Warmenhoven https://ebookbell.com/product/functional-data-analysis-in- biomechanics-edward-gunning-john-warmenhoven-60635520
  • 3. Geostatistical Functional Data Analysis Wiley Series In Probability And Statistics 1st Edition Jorge Mateu Editor https://ebookbell.com/product/geostatistical-functional-data-analysis- wiley-series-in-probability-and-statistics-1st-edition-jorge-mateu- editor-51992568 Sfunctional Data Analysis Users Manual For Windows 1st Edition Douglas B Clarkson https://ebookbell.com/product/sfunctional-data-analysis-users-manual- for-windows-1st-edition-douglas-b-clarkson-2003332 Nonparametric Functional Data Analysis Theory And Practice Springer Series In Statistics 1st Edition Frdric Ferraty https://ebookbell.com/product/nonparametric-functional-data-analysis- theory-and-practice-springer-series-in-statistics-1st-edition-frdric- ferraty-2007538 Applied Functional Data Analysis 1st Edition Jo Ramsay Bw Silverman https://ebookbell.com/product/applied-functional-data-analysis-1st- edition-jo-ramsay-bw-silverman-2618086 Nonparametric Functional Data Analysis 1st Edition Frdric Ferraty https://ebookbell.com/product/nonparametric-functional-data- analysis-1st-edition-frdric-ferraty-1289824
  • 6. Functional Data Analysis with R Emerging technologies generate data sets of increased size and complexity that require new or updated statisti- cal inferential methods and scalable, reproducible software. These data sets often involve measurements of a continuous underlying process, and benefit from a functional data perspective. Functional Data Analysis with R presents many ideas for handling functional data including dimension reduction techniques, smoothing, func- tional regression, structured decompositions of curves, and clustering. The idea is for the reader to be able to immediately reproduce the results in the book, implement these methods, and potentially design new methods and software that may be inspired by these approaches. Features: • Functional regression models receive a modern treatment that allows extensions to many practical scenarios and development of state-of-the-art software. • The connection between functional regression, penalized smoothing, and mixed effects models is used as the cornerstone for inference. • Multilevel, longitudinal, and structured functional data are discussed with emphasis on emerging functional data structures. • Methods for clustering functional data before and after smoothing are discussed. • Multiple new functional data sets with dense and sparse sampling designs from various application areas are presented, including the NHANES linked accelerometry and mortality data, COVID-19 mortality data, CD4 counts data, and the CONTENT child growth study. • Step-by-step software implementations are included, along with a supplementary website (www.Functional- DataAnalysis.com) featuring software, data, and tutorials. • More than 100 plots for visualization of functional data are presented. Functional Data Analysis with R is primarily aimed at undergraduate, master’s, and PhD students, as well as data scientists and researchers working on functional data analysis. The book can be read at different levels and com- bines state-of-the-art software, methods, and inference. It can be used for self-learning, teaching, and research, and will particularly appeal to anyone who is interested in practical methods for hands-on, problem-forward functional data analysis. The reader should have some basic coding experience, but expertise in R is not required. Ciprian M. Crainiceanu is Professor of Biostatistics at Johns Hopkins University working on wearable and im- plantable technology (WIT), signal processing, and clinical neuroimaging. He has extensive experience in mixed effects modeling, semiparametric regression, and functional data analysis with application to data generated by emerging technologies. Jeff Goldsmith is Associate Dean for Data Science and Associate Professor of Biostatistics at the Columbia Uni- versity Mailman School of Public Health. His work in functional data analysis includes methodological and com- putational advances with applications in reaching kinematics, wearable devices, and neuroimaging. Andrew Leroux is an Assistant Professor of Biostatistics and Informatics at the University of Colorado. His interests include the development of methodology in functional data analysis, particularly related to wearable technologies and intensive longitudinal data. Erjia Cui is an Assistant Professor of Biostatistics at the University of Minnesota. His research interests include developing functional data analysis methods and semiparametric regression models with reproducible software, with applications in wearable devices, mobile health, and imaging.
  • 7. MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY Editors: F. Bunea, R. Henderson, L. Levina, N. Meinshausen, R. Smith, Recently Published Titles Multistate Models for the Analysis of Life History Data Richard J. Cook and Jerald F. Lawless 158 Nonparametric Models for Longitudinal Data with Implementation in R Colin O. Wu and Xin Tian 159 Multivariate Kernel Smoothing and Its Applications José E. Chacón and Tarn Duong 160 Sufficient Dimension Reduction Methods and Applications with R Bing Li 161 Large Covariance and Autocovariance Matrices Arup Bose and Monika Bhattacharjee 162 The Statistical Analysis of Multivariate Failure Time Data: A Marginal Modeling Approach Ross L. Prentice and Shanshan Zhao 163 Dynamic Treatment Regimes Statistical Methods for Precision Medicine Anastasios A. Tsiatis, Marie Davidian, Shannon T. Holloway, and Eric B. Laber 164 Sequential Change Detection and Hypothesis Testing General Non-i.i.d. Stochastic Models and Asymptotically Optimal Rules Alexander Tartakovsky 165 Introduction to Time Series Modeling Genshiro Kitigawa 166 Replication and Evidence Factors in Observational Studies Paul R. Rosenbaum 167 Introduction to High-Dimensional Statistics, Second Edition Christophe Giraud 168 Object Oriented Data Analysis J.S. Marron and Ian L. Dryden 169 Martingale Methods in Statistics Yoichi Nishiyama 170 The Energy of Data and Distance Correlation Gabor J. Szekely and Maria L. Rizzo 171 Sparse Graphical Modeling for High Dimensional Data: Sparse Graphical Modeling for High Dimensional Data Faming Liang and Bochao Jia 172 Bayesian Nonparametric Methods for Missing Data and Causal Inference Michael J. Daniels, Antonio Linero, and Jason Roy 173 Functional Data Analysis with R Ciprian M. Crainiceanu, Jeff Goldsmith, Andrew Leroux, and Erjia Cui 174 Formoreinformationaboutthisseriespleasevisit:https://www.crcpress.com/Chapman--HallCRC-Monographs- on-Statistics--Applied-Probability/book-series/CHMONSTAAPP
  • 8. Functional Data Analysis with R Ciprian M. Crainiceanu, Jeff Goldsmith, Andrew Leroux, and Erjia Cui
  • 9. First edition published 2024 by CRC Press 2385 Executive Center Drive, Suite 320, Boca Raton, FL 33431, U.S.A. and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2024 Ciprian M. Crainiceanu, Jeff Goldsmith, Andrew Leroux, and Erjia Cui Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as- sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho- tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermissions@tandf.co.uk Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden- tification and explanation without intent to infringe. ISBN: 978-1-032-24471-6 (hbk) ISBN: 978-1-032-24472-3 (pbk) ISBN: 978-1-003-27872-6 (ebk) DOI: 10.1201/9781003278726 Typeset in CMR10 by KnowledgeWorks Global Ltd. Publisher’s note: This book has been prepared from camera-ready copy provided by the authors. Library of Congress Cataloging-in-Publication Data Names: Crainiceanu, Ciprian, author. | Goldsmith, Jeff, author. | Leroux, Andrew, author. | Cui, Erjia, author. Title: Functional data analysis with R / Ciprian Crainiceanu, Jeff Goldsmith, Andrew Leroux, and Erjia Cui. Description: First edition. | Boca Raton : CRC Press, 2024. | Series: CRC monographs on statistics and applied probability | Includes bibliographical references and index. | Summary: “Functional Data Analysis with R is primarily aimed at undergraduate, masters, and PhD students, as well as data scientists and researchers working on functional data analysis. The book can be read at different levels and combines state-of-the-art software, methods, and inference. It can be used for self-learning, teaching, and research, and will particularly appeal to anyone who is interested in practical methods for hands-on, problem-forward functional data analysis. The reader should have some basic coding experience, but expertise in R is not required”-- Provided by publisher. Identifiers: LCCN 2023041843 (print) | LCCN 2023041844 (ebook) | ISBN 9781032244716 (hbk) | ISBN 9781032244723 (pbk) | ISBN 9781003278726 (ebk) Subjects: LCSH: Multivariate analysis. | Statistical functionals. | Functional analysis. | R (Computer program language) Classification: LCC QA278 .C73 2024 (print) | LCC QA278 (ebook) | DDC 519.5/35--dc23/eng/20231221 LC record available at https://lccn.loc.gov/2023041843 LC ebook record available at https://lccn.loc.gov/2023041844
  • 10. To Bianca, Julia, and Adina, may your life be as beautiful as you made mine. Ciprian To my family and friends, for your unfailing support and encouragement. Jeff To Tushar, mom, and dad, thank you for all you do to keep me centered and sane. To Sarina and Nikhil, you’re all a parent could ever ask for. Never stop shining your light on the world. Andrew To my family, especially my mom, for your unconditional love. Erjia
  • 12. Contents Preface xi 1 Basic Concepts 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 NHANES 2011–2014 Accelerometry Data . . . . . . . . . . . . . . . 3 1.2.2 COVID-19 US Mortality Data . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 CD4 Counts Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.4 The CONTENT Child Growth Study . . . . . . . . . . . . . . . . . 15 1.3 Notation and Methodological Challenges . . . . . . . . . . . . . . . . . . . 19 1.4 R Data Structures for Functional Observations . . . . . . . . . . . . . . . . 20 1.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2 Key Methodological Concepts 25 2.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.1 The Linear Algebra of SVD . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.2 The Link between SVD and PCA . . . . . . . . . . . . . . . . . . . . 27 2.1.3 SVD and PCA for High-Dimensional FDA . . . . . . . . . . . . . . . 28 2.1.4 SVD for US Excess Mortality . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3 Semiparametric Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.3.1 Regression Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.3.1.1 Univariate Regression Splines . . . . . . . . . . . . . . . . . 37 2.3.1.2 Regression Splines with Multiple Covariates . . . . . . . . 38 2.3.1.3 Multivariate Regression Splines . . . . . . . . . . . . . . . . 39 2.3.2 Penalized Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.3 Smoothing as Mixed Effects Modeling . . . . . . . . . . . . . . . . . 43 2.3.4 Penalized Spline Smoothing in NHANES . . . . . . . . . . . . . . . 44 2.3.4.1 Mean PA among Deceased and Alive Individuals . . . . . . 44 2.3.4.2 Regression of Mean PA . . . . . . . . . . . . . . . . . . . . 46 2.4 Correlation and Multiplicity Adjusted (CMA) Confidence Intervals and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4.1 CMA Confidence Intervals Based on Multivariate Normality . . . . . 48 2.4.2 CMA Confidence Intervals Based on Parameter Simulations . . . . . 50 2.4.3 CMA Confidence Intervals Based on the Nonparametric Bootstrap of the Max Absolute Statistic . . . . . . . . . . . . . . . . . . . . . . . 51 2.4.4 Pointwise and Global Correlation and Multiplicity Adjusted (CMA) p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.4.5 The Origins of CMA Inference Ideas in this Book . . . . . . . . . . . 53 2.5 Covariance Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.5.1 Types of Covariance Smoothing . . . . . . . . . . . . . . . . . . . . . 55 2.5.1.1 Covariance Smoothing for Dense Functional Data . . . . . 56 vii
  • 13. viii Contents 2.5.1.2 Covariance Smoothing for Sparse Functional Data . . . . . 57 2.5.2 Covariance Smoothing in NHANES . . . . . . . . . . . . . . . . . . . 58 2.5.3 Covariance Smoothing for CD4 Counts . . . . . . . . . . . . . . . . . 60 3 Functional Principal Components Analysis 65 3.1 Defining FPCA and Connections to PCA . . . . . . . . . . . . . . . . . . . 65 3.1.1 A Simulated Example . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.1.1.1 Code for Generating Data . . . . . . . . . . . . . . . . . . . 67 3.1.1.2 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . 68 3.1.1.3 Raw PCA versus FPCA Results . . . . . . . . . . . . . . . 69 3.1.1.4 Functional PCA with Missing Data . . . . . . . . . . . . . 73 3.1.2 Application to NHANES . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.1.2.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . 74 3.1.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2 Generalized FPCA for Non-Gaussian Functional Data . . . . . . . . . . . . 76 3.2.1 Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.2.2 Fast GFPCA Using Local Mixed Effects . . . . . . . . . . . . . . . . 80 3.2.3 Binary PCA Using Exact EM . . . . . . . . . . . . . . . . . . . . . . 83 3.2.4 Functional Additive Mixed Models . . . . . . . . . . . . . . . . . . . 84 3.2.5 Comparison of Approaches . . . . . . . . . . . . . . . . . . . . . . . 86 3.2.6 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.3 Sparse/Irregular FPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.3.1 CONTENT Child Growth Data . . . . . . . . . . . . . . . . . . . . . 88 3.3.2 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.3.4 About the Methodology for Fast Sparse FPCA . . . . . . . . . . . . 96 3.4 When PCA Fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4 Scalar-on-Function Regression 101 4.1 Motivation and EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2 “Simple” Linear Scalar-on-Function Regression . . . . . . . . . . . . . . . . 106 4.2.1 Model Specification and Interpretation . . . . . . . . . . . . . . . . . 107 4.2.2 Parametric Estimation of the Coefficient Function . . . . . . . . . . 108 4.2.3 Penalized Spline Estimation . . . . . . . . . . . . . . . . . . . . . . . 113 4.2.4 Data-Driven Basis Expansion . . . . . . . . . . . . . . . . . . . . . . 118 4.3 Inference in “Simple” Linear Scalar-on-Function Regression . . . . . . . . . 123 4.3.1 Unadjusted Inference for Functional Predictors . . . . . . . . . . . . 123 4.4 Extensions of Scalar-on-Function Regression . . . . . . . . . . . . . . . . . 126 4.4.1 Adding Scalar Covariates . . . . . . . . . . . . . . . . . . . . . . . . 126 4.4.2 Multiple Functional Coefficients . . . . . . . . . . . . . . . . . . . . . 127 4.4.3 Exponential Family Outcomes . . . . . . . . . . . . . . . . . . . . . 129 4.4.4 Other Scalar-on-Function Regression Models . . . . . . . . . . . . . 129 4.5 Estimation and Inference Using mgcv . . . . . . . . . . . . . . . . . . . . . 130 4.5.1 Unadjusted Pointwise Inference for SoFR Using mgcv . . . . . . . . . 132 4.5.2 Correlation and Multiplicity Adjusted (CMA) Inference for SoFR . . 134 5 Function-on-Scalar Regression 143 5.1 Motivation and Exploratory Analysis of MIMS Profiles . . . . . . . . . . . 144 5.1.1 Regressions Using Binned Data . . . . . . . . . . . . . . . . . . . . . 145 5.2 Linear Function-on-Scalar Regression . . . . . . . . . . . . . . . . . . . . . 151 5.2.1 Estimation of Fixed Effects . . . . . . . . . . . . . . . . . . . . . . . 153
  • 14. Contents ix 5.2.1.1 Estimation Using Ordinary Least Squares . . . . . . . . . . 153 5.2.1.2 Estimation Using Smoothness Penalties . . . . . . . . . . . 155 5.2.2 Accounting for Error Correlation . . . . . . . . . . . . . . . . . . . . 159 5.2.2.1 Modeling Residuals Using FPCA . . . . . . . . . . . . . . . 161 5.2.2.2 Modeling Residuals Using Splines . . . . . . . . . . . . . . 166 5.2.2.3 A Bayesian Perspective on Model Fitting . . . . . . . . . . 168 5.3 A Scalable Approach Based on Epoch-Level Regressions . . . . . . . . . . . 170 6 Function-on-Function Regression 175 6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.1.1 Association between Patterns of Excess Mortality . . . . . . . . . . . 176 6.1.2 Predicting Future Growth of Children from Past Observations . . . 176 6.2 Linear Function-on-Function Regression . . . . . . . . . . . . . . . . . . . . 176 6.2.1 Penalized Spline Estimation of FoFR . . . . . . . . . . . . . . . . . . 178 6.2.2 Model Fit and Prediction Using FoFR . . . . . . . . . . . . . . . . . 180 6.2.3 Missing and Sparse Data . . . . . . . . . . . . . . . . . . . . . . . . 181 6.3 Fitting FoFR Using pffr in refund . . . . . . . . . . . . . . . . . . . . . . 181 6.3.1 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.3.2 Additional Features of pffr . . . . . . . . . . . . . . . . . . . . . . . 187 6.3.3 An Example of pffr in the CONTENT Study . . . . . . . . . . . . 190 6.4 Fitting FoFR Using mgcv . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.5 Inference for FoFR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 6.5.1 Unadjusted Pointwise Inference for FoFR . . . . . . . . . . . . . . . 199 6.5.2 Correlation and Multiplicity Adjusted Inference for FoFR . . . . . . 201 7 Survival Analysis with Functional Predictors 211 7.1 Introduction to Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . 212 7.2 Exploratory Data Analysis of the Survival Data in NHANES . . . . . . . . 213 7.2.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 7.2.1.1 Traditional Survival Analysis . . . . . . . . . . . . . . . . . 213 7.2.1.2 Survival Analysis with Functional Predictors . . . . . . . . 215 7.2.2 Kaplan-Meier Estimators . . . . . . . . . . . . . . . . . . . . . . . . 217 7.2.3 Results for the Standard Cox Models . . . . . . . . . . . . . . . . . . 218 7.3 Cox Regression with Baseline Functional Predictors . . . . . . . . . . . . . 220 7.3.1 Linear Functional Cox Model . . . . . . . . . . . . . . . . . . . . . . 220 7.3.1.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 7.3.1.2 Inference on the Functional Coefficient . . . . . . . . . . . 224 7.3.1.3 Survival Curve Prediction . . . . . . . . . . . . . . . . . . . 232 7.3.2 Smooth Effects of Traditional and Functional Predictors . . . . . . . 234 7.3.3 Additive Functional Cox Model . . . . . . . . . . . . . . . . . . . . . 236 7.4 Simulating Survival Data with Functional Predictors . . . . . . . . . . . . 239 8 Multilevel Functional Data Analysis 243 8.1 Data Structure in NHANES . . . . . . . . . . . . . . . . . . . . . . . . . . 244 8.2 Multilevel Functional Principal Component Analysis . . . . . . . . . . . . . 245 8.2.1 Two-Level Functional Principal Component Analysis . . . . . . . . . 245 8.2.1.1 Two-Level FPCA Model . . . . . . . . . . . . . . . . . . . 246 8.2.1.2 Estimation of the Two-Level FPCA Model . . . . . . . . . 246 8.2.1.3 Implementation in R . . . . . . . . . . . . . . . . . . . . . 248 8.2.1.4 NHANES Application Results . . . . . . . . . . . . . . . . 249 8.2.2 Structured Functional PCA . . . . . . . . . . . . . . . . . . . . . . . 252
  • 15. x Contents 8.2.2.1 Two-Way Crossed Design . . . . . . . . . . . . . . . . . . . 253 8.2.2.2 Three-Way Nested Design . . . . . . . . . . . . . . . . . . . 254 8.3 Multilevel Functional Mixed Models . . . . . . . . . . . . . . . . . . . . . . 255 8.3.1 Functional Additive Mixed Models . . . . . . . . . . . . . . . . . . . 258 8.3.2 Fast Univariate Inference . . . . . . . . . . . . . . . . . . . . . . . . 259 8.3.3 NHANES Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 260 8.4 Multilevel Scalar-on-Function Regression . . . . . . . . . . . . . . . . . . . 262 8.4.1 Generalized Multilevel Functional Regression . . . . . . . . . . . . . 262 8.4.2 Longitudinal Penalized Functional Regression . . . . . . . . . . . . . 263 9 Clustering of Functional Data 265 9.1 Basic Concepts and Examples . . . . . . . . . . . . . . . . . . . . . . . . . 265 9.2 Some Clustering Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 268 9.2.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 9.2.1.1 Clustering States Using K-means . . . . . . . . . . . . . . . 268 9.2.1.2 Background on K-means . . . . . . . . . . . . . . . . . . . 271 9.2.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 271 9.2.2.1 Hierarchical Clustering of States . . . . . . . . . . . . . . . 271 9.2.2.2 Background on Hierarchical Clustering . . . . . . . . . . . 274 9.2.3 Distributional Clustering . . . . . . . . . . . . . . . . . . . . . . . . 276 9.2.3.1 Distributional Clustering of States . . . . . . . . . . . . . . 276 9.2.3.2 Background on Distributional Clustering . . . . . . . . . . 276 9.3 Smoothing and Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 9.3.1 FPCA Smoothing and Clustering . . . . . . . . . . . . . . . . . . . . 280 9.3.2 FPCA Smoothing and Clustering with Noisy Data . . . . . . . . . . 285 9.3.3 FPCA Smoothing and Clustering with Sparse Data . . . . . . . . . 287 9.3.4 Clustering NHANES Data . . . . . . . . . . . . . . . . . . . . . . . . 289 Bibliography 291 Index 313
  • 16. Preface Around the year 2000, several major areas of statistics were witnessing rapid changes: functional data analysis, semiparametric regression, mixed effects models, and software development. While none of these areas was new, they were all becoming more mature, and their complementary ideas were setting the stage for new and rapid advancements. These developments were the result of the work of thousands of statisticians, whose collective achievements cannot be fully recognized in one monograph. We will try to describe some of the watershed moments that directly influenced our work and this book. We will also identify and contextualize our contributions to functional data analysis. The Functional Data Analysis (FDA) book of Ramsay and Silverman [244, 245] was first published in 1997 and, without a doubt, defined the field. It considered functions as the basic unit of observation, and introduced new data structures, new methods, and new definitions. This amplified the interest in FDA, especially with the emergence of new, larger, and more complex data sets in the early 2000s. Around the same time, and largely independent of the FDA literature, nonparametric modeling was subject to massive structural changes. Starting in the early 1970s, the seminal papers of Grace Wahba and collaborators [54, 150, 303] were setting the stage for smoothing spline regression. Likely influenced by these ideas, in 1986, Finbarr O’Sullivan [221] published the first paper on penalized splines (B- splines with a smaller number of knots and a penalty on the roughness of the regression function). In 1996, Marx and Eilers [71] published a seminal paper on P-splines (similar to O’Sullivan’s approach, but using a different penalty structure) and followed it up in 2002 by showing that ideas can be extended to Generalized Additive Models (GAM) [72]. In 1999, Brumback, Ruppert, and Wand [26] pointed out that regression models incorporating splines with coefficient penalties can be viewed as particular cases of Generalized Linear Mixed Models (GLMM). This idea was expanded upon in a series of papers that led to the highly influential Semiparametric Regression book by Ruppert, Wand, and Carroll [258], which was published in 2003. The book showed that semiparametric models could incorporate additional covariates, random effects, and nonparametric smoothing components in a unified mixed effects inferential framework. It also demonstrated how to implement these models in existing mixed effects software. Simon Wood and his collaborators, in a series of papers that culminated with the 2006 Generalized Additive Models book [315], set the current standards for methods and software integration for GAM. The substantially updated 2017 second edition of this book [319] is now a classic reference for GAM. In early 2000s, the connection between functional data analysis, semiparametric regres- sion, and mixed effects models was not yet apparent, though some early cross-pollination work was starting to appear. In 1999, Marx and Eilers [192] introduced the idea of P-splines for signal regression, which is closely related to the Functional Linear Model with a scalar outcome and functional predictors described by Ramsay and Silverman; see also extensions in the early 2000s [72, 171, 193]. In 2007, Reiss and Ogden [252] introduced a version of the method proposed by Marx and Eilers [192] using a different penalty structure, described methods for principal component regression (FPCR) and functional partial least squares (FPLS), and noted the connection with the mixed effects model representation of penalized splines described in [258]. In spite of these crucial advancements, by 2008 there was still xi
  • 17. xii Preface no reliable FDA software for implementing these methods. In 2008, Wood gave a Royal Scientific Society (RSS) talk (https://rb.gy/o1zg5), where he showed how to use mgcv to fit scalar-on-function regression (SoFR) models using “linear functional terms.” This talk clarified the conceptual and practical connections between functional and semiparametric regression; see pages 17–20 of his presentation. In a personal note, Wood mentioned that his work was influenced by that of Eilers, Marx, Reiss, and Ogden, though he points to Wahba’s 1990 book [304] and Tikhonov, 1963 [294] as his primary sources of inspiration. In his words: “[Grace Wahba’s equation] (8.1.4), from Tikhonov, 1963, is essentially the signal regression problem. It just took me a long time to think up the summation convention idea that mgcv uses to implement this.” In 2011, Wood published the idea of penalized spline estimation for the functional coefficient in the SoFR context; see Section 5.2 in his paper, where methods are extended to incorporate non-Gaussian errors with multiple penalties. Our methods and philosophy were also informed by many sources, including the now classical references discussed above. However, we were heavily influenced by the mixed ef- fects representation of semiparametric models introduced by Ruppert, Wand, and Carroll [258]. Also, we were interested in the practical implementation and scalability of a variety of FDA models beyond the SoFR model. The 2010 paper by Crainiceanu and Goldsmith [48] and the 2011 paper led by Goldsmith and Bobb [102] outlined the philosophy and practice underlying much of the functional regression chapters of this book: (1) where necessary, project observed functions on a functional principal component basis to account for noise, irregular observation grids, and/or missing data; (2) use rich-basis spline expansions for functional coefficients and induce smoothing using penalties on the spline coefficients; (3) identify the mixed effects models that correspond to the specific functional regression; and (4) use existing mixed effects model software (in their case WinBUGS [187] and nlme [230], respectively) to fit the model and conduct inference. Regardless of the underlying software platform, one of our main contributions was to recognize the deep connections between functional regression, penalized spline smoothing, and mixed effects inference. This allowed extensions that incorporated multiple scalar covariates, random effects, and multiple func- tional observations with or without noise, with dense or sparse sampling patterns, and complete or missing data. Over time, the inferential approach was extended to scalar-on- function regression (SoFR), function-on-scalar regression (FoSR), and function-on-function regression (FoFR). We have also contributed to increasing awareness of new data structures and the need for validated and supported inferential software. Around 2010–2011, Philip Reiss and Crainiceanu initiated a project to assemble existing R functions for FDA. It was started as the package refund [105] for “REgression with FUNctional Data,” though it never provided any refund, it was not only about regression, and was not particularly easy to find on Google. However, it did bring together a group of statisticians who were passionate about developing FDA software for a wide audience. We would like to thank all of these contributors for their dedication and vision. The refund package is currently maintained by Julia Wrobel. Fabian Scheipl, Sonja Greven, and collaborators have led a series of transformative pa- pers [128, 262, 263] that started to appear in 2015 and expanded functional regression in many new directions. The 2015 paper by Ivanescu, Staicu, Scheipl, and Greven [128] showed how to conduct function-on-function regression (FoFR) using the philosophy outlined by Goldsmith, Bobb, and Crainiceanu [48, 102]. The paper made the connection to the “lin- ear functional terms” implementation in mgcv, which merged previously disparate lines of work in FDA. This series of papers led to substantial revisions of the refund package and the addition of the powerful function pffr(), which provides a functional interface based on the mgcv package. The function pfr(), initially developed by Goldsmith, was updated to the same standard. Scheipl’s contributions to refund were transformative and set a new bar for FDA software. Finally, the ideas came together and showed how functional regression
  • 18. Preface xiii can be modeled semiparametrically using splines, smoothness can be induced via specific penalties on parameters, and penalized models can be treated as mixed effects models, which can be fit using modern software. This body of work provides much of the infrastructure of Chapters 4, 5, and 6 of this book. To address the larger and increasingly complex data applications, new methods were required for Functional Principal Components Analysis (FPCA). To the best of our knowl- edge, in 2010 there was no working software for smoothing covariance matrices for functional data with more than 300 observations per function. Luo Xiao was one of the main contrib- utors who introduced the FAst Covariance Estimation (FACE), a principled method for nonparametric smoothing of covariance operators for high and ultra-high dimensional func- tional data. Methods use “sandwich estimators” of covariance matrices that are guaranteed to be symmetric and positive definite and were deployed in the refund::fpca.face() function [331]. Xiao’s subsequent work on sparse and multivariate sparse FPCA was deployed as the standalone functions face::face.sparse() [328, 329] and mfaces::mface.sparse() [172, 173]. During the writing of this book, it became appar- ent that methods were also needed for FPCA-like methods for non-Gaussian functional data. Andrew Leroux and Wrobel led a paper on fast generalized FPCA (fastGFPCA) [167] using local mixed effects models and deployed the accompanying fastGFPCA package [324]. These developments are highlighted in Chapters 2 and 3 of this book. Much less work has been dedicated to survival analysis with functional predictors and, es- pecially, to extending the semiparametric regression ideas to this context. In 2015, Jonathan Gellar introduced the Penalized Functional Cox Regression [94], where the effect of the func- tional predictor on the log-hazard was modeled using penalized splines. However, methods were not immediately deployed in mgcv because this option only became available in 2016 [322]. In subsequent publications, Leroux [164, 166] and Erjia Cui [55, 56] made clear the connection to the “linear functional terms” in mgcv and substantially enlarged the range of applications of survival analysis with functional predictors. This work provides the infras- tructure for Chapter 7 of this book. In 2009, Chongzhi Di, Crainiceanu, and collaborators introduced the concept of Multi- level Functional Principal Component (MFPCA) for functional data observed at multiple visits (e.g., electroencephalograms at every 30 seconds during sleep at two visits several years apart). They developed and deployed the refund::mfpca.sc() function. A much im- proved version of the software was deployed recently in the refund::mfpca.face() function based on a paper led by Cui and Ruonan Li [58]. Much work has been dedicated to ex- tending ideas to structured functional data [272, 273], led by Haochang Shou, longitudinal functional data [109], led by Greven, and ultra-high dimensional data [345, 346], led by Vadim Zipunnikov. Many others have provided contributions, including Ana-Maria Staicu, Goldsmith, and Lei Huang. Fast methods for fixed effects inference in this context were developed, among others, by Staicu [223] and Cui [57]. These methods required specialized software to deal with the size and complexity of new data sets. This work forms the basis of Chapter 8 of this book. As we were writing this book we realized just how many open problems still remain. Some of these problems have been addressed along the way; some are still left open. In the end, we have tried to provide a set of coherent analytic tools based on statistical principled approaches. The core set of ideas is to model functional coefficients parametrically or non- parametrically using splines, penalize the spline coefficients, and conduct inference in the resulting mixed effects model. The book is accompanied by detailed software and a website http://www.FunctionalDataAnalysis.com that will continue to be updated. We hope that you enjoy reading this book as much as we enjoyed writing it.
  • 20. 1 Basic Concepts Our goal is to create the most useful book for the widest possible audience without theo- retical, methodological, or computational compromise. Our approach to statistics is to identify important scientific problems and meaning- fully contribute to solving them through timely engagement with data. The development of general-purpose methodology is motivated by this process, and must be accompanied by computational tools that facilitate reproducibility and transparency. This “problem for- ward” approach is critical as technological advances rapidly increase the precision and vol- ume of traditional measurements, produce completely new types of measurements, and open new areas of scientific research. Our experience in public health and medical research provides numerous examples of new technologies that reshape scientific questions. For example, heart rate and blood pressure used to be measured once a year during an annual medical exam. Wearable devices can now measure them continuously, including during the night, for weeks or months at the time. The resulting data provide insights into blood pressure, hypertension, and health outcomes and open completely new areas of research. New types of measurements are continuously emerging, including physical activity measured by accelerometers, brain imaging, ecological momentary assessments (EMA) via smart phone apps, daily infection and mortality during the COVID-19 pandemic, or CD4 counts from the time of sero-conversion. These examples and many others involve measurements of a continuous underlying process, and benefit from a functional data perspective. 1.1 Introduction Functional Data Analysis (FDA) provides a conceptual framework for analyzing functions instead of or in addition to scalar measurements. For example, physical activity is a con- tinuous process over the course of the day and can be observed for each individual; FDA considers the complete physical activity trajectory in the analysis instead of reducing it to a single scalar summary, such as the total daily activity. In this book we denote the observed functions by Wi : S → R, where S is an interval (e.g., [0, 1] in R or [0, 1]M in RM ), i is the basic experimental unit (e.g., study participant), and Wi(s) is the functional observation for unit i at s ∈ S. In general, the domain S does not need to be an interval, but for the purposes of this book we will work under this assumption. We often assume that Wi(s) = Xi(s) + i(s), where Xi : S → R is the true functional process and i(s) are independent noise variables. We will see various generalizations of this definition, but for illustration purposes we use this notation. We briefly summarize the properties of functional data that can be used to better target the associated analytic methods: 1
  • 21. 2 Functional Data Analysis with R • Continuity is the property of the observed functions, Wi(s), and true functional pro- cesses, Xi(s), which allows it to be sampled at a higher or lower resolution within S. • Ordering is the property of the functional domain, S, which can be ordered and has a distance. • Self-consistency is the property of the observed functions, Wi(s), and true functional processes, Xi(s), to be on the same scale and have the same interpretation for all ex- perimental units, i, and functional arguments, s. • Smoothness is the property of the true functional process, Xi(s), which is not expected to change substantially for small changes in the functional argument, s. • Colocalization is the property of the functional argument, s, which has the same inter- pretation for all observed functions, Wi(s), and true functional processes, Xi(s). These properties differentiate functional from multivariate data. As the functional ar- gument, s ∈ S, is often time or space, FDA can be used for modeling temporal and/or spatial processes. However, there is a fundamental difference between FDA and spatio- temporal processes. Indeed, FDA assumes that the observed functions, Wi(s), and true functional processes, Xi(s), depend on and are indexed by the experimental unit i. This means that there are many repetitions of the time series or spatial processes, which is not the case for time series or spatial analysis. The FDA framework serves to guide methods development, interpretation, and ex- ploratory analysis. We emphasize that the concept of continuously observed functions differs from the practical reality that functions are observed over discrete grids that can be dense or sparse, regularly spaced or irregular, and common or unique across functional observations. Put differently, in practice, functional data are multivariate data with specific properties. Tools for understanding functional data must bridge the conceptual and practical to produce useful insights that reflect the data-generating and observation processes. FDA has a long and rich tradition. Its beginnings can be traced at least to a paper by C.R. Rao [247], who proposed to use Principal Component Analysis (PCA), a mul- tivariate method, to analyze growth curves. Several monographs on FDA already exist, including [86, 153, 242, 245]. In addition, several survey papers provide insights into current developments [154, 205, 250, 299, 307]. This book is designed to complement the existing literature by focusing on methods that (1) combine parametric, nonparametric, and mixed effects components; (2) provide statistically principled approaches for estimation and in- ference; (3) allow users to seamlessly add or remove model components; (4) are associated with high-quality, fast, and easy-to-modify R software; and (5) are intuitive and friendly to scientific applications. This book provides an introduction to FDA with R [240]. Two packages will be used throughout the book: (1) refund [105], which contains a large number of FDA models and many of the data sets used for illustration in this book; and (2) mgcv [317, 319], a powerful inferential software developed for semiparametric inference. We will show how this software, originally developed for semiparametric regression, can be adapted to FDA. This is a crucial contribution of the book, which is built around the idea of providing tools that can be readily used in practice. The book is accompanied by the web page http://www.FunctionalDataAnalysis.com, which contains vignettes and R software for each chapter of this book. All vignettes use the refund and mgcv packages, which are available from CRAN and can be loaded into R [240] as follows. library(refund) library(mgcv)
  • 22. Basic Concepts 3 General-purpose, stable, and fast software is the key to increasing the popularity of FDA methods. The book will present the current version of the software, while acknowledging that software is changing much faster than methodology. Thus, the book will change slowly, while the web page http://www.FunctionalDataAnalysis.com and accompanying vignettes will be adapted to the latest developments. 1.2 Examples We now introduce several examples that illustrate the ubiquity and complexity of functional data in modern research, and that will be revisited throughout the book. These examples highlight various types of functional data sampling, including dense, regularly-spaced grids that are common across participants, and sparse, irregular observations for each participant. 1.2.1 NHANES 2011–2014 Accelerometry Data The National Health and Nutrition Examination Survey (NHANES) is a large, ongoing, cross-sectional study of the non-institutionalized US population conducted by the Centers for Disease Control and Prevention (CDC) in two-year waves using a multi-stage stratified sampling scheme. NHANES collects a vast array of demographic, socioeconomic, lifestyle and medical data, though the exact data collected and population samples vary from year to year. The wrist-worn accelerometry data collected in the NHANES 2011–2012 and 2013– 2014 waves were released in December 2020. This data set is of particular interest because (1) it is publicly available and linked to the National Death Index (NDI) by the National Center for Health Statistics (NCHS); (2) it was collected from the wrist and processed into “monitor-independent movement summary” (MIMS) units [138] using an open source, reproducible algorithm (https://bit.ly/3cDnRBF); and (3) the protocol required 24-hour continuous wear of the wrist accelerometers, including during sleep, for multiple days for each study participant. In total there were 14,693 study participants who agreed to wear an accelerometer. To ensure the quality of the accelerometry data for each subject, we excluded study participants who had less than three good days (2,083 study participants excluded), where a good day is defined as having at least 95% “good data.” “Good data” is defined as data that was flagged as “wear” (PAXPREDM ∈ {1, 2, 4}) and did not have a quality problem flag (PAXFLGSM = ) in the NHANES data set. The final data set has 12,610 study participants with an average age of 36.90 years and 51.23% females. Note that the variable name “gender” used in the data set and elsewhere is taken directly from the framing of the questions in NHANES, and is not intended to conflate sex and gender. The proportions of Non-Hispanic White, Non-Hispanic Black, Non-Hispanic Asian, Mexican American, Other Hispanic and Other Race were 35.17%, 24.81%, 11.01%, 15.16%, 9.87%, and 3.98%, respectively. The data set is too large to be made available through refund, but it is available from the website http://www.FunctionalDataAnalysis.com associated with this book. Figure 1.1 displays objective physical activity data measured in MIMS for three study participants in the NHANES 2011–2014. Each panel column corresponds to one study par- ticipant and each panel row corresponds to a day of the week. The first study participant (labeled SEQN 75111) had seven days of data labeled Sunday through Saturday. The sec- ond study participant (labeled SEQN 77936) had five days of data labeled Tuesday through Saturday. The third study participant (labeled SEQN 82410) had six days of data that in- cluded all days of the week except Friday. This happened because the data recorded on
  • 23. 4 Functional Data Analysis with R FIGURE 1.1: Physical activity data measured in MIMS for three study participants in the NHANES 2011–2014 summarized at every minute of the day. Each study participant is shown in one column and each row corresponds to a day of the week. The x-axis in each panel is time in one-minute increments from midnight to midnight.
  • 24. Basic Concepts 5 Friday had less than 95% of “good data” and were therefore excluded. The x-axis for each panel is time in one-minute increments from midnight (beginning of the day) to midnight (end of the day). The y-axis is MIMS, a measure of physical activity intensity. Some features of the data become apparent during visual inspection of Figure 1.1. First, activity during the night (0–6 AM) is reduced for the first two study participants, but not for the third. Indeed, study participant SEQN 82410 has clearly more activity during the night than during the day (note the consistent dip in activity between 12 PM and 6 PM). Second, there is substantial heterogeneity of the data from one minute to another both within and between days. Third, data are positive and exhibit substantial skewness. Fourth, the patterns of activity of study participant SEQN 75111 on Saturday and Sunday are quite different from their pattern of activity on the other days of the week. Fifth, there seems to be some day-to-day within-individual consistency of observations. Having multiple days of minute-level physical activity for the same individual increases the complexity and size of the data. A potential solution is to take averages at the same time of the day within study participants. This is equivalent to averaging the curves in Figure 1.1 by column at the same time of the day. This reduces the data to one function per study participant, but ignores the visit-to-visit variability around the person-specific mean. To illustrate the population-level data structure, Figure 1.2 displays the smooth means of several groups within NHANES. Data were smoothed for visualization purposes; techni- cal details on smoothing are discussed in Section 2.3. The left panel displays the average physical activity data for individuals who died (blue line) and survived (red line). Mortal- ity indicators were based on the NHANES mortality release file that included events up to December 31, 2019. Mortality information was available for 8,713 of the 12,610 study partic- ipants. There were 832 deceased individuals and 7,881 who were still alive on December 31, 2019. The plot indicates that individuals who did not die had, on average, higher physical activity throughout the day, with larger differences between 8 AM and 11 PM. This result is consistent with the published literature on the association between physical activity and mortality; see, for example, [64, 65, 136, 170, 259, 275, 292]. The right panel in Figure 1.2 displays the smooth average curves for groups stratified by age and gender. For illustration purposes, four age groups (in years) were used and identified FIGURE 1.2: Average physical activity data (expressed in MIMS) in NHANES 2011–2014 as a function of the minute of the day in different groups. Left panel: deceased (blue line) and alive individuals (red line) as of December 31, 2019. Right panel: females (dashed lines) and males (solid lines) within age groups [18, 35] (red), (35, 50] (orange), (50, 65] (light blue), and (65, 80] (dark blue).
  • 25. 6 Functional Data Analysis with R by a different color: [18, 35] (red), (35, 50] (orange), (50, 65] (light blue), and (65, 80] (dark blue). Within each age group, data for females is shown as dashed lines and for males as solid lines. In all subgroups physical activity averages are lower at night, increase sharply in the morning and remain high during the day. The average for the (50, 65] and (65, 80] age groups exhibit a steady decrease during the day. This pattern is not apparent in the younger age groups. These findings are consistent with the activity patterns described in [265, 327]. In addition, for every age group, the average activity during the day is higher for females compared to males. During the night, females have the same or slightly lower activity than males. These results contradict the widely cited literature [296] which indicated that “Males are more physically active than females.” However, they are consistent with [327], which found that women are more active than men, especially among older individuals. Rich, complex data as displayed in Figures 1.1 and 1.2 suggest multiple scientific prob- lems, including (1) quantifying the association between physical activity patterns and health outcomes (e.g., prevalent diabetes or stroke) with or without adjustment for other covari- ates (e.g., age, gender, body mass index); (2) identifying which specific components of physical activity data are most predictive of future health outcomes (e.g., incident mor- tality or cardiovascular events); (3) visualizing the directions of variation in the data; (4) investigating whether clusters exist and if they are scientifically meaningful; (5) evaluating transformations of the data that may provide complementary information; (6) developing prediction methods for missing observations (e.g., one hour of missing data for a person); (7) quantifying whether the timing or fragmentation of physical activity provides additional information above and beyond summary statistics (e.g., mean, standard deviation over the day); (8) studying how much data are needed to identify a particular study participant; (9) predicting the activity for the rest of the day given data up to a particular time and day (e.g., 12 PM on Sunday); (10) determining what levels of data aggregation (e.g., minute, hour, day) may be most useful for specific scientific questions; and (11) proposing data generating mechanisms that could produce data similar to the observed data. The daily physical activity curves have all the properties that define functional data: continuity, ordering, self-consistency, smoothness, and colocalization. The measured process is continuous, as physical activity is continuous. While MIMS were summarized at the minute level, data aggregation could have been done at a finer (e.g., ten-, or one-second intervals) or coarser (e.g., one- or two-hour intervals) scale. The functional data have the ordering property, because the functional argument is time during the day, which is both ordered and has a well-defined distance. The data and the measured process have the self- consistency property because all observations are expressed in MIMS at the minute level. The true functional process can be assumed to have the smoothness property, as one does not expect physical activity to change substantially over short periods of time (e.g., one second). The functional argument has the colocalization property, as the time when physical activity is measured (e.g., 12:00 PM) has the same interpretation for every study participant and day of measurement. The observed data can be denoted as a function Wim : S → R+, where Wim(s) is the MIMS measurement at minute s ∈ S = {1, . . . , 1440} and day m = 1, . . . , Mi, where Mi is the number of days with high-quality physical activity data for study participant i. Data complexity could be reduced by taking the average Wi(s) = 1 Mi Mi m=1 Wim(s) at every minute s or the average over days and minutes Wi = 1 Mi|S| |S| s=1 Mi m=1 Wim(s), where |S| denotes the number of elements in the domain S. Such reductions in complexity improve interpretability and make analyses easier, though some information may be lost. Deciding at what level to summarize the data without losing crucial information is an important goal of FDA.
  • 26. Basic Concepts 7 Here we have identified the domain of the functions Wim(·) as S = {1, . . . , 1440}, which is a finite set in R and does not satisfy the basic requirement that S is an interval. This could be a major limitation as basic concepts such as continuity or smoothness of the functions cannot be defined on the sampling domain S = {1, . . . , 1440}. This is due to practical limitations of sampling that can only be done at a finite number of points. Here the theoretical domain is [0, 1440] minutes, or [0, 24] hours, or [0, 1] days, depending on how we normalize the domain. Recall that the functions have the continuity property, which assumes that the function could be measured anywhere within this theoretical domain. While not formally correct, we will refer to both of these domains as S to simplify the exposition; whenever necessary we will indicate more precisely when we refer to the theoretical (e.g., S = [0, 1440]) or sampling (e.g., S = {1, . . . , 1440}) domain. This slight abuse of notation will be used throughout the book and clarifications will be added, as needed. 1.2.2 COVID-19 US Mortality Data COVID-19 is an infectious disease caused by the SARS-Cov-2 virus that was first identi- fied in Wuhan, China in 2019. The virus spreads primarily via airborne mechanisms. In COVID-19, “CO” stands for corona, “VI” for virus, “D” for disease, and 19 for 2019, the first year the virus was identified in humans. According to the World Health Organization, COVID-19 has become a world pandemic with more than 767 million confirmed infections and almost 7 million confirmed deaths in virtually every country of the world by June 6, 2023 (https://covid19.who.int/). Here we focus on mortality data collected in the US before and during the pandemic. The COVID-19 data used in this book can be loaded using the following lines of code. #Load refund library(refund) #Load the COVID-19 data data(COVID19) Among other variables, this data set contains the US weekly number of all-cause deaths, weekly number of deaths due to COVID-19 (as assessed on the death certificate), and population size in the 50 US states plus Puerto Rico and District of Columbia as of July 1, 2020. Figure 1.3 displays the total weekly number of deaths in the US between the week ending on January 14, 2017 and the week ending on December 12, 2020 for a total of 205 weeks. The original data source is the National Center for Health Statistics (NCHS) and the data set link is called National and State Estimates of Excess Deaths. It can be accessed from https://bit.ly/3wjMQBY. The file can be downloaded directly from https: //bit.ly/3pMAAaA. The data stored in the COVID19 data set in the refund package contains an analytic version of these data as the variable US weekly mort. In Figure 1.3, each dot corresponds to one week and the number of deaths is expressed in thousands. For example, there were 61,114 deaths in the US in the week ending on January 14, 2017. Here we are interested in excess mortality in the first 52 weeks of 2020 compared to the first 52 weeks of 2019. The first week of 2020 is the one ending on January 4, 2020 and the 52nd week is the one ending on December 26, 2020. There were 3,348,951 total deaths in the US in the first 52 weeks of 2020 (red shaded area in Figure 1.3) and 2,852,747 deaths in the first 52 weeks of 2019 (blue shaded area in Figure 1.3). Thus, there were 496,204 more deaths in the US in the first 52 weeks of 2020 than in the first 52 weeks of 2019. This is called the (raw) excess mortality in the first 52 weeks of the year. Here we use this intuitive definition (number of deaths in 2020 minus the number of deaths in 2019), though slightly different definitions can be used. Indeed, note that the population size increases from 2019 to 2020 and some additional deaths can be due to the increase in
  • 27. 8 Functional Data Analysis with R FIGURE 1.3: Total weekly number of deaths in the US between January 14, 2017 and December 12, 2020. The COVID-19 epidemic is thought to have started in the US sometime between January and March 2020. population. For example, the US population was 330,024,493 on December 26, 2020 and 329,147,064 on December 26, 2019 for an increase of 877,429. Using the mortality rate in 2019 of 0.0087 (number of deaths divided by the total population), the expected increase in number of deaths due to increase in the population would be 7,634. Thus, the number of deaths associated with the natural increase in population is about 1.5% of the total excess all-cause deaths in 2020 compared to 2019. Figure 1.3 displays a higher mortality peak at the end of 2017 and beginning of 2018, which is likely due to a severe flu season. The CDC estimates that in the 2017–2018 flu season in the US there were “an estimated 35.5 million people getting sick with influenza, 16.5 million people going to a health care provider for their illness, 490,600 hospitalizations, and 34,200 deaths from influenza” (https://bit.ly/3H8fa1b). As indicated in Figure 1.3, the excess mortality can be calculated for every week from the beginning of 2020. The blue dots in Figure 1.4 display this weekly excess all-cause mortality as a function of time from January 2020. Excess mortality is positive in every week with an average of 9,542 excess deaths per week for a total of 496,204 excess deaths in the first 52 weeks. Excess mortality is not a constant function over the year. For example, there were an average of 1,066 all-cause excess deaths per week between January 1, 2020 and March 14, 2020. In contrast, there were an average of 14,948 all-cause excess deaths per week between March 28, 2020 and June 23, 2020. One of the most watched indicators of the severity of the pandemic in the US was the number of deaths attributed to COVID-19. The data is made available by the US Center for Disease Control and Prevention (CDC) and can be downloaded directly from
  • 28. Basic Concepts 9 FIGURE 1.4: Total weekly number of deaths attributed to COVID-19 and excess mor- tality in the US. The x-axis is time expressed in weeks from the first week in 2020. Red dots correspond to weekly number of deaths attributed to COVID-19. Blue dots indicate the difference in the total number of deaths between a particular week in 2020 and the corresponding week in 2019. https://bit.ly/3iE2xjo. The data stored in the COVID19 data set in the refund package contains an analytic version of these data as the variable US weekly mort CV19. The red dots in Figure 1.4 represent the weekly mortality attributed to COVID-19 according to the death certificate. Visually, COVID-19 and all-cause excess mortality have a similar pattern during the year with some important differences: (1) all-cause excess mortality is larger than COVID-19 mortality every week; (2) the main association does not seem to be delayed (lagged) in either direction; and (3) the difference between all-cause excess and COVID-19 mortality as a proportion of COVID-19 mortality is highest in the summer. Figure 1.4 indicates that there were more excess deaths than COVID-19 attributed deaths in each week of 2020. In fact, the total US all-cause excess deaths in the first 52 weeks of 2020 was 496,204 compared to 365,122 deaths attributed to COVID-19. The difference is 131,082 deaths, or 35.9% more excess deaths than COVID-19 attributed deaths. So, what are some potential sources for this discrepancy? In some cases, viral infection did occur and caused death, though the primary cause of death was recorded as something else (e.g., cardiac or pulmonary failure). This could happen if death occurred after the infection had already passed, infection was present and not detected, or infection was present but not adjudicated as the primary cause of death. In other cases, viral infection did not occur, but the person died due to mental or physical health stresses, isolation, or deferred health care. There could also be other reasons that are not immediately apparent.
  • 29. 10 Functional Data Analysis with R FIGURE 1.5: Each line represents the cumulative weekly all-cause excess mortality per million for each US state plus Puerto Rico and District of Columbia. Five states are empha- sized: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and California (plum). In addition to data aggregated at the US national level, the COVID19 data contains similar data for each state and two territories, Puerto Rico and Washington DC. The all-cause weekly excess mortality data for each state in the US is stored as the variable States excess mortality in the COVID19 data set. Figure 1.5 displays the total cumulative all-cause excess mortality per million in every state in the US, Puerto Rico and District of Columbia. For each state, the weekly excess mortality was obtained as described for the US in Figures 1.3 and 1.4. For every week, the cumulative excess mortality was calculated by adding the excess mortality for every week up to and including the current week. To make data comparable across states, cumulative excess mortality was then divided by the estimated population of the state or territory on July 1, 2020 and multiplied by 1,000,000. Every line represents a state or territory with the trajectory for five states being emphasized: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and California (plum). For example, New Jersey had 1,916 excess all-cause deaths per one million residents by April 30, 2020. This corresponds to a total of 17,019 excess all-cause deaths by April 30, 2020 because the population of New Jersey was 8,882,371 as of July 1, 2020 (the reference date for the population size). The trajectories for individual states exhibit substantial heterogeneity. For example, New Jersey had the largest number of excess deaths per million in the US. Most of these excess deaths were accumulated in the April–June period, with fewer between June and November, and another increase in December. In contrast, California had a much lower cumulative excess number of deaths per million, with a roughly constant increase during
  • 30. Basic Concepts 11 FIGURE 1.6: Each line represents the cumulative COVID-19 mortality for each US state plus Puerto Rico and District of Columbia in 2020. Cumulative means that numbers are added as weeks go by. Five states are emphasized: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and California (plum). 2020. Maryland had about a third the number of excess deaths per million of New Jersey at the end of June and about half by the end of December. We now investigate the number of weekly deaths attributed to COVID-19 for each state in the US, which is stored as the variable States CV19 mortality in the COVID19 data set. Figure 1.6 is similar to Figure 1.5, but it displays the cumulative number of deaths attributed to COVID-19 for each state per million residents. Each line corresponds to a state and a few states are emphasized using the same color scheme as in Figure 1.5. The y-axis was kept the same as in Figure 1.5 to illustrate that, in general, the number of cumulative COVID-19 deaths tends to be lower than the excess all-cause mortality. However, the main patterns exhibit substantial similarities. There are many scientific and methodological problems that occur from such a data set. Here are a few examples: (1) quantifying the all-cause and COVID-19 mortality at the state level as a function of time; (2) identifying whether the observed trajectories are affected by geography, population characteristics, weather, mitigating interventions, or intervention compliance; (3) investigating whether the strength of the association between reported COVID-19 and all-cause excess mortality varies with time; (4) identifying which states are the largest contributors to the observed excess mortality in the January–March period; (5) quantifying the main directions of variation and clusters of state-specific mortality patterns; (6) evaluating the distribution of the difference between all-cause excess and COVID-19 deaths as a function of state and time; (7) predicting the number of COVID-19 deaths and infections based on the excess number of deaths; (8) evaluating dynamic prediction
  • 31. 12 Functional Data Analysis with R models for mortality trajectories; (9) comparing different data transformations for analysis, visualization, and communication of results; and (10) using data from countries with good health statistics systems to estimate the burden of COVID-19 in other countries using all-cause excess mortality. In the COVID-19 example it is not immediately clear that data could be viewed as functional. However, the partitioning of the data by state suggests that such an approach could be useful, at least for visualization purposes. Note that data in Figures 1.5 and 1.6 are curves evaluated at every week of 2020. Thus, the measured process is continuous, as observations could have been taken at a much finer (e.g., days or hours) or coarser (e.g., every month) time scale. Data are ordered by calendar time and is self-consistent because the number or proportion of deaths has the same interpretation for each state and every week. Moreover, one can assume that the true number of deaths is a smooth process as the number of deaths is not expected to change substantially for small changes in time (e.g., one hour). Data are also colocalized, as calendar time has the same interpretation for each state and territory. The observed data can be denoted as functions Wim : S → R+, where Wim(s) is the number or cumulative number of deaths in state i per one million residents at time s ∈ S = {1, . . . , 52}. Here m ∈ 1, 2 denotes all-cause excess mortality (m = 1) and COVID- 19 attributed mortality (m = 2), respectively. Because each m refers to different types of measurements on the same unit (in this case, US state), this type of data is referred to as “multivariate” functional data. Observations can be modeled as scalars by focusing, for example, on Wim(s) at one s at a time or on the average of Wim(s) over s for one m. FDA focuses on analyzing the entire function or combination of functions, extracting information using fewer assumptions, and suggesting functional summaries that may not be immediately evident. Most importantly, FDA provides techniques for data visualization and exploratory data analysis (EDA) in the original or a transformed data space. Just as in the case of NHANES physical activity data, the domain of the functions Wim(·) is S = {1, . . . , 52} expressed in weeks, which is a finite set that is not an interval. This is due to practical limitations of sampling that can only be done at a finite number of points. Here the theoretical domain is [0, 52] weeks, or [0, 12] months, or [0, 1] years, depending on how we normalize the domain. Recall that the functions have the continuity property, which assumes that the function could be measured anywhere within this theoretical domain. While not formally correct, we will refer to both of these domains as S to simplify the exposition. 1.2.3 CD4 Counts Data Human immune deficiency virus (HIV) attacks CD4 cells, which are an essential part of the human immune system. This reduces the concentration of CD4 cells in the blood, which affects their ability to signal other types of immune cells. Ultimately, this compromises the immune system and substantially reduces the human body’s ability to fight off infections. Therefore, the CD4 cell count per milliliter of blood is a widely used measure of HIV pro- gression. The CD4 counts data used in this book can be loaded as follows. #Load refund library(refund) #Load the CD4 data data(cd4) This data contains the CD4 cell counts for 366 HIV infected individuals from the Multi- center AIDS Cohort Study (MACS) [66, 144]. We would like to thank Professor Peter Diggle for making this important de-identified data publicly available on his website and for giving
  • 32. Basic Concepts 13 FIGURE 1.7: Each line represents the log CD4 count as a function of month, where month zero corresponds to seroconversion. Five study participants are identified using colors: green, red, blue, salmon, and plum. us the permission to use it in this book. We would also like to thank the participants in this MACS sub-study. Figure 1.7 displays the log CD4 count for up to 18 months before and 42 months after sero-conversion. Each line represents the log CD4 count for one study participant as a function of month, where month zero corresponds to sero-conversion. There are a total of 1,888 data points, with between 1 and 11 (median 5) observations per study participant. Five study participants are highlighted using colors: green, red, blue, salmon, and plum. Some of the characteristics of these data include (1) there are few obser- vations per curve; (2) the time of observations is not synchronized across individuals; and (3) there is substantial visit-to-visit variation in log CD4 counts before and after seroconversion. Figure 1.8 displays the same data as Figure 1.7 together with the raw (cyan dots) and smooth (dark red line) estimator of the mean. The raw mean is the average of log CD4 counts of study participants who had a visit at that time. The raw mean exhibits substantial variation and has a missing observation at time t = 0. The smooth mean estimator captures the general shape of the raw estimator, but provides a more interpretable summary. For example, the smooth estimator is relatively constant before seroconversion, declines rapidly in the first 10–15 months after seroconversion, and continues to decline, but much slower after month 15. These characteristics are not immediately apparent in the raw mean or in the person-specific log CD4 trajectories displayed in Figure 1.6. There are many scientific and methodological problems suggested by the CD4 data set. Here we identify a few: (1) estimating the time-varying mean, standard deviation and
  • 33. 14 Functional Data Analysis with R FIGURE 1.8: Each gray line represents the log CD4 count as a function of month, where month zero corresponds to seroconversion. The point-wise raw mean is shown as cyan dots. The smooth estimator of the mean is shown as a dark red line. quantiles of the log CD4 counts as a function of time; (2) producing confidence intervals for these time-varying population parameters; (3) identifying whether there are specific sub- groups that have different patterns over time; (4) designing analytic methods that work with sparse data (few observations per curve that are not synchronized across individuals); (5) predicting log CD4 observations for each individual at months when measurements were not taken; (6) predicting the future observations for one individual given observations up to a certain point (e.g., 10 months after seroconversion); (7) constructing confidence intervals for these predictions; (8) quantifying the month-to-month measurement error (fluctuations along the long-term trend); (9) studying whether the month-to-month measurement error depends on person-specific characteristics, including average log CD4 count; and (10) de- signing realistic simulation studies that mimic the observed data structure to evaluate the performance of analytic methods. Data displayed in Figures 1.7 and 1.8 are observed at discrete time points and with substantial visit-to-visit variability. We leave it as an exercise to argue that the CD4 data has the characteristics of functional data: continuity, ordering, self-consistency, smoothness, and colocalization. The observed data has the structure {sij, Wi(sij)}, where Wi(sij) is the log CD4 count at time sij ∈ S = {−18, −17, . . . , 42}. Here i = 1, . . . , n is study participant, j = 1, . . . , pi is the observation number, and pi is the number of observations for study participant i. In statistics, this data structure is often encountered in longitudinal studies and is
  • 34. Basic Concepts 15 typically modeled using linear mixed effects (LME) models [66, 87, 161, 196]. LMEs use a pre-specified, typically parsimonious, structure of random effects (e.g., random inter- cepts and slopes) to capture the person-specific curves. Functional data analysis comple- ments LMEs by considering more complex and/or data-dependent designs of random effects [134, 254, 255, 283, 328, 334, 336]. It is worth noting that this data structure and problem are equivalent to the matrix completion problem [29, 30, 214, 312]. Statistical approaches can handle different levels of measurement error in the matrix entries, and produce both point estimators and the associated uncertainty for each matrix entry. In this example, one could think about the sampling domain as being S = {−18, −17, . . . , 42} expressed in months. This is a finite set that is not an interval. The theoretical domain is [−18, 42] in months from seroconversion, though the interval could be normalized to [0, 1]. The difference from the NHANES and COVID-19 data sets is that observations are not available at every point in S = {−18, −17, . . . , 42} for each individual. Indeed, the minimum number of observations per individual is 1 and the maximum is 11, with a median number of observations of 5, which is 100×5/(42+19) = 8.2% of the number of possible time points between −18 and 42. This type of data is referred to in statistics as “sparse functional data.” In strict mathematical terms this is a misnomer, as the sampling domain S = {−18, −17, . . . , 42} is itself mathematically sparse in R. Here we will use the definition that sparse functional data are observed functions Wi(sij) where j = 1, . . . , pi, pi is small (at most 20) at sampling points sij that are not identical across study participants. Note that this is a property of the observed data Wi(sij) and not of the true underlying process, Xi(s), which could be observed/sampled at any point in [−18, 42]. While this defi- nition is imprecise, it should be intuitive enough for the intents and purposes of this book. We acknowledge that there may be other definitions and also that there is a continuum of scientific examples between “dense, equally spaced functional data” and “sparse, unequally spaced functional data.” 1.2.4 The CONTENT Child Growth Study The CONTENT child growth study (referred to in this book as the CONTENT study) was funded by the Sixth Framework Programme of the European Union, Project CONTENT (INCO-DEV-3-032136) and was led by Dr. William Checkley. The study was conducted between May 2007 and February 2011 in Las Pampas de San Juan Miraflores and Nuevo Paraı́so, two peri-urban shanty towns with high population density located on the southern edge of Lima city in Peru. The shanty towns had approximately 40,000 residents with 25% of the population under the age of 5 [38, 39]. A simple census was conducted to identify pregnant women and children less than 3 months of age. Eligible newborns and pregnant women were randomly selected from the census and invited to participate in the study. Only one newborn was recruited per household. Written informed consent was required from parents or guardians before enrollment. The study design was that of a longitudinal cohort study with the primary objective to assess if infection with Helicobacter pylori (H. pylori) increases the risk of diarrhea, which, in turn, could adversely affect the growth in children less than 2 years of age [131]. Anthropometric data were obtained longitudinally on 197 children weekly until the child was 3 months of age, every two weeks between three and 11 months of age, and once monthly thereafter for the remainder of follow-up up to age 2. Here we will focus on child length and weight, both measured at the same visits. Even if visits were designed to be equally spaced, they were obtained within different days of each sampling period. For example, the observation on week four for a child could be on day 22 or 25, depending on the availability of the contact person, day of the week, or on the researchers who conducted the visit.
  • 35. 16 Functional Data Analysis with R FIGURE 1.9: Longitudinal observations of z-score for length (zlen, first column) and z-score for weight (zwei, second column) shown for males (first row) and females (second row) as a function of day from birth. Data for two boys (shown as light and dark shades of red) and two girls (shown as light and dark shades of blue) are highlighted. The same shade of color identifies the same individual. Moreover, not all planned visits were completed, which provided the data a quasi-sparse structure, as observations are not temporally synchronized across children. We would like to thank Dr. William Checkley for making this important de-identified data publicly available and to the members of the communities of Pampas de San Juan de Miraflores and Nuevo Paraı́so who participated in this study. The data can be loaded directly using the refund R package as follows. #Load refund library(refund) #Load the CONTENT data data(content) Figure 1.9 provides an illustration of the z-score for length (zlen) and z-score for weight (zwei) variables collected in the CONTENT study. Data are also available on the origi- nal scale, though for illustration purposes here we display these normalized measures. For example, the zlen measurement is obtained by subtracting the mean and dividing by the standard deviation of height for a given age of children as provided by the World Health Organization (WHO) growth charts. Even though the study was designed to collect data up to age 2 (24 months), for visu- alization purposes, observations are displayed only through day 600, as data become very
  • 36. Basic Concepts 17 FIGURE 1.10: Histogram of the number of days from birth in the CONTENT study. There are a total of 4,405 observations for 197 children. sparse thereafter. Data for every individual is shown as a light gray line and four different panels display the zlen (first column) and zwei (second column) variables as a function of day from birth separately for males (first row) and females (second row). Data for two boys is highlighted in the first row of panels in red. The lighter and darker shades of red are used to identify the same individual in the two panels. A similar strategy is used to highlight two girls using lighter and darker shades of blue. Note, for example, that both girls who are highlighted start at about the same length and weight z-score, but their trajectories are substantially different. The z-scores increase for both height and weight for the first girl (data shown in darker blue) and decrease for the second girl (data shown in light blue). Moreover, after day 250 the second girl seems to reverse the downward trend in the z-score for weight, though that does not happen with her z-score for height, which continues to decrease. These data were analyzed in [127, 169] to dynamically predict the growth patterns of children at any time point given the data up to that particular time. Figure 1.10 displays the histogram of the number of days from birth in the CONTENT study. There are a total of 4,405 observations for 197 children, out of which 2006 (45.5% of total) are in the first 100 days and 3,299 (74.9% of total) are in the first 200 days from birth. Observations become sparser after that, which can also be observed in Figure 1.9. There are several problems suggested by the CONTENT growth study including (1) estimating the marginal mean, standard deviation and quantiles of anthropometric mea- surements as a function of time; (2) producing pointwise and joint confidence intervals for these time-varying parameters; (3) identifying whether there are particular subgroups or in- dividuals that have distinct patterns or individual observations; (4) conducting estimation and inference on the individual growth trajectories; (5) quantifying the contemporaneous and lagged correlations between various anthropometric measures; (6) estimating anthropo-
  • 37. 18 Functional Data Analysis with R metric measures when observations were missing; (7) predicting future observations for one individual given observations up to a certain point (e.g., 6 months after birth); (8) quan- tifying the month-to-month measurement error and study whether it is differential among children; (9) developing methods that are designed for multivariate sparse data (few obser- vations per curve) with the amount of sparsity varying along the observation domain; (10) identifying outlying observations or patterns of growth that could be used as early warn- ings of growth stunting; (11) developing methods for studying the longitudinal association between multivariate growth outcomes and time-dependent exposures, such as infections; and (12) designing realistic simulation scenarios that mimic the observed data structure to evaluate the performance of analytic methods. Data displayed in Figure 1.9 are observed at discrete time points and with substantial visit-to-visit and participant-to-participant variability. These data have all the characteris- tics of functional data: continuity, ordering, self-consistency, smoothness, and colocalization. Indeed, data are continuous because growth curves could be sampled at any time point at both higher and lower resolutions. The choice for the particular sampling resolution was a balance between available resources and knowledge about the growth process of humans. Data are also ordered as observations are sampled in time. That is, we know that a measure- ment at week 3 was taken before a measurement at month 5 and we know exactly how far apart the two measurements were taken. Also, the observed and true functional processes have the self-consistency property as they are expressed in the same units of measurement. For example, height is always measured in centimeters or is transformed into normalized measures, such as zlen. Data are also smooth, as the growth process is expected to be grad- ual and not have large minute-to-minute or even day-to-day fluctuations. Even potential growth spurts are smooth processes characterized by faster growth but small day-to-day variation. Observations are also colocalized, as the functional argument, time from birth, has the same interpretation for all functions. For example, one month from birth means the same thing for each baby. The observed functional data in CONTENT has the structure {sij, Wim(sij)}, where Wim : S → R is the mth anthropometric measurement at time s ∈ S ⊂ [0, 24] (expressed in months from birth) for study participant i. Here the time of the observations, sij, depends on the study participant, i, and visit number, j, but not the anthropometric measure, m. The reason is that if a visit was completed, all anthropometric measures were collected. However, this may not be the case for all studies and observations may depend on m in other studies. Each such variation on sampling requires special attention and methods de- velopment. In this example it is difficult to enumerate the entire sampling domain because it is too large and observations are not equally spaced. One way to obtain this space in R is using the function #Find all unique observations sampling S - sort(unique(content$agedays)) A similar notation, Wim(s), was used to describe the NHANES data structure in Sec- tion 1.2.1. In NHANES m referred to the day number from initiating the accelerometry study. However, in the CONTENT study, m refers to the type of anthropometric measure. Thus, while in NHANES functions indexed by m measure the same thing every day (e.g., physical activity at 12 PM), in CONTENT each function measures something different (e.g., zlen and zwei at month 2). In FDA one typically refers to the NHANES structure as mul- tilevel and to the CONTENT structure as multivariate functional data. Another difference is that data are not equally spaced within individuals and are not synchronized across in- dividuals. Thus, the CONTENT data has a multivariate (multiple types of measurement), functional (has all characteristics of functional data), sparse (few observations per curve
  • 38. Basic Concepts 19 that are not synchronized across individuals), and unequally spaced (observations were not taken at equal intervals within study participants). The CONTENT data is highly com- plex and contains additional time invariant (e.g., sex) and time-varying observations (e.g., bacterial infections). As the CD4 counts data presented in Section 1.2.3, the CONTENT data is at the interface between traditional linear mixed effects models (LME) and functional data. While both approaches can be used, this is an example when FDA approaches are more reasonable, at least as an exploratory tool to understand the potential hidden complexity of individual trajectories. In these situations, one also starts to question or even test the standard residual dependence assumptions in traditional LMEs. In the end, we will show that every FDA is a form of LME, but this will require some finesse and substantial methodological development. 1.3 Notation and Methodological Challenges In all examples in Section 1.2, the data are comprised of functions Wi : S → R, though in the CONTENT example, one could argue that the vector Wi(·) = {Wi1(·), Wi2(·)}, where Wi1(·) and Wi2(·)} are the z-scores for length and weight, respectively, takes values in R2 . Here, the conceptual and practical framing of functional data should be noted: conceptually, the theoretical domain S (where functional data could be observed) is an interval in R or RM ; practically, the sampling domain S (where functional data is actually observed) is a finite subset of points of the theoretical domain. We will, at times, be specific about our use of a particular framing, but frequently the distinction can be elided (or at least inferred from context) without detracting from the clarity of our discussion. Continuity is an important property of functional data, indicating that measurements could, in principle, have been taken at any point in the interval spanned by the sampling domain S. For example, in the NHANES study, data are summarized at every minute of the day, which results in 1,440 observations per day. However, data could be summarized at a much finer or coarser resolution. Thus, the domain of the function is considered to be an interval and, without loss of generality, the [0, 1] interval. In NHANES the start of the day (midnight or 12:00 AM) would correspond to 0, the end of the day (11:59 PM) would correspond to 1 and minute s of the day would correspond to (s − 1)/1439. Most common functional data are of the type Wi : [0, 1] → R, though many variations exist. An important assumption is that there exists an underlying, true process, Xi : [0, 1] → R, and Wi(s) provides proxy measurements of Xi(s) at the points where Wi(·) is observed. The observed function is Wi(s) = Xi(s)+i(s), where i(s) are independent noise variables, which could be Gaussian, but could refer to binary, Poisson, or other types of errors. Thus, FDA assumes that there exists an infinite-dimensional data generating process, Xi(·), for every study participant, while information is accumulated at a finite number of points via the measured process, Wi(s), where s ∈ S and S is the sampling domain. This inferential problem is addressed by a combination of smoothing and simplifying (modeling) assumptions. The sampling location (s points where Wi(·) are measured), measurement type (exactly what is measured), and underlying signal structure (the distribution of Xi(·)) raise important methodological problems that need to be addressed to bridge the theoretical assumption of continuity with the reality of sampling at a finite number of points. First, connecting the continuity of Xi(·) to the discrete measurement Wi(·) needs to be done through explicit modeling and assumptions.
  • 39. 20 Functional Data Analysis with R Second, the density and number of observations at the study participant level could vary substantially. Indeed, there could be as few as two or three to as many as hundreds of millions of observations per study participant. Moreover, observations can be equally or unequally spaced within and between study participants as well as when aggregated across study participants. Each of these scenarios raises its own specific set of challenges. Third, the complexity of individual and population trajectories is a priori unknown. Ex- tracting information is thus a balancing act between model assumptions and signal structure often in the presence of substantial noise. As shown in the examples in this chapter, func- tional data are seldom linear and often non-stationary. Fourth, the covariance structure within experimental units (e.g., study participants) re- quires a new set of assumptions that cannot be directly extended from traditional statistical models. For example, the independence and exchangeability assumptions from longitudinal data analysis are, at best, suspect in many high-resolution FDA applications. The auto- regressive assumption is probably way too restrictive, as well, because it implies stationarity of residuals and an exponential decrease of correlation as a function of distance. Moreover, as sampling points are getting closer together (higher resolution) the structure of correlation may change substantially. The unstructured correlation assumption is more appropriate for FDA, but it requires the estimation of a very large dimensional correlation matrix. This can raise computational challenges for moderate to high-dimensional functions. Fifth, observed data may be non-Gaussian with high skewness and thicker than normal tails. While much is known about univariate modeling of such data, much more needs to be done when the marginal distributions of functional data exhibit such behavior. Binary or Poisson functional data raise their own specific sets of challenges. To understand the richness of FDA, one could think of all problems in traditional data analysis where some of the scalar observations are replaced with functional observations. This requires new modeling and computational tools to accommodate the change of all or some measurements from scalars to high-dimensional, highly structured multivariate vectors, matrices or arrays. The goal of this book is to address these problems by providing a class of self-contained, coherent analytic methods that are computationally friendly. To achieve this goal, we need three important components: dimensionality reduction, penalized smoothing, and unified regression modeling via mixed effects models inference. Chapter 2 will introduce these ideas and principles. 1.4 R Data Structures for Functional Observations As the preceding text makes clear, there is a contrast between the conceptual and practical formulations of functional data: conceptually, functional data are continuous and infinite dimensional, but practically they are observed over discrete grids. This book relies on both formulations to provide interpretable model structures with concrete software implemen- tations. We will use a variety of data structures for the storage, manipulation, and use of functional observations, and discuss these briefly now. In perhaps the simplest case, functional data are observed over the same equally spaced grid for each participant or unit of observation. Physical activity is measured at each minute of the day for each participant in the NHANES data set, and deaths due to COVID-19 are recorded weekly in each state in the COVID-19 US mortality data. A natural way of representing such data is a matrix in which rows correspond to participants and columns to the grid over which data are observed.
  • 40. Basic Concepts 21 For illustration purposes, we display below the “wide format” data structure of the NHANES physical activity data. This is stored in the variable MIMS of the data nhanes fda with r. This NHANES data consists of a 12,610 × 1,440 matrix, with columns containing MIMS measurements from 12:00 AM to 11:59 PM. Here we approximated the MIMS up to the second decimal for illustration purposes, so the actual data may vary slightly upon closer inspection. This data structure is familiar to many statisticians, and can be useful in the implementation of specific methods, such as Functional Principal Com- ponent Analysis (FPCA). #Storage format for the accelerometry data in NHANES data set nhanes fda with r$MIMS MIN0001 MIN0002 MIN0003 MIN0004 ... MIN1439 MIN1440 62161 1.11 3.12 1.47 0.94 ... 1.38 1.53 62163 25.15 19.16 17.84 20.33 ... 7.38 15.93 62164 1.92 1.67 2.38 0.93 ... 3.03 4.46 62165 3.98 3.00 1.91 0.89 ... 2.18 0.31 ... ... ... ... ... ... ... ... 83730 1.50 2.11 1.34 0.16 ... 1.07 1.14 83731 0.09 0.01 0.49 0.10 ... 0.86 0.46 It is possible to use matrices for data that are somewhat less simple, although care is required. When data can be observed over the same grid but are sparse for each subject, a matrix with missing entries can be used. For the CD4 data, observations are recorded at months before or after seroconversion. The observation grid is integers from −18 to 42, but any specific participant is measured only at a subset of these values. Data like these can be stored in a relatively sparse matrix, again with rows for study units and columns for elements of the observation grid. Our data examples focus on equally spaced grids, but this is not required for functional data in general or for the use of matrices to store these observations. For illustration purposes, we display the CD4 count data in the same “wide format” used for NHANES. The structure is similar to that of NHANES data, where each row corresponds to an individual and each column corresponds to a potential sampling point, in this case a month from seroconversion. However, in the CD4 data example most observations are not available, as indicated by the NA fields. Indeed, as we discussed, only 1,888 data points are available out of the 366×61 = 22,326 entries of the matrix, or 8.5%. Having one look at the data matrix and knowing that less than 10% of the entries are known, immediately creates the idea that the matrix and the data are “sparse.” Note, however, that this concept refers to the percent of non-missing entries into a matrix and not to the mathematical concept of sparsity. In most of the book, “sparsity” will refer to matrix sparsity and not to the mathematical concept of sparsity of a set. #Storage format for CD4 data in refund CD4 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 ... 41 42 [1,] NA NA NA NA NA NA NA NA NA 548 ... NA NA [2,] NA NA NA NA NA NA NA NA NA NA ... NA NA [3,] NA NA NA 846 NA NA NA NA NA 1102 ... NA NA ... ... ... ... ... ... ... ... ... ... ... ... ... ... [363,] NA NA NA NA NA NA NA NA 1661 NA ... NA NA [364,] NA NA NA 646 NA NA NA 882 NA NA ... NA NA [365,] NA NA NA NA NA NA NA NA NA NA ... 294 NA [366,] NA NA NA NA NA NA NA NA NA NA ... 462 NA
  • 41. 22 Functional Data Analysis with R Storing the CD4 in wide format is not a problem because the matrix is relatively small and does not take that much memory. However, this format is not efficient and could be extremely cumbersome when data matrices increase both in terms of number of rows or columns. The number of columns can increase very quickly when the observations are irregular across subjects and the union of sampling point across study participants is very large. In the extreme, but commonly encountered, case when no two observations are taken at exactly the same time, the number of columns of the matrix would be equal to the total number of observations for all individuals. Additionally, observation grid values are not directly accessible, and must be stored as column names or in a separate vector. Using the “long format” for sparse functional data can address some disadvantages that are associated with the “wide format.” In particular, a data matrix or frame with columns for study unit ID, observation grid point, and measurement value can be used for dense or sparse data and for regular or irregular observation grids, and makes the observation grid explicit. Below we show the CD4 counts data in “long format,” where all the missing data are no longer included. The price to pay is that we add the column ID, which contains many repetitions, while the column time also contains some repetitions to explicitly indicate the month where the sample was taken. The long format of the data is much more memory efficient when data are sparse, though these advantages can disappear or become disadvantages when data become denser. For example, when the observation grid is common across subjects and there are many observations for each study participant, the ID and time column require substantial addi- tional memory without providing additional information. Long format data may also repeat subject-level covariates for each element of the observation grid, which further exacerbates memory requirements. Moreover, complexity and memory allocation can increase substan- tially when multiple functional variables are observed on different observation grids. From a practical perspective, different software implementations require different data structures, which can be a reason for frustration. In general refund tends to use the wide format of the data, whereas our implementation of FDA in mgcv often uses the long format. #CD4 count data in long format CD4 CD4 count time ID 548 -9 1 ... ... ... 846 -15 3 1102 -9 3 ... ... ... 1661 -10 363 ... ... ... 646 -15 364 882 -11 364 ... ... ... 294 41 365 ... ... ... 462 41 366 Given these considerations, we will use both the wide and long formats and we will discuss when and how we make the transition between these formats. We recognize the increased popularity of the tidyverse for visualization and exploratory data analysis, which prefers the long format of the data. Over the last several years, many R users have gravitated toward data frames for data storage. This shift has been facilitated by (and arguably is
  • 42. Basic Concepts 23 attributable to) the development of the tidyverse collection of packages, which implement general-purpose tools for data manipulation, visualization, and analysis. The tidyfun [261] R package was developed to address issues that arise in the storage, manipulation, and visualization of functional data. Beginning from the conceptual perspec- tive that a complete curve is the basic unit of analysis, tidyfun introduces a data type (tf) that represents and operates on functional data in a way that is analogous to nu- meric data. This allows functional data to easily sit alongside other (scalar or functional) observations in a data frame in a way that is integrated with a tidyverse-centric approach to manipulation, exploratory analysis, and visualization. Where possible, tidyfun conserves memory by avoiding data duplication. We will use both the tidyverse and the usualverse (completely made up word) and we will point out the various approaches to handling the data. In the end, it is a personal choice of what tools to use, as long as the main inferential engine works. One can reasonably ask why a book of methods places such an emphasis on data struc- tures? The reason is that this is a book on “functional data analysis with R” and not a book on “functional data analysis without R.” Thus, in addition to methods and inference we emphasize the practical implementation of methods and the combination of data structures, code, and methods that is amenable to software development. 1.5 Notation Throughout the book we will attempt to use notation that is consistent across chapters. This will not be easy or perfect, as functional data analysis can test the limits of reasonable notation. Indeed, the Latin and Greek alphabet using lower- and uppercase, bold and regular font were heavily tested by the data structures discussed in this book. To provide some order ahead of starting the book in earnest we introduce the following notation. • n: number of study participants • i: the index for the study participant, i = 1, . . . , n • S: the sampling or theoretical domain of the observed functions; this will depend on the context • Yi: scalar outcome for study participant i • Wi(sj): observed functional measurement for study participant i and location sj ∈ S, for j = 1, . . . , p when data are observed on the same grid (dense, equal grid) • Wi(sij): observed functional measurement for study participant i and location sij ∈ S, for j = 1, . . . , pi when data are observed on different grids across study participants (sparse, different grid) • Wim(·): observed functional measurement for multivariate or multilevel data. For mul- tivariate data m = 1, . . . , M, whereas for multivariate data m = 1, . . . , Mi, though in some instances Mi = M for all i
  • 43. 24 Functional Data Analysis with R • Xi(sj), Xi(sij), Xim(·): same as Wi(sj), Wi(sij), Wim(·), but for the underlying, unob- served, functional process • Zi: column vector of additional scalar covariates • vectors: defined as columns and referred to using bold, typically lower case, font • matrices: referred to using bold, typically upper case, font
  • 44. 2 Key Methodological Concepts In this chapter we introduce some of the key methodological concepts that will be used extensively throughout the book. Each method is important in itself, but it is the specific combination of these methods that provides a coherent infrastructure for FDA inference and software development. Understanding the details of each approach is not essential for the application of these methods. Readers who are less interested in a deep dive into these methods and more interested in applying them can skip this chapter for now. 2.1 Dimension Reduction Consider the case when functional data are of the form Wraw,i(s) for i = 1, . . . , n and s ∈ S = {s1, . . . , sp}, where p = |S| is the number of observations in S. Assume that all functions are measured at the same values, sj, j = 1, . . . , p, and that there are no missing observations. The centered and normalized functional data is Wi(sj) = 1 √ np {Wraw,i(sj) − Wraw(sj)} , where Wraw(sj) = 1 n n i=1 Wraw,i(sj) is the average of functional observations over study participants at sj. This transformation is not strictly necessary, but will simplify the con- nection between the discrete observed measurement process and the theoretical underlying continuous process. In particular, dividing by √ np will keep measures of data variation comparable when the number of rows (study participants) or columns (data sampling res- olution) change. The data can be organized in an n×p dimensional matrix, W, where the ith row contains the observations {Wi(sj) : j = 1, . . . , p}. Each row in W corresponds to a study participant and each column has mean zero. The dimension of the problem refers to p and dimension reduction refers to finding a smaller set of functions, K0 p, that contains most of the information in the functions {Wi(sj) : j = 1, . . . , p}. There are many approaches to dimension reduction. Here we focus on two closely re- lated techniques: Singular Value Decomposition (SVD) and Principal Component Analysis (PCA). While the linear algebra will get slightly involved, SVD and PCA are essential ana- lytic tools for high-dimensional FDA. Moreover, the SVD and PCA of any n×p dimensional matrix can easily be computed in R [240] as described below. #SVD of matrix W SVD of W - svd(W) #PCA of matrix W PCA of W - princomp(W) 25
  • 45. 26 Functional Data Analysis with R 2.1.1 The Linear Algebra of SVD The SVD of W is the decomposition W = UΣVt , where U is an n × n dimensional matrix with the property Ut U = In, Σ is an n × p dimensional diagonal matrix, and V is a p × p dimensional matrix with the property Vt V = Ip. Here In and Ip are the identity matrices of size n and p, respectively. The diagonal entries dk of Σ, k = 1, . . . , K = min(n, p), are called the singular values of W. The columns of U, uk = {uik : i = 1, . . . , n}, and V, vk = {vk(sj) : j = 1, . . . , p}, for k = 1, . . . , K are the left and right singular vectors of W, respectively. The matrix form of the SVD decomposition can be written in entry-wise form for every s ∈ S as Wi(s) = K k=1 dkuikvk(s) . (2.1) This provides an explicit linear decomposition of the data in terms of the functions, {vk(sj) : j = 1, . . . , p}, which are the columns of V and form an orthonormal basis in Rp . These right singular vectors are often referred to as the main directions of variation in the functional space. Because vk are orthonormal, the coefficients of this decomposition can be obtained as dkuik = p j=1 Wi(sj)vk(sj) . Thus, dkuik is the inner product between the ith row of W (the data for study participant i) and the kth column of V (the kth principal direction of variation in functional space). We will show that {d2 k : k = 1, . . . , K} quantify the variability of the observed data explained by the vectors {vk(sj) : j = 1, . . . , p} for k = 1, . . . , K. The total variance of the original data is 1 np n i=1 p j=1 {Wraw,i(sj) − Wraw(sj)}2 = n i=1 p j=1 W2 i (sj) , which is equal to tr(Wt W) = tr(VΣt Ut UΣVt ), where tr(A) denotes the trace of matrix A. As Ut U = In, tr(Wt W) = tr(VΣt ΣVt ) = tr(Σt ΣVt V), where we used the property that tr(AB) = tr(BA) for A = V and B = Σt ΣVt . As Vt V = Ip and Σt Σ = K k=1 d2 k, it follows that n i=1 p j=1 W2 i (sj) = K k=1 d2 k , (2.2) indicating that the total variance is equal to the sum of squares of the singular values. In practice, for every s ∈ S, Wi(s) is often approximated by K0 k=1 dkuikvk(s) that is, by the first K0 right singular vectors, where 0 ≤ K0 ≤ K. We now quantify the variance explained by these K0 right singular vectors. Denote by V = [VK0 |V−K0 ] the partition of V in the p × K0 dimensional sub-matrix VK0 and p × (p − K0) dimensional sub-matrix V−K0 containing the first K0 and the last (p − K0) columns of V, respectively. Similarly, denote by ΣK0 and Σ−K0 the sub-matrices of Σ that correspond to the first K0 and last (K−K0) singular values, respectively. With this notation, W = UΣK0 Vt K0 +UΣ−K0 Vt −K0 or, equivalently, W − UΣK0 Vt K0 = UΣ−K0 Vt −K0 . Using a similar argument to the one for the decomposition of the total variation, we obtain tr(V−K0 Σt −K0 Ut UΣ−K0 Vt −K0 ) = K k=K0+1 d2 k. Therefore, tr(W − UΣK0 Vt K0 )t (W − UΣK0 Vt K0 ) = K k=K0+1 d2 k .
  • 46. Key Methodological Concepts 27 Changing from matrix to entry-wise notation this equality becomes n i=1 p j=1 {Wi(sj) − K0 k=1 dkuikvk(sj)}2 = K k=K0+1 d2 k . (2.3) Equations (2.2) and (2.3) indicate that the first K0 right singular vectors of W explain K0 k=1 d2 k of the total variance of the data, or a fraction equal to K0 k=1 d2 k/ K k=1 d2 k. In many applications d2 k decrease quickly with k indicating that only a few vk(·) functions are enough to capture the variability in the observed data. It can also be shown that for every K0 = 1, . . . , K n i=1 p j=1 {Wi(sj) − k=K0 dkuikvk(sj)}2 = d2 K0 , (2.4) where the sum over k = K0 is over all k = 1, . . . , K, except K0. Thus, the K0th right singular vector explains d2 K0 of the total variance, or a fraction equal to d2 K0 / K k=1 d2 k. The proof is similar to the one for equation (2.3), but partitions the matrix V into a sub-matrix that contains its K0 column vector and a sub-matrix that contains all its other columns. In summary, equation (2.1) can be rewritten for every s ∈ S as Wi(s) = K0 k=1 dkuikvk(s) + K k=K0+1 dkuikvk(s) , (2.5) where K0 k=1 dkuikvk(s) is the approximation of Wi(s) and K k=K0+1 dkuikvk(s) is the ap- proximation error with variance equal to K k=K0+1 d2 k. The number K0 is typically chosen to explain a given fraction of the total variance of the data, but other criteria could be used. We now provide the matrix equivalent of the approximation in equation (2.5). Recall that Wi(sj) is the (i, j)th entry of the matrix W. If uk and vk denote the left and right singular vectors of W, the (i, j) entry of the matrix ukvt k is equal to uikvk(sj). Therefore, the matrix format of equation (2.5) is W = K0 k=1 dkukvt k + K k=K0+1 dkukvt k . (2.6) The matrix K0 k=1 dkukvt k is called the rank K0 approximation of W. 2.1.2 The Link between SVD and PCA The PCA [140, 229] of W is the decomposition Wt W = VΛVt , where V is the p × p dimensional matrix with the property Vt V = Ip and Λ is a p × p diagonal matrix with positive elements on the diagonal λ1 ≥ . . . ≥ λp ≥ 0. PCA is also known as the discrete Karhunen-Loéve transform [143, 184]. Denote by vk, k = 1, . . . , K = min(n, p), the p × 1 dimensional column vectors of the matrix V. The vector vk is the kth eigenvector of the matrix V, corresponds to the eigenvalue λk, and has the property that Wt Wvk = λkvk. In FDA the vk vectors are referred to as eigenfunctions. In image analysis the term eigenimages is used instead. Just as with SVD, vk form a set of orthonormal vectors in Rp . It can be shown that every vk+1 explains the most residual variability in the data matrix, W, after accounting for the eigenvectors v1, . . . , vk. We will show this for v1 first. Note that Wt W = K k=1 λkvkvt k. If
  • 47. 28 Functional Data Analysis with R v is any p × 1 dimensional vector such that vt v = 1, the variance of Wv is vt Wt Wv = K k=1 λkvt vkvt kv. Denote by v = K l=1 alvl the expansion of v in the basis {vl : l = 1, . . . , K}. Because vk are orthonormal vt kv = K l=1 alvt kvl = ak and vt v = K l=1 a2 l = 1. Therefore, vt Wt Wv = K k=1 λka2 k ≤ λ1 K k=1 a2 k = λ1. Equality can be achieved only when a1 = 1 and ak = 0 for k = 2, . . . , K, that is, when v = v1. Thus, v1 is the solution to the problem v1 = arg max ||v=1|| vt Wt Wv . (2.7) Once v1 is known, the projection of the data matrix on v1 is A1v1 and the residual variation in the data is W − A1vt 1, where A1 is an n × p dimensional matrix. Because vk are orthonormal, it can be shown that A1 = Wv1 and the unexplained variation is W − Wv1vt 1 = K k=2 λkvkvt k. Iterating with W − Wv1vt 1 instead of W, we obtain that the second eigenfunction, v2, maximizes the residual variance after accounting for v1. The process is then iterated. PCA and SVD are closely connected, as Wt W = VΣt ΣVt . Thus, if d2 k are ordered such that d2 1 ≥ . . . ≥ d2 K ≥ 0, the kth right singular vector of W is equal to the kth eigenvector of Wt W and corresponds to the kth eigenvalue λk = d2 k. Similarly, WWt = UΣΣt U, indicating that the kth left singular vector of W is equal to the kth eigenvector of WWt and corresponds to the kth eigenvalue λk = d2 k. SVD and PCA have been developed for multivariate data and can be applied to func- tional data. There are, however, some specific considerations that apply to FDA: (1) the data Wi(s) are functions of s and are expressed in the same units for all s; (2) the mean function, W(s), and the main directions of variation in the functional space, vk = {vk(sj), j = 1, . . . , p}, are functions of s ∈ S; (3) these functions inherit and abide by the rules induced by the organization of the space in S (e.g., they do not change too much for small variations in s); (4) the correlation structure between Wi(s) and Wi(s ) may depend on (s, s ); and (5) the data may be observed with noise, which may substantially affect the calculation and interpretation of {vk(sj), j = 1, . . . , p}. For these reasons, FDA often uses smoothing assumptions on Wi(·), W(·) and vk(·). These smoothing assumptions provide a different flavor to PCA and SVD and give rise to functional PCA (FPCA) and SVD (FSVD). While FPCA is better known in FDA, FSVD is a powerful technique that is indispensable for higher dimensional (large p) applications. A more in-depth look at smoothing in FDA is provided in Section 2.3. 2.1.3 SVD and PCA for High-Dimensional FDA When data are high-dimensional (large p) the n×p dimensional matrix W cannot be loaded into the memory and SVD cannot be performed. Things are more difficult for PCA, which uses a p×p dimensional matrix Wt W. In this situation, feasible computational alternatives are needed. Consider the case when p is very large but n is small to moderate. It can be shown that WWt = p j=1 wjwt j, where wj is the jth column of matrix W. The advantage of this formulation is that it can be computed sequentially. For example, if Ck = k j=1 wjwt j, C1 = w1wt 1, Ck+1 = Ck + wk+1wt k+1, and Cp = WWt . It takes O(n2 ) operations to calculate C1 because it requires the multiplication of an n × 1 by a 1 × n dimensional matrix. At every step, k + 1, only the n × n dimensional matrix Ck and the n × 1 vector wk+1 need to be loaded in the memory. This avoids loading the complete data matrix. Thus, the matrix WWt can be calculated in O(n2 p) operations without ever loading the complete matrix, W. The PCA decomposition of WWt = UΣΣt U yields the matrices U and Σ. The matrix V can then be obtained as V = Wt UΣ−1 . Thus, each column of V is
  • 48. Key Methodological Concepts 29 obtained by multiplying Wt with the corresponding column of UΣ−1 . This requires O(n2 p) operations. As, in general, we are only interested in the first K0 columns of V, the total number of operations is of the order O(n2 pK0). Moreover, the operations do not require loading the entire data set in the computer memory. Indeed, Wt UΣ−1 can be done by loading one 1 × n dimensional row of Wt at a time. The essential idea of this computational trick is to replace the diagonalization of the large p × p dimensional matrix Wt W with the diagonalization of the much smaller n × n dimensional matrix WWt . When n is also large, this trick does not work. A simple solution to address this problem is to sub-sample the rows of the matrix W to a tractable sample size, say 2000. Sub-sampling can be repeated and right singular vectors can be averaged across sub-samples. Other solutions include incremental, or streaming, approaches [133, 203, 219, 285] and the power method [67, 158]. The incremental, or streaming, approaches start with a number of rows of W that can be handled computationally. Then covariance operators, eigenvectors, and eigenvalues are updated as new rows are added to the matrix W. The power method starts with the n × n dimensional matrix A = WWt and an n × 1 dimensional random normal vector u0, which is normalized u0 ← u0/||u0||. Here ||a|| = (at a)1/2 is the norm induced by the inner product in Rn . The power method consists of calculating the updates ur+1 ← Aur and ur+1 ← ur+1/||ur+1||. Under mild conditions, this approach yields the first eigenfunction, v1, which can be subtracted and the method can be iterated to obtain the subsequent eigenfunctions. The computational trick here is that diagonalization of matrices is replaced by matrix multiplications, which are much more computationally efficient. We have found that sampling is a very powerful, easy-to-use method and we recommend it as a first line approach in cases when both n and p are very large. 2.1.4 SVD for US Excess Mortality We show how SVD can be used to visualize and analyze the cumulative all-cause excess mor- tality data in 50 states and 2 territories (District of Columbia and Puerto Rico). Figure 2.1 displays these functions for each of the first 52 weeks of 2020. For each state or territory, i, the data are Wi(sj), where sj = j ∈ {1, . . . , p = 52}. The mean W(sj) is obtained by aver- aging observations across states (i) for every week of 2020 (sj = j). The R implementation is #Calculate the mean of Wr, the un-centered data matrix mW - colMeans(Wr) #Construct a matrix with the mean repeated on each row mW mat - matrix(rep(mW, each = nrow(Wr)), ncol = ncol(Wr)) #Center the data W - Wr - mW mat Here mW is the R notation for the mean vector that contains W(sj), j = 1, . . . , p. We have not divided by √ np, as results are identical and it is more intuitive to work on the original scale of the data. Figure 2.1 displays the cumulative excess mortality per one million people in each state of the US and two territories in 2020 (light gray lines). This is the same data as in Figure 1.5 without emphasizing the mortality patterns for specific states. Instead, the dark red line is the average of these curves and corresponds to the mW variable. Figure 2.2 displays the same data as Figure 2.1 after centering the data (removing the mean at every time point). These data are stored as rows in the matrix W (in R notation) and have been denoted as W (in statistical notation). Five states are emphasized to provide examples of trajectories. The centered data matrix W (W in R) is decomposed using the SVD. The left singular vectors, U, are stored as columns in the matrix U, the singular values, d, are stored in the
  • 49. 30 Functional Data Analysis with R FIGURE 2.1: Each line represents the cumulative excess mortality for each state and two territories in the US. The mean cumulative excess mortality in the US per one million residents is shown as a dark red line. vector d, and the right singular vectors, V, are stored as columns in the matrix V. #Calculate the SVD of W SVD of W - svd(W) #Left singular vectors stored by columns U - SVD of W$u #Singular values d - SVD of W$d #Right singular vectors stored by columns V - SVD of W$v The individual and cumulative variance explained can be calculated from the vector of singular values, d. In R this is implemented as #Calculate the eigenvalues lambda - SVD of W$d^2 #Individual proportion of variation propor var - round(100 * lambda / sum(lambda), digits = 1) #Cumulative proportion of variation cumsum var - cumsum(propor var) Table 2.1 presents the individual and cumulative percent variance explained by the first five right singular vectors. The first two right singular vectors explain 84% and 11.9% of the variance, respectively, for a total of 95.9%. The first five right singular vectors explain
  • 50. Key Methodological Concepts 31 FIGURE 2.2: Each line represents the centered cumulative excess mortality for each state in the US. Centered means that the average at every time point is equal to zero. Five states are emphasized: New Jersey (green), Louisiana (red), Maryland(blue), Texas (salmon), and California (plum). a cumulative 99.7%, indicating that dimension reduction is quite effective in this particular example. Recall that the right singular vectors are the functional principal components. The next step is to visualize the two right singular vectors, which together explain 95.9% of the variability. These are the vectors V[,1] and V[,2] in R notation and v1 and v2 in statistical notation. Figure 2.3 displays the first (light coral) and second (dark coral) right singular vectors. The interpretation of the first right singular vector is that the mortality data for a state that has a positive coefficient (score) tends to (1) be closer to the US mean between January and April; (2) have a sharp increase above the US mean between April and June; and (3) be larger with a constant difference from the US mean between July and December. The mortality data for a state that has a positive coefficient on the second right singular vector tends to (1) have an even sharper increase between April and June TABLE 2.1 All-cause cumulative excess mortality in 50 US states plus Puerto Rico and District of Columbia. Individual and cumulative percent variance explained by the first five right singular vectors (principal components). Right singular vectors Variance 1 2 3 4 5 Individual (%) 84.0% 11.9% 2.9% 0.6% 0.3% Cumulative (%) 84.0% 95.9% 98.8% 99.4% 99.7%
  • 51. 32 Functional Data Analysis with R FIGURE 2.3: First two right singular vectors (principal components) for all-cause weekly excess US mortality data in 2020. First right singular vector: light coral. Second singular vector: dark coral. relative to the US average; and (2) exhibit a decreased difference from the US mean as time progresses from July to December. Of course, things are more complex, as the mean and right singular vectors can compensate for one another in specific times of the year. Individual state mortality data can be reconstructed for all states simultaneously. A K0 = 2 rank reconstruction of the data can be obtained as #Set the reconstruction rank K0 - 2 #Reconstruct the centered data using rank K0 approximation rec - SVD of W$u[,1:K0] %*% diag(SVDofW$d[1:K0]) %*% t(V[,1:K0]) #Add the mean to the rank K0 approximation of W WK0 - mW mat + rec The matrices W and WK0 contain the original and reconstructed data, where each state is recorded by rows. Figure 2.4 displays the original (solid lines) and reconstructed data (dashed lines of matching color) for five states: New Jersey (green), Louisiana (red), Mary- land (blue), Texas (salmon), and California (plum). Even though the reconstructions are not perfect, they do capture the main features of the data for each of the five states. Better approximations can be obtained by increasing K0, though at the expense of using additional right singular vectors. Consider, for example, the mortality data from New Jersey. The rank K0 = 2 recon- struction of the data is WNJ(s) = WUS(s) + 0.49v1(s) + 0.25v2(s) ,
  • 52. Key Methodological Concepts 33 FIGURE 2.4: All-cause excess mortality (solid lines) and predictions based on rank 2 SVD (dashed lines) for five states in the US: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and California (plum). where the coefficients 0.49 and 0.25 correspond to ui1 and ui2, the (i, 1) and (i, 2) entries of the matrix U (U in R), where i corresponds to New Jersey. These values can be calculated in R as U[states==New Jersey, 1:2] where states is the vector containing the names of US states and territories. We have used the notation WUS(s) instead of W(s) and WNJ(s) instead of Wi(s) to improve the precision of notation. Both coefficients for v1(·) and v2(·) are positive, indicating that for New Jersey there was a strong increase in mortality between April and June, a much slower increase between June and November and a further larger increase in December. Even though neither of the two components contained information about the increase in mortality in December, the effect was accounted for by the mean; see, for the example the increase in the November December period in the mean in Figure 2.1. All the coefficients, also known as scores, are stored in the matrix U. It is customary to display these scores using scatter plots. For example, plot(U[,1], U[,2]) produces a plot similar to the one shown in Figure 2.5. Every point in this graph represents a state and the same five states were emphasized: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and California (plum). Note that New Jersey is the point
  • 53. 34 Functional Data Analysis with R FIGURE 2.5: Scores on the first versus second right singular vectors for all-cause weekly excess mortality in the US. Each dot is a state, Puerto Rico, or Washington DC. Five states are emphasized: New Jersey (green), Louisiana (red), Maryland (blue), Texas (salmon), and California (plum). with the largest score on the first right singular vector and the third largest score on the second right singular vector. Louisiana has the third largest score on the first right singular vector, which is consistent with being among the states with highest all-cause mortality. In contrast to New Jersey, the score for Louisiana on the second right singular vector is negative indicating that its cumulative mortality data continues to increase away from the US mean between May and November; see Figure 2.2. 2.2 Gaussian Processes While all the data we observe will be sampled at discrete time points, observed functional data is thought of realizations of an underlying continuous process. Here we provide some theoretical concepts that will help with the interpretation of the analytic methods. A Gaus- sian Process (GP) is a collection of random variables {W(s), s ∈ S} where every finite col- lection of random variables {W(s1), . . . , W(sp)}, sj ∈ S for every j = 1, . . . , p and every p is a multivariate Gaussian distribution. For convenience, we consider S = [0, 1] and interpret it as time, but Gaussian Processes can be defined over space, as well. A Gaussian Process
  • 54. Key Methodological Concepts 35 is completely characterized by its mean µ(s) and covariance operator KW : S × S → R, where KW (s1, s2) = Cov{W(s1), W(s2)}. Assume now that the mean of the process is 0. By Mercer’s theorem [199] there exists a set of eigenvalues and eigenfunctions λk, φk(s), where λk ≥ 0, φk : S → R form an orthonormal basis in L2 ([0, 1]), KW (s, t)φk(t)dt = λsφk(s) for every s ∈ S and k = 1, 2, . . ., and KW (s1, s2) = ∞ k=1 λkφk(s1)φk(s2) . The Kosambi-Karhunen-Loève (KKL) [143, 157, 184] theorem provides the explicit de- composition of the process W(s). Because φk(t) form an orthonormal basis, the Gaussian Process can be expanded as W(s) = ∞ k=1 ξkφk(s) , where ξk = 1 0 W(s)φk(s)dt, which does not depend on s. It is easy to show that the E(ξk) = 0 as E(ξk) = E{ 1 0 W(s)φk(s)ds} = 1 0 E{W(s)}φk(s)ds = 0 . We can also show that the Cov(ξk, ξl) = E(ξkξl) = 0 for k = l and Var(ξk) = λk. The proof is shown below E(ξkξl) = E 1 0 1 0 W(s)W(t)φk(s)φl(t)dtds = 1 0 1 0 E{W(s)W(t)}φk(t)φl(s)dtds = 1 0 1 0 KW (s, t)φk(t)dt φl(s)ds = λk 1 0 φk(s)φl(s)ds = λkδkl , (2.8) where δkl = 0 if k = l and 1 otherwise. The second equality holds because of the change of order of integrals (expectations), the third equality holds because of the definition of KW (s, t), the fourth equality holds because φk(s) is the eigenfunction of KW (·, ·) corre- sponding to the eigenvalue λk, and the fifth equality holds because of the orthonormality of the φk(s) functions. These results hold for any L2 [0, 1] integrable process and does not require Gaussianity of the scores. However, if the process is Gaussian, it can be shown that any finite collection {ξk1 , . . . , ξkl } is jointly Gaussian. Because the individual entries are uncorrelated mean- zero, the scores are independent Gaussian random variables. One could reasonably ask, why should one care about all these properties and whether this theory has any practical implications. Below we identify some of the practical implications. The expression “Gaussian Process” is quite intimidating, the definition is relatively tech- nical, and it is not clear from the definition that such objects even exist. However, these results show how to generate Gaussian Processes relatively easily. Indeed, the only ingredi- ents we need are a set of orthonormal functions φk(·) in L2 [0, 1] and a set of positive numbers λ1 ≥ λ2 ≥ . . .. For example, if φ1(s) = √ 2 sin(2πs), φ2(s) = √ 2 cos(2πs), λ1 = 4, λ2 = 1,
  • 55. Random documents with unrelated content Scribd suggests to you:
  • 56. could pick; then I required that of him every day, or I docked his wages.” As we were talking, the mate of the “Quitman” took up an oyster- shell and threw it at the head of one of the deck-hands, who did not handle the cotton to suit him. It did not hurt the negro’s head much, but it hurt his feelings. “Out on the plantations,” observed my friend the overseer, “it would cost him fifty dollars to hit a nigger that way. It cost me a hundred and fifty dollars just for knocking down three niggers lately,—fifty dollars a piece, by ——!” He thought the negroes were going to be crowded out by the Germans; and went on to say, with true Southern consistency,— “The Germans want twenty dollars a month, and we can hire the niggers for ten and fifteen. The Germans will die in our swamps. Then as soon as they get money enough to buy a cart and mule, and an acre of land somewhar, whar they can plant a grape-vine, they’ll go in for themselves.”
  • 57. CHAPTER LV. THE LOWER MISSISSIPPI. We were nearly all night at Natchez loading cotton. The next day, I noticed that the men worked languidly, and that the mate was plying them with whiskey. I took an opportunity to talk with him about them. He said,— “We have a hundred and eighty hands aboard, all told. Thar’s sixty deck-hands. That a’n’t enough. We ought to have reliefs, when we’re shipping freight day and night as we are now.” I remarked: “A gentleman who came up to Vicksburg in the ‘Fashion,’ stated, as an excuse for the long trip she made, that the niggers wouldn’t work,—that the mates couldn’t make them work.” He replied: “I reckon the hands on board the ‘Fashion’ are about in the condition these are. These men are used up. They ha’n’t had no sleep for four days and nights. I’ve seen a man go to sleep many a time, standing up, with a box on his shoulder. We pay sixty dollars a month,—more’n almost any other boat, the work is so hard. But we get rid of paying a heap of ’em. When a man gets so used up he can’t stand no more, he quits. He don’t dare to ask for wages, for he knows he’ll get none, without he sticks by to the end of the trip.” While we were talking, a young fellow, not more than twenty years old, came up, looking very much exhausted, and told the mate he was sick. “Ye a’n’t sick neither!” roared the mate at him, fiercely. “You’re lazy! If you won’t work, go ashore.”
  • 58. The fellow limped away again, and went ashore at the next landing. “Is he sick or lazy?” I asked. “Neither. He’s used up. He was as smart a hand as I had when he came aboard. But they can’t stand it.” “Was it always so?” “No; before the war we had men trained for this work. We had some niggers, but more white men. We couldn’t git all the niggers we wanted; a fifteen hundred dollar man wore out too quick.” “The whites were the best, I suppose.” “The niggers was the best. They was more active getting down bales. They liked the fun. They stand it better than white men. Business stopped, and that set of hands all dropped off,—went into the war, the most of ’em. Now we have to take raw hands. These are all plantation niggers. Not one of ’m’ll ship for another trip; they’ve had enough of it. Thar’s no compellin’ ’em. You can’t hit a nigger now, but these d——d Yankee sons of b——s have you up and make you pay for it.” I told him if that was the case, I didn’t think I should hit one. “They’ve never had me up,” he resumed. “When I tackle a nigger, it’ll be whar thar an’t no witnesses, and it’ll be the last of him. That’s what ought to be done with ’em,—kill ’em all off. I like a nigger in his place, and that’s a servant, if thar’s any truth in the Bible.” This allusion to Scripture, from lips hot with words of wrath and wrong, was especially edifying. The “Quitman” was a fine boat, and passengers, if not deck-hands, fared sumptuously on board of her. The table was equal to that of the best hotels. An excellent quality of claret wine was furnished, as a part of the regular dinner fare, after the French fashion, which appears to have been introduced into this country by the Creoles, and which is to be met with, I believe, only on the steamboats of the Lower Mississippi.
  • 59. On the “Quitman,” as on the boat from Memphis to Vicksburg, I made the acquaintance of all sorts of Southern people. The conversation of some of them is worth recording. One, a Mississippi planter, learning that I was a Northern man, took me aside, and with much emotion, asked if I thought there was “any chance of the government paying us for our niggers.” “What niggers?” “The niggers you’ve set free by this abolition war.” “This abolition war you brought upon yourselves; and paying you for your slaves would be like paying a burglar for a pistol lost on your premises. No, my friend, believe me, you will never get the first cent, as long as this government lasts.” He looked deeply anxious. But he still cherished a hope. “I’ve been told by a heap of our people that we shall get our pay. Some are talking about buying nigger claims. They expect, when our representatives get into Congress, there’ll be an appropriation made.” He went on: “I did one mighty bad thing. To save my niggers, I run ’em off into Texas. It cost me a heap of money. I came back without a dollar, and found the Yankees had taken all my stock, and everything, and my niggers was free, after all.” Jim B——, from Warren County, ten miles from Vicksburg, was a Mississippi planter of a different type,—jovial, generous, extravagant in his speech, and, in his habits of living, fast. “My niggers are all with me yet, and you can’t get ’em to leave me. The other day my boy Dan drove me into town; when we got thar, I says to him, ‘Dan, ye want any money?’ ‘Yes, master, I’d like a little?’ I took out a ten- dollar bill and give him. Another nigger says to him, ‘Dan, what did that man give you money for?’ ‘That man?’ says Dan; ‘I belongs to him.’ ‘No, you don’t belong to nobody now; you’re free.’ ‘Well,’ says Dan, ‘he provides for me, and gives me money, and he’s my master, any way.’ I give my boys a heap more money than I should if I just hired ’em. We go right on like we always did, and I pole ’em if they
  • 60. don’t do right. This year I says to ’em, ‘Boys, I’m going to make a bargain with you. I’ll roll out the ploughs and the mules and the feed, and you shall do the work; we’ll make a crop of cotton, and you shall have half. I’ll provide for ye, give ye quarters, treat ye well, and when ye won’t work, pole ye like I always have. They agreed to it, and I put it into the contract that I was to whoop ’em when I pleased.” Jim was very enthusiastic about a girl that belonged to him. “She’s a perfect mountain-spout of a woman!” (if anybody knows what that is.) “When the Yankees took me prisoner, she froze to a trunk of mine, and got it out of the way with fifty thousand dollars Confederate money in it.” He never wearied of praising her fine qualities. “She’s black outside, but she’s white inside, shore!” And he spoke of a son of hers, then twelve years old, with an interest and affection which led me to inquire about the child’s father. “Well,” said Jim, with a smile, “he’s a perfect little image of me, only a shade blacker.” An Arkansas planter said: “I’ve a large plantation near Pine Bluff. I furnish everything but clothes, and give my freedmen one third of the crop they make. On twenty plantations around me, there are ten different styles of contracts. Niggers are working well; but you can’t get only about two thirds as much out of ’em now as you could when they were slaves” (which I suppose is about all that ought to be got out of them). “The nigger is fated: he can’t live with the white race, now he’s free. I don’t know one I’d trust with fifty dollars, or to manage a crop and control the proceeds. It will be generations before we can feel friendly towards the Northern people.” I remarked: “I have travelled months in the South, and expressed my sentiments freely, and met with better treatment than I could have expected five years ago.” “That’s true; if you had expressed abolition sentiments then, you’d have woke up some morning and found yourself hanging from some
  • 61. limb.” Of the war he said: “Slavery was really what we were fighting for, although the leaders didn’t talk that to the people. They saw the slave interest was losing power in the Union, and trying to straighten it up, they tipped it over.” A Louisiana planter, from Lake Providence,—and a very intelligent, well-bred gentleman,—said: “Negroes do best when they have a share of the crop; the idea of working for themselves stimulates them. Planters are afraid to trust them to manage; but it’s a great mistake. I know an old negro who, with three children, made twenty-five bales of cotton this year on abandoned land. Another, with two women and a blind mule, made twenty-seven bales. A gang of fifty made three hundred bales,—all without any advice or assistance from white men. I was always in favor of educating and elevating the black race. The laws were against it, but I taught all my slaves to read the Bible. Each race has its peculiarities: the negro has his, and it remains to be seen what can be done with him. Men talk about his stealing: no doubt he’ll steal: but circumstances have cultivated that habit. Some of my neighbors couldn’t have a pig, but their niggers would steal it. But mine never stole from me, because they had enough without stealing. Giving them the elective franchise just now is absurd; but when they are prepared for it, and they will be some day, I shall advocate it.” Another Louisianian, agent of the Hope Estate, near Water-Proof, in Tensas Parish, said: “I manage five thousand acres,—fourteen hundred under cultivation. I always fed my niggers well, and rarely found one that would steal. My neighbors’ niggers, half-fed, hard- worked, they’d steal, and I never blamed ’em. Nearly all mine stay with me. They’ve done about two thirds the work this year they used to, for one seventh of the crops. Heap of niggers around me have never received anything; they’re only just beginning to learn that they’re free. Many planters keep stores for niggers, and sell ’em flour, prints, jewelry and trinkets, and charge two or three prices for everything. I think God intended the niggers to be slaves; we have
  • 62. the Bible for that:” always the Bible. “Now since man has deranged God’s plan, I think the best we can do is to keep ’em as near a state of bondage as possible. I don’t believe in educating ’em.” “Why not?” “One reason, schooling would enable them to compete with white mechanics.” “And why not?” “It would be a disadvantage to the whites,” he replied,—as if that was the only thing to be considered by men with the Bible in their mouths! “In Mississippi, opposite Water-Proof, there’s a minister collecting money to buy plantations in a white man’s name, to be divided in little farms of ten and fifteen acres for the niggers. He couldn’t do that thing in my parish: he’d soon be dangling from some tree. There isn’t a freedman taught in our parish; not a school; it wouldn’t be allowed.” He admitted that the war was brought on by the Southern leaders, but thought the North “ought to be lenient and give them all their rights.” Adding: “What we want chiefly is to legislate for the freedmen. Another thing: the Confederate debt ought to be assumed by the government. We shall try hard for that. If we can’t get it, if the North continues to treat us as a subjugated people, the thing will have to be tried over again,”—meaning the war. “We must be left to manage the nigger. He can’t be made to work without force.” (He had just said his niggers did two thirds as much work as formerly.) “My theory is, feed ’em well, clothe ’em well, and then, if they won’t work, d—n ’em, whip ’em well!” I did not neglect the deck-passengers. These were all negroes, except a family of white refugees from Arkansas, who had been burnt out twice during the war, once near Little Rock, and again in Tennessee, near Memphis. With the little remnant of their possessions they were now going to seek their fortunes elsewhere,— ill-clad, starved-looking, sleeping on deck in the rain, coiled around the smoke-pipe, and covered with ragged bedclothes.
  • 63. The talk of the negroes was always entertaining. Here is a sample, from the lips of a stout old black woman:— “De best ting de Yankees done was to break de slavery chain. I shouldn’t be here to-day if dey hadn’t. I’m going to see my mother.” “Your mother must be very old.” “You may know she’s dat, for I’m one of her baby chil’n, and I’s got ’leven of my own. I’ve a heap better time now ’n I had when I was in bondage. I had to nus’ my chil’n four times a day and pick two hundred pounds cotton besides. My third husband went off to de Yankees. My first was sold away from me. Now I have my second husband again; I was sold away from him, but I found him again, after I’d lived with my third husband thirteen years.” I asked if he was willing to take her back. “He was willing to have me again on any terms”—emphatically—“for he knowed I was Number One!” Several native French inhabitants took passage at various points along the river, below the Mississippi line. All spoke very good French, and a few conversed well in English. One, from Point Coupée Parish, said: “Before the war, there were over seventeen thousand inhabitants in our parish.” (In Louisiana a county is called a parish.) “Nearly thirteen thousand were slaves. Many of the free inhabitants were colored; so that there were about four colored persons to one white. We made yearly between eight and nine thousand hogsheads of sugar, and fifteen hundred bales of cotton. The war has left us only three thousand inhabitants. We sent fifteen hundred men into the Confederate army. All the French population were in favor of secession. The white inhabitants of these parishes are mostly French Creoles. We treated our slaves better than the Americans treated theirs. We didn’t work them so hard; and there was more familiarity and kindly feeling between us and our servants. The children were raised together; and a white child learned the negroes’ patois before he learned French. The patois is curious: a negro says ‘Moi pas connais’ for ‘Je ne sais pas’ (I do not know); and they use a great
  • 64. many African words which you would not understand. Our slaves were never sold except to settle an estate. Besides these two classes there was a third, quite separate, which did not associate with either of the others. They were the free colored, of French-African descent, some almost or quite white, with many large property holders and slave-owners among them; a very respectable class, forming a society of their own.” The villages and plantation dwellings along here, with their low roofs and sunny verandas, on the level river bank, had a peculiarly foreign and tropical appearance. The levees of Louisiana form a much more extensive and complete system than those of Mississippi. In the latter State there is much hilly land that does not need their protection, and much swamp land not worth protecting; and there is, I believe, no law regarding them. In the low and level State of Louisiana, however, a large and fertile part of which lies considerably below the level of high water, there is very strict legislation on the subject, compelling every land-owner on the river to keep up his levees. This year the State itself had undertaken to repair them, issuing eight per cent. bonds to the amount of a million dollars for the purpose,—the expense of the work to be defrayed eventually by the planters. For a long distance the Lower Mississippi, at high water, appears to be flowing upon a ridge. The river has built up its own banks higher than the country which lies back of them; and the levees have raised them still higher. Behind this fertile strip there are extensive swamps, containing a soil of unsurpassed depth and richness, but unavailable for want of drainage. Three methods are proposed for bringing them under cultivation. First, to surround them by levees, ditch them, and pump the water out by steam. Second, to cut a canal through them to the Gulf. Third, to turn the Mississippi into them, and fill them with its alluvial deposit. This last method is no doubt the one Nature intended to employ; and it is the opinion of many that man, confining the flow of the stream within artificial limits, attempted the settlement of this country several centuries too soon.
  • 65. A remarkable feature of Louisiana scenery is its forests of cypress- trees growing out of the water, heavy, sombre, and shaggy with moss. The complexion of the river water is a light mud-color, which it derives from the turbid Missouri,—the Upper Mississippi being a clear stream. Pour off a glass of it after it has been standing a short time, and a sediment of dark mud appears at the bottom. Notwithstanding this unpleasant peculiarity, it is used altogether for cooking and drinking purposes on board the steamboats, and I found New Orleans supplied with it. A curious fact has been suggested with regard to this wonderful river,—that it runs up hill. Its mouth is said to be two and a half miles higher—or farther from the earth’s centre—than its source. When we consider that the earth is a spheroid, with an axis shorter by twenty-six miles than its equatorial diameter; and that the same centrifugal motion which has caused the equatorial protuberance tends still to heap up the waters of the globe where that motion is greatest; the seeming impossibility appears possible,—just as we see a revolving grindstone send the water on its surface to the rim. Stop the grindstone, and the water flows down its sides. Stop the earth’s revolution, and immediately you will see the Mississippi River turn and flow the other way. Some years ago I made a voyage of several days on the Upper Mississippi, to the head of navigation. It was difficult to realize that this was the same stream on which I was now sailing day after day in an opposite direction,—six days in all, from Memphis to New Orleans. From St. Anthony’s Falls to the Gulf, the Mississippi is navigable twenty-two hundred miles. Its entire length is three thousand miles. Its great tributary, the Missouri, is alone three thousand miles in length: measured from its head-waters to the Gulf, it is four thousand five hundred miles. Consider also the Ohio, the Arkansas, the Red River, and the hundred lesser streams that fall into it, and well may we call it by its Indian name, Michi-Sepe, the Father of Waters.
  • 66. CHAPTER LVI. THE CRESCENT CITY. On the morning of January 1st, 1866, I arrived at New Orleans. It was midwinter; but the mild sunny weather that followed the first chill days of rain, made me fancy it May. The gardens of the city were verdant with tropical plants. White roses in full bloom climbed upon trellises or the verandas of houses. Oleander trees, bananas with their broad drooping leaves six feet long, and Japan plums that ripen in February, grew side by side in the open air. There were orange-trees whose golden fruit could be picked from the balconies which they half concealed. Magnolias, gray-oaks and live-oaks, some heavily hung with moss that swung in the breeze like waving hair, shaded the yards and streets. I found the roadsides of the suburbs green with grass, and the vegetable gardens checkered and striped with delicately contrasting rows of lettuce, cabbages, carrots, beets, onions, and peas in blossom. The French quarter of the city impresses you as a foreign town transplanted to the banks of the Mississippi. Many of the houses are very ancient, with low, moss-covered roofs projecting over the first story, like slouched hat-brims over quaint old faces. The more modern houses are often very elegant, and not less picturesque. The names of the streets are Pagan, foreign, and strange. The gods and muses of mythology, the saints of the Church, the Christian virtues, and modern heroes, are all here. You have streets of “Good Children,” of “Piety,” of “Apollo,” of “St. Paul,” of “Euterpe,” and all their relations. The shop-signs are in French, or in French and English. The people you meet have a foreign air and speak a foreign
  • 67. tongue. Their complexions range through all hues, from the dark Creole to the ebon African. The anomalous third class of Louisiana— the respectable free colored people of French-African descent—are largely represented. Dressed in silks, accompanied by their servants, and speaking good French,—for many of them are well educated,— the ladies and children of this class enter the street cars, which they enliven with the Parisian vivacity of their conversation. The mingling of foreign and American elements has given to New Orleans a great variety of styles of architecture; and the whole city has a light, picturesque, and agreeable appearance. It is built upon an almost level strip of land bordering upon the left bank of the river, and falling back from the levee with an imperceptible slope to the cypress and alligator swamps in the rear. The houses have no cellars. I noticed that the surface drainage of the city flowed back from the river into the Bayou St. John, a navigable inlet of Lake Ponchartrain. The old city front lay upon a curve of the Mississippi, which gave it a crescent shape: hence its poetic soubriquet. The modern city has a river front seven miles in extent, bent like the letter S. The broad levee, lined with wharves on one side and belted by busy streets on the other, crowded with merchandise, and thronged with merchants, boatmen, and laborers, presents always a lively and entertaining spectacle. Steam and sailing crafts of every description, arriving, departing, loading, unloading, and fringing the city with their long array of smoke-pipes and masts, give you some idea of the commerce of New Orleans. Here is the great cotton market of the world. In looking over the cotton statistics of the past thirty years, I found that nearly one half the crop of the United States had passed through this port. In 1855– 1856 (the mercantile cotton year beginning September 1st and ending August 31st) 1,795,023 bales were shipped from New Orleans,—986,622 to Great Britain (chiefly to Liverpool); 214,814 to France (chiefly to Havre); 162,657 to the North of Europe; 178,812 to the South of Europe, Mexico, c.; and 222,100 coastwise,—
  • 68. 151,469 going to Boston and 51,340 to New York. In 1859–1860, 2,214,296 bales were exported, 1,426,966 to Great Britain, 313,291 to France, and 208,634 coastwise,—131,648 going to Boston, 62,936 to New York, and 5,717 to Providence. This, it will be remembered, was the great cotton year, the crop amounting to near 5,000,000 bales. One is interested to learn how much cotton left this port during the war. In 1860–1861, 1,915,852 bales were shipped, nearly all before hostilities began; in 1861–1862, 27,627 bales; in 1862–1863, 23,750; in 1863–1864, 128,130; in 1864–1865, 192,351. The total receipts during this last year were 271,015 bales. From September 1st, 1865, to January 1st, 1866, the receipts were 375,000 bales; and cotton was still coming. The warehouses on the lower tributaries of the Mississippi were said to be full of it, waiting for high water to send it down. There had been far more concealed in the country than was supposed: it made its appearance where least looked for; and such was the supply that experienced traders believed that prices would thenceforth be steadily on the decline. A first-class Liverpool steamer is calculated to take out 3000 500- pound bales, the freight on which is 7–8ths of a penny per pound,— not quite two cents. The freight to New York and Boston is 1 1–4th cents by steamers, and 7–8ths of a cent by sailing-vessels. I put up at the St. Charles, famous before the war as a hotel, and during the war as the head-quarters of General Butler. It is a conspicuous edifice, with white-pillared porticos, and a spacious Rotunda, thronged nightly with a crowd which strikes a stranger with astonishment. It is a sort of social evening exchange, where merchants, planters, travellers, river-men, army men, (principally Rebels,) manufacturing and jobbing agents, showmen, overseers, idlers, sharpers, gamblers, foreigners, Yankees, Southern men, the well dressed and the prosperous, the rough and the seedy, congregate together, some leaning against the pillars, and a few sitting about the stoves, which are almost hidden from sight by the concourse of people standing or moving about in the great central
  • 69. space. Numbers of citizens regularly spend their evenings here, as at a club-room. One, an old plantation overseer of the better class, told me that for years he had not missed going to the Rotunda a single night, except when absent from the city. The character he gave the crowd was not complimentary. “They are all trying to get money without earning it. Each is doing his best to shave the rest. If they ever make anything, I don’t know it. I’ve been here two thousand nights, and never made a cent yet.” I inquired what brought him here. “For company; to kill time. I never was married, and never had a home. When I was young, the girls said I smelt like a wet dog; that’s because I was poor. Since I’ve got rich, I’m too old to get married.” What he was thinking of now was a fortune to be made out of labor- saving machinery to be used on the plantations: “I wish I could get hold of a half-crazy feller, to fix up a cotton planter, cotton-picker, cane-cutter, and a thing to hill up some.” He talked cynically of the planters. “They’re a helpless set. They’re all confused. They don’t know what they’re going to do. They never did know much else but to get drunk. If a man has a plantation to rent or sell, he can’t tell anything about it; you can’t get any proposition out of him.” He complained that Northern capital lodged in the cotton belt; but little of it getting through to the sugar country. He did not know any lands let to Northern men. “They hav’n’t got sugar on the brain; it’s cotton they’re all crazy after.” He used to oversee for fifteen hundred dollars a year: he was now offered five thousand. He was a well-dressed, rather intelligent, capable man; and I noticed that the planters treated him with respect. But his manner toward them was cool and independent: he could not forget old times. “I never was thought anything of by these men, till I got rich. Then they began to say ‘Dick P—— is a mighty clever feller;’ and by-and-by it got to be ‘Mr. P——.’ Now they
  • 70. all come to me, because I know about business, and they don’t know a thing.” Like everybody else, he had much to say of the niggers. “A heap of the planters wants ’em all killed off. But I believe in the nigger. He’ll work, if they’ll only let him alone. They fool him, and tell him such lies, he’s no confidence. I’ve worked free niggers and white men, and always found the niggers worked the best. But no nigger, nor anybody else, will work like a slave works with the whip behind him. You can’t make ’em. I was brought up to work alongside o’ niggers, and soon as I got out of it, nothing, no money, could induce me to work so again.” Speaking of other overseers, he said: “I admit I was about as tight on the nigger as a man ought to be. If I’d been a slave, I shouldn’t have wanted to work under a master that was tighter than I was. But I wa’n’t a priming to some. You see that red-faced feller with his right hand behind him, talking with two men? He’s an overseer. I know of his killing two niggers, and torturing another so that he died in a few days.” (I omit the shocking details of the punishment said to have been applied.) “The other night he came here to kill me because I told about him. He pulled out his pistol, and says he, ‘Dick P——, did you tell so-and-so I killed three niggers on Clark’s plantation?’ ‘Yes,’ I says, ‘I said so, and can prove it; and if there’s any shooting to be done, I can shoot as fast as you can.’ After that he bullied around here some, then went off, and I hav’n’t heard anything about shooting since.” Among the earliest acquaintances I made at New Orleans was General Phil. Sheridan, perhaps the most brilliant and popular fighting man of the war. I found him in command of the Military Division of the Gulf, comprising the States of Louisiana, Texas, and Florida. In Florida he had at that time seven thousand troops; in Louisiana, nine thousand; and in Texas, twenty thousand, embracing ten thousand colored troops at Corpus Christi and on the Rio Grande, watching the French movements.
  • 71. It was Sheridan’s opinion that the Rebellion would never be ended until Maximilian was driven from Mexico. Such a government on our borders cherished the seeds of ambition and discontent in the minds of the late Confederates. Many were emigrating to Mexico, and there was danger of their uniting either with the Liberals or the Imperialists, and forming a government inimical to the United States. To prevent such a possibility, he had used military and diplomatic strategy. Three thousand Rebels having collected in Monterey, he induced the Liberals to arrest and disarm them. Then in order that they should not be received by the Imperialists, he made hostile demonstrations, sending a pontoon train to Brownsville, and six thousand cavalry to San Antonio, establishing military posts, and making extensive inquiries for forage. Under such circumstances, Maximilian did not feel inclined to welcome the Rebel refugees. It is even probable that, had our government at that time required the withdrawal of the French from Mexico, the demand, emphasized by these and similar demonstrations, would have been complied with. Maximilian is very weak in his position. Nineteen twentieths of the people are opposed to him. There is no regular, legitimate taxation for the support of his government, but he levies contributions upon merchants for a small part of the funds he requires, and draws upon France for the rest. His “government” consists merely of an armed occupation of the country; with long lines of communication between military posts, which could be easily cut off and captured one after another by a comparatively small force. The Southern country, in the General’s opinion, was fast becoming “Northernized.” It was very poor, and going to be poorer. The planters had no enterprise, no recuperative energy: they were entirely dependent upon Northern capital and Northern spirit. He thought the freedmen’s affairs required no legislation, but that the State should leave them to be regulated by the natural law of supply and demand. Phil. Sheridan is a man of small stature, compactly and somewhat massively built, with great toughness of constitutional fibre, and an alert countenance, expressive of remarkable energy and force. I
  • 72. inquired if he experienced no reaction after the long strain upon his mental and bodily powers occasioned by the war. “Only a pleasant one,” he replied. “During my Western campaigns, when I was continually in the saddle, I weighed but a hundred and fifteen pounds. My flesh was hard as iron. Now my weight is a hundred and forty-five.” He went over with me to the City Hall, to which the Executive department of the State had been removed, and introduced me to Governor Wells, a plain, elderly man, affable, and loyal in his speech. I remember his saying that the action of the President, in pardoning Governor Humphreys, of Mississippi, after he had been elected by the people on account of his services in the Confederate cause, was doing great harm throughout the South, encouraging Rebels and discouraging Union men. “Everything is being conceded to traitors,” said he, “before they have been made to feel the Federal power.” He spoke of the strong Rebel element in the Legislature which he was combating; and gave me copies of two veto messages which he had returned to it with bills that were passed for the especial benefit of traitors. The new serf code, similar to that of Mississippi, engineered through the Legislature by a member of the late Confederate Congress, he had also disapproved. After this, I was surprised to hear from other sources how faithfully he had been carrying out the very policy which he professed to condemn,—even going beyond the President, in removing from office Union men appointed by Governor Hahn and appointing Secessionists and Rebels in their place; and advocating the Southern doctrine that the Government must pay for the slaves it had emancipated. Such discrepancies between deeds and professions require no comment. Governor Wells is not the only one, nor the highest, among public officers, who, wishing to reconcile the irreconcilable, and to stand well before the country whilst they were strengthening the hands and gaining the favor of its enemies, have suffered their loyal protestations to be put to some confusion by acts of doubtful patriotism.
  • 73. At the Governor’s room I had the good fortune to meet the Mayor of the city, Mr. Hugh Kennedy, whom I afterwards called upon by appointment. By birth a Scotchman, he had been thirty years a citizen of New Orleans, and, from the beginning of the Secession troubles, had shown himself a stanch patriot. He was appointed to the mayoralty by President Lincoln; General Banks removed him, but he was afterwards reinstated. I found him an almost enthusiastic believer in the future greatness of New Orleans. “It is certain,” he said, “to double its population in ten years. Its prosperity dates from the day of the abolition of slavery. Men who formerly lived upon the proceeds of slave-labor are now stimulated to enterprise. A dozen industrial occupations will spring up where there was one before. Manufactures are already taking a start. We have two new cotton-mills just gone into operation. The effect upon the whole country will be similar. Formerly planters went or sent to New York and Boston and laid in their supplies; for this reason there were no villages in the South. But now that men work for wages, which they will wish to spend near home, villages will everywhere spring up.” Living, in New Orleans, he said, was very cheap. The fertile soil produces, with little labor, an abundance of vegetables the year round. Cattle are brought from the extensive prairies of the State, and from the vast pastures of Texas: and contractors had engaged to supply the charitable institutions of the city with the rumps and rounds of beef at six cents a pound. The street railroads promised to yield a considerable revenue to the city. The original company paid only $130,000 for the privilege of laying down its rails, and an exclusive right to the track for twenty- five years. But two new roads had been started, one of which had stipulated to pay to the city government eleven and a half per cent. of its gross proceeds, and the other twenty-two and a half per cent. “In two or three years an annual income from that source will not be less than $200,000.”
  • 74. From Mr. Kennedy I learned that free people of color owned property in New Orleans to the amount of $15,000,000. He was delighted with the working of the free-labor system. “I thought it an indication of progress when the white laborers and negroes on the levees the other day made a strike for higher wages. They were receiving two dollars and a half and three dollars a day, and they struck for five and seven dollars. They marched up the levee in a long procession, white and black together. I gave orders that they should not be interfered with as long as they interfered with nobody else; but when they undertook by force to prevent other laborers from working, the police promptly put a stop to their proceedings.”
  • 75. CHAPTER LVII. POLITICS, FREE LABOR, AND SUGAR. Through the courtesy of the Mayor I became acquainted with some of the radical Union men of New Orleans. Like the same class in Richmond and elsewhere, I found them extremely dissatisfied with the political situation and prospects. “Everything,” they said, “has been given up to traitors. The President is trying to help the nation out of its difficulty by restoring to power the very men who created the difficulty. To have been a good Rebel is now in a man’s favor; and to have stood by the government through all its trials is against him. If an original secessionist, or a time-serving, half-and-half Union man, ready to make any concession for the convenience of the moment, goes to Washington, he gets the ear of the administration, and comes away full of encouragement for the worst enemies the government ever had. If a man of principle goes to Washington, he gets nothing but plausible words which amount to nothing, if he isn’t actually insulted for his trouble.” I heard everywhere the same complaints from this class. And here I may state that they were among the saddest things I had to endure in the South. Whatever may be thought of the intrinsic merits of any measures, we cannot but feel misgivings when we see our late enemies made jubilant by them, and loyal men dismayed. The Union men of New Orleans were severe in their strictures on General Banks. “It was he,” they said, “who precipitated the organization of the State government on a Rebel basis. Read his General Orders No. 35, issued March 11th, 1864, concerning the election of delegates to the Convention. Rebels who have taken the
  • 76. amnesty oath are admitted to the polls, and loyal colored men are excluded. Section 4th reads, ‘Every free white man,’ c. Since his return to Massachusetts he has been making speeches in favor of negro suffrage. He is in favor of it there, where it is popular as an abstraction, and a man gets into Congress on the strength of it; but he was not in favor of it here, where there was a chance of making it practical. His excuse was, that if black men voted white men would take offence, and keep away from the polls. Very likely some white men would, but loyal white men wouldn’t. That he had the power to extend the franchise to the blacks, or at least thought he had, may be seen by his apology for not doing so, in which he says: ‘I did not decide upon this subject without very long and serious consideration,’ and so forth. So he let the great, the golden opportunity slip, of organizing the State government on a loyal basis, —of demonstrating the capacity of the colored man for self- government, and, of setting an example to the other Rebel States.” Being one day in the office of Mr. Durant, a prominent lawyer and Union man, I was much struck by the language and bearing of a gentleman who called upon him, and carried on a long conversation in French. Having understood that the Creoles were nearly all secessionists, I was surprised to hear this man give utterance to the most enlightened Republican sentiments. After he had gone out, I expressed my gratification at having met him. “That,” said Mr. Durant, “is one of the ablest and wealthiest business men in New Orleans. He was educated in Paris. But there is one thing about him you do not seem to have suspected. He belongs to that class of Union men the government has made up its mind to leave politically bound in the hands of the Rebels. That man, whom you thought refined and intelligent, has not the right which the most ignorant, Yankee-hating, negro-hating Confederate soldier has. He is a colored man, and has no vote.” There were six daily newspapers published in New Orleans,—five in English and one in English and French,—besides several weeklies. There was but one loyal sheet among them, and that was a “nigger
  • 77. paper,” the Tribune, not sold by any newsboy, and, I believe, by but one news-dealer. I called on General T. W. Sherman, in command of the Eastern District of Louisiana, who told me that, in order to please the people, our troops had been withdrawn from the interior, and that the militia, consisting mostly of Rebel soldiers, many of whom still wore the Rebel uniform, had been organized to fill their place. The negroes, whom they treated tyrannically, had been made to believe that it was the United States, and not the State government, that had thus set their enemies to keep guard over them. Both Governor Wells and General Sherman had received piles of letters from “prominent parties” expressing fears of negro insurrections. The most serious indications of bloody retribution preparing for the white race had been reported in the Teche country, where regiments of black cavalry were said to be organized and drilled. The General, on visiting the spot, and investigating the truth of the story, learned that it had its foundation in the fact that some negro boys had been playing soldier with wooden swords. No wonder the Rebel militia was thought necessary! From General Baird, Assistant-Commissioner, and General Gregg, Inspecting-Agent of the Freedmen’s Bureau, I obtained official information regarding the condition of free labor in Louisiana. A detailed account of it would be but a recapitulation, with slight variations, of what I have said of free labor in other States. The whites were as ignorant of the true nature of the system as the blacks. Capitalists did not understand how they could secure labor without owning it, or how men could be induced to work without the whip. It was thought necessary to make a serf of him who was no longer a slave. To this end the Legislature had passed a code of black laws even more objectionable than that enacted by the Legislature of Mississippi. By its provisions freedmen were to be arrested as vagrants who had not, on the 10th of January, 1866, entered into contracts for the year. They were thus left little choice as to employers, and none as to terms. They were also subjected to
  • 78. a harsh system of fines and punishments for loss of time and the infraction of contracts; and made responsible for all losses of stock on the plantation, until they should be able to prove that they had not killed it. Although these laws had not been approved by the Governor, there was no doubt but they would be approved and enforced as soon as the national troops were removed. A majority of the Southern planters clamored for the withdrawal of the troops and the Freedmen’s Bureau. But Northern planters settled in the State as earnestly opposed the measure. “If the government’s protection goes, we must go too. It would be impossible for us to live here without it. Planters would come to us and say, ‘Here, you’ve got a nigger that belongs to us;’ they would claim him, under the State laws, and compel him to go and work for them. Not a first- class laborer could we be sure of.” Here, as elsewhere, the fact that the freedmen had no independent homes, but lived in negro-quarters at the will of the owner, placed him under great disadvantages, which the presence of the Bureau was necessary to counteract. The planters desired nothing so much as to be left to manage the negroes with or without the help of State laws. “With that privilege,” they said, “we can make more out of them than ever. The government must take care of the old and worthless niggers it has set free, and we will put through the able- bodied ones.” The disposition to keep the freedmen in debt by furnishing their supplies at dishonest prices, and to impose upon their helplessness and ignorance in various other ways, was very general. Fortunately there was a great demand for labor, and the freedmen, with the aid of the Bureau, were making favorable contracts with their employers. When encouraged by just treatment and fair wages, they were working well. But they were observed to be always happier, thriftier, and more comfortable, living in little homes of their own and working land on their own account, than in any other condition. “I believe,” said General Gregg, “the best thing philanthropic Northern capitalists can do both for the freedmen and
  • 79. for themselves, is to buy up tracts of land, which can be had in some of the most fertile sections of Louisiana at two, three, and five dollars an acre, to be leased to the freedmen.” The more enlightened planters were in favor of educating the blacks. But the majority were opposed to it; so that in many parishes it was impossible to establish schools, while in others it had been very difficult. In January last there were 278 teachers in the State, instructing 19,000 pupils in 143 schools. The expenses, $20,000 a month, were defrayed by the Bureau from the proceeds of rents of abandoned and confiscated estates. But this source of revenue had nearly failed, in consequence of the indiscriminate pardoning of Rebel owners and the restoration of their property. In New Orleans, for example, the rents of Rebel estates had dwindled, in October, 1865, to $8,000; in December, to $1,500; and they were still rapidly diminishing. The result was, it had been necessary to order the discontinuance of all the schools in the State at the end of January, the funds in the treasury of the Bureau being barely sufficient to hold out until that time. It was hoped, however, that they would soon be reëstablished on a permanent basis, by a tax upon the freedmen themselves. For this purpose, the Assistant Commissioner had ordered that five per cent. of their wages should be paid by their employers to the agents of the Bureau. The freedmen’s schools in New Orleans were not in session at the time I was there; but I heard them highly praised by those who had visited them. Here is Mr. Superintendent Warren’s account of them:— “From the infant which must learn to count its fingers, to the scholar who can read and understand blank-verse, we have grades and departments adapted and free to all. Examinations, promotions, and gradations are had at stated seasons. The city is divided into districts; each district has its school, and each school the several departments of primary, intermediate, and grammar. A principal is appointed to each school, with the requisite number of assistants. Our teachers are mostly from the North, with a few Southerners, who have heroically dared the storm of prejudice to do good and right. The normal method of teaching is adopted, and object teaching is a specialty. “There are eight schools in the city, with from two to eight hundred pupils each, which, with those in the suburbs, amount to sixteen schools with nearly six
  • 80. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com