Microeconometrics. Methods and applications by A. Colin Cameron, Pravin K. Trivedi

This page intentionally left blank

Microeconometrics
This book provides a comprehensive treatment of microeconometrics, the analysis of
individual-level data on the economic behavior of individuals or firms using regres-
sion methods applied to cross-section and panel data. The book is oriented to the prac-
titioner. A good understanding of the linear regression model with matrix algebra is
assumed. The text can be used for Ph.D. courses in microeconometrics, in applied
econometrics, or in data-oriented microeconomics sub-disciplines; and as a reference
work for graduate students and applied researchers who wish to fill in gaps in their
tool kit. Distinguishing features include emphasis on nonlinear models and robust
inference, as well as chapter-length treatments of GMM estimation, nonparametric
regression, simulation-based estimation, bootstrap methods, Bayesian methods, strati-
fied and clustered samples, treatment evaluation, measurement error, and missing data.
The book makes frequent use of empirical illustrations, many based on seven large and
rich data sets.
A. Colin Cameron is Professor of Economics at the University of California, Davis. He
currently serves as Director of that university’s Center on Quantitative Social Science
Research. He has also taught at The Ohio State University and has held short-term
visiting positions at Indiana University at Bloomington and at a number of Australian
and European universities. His research in microeconometrics has appeared in leading
econometrics and economics journals. He is coauthor with Pravin Trivedi of Regres-
sion Analysis of Count Data.
Pravin K. Trivedi is John H. Rudy Professor of Economics at Indiana University at
Bloomington. He has also taught at The Australian National University and University
of Southampton and has held short-term visiting positions at a number of European
universities. His research in microeconometrics has appeared in most leading econo-
metrics and health economics journals. He coauthored Regression Analysis of Count
Data with A. Colin Cameron and is on the editorial boards of the Econometrics Journal
and the Journal of Applied Econometrics.

Microeconometrics
Methods and Applications
A. Colin Cameron Pravin K. Trivedi
University of California, Indiana University
Davis

  
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo
Cambridge University Press
The Edinburgh Building, Cambridge  , UK
First published in print format
- ----
- ----
© A. Colin Cameron and Pravin K. Trivedi 2005
2005
Information on this title: www.cambridge.org/9780521848053
This publication is in copyright. Subject to statutory exception and to the provision of
relevant collective licensing agreements, no reproduction of any part may take place
without the written permission of Cambridge University Press.
- ---
- ---
Cambridge University Press has no responsibility for the persistence or accuracy of s
for external or third-party internet websites referred to in this publication, and does not
guarantee that any content on such websites is, or will remain, accurate or appropriate.
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
hardback
eBook (NetLibrary)
eBook (NetLibrary)
hardback

To
my mother and the memory of my father
the memory of my parents

Contents
List of Figures page xv
List of Tables xvii
Preface xxi
I Preliminaries
1 Overview 3
1.1 Introduction 3
1.2 Distinctive Aspects of Microeconometrics 5
1.3 Book Outline 10
1.4 How to Use This Book 14
1.5 Software 15
1.6 Notation and Conventions 16
2 Causal and Noncausal Models 18
2.1 Introduction 18
2.2 Structural Models 20
2.3 Exogeneity 22
2.4 Linear Simultaneous Equations Model 23
2.5 Identiﬁcation Concepts 29
2.6 Single-Equation Models 31
2.7 Potential Outcome Model 31
2.8 Causal Modeling and Estimation Strategies 35
2.9 Bibliographic Notes 38
3 Microeconomic Data Structures 39
3.1 Introduction 39
3.2 Observational Data 40
3.3 Data from Social Experiments 48
3.4 Data from Natural Experiments 54
vii

CONTENTS
3.5 Practical Considerations 58
II Core Methods
4 Linear Models 65
4.1 Introduction 65
4.2 Regressions and Loss Functions 66
4.3 Example: Returns to Schooling 69
4.4 Ordinary Least Squares 70
4.5 Weighted Least Squares 81
4.6 Median and Quantile Regression 85
4.7 Model Misspeciﬁcation 90
4.8 Instrumental Variables 95
4.9 Instrumental Variables in Practice 103
5 Maximum Likelihood and Nonlinear Least-Squares Estimation 116
5.1 Introduction 116
5.2 Overview of Nonlinear Estimators 117
5.3 Extremum Estimators 124
5.4 Estimating Equations 133
5.5 Statistical Inference 135
5.6 Maximum Likelihood 139
5.7 Quasi-Maximum Likelihood 146
5.8 Nonlinear Least Squares 150
5.9 Example: ML and NLS Estimation 159
6 Generalized Method of Moments and Systems Estimation 166
6.2 Examples 167
6.3 Generalized Method of Moments 172
6.4 Linear Instrumental Variables 183
6.5 Nonlinear Instrumental Variables 192
6.6 Sequential Two-Step m-Estimation 200
6.7 Minimum Distance Estimation 202
6.8 Empirical Likelihood 203
6.9 Linear Systems of Equations 206
6.10 Nonlinear Sets of Equations 214
viii

CONTENTS
7 Hypothesis Tests 223
7.2 Wald Test 224
7.3 Likelihood-Based Tests 233
7.4 Example: Likelihood-Based Hypothesis
Tests
241
7.5 Tests in Non-ML Settings 243
7.6 Power and Size of Tests 246
7.7 Monte Carlo Studies 250
7.8 Bootstrap Example 254
8 Specification Tests and Model Selection 259
8.2 m-Tests 260
8.3 Hausman Test 271
8.4 Tests for Some Common Misspecifications 274
8.5 Discriminating between Nonnested
Models
278
8.6 Consequences of Testing 285
8.7 Model Diagnostics 287
9 Semiparametric Methods 294
9.2 Nonparametric Example: Hourly Wage 295
9.3 Kernel Density Estimation 298
9.4 Nonparametric Local Regression 307
9.5 Kernel Regression 311
9.6 Alternative Nonparametric Regression
Estimators
319
9.7 Semiparametric Regression 322
9.8 Derivations of Mean and Variance
of Kernel Estimators
330
10 Numerical Optimization 336
10.2 General Considerations 336
10.3 Specific Methods 341
ix

CONTENTS
III Simulation-Based Methods
11 Bootstrap Methods 357
11.2 Bootstrap Summary 358
11.3 Bootstrap Example 366
11.4 Bootstrap Theory 368
11.5 Bootstrap Extensions 373
11.6 Bootstrap Applications 376
12 Simulation-Based Methods 384
12.2 Examples 385
12.3 Basics of Computing Integrals 387
12.4 Maximum Simulated Likelihood Estimation 393
12.5 Moment-Based Simulation Estimation 398
12.6 Indirect Inference 404
12.7 Simulators 406
12.8 Methods of Drawing Random Variates 410
13 Bayesian Methods 419
13.2 Bayesian Approach 420
13.3 Bayesian Analysis of Linear Regression 435
13.4 Monte Carlo Integration 443
13.5 Markov Chain Monte Carlo Simulation 445
13.6 MCMC Example: Gibbs Sampler for SUR 452
13.7 Data Augmentation 454
13.8 Bayesian Model Selection 456
IV Models for Cross-Section Data
14 Binary Outcome Models 463
14.2 Binary Outcome Example: Fishing Mode Choice 464
14.3 Logit and Probit Models 465
14.4 Latent Variable Models 475
14.5 Choice-Based Samples 478
14.6 Grouped and Aggregate Data 480
14.7 Semiparametric Estimation 482
x

CONTENTS
14.8 Derivation of Logit from Type I Extreme Value 486
15 Multinomial Models 490
15.2 Example: Choice of Fishing Mode 491
15.3 General Results 495
15.4 Multinomial Logit 500
15.5 Additive Random Utility Models 504
15.6 Nested Logit 507
15.7 Random Parameters Logit 512
15.8 Multinomial Probit 516
15.9 Ordered, Sequential, and Ranked Outcomes 519
15.10 Multivariate Discrete Outcomes 521
15.12 Derivations for MNL, CL, and NL Models 524
16 Tobit and Selection Models 529
16.2 Censored and Truncated Models 530
16.3 Tobit Model 536
16.4 Two-Part Model 544
16.5 Sample Selection Models 546
16.6 Selection Example: Health Expenditures 553
16.7 Roy Model 555
16.8 Structural Models 558
16.10 Derivations for the Tobit Model 566
17 Transition Data: Survival Analysis 573
17.2 Example: Duration of Strikes 574
17.3 Basic Concepts 576
17.4 Censoring 579
17.5 Nonparametric Models 580
17.6 Parametric Regression Models 584
17.7 Some Important Duration Models 591
17.8 Cox PH Model 592
17.9 Time-Varying Regressors 597
17.10 Discrete-Time Proportional Hazards 600
17.11 Duration Example: Unemployment Duration 603
xi

CONTENTS
18 Mixture Models and Unobserved Heterogeneity 611
18.2 Unobserved Heterogeneity and Dispersion 612
18.3 Identification in Mixture Models 618
18.4 Specification of the Heterogeneity Distribution 620
18.5 Discrete Heterogeneity and Latent Class Analysis 621
18.6 Stock and Flow Sampling 625
18.7 Specification Testing 628
18.8 Unobserved Heterogeneity Example: Unemployment Duration 632
19 Models of Multiple Hazards 640
19.2 Competing Risks 642
19.3 Joint Duration Distributions 648
19.4 Multiple Spells 655
19.5 Competing Risks Example: Unemployment Duration 658
20 Models of Count Data 665
20.2 Basic Count Data Regression 666
20.3 Count Example: Contacts with Medical Doctor 671
20.4 Parametric Count Regression Models 674
20.5 Partially Parametric Models 682
20.6 Multivariate Counts and Endogenous Regressors 685
20.7 Count Example: Further Analysis 690
V Models for Panel Data
21 Linear Panel Models: Basics 697
21.2 Overview of Models and Estimators 698
21.3 Linear Panel Example: Hours and Wages 708
21.4 Fixed Effects versus Random Effects Models 715
21.5 Pooled Models 720
21.6 Fixed Effects Model 726
21.7 Random Effects Model 734
xii

CONTENTS
21.8 Modeling Issues 737
22 Linear Panel Models: Extensions 743
22.2 GMM Estimation of Linear Panel Models 744
22.3 Panel GMM Example: Hours and Wages 754
22.4 Random and Fixed Effects Panel GMM 756
22.5 Dynamic Models 763
22.6 Difference-in-Differences Estimator 768
22.7 Repeated Cross Sections and Pseudo Panels 770
22.8 Mixed Linear Models 774
23 Nonlinear Panel Models 779
23.2 General Results 779
23.3 Nonlinear Panel Example: Patents and R&D 762
23.4 Binary Outcome Data 795
23.5 Tobit and Selection Models 800
23.6 Transition Data 801
23.7 Count Data 802
VI Further Topics
24 Stratiﬁed and Clustered Samples 813
24.2 Survey Sampling 814
24.3 Weighting 817
24.4 Endogenous Stratiﬁcation 822
24.5 Clustering 829
24.6 Hierarchical Linear Models 845
24.7 Clustering Example: Vietnam Health Care Use 848
24.8 Complex Surveys 853
25 Treatment Evaluation 860
25.2 Setup and Assumptions 862
xiii

CONTENTS
25.3 Treatment Effects and Selection Bias 865
25.4 Matching and Propensity Score Estimators 871
25.5 Differences-in-Differences Estimators 878
25.6 Regression Discontinuity Design 879
25.7 Instrumental Variable Methods 883
25.8 Example: The Effect of Training on Earnings 889
26 Measurement Error Models 899
26.2 Measurement Error in Linear Regression 900
26.3 Identiﬁcation Strategies 905
26.4 Measurement Errors in Nonlinear Models 911
26.5 Attenuation Bias Simulation Examples 919
27 Missing Data and Imputation 923
27.2 Missing Data Assumptions 925
27.3 Handling Missing Data without Models 928
27.4 Observed-Data Likelihood 929
27.5 Regression-Based Imputation 930
27.6 Data Augmentation and MCMC 932
27.7 Multiple Imputation 934
27.8 Missing Data MCMC Imputation Example 935
A Asymptotic Theory 943
A.1 Introduction 943
A.2 Convergence in Probability 944
A.3 Laws of Large Numbers 947
A.4 Convergence in Distribution 948
A.5 Central Limit Theorems 949
A.6 Multivariate Normal Limit Distributions 951
A.7 Stochastic Order of Magnitude 954
A.8 Other Results 955
A.9 Bibliographic Notes 956
B Making Pseudo-Random Draws 957
References 961
Index 999
xiv

List of Figures
3.1 Social experiment with random assignment page 50
4.1 Quantile regression estimates of slope coefficient 89
4.2 Quantile regression estimated lines 90
7.1 Power of Wald chi-square test 249
7.2 Density of Wald test on slope coefficient 253
9.1 Histogram for log wage 296
9.2 Kernel density estimates for log wage 296
9.3 Nonparametric regression of log wage on education 297
9.4 Kernel density estimates using different kernels 300
9.5 k-nearest neighbors regression 309
9.6 Nonparametric regression using Lowess 310
9.7 Nonparametric estimate of derivative of y with respect to x 317
11.1 Bootstrap estimate of the density of t-test statistic 368
12.1 Halton sequence draws compared to pseudo-random draws 411
12.2 Inverse transformation method for unit exponential draws 413
12.3 Accept–reject method for random draws 414
13.1 Bayesian analysis for mean parameter of normal density 424
14.1 Charter boat fishing: probit and logit predictions 466
15.1 Generalized random utility model 516
16.1 Tobit regression example 531
16.2 Inverse Mills ratio as censoring point c increases 540
17.1 Strike duration: Kaplan–Meier survival function 575
17.2 Weibull distribution: density, survivor, hazard, and cumulative
hazard functions
585
17.3 Unemployment duration: Kaplan–Meier survival function 604
17.4 Unemployment duration: survival functions by unemployment insurance 605
17.5 Unemployment duration: Nelson–Aalen cumulated hazard function 606
17.6 Unemployment duration: cumulative hazard function by
unemployment insurance
606
xv

LIST OF FIGURES
18.1 Length-biased sampling under stock sampling: example 627
18.2 Unemployment duration: exponential model generalized residuals 633
18.3 Unemployment duration: exponential-gamma model generalized
residuals
633
18.4 Unemployment duration: Weibull model generalized residuals 635
18.5 Unemployment duration: Weibull-IG model generalized residuals 636
19.1 Unemployment duration: Cox CR baseline survival functions 661
19.2 Unemployment duration: Cox CR baseline cumulative hazards 662
21.1 Hours and wages: pooled (overall) regression 712
21.2 Hours and wages: between regression 713
21.3 Hours and wages: within (ﬁxed effects) regression 713
21.4 Hours and wages: ﬁrst differences regression 714
23.1 Patents and R&D: pooled (overall) regression 793
25.1 Regression-discontinuity design: example 880
25.2 RD design: treatment assignment in sharp and fuzzy designs 883
25.3 Training impact: earnings against propensity score by treatment 892
27.1 Missing data: examples of missing regressors 924
xvi

List of Tables
1.1 Book Outline page 11
1.2 Outline of a 20-Lecture 10-Week Course 15
1.3 Commonly Used Acronyms and Abbreviations 17
3.1 Features of Some Selected Social Experiments 51
3.2 Features of Some Selected Natural Experiments 54
4.1 Loss Functions and Corresponding Optimal Predictors 67
4.2 Least Squares Estimators and Their Asymptotic Variance 83
4.3 Least Squares: Example with Conditionally Heteroskedastic Errors 84
4.4 Instrumental Variables Example 103
4.5 Returns to Schooling: Instrumental Variables Estimates 111
5.1 Asymptotic Properties of M-Estimators 121
5.2 Marginal Effect: Three Different Estimates 122
5.3 Maximum Likelihood: Commonly Used Densities 140
5.4 Linear Exponential Family Densities: Leading Examples 148
5.5 Nonlinear Least Squares: Common Examples 151
5.6 Nonlinear Least-Squares Estimators and Their Asymptotic Variance 156
5.7 Exponential Example: Least-Squares and ML Estimates 161
6.1 Generalized Method of Moments: Examples 172
6.2 GMM Estimators in Linear IV Model and Their Asymptotic Variance 186
6.3 GMM Estimators in Nonlinear IV Model and Their Asymptotic Variance 195
6.4 Nonlinear Two-Stage Least-Squares Example 199
7.1 Test Statistics for Poisson Regression Example 242
7.2 Wald Test Size and Power for Probit Regression Example 253
8.1 Speciﬁcation m-Tests for Poisson Regression Example 270
8.2 Nonnested Model Comparisons for Poisson Regression Example 284
8.3 Pseudo R2
s: Poisson Regression Example 291
9.1 Kernel Functions: Commonly Used Examples 300
9.2 Semiparametric Models: Leading Examples 323
10.1 Gradient Method Results 339
10.2 Computational Difﬁculties: A Partial Checklist 350
xvii

LIST OF TABLES
11.1 Bootstrap Statistical Inference on a Slope Coefﬁcient: Example 367
11.2 Bootstrap Theory Notation 369
12.1 Monte Carlo Integration: Example for x Standard Normal 392
12.2 Maximum Simulated Likelihood Estimation: Example 398
12.3 Method of Simulated Moments Estimation: Example 404
13.1 Bayesian Analysis: Essential Components 425
13.2 Conjugate Families: Leading Examples 428
13.3 Gibbs Sampling: Seemingly Unrelated Regressions Example 454
13.4 Interpretation of Bayes Factors 457
14.1 Fishing Mode Choice: Data Summary 464
14.2 Fishing Mode Choice: Logit and Probit Estimates 465
14.3 Binary Outcome Data: Commonly Used Models 467
15.1 Fishing Mode Multinomial Choice: Data Summary 492
15.2 Fishing Mode Multinomial Choice: Logit Estimates 493
15.3 Fishing Mode Choice: Marginal Effects for Conditional Logit Model 493
16.1 Health Expenditure Data: Two-Part and Selection Models 554
17.1 Survival Analysis: Deﬁnitions of Key Concepts 577
17.2 Hazard Rate and Survivor Function Computation: Example 582
17.3 Strike Duration: Kaplan–Meier Survivor Function Estimates 583
17.4 Exponential and Weibull Distributions: pdf, cdf, Survivor Function,
Hazard, Cumulative Hazard, Mean, and Variance
584
17.5 Standard Parametric Models and Their Hazard and Survivor Functions 585
17.6 Unemployment Duration: Description of Variables 603
17.7 Unemployment Duration: Kaplan–Meier Survival and Nelson–Aalen
Cumulated Hazard Functions
605
17.8 Unemployment Duration: Estimated Parameters from Four
Parametric Models
607
17.9 Unemployment Duration: Estimated Hazard Ratios from Four
Parametric Models
608
18.1 Unemployment Duration: Exponential Model with Gamma and IG
Heterogeneity
634
18.2 Unemployment Duration: Weibull Model with and without
Heterogeneity
635
19.1 Some Standard Copula Functions 654
19.2 Unemployment Duration: Competing and Independent Risk
Estimates of Exponential Model with and without IG Frailty
659
19.3 Unemployment Duration: Competing and Independent Risk
Estimates of Weibull Model with and without IG Frailty
660
20.1 Proportion of Zero Counts in Selected Empirical Studies 666
20.2 Summary of Data Sets Used in Recent Patent–R&D Studies 667
20.3 Contacts with Medical Doctor: Frequency Distribution 672
20.4 Contacts with Medical Doctor: Variable Descriptions 672
20.5 Contacts with Medical Doctor: Count Model Estimates 673
20.6 Contacts with Medical Doctor: Observed and Fitted Frequencies 674
xviii

LIST OF TABLES
21.1 Linear Panel Model: Common Estimators and Models 699
21.2 Hours and Wages: Standard Linear Panel Model Estimators 710
21.3 Hours and Wages: Autocorrelations of Pooled OLS Residuals 714
21.4 Hours and Wages: Autocorrelations of Within Regression Residuals 715
21.5 Pooled Least-Squares Estimators and Their Asymptotic Variances 721
21.6 Variances of Pooled OLS Estimator with Equicorrelated Errors 724
21.7 Hours and Wages: Pooled OLS and GLS Estimates 725
22.1 Panel Exogeneity Assumptions and Resulting Instruments 752
22.2 Hours and Wages: GMM-IV Linear Panel Model Estimators 755
23.1 Patents and R&D Spending: Nonlinear Panel Model Estimators 794
24.1 Stratification Schemes with Random Sampling within Strata 823
24.2 Properties of Estimators for Different Clustering Models 832
24.3 Vietnam Health Care Use: Data Description 850
24.4 Vietnam Health Care Use: FE and RE Models for Positive Expenditure 851
24.5 Vietnam Health Care Use: Frequencies for Pharmacy Visits 852
24.6 Vietnam Health Care Use: RE and FE Models for Pharmacy Visits 852
25.1 Treatment Effects Framework 865
25.2 Treatment Effects Measures: ATE and ATET 868
25.3 Training Impact: Sample Means in Treated and Control Samples 890
25.4 Training Impact: Various Estimates of Treatment Effect 891
25.5 Training Impact: Distribution of Propensity Score for Treated and
Control Units Using DW (1999) Specification
894
25.6 Training Impact: Estimates of ATET 895
25.7 Training Evaluation: DW (2002) Estimates of ATET 896
26.1 Attenuation Bias in a Logit Regression with Measurement Error 919
26.2 Attenuation Bias in a Nonlinear Regression with Additive
Measurement Error
920
27.1 Relative Efficiency of Multiple Imputation 935
27.2 Missing Data Imputation: Linear Regression Estimates with 10%
Missing Data and High Correlation Using MCMC Algorithm
936
937
Missing Data and Low Correlation Using MCMC Algorithm
937
27.5 Missing Data Imputation: Logistic Regression Estimates with 10%
938
27.6 Missing Data Imputation: Logistic Regression Estimates with 25%
Missing Data and Low Correlation Using MCMC Algorithm
939
A.1 Asymptotic Theory: Definitions and Theorems 944
B.1 Continuous Random Variable Densities and Moments 957
B.2 Continuous Random Variable Generators 958
B.3 Discrete Random Variable Probability Mass Functions and Moments 959
B.4 Discrete Random Variable Generators 959
xix

Preface
This book provides a detailed treatment of microeconometric analysis, the analysis of
individual-level data on the economic behavior of individuals or firms. This type of
analysis usually entails applying regression methods to cross-section and panel data.
The book aims at providing the practitioner with a comprehensive coverage of sta-
tistical methods and their application in modern applied microeconometrics research.
These methods include nonlinear modeling, inference under minimal distributional
assumptions, identifying and measuring causation rather than mere association, and
correcting departures from simple random sampling. Many of these features are of
relevance to individual-level data analysis throughout the social sciences.
The ambitious agenda has determined the characteristics of this book. First, al-
though oriented to the practitioner, the book is relatively advanced in places. A cook-
book approach is inadequate because when two or more complications occur simulta-
neously – a common situation – the practitioner must know enough to be able to adapt
available methods. Second, the book provides considerable coverage of practical data
problems (see especially the last three chapters). Third, the book includes substantial
empirical examples in many chapters to illustrate some of the methods covered. Fi-
nally, the book is unusually long. Despite this length we have been space-constrained.
We had intended to include even more empirical examples, and abbreviated presen-
tations will at times fail to recognize the accomplishments of researchers who have
made substantive contributions.
The book assumes a good understanding of the linear regression model with matrix
algebra. It is written at the mathematical level of the first-year economics Ph.D. se-
quence, comparable to Greene (2003). We have two types of readers in mind. First, the
book can be used as a course text for a microeconometrics course, typically taught in
the second year of the Ph.D., or for data-oriented microeconomics field courses such
as labor economics, public economics, and industrial organization. Second, the book
can be used as a reference work for graduate students and applied researchers who
despite training in microeconometrics will inevitably have gaps that they wish to fill.
For instructors using this book as an econometrics course text it is best to introduce
the basic nonlinear cross-section and linear panel data models as early as possible,
xxi

PREFACE
initially skipping many of the methods chapters. The key methods chapter (Chapter 5)
covers maximum-likelihood and nonlinear least-squares estimation. Knowledge of
maximum likelihood and nonlinear least-squares estimators provides adequate back-
ground for the most commonly used nonlinear cross-section models (Chapters 14–17
and 20), basic linear panel data models (Chapter 21), and treatment evaluation meth-
ods (Chapter 25). Generalized method of moments estimation (Chapter 6) is needed
especially for advanced linear panel data methods (Chapter 22).
For readers using this book as a reference work, the chapters have been written to be
as self-contained as possible. The notable exception is that some command of general
estimation results in Chapter 5, and occasionally Chapter 6, will be necessary. Most
chapters on models are structured to begin with a discussion and example that is acces-
sible to a wide audience.
The Web site www.econ.ucdavis.edu/faculty/cameron provides all the data and
computer programs used in this book and related materials useful for instructional
purposes.
This project has been long and arduous, and at times seemingly without an end. Its
completion has been greatly aided by our colleagues, friends, and graduate students.
We thank especially the following for reading and commenting on specific chapters:
Bijan Borah, Kurt Brännäs, Pian Chen, Tim Cogley, Partha Deb, Massimiliano De
Santis, David Drukker, Jeff Gill, Tue Gorgens, Shiferaw Gurmu, Lu Ji, Oscar Jorda,
Roger Koenker, Chenghui Li, Tong Li, Doug Miller, Murat Munkin, Jim Prieger,
Ahmed Rahmen, Sunil Sapra, Haruki Seitani, Yacheng Sun, Xiaoyong Zheng, and
David Zimmer. Pian Chen gave detailed comments on most of the book. We thank
Rajeev Dehejia, Bronwyn Hall, Cathy Kling, Jeffrey Kling, Will Manning, Brian
McCall, and Jim Ziliak for making their data available for empirical illustrations. We
thank our respective departments for facilitating our collaboration and for the produc-
tion and distribution of the draft manuscript at various stages. We benefited from the
comments of two anonymous reviewers. Guidance, advice, and encouragement from
our Cambridge editor, Scott Parris, have been invaluable.
Our interest in econometrics owes much to the training and environments we en-
countered as students and in the initial stages of our academic careers. The first author
thanks The Australian National University; Stanford University, especially Takeshi
Amemiya and Tom MaCurdy; and The Ohio State University. The second author thanks
the London School of Economics and The Australian National University.
Our interest in writing a book oriented to the practitioner owes much to our exposure
to the research of graduate students and colleagues at our respective institutions, UC-
Davis and IU-Bloomington.
Finally, we thank our families for their patience and understanding without which
completion of this project would not have been possible.
A. Colin Cameron
Davis, California
Pravin K. Trivedi
Bloomington, Indiana
xxii

PART ONE
Preliminaries
Part 1 covers the essential components of microeconometric analysis – an economic
specification, a statistical model and a data set.
Chapter 1 discusses the distinctive aspects of microeconometrics, and provides an
outline of the book. It emphasizes that discreteness of data, and nonlinearity and het-
erogeneity of behavioral relationships are key aspects of individual-level microecono-
metric models. It concludes by presenting the notation and conventions used through-
out the book.
Chapters 2 and 3 set the scene for the remainder of the book by introducing the
reader to key model and data concepts that shape the analyses of later chapters.
A key distinction in econometrics is between essentially descriptive models and
data summaries at various levels of statistical sophistication and models that go be-
yond associations and attempt to estimate causal parameters. The classic definitions
of causality in econometrics derive from the Cowles Commission simultaneous equa-
tions models that draw sharp distinctions between exogenous and endogenous vari-
ables, and between structural and reduced form parameters. Although reduced form
models are very useful for some purposes, knowledge of structural or causal parame-
ters is essential for policy analyses. Identification of structural parameters within the
simultaneous equations framework poses numerous conceptual and practical difficul-
ties. An increasingly-used alternative approach based on the potential outcome model,
also attempts to identify causal parameters but it does so by posing limited questions
within a more manageable framework. Chapter 2 attempts to provide an overview of
the fundamental issues that arise in these and other alternative frameworks. Readers
who initially find this material challenging should return to this chapter after gaining
greater familiarity with specific models covered later in the book.
The empirical researcher’s ability to identify causal parameters depends not only
on the statistical tools and models but also on the type of data available. An experi-
mental framework provides a standard for establishing causal connections. However,
observational, not experimental, data form the basis of much of econometric inference.
Chapter 3 surveys the pros and cons of three main types of data: observational data,
data from social experiments, and data from natural experiments. The strengths and
weaknesses of conducting causal inference based on each type of data are reviewed.
1

C H A P T E R 1
Overview
1.1. Introduction
This book provides a detailed treatment of microeconometric analysis, the analysis
of individual-level data on the economic behavior of individuals or firms. A broader
definition would also include grouped data. Usually regression methods are applied to
cross-section or panel data.
Analysis of individual data has a long history. Ernst Engel (1857) was among the
earliest quantitative investigators of household budgets. Allen and Bowley (1935),
Houthakker (1957), and Prais and Houthakker (1955) made important contributions
following the same research and modeling tradition. Other landmark studies that were
also influential in stimulating the development of microeconometrics, even though
they did not always use individual-level information, include those by Marschak and
Andrews (1944) in production theory and by Wold and Jureen (1953), Stone (1953),
and Tobin (1958) in consumer demand.
As important as the above earlier cited work is on household budgets and demand
analysis, the material covered in this book has stronger connections with the work on
discrete choice analysis and censored and truncated variable models that saw their first
serious econometric applications in the work of McFadden (1973, 1984) and Heckman
(1974, 1979), respectively. These works involved a major departure from the over-
whelming reliance on linear models that characterized earlier work. Subsequently, they
have led to significant methodological innovations in econometrics. Among the earlier
textbook-level treatments of this material (and more) are the works of Maddala (1983)
and Amemiya (1985). As emphasized by Heckman (2001), McFadden (2001), and oth-
ers, many of the fundamental issues that dominated earlier work based on market data
remain important, especially concerning the conditions necessary for identifiability of
causal economic relations. Nonetheless, the style of microeconometrics is sufficiently
distinct to justify writing a text that is exclusively devoted to it.
Modern microeconometrics based on individual-, household-, and establishment-
level data owes a great deal to the greater availability of data from cross-section
and longitudinal sample surveys and census data. In the past two decades, with the
3

OVERVIEW
expansion of electronic recording and collection of data at the individual level, data
volume has grown explosively. So too has the available computing power for analyzing
large and complex data sets. In many cases event-level data are available; for example,
marketing science often deals with purchase data collected by electronic scanners in
supermarkets, and industrial organization literature contains econometric analyses of
airline travel data collected by online booking systems. There are now new branches of
economics, such as social experimentation and experimental economics, that generate
“experimental” data. These developments have created many new modeling opportu-
nities that are absent when only aggregated market-level data are available. Meanwhile
the explosive growth in the volume and types of data has also given rise to numerous
methodological issues. Processing and econometric analysis of such large microdata-
bases, with the objective of uncovering patterns of economic behavior, constitutes the
core of microeconometrics. Econometric analysis of such data is the subject matter of
this book.
Key precursors of this book are the books by Maddala (1983) and Amemiya (1985).
Like them it covers topics that are presented only briefly, or not at all, in undergraduate
and first-year graduate econometrics courses. Especially compared to Amemiya (1985)
this book is more oriented to the practitioner. The level of presentation is nonetheless
advanced in places, especially for applied researchers in disciplines that are less math-
ematically oriented than economics.
A relatively advanced presentation is needed for several reasons. First, the data are
often discrete or censored, in which case nonlinear methods such as logit, probit,
and Tobit models are used. This leads to statistical inference based on more difficult
asymptotic theory.
Second, distributional assumptions for such data become critically important. One
response is to develop highly parametric models that are sufficiently detailed to capture
the complexities of data, but these models can be challenging to estimate. A more com-
mon response is to minimize parametric assumptions and perform statistical inference
based on standard errors that are “robust” to complications such as heteroskedasticity
and clustering. In such cases considerable knowledge can be needed to ensure valid
statistical inference even if a standard regression package is used.
Third, economic studies often aim to determine causation rather than merely mea-
sure correlation, despite access to observational rather than experimental data. This
leads to methods to isolate causation such as instrumental variables, simultaneous
equations, measurement error correction, selection bias correction, panel data fixed
effects, and differences-in-differences.
Fourth, microeconomic data are typically collected using cross-section and panel
surveys, censuses, or social experiments. Survey data collected using these methods
are subject to problems of complex survey methodology, departures from simple ran-
dom sampling assumptions, and problems of sample selection, measurement errors,
and incomplete, and/or missing data. Dealing with such issues in a way that can sup-
port valid population inferences from the estimated econometric models population
requires use of advanced methods.
Finally, it is not unusual that two or more complications occur simultaneously,
such as endogeneity in a logit model with panel data. Then a cookbook approach
4

1.2. DISTINCTIVE ASPECTS OF MICROECONOMETRICS
becomes very difficult to implement. Instead, considerable understanding of the the-
ory underlying the methods is needed, as the researcher may need to read econometrics
journal articles and adapt standard econometrics software.
1.2. Distinctive Aspects of Microeconometrics
We now consider several advantages of microeconometrics that derive from its distinc-
tive features.
1.2.1. Discreteness and Nonlinearity
The first and most obvious point is that microeconometric data are usually at a low
level of aggregation. This has a major consequence for the functional forms used to
analyze the variables of interest. In many, if not most, cases linear functional forms
turn out to be simply inappropriate. More fundamentally, disaggregation brings to the
forefront heterogeneity of individuals, firms, and organizations that should be prop-
erly controlled (modeled) if one wants to make valid inferences about the underlying
relationships. We discuss these issues in greater detail in the following sections.
Although aggregation is not entirely absent in microdata, as for example when
household- or establishment-level data are collected, the level of aggregation is usu-
ally orders of magnitude lower than is common in macro analyses. In the latter case the
process of aggregation leads to smoothing, with many of the movements in opposite
directions canceling in the course of summation. The aggregated variables often show
smoother behavior than their components, and the relationships between the aggre-
gates frequently show greater smoothness than the components. For example, a rela-
tion between two variables at a micro level may be piecewise linear with many nodes.
After aggregation the relationship is likely to be well approximated by a smooth func-
tion. Hence an immediate consequence of disaggregation is the absence of features of
continuity and smoothness both of the variables themselves and of the relationships
between them.
Usually individual- and firm-level data cover a huge range of variation, both in the
cross-section and time-series dimensions. For example, average weekly consumption
of (say) beef is highly likely to be positive and smoothly varying, whereas that of an in-
dividual household in a given week may be frequently zero and may also switch to pos-
itive values from time to time. The average number of hours worked by female workers
is unlikely to be zero, but many individual females have zero market hours of work
(corner solutions), switching to positive values at other times in the course of their la-
bor market history. Average household expenditure on vacations is usually positive, but
many individual households may have zero expenditure on vacations in any given year.
Average per capita consumption of tobacco products will usually be positive, but many
individuals in the population have never consumed these products and never will, irre-
spective of price and income considerations. As Pudney (1989) has observed, micro-
data exhibit “holes, kinks and corners.” The holes correspond to nonparticipation in the
activity of interest, kinks correspond to the switching behavior, and corners correspond
5

OVERVIEW
to the incidence of nonconsumption or nonparticipation at specific points of time.
That is, discreteness and nonlinearity of response are intrinsic to microeconometrics.
An important class of nonlinear models in microeconometrics deals with limited
dependent variables (Maddala, 1983). This class includes many models that provide
suitable frameworks for analyzing discrete responses and responses with limited range
of variation. Such tools of analyses are of course also available for analyzing macro-
data, if required. The point is that they are indispensable in microeconometrics and
give it its distinctive feature.
1.2.2. Greater Realism
Macroeconometrics is sometimes based on strong assumptions; the representative
agent assumption is a leading example. A frequent appeal is made to microeconomic
reasoning to justify certain specifications and interpretations of empirical results. How-
ever, it is rarely possible to say explicitly how these are affected by aggregation over
time and micro units. Alternatively, very extreme aggregation assumptions are made.
For example, aggregates are said to reflect the behavior of a hypothetical representative
agent. Such assumptions also are not credible.
From the viewpoint of microeconomic theory, quantitative analysis founded on
microdata may be regarded as more realistic than that based on aggregated data. There
are three justifications for this claim. First, the measurement of the variables involved
in such hypotheses is often more direct (though not necessarily free from measurement
error) and has greater correspondence to the theory being tested. Second, hypotheses
about economic behavior are usually developed from theories of individual behavior. If
these hypotheses are tested using aggregated data, then many approximations and sim-
plifying assumptions have to be made. The simplifying assumption of a representative
agent causes a great loss of information and severely limits the scope of an empirical
investigation. Because such assumptions can be avoided in microeconometrics, and
usually are, in principle the microdata provide a more realistic framework for testing
microeconomic hypotheses. This is not a claim that the promise of microdata is nec-
essarily achieved in empirical work. Such a claim must be assessed on a case-by-case
basis. Finally, a realistic portrayal of economic activity should accommodate a broad
range of outcomes and responses that are a consequence of individual heterogeneity
and that are predicted by underlying theory. In this sense microeconomic data sets can
support more realistic models.
Microeconometric data are often derived from household or firm surveys, typically
encompassing a wide range of behavior, with many of the behavioral outcomes tak-
ing the form of discrete or categorical responses. Such data sets have many awkward
features that call for special tools in the formulation and analysis that, although not
entirely absent from macroeconometric work, nevertheless are less widely used.
1.2.3. Greater Information Content
The potential advantages of microdata sets can be realized if such data are informa-
tive. Because sample surveys often provide independent observations on thousands of
6

cross-sectional units, such data are thought to be more informative than the standard,
usually highly serially correlated, macro time series typically consisting of at most a
few hundred observations.
As will be explained in the next chapter, in practice the situation is not so clear-cut
because the microdata may be quite noisy. At the individual level many (idiosyncratic)
factors may play a large role in determining responses. Often these cannot be observed,
leading one to treat them under the heading of a random component, which can be a
very large part of observed variation. In this sense randomness plays a larger role in
microdata than in macrodata. Of course, this affects measures of goodness of fit of the
regressions. Students whose initial exposure to econometrics comes through aggregate
time-series analysis are often conditioned to see large R2
values. When encountering
cross-section regressions for the first time, they express disappointment or even alarm
at the “low explanatory power” of the regression equation. Nevertheless, there remains
a strong presumption that, at least in certain dimensions, large microdata sets are highly
informative.
Another qualification is that when one is dealing with purely cross-section data,
very little can be said about the intertemporal aspects of relationships under study.
This particular aspect of behavior can be studied using panel and transition data.
In many cases one is interested in the behavioral responses of a specific group of
economic agents under some specified economic environment. One example is the
impact of unemployment insurance on the job search behavior of young unemployed
persons. Another example is the labor supply responses of low-income individuals
receiving income support. Unless microdata are used such issues cannot be addressed
directly in empirical work.
1.2.4. Microeconomic Foundations
Econometric models vary in the explicit role given to economic theory. At one end of
the spectrum there are models in which the a priori theorizing may play a dominant
role in the specification of the model and in the choice of an estimation procedure. At
the other end of the spectrum are empirical investigations that make much less use of
economic theory.
The goal of the analysis in the first case is to identify and estimate fundamental
parameters, sometimes called deep parameters, that characterize individual taste and
preferences and/or technological relationships. As a shorthand designation, we call
this the structural approach. Its hallmark is a heavy dependence on economic theory
and emphasis on causal inference. Such models may require many assumptions, such
as the precise specification of a cost or production function or specification of the
distribution of error terms. The empirical conclusions of such an exercise may not
be robust with respect to the departures from the assumptions. In Section 2.4.4 we
shall say more about this approach. At the present stage we simply emphasize that if
the structural approach is implemented with aggregated data, it will yield estimates
of the fundamental parameters only under very stringent (and possibly unrealistic)
conditions. Microdata sets provide a more promising environment for the structural
approach, essentially because they permit greater flexibility in model specification.
7

OVERVIEW
The goal of the analysis in the second case is to model relationship(s) between re-
sponse variables of interest conditionally on variables the researcher takes as given, or
exogenous. More formal definitions of endogeneity and exogeneity are given in Chap-
ter 2. As a shorthand designation, we call this a reduced form approach. The essential
point is that reduced form analysis does not always take into account all causal inter-
dependencies. A regression model in which the focus is on the prediction of y given
regressors x, and not on the causal interpretation of the regression parameters, is often
referred to as a reduced form regression. As will be explained in Chapter 2, in general
the parameters of the reduced form model are functions of structural parameters. They
may not be interpretable without some information about the structural parameters.
1.2.5. Disaggregation and Heterogeneity
It is sometimes said that many problems and issues of macroeconometrics arise from
serial correlation of macro time series, and those of microeconometrics arise from
heteroskedasticity of individual-level data. Although this is a useful characterization of
the modeling effort in many microeconometric analyses, it needs amplification and is
subject to important qualifications. In a range of microeconometric models, modeling
of dynamic dependence may be an important issue.
The benefits of disaggregation, which were emphasized earlier in this section, come
at a cost: As the data become more disaggregated the importance of controlling for
interindividual heterogeneity increases. Heterogeneity, or more precisely unobserved
heterogeneity, plays a very important role in microeconometrics. Obviously, many
variables that reflect interindividual heterogeneity, such as gender, race, educational
background, and social and demographic factors, are directly observed and hence can
be controlled for. In contrast, differences in individual motivation, ability, intelligence,
and so forth are either not observed or, at best, imperfectly observed.
The simplest response is to ignore such heterogeneity, that is, to absorb it into the
regression disturbance. After all this is how one treats the myriad small unobserved
factors. This step of course increases the unexplained part of the variation. More seri-
ously, ignoring persistent interindividual differences leads to confounding with other
factors that are also sources of persistent interindividual differences. Confounding is
said to occur when the individual contributions of different regressors (predictor vari-
ables) to the variation in the variable of interest cannot be statistically separated. Sup-
pose, for example, that the factor x1 (schooling) is said to be the source of variation in
y (earnings), when another variable x2 (ability), which is another source of variation,
does not appear in the model. Then that part of total variation that is attributable to
the second variable may be incorrectly attributed to the first variable. Intuitively, their
relative importances are confounded. A leading source of confounding bias is the in-
correct omission of regressors from the model and the inclusion of other variables that
are proxies for the omitted variable.
Consider, for example, the case in which a program participation (0/1 dummy)
variable D is included in the regression mean function with a vector of regressors x,
y = x
β + αD + u, (1.1)
8

where u is an error term. The term “treatment” is used in biological and experimental
sciences to refer to an administered regimen involving participants in some trial. In
econometrics it commonly refers to participation in some activity that may impact an
outcome of interest. This activity may be randomly assigned to the participants or may
be self-selected by the participant. Thus, although it is acknowledged that individuals
choose their years of schooling, one still thinks of years of schooling as a “treatment”
variable. Suppose that program participation is taken to be a discrete variable. The
coefficient α of the “treatment variable” measures the average impact of the program
participation (D = 1), conditional on covariates. If one does not control for unob-
served heterogeneity, then a potential ambiguity affects the interpretation of the results.
If d is found to have a significant impact, then the following question arises: Is α sig-
nificantly different from zero because D is correlated with some unobserved variable
that affects y or because there is a causal relationship between D and y? For example,
if the program considered is university education, and the covariates do not include a
measure of ability, giving a fully causal interpretation becomes questionable. Because
the issue is important, more attention should be given to how to control for unobserved
heterogeneity.
In some cases where dynamic considerations are involved the type of data available
may put restrictions on how one can control for heterogeneity. Consider the example
of two households, identical in all relevant respects except that one exhibits a sys-
tematically higher preference for consuming good A. One could control for this by
allowing individual utility functions to include a heterogeneity parameter that reflects
their different preferences. Suppose now that there is a theory of consumer behavior
that claims that consumers become addicted to good A, in the sense that the more they
consume of it in one period, the greater is the probability that they will consume more
of it in the future. This theory provides another explanation of persistent interindi-
vidual differences in the consumption of good A. By controlling for heterogeneous
preferences it becomes possible to test which source of persistence in consumption –
preference heterogeneity or addiction – accounts for different consumption patterns.
This type of problem arises whenever some dynamic element generates persistence
in the observed outcomes. Several examples of this type of problem arise in various
places in the book.
A variety of approaches for modeling heterogeneity coexist in microeconometrics.
A brief mention of some of these follows, with details postponed until later.
An extreme solution is to ignore all unobserved interindividual differences. If unob-
served heterogeneity is uncorrelated with observed heterogeneity, and if the outcome
being studied has no intertemporal dependence, then the aforementioned problems will
not arise. Of course, these are strong assumptions and even with these assumptions not
all econometric difficulties disappear.
One approach for handling heterogeneity is to treat it as a fixed effect and to esti-
mate it as a coefficient of an individual specific 0/1 dummy variable. For example, in
a cross-section regression, each micro unit is allowed its own dummy variable (inter-
cept). This leads to an extreme proliferation of parameters because when a new individ-
ual is added to the sample, a new intercept parameter is also added. Thus this approach
will not work if our data are cross sectional. The availability of multiple observations
9

OVERVIEW
per individual unit, most commonly in the form of panel data with T time-series ob-
servations for each of the N cross-section units, makes it possible to either estimate
or eliminate the fixed effect, for example by first differencing if the model is linear
and the fixed effect is additive. If the model is nonlinear, as is often the case, the fixed
effect will usually not be additive and other approaches will need to be considered.
A second approach to modeling unobserved heterogeneity is through a random ef-
fects model. There are a number of ways in which the random effects model can be
formulated. One popular formulation assumes that one or more regression parameters,
often just the regression intercept, varies randomly across the cross section. In another
formulation the regression error is given a component structure, with an individual
specific random component. The random effects model then attempts to estimate the
parameters of the distribution from which the random component is drawn. In some
cases, such as demand analysis, the random term can be interpreted as random prefer-
ence variation. Random effects models can be estimated using either cross-section or
panel data.
1.2.6. Dynamics
A very common assumption in cross-section analysis is the absence of intertempo-
ral dependence, that is, an absence of dynamics. Thus, implicitly it is assumed that
the observations correspond to a stochastic equilibrium, with the deviation from the
equilibrium being represented by serially independent random disturbances. Even in
microeconometrics for some data situations such an assumption may be too strong.
For example, it is inconsistent with the presence of serially correlated unobserved het-
erogeneity. Dependence on lagged dependent variables also violates this assumption.
The foregoing discussion illustrates some of the potential limitations of a single
cross-section analysis. Some limitations may be overcome if repeated cross sections
are available. However, if there is dynamic dependence, the least problematic approach
might well be to use panel data.
1.3. Book Outline
The book is split into six parts. Part 1 presents the issues involved in microeconometric
modeling. Parts 2 and 3 present general theory for estimation and statistical inference
for nonlinear regression models. Parts 4 and 5 specialize to the core models used in
applied microeconometrics for, respectively, cross-section and panel data. Part 6 covers
broader topics that make considerable use of material presented in the earlier chapters.
The book outline is summarized in Table 1.1. The remainder of this section details
each part in turn.
1.3.1. Part 1: Preliminaries
Chapters 2 and 3 expand on the special features of the microeconometric approach
to modeling and microeconomic data structures within the more general statistical
10

Table 1.1. Book Outline
Part and Chapter Backgrounda
Example
1. Preliminaries
1. Overview –
2. Causal and Noncausal Models – Simultaneous equations models
3. Microeconomic Data
Structures
– Observational data
2. Core Methods
4. Linear Models – Ordinary least squares
5. Maximum Likelihood and
Nonlinear Least-Squares
Estimation
– m-estimation or extremum
estimation
6. Generalized Method of
Moments and Systems
Estimation
5 Instrumental variables
7. Hypothesis Tests 5 Wald, score, and likelihood ratio
tests
8. Specification Tests and Model
Selection
5,7 Conditional moment test
9. Semiparametric Methods – Kernel regression
10. Numerical Optimization 5 Newton–Raphson iterative method
3. Simulation-Based Methods
11. Bootstrap Methods 7 Percentile t-method
12. Simulation-Based Methods 5 Maximum simulated likelihood
13. Bayesian Methods – Markov chain Monte Carlo
4. Models for Cross-Section Data
14. Binary Outcome Models 5 Logit, probit for y = (0, 1)
15. Multinomial Models 5,14 Multinomial logit for
y = (1, . . , m)
16. Tobit and Selection Models 5,14 Tobit for y = max(y∗
, 0)
17. Transition Data: Survival
Analysis
5 Cox proportional hazards for
y = min(y∗
, c)
18. Mixture Models and
Unobserved Heterogeneity
5,17 Unobserved heterogeneity
19. Models for Multiple Hazards 5,17 Multiple hazards
20. Models of Count Data 5 Poisson for y = 0, 1, 2, . . .
5. Models for Panel Data
21. Linear Panel Models: Basics – Fixed and random effects
22. Linear Panel Models:
Extensions
6,21 Dynamic and endogenous
regressors
23. Nonlinear Panel Models 5,6,21,22 Panel logit, Tobit, and Poisson
6. Further Topics
24. Stratified and Clustered
Samples
5 Data (yi j , xi j ) correlated over j
25. Treatment Evaluation 5,21 Regressor d = 1 if in program
26. Measurement Error Models 5 Logit model with measurement
errors
27. Missing Data and Imputation 5 Regression with missing
observations
a The background gives the essential chapter needed in addition to the treatment of ordinary and weighted LS in
Chapter 4. Note that the first panel data chapter (Chapter 21) requires only Chapter 4.

OVERVIEW
arena of regression analysis. Many of the issues raised in these chapters are pursued
throughout the book as the reader develops the necessary tools.
1.3.2. Part 2: Core Methods
Chapters 4–10 detail the main general methods used in classical estimation and sta-
tistical inference. The results given in Chapter 5 in particular are extensively used
throughout the book.
Chapter 4 presents some results for the linear regression model, emphasizing those
issues and methods that are most relevant for the rest of the book. Analysis is relatively
straightforward as there is an explicit expression for linear model estimators such as
ordinary least squares.
Chapters 5 and 6 present estimation theory that can be applied to nonlinear models
for which there is usually no explicit solution for the estimator. Asymptotic theory
is used to obtain the distribution of estimators, with emphasis on obtaining robust
standard error estimates that rely on relatively weak distributional assumptions. A quite
general treatment of estimation, along with specialization to nonlinear least-squares
and maximum likelihood estimation, is presented in Chapter 5. The more challenging
generalized method of moments estimator and specialization to instrumental variables
estimation are given separate treatment in Chapter 6.
Chapter 7 presents classical hypothesis testing when estimators are nonlinear and
the hypothesis being tested is possibly nonlinear in parameters. Specification tests in
addition to hypothesis tests are the subject of Chapter 8.
Chapter 9 presents semiparametric estimation methods such as kernel regression.
The leading example is flexible modeling of the conditional mean. For the patents ex-
ample, the nonparametric regression model is E[y|x] = g(x), where the function g(·)
is unspecified and is instead estimated. Then estimation has an infinite-dimensional
component g(·) leading to a nonstandard asymptotic theory. With additional regres-
sors some further structure is needed and the methods are called semiparametric or
seminonparametric.
Chapter 10 presents the computational methods used to compute a parameter esti-
mate when the estimator is defined implicitly, usually as the solution to some first-order
conditions.
1.3.3. Part 3: Simulation-Based Methods
Chapters 11–13 consider methods of estimation and inference that rely on simulation.
These methods are generally more computationally intensive and, currently, less uti-
lized than the methods presented in Part 2.
Chapter 11 presents the bootstrap method for statistical inference. This yields the
empirical distribution of an estimator by obtaining new samples by simulation, such
as by repeated resampling with replacement from the original sample. The bootstrap
can provide a simple way to obtain standard errors when the formulas from asymp-
totic theory are complex, as is the case for some two-step estimators. Furthermore, if
12

1.3. BOOK OUTLINE
implemented appropriately, the bootstrap can lead to better statistical inference in
small samples.
Chapter 12 presents simulation-based estimation methods, developed for models
that involve an integral over a probability distribution for which there is no closed-
form solution. Estimation is still possible by making multiple draws from the relevant
distribution and averaging.
Chapter 13 presents Bayesian methods, which combine a distribution for the ob-
served data with a speciﬁed prior distribution for parameters to obtain a posterior dis-
tribution of the parameters that is the basis for estimation. Recent advances make com-
putation possible even if there is no closed-form solution for the posterior distribution.
Bayesian analysis can provide an approach to estimation and inference that is quite dif-
ferent from the classical approach. However, in many cases only the Bayesian tool kit
is adopted to permit classical estimation and inference for problems that are otherwise
intractable.
1.3.4. Part 4: Models for Cross-Section Data
Chapters 14–20 present the main nonlinear models for cross-section data. This part is
the heart of the book and presents advanced topics such as models for limited depen-
dent variables and sample selection. The classes of models are deﬁned by the range of
values taken by the dependent variable.
Binary data models for dependent variable that can take only two possible values,
say y = 0 or y = 1, are presented in Chapter 14. In Chapter 15 an extension is made to
multinomial models, for dependent variable that takes several discrete values. Exam-
ples include employment status (employed, unemployed, and out of the labor force)
and mode of transportation to work (car, bus, or train). Linear models can be informa-
tive but are not appropriate, as they can lead to predicted probabilities outside the unit
interval. Instead logit, probit, and related models are used.
Chapter 16 presents models with censoring, truncation, sample selection. Exam-
ples include annual hours of work, conditional on choosing to work, and hospital ex-
penditures, conditional on being hospitalized. In these cases the data are incompletely
observed with a bunching of observations at y = 0 and with the remaining y 0.
The model for the observed data can be shown to be nonlinear even if the underlying
process is linear, and linear regression on the observed data can be very misleading.
Simple corrections for censoring, truncation, or sample selection such as the Tobit
model exist, but these are very dependent on distributional assumptions.
Models for duration data are presented in Chapters 17–19. An example is length
of unemployment spell. Standard regression models include the exponential, Weibull,
and Cox proportional hazards model. Additionally, as in Chapter 16, the dependent
variable is often incompletely observed. For example, the data may be on the length of
a current spell that is incomplete, rather than the length of a completed spell.
Chapter 20 presents count data models. Examples include various measures of
health utilization such as number of doctor visits and number of days hospitalized.
Again the model is nonlinear, as counts and hence the conditional mean are nonnega-
tive. Leading parametric models include the Poisson and negative binomial.
13

OVERVIEW
1.3.5. Part 5: Models for Panel Data
Chapters 21–23 present methods for panel data. Here the data are observed in several
time periods for each of the many individuals in the sample, so the dependent variable
and regressors are indexed by both individual and time. Any analysis needs to control
for the likely positive correlation of error terms in different time periods for a given in-
dividual. Additionally, panel data can provide sufficient data to control for unobserved
time-invariant individual-specific effects, permitting identification of causation under
weaker assumptions than those needed if only cross-section data are available.
The basic linear panel data model is presented in Chapter 21, with emphasis on
fixed effects and random effects models. Extensions of linear models to permit lagged
dependent variables and endogenous regressors are presented in Chapter 22. Panel
methods for the nonlinear models of Part 4 are presented in Chapter 23.
The panel data methods are placed late in the book to permit a unified self-contained
treatment. Chapter 21 could have been placed immediately after Chapter 4 and is writ-
ten in an accessible manner that relies on little more than knowledge of least-squares
estimation.
1.3.6. Part 6: Further Topics
This part considers important topics that can generally relate to any and all models
covered in Parts 4 and 5. Chapter 24 deals with modeling of clustered data in sev-
eral different models. Chapter 25 discusses treatment evaluation. Treatment evaluation
is a general term that can cover a wide variety of models in which the focus is on
measuring the impact of some “treatment” that is either exogenously or randomly as-
signed to an individual on some measure of interest, denoted an “outcome variable.”
Chapter 26 deals with the consequences of measurement errors in outcome and/or
regressor variables, with emphasis on some leading nonlinear models. Chapter 27
considers some methods of handling missing data in linear and nonlinear regression
models.
1.4. How to Use This Book
The book assumes a basic understanding of the linear regression model with matrix
algebra. It is written at the mathematical level of the first-year economics Ph.D. se-
quence, comparable to Greene (2003).
Although some of the material in this book is covered in a first-year sequence,
most of it appears in second-year econometrics Ph.D. courses or in data-oriented mi-
croeconomics field courses such as labor economics, public economics, or industrial
organization. This book is intended to be used as both an econometrics text and as an
adjunct for such field courses. More generally, the book is intended to be useful as a
reference work for applied researchers in economics, in related social sciences such as
sociology and political science, and in epidemiology.
For readers using this book as a reference work, the models chapters have been
written to be as self-contained as possible. For the specific models presented in Parts 4
14

1.5. SOFTWARE
Table 1.2. Outline of a 20-Lecture 10-Week Course
Lectures Chapter Topic
1–3 4, Appx. A Review of linear models and asymptotic theory
4–7 5 Estimation: m-estimation, ML, and NLS
8 10 Estimation: numerical optimization
9–11 14, 15 Models: binary and multinomial
12–14 16 Models: censored and truncated
15 6 Estimation: GMM
16 7 Testing: hypothesis tests
17–19 21 Models: basic linear panel
20 9 Estimation: semiparametric
and 5 it will generally be sufficient to read the relevant chapter in isolation, except
that some command of the general estimation results in Chapter 5 and in some cases
Chapter 6 will be necessary. Most chapters are structured to begin with a discussion
and example that is accessible to a wide audience.
For instructors using this book as a course text it is best to introduce the basic non-
linear cross-section and linear panel data models as early as possible, skipping many
of the methods chapters. The most commonly used nonlinear cross-section models
are presented in Chapters 14–16; these require knowledge of maximum likelihood
and least-squares estimation, presented in Chapter 5. Chapter 21 on linear panel data
models requires even less preparation, essentially just Chapter 4.
Table 1.2 provides an outline for a one-quarter second-year graduate course taught
at the University of California, Davis, immediately following the required first-year
statistics and econometrics sequence. A quarter provides sufficient time to cover the
basic results given in the first half of the chapters in this outline. With additional time
one can go into further detail or cover a subset of Chapters 11–13 on computation-
ally intensive estimation methods (simulation-based estimation, the bootstrap, which
is also briefly presented in Chapter 7, and Bayesian methods); additional cross-section
models (durations and counts) presented in Chapters 17–20; and additional panel data
models (linear model extensions and nonlinear models) given in Chapters 22 and 23.
At Indiana University, Bloomington, a 15-week semester-long field course in mi-
croeconometrics is based on material in most of Parts 4 and 5. The prerequisite courses
for this course cover material similar to that in Part 2.
Some exercises are provided at the end of each chapter after the first three intro-
ductory chapters. These exercises are usually learning-by-doing exercises; some are
purely methodological whereas others entail analysis of generated or actual data. The
level of difficulty of the questions is mostly related to the level of difficulty of the topic.
1.5. Software
There are many software packages available for data analysis. Popular packages with
strong microeconometric capabilities include LIMDEP, SAS, and STATA, all of which
15

OVERVIEW
offer an impressive range of canned routines and additionally support user-defined pro-
cedures using a matrix programming language. Other packages that are also widely
used include EVIEWS, PCGIVE, and TSP. Despite their time-series orientation, these
can support some cross-section data analysis. Users who wish to do their own pro-
gramming also have available a variety of options including GAUSS, MATLAB, OX,
and SAS/IML. The latest detailed information about these packages and many others
can be efficiently located via an Internet browser and a search engine.
1.6. Notation and Conventions
Vector and matrix algebra are used extensively.
Vectors are defined as column vectors and represented using lowercase bold. For
example, for linear regression the regressor vector x is a K × 1 column vector with jth
entry xj and the parameter vector β is a K × 1 column vector with jth entry βj , so
x
(K × 1)
=



x1
.
.
.
xK


 and β
(K × 1)
=



β1
.
.
.
βK


.
Then the linear regression model y = β1x1 + β2x2 + · · · + βK xK + u is expressed as
y = x
β + u. At times a subscript i is added to denote the typical ith observation. The
linear regression equation for the ith observation is then
yi = x
i β + ui .
The sample is one of N observations, {(yi , xi ), i = 1, . . . , N}. In this book observa-
tions are usually assumed to be independent over i.
Matrices are represented using uppercase bold. In matrix notation the sample is
(y, X), where y is an N × 1 vector with ith entry yi and X is a matrix with ith row x
i ,
so
y
(N × 1)
=



y1
.
.
.
yN


 and X
(N × dim(x))
=



x
1
.
.
.
x
N


.
The linear regression model upon stacking all N observations is then
y = Xβ + u,
where u is an N × 1 column vector with ith entry ui .
Matrix notation is compact but at times it is clearer to write products of matrices
as summations of products of vectors. For example, the OLS estimator can be equiva-
lently written in either of the following ways:

β = (X
X)−1
X
y =

N

i=1
xi x
i

−1 N

i=1
xi yi .
16

1.6. NOTATION AND CONVENTIONS
Table 1.3. Commonly Used Acronyms and Abbreviations
Linear















OLS
GLS
FGLS
IV
2SLS
3SLS
Ordinary least squares
Generalized least squares
Feasible generalized least squares
Instrumental variables
Two-stage least squares
Three-stage least squares











NLS
FGNLS
NIV
NL2SLS
NL3SLS
Nonlinear least squares
Feasible generalized nonlinear least squares
Nonlinear Nonlinear instrumental variables
Nonlinear two-stage least squares
Nonlinear three-stage least squares











LS
ML
QML
GMM
GEE
Least squares
Maximum likelihood
General Quasi-maximum likelihood
Generalized method of moments
Generalized estimating equations
Generic notation for a parameter is the q × 1 vector θ. The regression parameters
are represented by the K × 1 vector β, which may equal θ or may be a subset of θ
depending on the context.
The book uses many abbreviations and acronyms. Table 1.3 summarizes abbrevia-
tions used for some common estimation methods, ordered by whether the estimator is
developed for linear or nonlinear regression models. We also use the following: dgp
(data-generating process), iid (independently and identically distributed), pdf (prob-
ability density function), cdf (cumulative distribution function), L (likelihood), ln L
(log-likelihood), FE (ﬁxed effects), and RE (random effects).
17

C H A P T E R 2
Causal and Noncausal Models
2.1. Introduction
Microeconometrics deals with the theory and applications of methods of data analysis
developed for microdata pertaining to individuals, households, and firms. A broader
definition might also include regional- and state-level data. Microdata are usually
either cross sectional, in which case they refer to conditions at the same point in
time, or longitudinal (panel) in which case they refer to the same observational units
over several periods. Such observations are generated from both nonexperimental
setups, such as censuses and surveys, and quasi-experimental or experimental setups,
such as social experiments implemented by governments with the participation of
volunteers.
A microeconometric model may be a full specification of the probability distribu-
tion of a set of microeconomic observations; it may also be a partial specification of
some distributional properties, such as moments, of a subset of variables. The mean of
a single dependent variable conditional on regressors is of particular interest.
There are several objectives of microeconometrics. They include both data descrip-
tion and causal inference. The first can be defined broadly to include moment prop-
erties of response variables, or regression equations that highlight associations rather
than causal relations. The second category includes causal relationships that aim at
measurement and/or empirical confirmation or refutation of conjectures and proposi-
tions regarding microeconomic behavior. The type and style of empirical investigations
therefore span a wide spectrum. At one end of the spectrum can be found very highly
structured models, derived from detailed specification of the underlying economic be-
havior, that analyze causal (behavioral) or structural relationships for interdependent
microeconomic variables. At the other end are reduced form studies that aim to un-
cover correlations and associations among variables, without necessarily relying on
a detailed specification of all relevant interdependencies. Both approaches share the
common goal of uncovering important and striking relationships that could be helpful
in understanding microeconomic behavior, but they differ in the extent to which they
rely on economic theory to guide their empirical investigations.
18

2.1. INTRODUCTION
As a subdiscipline microeconometrics is newer than macroeconometrics, which is
concerned with modeling of market and aggregate data. A great deal of the early
work in applied econometrics was based on aggregate time-series data collected by
government agencies. Much of the early work on statistical demand analysis up until
about 1940 used market rather than individual or household data (Hendry and Morgan,
1996). Morgan’s (1990) book on the history of econometric ideas makes no reference
to microeconometric work before the 1940s, with one important exception. That ex-
ception is the work on household budget data that was instigated by concern with the
living standards of the less well-off in many countries. This led to the collection of
household budget data that provided the raw material for some of the earlier microe-
conometric studies such as those pioneered by Allen and Bowley (1935). Nevertheless,
it is only since the 1950s that microeconometrics has emerged as a distinctive and rec-
ognized subdiscipline. Even into the 1960s the core of microeconometrics consisted
of demand analyses based on household surveys.
With the award of the year 2000 Nobel Prize in Economics to James Heckman
and Daniel McFadden for their contributions to microeconometrics, the subject area
has achieved clear recognition as a distinct subdiscipline. The award cited Heckman
“for his development of theory and methods for analyzing selective samples” and
McFadden “for his development of theory and methods for analyzing discrete choice.”
Examples of the type of topics that microeconometrics deals with were also men-
tioned in the citation: “ . . . what factors determine whether an individual decides to
work and, if so, how many hours? How do economic incentives affect individual
choices regarding education, occupation or place of residence? What are the effects
of different labor-market and educational programs on an individual’s income and
employment?”
Applications of microeconometric methods can be found not only in every area of
microeconomics but also in other cognate social sciences such as political science,
sociology, and geography.
Beginning with the 1970s and especially within the past two decades revolution-
ary advances in our capacity for handling large data sets and associated computations
have taken place. These, together with the accompanying explosion in the availability
of large microeconomic data sets, have greatly expanded the scope of microecono-
metrics. As a result, although empirical demand analysis continues to be one of the
most important areas of application for microeconometric methods, its style and con-
tent have been heavily inﬂuenced by newer methods and models. Further, applications
in economic development, ﬁnance, health, industrial organization, labor and public
economics, and applied microeconomics generally are now commonplace, and these
applications will be encountered at various places in this book.
The primary focus of this book is on the newer material that has emerged in the
past three decades. Our goal is to survey concepts, models, and methods that we re-
gard as standard components of a modern microeconometrician’s tool kit. Of course,
the notion of standard methods and models is inevitably both subjective and elastic,
being a function of the presumed clientele of this book as well as the authors’ own
backgrounds. There may also be topics we regard as too advanced for an introductory
book such as this that others would place in a different category.
19

CAUSAL AND NONCAUSAL MODELS
Microeconometrics focuses on the complications of nonlinear models and on ob-
taining estimates that can be given a structural interpretation. Much of this book, es-
pecially Parts 2–4, presents methods for nonlinear models. These nonlinear methods
overlap with many areas of applied statistics including biostatistics. By contrast, the
distinguishing feature of econometrics is the emphasis placed on causal modeling.
This chapter introduces the key concepts related to causal (and noncausal) modeling,
concepts that are germane to both linear and nonlinear models.
Sections 2.2 and 2.3 introduce the key concepts of structure and exogeneity.
Section 2.4 uses the linear simultaneous equations model as a specific illustration
of a structural model and connects it with the other important concepts of reduced
form models. Identification definitions are given in Section 2.5. Section 2.6 considers
single-equation structural models. Section 2.7 introduces the potential outcome model
and compares the causal parameters and interpretations in the potential outcome model
with those in the simultaneous equations model. Section 2.8 provides a brief discus-
sion of modeling and estimation strategies designed to handle computational and data
challenges.
2.2. Structural Models
Structure consists of
1. a set of variables W (“data”) partitioned for convenience as [Y Z];
2. a joint probability distribution of W, F(W);
3. an a priori ordering of W according to hypothetical cause-and-effect relationships and
specification of a priori restrictions on the hypothesized model; and
4. a parametric, semiparametric, or nonparametric specification of functional forms and
the restrictions on the parameters of the model.
This general description of a structural model is consistent with a well-established
Cowles Commission definition of a structure. For example, Sargan (1988, p. 27) states:
A model is the specification of the probability distribution for a set of observations.
A structure is the specification of the parameters of that distribution. Therefore, a
structure is a model in which all the parameters are assigned numerical values.
We consider the case in which the modeling objective is to explain the values of
observable vector-valued variable y, y
= (y1, . . . , yG). Each element of y is a func-
tion of some other elements of y and of explanatory variables z and a purely random
disturbance u. Note that the variables y are assumed to be interdependent. By contrast,
interdependence between zi is not modeled. The ith observation satisfies the set of
implicit equations
g

yi , zi
, ui |θ

= 0, (2.1)
where g is a known function. We refer to this as the structural model, and we refer to
θ as structural parameters. This corresponds to property 4 given earlier in this section.
20

2.2. STRUCTURAL MODELS
Assume that there is a unique solution for yi for every (zi , ui ). Then we can write
the equations in an explicit form with y as function of (z, u):
yi = f (zi , ui |π) . (2.2)
This is referred to as the reduced form of the structural model, where π is a vector
of reduced form parameters that are functions of θ. The reduced form is obtained
by solving the structural model for the endogenous variables yi , given (zi , ui ). The
reduced form parameters π are functions of θ.
If the objective of modeling is inference about elements of θ, then (2.1) provides a
direct route. This involves estimation of the structural model. However, because ele-
ments of π are functions of θ, (2.2) also provides an indirect route to inference on θ.
If f(zi , ui |π) has a known functional form, and if it is additively separable in zi and ui ,
such that we can write
yi = g (zi |π) + ui = E [yi |zi ] + ui , (2.3)
then the regression of y on z is a natural prediction function for y given z. In this
sense the reduced form equation has a useful role for making conditional predictions
of yi given (zi , ui ). To generate predictions of the left-hand-side variable for assigned
values of the right-hand-side variables of (2.2) requires estimates of π, which may be
computationally simpler.
An important extension of (2.3) is the transformation model, which for scalar y
takes the form
(y) = z
π + u, (2.4)
where (y) is a transformation function (e.g., (y) = ln(y) or (y) = y1/2
). In some
cases the transformation function may depend on unknown parameters. A transfor-
mation model is distinct from a regression, but it too can be used to make estimates
of E [y|z]. An important example is the accelerated failure time model analyzed in
Chapter 17.
One of the most important, and potentially controversial, steps in the specification
of the structural model is property 3, in which an a priori ordering of variables into
causes and effects is assigned. In essence this involves drawing a distinction between
those variables whose variation the model is designed to explain and those whose
variation is externally determined and hence lie outside the scope of investigation. In
microeconometrics, examples of the former are years of schooling and hours worked;
examples of the latter are gender, ethnicity, age, and similar demographic variables.
The former, denoted y, are referred to as endogenous and the latter, denoted z, are
called exogenous variables.
Exogeneity of a variable is an important simplification because in essence it jus-
tifies the decision to treat that variable as ancillary, and not to model that variable
because the parameters of that relationship have no direct bearing on the variable
under study. This important notion needs a more formal definition, which we now
provide.
21

2.3. Exogeneity
We begin by considering the representation of a general ﬁnite dimensional parametric
case in which the joint distribution of W, with parameters θ partitioned as (θ1 θ2), is
factored into the conditional density of Y given Z, and the marginal distribution of Z,
giving
fJ (W|θ) = fC (Y|Z, θ) × fM (Z|θ) . (2.5)
A special case of this result occurs if
fJ (W|θ) = fC (Y|Z, θ1) × fM

Z|θ2

,
where θ1 and θ2 are functionally independent. Then we say that Z is exogenous with
respect to θ1; this means that knowledge of fM (Z|θ2) is not required for inference on
θ1, and hence we can validly condition the distribution of Y on Z.
Models can always be reparameterized. So next consider the case in which the
model is reparameterized in terms of parameters ϕ, with one-to-one transformation
of θ, say ϕ = h(θ), where ϕ is partitioned into (ϕ1, ϕ2). This reparametrization may
be of interest if, for example, ϕ1 is structurally invariant to a class of policy interven-
tions. Suppose ϕ1 is the parameter of interest. In such a case one is interested in the
exogeneity of Z with respect to ϕ1. Then, the condition for exogeneity is that
fJ (W|ϕ) = fC

Y|Z, ϕ1

× fM

Z|ϕ2

, (2.6)
where ϕ1 is independent of ϕ2.
Finally consider the case in which the interest is in a parameter λ that is a function
of ϕ, say h(ϕ). Then for exogeneity of Z with respect to λ, we need two conditions:
(i) λ depends only on ϕ1, i.e., λ = h(ϕ1), and hence only the conditional distribution is
of interest; and (ii) ϕ1 and ϕ2 are “variation free” which means that the parameters of
the joint distribution are not subject to cross-restrictions, i.e. (ϕ1, ϕ2) ∈ Φ1 × Φ2 =
{ϕ1 ∈ 1, ϕ2 ∈ 2}.
The factorization in (2.5)-(2.6) plays an important role in the development of the
exogeneity concept. Of special interest in this book are the following three con-
cepts related to exogeneity: (1) weak exogeneity; (2) Granger noncausality; (3) strong
exogeneity.
Deﬁnition 2.1 (Weak Exogeneity): Z is weakly exogenous for λ if (i) and (ii)
hold.
If the marginal model parameters are uninformative for inference on λ, then infer-
ence on λ can proceed on the basis of the conditional distribution f (Y|Z, ϕ1) alone.
The operational implication is that weakly exogenous variables can be taken as given
if one’s main interest is in inference on λ or ϕ1. This does not mean that there is no
statistical model for Z; it means that the parameters of that model play no role in the
inference on ϕ1, and hence are irrelevant.
22

2.4. LINEAR SIMULTANEOUS EQUATIONS MODEL
2.3.1. Conditional Independence
Originally, the Granger causality concept was defined in the context of prediction in a
time-series environment. More generally, it can be interpreted as a form of conditional
independence (Holland, 1986, p. 957).
Partition z into two subsets z1 and z2; let W = [y, z1, z2] be the matrices of vari-
ables of interest. Then z1 and y are conditionally independent given z2 if
f (y|z1, z2) = f (y|z2) . (2.8)
This is stronger than the mean independence assumption, which would imply
E [y|z1, z2] = E [y|z2] . (2.9)
Then z1 has no predictive value for y, after conditioning on z2. In a predictive sense
this means that z1 does not Granger-cause y.
In a time-series context, z1 and z2 would be mutually exclusive lagged values of
subsets of y.
Definition 2.2 (Strong Exogeneity): z1 is strongly exogenous for ϕ if it is
weakly exogenous for ϕ and does not Granger-cause y so (2.8) holds.
2.3.2. Exogenizing Variables
Exogeneity is a strong assumption. It is a property of random variables relative to
parameters of interest. Hence a variable may be validly treated as exogenous in one
structural model but not in another; the key issue is the parameters that are the subject
of inference. Arbitrary imposition of this property will have some undesirable conse-
quences that will be discussed in Section 2.4.
The exogeneity assumption may be justified by a priori theorizing, in which case it
is a part of the maintained hypothesis of the model. It may in some cases be justified
as a valid approximation, in which case it may be subject to testing, as discussed in
Section 8.4.3. In cross-section analysis it may be justified as being a consequence of
a natural experiment or a quasi-experiment in which the value of the variable is de-
termined by an external intervention; for example, government or regulatory authority
may determine the setting of a tax rate or a policy parameter. Of special interest is the
case in which an external intervention results in a change in the value of an impor-
tant policy variable. Such a natural experiment is tantamount to exogenization of some
variable. As we shall see in Chapter 3, this creates a quasi-experimental opportunity to
study the impact of a variable in the absence of other complicating factors.
2.4. Linear Simultaneous Equations Model
An important special case of the general structural model specified in (2.1) is the linear
simultaneous equation model developed by the Cowles Commission econometricians.
Comprehensive treatment of this model is available in many textbooks (e.g., Sargan,
23

1988). The treatment here is brief and selective; also see Section 6.9.6. The objective is
to bring into the discussion several key ideas and concepts that have more general rele-
vance. Although the analysis is restricted to linear models, many insights are routinely
applied to nonlinear models.
2.4.1. The SEM Setup
The linear simultaneous equations model (SEM) setup is as follows:
y1i β11 + · · · + yGi β1G + z1i γ11 + · · · + zKi γ1K = u1i
.
.
.
.
.
. =
.
.
.
y1i βG1 + · · · + yGi βGG + z1i γG1 + · · · + zKi γGK = uGi ,
where i is the observation subscript.
A clear a priori distinction or preordering is made between endogenous variables,
y
i = (y1i , . . ., yGi ), and exogenous variables, z
i = (z1i , . . ., zKi ). By definition the ex-
ogenous variables are uncorrelated with the purely random disturbances (u1i , . . ., uGi ).
In its unrestricted form every variable enters every equation.
In matrix notation, the G-equation SEM for the ith equation is written as
y
i B + z
i Γ = u
i , (2.10)
where yi , B, zi , Γ, and ui have dimensions G × 1, G × G, K × 1, K × G, and G × 1,
respectively. For specified values of (B, Γ) and (zi , ui ) G linear simultaneous equa-
tions can in principle be solved for yi .
The standard assumptions of SEM are as follows:
1. B is nonsingular and has rank G.
2. rank[Z] = K. The N × K matrix Z is formed by stacking z
i , i = 1, . . ., N.
3. plim N−1
Z
Z = Σzz is a symmetric K × K positive definite matrix.
4. ui ∼ N[0, Σ]; that is, E[ui ] = 0 and E[ui u
i ] = =[σi j ], where Σ is a symmetric
G × G positive definite matrix.
5. The errors in each equation are serially independent.
In this model the structure (or structural parameters) consists of (B, Γ, Σ). Writing
Y =




y
1
·
·
y
N



 , Z =




z
1
·
·
z
N



 , U =







u
1
·
·
·
u
N







allows us to express the structural model more compactly as
YB + ZΓ = U, (2.11)
where the arrays Y, B, Z, Γ, and U have dimensions N × G, G × G, N × K, K ×
G, and N × G, respectively. Solving for all the endogenous variables in terms of all
24

the exogenous variables, we obtain the reduced form of the SEM:
Y + ZΓB−1
= UB−1
,
Y = ZΠ + V, (2.12)
where Π = −ΓB−1
and V = UB−1
. Given Assumption 4, vi ∼ N[0, B−1
ΣB−1
].
In the SEM framework the structural model has primacy for several reasons. First,
the equations themselves have interpretations as economic relationships such as de-
mand or supply relations, production functions, and so forth, and they are subject to
restrictions of economic theory. Consequently, B and Γ are parameters that describe
economic behavior. Hence a priori theory can be invoked to form expectations about
the sign and size of individual coefficients. By contrast, the unrestricted reduced form
parameters are potentially complicated functions of the structural parameters, and as
such it may be difficult to evaluate them postestimation. This consideration may have
little weight if the goal of econometric modeling is prediction rather than inference on
parameters with behavioral interpretation.
Consider, without loss of generality, the first equation in the model (2.11), with y1
as the dependent variable. In addition, some of the remaining G − 1 endogenous vari-
ables and K − 1 exogenous variables may be absent from this equation. From (2.12)
we see that in general the endogenous variables Y depend stochastically on V, which
in turn is a function of the structural errors U. Therefore, in general plim N−1
Y
U = 0.
Generally, the application of the least-squares estimator in the simultaneous equation
setting yields inconsistent estimates. This is a well-known and basic result from the si-
multaneous equations literature, often referred to as the “simultaneous equations bias”
problem. The vast literature on simultaneous equations models deals with identifica-
tion and consistent estimation when the least-squares approach fails; see Sargan (1988)
and Schmidt (1976), and Section 6.9.6.
The reduced form of SEM expresses every endogenous variable as a linear function
of all exogenous variables and all structural disturbances. The reduced form distur-
bances are linear combinations of the structural disturbances. From the reduced form
for the ith observation
E [yi |zi ] = z
i Π, (2.13)
V [yi |zi ] = Ω ≡ B−1
ΣB−1
. (2.14)
The reduced form parameters Π are derived parameters defined as functions of the
structural parameters. If Π can be consistently estimated then the reduced form can
be used to make predictive statements about variations in Y due to exogenous changes
in Z. This is possible even if B and Γ are not known. Given the exogeneity of Z,
the full set of reduced form regressions is a multivariate regression model that can be
estimated consistently by least squares. The reduced form provides a basis for making
conditional predictions of Y given Z.
The restricted reduced form is the unrestricted reduced form model subject to re-
strictions. If these are the same restrictions as those that apply to the structure, then
structural information can be recovered from the reduced form.
25

In the SEM framework, the unknown structural parameters, the nonzero elements
of B, Γ, and Σ, play a key role because they reflect the causal structure of the
model. The interdependence between endogenous variables is described by B, and
the responses of endogenous variables to exogenous shocks in Z is reflected in the
parameter matrix Γ. In this setup the causal parameters of interest are those that
measure the direct marginal impact of a change in an explanatory variable, yj or
zk on the outcome of interest yl, l = j, and functions of such parameters and data.
The elements of Σ describe the dispersion and dependence properties of the ran-
dom disturbances, and hence they measure some properties of the way the data are
generated.
2.4.2. Causal Interpretation in SEM
A simple example will illustrate the causal interpretation of parameters in SEM. The
structural model has two continuous endogenous variables y1 and y2, a single con-
tinuous exogenous variable z1, one stochastic relationship linking y1 and y2, and one
definitional identity linking all three variables in the model:
y1 = γ1 + β1 y2 + u1, 0 β1 1,
y2 = y1 + z1.
In this model u1 is a stochastic disturbance, independent of z1, with a well-defined
distribution. The parameter β1 is subject to an inequality constraint that is also a part
of the model specification. The variable z1 is exogenous and therefore its variation is
induced by external sources that we may regard as interventions. These interventions
have a direct impact on y2 through the identity and also an indirect one through the
first equation. The impact is measured by the reduced form of the model, which is
y1 =
γ1
1 − β1
+
β1
1 − β1
z1 +
1
1 − β1
u1
= E[y1|z1] + v1,
y2 =
γ1
1 − β1
+
1
1 − β1
z1 +
1
1 − β1
u1
= E[y2|z1] + v1,
where v1 = u1/(1 − β1). The reduced form coefficients β1/(1 − β1) and 1/(1 − β1)
have a causal interpretation. Any externally induced unit change in z1 will cause the
value of y1 and y2 to change by these amounts. Note that in this model y1 and y2 also
respond to u1. In order not to confound the impact of the two sources of variation we
require that z1 and u1 are independent.
Also note that
∂y1
∂y2
= β1 =
β1
1 − β1
÷
1
1 − β1
=
∂y1
∂z1
÷
∂y2
∂z1
.
26

In what sense does β1 measure the causal effect of y2 on y1? To see a possible diffi-
culty, observe that y1 and y2 are interdependent or jointly determined, so it is unclear
in what sense y2 “causes” y1. Although z1 (and u1) is the ultimate cause of changes
in the reduced form sense, y2 is a proximate or an intermediate cause of y1. That is,
the first structural equation provides a snapshot of the impact of y2 on y1, whereas
the reduced form gives the (equilibrium) impact after allowing for all interactions be-
tween the endogenous variables to work themselves out. In a SEM framework even
endogenous variables are viewed as causal variables, and their coefficients as causal
parameters. This approach can cause puzzlement for those who view causality in an
experimental setting where independent sources of variation are the causal variables.
The SEM approach makes sense if y2 has an independent and exogenous source of
variation, which in this model is z1. Hence the marginal response coefficient β1 is a
function of how y1 and y2 respond to a change in z1, as the immediately preceding
equation makes clear.
Of course this model is but a special case. More generally, we may ask under what
conditions will the SEM parameters have a meaningful causal interpretation. We return
to this issue when discussing identification concepts in Section 2.5.
2.4.3. Extensions to Nonlinear and Latent Variable Models
If the simultaneous model is nonlinear in parameters only, the structural model can
be written as
YB(θ) + ZΓ(θ) = U, (2.15)
where B(θ) and Γ(θ) are matrices whose elements are functions of the structural pa-
rameters θ. An explicit reduced form can be derived as before.
If nonlinearity is instead in variables then an explicit (analytical) reduced form
may not be possible, although linearized approximations or numerical solutions of the
dependent variables, given (z, u), can usually be obtained.
Many microeconometric models involve latent or unobserved variables as well as
observed endogenous variables. For example, search and auction theory models use the
concept of reservation wage or reservation price, choice models invoke indirect utility,
and so forth. In the case of such models the structural model (2.1) may be replaced by
g

y∗
i , zi
, ui |θ

= 0, (2.16)
where the latent variables y∗
i replace the observed variables yi . The corresponding
reduced form solves for y∗
i in terms of (zi , ui ), yielding
y∗
i = f (zi , ui |π) . (2.17)
This reduced form has limited usefulness as y∗
i is not fully observed. However, if we
have functions yi = h(y∗
i ) that relate observable with latent counterparts of yi , then the
reduced form in terms of observables is
yi = h (f (zi , ui |π)) . (2.18)
See Section 16.8.2 for further details.
27

When the structural model involves nonlinearities in variables, or when latent vari-
ables are involved, an explicit derivation of the functional form of this reduced form
may be difficult to obtain. In such cases practitioners use approximations. By citing
mathematical or computational convenience, a specific functional form may be used
to relate an endogenous variable to all exogenous variables, and the result would be
referred to as a “reduced form type relationship.”
2.4.4. Interpretations of Structural Relationships
Marschak (1953, p. 26) in an influential essay gave the following definition of a
structure:
Structure was defined as a set of conditions which did not change while observations
were being made but which might change in future. If a specified change of struc-
ture is expected or intended, prediction of variables of interest to the policy maker
requires some knowledge of past structure.. . . In economics, the conditions that con-
stitute a structure are (1) a set of relations describing human behavior and institutions
as well as technological laws and involving, in general, nonobservable random dis-
turbances and nonobservable random errors of measurement; (2) the joint probability
distribution of these random quantities.
Marschak argued that the structure was fundamental for a quantitative evaluation or
tests of economic theory and that the choice of the best policy requires knowledge of
the structure.
In the SEM literature a structural model refers to “autonomous” (not “derived”)
relationships. There are other closely related concepts of a structure. One such concept
refers to “deep parameters,” by which is meant technology and preference parameters
that are invariant to interventions.
In recent years an alternative usage of the term structure has emerged, one that refers
to econometric models based on the hypothesis of dynamic stochastic optimization by
rational agents. In this approach the starting point for any structural estimation prob-
lem is the first-order necessary conditions that define the agent’s optimizing behavior.
For example, in a standard problem of maximizing utility subject to constraints, the
behavioral relations are the deterministic first-order marginal utility conditions. If the
relevant functional forms are explicitly stated, and stochastic errors of optimization are
introduced, then the first-order conditions define a behavioral model whose parameters
characterize the utility function – the so-called deep or policy-invariant parameters.
Examples are given in Sections 6.2.7 and 16.8.1.
Two features of this highly structural approach should be mentioned. First, they
rely on a priori economic theory in a serious manner. Economic theory is not used
simply to generate a list of relevant variables that one can use in a more or less arbi-
trarily specified functional form. Rather, the underlying economic theory has a major
(but not exclusive) role in specification, estimation, and inference. The second feature
is that identification, specification, and estimation of the resulting model can be very
complicated, because the agent’s optimization problem is potentially very complex,
28

2.5. IDENTIFICATION CONCEPTS
especially if dynamic optimization under uncertainty is postulated and discreteness
and discontinuities are present; see Rust (1994).
2.5. Identification Concepts
The goal of the SEM approach is to consistently estimate (B, Γ, Σ) and conduct statis-
tical inference. An important precondition for consistent estimation is that the model
should be identified. We briefly discuss the important twin concepts of observational
equivalence and identifiability in the context of parametric models.
Identification is concerned with determination of a parameter given sufficient ob-
servations. In this sense, it is an asymptotic concept. Statistical uncertainty necessarily
affects any inference based on a finite number of observations. By hypothetically con-
sidering the possibility that sufficient number of observations are available, it is pos-
sible to consider whether it is logically possible to determine a parameter of interest
either in the sense of its point value or in the sense of determining the set of which
the parameter is an element. Therefore, identification is a fundamental consideration
and logically occurs prior to and is separate from statistical estimation. A great deal of
econometric literature on identification focuses on point identification. This is also the
emphasis of this section. However, set identification, or bounds identification, is an
important approach that will be used in selected places in this book (e.g., Chapters 25
and 27; see Manski, 1995).
Definition 2.3 (Observational Equivalence): Two structures of a model defined
as joint probability distribution function Pr[x|θ], x ∈ W, θ ∈ Θ, are observa-
tionally equivalent if Pr[x|θ1
] = Pr[x|θ2
] ∀ x ∈ W.
Less formally, if, given the data, two structural models imply identical joint proba-
bility distributions of the variables, then the two structures are observationally equiva-
lent. The existence of multiple observationally equivalent structures implies the failure
of identification.
Definition 2.4 (Identification): A structure θ0
is identified if there is no other
observationally equivalent structure in Θ.
A simple example of nonidentification occurs when there is perfect collinearity be-
tween regressors in the linear regression y = Xβ + u. Then we can identify the linear
combination Cβ, where rank[C] rank[β], but we cannot identify β itself.
This definition concerns uniqueness of the structure. In the context of the SEM
we have given, this definition means that identification requires that there is a unique
triple (B, Γ, Σ) consistent with the observed data. In SEM, as in other cases, identi-
fication involves being able to obtain unique estimates of structural parameters given
the sample moments of the data. For example, in the case of the reduced form (2.12),
under the stated assumptions the least-squares estimator provides unique estimates of
Π, that is,
Π = [Z
Z]−1
Z
Y, and identification of B, Γ requires that there is a solution
29

for the unknown elements of Γ and B from the equations Π + ΓB−1
= 0, given a
priori restrictions on the model. A unique solution implies just identification of the
model.
A complete model is said to be identified if all the model parameters are identified.
It is possible that for some models only a subset of parameters is identified. In some
situations it may be important to be able to identify some function of parameters, and
not necessarily all the individual parameters. Identification of a function of parameters
means that function can be recovered uniquely from F(W|Θ).
How does one ensure that the structures of alternative model specifications can be
“ruled out”? In SEM the solution to this problem depends on augmenting the sample
information by a priori restrictions on (B, Γ, Σ). The a priori restrictions must intro-
duce sufficient additional information into the model to rule out the existence of other
observationally equivalent structures.
The need for a priori restrictions is demonstrated by the following argument. First
note that given the assumptions of Section 2.4.1 the reduced form, defined by (Π, Ω),
is always unique. Initially suppose there are no restrictions on (B, Γ, Σ). Next suppose
that there are two observationally equivalent structures (B1, Γ1, Σ1) and (B2, Γ2, Σ2).
Then
Π = −Γ1B−1
1 = −Γ2B−1
2 . (2.19)
Let H be a G × G nonsingular matrix. Then Γ1B−1
1 = Γ1HH−1
B−1
1 = Γ2B−1
2 , which
means that Γ2 = Γ1H, B2 = B1H. Thus the second structure is a linear transformation
of the first.
The SEM solution to this problem is to introduce restrictions on (B, Γ, Σ) such
that we can rule out the existence of linear transformations that lead to observation-
ally equivalent structures. In other words, the restrictions on (B, Γ, Σ) must be such
that there is no matrix H that would yield another structure with the same reduced
form; given (Π, Ω) there will be a unique solution to the equations Π = −ΓB−1
and
Ω ≡ (B−1
)
ΣB−1
.
In practice a variety of restrictions can be imposed including (1) normalizations,
such as setting diagonal elements of B equal to 1, (2) zero (exclusion) and linear ho-
mogeneous and inhomogeneous restrictions, and (3) covariance and inequality restric-
tions. Details of the necessary and sufficient conditions for identification in linear and
nonlinear models can be found in many texts including Sargan (1988).
Meaningful imposition of identifying restrictions requires that the a priori restric-
tions imposed should be valid a posteriori. This idea is pursued further in several chap-
ters where identification issues are considered (e.g., Section 6.9).
Exclusion restrictions essentially state that the model contains some variables that
have zero impact on some endogenous variables. That is, certain directions of causa-
tion are ruled out a priori. This makes it possible to identify other directions of cau-
sation. For example, in the simple two-variable example given earlier, z1 did not enter
the y1-equation, making it possible to identify the direct impact of y2 on y1. Although
exclusion restrictions are the simplest to apply, in parametric models identification can
also be secured by inequality restrictions and covariance restrictions.
30

2.7. POTENTIAL OUTCOME MODEL
If there are no restrictions on Σ, and the diagonal elements of B are normalized to
1, then a necessary condition for identification is the order condition, which states
that the number of excluded exogenous variables must at least equal the number of
included endogenous variables. A sufficient condition is the rank condition given in
many texts that ensures for the jth equation parameters ΠΓj = −Bj yields a unique
solution for (Γj , Bj ) given Π.
Given identification, the term just (exact) identification refers to the case when
the order condition is exactly satisfied; overidentification refers to the case when the
number of restrictions on the system exceeds that required for exact identification.
Identification in nonlinear SEM has been discussed in Sargan (1988), who also
gives references to earlier related work.
2.6. Single-Equation Models
Without loss of generality consider the first equation of a linear SEM subject to nor-
malization β11 = 1. Let y = y1, let y1 denote the endogenous components of y other
than y1, and let z1 denote the exogenous components of z with
y = y
1α + z
1γ + u. (2.20)
Many studies skip the formal steps involved in going from a system to a single equation
and begin by writing the regression equation
y = x
β + u,
where some components of x are endogenous (implicitly y1) and others are exogenous
(implicitly z1). The focus lies then on estimating the impact of changes in key regres-
sor(s) that may be endogenous or exogenous, depending on the assumptions. Instru-
mental variable or two-stage least-squares estimation is the most obvious estimation
strategy (see Sections 4.8, 6.4, and 6.5).
In the SEM approach it is natural to specify at least some of the remaining equa-
tions in the model, even if they are not the focus of inquiry. Suppose y1 has dimen-
sion 1. Then the first possibility is to specify the structural equation for y1 and for
the other endogenous variables that may appear in this structural equation for y1.
A second possibility is to specify the reduced form equation for y1. This will show
exogenous variables that affect y1 but do not directly affect y. An advantage is that
in such a setting instrumental variables emerge naturally. However, in recent empir-
ical work using instrumental variables in a single-equation setting, even the formal
step of writing down a reduced form for the endogenous right-hand-side variable is
avoided.
2.7. Potential Outcome Model
Motivation for causal inference in econometric models is especially strong when the
focus is on the impact of public policy and/or private decision variables on some
31

specific outcomes. Specific examples include the impact of transfer payments on labor
supply, the impact of class size on student learning, and the impact of health insurance
on utilization of health care. In many cases the causal variables themselves reflect
individual decisions and hence are potentially endogenous. When, as is usually the
case, econometric estimation and inference are based on observational data, iden-
tification of and inference on causal parameters pose many challenges. These chal-
lenges can become potentially less serious if the causal issues are addressed using
data from a controlled social experiment with a proper statistical design. Although
such experiments have been implemented (see Section 3.3 for examples and details)
they are generally expensive to organize and run. Therefore, it is more attractive
to implement causal modeling using data generated by a natural experiment or in
a quasi-experimental setting. Section 3.4 discusses the pros and cons of these data
structures; but for present purposes one should think of a natural or quasi experi-
ment as a setting in which some causal variable changes exogenously and indepen-
dently of other explanatory variables, making it relatively easier to identify causal
parameters.
A major obstacle for causality modeling stems from the fundamental problem of
causal inference (Holland, 1986). Let X be the hypothesized cause and Y the outcome.
By manipulating the value of X we can change the value of Y. Suppose the value of X
is changed from x1 to x2. Then a measure of the causal impact of the change on Y is
formed by comparing the two values of Y: y2, which results from the change, and y1,
which would have resulted had no change in x occurred. However, if X did change,
then the value of Y, in the absence of the change, would not be observed. Hence noth-
ing more can be said about causal impact without some hypothesis about what value
Y would have assumed in the absence of the change in X. The latter is referred to
as a counterfactual, which means hypothetical unobserved value. Briefly stated, all
causal inference involves comparison of a factual with a counterfactual outcome. In
the conventional econometric model (e.g., SEM) a counterfactual does not need to be
explicitly stated.
A relatively newer strand in the microeconometric literature – program evalua-
tion or treatment evaluation – provides a statistical framework for the estimation
of causal parameters. In the statistical literature this framework is also known as the
Rubin causal model (RCM) in recognition of a key early contribution by Rubin
(1974, 1978), who in turn cites R.A. Fisher as originator of the approach. Al-
though, following recent convention, we refer to this as the Rubin causal model,
Neyman (Splawa-Neyman) also proposed a similar statistical model in an article
published in Polish in 1923; see Neyman (1990). Models involving counterfactuals
have been independently developed in econometrics following the seminal work of
Roy (1951). In the remainder of this section the salient features of RCM will be
analyzed.
Causal parameters based on counterfactuals provide statistically meaningful and
operational definitions of causality that in some respects differ from the traditional
Cowles foundation definition. First, in ideal settings this framework leads to consider-
able simplicity of econometric methods. Second, this framework typically focuses on
32

2.7. POTENTIAL OUTCOME MODEL
the fewer causal parameters that are thought to be most relevant to policy issues that
are examined. This contrasts with the traditional econometric approach that focuses
simultaneously on all structural parameters. Third, the approach provides additional
insights into the properties of causal parameters estimated by the standard structural
methods.
2.7.1. The Rubin Causal Model
The term “treatment” is used interchangeably with “cause.” In medical studies of new
drug evaluation, involving groups of those who receive the treatment and those who
do not, the drug response of the treated is compared with that of the untreated. A mea-
sure of causal impact is the average difference in the outcomes of the treated and the
nontreated groups. In economics, the term treatment is used very broadly. Essentially
it covers variables whose impact on some outcome is the object of study. Examples of
treatment–outcome pairs include schooling and wages, class size and scholastic per-
formance, and job training and earnings. Note that a treatment need not be exogenous,
and in many situations it is an endogenous (choice) variable.
Within the framework of a potential outcome model (POM), which assumes that
every element of the target population is potentially exposed to the treatment, the triple
(y1i , y0i , Di ), i = 1, . . . , N, forms the basis of treatment evaluation. The categorical
variable D takes the values 1 and 0, respectively, when treatment is or is not received;
y1i measures the response for individual i receiving treatment, and y0i measures that
when not receiving treatment. That is,
yi =

y1i if Di = 1,
y0i if Di = 0.
(2.21)
Since the receipt and nonreceipt of treatment are mutually exclusive states for indi-
vidual i, only one of the two measures is available for any given i, the unavailable
measure being the counterfactual. The effect of the cause D on outcome of individual
i is measured by (y1i − y0i ). The average causal effect of Di = 1, relative to Di = 0,
is measured by the average treatment effect (ATE):
ATE = E[y|D = 1] − E[y|D = 0], (2.22)
where expectations are with respect to the probability distribution over the target pop-
ulation. Unlike the conventional structural model that emphasizes marginal effects, the
POM framework emphasizes ATE and parameters related to it.
The experimental approach to the estimation of ATE-type parameters involves a
random assignment of treatment followed by a comparison of the outcomes with a
set of nontreated cases that serve as controls. Such an experimental design is explained
in greater detail in Chapter 3. Random assignment implies that individuals exposed to
treatment are chosen randomly, and hence the treatment assignment does not depend
on the outcome and is uncorrelated with the attributes of treated subjects. Two ma-
jor simpliﬁcations follow. The treatment variable can be treated as exogenous and its
coefﬁcient in a linear regression will not suffer from omitted variable bias if some
33

relevant variables are unavoidably omitted from the regression. Under certain condi-
tions, discussed at greater length in Chapters 3 and 25, the mean difference between
the outcomes of the treated and the control groups will provide an estimate of ATE.
The payoff to the well-designed experiment is the relative simplicity with which causal
statements can be made. Of course, to ensure high statistical precision for the treatment
effect estimate, one should still control for those attributes that also independently in-
fluence the outcomes.
Because random assignment of treatment is generally not feasible in economics,
estimation of ATE-type parameters must be based on observational data generated
under nonrandom treatment assignment. Then the consistent estimation of ATE will
be threatened by several complications that include, for example, possible correlation
between the outcomes and treatment, omitted variables, and endogeneity of the treat-
ment variable. Some econometricians have suggested that the absence of randomiza-
tion comprises the major impediment to convincing statistical inference about causal
relationships.
The potential outcome model can lead to causal statements if the counterfactual can
be clearly stated and made operational. An explicit statement of the counterfactual,
with a clear implication of what should be compared, is an important feature of this
model. If, as may be the case with observational data, there is lack of a clear distinc-
tion between observed and counterfactual quantities, then the answer to the question
of who is affected by the treatment remains unclear. ATE is a measure that weights and
combines marginal responses of specific subpopulations. Specific assumptions are re-
quired to operationalize the counterfactual. Information on both treated and untreated
units that can be observed is needed to estimate ATE. For example, it is necessary to
identify the untreated group that proxies the treated group if the treatment were not
applied. It is not necessarily true that this step can always be implemented. The exact
way in which the treated are selected involves issues of sampling design that are also
discussed in Chapters 3 and 25.
A second useful feature of the POM is that it identifies opportunities for causal
modeling created by natural or quasi-experiments. When data are generated in such
settings, and provided certain other conditions are satisfied, causal modeling can occur
without the full complexities of the SEM framework. This issue is analyzed further in
Chapters 3 and 25.
Third, unlike the structural form of the SEM where all variables other than that be-
ing explained can be labeled as “causes,” in the POM not all explanatory variables can
be regarded as causal. Many are simply attributes of the units that must be controlled
for in regression analysis, and attributes are not causes (Holland, 1986). Causal param-
eters must relate to variables that are actually or potentially, and directly or indirectly,
subject to intervention.
Finally, identifiability of the ATE parameter may be an easier research goal and
hence feasible in situations where the identifiability of a full SEM may not be (Angrist,
2001). Whether this is so has to be determined on a case-by-case basis. However,
many available applications of the POM typically employ a limited, rather than full,
information framework. However, even within the SEM framework the use of a limited
information framework is also feasible, as was previously discussed.
34

2.8. CAUSAL MODELING AND ESTIMATION STRATEGIES
2.8. Causal Modeling and Estimation Strategies
In this section we briefly sketch some of the ways in which econometricians approach
the modeling of causal relationships. These approaches can be used within both SEM
and POM frameworks, but they are typically identified with the former.
2.8.1. Identification Frameworks
Full-Information Structural Models
One variant of this approach is based on the parametric specification of the joint distri-
bution of endogenous variables conditional on exogenous variables. The relationships
are not necessarily derived from an optimizing model of behavior. Parametric restric-
tions are placed to ensure identification of the model parameters that are the target
of statistical inference. The entire model is estimated simultaneously using maximum
likelihood or moments-based estimation. We call this approach the full-information
structural approach. For well-specified models this is an attractive approach but in
general its potential limitation is that it may contain some equations that are poorly
specified. Under joint estimation the effects of localized misspecification may also
affect other estimates.
Statistically we may interpret the full-information approach as one in which the
joint probability distribution of endogenous variables, given the exogenous variables,
forms the basis of inference about causality. The jointness may derive from contem-
poraneous or dynamic interdependence between endogenous variables and/or the dis-
turbances on the equations.
Limited-Information Structural Models
By contrast, when the central object of statistical inference is estimation of one or two
key parameters, a limited-information approach may be used. A feature of this ap-
proach is that, although one equation is the focus of inference, the joint dependence
between it and other endogenous variables is exploited. This requires that explicit as-
sumptions are made about some features of the model that are not the main object of
inference. Instrumental variable methods, sequential multistep methods, and limited
information maximum likelihood methods are specific examples of this approach. To
implement the approach one typically works with one (or more) structural equations
and some implicitly or explicitly stated reduced form equations. This contrasts with the
full-information approach where all equations are structural. The limited-information
approach is often computationally more tractable than the full-information one.
Statistically we may interpret the limited-information approach as one in which the
joint distribution is factored into the product of a conditional model for the endogenous
variable(s) of interest, say y1, and a marginal model for other endogenous variables,
say y2, which are in the set of the conditioning variables, as in
f (y|x, θ) = g(y1|x, y2, θ1)h(y2|x, θ2), θ ∈ Θ. (2.23)
35

Modeling may be based on the component g(y1|x, y2, θ1) with minimal attention to
h(y2|x, θ2) if θ2 are regarded as nuisance parameters. Of course, such a factorization
is not unique, and hence the limited-information approach can have several variants.
Identified Reduced Forms
A third variant of the SEM approach works with an identified reduced form. Here too
one is interested in structural parameters. However, it may be convenient to estimate
these from the reduced form subject to restrictions. In time series the identified vector
autoregressions provide an example.
2.8.2. Identification Strategies
There are numerous potential ways in which the identification of key model parameters
can be jeopardized. Omitted variables, functional form misspecifications, measure-
ment errors in explanatory variables, using data unrepresentative of the population, and
ignoring endogeneity of explanatory variables are leading examples. Microeconomet-
rics contains many specific examples of how these challenges can be tackled. Angrist
and Krueger (2000) provide a comprehensive survey of popular identification strate-
gies in labor economics, with emphasis on the POM framework. Most of the issues are
developed elsewhere in the book, but a brief mention is made here.
Exogenization
Data are sometimes generated by natural experiments and quasi-experiments. The idea
here is simply that a policy variable may exogenously change for some subpopulation
while it remains the same for other subpopulations. For example, minimum wage laws
in one state may change while they remain unchanged in a neighboring state. Such
events naturally create treatment and control groups. If the natural experiment ap-
proximates a randomized treatment assignment, then exploiting such data to estimate
structural parameters can be simpler than estimation of a larger simultaneous equa-
tions model with endogenous treatment variables. It is also possible that the treatment
variable in a natural experiment can be regarded as exogenous, but the treatment itself
is not randomly assigned.
Elimination of Nuisance Parameters
Identification may be threatened by the presence of a large number of nuisance param-
eters. For example, in a cross-section regression model the conditional mean function
E[yi |xi ] may involve an individual specific fixed effect αi , assumed to be correlated
with the regression error. This effect cannot be identified without many observations
on each individual (i.e., panel data). However, with just a short panel it could be elim-
inated by a transformation of the model. Another example is the presence of timein-
variant unobserved exogenous variables that may be common to groups of individuals.
36

2.8. CAUSAL MODELING AND ESTIMATION STRATEGIES
An example of a transformation that eliminates fixed effects is taking differences and
working with the differences-in-differences form of the model.
Controlling for Confounders
When variables are omitted from a regression, and when omitted factors are correlated
with the included variables, a confounding bias results. For example, in a regression
with earnings as a dependent variable and schooling as an explanatory variable, indi-
vidual ability may be regarded as an omitted variable because only imperfect proxies
for it are typically available. This means that potentially the coefficient of the school-
ing variable may not be identified. One possible strategy is to introduce control vari-
ables in the model; the general approach is called the control function approach.
These variables are an attempt to approximate the influence of the omitted variables.
For example, various types of scholastic achievement scores may serve as controls for
ability.
Creating Synthetic Samples
Within the POM framework a causal parameter may be unidentified because no suit-
able comparison or control group can provide the benchmark for estimation. A poten-
tial solution is to create a synthetic sample that includes a comparison group that are
proxies for controls. Such a sample is created by matching (discussed in Chapter 25).
If treated samples can be augmented by well-matched controls, then identification of
causal parameters can be achieved in the sense that a parameter related to ATE can be
estimated.
Instrumental Variables
If identification is jeopardized because the treatment variable is endogenous, then a
standard solution is to use valid instrumental variables. This is easier said than done.
The choice of the instrumental variable as well as the interpretation of the results
obtained must be done carefully because the results may be sensitive to the choice of
instruments. The approach is analyzed in Sections 4.8, 4.9, 6.4, 6.5, and 25.7, as well
as in several other places in the book as the need arises. Again a natural experiment
may provide a valid instrument.
Reweighting Samples
Sample-based inferences about the population are only valid if the sample data are
representative of the population. The problem of sample selection or biased sampling
arises when the sample data are not representative, in which case the population param-
eters are not identified. This problem can be approached as one that requires correction
for sample selection (Chapter 16) or one that requires reweighting of the sample infor-
mation (Chapter 24).
37

2.9. Bibliographic Notes
2.1 The 2001 Nobel lectures by Heckman and McFadden are excellent sources for both his-
torical and current information about the developments in microeconometrics. Heckman’s
lecture is remarkable for its comprehensive scope and offers numerous insights into many
aspects of microeconometrics. His discussion of heterogeneity has many points of contact
with several topics covered in this book.
2.2 Marschak (1953) gives a classic statement of the primacy of structural modeling for policy
evaluation. He makes an early mention of the idea of parameter invariance.
2.3 Engle, Hendry, and Richard (1983) provide definitions of weak and strong exogeneity in
terms of the distribution of observable variables. They make links with previous literature
on exogeneity concepts.
2.4 and 2.5 The term “identification” was used by Koopmans (1949). Point identification in
linear parametric models is covered in most textbooks including those by Sargan (1988)
who gives a comprehensive and succint treatment, Davidson and MacKinnon (2004), and
Greene (2003). Gouriéroux and Monfort (1989, chapter 3.4) provide a different perspective
using Fisher and Kullback information measures. Bounds identification in several leading
cases is developed in Manski (1995).
2.6 Heckman (2000) provides a historical overview and modern interpretations of causality in
the traditional econometric model. Causality concepts within the POM framework are care-
fully and incisively analyzed by Holland (1986), who also relates them to other definitions.
A sample of the statisticians’ viewpoints of causality from a historical perspective can be
found in Freedman (1999). Pearl (2000) gives insightful schematic exposition of the idea
of “treating causation as a summary of behavior under interventions,” as well as numerous
problems of inferring causality in a nonexperimental situation.
2.7 Angrist and Krueger (1999) survey solutions to identification pitfalls with examples from
labor economics.
38

C H A P T E R 3
Microeconomic Data Structures
3.1. Introduction
This chapter surveys issues concerning the potential usefulness and limitations of dif-
ferent types of microeconomic data. By far the most common data structure used in
microeconometrics is survey or census data. These data are usually called observa-
tional data to distinguish them from experimental data.
This chapter discusses the potential limitation of the aforementioned data struc-
tures. The inherent limitations of observational data may be further compounded by
the manner in which the data are collected, that is, by the sample frame (the way the
sample is generated), sample design (simple random sample versus stratiﬁed random
sample), and sample scope (cross-section versus longitudinal data). Hence we also
discuss sampling issues in connection with the use of observational data. Some of this
terminology is new at this stage but will be explained later in this chapter.
Microeconometrics goes beyond the analysis of survey data under the assumptions
of simple random sampling. This chapter considers extensions. Section 3.2 outlines
the structure of multistage sample surveys and some common forms of departure from
random sampling; a more detailed analysis of their statistical implications is provided
in later chapters. It also considers some commonly occurring complications that result
in the data not being necessarily representative of the population. Given the deﬁcien-
cies of observational data in estimating causal parameters, there has been an increased
attempt at exploiting experimental and quasi-experimental data and frameworks. Sec-
tion 3.3 examines the potential of data from social experiments. Section 3.4 considers
the modeling opportunities arising from a special type of observational data, generated
under quasi-experimental conditions, that naturally provide treated and untreated sub-
jects and hence are called natural experiments. Section 3.5 covers practical issues of
microdata management.
39

MICROECONOMIC DATA STRUCTURES
3.2. Observational Data
The major source of microeconomic observational data is surveys of households, ﬁrms,
and government administrative data. Census data can also be used to generate samples.
Many other samples are often generated at points of contact between transacting par-
ties. For example, marketing data may be generated at the point of sale and/or surveys
among (actual or potential) purchasers. The Internet (e.g., online auctions) is also a
source of data.
There is a huge literature on sample surveys from the viewpoint of both survey
statisticians and users of survey data. The ﬁrst discusses how to sample from the pop-
ulation and the results from different sampling designs, and the second deals with the
issues of estimation and inference that arise when survey data are collected using dif-
ferent sampling designs. A key issue is how well the sample represents the population.
This chapter deals with both strands of the literature in an introductory fashion. Many
additional details are given in Chapter 24.
3.2.1. Nature of Survey Data
The term observational data usually refers to survey data collected by sampling the
relevant population of subjects without any attempt to control the characteristics of
the sampled data. Let t denote the time subscript, let w denote a set of variables
of interest. In the present context t can be a point in time or time interval. Let
St denote a sample from population probability distribution F(wt |θt ); St is a draw
from F(wt |θt ), where θ is a parameter vector. The population should be thought
of as a set of points with characteristics of interest, and for simplicity we assume
that the form of the probability distribution F is known. A simple random sam-
pling scheme allows every element of the population to have an equal probability of
being included in the sample. More complex sampling schemes will be considered
later.
The abstract concept of a stationary population provides a useful benchmark. If
the moments of the characteristics of the population are constant, then we can write
θt = θ, for all t. This is a strong assumption because it implies that the moments of
the characteristics of the population are time-invariant. For example, the age–sex dis-
tribution should be constant. More realistically, some population characteristics would
not be constant. To handle such a possibility, (the parameters of) each population may
be regarded as a draw from a superpopulation with constant characteristics. Specif-
ically, we think of each θt as a draw from a probability distribution with constant
(hyper)parameter θ. The terms superpopulation and hyperparameters occur frequently
in the literature on hierarchical models discussed in Chapter 24. Additional complica-
tions arise if θt has an evolutionary component, for example through dependence on
t, or if successive values are interdependent. Using hierarchical models, discussed in
Chapters 13 and 26, provides one approach for modeling the relation between hyper-
parameters and subpopulation characteristics.
40

3.2. OBSERVATIONAL DATA
3.2.2. Simple Random Samples
As a benchmark for subsequent discussion, consider simple random sampling in which
the probability of sampling unit i from a population of size N, with N large, is 1/N for
all i. Partition w as [y : x]. Suppose our interest is in modeling y, a possibly vector-
valued outcome variable, conditional on the exogenous covariate vector x, whose joint
distribution is denoted fJ (y, x). This can be always be factored as the product of the
conditional distribution fC (y|x, θ) and the marginal distribution fM (x):
fJ (y, x) = fC (y|x, θ) fM (x). (3.1)
Simple random sampling involves drawing the (y, x) combinations uniformly from
the entire population.
3.2.3. Multistage Surveys
One alternative is a stratified multistage cluster sampling, also referred to as a com-
plex survey method. Large-scale surveys like the Current Population Survey (CPS)
and the Panel Survey of Income Dynamics (PSID) take this approach. Section 24.2
provides additional detail on the structure of the CPS.
The complex survey design has advantages. It is more cost effective because it
reduces geographical dispersion, and it becomes possible to sample certain subpop-
ulations more intensively. For example, “oversampling” of small subpopulations ex-
hibiting some relevant characteristic becomes feasible whereas a random sample of the
population would produce too few observations to support reliable results. A disadvan-
tage is that stratified sampling will reduce interindividual variation, which is essential
for greater precision.
The sample survey literature focuses on multistage surveys that sequentially parti-
tion the population into the following categories:
1. Strata: Nonoverlapping subpopulations that exhaust the population.
2. Primary sampling units (PSUs): Nonoverlapping subsets of the strata.
3. Secondary sampling units (SSUs): Sub-units of the PSU, which may in turn be parti-
tioned, and so on.
4. Ultimate sampling unit (USU): The final unit chosen for interview, which could be a
household or a collection of households (a segment).
As an example, the strata may be the various states or provinces in a country, the
PSU may be regions within the state or province, and the USU may be a small cluster
of households in the same neighborhood.
Usually all strata are surveyed so that, for example, all states will be included in
the sample with certainty. But not all of the PSUs and their subdivisions are surveyed,
and they may be sampled at different rates. In two-stage sampling the surveyed PSUs
are drawn at random and the USU is then drawn at random from the selected PSUs. In
multistage sampling intermediate sampling units such as SSUs also appear.
41

A consequence of these sampling methods is that different households will have
different probabilities of being sampled. The sample is then unrepresentative of the
population. Many surveys provide sampling weights that are intended to be inversely
proportional to the probability of being sampled, in which case these weights can be
used to obtain unbiased estimators of population characteristics.
Survey data may be clustered due to, for example, sampling of many households
in the same small neighborhood. Observations in the same cluster are likely to be de-
pendent or correlated because they may depend on some observable or unobservable
factor that could affect all observations in a stratum. For example, a suburb may be
dominated by high-income households or by households that are relatively homoge-
neous in some dimension of their preferences. Data from these households will tend
to be correlated, at least unconditionally, though it is possible that such correlation
is negligible after conditioning on observable characteristics of the households. Sta-
tistical inference ignoring correlation between sampled observations yields erroneous
estimates of variances that are smaller than those from the correct formula. These is-
sues are covered in greater depth in Section 24.5. Two-stage and multistage samples
potentially further complicate the computation of standard errors.
In summary, (1) stratiﬁcation with different sampling rates within strata means that
the sample is unrepresentative of the population; (2) sampling weights inversely pro-
portional to the probability of being sampled can be used to obtain unbiased estimation
of population characteristics; and (3) clustering may lead to correlation of observations
and understatement of the true standard errors of estimators unless appropriate adjust-
ments are made.
3.2.4. Biased Samples
If a random sample is drawn then the probability distribution for the data is the same
as the population distribution. Certain departures from random sampling cause a di-
vergence between the two; this is referred to as biased sampling. The data distribution
differs from the population distribution in a manner that depends on the nature of the
deviation from random sampling. Deviation from random sampling occurs because it
is sometimes more convenient or cost effective to obtain the data from a subpopulation
even though it is not representative of the entire population. We now consider several
examples of such departures, beginning with a case in which there is no departure from
randomness.
Exogenous Sampling
Exogenous sampling from survey data occurs if the analyst segments the available
sample into subsamples based only on a set of exogenous variables x, but not on the
response variable. For example, in a study of hospitalizations in Germany, Geil et al.
(1997) segmented the data into two categories, those with and without chronic condi-
tions. Classiﬁcation by income categories is also common. Perhaps it is more accurate
to depict this type of sampling as exogenous subsampling because it is done by ref-
erence to an existing sample that has already been collected. Segmenting an existing
42

sample by gender, health, or socioeconomic status is very common. Under the assump-
tions of exogenous sampling the probability distribution of the exogenous variables
is independent of y and contains no information about the population parameters of
interest, θ. Therefore, one may ignore the marginal distribution of the exogenous vari-
ables and simply base estimation on the conditional distribution f (y|x, θ). Of course,
the assumption may be wrong and the observed distribution of the outcome variable
may depend on the selected segmenting variable, which may be correlated with the
outcome, thus causing departure from exogenous sampling.
Response-Based Sampling
Response-based sampling occurs if the probability of an individual being included
in the sample depends on the responses or choices made by that individual. In this
case sample selection proceeds in terms of rules defined in terms of the endogenous
variable under study.
Three examples are as follows: (1) In a study of the effect of negative income tax or
Aid to Families with Dependent Children (AFDC) on labor supply only those below
the poverty line are surveyed. (2) In a study of determinants of public transport modal
choice, only users of public transport (a subpopulation) are surveyed. (3) In a study of
the determinants of number of visits to a recreational site, only those with at least one
visit are included.
Lower survey costs provide an important motivation for using choice-based samples
in preference to simple random samples. It would require a very large random sample
to generate enough observations (information) about a relatively infrequent outcome
or choice, and hence it is cheaper to collect a sample from those who have actually
made the choice.
The practical significance of this is that consistent estimation of population param-
eters θ can no longer be carried out using the conditional population density f (y|x)
alone. The effect of the sampling scheme must also be taken into account. This topic
is discussed further in Section 24.4.
Length-Biased Sampling
Length-biased sampling illustrates how biases may result from sampling one popu-
lation to make inferences about a different population. Strictly speaking, it is not so
much an example of departure from randomness in sampling as one of sampling the
“wrong” population.
Econometric studies of transitions model the time spent in origin state j by indi-
vidual i before transiting to another destination state s. An example is when j cor-
responds to unemployment and s to employment. The data used in such studies can
come from one of several possible sources. One source is sampling individuals who
are unemployed on a particular date, another is to sample those who are in the labor
force regardless of their current state, and a third is to sample individuals who are ei-
ther entering or leaving unemployment during a specified period of time. Each type
of sampling scheme is based on a different concept of the relevant population. In the
43

first case the relevant population is the stock of unemployed individuals, in the second
the labor force, and in the third individuals with transitioning employment status. This
topic is discussed further in Section 18.6.
Suppose that the purpose of the survey is to calculate a measure of the average
duration of unemployment. This is the average length of time a randomly chosen indi-
vidual will spend in unemployment if he or she becomes unemployed. The answer to
this apparently straightforward question may vary depending on how the sample data
are obtained. The flow distribution of completed durations is in general quite differ-
ent from the stock distribution. When we sample the stock, the probability of being in
the sample is higher for individuals with longer durations. When we sample the flow
out of the state, the probability does not depend on the time spent in the state. This
is the well-known example of length-biased sampling in which the estimate obtained
by sampling the stock is a biased estimate of the average length of an unemployment
spell of a random entrant to unemployment.
The following simple schematic diagram may clarify the point:
◦ •
Entry f low
→
• •
• ◦
Stock
→ ◦ ◦ •
Exit f low
Here we use the symbol • to denote slow movers and the symbol ◦ to denote fast
movers. Suppose the two types are equally represented in the flow, but the slow movers
stay in the stock longer than the fast movers. Then the stock population has a higher
proportion of slow movers. Finally, the exit population has a higher proportion of fast
movers. The argument will generalize to other types of heterogeneity.
The point of this example is not that flow sampling is a better thing to do than stock
sampling. Rather, it is that, depending on what the question is, stock sampling may not
yield a random sample of the relevant population.
3.2.5. Bias due to Sample Selection
Consider the following problem. A researcher is interested in measuring the effect of
training, denoted z (treatment), on posttraining wages, denoted y (outcome), given the
worker’s characteristics, denoted x. The variable z takes the value 1 if the worker has
received training and is 0 otherwise. Observations are available on (x, D) for all work-
ers but on y only for those who received training (D = 1). One would like to make
inferences about the average impact of training on the posttraining wage of a ran-
domly chosen worker with known characteristics who is currently untrained (D = 0).
The problem of sample selection concerns the difficulty of making such an inference.
Manski (1995), who views this as a problem of identification, defines the selection
problem formally as follows:
This is the problem of identifying conditional probability distributions from random
sample data in which the realizations of the conditioning variables are always ob-
served but realizations of the outcomes are censored.
44

Suppose y is the outcome to be predicted, and the conditioning variables are denoted
by x. The variable z is a censoring indicator that takes the value 1 if the outcome y is
observed and 0 otherwise. Because the variables (D, x) are always observed, but y is
observed only when D = 1, Manski views this as a censored sampling process. The
censored sampling process does not identify Pr[y|x], as can be seen from
Pr[y|x] = Pr[y|x, D = 1] Pr[D = 1|x] + Pr[y|x, D = 0] Pr[D = 0|x]. (3.2)
The sampling process can identify three of the four terms on the right-hand side,
but provides no information about the term Pr[y|x, D = 0]. Because
E[y|x] = E[y|x, D = 1] · Pr[D = 1|x] + E[y|x, D = 0] · Pr[D = 0|x],
whenever the censoring probability Pr[D = 0|x] is positive, the available empirical
evidence places no restrictions on E[y|x]. Consequently, the censored-sampling pro-
cess can identify Pr[y|x] only for some unknown value of Pr[y|x, D = 0]. To learn
anything about the E[y|x], restrictions will need to be placed on Pr[y|x].
The alternative approaches for solving this problem are discussed in Section 16.5.
3.2.6. Quality of Survey Data
The quality of sample data depends not only on the sample design and the survey
instrument but also on the survey responses. This observation applies especially to
observational data. We consider several ways in which the quality of the sample data
may be compromised. Some of the problems (e.g., attrition) can also occur with other
types of data. This topic overlaps with that of biased sampling.
Problem of Survey Nonresponse
Surveys are normally voluntary, and incentive to participate may vary systematically
according to household characteristics and type of question asked. Individuals may
refuse to answer some questions. If there is a systematic relationship between refusal
to answer a question and the characteristics of the individual, then the issue of the
representativeness of a survey after allowing for nonresponse arises. If nonresponse
is ignored, and if the analysis is carried out using the data from respondents only, how
will the estimation of parameters of interest be affected?
Survey nonresponse is a special case of the selection problem mentioned in the
preceding section. Both involve biased samples. To illustrate how it leads to distorted
inference consider the following model:

y1
y2

x, z ∼ N

x
β
z
γ

,

σ2
1 σ12
σ12 σ2
2

, (3.3)
where y1 is a continuous random variable of interest (e.g., expenditure) that depends
on x, and y2 is a latent variable that measures the “propensity to participate” in a survey
45

and depends on z. The individual participates if y2 0; otherwise the individual does
not. The variables x and z are assumed to be exogenous. The formulation allows y1
and y2 to be correlated.
Suppose we estimate β from the data supplied by participants by least squares.
Is this estimator unbiased in the presence of nonparticipation? The answer is that if
nonparticipation is random and independent of y1, the variable of interest, then there
is no bias, but otherwise there will be.
The argument is as follows:

β =

X
X
−1
X
y1,
E[
β − β] = E

X
X
−1
X
E[y1 − Xβ|X, Z, y2 0

,
where the ﬁrst line gives the least-squares formula for the estimates of β and the second
line gives its bias. If y1 and y2 are independent, conditional on X and Z, σ12 = 0,
then
E[y1 − Xβ|X, Z, y2 0] = E[y1 − Xβ|X, Z] = 0,
and there is no bias.
Missing and Mismeasured Data
Survey respondents dealing with an extensive questionnaire will not necessarily an-
swer every question and even if they do, the answers may be deliberately or fortu-
itously false. Suppose that the sample survey attempts to obtain a vector of responses
denoted as xi =(xi1, . . . ., xi K ) from N individuals, i = 1, . . . , N. Suppose now that
if an individual fails to provide information on any one or more elements of xi , then
the entire vector is discarded. The ﬁrst problem resulting from missing data is that the
sample size is reduced. The second potentially more serious problem is that missing
data can potentially lead to biases similar to the selection bias. If the data are missing
in a systematic manner, then the sample that is left to analyze may not be represen-
tative of the population. A form of selection bias may be induced by any systematic
pattern of nonresponse. For example, high-income respondents may systematically not
respond to questions about income. Conversely, if the data are missing completely at
random then discarding incomplete observations will reduce precision but not gen-
erate biases. Chapter 27 discusses the missing-data problem and solutions in greater
depth.
Measurement errors in survey responses are a pervasive problem. They can arise
from a variety of causes, including incorrect responses arising from carelessness, de-
liberate misreporting, faulty recall of past events, incorrect interpretation of questions,
and data-processing errors. A deeper source of measurement error is due to the mea-
sured variable being at best an imperfect proxy for the relevant theoretical concept.
The consequences of such measurement errors is a major topic and is discussed in
Chapter 26.
46

Sample Attrition
In panel data situations the survey involves repeated observations on a set of individu-
als. In this case we can have
r full response in all periods (full participation),
r nonresponse in the first period and in all subsequent periods (nonparticipation), or
r partial response in the sense of response in the initial periods but nonresponse in later
periods (incomplete participation) – a situation referred to as sample attrition.
Sample attrition leads to missing data, and the presence of any nonrandom pattern
of “missingness” will lead to the sample selection type problems already mentioned.
This can be interpreted as a special case of the sample selection problem. Sample
attrition is discussed briefly in Sections 21.8.5 and 23.5.2.
3.2.7. Types of Observational Data
Cross-section data are obtained by observing w, for the sample St for some t. Al-
though it is usually impractical to sample all households at the same point of time,
cross-section data are still a snapshot of characteristics of each element of a subset of
the population that will be used to make inferences about the population. If the pop-
ulation is stationary, then inferences made about θt using St may be valid also for
t
= t. If there is significant dependence between past and current behavior, then lon-
gitudinal data are required to identify the relationship of interest. For example, past
decisions may affect current outcomes; inertia or habit persistence may account for
current purchases, but such dependence cannot be modeled if the history of purchases
is not available. This is one of the limitations imposed by cross-section data.
Repeated cross-section data are obtained by a sequence of independent samples
St taken from F(wt |θt ), t = 1, . . . , T. Because the sample design does not attempt to
retain the same units in the sample, information about dynamic dependence in behavior
is lost. If the population is stationary then repeated cross-section data are obtained by
a sampling process somewhat akin to sampling with replacement from the constant
population. If the population is nonstationary, repeated cross sections are related in a
manner that depends on how the population is changing over time. In such a case the
objective is to make inferences about the underlying constant (hyper)parameters. The
analysis of repeated cross sections is discussed in Section 22.7.
Panel or longitudinal data are obtained by initially selecting a sample S and
then collecting observations for a sequence of time periods, t = 1, . . . , T. This can
be achieved by interviewing subjects and collecting both present and past data at the
same time, or by tracking the subjects once they have been inducted into the survey.
This produces a sequence of data vectors {w1, . . . , wT } that are used to make infer-
ences about either the behavior of the population or that of the particular sample of
individuals. The appropriate methodology in each case may not be the same. If the
data are drawn from a nonstationary population, the appropriate objective should be
inference on (hyper)parameters of the superpopulation.
47

Some limitations of these types of data are immediately obvious. Cross-section
samples and repeated cross-sections do not in general provide suitable data for mod-
eling intertemporal dependence in outcomes. Such data are only suitable for modeling
static relationships. In contrast, longitudinal data, especially if they span a sufficiently
long time period, are suitable for modeling both static and dynamic relationships.
Longitudinal data are not free from problems. The first issue is representativeness of
the panel. Problems of inference regarding population behavior using longitudinal data
become more difficult if the population is not stationary. For analyzing dynamics of be-
havior, retaining original households in the panel for as long as possible is an attractive
option. In practice, longitudinal data sets suffer from the problem of “sample attrition,”
perhaps due to “sample fatigue.” This simply means that survey respondents do not
continue to provide responses to questionnaires. This creates two problems: (1) The
panel becomes unbalanced and (2) there is the danger that the retained household may
not be “typical” and that the sample becomes unrepresentative of the population. When
the available sample data are not a random draw from the population, results based on
different types of data will be susceptible to biases to different degrees. The problem
of “sample fatigue” arises because over time it becomes more difficult to retain in-
dividuals within the panel or they may be “lost” (censored) for some other reason,
such as a change of location. These issues are dealt with later in the book. Analysis
of longitudinal data may nevertheless provide information about some aspects of the
behavior of the sampled units, although extrapolation to population behavior may not
be straightforward.
3.3. Data from Social Experiments
Observational and experimental data are distinct because an experimental environment
can in principle be closely monitored and controlled. This makes it possible to vary
a causal variable of interest, holding other covariates at controlled settings. In con-
trast, observational data are generated in an uncontrolled environment, leaving open
the possibility that the presence of confounding factors will make it more difficult to
identify the causal relationship of interest. For example, when one attempts to study
the earnings–schooling relationship using observational data, one must accept that the
years of schooling of an individual is itself an outcome of an individual’s decision-
making process, and hence one cannot regard the level of schooling as if it had been
set by a hypothetical experimenter.
In social sciences, data analogous to experimental data come from either social
experiments, defined and described in greater detail in the following, or from “labo-
ratory” experiments on small groups of voluntary participants that mimic the behavior
of economic agents in the real-life counterpart of the experiment. Social experiments
are relatively uncommon, and yet experimental concepts, methods, and data serve as a
benchmark for evaluating econometric studies based on observational data.
This section provides a brief account of the methodology of social experiments, the
nature of the data emanating from them, and some problems and issues of econometric
methodology that they generate.
48

3.3. DATA FROM SOCIAL EXPERIMENTS
The central feature of the experimental methodology involves a comparison be-
tween the outcomes of the randomly selected experimental group that is subjected to a
“treatment”with those of a control (comparison) group. In a good experiment consid-
erable care is exercised in matching the control and experimental (“treated”) groups,
and in avoiding potential biases in outcomes. Such conditions may not be realized
in observational environments, thereby leading to a possible lack of identiﬁcation of
causal parameters of interest. Sometimes, however, experimental conditions may be
approximately replicated in observational data. Consider, for example, two contigu-
ous regions or states, one of which pursues a different minimum-wage policy from the
other, creating the conditions of a natural experiment in which observations from the
“treated” state can be compared with those from the “control” state. The data structure
of a natural experiment has also attracted attention in econometrics.
A social experiment involves exogenous variations in the economic environment
facing the set of experimental subjects, which is partitioned into one subset that re-
ceives the experimental treatment and another that serves as a control group. In con-
trast to observational studies in which changes in exogenous and endogenous factors
are often confounded, a well-designed social experiment aims to isolate the role of
treatment variables. In some experimental designs there may be no explicit control
group, but varying levels of the treatment are applied, in which case it becomes pos-
sible in principle to estimate the entire response surface of experimental outcomes.
The primary object of a social experiment is to estimate the impact of an actual
or potential social program. The potential outcome model of Section 2.7 provides a
relevant background for modeling the impact of social experiments. Several alternative
measures of impact have been proposed and these will be discussed in the chapter on
program evaluation (Chapter 25).
Burtless (1995) summarizes the case for social experiments, while noting some
potential limitations. In a companion article Heckman and Smith (1995) focus on
limitations of actual social experiments that have been implemented. The remaining
discussion in this section borrows signiﬁcantly from these papers.
3.3.1. Leading Features of Social Experiments
Social experiments are motivated by policy issues about how subjects would react to a
type of policy that has never been tried and hence one for which no observed response
data exist. The idea of a social experiment is to enlist a group of willing participants,
some of whom are randomly assigned to a treatment group and the rest to a control
group. The difference between the responses of those in the treatment group, subjected
to the policy change, and those in the control group, who are not, is the estimated
effect of the policy. Schematically the standard experimental design is as depicted in
Figure 3.1.
The term “experimentals” refers to the group receiving treatments, “controls” to the
group not receiving treatment, and “random assignment” to the process of assigning
individuals to the two groups.
Randomized trials were introduced in statistics by R. A. Fisher (1928) and his
co-workers. A typical agricultural experiment would consist of a trial in which a new
49

Eligible
subject
invited to
participate
Agrees to
participate?
Yes
No
Randomize
Drop from
study
Assign to
treatment
Assign to
control
Figure 3.1: Social experiment with random assignment.
treatment such as fertilizer application would be applied to plants growing on ran-
domly chosen blocks of land and then the responses would be compared with those
of a control group of plants, similar to the experimentals in all relevant respects but
not given experimental treatment. If the effect of all other differences between the ex-
perimental and control groups can be eliminated, the estimated difference between the
two sets of responses can be attributed to the treatment. In the simplest situation one
can concentrate on a comparison of the mean outcome of the treated group and of the
untreated group.
Although in agricultural and biomedical sciences, the randomized experiments
methodology has been long established, in economics and social sciences it is new.
It is attractive for studying responses to policy changes for which no observational
data exist, perhaps because the policy changes of interest have never occurred. Ran-
domized experiments also permit a greater variation in policy variables and parameters
than are present in observational data, thereby making it easier to identify and study
responses to policy changes. In many cases the social experiment may try out a pol-
icy that has never been tried, so the observational data remain completely silent on its
potential impact.
Social experiments are still rather rare outside the United States, partly because
they are expensive to run. In the United States a number of such experiments have
taken place since the early 1970s. Table 3.1 summarizes features of some relatively
well-known examples; for a more extensive coverage see Burtless (1995).
An experiment may produce either cross-section or longitudinal data, although cost
considerations will usually limit the time dimension well below what is typical in ob-
servational data. When an experiment lasts several years and has multiple stages and/or
geographical locations, as in the case of RHIE, interim analyses based on “incomplete”
data are not uncommon (Newhouse et al., 1993).
3.3.2. Advantages of Social Experiments
Burtless (1995) surveys the advantages of social experiments with great clarity.
The key advantage stems from randomized trials that remove any correlation be-
tween the observed and unobserved characteristics of program participants. Hence the
50

Table 3.1. Features of Some Selected Social Experiments
Experiment Tested Treatments Target Population
Rand Health Health insurance plans with Low- and moderate-level
Insurance Experiment varying copayment rate and income persons and families
(RHIE), 1974–1982 differing levels of maximum
out-of-pocket expenses
Negative Income Tax NIT plans with alternative Low- and moderate-level
(NIT), 1968–1978 income guarantees and income persons and families
tax rates with nonaged head of household
Job Training Job search assistance, Out-of-school youths and
Partnership Act (JTPA), on-the-job training, classroom disadvantaged adults
(1986–1994) training financed under JTPA
contribution of the treatment to the outcome difference between the treated and control
groups can be estimated without confounding bias even if one cannot control for the
confounding variables. The presence of correlation between treatment and confound-
ing variables often plagues observational studies and complicates causal inference. By
contrast, an experimental study conducted under ideal circumstances can produce a
consistent estimate of the average difference in outcomes of the treated and nontreated
groups without much computational complexity.
If, however, an outcome depends on treatment as well as other observable fac-
tors, then controlling for the latter will in general improve the precision of the impact
estimate.
Even if observational data are available, the generation and use of experimental data
has great appeal because it offers the possibility of exogenizing a policy variable, and
randomization of treatments can potentially lead to great simplification of statistical
analysis. Conclusions based on observational data often lack generality because they
are based on a nonrandom sample from the population – the problem of selection bias.
An example is the aforementioned RHIE study whose major focus is on the price re-
sponsiveness of the demand for health services. Availability of health insurance affects
the user price of health services and thereby its use. An important policy issue is the ex-
tent to which “overutilization” of health services would result from subsidized health
insurance. One can, of course, use observational data to model the relation between
the demand for health services and the level of insurance. However, such analyses are
subject to the criticism that the level of health insurance should not be treated as ex-
ogenous. Theoretical analyses show that the demand for health insurance and health
care are jointly determined, so causation is not unidirectional. This fact can potentially
make it difficult to identify the role of health insurance. Treating health insurance as
exogenous biases the estimate of price responsiveness. However, in an experimental
setup the participating households could be assigned an insurance policy, making it an
exogenous variable. The role of insurance is then identifiable. Once the key variable
of interest is exogenized, the direction of causation becomes clear and the impact of
51

the treatment can be studied unambiguously. Furthermore, if the experiment is free
from some of the problems that we mention in the following, this greatly simplifies
statistical analysis relative to what is often necessary in survey data.
3.3.3. Limitations of Social Experiments
The application of a nonhuman methodology, initially that is, one developed for and
applied to nonhuman subjects, to human subjects has generated a lively debate in the
literature. See especially Heckman and Smith (1995), who argue that many social ex-
periments may suffer from limitations that apply to observational studies. These is-
sues concern general points such as the merits of experimental versus observational
methodology, as well as specific issues concerning the biases and problems inherent
in the use of human subjects. Several of the issues are covered in more detail in later
chapters but a brief overview follows.
Social experiments are very costly to run. Sometimes, perhaps often, they do not
correspond to “clean” randomized trials. Hence the results from such experiments are
not always unambiguous and easily interpretable, or free from biases. If the treatment
variable has many alternative settings of interest, or if extrapolation is an important
objective, then a very large sample must be collected to ensure sufficient data variation
and to precisely gauge the effect of treatment variation. In that case the cost of the
experiment will also increase. If the cost factor prevents a large enough experiment, its
utility relative to observational studies may be questionable; see the papers by Rosen
and Stafford in Hausman and Wise (1985).
Unfortunately the design of some social experiments is flawed. Hausman and Wise
(1985) argue that the data from the New Jersey negative income tax experiment was
subject to endogenous stratification, which they describe as follows:
. . . [T]he reason for an experiment is, by randomization, to eliminate correlation
between the treatment variable and other determinants of the response variable that
is under study. In each of the income-maintenance experiments, however, the exper-
imental sample was selected in part on the basis of the dependent variable, and the
assignment to treatment versus control group was based in part on the dependent
variable as well. In general, the group eligible for selection – based on family status,
race, age of family head, etc. – was stratified on the basis of income (and other vari-
ables) and persons were selected from within the strata. (Hausman and Wise, 1985,
pp. 190–191)
The authors conclude that, in the presence of endogenous stratification, unbiased es-
timation of treatment effects is not straightforward. Unfortunately, a fully randomized
trial in which treatment assignment within a randomly selected experimental group
from the population is independent of income would be much more costly and may
not be feasible.
There are several other issues that detract from the ideal simplicity of a random-
ized experiment. First, if experimental sites are selected randomly, cooperation of
administrators and potential participants at that site would be required. If this is not
forthcoming, then alternative treatment sites where such cooperation is obtainable
52

will be substituted, thereby compromising the random assignment principle; see Hotz
(1992).
A second problem is that of sample selection, which is relevant because participa-
tion is voluntary. For ethical reasons there are many experiments that simply cannot
be done (e.g., random assignment of students to years of education). Unlike medical
experiments that can achieve the gold standard of a double-blind protocol, in social
experiments experimenters and subjects know whether they are in treatment or con-
trol groups. Furthermore, those in control groups may obtain treatment, (e.g., training)
from alternative sources. If the decision to participate is uncorrelated with either x or
ε, the analysis of the experimental data is simplified.
A third problem is sample attrition caused by subjects dropping out of the experi-
ment after it has started. Even if the initial sample was random the effect of nonran-
dom attrition may well lead to a problem similar to the attrition bias in panels. Finally,
there is the problem of Hawthorne effect. The term originates in social psychology
research conducted jointly by the Harvard Graduate School of Business Administra-
tion and the management of the Western Electric Company at the latter’s Hawthorne
works in Chicago from 1926 to 1932. Human subjects, unlike inanimate objects, may
change or adapt their behavior while participating in the experiment. In this case the
variation in the response observed under experimental conditions cannot be attributed
solely to treatment.
Heckman and Smith (1995) mention several other difficulties in implementing a
randomized treatment. Because the administration of a social experiment involves a
bureaucracy, there is a potential for biases. Randomization bias occurs if the assign-
ment introduces a systematic difference between the experimental participant and the
participant during its normal operation. Heckman and Smith document the possibilities
of such bias in actual experiments. Another type of bias, called substitution bias, is
introduced when the controls may be receiving some form of treatment that substitutes
for the experimental treatment. Finally, analysis of social experiments is inevitably of
a partial equilibrium nature. One cannot reliably extrapolate the treatment effects to
the entire population because the ceteris paribus assumption will not hold when the
entire population is involved.
Specifically, the key issue is whether one can extrapolate the results from the exper-
iment to the population at large. If the experiment is conducted as a pilot program on a
small scale, but the intention is to predict the impact of policies that are more broadly
applied, then the obvious limitation is that the pilot program cannot incorporate the
broader impact of the treatment. A broadly applied treatment may change the eco-
nomic environment sufficiently to invalidate the predictions from a partial equilibrium
setup. So the treatment will not be like the actual policy that it mimics.
In summary, social experiments, in principle, could yield data that are easier to an-
alyze and to understand in terms of cause and effect than observational data. Whether
this promise is realized depends on the experimental design. A poor experimen-
tal design generates its own statistical complications, which affect the precision of
the conclusions. Social experiments differ fundamentally from those in biology and
agriculture because human subjects and treatment administrators tend to be both
active and forward-looking individuals with personal preferences, rather than
53

Table 3.2. Features of Some Selected Natural Experiments
Experiment Treatments Studied Reference
Outcomes for identical twins Differences in returns to Ashenfelter and
with different schooling levels schooling through correlation Krueger (1994)
between schooling and wages
Transition to National Health Labor market effects of NHI Gruber and
Insurance in Canada as Sasketchwan based on comparison of Hanratty (1995)
moves to NHI and other states provinces with and without NHI
follow several years later
New Jersey increases minimum Minimum wage effects on Card and
wage while neighboring employment Krueger (1994)
Pennsylvania does not
passive administrators of a standard protocol or willing recipients of randomly as-
signed treatment.
3.4. Data from Natural Experiments
Sometimes, however, a researcher may have available data from a “natural experi-
ment.” A natural experiment occurs when a subset of the population is subjected to
an exogenous variation in a variable, perhaps as a result of a policy shift, that would
ordinarily be subject to endogenous variation. Ideally, the source of the variation is
well understood.
In microeconometrics there are broadly two ways in which the idea of a natural
experiment is exploited. For concreteness consider the simple regression model
y = β1 + β2x + u, (3.4)
where x is an endogenous treatment variable correlated with u.
Suppose that there is an exogenous intervention that changes x. Examples of such
external intervention are administrative rules, unanticipated legislation, natural events
such as twin births, weather-related shocks, and geographical variation; see Table 3.2
for examples. Exogenous intervention creates an opportunity for evaluating its im-
pact by comparing the behavior of the impacted group both pre- and postintervention,
or with that of a nonimpacted group postintervention. That is, “natural” comparison
groups are generated by the event that facilitates estimation of the β2. Estimation is
simpliﬁed because x can be treated as exogenous.
The second way in which a natural experiment can assist inference is by generating
natural instrumental variables. Suppose z is a variable that is correlated with x, or
perhaps causally related to x, and uncorrelated with u. Then an instrumental variable
estimator of β2, expressed in terms of sample covariances, is

β2 =
Cov[z, y]
Cov[z, x]
(3.5)
54

3.4. DATA FROM NATURAL EXPERIMENTS
(see Section 4.8.5). In an observational data setup an instrumental variable with the
right properties may be difficult to find, but it could arise naturally in a favorable
natural experiment. Then estimation would be simplified. We consider the first case
in the next section; the topic of naturally generated instruments will be covered in
Chapter 25.
3.4.1. Natural Exogenous Interventions
Such data are less expensive to collect and they also allow the researcher to evaluate the
role of some specific factor in isolation, as in a controlled experiment, because “nature”
holds constant variations attributed to other factors that are not of direct interest. Such
natural experiments are attractive because they generate treatment and control groups
inexpensively and in a real-world setting. Whether a natural experiment can support
convincing inference depends, in part, on whether the supposed natural intervention
is genuinely exogenous, whether its impact is sufficiently large to be measurable, and
whether there are good treatment and control groups. Just because a change is legis-
lated, for example, does not mean that it is an exogenous intervention. However, in
appropriate cases, opportunistic exploitation of such data sets can yield valuable em-
pirical insights.
Investigations based on natural experiments have several potential limitations
whose importance in any given study can only be assessed through a careful con-
sideration of the relevant theory, facts, and institutional setting. Following Campbell
(1969) and Meyer (1995), these are grouped into limitations that affect a study’s inter-
nal validity (i.e., the inferences about policy impact drawn from the study) and those
that affect a study’s external validity (i.e., the generalization of the conclusions to other
members of the population).
Consider an investigation of a policy change in which conclusions are drawn from
a comparison of pre- and postintervention data, using the regression method briefly
described in the following and in greater detail in Chapter 25. In any study there will
be omitted variables that may have also changed in the time interval between policy
change and its impact. The characteristics of sampled individuals such as age, health
status, and their actual or anticipated economic environment may also change. These
omitted factors will directly affect the measured impact of the policy change. Whether
the results can be generalized to other members of the population will depend on the
absence of bias due to nonrandom sampling, existence of significant interaction effects
between the policy change and its setting, and an absence of the role of historical
factors that would cause the impact to vary from one situation to another. Of course,
these considerations are not unique to data from natural experiments; rather, the point
is that the latter are not necessarily free from these problems.
3.4.2. Differences in Differences
One simple regression method is based on a comparison of outcomes in one group
before and after a policy intervention. For example, consider
yit = α + βDt + εit , i = 1, . . . , N, t = 0, 1,
55

where Dt = 1 in period 1 (postintervention), Dt = 0 in period 0 (preintervention), and
yit measures the outcome. The regression estimated from the pooled data will yield an
estimate of policy impact parameter β. This is easily shown to be equal to the average
difference in the pre- and postintervention outcome,

β = N−1

i
(yi1 − yi0)
= y1 − y0.
The one-group before and after design makes the strong assumption that the group
remains comparable over time. This is required for identifiability of β. If, for exam-
ple, we allowed α to vary between the two periods, β would no longer be identified.
Changes in α are confounded with the policy impact.
One way to improve on the previous design is to include an additional untreated
comparison group, that is, one not impacted by policy, and for which the data are avail-
able in both periods. Using Meyer’s (1995) notation, the relevant regression now is
y
j
it = α + α1 Dt + α1
D j
+ βD
j
t + ε
j
it , i = 1, . . . , N, t = 0, 1,
where j is the group superscript, D j
= 1 if j equals 1 and D j
= 0 otherwise, D
j
t = 1
if both j and t equal 1 and D
j
t = 0 otherwise, and ε is a zero-mean constant-variance
error term. The equation does not include covariates but they can be added, and those
that do not vary are already subsumed under α. This relation implies that, for the
treated group, we have preintervention
y1
i0 = α + α1
D1
+ ε1
i0
and postintervention
y1
i1 = α + α1 + α1
D1
+ β + ε1
i1.
The impact is therefore
y1
i1 − y1
i0 = α1 + β + ε1
i1 − ε1
i0. (3.6)
The corresponding equations for the untreated group are
y0
i0 = α + ε0
i0,
y0
i1 = α + α1 + ε0
i1,
and hence the difference is
y1
i1 − y0
i0 = α1 + ε0
i1 − ε0
i0. (3.7)
Both the first-difference equations include the period-1 specific effect α1, which can
be eliminated by taking the difference between Equations (3.6) and (3.7):

y1
i1 − y1
i0

−

y0
i1 − y0
i0

= β +

ε1
i1 − ε1
i0

−

ε0
i1 − ε0
i0

. (3.8)
Assuming that E[(ε1
i1 − ε1
i0) − (ε0
i1 − ε0
i0)] equals zero, we can obtain an unbiased
estimate of β by the sample average of (y1
i1 − y1
i0) − (y0
i1 − y0
i0). This method uses
56

3.4. DATA FROM NATURAL EXPERIMENTS
differences in differences. If time-varying covariates are present, they can be
included in the relevant equations and their differences will appear in the regression
equation (3.8).
For simplicity our analysis ignored the possibility that there remain observable dif-
ferences in the distribution of characteristics between the treatment and control groups.
If so, then such differences must be controlled for. The standard solution is to include
such controlling variables in the regression.
An example of a study based on a natural experiment is that of Ashenfelter and
Krueger (1994). They estimate the returns to schooling by contrasting the wage rates
of identical twins with different schooling levels. In this case running a regular exper-
iment in which individuals are exogenously assigned different levels of schooling is
simply not feasible. Nonetheless, some experimental-type controls are needed. As the
authors explain:
Our goal is to ensure that the correlation we observe between schooling and wage
rates is not due to a correlation between schooling and a worker’s ability or other
characteristics. We do this by taking advantage of the fact that monozygotic twins
are genetically identical and have similar family backgrounds.
Data on twins have served as a basis for a number of other econometric studies
(Rosenzweig and Wolpin, 1980; Bronars and Grogger, 1994). Since the twinning prob-
ability in the population is not high, an important issue is generating a sufficiently
large representative sample, allowing for some nonresponse. One source of such data
is the census. Another source is the “twins festivals” that are held in the United States.
Ashenfelter and Krueger (1994, p. 1158) report that their data were obtained from in-
terviews conducted at the 16th Annual Twins Day Festival, Twinsburg, Ohio, August
1991, which is the largest gathering of twins, triplets, and quadruplets in the world.
The attraction of using the twins data is that the presence of common effects from
both observable and unobservable factors can be eliminated by modeling the differ-
ences between the outcomes of the twins. For example, Ashenfelter and Krueger esti-
mate a regression model of the difference in the log of wage rates between the first and
the second twin. The first differencing operation eliminates the effects of age, gender,
ethnicity, and so forth. The remaining explanatory variables are differences between
schooling levels, which is the variable of main interest, and variables such as differ-
ences in years of tenure and marital status.
3.4.3. Identification through Natural Experiments
The natural experiments school has had a useful impact on econometric practice. By
encouraging the opportunistic exploitation of quasi-experimental data, and by using
modeling frameworks such as the POM of Chapter 2, econometric practice bridges the
gap between observational and experimental data. The notions of parameter identifica-
tion rooted in the SEM framework are broadened to include identification of measures
that are interesting from a policy viewpoint. The main advantage of using data from a
natural experiment is that a policy variable of interest might be validly treated as ex-
ogenous. However, in using data from natural experiments, as in the case of social
57

experiments, the choice of control groups plays a critical role in determining the
reliability of the conclusions. Several potential problems that affect a social experi-
ment, such as selectivity and attrition bias, will also remain potential problems in the
case of natural experiments. Only a subset of interesting policy problems may lend
themselves to analysis within the natural experiment framework. The experiment may
apply only to a small part of the population, and the conditions under which it occurs
may not replicate themselves easily. An example given in Section 22.6 illustrates this
point in the context of difference in differences.
3.5. Practical Considerations
Although there has been an explosion in the number and type of microdata sets that
are available, certain well-established databases have supported numerous studies. We
provide a very partial list of some of very well known U.S. micro databases. For fur-
ther details, see the respective Web sites for these data sets or the data clearinghouses
mentioned in the following. Many of these allow you to download the data directly.
3.5.1. Some Sources of Microdata
Panel Study in Income Dynamics (PSID): Based at the Survey Research Center at
the University of Michigan, PSID is a national survey that has been running since
1968. Today it covers over 40,000 individuals and collects economic and demo-
graphic data. These data have been used to support a wide variety of microecono-
metric analyses. Brown, Duncan and Stafford (1996) summarize recent develop-
ments in PSID data.
Current Population Survey (CPS): This is a monthly national survey of about 50,000
households that provides information on labor force characteristics. The survey has
been conducted for more than 50 years. Major revisions in the sample have fol-
lowed each of the decennial censuses. For additional details about this survey see
Section 24.2. It is the basis of many federal government statistics on earnings and
unemployment. It is also an important source of microdata that have supported nu-
merous studies especially of labor markets. The survey was redesigned in 1994
(Polivka, 1996).
National Longitudinal Survey (NLS): The NLS has four original cohorts: NLS Older
Men, NLS Young Men, NLS Mature Women, and NLS Young Women. Each of
the original cohorts is a national yearly survey of over 5,000 individuals who have
been repeatedly interviewed since the mid-1960s. Surveys collect information on
each respondent’s work experiences, education, training, family income, household
composition, marital status, and health. Supplementary data on age, sex, etc. are
available.
National Longitudinal Surveys of Youth (NLSY): The NLSY is a national annual
survey of 12,686 young men and young women who where 14 to 22 years of age
when they were ﬁrst surveyed in 1979. It contains three subsamples. The data
58

3.5. PRACTICAL CONSIDERATIONS
provide a unique opportunity to study the life-course experiences of a large sam-
ple of young adults who are representative of American men and women born in
the late 1950s and early 1960s. A second NLSY began in 1997.
Survey of Income and Program Participation (SIPP): SIPP is a longitudinal survey
of around 8,000 housing units per month. It covers income sources, participation in
entitlement programs, correlation between these items, and individual attachments
to the job market over time. It is a multipanel survey with a new panel being intro-
duced at the beginning of each calendar year. The ﬁrst panel of SIPP was initiated
in October 1983. Compared with CPS, SIPP has fewer employed and more unem-
ployed persons.
Health and Retirement Study (HRS): The HRS is a longitudinal national study.
The baseline consists of interviews with members of 7,600 households in 1992
(respondents aged from 51 to 61) with follow-ups every two years for 12 years. The
data contain a wealth of economic, demographic, and health information.
World Bank’s Living Standards Measurement Study (LSMS): The World Bank’s
LSMS household surveys collect data “on many dimensions of household well-
being that can be used to assess household welfare, understand household behavior,
and evaluate the effects of various government policies on the living conditions of
the population” in many developing countries. Many examples of the use of these
data can be found in Deaton (1997) and in the economic development literature.
Grosh and Glewwe (1998) outline the nature of the data and provide references to
research studies that have used them.
Data clearinghouses: The Interuniversity Consortium for Political and Social Re-
search (ICPSR) provides access to many data sets, including the PSID, CPS, NLS,
SIPP, National Medical Expenditure Survey (NMES), and many others. The U.S.
Bureau of Labor Statistics handles the CPS and NLS surveys. The U.S. Bureau of
Census handles the SIPP. The U.S. National Center for Health Statistics provides
access to many health data sets. A useful gateway to European data archives is
the Council of European Social Science Data Archives (CESSDA), which provides
links to several European national data archives.
Journal data archives: For some purposes, such as replication of published results
for classroom work, you can get the data from journal archives. Two archives in
particular have well-established procedures for data uploads and downloads using
an Internet browser. The Journal of Business and Economic Statistics archives data
used in most but not all articles published in that journal. The Journal of Applied
Econometrics data archive is also organized along similar lines and contains data
pertaining to most articles published since 1994.
3.5.2. Handling Microdata
Microeconomic data sets tend to be quite large. Samples of several hundreds or thou-
sands are common and even those of tens of thousands are not unusual. The distribu-
tions of outcomes of interest are often nonnormal, in part because one is often dealing
59

with discrete data such as binary outcomes, or with data that have limited variation
such as proportions or shares, or with truncated or censored continuous outcomes.
Handling large nonnormal data sets poses some problems of summarizing and report-
ing the important features of data. Often it is useful to use one computing environment
(program) for data extraction, reduction, and preparation and a different one for model
estimation.
3.5.3. Data Preparation
The most basic feature of microeconometric analysis is that the process of arriving at
the sample finally used in the econometric investigation is likely to be a long one. It
is important to accurately document decisions and choices made by the investigator in
the process of “cleaning up” the data. Let us consider some specific examples.
One of the most common features of sample survey data is nonresponse or par-
tial response. The problems of nonresponse have already been discussed. Partial res-
ponse usually means that some parts of survey questionnaires were not answered. If
this means that some of the required information is not available, the observations in
question are deleted. This is called listwise deletion. If this problem occurs in a sig-
nificant number of cases, it should be properly analyzed and reported because it could
lead to an unrepresentative sample and biases in estimation. The issue is analyzed in
Chapter 27. For example, consider a question in a household survey to which high-
income households do not respond, leading to a sample in which these households are
underrepresented. Hence the end effect is no different from one in which there is a full
response but the sample is not representative.
A second problem is measurement error in reported data. Microeconomic data are
typically noisy. The extent, type, and seriousness of measurement error depends on the
type of survey cross section or panel, the individual who responds to the survey, and
the variable about which information is sought. For example, self-reported income data
from panel surveys are strongly suspected to have serially correlated measurement er-
ror. In contrast, reported expenditure magnitudes are usually thought to have a smaller
measurement error. Deaton (1997) surveys some of the sources of measurement er-
ror with special reference to the World Bank’s Living Standards Measurement Survey,
although several of the issues raised have wider relevance. The biases from measure-
ment error depend on what is done to the data in terms of transformations (e.g., first
differencing) and the estimator used. Hence to make informative statements about the
seriousness of biases from measurement error, one must analyze well-defined mod-
els. Later chapters will give examples of the impact of measurement error in specific
contexts.
3.5.4. Checking Data
In large data sets it is easy to have erroneous data resulting from keyboard and cod-
ing errors. One should therefore apply some elementary checks that would reveal the
existence of problems. One can check the data before analyzing it by examining some
60

3.6. BIBLIOGRAPHIC NOTES
descriptive statistics. The following techniques are useful. First, use summary statistics
(min, max, mean, and median) to make sure that the data are in the proper interval and
on the proper scale. For instance, categorical variables should be between zero and
one, counts should be greater than or equal to zero. Sometimes missing data are coded
as −999, or some other integer, so take care not to treat these entries as data. Second,
one should know whether changes are fractional or on a percentage scale. Third, use
box and whisker plots to identify problematic observations. For instance, using box
and whisker plots one researcher found a country that had negative population growth
(owing to a war) and another country that had recorded investment as more than GDP
(because foreign aid had been excluded from the GDP calculation). Checking observa-
tions before proceeding with estimation may also suggest normalizing transformations
and/or distributional assumptions with features appropriate for modeling a particular
data set. Third, screening data may suggest appropriate data transforms. For example,
box and whisker plots and histograms could suggest which variables might be better
modeled via a log or power transform. Finally, it may be important to check the scales
of measurement. For some purposes, such as the use of nonlinear estimators, it may
be desirable to scale variables so that they have roughly similar scale. Summary statis-
tics can be used to check that the means, variances, and covariances of the variables
indicate proper scaling.
3.5.5. Presenting Descriptive Statistics
Because microdata sets are usually large, it is essential to provide the reader with an
initial table of descriptive statistics, usually mean, standard deviation, minimum, and
maximum for every variable. In some cases unexpectedly large or small values may
reveal the presence of a gross recording error or erroneous inclusion of an incorrect
data point. Two-way scatter diagrams are usually not helpful, but tabulation of cate-
gorical variables (contingency tables) can be. For discrete variables histograms can be
useful and for continuous variables density plots can be informative.
3.2 Deaton (1997) provides an introduction to sample surveys especially for developing
economies. Several speciﬁc references to complex surveys are provided in Chapter 24.
Becketti et al. (1988) investigate the importance of the issue of representativeness of the
PSID.
3.3 The collective volume edited by Hausman and Wise (1985) contains several papers on indi-
vidual social experiments including the RHIE, NIT, and Time-of-Use pricing experiments.
Several studies question the usefulness of the experimental data and there is extensive dis-
cussion of the ﬂaws in experimental designs that preclude clear conclusions. Pros and cons
of social experiments versus observational data are discussed in an excellent pair of papers
by Burtless (1995) and Heckman and Smith (1995).
3.4 A special issue of the Journal of Business and Economic Statistics (1995) carries a number
of articles that use the methodology of quasi- or natural experiments. The collection in-
cludes an article by Meyer who surveys the issues in and the methodology of econometric
61

studies that use data from natural experiments. He also provides a valuable set of guidelines
on the credible use of natural variation in making inferences about the impact of economic
policies, partly based on the work of Campbell (1969). Kim and Singal (1993) study the
impact of changes in market concentration on price using the data generated by a airline
mergers. Rosenzweig and Wolpin (2000) review an extensive literature based on natural
experiments such as identical twins. Isacsson (1999) uses the twins approach to study re-
turns to schooling using Swedish data. Angrist and Lavy (1999) study the impact of class
size on test scores using data from schools that are subject to “Maimonides’ Rule” (brieﬂy
reviewed in Section 25.6), which states that class size should not exceed 40. The rule gen-
erates an instrument.
62

PART TWO
Core Methods
Part 2 presents the core estimation methods – least squares, maximum likelihood and
method of moments – and associated methods of inference for nonlinear regression
models that are central in microeconometrics. The material also includes modern top-
ics such as quantile regression, sequential estimation, empirical likelihood, semipara-
metric and nonparametric regression, and statistical inference based on the bootstrap.
In general the discussion is at a level intended to provide enough background and
detail to enable the practitioner to read and comprehend articles in the leading econo-
metrics journals and, where needed, subsequent chapters of this book. We presume
prior familiarity with linear regression analysis.
The essential estimation theory is presented in three chapters. Chapter 4 begins with
the linear regression model. It then covers at an introductory level quantile regression,
which models distributional features other than the conditional mean. It provides a
lengthy expository treatment of instrumental variables estimation, a major method of
causal inference. Chapter 5 presents the most commonly-used estimation methods for
nonlinear models, beginning with the topic of m-estimation, before specialization to
maximum likelihood and nonlinear least squares regression. Chapter 6 provides a com-
prehensive treatment of generalized method of moments, which is a quite general esti-
mation framework that is applicable for linear and nonlinear models in single-equation
and multi-equation settings. The chapter emphasizes the special case of instrumental
variables estimation.
We then turn to model testing. Chapter 7 covers both the classical and bootstrap
approaches to hypothesis testing, while Chapter 8 presents relatively more modern
methods of model selection and specification analysis. Because of their importance
the computationally-intensive bootstrap methods are also the subject of a more de-
tailed chapter, Chapter 11 in Part 3. A distinctive feature of this book is that, as much
as possible, testing procedures are presented in a unified manner in just these three
chapters. The procedures are then illustrated in specific applications throughout the
book.
Chapter 9 is a stand-alone chapter that presents nonparametric and semiparametric
estimation methods that place a flexible structure on the econometric model.
63

PART TWO: CORE METHODS
Chapter 10 presents the computational methods used to compute the nonlinear esti-
mators presented in chapters 5 and 6. This material becomes especially relevant to the
practitioner if an estimator is not automatically computed by an econometrics package,
or if numerical difﬁculties are encountered in model estimation.
64

C H A P T E R 4
Linear Models
4.1. Introduction
A great deal of empirical microeconometrics research uses linear regression and its
various extensions. Before moving to nonlinear models, the emphasis of this book,
we provide a summary of some important results for the single-equation linear regres-
sion model with cross-section data. Several different estimators in the linear regression
model are presented.
Ordinary least-squares (OLS) estimation is especially popular. For typical microe-
conometric cross-section data the model error terms are likely to be heteroskedas-
tic. Then statistical inference should be robust to heteroskedastic errors and efficiency
gains are possible by use of weighted rather than ordinary least squares.
The OLS estimator minimizes the sum of squared residuals. One alternative is to
minimize the sum of the absolute value of residuals, leading to the least absolute de-
viations estimator. This estimator is also presented, along with extension to quantile
regression.
Various model misspecifications can lead to inconsistency of least-squares estima-
tors. In such cases inference about economically interesting parameters may require
more advanced procedures and these are pursued at considerable length and depth else-
where in the book. One commonly used procedure is instrumental variables regression.
The current chapter provides an introductory treatment of this important method and
additionally addresses the complication of weak instruments.
Section 4.2 provides a definition of regression and presents various loss functions
that lead to different estimators for the regression function. An example is introduced
in Section 4.3. Some leading estimation procedures, specifically ordinary least squares,
weighted least squares, and quantile regression, are presented in, respectively, Sec-
tions 4.4, 4.5, and 4.6. Model misspecification is considered in Section 4.7. Instru-
mental variables regression is presented in Sections 4.8 and 4.9. Sections 4.3–4.5, 4.7,
and 4.8 cover standard material in introductory courses, whereas Sections 4.2, 4.6, and
4.9 introduce more advanced material.
65

LINEAR MODELS
4.2. Regressions and Loss Functions
In modern microeconometrics the term regression refers to a bewildering range of
procedures for studying the relationship between an outcome variable y and a set of
regressors x. It is helpful, therefore, to state at the beginning the motivation and justi-
ﬁcation for some of the leading types of regressions.
For exposition it is convenient to think of the purpose of regression to be condi-
tional prediction of y given x. In practice, regression models are also used for other
purposes, most notably causal inference. Even then a prediction function constitutes a
useful data summary and is still of interest. In particular, see Section 4.2.3 for the dis-
tinction between linear prediction and causal inference based on a linear causal mean.
4.2.1. Loss Functions
Let
y denote the predictor deﬁned as a function of x. Let e ≡ y −
y denote the pre-
diction error, and let
L(e) = L(y −
y) (4.1)
denote the loss associated with the error e. As in decision analysis we assume that the
predictor forms the basis of some decision, and the prediction error leads to disutility
on the part of the decision maker that is captured by L(e), whose precise functional
form is a choice of the decision maker. The loss function has the property that it is
increasing in |e|.
Treating (y,
y) as random, the decision maker minimizes the expected value of the
loss function, denoted E[L(e)] . If the predictor depends on x, a K-dimensional vector,
then expected loss is expressed as
E [L((y −
y)|x)] . (4.2)
The choice of the loss function should depend in a substantive way on the losses
associated with prediction errors. In some situations, such as weather forecasting, there
may be a sound basis for choosing one loss function over another.
In econometrics, there is often no clear guide and the convention is to specify
quadratic loss. Then (4.1) specializes to L(e) = e2
and by (4.2) the optimal predic-
tor minimizes the expected loss E[L(e|x)] = E[e2
|x]. It follows that in this case the
minimum mean-squared prediction error criterion is used to compare predictors.
4.2.2. Optimal Prediction
The decision theory approach to choosing the optimal predictor is framed in terms of
minimizing expected loss,
min

y
E [L (y −
y)|x)] .
Thus the optimality property is relative to the loss function of the decision maker.
66

4.2. REGRESSIONS AND LOSS FUNCTIONS
Table 4.1. Loss Functions and Corresponding Optimal Predictors
Type of Loss Function Definition Optimal Predictor
Squared error loss L(e) = e2
E[y|x]
Absolute error loss L(e) = |e| med[y|x]
Asymmetric absolute loss L(e) =

(1 − α) |e|
α |e|
if e 0
if e 0
qα [y|x]
Step loss L(e) =

0
1
if e 0
if e 0
mod[y|x]
Four leading examples of loss function, and the associated optimal predictor func-
tion, are given in Table 4.1. We provide a brief presentation for each in turn. A detailed
analysis is given in Manski (1988a).
The most well known loss function is the squared error loss (or mean-square loss)
function. Then the optimal predictor of y is the conditional mean function, E[y|x]. In
the most general case no structure is placed on E[y|x] and estimation is by nonpara-
metric regression (see Chapter 9). More often a model for E[y|x] is specified, with
E[y|x] = g(x, β), where g(·) is a specified function and β is a finite-dimensional vec-
tor of parameters that needs to be estimated. The optimal prediction is
y = g(x,
β),
where
β is chosen to minimize the in-sample loss
N

i=1
L(ei ) =
N

i=1
e2
i =
N

i=1
(yi − g(xi , β))2
.
The loss function is the sum of squared residuals, so estimation is by nonlinear least
squares (see Section 5.8). If the conditional mean function g(·) is restricted to be linear
in x and β, so that E[y|x] = x
β, then the optimal predictor is
y = x
β, where
β is the
ordinary least-squares estimator detailed in Section 4.4.
If the loss criterion is absolute error loss, then the optimal predictor is the con-
ditional median, denoted med[y|x]. If the conditional median function is linear, so
that med[y|x] = x
β, then the optimal predictor is
y = x
β, where
β is the least abso-
lute deviations estimator that minimizes

i |yi − x
i β|. This estimator is presented in
Section 4.6.
Both the squared error and absolute error loss functions are symmetric, so the same
penalty is imposed for prediction error of a given magnitude regardless of the direc-
tion of the prediction error. Asymmetric absolute error loss instead places a penalty
of (1 − α) |e| on overprediction and a different penalty α |e| on underprediction. The
asymmetry parameter α is specified. It lies in the interval (0, 1) with symmetry when
α = 0.5 and increasing asymmetry as α approaches 0 or 1. The optimal predictor can
be shown to be the conditional quantile, denoted qα [y|x]; a special case is the condi-
tional median when α = 0.5. Conditional quantiles are defined in Section 4.6, which
presents quantile regression (Koenker and Bassett, 1978).
The last loss function given in Table 4.1 is step loss, which bases the loss simply on
the sign of the prediction error regardless of the magnitude. The optimal predictor is the
67

LINEAR MODELS
conditional mode, denoted mod[y|x]. This provides motivation for mode regression
(Lee, 1989).
Maximum likelihood does not fall as easily into the prediction framework of this
section. It can, however, be given an expected loss interpretation in terms of predicting
the density and minimizing Kullback–Liebler information (see Section 5.7).
The results just stated imply that the econometrician interested in estimating a pre-
diction function from the data (y, x) should choose the prediction function according
to the loss function. The use of the popular linear regression implies, at least implicitly,
that the decision maker has a quadratic loss function and believes that the conditional
mean function is linear. However, if one of the other three loss functions is specified,
then the optimal predictor will be based on one of the three other types of regressions.
In practice there can be no clear reason for preferring a particular loss function.
Regressions are often used as data summaries, rather than for prediction per se.
Then it can be useful to consider a range of estimators, as alternative estimators may
provide useful information about the sensitivity of estimates. Manski (1988a, 1991)
has pointed out that the quadratic and absolute error loss functions are both convex. If
the conditional distribution of y|x is symmetric then the conditional mean and median
estimators are both consistent and can be expected to be quite close. Furthermore, if
one avoids assumptions about the distribution of y|x, then differences in alternative
estimators provide a way of learning about the data distribution.
4.2.3. Linear Prediction
The optimal predictor under squared error loss is the conditional mean E[y|x]. If this
conditional mean is linear in x, so that E[y|x] = x
β, the parameter β has a structural
or causal interpretation and consistent estimation of β by OLS implies consistent esti-
mation of E[y|x] = x
β. This permits meaningful policy analysis of effects of changes
in regressors on the conditional mean.
If instead the conditional mean is nonlinear in x, so that E[y|x] = x
β, the structural
interpretation of OLS disappears. However, it is still possible to interpret β as the best
linear predictor under squared error loss. Differentiation of the expected loss E[(y −
x
β)2
] with respect to β yields first-order conditions −2E[x(y − x
β)] = 0, so the opti-
mal linear predictor is β =

E[xx
]
−1
E[xy] with sample analogue the OLS estimator.
Usually we specialize to models with intercept. In a change of notation we define x
to denote regressors excluding the intercept, and we replace x
β by α + x
γ. The first-
order conditions with respect to α and γ are that −2E[u] = 0 and −2E[xu] = 0, where
u = y − (α + x
γ). These imply that E[u] = 0 and Cov[x,u] = 0. Solving yields
γ = (V[x])−1
Cov[x, y], (4.3)
α = E[y]−E[x
]γ;
see, for example, Goldberger (1991, p. 52).
From the derivation of (4.3) it should be clear that for data (y, x) we can always
write a linear regression model
y = α + x
γ + u, (4.4)
68

4.3. EXAMPLE: RETURNS TO SCHOOLING
where the parameters α and γ are defined in (4.3) and the error term u satisfies E[u] =
0 and Cov[x,u] = 0.
A linear regression model can therefore always be given the nonstructural or re-
duced form interpretation as the best linear prediction (or linear projection) un-
der squared error loss. However, for the conditional mean to be linear in x, so that
E[y|x] = α+x
γ, requires the assumption that E[u|x] = 0, in addition to E[u] = 0 and
Cov[x,u] = 0.
This distinction is of practical importance. For example, if E[u|x] = 0, so that
E[y|x] = α+x
γ, then the probability limit of a least-squares (LS) estimator
γ is γ
regardless of whether the LS estimator is weighted or unweighted, or whether the
sample is obtained by simple random sampling or by exogenous stratified sampling. If
instead E[y|x] =α+x
γ then these different LS estimators may have different proba-
bility limits. This example is discussed further in Section 24.3.
A structural interpretation of OLS requires that the conditional mean of the error
term, given regressors, equals zero.
4.3. Example: Returns to Schooling
A leading linear regression application from labor economics concerns measuring the
impact of education on wages or earnings.
A typical returns to schooling model specifies
ln wi = αsi + x
2i β + ui , i = 1, ..., N, (4.5)
where w denotes hourly wage or annual earnings, s denotes years of completed school-
ing, and x2 denotes control variables such as work experience, gender, and family
background. The subscript i denotes the ith person in the sample. Since the dependent
variable is log wage, the model is a log-linear model and the coefficient α measures
the proportionate change in earnings associated with a one-year increase in education.
Estimation of this model is most often by ordinary least squares. The transforma-
tion to ln w in practice ensures that errors are approximately homoskedastic, but it
is still best to obtain heteroskedastic consistent standard errors as detailed in Sec-
tion 4.4. Estimation can also be by quantile regression (see Section 4.6), if interest
lies in distributional issues such as behavior in the lower quartile.
The regression (4.5) can be used immediately in a descriptive manner. For exam-
ple, if
α = 0.10 then a one-year increase in schooling is associated with 10% higher
earnings, controlling for all the factors included in x2. It is important to add the last
qualifier as in this example the estimate
α usually becomes smaller as x2 is expanded
to include additional controls likely to influence earnings.
Policy interest lies in determining the impact of an exogenous change in schooling
on earnings. However, schooling is not randomly assigned; rather, it is an outcome that
depends on choices made by the individual. Human capital theory treats schooling as
investment by individuals in themselves, and α is interpreted as a measure of return to
human capital. The regression (4.5) is then a regression of one endogenous variable,
y, on another, s, and so does not measure the causal impact of an exogenous change
69

LINEAR MODELS
in s. The conditional mean function here is not causally meaningful because one is
conditioning on a factor, schooling, that is endogenous. Indeed, unless we can argue
that s is itself a function of variables at least one of which can vary independently of
u, it is unclear just what it means to regard α as a causal parameter.
Such concern about endogenous regressors with observational data on individuals
pervades microeconometric analysis. The standard assumptions of the linear regres-
sion model given in Section 4.4 are that regressors are exogenous. The consequences
of endogenous regressors are considered in Section 4.7. One method to control for
endogenous regressors, instrumental variables, is detailed in Section 4.8. A recent ex-
tensive review of ways to control for endogeneity in this wage–schooling example is
given in Angrist and Krueger (1999). These methods are summarized in Section 2.8
and presented throughout this book.
4.4. Ordinary Least Squares
The simplest example of regression is the OLS estimator in the linear regression model.
After first defining the model and estimator, a quite detailed presentation of the
asymptotic distribution of the OLS estimator is given. The exposition presumes pre-
vious exposure to a more introductory treatment. The model assumptions made here
permit stochastic regressors and heteroskedastic errors and accommodate data that are
obtained by exogenous stratified sampling.
The key result of how to obtain heteroskedastic-robust standard errors of the OLS
estimator is given in Section 4.4.5.
4.4.1. Linear Regression Model
In a standard cross-section regression model with N observations on a scalar
dependent variable and several regressors, the data are specified as (y, X), where y
denotes observations on the dependent variable and X denotes a matrix of explanatory
variables.
The general regression model with additive errors is written in vector notation as
y = E [y|X] + u, (4.6)
where E[y|X] denotes the conditional expectation of the random variable y given X,
and u denotes a vector of unobserved random errors or disturbances. The right-hand
side of this equation decomposes y into two components, one that is deterministic
given the regressors and one that is attributed to random variation or noise. We think
of E[y|X] as a conditional prediction function that yields the average value, or more
formally the expected value, of y given X.
A linear regression model is obtained when E[y|X] is specified to be a linear func-
tion of X. Notation for this model has been presented in detail in Section 1.6. In vector
notation the ith observation is
yi = x
i β+ui , (4.7)
70

4.4. ORDINARY LEAST SQUARES
where xi is a K × 1 regressor vector and β is a K × 1 parameter vector. At times
it is simpler to drop the subscript i and write the model for typical observation as
y = x
β + u. In matrix notation the N observations are stacked by row to yield
y = Xβ + u, (4.8)
where y is an N × 1 vector of dependent variables, X is an N × K regression ma-
trix, and u is an N × 1 error vector.
Equations (4.7) and (4.8) are equivalent expressions for the linear regression model
and will be used interchangeably. The latter is more concise and is usually the most
convenient representation.
In this setting y is referred to as the dependent variable or endogenous variable
whose variation we wish to study in terms of variation in x and u; u is referred to as
the error term or disturbance term; and x is referred to as regressors or predictors
or couariates. If Assumption 4 in Section 4.4.6 holds, then all components of x are
exogenous variables or independent variables.
4.4.2. OLS Estimator
The OLS estimator is defined to be the estimator that minimizes the sum of squared
errors
N

i=1
u2
i = u
u = (y − Xβ)
(y − Xβ). (4.9)
Setting the derivative with respect to β equal to 0 and solving for β yields the OLS
estimator,

βOLS = (X
X)−1
X
y, (4.10)
see Exercise 4.5 for a more general result, where it is assumed that the matrix inverse of
X
X exists. If X
X is of less than full rank, the inverse can be replaced by a generalized
inverse. Then OLS estimation still yields the optimal linear predictor of y given x if
squared error loss is used, but many different linear combinations of x will yield this
optimal predictor.
4.4.3. Identification
The OLS estimator can always be computed, provided that X
X is nonsingular. The
more interesting issue is what
βOLS tells us about the data.
We focus on the ability of the OLS estimator to permit identification (see Section
2.5) of the conditional mean E[y|X]. For the linear model the parameter β is identified
if
1. E[y|X] = Xβ and
2. Xβ(1)
= Xβ(2)
if and only if β(1)
= β(2)
.
71

LINEAR MODELS
The first condition that the conditional mean is correctly specified ensures that β is
of intrinsic interest; the second assumption implies that X
X is nonsingular, which is
the same condition needed to compute the unique OLS estimate (4.10).
4.4.4. Distribution of the OLS Estimator
We focus on the asymptotic properties of the OLS estimator. Consistency is estab-
lished and then the limit distribution is obtained by rescaling the OLS estimator.
Statistical inference then requires consistent estimation of the variance matrix of the
estimator. The analysis makes extensive use of asymptotic theory, which is summa-
rized in Appendix A.
Consistency
The properties of an estimator depend on the process that actually generated the data,
the data generating process (dgp). We assume the dgp is y = Xβ + u, so that the
model (4.8) is correctly specified. In some places, notably Chapters 5 and 6 and Ap-
pendix A the subscript 0 is added to β, so the dgp is y = Xβ0 + u. See Section 5.2.3
for discussion.
Then

βOLS = (X
X)−1
X
y
= (X
X)−1
X
(Xβ + u)
= (X
X)−1
X
Xβ + (X
X)−1
X
u,
and the OLS estimator can be expressed as

βOLS = β + (X
X)−1
X
u. (4.11)
To prove consistency we rewrite (4.11) as

βOLS = β +

N−1
X
X
−1
N−1
X
u. (4.12)
The reason for renormalization in the right-hand side is that N−1
X
X = N−1

i xi x
i
is an average that converges in probability to a finite nonzero matrix if xi satisfies
assumptions that permit a law of large numbers to be applied to xi x
i (see Section 4.4.8
for detail). Then
plim
βOLS = β +

plim N−1
X
X
−1
plim N−1
X
u

,
using Slutsky’s Theorem (Theorem A.3). The OLS estimator is consistent for β (i.e.,
plim
βOLS = β) if
plim N−1
X
u = 0. (4.13)
If a law of large numbers can be applied to the average N−1
X
u = N−1

i xi ui then
a necessary condition for (4.13) to hold is that E[xi ui ] = 0.
72

Limit Distribution
Given consistency, the limit distribution of
βOLS is degenerate with all the mass at β.
To obtain the limit distribution we multiply
βOLS by
√
N, as this rescaling leads to a
random variable that under standard cross-section assumptions has nonzero yet finite
variance asymptotically. Then (4.11) becomes
√
N(
βOLS − β) =

N−1
X
X
−1
N−1/2
X
u. (4.14)
The proof of consistency assumed that plim N−1
X
X exists and is finite and nonzero.
We assume that a central limit theorem can be applied to N−1/2
X
u to yield a multi-
variate normal limit distribution with finite, nonsingular covariance matrix. Applying
the product rule for limit normal distributions (Theorem A.17) implies that the product
in the right-hand side of (4.14) has a limit normal distribution. Details are provided in
Section 4.4.8.
This leads to the following proposition, which permits regressors to be stochastic
and does not restrict model errors to be homoskedastic and uncorrelated.
Proposition 4.1 (Distribution of OLS Estimator). Make the following assump-
tions:
(i) The dgp is model (4.8), that is, y = Xβ + u.
(ii) Data are independent over i with E[u|X] = 0 and E[uu
|X] = Ω = Diag[σ2
i ].
(iii) The matrix X is of full rank so that Xβ(1)
= Xβ(2)
iff β(1)
= β(2)
.
(iv) The K × K matrix
Mxx = plim N−1
X
X = plim
1
N
N

i=1
xi x
i = lim
1
N
N

i=1
E[xi x
i ] (4.15)
exists and is finite nonsingular.
(v) The K × 1 vector N−1/2
X
u =N−1/2
N
i=1 xi ui
d
→ N [0, MxΩx], where
MxΩx = plim N−1
X
uu
X = plim
1
N
N

i=1
u2
i xi x
i = lim
1
N
N

i=1
E[u2
i xi x
i ].
(4.16)
Then the OLS estimator
βOLS defined in (4.10) is consistent for β and
√
N(
βOLS − β)
d
→ N

0, M−1
xx MxΩxM−1
xx

. (4.17)
Assumption (i) is used to obtain (4.11). Assumption (ii) ensures E[y|X] = Xβ and
permits heterostedastic errors with variance σ2
i , more general than the homoskedastic
uncorrelated errors that restrict Ω = σ2
I. Assumption (iii) rules out perfect collinear-
ity among the regressors. Assumption (iv) leads to the rescaling of X
X by N−1
in
(4.12) and (4.14). Note that by a law of large numbers plim = lim E (see Appendix
Section A.3).
The essential condition for consistency is (4.13). Rather than directly assume this
we have used the stronger assumption (v) which is needed to obtain result (4.17).
73

LINEAR MODELS
Given that N−1/2
X
u has a limit distribution with zero mean and finite variance, mul-
tiplication by N−1/2
yields a random variable that converges in probability to zero and
so (4.13) holds as desired. Assumption (v) is required, along with assumption (iv), to
obtain the limit normal result (4.17), which by Theorem A.17 then follows immedi-
ately from (4.14). More primitive assumptions on ui and xi that ensure (iv) and (v) are
satisfied are given in Section 4.4.6, with formal proof in Section 4.4.8.
Asymptotic Distribution
Proposition 4.1 gives the limit distribution of
√
N(
βOLS − β), a rescaling of
βOLS.
Many practitioners prefer to see asymptotic results written directly in terms of the dis-
tribution of
βOLS, in which case the distribution is called an asymptotic distribution.
This asymptotic distribution is interpreted as being applicable in large samples, mean-
ing samples large enough for the limit distribution to be a good approximation but not
so large that
βOLS
p
→ β as then its asymptotic distribution would be degenerate. The
discussion mirrors that in Appendix A.6.4.
The asymptotic distribution is obtained from (4.17) by division by
√
N and addition
of β. This yields the asymptotic distribution

βOLS
a
∼ N

β,N−1
M−1
xx MxΩxM−1
xx

, (4.18)
where the symbol
a
∼ means is “asymptotically distributed as.” The variance matrix
in (4.18) is called the asymptotic variance matrix of
βOLS and is denoted V[
βOLS].
Even simpler notation drops the limits and expectations in the definitions of Mxx and
MxΩx and the asymptotic distribution is denoted

βOLS
a
∼ N

β,(X
X)−1
X
ΩX(X
X)
−1

, (4.19)
and V[
βOLS] is defined to be the variance matrix in (4.19).
We use both (4.18) and (4.19) to represent the asymptotic distribution in later chap-
ters. Their use is for convenience of presentation. Formal asymptotic results for statisti-
cal inference are based on the limit distribution rather than the asymptotic distribution.
For implementation, the matrices Mxx and MxΩx in (4.17) or (4.18) are replaced by
consistent estimates
Mxx and
MxΩx. Then the estimated asymptotic variance matrix
of
βOLS is

V[
βOLS] = N−1
M−1
xx

MxΩx

M−1
xx . (4.20)
This estimate is called a sandwich estimate, with
MxΩx sandwiched between
M−1
xx
and
M−1
xx .
4.4.5. Heteroskedasticity-Robust Standard Errors for OLS
The obvious choice for
Mxx in (4.20) is N−1
X
X. Estimation of MxΩx defined in (4.16)
depends on assumptions made about the error term.
In microeconometrics applications the model errors are often conditionally het-
eroskedastic, with V[ui |xi ] = E[u2
i |xi ] = σ2
i varying over i. White (1980a) proposed
74

using
MxΩx = N−1

i
u2
i xi x
i . This estimate requires additional assumptions given in
Section 4.4.8.
Combining these estimates
Mxx and
MxΩx and simplifying yields the estimated
asymptotic variance matrix estimate

V[
βOLS] = (X
X)−1
X
ΩX(X
X)
−1
(4.21)
=

N

i=1
xi x
i

−1 N

i=1

u2
i xi x
i

N

i=1
xi x
i

−1
,
where
Ω = Diag[
u2
i ] and
ui = yi − x
i

β is the OLS residual. This estimate, due to
White (1980a), is called the heteroskedastic-consistent estimate of the asymptotic
variance matrix of the OLS estimator, and it leads to standard errors that are called
heteroskedasticity-robust standard errors, or even more simply robust standard
errors. It provides a consistent estimate of V[
βOLS] even though
u2
i is not consistent
for σ2
i .
In introductory courses the errors are restricted to be homoskedastic. Then Ω = σ2
I
so that X
ΩX = σ2
X
X and hence MxΩx = σ2
Mxx. The limit distribution variance ma-
trix in (4.17) simplifies to σ2
M−1
xx , and many computer packages instead use what is
sometimes called the default OLS variance estimate

V[
βOLS] = s2
(X
X)−1
, (4.22)
where s2
= (N − K)−1

i
u2
i .
Inference based on (4.22) rather than (4.21) is invalid, unless errors are ho-
moskedastic and uncorrelated. In general the erroneous use of (4.22) when errors are
heteroskedastic, as is often the case for cross-section data, can lead to either inflation
or deflation of the true standard errors.
In practice
MxΩx is calculated using division by (N − K), rather than by N, to be
consistent with the similar division in forming s2
in the homoskedastic case. Then

V[
βOLS] in (4.21) is multiplied by N/(N − K). With heteroskedastic errors there is
no theoretical basis for this adjustment for degrees of freedom, but some simulation
studies provide support (see MacKinnon and White, 1985, and Long and Ervin, 2000).
Microeconometric analysis uses robust standard errors wherever possible. Here the
errors are robust to heteroskedasticity. Guarding against other misspecifications may
also be warranted. In particular, when data are clustered the standard errors should
additionally be robust to clustering; see Sections 21.2.3 and 24.5.
4.4.6. Assumptions for Cross-Section Regression
Proposition 4.1 is a quite generic theorem that relies on assumptions about N−1
X
X
and N−1/2
X
u. In practice these assumptions are verified by application of laws of
large numbers and central limit theorems to averages of xi x
i and xi ui . These in turn
require assumptions about how the observations xi and errors ui are generated, and
consequently how yi defined in (4.7) is generated. The assumptions are referred to
collectively as assumptions regarding the data-generating process (dgp). A simple
pedagogical example is given in Exercise 4.4.
75

LINEAR MODELS
Our objective at this stage is to make assumptions that are appropriate in many ap-
plied settings where cross-section data are used. The assumptions, are those in White
(1980a), and include three important departures from those in introductory treatments.
First, the regressors may be stochastic (Assumptions 1 and 3 that follow), so assump-
tions on the error are made conditional on regressors. Second, the conditional variance
of the error may vary across observations (Assumption 5). Third, the errors are not
restricted to be normally distributed.
Here are the assumptions:
1. The data (yi , xi ) are independent and not identically distributed (inid) over i.
2. The model is correctly specified so that
yi = x
i β+ui .
3. The regressor vector xi is possibly stochastic with finite second moment, additionally
E[|xi j xik|1+δ
] ≤ ∞ for all j, k = 1, . . . , K for some δ 0, and the matrix Mxx defined
in (4.15) exists and is a finite positive definite matrix of rank K. Also, X has rank K in
the sample being analyzed.
4. The errors have zero mean, conditional on regressors
E [ui |xi ] = 0.
5. The errors are heteroskedastic, conditional on regressors, with
σ2
i = E

u2
i |xi

,
Ω = E

uu
|X

= Diag

σ2
i

,
(4.23)
where Ω is an N × N positive definite matrix. Also, for some δ 0, E[|u2
i |1+δ
] ∞.
6. The matrix MxΩx defined in (4.16) exists and is a finite positive definite matrix of rank
K, where MxΩx = plim N−1

i u2
i xi x
i given independence over i. Also, for some δ
0, E[|u2
i xi j xik|1+δ
] ∞ for all j, k = 1, . . . , K.
4.4.7. Remarks on Assumptions
For completeness we provide a detailed discussion of each assumption, before proving
the key results in the following section.
Stratified Random Sampling
Assumption 1 is one that is often implicitly made for cross-section data. Here we make
it explicit. It restricts (yi , xi ) to be independent over i, but permits the distribution to
differ over i. Many microeconometrics data sets come from stratified random sam-
pling (see Section 3.2). Then the population is partitioned into strata and random draws
are made within strata, but some strata are oversampled with the consequence that the
sampled (yi , xi ) are inid rather than iid. If instead the data come from simple ran-
dom sampling then (yi , xi ) are iid, a stronger assumption that is a special case of inid.
Many introductory treatments assumed that regressors are fixed in repeated samples.
76

Then (yi , xi ) are inid since only yi is random with a value that depends on the value of
xi . The fixed regressors assumption is rarely appropriate for microeconometrics data,
which are usually observational data. It is used instead for experimental data, where x
is the treatment level.
These different assumptions on the distribution of (yi , xi ) affect the particular laws
of large numbers and central limit theorems used to obtain the asymptotic properties
of the OLS estimator. Note that even if (yi , xi ) are iid, yi given xi is not iid since, for
example, E[yi |xi ] = x
i β varies with xi .
Assumption 1 rules out most time-series data since they are dependent over obser-
vations. It will also be violated if the sampling scheme involves clustering of observa-
tions. The OLS estimator can still be consistent in these cases, provided Assumptions
2–4 hold, but usually it has a variance matrix different from that presented in this
chapter.
Correctly Specified Model
Assumption 2 seems very obvious as it is an essential ingredient in the derivation of
the OLS estimator. It still needs to be made explicitly, however, since
β = (X
X)−1
X
y
is a function of y and so its properties depend on y.
If Assumption 2 holds then it is being assumed that the regression model is linear in
x, rather than nonlinear, that there are no omitted variables in the regression, and that
there is no measurement error in the regressors, as the regressors x used to calculate

β are the same regressors x that are in the dgp. Also, the parameters β are the same
across individuals, ruling out random parameter models.
If Assumption 2 fails then OLS can only be interpreted as an optimal linear predic-
tor; see Section 4.2.3.
Stochastic Regressors
Assumption 3 permits regressors to be stochastic regressors, as is usually the case
when survey data rather than experimental data are used. It is assumed that in the limit
the sample second-moment matrix is constant and nonsingular.
If the regressors are iid, as is assumed under simple random sampling, then
Mxx =E[xx
] and Assumption 3 can be reduced to an assumption that the second
moment exists. If the regressors are stochastic but inid, as is the case for stratified
random sampling, then we need the stronger Assumption 3, which permits applica-
tion of the Markov LLN to obtain plim N−1
X
X. If the regressors are fixed in repeated
samples, the common less-satisfactory assumption made in introductory courses, then
Mxx = lim N−1
X
X and Assumption 3 becomes assumption that this limit exists.
Weakly Exogenous Regressors
Assumption 4 of zero conditional mean errors is crucial because when combined
with Assumption 2 it implies that E[y|X] = Xβ, so that the conditional mean is indeed
Xβ.
77

LINEAR MODELS
The assumption that E[u|x] = 0 implies that Cov[x,u] = 0, so that the error is un-
correlated with regressors. This follows as Cov[x,u] =E[xu]−E[x]E[u] and E[u|x] =
0 implies E[xu] = 0 and E[u] = 0 by the law of iterated expectations. The weaker
assumption that Cov[x,u] = 0 can be sufficient for consistency of OLS, whereas the
stronger assumption that E[u|x] = 0 is needed for unbiasedness of OLS.
The economic meaning of Assumption 4 is that the error term represents all the
excluded factors that are assumed to be uncorrelated with X and these have, on av-
erage, zero impact on y. This is a key assumption that was referred to in Section 2.3
as the weak exogeneity assumption. Essentially this means that the knowledge of the
data-generating process for X variables does not contribute useful information for es-
timating β. When the assumption fails, one or more of the K regressor variables is
said to be jointly dependent with y, or simply endogenous. A general term for cor-
relation of regressors with errors is endogeneity or endogenous regressors, where
the term “endogenous” means caused by factors inside the system. As we will show
in Section 4.7, the violation of weak exogeneity may lead to inconsistent estimation.
There are many ways in which weak exogeneity can be violated, but one of the most
common involves a variable in x that is a choice or a decision variable that is related
to y in a larger model. Ignoring these other relationships, and treating xi as if it were
randomly assigned to observation i, and hence uncorrelated with ui , will have non-
trivial consequences. Endogenous sampling is ruled out by Assumption 4. Instead,
if data are collected by stratified random sampling it must be exogenous stratified
sampling.
Conditionally Heteroskedastic Errors
Independent regression errors uncorrelated with regressors are assumed, a conse-
quence of Assumptions 1, 2, and 4. Introductory courses usually further restrict at-
tention to errors that are homoskedastic with homogeneous or constant variances, in
which case σ2
i = σ2
for all i. Then the errors are iid (0, σ2
) and are called spherical
errors since Ω = σ2
I.
Assumption 5 is instead one of conditionally heteroskedastic regression errors,
where heteroskedastic means heterogeneous variances or different variances. The as-
sumption is stated in terms of the second moment E[u2
|x], but this equals the vari-
ance V[u|x] since E[u|x] = 0 by Assumption 4. This more general assumption of het-
eroskedastic errors is made because empirically this is often the case for cross-section
regression. Furthermore, relaxing the homoskedasticity assumption is not costly as it
is possible to obtain valid standard errors for the OLS estimator even if the functional
form for the heteroskedasticity is unknown.
The term conditionally heteroskedastic is used for the following reason. Even if
(yi , xi ) are iid, as is the case for simple random sampling, once we condition on xi
the conditional mean and conditional variance can vary with xi . Similarly, the errors
ui = yi − x
i β are iid under simple random sampling, and they are therefore uncon-
ditionally homoskedastic. Once we condition on xi , and consider the distribution of
ui conditional on xi , the variance of this conditional distribution is permitted to vary
with xi .
78

Limit Variance Matrix of N−1/2
X
u
Assumption 6 is needed to obtain the limit variance matrix of N−1/2
X
u. If regressors
are independent of the errors, a stronger assumption than that made in Assumption
4, then Assumption 5 that E[|u2
i |1+δ
] ∞ and Assumption 3 that E[|xi j xik|1+δ
] ∞
imply the Assumption 6 condition that E[|u2
i xi j xik|1+δ
] ∞.
We have deliberately not made a seventh assumption, that the error u is normally
distributed conditional on X. An assumption such as normality is needed to obtain the
exact small-sample distribution of the OLS estimator. However, we focus on asymp-
totic methods throughout this book, because exact small-sample distributional results
are rarely available for the estimators used in microeconometrics, and then the normal-
ity assumption is no longer needed.
4.4.8. Derivations for the OLS Estimator
Here we present both small-sample and limit distributions of the OLS estimator and
justify White’s estimator of the variance matrix of the OLS estimator under Assump-
tions 1–6.
Small-Sample Distribution
The parameter β is identified under Assumptions 1–4 since then E[y|X] = Xβ and X
has rank K.
In small samples the OLS estimator is unbiased under Assumptions 1–4 and its vari-
ance matrix is easily obtained given Assumption 5. These results are obtained by using
the law of iterated expectations to first take expectation with respect to u conditional
on X and then take the unconditional expectation. Then from (4.11)
E[
βOLS] = β + EX,u

(X
X)−1
X
u

(4.24)
= β + EX

Eu|X

(X
X)−1
X
u|X

= β + EX

(X
X)−1
X
Eu|X[u|X]

= β,
using the law of iterated expectations (Theorem A.23) and given Assumptions 1 and
4, which together imply that E[u|X] = 0. Similarly, (4.11) yields
V[
βOLS] = EX[(X
X)−1
X
ΩX(X
X)
−1
], (4.25)
given Assumption 5, where E

uu
|X

= Ω and we use Theorem A.23, which tells us
that in general
VX,u[g(X, u)] = EX[Vu|X[g(X, u)]] + VX[Eu|X[g(X, u)]].
This simplifies here as the second term is zero since Eu|X[(X
X)−1
X
u] = 0.
The OLS estimator is therefore unbiased if E[u|X] = 0. This valuable property
generally does not extend to nonlinear estimators. Most nonlinear estimators, such
as nonlinear least squares, are biased and even linear estimators such as instrumental
79

LINEAR MODELS
variables estimators can be biased. The OLS estimator is inefficient, as its variance
is not the smallest possible variance matrix among linear unbiased estimators, unless
Ω = σ2
I. The inefficiency of OLS provides motivation for more efficient estimators
such as generalized least squares, though the efficiency loss of OLS is not necessarily
great. Under the additional assumption of normality of the errors conditional on X, an
assumption not usually made in microeconometrics applications, the OLS estimator is
normally distributed conditional on X.
Consistency
The term plim

N−1
X
X
−1
= M−1
xx since plim N−1
X
X = Mxx by Assumption 3.
Consistency then requires that condition (4.13) holds. This is established using a law
of large numbers applied to the average N−1
X
u =N−1

i xi ui , which converges in
probability to zero if E[xi ui ] = 0. Given Assumptions 1 and 2, the xi ui are inid and
Assumptions 1–5 permit use of the Markov LLN (Theorem A.9). If Assumption 1 is
simplified to (yi , xi ) iid then xi ui are iid and Assumptions 1–4 permit simpler use of
the Kolmogorov LLN (Theorem A.8).
Limit Distribution
By Assumption 3, plim

N−1
X
X
−1
= M−1
xx . The key is to obtain the limit distribu-
tion of N−1/2
X
u = N−1/2

i xi ui by application of a central limit theorem. Given
Assumptions 1 and 2, the xi ui are inid and Assumptions 1–6 permit use of the Lia-
pounov CLT (Theorem A.15). If assumption 1 is strengthened to (yi , xi ) iid then xi ui
are iid and Assumptions 1–5 permit simpler use of the Lindeberg–Levy CLT (Theo-
rem A.14).
This yields
1
√
N
X
u
d
→ N [0, MxΩx] , (4.26)
where MxΩx = plim N−1
X
uu
X = plim N−1

i u2
i xi x
i given independence over i.
Application of a law of large numbers yields MxΩx = lim N−1

i Exi
[σ2
i xi x
i ], us-
ing Eui ,xi
[u2
i xi x
i ] = Exi
[E[u2
i |xi ]xi x
i ] and σ2
i = E[u2
i |xi ]. It follows that MxΩx =
lim N−1
E[X
ΩX], where Ω = Diag[σ2
i ] and the expectation is with respect to only
X, rather than both X and u.
The presentation here assumes independence over i. More generally we can permit
correlated observations. Then MxΩx = plim N−1

i

j ui u j xi x
j and Ω has i jth en-
try σi j = Cov[ui , u j ]. This complication is deferred to treatment of the nonlinear LS
estimator in Section 5.8.
Heteroskedasticity-Robust Standard Errors
We consider the key step of consistent estimation of MxΩx. Beginning with the original
definition of MxΩx = plim N−1
N
i=1 u2
i xi x
i , we replace ui by
ui = yi − x
i

β, where
80

4.5. WEIGHTED LEAST SQUARES
asymptotically
ui
p
→ ui since
β
p
→ β. This yields the consistent estimate

MxΩx =
1
N
N

i=1

u2
i xi x
i = N−1
X
ΩX, (4.27)
where
Ω = Diag[
u2
i ]. The additional assumption that E[|x2
i j xik xil|1+δ
]

and j, k,l = 1, . . . , K is needed, as
u2
i xi x
i = (ui − x
i (
β − β))2
xi x
i
involves up to the fourth power of xi (see White (1980a)).
Note that
Ω does not converge to the N × N matrix Ω, a seemingly impos-
sible task without additional structure as there are N variances σ2
i to be esti-
mated. But all that is needed is that N−1
X
ΩX converges to the K × K matrix
plim N−1
X
ΩX =N−1
plim

i σ2
i xi x
i . This is easier to achieve because the number
of regressors K is fixed. To understand White’s estimator, consider OLS estimation of
the intercept-only model yi = β + ui with heteroskedastic error. Then in our notation
we can show that
β = ȳ, Mxx = lim N−1

i 1 = 1, and MxΩx = lim N−1

i E[u2
i ].
An obvious estimator for MxΩx is
MxΩx = N−1

i
u2
i , where
ui = yi −
β. To obtain
the probability limit of this estimate, it is enough to consider N−1

i u2
i , since
ui −
ui
p
→ 0 given
β
p
→ β. If a law of large numbers can be applied this average converges
to the limit of its expected value, so plim N−1

i u2
i = lim N−1

i E[u2
i ] = MxΩx as
desired. Eicker (1967) gave the formal conditions for this example.
4.5. Weighted Least Squares
If robust standard errors need to be used efficiency gains are usually possible. For
example, if heteroskedasticity is present then the feasible generalized least-squares
(GLS) estimator is more efficient than the OLS estimator.
In this section we present the feasible GLS estimator, an estimator that makes
stronger distributional assumptions about the variance of the error term. It is nonethe-
less possible to obtain standard errors of the feasible GLS estimator that are robust to
misspecification of the error variance, just as in the OLS case.
Many studies in microeconometrics do not take advantage of the potential efficiency
gains of GLS, for reasons of convenience and because the efficiency gains may be felt
to be relatively small. Instead, it is common to use less efficient weighted least-squares
estimators, most notably OLS, with robust estimates of the standard errors.
4.5.1. GLS and Feasible GLS
By the Gauss–Markov theorem, presented in introductory texts, the OLS estimator is
efficient among linear unbiased estimators if the linear regression model errors are
independent and homoskedastic.
Instead, we assume that the error variance matrix Ω = σ2
I. If Ω is known and
nonsingular, we can premultiply the linear regression model (4.8) by Ω−1/2
, where
81

LINEAR MODELS
Ω1/2
Ω1/2
= Ω, to yield
Ω−1/2
y = Ω−1/2
Xβ + Ω−1/2
u.
Some algebra yields V[Ω−1/2
u] = E[(Ω−1/2
u)(Ω−1/2
u)
|X] = I. The errors in this
transformed model are therefore zero mean, uncorrelated, and homoskedastic. So β
can be efficiently estimated by OLS regression of Ω−1/2
y on Ω−1/2
X.
This argument yields the generalized least-squares estimator

βGLS = (X
Ω−1
X)−1
X
Ω−1
y. (4.28)
The GLS estimator cannot be directly implemented because in practice Ω is not
known. Instead, we specify that Ω = Ω(γ), where γ is a finite-dimensional parameter
vector, obtain a consistent estimate
γ of γ, and form
Ω = Ω(
γ). For example, if errors
are heteroskedastic then specify V[u|x] = exp(z
γ), where z is a subset of x and the
exponential function is used to ensure a positive variance. Then
γ can be consistently
estimated by nonlinear least-squares regression (see Section 5.8) of the squared OLS
residual
u2
i = (y − x
βOLS)2
on exp(z
γ). This estimate
Ω can be used in place of Ω
in (4.28). Note that we cannot replace Ω in (4.28) by
Ω = Diag[
u2
i ] as this yields an
inconsistent estimator (see Section 5.8.6).
The feasible generalized least-squares (FGLS) estimator is

βFGLS = (X
Ω
−1
X)−1
X
Ω
−1
y. (4.29)
If Assumptions 1–6 hold and Ω(γ) is correctly specified, a strong assumption that is
relaxed in the following, and
γ is consistent for γ, it can be shown that
√
N(
βFGLS − β)
d
→ N

0,

plim N−1
X
Ω
−1
X
−1

. (4.30)
The FGLS estimator has the same limiting variance matrix as the GLS estimator and
so is second-moment efficient. For implementation replace Ω by
Ω in (4.30).
It can be shown that the GLS estimator minimizes u
Ω−1
u, see Exercise 4.5, which
simplifies to

i u2
i /σ2
i if errors are heteroskedastic but uncorrelated. The motivation
provided for GLS was efficient estimation of β. In terms of the Section 4.2 discussion
of loss function and optimal prediction, with heteroskedastic errors the loss function is
L(e) = e2
/σ2
. Compared to OLS with L(e) = e2
, the GLS loss function places a rel-
atively smaller penalty on the prediction error for observations with large conditional
error variance.
4.5.2. Weighted Least Squares
The result in (4.30) assumes correct specification of the error variance matrix Ω(γ).
If instead Ω(γ) is misspecified then the FGLS estimator is still consistent, but (4.30)
gives the wrong variance. Fortunately, a robust estimate of the variance of the GLS
estimator can be found even if Ω(γ) is misspecified.
Specifically, define Σ = Σ(γ) to be a working variance matrix that does not nec-
essarily equal the true variance matrix Ω = E[uu
|X]. Form an estimate
Σ = Σ(
γ),
82

4.5. WEIGHTED LEAST SQUARES
Table 4.2. Least-Squares Estimators and Their Asymptotic Variance
Estimatora
Deﬁnition Estimated Asymptotic Variance
OLS
β =

X
X
−1
X
y

X
X
−1
X
ΩX

X
X
−1
FGLS
β = (X
Ω
−1
X)−1
X
Ω
−1
y (X
Ω
−1
X)−1
WLS
β = (X
Σ
−1
X)−1
X
Σ
−1
y (X
Σ
−1
X)−1
X
Σ
−1

Ω
Σ
−1
X(X
Σ
−1
X)−1
.
a Estimators are for linear regression model with error conditional variance matrix
. For FGLS it is
assumed that

is consistent for
. For OLS and WLS the heteroskedastic robust variance matrix of
β
uses

equal to a diagonal matrix with squared residuals on the diagonals.
where
γ is an estimate of γ. Then use weighted least squares with weighting ma-
trix
Σ
−1
.
This yields the weighted least-squares (WLS) estimator

βWLS = (X
Σ
−1
X)−1
X
Σ
−1
y. (4.31)
Statistical inference is then done without the assumption that Σ = Ω, the true variance
matrix of the error term. In the statistics literature this approach is referred to as a
working matrix approach. We call it weighted least squares, but be aware that others
instead use weighted least squares to mean GLS or FGLS in the special case that Ω−1
is diagonal. Here there is no presumption that the weighting matrix Σ−1
= Ω−1
.
Similar algebra to that for OLS given in Section 4.4.5 yields the estimated asymp-
totic variance matrix

V[
βWLS] = (X
Σ
−1
X)−1
X
Σ
−1

Ω
Σ
−1
X(X
Σ
−1
X)−1
, (4.32)
where
Ω is such that
plim N−1
X
Σ
−1

Ω
Σ
−1
X = plim N−1
X
Σ−1
ΩΣ−1
X.
In the heteroskedastic case
Ω = Diag[
u∗2
i ], where
u∗
i = yi − x
i

βWLS.
For heteroskedastic errors the basic approach is to choose a simple model for het-
eroskedasticity such as error variance depending on only one or two key regressors. For
example, in a linear regression model of the level of wages as a function of schooling
and other variables, the heteroskedasticity might be modeled as a function of school-
ing alone. Suppose this model yields
Σ = Diag[
σ2
i ]. Then OLS regression of yi /
σi on
xi /
σi (with the no-constant option) yields
βWLS and the White robust standard errors
from this regression can be shown to equal those based on (4.32).
The weighted least-squares or working matrix approach is especially convenient
when there is more than one complication. For example, in the random effects panel
data model of Chapter 21 the errors may be viewed as both correlated over time for a
given individual and heteroskedastic. One may use the random effects estimator, which
controls only for the ﬁrst complication, but then compute heteroskedastic-consistent
standard errors for this estimator.
The various least-squares estimators are summarized in Table 4.2.
83

LINEAR MODELS
Table 4.3. Least Squares: Example with
Conditionally Heteroskedastic Errorsa
OLS WLS GLS
Constant 2.213 1.060 0.996
(0.823) (0.150) (0.007)
[0.820] [0.051] [0.006]
x 0.979 0.957 0.952
(0.178) (0.190) (0.209)
[0.275] [0.232] [0.208]
R2
0.236 0.205 0.174
a Generated data for sample size of 100. OLS, WLS, and GLS
are all consistent but OLS and WLS are inefficient. Two differ-
ent standard errors are given: default standard errors assuming
homoskedastic errors in parentheses and heteroskedastic-robust
standard errors in square brackets. The data-generating process
is given in the text.
4.5.3. Robust Standard Errors for LS Example
As an example of robust standard error estimation, consider estimation of the standard
error of least-squares estimates of the slope coefficient for a dgp with multiplicative
heteroskedasticity
y = 1 + 1 × x + u,
u = xε,
where the scalar regressor x ∼ N[0, 25] and ε ∼ N[0, 4].
The errors are conditionally heteroskedastic, since V[u|x] = V[xε|x] =
x2
V[ε|x] = 4x2
, which depends on the regressor x. This differs from the unconditional
variance, where V[u] = V[xε] = E[(xε)2
] − (E[xε])2
= E[x2
]E[ε2
] = V[x]V[ε] =
100, given x and ε independent and the particular dgp here.
Standard errors for the OLS estimator should be calculated using the
heteroskedastic-consistent or robust variance estimate (4.21). Since OLS is not fully
efficient, WLS may provide efficiency gains. GLS will definitely provide efficiency
gains and in this simulated data example we have the advantage of knowing that
V[u|x] = 4x2
. All estimation methods yield a consistent estimate of the intercept and
slope coefficients.
Various least-squares estimates and associated standard errors from a generated data
sample of size 100 are given in Table 4.3. We focus on the slope coefficient.
The OLS slope coefficient estimate is 0.979. Two standard error estimates are re-
ported, with the correct heteroskedasticity-robust standard error of 0.275 using (4.21)
much larger here than the incorrect estimate of 0.177 that uses s2
(X
X)−1
. Such a large
difference in standard error estimates could lead to quite different conclusions in statis-
tical inference. In general the direction of bias in the standard errors could be in either
direction. For this example it can be shown theoretically that, in the limit, the robust
standard errors are
√
3 times larger than the incorrect one. Specifically, for this dgp
84

4.6. MEDIAN AND QUANTILE REGRESSION
and for sample size N the correct and incorrect standard errors of the OLS estimate of
the slope coefficient converge to, respectively,
√
12/N and
√
4/N.
As an example of the WLS estimator, assume that u =
√
|x|ε rather than u = xε,
so that it is assumed that V[u] = σ2
|x|. The WLS estimator can be computed by OLS
regression after dividing y, the intercept, and x by
√
|x|. Since this is the wrong model
for the heteroskedastic error the correct standard error for the slope coefficient is the
robust estimate of 0.232, computed using (4.32).
The GLS estimator for this dgp can be computed by OLS regression after dividing
y, the intercept, and x by |x|, since the transformed error is then homoskedastic. The
usual and robust standard errors for the slope coefficient are similar (0.209 and 0.208).
This is expected as both are asymptotically correct because the GLS estimator here
uses the correct model for heteroskedasticity. It can be shown theoretically that for this
dgp the standard error of the GLS estimate of the slope coefficient converges to
√
4/N.
Both OLS and WLS are less efficient than GLS, as expected, with standard errors
for the slope coefficient of, respectively, 0.275 0.232 0.208.
The setup in this example is a standard one used in estimation theory for cross-
section data. Both y and x are stochastic random variables. The pair (yi , xi ) are inde-
pendent over i and identically distributed, as is the case under random sampling. The
conditional distribution of yi |xi differs over i, however, since the conditional mean and
variance of yi depend on xi .
4.6. Median and Quantile Regression
In an intercept-only model, summary statistics for the sample distribution include
quantiles, such as the median, lower and upper quartiles, and percentiles, in addition
to the sample mean.
In the regression context we might similarly be interested in conditional quantiles.
For example, interest may lie in how the percentiles of the earnings distribution for
lowly educated workers are much more compressed than those for highly educated
workers. In this simple example one can just do separate computations for lowly ed-
ucated workers and for highly educated workers. However, this approach becomes
infeasible if there are several regressors taking several values. Instead, quantile regres-
sion methods are needed to estimate the quantiles of the conditional distribution of y
given x.
From Table 4.1, quantile regression corresponds to use of asymmetric absolute loss,
whereas the special case of median regression uses absolute error loss. These methods
provide an alternative to OLS, which uses squared error loss.
Quantile regression methods have advantages beyond providing a richer charac-
terization of the data. Median regression is more robust to outliers than least-squares
regression. Moreover, quantile regression estimators can be consistent under weaker
stochastic assumptions than possible with least-squares estimation. Leading examples
are the maximum score estimator of Manski (1975) for binary outcome models (see
Section 14.6) and the censored least absolute deviations estimator of Powell (1984) for
censored models (see Section 16.6).
85

LINEAR MODELS
We begin with a brief explanation of population quantiles before turning to estima-
tion of sample quantiles.
4.6.1. Population Quantiles
For a continuous random variable y, the population qth quantile is that value µq such
that y is less than or equal to µq with probability q. Thus
q = Pr[y ≤ µq] = Fy(µq),
where Fy is the cumulative distribution function (cdf) of y. For example, if µ0.75 = 3
then the probability that y ≤ 3 equals 0.75. It follows that
µq = F−1
y (q).
Leading examples are the median, q = 0.5, the upper quartile, q = 0.75, and the lower
quartile, q = 0.25. For the standard normal distribution µ0.5 = 0.0, µ0.95 = 1.645, and
µ0.975 = 1.960. The 100qth percentile is the qth quantile.
For the regression model, the population qth quantile of y conditional on x is
that function µq(x) such that y conditional on x is less than or equal to µq(x) with
probability q, where the probability is evaluated using the conditional distribution of
y given x. It follows that
µq(x) = F−1
y|x (q), (4.33)
where Fy|x is the conditional cdf of y given x and we have suppressed the role of the
parameters of this distribution.
It is insightful to derive the quantile function µq(x) if the dgp is assumed to be the
linear model with multiplicative heteroskedasticity
y = x
β + u,
u = x
α × ε,
ε ∼ iid [0, σ2
],
where it is assumed that x
α 0. Then the population qth quantile of y conditional
on x is that function µq(x, β, α) such that
q = Pr[y ≤ µq(x, β, α)]
= Pr

u ≤ µq(x, β, α) − x
β

= Pr

ε ≤ [µq(x, β, α) − x
β]/x
α

= Fε

[µq(x, β, α) − x
β]/x
α

,
where we use u = y − x
β and ε = u/x
α, and Fε is the cdf of ε. It follows that
[µq(x, β, α) − x
β]/x
α = F−1
ε (q) so that
µq(x, β, α) = x
β + x
α × F−1
ε (q)
= x

β + α × F−1
ε (q)

.
86

Thus for the linear model with multiplicative heteroskedasticity of the form u = x
α ×
ε the conditional quantiles are linear in x. In the special case of homoskedasticity, x
α
equals a constant and all conditional quantiles have the same slope and differ only in
their intercept, which becomes larger as q increases.
In more general examples the quantile function may be nonlinear in x, owing to
other forms of heteroskedasticity such as u = h(x, α) × ε, where h(·) is nonlinear in
x, or because the regression function itself is of nonlinear form g(x, β). It is standard
to still estimate quantile functions that are linear and interpret them as the best lin-
ear predictor under the quantile regression loss function given in (4.34) in the next
section.
4.6.2. Sample Quantiles
For univariate random variable y the usual way to obtain the sample quantile estimate
is to ﬁrst order the sample. Then
µq equals the [Nq]th smallest value, where N is the
sample size and [Nq] denotes Nq rounded up to the nearest integer. For example, if
N = 97, the lower quartile is the 25th observation since [97 × 0.25] = [24.25] = 25.
Koenker and Bassett (1978) observed that the sample qth quantile
µq can equiv-
alently be expressed as the solution to the optimization problem of minimizing with
respect to β
N

i:yi ≥β
q|yi − β| +
N

i:yi β
(1 − q)|yi − β|.
This result is not obvious. To gain some understanding, consider the median, where
q = 0.5. Then the median is the minimum of

i |yi − β|. Suppose in a sample
of 99 observations that the 50th smallest observation, the median, equals 10 and
the 51st smallest observation equals 12. If we let β equal 12 rather than 10, then

i |yi − β| will increase by 2 for the ﬁrst 50 ordered observations and decrease by
2 for the remaining 49 observations, leading to an overall net increase of 50 × 2 −
49 × 2 = 2. So the 51st smallest observation is a worse choice than the 50th. Simi-
larly the 49th smallest observation can be shown to be a worse choice than the 50th
observation.
This objective function is then readily expanded to the linear regression case, so
that the qth quantile regression estimator
βq minimizes over βq
QN (βq) =
N

i:yi ≥x
i β
q|yi − x
i βq| +
N

i:yi x
i β
(1 − q)|yi − x
i βq|, (4.34)
where we use βq rather than β to make clear that different choices of q estimate
different values of β. Note that this is the asymmetric absolute loss function given in
Table 4.1, where
y is restricted to be linear in x so that e = y − x
βq. The special case
q = 0.5 is called the median regression estimator or the least absolute deviations
estimator.
87

LINEAR MODELS
4.6.3. Properties of Quantile Regression Estimators
The objective function (4.34) is not differentiable and so the gradient optimization
methods presented in Chapter 10 are not applicable. Fortunately, linear programming
methods can be used and these provide relatively fast computation of
βq.
Since there is no explicit solution for
βq the asymptotic distribution of
βq cannot
be obtained using the approach of Section 4.4 for OLS. The methods of Chapter 5 also
require adaptation, as the objective function is nondifferentiable. It can be shown that
√
N(
βq − βq)
d
→ N

0, A−1
BA−1

, (4.35)
(see, for example, Buchinsky, 1998, p. 85), where
A = plim
1
N
N

i=1
fuq
(0|xi )xi x
i , (4.36)
B = plim
1
N
N

i=1
q(1 − q)xi x
i ,
and fuq
(0|x) is the conditional density of the error term uq = y − x
βq evaluated
at uq = 0. Estimation of the variance of
βq is complicated by the need to estimate
fuq
(0|x). It is easier to instead obtain standard errors for
βq using the bootstrap pairs
procedure of Chapter 11.
4.6.4. Quantile Regression Example
In this section we perform conditional quantile estimation and compare it with the
usual conditional mean estimation using OLS regression. The application involves En-
gel curve estimation for household annual medical expenditure. More speciﬁcally, we
consider the regression relationship between the log of medical expenditure and the
log of total household expenditure. This regression yields an estimate of the (constant)
elasticity of medical expenditure with respect to total expenditure.
The data are from the World Bank’s 1997 Vietnam Living Standards Survey. The
sample consists of 5,006 households that have positive level of medical expenditures,
after dropping 16.6% of the sample that has zero expenditures to permit taking the
natural logarithm. Zero values can be handled using the censored quantile regression
methods of Powell (1986a), presented in Section 16.9.2. For simplicity we simply
dropped observations with zero expenditures. The largest component of medical ex-
penditure, especially at low levels of income, consists of medications purchased from
pharmacies. Although several household characteristic variables are available, for sim-
plicity we only consider one regressor, the log of total household expenditure, to serve
as a proxy for household income.
The linear least-squares regression yields an elasticity estimate of 0.57. This esti-
mate would be usually interpreted to mean that medicines are a “necessity” and hence
their demand is income inelastic. This estimate is not very surprising, but before ac-
cepting it at face value we should acknowledge that there may be considerable hetero-
geneity in the elasticity across different income groups.
88

Figure 4.1: Quantile regression estimates of slope coefficient for q = 0.05, 0.10, . . . ,
0.90, 0.95 and associated 95% confidence bands plotted against q from regression of the
natural logarithm of medical expenditure on the natural logarithm of total expenditure.
Quantile regression is a useful tool for studying such heterogeneity, as emphasized
by Koenker and Hallock (2001). We minimize the quantity (4.34), where y is log of
medical expenditure and x
β = β1 + β2x, where x is log of total household expendi-
ture. This is done for the nineteen quantile values q = {0.05, 0.10, ..., 0.95} , where
q = 0.5 is the median. In each case the standard errors were estimated using the boot-
strap method with 50 resamples. The results of this exercise are condensed into Fig-
ures 4.1 and 4.2.
Figure 4.1 plots the slope coefficient
β2,q for the different values of q, along with
the associated 95% confidence interval. This shows how the quantile estimates of the
elasticity varies with quantile value q. The elasticity estimate increases systematically
with the level of household income, rising from 0.15 for q = 0.05 to a maximum of
0.80 for q = 0.85. The least-squares slope estimate of 0.57 is also presented as a hori-
zontal line that does not vary with quantile. The elasticity estimates at lower and higher
quantiles are clearly statistically significantly different from each other and from the
OLS estimate, which has standard error 0.032. It seems that the aggregate elasticity es-
timate will vary according to changes in the underlying income distribution. This graph
supports the observation of Mosteller and Tukey (1977, p. 236), quoted by Koenker
and Hallock (2001), that by focusing only on the conditional mean function the least-
squares regression gives an incomplete summary of the joint distribution of dependent
and explanatory variables.
Figure 4.2 superimposes three estimated quantile regression lines
yq =
β1,q +

β2,q x for q = 0.1, 0.2, . . . , 0.9 and the OLS regression line. The OLS regression line,
not graphed, is similar to the median (q = 0.5) regression line. There is a fanning out
of the quantile regression lines in Figure 4.2. This is not surprising given the increase
89

LINEAR MODELS
0
5
10
15
Log
Household
Total
Expenditure
6 8 10 12
Log Household Medical Expenditure
Actual Data
90th percentile
Median
10th percentile
Regression Lines as Quantile Varies
Figure 4.2: Quantile regression estimated lines for q = 0.1, q = 0.5 and q = 0.9 from re-
gression of natural logarithm of medical expenditure on natural logarithm of total expenditure.
Data for 5006 Vietnamese households with positive medical expenditures in 1997.
in estimated slopes as q increases as evident in Figure 4.1. Koenker and Bassett (1982)
developed quantile regression as a means to test for heteroskedastic errors when the
dgp is the linear model. For such a case a fanning out of the quantile regression lines
is interpreted as evidence of heteroskedasticity. Another interpretation is that the con-
ditional mean is nonlinear in x with increasing slope and this leads to quantile slope
coefficients that increase with quantile q.
More detailed illustrations of quantile regression are given in Buchinsky (1994) and
Koenker and Hallock (2001).
4.7. Model Misspecification
The term “model misspecification” in its broadest sense means that one or more of the
assumptions made on the data generating process are incorrect. Misspecifications may
occur individually or in combination, but analysis is simpler if only the consequences
of a single misspecification are considered.
In the following discussion we emphasize misspecifications that lead to inconsis-
tency of the least-squares estimator and loss of identifiability of parameters of inter-
est. The least-squares estimator may nonetheless continue to have a meaningful inter-
pretation, only one different from that intended under the assumption of a correctly
specified model. Specifically, the estimator may converge asymptotically to a param-
eter that differs from the true population value, a concept defined in Section 4.7.5 as
the pseudo-true value.
The issues raised here for consistency of OLS are relevant to other estimators in
other models. Consistency can then require stronger assumptions than those needed
90

4.7. MODEL MISSPECIFICATION
for consistency of OLS, so that inconsistency resulting from model misspecification is
more likely.
4.7.1. Inconsistency of OLS
The most serious consequence of a model misspecification is inconsistent estimation
of the regression parameters β. From Section 4.4, the two key conditions needed to
demonstrate consistency of the OLS estimator are (1) the dgp is y = Xβ + u and (2)
the dgp is such that plim N−1
X
u = 0. Then

βOLS = β +

N−1
X
X
−1
N−1
X
u
p
→ β,
(4.37)
where the first equality follows if y = Xβ + u (see (4.12)) and the second line uses
plim N−1
X
u = 0.
The OLS estimator is likely to be inconsistent if model misspecification leads to
either specification of the wrong model for y, so that condition 1 is violated, or corre-
lation of regressors with the error, so that condition 2 is violated.
4.7.2. Functional Form Misspecification
A linear specification of the conditional mean function is merely an approximation in
RK
to the true unknown conditional mean function in parameter space of indeterminate
dimension. Even if the correct regressors are chosen, it is possible that the conditional
mean is incorrectly specified.
Suppose the dgp is one with a nonlinear regression function
y = g(x) + v,
where the dependence of g(x) on unknown parameters is suppressed, and assume
E[v|x] = 0. The linear regression model
y = x
β + u
is erroneously specified. The question is whether the OLS estimator can be given any
meaningful interpretation, even though the dgp is in fact nonlinear.
The usual way to interpret regression coefficients is through the true micro relation-
ship, which here is
E[yi |xi ] = g(xi ).
In this case
βOLS does not measure the micro response of E[yi |xi ] to a change in xi , as
it does not converge to ∂g(xi )/∂xi . So the usual interpretation of
βOLS is not possible.
White (1980b) showed that the OLS estimator converges to that value of β that
minimizes the mean-squared prediction error
Ex[(g(x) − x
β)2
].
91

LINEAR MODELS
Hence prediction from OLS is the best linear predictor of the nonlinear regression
function if the mean-squared error is used as the loss function. This useful property
has already been noted in Section 4.2.3, but it adds little in interpretation of
βOLS.
In summary, if the true regression function is nonlinear, OLS is not useful for indi-
vidual prediction. OLS can still be useful for prediction of aggregate changes, giving
the sample average change in E[y|x] due to change in x (see Stoker, 1982). However,
microeconometric analyses usually seek models that are meaningful at the individual
level.
Much of this book presents alternatives to the linear model that are more likely to
be correctly specified. For example, Chapter 14 on binary outcomes presents model
specifications that ensure that predicted probabilities are restricted to lie between 0
and 1. Also, models and methods that rely on minimal distributional assumptions are
preferred because there is then less scope for misspecification.
4.7.3. Endogeneity
Endogeneity is formally defined in Section 2.3. A broad definition is that a regressor
is endogenous when it is correlated with the error term. If any one regressor is en-
dogenous then in general OLS estimates of all regression parameters are inconsistent
(unless the exogenous regressor is uncorrelated with the endogenous regressor).
Leading examples of endogeneity, dealt with extensively in this book in both linear
and nonlinear model settings, include simultaneous equations bias (Section 2.4), omit-
ted variable bias (Section 4.7.4), sample selection bias (Section 16.5), and measure-
ment error bias (Chapter 26). Endogeneity is quite likely to occur when cross-section
observational data are used, and economists are very concerned with this complication.
A quite general approach to control for endogeneity is the instrumental variables
method, presented in Sections 4.8 and 4.9 and in Sections 6.4 and 6.5. This method
cannot always be applied, however, as necessary instruments may not be available.
Other methods to control for endogeneity, reviewed in Section 2.8, include con-
trol for confounding variables, differences in differences if repeated cross-section or
panel data are available (see Chapter 21), fixed effects if panel data are available and
endogeneity arises owing to a time-invariant omitted variable (see Section 21.6), and
regression-discontinuity design (see Section 25.6).
4.7.4. Omitted Variables
Omission of a variable in a linear regression equation is often the first example of
inconsistency of OLS presented in introductory courses. Such omission may be the
consequence of an erroneous exclusion of a variable for which data are available or of
exclusion of a variable that is not directly observed. For example, omission of ability in
a regression of earnings (or more usually its natural logarithm) on schooling is usually
due to unavailability of a comprehensive measure of ability.
Let the true dgp be
y = x
β + zα + v, (4.38)
92

4.7. MODEL MISSPECIFICATION
where x and z are regressors, with z a scalar regressor for simplicity, and v is an error
term that is assumed to be uncorrelated with the regressors x and z. OLS estimation of
y on x and z will yield consistent parameter estimates of β and α.
Suppose instead that y is regressed on x alone, with z omitted owing to unavailabil-
ity. Then the term zα is moved into the error term. The estimated model is
y = x
β + (zα + v), (4.39)
where the error term is now (zα + v). As before v is uncorrelated with x, but if z is
correlated with x the error term (zα + v) will be correlated with the regressors x. The
OLS estimator will be inconsistent for β if z is correlated with x.
There is enough structure in this example to determine the direction of the inconsis-
tency. Stacking all observations in an obvious manner gives the dgp y = Xβ + zα + v.
Substituting this into
βOLS = (X
X)−1
X
y yields

βOLS=β+

N−1
X
X
−1
N−1
X
z

α+

N−1
X
X
−1
N−1
X
v

.
Under the usual assumption that X is uncorrelated with v, the final term has probability
limit zero. X is correlated with z, however, and
plim
βOLS = β+δα, (4.40)
where
δ = plim

(N−1
X
X)−1

N−1
X
z

is the probability limit of the OLS estimator in regression of the omitted regressor (z)
on the included regressors (X).
This inconsistency is called omitted variables bias, where common terminology
states that various misspecifications lead to bias even though formally they lead to
inconsistency. The inconsistency exists as long as δ = 0, that is, as long as the omitted
variable is correlated with the included regressors. In general the inconsistency could
be positive or negative and could even lead to a sign reversal of the OLS coefficient.
For the returns to schooling example, the correlation between schooling and ability
is expected to be positive, so δ 0, and the return to ability is expected to be positive,
so α 0. It follows that δα 0, so the omitted variables bias is positive in this ex-
ample. OLS of earnings on schooling alone will overstate the effect of education on
earnings.
A related form of misspecification is inclusion of irrelevant regressors. For ex-
ample, the regression may be of y on x and z, even though the dgp is more simply
y = x
β + v. In this case it is straightforward to show that OLS is consistent, but there
is a loss of efficiency.
Controlling for omitted variables bias is necessary if parameter estimates are to be
given a causal interpretation. Since too many regressors cause little harm, but too few
regressors can lead to inconsistency, microeconometric models estimated from large
data sets tend to include many regressors. If omitted variables are still present then one
of the methods given at the end of Section 4.7.3 is needed.
93

LINEAR MODELS
4.7.5. Pseudo-True Value
In the omitted variables example the least-squares estimator is subject to confounding
in the sense that it does not estimate β, but instead estimates a function of β, δ, and α.
The OLS estimate cannot be used as an estimate of β, which, for example, measures
the effect of an exogenous change in a regressor x such as schooling holding all other
regressors including ability constant.
From (4.40), however,
βOLS is a consistent estimator of the function (β + δα) and
has a meaningful interpretation. The probability limit of
βOLS of β∗
= (β + δα) is
referred to as the pseudo-true value, see Section 5.7.1 for a formal definition, corre-
sponding to
βOLS.
Furthermore, one can obtain the distribution of
βOLS even though it is inconsis-
tent for β. The estimated asymptotic variance of
βOLS measures dispersion around
(β + δα) and is given by the usual estimator, for example by s2
(X
X)−1
if the error in
(4.38) is homoskedastic.
4.7.6. Parameter Heterogeneity
The presentation to date has permitted regressors and error terms to vary across indi-
viduals but has restricted the regression parameters β to be the same across individuals.
Instead, suppose that the dgp is
yi = x
i βi +ui , (4.41)
with subscript i on the parameters. This is an example of parameter heterogeneity,
where the marginal effect E[yi |xi ] = βi is now permitted to differ across individuals.
The random coefficients model or random parameters model specifies βi to be
independently and identically distributed over i with distribution that does not depend
on the observables xi . Let the common mean of βi be denoted β. The dgp can be
rewritten as
yi = x
i β + (ui + x
i (βi − β)),
and enough assumptions have been made to ensure that the regressors xi are uncorre-
lated with the error term (ui + x
i (βi − β)). OLS regression of y on x will therefore
consistently estimate β, though note that the error is heteroskedastic even if ui is ho-
moskedastic.
For panel data a standard model is the random effects model (see Section 21.7) that
lets the intercept vary across individuals while the slope coefficients are not random.
For nonlinear models a similar result need not hold, and random parameter models
can be preferred as they permit a richer parameterization. Random parameter models
are consistent with existence of heterogeneous responses of individuals to changes in
x. A leading example is random parameters logit in Section 15.7.
More serious complications can arise when the regression parameters βi for an
individual are related to observed individual characteristics. Then OLS estimation can
lead to inconsistent parameter estimation. An example is the fixed effects model for
panel data (see Section 21.6) for which OLS estimation of y on x is inconsistent. In
94

4.8. INSTRUMENTAL VARIABLES
this example, but not in all such examples, alternative consistent estimators for a subset
of the regression parameters are available.
4.8. Instrumental Variables
A major complication that is emphasized in microeconometrics is the possibility of
inconsistent parameter estimation caused by endogenous regressors. Then regression
estimates measure only the magnitude of association, rather than the magnitude and
direction of causation, both of which are needed for policy analysis.
The instrumental variables estimator provides a way to nonetheless obtain consis-
tent parameter estimates. This method, widely used in econometrics and rarely used
elsewhere, is conceptually difficult and easily misused.
We provide a lengthy expository treatment that defines an instrumental variable and
explains how the instrumental variables method works in a simple setting.
4.8.1. Inconsistency of OLS
Consider the scalar regression model with dependent variable y and single regressor x.
The goal of regression analysis is to estimate the conditional mean function E[y|x]. A
linear conditional mean model, without intercept for notational convenience, specifies
E[y|x] = βx. (4.42)
This model without intercept subsumes the model with intercept if dependent and
regressor variables are deviations from their respective means. Interest lies in obtaining
a consistent estimate of β as this gives the change in the conditional mean given an
exogenous change in x. For example, interest may lie in the effect in earnings caused
by an increase in schooling attributed to exogenous reasons, such as an increase in the
minimum age at which students leave school, that are not a choice of the individual.
The OLS regression model specifies
y = βx + u, (4.43)
where u is an error term. Regression of y on x yields OLS estimate
β of β.
Standard regression results make the assumption that the regressors are uncorrelated
with the errors in the model (4.43). Then the only effect of x on y is a direct effect via
the term βx. We have the following path analysis diagram:
x −→ y

u
where there is no association between x and u. So x and u are independent causes
of y.
However, in some situations there may be an association between regressors and
errors. For example, consider regression of log-earnings (y) on years of schooling (x).
The error term u embodies all factors other than schooling that determine earnings,
95

LINEAR MODELS
such as ability. Suppose a person has a high level of u, as a result of high (unobserved)
ability. This increases earnings, since y = βx + u, but it may also lead to higher lev-
els of x, since schooling is likely to be higher for those with high ability. A more
appropriate path diagram is then the following:
x −→ y
↑
u
where now there is an association between x and u.
What are the consequences of this correlation between x and u? Now higher levels
of x have two effects on y. From (4.43) there is both a direct effect via βx and an
indirect effect via u affecting x, which in turn affects y. The goal of regression is
to estimate only the first effect, yielding an estimate of β. The OLS estimate will
instead combine these two effects, giving
β β in this example where both effects
are positive. Using calculus, we have y = βx + u(x) with total derivative
dy
dx
= β +
du
dx
. (4.44)
The data give information on dy/dx, so OLS estimates the total effect β + du/dx
rather than β alone. The OLS estimator is therefore biased and inconsistent for β,
unless there is no association between x and u.
A more formal treatment of the linear regression model with K regressors leads to
the same conclusion. From Section 4.7.1 a necessary condition for consistency of OLS
is that plim N−1
X
u = 0. Consistency requires that the regressors are asymptotically
uncorrelated with the errors. From (4.37) the magnitude of the inconsistency of OLS
is

X
X
−1
X
u, the OLS coefficient from regression of u on x. This is just the OLS
estimate of du/dx, confirming the intuitive result in (4.44).
4.8.2. Instrumental Variable
The inconsistency of OLS is due to endogeneity of x, meaning that changes in x are
associated not only with changes in y but also changes in the error u. What is needed
is a method to generate only exogenous variation in x. An obvious way is through a
randomized experiment, but for most economics applications such experiments are too
expensive or even infeasible.
Definition of an Instrument
A crude experimental or treatment approach is still possible using observational data,
provided there exists an instrument z that has the property that changes in z are asso-
ciated with changes in x but do not lead to change in y (aside from the indirect route
via x). This leads to the following path diagram:
z −→ x −→ y
↑
u
96

which introduces a variable z that is causally associated with x but not u. It is still
the case that z and y will be correlated, but the only source of such correlation is the
indirect path of z being correlated with x, which in turn determines y. The more direct
path of z being a regressor in the model for y is ruled out.
More formally, a variable z is called an instrument or instrumental variable for
the regressor x in the scalar regression model y = βx + u if (1) z is uncorrelated with
the error u and (2) z is correlated with the regressor x.
The first assumption excludes the instrument z from being a regressor in the model
for y, since if instead y depended on both x and z, and y is regressed on x alone, then
z is being absorbed into the error so that z will then be correlated with the error. The
second assumption requires that there is some association between the instrument and
the variable being instrumented.
Examples of an Instrument
In many microeconometric applications it is difficult to find legitimate instruments.
Here we provide two examples.
Suppose we want to estimate the response of market demand to exogenous changes
in market price. Quantity demanded clearly depends on price, but prices are not ex-
ogenously given since they are determined in part by market demand. A suitable in-
strument for price is a variable that is correlated with price but does not directly affect
quantity demanded. An obvious candidate is a variable that affects market supply, since
this also affect prices, but is not a direct determinant of demand. An example is a mea-
sure of favorable growing conditions if an agricultural product is being modeled. The
choice of instrument here is uncontroversial, provided favorable growing conditions
do not directly affect demand, and is helped greatly by the formal economic model of
supply and demand.
Next suppose we want to estimate the returns to exogenous changes in schooling.
Most observational data sets lack measures of individual ability, so regression of earn-
ings on schooling has error that includes unobserved ability and hence is correlated
with the regressor schooling. We need an instrument z that is correlated with school-
ing, uncorrelated with ability, and more generally uncorrelated with the error term,
which means that it cannot directly determine earnings.
One popular candidate for z is proximity to a college or university (Card, 1995).
This clearly satisfies condition 2 because, for example, people whose home is a long
way from a community college or state university are less likely to attend college. It
most likely satisfies 1, though since it can be argued that people who live a long way
from a college are more likely to be in low-wage labor markets one needs to estimate
a multiple regression for y that includes additional regressors such as indicators for
nonmetropolitan area.
A second candidate for the instrument is month of birth (Angrist and Krueger,
1991). This clearly satisfies condition 1 as there is no reason to believe that month
of birth has a direct effect on earnings if the regression includes age in years. Surpris-
ingly condition 2 may also be satisfied, as birth month determines age of first entry
97

LINEAR MODELS
into school in the USA, which in turn may affect years of schooling since laws often
specify a minimum school-leaving age. Bound, Jaeger, and Baker (1995) provide a
critique of this instrument.
The consequences of choosing poor instruments are considered in detail in Sec-
tion 4.9.
4.8.3. Instrumental Variables Estimator
For regression with scalar regressor x and scalar instrument z, the instrumental vari-
ables (IV) estimator is defined as

βIV = (z
x)−1
z
y, (4.45)
where, in the scalar regressor case z, x and y are N × 1 vectors. This estimator provides
a consistent estimator for the slope coefficient β in the linear model y = βx + u if z
is correlated with x and uncorrelated with the error term.
There are several ways to derive (4.45). We provide an intuitive derivation, one that
differs from derivations usually presented such as that in Section 6.2.5.
Return to the earnings–schooling example. Suppose a one-unit change in the in-
strument z is associated with 0.2 more years of schooling and with a $500 increase
in annual earnings. This increase in earnings is a consequence of the indirect effect
that increase in z led to increase in schooling, which in turn increases income. Then it
follows that 0.2 years additional schooling is associated with a $500 increase in earn-
ings, so that a one-year increase in schooling is associated with a $500/0.2 = $2,500
increase in earnings. The causal estimate of β is therefore 2,500. In mathematical
notation we have estimated the changes dx/dz and dy/dz and calculated the causal
estimator as
βIV =
dy/dz
dx/dz
. (4.46)
This approach to identification of the causal parameter β is given in Heckman (2000,
p. 58); see also the example in Section 2.4.2.
All that remains is consistent estimation of dy/dz and dx/dz. The obvious way to
estimate dy/dz is by OLS regression of y on z with slope estimate (z
z)−1
z
y. Sim-
ilarly, estimate dx/dz by OLS regression of x on z with slope estimate (z
z)−1
z
x.
Then

βIV =
(z
z)−1
z
y
(zz)−1
zx
= (z
x)−1
z
y. (4.47)
4.8.4. Wald Estimator
A leading simple example of IV is one where the instrument z is a binary instru-
ment. Denote the subsample averages of y and x by ȳ1 and x̄1, respectively, when
z = 1 and by ȳ0 and x̄0, respectively, when z = 0. Then

z = (x̄1 − x̄0), and (4.46) yields

βWald =
(ȳ1 − ȳ0)
(x̄1 − x̄0)
. (4.48)
This estimator is called the Wald estimator, after Wald (1940), or the grouping esti-
mator.
The Wald estimator can also be obtained from the formula (4.45). For the no-
intercept model variables are measured in deviations from means, so z
y =

i (zi − z)
(yi − ȳ). For binary z this yields z
y = N1(ȳ1 − ȳ) = N1 N0(ȳ1 − ȳ0)/N, where N0
and N1 are the number of observations for which z = 0 and z = 1. This result uses
ȳ1 − ȳ = (N0 ȳ1 + N1 ȳ1)/N − (N0 ȳ0 + N1 ȳ1)/N = N0(ȳ1 − ȳ0)/N. Similarly, z
x =
N1 N0(x̄1 − x̄0)/N. Combining these results, we have that (4.45) yields (4.48).
For the earnings–schooling example it is being assumed that we can deﬁne two
groups where group membership does not directly determine earnings, though it does
affect level of schooling and hence indirectly affects earnings. Then the IV estimate is
the difference in average earnings across the two groups divided by the difference in
average schooling across the two groups.
4.8.5. Sample Covariance and Correlation Analysis
The IV estimator can also be interpreted in terms of covariances or correlations.
For sample covariances we have directly from (4.45) that

βIV =
Cov[z, y]
Cov[z, x]
, (4.49)
where here Cov[ ] is being used to denote sample covariance.
For sample correlations, note that the OLS estimator for the model (4.43) can be
written as
βOLS = rxy
√
yy/
√
xx, where rxy = x
y/

(xx)(y
y) is the sample correla-
tion between x and y. This leads to the interpretation of the OLS estimator as implying
that a one standard deviation change in x is associated with an rxy standard deviation
change in y. The problem is that the correlation rxy is contaminated by correlation
between x and u. An alternative approach is to measure the correlation between x and
y indirectly by the correlation between z and y divided by the correlation between z
and x. Then

βIV =
rzy
rzx
√
yy
√
xx
, (4.50)
which can be shown to equal
βIV in (4.45).
4.8.6. IV Estimation for Multiple Regression
Now consider the multiple regression model with typical observation
y = x
β + u,
with K regressor variables, so that x and β are K × 1 vectors.
99

LINEAR MODELS
Instruments
Assume the existence of an r × 1 vector of instruments z, with r ≥ K, satisfying the
following:
1. z is uncorrelated with the error u.
2. z is correlated with the regressor vector x.
3. z is strongly correlated, rather than weakly correlated, with the regressor vector x.
The first two properties are necessary for consistency and were presented earlier in
the scalar case. The third property, defined in Section 4.9.1, is a strengthening of the
second to ensure good finite-sample performance of the IV estimator.
In the multiple regression case z and x may share some common components.
Some components of x, called exogenous regressors, may be uncorrelated with u.
These components are clearly suitable instruments as they satisfy conditions 1 and
2. Other components of x, called endogenous regressors, may be correlated with u.
These components lead to inconsistency of OLS and are also clearly unsuitable in-
struments as they do not satisfy condition 1. Partition x into x = [x
1 x
2]
, where x1
contains endogenous regressors and x2 contains exogenous regressors. Then a valid
instrument is z = [z
1 x
2]
, where x2 can be an instrument for itself, but we need to find
at least as many instruments z1 as there are endogenous variables x1.
Identification
Identification in a simultaneous equations model was presented in Section 2.5. Here we
have a single equation. The order condition requires that the number of instruments
must at least equal the number of independent endogenous components, so that r ≥ K.
The model is said to be just-identified if r = K and overidentified if r K.
In many multiple regression applications there is only one endogenous regressor.
For example, the earnings on schooling regression will include many other regressors
such as age, geographic location, and family background. Interest lies in the coefficient
on schooling, but this is an endogenous variable most likely correlated with the error
because ability is unobserved. Possible candidates for the necessary single instrument
for schooling have already been given in Section 4.8.2.
If an instrument fails the first condition the instrument is an invalid instrument. If
an instrument fails the second condition the instrument is an irrelevant instrument,
and the model may be unidentified if too few instruments are relevant. The third con-
dition fails when very low correlation exists between the instrument and the endoge-
nous variable being instrumented. The model is said to be weakly identified and the
instrument is called a weak instrument.
Instrumental Variables Estimator
When the model is just-identified, so that r = K, the instrumental variables estima-
tor is the obvious matrix generalization of (4.45)

βIV =

Z
X
−1
Z
y, (4.51)
100

where Z is an N × K matrix with ith row z
i . Substituting the regression model y =
Xβ + u for y in (4.51) yields

βIV =

Z
X
−1
Z
[Xβ + u]
= β +

Z
X
−1
Z
u
= β +

N−1
Z
X
−1
N−1
Z
u.
It follows immediately that the IV estimator is consistent if
plim N−1
Z
u = 0
and
plim N−1
Z
X = 0.
These are essentially conditions 1 and 2 that z is uncorrelated with u and correlated
with x. To ensure that the inverse of N−1
Z
X exists it is assumed that Z
X is of full
rank K, a stronger assumption than the order condition that r = K.
With heteroskedastic errors the IV estimator is asymptotically normal with mean β
and variance matrix consistently estimated by

V[
βIV] = (Z
X)−1
Z
ΩZ(Z
X)−1
, (4.52)
where
Ω = Diag[
u2
i ]. This result is obtained in a manner similar to that for OLS given
in Section 4.4.4.
The IV estimator, although consistent, leads to a loss of efficiency that can be very
large in practice. Intuitively IV will not work well if the instrument z has low correla-
tion with the regressor x (see Section 4.9.3).
4.8.7. Two-Stage Least Squares
The IV estimator in (4.51) requires that the number of instruments equals the number
of regressors. For overidentified models the IV estimator can be used, by discarding
some of the instruments so that the model is just-identified. However, an asymptotic
efficiency loss can occur when discarding these instruments.
Instead, a common procedure is to use the two-stage least-squares (2SLS) estima-
tor

β2SLS =

X
Z(Z
Z)−1
Z
X
−1
X
Z(Z
Z)−1
Z
y

, (4.53)
presented and motivated in Section 6.4.
The 2SLS estimator is an IV estimator. In a just-identified model it simplifies to
the IV estimator given in (4.51) with instruments Z. In an overidentified model the
2SLS estimator equals the IV estimator given in (4.51) if the instruments are
X, where

X = Z(Z
Z)−1
Z
X is the predicted value of x from OLS regression of x on z.
The 2SLS estimator gets its name from the result that it can be obtained by two
consecutive OLS regressions: OLS regression of x on z to get
x followed by OLS
of y on
x, which gives
β2SLS. This interpretation does not necessarily generalize to
nonlinear regressions; see Section 6.5.6.
101

LINEAR MODELS
The 2SLS estimator is often expressed more compactly as

β2SLS =

X
PZX
−1
X
PZy

, (4.54)
where
PZ = Z(Z
Z)−1
Z
is an idempotent projection matrix that satisfies PZ = P
Z, PZP
Z = PZ, and PZZ = Z.
The 2SLS estimator can be shown to be asymptotically normal distributed with esti-
mated asymptotic variance

V[
β2SLS] = N

X
PZX
−1

X
Z(Z
Z)
−1
S(Z
Z)−1
Z
X

X
PZX
−1
, (4.55)
where in the usual case of heteroskedastic errors
S = N−1

i
u2
i zi z
i and
ui = yi −
x
i

β2SLS. A commonly used small-sample adjustment is to divide by N − K rather
than N in the formula for
S.
In the special case that errors are homoskedastic, simplification occurs and

V[
β2SLS] = s2
[X
PZX]−1
. This latter result is given in many introductory treatments,
but the more general formula (4.55) is preferred as the modern approach is to treat
errors as potentially heteroskedastic.
For overidentified models with heteroskedastic errors an estimator that White
(1982) calls the two-stage instrumental variables estimator is more efficient than
2SLS. Moreover, some commonly used model specification tests require estimation
by this estimator rather than 2SLS. For details see Section 6.4.2.
4.8.8. IV Example
As an example of IV estimation, consider estimation of the slope coefficient of x for
the dgp
y = 0 + 0.5x + u,
x = 0 + z + v,
where z ∼ N[2, 1] and (u, v) are joint normal with means 0, variances 1, and correla-
tion 0.8.
OLS of y on x yields inconsistent estimates as x is correlated with u since by
construction x is correlated with v, which in turn is correlated with u. IV estimation
yields consistent estimates. The variable z is a valid instrument since by construction
is uncorrelated with u but is correlated with x. Transformations of z, such as z3
, are
also valid instruments.
Various estimates and associated standard errors from a generated data sample of
size 10,000 are given in Table 4.4. We focus on the slope coefficient.
The OLS estimator is inconsistent, with slope coefficient estimate of 0.902 being
more than 50 standard errors from the true value of 0.5. The remaining estimates are
consistent and are all within two standard errors of 0.5.
There are several ways to compute the IV estimator. The slope coefficient from
OLS regression of y on z is 0.5168 and from OLS regression of x on z it is 1.0124,
102

4.9. INSTRUMENTAL VARIABLES IN PRACTICE
Table 4.4. Instrumental Variables Examplea
OLS IV 2SLS IV (z3
)
Constant −0.804 −0.017 −0.017 −0.014
(0.014) (0.022) (0.032) (0.025)
x 0.902 0.510 0.510 0.509
(0.006) (0.010) (0.014) (0.012)
R2
0.709 0.576 0.576 0.574
a Generated data for a sample size of 10,000. OLS is inconsistent and other esti-
mators are consistent. Robust standard errors are reported though they are unnec-
essary here as errors are homoskedastic. The 2SLS standard errors are incorrect.
The data-generating process is given in the text.
yielding an IV estimate of 0.5168/1.0124 = 0.510 using (4.47). In practice one instead
directly computes the IV estimator using (4.45) or (4.51), with z used as the instrument
for x and standard errors computed using (4.52). The 2SLS estimator (see (4.54))
can be computed by OLS regression of y on
x, where
x is the prediction from OLS
regression of x on z. The 2SLS estimates exactly equal the IV estimates in this just-
identified model, though the standard errors from this OLS regression of y on
x are
incorrect as will be explained in Section 6.4.5.
The final column uses z3
rather than z as the instrument for x. This alternative IV
estimator is consistent, since z3
is uncorrelated with u and correlated with x. However,
it is less efficient for this particular dgp, and the standard error of the slope coefficient
rises from 0.010 to 0.012.
There is an efficiency loss in IV estimation compared to OLS estimation, see (4.61)
for a general result for the case of single regressor and single instrument. Here r2
x,z =
0.510, not given in Table 4.4, is high so the loss is not great and the standard error of
the slope coefficient increases somewhat from 0.006 to 0.010. In practice the efficiency
loss can be much greater than this.
4.9. Instrumental Variables in Practice
Important practical issues include determining whether IV methods are necessary and,
if necessary, determining whether the instruments are valid. The relevant specification
tests are presented in Section 8.4. Unfortunately, the validity of tests are limited. They
require the assumption that in a just-identified model the instruments are valid and test
only overidentifying restrictions.
Although IV estimators are consistent given valid instruments, as detailed in the
following, IV estimators can be much less efficient than the OLS estimator and can
have a finite-sample distribution that for usual finite-sample sizes differs greatly from
the asymptotic distribution. These problems are greatly magnified if instruments are
weakly correlated with the variables being instrumented. One way that weak instru-
ments can arise is if there are many more instruments than needed. This is simply
dealt with by dropping some of the instruments (see also Donald and Newey, 2001). A
103

LINEAR MODELS
more fundamental problem arises when even with the minimal number of instruments
one or more of the instruments is weak.
This section focuses on the problem of weak instruments.
4.9.1. Weak Instruments
There is no single definition of a weak instrument. Many authors use the following
signals of a weak instrument, presented here for progressively more complex models.
r Scalar regressor x and scalar instrument z: A weak instrument is one for which r2
x,z is
small.
r Scalar regressor x and vector of instruments z: The instruments are weak if the R2
from
regression of x on z, denoted R2
x,z, is small or if the F-statistic for test of overall fit in
this regression is small.
r Multiple regressors x with only one endogenous: A weak instrument is one for which
the partial R2
is low or the partial F-statistic is small, where these partial statistics are
defined toward the end of Section 4.9.1.
r Multiple regressors x with several endogenous: There are several measures.
R2
Measures
Consider a single equation
y = β1x1 + x
2β2 + u, (4.56)
where just one regressor x1 is endogenous and the remaining regressors in the vector
x2 are exogenous. Assume that the instrument vector z includes the exogenous instru-
ments x2, as well as least one other instrument.
One possible R2
measure is the usual R2
from regression of x1 on z. However, this
could be high only because x1 is highly correlated with x2 whereas intuitively we really
need x1 to be highly correlated with the instrument(s) other than x2.
Bound, Jaeger, and Baker (1995) therefore proposed use of a partial R2
, denoted
R2
p, that purges the effect of x2. R2
p is obtained as R2
from the regression
(x1 −
x1) = (z−
z)
γ + v, (4.57)
where
x1 and
z are the fitted values from regressions of x1 on x2 and z on x2. In the
just-identified case z −
z will reduce to z1 −
z1, where z1 is the single instrument other
than x2 and
z1 is the fitted value from regression of z1 on x2.
It is not unusual for R2
p to be much lower than R2
x1,z. The formula for R2
p simplifies
to r2
x z when there is only one regressor and it is endogenous. It further simplifies to
Cor[x, z] when there is only one instrument.
When there is more than one endogenous variable, analysis is less straightforward
as a number of generalizations of R2
p have been proposed.
Consider a single equation with more than one endogenous variable model and fo-
cus on estimation of the coefficient of the first endogenous variable. Then in (4.56)
104

x1 is endogenous and additionally some of the variables in x2 are also endogenous.
Several alternative measures replace the right-hand side of (4.57) with a residual that
controls for the presence of other endogenous regressors. Shea (1997) proposed a par-
tial R2
, say R∗2
p , that is computed as the squared sample correlation between (x1 −
x1)
and (
x1 −

x1). Here (x1 −
x1) is again the residual from regression of x1 on x2, whereas
(
x1 −

x1) is the residual from regression of
x1 (the fitted value from regression of x1
on z) on
x2 (the fitted value from regression of x2 on z). Poskitt and Skeels (2002) pro-
posed an alternative partial R2
, which, like Shea’s R∗2
p , simplifies to R2
p when there is
only one endogenous regressor. Hall, Rudebusch, and Wilcox (1996) instead proposed
use of canonical correlations.
These measures for the coefficient for the first endogenous variable can be repeated
for the other endogenous variables. Poskitt and Skeels (2002) additionally consider an
R2
measure that applies jointly to instrumentation of all the endogenous variables.
The problems of inconsistency of estimators and loss of precision are magnified
as the partial R2
measures fall, as detailed in Sections 4.9.2 and 4.9.3. See especially
(4.60) and (4.62).
Partial F-Statistics
For poor finite-sample performance, considered in Section 4.9.4, it is common to use
a related measure, the F-statistic for whether coefficients are zero in regression of the
endogenous regressor on instruments.
For a single regressor that is endogenous we use the usual overall F-statistic, for a
test of π = 0 in the regression x = z
π + v of the endogenous regressor on the instru-
ments. This F-statistic is a function of R2
x,z.
More commonly, some exogenous regressors also appear in the model, and in model
(4.56) with single endogenous regressor x1 we use the F-statistic for a test of π1 = 0
in the regression
x = z
1π1 + x
2π2 + v, (4.58)
where z1 are the instruments other than the exogenous regressors and x2 are the ex-
ogenous regressors. This is the first-stage regression in the two-stage least-squares
interpretation of IV.
This statistic is used as a signal of potential finite-sample bias in the IV estimator.
In Section 4.9.4 we explain results of Staiger and Stock (1997) that suggest a value
less than 10 is problematic and a value of 5 or less is a sign of extreme finite-sample
bias and we consider extension to more than one endogenous regressor.
4.9.2. Inconsistency of IV Estimators
The essential condition for consistency of IV is condition 1 in Section 4.8.6, that
the instrument should be uncorrelated with the error term. No test is possible in the
just-identified case. In the overidentified case a test of the overidentifying assump-
tions is possible (see Section 6.4.3). Rejection then could be due to either instrument
105

LINEAR MODELS
endogeneity or model failure. Thus condition 1 is difficult to test directly and deter-
mining whether an instrument is exogenous is usually a subjective decision, albeit one
often guided by economic theory.
It is always possible to create an exogenous instrument through functional form
restrictions. For example, suppose there are two regressors so that y = β1x1 + β2x2 +
u, with x1 uncorrelated with u and x2 correlated with u. Note that throughout this
section all variables are assumed to be measured in departures from means, so that
without loss of generality the intercept term can be omitted. Then OLS is inconsistent,
as x2 is endogenous. A seemingly good instrument for x2 is x2
1 , since x2
1 is likely to
be uncorrelated with u because x1 is uncorrelated with u. However, the validity of
this instrument requires the functional form restriction on the conditional mean that
x1 only enters the model linearly and not quadratically. In practice one should view a
linear model as only an approximation, and obtaining instruments in such an artificial
way can be easily criticized.
A better way to create a valid instrument is through alternative exclusion restric-
tions that do not rely so heavily on choice of functional form. Some practical examples
have been given in Section 4.8.2.
Structural models such as the classical linear simultaneous equations model (see
Sections 2.4 and 6.10.6) make such exclusion restrictions very explicit. Even then the
restrictions can often be criticized for being too ad hoc, unless compelling economic
theory supports the restrictions.
For panel data applications it may be reasonable to assume that only current data
may belong in the equation of interest – an exclusion restriction permitting past data
to be used as instruments under the assumption that errors are serially uncorrelated
(see Section 22.2.4). Similarly, in models of decision making under uncertainty (see
Section 6.2.7), lagged variables can be used as instruments as they are part of the
information set.
There is no formal test of instrument exogeneity that does not additionally test
whether the regression equation is correctly specified. Instrument exogeneity in-
evitably relies on a priori information, such as that from economic or statistical theory.
The evaluation by Bound et al. (1995, pp. 446–447) of the validity of the instruments
used by Angrist and Krueger (1991) provides an insightful example of the subtleties
involved in determining instrument exogeneity.
It is especially important that an instrument be exogenous if an instrument is weak,
because with weak instruments even very mild endogeneity of the instrument can lead
to IV parameter estimates that are much more inconsistent than the already inconsistent
OLS parameter estimates.
For simplicity consider linear regression with one regressor and one instrument;
hence y = βx + u. Then performing some algebra, left as an exercise, yields
plim
βIV − β
plim
βOLS − β
=
Cor[z, u]
Cor[x, u]
×
1
Cor[z, x]
. (4.59)
Thus with an invalid instrument and low correlation between the instrument and the
regressor, the IV estimator can be even more inconsistent than OLS. For example,
suppose the correlation between z and x is 0.1, which is not unusual for cross-section
106

data. Then IV becomes more inconsistent than OLS as soon as the correlation coeffi-
cient between z and u exceeds a mere 0.1 times the correlation coefficient between x
and u.
Result (4.59) can be extended to the model (4.56) with one endogenous regressor
and several exogenous regressors, iid errors, and instruments that include all the ex-
ogenous regressors. Then
plim
β1,2SLS − β1
plim
β1,OLS − β1
=
Cor[
x, u]
Cor[x, u]
×
1
R2
p
, (4.60)
where R2
p is defined after (4.56). For extension to more than one endogenous regressor
see Shea (1997).
These results, emphasized by Bound et al. (1995), have profound implications for
the use of IV. If instruments are weak then even mild instrument endogeneity can lead
to IV being even more inconsistent than OLS. Perhaps because the conclusion is so
negative, the literature has neglected this aspect of weak instruments. A notable recent
exception is Hahn and Hausman (2003a).
Most of the literature assumes that condition 1 is satisfied, so that IV is consistent,
and focuses on other complications attributable to weak instruments.
4.9.3. Low Precision
Although IV estimation can lead to consistent estimation when OLS is inconsistent, it
also leads to a loss in precision. Intuitively, from Section 4.8.2 the instrument z is a
treatment that leads to exogenous movement in x but does so with considerable noise.
The loss in precision increases, and standard errors increase, with weaker instru-
ments. This is easily seen in the simplest case of a single endogenous regressor and
single instrument with iid errors. Then the asymptotic variance is
V[
βIV] = σ2
(x
z)−1
z
z(z
x)−1
(4.61)
= [σ2
/x
x]/[(z
x)2
/(z
z)(x
x)]
= V[
βOLS]/r2
x z.
For example, if the squared sample correlation coefficient between z and x equals 0.1,
then IV standard errors are 10 times those of OLS. Moreover, the IV estimator has
larger variance than the OLS estimator unless Cor[z, x] = 1.
Result (4.61) can be extended to the model (4.56) with one endogenous regressor
and several exogenous regressors, iid errors, and instruments that include all the ex-
ogenous regressors. Then
se[
β1,2SLS] = se[
β1,OLS]/Rp, (4.62)
where se[·] denotes asymptotic standard error and R2
p is defined after (4.56). For exten-
sion to more than one endogenous regressor this R2
p is replaced by the R∗2
p proposed
by Shea (1997). This provided the motivation for Shea’s test statistic.
The poor precision is concentrated on the coefficients for endogenous variables. For
exogenous variables the standard errors for 2SLS coefficient estimates are similar to
107

LINEAR MODELS
those for OLS. Intuitively, exogenous variables are being instrumented by themselves,
so they have a very strong instrument.
For the coefficients of an endogenous regressor it is a low partial R2
, rather than R2
,
that leads to a loss of estimator precision. This explains why 2SLS standard errors can
be much higher than OLS standard errors despite the high raw correlation between the
endogenous variable and the instruments. Going the other way, 2SLS standard errors
for coefficients of endogenous variables that are much larger than OLS standard errors
provide a clear signal that instruments are weak.
Statistics used to detect low precision of IV caused by weak instruments are called
measures of instrument relevance. To some extent they are unnecessary as the prob-
lem is easily detected if IV standard errors are much larger than OLS standard errors.
4.9.4. Finite-Sample Bias
This section summarizes a relatively challenging and as yet unfinished literature on
“weak instruments” that focuses on the practical problem that even in “large” samples
asymptotic theory can provide a poor approximation to the distribution of the IV esti-
mator. In particular the IV estimator is biased in finite samples even if asymptotically
consistent. The bias can be especially pronounced when instruments are weak.
This bias of IV, which is toward the inconsistent OLS estimator, can be remark-
ably large, as demonstrated in a simple Monte Carlo experiment by Nelson and Startz
(1990), and by a real data application involving several hundred thousand observations
but very weak instruments by Bound et al. (1995). Moreover, the standard errors can
also be very biased, as also demonstrated by Nelson and Startz (1990).
The theoretical literature entails quite specialized and advanced econometric theory,
as it is actually difficult to obtain the sample mean of the IV estimator. To see this,
consider adapting to the IV estimator the usual proof of unbiasedness of the OLS
estimator given in Section 4.4.8. For
βIV defined in (4.51) for the just-identified case
this yields
E[
βIV] = β + EZ,X,u[(Z
X)−1
Z
u]
= β + EZ,X

(Z
X)−1
Z
× [E[u|Z, X]

,
where the unconditional expectation with respect to all stochastic variables, Z, X,
and u, is obtained by first taking expectation with respect to u conditional on Z
and X, using the law of Iterated Expectations (see Section A.8.). An obvious suf-
ficient condition for the IV estimator to have mean β is that E[u|Z, X] = 0. This
assumption is too strong, however, because it implies E[u|X] = 0, in which case
there would be no need to instrument in the first place. So there is no simple way
to obtain E[
βIV]. A similar problem does not arise in establishing consistency. Then

βIV = β +

N−1
Z
X
−1
N−1
Z
u, where the term N−1
Z
u can be considered in isola-
tion of X and the assumption E[u|Z] = 0 leads to plim N−1
Z
u = 0.
Therefore we need to use alternative methods to obtain the mean of the IV estimator.
Here we merely summarize key results.
108

Initial research made the strong assumption of joint normality of variables and ho-
moskedastic errors. Then the IV estimator has a Wishart distribution (defined in Chap-
ter 13). Surprisingly, the mean of the IV estimator does not even exist in the just-
identified case, a signal that there may be finite-sample problems. The mean does exist
if there is at least one overidentifying restriction, and the variance exists if there are at
least two overidentifying restrictions. Even when the mean exists the IV estimator is
biased, with bias in the direction of OLS. With more overidentifying restrictions the
bias increases, eventually equaling the bias of the OLS estimator. A detailed discussion
is given in Davidson and MacKinnon (1993, pp. 221–224). Approximations based on
power-series expansions have also been used.
What determines the size of the finite-sample bias? For regression with a single
regressor x that is endogenous and is related to the instruments z by the reduced form
model x = zπ + v, the concentration parameter τ2
is defined as τ2
= π
ZZ
π/σ2
v .
The bias of IV can be shown to be an increasing function of τ2
. The quantity τ2
/K,
where K is the number of instruments, is the population analogue of the F-statistic
for a test of whether π = 0. The statistic F − 1, where F is the actual F-statistic in
the first-stage reduced form model, can be shown to be an approximately unbiased
estimate of τ2
/K. This leads to tests for finite-sample bias being based on the F-
statistic given in Section 4.9.2.
Staiger and Stock (1997) obtained results under weaker distributional assumptions.
In particular, normality is no longer needed. Their approach uses weak instrument
asymptotics that find the limit distribution of IV estimators for a sequence of models
with τ2
/K held constant as N → ∞. In a simple model 1/F provides an approximate
estimate of the finite-sample bias of the IV estimator relative to OLS. More generally,
the extent of the bias for given F varies with the number of endogenous regressors and
the number of instruments. Simulations show that to ensure that the maximal bias in
IV is no more than 10% that of OLS we need F 10. This threshold is widely cited
but falls to around 6.5, for example, if one is comfortable with bias in IV of 20% of
that for OLS. So a less strict rule of thumb is F 5. Shea (1997) demonstrated that
low partial R2
is also associated with finite-sample bias but there is no similar rule of
thumb for use of partial R2
as a diagnostic for finite-sample bias.
For models with more than one endogenous regressor, separate F-statistics can be
computed for each endogenous regressor. For a joint statistic Stock, Wright and Yogo
(2002) propose using the minimum eigenvalue of a matrix analogue of the first-stage
test F-statistic. Stock and Yogo (2003) present relevant critical values for this eigen-
value as the desired degree of bias, the number of endogenous variables, and the num-
ber of overidentifying restrictions vary. These tables include the single endogenous
regressor as a special case and presume at least two overidentifying restrictions, so
they do not apply to just-identified models.
Finite-sample bias problems arise not only for the IV estimate but also for IV stan-
dard errors and test statistics. Stock et al. (2002) present a similar approach to Wald
tests whereby a test of β = β0 at a nominal level of 5% is to have actual size of, say,
no more than 15%. Stock and Yogo (2003) also present detailed tables taking this size
distortion approach that include just-identified models.
109

LINEAR MODELS
4.9.5. Responses to Weak Instruments
What can the practitioner do in the face of weak instruments?
As already noted one approach is to limit the number of instruments used. This can
be done by dropping instruments or by combining instruments.
If finite-sample bias is a concern then alternative estimators may have better small-
sample properties than 2SLS. A number of alternatives, many variants of IV, are pre-
sented in Section 6.4.4.
Despite the emphasis on finite-sample bias the other problems created by weak
instruments may be of greater importance in applications. It is possible with a large
enough sample for the first-stage reduced form F-statistic to be large enough that
finite-sample bias is not a problem. Meanwhile, the partial R2
may be very small,
leading to fragility to even slight correlation between the model error and instrument.
This is difficult to test for and to overcome.
There also can be great loss in estimator precision, as detailed in Sections 4.9.3
and 4.9.4. In such cases either larger samples are needed or alternative approaches to
estimating causal marginal effects must be used. These methods are summarized in
Section 2.8 and presented elsewhere in this book.
4.9.6. IV Application
Kling (2001) analyzed in detail the use of college proximity as an instrument for
schooling. Here we use the same data from the NLS young men’s cohort on 3,010
males aged 24 to 34 years old in 1976 as used to produce Table 1 of Kling (2001) and
originally used by Card (1995). The model estimated is
ln wi = α + β1si + β2ei + β3e2
i + x
2i γ + ui ,
where s denotes years of schooling, e denotes years of work experience, e2
denotes ex-
perience squared, and x2 is a vector of 26 control variables that are mainly geographic
indicators and measure of parental education.
The schooling variable is considered endogenous, owing to lack of data on ability.
Additionally, the two work experience variables are endogenous, since work experi-
ence is calculated as age minus years of schooling minus six, as is common in this
literature, and schooling is endogenous. At least three instruments are needed.
Here exactly three instruments are used, so the model is just-identified. The first
instrument is col4, an indicator for whether a four-year college is nearby. This instru-
ment has already been discussed in Section 4.8.2. The other two instruments are age
and age squared. These are highly correlated with experience and experience squared,
yet it is believed they can be omitted from the model for log-wage since it is work
experience that matters. The remaining regressor vector x2 is used as an instrument for
itself.
Although age is clearly exogenous, some unobservables such as social skills may be
correlated with both age and wage. Then the use of age and age squared as instruments
can be questioned. This illustrates the general point that there can be disagreement on
assumptions of instrument validity.
110

Table 4.5. Returns to Schooling: Instrumental
Variables Estimatesa
OLS IV
Schooling (s) 0.073 0.132
(0.004) (0.049)
R2
0.304 0.207
Shea’s partial R2
– 0.006
First-stage F-statistic for s – 8.07
a Sample of 3,010 young males. Dependent variable is log hourly
wage. Coefficient and standard error for schooling given; esti-
mates for experience, experience squared, 26 control variables,
and an intercept are not reported. For the three endogenous re-
gressors – schooling (s), experience (e), and experience squared
(e2) – the three instruments are an indicator for whether a four-
year college (col) is nearby, age, and age squared. The partial
R2 and first-stage F-statistic are weak instruments diagnostics
explained in the test.
Results are given in Table 4.5. The OLS estimate of β1 is 0.073, so that wages
rise by 7.6% (= 100 × (e.073
− 1)) on average with each extra year of schooling. This
estimate is an inconsistent estimate of β1 given omitted ability. The IV estimate, or
equivalently the 2SLS estimate since the model is just-identified, is 0.132. An extra
year of schooling is estimated to lead to a 14.1% (= 100 × (e.132
− 1)) increase in
wage.
The IV estimator is much less efficient than OLS. A formal test does not reject ho-
moskedasticity and we follow Kling (2001) and use the usual standard errors, which
are very close to the heteroskedastic-robust standard errors. The standard error of

β1,OLS is 0.004 whereas that for
β1,IV is 0.049, over 10 times larger. The standard
errors for the other two endogenous regressors are about 4 times larger and the stan-
dard errors for the exogenous regressors are about 1.2 times larger. The R2
falls from
0.304 to 0.207.
R2
measures confirm that the instruments are not very relevant for schooling. A
simple test is to note that the regression (4.58) of schooling on all of the instruments
yields R2
= 0.297, which only falls a little to R2
= 0.291 if the three additional in-
struments are dropped. More formally, Shea’s partial R2
here equals 0.0064 = 0.082
,
which from (4.62) predicts that the standard error of
β1,IV will be inflated by a multiple
12.5 = 1/0.08, very close to the inflation observed here. This reduces the t-statistic on
schooling from 19.64 to 2.68. In many applications such a reduction would lead to sta-
tistical insignificance. In addition, from Section 4.9.2 even slight correlation between
the instrument col4i and the error term ui will lead to inconsistency of IV.
To see whether finite-sample bias may also be a problem we run the regression
(4.58) of schooling on all of the instruments. Testing the joint significance of the three
additional instruments yields an F-statistic of 8.07, suggesting that the bias of IV may
be 10 or 20% that of OLS. A similar regression for the other two endogenous variables
yields much higher F-statistics since, for example, age is a good additional instrument
111

LINEAR MODELS
for experience. Given that there are three endogenous regressors it is actually bet-
ter to use the method of Stock et al. (2002) discussed in Section 4.9.4, though here the
problem is restricted to schooling since for experience and experience squared, respec-
tively, Shea’s partial R2
equals 0.0876 and 0.0138, whereas the first-stage F-statistics
are 1,772 and 1,542.
If additional instruments are available then the model becomes overidentified and
standard procedure is to additionally perform a test of overidentifying restrictions (see
Section 8.4.4).
The estimation procedures in this chapter are implemented in all standard economet-
rics packages for cross-section data, except that not all packages implement quantile
regression. Most provide robust standard errors as an option rather than the default.
The most difficult estimator to apply can be the instrumental variables estimator, as
in many potential applications it can be difficult to obtain instruments that are uncor-
related with the error yet reasonably correlated with the regressor or regressors being
instrumented. Such instruments can be obtained through specification of a complete
structural model, such as a simultaneous equations system. Current applied research
emphasizes alternative approaches such as natural experiments.
The results in this chapter are presented in many first-year graduate texts, such as those by
Davidson and MacKinnon (2004), Greene (2003), Hayashi (2000), Johnston and diNardo
(1997), Mittelhammer, Judge, and Miller (2000), and Ruud (2000). We have emphasized re-
gression with stochastic regressors, robust standard errors, quantile regression, endogeneity,
and instrumental variables.
4.2 Manski (1991) has a nice discussion of regression in a general setting that includes discus-
sion of the loss functions given in Section 4.2.
4.3 The returns to schooling example is well studied. Angrist and Krueger (1999) and Card
(1999) provide recent surveys.
4.4 For a history of least squares see Stigler (1986). The method was introduced by Legendre
in 1805. Gauss in 1810 applied least squares to the linear model with normally distributed
error and proposed the elimination method for computation, and in later work he proposed
the theorem now called the Gauss–Markov theorem. Galton introduced the concept of re-
gression, meaning mean-reversion in the context of inheritance of family traits, in 1887.
For an early “modern” treatment with application to pauperism and welfare availability see
Yule (1897). Statistical inference based on least-squares estimates of the linear regression
model was developed most notably by Fisher. The heteroskedastic-consistent estimate of
the variance matrix of the OLS estimator, due to White (1980a) building on earlier work
by Eicker (1963), has had a profound impact on statistical inference in microeconometrics
and has been extended to many settings.
4.6 Boscovich in 1757 proposed a least absolute deviations estimator that predates least
squares; see Stigler (1986). A review of quantile regression, introduced by Koenker and
112

Bassett (1978), is given in Buchinsky (1994). A more elementary exposition is given in
Koenker and Hallock (2001).
4.7 The earliest known use of instrumental variables estimation to secure identification in a
simultaneous equations setting was by Wright (1928). Another oft-cited early reference is
Reiersol (1941), who used instrumental variables methods to control for measurement error
in the regressors. Sargan (1958) gives a classic early treatment of IV estimation. Stock and
Trebbi (2003) provide additional early references.
4.8 Instrumental variables estimation is presented in econometrics texts, with emphasis on al-
gebra but not necessarily intuition. The method is widely used in econometrics because of
the desirability of obtaining estimates with a causal interpretation.
4.9 The problem of weak instruments was drawn to the attention of applied researchers by
Nelson and Startz (1990) and Bound et al. (1995). There are a number of theoretical an-
tecedents, most notably the work of Nagar (1959). The problem has dampened enthusiasm
for IV estimation, and small-sample bias owing to weak instruments is currently a very
active research topic. Results often assume iid normal errors and restrict analysis to one
endogenous regressor. The survey by Stock et al. (2002) provides many references with
emphasis on weak instrument asymptotics. It also briefly considers extensions to nonlinear
models. The survey by Hahn and Hausman (2003b) presents additional methods and results
that we have not reviewed here. For recent work on bias in standard errors see Bond and
Windmeijer (2002). For a careful application see C.-I. Lee (2001).
Exercises
4–1 Consider the linear regression model yi = x
i β + ui with nonstochastic regressors
xi and error ui that has mean zero but is correlated as follows: E[ui uj ] = σ2
if
i = j , E[ui uj ] = ρσ2
if |i − j | = 1, and E[ui uj ] = 0 if |i − j | 1. Thus errors for
immediately adjacent observations are correlated whereas errors are otherwise
uncorrelated. In matrix notation we have y = Xβ + u, where Ω = E[uu
]. For this
model answer each of the following questions using results given in Section 4.4.
(a) Verify that Ω is a band matrix with nonzero terms only on the diagonal and
on the first off–diagonal; and give these nonzero terms.
(b) Obtain the asymptotic distribution of
βOLS using (4.19).
(c) State how to obtain a consistent estimate of V[
βOLS] that does not depend on
unknown parameters.
(d) Is the usual OLS output estimate s2
(X
X)−1
a consistent estimate of V[
βOLS]?
(e) Is White’s heteroskedasticity robust estimate of V[
βOLS] consistent here?
4–2 Suppose we estimate the model yi = µ + ui , where ui ∼ N[0, σ2
i ].
(a) Show that the OLS estimator of µ simplifies to
µ = y.
(b) Hence directly obtain the variance of y. Show that this equals White’s het-
eroskedastic consistent estimate of the variance given in (4.21).
4–3 Suppose the dgp is yi = β0xi + ui , ui = xi εi , xi ∼ N[0, 1], and εi ∼ N[0, 1]. As-
sume that data are independent over i and that xi is independent of εi . Note that
the first four central moments of N[0, σ2
] are 0, σ2
, 0, and 3σ4
.
(a) Show that the error term ui is conditionally heteroskedastic.
(b) Obtain plim N−1
X
X. [Hint: Obtain E[x2
i ] and apply a law of large numbers.]
113

LINEAR MODELS
(c) Obtain σ2
0 = V[ui ], where the expectation is with respect to all stochastic vari-
ables in the model.
(d) Obtain plim N−1
X
Ω0X = lim N−1
E[X
Ω0X], where Ω0 = Diag[V[ui |xi ]].
(e) Using answers to the preceding parts give the default OLS result (4.22) for the
variance matrix in the limit distribution of
√
N(
βOLS − β0), ignoring potential
heteroskedasticity. Your ultimate answer should be numerical.
(f) Now give the variance in the limit distribution of
√
N(
βOLS − β0), taking ac-
count of any heteroskedasticity. Your ultimate answer should be numerical.
(g) Do any differences between answers to parts (e) and (f) accord with your
prior beliefs?
4–4 Consider the linear regression model with scalar regressor yi = βxi + ui with data
(yi , xi ) iid over i though the error may be conditionally heteroskedastic.
(a) Show that (
βOLS − β) = (N−1

i x2
i )−1
N−1

i xi ui .
(b) Apply Kolmogorov law of large numbers (Theorem A.8) to the averages of x2
i
and xi ui to show that
βOLS
p
→ β. State any additional assumptions made on
the dgp for xi and ui .
(c) Apply the Lindeberg-Levy central limit theorem (Theorem A.14) to the aver-
ages of xi ui to show that N−1

i xi ui /N−2

i E[u2
i x2
i ]
p
→ N[0, 1]. State any
additional assumptions made on the dgp for xi and ui .
(d) Use the product limit normal rule (Theorem A.17) to show that part (c) implies
N−1/2

i xi ui
p
→ N[0, limN−1

i E[u2
i x2
i ]]. State any assumptions made on
the dgp for xi and ui .
(e) Combine results using (2.14) and the product limit normal rule (Theorem
A.17) to obtain the limit distribution of β.
4–5 Consider the linear regression model y = Xβ + u.
(a) Obtain the formula for
β that minimizes Q(β) = u
Wu, where W is of full rank.
[Hint: The chain rule for matrix differentiation for column vectors x and z is
∂ f (x)/∂x = (∂z
/∂x) × (∂ f (z)/∂z), for f (x) = f (g(x)) = f (z) where z =g(x)].
(b) Show that this simplifies to the OLS estimator if W = I.
(c) Show that this gives the GLS estimator if W = Ω−1
.
(d) Show that this gives the 2SLS estimator if W = Z(Z
Z)−1
Z
.
4–6 Consider IV estimation (Section 4.8) of the model y = x
β + u using instruments
z in the just-identified case with Z an N × K matrix of full rank.
(a) What essential assumptions must z satisfy for the IV estimator to be consis-
tent for β? Explain.
(b) Show that given just identification the 2SLS estimator defined in (4.53) re-
duces to the IV estimator given in (4.51).
(c) Give a real-world example of a situation where IV estimation is needed be-
cause of inconsistency of OLS, and specify suitable instruments.
4–7 (Adapted from Nelson and Startz, 1990.) Consider the three-equation model, y =
βx + u; x = λu + ε; z = γ ε + v, where the mutually independent errors u, ε, and
v are iid normal with mean 0 and variances, respectively, σ2
u , σ2
ε , and σ2
v .
(a) Show that plim(
βOLS − β) = λσ2
u /

λ2
σ2
u + σ2
ε

.
(b) Show that ρ2
XZ = γ σ2
ε /(λ2
σ2
u + σ2
ε )(γ 2
σ2
ε + σ2
v ).
(c) Show that
βIV = mzy/mzx = β + mzu/ (λmzu + mzε), where, for example, mzy =

i zi yi .
114

(d) Show that
βIV − β → 1/λ as γ (or ρxz) → 0.
(e) Show that
βIV − β → ∞ as mzu → −γ σ2
ε /λ.
(f) What do the last two results imply regarding ﬁnite-sample biases and the
moments of
βIV − β when the instruments are poor?
4–8 Select a 50% random subsample of the Section 4.6.4 data on log health expen-
diture (y) and log total expenditure (x).
(a) Obtain OLS estimates and contrast usual and White standard errors for the
slope coefﬁcient.
(b) Obtain median regression estimates and compare these to the OLS esti-
mates.
(c) Obtain quantile regression estimates for q = 0.25 and q = 0.75.
(d) Reproduce Figure 4.2 using your answers from parts (a)–(c).
4–9 Select a 50% random subsample of the Section 4.9.6 data on earnings and edu-
cation, and reproduce as much of Table 4.5 as possible and provide appropriate
interpretation.
115

C H A P T E R 5
Maximum Likelihood and
Nonlinear Least-Squares
Estimation
5.1. Introduction
A nonlinear estimator is one that is a nonlinear function of the dependent variable.
Most estimators used in microeconometrics, aside from the OLS and IV estimators in
the linear regression model presented in Chapter 4, are nonlinear estimators. Nonlin-
earity can arise in many ways. The conditional mean may be nonlinear in parameters.
The loss function may lead to a nonlinear estimator even if the conditional mean is
linear in parameters. Censoring and truncation also lead to nonlinear estimators even
if the original model has conditional mean that is linear in parameters.
Here we present the essential statistical inference results for nonlinear estimation.
Very limited small-sample results are available for nonlinear estimators. Statistical in-
ference is instead based on asymptotic theory that is applicable for large samples. The
estimators commonly used in microeconometrics are consistent and asymptotically
normal.
The asymptotic theory entails two major departures from the treatment of the linear
regression model given in an introductory graduate course. First, alternative methods
of proof are needed since there is no direct formula for most nonlinear estimators.
Second, the asymptotic distribution is generally obtained under the weakest distri-
butional assumptions possible. This departure was introduced in Section 4.4 to permit
heteroskedasticity-robust inference for the OLS estimator. Under such weaker assump-
tions the default standard errors reported by a simple regression program are invalid.
Some care is needed, however, as these weaker assumptions can lead to inconsistency
of the estimator itself, a much more fundamental problem.
As much as possible the presentation here is expository. Deﬁnitions of conver-
gence in probability and distribution, laws of large numbers (LLN), and central limit
theorems (CLT) are presented in many texts, and here these topics are relegated to
Appendix A. Applied researchers rarely aim to formally prove consistency and asymp-
totic normality. It is not unusual, however, to encounter data applications with estima-
tion problems sufﬁciently recent or complex as to demand reading recent econometric
journal articles. Then familiarity with proofs of consistency and asymptotic normality
116

5.2. OVERVIEW OF NONLINEAR ESTIMATORS
is very helpful, especially to obtain a good idea in advance of the likely form of the
variance matrix of the estimator.
Section 5.2 provides an overview of key results. A more formal treatment of
extremum estimators that maximize or minimize any objective function is given in Sec-
tion 5.3. Estimators based on estimating equations are defined and presented in Sec-
tion 5.4. Statistical inference based on robust standard errors is presented briefly in
Section 5.5, with complete treatment deferred to Chapter 7. Maximum likelihood es-
timation and quasi-maximum likelihood estimation are presented in Sections 5.6 and
5.7. Nonlinear least-squares estimation is given in Section 5.8. Section 5.9 presents a
detailed example.
The remaining leading parametric estimation procedures – generalized method
of moments and nonlinear instrumental variables – are given separate treatment in
Chapter 6.
5.2. Overview of Nonlinear Estimators
This section provides a summary of asymptotic properties of nonlinear estimators,
given more rigorously in Section 5.3, and presents ways to interpret regression co-
efficients in nonlinear models. The material is essential for understanding use of the
cross-section and panel data models presented in later chapters.
5.2.1. Poisson Regression Example
It is helpful to introduce a specific example of nonlinear estimation. Here we consider
Poisson regression, analyzed in more detail in Chapter 20.
The Poisson distribution is appropriate for a dependent variable y that takes only
nonnegative integer values 0, 1, 2, . . . . It can be used to model the number of occur-
rences of an event, such as number of patent applications by a firm and number of
doctor visits by an individual.
The Poisson density, or more formally the Poisson probability mass function, with
rate parameter λ is
f (y|λ) = e−λ
λy
/y!, y = 0, 1, 2, . . . ,
where it can be shown that E[y] = λ and V[y] = λ.
A regression model specifies the parameter λ to vary across individuals according
to a specific function of regressor vector x and parameter vector β. The usual Poisson
specification is
λ = exp(x
β),
which has the advantage of ensuring that the mean λ 0. The density of the Poisson
regression model for a single observation is therefore
f (y|x, β) = e− exp(x
β)
exp(x
β)y
/y!. (5.1)
117

MAXIMUM LIKELIHOOD AND NONLINEAR LEAST-SQUARES ESTIMATION
Consider maximum likelihood estimation based on the sample {(yi , xi ),i =
1, . . . , N}. The maximum likelihood (ML) estimator maximizes the log-likelihood
function (see Section 5.6). The likelihood function is the joint density, which given
independent observations is the product i f (yi |xi , β) of the individual densities,
where we have conditioned on the regressors. The log-likelihood function is then the
log of a product, which equals the sum of logs, or

i ln f (yi |xi , β).
For the Poisson density (5.1), the log-density for the ith observation is
ln f (yi |xi , β) = − exp(x
i β) + yi x
i β − ln yi !.
So the Poisson ML estimator
β maximizes
QN (β) =
1
N
N
i=1
!
− exp(x
i β) + yi x
i β − ln yi !

, (5.2)
where the scale factor 1/N is included so that QN (β) remains finite as N → ∞. The
Poisson ML estimator is the solution to the first-order conditions ∂ QN (β)/∂β|
β = 0,
or
1
N
N
i=1
(yi − exp(x
i β))xi

β
= 0. (5.3)
There is no explicit solution for
β in (5.3). Numerical methods to compute
β are
given in Chapter 10. In this chapter we instead focus on the statistical properties of the
resulting estimate
β.
5.2.2. m-Estimators
More generally, we define an m-estimator
θ of the q × 1 parameter vector θ as an esti-
mator that maximizes an objective function that is a sum or average of N subfunctions
QN (θ) =
1
N
N
i=1
q(yi , xi , θ), (5.4)
where q(·) is a scalar function, yi is the dependent variable, xi is a regressor vector,
and the results in this section assume independence over i.
For simplicity yi is written as a scalar, but the results extend to vector yi and so
cover multivariate and panel data and systems of equations. The objective function is
subscripted by N to denote that it depends on the sample data. Throughout the book
q is used to denote the dimension of θ. Note that here q is additionally being used to
denote the subfunction q(·) in (5.4).
Many econometrics estimators and models are m-estimators, corresponding to spe-
cific functional forms for q(y, x, θ). Leading examples are maximum likelihood (see
(5.39) later) and nonlinear least squares (NLS) (see (5.67) later). The Poisson ML
estimator that maximizes (5.2) is an example of (5.4) with θ = β and q(y, x, β) =
− exp(x
β) + yx
β − ln y!.
We focus attention on the estimator
θ that is computed as the solution to the asso-
ciated first-order conditions ∂ QN (θ)/∂θ|
θ = 0, or equivalently
1
N
N
i=1
∂q(yi , xi , θ)
∂θ

θ
= 0. (5.5)
118

This is a system of q equations in q unknowns that generally has no explicit solution
for
θ.
The term m-estimator, attributed to Huber (1967), is interpreted as an abbrevia-
tion for maximum-likelihood-like estimator. Many econometrics authors, including
Amemiya (1985, p. 105), Greene (2003, p. 461), and Wooldridge (2002, p. 344), define
an m-estimator as optimizing over a sum of terms, as in (5.4). Other authors, including
Serfling (1980), define an m-estimator as solutions of equations such as (5.5). Huber
(1967) considered both cases and Huber (1981, p. 43) explicitly defined an m-estimator
in both ways. In this book we call the former type of estimator an m-estimator
and the latter an estimating equations estimator (which will be treated separately in
Section 5.4).
5.2.3. Asymptotic Properties of m-Estimators
The key desirable asymptotic properties of an estimator are that it be consistent and
that it have an asymptotic distribution to permit statistical inference at least in large
samples.
Consistency
The first step in determining the properties of
θ is to define exactly what
θ is intended
to estimate. We suppose that there is a unique value of θ, denoted θ0 and called the
true parameter value, that generates the data. This identification condition (see Sec-
tion 2.5) requires both correct specification of the component of the dgp of interest and
uniqueness of this representation. Thus for the Poisson example it may be assumed that
the dgp is one with Poisson parameter exp(x
β0) and x is such that x
β(1)
= x
β(2)
if
and only if β(1)
= β(2)
.
The formal notation with subscript 0 for the true parameter value is used extensively
in Chapters 5 to 8. The motivation is that θ can take many different values, but interest
lies in two particular values – the true value θ0 and the estimated value
θ.
The estimate
θ will never exactly equal θ0, even in large samples, because of the
intrinsic randomness of a sample. Instead, we require
θ to be consistent for θ0 (see
Definition A.2 in Appendix A), meaning that
θ must converge in probability to θ0,
denoted
θ
p
→ θ0.
Rigorously establishing consistency of m-estimators is difficult. Formal results are
given in Section 5.3.2 and a useful informal condition is given in Section 5.3.7. Spe-
cializations to ML and NLS estimators are given in later sections.
Limit Normal Distribution
Given consistency, as N → ∞ the estimator
θ has a distribution with all mass at θ0. As
for OLS, we magnify or rescale
θ by multiplication by
√
N to obtain a random variable
that has nondegenerate distribution as N → ∞. Statistical inference is then conducted
assuming N is large enough for asymptotic theory to provide a good approximation,
but not so large that
θ collapses on θ0.
119

We therefore consider the behavior of
√
N(
θ − θ0). For most estimators this has a
finite-sample distribution that is too complicated to use for inference. Instead, asymp-
totic theory is used to obtain the limit of this distribution as N → ∞. For most microe-
conometrics estimators this limit is the multivariate normal distribution. More formally
√
N(
θ − θ0) converges in distribution to the multivariate normal, where convergence
in distribution is defined in Appendix A.
Recall from Section 4.4 that the OLS estimator can be expressed as
√
N(
β − β0) =

1
N
N
i=1
xi x
i
−1
1
√
N
N
i=1
xi ui ,
and the limit distribution was derived by obtaining the probability limit of the first term
on the right-hand side and the limit normal distribution of the second term. The limit
distribution of an m-estimator is obtained in a similar way. In Section 5.3.3 we show
that for an estimator that solves (5.5) we can always write
√
N(
θ − θ0) = −

1
N
N
i=1
∂2
qi (θ)
∂θ∂θ

θ+
−1
1
√
N
N
i=1
∂qi (θ)
∂θ

θ0
, (5.6)
where qi (θ) = q(yi , xi , θ), for some θ+
between
θ and θ0, provided second derivatives
and the inverse exist. This result is obtained by a Taylor series expansion.
Under appropriate assumptions this yields the following limit distribution of an
m-estimator:
√
N(
θ − θ0)
d
→ N[0, A−1
0 B0A−1
0 ], (5.7)
where A−1
0 is the probability limit of the first term in the right-hand side of (5.6), and
the second term is assumed to converge to the N[0, B0] distribution. The expressions
for A0 and B0 are given in Table 5.1.
Asymptotic Normality
To obtain the distribution of
θ from the limit distribution result (5.7), divide the left-
hand side of (5.7) by
√
N and hence divide the variance by N. Then

θ
a
∼ N

θ0,V[
θ]

, (5.8)
where
a
∼ means “is asymptotically distributed as,” and V[
θ] denotes the asymptotic
variance of
θ with
V[
θ] = N−1
A−1
0 B0A−1
0 . (5.9)
A complete discussion of the term asymptotic distribution has already been given in
Section 4.4.4, and is also given in Section A.6.4.
The result (5.9) depends on the unknown true parameter θ0. It is implemented by
computing the estimated asymptotic variance

V[
θ]=N−1
A
−1

B
A−1
, (5.10)
where
A and
B are consistent estimates of A0 and B0.
120

Table 5.1. Asymptotic Properties of m-Estimators
Propertya
Algebraic Formula
Objective function QN (θ) = N−1

i q(yi , xi , θ) is maximized wrt θ
Examples ML: qi = ln f (yi |xi , θ) is the log-density
NLS: qi = −(yi − g(xi , θ))2
is minus the squared error
First-order conditions ∂ QN (θ)/∂θ = N−1
N
i=1 ∂q(yi , xi , θ)/∂θ|
θ = 0.
Consistency Is plim QN (θ) maximized at θ = θ0?
Consistency (informal) Does E

∂q(yi , xi , θ)/∂θ|θ0

= 0?
Limit distribution
√
N(
θ − θ0)
d
→ N[0, A−1
0 B0A−1
0 ]
A0 = plimN−1
N
i=1 ∂2
qi (θ)/∂θ∂θ

θ0
B0 = plimN−1
N
i=1 ∂qi /∂θ×∂qi /∂θ

θ0
.
Asymptotic distribution
θ
a
∼ N[θ0, N−1
A−1
B
A−1
]

A = N−1
N
i=1 ∂2
qi (θ)/∂θ∂θ

θ

B = N−1
N
i=1 ∂qi /∂θ×∂qi /∂θ

θ
a The limit distribution variance and asymptotic variance estimate are robust sandwich forms that assume
independence over i. See Section 5.5.2 for other variance estimates.
The default output for many econometrics packages instead often uses a simpler
estimate
V[
θ] = −N−1
A−1
that is only valid in some special cases. See Section 5.5
for further discussion, including various ways to estimate A0 and B0 and then perform
hypothesis tests.
The two leading examples of m-estimators are the ML and the NLS estimators.
Formal results for these estimators are given in, respectively, Propositions 5.5 and 5.6.
Simpler representations of the asymptotic distributions of these estimators are given
in, respectively, (5.48) and (5.77).
Poisson ML Example
Like other ML estimators, the Poisson ML estimator is consistent if the density is
correctly specified. However, applying (5.25) from Section 5.3.7 to (5.3) reveals that
the essential condition for consistency is actually the weaker condition that E[y|x] =
exp(x
β0), that is, correct specification of the mean. Similar robustness of the ML
estimator to partial misspecification of the distribution holds for some other special
cases detailed in Section 5.7.
For the Poisson ML estimator ∂q(β)/∂β = (y − exp(x
β0))x, leading to
A0 = − plim N−1

i
exp(x
i β0)xi x
i
and
B0 = plim N−1

i
V [yi |xi ] xi x
i .
Then
β
a
∼ N[θ0,N−1
A−1
B
A−1
], where
A = −N−1

i exp(x
i

β)xi x
i and
B =
N−1

i (yi − exp(x
i

β))2
xi x
i .
121

Table 5.2. Marginal Effect: Three Different Estimates
Formula Description
N−1

i ∂E[yi |xi ]/∂xi Average response of all individuals
∂E[y|x]/∂x|x̄ Response of the average individual
∂E[y|x]/∂x|x∗ Response of a representative individual with x = x∗
If the data are actually Poisson distributed, then V[y|x] = E[y|x] = exp(x
β0), lead-
ing to possible simplification since A0 = −B0 so that A−1
0 B0A−1
0 = −A−1
0 . However,
in most applications with count data V[y|x] E[y|x], so it is best not to impose this
restriction.
5.2.4. Coefficient Interpretation in Nonlinear Regression
An important goal of estimation is often prediction, rather than testing the statistical
significance of regressors.
Marginal Effects
Interest often lies in measuring marginal effects, the change in the conditional mean
of y when regressors x change by one unit.
For the linear regression model, E[y|x] = x
β implies ∂E[y|x]/∂x = β so that the
coefficient has a direct interpretation as the marginal effect. For nonlinear regression
models, this interpretation is no longer possible. For example, if E[y|x] = exp(x
β),
then ∂E[y|x]/∂x = exp(x
β)β is a function of both parameters and regressors, and the
size of the marginal effect depends on x in addition to β.
General Regression Function
For a general regression function
E[y|x] =g(x, β),
the marginal effect varies with the evaluation value of x.
It is customary to present one of the estimates of the marginal effect given in
Table 5.2. The first estimate averages the marginal effects for all individuals. The sec-
ond estimate evaluates the marginal effect at x = x̄. The third estimate evaluates at
specific characteristics x = x∗
. For example, x∗
may represent a person who is female
with 12 years of schooling and so on. More than one representative individual might be
considered.
These three measures differ in nonlinear models, whereas in the linear model they
all equal β. Even the sign of the effect may be unrelated to the sign of the pa-
rameter, with ∂E[y|x]/∂xj positive for some values of x and negative for other val-
ues of x. Considerable care must be taken in interpreting coefficients in nonlinear
models.
122

Computer programs and applied studies often report the second of these measures.
This can be useful in getting a sense for the magnitude of the marginal effect, but
policy interest usually lies in the overall effect, the first measure, or the effect on a
representative individual or group, the third measure. The first measure tends to change
relatively little across different choices of functional form g(·), whereas the other two
measures can change considerably. One can also present the full distribution of the
marginal effects using a histogram or nonparametric density estimate.
Single-Index Models
Direct interpretation of regression coefficients is possible for single-index models that
specify
E[y|x] = g(x
β), (5.11)
so that the data and parameters enter the nonlinear mean function g(·) by way of the
single index x
β. Then nonlinearity is of the mild form that the mean is a nonlinear
function of a linear combination of the regressors and parameters. For single-index
models the effect on the conditional mean of a change in the jth regressor using cal-
culus methods is
∂E[y|x]
∂xj
= g
(x
β)βj ,
where g
(z) = ∂g(z)/∂z. It follows that the relative effects of changes in regressors
are given by the ratio of the coefficients since
∂E[y|x]/∂xj
∂E[y|x]/∂xk
=
βj
βk
,
because the common factor g
(x
β) cancels. Thus if βj is two times βk then a one-
unit change in xj has twice the effect as a one-unit change in xk. If g(·) is additionally
monotonic then it follows that the signs of the coefficients give the signs of the effects,
for all possible x.
Single-index models are advantageous owing to their simple interpretation. Many
standard nonlinear models such as logit, probit, and Tobit are of single-index form.
Moreover, some choices of function g(·) permit additional interpretation, notably the
exponential function considered later in this section and the logistic cdf analyzed in
Section 14.3.4.
Finite-Difference Method
We have emphasized the use of calculus methods. The finite-difference method in-
stead computes the marginal effect by comparing the conditional mean when xj is
increased by one unit with the value before the increase. Thus

xj
= g(x + ej , β) − g(x, β),
where ej is a vector with jth entry one and other entries zero.
123

For the linear model ﬁnite-difference and calculus methods lead to the same es-
timated effects, since

xj = (x
β + βj ) − x
β = βj . For nonlinear models,
however, the two approaches give different estimates of the marginal effect, unless the
change in xj is infinitesimally small.
Often calculus methods are used for continuous regressors and finite-difference
methods are used for integer-valued regressors, such as a (0, 1) indicator variable.
Exponential Conditional Mean
As an example, consider coefficient interpretation for an exponential conditional mean
function, so that E[y|x] = exp(x
β). Many count and duration models use the expo-
nential form.
A little algebra yields ∂E[y|x]/∂xj = E[y|x] × βj . So the parameters can be inter-
preted as semi-elasticities, with a one-unit change in xj increasing the conditional
mean by the multiple βj . For example, if βj = 0.2 then a one-unit change in xj
is predicted to lead to a 0.2 times proportionate increase in E[y|x], or an increase
of 20%.
If instead the finite-difference method is used, the marginal effect is computed as

xj = exp(x
β + βj ) − exp(x
β) = exp(x
β)(eβj
− 1). This differs from the
calculus result, unless βj is small so that eβj
1 + βj . For example, if βj = 0.2 the
increase is 22.14% rather than 20%.
5.3. Extremum Estimators
This section is intended for use in an advanced graduate course in microeconomet-
rics. It presents the key results on consistency and asymptotic normality of extremum
estimators, a very general class of estimators that minimize or maximize an objective
function. The presentation is very condensed. A more complete understanding requires
an advanced treatment such as that in Amemiya (1985), the basis of the treatment here,
or in Newey and McFadden (1994).
5.3.1. Extremum Estimators
For cross-section analysis of a single dependent variable the sample is one of N ob-
servations, {(yi , xi ), i = 1, . . . , N}, on a dependent variable yi , and a column vector
xi of regressors. In matrix notation the sample is (y, X), where y is an N × 1 vector
with ith entry yi and X is a matrix with ith row x
i , as deﬁned more completely in
Section 1.6.
Interest lies in estimating the q × 1 parameter vector θ = [θ1. . . . θq]
. The value
θ0, termed the true parameter value, is the particular value of θ in the process that
generated the data, called the data-generating process.
We consider estimators
θ that maximize over θ ∈ Θ the stochastic objective func-
tion QN (θ) = QN (y, X, θ), where for notational simplicity the dependence of QN (θ)
124

5.3. EXTREMUM ESTIMATORS
on the data is indicated only via the subscript N. Such estimators are called extremum
estimators, since they solve a maximization or minimization problem.
The extremum estimator may be a global maximum, so

θ = arg max θ∈Θ QN (θ). (5.12)
Usually the extremum estimator is a local maximum, computed as the solution to the
associated first-order conditions
∂ QN (θ)
∂θ

θ
= 0, (5.13)
where ∂ QN (θ)/∂θ is a q × 1 column vector with kth entry ∂ QN (θ)/∂θk. The lo-
cal maximum is emphasized because it is the local maximum that may be asymp-
totic normal distributed. The local and global maxima coincide if QN (θ) is globally
concave.
There are two leading examples of extremum estimators. For m-estimators consid-
ered in this chapter, notably ML and NLS estimators, QN (θ) is a sample average such
as average of squared residuals. For the generalized method of moments estimator (see
Section 6.3) QN (θ) is a quadratic form in sample averages.
For concreteness the discussion focuses on single-equation cross-section regression.
But the results are quite general and apply to any estimator based on optimization that
satisfies properties given in this section. In particular there is no restriction to a scalar
dependent variable and several authors use the notation zi in place of (yi , xi ). Then
QN (θ) equals QN (Z, θ) rather than QN (y, X, θ).
5.3.2. Formal Consistency Theorems
We first consider parameter identification, introduced in Section 2.5. Intuitively the
parameter θ0 is identified if the distribution of the data, or feature of the distribution of
interest, is determined by θ0 whereas any other value of θ leads to a different distribu-
tion. For example, in linear regression we required E[y|X] = Xβ0 and Xβ(1)
= Xβ(2)
if and only if β(1)
= β(2)
.
An estimation procedure may not identify θ0. For example, this is the case if the es-
timation procedure omits some relevant regressors. We say that an estimation method
identifies θ0 if the probability limit of the objective function, taken with respect to
the dgp with parameter θ = θ0, is maximized uniquely at θ = θ0. This identification
condition is an asymptotic one. Practical estimation problems that can arise in a finite
sample are discussed in Chapter 10.
Consistency is established in the following manner. As N → ∞ the stochastic ob-
jective function QN (θ), an average in the case of m-estimation, may converge in prob-
ability to a limit function, denoted Q0(θ), that in the simplest case is nonstochas-
tic. The corresponding maxima (global or local) of QN (θ) and Q0(θ) should then
occur for values of θ close to each other. Since the maximum of QN (θ) is
θ by
definition, it follows that
θ converges in probability to θ0 provided θ0 maximizes
Q0(θ).
125

Clearly, consistency and identification are closely related, and Amemiya (1985,
p. 230) states that a simple approach is to view identification to mean existence of a
consistent estimator. For further discussion see Newey and McFadden (1994, p. 2124)
and Deistler and Seifert (1978).
Key applications of this approach include Jennrich (1969) and Amemiya (1973).
Amemiya (1985) and Newey and McFadden (1994) present quite general theorems.
These theorems require several assumptions, including smoothness (continuity) and
existence of necessary derivatives of the objective function, assumptions on the dgp
to ensure convergence of QN (θ) to Q0(θ), and maximization of Q0(θ) at θ = θ0.
Different consistency theorems use slightly different assumptions.
We present two consistency theorems due to Amemiya (1985), one for a global
maximum and one for a local maximum. The notation in Amemiya’s theorems has
been modified as Amemiya (1985) defines the objective function without the normal-
ization 1/N present in, for example, (5.4).
Theorem 5.1 (Consistency of Global Maximum) (Amemiya, 1985, Theo-
rem 4.1.1): Make the following assumptions:
(i) The parameter space Θ is a compact subset of Rq
.
(ii) The objective function QN (θ) is a measurable function of the data for all θ ∈
Θ, and QN (θ) is continuous in θ ∈ Θ.
(iii) QN (θ) converges uniformly in probability to a nonstochastic function Q0(θ),
and Q0(θ) attains a unique global maximum at θ0.
Then the estimator
θ = arg maxθ∈Θ QN (θ) is consistent for θ0, that is,
θ
p
→ θ0.
Uniform convergence in probability of QN (θ) to
Q0(θ) = plim QN (θ) (5.14)
in condition (iii) means that supθ∈Θ |QN (θ) − Q0(θ)|
p
→ 0.
For a local maximum, first derivatives need to exist, but one need then only consider
the behavior of QN (θ) and its derivative in the neighborhood of θ0.
Theorem 5.2 (Consistency of Local Maximum) (Amemiya, 1985, Theo-
rem 4.1.2): Make the following assumptions:
(i) The parameter space Θ is an open subset of Rq
.
(ii) QN (θ) is a measurable function of the data for all θ ∈ Θ, and ∂ QN (θ)/∂θ
exists and is continuous in an open neighborhood of θ0.
(iii) The objective function QN (θ) converges uniformly in probability to Q0(θ) in
an open neighborhood of θ0, and Q0(θ) attains a unique local maximum at θ0.
Then one of the solutions to ∂ QN (θ)/∂θ = 0 is consistent for θ0.
An example of use of Theorem 5.2 is given later in Section 5.3.4.
126

Condition (i) in Theorem 5.1 permits a global maximum to be at the boundary of the
parameter space, whereas in Theorem 5.2 a local maximum has to be in the interior of
the parameter space. Condition (ii) in Theorem 5.2 also implies continuity of QN (θ)
in the open neighborhood of θ0, where a neighborhood N(θ0) of θ0 is open if and
only if there exists a ball with center θ0 entirely contained in N(θ0). In both theorems
condition (iii) is the essential condition. The maximum, global or local, of Q0(θ) must
occur at θ = θ0. The second part of (iii) provides the identification condition that θ0
has a meaningful interpretation and is unique.
For a local maximum, analysis is straightforward if there is only one local maxi-
mum. Then
θ is uniquely defined by ∂ QN (θ)/∂θ|
θ = 0. When there is more than one
local maximum, the theorem simply says that one of the local maxima is consistent,
but no guidance is given as to which one is consistent. It is best in such cases to con-
sider the global maximum and apply Theorem 5.1. See Newey and McFadden (1994,
p. 2117) for a discussion.
An important distinction is made between model specification, reflected in the
choice of objective function QN (θ), and the actual dgp of (y, X) used in obtaining
Q0(θ) in (5.14). For some dgps an estimator may be consistent, whereas for other dgps
an estimator may be inconsistent. In some cases, such as the Poisson ML and OLS es-
timators, consistency arises under a wide range of dgps provided the conditional mean
is correctly specified. In other cases consistency requires stronger assumptions on the
dgp such as correct specification of the density.
5.3.3. Asymptotic Normality
Results on asymptotic normality are usually restricted to the local maximum of QN (θ).
Then
θ solves (5.13), which in general is nonlinear in
θ and has no explicit solution
for
θ. Instead, we replace the left-hand side of this equation by a linear function of
θ,
by use of a Taylor series expansion, and then solve for
θ.
The most often used version of Taylor’s theorem is an approximation with a re-
mainder term. Here we instead consider an exact first-order Taylor expansion. For
the differentiable function f (·) there always exists a point x+
between x and x0 such
that
f (x) = f (x0) + f
(x+
)(x − x0),
where f
(x) = ∂ f (x)/∂x is the derivative of f (x). This result is also known as the
mean value theorem.
Application to the current setting requires several changes. The scalar function f (·)
is replaced by a vector function f(·) and the scalar arguments x, x0, and x+
are replaced
by the vectors
θ, θ0, and θ+
. Then
f(
θ) = f(θ0) +
∂f(θ)
∂θ

θ+
(
θ − θ0), (5.15)
where ∂f(θ)/∂θ is a matrix, for some unknown θ+
between
θ and θ0, and formally
θ+
differs for each row of this matrix (see Newey and McFadden, 1994, p. 2141).
For the local extremum estimator the function f(θ) = ∂ QN (θ)/∂θ is already a first
127

derivative. Then an exact first-order Taylor series expansion around θ0 yields
∂ QN (θ)
∂θ

θ
=
∂ QN (θ)
∂θ

θ0
+
∂2
QN (θ)
∂θ∂θ

θ+
(
θ − θ0), (5.16)
where ∂2
QN (θ)/∂θ∂θ
is a q × q matrix with ( j, k)th entry ∂2
QN (θ)/∂θj ∂θk, and
θ+
is a point between
θ and θ0.
The first-order conditions set the left-hand side of (5.16) to zero. Setting the right-
hand side to 0 and solving for (
θ − θ0) yields
√
N(
θ − θ0) = −

∂2
QN (θ)
∂θ∂θ

θ+
−1
√
N
∂ QN (θ)
∂θ

θ0
, (5.17)
where we rescale by
√
N to ensure a nondegenerate limit distribution (discussed fur-
ther in the following).
Result (5.17) provides a solution for
θ. It is of no use for numerical computation
of
θ, since it depends on θ0 and θ+
, both of which are unknown, but it is fine for
theoretical analysis. In particular, if it has been established that
θ is consistent for θ0
then the unknown θ+
converges in probability to θ0, because it lies between
θ and θ0
and by consistency
θ converges in probability to θ0.
The result (5.17) expresses
√
N(
θ − θ0) in a form similar to that used to obtain the
limit distribution of the OLS estimator (see Section 5.2.3). All we need do is assume
a probability limit for the first term on the right-hand side of (5.17) and a limit normal
distribution for the second term.
This leads to the following theorem, from Amemiya (1985), for an extremum esti-
mator satisfying a local maximum. Again note that Amemiya (1985) defines the ob-
jective function without the normalization 1/N. Also, Amemiya defines A0 and B0 in
terms of limE rather than plim.
Theorem 5.3 (Limit Distribution of Local Maximum) (Amemiya, 1985, The-
orem 4.1.3): In addition to the assumptions of the preceding theorem for consis-
tency of the local maximum make the following assumptions:
(i) ∂2
QN (θ)/∂θ∂θ
exists and is continuous in an open convex neighborhood of
θ0.
(ii) ∂2
QN (θ)/∂θ∂θ

θ+ converges in probability to the finite nonsingular matrix
A0 = plim ∂2
QN (θ)/∂θ∂θ

θ0
(5.18)
for any sequence θ+
such that θ+ p
→ θ0.
(iii)
√
N ∂ QN (θ)/∂θ|θ0
d
→ N[0, B0], where
B0 = plim

N ∂ QN (θ)/∂θ × ∂ QN (θ)/∂θ

θ0

. (5.19)
Then the limit distribution of the extremum estimator is
√
N(
θ − θ0)
d
→ N[0, A−1
0 B0A−1
0 ], (5.20)
where the estimator
θ is the consistent solution to ∂ QN (θ)/∂θ = 0.
128

The proof follows directly from the Limit Normal Product Rule (Theorem A.17)
applied to (5.17). Note that the proof assumes that consistency of
θ has already been
established. The expressions for A0 and B0 given in Table 5.1 are specializations to the
case QN (θ) = N−1

i qi (θ) with independence over i.
The probability limits in (5.18) and (5.19) are obtained with respect to the dgp
for (y, X). In some applications the regressors are assumed to be nonstochastic and
the expectation is with respect to y only. In other cases the regressors are treated as
stochastic and the expectations are then with respect to both y and X.
5.3.4. Poisson ML Estimator Asymptotic Properties Example
We formally prove consistency and asymptotic normality of the Poisson ML estimator,
under exogenous stratified sampling with stochastic regressors so that (yi , xi ) are inid,
without necessarily assuming that yi is Poisson distributed.
The key step to prove consistency is to obtain Q0(β) = plim QN (β) and verify that
Q0(β) attains a maximum at β = β0. For QN (β) defined in (5.1), we have
Q0(β) = plimN−1

i
#
−ex
i β
+ yi x
i β − ln yi !
$
= lim N−1

i
#
−E

ex
i β

+ E[yi x
i β] − E [ln yi !]
$
= lim N−1

i
#
−E

ex
i β

+ E

ex
i β0
x
i β

− E [ln yi !]
$
.
The second equality assumes a law of large numbers can be applied to each term. Since
(yi , xi ) are inid, the Markov LLN (Theorem A.8) can be applied if each of the expected
values given in the second line exists and additionally the corresponding (1 + δ)th
absolute moment exists for some δ 0 and the side condition given in Theorem A.8
is satisfied. For example, set δ = 1 so that second moments are used. The third line
requires the assumption that the dgp is such that E[y|x] = exp(x
β0). The first two
expectations in the third line are with respect to x, which is stochastic. Note that Q0(β)
depends on both β and β0. Differentiating with respect to β, and assuming that limits,
derivatives, and expectations can be interchanged, we get
∂ Q0(β)
∂β
= − lim N−1

i
E

ex
i β
xi

+ lim N−1

i
E

ex
i β0
xi

,
where the derivative of E[ln y!] with respect to β is zero since E[ln y!] will depend
on β0, the true parameter value in the dgp, but not on β. Clearly, ∂ Q0(β)/∂β = 0 at
β = β0 and ∂2
Q0(β)/∂β∂β
= − lim N−1

i E

exp(x
i β)xi x
i

is negative definite, so
Q0(β) attains a local maximum at β = β0 and the Poisson ML estimator is consistent
by Theorem 5.2. Since here QN (β) is globally concave the local maximum equals the
global maximum and consistency can also be established using Theorem 5.1.
For asymptotic normality of the Poisson ML estimator, the exact first-order Taylor
series expansion of the Poisson ML estimator first-order conditions (5.3) yields
√
N(
β − β0) = −

−N−1

i
ex
i β+
xi x
i
−1
N−1/2

i
(yi − ex
i β0
)xi , (5.21)
129

for some unknown β+
between
β and β0. Making sufficient assumptions on regressors
x so that the Markov LLN can be applied to the first term, and using β+ p
→ β0 since

β
p
→ β0, we have
−N−1

i
ex
i β+
xi x
i
p
→ A0 = − lim N−1

i
E[ex
i β0
xi x
i ]. (5.22)
For the second term in (5.21) begin by assuming scalar regressor x. Then X = (y −
exp(xβ0))x has mean E[X] = 0, as E[y|x] = exp(xβ0) has already been assumed for
consistency, and variance V[X] =E

V[y|x]x2

. The Liapounov CLT (Theorem A.15)
can be applied if the side condition involving a (2 + δ)th absolute moment of y −
exp(xβ0))x is satisfied. For this example with y ≥ 0 it is sufficient to assume that the
third moment of y exists, that is, δ = 1, and x is bounded. Applying the CLT gives
ZN =

i (yi − eβ0xi
)xi
%
i E

V[yi |xi ]x2
i

d
→ N[0, 1],
so
N−1/2

i
(yi − eβ0xi
)xi
d
→ N

0, lim N−1

i
E

V[yi |xi ]xi
2

,
assuming the limit in the expression for the asymptotic variance exists. This result can
be extended to the vector regressor case using the Cramer–Wold device (see Theo-
rem A.16). Then
N−1/2

i
(yi − ex
i β0
)xi
d
→ N

0, B0 = lim N−1

i
E

V[yi |xi ]xi x
i

. (5.23)
Thus (5.21) yields
√
N(
β − β0)
d
→ N[0, A−1
0 B0A−1
0 ], where A0 is defined in (5.22)
and B0 is defined in (5.23).
Note that for this particular example y|x need not be Poisson distributed for the
Poisson ML estimator to be consistent and asymptotically normal. The essential as-
sumption for consistency of the Poisson ML estimator is that the dgp is such that
E[y|x] = exp(x
β0).
For asymptotic normality the essential assumption is that V[y|x] exists, though
additional assumptions on existence of higher moments are needed to permit use
of LLN and CLT. If in fact V[y|x] = exp(x
β0) then A0= −B0 and more simply
√
N(
β − β0)
d
→ N[0, −A−1
0 ]. The results for this ML example extend to the LEF
class of densities defined in Section 5.7.3.
5.3.5. Proofs of Consistency and Asymptotic Normality
The assumptions made in Theorems 5.1–5.3 are quite general and need not hold in
every application. These assumptions need to be verified on a case-by-case basis, in a
manner similar to the preceding Poisson ML estimator example. Here we sketch out
details for m-estimators.
For consistency, the key step is to obtain the probability limit of QN (θ). This is
done by application of an LLN because for an m-estimator QN (θ) is the average
130

N−1

i qi (θ). Different assumptions on the dgp lead to the use of different LLNs
and more substantively to different expressions for Q0(θ).
Asymptotic normality requires assumptions on the dgp in addition to those required
for consistency. Specifically, we need assumptions on the dgp to enable application of
an LLN to obtain A0 and to enable application of a CLT to obtain B0.
For an m-estimator an LLN is likely to verify condition (ii) of Theorem 5.3 as each
entry in the matrix ∂2
QN (θ)/∂θ∂θ
is an average since QN (θ) is an average. A CLT
is likely to yield condition (iii) of Theorem 5.3, since
√
N ∂ QN (θ)/∂θ|θ0
has mean
0 from the informal consistency condition (5.24) in Section 5.3.7 and finite variance
E[N ∂ QN (θ)/∂θ × ∂ QN (θ)/∂θ

θ0
].
The particular CLT and LLN used to obtain the limit distribution of the estimator
vary with assumptions about the dgp for (y, X). In all cases the dependent variable is
stochastic. However, the regressors may be fixed or stochastic, and in the latter case
they may exhibit time-series dependence. These issues have already been considered
for OLS in Section 4.4.7.
The common microeconometrics assumption is that regressors are stochastic with
independence across observations, which is reasonable for cross-section data from na-
tional surveys. For simple random sampling, the data (yi , xi ) are iid and Kolmogorov
LLN and Lindeberg–Levy CLT (Theorems A.8 and A.14) can be used. Furthermore,
under simple random sampling (5.18) and (5.19) then simplify to
A0 = E

∂2
q(y, x, θ)
∂θ∂θ

θ0
'
and
B0 = E

∂q(y, x, θ)
∂θ
∂q(y, x, θ)
∂θ

θ0
'
,
where (y, x) denotes a single observation and expectations are with respect to the joint
distribution of (y, x). This simpler notation is used in several texts.
For stratified random sampling and for fixed regressors the data (yi , xi ) are inid and
Markov LLN and Liapounov CLT (Theorems A.9 and A.15) need to be used. These
require moment assumptions additional to those made in the iid case. In the stochastic
regressors case, expectations are with respect to the joint distribution of (y, x), whereas
in the fixed regressors case, such as in a controlled experiment where the level of x can
be set, the expectations in (5.18) and (5.19) are with respect to y only.
For time-series data the regressors are assumed to be stochastic, but they are also
assumed to be dependent across observations, a necessary framework to accommo-
date lagged dependent variables. Hamilton (1994) focuses on this case, which is also
studied extensively by White (2001a). The simplest treatments restrict the random vari-
ables (y, x) to have stationary distribution. If instead the data are nonstationary with
unit roots then rates of convergence may no longer be
√
N and the limit distributions
may be nonnormal.
Despite these important conceptual and theoretical differences about the stochastic
nature of (y, x), however, for cross-section regression the eventual limit theorem is
usually of the general form given in Theorem 5.3.
131

5.3.6. Discussion
The form of the variance matrix in (5.20) is called the sandwich form, with B0 sand-
wiched between A−1
0 and A−1
0 . The sandwich form, introduced in Section 4.4.4, will
be discussed in more detail in Section 5.5.2.
The asymptotic results can be extended to inconsistent estimators. Then θ0 is re-
placed by the pseudo-true value θ∗
, defined to be that value of θ that yields the local
maximum of Q0(θ). This is considered in further detail for quasi-ML estimation in
Section 5.7.1. In most cases, however, the estimator is consistent and in later chapters
the subscript 0 is often dropped to simplify notation.
In the preceding results the objective function QN (θ) is initially defined with nor-
malization by 1/N, the first derivative of QN (θ) is then normalized by
√
N, and the
second derivative is not normalized, leading to a
√
N-consistent estimator. In some
cases alternative normalizations may be needed, most notably time series with nonsta-
tionary trend.
The results assume that QN (θ) is a continuous differentiable function. This
excludes some estimators such as least absolute deviations, for which QN (θ) =
N−1

i |yi − x
i β|. One way to proceed in this case is to obtain a differentiable ap-
proximating function Q∗
N (θ) such that Q∗
N (θ) − QN (θ)
p
→ 0 and apply the preceding
theorem to Q∗
N (θ).
The key component to obtaining the limit distribution is linearization using a Taylor
series expansion. Taylor series expansions can be a poor global approximation to a
function. They work well in the statistical application here as the approximation is
asymptotically a local one, since consistency implies that for large sample sizes
θ is
close to the point of expansion θ0. More refined asymptotic theory is possible using the
Edgeworth expansion (see Section 11.4.3). The bootstrap (see Chapter 11) is a method
to empirically implement an Edgeworth expansion.
5.3.7. Informal Approach to Consistency of an m-Estimator
For the practitioner the limit normal result of Theorem 5.3 is much easier to prove than
formal proof of consistency using Theorem 5.1 or 5.2. Here we present an informal
approach to determining the nature and strength of distributional assumptions needed
for an m-estimator to be consistent.
For an m-estimator that is a local maximum, the first-order conditions (5.4) imply
that
θ is chosen so that the average of ∂qi (θ)/∂θ|
θ equals zero. Intuitively, a necessary
condition for this to yield a consistent estimator for θ0 is that in the limit the average
of ∂q(θ)/∂θ|θ0
goes to 0, or that
plim
∂ QN (θ)
∂θ

θ0
= lim
1
N
N

i=1
E

∂qi (θ)
∂θ

θ0
'
= 0, (5.24)
where the first equality requires the assumption that a law of large numbers can be
applied and expectation in (5.24) is taken with respect to the population dgp for (y, X).
The limit is used as the equality need not be exact, provided any departure from zero
disappears as N → ∞. For example, consistency should hold if the expectation equals
132

5.4. ESTIMATING EQUATIONS
1/N. The condition (5.24) provides a very useful check for the practitioner. An infor-
mal approach to consistency is to look at the first-order conditions for the estimator

θ and determine whether in the limit these have expectation zero when evaluated at
θ = θ0.
Even less formally, if we consider the components in the sum, the essential condi-
tion for consistency is whether for the typical observation
E

∂q(θ)/∂θ|θ0

= 0. (5.25)
This condition can provide a very useful guide to the practitioner. However, it is neither
a necessary nor a sufficient condition. If the expectation in (5.25) equals 1/N then it
is still likely that the probability limit in (5.24) equals zero, so the condition (5.25) is
not necessary. To see that it is not sufficient, consider y iid with mean µ0 estimated
using just one observation, say the first observation y1. Then
µ solves y1 − µ = 0 and
(5.25) is satisfied. But clearly y1
p
µ0 as the single observation y1 has a variance that
does not go to zero. The problem is that here the plim in (5.24) does not equal limE.
Formal proof of consistency requires use of theorems such as Theorem 5.1 or 5.2.
For Poisson regression use of (5.25) reveals that the essential condition for consis-
tency is correct specification of the conditional mean of y|x (see Section 5.2.3). Simi-
larly, the OLS estimator solves N−1

i xi (yi − x
i β) = 0, so from (5.25) consistency
essentially requires that E

x(y − x
β0)

= 0. This condition fails if E[y|x] = x
β0,
which can happen for many reasons, as given in Section 4.7. In other examples use
of (5.25) can indicate that consistency will require considerably more parametric as-
sumptions than correct specification of the conditional mean.
To link use of (5.24) to condition (iii) in Theorem 5.2, note the following:
∂ Q0(θ)/∂θ = 0 (condition (iii) in Theorem 5.2)
⇒ ∂(plim QN (θ))/∂θ = 0 (from definition of Q0(θ))
⇒ ∂(lim E[QN (θ)])/∂θ = 0 (as an LLN ⇒ Q0 = plimQN = lim E[QN ])
⇒ lim ∂E[QN (θ)]/∂θ = 0 (interchanging limits and differentiation), and
⇒ lim E[∂ QN (θ)/∂θ] = 0 (interchanging differentiation and expectation).
The last line is the informal condition (5.24). However, obtaining this result re-
quires additional assumptions, including restriction to local maximum, application
of a law of large numbers, interchangeability of limits and differentiation, and in-
terchangeability of differentiation and expectation (i.e., integration). In the scalar
case a sufficient condition for interchanging differentiation and limits is limh→0
(E[QN (θ + h)] − E[QN (θ)]) /h = dE[QN (θ)]/dθ uniformly in θ.
5.4. Estimating Equations
The derivation of the limit distribution given in Section 5.3.3 can be extended from a
local extremum estimator to estimators defined as being the solution of an estimating
equation that sets an average to zero. Several examples are given in Chapter 6.
133

5.4.1. Estimating Equations Estimator
Let
θ be defined as the solution to the system of q estimating equations
hN (
θ) =
1
N
N
i=1
h(yi , xi ,
θ) = 0, (5.26)
where h(·) is a q × 1 vector, and independence over i is assumed. Examples of h(·) are
given later in Section 5.4.2.
Since
θ is chosen so that the sample average of h(y, x,
θ) equals zero, we expect that

θ
p
→ θ0 if in the limit the average of h(y, x, θ0) goes to zero, that is, if plim hN (θ0) =
0. If an LLN can be applied this requires that limE[hN (θ0)] = 0, or more loosely that
for the ith observation
E[h(yi , xi , θ0)] = 0. (5.27)
The easiest way to formally establish consistency is actually to derive (5.26) as the
first-order conditions for an m-estimator.
Assuming consistency, the limit distribution of the estimating equations estimator
can be obtained in the same manner as in Section 5.3.3 for the extremum estimator.
Take an exact first-order Taylor series expansion of hN (θ) around θ0, as in (5.15) with
f(θ) = hN (θ), and set the right-hand side to 0 and solve. Then
√
N(
θ − θ0) = −

∂hN (θ)
∂θ

θ+
−1 √
NhN (θ0). (5.28)
This leads to the following theorem.
Theorem 5.4 (Limit Distribution of Estimating Equations Estimator):
Assume that the estimating equations estimator that solves (5.26) is consistent
for θ0 and make the following assumptions:
(i) ∂hN (θ)/∂θ
exists and is continuous in an open convex neighborhood of θ0.
(ii) ∂hN (θ)/∂θ

θ+ converges in probability to the finite nonsingular matrix
A0 = plim
∂hN (θ)
∂θ

θ0
= plim
1
N
N
i=1
∂hi (θ)
∂θ

θ0
, (5.29)
for any sequence θ+
such that θ+ p
→ θ0.
(iii)
√
NhN (θ0)
d
→ N[0, B0], where
B0 = plimNhN (θ0)hN (θ0)
= plim
1
N
N
i=1
N
j=1
hi (θ0)hj (θ0)
. (5.30)
Then the limit distribution of the estimating equations estimator is
√
N(
θ − θ0)
d
→ N[0, A−1
0 B0A−1
0 ], (5.31)
where, unlike for the extremum estimator, the matrix A0 may not be symmetric
since it is no longer necessarily a Hessian matrix.
134

5.5. STATISTICAL INFERENCE
This theorem can be proved by adaptation of Amemiya’s proof of Theorem 5.3.
Note that Theorem 5.4 assumes that consistency has already been established.
Godambe (1960) showed that for analysis conditional on regressors the most effi-
cient estimating equations estimator sets hi (θ) = ∂ ln f (yi |xi , θ)/∂θ. Then (5.26) are
the first-order conditions for the ML estimator.
5.4.2. Analogy Principle
The analogy principle uses population conditions to motivate estimators. The book
by Manski (1988a) emphasizes the importance of the analogy principle as a unify-
ing theme for estimation. Manski (1988a, p. xi) provides the following quote from
Goldberger (1968, p. 4):
The analogy principle of estimation . . . proposes that population parameters be
estimated by sample statistics which have the same property in the sample as the
parameters do in the population.
Analogue estimators are estimators obtained by application of the analogy prin-
ciple. Population moment conditions suggest as estimator the solution to the corre-
sponding sample moment condition.
Extremum estimator examples of application of the analogy principle have been
given in Section 4.2. For instance, if the goal of prediction is to minimize expected
loss in the population and squared error loss is used, then the regression parameters β
are estimated by minimizing the sample sum of squared errors.
Method of moments estimators are also examples. For instance, in the iid case if
E[yi − µ] = 0 in the population then we use as estimator
µ that solves the correspond-
ing sample moment conditions N−1

i (yi − µ) = 0, leading to
µ = ȳ, the sample
mean.
An estimating equations estimator may be motivated as an analogue estimator. If
(5.27) holds in the population then estimate θ by solving the corresponding sample
moment condition (5.26).
Estimating equations estimators are extensively used in microeconometrics. The
relevant theory can be subsumed within that for generalized method of moments,
presented in the next chapter, which is an extension that permits there to be more
moment conditions than parameters. In applied statistics the approach is used in the
context of generalized estimating equations.
5.5. Statistical Inference
A detailed treatment of hypothesis tests and confidence intervals is given in Chapter 7.
Here we outline how to test linear restrictions, including exclusion restrictions, using
the most common method, the Wald test for estimators that may be nonlinear. Asymp-
totic theory is used, so formal results lead to chi-square and normal distributions rather
than the small sample F- and t-distributions from linear regression under normality.
Moreover, there are several ways to consistently estimate the variance matrix of an
135

extremum estimator, leading to alternative estimates of standard errors and associated
test statistics and p-values.
5.5.1. Wald Hypothesis Tests of Linear Restrictions
Consider testing h linearly independent restrictions, say H0 against Ha, where
H0 : Rθ0 − r = 0,
Ha : Rθ0 − r = 0,
with R an h × q matrix of constants and r an h × 1 vector of constants. For example,
if θ = [θ1, θ2, θ3] then to test whether θ10 − θ20 = 2, R = [1, −1, 0] and r = −2.
The Wald test rejects H0 if R
θ − r, the sample estimate of Rθ0 − r, is signifi-
cantly different from 0. This requires knowledge of the distribution of R
θ − r. Sup-
pose
√
N(
θ − θ0)
d
→ N[0, C0], where C0= A−1
0 B0A−1
0 from (5.20). Then

θ
a
∼ N

θ0,N−1
C0

,
so that under H0 the linear combination
R
θ − r
a
∼ N

0, R(N−1
C0)R

,
where the mean is zero since Rθ0 − r = 0 under H0.
Chi-Square Tests
It is convenient to move from the multivariate normal distribution to the chi-square
distribution by taking the quadratic form. This yields the Wald statistic
W= (R
θ − r)

R(N−1
C)R
−1
(R
θ − r)
d
→ χ2
(h) (5.32)
under H0, where R(N−1
C0)R
is of full rank h under the assumption of linearly inde-
pendent restrictions, and
C is a consistent estimator of C0. Large values of W lead to
rejection, and H0 is rejected at level α if W χ2
α(h) and is not rejected otherwise.
Practitioners frequently instead use the F-statistic F = W/h. Inference is then based
on the F(h, N − q) distribution in the hope that this might provide a better finite sam-
ple approximation. Note that h times the F(h, N) distribution converges to the χ2
(h)
distribution as N → ∞.
The replacement of C0 by
C in obtaining (5.32) makes no difference asymptotically,
but in finite samples different
C will lead to different values of W. In the case of
classical linear regression this step corresponds to replacing σ2
by s2
. Then W/h is
exactly F distributed if the errors are normally distributed (see Section 7.2.1).
Tests of a Single Coefficient
Often attention is focused on testing difference from zero of a single coefficient, say the
jth coefficient. Then Rθ − r = θj and W =
θ
2
j /(N−1
cj j ), where
cj j is the jth diagonal
136

5.5. STATISTICAL INFERENCE
element in
C. Taking the square root of W yields
t =

θ j
se[
θ j ]
d
→ N[0, 1] (5.33)
under H0, where se[
θ j ] =

N−1
cj j is the asymptotic standard error of
θ j . Large val-
ues of t lead to rejection, and unlike W the statistic t can be used for one-sided tests.
Formally
√
W is an asymptotic z-statistic, but we use the notation t as it yields
the usual “t-statistic,” the estimate divided by its standard error. In finite samples,
some statistical packages use the standard normal distribution whereas others use the
t-distribution to compute critical values, p-values, and confidence intervals. Neither is
exactly correct in finite samples, except in the very special case of linear regression
with errors assumed to be normally distributed, in which case the t-distribution is
exact. Both lead to the same results in infinitely large samples as the t-distribution
then collapses to the standard normal.
5.5.2. Variance Matrix Estimation
There are many possible ways to estimate A−1
0 B0A−1
0 , because there are many ways to
consistently estimate A0 and B0. Thus different econometrics programs should give the
same coefficient estimates but, quite reasonably, can give standard errors, t-statistics,
and p-values that differ in finite samples. It is up to the practitioner to determine the
method used and the strength of the associated distributional assumptions on the dgp.
Sandwich Estimate of the Variance Matrix
The limit distribution of
√
N(
θ − θ0) has variance matrix A−1
0 B0A−1
0 . It follows that

θ has asymptotic variance matrix N−1
A−1
0 B0A−1
0 , where division by N arises because
we are considering
θ rather than
√
N(
θ − θ0).
A sandwich estimate of the asymptotic variance of
θ is any estimate of the form

V[
θ] = N−1
A−1
B
A−1
, (5.34)
where
A is consistent for A0 and
B is consistent for B0. This is called the sandwich
form since
B is sandwiched between
A −1
and
A −1
. For many estimators A is a
Hessian matrix so
A −1
is symmetric, but this need not always be the case.
A robust sandwich estimate is a sandwich estimate where the estimate
B is con-
sistent for B0 under relatively weak assumptions. It leads to what are termed robust
standard errors. A leading example is White’s heteroskedastic-consistent estimate of
the variance matrix of the OLS estimator (see Section 4.4.5). In various specific con-
texts, detailed in later sections, robust sandwich estimates are called Huber estimates,
after Huber (1967); Eicker–White estimates, after Eicker (1967) and White (1980a,b,
1982); and in stationary time-series applications Newey–West estimates, after Newey
and West (1987b).
137

Estimation of A and B
Here we present different estimators for A0 and B0 for both the estimating equa-
tions estimator that solves hN (
θ) = 0 and the local extremum estimator that solves
∂ QN (θ)/∂θ|
θ = 0.
Two standard estimates of A0 in (5.29) and (5.18) are the Hessian estimate

AH =
∂hN (θ)
∂θ

θ
=
∂2
QN (θ)
∂θ∂θ

θ
, (5.35)
where the second equality explains the use of the term Hessian, and the expected
Hessian estimate

AEH = E

∂hN (θ)
∂θ

θ
= E

∂2
QN (θ)
∂θ∂θ

θ
. (5.36)
The first is analytically simpler and potentially relies on fewer distributional assump-
tions; the latter is more likely to be negative definite and invertible.
For B0 in (5.30) or (5.19) it is not possible to use the obvious estimate
NhN (
θ)hN (
θ)
, since this equals zero as
θ is defined to satisfy hN (
θ) = 0. One es-
timate is to make potentially strong distributional assumptions to get

BE = E

NhN (θ)hN (θ)

θ
= E

N
∂ QN (θ)
∂θ
∂ QN (θ)
∂θ

θ
. (5.37)
Weaker assumptions are possible for m-estimators and estimating equations estimators
with data independent over i. Then (5.30) simplifies to
B0 = E

1
N
N
i=1
hi (θ)hi (θ)

,
since independence implies that, for i = j, E

hi hj

= E[hi ]E[hj

], which in turn
equals zero given E[hi (θ)] = 0. This leads to the outer product (OP) estimate or
BHHH estimate (after Berndt, Hall, Hall, and Hausman, 1974)

BOP =
1
N
N
i=1
hi (
θ)hi (
θ)
=
1
N
N
i=1
∂qi (θ)
∂θ

θ
∂qi (θ)
∂θ

θ
. (5.38)

BOP requires fewer assumptions than
BE.
In practice a degrees of freedom adjustment is often used in estimating B0, with
division in (5.38) for
BOP by (N − q) rather than N, and similar multiplication of
BE
in (5.37) by N/(N − q). There is no theoretical justification for this adjustment in
nonlinear models, but in some simulation studies this adjustment leads to better finite-
sample performance and it does coincide with the degrees of freedom adjustment made
for OLS with homoskedastic errors. No similar adjustment is made for
AH or
AEH.
Simplification occurs in some special cases with A0 = − B0. Leading examples are
OLS or NLS with homoskedastic errors (see Section 5.8.3) and maximum likelihood
with correctly specified distribution (see Section 5.6.4). Then either −
A−1
or
B−1
may
be used to estimate the variance of
√
N(
θ − θ0). These estimates are less robust to
misspecification of the dgp than those using the sandwich form. Misspecification of
138

5.6. MAXIMUM LIKELIHOOD
the dgp, however, may additionally lead to inconsistency of
θ, in which case even
inference based on the robust sandwich estimate will be invalid.
For the Poisson example of Section 5.2,
AH =
AEH = −N−1

i exp(x
i

β)xi x
i and

BOP = (N − q)−1

i (yi − exp(x
i

β))2
xi x
i . If V[y|x] = exp(x
β0), the case if y|x is
actually Poisson distributed, then
BE = −[N/(N − q)]
AEH and simplification occurs.
5.6. Maximum Likelihood
The ML estimator holds special place among estimators. It is the most efficient estima-
tor among consistent asymptotically normal estimators. It is also important pedagog-
ically, as many methods for nonlinear regression such as m-estimation can be viewed
as extensions and adaptations of results first obtained for ML estimation.
5.6.1. Likelihood Function
The Likelihood Principle
The likelihood principle, due to R. A. Fisher (1922), is to choose as estimator of the
parameter vector θ0 that value of θ that maximizes the likelihood of observing the ac-
tual sample. In the discrete case this likelihood is the probability obtained from the
probability mass function; in the continuous case this is the density. Consider the dis-
crete case. If one value of θ implies that the probability of the observed data occurring
is .0012, whereas a second value of θ gives a higher probability of .0014, then the
second value of θ is a better estimator.
The joint probability mass function or density f (y, X|θ) is viewed here as a func-
tion of θ given the data (y, X). This is called the likelihood function and is denoted
by LN (θ|y, X). Maximizing LN (θ) is equivalent to maximizing the log-likelihood
function
LN (θ) = ln LN (θ).
We take the natural logarithm because in application this leads to an objective function
that is the sum rather than the product of N terms.
Conditional Likelihood
The likelihood function LN (θ) = f (y, X|θ) = f (y|X, θ) f (X|θ) requires specification
of both the conditional density of y given X and the marginal density of X.
Instead, estimation is usually based on the conditional likelihood function
LN (θ) = f (y|X, θ), since the goal of regression is to model the behavior of y given
X. This is not a restriction if f (y|X) and f (X) depend on mutually exclusive sets
of parameters. When this is the case it is common terminology to drop the adjective
conditional. For rare exceptions such as endogenous sampling (see Chapters 3 and
24) consistent estimation requires that estimation is based on the full joint density
f (y, X|θ) rather than the conditional density f (y|X, θ).
139

Table 5.3. Maximum Likelihood: Commonly Used Densities
Model Range of y Density f (y) Common Parameterization
Normal (−∞, ∞) [2πσ2
]−1/2
e−(y−µ)2
/2σ2
µ = x
β, σ2
= σ2
Bernoulli 0 or 1 py
(1 − p)1−y
Logit p = ex
β
/(1 + ex
β
)
Exponential (0, ∞) λe−λy
λ = ex
β
or 1/λ = ex
β
Poisson 0, 1, 2, . . . e−λ
λy
/y! λ = ex
β
For cross-section data the observations (yi , xi ) are independent over i with condi-
tional density function f (yi |xi , θ). Then by independence the joint conditional density
f (y|X, θ) = N
i=1 f (yi |xi , θ), leading to the (conditional) log-likelihood function
QN (θ) = N−1
LN (θ) =
1
N
N

i=1
ln f (yi |xi , θ), (5.39)
where we divide by N so that the objective function is an average.
Results extend to multivariate data, systems of equations, and panel data by re-
placing the scalar yi by vector yi and letting f (yi |xi , θ) be the joint density of yi
conditional on xi . See also Section 5.7.5.
Examples
Across a wide range of data types the following method is used to generate fully
parametric cross-section regression models. First choose the one-parameter or two-
parameter (or in some rare cases three-parameter) distribution that would be used for
the dependent variable y in the iid case studied in a basic statistics course. Then pa-
rameterize the one or two underlying parameters in terms of regressors x and para-
meters θ.
Some commonly used distributions and parameterizations are given in Table 5.3.
Additional distributions are given in Appendix B, which also presents methods to draw
pseudo-random variates.
For continuous data on (−∞, ∞), the normal is the standard distribution. The clas-
sical linear regression model sets µ = x
β and assumes σ2
is constant.
For discrete binary data taking values 0 or 1, the density is always the Bernoulli,
a special case of the binomial with one trial. The usual parameterizations for the
Bernoulli probability lead to the logit model, given in Table 5.3, and the probit model
with p = Φ(x
β), where Φ(·) is the standard normal cumulative distribution function.
These models are analyzed in Chapter 14.
For positive continuous data on (0, ∞), notably duration data considered in Chap-
ters 17–19, the richer Weibull, gamma, and log-normal models are often used in addi-
tion to the exponential given in Table 5.3.
For integer-valued count data taking values 0, 1, 2, . . . (see Chapter 20) the richer
negative binomial is often used in addition to the Poisson presented in Section 5.2.1.
Setting λ = exp(x
β) ensures a positive conditional mean.
140

For incompletely observed data, censored or truncated variants of these distributions
may be used. The most common example is the censored normal, which is called the
Tobit model and is presented in Section 16.3.
Standard likelihood-based models are rarely specified by making assumptions on
the distribution of an error term. They are instead defined directly in terms of the
distribution of the dependent variable. In the special case that y ∼ N[x
β,σ2
] we can
equivalently define y = x
β + u, where the error term u ∼ N[0,σ2
]. However, this
relies on an additive property of the normal shared by few other distributions. For
example, if y is Poisson distributed with mean exp(x
β) we can always write y =
exp(x
β) + u, but the error u no longer has a familiar distribution.
5.6.2. Maximum Likelihood Estimator
The maximum likelihood estimator (MLE) is the estimator that maximizes the (con-
ditional) log-likelihood function and is clearly an extremum estimator. Usually the
MLE is the local maximum that solves the first-order conditions
1
N
∂LN (θ)
∂θ
=
1
N
N

i=1
∂ ln f (yi |xi , θ)
∂θ
= 0. (5.40)
More formally this estimator is the conditional MLE, as it is based on the conditional
density of y given x, but it is common practice to use the simpler term MLE.
The gradient vector ∂LN (θ)/∂θ is called the score vector, as it sums the first deriva-
tives of the log density, and when evaluated at θ0 it is called the efficient score.
5.6.3. Information Matrix Equality
The results of Section 5.3 simplify for the MLE, provided the density is correctly
specified and is one for which the range of y does not depend on θ.
Regularity Conditions
The ML regularity conditions are that
E f

∂ ln f (y|x, θ)
∂θ

=
(
∂ ln f (y|x, θ)
∂θ
f (y|x, θ) = 0 (5.41)
and
−E f

∂2
ln f (y|x, θ)
∂θ∂θ

= E f

∂ ln f (y|x, θ)
∂θ
∂ ln f (y|x, θ)
∂θ

, (5.42)
where the notation E f [·] is used to make explicit that the expectation is with respect to
the specified density f (y|x, θ). Result (5.41) implies that the score vector has expected
value zero, and (5.42) yields (5.44).
Derivation given in Section 5.6.7 requires that the range of y does not depend on θ
so that integration and differentiation can be interchanged.
141

Information Matrix Equality
The information matrix is the expectation of the outer product of the score vector,
I = E

∂LN (θ)
∂θ
∂LN (θ)
∂θ

. (5.43)
The terminology information matrix is used as I is the variance of ∂LN (θ)/∂θ, since
by (5.41) ∂LN (θ)/∂θ has mean zero. Then large values of I mean that small changes
in θ lead to large changes in the log-likelihood, which accordingly contains consider-
able information about θ. The quantity I is more precisely called Fisher Information,
as there are alternative information measures.
For log-likelihood function (5.39), the regularity condition (5.42) implies that
−E f

∂2
LN (θ)
∂θ∂θ

θ0
'
= E f

∂LN (θ)
∂θ
∂LN (θ)
∂θ

θ0
'
, (5.44)
if the expectation is with respect to f (y|x, θ0). The relationship (5.44) is called the
information matrix (IM) equality and implies that the information matrix also equals
−E[∂2
LN (θ)/∂θ∂θ
]. The IM equality (5.44) implies that −A0 = B0, where A0 and
B0 are defined in (5.18) and (5.19). Theorem 5.3 then simplifies since A−1
0 B0A−1
0 =
−A−1
0 = B−1
0 .
The equality (5.42) is in turn a special case of the generalized information matrix
equality
E f

∂m(y, θ)
∂θ

= −E f

m(y, θ)
∂ ln f (y|θ)
∂θ

, (5.45)
where m(·) is a vector moment function with E f [m(y, θ)] = 0 and expectations are
with respect to the density f (y|θ). This result, also obtained in Section 5.6.7, is used
in Chapters 7 and 8 to obtain simpler forms of some test statistics.
5.6.4. Distribution of the ML Estimator
The regularity conditions (5.41) and (5.42) lead to simplification of the general results
of Section 5.3.
The essential consistency condition (5.25) is that E[∂ ln f (y|x, θ)/∂θ|θ0
] = 0. This
holds by the regularity condition (5.41), provided the expectation is with respect to
f (y|x, θ0). Thus if the dgp is f (y|x, θ0), that is, the density has been correctly speci-
fied, the MLE is consistent for θ0.
For the asymptotic distribution, simplification occurs since −A0 = B0 by the IM
equality, which again assumes that the density is correctly specified.
These results can be collected into the following proposition.
Proposition 5.5 (Distribution of ML Estimator): Make the following assump-
tions:
(i) The dgp is the conditional density f (yi |xi , θ0) used to define the likelihood
function.
142

(ii) The density function f (·) satisfies f (y, θ(1)
) = f (y, θ(2)
) iff θ(1)
= θ(2)
(iii) The matrix
A0 = plim
1
N
∂2
LN (θ)
∂θ∂θ

θ0
(5.46)
(iv) The order of differentiation and integration of the log-likelihood can be re-
versed.
Then the ML estimator
θML, defined to be a solution of the first-order conditions
∂N−1
LN (θ)/∂θ = 0, is consistent for θ0, and
√
N(
θML − θ0)
d
→ N

0, −A−1
0

. (5.47)
Condition (i) states that the conditional density is correctly specified; conditions
(i) and (ii) ensure that θ0 is identified; condition (iii) is analogous to the assumption
on plim N−1
X
X in the case of OLS estimation; and condition (iv) is necessary for the
regularity conditions to hold. As in the general case probability limits and expectations
are with respect to the dgp for (y, X), or with respect to just y if regressors are assumed
to be nonstochastic or analysis is conditional on X.
Relaxation of condition (i) is considered in detail in Section 5.7. Most ML examples
satisfy condition (iv), but it does rule out some models such as y uniformly distributed
on the interval [0, θ] since in this case the range of y varies with θ. Then not only
does A0 = −B0 but the global MLE converges at a rate other than
√
N and has limit
distribution that is nonnormal. See, for example, Hirano and Porter (2003).
Given Proposition 5.5, the resulting asymptotic distribution of the MLE is often
expressed as

θML
a
∼ N

θ, −

E

∂2
LN (θ)
∂θ∂θ
−1
'
, (5.48)
where for notational simplicity the evaluation at θ0 is suppressed and we assume that
an LLN applies so that the plim operator in the definition of A0 is replaced by limE
and then drop the limit. This notation is often used in later chapters.
The right-hand side of (5.48) is the Cramer–Rao lower bound (CRLB), which from
basic statistics courses is the lower bound of the variance of unbiased estimators in
small samples. For large samples, considered here, the CRLB is the lower bound for
the variance matrix of consistent asymptotically normal (CAN) estimators with con-
vergence to normality of
√
N(
θ − θ0) uniform in compact intervals of θ0 (see Rao,
1973, pp. 344–351). Loosely speaking the MLE has the strong attraction of having
the smallest asymptotic variance among root−N consistent estimators. This result re-
quires the strong assumption of correct specification of the conditional density.
5.6.5. Weibull Regression Example
As an example, consider regression based on the Weibull distribution, which is used to
model duration data such as length of unemployment spell (see Chapter 17).
143

The density for the Weibull distribution is f (y) = γ αyα−1
exp(−γ yα
), where y 0
and the parameters α 0 and γ 0. It can be shown that E[y] = γ −1/α
Γ(α−1
+ 1),
where Γ(·) is the gamma function. The standard Weibull regression model is obtained
by specifying γ = exp(x
β), in which case E[y|x] = exp(−x
β/α)Γ(α−1
+ 1). Given
independence over i the log-likelihood function is
N−1
LN (θ) = N−1

i
{x
i β + ln α + (α − 1) ln yi − exp(x
i β)yα
i }.
Differentiation with respect to β and α leads to the first-order conditions
N−1

i {1 − exp(x
i β)yα
i }xi = 0,
N−1

i { 1
α
+ ln yi − exp(x
i β)yα
i ln yi } = 0.
Unlike the Poisson example, consistency essentially requires correct specification
of the distribution. To see this, consider the first-order conditions for β. The informal
condition (5.25) that E[{1 − exp(x
β)yα
}x] = 0 requires that E[yα
|x] = exp(−x
β),
where the power α is not restricted to be an integer. The first-order conditions for α
lead to an even more esoteric moment condition on y.
So we need to proceed on the assumption that the density is indeed Weibull with
γ = exp(x
β0) and α = α0. Theorem 5.5 can be applied as the range of y does not de-
pend on the parameters. Then, from (5.48), the Weibull MLE is asymptotically normal
with asymptotic variance
V

β

α

=

−E

i −ex
i β0 yα0
i xi x
i

i −ex
i β0 yα0
i ln(yi )xi

i −ex
i β0 yα0
i ln(yi )x
i

i di
−1
, (5.49)
where di = −(1/α2
0) − ex
i β0 yα0
i (ln yi )2
. The matrix inverse in (5.49) needs to be ob-
tained by partitioned inversion because the off-diagonal term ∂2
LN (β,α)/∂β∂α does
not have expected value zero. Simplification occurs in models with zero expected
cross-derivative E[∂2
LN (β,α)/∂β∂α
] = 0, such as regression with normally dis-
tributed errors, in which case the information matrix is said to be block diagonal
in β and α.
5.6.6. Variance Matrix Estimation for MLE
There are several ways to consistently estimate the variance matrix of an extremum
estimator, as already noted in Section 5.5.2. For the MLE additional possibilities arise
if the information matrix equality is assumed to hold. Then A−1
0 B0A−1
0 , −A−1
0 , and B−1
0
are all asymptotically equivalent, as are the corresponding consistent estimates of these
quantities. A detailed discussion for the MLE is given in Davidson and MacKinnon
(1993, chapter 18).
The sandwich estimate
A−1
B
A−1
is called the Huber estimate, after Huber (1967),
or White estimate, after White (1982), who considered the distribution of the MLE
without imposing the information matrix equality. The sandwich estimate is in theory
more robust than −
A−1
or
B−1
. It is important to note, however, that the cause of fail-
ure of the information matrix equality may additionally lead to the more fundamental
complication of inconsistency of
θML. This is the subject of Section 5.7.
144

5.6.7. Derivation of ML Regularity Conditions
We now formally derive the regularity conditions stated in Section 5.6.3. For notational
simplicity the subscript i and the regressor vector are suppressed.
Begin by deriving the ﬁrst condition (5.41). The density integrates to one, that is,
(
f (y|θ)dy = 1.
Differentiating both sides with respect to θ yields ∂
∂θ
)
f (y|θ)dy = 0. If the range of
integration (the range of y) does not depend on θ this implies
(
∂ f (y|θ)
∂θ
dy = 0. (5.50)
Now ∂ ln f (y|θ)/∂θ = [∂ f (y|θ)/∂θ]/[ f (y|θ)], which implies
∂ f (y|θ)
∂θ
=
∂ ln f (y|θ)
∂θ
f (y|θ). (5.51)
Substituting (5.51) in (5.50) yields
(
∂ ln f (y|θ)
∂θ
f (y|θ)dy = 0, (5.52)
which is (5.41) provided the expectation is with respect to the density f (y|θ).
Now consider the second condition (5.42), initially deriving a more general result.
Suppose
E[m(y, θ)] = 0,
for some (possibly vector) function m(·). Then when the expectation is taken with
respect to the density f (y|θ)
(
m(y, θ) f (y|θ)dy = 0. (5.53)
Differentiating both sides with respect to θ
and assuming differentiation and integra-
tion are interchangeable yields
(
∂m(y, θ)
∂θ f (y|θ) + m(y, θ)
∂ f (y|θ)
∂θ

dy = 0. (5.54)
Substituting (5.51) in (5.54) yields
(
∂m(y, θ)
∂θ f (y|θ) + m(y, θ)
∂ ln f (y|θ)
∂θ f (y|θ)

dy = 0, (5.55)
or
E

∂m(y, θ)
∂θ

= −E

m(y, θ)
∂ ln f (y|θ)
∂θ

, (5.56)
when the expectation is taken with respect to the density f (y|θ). The regularity con-
dition (5.42) is the special case m(y, θ) = ∂ ln f (y|θ)/∂θ and leads to the IM equality
(5.44). The more general result (5.56) leads to the generalized IM equality (5.45).
145

What happens when integration and differentiation cannot be interchanged? The
starting point (5.50) no longer holds, as by the fundamental theorem of calculus
the derivative with respect to θ of
)
f (y|θ)dy includes an additional term reflecting
the presence of a function θ in the range of the integral. Then E[∂ ln f (y|θ)/∂θ] = 0.
What happens when the density is misspecified? Then (5.52) still holds, but it does
not necessarily imply (5.41), since in (5.41) the expectation will no longer be with
respect to the specified density f (y|θ).
5.7. Quasi-Maximum Likelihood
The quasi-MLE
θQML is defined to be the estimator that maximizes a log-likelihood
function that is misspecified, as the result of specification of the wrong density. Gen-
erally such misspecification leads to inconsistent estimation.
In this section general properties of the quasi-MLE are presented, followed by some
special cases where the quasi-MLE retains consistency.
5.7.1. Psuedo-True Value
In principle any misspecification of the density may lead to inconsistency, as then the
expectation in evaluation of E[∂ ln f (y|x, θ)/∂θ|θ0
] (see Section 5.6.4) is no longer
with respect to f (y|x, θ0).
By adaptation of the general consistency proof in Section 5.3.2, the quasi-MLE

θQML converges in probability to the pseudo-true value θ∗
defined as
θ∗
= arg max θ∈Θ(plim N−1
LN (θ)). (5.57)
The probability limit is taken with respect to the true dgp. If the true dgp differs
from the assumed density f (y|x, θ) used to form LN (θ), then usually θ∗
= θ0 and
the quasi-MLE is inconsistent.
Huber (1967) and White (1982) showed that the asymptotic distribution of the
quasi-MLE is similar to that for the MLE, except that it is centered around θ∗
and
the IM equality no longer holds. Then
√
N(
θQML − θ∗
)
d
→ N

0, A∗−1
B∗
A∗−1

, (5.58)
where A∗
and B∗
are as defined in (5.18) and (5.19) except that probability limits
are taken with respect to the unknown true dgp and are evaluated at θ∗
. Consistent
estimates
A∗
and
B∗
can be obtained as in Section 5.5.2, with evaluation at
θQML.
This distributional result is used for statistical inference if the quasi-MLE retains
consistency. If the quasi-MLE is inconsistent then usually θ∗
has no simple interpre-
tation, aside from that given in the next section. However, (5.58) may still be useful if
nonetheless there is interest in knowing the precision of estimation. The result (5.58)
also provides motivation for White’s information matrix test (see Section 8.2.8) and
for Vuong’s test for discriminating between parametric models (see Section 8.5.3).
146

5.7. QUASI-MAXIMUM LIKELIHOOD
5.7.2. Kullback–Liebler Distance
Recall from Section 4.2.3 that if E[y|x] = x
β0 then the OLS estimator can still be
interpreted as the best linear predictor of E[y|x] under squared error loss. White (1982)
proposed a qualitatively similar interpretation for the quasi-MLE.
Let f (y|θ) denote the assumed joint density of y1, . . . , yN and let h(y) denote the
true density, which is unknown, where for simplicity dependence on regressors is sup-
pressed. Define the Kullback–Leibler information criterion (KLIC)
KLIC = E

ln

h(y)
f (y|θ)

, (5.59)
where expectation is with respect to h(y). KLIC takes a minimum value of 0 when
there is a θ0 such that h(y) = f (y|θ0), that is, the density is correctly specified, and
larger values of KLIC indicate greater ignorance about the true density.
Then the quasi-MLE
θQML minimizes the distance between f (y|θ) and h(y), where
distance is measured using KLIC. To obtain this result, note that under suitable
assumptions plim N−1
LN (θ) = E[ln f (y|θ)], so
θQML converges to θ∗
that maxi-
mizes E[ln f (y|θ)]. However, this is equivalent to minimizing KLIC, since KLIC =
E[ln h(y)] − E[ln f (y|θ)] and the first term does not depend on θ as the expectation is
with respect to h(y).
5.7.3. Linear Exponential Family
In some special cases the quasi-MLE is consistent even when the density is partially
misspecified. One well-known example is that the quasi-MLE for the linear regres-
sion model with normality is consistent even if the errors are nonnormal, provided
E[y|x] = x
β0. The Poisson MLE provides a second example (see Section 5.3.4).
Similar robustness to misspecification is enjoyed by other models based on densities
in the linear exponential family (LEF). An LEF density can be expressed as
f (y|µ) = exp{a(µ) + b(y) + c(µ)y}, (5.60)
where we have given the mean parameterization of the LEF, so that µ = E[y]. It can
be shown that for this density E[y] = −[c
(µ)]−1
a
(µ) and V[y] = [c
(µ)]−1
, where
c
(µ) = ∂c(µ)/∂µ and a
(µ) = ∂a(µ)/∂µ. Different functions a(·) and c(·) lead to
different densities in the family. The term b(y) in (5.60) is a normalizing constant that
ensures probabilities sum or integrate to one. The remainder of the density exp{a(µ) +
c(µ)y} is an exponential function that is linear in y, hence explaining the term linear
exponential.
Most densities cannot be expressed in this form. Several important densities are
LEF densities, however, including those given in Table 5.4. These densities, already
presented in Table 5.3, are reexpressed in Table 5.4 in the form (5.60). Other LEF
densities are the binomial with number of trials known (the Bernoulli being a special
case), some negative binomials models (the geometric and the Poisson being special
cases), and the one-parameter gamma (the exponential being a special case).
147

Table 5.4. Linear Exponential Family Densities: Leading Examples
Distribution f (y) = exp{a(·) + b(y) + c(·)y} E[y] V[y] = [c
(µ)]−1
Normal (σ2
known) exp{−µ2
2σ2 − 1
2
ln(2πσ2
) − y2
2σ2 + µ
σ2 y} µ σ2
Bernoulli exp{ln(1 − p) + ln[p/(1 − p)]y} µ = p µ(1 − µ)
Exponential exp{ln λ − λy} µ = 1/λ µ2
Poisson exp{−λ − ln y! + y ln λ} µ = λ µ
For regression the parameter µ = E[y|x] is modeled as
µ = g(x, β), (5.61)
for specified function g(·) that varies across models (see Section 5.7.4) depending
in part on restrictions on the range of y and hence µ. The LEF log-likelihood is
then
LN (β) =
N

i=1
{a(g(xi , β)) + b(yi ) + c(g(xi , β))yi }, (5.62)
with first-order conditions that can be reexpressed, using the aforementioned informa-
tion on the first-two moments of y, as
∂LN (β)
∂β
=
N

i=1
yi − g(xi , β)
σ2
i
×
∂g(xi , β)
∂β
= 0, (5.63)
where σ2
i = [c
(g(xi , β))]−1
is the assumed variance function corresponding to the par-
ticular LEF density. For example, for Bernoulli, exponential, and Poisson, σ2
i equals,
respectively, gi (1 − gi ), 1/g2
i , and gi , where gi = g(xi , β).
The quasi-MLE solves these equations, but it is no longer assumed that the LEF
density is correctly specified. Gouriéroux, Monfort, and Trognon (1984a) proved that
the quasi-MLE
βQML is consistent provided E[y|x] = g(x, β0). This is clear from
taking the expected value of the first-order conditions (5.63), which evaluated at
β = β0 are a weighted sum of errors y − g(x, β0) with expected value equal to zero
if E[y|x] = g(x, β0).
Thus the quasi-MLE based on an LEF density is consistent provided only that the
conditional mean of y given x is correctly specified. Note that the actual dgp for y
need not be LEF. It is the specified density, potentially incorrectly specified, that is
LEF.
Even with correct conditional mean, however, adjustment of default ML output for
variance, standard errors, and t-statistics based on −A−1
0 is warranted. In general the
sandwich form A−1
0 B0A−1
0 should be used, unless the conditional variance of y given
x is also correctly specified, in which case A0 = −B0. For Bernoulli models, how-
ever, A0 = −B0 always. Consistent standard errors can be obtained using (5.36) and
(5.38).
The LEF is a very special case. In general, misspecification of any aspect of the
density leads to inconsistency of the MLE. Even in the LEF case the quasi-MLE can
148

5.7. QUASI-MAXIMUM LIKELIHOOD
be used only to predict the conditional mean whereas with a correctly specified density
one can predict the conditional distribution.
5.7.4. Generalized Linear Models
Models based on an assumed LEF density are called generalized linear models
(GLMs) in the statistics literature (see the book with this title by McCullagh and
Nelder, 1989). The class of generalized linear models is the most widely used frame-
work in applied statistics for nonlinear cross-section regression, as from Table 5.3 it
includes nonlinear least squares, Poisson, geometric, probit, logit, binomial (known
number of trials), gamma, and exponential regression models. We provide a short
overview that introduces standard GLM terminology.
Standard GLMs specify the conditional mean g(x, β) in (5.61) to be of the simpler
single-index form, so that µ = g(x
β). Then g−1
(µ) = x
β, and the function g−1
(·) is
called the link function. For example, the usual specification for the Poisson model
corresponds to the log-link function since if µ = exp(x
β) then ln µ = x
β.
The first-order conditions (5.63) become

i [(yi − gi )/c
(gi )]g
i xi = 0, where gi =
g(x
i β) and g
i = g
(x
i β). There are computational advantages in choosing the link
function so that c
(g(µ)) = g
(µ), since then these first-order conditions reduce to

i (yi − gi )xi = 0, or the error (yi − gi ) is orthogonal to the regressors. The canonical
link function is defined to be that function g−1
(·) which leads to c
(g(µ)) = g
(µ) and
varies with c(µ) and hence the GLM. The canonical link function leads to µ = x
β for
normal, µ = exp(x
β) for Poisson, and µ = exp(x
β)/[1 + exp(x
β)] for binary data.
The last of these is the logit form given earlier in Table 5.3.
Two times the difference between the maximum achievable log-likelihood and the
fitted log-likelihood is called the deviance, a measure that generalizes the residual sum
of squares in linear regression to other LEF regression models.
Models based on the LEF are very restrictive as all moments depend on just one un-
derlying parameter, µ = g(x
β). The GLM literature places some additional structure
by making the convenient assumption that the LEF variance is potentially misspecified
by a scalar multiple α, so that V[y|x] = α × [c
(g(x, β)]−1
, where α = 1 necessarily.
For example, for the Poisson model let V[y|x] = αg(x, β) rather than g(x, β). Given
such variance misspecification it can be shown that B0 = −αA0, so the variance matrix
of the quasi-MLE is −αA−1
0 , which requires only a rescaling of the nonsandwich ML
variance matrix −A−1
0 by multiplication by α. A commonly used consistent estimate
for α is
α = (N − K)−1

i (yi −
gi )2
/
σ2
i , where
gi = g(xi ,
βQML),
σ2
i = [c
(
gi )]−1
,
and division is by (N − K) rather than N is felt to provide a better estimate in small
samples. See the preceding references and Cameron and Trivedi (1986, 1998) for fur-
ther details.
Many statistical packages include a GLM module that as a default gives standard
errors that are correct provided V[y|x] = α[c
(g(x, β))]−1
. Alternatively, one can es-
timate using ML, with standard errors obtained using the robust sandwich formula
A−1
0 B0A−1
0 . In practice the sandwich standard errors are similar to those obtained us-
ing the simple GLM correction. Yet another way to estimate a GLM is by weighted
nonlinear least squares, as detailed at the end of Section 5.8.6.
149

5.7.5. Quasi-MLE for Multivariate Dependent Variables
This chapter has focused on scalar dependent variables, but the theory applies also to
the multivariate case. Suppose the dependent variable y is an m × 1 vector, and the data
(yi , xi ), i = 1, . . . , N, are independent over i. Examples given in later chapters include
seemingly unrelated equations, panel data with m observations for the ith individual
on the same dependent variable, and clustered data where data for the i jth observation
are correlated over m possible values of j.
Given specification of f (y|x, θ), the joint density of y =(y1, . . . , ym) conditional on
x, the fully efficient MLE maximizes N−1

i ln f (yi |xi , θ) as noted after (5.39). How-
ever, in multivariate applications the joint density of y can be complicated. A simpler
estimator is possible given knowledge only of the m univariate densities f j (yj |x, θ),
j = 1, . . . , m, where yj is the jth component of y. For example, for multivariate count
data one might work with m independent univariate negative binomial densities for
each count rather than a richer multivariate count model that permits correlation.
Consider then the quasi-MLE
θQML based on the product of the univariate densities,
j f j (yj |x, θ), that maximizes
QN (θ) =
1
N
N

i=1
m

j=1
ln f (yi j |xi , θ). (5.64)
Wooldridge (2002) calls this estimator the partial MLE, since the density has been
only partially specified.
The partial MLE is an m-estimator with qi =

j ln f (yi j |xi , θ). The essential con-
sistency condition (5.25) requires that E[

j ∂ f (yi j |xi , θ)/∂θ

θ0
] = 0. This condi-
tion holds if the marginal densities f (yi j |xi , θ0) are correctly specified, since then
E[∂ f (yi j |xi , θ)/∂θ

θ0
] = 0 by the regularity condition (5.41).
Thus the partial MLE is consistent provided the univariate densities f j (yj |x, θ) are
correctly specified. Consistency does not require that f (y|x, θ) = j f j (yj |x, θ). De-
pendence of y1, . . . , ym will lead to failure of the information matrix equality, however,
so standard errors should be computed using the sandwich form for the variance matrix
with
A0 =
1
N
N
i=1
m
j=1
∂2
ln fi j
∂θ∂θ

θ0
, (5.65)
B0 =
1
N
N
i=1
m
j=1
m
k=1
∂ ln fi j
∂θ

θ0
∂ ln fik
∂θ

θ0
where fi j = f (yi j |xi , θ). Furthermore, the partial MLE is inefficient compared to the
MLE based on the joint density. Further discussion is given in Sections 6.9 and 6.10.
5.8. Nonlinear Least Squares
The NLS estimator is the natural extension of LS estimation for the linear model to the
nonlinear model with E[y|x] = g(x, β), where g(·) is nonlinear in β. The analysis and
results are essentially the same as for linear least squares, with the single change that in
150

5.8. NONLINEAR LEAST SQUARES
Table 5.5. Nonlinear Least Squares: Common Examples
Model Regression Function g(x, β)
Exponential exp(β1x1 + β2x2 + β3x3)
Regressor raised to power β1x1 + β2x
β3
2
Cobb–Douglas production β1x
β2
1 x
β3
2
CES production [β1x
β3
1 + β2x
β3
2 ]1/β3
Nonlinear restrictions β1x1 + β2x2 + β3x3, where β3 = −β2β1
the formulas for variance matrices the regressor vector x is replaced by ∂g(x, β)/∂β|
β,
the derivative of the conditional mean function evaluated at β =
β.
For microeconometric analysis, controlling for heteroskedastic errors may be neces-
sary, as in the linear case. The NLS estimator and extensions that model heteroskedas-
tic errors are generally less efficient than the MLE, but they are widely used in microe-
conometrics because they rely on weaker distributional assumptions.
5.8.1. Nonlinear Regression Model
The nonlinear regression model defines the scalar dependent variable y to have con-
ditional mean
E[yi |xi ] = g(xi , β), (5.66)
where g(·) is a specified function, x is a vector of explanatory variables, and β is a
K × 1 vector of parameters. The linear regression model of Chapter 4 is the special
case g(x, β) = x
β.
Common reasons for specifying a nonlinear function for E[y|x] include range re-
striction (e.g., to ensure that E[y|x] 0) and specification of supply or demand or
cost or expenditure models that satisfy restrictions from producer or consumer theory.
Some commonly used nonlinear regression models are given in Table 5.5.
5.8.2. NLS Estimator
The error term is defined to be the difference between the dependent variable
and its conditional mean, yi − g(xi , β). The nonlinear least-squares estimator

βNLS minimizes the sum of squared residuals,

i (yi − g(xi , β))2
, or equivalently
maximizes
QN (β) = −
1
2N
N

i=1
(yi − g(xi , β))2
, (5.67)
where the scale factor 1/2 simplifies the subsequent analysis.
151

Differentiation leads to the NLS first-order conditions
∂ QN (β)
∂β
=
1
N
N

i=1
∂gi
∂β
(yi − gi ) = 0, (5.68)
where gi = g(xi , β). These conditions restrict the residual (y − g) to be orthogonal to
∂g/∂β, rather than to x as in the linear case. There is no explicit solution for
βNLS,
which instead is computed using iterative methods (given in Chapter 10).
The nonlinear regression model can be more compactly represented in matrix nota-
tion. Stacking observations yields



y1
.
.
.
yN


 =



g1
.
.
.
gN


 +



u1
.
.
.
uN


 , (5.69)
where gi = g(xi , β), or equivalently
y = g + u, (5.70)
where y, g, and u are N × 1 vectors with ith entries of, respectively, yi , gi , and ui .
Then
QN (β) = −
1
2N
(y − g)
(y − g)
and
∂ QN (β)
∂β
=
1
N
∂g
∂β
(y − g), (5.71)
where
∂g
∂β
=




∂g1
∂β1
· · · ∂gN
∂β1
.
.
.
.
.
.
∂g1
∂βK
· · · ∂gN
∂βK



 (5.72)
is the K × N matrix of partial derivatives of g(x, β)
with respect to β.
5.8.3. Distribution of the NLS Estimator
The distribution of the NLS estimator will vary with the dgp. The dgp can always be
written as
yi = g(xi , β0) + ui , (5.73)
a nonlinear regression model with additive error u. The conditional mean is correctly
specified if E[y|x] = g(x, β0) in the dgp. Then the error must satisfy E[u|x] = 0.
Given the NLS first-order conditions (5.68), the essential consistency condition
(5.25) becomes
E[∂g(x, β)/∂β|β0
× (y − g(xi , β0))] = 0.
152

Equivalently, given (5.73), we need E[∂g(x, β)/∂β|β0
× u] = 0. This holds if
E[u|x] = 0, so consistency requires correct specification of the conditional mean as
in the linear case. If instead E[u|x] = 0 then consistent estimation requires nonlinear
instrumental methods (which are presented in Section 6.5).
The limit distribution of
√
N(
βNLS − β0) is obtained using an exact first-order
Taylor series expansion of the first-order conditions (5.68). This yields
√
N(
βNLS − β0) = −

−1
N
N

i=1
∂gi
∂β
∂gi
∂β +
1
N
N

i=1
∂2
gi
∂β∂β (yi − gi )

β+


−1
×
1
√
N
N

i=1
∂gi
∂β
ui

β0
,
for some β+
between
βNLS and β0. For A0 in (5.18) simplification occurs because
the term involving

∂2
g/∂β∂β

drops out since E[u|x] = 0. Thus asymptotically we
need consider only
√
N(
βNLS − β0) =

 1
N
N

i=1
∂gi
∂β
∂gi
∂β

β0


−1
1
√
N
N

i=1
∂gi
∂β
ui

β0
,
which is exactly the same as OLS, see Section 4.4.4, except xi is replaced by
∂gi /∂β

β0
.
This yields the following proposition, analogous to Proposition 4.1 for the OLS
estimator.
Proposition 5.6 (Distribution of NLS Estimator): Make the following
assumptions:
(i) The model is (5.73); that is, yi = g(xi , β0) + ui .
(ii) In the dgp E[ui |xi ] = 0 and E[uu
|X] = Ω0, where Ω0,i j = σi j .
(iii) The mean function g(·) satisfies g(x, β(1)
) = g(x, β(2)
) iff β(1)
= β(2)
.
(iv) The matrix
A0 = plim
1
N
N

i=1
∂gi
∂β
∂gi
∂β

β0
= plim
1
N
∂g
∂β
∂g
∂β

β0
(5.74)
(v) N−1/2
N
i=1 ∂gi /∂β×ui |β0
d
→ N [0, B0], where
B0 = plim
1
N
N

i=1
N

j=1
σi j
∂gi
∂β
∂gj
∂β

β0
= plim
1
N
∂g
∂β
Ω0
∂g
∂β

β0
. (5.75)
Then the NLS estimator
βNLS, defined to be a root of the first-order conditions
∂N−1
QN (β)/∂β = 0, is consistent for β0 and
√
N(
βNLS − β0)
d
→ N

0, A−1
0 B0A−1
0

. (5.76)
153

Conditions (i) to (iii) imply that the regression function is correctly specified and
the regressors are uncorrelated with the errors and that β0 is identified. The errors can
be heteroskedastic and correlated over i. Conditions (iv) and (v) assume the relevant
limit results necessary for application of Theorem 5.3. For condition (v) to be satisfied
some restrictions will need to be placed on the error correlation over i. The probability
limits in (5.74) and (5.75) are with respect to the dgp for X; they become regular limits
if X is nonstochastic.
The matrices A0 and B0 in Proposition 5.6 are the same as the matrices Mxx
and MxΩx in Section 4.4.4 for the OLS estimator with xi replaced by ∂gi /∂β|β0
.
The asymptotic theory for NLS is the same as that for OLS, with this single
change.
In the special case of spherical errors, Ω0 = σ2
0 I, so B0 = σ2
0 A0 and V[
βNLS] =
σ2
0 A−1
0 . Nonlinear least squares is then asymptotically efficient among LS estimators.
However, cross-section data errors are not necessarily heteroskedastic.
Given Proposition 5.6, the resulting asymptotic distribution of the NLS estimator
can be expressed as

βNLS
a
∼ N

β,(D
D)−1
D
Ω0D(D
D)
−1

, (5.77)
where the derivative matrix D = ∂g/∂β

β0
has ith row ∂gi /∂β

β0
(see (5.72)), for
notational simplicity the evaluation at β0 is suppressed, and we assume that an LLN
applies, so that the plim operator in the definitions of A0 and B0 are replaced by limE,
and then drop the limit. This notation is often used in later chapters.
5.8.4. Variance Matrix Estimation for NLS
We consider statistical inference for the usual microeconometrics situation of inde-
pendent errors with heteroskedasticity of unknown functional form. This requires a
consistent estimate of A−1
0 B0A−1
0 defined in Proposition 5.6.
For A0 defined in (5.74) it is straightforward to use the obvious estimator

A =
1
N
N

i=1
∂gi
∂β

β
∂gi
∂β

β
, (5.78)
as A0 does not involve moments of the errors.
Given independence over i the double sum in B0 defined in (5.75) simplifies to the
single sum
B0 = plim
1
N
N

i=1
σ2
i
∂gi
∂β
∂gi
∂β

β0
.
As for the OLS estimator (see Section 4.4.5) it is only necessary to consistently esti-
mate the K × K matrix sum B0. This does not require consistent estimation of σ2
i , the
N individual components in the sum.
154

White (1980b) gave conditions under which

B =
1
N
N

i=1

u2
i
∂gi
∂β
∂gi
∂β

β
=
1
N
∂g
∂β

β

Ω
∂g
∂β

β
(5.79)
is consistent for B0, where
ui = yi − g(xi ,
β),
β is consistent for β0, and

Ω = Diag[
u2
i ]. (5.80)
This leads to the following heteroskedastic-consistent estimate of the asymptotic
variance matrix of the NLS estimator:

V[
βNLS] = (
D
D)−1
D
Ω
D(
D
D)−1
, (5.81)
where
D = ∂g/∂β

β
. This equation is the same as the OLS result in Section 4.4.5,
with the regressor matrix X replaced by
D. In practice, a degrees of freedom correction
may be used, so that
B in (5.79) is computed using division by (N − K) rather than by
N. Then the right-hand side in (5.81) should be multiplied by N/(N − K).
Generalization to errors correlated over i is given in Section 5.8.7.
5.8.5. Exponential Regression Example
As an example, suppose that y given x has exponential conditional mean, so that
E[y|x] = exp(x
β). The model can be expressed as a nonlinear regression with
y = exp(x
β) + u,
where the error term u has E[u|x] = 0 and the error is potentially heteroskedastic.
The NLS estimator has first-order conditions
N−1

i

yi − exp(x
i β)

exp(x
i β)xi = 0, (5.82)
so consistency of
βNLS requires only that the conditional mean be correctly specified
with E[y|x] = exp(x
β0). Here ∂g/∂β = exp(x
β)x, so the general NLS result (5.81)
yields the heteroskedastic-robust estimate

V[
βNLS] =

i
e2x
i

β
xi x
i
−1
i

u2
i e2x
i

β
xi x
i

i
e2x
i

β
xi x
i
−1
, (5.83)
where
ui = yi − exp(x
i

βNLS).
5.8.6. Weighted NLS and FGNLS
For cross-section data the errors are often heteroskedastic. Then feasible generalized
NLS that controls for the heteroskedasticity is more efficient than NLS.
Feasible generalized nonlinear least squares (FGNLS) is still generally less efficient
than ML. The notable exception is that FGNLS is asymptotically equivalent to the
MLE when the conditional density for y is an LEF density. A special case is that FGLS
is asymptotically equivalent to the MLE in the linear regression under normality.
155

Table 5.6. Nonlinear Least-Squares Estimators and Their Asymptotic Variancea
Estimator Objective Function Estimated Asymptotic Variance
NLS QN (β) = −1
2N
u
u (
D
D)−1
D
Ω
D(
D
D)−1
FGNLS QN (β) = −1
2N
u
Ω(
γ)−1
u (
D
Ω
−1

D)−1
WNLS QN (β) = −1
2N
u
Σ
−1
u (
D
Σ
−1

D)−1
D
Σ
−1

Ω
Σ
−1

D(
D
Σ
−1

D)−1
.
a Functions are for a nonlinear regression model with error u = y − g defined in (5.70) and error conditional vari-
ance matrix
.
D is the derivative of the conditional mean vector with respect to β
evaluated at
β. For FGNLS
it is assumed that

is consistent for
. For NLS and WNLS the heteroskedastic robust variance matrix uses

equal to a diagonal matrix with squared residuals on the diagonals, an estimate that need not be consistent for
.
If heteroskedasticity is incorrectly modeled then the FGNLS estimator retains con-
sistency but one should then obtain standard errors that are robust to misspecification
of the model for heteroskedasticity. The analysis is very similar to that for the linear
model given in Section 4.5.
Feasible Generalized Nonlinear Least Squares
The feasible generalized nonlinear least-squares estimator
βFGNLS maximizes
QN (β) = −
1
2N
(y − g)
Ω(
γ)−1
(y − g), (5.84)
where it is assumed that E[uu
|X] = Ω(γ0) and
γ is a consistent estimate
γ of γ0.
If the assumptions made for the NLS estimator are satisfied and in fact Ω0 = Ω(γ0),
then the FGNLS estimator is consistent and asymptotically normal with estimated
asymptotic variance matrix given in Table 5.6. The variance matrix estimate is similar
to that for linear FGLS,

X
Ω(
γ)−1
X
−1
, except that X is replaced by
D = ∂g/∂β

β
.
The FGNLS estimator is the most efficient consistent estimator that minimizes
quadratic loss functions of the form (y − g)
V(y − g), where V is a weighting matrix.
In general, implementation of FGNLS requires inversion of the N × N matrix
Ω(
γ). This may be computationally impossible for large N, but in practice Ω(
γ) usu-
ally has a structure, such as diagonality, that leads to an analytical solution for the
inverse.
Weighted NLS
The FGNLS approach is fully efficient but leads to invalid standard error estimates if
the model for Ω0 is misspecified. Here we consider an approach between NLS and
FGNLS that specifies a model for the variance matrix of the errors but then obtains
robust standard errors. The discussion mirrors that in Section 4.5.2.
The weighted nonlinear least squares (WNLS) estimator
βWNLS maximizes
QN (β) = −
1
2N
(y − g)
Σ
−1
(y − g), (5.85)
where Σ = Σ(γ) is a working error variance matrix,
Σ = Σ(
γ), where
γ is an
estimate of γ, and, in a departure from FGNLS, Σ = Ω0.
156

Under assumptions similar to those for the NLS estimator and assuming that Σ0 =
plim
Σ, the WNLS estimator is consistent and asymptotically normal with estimated
asymptotic variance matrix given in Table 5.6.
This estimator is called WNLS to distinguish it from FGNLS, which assumed that
Σ = Ω0. The WNLS estimator hopefully lies between NLS and FGNLS in terms of
efficiency, though it may be less efficient than NLS if a poor model of the error vari-
ance matrix is chosen. The NLS and OLS estimators are special cases of WNLS with
Σ = σ2
I.
Heteroskedastic Errors
An obvious working model for heteroskedasticity is σ2
i = E[u2
i |xi ] = exp(z
i γ0),
where the vector z is a specified function of x (such as selected subcomponents of
x) and using the exponential ensures a positive variance.
Then Σ = Diag[exp(z
i γ)] and
Σ = Diag[exp(z
i
γ)], where
γ can be obtained by
nonlinear regression of squared NLS residuals (yi − g(xi ,
βNLS))2
on exp(z
i γ). Since
Σ is diagonal, Σ−1
= Diag[1/σ2
i ]. Then (5.84) simplifies and the WNLS estimator
maximizes
QN (β) = −
1
2N
N

i−1
(yi − g(xi , β))2

σ2
i
. (5.86)
The variance matrix of the WNLS estimator given in Table 5.6 yields

V[
βWNLS] =

N

i=1
1

σ2
i

di

d
i

−1
N

i=1

u2
i
1

σ4
i

di

d
i

N

i=1
1

σ2
i

di

d
i

−1
, (5.87)
where
di = ∂g(xi , β)/∂β|
β and
ui = yi − g(xi ,
βWNLS) is the residual. In practice
a degrees of freedom correction may be used, so that the right-hand side of (5.87)
is multiplied by N/(N − K). If the stronger assumption is made that Σ = Ω0, then
WNLS becomes FGNLS and

V[
βFGNLS] =

N

i=1
1

σ2
i

di

d
i

−1
. (5.88)
The WNLS and FGNLS estimators can be implemented using an NLS program.
First, do NLS regression of yi on g(xi , β). Second, obtain
γ by, for example, NLS re-
gression of (yi − g(xi ,
βNLS))2
on exp(z
i γ) if σ2
i = exp(z
i γ). Third, perform an NLS
regression of yi /
σi on g(xi , β)/
σi , where
σ2
i = exp(z
i
γ). This is equivalent to max-
imizing (5.86). White robust sandwich standard errors from this transformed regres-
sion give robust WNLS standard errors based on (5.87). The usual nonrobust stan-
dard errors from this transformed regression give FGNLS standard errors based on
(5.88).
With heteroskedastic errors it is very tempting to go one step further and attempt
FGNLS using
Ω = Diag[
u2
i ]. This will give inconsistent parameter estimates of β0,
however, as FGNLS regression of yi on g(xi , β) then reduces to NLS regression
of yi /|
ui | on g(xi , β)/|
ui |. The technique suffers from the fundamental problem of
157

correlation between regressors and error term. Alternative semiparametric methods
that enable an estimator as efficient as feasible GLS, without specifying a functional
form for Ω0, are presented in Section 9.7.6.
Generalized Linear Models
Implementation of the weighted NLS approach requires a reasonable specification for
the working matrix. A somewhat ad-hoc approach, already presented, is to let σ2
i =
exp(z
i γ), where z is often a subset of x. For example, in regression of earnings on
schooling and other control variables we might model heteroskedasticity more simply
as being a function of just a few of the regressors, most notably schooling.
Some types of cross-section data provide a natural model for heteroskedasticity
that is very parsimonious. For example, for count data the Poisson density specifies
that the variance equals the mean, so σ2
i = g(xi , β). This provides a working model
for heteroskedasticity that introduces no further parameters than those already used in
modeling the conditional mean.
This approach of letting the working model for the variance be a function of the
mean arises naturally for generalized linear models, introduced in Sections 5.7.3 and
5.7.4. From (5.63) the first-order conditions for the quasi-MLE based on an LEF den-
sity are of the form
N

i=1
yi − g(xi , β)
σ2
i
×
∂g(xi , β)
∂β
= 0,
where σ2
i = [c
(g(xi , β))]−1
is the assumed variance function corresponding to the
particular GLM (see (5.60)). For example, for Poisson, Bernoulli, and exponential
distributions σ2
i equals, respectively, gi , gi (1 − gi ), and 1/g2
i , where gi = g(xi , β).
These first-order conditions can be solved for β in one step that allows for depen-
dence of σ2
i on β. In a simpler two-step method one computes
σ2
i = c
(g(xi ,
β)) given
an initial NLS estimate of
β and then does a weighted NLS regression of yi /
σi on
g(xi , β)/
σi . The resulting estimator of β is asymptotically equivalent to the quasi-
MLE that directly solves (5.63) (see Gouriéroux, Monfort, and Trognan 1984a, or
Cameron and Trivedi, 1986). Thus FGNLS is asymptotically equivalent to ML estima-
tion when the density is an LEF density. To guard against misspecification of σ2
i infer-
ence is based on robust sandwich standard errors, or one lets
σ2
i =
α[c
(g(xi ,
β))]−1
,
where the estimate
α is given in Section 5.7.4.
5.8.7. Time Series
The general NLS result in Proposition 5.6 applies to all types of data, including time-
series data. The subsequent results on variance matrix estimation focused on the cross-
section case of heteroskedastic errors, but they are easily adapted to the case of time-
series data with serially correlated errors. Indeed, results on robust variance matrix
estimation using spectral methods for the time-series case preceded those for the cross-
section case.
158

5.9. EXAMPLE: ML AND NLS ESTIMATION
The time-series nonlinear regression model is
yt = g(xt , β)+ut , t = 1, . . . , T.
If the error ut is serially correlated it is common to use the autoregressive moving
average or ARMA(p, q) model
ut = ρ1ut−1 + · · · + ρput−p + εt + α1εt−1 + · · · + αqεt−q,
where εt is iid with mean 0 and variance σ2
, and restrictions may be placed on ARMA
model parameters to ensure stationarity and invertibility. The ARMA error model im-
plies a particular structure to the error variance matrix Ω0 = Ω(ρ, α).
The ARMA model provides a good model for Ω0 in the time-series case. In con-
trast, in the cross-section case, it is more difficult to correctly model heteroskedasticity,
leading to greater emphasis on robust inference that does not require specification of a
model for Ω0.
What if errors are both heteroskedastic and serially correlated? The NLS estimator
is consistent though inefficient if errors are serially correlated, provided xt does not
include lagged dependent variables in which case it becomes inconsistent. White and
Domowitz (1984) generalized (5.79) to obtain a robust estimate of the variance matrix
of the NLS estimator given heteroskedasticity and serial correlation of unknown func-
tional form, assuming serial correlation of no more than say, l, lags. In practice a minor
refinement due to Newey and West (1987b) is used. This refinement is a rescaling that
ensures that the variance matrix estimate is semi-positive definite. Several other refine-
ments have also been proposed and the assumption of fixed lag length has been relaxed
so that it is possible for l → ∞ at a sufficiently slower rate than N → ∞. This permits
an AR component for the error.
5.9. Example: ML and NLS Estimation
Maximum likelihood and NLS estimation, standard error calculation, and coefficient
interpretation are illustrated using simulation data.
5.9.1. Model and Estimators
The exponential distribution is used for continuous positive data, notably duration data
studied in Chapter 17. The exponential density is
f (y) = λe−λy
, y 0, λ 0,
with mean 1/λ and variance 1/λ2
. We introduce regressors into this model by setting
λ = exp(x
β),
which ensures λ 0. Note that this implies that
E[y|x] = exp(−x
β).
159

An alternative parameterization instead specifies E[y|x] = exp(x
β), so that λ =
exp(−x
β). Note that the exponential is used in two different ways: for the density
and for the conditional mean.
The OLS estimator from regression of y on x is inconsistent, since it fits a straight
line when the regression function is in fact an exponential curve.
The MLE is easily obtained. The log-density is ln f (y|x) = x
β − y exp(x
β), lead-
ing to ML first-order conditions N−1

i (1 − yi exp(x
i β))xi = 0, or
N−1

i
yi − exp(−x
β)
exp(−xβ)
xi = 0.
To perform NLS regression, note that the model can also be expressed as a nonlinear
regression with
y = exp(−x
β) + u,
where the error term u has E[u|x] = 0, though it is heteroskedastic. The first-order
conditions for an exponential conditional mean for this model, aside from a sign rever-
sal, have already been given in (5.82) and clearly lead to an estimator that differs from
the MLE.
As an example of weighted NLS we suppose that the error variance is propor-
tional to the mean. Then the working variance is V[y] = E[y] and weighted least
squares can be implemented by NLS regression of yi /
σi on exp(−x
i β)/
σi , where

σ2
i = exp(−x
i

βNLS). This estimator is less efficient than the MLE and may or may not
be more efficient than NLS.
Feasible generalized NLS can be implemented here, since we know the dgp.
Since V[y] = 1/λ2
for the exponential density, so the variance equals the mean
squared, it follows that V[u|x] = [exp(−x
β)]2
. The FGNLS estimator estimates σ2
i
by
σ2
i = [exp(−x
i

βNLS)]2
and can be implemented by NLS regression of yi /
σi on
exp(−x
i β)/
σi . In general FGNLS is less efficient than the MLE. In this example it is
actually fully efficient as the exponential density is an LEF density (see the discussion
at the end of Section 5.8.6).
5.9.2. Simulation and Results
For simplicity we consider regression on an intercept and a regressor. The data-
generating process is
y|x ∼ exponential[λ],
λ = exp(β1 + β2x),
where x ∼ N[1, 12
] and (β1, β2) = (2, −1). A large sample of size 10,000 was drawn
to minimize differences in estimates, particularly standard errors, arising from sam-
pling variability. For the particular sample of 10,000 drawn here the sample mean of
y is 0.62 and the sample standard deviation of y is 1.29.
Table 5.7 presents OLS, ML, NLS, WNLS, and FGNLS estimates. Up to three
different standard error estimates are also given. The default regression output yields
nonrobust standard errors, given in parentheses. For OLS and NLS estimators these
160

5.9. EXAMPLE: ML AND NLS ESTIMATION
Table 5.7. Exponential Example: Least-Squares and ML Estimatesa
Estimator
Variable OLS ML NLS WNLS FGNLS
Constant −0.0093 1.9829 1.8876 1.9906 1.9840
(0.0161) (0.0141) (0.0307) (0.0225) (0.0148)
[0.0172] [0.0144] [0.1421] [0.0359] [0.0146]
{0.2110}
x 0.6198 −0.9896 −0.9575 −0.9961 −0.9907
(0.0113) (0.0099) (0.0097) (0.0098) (0.0100)
[0.0254] [0.0099] [0.0612] [0.0224] [0.0101]
{0.0880}
lnL – −208.71 −232.98 −208.93 −208.72
R2
0.2326 0.3906 0.3913 0.3902 0.3906
a All estimators are consistent, aside from OLS. Up to three alternative standard error estimates are given:
nonrobust in parentheses, robust outer product in square brackets, and an alternative robust estimate for NLS
in braces. The conditional dgp is an exponential distribution with intercept 2 and slope parameter −1. Sample
size N = 10,000.
assume iid errors, an erroneous assumption here, and for the MLE these impose the
IM equality, a valid assumption here since the assumed density is the dgp. The robust
standard errors, given in square brackets, use the robust sandwich variance estimate
N−1
A−1
H

BOP

A−1
H , where
BOP is the outer product estimated given in (5.38). These
estimates are heteroskedastic consistent. For standard errors of the NLS estimator an
alternative better estimate is given in braces (and is explained in the next section). The
standard error estimates presented here use numerical rather than analytical derivatives
in computing
A and
B.
5.9.3. Comparison of Estimates and Standard Errors
The OLS estimator is inconsistent, yielding estimates unrelated to (β1, β2) in the ex-
ponential dgp.
The remaining estimators are consistent, and the ML, NLS, WNLS, and FGNLS
estimators are within two standard errors of the true parameter values of (2, −1), where
the robust standard errors need to be used for NLS. The FGNLS estimates are quite
close to the ML estimates, a consequence of using a dgp in the LEF.
For the MLE the nonrobust and robust ML standard errors are quite similar. This is
expected as they are asymptotically equivalent (since the information matrix equality
holds if the MLE is based on the true density) and the sample size here is large.
For NLS the nonrobust standard errors are invalid, because the dgp has het-
eroskedastic errors, and greatly overstate the precision of the NLS estimates. The for-
mula for the robust variance matrix estimate for NLS is given in (5.81), where
Ω =
Diag[
u2
i ]. An alternative that uses
Ω = Diag[
E

u2
i

], where
E

u2
i

= [exp(−x
i

β)]2
,
is given in braces. The two estimates differ: 0.0612 compared to 0.0880 for the
slope coefﬁcient. The difference arises because
u2
i = (yi − exp(x
i

β))2
differs from
161

[exp(−x
i

β)]2
. More generally standard errors estimated using the outer product (see
Section 5.5.2) can be biased even in quite large samples. NLS is considerably less effi-
cient than MLE, with standard errors many times those of the MLE using the preferred
estimates in braces.
The WNLS estimator does not use the correct model for heteroskedasticity, so the
nonrobust and robust standard errors again differ. Using the robust standard errors the
WNLS estimator is more efficient than NLS and less efficient than the MLE.
In this example the FGNLS estimator is as efficient as the MLE, a consequence
of the known dgp being in the LEF. The results indicate this, with coefficients and
standard errors very close to those for the MLE. The robust and nonrobust standard
errors for the FGNLS estimator are essentially the same, as expected since here the
model for heteroskedasticity is correctly specified.
Table 5.7 also reports the estimated log-likelihood, ln L =

i [x
i

β −
exp(−x
i

β)yi ], and an R-squared measure, R2
= 1 −

i (yi −
yi )2
/

i (yi − ȳ)2
,
where
yi = exp(−x
i

β), evaluated at the ML, NLS, WNLS, and FGNLS estimates.
The R2
differs little across models and is lowest for the NLS estimator, as expected
since NLS minimizes

i (yi −
yi )2
. The log-likelihood is maximized by the MLE, as
expected, and is considerably lower for the NLS estimator.
5.9.4. Coefficient Interpretation
Interest lies in changes in E[y|x] when x changes. We consider the ML estimates of

β2 = −0.99 given in Table 5.7.
The conditional mean exp(−β1 − β2x) is of single-index form, so that if an ad-
ditional regressor z with coefficient β3 were included, then the marginal effect of a
one-unit change in z would be
β3/
β2 times that of a one-unit change in x (see Sec-
tion 5.2.4).
The conditional mean is monotonically decreasing in x, so the sign of
β2 is the re-
verse of the marginal effect (see Section 5.2.4). Here the marginal effect of an increase
in x is an increase in the conditional mean, since
β2 is negative.
We now consider the magnitude of the marginal effect of changes in x using cal-
culus methods. Here ∂E[y|x]/∂x = −β2 exp(−x
β) varies with the evaluation point
x and ranges from 0.01 to 19.09 in the sample. The sample-average response is
0.99N−1

i exp(x
i

β) = 0.61. The response evaluated at the sample mean of x,
0.99 exp(x̄
β) = 0.37, is considerably smaller. Since ∂E[y|x]/∂x = −β2E[y|x], yet
another estimate of the marginal effect is 0.99ȳ = 0.61.
Finite-difference methods lead to a different estimated marginal effect. For

E[y|x] = (eβ2
− 1) exp(−x
β) (see Section 5.2.4). This yields an average
response over the sample of 1.04, rather than 0.61. The ﬁnite-difference and calculus
methods coincide, however, if

x is small.
The preceding marginal effects are additive. For the exponential conditional mean
we can also consider multiplicative or proportionate marginal effects (see Sec-
tion 5.2.4). For example, a 0.1-unit change in x is predicted to lead to a proportionate
increase in E[y|x] of 0.1 × 0.99 or a 9.9% increase. Again a ﬁnite-difference approach
will yield a different estimate.
162

Which of these measures is most useful? The restriction to single-index form is
very useful as the relative impact of regressors can be immediately calculated. For the
magnitude of the response it is most accurate to compute the average response across
the sample, using noncalculus methods, of a c-unit change in the regressor, where
the magnitude of c is a meaningful amount such as a one standard deviation change
in x.
Similar calculations can be done for the NLS, WNLS, and FGNLS estimates, with
similar results. For the OLS estimator, note that the coefficient of x can be interpreted
as giving the sample-average marginal effect of a change in x (see Section 4.7.2). Here
the OLS estimate
β2 = 0.61 equals to two decimal places the sample-average response
computed earlier using the exponential MLE. Here OLS provides a good estimate of
the sample-average marginal response, even though it can provide a very poor estimate
of the marginal response for any particular value of x.
Most econometrics packages provide simple commands to obtain the maximum like-
lihood estimators for the standard models introduced in Section 5.6.1. For other den-
sities many packages provide an ML routine to which the user provides the equation
for the density and possibly first derivatives or even second derivatives. Similarly, for
NLS one provides the equation for the conditional mean to an NLS routine. For some
nonlinear models and data sets the ML and NLS routines provided in packages can en-
counter computational difficulties in obtaining estimates. In such circumstances it may
be necessary to use more robust optimization routines provided as add-on modules to
Gauss, Matlab and OX. Gauss, Matlab and OX are better tools for nonlinear modeling,
but require a higher initial learning investment.
For cross-section data it is becoming standard to use standard errors based on the
sandwich form of the variance matrix. These are often provided as a command option.
For LS estimators this gives heteroskedastic-consistent standard errors. For maximum
likelihood one should be aware that misspecification of the density can lead to incon-
sistency in addition to requiring the use of sandwich errors.
The parameters of nonlinear models are usually not directly interpretable, and it is
good practice to additionally compute the implied marginal effects caused by changes
in regressors (see Section 5.2.4). Some packages do this automatically; for others sev-
eral lines of postestimation code using saved regression coefficients may be needed.
A brief history of the development of asymptotic theory results for extremum estimators is
given in Newey and McFadden (1994, p. 2115). A major econometrics advance was made
by Amemiya (1973), who developed quite general theorems that were applied to the Tobit
model MLE. Useful book-length treatments include those by Gallant (1987), Gallant and White
(1987), Bierens (1993), and White (1994, 2001a). Statistical foundations are given in many
books, including Amemiya (1985, Chapter 3), Davidson and MacKinnon (1993, Chapter 4),
163

Greene (2003, appendix D), Davidson (1994), and Zaman (1996).
5.3 The presentation of general extremum estimation results draws heavily on Amemiya (1985,
Chapter 4), and to a lesser extent on Newey and McFadden (1994). The latter reference is
very comprehensive.
5.4 The estimating equations approach is used in the generalized linear models literature (see
McCullagh and Nelder, 1989). Econometricians subsume this in generalized method of
moments (see Chapter 6).
5.5 Statistical inference is presented in detail in Chapter 7.
5.6 See the pioneering article by Fisher (1922) for general results for ML estimation, including
efficiency, and for comparison of the likelihood approach with the inverse-probability or
Bayesian approach and with method of moments estimation.
5.7 Modern applications frequently use the quasi-ML framework and sandwich estimates of
the variance matrix (see White, 1982, 1994). In statistics the approach is called generalized
linear models, with McCullagh and Nelder (1989) a standard reference.
5.8 Similarly for NLS estimation, sandwich estimates of the variance matrix are used that re-
quire relatively weak assumptions on the error process. The papers by White (1980a,c) had
a big impact on statistical inference in econometrics. Generalization and a detailed review
of the asymptotic theory is given in White and Domowitz (1984). Amemiya (1983) has
extensively surveyed methods for nonlinear regression.
Exercises
5–1 Suppose we obtain model estimates that yield predicted conditional mean

E[y|x] = exp(1 + 0.01x)/[1 + exp(1 + 0.01x)]. Suppose the sample is of size 100
and x takes integer values 1, 2, . . . , 100. Obtain the following estimates of the
estimated marginal effect ∂
E[y|x]/∂x.
(a) The average marginal effect over all observations.
(b) The marginal effect of the average observation.
(c) The marginal effect when x = 90.
(d) The marginal effect of a one-unit change when x = 90, computed using the
finite-difference method.
5–2 Consider the following special one-parameter case of the gamma distribution,
f (y) = (y/λ2
) exp (−y/λ), y 0, λ 0. For this distribution it can be shown that
E[y] = 2λ and V[y] = 2λ2
. Here we introduce regressors and suppose that in the
true model the parameter λ depends on regressors according to λi = exp(x
i β)/2.
Thus E[yi |xi ] = exp(x
i β) and V[yi |xi ] = [exp(x
i β)]2
/2. Assume the data are inde-
pendent over i and xi is nonstochastic and β = β0 in the dgp.
(a) Show that the log-likelihood function (scaled by N−1
) for this gamma model
is QN(β) = N−1

i
!
ln yi − 2x
i β + 2 ln 2 − 2yi exp(−x
i β)

.
(b) Obtain plim QN(β). You can assume that assumptions for any LLN used are
satisfied. [Hint: E[ln yi ] depends on β0 but not β.]
(c) Prove that
β that is the local maximum of QN(β) is consistent for β0. State
any assumptions made.
(d) Now state what LLN you would use to verify part (b) and what additional
information, if any, is needed to apply this law. A brief answer will do. There
is no need for a formal proof.
164

5–3 Continue with the gamma model of Exercise 5–2.
(a) Show that ∂ QN(β)/∂β = N−1

i 2[(yi − exp(x
i β))/ exp(x
i β)]xi .
(b) What essential condition indicated by the first-order conditions needs to be
satisfied for
β to be consistent?
(c) Apply a central limit theorem to obtain the limit distribution of
√
N∂ QN/∂β|β0
.
Here you can assume that the assumptions necessary for a CLT are satisfied.
(d) State what CLT you would use to verify part (c) and what additional informa-
tion, if any, is needed to apply this law. A brief answer will do. There is no
need for a formal proof.
(e) Obtain the probability limit of ∂2
QN/∂β∂β
|β0
.
(f) Combine the previous results to obtain the limit distribution of
√
N(
β − β0).
(g) Given part (f), state how to test H0 : β0 j ≥ β∗
j against Ha : β0 j β∗
j at level
0.05, where βj is the j th component of β.
5–4 A nonnegative integer variable y that is geometric distributed has density (or
more formally probability mass function) f (y) = (y + 1)(2λ)y
(1 + 2λ)−(y+0.5)
, y =
0, 1, 2, . . . , λ 0. Then E[y] = λ and V[y] = λ(1 + 2λ). Introduce regressors and
suppose γi = exp(x
i β). Assume the data are independent over i and xi is non-
stochastic and β = β0 in the dgp.
(a) Repeat Exercise 5–2 for this model.
(b) Repeat Exercise 5–3 for this model.
5–5 Suppose a sample yields estimates
θ1 = 5,
θ2 = 3, se[
θ1] = 2, and se[
θ2] = 1 and
the correlation coefficient between
θ1 and
θ2 equals 0.5. Perform the following
tests at level 0.05, assuming asymptotic normality of the parameter estimates.
(a) Test H0 : θ1 = 0 against Ha : θ1 = 0.
(b) Test H0 : θ1 = 2θ2 against Ha : θ1 = 2θ2.
(c) Test H0 : θ1 = 0, θ2 = 0 against Ha : at least one of θ1, θ2 = 0.
5–6 Consider the nonlinear regression model y = exp (x
β)/[1 + exp (x
β)] + u, where
the error term is possibly heteroskedastic.
(a) Within what range does this restrict E[y|x] to lie?
(b) Give the first-order conditions for the NLS estimator.
(c) Obtain the asymptotic distribution of the NLS estimator using result (5.77).
5–7 This question presumes access to software that allows NLS and ML estimation.
Consider the gamma regression model of Exercise 5–2. An appropriate gamma
variate can be generated using y = −λ lnr1 − λ lnr2, where λ = exp (x
β)/2 and
r1 and r2 are random draws from Uniform[0, 1]. Let x
β = β1 + β2x. Generate a
sample of size 1,000 when β1 = −1.0 and β2 = 1 and x ∼N[0, 1].
(a) Obtain estimates of β1 and β2 from NLS regression of y on exp(β1 + β2x).
(b) Should sandwich standard errors be used here?
(c) Obtain ML estimates of β1 and β2 from NLS regression of y on exp(β1 + β2x).
(d) Should sandwich standard errors be used here?
165

C H A P T E R 6
Generalized Method of Moments
and Systems Estimation
6.1. Introduction
The previous chapter focused on m-estimation, including ML and NLS estimation.
Now we consider a much broader class of extremum estimators, those based on method
of moments (MM) and generalized method of moments (GMM).
The basis of MM and GMM is specification of a set of population moment condi-
tions involving data and unknown parameters. The MM estimator solves the sample
moment conditions that correspond to the population moment conditions. For exam-
ple, the sample mean is the MM estimator of the population mean. In some cases there
may be no explicit analytical solution for the MM estimator, but numerical solution
may still be possible. Then the estimator is an example of the estimating equations
estimator introduced briefly in Section 5.4.
In some situations, however, MM estimation may be infeasible because there are
more moment conditions and hence equations to solve than there are parameters. A
leading example is IV estimation in an overidentified model. The GMM estimator, due
to Hansen (1982), extends the MM approach to accommodate this case.
The GMM estimator defines a class of estimators, with different GMM estimators
obtained by using different population moment conditions, just as different specified
densities lead to different ML estimators. We emphasize this moment-based approach
to estimation, even in cases where alternative presentations are possible, as it provides
a unified approach to estimation and can provide an obvious way to extend methods
from linear to nonlinear models.
The basics of GMM estimation are given in Sections 6.2 and 6.3, which present,
respectively, expository examples and asymptotic results for statistical inference. The
remainder of the chapter details more specialized estimators. Instrumental variables
estimators are presented in Sections 6.4 and 6.5. For linear models the treatment in
Sections 4.8 and 4.9 may be sufficient, but extension to nonlinear models uses the
GMM approach. Section 6.6 covers methods to compute standard errors of sequential
two-step m-estimators. Sections 6.7 and 6.8 present the minimum distance estimator,
a variant of GMM, and the empirical likelihood estimator, an alternative estimator to
166

6.2. EXAMPLES
GMM. Systems estimation methods, used in a relatively small fraction of microecono-
metrics studies, are discussed in Sections 6.9 and 6.10.
This chapter reviews many estimation methods from a GMM perspective. Applica-
tions of these methods to actual data include a linear IV application in Section 4.9.6
and a linear panel GMM application in Section 22.3.
6.2. Examples
GMM estimators are based on the analogy principle (see Section 5.4.2) that population
moment conditions lead to sample moment conditions that can be used to estimate
parameters. This section provides several leading applications of this principle, with
properties of the resulting estimator deferred to Section 6.3.
6.2.1. Linear Regression
A classic example of method of moments is estimation of the population mean when
y is iid with mean µ. In the population
E[y − µ] = 0.
Replacing the expectations operator E[·] for the population by the average operator
N−1
N
i=1(·) for the sample yields the corresponding sample moment
1
N
N

i=1
(yi − µ) = 0.
Solving for µ leads to the estimator
µMM = N−1

i yi = ȳ. The MM estimate of the
population mean is the sample mean.
This approach can be extended to the linear regression model y = x
β + u, where
x and β are K × 1 vectors. Suppose the error term u has zero mean conditional on
regressors. The single conditional moment restriction E[u|x] = 0 leads to K uncondi-
tional moment conditions E[xu] = 0, since
E[xu] = Ex[E[xu|x]] = Ex[xE[u|x]] = Ex[x·0] = 0, (6.1)
using the law of iterated expectations (see Section A.8) and the assumption that
E[u|x] = 0. Thus
E[x(y − x
β)] = 0,
if the error has conditional mean zero. The MM estimator is the solution to the corre-
sponding sample moment condition
1
N
N

i=1
xi (yi − x
i β) = 0.
This yields
βMM = (

i xi x
i )−1

i xi yi .
167

GENERALIZED METHOD OF MOMENTS AND SYSTEMS ESTIMATION
The OLS estimator is therefore a special case of MM estimation. The MM deriva-
tion of the OLS estimator, however, differs significantly from the usual one of mini-
mization of a sum of squared residuals.
6.2.2. Nonlinear Regression
For nonlinear regression the method of moments approach reduces to NLS if regres-
sion errors are additive. For more general nonlinear regression with nonadditive errors
(defined in the following) method of moments yields a consistent estimator whereas
NLS is inconsistent.
From Section 5.8.3 the nonlinear regression model with additive error is a model
that specifies
y = g(x, β) + u.
A moment approach similar to that for the linear model yields that E[u|x] = 0 im-
plies that E[h(x)(y − x
β)] = 0, where h(x) is any function of x. The particular choice
h(x) = ∂g(x, β)/∂β, motivated in Section 6.3.7, leads to corresponding sample mo-
ment condition that equals the first-order conditions for the NLS estimator given in
Section 5.8.2.
The more general nonlinear regression model with nonadditive error specifies
u = r(y, x, β),
where again E[u|x] = 0 but now y is no longer restricted to being an additive func-
tion of u. For example, in Poisson regression one may define the standardized error
u = [y − exp (x
β)]/[exp (x
β)]1/2
that has E[u|x] = 0 and V[u|x] = 1 since y has
conditional mean and variance equal to exp (x
β).
The NLS estimator is inconsistent given nonadditive error. Minimizing
N−1

i ui
2
= N−1

i r(yi , xi , β)2
leads to first-order conditions
1
N
N

i=1
∂r(yi , xi , β)
∂β
r(yi , xi , β) = 0.
Here yi appears in both terms in the product and there is no guarantee that this prod-
uct has expected value of zero even if E[r(·)|x] = 0. This inconsistency did not arise
with additive errors r(·) = y − g(x, β), as then ∂r(·)/∂β = −∂g(x, β)/∂β, so only
the second term in the product depended on y.
A moment-based approach yields a consistent estimator. The assumption that
E[u|x] = 0 implies
E[h(x)r(y, x, β)] = 0,
where h(x) is a function of x. If dim[h(x)] = K then the corresponding sample mo-
ment
1
N
N

i=1
h(xi )r(yi , xi , β) = 0
yields a consistent estimate of β, where solution is by numerical methods.
168

6.2. EXAMPLES
6.2.3. Maximum Likelihood
The Kullback–Leibler information criterion was defined in Section 5.7.2. From
this definition, a local maximum of KLIC occurs if E[s(θ)]= 0, where s(θ) =
∂ ln f (y|x, θ)/∂θ and f (y|x, θ) is the conditional density.
Replacing population moments by sample moments yields an estimator
θ that
solves N−1

i si (θ) = 0. These are the ML first-order conditions, so the MLE can
be motivated as an MM estimator.
6.2.4. Additional Moment Restrictions
Using additional moments can improve the efficiency of estimation but requires adap-
tation of regular method of moments if there are more moment conditions than param-
eters to estimate.
A simple example of an inefficient estimator is the sample mean. This is an ineffi-
cient estimator of the population mean unless the data are a random sample from the
normal distribution or some other member of the exponential family of distributions.
One way to improve efficiency is to use alternative estimators. The sample median,
consistent for µ if the distribution is symmetric, may be more efficient. Obviously the
MLE could be used if the distribution is fully specified, but here we instead improve
efficiency by using additional moment restrictions.
Consider estimation of β in the linear regression model. The OLS estimator is in-
efficient even assuming homoskedastic errors, unless errors are normally distributed.
From Section 6.2.1, the OLS estimator is an MM estimator based on E[xu] = 0. Now
make the additional moment assumption that errors are conditionally symmetric, so
that E[u3
|x] = 0 and hence E[xu3
] = 0. Then estimation of β may be based on the
2K moment conditions

E[x(y − x
β)]
E[x(y − x
β)3
]

=

0
0

.
The MM estimator would attempt to estimate β as the solution to the corresponding
sample moment conditions N−1

i xi (yi − x
i β) = 0 and N−1

i xi (yi − x
i β)3
= 0.
However, with 2K equations and only K unknown parameters β, it is not possible for
all of these sample moment conditions to be satisfied.
The GMM estimator instead sets the sample moments as close to zero as possible
using quadratic loss. Then
βGMM minimizes
QN (β) =
1
N

i xi ui
1
N

i xi u3
i

WN
1
N

i xi ui
1
N

i xi u3
i

, (6.2)
where ui = yi − x
i β and WN is a 2K × 2K weighting matrix. For some choices
of WN this estimator is more efficient than OLS. This example is analyzed in Sec-
tion 6.3.6.
169

6.2.5. Instrumental Variables Regression
Instrumental variables estimation is a leading example of generalized method of mo-
ments estimation.
Consider the linear regression model y = x
β + u, with the complication that some
components of x are correlated with the error term so that OLS is inconsistent for β.
Assume the existence of instruments z (introduced in Section 4.8) that are correlated
with x but satisfy E[u|z] = 0. Then E[y − x
β|z] = 0. Using algebra similar to that
used to obtain (6.1) for the OLS example, we multiply by z to get the K unconditional
population moment conditions
E[z(y − x
β)] = 0. (6.3)
The method of moments estimator solves the corresponding sample moment condition
1
N
N

i=1
zi (yi − x
i β) = 0.
If dim(z) = K this yields
βMM = (

i zi x
i )−1

i zi yi , which is the linear IV estimator
introduced in Section 4.8.6.
No unique solution exists if there are more potential instruments than regressors,
since then dim(z) K and there are more equations than unknowns. One possibility
is to use just K instruments, but there is then an efﬁciency loss. The GMM estimator
instead chooses
β to make the vector N−1

i zi (yi − x
i β) as small as possible using
quadratic loss, so that
βGMM minimizes
QN (β) =

1
N
N

i=1
zi (yi − x
i β)
'
WN

1
N
N

i=1
zi (yi − x
i β)
'
, (6.4)
where WN is a dim(z) × dim(z) weighting matrix. The 2SLS estimator (see Sec-
tion 4.8.6) corresponds to a particular choice of WN .
Instrumental variables methods for linear models are presented in considerable de-
tail in Section 6.4. An advantage of the GMM approach is that it provides a way to
specify the optimal choice of weighting matrix WN , leading to an estimator more efﬁ-
cient than 2SLS.
Section 6.5 covers IV methods for nonlinear models. One advantage of the GMM
approach is that generalization to nonlinear regression is straightforward. Then we
simply replace y − x
β in the preceding expression for QN (β) by the nonlinear model
error u = y − g(x
β) or u = r(y, x, β).
6.2.6. Panel Data
Another leading application of GMM and related estimation methods is to panel data
regression.
As an example, suppose yit = x
it β+uit , where i denotes individual and t denotes
time. From Section 6.2.1, pooled OLS regression of yit on xit is an MM estimator
based on the condition E[xit uit ] = 0. Suppose it is additionally assumed that the er-
ror uit is uncorrelated with regressors in periods other than the current period. Then
170

6.2. EXAMPLES
E[xisuit ] = 0 for s = t provides additional moment conditions that can be used to ob-
tain more efficient estimators.
Chapters 22 and 23 provide many applications of GMM methods to panel data.
6.2.7. Moment Conditions from Economic Theory
Economic theory can generate moment conditions that can be used as the basis for
estimation.
Begin with the model
yt = E[yt |xt , β] + ut ,
where the first term on the right-hand side measures the “anticipated” component of
y conditional on x and the second component measures the “unanticipated” compo-
nent. As examples, y may denote return on an asset or the rate of inflation. Under the
twin assumptions of rational expectations and market clearing or market efficiency,
we may obtain the result that the unanticipated component is unpredictable using any
information that was available at time t for determining E[y|x]. Then
E[(yt − E[yt |xt , β])|It ] = 0,
where It denotes information available at time t.
By the law of iterated expectations, E[zt (yt −E[yt |xt , β])] = 0, where zt is formed
from any subset of It . Since any part of the information set can be used as an instru-
ment, this provides many moment conditions that can be the basis of estimation. If
time-series data are available then GMM minimizes the quadratic form
QT (β) =

1
T
T
t=1
zt ut

WT

1
T
T
t=1
zt ut

,
where ut = yt − E[yt |xt , β]. If cross-section data are available at a single time point t
then GMM minimizes the quadratic form
QN (β) =

1
N
N
i=1
zi ui

WN

1
N
N
i=1
zi ui

,
where ui = yi − E[yi |xi , β] and the subscript t can be dropped as only one time period
is analyzed.
This approach is not restricted to the additive structure used in motivation. All
that is needed is an error ut with the property that E[ut |It ] = 0. Such conditions
arise from the Euler conditions from intertemporal models of decision making un-
der certainty. For example, Hansen and Singleton (1982) present a model of maxi-
mization of expected lifetime utility that leads to the Euler condition E[ut |It ] = 0,
where ut = βgα
t+1rt+1 − 1, gt+1 = ct+1/ct is the ratio of consumption in two periods,
and rt+1 is asset return. The parameters β and α, the intertemporal discount rate and
the coefficient of relative risk aversion, respectively, can be estimated by GMM using
either time-series or cross-section data as was done previously, with this new defini-
tion of ut . Hansen (1982) and Hansen and Singleton (1982) consider time-series data;
MaCurdy (1983) modeled both consumption and labor supply using panel data.
171

Table 6.1. Generalized Method of Moments: Examples
Moment Function h(·) Estimation Method
y − µ Method of moments for population mean
x(y − x
β) Ordinary least-squares regression
z(y − x
β) Instrumental variables regression
∂ ln f (y|x, θ)/∂θ Maximum likelihood estimation
6.3. Generalized Method of Moments
This section presents the general theory of GMM estimation. Generalized method of
moments deﬁnes a class of estimators. Different choice of moment condition and
weighting matrix lead to different GMM estimators, just as different choices of dis-
tribution lead to different ML estimators. We address these issues, in addition to pre-
senting the usual properties of consistency and asymptotic normality and methods to
estimate the variance matrix of the GMM estimator.
6.3.1. Method of Moments Estimator
The starting point is to assume the existence of r moment conditions for q parameters,
E[h(wi , θ0)] = 0, (6.5)
where θ is a q × 1 vector, h(·) is an r × 1 vector function with r ≥ q, and θ0 denotes
the value of θ in the dgp. The vector w includes all observables including, where
relevant, a dependent variable y, potentially endogenous regressors x, and instrumental
variables z. The dependent variable y may be a vector, so that applications with systems
of equations or with panel data are subsumed. The expectation is with respect to all
stochastic components of w and hence y, x, and z.
The choice of functional form for h(·) is qualitatively similar to the choice of model
and will vary with application. Table 6.1 summarizes some single-equation examples
of h(w) = h(y, x, z, θ) already presented in Section 6.2.
If r = q then method of moments can be applied. Equality to zero of the population
moment is replaced by equality to zero of the corresponding sample moment, and the
method of moments estimator
θMM is deﬁned to be the solution to
1
N
N

i=1
h(wi ,
θ) = 0. (6.6)
This is an estimating equations estimator that equivalently minimizes
QN (θ) =

1
N
N

i=1
h(wi , θ)
'
1
N
N

i=1
h(wi , θ)
'
,
with asymptotic distribution presented in Section 5.4 and reproduced in (6.13) in Sec-
tion 6.3.3.
172

6.3. GENERALIZED METHOD OF MOMENTS
6.3.2. GMM Estimator
The GMM estimator is based on r independent moment conditions (6.5) while q pa-
rameters are estimated.
If r = q the model is said to be just-identified and the MM estimator in (6.6) can be
used. More formally r = q is only a necessary condition for just-identification and we
additionally require that G0 in Proposition 5.1 is of rank q. Identification is addressed
in Section 6.3.9.
If r q the model is said to be overidentified and (6.6) has no solution for
θ as
there are more equations (r) than unknowns (q). Instead,
θ is chosen so that a quadratic
form in N−1

i h(wi ,
θ) is as close to zero as possible. Specifically, the generalized
methods of moments estimator
θGMM minimizes the objective function
QN (θ) =

1
N
N

i=1
h(wi , θ)
'
WN

1
N
N

i=1
h(wi , θ)
'
, (6.7)
where the r × r weighting matrix WN is symmetric positive definite, possibly stochas-
tic with finite probability limit, and does not depend on θ. The subscript N on WN is
used to indicate that its value may depend on the sample. The dimension r of WN ,
however, is fixed as N → ∞. The objective function can also be expressed in matrix
notation as QN (θ) = N−1
l
H(θ) × WN × N−1
H(θ)
l, where l is an N × 1 vector of
ones and H(θ) is an N × r matrix with ith row h(yi , xi , θ)
.
Different choices of weighting matrix WN lead to different estimators that, although
consistent, have different variances if r q. A simple choice, though often a poor
choice, is to let WN be the identity matrix. Then QN (θ) = h̄2
1 + h̄2
2 + · · · + h̄2
r is the
sum of r squared sample averages, where h̄ j = N−1

i h j (wi , θ) and h j (·) is the jth
component of h(·). The optimal choice of WN is given in Section 6.3.5.
Differentiating QN (θ) in (6.7) with respect to θ yields the GMM first-order
conditions

1
N
N

i=1
∂hi (
θ)
∂θ

θ
'
× WN ×

1
N
N

i=1
hi (
θ)
'
= 0, (6.8)
where hi (θ) = hi (wi , θ) and we have multiplied by the scaling factor 1/2. These equa-
tions will generally be nonlinear in
θ and can be quite complicated to solve as
θ may
appear in both the first and third terms. Numerical solution methods are presented in
Chapter 10.
6.3.3. Distribution of GMM Estimator
The asymptotic distribution of the GMM estimator is given in the following proposi-
tion, derived in Section 6.3.9.
Proposition 6.1 (Distribution of GMM Estimator): Make the following as-
sumptions:
(i) The dgp imposes the moment condition (6.5); that is, E[h(w, θ0)] = 0.
(ii) The r × 1 vector function h(·) satisfies h(w, θ(1)
) = h(w, θ(2)
) iff θ(1)
= θ(2)
.
173

(iii) The following r × q matrix exists and is finite with rank q:
G0 = plim
1
N
N

i=1

∂hi
∂θ

θ0
'
. (6.9)
(iv) WN
p
→ W0, where W0 is finite symmetric positive definite.
(v) N−1/2
N
i=1 hi |θ0
d
→ N [0, S(θ0)], where
S0 = plimN−1
N

i=1
N

j=1

hi h
j

θ0

. (6.10)
Then the GMM estimator
θGMM, defined to be a root of the first-order conditions
∂ QN (θ)/∂θ = 0 given in (6.8), is consistent for θ0 and
√
N(
θGMM − θ0)
d
→ N

0, (G
0W0G0)−1
(G
0W0S0W0G0)(G
0W0G0)−1

. (6.11)
Some leading specializations are the following.
First, in microeconometric analysis data are usually assumed to be independent over
i, so (6.10) simplifies to
S0 = plim
1
N
N

i=1

hi h
i

θ0

. (6.12)
If additionally the data are assumed to be identically distributed then (6.9) and
(6.10) simplify to G0 = E[∂h/∂θ

θ0
] and S0 = E[hh

θ0
], a notation used by many
authors.
Second, in the just-identified case that r = q, the situation for many estimators
including ML and LS, the results simplify to those already presented in Section 5.4 for
the estimating equations estimator. To see this note that when r = q the matrices G0,
W0, and S0 are square matrices that are invertible, so (G
0W0G0)−1
= G−1
0 W−1
0 (G
0)−1
and the variance matrix in (6.11) simplifies. It follows that, for the MM estimator in
(6.6),
√
N(
θMM − θ0)
d
→ N

0, G−1
0 S0(G
0)−1

. (6.13)
An MM estimator can always be computed as a GMM estimator and will be invariant
to the choice of full rank weighting matrix.
Third, the best choice of matrix WN is one such that W0 = S−1
0 . Then the variance
matrix in (6.11) simplifies to (G
0S−1
0 G0)−1
. This is expanded on in Section 6.3.5.
6.3.4. Variance Matrix Estimation
Statistical inference for the GMM estimator is possible given consistent estimates
G
of G0,
W of W0, and
S of S0 in (6.11). Consistent estimates are easily obtained under
relatively weak distributional assumptions.
174

For G0 the obvious estimator is

G =
1
N
N

i=1
∂hi
∂θ

θ
. (6.14)
For W0 the sample weighting matrix WN is used. The estimator for the r × r matrix S0
varies with the stochastic assumptions made about the dgp. Microeconometric analysis
usually assumes independence over i, so that S0 is of the simpler form (6.12). An
obvious estimator is then

S =
1
N
N

i=1
hi (
θ)hi (
θ)
. (6.15)
Since h(·) is r × 1, there are at most a finite number of r(r + 1)/2 unique entries in S0
to be estimated. So
S is consistent as N → ∞ without need to parameterize the
variance E[hi h
i ], assumed to exist, to depend on fewer parameters. All that is re-
quired are some mild additional assumptions to ensure that plim N−1

i

hi

h
i =
plim N−1

i hi h
i . For example, if
hi = xi
ui , where
ui is the OLS residual, we know
from Section 4.4 that existence of fourth moments of the regressors needs to be
assumed.
Combining these results, we have that the GMM estimator is asymptotically nor-
mally distributed with mean θ0 and estimated asymptotic variance

V[
θGMM] =
1
N

G
WN

G
−1

G
WN

SWN

G

G
WN

G
−1
. (6.16)
This variance matrix estimator is a robust estimator that is an extension of the Eicker–
White heteroskedastic-consistent estimator for least-squares estimators.
One can also take expectations and use
GE = N−1

i E[∂hi /∂θ
]

θ
for G0 and

SE = N−1

i E[hi h
i ]

θ
for S0. However, this usually requires additional distribu-
tional assumptions to take the expectation, and the variance matrix estimate will not
be as robust to distributional misspecification.
In the time-series case ht is subscripted by time t, and asymptotic theory is based
on the number of time periods T → ∞. For time-series data, with ht a vector
MA(q) process, the usual estimator of V[
θGMM] is one proposed by Newey and
West (1987b) that uses (6.16) with
S =
Ω0 +
q
j=1(1 − j
q+1
)(
Ωj +
Ω
j ), where
Ωj =
T −1
T
t= j+1

ht

h
t− j . This permits time-series correlation in ht in addition to contem-
poraneous correlation. Further details on covariance matrix estimation, including im-
provements in the time-series case, are given in Davidson and MacKinnon (1993, Sec-
tion 17.5), Hamilton (1994), and Haan and Levin (1997).
6.3.5. Optimal Weighting Matrix
Application of GMM requires specification of moment function h(·) and weighting
matrix WN in (6.7).
The easy part is choosing WN to obtain the GMM estimator with the smallest
asymptotic variance given a specified function h(·). This is often called optimal GMM
175

even though it is a limited form of optimality since a poor choice of h(·) could still lead
to a very inefficient estimator.
For just-identified models the same estimator (the MM estimator) is obtained for
any full rank weighting matrix, so one might just as well set WN = Iq.
For overidentified models with r q, and S0 known, the most efficient GMM es-
timator is obtained by choosing the weighting matrix WN = S−1
0 . Then the variance
matrix given in the proposition simplifies and
√
N(
θGMM − θ0)
d
→ N

0, (G
0S−1
0 G0)−1

, (6.17)
a result due to Hansen (1982).
This result can be obtained using matrix arguments similar to those that establish
that GLS is the most efficient WLS estimator in the linear model. Even more simply,
one can work directly with the objective function. For LS estimators that minimize the
quadratic form u
Wu the most efficient estimator is GLS that sets W = Σ−1
= V[u]−1
.
The GMM objective function in (6.7) is of this quadratic form with u = N−1

i hi (θ)
and so the optimal W = (V[N−1

i hi (θ)])−1
= S−1
0 . The optimal GMM estimator
weights by the inverse of the variance matrix of the sample moment conditions.
Optimal GMM
In practice S0 is unknown and we let WN =
S−1
, where
S is consistent for S0. The
optimal GMM estimator can be obtained using a two-step procedure. At the first step
a GMM estimator is obtained using a suboptimal choice of WN , such as WN = Ir
for simplicity. From this first step, form estimate
S using (6.15). At the second step
perform an optimal GMM estimator with optimal weighting matrix WN =
S−1
.
Then the optimal GMM estimator or two-step GMM estimator
θOGMM based on
hi (θ) minimizes
QN (θ) =

1
N
N

i=1
hi (θ)
'

S−1

1
N
N

i=1
hi (θ)
'
. (6.18)
The limit distribution is given in (6.17). The optimal GMM estimator is asymptoti-
cally normally distributed with mean θ0 and estimated asymptotic variance with the
relatively simple formula
V[
θOGMM] = N−1
(
G
S−1
G)−1
. (6.19)
Usually evaluation of
G and
S is at
θOGMM, so
S uses the same formula as
S except that
evaluation is at
θOGMM. An alternative is to continue to evaluate (6.19) at the first-step
estimator, as any consistent estimate of θ0 can be used.
Remarkably, the optimal GMM estimator in (6.18) requires no additional stochastic
assumptions beyond those needed to permit use of (6.16) to estimate the variance
matrix of suboptimal GMM. In both cases
S needs to be consistent for S0 and from the
discussion after (6.15) this requires few additional assumptions. This stands in stark
contrast to the additional assumptions needed for GLS to be more efficient than OLS
when errors are heteroskedastic. Heteroskedasticity in the errors will affect the optimal
choice of hi (θ), however (see Section 6.3.7).
176

Small-Sample Bias of Two-Step GMM
Theory suggests that for overidentified models it is best to use optimal GMM. In imple-
mentation, however, the theoretical optimal weighting matrix WN = S−1
0 needs to be
replaced by a consistent estimate
S−1
. This replacement makes no difference asymp-
totically, but it will make a difference in finite samples. In particular, individual obser-
vations that increase hi (θ) in (6.18) are likely to increase
S = N−1

i

hi

h
i in (6.18),
leading to correlation between N−1

i hi (θ) and
S. Note that S0 = plim N−1

i hi h
i
is not similarly affected because the probability limit is taken.
Altonji and Segal (1996) demonstrated this problem in estimation of covariance
structure models using panel data (see Section 22.5). They used the related minimum
distance estimator (see Section 6.7) but in the literature their results are intrepreted as
being relevant to GMM estimation with cross-section data or short panels. In simula-
tions the optimal estimator was more efficient than a one-step estimator, as expected.
However, the optimal estimator had finite-sample bias so large that its root mean-
squared error was much larger than that for the one-step estimator.
Altonji and Segal (1996) also proposed a variant, an independently weighted op-
timal estimator that forms the weighting matrix using observations other than used to
construct the sample moments. They split the sample into G groups, with G = 2 an
obvious choice, and minimize
QN (θ) =
1
G

g
hg(θ)
S−1
(−g)hg(θ), (6.20)
where hg(θ) is computed for the gth group and
S(−g) is computed using all but the gth
group. This estimator is less biased, since the weighting matrix
S−1
(−g) is by construction
independent of hg(θ). However, splitting the sample leads to efficiency loss. Horowitz
(1998a) instead used the bootstrap (see Section 11.6.4).
In the Altonji and Segal (1996) example hi involves second moments, so
S involves
fourth moments. Finite-sample problems for the optimal estimator may not be as sig-
nificant in other examples where hi involves only first moments. Nonetheless, Altonji
and Segal’s results do suggest caution in using optimal GMM and that differences
between one-step GMM and optimal GMM estimates may indicate problems of finite-
sample bias in optimal GMM.
Number of Moment Restrictions
In general adding further moment restrictions improves asymptotic efficiency, as it
reduces the limit variance (G
0S−1
0 G0)−1
of the optimal GMM estimator or at worst
leaves it unchanged.
The benefits of adding further moment conditions vary with the application. For ex-
ample, if the estimator is the MLE then there is no gain since the MLE is already fully
efficient. The literature has focused on IV estimation where gains may be considerable
because the variable being instrumented may be much more highly correlated with a
combination of many instruments than with a single instrument.
There is a limit, however, as the number of moment restrictions cannot exceed
the number of observations. Moreover, adding more moment conditions increases the
177

likelihood of finite-sample bias and related problems similar to those of weak instru-
ments in linear models (see Section 4.9). Stock et al. (2002) briefly consider weak
instruments in nonlinear models.
6.3.6. Regression with Symmetric Error Example
To demonstrate the GMM asymptotic results we return to the additional moment re-
strictions example introduced in Section 6.2.4. For this example the objective function
for
βGMM has already been given in (6.2). All that is required is specification of WN ,
such as WN = I.
To obtain the distribution of this estimator we use the general notation of Section
6.3. The function h(·) in (6.5) specializes to
h(y, x, β) =

x(y − x
β)
x(y − x
β)3

⇒
∂h(y, x, β)
∂β =

−xx
−3xx
(y − x
β)2

.
These expressions lead directly to expressions for G0 and S0 using (6.9) and (6.12), so
that (6.14) and (6.15) then yield consistent estimates

G =

− 1
N

i xi x
i
− 1
N

i 3
u2
i xi x
i

(6.21)
and

S =
1
N

i
u2
i xi x
i
1
N

i
u4
i xi x
i
1
N

i
u4
i xi x
i
1
N

i
u6
i xi x
i

, (6.22)
where
ui = y − x
i

β. Alternative estimates can be obtained by first evaluating the ex-
pectations in G0 and S0, but this will require assumptions on E[u2
|x], E[u4
|x], and
E[u6
|x]. Substituting
G,
S, and WN into (6.16) gives the estimated asymptotic vari-
ance matrix for
βGMM.
Now consider GMM with an optimal weighting matrix. This again minimizes (6.2),
but from (6.18) now WN =
S−1
, where
S is defined in (6.22). Computation of
S re-
quires first-step consistent estimates
β. An obvious choice is GMM with WN = I.
In this example the OLS estimator is also consistent and could instead be used.
Using (6.19) gives this two-step estimator an estimated asymptotic variance matrix

V[
βOGMM] equal to

i
ui xi x
i

i
u3
i xi x
i

i
u2
i xi x
i

i
u4
i xi x
i

i
u4
i xi x
i

i
u6
i xi x
i
−1
i
ui xi x
i

i
u3
i xi x
i

−1
,
where
ui = yi − x
i

βOGMM and the various divisions by N have canceled out.
Analytical results for the efficiency gain of optimal GMM in this example are eas-
ily obtained by specialization to the nonregression case where y is iid with mean µ.
Furthermore, assume that y is Laplace distributed with scale parameter equal to unity,
in which case the density is f (y) = (1/2) × exp{−|y − µ|} with E[y] = µ, V[y] = 2,
and higher central moments E[(y − µ)r
] equal to zero for r odd and equal to r! for
r even. The sample median is fully efficient as it is the MLE, and it can be shown to
178

have asymptotic variance 1/N. The sample mean ȳ is inefficient with variance V[ȳ] =
V[y]/N = 2/N. The optimal GMM estimator
µopt
based on the two moment condi-
tions E[(y − µ)] = 0 and E[(y − µ)3
] = 0 has weighting matrix that places much less
weight on the second moment condition, because it has relatively high variance, and
has negative off-diagonal entries. The optimal GMM estimator
µOGMM can be shown
to have asymptotic variance 1.7143/N (see Exercise 6.3). It is therefore more efficient
than the sample mean (variance 2/N), though is still considerably less efficient than
the sample median.
For this example the identity matrix is an exceptionally poor choice of weighting
matrix. It places too much weight on the second moment condition, yielding a sub-
optimal GMM estimator of µ with asymptotic variance 19.14/N that is many times
greater than even V[ȳ] = 2/N. For details see Exercise 6.3.
6.3.7. Optimal Moment Condition
Section 6.3.5 gives the surprising result that optimal GMM requires essentially no
more assumptions than does GMM without an optimal weighting matrix. However,
this optimality is very limited as it is conditional on the choice of moment function
h(·) in (6.5) or (6.18).
The GMM defines a class of estimators, with different choice of h(·) correspond-
ing to different members of the class. Some choices of h(·) are better than others, de-
pending on additional stochastic assumptions. For example, hi = xi ui yields the OLS
estimator whereas hi = xi ui /V[ui |xi ] yields the GLS estimator when errors are het-
eroskedastic. This multitude of potential choices for h(·) can make any particular
GMM estimator appear ad hoc. However, qualitatively similar decisions have to be
made in m-estimation in choosing, for example, to minimize the sum of squared errors
rather than the weighted sum of squared errors or the sum of absolute deviations of
errors.
If complete distributional assumptions are made the most efficient estimator is the
MLE. Thus the optimal choice of h(·) in (6.5) is
h(w, θ) =
∂ ln f (w, θ)
∂θ
,
where f (w, θ) is the joint density of w. For regression with dependent variable(s) y
and regressors x this is the unconditional MLE based on the unconditional joint den-
sity f (y, x, θ) of y and x. In many applications f (y, x, θ) = f (y|x, θ)g(x), where the
(suppressed) parameters of the marginal density of x do not depend on the parameters
of interest θ. Then it is just as efficient to use the conditional MLE based on the con-
ditional density f (y|x, θ). This can be used as the basis for MM estimation, or GMM
estimation with weighting matrix WN = Iq, though any full-rank matrix WN will also
give the MLE. This result is of limited practical use, however, as the purpose of GMM
estimation is to avoid making a full set of distributional assumptions.
When incomplete distributional assumptions are made, a common starting point is
specification of a conditional moment condition, where conditioning is on exoge-
nous variables. This is usually a low-order moment condition for the model error such
179

as E[u|x] = 0 or E[u|z] = 0. This conditional moment condition can lead to many
unconditional moment conditions that might be the basis for GMM estimation, such
as E[zu] = 0. Newey (1990a, 1993) obtained results on the optimal choice of uncon-
ditional moment condition for data independent over i.
Specifically, begin with s conditional moment condition restrictions
E[r(y, x, θ0)|z] = 0, (6.23)
where r(·) is a residual-type s × 1 vector function introduced in Section 6.2.2. A scalar
example is E[y − x
θ0|z] = 0. The instrumental variables notation is being used where
x are regressors, some potentially endogenous, and z are instruments that include the
exogenous components of x. In simpler models without endogeneity z = x.
GMM estimation of the q parameters θ based on (6.23) is not possible, as typically
there are only a few conditional moment restrictions, and often just one, so s ≤ q.
Instead, we introduce an r × s matrix function of the instruments D(z), where r ≥ q,
and note that by the law of iterated expectations E[D(z)r(y, x, θ0)] = 0, which can be
used as the basis for GMM estimation. The optimal instruments or optimal choice of
matrix function D(z) can be shown to be the q × s matrix
D∗
(z, θ0) = E

∂r(y, x, θ0)
∂θ
|z

{V [r(y, x, θ0)|z]}−1
. (6.24)
A derivation is given in, for example, Davidson and MacKinnon (1993, p. 604). The
optimal instrument matrix D∗
(z) is a q × s matrix, so the unconditional moment con-
dition E[D∗
(z)r(y, x, θ0)] = 0 yields exactly as many moment conditions as param-
eters. The optimal GMM estimator simply solves the corresponding sample moment
conditions
1
N
N

i=1
D∗
(zi , θ)r(yi , xi , θ) = 0. (6.25)
The optimal estimator requires additional assumptions, namely the expectations
used in forming D∗
(z, θ0) in (6.24), and implementation requires replacing unknown
parameters by known parameters so that generated regressors
D are used.
For example, if r(y, x, θ) = y − exp(x
θ) then ∂r/∂θ = − exp(x
θ)x and (6.24)
requires specification of E[exp(x
θ0)x|z] and V[y − exp(x
θ)|z]. One possibility is
to assume E[exp(x
θ0)x|z] is a low-order polynomial in z, in which case there will
be more moment conditions than parameters and so estimation is by GMM rather
than simply by solving (6.25), and to assume errors are homoskedastic. If these addi-
tional assumptions are wrong then the estimator is still consistent, provided (6.23) is
valid, and consistent standard errors can be obtained using the robust form of the vari-
ance matrix in (6.16). It is common to more simply use z rather than D∗
(z, θ) as the
instrument.
Optimal Moment Condition for Nonlinear Regression Example
The result (6.24) is useful in some cases, especially those where z = x. Here we con-
firm that GLS is the most efficient GMM estimator based on E[u|x] = 0.
180

Consider the nonlinear regression model y = g(x, β) + u. If the starting point is
the conditional moment restriction E[u|x] = 0, or E[y − g(x, β)|x] = 0, then z = x in
(6.23), and (6.24) yields
D∗
(x, β) = E

∂
∂β
(y − g(x, β0))|x

!
V

y − g(x, β0)|x
−1
= −
∂g(x, β0)
∂β
×
1
V [u|x]
,
which requires only specification of V[u|x]. From (6.25) the optimal GMM estimator
directly solves the corresponding sample moment conditions
1
N
N

i=1
−
∂g(xi , β)
∂β
×
(yi − g(xi , β))
σ2
i
= 0,
where σ2
i = V[ui |xi ] is functionally independent of β. These are the first-order condi-
tions for generalized NLS when the error is heteroskedastic. Implementation is possi-
ble using a consistent estimate
σ2
i of σ2
i , in which case GMM estimation is the same
as FGNLS. One can obtain standard errors robust to misspecification of σ2
i as detailed
in Section 5.8.
Specializing to the linear model, g(x, β) = x
β and the optimal GMM estimator
based on E[u|x] = 0 is GLS, and specializing further to the case of homoskedastic
errors, the optimal GMM estimator based on E[u|x] = 0 is OLS. As already seen in
the example in Section 6.3.6, more efficient estimation may be possible if additional
conditional moment conditions are used.
6.3.8. Tests of Overidentifying Restrictions
Hypothesis tests on θ can be performed using the Wald test (see Section 5.5), or with
other methods given in Section 7.5.
In addition there is a quite general model specification test that can be used for over-
identified models with more moment conditions (r) than parameters (q). The test is one
of the closeness of N−1

i

hi to 0, where
hi = h(wi ,
θ). This is an obvious test of H0:
E[h(w, θ0)] = 0, the initial population moment conditions. For just-identified models,
estimation imposes N−1

i

hi = 0 and the test is not possible. For over-identified
models, however, the first-order conditions (6.8) set a q × r matrix times N−1

i

hi
to zero, where q r, so

i

hi = 0.
In the special case that θ is estimated by
θOGMM defined in (6.18), Hansen (1982)
showed that the overidentifying restrictions (OIR) test statistic
OIR =

N−1

i

hi

S−1

N−1

i

hi

(6.26)
is asymptotically distributed as χ2
(r − q) under H0 :E[h(w, θ0)] = 0. Note that OIR
equals the GMM objective function (6.18) evaluated at
θOGMM. If OIR is large then
the population moment conditions are rejected and the GMM estimator is inconsistent
for θ.
181

It is not obvious a priori that the particular quadratic form in N−1

i

hi given in
(6.26) is χ2
(r − q) distributed under H0. A formal derivation is given in the next
section and an intuitive explanation in the case of linear IV estimation is provided
in Section 8.4.4.
A classic application is to life-cycle models of consumption (see Section 6.2.7), in
which case the orthogonality conditions are Euler conditions. A large chi-square test
statistic is then often stated to mean rejection of the life-cycle hypothesis. However, it
should instead be more narrowly interpreted as rejection of the particular specification
of utility function and set of stochastic assumptions used in the study.
6.3.9. Derivations for the GMM Estimator
The algebra is simplified by introducing a more compact notation. The GMM estimator
minimizes
QN (θ) = gN (θ)
WN gN (θ), (6.27)
where gN (θ) = N−1

i hi (θ). Then the GMM first-order conditions (6.8) are
GN (
θ)
WN gN (
θ) = 0, (6.28)
where GN (θ) = ∂gN (θ)/∂θ
= N−1

i ∂hi (θ)/∂θ
.
For consistency we consider the informal condition that the probability limit of
∂ QN (θ)/∂θ|θ0
equals zero. From (6.28) this will be the case as GN (θ0) and WN
have finite probability limits, by assumptions (iii) and (iv) of Proposition 6.1, and
plim gN (θ0) = 0 as a consequence of assumption (v). More intuitively, gN (θ0) =
N−1

i hi (θ0) has probability limit zero if a law of large numbers can be applied
and E[hi (θ0)] = 0, which was assumed at the outset in (6.5).
The parameter θ0 is identified by the key assumption (ii) and additionally assump-
tions (iii) and (iv), which restrict the probability limits of GN (θ0) and WN to be full-
rank matrices. The assumption that G0 = plim GN (θ0) is a full-rank matrix is called
the rank condition for identification. A weaker necessary condition for identification
is the order condition that r ≥ q.
For asymptotic normality, a more general theory is needed than that for an m-
estimator based on an objective function QN (β) =N−1

i q(wi , θ) that involves just
one sum. We rescale (6.28) by multiplication by
√
N, so that
GN (
θ)
WN
√
NgN (
θ) = 0. (6.29)
The approach of the general Theorem 5.3 is to take a Taylor series expansion around
θ0 of the entire left-hand side of (6.28). Since
θ appears in both the first and third
terms this is complicated and requires existence of first derivatives of GN (θ) and hence
second derivatives of gN (θ). Since GN (
θ) and WN have finite probability limits it is
sufficient to more simply take an exact Taylor series expansion of only
√
NgN (
θ). This
yields an expression similar to that in the Chapter 5 discussion of m-estimation, with
√
NgN (
θ) =
√
NgN (θ0) + GN (θ+
)
√
N(
θ − θ0), (6.30)
182

6.4. LINEAR INSTRUMENTAL VARIABLES
recalling that GN (θ) = ∂gN (θ)/∂θ
, where θ+
is a point between θ0 and
θ. Substitut-
ing (6.30) back into (6.29) yields
GN (
θ)
WN
√
NgN (θ0) + GN (θ+
)
√
N(
θ − θ0)

= 0.
Solving for
√
N(
θ − θ0) yields
√
N(
θ − θ0) = −

GN (
θ)
WN GN (θ+
)
−1
GN (
θ)
WN
√
NgN (θ0). (6.31)
Equation (6.31) is the key result for obtaining the limit distribution of the GMM
estimator. We obtain the probability limits of each of the first five terms using
θ
p
→ θ0,
given consistency, in which case θ+ p
→ θ0. The last term on the right-hand side of
(6.31) has a limit normal distribution by assumption (v). Thus
√
N(
θ − θ0)
d
→ −(G
0W0G0)−1
G
0W0 × N[0, S0],
where G0, W0, and S0 have been defined in Proposition 6.1. Applying the limit normal
product rule (Theorem A.17) yields (6.11).
This derivation treats the GMM first-order conditions as being q linear combina-
tions of the r sample moments gN (
θ), since GN (
θ)
WN is a q × r matrix. The MM
estimator is the special case q = r, since then GN (
θ)
WN is a full-rank square matrix,
so GN (
θ)
WN gN (
θ) = 0 implies that gN (
θ) = 0.
To derive the distribution of the OIR test statistic in (6.26), begin with a first-order
Taylor series expansion of
√
NgN (
θ) around θ0 to obtain
√
NgN (
θOGMM) =
√
NgN (θ0) + GN (θ+
)
√
N(
θOGMM − θ0)
=
√
NgN (θ0) − G0(G
0S−1
0 G0)−1
G
0S−1
0
√
NgN (θ0) + op(1)
= [I − M0S−1
0 ]
√
NgN (θ0) + op(1),
where the second equality uses (6.31) with WN consistent for S−1
0 , M0 =
G0(G
0S−1
0 G0)−1
G
0, and op(1) is defined in Definition A.22. It follows that
S
−1/2
0
√
NgN (
θOGMM) = S
−1/2
0 [I − M0S−1
0 ]
√
NgN (θ0) + op(1) (6.32)
= [I − S
−1/2
0 M0S
−1/2
0 ]S
−1/2
0
√
NgN (θ0) + op(1).
Now [I − S
−1/2
0 M0S
−1/2
0 ] = [I − S
−1/2
0 G0(G
0S−1
0 G0)−1
G
0S
−1/2
0 ] is an idempotent
matrix of rank (r − q), and S
−1/2
0
√
NgN (θ0)
d
→ N[0, I] given
√
NgN (θ0)
d
→
N[0, S0]. From standard results for quadratic forms of normal variables it follows
that the inner product
τN = (S
−1/2
0
√
NgN (
θOGMM))
(S
−1/2
0
√
NgN (
θOGMM))
converges to the χ2
(r − q) distribution.
6.4. Linear Instrumental Variables
Correlation of regressors with the error term leads to inconsistency of least-
squares methods. Examples of such failure include omitted variables, simultaneity,
183

measurement error in the regressors, and sample selection bias. Instrumental variables
methods provide a general approach that can handle any of these problems, provided
suitable instruments exist.
Instrumental variables methods fall naturally into the GMM framework as a surplus
of instruments leads to an excess of moment conditions that can be used for estimation.
Many IV results are most easily obtained using the GMM framework.
Linear IV is important enough to appear in many places in this book. An introduc-
tion was given in Sections 4.8 and 4.9. This section presents single-equation linear IV
as a particular application of GMM. For completeness the section also presents the
earlier literature on a special case, the two-stage least-squares estimator. Systems lin-
ear IV estimation is summarized in Section 6.9.5. Tests of endogeneity and tests of
overidentifying restrictions for linear models are detailed in Section 8.4. Chapter 22
presents linear IV estimation with panel data.
6.4.1. Linear GMM with Instruments
Consider the linear regression model
yi = x
i β+ui , (6.33)
where each component of x is viewed as being an exogenous regressor if it is uncor-
related with the error in model (6.33) or an endogenous regressor if it is correlated.
If all regressors are exogenous then LS estimators can be used, but if any components
of x are endogenous then LS estimators are inconsistent for β.
From Section 4.8, consistent estimates can be obtained by IV estimation. The key
assumption is the existence of an r × 1 vector of instruments z that satisfies
E[ui |zi ] = 0. (6.34)
Exogenous regressors can be instrumented by themselves. As there must be at least as
many instruments as regressors, the challenge is to find additional instruments that at
least equal the number of endogenous variables in the model. Some examples of such
instruments have been given in Section 4.8.2.
Linear GMM Estimator
From Section 6.2.5, the conditional moment restriction (6.34) and model (6.33) imply
the unconditional moment restriction
E[zi (yi −x
i β)] = 0, (6.35)
where for notational simplicity the following analysis uses β rather than the more
formal β0 to denote the true parameter value. A quadratic form in the corresponding
sample moments leads to the GMM objective function QN (β) given in (6.4).
In matrix notation define y = Xβ + u as usual and let Z denote the N × r matrix
of instruments with ith row z
i . Then

i zi (yi −x
i β) = Z
u and (6.4) becomes
QN (β) =

1
N
(y − Xβ)
Z

WN

1
N
Z
(y − Xβ)

, (6.36)
184

where WN is an r × r full-rank symmetric weighting matrix with leading examples
given at the end of this section. The ﬁrst-order conditions
∂ QN (β)
∂β
= −2

1
N
X
Z

WN

1
N
Z
(y − Xβ)

= 0
can actually be solved for β in this special case of GMM, leading to the GMM esti-
mator in the linear IV model

βGMM =

X
ZWN Z
X
−1
X
ZWN Z
y, (6.37)
where the divisions by N have canceled out.
Distribution of Linear GMM Estimator
The general results of Section 6.3 can be used to derive the asymptotic distribution.
Alternatively, since an explicit solution for
βGMM exists the analysis for OLS given in
Section 4.4. can be adapted. Substituting y = Xβ + u into (6.37) yields

βGMM = β +

N−1
X
Z

WN

N−1
Z
X
−1
N−1
X
Z

WN

N−1
Z
u

. (6.38)
From the last term, consistency of the GMM estimator essentially requires that
plim N−1
Z
u = 0. Under pure random sampling this requires that (6.35) holds,
whereas under other common sampling schemes (see Section 24.3) the stronger as-
sumption (6.34) is needed.
Additionally, the rank condition for identiﬁcation of β that plim N−1
Z
X is of
rank K ensures that the inverse in the right-hand side exists, provided WN is of full
rank. A weaker order condition is that r ≥ K.
The limit distribution is based on the expression for
√
N(
βGMM − β) obtained by
simple manipulation of (6.38). This yields an asymptotic normal distribution for
βGMM
with mean β and estimated asymptotic variance

V[
βGMM] = N

X
ZWN Z
X
−1
X
ZWN

SWN Z
X

X
ZWN Z
X
−1
, (6.39)
where
S is a consistent estimate of
S = lim
1
N
N

i=1
E

u2
i zi z
i

,
given the usual cross-section assumption of independence over i. The essential addi-
tional assumption needed for (6.39) is that N−1/2
Z
u
d
→ N[0, S]. Result (6.39) also
follows from Proposition 6.1 with h(·) = z(y − x
β) and hence ∂h/∂β
= −zx
.
For cross-section data with heteroskedastic errors, S is consistently estimated by

S =
1
N
N

i=1

u2
i zi z
i = Z
DZ/N, (6.40)
where
ui = yi − x
i

βGMM is the GMM residual and D is an N × N diagonal matrix
with entries
u2
i . A commonly used small-sample adjustment is to divide by N − K
185

Table 6.2. GMM Estimators in Linear IV Model and Their Asymptotic Variancea
Estimator Definition and Asymptotic Variance
GMM
βGMM = [X
ZWN Z
X]−1
X
ZWN Z
y
(general WN )
V[
β] = N[X
ZWN Z
X]−1
[X
ZWN

SWN Z
X][X
ZWN Z
X]−1
Optimal GMM
βOGMM = [X
Z
S
−1
Z
X]−1
X
Z
S
−1
Z
y
(WN =
S−1
)
V[
β] = N[X
Z
S
−1
Z
X]−1
2SLS
β2SLS = [X
Z(Z
Z)−1
Z
X]−1
X
Z(Z
Z)−1
Z
y
(WN = [N−1
Z
Z]−1
)
V[
β] = N[X
Z(Z
Z)−1
Z
X]−1
[X
Z(Z
Z)−1
S(Z
Z)−1
Z
X]
× [X
Z(Z
Z)−1
Z
X]−1

V[
β] = s2
[X
Z(Z
Z)−1
Z
X]−1
if homoskedastic errors
IV
βIV = [Z
X]−1
Z
y
(just-identified)
V[
β] = N(Z
X)−1
S(X
Z)−1
a Equations are based on a linear regression model with dependent variable y, regressors X, and instruments
Z.
S is defined in (6.40) and s2 is defined after (6.41). All variance matrix estimates assume errors that are
independent across observations and heteroskedastic, aside from the simplification for homoskedastic errors
given for the 2SLS estimator. Optimal GMM uses the optimal weighting matrix.
rather than N in the formula for
S. In the more restrictive case of homoskedastic errors,
E[u2
i |zi ] = σ2
and so S = lim N−1

i σ2
E[zi z
i ], leading to estimate

S = s2
Z
Z/N, (6.41)
where s2
= (N − K)−1
N
i=1
u2
i is consistent for σ2
. These results mimic similar re-
sults for OLS presented in Section 4.4.5.
6.4.2. Different Linear GMM Estimators
Implementation of the results of Section 6.4.1 requires specification of the weighting
matrix WN . For just-identified models all choices of WN lead to the same estima-
tor. For overidentified models there are two common choices of WN , given in the
following.
Table 6.2 summarizes these estimators and gives the appropriate specialization of
the estimated variance matrix formula given in (6.39), assuming independent het-
eroskedastic errors.
Instrumental Variables Estimator
In the just-identified case r = K and X
Z is a square matrix that is invertible. Then
[X
ZWN Z
X]−1
= (Z
X)−1
W−1
N (X
Z)−1
and (6.37) simplifies to the instrumental
variables estimator

βIV = (Z
X)
−1
Z
y, (6.42)
introduced in Section 4.8.6. For just-identified models the GMM estimator for any
choice of WN equals the IV estimator.
186

The simple IV estimator can also be used in overidentified models, by discarding
some of the instruments so that the model is just-identified, but this results in an effi-
ciency loss compared to using all the instruments.
Optimal-Weighted GMM
From Section 6.3.5, for overidentified models the most efficient GMM estimator,
meaning GMM with optimal choice of weighting matrix, sets WN =
S−1
in (6.37).
The optimal GMM estimator or two-step GMM estimator in the linear IV model
is

βOGMM =

(X
Z)
S−1
(Z
X)
−1
(X
Z)
S−1
(Z
y). (6.43)
For heteroskedastic errors,
S is computed using (6.40) based on a consistent first-step
estimate
β such as the 2SLS estimator defined in (6.44). White (1982) called this
estimator a two-stage IV estimator, since both steps entail IV estimation.
The estimated asymptotic variance matrix for optimal GMM given in Table 6.2
is of relatively simple form as (6.39) simplifies when WN =
S−1
. In computing the
estimated variance one can use
S as presented in Table 6.2, but it is more common to
instead use an estimator
S, say, that is also computed using (6.40) but evaluates the
residual at the optimal GMM estimator rather than the first-step estimate used to form

S in (6.43).
Two-Stage Least Squares
If errors are homoskedastic rather than heteroskedastic,
S−1
= [s2
N−1
Z
Z]−1
from
(6.41). Then WN = (N−1
Z
Z)−1
in (6.37), leading to the two-stage least-squares
estimator, introduced in Section 4.8.7, that can be expressed compactly as

β2SLS =

X
PZX
−1
X
PZy

, (6.44)
where PZ = Z(ZZ
)−1
Z
. The basis of the term two-stage least-squares is presented
in the next section. The 2SLS estimator is also called the generalized instrumental
variables (GIV) estimator as it generalizes the IV estimator to the overidentified
case of more instruments than regressors. It is also called the one-step GMM because
(6.44) can be calculated in one step, whereas optimal GMM requires two steps.
The 2SLS estimator is asymptotically normal distributed with estimated asymptotic
variance given in Table 6.2. The general form should be used if one wishes to guard
against heteroskedastic errors whereas the simpler form, presented in many introduc-
tory textbooks, is consistent only if errors are indeed homoskedastic.
Optimal GMM versus 2SLS
Both the optimal GMM and the 2SLS estimator lead to efficiency gains in overiden-
tified models. Optimal GMM has the advantage of being more efficient than 2SLS,
if errors are heteroskedastic, though the efficiency gain need not be great. Some of
the GMM testing procedures given in Section 7.5 and Chapter 8 assume estimation
187

using the optimal weighting matrix. Optimal GMM has the disadvantage of requiring
additional computation compared to 2SLS. Moreover, as discussed in Section 6.3.5,
asymptotic theory may provide a poor small-sample approximation to the distribution
of the optimal GMM estimator.
In cross-section applications it is common to use the less efficient 2SLS, though
with inference based on heteroskedastic robust standard errors.
Even More Efficient GMM Estimation
The estimator
βOGMM is the most efficient estimator based on the unconditional mo-
ment condition E[zi ui ] = 0, where ui = yi −x
i β. However, this is not the best moment
condition to use if the starting point is the conditional moment condition E[ui |zi ] = 0
and errors are heteroskedastic, meaning V[ui |zi ] varies with zi .
Applying the general results of Section 6.3.7, we can write the optimal moment
condition for GMM estimation based on E[ui |zi ] = 0 as
E

E

xi |zi

ui /V [ui |zi ]

= 0. (6.45)
As with the LS regression example in Section 6.3.7, one should divide by the error
variance V[u|z]. Implementation is more difficult than in the LS case, however, as
a model for E[x|z] needs to be specified in addition to one for V[u|z]. This may be
possible with additional structure. In particular, for a linear simultaneous equations
system E[xi |zi ] is linear in z so that estimation is based on E[xi ui /V[ui |zi ]] = 0.
For linear models the GMM estimator is usually based on the simpler condition
E[zi ui ] = 0. Given this condition, the optimal GMM estimator defined in (6.43) is the
most efficient GMM estimator.
6.4.3. Alternative Derivations of Two-Stage Least Squares
The 2SLS estimator, the standard IV estimator for overidentified models, was derived
in Section 6.4.2 as a GMM estimator.
Here we present three other derivations of the 2SLS estimator. One of these deriva-
tions, due to Theil, provided the original motivation for 2SLS, which predates GMM.
Theil’s interpretation is emphasized in introductory treatments. However, it does not
generalize to nonlinear models, whereas the GMM interpretation does.
We consider the linear model
y = Xβ + u, (6.46)
with E[u|Z] = 0 and additionally V[u|Z] = σ2
I.
GLS in a Transformed Model
Premultiplication of (6.46) by the instruments Z
yields the transformed model
Z
y = Z
Xβ + Z
u. (6.47)
188

This transformed model is often used as motivation for the IV estimator when r = K,
since ignoring Z
u since N−1
Z
u → 0 and solving yields
β = (Z
X)−1
Z
y.
Here instead we consider the overidentified case. Conditional on Z the error Z
u has
mean zero and variance σ2
Z
Z given the assumptions after (6.46). The efficient GLS
estimator of β in model (6.46) is then

β =

X
Z(σ2
Z
Z)−1
Z
X
−1
X
Z(σ2
Z
Z)−1
Z
y, (6.48)
which equals the 2SLS estimator in (6.44) since the multipliers σ2
cancel out. More
generally, note that if the transformed model (6.47) is instead estimated by WLS with
weighting matrix WN then the more general estimator (6.37) is obtained.
Theil’s Interpretation
Theil (1953) proposed estimation by OLS regression of the original model (6.46),
except that the regressors X are replaced by a prediction
X that is asymptotically un-
correlated with the error term.
Suppose that in the reduced form model the regressors X are a linear combination
of the instruments plus some error, so that
X = ZΠ + v, (6.49)
where Π is a K × r matrix. Multivariate OLS regression of X on Z yields estimator

Π = (Z
Z)−1
Z
X and OLS predictions
X = Z
Π or

X = PZX,
where PZ = Z(Z
Z)−1
Z
. OLS regression of y on
X rather than y on X yields estimator

βTheil = (
X
X)−1
X
y. (6.50)
Theil’s interpretation permits computation by two OLS regressions, with the first-stage
OLS giving
X and the second-stage OLS giving
β, leading to the term two-stage least-
squares estimator.
To establish consistency of this estimator reexpress the linear model (6.46) as
y =
Xβ + (X−
X)β + u.
The second-stage OLS regression of y on
X yields a consistent estimator of β if the re-
gressor
X is asymptotically uncorrelated with the composite error term (X−
X)β + u.
If
X were any proxy variable there is no reason for this to hold; however, here
X is un-
correlated with (X−
X) as an OLS prediction is orthogonal to the OLS residual. Thus
plim N−1
X
(X−
X)β = 0. Also,
N−1
X
u = N−1
X
PZu = N−1
X
Z(N−1
Z
Z)−1
N−1
Z
u.
Then
X is asymptotically uncorrelated with u provided Z is a valid instrument so that
plim N−1
Z
u = 0. This consistency result for
βTheil depends heavily on the linearity
of the model and does not generalize to nonlinear models.
189

Theil’s estimator in (6.50) equals the 2SLS estimator defined earlier in (6.44). We
have

βTheil = (
X
X)−1
X
y
= (X
P
ZPZX)−1
X
PZy
= (X
PZX)−1
X
PZy,
the 2SLS estimator, using P
ZPZ = PZ in the final equality.
Care is needed in implementing 2SLS using Theil’s method. The second-stage OLS
will give the wrong standard errors, even if errors are homoskedastic, as it will esti-
mate σ2
using the second-stage OLS regression residuals (y −
X
β) rather than the ac-
tual residuals (y − X
β). In practice one may also make adjustment for heteroskedastic
errors. It is much easier to use a program that offers 2SLS as an option and directly
computes (6.44) and the associated variance matrix given in Table 6.2.
The 2SLS interpretation does not always carry over to nonlinear models, as detailed
in Section 6.5.4. The GMM interpretation does, and for this reason it is emphasized
here more than Theil’s original derivation of linear 2SLS.
Theil actually considered a model where only some of the regressors X are endoge-
nous and the remaining are exogenous. The preceding analysis still applies, provided
all the exogenous components of X are included in the instruments Z. Then the first-
stage OLS regression of the exogenous regressors on the instruments fits perfectly and
the predictions of the exogenous regressors equal their actual values. So in practice at
the first-stage just the endogenous variables are regressed on the instruments, and the
second-stage regression is of y on the exogenous regressors and the first-stage predic-
tions of the endogenous regressors.
Basmann’s Interpretation
Basmann (1957) proposed using as instruments the OLS reduced form predictions

X = PZX for the simple IV estimator in the just-identified case, since there are then
exactly as many instruments
X as regressors X. This yields

βBasmann = (
X
X)−1
X
y. (6.51)
This is consistent since plim N−1
X
u = 0, as already shown for Theil’s estimator.
The estimator (6.51) actually equals the 2SLS estimator defined in (6.44), since

X
= X
PZ.
This IV approach will lead to correct standard errors and can be extended to non-
linear settings.
6.4.4. Alternatives to Standard IV Estimators
The IV-based optimal GMM and 2SLS estimators presented in Section 6.4.2 are the
standard estimators used when regressors are endogenous. Chernozhukov and Hansen
(2005) present an IV estimator for quantile regression.
190

Here we briefly discuss leading alternative estimators that have received renewed
interest given the poor finite-sample properties of 2SLS with weak instruments detailed
in Section 4.9. We focus on single-equation linear models. At this stage there is no
method that is relatively efficient yet has small bias in small samples.
Limited-Information Maximum Likelihood
The limited-information maximum likelihood (LIML) estimator is obtained by
joint ML estimation of the single equation (6.46) plus the reduced form for the en-
dogenous regressors in the right-hand side of (6.46) assuming homoskedastic normal
errors. For details see Greene (2003, p. 402) or Davidson and MacKinnon (1993,
pp. 644–651). More generally the k class of estimators (see, for example, Greene,
2003, p. 403) includes LIML, 2SLS, and OLS.
The LIML estimator due to Anderson and Rubin (1949) predates the 2SLS esti-
mator. Unlike 2SLS, the LIML estimator is invariant to the normalization used in a
simultaneous equations system. Moreover, LIML and 2SLS are asymptotically equiv-
alent given homoskedastic errors. Yet LIML is rarely used as it is more difficult to
implement and harder to explain than 2SLS. Bekker (1994) presents small-sample re-
sults for LIML and a generalization of LIML. See also Hahn and Hausman (2002).
Split-Sample IV
Begin with Basmann’s interpretation of 2SLS as an IV estimator given in (6.51). Sub-
stituting for y from (6.46) yields

β = β + (
X
X)−1
X
u.
By assumption plim N−1
Z
u = 0 so plim N−1
X
u = 0 and
β is consistent. However,
correlation between X and u, the reason for IV estimation, means that
X = PZX is
correlated with u. Thus E[
X
u] = 0, which leads to bias in the IV estimator. This bias
arises from using
X = Z
Π rather than
X = ZΠ as the instrument.
An alternative is to instead use as instrument predictions
X, which have the property
that E[
X
u] = 0 in addition to plim N−1
X
u = 0, and use estimator

β = (
X
X)−1
X
y.
Since E[
X
u] = 0 does not imply E[(
X
X)−1
X
u] = 0, this estimator will still be bi-
ased, but the bias may be reduced.
Angrist and Krueger (1995) proposed obtaining such instruments by splitting the
sample into two subsamples (y1, X1, Z1) and (y2, X2, Z2). The first sample is used
to obtain estimate
Π1 from regression of X1 on Z1. The second sample is used to
obtain the IV estimator where the instrument
X2 = Z2

Π1 uses
Π1 obtained from the
separate first sample. Angrist and Krueger (1995) define the unbiased split-sample
IV estimator as

βUSSIV = (
X
2X2)−1
X
2y2.
191

The split-sample IV estimator
βSSIV = (
X
2

X2)−1
X
2y2 is a variant based on Theil’s
interpretation of 2SLS. These estimators have finite-sample bias toward zero, unlike
2SLS, which is biased toward OLS. However, considerable efficiency loss occurs be-
cause only half the sample is used at the final stage.
Jackknife IV
A more efficient variant of this estimator implements a similar procedure but generates
instruments observation by observation.
Let the subscript (−i) denote the leave-one-out operation that drops the ith obser-
vation. Then for the ith observation we obtain estimate
Πi from regression of X(−i) on
Z(−i) and use as instrument
x
i = z
i

Πi . Repeating N times gives an instrument vector
denoted
X(−i) with ith row
x
i . This leads to the jackknife IV estimator

βJIV = (
X
(−i)X)−1
X
(−i)y2.
This estimator was originally proposed by Phillips and Hale (1977). Angrist,
Imbens and Krueger (1999) and Blomquist and Dahlberg (1999) called it a jackknife
estimator since the jackknife (see Section 11.5.5) is a leave-one-out method for bias
reduction. The computational burden of obtaining the N jackknife predicted values
x
i
is modest by use of the recursive formula given in Section 11.5.5. The Monte Carlo
evidence given in the two recent papers is mixed, however, indicating a potential for
bias reduction but also an increase in the variance. So the jackknife version may not be
better than the conventional version in terms of mean-square error. The earlier paper
by Phillips and Hale (1977) presents analytical results that the finite-sample bias of the
JIV estimator is smaller than that of 2SLS only for appreciably overidentified models
with r 2(K + 1). See also Hahn, Hausman and Kuersteiner (2001).
Independently Weighted 2SLS
A related method to split-sample IV is the independently weighted GMM estimator of
Altonji and Segal (1996) given in Section 6.3.5. Splitting the sample into G groups
and specializing to linear IV yields the independently weighted IV estimator

βIWIV =
1
G
G

g=1

X
gZg

S−1
(−g)Z
gXg
−1
X
gZg

S−1
(−g)Z
gyg,
where
S(−g) is computed using
S defined in (6.40) except that observations from the
gth group are excluded. In a panel application Ziliak (1997) found that the indepen-
dently weighted IV estimator performed much better than the unbiased split-sample
IV estimator.
6.5. Nonlinear Instrumental Variables
Nonlinear IV methods, notably nonlinear 2SLS proposed by Amemiya (1974), per-
mit consistent estimates of nonlinear regression models in situations where the NLS
192

6.5. NONLINEAR INSTRUMENTAL VARIABLES
estimator is inconsistent because to regressors are correlated with the error term. We
present these methods as a straightforward extension of the GMM approach for linear
models.
Unlike the linear case the estimators have no explicit formula, but the asymptotic
distribution can be obtained as a special case of the Section 6.3 results. This section
presents single-equation results, with systems results given in Section 6.10.4. A fun-
damentally important result is that a natural extension of Theil’s 2SLS method for
linear models to nonlinear models can lead to inconsistent parameter estimates (see
Section 6.5.4). Instead, the GMM approach should be used.
An alternative nonlinearity can arise when the model for the dependent variable is
a linear model, but the reduced form for the endogenous regressor(s) is a nonlinear
model owing to special features of the dependent variable. For example, the endoge-
nous regressor may be a count or a binary outcome. In that case the linear methods
of the previous section still apply. One approach is to ignore the special nature of the
endogenous regressor and just do regular linear 2SLS or optimal GMM. Alternatively,
obtain fitted values for the endogenous regressor by appropriate nonlinear regression,
such as Poisson regression on all the instruments if the endogenous regressor is a count,
and then do regular linear IV using this fitted value as the instrument for the count, fol-
lowing Basmann’s approach. Both estimators are consistent, though they have different
asymptotic distributions. The first simpler approach is the usual procedure.
6.5.1. Nonlinear GMM with Instruments
Consider the quite general nonlinear regression model where the error term may be
additive or nonadditive (see Section 6.2.2). Thus
ui = r(yi , xi , β), (6.52)
where the nonlinear model with additive error is the special case
ui = yi − g(xi , β), (6.53)
where g(·) is a specified function. The estimators given in Section 6.2.2 are inconsis-
tent if E[ui |xi ] = 0.
Assume the existence of r instruments z, where r ≥ K, that satisfy
E[ui |zi ] = 0. (6.54)
This is the same conditional moment condition as in the linear case, except that ui =
r(yi , xi , β) rather than ui = yi − x
i β.
Nonlinear GMM Estimator
By the law of iterated expectations, (6.54) leads to
E[zi ui ] = 0. (6.55)
The GMM estimator minimizes the quadratic form in the corresponding sample mo-
ment condition.
193

In matrix notation let u denote the N × 1 error vector with ith entry ui given in
(6.52) and let Z to be an N × r matrix of instruments with ith row z
i . Then

i zi ui =
Z
u and the GMM estimator in the nonlinear IV model
βGMM minimizes
QN (β) =

1
N
u
Z

WN

1
N
Z
u

, (6.56)
where WN is an r × r weighting matrix. Unlike linear GMM, the first-order conditions
do not lead to a closed-form solution for
βGMM.
Distribution of Nonlinear GMM Estimator
The GMM estimator is consistent for β given (6.54) and asymptotically normally dis-
tributed with estimated asymptotic variance

V

βGMM

= N

D
ZWN Z
D
−1

D
ZWN

SWN Z
D

D
ZWN Z
D
−1
(6.57)
using the results from Section 6.3.3 with h(·) = zu, where
S is given in the following
and
D is an N × K matrix of derivatives of the error term

D =
∂u
∂β

βGMM
. (6.58)
With nonadditive errors,
D has ith row ∂r(yi , xi , β)/∂β

β
. With additive errors,
D
has ith row ∂g(xi , β)/∂β

β
, ignoring the minus sign that cancels out in (6.57).
For independent heteroskedastic errors,

S = N−1

i

u2
i zi z
i , (6.59)
similar to the linear case except now
ui = r(yi , x,
β) or
ui = yi − g(x,
β).
The asymptotic variance of the GMM estimator in the nonlinear model is therefore
the same as that in the linear case given in (6.39), with the change that the regressor
matrix X is replaced by the derivative ∂u/∂β

β
. This is exactly the same change as
observed in Section 5.8 in going from linear to nonlinear least squares. By analogy
with linear IV, the rank condition for identification is that plim N−1
Z
∂u/∂β

β0
is
of rank K and the weaker order condition is that r ≥ K.
6.5.2. Different Nonlinear GMM Estimators.
Two leading specializations of the GMM estimator, which differ in the choice of
weighting matrix, are optimal GMM that sets WN =
S−1
and nonlinear two-stage least
squares (NL2SLS) that sets WN = (Z
Z)−1
. Table 6.3 summarizes these estimators
and their associated variance matrices, assuming independent heteroskedastic errors,
and gives results for general WN and results for nonlinear IV in the just-identified
model.
194

Table 6.3. GMM Estimators in Nonlinear IV Model and Their Asymptotic Variancea
Estimator Definition and Asymptotic Variance
GMM QGMM(β) = u
ZWN Z
u
(general WN )
V[
β] = N[
D
ZWN Z
D]−1
[
D
ZWN

SWN Z
D][
D
ZWN Z
D]−1
Optimal GMM QOGMM(β) = u
Z
S
−1
Z
u
(WN =
S−1
)
V[
β] = N[
D
Z
S
−1
Z
D]−1
NL2SLS QNL2SLS(β) = u
Z(Z
Z)−1
Z

u
(WN = [N−1
Z
Z]−1
)
V[
β] = N[
D
Z(Z
Z)−1
Z
D]−1
[
D
Z(Z
Z)−1
S(Z
Z)−1
Z
D]
× [
D
Z(Z
Z)−1
Z
D]−1

V[
β] = s2
[
D
Z(Z
Z)−1
Z
D]−1
if homoskedastic errors
NLIV
βNLIV solves Z
u = 0
(just-identified)
V[
β] = N(Z
D)−1
S(
D
Z)−1
a Equations are for a nonlinear regression model with error u defined in (6.53) or (6.52) and instruments Z.
D
is the derivative of the error vector with respect to β
evaluated at
β and simplifies for models with additive
error to the derivative of the conditional mean function with respect to β
evaluated at
β.
S is defined in (6.59).
All variance matrix estimates assume errors that are independent across observations and heteroskedastic, aside
from the simplification for homoskedastic errors given for the NL2SLS estimator.
Nonlinear Instrumental Variables
In the just-identified case one can directly use the sample moment conditions corre-
sponding to (6.55). This yields the method of moments estimator in the nonlinear
IV model
βNLIV that solves
1
N
N

i=1
zi ui = 0, (6.60)
or equivalently Z
u = 0 with asymptotic variance matrix given in Table 6.3.
Nonlinear estimators are often computed using iterative methods that obtain an op-
timum to an objective function rather than solve nonlinear systems of estimating equa-
tions. For the just-identified case
βNLIV can be computed as a GMM estimator mini-
mizing (6.56) with any choice of weighting matrix, most simply WN = I, leading to
the same estimate.
Optimal Nonlinear GMM
For overidentified models the optimal GMM estimator uses weighting matrix WN =

S−1
. The optimal GMM estimator in the nonlinear IV model
βOGMM therefore
minimizes
QN (β) =

1
N
u
Z

S−1

1
N
Z
u

. (6.61)
The estimated asymptotic variance matrix given in Table 6.3 is of relatively simple
form as (6.57) simplifies when WN =
S−1
.
195

As in the linear case the optimal GMM estimator is a two-step estimator when errors
are heteroskedastic. In computing the estimated variance one can use
S as presented
in Table 6.3, but it is more common to instead use an estimator
S, say, that is also
computed using (6.59) but evaluates the residual at the optimal GMM estimator rather
than the ﬁrst-step estimate used to form
S in (6.61).
Nonlinear 2SLS
A special case of the GMM estimator with instruments sets WN = (N−1
Z
Z)−1
in
(6.56). This gives the nonlinear two-stage least-squares estimator
βNL2SLS that
minimizes
QN (β) =
1
N
u
Z(Z
Z)−1
Z
u. (6.62)
This estimator has the attraction of being the optimal GMM estimator if errors are
homoskedastic, as then
S = s2
Z
Z/N, where s2
is a consistent estimate of the constant
V[u|z] so
S−1
is a multiple of (Z
Z)−1
.
With homoskedastic error this estimator has the simpler estimated asymptotic vari-
ance given in Table 6.3, a result often given in textbooks. However, in microecono-
metrics applications it is common to permit heteroskedastic errors and use the more
complicated robust estimate also given in Table 6.3.
The NL2SLS estimator, proposed by Amemiya (1974), was an important precursor
to GMM. The estimator can be motivated along similar lines to the ﬁrst motivation
for linear 2SLS given in Section 6.4.3. Thus premultiply the model error u by the
instruments Z
to obtain Z
u, where E[Z
u] = 0 since E[u|Z] = 0. Then do nonlinear
GLS regression. Assuming homoskedastic errors this minimizes
QN (β) = u
Z[σ2
Z
Z]−1
Z
u,
as V[u|Z] = σ2
I implies V[Z
u|Z] = σ2
Z
Z. This objective function is just a scalar
multiple of (6.62).
The Theil two-stage interpretation of linear 2SLS does not always carry over to non-
linear models (see Section 6.5.4). Moreover, NL2SLS is clearly a one-step estimator.
Amemiya chose the name NL2SLS because, as in the linear case, it permits consistent
estimation using instrumental variables. The name should not be taken literally, and
clearer terms are nonlinear IV or nonlinear generalized IV estimation.
Instrument Choice in Nonlinear Models
The preceding estimators presume the existence of instruments such that E[u|z] = 0
and that estimation is best if based on the unconditional moment condition E[zu] = 0.
Consider the nonlinear model with additive error so that u = y − g(x, β). To be
relevant the instrument must be correlated with the regressors x; yet to be valid it
cannot be a direct causal variable for y. From the variance matrix given in (6.57) it is
actually correlation of z with ∂g/∂β rather than just x that matters, to ensure that
D
Z
should be large. Weak instruments concerns are just as relevant here as in the linear
case studied in Section 4.9.
196

Given likely heteroskedasticity the optimal moment condition on which to base es-
timation, given E[u|z] = 0, is not E[zu] = 0. From Section 6.3.7, however, the optimal
moment condition requires additional moment assumptions that are difficult to make,
so it is standard to use E[zu] = 0 as has been done here.
An alternative way to control for heteroskedasticity is to base GMM estimation on
an error term defined to be close to homoskedastic. For example, with count data rather
than use u = y − exp (x
β), work with the standardized error u∗
= u/

exp (x
β)
(see Section 6.2.2). Note, however, that E[u∗
|z] = 0 and E[u|z] = 0 are different
assumptions.
Often just one component of x is correlated with u. Then, as in the linear case, the
exogenous components can be used as instruments for themselves and the challenge is
to find an additional instrument that is uncorrelated with u. There are some nonlinear
applications that arise from formal economic models as in Section 6.2.7, in which case
the many subcomponents of the information set are available as instruments.
6.5.3. Poisson IV Example
The Poisson regression model with exogenous regressors specifies E[y|x] = exp(x
β).
This can be viewed as a model with additive error u = y − exp(x
β). If regressors
are endogenous then E[u|x] = 0 and the Poisson MLE will then be inconsistent. Con-
sistent estimation assumes the existence of instruments z that satisfy E[u|z] = 0 or,
equivalently,
E[y − exp(x
β)|z] = 0.
The preceding results can be directly applied. The objective function is
QN (β) =

N−1

i
zi ui

WN

N−1

i
zi ui

,
where ui = yi − exp(x
i β). The first-order conditions are then

i
exp(x
i β)xi z
i

WN

i
zi (yi − exp(x
i β))

= 0.
The asymptotic distribution is given in Table 6.3, with
D
Z =

i ex
i

β
xi z
i since
∂g/∂β = exp(x
β)x and
S defined in (6.39) with
ui = yi − exp(x
i

β). The opti-
mal GMM and NL2SLS estimators differ in whether the weighting matrix is
S−1
or
(N−1
Z
Z)−1
, where Z
Z =

i zi z
i .
An alternative consistent estimator follows the Basmann approach. First, estimate
by OLS the reduced form xi = Πzi + vi giving K predictions
xi =
Πzi . Second, es-
timate by nonlinear IV as in (6.60) with instruments
xi rather than zi . Given the OLS
formula for
Π this estimator solves

i
xi z
i

i
zi z
i
−1
i
(yi − exp(x
i β))zi

= 0.
This estimator differs from the NL2SLS estimator because the first term in the left-
hand side differs. Potential problems with instead generalizing Theil’s method for lin-
ear models are detailed in the next section.
197

Similar issues arise in nonlinear models other than Poisson regression, such as mod-
els for binary data.
6.5.4. Two-Stage Estimation in Nonlinear Models
The usual interpretation of linear 2SLS can fail in nonlinear models. Thus suppose y
has mean g(x, β) and there are instruments z for the regressors x. Then OLS regression
of x on instruments z to get fitted values
x followed by NLS regression of y on g(
x, β)
can lead to inconsistent parameter estimates of β, as we now demonstrate. Instead, one
needs to use the NL2SLS estimator presented in the previous section.
Consider the following simple model, based on one presented in Amemiya (1984),
that is nonlinear in variables though still linear in parameters. Let
y = βx2
+ u, (6.63)
x = πz + v,
where the zero-mean errors u and v are correlated. The regressor x2
is endogenous,
since x is a function of v and by assumption u and v are correlated. As a result the
OLS estimator of β is inconsistent. If z is generated independently of the other random
variables in the model it is a valid instrument as it is clearly then independent of u but
correlated with x.
The IV estimator is
βIV = (

i zi x2
i )−1

i zi yi . This can be implemented by a reg-
ular IV regression of y on x2
with instrument z. Some algebra shows that, as expected,

βIV equals the nonlinear IV estimator defined in (6.60).
Suppose instead we perform the following two-stage least-squares estimation.
First, regress x on z to get
x =
πz and then regress y on
x2
. Then
β2SLS =
(

i
x2
i
x2
i )−1

i
x2
i yi , where
x2
i is the square of the prediction
xi obtained from OLS
regression of x on z. This yields an inconsistent estimate. Adapting the proof for the
linear case in Section 6.4.3 we have
yi = βx2
i + ui
= β
x2
i + wi ,
where wi = β(x2
i −
x2
i ) + ui . An OLS regression of yi on
x2
i is inconsistent for β
because the regressor
x2
i is asymptotically correlated with the composite error term wi .
Formally, (x2
i −
x2
i ) = (πzi + vi )2
− (
πzi )2
= π2
z2
i + 2πzi vi + v2
i −
π2
z2
i implies,
using plim
π = π and some algebra, that plim N−1

i
x2
i (x2
i −
x2
i ) = plim N−1

i π2
z2
i v2
i = 0 even if zi and vi are independent. Hence plim N−1

i
x2
i wi = plim
N−1

i
x2
i β(xi −
xi )2
= 0.
A variation that is consistent, however, is to regress x2
rather than x on z at the first
stage and use the prediction
x2 = (
x)2
at the second stage. It can be shown that this
equals
βIV. The instrument for x2
needs to be the fitted value for x2
rather than the
square of the fitted value for x.
This example generalizes to other nonlinear models where the nonlinearity is in
regressors only, so that
y = g(x)
β + u,
198

Table 6.4. Nonlinear Two-Stage Least-Squares Examplea
Estimator
Variable OLS NL2SLS Two-Stage
x2
1.189 0.960 1.642
(0.025) (0.046) (0.172)
R2
0.88 0.85 0.80
a The dgp given in the text has true coefficient equal to one. The sample
size is N = 200.
where g(x) is a nonlinear function of x. Common examples are use of powers and nat-
ural logarithm. Suppose E[u|z] = 0. Inconsistent estimates are obtained by regressing
x on z to get predictions
x, and then regressing y on g(
x). Consistent estimates can be
obtained by instead regressing g(x) on z to get predictions
g(x), and then regressing y
on
g(x) at the second stage. We use
g(x) rather than g(
x) as instrument for g(x). Even
then the second-stage regression gives invalid standard errors as OLS output will use
residuals
u = y −
g(x)
β rather than
u = y − g(x)
β. It is best to directly use a GMM
or NL2SLS command.
More generally models may be nonlinear in both variables and parameters. Consider
a single-index model with additive error, so that
y = g(x
β) + u.
Inconsistent estimates may be obtained by OLS of x on z to get predictions
x, and then
NLS regression of y on g(
x
β). Either GMM or NL2SLS needs to be used. Essentially,
for consistency we want
g(x
β), not g(
x
β).
NL2SLS Example
We consider NL2SLS estimation in a model with a simple nonlinearity resulting from
the square of an endogenous variable appearing as a regressor, as in the previous
section.
The dgp is (6.63), so y = βx2
+ u and x = πz + v, where β = 1, and π = 1, and
z = 1 for all observations and (u, v) are joint normal with means 0, variances 1, and
correlation 0.8. A sample of size 200 is drawn. Results are shown in Table 6.4.
The nonlinearity here is quite mild with the square of x rather than x appearing as
regressor. Interest lies in estimating its coefficient β. The OLS estimator is inconsis-
tent, whereas NL2SLS is consistent. The two-stage method where first an OLS regres-
sion of x on z is used to form
x and then an OLS regression of y on (
x)2
is performed
that yields an estimate that is more than two standard errors from the true value of
β = 1. The simulation also indicates a loss in goodness of fit and precision with larger
standard errors and lower R2
, similar to linear IV.
199

6.6. Sequential Two-Step m-Estimation
Sequential two-step estimation procedures are estimation procedures where the es-
timate of a parameter of ultimate interest is based on initial estimation of an un-
known parameter. An example is feasible GLS when the error has conditional vari-
ance exp(z
γ). Given an estimate
γ of γ, the FGLS estimator
β solves
N
i=1(yi −
x
i

β)2
/ exp(z
i
γ). A second example is the Heckman two-step estimator given in Sec-
tion 16.10.2.
These estimators are attractive as they can provide a relatively simple way to obtain
consistent parameter estimates. However, for valid statistical inference it may be nec-
essary to adjust the asymptotic variance of the second-step estimator to allow for the
first-step estimation. We present results for the special case where the estimating equa-
tions for both the first- and second-step estimators set a sample average to zero, which
is the case for m-estimators, method of moments, and estimating equations estimators.
Partition the parameter vector θ into θ1 and θ2, with ultimate interest in θ2. The
model is estimated sequentially by first obtaining
θ1 that solves
N
i=1 h1i (
θ1) = 0 and
then, given
θ1, obtaining
θ2 that solves N−1
N
i=1 h2i (
θ1,
θ2) = 0. In general the dis-
tribution of
θ2 given estimation of
θ1 differs from, and is more complicated than,
the distribution of
θ2 if θ1 is known. Statistical inference is invalid if it fails to take
into account this complication, except in some special cases given at the end of this
section.
The following derivation is given in Newey (1984), with similar results obtained by
Murphy and Topel (1985) and Pagan (1986). The two-step estimator can be rewritten
as a one-step estimator where (θ1, θ2) jointly solve the equations
N−1
N

i=1
h1(wi ,
θ1) = 0, (6.64)
N−1
N

i=1
h2(wi ,
θ1,
θ2) = 0.
Defining θ = (θ
1 θ
2)
and hi = (h
1i h
2i )
, we can write the equations as
N−1
N

i=1
h(wi ,
θ) = 0.
In this setup it is assumed that dim(h1) = dim(θ1) and dim(h2) = dim(θ2), so that the
number of estimating equations equals the number of parameters. Then (6.64) is an
estimating equations estimator or MM estimator.
Consistency requires that plim N−1

i h(wi , θ0) = 0, where θ0 = [θ1
10, θ1
20]. This
condition should be satisfied if
θ1 is consistent for θ10 in the first step, and if second-
step estimation of
θ2 with θ10 known (rather than estimated by
θ1) would lead to
a consistent estimate of θ20. Within a method of moments framework we require
E[h1i (θ1)] = 0 and E[h2i (θ1, θ2)] = 0. We assume that consistency is established.
For the asymptotic distribution we apply the general result that
√
N(
θ − θ0)
d
→ N

0, G−1
0 S0(G−1
0 )

,
200

6.6. SEQUENTIAL TWO-STEP M-ESTIMATION
where G0 and S0 are deﬁned in Proposition 6.1. Partition G0 and S0 in a similar way
to the partitioning of θ and hi . Then
G0 = lim
1
N
N

i=1
E

∂h1i /∂θ
1 0
∂h2i /∂θ
1 ∂h2i /∂θ
2

=

G11 0
G21 G22

,
using ∂h1i (θ)/∂θ
2 = 0 since h1i (θ) is not a function of θ2 from (6.64). Since G0, G11,
and G22 are square matrices
G−1
0 =

G−1
11 0
−G−1
22 G21G−1
11 G−1
22

.
Clearly,
S0 = lim
1
N
N

i=1
E

h1i h1i

h1i h2i

h2i h1i

h2i h2i

=

S11 S12
S21 S22

.
The asymptotic variance of
θ2 is the (2, 2) submatrix of the variance matrix of
θ. After
some algebra, we get
V[
θ2] = G−1
22

S22 + G21[G−1
11 S11G−1
11 ]G
21
−G21G−1
11 S12 − S21G−1
11 G
21
.
G−1
22 . (6.65)
The usual computer output yields standard errors that are incorrect and understate
the true standard errors, since V[
θ2] is then assumed to be G−1
22 S22G−1
22 , which can be
shown to be smaller than the true variance given in (6.65).
There is no need to account for additional variability in the second-step caused by
estimation in the ﬁrst step in the special case that E[∂h2i (θ)/∂θ1] = 0, as then G21 = 0
and V[
θ2] in (6.65) reduces to G−1
22 S22G−1
22 .
A well-known example of G21 = 0 is FGLS. Then for heteroskedastic errors
h2i (θ) =
x2i (yi − x
i θ2)
σ(xi , θ1)
,
where V[yi |xi ] = σ2
(xi , θ1), and
E[∂h2i (θ)/∂θ1] = E

−x2i
(yi − x
i θ2)
σ(xi , θ1)2
∂σ(xi , θ1)
∂θ1

,
which equals zero since E[yi |xi ] = x
i θ2. Furthermore, for FGLS consistency of
θ2
does not require that
θ1 be consistent since E[h2i (θ)] = 0 just requires that E[yi |xi ] =
x
i θ2, which does not depend on θ1.
A second example of G21 = 0 is ML estimation with a block diagonal matrix so that
E[∂2
L(θ)/∂θ1∂θ
2] = 0. This is the case for example for regression under normality,
where θ1 are the variance parameters and θ2 are the regression parameters.
In other examples, however, G21 = 0 and the more cumbersome expression (6.65)
needs to be used. This is done automatically by computer packages for some standard
two-step estimators, most notably Heckman’s two-step estimator of the sample selec-
tion model given in Section 16.5.4. Otherwise, V[
θ2] needs to be computed manually.
Many of the components come from earlier estimation. In particular, G−1
11 S11G−1
11 is
201

the robust variance matrix of
θ1 and G−1
22 S22G−1
22 is the robust variance matrix esti-
mate of
θ2 that incorrectly ignores the estimation error in
θ1. For data independent
over i the subcomponents of the S0 submatrix are consistently estimated by
Sjk =
N−1

i

hji

hki

, j, k = 1, 2. This leaves computation of
G21 = N−1

i ∂h2i /∂θ
1

θ
as the main challenge.
A recommended simpler approach is to obtain bootstrap standard errors (see Sec-
tion 16.2.5), or directly jointly estimate θ1 and θ2 in the combined model (6.64), as-
suming access to a GMM routine.
These simpler approaches can also be applied to sequential estimators that are
GMM estimators rather than m-estimators. Then combining the two estimators will
lead to a set of conditions more complicated than (6.64) and we no longer get (6.65).
However, one can still bootstrap or estimate jointly rather than sequentially.
6.7. Minimum Distance Estimation
Minimum distance estimation provides a way to estimate structural parameters θ that
are a specified function of reduced form parameters π, given a consistent estimate

π of π.
A standard reference is Ferguson (1958). Rothenberg (1973) applied this method
to linear simultaneous equations models, though the alternative methods given in Sec-
tion 6.9.6 are the standard methods used. Minimum distance estimation is most often
used in panel data analysis. In the initial work by Chamberlain (1982, 1984) (see Sec-
tion 22.2.7) he lets
π be OLS estimates from linear regression of the current-period
dependent variable on regressors in all periods. Subsequent applications to covariance
structures (see Section 22.5.4) let
π be estimated variances and autocovariances of the
panel data. See also the indirect inference method (Section 12.6).
Suppose that the relationship between q structural parameters and r q reduced
form parameters is that π0 = g(θ0). Further suppose that we have a consistent estimate

π of the reduced form parameters. An obvious estimator is
θ such that
π = g(
θ), but
this is infeasible since q r. Instead, the minimum distance (MD) estimator
θMD
minimizes with respect to θ the objective function
QN (θ) = (
π − g(θ))
WN (
π − g(θ)), (6.66)
where WN is an r × r weighting matrix.
If
π
p
→ π0 and WN
p
→ W0, where W0 is finite positive semidefinite then
QN (
θ)
p
→ Q0(θ) = (π0−g(θ))
W0(π0−g(θ)). It follows that θ0 is locally identified
if Rank[W0 × ∂g(θ)/∂θ
] = q, while consistency essentially requires that π0= g(θ0).
For the MD estimator
√
N(
θMD − θ0)
d
→ N[0,V[
θMD]], where
V[
θMD] = (G
0W0G0)−1
(G
0W0V[
π]W0G0)(G
0W0G0)−1
, (6.67)
G0 = ∂g(θ)/∂θ

θ0
, and it is assumed that the reduced form parameters
π have limit
distribution
√
N(
π − π0)
d
→ N[0,V[
π]]. More efficient reduced form estimators lead
to more efficient MD estimators, since smaller V[
π] leads to smaller V[
θMD] in (6.67).
202

6.8. EMPIRICAL LIKELIHOOD
To obtain the result (6.67), begin with the following rescaling of the first-order
conditions for the MD estimator:
GN (
θ)
WN
√
N(
π − g(
θ)) = 0, (6.68)
where GN (θ) = ∂g(θ)/∂θ
. An exact first-order Taylor series expansion about θ0
yields
√
Nh(
π − g(
θ)) =
√
N(
π − π0) − GN (θ+
)
√
N(
θ − θ0), (6.69)
where θ+
lies between
θ and θ0 and we have used g(θ0) = π0. Substituting (6.69)
back into (6.68) and solving for
√
N(
θ − θ0) yields
√
N(
θ − θ0) = [GN (
θ)
WN GN (θ+
)]−1
GN (
θ)
WN
√
N(
π − π0), (6.70)
which leads directly to (6.67).
For given reduced form estimator
π, the most efficient MD estimator uses weighting
matrix WN =
V[
π]−1
in (6.66). This estimator is called the optimal MD (OMD)
estimator, and sometimes the minimum chi-square estimator following Ferguson
(1958).
A common alternative special case is the equally weighted minimum distance
(EWMD) estimator, which sets WN = I. This is less efficient than the OMD estima-
tor, but it does not have the finite-sample bias problems analogous to those discussed
in Section 6.3.5 that arise when the optimal weighting matrix is used. The EWMD es-
timator can be simply obtained by NLS regression of
π j on gj (
θ), j = 1, . . . ,r, since
minimizing (
π − g(
θ))
(
π − g(
θ)) yields the same first-order conditions as those in
(6.68) with WN = I.
The maximized value of the objective function for the OMD is chi-squared dis-
tributed. Specifically,
(
π − g(
θOMD))
V[
π]−1
(
π − g(
θOMD)) (6.71)
is asymptotically distributed as χ2
(r − q) under H0 : g(θ0) = π0. This provides a
model specification test analogous to the OIR test of Section 6.3.8.
The MD estimator is qualitatively similar to the GMM estimator. The GMM frame-
work is the standard one employed. MD estimation is most often used in panel studies
of covariance structures, since then
π comprises easily estimated sample moments
(variances and covariances) that can then be used to obtain
θ.
6.8. Empirical Likelihood
The MM and GMM approaches do not require complete specification of the con-
ditional density. Instead, estimation is based on moment conditions of the form
E[h(y, x, θ)] = 0. The empirical likelihood approach, due to Owen (1988), is an alter-
native estimation procedure based on the same moment condition.
An attraction of the empirical likelihood estimator is that, although it is asymptoti-
cally equivalent to the GMM estimator, it has different finite-sample properties, and in
some examples it outperforms the GMM estimator.
203

6.8.1. Empirical Likelihood Estimation of Population Mean
We begin with empirical likelihood in the case of a scalar iid random variable y
with density f (y) and sample likelihood function i f (yi ). The complication con-
sidered here is that the density f (y) is not speciﬁed, so the usual ML approach is not
possible.
A completely nonparametric approach seeks to estimate the density f (y) evaluated
at each of the sample values of y. Let πi = f (yi ) denote the probability that the ith
observation on y takes the realized value yi . Then the goal is to maximize the so-
called empirical likelihood function i πi , or equivalently to maximize the empirical
log-likelihood function N−1

i ln πi , which is a multinomial model with no structure
placed on πi . This log-likelihood is unbounded, unless a constraint is placed on the
range of values taken by πi . The normalization used is that

i πi = 1. This yields the
standard estimate of the cumulative distribution function in the fully nonparametric
case, as we now demonstrate.
The empirical likelihood estimator maximizes with respect to π and η the
Lagrangian
LEL(π, η) =
1
N
N

i=1
ln πi − η

N

i=1
πi − 1

, (6.72)
where π = [π1. . . πN ]
and η is a Lagrange multiplier. Although the data yi do not
explicitly appear in (6.72) they appear implicitly as πi = f (yi ). Setting the derivatives
with respect to πi (i = 1, . . . , N), and η to zero and solving yields
πi = 1/N and η =
1. Thus the estimated density function
f (y) has mass 1/N at each of the realized values
yi , i = 1, . . . , N. The resulting distribution function is
F(y) = N−1
N
i=1 1(y ≤ yi ),
where 1(A) = 1 if event A occurs and 0 otherwise.
F(y) is just the usual empirical
distribution function.
Now introduce parameters. As a simple example, suppose we introduce the moment
restriction that E[y − µ] = 0, where µ is the unknown population mean. In the empir-
ical likelihood context this population moment is replaced by a sample moment, where
the sample moment weights sample values by the probabilities πi . Thus we introduce
the constraint that

i πi (yi − µ) = 0. The Lagrangian for the maximum empirical
likelihood estimator is
LEL(π, η, λ, µ) =
1
N
N

i=1
ln πi − η

N

i=1
πi − 1

− λ
N

i=1
πi (yi − µ), (6.73)
where η and λ are Lagrange multipliers.
Begin by differentiating the Lagrangian with respect to πi (i = 1, . . . , N), η, and
λ but not µ. Setting these derivatives to zero yields equations that are functions
of µ. Solving leads to the solution πi = πi (µ) and hence an empirical likelihood
N−1

i ln πi (µ) that is then maximized with respect to µ. This solution method leads
to nonlinear equations that need to be solved numerically.
For this particular problem an easier way to solve for µ is to note that the max-
imized value of L(π, η, λ, µ) must be less than or equal to N−1

i ln N−1
, since
this is the maximized value without the last constraint. However, L(π, η, λ, µ) equals
204

6.8. EMPIRICAL LIKELIHOOD
N−1

i ln N−1
if πi = 1/N and
µ = N−1

i yi = ȳ. So the maximum empirical
likelihood estimator of the population mean is the sample mean.
6.8.2. Empirical Likelihood Estimation of Regression Parameters
Now consider regression data that are iid over i. The only structure placed on the
model are r moment conditions
E[h(wi , θ)] = 0, (6.74)
where h(·) and wi are defined in Section 6.3.1. For example, h(w, θ) = x(y − x
θ) for
OLS estimation and h(y, x, θ) = (∂g/∂θ)(y − g(x,θ)) for NLS estimation.
The empirical likelihood approach maximizes the empirical likelihood function
N−1

i ln πi subject to the constraint

i πi = 1 (see (6.72)) and the additional sam-
ple constraint based on the population moment condition (6.74) that
N

i=1
πi h(wi , θ) = 0. (6.75)
Thus we maximize with respect to π, η, λ, and θ
LEL(π, η, λ, θ) =
1
N
N

i=1
ln πi − η

N

i=1
πi − 1

− λ
N

i=1
πi h(wi , θ), (6.76)
where the Lagrangian multipliers are a scalar η and column vector λ of the same
dimension as h(·).
First, concentrate out the N parameters π1, . . . , πN . Differentiating L(π, η, λ, θ)
with respect to πi yields 1/(Nπi ) − η − λ
hi = 0. Then we obtain η = 1 by multiply-
ing by πi and summing over i and using

i πi hi = 0. It follows that
πi (θ, λ) =
1
N(1 + λ
h(wi , θ))
. (6.77)
The problem is now reduced to a maximization problem with respect to (r + q) vari-
ables λ and θ, the Lagrangian multipliers associated with the r moment conditions
(6.74), and the q parameters θ.
Solution at this stage requires numerical methods, even for just-identified mod-
els. One can maximize with respect to θ and λ the function N−1

i ln[1/N(1 +
λ
h(wi , θ))].
Alternatively, first concentrate out λ. Differentiating L(π(θ, λ), η, λ) with respect
to λ yields

i πi hi = 0. Define λ(θ) to be the implicit solution to the system of
dim(λ) equations
N

i=1
1
N(1 + λ
h(wi , θ))
h(wi , θ) = 0.
In implementation numerical methods are needed to obtain λ(θ). Then (6.77) becomes
πi (θ) =
1
N(1 + λ(θ)
h(wi , θ))
. (6.78)
205

By substituting (6.78) into the empirical likelihood function N−1

i ln πi , the empir-
ical log-likelihood function evaluated at θ becomes
LEL(θ) = −N−1
N

i=1
ln[N(1 + λ(θ)
h(wi , θ))].
The maximum empirical likelihood (MEL) estimator
θMEL maximizes this function
with respect to θ.
Qin and Lawless (1994) show that
√
N(
θMEL − θ0)
d
→ N[0, A(θ0)−1
B(θ0)A(θ0)−1
],
where A(θ0) = plimE[∂h(θ)/∂θ
|θ0
] and B(θ0) = plimE[h(θ)h(θ)
|θ0
]. This is the
same limit distribution as the method of moments (see (6.13)). In finite samples
θMEL
differs from
θGMM, however, and inference is based on sample estimates

A =
N
i=1

πi
∂h
i
∂θ

θ
,

B =
N
i=1

πi hi (
θ)hi (
θ)
that weight by the estimated probabilities
πi rather than the proportions 1/N.
Imbens (2002) provides a recent survey of empirical likelihood that contrasts em-
pirical likelihood with GMM. Variations include replacing N−1

i ln πi in (6.26)
by N−1

i πi ln πi . Empirical likelihood is computationally more burdensome; see
Imbens (2002) for a discussion. The advantage is that the asymptotic theory provides
a better finite-sample approximation to the distribution of the empirical likelihood es-
timator than it does to that for the GMM estimator. This is pursued further in Sec-
tion 11.6.4.
6.9. Linear Systems of Equations
The preceding estimation theory covers single-equation estimation methods used in
the majority of applied studies. We now consider joint estimation of several equations.
Equations linear in parameters with an additive error are presented in this section, with
extensions to nonlinear systems given in the subsequent section.
The main advantage of joint estimation is the gain in efficiency that results from
incorporation of correlation in unobservables across equations for a given individual.
Additionally, joint estimation may be necessary if there are restrictions on parameters
across equations. With exogenous regressors systems estimation is a minor extension
of single-equation OLS and GLS estimation, whereas with endogenous regressors it is
single-equation IV methods that are adapted.
One leading example is systems of equations such as those for observed demand of
several commodities at a point in time for many individuals. For seemingly unrelated
regression all regressors are exogenous whereas for simultaneous equations models
some regressors are endogenous.
206

6.9. LINEAR SYSTEMS OF EQUATIONS
A second leading example is panel data, where a single equation is observed at
several points in time for many individuals, and each time period is treated as a separate
equation. By viewing a panel data model as an example of a system it is possible to
improve efﬁciency, obtain panel robust standard errors, and derive instruments when
some regressors are endogenous.
Many econometrics texts provide lengthy presentations of linear systems. The treat-
ment here is very brief. It is mainly directed toward generalization to nonlinear systems
(see Section 6.10) and application to panel data (see Chapters 21–23).
6.9.1. Linear Systems of Equations
The single-equation linear model is given by yi = x
i β + ui , where yi and ui are scalars
and xi and β are column vectors. The multiple-equation linear model, or multivari-
ate linear model, with G dependent variables is given by
yi = Xi β + ui , i = 1, . . . , N, (6.79)
where yi and ui are G × 1 vectors, Xi is a G × K matrix, and β is a K × 1 column
vector.
Throughout this section we make the cross-section assumption that the error vector
ui is independent over i, so E[ui u
j ] = 0 for i = j. However, components of ui for
given i may be correlated and have variances and covariances that vary over i, leading
to conditional error variance matrix for the ith individual
Ωi = E[ui u
i |Xi ]. (6.80)
There are various ways that a multiple-equation model may arise. At one extreme
the seemingly unrelated equations model combines G equations, such as demands for
different consumer goods, where parameters vary across equations and regressors may
or may not vary across equations. At the other extreme the linear panel data combines
G periods of data for the same equation, with parameters that are constant across
periods and regressors that may or may not vary across periods. These two cases are
presented in detail in Sections 6.9.3 and 6.9.4.
Stacking (6.79) over N individuals gives



y1
.
.
.
yN


 =



X1
.
.
.
XN


 β +



u1
.
.
.
uN


 , (6.81)
or
y = Xβ + u, (6.82)
where y and u are NG × 1 vectors and X is a NG × K matrix.
The results given in the following can be obtained by treating the stacked model
(6.82) in the same way as in the single-equation case. Thus the OLS estimator is
β =
(X
X)−1
X
y and in the just-identiﬁed case with instrument matrix Z the IV estimator
is
β = (Z
X)−1
Z
y. The only real change is that the usual cross-section assumption of
a diagonal error variance matrix is replaced by assumption of a block-diagonal error
207

matrix. This block-diagonality needs to be accommodated in computing the estimated
variance matrix of a systems estimator and in forming feasible GLS estimators and
efficient GMM estimators.
6.9.2. Systems OLS and FGLS Estimation
An OLS estimation of the system (6.82) yields the systems OLS estimator
(X
X)−1
X
y. Using (6.81) it follows immediately that

βSOLS =

N

i=1
X
i Xi
'−1 N

i=1
X
i yi . (6.83)
The estimator is asymptotically normal and, assuming the data are independent over i,
the usual robust sandwich result applies and

V

βSOLS

=

N

i=1
X
i Xi
'−1 N

i=1
X
i
ui
u
i Xi

N

i=1
X
i Xi
'−1
, (6.84)
where
ui = yi − Xi

β. This variance matrix estimate permits conditional variances and
covariances of the errors to differ across individuals.
Given correlation of the components of the error vector for a given individual,
more efficient estimation is possible by GLS or FGLS. If observations are indepen-
dent over i, the systems GLS estimator is systems OLS applied to the transformed
system
Ω
−1/2
i yi = Ω
−1/2
i Xi β + Ω
−1/2
i ui , (6.85)
where Ωi is the error variance matrix defined in (6.80). The transformed error Ω
−1/2
i ui
has mean zero and variance
E

Ω
−1/2
i ui

Ω
−1/2
i ui

|Xi

= Ω
−1/2
i E

u
i ui |Xi

Ω
−1/2
i
= Ω
−1/2
i Ωi Ω
−1/2
i
= IG.
So the transformed system has errors that are homoskedastic and uncorrelated over G
equations and OLS is efficient.
To implement this estimator, a model for Ωi needs to be specified, say Ωi = Ωi (γ).
Then perform systems OLS estimation in the transformed system where Ωi is replaced
by Ωi (
γ), where
γ is a consistent estimate of γ. This yields the systems feasible GLS
(SFGLS) estimator

βSFGLS =

N

i=1
X
i

Ω
−1
i Xi
'−1 N

i=1
X
i

Ω
−1
i yi . (6.86)
208

This estimator is asymptotically normal and to guard against possible misspecification
of Ωi (γ) we can use the robust sandwich estimate of the variance matrix

V

βSFGLS

=

N

i=1
X
i

Ωi
−1
Xi
'−1 N

i=1
X
i

Ω
−1
i
ui
u
i

Ωi
−1
Xi

N

i=1
X
i

Ωi
−1
Xi
'−1
, (6.87)
where
Ωi = Ωi (
γ).
The most common specification used for Ωi is to assume that it does not vary over
i. Then Ωi = Ω is a G × G matrix that can be consistently estimated for finite G and
N → ∞ by

Ω =
1
N
N

i=1

ui
u
i , (6.88)
where
ui = yi − Xi

βSOLS. Then the SFGLS estimator is (6.86) with
Ω instead of
Ωi ,
and after some algebra the SFGLS estimator can also be written as

βSFGLS =

X

Ω
−1
⊗ IN

X
−1
X

Ω
−1
⊗ IN

y
, (6.89)
where ⊗ denotes the Kronecker product. The assumption that Ωi = Ω rules out, for
example, heteroskedasticity over i. This is a strong assumption, and in many applica-
tions it is best to use robust standard errors calculated using (6.87), which gives correct
standard errors even if Ωi does vary over i.
6.9.3. Seemingly Unrelated Regressions
The seemingly unrelated regressions (SUR) model specifies the gth of G equations
for the ith of N individuals to be given by
yig = x
igβg + uig, g = 1, . . . , G, i = 1, . . . , N, (6.90)
where xig are regressors that are assumed to be exogenous and βg are Kg × 1 param-
eter vectors. For example, for demand data on G goods for N individuals, yig may
be the ith individual’s expenditure on good g or budget share for good g. In all that
follows G is assumed fixed and reasonably small while N → ∞. Note that we use the
subscript order yig as results then transfer easily to panel data with variable yit (see
Section 6.9.4). Other authors use the reverse order ygi .
The SUR model was proposed by Zellner (1962). The term seemingly unrelated
regressions is deceptive, as clearly the equations are related if the errors uig in different
equations are correlated. For the SUR model the relationship between yig and yih is
indirect; it comes through correlation in the errors across different equations.
Estimation combines observations over both equations and individuals. For microe-
conometrics applications, where independence over i is assumed, it is most convenient
to first stack all equations for a given individual. Stacking all G equations for the ith
209

individual we get



yi1
.
.
.
yiG


 =



x
i1 0 0
0
... 0
0 0 x
iG






β1
.
.
.
βG


 +



ui1
.
.
.
uiG


 , (6.91)
which is of the form yi = Xi β + ui in (6.79), where yi and ui are G × 1 vectors
with gth entries yig and uig, Xi is a G × K matrix with gth row [0· · · x
ig· · · 0], and
β = [β
1. . . β
G]
is a K × 1 vector where K = K1 + · · · KG. Some authors instead
first stack all individuals for a given equation, leading to different algebraic expressions
for the same estimators.
Given the definitions of Xi and yi it is easy to show that
βSOLS in (6.83) is




β1
.
.
.

βG


 =





N
i=1 xi1x
i1
−1 N
i=1 xi1 yi1
.
.
.
N
i=1 xiGx
iG
−1 N
i=1 xiG yiG





,
so that systems OLS is the same as separate equation-by-equation OLS. As might be
expected a priori, if the only link across equations is the error and the errors are treated
as being uncorrelated then joint estimation reduces to single-equation estimation.
A better estimator is the feasible GLS estimator defined in (6.86) using
Ω in (6.88)
and statistical inference based on the asymptotic variance given in (6.87). This estima-
tor is generally more efficient than systems OLS, though it can be shown to collapse
to OLS if the errors are uncorrelated across equations or if exactly the same regressors
appear in each equation.
Seemingly unrelated regression models may impose cross-equation parameter
restrictions. For example, a symmetry restriction may imply that the coefficient of
the second regressor in the first equation equals the coefficient of the first regressor
in the second equation. If such restrictions are equality restrictions one can easily
estimate the model by appropriate redefinition of Xi and β given in (6.79). For ex-
ample, if there are two equations and the restriction is that β2 = −β1 then define
Xi = [xi1 − xi2]
and β = β1. Alternatively, one can estimate using systems exten-
sions of single-equation OLS and GLS with linear restrictions on the parameters.
Also, in systems of equations it is possible that the variance matrix of the error
vector ui is singular, as a result of adding-up constraints. For example, suppose yig
is the ith budget share, and the model is yig = αg + z
i βg + uig, where the same re-
gressors appear in each equation. Then

g yig = 1 since budget shares sum to one,
which requires

g αg = 1,

g βg = 0, and

g uig = 0. The last restriction means
Ωi is singular and hence noninvertible. One can eliminate one equation, say the last,
and estimate the model by systems estimation applied to the remaining G − 1 equa-
tions. Then the parameter estimates for the Gth equation can be obtained using the
adding-up constraint. For example,
αG = 1 − (
α1 + · · · +
αG−1). It is also possible
to impose equality restrictions on the parameters in this setup. A literature exists on
methods that ensure that estimates obtained are invariant to the equation deleted; see,
for example, Berndt and Savin (1975).
210

6.9.4. Panel Data
Another leading application of systems GLS methods is to panel data, where a scalar
dependent variable is observed in each of T time periods for N individuals. Panel data
can be viewed as a system of equations, either T equations for N individuals or N
equations for T time periods. In microeconometrics we assume a short panel, with T
small and N → ∞ so it is natural to set it up as a scalar dependent variable yit , where
the gth equation in the preceding discussion is now interpreted as the tth time period
and G = T .
A simple panel data model is
yit = x
it β + uit , t = 1, . . . , T, i = 1, . . . , N, (6.92)
a specialization of (6.90) with β now constant. Then in (6.79) the regressor matrix
becomes Xi = [xi1· · · xiT ]
. After some algebra the systems OLS estimator defined in
(6.83) can be reexpressed as

βPOLS =

N

i=1
T

t=1
xit x
it
'−1 N

i=1
T

t=1
xit yit . (6.93)
This estimator is called the pooled OLS estimator as it pools or combines the cross-
section and time-series aspects of the data.
The pooled estimator is obtained simply by OLS estimation of yit on xit . However,
if uit are correlated over t for given i, the default OLS standard errors that assume
independence of the error over both i and t are invalid and can be greatly downward
biased. Instead, statistical inference should be based on the robust form of the co-
variance matrix given in (6.84). This is detailed in Section 21.2.3. In practice models
more complicated than (6.92) that include individual specific effects are estimated (see
Section 21.2).
6.9.5. Systems IV Estimation
Estimation of a single linear equation with endogenous regressors was presented
in Section 6.4. Now we extend this to the multivariate linear model (6.79) when
E[ui |Xi ] = 0. Brundy and Jorgenson (1971) considered IV estimation applied to the
system of equations to produce estimates that are both consistent and efficient.
We assume the existence of a G × r matrix of instruments Zi that satisfy E[ui |Zi ] =
0 and hence
E[Z
i (yi − Xi β)] = 0. (6.94)
These instruments can be used to obtain consistent parameter estimates using single-
equation IV methods, but joint equation estimation can improve efficiency. The sys-
tems GMM estimator minimizes
QN (β) =

N

i=1
Z
i (yi − Xi β)
'
WN

N

i=1
Z
i (yi − Xi β)
'
, (6.95)
211

where WN is an r × r weighting matrix. Performing some algebra yields

βSGMM =

X
ZWN Z
X
−1
X
ZWN Z
y

, (6.96)
where X is an NG × K matrix obtained by stacking X1, . . . , XN (see (6.81)) and Z
is an NG × r matrix obtained by similarly stacking Z1, . . . , ZN . The systems GMM
estimator has exactly the same form as (6.37), and the asymptotic variance matrix is
that given in (6.39). It follows that a robust estimate of the variance matrix is

V[
βSGMM] = N

X
ZWN Z
X
−1
X
ZWN

SWN Z
X

X
ZWN Z
X
−1
, (6.97)
where, in the systems case and assuming independence over i,

S =
1
N
N

i=1
Z
i
ui
u
i Zi . (6.98)
Several choices of weighting matrix receive particular attention.
First, the optimal systems GMM estimator is (6.96) with WN =
S−1
, where
S is
defined in (6.98). The variance matrix then simplifies to

V[
βOSGMM] = N

X
Z
S
−1
Z
X
−1
.
This estimator is the most efficient GMM estimator based on moment conditions
(6.94). The efficiency gain arises from two factors: (1) systems estimation, which per-
mits errors in different equations to be correlated, so that V[ui |Zi ] is not restricted to
being block diagonal, and (2) an allowance for quite general heteroskedasticity and
correlation, so that Ωi can vary over i.
Second, the systems 2SLS estimator arises when WN = (N−1
Z
Z)−1
. Consider
the SUR model defined in (6.91), with some of the regressors xig now endogenous.
Then systems 2SLS reduces to equation-by-equation 2SLS, with instruments zg for
the gth equation, if we define the instrument matrix to be
Zi =



z
i1 0 0
0
... 0
0 0 z
iG


 . (6.99)
In many applications z1 = z2 = · · · = zg so that a common set of instruments is used
in all equations, but we need not restrict analysis to this case. For the panel data model
(6.92) systems 2SLS reduces to pooled 2SLS if we define Zi = [zi1· · · ziT ]
.
Third, suppose that V[ui |Zi ] does not vary over i, so that V[ui |Zi ] = Ω. This is a
systems analogue of the single-equation assumption of homoskedasticity. Then as with
(6.88) a consistent estimate of Ω is
Ω = N−1

i
ui
u
i , where
ui are residuals based
on a consistent IV estimator such as systems 2SLS. Then the optimal GMM estimator
is (6.96) with WN = IN ⊗
Ω. This estimator should be contrasted with the three-stage
least-squares estimator presented at the end of the next section.
212

6.9.6. Linear Simultaneous Equations Systems
The linear simultaneous equations model, introduced in Section 2.4, is a very impor-
tant model that is often presented in considerable length in introductory graduate-level
econometrics courses. In this section we provide a very brief self-contained summary.
The discussion of identification overlaps with that in Chapter 2. Due to the presence
of endogenous variables OLS and SUR estimators are inconsistent. Consistent estima-
tion methods are placed in the context of GMM estimation, even though the standard
methods were developed well before GMM.
The linear simultaneous equations model specifies the gth of G equations for the
ith of N individuals to be given by
yig = z
igγg + Y
igβg + uig, g = 1, . . . , G, (6.100)
where the order of subscripts is that of Section 6.9 rather than Section 2.4, zg is
a vector of exogenous regressors that are assumed to be uncorrelated with the er-
ror term ug and Yg is a vector that contains a subset of the dependent variables
y1, . . . , yg−1, yg+1, . . . , yG of the other G − 1 equations. Yg is endogenous as it is
correlated with model errors. The model for the ith individual can equivalently be
written as
y
i B + z
i Γ = ui , (6.101)
where yi = [yi1. . . yiG]
is a G × 1 vector of endogenous variables, zi is an r × 1
vector of exogenous variables that is the union of zi1, . . . , ziG, ui = [ui1. . . uiG]
is
a G × 1 error vector, B is a G × G parameter matrix with diagonal entries unity, Γ is
an r × G parameter matrix, and some of the entries in B and Γ are constrained to be
unity. It is assumed that ui is iid over i with mean 0 and variance matrix Σ.
The model (6.101) is called the structural form with different restrictions on B
and Γ corresponding to different structures. Solving for the endogenous variables as a
function of the exogenous variables yields the reduced form
yi = −z
i ΓB−1
+ ui B−1
(6.102)
= z
i Π + vi ,
where Π = −ΓB−1
is the r × G matrix of reduced form parameters and vi = ui B−1
is the reduced form error vector with variance Ω = (B−1
)
ΣB−1
.
The reduced form can be consistently estimated by OLS, yielding estimates of
Π = −ΓB−1
and Ω = (B−1
)
ΣB−1
. The problem of identification, see Section 2.5,
is one of whether these lead to unique estimates of the structural form parameters B,
Γ and Σ. This requires some parameter restrictions since without restrictions B, Γ,
and Σ contain G2
more parameters than Π and Ω. A necessary condition for identi-
fication of parameters in the gth equation is the order condition that the number of
exogenous variables excluded from the gth equation must be at least equal to the num-
ber of endogenous variables included. This is the same as the order condition given
in Section 6.4.1. For example, if Yig in (6.100) has one component, so there is one
endogenous variable in the equation, then at least one of the components of xi must
not be included. This will ensure that there are as many instruments as regressors.
213

A sufficient condition for identification is the stronger rank condition. This is given
in many books such as Greene’s (2003) and for brevity is not given here. Other restric-
tions, such as covariance restrictions, may also lead to identification.
Given identification, the structural model parameters can be consistently estimated
by separate estimation of each equation by two-stage least squares defined in (6.44).
The same set of instruments zi is used for each equation. In the gth equation the sub-
component zig is used as instrument for itself and the remainder of zi is used as instru-
ment for Yig.
More efficient systems estimates are obtained using the three-stage least-squares
(3SLS) estimator of Zellner and Theil (1962), which assumes errors are homoskedas-
tic but are correlated across equations. First, estimate the reduced form coefficients Π
in (6.102) by OLS regression of y on z. Second, obtain the 2SLS estimates by OLS re-
gression of (6.100), where Yg is replaced by the reduced form predictions
Yg = z
ΠG.
This is OLS regression of yg on
Yg and zg, or equivalently of yg on
xg, where
xg are the
predictions of Yg and zg from OLS regression on z. Third, obtain the 3SLS estimates
by systems OLS regression of yg on
xg, g = 1, . . . , G. Then from (6.89)

θ3SLS =

X

Ω
−1
⊗ IN

X
−1

X

Ω
−1
⊗ IN

y,
where
X is obtained by first forming a block-diagonal matrix
Xi with diagonal blocks

xi1, . . . ,
xiG and then stacking
X1, . . . ,
XN , and
Ω = N−1

i
ui
u
i with
ui the residual
vectors calculated using the 2SLS estimates.
This estimator coincides with the systems GMM estimator with WN = IN ⊗
Ω in
the case that the systems GMM estimator uses the same instruments in every equation.
Otherwise, 3SLS and systems GMM differ, though both yield consistent estimates if
E[ui |zi ] = 0.
6.9.7. Linear Systems ML Estimation
The systems estimators for the linear model are essentially LS or IV estimators with in-
ference based on robust standard errors. Now additionally assume normally distributed
iid errors, so that ui ∼ N[0, Ω].
For systems with exogenous regressors the resulting MLE is asymptotically equiva-
lent to the GLS estimator. These estimators do use different estimators of Ω and hence
β, however, so that there are small-sample differences between the MLE and the GLS
estimator. For example, see Chapter 21 for the random effects panel data model.
For the linear SEM (6.101), the limited information maximum likelihood es-
timator, a single-equation ML estimator, is asymptotically equivalent to 2SLS. The
full information maximum likelihood estimator, the systems MLE, is asymptotically
equivalent to 3SLS. See, for example, Schmidt (1976) and Greene (2003).
6.10. Nonlinear Sets of Equations
We now consider systems of equations that are nonlinear in parameters. For example,
demand equation systems obtained from a specified direct or indirect utility may be
214

6.10. NONLINEAR SETS OF EQUATIONS
nonlinear in parameters. More generally, if a nonlinear model is appropriate for a de-
pendent variable studied in isolation, for example a logit or Poisson model, then any
joint model for two or more such variables will necessarily be nonlinear.
We begin with a discussion of fully parametric joint modeling, before focusing on
partially parametric modeling. As in the linear case we present models with exogenous
regressors before considering the complication of endogenous regressors.
6.10.1. Nonlinear Systems ML Estimation
Maximum likelihood estimation for a single dependent variable was presented in Sec-
tion 5.6. These results can be immediately applied to joint models of several dependent
variables, with the very minor change that the single dependent variable conditional
density f (yi |xi , θ) becomes f (yi |Xi , θ), where yi denotes the vector of dependent
variables, Xi denotes all the regressors, and θ denotes all the parameters.
For example, if y1 ∼ N[exp(x
1β1), σ2
1 ] and y2 ∼ N[exp(x
2β2), σ2
2 ] then a suitable
joint model may be to assume that (y1, y2) are bivariate normal with means exp(x
1β1)
and exp(x
2β2), variances σ2
1 and σ2
2 , and correlation ρ.
For data that are not normally distributed there can be challenges in specifying and
selecting a sufficiently flexible joint distribution. For example, for univariate counts
a standard starting model is the negative binomial (see Chapter 20). However, in ex-
tending this to a bivariate or multivariate model for counts there are several alternative
bivariate negative binomial models to choose from. These might differ, for example,
as to whether the univariate conditional distribution or the univariate marginal distri-
bution is negative binomial. In contrast the multivariate normal distribution has condi-
tional and marginal distributions that are both normal. All of these multivariate nega-
tive binomial distributions place some restrictions on the range of correlation such as
restricting to positive correlation, whereas for the multivariate normal there is no such
restriction.
Fortunately, modern computational advances permit richer models to be specified.
For example, a reasonably flexible model for correlated bivariate counts is to assume
that, conditional on unobservables ε1 and ε2, y1 is Poisson with mean exp(x
1β1 + ε1)
and y2 is Poisson with mean exp(x
1β1 + ε2). An estimable bivariate distribution can
be obtained by assuming that the unobservables ε1 and ε2 are bivariate normal and in-
tegrating them out. There is no closed-form solution for this bivariate distribution, but
the parameters can nonetheless be estimated using the method of maximum simulated
likelihood presented in Section 12.4.
A number of examples of nonlinear joint models are given throughout Part 4 of the
book. The simplest joint models can be inflexible, so consistency can rely on distribu-
tional assumptions that are too restrictive. However, there is generally no theoretical
impediment to specifying more flexible models that can be estimated using computa-
tionally intensive methods.
In particular, two leading methods for generating rich multivariate parametric mod-
els are presented in detail in Section 19.3. These methods are given in the context of
duration data models, but they have much wider applicability. First, one can introduce
correlated unobserved heterogeneity, as in the bivariate count example just given.
215

Second, one can use copulas, which provide a way to generate a joint distribution
given specified univariate marginals.
For ML estimation a simpler though less efficient quasi-ML approach is to specify
separate parametric models for y1 and y2 and obtain ML estimates assuming inde-
pendence of y1 and y2 but then do statistical inference permitting y1 and y2 to be
correlated. This has been presented in Section 5.7.5. In the remainder of this section
we consider such partially parametric approaches.
The challenges became greater if there is endogeneity, so that a dependent variable
in one equation appears as a regressor in another equation. Few models for nonlinear
simultaneous equations exist, aside from nonlinear regression models with additive
errors that are normally distributed.
6.10.2. Nonlinear Systems of Equations
For linear regression the movement from single equation to multiple equations is clear
as the starting point is the linear model y = x
β + u and estimation is by least squares.
Efficient systems estimation is then by systems GLS estimation. For nonlinear models
there can be much more variety in the starting point and estimation method.
We define the multivariate nonlinear model with G dependent variables to be
r(yi , Xi , β) = ui , (6.103)
where yi and ui are G × 1 vectors, r(yi , Xi , β) is a G × 1 vector function, Xi is a
G × L matrix, and β is a K × 1 column vector. Throughout this section we make the
cross-section assumption that the error vector ui is independent over i, but components
of ui for given i may be correlated with variances and covariances that vary over i.
One example of (6.103) is a nonlinear seemingly unrelated regression model.
Then the gth of G equations for the ith of N individuals is given by
rg(yig, xig, βg) = uig, g = 1, . . . , G. (6.104)
For example, uig = yig − exp(x
igβg). Then ui and r(·) in (6.103) are G × 1 vectors
with gth entries uig and rg(·), Xi is the same block-diagonal matrix as that defined in
(6.91), and β is obtained by stacking β1 to βG.
A second example is a nonlinear panel data model. Then for individual i in
period t
r(yit , xit , β) = uit , t = 1, . . . , T. (6.105)
Then ui and r(·) in (6.103) are T × 1 vectors, so G = T , with tth entries uit and
r(yit , xit , β). The panel model differs from the SUR model by having the same func-
tion r(·) and parameters β in each period.
6.10.3. Nonlinear Systems Estimation
When the regressors Xi in the model (6.103) are exogenous
E[ui |Xi ] = 0, (6.106)
216

6.10. NONLINEAR SETS OF EQUATIONS
where ui is the error term defined in (6.103). We assume that the error term is inde-
pendent over i, and the variance matrix is
Ωi = E[ui u
i |Xi ]. (6.107)
Additive Errors
Systems estimation is a straightforward adaptation of systems OLS and FGLS estima-
tion of the linear models when the nonlinear model is additive in the error term, so that
(6.103) specializes to
ui = yi − g(Xi , β). (6.108)
Then the systems NLS estimator minimizes the sum of squared residuals

i u
i ui ,
whereas the systems FGNLS estimator minimizes
QN (β) =

i
u
i

Ω
−1
i ui , (6.109)
where we specify a model Ωi (γ) for Ωi and
Ωi = Ωi (
γ). To guard against possible
misspecification of Ωi one can use robust standard errors that essentially require only
that ui is independent and satisfies (6.106). Then the estimated variance of the systems
FGNLS estimator is the same as that for the linear systems FGLS estimator in (6.87),
with Xi replaced by ∂g(yi , β)/∂β

β
and now
ui = yi − g(Xi ,
β). The estimated vari-
ance of the simpler systems NLS estimator is obtained by additionally replacing
Ωi
by IG.
The main challenge can be specifying a useful model for Ωi . As an example, sup-
pose we wish to jointly model two count data variables. In Chapter 20 we show
that a standard model for counts, a little more general than the Poisson model,
specifies the conditional mean to be exp(x
β) and the conditional variance to be a
multiple of exp(x
β). Then a joint model might specify u = [u1 u2]
, where u1 =
y1 − exp(x
1β1) and u2 = y2 − exp(x
2β2). The variance matrix Ωi then has diagonal
entries α1 exp(x
i1β1) and α2 exp(x
i2β2), and one possible parameterization for the co-
variance is α3[exp(x
i1β1) exp(x
i2β2)]1/2
. The estimate
Ωi then requires estimates of
β1, β2, α1, α2, and α3 that may be obtained from first-step single-equation estimation.
Nonadditive Errors
With nonadditive errors least-squares regression is no longer appropriate, as shown
in the single-equation case in Section 6.2.2. Wooldridge (2002) presents consistent
method of moments estimation.
The conditional moment restriction (6.106) leads to many possible unconditional
moment conditions that can be used for estimation. The obvious starting point is to
base estimation on the moment conditions E[X
i ui ] = 0. However, other moment con-
ditions may be used. We more generally consider estimation based on K moment
conditions
E[R(Xi , β)
ui ] = 0, (6.110)
217

where R(Xi , β) is a K × G matrix of functions of Xi and β. The specification of
R(Xi , β) and possible dependence on β are discussed in the following.
By construction there are as many moment conditions as parameters. The sys-
tems method of moments estimator
βSMM solves the corresponding sample moment
conditions
1
N
N

i=1
R(Xi , β)
r(yi , Xi ,
βSMM) = 0, (6.111)
where in practice R(Xi , β) is evaluated at a first-step estimate
β. This estimator is
asymptotically normal with variance matrix

V

βSMM

=

N

i=1

D
i

Ri
'−1 N

i=1

R
i
ui
u
i

Ri

N

i=1

R
i

Di
'−1
, (6.112)
where
Di = ∂ri /∂β

β
,
Ri = R(Xi ,
β), and
ui = r(yi , Xi ,
βSMM).
The main issue is specification of R(X, β) in (6.110). From Section 6.3.7, the most
efficient estimator based on (6.106) specifies
R∗
(Xi , β) = E

∂r(yi , Xi , β)
∂β
|Xi

Ω−1
i . (6.113)
In general the first expectation on the right-hand side requires strong distributional
assumptions, making optimal estimation difficult.
Simplification does occur, however, if the nonlinear model is one with additive er-
ror defined in (6.108). Then R∗
(Xi , β) = ∂g(Xi , β)
/∂β × Ω−1
i , and the estimating
equations (6.110) become
N−1
N

i=1
∂g(Xi , β)
∂β
Ω−1
i (yi − X
i

βSMM) = 0.
This estimator is asymptotically equivalent to the systems FGNLS estimator that min-
imizes (6.109).
6.10.4. Nonlinear Systems IV Estimation
When the regressors Xi in the model (6.103) are endogenous, so that E[ui |Xi ] = 0, we
assume the existence of a G × r matrix of instruments Zi such that
E[ui |Zi ] = 0, (6.114)
where ui is the error term defined in (6.103). We assume that the error term is indepen-
dent over i, and the variance matrix is Ωi = E[ui u
i |Zi ]. For the nonlinear SUR model
Zi is as defined in (6.99).
The approach is similar to that used in the preceding section for the systems MM
estimator, with the additional complication that now there may be a surplus of instru-
ments leading to a need for GMM estimation rather than just MM estimation. Condi-
tional moment restriction (6.106) leads to many possible unconditional moment condi-
tions that can be used for estimation. Here we follow many others in basing estimation
218

on the moment conditions E[Z
i ui ] = 0. Then a systems GMM estimator minimizes
QN (β) =

N

i=1
Z
i r(yi , Xi , β)
'
WN

N

i=1
Z
i r(yi , Xi , β)
'
. (6.115)
This estimator is asymptotically normal with estimated variance

V

βSGMM

= N

D
ZWN Z
D
−1

D
ZWN

SWN Z
D

D
ZWN Z
D
−1
, (6.116)
where
D
Z =

i ∂r

i /∂β

β
Zi and
S = N−1

i Zi
ui
u
i Z
i and we assume ui is inde-
pendent over i with variance matrix V[ui |Xi ] = Ωi .
The choice WN = [N−1

i Zi Z
i ]−1
corresponds to NL2SLS in the case
that r(yi , Xi , β) is obtained from a nonlinear SUR model. The choice WN =
[N−1

i Zi

ΩZ
i ]−1
, where
Ω = N−1

i
ui
u
i , is called nonlinear 3SLS (NL3SLS)
and is the most efficient estimator based on the moment condition E[Z
i ui ] = 0 in the
special case that Ωi = Ω. The choice WN =
S−1
gives the most efficient estimator un-
der the more general assumption that Ωi may vary with i. As usual, however, moment
conditions other than E[Z
i ui ] = 0 may lead to more efficient estimators.
6.10.5. Nonlinear Simultaneous Equations Systems
The nonlinear simultaneous equations model specifies that the gth of G equations
for the ith of N individuals is given by
uig = rg(yi , xig, βg), g = 1, . . . , G. (6.117)
This is the nonlinear SUR model with regressors that now include dependent variables
from other equations. Unlike the linear SEM, there are few practically useful results to
help ensure that a nonlinear SEM is identified.
Given identification, consistent estimates can be obtained using the GMM estima-
tors presented in the previous section. Alternatively, we can assume that ui ∼ N[0, Ω]
and obtain the nonlinear full-information maximum likelihood estimator. In a de-
parture from the linear SEM, the nonlinear full-information MLE in general has an
asymptotic distribution that differs from NL3SLS, and consistency of the nonlinear
full-information MLE requires that the errors are actually normally distributed. For
details see Amemiya (1985).
Handling endogeneity in nonlinear models can be complicated. Section 16.8 con-
siders simultaneity in Tobit models, where analysis is simpler when the model is linear
in the latent variables. Section 20.6.2 considers a more highly nonlinear example, en-
dogenous regressors in count data models.
Ideally GMM could be implemented using an econometrics package, requiring little
more difficulty and knowledge than that needed, say, for nonlinear least-squares esti-
mation with heteroskedastic errors. However, not all leading econometrics packages
219

provide a broad GMM module. Depending on the speciﬁc application, GMM estima-
tion may require a switch to a more suitable package or use of a matrix programming
language along with familiarity with the algebra of GMM.
A common application of GMM is IV estimation. Most econometrics packages in-
clude linear IV but not all include nonlinear IV estimators. The default standard errors
may assume homoskedastic errors rather than being heteroskedastic-robust. As already
emphasized in Chapter 4, it can be difﬁcult to obtain instruments that are uncorrelated
with the error yet reasonably correlated with the regressor or, in the nonlinear case, the
appropriate derivative of the error with respect to parameters.
Econometrics packages usually include linear systems but not nonlinear systems.
Again, default standard errors may not be robust to heteroskedasticity.
Textbook treatments of GMM include chapters by Davidson and MacKinnon (1993, 2004),
Hamilton (1994), and Greene (2003). The more recent books by Hayashi (2000) and
Wooldridge (2002) place considerable emphasis on GMM estimation. Bera and Bilias (2002)
provide a synthesis and history of many of the estimators presented in Chapters 5 and 6.
6.3 The original reference for GMM is Hansen (1982). A good explanation of optimal mo-
ments for GMM is given in the appendix of Arellano (2003). The October 2002 issue of
Journal of Business and Economic Statistics is devoted to GMM estimation.
6.4 The classic treatment of linear IV estimation by Sargan (1958) is a key precursor to GMM.
6.5 The nonlinear 2SLS estimator introduced by Amemiya (1974) generalizes easily to the
GMM estimator.
6.6 Standard references for sequential two-step estimation are Newey (1984), Murphy and
Topel (1985), and Pagan (1986).
6.7 A standard reference for minimum distance estimation is Chamberlain (1982).
6.8 A good overview of empirical likelihood is provided by Mittelhammer, Judge, and Miller
(2000) and key references are Owen (1988, 2001) and Qin and Lawless (1994). Imbens
(2002) provides a review and application of this relatively new method.
6.9 Texts such as Greene’s (2003) provide a more detailed coverage of systems estimation
than that provided here, especially for linear seemingly unrelated regressions and linear
simultaneous equations models.
6.10 Amemiya (1985) presents nonlinear simultaneous equations in detail.
Exercises
6–1 For the gamma regression model of Exercise 5.2, E[y|x] = exp(x
β) and V[y|x] =
(exp(x
β))2
/2.
(a) Show that these conditions imply that E[x{(y − x
β)2
− (exp(x
β))2
/2}] = 0.
(b) Use the moment condition in part (a) to form a method of moments estimator

βMM.
(c) Give the asymptotic distribution of
βMM using result (6.13) .
(d) Suppose we use the moment condition E[x(y − exp(x
β))] in addition to that
in part (a). Give the objective function for a GMM estimator of β.
220

6–2 Consider the linear regression model for data independent over i with yi =
x
i β + ui . Suppose E[ui |xi ] = 0 but there are available instruments zi with
E[ui |zi ] = 0 and V[ui |zi ] = σ2
i , where dim(z) dim(x). We consider the GMM es-
timator
β that minimizes
QN(β) = [N−1

i
zi (yi − x
i β)]
WN[N−1

i
zi (yi − x
i β)].
(a) Derive the limit distribution of
√
N(
β − β0) using the general GMM result
(6.11).
(b) State how to obtain a consistent estimate of the asymptotic variance of
β.
(c) If errors are homoskedastic what choice of WN would you use? Explain your
answer.
(d) If errors are heteroskedastic what choice of WN would you use? Explain your
answer.
6–3 Consider the Laplace intercept-only example at the end of Section 6.3.6, so
y = µ + u. Then GMM estimation is based on E[h(µ)] = 0, where h(µ) = [(y −
µ), (y − µ)3
]
.
(a) Using knowledge of the central moments of y given in Section 6.3.6, show
that G0 = E[∂h/∂µ] = [−1, −6]
and that S0 = E[hh
] has diagonal entries 2
and 720 and off-diagonal entries 24.
(b) Hence show that G
0S−1
0 G0 = 252/432.
(c) Hence show that
µOGMM has asymptotic variance 1.7143/N.
(d) Show that the GMM estimator of µ with W = I2 has asymptotic variance
19.14/N.
6–4 This question uses the probit model but requires little knowledge of the model.
Let y denote a binary variable that takes value 0 or 1 according to whether or
not an event occurs, let x denote a regressor vector, and assume independent
observations.
(a) Suppose E[y|x] = Φ(x
β), where Φ(·) is the standard normal cdf. Show that
E[(y − Φ(x
β))x] = 0. Hence give the estimating equations for a method of
moments estimator for β.
(b) Will this estimator yield the same estimates as the probit MLE? [For just this
part you need to read Section 14.3.]
(c) Give a GMM objective function corresponding to the estimator in part (a).
That is, give an objective function that yields the same first-order conditions,
up to a full-rank matrix transformation, as those obtained in part (a).
(d) Now suppose that because of endogeneity in some of the components
E[y|x] = Φ(x
β). Assume there exists a vector z, dim[z] dim[x], such that
E[y − Φ(x
β)|z] = 0. Give the objective function for a consistent estimator of
β. The estimator need not be fully efficient.
(e) For your estimator in part (d) give the asymptotic distribution of the estimator.
State clearly any assumptions made on the dgp to obtain this result.
(f) Give the weighting matrix, and a way to calculate it, for the optimal GMM
estimator in part (d).
(g) Give a real-world example of part (d). That is, give a meaningful example of
a probit model with endogenous regressor(s) and valid instrument(s). State
the dependent variable, the endogenous regressor(s), and the instrument(s)
used to permit consistent estimation. [This part is surprisingly difficult.]
221

6–5 Suppose we impose the constraint that E[wi ] = g(θ), where dim[w] dim[θ].
(a) Obtain the objective function for the GMM estimator.
(b) Obtain the objective function for the minimum distance estimator (see Sec-
tion 6.7) with π = E[wi ] and
π = w̄.
(c) Show that MD and GMM are equivalent in this example.
6–6 The MD estimator (see Section 6.7) uses the restriction π − g(θ) = 0. Suppose
more generally that the restriction is h(θ, π) = 0 and we estimate using the gen-
eralized MD estimator that minimizes QN(θ) = h(θ,
π)
WNh(θ,
π). Adapt (6.68)–
(6.70) to show that (6.67) holds with G0 = ∂h(θ, π)/∂θ

θ0,π0
and V[
π] replaced by
H
0V[
π]H0, where H0 = ∂h(θ, π)/∂π

θ0,π0
.
6–7 For data generated from the dgp given in Section 6.6.4 with N = 1,000, obtain
NL2SLS estimates and compare these to the two-stage estimates.
222

C H A P T E R 7
Hypothesis Tests
7.1. Introduction
In this chapter we consider tests of hypotheses, possibly nonlinear in the parameters,
using estimators appropriate for nonlinear models.
The distribution of test statistics can be obtained using the same statistical theory as
that used for estimators, since test statistics like estimators are statistics, that is, func-
tions of the sample. Given appropriate linearization of estimators and hypotheses, the
results closely resemble to those for testing linear restrictions in the linear regression
model. The results rely on asymptotic theory, however, and exact t- and F-distributed
test statistics for the linear model under normality are replaced by test statistics that
are asymptotically standard normal distributed (z-tests) or chi-square distributed.
There are two main practical concerns in hypothesis testing. First, tests may have
the wrong size, so that in testing at a nominal significance level of, say, 5%, the ac-
tual probability of rejection of the null hypothesis may be much more or less than
5%. Such a wrong size is almost certain to arise in moderate size samples as the un-
derlying asymptotic distribution theory is only an approximation. One remedy is the
bootstrap method, introduced in this chapter but sufficiently important and broad to be
treated separately in Chapter 11. Second, tests may have low power, so that there is low
probability of rejecting the null hypothesis when it should be rejected. This potential
weakness of tests is often neglected. Size and power are given more prominence here
than in most textbook treatments of testing.
The Wald test, the most widely used testing procedure, is defined in Section 7.2.
Section 7.3 additionally presents the likelihood ratio test and score or Lagrange mul-
tiplier tests, applicable when estimation is by ML. The various tests are illustrated in
Section 7.4. Section 7.5 extends these tests to estimators other than ML, including ro-
bust forms of tests. Sections 7.6, 7.7, and 7.8 present, respectively, test power, Monte
Carlo simulation methods, and the bootstrap.
Methods for determining model specification and selection, rather than hypothesis
tests per se, are given separate treatment in Chapter 8.
223

HYPOTHESIS TESTS
7.2. Wald Test
The Wald test, due to Wald (1943), is the preeminent hypothesis test in microecono-
metrics. It requires estimation of the unrestricted model, that is, the model without
imposition of the restrictions of the null hypothesis. The Wald test is widely used be-
cause modern software usually permits estimation of the unrestricted model even if
it is more complicated than the restricted model, and modern software increasingly
provides robust variance matrix estimates that permit Wald tests under relatively weak
distributional assumptions. The usual statistics for tests of statistical significance of
regressors reported by computer packages are examples of Wald test statistics.
This section presents the Wald test of nonlinear hypotheses in considerable detail,
presenting both theory and examples. The closely related delta method, used to form
confidence intervals or regions for nonlinear functions of parameters, is also presented.
A weakness of the Wald test – its lack of invariance to algebraically equivalent param-
eterizations of the null hypothesis – is detailed at the end of the section.
7.2.1. Linear Hypotheses in Linear Models
We first review standard linear model results, as the Wald test is a generalization of the
usual test for linear restrictions in the linear regression model.
The null and alternative hypotheses for a two-sided test of linear restrictions on the
regression parameters in the linear regression model y = X
β + u are
H0 : Rβ0 − r = 0,
Ha : Rβ0 − r = 0,
(7.1)
where in the notation used here there are h restrictions, R is an h × K matrix of con-
stants of full rank h, β is the K × 1 parameter vector, r is an h × 1 vector of constants,
and h ≤ K.
For example, a joint test that β1 = 1 and β2 − β3 = 2 when K = 4 can be expressed
as (7.1) with
R =

1 0 0 0
0 1 −1 0

, r =

1
2

.
The Wald test of Rβ0 − r = 0 is a test of closeness to zero of the sample analogue
R
β − r, where
β is the unrestricted OLS estimator. Under the strong assumption that
u ∼ N[0, σ2
0 I], the estimator
β ∼ N

β0, σ2
0 (X
X)−1

and so
R
β − r ∼ N

0, σ2
0 R(X
X)−1
R

,
under H0, where Rβ0 − r = 0 has led to simplification to a mean of 0. Taking the
quadratic form leads to the test statistic
W1 = (R
β − r)

σ2
0 R(X
X)−1
R
−1
(R
β − r),
which is exactly χ2
(h) distributed under H0. In practice the test statistic W1 cannot be
calculated, however, as σ2
0 is not known.
224

7.2. WALD TEST
In large samples replacing σ2
0 by its estimate s2
does not affect the limit distribution
of W1, since this is equivalent to premultiplication of W1 by σ2
0 /s2
and plim(σ2
0 /s2
) =
1 (see the Transformation Theorem A.12). Thus
W2 = (R
β − r)

s2
R(X
X)−1
R
−1
(R
β − r) (7.2)
converges to the χ2
(h) distribution under H0.
The test statistic W2 is chi-square distributed only asymptotically. In this linear
example with normal errors an alternative exact small-sample result can be obtained.
A standard result derived in many introductory texts is that
W3 = W2/h
is exactly F(h, N − K) distributed under H0, if s2
= (N − K)−1

i
u2
i , where
ui is
the OLS residual. This is the familiar F−test statistic, which is often reexpressed in
terms of sums of squared residuals.
Exact results such as that for W3 are not possible in nonlinear models, and even in
linear models they require very strong assumptions. Instead, the nonlinear analogue of
W2 is employed, with distributional results that are asymptotic only.
7.2.2. Nonlinear Hypotheses
We consider hypothesis tests of h restrictions, possibly nonlinear in parameters, on
the q × 1 parameter vector θ, where h ≤ q. For linear regression θ = β and q = K.
The null and alternative hypotheses for a two-sided test are
H0 : h(θ0) = 0,
Ha : h(θ0) = 0,
(7.3)
where h(·) is a h × 1 vector function of θ. Note that h(θ) in this chapter is used to
denote the restrictions of the null hypothesis. This should not be confused with the use
of h(w, θ) in the previous chapter to denote the moment conditions used to form an
MM or GMM estimator.
Familiar linear examples include tests of statistical significance of a single coeffi-
cient, h(θ) = θj = 0, and tests of subsets of coefficients, h(θ) = θ2 = 0. A nonlinear
example of a single restriction is h(θ) = θ1/θ2 − 1 = 0. These examples are studied
in later sections.
It is assumed that h(θ) is such that the h × q matrix
R(θ) =
∂h(θ)
∂θ (7.4)
is of full rank h when evaluated at θ = θ0. This assumption is equivalent to linear inde-
pendence of restrictions in the linear model, in which case R(θ) = R does not depend
on θ and has rank h. It is also assumed that the parameters are not at the boundary
of the parameter space under the null hypothesis. This rules out, for example, testing
H0 : θ1 = 0 if the model requires θ1 ≥ 0.
225

HYPOTHESIS TESTS
7.2.3. Wald Test Statistic
The intuition behind the Wald test is very simple. The obvious test of whether h(θ0) =
0 is to obtain estimate
θ without imposing the restrictions and see whether h(
θ) 0.
If h(
θ)
a
∼ N[0,V[h(
θ)]] under H0 then the test statistic
W = h(
θ)
[V[h(
θ)]]
−1
h(
θ)
a
∼ χ2
(h).
The only complication is finding V[h(
θ)], which will depend on the restrictions h(·)
and the estimator
θ.
By a first-order Taylor series expansion (see section 7.2.4) under the null hypoth-
esis, h(
θ) has the same limit distribution as R(θ0)(
θ − θ0 ), where R(θ) is defined in
(7.4). Then h(
θ) is asymptotically normal under H0 with mean zero and variance ma-
trix R(θ0)V[
θ]R(θ0)
. A consistent estimate is
RN−1
C
R
, where
R = R(
θ) and it is
assumed that the estimator
θ is root-N consistent with
√
N(
θ − θ0 )
d
→ N[0, C0], (7.5)
and
C is any consistent estimate of C0.
Common Versions of the Wald Test
The preceding discussion leads to the Wald test statistic
W = N
h
[
R
C
R
]−1
h, (7.6)
where
h = h(
θ) and
R = ∂h(θ)/∂θ

θ
. An equivalent expression is W =
h
[
R
V
[
θ]
R
]−1
h, where
V[
θ] = N−1
C is the estimated asymptotic variance of
θ.
The test statistic W is asymptotically χ2
(h ) distributed under H0. So H0 is rejected
against Ha at significance level α if W χ2
α(h) and is not rejected otherwise. Equiv-
alently, H0 is rejected at level α if the p-value, which equals Pr[χ2
(h) W], is less
than α.
One can also implement the Wald test statistic as an F−test. The Wald asymptotic
F-statistic
F = W/h (7.7)
is asymptotically F(h, N − q) distributed. This yields the same p-value as W in (7.6)
as N → ∞ though in finite samples the p-values will differ. For nonlinear models it
is most common to report W, though F is also used in the hope that it might provide a
better approximation in small samples.
For a test of just one restriction, the square root of the Wald chi-square test is a
standard normal test statistic. This result is useful as it permits testing a one-sided
hypothesis. Specifically, for scalar h(θ) the Wald z-test statistic is
Wz =

h

rN−1
C
r
, (7.8)
where
h = h(
θ) and
r = ∂h(θ)/∂θ

θ
is a 1 × k vector. Result (7.6) implies that
Wz is asymptotically standard normal distributed under H0. Equivalently, Wz is
226

7.2. WALD TEST
asymptotically t distributed with (N − q) degrees of freedom, since the t goes to the
normal as N → ∞. So Wz can also be a Wald t-test statistic.
Discussion
The Wald test statistic (7.6) for the nonlinear case has the same form as the linear
model statistic W2 given in (7.2). The estimated deviation from the null hypothesis is
h(
θ) rather than (R
β − r). The matrix R is replaced by the estimated derivative matrix

R, and the assumption that R is of full rank is replaced by the assumption that R0 is of
full rank. Finally, the estimated asymptotic variance of the estimator is N−1
C rather
than s2
(X
X)−1
.
There is a range of possible consistent estimates of C0 (see Section 5.5.2), lead-
ing in practice to different computed values of W or F or Wz that are asymptotically
equivalent. In particular, C0 is often of the sandwich form A−1
0 B0A−1
0 , consistently es-
timated by a robust estimate
A−1
B
A−1
. An advantage of the Wald test is that it is easy
to robustify to ensure valid statistical inference under relatively weak distributional
assumptions, such as potentially heteroskedastic errors.
Rejection of H0 is more likely the larger is W or F or, for two-sided tests, Wz.
This happens the further h(
θ) is from the null hypothesis value 0; the more efficient
the estimator
θ, since then
C is small; and the larger the sample size since then N−1
is small. The last result is a consequence of testing at unchanged significance level
α as sample size increases. In principle one could decrease α as the sample size is
increased. Such penalties for fully parametric models are presented in Section 8.5.1.
7.2.4. Derivation of the Wald Statistic
By an exact first-order Taylor series expansion around θ0
h(
θ) = h(θ0) +
∂h
∂θ

θ+
(
θ − θ0),
for some θ+
between
θ and θ0. It follows that
√
N(h(
θ) − h(θ0)) = R(θ+
)
√
N(
θ − θ0),
where R(θ) is defined in (7.4), which implies that
√
N(h(
θ) − h(θ0))
d
→ N

0, R0C0R0

(7.9)
by direct application of the limit normal product rule (Theorem A.7) as R(θ+
)
p
→
R0 = R(θ0) and using the limit distribution for
√
N(
θ − θ0) given in (7.5).
Under the null hypothesis (7.9) simplifies since h(θ0) = 0, and hence
√
Nh(
θ)
d
→ N

0, R0C0R0

(7.10)
under H0. One could in theory use this multivariate normal distribution to define a
rejection region, but it is much simpler to transform to a chi-square distribution. Re-
call that z ∼ N[0, Ω] with Ω of full rank implies z
Ω−1
z ∼ χ2
(dim(Ω)). Then (7.10)
227

HYPOTHESIS TESTS
implies that
Nh(
θ)
[R0C0R0

]−1
h(
θ)
d
→ χ2
(h ),
under H0, where the matrix inverse in this expression exists by the assumptions that R0
and C0 are of full rank. The Wald statistic defined in (7.6) is obtained upon replacing
R0 and C0 by consistent estimates.
7.2.5. Wald Test Examples
The most common tests are tests of one or more exclusion restrictions. We also provide
an example of test of a nonlinear hypothesis.
Tests of Exclusion Restrictions
Consider the exclusion restrictions that the last h components of θ are equal to zero.
Then h(θ) = θ2 = 0 where we partition θ = (θ
1, θ
2)
. It follows that
R(θ) =
∂h(θ)
∂θ =

∂θ2
∂θ
1
∂θ2
∂θ
2

= [0 Ih] ,
where 0 is a (q − h) × q matrix of zeros and Ih is an h × h identity matrix, so
R(θ)C(θ)R(θ)
= [0 Ih]

C11 C12
C21 C22

0
Ih

= C22.
The Wald test statistic for exclusion restrictions is therefore
W =
θ2

[N−1
C22]−1
θ2, (7.11)
where N−1
C22 =
V[
θ2], and is asymptotically distributed as χ2
(h ) under H0.
This test statistic is a generalization of the test of subsets of regressors in the linear
regression model. In that case small-sample results are available if errors are normally
distributed and the related F-test is instead used.
Tests of Statistical Significance
Tests of significance of a single coefficient are tests of whether or not θj , the jth
component of θ, differs from zero. Then h(θ) = θj and r(θ) = ∂h/∂θ
is a vector of
zeros except for a jth entry of 1, so (7.8) simplifies to
Wz =

θ j
se[
θ j ]
, (7.12)
where se[
θ j ] =

N−1
cj j is the standard error of
θ j and
cj j is the jth diagonal entry
in
C.
The test statistic Wz in (7.12) is often called a “t-statistic”, owing to results for
the linear regression model under normality, but strictly speaking it is an asymptotic
“z-statistic.”
228

7.2. WALD TEST
For a two-sided test of H0 : θj0 = 0 against Ha : θj0 = 0, H0 is rejected at signifi-
cance level α if |Wz| zα/2 and is not rejected otherwise. This yields exactly the same
results as the Wald chi-square test, since W2
z = W, where W is defined in (7.6), and
z2
α/2 = χ2
α(1).
Often there is prior information about the sign of θj . Then one should use a one-
sided hypothesis test. For example, suppose it is felt based on economic reasoning or
past studies that θj 0. It makes a difference whether θj 0 is specified to be the null
or the alternative hypothesis. For one-sided tests it is customary to specify the claim
made as the alternative hypothesis, as it can be shown that then stronger evidence is
required to support the claim. Here H0 : θj0 ≤ 0 is rejected against Ha : θj0 0 at
significance level α if Wz zα. Similarly, for a claim that θj 0, test H0 : θj0 ≥ 0
against Ha : θj0 0 and reject H0 at significance level α if Wz −zα.
Computer output usually gives the p-value for a two-sided test, but in many cases
it is more appropriate to use a one-sided test. If
θ j has the “correct” sign then the
p-value for the one-sided test is half that reported for a two-sided test.
Tests of Nonlinear Restriction
Consider a test of the single nonlinear restriction
H0 : h(θ) = θ1/θ2 − 1 = 0.
Then R(θ) is a 1 × q vector with first element ∂h/∂θ1 = 1/θ2, second element
∂h/∂θ2 = −θ1/θ2
2 , and remaining elements zero. By letting
cjk denote the jkth el-
ement of
C, (7.6) becomes
W = N

θ1

θ2
− 1
2




1

θ2
−

θ1

θ
2
2
0
'




c11
c12 · · ·

c21
c22 · · ·
.
.
.
.
.
.
...






1/
θ2
−
θ1/
θ
2
2
0






−1
,
where 0 is a (q − 2) × q matrix of zeros, yielding
W = N[
θ2(
θ1 −
θ2)]2
(
θ
2
2
c11 − 2
θ1

θ2
c12 +
θ
2
1
c22)−1
, (7.13)
which is asymptotically χ2
(1) distributed under H0. Equivalently,
√
W is asymptoti-
cally standard normal distributed.
7.2.6. Tests in Misspecified Models
Most treatments of hypothesis testing, including that given in Chapters 7 and 8 of
this book, assume that the null hypothesis model is correctly specified, aside from
relatively minor misspecification that does not affect estimator consistency but requires
robustification of standard errors.
In practice this is a considerable oversimplification. For example, in testing for het-
eroskedastic errors it is assumed that this is the only respect in which the regression
is deficient. However, if the conditional mean is misspecified then the true size of
the test will differ from the nominal size, even asymptotically. Moreover, asymptotic
229

HYPOTHESIS TESTS
equivalence of tests, such as that for the Wald, likelihood ratio, and Lagrange mul-
tiplier tests, will no longer hold. The better specified the model, however, the more
useful are the tests.
Also, note that tests often have some power against hypotheses other than the ex-
plicitly stated alternative hypothesis. For example, suppose the null hypothesis model
is y = β1 + β2x + u, where u is homoskedastic. A test of whether to also include z as
a regressor will also have some power against the alternative that the model is nonlin-
ear in x, for example y = β1 + β2x + β3x2
+ u, if x and z are correlated. Similarly, a
test against heteroskedastic errors will also have some power against nonlinearity in x.
Rejection of the null hypothesis does not mean that the alternative hypothesis model
is the only possible model.
7.2.7. Joint Versus Separate Tests
In applied work one often wants to know which coefficients out of a set of coefficients
are “significant.” When there are several hypotheses under test, one can either do a
joint test or simultaneous test of all hypotheses of interest or perform separate tests
of the hypotheses.
A leading example in linear regression concerns the use of separate t-tests for test-
ing the null hypotheses H10 : β1 = 0 and H20 : β2 = 0 versus using an F-test of the
joint hypothesis H0 : β1 = β2 = 0, where throughout the alternative is that at least
one of the parameters does not equal zero. The F-test is an explicit joint test, with
rejection of H0 if the estimated point (
β1,
β2) falls outside an elliptical probability
contour. Alternatively, the two separate t-tests can be conducted. This procedure is an
implicit joint test, called an induced test (Savin, 1984). The separate tests reject H0 if
either H10 or H20 is rejected, which occurs if (
β1,
β2) falls outside a rectangle whose
boundaries are the critical values of the two test statistics. Even if the same signifi-
cance level is used to test H0, so that the ellipse and rectangles have the same area,
the rejection regions for the joint and separate tests differ and there is a potential for a
conflict between them. For example, (
β1,
β2) may lie within the ellipse but outside the
rectangle.
Let e1 and e2 denote the event of type I error (see Section 7.5.1) in the two separate
tests, and let eI = e1 ∪ e2 denote the event of a type I error in the induced joint test.
Then Pr[eI] = Pr[e1] + Pr[e2] − Pr[eI ∩ e2], which implies that
αI ≤ α1 + α2, (7.14)
where αI, α1, and α2 denote the sizes of, respectively, the induced joint test, the first
separate test, and the second separate test. In the special case where the separate tests
are statistically independent, Pr[eI ∩ e2] = Pr[e1] Pr[e2] = α1α2 and hence αI = α1 +
α2 − α1α2. For a typically low value of α1 and α2, such as .05 or .01, α1α2 is very
small and the upper bound (7.14) is a good indicator of the size of the test.
A substantial literature on induced tests examines the problem of choosing critical
values for the separate tests such that the induced test has a known size. We do not pur-
sue this issue at length but mention the Bonferroni t-test as an example. The critical
values of this test have been tabulated; see Savin (1984).
230

7.2. WALD TEST
Statistically independent tests arise in linear regression with orthogonal regressors
and in likelihood-based testing (see Section 7.3) if relevant parts of the information
matrix are diagonal. Then the induced joint test statistic is based on the two statistically
independent separate test statistics, whereas the explicit joint null test statistic is the
sum of the two separate test statistics. The joint null may be rejected because either
one component or both components of the null are rejected. The use of separate tests
will reveal which situation applies.
In the more general case of correlated regressors or a nondiagonal information ma-
trix, the explicit joint test suffers from the disadvantage that the rejection of the null
does not indicate the source of the rejection. If the induced joint test is used then set-
ting the size of the test requires some variant of the Bonferroni test or approximation
using the upper bound in (7.14). Similar issues also arise when separate tests are ap-
plied sequentially, with each stage conditioned on the outcome of the previous stage.
Section 18.7.1 presents an example with discussion of a joint test of two hypotheses
where the two components of the test are correlated.
7.2.8. Delta Method for Confidence Intervals
The method used to derive the Wald test statistic is called the delta method, as Taylor
series approximation of h(
θ) entails taking the derivative of h(θ). This method can
also be used to obtain the distribution of a nonlinear combination of parameters and
hence form confidence intervals or regions.
One example is estimating the ratio θ1/θ2 by
θ1/
θ2. A second example is prediction
of the conditional mean g(x
β), say, using g(x
β). A third example is the estimated
elasticity with respect to change in one component of x.
Confidence Intervals
Consider inference on the parameter vector γ = h(θ) that is estimated by

γ = h(
θ), (7.15)
where the limit distribution of
√
N(
θ − θ0 ) is that given in (7.5). Then direct ap-
plication of (7.9) yields
√
N(
γ − γ0)
d
→ N

0, R0C0R0

, where R(θ) is defined in
(7.4). Equivalently, we say that
γ is asymptotically normally distributed with estimated
asymptotic variance matrix

V[
γ] =
RN−1
C
R
, (7.16)
a result that can be used to form confidence intervals or regions.
In particular, a 100(1 − α)% confidence interval for the scalar parameter γ is
γ ∈
γ ± zα/2se[
γ ], (7.17)
where
se[
γ ] =

rN−1
C
r, (7.18)
where
r = r(
θ) and r(θ) = ∂γ/∂θ
= ∂h(θ)/∂θ
.
231

HYPOTHESIS TESTS
Confidence Interval Examples
As an example, suppose that E[y|x] = exp (x
β) and we wish to obtain a confidence
interval for the predicted conditional mean when x = xp. Then h(β) = exp (x
pβ), so
∂h/∂β
= exp (x
pβ)xp and (7.18) yields
se[exp (x
p

β)] = exp (x
p

β)
%
x
p N−1
Cxp,
where
C is a consistent estimate of the variance matrix in the limit distribution of
√
N(
β − β0 ).
As a second example, suppose we wish to obtain a confidence interval for eβ
rather
than for β, a scalar coefficient. Then h(β) = eβ
, so ∂h/∂β = eβ
and (7.18) yields
se[e

β
] = e

β
se[
β]. This yields a 95% confidence interval for eβ
of e

β
± 1.96e

β
se[
β].
The delta method is not always the best method to obtain a confidence interval,
because it restricts the confidence interval to being symmetric about
γ . Moreover, in
the preceding example the confidence interval can include negative values even though
eβ
0. An alternative confidence interval is obtained by exponentiation of the terms
in the confidence interval for β. Then
Pr

β − 1.96se[
β] β
β + 1.96se[
β]

= 0.95
⇒ Pr

exp(
β − 1.96se[
β]) eβ
exp(
β + 1.96se[
β])

= 0.95.
This confidence interval has the advantage of being asymmetric and including only
positive values. This transformation is often used for confidence intervals for slope
parameters in binary outcome models and in duration models. The approach can be
generalized to other transformations γ = h(θ), provided h(·) is monotonic.
7.2.9. Lack of Invariance of the Wald Test
The Wald test statistic is easily obtained, provided estimates of the unrestricted model
can be obtained, and is no less powerful than other possible test procedures, as dis-
cussed in later sections. For these reasons it is the most commonly used test procedure.
However, the Wald test has a fundamental problem: It is not invariant to alge-
braically equivalent parameterizations of the null hypothesis. For example, consider
the example of Section 7.2.5. Then H0 : θ1/θ2 − 1 = 0 can equivalently be expressed
as H0 : θ1 − θ2 = 0, leading to Wald chi-square test statistic
W∗
= N(
θ1 −
θ2)2
(
c11 − 2
c12 +
c22)−1
, (7.19)
which differs from W in (7.13). The statistics W and W∗
can differ substantially in
finite samples, even though asymptotically they are equivalent. The small-sample dif-
ference can be quite substantial, as demonstrated in a Monte Carlo exercise by Gregory
and Veall (1985), who considered a very similar example. For tests with nominal size
0.05, one variant of the Wald test had actual size between 0.04 and 0.06 across all sim-
ulations, so asymptotic theory provided a good small-sample approximation, whereas
an alternative asymptotically equivalent variant of the Wald test had actual size that in
some simulations exceeded 0.20.
232

7.3. LIKELIHOOD-BASED TESTS
Phillips and Park (1988) explained the differences by showing that, although differ-
ent representations of the null hypothesis restrictions have the same chi-square distri-
bution using conventional asymptotic methods, they have different asymptotic distri-
butions using a more refined asymptotic theory based on Edgeworth expansions (see
Section 11.4.3). Furthermore, in particular settings such as the previous example, the
Edgeworth expansions can be used to indicate parameterizations of H0 and regions
of the parameter space where the usual asymptotic theory is likely to provide a poor
small-sample approximation.
The lesson is that care is needed when nonlinear restrictions are being tested. As
a robustness check one can perform several Wald tests using different algebraically
equivalent representations of the null hypothesis restrictions. If these lead to substan-
tially different conclusions there may be a problem. One solution is to perform a boot-
strap version of the Wald test. This can provide better small-sample performance and
eliminate much of the difference between Wald tests that use different representations
of H0, because from Section 11.4.4 the bootstrap essentially implements an Edgeworth
expansion. A second solution is to use other testing methods, given in the next section,
that are invariant to different representations of H0.
7.3. Likelihood-Based Tests
In this section we consider hypothesis testing when the likelihood function is known,
that is, the distribution is fully specified. There are then three classical statistical tech-
niques for testing hypotheses – the Wald test, the likelihood ratio (LR) test, and the
Lagrange multiplier (LM) test. A fourth test, the C(α) test, due to Neyman (1959), is
less commonly used and is not presented here; see Davidson and MacKinnon (1993).
All four tests are asymptotically equivalent, so one chooses among them based on ease
of computation and on finite-sample performance. We also do not cover the smooth
test of Neyman (1937), which Bera and Ghosh (2002) argue is optimal and is as fun-
damental as the other tests.
These results assume correct specification of the likelihood function. Extension to
tests based on quasi-ML estimators, as well as on m-estimators and efficient GMM
estimators, is given in Section 7.5.
7.3.1. Wald, Likelihood Ratio, and Lagrange Multiplier (Score) Tests
Let L(θ) denote the likelihood function, the joint conditional density of y given X and
parameters θ. We wish to test the null hypothesis given in (7.3) that h(θ0) = 0.
Tests other than the Wald test require estimation that imposes the restrictions of the
null hypothesis. Define the estimators

θu (unrestricted MLE),

θr (restricted MLE).
(7.20)
The unrestricted MLE
θu maximizes ln L(θ); it was more simply denoted
θ in ear-
lier discussion of the Wald test. The restricted MLE
θr maximizes the Lagrangian
233

HYPOTHESIS TESTS
ln L(θ) − λ
h(θ), where λ is an h × 1 vector of Lagrangian multipliers. In the simple
case of exclusion restrictions h(θ) = θ2 = 0, where θ = (θ
1, θ
2)
, the restricted MLE
is
θr = (
θ

1r , 0
), where
θ

1r is obtained simply as the maximum with respect to θ1 of
the restricted likelihood ln L(θ1, 0) and 0 is a (q − h) × 1 vector of zeros.
We motivate and define the three test statistics here, with derivation deferred to
Section 7.3.3. All three test statistics converge in distribution to χ2
(h ) under H0. So
H0 is rejected at significance level α if the computed test statistic exceeds χ2
α(h ).
Equivalently, reject H0 at level α if p ≤ α, where p = Pr

χ2
(h ) t

is the p-value
and t is the computed value of the test statistic.
Likelihood Ratio Test
The motivation for the LR test statistic is that if H0 is true, the unconstrained and
constrained maxima of the log-likelihood function should be the same. This suggests
using a function of the difference between ln L(
θu) and ln L(
θr ).
Implementation requires obtaining the limit distribution of this difference. It can be
shown that twice the difference is asymptotically chi-square distributed under H0. This
leads immediately to the likelihood ratio test statistic
LR = −2

ln L(
θr ) − ln L(
θu)

. (7.21)
Wald Test
The motivation for the Wald test is that if H0 is true, the unrestricted MLE
θu should
satisfy the restrictions of H0, so h(
θu) should be close to zero.
Implementation requires obtaining the asymptotic distribution of h(
θu). The general
form of the Wald test is given in (7.6). Specialization occurs for the MLE because by
the IM equality V[
θu] = −N−1
A0
−1
, where
A0 = plim N−1 ∂2
ln L
∂θ∂θ

θ0
. (7.22)
This leads to the Wald test statistic
W = −N
h

R
A−1
R
−1

h, (7.23)
where
h = h(
θu),
R = R(
θu), R(θ) = ∂h(θ)/∂θ
, and
A is a consistent estimate of A0.
The minus sign appears since A0 is negative definite.
Lagrange Multiplier Test or Score Test
One motivation for the LM test statistic is that the gradient ∂ ln L/∂θ|
θu
= 0 at the
maximum of the likelihood function. If H0 is true, then this maximum should also
occur at the restricted MLE (i.e., ∂ ln L/∂θ|
θr
0) because imposing the constraint
will have little impact on the estimated value of θ. Using this motivation LM is called
the score test because ∂ ln L/∂θ is the score vector.
An alternative motivation is to measure the closeness to zero of the Lagrange mul-
tipliers of the constrained optimization problem for the restricted MLE. Maximizing
234

ln L(θ) − λ
h(θ) with respect to θ implies that
∂ ln L
∂θ

θr
=
∂h(θ)
∂θ

θr
×
λr . (7.24)
It follows that tests based on the estimated Lagrange multipliers
λr are equivalent to
tests based on the score ∂ ln L/∂θ|
θr
, since ∂h/∂θ
is assumed to be of full rank.
Implementation requires obtaining the asymptotic distribution of ∂ ln L/∂θ|
θr
. This
leads to the Lagrange multiplier test or score test statistic
LM = −N−1 ∂ ln L
∂θ

θr

A−1 ∂ ln L
∂θ

θr
, (7.25)
where
A is a consistent estimate of A0 in (7.22) evaluated at
θr rather than
θu.
The LM test, due to Aitchison and Silvey (1958) and Silvey (1959), is equivalent to
the score test, due to Rao (1947). The test statistic LM is usually derived by obtaining
an analytical expression for the score rather than the Lagrange multipliers. Econome-
tricians usually call the test an LM test, even though a clearer terminology is to call it
a score test.
Discussion
Good intuition is provided by the expository graphical treatment of the three tests by
Buse (1982) that views all three tests as measuring the change in the log-likelihood.
Here we provide a verbal summary.
Consider scalar parameter and a Wald test of whether θ0 − θ∗
= 0. Then a given
departure of
θu from θ∗
will translate into a larger change in ln L, the more curved
is the log-likelihood function. A natural measure of curvature is the second derivative
H(θ) = ∂2
ln L/∂θ2
. This suggests W= −(
θu − θ∗
)2
H(
θu). The statistic W in (7.23)
can be viewed as a generalization to vector θ and more general restrictions h(θ0) with
N
A measuring the curvature.
For the score test Buse shows that a given value of ∂ ln L/∂θ|
θr
translates into a
larger change in ln L, the less curved is the log-likelihood function. This leads to use
of (N
A)−1
in (7.25). And the statistic LR directly compares the log-likelihoods.
An Illustration
To illustrate the three tests consider an iid example with yi ∼ N[µ0, 1] and test of
H0 : µ0 = µ∗
. Then
µu = ȳ and
µr = µ∗
.
For the LR test, ln L(µ) = − N
2
ln 2π − 1
2

i (yi − µ)2
and some algebra yields
LR = 2[ln L(ȳ) − ln L(µ∗
)] = N(ȳ − µ∗
)2
.
The Wald test is based on whether ȳ − µ∗
0. Here it is easy to show that ȳ −
µ∗
∼ N[0, 1/N] under H0, leading to the quadratic form
W = (ȳ − µ∗
)[1/N]−1
(ȳ − µ∗
).
This simpliﬁes to N(ȳ − µ∗
)2
and so here W = LR.
235

HYPOTHESIS TESTS
The LM test is based on closeness to zero of ∂ ln L(µ)/∂µ|µ∗ =

i (yi − µ)|µ∗ =
N(ȳ − µ∗
). This is just a rescaling of (ȳ − µ∗
) so LM = W. More formally,
A(µ∗
) = −
1 since ∂2
ln L(µ)/∂µ2
= −N and (7.25) yields
LM = N−1
(N(ȳ − µ∗
))[1]−1
(N(ȳ − µ∗
)).
This also simplifies to N(ȳ − µ∗
)2
and verifies that LM = W = LR.
Despite their quite different motivations, the three test statistics are equivalent here.
This exact equivalence is special to this example with constant curvature owing to a
log-likelihood quadratic in µ. More generally the three test statistics differ in finite
samples but are equivalent asymptotically (see Section 7.3.4).
7.3.2. Poisson Regression Example
Consider testing exclusion restrictions in the Poisson regression model introduced in
Section 5.2. This example is mainly pedagogical as in practice one should perform
statistical inference for count data under weaker distributional assumptions than those
of the Poisson model (see Chapter 20).
If y given x is Poisson distributed with conditional mean exp(x
β) then the log-
likelihood function is
ln L(β) =
N
i=1
!
− exp(x
i β) + yi x
i β − ln yi !

. (7.26)
For h exclusion restrictions the null hypothesis is H0 : h(β) = β2 = 0, where β =
(β
1, β
2)
.
The unrestricted MLE
β maximizes (7.26) with respect to β and has first-order
conditions

i (yi − exp(x
i β))xi = 0. The limit variance matrix is −A−1
, where
A = − plim N−1

i
exp (x
i β)xi x
i .
The restricted MLE is
β = (
β

1, 0
)
, where
β1 maximizes (7.26) with respect to β1,
with x
i β replaced by x
1i β1 since β2 = 0. Thus
β1 solves the first-order conditions

i (yi − exp(x
1i β1))x1i = 0.
The LR test statistic (7.21) is easily calculated from the fitted log-likelihoods of the
restricted and unrestricted models.
The Wald test statistic for exclusion restrictions from Section 7.2.5 is W =
−N
β2

A22
β2, where
A22
is the (2,2) block of
A−1
and
A = −N−1

i exp (x
i

β)xi x
i .
The LM test is based on ∂ ln L(β)/∂β =

i xi (yi − exp (x
i β)). At the restricted
MLE this equals

i xi
ui , where
ui = yi − exp (x
1i

β1) is the residual from estimation
of the restricted model. The LM test statistic (7.25) is
LM =
N
i=1
xi
ui
N
i=1
exp (x
1i

β1)xi x
i
−1 N
i=1
xi
ui

. (7.27)
Some further simplification is possible since

i x1i
ui = 0 from the first-order condi-
tions for the restricted MLE given earlier. The LM test here is based on the correlation
between the omitted regressors and the residual, a result that is extended to other ex-
amples in Section 7.3.5.
236

In general it can be difficult to obtain an algebraic expression for the LM test. For
standard applications of the LM test this has been done and is incorporated into com-
puter packages. Computation by auxiliary regression may also be possible (see Sec-
tion 3.5).
7.3.3. Derivation of Tests
The distribution of the Wald test was formally derived in Section 7.2.4. Proofs for the
likelihood ratio and Lagrange multiplier tests are more complicated and we merely
sketch them here.
Likelihood Ratio Test
For simplicity consider the special case where the null hypothesis is θ = θ, so that
there is no estimation error in
θr = θ. Taking a second-order Taylor series expansion
of ln L(θ) about ln L(
θu) yields
ln L(θ) = ln L(
θu) +
∂ ln L
∂θ

θu
(θ −
θu) +
1
2
(θ −
θu) ∂2
ln L
∂θ∂θ

θu
(θ −
θu) + R,
where R is a remainder term. Since ∂ ln L/∂θ|
θu
= 0 by the first-order conditions, this
implies upon rearrangement that
−2

ln L(θ) − ln L(
θu)

= −(θ −
θu) ∂2
ln L
∂θ∂θ

θu
(θ −
θu) + R. (7.28)
The right-hand side of (7.28) is χ2
(h) under H0 : θ = θ since by standard results
√
N(
θu − θ)
d
→ N

0, −[plim N−1
∂2
ln L/∂θ∂θ
]−1

. For derivation of the limit dis-
tribution of LR in the general case see, for example, Amemiya (1985, p. 143).
A reason for preferring LR is that by the Neyman–Pearson (1933) lemma the uni-
formly most powerful test for testing a simple null hypothesis versus simple alternative
hypothesis is a function of the likelihood ratio L(
θr )/L(
θu), though not necessarily the
specific function −2 ln(L(
θr )/L(
θu)) that equals LR given in (7.21) and gives the test
statistic its name.
LM or Score Test
By a first-order Taylor series expansion
1
√
N
∂ ln L
∂θ

θr
=
1
√
N
∂ ln L
∂θ

θ0
+
1
N
∂2
ln L
∂θ∂θ
√
N(
θr − θ0),
and both terms in the right-hand side contribute to the limit distribution. Then the
χ2
(h) distribution of LM defined in (7.25) follows since it can be shown that
R0A−1
0
1
√
N
∂ ln L
∂θ

θr
d
→ N

0, R0A−1
0 B0A−1
0 R
0

, (7.29)
237

HYPOTHESIS TESTS
where details are provided in Wooldridge (2002, p. 365), for example, and R0 and A0
are defined in (7.4) and (7.22) and
B0 = plim N−1 ∂ ln L
∂θ
∂ ln L
∂θ

θ0
. (7.30)
Result (7.29) leads to a chi-square statistic that is much more complicated
than (7.25), but simplification to (7.25) then occurs by the information matrix
equality.
7.3.4. Which Test?
Choice of test procedure is usually made based on existence of robust versions, finite-
sample performance, and ease of computation.
Asymptotic Equivalence
All three test statistics are asymptotically distributed as χ2
(h) under H0. Further-
more, all three can be shown to be noncentral χ2
(h; λ) distributed with the same
noncentrality parameter under local alternatives. Details are provided for the Wald
test in Section 7.6.3. So the tests all have the same asymptotic power against local
alternatives.
The finite-sample distributions of the three statistics differ. In the linear regression
model with normality, a variant of the Wald test statistic for h linear restrictions on
θ exactly equals the F(h, N − K) statistic (see Section 7.2.1) whereas no analytical
results exist for the LR and LM statistics. More generally, in nonlinear models exact
small-sample results do not exist.
In some cases an ordering of the values taken by the three test statistics can be
obtained. In particular for tests of linear restrictions in the linear regression model
under normality, Berndt and Savin (1977) showed that Wald ≥ LR ≥ LM. This result
is of little theoretical consequence, as the test least likely to reject under the null will
have the smallest actual size but also the smallest power. However, it is of practical
consequence for the linear model, as it means when testing at fixed nominal size α
that the Wald test will always reject H0 more often than the LR, which in turn will
reject more often than the LM test. The Wald test would be preferred by a researcher
determined to reject H0. This result is restricted to linear models.
Invariance to Reparameterization
The Wald test is not invariant to algebraically equivalent parameterizations of the null
hypothesis (see Section 7.2.9) whereas the LR test is invariant. Some but not all ver-
sions of the LM test are invariant. The LM test is generally invariant if the expected
Hessian (see Section 5.5.2) is used to estimate A0 and not invariant if the Hessian is
used. The test LM∗
defined later in (7.34) is invariant. The lack of invariance for the
Wald test is a major weakness.
238

Robust Versions
In some cases with misspecified density the quasi-MLE (see Section 5.7) remains con-
sistent. The Wald test is then easily robustified (see Section 7.2). The LM test can be
robustified with more difficulty; see (7.38) in Section 7.5.1 for a general result for m-
estimators and Section 8.4 for some robust LM test examples. The LR test is no longer
chi-square distributed, except in a special case given later in (7.39). Instead, the LR
test is a mixture of chi-squares (see Section 8.5.3).
Convenience
Convenience in computation is also a consideration. LR requires estimation of the
model twice, once with and once without the restrictions of the null hypothesis. If
done by a package, it is easily implemented as one need only read off the printed log-
likelihood routinely printed out, subtract, and multiply by 2. Wald requires estimation
only under Ha and is best to use when the unrestricted model is easy to estimate. For
example, this is the case for restrictions on the parameters of the conditional mean
in nonlinear models such as NLS, probit, Tobit, and logit. The LM statistic requires
estimation only under H0 and is best to use when the restricted model is easy to esti-
mate. Examples are tests for autocorrelation and heteroskedasticity, where it is easiest
to estimate the null hypothesis model that does not have these complications.
The Wald test is often used for tests of statistical significance whereas the LM test
is often used for tests of correct model specification.
7.3.5. Interpretation and Computation of the LM test
Lagrange multiplier tests have the additional advantages of simple interpretation in
some leading examples and computation by auxiliary regression.
In this section attention is restricted to the usual cross-section data case of a scalar
dependent variable independent over i, so that ∂ ln L(θ)/∂θ =

i si (θ), where
si (θ) =
∂ ln f (yi |xi , θ)
∂θ
(7.31)
is the contribution of the ith observation to the score vector of the unrestricted model.
From (7.25) the LM test is a test of the closeness to zero of

i si (
θr ).
Simple Interpretation of the LM Test
Suppose that the density is such that s(θ) factorizes as
s(θ)= g(x, θ)r(y, x, θ) (7.32)
for some q × 1 vector function g(·) and scalar function r(y, x, θ), the latter of which
may be interpreted as a generalized residual because y appears in r(·) but not g(·). For
example, for Poisson regression ∂ ln f /∂θ = x(y − exp(x
β)).
239

HYPOTHESIS TESTS
Given (7.32) and independence over i, ∂ ln L/∂θ|
θr
=

i
gi
ri , where
gi =
g(xi ,
θr ) and
ri = r(yi , xi ,
θr ). The LM test can therefore be simply interpreted as
a score test of the correlation between
gi and the residual
ri . This interpretation was
given in Section 7.3.2 for the LM test with Poisson regression, where
gi = xi and

ri = yi − exp(x
1i

β1).
The partition (7.32) will arise whenever f (y) is based on a one-parameter den-
sity. In particular, many common likelihood models are based on one-parameter LEF
densities, with parameter µ then modeled as a function of x and β. In the LEF case
r(y, x, θ) = (y − E[y|x]) (see Section 5.7.3), so the generalized residual r(·) in (7.32)
is then the usual residual.
More generally a partition similar to (7.32) will also arise when f (y) is based on a
two-parameter density, the information matrix is block diagonal in the two parameters,
and the two parameters in turn depend on regressors and parameter vectors β and α
that are distinct. Then LM tests on β are tests of correlation of
gβi and
rβi , where
s(β) = gβ(x, θ)rβ(y, x, θ), with similar interpretation for LM tests on α.
A leading example is linear regression under normality with two parameters µ and
σ2
modeled as µ = x
β and σ2
= α or σ2
= σ2
(z, α). For exclusion restrictions in lin-
ear regression under normality, si (β) = xi (yi − x
i β) and the LM test is one of correla-
tion between regressors xi and the restricted model residual
ui = yi − x
1i

β1. For tests
of heteroskedasticity with σ2
i = exp(α1 + z
i α2), si (α) =1
2
zi ((yi − x
i β)2
/σ2
i ) − 1),
and the LM test is one of correlation between zi and the squared residual
u2
i =
(yi − x
i

β)2
, since σ2
i is constant under the null hypothesis that α2 = 0.
Outer Product of the Gradient Versions of the LM Test
Now return to the general si (θ) defined in (7.31). We show in the following that an
asymptotically equivalent version of the LM test statistic (7.25) can be obtained by
running the auxiliary regression or artificial regression
1 =
s
i γ + vi , (7.33)
where
si = si (
θr ), and computing
LM∗
= N R2
u, (7.34)
where R2
u is the uncentered R2
defined after (7.36). LM∗
is asymptotically χ2
(h) under
H0. Equivalently, LM∗
equals ESSu, the uncentered explained sum of squares (the sum
of squares of the fitted values), or equals N− RSS, where RSS is the residual sum of
squares, from regression (7.33).
This result can be easy to implement as in many applications it can be quite simple
to analytically obtain si (θ), generate data for the q components
s1i , . . . ,
sqi , and regress
1 on
s1i , . . . ,
sqi . Note that here f (yi |xi , θ) in (7.31) is the density of the unrestricted
model.
For the exclusion restrictions in the Poisson model example in Section 7.3.2,
si (β) = (yi − exp (x
i β))xi and x
i

βr = x
1i

β1r . It follows that LM∗
can be computed
240

7.4. EXAMPLE: LIKELIHOOD-BASED HYPOTHESIS TESTS
as N R2
u from regressing 1 on (yi − exp (x
1i

β1r ))xi , where xi contains both x1i and x2i ,
and
β1r is obtained from Poisson regression of yi on x1i alone.
Equations (7.33) and (7.34) require only independence over i. Other auxiliary re-
gressions are possible if further structure is assumed. In particular, specialize to cases
where s(θ) factorizes as in (7.32), and define r(y, x, θ) so that V[r(y, x, θ)] = 1. Then
an alternative asymptotically equivalent version of the LM test is N R2
u from regression
of
ri on
gi . This includes LM tests for linear regression under normality, such as the
Breusch–Pagan LM test for heteroskedasticity.
These alternative versions of the LM test are called outer-product-of-the-gradient
versions of the LM test, as they replace −A0 in (7.22) by an outer-product-of-the-
gradient (OPG) estimate or BHHH estimate of B0. Although they are easily computed,
OPG variants of LM tests can have poor small-sample properties with large size distor-
tions. This has discouraged use of the OPG form of the LM test. These small-sample
problems can be greatly reduced by bootstrapping (see Section 11.6.3). Davidson and
MacKinnon (1984) propose double-length auxiliary regressions that also perform bet-
ter in finite samples.
Derivation of the OPG Version
To derive LM∗
, first note that in (7.25), ∂ ln L(θ )/∂θ|
θr
=

si . Second, by the
information matrix equality A0 = −B0 and, from Section 5.5.2, B0 can be consis-
tently estimated under H0 by the OPG estimate or BHHH estimate N−1

si
s
i . Com-
bining, these results gives an asymptotically equivalent version of the LM test sta-
tistic (7.25):
LM∗
=
N
i=1

s
i
N
i=1

si
s
i
−1 N
i=1

si

. (7.35)
This statistic can be computed from an auxiliary regression of 1 on
si as follows.
Define S to be the N × q matrix with ith row
s
i , and define l to be the N × 1 vector of
ones. Then
LM∗
= l
S[S
S]−1
S
l = ESSu = N R2
u. (7.36)
In general for regression of y on X the uncentered explained sums of squares (ESSu)
is y
X (X
X)−1
X
y, which is exactly of the form (7.36), whereas the uncentered R2
is
R2
u = y
X (X
X)−1
X
y/y
y, which here is (7.36) divided by l
l = N. The term uncen-
tered is used because in R2
u division is by the sum of squared deviations of y around
zero rather than around the sample mean.
7.4. Example: Likelihood-Based Hypothesis Tests
The various test procedures – Wald, LR, and LM – are illustrated using generated data
from the dgp y|x Poisson distributed with mean exp(β1 + β2x2 + β3x3 + β4x4), where
β1 = 0 and β2 = β3 = β4 = 0.1 and the three regressors are iid draws from N[0, 1].
241

HYPOTHESIS TESTS
Table 7.1. Test Statistics for Poisson Regression Examplea
Test Statistic Result
Null Hypothesis Wald LR LM LM* ln L at level 0.05
H10 : β3 = 0 5.904 5.754 5.916 6.218 −241.648 Reject
(0.015) (0.016) (0.015) (0.013)
H20 : β3 = 0, β4 = 0 8.570 8.302 8.575 9.186 −242.922 Reject
(0.014) (0.016) (0.014) (0.010)
H30 : β3 − β4 = 0 0.293 0.293 0.293 0.315 −238.918 Do not reject
(0.588) (0.589) (0.588) (0.575)
H40 : β3/β4 − 1 = 0 0.158 0.293 0.293 0.315 −238.918 Do not reject
(0.691) (0.589) (0.588) (0.575)
a The dgp for y is the Poisson distribution with parameter exp(0.0 + 0.1x2 + 0.1x3 + 0.1x4) and sample size
N = 200. Test statistics are given with associated p-values in parentheses. Tests of the second hypothesis are
χ2(2) and the other tests are χ2(1) distributed. Log-likelihoods for restricted ML estimation are also given; the
log-likelihood in the unrestricted model is −238.772.
Poisson regression of y on an intercept, x2, x3, and x4 for a generated sample of size
200 yielded unrestricted MLE

E[y|x] = exp(−0.165
(−2.14)
− 0.028
(−0.36)
x2 + 0.163
(2.43)
x3 + 0.103
(0.08)
x4),
where associated t-statistics are given in parentheses and the unrestricted log-
likelihood is −238.772.
The analysis tests four different hypotheses, detailed in the first column of Table 7.1.
The estimator is nonlinear, whereas the hypotheses are examples of, respectively, sin-
gle exclusion restriction, multiple exclusion restriction, linear restrictions, and nonlin-
ear restrictions. The remainder of the table gives four asymptotically equivalent test
statistics of these hypotheses and their associated p-values. For this sample all tests re-
ject the first two hypotheses and do not reject the remaining two, at significance level
0.05.
The Wald test statistic is computed using (7.23). This requires estimation of the un-
restricted model, given previously, to obtain the variance matrix estimate of the unre-
stricted MLE. Wald tests of different hypotheses then require computation of different
h and R and simplify in some cases. The Wald chi-square test of the single exclu-
sion restriction is just the square of the usual t-test, with 2.432
5.90. The Wald test
statistic of the joint exclusion restrictions is detailed in Section 7.2.5. Here x3 is sta-
tistically significant and x4 is statistically insignificant, whereas jointly x3 and x4 are
statistically significant at level 0.05. The Wald test for the third hypothesis is given in
(7.19) and leads to nonrejection. The third and fourth hypotheses are equivalent, since
β3/β4 − 1 = 0 implies β3 = β4, but the Wald test statistic for the fourth hypothesis,
given in (7.13), differs from (7.19). The statistic (7.13) was calculated using matrix
operations, as most packages will at best calculate Wald tests of linear hypotheses.
The LR test statistic is especially easy to compute, using (7.21), given estima-
tion of the restricted model. For the first three hypotheses the restricted model is
242

7.5. TESTS IN NON-ML SETTINGS
estimated by Poisson regression of y on, respectively, regressors (1, x2, x4), (1, x2), and
(1, x2, x3 + x4), where the third regression uses β3x3 + β4x4 = β3(x3 + x4) if β3 = β4.
As an example of the LR test, for the second hypothesis LR = −2[−238.772 −
(−242.922)] = 8.30. The fourth restricted model in theory requires ML estimation
subject to nonlinear constraints on the parameters, which few packages do. However,
constrained ML estimation is invariant to the way the restrictions are expressed, so
here the same estimates are obtained as for the third restricted model, leading to the
same LR test statistic.
The LM test statistic is computed using (7.25), which for the Poisson model spe-
cializes to (7.27). This statistic is computed using matrix commands, with different
restrictions leading to the different restricted MLE estimates
β. As for the LR test,
the LM test is invariant to transformations, so the LM tests of the third and fourth
hypotheses are equivalent.
An asymptotically equivalent version of the LM test statistic is the statistic
LM∗
given in (7.35). This can be computed as the explained sum of squares
from the auxiliary regression (7.33). For the Poisson model sji = ∂ ln f (yi )/∂βj =
(yi − exp(x
i β))xji , with evaluation at the appropriate restricted MLE for the hypothe-
sis under consideration. The statistic LM∗
is simpler to compute than LM, though like
LM it requires restricted ML estimates.
In this example with generated data the various test statistics are very similar. This
is not always the case. In particular, the test statistic LM∗
can have poorer finite-sample
size properties than LM, even if the dgp is known. Also, in applications with real data
the dgp is unlikely to be perfectly specified, leading to divergence of the various test
statistics even in infinitely large samples.
7.5. Tests in Non-ML Settings
The Wald test is the standard test to use in non-ML settings. From Section 7.2 it is a
general testing procedure that can always be implemented, using an appropriate sand-
wich estimator of the variance matrix of the parameter estimates. The only limitation
is that in some applications unrestricted estimation may be much more difficult to
perform than restricted estimation.
The LM or score test, based on departures from zero of the gradient vector of the
unrestricted model evaluated at the restricted estimates, can also be generalized to
non-ML estimators. The form of the LM test, however, is usually considerably more
complicated than in the ML case. Moreover, the simplest forms of the LM test statistic
based on auxiliary regressions are usually not robust to distributional misspecification.
The LR test is based on the difference between the maximized values of the objec-
tive function with and without restrictions imposed. This usually does not generalize
to objective functions other than the likelihood function, as this difference is usually
not chi–square distributed.
For completeness we provide a condensed presentation of extension of the ML tests
to m-estimators and to efficient GMM estimators. As already noted, in most applica-
tions use of the simpler Wald test is sufficient.
243

HYPOTHESIS TESTS
7.5.1. Tests Based on m-Estimators
Tests for m-estimators are straightforward extensions of those for ML estimators, ex-
cept that it is no longer possible to use the information matrix equality to simplify the
test statistics and the LR test generalizes in only very special cases. The resulting test
statistics are asymptotically χ2
(h) distributed under H0 : h(θ) = 0 and have the same
noncentral chi-square distribution under local alternatives.
Consider m-estimators that maximize QN (θ) = N−1

i qi (θ) with first-order con-
ditions N−1

i si (θ) = 0. Define the q × q matrices A(θ) = N−1

i ∂si (θ)/∂θ
and
B(θ) = N−1

i si (θ)si (θ)
and the h × q matrix R(θ) = ∂ ln h(θ)/∂θ
. Let
θu and

θr denote unrestricted and restricted estimators, respectively, and let
A = A(
θu)
and
A = A(
θr ) with similar notation for B and R. Finally, let
h = h(
θu) and
si =
si (
θr ).
The Wald test statistic is based on closeness of
h to zero. Here
W =
h

RN−1
A−1
B
A−1
R

−1

h, (7.37)
since from Section 5.5.1 the robust variance matrix estimate for
θu is N−1
A−1
B
A−1
.
Packages with the option of robust standard errors use this more general form to com-
pute Wald tests of statistical significance.
Let g(θ) = ∂ ln QN (θ)/∂θ denote the gradient vector, and let
g = g(
θr ) =

i
si .
The LM test statistic is based on the closeness of
g to 0 and is given by
LM = N
g

A−1
R

R
A−1
B
A−1
R

−1

R
A−1
−1

g, (7.38)
a result obtained by forming a chi-square test statistic based on (7.29), where N
g re-
places |∂ ln L/∂θ|
θr
. This test is clearly not as simple to implement as a robust Wald
test. Some examples of computation of the robust form of LM tests are given in Sec-
tion 8.4. The standard implementations of LM tests in computer packages are often
not robust versions of the LM test.
The LR test does not generalize easily. It does generalize to m-estimators if
B0 = −αA0 for some scalar α, a weaker version of the IM equality. In such special
cases the quasi-likelihood ratio (QLR) test statistic is
QLR = −2N

QN (
θr ) − QN (
θu)

/
αu, (7.39)
where
αu is a consistent estimate of α obtained from unrestricted estimation (see
Wooldridge, 2002, p. 370). The condition B0 = −αA0 holds for generalized linear
models (see Section 5.7.4). Then the statistic QLR is equivalent to the difference of de-
viances for the restricted and unrestricted models, a generalization of the F-test based
on the difference between restricted and unrestricted sum of squared residuals for OLS
and NLS estimation with homoskedastic errors. For general quasi-ML estimation, with
B0 = −αA0, the LR test statistic can be distributed as a weighted sum of chi-squares
(see Section 8.5.3).
244

7.5. TESTS IN NON-ML SETTINGS
7.5.2. Tests Based on Efficient GMM Estimators
For GMM the various test statistics are simplest for efficient GMM, meaning GMM
estimation using the optimal weighting matrix. This poses no great practical restriction
as the optimal weighting matrix can always be estimated, as detailed in Section 6.3.5.
Consider GMM estimation based on the moment condition E[mi (θ)] = 0. (Note
the change in notation from Chapter 6: h(θ) is being used in the current chapter to
denote the restrictions under H0.) Using the notation introduced in Section 6.3.5, the
efficient unrestricted GMM estimator
θu minimizes QN (θ) = gN (θ)
S−1
N gN (θ), where
gN (θ) = N−1

i mi (θ) and SN is consistent for S0 = V[gN (θ)]. The restricted GMM
estimator
θr is assumed to minimize QN (θ) with the same weighting matrix S−1
N ,
subject to the restriction h(θ) = 0.
The three following test statistics, summarized by Newey and West (1987a) are
asymptotically χ2
(h) distributed under H0 : h(θ) = 0 and have the same noncentral
chi-square distribution under local alternatives.
The Wald test statistic as usual is based on closeness of
h to zero. This yields
W =
h

RN−1
(
G
S−1
G)−1
R

−1

h, (7.40)
since the variance of the efficient GMM estimator is N−1
(
G
S−1
G)−1
from Section
6.3.5, where GN (θ) = ∂gN (θ)/∂θ
and the carat denotes evaluation at
θu.
The first-order conditions of efficient GMM are
G
S−1
g = 0. The LM statistic tests
whether this gradient vector is close to zero when instead evaluated at
θr , leading to
LM = N
g
S−1
G(
G
S−1
G)−1
G
S−1

g, (7.41)
where the tilda denotes evaluation at
θr and we use the Section 6.3.3 assumption that
√
NgN (θ0)
d
→ N[0, S0], so

NGS−1g
d
→ N

0, plim N−1
G
S−1
G

.
For the efficient GMM estimator the difference in maximized values of the objective
function can also be compared, leading to the difference test statistic
D = N

QN (
θr ) − QN (
θu)

. (7.42)
Like W and LM, the statistic D is asymptotically χ2
(h) distributed under H0 :
h(θ) = 0.
Even in the likelihood case, this last statistic differs from the LR statistic be-
cause it uses a different objective function. The MLE minimizes QN (θ) = −N−1

i ln f (yi |θ). From Section 6.3.7, the asymptotically equivalent efficient GMM es-
timator instead minimizes the quadratic form QN (θ) = N−1

i si (θ)

i si (θ)

,
where si (θ) = ∂ ln f (yi |θ)/∂θ. The statistic D can be used in general, provided the
GMM estimator used is the efficient GMM estimator, whereas the LR test can only be
generalized for some special cases of m-estimators mentioned after (7.39).
For MM estimators, that is, in the just-identified GMM model, D = LM =
N QN (
θr ), so the LM and difference tests are equivalent. For D this simplification oc-
curs because gN (
θu) = 0 and so QN (
θu) = 0. For LM simplification occurs in (7.41)
as then
GN is invertible.
245

HYPOTHESIS TESTS
7.6. Power and Size of Tests
The remaining sections of this chapter study two limitations in using the usual com-
puter output to test hypotheses.
First, a test can have little ability to discriminate between the null and alternative
hypotheses. Then the test has low power, meaning there is a low probability of rejecting
the null hypothesis when it is false. Standard computer output does not calculate test
power, but it can be evaluated using asymptotic methods (see this section) or finite-
sample Monte Carlo methods (see Section 7.7). If a major contribution of an empirical
paper is the rejection or nonrejection of a particular hypothesis, there is no reason for
the paper not to additionally present the power of the test against some meaningful
alternative hypothesis.
Second, the true size of the test may differ substantially from the nominal size of
the test obtained from asymptotic theory. The rule of thumb that sample size N 30
is sufficient for asymptotic theory to provide a good approximation for inference on a
single variable does not extend to models with regressors. Poor approximation is most
likely in the tails of the approximating distribution, but the tails are used to obtain
critical values of tests at common significance levels such as 5%. In practice the critical
value for a test statistic obtained from large-sample approximation is often smaller
than the correct critical value based on the unknown true distribution. Small-sample
refinements are attempts to get closer to the exact critical value. For linear regression
under normality exact critical values can be obtained, using the t rather than z and the
F rather than χ2
distribution, but similar results are not exact for nonlinear regression.
Instead, small-sample refinements may be obtained through Monte Carlo methods (see
Section 7.7) or by use of the bootstrap (see Section 7.8 and Chapter 11).
With modern computers it is relatively easy to correct the size and investigate the
power of tests used in an applied study. We present this neglected topic in some
detail.
7.6.1. Test Size and Power
Hypothesis tests lead to either rejection or nonrejection of the null hypothesis. Correct
decisions are made if H0 is rejected when H0 is false or if H0 is not rejected when H0
is true.
There are also two possible incorrect decisions: (1) rejecting H0 when H0 is true,
called a type I error, and (2) nonrejection of H0 when H0 is false, called a type II
error. Ideally the probabilities of both errors will be low, but in practice decreasing
the probability of one type of error comes at the expense of increasing the probability
of the other. The classical hypothesis testing solution is to fix the probability of a type
I error at a particular level, usually 0.05, while leaving the probability of a type II error
unspecified.
Define the size of a test or significance level
α = Pr

type I error

= Pr

reject H0|H0 true

,
(7.43)
246

7.6. POWER AND SIZE OF TESTS
with common choices of α being 0.01, 0.05, or 0.10. A hypothesis is rejected if the test
statistic falls into a rejection region defined so that the test significance level equals the
specified value of α. A closely related equivalent method computes the p-value of a
test, the marginal significance level at which the null hypothesis is just rejected, and
rejects H0 if the p-value is less than the specified value of α. Both methods require only
knowledge of the distribution of the test statistic under the null hypothesis, presented
in Section 7.2 for the Wald test statistic.
Consideration should also be given to the probability of a type II error. The power
of a test is defined to be
Power = Pr

reject H0|Ha true

= 1 − Pr

accept H0|Ha true

= 1 − Pr

Type II error

.
(7.44)
Ideally, test power is close to one since then the probability of a type II error is close to
zero. Determining the power requires knowledge of the distribution of the test statistic
under Ha.
Analysis of test power is typically ignored in empirical work, except that test proce-
dures are usually chosen to be ones that are known theoretically to have power that, for
given level α, is high relative to other alternative test statistics. Ideally, the uniformly
most powerful (UMP) test is used. This is the test that has the greatest power, for given
level α, for all alternative hypotheses. UMP tests do exist when testing a simple null
hypothesis against a simple alternative hypothesis. Then the Neyman–Pearson lemma
gives the result that the UMP test is a function of the likelihood ratio. For more gen-
eral testing situations involving composite hypotheses there is usually no UMP test,
and further restrictions are placed such as UMP one-sided tests. In practice, power
considerations are left to theoretical econometricians who use theory and simulations
applied to various testing procedures to suggest which testing procedures are the most
powerful.
It is nonetheless possible to determine test power in any given application. In the
following we detail how to compute the asymptotic power of the Wald test, which
equals that of the LR and LM tests in the fully parametric case.
7.6.2. Local Alternative Hypotheses
Since power is the probability of rejecting H0 when Ha is true, the computation
of power requires obtaining the distribution of the test statistic under the alterna-
tive hypothesis. For a Wald chi-square test at significance level α the power equals
Pr[W χ2
α(h)|Ha]. Calculation of this probability requires specification of a particular
alternative hypothesis, because Ha : h(θ) = 0 is very broad.
The obvious choice is the fixed alternative h(θ) = δ, where δ is an h × 1 finite
vector of nonzero constants. The quantity δ is sometimes referred to as the hypoth-
esis error, and larger hypothesis errors lead to greater power. For a fixed alternative
the Wald test statistic asymptotically has power one as it rejects the null hypothesis
all the time. To see this note that if h(θ) = δ then the Wald test statistic becomes
247

HYPOTHESIS TESTS
infinite, since
W =
h(
RN−1
C
R
)−1
h
p
→ δ

R0 N−1
C0R
0
−1
δ,
using
θ
p
→ θ0, so
h = h(
θu)
p
→ h(θ) = δ, and
C
p
→ C0. It follows that W
p
→ ∞ since
all the terms except N are finite and nonzero. This infinite value leads to H0 being
always rejected, as it should be, and hence having perfect power of one.
The Wald test statistic is therefore a consistent test statistic, that is, one whose
power goes to one as N → ∞. Many test statistics are consistent, just as many estima-
tors are consistent. More stringent criteria are needed to discriminate among the test
statistics, just as relative efficiency is used to choose among estimators.
For estimators that are root-N consistent, we consider a sequence of local alter-
natives
Ha : h(θ) = δ/
√
N, (7.45)
where δ is a vector of fixed constants with δ = 0. This sequence of alternative hy-
potheses, called Pitman drift, gets closer to the null hypothesis value of zero as the
sample size gets larger, at the same rate
√
N as used to scale up
θ to get a nonde-
generate distribution for the consistent estimator. The alternative hypothesis value of
h(θ) therefore moves toward zero at a rate that negates any improved efficiency with
increased sample size. For a much more detailed account of local alternatives and re-
lated literatures see McManus (1991).
7.6.3. Asymptotic Power of the Wald Test
Under the sequence of local alternatives (7.45) the Wald test statistic has a nondegen-
erate distribution, the noncentral chi-square distribution. This permits determination
of the power of the Wald test.
Specifically, as is shown in Section 7.7.4, under Ha the Wald statistic W defined in
(7.6) is asymptotically χ2
(h ; λ) distributed, where χ2
(h; λ) denotes the noncentral
chi-square distribution with noncentrality parameter
λ =
1
2
δ

R0C0R0

−1
δ, (7.46)
and R0 and C0 are defined in (7.4) and (7.5). The power of the Wald test, the proba-
bility of rejecting H0 given the local alternative Ha is true, is therefore
Power = Pr[W χ2
α(h)|W ∼ χ2
α(h; λ)]. (7.47)
Figure 7.1 plots power against λ for tests of a scalar hypothesis (h = 1) at the com-
monly used sizes or significance levels of 10%, 5%, and 1%. For λ close to zero the
power equals the size, and for large λ the power goes to one.
These features hold also for h 1. In particular power is monotonically increasing
in the noncentrality parameter λ defined in (7.46). Several general results follow.
First, power is increasing in the distance between the null and alternative hypo-
theses, as then δ and hence λ increase.
248

7.6. POWER AND SIZE OF TESTS
0
.2
.4
.6
.8
1
Test
Power
0 5 10 15 20
Noncentrality parameter lamda
Test size = 0.10
Test size = 0.05
Test size = 0.01
Test Power as a function of the ncp
Figure 7.1: Power of Wald chi-square test with one degree of freedom for three different
test sizes as the noncentrality parameter ranges from 0 to 20.
Second, for given alternative δ power increases with efficiency of the estimator
θ,
as then C0 is smaller and hence λ is larger.
Third, as the size of the test increases power increases and the probability of a type II
error decreases.
Fourth, if several different test statistics are all χ2
(h) under the null hypothesis
and noncentral-χ2
(h) under the alternative, the preferred test statistic is that with the
highest noncentrality parameter λ since then power is the highest. Furthermore, two
tests that have the same noncentrality parameter are asymptotically equivalent under
local alternatives.
Finally, in actual applications one can calculate the power as a function of δ. Speci-
fically, for a specified alternative δ, an estimated noncentrality parameter
λ can be
computed using (7.46) using parameter estimate
θ with associated estimates
R and
C.
Such power calculations are illustrated in Section 7.6.5.
7.6.4. Derivation of Asymptotic Power
To obtain the distribution of W under Ha, begin with the Taylor series expansion result
(7.9). This simplifies to
√
Nh(
θ)
d
→ N

δ, R0C0R0

, (7.48)
under Ha, since then
√
Nh(θ) = δ. Thus a quadratic form centered at δ would be
chi-square distributed under Ha.
The Wald test statistic W defined in (7.6) instead forms a quadratic form centered
at 0 and is no longer chi-squared distributed under Ha. In general if z ∼ N[µ, Ω],
where rank(Ω) = h, then z
Ω−1
z ∼ χ2
(h; λ), where χ2
(h; λ) denotes the noncentral
chi-square distribution with noncentrality parameter λ = 1
2
µ
Ω−1
µ. Applying this re-
sult to (7.48) yields
Nh(
θ)
(R0C0R
0)−1
h(
θ)
d
→ χ2
(h; λ), (7.49)
under Ha, where λ is defined in (7.49).
249

HYPOTHESIS TESTS
7.6.5. Calculation of Asymptotic Power
To shed light on how power changes with δ, consider tests of coefficient significance
in the scalar case. Then the noncentrality parameter defined in (7.46) is
λ =
δ2
2c

δ/
√
N
2
2(se[
θ])2
, (7.50)
where the approximation arises because of estimation of c, the limit variance of
√
N(
θ − θ), by N(se[
θ])2
, where se[
θ] is the standard error of
θ.
Consider a Wald chi-square test of H0 : θ = 0 against the alternative hypothesis that
θ is within a standard errors of zero, that is, against
Ha : θ = a × se[
θ],
where se[
θ] is treated here as a constant. Then δ/
√
N in (7.45) equals a × se[
θ], so
that (7.50) simplifies to λ = a2
/2. Thus the Wald test is asymptotically χ2
α(1; λ) under
Ha where λ = a2
/2.
From Figure 7.1 it is clear for the common case of significance level tests at 5% that
if a = 2 the power is well below 0.5, if a = 4 the power is around 0.5, and if a = 6 the
power is still below 0.9. A borderline test of statistical significance can therefore have
low power against alternatives that are many standard errors from zero. Intuitively, if

θ = 2se[
θ] then a test of θ = 0 against θ = 4se[
θ] has power of approximately 0.5,
because a 95% confidence interval for θ is approximately (0, 4se[
θ]), implying that
values of θ = 0 or θ = 4se[
θ] are just as likely.
As a more concrete example, suppose θ measures the percentage increase in wage
resulting from a training program, and that a study finds
θ = 6 with se[
θ] = 4. Then
the Wald test at 5% significance level leads to nonrejection of H0, since W = (6/4)2
=
2.25 χ2
.05(1) = 3.96. The conclusion of such a study will often state that the training
program is not statistically significant. One should not interpret this as meaning that
there is a high probability that the training program has no effect, however, as this test
has low power. For example, the preceding analysis indicates that a test of H0 : θ = 0
against Ha : θ = 16, a relatively large training effect, has power of only 0.5, since
4 × se[
θ] = 16. Reasons for low power include small sample size, large model error
variance, and small spread in the regressors.
In simple cases, solving the inverse problem of estimating the minimum sample size
needed to achieve a given desired level of power is possible. This is especially popular
in medical studies.
Andrews (1989) gives a more formal treatment of using the noncentrality parameter
to determine regions of the parameter space against which a test in an empirical setting
is likely to have low power. He provides many applied examples where it is easy to
determine that tests have low power against meaningful alternatives.
7.7. Monte Carlo Studies
Our discussion of statistical inference has so far relied on asymptotic results. For small
samples analytical results are rarely available, aside from tests of linear restrictions in
250

7.7. MONTE CARLO STUDIES
the linear regression model under normality. Small-sample results can nonetheless be
obtained by performing a Monte Carlo study.
7.7.1. Overview
An example of a Monte Carlo study of the small-sample properties of a test statistic is
the following. Set the sample size N to 40, say, and randomly generate 10,000 samples
of size 40 under the H0 model. For each replication (sample) form the test statistic of
interest and test H0, rejecting H0 if the test statistic falls in the rejection region, usually
determined by asymptotic results.
The true size or actual size of the test statistic is simply the fraction of replications
for which the test statistic falls in the rejection region. Ideally, this is close to the
nominal size, which is the chosen significance level of the test. For example, if testing
at 5% the nominal test size is 0.05 and the true size is hopefully close to 0.05.
Determining test power in small samples requires additional simulation, with sam-
ples generated under one or more particular specification of the possible models that
lie in the composite alternative hypothesis Ha. The power is calculated as the fraction
of replications for that the null hypothesis is rejected, using either the same test as used
in determining the true size, or a size-corrected version of the test that uses a rejection
region such that the nominal size equals the true size.
Monte Carlo studies are simple to implement, but there are many subtleties involved
in designing a good Monte Carlo study. For an excellent discussion see Davidson and
MacKinnon (1993).
7.7.2. Monte Carlo Details
As an example of a Monte Carlo study we consider statistical inference on the slope
coefficient in a probit model. The following analysis does not rely on knowledge of
the probit model.
The data-generating process is a probit model, with binary regressor y equal to one
with probability
Pr[y = 1|x] = (β1 + β2x),
where (·) is the standard normal cdf, x ∼ N[0, 1], and (β1, β2) = (0, 1).
The data (y, x) are easily generated for this dgp. The regressor x is first obtained as
a random draw from the standard normal distribution. Then, from Section 14.4.2 the
dependent variable y is set equal to 1 if x + u 0 and is set to 0 otherwise, where u
is a random draw from the standard normal. For this dgp y = 1 roughly half the time
and y = 0 the other half.
In each simulation N new observations of both x and y are drawn, and the MLE
from probit regression of y on x is obtained. An alternative is to use the same N draws
of the regressor x in each simulation and only redraw y. The former setup corresponds
to simple random sampling and the latter corresponds to analysis conditional on x or
“fixed in repeated trials”; see Section 4.4.7.
Monte Carlo studies often consider a range of sample sizes. Here we simply
set N = 40. Programs can be checked by also setting a very large value of N,
251

HYPOTHESIS TESTS
say N = 10,000, as then Monte Carlo results should be very close to asymptotic
results.
Numerous simulations are needed to determine actual test size, because this de-
pends on behavior in the tails of the distribution rather than the center. If S simulations
are run for a test of true size α, then the proportion of times the null hypothesis is
correctly rejected is an outcome from S binomial trials with mean α and variance
α(1 − α)/S. So 95% of Monte Carlos will estimate the test size to be in the inter-
val α ± 1.96
√
α(1 − α)/S. A mere 100 simulations is not enough since, for example,
this interval is (0.007, 0.093) when α = 0.05. For 10,000 simulations the 95% inter-
val is much more precise, equalling (0.008, 0.012), (0.046, 0.054), (0.094, 0.106), and
(0.192, 0.208) for α equal to, respectively, 0.01, 0.05, 0.10, and 0.20. Here S = 10,000
simulations are used.
A problem that can arise in Monte Carlo simulations is that for some simulation
samples the model may not be estimable. For example, consider linear regression on
just an intercept and an indicator variable. If the indicator variable happens to always
take the same value, say 0, in a simulation sample then its coefficient cannot be sepa-
rately identified from that for the intercept. A similar problem arises in the probit and
other binary outcome models, if all ys are 0 or all ys are 1 in a simulation sample. The
standard procedure, which can be criticized, is to drop such simulation samples, and to
write computer code that permits the simulation loop to continue when such a problem
arises. In this example the problem did not arise with N = 40, but it did for N = 30.
7.7.3. Small-Sample Bias
Before moving to testing we look at the small-sample properties of the MLE
β2 and
its estimated standard error se[
β2].
Across the 10,000 simulations
β2 had mean 1.201 and standard deviation 0.452,
whereas se[
β2] had mean 0.359. The MLE is therefore biased upward in small sam-
ples, as the average of
β2 is considerably greater than β2 = 1. The standard errors are
biased downward in small samples since the average of se[
β2] is considerably smaller
than the standard deviation of
β2.
7.7.4. Test Size
We consider a two-sided test of H0 : β2 = 1 against Ha : β2 = 1, using the Wald test
z = Wz =

β2 − 1
se[
β2]
,
where se[
β2] is the standard error of the MLE estimated using the variance matrix
given in Section 14.3.2, which is minus the inverse of the expected Hessian. Given the
dgp, asymptotically z is standard normal distributed and z2
is chi-squared distributed.
The goal is to find how well this approximates the small-sample distribution.
Figure 7.2 gives the density for the S = 10,000 computed values of z, where the den-
sity is plotted using the kernel density estimate of Chapter 9 rather than a histogram.
This is superimposed on the standard normal density. Clearly the asymptotic result is
not exact, especially in the upper tail where the difference is clearly large enough to
252

7.7. MONTE CARLO STUDIES
Table 7.2. Wald Test Size and Power for Probit Regression Examplea
Nominal Size (α) Actual Size Actual Power Asymptotic Power
0.01 0.005 0.007 0.272
0.05 0.029 0.226 0.504
0.10 0.081 0.608 0.628
0.20 0.192 0.858 0.755
a The dgp for y is the Probit with Pr[y = 1] = (0 + β2x) and sample size N = 40. The test is a two-
sided Wald test of whether or not the slope coefficient equals 1. Actual size is calculated from S =
10,000 simulations with β2 = 1 and power is calculated from 10,000 simulations with β2 = 2.
lead to size distortions when testing at, say, 5%. Also, across the simulations z has
mean 0.114 = 0 and standard deviation 0.956 = 1.
The first two columns of Table 7.2 give the nominal size and the actual size of
the Wald test for nominal sizes α = 0.01, 0.05, 0.10, and 0.20. The actual size is the
proportion of the 10,000 simulations in which |z| zα/2, or equivalently that z2

χ2
α(1). Clearly the actual size of the test is much less than the nominal size for α ≤
0.10. An ad hoc small-sample correction is to instead assume that z is t distributed
with 38 degrees of freedom, and reject if |z| tα/2(38). However, this leads to even
smaller actual size, since tα/2(38) zα/2.
The Monte Carlo simulations can also be used to obtain size-corrected critical val-
ues. Thus the lower and upper 2.5 percentiles of the 10,000 simulated values of z are
−1.905 and 2.003. It follows that an asymmetric rejection region with actual size 0.05
is z −1.905 and z 2.003, a larger rejection region than |z2| 1.960.
7.7.5. Test Power
We consider power of the Wald test under Ha : β2 = 2. We would expect the power to
be reasonable because this value of β2 lies two to three standard errors away from the
0
.1
.2
.3
.4
-4 -2 0 2 4
Wald Test Statistic
Monte Carlo
Standard Normal
Monte Carlo Simulations of Wald Test
Density
Figure 7.2: Density of Wald test statistic that slope coefficient equals one computed by
Monte Carlo simulation with standard normal density also plotted for comparison. Data are
generated from a probit regression model.
253

HYPOTHESIS TESTS
null hypothesis value of β2 = 1, given that se[
β2] has average value 0.359. The actual
and nominal power of the Wald test are given in the last two columns of Table 7.2.
The actual power is obtained in the same way as actual size, being the proportion
of the 10,000 simulations in which |z| zα/2. The only change is that, in generating y
in the simulation, β2 = 2 rather than 1. The actual power is very low for α = 0.01 and
0.05, cases where the actual size is much less than the nominal size.
The nominal power of the Wald test is determined using the asymptotic non-
central χ2
(1, λ) distribution under Ha, where from (7.50) λ = 1
2
(δ/
√
N)2
/se[
β2]2
=
1
2
× 12
/0.3592
3.88, since the local alternative is that Ha : β2 − 1 = δ/
√
N, so
δ/
√
N = 1 for β2 = 2. The asymptotic result is not exact, but it does provide a useful
estimate of the power for α = 0.10 and 0.20, cases where the true size closely matches
the nominal size.
7.7.6. Monte Carlo in Practice
The preceding discussion has emphasized use of the Monte Carlo analysis to calculate
test power and size. A Monte Carlo analysis can also be very useful for determining
small-sample bias in an estimator and, by setting N large, for determining that an
estimator is actually consistent. Such Monte Carlo routines are very simple to run
using current computer packages.
A Monte Carlo analysis can be applied to real data if the conditional distribution
of y given x is fully parametrized. For example, consider a probit model estimated
with real data. In each simulation the regressors are set at their sample values, if the
sampling framework is one of fixed regressors in repeated samples, while a new set of
values for the binary dependent variable y needs to be generated. This will depend on
what values of the parameters β are used. Let
β1, . . . ,
βK denote the probit estimates
from the original sample and consider a Wald test of H0 : βj = 0. To calculate test size,
generate S simulation samples by setting βk =
βk for j = k and setting βj = 0, and
then calculate the proportion of simulations in which H0 : βj = 0 is rejected. To esti-
mate the power of the Wald test against a specific alternative Ha : βj = 1, say, generate
y with βk =
βk for j = k and βj = 1 in generating y, and calculate the proportion of
simulations in which H0 : βj = 0 is rejected.
In practice much microeconometric analysis is based on estimators that are not
based on fully parametric models. Then additional distributional assumptions are
needed to perform a Monte Carlo analysis.
Alternatively, power can be calculated using asymptotic methods rather than finite-
sample methods. Additionally the bootstrap, presented next, can be used to obtain size
using a more refined asymptotic theory.
7.8. Bootstrap Example
The bootstrap is a variant of Monte Carlo simulation that has the attraction of being
implementable with fewer parametric assumptions and with little additional program
254

7.8. BOOTSTRAP EXAMPLE
code beyond that required to estimate the model in the first place. Essential ingredients
for the bootstrap to be valid are that the estimator actually has a limit distribution and
that the bootstrap resamples quantities that are iid.
The bootstrap has two general uses. First, it can be used as an alternative way to
compute statistics without asymptotic refinement. This is particularly useful for com-
puting standard errors when analytical formulas are complex. Second, it can be used
to implement a refinement of the usual asymptotic theory that may provide a better
finite-sample approximation to the distribution of test statistics.
We illustrate the bootstrap to implement a Wald test, ahead of a complete treatment
in Chapter 11.
7.8.1. Inference Using Standard Asymptotics
Consider again a probit example with binary regressor y equal to one with probability
p = (γ + βx), where (·) is the standard normal cdf. Interest lies in testing H0 :
β = 1 against Ha : β = 1 at significance level 0.05. The analysis here does not require
knowledge of the probit model.
One sample of size N = 30 is generated. Probit ML estimation yields
β = 0.817
and s
β = 0.294, where the standard error is based on −
A−1
, so the test statistic z =
(1 − 0.817)/0.294 = −0.623.
Using standard asymptotic theory we obtain 5% critical values of −1.96 and 1.96,
since z.025 = 1.96, and H0 is not rejected.
7.8.2. Bootstrap without Asymptotic Refinement
The departure point of the bootstrap method is to resample from an approximation to
the population; see Section 11.2.1. The paired bootstrap does so by resampling from
the original sample.
Thus form B pseudo-samples of size N by drawing with replacement from the orig-
inal data {(yi , xi ), i = 1, . . . , N}. For example, the first pseudo-sample of size 30 may
have (y1, x1) once, (y2, x2) not at all, (y3, x3) twice, and so on. This yields B estimates

β
∗
1, . . . ,
β
∗
B of the parameter of interest β, that can be used to estimate features of the
distribution of the original estimate
β.
For example, suppose the computer program used to estimate a probit model reports

β but not the standard error s
β. The bootstrap solves this problem since we can use
the estimated standard deviation s
β,boot of
β
∗
1, . . . ,
β
∗
B from the B bootstrap pseudo-
samples. Given this standard error estimate it is possible to perform a Wald hypothesis
test on β.
For the probit Wald test example, the resulting bootstrap estimate of the standard
error of
β is 0.376, leading to z = (1 − 0.817)/0.376 = −0.487. Since −0.487 lies in
(−1.96, 1.96) we do not reject H0 at 5%.
This use of the bootstrap to test hypotheses does not lead to size improvements in
small samples. However, it can lead to great time savings in many applications if it is
difficult to otherwise obtain the standard errors for an estimator.
255

HYPOTHESIS TESTS
7.8.3. Bootstrap with Asymptotic Refinement
Some bootstraps can lead to a better asymptotic approximation to the distribution of
z. This is likely to lead to finite-sample critical values that are better in the sense that
the actual size is likely to be closer to the nominal size of 0.05. Details are provided in
Chapter 11. Here we illustrate the method.
Again form B pseudo-samples of size N by drawing with replacement from the
original data. Estimate the probit model in each pseudo-sample and for the bth
pseudo-sample compute z∗
b = (
β
∗
b −
β)/s
β
∗
b
, where
β is the original estimate. The
bootstrap distribution for the original test statistic z is then the empirical distribution
of z∗
i , . . . , z∗
B rather than the standard normal. The lower and upper 2.5 percentiles of
this empirical distribution give the bootstrap critical values.
For the example here with B = 1,000 the lower and upper 2.5 percentiles of the
empirical bootstrap distribution of z were found to be −2.62 and 1.83. The bootstrap
critical values for testing at 5% are then −2.62 and 1.83, rather than the usual ±1.96.
Since the initial sample test statistic z = −0.623 lies in (−2.62, 1.83) we do not reject
H0 : β = 1. A bootstrap p−value can also be computed.
Unlike the bootstrap in the previous section, an asymptotic improvement occurs
here because the studentized test statistic z is asymptotically pivotal (see Section
11.2.3) whereas the estimator
β is not.
Microeconometrics research places emphasis on statistical inference based on min-
imal distributional assumptions, using robust estimates of the variance matrix of an
estimator. There is no sense in robust inference, however, if failure of distributional
assumptions leads to the more serious complication of estimator inconsistency as can
happen for some though not all ML estimators.
Many packages provide a “robust” standard errors option in estimator commands.
In micreconometrics packages robust often means heteroskedastic consistent and does
not guard against other complications such as clustering, see Section 24.5, that can
also lead to invalid statistical inference.
Robust inference is usually implemented using a Wald test. The Wald test has the
weakness of invariance to reparametrization of nonlinear hypotheses, though this may
be diminished by performing an appropriate bootstrap. Standard auxiliary regressions
for the LM test and implementations of LM tests on computer packages are usually
not robustified, though in some cases relatively simple robustification of the LM test
is possible (see Section 8.4).
The power of tests can be weak. Ideally, power against some meaningful alternative
would be reported. Failing this, as Section 7.6 indicates, one should be careful about
overstating the conclusions from a hypothesis test unless parameters are very precisely
estimated.
The finite sample size of tests derived from asymptotic theory is also an issue. The
bootstrap method, detailed in Chapter 11, has the potential to yield hypothesis tests
and confidence intervals with much better finite-sample properties.
256

Statistical inference can be quite fragile, so these issues are of importance to the
practitioner. Consider a two-tailed Wald test of statistical significance when
θ = 1.96,
and assume the test statistic is indeed standard normal distributed. If s
θ = 1.0 then
t = 1.96 and the p−value is 0.050. However, the true p−value is a much higher 0.117
if the standard error was underestimated by 20% (so correct t = 1.57), and a much
lower 0.014 if the standard error was overestimated by 20% (so t = 2.35).
The econometrics texts by Gouriéroux and Monfort (1989) and Davidson and MacKinnon
(1993) give quite lengthy treatment of hypothesis testing. The presentation here considers only
equality restrictions. For tests of inequality restrictions see Gouriéroux, Holly, and Monfort
(1982) for the linear case and Wolak (1991) for the nonlinear case. For hypothesis testing when
the parameters are at the boundary of the parameter space under the null hypothesis the tests
can break down; see Andrews (2001).
7.3 A useful graphical treatment of the three classical test procedures is given by Buse (1982).
7.5 Newey and West (1987a) present extension of the classical tests to GMM estimation.
7.6 Davidson and MacKinnon (1993) give considerable discussion of power and explain the
distinction between explicit and implicit null and alternative hypotheses.
7.7 For Monte Carlo studies see Davidson and MacKinnon (1993) and Hendry (1984).
7.8 The bootstrap method due to Efron (1979) is detailed in Chapter 11.
Exercises
7–1 Suppose a sample yields estimates
θ1 = 5,
θ2 = 3 with asymptotic variance es-
timates 4 and 2 and the correlation coefficient between
θ1 and
θ2 equals 0.5.
Assume asymptotic normality of the parameter estimates.
(a) Test H0 : θ1eθ2
= 100 against Ha : θ1 = 100 at level 0.05.
(b) Obtain a 95% confidence interval for γ = θ1eθ2
.
7–2 Consider NLS regression for the model y = exp(α + βx) + ε, where α, β, and
x are scalars and ε ∼ N[0, 1]. Note that for simplicity σ2
ε = 1 and need not be
estimated. We want to test H0 : β = 0 against Ha : β = 0.
(a) Give the first-order conditions for the unrestricted MLE of α and β.
(b) Give the asymptotic variance matrix for the unrestricted MLE of α and β.
(c) Give the explicit solution for the restricted MLE of α and β.
(d) Give the auxiliary regression to compute the OPG form of the LM test.
(e) Give the complete expression for the original form of the LM test. Note that
it involves derivatives of the unrestricted log-likelihood evaluated at the re-
stricted MLE of α and β. [This is more difficult than parts (a)–(d).]
7–3 Suppose we wish to choose between two nested parametric models. The relation-
ship between the densities of the two models is that g(y|x,β,α = 0) = f (y|x,β),
where for simplicity both β and α are scalars. If g is the correct density then the
MLE of β based on density f is inconsistent. A test of model f against model
g is a test of H0 : α = 0 against Ha : α = 0. Suppose ML estimation yields the
following results: (1) model f :
β = 5.0, se[
β] = 0.5, and ln L = −106; (2) model
g:
β = 3.0, se[
β] = 1.0,
α = 2.5, se[
α] = 1.0, and ln L = −103. Not all of the
257

HYPOTHESIS TESTS
following tests are possible given the preceding information. If there is enough
information, perform the tests and state your conclusions. If there is not enough
information, then state this.
(a) Perform a Wald test of H0 at level 0.05.
(b) Perform a Lagrange multiplier test of H0 at level 0.05.
(c) Perform a likelihood ratio test of H0 at level 0.05.
(d) Perform a Hausman test of H0 at level 0.05.
7–4 Consider test of H0 : µ = 0 against Ha : µ = 0 at nominal size 0.05 when the
dgp is y ∼ N[µ, 100], so the standard deviation is 10, and the sample size is
N = 10. The test statistic is the usual t-test statistic t =
µ/

s/10, where s2
=
(1/9)

i (yi − ȳ)2
. Perform 1,000 simulations to answer the following.
(a) Obtain the actual size of the t-test if the correct finite-sample critical values
±t.025(8) = ±2.306 are used. Is there size distortion?
(b) Obtain the actual size of the t-test if the asymptotic approximation critical
values ±z.025 = ±1.960 are used. Is there size distortion?
(c) Obtain the power of the t-test against the alternative Ha : µ = 1, when the
critical values ±t.025(8) = ±2.306 are used. Is the test powerful against this
particular alternative?
7–5 Use the health expenditure data of Section 16.6. The model is a probit regression
of DMED, an indicator variable for positive health expenditures, against the 17
regressors listed in the second paragraph of Section 16.6. You should obtain the
estimates given in the first column of Table 16.1. Consider joint test of the statisti-
cal significance of the self-rated health indicators HLTHG, HLTHF, and HLTHP at
level 0.05.
(a) Perform a Wald test.
(b) Perform a likelihood ratio test.
(c) Perform an auxiliary regression to implement an LM test. [This will require
some additional coding.]
258

C H A P T E R 8
Specification Tests and Model
Selection
8.1. Introduction
Two important practical aspects of microeconometric modeling are determining
whether a model is correctly specified and selecting from alternative models. For these
purposes it is often possible to use the hypothesis testing methods presented in the pre-
vious chapter, especially when models are nested. In this chapter we present several
other methods.
First, m-tests such as conditional moment tests are tests of whether moment con-
ditions imposed by a model are satisfied. The approach is similar in spirit to GMM,
except that the moment conditions are not imposed in estimation and are instead used
for testing. Such tests are conceptually very different from the hypothesis tests of
Chapter 7, as there is no explicit statement of an alternative hypothesis model.
Second, Hausman tests are tests of the difference between two estimators that are
both consistent if the model is correctly specified but diverge if the model is incorrectly
specified.
Third, tests of nonnested models require special methods because the usual hypoth-
esis testing approach can only be applied when one model is nested within another.
Finally, it can be useful to compute and report statistics of model adequacy that are
not test statistics. For example, an analogue of R2
may be used to measure the good-
ness of fit of a nonlinear model.
Ideally, these methods are used in a cycle of model specification, estimating, testing,
and evaluation. This cycle can move from a general model toward a specific model, or
from a specific model to a more general one that is felt to capture the most important
features of the data.
Section 8.2 presents m-tests, including conditional moment tests, the information
matrix test, and chi-square goodness of fit tests. The Hausman test is presented in
Section 8.3. Tests for several common misspecifications are discussed in Section 8.4.
Discrimination between nonnested models is the focus of Section 8.5. Commonly used
convenient implementations of the tests of Sections 8.2–8.5 can rely on strong distri-
butions and/or perform poorly in finite samples. These concerns have discouraged use
259

SPECIFICATION TESTS AND MODEL SELECTION
of some of these tests, but such concerns are outdated because in many cases the boot-
strap methods presented in Chapter 11 can correct for these weaknesses. Section 8.6
considers the consequences of testing a model on subsequent inference. Model diag-
nostics are presented in the stand-alone Section 8.7.
8.2. m-Tests
m-Tests, such as conditional moment tests, are a general specification testing proce-
dure that encompasses many common specification tests. The tests are easily imple-
mented using auxiliary regressions when estimation is by ML, a situation where tests
of model assumptions are especially desirable. Implementation is usually more diffi-
cult when estimators are instead based on minimal distributional assumptions.
We first introduce the test statistic and computational methods, followed by leading
examples and an illustration of the tests.
8.2.1. m-Test Statistic
Suppose a model implies the population moment condition
H0 : E[mi (wi , θ)] = 0, (8.1)
where w is a vector of observables, usually the dependent variable y and regressors
x and sometimes additional variables z, θ is a q × 1 vector of parameters, and mi (·)
is an h × 1 vector. A simple example is that E[(y − x
β)z] = 0 if z can be omitted in
the linear model y = x
β + u. Especially for fully parametric models there are many
candidates for mi (·).
An m-test is a test of the closeness to zero of the corresponding sample moment

mN (
θ) = N−1
N

i=1
mi (wi ,
θ). (8.2)
This approach is similar to that for the Wald test, where h(θ) = 0 is tested by testing
the closeness to zero of h(
θ).
A test statistic is obtained by a method similar to that detailed in Section 7.2.4 for
the Wald test. In Section 8.2.3 it is shown that if (8.1) holds then
√
N
mN (
θ)
d
→ N[0, Vm], (8.3)
where Vm defined later in (8.10) is more complicated than in the case of the Wald test
because mi (wi ,
θ) has two sources of stochastic variation as both wi and
θ are random.
A chi-square test statistic can then be obtained by taking the corresponding
quadratic form. Thus the m-test statistic for (8.1) is
M = N
mN (
θ)
V−1
m
mN (
θ), (8.4)
which is asymptotically χ2
(rank[Vm]) distributed if the moment conditions (8.1) are
correct. An m-test rejects the moment conditions (8.1) at significance level α if M
χ2
α(h) and does not reject otherwise.
260

8.2. M-TESTS
A complication is that Vm may not be of full rank h. For example, this is the case
if the estimator
θ itself sets a linear combination of components of
mN (
θ) to 0. In
some cases, such as the OIR test,
Vm is still of full rank and M can be computed but
the chi-square test statistic has only rank[Vm] degrees of freedom. In other cases
Vm
itself is not of full rank. Then it is simplest to drop (h − rank[Vm]) of the moment
conditions and perform an m-test using just this subset of the moment conditions. Al-
ternatively, the full set of moment conditions can be used, but
V−1
m in (8.4) is replaced
by
V−
m, the generalized inverse of
Vm. The Moore–Penrose generalized inverse V−
of a matrix V satisfies VV−
V = V, V−
VV−
= V−
, (VV−
)
= VV−
, and (V−
V)
=
V−
V. When Vm is less than full rank then strictly speaking (8.3) no longer helds,
since the multivariate normal requires full rank Vm, but (8.4) still holds given these
adjustments.
The m-test approach is conceptually very simple. The moment restriction (8.1) is
rejected if a quadratic form in the sample estimate (8.2) is far enough from zero. The
challenges are in calculating M since
Vm can be quite complex (see Section 8.2.2),
selecting moments m(·) to test (see Sections 8.2.3–8.2.6 for leading examples), and
interpreting reasons for rejection of (8.1) (see Section 8.2.8).
8.2.2. Computation of the m-Statistic
There are several ways to compute the m-statistic.
First, one can always directly compute
Vm, and hence M, using the consistent es-
timates of the components of Vm given in Section 8.2.3. Most practitioners shy away
from this approach as it entails matrix computations.
Second, the bootstrap can always be used (see Section 11.6.3), since the bootstrap
can provide an estimate of Vm that controls for all sources of variation in
mN (
θ) =
N−1

i mi (wi ,
θ).
Third, in some cases auxiliary regressions similar to those for the LM test given
in Section 7.3.5 can be run to compute asymptotically equivalent versions of M that
do not require computation of
Vm. These auxiliary regressions may in turn be boot-
strapped to obtain an asymptotic refinement (see Section 11.6.3). We present several
leading auxiliary regressions.
Auxiliary Regressions Using the ML Estimator
Model specification tests are especially desirable when inference is done within the
likelihood framework, as in general any misspecification of the density can lead to in-
consistency of the MLE. Fortunately, an m-test is easily implemented when estimation
is by maximum likelihood.
Specifically, when
θ is the MLE, generalizing the LM test result of Section 7.3.5
(see Section 8.2.3) yields an asymptotically equivalent version of the m-test is obtained
from the auxiliary regression
1 =
m
i δ +
s
i γ + ui , (8.5)
261

where
mi = mi (yi , xi ,
θML),
si = ∂ ln f (yi |xi , θ)/∂θ|
θML
is the contribution of the
ith observation to the score and f (yi |xi , θ) is the conditional density function, by
calculating
M∗
= N R2
u, (8.6)
where R2
u is the uncentered R2
defined at the end of Section 7.3.5. Equivalently, M∗
equals ESSu, the uncentered explained sum of squares (the sum of squares of the fitted
values) from regression (8.5), or M∗
equals N − RSS, where RSS is the residual sum
of squares from regression (8.5). M∗
(h) under H0.
The test statistic M∗
is called the outer product of the gradient form of the m-test,
and it is a generalization of the auxiliary regression for the LM test (see Section 7.3.5).
Although the OPG form can be easily computed, it has poor small-sample properties
with large size distortions. Similar to the LM test, however, these small-sample prob-
lems can be greatly reduced by using bootstrap methods (see Section 11.6.3).
The test statistic M∗
may also be appropriate in some non-ML settings. The auxil-
iary regression is applicable whenever E[∂m/∂θ
] = −E[ms
] (see Section 8.2.3). By
the generalized IM equality (see Section 5.6.3), this condition holds for the MLE when
expectation is with respect to the specified density f (·). It can also hold under weaker
distributional assumptions in some cases.
Auxiliary Regressions When E[∂m/∂θ
] = 0
In some applications mi (wi , θ) satisfies
E

∂mi (wi , θ)/∂θ

θ0

= 0, (8.7)
in addition to (8.1).
Then it can be shown that the asymptotic distribution of
√
N
mN (
θ) is the same
as that of
√
NmN (θ0), so Vm = plim N−1

i mi0m
i0, which can be consistently esti-
mated by
Vm = N−1

i
mi
m
i . The test statistic can be computed in a similar manner
to (8.5), except the auxiliary regression is more simply
1 =
m
i δ + ui , (8.8)
with test statistic M∗∗
equal to N times the uncentered R2
.
This auxiliary regression is valid for any root-N consistent estimator
θ, not just
the MLE, provided (8.7) holds. The condition (8.7) is met in a few examples; see
Section 8.2.9 for an example.
Even if (8.7) does not hold the simpler regression (8.8) might still be run as a guide,
as it places a lower bound on the correct value of M, the m-test statistic. If this simpler
regression leads to rejection then (8.1) is certainly rejected.
Other Auxiliary Regressions
Alternative auxiliary regressions to (8.5) and (8.8) are possible if m(y, x, θ) and
s(y, x, θ) can be appropriately factorized.
262

8.2. M-TESTS
First, if s(y, x, θ) = g(x, θ)r(y, x, θ) and m(y, x, θ) = h(x, θ)r(y, x, θ) for some
common scalar function r(·) with V[r(y, x, θ)] = 1 and estimation is by ML, then an
asymptotically equivalent regression to (8.5) is N R2
u from regression of
ri on
gi and
hi .
Second, if m(y, x, θ) = h(x, θ)v(y, x, θ) for some scalar function v(·) with
V[v(y, x, θ)] = 1 and E[∂m/∂θ
] = 0, then an asymptotically equivalent regression
to (8.8) is N R2
u from regression of
vi on
hi . For further details see Wooldridge (1991).
Additional auxiliary regressions exist in special settings. Examples are given in
Section 8.4, and White (1994) gives a quite general treatment.
8.2.3. Derivations for the m-Test Statistic
To avoid the need to compute Vm, the variance matrix in (8.3), m-tests are usually
implemented using auxiliary regressions or bootstrap methods. For completeness this
section derives the actual expression for Vm and provides justification for the auxiliary
regressions (8.5) and (8.8).
The key is obtaining the distribution of
mN (
θ) defined in (8.2). This is complicated
because mN (
θ) is stochastic for two reasons: the random variables wi and evaluation
at the estimator
θ.
Assume that
θ is an m-estimator or estimating equations estimator that solves
1
N
N

i=1
si (wi ,
θ) = 0, (8.9)
for some function s(·), here not necessarily ∂ ln f (y|x, θ)/∂θ, and make the usual
cross-section assumption that data are independent over i. Then we shall show that
√
N
mN (
θ)
d
→ N[0, Vm], as in (8.3), where
Vm = H0J0H
0, (8.10)
the h × (h + q) matrix
H0 = [Ih − C0A−1
0 ], (8.11)
where C0 = plim N−1

i ∂mi0/∂θ
and A0 = plim N−1

i ∂si0/∂θ
, and the (h +
q) × (h + q) matrix
J0 = plim N−1
N
i=1 mi0m
i0
N
i=1 mi0s
i0
N
i=1 si0m
i0
N
i=1 si0s
i0
'
, (8.12)
where mi0 = mi (wi , θ0) and si0 = si (wi , θ0).
To derive (8.10), take a first-order Taylor series expansion around θ0 to obtain
√
N
mN (
θ) =
√
NmN (θ0) +
∂mN (θ0)
∂θ
√
N(
θ − θ0) + op(1). (8.13)
For
θ defined in (8.9) this implies that
√
N
mN (
θ) =
1
√
N
N

i=1
mi (θ0) − C0A−1
0
1
√
N
N

i=1
si0 + op(1), (8.14)
263

where we use mN = N−1

i mi , ∂mN /∂θ
= N−1

i ∂mi /∂θ p
→ C0, and
√
N(
θ −
θ0) has the same limit distribution as A−1
0 N−1/2

i si0 by applying the usual ﬁrst-order
Taylor series expansion to (8.9). Equation (8.14) can be written as
√
N
mN (
θ) =

Ih −C0A−1
0



1
√
N
N
i=1 mi0
1
√
N
N
i=1 si0

 + op(1). (8.15)
Equation (8.10) follows by application of the limit normal product rule (Theo-
rem A.17) as the second term in the product in (8.15) has limit normal distribution
under H0 with mean 0 and variance J0.
To compute M in (8.4), a consistent estimate
Vm for Vm can be obtained by replac-
ing each component of Vm by a consistent estimate. For example, C0 can be consis-
tently estimated by
C=N−1

i ∂mi /∂θ

θ
, and so on. Although this can always be
done, using auxiliary regressions is easier when they are available.
First, consider the auxiliary regression (8.5) when
θ is the MLE. By the generalized
IM equality (see Section 5.6.3) E[∂mi0/∂θ
] = −E[mi0s
i0], where for the MLE we
specialize to si = ∂ ln f (yi , xi , θ)/∂θ
. Considerable simpliﬁcation occurs since then
C0 = −plimN−1

i mi0s
i0 and A0 = −plimN−1

i si0s
i0, which also appear in the
J0 matrix. This leads to the OPG form of the test. For further details see Newey (1985)
or Pagan and Vella (1989).
Second, for the auxiliary regression (8.8), note that if E[∂mi0/∂θ
] = 0 then C0 =
0, so H0 = [Ih 0] and hence H0J0H
0 = plimN−1

i mi0m
i0.
8.2.4. Conditional Moment Tests
Conditional moment tests, due to Newey (1985) and Tauchen (1985), are m-tests of
unconditional moment restrictions that are obtained from an underlying conditional
moment restriction.
As an example, consider the linear regression model y = x
β + u. A standard as-
sumption for consistency of the OLS estimator is that the error has conditional mean
zero, or equivalently the conditional moment restriction
E[y − x
β|x] = 0. (8.16)
In Chapter 6 we considered using some of the implied unconditional moment restric-
tions as the basis of MM or GMM estimation. In particular (8.16) implies that E[x(y −
x
β)] = 0. Solving the corresponding sample moment condition

i xi (yi − x
i β) = 0
leads to the OLS estimator for β. However, (8.16) implies many other moment condi-
tions that are not used in estimation. Consider the unconditional moment restriction
E[g(x)(y − x
β)] = 0,
where the vector g(x) should differ from x, already used in OLS estimation. For exam-
ple, g(x) may contain the squares and cross-products of the components of the regres-
sor vector x. This suggests a test based on whether or not the corresponding sample
moment
mN (
β) = N−1

i g(xi )(yi − x
i

β) is close to zero.
264

8.2. M-TESTS
More generally, consider the conditional moment restriction
E[r(y, x, θ)|x] = 0, (8.17)
for some scalar function r(·). The conditional (CM) moment test is an m-test based
on the implied unconditional moment restrictions
E[g(x)r(y, x, θ)] = 0, (8.18)
where g(x) and/or r(y, x, θ) are chosen so that these restrictions are not already used
in estimation.
Likelihood-based models lead to many potential restrictions. For less than fully
parametric models examples of r(y, x, θ) include y − µ(x, θ), where µ(·) is the spec-
ified conditional mean function, and (y − µ(x, θ))2
− σ2
(x, θ), where σ2
(x, θ) is a
specified conditional variance function.
8.2.5. White’s Information Matrix Test
For ML estimation the information matrix equality implies moment restrictions that
may be used in an m-test, as they are usually not imposed in obtaining the MLE.
Specifically, from Section 5.6.3 the IM equality implies
E[Vech [Di (yi , xi , θ0)]] = 0, (8.19)
where the q × q matrix Di is given by
Di (yi , xi , θ0) =
∂2
ln fi
∂θ∂θ +
∂ ln fi
∂θ
∂ ln fi
∂θ , (8.20)
and the expectation is taken with respect to the assumed conditional density fi =
f (yi |xi , θ). Here Vech is the vector-half operator that stacks the columns of the ma-
trix Di in the same way as the Vec operator, except that only the q(q + 1)/2 unique
elements of the symmetric matrix Di are stacked.
White (1982) proposed the information matrix test of whether the corresponding
sample moment

dN (
θ) = N−1
N

i=1
Vech[Di (yi , xi ,
θML)] (8.21)
is close to zero. Using (8.4) the IM test statistic is
IM = N
dN (
θ)
V−1
dN (
θ), (8.22)
where the expression for
V given in White (1982) is quite complicated. A much easier
way to implement the test, due to Lancaster (1984) and Chesher (1984), is to use the
auxiliary regression (8.5), which is applicable since the MLE is used in (8.21).
The IM test can also be applied to a subset of the restrictions in (8.19). This should
be done if q is large as then the number of restrictions q(q + 1)/2 being tested is very
large.
Large values of the IM test statistic lead to rejection of the restrictions of the
IM equality and the conclusion that the density is incorrectly specified. In general
265

this means that the ML estimator is inconsistent. In some special cases, detailed in
Section 5.7, the MLE may still be consistent though standard errors need then to be
based on the sandwich form of the variance matrix.
8.2.6. Chi-Square Goodness-of-Fit Test
A useful specification test for fully parametric models is to compare predicted prob-
abilities with sample relative frequencies. The model is a poor one if these differ
considerably.
Begin with discrete iid random variable y that can take one of J possible values
with probabilities p1, p2, . . . , pJ ,
J
j=1 pj = 1. The correct specification of the prob-
abilities can be tested by testing the equality of theoretical frequencies Npj to the
observed frequencies N p̄j , where p̄j is the fraction of the sample that takes the jth
possible value. The Pearson chi-square goodness-of-fit test (PCGF) statistic is
PCGF =
J

j=1
(N p̄j − Npj )2
Npj
. (8.23)
This statistic is asymptotically χ2
(J − 1) distributed under the null hypothesis that the
probabilities p1, p2, . . . , pJ are correct. The test can be extended to permit the prob-
abilities to be predicted from regression (see Exercise 8.2). Consider a multinomial
model for discrete y with probabilities pi j = pi j (xi , θ). Then pj in (8.23) is replaced
by
pj = N−1

i Fj (xi ,
θ) and if
θ is the multinomial MLE we again get a chi-square
distribution, but with reduced number of degrees of freedom (J − dim(θ) − 1) result-
ing from the estimation of θ (see Andrews, 1988a).
For regression models other than multinomial models, the statistic PCGF in (8.23)
can be computed by grouping y into cells, but the statistic PCGF is then no longer
chi-square distributed. Instead, a closely related m-test statistic is used. To derive this
statistic, break the range of y into J mutually exclusive cells, where the J cells span
all possible values of y. Let di j (yi ) be an indicator variable equal to one if yi ∈ cell
j and equal to zero otherwise. Let pi j (xi , θ) =
)
yi ∈cell j f (yi |xi , θ)dyi be the predicted
probability that observation i falls in cell j, where f (y|x, θ) is the conditional density
of y and to begin with we assume the parameter vector θ is known. If the conditional
density is correctly specified, then
E[di j (yi ) − pi j (xi , θ)] = 0, j = 1, . . . , J. (8.24)
Stacking all J moments in obvious vector notation, we have
E[di (yi ) − pi (xi , θ)] = 0, (8.25)
where di and pi are J × 1 vectors with jth entries di j and pi j . This suggests an m-test
of the closeness to zero of the corresponding sample moment
1
dpN (
θ) = N−1
N

i=1
(di (yi ) − pi (xi ,
θ)), (8.26)
which is the difference between the vector of sample relative frequencies N−1

i di
and the vector of predicted frequencies N−1

i
pi . Using (8.5) we obtain the
266

8.2. M-TESTS
chi-square goodness-of-fit (CGF) test statistic of Andrews (1988a, 1988b):
CGF = N1
dpN (
θ)
V−11
dpN (
θ), (8.27)
where the expression for
V is quite complicated. The CGF test statistic is easily com-
puted using the auxiliary regression (8.5), with
mi = di −
pi . This auxiliary regression
is appropriate here because a fully parametric model is being tested and so
θ will be
the MLE.
One of the categories needs to be dropped because of the restriction that probabil-
ities sum to one, yielding a test statistic that is asymptotically χ2
(J − 1) under the
null hypothesis that f (y|x, θ) is correctly specified. Further categories may need to
be dropped in some special cases, such as the multinomial example already discussed
after (8.23). In addition to reporting the calculated test statistic it can be informative to
report the components of N−1

i di and N−1

i
pi .
The relevant asymptotic theory is provided by Andrews (1988a), with a simpler
presentation and several applications given in Andrews (1988b). For simplicity we
presented cells determined by the range of y, but the partitioning can be on both y
and x. Cells should be chosen so that no cell has only a few observations. For further
details and a history of this test see these articles.
For continuous random variable y in the iid case a more general test than the SCGF
test is the Kolmogorov test; this uses the entire distribution of y, not just cells formed
from y. Andrews (1997) presents a regression version of the Kolmogorov test, but it is
much more difficult to implement than the CGF test.
8.2.7. Test of Overidentifying Restrictions
Tests of overidentifying assumptions (see Section 6.3.8) are examples of m-tests.
In the notation of Chapter 6, the GMM estimator is based on the assumption that
E[h(wi , θ0)] = 0. If the model is overidentified, then only q of these moment re-
strictions are used in estimation, leading to (r − q) linearly dependent orthogonal-
ity conditions, where r = dim[h(·)], that can be used to form an m-test. Then we
use M in (8.4), where
mN = N−1

i h(wi ,
θ). As shown in Section 6.3.9, if
θ is
the optimal GMM estimator then
mN (
θ)
S−1
N
mN (
θ), where
SN = N−1
N
i=1

hi

h
i , is
asymptotically χ2
(r − q) distributed. A more intuitive linear IV example is given in
Section 8.4.4.
8.2.8. Power and Consistency of Conditional Moment Tests
Because there is no explicit alternative hypothesis, m-tests differ from the tests of
Chapter 7.
Several authors have given examples where the IM test can be shown to be equiv-
alent to a conventional LM test of null against alternative hypotheses. Chesher (1984)
interpreted the IM test as a test for random parameter heterogeneity. For the linear
model under normality, A. Hall (1987) showed that subcomponents of the IM test
correspond to LM tests of heteroskedasticity, symmetry, and kurtosis. Cameron and
267

Trivedi (1998) give some additional examples and reference to results for the linear
exponential family.
More generally, m-tests can be interpreted in a conditional moment framework
as follows. Begin with an added variable test in a linear regression model. Suppose
we want to test whether β2 = 0 in the model y = x
1β1 + x
2β2 + u. This is a test of
H0 : E[y − x
1β1|x] = 0 against Ha : E[y − x
1β1|x] = x
2β2. The most powerful test
of H0 : β2 = 0 in regression of y − x
1β1 on x2 is based on the efficient GLS estimator

β2 =

N

i=1
x2i x
2i
σ2
i
'−1 N

i=1
x2i (yi − x
1i β1)
σ2
i
,
where σ2
i = V[yi |xi ] under H0 and independence over i is assumed. This test is equiv-
alent to a test based on the second sum alone, which is an m-test of
E

x2i (yi − x
1i β1)
σ2
i

= 0. (8.28)
Reversing the process, we can interpret an m-test based on (8.28) as a CM test of
H0 : E[y − x
1β1|x] = 0 against Ha : E[y − x
1β1|x] = x
2β2. Also, an m-test based
on E

x2

y − x
1β1

= 0 can be interpreted as a CM test of H0 : E[y − x
1β1|x] = 0
against Ha : E[y − x
1β1|x] = σ2
y|xx
2β2, where σ2
y|x = V[y|x] under H0.
More generally, suppose we start with the conditional moment restriction
E[r(yi , xi , θ)|xi ] = 0, (8.29)
for some scalar function r(·). Then an m-test based on the unconditional moment
restriction
E[g(xi )r(yi , xi , θ)] = 0 (8.30)
can be interpreted as a CM test with null and alternative hypotheses
H0 : E[r(yi , xi , θ)|xi ] = 0, (8.31)
Ha : E[r(yi , xi , θ)|xi ] = σ2
i g(xi )
γ,
where σ2
i = V[r(yi , xi , θ)|xi ] under H0.
This approach gives a guide to the directions in which a CM test has power. Al-
though (8.30) suggests power is in the general direction of g(x), from (8.31) a more
precise statement is that it is instead the direction of g(x) multiplied by the variance
of r(y, x, θ). The distinction is important because many cross-section applications this
variance is not constant across observations. For further details and references see
Cameron and Trivedi (1998), who call this a regression-based CM test. The approach
generalizes to vector r(·), though with more cumbersome algebra.
An m-test is a test of a finite number of moment conditions. It is therefore possible to
construct a dgp for which the underlying conditional moment condition, such as that in
(8.29), is false yet the moment conditions are satisfied. Then the CM test is inconsistent
as it fails to reject with probability one as N → ∞. Bierens (1990) proposed a way
to specify g(x) in (8.30) that ensures a consistent conditional moment test, for tests
of functional form in the nonlinear regression model where r(y, x, θ) = y − f (x, θ).
268

8.2. M-TESTS
Ensuring the consistency of the test does not, however, ensure that it will have high
power against particular alternatives.
8.2.9. m-Tests Example
To illustrate various m-tests we consider the Poisson regression model introduced in
Section 5.2, with Poisson density f (y) = e−µ
µy
/y! and µ = exp(x
β).
We wish to test
H0 : E[m(y, x, β)] = 0,
for various choices of m(·). This test will be conducted under the assumption that the
dgp is indeed the specified Poisson density.
Auxiliary Regressions
Since estimation is by ML we can use the m-test statistic M∗
computed as N times the
uncentered R2
from auxiliary regression (8.5), where
1 =
m(yi , xi ,
β)
δ + (yi − exp(x
i

β))x
i γ+ui , (8.32)
since
s = |∂ ln f (y)/∂β|
β = (y − exp(x
β))x and
β is the MLE. Under H0 the test is
χ2
(dim(m)) distributed.
An alternative is the M∗∗
statistic from auxiliary regression
1 =
m(y, x, z,
β)
δ+u. (8.33)
This test is asymptotically equivalent to LM∗
if m(·) is such that E[∂m/∂β] = 0, but
otherwise it is not chi-squared distributed.
Moments Tested
Correct specification of the conditional mean function, that is, E[y − exp(x
β)|x] = 0,
can be tested by an m-test of
E[(y − exp(x
β))z] = 0,
where z may be a function of x. For the Poisson and other LEF models, z cannot
equal x because the first-order conditions for
βML impose the restriction that

i (yi −
exp(x
i

β))xi = 0, leading to M = 0 if z = x. Instead, z could include squares and cross-
products of the regressors.
Correct specification of the variance may also be tested, as the Poisson distribution
implies conditional mean–variance equality. Since V[y|x]−E[y|x] = 0, with E[y|x] =
exp(x
β), this suggests an m-test of
E[{(y − exp(x
β))2
− exp(x
β)}x] = 0.
A variation instead tests
E[{(y − exp(x
β))2
− y}x] = 0,
269

as E[y|x] = exp(x
β). Then m(β) = {(y − exp(xβ))2
− y}x has the property that
E[∂m/∂β] = 0, so (8.7) holds and the alternative regression (8.33) yields an asymp-
totically equivalent test to the regression (8.32).
A standard specification test for parametric models is the IM test. For the Poisson
density, D defined in (8.19) becomes D(y, x, β) = {(y − exp(x
β))2
− y}xx
, and we
test
E[{(y − exp(x
β))2
− y}Vech[xx
]] = 0.
Clearly for the Poisson example the IM test is a test of the first and second moment con-
ditions implied by the Poisson model, a result that holds more generally for LEF mod-
els. The test statistic M∗∗
is asymptotically equivalent to M∗
since here E[∂m/∂β] = 0.
The Poisson assumption can also be tested using a chi-square goodness-of-fit test.
For example, since few counts exceed three in the subsequent simulation example,
form four cells corresponding to y = 0, 1, 2, and 3 or more, where in implementing
the test the cell with y = 3 or more are dropped because probabilities sum to one.
So for j = 0, . . . , 2 compute indicator di j = 1 if yi = j and di j = 0 otherwise and
compute predicted probability
pi j = e−
µi
µ
j
i /j!, where
µi = exp(x
i

β). Then test
E[(d − p)] = 0,
where di = [di0, di1, di2] and pi = [pi0, pi1, pi2] by the auxiliary regression (8.33)
where
mi = di −
pi .
Simulation Results
Data were generated from a Poisson model with mean E[y|x] = exp(β1 + β2x2),
where x2 ∼ N[0, 1] and (β1, β2) = (0, 1). Poisson ML regression of y on x for a sam-
ple of size 200 yielded

E[y|x] = exp(−0.165
(0.089)
+ 1.124
(0.069)
x2),
where associated standard errors are in parentheses.
The results of the various M-tests are given in Table 8.1.
Table 8.1. Specification m-Tests for Poisson Regression Examplea
Test Type H0 where µ = exp(x
β) M∗
dof p-value M∗∗
1. Correct mean E[(y − µ)x2
2 ] = 0 3.27 1 0.07 0.44
2. Variance = mean E[{(y − µ)2
− µ}x] = 0 2.43 2 0.30 1.89
3. Variance = mean E[{(y − µ)2
− y}x] = 0 2.43 2 0.30 2.41
4. Information Matrix E[{(y − µ)2
− y}Vech[xx
]] = 0 2.95 3 0.40 2.73
5. Chi-square GOF E[d − p] = 0 2.50 3 0.48 0.75
a The dgp for y is the Poisson distribution with mean parameter exp(0 + x2) and sample size N = 200. The
m-test statistic M∗ is chi-squared with degrees of freedom given in the dof column and p-value given in the
p-value column. The alternative test statistic M∗∗ is valid for tests 3 and 4 only.
270

8.3. HAUSMAN TEST
As an example of computation of M∗
using (8.32) consider the IM test. Since x =
[1, x2]
and Vech[xx
] = [1, x2, x2
2 ]
, the auxiliary regression is of 1 on {(y −
µ)2
− y},
{(y −
µ)2
− y}x2, {(y −
µ)2
− y}x2
2 , (y −
µ), and (y −
µ)x2 and yields uncentered
R2
= 0.01473 and N = 200, leading to M∗
= 2.95. The same value of M∗
is obtained
directly from the uncentered explained sum of squares of 2.95, and indirectly as N
minus 197.05, the residual sum of squares from this regression. The test statistic is
χ2
(3) distributed with p = 0.40, so the null hypothesis is not rejected at significance
level 0.05.
For the chi-square goodness-of-fit test the actual frequencies are, respectively,
0.435, 0.255, and 0.110; and the corresponding predicted frequencies are 0.429, 0.241,
and 0.124. This yields PCGF = 0.47 using (8.23), but this statistic is not chi-squared
as it does not control for error in estimating
β. The auxiliary regression for the correct
statistic CGF in (8.27) leads to M∗
= 2.50, which is chi-square distributed.
In this simulation all five moment conditions are not rejected at level 0.05 since
the p-value for M∗
exceeds 0.05. This is as expected, as the data in this simulation
example are generated from the specified density so that tests at level 0.05 should re-
ject only 5% of the time. The alternative statistic M∗∗
is valid only for tests 3 and
4 since only then does E[∂m/∂β] = 0; otherwise, it only provides a lower bound
for M.
8.3. Hausman Test
Tests based on comparisons between two different estimators are called Hausman tests,
after Hausman (1978), or Wu–Hausman tests or even Durbin–Wu–Hausman tests after
Wu (1973) and Durbin (1954) who proposed similar tests.
8.3.1. Hausman Test
Consider a test for endogeneity of a regressor in a single equation. Two alternative es-
timators are the OLS and 2SLS estimators, where the 2SLS estimator uses instruments
to control for possible endogeneity of the regressor. If there is endogeneity then OLS
is inconsistent, so the two estimators will have different probability limit. If there is no
endogeneity both estimators are consistent, so the two estimators have the same prob-
ability limit. This suggests testing for endogeneity by testing for difference between
the OLS and 2SLS estimators, see Section 8.4.3 for further discussion.
More generally, consider two estimators
θ and
θ. We consider the testing situation
where
H0 : plim(
θ −
θ) = 0,
Ha : plim(
θ −
θ) = 0.
(8.34)
Assume the difference between the two root-N consistent estimators is also root-N
consistent under H0 with mean 0 and a limit normal distribution, so that
√
N(
θ −
θ)
d
→ N [0, VH] ,
271

where VH denotes the variance matrix in the limiting distribution. Then the Hausman
test statistic
H = (
θ −
θ)
(N−1
VH)−1
(
θ −
θ) (8.35)
(q) distributed under H0. We reject H0 at level α if H χ2
α(q).
In some applications, such as tests of endogeneity, V[
θ −
θ] is of less than full rank.
Then the generalized inverse is used in (8.35) and the chi-square test has degrees of
freedom equal to the rank of V[
θ −
θ].
The Hausman test can be applied to just a subset of the parameters. For example,
interest may lie solely in the coefficient of the possibly endogenous regressor and
whether it changes in moving from OLS to 2SLS. Then just one component of θ is
used and the test statistic is χ2
(1) distributed. As in other settings, this test on a subset
of parameters can lead to a conclusion different from that of a test on all parameters.
8.3.2. Computation of the Hausman Test
Computing the Hausman test is easy in principle but difficult in practice owing to the
need to obtain a consistent estimate of VH, the limit variance matrix of
√
N(
θ −
θ). In
general
N−1
VH = V[
θ −
θ] = V[
θ] + V[
θ] − 2Cov[
θ,
θ]. (8.36)
The first two quantities are readily computed from the usual output, but the third is
not.
Computation for Fully Efficient Estimator under the Null Hypothesis
Although the essential null and alternative hypotheses of the Hausman test are as in
(8.34), in applications there is usually a specific null hypothesis model and alternative
hypothesis in mind. For example, in comparing OLS and 2SLS estimators the null hy-
pothesis model has all regressors exogenous whereas the alternative hypothesis model
permits some regressors to be endogenous.
If
θ is the efficient estimator in the null hypothesis model, then Cov[
θ,
θ] = V[
θ].
For proof see Exercise 8.3. This implies V[
θ −
θ] = V[
θ]−V[
θ], so
H = (
θ −
θ)

V[
θ] −
V[
θ]
−1
(
θ −
θ). (8.37)
This statistic has the considerable advantage of requiring only the estimated asymptotic
variance matrices of the parameter estimates
θ and
θ. It is helpful to use a program
that permits saving parameter and variance matrix estimates and computation using
matrix commands.
For example, this simplification can be applied to endogeneity tests in a linear re-
gression model if the errors are assumed to be homoskedastic. Then
θ is the OLS
estimator that is fully efficient under the null hypothesis of no endogeneity, and
θ is
the 2SLS estimator. Care is needed, however, to ensure the consistent estimates of the
variance matrices are such that
V[
θ] −
V[
θ] is positive definite (see Ruud, 1984). In
272

8.3. HAUSMAN TEST
the OLS–2SLS comparison the variance matrix estimators
V[
θ] and
V[
θ] should use
the same estimate of the error variance σ2
.
Version (8.37) of the Hausman test is especially easy to calculate by hand if θ is a
scalar, or if only one component of the parameter vector is tested. Then
H = (
θ −
θ)2
/(
s2
−
s2
)
is χ2
(1) distributed, where
s and
s are the reported standard errors of
θ and
θ.
Auxiliary Regressions
In some leading cases the Hausman test can be more simply computed as a standard
test for the significance of a subset of regressors in an augmented OLS regression,
derived under the assumption that
θ is fully efficient. Examples are given in Section
8.4.3 and in Section 21.4.3.
Robust Hausman Tests
The simpler version (8.37) of the Hausman test, and standard auxiliary regressions,
requires the strong distributional assumption that
θ is fully efficient. This is counter
to the approach of performing robust inference under relatively weak distributional
assumptions.
Direct estimation of Cov[
θ,
θ] and hence VH is in principle possible. Suppose
θ and

θ are m-estimators that solve

i h1i (
θ) = 0 and

i h2i (
θ) = 0. Define
δ

= [
θ,
θ].
Then V[
δ] = G−1
0 S0(G−1
0 )
, where G0 and S0 are defined in Section 6.6, with the sim-
plification that here G12 = 0. The desired V[
θ −
θ] = RV[
δ]R
, where R = [Iq, −Iq].
Implementation can require additional coding that may be application specific.
A simpler approach is to bootstrap (see Section 11.6.3), though care is needed in
some applications to ensure use of the correct degrees of freedom in the chi-square
test.
Another possible approach for less than fully efficient
θ is to use an auxiliary re-
gression that is appropriate in the efficient case but to perform the subsets of regres-
sors test using robust standard errors. This robust test is simple to implement and will
have power in testing the misspecification of interest, though it may not necessarily be
equivalent to the Hausman test that uses the more general form of H given in (8.35).
An example is given in Section 21.4.3.
Finally, bounds can be calculated that do not require computation of Cov[
θ,
θ]. For
scalar random variables, Cov[x, y] ≤ sx sy. For the scalar case this suggests an upper
bound for H of N(
θ −
θ)2
/(
s2
+
s2
− 2
s
s), where
s2
=
V[
θ] and
s2
=
V[
θ]. A lower
bound for H is N(
θ −
θ)2
/(
s2
+
s2
), under the assumption that
θ and
θ are positively
correlated. In practice, however, these bounds are quite wide.
8.3.3. Power of the Hausman Test
The Hausman test is a quite general procedure that does not explicitly state an alterna-
tive hypothesis and therefore need not have high power against particular alternatives.
273

For example, consider tests of exclusion restrictions in fully parametric models. De-
note the null hypothesis H0 : θ2 = 0, where θ is partitioned as (θ
1, θ
2)
. An obvious
specification test is a Hausman test of the difference
θ1 −
θ1, where (
θ1,
θ2) is the un-
restricted MLE and (
θ1, 0) is the restricted MLE of θ. Holly (1982) showed that this
Hausman test coincides with a classical test (Wald, LR, or LM) of H0 : I−1
11 I12θ2 = 0,
where Ii j = E

∂2
L(θ1, θ2)/∂θi ∂θj

, rather than of H0 : θ2 = 0. The two tests co-
incide if I12 is of full column rank and dim(θ1) ≥dim(θ2), as then I−1
11 I12θ2 = 0
iff θ2 = 0. Otherwise, they can differ. Clearly, the Hausman test will have no power
against H0 if the information matrix is block diagonal as then I12 = 0. Holly (1987)
extended analysis to nonlinear hypotheses.
8.4. Tests for Some Common Misspecifications
In this section we present tests for some common model misspecifications. Attention
is focused on test statistics that can be computed using auxiliary regressions, using
minimal assumptions to permit inference robust to heteroskedastic errors.
8.4.1. Tests for Omitted Variables
Omitted variables usually lead to inconsistent parameter estimates, except for special
cases such as an omitted regressor in the linear model that is uncorrelated with the
other regressors. It is therefore important to test for potential omitted variables.
The Wald test is most often used as it is usually no more difficult to estimate the
model with omitted variables included than to estimate the restricted model with omit-
ted variables excluded. Furthermore, this test can use robust sandwich standard errors,
though this really only makes sense if the estimator retains consistency in situations
where robust sandwich errors are necessary.
If attention is restricted to ML estimation an alternative is to estimate models with
and without the potentially irrelevant regressors and perform an LR test.
Robust forms of the LM test can be easily computed in some settings. For example,
consider a test of H0 : β2 = 0 in the Poisson model with mean exp(x
1β1 + x
2β2). The
LM test statistic is based on the score statistic

i xi
ui , where
ui = yi − exp (x
1i

β1)
(see Section 7.3.2). Now a heteroskedastic robust estimate for the variance of
N−1/2

i xi ui , where ui = yi − E[yi |xi ], is N−1

i u2
i xi x
i , and it can be shown that
LM+
=

n

i=1
xi
ui
'
n

i=1

u2
i xi x
i
'−1
n

i=1
xi
ui
'
is a robust LM test statistic that does not require the Poisson restriction that V[ui |xi ] =
exp (x
1i β1) under H0. This can be computed as N times the uncentered R2
from re-
gression of 1 on x1i
ui and x2i
ui . Such robust LM tests are possible more generally for
assumed models in the linear exponential family, as the score statistic in such models is
again a weighted average of a residual
ui (see Wooldridge, 1991). This class includes
OLS, and adaptations are also possible when estimation is by 2SLS or by NLS; see
Wooldridge (2002).
274

8.4. TESTS FOR SOME COMMON MISSPECIFICATIONS
8.4.2. Tests for Heteroskedasticity
Parameter estimates in linear or nonlinear regression models of the conditional mean
estimated by LS or IV methods retain their consistency in the presence of het-
eroskedasticity. The only correction needed is to the standard errors of these estimates.
This does not require modeling heteroskedasticity, as heteroskedastic-robust standard
errors can be computed under minimal distributional assumptions using the result of
White (1980). So there is little need to test for heteroskedasticity, unless estimator
efficiency is of great concern. Nonetheless, we summarize some results on tests for
heteroskedasticity.
We begin with LS estimation of the linear regression model y = x
β + u. Suppose
heteroskedasticity is modeled by V[u|x] = g(α1 + z
α2), where z is usually a sub-
set of x and g(·) is often the exponential function. The literature focuses on tests of
H0 : α2 = 0 using the LM approach because, unlike Wald and LR tests, these require
only OLS estimation of β. The standard LM test of Breusch and Pagan (1979) depends
heavily on the assumption of normally distributed errors, as it uses the restriction that
E[u4
|x4
] = 3σ4
under H0. Koenker (1981) proposed a more robust version of the LM
test, N R2
from regression of
u2
i on 1 and zi , where
ui is the OLS residual. This test re-
quires the weaker assumption that E[u4
|x] is constant. Like the Breusch–Pagan test it
is invariant to choice of the function g(·). The White (1980a) test for heteroskedasticity
is equivalent to this LM test, with z = Vech[xx
]. The test can be further generalized
to let E[u4
|x] vary with x, though constancy may be a reasonable assumption for the
test since H0 already specifies that E[u2
|x] is constant.
Qualitatively similar results carry over to nonlinear models of the conditional mean
that assume a particular form of heteroskedasticity that may be tested for misspec-
ification. For example, the Poisson regression model sets V[y|x] = exp (x
β). More
generally, for models in the linear exponential family, the quasi-MLE is consistent
despite misspecified heteroskedasticity and qualitatively similar results to those here
apply. Then valid inference is possible even if the model for heteroskedasticity is mis-
specified, provided the robust standard errors presented in Section 5.7.4 are used. If
one still wishes to test for correct specification of heteroskedasticity then robust LM
tests are possible (see Wooldridge, 1991).
Heteroskedasticity can lead to the more serious consequence of inconsistency of pa-
rameter estimates in some nonlinear models. A leading example is the Tobit model (see
Chapter 16), a linear regression model with normal homoskedastic errors that becomes
nonlinear as the result of censoring or truncation. Then testing for heteroskedasticity
becomes more important. A model for V[u|x] can be specified and Wald, LR, or LM
tests can be performed or m-tests for heteroskedasticity can be used (see Pagan and
Vella, 1989).
8.4.3. Hausman Tests for Endogeneity
Instrumental variables estimators should only be used where there is a need for them,
since LS estimators are more efficient if all regressors are exogenous and from Sec-
tion 4.9 this loss of efficiency can be substantial. It can therefore be useful to test
275

whether IV methods are needed. A test for endogeneity of regressors compares IV
estimates with LS estimates. If regressors are endogenous then in the limit these esti-
mates will differ, whereas if regressors are exogenous the two estimators will not differ.
Thus large differences between LS and IV estimates can be interpreted as evidence of
endogeneity.
This example provides the original motivation for the Hausman test. Consider the
linear regression model
y = x
1β1 + x
2β2 + u, (8.38)
where x1 is potentially endogenous and x2 is exogenous. Let
β be the OLS estimator
and
β be the 2SLS estimator in (8.38). Assuming homoskedastic errors so that OLS is
efficient under the null hypothesis of no endogeneity, a Hausman test of endogeneity of
x1 can be calculated using the test statistic H defined in (8.37). Because V[
β] − V[
β]
can be shown to be not of full rank, however, a generalized inverse is needed and the
degrees of freedom are dim(β1) rather than dim(β).
Hausman (1978) showed that the test can more simply be implemented by test of
γ = 0 in the augmented OLS regression
y = x
1β1 + x
2β2 +
x
1γ + u,
where
x1 is the predicted value of the endogenous regressors x1 from reduced form
multivariate regression of x1 on the instruments z. Equivalently, we can test γ = 0 in
the augmented OLS regression
y = x
1β1 + x
2β2 +
v
1γ+u,
where
v1 is the residual from the reduced form multivariate regression of x1 on the
instruments z. Intuition for these tests is that if u in (8.38) is uncorrelated with x1
and x2, then γ = 0. If instead u is correlated with x1, then this will be picked up by
significance of additional transformations of x1 such as
x1 and
v1.
For cross-section data it is customary to presume heteroskedastic errors. Then the
OLS estimator
β is inefficient in (8.38) and the simpler version (8.37) of the Haus-
man test cannot be used. However, the preceding augmented OLS regressions can
still be used, provided γ = 0 is tested using the heteroskedastic-consistent estimate of
the variance matrix. This should actually be equivalent to the Hausman test, as from
Davidson and MacKinnon (1993, p. 239)
γOLS in these augmented regressions equals
AN (
β −
β), where AN is a full-rank matrix with finite probability limit.
Additional Hausman tests for endogeneity are possible. Suppose y = x
1β1 +
x
2β2 + x
3β3 + u, where x1 is potentially endogenous x2 is assumed to be endoge-
nous, and x3 is assumed to be exogenous. Then endogeneity of x1 can be tested
by comparing the 2SLS estimator with just x2 instrumented to the 2SLS estima-
tor with both x1 and x2 instrumented. The Hausman test can also be generalized
to nonlinear regression models, with OLS replaced by NLS and 2SLS replaced
by NL2SLS. Davidson and MacKinnon (1993) present augmented regressions that
can be used to compute the relevant Hausman test, assuming homoskedastic errors.
Mroz (1987) provides a good application of endogeneity tests including examples of
computation of V[
θ −
θ] when
θ is not efficient.
276

8.4. TESTS FOR SOME COMMON MISSPECIFICATIONS
8.4.4. OIR Tests for Exogeneity
If an IV estimator is used then the instruments must be exogenous for the IV estimator
to be consistent. For just-identified models it is not possible to test for instrument
exogeneity. Instead, a priori arguments need to be used to justify instrument validity.
Some examples are given in Section 4.8.2. For overidentified models, however, a test
for exogeneity of instruments is possible.
We begin with linear regression. Then y = x
β + u and instruments z are valid
if E[u|z] = 0 or if E[zu] = 0. An obvious test of H0 : E[zu] = 0 is based on depar-
tures of N−1

i zi
ui from zero. In the just-identified case the IV estimator solves
N−1

i zi
ui = 0 so this test is not useful. In the overidentified case the overidentify-
ing restrictions test presented in Section 6.3.8 is
OIR =
u
Z
S−1
Z

u, (8.39)
where
u = y − X
β,
β is the optimal GMM estimator that minimizes u
Z
S−1
Z
u, and

S is consistent for plim N−1

i u2
i zi z
i . The OIR test of Hansen (1982) is an extension
of a test proposed by Sargan (1958) for linear IV, and the test statistic (8.39) is often
called a Sargan test. If OIR is large then the moment conditions are rejected and the
IV estimator is inconsistent. Rejection of H0 is usually interpreted as evidence that the
instruments z are endogenous, but it could also be evidence of model misspecifica-
tion so that in fact y = x
β + u. In either case rejection indicates problems for the IV
estimator.
As formally derived in Section 6.3.9, OIR is distributed as χ2
(r − K) under H0,
where (r − K) is the number of overidentifying restrictions. To gain some intuition for
this result it is useful to specialize to homoskedastic errors. Then
S =
σ2
Z
Z, where

σ2
=
u
u/(N − K), so
OIR =

u
PZ
u

u
u/(N − K)
,
where PZ = Z(Z
Z)−1
Z
. Thus OIR is a ratio of quadratic forms in
u. Under H0 the
numerator has probability limit σ2
(r − K) and the denominator has plim
σ2
= σ2
, so
the ratio is centered on r − K, but this is the mean of a χ2
(r − K) random variable.
The test statistic in (8.39) extends immediately to nonlinear regression, by simply
defining ui = y − g(x, β) or u = r(y, x, β) as in Section 6.5, and to linear systems
and panel estimators by appropriate definition of u (see Sections 6.9 and 6.10).
For linear IV with homoskedastic errors alternative OIR tests to (8.39) have been
proposed. Magdalinos (1988) contrasts a number of these tests. One can also use in-
cremental OIR tests of a subset of overidentifying restrictions.
8.4.5. RESET Test
A common functional form misspecification may involve neglected nonlinearity in
some of the regressors. Consider the regression y = x
β + u, where we assume that the
regressors enter linearly and are asymptotically uncorrelated with the error u. To test
for nonlinearity one straightforward approach is to enter power functions of exogenous
277

variables, most commonly squares, as additional independent regressors and test the
statistical significance of these additional variables using a Wald test or an F-test.
This requires the investigator to have specific reasons for considering nonlinearity, and
clearly the technique will not work for categorical x variables.
Ramsey (1969) suggested a test of omitted variables from the regression that can
be formulated as a test of functional form. The proposal is to fit the initial regres-
sion and generate new regressors that are functions of fitted values
y = x
β, such
as w = [(x
β)2
, (x
β)3
, . . . , (x
β)p
]. Then estimate the model y = x
β + w
γ + u,
and the test of nonlinearity is the Wald test of p restrictions, H0 : γ = 0 against
Ha : γ = 0. Typically a low value of p such as 2 or 3 is used. This test can be made
robust to heteroskedasticity.
8.5. Discriminating between Nonnested Models
Two models are nested if one is a special case of the other; they are nonnested if
neither can be represented as a special case of the other. Discriminating between nested
models is possible using a standard hypothesis test of the parametric restrictions that
reduce one model to the other. In the nonnested case, however, alternative methods
need to be developed.
The presentation focuses on nonnested model discrimination within the likelihood
framework, where results are well developed. A brief discussion of the nonlikelihood
case is given in Section 8.5.4. Bayesian methods for model discrimination are pre-
sented in Section 13.8.
8.5.1. Information Criteria
Information criteria are log-likelihood criteria with degrees of freedom adjustment.
The model with the smallest information criterion is preferred.
The essential intuition is that there exists a tension between model fit, as measured
by the maximized log-likelihood value, and the principle of parsimony that favors a
simple model. The fit of the model can be improved by increasing model complexity.
However, parameters are only added if the resulting improvement in fit sufficiently
compensates for loss of parsimony. Note that in this viewpoint it is not necessary
that the set of models under consideration should include the “true dgp.” Different
information criteria vary in how steeply they penalize model complexity.
Akaike (1973) originally proposed the Akaike information criterion
AIC = −2 ln L + 2q, (8.40)
where q is the number of parameters, with the model with lowest AIC preferred. The
term information criterion is used because the underlying theory, presented more sim-
ply in Amemiya (1980), discriminates among models using the Kullback–Liebler in-
formation criterion (KLIC).
A considerable number of modifications to AIC have been proposed, all of the form
−2 lnL+g(q, N) for specified penalty function g(·) that exceeds 2q. The most popular
278

8.5. DISCRIMINATING BETWEEN NONNESTED MODELS
variation is the Bayesian information criterion
BIC = −2 ln L + (ln N)q, (8.41)
proposed by Schwarz (1978). Schwarz assumed y has density in the exponential family
with parameter θ, the jth model has parameter θj with dim[θj ] = qj dim[θ], and
the prior across models is a weighted sum of the prior for each θj . He showed that un-
der these assumptions maximizing the posterior probability (see Chapter 13) is asymp-
totically equivalent to choosing the model for which ln L − (ln N)qj /2 is largest. Since
this is equivalent to minimizing (8.41), the procedure of Schwarz has been labeled the
Bayesian information criterion. A refinement of AIC based on minimization of KLIC
that is similar to BIC is the consistent AIC, CAIC= −2 ln L + (1 + ln N) q. Some
authors define criteria such as AIC and BIC by additionally dividing by N in the right-
hand sides of (8.40) and (8.41).
If model parsimony is important, then BIC is more widely used as the model-size
penalty for AIC is relatively low. Consider two nested models with q1 and q2 parame-
ters, respectively, where q2 = q1 + h. An LR test is then possible and favors the larger
model at significance level 5% if 2 ln L increases by χ2
.05(h). AIC favors the larger
model if 2 ln L increases by more than 2h, a lesser penalty for model size than the LR
test if h 7. In particular for h = 1, that is, one restriction, the LR test uses a 5%
critical value of 3.84 whereas AIC uses a much lower value of 2. The BIC favors the
larger model if 2 ln L increases by h ln N, a much larger penalty than either AIC or an
LR test of size 0.05 (unless N is exceptionally small).
The Bayesian information criterion increases the penalty as sample size increases,
whereas traditional hypothesis tests at a significance level such as 5% do not. For
nested models with q2 = q1 + 1 choosing the larger model on the basis of lower BIC
is equivalent to using a two-sided t-test critical value of
√
ln N, which equals 2.15,
3.03, and 3.72, respectively, for N = 102
, 104
, and 106
. By comparison traditional hy-
pothesis tests with size 0.05 use an unchanging critical value of 1.96. More generally,
for a χ2
(h) distributed test statistic the BIC suggests using a critical value of h ln N
rather than the customary χ2
.05(h).
Given their simplicity, penalized likelihood criteria are often used for selecting “the
best model.” However, there is no clear answer as to which criterion, if any, should
be preferred. Considerable approximation is involved in deriving the formulas for AIC
and related measures, and loss functions other than minimization of KLIC, or max-
imization of the posterior probability in the case of BIC, might be much more ap-
propriate. From a decision-theoretic viewpoint, the choice of the model from a set of
models should depend on the intended use of that model. For example, the purpose of
the model may be to summarize the main features of a complex reality, or to predict
some outcome, or to test some important hypothesis. In applied work it is quite rare to
see an explicit statement of the intended use of an econometric model.
8.5.2. Cox Likelihood Ratio Test of Nonnested Models
Consider choosing between two parametric models. Let model Fθ have density
f (y|x, θ) and model Gγ have density g(y|x, γ).
279

A likelihood ratio test of the model Fθ against Gγ is based on
LR(
θ,
γ) ≡ L f (
θ) − Lg(
γ) =
N

i=1
ln
f (yi |xi ,
θ)
g(yi |xi ,
γ)
. (8.42)
If Gγ is nested in Fθ then, from Section 7.3.1, 2LR(
θ,
γ) is chi-square distributed
under the null hypothesis that Fθ = Gγ . However, this result no longer holds if the
models are nonnested.
Cox (1961, 1962b) proposed solving this problem in the special case that Fθ is the
true model but the models are not nested, by applying a central limit theorem under
the assumption that Fθ is the true model.
This approach is computationally awkward to implement if one cannot analytically
obtain E f [ln( f (y|x, θ)/g(y|x, γ))], where E f denotes expectation with respect to the
density f (y|x, θ). Furthermore, if a similar test statistic is obtained with the roles of
Fθ and Gγ reversed it is possible to find both that model Fθ is rejected in favor of
Gγ and that model Gγ is rejected in favor of Fθ. The test is therefore not necessarily
one of model selection as it does not necessarily select one or the other; instead it is a
model specification test that zero, one, or two of the models can pass.
The Cox statistic has been obtained analytically in some cases. For nonnested
linear regression models y = x
β + u and y = z
γ + v with homoskedastic nor-
mally distributed errors (see Pesaran, 1974). For nonnested transformation models
h(y) = x
β + u and g (y) = z
γ + v, where h(y) and g(y) are known transforma-
tions; see Pesaran and Pesaran (1995), who use a simulation-based approach. This
permits, for example, discrimination between linear and log-linear parametric mod-
els, with h(·) the identity transformation and g(·) the log transformation. Pesaran and
Pesaran (1995) apply the idea to choosing between logit and probit models presented in
Chapter 14.
8.5.3. Vuong Likelihood Ratio Test of Nonnested Models
Vuong (1989) provided a very general distribution theory for the LR test statistic that
covers both nested and nonnested models and more remarkably permits the dgp to be
an unknown density that differs from both f (·) and g(·).
The asymptotic results of Vuong, presented here to aid understanding of the variety
of tests presented in Vuong’s paper, are relatively complex as in some cases the test
statistic is a weighted sum of chi-squares with weights that can be difficult to compute.
Vuong proposed a test of
H0 : E0

ln
f (y|x, θ)
g(y|x, γ)

= 0, (8.43)
where E0 denotes expectation with respect to the true dgp h(y|x), which may be un-
known. This is equivalent to testing Eh[ln(h/g)]−Eh[ln(h/f )] = 0, or testing whether
the two densities f and g have the same Kullback–Liebler information criterion
(see Section 5.7.2). One-sided alternatives are possible with Hf : E0[ln( f/g)] 0 and
Hg : E0[ln( f/g)] 0.
280

An obvious test of H0 is an m-test of whether the sample analogue LR(
θ,
γ) defined
in (8.42) differs from zero. Here the distribution of the test statistic is to be obtained
with possibly unknown dgp. This is possible because from Section 5.7.1 the quasi-
MLE
θ converges to the pseudo-true value θ∗
and
√
N(
θ − θ∗
) has a limit normal
distribution, with a similar result for the quasi-MLE
γ.
General Result
The resulting distribution of LR(
θ,
γ) varies according to whether or not the two mod-
els, both possibly incorrect, are equivalent in the sense that f (y|x, θ∗) = g(y|x, γ∗),
where θ∗ and γ∗ are the pseudo-true values of θ and γ.
If f (y|x, θ∗) = g(y|x, γ∗) then
2LR(
θ,
γ)
d
→ Mp+q(λ∗), (8.44)
where p and q are the dimensions of θ and γ and Mp+q(λ∗) denotes the cdf of the
weighted sum of chi-squared variables
p+q
j=1 λ∗ j Z2
j . The Z2
j are iid χ2
(1) and λ∗ j are
the eigenvalues of the (p + q) × (p + q) matrix
W =

−B f (θ∗)A f (θ∗)−1
−B f g(θ∗, γ∗)Ag(γ∗)−1
−Bg f (γ∗, θ∗)A f (θ∗)−1
−Bg(γ∗)Ag(γ∗)−1

, (8.45)
where A f (θ∗) = E0[∂2
ln f/∂θ∂θ
], B f (θ∗) = E0[(∂ ln f/∂θ)(∂ ln f/∂θ
)], the matri-
ces Ag(γ∗) and Bg(γ∗) are similarly defined for the density g(·), the cross-matrix
B f g(θ∗, γ∗) = E0[(∂ ln f/∂θ)(∂ ln g/∂γ
)], and expectations are with respect to the
true dgp. For explanation and derivation of these results see Vuong (1989).
If instead f (y|x, θ∗) = g(y|x, γ∗), then under H0
N−1/2
LR(
θ,
γ)
d
→ N[0, ω2
∗], (8.46)
where
ω2
∗ = V0

ln
f (y|x, θ∗)
g(y|x, γ∗)

, (8.47)
and the variance is with respect to the true dgp. For derivation again see Vuong (1989).
Use of these results varies with whether or not one model is assumed to be correctly
specified and with the nesting relationship between the two models.
Vuong differentiated among three types of model comparisons. The models Fθ and
Gγ are (1) nested with Gγ nested in Fθ if Gγ ⊂ Fθ; (2) strictly nonnested models
if and only if Fθ ∩ Gγ = φ so that neither model can specialize to the other; and
(3) overlapping if Fθ ∩ Gγ = φ and Fθ Gγ and Gγ Fθ. Similar distinctions are
made by Pesaran and Pesaran (1995).
Both (2) and (3) are nonnested models, but they require different testing procedures.
Examples of strictly nonnested models are linear models with different error distribu-
tions and nonlinear regression models with the same error distributions but different
functional forms for the conditional mean. For overlapping models some specializa-
tions of the two models are equal. An example is linear models with some regressors
in common and some regressors not in common.
281

Nested Models
For nested models it is necessarily the case that f (y|x, θ∗) = g(y|x, γ∗). For Gγ
nested in Fθ, H0 is tested against Hf : E0[ln( f/g)] 0.
For density possibly misspecified the weighted chi-square result (8.44) is appropri-
ate, using the eigenvalues
λj of the sample analogue of W in (8.45). Alternatively, one
can use eigenvalues
λj of the sample analogue of the smaller matrix
W = B f (θ∗)[D(γ∗)Ag(γ∗)−1
D(γ∗)
− A f (θ∗)−1
],
where D(γ∗) = ∂φ(γ∗)/∂γ and the constrained quasi-MLE
θ = φ(
γ), see Vuong
(1989). This result provides a robustified version of the standard LR test for nested
models.
If the density f (·) is actually correctly specified, or more generally satisfies the IM
equality, we get the expected result that 2LR(
θ,
γ)
d
→ χ2
(p − q) as then (p − q) of
the eigenvalues of W or W equal one whereas the others equal zero.
Strictly Nonnested Models
For strictly nonnested models it is necessarily the case that f (y|x, θ∗) = g(y|x, γ∗).
The normal distribution result (8.46) is applicable, and a consistent estimate of ω2
∗ is

ω2
=
1
N
N

i=1

ln
f (yi |xi ,
θ)
g(yi |xi ,
γ)

2
−

1
N
N

i=1
ln
f (yi |xi ,
θ)
g(yi |xi ,
γ)

2
. (8.48)
Thus form
TLR = N−1/2
LR(
θ,
γ)
2

ω
d
→ N[0, 1]. (8.49)
For tests with critical value c, H0 is rejected in favor of Hf : E0[ln( f/g)] 0 if
TLR c, H0 is rejected in favor of Hg : E0[ln( f/g)] 0 if TLR −c, and discrimi-
nation between the two models is not possible if |TLR| c. The test can be modified
to permit log-likelihood penalties similar to AIC and BIC; see Vuong (1989, p. 316).
An asymptotically equivalent statistic to (8.49) replaces
ω2
by
ω2
equal to just the first
term in the right-hand side of (8.48).
This test assumes that both models are misspecified. If instead one of the models is
assumed to be correctly specified, the Cox test approach of Section 8.5.2 needs to be
used.
Overlapping Models
For overlapping models it is not clear a priori as to whether or not f (y|x, θ∗) =
g(y|x, γ∗), and one needs to first test this condition.
Vuong (1989) proposes testing whether or not the variance ω2
∗ defined in (8.47)
equals zero, since ω2
∗ = 0 if and only if f (·) = g(·). Thus compute
ω2
in (8.48). Under
Hω
0 : ω2
∗ = 0
N
ω2 d
→ Mp+q(λ∗), (8.50)
282

where the Mp+q(λ∗) distribution is defined after (8.44). Hypothesis Hω
0 is rejected at
level α if N
ω2
exceeds the upper α percentile of the Mp+q(
λ) distribution, using the
eigenvalues
λj of the sample analogue of W in (8.45). Alternatively, and more simply,
one can test the conditions that θ∗ and γ∗ must satisfy for f (·) = g(·). Examples are
given in Lien and Vuong (1987).
If Hω
0 is not rejected, or the conditions for f (·) = g(·) are not rejected, conclude
that it is not possible to discriminate between the two models given the data. If Hω
0 is
rejected, or the conditions for f (·) = g(·) are rejected, then test H0 against Hf or Hg
using TLR as detailed in the strictly nonnested case. In this latter case the significance
level is at most the maximum of the significance levels for each of the two tests.
This test assumes that both models are misspecified. If instead one of the models is
assumed to be correctly specified, then the other model must also be correctly specified
for the two models to be equivalent. Thus f (y|x, θ∗) = g(y|x, γ∗) under H0, and one
can directly move to the LR test using the weighted chi-square result (8.44). Let c1 and
c2 be upper tail and lower tail critical values, respectively. If 2LR(
θ,
γ) c1 then H0
is rejected in favor of Hf ; if 2LR(
θ,
γ) c2 then H0 is rejected in favor of Hg; and
the test is otherwise inconclusive.
8.5.4. Other Nonnested Model Comparisons
The preceding methods are restricted to fully parametric models. Methods for discrim-
inating between models that are only partially parameterized, such as linear regression
without the assumption of normality, are less clear-cut.
The information criteria of Section 8.5.1 can be replaced by criteria developed using
loss functions other than KLIC. A variety of measures corresponding to different loss
functions are presented in Amemiya (1980). These measures are often motivated for
nested models but may also be applicable to nonnested models.
A simple approach is to compare predictive ability, selecting the model with low-
est value of mean-squared error (N − q)−1

i (yi −
yi )2
. For linear regression this is
equivalent to choosing the model with highest adjusted R2
, which is generally viewed
as providing too small a penalty for model complexity. An adaptation for nonparamet-
ric regression is leave-one-out cross-validation (see Section 9.5.3).
Formal tests to discriminate between nonnested models in the nonlikelihood case
often take one of two approaches. Artificial nesting, proposed by Davidson and
MacKinnon (1984), embeds the two nonnested models into a more general artificial
model and leads to so-called J tests and P tests and related tests. The encompassing
principle, proposed by Mizon and Richard (1986), leads to a quite general framework
for testing one model against a competing nonnested model. White (1994) links this
approach with CM tests. For a summary of this literature see Davidson and MacKinnon
(1993, chapter 11).
8.5.5. Nonnested Models Example
A sample of 100 observations is generated from a Poisson model with mean E[y|x] =
exp(β1 + β2x2 + β3x3), where x2, x3 ∼ N[0, 1], and (β1, β2, β3) = (0.5, 0.5, 0.5).
283

Table 8.2. Nonnested Model Comparisons for Poisson Regression Examplea
Test Type Model 1 Model 2 Conclusion
−2ln L 366.86 352.18 Model 2 preferred
AIC 370.86 358.18 Model 2 preferred
BIC 376.07 366.00 Model 2 preferred
N
ω2
7.84 with p = 0.000 Can discriminate
TLR = N−1/2
LR/
ω −0.883 with p = 0.377 No model favored
a N = 100. Model 1 is Poisson regression of y on intercept and x2. Model 2 is Poisson regression
of y on intercept, x3, and x2
3 . The final two rows are for the Vuong test for nonoverlapping models
(see the text).
The dependent variable y has sample mean 1.92 and standard deviation 1.84. Two
incorrect nonnested models were estimated by Poisson regression:
Model 1:
E[y|x] = exp(0.608
(8.08)
+ 0.291
(4.03)
x2),
Model 2:
E[y|x] = exp(0.493
(5.14)
+ 0.359
(5.10)
x3 + 0.091
(1.78)
x2
3 ),
where t−statistics are given in parentheses.
The first three rows of Table 8.2 give various information criteria, with the model
with smallest value preferred. The first does not penalize number of parameters and
favors model 2. The second and third measures defined in (8.40) and (8.41) give larger
penalty to model 2, which has an additional parameter, but still lead to the larger model
2 being favored.
The final two rows of the Table 8.2 summarize Vuong’s test, here a test of overlap-
ping models.
First, test the condition of equality of the densities when evaluated at the pseudo-
true values. The statistic
ω2
in (8.48) is easily computed given expressions for the
densities. The difficult part is computing an estimate of the matrix W in (8.45). For
the Poisson density we can use
A and
B defined at the end of Section 5.2.3 and

B f g = N−1

i (yi −
µ f i )x f i × (yi −
µgi )x
gi . The eigenvalues of
W are λ1 = 0.29,
λ2 = 1.00, λ3 = 1.06, λ4 = 1.48, and λ5 = 2.75. The p-value for the test statis-
tic N
ω2
with distribution given in (8.44) is obtained as the proportion of draws of
5
j=1 λj z2
j , say 10,000 draws, which exceed N
ω2
= 69.14. Here p = 0.000 0.05
and we conclude that it is possible to discriminate between the models. The critical
value at level 0.05 in this example equals 16.10, quite a bit higher than χ2
.05(5) =
11.07.
Given discrimination is possible, then the second test can be applied. Here TLR =
−0.883 favors the second model, since it is negative. However, using a standard normal
two-tail test at 5% the difference is not statistically significant. In this example
ω2
is
quite large, which means the first test statistic N
ω2
is large but the second test statistic
N−1/2
LR(
θ,
γ)
2

ω is small.
284

8.6. CONSEQUENCES OF TESTING
8.6. Consequences of Testing
In practice more than one test is performed before one reaches a preferred model. This
leads to several complications that practitioners usually ignore.
8.6.1. Pretest Estimation
The use of specification tests to choose a model complicates the distribution of an
estimator. For example, suppose we choose between two estimators
θ and
θ on the
basis of a statistical test at 5%. For instance,
θ and
θ may be estimators in unrestricted
and restricted models. Then the actual estimator is θ+
= w
θ + (1 − w)
θ, where the
random variable w takes value 1 if the test favors
θ and 0 if the test favors
θ. In short,
the estimator depends on the restricted and unrestricted estimators and on a random
variable w, which in turn depends on the significance level of the test. Hence θ+
is an
estimator with complex properties. This is called a pretest estimator, as the estimator
is based on an initial test. The distribution of θ+
has been obtained for the linear
regression model under normality and is nonstandard.
In theory statistical inference should be based on the distribution of θ+
. In practice
inference is based on the distribution of
θ if w = 1 or of
θ if w = 0, ignoring the
randomness in w. This is done for simplicity, as even in the simplest models the dis-
tribution of the estimator becomes intractable when several such tests are performed.
8.6.2. Order of Testing
Different conclusions can be drawn according to the order in which tests are con-
ducted.
One possible ordering is from general to specific model. For example, one may
estimate a general model for demand before testing restrictions from consumer de-
mand theory such as homogeneity and symmetry. Or the cycle may go from specific
to general model, with regressors added as needed and additional complications such
as endogeneity controlled for if present. Such orderings are natural when choosing
which regressors to include in a model, but when specification tests are also being
performed it is not uncommon to use both general to specific and specific to general
orderings in the same study.
A related issue is that of joint versus separate tests. For example, the significance
of two regressors can be tested by either two individual t−tests of significance or a
joint F−test or χ2
(2) test of significance. A general discussion was given in Sec-
tion 7.2.7 and an example is given later in Section 18.7.
8.6.3. Data Mining
Taken to its extreme, the extensive use of tests to select a model has been called data
mining (Lovell, 1983). For example, one may search among several hundred possible
285

predictors of y and choose just those predictors that are significant at 5% on a two-
sided test. Computer programs exist that automate such searches and are commonly
used in some branches of applied statistics. Unfortunately, such broad searches will
lead to discovery of spurious relationships, since a test with size 0.05 leads to er-
roneous findings of statistical significance 5% of the time. Lovell pointed out that
the application of such a methodology tends to overestimate the goodness-of-fit mea-
sures (e.g., R2
) and underestimate the sampling variances of regression coefficients,
even when it succeeds in uncovering the variables that feature in the data-generating
process. Using standard tests and reporting p-values without taking account of the
model-search procedure is misleading because nominal and actual p-values are not
the same. White (2001b) and Sullivan, Timmermann, and White (2001) show how to
use bootstrap methods to calculate the true statistical significance of regressors. See
also P. Hansen (2003).
The motivation for data mining is sometimes to conserve degrees of freedom or
to avoid overparameterization (“clutter”). More importantly, many aspects of speci-
fication, such as the functional form of covariates, are left unresolved by underlying
theory. Given specification uncertainty, justification exists for specification searching
(Sargan, 2001). However, care needs to be taken especially if small samples are an-
alyzed and the number of specification searches is large relative to the sample size.
When the specification search is sequential, with a large number of steps, and with
each step determined by a previous test outcome, the statistical properties of the pro-
cedure as a whole are complex and analytically intractable.
8.6.4. A Practical Approach
Applied microeconometrics research generally minimizes the problem of pretest esti-
mation by making judicious use of hypothesis tests. Economic theory is used to guide
the selection of regressors, to greatly reduce the number of potential regressors. If the
sample size is large there is little purpose served by dropping “insignificant” variables.
Final results often use regressions that include statistically insignificant regressors for
control variables, such as region, industry, and occupation dummies in an earnings
regression. Clutter can be avoided by not reporting unimportant coefficients in a full
model specification but noting that fact in an appropriate place. This can lead to some
loss of precision in estimating the key regressors of interest, such as years of school-
ing in an earnings regression, but guards against bias caused by erroneously dropping
variables that should be included.
Good practice is to use only part of the sample (“training sample”) for specification
searches and model selection, and then report results using the preferred model esti-
mated using a completely separate part of the sample (“estimation sample”). In such
circumstances pretesting does not affect the distribution of the estimator, if the sub-
samples are independent. This procedure is usually only implemented when sample
sizes are very large, because using less than the full sample in final estimation leads to
a loss in estimator precision.
286

8.7. MODEL DIAGNOSTICS
8.7. Model Diagnostics
In this section we discuss goodness-of-fit measures and definitions of residuals in non-
linear models. Useful measures are those that reveal model deficiency in some partic-
ular dimension.
8.7.1. Pseudo-R2
Measures
Goodness of fit is interpreted as closeness of fitted values to sample values of the
dependent variable.
For linear models with K regressors the most direct measure is the standard error
of the regression, which is the estimated standard deviation of the error term,
s =

1
N − K
N

i=1
(yi −
yi )2
'1/2
.
For example, a standard error of regression of 0.10 in a log-earnings regression means
that approximately 95% of the fitted values are within 0.20 of the actual value of
log-earnings, or within 22% of actual earnings using e0.2
1.22. This measure is the
same as the in-sample root mean squared error where
yi is viewed as a forecast of of
yi , aside from a degrees of freedom correction. Alternatively, one can use the mean
absolute error (N − K)−1

i |yi −
yi |. The same measures can be used for nonlinear
regression models, provided the nonlinear models lead to a predicted value
yi of the
dependent variable.
A related measure in linear models is R2
, the coefficient of multiple determina-
tion. This explains the fraction of variation of the dependent variable explained by the
regressors. The statistic R2
is more commonly reported than s, even though s may be
more informative in evaluating the goodness of fit.
A pseudo-R2
is an extension of R2
to nonlinear regression model. There are several
interpretations of R2
in the linear model. These lead to several possible pseudo-R2
measures that in nonlinear models differ and do not necessarily have the properties of
lying between zero and one and increasing as regressors are added. We present several
of these measures that, for simplicity, are not adjusted for degrees of freedom.
One approach bases R2
on decomposition of the total sum of squares (TSS), with

i
(yi − ȳ)2
=

i
(yi −
yi )2
+

i
(
yi − ȳ)2
+ 2

i
(yi −
yi )(
yi − ȳ).
The first sum in the right-hand side is the residual sum of squares (RSS) and the second
term is the explained sum of squares (ESS). This leads to two possible measures:
R2
RES = 1 − RSS/TSS,
R2
EXP = ESS/TSS.
For OLS regression in the linear model with intercept the third sum equals zero, so
R2
RES = R2
EXP. However, this simplification does not occur in other models and in gen-
eral R2
RES = R2
EXP in nonlinear models. The measure R2
RES can be less than zero, R2
EXP
287

can exceed one, and both measures may decrease as regressors are added though R2
RES
will increase for NLS regression of the nonlinear model as then the estimator is mini-
mizing RSS.
A closely related measure uses
R2
COR = 3
Cor2
[yi ,
yi ] ,
the squared correlation between actual and fitted values. The measure R2
COR lies be-
tween zero and one and equals R2
in OLS regression for the linear model with inter-
cept. In nonlinear models R2
COR can decrease as regressors are added.
A third approach uses weighted sums of squares that control for the intrinsic het-
eroskedasticity of cross-section data. Let
σ2
i be the fitted conditional variance of yi ,
where it is assumed that heteroskedasticity is explicitly modeled as is the case for
FGLS and for models such as logit and Poisson. Then we can use
R2
WSS = 1 − WRSS/WTSS,
where the weighted residual sum of squares WRSS =

i (yi −
yi )2
/
σ2
i , WTSS =

i (yi −
µ)2
/
σ2
, and
µ and
σ2
are the estimated mean and variance in the intercept-
only model. This can be called a Pearson R2
because WRSS equals the Pearson
statistic, which, aside from any finite-sample corrections, should equal N if het-
eroskedasticity is correctly modeled. Note that R2
WSS can be less than zero and decrease
as regressors are added.
A fourth approach is a generalization of R2
to objective functions other than the sum
of squared residuals. Let QN (θ) denote the objective function being maximized, Q0
denote its value in the intercept-only model, Qfit denote the value in the fitted model,
and Qmax denote the largest possible value of QN (θ). Then the maximum potential
gain in the objective function resulting from inclusion of regressors is Qmax − Q0 and
the actual gain is Qfit − Q0. This suggests the measure
R2
RG =
Qfit − Q0
Qmax − Q0
= 1 −
Qmax − Qfit
Qmax − Q0
,
where the subscript RG means relative gain. For least-squares estimation the loss
function maximized is minus the residual sum of squares. Then Q0 = −TSS, Qfit =
−RSS, and Qmax = 0, so R2
RG = ESS/TSS for OLS or NLS regression. The measure
R2
RG has the advantage of lying between zero and one and increasing as regressors are
added. For ML estimation the loss function is QN (θ) = ln LN (θ). Then R2
RG cannot
always be used as in some models there may be no bound on Qmax. For example, for
the linear model under normality LN (β,σ2
) →∞ as σ2
→0. For ML and quasi-ML
estimation of linear exponential family models, such as logit and Poisson, Qmax is
usually known and R2
RG can be shown to be an R2
based on the deviance residuals
defined in the next section.
A related measure to R2
RG is R2
Q = 1 − Qfit/Q0. This measure increases as re-
gressors are added. It equals R2
RG if Qmax = 0, which is the case for OLS regres-
sion and for binary and multinomial models. Otherwise, for discrete data this mea-
sure may have upper bound less than one, whereas for continuous data the measure
288

8.7. MODEL DIAGNOSTICS
may not be bounded between zero and one as the log-likelihood can be negative or
positive. For example, for ML estimation with continuous density it is possible that
Q0 = 1 and Qfit = 4, leading to R2
Q = −3, or that Q0 = −1 and Qfit = 4, leading
to R2
Q = 5.
For nonlinear models there is therefore no universal pseudo-R2
. The most useful
measures may be R2
COR, as correlation coefficients are easily interpreted, and R2
RG in
special cases that Qmax is known. Cameron and Windmeijer (1997) analyze many of
the measures and Cameron and Windmeijer (1996) apply these measures to count data
models.
8.7.2. Residual Analysis
Microeconometrics analysis actually places little emphasis on residual analysis, com-
pared to some other areas of statistics. If data sets are small then there is concern that
residual analysis may lead to overfitting of the model. If the data set is large then
there is a belief that residual analysis may be unnecessary as a single observation will
have little impact on the analysis. We therefore give a brief summary. A more exten-
sive discussion is given in, for example, McCullagh and Nelder (1989) and Cameron
and Trivedi (1998, chapter 5). Econometricians have had particular interest in defining
residuals in censored and truncated models.
A wide range of residuals have been proposed for nonlinear regression models.
Consider a scalar dependent variable yi with fitted value
yi =
µi = µ(xi ,
θ). The raw
residual is ri = yi −
µi . The Pearson residual is the obvious correction for het-
eroskedasticity pi = (yi −
µi )/
σi , where
σi is an estimate of the conditional variance
of yi . This requires a specification of the variance for yi , which is done for models
such as the Poisson. For an LEF density (see Section 5.7.3) the deviance residual is
di = sign(yi −
µi )

2[l(yi ) − l(
µi )], where l(y) denotes the log-density of y|µ eval-
uated at µ = y and l(
µ) denotes evaluation at µ =
µ. A motivation for the deviance
residual is that the sum of squares of these residuals is the deviance statistic that is
the generalization for LEF models of the sum of raw residuals in the linear model. The
Anscombe residual is defined to be the transformation of y that is closest to normality,
then standardized to mean zero and variance 1. This transformation has been obtained
for LEF densities.
Small-sample corrections to residuals have been proposed to account for estima-
tion error in
µi . For the linear model this entails division of residuals by
√
1 − hii ,
where hii is the ith diagonal entry in the hat matrix H = X(X
X)−1
X. These residu-
als are felt to have better finite-sample performance. Since H has rank K, the num-
ber of regressors, the average value of hii is K/N and values of hii in excess of
2K/N are viewed as having high leverage. These results extend to LEF models
with H = W1/2
X(X
WX)−1
XW1/2
, where W = Diag[wii ] and wii = g
(x
i β)/σ2
i with
g(x
i β) and σ2
i the specified conditional mean and variance, respectively. McCullagh
and Nelder (1989) provide a summary.
More generally, Cox and Snell (1968) define a generalized residual to be any scalar
function ri = r(yi , xi ,
θ) that satisfies some relatively weak conditions. One way that
such residuals arise is that many estimators have first-order conditions of the form
289


i g(xi , θ)r(yi , xi ,
θ) = 0, where yi appears in the scalar r(·) but not in the vector g(·).
See also White (1994).
For regression models based on a normal latent variable (see Chapters 14 and 16)
Chesher and Irish (1987) propose using E[ε∗
i |yi ] as the residual, where y∗
i = µi + ε∗
i
is the unobserved latent variable and yi = g(y∗
i ) is the observed dependent variable.
Particular choices of g(·) correspond to the probit and Tobit models. Gouriéroux et al.
(1987) generalize this approach to LEF densities. A natural approach in this context
is to treat residuals as missing data, along the lines of the expectation maximum algo-
rithm in Section 10.3.
A common use of residuals is in plots against other variables of interest. Plots of
residuals against fitted values can reveal poor model fit; plots of residuals against omit-
ted variables can suggest further regressors to include in the model; and plots of resid-
uals against included regressors can suggest need for a different functional form. It can
be helpful to include a nonparametric regression line in such plots, (see Chapter 9). If
data take only a few discrete values the plots can be difficult to interpret because of
clustering at just a few values, and it can be helpful to use a so-called jitter feature that
adds some random noise to the data to reduce the clustering.
Some parametric models imply that an appropriately defined residual should be
normally distributed. This can be checked by a normal scores plot that orders residuals
ri from smallest to largest and plots them against the values predicted if the resid-
uals were exactly normally distributed. Thus plot ordered ri against r + sr Φ−1
((i −
0.5)/N), where r and sr are the sample mean and standard deviation of r and Φ−1
(·)
is the inverse of the standard normal cdf.
8.7.3. Diagnostics Example
Table 8.3 uses the same data-generating process as in Section 8.5.5. The dependent
variable y has sample mean 1.92 and standard deviation 1.84. Poisson regression of y
on x3 and of y on x3 and x2
3 yields
Model 1:
E[y|x] = exp(0.586
(5.20)
+ 0.389
(7.60)
x3),
Model 2:
E[y|x] = exp(0.493
(5.14)
+ 0.359
(5.10)
x3 + 0.091
(1.78)
x2
3 ),
where t-statistics are given in parentheses.
In this example all R2
measures increase with addition of x2
3 as regressor, though
by quite different amounts given that in this example all but the last R2
have similar
values. More generally the first three R2
are scaled similarly and R2
RES and R2
COR can
be quite close, but the remaining three measures are scaled quite differently. Only the
last two R2
measures are guaranteed to increase as a regressor is added, unless the
objective function is the sum of squared errors. The measure R2
RG can be constructed
here, as the Poisson log-likelihood is maximized if the fitted mean
µi = yi for all i,
leading to Qmax =

i [yi ln yi − yi − ln yi !], where y ln y = 0 when y = 0.
Additionally, three residuals were calculated for the second model. The sample
mean and standard deviation of residuals were, respectively, 0 and 1.65 for the raw
290

Table 8.3. Pseudo R2
s: Poisson Regression Examplea
Diagnostic Model 1 Model 2 Difference
s where s2
= RSS/(N-K) 0.1662 0.1661 0.0001
R2
RES = 1 − RSS/TSS 0.1885 0.1962 +0.0077
R2
EXP = ESS/TSS 0.1667 0.2087 +0.0402
R2
COR = 3
Cor2
[yi ,
yi ] 0.1893 0.1964 +0.0067
R2
WSS = 1 − WRSS/WTSS 0.1562 0.1695 +0.0233
R2
RG = (Qfit−Q0)/(Qmax−Q0) 0.1552 0.1712 +0.0160
R2
Q = 1−Qfit/Q0 0.0733 0.0808 +0.0075
a N = 100. Model 1 is Poisson regression of y on intercept and x3. Model 2 is Poisson regression of y
on intercept, x3, and x2
3 . RSS is residual sum of squares (SS), ESS is explained SS, TSS is total sum
of squares, WRSS is weighted RSS, WTSS is weighted TSS,Qfit is fitted value of objective function,
Q0 is fitted value in intercept-only model, and Qmax is the maximum possible value of the objective
function given the data and exists only for some objective functions.
residuals, 0.01 and 1.97 for the Pearson residuals, and −0.21 and 1.22 for the deviance
residuals. The zero mean for the raw residual is a property of Poisson regression with
intercept included that is shared by very few other models. The larger standard devia-
tion of the raw residuals reflects the lack of scaling and the fact that here the standard
deviation of y exceeds 1. The correlations between pairs of these residuals all exceed
0.96. This is likely to happen when R2
is low so that
yi ȳ.
m-Tests and Hausman tests are most easily implemented by use of auxiliary regres-
sions. One should be aware that these auxiliary regressions may be valid only under
distributional assumptions that are stronger than those made to obtain the usual robust
standard errors of regression coefficients. Some robust tests have been presented in
Section 8.4.
With a large enough data set and fixed significance level such as 5% the sample mo-
ment conditions implied by a model will be rejected, except in the unrealistic case that
all aspects of the model–functional form, regressors, and distribution – are correctly
specified. In classical testing situations this is often a desired result. In particular, with
a large enough sample, regression coefficients will always be significantly different
from zero and many studies seek such a result. However, for specification tests the
desire is usually to not reject, so that one can say that the model is correctly specified.
Perhaps for this reason specification tests are under-utilized.
As an illustration, consider tests of correct specification of life-cycle models of
consumption. Unless samples are small a dedicated specification tester is likely to
reject the model at 5%. For example, suppose a model specification test statistic
is χ2
(12) distributed when applied to a sample with N = 3,000 has a p-value of
0.02. It is not clear that the life-cycle model is providing a poor explanation of the
291

data, even though it would be formally rejected at the 5% significance level. One
possibility is to increase the critical value as sample size increases using BIC (see
Section 8.5.1).
Another reason for underutilization of specification tests is difficulty in computation
and poor size property of tests when more convenient auxiliary regressions are used
to implement an asymptotically equivalent version of a test. These drawbacks can be
greatly reduced by use of the bootstrap. Chapter 11 presents bootstrap methods to
implement many of the tests given in this chapter.
8.2 The conditional moment test, due to Newey (1985) and Tauchen (1985), is a generalization
of the information matrix test of White (1982). For ML estimation, the computation of the
m-test by auxiliary regression generalizes methods of Lancaster (1984) and Chesher (1984)
for the IM test. A good overview of m-tests is given in Pagan and Vella (1989). The m-test
provides a very general framework for viewing testing. It can be shown to nest all tests,
such as Wald, LM, LR, and Hausman tests. This unifying element is emphasized in White
(1994).
8.3 The Hausman test was proposed by Hausman (1978), with earlier references already given
in Section 8.3 and a good survey provided by Ruud (1984).
8.4 The econometrics texts by Greene (2003), Davidson and McKinnon (1993) and Wooldridge
(2002) present many of the standard specification tests.
8.5 Pesaran and Pesaran (1993) discuss how the Cox (1961, 1962b) nonnested test can be
implemented when an analytical expression for the expectation of the log-likelihood is not
available. Alternatively, the test of Vuong (1989) can be used.
8.7 Model diagnostics for nonlinear models are often obtained by extension of results for the
linear regression model to generalized linear models such as logit and Poisson models. A
detailed discussion with references to the literature is given in Cameron and Trivedi (1998,
Chapter 5).
Exercises
8–1 Suppose y = x
β + u, where u ∼ N[0,σ2
], with parameter vector θ = [β
, σ2
]
and
density f (y|θ) = (1/
√
2πσ) exp[−(y − x
β)2
/2σ2
]. We have a sample of N inde-
pendent observations.
(a) Explain why a test of the moment condition E[x(y − x
β)3
] is a test of the
assumption of normally distributed errors.
(b) Give the expressions for
mi and
si given in (8.5) necessary to implement the
m-test based on the moment condition in part (a).
(c) Suppose dim[x] =10, N = 100, and the auxiliary regression in (8.5) yields an
uncentered R2
of 0.2. What do you conclude at level 0.05?
(d) For this example give the moment conditions tested by White’s information
matrix test.
8–2 Consider the multinomial version of the PCGF test given in (8.23) with pj replaced
by
pj = N−1

i Fj (xi ,
θ). Show that PCGF can be expressed as CGF in (8.27)
292

with
V = Diag[N
pj ]. [Conclude that in the multinomial case Andrew’s test statistic
simplifies to Pearson’s statistic.]
8–3 (Adapted from Amemiya, 1985). For the Hausman test given in Section 8.4.1 let
V11 = V[
θ], V22 = V[
θ], and V12 = Cov[
θ,
θ].
(a) Show that the estimator θ̄ =
θ + [V11 + V22 − 2V12]−1
(
θ,
θ) has asymptotic
variance matrix V[θ̄] = V11 − [V11 − V12][V11 + V22 − 2V12]−1
[V11 − V12].
(b) Hence show that V[θ̄] is less than V[
θ] in the matrix sense unless Cov[
θ,
θ] =
V[
θ].
(c) Now suppose that
θ is fully efficient. Can V[θ̄] be less than V[
θ]? What do
you conclude?
8–4 Suppose that two models are non-nested and there are N = 200 observations.
For model 1, the number of parameters q = 10 and ln L = −400. For model 3,
q = 10 and ln L = −380.
(a) Which model is favored using AIC?
(b) Which model is favored using BIC?
(c) Which model would be favored if the models were actually nested and we
used a likelihood ratio test at level 0.05?
8–5 Use the health expenditure data of Section 16.6. The model is a probit regres-
sion of DMED, an indicator variable for positive health expenditures, against the
17 regressors listed in the second paragraph of Section 16.6. You should obtain
the estimates given in the first column of Table 16.1.
(a) Test the joint statistical significance of the self-rated health indicators HLTHG,
HLTHF, and HLTHP at level 0.05 using a Hausman test. [This may require
some additional coding, depending on the package used.]
(b) Is the Hausman test the best test to use here?
(c) Does an information matrix test at level 0.05 support the restrictions of this
model? [This will require some additional coding.]
(d) Discriminate between a model that drops HLTHG, HLTHF, and HLTHP and a
model that drops LC, IDP, and LPI on the basis of R2
RES, R2
EXP, R2
COR, and
R2
RG.
293

C H A P T E R 9
Semiparametric Methods
9.1. Introduction
In this chapter we present methods for data analysis that require less model specifica-
tion than the methods of the preceding chapters.
We begin with nonparametric estimation. This makes very minimal assumptions
regarding the process that generated the data. One leading example is estimation of
a continuous density using a kernel density estimate. This has the attraction of pro-
viding a smoother estimate than the familiar histogram. A second leading example
is nonparametric regression, such as kernel regression, on a scalar regressor. This
places a flexible curve on an (x, y) scatterplot with no parametric restrictions on the
form of the curve. Nonparametric estimates have numerous uses, including data de-
scription, exploratory analysis of data and of fitted residuals from a regression model,
and summary across simulations of parameter estimates obtained from a Monte Carlo
study.
Econometric analysis emphasizes multivariate regression of a scalar y on a vector
of regressors x. However, nonparametric methods, although theoretically possible with
an infinitely large sample, break down in practice because the data need to be sliced in
several dimensions, leading to too few data points in each slice.
As a result econometricians have focused on semiparametric methods. These com-
bine a parametric component, greatly reducing the dimensionality, with a nonpara-
metric component. One important application is to permit more flexible models of the
conditional mean. For example, the conditional mean E[y|x] may be parameterized to
be of the single-index form g(x
β), where the functional form for g(·) is not specified
but is instead nonparametrically estimated, along with the unknown parameters β. An-
other important application relaxes distributional assumptions that if misspecified lead
to inconsistent parameter estimates. For example, we may wish to obtain consistent
estimates of β in a linear regression model y = x
β + ε when data on y are trun-
cated or censored (see Chapter 16), without having to correctly specify the particular
distribution of the error term ε.
294

9.2. NONPARAMETRIC EXAMPLE: HOURLY WAGE
The asymptotic theory for nonparametric methods differs from that for more para-
metric methods. Estimates are obtained by cutting the data into ever smaller slices as
N → ∞ and estimating local behavior within each slice. Since less than N observa-
tions are being used in estimating each slice the convergence rate is slower than that
obtained in the preceding chapters. Nonetheless, in the simplest cases nonparamet-
ric estimates are still asymptotically normally distributed. In some leading cases of
semiparametric regression the estimators of parameters β have the usual property of
converging at rate N−1/2
, so that scaling by
√
N leads to a limit normal distribution,
whereas the nonparametric component of the model converges at a slower rate N−r
,
r 1/2.
Because nonparametric methods are local averaging methods, different choices of
localness lead to different ﬁnite-sample results. In some restrictive cases there are rules
or methods to determine the bandwidth or window width used in local averaging, just
as there are rules for determining the number of bins in a histogram given the number
of observations. In addition, it is common practice to use the nonscientiﬁc method of
choosing the bandwidth that gives a graph that to the eye looks reasonably smooth yet
is still capable of picking up details in the relationship of interest.
Nonparametric methods form the bulk of this chapter, both because they are of
intrinsic interest and because they are an essential input for semiparametric methods,
presented most notably in the chapters on discrete and censored dependent-variable
models. Kernel methods are emphasized as they are relatively simple to present and
because “It is argued that all smoothing methods are in an asymptotic sense essentially
equivalent to kernel smoothing” (Härdle, 1990, p. xi).
Section 9.2 provides examples of nonparametric density estimation and nonpara-
metric regression applied to data. Kernel density estimation is presented in Section
9.3. Local regression is discussed in Section 9.4, to provide motivation for the formal
treatment of kernel regression given in Section 9.5. Section 9.6 presents nonparamet-
ric regression methods other than kernel methods. The vast topic of semiparametric
regression is then introduced in Section 9.7.
9.2. Nonparametric Example: Hourly Wage
As an example we consider the hourly wage and education for 175 women aged
36 years who worked in 1993. The data are from the Michigan Panel Survey of In-
come Dynamics. It is easily established that the distribution of the hourly wage is
right-skewed and so we model ln wage, the natural logarithm of the hourly wage.
We give just one example of nonparametric density estimation and one of nonpara-
metric regression and illustrate the important role of bandwidth selection. Sections 9.3
to 9.6 then provide the underlying theory.
9.2.1. Nonparametric Density Estimate
A histogram of the natural logarithm of wage is given in Figure 9.1. To provide detail
the bin width is chosen so that there are 30 bins, each of width about 0.20. This is an
295

SEMIPARAMETRIC METHODS
0
.2
.4
.6
0 1 2 3 4 5
Log Hourly Wage
Histogram for Log Wage
Density
Figure 9.1: Histogram for natural logarithm of hourly wage. Data for 175 U.S. women aged
36 years who worked in 1993.
unusually narrow bin width for only 175 observations, but many details are lost with
a larger bin width. The log-wage data seem to be reasonably symmetric, though they
are possibly slightly left-skewed.
The standard smoothed nonparametric density estimate is the kernel density esti-
mate defined in (9.3). Here we use the Epanechnikov kernel defined in Table 9.1.
The essential decision in implementation is the choice of bandwidth. For this ex-
ample Silverman’s plug-in estimate defined in (9.13) yields bandwidth of h = 0.545.
Then the kernel estimate is a weighted average of those observations that have log
wage within 0.21 units of the log wage at the current point of evaluation, with great-
est weight placed on data closest to the current point of evaluation. Figure 9.2 presents
three kernel density estimates, with bandwidths of 0.273, 0.545 and 1.091, respectively
0
.2
.4
.6
.8
Kernel
density
estimates
0 1 2 3 4 5
Log Hourly Wage
One-half plug-in
Plug-in
Two times plug-in
Density Estimates as Bandwidth Varies
Figure 9.2: Kernel density estimates for log wage for three different bandwidths using the
Epanechnikov kernel. The plug-in bandwidth is h = 0.545. Same data as Figure 9.1.
296

9.2. NONPARAMETRIC EXAMPLE: HOURLY WAGE
corresponding to one-half the plug-in, the plug-in, and two times the plug-in band-
width. Clearly the smallest bandwidth is too small as it leads to too jagged a density es-
timate. The largest bandwidth oversmooths the data. The middle bandwidth, the plug-
in value of 0.545, seems the best choice. It gives a reasonably smooth density estimate.
What might we do with this kernel density estimate? One possibility is to compare
the density to the normal, by superimposing a normal density with mean equal to the
sample mean and variance equal to the sample variance. The graph is not reproduced
here but reveals that the kernel density estimate with preferred bandwidth 0.545 is con-
siderably more peaked than the normal. A second possibility is to compare log-wage
kernel density estimates for different subgroups, such as by educational attainment or
by full-time or part-time work status.
9.2.2. Nonparametric Regression
We consider the relationship between log wage and education. The nonparametric
method used here is the Lowess local regression method, a local weighted average
estimator (see Equation (9.16) and Section 9.6.2).
A local weighted regression line at each point x is fitted using centered subsets that
include the closest 0.8N observations, the program default, where N is the sample
size, and the weights decline as we move away from x. For values of x near the end
points, smaller uncentered subsets are used.
Figure 9.3 gives a scatter plot of log wage against education and three Lowess
regression curves for bandwidths of 0.8, 0.4 and 0.1. The first two bandwidths give
similar curves. The relationship appears to be quadratic, but this may be speculative as
the data are relatively sparse at low education levels, with less than 10% of the sample
having less than 10 years of schooling. For the majority of the data a linear relationship
may also work well. For simplicity we have not presented 95% confidence intervals or
bands that might also be provided.
0
1
2
3
4
5
Log
Hourly
Wage
0 5 10 15 20
Years of Schooling
Actual data Bandwidth h=0.8
Bandwidth h=0.4 Bandwidth h=0.1
Nonparametric Regression as Bandwidth Varies
Figure 9.3: Nonparametric regression of log wage on education for three different band-
widths using Lowess regression. Same sample as Figure 9.1.
297

9.3. Kernel Density Estimation
Nonparametric density estimates are useful for comparison across different groups and
for comparison to a benchmark density such as the normal. Compared to a histogram
they have the advantage of providing a smoother density estimate. A key decision,
analogous to choosing the number of bins in a histogram, is bandwidth choice. We
focus on the standard nonparametric density estimator, the kernel density estimator. A
detailed presentation is given as results also relevant for regression are more simply
obtained for density estimation.
9.3.1. Histogram
A histogram is an estimate of the density formed by splitting the range of x into
equally spaced intervals and calculating the fraction of the sample in each interval.
We give a more formal presentation of the histogram, one that extends naturally to
the smoother kernel density estimator. Consider estimation of the density f (x0) of a
scalar continuous random variable x evaluated at x0. Since the density is the derivative
of the cdf F(x0) (i.e., f (x0) = dF(x0)/dx) we have
f (x0) = lim
h→0
F(x0 + h) − F(x0 − h)
2h
= lim
h→0
Pr [x0 − h x x0 + h]
2h
.
For a sample {xi , i = 1, . . . , N} of size N, this suggests using the estimator

fHIST(x0) =
1
N
N

i=1
1(x0 − h xi x0 + h)
2h
, (9.1)
where the indicator function
1(A) =

1 if event A occurs,
0 otherwise.
The estimator
fHIST(x0) is a histogram estimate centered at x0 with bin width 2h, since
it equals the fraction of the sample that lies between x0 − h and x0 + h divided by the
bin width 2h. If
fHIST is evaluated over the range of x at equally spaced values of x,
each 2h units apart, it yields a histogram.
The estimator
fHIST(x0) gives all observations in x0 ± h equal weight as is clear
from rewriting (9.1) as

fHIST(x0) =
1
Nh
N

i=1
1
2
× 1

xi − x0
h

1

. (9.2)
This leads to a density estimate that is a step function, even if the underlying density
is continuous. Smoother estimates can be obtained by using weighting functions other
than the indicator function chosen here.
298

9.3. KERNEL DENSITY ESTIMATION
9.3.2. Kernel Density Estimator
The kernel density estimator, introduced by Rosenblatt (1956), generalizes the his-
togram estimate (9.2) by using an alternative weighting function, so

f (x0) =
1
Nh
N

i=1
K

xi − x0
h

. (9.3)
The weighting function K(·) is called a kernel function and satisfies restrictions given
in the next section. The parameter h is a smoothing parameter called the bandwidth,
and two times h is the window width. The density is estimated by evaluating
f (x0) at
a wider range of values of x0 than used in forming a histogram; usually evaluation is
at the sample values x1, . . . , xN . This also helps provide a density estimate smoother
than a histogram.
9.3.3. Kernel Functions
The kernel function K(·) is a continuous function, symmetric around zero, that inte-
grates to unity and satisfies additional boundedness conditions. Following Lee (1996)
we assume that the kernel satisfies the following conditions:
(i) K(z) is symmetric around 0 and is continuous.
(ii)
)
K(z)dz = 1,
)
zK(z)dz = 0, and
)
|K(z)|dz ∞.
(iii) Either (a) K(z) = 0 if |z| ≥ z0 for some z0 or (b) |z|K(z) → 0 as |z| → ∞.
(iv)
)
z2
K(z)dz = κ, where κ is a constant.
In practice kernel functions work better if they satisfy condition (iiia) rather than
just the weaker condition (iiib). Then restricting attention to the interval [−1, 1] rather
than [−z0, z0] is simply a normalization for convenience, and usually K(z) is restricted
to z ∈ [−1, 1].
Some commonly used kernel functions are given in Table 9.1. The uniform kernel
uses the same weights as a histogram of bin width 2h, except that it produces a running
histogram that is evaluated at a series of points x0 rather than using fixed bins. The
Gaussian kernel satisfies (iiib) rather than (iiia) because it does not restrict z ∈ [−1, 1].
A pth-order kernel is one whose first nonzero moment is the pth moment. The first
seven kernels are of second order and satisfy the second condition in (ii). The last
two kernels are fourth-order kernels. Such higher order kernels can increase rates of
convergence if f (x) is more than twice differentiable (see Section 9.3.10), though they
can take negative values. Table 9.1 also gives the parameter δ, defined in (9.11) and
used in Section 9.3.6 to aid bandwidth choice, for some of the kernels.
Given K(·) and h the estimator is very simple to implement. If the kernel estimator
is evaluated at r distinct values of x0 then computation of the kernel estimator requires
at most Nr operations, when the kernel has unbounded support. Considerable compu-
tational savings on this are possible; see, for example, Härdle (1990, p. 35).
299

Table 9.1. Kernel Functions: Commonly Used Examplesa
Kernel Kernel Function K(z) δ
Uniform (or box or rectangular) 1
2
× 1(|z| 1) 1.3510
Triangular (or triangle) (1 − |z|) × 1(|z| 1) –
Epanechnikov (or quadratic) 3
4
(1 − z2
) × 1(|z| 1) 1.7188
Quartic (or biweight) 15
16
(1 − z2
)2
× 1(|z| 1) 2.0362
Triweight 35
32
(1 − z2
)3
× 1(|z| 1) 2.3122
Tricubic 70
81
(1 − |z|3
)3
× 1(|z| 1) –
Gaussian (or normal) (2π)−1/2
exp(−z2
/2) 0.7764
Fourth-order Gaussian 1
2
(3 − z)2
(2π)−1/2
exp(−z2
/2) –
Fourth-order quartic 15
32
(3 − 10z2
+ 7z4
) × 1(|z| 1) –
a The constant δ is deﬁned in (9.11) and is used to obtain Silverman’s plug-in estimate given in (9.13).
9.3.4. Kernel Density Example
The key choice of bandwidth h has already been illustrated in Figure 9.2.
Here we illustrate the choice of kernel using generated data, a random sample of
size 100 drawn from the N[0, 252
] distribution. For the particular sample drawn the
sample mean is 2.81 and the sample standard deviation is 25.27.
Figure 9.4 shows the effect of using different kernels. For Epanechnikov, Gaussian,
quartic and uniform kernels, Silverman’s plug-in estimate given in (9.13) yields band-
widths of, respectively, 0.545, 0.246, 0.246, and 0.214. The resulting kernel density
estimates are very similar, even for the uniform kernel which produces a running
histogram. The variation in density estimate with kernel choice is much less than the
variation with bandwidth choice evident in Figure 9.2.
0
.2
.4
.6
Kernel
density
estimates
0 1 2 3 4 5
Log Hourly Wage
Epanechnikov (h=0.545)
Gaussian (h=0.246)
Quartic (h=0.646)
Uniform (h=0.214)
Density Estimates as Kernel Varies
Figure 9.4: Kernel density estimates for log wage for four different kernels using the corre-
sponding Silverman’s plug-in estimate for bandwidth. Same data as Figure 9.1.
300

9.3.5. Statistical Inference
We present the distribution of the kernel density estimator
f (x) for given choice of
K(·) and h, assuming the data x are iid. The estimate
f (x) is biased. This bias goes to
zero asymptotically if the bandwidth h → 0 as N → ∞, so
f (x) is consistent. How-
ever, the bias term does not necessarily disappear in the asymptotic normal distribution
for
f (x), complicating statistical inference.
Mean and Variance
The mean and variance of
f (x0) are obtained in Section 9.8.1, assuming that the second
derivative of f (x) exists and is bounded and that the kernel satisﬁes
)
zK(z)dz = 0,
as assumed in property (ii) of Section 9.3.3.
The kernel density estimator is biased with bias term b(x0) that depends on the
bandwidth, the curvature of the true density, and the kernel used according to
b(x0) = E[
f (x0)] − f (x0) =
1
2
h2
f

(x0)
(
z2
K(z)dz. (9.4)
The kernel estimator is biased of size O(h2
), where we use the order of magnitude
notation that a function a(h) is O(hk
) if a(h)/hk
is ﬁnite. The bias disappears asymp-
totically if h → 0 as N → ∞.
Assuming h → 0 and N → ∞, the variance of the kernel density estimator is
V[
f (x0)] =
1
Nh
f (x0)
(
K(z)2
dz + o

1
Nh

, (9.5)
where a function a(h) is o(hk
) if a(h)/hk
→ 0. The variance depends on the sample
size, bandwidth, the true density, and the kernel. The variance disappears if Nh → ∞,
which requires that while h → 0 it must do so at a slower rate than N → ∞.
Consistency
The kernel estimator is pointwise consistent, that is, consistent at a particular point
x = x0, if both the bias and variance disappear. This is the case if h → 0 and
Nh → ∞.
For estimation of f (x) at all values of x the stronger condition of uniform conver-
gence, that is, supx0
|
f (x0) − f (x0)|
p
→ 0, can be shown to occur if Nh/ ln N → ∞.
This requires h larger than for pointwise convergence.
The preceding results show that asymptotically
f (x0) has mean f (x0) + b(x0) and
variance (Nh)−1
f (x0)
)
K(z)2
dz. It follows that if a central limit theorem can be
301

applied, the kernel density estimator has limit distribution
√
Nh(
f (x0) − f (x0) − b(x0))
d
→ N

0, f (x0)
(
K(z)2
dz

. (9.6)
The central limit theorem applied is a nonstandard one and requires condition (iv); see,
for example, Lee (1996, p. 139) or Pagan and Ullah (1999, p. 40).
It is important to note the presence of the bias term b(x0), defined in (9.4). For
typical choices of bandwidth this term does not disappear, complicating computation
of confidence intervals (presented in Section 9.3.7).
9.3.6. Bandwidth Choice
The choice of bandwidth h is much more important than choice of kernel function
K(·). There is a tension between setting h small to reduce bias and setting h large to
ensure smoothness. A natural metric to use is therefore mean-squared error (MSE),
the sum of bias squared and variance.
From (9.4) the bias is O(h2
) and from (9.5) the variance is O((Nh)−1
). Intu-
itively MSE is minimized by choosing h so that bias squared and variance are of the
same order, so h4
= (Nh)−1
, which implies the optimal bandwidth h = O(N−0.2
) and
√
Nh = O(N0.4
). We now give a more formal treatment that includes a practical plug-
in estimate for h.
Mean Integrated Squared Error
A local measure of the performance of the kernel density estimate at x0 is the MSE
MSE[
f (x0)] = E[(
f (x0) − f (x0))2
], (9.7)
where the expectation is with respect to the density f (x). Since MSE equals variance
plus squared bias, (9.4) and (9.5) yield the MSE of the kernel density estimate
MSE[
f (x0)]
1
Nh
f (x0)
(
K(z)2
dz +

1
2
h2
f

(x0)
(
z2
K(z)dz
.2
. (9.8)
To obtain a global measure of performance at all values of x0 we begin by defining
the integrated squared error (ISE)
ISE(h) =
(
(
f (x0) − f (x0))2
dx0, (9.9)
the continuous analogue of summing squared error over all x0 in the discrete case.
This is written as a function of h to emphasize dependence on the bandwidth. We then
eliminate the dependence of
f (x0) on x values other than x0 by taking the expected
302

value of the ISE with respect to the density f (x). This yields the mean integrated
squared error (MISE),
MISE(h) = E [ISE(h)]
= E
(
(
f (x0) − f (x0))2
dx0

=
(
E[(
f (x0) − f (x0))2
]dx0
=
(
MSE[
f (x0)]dx0,
where MSE[
f (x)] is deﬁned in (9.8). From the preceding algebra MISE equals the
integrated mean-squared error (IMSE).
Optimal Bandwidth
The optimal bandwidth minimizes MISE. Differentiating MISE(h) with respect to h
and setting the derivative to zero yields the optimal bandwidth
h∗
= δ
(
f
(x0)2
dx0
−0.2
N−0.2
, (9.10)
where δ depends on the kernel function used, with
δ =
)
K(z)2
dz
)
z2 K(z)dz
2

0.2
. (9.11)
This result is due to Silverman (1986).
Since h∗
= O(N−0.2
), we have h∗
→ 0 as N → ∞ and Nh∗
= O(N0.8
) → ∞
as required for consistency. The bias in
f (x0) is O(h∗2
) = O(N−0.4
), which disap-
pears as N → ∞. For a histogram estimate it can be shown that h∗
= O(N−0.2
)
and MISE(h∗
) = O(N−2/3
), inferior to MISE(h∗
) = O(N−4/5
) for the kernel density
estimate.
The optimal bandwidth depends on the curvature of the density, with h∗
lower if
f (x) is highly variable.
Optimal Kernel
The optimal bandwidth varies with the kernel (see (9.10) and (9.11)). It can be shown
that MISE(h∗
) varies little across kernels, provided different optimal h∗
are used for
different kernels (Figure 9.4 provides an illustration). It can be shown that the optimal
kernel is the Epanechnikov, though this advantage is slight.
Bandwidth choice is much more important than kernel choice and from (9.10) this
varies with the kernel.
303

Plug-in Bandwidth Estimate
A plug-in estimate for the bandwidth is a simple formula for h that depends on the
sample size N and the sample standard deviation s.
A useful starting point is to assume that the data are normally distributed. Then
)
f

(x0)2
dx0 = 3/(8
√
πσ5
) = 0.2116/σ5
, in which case (9.10) specializes to
h∗
= 1.3643δN−0.2
s, (9.12)
where s is the sample standard deviation of x and δ is given in Table 9.1 for several
kernels. For the Epanechnikov kernel h∗
= 2.345N−0.2
s, and for the Gaussian kernel
h∗
= 1.059N−0.2
s. The considerably lower bandwidth for the normal kernel arises
because, unlike most kernels, the normal kernel gives some weight to xi even if |xi −
x0| h. In practice one uses Silverman’s plug-in estimate
h∗
= 1.3643δN−0.2
min(s, iqr/1.349), (9.13)
where iqr is the sample interquartile range. This uses iqr/1.349 as an alternative
estimate of σ that protects against outliers, which can increase s and lead to too large
an h.
These plug-in estimates for h work well in practice, especially for symmetric uni-
modal densities, even if f (x) is not the normal density. Nonetheless, one should also
check by using variations such as twice and half the plug-in estimate.
For the example in Figures 9.2 and 9.4 we have 177−0.2
= 0.3551, s = 0.8282, and
iqr/1.349 = 0.6459, so (9.13) yields h∗
= 0.3173δ. For the Epanechnikov kernel, for
example, this yields h∗
= 0.545 since δ = 1.7188 from Table 9.1.
Cross-Validation
From (9.9), ISE(h) =
)

f 2
(x0)dx0 − 2
)

f (x0) f (x0)dx0 +
)
f 2
(x0)dx0. The third
term does not depend on h. An alternative data-driven approach estimates the ﬁrst
two terms in ISE(h) by
CV(h) =
1
N2h

i

j
K(2)

xi − xj
h

−
2
N
N

i=1

f−i (xi ) , (9.14)
where K(2)
(u) =
)
K(u − t)K(t)dt is the convolution of K with itself, and
f−i (xi ) is
the leave-one-out kernel estimator of f (xi ). See Lee (1996, p. 137) or Pagan and Ullah
(1999, p. 51) for a derivation. The cross-validation estimate hCV is chosen to mini-
mize 1
CV(h). It can be shown that hCV
p
→ h∗
as N → ∞, but the rate of convergence
is very slow.
Obtaining hCV is computationally burdensome because 3
ISE(h) needs to be com-
puted for a range of values of h. It is often not necessary to cross-validate for kernel
density estimation as the plug-in estimate usually provides a good starting point.
304

9.3.7. Confidence Intervals
Kernel density estimates are usually presented without confidence intervals, but it is
possible to construct pointwise confidence intervals for f (x0), where pointwise means
evaluated at a particular value of x0. A simple procedure is to obtain confidence inter-
vals at a small number of evaluation points x0, say 10, that are evenly distributed over
the range of x and plot these along with the estimated density curves.
The result (9.6) yields the following 95% confidence interval for f (x0):
f (x0) ∈
f (x0) − b(x0) ± 1.96 ×
4
1
Nh

f (x0)
(
K(z)2dz.
For most kernels
)
K(z)2
dz is easily obtained by analytical methods.
The situation is complicated by the bias term, which should not be ignored in finite
samples, even though asymptotically b(x0)
p
→ 0. This is because with optimal band-
width h∗
= O(N−0.2
) the bias of the rescaled random variable
√
Nh(
f (x0) − f (x0))
given in (9.6) does not disappear, since
√
Nh∗ times O(h∗2
) = O(1). The bias can be
estimated using (9.4) and a kernel estimate of f

(x0), but in practice the estimate of
f

(x0) is noisy. Instead, the usual method is to reduce the bias in computing the confi-
dence interval, but not
f (x0) itself, by undersmoothing, that is, by choosing h h∗
so
that h∗
= o(N−0.2
). Other approaches include using a higher order kernel, such as the
fourth-order kernels given in Table 9.1, or bootstrapping (see Section 11.6.5).
One can also compute confidence bands for f (x) over all possible values of x.
These are wider than the pointwise confidence intervals for each value x0.
9.3.8. Estimation of Derivatives of a Density
In some cases estimates of the derivatives of a density need to be made. For example,
estimation of the bias term of
f (x0) given in (9.4) requires an estimate of f
(x0).
For simplicity we present estimates of the first derivative. A finite-difference
approach uses
f
(x0) = [
f (x0 +

. A calculus approach in-
stead takes the ﬁrst derivative of
f (x0) in (9.3), yielding
f
(x0) = −(Nh2
)−1

i K
((xi − x0)/h).
Intuitively, a larger bandwidth should be used to estimate derivatives, which can be
more variable than f (x0). The bias of
f (s)
(x0) is as before but the variance converges
more slowly, leading to optimal bandwidth h∗
= O(N−1/(2s+2p+1)
) if f (x0) is p times
differentiable. For kernel estimation of the ﬁrst derivative we need p ≥ 3.
9.3.9. Multivariate Kernel Density Estimate
The preceding discussion considered kernel density estimation for scalar x. For the
density of the k-dimensional random variable x, the multivariate kernel density esti-
mator is

f (x0) =
1
Nhk
N

i=1
K

xi − x0
h

,
305

where K(·) is now a k-dimensional kernel. Usually K(·) is a product kernel, the prod-
uct of one-dimensional kernels. Multivariate kernels such as the multivariate normal
density or spherical kernels proportionate to K(z
z) can also be used. The kernel K(·)
satisfies properties similar to properties given in the one-dimensional case; see Lee
(1996, p. 125).
The analytical results and expressions are similar to those before, except that the
variance of
f (x0) declines at rate O(Nhk
), which for k 1 is slower than O(Nh) in
the one-dimensional case. Then

Nhk(
f (x0) − f (x0) − b(x0))
d
→ N

0, f (x0)
(
K(z)2
dz

.
The optimal bandwidth choice is h = O(N−1/(k+4)
), which is larger than O(N−0.2
) in
the one-dimensional case, and implies
√
Nhk = O(N2/(4+k)
). The plug-in and cross-
validation methods can be extended to the multivariate case. For the product normal
kernel Scott’s plug-in estimate for the jth component of x is h j = N−1/(k+4)
sj , where
sj is the sample standard deviation of xj .
Problems of sparseness of data are more likely to arise with a multivariate kernel.
There is a curse of dimensionality, as fewer observations in the vicinity of x0 receive
substantial weight when x is of higher dimension. Even when this is not a problem,
plotting even a bivariate kernel density estimate requires a three-dimensional plot that
can be difficult to read and interpret.
One use of a multivariate kernel density estimate is to permit estimation of a
conditional density. Since f (y|x) = f (x, y)/f (x), an obvious estimator is
f (y|x) =

f (x, y)/
f (x), where
f (x, y) and
f (x) are bivariate and univariate kernel density
estimates.
9.3.10. Higher Order Kernels
The preceding analysis assumes f (x) is twice differentiable, a necessary assumption to
obtain the bias term in (9.4). If f (x) is more than twice differentiable then using higher
order kernels (see Section 9.3.3 for fourth-order examples) reduces the order of the
bias, leading to smaller h∗
and faster rates of convergence. A general statement is that
if x is k dimensional and f (x) is p times differentiable and a pth-order kernel is used,
then the kernel estimate
f (x0) of f (x) has optimal rate of convergence N−p/(2p+k)
when h∗
= O(N−1/(2p+k)
).
9.3.11. Alternative Nonparametric Density Estimates
The kernel density estimate is the standard nonparametric estimate. Other density es-
timates are presented, for example, in Pagan and Ullah (1999). These often use ap-
proaches such as nearest-neighbors methods that are more commonly used in non-
parametric regression and are presented briefly in Section 9.6.
306

9.4. NONPARAMETRIC LOCAL REGRESSION
9.4. Nonparametric Local Regression
We consider regression of scalar dependent variable y on a scalar regressor variable x.
The regression model is
yi = m(xi ) + εi , i = 1, . . . , N,
εi ∼ iid [0, σ2
ε ].
(9.15)
The complication is that the functional form m(·) is not specified, so NLS estimation
is not possible.
This section provides a simple general treatment of nonparametric regression us-
ing local weighted averages. Specialization to kernel regression is given in Section 9.5
and other commonly used local weighted methods are presented in Section 9.6.
9.4.1. Local Weighted Averages
Suppose that for a distinct value of the regressor, say x0, there are multiple obser-
vations on y, say N0 observations. Then an obvious simple estimator for m(x0) is
the sample average of these N0 values of y, which we denote
m(x0). It follows that

m(x0) ∼

m(x0), N−1
0 σ2
ε

, since it is the average of N0 observations that by (9.15) are
iid with mean m(x0) and variance σ2
ε .
The estimator
m(x0) is unbiased but not necessarily consistent. Consistency requires
N0 → ∞ as N → ∞, so that V[
m(x0)] → 0. With discrete regressors this estimator
may be very noisy in finite samples because N0 may be small. Even worse, for con-
tinuous regressors there may be only one observation for which xi takes the particular
value x0, even as N → ∞.
The problem of sparseness in data can be overcome by averaging observed values
of y when x is close to x0, in addition to when x exactly equals x0. We begin by noting
that the estimator
m(x0) can be expressed as a weighted average of the dependent
variable, with
m(x0) =

i wi0 yi , where the weights wi0 equal 1/N0 if xi = x0 and
equal 0 if xi = x0. Thus the weights vary with both the evaluation point x0 and the
sample values of the regressors.
More generally we consider the local weighted average estimator

m(x0) =
N

i=1
wi0,h yi , (9.16)
where the weights
wi0,h = w(xi , x0, h)
sum to one, so

i wi0,h = 1. The weights are specified to increase as xi becomes
closer to x0.
The additional parameter h is generic notation for a window width parameter. It
is defined so that smaller values of h lead to a smaller window and more weight being
placed on those observations with xi close to x0. In the specific example of kernel
regression, h is the bandwidth. Other methods given in Section 9.6 have alternative
smoothing parameters that play a similar role to h here. As h becomes smaller
m(x0)
307

becomes less biased, as only observations close to x0 are being used, but more variable,
as fewer observations are being used.
The OLS predictor for the linear regression model is a weighted average of yi , since
some algebra yields

mOLS(x0) =
N

i=1
5
1
N
+
(x0 − x̄)(xi − x̄)

j (xj − x̄)2
6
yi .
The OLS weights, however, can actually increase with increasing distance between x0
and xi if, for example, xi x0 x̄. Local regression instead uses weights that are
decreasing in |xi − x0|.
9.4.2. K-Nearest Neighbors Example
We consider a simple example, the unweighted average of the y values correspond-
ing to the closest (k − 1)/2 observations on x less than x0 and the closest (k − 1)/2
observations on x greater than x0.
Order the observations by increasing x values. Then evaluation at x0 = xi yields

mk(xi ) =
1
k
(yi−(k−1)/2 + · · · + yi+(k−1)/2),
where for simplicity k is odd, and potential modifications caused by ties and values of
x0 close to the end points x1 or xN are ignored. This estimator can be expressed as a
special case of (9.16) with weight
wi0,k =
1
k
× 1

|i − 0|
k − 1
2

, x1 x2 · · · x0 · · · xN .
This estimator has many names. We refer to it as a (symmetrized) k–nearest neigh-
bors estimator (k−NN), defined in Section 9.6.1. It is also a standard local running
average or running mean or moving average of length k centered at x0 that is used,
for example, to plot a time series y against time x. The parameter k plays the role of
the window width h in Section 9.4.1, with small k corresponding to small h.
As an illustration, consider data generated from the model
yi = 150 + 6.5xi − 0.15x2
i + 0.001x3
i + εi , i = 1, . . . , 100, (9.17)
xi = i,
εi ∼ N[0, 252
].
The mean of y is a cubic in x, with x taking values 1, 2, . . . , 100, with turning points
at x = 20 and x = 80. To this is added a normally distributed error term with standard
deviation 25.
Figure 9.5 plots the symmetrized k–NN estimator with k = 5 and 25. Both moving
averages suggest a cubic relationship. The second is smoother than the first but is still
quite jagged despite one-quarter of the sample being used to form the average. The
OLS regression line is also given on the diagram.
308

9.4. NONPARAMETRIC LOCAL REGRESSION
150
200
250
300
350
Dependent
variable
y
0 20 40 60 80 100
Regressor x
Actual Data
kNN (k=5)
Linear OLS
kNN (k=25)
k-Nearest Neighbors Regression as k Varies
Figure 9.5: k-nearest neighbors regression curve for two different choices of k, as well as
OLS regression line. The data are generated from a cubic polynomial model.
The slope of
mk(x) is flatter at the end points when k = 25 rather than k = 5. This
illustrates a boundary problem in estimating m(x) at the end points. For example,
for the smallest regressor value x1 there are no lower valued observations on x
to be included, and the average becomes a one-sided average
mk(x1) = (y1 + · · · +
y1+(k−1)/2)/[(k + 1)/2]. Since for these data mk(x) is increasing in x in this region,
this leads to
mk(x1) being an overestimate and the overstatement is increasing in k.
Such boundary problems are reduced by instead using methods given in Section 9.6.2.
9.4.3. Lowess Regression Example
Using alternative weights to those used to form the symmetrized k–NN estimator can
lead to better estimates of m(x).
An example is the Lowess estimator, defined in Section 9.6.2. This provides a
smoother estimate of m(x) as it uses kernel weights rather than an indicator func-
tion, analogous to a kernel density estimate being smoother than a running histogram.
It also has smaller bias (see Section 9.6.2), which is especially beneficial in estimating
m(x) at the end points.
Figure 9.6 plots, for data generated by (9.17), the Lowess estimate with k = 25. This
local regression estimate is quite close to the true cubic conditional mean function,
which is also drawn. Comparing Figure 9.6 to Figure 9.5 for symmetrized k–NN with
k = 25, we see that Lowess regression leads to a much smoother regression function
estimate and more precise estimation at the boundaries.
When the error term is normally distributed and analysis is conditional on x1, . . . , xN ,
the exact small-sample distribution of
m(x0) in (9.16) can be easily obtained.
309

150
200
250
300
350
Dependent
variable
y
0 20 40 60 80 100
Regressor x
Actual Data
Lowess (k=25)
OLS Cubic Regression
Lowess Nonparametric Regression
Figure 9.6: Nonparametric regression curve using Lowess, as well as a cubic regression
curve. Same generated data as Figure 9.5.
Substituting yi = m(xi ) + εi into the deﬁnition of
m(x0) leads directly to

m(x0) −
N

i=1
wi0,hm(xi ) =
N

i=1
wi0,hεi ,
which implies with ﬁxed regressors, and if εi are iid N[0, σ2
ε ], that

m(x0) ∼ N

N

i=1
wi0,hm(xi ), σ2
ε
N

i=1
w2
i0,h
'
. (9.18)
Note that in general
m(x0) is biased and the distribution is not necessarily centered
around m(x0).
With stochastic regressors and nonnormal errors, we condition on x1, . . . , xN and
apply a central limit theorem for U-statistics that is appropriate for double summations
(see, for example, Pagan and Ullah, 1999, p. 359). Then for εi iid [0, σ2
ε ],
c(N)
N

i=1
wi0,hεi
d
→ N

0, σ2
ε lim c(N)2
N

i=1
w2
i0,h
'
, (9.19)
where c(N) is a function of the sample size with O(c(N)) N1/2
that can vary with
the local estimator. For example, c(N) =
√
Nh for kernel regression and c(N) = N0.4
for kernel regression with optimal bandwidth. Then
c(N) (
m(x0) − m(x0) − b(x0))
d
→ N

0, σ2
ε lim c(N)2
N

i=1
w2
i0,h
'
, (9.20)
where b(x0) = m(x0)−

i wi0,hm(xi ). Note that (9.20) yields (9.18) for the asymp-
totic distribution of
m(x0).
Clearly, the distribution of
m(x0), a simple weighted average, can be obtained un-
der alternative distributional assumptions. For example, for heteroskedastic errors
310

9.5. KERNEL REGRESSION
the variance in (9.19) and (9.20) is replaced by lim c(N)2

i σ2
ε,i w2
i0,h, which can be
consistently estimated by replacing σ2
ε,i by the squared residual (yi −
m(xi ))2
. Alter-
natively, one can bootstrap (see Section 11.6.5).
Throughout this chapter we follow the nonparametric terminology that an estimator

θ of θ0 has convergence rate N−r
if
θ = θ0 + Op(N−r
), so that Nr
(
θ − θ0) = Op(1)
and ideally Nr
(
θ − θ0) has a limit normal distribution. Note in particular that an esti-
mator that is commonly called a
√
N-consistent estimator is converging at rate N−1/2
.
Nonparametric estimators typically have a slower rate of convergence than this, with
r 1/2, because small bandwidth h is needed to eliminate bias but then less than N
observations are being used to estimate
m(x0).
As an example, consider the k–NN example of Section 9.4.2. Suppose k = N4/5
, so
that for example k = 251 if N = 1,000. Then the estimator is consistent as the moving
average uses N4/5
/N = N−1/5
of the sample and is therefore collapsing around x0 as
N → ∞. Using (9.18), the variance of the moving average estimator is σ2
ε

i w2
i0,k =
σ2
ε × k × (1/k)2
= σ2
ε × 1/k = σ2
ε N−4/5
, so in (9.19) c(N) =
√
k =
√
N4/5 = N0.4
,
which is less than N1/2
. Other values of k also ensure consistency, provided k O(N).
More generally, a range of values of the bandwidth parameter eliminates asymptotic
bias, but smaller bandwidth increases variability. In this literature this trade-off is ac-
counted for by minimizing mean-squared error, the sum of variance and bias squared.
Stone (1980) showed that if x is k dimensional and m(x) is p times differentiable
then the fastest possible rate of convergence for a nonparametric estimator of an sth-
order derivative of m(x) is N−r
, where r = (p − s)/(2p + k). This rate decreases as
the order of the derivative increases and as the dimension of x increases. It increases the
more differentiable m(x) is assumed to be, approaching N−1/2
if m(x) has derivatives
of order approaching inﬁnity. For scalar regression estimation of m(x) it is customary
to assume existence of m
(x), in which case r = 2/5 and the fastest convergence rate
is N−0.4
.
9.5. Kernel Regression
Kernel regression is a weighted average estimator using kernel weights. Issues such as
bias and choice of bandwidth presented for kernel density estimation are also relevant
here. However, there is less guidance for choice of bandwidth than in the regression
case. Also, while we present kernel regression for pedagogical reasons, kernel local
regression estimators are often used in practice (see Section 9.6).
9.5.1. Kernel Regression Estimator
The goal in kernel regression is to estimate the regression function m(x) in the model
y = m(x) + ε deﬁned in (9.15).
311

From Section 9.4.1, an obvious estimator of m(x0) is the average of the sample
values yi of the dependent variable corresponding to the xi s close to x0. A variation
on this is to find the average of the yi s for all observations with xi within distance h of
x0. This can be formally expressed as

m(x0) ≡
N
i=1 1

xi −x0
h

1

yi
N
i=1 1

xi −x0
h

1
,
where as before 1(A) = 1 if event A occurs and equals 0 otherwise. The numerator
sums the y values and the denominator gives the number of y values that are summed.
This expression gives equal weights to all observations close to x0, but it may be
preferable to give the greatest weight at x0 and decrease the weight as we move away.
Thus more generally we consider a kernel weighting function K(·), introduced in Sec-
tion 9.3.2. This yields the kernel regression estimator

m(x0) ≡
1
Nh
N
i=1 K
xi −x0
h

yi
1
Nh
N
i=1 K
xi −x0
h
. (9.21)
Several common kernel functions – uniform, Gaussian, Epanechnikov, and quartic –
have already been given in Table 9.1.
The constant h is called the bandwidth, and 2h is called the window width. The
bandwidth plays the same role as k in the k–NN example of Section 9.4.2.
The estimator (9.21) was proposed by Nadaraya (1964) and Watson (1964),
who gave an alternative derivation. The conditional mean m(x) =
)
y f (y|x)dy =
)
y[ f (y, x)/f (x)]dy, which can be estimated by
m(x) =
)
y[
f (y, x)/
f (x)]dy, where

f (y, x) and
f (x) are bivariate and univariate kernel density estimators. It can be shown
that this equals the estimator in (9.21). The statistics literature also considers kernel re-
gression in the fixed design or fixed regressors case where f (x) is known and need not
be estimated, whereas we consider only the case of stochastic regressors that arises
with observational data.
The kernel regression estimator is a special case of the weighted average (9.16),
with weights
wi0,h =
1
Nh
K
xi −x0
h

1
Nh
N
i=1 K
xi −x0
h
, (9.22)
which by construction sum over i to one. The general results of Section 9.4 are relevant,
but we give a more detailed analysis.
We present the distribution of the kernel regression estimator
m(x) for given choice
of K(·) and h, assuming the data x are iid. We implicitly assume that regressors are
continuous. With discrete regressors
m(x0) will still collapse on m(x0), and both
m(x0)
in the limit and m(x0) are step functions.
312

Consistency
Consistency of
m(x0) for the conditional mean function m(x0) requires h → 0, so that
substantial weight is given only to xi very close to x0. At the same time we need many
xi close to x0, so that many observations are used in forming the weighted average.
Formally,
m(x0)
p
→ m(x0) if h → 0 and Nh → ∞ as N → ∞.
Bias
The kernel regression estimator is biased of size O(h2
), with bias term
b(x0) = h2

m
(x0)
f
(x0)
f (x0)
+
1
2
m
(x0)
(
z2
K(z)dz (9.23)
(see Section 9.8.2) assuming m(x) is twice differentiable. As for kernel density estima-
tion, the bias varies with the kernel function used. More importantly, the bias depends
on the slope and curvature of the regression function m(x0) and the slope of the density
f (x0) of the regressors, whereas for density estimation the bias depended only on the
second derivatives of f (x0). The bias can be particularly large at the end points, as
illustrated in Section 9.4.2.
The bias can be reduced by using higher order kernels, defined in Section 9.3.3, and
boundary modifications such as specific boundary kernels. Local polynomial regres-
sion and modifications such as Lowess (see Section 9.6.2) have the attraction that the
term in (9.23) depending on m
(x0) drops out and perform well at the boundaries.
In Section 9.8.2 it is shown that, for xi iid with density f (xi ), the kernel regression
estimator has limit distribution
√
Nh(
m(x0) − m(x0) − b(x0))
d
→ N

0,
σ2
ε
f (x0)
(
K(z)2
dz

. (9.24)
The variance term in (9.24) is larger for small f (x0), so as expected the variance of

m(x0) is larger in regions where x is sparse.
Incorporating values of yi for which xi = x0 into the weighted average introduces bias,
since E[yi |xi ] = m(xi ) = m(x0) for xi = x0. However, using these additional points
reduces the variance of the estimator, since we are averaging over more data. The opti-
mal bandwidth balances the trade-off between increased bias and decreased variance,
using squared error loss. Unlike kernel density estimation, plug-in approaches are im-
practical and cross-validation is used more extensively.
For simplicity most studies focus on choosing one bandwidth for all values of x0.
Some methods with variable bandwidths, notably k–NN and Lowess, are given in
Section 9.6.
313

Mean Integrated Squared Error
The local performance of
m(·) at x0 is measured by the mean-squared error, given
by
MSE[
m(x0)] = E[(
m(x0) − m (x0))2
],
where the expectation eliminates dependence of
m(x0) on x. Since MSE equals vari-
ance plus squared bias, the MSE can be obtained using (9.23) and (9.24).
Similar to Section 9.3.6, the integrated square error is
ISE(h) =
(
(
m(x0) − m (x0))2
f (x0)dx0,
where f (x) denotes the density of the regressors x, and the mean integrated square
error, or equivalently the integrated mean-squared error, is
MISE(h) =
(
MSE[
m(x0)] f (x0)dx0.
Optimal Bandwidth
The optimal bandwidth h∗
minimizes MISE(h). This yields h∗
= O(N−0.2
) since
the bias is O(h2
) from (9.23); the variance is O((Nh)−1
) from (9.24) since an O(1)
variance is obtained after scaling
m(x0) by
√
Nh; and for bias squared and variance to
be of the same order (h2
)2
= (Nh)−1
or h = N−0.2
. The kernel estimate then converges
to m(x0) at rate (Nh∗
)−1/2
= N−0.4
rather than the usual N−0.5
for parametric analysis.
Plug-in Bandwidth Estimate
One can obtain an exact expression for h∗
that minimizes MISE(h), using calculus
methods similar to those in Section 9.3.5 for the kernel density estimator. Then h∗
depends on the bias and variance expressions in (9.23) and (9.24).
A plug-in approach calculates h∗
using estimates of these unknowns. However,
estimation of m
(x), for example, requires nonparametric methods that in turn require
an initial bandwidth choice, but h∗
also depends on unknowns such as m
(x). Given
these complications one should be wary of plug-in estimates. More common is to use
cross-validation, presented in the following.
It can also be shown that MISE(h∗
) is minimized if the Epanichnikov kernel is
used (see Härdle, 1990, p. 186, or Härdle and Linton, 1994, p. 2321), though as in the
kernel regression case MISE(h∗
) is not much larger for other kernels. The key issue is
determination of h∗
, which will vary with kernel and the data.
Cross-Validation
An empirical estimate of the optimal h can be obtained by the leave-one-out cross-
validation procedure. This chooses
h∗
that minimizes
CV(h) =
N

i=1
(yi −
m−i (xi ))2
π(xi ), (9.25)
314

where π(xi ) is a weighting function (discussed in the following) and

m−i (xi ) =

j=i
wji,h yj /

j=i
wji,h (9.26)
is a leave-one-out estimate of m(xi ) obtained by the kernel formula (9.21), or more
generally by a weighted procedure (9.16), with the modification that yi is dropped.
Cross-validation is not as computationally intensive as it first appears. It can be
shown that
yi −
m−i (xi ) =
yi −
m(xi )
1 − [wii,h/

j wji,h]
, (9.27)
so that for each value of h cross-validation requires only one computation of the
weighted averages
m(xi ), i = 1, . . . , N.
The weights π(xi ) are introduced to potentially downweight the end points, which
otherwise may receive too much importance since local weighted estimates can be
quite highly biased at the end points as illustrated in Section 9.4.2. For example, ob-
servations with xi outside the 5th to 95th percentiles may not be used in calculating
CV(h), in which case π(xi ) = 0 for these observations and π(xi ) = 1 otherwise. The
term cross-validation is used as it validates the ability to predict the ith observation us-
ing all the other observations in the data set. The ith observation is dropped because if
instead it was additionally used in the prediction, then CV(h) would be trivially mini-
mized when
mh(xi ) = yi , i = 1, . . . , N. CV(h) is also called the estimated prediction
error.
Härdle and Marron (1985) showed that minimizing CV(h) is asymptotically equiv-
alent to minimizing a modification of ISE(h) and MISE(h). The modification includes
weight function π(x0) in the integrand, as well as the averaged squared error (ASE)
N−1

i (
m(xi ) − m(xi ))2
π(xi ), which is a discrete sample approximation to ISE(h).
The measure CV(h) converges at the slow rate of O(N−0.1
) however, so CV(h) can be
quite variable in finite samples.
Generalized Cross-Validation
An alternative to leave-one-out cross validation is to use a measure similar to CV(h)
but one that more simply uses
m(xi ) rather than
m−i (xi ) and then adds a model com-
plexity penalty that increases as the bandwidth h decreases. This leads to
PV(h) =
N

i=1
(yi −
m(xi ))2
π(xi )p(wii,h),
where p(·) is the penalty function and wii,h is the weight given to the ith observation
in
m(xi ) =

j wji,h yj .
A popular example is the generalized cross-validation measure that uses the
penalty function p(wii,h) = (1 − wii,h)2
. Other penalties are given in Härdle (1990,
p. 167) and Härdle and Linton (1994, p. 2323).
315

Cross-Validation Example
For the local running average example in Section 9.4.2, CV(k) = 54,811, 56,666,
63,456, 65,605, and 69,939 for k = 3, 5, 7, 9, and 25, respectively. In this case all
observations were used to calculate CV(k), with π(xi ) = 1, despite possible end-point
problems. There is no real gain after k = 5, though from Figure 9.5 this value pro-
duced too rough an estimate and in practice one would choose a higher value of k to
get a smoother curve.
More generally cross-validation is by no means perfect and it is common to “eye-
ball” fitted nonparametric curves to select h to achieve a desired degree of smoothness.
Trimming
The denominator of the kernel estimator in (9.21) is
f (x0), the kernel estimate of the
density of the regressor at x0. At some evaluation points
f (xi ) can be very small,
leading to a very large estimate
m(xi ). Trimming eliminates or greatly downweights
all points with
f (xi ) b, say, where b → 0 at an appropriate rate as N → ∞. Such
problems are most likely to occur in the tails of the distribution. For nonparametric
estimation one can just focus on estimation of m(xi ) for more central values of xi , and
values in the tails may be downweighted in cross-validation. However, the semipara-
metric methods of Section 9.7 can entail computation of
m(xi ) at all values of xi , in
which case it is not unusual to trim. Ideally, the trimming function should make no
difference asymptotically, though it will make a difference in finite samples.
Kernel regression estimates should generally be presented with pointwise confidence
intervals. A simple procedure is to present pointwise confidence intervals for f (x0)
evaluated at, for example, x0 equal to the first through ninth deciles of x.
If the bias b(x0) in
m(x0) is ignored, (9.24) yields the following 95% confidence
interval:
m(x0) ∈
m(x0) ± 1.96
4
1
Nh

σ2
ε

f (x0)
(
K(z)2dz,
where
σ2
ε =

i wi0,h
ε2
i and wi0,h is defined in (9.22) and
f (x0) is the kernel density
estimate at x0. This estimate assumes homoskedastic errors, though is likely to be
somewhat robust to heteroskedasticity since observations close to x0 are given the
greatest weight. Alternatively, from the discussion after (9.20) a heteroskedastic robust
95% confidence interval is
m(x0) ± 1.96
s0, where
s2
0 =

i w2
i0,h
ε2
i .
As in the kernel density case, the bias in
m(x0) should not be ignored. As already
noted, estimation of the bias is difficult. Instead, the standard procedure is to under-
smooth, with smaller bandwidth h satisfying h = o(N−0.2
) rather than the optimal
h∗
= O(N−0.2
).
316

Härdle (1990) gives a detailed presentation of confidence intervals, including uni-
form confidence bands rather than pointwise intervals, and the bootstrap methods given
in Section 11.6.5.
9.5.5. Derivative Estimation
In regression we are often interested in how the conditional mean of y changes with
changes in x, the marginal effect, rather than the conditional mean per se.
Kernel estimates can be easily used to form the derivative. The general result is that
the sth derivative of the kernel regression estimate,
m(s)
(x0), is consistent for m(s)
(x0),
the sth derivative of the conditional mean m(x0). Either calculus or finite-difference
approaches can be taken.
As an example, consider estimation of the first derivative in the generated-data
example of the previous section. Let z1, . . . , zN denote the ordered points at which
the kernel regression function is evaluated and
m(z1), . . . ,
m(zN ) denote the estimates
at these points. A finite-difference estimate is
m
(zi ) = [
m(zi ) −
m(zi−1)]/[zi − zi−1].
This is plotted in Figure 9.7, along with the true derivative, which for the dgp given
in (9.17) is the quadratic m
(zi ) = 6.5 − 0.30zi + 0.003z2
i . As expected the derivative
estimate is somewhat noisy, but it picks up the essentials. Derivative estimates should
be based on oversmoothed estimates of the conditional mean. For further details see
Pagan and Ullah (1999, chapter 4). Härdle (1990, p. 160) presents adaptation of cross-
validation to derivative estimation.
In addition to the local derivative m
(x0) we may also be interested in the average
derivative E[m
(x)]. The average derivative estimator given in Section 9.7.4 provides
a
√
N-consistent and asymptotically normal estimate of E[m
(x)].
9.5.6. Conditional Moment Estimation
The kernel regression methods for the conditional mean E[y|x] = m(x) can be ex-
tended to nonparametric estimation of other conditional moments.
-2
0
2
4
6
8
Dependent
variable
y
0 20 40 60 80 100
Regressor x
From Lowess (k=25)
From OLS Cubic Regression
Nonparametric Derivative Estimation
Figure 9.7: Nonparametric derivative estimate using previously estimated Lowess re-
gression curve, as well as using a cubic regression curve. Same generated data as
Figure 9.5.
317

For raw conditional moments such as E[yk
|x] we use the weighted average

E[yk
|x0] =
N

i=1
wi0,h yk
i , (9.28)
where the weights wi0,h may be the same weights as used for estimation of m(x0).
Central conditional moments can then be computed by reexpressing them as
weighted sums of raw moments. For example, since V[y|x] = E[y2
|x] − (E[y|x])2
, the
conditional variance can be estimated by
E[y2
|x0] −
m(x0)2
. One expects that higher
order conditional moments will be estimated with more noise than will be the condi-
tional mean.
9.5.7. Multivariate Kernel Regression
We have focused on kernel regression on a single regressor. For regression of scalar y
on k-dimensional vector x, that is, yi = m(xi ) + εi = m(x1i , . . . , xki ) + εi , the kernel
estimator of m(x0) becomes

m(x0) ≡
1
Nhk
N
i=1 K
xi −x0
h

yi
1
Nhk
N
i=1 K
xi −x0
h
,
where K(·) is now a multivariate kernel. Often K(·) is the product of k one-
dimensional kernels, though multivariate kernels such as the multivariate normal den-
sity can be used.
If a product kernel is used the regressors should be transformed to a common scale
by dividing by the standard deviation. Then the cross-validation measure (9.25) can
be used to determine a common optimal bandwidth h∗
, though determining which xi
should be downweighted as the result of closeness to the end points is more compli-
cated when x is multivariate. Alternatively, regressors need not be rescaled, but then
different bandwidths should be used for each regressor.
The asymptotic results and expressions are similar to those considered before, as the
estimate is again a local average of the yi . The bias b(x0) is again O(h2
) as before, but
the variance of
m(x0) declines at a rate O(Nhk
), slower than in the one-dimensional
case since essentially a smaller fraction of the sample is being used to form
m(x0).
Then

Nhk(
m(x0) − m(x0) − b(x0))
d
→ N

0,
σ2
ε
f (x0)
(
K(z)2
dz

.
The optimal bandwidth choice is h∗
= O(N−1/(k+4)
), which is larger than O(N−0.2
) in
the one-dimensional case. The corresponding optimal rate of convergence of
m(x0) is
N−2/(k+4)
.
This result and the earlier scalar result assumes that m(x) is twice differentiable, a
necessary assumption to obtain the bias term in (9.23). If m(x) is instead p times dif-
ferentiable then kernel estimation using a pth order kernel (see Section 9.3.3) reduces
the order of the bias, leading to smaller h∗
and faster rates of convergence that attain
Stone’s bound given in Section 9.4.5; see Härdle (1990, p. 93) for further details. Other
nonparametric estimators given in the next section can also attain Stone’s bound.
318

9.6. ALTERNATIVE NONPARAMETRIC REGRESSION ESTIMATORS
The convergence rate decreases as the number of regressors increases, approaching
N0
as the number of regressors approaches infinity. This curse of dimensionality
greatly restricts the use of nonparametric methods in regression models with several
regressors. Semiparametric models (see Section 9.7) place additional structure so that
the nonparametric components are of low dimension.
9.5.8. Tests of Parametric Models
An obvious test of correct specification of a parametric model of the conditional mean
is to compare the fitted mean with that obtained from a nonparametric model.
Let
mθ(x) denote a parametric estimator of E[y|x] and
mh(x) denote a nonparamet-
ric estimator such as a kernel estimator. One approach is to compare
mθ(x) with
mh(x)
at a range of values of x. This is complicated by the need to correct for asymptotic
bias in
mh(x) (see Härdle and Mammen, 1993). A second approach is to consider con-
ditional moment tests of the form N−1

i wi (yi −
mθ(xi )), where different weights,
based in part on kernel regression, test failure of E[y|x] = mθ(x) in different direc-
tions. For example, Horowitz and Härdle (1994) use wi =
mh(xi ) −
mθ(xi ). Pagan
and Ullah (1999, pp. 141–150) and Yatchew (2003, pp. 119–124) survey some of the
methods used.
9.6. Alternative Nonparametric Regression Estimators
Section 9.4 introduced local regression methods that estimate the regression function
m(x0) by a local weighted average
m(x0) =

i wi0,h yi , where the weights wi0,h =
w(xi , x0, h) differ with the point of evaluation x0 and the sample value of xi . Section
9.5 presented detailed results when the weights are kernel weights.
Here we consider other commonly used local estimators that correspond to other
weights. Many of the results of Section 9.5 carry through, with similar optimal rates
of convergence and use of cross-validation for bandwidth selection, though the exact
expressions for bias and variance differ from those in (9.23) and (9.24). The estimators
given in Section 9.6.2 are especially popular.
9.6.1. Nearest Neighbors Estimator
The k–nearest neighbor estimator is the equally weighted average of the y values for
the k observations of xi closest to x0. Define Nk(x0) to be the set of k observations of
xi closest to x0. Then

mk−N N (x0) =
1
k
N

i=1
1(xi ∈ Nk(x0))yi . (9.29)
This estimator is a kernel estimator with uniform weights (see Table 9.1) except that
the bandwidth is variable. Here the bandwidth h0 at x0 equals the distance between
x0 and the furthest of the k nearest neighbors, and more formally h0 k/(2N f (x0)).
319

The quantity k/N is called the span. Smoother curves can be obtained by using kernel
weights in (9.29).
The estimator has the attraction of providing a simple rule for variable bandwidth
selection. It is computationally faster to use a symmetrized version that uses the k/2
nearest neighbors to the left and a similar number to the right, which is the local run-
ning average method used in Section 9.4.2. Then one can use an updating formula on
observations ordered by increasing xi , as then one observation leaves the data and one
enters as x0 increases.
9.6.2. Local Linear Regression and Lowess
The kernel regression estimator is a local constant estimator because it assumes that
m(x) equals a constant in the local neighborhood of x0. Instead, one can let m(x) be
linear in the neighborhood of x0, so that m(x) = a0 + b0(x − x0) in the neighborhood
of x0.
To implement this idea, note that the kernel regression estimator
m(x0) can be ob-
tained by minimizing

i K ((xi − x0)/h) (yi − m0)2
with respect to m0. The local
linear regression estimator minimizes
N

i=1
K

xi − x0
h

(yi − a0 − b0(xi − x0))2
, (9.30)
with respect to a0 and b0, where K(·) is a kernel weighting function. Then
m(x) =

a0 +
b0(x − x0) in the neighborhood of x0. The estimate at exactly x0 is then
m(x) =

a0, and
b0 provides an estimate of the first derivative
m
(x0). More generally, a local
polynomial estimator of degree p minimizes
N

i=1
K

xi − x0
h

(yi − a0,0 − a0,1(xi − x0) − · · · − a0,p
(xi − x0)p
p!
)2
, (9.31)
yielding
m(s)
(x0) =
a0,s.
Fan and Gijbels (1996) list many properties and attractions of this method. Esti-
mation entails only weighted least-squares regression at each evaluation point x0. The
estimators can be expressed as a weighted average of yi , since they are LS estimators.
The local linear estimator has bias term b(x0) = h2
1
2
m
(x0)
)
z2
K(z)dz, which, un-
like the bias for kernel regression given in (9.23), does not depend on m
(x0). This
is especially beneficial for overcoming the boundary problems illustrated in Section
9.4.2. For estimating an sth-order derivative a good choice of p is p = s + 1 so that,
for example, one uses a local quadratic estimator to estimate the first derivative.
A standard local regression estimator is the locally weighted scatterplot smoothing
or Lowess estimator of Cleveland (1979). This is a variant of local polynomial estima-
tion that in (9.31) uses a variable bandwidth h0,k determined by the distance from x0 to
its kth nearest neighbor; uses the tricubic kernel K(z) = (70/81)(1 − |z|3
)3
1(|z| 1);
and downweights observations with large residuals yi −
m(xi ), which requires passing
through the data N times. For a summary see Fan and Gijbels (1996, p. 24). Lowess
is attractive compared to kernel regression as it uses a variable bandwidth, robustifies
320

9.6. ALTERNATIVE NONPARAMETRIC REGRESSION ESTIMATORS
against outliers, and uses a local polynomial estimator to minimize boundary prob-
lems. However, it is computationally intensive.
Another popular variation is the supersmoother of Friedman (1984) (see Härdle,
1990, p. 181). The starting point is symmetrized k–NN, using local linear fit rather than
local constant fit for better fit at the boundary. Rather than use a fixed span or fixed
k, however, the supersmoother is a variable span smoother where the variable span is
determined by local cross-validation that entails nine passes over the data. Compared
to Lowess the supersmoother does not robustify against outliers, but it permits the span
to vary and is fast to compute.
9.6.3. Smoothing Spline Estimator
The cubic smoothing spline estimator
mλ(x) minimizes the penalized residual sum
of squares
PRSS(λ) =
N

i=1
(yi − m(xi ))2
+ λ
(
(m
(x))2
dx, (9.32)
where λ is a smoothing parameter. As elsewhere in this chapter squared error loss is
used. The first term alone leads to a very rough fit since then
m(xi ) = yi . The second
term is introduced to penalize roughness. The cross-validation methods of Section
9.5.3 can be used to determine λ, with larger values of λ leading to a smoother curve.
Härdle (1990, pp. 56–65) shows that
mλ(x) is a cubic polynomial between succes-
sive x-values and that the estimator can be expressed as a local weighted average of
the ys and is asymptotically equivalent to a kernel estimator with a particular variable
kernel. In microeconometrics smoothing splines are used less frequently than the other
methods presented here. The approach can be adapted to other roughness penalties and
other loss functions.
9.6.4. Series Estimators
Series estimators approximate a regression function by a weighted sum of K functions
z1(x), . . . , zK (x),

mK (x) =
K

j=1

β j z j (x), (9.33)
where the coefficients
β1, . . . ,
βK are simply obtained by OLS regression of y on
z1(x), . . . , zK (x). The functions z1(x), . . . , zK (x) form a truncated series. Examples
include a (K − 1)th-order polynomial approximation or power series with z j (x) =
x j−1
, j = 1, . . . , K; orthogonal and orthonormal polynomial variants (see Section
12.3.1); truncated Fourier series where the regressor is rescaled so that x ∈ [0, 2π];
the Fourier flexible functional form of Gallant (1981), which is a truncated Fourier
series plus the terms x and x2
; and regression splines that approximate the regres-
sion function m(x) by polynomial functions between a given number of knots that are
joined at the knots.
321

The approach differs from that in Section 9.4 as it is a global approximation ap-
proach to estimation of m(x), rather than a local approach to estimation of m(x0).
Nonetheless,
mK (x)
p
→ m(x0) if K → ∞ at an appropriate rate as N → ∞. From
Newey (1997) if x is k dimensional and m(x) is p times differentiable the mean in-
tegrated squared error (see Section 9.5.3) MISE(h) = O(K−2p/k
+ K/N), where the
first term reflects bias and the second term variance. Equating these gives the optimal
K∗
= Nk/(2p+k)
, so K grows but at slower rate than the sample size. The convergence
rate of
mK∗ (x) equals the fastest possible rate of Stone (1980), given in Section 9.4.5.
Intuitively, series estimators may not be robust as outliers may have a global rather
than merely local impact on
m(x), but this conjecture is not tested in typical examples
given in texts.
Andrews (1991) and Newey (1997) give a very general treatment that includes
the multivariate case, estimation of functionals other than the conditional mean,
and extensions to semiparametric models where series methods are most often
used.
9.7. Semiparametric Regression
The preceding analysis has emphasized regression models without any structure. In
microeconometrics some structure is usually placed on the regression model.
First, economic theory may place some structure, such as symmetry and homo-
geneity restrictions, in a demand function. Such information may be incorporated into
nonparametric regression; see, for example, Matzkin (1994).
Second, and more frequently, econometric models include so many potential regres-
sors that the curse of dimensionality makes fully nonparametric analysis impractical.
Instead, it is common to estimate a semiparametric model that loosely speaking com-
bines a parametric component with a nonparametric component; see Powell (1994) for
a careful discussion of the term semiparametric.
There are many different semiparametric models and myriad methods are often
available to consistently estimate these models. In this section we present just a few
leading examples. Applications are given elsewhere in this book, including the binary
outcome models and censored regression models given in Chapters 14 and 16.
9.7.1. Examples
Table 9.2 presents several leading examples of semiparametric regression. The first
two examples, detailed in the following, generalize the linear model x
β by adding
an unspecified component λ(z) or by permitting an unspecified transformation g(x
β),
whereas the third combines the first two. The next three models, used more in ap-
plied statistics than econometrics, reduce the dimensionality by assuming additivity
or separability of the regressors but are otherwise nonparametric. We detail the gen-
eralized additive model. Related to these are neural network models; see Kuan and
White (1994). The last example, also detailed in the following, is a flexible model of
the conditional variance. Care needs to be taken to ensure that semiparametric models
322

9.7. SEMIPARAMETRIC REGRESSION
Table 9.2. Semiparametric Models: Leading Examples
Name Model Parametric Nonparametric
Partially linear E[y|x, z] = x
β + λ(z) β λ( · )
Single index E[y|x] = g(x
β) β g(·)
Generalized partial E[y|x, z] = g(x
β + λ(z)) β g(·),λ( · )
linear
Generalized additive E[y|x] = c+
k
j=1 gj (xj ) – gj (·)
Partial additive E[y|x, z] = x
β + c+
k
j=1 gj (z j ) β gj (·)
Projection pursuit E[y|x] =
M
j=1 gj (x
j βj ) βj gj (·)
Heteroskedastic E[y|x] = x
β; V[y|x] = σ2
(x) β σ2
(·)
linear
are identified. For example, see the discussion of single-index models. In addition to
estimation of β, interest also lies in the marginal effects such as ∂E[y|x, z]/∂x.
9.7.2. Efficiency of Semiparametric Estimators
We consider loss of efficiency in estimating by semiparametric rather than parametric
methods, ahead of presenting results for several leading semiparametric models.
Our summary follows Robinson (1988b), who considers a semiparametric model
with parametric component denoted β and nonparametric component denoted G that
depends on infinitely many nuisance parameters. Examples of G include the shape of
the distribution of a symmetrically distributed iid error and the single-index function
g(·) given in (9.37) in Section 9.7.4. The estimator
β = β(
G), where
G is a nonpara-
metric estimator of G.
Ideally, the estimator
β is adaptive, meaning that there is no efficiency loss in
having to estimate G by nonparametric methods, so that
√
N(
β − β)
d
→ N[0, VG],
where VG is the covariance matrix for any shape function G in the particular class be-
ing considered. Within the likelihood framework VG is the Cramer–Rao lower bound.
In the second-moment context VG is given by the Gauss–Markov theorem or a gener-
alization such as to GMM. A leading example of an adaptive estimator is estimation
with specified conditional mean function but with unknown functional form for het-
eroskedasticity (see Section 9.7.6).
If the estimator
β is not adaptive then the next best optimality property is for the
estimator to attain the semiparametric efficiency bound V∗
G, so that
√
N(
β − β)
d
→ N[0, V∗
G],
where V∗
G is a generalization of the Cramer–Rao lower bound or its second-moment
analogue that provides the smallest variance matrix possible given the specified
semiparametric model. For an adaptive estimator V∗
G = VG, but usually V∗
G exceeds
VG. Semiparametric efficiency bounds are introduced in Section 9.7.8. They can be
323

obtained only in some semiparametric settings, and even when they are known no
estimator may exist that attains the bound. An example that attains the bound is the
binary choice model estimator of Klein and Spady (1993) (see Section 14.7.4).
If the semiparametric efficiency bound is not attained or is not known, then the next
best property is that
√
N(
β − β)
d
→ N[0,V∗∗
G ] for V∗∗
G greater than V∗
G, which permits
the usual statistical inference. More generally,
√
N(
β − β) = Op(1) but is not neces-
sarily normally distributed. Finally, consistent but less than
√
N-consistent estimators
have the property that Nr
(
β − β) = Op(1), where r 0.5. Often asymptotic normal-
ity cannot be established. This often arises when the parametric and nonparametric
parts are treated equally, so that maximization occurs jointly over β and G. There are
many examples, particularly in discrete and truncated choice models.
Despite their potential inefficiency, semiparametric estimators are attractive because
they can retain consistency in settings where a fully parametric estimator is inconsis-
tent. Powell (1994, p. 2513) presents a table that summarizes the existence of consis-
tent and
√
N-consistent asymptotic normal estimators for a range of semiparametric
models.
9.7.3. Partially Linear Model
The partially linear model specifies the conditional mean to be the usual linear re-
gression function plus an unspecified nonlinear component, so
E[y|x, z] = x
β + λ(z), (9.34)
where the scalar function λ(·) is unspecified.
An example is the estimation of a demand function for electricity, where z reflects
time-of-day or weather indicators such as temperature. A second example is the sample
selection model given in Section 16.5. Ignoring λ(z) leads to inconsistent β owing to
omitted variables bias, unless Cov[x, λ(z)] = 0. In applications interest may lie in β,
λ(z) or both. Fully nonparametric estimation of E[y|x, z] is possible but leads to less
than
√
N-consistent estimation of β.
Robinson Difference Estimator
Instead, Robinson (1988a) proposed the following method. The regression model
implies
y = x
β + λ(z) + u,
where the error u = y − E[y|x, z]. This in turn implies
E[y|z] = E[x|z]
β + λ(z)
since E[u|x, z] = 0 implies E[u|z] = 0. Subtracting the two equations yields
y − E[y|z] = (x − E[x|z])
β + u. (9.35)
The conditional moments in (9.35) are unknown, but they can be replaced by nonpara-
metric estimates.
324

Thus Robinson proposed the OLS regression estimation of
yi −
myi = (x −
mxi )
β + v, (9.36)
where
myi and
mxi are predictions from nonparametric regression of, respectively, yi
and xi on zi . Given independence over i, the OLS estimator of β in (9.36) is
√
N
consistent and asymptotically normal with
√
N(
βPL − β)
d
→ N

0, σ2

plim
1
N
N

i=1
(xi − E[xi |zi ])(xi − E[xi |zi ])

−1

 ,
assuming ui is iid [0, σ2
]. Not specifying λ(z) generally leads to an efficiency loss,
though there is no loss if E[x|z] is linear in z. To estimate V[
βPL] simply replace
(xi −E[xi |zi ]) by (xi −
mxi ). The asymptotic result generalizes to heteroskedastic er-
rors, in which case one just uses the usual Eicker–White standard errors from the OLS
regression (9.36). Since λ(z) = E[y|z] − E[x|z]
β it can be consistently estimated by

λ(z) =
myi −
mxi

β.
A variety of nonparametric estimators
myi and
mxi can be used. Robinson (1988a)
used kernel estimates that require convergence at rate no slower than N−1/4
so that
oversmoothing or higher order kernels are needed if the dimension of z is large; see
Pagan and Ullah (1999, p. 205). Note also that the kernel estimators may be trimmed
Other Estimators
Several other methods lead to
√
N-consistent estimates of β in the partially linear
model. Speckman (1988) also used kernels. Engle et al. (1986) used a generalization
of the cubic smoothing spline estimator. Andrews (1991) presented regression of y on
x and a series approximation for λ(z) given in Section 9.6.4. Yatchew (1997) presents
a simple differencing estimator.
9.7.4. Single-Index Models
A single-index model specifies the conditional mean to be an unknown scalar function
of a linear combination of the regressors, with
E[y|x] = g(x
β), (9.37)
where the scalar function g(·) is unspecified. The advantages of single-index models
have been presented in Section 5.2.4. Here the function g(·) is obtained from the data,
whereas previous examples specified, for example, E[y|x] = exp(x
β).
Identification
Ichimura (1993) presents identification conditions for the single-index model. For
unknown function g(·) the single-index model β is only identified up to location and
scale. To see this note that for scalar v the function g∗
(a + bv) can always be expressed
325

as g(v), so the function g∗
(a + bx
β) is equivalent to g(x
β). Additionally, g(·) must
be differentiable. In the simplest case all regressors are continuous. If instead some
regressors are discrete, then at least one regressor must be continuous and if g(·) is
monotonic then bounds can be obtained for β.
Average Derivative Estimator
For continuous regressors, Stoker (1986) observed that if the conditional mean is single
index then the vector of average derivatives of the conditional mean determines β up
to scale, since for m(xi ) = g(x
i β)
δ ≡ E

∂m(x)
∂x

= E[g
(x
β)]β, (9.38)
and E[g
(x
i β)] is a scalar. Furthermore, by the generalized information matrix equal-
ity given in Section 5.6.3, for any function h(x), E[∂h(x)/∂x] = −E[h(x)s(x)], where
s(x) = ∂ ln f (x)/∂x = f
(x)/f (x) and f (x) is the density of x. Thus
δ = −E [m(x)s(x)] = −E [E[y|x]s(x)] . (9.39)
It follows that δ, and hence β up to scale, can be estimated by the average derivative
(AD) estimator

δAD = −
1
N
N

i=1
yi
s(xi ), (9.40)
where
s(xi ) =
f (xi )/
f (xi ) can be obtained by kernel estimation of the density of xi
and its ﬁrst derivative. The estimator
δ is
√
N consistent and its asymptotic normal
distribution was derived by Härdle and Stoker (1989). The function g(·) can be esti-
mated by nonparametric regression of yi on x
i

δ. Note that
δAD provides an estimate
of E

m
(x)

regardless of whether a single-index model is relevant.
A weakness of
δAD is that
s(xi ) can be very large if
f (xi ) is small. One possibility is
to trim when
f (xi ) is small. Powell, Stock, and Stoker (1989) instead observed that the
result (9.38) extends to weighted derivatives with δ ≡ E[w(x)m
(x)]. Especially con-
venient is to choose w(x) = f (x), which yields the density weighted average deriva-
tive (DWAD) estimator

δDWAD = −
1
N
N

i=1
yi

f (xi ), (9.41)
which no longer divides by
f (xi ). This yields a
√
N-consistent and asymptotically
normal estimate of β up to scale. For example, if the ﬁrst component of β is normalized
to one then
β1 = 1 and
β j =
δ j /
δ1 for j 1.
These methods require continuous regressors so that the derivatives exist. Horowitz
and Härdle (1996) present extension to discrete regressors.
326

Semiparametric Least Squares
An alternative estimator of the single-index model was proposed by Ichimura (1993).
Begin by assuming that g(·) is known, in which case the WLS estimator of β
minimizes
SN (β) =
1
N
N

i=1
wi (x)(yi − g(x
i β))2
.
For unknown g(·) Ichimura proposed replacing g(x
i β) by a nonparametric estimate

g(x
i β), leading to the weighted semiparametric least-squares (WSLS) estimator

βWSLS that minimizes
QN (β) =
1
N
N

i=1
π(xi )wi (x)(yi −
g(x
i β))2
,
where π(xi ) is a trimming function that drops observations if the kernel regression
estimate of the scalar x
i β is small, and
g(x
i β) is a leave-one-out kernel estimator
from regression of yi on x
i β. This is a
√
N-consistent and asymptotically normal
estimate of β up to scale that is generaly more efficient than the DWAD estimator. For
heteroskedastic data the most efficient estimator is the analogue of feasible GLS that
uses estimated weight function
wi (x) = 1/
σ2
i , where
σ2
i is the kernel estimate given
in (9.43) of Section 9.7.6 and where
ui = yi −
g(x
i

β) and
β is obtained from initial
minimization of QN (β) with wi (x) = 1.
The WSLS estimator is computed by iterative methods. Begin with an initial esti-
mator
β
(1)
, such as the DWAD estimator with first component normalized to one. Form
the kernel estimate
g(x
i

β
(1)
) and hence QN (
β
(1)
), perturb
β
(1)
to obtain the gradient
gN (
β
(1)
) = ∂ QN (β)/∂β|
β
(1) and hence an update
β
(2)
=
β
(1)
+ AN gN (
β
(1)
), and so
on. This estimator is considerably more difficult to calculate than the DWAD estima-
tor, especially as QN (β) can be nonconvex and multimodal.
9.7.5. Generalized Additive Models
Generalized additive models specify E[y|x] = g1(x1) + · · · +gk(xk), a specializa-
tion of the fully nonparametric model E[y|x] = g(x1, . . . , gk). This specialization re-
sults in the estimated subfunctions
gj (xj ) converging at the rate for a one-dimensional
nonparametric regression rather than the slower rate of a k-dimensional nonparametric
regression.
A well-developed methodology exists for estimating such models (see Hastie and
Tibsharani, 1990). This is automated in some statistical packages such as S-Plus. Plots
of the estimated subfunctions
gj (xj ) on xj trace out the marginal effects of xj on
E[y|x], so the additive model can provide a useful tool for exploratory data analy-
sis. The model sees little use in microeconometrics in part because many applications
such as censoring, truncation, and discrete outcomes lead naturally to single-index and
partially linear models.
327

9.7.6. Heteroskedastic Linear Model
The heteroskedastic linear model specifies
E[y|x] = x
β,
V[y|x] = σ2
(x),
where the variance function σ2
(·) is unspecified.
The assumption that errors are heteroskedastic is the standard cross-section data
assumption in modern microecometrics. One can obtain consistent but inefficient esti-
mates of β by doing OLS and using the Eicker–White heteroskedastic-consistent esti-
mate of the variance matrix of the OLS estimator. Cragg (1983) and Amemiya (1983)
proposed an IV estimator that is more efficient than OLS but still not fully efficient.
Feasible GLS provides a fully efficient second-moment estimator but is not attractive
as it requires specification of a functional form for σ2
(x) such as σ2
(x) = exp(x
γ).
Robinson (1987) proposed a variant of FGLS using a nonparametric estimator of
σ2
i = σ2
(xi ). Then

βHLM =

N

i=1

σ−2
i xi x
i

−1
N

i=1

σ−2
i xi yi

, (9.42)
where Robinson (1987) used a k–NN estimator of σ2
i with uniform weight, so

σ2
i =
1
k
N

j=1
1(xj ∈ Nk(xi ))
u2
j , (9.43)
where
ui = yi − x
i

βOLS is the residual from first-stage OLS regression of yi on xi and
Nk(xi ) is the set of k observations of xj closest to xi in weighted Euclidean norm. Then
√
N(
βHLM − β)
d
→ N[0, N

0,

plim
1
N
N

i=1
σ−2
(xi )xi xi

−1

 ,
assuming ui is iid [0, σ2
(xi )]. This estimator is adaptive as it attains the Gauss–
Markov bound so is as as efficient as the GLS estimator when σ2
i is known. The
variance matrix is consistently estimated by

N−1

i
σ−2
i xi x
i
−1
.
In principle other nonparametric estimators of σ2
(xi ) might be used, but Carroll
(1982) and others originally proposed use of a kernel estimator of σ2
i and found that
proof of efficiency was possible only under very restrictive assumptions on xi . The
Robinson method extends to models with nonlinear mean function.
9.7.7. Seminonparametric MLE
Suppose yi is iid with specified density f (yi |xi , β). In general, misspecification of the
density leads to inconsistent parameter estimates. Gallant and Nychka (1987) proposed
approximating the unknown true density by a power-series expansion around the den-
sity f (y|x, β). To ensure a positive density they actually use a squared power-series
328

expansion around f (y|x, β), yielding
hp(y|x, β, α) =
(p(y|α))2
f (y|x, β)
)
(p(z|α))2
f (y|z, β)dz
, (9.44)
where p(y|α) is a pth order polynomial in y, α is the vector of coefficients of the poly-
nomial, and division by the denominator ensures that probabilities integrate or sum to
one. The estimator of β and α maximizes the log-likelihood
N
i=1 ln hp(yi |x, β, α).
The approach generalizes immediately to multivariate yi . The estimator is called the
seminonparametric maximum likelihood estimator because it is a nonparametric
estimator that can be estimated in the same way as a maximum likelihood estimator.
Gallant and Nychka (1987) showed that under fairly general conditions the estimator
yields consistent estimates of the density if the order p of the polynomial increases
with sample size N at an appropriate rate.
This result provides a strong basis for using (9.44) to obtain a class of flexible dis-
tributions for any particular data. The method is particularly simple if the polynomial
series p(y|α) is the orthogonal or orthonormal polynomial series (see Section 12.3.1)
for the baseline density f (y|x, β), as then the normalizing factor in the denominator
can be simply constructed. The order of the polynomial can be chosen using infor-
mation criteria, with measures that penalize model complexity more than AIC used in
practice. Regular ML statistical inference is possible if one ignores the data-dependent
selection of the polynomial order and assumes that the resulting density hp(y|x, β, α)
is correctly specified. An example of this approach for count data regression is given
in Cameron and Johansson (1997).
9.7.8. Semiparametric Efficiency Bounds
Semiparametric efficiency bounds extend efficiency bounds such as Cramer–Rao or
the Gauss–Markov theorem to cases where the dgp has a nonparametric component.
The best semiparametric methods achieve this efficiency bound.
We use β to denote parameters we wish to estimate, which may include variance
components such as σ2
, and η to denote nuisance parameters. For simplicity we con-
sider ML estimation with a nonparametric component.
We begin with the fully parametric case. The MLE (
β,
η) maximizes L(β, η) =
ln L(β, η). Let θ = (β, η) and let Iθθ be the information matrix defined in (5.43).
Then
√
N(
θ − θ)
d
→ N[0, I−1
θθ ]. For
√
N(
β − β), partitioned inversion of Iθθ leads
to
V∗
= (Iββ − IβηI−1
ηη Iηβ)−1
(9.45)
as the efficiency bound for estimation of β when η is unknown. There is an efficiency
loss when η is unknown, unless the information matrix is block diagonal so that Iβη =
0 and the variance reduces to I−1
ββ.
Now consider extension to the nonparametric case. Suppose we have a paramet-
ric submodel, say L0(β), that involves β alone. Consider the family of all possible
parametric models L(β, η) that nest L0(β) for some value of η. The semiparametric
329

efficiency bound is the largest value of V∗
given in (9.45) over all possible parametric
models L(β, η), but this is difficult to obtain.
Simplification is possible by considering

sβ = sβ − E[sβ|sη],
where sθ denotes the score ∂L/∂θ, and
sβ is the score for β after concentrating out
η. For finite-dimensional η it can be shown that E[N−1
sβ
s
β] = V∗
. Here η is instead
infinite dimensional. Assume iid data and let sθi denote the ith component in the sum
that leads to the score sθ. Begun et al. (1983) define the tangent set to be the set of all
linear combinations of sηi . When this tangent set is linear and closed the largest value
of V∗
in (9.45) equals
Ω =

plim N−1

sβ
s
β
−1
= (E[
sβi
s
βi ])−1
.
The matrix Ω is then the semiparametric efficiency bound.
In applications one first obtains sη =

i sηi
. Then obtain E[sβi
|sηi
], which may
entail assumptions such as symmetry of errors that place restrictions on the class of
semiparametric models being considered. This yields
sβi and hence Ω. For more de-
tails and applications see Newey (1990b), Pagan and Ullah (1999), and Severini and
Tripathi (2001).
9.8. Derivations of Mean and Variance of Kernel Estimators
Nonparametric estimation entails a balance between smoothness (variance) and bias
(mean). Here we derive the mean and variance of kernel density and kernel regression
estimators. The derivations follow those of M. J. Lee (1996).
9.8.1. Mean and Variance of Kernel Density Estimator
Since xi are iid each term in the summation has the same expected value and
E[
f (x0)] = E
1
h
K
x−x0
h

=
) 1
h
K
x−x0
h

f (x)dx.
By change of variable to z = (x − x0)/h so that x = x0 + hz and dx/dz = h we
obtain
E[
f (x0)] =
(
K(z) f (x0 + hz)dz.
A second-order Taylor series expansion of f (x0 + hz) around f (x0) yields
E[
f (x0)] =
)
K(z){ f (x0) + f
(x0)hz + 1
2
f

(x0)(hz)2
}dz
= f (x0)
)
K(z)dz + h f
(x0)
)
zK(z)dz + 1
2
h2
f

(x0)
)
z2
K(z)dz.
330

9.8. DERIVATIONS OF MEAN AND VARIANCE OF KERNEL ESTIMATORS
Since the kernel K(z) integrates to unity this simplifies to
E[
f (x0)] − f (x0) = h f
(x0)
(
zK(z)dz +
1
2
h2
f

(x0)
(
z2
K(z)dz.
If additionally the kernel satisfies
)
zK(z)dz = 0, assumed in condition (ii) in Section
9.3.3, and second derivatives of f are bounded, then the first term on the right-hand
side disappears, yielding E[
f (x0)] − f (x0) = b(x0), where b(x0) is defined in (9.4).
To obtain the variance of
f (x0), begin by noting that if yi are iid then V[ȳ] =
N−1
V[y] = N−1
E[y2
] − N−1
(E[y])2
. Thus
V[
f (x0)] = 1
N
E
1
h
K
x−x0
h
2

− 1
N

E
1
h
K
x−x0
h
2
.
Now by change of variables and first-order Taylor series expansion
E
1
h
K
x−x0
h
2

=
) 1
h
K(z)2
{ f (x0) + f
(x0)hz}dz
= 1
h
f (x0)
)
K(z)2
dz + f
(x0)
)
zK(z)2
dz.
It follows that
V[
f (x0)] = 1
Nh
f (x0)
)
K(z)2
dz + 1
N
f
(x)
)
zK(z)2
dz
− 1
N
[ f (x0) + h2
2
f

(x0)[
)
z2
K(z)dz]]2
.
For h → 0 and N → ∞ this is dominated by the first term, leading to Equation (9.5).
9.8.2. Distribution of Kernel Regression Estimator
We obtain the distribution for regressors xi that are iid with density f (x). From Section
9.5.1 the kernel estimator is a weighted average
m(x0) =

i wi0,h yi , where the kernel
weights wi0,h are given in (9.22). Since the weights sum to unity we have
m(x0) −
m(x0) =

i wi0,h(yi − m(x0)). Substituting (9.15) for yi , and normalizing by
√
Nh
as in the kernel density estimator case we have
√
Nh(
m(x0) − m(x0)) =
√
Nh
N

i=1
wi0,h(m(xi ) − m(x0) + εi ). (9.46)
One approach to obtaining the limit distribution of (9.46) is to take a second-order
Taylor series expansion of m(xi ) around x0. This approach is not always taken be-
cause the weights wi0,h are complicated by the normalization that they sum to one (see
(9.22)).
Instead, we take the approach of Lee (1996, pp. 148–151) following Bierens (1987,
pp. 106–108). Note that the denominator of the weight function is the kernel estimate
of the density of x0, since
f (x0) = (Nh)−1

i K ((xi − x0)/h). Then (9.46) yields
√
Nh(
m(x0) − m(x0)) =
1
√
Nh
N

i=1
K

xi − x0
h

(m(xi ) − m(x0) + εi )
7

f (x0).
(9.47)
We apply the Transformation Theorem (Theorem A.12) to (9.47), using
f (x0)
p
→
f (x0) for the denominator, while several steps are needed to obtain a limit normal
331

distribution for the numerator:
1
√
Nh
N

i=1
K

xi − x0
h

(m(xi ) − m(x0) + εi ) (9.48)
=
1
√
Nh
N

i=1
K

xi − x0
h

(m(xi ) − m(x0)) +
1
√
Nh
N

i=1
K

xi − x0
h

εi .
Consider the first sum in (9.48); if a law of large numbers can be applied it converges
in probability to its mean
E

1
√
Nh
N

i=1
K

xi − x0
h

(m(xi ) − m(x0))
'
(9.49)
=
√
N
√
h
(
K

x − x0
h

(m(x) − m(x0)) f (x)dx
=
√
Nh
(
K(z)(m(x0 + hz) − m(x0)) f (x0 + hz)dz
=
√
Nh
(
K(z)

hzm
(x0) +
1
2
h2
z2
m
(x0)

f (x0) + hzf
(x0)

dz
=
√
Nh
(
K(z)h2
z2
m
(x0) f
(x0)dz +
(
K(z)
1
2
h2
z2
m
(x0) f (x0)dz
.
=
√
Nhh2

m
(x0) f
(x0) +
1
2
m
(x0) f (x0)
(
z2
K(z)dz
=
√
Nh f (x0)b(x0),
where b(x0) is defined in (9.23). The first equality uses xi iid; the second equality is
change of variables to z = (x − x0)/h; the third equality applies a second-order Taylor
series expansion to m(x0 + hz) and a first-order Taylor series expansion to f (x0 + hz);
the fourth equality follows because upon expanding the product to four terms, the two
terms given dominate the others (see, e.g., Lee, 1996, p. 150).
Now consider the second sum in (9.48); the terms in the sum clearly have mean
zero, and the variance of each term, dropping subscript i, is
V

K

x − x0
h

ε

= E

K2

x − x0
h

ε2

(9.50)
=
(
K2

x − x0
h

V[ε|x] f (x)dx
= h
(
K2
(z) V[ε|x0 + hz] f (x0 + hz)dz
= hV[ε|x0] f (x0)
(
K2
(z) dz,
by change of variables to z = (x − x0)/h with dx = hdz in the third-line term, and
letting h → 0 to get the last line. It follows upon applying a central limit theorem that
1
√
Nh
N

i=1
K

xi − x0
h

εi
d
→ N

0, V[ε|x0] f (x0)
(
K2
(z) dz

. (9.51)
332

Combining (9.49) and (9.51), we have that
√
Nh(
m(x0) − m(x0)) defined in (9.47)
converges to 1/f (x0) times N
√
Nh f (x0)b(x0), V[ε|x0] f (x0)
)
K2
(z) dz

. Division
of the mean by f (x0) and the variance by f (x0)2
leads to the limit distribution given
in (9.24).
All-purpose regression packages increasingly offer adequate methods for univariate
nonparametric density estimation and regression. The programming language XPlore
emphasizes nonparametric and graphical methods; details on many of the methods are
provided at its Web site.
Nonparametric univariate density estimation is straightforward, using a kernel den-
sity estimate based on a kernel such as the Gaussian or Epanechnikov. Easily computed
plug-in estimates for the bandwidth provide a useful starting point that one may then,
say, halve or double to see if there is an improvement.
Nonparametric univariate regression is also straightforward, aside from bandwidth
selection. If relatively unbiased estimates of the regression function at the end points
are desired, then local linear regression or Lowess estimates are better than kernel
regression. Plug-in estimates for the bandwidth are more difficult to obtain and cross-
validation is instead used (see Section 9.5.3) along with eyeballing the scatterplot with
a fitted line. The degree of desired smoothness can vary with application. For nonpara-
metric multivariate regression such eyeballing may be impossible.
Semiparametric regression is more complicated. It can entail subtleties such as trim-
ming and undersmoothing the nonparametric component since typically estimation
of the parametric component involves averaging the nonparametric component. For
such purposes one generally uses specialized code written in languages such as Gauss,
Matlab, Splus, or XPlore. For the nonparametric estimation component considerable
computational savings can be obtained through use of fast computing algorithms such
as binning and updating; see, for example, Fan and Gijbels (1996) and Härdle and
Linton (1994).
All methods require at some stage specification of a bandwidth or window width.
Different choices lead to different estimates in finite samples, and the differences can
be quite large as illustrated in many of the figures in this chapter. By contrast, within
a fully parametric framework different researchers estimating the same model by ML
will all obtain the same parameter estimates. This indeterminedness is a detraction of
nonparametric methods, though the hope is that in semiparametric methods at least the
spillover effects to the parametric component of the model may be small.
Nonparametric estimation is well presented in many statistics texts, including Fan and Gijbels
(1996). Ruppert, Wand, and Carroll (2003) present application of many semiparametric meth-
ods. The econometrics books by Härdle (1990), M. J. Lee (1996), Horowitz (1998b), Pagan and
Ullah (1999), and Yatchew (2003) cover both nonparametric and semiparametric estimation.
333

Pagan and Ullah (1999) is particularly comprehensive. Yatchew (2003) is oriented to the ap-
plied econometrician. He emphasizes the partial linear and single-index models and practical
aspects of their implementation such as computation of confidence intervals.
9.3 Key early references for kernel density estimation are Rosenblatt (1956) and Parzen (1962).
Silverman’s (1986) is a classic book on nonparametric density estimation.
9.4 A quite general statement of optimal rates of convergence for nonparametric estimators is
given in Stone (1980).
9.5 Kernel regression estimation was proposed by Nadaraya (1964) and Watson (1964). A
very helpful and relatively simple survey of kernel and nearest-neighbors regression is by
Altman (1992). There are many other surveys in the statistics literature. Härdle (1990, chap-
ter 5) has a lengthy discussion of bandwidth choice and confidence intervals.
9.6 Many approaches to nonparametric local regression are contained in Stone (1977). For
series estimators see Andrews (1991) and Newey (1997).
9.6 For semiparametric efficiency bounds see the survey by Newey (1990b) and the more recent
paper by Severini and Tripathi (2001). An early econometrics application was given by
Chamberlain (1987).
9.7 The econometrics literature focuses on semiparametric regression. Survey papers include
those by Powell (1994), Robinson (1988b), and, at a more introductory level, Yatchew
(1998). Additional references are given in elsewhere in this book, notably in Sections 14.7,
15.11, 16.9, 20.5, and 23.8. The applied study by Bellemare, Melenberg, and Van Soest
(2002) illustrates several semiparametric methods.
Exercises
9–1 Suppose we obtain a kernel density estimate using the uniform kernel (see
Table 9.1) with h = 1 and a sample of size N = 100. Suppose in fact the data
x ∼ N[0, 1].
(a) Calculate the bias of the kernel density estimate at x0 = 1 using (9.4).
(b) Is the bias large relative to the true value φ(1), where φ(·) is the standard
normal pdf?
(c) Calculate the variance of the kernel density estimate at x0 = 1 using (9.5).
(d) Which is making a bigger contribution to MSE at x0 = 1, variance or bias
squared?
(e) Using results in Section 9.3.7, give a 95% confidence interval for the density
at x0 = 1 based on the kernel density estimate
f (1).
(f) For this example, what is the optimal bandwidth h∗
from (9.10).
9–2 Suppose we obtain a kernel regression estimate using a uniform kernel (see
Table 9.1) with h = 1 and a sample of size N = 100. Suppose in fact the data
x ∼ N[0, 1] and the conditional mean function is m(x) = x2
.
(a) Calculate the bias of the kernel regression estimate at x0 = 1 using (9.23).
(b) Is the bias large relative to the true value m(1) = 1?
(c) Calculate the variance of the kernel regression estimate at x0 = 1 using
(9.24).
(d) Which is making a bigger contribution to MSE at x0 = 1, variance or bias
squared?
334

(e) Using results in Section 9.5.4, give a 95% conﬁdence interval for E[y|x0 = 1]
based on the kernel regression estimate
m(1).
9–3 This question assumes access to a nonparametric density estimation program.
Use the Section 4.6.4 data on health expenditure. Use a kernel density estimate
with Gaussian kernel (if available).
(a) Obtain the kernel density estimate for health expenditure, choosing a suitable
bandwidth by eyeballing and trial and error. State the bandwidth chosen.
(b) Obtain the kernel density estimate for natural logarithm of health expenditure,
choosing a suitable bandwidth by eyeballing and trial and error. State the
bandwidth chosen.
(c) Compare your answer in part (b) to an appropriate histogram.
(d) If possible superimpose a ﬁtted normal density on the same graph as the
kernel density estimate from part (b). Do health expenditures appear to be
log-normally distributed?
9–4 This question assumes access to a kernel regression program or other non-
parametric smoother. Use the complete sample of the Section 4.6.4 data
on natural logarithm of health expenditure (y) and natural logarithm of total
expenditure (x).
(a) Obtain the kernel regression density estimate for health expenditure, choos-
ing a good bandwidth by eyeballing and trial and error. State the bandwidth
chosen.
(b) Given part (a), does health appear to be a normal good?
(c) Given part (a), does health appear to be a superior good?
(d) Compare your nonparametric estimates with predictions from linear and
quadratic regression.
335

C H A P T E R 10
Numerical Optimization
10.1. Introduction
Theoretical results on consistency and the asymptotic distribution of an estimator de-
fined as the solution to an optimization problem were presented in Chapters 5 and 6.
The more practical issue of how to numerically obtain the optimum, that is, how to
calculate the parameter estimates, when there is no explicit formula for the estimator,
comprises the subject of this chapter.
For the applied researcher estimation of standard nonlinear models, such as logit,
probit, Tobit, proportional hazards, and Poisson, is seemingly no different from es-
timation of an OLS model. A statistical package obtains the estimates and reports
coefficients, standard errors, t-statistics, and p-values. Computational problems gen-
erally only arise for the same reasons that OLS may fail, such as multicollinearity or
incorrect data input.
Estimation of less standard nonlinear models, including minor variants of a standard
model, may require writing a program. This may be possible within a standard statisti-
cal package. If not, then a programming language is used. Especially in the latter case
a knowledge of optimization methods becomes necessary.
General considerations for optimization are presented in Section 10.2. Various iter-
ative methods, including the Newton–Raphson and Gauss–Newton gradient methods,
are described in Section 10.3. Practical issues, including some common pitfalls, are
presented in Section 10.4. These issues become especially relevant when the opti-
mization method fails to produce parameter estimates.
10.2. General Considerations
Microeconometric analysis is often based on an estimator
θ that maximizes a stochas-
tic objective function QN (θ), where usually
θ solves the first-order conditions
∂ QN (θ)/∂θ = 0. A minimization problem can be recast as a maximization by mul-
tiplying the objective function by minus one. In nonlinear applications there will
336

10.2. GENERAL CONSIDERATIONS
generally be no explicit solution to the first-order conditions, a nonlinear system of
q equations in the q unknowns θ.
A grid search procedure is usually impractical and iterative methods, usually gradi-
ent methods, are employed.
10.2.1. Grid Search
In grid search methods, the procedure is to select many values of θ along a grid,
compute QN (θ) for each of these values, and choose as the estimator
θ the value that
provides the largest (locally or globally depending on the application) value of QN (θ).
If a fine enough grid can be chosen this method will always work. It is generally
impractical, however, to choose a fine enough grid without further restrictions. For
example, if 10 parameters need to be estimated and the grid evaluates each parameter
at just 10 points, a very sparse grid, there are 1010
or 10 billion evaluations.
Grid search methods are nonetheless useful in applications where the grid search
need only be performed among a subset of the parameters. They also permit viewing
the response surface to verify that in using iterative methods one need not be concerned
about multiple maxima. For example, many time-series packages do this for the scalar
AR(1) coefficient in a regression model with AR(1) error. A second example is doing a
grid search for the scalar inclusive parameter in a nested logit model (see Section 15.6).
Of course, grid search methods may have to be used if nothing else works.
10.2.2. Iterative Methods
Virtually all microeconometric applications instead use iterative methods. These
update the current estimate of θ using a particular rule. Given an sth-round estimate
θs
the iterative method provides a rule that yields a new estimate
θs+1, where
θs denotes
the sth-round estimate rather than the sth component of
θ. Ideally, the new estimate is
a move toward the maximum, so that QN (
θs+1) QN (
θs), but in general this cannot
be guaranteed. Also, gradient estimates may find a local maximum but not necessarily
the global maximum.
10.2.3. Gradient Methods
Most iterative methods are gradient methods that change
θs in a direction determined
by the gradient. The update formula is a matrix weighted average of the gradient

θs+1 =
θs + Asgs, s = 1, . . . , S, (10.1)
where As is a q × q matrix that depends on
θs, and
gs =
∂ QN (θ)
∂θ

θs
(10.2)
is the q × 1 gradient vector evaluated at
θs. Different gradient methods use differ-
ent matrices As, detailed in Section 10.3. A leading example is the Newton–Raphson
method, which sets As = −H−1
s , where Hs is the Hessian matrix defined later in (10.6).
337

NUMERICAL OPTIMIZATION
Note that in this chapter A and g denote quantities that differ from those in other chap-
ters. Here A is not the matrix that appears in the limit distribution of an estimator and
g is not the conditional mean of y in the nonlinear regression model.
Ideally, the matrix As is positive definite for a maximum (or negative definite for
a minimum), as then it is likely that QN (
θs+1) QN (
θs). This follows from the first-
order Taylor series expansion QN (
θs+1) = QN (
θs) + g
s(
θs+1 −
θs) + R, where R is
a remainder. Substituting in the update formula (10.1) yields
QN (
θs+1) − QN (
θs) = g
sAsgs + R,
which is greater than zero if As is positive definite and the remainder R is sufficiently
small, since for a positive definite square matrix A the quadratic form x
Ax 0 for
all column vectors x = 0. Too small a value of As leads to an iterative procedure that
is too slow; however, too large a value of As may lead to overshooting, even if As is
positive definite, as the remainder term cannot be ignored for large changes.
A common modification to gradient methods is to add a step-size adjustment to
prevent possible overshooting or undershooting, so

θs+1 =
θs +
λsAsgs, (10.3)
where the stepsize
λs is a scalar chosen to maximize QN (
θs+1). At the sth round
first calculate Asgs, which may involve considerable computation. Then calculate
QN (
θ), where
θ =
θs + λAsgs for a range of values of λ (called a line search),
and choose
λs as that λ that maximizes QN (
θ). Considerable computational savings
are possible because the gradient and As are not recomputed along the line search.
A second modification is sometimes made when the matrix As is defined as the
inverse of a matrix Bs, say, so that As = B−1
s . Then if Bs is close to singular a matrix
of constants, say C, is added or subtracted to permit inversion, so As = (Bs + C)−1
.
Similar adjustments can be made if As is not positive definite. Further discussion of
computation of As is given in Section 10.3.
Gradient methods are most likely to converge to the local maximum nearest the
starting values. If the objective function has multiple local optima then a range of
starting values should be used to increase the chance of finding the global maximum.
10.2.4. Gradient Method Example
Consider calculation of the NLS estimator in the exponential regression model when
the only regressor is the intercept. Then E[y] = eβ
and a little algebra yields the gra-
dient g = N−1

i (yi − eβ
)eβ
= (ȳ − eβ
)eβ
. Suppose in (10.1) we use As = e−2
βs ,
which corresponds to the method of scoring variant of the Newton–Raphson algo-
rithm presented later in Section 10.3.2. The iterative method simplifies to
βs+1 =

βs + (ȳ − e

βs )/e

βs .
As an example of the performance of this algorithm, suppose ȳ = 2 and the starting
value is
β1 = 0. This leads to the iterations listed in Table 10.1. There is very rapid
convergence to the NLS estimate, which for this simple example can be analytically
obtained as
β = ln ȳ = ln 2 = 0.693147. The objective function increases throughout,
338

10.2. GENERAL CONSIDERATIONS
Table 10.1. Gradient Method Results
Round Estimate Gradient Objective Function
s
βs gs QN (
βs) = − 1
2N

i (yi − eβ
)2
1 0.000000 1.000000 1.500000 −

i y2
i /2N
2 1.000000 −1.952492 1.742036 −

i y2
i /2N
3 0.735758 −0.181711 1.996210 −

i y2
i /2N
4 0.694042 −0.003585 1.999998 −

i y2
i /2N
5 0.693147 −0.000002 2.000000 −

i y2
i /2N
a consequence of use of the NR algorithm with globally concave objective function.
Note that overshooting occurs in the ﬁrst iteration, from
β1 = 0.0 to
β2 = 1.0, greater
than
β = 0.693.
Quick convergence usually occurs when the NR algorithm is used and the objective
function is globally concave. The challenge in practice is that nonstandard nonlinear
models often have objective functions that are not globally concave.
10.2.5. Method of Moments and GMM Estimators
For m-estimators QN (θ) = N−1

i qi (θ) and the gradient g(θ) = N−1

i
∂qi (θ)/∂θ.
For GMM estimators QN (θ) is a quadratic form (see Section 6.3.2) and the gradient
takes the more complicated form
g(θ) =

N−1

i
∂hi (θ)
/∂θ
'
× WN ×

N−1

i
hi (θ)
'
.
Some gradient methods can then no longer be used as they work only for averages.
Methods given in Section 10.3 that can still be used include Newton-Raphson, steepest
ascent, DFP, BFG, and simulated annealing.
Method of moments and estimating equations estimators are deﬁned as solving a
system of equations, but they can be converted to a numerical optimization problem
similar to GMM. The estimator
θ that solves the q equations N−1

i hi (θ) = 0 can
be obtained by minimizing QN (θ) = [N−1

i hi (θ)]
[N−1

i hi (θ)].
10.2.6. Convergence Criteria
Iterations continue until there is virtually no change. Programs ideally stop when all
of the following occur: (1) A small relative change occurs in the objective function
QN (
θs); (2) a small change of the gradient vector gs occurs relative to the Hessian;
and (3) a small relative change occurs in the parameter estimates
θs. Statistical pack-
ages typically choose default threshold values for these three changes, called conver-
gence criteria. These values can often be changed by the user. A conservative value
is 10−6
.
339

In addition there is usually a maximum number of iterations that will be
attempted. If this maximum is reached estimates are typically reported. The estimates
should not be used, however, unless convergence has been achieved.
If convergence is achieved then a local maximum has been obtained. However, there
is no guarantee that the global maximum is obtained, unless the objective function is
globally concave.
10.2.7. Starting Values
The number of iterations is considerably reduced if the initial starting values
θ1 are
close to
θ. Consistent parameter estimates are obviously good estimates to use as start-
ing values. A poor choice of starting values can lead to failure of iterative methods. In
particular, for some estimators and gradient methods it may not be possible to compute
g1 or A1 if the starting value is
θ1 = 0.
If the objective function is not globally concave it is good practice to use a range of
starting values to increase the chance of obtaining a global maximum.
10.2.8. Numerical and Analytical Derivatives
Any gradient method by deﬁnition uses derivatives of the objective function. Either
numerical derivatives or analytical derivatives may be used.
Numerical derivatives are computed using

θj
=
QN (
θs + hej ) − QN (
θs − hej )
2h
, j = 1, . . . , q, (10.4)
where h is small and ej = (0 . . . 0 1 0 . . . 0)
is a vector with unity in the jth row and
zeros elsewhere.
In theory h should be very small, as formally ∂ QN (θ)/∂θj equals the limit of

θj as h → 0. In practice too small a value of h leads to inaccuracy ow-
ing to rounding error. For this reason calculations using numerical derivatives should
always be done in double precision or quadruple precision rather than single precision.
Although a program may use a default value such as h = 10−6
, other values will be
better for any particular problem. For example, a smaller value of h is appropriate if the
dependent variable y in NLS regression is measured in thousands of dollars rather than
dollars (with regressors not rescaled), since then θ will be one-thousandth the size.
A drawback of using numerical derivatives is that these derivatives have to be com-
puted many times – for each of the q parameters, for each of the N observations, and
for each of the S iterations. This requires 2qN S evaluations of the objective function,
where each evaluation itself may be computationally burdensome.
An alternative is to use analytical derivatives. These will be more accurate than
numerical derivatives and may be much quicker to compute, especially if the analytical
derivatives are simpler than the objective function itself. Moreover, only qN S function
evaluations are needed.
For methods that additionally require calculation of second derivatives to form As
there is even greater beneﬁt to providing analytical derivatives. Even if just analyt-
ical ﬁrst derivatives are given, the second derivative may then be more quickly and
340

10.3. SPECIFIC METHODS
accurately obtained as the numerical first derivative of the analytical first derivative.
Statistical packages often provide the user with the option of providing analytical first
and second derivatives.
Numerical derivatives have the advantage of requiring no coding beyond providing
the objective function. This saves coding time and eliminates one possible source of
user error, though some packages have the ability to take analytical derivatives.
If computational time is a factor or if there is concern about accuracy of calcula-
tions, however, it is worthwhile going to the trouble of providing analytical derivatives.
It is still good practice then to check that the analytical derivatives have been correctly
coded by obtaining parameter estimates using numerical derivatives, with starting val-
ues the estimates obtained using analytical derivatives.
10.2.9. Nongradient Methods
Gradient methods presume the objective function is sufficiently smooth to ensure ex-
istence of the gradient. For some examples, notably least absolute deviations (LAD),
quantile regression, and maximum score estimation, there is no gradient and alterna-
tive iterative methods are used.
For example, for LAD the objective function QN (θs) = N−1

i |yi − xi β| has no
derivative and linear programming methods are used in place of gradient methods.
Such examples are sufficiently rare in microeconometrics that we focus almost exclu-
sively on gradient methods.
For objective functions that are difficult to maximize, particularly because of multi-
ple local optima, use can be made of nongradient methods such as simulated annealing
(presented in Section 10.3.8) and genetic algorithms (see Dorsey and Mayer, 1995).
10.3. Specific Methods
The leading method for obtaining a globally concave objective function is the Newton–
Raphson iterative method. The other methods, such as steepest descent and DFP, are
usually learnt and employed when the Newton–Raphson method fails. Another com-
mon method is the Gauss–Newton method for the NLS estimator. This method is
not as universal as the Newton–Raphson method, as it is applicable only to least-
squares problems, and it can be obtained as a minor adaptation of the Newton–Raphson
method. These various methods are designed to obtain a local optimum given some
starting values for the parameters.
This section also presents the expectation method, which is particularly useful in
missing data problems, and the method of simulated annealing, which is an example of
a nongradient method and is more likely to yield a global rather than local maximum.
10.3.1. Newton–Raphson Method
The Newton–Raphson (NR) method is a popular gradient method that works espe-
cially well if the objective function is globally concave in θ. In this method

θs+1 =
θs − H−1
s gs, (10.5)
341

where gs is defined in (10.2) and
Hs =
∂2
QN (θ)
∂θ∂θ

θs
(10.6)
is the q × q Hessian matrix evaluated at
θs. These formulas apply to both maximiza-
tion and minimization of QN (θ) since premultiplying QN (θ) by minus one changes
the sign of both H−1
s and gs.
To motivate the NR method, begin with the sth-round estimate
θs for θ. Then by
second-order Taylor series expansion around
θs
QN (θ) = QN (
θs) +
∂ QN (θ )
∂θ

θs
(θ −
θs) +
1
2
(θ −
θs) ∂2
QN (θ)
∂θ∂θ

θs
(θ −
θs) + R.
Ignoring the remainder term R and using more compact notation, we approximate
QN (θ) by
Q∗
N (θ) = QN (
θs) + g
s(θ −
θs) +
1
2
(θ −
θs)
Hs(θ −
θs),
where gs and Hs are defined in (10.2) and (10.6). To maximize the approxima-
tion Q∗
N (θ) with respect to θ we set the derivative to zero. Then gs + Hs(θ −
θs) = 0,
and solving for θ yields
θs+1 =
θs − H−1
s gs, which is (10.5). The NR update therefore
maximizes a second-order Taylor series approximation to QN (θ) evaluated at
θs.
To see whether NR iterations will necessarily increase QN (θ), substitute the
(s + 1)th-round estimate back into the Taylor series approximation to obtain
QN (
θs+1) = QN (
θs) −
1
2
(
θs+1 −
θs)
Hs(
θs+1 −
θs) + R.
Ignoring the remainder term, we see that this increases (or decreases) if Hs is negative
(or positive) definite. At a local maximum the Hessian is negative semi-definite, but
away from the maximum this may not be the case even for well-defined problems. If
the NR method strays into such territory it may not necessarily move toward the max-
imum. Furthermore the Hessian is then singular, in which case H−1
s in (10.5) cannot
be computed. Clearly, the NR method works best for maximization (or minimization)
problems if the objective function is globally concave (or convex), as then Hs is al-
ways negative (or positive) definite. In such cases convergence often occurs within
10 iterations.
An additional attraction of the NR method arises if the starting value
θ1 is root-N
consistent, that is, if
√
N(
θ1 − θ0) has a proper limiting distribution. Then the second-
round estimator
θ2 can be shown to have the same asymptotic distribution as the es-
timator obtained by iterating to convergence. There is therefore no theoretical gain to
further iteration. An example is feasible GLS, where initial OLS leads to consistent
regression parameter estimates, and these in turn are used to obtain consistent variance
parameter estimates, which are then used to obtain efficient GLS. A second example
is use of easily obtained consistent estimates as starting values before maximizing a
complicated likelihood function. Although there is no need to iterate further, in practice
most researchers still prefer to iterate to convergence unless this is computationally too
342

time consuming. One advantage of iterating to convergence is that different researchers
should obtain the same parameter estimates, whereas different initial root-N consistent
estimates lead to second-round parameter estimates that will differ even though they
are asymptotically equivalent.
10.3.2. Method of Scoring
A common modification of the NR method is the method of scoring (MS). In this
method the Hessian matrix is replaced by its expected value
HMS,s = E

∂2
QN (θ)
∂θ∂θ

θs
. (10.7)
This substitution is especially advantageous when applied to the MLE (i.e., QN (θ) =
N−1
LN (θ)), because the expected value should be negative definite, since by the infor-
mation matrix equality (see Section 5.6.3), HMS,s = E

∂LN /∂θ × ∂LN /∂θ

, which
is positive definite since it is a covariance matrix. Obtaining the expectation in (10.7)
is possible only for m-estimators and even then may be analytically difficult.
The method of scoring algorithm for the MLE of generalized linear models, such
as the Poisson, probit, and logit, can be shown to be implementable using iteratively
reweighted least squares (see McCullagh and Nelder, 1989). This was advantageous to
early adopters of these models who only had access to an OLS program.
The method of scoring can also be applied to m-estimators other than the MLE,
though then HMS,s may not be negative definite.
10.3.3. BHHH Method
The BHHH method of Berndt, Hall, Hall, and Hausman (1974) uses (10.1) with
weighting matrix As = −H−1
BHHH,s where the matrix
HBHHH,s = −
N

i=1
∂qi (θ)
∂θ
∂qi (θ)
∂θ

θs
, (10.8)
and QN (θ) =

i qi (θ). Compared to NR, this has the advantage of requiring evalua-
tion of first derivatives only, offering considerable computational savings.
To justify this method, begin with the method of scoring for the MLE, in which case
QN (θ) =

i ln fi (θ), where fi (θ) is the log-density. The information matrix equality
can be expressed as
E

∂2
LN (θ)
∂θ∂θ

= −E

N

i=1
∂ ln fi (θ)
∂θ
N

j=1
∂ ln f j (θ)
∂θ
'
,
and independence over i implies
E

∂2
LN (θ)
∂θ∂θ

= −
N

i=1
E

∂ ln fi (θ)
∂θ
∂ ln fi (θ)
∂θ

.
Dropping the expectation leads to (10.8).
343

The BHHH method can also be applied to estimators other than the MLE, in which
case it is viewed as simply another choice of matrix As in (10.1) rather than as an
estimate of the Hessian matrix Hs.
The BHHH method is used for many cross-section m-estimators as it can work well
and requires only first derivatives.
10.3.4. Method of Steepest Ascent
The method of steepest ascent sets As = Iq, the simplest choice of weighting matrix.
A line search is then done (see (10.3)) to scale Iq by a constant λs.
The line search can be down manually. In practice it is common to use the optimal
λ for the line search, which can be shown to be λs = −g
sgs/g
sHsgs, where Hs is the
Hessian matrix. This optimal λs requires computation of the Hessian, in which case
one might instead use NR. The advantage of steepest ascent rather than NR is that Hs
can be singular, though Hs still needs to be negative definite to ensure λs 0 so that
λsIq is negative definite.
10.3.5. DFP and BFGS Methods
The DFP algorithm due to Davidon, Fletcher, and Powell is a gradient method with
weighting matrix As that is positive definite and requires computation of only first
derivatives, unlike NR, which requires computation of the Hessian. Here the method
is presented without derivation.
The weighting matrix As is computed by the recursion
As = As−1 +
δs−1δ
s−1
δ
s−1γs−1
+
As−1γs−1γ
s−1As−1
γ
s−1As−1γs−1
, (10.9)
where δs−1 = As−1gs−1 and γs−1 = gs − gs−1. By inspection of the right-hand side
of (10.9), As will be positive definite provided the initial A0 is positive definite (e.g.,
A0 = Iq).
The procedure converges quite well in many statistical applications. Eventually As
goes to the theoretically preferred −H−1
s . In principle this method can also provide
an approximate estimate of the inverse of the Hessian for use in computation of stan-
dard errors, without needing either second derivatives or matrix inversion. In practice,
however, this estimate can be a poor one.
A refinement of the DFP algorithm is the BFGS algorithm of Boyden, Fletcher,
Goldfarb, and Shannon with
As = As−1 +
δs−1δ
s−1
δ
s−1γs−1
+
As−1γs−1γ
s−1As−1
γ
s−1As−1γs−1
− (γ
s−1As−1γs−1)ηs−1η
s−1, (10.10)
where ηs−1 = (δs−1/δ
s−1γs−1) − (As−1γs−1/γ
s−1As−1γs−1).
344

10.3.6. Gauss–Newton Method
The Gauss–Newton (GN) method is an iterative method for the NLS estimator that
can be implemented by iterative OLS.
Specifically, for NLS with conditional mean function g(xi , β), the GN method sets
the parameter change vector (
βs+1 −
βs) equal to the OLS coefficient estimates from
the artificial regression
yi − g(xi ,
βs) =
∂gi
∂β

βs
β + vi . (10.11)
Equivalently,
βs+1 equals the OLS coefficient estimates from the artificial regression
yi − g(xi ,
βs) −
∂gi
∂β

βs

βs =
∂gi
∂β

βs
β + vi . (10.12)
To derive this method, let
βs be a starting value, approximate g(xi , β) by a first-
order Taylor series expansion
g(xi , β) = g(xi ,
βs) +
∂gi
∂β

βs
(β −
βs),
and substitute this in the least-squares objective function QN (β) to obtain the
approximation
Q∗
N (β) =
N
i=1

yi − g(xi ,
βs) −
∂gi
∂β

βs
(β −
βs)

2
.
But this is the sum of squared residuals for OLS regression of yi − g(xi ,
βs) on
∂gi /∂β

βs
with parameter vector (β −
βs), leading to (10.11). More formally,

βs+1 =
βs +

i
∂gi
∂β

βs
∂gi
∂β

βs
'−1

i
∂gi
∂β

βs
(yi − g(xi ,
βs)). (10.13)
This is the gradient method (10.1) with vector gs =

i ∂gi /∂β|
βs
(yi − g(xi ,
βs))
weighted by matrix As = [

i ∂gi /∂β×∂gi /∂β
|
βs
]−1
.
The iterative method (10.13) equals the method of scoring variant of the Newton–
Raphson algorithm for NLS estimation since, from Section 5.8, the second sum on the
right-hand side is the gradient vector and the first sum is minus the expected value
of the Hessian (see also Section 10.3.9). The Gauss–Newton algorithm is therefore a
special case of the Newton–Raphson, and NR is emphasized more here as it can be
applied to a much wider range of problems than can GN.
10.3.7. Expectation Maximization
There are a number of data and model formulations considered in this book that can be
thought of as involving incomplete or missing data. For example, outcome variables of
interest (e.g., expenditure or the length of a spell in some state) may be right-censored.
That is, for some cases we may observe the actual expenditure or spell length, whereas
345

in other cases we may only know that the outcome exceeded some specific value, say
c∗
. A second example involves a multiple regression in which the data matrix looks as
follows:

y1 X1
? X2

,
where ? stands for missing data. Here we envisage a situation in which we wish to
estimate a linear regression model y = Xβ + u, where y
=

y1 ?

, X
=

X1 X2

,
but a subset of variables y is missing. A third example involves estimating the parame-
ters (θ1, θ2, . . . , θC , π1, . . . , πC ) of a C-component mixture distribution, also called a
latent class model, h (y|X) =
C
j=1 πj f j

yj |Xj , θj

, where f j

yj |Xj , θj

are well-
defined pdfs. Here πj ( j = 1, . . . , C) are unknown sampling fractions corresponding
to the C latent densities from which the observations are sampled. It is convenient to
think of this problem also as a missing data problem in the sense that if the sampling
fractions were known constants then estimation would be simpler.
The expectation maximization (EM) framework provides a unifying framework
for developing algorithms for problems that can be interpreted as involving miss-
ing data. Although particular solutions to this type of estimation problem have long
been found in the literature, Dempster, Laird, and Rubin (1977) provided a definitive
treatment.
Let y denote the vector dependent variable of interest, determined by the under-
lying latent variable vector y∗
. Let f ∗
(y∗
|X, θ) denote the joint density of the latent
variables, conditional on regressors X, and let f (y|X, θ) denote the joint density of
the observed variables. Let there be a many-to-one mapping from the sample space
of y to that of y∗
; that is, the value of the latent variable y∗
uniquely determines
y, but the value of y does not uniquely determine y∗
. It follows that f (y|X, θ) =
f ∗
(y∗
|X, θ)/f (y∗
|y, X, θ), since from Bayes rule the conditional density f (y∗
|y) =
f (y, y∗
)/ f (y) = f ∗
(y∗
)/ f (y), where the final equality uses f (y∗
, y) = f ∗
(y∗
) as y∗
uniquely determines y. Rearranging gives f (y) = f ∗
(y∗
)/f (y∗
|y).
The MLE maximizes
QN (θ) =
1
N
LN (θ) =
1
N
ln f ∗
(y∗
|X, θ) −
1
N
ln f (y∗
|y, X, θ). (10.14)
Because y∗
is unobserved the first term in the log-likelihood is ignored. The second
term is replaced by its expected value, which will not involve y∗
, where at the sth
round this expectation is evaluated at θ =
θs.
The expectation (E) part of the EM algorithm calculates
QN (θ|
θs) = −E

1
N
ln f (y∗
|y, X, θ)|y, X,
θs

, (10.15)
where expectation is with respect to the density f (y∗
|y, X,
θs). The maximization (M)
part of the EM algorithm maximizes QN (θ|
θs) to obtain
θs+1.
The full EM algorithm is iterative. The likelihood is maximized, given the expected
value of the latent variable; the expected value is evaluated afresh given the current
value of θ. The iterative process continues until convergence is achieved. The EM
algorithm has the advantage of always leading to an increase or constancy in QN (θ);
346

see Amemiya (1985, p. 376). The EM algorithm is applied to a latent class model in
Section 18.5.3 and to missing data in Section 27.5.
There is a very extensive literature on situations where the EM algorithm can be
usefully applied, even though it can be applied to only a subset of optimization prob-
lems. The EM algorithm is easy to program in many cases and its use was further en-
couraged by considerations of limited computing power and storage that are no longer
paramount. Despite these attractions, for censored data models and latent class models
direct estimation using Newton–Raphson type iterative procedures is often found to be
faster and more efficient computationally.
10.3.8. Simulated Annealing
Simulated annealing (SA) is a nongradient iterative method reviewed by Goffe,
Ferrier, and Rogers (1994). It differs from gradient methods in permitting movements
that decrease rather than increase the objective function to be maximized, so that one
is not locked in to moving steadily toward one particular local maximum.
Given a value
θs at the sth round we perturb the jth component of
θs to obtain a
new trial value of
θ∗
s =
θs +

0 · · · 0 (λjrj ) 0 · · · 0

, (10.16)
where λj is a prespecified step length and rj is a draw from a uniform distribution on
(−1, 1). The new trial value is used, that is, the method sets
θs+1 = θ∗
s , if it increases
the objective function, or if it does not increase the value of the objective function but
does pass the Metropolis criterion that
exp

(QN (θ∗
s ) − QN (
θs))/Ts

u, (10.17)
where u is a drawing from a uniform (0, 1) distribution and Ts is a scaling parameter
called the temperature. Thus not only uphill moves are accepted, but downhill moves
are also accepted with a probability that decreases with the difference between QN (θ∗
s )
and QN (
θs) and that increases with the temperature. The terms simulated annealing
and temperature come from analogy with minimizing thermal energy by slowly cool-
ing (annealing) a molten metal.
The user needs to set the step-size parameter λj . Goffe et al. (1994) suggest period-
ically adjusting λj so that 50% of all moves over a number of iterations are accepted.
The temperature also needs to be chosen and reduced during the course of iterations.
Then the algorithm initially is searching over a wide range of parameter values before
steadily locking in on a particular region.
Fast simulated annealing (FSA), proposed by Szu and Hartley (1987), is a faster
method. It replaces the uniform (−1, 1) random number rj by a Cauchy random vari-
able rj scaled by the temperature and permits a fixed step length vj . The method also
uses a simpler adjustment of the temperature over iterations with Ts equal to the ini-
tial temperature divided by the number of FSA iterations, where one iteration is a full
cycle over the q components of θ.
Cameron and Johansson (1997) discuss and use simulated annealing, following the
methods of Horowitz (1992). This begins with FSA but on grounds of computational
347

savings switches to gradient methods (BFGS) when relatively little change in QN (·)
occurs over a number of iterations or after many (250) FSA iterations. In a simulation
they find that NR with a number of different starting values offers a considerable im-
provement over NR with just one set of starting values, but even better is FSA with a
number of different starting values.
10.3.9. Example: Exponential Regression
Consider the nonlinear regression model with exponential conditional mean
E[yi |xi ] = exp(x
i β), (10.18)
where xi and β are K × 1 vectors. The NLS estimator
β minimizes
QN (β) =

i
(yi − exp(x
i β))2
, (10.19)
where for notational simplicity scaling by 2/N is ignored. The first-order conditions
are nonlinear in β and there is no explicit solution for β. Instead, gradient methods
need to be used.
For this example the gradient and Hessian are, respectively,
g = −2

i
(yi − ex
i β
)ex
i β
xi (10.20)
and
H = 2

i
#
ex
i β
ex
i β
xi x
i − 2(yi − ex
i β
)ex
i β
xi x
i
$
. (10.21)
The NR iterative method (10.5) uses gs and Hs equal to (10.20) and (10.21) evaluated
at
βs.
A simpler method of scoring variation of NR notes that (10.18) implies
E[H] = 2

i
ex
i β
ex
i β
xi x
i . (10.22)
Using E[Hs] in place of Hs yields

βs+1 −
βs =

i
ex
i

βs
ex
i

βs
xi x
i
'−1

i
ex
i

βs
xi (yi − ex
i

βs
).
It follows that
βs+1 −
βs can be computed from OLS regression of (yi − ex
i

βs ) on
ex
i

βs xi . This is also the Gauss–Newton regression (10.11), since ∂g(xi , β)/∂β =
exp(x
i

βs)xi for the exponential conditional mean (10.18). Specialization to
exp(x
i β) = exp(β) gives the iterative procedure presented in Section 10.2.4.
Some practical issues have already been presented in Section 10.2, notably conver-
gence criteria, modifications such as step-size adjustment, and the use of numerical
rather than analytical derivatives. In this section a brief overview of statistical packages
348

is given, followed by a discussion of common pitfalls that can arise in computation of
a nonlinear estimator.
10.4.1. Statistical Packages
All standard microeconometric packages such as Limdep, Stata, PCTSP, and SAS have
built-in procedures to estimate basic nonlinear models such as logit and probit. These
packages are simple to use, requiring no knowledge of iterative methods or even of the
model being used. For example, the command for logit regression might be “logit y
x” rather than the command “ols y x” for OLS. Nonlinear least squares requires some
code to convey to the package the particular functional form for g(x, β) one wishes
to specify. Estimation should be quick and accurate as the program should exploit the
structure of the particular model. For example, if the objective function is globally
concave then the method of scoring might be used.
If a statistical package does not contain a particular model then one needs to write
one’s own code. This situation can arise with even minor variation of standard mod-
els, such as imposing restrictions on parameters or using parameterizations that are
not of single-index form. The code may be written using one’s own favorite statistical
package or using other more specialized programming languages. Possibilities include
(1) built-in optimization procedures within the statistical package that require spec-
ification of the objective function and possibly its derivatives; (2) matrix commands
within the statistical package to compute As and gs and iterate; (3) a matrix program-
ming language such as Gauss, Matlab, OX, SAS/IML, or S-Plus, and possibly add-on
optimization routines; (4) a programming language such as Fortran or C++; and (5) an
optimization package such as those in GAMS, GQOPT, or NAGLIB.
The first and second methods are attractive because they do not force the user to
learn a new program. The first method is particularly simple for m-estimation as it can
require merely specification of the subfunction qi (θ) for the ith observation rather than
specification of QN (θ). In practice, however, the optimization procedures for user-
defined functions in the standard packages are more likely to encounter numerical
problems than if more specialized programs are used. Moreover, for some packages
the second method can require learning arcane forms of matrix commands.
For nonlinear problems, the third method is the best, although this might require the
user to learn a matrix programming language from scratch. One then is set up to han-
dle virtually any econometric problem encountered, and the optimization routines that
come with matrix programming languages are usually adequate. Also, many authors
make available the code used in specific papers.
The fourth and fifth methods generally require a higher level of programming so-
phistication than the third method. The fourth method can lead to much faster compu-
tation and the fifth method can solve the most numerically challenging optimization
problems.
Other practical issues include cost of software; the software used by colleagues; and
whether the software has clear error messages and useful debugging features, such as a
trace program that tracks line-by-line program execution. The value of using software
similar to that used by other colleagues cannot be underestimated.
349

Table 10.2. Computational Difficulties: A Partial Checklist
Problem Check
Data read incorrectly Print full descriptive statistics.
Imprecise calculation Use analytical derivatives or numerical with different
step size h.
Multicollinearity Check condition number of X
X. Try subset of regressors.
Singular matrix in iterations Try method not requiring matrix inversion such as DFP.
Poor starting values Try a range of different starting values.
Model not identified Difficult to check. Obvious are dummy variable traps.
Strange parameter values Constant included/excluded? Iterations actually
converged?
Different standard errors Which method was used to calculate variance matrix?
10.4.2. Computational Difficulties
Computational difficulties are, in practice, situations where it is not possible to obtain
an estimate of the parameters. For example, an error message may indicate that the
estimator cannot be calculated because the Hessian is singular. There are many possi-
ble reasons for this, as detailed in the following and summarized in Table 10.2. These
reasons may also provide explanation for another common situation of parameter esti-
mates that are obtained but are seemingly in error.
First, the data may not have been read in correctly. This is a remarkably common
oversight. With large data sets it is not practical to print out all the data. However, at a
minimum one should always obtain descriptive statistics and check for anomilies such
as incorrect range for a variable, unusually large or small sample mean, and unusu-
ally large or small standard deviation (including a value of zero, which indicates no
variation). See Section 3.5.4 for further details.
Second, there may be calculation errors. To minimize these all calculations should
be done in double precision or even quadruple precision rather then single precision.
It is helpful to rescale the data so that the regressors have similar means and variances.
For example, it may be better to use annual income in thousands of dollars rather than
in dollars. If numerical derivatives are used it may be necessary to alter the change
value h in (10.4). Care needs to be paid to how functions are evaluated. For example,
the function ln Γ(y), where Γ(·) is the gamma function, is best evaluated using the
log-gamma function. If instead one evaluates the gamma function followed by the log
function considerable numerical error arises even for moderate sized y.
Third, multicollinearity may be a problem. In single-index models (see Sec-
tion 5.2.4) the usual checks for multicollinearity will carry over. The correlation matrix
for the regressors can be printed, though this only considers pairwise correlation. Bet-
ter is to use the condition number of X
X, that is, the square root of the ratio of the
largest to smallest eigenvalue of X
X. If this exceeds 100 then problems may arise. For
more highly nonlinear models than single-index ones it is possible to have problems
even if the condition number is not large. If one suspects multicollinearity is causing
350

numerical problems then see whether it is possible to estimate the model with a subset
of the variables that are less likely to be collinear.
Fourth, a noninvertible Hessian during iterations does not necessarily imply singu-
larity at the true maximum. It is worthwhile trying a range of iterative methods such
as steepest ascent with line search and DFP, not just Newton–Raphson. This problem
may also result from multicollinearity.
Fifth, try different starting values. The iterative gradient methods are designed to
obtain a local maximum rather than the global maximum. One way to guard against
this is to begin iterations at a wide range of starting values. A second way is to per-
form a grid search. Both of these approaches theoretically require evaluations at many
different points if the dimension of θ is large, but it may be sufficient to do a detailed
analysis for a stripped-down version of the model that includes just the few regressors
thought to be most statistically significant.
Lastly, the model may not be identified. Indeed a standard necessary condition for
model identification is that the Hessian be invertible. As with linear models, sim-
ple checks include avoiding dummy variable traps and, if a subset of data is being
used in initial analysis, determining that all variables in the subset of the data have
some variation. For example, if data are ordered by gender or by age or by region
then problems can arise if these appear as indicator variables and the chosen subset
is of individuals of a particular gender, age, or region. For nonlinear models it can
be difficult to theoretically determine that the model is not identified. Often one first
eliminates all other potential causes before returning to a careful analysis of model
identification.
Even after parameter estimates are successfully obtained computational problems
can still arise, as it may not be possible to obtain estimates of the variance matrix
A−1
BA−1
. This situation can arise when the iterative method used, such as DFP, does
not use the Hessian matrix A−1
as the weighting matrix in the iterations. First check
that the iterative method has indeed converged rather than, for example, stopping at
a default maximum number of iterations. If convergence has occurred, try alternative
estimates of A, using the expected Hessian or using more accurate numerical com-
putations by, for example, using analytical rather than numerical derivatives. If such
solutions still fail it is possible that the model is not identified, with this nonidentifica-
tion being finessed at the parameter estimation stage by using an iterative method that
did not compute the Hessian.
Other perceived computational problems are parameter and variance estimates that
do not accord with prior beliefs. For parameter estimates obvious checks include en-
suring correct treatment of an intercept term (inclusion or exclusion, depending on the
context), that convergence has been achieved, and that a global maximum is obtained
(by trying a range of starting values). If standard errors of parameter estimates dif-
fer across statistical packages that give the same parameter estimates, the most likely
cause is that a different method has been used to construct the variance matrix estimate
A good computational strategy is to start with a small subset of the data and regres-
sors, say one regressor and 100 observations. This simplifies detailed tracing of the
program either manually, such as by printing out key output along the way, or using
351

a built-in trace facility if the program has one. If the program passes this test then
computational problems with the full model and data are less likely to be due to in-
correct data input or coding errors and are more likely due to genuine computational
difficulties such as multicollinearity or poor starting values.
A good way to test program validity is to construct a simulated data set where the
true parameters are known. For a large sample size, say N = 10,000, the estimated
parameter values should be close to the true values.
Finally, note that obtaining reasonable computational results from estimation of a
nonlinear model does not guarantee correct results. For example, many early pub-
lished applications of multinomial probit models reported apparently sensible results,
yet the models estimated have subsequently been determined to be not identified (see
Section 15.8.1).
Numerical problems can arise even in linear models, and it is instructive to read Davidson and
MacKinnon (1993, Section 1.5) and Greene (2003, appendix E). Standard references for statis-
tical computation are Kennedy and Gentle (1980) and especially Press et al. (1993) and related
co-authored books by Press. For evaluation of functions the standard reference is Abramowitz
and Stegun (1971). Quandt (1983) presents many computational issues, including optimization.
5.3 Summaries of iterative methods are given in Amemiya (1985, Section 4.4), Davidson and
MacKinnon (1993, Section 6.7), Maddala (1977, Section 9.8), and especially Greene (2003,
appendix E.6). Harvey (1990) gives many applications of the GN algorithm, which, owing
to its simplicity, is the usual iterative method for NLS estimation. For the EM algorithm see
especially Amemiya (1985, pp. 375–378). For SA see Goffe et al. (1994).
Exercises
10–1 Consider calculation of the MLE in the logit regression model when the only re-
gressor is the intercept. Then E[y ] = 1/(1 + e−β
) and the gradient of the scaled
log-likelihood function g(β) = (y − 1/(1 + e−β
)). Suppose a sample yields ȳ =
0.8 and the starting value is β = 0.0.
(a) Calculate β for the first six iterations of the Newton–Raphson algorithm.
(b) Calculate the first six iterations of a gradient algorithm that sets As = 1 in
(10.1), so
βs+1 =
βs + gs.
(c) Compare the performance of the methods in parts (a) and (b).
10–2 Consider the nonlinear regression model y = αx1 + γ/(x2 − δ) + u, where x1
and x2 are exogenous regressors independent of the iid error u ∼ N[0, σ2
].
(a) Derive the equation for the Gauss–Newton algorithm for estimating (α, γ, δ).
(b) Derive the equation for the Newton–Raphson algorithm for estimating
(α, γ, δ).
(c) Explain the importance of not arbitrarily choosing the starting values of the
algorithm.
10–3 Suppose that the pdf of y has a C-component mixture form, f (y|π) =
C
j =1 π j f j (y), where π = (π1, . . . , πC), πj 0,
C
j =1 π j = 1. The π j are
352

unknown mixing proportions whereas the parameters of the densities f j (y) are
presumed known.
(a) Given a random sample on yi , i = 1, . . . , N, write the general log-likelihood
function and obtain the ﬁrst-order conditions for
πML. Verify that there is no
explicit solution for
πML.
(b) Let zi be a C × 1 vector of latent categorical variables, i = 1, . . . , N, such
that zji = 1 if y comes from the j th component of the mixture and zji = 0
otherwise. Write down the likelihood function in terms of the observed and
latent variables as if the latent variable were observed.
(c) Devise an EM algorithm for estimating π. [Hint: If zji were observable the
MLE of
π j = N−1

i zji . The E step requires calculation of E[zji |yi ]; the M
step requires replacing zji by E[zji |yi ] and then solving for π.]
10–4 Let (y1i , y2i ), i = 1, . . . , N, have a bivariate normal distribution with mean
(µ1, µ2) and covariance parameters (σ11, σ12, σ22) and correlation coefﬁcient
ρ. Suppose that all N observations on y1 are available but there are m N
missing observations on y2. Using the fact that the marginal distribution of yj
is N[µj , σj j ], and that conditionally y2|y1 ∼ N[µ2.1, σ22.1], where µ2.1 = µ2 +
σ12/σ22(y1 − µ1), σ22.1 = (1 − ρ2
)σ22, devise an EM algorithm for imputing the
missing observations on y1.
353

PART THREE
Simulation-Based
Methods
Part 1 emphasized that microeconometric models are frequently nonlinear models es-
timated using large and heterogeneous data sets drawn from surveys that are complex
and subject to a variety of sampling biases. A realistic depiction of the economic phe-
nomena in such settings often requires the use of models for which estimation and
subsequent statistical inference are difficult. Advances in computing hardware and
software now make it feasible to tackle such tasks. Part 3 presents modern, computer-
intensive, simulation-based methods of estimation and inference that mitigate some of
these difficulties. The background required to cover this material varies somewhat with
the chapter, but the essential base is least squares and maximum likelihood estimation.
Chapter 11 presents bootstrap methods for statistical inference. These methods have
the attraction of providing a simple way to obtain standard errors when the formulae
from asymptotic theory are complex, as is the case, for example, for some two-step
estimators. Furthermore, if implemented appropriately, a bootstrap can lead to a more
refined asymptotic theory that may then lead to better statistical inference in small
samples.
Chapter 12 presents simulation-based estimation methods. These methods permit
estimation in situations where standard computational methods may not permit calcu-
lation of an estimator, because of the presence of an integral over a probability distri-
bution that leads to no closed-form solution.
Chapter 13 surveys Bayesian methods that provide an approach to estimation and
inference that is quite different from the classical approach used in other chapters
of this book. Despite this different approach, in practice in large sample settings the
Bayesian approach produces similar results to those from classical methods. Further,
they often do so in a computationally more efficient manner.
355

C H A P T E R 11
Bootstrap Methods
11.1. Introduction
Exact finite-sample results are unavailable for most microeconometrics estimators
and related test statistics. The statistical inference methods presented in preceding
chapters rely on asymptotic theory that usually leads to limit normal and chi-square
distributions.
An alternative approximation is provided by the bootstrap, due to Efron (1979,
1982). This approximates the distribution of a statistic by a Monte Carlo simulation,
with sampling done from the empirical distribution or the fitted distribution of the ob-
served data. The additional computation required is usually feasible given advances
in computing power. Like conventional methods, however, bootstrap methods rely on
asymptotic theory and are only exact in infinitely large samples.
The wide range of bootstrap methods can be classified into two broad approaches.
First, the simplest bootstrap methods can permit statistical inference when conven-
tional methods such as standard error computation are difficult to implement. Second,
more complicated bootstraps can have the additional advantage of providing asymp-
totic refinements that can lead to a better approximation in-finite samples. Applied
researchers are most often interested in the first aspect of the bootstrap. Theoreticians
emphasize the second, especially in settings where the usual asymptotic methods work
poorly in finite samples.
The econometrics literature focuses on use of the bootstrap in hypothesis test-
ing, which relies on approximation of probabilities in the tails of the distributions
of statistics. Other applications are to confidence intervals, estimation of standard er-
rors, and bias reduction. The bootstrap is straightforward to implement for smooth
√
N-consistent estimators based on iid samples, though bootstraps with asymptotic re-
finements are underutilized. Caution is needed in other settings, including nonsmooth
estimators such as the median, nonparametric estimators, and inference for data that
are not iid.
A reasonably self-contained summary of the bootstrap is provided in Section 11.2,
an example is given in Section 11.3, and some theory is provided in Section 11.4.
357

BOOTSTRAP METHODS
Further variations of the bootstrap are presented in Section 11.5. Section 11.6 presents
use of the bootstrap for specific types of data and specific methods used often in
microeconometrics.
11.2. Bootstrap Summary
We summarize key bootstrap methods for estimator
θ and associated statistics based
on an iid sample {w1, . . . , wN }, where usually wi = (yi , xi ) and
θ is a smooth esti-
mator that is
√
N consistent and asymptotically normally distributed. For notational
simplicity we generally present results for scalar θ. For vector θ in most instances
replace θ by θj , the jth component of θ.
Statistics of interest include the usual regression output: the estimate
θ; standard er-
rors s
θ ; t-statistic t = (
θ − θ0)/s
θ , where θ0 is the null hypothesis value; the associated
critical value or p-value for this statistic; and a confidence interval.
This section presents bootstraps for each of these statistics. Some motivation is also
provided, with the underlying theory sketched in Section 11.4.
11.2.1. Bootstrap without Refinement
Consider estimation of the variance of the sample mean
µ = ȳ = N−1
N
i=1 yi , where
the scalar random variable yi is iid [µ, σ2
], when it is not known that V[
µ] = σ2
/N.
The variance of
µ could be obtained by obtaining S such samples of size N from the
population, leading to S sample means and hence S estimates
µs = ȳs, s = 1, . . . , S.
Then we could estimate V[
µ] by (S − 1)−1
S
s=1(
µs −
µ)2
, where
µ = S−1
S
s=1
µs.
Of course this approach is not possible, as we only have one sample. A bootstrap
can implement this approach by viewing the sample as the population. Then the finite
population is now the actual data y1, . . . , yN . The distribution of
µ can be obtained
by drawing B bootstrap samples from this population of size N, where each bootstrap
sample of size N is obtained by sampling from y1, . . . , yN with replacement. This
leads to B sample means and hence B estimates
µb = ȳb, b = 1, . . . , B. Then esti-
mate V[
µ] by (B − 1)−1
B
b=1(
µb −
µ)2
, where
µ = B−1
B
b=1
µb. Sampling with
replacement may seem to be a departure from usual sampling methods, but in fact
standard sampling theory assumes sampling with replacement rather than without re-
placement (see Section 24.2.2).
With additional information other ways to obtain bootstrap samples may be possi-
ble. For example, if it is known that yi ∼ N[µ, σ2
] then we could obtain B bootstrap
samples of size N by drawing from the N[
µ, s2
] distribution. This bootstrap is an
example of a parametric bootstrap, whereas the preceding bootstrap was from the em-
pirical distribution.
More generally, for estimator
θ similar bootstraps can be used to, for example,
estimate V[
θ] and hence standard errors when analytical formulas for V[
θ] are com-
plex. Such bootstraps are usually valid for observations wi that are iid over i, and they
have similar properties to estimates obtained using the usual asymptotic theory.
358

11.2. BOOTSTRAP SUMMARY
11.2.2. Asymptotic Refinements
In some settings it is possible to improve on the preceding bootstrap and obtain es-
timates that are equivalent to those obtained using a more refined asymptotic theory
that may better approximate the finite-sample distribution of
θ. Much of this chapter
is directed to such asymptotic refinements.
Usual asymptotic theory uses the result that
√
N(
θ − θ0)
d
→ N[0, σ2
]. Thus
Pr[
√
N(
θ − θ0)/σ ≤ z] = Φ(z) + R1, (11.1)
where Φ(·) is the standard normal cdf and R1 is a remainder term that disappears as
N → ∞.
This result is based on asymptotic theory detailed in Section 5.3 that includes ap-
plication of a central limit theorem. The CLT is based on a truncated power-series
expansion. The Edgeworth expansion, detailed in Section 11.4.3, includes additional
terms in the expansion. With one extra term this yields
Pr[
√
N(
θ − θ0)/σ ≤ z] = Φ(z) +
g1(z)φ(z)
√
N
+ R2, (11.2)
where φ(·) is the standard normal density, g1(·) is a bounded function given after
(11.13) in Section 11.4.3 and R2 is a remainder term that disappears as N → ∞.
The Edgeworth expansion is difficult to implement theoretically as the function
g1(·) is data dependent in a complicated way. A bootstrap with asymptotic refinement
provides a simple computational method to implement the Edgeworth expansion. The
theory is given in Section 11.4.4.
Since R1 = O(N−1/2
) and R2 = O(N−1
), asymptotically R2 R1, leading to a
better approximation as N → ∞. However, in finite samples it is possible that R2
R1. A bootstrap with asymptotic refinement provides a better approximation asymptot-
ically that hopefully leads to a better approximation in samples of the finite sizes typ-
ically used. Nevertheless, there is no guarantee and simulation studies are frequently
used to verify that finite-sample gains do indeed occur.
11.2.3. Asymptotically Pivotal Statistic
For asymptotic refinement to occur, the statistic being bootstrapped must be an asymp-
totically pivotal statistic, meaning a statistic whose limit distribution does not depend
on unknown parameters. This result is explained in Section 11.4.4.
As an example, consider sampling from yi ∼ [µ, σ2
]. Then the estimate
µ = ȳ
a
∼
N[µ, σ2
/N] is not asymptotically pivotal even given a null hypothesis value µ = µ0
since its distribution depends on the unknown parameter σ2
. However, the studentized
statistic t = (
µ − µ0)/s
µ
a
∼ N[0, 1] is asymptotically pivotal.
Estimators are usually not asymptotically pivotal. However, conventional asymp-
totically standard normal or chi-squared distributed test statistics, including Wald,
Lagrange multiplier, and likelihood ratio tests, and related confidence intervals, are
asymptotically pivotal.
359

BOOTSTRAP METHODS
11.2.4. The Bootstrap
In this section we provide a broad description of the bootstrap, with further details
given in subsequent sections.
Bootstrap Algorithm
A general bootstrap algorithm is as follows:
1. Given data w1, . . . , wN , draw a bootstrap sample of size N using a method given in the
following and denote this new sample w∗
1, . . . , w∗
N .
2. Calculate an appropriate statistic using the bootstrap sample. Examples include (a) the
estimate
θ
∗
of θ, (b) the standard error s
θ
∗ of the estimate
θ
∗
, and (c) a t-statistic
t∗
= (
θ
∗
−
θ)/s
θ
∗ centered at the original estimate
θ. Here
θ
∗
and s
θ
∗ are calculated in
the usual way but using the new bootstrap sample rather than the original sample.
3. Repeat steps 1 and 2 B independent times, where B is a large number, obtaining B
bootstrap replications of the statistic of interest, such as
θ
∗
1 , . . . ,
θ
∗
B or t∗
1 , . . . , t∗
B.
4. Use these B bootstrap replications to obtain a bootstrapped version of the statistic, as
detailed in the following subsections.
Implementation can vary according to how bootstrap samples are obtained, how
many bootstraps are performed, what statistic is being bootstrapped, and whether or
not that statistic is asymptotically pivotal.
Bootstrap Sampling Methods
The bootstrap dgp in step 1 is used to approximate the true unknown dgp.
The simplest bootstrapping method is to use the empirical distribution of the data,
which treats the sample as being the population. Then w∗
1, . . . , w∗
N are obtained by
sampling with replacement from w1, . . . , wN . In each bootstrap sample so obtained,
some of the original data points will appear multiple times whereas others will not
appear at all. This method is an empirical distribution function (EDF) bootstrap
or nonparametric bootstrap. It is also called a paired bootstrap since in single-
equation regression models wi = (yi , xi ), so here both yi and xi are resampled.
Suppose the conditional distribution of the data is specified, say y|x ∼ F(x, θ0), and
an estimate
θ
p
→ θ0 is available. Then in step 1 we can instead form a bootstrap sample
by using the original xi while generating yi by random draws from F(xi ,
θ). This
corresponds to regressors fixed in repeated samples (see Section 4.4.5). Alternatively,
we may first resample x∗
i from x1, . . . , xN and then generate yi from F(x∗
i ,
θ), i =
1, . . . , N. Both are examples of a parametric bootstrap that can be applied in fully
parametric models.
For regression model with additive iid error, say yi = g(xi , β) + ui , we can form
fitted residuals
u1, . . . ,
uN , where
ui = yi − g(xi ,
β). Then in step 1 bootstrap from
these residuals to get a new draw of residuals, say (
u∗
1 , . . . ,
u∗
N ), leading to a bootstrap
sample (y∗
1 , x1), . . . , (y∗
N , xN ), where y∗
i = g(xi ,
β) + u∗
i . This bootstrap is called a
360

residual bootstrap. It uses information intermediate between the nonparametric and
parametric bootstrap. It can be applied if the error term has distribution that does not
depend on unknown parameters.
We emphasize the paired bootstrap on grounds of its simplicity, applicability to
a wide range of nonlinear models, and reliance on weak distributional assumptions.
However, the other bootstraps generally provide a better approximation (see Horowitz,
2001, p. 3185) and should be used if the stronger model assumptions they entail are
warranted.
The Number of Bootstraps
The bootstrap asymptotics rely on N → ∞ and so the bootstrap can be asymptotically
valid even for low B. However, clearly the bootstrap is more accurate as B → ∞. A
sufficiently large value of B varies with one’s tolerance for bootstrap-induced simula-
tion error and with the purpose of the bootstrap.
Andrews and Buchinsky (2000) present an application-specific numerical method
to determine the number of replications B needed to ensure a given level of accuracy
or, equivalently, the level of accuracy obtained for a given value of B. Let λ denote
the quantity of interest, such as a standard error or a critical value,
λ∞ denote the ideal
bootstrap estimate with B = ∞, and
λB denote the estimate with B bootstraps. Then
Andrews and Buchinsky (2000) show that
√
B(
λB −
λ∞)/
λ∞
d
→ N[0, ω],
where ω varies with the application and is defined in Table III of Andrews and Buchin-
sky (2000). It follows that Pr[δ ≤ zτ/2
√
ω/B] = 1 − τ, where δ = |
λB −
λ∞|/
λ∞
denotes the relative discrepancy caused by only B replications. Thus B ≥ ωz2
τ/2/δ2
ensures the relative discrepancy is less than δ with probability at least 1 − τ. Alterna-
tively, given B replications the relative discrepancy is less than δ = zτ/2
√
ω/B.
To provide concrete guidelines we propose the rule of thumb that
B = 384ω.
This ensures that the relative discrepancy is less than 10% with probability at least
0.95, since z2
.025/0.12
= 384. The only difficult part in implementation is estimation of
ω, which varies with the application.
For standard error estimation ω = (2 + γ4)/4, where γ4 is the coefficient of excess
kurtosis for the bootstrap estimator
θ
∗
. Intuitively, fatter tails in the distribution of the
estimator mean outliers are more likely, contaminating standard error estimation. It
follows that B = 384 × (1/2) = 192 is enough if γ4 = 0 whereas B = 960 is needed
if γ4 = 8. These values are higher than those proposed by Efron and Tibsharani (1993,
p. 52), who state that B = 200 is almost always enough.
For a symmetric two-sided test or confidence interval at level α, ω = α(1 −
α)/[2zα/2φ(zα/2)]2
. This leads to B = 348 for α = 0.05 and B = 685 for α = 0.01.
As expected more bootstraps are needed the further one goes into the tails of the
distribution.
361

BOOTSTRAP METHODS
For a one-sided test or nonsymmetric two-sided test or confidence interval at level
α, ω = α(1 − α)/[zαφ(zα)]2
. This leads to B = 634 for α = 0.05 and B = 989 for
α = 0.01. More bootstraps are needed when testing in one tail. For chi-squared tests
with h degrees of freedom ω = α(1 − α)/[χ2
α(h) f (χ2
α(h))]2
, where f (·) is the χ2
(h)
density.
For test p-values ω = (1 − p)/p. For example, if p = 0.05 then ω = 19 and B =
7,296. Many more bootstraps are needed for precise calculation of the test p-value
compared to hypothesis rejection if a critical value is exceeded.
For bias-corrected estimation of θ a simple rule uses
ω =
σ2
/
θ
2
, where the esti-
mator
θ has standard error
σ. For example, if the usual t-statistic t =
θ/
σ = 2 then

ω = 1/4 and B = 96. Andrews and Buchinsky (2000) provide many more details and
refinements of these results.
For hypothesis testing, Davidson and MacKinnon (2000) provide an alternative
approach. They focus on the loss of power caused by bootstrapping with finite B.
(Note that there is no power loss if B = ∞.) On the basis of simulations they recom-
mend at least B = 399 for tests at level 0.05, and at least B = 1,499 for tests at level
0.01. They argue that for testing their approach is superior to that of Andrews and
Buchinsky.
Several other papers by Davidson and MacKinnon, summarized in MacKinnon
(2002), emphasize practical considerations in bootstrap inference. For hypothesis test-
ing at level α choose B so that α(B + 1) is an integer. For example, at α = 0.05 let
B = 399 rather than 400. If instead B = 400 it is unclear on an upper one-sided al-
ternative test whether the 20th or 21st largest bootstrap t-statistic is the critical value.
For nonlinear models computation can be reduced by performing only a few Newton–
Raphson iterations in each bootstrap sample from starting values equal to the initial
parameter estimates.
11.2.5. Standard Error Estimation
The bootstrap estimate of variance of an estimator is the usual formula for estimating
a variance, applied to the B bootstrap replications
θ
∗
1 , . . . ,
θ
∗
B:
s2

θ,Boot
=
1
B − 1
B

b=1
(
θ
∗
b −
θ
∗
)2
, (11.3)
where

θ
∗
= B−1
B

b=1

θ
∗
b . (11.4)
Taking the square root yields s
θ,Boot, the bootstrap estimate of the standard error.
This bootstrap provides no asymptotic refinement. Nonetheless, it can be ex-
traordinarily useful when it is difficult to obtain standard errors using conventional
methods. There are many examples. The estimate
θ may be a sequential two-step
m-estimator whose standard error is difficult to compute using the results given in
Secttion 6.8. The estimate
θ may be a 2SLS estimator estimated using a package that
362

only reports standard errors assuming homoskedastic errors but the errors are actu-
ally heteroskedastic. The estimate
θ may be a function of other parameters that are
actually estimated, for example,
θ =
α/
β, and the bootstrap can be used instead of
the delta method. For clustered data with many small clusters, such as short panels,
cluster-robust standard errors can be obtained by resampling the clusters.
Since the bootstrap estimate s
θ,Boot is consistent, it can be used in place of s
θ in
the usual asymptotic formula to form confidence intervals and hypothesis tests that
are asymptotically valid. Thus asymptotic statistical inference is possible in settings
where it is difficult to obtain standard errors by other methods. However, there will be
no improvement in finite-sample performance. To obtain an asymptotic refinement
the methods of the next section are needed.
11.2.6. Hypothesis Testing
Here we consider tests on an individual coefficient, denoted θ. The test may be either
an upper one-tailed alternative of H0 : θ ≤ θ0 against Ha : θ θ0 or a two-sided test
of H0 : θ = θ0 against Ha : θ = θ0. Other tests are deferred to Section 11.6.3.
Tests with Asymptotic Refinement
The usual test statistic TN = (
θ − θ0)/s
θ provides the potential for asymptotic refine-
ment, as it is asymptotically pivotal since its asymptotic standard normal distribution
does not depend on unknown parameters. We perform B bootstrap replications pro-
ducing B test statistics t∗
1 , . . . , t∗
B, where
t∗
b = (
θ
∗
b −
θ)/s
θ
∗
b
. (11.5)
The estimates t∗
b are centered around the original estimate
θ since resampling is
from a distribution centered around
θ. The empirical distribution of t∗
1 , . . . , t∗
B, or-
dered from smallest to largest, is then used to approximate the distribution of TN as
follows.
For an upper one-tailed alternative test the bootstrap critical value (at level α)
is the upper α quantile of the B ordered test statistics. For example, if B = 999 and
α = 0.05 then the critical value is the 950th highest value of t∗
, since then (B + 1)(1 −
α) = 950. For a similar lower tail one-sided test the critical value is the 50th smallest
value of t∗
.
One can also compute a bootstrap p-value in the obvious way. For example, if the
original statstistic t lies between the 914th and 915th largest values of 999 bootstrap
replicates then the p-value for a upper one-tailed alternative test is 1 − 914/(B + 1) =
0.086.
For a two-sided test a distinction needs to be made between symmetrical and
nonsymmetrical tests. For a nonsymmetrical test or equal-tailed test the bootstrap
critical values (at level α) are the lower α/2 and upper α/2 quantiles of the ordered
test statistics t∗
, and the null hypothesis is rejected at level α if the original t-statistic
lies outside this range. For a symmetrical test we instead order |t∗
| and the bootstrap
363

BOOTSTRAP METHODS
critical value (at level α) is the upper α quantile of the ordered |t∗
|. The null hypoth-
esis is rejected at level α if |t| exceeds this critical value.
These tests, using the percentile-t method, provide asymptotic refinements. For a
one-sided t-test and for a nonsymmetrical two-sided t-test the true size of the test is
α + O(N−1/2
) with standard asymptotic critical values and α + O(N−1
) with boot-
strap critical values. For a two-sided symmetrical t-test or for an asymptotic chi-
square test the asymptotic approximations work better, and the true size of the test
is α + O(N−1
) using standard asymptotic critical values and α + O(N−2
) using boot-
strap critical values.
Tests without Asymptotic Refinement
Alternative bootstrap methods can be used that although asymptotically valid do not
provide an asymptotic refinement.
One approach already mentioned at the end of Section 11.2.5 is to compute t =
(
θ − θ0)/s
θ,boot, where the bootstrap estimate s
θ,boot given in (11.3) replaces the usual
estimate s
θ , and compare this test statistic to critical values from the standard normal
distribution.
A second approach, exposited here for a two-sided test of H0 : θ = θ0 against
Ha : θ = θ0, finds the lower α/2 and upper α/2 quantiles of the bootstrap estimates

θ
∗
1 , . . . ,
θ
∗
B and rejects H0 if θ0 falls outside this region. This is called the percentile
method. Asymptotic refinement is obtained by using t∗
b in (11.5) that centers around

θ rather than θ0 and using a different standard error s ∗

θ
in each bootstrap.
These two bootstraps have the attraction of not requiring computation of s
θ , the
usual standard error estimate based on asymptotic theory.
Much of the statistics literature considers confidence interval estimation rather than its
flip side of hypothesis tests. Here instead we began with hypothesis tests, so only a
brief presentation of confidence intervals is necessary.
An asymptotic refinement is based on the t-statistic, which is asymptotically piv-
otal. Thus from steps 1–3 in Section 11.2.4 we obtain bootstrap replication t-statistics
t∗
1 , . . . , t∗
B. Then let t∗
[1−α/2] and t∗
[α/2] denote the lower and upper α/2 quantiles of these
t-statistics. The percentile-t method 100(1 − α) percent confidence interval is

θ − t∗
[1−α/2] × s
θ ,
θ + t∗
[α/2] × s
θ

, (11.6)
where
θ and s
θ are the estimate and standard error from the original sample.
An alternative is the bias-corrected and accelerated (BCa) method detailed in
Efron (1987). This offers an asymptotic refinement in a wider class of problems than
the percentile-t method.
Other methods provide an asymptotically valid confidence interval, but without
asymptotic refinement. First, one can use the bootstrap estimate of the standard
364

error in the usual confidence interval formula, leading to interval (
θ − z[1−α/2] ×
s
θ,boot,
θ + z[α/2] × s
θ,boot). Second, the percentile method confidence interval is the
distance between the lower α/2 and upper α/2 quantiles of the B bootstrap estimates

θ
∗
1 , . . . ,
θ
∗
B of θ.
11.2.8. Bias Reduction
Nonlinear estimators are usually biased in finite samples, though this bias goes to zero
asymptotically if the estimator is consistent. For example, if µ3
is estimated by
θ = ȳ3
,
where yi is iid [µ, σ2
], then E[
θ − µ3
] = 3µσ2
/N+E[(y − µ)3
]/N2
.
More generally, for a
√
N-consistent estimator
E[
θ − θ0] =
aN
N
+
bN
N2
+
cN
N3
+ . . . , (11.7)
where aN , bN , and cN are bounded constants that vary with the data and estimator (see
Hall, 1992, p. 53). An alternative estimator
θ provides an asymptotic refinement if
E[
θ − θ0] =
BN
N2
+
CN
N3
+ . . . , (11.8)
where BN and CN are bounded constants. For both estimators the bias disappears as
N → ∞. The latter estimator has the attraction that the bias goes to zero at a faster
rate, and hence it is an asymptotic refinement, though in finite samples it is possible
that (BN /N2
) (aN /N + bN /N2
).
We wish to estimate the bias E[
θ] − θ. This is the distance between the expected
value or population average value of the parameter and the parameter value generating
the data. The bootstrap replaces the population with the sample, so that the bootstrap
samples are generated by parameter
θ, which has average value
θ
∗
over the bootstraps.
The bootstrap estimate of the bias is then
Bias
θ = (
θ
∗
−
θ), (11.9)
where
θ
∗
is defined in (11.4).
Suppose, for example, that
θ = 4 and
θ
∗
= 5. Then the estimated bias is (5 − 4) =
1, an upward bias of 1. Since
θ overestimates by 1, bias correction requires subtracting
1 from
θ, giving a bias-corrected estimate of 3. More generally, the bootstrap bias-
corrected estimator of θ is

θBoot =
θ − (
θ
∗
−
θ) (11.10)
= 2
θ −
θ
∗
.
Note that
θ
∗
itself is not the bias-corrected estimate. For more details on the direction
of the correction, which may seem puzzling, see Efron and Tibsharani (1993, p. 138).
For typical
√
N-consistent estimators the asymptotic bias of
θ is O(N−1
) whereas the
asymptotic bias of
θBoot is instead O(N−2
).
In practice bias correction is seldom used for
√
N-consistent estimators, as the boot-
strap estimate can be more variable than the original estimate
θ and the bias is often
365

BOOTSTRAP METHODS
small relative to the standard error of the estimate. Bootstrap bias correction is used
for estimators that converge at rate less than
√
N, notably nonparametric regression
and density estimators.
11.3. Bootstrap Example
As a bootstrap example, consider the exponential regression model introduced in Sec-
tion 5.9. Here the data are generated from an exponential distribution with an expo-
nential mean with two regressors:
yi |xi ∼ exponential(λi ), i = 1, . . . , 50,
λi = exp(β1 + β2x2i + β3x3i ),
(x2i , x3i ) ∼ N[0.1, 0.1; 0.12
, 0.12
, 0.005],
(β1, β2, β3) = (−2, 2, 2).
Maximum likelihood estimation on a sample of 50 observations yields
β1 =
−2.192;
β2 = 0.267, s2 = 1.417, and t2 = 0.188; and
β3 = 4.664, s3 = 1.741, and
t3 = 2.679. For this ML example the standard errors were based on −
A−1
, minus the
inverse of the estimated Hessian matrix.
We concentrate on statistical inference for β3 and demonstrate the bootstrap for
standard error computation, test of statistical significance, confidence intervals, and
bias correction. The differences between bootstrap and usual asymptotic estimates are
relatively small in this example and can be much larger in other examples.
The results reported here are based on the paired bootstrap (see Section 11.2.4) with
(yi , x2i , x3i ) jointly resampled with replacement B = 999 times. From Table 11.1, the
999 bootstrap replication estimates
β
∗
3,b, b = 1, . . . , 999, had mean 4.716 and standard
deviation of 1.939. Table 11.1 also gives key percentiles for
β
∗
3 and t∗
3 (defined in the
following).
A parametric bootstrap could have been used instead. Then bootstrap samples
would be obtained by drawing yi from the exponential distribution with parameter
exp(
β1 +
β2x2i +
β3x3i ). In the case of tests of H0 : β3 = 0 the exponential param-
eter could instead be exp(
β1 +
β2x2i ), where
β1 and
β2 are then the restricted ML
estimates from the original sample.
Standard errors: From (11.3) the bootstrap estimate of standard error is computed
using the usual standard deviation formula for the 999 bootstrap replication esti-
mates of β3. This yields estimate 1.939 compared to the usual asymptotic standard
error estimate of 1.741. Note that this bootstrap offers no refinement and would
only be used as a check or if finding the standard error by other means proved
difficult.
Hypothesis testing with asymptotic refinement: We consider test of H0 : β3 = 0
against Ha : β3 = 0 at level 0.05. A test with asymptotic refinement is based on the
t-statistic, which is asymptotically pivotal. From Section 11.2.6 for each bootstrap
we compute t∗
3 = (
β
∗
3 − 4.664)/s
β
∗
3
, which is centered on the estimate
β3 = 4.664
from the original sample. For a nonsymmetrical test the bootstrap critical values
366

11.3. BOOTSTRAP EXAMPLE
Table 11.1. Bootstrap Statistical Inference on a Slope Coefficient:
Examplea

β
∗
3 t∗
3 z = t(∞) t(47)
Mean 4.716 0.026 1.021 1.000
SDb
1.939 1.047 1.000 1.021
1% −.336 −2.664 −2.326 −2.408
2.5% 0.501 −2.183 −1.960 −2.012
5% 1.545 −1.728 −1.645 −1.678
25% 3.570 −0.621 −0.675 −0.680
50% 4.772 0.062 0.000 0.000
75% 5.971 0.703 0.675 0.680
95% 7.811 1.706 1.645 1.678
97.5% 8.484 2.066 1.960 2.012
99.0% 9.427 2.529 2.326 2.408
a Summary statistics and percentiles based on 999 paired bootstrap resamples for
(1) estimate
β
∗
3; (2) the associated statistics t∗
3 = (
β
∗
3−
β3)/s
β
∗
3
; (3) student t-
distribution with 47 degrees of freedom; (4) standard normal distribution. Original
dgp is one draw from the exponential distribution given in the text; the sample size
is 50.
b SD, standard deviation.
equal the lower and upper 2.5 percentiles of the 999 values of t∗
3 , the 25th lowest
and 25th highest values. From Table 11.1 these are −2.183 and 2.066. Since the
t-statistic computed from the original sample t3 = (4.664 − 0)/1.741 = 2.679
2.066, the null hypothesis is rejected. A symmetrical test that instead uses the upper
5 percentile of |t∗
3 | yields bootstrap critical value 2.078 that again leads to rejection
of H0 at level 0.05.
The bootstrap critical values in this example exceed those using the asymptotic
approximation of either standard normal or t(47), an ad hoc finite-sample adjust-
ment motivated by the exact result for linear regression under normality. So the
usual asymptotic results in this example lead to overrejection and have actual size
that exceeds the nominal size. For example, at 5% the z critical region values
of (−1.960, 1.960) are smaller than the bootstrap critical values (−2.183, 2.066).
Figure 11.1 plots the bootstrap estimate based on t∗
3 of the density of the t-test,
smoothed using kernel methods, and compares it to the standard normal. The two
densities appear close, though the left tail is notably fatter for the bootstrap estimate.
Table 11.1 makes clearer the difference in the tails.
Hypothesis testing without asymptotic refinement: Alternative bootstrap testing
methods can be used but do not offer an asymptotic refinement. First, using the
bootstrap standard error estimate of 1.939, rather than the asymptotic standard error
estimate of 1.741, yields t3 = (4.664 − 0)/1.939 = 2.405. This leads to rejection at
level 0.05 using either standard normal or t(47) critical values. Second, from Table
11.1, 95% of the bootstrap estimates
β
∗
3 lie in the range (0.501, 8.484), which does
not include the hypothesized value of 0, so again we reject H0 : β3 = 0.
367

BOOTSTRAP METHODS
0
.1
.2
.3
.4
Density
-4 -2 0 2 4
t-statistic from each bootstrap replication
Bootstrap Estimate
Standard Normal
Bootstrap Density of ‘t-Statistic’
Figure 11.1: Bootstrap density of t-test statistic for slope equal to zero obtained from
999 bootstrap replications with standard normal density plotted for comparison. Data are
generated from an exponential distribution regression model.
Confidence intervals: An asymptotic refinement is obtained using the 95% percentile-
t confidence interval. Applying (11.6) yields (4.664 − 2.183 × 1.741, 4.664 +
2.066 × 1.741) or (0.864, 8.260). This compares to a conventional 95% asymptotic
confidence interval of 4.664 ± 1.960 × 1.741 or (1.25, 8.08).
Other confidence intervals can be constructed, but these do not have an asymp-
totic refinement. Using the bootstrap standard error estimate leads to a 95% con-
fidence interval 4.664 ± 1.960 × 1.939 = (0.864, 8.464). The percentile method
uses the lower and upper 2.5 percentiles of the 999 bootstrap coefficient estimates,
leading to a 95% confidence interval of (0.501, 8.484).
Bias correction: The mean of the 999 bootstrap replication estimates of β3 is
4.716, compared to the original estimate of 4.664. The estimated bias of (4.716 −
4.664) = 0.052 is quite small, especially compared to the standard error of s3 =
1.741. The estimated bias is upward and (11.10) yields a bias-corrected estimate of
β3 equal to 4.664 − 0.052 = 4.612.
The bootstrap relies on asymptotic theory and may actually provide a finite-
sample approximation worse than that of conventional methods. To determine that
the bootstrap is really an improvement here we need a full Monte Carlo analysis
with, say, 1, 000 samples of size 50 drawn from the exponential dgp, with each of
these samples then bootstrapped, say, 999 times.
11.4. Bootstrap Theory
The exposition here follows the comprehensive survey of Horowitz (2001). Key results
are consistency of the bootstrap and, if the bootstrap is applied to an asymptotically
pivotal statistic, asymptotic refinement.
368

11.4. BOOTSTRAP THEORY
11.4.1. The Bootstrap
We use X1, . . . , XN as generic notation for the data, where for notational simplicity
bold is not used for Xi even though it is usually a vector, such as (yi , xi ). The data are
assumed to be independent draws from distribution with cdf F0(x) = Pr[X ≤ x]. In
the simplest applications F0 is in a finite-dimensional family, with F0 = F0(x, θ0).
The statistic being considered is denoted TN = TN (X1, . . . , XN ). The exact finite-
sample distribution of TN is GN = GN (t, F0) = Pr[TN ≤ t]. The problem is to find a
good approximation to GN .
Conventional asymptotic theory uses the asymptotic distribution of TN , denoted
G∞ = G∞(t, F0). This may theoretically depend on unknown F0, in which case we
use a consistent estimate of F0. For example, use
F0 = F0(·,
θ), where
θ is consistent
for θ0.
The empirical bootstrap takes a quite different approach to approximating
GN (·, F0). Rather than replace GN by G∞, the population cdf F0 is replaced by a
consistent estimator FN of F0, such as the empirical distribution of the sample.
GN (·, FN ) cannot be determined analytically but can be approximated by boot-
strapping. One bootstrap resample with replacement yields the statistic T ∗
N =
TN (X∗
1, . . . , X∗
N ). Repeating this step B independent times yields replications
T ∗
N,1, . . . , T ∗
N,B. The empirical cdf of T ∗
N,1, . . . , T ∗
N,B is the bootstrap estimate of the
distribution of T , yielding

GN (t, FN ) =
1
B
B

b=1
1(T ∗
N,b ≤ t), (11.11)
where 1(A) equals one if event A occurs and equals zero otherwise. This is just the
proportion of the bootstrap resamples for which the realized T ∗
N ≤ t.
The notation is summarized in Table 11.2.
11.4.2. Consistency of the Bootstrap
The bootstrap estimate
GN (t, FN ) clearly converges to GN (t, FN ) as the number of
bootstraps B → ∞. Consistency of the bootstrap estimate
GN (t, FN ) for GN (t, F0)
Table 11.2. Bootstrap Theory Notation
Quantity Notation
Sample (iid) X1, . . . , XN , where Xi is usually a vector
Population cdf of X F0 = F0(x, θ0) = Pr[X ≤ x]
Statistic of interest TN = TN (X1, . . . , XN )
Finite sample cdf of TN GN = GN (t, F0) = Pr[TN ≤ t]
Limit cdf of TN G∞ = G∞(t, F0)
Asymptotic cdf of TN

G∞ = G∞(t,
F0), where
F0 = F0(x,
θ)
Bootstrap cdf of TN

GN (t, FN ) = B−1
B
b=1 1(T ∗
N,b ≤ t)
369

BOOTSTRAP METHODS
therefore requires that
GN (t, FN )
p
→ GN (t, F0),
uniformly in the statistic t and for all F0 in the space of permitted cdfs.
Clearly, FN must be consistent for F0. Additionally, smoothness in the dgp F0(x) is
needed, so that FN (x) and F0(x) are close to each other uniformly in the observations
x for large N. Moreover, smoothness in GN (·, F), the cdf of the statistic considered as
a functional of F, is required so that GN (·, FN ) is close to GN (·, F0) when N is large.
Horowitz (2001, pp. 3166–3168) gives two formal theorems, one general and one
for iid data, and provides examples of potential failure of the bootstrap, including
estimation of the median and estimation with boundary constraints on parameters.
Subject to consistency of FN for F0 and smoothness requirements on F0 and GN ,
the bootstrap leads to consistent estimates and asymptotically valid inference. The
bootstrap is consistent in a very wide range of settings.
11.4.3. Edgeworth Expansions
An additional attraction of the bootstrap is that it allows for asymptotic reﬁnement.
Singh (1981) provided a proof using Edgeworth expansions, which we now introduce.
Consider the asymptotic behavior of ZN =

i Xi /
√
N, where for simplicity Xi are
standardized scalar random variables that are iid [0, 1]. Then application of a central
limit theorem leads to a limit standard normal distribution for ZN . More precisely, ZN
has cdf
GN (z) = Pr[ZN ≤ z] = (z) + O(N−1/2
), (11.12)
where (·) is the standard normal cdf. The remainder term is ignored and regular
asymptotic theory approximates GN (z) by G∞(z) = (z).
The CLT leading to (11.12) is formally derived by a simple approximation of the
characteristic function of ZN , E[eisZN
], where i = −
√
1. A better approximation
expands this characteristic function in powers of N−1/2
. The usual Edgeworth expan-
sion adds two additional terms, leading to
GN (z) = Pr[ZN ≤ z] = (z) +
g1(z)
√
N
+
g2(z)
N
+ O(N−3/2
), (11.13)
where g1(z) = −(z2
− 1)φ(z)κ3/6, φ(·) denotes the standard normal density, κ3 is the
third cumulant of ZN , and the lengthy expression for g2(·) is given in Rothenberg
(1984, p. 895) or Amemiya (1985, p. 93). In general the rth cumulant κr is the rth
coefﬁcient in the series expansion ln(E[eisZN
]) =
∞
r=0 κr (is)r
/r! of the log charac-
teristic function or cumulant generating function.
The remainder term in (11.13) is ignored and an Edgeworth expansion approximates
GN (z, F0) by G∞(z, F0) = (z) + N−1/2
g1(z) + N−1
g2(z). If ZN is a test statistic
this can be used to compute p-values and critical values. Alternatively, (11.13) can be
370

11.4. BOOTSTRAP THEORY
inverted to
Pr

ZN +
h1(z)
√
N
+
h2(z)
N
≤ z

Φ(z), (11.14)
for functions h1(z) and h2(z) given in Rothenberg (1984, p. 895). The left-hand side
gives a modified statistic that will be better approximated by the standard normal than
the original statistic ZN .
The problem in application is that the cumulants of ZN are needed to evaluate the
functions g1(z) and g2(z) or h1(z) and h2(z). It can be very difficult to obtain analytical
expressions for these cumulants (e.g., Sargan, 1980, and Phillips, 1983). The bootstrap
provides a numerical method to implement the Edgeworth expansion without the need
to calculate cumulants, as shown in the following.
11.4.4. Asymptotic Refinement via Bootstrap
We now return to the more general setting of Section 11.4.1, with the additional as-
sumption that TN has a limit normal distribution and usual
√
N asymptotics apply.
Conventional asymptotic methods use the limit cdf G∞(t, F0) as an approximation
to the true cdf GN (t, F0). For
√
N-consistent asymptotically normal estimators this
has an error that in the limit behaves as a multiple of N−1/2
. We write this as
GN (t, F0) = G∞(t, F0) + O(N−1/2
), (11.15)
where in our example G∞(t, F0) = Φ(t).
A better approximation is possible using an Edgeworth expansion. Then
GN (t, F0) = G∞(t, F0) +
g1(t, F0)
√
N
+
g2(t, F0)
N
+ O(N−3/2
). (11.16)
Unfortunately, as already noted, the functions g1(·) and g2(·) on the right-hand side
can be difficult to construct.
Now consider the bootstrap estimator GN (t, FN ). An Edgeworth expansion yields
GN (t, FN ) = G∞(t, FN ) +
g1(t, FN )
√
N
+
g2(t, FN )
N
+ O(N−3/2
); (11.17)
see Hall (1992) for details. The bootstrap estimator GN (t, FN ) is used to approximate
the finite-sample cdf GN (t, F0). Subtracting (11.16) from (11.17), we get
GN (t, FN ) − GN (t, F0) = [G∞(t, FN ) − G∞(t, F0)] (11.18)
+
[g1(t, FN ) − g1(t, F0)]
√
N
+ O(N−1
).
Assume that FN is
√
N consistent for the true cdf F0, so that FN − F0 = O(N−1/2
).
For continuous function G∞ the first term on the right-hand side of (11.18),
[G∞(t, FN ) − G∞(t, F0)], is therefore O(N−1/2
), so GN (t, FN ) − GN (t, F0) =
O(N−1/2
).
The bootstrap approximation GN (t, FN ) is therefore in general no closer asymptot-
ically to GN (t, F0) than is the usual asymptotic approximation G∞(t, F0); see (11.15).
371

BOOTSTRAP METHODS
Now suppose the statistic TN is asymptotically pivotal, so that its asymptotic dis-
tribution G∞ does not depend on unknown parameters. Here this is the case if TN is
standardized so that its limit distribution is the standard normal. Then G∞(t, FN ) =
G∞(t, F0), so (11.18) simplifies to
GN (t, FN ) − GN (t, F0) = N−1/2
[g1(t, FN ) − g1(t, F0)] + O(N−1
). (11.19)
However, because FN − F0 = O(N−1/2
) we have that [g1(t, FN ) − g1(t, F0)] =
O(N−1/2
) for g1 continuous in F. It follows upon simplification that GN (t, FN ) =
GN (t, F0) + O(N−1
). The bootstrap approximation GN (t, FN ) is now a better asymp-
totic approximation to GN (t, F0) as the error is now O(N−1
).
In summary, for a bootstrap on an asymptotically pivotal statistic we have
GN (t, F0) = GN (t, FN ) + O(N−1
), (11.20)
an improvement on the conventional approximation GN (t, F0) = G∞(t, F0) +
O(N−1/2
).
The bootstrap on an asymptotically pivotal statistic therefore leads to an improved
small-sample performance in the following sense. Let α be the nominal size for a test
procedure. Usual asymptotic theory produces t-tests with actual size α + O(N−1/2
),
whereas the bootstrap produces t-tests with actual size α + O(N−1
).
For symmetric two-sided hypothesis tests and confidence intervals the bootstrap on
an asymptotically pivotal statistic can be shown to have approximation error O(N−3/2
)
compared to error O(N−1
) using usual asymptotic theory.
The preceding results are restricted to asymptotically normal statistics. For chi-
squared distributed test statistics the asymptotic gains are similar to those for sym-
metric two-sided hypothesis tests. For proof of bias reduction by bootstrapping, see
Horowitz (2001, p. 3172).
The theoretical analysis leads to the following points. The bootstrap should be from
distribution FN consistent for F0. The bootstrap requires smoothness and continuity in
F0 and GN , so that a modification of the standard bootstrap is needed if, for example,
there is a discontinuity because of a boundary constraint on the parameters such as
θ ≥ 0. The bootstrap assumes existence of low-order moments, as low-order cumu-
lants appear in the function g1 in the Edgeworth expansions. Asymptotic refinement
requires use of an asymptotically pivotal statistic. The bootstrap refinement presented
assumes iid data, so that modification is needed even for heteroskedastic errors. For
more complete discussion see Horowitz (2001).
11.4.5. Power of Bootstrapped Tests
The analysis of the bootstrap has focused on getting tests with correct size in small
samples. The size correction of the bootstrap will lead to changes in the power of tests,
as will any size correction.
Intuitively, if the actual size of a test using first-order asymptotics exceeds the nom-
inal size, then bootstrapping with asymptotic refinement will not only reduce the size
toward the nominal size but, because of less frequent rejection, will also reduce the
power of the test. Conversely, if the actual size is less than the nominal size then
372

11.5. BOOTSTRAP EXTENSIONS
bootstrapping will increase test power. This is observed in the simulation exercise of
Horowitz (1994, p. 409). Interestingly, in his simulation he finds that although boot-
strapping first-order asymptotically equivalent tests leads to tests with similar actual
size (essentially equal to the nominal size) there can be considerable difference in test
power across the bootstrapped tests.
11.5. Bootstrap Extensions
The bootstrap methods presented so far emphasize smooth
√
N-consistent asymp-
totically normal estimators based on iid data. The following extensions of the boot-
strap permit for a wider range of applications a consistent bootstrap (Sections 11.5.1
and 11.5.2) or a consistent bootstrap with asymptotic refinement (Sections 11.5.3–
11.5.5). The presentation of these more advanced methods is brief. Some are used in
Section 11.6.
11.5.1. Subsampling Method
The subsampling method uses a sample of size m that is substantially smaller than
the sample size N. The subsampling may be with replacement (Bickel, Gotze, and van
Zwet, 1997) or without replacement (Politis and Romano, 1994).
Replacement subsampling provides subsamples that are random samples of the pop-
ulation, rather than random samples of an estimate of the distribution such as the sam-
ple in the case of a paired bootstrap. Replacement subsampling can then be consistent
when failure of the smoothness conditions discussed in Section 11.4.2 leads to in-
consistency of a full sample bootstrap. The associated asymptotic error for testing or
confidence intervals, however, is of higher order of magnitude than the usual 0(N−1/2
)
obtained when a full sample bootstrap without refinement can be used.
Subsample bootstraps are useful when full sample bootstraps are invalid, or as a
way to verify that a full sample bootstrap is valid. Results will differ with the choice of
subsample size. And there is a considerable increase in sample error because a smaller
fraction of the sample is being used. Indeed, we should have (m/N) → 0 and N → ∞.
Politis, Romano, and Wolf (1999) and Horowitz (2001) provide further details.
11.5.2. Moving Blocks Bootstrap
The moving blocks bootstrap is used for data that are dependent rather than indepen-
dent. This splits the sample into r nonoverlapping blocks of length l, where rl N.
First, one samples with replacement from these blocks, to give r new blocks, which
will have a different temporal ordering from the original r blocks. Then one estimates
the parameters using this bootstrap sample.
The moving blocks method treats the randomly drawn blocks as being independent
of each other, but allows dependence within the blocks. A similar blocking was ac-
tually used by Anderson (1971) to derive a central limit theorem for an m-dependent
process. The moving blocks process requires r → ∞ as N → ∞ to ensure that we
373

BOOTSTRAP METHODS
are likely to draw consecutive blocks uncorrelated with each other. It also requires the
block length l → ∞ as N → ∞. See, for example, Götze and Künsch (1996).
11.5.3. Nested Bootstrap
A nested bootstrap, introduced by Hall (1986), Beran (1987), and Loh (1987), is
a bootstrap within a bootstrap. This method is especially useful if the bootstrap is
on a statistic that is not asympotically pivotal. In particular, if the standard error of
the estimate is difficult to compute one can bootstrap the current bootstrap sample
to obtain a bootstrap standard error estimate s
θ
∗
,Boot and form t∗
= (
θ
∗
−
θ)/s
θ
∗
,Boot,
and then apply the percentile-t method to the bootstrap replications t∗
1 , . . . , t∗
B. This
permits asymptotic refinements where a single round of bootstrap would not.
More generally, iterated bootstrapping is a way to improve the performance of
the bootstrap by estimating the errors (i.e., bias) that arise from a single pass of the
bootstrap, and correcting for these errors. In general each further iteration of the boot-
strap reduces bias by a factor N−1
if the statistic is asymptotically pivotal and by a
factor N−1/2
otherwise. For a good exposition see Hall and Martin (1988). If B boot-
straps are performed at each iteration then Bk
bootstraps need to be performed if there
are k iterations. For this reason at most two iterations, called a double bootstrap or
calibrated bootstrap, are done.
Davison, Hinkley, and Schechtman (1986) proposed balanced bootstrapping. This
method ensures that each sample observation is reused exactly the same number of
times over all B bootstraps, leading to better bootstrap estimates. For implementation
see Gleason (1988), whose algorithms add little to computational time compared to
the usual unbalanced bootstrap.
11.5.4. Recentering and Rescaling
To yield an asymptotic refinement the bootstrap should be based on an estimate
F of
the dgp F0 that imposes all the conditions of the model under consideration. A leading
example arises with the residual bootstrap.
Least-squares residuals do not sum to zero in nonlinear models, or even in lin-
ear models if there is no intercept. The residual bootstrap (see Section 11.2.4) based
on least-squares residuals will then fail to impose the restriction that E[ui ] = 0. The
residual bootstrap should instead bootstrap the recentered residual
ui − ū, where
ū = N−1
N
i=1
ui . Similar recentering should be done for paired bootstraps of GMM
estimators in overidentified models (see Section 11.6.4).
Rescaling of residuals can also be useful. For example, in the linear regression
model with iid errors resample from (N/(N − K))1/2
ui since these have variance s2
.
Other adjustments include using the standardized residual
ui /

(1 − hii )s2, where hii
is the ith diagonal entry in the projection matrix X(X
X)−1
X
.
11.5.5. The Jackknife
The bootstrap can be used for bias correction (see Section 11.2.8). An alternative re-
sampling method is the jackknife, a precursor of the bootstrap. The jackknife uses N
374

11.5. BOOTSTRAP EXTENSIONS
deterministically defined subsamples of size N − 1 obtained by dropping in turn each
of the N observations and recomputing the estimator.
To see how the jackknife works, let
θN denote the estimate of θ using all N obser-
vations, and let
θN−1 denote the estimate of θ using the first (N − 1) observations.
If (11.7) holds then E[
θN ] = θ + aN /N + bN /N2
+ O(N−3
) and E[
θN−1] = θ +
aN /(N − 1) + bN /(N − 1)2
+ O(N−3
), which implies E[N
θN − (N − 1)
θN−1] =
θ + O(N−2
). Thus N
θN − (N − 1)
θN−1 has smaller bias than
θN .
The estimator can be more variable, however, as it uses less of the data. As an
extreme example, if
θ = ȳ then the new estimator is simply yN , the Nth observation.
The variation can be reduced by dropping each observation in turn and averaging.
More formally then, consider the estimator
θ of a parameter vector θ based on a
sample of size N from iid data. For i = 1, . . . , N sequentially delete the ith observa-
tion and obtain N jacknife replication estimates
θ(−i) from the N jackknife resamples
of size (N − 1). The jacknife estimate of the bias of
θ is (N − 1)(
θ −
θ), where

θ = N−1

i

θ(−i) is the average of the N jacknife replications
θ(−i). The bias appears
large because of multiplication by (N − 1), but the differences (
θ(−i) −
θ) are much
smaller than in the bootstrap case since a jackknife resample differs from the original
sample in only one observation.
This leads to the bias-corrected jackknife estimate of θ:

θJack =
θ − (N − 1)(
θ −
θ) (11.21)
= N
θ − (N − 1)
θ.
This reduces the bias from O(N−1
) to O(N−2
), which is the same order of bias re-
duction as for the bootstrap. It is assumed that, as for the bootstrap, the estimator is
a smooth
√
N-consistent estimator. The jackknife estimate can have increased vari-
ance compared with
θ, and examples where the jackknife fails are given in Miller
(1974).
A simple example is estimation of σ2
from an iid sample with yi ∼ [µ, σ2
]. The es-
timate
σ2
= N−1

i (yi − ȳ)2
, the MLE under normality, has E[
σ2
] = σ2
(N − 1)/N
so that the bias equals σ2
/N, which is O(N−1
). In this example the jackknife estimate
can be shown to simplify to
σ2
Jack = (N − 1)−1

i (yi − ȳ)2
, so one does not need not
to compute N separate estimates
σ2
(−i). This is an unbiased estimate of σ2
, so the bias
is actually zero rather than the general result of O(N−2
).
The jackknife is due to Quenouille (1956). Tukey (1958) considered application to
a wider range of statistics. In particular, the jackknife estimate of the standard error
of an estimator
θ is

seJack[
θ] =

N − 1
N
N

i=1
(
θ(−i) −
θ)2
'1/2
. (11.22)
Tukey proposed the term jackknife by analogy to a Boy Scout jackknife that solves
a variety of problems, each of which could be solved more efficiently by a specially
constructed tool. The jackknife is a “rough and ready” method for bias reduction in
many situations, but it is not the ideal method in any. The jackknife can be viewed as a
linear approximation of the bootstrap (Efron and Tibsharani, 1993, p. 146). It requires
375

BOOTSTRAP METHODS
less computation than the bootstrap in small samples, as then N B is likely, but is
outperformed by the bootstrap as B → ∞.
Consider the linear regression model y = Xβ + u, with
β =

X
X
−1
X
y. An ex-
ample of a biased estimator from OLS regression is a time-series model with lagged
dependent variable as regressor. The regression estimator based on the ith jackknife
sample (X(−i), y(−i)) is given by

β(−i) = [X
(−i)X(−i)]−1
X
(−i)y(−i)
= [X
X − xi x
i ]−1
(X
y − xi yi )
=
β − [X
X]−1
xi (yi − x
i

β(−i)).
The third equality avoids the need to invert X
(−i)X(−i) for each i and is obtained using
[X
X]−1
= [X
(−i)X(−i)]−1
−
[X
(−i)X(−i)]−1
xi x
i

X
(−i)X(−i)
−1
1 + x
i

X
(−i)X(−i)
−1
xi
.
Here the pseudo-values are given by N
β − (N − 1)
β(−i), and the jackknife estimator
of
β is given by

βJack = N
β − (N − 1)
1
N
N

i=1

β(−i). (11.23)
An interesting application of the jackknife to bias reduction is the jackknife IV
estimator (see Section 6.4.4).
11.6. Bootstrap Applications
We consider application of the bootstrap taking into account typical microeconometric
complications such as heteroskedasticity and clustering and more complicated estima-
tors that can lead to failure of simple bootstraps.
11.6.1. Heteroskedastic Errors
For least squares in models with additive errors that are heteroskedastic, the standard
procedure is to use White’s heteroskedastic-consistent covariance matrix estimator
(HCCME). This is well known to perform poorly in small samples. When done cor-
rectly, the bootstrap can provide an improvement.
The paired bootstrap leads to valid inference, since the essential assumption that
(yi , xi ) is iid still permits V[ui |xi ] to vary with xi (see Section 4.4.7). However, it
does not offer an asymptotic reﬁnement because it does not impose the condition that
E[ui |xi ] = 0.
The usual residual bootstrap actually leads to invalid inference, since it assumes
that ui |xi is iid and hence erroneously imposes the condition of homoskedastic er-
rors. In terms of Section 11.4 theory,
F is then inconsistent for F. One can specify a
formal model for heteroskedasticity, say ui = exp(z
i α)εi , where εi are iid, obtain esti-
mate exp(z
i
α), and then bootstrap the implied residuals
εi . Consistency and asymptotic
376

11.6. BOOTSTRAP APPLICATIONS
refinement of this bootstrap requires correct specification of the functional form for the
heteroskedasticity.
The wild bootstrap, introduced by Wu (1986) and Liu (1988) and studied further
by Mammen (1993), provides asymptotic refinement without imposing such structure
on the heteroskedasticity. This bootstrap replaces the OLS residual
ui by the following
residual:

u∗
i =
5 1−
√
5
2

ui −0.6180
ui with probability 1+
√
5
2
√
5
0.7236,
[1 − 1−
√
5
2
]
ui 1.6180
ui with probability 1 − 1+
√
5
2
√
5
0.2764.
Taking expectations with respect to only this two-point distribution and perform-
ing some algebra yields E[
u∗
i ] = 0, E[
u∗2
i ] =
u2
i , and E[
u∗3
i ] =
u3
i . Thus
u∗
i leads
to a residual with zero conditional mean as desired, since E[
u∗
i |
ui , xi ] = 0 implies
E[
u∗
i |xi ] = 0, while the second and third moments are unchanged.
The wild bootstrap resamples have ith observation (y∗
i , xi ), where y∗
i = x
i

β +
u∗
i .
The resamples vary because of different realizations of
u∗
i . Simulations by Horowitz
(1997, 2001) show that this bootstrap works much better than a paired bootstrap when
there is heteroskedasticity and works well compared to other bootstrap methods even
if there is no heteroskedasticity.
It seems surprising that this bootstrap should work because for the ith observa-
tion it draws from only two possible values for the residual, −0.6180
ui or 1.6180
ui .
However, a similar draw is being made over all N observations and over B bootstrap
iterations. Recall also that White’s estimator replaces E[u2
i ] by
u2
i , which, although
incorrect for one observation, is valid when averaged over the sample. The wild boot-
strap is instead drawing from a two-point distribution with mean 0 and variance
u2
i .
11.6.2. Panel Data and Clustered Data
Consider a linear panel regression model

yit =
wit

θ+
uit ,
where i denotes individual and t denotes time period. Following the notation of Sec-
tion 21.2.3, the tilda is added as the original data yit and xit may first be transformed
to eliminate fixed effects, for example. We assume that the errors
uit are independent
over i, though they may be heteroskedastic and correlated over t for given i.
If the panel is short, so that T is finite and asymptotic theory relies on N → ∞,
then consistent standard errors for
θ can be obtained by a paired or EDF bootstrap
that resamples over i but does not resample over t. In the preceding presentation wi
becomes [yi1, xi1, . . . , yiT , xiT ] and we resample over i and obtain all T observations
for the chosen i.
This panel bootstrap, also called a block bootstrap, can also be applied to the
nonlinear panel models of Chapter 23. The key assumptions are that the panel is short
and the data are independent over i. More generally, this bootstrap can be applied
whenever data are clustered (see Section 24.5), provided cluster size is finite and the
number of clusters goes to infinity.
377

BOOTSTRAP METHODS
The panel bootstrap produces standard errors that are asymptotically equivalent to
panel robust sandwich standard errors (see Section 21.2.3). It does not provide an
asymptotic refinement. However, it is quite simple to implement and is practically very
useful as many packages do not automatically provide panel robust standard errors
even for quite basic panel estimators such as the fixed effects estimator. Depending
on the application, other bootstraps such as parametric and residual bootstraps may be
possible, provided again that resampling is over i only.
Asymptotic refinement is straightforward if the errors are iid. More realistically,
however,
uit will be heteroskedastic and correlated over t for given i. The wild boot-
strap (see Section 11.6.1) should provide an asymptotic refinement in a linear model
if the panel is short. Then wild bootstrap resamples have (i, t)th observation (
y∗
it ,
wit ),
where
y∗
it =
wit

θ+

u
∗
it ,

uit =
yit −
wit

θ and

u
∗
it is a draw from the two-point distri-
bution given in Section 11.6.1.
11.6.3. Hypothesis and Specification Tests
Section 11.2.6 focused on tests of the hypothesis θ = θ0. Here we consider more gen-
eral tests. As in Section 11.2.6, the bootstrap can be used to perform hypothesis tests
with or without asymptotic refinement.
Tests without Asymptotic Refinement
A leading example of the usefulness of the bootstrap is the Hausman test (see Sec-
tion 8.3). Standard implementation of this test requires estimation of V[
θ −
θ], where

θ and
θ are the two estimators being contrasted. Obtaining this estimate can be diffi-
cult unless the strong assumption is made that one of the estimators is fully efficient
under H0. The paired bootstrap can be used instead, leading to consistent estimate

VBoot[
θ −
θ] =
1
B − 1
B

b=1
[(
θ
∗
b −
θ
∗
b) − (
θ
∗
−
θ
∗
)][(
θ
∗
b −
θ
∗
b) − (
θ
∗
−
θ
∗
)]
,
where
θ
∗
= B−1

b

θ
∗
b and
θ
∗
= B−1

b

θ
∗
b. Then compute
H = (
θ −
θ)

VBoot[
θ −
θ]
−1
(
θ −
θ) (11.24)
and compare to chi-square critical values. As mentioned in Chapter 8, a generalized
inverse may need to be used and care may be needed to ensure chi-square critical
values are obtained using the correct degrees of freedom.
More generally, this approach can be used for any standard normal test or chi-square
distributed test where implementation is difficult because a variance matrix must be
estimated. Examples include hypothesis tests based on a two-step estimator and the
m-tests of Chapter 8.
Tests with Asymptotic Refinement
Many tests, especially those for fully parametric models such as the LM test and IM
test, can be simply implemented using an auxiliary regression (see Sections 7.3.5 and
378

11.6. BOOTSTRAP APPLICATIONS
8.2.2). The resulting test statistics, however, perform poorly in finite samples as docu-
mented in many Monte Carlo studies. Such test statistics are easily computed and are
asymptotically pivotal as the chi-square distribution does not depend on unknown pa-
rameters. They are therefore prime candidates for asymptotic refinement by bootstrap.
Consider the m-test of H0 : E[mi (yi |xi , θ)] = 0 against Ha : E[mi (yi |xi , θ)] = 0
(see Section 8.2). From the original data estimate
θ by ML, and calculate the test
statistic M. Using a parametric bootstrap, resample y∗
i from the fitted conditional den-
sity f (yi |xi ,
θ), for fixed regressors in repeated samples, or from f (yi |x∗
i ,
θ). Compute
M∗
b, b = 1, . . . , B, in the bootstrap resamples. Reject H0 at level α if the original cal-
culated statistic M exceeds the α quantile of M∗
b, b = 1, . . . , B.
Horowitz (1994) presented this bootstrap for the IM test and demonstrated with
simulation examples that there are substantial finite-sample gains to this bootstrap. A
detailed application by Drukker (2002) to specification tests for the tobit model sug-
gests that conditional moment specification tests can be easily applied to fully para-
metric models, since any size distortion in the auxiliary regressions can be corrected
through bootstrap.
Note that bootstrap tests without asymptotic refinement, such as the Hausman test
given here, can be refined by use of the nested bootstrap given in Section 11.5.3.
11.6.4. GMM, Minimum Distance, and Empirical Likelihood in
Overidentified Models
The GMM estimator is based on population moment conditions E[h(wi , θ)] = 0
(see Section 6.3.1). In a just-identified model a consistent estimator simply solves
N−1

i h(wi ,
θ) = 0. In overidentified models this estimator is no longer feasible.
Instead, the GMM estimator is used (see Section 6.3.2).
Now consider bootstrapping, using the paired or EDF bootstrap. For GMM in an
overidentified model N−1

i h(wi ,
θ) = 0, so this bootstrap does not impose on the
bootstrap resamples the original population restriction that E[h(wi , θ)] = 0. As a re-
sult even if the asymptotically pivotal t-statistic is used there is no longer a bootstrap
refinement, though bootstraps on
θ and related confidence intervals and t-test statis-
tics remain consistent. More fundamentally, the bootstrap of the OIR test (see Sec-
tion 6.3.8) can be shown to be inconsistent. We focus on cross-section data but similar
issues arise for panel GMM estimators (see Chapter 22) in overidentified models.
Hall and Horowitz (1996) propose correcting this by recentering. Then the boot-
strap is based on h∗
(wi ,
θ) = h(wi ,
θ)−N−1

i h(wi ,
θ) and asymptotic refinements
can be obtained for statistics based on
θ including the OIR test.
Horowitz (1998) does similar recentering for the minimum distance estimator (see
Section 6.7). He then applies the bootstrap to the covariance structure example of
Altonji and Segal (1996) discussed in Section 6.3.5.
An alternative adjustment proposed by Brown and Newey (2002) is to not recenter
but to instead resample the observations wi with probabilities that vary across observa-
tions rather than using equal weights 1/N. Specifically, let Pr[w∗
= wi ] =
πi , where

πi = (1 +
λ

hi ),
hi = h(wi ,
θ), and
λ maximizes

i ln(1 +
λ

hi ). The motivation is
that the probabilities
πi equivalently are the solution to an empirical likelihood (EL)
379

BOOTSTRAP METHODS
problem (see Section 6.8.2) of maximizing

i ln πi with respect to π1, . . . , πN subject
to the constraints

i πi

hi = 0 and

i πi = 1. This empirical likelihood bootstrap
of the GMM estimator therefore imposes the constraint

i
πi

hi = 0.
One could instead work directly with EL from the beginning, letting
θ be the EL
estimator rather than the GMM estimator. The advantage of the Brown and Newey
(2002) approach is that it avoids the more challenging computation of the EL estimator.
Instead, one needs only the GMM estimator and solution of the concave programming
problem of minimizing

i ln(1 +
λ

hi ).
11.6.5. Nonparametric Regression
Nonparametric density and regression estimators converge at rate less than
√
N and
are asymptotically biased. This complicates inference such as confidence intervals (see
Sections 9.3.7 and 9.5.4).
We consider the kernel regression estimator
m(x0) of m(x0) = E[y|x = x0] for ob-
servations (y, x) that are iid, though conditional heteroskedasticity is permitted. From
Horowitz (2001, p. 3204), an asymptotically pivotal statistic is
t =

m(x0) − m(x0)
s
m(x0)
,
where
m(x0) is an undersmoothed kernel regression estimator with bandwidth h =
o(N−1/3
) rather than the optimal h∗
= O(N−1/5
) and
s2

m(x0) =
1
Nh[
f (x0)]2
N

i=1
(yi −
m(xi ))2
K

xi − x0
h
2
,
where
f (x0) is a kernel estimate of the density f (x) at x = x0. A paired bootstrap
resamples (y∗
, x∗
) and forms t∗
b = [
m∗
b(x0) − m(x0)]/s∗

m(x0),b, where s∗

m(x0),b is com-
puted using bootstrap sample kernel estimates
m∗
b(xi ) and
f ∗
b (x0). The percentile-t
confidence interval of Section 11.2.7 then provides an asymptotic refinement. For a
symmetrical confidence interval or symmetrical test at level α the error is o((Nh−1
))
rather than O((Nh−1
)) using first-order asymptotic approximation.
Several variations on this bootstrap are possible. Rather than using undersmoothing,
bias can be eliminated by directly estimating the bias term given in Section 9.5.2.
Also rather than using s2

m(x0), the variance term given in Section 9.5.2 can be directly
estimated.
Yatchew (2003) provides considerable detail on implementing the bootstrap in non-
parametric and semiparametric regression.
11.6.6. Nonsmooth Estimators
From Section 11.4.2 the bootstrap assumes smoothness in estimators and statistics.
Otherwise the bootstrap may not offer an asymptotic refinement and may even be
invalid.
As illustration we consider the LAD estimator and extension to binary data. The
LAD estimator (see Section 4.6.2) has objective function

i |yi − x
i β| that has
380

discontinuous first derivative. A bootstrap can provide a valid asymptotic approx-
imation but does not provide an asymptotic refinement. For binary outcomes, the
LAD estimator extends to the maximum score estimator of Manski (1975) (see
Section 14.7.2). For this estimator the bootstrap is not even consistent.
In these examples bootstraps with asymptotic refinements can be obtained by us-
ing a smoothed version of the original objective function for the estimator. For ex-
ample, the smoothed maximum score estimator of Horowitz (1992) is presented in
Section 14.7.2.
11.6.7. Time Series
The bootstrap relies on resampling from an iid distribution. Time-series data therefore
present obvious problems as the result of dependence.
The bootstrap is straightforward in the linear model with an ARMA error structure
and resampling the underlying white noise error. As an example, suppose yt = βxt +
ut , where ut = ρut−1 + εt and εt is white noise. Then given estimates
β and
ρ we
can recursively compute residuals as
εt =
ut −
ρ
ut−1 = yt − xt

β −
ρ(yt−1 − xt−1

β).
Bootstrapping these residuals to give
ε∗
t , t = 1, . . . , T , we can then recursively com-
pute
u∗
t = ρ
u∗
t−1 +
ε∗
t and hence y∗
t =
βxt +
u∗
t . Then regress y∗
t on xt with AR(1)
error. An early example was presented by Freedman (1984), who bootstrapped a dy-
namic linear simultaneous equations regression model estimated by 2SLS. Given lin-
earity, simultaneity adds little problems. The dynamic nature of the model is handled
by recursively constructing y∗
t = f (y∗
t−1, xt , u∗
t ), where u∗
t are obtained by resampling
from the 2SLS structural equation residuals and y∗
0 = y0. Then perform 2SLS on each
bootstrap sample.
This method assumes the underlying error is iid. For general dependent data without
an ARMA specification, for example, nonstationary data, the moving blocks bootstrap
presented in Section 11.5.2 can be used.
For testing unit roots or cointegration special care is needed in applying the boot-
strap as the behavior of the test statistic changes discontinuously at the unit root.
See, for example, Li and Maddala (1997). Although it is possible to implement a
valid bootstrap in this situation, to date these bootstraps do not provide an asymptotic
refinement.
The bootstrap without asymptotic refinement can be a very useful tool for the applied
researcher in situations where it is difficult to perform inference by other means. This
need can vary with available software and the practitioner’s tool kit. The most common
application of the bootstrap to date is computation of standard errors needed to conduct
a Wald hypothesis test. Examples include heteroskedasticity-robust and panel-robust
inference, inference for two-step estimators, and inference on transformations of es-
timators. Other potential applications include computation of m-test statistics such as
the Hausman test.
381

BOOTSTRAP METHODS
The bootstrap can additionally provide an asymptotic refinement. Many Monte
Carlo studies show that quite standard procedures can perform poorly in finite sam-
ples. There appears to be great potential for use of bootstrap refinements, currently
unrealized. In some cases this could improve existing inference, such as use of the
wild bootstrap in models with additive errors that are heteroskedastic. In other cases it
should encourage increased use of methods that are currently under-utilized. In partic-
ular, model specification tests with good small-sample properties can be implemented
by bootstrapping easily computed auxiliary regressions.
There are two barriers to the use of the bootstrap. First, the bootstrap is not always
built into statistical packages. This will change over time, and for now constructing
code for a bootstrap is not too difficult provided the package includes looping and the
ability to save regression output. Second, there are subtleties involved. Asymptotic re-
finement requires use of an asymptotically pivotal statistic and the simplest bootstraps
presume iid data and smoothness of estimators and statistics. This covers a wide class
of applications but not all applications.
The bootstrap was proposed by Efron (1979) for the iid case. Singh (1981) and Bickel and
Freedman (1981) provided early theory. A good introductory statistics treatment is by Efron
and Tibsharani (1993), and a more advanced treatment is by Hall (1992). Extensions to
the regression case were considered early on; see, for example, Freedman (1984). Most of
the work by econometricians has occurred in the past 10 years. The survey of Horowitz
(2001) is very comprehensive and is well complemented by the survey of Brownstone and
Kazimi (1998), which considers many econometrics applications, and the paper by MacKinnon
(2002).
Exercises
11–1 Consider the model y = α + βx + ε, where α, β, and x are scalars and ε ∼
N[0, σ2
]. Generate a sample of size N = 20 with α = 2, β = 1, and σ2
= 1 and
suppose that x ∼ N[2, 2]. We wish to test H0 : β = 1 against Ha : β = 1 at level
0.05 using the t-statistic t = (
β − 1)/se[
β]. Do as much of the following as your
software permits. Use B = 499 bootstrap replications.
(a) Estimate the model by OLS, giving slope estimate
β.
(b) Use a paired bootstrap to compute the standard error and compare this to
the original sample estimate. Use the bootstrap standard error to test H0.
(c) Use a paired bootstrap with asymptotic refinement to test H0.
(d) Use a residual bootstrap to compute the standard error and compare this to
the original sample estimate. Use the bootstrap standard error to test H0.
(e) Use a residual bootstrap with asymptotic refinement to test H0.
11–2 Generate a sample of size 20 according from the following dgp. The two regres-
sors are generated by x1 ∼ χ2
(4) − 4 and x2 ∼ 3.5 + U[1, 2]; the error is from a
mixture of normals with u ∼ N[0, 25] with probability 0.3 and u ∼ N[0, 5] with
382

probability 0.7; and the dependent variable is y = 1.3x1 + 0.7x2 + 0.5u.
(a) Estimate by OLS the model y = β0 + β1x1 + β2x2 + u.
(b) Suppose we are interested in estimating the quantity γ = β1 + β2
2 from the
data. Use the least-squares estimates to estimate this quantity. Use the
delta method to obtain approximate standard error for this function.
(c) Then estimate the standard error of
γ using a paired bootstrap. Compare
this to se[
γ ] from part (b) and explain the difference. For the bootstrap use
B = 25 and B = 200.
(d) Now test H0 : γ = 1.0 at level 0.05 using a paired bootstrap with B = 999.
Perform bootstrap tests without and with asymptotic refinement.
11–3 Use 200 observations from the Section 4.6.4 data on natural logarithm of health
expenditure (y) and natural logarithm of total expenditure (x). Obtain OLS esti-
mates of the model y = α + βx + u. Use the paired bootstrap with B = 999.
(a) Obtain a bootstrap estimate of the standard error of
β.
(b) Use this standard error estimate to test H0 : β = 1 against Ha : β = 1.
(c) Do a bootstrap test with refinement of H0 : β = 1 against Ha : β = 1 under
the assumption that u is homoskedastic.
(d) If u is heteroskedastic what happens to your method in (c)? Is the test still
asymptotically valid, and if so does it offer an asymptotic refinement?
(e) Do a bootstrap to obtain a bias-corrected estimate of β.
383

C H A P T E R 12
Simulation-Based Methods
12.1. Introduction
The nonlinear methods presented in the preceding chapters do not require closed-form
solutions for the estimator. Nonetheless, they rely considerably on analytical tractabil-
ity. In particular, the objective function for the estimator has been assumed to have a
closed-form expression, and the asymptotic distribution of the estimator is based on a
linearization of the estimating equations.
In the current chapter we present simulation-based estimation method

Microeconometrics. Methods and applications by A. Colin Cameron, Pravin K. Trivedi

More Related Content

What's hot (18)

Similar to Microeconometrics. Methods and applications by A. Colin Cameron, Pravin K. Trivedi (20)

Recently uploaded (20)

Microeconometrics. Methods and applications by A. Colin Cameron, Pravin K. Trivedi