Panel Data Regression Notes- Part 2 Subsection -A.pptx
1. Part 2A: Endogeneity [ 1/53]
Econometric Analysis of Panel Data
William Greene
Department of Economics
University of South Florida
2. Part 2A: Endogeneity [ 2/53]
lnSPI = + *lnGDPPC(PPP) + , 0 < < 1.
(Huffington Post, 2/16/16)
Reverse Causality in the Preston Curve?
3. Part 2A: Endogeneity [ 3/53]
In two of my projects, I was asked by reviewers to address the endogeneity concerns. In one
project, I regress employee departure on project termination. Arguably project termination is
not exogenous.
DEPARTURE = f(a + b*PROJECT TERMINATION + e)
(Time until departure???)
In the other, I regress firms’ charitable giving in specific countries on their business activities
in the local community. Again, business presence in countries are not exogenous. The
problem is, both papers used non-linear models (Hazard model in one, and hurdle model
in the other), which are required by the data I have. Are you aware of any econometric
methods to deal with endogeneity in non-linear models? My search online did not go
anywhere.
Hazard Model: Not a linear model.
Prob[event happens in time interval t to t+Δ| event happens after time t] =
a function of (x’β)
http://people.stern.nyu.edu/wgreene/Econometrics/
NonlinearPanelDataModels.pdf
4. Part 2A: Endogeneity [ 4/53]
I have been asked this question (or ones like it) dozens of times. I think the issue is
getting way overplayed. But, I'm not the majority voice, so you are going to have to deal
with this.
Step 1: you or the referee need to figure out (make a case for) by what construction is
"project termination" endogenous. What is correlated with what **in your hazard
model** that makes the variable endogenous? There must be a second equation that
implies that project termination is endogenous. What is it? What unobservable in that
equation is correlated with what unobservable in the hazard model that makes it
endogenous. Same questions for your hurdle model.
Step 2: Depends on the outcome of Step 1…
11. Part 2A: Endogeneity [ 11/53]
S
N N
i S i
i 1 i 1
i S
i
i i
i S
Control Function Approach
S* = w, S = 1[S* > 0], w ~ N[0,1]
lnL = ln (2S 1) f
GENERALIZED RESIDUAL
(2S 1)
f
u (2S 1) Control Function
Constant term (2S 1)
(For a l
x
x
x
x
2
i
i i i i S i
i i
inear regression, the generalized residual is e / s .)
Poisson or NB1 Model with "Residual Inclusion"
ˆ ˆ ˆ
E[C | x ,S ,u ] exp[ S u ]
x v
16. Part 2A: Endogeneity [ 16/53]
Endogeneity
y = X+ε,
Definition: E[ε|x]≠0
Why not?
Omitted variables
Unobserved heterogeneity (equivalent to omitted
variables)
Measurement error on the RHS (equivalent to
omitted variables)
Endogenous sampling and attrition
Simultaneity (?) (“reverse causality”)
17. Part 2A: Endogeneity [ 17/53]
Cornwell and Rupert Data
Cornwell and Rupert Returns to Schooling Data, 595 Individuals, 7 Years
Variables in the file are
EXP = work experience
WKS = weeks worked
OCC = occupation, 1 if blue collar,
IND = 1 if manufacturing industry
SOUTH = 1 if resides in south
SMSA = 1 if resides in a city (SMSA)
MS = 1 if married
FEM = 1 if female
UNION = 1 if wage set by union contract
ED = years of education
LWAGE = log of wage = dependent variable in regressions
These data were analyzed in Cornwell, C. and Rupert, P., "Efficient Estimation with
Panel Data: An Empirical Comparison of Instrumental Variable Estimators," Journal
of Applied Econometrics, 3, 1988, pp. 149-155. See Baltagi, page 122 for further
analysis. The data were downloaded from the website for Baltagi's text.
19. Part 2A: Endogeneity [ 19/53]
The Effect of Education on LWAGE
1 2 3 4 ... ε
What is ε? ,...+ everything
M e
ot ls
ivat e
= f( , , , ,...)
ion
Motivation
LWAGE EDUC EXP
EDUC GENDER SMSA SOUTH
2
EXP
20. Part 2A: Endogeneity [ 20/53]
What Influences LWAGE?
1 2
3 4
Motivation
Motiva
( , ,...)
...
ε( )
Variation in is associated with variation in
tion
Motivation
Motivation Motivatio
( , ,...) and ε(
LWAGE EDUC X
EXP
EDUC X
2
EXP
2
n
Motivatio
)
What lookslike an effect due to variationin may
be due to variationin . The estimate of picks up
the effect of and the hidden effect of
n
Motivation.
EDUC
EDUC
21. Part 2A: Endogeneity [ 21/53]
The General Problem
1 2
1 1
2 2
2
1 2
(
Cov( , ) , K variables
Cov( , ) , K variables
is
cannot estimate ( , )
consistently. Some other estimator is needed.
Additional structur
, )
e: H
endogenous
OLS regression of y o
y X X
X 0
X 0
n
X
X X
2
2
1 2
ow does X become endogenous?
= + where Cov( , ) but Cov( , )= .
An estimator based on ( , , ) may
be able to estimate ( , ) consistently.
instrumental varia
X Z V V 0 Z 0
X
bl X Z
e (IV)
22. Part 2A: Endogeneity [ 22/53]
Instrumental Variables
Framework: y = X + , K variables in X.
There exists a set of K variables, Z such that
plim(Z’X/n) 0 but plim(Z’/n) = 0
The variables in Z are called instrumental
variables.
An alternative (to least squares) estimator of is
bIV = (Z’X)-1
Z’y ~ Cov(Z,y) / Cov(Z,X)
We consider the following:
Why use this estimator?
What are its properties compared to least squares?
We will also examine an important application
23. Part 2A: Endogeneity [ 23/53]
An Exogenous Influence
1 2
3 4
Motivation
Moti
( , , ,...)
...
ε( )
Variation in is associated with variation in
( ,
vation
Motivation
, ,...) andnot Motiva n
( o
ε ti
LWA Z
GE EDUC X
EXP
EDU Z
C
Z
X
2
EXP
2
)
An effect due to the effect of variationin on will
only be due to variationin . The estimate of picks up
the effect of only.
Z
Z is anInstrument
EDUC
EDU
al Vari
C
EDUC
able
24. Part 2A: Endogeneity [ 24/53]
Instrumental Variables
My theory claims that MS and FEM are
instruments
Structural equations
LWAGE (ED,EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION)
ED (…,MS, FEM) Equation explains the
endogeneity
Reduced Form:
LWAGE[ ED (…,MS, FEM),
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION ]
25. Part 2A: Endogeneity [ 25/53]
X
Z
SNAP Model.
X is in both
equations.
Z is in SNAP
equation.
SNAP is in
Health
equation.
26. Part 2A: Endogeneity [ 26/53]
Instrumental Variables in Regression
Typical Case: One “problem” variable – the “last” one
yit = 1x1it + 2x2it + … + KxKit + εit
E[εit|x1it…,xKit] ≠ 0. (0 for all others)
There exists a variable zit such that
Relevance
E[xKit| x1it, x2it,…, xK-1,it,zit] = g(x1it, x2it,…, xK-1,it,zit)
In the presence of the other variables, zit “explains” xit
A projection interpretation: In the projection,
xKt =θ1x1it,+ θ2x2it + … + θk-1xK-1,it + θK zit, θK ≠ 0.
Exogeneity
E[εit| x1it, x2it,…, xK-1,it,zit] = 0
In the presence of the other variables, zit and εit are
uncorrelated.
27. Part 2A: Endogeneity [ 27/53]
Two Stage Least Squares Strategy
Reduced Form:
LWAGE[ ED (MS, FEM,X),
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION ]
Strategy
(1) Purge ED of the influence of everything but MS,
FEM and the other X variables. Predict ED using all
exogenous information in the sample (X,MS,FEM).
(2) Regress LWAGE on this prediction of ED and
everything else.
Standard errors must be adjusted for the predicted
ED
29. Part 2A: Endogeneity [ 29/53]
The weird results for the
coefficient on ED may be
due to the instruments, MS
and FEM being dummy
variables. There is not
much variation in these
variables and not much
covariation with the other
variables.
2SLS Regression (Maybe not a very good theory))
2SLS coefficient estimate is implausible. Now what?
30. Part 2A: Endogeneity [ 30/53]
An Interpretation
The Source of the Endogeneity
LWAGE = f(ED,
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION) +
ED = f(MS,FEM,
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION) + u
31. Part 2A: Endogeneity [ 31/53]
Can We Remove the Endogeneity?
LWAGE = f(ED,
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION) + u +
LWAGE = f(ED,
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION) + u +
Strategy
Estimate u
Add u to the equation.
ED is correlated with u+ because it is correlated with
u.
ED is uncorrelated with u+ if u is in the equation.
32. Part 2A: Endogeneity [ 32/53]
Auxiliary Regression for
ED to Obtain Residuals
IVs
Exog.
Vars
34. Part 2A: Endogeneity [ 34/53]
A Warning About Control Functions
Sum of squares is not computed correctly because U is in the regression.
A general result. Control function estimators usually require a fix to the
estimated covariance matrix for the estimator.
35. Part 2A: Endogeneity [ 35/53]
Estimating σ2
2
2 n
1
i 1 i
n
Estimating the asymptotic covariance matrix -
a caution about estimating .
ˆ
Since the regression is computed by regressing y on ,
one might use
ˆ
(y ) uses
ˆ 2sls
x
x'b
2 n
1
i 1 i
n
ˆ
This is inconsistent. Use
(y ) uses
ˆ
(Degrees of freedom correction is optional; usually done.)
2sls
x
x'b x
36. Part 2A: Endogeneity [ 36/53]
Robust estimation of VC
-1 2 -1
i,t it it
Counterpart to the White estimator allows heteroscedasticity
ˆ
ˆ ˆ ˆ ˆ
ˆ ˆ
Est.Asy.Var[ ]= ( ) (y ) ( )
it it
X'X x β x x X'X
“Actual” X
“Predicted” X
38. Part 2A: Endogeneity [ 38/53]
Inference with IV Estimators
(1) Wald Statistics:
ˆ ˆ ˆ
( ) ( )
(E.g., the usual 't-statistics')
(2) A type of F statistic:
ˆ ˆ
Compute SSUA=( )'( ) without restrictions (Note, )
ˆ
ˆ
Compute SSR=( )'(
-1
u u
R
Rβ - q ' { Est.Asy.Var[β]} Rβ - q
y Xβ y Xβ X
y Xβ y
ˆ
ˆ ) with restrictions
ˆ ˆ
ˆ ˆ ˆ
Compute SSU=( )'( ) without restrictions (Note, )
(SSR SSU) / J
F = ~ F[J,N K]
SSUA/(N-K)
R
U U
Xβ
y Xβ y Xβ X
39. Part 2A: Endogeneity [ 39/53]
Endogeneity Test? (Hausman)
Exogenous Endogenous
OLS Consistent, Efficient Inconsistent
2SLS Consistent, Inefficient Consistent
Base a test on d = b2SLS - bOLS
Use a Wald statistic, d’[Var(d)]-1
d
What to use for the variance matrix?
Hausman: V2SLS - VOLS
42. Part 2A: Endogeneity [ 42/53]
Endogeneity Test: Wu
Considerable complication in Hausman test
(text, pp. 276-277)
Simplification: Wu test.
Regress y on X and estimated for the
endogenous part of X. Then use an ordinary
Wald test.
X̂
44. Part 2A: Endogeneity [ 44/53]
Regression Based Endogeneity Test
it it it
An easy t test. (Wooldridge 2010, p. 127)
y q
= a set of M instruments.
Write = +
Can be estimated by ordinary least squares.
Endogeneity concerns correlation between v and .
ˆ
Add v
it
x δ
Z
q Zπ v
it it it it
= q - to the equation and use OLS
ˆ
ˆ
y q v + { error}
Simple t test on whether equals 0.
ˆ
Even easier, algebraically identical, (Wu, 1973), add
to the equation and do the same tes
it
z
x δ
q
t.
45. Part 2A: Endogeneity [ 45/53]
Wu Test
Since this is 2SLS using a control function, the standard errors should have
been adjusted to carry out this test. (The sum of squares is too small.)
46. Part 2A: Endogeneity [ 46/53]
Testing Endogeneity of WKS
(1) Regress WKS on 1,EXP,EXPSQ,OCC,SOUTH,SMSA,MS.
U=residual, WKSHAT=prediction
(2) Regress LWAGE on 1,EXP,EXPSQ,OCC,SOUTH,SMSA,WKS, U or WKSHAT
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -9.97734299 .75652186 -13.188 .0000
EXP .01833440 .00259373 7.069 .0000 19.8537815
EXPSQ -.799491D-04 .603484D-04 -1.325 .1852 514.405042
OCC -.28885529 .01222533 -23.628 .0000 .51116447
SOUTH -.26279891 .01439561 -18.255 .0000 .29027611
SMSA .03616514 .01369743 2.640 .0083 .65378151
WKS .35314170 .01638709 21.550 .0000 46.8115246
U -.34960141 .01642842 -21.280 .0000 -.341879D-14
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -9.97734299 .75652186 -13.188 .0000
EXP .01833440 .00259373 7.069 .0000 19.8537815
EXPSQ -.799491D-04 .603484D-04 -1.325 .1852 514.405042
OCC -.28885529 .01222533 -23.628 .0000 .51116447
SOUTH -.26279891 .01439561 -18.255 .0000 .29027611
SMSA .03616514 .01369743 2.640 .0083 .65378151
WKS .00354028 .00116459 3.040 .0024 46.8115246
WKSHAT .34960141 .01642842 21.280 .0000 46.8115246
47. Part 2A: Endogeneity [ 47/53]
Weak Instruments
One endogenous variable: y = X +xk+ ; Instruments (X,Z) Z is
exogenous
Symptom: The relevance condition, plim Z’X/n not zero, is close to being
violated. Relevance: Z must “explain” xk after controlling for X.
Detection:
Standard F test in the regression of xk on (X,Z). Wald test
of coefficients on Z equal 0, F < 10 suggests a problem. (Staiger and
Stock)
Other versions by Stock and Yogo (2005), Cragg and Donald (1993),
Kleibergen and Paap (2006) for when xk is more than one variable and
for when xk = (X,Z) + u and is heteroscedastic, clustered, nonnormal,
etc.
Remedy:
Not much – most of the discussion is about the condition, not what to
do about it.
Use LIML instead of 2SLS? Requires a normality assumption. Probably
48. Part 2A: Endogeneity [ 48/53]
Weak Instruments (cont.)
-1
ˆ
plim = + [Cov( )] Cov( )
If Cov( ) is "small" but nonzero, small
Cov( ) may hugely magnify the effect.
IV is not only inefficient, may be very badly
biased by "weak" instruments.
Solutions
β β , X , ε
, ε
Z Z
Z
, X
Z
? Can one "test" for weak instruments?
49. Part 2A: Endogeneity [ 49/53]
Weak Instruments
-1 -1
-1
Which is better?
LS is inconsistent, but probably has smaller variance
LS may be more precise
IV is consistent, but probably has larger variance
ˆ
Asy.Var[ ] =
may be l
Z ZZ
X
X
Z
Z
X
β Q Ω Q
Q arge. (Compared to what?)
Strange results with "small"
IV estimator tends to resemble OLS (bias) (not a
function of sample size).
Contradictory result. Suppose is perfectly correlated
ZX
z
Q
with . IV MUST be the same as OLS.
x
50. Part 2A: Endogeneity [ 50/53]
Endogenous Union Effect
Name ; Xunion = one,occ,smsa,ed,exp,union $
Name ; Zinst = fem,ind,south $
Name ; Zunion = one,occ,smsa,ed,exp,Zinst $
? Inconsistent OLS
Regr ; Lhs = lwage ; Rhs = Xunion ; Cluster = 7 $
? Two Stage Least Squares gives a nonsense result
2sls ; Lhs=lwage;rhs=Xunion;Inst=Zunion ; Cluster=7 $
? Test for weak instruments
Regr ; Lhs=union ; Rhs=Zunion ; res = u
; Cluster = 7 ; test:zinst $
? Control function estimator
? 2SLS coefficients with the wrong standard errors
Regr ; Lhs=lwage;rhs=xunion,u;cluster=7$
53. Part 2A: Endogeneity [ 53/53]
Weak Instruments? No
What is going on here? When the endogenous variable and/or the excluded
instruments are binary, the actual results are sometimes a bit unstable. The
theoretical results are generally about covariation of continuous variables.
55. Part 2A: Endogeneity [ 55/53]
The First IV Study Was a Natural Experiment
(Snow, J., On the Mode of Communication of Cholera, 1855)
http://www.ph.ucla.edu/epi/snow/snowbook3.html
London Cholera epidemic, ca 1853-4
Cholera = f(Water Purity,u) + ε.
‘Causal’ effect of water purity on cholera?
Purity=f(cholera prone environment (poor, garbage in
streets, rodents, etc.). Regression does not work.
Two London water companies
Lambeth Southwark & Vauxhall
Main sewage discharge
Paul Grootendorst: A Review of Instrumental Variables Estimation of Treatment Effects…
http://individual.utoronto.ca/grootendorst/pdf/IV_Paper_Sept6_2007.pdf
A review of instrumental variables estimation in the applied health sciences. Health Services
and Outcomes Research Methodology 2007; 7(3-4):159-179.
River
Thames
56. Part 2A: Endogeneity [ 56/53]
0 1
0 1
Cholera = BadWater Other Factors
C = B (Stylized)
(C=0/1=no/yes
Investigation Using an Instrumental Variable
Theory :
Model :
1
) (B=0/1=good/bad) ( =other factors)
Cholera prone environment u affects B and .
Interpret this to say
Interesting measure of causal effect of bad water :
Endogeneity Problem :
0 1
0 1
0
1
B(u) and (u) are correlated because of u.
E[C|B] B because E[ |B] 0
E[C|B=1] = E[ |B=1]
E[C|B=0] = E[ |B=0]
E[C|B=1] - E[C|B=0] = {E[ |B
Confounding Effect :
=1] E[ |B=0]}
Comparing cholera rates of those with bad water (measurable)
to those with good water, P(C|B=1) - P(C|B=0), does not reveal the
water effect.
Conclusion :
57. Part 2A: Endogeneity [ 57/53]
L = 1 if water supplied by Lambeth
L = 0 if water supplied by Southwark/Vauxhall
Is E[B|L=1] E[B|L=0]? That i
Instrumental Variable :
Relevant? s Snow's theory, that
the water supply is partly the culprit, and because of their
location, Lambeth provided purer water than Southwark.
Exogenous Is E[ |L=1]-E[ |L=0]=0? Water supply is randomly supplied
to houses. Homeowners do not even know which supplier is
providing their water. "Assignm
?
0 1
0 1
0 1
ent is random."
in E[C|L] = E[B| L] E[ | L]:
E[C | L 1] E[B| L 1] E[ | L 1]
E[C | L 0] E[B| L 0] E[ | L 0]
E[C | L
Using the IV
Estimating Equation :
1
1] E[C | L 0] E[B | L 1] E[B| L 0]
(z
E[ | L 1] E[ | L 0] ero because L is exogenous)
58. Part 2A: Endogeneity [ 58/53]
1
1 (Note :nonz
E[C | L 1] E[C | L 0] E[B| L 1] E[B | L 0]
E[C | L 1] E[C | L 0]
ero denominator is the r
E[B| L 1
elev
] E[B| L 0]
P(C|L=1) = Proportion
ance condition.
of observations
)
IV Estimator :
Operational : supplied by Lambeth that have Cholera
P(C|L=0) = Proportion of observations supplied by Southwark that have Cholera
P(B| L 1) Pr oportion of observations sup
1
plied by Lambeth with Bad Water
P(B| L 0) Pr oportion of observations supplied by Southwark with Bad Water
P(C | L 1) P(C | L 0) Cov(C,L
b (broadly)
P(B| L 1) P(B| L 0)
Estimate :
)
(The Wald estimator)
Cov(B,L)
59. Part 2A: Endogeneity [ 59/53]
On Sat, May 3, 2014 at 4:48 PM, … wrote:
Dear Professor Greene,
I am giving an Econometrics course in Brazil and we are using
your textbook. I got a question which I think only you can help
me. In our last class, I did a formal proof that
var(beta_hat_OLS) is lower or equal than var(beta_hat_2SLS),
under homoscedasticity.
We know this assertive is also valid under heteroscedasticity,
but a graduate student asked me the proof (which is my
problem).
Do you know where can I find it?