Panel Data Regression Notes- Part 2 Subsection -A.pptx

Part 2A: Endogeneity [ 1/53]
Econometric Analysis of Panel Data
William Greene
Department of Economics
University of South Florida

lnSPI =  + *lnGDPPC(PPP) + , 0 <  < 1.
(Huffington Post, 2/16/16)
Reverse Causality in the Preston Curve?

In two of my projects, I was asked by reviewers to address the endogeneity concerns. In one
project, I regress employee departure on project termination. Arguably project termination is
not exogenous.
DEPARTURE = f(a + b*PROJECT TERMINATION + e)
(Time until departure???)
In the other, I regress firms’ charitable giving in specific countries on their business activities
in the local community. Again, business presence in countries are not exogenous. The
problem is, both papers used non-linear models (Hazard model in one, and hurdle model
in the other), which are required by the data I have. Are you aware of any econometric
methods to deal with endogeneity in non-linear models? My search online did not go
anywhere.
Hazard Model: Not a linear model.
Prob[event happens in time interval t to t+Δ| event happens after time t] =
a function of (x’β)
http://people.stern.nyu.edu/wgreene/Econometrics/
NonlinearPanelDataModels.pdf

I have been asked this question (or ones like it) dozens of times. I think the issue is
getting way overplayed. But, I'm not the majority voice, so you are going to have to deal
with this.
Step 1: you or the referee need to figure out (make a case for) by what construction is
"project termination" endogenous. What is correlated with what **in your hazard
model** that makes the variable endogenous? There must be a second equation that
implies that project termination is endogenous. What is it? What unobservable in that
equation is correlated with what unobservable in the hazard model that makes it
endogenous. Same questions for your hurdle model.
Step 2: Depends on the outcome of Step 1…

By what construction is SNAP endogenous in the
HEALTH equation?
SNAP = XβSNAP + Zδ + ε
HEALTH = XβHEALTH + ηSNAP + v

  

  
  

   
j
j
Poisson Regression
exp( )
Prob(y = j|x,S) = ,
j!
= exp( x+ S) [How is S endogenous?]
Negative Binomial Regression
exp( )
Prob(y = j|x,S,v) = ,
j!
= exp( x+ S+ ) = E[y|
v

  

    

0
j
x,S, ]
Prob(y = j|x,S) = Prob(y = j|x,S,v)f( )d
Negative Binomial Regression with Common Factor
exp( )
Prob(y = j|x,S,v,u) = ,
j!
= exp( x+ S+ + u) = E[y|x,S, u]
v
v v
v v,

 
 
 
S
N N
i S i
i 1 i 1
i S
i
i i
i S
Control Function Approach
S* = w, S = 1[S* > 0], w ~ N[0,1]
lnL = ln (2S 1) f
GENERALIZED RESIDUAL
(2S 1)
f
u (2S 1) Control Function
Constant term (2S 1)
(For a l
 
 

  

 

   

  
 
x
x
x
x




2
i
i i i i S i
i i
inear regression, the generalized residual is e / s .)
Poisson or NB1 Model with "Residual Inclusion"
ˆ ˆ ˆ
E[C | x ,S ,u ] exp[ S u ]

   
x v


Endogeneity
 y = X+ε,
 Definition: E[ε|x]≠0
 Why not?
 Omitted variables
 Unobserved heterogeneity (equivalent to omitted
variables)
 Measurement error on the RHS (equivalent to
omitted variables)
 Endogenous sampling and attrition
 Simultaneity (?) (“reverse causality”)

Cornwell and Rupert Data
Cornwell and Rupert Returns to Schooling Data, 595 Individuals, 7 Years
Variables in the file are
EXP = work experience
WKS = weeks worked
OCC = occupation, 1 if blue collar,
IND = 1 if manufacturing industry
SOUTH = 1 if resides in south
SMSA = 1 if resides in a city (SMSA)
MS = 1 if married
FEM = 1 if female
UNION = 1 if wage set by union contract
ED = years of education
LWAGE = log of wage = dependent variable in regressions
These data were analyzed in Cornwell, C. and Rupert, P., "Efficient Estimation with
Panel Data: An Empirical Comparison of Instrumental Variable Estimators," Journal
of Applied Econometrics, 3, 1988, pp. 149-155. See Baltagi, page 122 for further
analysis. The data were downloaded from the website for Baltagi's text.

Specification: Quadratic Effect of Experience

The Effect of Education on LWAGE
     
1 2 3 4 ... ε
What is ε? ,...+ everything
M e
ot ls
ivat e
= f( , , , ,...)
ion
Motivation
LWAGE EDUC EXP
EDUC GENDER SMSA SOUTH
2
EXP

What Influences LWAGE?
 
  

1 2
3 4
Motivation
Motiva
( , ,...)
...
ε( )
Variation in is associated with variation in
tion
Motivation
Motivation Motivatio
( , ,...) and ε(
LWAGE EDUC X
EXP
EDUC X
2
EXP
2
n
Motivatio
)
What lookslike an effect due to variationin may
be due to variationin . The estimate of picks up
the effect of and the hidden effect of
n
Motivation.
EDUC
EDUC

The General Problem
1 2
1 1
2 2
2
1 2
(
Cov( , ) , K variables
Cov( , ) , K variables
is
cannot estimate ( , )
consistently. Some other estimator is needed.
Additional structur
, )
e: H
  


endogenous
OLS regression of y o
y X X
X 0
X 0
n
X
X X
  


 
2
2
1 2
ow does X become endogenous?
= + where Cov( , ) but Cov( , )= .
An estimator based on ( , , ) may
be able to estimate ( , ) consistently.

instrumental varia
X Z V V 0 Z 0
X
bl X Z
e (IV)
  
 

Instrumental Variables
 Framework: y = X + , K variables in X.
 There exists a set of K variables, Z such that
plim(Z’X/n)  0 but plim(Z’/n) = 0
The variables in Z are called instrumental
variables.
 An alternative (to least squares) estimator of  is
bIV = (Z’X)-1
Z’y ~ Cov(Z,y) / Cov(Z,X)
 We consider the following:
 Why use this estimator?
 What are its properties compared to least squares?
 We will also examine an important application

An Exogenous Influence
 
  

1 2
3 4
Motivation
Moti
( , , ,...)
...
ε( )
Variation in is associated with variation in
( ,
vation
Motivation
, ,...) andnot Motiva n
( o
ε ti
LWA Z
GE EDUC X
EXP
EDU Z
C
Z
X
2
EXP
2
)
An effect due to the effect of variationin on will
only be due to variationin . The estimate of picks up
the effect of only.
Z
Z is anInstrument
EDUC
EDU
al Vari
C
EDUC
able

Instrumental Variables
 My theory claims that MS and FEM are
instruments
 Structural equations
 LWAGE (ED,EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION)
 ED (…,MS, FEM)  Equation explains the
endogeneity
Reduced Form:
LWAGE[ ED (…,MS, FEM),
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION ]

X
Z
SNAP Model.
X is in both
equations.
Z is in SNAP
equation.
SNAP is in
Health
equation.

Instrumental Variables in Regression
 Typical Case: One “problem” variable – the “last” one
 yit = 1x1it + 2x2it + … + KxKit + εit
 E[εit|x1it…,xKit] ≠ 0. (0 for all others)
 There exists a variable zit such that
Relevance
 E[xKit| x1it, x2it,…, xK-1,it,zit] = g(x1it, x2it,…, xK-1,it,zit)
In the presence of the other variables, zit “explains” xit
 A projection interpretation: In the projection,
xKt =θ1x1it,+ θ2x2it + … + θk-1xK-1,it + θK zit, θK ≠ 0.
Exogeneity
 E[εit| x1it, x2it,…, xK-1,it,zit] = 0
In the presence of the other variables, zit and εit are
uncorrelated.

Two Stage Least Squares Strategy
 Reduced Form:
LWAGE[ ED (MS, FEM,X),
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION ]
 Strategy
 (1) Purge ED of the influence of everything but MS,
FEM and the other X variables. Predict ED using all
exogenous information in the sample (X,MS,FEM).
 (2) Regress LWAGE on this prediction of ED and
everything else.
 Standard errors must be adjusted for the predicted
ED

OLS Regression (Inconsistent)

The weird results for the
coefficient on ED may be
due to the instruments, MS
and FEM being dummy
variables. There is not
much variation in these
variables and not much
covariation with the other
variables.
2SLS Regression (Maybe not a very good theory))
2SLS coefficient estimate is implausible. Now what?

An Interpretation
The Source of the Endogeneity
 LWAGE = f(ED,
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION) + 
 ED = f(MS,FEM,
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION) + u

Can We Remove the Endogeneity?
 LWAGE = f(ED,
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION) + u + 
 LWAGE = f(ED,
EXP,EXPSQ,WKS,OCC,
SOUTH,SMSA,UNION) + u + 
 Strategy
 Estimate u
 Add u to the equation.
ED is correlated with u+ because it is correlated with
u.
 ED is uncorrelated with u+ if u is in the equation.

Auxiliary Regression for
ED to Obtain Residuals
IVs
Exog.
Vars

OLS with Residual Added (Control Function)
2SLS

A Warning About Control Functions
Sum of squares is not computed correctly because U is in the regression.
A general result. Control function estimators usually require a fix to the
estimated covariance matrix for the estimator.

Estimating σ2


   
2
2 n
1
i 1 i
n
Estimating the asymptotic covariance matrix -
a caution about estimating .
ˆ
Since the regression is computed by regressing y on ,
one might use
ˆ
(y ) uses
ˆ 2sls
x
x'b

   
2 n
1
i 1 i
n
ˆ
This is inconsistent. Use
(y ) uses
ˆ
(Degrees of freedom correction is optional; usually done.)
2sls
x
x'b x

Robust estimation of VC
 
 
  
-1 2 -1
i,t it it
Counterpart to the White estimator allows heteroscedasticity
ˆ
ˆ ˆ ˆ ˆ
ˆ ˆ
Est.Asy.Var[ ]= ( ) (y ) ( )
it it
X'X x β x x X'X
“Actual” X
“Predicted” X

2SLS vs. Robust Standard Errors
+--------------------------------------------------+
| Robust Standard Errors |
+---------+--------------+----------------+--------+
|Variable | Coefficient | Standard Error |b/St.Er.|
+---------+--------------+----------------+--------+
B_1 45.4842872 4.02597121 11.298
B_2 .05354484 .01264923 4.233
B_3 -.00169664 .00029006 -5.849
B_4 .01294854 .05757179 .225
B_5 .38537223 .07065602 5.454
B_6 .36777247 .06472185 5.682
B_7 .95530115 .08681261 11.000
+--------------------------------------------------+
| 2SLS Standard Errors |
+---------+--------------+----------------+--------+
|Variable | Coefficient | Standard Error |b/St.Er.|
+---------+--------------+----------------+--------+
B_1 45.4842872 .36908158 123.236
B_2 .05354484 .03139904 1.705
B_3 -.00169664 .00069138 -2.454
B_4 .01294854 .16266435 .080
B_5 .38537223 .17645815 2.184
B_6 .36777247 .17284574 2.128
B_7 .95530115 .20846241 4.583

Inference with IV Estimators
 

(1) Wald Statistics:
ˆ ˆ ˆ
( ) ( )
(E.g., the usual 't-statistics')
(2) A type of F statistic:
ˆ ˆ
Compute SSUA=( )'( ) without restrictions (Note, )
ˆ
ˆ
Compute SSR=( )'(
-1
u u
R
Rβ - q ' { Est.Asy.Var[β]} Rβ - q
y Xβ y Xβ X
y Xβ y 
 


ˆ
ˆ ) with restrictions
ˆ ˆ
ˆ ˆ ˆ
Compute SSU=( )'( ) without restrictions (Note, )
(SSR SSU) / J
F = ~ F[J,N K]
SSUA/(N-K)
R
U U
Xβ
y Xβ y Xβ X

Endogeneity Test? (Hausman)
Exogenous Endogenous
OLS Consistent, Efficient Inconsistent
2SLS Consistent, Inefficient Consistent
Base a test on d = b2SLS - bOLS
Use a Wald statistic, d’[Var(d)]-1
d
What to use for the variance matrix?
Hausman: V2SLS - VOLS

Hausman Test

Hausman Test: One at a Time?

Endogeneity Test: Wu
 Considerable complication in Hausman test
(text, pp. 276-277)
 Simplification: Wu test.
 Regress y on X and estimated for the
endogenous part of X. Then use an ordinary
Wald test.
X̂

Monday, 2/6/17

Regression Based Endogeneity Test

    

it it it
An easy t test. (Wooldridge 2010, p. 127)
y q
= a set of M instruments.
Write = +
Can be estimated by ordinary least squares.
Endogeneity concerns correlation between v and .
ˆ
Add v
it
x δ
Z
q Zπ v


      


it it it it
= q - to the equation and use OLS
ˆ
ˆ
y q v + { error}
Simple t test on whether equals 0.
ˆ
Even easier, algebraically identical, (Wu, 1973), add
to the equation and do the same tes
it
z
x δ
q
t.

Wu Test
Since this is 2SLS using a control function, the standard errors should have
been adjusted to carry out this test. (The sum of squares is too small.)

Testing Endogeneity of WKS
(1) Regress WKS on 1,EXP,EXPSQ,OCC,SOUTH,SMSA,MS.
U=residual, WKSHAT=prediction
(2) Regress LWAGE on 1,EXP,EXPSQ,OCC,SOUTH,SMSA,WKS, U or WKSHAT
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -9.97734299 .75652186 -13.188 .0000
EXP .01833440 .00259373 7.069 .0000 19.8537815
EXPSQ -.799491D-04 .603484D-04 -1.325 .1852 514.405042
OCC -.28885529 .01222533 -23.628 .0000 .51116447
SOUTH -.26279891 .01439561 -18.255 .0000 .29027611
SMSA .03616514 .01369743 2.640 .0083 .65378151
WKS .35314170 .01638709 21.550 .0000 46.8115246
U -.34960141 .01642842 -21.280 .0000 -.341879D-14
+---------+--------------+----------------+--------+---------+----------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|
+---------+--------------+----------------+--------+---------+----------+
Constant -9.97734299 .75652186 -13.188 .0000
EXP .01833440 .00259373 7.069 .0000 19.8537815
EXPSQ -.799491D-04 .603484D-04 -1.325 .1852 514.405042
OCC -.28885529 .01222533 -23.628 .0000 .51116447
SOUTH -.26279891 .01439561 -18.255 .0000 .29027611
SMSA .03616514 .01369743 2.640 .0083 .65378151
WKS .00354028 .00116459 3.040 .0024 46.8115246
WKSHAT .34960141 .01642842 21.280 .0000 46.8115246

Weak Instruments
 One endogenous variable: y = X +xk+ ; Instruments (X,Z) Z is
exogenous
 Symptom: The relevance condition, plim Z’X/n not zero, is close to being
violated. Relevance: Z must “explain” xk after controlling for X.
 Detection:
 Standard F test in the regression of xk on (X,Z). Wald test
of coefficients on Z equal 0, F < 10 suggests a problem. (Staiger and
Stock)
 Other versions by Stock and Yogo (2005), Cragg and Donald (1993),
Kleibergen and Paap (2006) for when xk is more than one variable and
for when xk = (X,Z) + u and is heteroscedastic, clustered, nonnormal,
etc.
 Remedy:
 Not much – most of the discussion is about the condition, not what to
do about it.
 Use LIML instead of 2SLS? Requires a normality assumption. Probably

Weak Instruments (cont.)
-1
ˆ
plim = + [Cov( )] Cov( )
If Cov( ) is "small" but nonzero, small
Cov( ) may hugely magnify the effect.
IV is not only inefficient, may be very badly
biased by "weak" instruments.
Solutions
β β , X , ε
, ε
Z Z
Z
, X
Z
? Can one "test" for weak instruments?

Weak Instruments
-1 -1
-1
Which is better?
LS is inconsistent, but probably has smaller variance
LS may be more precise
IV is consistent, but probably has larger variance
ˆ
Asy.Var[ ] =
may be l
Z ZZ
X
X
Z
Z
X
β Q Ω Q
Q arge. (Compared to what?)
Strange results with "small"
IV estimator tends to resemble OLS (bias) (not a
function of sample size).
Contradictory result. Suppose is perfectly correlated
ZX
z
Q
with . IV MUST be the same as OLS.
x

Endogenous Union Effect
Name ; Xunion = one,occ,smsa,ed,exp,union $
Name ; Zinst = fem,ind,south $
Name ; Zunion = one,occ,smsa,ed,exp,Zinst $
? Inconsistent OLS
Regr ; Lhs = lwage ; Rhs = Xunion ; Cluster = 7 $
? Two Stage Least Squares gives a nonsense result
2sls ; Lhs=lwage;rhs=Xunion;Inst=Zunion ; Cluster=7 $
? Test for weak instruments
Regr ; Lhs=union ; Rhs=Zunion ; res = u
; Cluster = 7 ; test:zinst $
? Control function estimator
? 2SLS coefficients with the wrong standard errors
Regr ; Lhs=lwage;rhs=xunion,u;cluster=7$

OLS Should Be Inconsistent

Nonsense 2SLS Result

Weak Instruments? No
What is going on here? When the endogenous variable and/or the excluded
instruments are binary, the actual results are sometimes a bit unstable. The
theoretical results are generally about covariation of continuous variables.

Appendix
Miscellaneous

The First IV Study Was a Natural Experiment
(Snow, J., On the Mode of Communication of Cholera, 1855)
http://www.ph.ucla.edu/epi/snow/snowbook3.html
 London Cholera epidemic, ca 1853-4
 Cholera = f(Water Purity,u) + ε.
 ‘Causal’ effect of water purity on cholera?
 Purity=f(cholera prone environment (poor, garbage in
streets, rodents, etc.). Regression does not work.
Two London water companies
Lambeth Southwark & Vauxhall
Main sewage discharge
Paul Grootendorst: A Review of Instrumental Variables Estimation of Treatment Effects…
http://individual.utoronto.ca/grootendorst/pdf/IV_Paper_Sept6_2007.pdf
A review of instrumental variables estimation in the applied health sciences. Health Services
and Outcomes Research Methodology 2007; 7(3-4):159-179.
River
Thames

0 1
0 1
Cholera = BadWater Other Factors
C = B (Stylized)
(C=0/1=no/yes
  
  
Investigation Using an Instrumental Variable
Theory :
Model :
1
) (B=0/1=good/bad) ( =other factors)
Cholera prone environment u affects B and .
Interpret this to say



Interesting measure of causal effect of bad water :
Endogeneity Problem :
0 1
0 1
0
1
B(u) and (u) are correlated because of u.
E[C|B] B because E[ |B] 0
E[C|B=1] = E[ |B=1]
E[C|B=0] = E[ |B=0]
E[C|B=1] - E[C|B=0] = {E[ |B

    
   
  
  
Confounding Effect :
=1] E[ |B=0]}
Comparing cholera rates of those with bad water (measurable)
to those with good water, P(C|B=1) - P(C|B=0), does not reveal the
water effect.
 
Conclusion :

L = 1 if water supplied by Lambeth
L = 0 if water supplied by Southwark/Vauxhall
Is E[B|L=1] E[B|L=0]? That i

Instrumental Variable :
Relevant? s Snow's theory, that
the water supply is partly the culprit, and because of their
location, Lambeth provided purer water than Southwark.
Exogenous Is E[ |L=1]-E[ |L=0]=0? Water supply is randomly supplied
to houses. Homeowners do not even know which supplier is
providing their water. "Assignm
 
?
0 1
0 1
0 1
ent is random."
in E[C|L] = E[B| L] E[ | L]:
E[C | L 1] E[B| L 1] E[ | L 1]
E[C | L 0] E[B| L 0] E[ | L 0]
E[C | L
   
      
      
Using the IV
Estimating Equation :  
 
1
1] E[C | L 0] E[B | L 1] E[B| L 0]
(z
E[ | L 1] E[ | L 0] ero because L is exogenous)
      
     

 
1
1 (Note :nonz
E[C | L 1] E[C | L 0] E[B| L 1] E[B | L 0]
E[C | L 1] E[C | L 0]
ero denominator is the r
E[B| L 1
elev
] E[B| L 0]
P(C|L=1) = Proportion
ance condition.
of observations
)
      
  
 
  
IV Estimator :
Operational : supplied by Lambeth that have Cholera
P(C|L=0) = Proportion of observations supplied by Southwark that have Cholera
P(B| L 1) Pr oportion of observations sup
 
1
plied by Lambeth with Bad Water
P(B| L 0) Pr oportion of observations supplied by Southwark with Bad Water
P(C | L 1) P(C | L 0) Cov(C,L
b (broadly)
P(B| L 1) P(B| L 0)
 
  
 
  
Estimate :
)
(The Wald estimator)
Cov(B,L)

On Sat, May 3, 2014 at 4:48 PM, … wrote:
Dear Professor Greene,
I am giving an Econometrics course in Brazil and we are using
your textbook. I got a question which I think only you can help
me. In our last class, I did a formal proof that
var(beta_hat_OLS) is lower or equal than var(beta_hat_2SLS),
under homoscedasticity.
We know this assertive is also valid under heteroscedasticity,
but a graduate student asked me the proof (which is my
problem).
Do you know where can I find it?

Panel Data Regression Notes- Part 2 Subsection -A.pptx

More Related Content

Similar to Panel Data Regression Notes- Part 2 Subsection -A.pptx (20)

Recently uploaded (20)

Panel Data Regression Notes- Part 2 Subsection -A.pptx