Paris Lecture 1

Lecture 1: Review of Linear Mixed Models
Shravan Vasishth
Department of Linguistics
University of Potsdam, Germany
October 7, 2014
1 / 42

Introduction
Motivating this course
In psycholinguistics, we usually use frequentist methods to
analyze our data.
The most common tool I use is lme4.
In recent years, very powerful programming languages have
become available that make Bayesian modeling relatively easy.
Bayesian tools have several important advantages over
frequentist methods, but they require some very speciﬁc
background knowledge.
My goal in this course is to try to provide that background
knowledge.
Note: I will teach a more detailed (2-week) course in ESSLLI in
Barcelona, Aug 3-14, 2015.
2 / 42

Introduction
Motivating this course
In this introductory lecture, my goals are to
1 motivate you look at Bayesian Linear Mixed Models as an
alternative to using frequentist methods.
2 make sure we are all on the same page regarding the moving
parts of a linear mixed model.
3 / 42

Preliminaries
Prerequisites
1 Familiarity with ﬁtting standard LMMs such as:
lmer(rt~cond+(1+cond|subj)+(1+cond|item),dat)
2 Basic knowledge of R.
4 / 42

Preliminaries
A bit about my background
1999: Discovered the word“ANOVA”in a chance conversation
with a psycholinguist.
1999: Did my first self-paced reading experiment. Fit a
repeated measures ANOVA.
2000: Went to statisticans at Ohio State’ statistical
consulting unit, and they said: “why are you fitting ANOVA?
You need linear mixed models.”
2000-2010: Kept fitting and publishing linear mixed models.
2010: Realized I didn’t really know what I was doing, and
started a part-time MSc in Statistics at Sheffield’s School of
Math and Statistics (2011-2015).
5 / 42

Repeated measures data
Linear mixed models
Example: Gibson and Wu data, Language and Cognitive Processes, 2012
6 / 42

Linear mixed models
Example: Gibson and Wu data, Language and Cognitive Processes, 2012
Subject vs object relative clauses in Chinese, self-paced
reading.
The critical region is the head noun.
The goal is to ﬁnd out whether SRs are harder to process than
ORs at the head noun.
7 / 42

Linear mixed models
Example: Gibson and Wu 2012 data
> head(data[,c(1,2,3,4,7,10,11)])
subj item type pos rt rrt x
7 1 13 obj-ext 6 1140 -0.8771930 0.5
20 1 6 subj-ext 6 1197 -0.8354219 -0.5
32 1 5 obj-ext 6 756 -1.3227513 0.5
44 1 9 obj-ext 6 643 -1.5552100 0.5
60 1 14 subj-ext 6 860 -1.1627907 -0.5
73 1 4 subj-ext 6 868 -1.1520737 -0.5
8 / 42

lme4 model of Gibson and Wu data
Crossed varying intercepts and slopes model, with correlation
This is the type of“maximal”model that most people ﬁt nowadays
(citing Barr et al 2012):
> m1 <- lmer(rrt~x+(1+x|subj)+(1+x|item),
+ subset(data,region=="headnoun"))
I will now show two major (related) problems that occur with the
small datasets we usually have in psycholinguistics:
The correlation estimates either lead to degenerate variance
covariance matrices, and/or
The correlation estimates are wild estimates that have no
bearing with reality.
[This is not a failing of lmer, but rather that the user is
demanding too much of lmer.]
9 / 42

lme4 model of Gibson and Wu data
Typical data analysis: Crossed varying intercepts and slopes model, with correlation
> summary(m1)
Linear mixed model fit by REML [lmerMod]
Formula: rrt ~ x + (1 + x | subj) + (1 + x | item)
Data: subset(data, region == "headnoun")
REML criterion at convergence: 1595.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.5441 -0.6430 -0.1237 0.5996 3.2501
Random effects:
Groups Name Variance Std.Dev. Corr
subj (Intercept) 0.371228 0.60928
x 0.053241 0.23074 -0.51
item (Intercept) 0.110034 0.33171
x 0.009218 0.09601 1.00
Residual 0.891577 0.94423
Number of obs: 547, groups: subj, 37; item, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) -2.67151 0.13793 -19.369
x -0.07758 0.09289 -0.835
Correlation of Fixed Effects:
(Intr)
x 0.012
10 / 42

The“best”model
The way to decide on the“best”model is to ﬁnd the simplest
model using the Generalized Likelihood Ratio Test (Pinheiro and
Bates 2000). Here, this is the varying intercepts model, not the
maximal model.
> m1<- lmer(rrt~x+(1+x|subj)+(1+x|item),
+ headnoun)
> m1a<- lmer(rrt~x+(1|subj)+(1|item),
+ headnoun)
11 / 42

The“best”model
> anova(m1,m1a)
Data: headnoun
Models:
m1a: rrt ~ x + (1 | subj) + (1 | item)
m1: rrt ~ x + (1 + x | subj) + (1 + x | item)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
m1a 5 1603.5 1625.0 -796.76 1593.5
m1 9 1608.5 1647.3 -795.27 1590.5 2.9742 4 0.5622
12 / 42

How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
Here, we simulate data with the same structure, sample size, and
parameter values as the Gibson and Wu data, except that we
assume that the correlations are 0.6. Then we analyze the data
using lmer (maximal model). Can lmer recover the
correlations?
13 / 42

Simulated data
We deﬁne a function called new.df that generates data similar to
the Gibson and Wu data-set. For code, see accompanying .R ﬁle.
14 / 42

Simulated data
Next, we write a function that generates data for us repeatedly
with the following speciﬁcations: sample size for subjects and
items, and some correlation between subject intercept and slope,
and item intercept and slope.
> gendata<-function(subjects=37,items=15){
+ dat<-new.df(nsubj=subjects,nitems=items,
+ rho.u=0.6,rho.w=0.6)
+ dat <- dat[[1]]
+ dat<-dat[,c(1,2,3,9)]
+ dat$x<-ifelse(dat$cond==1,-0.5,0.5)
+
+ return(dat)
+ }
15 / 42

Simulated data
Set number of simulations:
> nsim<-100
Next, we generate simulated data 100 times, and then store the
estimated subject and item level correlations in the random eﬀects,
and plot their distributions.
We do this for two settings: Gibson and Wu sample sizes (37
subjects, 15 items), and 50 subjects and 30 items.
16 / 42

Simulated data
37 subjects and 15 items
> library(lme4)
> subjcorr<-rep(NA,nsim)
> itemcorr<-rep(NA,nsim)
> for(i in 1:nsim){
+ dat<-gendata()
+ m3<-lmer(rt~x+(1+x|subj)+(1+x|item),dat)
+ subjcorr[i]<-attr(VarCorr(m3)$subj,"correlation")[1,2]
+ itemcorr[i]<-attr(VarCorr(m3)$item,"correlation")[1,2]
+ }
17 / 42

Simulated data
Distribution of subj. corr.
ρ^
u
Density
−0.2 0.2 0.6 1.0
0.00.51.01.5
Distribution of item corr.
ρ^
w
Density
−1.0 0.0 0.5 1.0
0.00.51.01.5
18 / 42

Simulated data
50 subjects and 30 items
> subjcorr<-rep(NA,nsim)
> itemcorr<-rep(NA,nsim)
> for(i in 1:nsim){
+ #print(i)
+ dat<-gendata(subjects=50,items=30)
+ m3<-lmer(rt~x+(1+x|subj)+(1+x|item),dat)
+ subjcorr[i]<-attr(VarCorr(m3)$subj,"correlation")[1,2]
+ itemcorr[i]<-attr(VarCorr(m3)$item,"correlation")[1,2]
+ }
19 / 42

Simulated data
Distribution of subj. corr.
ρ^
u
Density
0.2 0.6 1.0
0.01.02.0
Distribution of item corr.
ρ^
w
Density
0.0 0.4 0.8
0.01.02.0
20 / 42

Simulated data
Conclusion:
1 It seems that lmer can estimate the correlation parameters just
in case sample size for items and subjects is“large enough”
(this can be established using simulation, as done above).
2 Barr et al’s recommendation to ﬁt a maximal model makes
sense as a general rule only if it’s already clear that we have
enough data to estimate all the variance components and
parameters.
3 In my experience, that is rarely the case at least in
psycholinguistics, especially when we go to more complex
designs than a simple two-condition study.
21 / 42

Keep it maximal?
Gelman and Hill (2007, p. 549) make a more tempered
recommendation than Barr et al:
Don’t get hung up on whether a coefficient“should”vary
by group. Just allow it to vary in the model, and then, if
the estimated scale of variation is small . . . , maybe you
can ignore it if that would be more convenient. Practical
concerns sometimes limit the feasible complexity of a
model–for example, we might fit a varying-intercept
model first, then allow slopes to vary, then add
group-level predictors, and so forth. Generally, however,
it is only the difficulties of fitting and, especially,
understanding the models that keeps us from adding even
more complexity, more varying coefficients, and more
interactions.
22 / 42

Why fit Bayesian LMMs?
Advantages of fitting a Bayesian LMM
1 For such data, there can be situations where you really need
to or want to fit full variance-covariance matrices for random
effects. Bayesian LMMs will let you fit them even in cases
where lmer would fail to converge or return nonsensical
estimates (due to too little data).
The way we will set them up, Bayesian LMMs will typically
underestimate correlation, unless there is enough data.
2 A direct answer to the research question can be obtained by
examining the posterior distribution given data.
3 We can avoid the traditional hard binary decision associated
with frequentist methods: p < 0.05 implies reject null, and
p > 0.05 implies“accept”null. We are more interested in
quantifying our uncertainty about the scientific claim.
4 Prior knowledge can be included in the model.
23 / 42

Why fit Bayesian LMMs?
Disadvantages of doing a Bayesian analysis
You have to invest effort into specifying a model; unlike lmer,
which involves a single line of code, JAGS and Stan model
specifications can extend to 20-30 lines.
A lot of decisions have to be made.
There is a steep learning curve; you have to know a bit about
probability distributions, MCMC methods, and of course
Bayes’ Theorem.
It takes much more time to fit a complicated model in a
Bayesian setting than with lmer.
But I will try to demonstrate to you in this course that it’s worth
the effort, especially when you don’t have a lot of data (usually the
case in psycholinguistics).
24 / 42

Brief review of linear (mixed) models
Linear models
yi
↑
response
=
parameter
↓
β0 +
parameter
↓
β1xi
↑
predictor
+ εi
↑
error
(1)
where
εi is the residual error, assumed to be normally distributed:
εi ∼ N(0,σ2).
Each response yi (i ranging from 1 to I) is independently and
identically distributed as yi ∼ N(β0 +β1xi ,σ2).
Point values for parameters: β0 and β1 are the parameters
to be estimated. In the frequentist setting, these are point
values, they have no distribution.
Hypothesis test: Usually, β1 is the parameter of interest; in
the frequentist setting, we test the null hypothesis that β1 = 0.
25 / 42

Linear models and repeated measures data
Linear mixed models are useful for correlated data (e.g.,
repeated measures) where the responses y are not
independently distributed.
A key diﬀerence from linear models is that the intercept
and/or slope vary by subject j = 1,...,J (and possibly also by
item k = 1,...,K):
yi
↑
response
=
varying intercepts
↓
[β0 +u0j +w0k] +
varying slopes
↓
[β1 +u1j +w1k]xi
↑
predictor
+ εi
↑
error
(2)
26 / 42

Unpacking the lme4 model
yi
↑
response
=
varying intercepts
↓
[β0 +u0j +w0k] +
varying slopes
↓
[β1 +u1j +w1k]xi
↑
predictor
+ εi
↑
error
(3)
This is the“maximal”model we saw earlier:
> m1 <- lmer(rrt~x+(1+x|subj)+(1+x|item),
+ headnoun)
27 / 42

> summary(m1)
Linear mixed model fit by REML [lmerMod]
Formula: rrt ~ x + (1 + x | subj) + (1 + x | item)
Data: headnoun
REML criterion at convergence: 1595.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.5441 -0.6430 -0.1237 0.5996 3.2501
Random effects:
subj (Intercept) 0.371228 0.60928
x 0.053241 0.23074 -0.51
item (Intercept) 0.110034 0.33171
x 0.009218 0.09601 1.00
Residual 0.891577 0.94423
Number of obs: 547, groups: subj, 37; item, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) -2.67151 0.13793 -19.369
x -0.07758 0.09289 -0.835
Correlation of Fixed Effects:
(Intr)
x 0.012
28 / 42

rrti = (β0 +u0j +w0k)+(β1 +u1j +w1k)xi +εi (4)
1 i = 1,...,547 data points; j = 1,...,37 items; k = 1,...,15 data
points
2 xi is coded −0.5 (SR) and 0.5 (OR).
3 εi ∼ N(0,σ2).
4 u0j ∼ N(0,σ0j ) and u1j ∼ N(0,σ1j ).
5 w0k ∼ N(0,σ0k) and w1k ∼ N(0,σ1k).
with a multivariate normal distribution for the varying slopes and
intercepts
u0j
u1j
∼ N
0
0
,Σu
w0k
w1k
∼ N
0
0
,Σw (5)
29 / 42

The variance components associated with subjects
Random effects:
subj (Intercept) 0.37 0.61
so 0.05 0.23 -0.51
Σu =
σ2
u0 ρu σu0σu1
ρu σu0σu1 σ2
u1
=
0.612 −.51×0.61×0.23
−.51×0.61×0.23 0.232
(6)
30 / 42

The variance components associated with items
Random effects:
item (Intercept) 0.11 0.33
so 0.01 0.10 1.00
Note the by items intercept-slope correlation of +1.00.
Σw =
σ2
w0 ρw σw0σw1
ρw σw0σw1 σ2
w1
=
0.332 1×0.33×0.10
1×0.33×0.10 0.102
(7)
31 / 42

The variance components of the“maximal”linear mixed
model
RTi = β0 +u0j +w0k +(β1 +u1j +w1k)xi +εi (8)
u0j
u1j
∼ N
0
0
,Σu
w0k
w1k
∼ N
0
0
,Σw (9)
εi ∼ N(0,σ2
) (10)
The parameters are β0,β1,Σu,Σw ,σ. Each of the matrices Σ has
three parameters. So we have 9 parameters.
32 / 42

Summary so far
Linear mixed models allow us to take all relevant variance
components into account; LMMs allow us to describe how the
data were generated.
However, maximal models should not be ﬁt blindly, especially
when there is not enough data to estimate parameters.
For small datasets we often see degenerate variance
covariance estimates (with correlation ±1). Many
psycholinguists ignore this degeneracy.
If one cares about the correlation, one should not ignore the
degeneracy.
33 / 42

Frequentist vs Bayesian methods
The frequentist approach
1 In the frequentist setting, we start with a dependent measure
y, for which we assume a probability model.
2 In the above example, we have reading time data, rt, which
we assume is generated from a normal distribution with some
mean µ and variance σ2; we write this rt ∼ N(µ,σ2).
3 Given a particular set of parameter values µ and σ2, we could
state the probability distribution of rt given the parameters.
We can write this as p(rt | µ,σ2).
34 / 42

1 In reality, we know neither µ nor σ2. The goal of ﬁtting a
model to such data is to estimate the two parameters, and
then to draw inferences about what the true value of µ is.
2 The frequentist method relies on the fact that, under repeated
sampling and with a large enough sample size, the sampling
distribution the sample mean ¯X is distributed as N(µ,σ2/n).
3 The standard method is to use the sample mean ¯x as an
estimate of µ and given a large enough sample size n, we can
compute an approximate 95% conﬁdence interval
¯x ±2×(ˆσ2/n).
35 / 42

The 95% confidence interval has a slightly complicated
interpretation:
If we were to repeatedly carry out the experiment and compute a
confidence interval each time using the above procedure, 95% of
those confidence intervals would contain the true parameter value
µ (assuming, of course, that all our model assumptions are
satisfied).
The particular confidence interval we calculated for our particular
sample does not give us a range such that we are 95% certain that
the true µ lies within it, although this is how most users of
statistics seem to (mis)interpret the confidence interval.
36 / 42

The 95% CI
0 20 40 60 80 100
5658606264
95% CIs in 100 repeated samples
Scores
37 / 42

The Bayesian approach
1 The Bayesian approach starts with a probability model that
deﬁnes our prior belief about the possible values that the
parameter µ and σ2 might have.
2 This probability model expresses what we know so far about
these two parameters (we may not know much, but in
practical situations, it is not the case that we don’t know
anything about their possible values).
3 Given this prior distribution, the probability model p(y | µ,σ2)
and the data y allow us to compute the probability
distribution of the parameters given the data, p(µ,σ2 | y).
4 This probability distribution, called the posterior distribution,
is what we use for inference.
38 / 42

1 Unlike the 95% conﬁdence interval, we can deﬁne a 95%
credible interval that represents the range within which we are
95% certain that the true value of the parameter lies, given
the data at hand.
2 Note that in the frequentist setting, the parameters are point
values: µ is assumed to have a particular value in nature.
3 In the Bayesian setting, µ is a random variable with a
probability distribution; it has a mean, but there is also some
uncertainty associated with its true value.
39 / 42

Bayes’ theorem makes it possible to derive the posterior distribution
given the prior and the data. The conditional probability rule in
probability theory (see Kerns) is that the joint distribution of two
random variables p(θ,y) is equal to p(θ | y)p(y). It follows that:
p(θ,y) =p(θ | y)p(y)
=p(y,θ) (because p(θ,y) = p(y,θ))
=p(y | θ)p(θ).
(11)
The ﬁrst and third lines in the equalities above imply that
p(θ | y)p(y) = p(y | θ)p(θ). (12)
40 / 42

Dividing both sides by p(y) we get:
p(θ | y) =
p(y | θ)p(θ)
p(y)
(13)
The term p(y | θ) is the probability of the data given θ. If we treat
this as a function of θ, we have the likelihood function.
Since p(θ | y) is the posterior distribution of θ given y, and
p(y | θ) the likelihood, and p(θ) the prior, the following
relationship is established:
Posterior ∝ Likelihood×Prior (14)
41 / 42

Posterior ∝ Likelihood×Prior (15)
We ignore the denominator p(y) here because it only serves as a
normalizing constant that renders the left-hand side (the posterior)
a probability distribution.
The above is Bayes’ theorem, and is the basis for determining the
posterior distribution given a prior and the likelihood.
The rest of this course simply unpacks this idea.
Next week, we will look at some simple examples of the application
of Bayes’ Theorem.
42 / 42

Paris Lecture 1

More Related Content

What's hot (19)

Similar to Paris Lecture 1 (20)

Recently uploaded (20)

Paris Lecture 1