Lecture 1: Review of Linear Mixed Models
Lecture 1: Review of Linear Mixed Models
Shravan Vasishth
Department of Linguistics
University of Potsdam, Germany
October 7, 2014
1 / 42
Lecture 1: Review of Linear Mixed Models
Introduction
Motivating this course
In psycholinguistics, we usually use frequentist methods to
analyze our data.
The most common tool I use is lme4.
In recent years, very powerful programming languages have
become available that make Bayesian modeling relatively easy.
Bayesian tools have several important advantages over
frequentist methods, but they require some very specific
background knowledge.
My goal in this course is to try to provide that background
knowledge.
Note: I will teach a more detailed (2-week) course in ESSLLI in
Barcelona, Aug 3-14, 2015.
2 / 42
Lecture 1: Review of Linear Mixed Models
Introduction
Motivating this course
In this introductory lecture, my goals are to
1 motivate you look at Bayesian Linear Mixed Models as an
alternative to using frequentist methods.
2 make sure we are all on the same page regarding the moving
parts of a linear mixed model.
3 / 42
Lecture 1: Review of Linear Mixed Models
Preliminaries
Prerequisites
1 Familiarity with fitting standard LMMs such as:
lmer(rt~cond+(1+cond|subj)+(1+cond|item),dat)
2 Basic knowledge of R.
4 / 42
Lecture 1: Review of Linear Mixed Models
Preliminaries
A bit about my background
1999: Discovered the word“ANOVA”in a chance conversation
with a psycholinguist.
1999: Did my first self-paced reading experiment. Fit a
repeated measures ANOVA.
2000: Went to statisticans at Ohio State’ statistical
consulting unit, and they said: “why are you fitting ANOVA?
You need linear mixed models.”
2000-2010: Kept fitting and publishing linear mixed models.
2010: Realized I didn’t really know what I was doing, and
started a part-time MSc in Statistics at Sheffield’s School of
Math and Statistics (2011-2015).
5 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
Linear mixed models
Example: Gibson and Wu data, Language and Cognitive Processes, 2012
6 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
Linear mixed models
Example: Gibson and Wu data, Language and Cognitive Processes, 2012
Subject vs object relative clauses in Chinese, self-paced
reading.
The critical region is the head noun.
The goal is to find out whether SRs are harder to process than
ORs at the head noun.
7 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
Linear mixed models
Example: Gibson and Wu 2012 data
> head(data[,c(1,2,3,4,7,10,11)])
subj item type pos rt rrt x
7 1 13 obj-ext 6 1140 -0.8771930 0.5
20 1 6 subj-ext 6 1197 -0.8354219 -0.5
32 1 5 obj-ext 6 756 -1.3227513 0.5
44 1 9 obj-ext 6 643 -1.5552100 0.5
60 1 14 subj-ext 6 860 -1.1627907 -0.5
73 1 4 subj-ext 6 868 -1.1520737 -0.5
8 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
lme4 model of Gibson and Wu data
Crossed varying intercepts and slopes model, with correlation
This is the type of“maximal”model that most people fit nowadays
(citing Barr et al 2012):
> m1 <- lmer(rrt~x+(1+x|subj)+(1+x|item),
+ subset(data,region=="headnoun"))
I will now show two major (related) problems that occur with the
small datasets we usually have in psycholinguistics:
The correlation estimates either lead to degenerate variance
covariance matrices, and/or
The correlation estimates are wild estimates that have no
bearing with reality.
[This is not a failing of lmer, but rather that the user is
demanding too much of lmer.]
9 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
lme4 model of Gibson and Wu data
Typical data analysis: Crossed varying intercepts and slopes model, with correlation
> summary(m1)
Linear mixed model fit by REML [lmerMod]
Formula: rrt ~ x + (1 + x | subj) + (1 + x | item)
Data: subset(data, region == "headnoun")
REML criterion at convergence: 1595.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.5441 -0.6430 -0.1237 0.5996 3.2501
Random effects:
Groups Name Variance Std.Dev. Corr
subj (Intercept) 0.371228 0.60928
x 0.053241 0.23074 -0.51
item (Intercept) 0.110034 0.33171
x 0.009218 0.09601 1.00
Residual 0.891577 0.94423
Number of obs: 547, groups: subj, 37; item, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) -2.67151 0.13793 -19.369
x -0.07758 0.09289 -0.835
Correlation of Fixed Effects:
(Intr)
x 0.012
10 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
The“best”model
The way to decide on the“best”model is to find the simplest
model using the Generalized Likelihood Ratio Test (Pinheiro and
Bates 2000). Here, this is the varying intercepts model, not the
maximal model.
> m1<- lmer(rrt~x+(1+x|subj)+(1+x|item),
+ headnoun)
> m1a<- lmer(rrt~x+(1|subj)+(1|item),
+ headnoun)
11 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
The“best”model
> anova(m1,m1a)
Data: headnoun
Models:
m1a: rrt ~ x + (1 | subj) + (1 | item)
m1: rrt ~ x + (1 + x | subj) + (1 + x | item)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
m1a 5 1603.5 1625.0 -796.76 1593.5
m1 9 1608.5 1647.3 -795.27 1590.5 2.9742 4 0.5622
12 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
Here, we simulate data with the same structure, sample size, and
parameter values as the Gibson and Wu data, except that we
assume that the correlations are 0.6. Then we analyze the data
using lmer (maximal model). Can lmer recover the
correlations?
13 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
We define a function called new.df that generates data similar to
the Gibson and Wu data-set. For code, see accompanying .R file.
14 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
Next, we write a function that generates data for us repeatedly
with the following specifications: sample size for subjects and
items, and some correlation between subject intercept and slope,
and item intercept and slope.
> gendata<-function(subjects=37,items=15){
+ dat<-new.df(nsubj=subjects,nitems=items,
+ rho.u=0.6,rho.w=0.6)
+ dat <- dat[[1]]
+ dat<-dat[,c(1,2,3,9)]
+ dat$x<-ifelse(dat$cond==1,-0.5,0.5)
+
+ return(dat)
+ }
15 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
Set number of simulations:
> nsim<-100
Next, we generate simulated data 100 times, and then store the
estimated subject and item level correlations in the random effects,
and plot their distributions.
We do this for two settings: Gibson and Wu sample sizes (37
subjects, 15 items), and 50 subjects and 30 items.
16 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
37 subjects and 15 items
> library(lme4)
> subjcorr<-rep(NA,nsim)
> itemcorr<-rep(NA,nsim)
> for(i in 1:nsim){
+ dat<-gendata()
+ m3<-lmer(rt~x+(1+x|subj)+(1+x|item),dat)
+ subjcorr[i]<-attr(VarCorr(m3)$subj,"correlation")[1,2]
+ itemcorr[i]<-attr(VarCorr(m3)$item,"correlation")[1,2]
+ }
17 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
Distribution of subj. corr.
ρ^
u
Density
−0.2 0.2 0.6 1.0
0.00.51.01.5
Distribution of item corr.
ρ^
w
Density
−1.0 0.0 0.5 1.0
0.00.51.01.5
18 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
50 subjects and 30 items
> subjcorr<-rep(NA,nsim)
> itemcorr<-rep(NA,nsim)
> for(i in 1:nsim){
+ #print(i)
+ dat<-gendata(subjects=50,items=30)
+ m3<-lmer(rt~x+(1+x|subj)+(1+x|item),dat)
+ subjcorr[i]<-attr(VarCorr(m3)$subj,"correlation")[1,2]
+ itemcorr[i]<-attr(VarCorr(m3)$item,"correlation")[1,2]
+ }
19 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
Distribution of subj. corr.
ρ^
u
Density
0.2 0.6 1.0
0.01.02.0
Distribution of item corr.
ρ^
w
Density
0.0 0.4 0.8
0.01.02.0
20 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
How meaningful were the lmer estimates of correlations in
the maximal model m1?
Simulated data
Conclusion:
1 It seems that lmer can estimate the correlation parameters just
in case sample size for items and subjects is“large enough”
(this can be established using simulation, as done above).
2 Barr et al’s recommendation to fit a maximal model makes
sense as a general rule only if it’s already clear that we have
enough data to estimate all the variance components and
parameters.
3 In my experience, that is rarely the case at least in
psycholinguistics, especially when we go to more complex
designs than a simple two-condition study.
21 / 42
Lecture 1: Review of Linear Mixed Models
Repeated measures data
Keep it maximal?
Gelman and Hill (2007, p. 549) make a more tempered
recommendation than Barr et al:
Don’t get hung up on whether a coefficient“should”vary
by group. Just allow it to vary in the model, and then, if
the estimated scale of variation is small . . . , maybe you
can ignore it if that would be more convenient. Practical
concerns sometimes limit the feasible complexity of a
model–for example, we might fit a varying-intercept
model first, then allow slopes to vary, then add
group-level predictors, and so forth. Generally, however,
it is only the difficulties of fitting and, especially,
understanding the models that keeps us from adding even
more complexity, more varying coefficients, and more
interactions.
22 / 42
Lecture 1: Review of Linear Mixed Models
Why fit Bayesian LMMs?
Advantages of fitting a Bayesian LMM
1 For such data, there can be situations where you really need
to or want to fit full variance-covariance matrices for random
effects. Bayesian LMMs will let you fit them even in cases
where lmer would fail to converge or return nonsensical
estimates (due to too little data).
The way we will set them up, Bayesian LMMs will typically
underestimate correlation, unless there is enough data.
2 A direct answer to the research question can be obtained by
examining the posterior distribution given data.
3 We can avoid the traditional hard binary decision associated
with frequentist methods: p < 0.05 implies reject null, and
p > 0.05 implies“accept”null. We are more interested in
quantifying our uncertainty about the scientific claim.
4 Prior knowledge can be included in the model.
23 / 42
Lecture 1: Review of Linear Mixed Models
Why fit Bayesian LMMs?
Disadvantages of doing a Bayesian analysis
You have to invest effort into specifying a model; unlike lmer,
which involves a single line of code, JAGS and Stan model
specifications can extend to 20-30 lines.
A lot of decisions have to be made.
There is a steep learning curve; you have to know a bit about
probability distributions, MCMC methods, and of course
Bayes’ Theorem.
It takes much more time to fit a complicated model in a
Bayesian setting than with lmer.
But I will try to demonstrate to you in this course that it’s worth
the effort, especially when you don’t have a lot of data (usually the
case in psycholinguistics).
24 / 42
Lecture 1: Review of Linear Mixed Models
Brief review of linear (mixed) models
Linear models
yi
↑
response
=
parameter
↓
β0 +
parameter
↓
β1xi
↑
predictor
+ εi
↑
error
(1)
where
εi is the residual error, assumed to be normally distributed:
εi ∼ N(0,σ2).
Each response yi (i ranging from 1 to I) is independently and
identically distributed as yi ∼ N(β0 +β1xi ,σ2).
Point values for parameters: β0 and β1 are the parameters
to be estimated. In the frequentist setting, these are point
values, they have no distribution.
Hypothesis test: Usually, β1 is the parameter of interest; in
the frequentist setting, we test the null hypothesis that β1 = 0.
25 / 42
Lecture 1: Review of Linear Mixed Models
Brief review of linear (mixed) models
Linear models and repeated measures data
Repeated measures data
Linear mixed models are useful for correlated data (e.g.,
repeated measures) where the responses y are not
independently distributed.
A key difference from linear models is that the intercept
and/or slope vary by subject j = 1,...,J (and possibly also by
item k = 1,...,K):
yi
↑
response
=
varying intercepts
↓
[β0 +u0j +w0k] +
varying slopes
↓
[β1 +u1j +w1k]xi
↑
predictor
+ εi
↑
error
(2)
26 / 42
Lecture 1: Review of Linear Mixed Models
Brief review of linear (mixed) models
Linear models and repeated measures data
Unpacking the lme4 model
Crossed varying intercepts and slopes model, with correlation
yi
↑
response
=
varying intercepts
↓
[β0 +u0j +w0k] +
varying slopes
↓
[β1 +u1j +w1k]xi
↑
predictor
+ εi
↑
error
(3)
This is the“maximal”model we saw earlier:
> m1 <- lmer(rrt~x+(1+x|subj)+(1+x|item),
+ headnoun)
27 / 42
Lecture 1: Review of Linear Mixed Models
Brief review of linear (mixed) models
Linear models and repeated measures data
Unpacking the lme4 model
Crossed varying intercepts and slopes model, with correlation
> summary(m1)
Linear mixed model fit by REML [lmerMod]
Formula: rrt ~ x + (1 + x | subj) + (1 + x | item)
Data: headnoun
REML criterion at convergence: 1595.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.5441 -0.6430 -0.1237 0.5996 3.2501
Random effects:
Groups Name Variance Std.Dev. Corr
subj (Intercept) 0.371228 0.60928
x 0.053241 0.23074 -0.51
item (Intercept) 0.110034 0.33171
x 0.009218 0.09601 1.00
Residual 0.891577 0.94423
Number of obs: 547, groups: subj, 37; item, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) -2.67151 0.13793 -19.369
x -0.07758 0.09289 -0.835
Correlation of Fixed Effects:
(Intr)
x 0.012
28 / 42
Lecture 1: Review of Linear Mixed Models
Brief review of linear (mixed) models
Linear models and repeated measures data
Unpacking the lme4 model
Crossed varying intercepts and slopes model, with correlation
rrti = (β0 +u0j +w0k)+(β1 +u1j +w1k)xi +εi (4)
1 i = 1,...,547 data points; j = 1,...,37 items; k = 1,...,15 data
points
2 xi is coded −0.5 (SR) and 0.5 (OR).
3 εi ∼ N(0,σ2).
4 u0j ∼ N(0,σ0j ) and u1j ∼ N(0,σ1j ).
5 w0k ∼ N(0,σ0k) and w1k ∼ N(0,σ1k).
with a multivariate normal distribution for the varying slopes and
intercepts
u0j
u1j
∼ N
0
0
,Σu
w0k
w1k
∼ N
0
0
,Σw (5)
29 / 42
Lecture 1: Review of Linear Mixed Models
Brief review of linear (mixed) models
Linear models and repeated measures data
The variance components associated with subjects
Random effects:
Groups Name Variance Std.Dev. Corr
subj (Intercept) 0.37 0.61
so 0.05 0.23 -0.51
Σu =
σ2
u0 ρu σu0σu1
ρu σu0σu1 σ2
u1
=
0.612 −.51×0.61×0.23
−.51×0.61×0.23 0.232
(6)
30 / 42
Lecture 1: Review of Linear Mixed Models
Brief review of linear (mixed) models
Linear models and repeated measures data
The variance components associated with items
Random effects:
Groups Name Variance Std.Dev. Corr
item (Intercept) 0.11 0.33
so 0.01 0.10 1.00
Note the by items intercept-slope correlation of +1.00.
Σw =
σ2
w0 ρw σw0σw1
ρw σw0σw1 σ2
w1
=
0.332 1×0.33×0.10
1×0.33×0.10 0.102
(7)
31 / 42
Lecture 1: Review of Linear Mixed Models
Brief review of linear (mixed) models
Linear models and repeated measures data
The variance components of the“maximal”linear mixed
model
RTi = β0 +u0j +w0k +(β1 +u1j +w1k)xi +εi (8)
u0j
u1j
∼ N
0
0
,Σu
w0k
w1k
∼ N
0
0
,Σw (9)
εi ∼ N(0,σ2
) (10)
The parameters are β0,β1,Σu,Σw ,σ. Each of the matrices Σ has
three parameters. So we have 9 parameters.
32 / 42
Lecture 1: Review of Linear Mixed Models
Brief review of linear (mixed) models
Linear models and repeated measures data
Summary so far
Linear mixed models allow us to take all relevant variance
components into account; LMMs allow us to describe how the
data were generated.
However, maximal models should not be fit blindly, especially
when there is not enough data to estimate parameters.
For small datasets we often see degenerate variance
covariance estimates (with correlation ±1). Many
psycholinguists ignore this degeneracy.
If one cares about the correlation, one should not ignore the
degeneracy.
33 / 42
Lecture 1: Review of Linear Mixed Models
Frequentist vs Bayesian methods
The frequentist approach
1 In the frequentist setting, we start with a dependent measure
y, for which we assume a probability model.
2 In the above example, we have reading time data, rt, which
we assume is generated from a normal distribution with some
mean µ and variance σ2; we write this rt ∼ N(µ,σ2).
3 Given a particular set of parameter values µ and σ2, we could
state the probability distribution of rt given the parameters.
We can write this as p(rt | µ,σ2).
34 / 42
Lecture 1: Review of Linear Mixed Models
Frequentist vs Bayesian methods
The frequentist approach
1 In reality, we know neither µ nor σ2. The goal of fitting a
model to such data is to estimate the two parameters, and
then to draw inferences about what the true value of µ is.
2 The frequentist method relies on the fact that, under repeated
sampling and with a large enough sample size, the sampling
distribution the sample mean ¯X is distributed as N(µ,σ2/n).
3 The standard method is to use the sample mean ¯x as an
estimate of µ and given a large enough sample size n, we can
compute an approximate 95% confidence interval
¯x ±2×(ˆσ2/n).
35 / 42
Lecture 1: Review of Linear Mixed Models
Frequentist vs Bayesian methods
The frequentist approach
The 95% confidence interval has a slightly complicated
interpretation:
If we were to repeatedly carry out the experiment and compute a
confidence interval each time using the above procedure, 95% of
those confidence intervals would contain the true parameter value
µ (assuming, of course, that all our model assumptions are
satisfied).
The particular confidence interval we calculated for our particular
sample does not give us a range such that we are 95% certain that
the true µ lies within it, although this is how most users of
statistics seem to (mis)interpret the confidence interval.
36 / 42
Lecture 1: Review of Linear Mixed Models
Frequentist vs Bayesian methods
The 95% CI
0 20 40 60 80 100
5658606264
95% CIs in 100 repeated samples
Scores
37 / 42
Lecture 1: Review of Linear Mixed Models
Frequentist vs Bayesian methods
The Bayesian approach
1 The Bayesian approach starts with a probability model that
defines our prior belief about the possible values that the
parameter µ and σ2 might have.
2 This probability model expresses what we know so far about
these two parameters (we may not know much, but in
practical situations, it is not the case that we don’t know
anything about their possible values).
3 Given this prior distribution, the probability model p(y | µ,σ2)
and the data y allow us to compute the probability
distribution of the parameters given the data, p(µ,σ2 | y).
4 This probability distribution, called the posterior distribution,
is what we use for inference.
38 / 42
Lecture 1: Review of Linear Mixed Models
Frequentist vs Bayesian methods
The Bayesian approach
1 Unlike the 95% confidence interval, we can define a 95%
credible interval that represents the range within which we are
95% certain that the true value of the parameter lies, given
the data at hand.
2 Note that in the frequentist setting, the parameters are point
values: µ is assumed to have a particular value in nature.
3 In the Bayesian setting, µ is a random variable with a
probability distribution; it has a mean, but there is also some
uncertainty associated with its true value.
39 / 42
Lecture 1: Review of Linear Mixed Models
Frequentist vs Bayesian methods
The Bayesian approach
Bayes’ theorem makes it possible to derive the posterior distribution
given the prior and the data. The conditional probability rule in
probability theory (see Kerns) is that the joint distribution of two
random variables p(θ,y) is equal to p(θ | y)p(y). It follows that:
p(θ,y) =p(θ | y)p(y)
=p(y,θ) (because p(θ,y) = p(y,θ))
=p(y | θ)p(θ).
(11)
The first and third lines in the equalities above imply that
p(θ | y)p(y) = p(y | θ)p(θ). (12)
40 / 42
Lecture 1: Review of Linear Mixed Models
Frequentist vs Bayesian methods
The Bayesian approach
Dividing both sides by p(y) we get:
p(θ | y) =
p(y | θ)p(θ)
p(y)
(13)
The term p(y | θ) is the probability of the data given θ. If we treat
this as a function of θ, we have the likelihood function.
Since p(θ | y) is the posterior distribution of θ given y, and
p(y | θ) the likelihood, and p(θ) the prior, the following
relationship is established:
Posterior ∝ Likelihood×Prior (14)
41 / 42
Lecture 1: Review of Linear Mixed Models
Frequentist vs Bayesian methods
The Bayesian approach
Posterior ∝ Likelihood×Prior (15)
We ignore the denominator p(y) here because it only serves as a
normalizing constant that renders the left-hand side (the posterior)
a probability distribution.
The above is Bayes’ theorem, and is the basis for determining the
posterior distribution given a prior and the likelihood.
The rest of this course simply unpacks this idea.
Next week, we will look at some simple examples of the application
of Bayes’ Theorem.
42 / 42

More Related Content

PDF
MLconf NYC Animashree Anandkumar
PDF
Nbe rcausalpredictionv111 lecture2
PDF
Binary Classification with Models and Data Density Distribution by Xuan Chen
PDF
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
PPT
The Impact Of Semantic Handshakes
PDF
FUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONS
PDF
Dimensionality Reduction Techniques for Document Clustering- A Survey
PDF
Mapping Subsets of Scholarly Information
MLconf NYC Animashree Anandkumar
Nbe rcausalpredictionv111 lecture2
Binary Classification with Models and Data Density Distribution by Xuan Chen
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
The Impact Of Semantic Handshakes
FUZZY ROUGH INFORMATION MEASURES AND THEIR APPLICATIONS
Dimensionality Reduction Techniques for Document Clustering- A Survey
Mapping Subsets of Scholarly Information

What's hot (19)

PDF
Introduction to Some Tree based Learning Method
PPTX
Cluster analysis
PPTX
Machine learning session9(clustering)
PPTX
K means clustering
PPT
Statistical Clustering
PDF
Cluster Analysis
PPTX
Machine Learning
PDF
Combinatorial Problems2
PDF
TS-IASSL2014
PDF
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
PPT
Introduction to Machine Learning Aristotelis Tsirigos
DOC
Figure 1
PPTX
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
PPT
Lec 4,5
PDF
Maximizing the Representation Gap between In-domain & OOD examples
PDF
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
PDF
Learning to Rank - From pairwise approach to listwise
PDF
Classifiers
PPT
Intro to Model Selection
Introduction to Some Tree based Learning Method
Cluster analysis
Machine learning session9(clustering)
K means clustering
Statistical Clustering
Cluster Analysis
Machine Learning
Combinatorial Problems2
TS-IASSL2014
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Introduction to Machine Learning Aristotelis Tsirigos
Figure 1
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Lec 4,5
Maximizing the Representation Gap between In-domain & OOD examples
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
Learning to Rank - From pairwise approach to listwise
Classifiers
Intro to Model Selection
Ad

Similar to Paris Lecture 1 (20)

DOC
Discovering Novel Information with sentence Level clustering From Multi-docu...
PPT
20070702 Text Categorization
PPTX
22_RepeatedMeasuresDesign_Complete.pptx
PDF
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
PPT
Machine Learning and Artificial Neural Networks.ppt
PPT
Introduction to Machine Learning.
PPT
2.7 other classifiers
PPT
MachineLearning.ppt
PPT
MachineLearning.ppt
PPT
MachineLearning.ppt
PPT
nnml.ppt
PDF
Methodological study of opinion mining and sentiment analysis techniques
PDF
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
PDF
Analytical study of feature extraction techniques in opinion mining
PDF
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
PPTX
SVM - Functional Verification
DOCX
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
PPT
4-IR Models_new.ppt
PPT
4-IR Models_new.ppt
PDF
Thesis_NickyGrant_2013
Discovering Novel Information with sentence Level clustering From Multi-docu...
20070702 Text Categorization
22_RepeatedMeasuresDesign_Complete.pptx
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Machine Learning and Artificial Neural Networks.ppt
Introduction to Machine Learning.
2.7 other classifiers
MachineLearning.ppt
MachineLearning.ppt
MachineLearning.ppt
nnml.ppt
Methodological study of opinion mining and sentiment analysis techniques
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
Analytical study of feature extraction techniques in opinion mining
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
SVM - Functional Verification
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
4-IR Models_new.ppt
4-IR Models_new.ppt
Thesis_NickyGrant_2013
Ad

Recently uploaded (20)

PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
modul_python (1).pptx for professional and student
DOCX
Factor Analysis Word Document Presentation
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Transcultural that can help you someday.
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
chrmotography.pptx food anaylysis techni
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
SET 1 Compulsory MNH machine learning intro
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
A Complete Guide to Streamlining Business Processes
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
modul_python (1).pptx for professional and student
Factor Analysis Word Document Presentation
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Transcultural that can help you someday.
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
[EN] Industrial Machine Downtime Prediction
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
chrmotography.pptx food anaylysis techni
Microsoft Core Cloud Services powerpoint
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Topic 5 Presentation 5 Lesson 5 Corporate Fin
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
SET 1 Compulsory MNH machine learning intro
DU, AIS, Big Data and Data Analytics.ppt
IMPACT OF LANDSLIDE.....................
A Complete Guide to Streamlining Business Processes

Paris Lecture 1

  • 1. Lecture 1: Review of Linear Mixed Models Lecture 1: Review of Linear Mixed Models Shravan Vasishth Department of Linguistics University of Potsdam, Germany October 7, 2014 1 / 42
  • 2. Lecture 1: Review of Linear Mixed Models Introduction Motivating this course In psycholinguistics, we usually use frequentist methods to analyze our data. The most common tool I use is lme4. In recent years, very powerful programming languages have become available that make Bayesian modeling relatively easy. Bayesian tools have several important advantages over frequentist methods, but they require some very specific background knowledge. My goal in this course is to try to provide that background knowledge. Note: I will teach a more detailed (2-week) course in ESSLLI in Barcelona, Aug 3-14, 2015. 2 / 42
  • 3. Lecture 1: Review of Linear Mixed Models Introduction Motivating this course In this introductory lecture, my goals are to 1 motivate you look at Bayesian Linear Mixed Models as an alternative to using frequentist methods. 2 make sure we are all on the same page regarding the moving parts of a linear mixed model. 3 / 42
  • 4. Lecture 1: Review of Linear Mixed Models Preliminaries Prerequisites 1 Familiarity with fitting standard LMMs such as: lmer(rt~cond+(1+cond|subj)+(1+cond|item),dat) 2 Basic knowledge of R. 4 / 42
  • 5. Lecture 1: Review of Linear Mixed Models Preliminaries A bit about my background 1999: Discovered the word“ANOVA”in a chance conversation with a psycholinguist. 1999: Did my first self-paced reading experiment. Fit a repeated measures ANOVA. 2000: Went to statisticans at Ohio State’ statistical consulting unit, and they said: “why are you fitting ANOVA? You need linear mixed models.” 2000-2010: Kept fitting and publishing linear mixed models. 2010: Realized I didn’t really know what I was doing, and started a part-time MSc in Statistics at Sheffield’s School of Math and Statistics (2011-2015). 5 / 42
  • 6. Lecture 1: Review of Linear Mixed Models Repeated measures data Linear mixed models Example: Gibson and Wu data, Language and Cognitive Processes, 2012 6 / 42
  • 7. Lecture 1: Review of Linear Mixed Models Repeated measures data Linear mixed models Example: Gibson and Wu data, Language and Cognitive Processes, 2012 Subject vs object relative clauses in Chinese, self-paced reading. The critical region is the head noun. The goal is to find out whether SRs are harder to process than ORs at the head noun. 7 / 42
  • 8. Lecture 1: Review of Linear Mixed Models Repeated measures data Linear mixed models Example: Gibson and Wu 2012 data > head(data[,c(1,2,3,4,7,10,11)]) subj item type pos rt rrt x 7 1 13 obj-ext 6 1140 -0.8771930 0.5 20 1 6 subj-ext 6 1197 -0.8354219 -0.5 32 1 5 obj-ext 6 756 -1.3227513 0.5 44 1 9 obj-ext 6 643 -1.5552100 0.5 60 1 14 subj-ext 6 860 -1.1627907 -0.5 73 1 4 subj-ext 6 868 -1.1520737 -0.5 8 / 42
  • 9. Lecture 1: Review of Linear Mixed Models Repeated measures data lme4 model of Gibson and Wu data Crossed varying intercepts and slopes model, with correlation This is the type of“maximal”model that most people fit nowadays (citing Barr et al 2012): > m1 <- lmer(rrt~x+(1+x|subj)+(1+x|item), + subset(data,region=="headnoun")) I will now show two major (related) problems that occur with the small datasets we usually have in psycholinguistics: The correlation estimates either lead to degenerate variance covariance matrices, and/or The correlation estimates are wild estimates that have no bearing with reality. [This is not a failing of lmer, but rather that the user is demanding too much of lmer.] 9 / 42
  • 10. Lecture 1: Review of Linear Mixed Models Repeated measures data lme4 model of Gibson and Wu data Typical data analysis: Crossed varying intercepts and slopes model, with correlation > summary(m1) Linear mixed model fit by REML [lmerMod] Formula: rrt ~ x + (1 + x | subj) + (1 + x | item) Data: subset(data, region == "headnoun") REML criterion at convergence: 1595.6 Scaled residuals: Min 1Q Median 3Q Max -2.5441 -0.6430 -0.1237 0.5996 3.2501 Random effects: Groups Name Variance Std.Dev. Corr subj (Intercept) 0.371228 0.60928 x 0.053241 0.23074 -0.51 item (Intercept) 0.110034 0.33171 x 0.009218 0.09601 1.00 Residual 0.891577 0.94423 Number of obs: 547, groups: subj, 37; item, 15 Fixed effects: Estimate Std. Error t value (Intercept) -2.67151 0.13793 -19.369 x -0.07758 0.09289 -0.835 Correlation of Fixed Effects: (Intr) x 0.012 10 / 42
  • 11. Lecture 1: Review of Linear Mixed Models Repeated measures data The“best”model The way to decide on the“best”model is to find the simplest model using the Generalized Likelihood Ratio Test (Pinheiro and Bates 2000). Here, this is the varying intercepts model, not the maximal model. > m1<- lmer(rrt~x+(1+x|subj)+(1+x|item), + headnoun) > m1a<- lmer(rrt~x+(1|subj)+(1|item), + headnoun) 11 / 42
  • 12. Lecture 1: Review of Linear Mixed Models Repeated measures data The“best”model > anova(m1,m1a) Data: headnoun Models: m1a: rrt ~ x + (1 | subj) + (1 | item) m1: rrt ~ x + (1 + x | subj) + (1 + x | item) Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq) m1a 5 1603.5 1625.0 -796.76 1593.5 m1 9 1608.5 1647.3 -795.27 1590.5 2.9742 4 0.5622 12 / 42
  • 13. Lecture 1: Review of Linear Mixed Models Repeated measures data How meaningful were the lmer estimates of correlations in the maximal model m1? Simulated data Here, we simulate data with the same structure, sample size, and parameter values as the Gibson and Wu data, except that we assume that the correlations are 0.6. Then we analyze the data using lmer (maximal model). Can lmer recover the correlations? 13 / 42
  • 14. Lecture 1: Review of Linear Mixed Models Repeated measures data How meaningful were the lmer estimates of correlations in the maximal model m1? Simulated data We define a function called new.df that generates data similar to the Gibson and Wu data-set. For code, see accompanying .R file. 14 / 42
  • 15. Lecture 1: Review of Linear Mixed Models Repeated measures data How meaningful were the lmer estimates of correlations in the maximal model m1? Simulated data Next, we write a function that generates data for us repeatedly with the following specifications: sample size for subjects and items, and some correlation between subject intercept and slope, and item intercept and slope. > gendata<-function(subjects=37,items=15){ + dat<-new.df(nsubj=subjects,nitems=items, + rho.u=0.6,rho.w=0.6) + dat <- dat[[1]] + dat<-dat[,c(1,2,3,9)] + dat$x<-ifelse(dat$cond==1,-0.5,0.5) + + return(dat) + } 15 / 42
  • 16. Lecture 1: Review of Linear Mixed Models Repeated measures data How meaningful were the lmer estimates of correlations in the maximal model m1? Simulated data Set number of simulations: > nsim<-100 Next, we generate simulated data 100 times, and then store the estimated subject and item level correlations in the random effects, and plot their distributions. We do this for two settings: Gibson and Wu sample sizes (37 subjects, 15 items), and 50 subjects and 30 items. 16 / 42
  • 17. Lecture 1: Review of Linear Mixed Models Repeated measures data How meaningful were the lmer estimates of correlations in the maximal model m1? Simulated data 37 subjects and 15 items > library(lme4) > subjcorr<-rep(NA,nsim) > itemcorr<-rep(NA,nsim) > for(i in 1:nsim){ + dat<-gendata() + m3<-lmer(rt~x+(1+x|subj)+(1+x|item),dat) + subjcorr[i]<-attr(VarCorr(m3)$subj,"correlation")[1,2] + itemcorr[i]<-attr(VarCorr(m3)$item,"correlation")[1,2] + } 17 / 42
  • 18. Lecture 1: Review of Linear Mixed Models Repeated measures data How meaningful were the lmer estimates of correlations in the maximal model m1? Simulated data Distribution of subj. corr. ρ^ u Density −0.2 0.2 0.6 1.0 0.00.51.01.5 Distribution of item corr. ρ^ w Density −1.0 0.0 0.5 1.0 0.00.51.01.5 18 / 42
  • 19. Lecture 1: Review of Linear Mixed Models Repeated measures data How meaningful were the lmer estimates of correlations in the maximal model m1? Simulated data 50 subjects and 30 items > subjcorr<-rep(NA,nsim) > itemcorr<-rep(NA,nsim) > for(i in 1:nsim){ + #print(i) + dat<-gendata(subjects=50,items=30) + m3<-lmer(rt~x+(1+x|subj)+(1+x|item),dat) + subjcorr[i]<-attr(VarCorr(m3)$subj,"correlation")[1,2] + itemcorr[i]<-attr(VarCorr(m3)$item,"correlation")[1,2] + } 19 / 42
  • 20. Lecture 1: Review of Linear Mixed Models Repeated measures data How meaningful were the lmer estimates of correlations in the maximal model m1? Simulated data Distribution of subj. corr. ρ^ u Density 0.2 0.6 1.0 0.01.02.0 Distribution of item corr. ρ^ w Density 0.0 0.4 0.8 0.01.02.0 20 / 42
  • 21. Lecture 1: Review of Linear Mixed Models Repeated measures data How meaningful were the lmer estimates of correlations in the maximal model m1? Simulated data Conclusion: 1 It seems that lmer can estimate the correlation parameters just in case sample size for items and subjects is“large enough” (this can be established using simulation, as done above). 2 Barr et al’s recommendation to fit a maximal model makes sense as a general rule only if it’s already clear that we have enough data to estimate all the variance components and parameters. 3 In my experience, that is rarely the case at least in psycholinguistics, especially when we go to more complex designs than a simple two-condition study. 21 / 42
  • 22. Lecture 1: Review of Linear Mixed Models Repeated measures data Keep it maximal? Gelman and Hill (2007, p. 549) make a more tempered recommendation than Barr et al: Don’t get hung up on whether a coefficient“should”vary by group. Just allow it to vary in the model, and then, if the estimated scale of variation is small . . . , maybe you can ignore it if that would be more convenient. Practical concerns sometimes limit the feasible complexity of a model–for example, we might fit a varying-intercept model first, then allow slopes to vary, then add group-level predictors, and so forth. Generally, however, it is only the difficulties of fitting and, especially, understanding the models that keeps us from adding even more complexity, more varying coefficients, and more interactions. 22 / 42
  • 23. Lecture 1: Review of Linear Mixed Models Why fit Bayesian LMMs? Advantages of fitting a Bayesian LMM 1 For such data, there can be situations where you really need to or want to fit full variance-covariance matrices for random effects. Bayesian LMMs will let you fit them even in cases where lmer would fail to converge or return nonsensical estimates (due to too little data). The way we will set them up, Bayesian LMMs will typically underestimate correlation, unless there is enough data. 2 A direct answer to the research question can be obtained by examining the posterior distribution given data. 3 We can avoid the traditional hard binary decision associated with frequentist methods: p < 0.05 implies reject null, and p > 0.05 implies“accept”null. We are more interested in quantifying our uncertainty about the scientific claim. 4 Prior knowledge can be included in the model. 23 / 42
  • 24. Lecture 1: Review of Linear Mixed Models Why fit Bayesian LMMs? Disadvantages of doing a Bayesian analysis You have to invest effort into specifying a model; unlike lmer, which involves a single line of code, JAGS and Stan model specifications can extend to 20-30 lines. A lot of decisions have to be made. There is a steep learning curve; you have to know a bit about probability distributions, MCMC methods, and of course Bayes’ Theorem. It takes much more time to fit a complicated model in a Bayesian setting than with lmer. But I will try to demonstrate to you in this course that it’s worth the effort, especially when you don’t have a lot of data (usually the case in psycholinguistics). 24 / 42
  • 25. Lecture 1: Review of Linear Mixed Models Brief review of linear (mixed) models Linear models yi ↑ response = parameter ↓ β0 + parameter ↓ β1xi ↑ predictor + εi ↑ error (1) where εi is the residual error, assumed to be normally distributed: εi ∼ N(0,σ2). Each response yi (i ranging from 1 to I) is independently and identically distributed as yi ∼ N(β0 +β1xi ,σ2). Point values for parameters: β0 and β1 are the parameters to be estimated. In the frequentist setting, these are point values, they have no distribution. Hypothesis test: Usually, β1 is the parameter of interest; in the frequentist setting, we test the null hypothesis that β1 = 0. 25 / 42
  • 26. Lecture 1: Review of Linear Mixed Models Brief review of linear (mixed) models Linear models and repeated measures data Repeated measures data Linear mixed models are useful for correlated data (e.g., repeated measures) where the responses y are not independently distributed. A key difference from linear models is that the intercept and/or slope vary by subject j = 1,...,J (and possibly also by item k = 1,...,K): yi ↑ response = varying intercepts ↓ [β0 +u0j +w0k] + varying slopes ↓ [β1 +u1j +w1k]xi ↑ predictor + εi ↑ error (2) 26 / 42
  • 27. Lecture 1: Review of Linear Mixed Models Brief review of linear (mixed) models Linear models and repeated measures data Unpacking the lme4 model Crossed varying intercepts and slopes model, with correlation yi ↑ response = varying intercepts ↓ [β0 +u0j +w0k] + varying slopes ↓ [β1 +u1j +w1k]xi ↑ predictor + εi ↑ error (3) This is the“maximal”model we saw earlier: > m1 <- lmer(rrt~x+(1+x|subj)+(1+x|item), + headnoun) 27 / 42
  • 28. Lecture 1: Review of Linear Mixed Models Brief review of linear (mixed) models Linear models and repeated measures data Unpacking the lme4 model Crossed varying intercepts and slopes model, with correlation > summary(m1) Linear mixed model fit by REML [lmerMod] Formula: rrt ~ x + (1 + x | subj) + (1 + x | item) Data: headnoun REML criterion at convergence: 1595.6 Scaled residuals: Min 1Q Median 3Q Max -2.5441 -0.6430 -0.1237 0.5996 3.2501 Random effects: Groups Name Variance Std.Dev. Corr subj (Intercept) 0.371228 0.60928 x 0.053241 0.23074 -0.51 item (Intercept) 0.110034 0.33171 x 0.009218 0.09601 1.00 Residual 0.891577 0.94423 Number of obs: 547, groups: subj, 37; item, 15 Fixed effects: Estimate Std. Error t value (Intercept) -2.67151 0.13793 -19.369 x -0.07758 0.09289 -0.835 Correlation of Fixed Effects: (Intr) x 0.012 28 / 42
  • 29. Lecture 1: Review of Linear Mixed Models Brief review of linear (mixed) models Linear models and repeated measures data Unpacking the lme4 model Crossed varying intercepts and slopes model, with correlation rrti = (β0 +u0j +w0k)+(β1 +u1j +w1k)xi +εi (4) 1 i = 1,...,547 data points; j = 1,...,37 items; k = 1,...,15 data points 2 xi is coded −0.5 (SR) and 0.5 (OR). 3 εi ∼ N(0,σ2). 4 u0j ∼ N(0,σ0j ) and u1j ∼ N(0,σ1j ). 5 w0k ∼ N(0,σ0k) and w1k ∼ N(0,σ1k). with a multivariate normal distribution for the varying slopes and intercepts u0j u1j ∼ N 0 0 ,Σu w0k w1k ∼ N 0 0 ,Σw (5) 29 / 42
  • 30. Lecture 1: Review of Linear Mixed Models Brief review of linear (mixed) models Linear models and repeated measures data The variance components associated with subjects Random effects: Groups Name Variance Std.Dev. Corr subj (Intercept) 0.37 0.61 so 0.05 0.23 -0.51 Σu = σ2 u0 ρu σu0σu1 ρu σu0σu1 σ2 u1 = 0.612 −.51×0.61×0.23 −.51×0.61×0.23 0.232 (6) 30 / 42
  • 31. Lecture 1: Review of Linear Mixed Models Brief review of linear (mixed) models Linear models and repeated measures data The variance components associated with items Random effects: Groups Name Variance Std.Dev. Corr item (Intercept) 0.11 0.33 so 0.01 0.10 1.00 Note the by items intercept-slope correlation of +1.00. Σw = σ2 w0 ρw σw0σw1 ρw σw0σw1 σ2 w1 = 0.332 1×0.33×0.10 1×0.33×0.10 0.102 (7) 31 / 42
  • 32. Lecture 1: Review of Linear Mixed Models Brief review of linear (mixed) models Linear models and repeated measures data The variance components of the“maximal”linear mixed model RTi = β0 +u0j +w0k +(β1 +u1j +w1k)xi +εi (8) u0j u1j ∼ N 0 0 ,Σu w0k w1k ∼ N 0 0 ,Σw (9) εi ∼ N(0,σ2 ) (10) The parameters are β0,β1,Σu,Σw ,σ. Each of the matrices Σ has three parameters. So we have 9 parameters. 32 / 42
  • 33. Lecture 1: Review of Linear Mixed Models Brief review of linear (mixed) models Linear models and repeated measures data Summary so far Linear mixed models allow us to take all relevant variance components into account; LMMs allow us to describe how the data were generated. However, maximal models should not be fit blindly, especially when there is not enough data to estimate parameters. For small datasets we often see degenerate variance covariance estimates (with correlation ±1). Many psycholinguists ignore this degeneracy. If one cares about the correlation, one should not ignore the degeneracy. 33 / 42
  • 34. Lecture 1: Review of Linear Mixed Models Frequentist vs Bayesian methods The frequentist approach 1 In the frequentist setting, we start with a dependent measure y, for which we assume a probability model. 2 In the above example, we have reading time data, rt, which we assume is generated from a normal distribution with some mean µ and variance σ2; we write this rt ∼ N(µ,σ2). 3 Given a particular set of parameter values µ and σ2, we could state the probability distribution of rt given the parameters. We can write this as p(rt | µ,σ2). 34 / 42
  • 35. Lecture 1: Review of Linear Mixed Models Frequentist vs Bayesian methods The frequentist approach 1 In reality, we know neither µ nor σ2. The goal of fitting a model to such data is to estimate the two parameters, and then to draw inferences about what the true value of µ is. 2 The frequentist method relies on the fact that, under repeated sampling and with a large enough sample size, the sampling distribution the sample mean ¯X is distributed as N(µ,σ2/n). 3 The standard method is to use the sample mean ¯x as an estimate of µ and given a large enough sample size n, we can compute an approximate 95% confidence interval ¯x ±2×(ˆσ2/n). 35 / 42
  • 36. Lecture 1: Review of Linear Mixed Models Frequentist vs Bayesian methods The frequentist approach The 95% confidence interval has a slightly complicated interpretation: If we were to repeatedly carry out the experiment and compute a confidence interval each time using the above procedure, 95% of those confidence intervals would contain the true parameter value µ (assuming, of course, that all our model assumptions are satisfied). The particular confidence interval we calculated for our particular sample does not give us a range such that we are 95% certain that the true µ lies within it, although this is how most users of statistics seem to (mis)interpret the confidence interval. 36 / 42
  • 37. Lecture 1: Review of Linear Mixed Models Frequentist vs Bayesian methods The 95% CI 0 20 40 60 80 100 5658606264 95% CIs in 100 repeated samples Scores 37 / 42
  • 38. Lecture 1: Review of Linear Mixed Models Frequentist vs Bayesian methods The Bayesian approach 1 The Bayesian approach starts with a probability model that defines our prior belief about the possible values that the parameter µ and σ2 might have. 2 This probability model expresses what we know so far about these two parameters (we may not know much, but in practical situations, it is not the case that we don’t know anything about their possible values). 3 Given this prior distribution, the probability model p(y | µ,σ2) and the data y allow us to compute the probability distribution of the parameters given the data, p(µ,σ2 | y). 4 This probability distribution, called the posterior distribution, is what we use for inference. 38 / 42
  • 39. Lecture 1: Review of Linear Mixed Models Frequentist vs Bayesian methods The Bayesian approach 1 Unlike the 95% confidence interval, we can define a 95% credible interval that represents the range within which we are 95% certain that the true value of the parameter lies, given the data at hand. 2 Note that in the frequentist setting, the parameters are point values: µ is assumed to have a particular value in nature. 3 In the Bayesian setting, µ is a random variable with a probability distribution; it has a mean, but there is also some uncertainty associated with its true value. 39 / 42
  • 40. Lecture 1: Review of Linear Mixed Models Frequentist vs Bayesian methods The Bayesian approach Bayes’ theorem makes it possible to derive the posterior distribution given the prior and the data. The conditional probability rule in probability theory (see Kerns) is that the joint distribution of two random variables p(θ,y) is equal to p(θ | y)p(y). It follows that: p(θ,y) =p(θ | y)p(y) =p(y,θ) (because p(θ,y) = p(y,θ)) =p(y | θ)p(θ). (11) The first and third lines in the equalities above imply that p(θ | y)p(y) = p(y | θ)p(θ). (12) 40 / 42
  • 41. Lecture 1: Review of Linear Mixed Models Frequentist vs Bayesian methods The Bayesian approach Dividing both sides by p(y) we get: p(θ | y) = p(y | θ)p(θ) p(y) (13) The term p(y | θ) is the probability of the data given θ. If we treat this as a function of θ, we have the likelihood function. Since p(θ | y) is the posterior distribution of θ given y, and p(y | θ) the likelihood, and p(θ) the prior, the following relationship is established: Posterior ∝ Likelihood×Prior (14) 41 / 42
  • 42. Lecture 1: Review of Linear Mixed Models Frequentist vs Bayesian methods The Bayesian approach Posterior ∝ Likelihood×Prior (15) We ignore the denominator p(y) here because it only serves as a normalizing constant that renders the left-hand side (the posterior) a probability distribution. The above is Bayes’ theorem, and is the basis for determining the posterior distribution given a prior and the likelihood. The rest of this course simply unpacks this idea. Next week, we will look at some simple examples of the application of Bayes’ Theorem. 42 / 42