Mixed Effects Models - Model Comparison

Week 4.1: Model Comparison
! Lab: Interactions Practice
! Model Comparison
! Nested Models
! Hypothesis Testing
! REML vs ML
! Non-Nested Models
! Shrinkage
! The Problem
! Solutions

Interpreting Interactions
• Numerical interaction term tells us how the
interaction works:
• Strengthens individual effects with the same sign
as the interaction
• Weakens individual effects with a different sign as
the interaction
• Or, again, just look at the graph ☺

Interpreting Interactions Practice
• Dependent variable: Classroom learning
• Independent variable 1: Intrinsic motivation
• Learning because you want to learn (intrinsic) vs.
to get a good grade (extrinsic)
• Intrinsic motivation has a + effect on learning
• Independent variable 2: Autonomy language
• “You can…” (vs. “You must…”)
• Also has a + effect on learning
• Motivation x autonomy interaction is +
• Interpretation: Combining intrinsic
motivation and autonomy
language especially benefits
learning
• “Synergistic” interaction
Vansteenkiste
et al., 2004,
JPSP

• Dependent variable: Satisfaction with a
consumer purchase
• Number of choices: - effect on
satisfaction
• “Maximizing” strategy: - effect on satisfaction
• Trying to find the best option vs. “good enough”
• Choices x maximizing strategy is -
• Interpretation: Having lots
of choices when you’re a
maximizer especially
reduces satisfaction
• Also a synergistic
interaction
(Carrillat, Ladik, & Legoux, 2011; Marketing Letters)

Model Formulae Practice
• Write the R formula for each model:
• 1) We’re interested in the effects of FamilySES,
PriorNightSleep, and Nutrition on MathTest
Performance, but we don’t expect them to interact
• 2) We factorially manipulated SentenceType (active
or passive) and Plausibility (low or high) in a test
of TextComprehensionAccuracy

Model Formulae Practice
• Write the R formula for each model:
• 1) We’re interested in the effects of FamilySES,
PriorNightSleep, and Nutrition on MathTest
Performance, but we don’t expect them to interact
• MathPerformance ~ 1 + SES + Sleep +
Nutrition
• 2) We factorially manipulated SentenceType (active
or passive) and Plausibility (low or high) in a test
of TextComprehensionAccuracy
• ComprehensionAccuracy ~ 1 + SentenceType +
Plausibility + SentenceType:Plausibility
or
ComprehensionAccuracy ~ 1 +
SentenceType*Plausibility

• Second language proficiency: + effect on
translation accuracy
• Word frequency: + effect on accuracy
• Frequency x proficiency interaction is -
• Interpretation: Proficiency matters less when translating
high frequency words
• Or: Difference between high & low proficiency words gets
smaller if you have high proficiency
• “Antagonistic” interaction. Combining the effects reduces or
reverses the individual effects.
(e.g., Diependaele, Lemhöfer,
Brysbaert, 2012, QJEP)

• Retrieval practice: + effect on long-term
learning
• Working memory span: + effect on learning
• Retrieval practice x WM span interaction is -
(Agarwal et al., 2016)
• Interpretation: Retrieval practice is especially
beneficial for people with low working memory.
• Or: Low WM confers less of a disadvantage if you
do retrieval practice

• Affectionate touch: + effect on feeling of
relationship security
• Avoidant attachment style: - effect on security
• Touch x avoidant attachment interaction is -
• Interpretation: Affectionate touch enhances
relationship security less for people with
an avoidant attachment style
(Jakubiak & Feeney, SPPS, 2016)

• Age: - effect on picture memory
• Older adults have poorer memory
• Emotional valence: - effect on accuracy
• Positive pictures are not remembered as well
compared to negative pictures
• Age x Valence interaction is +
• Interpretation: Age declines are smaller for positive pictures
• Or: Disadvantage of positive pictures is not as strong for
older adults
(e.g., Mather & Carstensen, 2005, TiCS)

Model Comparison
• Sometimes, we may have more than 1 model
that we could consider applying to the data
• 2 or more competing theoretical models
• e.g., critical period in language acquisition
No critical period (Vanhove, 2013)
Critical period hypothesis
(Hartshorne et al., 2020)
1 + AgeOfAcquisition
1 + AgeOfAcquisition*CriticalPeriod

Model Comparison
• Sometimes, we may have more than 1 model
that we could consider applying to the data
• 2 or more competing theoretical models
• Exploratory analysis where we don’t yet know
which model would be appropriate

Dataset
! Social support & health (e.g., Cohen & Wills, 1985)
! lifeexpectancy.csv:
! Longitudinal study of 1000 subjects – some
siblings from same family, so 517 total families
! Perceived social support (z-scored)
! Lifespan
! And several control variables

Nested Models
! Three possible models of life expectancy:
! Amount of weekly exercise
! Amount of weekly exercise & perceived social
support
! Amount of weekly exercise, perceived social
support, years of education, conscientiousness,
yearly income, and number of vowels in your last
name
! These are nested models—each one can be
formed by subtracting variables from the one
below it (“nested inside it”)

Nested Models
support
name
! Which set of information would give us the
most accurate fitted() values?

Nested Models
support
name
• The “biggest” nested model will always provide
predictions that are at least as good
• Adding info can only explain more of the variance

Nested Models
• The “biggest” nested model will always provide
predictions that are at least as good
• Adding info can only explain more of the variance
• Might not be much better (“number of vowels”
effect zero or close to zero) but can’t be worse
Slope of regression
line relating last
name vowels to life
expectancy is near 0
But that merely fails
to improve
predictions; doesn’t
hurt them

Hypothesis Testing
! Let’s think about our first two models:
! Comparing these two statistical models closely
relates to our research question: Which theoretical
model best explains data?
! The theoretical model where social support does affect life
expectancy
! The model where social support doesn’t affect life
expectancy
E(Yi(j)) = γ00 + γ10HrsExercise + γ20SocSupport
model1
E(Yi(j)) = γ00 + γ10HrsExercise
model2

Hypothesis Testing
! What are some possible values of γ20 (the
SocSupport effect) in model1?
! 3.83
! -1.04
! 0 – there is no social support effect
model1
model2

! What happens when γ20 is equal to 0?
! Anything multiplied by 0 is 0, so SocSupport just
drops out of the equation
! Becomes the same thing as model2
Hypothesis Testing
0
model1
model2

Hypothesis Testing
! model2 is just a special case of model1
! The version of model1 where γ20 happens to be 0
! One of many possible versions of model1
! Why we say model2 is “nested” in model1
model1
model2
0

Hypothesis Testing
! This also helps show why model1 always
fits as well as model2 or better
! model1 can account for the case where γ20 = 0
! But it can also account for many other cases, too
model1
model2
0

Likelihood Ratio Test
! We can compare nested models (only) using
the likelihood-ratio test
! Remember that likelihood is what we search for in
fitting an individual model (find the values with the
highest likelihood)
! First, fit each of the models to be compared
! model1 <- lmer(Lifespan ~ 1 +
HrsExercise + SocSupport + (1|Family),
data=lifeexpectancy)
! model2 <- lmer(Lifespan ~ 1 +
HrsExercise + (1|Family),
data=lifeexpectancy)

• Then, compare them with anova():
• anova(model1, model2)
• Order doesn’t matter
• Differences in (log) likelihoods are
distributed as a chi-square
• d.f. = # of variables added or removed
• Here, χ2
(1) = 8.67, p = .003
Log likelihood will also be somewhat higher (better) for the
complex model … but is it SIGNIFICANTLY better?
We’ll discuss what
this means in a
moment (don’t
worry; it’s what we
want)

• t-test and LR test are very similar!
• t-test: Tests whether an effect differs from 0,
based on this model
• Likelihood ratio: Compare to a model where the
effect actually IS constrained to be 0
• With an infinitely large sample, these two
tests would produce identical conclusions
• With small sample, t-test is less likely to
detect spurious differences (Luke, 2017)
• But, large differences uncommon

based on this model
p-value from likelihood
ratio test: .0032
p-value from lmerTest t-
test: .0033

based on this model
• Guidance:
• LR test is useful for testing groups of variable
• model1 <- lmer(Lifespan ~ 1 + HrsExercise …)
• model3 <- lmer(Lifespan ~ 1 + HrsExercise +
SocSupport + YrsEducation +
Conscientiousness …)
• If testing just one variable at a time, use t-test—
slightly less likely to produce Type I error

REML vs ML
• Technically, two different algorithms that R can
use “behind the scenes” to get the estimates
# REML: Restricted Maximum Likelihood
• Assumes the fixed effects structure is correct
• Bad for comparing models that differ in fixed effects
# ML: Maximum Likelihood
• OK for comparing models
• But, may underestimate variance of random effects
• Ideal: ML for model comparison, REML for final
results
• lme4 does this automatically for you!
• Defaults to REML. But automatically refits models
with ML when you do likelihood ratio test.

REML vs ML
• The one time you might want to mess with this:
• If you are going to be doing a lot of model
comparisons, can fit the model with ML to begin
with
• model1 <- lmer(DV ~ 1 + Predictors,
data=lifeexpectancy, REML=FALSE)
• Saves refitting for each comparison
• Remember to refit the model with REML=TRUE
for your final results

Non-Nested Models
• Which of these pairs is not a case of nested
models?
• A
• Accuracy ~ SentenceType + Aphasia +
SentenceType:Aphasia
• Accuracy ~ SentenceType + Aphasia
• B
• MathAchievement ~ SocioeconomicStatus
• MathAchievement ~ TeacherRating + ClassSize
• C
• Recall ~ StudyTime
• Recall ~ StudyTime + StudyStrategy

Non-Nested Models
• Which of these pairs is not a case of nested
models?
• A
• Accuracy ~ SentenceType + Aphasia +
SentenceType:Aphasia
• Accuracy ~ SentenceType + Aphasia
• B
• MathAchievement ~ SocioeconomicStatus
• MathAchievement ~ TeacherRating + ClassSize
• Each of these models has something that the other doesn’t have.

Non-Nested Models
• Models that aren’t nested can’t be tested the
same way
• A non-nested comparison:
• What would support 1st model over 2nd?
• γ20 is significantly greater than 0, but also γ10 is 0
• But remember we can’t test that something is 0 with
frequentist statistics … can’t prove the H0 is true
• Parametric statistics don’t apply here $
E(Yi(j)) = γ00 + γ10YrsEducation + γ20IncomeThousands
E(Yi(j)) = γ00 + γ10YrsEducation + γ20IncomeThousands
0
0

Non-Nested Models: Comparison
• Can be compared with information criteria
• Remember our fitted values from last week?
• fitted(model2)
• What if we replaced all of our observations with
just the fitted (predicted) values?
• We’d be losing some information
• However, if the model predicted the data well, we
would not be losing that much
• Information criteria measure how much information is
lost with the fitted values (so, lower is better)

• AIC: An Information Criterion or Akaike’s Information Criterion
• -2(log likelihood) + 2k
• k = # of fixed and random effects in a particular model
• A model with a lower AIC is better
Akaike, 1974

• AIC: An Information Criterion or Akaike’s Information Criterion
• -2(log likelihood) + 2k
• k = # of fixed and random effects in a particular model
• A model with a lower AIC is better
• Doesn’t assume any of the models is correct
• Appropriate for correlational / non-experimental data
• BIC: Bayesian Information Criterion
• -2(log likelihood) + log(n)k
• k = # of fixed & random effects, n = num. observations
• A model with a lower BIC is better
• Typically prefers simpler models than AIC
• Assumes that there’s a “true” underlying model in the
set of variables being considered
• Appropriate for experimental data Yang, 2005; Oehlert, 2012

• Can also get these from anova(model1, model2)
• Just ignore the chi-square if non-nested models
• AIC and BIC do not have a significance test
associated with them
• The model with the lower AIC/BIC is preferred, but
we don’t know how reliable this preference is

Shrinkage
• The “Madden curse”…
• Each year, a top NFL football player is picked to
appear on the cover of the Madden NFL video
game
• That player often doesn’t
play as well in the following
season
• Is the cover “cursed”?

Shrinkage
• What’s needed to be one of the top NFL players
in a season?
• You have to be a good player
• Genuine predictor (signal)
• And, luck on your side
• Random chance or error
• Top-performing player probably
very good and very lucky
• The next season…
• Your skill may persist
• Random chance probably won’t
• Regression to the mean
• Madden video game cover imperfect predicts next
season’s performance because it was partly based
on random error

Shrinkage
• Our estimates (& any choice of variables
based on them) always partially reflect random
chance in the dataset we used to obtain them
• Won’t fit any later data set quite
as well … shrinkage
• Problem when we’re using the
data to decide the model

Shrinkage
• Our estimates (& any choice of variables
based on them) always partially reflect random
chance in the dataset we used to obtain them
• Won’t fit any later data set quite
as well … shrinkage
• “If you use a sample to construct a model, or to
choose a hypothesis to test, you cannot make a
rigorous scientific test of the model or the hypothesis
using that same sample data.”
(Babyak, 2004, p. 414)

Shrinkage—Examples
• Relations that we observe between a predictor
variable and a dependent variable might simply
be capitalizing on random chance
• U.S. government puts out 45,000 economic
statistics each year (Silver, 2012)
• Can we use these to predict whether US economy
will go into recession?
• With 45,000 predictors, we are very likely to find a
spurious relation by chance
• Especially w/ only 15
recessions since
the end of WW II

• Relations that we observe between a predictor
variable and a dependent variable might simply
be capitalizing on random chance
• U.S. government puts out 45,000 economic
statistics each year (Silver, 2012)
• Can we use these to predict whether US economy
will go into recession?
• With 45,000 predictors, we are very likely to find a
spurious relation by chance
• Significance tests try to address this … but with
45,000 predictors, we are likely to find significant
effects by chance (5% Type I error rate at ɑ=.05)

• Adak Island, Alaska
• Daily temperature here predicts
stock market activity!
• r = -.87 correlation with the price
of a specific group of stocks!
• Completely true—I’m not making this up!
• Problem with this:
• With thousands of weather stations & stocks, easy to find a
strong correlation somewhere, even if it’s just sampling error
• Problem is that this factoid doesn’t reveal all of the other (non-
significant) weather stations & stocks we searched through
• Would only be impressive if this hypothesis continued to be
true on a new set of weather data & stock prices
Vul et al., 2009

• “Puzzlingly high correlations” in some fMRI work
• Correlate each voxel in a brain scan with a behavioral
measure (e.g., personality survey)
• Restrict the analysis to voxels where
the correlation is above some threshold
• Compute final correlation in this region
with behavioral measure—very high!
• Problem: Voxels were already chosen based on
those high correlations
• Includes sampling error favoring the correlation but
excludes error that doesn’t
Vul et al., 2009

Shrinkage—Solutions
• One solution: Select model(s) in advance
(perhaps even pre-registered)
• A theory is valuable for this
• Adak Island example is implausible in part because there’s
no causal reason why an island in Alaska would relate to
stock prices
“Just as you do not need to know exactly how a car engine
works in order to drive safely, you do not need to
understand all the intricacies of the economy to accurately
read those gauges.” – Economic forecasting firm ECRI
(quoted in Silver, 2012)

• Not driven purely by the data or by chance if we have an a
priori reason to favor this variable
“There is really nothing so practical as a good theory.”
-- Social psychologist Kurt Lewin (Lewin’s Maxim)

• Based on some other measure (e.g., another brain
scan)

• Based on some other measure (e.g., another brain
scan)
• Based on research design
• For factorial experiments, typical to include all
experimental variables and interactions
• Research design implies you were interested in all of these

• For more exploratory analyses: Show that the
finding replicates
• On a second dataset
• Test a model obtained from one subset of the data
applies to another subset (cross-validation)
• e.g., training and test sets
• A better version: Do this with
many randomly chosen subsets
• Monte Carlo methods
• Reading on Canvas for some
general ways to do this in R

Mixed Effects Models - Model Comparison

More Related Content

What's hot (20)

Similar to Mixed Effects Models - Model Comparison (20)

More from Scott Fraundorf (7)

Recently uploaded (20)

Mixed Effects Models - Model Comparison