Survival Analysis With Generalized Additive Models

Poisson GAMs and GLMMs for
observational & clinical trial data
CHRISTOS ARGYROPOULOS MD, PHD, FASN
ASSOCIATE PROFESSOR
CHIEF NEPHROLOGY, DEPARTMENT OF INTERNAL MEDICINE
UNIVERSITY OF NEW MEXICO HEALTH SCIENCES CENTER
Data driven extraction of patterns in survival analysis
April 29th 2022, QuantumBlack Research Talk

Disclosures
Consultation fees: Bayer, Otsuka, Bayer, Baxter, Quanta
Writing support: Astra Zeneca
Research Support: Dialysis Clinic, Inc

Outline of the talk
Generalized Additive Models & Pattern Discovery in Time to Event (“Survival”) Analysis
Case Studies
Extensions & Connections for Big Data Analytics and Bayesian Analyses

Pattern Discovery in
Time to Event
(“Survival”) Analysis
THE GENERALIZED ADDITIVE MODEL THAT COULD

Motivation for GAMs
Secondary analysis of a randomized trial regarding dialysis devices in 2006
Intervention protocol modification (known time point) X treatment interaction suggesting
that the intervention effects were not constant over time
Were other “secular trends/drift” ?
Subgroup also postulated by independent commentary when the trial published
Clinical questions could be answered by coming up with methods that could support:
1. More than one time-scales in the dataset
2. Data guided exploration for non-constancy of treatment effect in subgroups

Survival Analysis 101
Analyses the time (t) until an event of interest has occurred
Implies a time scale with a well-defined origin (t=0)
Some things that make survival analysis different from all other data analyses:
1. Outcome (time to event) can vary numerically based on the choice of the time scale
2. Not every observational/experiment unit in the database will experience the event of interest
3. Observational units may enter/leave the dataset at arbitrary time points
4. There can be other events (not of primary interest) that interfere/compete with observing the
event of interest

Survival Analysis as a Mixed
Outcome Data Analysis Problem
Instead of assuming that we only have a collection of continuous “times”, assume we
have a collection of times (t) and event indicators (δ, δ=1, experienced the event of
interest, δ=0 did not experience the event of interest, censored)
Analyst shifts attention to jointly modeling t, δ (mixed discrete/continuous outcome
model)
 A missing data perspective can be used to handle the modeling of t, δ depending on
how individuals come under observation and leave observation
The missing data perspective forces one to consider mechanisms for generation of the
mixed discrete and continuous data in the context of:
1. the primary time scale that is relevant to the analyst
2. ALL other time scales that are relevant to understand the phenomenon under study

Using Lexis Diagrams to Think About
Generation of Time to Event Datasets

Multiple time scales are implicit in
ALL time to event datasets
MANY TIME-SCALES ARE IMPLICIT IN
ANY SURVIVAL DATASET
BUT THE REAL-WORLD FLOWS
ALONG THE ARROW OF TIME
•Long studies
•Changing standards of care during study
•Secular changes that affect the context
(organizations, regulatory framework) of
e.g. health care delivery
•Disruptive events e.g. a pandemic, or a
financial crisis
•The arrow of time & the other time scales
are features that are best not ignored even
if the primary interest is on the study scale
◦ Disease scale (D)
◦ Study scale (F)
◦ Calendar time (c)
◦ Age, ….
“Age-period-cohort” type of structure in
any survival dataset
Relevant to BOTH clinical trials AND
observational datasets
USUALLY IGNORED OR IMPERFECTLY
CAPTURED

Modeling a single time scale
Consider a dataset with N survival
observations and a single time scale (study
time)
Individuals come into observation at times E
(not necessarily equal to zero)
They stay under observation until time F (so
the total observation time is F-E)
Indicators of the occurrence event of
interest are found in the set D
If all individuals were to experience the
event of interest, then optimization of the
density function f(F) with respect to
parameters of interest β would allow
inference and prediction
In the presence of censoring, the times of the
event of interest for some individuals are not
observed
It is only known that this time is larger than the
last recorded time (“censoring time”) that these
individuals had not experienced the event of
interest
If the distribution of survival times does not
provide any information about the censoring
times, then we have non-informative censoring
Informally, individuals drop out of observation
for reasons not related to the study/observation
(physicists may object to the general validity of
this assumption)
Non-informative censoring is an assumption
that must be verified using subject matter
input/dataset design considerations

Logarithmic derivatives to the rescue
Density function: 𝑓 𝐹
Survival function: Integral of the density function
𝑆 𝐹 =
0
𝐹
𝑓 𝑡 𝑑𝑡
Hazard function: the (negative) logarithmic derivative of the
survival function
ℎ 𝐹 = −
𝑑 log 𝑆 𝑡
𝑑𝑡 𝑡=𝐹
Cumulative hazard function: the integral of the hazard
function
𝐻 𝐹 =
0
𝐹
ℎ 𝑡 𝑑𝑡
Target of optimization is the likelihood of the data
𝒊=𝟏
𝑵
𝑓(𝐹𝑖)𝛿𝑖 × 𝑆(𝐹𝑖)1−𝛿𝑖
𝑆(𝐸𝑖)
=
𝑖=1
𝑁
ℎ(𝐹𝑖)𝛿𝑖 × 𝑒𝑥𝑝( −
𝐸𝑖
𝐹𝑖
ℎ(𝑡)𝑑𝑡)
Major innovation in handling such expression was
the introduction of the partial & profile likelihoods and
the proportional hazard model by Sir David Cox

That ‘70s show
APPROACHES
1. Piecewise exponential model (PEM)
2. Connection to Generalized Linear Models (GLMs)
Discrete time logistic regression
Poisson regression (PR)
 Likelihood optimization for PEM/PR  Profile likelihood
for Cox Proportional Hazard (CPH) model
Unified parametric framework for CPH, logistic regression,
PR
Time is a covariate – handled like any other (so can have
more than one?)
SOME KEY PUBLICATIONS
Breslow N (1972) Contribution to the discussion on the paper of D.R. Cox : “Regression Models
and Life-Tables.” J R Stat Soc Ser B Methodol 34: 216–217.
Breslow N (1974) Covariance analysis of censored survival data. Biometrics 30: 89–99.
Holford TR (1976) Life tables with concomitant information. Biometrics 32: 587–597.
Holford TR (1980) The analysis of rates and of survivorship using log-linear models. Biometrics
36: 299–305.
Clayton DG (1983) Fitting a General Family of Failure-Time Distributions using GLIM. J R Stat
Soc Ser C Appl Stat 32: 102–109.
Aitkin M, Clayton D (1980) The Fitting of Exponential, Weibull and Extreme Value Distributions
to Complex Censored Survival Data Using GLIM. J R Stat Soc Ser C Appl Stat 29: 156–163.
Peduzzi P, Holford T, Hardy R (1979) A computer program for life table regression analysis with
time dependent covariates. Comput Programs Biomed 9: 106–114.
Whitehead J (1980) Fitting Cox’s Regression Model to Survival Data using GLIM. J R Stat Soc
Ser C Appl Stat 29: 268–275.
Efron B (1988) Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve. J Am Stat
Assoc 83: 414–425.
Trivia: First application of the CPH in a NIH trial (1981) used a
GLM/software published independently in JASA: Laird N, Olivier
D (1981) Covariance Analysis of Censored Survival Data Using Log-
Linear Analysis Techniques. J Am Stat Assoc 76: 231–240.

Back to the basics to link the past to the future
Survival likelihood for right censored/left truncated data
Numerical quadrature under a monotonic transformation of the hazard 𝑔(ℎ)
Poisson GLM likelihood kernel
𝑖=1
𝑁
𝑆(𝐸𝑖)
=
𝑖=1
𝑁
ℎ(𝐹𝑖)𝛿𝑖 × exp( −
𝐸𝑖
𝐹𝑖
ℎ(𝑡)𝑑𝑡)
𝐸𝑖
𝐹𝑖
ℎ(𝑡)𝑑𝑡 =
𝑗=1
𝑛
𝑤𝑖,𝑗ℎ 𝑡𝑖,𝑗 + 𝑅𝑛 𝐹𝑖, 𝐸𝑖; 𝑔(ℎ)
𝑖=1
𝑁
𝑆(𝐸𝑖)
≈
𝑖=1
𝑁
𝑗=1
𝑛𝑖
ℎ 𝑡𝑖,𝑗
ℎ𝑎𝑧𝑎𝑟𝑑 𝑎𝑡
𝑡ℎ𝑒 𝑛𝑜𝑑𝑒
𝑑𝑖,𝑗
𝑒𝑣𝑒𝑛𝑡 𝑖𝑛𝑑𝑖𝑐𝑎𝑡𝑜𝑟
× 𝑒
−ℎ(𝑡𝑖,𝑗)× 𝑤𝑖,𝑗
𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑡ℎ𝑒
𝑞𝑢𝑑𝑟𝑎𝑡𝑢𝑟𝑒 𝑛𝑜𝑑𝑒

Why didn’t the 70s catch on?
BAD HAIRCUT STYLES? OR SOMETHING ELSE?
Computational bottlenecks:
Trapezoid rule used for quadrature
Time split at each unique time in the dataset
0.5 x (N2 +N) records
 Top of the line memory chip in 1973 was the
Mostek MK4096 4 kbit DRAM
Cox’s model shifted focus to hazard
modeling, while ignoring the integrals
Counting processes/semiparametrics was
the way one got promoted in academia

Back to the
21st century
Computers are faster and have more memory
There are more efficient ways to integrate functions than the
trapezoidal rule
Trials and observational datasets are getting more complex and
larger
Multistakeholder needs of clinical and drug development related
datasets
We want to “do more stuff” with our data:
• Inference (about the patterns)
• Prediction (about the future)
Best not to waste degrees of freedom during
estimation/inference
Best not to leave functional specifications to the analyst if the
subject matter experts will not commit

Embedding numerical integration
into likelihood optimization
Need more efficient integration rules than the
trapezoidal rule even with out computers
Interpolatory, Gauss Lobatto rules, preserve
the end-points (entry and exit time) as nodes
of the integration scheme aiding interpretation
of the resulting approximate likelihoods
Positions of the nodes and the associated
weights can be obtained by solving a triagonal
system in the interval −1,1
Convergence is exponential to the number of
nodes
Memory requirements grow as 𝑂 𝑁 , rather
than 𝑂 𝑁2

Of nuts, bolts and numerical errors
IMPLEMENTATION
Create record of n pseudo-observations for
each observation in the dataset
The first n-1 pseudo-observations for each
record have an event indicator of zero
The last pseudo-observation has an event
indicator equal to that of the corresponding
observation
The “inflated” dataset is fit with software for
Poisson regression
ERROR ANALYSIS
Total error incurred by approximating N numerical
integrals via GL is
𝑖=1
𝑁
𝑒𝑅𝑛(𝐹𝑖,𝐸𝑖;𝑔(ℎ))
≤ 𝑁 × 𝑚𝑎𝑥
𝑖
𝑒𝑅𝑛(𝐹𝑖,𝐸𝑖;𝑔(ℎ))
For the GL rule, the error 𝑅𝑛(𝐹𝑖, 𝐸𝑖; 𝑔(ℎ)) =
−
𝑛 𝑛−1 3 𝑛−2 ! 4
2𝑛−1 2𝑛−2 ! 3
𝐴
𝐹𝑖 − 𝐸𝑖
2𝑛−1
𝐵
𝑔(ℎ) 2𝑛−2
(𝜉), 𝐸𝑖 < 𝜉 < 𝐹𝑖
If the log-hazard behaves locally as a polynomial, the
last term will increase and catch up with the error
reduction of large 𝑛 implied by the first term

Poisson Generalized Additive Models
(PGAMs) for time to event analysis
WHAT HAVE WE ACHIEVED SO FAR
 Provided a rigorous , computationally scalable
transformation of time-to-event analysis to a Poisson
GLM
Analyst given a fixed computational budget has full
control over the numerical error of the approximation
GLMs are extremely well developed
1. Computationally (weighted linear solvers)
2. Theoretically (uncertainty quantification/propagation)
WHAT WE WOULD LIKE TO DO NEXT
Explore formulations that allow the analyst to verify and
explore patterns in the data
Verification: implies a given parametric form for the
relationships between features (covariates) and
outcomes (outputs)
Exploration/discovery: let the data define the patterns
Generalized Additive Models (GAM) satisfy both
goals
For survival analysis, we specify a GAM regression
structure on the log-hazard function & use the Poisson
GLM to fit (PGAM)

Flexible log-hazard modeling framework
supports diverse analytic goals
log( ℎ(𝑡𝑖,𝑗)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃
log( ℎ(𝑡𝑖,𝑗)) =
𝑘=1
𝑆
𝜆𝑘(𝑡𝑖,𝑗) + 𝐱𝑖𝛃
log( ℎ(𝑡𝑖,𝑗)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝛽𝑇(𝑡)
log( ℎ(𝑡𝑘,𝑖,𝑗)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝐳𝑘𝐛, 𝐛~𝑁(𝟎, 𝐆)
log( ℎ(𝑡𝑖,𝑗, 𝑐)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝑓(𝑐)
Relative Risk Regression (Cox like)
Stratified Relative Risk model
(baseline effect of the primary time
scale differs by group)
Time varying effects
Spatiotemporal correlations
Multiple Time Scales (e.g., calendar
time)
Flexible subgroup analyses log( ℎ(𝑡𝑖,𝑗, 𝑦)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝑓(𝑦) × 𝛽𝑘

A high-level summary of GAMs
Procedures that approximate functions e.g. 𝜆0(𝑡𝑖,𝑗), 𝑓(𝑐) given as linear combination of
known basis functions 𝑏𝑗 𝑥 and unknown coefficients 𝜃𝑗
𝑓 𝑐 = 𝑗=1
𝑘
𝑏𝑗 𝑐 𝜃𝑗 = 𝑪𝜽
Examples of basis functions: polynomials, splines (cubic, B, P), tensor product
smooths, random effects, Gaussian random fields
Representation comes with an associated measure of the function smoothness 𝐽 𝑓 =
𝜽𝑻𝐒𝜽, where 𝐒 is a dataset and basis-specific positive semi-definite matrix
The penalty 𝐽 𝑓 has a null space of functions that are unpenalized
Learning in GAMs ⟺ estimate 𝜃𝑗 from data AND the optimal degree of smoothing for
the representation of each function from the data using an appropriate GLM family (for
survival analysis this is either a Poisson or a logistic family)

GAMs as Generalized Linear Mixed Models I
A standard GAM for a single observation 𝑦𝑖 from an exponential family with mean 𝜇𝑖
and scale 𝜙, 𝐸𝐹 𝜇𝑖, 𝜙 is a model for 𝜇𝑖 given a) parametric features (“covariates”) 𝐱𝑖 b)
smooth functions of flexible (possibly vector valued) parametric features 𝒄𝑖 and a c) link
function 𝑔 ∙ :
𝑔 𝜇𝑖 = 𝐱𝑖𝛃+
𝑗
𝑓𝑗 𝑐𝑖,𝑗 , 𝑦𝑖~𝐸𝐹 𝜇𝑖, 𝜙
Learning/estimation is via maximization of the penalized log-likelihood:
𝑖
log 𝐸𝐹 𝜇𝑖, 𝜙 −
1
2 𝑗
𝜆𝑗𝜽𝑗
𝑻
𝐒𝑗𝜽𝑗

GAMs as Generalized Linear Mixed Models II
Perform an (implicit) eigen-decomposition
𝐒𝑗 = 𝐔𝑗𝐃𝑗𝐔𝑗
𝑇
and after carrying out steps 1-5
Rewrite the log-likelihood to be optimized
as
𝑖
log 𝐸𝐹 𝜇𝑖, 𝜙 −
1
2 𝑗
𝜆𝑗𝜽𝑗,𝑅
𝑇
𝐃𝑗
+
𝜽𝑗,𝑅
log 𝑀𝑉𝑁 𝟎, 𝐃𝑗
+
−1
𝜆𝑗
1. Collect the positive eigenvalues into the
diagonal matrix 𝐃𝑗
+
2. Partition the eigenvector matrix 𝐔𝑗 =
𝐔𝑗
+
: 𝐔𝑗
0
3. Reparameterize via the linear system
𝐔𝑗
𝑇
𝜽𝑗 = 𝜽𝑗,𝑅
𝑇
, 𝜽𝑗,𝐹
𝑇 𝑇
4. Rewrite the smooth 𝑓𝑗 𝑐 =
𝑙=1
𝑘
𝑏𝑗,𝑙 𝑐 𝜃𝑗,𝑙 = 𝑪𝑗𝜽𝑗 = 𝑪𝑗𝑼𝑗
0
𝜽𝑗,𝐹
𝑇
+
𝑪𝑗𝐔𝑗
+
𝜽𝑗,𝑅
𝑇
5. Rewrite the penalty as 𝜽𝑗
𝑻
𝐒𝑗𝜽𝑗= 𝜽𝑗,𝑅
𝑇
𝐃𝑗
+
𝜽𝑗,𝑅
Generalized Linear (Gaussian) Mixed Model

Numerical methods for fitting PGAMs
1. If degree of smoothing 𝜆𝑗 is known,
Penalized Iteratively Linear Least
Squares
2. If degree of smoothing is not known:
 Leave one out cross validation
 Generalized cross validation
 Marginal likelihood and Restricted
Extended Maximum Likelihood
 Optimizations may be applied to
linearizations or the non-linearized
versions of the GLMM (“direct nested
iteration”)
 Extremely rich (and esoteric) literature
about the linear algebra form of the
updating equations

Log-linear (at the point of analysis) framework can
seamlessly support prediction about non-linear
functionals using simple Monte Carlo
𝑅𝑅(𝑡) =
1 − 𝑆𝐴(𝑡)
1 − 𝑆𝐵(𝑡)
𝑅(𝑡) =
𝑆𝐴(𝑡)
𝑆𝐵(𝑡)
𝐴𝑅𝐷(𝑡) = 𝑆𝐵(𝑡) − 𝑆𝐴(𝑡)
𝑅𝑀𝑆𝑇(𝑡) =
0
𝑡
𝑆𝐴(𝑡)𝑑𝑡 −
0
𝑡
𝑆𝐵(𝑡)𝑑𝑡
 Relative Risk
 Relative Survival
 Absolute Risk Difference
(actuarial/pharmacoeconomic
applications)
 (Restricted) Mean Survival Time
(demography/reliability analysis)

Numerical Error
of Gauss
Lobatto for
integrating a
single life-time

Survival Point
Estimates :
PGAM v.s.
Kaplan Meier

Hazard
Ratio
Estimates:
PGAM v.s.
Cox model

PGAM accurately
estimates the RMST
under different
functional
deviations from the
baseline hazard

Cox v.s. PGAM
in a cancer trial
HR LCI UCI
Cox 0.73 0.64 0.83
GL3 0.68 0.59 0.77
GL4 0.77 0.67 0.88
GL5 0.72 0.64 0.82
GL6 0.72 0.63 0.82
GL7 0.73 0.64 0.83
GL8 0.73 0.64 0.83
GL9 0.73 0.64 0.83
GL10 0.73 0.64 0.83
GL11 0.73 0.64 0.83
GL12 0.73 0.64 0.83
GL13 0.73 0.64 0.83
GL14 0.73 0.64 0.83
GL15 0.73 0.64 0.83
GL16 0.73 0.64 0.83
GL17 0.73 0.64 0.83
GL18 0.73 0.64 0.83
GL19 0.73 0.64 0.83
GL20 0.73 0.64 0.83
Hazard Ratios

Cox v.s. PGAM in a nephrology trial
Cox GL7 GL10 GL20
Variable HR 95% CI HR 95% CI HR 95%CI HR 95%CI
High Kt/V 0.96 0.84 - 1.10 0.96 0.84 - 1.09 0.96 0.84 - 1.09 0.96 0.84 - 1.09
High Flux 0.92 0.81 - 1.05 0.92 0.81 - 1.06 0.92 0.81 - 1.06 0.92 0.81 - 1.06
Age (per 10) 1.41 1.33 - 1.50 1.42 1.33 - 1.51 1.42 1.33 - 1.51 1.42 1.33 - 1.51
Female 0.85 0.73 - 0.98 0.85 0.73 - 0.98 0.85 0.73 - 0.98 0.85 0.73 - 0.98
Black 0.77 0.66 - 0.91 0.77 0.66 - 0.91 0.77 0.66 - 0.91 0.77 0.66 - 0.91
Diabetic 1.29 1.11 - 1.50 1.29 1.11 - 1.5 1.29 1.11 - 1.5 1.29 1.11 - 1.5
Duration 1.04 1.02 - 1.06 1.04 1.02 - 1.05 1.04 1.02 - 1.05 1.04 1.02 - 1.05
ICED 1.37 1.25 - 1.50 1.38 1.27 - 1.51 1.38 1.27 - 1.51 1.38 1.27 - 1.51
Alb (per 0.5
g/dl)
0.51 0.43 - 0.62 0.53 0.45 - 0.64 0.53 0.45 - 0.64 0.53 0.45 - 0.64
Alb X Time
interaction
1.11 1.04 - 1.19 1.09 1.02 - 1.16 1.09 1.02 - 1.16 1.09 1.02 - 1.16

Modeling of complex datasets to account for
secular trends & conduct subgroup analyses
TIME SCALE INTERACTIONS COVARIATE BY TREATMENT BY
DISEASE INTERACTIONS

PGAMs/GLMMs as approximate Bayesian
computations in survival analyses

PGAMs/GLMMs for scalable
inference in large datasets

Extensions & Connections for
Big Data Analytics and
Bayesian Analyses

Software libraries available for
PGAM/GLMM analyses
Mgcv in R (by Simon Wood, extremely fast, requires manual generation of the zero inflated
dataset, has one of the most versatile collection of smoothers)
Pammtools in R (by Andreas Bender, fast, automatic generation of the zero inflated dataset)
hglm in R (by Alam, Ronegard, Shen, requires manual generation of the zero inflated
dataset and the random effects matrix used to penalized fittings)
gamlss in R (originally by Rigby and Stasinopoulos ,multinational team of authors now,
versatile tool allows for additive composition of non-linear function approximations e.g.
Neural Networks and offers as many choices as TMB)
glmmTMB in R & TMB in R (extreme flexibility in the selection of optimizers, automatic
differentiation (AD) for sparse Hessians to calculate standard errors, no coding of matrix
algebra calculations, user only specifies the likelihood in C++ and compiles a DLL callable
from R)
In our hands, TMB based solutions have scaled much better with size than all other options

The Bayesian hiding in the PGAM/GLMM
Closet : a 30,000ft view
GLMMs can be given a Bayesian interpretation so that
1. Numerical results obtained by non-Bayesian PGAM
software can be treated as deterministic approximations
to a full and very expensive Bayesian computation
2. If one is fitting PGAMs/GLMMs from a non-Bayesian
perspective and the fitting is numerically unstable, then
one may fit the model using Bayesian methods e.g.
MCMC and report the latter from a non-Bayesian
perspective
3. New implementations of PGAMs can be devised that
take advantage of high-performance optimization
libraries and state-of-the art algorithmic differentiation
approaches

The Bayesian Connection Explained Through the
Hierarchical Likelihood Approach of Lee and Nelder
Likelihood : 𝐿 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙; 𝒀, 𝜽𝑗,𝑅
𝑇
=
Assume non-informative, uniform, improper priors for 𝛃, 𝜆𝑗, and 𝜽𝑗,𝑅
𝑇
Posterior for parameters: 𝑝 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙 𝒀, 𝐗, 𝐂 =
𝐿 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
𝑇
𝑇
, 𝜽𝑗,𝑅
𝑇
𝑇
𝑑𝜷𝑑 𝜽𝑗,𝑅
𝑇
𝑑 𝜆𝑗 𝑑𝜙
=
𝑝 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 =
𝑇
𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜽𝑗,𝑅
𝑇
𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 ≈
𝑇
𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜽𝑗,𝑅
𝑇
𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂
𝑖=1
𝑁
𝑗
𝐸𝐹 𝑦𝑖 𝜇𝑖 𝛃, 𝜽𝑗,𝑅
𝑇
, 𝜽𝑗,𝐹
𝑇
, 𝐗, 𝐂 , 𝜙 𝑀𝑉𝑁 𝜽𝑗,𝑅
𝑇
𝟎, 𝜆𝑗𝜽𝑗,𝑅
𝑇
𝐃𝑗
+
𝜽𝑗,𝑅
Nested optimizations using the Laplace Approximation to handle the integrals of the two marginal likelihoods
𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 =
𝑇
, 𝜽𝑗,𝑅
𝑇
𝑇
𝑑𝜷𝑑 𝜽𝑗,𝑅
𝑇

Do we have to stick with non-linear
functionals in GAMs?
Nothing in theory restricts in 𝑓𝑗 𝑐𝑖,𝑗 in 𝑔 𝜇𝑖 = 𝐱𝑖𝛃+ 𝑗 𝑓𝑗 𝑐𝑖,𝑗 , 𝑦𝑖~𝐸𝐹 𝜇𝑖, 𝜙 to be of the form
𝑗=1
𝑘
𝑏𝑗 𝑐𝑖,𝑗 𝜃𝑗
Non-linear functionals (e.g. Deep Neural Network structures) could be envisioned
These models operate outside the well-behaved likelihood spaces conventional GAMs live in
Selection of optimizers must be optimized:
gradient descent may not deliver always.
Think of Newton with AD computed Hessians
Connections to Bayesian Neural Networks are immediate:
This is your prior
exp −
1
2 𝑗
𝜆𝑗𝜽𝑗,𝑅
𝑇
𝐃𝑗
+
𝜽𝑗,𝑅
log 𝑀𝑉𝑁 𝟎, 𝐃𝑗
+
−1
𝜆𝑗

Summary and some final thoughts
THEORETICAL
GAMs combine the “universal” approximation
properties of polynomials/splines with GLMs
While GAMs are additive, linear models, they can
be extended to additive compositions of non-linear
models (e.g. neural networks).
This will require ditching most existing
implementations in favor of galmss and/or custom
developed software code
Poisson GLM appears to be a “universal
approximator” for other lifetime distributions
GAMS and PGAMs offer a cheap and dirty way to
conduct Bayesian inference without Monte Carlo
TECHNICAL
Robust implementations of GAMs are
predominantly found in R
Mgcv is currently the best option to balance
coding effort against execution time
House made solutions using TMB are the only
realistic option for massive datasets
While GAMLSS allows for Machine Learning
function approximation tools (e.g. neural nets),
TMB is the one most likely to deliver scalable
results in this area due to the ability to include
black box C++ code in the model specification
Numerical libraries in Python (tensorflow) that
implement AD should be explored for GAM
implementations

Survival Analysis With Generalized Additive Models

More Related Content

What's hot (18)

Similar to Survival Analysis With Generalized Additive Models (20)

More from Christos Argyropoulos (20)

Recently uploaded (20)

Survival Analysis With Generalized Additive Models

Editor's Notes