SlideShare a Scribd company logo
Poisson GAMs and GLMMs for
observational & clinical trial data
CHRISTOS ARGYROPOULOS MD, PHD, FASN
ASSOCIATE PROFESSOR
CHIEF NEPHROLOGY, DEPARTMENT OF INTERNAL MEDICINE
UNIVERSITY OF NEW MEXICO HEALTH SCIENCES CENTER
Data driven extraction of patterns in survival analysis
April 29th 2022, QuantumBlack Research Talk
Disclosures
Consultation fees: Bayer, Otsuka, Bayer, Baxter, Quanta
Writing support: Astra Zeneca
Research Support: Dialysis Clinic, Inc
Outline of the talk
Generalized Additive Models & Pattern Discovery in Time to Event (“Survival”) Analysis
Case Studies
Extensions & Connections for Big Data Analytics and Bayesian Analyses
Pattern Discovery in
Time to Event
(“Survival”) Analysis
THE GENERALIZED ADDITIVE MODEL THAT COULD
Motivation for GAMs
Secondary analysis of a randomized trial regarding dialysis devices in 2006
Intervention protocol modification (known time point) X treatment interaction suggesting
that the intervention effects were not constant over time
Were other “secular trends/drift” ?
Subgroup also postulated by independent commentary when the trial published
Clinical questions could be answered by coming up with methods that could support:
1. More than one time-scales in the dataset
2. Data guided exploration for non-constancy of treatment effect in subgroups
Survival Analysis 101
Analyses the time (t) until an event of interest has occurred
Implies a time scale with a well-defined origin (t=0)
Some things that make survival analysis different from all other data analyses:
1. Outcome (time to event) can vary numerically based on the choice of the time scale
2. Not every observational/experiment unit in the database will experience the event of interest
3. Observational units may enter/leave the dataset at arbitrary time points
4. There can be other events (not of primary interest) that interfere/compete with observing the
event of interest
Survival Analysis as a Mixed
Outcome Data Analysis Problem
Instead of assuming that we only have a collection of continuous “times”, assume we
have a collection of times (t) and event indicators (δ, δ=1, experienced the event of
interest, δ=0 did not experience the event of interest, censored)
Analyst shifts attention to jointly modeling t, δ (mixed discrete/continuous outcome
model)
 A missing data perspective can be used to handle the modeling of t, δ depending on
how individuals come under observation and leave observation
The missing data perspective forces one to consider mechanisms for generation of the
mixed discrete and continuous data in the context of:
1. the primary time scale that is relevant to the analyst
2. ALL other time scales that are relevant to understand the phenomenon under study
Using Lexis Diagrams to Think About
Generation of Time to Event Datasets
Multiple time scales are implicit in
ALL time to event datasets
MANY TIME-SCALES ARE IMPLICIT IN
ANY SURVIVAL DATASET
BUT THE REAL-WORLD FLOWS
ALONG THE ARROW OF TIME
•Long studies
•Changing standards of care during study
•Secular changes that affect the context
(organizations, regulatory framework) of
e.g. health care delivery
•Disruptive events e.g. a pandemic, or a
financial crisis
•The arrow of time & the other time scales
are features that are best not ignored even
if the primary interest is on the study scale
◦ Disease scale (D)
◦ Study scale (F)
◦ Calendar time (c)
◦ Age, ….
“Age-period-cohort” type of structure in
any survival dataset
Relevant to BOTH clinical trials AND
observational datasets
USUALLY IGNORED OR IMPERFECTLY
CAPTURED
Modeling a single time scale
Consider a dataset with N survival
observations and a single time scale (study
time)
Individuals come into observation at times E
(not necessarily equal to zero)
They stay under observation until time F (so
the total observation time is F-E)
Indicators of the occurrence event of
interest are found in the set D
If all individuals were to experience the
event of interest, then optimization of the
density function f(F) with respect to
parameters of interest β would allow
inference and prediction
In the presence of censoring, the times of the
event of interest for some individuals are not
observed
It is only known that this time is larger than the
last recorded time (“censoring time”) that these
individuals had not experienced the event of
interest
If the distribution of survival times does not
provide any information about the censoring
times, then we have non-informative censoring
Informally, individuals drop out of observation
for reasons not related to the study/observation
(physicists may object to the general validity of
this assumption)
Non-informative censoring is an assumption
that must be verified using subject matter
input/dataset design considerations
Logarithmic derivatives to the rescue
Density function: 𝑓 𝐹
Survival function: Integral of the density function
𝑆 𝐹 =
0
𝐹
𝑓 𝑡 𝑑𝑡
Hazard function: the (negative) logarithmic derivative of the
survival function
ℎ 𝐹 = −
𝑑 log 𝑆 𝑡
𝑑𝑡 𝑡=𝐹
Cumulative hazard function: the integral of the hazard
function
𝐻 𝐹 =
0
𝐹
ℎ 𝑡 𝑑𝑡
Target of optimization is the likelihood of the data
𝒊=𝟏
𝑵
𝑓(𝐹𝑖)𝛿𝑖 × 𝑆(𝐹𝑖)1−𝛿𝑖
𝑆(𝐸𝑖)
=
𝑖=1
𝑁
ℎ(𝐹𝑖)𝛿𝑖 × 𝑒𝑥𝑝( −
𝐸𝑖
𝐹𝑖
ℎ(𝑡)𝑑𝑡)
Major innovation in handling such expression was
the introduction of the partial & profile likelihoods and
the proportional hazard model by Sir David Cox
That ‘70s show
APPROACHES
1. Piecewise exponential model (PEM)
2. Connection to Generalized Linear Models (GLMs)
Discrete time logistic regression
Poisson regression (PR)
 Likelihood optimization for PEM/PR  Profile likelihood
for Cox Proportional Hazard (CPH) model
Unified parametric framework for CPH, logistic regression,
PR
Time is a covariate – handled like any other (so can have
more than one?)
SOME KEY PUBLICATIONS
Breslow N (1972) Contribution to the discussion on the paper of D.R. Cox : “Regression Models
and Life-Tables.” J R Stat Soc Ser B Methodol 34: 216–217.
Breslow N (1974) Covariance analysis of censored survival data. Biometrics 30: 89–99.
Holford TR (1976) Life tables with concomitant information. Biometrics 32: 587–597.
Holford TR (1980) The analysis of rates and of survivorship using log-linear models. Biometrics
36: 299–305.
Clayton DG (1983) Fitting a General Family of Failure-Time Distributions using GLIM. J R Stat
Soc Ser C Appl Stat 32: 102–109.
Aitkin M, Clayton D (1980) The Fitting of Exponential, Weibull and Extreme Value Distributions
to Complex Censored Survival Data Using GLIM. J R Stat Soc Ser C Appl Stat 29: 156–163.
Peduzzi P, Holford T, Hardy R (1979) A computer program for life table regression analysis with
time dependent covariates. Comput Programs Biomed 9: 106–114.
Whitehead J (1980) Fitting Cox’s Regression Model to Survival Data using GLIM. J R Stat Soc
Ser C Appl Stat 29: 268–275.
Efron B (1988) Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve. J Am Stat
Assoc 83: 414–425.
Trivia: First application of the CPH in a NIH trial (1981) used a
GLM/software published independently in JASA: Laird N, Olivier
D (1981) Covariance Analysis of Censored Survival Data Using Log-
Linear Analysis Techniques. J Am Stat Assoc 76: 231–240.
Back to the basics to link the past to the future
Survival likelihood for right censored/left truncated data
Numerical quadrature under a monotonic transformation of the hazard 𝑔(ℎ)
Poisson GLM likelihood kernel
𝑖=1
𝑁
𝑓(𝐹𝑖)𝛿𝑖 × 𝑆(𝐹𝑖)1−𝛿𝑖
𝑆(𝐸𝑖)
=
𝑖=1
𝑁
ℎ(𝐹𝑖)𝛿𝑖 × exp( −
𝐸𝑖
𝐹𝑖
ℎ(𝑡)𝑑𝑡)
𝐸𝑖
𝐹𝑖
ℎ(𝑡)𝑑𝑡 =
𝑗=1
𝑛
𝑤𝑖,𝑗ℎ 𝑡𝑖,𝑗 + 𝑅𝑛 𝐹𝑖, 𝐸𝑖; 𝑔(ℎ)
𝑖=1
𝑁
𝑓(𝐹𝑖)𝛿𝑖 × 𝑆(𝐹𝑖)1−𝛿𝑖
𝑆(𝐸𝑖)
≈
𝑖=1
𝑁
𝑗=1
𝑛𝑖
ℎ 𝑡𝑖,𝑗
ℎ𝑎𝑧𝑎𝑟𝑑 𝑎𝑡
𝑡ℎ𝑒 𝑛𝑜𝑑𝑒
𝑑𝑖,𝑗
𝑒𝑣𝑒𝑛𝑡 𝑖𝑛𝑑𝑖𝑐𝑎𝑡𝑜𝑟
× 𝑒
−ℎ(𝑡𝑖,𝑗)× 𝑤𝑖,𝑗
𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑡ℎ𝑒
𝑞𝑢𝑑𝑟𝑎𝑡𝑢𝑟𝑒 𝑛𝑜𝑑𝑒
Why didn’t the 70s catch on?
BAD HAIRCUT STYLES? OR SOMETHING ELSE?
Computational bottlenecks:
Trapezoid rule used for quadrature
Time split at each unique time in the dataset
0.5 x (N2 +N) records
 Top of the line memory chip in 1973 was the
Mostek MK4096 4 kbit DRAM
Cox’s model shifted focus to hazard
modeling, while ignoring the integrals
Counting processes/semiparametrics was
the way one got promoted in academia
Back to the
21st century
Computers are faster and have more memory
There are more efficient ways to integrate functions than the
trapezoidal rule
Trials and observational datasets are getting more complex and
larger
Multistakeholder needs of clinical and drug development related
datasets
We want to “do more stuff” with our data:
• Inference (about the patterns)
• Prediction (about the future)
Best not to waste degrees of freedom during
estimation/inference
Best not to leave functional specifications to the analyst if the
subject matter experts will not commit
Embedding numerical integration
into likelihood optimization
Need more efficient integration rules than the
trapezoidal rule even with out computers
Interpolatory, Gauss Lobatto rules, preserve
the end-points (entry and exit time) as nodes
of the integration scheme aiding interpretation
of the resulting approximate likelihoods
Positions of the nodes and the associated
weights can be obtained by solving a triagonal
system in the interval −1,1
Convergence is exponential to the number of
nodes
Memory requirements grow as 𝑂 𝑁 , rather
than 𝑂 𝑁2
Of nuts, bolts and numerical errors
IMPLEMENTATION
Create record of n pseudo-observations for
each observation in the dataset
The first n-1 pseudo-observations for each
record have an event indicator of zero
The last pseudo-observation has an event
indicator equal to that of the corresponding
observation
The “inflated” dataset is fit with software for
Poisson regression
ERROR ANALYSIS
Total error incurred by approximating N numerical
integrals via GL is
𝑖=1
𝑁
𝑒𝑅𝑛(𝐹𝑖,𝐸𝑖;𝑔(ℎ))
≤ 𝑁 × 𝑚𝑎𝑥
𝑖
𝑒𝑅𝑛(𝐹𝑖,𝐸𝑖;𝑔(ℎ))
For the GL rule, the error 𝑅𝑛(𝐹𝑖, 𝐸𝑖; 𝑔(ℎ)) =
−
𝑛 𝑛−1 3 𝑛−2 ! 4
2𝑛−1 2𝑛−2 ! 3
𝐴
𝐹𝑖 − 𝐸𝑖
2𝑛−1
𝐵
𝑔(ℎ) 2𝑛−2
(𝜉), 𝐸𝑖 < 𝜉 < 𝐹𝑖
If the log-hazard behaves locally as a polynomial, the
last term will increase and catch up with the error
reduction of large 𝑛 implied by the first term
Poisson Generalized Additive Models
(PGAMs) for time to event analysis
WHAT HAVE WE ACHIEVED SO FAR
 Provided a rigorous , computationally scalable
transformation of time-to-event analysis to a Poisson
GLM
Analyst given a fixed computational budget has full
control over the numerical error of the approximation
GLMs are extremely well developed
1. Computationally (weighted linear solvers)
2. Theoretically (uncertainty quantification/propagation)
WHAT WE WOULD LIKE TO DO NEXT
Explore formulations that allow the analyst to verify and
explore patterns in the data
Verification: implies a given parametric form for the
relationships between features (covariates) and
outcomes (outputs)
Exploration/discovery: let the data define the patterns
Generalized Additive Models (GAM) satisfy both
goals
For survival analysis, we specify a GAM regression
structure on the log-hazard function & use the Poisson
GLM to fit (PGAM)
Flexible log-hazard modeling framework
supports diverse analytic goals
log( ℎ(𝑡𝑖,𝑗)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃
log( ℎ(𝑡𝑖,𝑗)) =
𝑘=1
𝑆
𝜆𝑘(𝑡𝑖,𝑗) + 𝐱𝑖𝛃
log( ℎ(𝑡𝑖,𝑗)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝛽𝑇(𝑡)
log( ℎ(𝑡𝑘,𝑖,𝑗)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝐳𝑘𝐛, 𝐛~𝑁(𝟎, 𝐆)
log( ℎ(𝑡𝑖,𝑗, 𝑐)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝑓(𝑐)
Relative Risk Regression (Cox like)
Stratified Relative Risk model
(baseline effect of the primary time
scale differs by group)
Time varying effects
Spatiotemporal correlations
Multiple Time Scales (e.g., calendar
time)
Flexible subgroup analyses log( ℎ(𝑡𝑖,𝑗, 𝑦)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝑓(𝑦) × 𝛽𝑘
A high-level summary of GAMs
Procedures that approximate functions e.g. 𝜆0(𝑡𝑖,𝑗), 𝑓(𝑐) given as linear combination of
known basis functions 𝑏𝑗 𝑥 and unknown coefficients 𝜃𝑗
𝑓 𝑐 = 𝑗=1
𝑘
𝑏𝑗 𝑐 𝜃𝑗 = 𝑪𝜽
Examples of basis functions: polynomials, splines (cubic, B, P), tensor product
smooths, random effects, Gaussian random fields
Representation comes with an associated measure of the function smoothness 𝐽 𝑓 =
𝜽𝑻𝐒𝜽, where 𝐒 is a dataset and basis-specific positive semi-definite matrix
The penalty 𝐽 𝑓 has a null space of functions that are unpenalized
Learning in GAMs ⟺ estimate 𝜃𝑗 from data AND the optimal degree of smoothing for
the representation of each function from the data using an appropriate GLM family (for
survival analysis this is either a Poisson or a logistic family)
GAMs as Generalized Linear Mixed Models I
A standard GAM for a single observation 𝑦𝑖 from an exponential family with mean 𝜇𝑖
and scale 𝜙, 𝐸𝐹 𝜇𝑖, 𝜙 is a model for 𝜇𝑖 given a) parametric features (“covariates”) 𝐱𝑖 b)
smooth functions of flexible (possibly vector valued) parametric features 𝒄𝑖 and a c) link
function 𝑔 ∙ :
𝑔 𝜇𝑖 = 𝐱𝑖𝛃+
𝑗
𝑓𝑗 𝑐𝑖,𝑗 , 𝑦𝑖~𝐸𝐹 𝜇𝑖, 𝜙
Learning/estimation is via maximization of the penalized log-likelihood:
𝑖
log 𝐸𝐹 𝜇𝑖, 𝜙 −
1
2 𝑗
𝜆𝑗𝜽𝑗
𝑻
𝐒𝑗𝜽𝑗
GAMs as Generalized Linear Mixed Models II
Perform an (implicit) eigen-decomposition
𝐒𝑗 = 𝐔𝑗𝐃𝑗𝐔𝑗
𝑇
and after carrying out steps 1-5
Rewrite the log-likelihood to be optimized
as
𝑖
log 𝐸𝐹 𝜇𝑖, 𝜙 −
1
2 𝑗
𝜆𝑗𝜽𝑗,𝑅
𝑇
𝐃𝑗
+
𝜽𝑗,𝑅
log 𝑀𝑉𝑁 𝟎, 𝐃𝑗
+
−1
𝜆𝑗
1. Collect the positive eigenvalues into the
diagonal matrix 𝐃𝑗
+
2. Partition the eigenvector matrix 𝐔𝑗 =
𝐔𝑗
+
: 𝐔𝑗
0
3. Reparameterize via the linear system
𝐔𝑗
𝑇
𝜽𝑗 = 𝜽𝑗,𝑅
𝑇
, 𝜽𝑗,𝐹
𝑇 𝑇
4. Rewrite the smooth 𝑓𝑗 𝑐 =
𝑙=1
𝑘
𝑏𝑗,𝑙 𝑐 𝜃𝑗,𝑙 = 𝑪𝑗𝜽𝑗 = 𝑪𝑗𝑼𝑗
0
𝜽𝑗,𝐹
𝑇
+
𝑪𝑗𝐔𝑗
+
𝜽𝑗,𝑅
𝑇
5. Rewrite the penalty as 𝜽𝑗
𝑻
𝐒𝑗𝜽𝑗= 𝜽𝑗,𝑅
𝑇
𝐃𝑗
+
𝜽𝑗,𝑅
Generalized Linear (Gaussian) Mixed Model
Numerical methods for fitting PGAMs
1. If degree of smoothing 𝜆𝑗 is known,
Penalized Iteratively Linear Least
Squares
2. If degree of smoothing is not known:
 Leave one out cross validation
 Generalized cross validation
 Marginal likelihood and Restricted
Extended Maximum Likelihood
 Optimizations may be applied to
linearizations or the non-linearized
versions of the GLMM (“direct nested
iteration”)
 Extremely rich (and esoteric) literature
about the linear algebra form of the
updating equations
Log-linear (at the point of analysis) framework can
seamlessly support prediction about non-linear
functionals using simple Monte Carlo
𝑅𝑅(𝑡) =
1 − 𝑆𝐴(𝑡)
1 − 𝑆𝐵(𝑡)
𝑅(𝑡) =
𝑆𝐴(𝑡)
𝑆𝐵(𝑡)
𝐴𝑅𝐷(𝑡) = 𝑆𝐵(𝑡) − 𝑆𝐴(𝑡)
𝑅𝑀𝑆𝑇(𝑡) =
0
𝑡
𝑆𝐴(𝑡)𝑑𝑡 −
0
𝑡
𝑆𝐵(𝑡)𝑑𝑡
 Relative Risk
 Relative Survival
 Absolute Risk Difference
(actuarial/pharmacoeconomic
applications)
 (Restricted) Mean Survival Time
(demography/reliability analysis)
Case Studies
Numerical Error
of Gauss
Lobatto for
integrating a
single life-time
Survival Point
Estimates :
PGAM v.s.
Kaplan Meier
Hazard
Ratio
Estimates:
PGAM v.s.
Cox model
PGAM accurately
estimates the RMST
under different
functional
deviations from the
baseline hazard
Cox v.s. PGAM
in a cancer trial
HR LCI UCI
Cox 0.73 0.64 0.83
GL3 0.68 0.59 0.77
GL4 0.77 0.67 0.88
GL5 0.72 0.64 0.82
GL6 0.72 0.63 0.82
GL7 0.73 0.64 0.83
GL8 0.73 0.64 0.83
GL9 0.73 0.64 0.83
GL10 0.73 0.64 0.83
GL11 0.73 0.64 0.83
GL12 0.73 0.64 0.83
GL13 0.73 0.64 0.83
GL14 0.73 0.64 0.83
GL15 0.73 0.64 0.83
GL16 0.73 0.64 0.83
GL17 0.73 0.64 0.83
GL18 0.73 0.64 0.83
GL19 0.73 0.64 0.83
GL20 0.73 0.64 0.83
Hazard Ratios
Cox v.s. PGAM in a nephrology trial
Cox GL7 GL10 GL20
Variable HR 95% CI HR 95% CI HR 95%CI HR 95%CI
High Kt/V 0.96 0.84 - 1.10 0.96 0.84 - 1.09 0.96 0.84 - 1.09 0.96 0.84 - 1.09
High Flux 0.92 0.81 - 1.05 0.92 0.81 - 1.06 0.92 0.81 - 1.06 0.92 0.81 - 1.06
Age (per 10) 1.41 1.33 - 1.50 1.42 1.33 - 1.51 1.42 1.33 - 1.51 1.42 1.33 - 1.51
Female 0.85 0.73 - 0.98 0.85 0.73 - 0.98 0.85 0.73 - 0.98 0.85 0.73 - 0.98
Black 0.77 0.66 - 0.91 0.77 0.66 - 0.91 0.77 0.66 - 0.91 0.77 0.66 - 0.91
Diabetic 1.29 1.11 - 1.50 1.29 1.11 - 1.5 1.29 1.11 - 1.5 1.29 1.11 - 1.5
Duration 1.04 1.02 - 1.06 1.04 1.02 - 1.05 1.04 1.02 - 1.05 1.04 1.02 - 1.05
ICED 1.37 1.25 - 1.50 1.38 1.27 - 1.51 1.38 1.27 - 1.51 1.38 1.27 - 1.51
Alb (per 0.5
g/dl)
0.51 0.43 - 0.62 0.53 0.45 - 0.64 0.53 0.45 - 0.64 0.53 0.45 - 0.64
Alb X Time
interaction
1.11 1.04 - 1.19 1.09 1.02 - 1.16 1.09 1.02 - 1.16 1.09 1.02 - 1.16
Modeling of complex datasets to account for
secular trends & conduct subgroup analyses
TIME SCALE INTERACTIONS COVARIATE BY TREATMENT BY
DISEASE INTERACTIONS
PGAMs/GLMMs as approximate Bayesian
computations in survival analyses
PGAMs/GLMMs for scalable
inference in large datasets
Extensions & Connections for
Big Data Analytics and
Bayesian Analyses
Software libraries available for
PGAM/GLMM analyses
Mgcv in R (by Simon Wood, extremely fast, requires manual generation of the zero inflated
dataset, has one of the most versatile collection of smoothers)
Pammtools in R (by Andreas Bender, fast, automatic generation of the zero inflated dataset)
hglm in R (by Alam, Ronegard, Shen, requires manual generation of the zero inflated
dataset and the random effects matrix used to penalized fittings)
gamlss in R (originally by Rigby and Stasinopoulos ,multinational team of authors now,
versatile tool allows for additive composition of non-linear function approximations e.g.
Neural Networks and offers as many choices as TMB)
glmmTMB in R & TMB in R (extreme flexibility in the selection of optimizers, automatic
differentiation (AD) for sparse Hessians to calculate standard errors, no coding of matrix
algebra calculations, user only specifies the likelihood in C++ and compiles a DLL callable
from R)
In our hands, TMB based solutions have scaled much better with size than all other options
The Bayesian hiding in the PGAM/GLMM
Closet : a 30,000ft view
GLMMs can be given a Bayesian interpretation so that
1. Numerical results obtained by non-Bayesian PGAM
software can be treated as deterministic approximations
to a full and very expensive Bayesian computation
2. If one is fitting PGAMs/GLMMs from a non-Bayesian
perspective and the fitting is numerically unstable, then
one may fit the model using Bayesian methods e.g.
MCMC and report the latter from a non-Bayesian
perspective
3. New implementations of PGAMs can be devised that
take advantage of high-performance optimization
libraries and state-of-the art algorithmic differentiation
approaches
The Bayesian Connection Explained Through the
Hierarchical Likelihood Approach of Lee and Nelder
Likelihood : 𝐿 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙; 𝒀, 𝜽𝑗,𝑅
𝑇
=
Assume non-informative, uniform, improper priors for 𝛃, 𝜆𝑗, and 𝜽𝑗,𝑅
𝑇
Posterior for parameters: 𝑝 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙 𝒀, 𝐗, 𝐂 =
𝐿 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙; 𝒀, 𝜽𝑗,𝑅
𝑇
𝐿 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙; 𝒀, 𝜽𝑗,𝑅
𝑇
𝑑𝜷𝑑 𝜽𝑗,𝑅
𝑇
𝑑 𝜆𝑗 𝑑𝜙
=
𝑝 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 =
𝑝 𝜷, 𝜽𝑗,𝐹
𝑇
𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜽𝑗,𝑅
𝑇
𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 ≈
𝑝 𝜷, 𝜽𝑗,𝐹
𝑇
𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜽𝑗,𝑅
𝑇
𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂
𝑖=1
𝑁
𝑗
𝐸𝐹 𝑦𝑖 𝜇𝑖 𝛃, 𝜽𝑗,𝑅
𝑇
, 𝜽𝑗,𝐹
𝑇
, 𝐗, 𝐂 , 𝜙 𝑀𝑉𝑁 𝜽𝑗,𝑅
𝑇
𝟎, 𝜆𝑗𝜽𝑗,𝑅
𝑇
𝐃𝑗
+
𝜽𝑗,𝑅
Nested optimizations using the Laplace Approximation to handle the integrals of the two marginal likelihoods
𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 =
𝐿 𝜷, 𝜽𝑗,𝐹
𝑇
, 𝜽𝑗,𝑅
𝑇
, 𝜆𝑗 , 𝜙; 𝒀, 𝜽𝑗,𝑅
𝑇
𝑑𝜷𝑑 𝜽𝑗,𝑅
𝑇
Do we have to stick with non-linear
functionals in GAMs?
Nothing in theory restricts in 𝑓𝑗 𝑐𝑖,𝑗 in 𝑔 𝜇𝑖 = 𝐱𝑖𝛃+ 𝑗 𝑓𝑗 𝑐𝑖,𝑗 , 𝑦𝑖~𝐸𝐹 𝜇𝑖, 𝜙 to be of the form
𝑗=1
𝑘
𝑏𝑗 𝑐𝑖,𝑗 𝜃𝑗
Non-linear functionals (e.g. Deep Neural Network structures) could be envisioned
These models operate outside the well-behaved likelihood spaces conventional GAMs live in
Selection of optimizers must be optimized:
gradient descent may not deliver always.
Think of Newton with AD computed Hessians
Connections to Bayesian Neural Networks are immediate:
This is your prior
exp −
1
2 𝑗
𝜆𝑗𝜽𝑗,𝑅
𝑇
𝐃𝑗
+
𝜽𝑗,𝑅
log 𝑀𝑉𝑁 𝟎, 𝐃𝑗
+
−1
𝜆𝑗
Summary and some final thoughts
THEORETICAL
GAMs combine the “universal” approximation
properties of polynomials/splines with GLMs
While GAMs are additive, linear models, they can
be extended to additive compositions of non-linear
models (e.g. neural networks).
This will require ditching most existing
implementations in favor of galmss and/or custom
developed software code
Poisson GLM appears to be a “universal
approximator” for other lifetime distributions
GAMS and PGAMs offer a cheap and dirty way to
conduct Bayesian inference without Monte Carlo
TECHNICAL
Robust implementations of GAMs are
predominantly found in R
Mgcv is currently the best option to balance
coding effort against execution time
House made solutions using TMB are the only
realistic option for massive datasets
While GAMLSS allows for Machine Learning
function approximation tools (e.g. neural nets),
TMB is the one most likely to deliver scalable
results in this area due to the ability to include
black box C++ code in the model specification
Numerical libraries in Python (tensorflow) that
implement AD should be explored for GAM
implementations

More Related Content

PDF
Large Language Models Bootcamp
PDF
Project sentiment analysis
PDF
Natural Language Processing NLP (Transformers)
PDF
Natural Language Processing for Medical Data
PDF
ChatGPT and AI for web developers - Maximiliano Firtman
PDF
IRJET- Automated Blood Group Recognition System using Image Processing
PPTX
Approaches to Sentiment Analysis
PDF
Domain Transfer and Adaptation Survey
Large Language Models Bootcamp
Project sentiment analysis
Natural Language Processing NLP (Transformers)
Natural Language Processing for Medical Data
ChatGPT and AI for web developers - Maximiliano Firtman
IRJET- Automated Blood Group Recognition System using Image Processing
Approaches to Sentiment Analysis
Domain Transfer and Adaptation Survey

What's hot (18)

PPT
STOCK MARKET PREDICTION
PPTX
Is Artificial Intelligence (AI) A Threat To Humans?
PPTX
The Ethics of Artificial Intelligence
PDF
HunchLab 2.0 Predictive Missions: Under the Hood
PDF
Generative Models for General Audiences
PPTX
Prompt Engineering.pptx
PPTX
Production Bioinformatics, emphasis on Production
PDF
Deep dive into ChatGPT
PPTX
chatgpt ..........................................
PPT
Artificial Intelligence and Diagnostics
PPTX
A scoping review of Machine Learning in Seismic Geophysics
PDF
Project report
PPTX
AI Based Presentation.pptx
PPTX
User Experience Design for Embedded Devices
 
PPTX
How does ChatGPT work: an Information Retrieval perspective
ODP
Emotion detection from text using data mining and text mining
PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
PPTX
Fine tuning large LMs
STOCK MARKET PREDICTION
Is Artificial Intelligence (AI) A Threat To Humans?
The Ethics of Artificial Intelligence
HunchLab 2.0 Predictive Missions: Under the Hood
Generative Models for General Audiences
Prompt Engineering.pptx
Production Bioinformatics, emphasis on Production
Deep dive into ChatGPT
chatgpt ..........................................
Artificial Intelligence and Diagnostics
A scoping review of Machine Learning in Seismic Geophysics
Project report
AI Based Presentation.pptx
User Experience Design for Embedded Devices
 
How does ChatGPT work: an Information Retrieval perspective
Emotion detection from text using data mining and text mining
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Fine tuning large LMs
Ad

Similar to Survival Analysis With Generalized Additive Models (20)

PDF
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
PPTX
Power and sample size calculations for survival analysis webinar Slides
PDF
Panel slides
PDF
Basic survival analysis
PDF
PPOS design
PDF
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
PPTX
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
PDF
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
PPTX
Designing studies with recurrent events | Model choices, pitfalls and group s...
PDF
An Alternative To Null-Hypothesis Significance Tests
PDF
Use Proportional Hazards Regression Method To Analyze The Survival of Patient...
PDF
A Framework for Statistical Simulation of Physiological Responses (SSPR).
PDF
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
PPTX
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
PDF
Asking Better Questions How Presentation Formats Influence Information Search
PDF
Introduction to 16S rRNA gene multivariate analysis
PDF
An SPRT Procedure for an Ungrouped Data using MMLE Approach
PPT
Basen Network
PDF
Improving predictions: Lasso, Ridge and Stein's paradox
Rodriguez_Ullmayer_Rojo_RUSIS@UNR_REU_Technical_Report
Power and sample size calculations for survival analysis webinar Slides
Panel slides
Basic survival analysis
PPOS design
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
STATISTICAL TOOLS USED IN ANALYTICAL CHEMISTRY
Projecting ‘time to event’ outcomes in technology assessment: an alternative ...
Designing studies with recurrent events | Model choices, pitfalls and group s...
An Alternative To Null-Hypothesis Significance Tests
Use Proportional Hazards Regression Method To Analyze The Survival of Patient...
A Framework for Statistical Simulation of Physiological Responses (SSPR).
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
EUSFLAT 2019: explainable neuro fuzzy recurrent neural network to predict col...
Asking Better Questions How Presentation Formats Influence Information Search
Introduction to 16S rRNA gene multivariate analysis
An SPRT Procedure for an Ungrouped Data using MMLE Approach
Basen Network
Improving predictions: Lasso, Ridge and Stein's paradox
Ad

More from Christos Argyropoulos (20)

PPTX
Thrombotic Microangiopathies in Pregnancy
PPTX
The Perl Module Task::MemManager : a module for managing memory for foreign a...
PPTX
Nephrologist's take on the effect and impact of SGLT-2 inhibitors, GLP-1 RAs ...
PPTX
Performant Data Reductions with Perl (with some C and OpenMP)
PPTX
Nephrologist's take on the effect and impact of SGLT-2 inhibitors and GLP-1 R...
PPTX
Enhancing non-Perl bioinformatic applications with Perl
PPTX
Secondary Hyperparathyroidism in Kidney Transplantation
PPTX
Management of SHPT in dialysis and beyond.pptx
PPTX
Kidney Disease In patients living with HIV
PPTX
RNA Biomarkers in Chronic Kidney Disease
PPTX
Cardiometabolic Benefits of Renal Diabetes and Obesity Medications
PPTX
Diabetic Kidney Disease 2022 Update
PPTX
Aldosterone in diabetes and other kidney diseases
PPTX
Diabetic kidney disease 2021
PPTX
Diabetic kidney disease 2021 all_slides
PPTX
Diabetic kidney disease 2021
PPTX
Sglt2 across the_spectrum_of_kidney_diseases
PPTX
Acute Kidney Injury in Patients with Cancer
PPTX
Telenephrology
PPTX
Hyperparathyroidism after kidney transplantation
Thrombotic Microangiopathies in Pregnancy
The Perl Module Task::MemManager : a module for managing memory for foreign a...
Nephrologist's take on the effect and impact of SGLT-2 inhibitors, GLP-1 RAs ...
Performant Data Reductions with Perl (with some C and OpenMP)
Nephrologist's take on the effect and impact of SGLT-2 inhibitors and GLP-1 R...
Enhancing non-Perl bioinformatic applications with Perl
Secondary Hyperparathyroidism in Kidney Transplantation
Management of SHPT in dialysis and beyond.pptx
Kidney Disease In patients living with HIV
RNA Biomarkers in Chronic Kidney Disease
Cardiometabolic Benefits of Renal Diabetes and Obesity Medications
Diabetic Kidney Disease 2022 Update
Aldosterone in diabetes and other kidney diseases
Diabetic kidney disease 2021
Diabetic kidney disease 2021 all_slides
Diabetic kidney disease 2021
Sglt2 across the_spectrum_of_kidney_diseases
Acute Kidney Injury in Patients with Cancer
Telenephrology
Hyperparathyroidism after kidney transplantation

Recently uploaded (20)

PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
. Radiology Case Scenariosssssssssssssss
PDF
diccionario toefl examen de ingles para principiante
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
Microbiology with diagram medical studies .pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
famous lake in india and its disturibution and importance
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Comparative Structure of Integument in Vertebrates.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
. Radiology Case Scenariosssssssssssssss
diccionario toefl examen de ingles para principiante
Cell Membrane: Structure, Composition & Functions
Biophysics 2.pdffffffffffffffffffffffffff
Microbiology with diagram medical studies .pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
famous lake in india and its disturibution and importance
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
ECG_Course_Presentation د.محمد صقران ppt
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Comparative Structure of Integument in Vertebrates.pptx

Survival Analysis With Generalized Additive Models

  • 1. Poisson GAMs and GLMMs for observational & clinical trial data CHRISTOS ARGYROPOULOS MD, PHD, FASN ASSOCIATE PROFESSOR CHIEF NEPHROLOGY, DEPARTMENT OF INTERNAL MEDICINE UNIVERSITY OF NEW MEXICO HEALTH SCIENCES CENTER Data driven extraction of patterns in survival analysis April 29th 2022, QuantumBlack Research Talk
  • 2. Disclosures Consultation fees: Bayer, Otsuka, Bayer, Baxter, Quanta Writing support: Astra Zeneca Research Support: Dialysis Clinic, Inc
  • 3. Outline of the talk Generalized Additive Models & Pattern Discovery in Time to Event (“Survival”) Analysis Case Studies Extensions & Connections for Big Data Analytics and Bayesian Analyses
  • 4. Pattern Discovery in Time to Event (“Survival”) Analysis THE GENERALIZED ADDITIVE MODEL THAT COULD
  • 5. Motivation for GAMs Secondary analysis of a randomized trial regarding dialysis devices in 2006 Intervention protocol modification (known time point) X treatment interaction suggesting that the intervention effects were not constant over time Were other “secular trends/drift” ? Subgroup also postulated by independent commentary when the trial published Clinical questions could be answered by coming up with methods that could support: 1. More than one time-scales in the dataset 2. Data guided exploration for non-constancy of treatment effect in subgroups
  • 6. Survival Analysis 101 Analyses the time (t) until an event of interest has occurred Implies a time scale with a well-defined origin (t=0) Some things that make survival analysis different from all other data analyses: 1. Outcome (time to event) can vary numerically based on the choice of the time scale 2. Not every observational/experiment unit in the database will experience the event of interest 3. Observational units may enter/leave the dataset at arbitrary time points 4. There can be other events (not of primary interest) that interfere/compete with observing the event of interest
  • 7. Survival Analysis as a Mixed Outcome Data Analysis Problem Instead of assuming that we only have a collection of continuous “times”, assume we have a collection of times (t) and event indicators (δ, δ=1, experienced the event of interest, δ=0 did not experience the event of interest, censored) Analyst shifts attention to jointly modeling t, δ (mixed discrete/continuous outcome model)  A missing data perspective can be used to handle the modeling of t, δ depending on how individuals come under observation and leave observation The missing data perspective forces one to consider mechanisms for generation of the mixed discrete and continuous data in the context of: 1. the primary time scale that is relevant to the analyst 2. ALL other time scales that are relevant to understand the phenomenon under study
  • 8. Using Lexis Diagrams to Think About Generation of Time to Event Datasets
  • 9. Multiple time scales are implicit in ALL time to event datasets MANY TIME-SCALES ARE IMPLICIT IN ANY SURVIVAL DATASET BUT THE REAL-WORLD FLOWS ALONG THE ARROW OF TIME •Long studies •Changing standards of care during study •Secular changes that affect the context (organizations, regulatory framework) of e.g. health care delivery •Disruptive events e.g. a pandemic, or a financial crisis •The arrow of time & the other time scales are features that are best not ignored even if the primary interest is on the study scale ◦ Disease scale (D) ◦ Study scale (F) ◦ Calendar time (c) ◦ Age, …. “Age-period-cohort” type of structure in any survival dataset Relevant to BOTH clinical trials AND observational datasets USUALLY IGNORED OR IMPERFECTLY CAPTURED
  • 10. Modeling a single time scale Consider a dataset with N survival observations and a single time scale (study time) Individuals come into observation at times E (not necessarily equal to zero) They stay under observation until time F (so the total observation time is F-E) Indicators of the occurrence event of interest are found in the set D If all individuals were to experience the event of interest, then optimization of the density function f(F) with respect to parameters of interest β would allow inference and prediction In the presence of censoring, the times of the event of interest for some individuals are not observed It is only known that this time is larger than the last recorded time (“censoring time”) that these individuals had not experienced the event of interest If the distribution of survival times does not provide any information about the censoring times, then we have non-informative censoring Informally, individuals drop out of observation for reasons not related to the study/observation (physicists may object to the general validity of this assumption) Non-informative censoring is an assumption that must be verified using subject matter input/dataset design considerations
  • 11. Logarithmic derivatives to the rescue Density function: 𝑓 𝐹 Survival function: Integral of the density function 𝑆 𝐹 = 0 𝐹 𝑓 𝑡 𝑑𝑡 Hazard function: the (negative) logarithmic derivative of the survival function ℎ 𝐹 = − 𝑑 log 𝑆 𝑡 𝑑𝑡 𝑡=𝐹 Cumulative hazard function: the integral of the hazard function 𝐻 𝐹 = 0 𝐹 ℎ 𝑡 𝑑𝑡 Target of optimization is the likelihood of the data 𝒊=𝟏 𝑵 𝑓(𝐹𝑖)𝛿𝑖 × 𝑆(𝐹𝑖)1−𝛿𝑖 𝑆(𝐸𝑖) = 𝑖=1 𝑁 ℎ(𝐹𝑖)𝛿𝑖 × 𝑒𝑥𝑝( − 𝐸𝑖 𝐹𝑖 ℎ(𝑡)𝑑𝑡) Major innovation in handling such expression was the introduction of the partial & profile likelihoods and the proportional hazard model by Sir David Cox
  • 12. That ‘70s show APPROACHES 1. Piecewise exponential model (PEM) 2. Connection to Generalized Linear Models (GLMs) Discrete time logistic regression Poisson regression (PR)  Likelihood optimization for PEM/PR  Profile likelihood for Cox Proportional Hazard (CPH) model Unified parametric framework for CPH, logistic regression, PR Time is a covariate – handled like any other (so can have more than one?) SOME KEY PUBLICATIONS Breslow N (1972) Contribution to the discussion on the paper of D.R. Cox : “Regression Models and Life-Tables.” J R Stat Soc Ser B Methodol 34: 216–217. Breslow N (1974) Covariance analysis of censored survival data. Biometrics 30: 89–99. Holford TR (1976) Life tables with concomitant information. Biometrics 32: 587–597. Holford TR (1980) The analysis of rates and of survivorship using log-linear models. Biometrics 36: 299–305. Clayton DG (1983) Fitting a General Family of Failure-Time Distributions using GLIM. J R Stat Soc Ser C Appl Stat 32: 102–109. Aitkin M, Clayton D (1980) The Fitting of Exponential, Weibull and Extreme Value Distributions to Complex Censored Survival Data Using GLIM. J R Stat Soc Ser C Appl Stat 29: 156–163. Peduzzi P, Holford T, Hardy R (1979) A computer program for life table regression analysis with time dependent covariates. Comput Programs Biomed 9: 106–114. Whitehead J (1980) Fitting Cox’s Regression Model to Survival Data using GLIM. J R Stat Soc Ser C Appl Stat 29: 268–275. Efron B (1988) Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve. J Am Stat Assoc 83: 414–425. Trivia: First application of the CPH in a NIH trial (1981) used a GLM/software published independently in JASA: Laird N, Olivier D (1981) Covariance Analysis of Censored Survival Data Using Log- Linear Analysis Techniques. J Am Stat Assoc 76: 231–240.
  • 13. Back to the basics to link the past to the future Survival likelihood for right censored/left truncated data Numerical quadrature under a monotonic transformation of the hazard 𝑔(ℎ) Poisson GLM likelihood kernel 𝑖=1 𝑁 𝑓(𝐹𝑖)𝛿𝑖 × 𝑆(𝐹𝑖)1−𝛿𝑖 𝑆(𝐸𝑖) = 𝑖=1 𝑁 ℎ(𝐹𝑖)𝛿𝑖 × exp( − 𝐸𝑖 𝐹𝑖 ℎ(𝑡)𝑑𝑡) 𝐸𝑖 𝐹𝑖 ℎ(𝑡)𝑑𝑡 = 𝑗=1 𝑛 𝑤𝑖,𝑗ℎ 𝑡𝑖,𝑗 + 𝑅𝑛 𝐹𝑖, 𝐸𝑖; 𝑔(ℎ) 𝑖=1 𝑁 𝑓(𝐹𝑖)𝛿𝑖 × 𝑆(𝐹𝑖)1−𝛿𝑖 𝑆(𝐸𝑖) ≈ 𝑖=1 𝑁 𝑗=1 𝑛𝑖 ℎ 𝑡𝑖,𝑗 ℎ𝑎𝑧𝑎𝑟𝑑 𝑎𝑡 𝑡ℎ𝑒 𝑛𝑜𝑑𝑒 𝑑𝑖,𝑗 𝑒𝑣𝑒𝑛𝑡 𝑖𝑛𝑑𝑖𝑐𝑎𝑡𝑜𝑟 × 𝑒 −ℎ(𝑡𝑖,𝑗)× 𝑤𝑖,𝑗 𝑤𝑒𝑖𝑔ℎ𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑞𝑢𝑑𝑟𝑎𝑡𝑢𝑟𝑒 𝑛𝑜𝑑𝑒
  • 14. Why didn’t the 70s catch on? BAD HAIRCUT STYLES? OR SOMETHING ELSE? Computational bottlenecks: Trapezoid rule used for quadrature Time split at each unique time in the dataset 0.5 x (N2 +N) records  Top of the line memory chip in 1973 was the Mostek MK4096 4 kbit DRAM Cox’s model shifted focus to hazard modeling, while ignoring the integrals Counting processes/semiparametrics was the way one got promoted in academia
  • 15. Back to the 21st century Computers are faster and have more memory There are more efficient ways to integrate functions than the trapezoidal rule Trials and observational datasets are getting more complex and larger Multistakeholder needs of clinical and drug development related datasets We want to “do more stuff” with our data: • Inference (about the patterns) • Prediction (about the future) Best not to waste degrees of freedom during estimation/inference Best not to leave functional specifications to the analyst if the subject matter experts will not commit
  • 16. Embedding numerical integration into likelihood optimization Need more efficient integration rules than the trapezoidal rule even with out computers Interpolatory, Gauss Lobatto rules, preserve the end-points (entry and exit time) as nodes of the integration scheme aiding interpretation of the resulting approximate likelihoods Positions of the nodes and the associated weights can be obtained by solving a triagonal system in the interval −1,1 Convergence is exponential to the number of nodes Memory requirements grow as 𝑂 𝑁 , rather than 𝑂 𝑁2
  • 17. Of nuts, bolts and numerical errors IMPLEMENTATION Create record of n pseudo-observations for each observation in the dataset The first n-1 pseudo-observations for each record have an event indicator of zero The last pseudo-observation has an event indicator equal to that of the corresponding observation The “inflated” dataset is fit with software for Poisson regression ERROR ANALYSIS Total error incurred by approximating N numerical integrals via GL is 𝑖=1 𝑁 𝑒𝑅𝑛(𝐹𝑖,𝐸𝑖;𝑔(ℎ)) ≤ 𝑁 × 𝑚𝑎𝑥 𝑖 𝑒𝑅𝑛(𝐹𝑖,𝐸𝑖;𝑔(ℎ)) For the GL rule, the error 𝑅𝑛(𝐹𝑖, 𝐸𝑖; 𝑔(ℎ)) = − 𝑛 𝑛−1 3 𝑛−2 ! 4 2𝑛−1 2𝑛−2 ! 3 𝐴 𝐹𝑖 − 𝐸𝑖 2𝑛−1 𝐵 𝑔(ℎ) 2𝑛−2 (𝜉), 𝐸𝑖 < 𝜉 < 𝐹𝑖 If the log-hazard behaves locally as a polynomial, the last term will increase and catch up with the error reduction of large 𝑛 implied by the first term
  • 18. Poisson Generalized Additive Models (PGAMs) for time to event analysis WHAT HAVE WE ACHIEVED SO FAR  Provided a rigorous , computationally scalable transformation of time-to-event analysis to a Poisson GLM Analyst given a fixed computational budget has full control over the numerical error of the approximation GLMs are extremely well developed 1. Computationally (weighted linear solvers) 2. Theoretically (uncertainty quantification/propagation) WHAT WE WOULD LIKE TO DO NEXT Explore formulations that allow the analyst to verify and explore patterns in the data Verification: implies a given parametric form for the relationships between features (covariates) and outcomes (outputs) Exploration/discovery: let the data define the patterns Generalized Additive Models (GAM) satisfy both goals For survival analysis, we specify a GAM regression structure on the log-hazard function & use the Poisson GLM to fit (PGAM)
  • 19. Flexible log-hazard modeling framework supports diverse analytic goals log( ℎ(𝑡𝑖,𝑗)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 log( ℎ(𝑡𝑖,𝑗)) = 𝑘=1 𝑆 𝜆𝑘(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 log( ℎ(𝑡𝑖,𝑗)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝛽𝑇(𝑡) log( ℎ(𝑡𝑘,𝑖,𝑗)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝐳𝑘𝐛, 𝐛~𝑁(𝟎, 𝐆) log( ℎ(𝑡𝑖,𝑗, 𝑐)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝑓(𝑐) Relative Risk Regression (Cox like) Stratified Relative Risk model (baseline effect of the primary time scale differs by group) Time varying effects Spatiotemporal correlations Multiple Time Scales (e.g., calendar time) Flexible subgroup analyses log( ℎ(𝑡𝑖,𝑗, 𝑦)) = 𝜆0(𝑡𝑖,𝑗) + 𝐱𝑖𝛃 + 𝑓(𝑦) × 𝛽𝑘
  • 20. A high-level summary of GAMs Procedures that approximate functions e.g. 𝜆0(𝑡𝑖,𝑗), 𝑓(𝑐) given as linear combination of known basis functions 𝑏𝑗 𝑥 and unknown coefficients 𝜃𝑗 𝑓 𝑐 = 𝑗=1 𝑘 𝑏𝑗 𝑐 𝜃𝑗 = 𝑪𝜽 Examples of basis functions: polynomials, splines (cubic, B, P), tensor product smooths, random effects, Gaussian random fields Representation comes with an associated measure of the function smoothness 𝐽 𝑓 = 𝜽𝑻𝐒𝜽, where 𝐒 is a dataset and basis-specific positive semi-definite matrix The penalty 𝐽 𝑓 has a null space of functions that are unpenalized Learning in GAMs ⟺ estimate 𝜃𝑗 from data AND the optimal degree of smoothing for the representation of each function from the data using an appropriate GLM family (for survival analysis this is either a Poisson or a logistic family)
  • 21. GAMs as Generalized Linear Mixed Models I A standard GAM for a single observation 𝑦𝑖 from an exponential family with mean 𝜇𝑖 and scale 𝜙, 𝐸𝐹 𝜇𝑖, 𝜙 is a model for 𝜇𝑖 given a) parametric features (“covariates”) 𝐱𝑖 b) smooth functions of flexible (possibly vector valued) parametric features 𝒄𝑖 and a c) link function 𝑔 ∙ : 𝑔 𝜇𝑖 = 𝐱𝑖𝛃+ 𝑗 𝑓𝑗 𝑐𝑖,𝑗 , 𝑦𝑖~𝐸𝐹 𝜇𝑖, 𝜙 Learning/estimation is via maximization of the penalized log-likelihood: 𝑖 log 𝐸𝐹 𝜇𝑖, 𝜙 − 1 2 𝑗 𝜆𝑗𝜽𝑗 𝑻 𝐒𝑗𝜽𝑗
  • 22. GAMs as Generalized Linear Mixed Models II Perform an (implicit) eigen-decomposition 𝐒𝑗 = 𝐔𝑗𝐃𝑗𝐔𝑗 𝑇 and after carrying out steps 1-5 Rewrite the log-likelihood to be optimized as 𝑖 log 𝐸𝐹 𝜇𝑖, 𝜙 − 1 2 𝑗 𝜆𝑗𝜽𝑗,𝑅 𝑇 𝐃𝑗 + 𝜽𝑗,𝑅 log 𝑀𝑉𝑁 𝟎, 𝐃𝑗 + −1 𝜆𝑗 1. Collect the positive eigenvalues into the diagonal matrix 𝐃𝑗 + 2. Partition the eigenvector matrix 𝐔𝑗 = 𝐔𝑗 + : 𝐔𝑗 0 3. Reparameterize via the linear system 𝐔𝑗 𝑇 𝜽𝑗 = 𝜽𝑗,𝑅 𝑇 , 𝜽𝑗,𝐹 𝑇 𝑇 4. Rewrite the smooth 𝑓𝑗 𝑐 = 𝑙=1 𝑘 𝑏𝑗,𝑙 𝑐 𝜃𝑗,𝑙 = 𝑪𝑗𝜽𝑗 = 𝑪𝑗𝑼𝑗 0 𝜽𝑗,𝐹 𝑇 + 𝑪𝑗𝐔𝑗 + 𝜽𝑗,𝑅 𝑇 5. Rewrite the penalty as 𝜽𝑗 𝑻 𝐒𝑗𝜽𝑗= 𝜽𝑗,𝑅 𝑇 𝐃𝑗 + 𝜽𝑗,𝑅 Generalized Linear (Gaussian) Mixed Model
  • 23. Numerical methods for fitting PGAMs 1. If degree of smoothing 𝜆𝑗 is known, Penalized Iteratively Linear Least Squares 2. If degree of smoothing is not known:  Leave one out cross validation  Generalized cross validation  Marginal likelihood and Restricted Extended Maximum Likelihood  Optimizations may be applied to linearizations or the non-linearized versions of the GLMM (“direct nested iteration”)  Extremely rich (and esoteric) literature about the linear algebra form of the updating equations
  • 24. Log-linear (at the point of analysis) framework can seamlessly support prediction about non-linear functionals using simple Monte Carlo 𝑅𝑅(𝑡) = 1 − 𝑆𝐴(𝑡) 1 − 𝑆𝐵(𝑡) 𝑅(𝑡) = 𝑆𝐴(𝑡) 𝑆𝐵(𝑡) 𝐴𝑅𝐷(𝑡) = 𝑆𝐵(𝑡) − 𝑆𝐴(𝑡) 𝑅𝑀𝑆𝑇(𝑡) = 0 𝑡 𝑆𝐴(𝑡)𝑑𝑡 − 0 𝑡 𝑆𝐵(𝑡)𝑑𝑡  Relative Risk  Relative Survival  Absolute Risk Difference (actuarial/pharmacoeconomic applications)  (Restricted) Mean Survival Time (demography/reliability analysis)
  • 26. Numerical Error of Gauss Lobatto for integrating a single life-time
  • 27. Survival Point Estimates : PGAM v.s. Kaplan Meier
  • 29. PGAM accurately estimates the RMST under different functional deviations from the baseline hazard
  • 30. Cox v.s. PGAM in a cancer trial HR LCI UCI Cox 0.73 0.64 0.83 GL3 0.68 0.59 0.77 GL4 0.77 0.67 0.88 GL5 0.72 0.64 0.82 GL6 0.72 0.63 0.82 GL7 0.73 0.64 0.83 GL8 0.73 0.64 0.83 GL9 0.73 0.64 0.83 GL10 0.73 0.64 0.83 GL11 0.73 0.64 0.83 GL12 0.73 0.64 0.83 GL13 0.73 0.64 0.83 GL14 0.73 0.64 0.83 GL15 0.73 0.64 0.83 GL16 0.73 0.64 0.83 GL17 0.73 0.64 0.83 GL18 0.73 0.64 0.83 GL19 0.73 0.64 0.83 GL20 0.73 0.64 0.83 Hazard Ratios
  • 31. Cox v.s. PGAM in a nephrology trial Cox GL7 GL10 GL20 Variable HR 95% CI HR 95% CI HR 95%CI HR 95%CI High Kt/V 0.96 0.84 - 1.10 0.96 0.84 - 1.09 0.96 0.84 - 1.09 0.96 0.84 - 1.09 High Flux 0.92 0.81 - 1.05 0.92 0.81 - 1.06 0.92 0.81 - 1.06 0.92 0.81 - 1.06 Age (per 10) 1.41 1.33 - 1.50 1.42 1.33 - 1.51 1.42 1.33 - 1.51 1.42 1.33 - 1.51 Female 0.85 0.73 - 0.98 0.85 0.73 - 0.98 0.85 0.73 - 0.98 0.85 0.73 - 0.98 Black 0.77 0.66 - 0.91 0.77 0.66 - 0.91 0.77 0.66 - 0.91 0.77 0.66 - 0.91 Diabetic 1.29 1.11 - 1.50 1.29 1.11 - 1.5 1.29 1.11 - 1.5 1.29 1.11 - 1.5 Duration 1.04 1.02 - 1.06 1.04 1.02 - 1.05 1.04 1.02 - 1.05 1.04 1.02 - 1.05 ICED 1.37 1.25 - 1.50 1.38 1.27 - 1.51 1.38 1.27 - 1.51 1.38 1.27 - 1.51 Alb (per 0.5 g/dl) 0.51 0.43 - 0.62 0.53 0.45 - 0.64 0.53 0.45 - 0.64 0.53 0.45 - 0.64 Alb X Time interaction 1.11 1.04 - 1.19 1.09 1.02 - 1.16 1.09 1.02 - 1.16 1.09 1.02 - 1.16
  • 32. Modeling of complex datasets to account for secular trends & conduct subgroup analyses TIME SCALE INTERACTIONS COVARIATE BY TREATMENT BY DISEASE INTERACTIONS
  • 33. PGAMs/GLMMs as approximate Bayesian computations in survival analyses
  • 35. Extensions & Connections for Big Data Analytics and Bayesian Analyses
  • 36. Software libraries available for PGAM/GLMM analyses Mgcv in R (by Simon Wood, extremely fast, requires manual generation of the zero inflated dataset, has one of the most versatile collection of smoothers) Pammtools in R (by Andreas Bender, fast, automatic generation of the zero inflated dataset) hglm in R (by Alam, Ronegard, Shen, requires manual generation of the zero inflated dataset and the random effects matrix used to penalized fittings) gamlss in R (originally by Rigby and Stasinopoulos ,multinational team of authors now, versatile tool allows for additive composition of non-linear function approximations e.g. Neural Networks and offers as many choices as TMB) glmmTMB in R & TMB in R (extreme flexibility in the selection of optimizers, automatic differentiation (AD) for sparse Hessians to calculate standard errors, no coding of matrix algebra calculations, user only specifies the likelihood in C++ and compiles a DLL callable from R) In our hands, TMB based solutions have scaled much better with size than all other options
  • 37. The Bayesian hiding in the PGAM/GLMM Closet : a 30,000ft view GLMMs can be given a Bayesian interpretation so that 1. Numerical results obtained by non-Bayesian PGAM software can be treated as deterministic approximations to a full and very expensive Bayesian computation 2. If one is fitting PGAMs/GLMMs from a non-Bayesian perspective and the fitting is numerically unstable, then one may fit the model using Bayesian methods e.g. MCMC and report the latter from a non-Bayesian perspective 3. New implementations of PGAMs can be devised that take advantage of high-performance optimization libraries and state-of-the art algorithmic differentiation approaches
  • 38. The Bayesian Connection Explained Through the Hierarchical Likelihood Approach of Lee and Nelder Likelihood : 𝐿 𝜷, 𝜽𝑗,𝐹 𝑇 , 𝜽𝑗,𝑅 𝑇 , 𝜆𝑗 , 𝜙; 𝒀, 𝜽𝑗,𝑅 𝑇 = Assume non-informative, uniform, improper priors for 𝛃, 𝜆𝑗, and 𝜽𝑗,𝑅 𝑇 Posterior for parameters: 𝑝 𝜷, 𝜽𝑗,𝐹 𝑇 , 𝜽𝑗,𝑅 𝑇 , 𝜆𝑗 , 𝜙 𝒀, 𝐗, 𝐂 = 𝐿 𝜷, 𝜽𝑗,𝐹 𝑇 , 𝜽𝑗,𝑅 𝑇 , 𝜆𝑗 , 𝜙; 𝒀, 𝜽𝑗,𝑅 𝑇 𝐿 𝜷, 𝜽𝑗,𝐹 𝑇 , 𝜽𝑗,𝑅 𝑇 , 𝜆𝑗 , 𝜙; 𝒀, 𝜽𝑗,𝑅 𝑇 𝑑𝜷𝑑 𝜽𝑗,𝑅 𝑇 𝑑 𝜆𝑗 𝑑𝜙 = 𝑝 𝜷, 𝜽𝑗,𝐹 𝑇 , 𝜽𝑗,𝑅 𝑇 𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 = 𝑝 𝜷, 𝜽𝑗,𝐹 𝑇 𝜽𝑗,𝑅 𝑇 , 𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜽𝑗,𝑅 𝑇 𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 ≈ 𝑝 𝜷, 𝜽𝑗,𝐹 𝑇 𝜽𝑗,𝑅 𝑇 , 𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜽𝑗,𝑅 𝑇 𝜆𝑗 , 𝜙, 𝒀, 𝐗, 𝐂 × 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 𝑖=1 𝑁 𝑗 𝐸𝐹 𝑦𝑖 𝜇𝑖 𝛃, 𝜽𝑗,𝑅 𝑇 , 𝜽𝑗,𝐹 𝑇 , 𝐗, 𝐂 , 𝜙 𝑀𝑉𝑁 𝜽𝑗,𝑅 𝑇 𝟎, 𝜆𝑗𝜽𝑗,𝑅 𝑇 𝐃𝑗 + 𝜽𝑗,𝑅 Nested optimizations using the Laplace Approximation to handle the integrals of the two marginal likelihoods 𝑝 𝜆𝑗 , 𝜙 , 𝒀, 𝐗, 𝐂 = 𝐿 𝜷, 𝜽𝑗,𝐹 𝑇 , 𝜽𝑗,𝑅 𝑇 , 𝜆𝑗 , 𝜙; 𝒀, 𝜽𝑗,𝑅 𝑇 𝑑𝜷𝑑 𝜽𝑗,𝑅 𝑇
  • 39. Do we have to stick with non-linear functionals in GAMs? Nothing in theory restricts in 𝑓𝑗 𝑐𝑖,𝑗 in 𝑔 𝜇𝑖 = 𝐱𝑖𝛃+ 𝑗 𝑓𝑗 𝑐𝑖,𝑗 , 𝑦𝑖~𝐸𝐹 𝜇𝑖, 𝜙 to be of the form 𝑗=1 𝑘 𝑏𝑗 𝑐𝑖,𝑗 𝜃𝑗 Non-linear functionals (e.g. Deep Neural Network structures) could be envisioned These models operate outside the well-behaved likelihood spaces conventional GAMs live in Selection of optimizers must be optimized: gradient descent may not deliver always. Think of Newton with AD computed Hessians Connections to Bayesian Neural Networks are immediate: This is your prior exp − 1 2 𝑗 𝜆𝑗𝜽𝑗,𝑅 𝑇 𝐃𝑗 + 𝜽𝑗,𝑅 log 𝑀𝑉𝑁 𝟎, 𝐃𝑗 + −1 𝜆𝑗
  • 40. Summary and some final thoughts THEORETICAL GAMs combine the “universal” approximation properties of polynomials/splines with GLMs While GAMs are additive, linear models, they can be extended to additive compositions of non-linear models (e.g. neural networks). This will require ditching most existing implementations in favor of galmss and/or custom developed software code Poisson GLM appears to be a “universal approximator” for other lifetime distributions GAMS and PGAMs offer a cheap and dirty way to conduct Bayesian inference without Monte Carlo TECHNICAL Robust implementations of GAMs are predominantly found in R Mgcv is currently the best option to balance coding effort against execution time House made solutions using TMB are the only realistic option for massive datasets While GAMLSS allows for Machine Learning function approximation tools (e.g. neural nets), TMB is the one most likely to deliver scalable results in this area due to the ability to include black box C++ code in the model specification Numerical libraries in Python (tensorflow) that implement AD should be explored for GAM implementations

Editor's Notes

  • #24: We will discuss a more direct and high level implementation based on the Bayesian interpretation of PGAMs