Clinical data based optimal STI strategies for HIV: a reinforcement learning approach

Clinical data based optimal STI strategies for
HIV: a reinforcement learning approach
Damien Ernst
Department of Electrical Engineering and Computer Science
University of Li`ege
Monteﬁore - March 9, 2006
Presentation based on the paper: “Clinical data based optimal STI
strategies for HIV: a reinforcement leanring approach”. D. Ernst, G.B.
Stan, J. Gon¸calves and L. Wehenkel
.
Damien Ernst Clinical data .... (1/22)

HIV
Human Immunodeﬁciency Virus (HIV) is a retrovirus at the
source of the Acquired Immune Deﬃciency Syndrome (AIDS)
HIV particles target cells of the immune system (mostly CD4+
lymphocytes and macrophages)
Inclusion of HIV particles in immune cells lead to massive
production of new viral particles, death of the infected cells
and, ultimately, devastation of the immune system

Current anti-HIV drugs
Two main categories:
1. Reverse Transcriptaese Inhibitors (RTI)
2. Protease Inhibitor (PI)
Figure: Taken from http://www.cellsalive.com/hiv0.htm

Treatments for infected patients
Highly Active Anti-Retroviral Therapy (HAART): combination
of two or more drugs. Usually one or more RTIs in
combinations with a PI.
Two main concerns about the long-term used of anti retroviral
drugs: undesirable side effects (leading to poor compliance)
and mutation of the virus (need to change drugs or even
inability to find appropriate pharmaceutical treatments).
Need for efficient drug scheduling strategies.
Idealistically, a drug-scheduling strategy should bring the
system to a state where the immune system has control over
the virus (with low amount of drugs and low systemic effects).

Structured Treatment Interruption (STI)
STI: to cycle the patient on and oﬀ drug therapy
STI strategies often well received by patients since they oﬀer
them period of relief from treatment
In some remarkable cases, STI strategies have enabled the
patients to maintain immune control over the virus in the
absence of treatment
Goal of this research: to compute optimal STI strategies

STI: A glimpse at today’s practice
If CD4+ cell count falls below a certain threshold, put the patient
on drugs. Otherwise put him oﬀ. This practice has met some
problems:
Figure: Taken from
http://www.cpcra.org/docs/pubs/2006/croi2006-smart.pdf

More advanced techniques (not clinically tested)
Some authors have proposed to design STI treatments by
exploiting mathematical models of the HIV infection.
Models are under the form of a set of Ordinary Differential
Equations (ODEs)
Deduction of STI strategies is done by using methods from
the control theory.
But modelling of the HIV dynamics is a difficult task. Indeed, one
has
to select the right parametric system of ODEs
to fit the parameters to reflect quantitatively biological
observations

An interesting alternative
Infer directly from clinical data good STI strategies, without
modelling the HIV infection dynamics.
Clinical data: time evolution of patient’s state (CD4+ T cell
count, systemic costs of the drugs, etc) recorded at
discrete-time instant and sequence of drugs administered.
Clinical data can be seen as trajectories of the immune system
responding to treatment.

Inferring policies from trajectories
Problem of inferring from trajectories appropriate control
policy has been studied in control theory and computer
science.
One way to approach it: state an optimality criterion and
search for strategies optimizing this criterion.
Classical approach: infer a model and derive from it and the
optimality criterion an optimal strategy.
Reinforcement learning approach: compute optimal strategies
directly from the trajectory, without identifying a model.

The trajectories are processed
by using reinforcement learning techniques
patients
A pool of
HIV infected
problem which typically containts the following information:
some (near) optimal STI strategies,
often under the form of a mapping
given time and the drugs he has to take
protocols and are monitored at regular intervals
The patients follow some (possibly suboptimal) STI
The monitoring of each patient generates a trajectory for the optimal STI
drugs taken by the patient between t0 and t1 = t0 + n days
state of the patient at time t0
Processing of the trajectories gives
between the state of the patient at a
till the next time his state is monitored.
Figure: Determination of optimal STI strategies from clinical data by
using reinforcement learning algorithms: the overall principle.

Learning from a sample of trajectories: the RL approach
Problem formulation
Discrete-time dynamics:
xt+1 = f (xt , ut ) t = 0, 1, . . .
where xt ∈ X and ut ∈ U.
Cost function: c(x, u) : X × U → R. c(x, u) bounded by Bc.
Discounted inﬁnite horizon cost associated to stationary policy
µ : X → U: Jµ(x) = lim
N→∞
N−1
t=0 γt c(xt , µ(xt))
Optimal stationary policy µ∗ : Policy that minimizes Jµ for all x.
Objective: Find an optimal policy µ∗.
We do not know: The discrete-time dynamics.
We know instead: A set of trajectories (x0, u0, x1, · · · , uT−1, xT ).

Some dynamic programming results
Sequence of functions QN : X × U → R
QN(x, u) = c(x, u) + γ min
u ∈U
QN−1(f (x, u), u ), ∀N > 1
with Q1(x, u) ≡ c(x, u), converges to the Q-function, unique
solution of the Bellman equation:
Q(x, u) = c(x, u) + γ min
u ∈U
Q(f (x, u), u ).
Necessary and suﬃcient optimality condition:
µ∗
(x) ∈ arg min
u∈U
Q(x, u)
Stationary policy µ∗
N:
µ∗
N (x) ∈ arg min
u∈U
QN(x, u).
Bound on the suboptimality of µ∗
N:
Jµ∗
N − Jµ∗
≤
2γN Bc
(1 − γ)2
.

Fitted Q iteration
Trajectories (x0, u0, x1, · · · , uT−1, xT ) transformed into a set of
one-step system transitions F = {(xl
t , ul
t , xl
t+1)}#F
l=1.
Fitted Q iteration computes from F the functions ˆQ1, ˆQ2, . . .,
ˆQN, approximations of Q1, Q2, . . ., QN.
Computation done iteratively by solving a sequence of standard
supervised learning (SL) problems. Training sample for the kth
(k ≥ 2) problem is
(xl
t , ul
t ), c(xl
t , ul
t) + γmin
u∈U
ˆQk−1(xl
t+1, u)
#F
l=1
with
ˆQ1(x, u) ≡ c(x, u). From the kth training sample, the supervised
learning algorithm outputs ˆQk .
ˆµ∗
N(x) ∈ arg min
u∈U
ˆQN (x, u) is taken as approximation of µ∗(x).
In our simulations, SL method used is an ensemble of regression
trees method named Extra-Trees.

Illustration
We present results we have obtained by using the RL-based
approach on artiﬁcially generated data.
The example is directly inspired from
B.M. Adams, H.T. Banks, Hee-Dae Kwon and H.T. Tran.
(2004). “Dynamic multidrug therapies for HIV: Optimal and
STI Control Approaches”. Mathematical Biosciences and
Engineering, 1, 223-241.

Illustration: Kinds of STI strategies targeted
Bi-therapy treatments combining a fixed RTI and a fixed PI.
Revise drug administration every five days based on clinical
measurements.
Four possible on-off combinations for the next five days: RTI and
PI on, only RTI on, only STI on, RTI and PI off
We seek STI strategies that minimize Jµ.
Instantaneous cost at time t:
c(xt, ut ) = 0.1Vt + 20000 2
1t
+ 2000 2
2t
− 1000Et
1t = 0.7 (resp. 1t = 0) if the RTI is cycled on (resp. off) at t
2t = 0.3 (resp. 2t = 0) if the PI is cycled on (resp. off) at time t
V : number of free HI viruses
E: number of cytotoxic T-lymphocytes
Decay factor γ: chosen equal to 0.98.

Illustration: A mathematical model as substitute for
real-life patients
˙T1 = λ1 − d1T1 − (1 − 1)k1VT1
˙T2 = λ2 − d2T2 − (1 − f 1)k2VT2
˙T∗
1 = (1 − 1)k1VT1 − δT∗
1 − m1ET∗
1
˙T∗
2 = (1 − f 1)k2VT2 − δT∗
2 − m2ET∗
2
˙V = (1 − 2)NT δ(T∗
1 + T∗
2 ) − cV − [(1 − 1)ρ1k1T1 + (1 − f 1)ρ2k2T2]V
˙E = λE +
bE (T∗
1 + T∗
2 )
(T∗
1 + T∗
2 ) + Kb
E −
dE (T∗
1 + T∗
2 )
(T∗
1 + T∗
2 ) + Kd
E − δE E
T1 (T∗
1 ) = number of non-infected (infected) CD4+
lymphocytes
T2 (T∗
2 ) = non-infected (infected) macrophages
V = number of free HI viruses
E = number of cytotoxic T-lymphocytes.
1 and 2 = control actions corresponding to RTI and the PI.
Period during which the RTI (resp. the PI) is administrated to the
patient: 1 (resp. 2) is set equal to 0.7 (resp. 0.3).
RTI (resp. the PI) not administrated: 1 = 0 (resp. 2 = 0).

Illustration: Some insight into this model
In absence of treatment, three physical equilibrium points:
1. uninfected state:
(T1, T2, T∗
1 , T∗
2 , V , E) = (106
, 3198, 0, 0, 0, 10)
2. “healthy” locally stable equilibrium
(T1, T2, T∗
1 , T∗
2 , V , E) = (967839, 621, 76, 6, 415, 353108)
(small viral load, a high CD4+ T-lymphocytes count, high
HIV-speciﬁc cytotoxic T-cells count)
3. “non-healthy” locally stable equilibrium point
(T1, T2, T∗
1 , T∗
2 , V , E) = (163573, 5, 11945, 46, 63919, 24)
(T-cells depleted, viral load very high).

Illustration: Protocol for artificially generating the clinical
data
Monitoring of patients: every five days during 1000 days.
Medication: can be revised every five days based on the
information generated by the monitoring.
Iterative generation of the clinical data (ten iterations):
First iteration. Thirty patients in “non-healthy” steady-state.
Physiological data ( T1, T2, T∗
1 , T∗
2 , V , E) recorded and a
new type of medication randomly selected in U every five
days. Monitoring of each patient generates a trajectory
(x0, u0, x1, · · · , x199, u199, x200).
Second iteration. Only difference with first iteration:
medication determined by the following STI strategy: in 85%
of the cases, use strategy ˆµ∗
400 computed by fitted Q iteration
on previously generated trajectories; in the remaining 15%
medication randomly selected in U.
Third-tenth iteration: idem as second iteration.

Illustration: Simulation results
0
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
days
log10(T1)
250 500 750 0
days
250 500 750
-0.5
0.0
0.5
1.
1.5
2.
2.5
3.
log10(T2)
-1.
0.0
1.
2.
3.
4.
5.
0
days
250 500 750
log10(T∗
1)
0
days
250 500 750
-1.
0.0
0.5
1.
1.5
2.
-0.5
log10(T∗
2)
0.0
2.
3.
4.
5.
6.
0
days
250 500 750
log10(V)
1.
0
days
250 500 750
log10(E)
2.
3.
4.
5.
Figure: Solid curve (−) corresponds = patient which follows STI
strategies; dashed curves (− −) = no interruption in the treatment;
dotted curves (− ·) = no treatment

0
days
250 500 750
reversetranscriptase
inhibitor
off
on
0
days
250 500 750
inhibitor
protease
off
on
Figure: STI treatment for a patient treated from early stage of infection.
Clinical data generated by 300 patients.
infinite time
horizon cost
number of patients
-5.e+8
-1.e+9
-1.5e+9
-2.e+9
-2.5e+9
-3.e+9
-3.5e+9
-4.e+9
240 300180120906030
Figure: Influence of the number of patients on the infinite time horizon
cost corresponding to the computed STI strategies.

From numerically simulated data to real-life patients
We expect to face four main difficulties:
The HIV/immune system dynamics may be different from one
patient to the other.
Difficulty to state properly the optimal control problem
Partial observability
Corrupted measurements

Conclusions
Reinforcement learning algorithms seem to be promising tools
to extract from clinical data, good STI strategies.
Lot of work is however still needed !!!
But 40 millions of people are living with HIV/AIDS. Isn’t it a
good reason to keep working hard ?
Figure: Taken from UNAIDS. AIDS epidemic update: December 2005.
“UNAIDS/05.19E”

Clinical data based optimal STI strategies for HIV: a reinforcement learning approach

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Clinical data based optimal STI strategies for HIV: a reinforcement learning approach (20)

More from Université de Liège (ULg) (20)

Recently uploaded (20)

Clinical data based optimal STI strategies for HIV: a reinforcement learning approach