Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems

© SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger
Explaining Online Reinforcement Learning Decisions
of Self-Adaptive Systems
Felix Feit, Andreas Metzger, Klaus Pohl
ACSOS 2022

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Agenda
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
A. Metzger, ACSOS 2022 2
1. Motivation
2. Explanation Approach “XRL-DINE“
3. Validation
4. Discussion and Outlook

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Online Reinforcement Learning for SAS
A. Metzger, ACSOS 2022
3
MAPE-K Combination
RL
Self-Adaptation Logic
Analyze
Monitor Execute
Plan
Knowledge
Online RL for SAS
Execute
Policy
(K)
Monitor
Action
Selection
(A + P)
Policy
Update
Action
A
State S
Reward R
Next state S’
Action A
State S
Reward R
Action
Selection
Next state S’
RL Agent
Policy
Policy
Update
Environ-
ment
• Online RL = Emerging approach for addressing design time uncertainty [Palm et al. @ CAiSE 2020]
 Leverage information only available @ runtime (i.e., during live system execution)
• Since 2019 use of Learning for SAS most prominent strategy [Porter et al. @ ACSOS 2020]
• Conceptual Model of Online RL:
[Metzger et al. @ Computing 2022]
Learning Goal
defined by
Reward
Function

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Online Reinforcement Learning for SAS
Policy (Knowledge) represented
as deep neural network
Pro
• Handling of continuous
states and actions
• Generalization over unseen,
neighboring states
Con
• Deep RL = “black box”
•  Limited trustworthiness
•  Difficult to debug
e.g., reward function
correctly defined?
Increasing use of Deep RL The power of deep learning:
“/imagine yellow Labrador in the style of…”
[generated using Midjourney AI]
UP RIGHT DOWN LEFT
-12,224348 -12,044198 -12,463232 -12,349292
-11,368766 -11,407281 -11,567699 -11,741966
-10,724603 -10,758294 -10,878073 -10,956671
-10,118299 -9,9104485 -10,209248 -9,9700161
-9,2663503 -9,0282991 -9,4003475 -9,4925009
-8,1970854 -8,145322 -8,3578677 -8,6139991
-7,5583819 -7,3056764 -7,4218645 -7,6021728
-6,5359939 -6,4629871 -6,6290838 -6,800472
-5,8610507 -5,6457718 -5,6477581 -6,1414037
-5,01 -4,8068304 -4,7890915 -4,8609271
-3,9 -3,902123 -3,9078592 -4,0513057
-3,1389 -3,273 -2,9849214 -3,3706948
-12,609306 -12,612216 -12,579926 -12,784373
-11,845763 -11,794108 -11,845683 -12,478945
-11,237922 -10,873018 -10,900784 -11,436694
-10,008564 -9,9425947 -9,9517958 -10,268937
-9,2581936 -8,9864275 -8,9889605 -9,1798613
-8,1866584 -7,9931344 -7,9940185 -8,9510004
-7,0729858 -6,9970914 -6,9977784 -8,2920288
-6,0319487 -5,9988816 -5,9989739 -6,6224484
-6,0996963 -4,9994285 -4,999519 -6,232246
-4,8057376 -3,9997425 -3,9997874 -4,9212681
-3,1293615 -2,9999285 -2,9999384 -3,556979
-2,7588749 -2,1 -2 -2,2345458
-13,360876 -12 -13,954412 -12,991696
-12,565624 -11 -112,1835 -12,995431
-11,733772 -10 -112,79659 -11,980454
-10,740429 -9 -111,48554 -10,947642
-9,95339 -8 -112,12695 -9,9878453
-8,9112 -7 -112,67844 -8,8890419
-7,992178 -6 -112,91331 -7,965498
-6,9846114 -5 -112,41604 -6,9763523
-5,9533325 -4 -111,81117 -5,9401313
-4,9217978 -3 -110,6219 -4,8068301
-3,9307738 -2 -112,85584 -3,9418704
-2,9796888 -1,9340884 -1 -2,9827181
-13 -112,90933 -13,998779 -13,995334
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
UP RIGHT DOWN LEFT
-12,224348 -12,044198 -12,463232 -12,349292
-11,368766 -11,407281 -11,567699 -11,741966
-10,724603 -10,758294 -10,878073 -10,956671
-10,118299 -9,9104485 -10,209248 -9,9700161
-9,2663503 -9,0282991 -9,4003475 -9,4925009
-8,1970854 -8,145322 -8,3578677 -8,6139991
-7,5583819 -7,3056764 -7,4218645 -7,6021728
-6,5359939 -6,4629871 -6,6290838 -6,800472
-5,8610507 -5,6457718 -5,6477581 -6,1414037
-5,01 -4,8068304 -4,7890915 -4,8609271
-3,9 -3,902123 -3,9078592 -4,0513057
-3,1389 -3,273 -2,9849214 -3,3706948
-12,609306 -12,612216 -12,579926 -12,784373
-11,845763 -11,794108 -11,845683 -12,478945
-11,237922 -10,873018 -10,900784 -11,436694
-10,008564 -9,9425947 -9,9517958 -10,268937
-9,2581936 -8,9864275 -8,9889605 -9,1798613
State
S
Action A
State
S
Action
A
Deep RL Classical RL (Q-Learning)
Environ-
ment

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Explainable Reinforcement Learning (XRL) for SAS
State of the Art
Goal-based Models [Welsh et al. @ Trans. CCI 2014]
• Explanations in terms of the satisficement of softgoals
• Requires making assumptions about environment dynamics at
design time (difficult due to design time uncertainty)
Provenance Graphs [Reynolds et al. @ MODELS Wkshp 2020]
• Graph history to determine if, how and why model changed
• Graph can become too complex to be meaningfully interpreted by humans
• Query language suggested for “pruning” graphs but not for explanations
Temporal Graph Models [Ullauri et al. @ SoSym 2022]
• Explicitly considers Deep RL
• Explanations via queries to model @ runtime
• Interesting points of interactions extracted via CEP
• No detailed, contrastive decomposition of explanations
Example: Vacuum Cleaner
Example:
Fibonacci
Example: Remote Data Mirroring

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Agenda
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
1. Motivation
3. Validation

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
Reward Decomposition
[Sequeira et al. @ Artif. Intell. 2020]
Decompose reward function to explain short-
term goal orientation of RL (train sub-RL agents)
Pro
• Helpful in the presence of multiple, “competing”
quality goals for learning
• Provides contrastive (counterfactual) explanations
Con
• No indication of explanation’s relevance
• Requires manually selecting relevant explanations
 cognitive overhead
Interestingness Elements
[Juozapaitis et al. @ IJCAI Wkshp 2019]
Identify relevant moments of interaction
between agent and environment at runtime
Pro
• Facilitates automatically selecting relevant
interactions to be explained
Con
• Does not explain whether RL behaves as expected
and for the right reasons
7
Augment and Combine RL Explanation Techniques from AI Research
Decomposed Interestingness Elements (DINEs)
+

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
Internal Behaviour
Evolution of Reward
External Behaviour
Evolution of States and Actions
8
Understanding RL without DINEs?
R
S, A

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
Important Interaction
Determine whether RL in given state is
uncertain (wide range of actions) or
certain (almost always same action)
• How much does relative importance
of actions differ for each sub-agent?
• Number of DINES shown can be
tuned via Threshold ρ (level of
inequality)
9
Three Types of DINEs
Visualization in Dashboard
Certain Uncertain

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
Reward Channel Dominance
Influence that each sub-agent has on
each possible action
• Influence of rewards of sub-agents
on composed decision
10
Relative

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
XRL-DINE
Reward Channel Extremum
Points after local minimum/maximum
of state-value  RL decisions in
potentially critical states
• ExpectedReward (S) –
ExpectedReward (S’) > ϕ
 Maximum
• Number of DINES shown can be
tuned via Threshold ϕ
11
Minimum
Maximum

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Agenda
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
1. Motivation
3. Validation

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Validation
Proof-of-Concept
Implementation
• Double Deep Q-Networks with
Experience Replay [Hasselt et al.
@ AAAI 2016]
• Approximation of environment
model using supervised
learning on contents of replay
memory
• OpenAI Gym interface to
connect RL and SAS
Experimental Setup
RL Problem Formulation
• Action Space =
{Add / remove web servers,
Change dimmer value}
• State Space =
{Request arrival rate,
Average throughput,
Average response time}
• Decomposed Reward Function
Self-Adaptive System
• “SWIM” Exemplar [Moreno et al.
@ SEAMS 2018]
• Self-adaptive multi-tier web
application

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Validation
Important Interactions Reward Channel Extrema
Quantitative Results
Cognitive load ~ number of DINEs shown to developers

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Agenda
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
1. Motivation
3. Validation

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Discussion
Limitations of XRL-DINE
May generate difficult to understand explanations
• Reason 1: Reward function was decomposed incorrectly or non-optimally
• Reason 2: Environment dynamics may delay effects of adaptations and thus understandability
Not directly applicable to collaborative adaptive systems
• XRL-DINE does not consider decisions of other RL agents
• May lead to misleading explanations if DINE-XRL is directly applied in collaborative setting
Only works for value-based deep RL
• XRL-DINEs computed using value-function Q(S, A) – details see paper
• Value-based deep RL: policy = Q(S, A) approximated by neural network

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Outlook
Considering Explanation Requirements from Social Sciences [Miller @ Artif. Intell. 2019]
Contrastive: “why P happened instead of Q?”
 “Reward Channel Dominance” DINE
Selective: “no need for complete course of events”
 “Reward Channel Extrema” DINE &
“Important Interactions” DINE
Causal: “most likely explanation not
necessarily the best”
 Future work: e.g., check whether agent relies on
spurious correlations (and not causality)
[Gajcin et al. @ AAAMAS Wkshp 2022]
Social: “transfer of knowledge as part of a
conversation”
 Future work: e.g., Chatbot4XAI


?
? Prediction
Explanation
Train will be
delayed
Train passed
last light 5
min later
than typical
This also
happened
yesterday, but
why today a
problem?
Because of
attribute
“number of
trains behind
current one” > 5
What would
have to change
such that not
delay
(counterfactual)
“number of
trains behind
current one” < 2
Human
(Explainee)
Chatbot4XAI
(Explainer)

©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
Prof. Dr. K. Pohl
Thank You!
Research leading to these results has received funding from the EU’s Horizon 2020 research and
innovation programme under grant agreements no. 780351 & 871493
Further Reading
• A. Metzger, C. Quinton, Z. Á. Mann, L. Baresi, K. Pohl, “Realizing Self-Adaptive Systems via Online
Reinforcement Learning and Feature-Model-guided Exploration”, Computing, Springer, March, 2022
• A. Metzger, C. Quinton, Z. Mann, L. Baresi, and K. Pohl, “Feature model-guided online reinforcement
learning for self-adaptive services,” in 18th Int’l Conf. on Service-Oriented Computing (ICSOC 2020),
LNCS 12571, Springer, 2020
• A. Palm, A. Metzger, and K. Pohl, “Online reinforcement learning for self-adaptive information
systems,” in 32nd Int’l Conf. on Advanced Information Systems Engineering (CAiSE 2020), LNCS
12127. Springer, 2020
www.enact-project.eu www.dataports-project.eu

Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems

More Related Content

Similar to Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems (20)

More from Andreas Metzger (14)

Recently uploaded (20)

Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems

Editor's Notes