SlideShare a Scribd company logo
© SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger
Explaining Online Reinforcement Learning Decisions
of Self-Adaptive Systems
Felix Feit, Andreas Metzger, Klaus Pohl
ACSOS 2022
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Agenda
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
A. Metzger, ACSOS 2022 2
1. Motivation
2. Explanation Approach “XRL-DINE“
3. Validation
4. Discussion and Outlook
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Online Reinforcement Learning for SAS
A. Metzger, ACSOS 2022
3
MAPE-K Combination
RL
Self-Adaptation Logic
Analyze
Monitor Execute
Plan
Knowledge
Online RL for SAS
Execute
Policy
(K)
Monitor
Action
Selection
(A + P)
Policy
Update
Action
A
State S
Reward R
Next state S’
Action A
State S
Reward R
Action
Selection
Next state S’
RL Agent
Policy
Policy
Update
Environ-
ment
• Online RL = Emerging approach for addressing design time uncertainty [Palm et al. @ CAiSE 2020]
 Leverage information only available @ runtime (i.e., during live system execution)
• Since 2019 use of Learning for SAS most prominent strategy [Porter et al. @ ACSOS 2020]
• Conceptual Model of Online RL:
[Metzger et al. @ Computing 2022]
Learning Goal
defined by
Reward
Function
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Online Reinforcement Learning for SAS
Policy (Knowledge) represented
as deep neural network
Pro
• Handling of continuous
states and actions
• Generalization over unseen,
neighboring states
Con
• Deep RL = “black box”
•  Limited trustworthiness
•  Difficult to debug
e.g., reward function
correctly defined?
A. Metzger, ACSOS 2022 4
Increasing use of Deep RL The power of deep learning:
“/imagine yellow Labrador in the style of…”
[generated using Midjourney AI]
UP RIGHT DOWN LEFT
-12,224348 -12,044198 -12,463232 -12,349292
-11,368766 -11,407281 -11,567699 -11,741966
-10,724603 -10,758294 -10,878073 -10,956671
-10,118299 -9,9104485 -10,209248 -9,9700161
-9,2663503 -9,0282991 -9,4003475 -9,4925009
-8,1970854 -8,145322 -8,3578677 -8,6139991
-7,5583819 -7,3056764 -7,4218645 -7,6021728
-6,5359939 -6,4629871 -6,6290838 -6,800472
-5,8610507 -5,6457718 -5,6477581 -6,1414037
-5,01 -4,8068304 -4,7890915 -4,8609271
-3,9 -3,902123 -3,9078592 -4,0513057
-3,1389 -3,273 -2,9849214 -3,3706948
-12,609306 -12,612216 -12,579926 -12,784373
-11,845763 -11,794108 -11,845683 -12,478945
-11,237922 -10,873018 -10,900784 -11,436694
-10,008564 -9,9425947 -9,9517958 -10,268937
-9,2581936 -8,9864275 -8,9889605 -9,1798613
-8,1866584 -7,9931344 -7,9940185 -8,9510004
-7,0729858 -6,9970914 -6,9977784 -8,2920288
-6,0319487 -5,9988816 -5,9989739 -6,6224484
-6,0996963 -4,9994285 -4,999519 -6,232246
-4,8057376 -3,9997425 -3,9997874 -4,9212681
-3,1293615 -2,9999285 -2,9999384 -3,556979
-2,7588749 -2,1 -2 -2,2345458
-13,360876 -12 -13,954412 -12,991696
-12,565624 -11 -112,1835 -12,995431
-11,733772 -10 -112,79659 -11,980454
-10,740429 -9 -111,48554 -10,947642
-9,95339 -8 -112,12695 -9,9878453
-8,9112 -7 -112,67844 -8,8890419
-7,992178 -6 -112,91331 -7,965498
-6,9846114 -5 -112,41604 -6,9763523
-5,9533325 -4 -111,81117 -5,9401313
-4,9217978 -3 -110,6219 -4,8068301
-3,9307738 -2 -112,85584 -3,9418704
-2,9796888 -1,9340884 -1 -2,9827181
-13 -112,90933 -13,998779 -13,995334
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
UP RIGHT DOWN LEFT
-12,224348 -12,044198 -12,463232 -12,349292
-11,368766 -11,407281 -11,567699 -11,741966
-10,724603 -10,758294 -10,878073 -10,956671
-10,118299 -9,9104485 -10,209248 -9,9700161
-9,2663503 -9,0282991 -9,4003475 -9,4925009
-8,1970854 -8,145322 -8,3578677 -8,6139991
-7,5583819 -7,3056764 -7,4218645 -7,6021728
-6,5359939 -6,4629871 -6,6290838 -6,800472
-5,8610507 -5,6457718 -5,6477581 -6,1414037
-5,01 -4,8068304 -4,7890915 -4,8609271
-3,9 -3,902123 -3,9078592 -4,0513057
-3,1389 -3,273 -2,9849214 -3,3706948
-12,609306 -12,612216 -12,579926 -12,784373
-11,845763 -11,794108 -11,845683 -12,478945
-11,237922 -10,873018 -10,900784 -11,436694
-10,008564 -9,9425947 -9,9517958 -10,268937
-9,2581936 -8,9864275 -8,9889605 -9,1798613
State
S
Action A
State
S
Action
A
Deep RL Classical RL (Q-Learning)
Environ-
ment
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Explainable Reinforcement Learning (XRL) for SAS
A. Metzger, ACSOS 2022 5
State of the Art
Goal-based Models [Welsh et al. @ Trans. CCI 2014]
• Explanations in terms of the satisficement of softgoals
• Requires making assumptions about environment dynamics at
design time (difficult due to design time uncertainty)
Provenance Graphs [Reynolds et al. @ MODELS Wkshp 2020]
• Graph history to determine if, how and why model changed
• Graph can become too complex to be meaningfully interpreted by humans
• Query language suggested for “pruning” graphs but not for explanations
Temporal Graph Models [Ullauri et al. @ SoSym 2022]
• Explicitly considers Deep RL
• Explanations via queries to model @ runtime
• Interesting points of interactions extracted via CEP
• No detailed, contrastive decomposition of explanations
Example: Vacuum Cleaner
Example:
Fibonacci
Example: Remote Data Mirroring
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Agenda
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
A. Metzger, ACSOS 2022 6
1. Motivation
2. Explanation Approach “XRL-DINE“
3. Validation
4. Discussion and Outlook
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
XRL-DINE
Reward Decomposition
[Sequeira et al. @ Artif. Intell. 2020]
Decompose reward function to explain short-
term goal orientation of RL (train sub-RL agents)
Pro
• Helpful in the presence of multiple, “competing”
quality goals for learning
• Provides contrastive (counterfactual) explanations
Con
• No indication of explanation’s relevance
• Requires manually selecting relevant explanations
 cognitive overhead
Interestingness Elements
[Juozapaitis et al. @ IJCAI Wkshp 2019]
Identify relevant moments of interaction
between agent and environment at runtime
Pro
• Facilitates automatically selecting relevant
interactions to be explained
Con
• Does not explain whether RL behaves as expected
and for the right reasons
7
Augment and Combine RL Explanation Techniques from AI Research
Decomposed Interestingness Elements (DINEs)
A. Metzger, ACSOS 2022
+
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
XRL-DINE
Internal Behaviour
Evolution of Reward
External Behaviour
Evolution of States and Actions
8
Understanding RL without DINEs?
A. Metzger, ACSOS 2022
R
S, A
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
XRL-DINE
Important Interaction
Determine whether RL in given state is
uncertain (wide range of actions) or
certain (almost always same action)
• How much does relative importance
of actions differ for each sub-agent?
• Number of DINES shown can be
tuned via Threshold ρ (level of
inequality)
9
Three Types of DINEs
A. Metzger, ACSOS 2022
Visualization in Dashboard
Certain Uncertain
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
XRL-DINE
Reward Channel Dominance
Influence that each sub-agent has on
each possible action
• Influence of rewards of sub-agents
on composed decision
10
Three Types of DINEs
A. Metzger, ACSOS 2022
Visualization in Dashboard
Relative
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
XRL-DINE
Reward Channel Extremum
Points after local minimum/maximum
of state-value  RL decisions in
potentially critical states
• ExpectedReward (S) –
ExpectedReward (S’) > ϕ
 Maximum
• Number of DINES shown can be
tuned via Threshold ϕ
11
Three Types of DINEs
A. Metzger, ACSOS 2022
Visualization in Dashboard
Minimum
Maximum
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Agenda
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
A. Metzger, ACSOS 2022 12
1. Motivation
2. Explanation Approach “XRL-DINE“
3. Validation
4. Discussion and Outlook
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Validation
Proof-of-Concept
Implementation
• Double Deep Q-Networks with
Experience Replay [Hasselt et al.
@ AAAI 2016]
• Approximation of environment
model using supervised
learning on contents of replay
memory
• OpenAI Gym interface to
connect RL and SAS
A. Metzger, ACSOS 2022 13
Experimental Setup
RL Problem Formulation
• Action Space =
{Add / remove web servers,
Change dimmer value}
• State Space =
{Request arrival rate,
Average throughput,
Average response time}
• Decomposed Reward Function
Self-Adaptive System
• “SWIM” Exemplar [Moreno et al.
@ SEAMS 2018]
• Self-adaptive multi-tier web
application
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Validation
A. Metzger, ACSOS 2022 14
Qualitative Results
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Validation
Important Interactions Reward Channel Extrema
A. Metzger, ACSOS 2022 15
Quantitative Results
Cognitive load ~ number of DINEs shown to developers
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Agenda
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
A. Metzger, ACSOS 2022 16
1. Motivation
2. Explanation Approach “XRL-DINE“
3. Validation
4. Discussion and Outlook
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Discussion
A. Metzger, ACSOS 2022 17
Limitations of XRL-DINE
May generate difficult to understand explanations
• Reason 1: Reward function was decomposed incorrectly or non-optimally
• Reason 2: Environment dynamics may delay effects of adaptations and thus understandability
Not directly applicable to collaborative adaptive systems
• XRL-DINE does not consider decisions of other RL agents
• May lead to misleading explanations if DINE-XRL is directly applied in collaborative setting
Only works for value-based deep RL
• XRL-DINEs computed using value-function Q(S, A) – details see paper
• Value-based deep RL: policy = Q(S, A) approximated by neural network
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Outlook
A. Metzger, ACSOS 2022 18
Considering Explanation Requirements from Social Sciences [Miller @ Artif. Intell. 2019]
Contrastive: “why P happened instead of Q?”
 “Reward Channel Dominance” DINE
Selective: “no need for complete course of events”
 “Reward Channel Extrema” DINE &
“Important Interactions” DINE
Causal: “most likely explanation not
necessarily the best”
 Future work: e.g., check whether agent relies on
spurious correlations (and not causality)
[Gajcin et al. @ AAAMAS Wkshp 2022]
Social: “transfer of knowledge as part of a
conversation”
 Future work: e.g., Chatbot4XAI


?
? Prediction
Explanation
Train will be
delayed
Train passed
last light 5
min later
than typical
This also
happened
yesterday, but
why today a
problem?
Because of
attribute
“number of
trains behind
current one” > 5
What would
have to change
such that not
delay
(counterfactual)
“number of
trains behind
current one” < 2
Human
(Explainee)
Chatbot4XAI
(Explainer)
©
SSE,
Prof.
Dr.
Klaus
Pohl,
Prof.
Dr.
Andreas
Metzger
SOFTWARE SYSTEMS ENGINEERING
Prof. Dr. K. Pohl
Thank You!
Research leading to these results has received funding from the EU’s Horizon 2020 research and
innovation programme under grant agreements no. 780351 & 871493
Further Reading
• A. Metzger, C. Quinton, Z. Á. Mann, L. Baresi, K. Pohl, “Realizing Self-Adaptive Systems via Online
Reinforcement Learning and Feature-Model-guided Exploration”, Computing, Springer, March, 2022
• A. Metzger, C. Quinton, Z. Mann, L. Baresi, and K. Pohl, “Feature model-guided online reinforcement
learning for self-adaptive services,” in 18th Int’l Conf. on Service-Oriented Computing (ICSOC 2020),
LNCS 12571, Springer, 2020
• A. Palm, A. Metzger, and K. Pohl, “Online reinforcement learning for self-adaptive information
systems,” in 32nd Int’l Conf. on Advanced Information Systems Engineering (CAiSE 2020), LNCS
12127. Springer, 2020
A. Metzger, ACSOS 2022 19
www.enact-project.eu www.dataports-project.eu

More Related Content

PPTX
Explainable Online Reinforcement Learning for Adaptive Systems
PPTX
Data Quality Issues in Online Reinforcement Learning for Self-Adaptive System...
PPTX
Intro to Deep Reinforcement Learning
PDF
Continuous control with deep reinforcement learning (DDPG)
PDF
Deep Reinforcement Learning An Introduction
PDF
Notes on Reinforcement Learning - v0.1
PDF
Shanghai deep learning meetup 4
PPTX
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Explainable Online Reinforcement Learning for Adaptive Systems
Data Quality Issues in Online Reinforcement Learning for Self-Adaptive System...
Intro to Deep Reinforcement Learning
Continuous control with deep reinforcement learning (DDPG)
Deep Reinforcement Learning An Introduction
Notes on Reinforcement Learning - v0.1
Shanghai deep learning meetup 4
Introduction: Asynchronous Methods for Deep Reinforcement Learning

Similar to Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems (20)

PDF
Horizon: Deep Reinforcement Learning at Scale
PPTX
Designing an AI that gains experience for absolute beginners
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
PDF
An introduction to reinforcement learning
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
PDF
Algorithms for Reinforcement Learning
PDF
Reinforcement learning in a nutshell
PPTX
Getting started with reinforcement learning in open ai gym
PDF
An introduction to deep reinforcement learning
PPTX
rl-lectures-pres.pptx
PPTX
DRL Medical Imaging Literature Review
PDF
RL presentation
PDF
Deep Q-Learning
PPTX
Diksha Jainsgspoawptpowtj[0awutajwtjoa;wtoawjtoiawt
PPTX
Midterm review for CS156
PPTX
Introduction to Reinforcement Learning.pptx
PDF
Dissertation defense
PDF
Deep reinforcement learning&Robotics
PDF
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
PDF
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Horizon: Deep Reinforcement Learning at Scale
Designing an AI that gains experience for absolute beginners
R22 Machine learning jntuh UNIT- 5.pptx
An introduction to reinforcement learning
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Algorithms for Reinforcement Learning
Reinforcement learning in a nutshell
Getting started with reinforcement learning in open ai gym
An introduction to deep reinforcement learning
rl-lectures-pres.pptx
DRL Medical Imaging Literature Review
RL presentation
Deep Q-Learning
Diksha Jainsgspoawptpowtj[0awutajwtjoa;wtoawjtoiawt
Midterm review for CS156
Introduction to Reinforcement Learning.pptx
Dissertation defense
Deep reinforcement learning&Robotics
“Reinforcement Learning: a Practical Introduction,” a Presentation from Micro...
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Ad

More from Andreas Metzger (14)

PPTX
Antrittsvorlesung - APL.pptx
PPTX
Feature Model-Guided Online Reinforcement Learning for Self-Adaptive Services
PPTX
Triggering Proactive Business Process Adaptations via Online Reinforcement Le...
PPTX
Data-driven AI for Self-Adaptive Software Systems
PPTX
Data-driven Deep Learning for Proactive Terminal Process Management
PPTX
Big Data Technology Insights
PPTX
Proactive Process Adaptation using Deep Learning Ensembles
PPTX
Data-driven AI for Self-adaptive Information Systems
PPTX
Towards an End-to-End Architecture for Run-time Data Protection in the Cloud
PPTX
Considering Non-sequential Control Flows for Process Prediction with Recurren...
PPTX
Big Data Value in Mobility and Logistics
PPTX
Predictive Business Process Monitoring considering Reliability and Risk
PDF
Risk-based Proactive Process Adaptation
PDF
Predictive Process Monitoring Considering Reliability Estimates
Antrittsvorlesung - APL.pptx
Feature Model-Guided Online Reinforcement Learning for Self-Adaptive Services
Triggering Proactive Business Process Adaptations via Online Reinforcement Le...
Data-driven AI for Self-Adaptive Software Systems
Data-driven Deep Learning for Proactive Terminal Process Management
Big Data Technology Insights
Proactive Process Adaptation using Deep Learning Ensembles
Data-driven AI for Self-adaptive Information Systems
Towards an End-to-End Architecture for Run-time Data Protection in the Cloud
Considering Non-sequential Control Flows for Process Prediction with Recurren...
Big Data Value in Mobility and Logistics
Predictive Business Process Monitoring considering Reliability and Risk
Risk-based Proactive Process Adaptation
Predictive Process Monitoring Considering Reliability Estimates
Ad

Recently uploaded (20)

PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
A Presentation on Artificial Intelligence
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
project resource management chapter-09.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
August Patch Tuesday
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Unlocking AI with Model Context Protocol (MCP)
A novel scalable deep ensemble learning framework for big data classification...
A Presentation on Artificial Intelligence
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
A comparative analysis of optical character recognition models for extracting...
Enhancing emotion recognition model for a student engagement use case through...
Encapsulation_ Review paper, used for researhc scholars
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
project resource management chapter-09.pdf
Group 1 Presentation -Planning and Decision Making .pptx
August Patch Tuesday
Assigned Numbers - 2025 - Bluetooth® Document
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Digital-Transformation-Roadmap-for-Companies.pptx
1 - Historical Antecedents, Social Consideration.pdf
Web App vs Mobile App What Should You Build First.pdf
1. Introduction to Computer Programming.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Unlocking AI with Model Context Protocol (MCP)

Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems

  • 1. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems Felix Feit, Andreas Metzger, Klaus Pohl ACSOS 2022
  • 2. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Agenda © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger A. Metzger, ACSOS 2022 2 1. Motivation 2. Explanation Approach “XRL-DINE“ 3. Validation 4. Discussion and Outlook
  • 3. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Online Reinforcement Learning for SAS A. Metzger, ACSOS 2022 3 MAPE-K Combination RL Self-Adaptation Logic Analyze Monitor Execute Plan Knowledge Online RL for SAS Execute Policy (K) Monitor Action Selection (A + P) Policy Update Action A State S Reward R Next state S’ Action A State S Reward R Action Selection Next state S’ RL Agent Policy Policy Update Environ- ment • Online RL = Emerging approach for addressing design time uncertainty [Palm et al. @ CAiSE 2020]  Leverage information only available @ runtime (i.e., during live system execution) • Since 2019 use of Learning for SAS most prominent strategy [Porter et al. @ ACSOS 2020] • Conceptual Model of Online RL: [Metzger et al. @ Computing 2022] Learning Goal defined by Reward Function
  • 4. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Online Reinforcement Learning for SAS Policy (Knowledge) represented as deep neural network Pro • Handling of continuous states and actions • Generalization over unseen, neighboring states Con • Deep RL = “black box” •  Limited trustworthiness •  Difficult to debug e.g., reward function correctly defined? A. Metzger, ACSOS 2022 4 Increasing use of Deep RL The power of deep learning: “/imagine yellow Labrador in the style of…” [generated using Midjourney AI] UP RIGHT DOWN LEFT -12,224348 -12,044198 -12,463232 -12,349292 -11,368766 -11,407281 -11,567699 -11,741966 -10,724603 -10,758294 -10,878073 -10,956671 -10,118299 -9,9104485 -10,209248 -9,9700161 -9,2663503 -9,0282991 -9,4003475 -9,4925009 -8,1970854 -8,145322 -8,3578677 -8,6139991 -7,5583819 -7,3056764 -7,4218645 -7,6021728 -6,5359939 -6,4629871 -6,6290838 -6,800472 -5,8610507 -5,6457718 -5,6477581 -6,1414037 -5,01 -4,8068304 -4,7890915 -4,8609271 -3,9 -3,902123 -3,9078592 -4,0513057 -3,1389 -3,273 -2,9849214 -3,3706948 -12,609306 -12,612216 -12,579926 -12,784373 -11,845763 -11,794108 -11,845683 -12,478945 -11,237922 -10,873018 -10,900784 -11,436694 -10,008564 -9,9425947 -9,9517958 -10,268937 -9,2581936 -8,9864275 -8,9889605 -9,1798613 -8,1866584 -7,9931344 -7,9940185 -8,9510004 -7,0729858 -6,9970914 -6,9977784 -8,2920288 -6,0319487 -5,9988816 -5,9989739 -6,6224484 -6,0996963 -4,9994285 -4,999519 -6,232246 -4,8057376 -3,9997425 -3,9997874 -4,9212681 -3,1293615 -2,9999285 -2,9999384 -3,556979 -2,7588749 -2,1 -2 -2,2345458 -13,360876 -12 -13,954412 -12,991696 -12,565624 -11 -112,1835 -12,995431 -11,733772 -10 -112,79659 -11,980454 -10,740429 -9 -111,48554 -10,947642 -9,95339 -8 -112,12695 -9,9878453 -8,9112 -7 -112,67844 -8,8890419 -7,992178 -6 -112,91331 -7,965498 -6,9846114 -5 -112,41604 -6,9763523 -5,9533325 -4 -111,81117 -5,9401313 -4,9217978 -3 -110,6219 -4,8068301 -3,9307738 -2 -112,85584 -3,9418704 -2,9796888 -1,9340884 -1 -2,9827181 -13 -112,90933 -13,998779 -13,995334 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 UP RIGHT DOWN LEFT -12,224348 -12,044198 -12,463232 -12,349292 -11,368766 -11,407281 -11,567699 -11,741966 -10,724603 -10,758294 -10,878073 -10,956671 -10,118299 -9,9104485 -10,209248 -9,9700161 -9,2663503 -9,0282991 -9,4003475 -9,4925009 -8,1970854 -8,145322 -8,3578677 -8,6139991 -7,5583819 -7,3056764 -7,4218645 -7,6021728 -6,5359939 -6,4629871 -6,6290838 -6,800472 -5,8610507 -5,6457718 -5,6477581 -6,1414037 -5,01 -4,8068304 -4,7890915 -4,8609271 -3,9 -3,902123 -3,9078592 -4,0513057 -3,1389 -3,273 -2,9849214 -3,3706948 -12,609306 -12,612216 -12,579926 -12,784373 -11,845763 -11,794108 -11,845683 -12,478945 -11,237922 -10,873018 -10,900784 -11,436694 -10,008564 -9,9425947 -9,9517958 -10,268937 -9,2581936 -8,9864275 -8,9889605 -9,1798613 State S Action A State S Action A Deep RL Classical RL (Q-Learning) Environ- ment
  • 5. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Explainable Reinforcement Learning (XRL) for SAS A. Metzger, ACSOS 2022 5 State of the Art Goal-based Models [Welsh et al. @ Trans. CCI 2014] • Explanations in terms of the satisficement of softgoals • Requires making assumptions about environment dynamics at design time (difficult due to design time uncertainty) Provenance Graphs [Reynolds et al. @ MODELS Wkshp 2020] • Graph history to determine if, how and why model changed • Graph can become too complex to be meaningfully interpreted by humans • Query language suggested for “pruning” graphs but not for explanations Temporal Graph Models [Ullauri et al. @ SoSym 2022] • Explicitly considers Deep RL • Explanations via queries to model @ runtime • Interesting points of interactions extracted via CEP • No detailed, contrastive decomposition of explanations Example: Vacuum Cleaner Example: Fibonacci Example: Remote Data Mirroring
  • 6. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Agenda © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger A. Metzger, ACSOS 2022 6 1. Motivation 2. Explanation Approach “XRL-DINE“ 3. Validation 4. Discussion and Outlook
  • 7. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl XRL-DINE Reward Decomposition [Sequeira et al. @ Artif. Intell. 2020] Decompose reward function to explain short- term goal orientation of RL (train sub-RL agents) Pro • Helpful in the presence of multiple, “competing” quality goals for learning • Provides contrastive (counterfactual) explanations Con • No indication of explanation’s relevance • Requires manually selecting relevant explanations  cognitive overhead Interestingness Elements [Juozapaitis et al. @ IJCAI Wkshp 2019] Identify relevant moments of interaction between agent and environment at runtime Pro • Facilitates automatically selecting relevant interactions to be explained Con • Does not explain whether RL behaves as expected and for the right reasons 7 Augment and Combine RL Explanation Techniques from AI Research Decomposed Interestingness Elements (DINEs) A. Metzger, ACSOS 2022 +
  • 8. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl XRL-DINE Internal Behaviour Evolution of Reward External Behaviour Evolution of States and Actions 8 Understanding RL without DINEs? A. Metzger, ACSOS 2022 R S, A
  • 9. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl XRL-DINE Important Interaction Determine whether RL in given state is uncertain (wide range of actions) or certain (almost always same action) • How much does relative importance of actions differ for each sub-agent? • Number of DINES shown can be tuned via Threshold ρ (level of inequality) 9 Three Types of DINEs A. Metzger, ACSOS 2022 Visualization in Dashboard Certain Uncertain
  • 10. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl XRL-DINE Reward Channel Dominance Influence that each sub-agent has on each possible action • Influence of rewards of sub-agents on composed decision 10 Three Types of DINEs A. Metzger, ACSOS 2022 Visualization in Dashboard Relative
  • 11. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl XRL-DINE Reward Channel Extremum Points after local minimum/maximum of state-value  RL decisions in potentially critical states • ExpectedReward (S) – ExpectedReward (S’) > ϕ  Maximum • Number of DINES shown can be tuned via Threshold ϕ 11 Three Types of DINEs A. Metzger, ACSOS 2022 Visualization in Dashboard Minimum Maximum
  • 12. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Agenda © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger A. Metzger, ACSOS 2022 12 1. Motivation 2. Explanation Approach “XRL-DINE“ 3. Validation 4. Discussion and Outlook
  • 13. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Validation Proof-of-Concept Implementation • Double Deep Q-Networks with Experience Replay [Hasselt et al. @ AAAI 2016] • Approximation of environment model using supervised learning on contents of replay memory • OpenAI Gym interface to connect RL and SAS A. Metzger, ACSOS 2022 13 Experimental Setup RL Problem Formulation • Action Space = {Add / remove web servers, Change dimmer value} • State Space = {Request arrival rate, Average throughput, Average response time} • Decomposed Reward Function Self-Adaptive System • “SWIM” Exemplar [Moreno et al. @ SEAMS 2018] • Self-adaptive multi-tier web application
  • 14. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Validation A. Metzger, ACSOS 2022 14 Qualitative Results
  • 15. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Validation Important Interactions Reward Channel Extrema A. Metzger, ACSOS 2022 15 Quantitative Results Cognitive load ~ number of DINEs shown to developers
  • 16. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Agenda © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger A. Metzger, ACSOS 2022 16 1. Motivation 2. Explanation Approach “XRL-DINE“ 3. Validation 4. Discussion and Outlook
  • 17. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Discussion A. Metzger, ACSOS 2022 17 Limitations of XRL-DINE May generate difficult to understand explanations • Reason 1: Reward function was decomposed incorrectly or non-optimally • Reason 2: Environment dynamics may delay effects of adaptations and thus understandability Not directly applicable to collaborative adaptive systems • XRL-DINE does not consider decisions of other RL agents • May lead to misleading explanations if DINE-XRL is directly applied in collaborative setting Only works for value-based deep RL • XRL-DINEs computed using value-function Q(S, A) – details see paper • Value-based deep RL: policy = Q(S, A) approximated by neural network
  • 18. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Outlook A. Metzger, ACSOS 2022 18 Considering Explanation Requirements from Social Sciences [Miller @ Artif. Intell. 2019] Contrastive: “why P happened instead of Q?”  “Reward Channel Dominance” DINE Selective: “no need for complete course of events”  “Reward Channel Extrema” DINE & “Important Interactions” DINE Causal: “most likely explanation not necessarily the best”  Future work: e.g., check whether agent relies on spurious correlations (and not causality) [Gajcin et al. @ AAAMAS Wkshp 2022] Social: “transfer of knowledge as part of a conversation”  Future work: e.g., Chatbot4XAI   ? ? Prediction Explanation Train will be delayed Train passed last light 5 min later than typical This also happened yesterday, but why today a problem? Because of attribute “number of trains behind current one” > 5 What would have to change such that not delay (counterfactual) “number of trains behind current one” < 2 Human (Explainee) Chatbot4XAI (Explainer)
  • 19. © SSE, Prof. Dr. Klaus Pohl, Prof. Dr. Andreas Metzger SOFTWARE SYSTEMS ENGINEERING Prof. Dr. K. Pohl Thank You! Research leading to these results has received funding from the EU’s Horizon 2020 research and innovation programme under grant agreements no. 780351 & 871493 Further Reading • A. Metzger, C. Quinton, Z. Á. Mann, L. Baresi, K. Pohl, “Realizing Self-Adaptive Systems via Online Reinforcement Learning and Feature-Model-guided Exploration”, Computing, Springer, March, 2022 • A. Metzger, C. Quinton, Z. Mann, L. Baresi, and K. Pohl, “Feature model-guided online reinforcement learning for self-adaptive services,” in 18th Int’l Conf. on Service-Oriented Computing (ICSOC 2020), LNCS 12571, Springer, 2020 • A. Palm, A. Metzger, and K. Pohl, “Online reinforcement learning for self-adaptive information systems,” in 32nd Int’l Conf. on Advanced Information Systems Engineering (CAiSE 2020), LNCS 12127. Springer, 2020 A. Metzger, ACSOS 2022 19 www.enact-project.eu www.dataports-project.eu

Editor's Notes

  • #5: Monet – Kandinski -- Marc