Autonomous Systems: How to Address the Dilemma between Autonomy and Safety

Autonomous Systems: How to
Address the Dilemma between
Autonomy and Safety
Lionel Briand
ASE 2022 Keynote
http://www.lbriand.info

Cyber-Physical Systems
2
Leveson, 2016

Autonomous Systems
• Autonomy lies on a spectrum
• Fully autonomous systems ó advisory systems
• Open context: context not fully specified
3
Burton et al., 2019

Why Autonomy?
• Reactivity to events, e.g., lunar rover
• Alleviate cognitive load in complex tasks. E.g., adaptive
cruise control
• New missions without human physical constraints, e.g.,
pilots in fighter jets
• Complete specifications difficult to formalize: environmental
complexity as well as continual change
4

Autonomy and Machine Learning
• Machine learning (ML) helps with the implementation of the
understanding and decision layers.
• ML: (1) infer input-output relationships from samples of intended
behavior, (2) generalize the learned function to unknown and
unforeseen inputs
• Risk: unanticipated responses and emergent system behavior
• Our focus: autonomous systems that use ML models
5

Semantic Gap
• Gap between intended functionality and specified
functionality (Burton et al., 2019).
• More challenging for autonomous systems
• Causes:
- Complexity and unpredictability of the operational domain
- E.g., Urban environments
6

Semantic Gap
• Causes:
- Complexity and unpredictability of the system itself
- E.g., Heterogeneous sensing channels
7

Semantic Gap
• Causes:
- Transfer of decision function to the system
- E.g., ML functions do not deliver clear-cut answers
8

Paradox
10
• “The problem of deriving a suitable specification of the
intended behavior is instead transferred to the problem
of demonstrating that the implemented (learned)
behaviour meets the intent.” Burton et al. 2019
• Main challenge: Definition and validation of adequate
system safety requirements

Assurance Cases
• A valid, evidence-based justification for a set of claims
about the safety of a system for a given function over its
operational context
11
Burton et al., 2019

Example: Autonomous Taxiing
12
Asaadi et al., 2020
CNN:
• Cross track
error (CTE)
• Heading
Error (HE)

Example: Autonomous Taxiing
13
• Safety property: Lateral runway excursions
• Safety requirement: the aircraft CTE shall not exceed a
specified offset whilst taxiing
• Hazard: aligning the aircraft nose with a different runway
marking instead of the centerline
• Mitigation: retraining the ML component using training data
augmented with appropriate test data exhibiting violations

Assurance Case
14
GSN notation:
Goals, strategies,
evidence
Structured
arguments backed
up by evidence
Evidence:
• artifacts
• verification
Asaadi et al., 2020

Safety as a Control Problem
15
• Safety should be
treated as an emergent
property of the system
(Leveson, 2016)
• Paradigm Change:

Run-time Assurance Architecture
16
• HiL simulator and iron bird
• CNN outputs used for run-time monitoring
• Contingency actions
Asaadi et al., 2020

ML Challenges
• Robustness to adversarial or unexpected inputs
• Uncertainty of predictions
• Explainability: provides insights in a human interpretable way
• Verification: formal (proof) or experimental (test)
17
Tambon et al., 2022

Safety Verification
• Ideally:
• Formal safety guarantees
• Practical assumptions (about input space, model, properties …)
• Scalable (e.g., to large models and systems)
• But
• We cannot have it all (e.g., high dimensionality, many parameters)
• Safety is not a model property, rather a system one
18

Safety Verification
• Formal analysis, e.g., reachability analysis
• Focus on robustness to perturbations
• Guarantees about models (not systems) under very restrictive
assumptions (e.g., model architecture, input space shape, reasonable
over-approximation of output space)
• Scalability issues
• Testing-based analysis, e.g., search-based testing
• Heuristics for input space exploration
• No guarantees, but no restrictive assumptions and more scalable
• Smart combination?
19

ML-Related Assurance Questions
• Model level: Under which input conditions is an ML-
component classifying/predicting (in)correctly?
• System level: Under which conditions is an ML-component
(un)safe for the system?
• Integration level: Are there possible undesirable interactions
among (ML) components?
• Context: No specifications or code for ML components,
increasingly large models (DL, RL)
20

Summary
• Autonomous safety functions can help achieve higher safety,
e.g., emergency braking
• However, they also entail risks addressed by: (1) design-time
assurance cases and (2) run-time safety monitoring play a
key role to alleviate those risks
• We need automated support to collect appropriate evidence
(1) and guide monitoring (2) are required in the context of ML-
enabled autonomous systems.
21

Example Projects and
Lessons Learned
22

Model-Level Testing and
Analysis
23

Test Inputs: Adversarial or
Natural?
• Adversarial inputs: Focus on robustness, e.g., noise or attacks
• Natural inputs: Focus on functional aspects, e.g., functional
safety
24

Example: Key-points Detection
• DNNs used for key-points detection
in images
• Many applications, e.g., face
recognition
• Testing: Find test suite that causes
DNN to poorly predict as many key-
points as possible within time
budget
• Images generated by a simulator
28
Ground truth
Predicted

Example Application
• Drowsiness or gaze detection based on interior camera monitoring the driver
• In the drowsiness or gaze detection problem, each Key-Point (KP) may be highly
important for safety
• Each KP leads to a test objective
• For our subject DNN, we have 27 test objectives
• Goal: Cause the DNN to mispredict as many key-points as possible
• Solution: Many-objective search algorithms (based on genetic algorithms)
combined with simulator
30
Ul Haq et al., 2021

Overview
31
Input Generator (search) Simulator
Input (vector)
DNN
Fitness
Calculator
Actual Key-points Positions
Predicted Key-points Positions
Fitness Score
(Error Value)
Most Critical
Test Inputs
Test
Image

Results
• Our approach is effective in generating test suites that cause the DNN to severely
mispredict more than 93% of all key-points on average
• Not all mispredictions can be considered failures … (e.g., shadows)
• We must know when the DNN cannot be expected to be accurate and have contingency
measures
• Some key-points are more severely predicted than others, detailed analysis revealed
two reasons:
• Under-representation of some key-points (hidden) in the training data
• Large variation in the shape and size of the mouth across different 3D models (more
training needed)
32

Interpretation
• Regression trees to predict accuracy based on simulation parameters
• Enable detailed analysis to find the root causes of high Normalized Error (NE) values, e.g., shadow
on the location of KP26 is the cause of high NE values
• Regression trees show excellent accuracy, reasonable size.
• Amenable to risk analysis, gaining useful safety insights, and contingency plans at run-time
33
Image Characteristics Condition NE
! = 9 ∧ # < 18.41 0.04
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % < 17.06 0.26
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ 17.06 ≤ % < 19 0.71
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % ≥ 19 0.36
Representative rules derived from the decision tree for KP26
(M: Model-ID, P: Pitch, R: Roll, Y: Yaw, NE: Normalized Error)
(A) A test image satisfying
the first condition
(B) A test image satisfying
the third condition
NE = 0.013 NE = 0.89

Real Images
• Test selection requires a different approach than with a
simulator
• Labeling costs are significant
36

DNN Structural Coverage Criteria
37
Chen et al. 2020
• Neuron coverage
• How neurons are
activated
• Many variants

Limitations
• Require access to the DNN internals and sometimes the
training set. Not realistic in many practical settings.
• There is a weak correlation between coverage and
misclassification or poor predictions for natural inputs
• Also many questions regarding the studies focused on
adversarial inputs …
38

Diversity-driven Test Selection
• Test inputs are real images
• Black-box approach based on measuring the diversity of
test inputs.
• Scalable selection
• The more diverse, the more likely test inputs are to reveal
faults
39
Agahababaeyan et al., 2022

Geometric Diversity (GD)
• Given a dataset X and its corresponding feature vectors V,
the geometric diversity of a subset S ⊆ X is defined as the
hyper-volume of the parallelepiped spanned by the rows of
Vs, i.e., feature vectors of items in S, where the larger the
volume, the more diverse is the feature space of S
40

Extracting Image Features
• VGG16 is a convolutional neural network trained on a
subset of the ImageNet dataset, a collection of over 14
million images belonging to 22,000 categories.
41

Correlation with Faults
Correlation between geometric diversity and faults

Pareto Front Optimization
44
• Black-box test selection to detect as many diverse faults
as possible for a given test budget
• Search-based Approach: Multi-Objective Genetic Search
(NSGA-II)
• Two objectives (Max):
• diversity
• uncertainty (e.g., Gini)
GD score
Gini
score

HUDD
45
• How to help perform risk analysis with real images?
• Rely on heatmaps to generate clusters of images leading to failures because of a
same root cause (common characteristics of the images)
• enable engineers to ensure safety with introducing countermeasures
Error-inducing
Test Set images
Step1.
Heatmap
based
clustering
Root cause clusters
C1 C2 C3
Step 2. Inspection of subset
of cluster elements.
Fahmy et al. 2021

• Classification
• Gaze Detection
46
90
270
180 0
45
22.5
67.5
337.5
315
292.5
247.5
225
202.5
157.5
135
112.5
Top
Center
B
o
t
t
o
m
L
e
f
t
Bottom
Center
B
o
t
t
o
m
R
i
g
h
t
Top
Right
Middle
Right
T
o
p
L
e
f
t
Middle
Left
Example Application

Clusters identify different problems
Cluster 1
(angle ~157.5)
borderline cases
Cluster3
(near closed eyes)
incomplete training set
49
Cluster 2
(eye middle center)
incomplete set of classes

SEDE: Simulator-based Explanations
for DNN failurEs
50
• Built on HUDD
• Generates readable
descriptions of
failure-inducing,
real-world images
• Effective retraining
• Leverages
availability of
simulators
Fahmy et al., 2022

SAFE: Black-Box Approach
51
Retraining
Features Extraction Detection of root causes
Error inducing
images
Data
Preprocessing
Features
Extraction
Clustering
Root cause
clusters
Dimensionality
Reduction
Unsafe-set
Selection
Retraining
Improved DNN
• Black-box alternative to HUDD
• No need to extend the DNN (LRP)
• Reduces training time and memory usage
• More accurate clustering (density based,
DBSCAN)
Attaoui et al., 2022

Reinforcement Learning Testing
and Safety
52

Reward and Functional Faults
54
• Testing goal: Detect and explain reward and functional faults
• A pole is attached to a cart, and the goal is to move the cart right and left to
keep the pole from falling.
• Reward fault: the accumulated reward is less than a threshold, e.g., the pole
falls down in the first 70 timesteps
• Functional fault: the pole is stable and, regardless of the accumulated reward,
the cart moves beyond the 2.4 limited distance, and the episode terminates
Functional fault Reward fault

STARLA: Objectives
• Detect as quickly
as possible diverse
faulty episodes
• Accurately
characterize faulty
episodes
• STARLA: Search
and ML based
55
Zolfagharian et al. 2022

System-Level Testing and
Analysis
56

Testing via Physics-based
Simulation
58
ADAS
(SUT)
Simulator (Matlab/Simulink)
Model
(Matlab/Simulink)
▪ Physical plant (vehicle / sensors / actuators)
▪ Other cars
▪ Pedestrians
▪ Environment (weather / roads / traﬃc signs)
Test input
Test output
time-stamped output

Reinforcement Learning for ADAS
Testing
• Use RL to change the environment in a realistic manner
with the objective of triggering safety violations
59
Action (Behavior of environment actors, e.g., vehicle in front)
Reward (e.g., based on distance from vehicle in front)
State/Next State (Locations and conditions about the
environment, e.g., collision)
RL Agent
RL-Environment
Simulator
(CARLA)
Control
(Image, location,…)
ADAS

Example Violation
• Violation: Ego Vehicle should not collide with other vehicle
• Vehicle-in-front slows down suddenly and then moves to the right
• Possible reason: Agent was not trained with such episodes
61
Car View Top View

System Testing Challenges
62
To address the challenges, we propose SAMOTA (Surrogate-Assisted
Many-Objective Testing Approach) by leveraging many-objective
search and Surrogate Models (SMs)
Ul Haq et al. 2022
Large Input
Space
Many Safety
Requirements
Computationally-
intensive
Simulation

Surrogate Models
• Surrogate model: Model that mimics the simulator, to a
certain extent, while being much less computationally
expensive
• Research: Combine search with surrogate modeling to
decrease the computational cost of testing (Ul Haq et al. 2022)
63
Polynomial
Regression
(PR)
Radial Basis Function (RBF) Kringing (KR)

SAMOTA
64
Steps:
1. Initialization
2. Global search
3. Local search
4. Glocal search
5. Local search
6. …
Execute
Simulator
Global
SMs Many Objective
Search Algorithm
Most Critical
Test Cases
Most Uncertain
Test Cases
Global Search
Initialisation
Execute
Simulator Database
Minimal
Test Suite
Local SM
per Cluster
Local Search
Most Critical
Test Cases
Single Objective
Search Algorithm
Clustering
Top Points

ML Impact on System Safety
• Inherent uncertainty in ML models
• Test mitigation mechanisms in
systems to handle misclassifications
or mispredictions
• Goal: We want to learn, as accurately
as possible, when an ML component
leads to system safety violations in
terms of inputs and outputs
• Applications: This is expected to help
guide and focus the testing of ML
components or implement safety
monitors for them.
65

Learn the Safety Envelope
Monitoring: How far are we from the safety envelop
for a given violation probability?
66
input
output
Safe Unsafe
Learnt envelope

Safety Analysis
• The adoption of machine learning to enable autonomy raises new
or amplify existing safety challenges
• Assurance cases (1) + run-time assurance architecture (2)
• Automated test support to collect evidence (1) and learn safety
conditions to monitor (2)
• Approaches based on metaheuristic search and machine learning
have shown to be practical, effective, and reasonably efficient
• Limited industrial experience properly reported on these topics
68

Automated Testing
• Remains key mechanism to provide safety evidence
• Different levels of testing: model, integration, system
• Automation is key at all levels and requires different strategies
• Strategy: evolutionary computing and machine learning
• Scalability is the main challenge: simulations, surrogate models, …
• More work needed beyond stateless models (DNN): reinforcement
learning …
69

Selected References
• Ul Haq et al. "Automatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search" ACM International
Symposium on Software Testing (ISSTA), 2021
• Ul Haq et al., “Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization”
IEEE/ACM ICSE 2022
• Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions
on Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021
• Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems”,
ArXiv report, https://arxiv.org/pdf/2204.00480.pdf, ACM TOSEM, 2022
• Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ArXiv report,
https://arxiv.org/pdf/2201.05077v1.pdf, ACM TOSEM, 2022
• Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”,
https://arxiv.org/pdf/2112.12591.pdf
• Zolfagharian et al., “Search-Based Testing Approach for Deep Reinforcement Learning Agents”,
https://arxiv.org/pdf/2206.07813.pdf
73

Selected Safety References
• Asaadi et al., “Assured Integration of Machine Learning-
basedAutonomy on Aviation Platforms”, 2020 AIAA/IEEE 39th
Digital Avionics Systems Conference (DASC)
• Burton et al., “Mind the gaps: Assuring the safety of autonomous
systems from an engineering, ethical, and legal perspective”,
Artificial Intelligence (Elsevier), 2019
• Tambon et al., “How to certify machine learning based
safety-critical systems? A systematic literature review”,
Automated Software Engineering (Springer), 2022
• Leveson, “Engineering a safer world: Systems thinking applied to
safety”, The MIT Press, 2016.
74

Selected ML Testing References
• Goodfellow et al. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
• Zhang et al. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In 33rd
IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018.
• Tian et al. "DeepTest: Automated testing of deep-neural-network-driven autonomous cars." In Proceedings of the 40th
international conference on software engineering, 2018.
• Li et al. “Structural Coverage Criteria for Neural Networks Could Be Misleading”, IEEE/ACM 41st International Conference on
Software Engineering: New Ideas and Emerging Results (NIER)
• Kim et al. "Guiding deep learning system testing using surprise adequacy." In IEEE/ACM 41st International Conference on Software
Engineering (ICSE), 2019.
• Ma et al. "DeepMutation: Mutation testing of deep learning systems." In 2018 IEEE 29th International Symposium on Software
Reliability Engineering (ISSRE), 2018.
• Zhang et al. "Machine learning testing: Survey, landscapes and horizons." IEEE Transactions on Software Engineering (2020).
• Riccio et al. "Testing machine learning based systems: a systematic mapping." Empirical Software Engineering 25, no. 6 (2020)
• Gerasimou et al., “Importance-Driven Deep Learning System Testing”, IEEE/ACM 42nd International Conference on Software
Engineering, 2020
75

Autonomous Systems: How to Address the Dilemma between Autonomy and Safety

More Related Content

What's hot (20)

Similar to Autonomous Systems: How to Address the Dilemma between Autonomy and Safety (20)

More from Lionel Briand (20)

Recently uploaded (20)

Autonomous Systems: How to Address the Dilemma between Autonomy and Safety