SlideShare a Scribd company logo
Sedna Lab, University of Ottawa
DeepTest Workshop May 2025
Université d’Ottawa | University of Ottawa
Shiva Nejati
Failures or False Alarms? Validating Tests and
Failures for Cyber Physical Systems
When Failures Aren’t Failures
Software testing is about finding failures
Caveat: Easy to mistake invalid failures for
genuine ones!
• Failures that are not critical
• Failures resolved by current practices
• Failures due to inputs violating system’s
preconditions
• Failures from simulation non-determinism
• Failures from simulated versus real-world
mismatches
Faults vs. Failures
Valid? Invalid?
.... It soon became clear that verification has
to a lot with what surround the system and
not just about the system itself.
The environment is not merely a backdrop
but an active participant that the system
must interact with correctly.
Michael Jackson: The Meaning of Requirements.
Annals of Software Engineering (1997)
Environment Matters for Failures
Environment ∧ System ⊨ Requirement
Failure: In a given environment, the system's interaction with that
environment violates a requirement.
∃ Environment s.t. Environment ∧ System ⊭ Requirement
System Verification:
Failures are contextual
Failures in a valid environment
• System is faulty
Failures in an invalid environment
• Revisit testing approach or assumptions
What makes an environment invalid?
• Incorrect assumptions about users and needs
• Inaccurate modeling of operational context
• Poor simulation of real-world conditions
Interpreting Failures
Runtime Crashes ≠ Always Failures
Crashes like buffer overflow and segfaults
• Often treated as implicit test oracles
But, a runtime crash may not always indicate a true failure
Runtime Crashes Ignored for Function Models
Function Engineer Focus:
• Function engineers prioritize functionality and control
requirements, often ignoring Simulink runtime errors
R. Matinnejad, S. Nejati, L. Briand, T. Bruckmann: Test Generation and
Test Prioritization for Simulink Models with Dynamic Behavior. IEEE TSE
2019
Domain Context:
• Simulink function models in the automotive sector
Reality:
• Crashes are treated as code-level issues, handled post-
conversion to C code
Fuzzing Z3: Crashes vs. Developer Intent
Fuzzing the Z3 solver's command-line options, treating assertion violations as failures.
Feedback from Z3 developers:
"All options are for internal consumption and
debugging purposes. The user is expected to run the
tool without options ... ‘’
“I would rather spend my time on developing better
solving algorithms than think of better ways to make
experimental code harder to reach for random
testers."
https://github.com/Z3Prover/z3/issues/3892
Failures from Invalid Inputs
Requirement: When the autopilot is enabled, the aircraft should reach the desired
altitude within 500 seconds in calm air.
Failure occurs because throttle is insufficient — a system prerequisite is unmet.
K. Gaaloul, C. Menghi, S. Nejati, L.
Briand, D. Wolfe: Mining assumptions
for software components using
machine learning. ESEC/SIGSOFT FSE
2020
This is not a system bug — it’s a failure caused by an invalid input.
Deep Learning Predicts on Anything
Deep learning systems produce an output for any input, even invalid inputs
S. Dola, M. Dwyer, M. Soffa: Distribution-Aware
Testing of Neural Networks Using Generative
Models. ICSE 2021.
Valid Inputs
Corrupted/
Invalid
This behavior makes deep learning systems hard to test — they never “crash’’.
Flaky Failures from Simulation Nondeterminism
Simulation randomness can cause the same test to pass or fail unpredictably.
Initial Scene
Last Scene of 1st execution Last Scene of a random re-execution
Failures from Real World vs Synthetic Mismatches
Failures arise not from flaws in the model, but from the distribution gap
between real-world and synthetic data
Real World data for training and testing Synthetic data for testing
Distribution shift between real-world training data
and synthetic test data can produce erroneous
failures.
Explaining Failures
Inferring the conditions under which failures occur to
identify invalid cases
Explaining Failures in Autopilot
Precondition for the ascent requirement: The nose should be pointing
upward with adequate throttle applied.
Explaining Failures in Priority-Based Traffic Management
Design Assumption: High-priority traffic should remain under 75%
utilization to preserve network quality.
Building Failure Explanations
A data-driven method for learning rules from test data
Test Data Inferred Model
(Decision Trees/
Decision Rules)
Explanation
Rules
if … then Fail
if … then Fail
if … then Fail
if … then Fail
(test inputs + pass/fail labels)
Challenge: Test data generation is expensive.
Question: How do we generate test data more efficiently?
Efficient Test Data Generation Using Surrogate Models
Efficiently explore test input space
Use surrogate models to approximate pass/fail outcomes — without always running the test
Search Space
Use surrogate's label if:
F̄(t) − e, F̄(t), and F̄(t) + e agree (pass or fail)
Otherwise, run the test
Surrogate model
Test Data Generation via Boundary Exploitation
Test data near the pass/fail boundary may lead to more informative failure
explanations
Search Space
Regression trees
Logistic Regression
We use regression trees and logistic regression to learn the decision boundary between pass and fail
outcomes.
Exploration guided by surrogate models leads to more accurate
failure explanations than methods that exploit decision boundaries.
B. JODAT, A. CHANDAR, S. NEJATI, M. SABETZADEH:TEST GENERATION STRATEGIES FOR BUILDING FAILURE MODELS
AND EXPLAINING SPURIOUS FAILURES ACM TOSEM 2024
Exploitation Finds Failures — But Misses
Comprehensive Coverage
Pareto-based testing underperforms random testing in failure coverage
L. Sorokin, D. Safin, S. Nejati: Can search-based testing with pareto optimization effectively cover failure-revealing
test inputs? EMSE Journal 2025
Ground truth -- Actual
space of failures
558 failures
Failures found using
Pareto-based testing
(NSGAII)
429 failures
Failures found using
random search
89 failures
Validating Image Test Inputs
Automated Human-Assisted Validation of Test Inputs for Deep Learning
Systems
Industrial Context
SmartInside AI: DL-based anomaly detection solution for grid power facility
They used GenAI to transform their dataset to resemble Canada weather
Transformed images might be invalid!
Original
images
Transformed
images
Human-Assisted Test Input Validation
Validating test inputs via active learning
Challenge: Cannot feed images
directly to the classifier
Solution: Train the classifier on
image comparison metrics
computed for image pairs!
Image Comparison Metrics
Metrics that measure the similarity or differences between two images
Pixel-Level: Directly compare pixels
Feature-Level: Use features from pretrained DL
models.
State of the Art (Single-Metric)
Test input validation using a single metric
ICSE 2023: Validation via reconstruction error
of variational autoencoders
ICSE 2022: Validation via visual information
fidelity (VIF)
On average, using multiple metrics and active learning for test input
validation results in at least a 13% improvement in accuracy
compared to baselines that use single metrics.
Both feature-level and pixel-level metrics are influential in validating
test input images
D GHOBARI, M H AMINI, D QUOC TRAN, S PARK, S NEJATI, M SABETZADEH: Test Input Validation for Vision-based
DL Systems: An Active Learning Approach. ICSE-SEIP 2025
Flakiness in Simulation-Based Testing
Prevalence of flaky tests, their impact, and prediction of flaky tests
Testing Automated Driving Systems (ADS)
Weather
Test Inputs
Car
Positions
Static
Objects
# Collisions
Test Outputs
Distance to the
closest vehicle
Pass/Fail
Is Flakiness Prevalent in ADS Testing?
• Experiments on five setups from the literature based on Carla and BeamNG
simulators
• We fixed all the scene parameters for each test
Car positoin Destination Weather Time of day
Random seeds
After executing each test input 10 times, we observed:
• 4% to 98% flakiness in test outputs (not necessarily leading to a
verdict change)
• 1% to 74% flakiness in test verdicts
M. AMINI, S. NASERI, S. NEJATI: EVALUATING THE IMPACT OF FLAKY SIMULATORS ON TESTING AUTONOMOUS DRIVING
SYSTEMS. EMSE JOURNAL 2024
Flakiness is Prevalent in ADS testing!
Do Flaky ADS tests reflect real-world scenarios or
Unrealistic Situations?
Three types of Flaky Scenarios
Type I Type II Type III
2.5% 27.5% 70%
Scenarios violating physics
principles (unrealistic)
Unstable Behaviour of
ADS Controllers (Realistic) Normal Scenarios
How Flakiness Can Skew Our Testing Comparisons?
We often compare different testing methods using simulation-based
testing using metrics such as:
• Number of failure-revealing tests
• Fitness values
Hidden assumption:
• Test results variations are mainly caused by testing methods
Can flakiness impact these metrics?
Can Flakytests significantly impact ADS testing
results?
We compare two different instances of random testing:
• Each test executed once
• Each test executed 10 times
Metrics
• Number of failure revealing tests
• Fitness values
A random tester that reruns each test multiple times significantly
outperforms the random tester that runs each test once.
M. AMINI, S. NASERI, S. NEJATI: Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving
Systems. EMSE JOURNAL 2024
Metrics for comparing testing methods are significantly impacted by
flakiness.
Mitigation: ADS testing methods cannot be reliably compared by
executing individual test inputs only once.
The simple lane-keeping ADS test setup shows
the lowest rate of flaky tests.
Modular DNN-based ADS reduces flaky tests
compared to end-to-end DNN-based ADS.
Carla simulator yields a lower flaky test rate
compared to the BeamNG simulator.
Lessons Learned
Bridging the Real-to-Synthetic Gap
Can Image to Image translators effectively reduce this gap?
Image to Image Translators
Reducing the gap between real-world and synthetic images
Hence, translators reduce the chance
of ADS seeing out-of-distribution
images, enabling more realistic
evaluation.
Translators bring synthetic images from
the test set closer to the distribution of
real-world images in the train set.
Unpaired Image-to-Image Translators
• CycleGAN
Simultaneously trains discriminators and generators
Unstable training
• SAEVAE
Combines a sparse autoencoder (SAE) and a variational
autoencoder (VAE)
Straightforward training -- VAE (discriminator) is
trained once
SAE performs at pixel-level – No impact on the visible
content
Fast translations
Can translators lead to accurate evaluation of ADS?
Could translators diminish the effectiveness of test images and their
ability to reveal faults?
How effectively do translators mitigate the accuracy
gap in offline testing?
Metric: Mean Absolute Error (MAE)
Two pairs of datasets: Lane keeping and object detection
Five ADS DNNs and three translators
For the for the lane-keeping task, both SAEVAE and CycleGAN
reduce the accuracy gap, with SAEVAE outperforming CycleGAN
For the object-detection task, only SAEVAE can reduce the accuracy
gap
M H AMINI, S. NEJATI: Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving
Systems. ASE 2024
How effectively do translators reduce failures in
online testing?
Metric: # of out of bounds (OOB)
Simulator: BeamNG
Four DNNs
SAEVAE significantly reduces failures during online testing
compared to other translators or no translator.
M H AMINI, S. NEJATI: Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving
Systems. ASE 2024
Can translators preserve the test data quality?
Metrics: Neuron coverage, surprise adequacy, the number of clusters in the latent
space vectors, geometric diversity and standard deviations
Two pairs of datasets: Lane keeping and object detection
Five DNNs
Translators do not significantly reduce test data quality. SAEVAE,
our best translator, preserves or enhances fault-revealing ability,
maintains diversity, and retains coverage in most cases.
M H AMINI, S. NEJATI: Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving
Systems. ASE 2024
Image-to-image translators improve online and
offline testing results.
They preserve test data quality measured by
neuron coverage, surprise adequacy, the
number of clusters in the latent space
vectors, and geometric diversity and standard
deviations.
SAEVAE outperforms CycleGAN.
Lessons Learned
• When reporting failures, use failure explanations to
assess their validity.
• When transforming images by GenAI or perturbations,
asses the validity of the transformed images.
• When using simulators, discuss or mitigate the
potential flakiness of test results.
• When testing ADS with simulators or synthetic
images, use image translators to mitigate the impact
of the gap between real-world and synthetic data.
Conclusion
Acknoledgements
https://www.uoIoT.ca/
@uOttawaIoT

More Related Content

PPT
Slides chapters 13-14
PPT
Software testing mtech project in jalandhar
PPT
Software testing mtech project in ludhiana
PDF
SE2018_Lec 19_ Software Testing
PDF
SSBSE 2020 keynote
PPT
A beginners guide to testing
PPT
6months industrial training in software testing, jalandhar
PPT
6 weeks summer training in software testing,ludhiana
Slides chapters 13-14
Software testing mtech project in jalandhar
Software testing mtech project in ludhiana
SE2018_Lec 19_ Software Testing
SSBSE 2020 keynote
A beginners guide to testing
6months industrial training in software testing, jalandhar
6 weeks summer training in software testing,ludhiana

Similar to Keynote presentation at DeepTest Workshop 2025 (20)

PPT
6months industrial training in software testing, ludhiana
PPT
6 weeks summer training in software testing,jalandhar
PDF
Combinatorial testing
PPTX
Automating The Process For Building Reliable Software
PPT
prova4
PPT
prova2
PPT
PPT
PPT
provacompleta2
PPT
finalelocale2
PPT
prova9
PPT
provarealw3
PPT
PPT
PPT
prova3
PPT
stasera1
PPT
provarealw2
PPT
provarealw4
PPT
test2
6months industrial training in software testing, ludhiana
6 weeks summer training in software testing,jalandhar
Combinatorial testing
Automating The Process For Building Reliable Software
prova4
prova2
provacompleta2
finalelocale2
prova9
provarealw3
prova3
stasera1
provarealw2
provarealw4
test2
Ad

Recently uploaded (20)

PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
. Radiology Case Scenariosssssssssssssss
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPT
Chemical bonding and molecular structure
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
An interstellar mission to test astrophysical black holes
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Biophysics 2.pdffffffffffffffffffffffffff
. Radiology Case Scenariosssssssssssssss
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
HPLC-PPT.docx high performance liquid chromatography
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
INTRODUCTION TO EVS | Concept of sustainability
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Derivatives of integument scales, beaks, horns,.pptx
ECG_Course_Presentation د.محمد صقران ppt
neck nodes and dissection types and lymph nodes levels
AlphaEarth Foundations and the Satellite Embedding dataset
Chemical bonding and molecular structure
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
2. Earth - The Living Planet Module 2ELS
Cell Membrane: Structure, Composition & Functions
An interstellar mission to test astrophysical black holes
Ad

Keynote presentation at DeepTest Workshop 2025

  • 1. Sedna Lab, University of Ottawa DeepTest Workshop May 2025 Université d’Ottawa | University of Ottawa Shiva Nejati Failures or False Alarms? Validating Tests and Failures for Cyber Physical Systems
  • 2. When Failures Aren’t Failures Software testing is about finding failures Caveat: Easy to mistake invalid failures for genuine ones! • Failures that are not critical • Failures resolved by current practices • Failures due to inputs violating system’s preconditions • Failures from simulation non-determinism • Failures from simulated versus real-world mismatches
  • 4. .... It soon became clear that verification has to a lot with what surround the system and not just about the system itself. The environment is not merely a backdrop but an active participant that the system must interact with correctly. Michael Jackson: The Meaning of Requirements. Annals of Software Engineering (1997)
  • 5. Environment Matters for Failures Environment ∧ System ⊨ Requirement Failure: In a given environment, the system's interaction with that environment violates a requirement. ∃ Environment s.t. Environment ∧ System ⊭ Requirement System Verification: Failures are contextual
  • 6. Failures in a valid environment • System is faulty Failures in an invalid environment • Revisit testing approach or assumptions What makes an environment invalid? • Incorrect assumptions about users and needs • Inaccurate modeling of operational context • Poor simulation of real-world conditions Interpreting Failures
  • 7. Runtime Crashes ≠ Always Failures Crashes like buffer overflow and segfaults • Often treated as implicit test oracles But, a runtime crash may not always indicate a true failure
  • 8. Runtime Crashes Ignored for Function Models Function Engineer Focus: • Function engineers prioritize functionality and control requirements, often ignoring Simulink runtime errors R. Matinnejad, S. Nejati, L. Briand, T. Bruckmann: Test Generation and Test Prioritization for Simulink Models with Dynamic Behavior. IEEE TSE 2019 Domain Context: • Simulink function models in the automotive sector Reality: • Crashes are treated as code-level issues, handled post- conversion to C code
  • 9. Fuzzing Z3: Crashes vs. Developer Intent Fuzzing the Z3 solver's command-line options, treating assertion violations as failures. Feedback from Z3 developers: "All options are for internal consumption and debugging purposes. The user is expected to run the tool without options ... ‘’ “I would rather spend my time on developing better solving algorithms than think of better ways to make experimental code harder to reach for random testers." https://github.com/Z3Prover/z3/issues/3892
  • 10. Failures from Invalid Inputs Requirement: When the autopilot is enabled, the aircraft should reach the desired altitude within 500 seconds in calm air. Failure occurs because throttle is insufficient — a system prerequisite is unmet. K. Gaaloul, C. Menghi, S. Nejati, L. Briand, D. Wolfe: Mining assumptions for software components using machine learning. ESEC/SIGSOFT FSE 2020 This is not a system bug — it’s a failure caused by an invalid input.
  • 11. Deep Learning Predicts on Anything Deep learning systems produce an output for any input, even invalid inputs S. Dola, M. Dwyer, M. Soffa: Distribution-Aware Testing of Neural Networks Using Generative Models. ICSE 2021. Valid Inputs Corrupted/ Invalid This behavior makes deep learning systems hard to test — they never “crash’’.
  • 12. Flaky Failures from Simulation Nondeterminism Simulation randomness can cause the same test to pass or fail unpredictably. Initial Scene Last Scene of 1st execution Last Scene of a random re-execution
  • 13. Failures from Real World vs Synthetic Mismatches Failures arise not from flaws in the model, but from the distribution gap between real-world and synthetic data Real World data for training and testing Synthetic data for testing Distribution shift between real-world training data and synthetic test data can produce erroneous failures.
  • 14. Explaining Failures Inferring the conditions under which failures occur to identify invalid cases
  • 15. Explaining Failures in Autopilot Precondition for the ascent requirement: The nose should be pointing upward with adequate throttle applied.
  • 16. Explaining Failures in Priority-Based Traffic Management Design Assumption: High-priority traffic should remain under 75% utilization to preserve network quality.
  • 17. Building Failure Explanations A data-driven method for learning rules from test data Test Data Inferred Model (Decision Trees/ Decision Rules) Explanation Rules if … then Fail if … then Fail if … then Fail if … then Fail (test inputs + pass/fail labels) Challenge: Test data generation is expensive. Question: How do we generate test data more efficiently?
  • 18. Efficient Test Data Generation Using Surrogate Models Efficiently explore test input space Use surrogate models to approximate pass/fail outcomes — without always running the test Search Space Use surrogate's label if: F̄(t) − e, F̄(t), and F̄(t) + e agree (pass or fail) Otherwise, run the test Surrogate model
  • 19. Test Data Generation via Boundary Exploitation Test data near the pass/fail boundary may lead to more informative failure explanations Search Space Regression trees Logistic Regression We use regression trees and logistic regression to learn the decision boundary between pass and fail outcomes.
  • 20. Exploration guided by surrogate models leads to more accurate failure explanations than methods that exploit decision boundaries. B. JODAT, A. CHANDAR, S. NEJATI, M. SABETZADEH:TEST GENERATION STRATEGIES FOR BUILDING FAILURE MODELS AND EXPLAINING SPURIOUS FAILURES ACM TOSEM 2024
  • 21. Exploitation Finds Failures — But Misses Comprehensive Coverage Pareto-based testing underperforms random testing in failure coverage L. Sorokin, D. Safin, S. Nejati: Can search-based testing with pareto optimization effectively cover failure-revealing test inputs? EMSE Journal 2025 Ground truth -- Actual space of failures 558 failures Failures found using Pareto-based testing (NSGAII) 429 failures Failures found using random search 89 failures
  • 22. Validating Image Test Inputs Automated Human-Assisted Validation of Test Inputs for Deep Learning Systems
  • 23. Industrial Context SmartInside AI: DL-based anomaly detection solution for grid power facility They used GenAI to transform their dataset to resemble Canada weather Transformed images might be invalid! Original images Transformed images
  • 24. Human-Assisted Test Input Validation Validating test inputs via active learning Challenge: Cannot feed images directly to the classifier Solution: Train the classifier on image comparison metrics computed for image pairs!
  • 25. Image Comparison Metrics Metrics that measure the similarity or differences between two images Pixel-Level: Directly compare pixels Feature-Level: Use features from pretrained DL models.
  • 26. State of the Art (Single-Metric) Test input validation using a single metric ICSE 2023: Validation via reconstruction error of variational autoencoders ICSE 2022: Validation via visual information fidelity (VIF)
  • 27. On average, using multiple metrics and active learning for test input validation results in at least a 13% improvement in accuracy compared to baselines that use single metrics. Both feature-level and pixel-level metrics are influential in validating test input images D GHOBARI, M H AMINI, D QUOC TRAN, S PARK, S NEJATI, M SABETZADEH: Test Input Validation for Vision-based DL Systems: An Active Learning Approach. ICSE-SEIP 2025
  • 28. Flakiness in Simulation-Based Testing Prevalence of flaky tests, their impact, and prediction of flaky tests
  • 29. Testing Automated Driving Systems (ADS) Weather Test Inputs Car Positions Static Objects # Collisions Test Outputs Distance to the closest vehicle Pass/Fail
  • 30. Is Flakiness Prevalent in ADS Testing? • Experiments on five setups from the literature based on Carla and BeamNG simulators • We fixed all the scene parameters for each test Car positoin Destination Weather Time of day Random seeds
  • 31. After executing each test input 10 times, we observed: • 4% to 98% flakiness in test outputs (not necessarily leading to a verdict change) • 1% to 74% flakiness in test verdicts M. AMINI, S. NASERI, S. NEJATI: EVALUATING THE IMPACT OF FLAKY SIMULATORS ON TESTING AUTONOMOUS DRIVING SYSTEMS. EMSE JOURNAL 2024 Flakiness is Prevalent in ADS testing!
  • 32. Do Flaky ADS tests reflect real-world scenarios or Unrealistic Situations? Three types of Flaky Scenarios Type I Type II Type III 2.5% 27.5% 70% Scenarios violating physics principles (unrealistic) Unstable Behaviour of ADS Controllers (Realistic) Normal Scenarios
  • 33. How Flakiness Can Skew Our Testing Comparisons? We often compare different testing methods using simulation-based testing using metrics such as: • Number of failure-revealing tests • Fitness values Hidden assumption: • Test results variations are mainly caused by testing methods Can flakiness impact these metrics?
  • 34. Can Flakytests significantly impact ADS testing results? We compare two different instances of random testing: • Each test executed once • Each test executed 10 times Metrics • Number of failure revealing tests • Fitness values
  • 35. A random tester that reruns each test multiple times significantly outperforms the random tester that runs each test once. M. AMINI, S. NASERI, S. NEJATI: Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving Systems. EMSE JOURNAL 2024 Metrics for comparing testing methods are significantly impacted by flakiness. Mitigation: ADS testing methods cannot be reliably compared by executing individual test inputs only once.
  • 36. The simple lane-keeping ADS test setup shows the lowest rate of flaky tests. Modular DNN-based ADS reduces flaky tests compared to end-to-end DNN-based ADS. Carla simulator yields a lower flaky test rate compared to the BeamNG simulator. Lessons Learned
  • 37. Bridging the Real-to-Synthetic Gap Can Image to Image translators effectively reduce this gap?
  • 38. Image to Image Translators Reducing the gap between real-world and synthetic images Hence, translators reduce the chance of ADS seeing out-of-distribution images, enabling more realistic evaluation. Translators bring synthetic images from the test set closer to the distribution of real-world images in the train set.
  • 39. Unpaired Image-to-Image Translators • CycleGAN Simultaneously trains discriminators and generators Unstable training • SAEVAE Combines a sparse autoencoder (SAE) and a variational autoencoder (VAE) Straightforward training -- VAE (discriminator) is trained once SAE performs at pixel-level – No impact on the visible content Fast translations
  • 40. Can translators lead to accurate evaluation of ADS? Could translators diminish the effectiveness of test images and their ability to reveal faults?
  • 41. How effectively do translators mitigate the accuracy gap in offline testing? Metric: Mean Absolute Error (MAE) Two pairs of datasets: Lane keeping and object detection Five ADS DNNs and three translators
  • 42. For the for the lane-keeping task, both SAEVAE and CycleGAN reduce the accuracy gap, with SAEVAE outperforming CycleGAN For the object-detection task, only SAEVAE can reduce the accuracy gap M H AMINI, S. NEJATI: Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving Systems. ASE 2024
  • 43. How effectively do translators reduce failures in online testing? Metric: # of out of bounds (OOB) Simulator: BeamNG Four DNNs
  • 44. SAEVAE significantly reduces failures during online testing compared to other translators or no translator. M H AMINI, S. NEJATI: Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving Systems. ASE 2024
  • 45. Can translators preserve the test data quality? Metrics: Neuron coverage, surprise adequacy, the number of clusters in the latent space vectors, geometric diversity and standard deviations Two pairs of datasets: Lane keeping and object detection Five DNNs
  • 46. Translators do not significantly reduce test data quality. SAEVAE, our best translator, preserves or enhances fault-revealing ability, maintains diversity, and retains coverage in most cases. M H AMINI, S. NEJATI: Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving Systems. ASE 2024
  • 47. Image-to-image translators improve online and offline testing results. They preserve test data quality measured by neuron coverage, surprise adequacy, the number of clusters in the latent space vectors, and geometric diversity and standard deviations. SAEVAE outperforms CycleGAN. Lessons Learned
  • 48. • When reporting failures, use failure explanations to assess their validity. • When transforming images by GenAI or perturbations, asses the validity of the transformed images. • When using simulators, discuss or mitigate the potential flakiness of test results. • When testing ADS with simulators or synthetic images, use image translators to mitigate the impact of the gap between real-world and synthetic data. Conclusion