1. Sedna Lab, University of Ottawa
DeepTest Workshop May 2025
Université d’Ottawa | University of Ottawa
Shiva Nejati
Failures or False Alarms? Validating Tests and
Failures for Cyber Physical Systems
2. When Failures Aren’t Failures
Software testing is about finding failures
Caveat: Easy to mistake invalid failures for
genuine ones!
• Failures that are not critical
• Failures resolved by current practices
• Failures due to inputs violating system’s
preconditions
• Failures from simulation non-determinism
• Failures from simulated versus real-world
mismatches
4. .... It soon became clear that verification has
to a lot with what surround the system and
not just about the system itself.
The environment is not merely a backdrop
but an active participant that the system
must interact with correctly.
Michael Jackson: The Meaning of Requirements.
Annals of Software Engineering (1997)
5. Environment Matters for Failures
Environment ∧ System ⊨ Requirement
Failure: In a given environment, the system's interaction with that
environment violates a requirement.
∃ Environment s.t. Environment ∧ System ⊭ Requirement
System Verification:
Failures are contextual
6. Failures in a valid environment
• System is faulty
Failures in an invalid environment
• Revisit testing approach or assumptions
What makes an environment invalid?
• Incorrect assumptions about users and needs
• Inaccurate modeling of operational context
• Poor simulation of real-world conditions
Interpreting Failures
7. Runtime Crashes ≠ Always Failures
Crashes like buffer overflow and segfaults
• Often treated as implicit test oracles
But, a runtime crash may not always indicate a true failure
8. Runtime Crashes Ignored for Function Models
Function Engineer Focus:
• Function engineers prioritize functionality and control
requirements, often ignoring Simulink runtime errors
R. Matinnejad, S. Nejati, L. Briand, T. Bruckmann: Test Generation and
Test Prioritization for Simulink Models with Dynamic Behavior. IEEE TSE
2019
Domain Context:
• Simulink function models in the automotive sector
Reality:
• Crashes are treated as code-level issues, handled post-
conversion to C code
9. Fuzzing Z3: Crashes vs. Developer Intent
Fuzzing the Z3 solver's command-line options, treating assertion violations as failures.
Feedback from Z3 developers:
"All options are for internal consumption and
debugging purposes. The user is expected to run the
tool without options ... ‘’
“I would rather spend my time on developing better
solving algorithms than think of better ways to make
experimental code harder to reach for random
testers."
https://github.com/Z3Prover/z3/issues/3892
10. Failures from Invalid Inputs
Requirement: When the autopilot is enabled, the aircraft should reach the desired
altitude within 500 seconds in calm air.
Failure occurs because throttle is insufficient — a system prerequisite is unmet.
K. Gaaloul, C. Menghi, S. Nejati, L.
Briand, D. Wolfe: Mining assumptions
for software components using
machine learning. ESEC/SIGSOFT FSE
2020
This is not a system bug — it’s a failure caused by an invalid input.
11. Deep Learning Predicts on Anything
Deep learning systems produce an output for any input, even invalid inputs
S. Dola, M. Dwyer, M. Soffa: Distribution-Aware
Testing of Neural Networks Using Generative
Models. ICSE 2021.
Valid Inputs
Corrupted/
Invalid
This behavior makes deep learning systems hard to test — they never “crash’’.
12. Flaky Failures from Simulation Nondeterminism
Simulation randomness can cause the same test to pass or fail unpredictably.
Initial Scene
Last Scene of 1st execution Last Scene of a random re-execution
13. Failures from Real World vs Synthetic Mismatches
Failures arise not from flaws in the model, but from the distribution gap
between real-world and synthetic data
Real World data for training and testing Synthetic data for testing
Distribution shift between real-world training data
and synthetic test data can produce erroneous
failures.
15. Explaining Failures in Autopilot
Precondition for the ascent requirement: The nose should be pointing
upward with adequate throttle applied.
16. Explaining Failures in Priority-Based Traffic Management
Design Assumption: High-priority traffic should remain under 75%
utilization to preserve network quality.
17. Building Failure Explanations
A data-driven method for learning rules from test data
Test Data Inferred Model
(Decision Trees/
Decision Rules)
Explanation
Rules
if … then Fail
if … then Fail
if … then Fail
if … then Fail
(test inputs + pass/fail labels)
Challenge: Test data generation is expensive.
Question: How do we generate test data more efficiently?
18. Efficient Test Data Generation Using Surrogate Models
Efficiently explore test input space
Use surrogate models to approximate pass/fail outcomes — without always running the test
Search Space
Use surrogate's label if:
F̄(t) − e, F̄(t), and F̄(t) + e agree (pass or fail)
Otherwise, run the test
Surrogate model
19. Test Data Generation via Boundary Exploitation
Test data near the pass/fail boundary may lead to more informative failure
explanations
Search Space
Regression trees
Logistic Regression
We use regression trees and logistic regression to learn the decision boundary between pass and fail
outcomes.
20. Exploration guided by surrogate models leads to more accurate
failure explanations than methods that exploit decision boundaries.
B. JODAT, A. CHANDAR, S. NEJATI, M. SABETZADEH:TEST GENERATION STRATEGIES FOR BUILDING FAILURE MODELS
AND EXPLAINING SPURIOUS FAILURES ACM TOSEM 2024
21. Exploitation Finds Failures — But Misses
Comprehensive Coverage
Pareto-based testing underperforms random testing in failure coverage
L. Sorokin, D. Safin, S. Nejati: Can search-based testing with pareto optimization effectively cover failure-revealing
test inputs? EMSE Journal 2025
Ground truth -- Actual
space of failures
558 failures
Failures found using
Pareto-based testing
(NSGAII)
429 failures
Failures found using
random search
89 failures
22. Validating Image Test Inputs
Automated Human-Assisted Validation of Test Inputs for Deep Learning
Systems
23. Industrial Context
SmartInside AI: DL-based anomaly detection solution for grid power facility
They used GenAI to transform their dataset to resemble Canada weather
Transformed images might be invalid!
Original
images
Transformed
images
24. Human-Assisted Test Input Validation
Validating test inputs via active learning
Challenge: Cannot feed images
directly to the classifier
Solution: Train the classifier on
image comparison metrics
computed for image pairs!
25. Image Comparison Metrics
Metrics that measure the similarity or differences between two images
Pixel-Level: Directly compare pixels
Feature-Level: Use features from pretrained DL
models.
26. State of the Art (Single-Metric)
Test input validation using a single metric
ICSE 2023: Validation via reconstruction error
of variational autoencoders
ICSE 2022: Validation via visual information
fidelity (VIF)
27. On average, using multiple metrics and active learning for test input
validation results in at least a 13% improvement in accuracy
compared to baselines that use single metrics.
Both feature-level and pixel-level metrics are influential in validating
test input images
D GHOBARI, M H AMINI, D QUOC TRAN, S PARK, S NEJATI, M SABETZADEH: Test Input Validation for Vision-based
DL Systems: An Active Learning Approach. ICSE-SEIP 2025
29. Testing Automated Driving Systems (ADS)
Weather
Test Inputs
Car
Positions
Static
Objects
# Collisions
Test Outputs
Distance to the
closest vehicle
Pass/Fail
30. Is Flakiness Prevalent in ADS Testing?
• Experiments on five setups from the literature based on Carla and BeamNG
simulators
• We fixed all the scene parameters for each test
Car positoin Destination Weather Time of day
Random seeds
31. After executing each test input 10 times, we observed:
• 4% to 98% flakiness in test outputs (not necessarily leading to a
verdict change)
• 1% to 74% flakiness in test verdicts
M. AMINI, S. NASERI, S. NEJATI: EVALUATING THE IMPACT OF FLAKY SIMULATORS ON TESTING AUTONOMOUS DRIVING
SYSTEMS. EMSE JOURNAL 2024
Flakiness is Prevalent in ADS testing!
32. Do Flaky ADS tests reflect real-world scenarios or
Unrealistic Situations?
Three types of Flaky Scenarios
Type I Type II Type III
2.5% 27.5% 70%
Scenarios violating physics
principles (unrealistic)
Unstable Behaviour of
ADS Controllers (Realistic) Normal Scenarios
33. How Flakiness Can Skew Our Testing Comparisons?
We often compare different testing methods using simulation-based
testing using metrics such as:
• Number of failure-revealing tests
• Fitness values
Hidden assumption:
• Test results variations are mainly caused by testing methods
Can flakiness impact these metrics?
34. Can Flakytests significantly impact ADS testing
results?
We compare two different instances of random testing:
• Each test executed once
• Each test executed 10 times
Metrics
• Number of failure revealing tests
• Fitness values
35. A random tester that reruns each test multiple times significantly
outperforms the random tester that runs each test once.
M. AMINI, S. NASERI, S. NEJATI: Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving
Systems. EMSE JOURNAL 2024
Metrics for comparing testing methods are significantly impacted by
flakiness.
Mitigation: ADS testing methods cannot be reliably compared by
executing individual test inputs only once.
36. The simple lane-keeping ADS test setup shows
the lowest rate of flaky tests.
Modular DNN-based ADS reduces flaky tests
compared to end-to-end DNN-based ADS.
Carla simulator yields a lower flaky test rate
compared to the BeamNG simulator.
Lessons Learned
38. Image to Image Translators
Reducing the gap between real-world and synthetic images
Hence, translators reduce the chance
of ADS seeing out-of-distribution
images, enabling more realistic
evaluation.
Translators bring synthetic images from
the test set closer to the distribution of
real-world images in the train set.
39. Unpaired Image-to-Image Translators
• CycleGAN
Simultaneously trains discriminators and generators
Unstable training
• SAEVAE
Combines a sparse autoencoder (SAE) and a variational
autoencoder (VAE)
Straightforward training -- VAE (discriminator) is
trained once
SAE performs at pixel-level – No impact on the visible
content
Fast translations
40. Can translators lead to accurate evaluation of ADS?
Could translators diminish the effectiveness of test images and their
ability to reveal faults?
41. How effectively do translators mitigate the accuracy
gap in offline testing?
Metric: Mean Absolute Error (MAE)
Two pairs of datasets: Lane keeping and object detection
Five ADS DNNs and three translators
42. For the for the lane-keeping task, both SAEVAE and CycleGAN
reduce the accuracy gap, with SAEVAE outperforming CycleGAN
For the object-detection task, only SAEVAE can reduce the accuracy
gap
M H AMINI, S. NEJATI: Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving
Systems. ASE 2024
43. How effectively do translators reduce failures in
online testing?
Metric: # of out of bounds (OOB)
Simulator: BeamNG
Four DNNs
44. SAEVAE significantly reduces failures during online testing
compared to other translators or no translator.
M H AMINI, S. NEJATI: Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving
Systems. ASE 2024
45. Can translators preserve the test data quality?
Metrics: Neuron coverage, surprise adequacy, the number of clusters in the latent
space vectors, geometric diversity and standard deviations
Two pairs of datasets: Lane keeping and object detection
Five DNNs
46. Translators do not significantly reduce test data quality. SAEVAE,
our best translator, preserves or enhances fault-revealing ability,
maintains diversity, and retains coverage in most cases.
M H AMINI, S. NEJATI: Bridging the Gap between Real-world and Synthetic Images for Testing Autonomous Driving
Systems. ASE 2024
47. Image-to-image translators improve online and
offline testing results.
They preserve test data quality measured by
neuron coverage, surprise adequacy, the
number of clusters in the latent space
vectors, and geometric diversity and standard
deviations.
SAEVAE outperforms CycleGAN.
Lessons Learned
48. • When reporting failures, use failure explanations to
assess their validity.
• When transforming images by GenAI or perturbations,
asses the validity of the transformed images.
• When using simulators, discuss or mitigate the
potential flakiness of test results.
• When testing ADS with simulators or synthetic
images, use image translators to mitigate the impact
of the gap between real-world and synthetic data.
Conclusion