Taming Test Flakiness: How We Built a Scalable Tool to Detect and Manage Flaky Tests

Nitish Malik

Senior Engineering Manager @Atlassian

Published Jun 24, 2025

Introduction

A year ago, we began addressing flaky tests to enhance the Continuous Integration (CI) experience within our monorepo. A manually managed file-based system was in use; however, it presented several challenges, including a complex workflow, no customisations, no actionability, a single point of failure, and difficulties in scaling. As we progressed in enhancing our CI ecosystem, it became essential to establish a system that is effective, scalable, and configurable. This system must be easily adoptable and designed to minimise friction in developers' workflows. In response to this, we developed a platformised, tech stack agnostic tool, designed to detect, manage, and mitigate flaky tests across all of our codebases effectively: Flakinator.

Before we delve into our solution, it is crucial to grasp the problem's intricacies and underlying significance.

What are Flaky Tests?

Flaky tests are the bane of any software development team. They fail sporadically without any changes to the underlying code, leading to mistrust in test results, wasted debugging efforts, and disruptions to CI/CD pipelines.

Hidden Cost of Flaky Tests?

Non-deterministic behaviour that leads to random failures creates inefficiencies, forcing developers to repeatedly run builds. This not only consumes valuable engineering hours spent troubleshooting tests that should ideally yield consistent results, but it also diminishes developer satisfaction.

Why is it a Big Deal?

Flaky tests are a well-documented problem in the software development lifecycle (SDLC), and several studies and industry insights highlight the severity of their impact.

At Atlassian, Test flakiness has been a significant contributor to build reliability issues in the past, responsible for as much as 21% of master build failures in the Jira Frontend repository.
Approximately 15% of Jira backend repo failures are attributed to flaky tests, necessitating reruns that ultimately waste over 150,000 hours of developer time each year.
A study by Microsoft Research on flaky tests in their CI systems found that 13% of their test failures were flaky, highlighting that even mature CI pipelines are not immune.
A Google study of its internal testing systems found that 16% of test failures were determined to be flaky rather than actual bugs.
A survey conducted by CircleCI found that 1 in 4 developers lose trust in their test suite due to flaky tests, often leading to skipping tests altogether or bypassing failed tests.

Key Quotes from Research

“Flaky tests are one of the most time-consuming and frustrating issues in software development. They undermine the trust in automated tests and lead to significant inefficiencies in CI/CD pipelines.” – Google Research

"Flaky tests are not just a result of poor test writing; they’re often a symptom of deeper architectural or environmental issues." – Microsoft Research

Introducing Flakinator

Flakinator serves as an essential offering for our Atlassian products, enabling teams to focus on delivering features and improvements rather than being bogged down by the unpredictability of flaky tests.

Relentless pursuit of flaky tests and their ultimate elimination

Flakinator Capabilities

Design Overview

Flakinator sits in our CI infrastructure, expecting the test run data to be ingested through CI. The ingested records undergo transformation, with raw test data being stored for future use. Various detection mechanisms are implemented for different products to identify flaky tests within the system. Multiple consumers utilise this information, tailoring it to meet their specific needs and visualisations.

How It All Comes Together

Flakinator is built on a scalable, distributed architecture to handle the large volume of test data across multiple Atlassian products. Here’s an overview of the key components:

Figure 6: Flakainator architecture diagram

Ingestion Pipeline to record Test Runs: A robust data ingestion pipeline collects test run data in real-time from CI systems, normalises it, and stores it in a centralised storage.Implemented scripts or hooks in the CI/CD to automatically capture metadata from each test run. This data includes test duration, execution environment, results, retry attempts, error messages and other metadata.
Flakiness Detection Engine: Flakinator supports multiple detection mechanisms, all constructed upon a similar architectural framework. We use the Java and Kotlin ecosystems alongside numerous AWS components to effectively calculate, store, and serve test quality scores at scale. Various configurations are employed for different products to ensure customisation, and all of these configurations are driven by user preferences.
Notification and Insights: The code ownership and notification module has been implemented to ensure that the relevant team members are promptly informed about the status of tests. This enhancement significantly improves communication and accountability within the team. Once a flaky test is detected, a Jira ticket is created for the owning team with pre-decided due dates to provide resolution. Additionally, the Flakinator Bot sends out Slack notifications to keep everyone informed. A user-friendly interface designed with React empowers developers to manage flaky tests effectively. Users can easily explore these tests, take actions such as silencing them, search by test type, and access linked builds along with historical run data for each test.
Scalability and Reliability: Our system handles more than 350 million test executions per day, with high availability and fault tolerance. Our storage module is equipped with over 3TB of data for efficient operation and analysis.

Detection Algorithms

1. RETRY detection mechanism

Rerun the failing test in the same build and use that data as a detection method to find flakes. The Flakinator CLI, integrated into the pipelines, checks whether failing test cases are already designated as flaky. If a test is not included in the flaky list, an implicit retry mechanism is employed to collect flaky signals, with the circuit breaking at the first occurrence of a flip signal. The number of retries is configurable and varies depending on the test type. When flip signals are received, newly identified flaky tests are logged in the database to enhance the efficiency of future builds. This approach has enabled us to achieve an impressive 81% detection rate for certain products.

For a test case history like below, the yellow ones are flakes. This information is the signal that we use for quarantining a test.

2. Bayesian Inference for Flakiness Detection

Bayesian theorem is a theorem in Statistics which provides a formula to calculate the Probability of an event A happening given that event B has already occurred. In other words, it’s used to update the probability of a hypothesis based on new evidence

Conditional probability is the likelihood of an outcome occurring based on a previous outcome in similar circumstances. Bayes' theorem relies on using prior probability distributions in order to generate posterior probabilities.

Bayesian inference

In Bayesian statistical inference, prior probability is the probability of an event occurring before new data is collected. Posterior probability is the revised probability of an event occurring after considering the new information.

For the use case of creating a flakiness score for a test case, we use the prior probability distribution of a test case's historic runs and create the posterior probability from it. The analysis/ inference component consists of 3 modules

Historical Analysis: Utilise a moving window approach to analyse historical test run data, applying Bayesian inference to calculate the probability of a test being flaky.
Signal Processors: To derive a comprehensive flakiness score, consider multiple signal distributions (e.g., duration variability, environment consistency, result patterns, retry frequency).
Scoring: Assign a flakiness score between 0 and 1, where higher scores indicate greater flakiness.

Figure 10: Flakiness score detection workflow

An example of a low-quality test case where the test shows indeterministic results in CI across multiple commits

Results and Impact

Since deploying Flakinator, we’ve seen significant improvements in CI build stabilisation across our engineering products. This tool is currently utilised by over 12 products within Atlassian. Flakinator has already made a substantial impact across various offerings, especially regarding Builds Recovered and Cost Savings. As of the last quarter, Flakinator successfully recovered more than 22,000 builds and identified 7,000 unique flaky tests, leading to considerable cost savings. This tool enhances build reliability, conserves development hours, and reduces CI resource consumption by minimising the need for test reruns, ultimately accelerating time to market.

Figure 12: Builds recovered by flakinator

Metrics give teams visibility into their quality and performance tracking key indicators, for example, tracking the flaky test rate at a team level highlights which teams are contributing the most to pipeline instability, motivating them to prioritise fixing flaky tests. Data-driven insights also help teams and leadership forecast the effort and time required for specific tasks, such as reducing test flakiness in the packages owned by them.

Lessons Learned

Building a flaky test management system wasn’t without its challenges. Here are some of the key lessons we learned:

Data Quality Matters: Inconsistent or missing test metadata can lead to inaccurate flakiness detection, so it's critical to invest in reliable data collection.
Iterate on Algorithms: No single algorithm works universally. Combining heuristics, statistical methods, and machine learning provided the most accurate results.
Prioritise Developer Experience: A tool is only effective if developers use it. We focused heavily on building an intuitive UI and smooth integrations with existing workflows.

Future Plans

We’re continuously improving our flaky test management tool. Some of the upcoming features we’re excited about include:

Enhance prediction capabilities using machine learning: Leverage machine learning (ML) algorithms to improve a system's ability to predict outcomes, identify patterns, or forecast potential issues. By analysing historical data, machine learning models can make accurate predictions about future events or behaviours.
Expand integration options: This can include creating APIs, plugins, or connectors that allow other systems to interact with the platform. For example, Flakinator could integrate with popular CI/CD tools (like Jenkins or GitLab), cloud platforms, monitoring tools, or issue-tracking systems.
Improve automated remediation: Automatically fixing flaky tests by identifying and addressing common issues (example, timeout issues, mocking failures, or environmental dependencies). It typically involves monitoring systems for issues, triggering predefined scripts or workflows to address those issues, and verifying that the remediation was successful. For example, if a system detects a server failure, it might automatically spin up a new server instance or reroute traffic to prevent failures.
Develop more sophisticated analytics: Sophisticated analytics go beyond basic reporting to include predictive analytics, prescriptive analytics (what actions to take), and diagnostic analytics (why something happened).
Broader Adoption: Sharing our solution with the broader engineering community to help others tackle flaky tests.

Flaky tests are an inevitable challenge in large-scale software development, but they don’t have to derail your CI/CD pipelines. By building a robust flaky test management system, we’ve improved build reliability, streamlined developer workflows, and saved resources across Atlassian.

We hope this blog inspires you to tackle test flakiness in your own organisation. Thanks

Ishan Gujarathi

Software Engineer @ Deloitte | Knight @ LeetCode (Top 5%) | Software Development | Distributed Systems | GenAI

1mo

Can you pls check DM Nitish Malik sir!?

Prashant Gupta

Leadership and Technology experience in Enterprise Software, Ecommerce, Last Mile logistics, IOT, Payroll & Fintech, Travel Tech, Fleet management, Automotive IOT and Telecom. Building ground up teams and startups.

With over 300K tests in Jira, We are dog fooding Flakinator to improve our CI build reliability. So far the results have been encouraging and we hope to leverage Flakinator to its full potential.

2 Reactions

See more comments

Taming Test Flakiness: How We Built a Scalable Tool to Detect and Manage Flaky Tests

Nitish Malik

Senior Engineering Manager @Atlassian

Introduction

What are Flaky Tests?

Hidden Cost of Flaky Tests?

Why is it a Big Deal?

Key Quotes from Research

Introducing Flakinator

Flakinator Capabilities

Design Overview

How It All Comes Together

Detection Algorithms

Bayesian inference

Results and Impact

Lessons Learned

Future Plans

Others also viewed

GitLab 17.8 and how AI is reshaping the software development lifecycle

Quality Engineering in the Era of Microservices: Why Traditional Testing Strategies Fail

Day 67/100 — Environment Promotion Strategy: Dev → Staging → Production | Canary vs. Blue-Green Deployments

Strategic Quality Engineering: From Testing to Transformation

March 2025

"Navigating Key Development Methodologies: Bottom-up, Top-down, Meet-in-the-middle, and Platform"

From Chaos to Code: A Practical Cure for the Tragedy of Misaligned OKRs

Why Most Automation Frameworks Fail – And How to Build One That Lasts?

Example-Guided, A Brief History

The Flaky Test That Took Down a Release: Lessons Learned

Explore topics