What are Flaky Tests? How to Fix Flaky Tests?

Having your build broken by a test is never fun, but flaky tests are a particularly disrupting form of failure. One that is as hard to fix as is unavoidable. In this article, we’ll lay out a three-step plan to prevent and manage those pesky flaky tests.

What are flaky tests?

A flaky test is a test that fails randomly for no apparent reason. One moment the test works, the next one it fails even though nothing changed in the codebase. Flakiness can take several forms:

Random flakiness: the most common type of flaky tests. These are tests that randomly fail or pass when re-run even thought nothing has changed.
Environmental flakiness: a test that works on your machine but fails on other developer’s machines or in continuous integration (or vice-versa) is also flaky.
Branch flakiness: are tests that pass on the feature branch or pull request but began failing once merged into main.

A diagram showing 3 types of flaky tests. Random flakiness shows a cycle of pass-fail and a dice (meaning the cycle is random). These test take place in the same environment. Environmental flakiness show two environments: developer machine and CI machine. The test passes on one and fails on the other. Branch flakiness shows a tests passing on a feature branch but failing once merged into main. — Types of flaky tests. Some may fall in more than one category,

Why flaky tests are a problem

An essential property of an automated test is its determinism: if the code hasn’t changed, the results shouldn’t either. Flaky tests hinder development, neutralize the benefits of CI/CD, and shake the team’s trust in their test suite.

You may be thinking that if a test fails randomly, you can game the system by retrying it until it passes. If this sounds like a good plan to you, you’re not alone. Some test runners even do this for you automatically.

A meme image showing a man in a game show. The questions reads "The test failed. What do you do?" The possible answers are: 1 Try it again. 2 Re-run it. 3 Re-run it more. 4 Keep Re-running it.

This is a humorous take classic and erroneous "fixes" for flaky tests. — The game is rigged

While re-running flaky tests is a popular “fix” — especially, when a deadline is approaching — is it a real solution? How confident are you in your test after having taken this road? How often must the test fail until you declare it a “real” failure? There aren’t any satisfying answers. The game is rigged. You can’t win with this strategy.

Flaky tests are everywhere

A survey published in October 2021 shows that not even huge corporations like Google or Microsoft are spared from flaky tests — about 41% and 26% of their tests, respectively, were discovered to be flaky. The same survey showed how flaky tests threaten the efficiency of CI: 47% of failed jobs that were manually restarted succeeded on the second run.

A more recent survey (2022) confirmed that flaky tests are still a common and serious problem for many software developers and testers — about half of the surveyed professionals experienced flakiness in their tests on a weekly basis.

Flaky tests are pretty common in the software industry. But they only become a serious problem when we’re not proactive about fixing them.

Step 1 – Commit to fixing the problem right away!

The first appearance of a flaky test is the best moment to fix it. Maybe the test is new, or a recent commit changed its stability. This means that the related history is still fresh in the developers’ memory and that they can act quickly.

Nothing is more frustrating than trying to push a hotfix down the pipeline and seeing a flaky test standing in your way. Retrying tests may temporarily solve the issue but, by slowing down CI/CD, you’re wasting time and reducing your capacity to deliver software.

If you don’t have enough time to fix a test right away then it has to go into your technical debt pile. You should document it, create a ticket, and start working on it as soon as possible.

Step 2 – Find the flaky tests in your suite

Flaky tests are statistical by nature, so you’ll need to follow a test over a few days to understand its behavior. The more it runs, the more likely a pattern will emerge.

You can automate re-running tests with CI pipelines — most CI/CD vendors provide a scheduling feature. If you are a Semaphore user, you can set up a cron scheduler so the test suite re-runs periodically. This will help you identify flaky pain points after a couple of days.

Setting up an cron scheduler on Semaphore

Another benefit of scheduling is that builds execute at different times of the day. If you notice a pattern — for example, a test fails only between 3 and 5 am — you are one step closer to fixing the test.

Flaky test detection with CI/CD

After a few days, the CI pipeline history should offer enough data to begin identifying flaky tests. Semaphore users should enable test reports to analyze error data across all CI/CD runs, this will save you the effort of manually tallying up errors in your pipelines.

Test reports offer an easy way of summarizing errors across multiple CI runs. This helps you identify possible flaky tests. — Test report on Semaphore CI/CD

Building on test reports, Semaphore implements the flaky test dashboard (currently on closed beta, soon to be generally available). This feature provides an UI where you can easily view your flaky test, figure out when they appeared first, and estimate their impact in your test suite.

Flaky tests dashboard. This Semaphore feature shows when flaky tests were first introduced, how many time it has disrupted your build and lets you organize and label your tests for triage. — Semaphore’s Flaky Test Dashboard

Saving debugging information

Save every scrap of information that can help you find the root cause of the flakiness. Event logs, memory maps, profiler outputs — the key to the problem could be anywhere. You can increase the logging verbosity in most cases by setting some environment variables. For example, by starting a Node application with the DEBUG variable:

DEBUG=* node your-script.js

In addition, all languages and frameworks provide some kind of built-in or third party debugger. In the case of Node, you can activate it with the --inspect flag:

node --inspect your-script.js

Even sprinkling some console outputs can help you understand the state of your program at the moment it starts flaking:

console.log('Value of x:', x);
console.error('Error encountered:', error);

Semaphore users can persist log files, screeshots, memory dumps and any other debugging information using the artifact push facility.

See your flaky tests in action with SSH debugging

For quick diagnosis, you can run a job interactively with SSH debugging. Semaphore gives you the option to access all running jobs via SSH, restart your jobs in debug mode, or start on-demand virtual machines to explore CI/CD.

You can reproduce the conditions that caused the test to fail and try ways of fixing it. The changes will be lost when the session ends, so you’ll need to re-apply any modifications as a normal commit in your repository.

# start debugging session and run commands in the terminal

$ sem debug job 09bcc47b-8fec-4f36-8dc9-d363447c1cf9
* Creating debug session for job '09bcc47b-8fec-4f36-8dc9-d363447c1cf9'
* Setting duration to 60 minutes
* Waiting for debug session to boot up ..
* Waiting for ssh daemon to become ready .

Semaphore CI Debug Session.

  - Checkout your code with `checkout`
  - Run your CI commands with `source ~/commands.sh`
  - Leave the session with `exit`

Documentation: https://docs.semaphoreci.com/essentials/debugging-with-ssh-access/.

semaphore@semaphore-vm:~$ source ~/commands.sh

Take screenshots of your UI flaky tests

The most challenging class of flaky errors to debug involves the UI. End-to-end and acceptance tests depend on graphical elements not represented in logs.

Configure your test framework to dump HTML or screenshots when a test fails. You’ll be happy to have something to look at when the error strikes. The following example shows how to save a rendered page as an image with Cucumber.

// Cucumber supports hooks that allow you to execute code after a scenario has failed.
@After
public void afterAcceptanceTests(Scenario scenario) {
   try {
       if (scenario.isFailed()) {
       final byte[] screen = ((TakesScreenshot) driver).getScreenshotAs(OutputType.BYTES);
       scenario.embed(screen, "image/png");
      }
  } finally {
       driver.quit();
  }
}

Here’s the same thing but with Ruby and another BDD framework called Capybara, which also allows you to take screenshots in tests.

after(:each) do |example|
 if example.exception
   # print DOM to file
   print page.html
   # save screenshot of browser
   page.save_screenshot('screenshot.png')
 end
end

Step 3 – Documenting flaky tests

After driving the flaky tests out into the open:

Document every flaky test in your ticketing system.
As you acquire more information about the cause of a test’s flakiness, add it to the ticket.
Feel free to fix tests right away if the reason for their flakiness is readily apparent.

Too many open tickets are a good indicator that some time needs to be set aside to improve the test suite’s quality. Also, a ticket is an excellent place to discuss ideas for fixes.

Step 4 – Determining the cause and fixing the test

In some cases, the cause of failure is obvious. This means that you can fix the test and close the case quickly. The problem arises when it’s not immediately clear why a test is failing. In such cases, you will need to analyze all the garnered data.

Let’s look at common causes for flakiness and their solutions.

Environmental differences

Differences between your local development machine and CI fall into this category. Variances in operating systems, libraries, environment variables, number of CPUs, or network speed can produce flaky tests.

While having 100% identical systems is impossible, being strict about library versions and consistent in the build process help to avoid flakiness. Even minor version changes in a library can introduce unexpected behavior or even bugs. Keeping environments as equal as possible during the entire CI process reduces the chance of creating flaky tests.

Containers are great for controlling what goes into the application environment and isolating the code from OS-level influence.

Non-deterministic code

Code that relies on unpredictable inputs such as dates, random values, or remote services, produces non-deterministic tests.

Preventing non-determinism involves exerting a tight degree of control over your test environment. One thing you can do is to inject known data in place of otherwise uncertain inputs using fakes, stubs, and mocks. These devices allow you a great degree of control over otherwise random inputs in your tests.

In the following example we override now() with a fixed value, effectively removing the non-deterministic aspects from the test:

@Test
public void methodThatUsesNow() {
   String fixedTime = "2022-01-01T12:00:00Z";
   Clock clock = Clock.fixed(Instant.parse(fixedTime), ZoneId.of("UTC"));

   // now holds a known datetime value
   Instant now = Instant.now(clock);

   // the rest of the test...
}

Asynchronous code

Flaky tests can happen when the test suite and the application run in separate processes. When a test performs an action, the application needs some time to complete the request. After that, it can check if the action yielded the expected result.

A simple solution for asynchrony is for the test to wait for a specified period before it checks if the action was successful:

click_button “Send”
sleep 5
expect_email_to_be_sent

The problem here is that, from time to time, the application will need more than 5 seconds to complete a task. In such cases, the test will fail. Also, if the application typically needs around 2 seconds to complete the task, the same test would be wasting 3 seconds every time it executes.

There are two better solutions to this problem: polling and callbacks.

Polling is based on a series of repeated checks of whether an expectation has been satisfied.

click_button "Send"
wait_for_email_to_be_sent

In this case, the wait_for_email_to_be_sent method checks if the expectation is valid. If it’s not, it sleeps for a short time (say, 0.1 seconds) and checks again. The test fails after a predefined number of unsuccessful attempts.

The callback solution allows the code to signal back to the test when it can start executing again. The advantage of this is that the test doesn’t wait longer than necessary.

Imagine that we have an async function that returns a value by reference:

function someAsyncFunction(myObject) {
   // function body
   // ....

   // return value by reference
   myObject.return_value = "some string";
}

How do we test such a function? We can’t simply call it and compare the resulting value, because the function may not have been completed yet by the time the assertion is executed.

// this introduces flakiness
let testObject = {};
someAsyncFunction(testObject);
assertEqual(testObject.return_value == "some string");

We could put a sleep or some kind of timer in place, but this kind of pattern itself could introduce flakiness. It is much better to refactor the async function to accept a callback, which is executed when the body of the function is complete:

// run a callback when the function is done
function someAsyncFunction(myObject, callback) {

  // function body ...

  // execute callback when done
  callback(myObject);
}

// move the test inside the callback function
function callback(testobject) {
  assertEqual(testobject.return_value == "expected string");
};

Now we can chain the test to the async function, ensuring that the assertion runs after the function is done:

let testObject = {};
someAsyncFunction(testObject, callback);

Concurrency

Concurrency can be responsible for flakiness due to deadlocks, race conditions, leaky implementations, or implementations with side effects. The problem stems from using shared resources.

Check out this test for a money transfer function:

function testAccountTransfer(fromAccount, toAccount) {
  lockFrom=fromAccount.lock()
  lockTo=toAccount.lock()

  beforeBalanceFrom = getBalance(fromAccount)
  beforeBalanceTo = getBalance(toAccount)

  transfer(fromAccount,toAccount,100)

  assert(beforeBalanceFrom - getBalance(fromAccount) == 100)
  assert(getBalance(toAccount) - beforeBalanceTo == 100)

  lockTo.release()
  lockFrom.release()
}

If we were to run multiple instances of this test in parallel, we would risk creating a deadlock in which each function locks out the resources that the other needs, developing a state in which neither test ends.

// both tests running in parallel can cause a deadlock
testAccountTransfer('John', 'Paula')
testAccountTransfer('Paula', 'John')

This sort of failure can be prevented by replacing the shared resource (the account) with a mocked component.

Order dependency

Order dependency problems are caused when tests are executed in a different order than planned. One way to solve this issue is to consistently conduct tests in the same order. This, however, this is a poor solution, as it means that we have accepted that tests are brittle and that their execution depends solely on a carefully built environment.

The root of the problem is that these tests depend on shared mutable data, and when it’s not mutated in a predefined order, the tests fail. This issue is resolved by breaking dependency on shared data. Every test should prepare the environment for its execution and clean it after it’s done.

Look at this Cypress test and think about what would happen if we reversed the test order of subscribing and unsubscribing.

describe('Newsletter test', () => {

  it('Subscribes to newsletter', () => {
     cy.visit('https://example.com/newsletter');

     cy.get('.action-email').type('fake@email.com');
     cy.get('.subscribe-button').click();

     cy.get('.message').should('have.value', 'Subscribed successfully');
  });

  it('Unsubscribes from newsletter', () => {
     cy.visit('https://example.com/newsletter');

     cy.get('.action-email').type('fake@email.com');
     cy.get('.unsubscribe-button').click();

     cy.get('.message').should('have.value', 'Unsubscribed successfully');
  });

});

In addition to issues with data in the database, problems can also occur with other shared data, e.g. files on a disk or global variables. In such cases, a custom solution should be developed to clean up the environment and prepare it before every test.

Improper assumptions

One has to make assumptions when writing tests, e.g. expecting a dataset to already be loaded in the database, etc. Sometimes, however, reality surprises us with a day that has less than 24 hours.

The best we can do is to make tests completely self-contained, i.e. prepare the conditions and set up the scenario within the test. The next best thing is to check our assumptions before executing them. For instance, JUnit has an assumption utility class that aborts a test (but does not fail it), if the initial conditions are not suitable.

@Test
void testOnDev() {
   System.setProperty("ENV", "DEV");
   Assumptions.assumeTrue("DEV".equals(System.getProperty("ENV")));
}

Fixing the test

After acquiring more data about a test failure, you should start fixing tests. Once you fix a group of tests, merge the branch back into the mainline to pass the benefits to your team.

If all else fails, the offending test should be deleted and re-written from scratch, preferably by someone who wasn’t involved in writing it the first time.

Conclusion

Remember, it’s easier to write a test than to maintain it. Strategies for fixing a flaky test are highly dependent on the application, but the approach and ideas outlined in this article should be enough to get you started.

Is flakiness still plaguing in your codebase? Read these next:

What is a Flaky Test? How to Fix Flaky Tests?