How to Deal With and Eliminate Flaky Tests
Randomly failing tests are the hardest to debug. Here’s a system you can use to fix them and keep your test suite healthy.
An essential property of an automated test and the entire test suite is its determinism. This means that a test should always have the same result when the tested code doesn't change. A test that fails randomly is commonly called a flaky test. Flaky tests hinder your development experience, slow down progress, hide real bugs and, in the end, cost money. It makes sense to invest time in keeping your test suite robust and trustworthy.
A strategy that works well when it comes to keeping the quality of your test suite high is "No broken windows". This idea is based on the broken windows theory from the criminological theory describing vandalism. In a nutshell, if you have a freshly painted building surrounded by other, well-maintained buildings, its facade will remain in a good state until the first window gets broken. After that, it will quickly acquire more broken windows, graffiti and dirt. The trick for keeping a building in a good state is to prevent minor damage and to do small repairs as fast as possible.
The "No broken windows" strategy works well for many aspects of software development. Applied to flaky tests, this means that you should fix a flaky test as soon as it appears. If not, you risk acquiring more flaky tests, as developers will care less about the overall test suite quality.
On the other hand, the first appearance of a flaky test is probably the best moment to fix it anyway. The test has either been introduced recently, or there has been a recent change that influenced its stability. This means that the related history is still fresh in the developers' memory, so they can act quickly.
If you don't have enough time to fix a test right away, you should make sure to document it, create a ticket and schedule it as soon as possible. Keep in mind that documenting a problem does not equal fixing it. If it's not fixed soon, it can shake the team's trust in the test suite.
But what if you already have a non-trivial test suite with more than just a few flaky tests? You realized only recently that they are standing in your way and you want them fixed. In this tutorial, we will discuss some strategies that can help you deal with and weed out flaky tests from your tests suite more easily.
Step 1 - Drive Flaky Tests out into the Open
A flaky test appears once in a while by definition. If you're using a continuous integration (CI) service such as Semaphore, you might have seen a flaky test only in the CI environment, but never locally. The reason for this might be that your whole test suite is executed much more often in CI than on your local development machine.
If you're not using a CI service, you should consider using one even if your project is still small. Practice has shown that developers tend to forget to run the whole test suite regularly, even when the test suite isn't large and it doesn't take much time to execute it. This means that every project should use a CI service from the very start.
If you want flaky tests to show up, you need to run your test suite many times. One efficient way to do this is to utilize a CI service. Create a branch for fixing flaky tests and set up your CI service to schedule a build on the branch every hour, or more often than that. In a day or two, you will have enough builds to demonstrate a number of flaky tests.
Another benefit of scheduling builds is that a build will be executed at different times of the day, and time is a possible cause of flaky tests. If you notice a pattern - for example that a test fails every time between 3 and 5 am, you are one step closer to fixing the test.
Step 2 - Document Flaky Tests
After you have driven the flaky tests out in the open, document every flaky test in your ticketing system. As you acquire more information about the cause of a test's flakiness, add them to the test ticket. Feel free to fix the tests where the cause of flakiness is obvious right away.
The number of tickets can give the whole team a strong argument that time needs to be scheduled to improve the quality of the test suite. Also, a ticket is a good place to discuss ideas about fixing the test.
Step 3 - Determine the Cause of Failures
In some cases, the cause of a test failure is obvious. This means that you can fix the test and close the case easily. The problem arises when it's not immediately obvious why a test fails. In this case, you will need to acquire more information about the failure itself.
The idea is to create a hook that will fire when a test fails, and gather data about the test failure - the state of the application, the web page screenshot or HTML dump, the test log, etc.
Example: Dissecting a Failed Test in a Ruby Application
Cucumber supports hooks that allow you to execute code after a scenario has failed:
# features/support/env.rb After do |scenario| if scenario.failed? print page.html puts `cat log/test.log` end end
This hook checks if a scenario has failed and if it has, it prints the page HTML and the application log to the screen. The output may not look pretty, but it will help you pinpoint the issue that's causing the test to fail.
Capybara provides the option for taking screenshots in tests. You can upload the screenshot to a third-party service like Amazon S3 for further investigation.
If you're using RSpec, you can use a similar method to execute code if a test fails:
# spec/spec_helper.rb after(:each) do |example| if example.exception # Print the test log or get more data end end
Other testing tools should provide similar capabilities to execute code when a test fails.
Step 4 - Fix a Flaky Test
After acquiring more data about a test failure, you should start fixing tests one by one. After you fix a group of tests, merge the branch back to the mainline to pass the benefits to your team. Leave the flaky tests branch alive and building until no repeating flaky test appears for a longer period of time.
A popular solution that hides flaky test failures is to re-run the failed tests a few times until they pass. If a test passes at least once, it is declared as passed. Some test runners have a built-in feature for re-running failed tests before declaring them as failed. For others, there are plugins that provide the capability. The idea behind this is that a test failed accidentally and that re-running the test will fix it. We believe that this is not an acceptable solution, as another way of looking at this is that a flaky test passes accidentally when it does so. This means that a flaky test cannot be trusted, and it has to be fixed.
An empirical study of flaky tests shows that the majority of flaky tests are caused by asynchronous waits, concurrency and test order dependency.
Asynchronous wait flaky tests happen because the test suite and the application under test run in separate processes. When a test performs an action, the application needs some time to complete the request. After that, the test can check if the action has yielded the expected result.
A simple solution for the asynchrony is for the test to wait for a specified period of time before it checks if the action has been successful:
click_button "Send" sleep 5 expect_email_to_be_sent
The problem here is that, from time to time, the application needs more than 5 seconds to complete the task. In that case, the test will fail.
Another issue is that if the application typically needs around 2 seconds to complete the task, the test will be wasting 3 seconds every time it executes.
There are two solutions to this problem: callbacks and polling. The callbacks solution means that the application can signal back to the test when it can start executing again. The advantage of this solution is that the test doesn't wait longer than necessary. However, this solution is rarely supported by the testing library and if it is, the application needs to support it too.
Polling is based on a series of repeated checks of whether an expectation has been satisfied.
click_button "Send" wait_for_email_to_be_sent
wait_for_email_to_be_sent method would check if the expectation is valid.
If it's not, it would sleep for a specific short period of time (1 second) and
check again. If the expectation doesn't get satisfied after a predefined number
of attempts, the test fails.
Capybara is a Ruby testing library whose matchers have a built-in polling solution. The maximum wait time is configurable. Similar libraries in other languages provide similar solutions. This is a simple solution that's well-supported in test libraries, and it fixes both problems outlined above.
Flakiness related to concurrency can be caused by non-deterministic behaviour of the test, or the code under test. In some cases, there is a bug in the code. In others, the problem is that the code can produce different valid results, and the test accepts only one. This problem can be solved by making the test more robust, so that it accepts all valid results.
Order dependency problems are caused by the fact that sometimes tests fail when they are executed in a different order. One way to solve this issue is to always execute tests in the same order. However, this is a poor solution, as it means that we have accepted that tests are brittle and that their execution depends solely on a carefully built environment.
The root of the problem is that tests depend on shared mutable data. When the data is not mutated in a predefined order, tests fail. This issue can be resolved by breaking dependency on shared data. Every test should prepare the environment for its execution and clean the environment after it's done. A popular solution for this problem is to run a test in a database transaction that's rolled back once the test has finished executing.
In addition to problems with the data that lives in a database, problems can also occur with other shared data, e.g. files on a disk, or global variables. In these cases, a custom solution needs to be developed that will clean up the environment and prepare it before every test.
Most testing libraries provide a way to execute tests in random order. We advise using this option, as it will force you to write more resilient and stable tests.
In this article, we discussed optimal approaches to finding flaky tests in an application, documenting problematic tests, and fixing them. We illustrated ideas that can help you gather more information about failing tests and a general pattern for fixing the most common causes of flaky tests.
Once all flaky tests have been fixed in the application test suite, the team should go back to the "no broken windows" strategy. That way, the quality of the test suite will remain high.
Strategies for fixing a concrete flaky test are highly dependent on the application, but the approach and ideas outlined in this article should be enough to get you started.