What are Flaky Tests?

Software testing is vital for ensuring code quality and reducing the risk of defects slipping into production. However, automated tests can introduce their own problems. One particularly insidious issue is known as “test flakiness.” Let’s dive into understanding flaky tests and how to mitigate them.

What is a Flaky Test?

A flaky test is a test that intermittently passes and fails without any actual changes made to the code it’s testing. In other words, the test results are unpredictable and unreliable. This can be extremely frustrating for developers who rely on automated test suites to provide a reliable indication of whether their code is working or not.

Why Do Flaky Tests Happen?

Usually, flaky tests are caused by one or more of the following issues:

  • Timing Issues: Tests dependent on precise timing sequences (e.g., delays or waiting for specific events) might fail if system resources are variable or if the test environment is not deterministic.
  • External Dependencies: Tests that interact with external systems, such as databases, networks, or third-party APIs, are prone to failure if those external resources are unavailable, slow, or return inconsistent results.
  • Test Data: Tests that don’t properly isolate and manage their test data can lead to conflicts and unpredictable results if data is shared between tests or modified without being appropriately reset.
  • Resource Contention: If tests run in parallel and compete for shared resources, such as files or system memory, their outcome can become dependent on the order of execution and resource availability.
  • Non-deterministic Code: Code that uses randomness or relies on elements like system time might produce different results each time it’s run, leading to flaky tests.

Code Example of a Flaky Test

Here’s a Python example using the pytest framework that demonstrates a flaky test:

Python
import random
import time

def test_random_number():
    random_value = random.randint(1, 10)
    time.sleep(0.5)  # Introduce a potential delay
    assert random_value > 5

This test has an inherent probability of failure. Since the random number could be less than or equal to 5, the assertion might fail intermittently. The addition of time.sleep further increases the potential for flakiness if the system is under load.

The Problem with Flaky Tests

Flaky tests erode confidence in your test suite. A failed test should reliably signify a problem in the code. But with flaky tests, the following dilemmas arise:

  • False Negatives: A flaky test may pass sometimes, obscuring actual bugs.
  • False Positives: Failing flaky tests waste developer time in investigating failures that aren’t caused by code changes.
  • Loss of Trust: Teams start ignoring failing tests, undermining the entire purpose of a test suite.

Flaky tests have a real business cost! They are worse than no test at all.

Preventing Test Flakiness

Here are strategies to deal with flaky tests:

  • Quarantining: Identify flaky tests and isolate them from your main test suite to prevent them from disrupting development cycles.
  • Refactoring Tests: Improve the test code itself. Address timing issues, manage dependencies effectively, and ensure proper test isolation.
  • Mocking: Use mocks or stubs to simulate external dependencies and create a controlled test environment.
  • Test Frameworks: Some test frameworks have built-in features for detecting and reporting flaky tests.
  • Robust Retries: Consider implementing a controlled retry mechanism for tests with a known tendency for flakiness. This is only if the test cannot be considered deterministic, like testing the contents of an LLM output.

How to Test LLM Outputs

LLMs are inherently non-deterministic, meaning they don’t necessarily return the same output every time they’re given the same input prompt.

This means their chance of flakiness is pretty high, but the best tactic, if the response cannot be mocked, is to test a LLM’s semantic similarity. Use semantic similarity metrics (like cosine similarity with sentence embeddings) to compare the LLM’s output to a set of acceptable responses or to identify near-duplicates. This allows for flexibility in the exact wording while ensuring the LLM generates semantically meaningful results.