Why Mutation Testing Is Slow (and What to Do About It)

tl;dr: Certain mutations (especially in parsers and loop-heavy code) create non-terminating programs. Vary uses adaptive timeouts, early exit on first failure, and shared compilation caches to keep runs fast despite these pathological mutants.

Mutation testing is powerful. It is also, if you are not careful, slow. Not "large test suite" slow, where adding hardware or parallelism helps linearly. Structurally slow, caused by the interaction between what mutations do and what certain kinds of code do in response.

This article explains why some mutation runs take much longer than others and how Vary addresses each cause.

The cost model

A mutation testing run does three things for each mutant.

Phase	What it does
Compile	Type-check the mutated code and generate bytecode
Load	Bring the compiled mutant into memory
Run	Execute the test suite against the mutant

The total wall-clock time is roughly:

total = N_mutants * (compile_time + load_time + test_time)

For a file with 200 mutants, a 5ms compile, 1ms load, and 10ms test suite, that is 200 * 16ms = 3.2 seconds. Perfectly reasonable.

The problem is that test_time is not always 10ms.

The infinite loop problem

Consider a parser that walks an input string character by character:

def parse_tokens(text: Str, pos: Int) -> Int {
    mut i = pos
    while i < len(text) {
        if text[i] == " " {
            i = i + 1
        } else {
            break
        }
    }
    return i
}

Now consider what happens when the mutation i = i + 1 becomes i = i - 1. The loop variable decrements instead of incrementing. It will never reach len(text). The program runs forever.

Or when i < len(text) becomes i <= len(text). An off-by-one that, depending on what happens at the boundary, may cause an infinite loop by reading past the end of input and never finding the termination condition.

Or when i = i + 1 is removed entirely (a statement-removal mutation). Now i never changes. The loop spins on the same character forever.

These are not exotic edge cases. They are the natural consequence of applying standard mutation operators to loop-heavy code. Parsers, interpreters, state machines, graph traversals, protocol handlers: any code that uses loops with computed termination conditions is vulnerable.

The timeout tax

Every mutation testing tool handles non-terminating mutants the same way: a timeout. If a test does not complete within some limit, the mutant is killed (the test "failed" by hanging) and the tool moves on.

The question is how long that timeout should be. Too long and a single bad mutant wastes seconds or minutes while the test suite hangs. Too short and legitimate slow tests get killed as false positives.

A fixed 5-second timeout sounds reasonable until you do the math. If 10% of 200 mutants trigger infinite loops, that is 20 mutants * 5 seconds = 100 seconds of pure waiting. If each mutant runs against a 30-test suite and each test independently hangs, it is 20 * 30 * 5 = 3,000 seconds. Fifty minutes, doing nothing useful.

The compounding factors

Three things make the problem worse in practice:

Multiple tests per mutant. If a mutant causes an infinite loop in one test, it likely causes the same loop in every test that exercises the same code path. Running all 30 tests to their timeout is wasteful; the first timeout already killed the mutant.

Recompilation overhead. AST-level mutation requires recompiling each mutant through the type checker and code generator. For source files with complex imports, the type checker recursively resolves the entire module graph. Doing this hundreds of times adds up even if each individual compile is only a few hundred milliseconds.

Uneven mutant distribution. Some source files generate far more mutants than others. A 100-line parser might produce 600 mutants because of its dense control flow, while a 200-line data class produces 20. A naive "test everything" approach spends most of its time on the densest files.

How Vary handles it

1. Early exit on first failure

When a test fails or times out for a mutant, there is no reason to run the remaining tests. The mutant is dead. Vary stops immediately and moves to the next mutant.

This is the single biggest optimization. For a file with 42 tests and 10 timeout-causing mutants, it reduces the timeout tax from 42 * 10 * T to 1 * 10 * T, a 42x improvement.

2. Adaptive timeouts

Instead of a fixed 5-second timeout, Vary measures the baseline test duration (how long the tests take on the original, unmutated code) and sets the per-test timeout to 10x the baseline, with a minimum of 1 second and a maximum of 5 seconds.

If your tests complete in 9 milliseconds, the timeout is 1 second. A mutant that causes an infinite loop wastes 1 second, not 5.

3. Shared module resolution

When type-checking mutated code, the imported modules do not change between mutants, only the source file does. Vary creates a single module resolver and shares it across all mutants for a file. Imported modules are resolved and type-checked once, and the results are reused for every subsequent mutant.

For projects with deep import chains, this can reduce per-mutant type-checking from hundreds of milliseconds to single-digit milliseconds.

4. Default mutant cap

By default, Vary tests up to 200 mutants per file (configurable with --all for exhaustive runs or --quick for 20). The mutants are prioritized so that the most valuable ones (affecting functions with contracts, side effects, or complex control flow) run first. This means a capped run still exercises the highest-value mutations.

What this means for your scores

A mutation score of 80% does not mean the same thing for a calculator module and a parser module.

Module kind	What the score reflects
Calculator	Simple arithmetic operations. Most mutants are killed quickly. Survivors indicate genuinely missing test assertions.
Parser	Dense loop logic. Many mutants are killed by timeout (the test hung, which counts as a failure). Survivors may be in code paths where the mutation causes subtly wrong output rather than non-termination.

Timeout-killed mutants are real kills: a mutation that causes an infinite loop is certainly detected by the test suite. But they tell you less about the precision of your assertions than mutants killed by explicit assertion failures.

Vary reports timeout kills separately in verbose mode (--trace) so you can distinguish between "my test caught the wrong output" and "my test hung because the code looped."

Practical advice

For parser and interpreter code: Use --quick for fast feedback during development. Run the full suite in CI with a --budget time limit. Expect longer per-file times and higher timeout-kill rates. This is normal.

For business logic: Full runs are usually fast. If a file is unexpectedly slow, check how many tests it is paired with. A source file paired with a 50-test integration suite will be slower per mutant than one paired with 5 targeted unit tests.

For any code: The --output log mode shows per-file timing and per-mutant rates, making it easy to identify which files are slow and why:

[2.2s] file_done file=calculator.vary killed=95 survived=5 total=100 duration=2142ms
[8.1s] estimate file=grammar.vary mutants=200 est=~2.8s (14ms/mutant)
[78.6s] file_done file=grammar.vary killed=143 survived=57 total=200 duration=70651ms

When the actual time (70s) dramatically exceeds the estimate (2.8s), timeouts are the cause. The code is not slow; it is looping.