Vary mutation testing speed: comparing to AST and PIT

tl;dr: Vary does regular mutation-performance testing, not just feature work. On the current project-scale Frugal benchmark, the checked-in result is a stable 27-second run over 1,857 mutants with an 82.18 percent mutation score. On the current PIT-style comparison fixture, Vary measures at 0.92x of PIT's wall time, which falls within the acceptance target.

Mutation testing only earns its keep when two things hold together at the same time: the results have to mean something, and the run has to finish in a reasonable amount of time. The historical answer to the second part has often been "no" (see Why mutation testing is slow for the long version), so we measure mutation performance directly against benchmark programs instead of waving our hands about it.

Quick definitions

A few terms before the numbers, in case you're new to any of this.

Mutation testing

A way to grade your test suite. The tool makes small changes to your source (flip a > to a <, swap a + for a -, return 0 instead of the real value), then re-runs your tests against each change. If a test fails, the mutation is "killed" and your test suite did its job. If every test still passes, the mutation "survives", which usually means there's a behaviour your tests don't actually pin down. The fraction killed is the mutation score. The longer pitch on why this matters lives in Why mutation testing.

AST mutation vs bytecode mutation

Two ways to do that work. AST mutation parses the source, mutates a tree node, regenerates code, and runs the tests. It's straightforward but slow, because every mutant goes back through the whole compile pipeline. Bytecode mutation compiles once and then patches the JVM bytecode for each mutant, which skips most of the front-end work. Vary supports both. "Vary compiled" in the tables below is the bytecode path; "Vary AST reference" is the AST path. There's a longer write-up in How bytecode mutation testing works and the case for picking it as the default in The bytecode mutation thesis.

PIT

The most widely used mutation testing tool in the JVM world. It's mature, fast, and works at the bytecode level on Java, Kotlin, and Scala. It's the natural baseline for any new JVM mutation engine to measure against, which is why a matched fixture against it is part of how we keep ourselves honest. The deeper dive on PIT lives in How PIT works and Inside PIT, part 2: why it stays fast.

What Vary measures

Two benchmarks anchor the current story.

The first is Frugal, a project-scale parser and grammar workload derived from a port of Parsimonious (a PEG parsing library for Python). Frugal is the one that answers the question users actually ask: what does mutation testing look like on a real codebase, with branch-heavy logic and a real test suite hanging off it?

The second is a smaller PIT-style fixture built for controlled comparison. Its job is narrower. It lines up Vary and PIT on a matched overlap of arithmetic, conditional, and return-value mutations, without dressing up the result as a misleading cross-language headline.

Current Frugal result

On the checked-in Frugal benchmark, Vary's compiled mutation run finishes in a stable 27 seconds across three measured runs. That covers 1,857 mutants, kills 1,526 of them, leaves 298 surviving, and lands at an 82.18 percent mutation score.

Workload	Mode	Wall time	Total mutants	Killed	Survived	Mutation score
Frugal	Vary compiled	27 s	1857	1526	298	82.18%

Speed matters less on its own than what comes with it. The Frugal run is checked in, the 27 seconds reproduces across measurements, and 1,857 mutants is project-scale rather than toy-scale. That combination is what makes the number worth quoting. Mutation testing isn't ordinary test execution. You're not asking whether the current code passes, you're asking whether the tests would still catch the code if it were slightly wrong. A strong mutation signal is far more useful when the run is short enough to fit into normal engineering work, instead of being relegated to an overnight job nobody actually reads.

Comparison to AST

Vary also measures its compiled mutation path against an AST reference workload on the same Frugal benchmark. The current AST project run is 67.0 seconds.

Workload	Mode	Wall time	Total mutants	Killed	Survived	Mutation score
Frugal	Vary compiled	27 s	1857	1526	298	82.18%
Frugal	Vary AST reference	67.0 s	1857	1523	301	82.01%

Same data, drawn out as bars so the gap is easier to see at a glance:

Frugal wall time (lower is better)

Vary compiled       ████████████████████                                27.0 s
Vary AST reference  ██████████████████████████████████████████████████  67.0 s

Treat that comparison as a benchmark result, not a universal law. We're not claiming every AST mutation workflow everywhere will take 67 seconds, or that every compiled run will land at 27. On Vary's pinned Frugal benchmark, the compiled path is materially faster than the AST reference while landing in the same mutation-score range. That's the claim.

Comparison to PIT

The PIT comparison answers a different question. Anyone who's already used PIT usually wants to know whether Vary belongs in the same performance class. Without that comparison, it's hard to tell whether Vary's mutation engine is genuinely competitive or just an interesting demo.

What the comparison program looks like

The harness is just two small matched programs: ten Java static methods on one side and ten Vary functions on the other, with the same signatures, the same branch structure, and one matching test per function. Same surface, same shape, both sides.

Function	What it does
`add`	a + b
`subtract`	a - b
`multiply`	a * b
`divide`	a / b, returning 0 when b is 0
`absolute_difference`	abs(a - b) via a branch
`in_range`	true when lo ≤ x ≤ hi
`clamp`	clamp x into [lo, hi]
`sum_range`	sum 0..n in a counted loop
`max3`	max of three ints
`signum`	-1, 0, or 1

Both sides only mutate the operators that exist on both sides (arithmetic, conditional, return-value). Both sides run cold, in separate JVMs, so neither piggybacks on the other's JIT. The whole thing runs from a single pinned reproduction script.

It's small on purpose. A Java port of Frugal would be thousands of lines on each side just to confirm a ratio, and that's not a maintenance burden worth taking on.

The numbers

On the current PIT-style comparison fixture, PIT measures 6023 milliseconds and Vary measures 6552 milliseconds, giving a ratio of 0.92x. That falls within the accepted comparison band and is recorded as a passing result.

Workload	Tool	Wall time
PIT-style overlap fixture	PIT	6023 ms
PIT-style overlap fixture	Vary	6552 ms

Comparison	Result
PIT wall time / Vary wall time	0.92x

And the same picture as bars, where the point is how close the two are:

PIT-style fixture wall time (lower is better)

PIT   ██████████████████████████████████████████████      6023 ms
Vary  ██████████████████████████████████████████████████  6552 ms

This benchmark is intentionally smaller and narrower than Frugal. It isn't trying to stand in for an entire application. It's trying to answer whether, on a carefully matched overlap fixture, Vary is operating in the same neighbourhood as a well-known JVM mutation tool. The answer right now is yes. That doesn't erase the differences between the two systems, but it does mean the comparison is meaningful rather than aspirational.

What the benchmarks mean

Frugal and the PIT-style fixture are useful precisely because they answer different questions. Frugal shows what mutation testing looks like on demanding real code, which is exactly where weak test suites tend to get exposed. The PIT-style fixture normalizes scope so the result says something concrete about tool performance instead of dissolving into a vague claim.

Put it all together: Frugal at 27 seconds with an 82.18 percent score over 1,857 mutants, materially ahead of the AST reference, and 0.92x of PIT's wall time on a matched fixture. That's a more useful answer than the usual "mutation testing is fast" handwave, and it's the version we're willing to defend in public.