Mutation

Strict mode

Strict mode makes mutation testing fast enough to run in tight loops without giving up the correctness guarantees of a fresh-compile run. It combines four optimizations: per-test coverage traces, evidence-based test selection, kill-first scheduling, and long-lived warm workers. Each optimization has a reference mode that disables it, and a parity benchmark verifies that every mode produces the same kill/survive verdicts as the unoptimized baseline.

For the mutation testing basics, see Introduction. For the benchmark harness that validates strict mode, see Benchmark.

CLI flags

FlagValuesDefaultPurpose
--strict-selectionevidence, referenceevidenceTest selection strategy
--reachabilityflagoffRecord per-test method reachability during baseline
--warm-workerson, offonWorker lifecycle mode
--fresh-workersflagoffAlias for --warm-workers=off
--backendfresh-loader, hot-swap, redefinefresh-loaderBytecode mutation backend
--relevance-graph-pathpath.vary/relevance/Custom location for the persisted relevance graph
--explainflagoffInclude selection and scheduling explanation in the report

Quick start

# Default strict mode: evidence-based selection + warm workers
vary mutate src.vary --level bytecode --reachability

# Reference mode: all tests, fresh workers (for parity checks)
vary mutate src.vary --level bytecode --strict-selection=reference --warm-workers=off

# Full explainability in the JSON report
vary mutate src.vary --level bytecode --reachability --explain --output json

Evidence-based test selection

When --strict-selection=evidence is active and --reachability is enabled, the engine records which test methods reach which source methods during the baseline pass. For each mutant, it builds a candidate test set from only those tests whose coverage trace intersects the mutated method.

Tests with incomplete traces (crashed or timed out during baseline) are always included in every candidate set as a conservative fallback. No kill is missed because of missing trace data.

Setting --strict-selection=reference disables this optimization and runs all available tests against every mutant. This is the correctness baseline that evidence mode is verified against by the parity benchmark.

Kill-first scheduling

Within each candidate set, tests are reordered to maximize early kills.

PrioritySourceDescription
1Historical killerThe test that killed this mutant in a prior run (from the mutation history artifact)
2Current-run kill countTests with more kills so far in the current run
3Historical kill countTests with higher per-test kill counts from prior runs
4Natural orderRemaining tests run in declaration order

The scheduler never removes tests from the candidate set; it only reorders them. When a test kills the mutant, the remaining tests are skipped (early exit). The testsRunUntilKill metric records how many tests ran before the kill, and appears in --explain output.

Warm workers

When --warm-workers=on (the default), mutation workers are long-lived: each worker processes a batch of mutants before being recycled. Between mutants the worker runs a reset sequence that clears mutable singletons (runtime state, observer traces, reachability traces, coverage, deterministic clocks). After each reset a health check inspects singleton state; if any leak is detected the worker is immediately recycled with a fresh module resolver.

Setting --warm-workers=off (or --fresh-workers) creates a fresh worker for every mutant. This is slower but provides a correctness reference. Both modes produce identical kill/survive classifications, verified by parity tests.

Bytecode backends

The --backend flag selects how mutated bytecode is loaded for testing.

BackendBehaviour
fresh-loader (default)Each mutant is loaded with a fresh class loader. This is the correctness reference; every other backend is validated against it.
hot-swapEligible mutants are applied by patching only the mutated method body in an already-loaded class. Ineligible mutants (for example those that change class structure, constructors, or <clinit>) automatically fall back to fresh-loader.
redefineUses java.lang.instrument.Instrumentation.redefineClasses to swap method bodies in a warm classloader. Requires an attached instrumentation agent; when none is available, every mutant routes through fresh-loader fallback. Classification parity against fresh-loader is enforced by the parity gate.

Hot-swap and redefine are both performance optimisations. The swapOutcome field in results tracks which path was taken: "hotswap", "redefine", "ineligible:<reason>", "fallback:<reason>", or "fresh-loader".

Worker-level poison recovery (wall-clock escape hatch, state-leak detection, mid-run classloader retire/recreate) is applied to both bytecode backends so a single misbehaving mutant cannot contaminate the rest of the queue.

Artifacts

Strict-mode mutation produces several artifacts:

ArtifactLocationPurpose
Relevance graph.vary/relevance/Persisted test-to-method coverage map used for evidence-based selection
Mutation history.vary/history/Append-only record of prior kill/survive outcomes used to seed kill-first scheduling
Benchmark report.vary-logs/strict-benchmark.jsonCross-mode parity report from vary benchmark

Both the relevance graph and mutation history are content-addressed and strict about compatibility: a cached artifact is rejected if the source hash, compiler version, or schema version does not match the current run. Rejected artifacts are dropped cleanly and the run falls back to cold ordering.

Explainability

With --explain, each mutant in the JSON report includes a selectionExplanation object:

{
  "selectionStrategy": "coverage",
  "schedulingStrategy": "kill-first",
  "candidateTests": 3,
  "totalAvailableTests": 8,
  "testsRunUntilKill": 1,
  "testExecutionOrder": [
    {"testName": "test_add", "position": 1, "killed": true, "provenance": "Coverage", "orderingBasis": "historical-killer"},
    {"testName": "test_sub", "position": 2, "killed": false, "provenance": "Coverage", "orderingBasis": "default"}
  ],
  "activeProvenanceTypes": ["Coverage"]
}

Each test entry includes provenance (why it was selected: Coverage, PreviousKiller, Fallback, ...) and orderingBasis (why it was placed at this position: historical-killer, kill-first, historical-counts, default). The top-level activeProvenanceTypes lists all provenance types that contributed to the candidate set.

Use this to understand why a mutant survived or to debug selection coverage gaps.

← Equivalent mutants
Benchmark →