Strict mode makes mutation testing fast enough to run in tight loops without giving up the correctness guarantees of a fresh-compile run. It combines four optimizations: per-test coverage traces, evidence-based test selection, kill-first scheduling, and long-lived warm workers. Each optimization has a reference mode that disables it, and a parity benchmark verifies that every mode produces the same kill/survive verdicts as the unoptimized baseline.
For the mutation testing basics, see Introduction. For the benchmark harness that validates strict mode, see Benchmark.
| Flag | Values | Default | Purpose |
|---|---|---|---|
--strict-selection | evidence, reference | evidence | Test selection strategy |
--reachability | flag | off | Record per-test method reachability during baseline |
--warm-workers | on, off | on | Worker lifecycle mode |
--fresh-workers | flag | off | Alias for --warm-workers=off |
--backend | fresh-loader, hot-swap, redefine | fresh-loader | Bytecode mutation backend |
--relevance-graph-path | path | .vary/relevance/ | Custom location for the persisted relevance graph |
--explain | flag | off | Include selection and scheduling explanation in the report |
# Default strict mode: evidence-based selection + warm workers
vary mutate src.vary --level bytecode --reachability
# Reference mode: all tests, fresh workers (for parity checks)
vary mutate src.vary --level bytecode --strict-selection=reference --warm-workers=off
# Full explainability in the JSON report
vary mutate src.vary --level bytecode --reachability --explain --output json
When --strict-selection=evidence is active and --reachability is enabled, the engine records which test methods reach which source methods during the baseline pass. For each mutant, it builds a candidate test set from only those tests whose coverage trace intersects the mutated method.
Tests with incomplete traces (crashed or timed out during baseline) are always included in every candidate set as a conservative fallback. No kill is missed because of missing trace data.
Setting --strict-selection=reference disables this optimization and runs all available tests against every mutant. This is the correctness baseline that evidence mode is verified against by the parity benchmark.
Within each candidate set, tests are reordered to maximize early kills.
| Priority | Source | Description |
|---|---|---|
| 1 | Historical killer | The test that killed this mutant in a prior run (from the mutation history artifact) |
| 2 | Current-run kill count | Tests with more kills so far in the current run |
| 3 | Historical kill count | Tests with higher per-test kill counts from prior runs |
| 4 | Natural order | Remaining tests run in declaration order |
The scheduler never removes tests from the candidate set; it only reorders them. When a test kills the mutant, the remaining tests are skipped (early exit). The testsRunUntilKill metric records how many tests ran before the kill, and appears in --explain output.
When --warm-workers=on (the default), mutation workers are long-lived: each worker processes a batch of mutants before being recycled. Between mutants the worker runs a reset sequence that clears mutable singletons (runtime state, observer traces, reachability traces, coverage, deterministic clocks). After each reset a health check inspects singleton state; if any leak is detected the worker is immediately recycled with a fresh module resolver.
Setting --warm-workers=off (or --fresh-workers) creates a fresh worker for every mutant. This is slower but provides a correctness reference. Both modes produce identical kill/survive classifications, verified by parity tests.
The --backend flag selects how mutated bytecode is loaded for testing.
| Backend | Behaviour |
|---|---|
fresh-loader (default) | Each mutant is loaded with a fresh class loader. This is the correctness reference; every other backend is validated against it. |
hot-swap | Eligible mutants are applied by patching only the mutated method body in an already-loaded class. Ineligible mutants (for example those that change class structure, constructors, or <clinit>) automatically fall back to fresh-loader. |
redefine | Uses java.lang.instrument.Instrumentation.redefineClasses to swap method bodies in a warm classloader. Requires an attached instrumentation agent; when none is available, every mutant routes through fresh-loader fallback. Classification parity against fresh-loader is enforced by the parity gate. |
Hot-swap and redefine are both performance optimisations. The swapOutcome field in results tracks which path was taken: "hotswap", "redefine", "ineligible:<reason>", "fallback:<reason>", or "fresh-loader".
Worker-level poison recovery (wall-clock escape hatch, state-leak detection, mid-run classloader retire/recreate) is applied to both bytecode backends so a single misbehaving mutant cannot contaminate the rest of the queue.
Strict-mode mutation produces several artifacts:
| Artifact | Location | Purpose |
|---|---|---|
| Relevance graph | .vary/relevance/ | Persisted test-to-method coverage map used for evidence-based selection |
| Mutation history | .vary/history/ | Append-only record of prior kill/survive outcomes used to seed kill-first scheduling |
| Benchmark report | .vary-logs/strict-benchmark.json | Cross-mode parity report from vary benchmark |
Both the relevance graph and mutation history are content-addressed and strict about compatibility: a cached artifact is rejected if the source hash, compiler version, or schema version does not match the current run. Rejected artifacts are dropped cleanly and the run falls back to cold ordering.
With --explain, each mutant in the JSON report includes a selectionExplanation object:
{
"selectionStrategy": "coverage",
"schedulingStrategy": "kill-first",
"candidateTests": 3,
"totalAvailableTests": 8,
"testsRunUntilKill": 1,
"testExecutionOrder": [
{"testName": "test_add", "position": 1, "killed": true, "provenance": "Coverage", "orderingBasis": "historical-killer"},
{"testName": "test_sub", "position": 2, "killed": false, "provenance": "Coverage", "orderingBasis": "default"}
],
"activeProvenanceTypes": ["Coverage"]
}
Each test entry includes provenance (why it was selected: Coverage, PreviousKiller, Fallback, ...) and orderingBasis (why it was placed at this position: historical-killer, kill-first, historical-counts, default). The top-level activeProvenanceTypes lists all provenance types that contributed to the candidate set.
Use this to understand why a mutant survived or to debug selection coverage gaps.