Alpha. Vary is under active development and not ready for production use. Syntax, APIs, performance, and behaviour may change between releases.

Benchmark

The strict-mode benchmark validates that optimized mutation testing modes produce the same mutant classifications as the unoptimized reference mode. A mutant classified as killed by one mode must be classified as killed by every other mode. "Strict" here means parity-checked: the benchmark does not merely measure throughput, it asserts that every optimization preserves correctness.

For the optimizations the benchmark is checking, see Strict mode.

Quick start

vary benchmark

This runs every mode against the built-in fixture set and writes a JSON artifact to .vary-logs/strict-benchmark.json. The command exits non-zero if any mode classifies a mutant differently from the reference baseline.

Modes

The benchmark supports three primary modes that form the parity triangle:

Mode	Label	Test selection	Workers	Purpose
Reference	`reference`	All tests	Fresh per mutant	Known-good baseline; no optimizations
Evidence	`evidence`	Coverage-guided	Fresh per mutant	Validates test selection preserves classifications
Evidence+Warm	`evidence+warm`	Coverage-guided	Warm (batched)	Validates both test selection and warm workers

Two additional backend modes (fresh-loader and hotswap), along with the backend-parity composite mode, are available for backend-specific benchmarking. Use --mode all (the default) to run every mode.

Reference mode

Runs every available test against every mutant using a fresh class loader per mutant. No coverage-guided selection, no test reordering, no worker reuse. This is the correctness baseline every other mode is compared against.

Evidence mode

Enables coverage-guided test selection via reachability instrumentation. Only tests whose coverage trace intersects the mutated method are run. Workers are still fresh (one per mutant). If evidence mode kills or survives a mutant differently from reference, the test selection logic has a bug.

Evidence+warm mode

Enables both coverage-guided test selection and warm workers. Before the timed run, a warmup pass collects kill history and per-test kill counts. The benchmark then uses seeded kill-first scheduling with warm workers that batch multiple mutants before recycling. If this mode diverges from reference, either test selection or warm-worker state leakage is the cause.

What it measures

Metric	Description
Wall time (ms)	End-to-end duration per mode
Mutants/sec	Throughput: total mutants / wall time
Median tests-run-per-mutant	Median number of tests executed per mutant
Mutation score (%)	killed / (total - errors) × 100
Fallback counts	Hot-swap path breakdown: hotswap, ineligible, fallback
Worker reset counts	Total resets and retired workers (warm modes)

Parity validation

The benchmark fails if any of these conditions are detected.

Condition	Meaning
Classification mismatch across modes	A mutant classified as killed in one mode but survived in another. This is the primary parity check.
Killed with empty `killedBy`	A mutant marked killed but no test recorded as the killer.
Flaky classification	A mutant whose outcome differs across identical runs of the same mode.

Classification mismatches are the strictest check: they prove that an optimization changed the result, not just the speed.

Options

vary benchmark                               # all modes, built-in fixtures
vary benchmark --fixture-dir tests/mutation/ # custom fixtures
vary benchmark --output-dir ./results        # custom output location
vary benchmark --runs 5                      # multiple iterations
vary benchmark --mode reference              # single mode
vary benchmark --mode evidence+warm          # single mode

Flag	Default	Description
`--mode <name>`	`all`	Mode to run: `reference`, `evidence`, `evidence+warm`, `fresh-loader`, `hotswap`, `backend-parity`, or `all`
`--fixture-dir <path>`	built-in fixtures	Fixture directory to benchmark against
`--output-dir <path>`	`.vary-logs/`	Where to write the benchmark artifact
`--runs <n>`	1	Iterations per mode

Continuous integration

The benchmark runs in the nightly workflow against the project's fixture set. On failure, the benchmark artifact and the structured mismatch report are uploaded so a regression can be inspected without re-running the job. Use the benchmark as a regression gate whenever you touch strict-mode internals.

Pinned benchmark matrix

The bytecode mutation performance wave measures progress against a fixed set of workloads rather than ad-hoc commands. Each workload pins a fixture, an exact command line, an artifact path, the telemetry columns the run must emit, and whether it belongs to the short (< 5 minute) class or the long acceptance class.

Workload	Class	Purpose
W1: Cold single-file compiled mutation	short	Baseline per-mutant cost on the hot path
W2: Warm-repeat single-file compiled mutation	short	Validates warm-worker + inference reuse
W3: Full project compiled mutation	long	Project-scale compiled wall time
W4: Repeat full-project mutation on unchanged inputs	long	Inference reuse at project scale
W5: One-file-edit invalidation run	long	Edit triggers correct invalidation
W6: Long-tail dense-mutant file run	short	Single-file p95 behaviour
R1: AST-runner reference	long	Compiled-vs-AST speedup baseline
R2: PIT-style comparison	long	Cross-tool operator-scope comparison

The full matrix (pinned environment, required telemetry columns, per-workload fixtures and commands) lives at docs/benchmarks/bytecode-mutation-benchmark-matrix.md in the repository. Numbers are valid only on the pinned dev machine defined by docs/benchmarks/mutation-baseline-2026-04.md; cross-machine runs are informational only.

Final acceptance report

The wave's binary pass / fail verdict against every acceptance bar (compiled-vs-AST speedup, redefine-vs-fresh-loader ratio, unchanged-repeat speedup, PIT-style comparison, test-dispatch exclusion, reproducible recipes) is recorded in docs/benchmarks/bytecode-mutation-wave-acceptance.md. That report is the authoritative source of truth for whether the wave is pass or fail; when it fails, the closeout blocker rule keeps the wave open and appends remediation stories rather than overriding the verdict.

Strict mode Fast mode