Alpha. Vary is under active development and not ready for production use. Syntax, APIs, performance, and behaviour may change between releases.

Benchmark

The strict-mode benchmark validates that optimized mutation testing modes produce the same mutant classifications as the unoptimized reference mode. A mutant classified as killed by one mode must be classified as killed by every other mode. "Strict" here means parity-checked: the benchmark does not merely measure throughput, it asserts that every optimization preserves correctness.

For the optimizations the benchmark is checking, see Strict mode.

Quick start

vary benchmark

This runs every mode against the built-in fixture set and writes a JSON artifact to .vary-logs/strict-benchmark.json. The command exits non-zero if any mode classifies a mutant differently from the reference baseline.

Modes

The benchmark supports three primary modes that form the parity triangle:

ModeLabelTest selectionWorkersPurpose
ReferencereferenceAll testsFresh per mutantKnown-good baseline; no optimizations
EvidenceevidenceCoverage-guidedFresh per mutantValidates test selection preserves classifications
Evidence+Warmevidence+warmCoverage-guidedWarm (batched)Validates both test selection and warm workers

Two additional backend modes (fresh-loader and hotswap), along with the backend-parity composite mode, are available for backend-specific benchmarking. Use --mode all (the default) to run every mode.

Reference mode

Runs every available test against every mutant using a fresh class loader per mutant. No coverage-guided selection, no test reordering, no worker reuse. This is the correctness baseline every other mode is compared against.

Evidence mode

Enables coverage-guided test selection via reachability instrumentation. Only tests whose coverage trace intersects the mutated method are run. Workers are still fresh (one per mutant). If evidence mode kills or survives a mutant differently from reference, the test selection logic has a bug.

Evidence+warm mode

Enables both coverage-guided test selection and warm workers. Before the timed run, a warmup pass collects kill history and per-test kill counts. The benchmark then uses seeded kill-first scheduling with warm workers that batch multiple mutants before recycling. If this mode diverges from reference, either test selection or warm-worker state leakage is the cause.

What it measures

MetricDescription
Wall time (ms)End-to-end duration per mode
Mutants/secThroughput: total mutants / wall time
Median tests-run-per-mutantMedian number of tests executed per mutant
Mutation score (%)killed / (total - errors) × 100
Fallback countsHot-swap path breakdown: hotswap, ineligible, fallback
Worker reset countsTotal resets and retired workers (warm modes)

Parity validation

The benchmark fails if any of these conditions are detected.

ConditionMeaning
Classification mismatch across modesA mutant classified as killed in one mode but survived in another. This is the primary parity check.
Killed with empty killedByA mutant marked killed but no test recorded as the killer.
Flaky classificationA mutant whose outcome differs across identical runs of the same mode.

Classification mismatches are the strictest check: they prove that an optimization changed the result, not just the speed.

Options

vary benchmark                               # all modes, built-in fixtures
vary benchmark --fixture-dir tests/mutation/ # custom fixtures
vary benchmark --output-dir ./results        # custom output location
vary benchmark --runs 5                      # multiple iterations
vary benchmark --mode reference              # single mode
vary benchmark --mode evidence+warm          # single mode
FlagDefaultDescription
--mode <name>allMode to run: reference, evidence, evidence+warm, fresh-loader, hotswap, backend-parity, or all
--fixture-dir <path>built-in fixturesFixture directory to benchmark against
--output-dir <path>.vary-logs/Where to write the benchmark artifact
--runs <n>1Iterations per mode

Continuous integration

The benchmark runs in the nightly workflow against the project's fixture set. On failure, the benchmark artifact and the structured mismatch report are uploaded so a regression can be inspected without re-running the job. Use the benchmark as a regression gate whenever you touch strict-mode internals.

Pinned benchmark matrix

The bytecode mutation performance wave measures progress against a fixed set of workloads rather than ad-hoc commands. Each workload pins a fixture, an exact command line, an artifact path, the telemetry columns the run must emit, and whether it belongs to the short (< 5 minute) class or the long acceptance class.

WorkloadClassPurpose
W1: Cold single-file compiled mutationshortBaseline per-mutant cost on the hot path
W2: Warm-repeat single-file compiled mutationshortValidates warm-worker + inference reuse
W3: Full project compiled mutationlongProject-scale compiled wall time
W4: Repeat full-project mutation on unchanged inputslongInference reuse at project scale
W5: One-file-edit invalidation runlongEdit triggers correct invalidation
W6: Long-tail dense-mutant file runshortSingle-file p95 behaviour
R1: AST-runner referencelongCompiled-vs-AST speedup baseline
R2: PIT-style comparisonlongCross-tool operator-scope comparison

The full matrix (pinned environment, required telemetry columns, per-workload fixtures and commands) lives at docs/benchmarks/bytecode-mutation-benchmark-matrix.md in the repository. Numbers are valid only on the pinned dev machine defined by docs/benchmarks/mutation-baseline-2026-04.md; cross-machine runs are informational only.

Final acceptance report

The wave's binary pass / fail verdict against every acceptance bar (compiled-vs-AST speedup, redefine-vs-fresh-loader ratio, unchanged-repeat speedup, PIT-style comparison, test-dispatch exclusion, reproducible recipes) is recorded in docs/benchmarks/bytecode-mutation-wave-acceptance.md. That report is the authoritative source of truth for whether the wave is pass or fail; when it fails, the closeout blocker rule keeps the wave open and appends remediation stories rather than overriding the verdict.