Alpha. Vary is under active development and not ready for production use. Syntax, APIs, performance, and behaviour may change between releases.
Benchmark
The strict-mode benchmark validates that optimized mutation testing modes produce the same mutant classifications as the unoptimized reference mode. A mutant classified as killed by one mode must be classified as killed by every other mode. "Strict" here means parity-checked: the benchmark does not merely measure throughput, it asserts that every optimization preserves correctness.
For the optimizations the benchmark is checking, see Strict mode.
Quick start
vary benchmark
This runs every mode against the built-in fixture set and writes a JSON artifact to .vary-logs/strict-benchmark.json. The command exits non-zero if any mode classifies a mutant differently from the reference baseline.
Modes
The benchmark supports three primary modes that form the parity triangle:
| Mode | Label | Test selection | Workers | Purpose |
|---|---|---|---|---|
| Reference | reference | All tests | Fresh per mutant | Known-good baseline; no optimizations |
| Evidence | evidence | Coverage-guided | Fresh per mutant | Validates test selection preserves classifications |
| Evidence+Warm | evidence+warm | Coverage-guided | Warm (batched) | Validates both test selection and warm workers |
Two additional backend modes (fresh-loader and hotswap), along with the backend-parity composite mode, are available for backend-specific benchmarking. Use --mode all (the default) to run every mode.
Reference mode
Runs every available test against every mutant using a fresh class loader per mutant. No coverage-guided selection, no test reordering, no worker reuse. This is the correctness baseline every other mode is compared against.
Evidence mode
Enables coverage-guided test selection via reachability instrumentation. Only tests whose coverage trace intersects the mutated method are run. Workers are still fresh (one per mutant). If evidence mode kills or survives a mutant differently from reference, the test selection logic has a bug.
Evidence+warm mode
Enables both coverage-guided test selection and warm workers. Before the timed run, a warmup pass collects kill history and per-test kill counts. The benchmark then uses seeded kill-first scheduling with warm workers that batch multiple mutants before recycling. If this mode diverges from reference, either test selection or warm-worker state leakage is the cause.
What it measures
| Metric | Description |
|---|---|
| Wall time (ms) | End-to-end duration per mode |
| Mutants/sec | Throughput: total mutants / wall time |
| Median tests-run-per-mutant | Median number of tests executed per mutant |
| Mutation score (%) | killed / (total - errors) × 100 |
| Fallback counts | Hot-swap path breakdown: hotswap, ineligible, fallback |
| Worker reset counts | Total resets and retired workers (warm modes) |
Parity validation
The benchmark fails if any of these conditions are detected.
| Condition | Meaning |
|---|---|
| Classification mismatch across modes | A mutant classified as killed in one mode but survived in another. This is the primary parity check. |
Killed with empty killedBy | A mutant marked killed but no test recorded as the killer. |
| Flaky classification | A mutant whose outcome differs across identical runs of the same mode. |
Classification mismatches are the strictest check: they prove that an optimization changed the result, not just the speed.
Options
vary benchmark # all modes, built-in fixtures
vary benchmark --fixture-dir tests/mutation/ # custom fixtures
vary benchmark --output-dir ./results # custom output location
vary benchmark --runs 5 # multiple iterations
vary benchmark --mode reference # single mode
vary benchmark --mode evidence+warm # single mode
| Flag | Default | Description |
|---|---|---|
--mode <name> | all | Mode to run: reference, evidence, evidence+warm, fresh-loader, hotswap, backend-parity, or all |
--fixture-dir <path> | built-in fixtures | Fixture directory to benchmark against |
--output-dir <path> | .vary-logs/ | Where to write the benchmark artifact |
--runs <n> | 1 | Iterations per mode |
Continuous integration
The benchmark runs in the nightly workflow against the project's fixture set. On failure, the benchmark artifact and the structured mismatch report are uploaded so a regression can be inspected without re-running the job. Use the benchmark as a regression gate whenever you touch strict-mode internals.
Pinned benchmark matrix
The bytecode mutation performance wave measures progress against a fixed set of workloads rather than ad-hoc commands. Each workload pins a fixture, an exact command line, an artifact path, the telemetry columns the run must emit, and whether it belongs to the short (< 5 minute) class or the long acceptance class.
| Workload | Class | Purpose |
|---|---|---|
| W1: Cold single-file compiled mutation | short | Baseline per-mutant cost on the hot path |
| W2: Warm-repeat single-file compiled mutation | short | Validates warm-worker + inference reuse |
| W3: Full project compiled mutation | long | Project-scale compiled wall time |
| W4: Repeat full-project mutation on unchanged inputs | long | Inference reuse at project scale |
| W5: One-file-edit invalidation run | long | Edit triggers correct invalidation |
| W6: Long-tail dense-mutant file run | short | Single-file p95 behaviour |
| R1: AST-runner reference | long | Compiled-vs-AST speedup baseline |
| R2: PIT-style comparison | long | Cross-tool operator-scope comparison |
The full matrix (pinned environment, required telemetry columns, per-workload fixtures and commands) lives at docs/benchmarks/bytecode-mutation-benchmark-matrix.md in the repository. Numbers are valid only on the pinned dev machine defined by docs/benchmarks/mutation-baseline-2026-04.md; cross-machine runs are informational only.
Final acceptance report
The wave's binary pass / fail verdict against every acceptance bar (compiled-vs-AST speedup, redefine-vs-fresh-loader ratio, unchanged-repeat speedup, PIT-style comparison, test-dispatch exclusion, reproducible recipes) is recorded in docs/benchmarks/bytecode-mutation-wave-acceptance.md. That report is the authoritative source of truth for whether the wave is pass or fail; when it fails, the closeout blocker rule keeps the wave open and appends remediation stories rather than overriding the verdict.