
The strict-mode benchmark validates that optimized mutation testing modes produce the same mutant classifications as the unoptimized reference mode. A mutant classified as killed by one mode must be classified as killed by every other mode. "Strict" here means parity-checked: the benchmark does not merely measure throughput, it asserts that every optimization preserves correctness.

For the optimizations the benchmark is checking, see [Strict mode](/docs/mutation/strict-mode/).

## Quick start

```bash
vary benchmark
```

This runs every mode against the built-in fixture set and writes a JSON artifact to `.vary-logs/strict-benchmark.json`. The command exits non-zero if any mode classifies a mutant differently from the reference baseline.

## Modes

The benchmark supports three primary modes that form the parity triangle:

| Mode | Label | Test selection | Workers | Purpose |
|------|-------|---------------|---------|---------|
| **Reference** | `reference` | All tests | Fresh per mutant | Known-good baseline; no optimizations |
| **Evidence** | `evidence` | Coverage-guided | Fresh per mutant | Validates test selection preserves classifications |
| **Evidence+Warm** | `evidence+warm` | Coverage-guided | Warm (batched) | Validates both test selection and warm workers |

Two additional backend modes (`fresh-loader` and `hotswap`), along with the `backend-parity` composite mode, are available for backend-specific benchmarking. Use `--mode all` (the default) to run every mode.

### Reference mode

Runs every available test against every mutant using a fresh class loader per mutant. No coverage-guided selection, no test reordering, no worker reuse. This is the correctness baseline every other mode is compared against.

### Evidence mode

Enables coverage-guided test selection via reachability instrumentation. Only tests whose coverage trace intersects the mutated method are run. Workers are still fresh (one per mutant). If evidence mode kills or survives a mutant differently from reference, the test selection logic has a bug.

### Evidence+warm mode

Enables both coverage-guided test selection and warm workers. Before the timed run, a warmup pass collects kill history and per-test kill counts. The benchmark then uses seeded kill-first scheduling with warm workers that batch multiple mutants before recycling. If this mode diverges from reference, either test selection or warm-worker state leakage is the cause.

## What it measures

| Metric | Description |
|--------|-------------|
| Wall time (ms) | End-to-end duration per mode |
| Mutants/sec | Throughput: total mutants / wall time |
| Median tests-run-per-mutant | Median number of tests executed per mutant |
| Mutation score (%) | killed / (total - errors) × 100 |
| Fallback counts | Hot-swap path breakdown: hotswap, ineligible, fallback |
| Worker reset counts | Total resets and retired workers (warm modes) |

## Parity validation

The benchmark fails if any of these conditions are detected.

| Condition | Meaning |
|-----------|---------|
| Classification mismatch across modes | A mutant classified as killed in one mode but survived in another. This is the primary parity check. |
| Killed with empty `killedBy` | A mutant marked killed but no test recorded as the killer. |
| Flaky classification | A mutant whose outcome differs across identical runs of the same mode. |

Classification mismatches are the strictest check: they prove that an optimization changed the result, not just the speed.

## Options

```bash
vary benchmark                               # all modes, built-in fixtures
vary benchmark --fixture-dir tests/mutation/ # custom fixtures
vary benchmark --output-dir ./results        # custom output location
vary benchmark --runs 5                      # multiple iterations
vary benchmark --mode reference              # single mode
vary benchmark --mode evidence+warm          # single mode
```

| Flag | Default | Description |
|------|---------|-------------|
| `--mode <name>` | `all` | Mode to run: `reference`, `evidence`, `evidence+warm`, `fresh-loader`, `hotswap`, `backend-parity`, or `all` |
| `--fixture-dir <path>` | built-in fixtures | Fixture directory to benchmark against |
| `--output-dir <path>` | `.vary-logs/` | Where to write the benchmark artifact |
| `--runs <n>` | 1 | Iterations per mode |

## Continuous integration

The benchmark runs in the nightly workflow against the project's fixture set. On failure, the benchmark artifact and the structured mismatch report are uploaded so a regression can be inspected without re-running the job. Use the benchmark as a regression gate whenever you touch strict-mode internals.

## Pinned benchmark matrix

The bytecode mutation performance wave measures progress against a fixed set of workloads rather than ad-hoc commands. Each workload pins a fixture, an exact command line, an artifact path, the telemetry columns the run must emit, and whether it belongs to the short (< 5 minute) class or the long acceptance class.

| Workload | Class | Purpose |
|----------|-------|---------|
| W1: Cold single-file compiled mutation | short | Baseline per-mutant cost on the hot path |
| W2: Warm-repeat single-file compiled mutation | short | Validates warm-worker + inference reuse |
| W3: Full project compiled mutation | long | Project-scale compiled wall time |
| W4: Repeat full-project mutation on unchanged inputs | long | Inference reuse at project scale |
| W5: One-file-edit invalidation run | long | Edit triggers correct invalidation |
| W6: Long-tail dense-mutant file run | short | Single-file p95 behaviour |
| R1: AST-runner reference | long | Compiled-vs-AST speedup baseline |
| R2: PIT-style comparison | long | Cross-tool operator-scope comparison |

The full matrix (pinned environment, required telemetry columns, per-workload fixtures and commands) lives at [`docs/benchmarks/bytecode-mutation-benchmark-matrix.md`](https://github.com/ccollicutt/vary/blob/main/docs/benchmarks/bytecode-mutation-benchmark-matrix.md) in the repository. Numbers are valid only on the pinned dev machine defined by [`docs/benchmarks/mutation-baseline-2026-04.md`](https://github.com/ccollicutt/vary/blob/main/docs/benchmarks/mutation-baseline-2026-04.md); cross-machine runs are informational only.

## Final acceptance report

The wave's binary `pass` / `fail` verdict against every acceptance bar (compiled-vs-AST speedup, `redefine`-vs-`fresh-loader` ratio, unchanged-repeat speedup, PIT-style comparison, test-dispatch exclusion, reproducible recipes) is recorded in [`docs/benchmarks/bytecode-mutation-wave-acceptance.md`](https://github.com/ccollicutt/vary/blob/main/docs/benchmarks/bytecode-mutation-wave-acceptance.md). That report is the authoritative source of truth for whether the wave is `pass` or `fail`; when it fails, the closeout blocker rule keeps the wave open and appends remediation stories rather than overriding the verdict.
