The strict-mode benchmark validates that optimized mutation testing modes produce the same mutant classifications as the unoptimized reference mode. A mutant classified as killed by one mode must be classified as killed by every other mode. "Strict" here means parity-checked: the benchmark does not merely measure throughput, it asserts that every optimization preserves correctness. For the optimizations the benchmark is checking, see [Strict mode](/docs/mutation/strict-mode/). ## Quick start ```bash vary benchmark ``` This runs every mode against the built-in fixture set and writes a JSON artifact to `.vary-logs/strict-benchmark.json`. The command exits non-zero if any mode classifies a mutant differently from the reference baseline. ## Modes The benchmark supports three primary modes that form the parity triangle: | Mode | Label | Test selection | Workers | Purpose | |------|-------|---------------|---------|---------| | **Reference** | `reference` | All tests | Fresh per mutant | Known-good baseline; no optimizations | | **Evidence** | `evidence` | Coverage-guided | Fresh per mutant | Validates test selection preserves classifications | | **Evidence+Warm** | `evidence+warm` | Coverage-guided | Warm (batched) | Validates both test selection and warm workers | Two additional backend modes (`fresh-loader` and `hotswap`), along with the `backend-parity` composite mode, are available for backend-specific benchmarking. Use `--mode all` (the default) to run every mode. ### Reference mode Runs every available test against every mutant using a fresh class loader per mutant. No coverage-guided selection, no test reordering, no worker reuse. This is the correctness baseline every other mode is compared against. ### Evidence mode Enables coverage-guided test selection via reachability instrumentation. Only tests whose coverage trace intersects the mutated method are run. Workers are still fresh (one per mutant). If evidence mode kills or survives a mutant differently from reference, the test selection logic has a bug. ### Evidence+warm mode Enables both coverage-guided test selection and warm workers. Before the timed run, a warmup pass collects kill history and per-test kill counts. The benchmark then uses seeded kill-first scheduling with warm workers that batch multiple mutants before recycling. If this mode diverges from reference, either test selection or warm-worker state leakage is the cause. ## What it measures | Metric | Description | |--------|-------------| | Wall time (ms) | End-to-end duration per mode | | Mutants/sec | Throughput: total mutants / wall time | | Median tests-run-per-mutant | Median number of tests executed per mutant | | Mutation score (%) | killed / (total - errors) × 100 | | Fallback counts | Hot-swap path breakdown: hotswap, ineligible, fallback | | Worker reset counts | Total resets and retired workers (warm modes) | ## Parity validation The benchmark fails if any of these conditions are detected. | Condition | Meaning | |-----------|---------| | Classification mismatch across modes | A mutant classified as killed in one mode but survived in another. This is the primary parity check. | | Killed with empty `killedBy` | A mutant marked killed but no test recorded as the killer. | | Flaky classification | A mutant whose outcome differs across identical runs of the same mode. | Classification mismatches are the strictest check: they prove that an optimization changed the result, not just the speed. ## Options ```bash vary benchmark # all modes, built-in fixtures vary benchmark --fixture-dir tests/mutation/ # custom fixtures vary benchmark --output-dir ./results # custom output location vary benchmark --runs 5 # multiple iterations vary benchmark --mode reference # single mode vary benchmark --mode evidence+warm # single mode ``` | Flag | Default | Description | |------|---------|-------------| | `--mode ` | `all` | Mode to run: `reference`, `evidence`, `evidence+warm`, `fresh-loader`, `hotswap`, `backend-parity`, or `all` | | `--fixture-dir ` | built-in fixtures | Fixture directory to benchmark against | | `--output-dir ` | `.vary-logs/` | Where to write the benchmark artifact | | `--runs ` | 1 | Iterations per mode | ## Continuous integration The benchmark runs in the nightly workflow against the project's fixture set. On failure, the benchmark artifact and the structured mismatch report are uploaded so a regression can be inspected without re-running the job. Use the benchmark as a regression gate whenever you touch strict-mode internals. ## Pinned benchmark matrix The bytecode mutation performance wave measures progress against a fixed set of workloads rather than ad-hoc commands. Each workload pins a fixture, an exact command line, an artifact path, the telemetry columns the run must emit, and whether it belongs to the short (< 5 minute) class or the long acceptance class. | Workload | Class | Purpose | |----------|-------|---------| | W1: Cold single-file compiled mutation | short | Baseline per-mutant cost on the hot path | | W2: Warm-repeat single-file compiled mutation | short | Validates warm-worker + inference reuse | | W3: Full project compiled mutation | long | Project-scale compiled wall time | | W4: Repeat full-project mutation on unchanged inputs | long | Inference reuse at project scale | | W5: One-file-edit invalidation run | long | Edit triggers correct invalidation | | W6: Long-tail dense-mutant file run | short | Single-file p95 behaviour | | R1: AST-runner reference | long | Compiled-vs-AST speedup baseline | | R2: PIT-style comparison | long | Cross-tool operator-scope comparison | The full matrix (pinned environment, required telemetry columns, per-workload fixtures and commands) lives at [`docs/benchmarks/bytecode-mutation-benchmark-matrix.md`](https://github.com/ccollicutt/vary/blob/main/docs/benchmarks/bytecode-mutation-benchmark-matrix.md) in the repository. Numbers are valid only on the pinned dev machine defined by [`docs/benchmarks/mutation-baseline-2026-04.md`](https://github.com/ccollicutt/vary/blob/main/docs/benchmarks/mutation-baseline-2026-04.md); cross-machine runs are informational only. ## Final acceptance report The wave's binary `pass` / `fail` verdict against every acceptance bar (compiled-vs-AST speedup, `redefine`-vs-`fresh-loader` ratio, unchanged-repeat speedup, PIT-style comparison, test-dispatch exclusion, reproducible recipes) is recorded in [`docs/benchmarks/bytecode-mutation-wave-acceptance.md`](https://github.com/ccollicutt/vary/blob/main/docs/benchmarks/bytecode-mutation-wave-acceptance.md). That report is the authoritative source of truth for whether the wave is `pass` or `fail`; when it fails, the closeout blocker rule keeps the wave open and appends remediation stories rather than overriding the verdict.