
> VAST is a collection of testing modes. This page tells you which modes to use, when, and what to do with the results.

## Profile and mode matrix

Each VAST mode/profile has a purpose, cost, and suitable cadence.

| Mode / Profile | Purpose | Cost | Best bug classes | Cadence | Gating? | Signal |
|---------------|---------|------|-----------------|---------|---------|--------|
| `--mode fast` | Smoke differential correctness | ~2 min, 100 programs | Codegen, obvious regressions | PR, local | Yes (quick gate) | High: catches gross breakage |
| `--profile complete` | Broad semantic confidence | ~5 min per 500 programs | Feature interactions, type bugs | Nightly | No (informational) | Medium: breadth over depth |
| `--opt-check` | Optimizer regression hunting | 2x cost (4 paths) | Optimizer bugs, pass corruption | Nightly, RC | Yes (RC gate) | High: catches real optimizer bugs |
| `--stress` | Edge-case value testing | +30% expression cost | Boundary bugs, overflow, precision | Nightly | No | Medium: targets specific edges |
| `--stateful` | Mutable state accumulation bugs | ~3 min per 500 programs | Loop/state bugs, off-by-one | Nightly | No | Medium: finds subtle state bugs |
| `--aliasing` | Reference propagation bugs | ~2 min per 500 programs | Aliasing, sharing, field chain bugs | Nightly, RC | No | Medium: targets specific subsystem |
| `--exception-propagation` | Stack unwinding mismatches | ~2 min per 500 programs | Exception codegen, catch routing | Nightly | No | Medium |
| `--concurrency` | Deterministic concurrency | ~3 min per 500 programs | Fork-join, pipeline, scheduling | Nightly | No | Medium |
| `--symbolic` | Constraint-driven edge values | ~2 min per 500 programs | Branch condition, range partition | Nightly | No | Medium: complements stress |
| `--ir-check` | IR translation equivalence | ~3 min per 1000 programs | Lowering bugs, IR/JVM divergence | Nightly, RC | Yes (RC gate) | High |
| `--verify-all-passes` | Per-pass optimizer isolation | ~5 min per 1000 programs | Pass-specific semantic corruption | Nightly, RC | Yes (RC gate) | High |
| `--large-programs` | Compiler scalability | ~5 min per 20 programs | Crashes, timeouts, memory exhaustion | Nightly, RC | No (informational) | Low volume, high impact |
| `--mode continuous` | Adaptive coverage exploration | Time-bounded (configurable) | Coverage gaps, undertested areas | Post-RC, soak | No | Cumulative: improves over time |
| `--metamorphic` | Semantics-preserving equivalence | +50% cost | Optimizer commutativity, folding bugs | Nightly | No | Medium |
| `--mutate` | VAST self-validation | +5x per program | VAST blind spots | Nightly | No | Meta: tests the tester |
| Corpus replay | Known regression detection | ~30s | Previously found bugs | Every RC (gate 3) | Yes (hard gate) | Very high: prevents regressions |

## Four testing purposes

Every VAST run should have one of these purposes.

### 1. Smoke validation

| | |
|---|---|
| Question | Did we obviously break the compiler? |
| Use | Fast mode, small number of profiles, baseline differential comparison, parser round-trip, corpus replay |
| Run on | Every PR, local development |
| Success means | Cheap and reliable signal that nothing is obviously broken |

```bash
vary vast --mode fast
```

### 2. Semantic regression hunting

| | |
|---|---|
| Question | Did a recent compiler change break semantics? |
| Use | Deep mode, `--opt-check`, semantic coverage, interaction coverage, stress mode, specialized profiles (stateful, aliasing, exceptions, concurrency, symbolic) |
| Run on | Nightly, before RC |
| Success means | High bug-finding probability on recent changes |

```bash
./bin/vary vast --mode deep --verbose --opt-check --stress
```

### 3. Trust and infrastructure validation

| | |
|---|---|
| Question | Is VAST itself still believable? |
| Use | Sabotage modes, trusted path calibration, path health matrix, corpus replay, regression generation validation |
| Run on | Nightly or post-RC, whenever VAST internals change |
| Success means | The tester is still testing |

```bash
make vast-negative
vary vast --profile core --count 100 --calibrate --path-health
```

### 4. Scalability and soak

| | |
|---|---|
| Question | Does the compiler survive hard or long-running conditions? |
| Use | Continuous exploration, large programs, compile-time/memory/performance metrics, failure artifact collection |
| Run on | Nightly, scheduled soak, pre-release |
| Success means | Stability under volume and size |

```bash
vary vast --mode continuous --duration 300
./bin/vary vast --mode deep --large-programs --verbose
```

## CI run profiles

### PR (fast smoke)

Run only fast high-signal checks:

| Check | Command |
|-------|---------|
| Corpus replay | `make rc` (gate 3) |
| Fast differential | `vary vast --mode fast` |
| Parser round-trip | Included in fast mode |

Goal: quick breakage detection, under 2 minutes.

### Nightly (semantic exploration)

Run broader semantic exploration with specialized generators:

| Section | What it runs |
|---------|-------------|
| RC validation | 11-gate pipeline |
| Deep differential | `--mode deep --opt-check --stress` |
| Metamorphic + round-trip + coverage | `--metamorphic --round-trip --show-coverage` |
| Mutation expansion | `--mutate` |
| Specialized generators | `--stateful`, `--concurrency`, `--exception-propagation`, `--symbolic` |
| Large programs | `--large-programs` |
| IR translation checks | `--ir-check` |
| Confidence report | `--confidence` |
| Negative validation | Sabotage probes |
| Corpus growth | Exploration + auto-reduction |

Goal: bug discovery and confidence growth.

### RC / pre-release (trust validation)

Run the most expensive trusted suite:

| Check | Purpose |
|-------|---------|
| Deep corpus replay | All known-good programs pass |
| Deep + all specialized profiles | Maximum semantic coverage |
| Pass verification | `--verify-all-passes` |
| IR translation checks | `--ir-check` |
| Large programs | Scale/crash/perf stress |
| Continuous exploration (bounded) | Coverage-gap hunting |
| Sabotage validation | Trust infrastructure check |
| Artifact retention | 14-day failure artifact storage |

Goal: shipping confidence.

## Bug taxonomy

VAST is effective at finding specific classes of bugs. Knowing the taxonomy helps interpret failures and choose the right testing mode.

| Bug class | Description | Best VAST mode | Example |
|-----------|-------------|---------------|---------|
| Codegen bugs | Wrong bytecode emitted for an expression or statement | Deep differential, `--opt-check` | `a - b` compiled to `a + b` in nested context |
| Optimizer semantic corruption | Optimization pass changes program behaviour | `--opt-check`, `--verify-all-passes` | Constant folder evaluates edge-case arithmetic wrong |
| IR lowering bugs | AST to IR translation loses or changes semantics | `--ir-check`, deep differential | Variable binding lost during lowering |
| AST interpreter bugs | Reference oracle itself produces wrong result | Deep differential (AST disagrees with IR + JVM) | Interpreter mishandles enum dispatch |
| Aliasing/reference bugs | Mutation through one alias not visible through another | `--aliasing` | Shared data object not updated through alias |
| Exception unwinding mismatches | Stack unwinding differs between paths | `--exception-propagation` | Catch block executed on wrong path |
| State accumulation bugs | Bug only surfaces after many mutation steps | `--stateful` | Off-by-one in loop iteration interacting with mutable state |
| Concurrency determinism bugs | Parallel execution produces different result across runs | `--concurrency` | Fork-join result depends on scheduling |
| Parser round-trip instability | Format-then-reparse produces different AST | `--round-trip` | Formatter drops parentheses, changing precedence |
| Pass-specific breakage | One optimizer pass breaks semantics while others pass | `--verify-all-passes` | DCE removes a needed side effect |
| Performance/scalability blowups | Compiler crashes, times out, or exhausts memory on large input | `--large-programs` | Quadratic codegen on deep call graphs |
| Float precision divergence | Floating-point results differ between interpreter and JVM | `--profile float`, `--stress` | Precision accumulation in long arithmetic chains |

### Three classes of generated programs

VAST generates programs in three categories, each serving a different purpose:

| Class | Purpose | Size | Example |
|-------|---------|------|---------|
| Minimal semantic probes | Test one feature or interaction precisely | 5-20 AST nodes | `return (x + 0)`: tests identity folding |
| Feature interaction probes | Exercise combinations of language features | 20-200 AST nodes | Enum match inside try/except with nullable field access |
| Realistic structured programs | Stress scalability and complex control flow | 200-10,000 AST nodes | Multi-function program with loops, state, exceptions, and cross-function calls |

Minimal probes are cheap and targeted. Interaction probes find combination bugs. Realistic programs stress the compiler at scale. A good testing session uses all three.

## Mode-to-bug-class mapping

Use the right VAST mode for the compiler subsystem you changed.

| Compiler change | VAST modes to run |
|----------------|-------------------|
| Optimizer (constant folder, DCE) | `--opt-check`, `--verify-all-passes`, `--stress` |
| IR lowering | `--ir-check`, `--metamorphic` |
| Codegen (bytecode generation) | Deep differential, `--opt-check` |
| Data types / aliasing / runtime | `--stateful`, `--concurrency` |
| Exception handling | `--exception-propagation` |
| Concurrency runtime | `--concurrency` |
| Parser / formatter | `--round-trip`, round-trip RC gate |
| Performance / scalability | `--large-programs`, `--mode continuous` |
| New language feature | Deep mode with `complete` profile, `--interaction-coverage` |

## Coverage as a steering tool

Coverage should drive testing choices, not just report status.

| Signal | What it means | Action |
|--------|--------------|--------|
| Low feature coverage | Generator or profile weakness | Add constructs to generator, run richer profiles |
| Low semantic coverage | Branch/value/behaviour gaps | Enable stress mode, run symbolic generation |
| Low interaction coverage | Missing combined constructs | Run `complete` profile with more programs |
| Low confidence but no mismatches | Insufficient exploration | Run continuous mode for longer, increase program count |
| Coverage plateau over multiple nights | Generator saturation | Switch profile weighting, add new generation families |
| Good coverage but repeated failures in one class | Subsystem weakness | Invest in targeted testing for that subsystem |

## Failure classification and routing

Every VAST failure should be classified into a bucket with a standard response.

| Failure type | Identified by | Response |
|-------------|--------------|----------|
| Optimizer bug | AST/IR/JVM-unopt agree, JVM-opt differs | Minimize, check optimizer passes, store corpus entry |
| Codegen bug | AST/IR agree, JVM differs | Minimize, check bytecode generation |
| AST/IR disagreement | AST differs from IR and JVM | Check AST interpreter correctness |
| Parser round-trip failure | Format-reparse produces different AST | Check formatter and parser |
| Pass verification failure | Single pass changes semantics | Isolate which pass, minimize |
| Large-program compile failure | Compilation timeout or crash | Check scalability, reduce program |
| Performance regression | Compile time exceeds threshold | Profile compiler, check for algorithmic regression |
| Infrastructure/tester failure | Path crash, VAST internal error | Check VAST code, not compiler |

For every failure: minimize it, tag it, store it in the corpus, and identify the likely subsystem.

## Coverage adequacy policy

Coverage is useful only if it drives decisions. These thresholds define what counts as "enough".

### RC-blocking thresholds

| Metric | Threshold | Action if below |
|--------|-----------|----------------|
| Feature coverage (complete profile) | 100% | Generator bug: all enabled constructs must appear |
| Semantic coverage (deep mode) | 80% | Increase program count or enable `--stress` / `--symbolic` |
| Corpus replay pass rate | 100% | Regression: block release, investigate |
| Confidence score | No drop > 5 points from previous RC | Investigate cause of confidence loss |
| Pass verification | Zero mismatches | Optimizer bug: block release |
| Sabotage detection | All 4 modes detected | VAST infrastructure gap: fix before release |

### Warning thresholds (non-blocking)

| Metric | Threshold | Action |
|--------|-----------|--------|
| Interaction coverage (complete) | Below 50% | Run more programs with `complete` profile |
| Semantic coverage stagnation | Same value for 3+ nights | Generator may be saturated: review profile weights |
| Bug yield decline | Zero new mismatches for 7+ nights | Normal if compiler is stable; flag if new features were added |
| Large-program timeout rate | Above 10% | Check for performance regression in codegen |

### How to interpret coverage gaps

Coverage numbers are not goals in themselves. They are signals about generator quality and testing thoroughness.

| Pattern | Interpretation | Action |
|---------|---------------|--------|
| High feature coverage, low semantic coverage | Generator uses all constructs but does not exercise interesting behaviours (edge values, error paths, deep nesting) | Enable `--stress` and `--symbolic` |
| High semantic coverage, low interaction coverage | Individual features are well-tested but combinations are not | Run the `complete` profile with larger program counts |
| High confidence, no mismatches | The system is working; normal state for a stable compiler | None required |
| Low confidence despite clean runs | Insufficient exploration | Increase program counts or run continuous mode for longer |
| Rising mismatch count | Active compiler bugs | Minimize, store, investigate. Do not release |

## Release decision policy

A release decision should consume these signals:

| Signal | Release gate |
|--------|-------------|
| Mismatch count | Zero unreduced new mismatches |
| Corpus replay | 100% pass rate (no regressions) |
| Confidence score | No drop beyond 5 points from previous release |
| Semantic coverage | Above 80% in deep mode |
| Interaction coverage | Above 50% in complete profile |
| Path health | All paths agree on calibration set |
| Performance trend | No compile-time regression above 25% |
| Large-program stability | No new crash or timeout regressions |
| Pass verification | All optimizer passes preserve semantics |
| Sabotage validation | All 4 sabotage modes detected |
