VAST is a collection of testing modes. This page tells you which modes to use, when, and what to do with the results.
Each VAST mode/profile has a purpose, cost, and suitable cadence.
| Mode / Profile | Purpose | Cost | Best bug classes | Cadence | Gating? | Signal |
|---|---|---|---|---|---|---|
--mode fast | Smoke differential correctness | ~2 min, 100 programs | Codegen, obvious regressions | PR, local | Yes (quick gate) | High: catches gross breakage |
--profile complete | Broad semantic confidence | ~5 min per 500 programs | Feature interactions, type bugs | Nightly | No (informational) | Medium: breadth over depth |
--opt-check | Optimizer regression hunting | 2x cost (4 paths) | Optimizer bugs, pass corruption | Nightly, RC | Yes (RC gate) | High: catches real optimizer bugs |
--stress | Edge-case value testing | +30% expression cost | Boundary bugs, overflow, precision | Nightly | No | Medium: targets specific edges |
--stateful | Mutable state accumulation bugs | ~3 min per 500 programs | Loop/state bugs, off-by-one | Nightly | No | Medium: finds subtle state bugs |
--aliasing | Reference propagation bugs | ~2 min per 500 programs | Aliasing, sharing, field chain bugs | Nightly, RC | No | Medium: targets specific subsystem |
--exception-propagation | Stack unwinding mismatches | ~2 min per 500 programs | Exception codegen, catch routing | Nightly | No | Medium |
--concurrency | Deterministic concurrency | ~3 min per 500 programs | Fork-join, pipeline, scheduling | Nightly | No | Medium |
--symbolic | Constraint-driven edge values | ~2 min per 500 programs | Branch condition, range partition | Nightly | No | Medium: complements stress |
--ir-check | IR translation equivalence | ~3 min per 1000 programs | Lowering bugs, IR/JVM divergence | Nightly, RC | Yes (RC gate) | High |
--verify-all-passes | Per-pass optimizer isolation | ~5 min per 1000 programs | Pass-specific semantic corruption | Nightly, RC | Yes (RC gate) | High |
--large-programs | Compiler scalability | ~5 min per 20 programs | Crashes, timeouts, memory exhaustion | Nightly, RC | No (informational) | Low volume, high impact |
--mode continuous | Adaptive coverage exploration | Time-bounded (configurable) | Coverage gaps, undertested areas | Post-RC, soak | No | Cumulative: improves over time |
--metamorphic | Semantics-preserving equivalence | +50% cost | Optimizer commutativity, folding bugs | Nightly | No | Medium |
--mutate | VAST self-validation | +5x per program | VAST blind spots | Nightly | No | Meta: tests the tester |
| Corpus replay | Known regression detection | ~30s | Previously found bugs | Every RC (gate 3) | Yes (hard gate) | Very high: prevents regressions |
Every VAST run should have one of these purposes.
| Question | Did we obviously break the compiler? |
| Use | Fast mode, small number of profiles, baseline differential comparison, parser round-trip, corpus replay |
| Run on | Every PR, local development |
| Success means | Cheap and reliable signal that nothing is obviously broken |
vary vast --mode fast
| Question | Did a recent compiler change break semantics? |
| Use | Deep mode, --opt-check, semantic coverage, interaction coverage, stress mode, specialized profiles (stateful, aliasing, exceptions, concurrency, symbolic) |
| Run on | Nightly, before RC |
| Success means | High bug-finding probability on recent changes |
./bin/vary vast --mode deep --verbose --opt-check --stress
| Question | Is VAST itself still believable? |
| Use | Sabotage modes, trusted path calibration, path health matrix, corpus replay, regression generation validation |
| Run on | Nightly or post-RC, whenever VAST internals change |
| Success means | The tester is still testing |
make vast-negative
vary vast --profile core --count 100 --calibrate --path-health
| Question | Does the compiler survive hard or long-running conditions? |
| Use | Continuous exploration, large programs, compile-time/memory/performance metrics, failure artifact collection |
| Run on | Nightly, scheduled soak, pre-release |
| Success means | Stability under volume and size |
vary vast --mode continuous --duration 300
./bin/vary vast --mode deep --large-programs --verbose
Run only fast high-signal checks:
| Check | Command |
|---|---|
| Corpus replay | make rc (gate 3) |
| Fast differential | vary vast --mode fast |
| Parser round-trip | Included in fast mode |
Goal: quick breakage detection, under 2 minutes.
Run broader semantic exploration with specialized generators:
| Section | What it runs |
|---|---|
| RC validation | 11-gate pipeline |
| Deep differential | --mode deep --opt-check --stress |
| Metamorphic + round-trip + coverage | --metamorphic --round-trip --show-coverage |
| Mutation expansion | --mutate |
| Specialized generators | --stateful, --concurrency, --exception-propagation, --symbolic |
| Large programs | --large-programs |
| IR translation checks | --ir-check |
| Confidence report | --confidence |
| Negative validation | Sabotage probes |
| Corpus growth | Exploration + auto-reduction |
Goal: bug discovery and confidence growth.
Run the most expensive trusted suite:
| Check | Purpose |
|---|---|
| Deep corpus replay | All known-good programs pass |
| Deep + all specialized profiles | Maximum semantic coverage |
| Pass verification | --verify-all-passes |
| IR translation checks | --ir-check |
| Large programs | Scale/crash/perf stress |
| Continuous exploration (bounded) | Coverage-gap hunting |
| Sabotage validation | Trust infrastructure check |
| Artifact retention | 14-day failure artifact storage |
Goal: shipping confidence.
VAST is effective at finding specific classes of bugs. Knowing the taxonomy helps interpret failures and choose the right testing mode.
| Bug class | Description | Best VAST mode | Example |
|---|---|---|---|
| Codegen bugs | Wrong bytecode emitted for an expression or statement | Deep differential, --opt-check | a - b compiled to a + b in nested context |
| Optimizer semantic corruption | Optimization pass changes program behaviour | --opt-check, --verify-all-passes | Constant folder evaluates edge-case arithmetic wrong |
| IR lowering bugs | AST to IR translation loses or changes semantics | --ir-check, deep differential | Variable binding lost during lowering |
| AST interpreter bugs | Reference oracle itself produces wrong result | Deep differential (AST disagrees with IR + JVM) | Interpreter mishandles enum dispatch |
| Aliasing/reference bugs | Mutation through one alias not visible through another | --aliasing | Shared data object not updated through alias |
| Exception unwinding mismatches | Stack unwinding differs between paths | --exception-propagation | Catch block executed on wrong path |
| State accumulation bugs | Bug only surfaces after many mutation steps | --stateful | Off-by-one in loop iteration interacting with mutable state |
| Concurrency determinism bugs | Parallel execution produces different result across runs | --concurrency | Fork-join result depends on scheduling |
| Parser round-trip instability | Format-then-reparse produces different AST | --round-trip | Formatter drops parentheses, changing precedence |
| Pass-specific breakage | One optimizer pass breaks semantics while others pass | --verify-all-passes | DCE removes a needed side effect |
| Performance/scalability blowups | Compiler crashes, times out, or exhausts memory on large input | --large-programs | Quadratic codegen on deep call graphs |
| Float precision divergence | Floating-point results differ between interpreter and JVM | --profile float, --stress | Precision accumulation in long arithmetic chains |
VAST generates programs in three categories, each serving a different purpose:
| Class | Purpose | Size | Example |
|---|---|---|---|
| Minimal semantic probes | Test one feature or interaction precisely | 5-20 AST nodes | return (x + 0): tests identity folding |
| Feature interaction probes | Exercise combinations of language features | 20-200 AST nodes | Enum match inside try/except with nullable field access |
| Realistic structured programs | Stress scalability and complex control flow | 200-10,000 AST nodes | Multi-function program with loops, state, exceptions, and cross-function calls |
Minimal probes are cheap and targeted. Interaction probes find combination bugs. Realistic programs stress the compiler at scale. A good testing session uses all three.
Use the right VAST mode for the compiler subsystem you changed.
| Compiler change | VAST modes to run |
|---|---|
| Optimizer (constant folder, DCE) | --opt-check, --verify-all-passes, --stress |
| IR lowering | --ir-check, --metamorphic |
| Codegen (bytecode generation) | Deep differential, --opt-check |
| Data types / aliasing / runtime | --stateful, --concurrency |
| Exception handling | --exception-propagation |
| Concurrency runtime | --concurrency |
| Parser / formatter | --round-trip, round-trip RC gate |
| Performance / scalability | --large-programs, --mode continuous |
| New language feature | Deep mode with complete profile, --interaction-coverage |
Coverage should drive testing choices, not just report status.
| Signal | What it means | Action |
|---|---|---|
| Low feature coverage | Generator or profile weakness | Add constructs to generator, run richer profiles |
| Low semantic coverage | Branch/value/behaviour gaps | Enable stress mode, run symbolic generation |
| Low interaction coverage | Missing combined constructs | Run complete profile with more programs |
| Low confidence but no mismatches | Insufficient exploration | Run continuous mode for longer, increase program count |
| Coverage plateau over multiple nights | Generator saturation | Switch profile weighting, add new generation families |
| Good coverage but repeated failures in one class | Subsystem weakness | Invest in targeted testing for that subsystem |
Every VAST failure should be classified into a bucket with a standard response.
| Failure type | Identified by | Response |
|---|---|---|
| Optimizer bug | AST/IR/JVM-unopt agree, JVM-opt differs | Minimize, check optimizer passes, store corpus entry |
| Codegen bug | AST/IR agree, JVM differs | Minimize, check bytecode generation |
| AST/IR disagreement | AST differs from IR and JVM | Check AST interpreter correctness |
| Parser round-trip failure | Format-reparse produces different AST | Check formatter and parser |
| Pass verification failure | Single pass changes semantics | Isolate which pass, minimize |
| Large-program compile failure | Compilation timeout or crash | Check scalability, reduce program |
| Performance regression | Compile time exceeds threshold | Profile compiler, check for algorithmic regression |
| Infrastructure/tester failure | Path crash, VAST internal error | Check VAST code, not compiler |
For every failure: minimize it, tag it, store it in the corpus, and identify the likely subsystem.
Coverage is useful only if it drives decisions. These thresholds define what counts as "enough".
| Metric | Threshold | Action if below |
|---|---|---|
| Feature coverage (complete profile) | 100% | Generator bug: all enabled constructs must appear |
| Semantic coverage (deep mode) | 80% | Increase program count or enable --stress / --symbolic |
| Corpus replay pass rate | 100% | Regression: block release, investigate |
| Confidence score | No drop > 5 points from previous RC | Investigate cause of confidence loss |
| Pass verification | Zero mismatches | Optimizer bug: block release |
| Sabotage detection | All 4 modes detected | VAST infrastructure gap: fix before release |
| Metric | Threshold | Action |
|---|---|---|
| Interaction coverage (complete) | Below 50% | Run more programs with complete profile |
| Semantic coverage stagnation | Same value for 3+ nights | Generator may be saturated: review profile weights |
| Bug yield decline | Zero new mismatches for 7+ nights | Normal if compiler is stable; flag if new features were added |
| Large-program timeout rate | Above 10% | Check for performance regression in codegen |
Coverage numbers are not goals in themselves. They are signals about generator quality and testing thoroughness.
| Pattern | Interpretation | Action |
|---|---|---|
| High feature coverage, low semantic coverage | Generator uses all constructs but does not exercise interesting behaviours (edge values, error paths, deep nesting) | Enable --stress and --symbolic |
| High semantic coverage, low interaction coverage | Individual features are well-tested but combinations are not | Run the complete profile with larger program counts |
| High confidence, no mismatches | The system is working; normal state for a stable compiler | None required |
| Low confidence despite clean runs | Insufficient exploration | Increase program counts or run continuous mode for longer |
| Rising mismatch count | Active compiler bugs | Minimize, store, investigate. Do not release |
A release decision should consume these signals:
| Signal | Release gate |
|---|---|
| Mismatch count | Zero unreduced new mismatches |
| Corpus replay | 100% pass rate (no regressions) |
| Confidence score | No drop beyond 5 points from previous release |
| Semantic coverage | Above 80% in deep mode |
| Interaction coverage | Above 50% in complete profile |
| Path health | All paths agree on calibration set |
| Performance trend | No compile-time regression above 25% |
| Large-program stability | No new crash or timeout regressions |
| Pass verification | All optimizer passes preserve semantics |
| Sabotage validation | All 4 sabotage modes detected |