Alpha. Vary is under active development and not ready for production use. Syntax, APIs, performance, and behaviour may change between releases.
Testing playbook
VAST is a collection of testing modes. This page tells you which modes to use, when, and what to do with the results.
Profile and mode matrix
Each VAST mode/profile has a purpose, cost, and suitable cadence.
| Mode / Profile | Purpose | Cost | Best bug classes | Cadence | Gating? | Signal |
|---|---|---|---|---|---|---|
--mode fast | Smoke differential correctness | ~2 min, 100 programs | Codegen, obvious regressions | PR, local | Yes (quick gate) | High: catches gross breakage |
--profile complete | Broad semantic confidence | ~5 min per 500 programs | Feature interactions, type bugs | Nightly | No (informational) | Medium: breadth over depth |
--opt-check | Optimizer regression hunting | 2x cost (4 paths) | Optimizer bugs, pass corruption | Nightly, RC | Yes (RC gate) | High: catches real optimizer bugs |
--stress | Edge-case value testing | +30% expression cost | Boundary bugs, overflow, precision | Nightly | No | Medium: targets specific edges |
--stateful | Mutable state accumulation bugs | ~3 min per 500 programs | Loop/state bugs, off-by-one | Nightly | No | Medium: finds subtle state bugs |
--aliasing | Reference propagation bugs | ~2 min per 500 programs | Aliasing, sharing, field chain bugs | Nightly, RC | No | Medium: targets specific subsystem |
--exception-propagation | Stack unwinding mismatches | ~2 min per 500 programs | Exception codegen, catch routing | Nightly | No | Medium |
--concurrency | Deterministic concurrency | ~3 min per 500 programs | Fork-join, pipeline, scheduling | Nightly | No | Medium |
--symbolic | Constraint-driven edge values | ~2 min per 500 programs | Branch condition, range partition | Nightly | No | Medium: complements stress |
--ir-check | IR translation equivalence | ~3 min per 1000 programs | Lowering bugs, IR/JVM divergence | Nightly, RC | Yes (RC gate) | High |
--verify-all-passes | Per-pass optimizer isolation | ~5 min per 1000 programs | Pass-specific semantic corruption | Nightly, RC | Yes (RC gate) | High |
--large-programs | Compiler scalability | ~5 min per 20 programs | Crashes, timeouts, memory exhaustion | Nightly, RC | No (informational) | Low volume, high impact |
--mode continuous | Adaptive coverage exploration | Time-bounded (configurable) | Coverage gaps, undertested areas | Post-RC, soak | No | Cumulative: improves over time |
--metamorphic | Semantics-preserving equivalence | +50% cost | Optimizer commutativity, folding bugs | Nightly | No | Medium |
--mutate | VAST self-validation | +5x per program | VAST blind spots | Nightly | No | Meta: tests the tester |
| Corpus replay | Known regression detection | ~30s | Previously found bugs | Every RC (gate 3) | Yes (hard gate) | Very high: prevents regressions |
Four testing purposes
Every VAST run should have one of these purposes.
1. Smoke validation
| Question | Did we obviously break the compiler? |
| Use | Fast mode, small number of profiles, baseline differential comparison, parser round-trip, corpus replay |
| Run on | Every PR, local development |
| Success means | Cheap and reliable signal that nothing is obviously broken |
vary vast --mode fast
2. Semantic regression hunting
| Question | Did a recent compiler change break semantics? |
| Use | Deep mode, --opt-check, semantic coverage, interaction coverage, stress mode, specialized profiles (stateful, aliasing, exceptions, concurrency, symbolic) |
| Run on | Nightly, before RC |
| Success means | High bug-finding probability on recent changes |
./bin/vary vast --mode deep --verbose --opt-check --stress
3. Trust and infrastructure validation
| Question | Is VAST itself still believable? |
| Use | Sabotage modes, trusted path calibration, path health matrix, corpus replay, regression generation validation |
| Run on | Nightly or post-RC, whenever VAST internals change |
| Success means | The tester is still testing |
make vast-negative
vary vast --profile core --count 100 --calibrate --path-health
4. Scalability and soak
| Question | Does the compiler survive hard or long-running conditions? |
| Use | Continuous exploration, large programs, compile-time/memory/performance metrics, failure artifact collection |
| Run on | Nightly, scheduled soak, pre-release |
| Success means | Stability under volume and size |
vary vast --mode continuous --duration 300
./bin/vary vast --mode deep --large-programs --verbose
CI run profiles
PR (fast smoke)
Run only fast high-signal checks:
| Check | Command |
|---|---|
| Corpus replay | make rc (gate 3) |
| Fast differential | vary vast --mode fast |
| Parser round-trip | Included in fast mode |
Goal: quick breakage detection, under 2 minutes.
Nightly (semantic exploration)
Run broader semantic exploration with specialized generators:
| Section | What it runs |
|---|---|
| RC validation | 11-gate pipeline |
| Deep differential | --mode deep --opt-check --stress |
| Metamorphic + round-trip + coverage | --metamorphic --round-trip --show-coverage |
| Mutation expansion | --mutate |
| Specialized generators | --stateful, --concurrency, --exception-propagation, --symbolic |
| Large programs | --large-programs |
| IR translation checks | --ir-check |
| Confidence report | --confidence |
| Negative validation | Sabotage probes |
| Corpus growth | Exploration + auto-reduction |
Goal: bug discovery and confidence growth.
RC / pre-release (trust validation)
Run the most expensive trusted suite:
| Check | Purpose |
|---|---|
| Deep corpus replay | All known-good programs pass |
| Deep + all specialized profiles | Maximum semantic coverage |
| Pass verification | --verify-all-passes |
| IR translation checks | --ir-check |
| Large programs | Scale/crash/perf stress |
| Continuous exploration (bounded) | Coverage-gap hunting |
| Sabotage validation | Trust infrastructure check |
| Artifact retention | 14-day failure artifact storage |
Goal: shipping confidence.
Bug taxonomy
VAST is effective at finding specific classes of bugs. Knowing the taxonomy helps interpret failures and choose the right testing mode.
| Bug class | Description | Best VAST mode | Example |
|---|---|---|---|
| Codegen bugs | Wrong bytecode emitted for an expression or statement | Deep differential, --opt-check | a - b compiled to a + b in nested context |
| Optimizer semantic corruption | Optimization pass changes program behaviour | --opt-check, --verify-all-passes | Constant folder evaluates edge-case arithmetic wrong |
| IR lowering bugs | AST to IR translation loses or changes semantics | --ir-check, deep differential | Variable binding lost during lowering |
| AST interpreter bugs | Reference oracle itself produces wrong result | Deep differential (AST disagrees with IR + JVM) | Interpreter mishandles enum dispatch |
| Aliasing/reference bugs | Mutation through one alias not visible through another | --aliasing | Shared data object not updated through alias |
| Exception unwinding mismatches | Stack unwinding differs between paths | --exception-propagation | Catch block executed on wrong path |
| State accumulation bugs | Bug only surfaces after many mutation steps | --stateful | Off-by-one in loop iteration interacting with mutable state |
| Concurrency determinism bugs | Parallel execution produces different result across runs | --concurrency | Fork-join result depends on scheduling |
| Parser round-trip instability | Format-then-reparse produces different AST | --round-trip | Formatter drops parentheses, changing precedence |
| Pass-specific breakage | One optimizer pass breaks semantics while others pass | --verify-all-passes | DCE removes a needed side effect |
| Performance/scalability blowups | Compiler crashes, times out, or exhausts memory on large input | --large-programs | Quadratic codegen on deep call graphs |
| Float precision divergence | Floating-point results differ between interpreter and JVM | --profile float, --stress | Precision accumulation in long arithmetic chains |
Three classes of generated programs
VAST generates programs in three categories, each serving a different purpose:
| Class | Purpose | Size | Example |
|---|---|---|---|
| Minimal semantic probes | Test one feature or interaction precisely | 5-20 AST nodes | return (x + 0): tests identity folding |
| Feature interaction probes | Exercise combinations of language features | 20-200 AST nodes | Enum match inside try/except with nullable field access |
| Realistic structured programs | Stress scalability and complex control flow | 200-10,000 AST nodes | Multi-function program with loops, state, exceptions, and cross-function calls |
Minimal probes are cheap and targeted. Interaction probes find combination bugs. Realistic programs stress the compiler at scale. A good testing session uses all three.
Mode-to-bug-class mapping
Use the right VAST mode for the compiler subsystem you changed.
| Compiler change | VAST modes to run |
|---|---|
| Optimizer (constant folder, DCE) | --opt-check, --verify-all-passes, --stress |
| IR lowering | --ir-check, --metamorphic |
| Codegen (bytecode generation) | Deep differential, --opt-check |
| Data types / aliasing / runtime | --stateful, --concurrency |
| Exception handling | --exception-propagation |
| Concurrency runtime | --concurrency |
| Parser / formatter | --round-trip, round-trip RC gate |
| Performance / scalability | --large-programs, --mode continuous |
| New language feature | Deep mode with complete profile, --interaction-coverage |
Coverage as a steering tool
Coverage should drive testing choices, not just report status.
| Signal | What it means | Action |
|---|---|---|
| Low feature coverage | Generator or profile weakness | Add constructs to generator, run richer profiles |
| Low semantic coverage | Branch/value/behaviour gaps | Enable stress mode, run symbolic generation |
| Low interaction coverage | Missing combined constructs | Run complete profile with more programs |
| Low confidence but no mismatches | Insufficient exploration | Run continuous mode for longer, increase program count |
| Coverage plateau over multiple nights | Generator saturation | Switch profile weighting, add new generation families |
| Good coverage but repeated failures in one class | Subsystem weakness | Invest in targeted testing for that subsystem |
Failure classification and routing
Every VAST failure should be classified into a bucket with a standard response.
| Failure type | Identified by | Response |
|---|---|---|
| Optimizer bug | AST/IR/JVM-unopt agree, JVM-opt differs | Minimize, check optimizer passes, store corpus entry |
| Codegen bug | AST/IR agree, JVM differs | Minimize, check bytecode generation |
| AST/IR disagreement | AST differs from IR and JVM | Check AST interpreter correctness |
| Parser round-trip failure | Format-reparse produces different AST | Check formatter and parser |
| Pass verification failure | Single pass changes semantics | Isolate which pass, minimize |
| Large-program compile failure | Compilation timeout or crash | Check scalability, reduce program |
| Performance regression | Compile time exceeds threshold | Profile compiler, check for algorithmic regression |
| Infrastructure/tester failure | Path crash, VAST internal error | Check VAST code, not compiler |
For every failure: minimize it, tag it, store it in the corpus, and identify the likely subsystem.
Coverage adequacy policy
Coverage is useful only if it drives decisions. These thresholds define what counts as "enough".
RC-blocking thresholds
| Metric | Threshold | Action if below |
|---|---|---|
| Feature coverage (complete profile) | 100% | Generator bug: all enabled constructs must appear |
| Semantic coverage (deep mode) | 80% | Increase program count or enable --stress / --symbolic |
| Corpus replay pass rate | 100% | Regression: block release, investigate |
| Confidence score | No drop > 5 points from previous RC | Investigate cause of confidence loss |
| Pass verification | Zero mismatches | Optimizer bug: block release |
| Sabotage detection | All 4 modes detected | VAST infrastructure gap: fix before release |
Warning thresholds (non-blocking)
| Metric | Threshold | Action |
|---|---|---|
| Interaction coverage (complete) | Below 50% | Run more programs with complete profile |
| Semantic coverage stagnation | Same value for 3+ nights | Generator may be saturated: review profile weights |
| Bug yield decline | Zero new mismatches for 7+ nights | Normal if compiler is stable; flag if new features were added |
| Large-program timeout rate | Above 10% | Check for performance regression in codegen |
How to interpret coverage gaps
Coverage numbers are not goals in themselves. They are signals about generator quality and testing thoroughness.
| Pattern | Interpretation | Action |
|---|---|---|
| High feature coverage, low semantic coverage | Generator uses all constructs but does not exercise interesting behaviours (edge values, error paths, deep nesting) | Enable --stress and --symbolic |
| High semantic coverage, low interaction coverage | Individual features are well-tested but combinations are not | Run the complete profile with larger program counts |
| High confidence, no mismatches | The system is working; normal state for a stable compiler | None required |
| Low confidence despite clean runs | Insufficient exploration | Increase program counts or run continuous mode for longer |
| Rising mismatch count | Active compiler bugs | Minimize, store, investigate. Do not release |
Release decision policy
A release decision should consume these signals:
| Signal | Release gate |
|---|---|
| Mismatch count | Zero unreduced new mismatches |
| Corpus replay | 100% pass rate (no regressions) |
| Confidence score | No drop beyond 5 points from previous release |
| Semantic coverage | Above 80% in deep mode |
| Interaction coverage | Above 50% in complete profile |
| Path health | All paths agree on calibration set |
| Performance trend | No compile-time regression above 25% |
| Large-program stability | No new crash or timeout regressions |
| Pass verification | All optimizer passes preserve semantics |
| Sabotage validation | All 4 sabotage modes detected |