Testing playbook — Markdown View

VAST is a collection of testing modes. This page tells you which modes to use, when, and what to do with the results.

Profile and mode matrix

Each VAST mode/profile has a purpose, cost, and suitable cadence.

Mode / Profile	Purpose	Cost	Best bug classes	Cadence	Gating?	Signal
`--mode fast`	Smoke differential correctness	~2 min, 100 programs	Codegen, obvious regressions	PR, local	Yes (quick gate)	High: catches gross breakage
`--profile complete`	Broad semantic confidence	~5 min per 500 programs	Feature interactions, type bugs	Nightly	No (informational)	Medium: breadth over depth
`--opt-check`	Optimizer regression hunting	2x cost (4 paths)	Optimizer bugs, pass corruption	Nightly, RC	Yes (RC gate)	High: catches real optimizer bugs
`--stress`	Edge-case value testing	+30% expression cost	Boundary bugs, overflow, precision	Nightly	No	Medium: targets specific edges
`--stateful`	Mutable state accumulation bugs	~3 min per 500 programs	Loop/state bugs, off-by-one	Nightly	No	Medium: finds subtle state bugs
`--aliasing`	Reference propagation bugs	~2 min per 500 programs	Aliasing, sharing, field chain bugs	Nightly, RC	No	Medium: targets specific subsystem
`--exception-propagation`	Stack unwinding mismatches	~2 min per 500 programs	Exception codegen, catch routing	Nightly	No	Medium
`--concurrency`	Deterministic concurrency	~3 min per 500 programs	Fork-join, pipeline, scheduling	Nightly	No	Medium
`--symbolic`	Constraint-driven edge values	~2 min per 500 programs	Branch condition, range partition	Nightly	No	Medium: complements stress
`--ir-check`	IR translation equivalence	~3 min per 1000 programs	Lowering bugs, IR/JVM divergence	Nightly, RC	Yes (RC gate)	High
`--verify-all-passes`	Per-pass optimizer isolation	~5 min per 1000 programs	Pass-specific semantic corruption	Nightly, RC	Yes (RC gate)	High
`--large-programs`	Compiler scalability	~5 min per 20 programs	Crashes, timeouts, memory exhaustion	Nightly, RC	No (informational)	Low volume, high impact
`--mode continuous`	Adaptive coverage exploration	Time-bounded (configurable)	Coverage gaps, undertested areas	Post-RC, soak	No	Cumulative: improves over time
`--metamorphic`	Semantics-preserving equivalence	+50% cost	Optimizer commutativity, folding bugs	Nightly	No	Medium
`--mutate`	VAST self-validation	+5x per program	VAST blind spots	Nightly	No	Meta: tests the tester
Corpus replay	Known regression detection	~30s	Previously found bugs	Every RC (gate 3)	Yes (hard gate)	Very high: prevents regressions

Four testing purposes

Every VAST run should have one of these purposes.

1. Smoke validation


Question	Did we obviously break the compiler?
Use	Fast mode, small number of profiles, baseline differential comparison, parser round-trip, corpus replay
Run on	Every PR, local development
Success means	Cheap and reliable signal that nothing is obviously broken

vary vast --mode fast

2. Semantic regression hunting


Question	Did a recent compiler change break semantics?
Use	Deep mode, `--opt-check`, semantic coverage, interaction coverage, stress mode, specialized profiles (stateful, aliasing, exceptions, concurrency, symbolic)
Run on	Nightly, before RC
Success means	High bug-finding probability on recent changes

./bin/vary vast --mode deep --verbose --opt-check --stress

3. Trust and infrastructure validation


Question	Is VAST itself still believable?
Use	Sabotage modes, trusted path calibration, path health matrix, corpus replay, regression generation validation
Run on	Nightly or post-RC, whenever VAST internals change
Success means	The tester is still testing

make vast-negative
vary vast --profile core --count 100 --calibrate --path-health

4. Scalability and soak


Question	Does the compiler survive hard or long-running conditions?
Use	Continuous exploration, large programs, compile-time/memory/performance metrics, failure artifact collection
Run on	Nightly, scheduled soak, pre-release
Success means	Stability under volume and size

vary vast --mode continuous --duration 300
./bin/vary vast --mode deep --large-programs --verbose

CI run profiles

PR (fast smoke)

Run only fast high-signal checks:

Check	Command
Corpus replay	`make rc` (gate 3)
Fast differential	`vary vast --mode fast`
Parser round-trip	Included in fast mode

Goal: quick breakage detection, under 2 minutes.

Nightly (semantic exploration)

Run broader semantic exploration with specialized generators:

Section	What it runs
RC validation	11-gate pipeline
Deep differential	`--mode deep --opt-check --stress`
Metamorphic + round-trip + coverage	`--metamorphic --round-trip --show-coverage`
Mutation expansion	`--mutate`
Specialized generators	`--stateful`, `--concurrency`, `--exception-propagation`, `--symbolic`
Large programs	`--large-programs`
IR translation checks	`--ir-check`
Confidence report	`--confidence`
Negative validation	Sabotage probes
Corpus growth	Exploration + auto-reduction

Goal: bug discovery and confidence growth.

RC / pre-release (trust validation)

Run the most expensive trusted suite:

Check	Purpose
Deep corpus replay	All known-good programs pass
Deep + all specialized profiles	Maximum semantic coverage
Pass verification	`--verify-all-passes`
IR translation checks	`--ir-check`
Large programs	Scale/crash/perf stress
Continuous exploration (bounded)	Coverage-gap hunting
Sabotage validation	Trust infrastructure check
Artifact retention	14-day failure artifact storage

Goal: shipping confidence.

Bug taxonomy

VAST is effective at finding specific classes of bugs. Knowing the taxonomy helps interpret failures and choose the right testing mode.

Bug class	Description	Best VAST mode	Example
Codegen bugs	Wrong bytecode emitted for an expression or statement	Deep differential, `--opt-check`	`a - b` compiled to `a + b` in nested context
Optimizer semantic corruption	Optimization pass changes program behaviour	`--opt-check`, `--verify-all-passes`	Constant folder evaluates edge-case arithmetic wrong
IR lowering bugs	AST to IR translation loses or changes semantics	`--ir-check`, deep differential	Variable binding lost during lowering
AST interpreter bugs	Reference oracle itself produces wrong result	Deep differential (AST disagrees with IR + JVM)	Interpreter mishandles enum dispatch
Aliasing/reference bugs	Mutation through one alias not visible through another	`--aliasing`	Shared data object not updated through alias
Exception unwinding mismatches	Stack unwinding differs between paths	`--exception-propagation`	Catch block executed on wrong path
State accumulation bugs	Bug only surfaces after many mutation steps	`--stateful`	Off-by-one in loop iteration interacting with mutable state
Concurrency determinism bugs	Parallel execution produces different result across runs	`--concurrency`	Fork-join result depends on scheduling
Parser round-trip instability	Format-then-reparse produces different AST	`--round-trip`	Formatter drops parentheses, changing precedence
Pass-specific breakage	One optimizer pass breaks semantics while others pass	`--verify-all-passes`	DCE removes a needed side effect
Performance/scalability blowups	Compiler crashes, times out, or exhausts memory on large input	`--large-programs`	Quadratic codegen on deep call graphs
Float precision divergence	Floating-point results differ between interpreter and JVM	`--profile float`, `--stress`	Precision accumulation in long arithmetic chains

Three classes of generated programs

VAST generates programs in three categories, each serving a different purpose:

Class	Purpose	Size	Example
Minimal semantic probes	Test one feature or interaction precisely	5-20 AST nodes	`return (x + 0)`: tests identity folding
Feature interaction probes	Exercise combinations of language features	20-200 AST nodes	Enum match inside try/except with nullable field access
Realistic structured programs	Stress scalability and complex control flow	200-10,000 AST nodes	Multi-function program with loops, state, exceptions, and cross-function calls

Minimal probes are cheap and targeted. Interaction probes find combination bugs. Realistic programs stress the compiler at scale. A good testing session uses all three.

Mode-to-bug-class mapping

Use the right VAST mode for the compiler subsystem you changed.

Compiler change	VAST modes to run
Optimizer (constant folder, DCE)	`--opt-check`, `--verify-all-passes`, `--stress`
IR lowering	`--ir-check`, `--metamorphic`
Codegen (bytecode generation)	Deep differential, `--opt-check`
Data types / aliasing / runtime	`--stateful`, `--concurrency`
Exception handling	`--exception-propagation`
Concurrency runtime	`--concurrency`
Parser / formatter	`--round-trip`, round-trip RC gate
Performance / scalability	`--large-programs`, `--mode continuous`
New language feature	Deep mode with `complete` profile, `--interaction-coverage`

Coverage as a steering tool

Coverage should drive testing choices, not just report status.

Signal	What it means	Action
Low feature coverage	Generator or profile weakness	Add constructs to generator, run richer profiles
Low semantic coverage	Branch/value/behaviour gaps	Enable stress mode, run symbolic generation
Low interaction coverage	Missing combined constructs	Run `complete` profile with more programs
Low confidence but no mismatches	Insufficient exploration	Run continuous mode for longer, increase program count
Coverage plateau over multiple nights	Generator saturation	Switch profile weighting, add new generation families
Good coverage but repeated failures in one class	Subsystem weakness	Invest in targeted testing for that subsystem

Failure classification and routing

Every VAST failure should be classified into a bucket with a standard response.

Failure type	Identified by	Response
Optimizer bug	AST/IR/JVM-unopt agree, JVM-opt differs	Minimize, check optimizer passes, store corpus entry
Codegen bug	AST/IR agree, JVM differs	Minimize, check bytecode generation
AST/IR disagreement	AST differs from IR and JVM	Check AST interpreter correctness
Parser round-trip failure	Format-reparse produces different AST	Check formatter and parser
Pass verification failure	Single pass changes semantics	Isolate which pass, minimize
Large-program compile failure	Compilation timeout or crash	Check scalability, reduce program
Performance regression	Compile time exceeds threshold	Profile compiler, check for algorithmic regression
Infrastructure/tester failure	Path crash, VAST internal error	Check VAST code, not compiler

For every failure: minimize it, tag it, store it in the corpus, and identify the likely subsystem.

Coverage adequacy policy

Coverage is useful only if it drives decisions. These thresholds define what counts as "enough".

RC-blocking thresholds

Metric	Threshold	Action if below
Feature coverage (complete profile)	100%	Generator bug: all enabled constructs must appear
Semantic coverage (deep mode)	80%	Increase program count or enable `--stress` / `--symbolic`
Corpus replay pass rate	100%	Regression: block release, investigate
Confidence score	No drop > 5 points from previous RC	Investigate cause of confidence loss
Pass verification	Zero mismatches	Optimizer bug: block release
Sabotage detection	All 4 modes detected	VAST infrastructure gap: fix before release

Warning thresholds (non-blocking)

Metric	Threshold	Action
Interaction coverage (complete)	Below 50%	Run more programs with `complete` profile
Semantic coverage stagnation	Same value for 3+ nights	Generator may be saturated: review profile weights
Bug yield decline	Zero new mismatches for 7+ nights	Normal if compiler is stable; flag if new features were added
Large-program timeout rate	Above 10%	Check for performance regression in codegen

How to interpret coverage gaps

Coverage numbers are not goals in themselves. They are signals about generator quality and testing thoroughness.

Pattern	Interpretation	Action
High feature coverage, low semantic coverage	Generator uses all constructs but does not exercise interesting behaviours (edge values, error paths, deep nesting)	Enable `--stress` and `--symbolic`
High semantic coverage, low interaction coverage	Individual features are well-tested but combinations are not	Run the `complete` profile with larger program counts
High confidence, no mismatches	The system is working; normal state for a stable compiler	None required
Low confidence despite clean runs	Insufficient exploration	Increase program counts or run continuous mode for longer
Rising mismatch count	Active compiler bugs	Minimize, store, investigate. Do not release

Release decision policy

A release decision should consume these signals:

Signal	Release gate
Mismatch count	Zero unreduced new mismatches
Corpus replay	100% pass rate (no regressions)
Confidence score	No drop beyond 5 points from previous release
Semantic coverage	Above 80% in deep mode
Interaction coverage	Above 50% in complete profile
Path health	All paths agree on calibration set
Performance trend	No compile-time regression above 25%
Large-program stability	No new crash or timeout regressions
Pass verification	All optimizer passes preserve semantics
Sabotage validation	All 4 sabotage modes detected