VAST

Testing playbook

VAST is a collection of testing modes. This page tells you which modes to use, when, and what to do with the results.

Profile and mode matrix

Each VAST mode/profile has a purpose, cost, and suitable cadence.

Mode / ProfilePurposeCostBest bug classesCadenceGating?Signal
--mode fastSmoke differential correctness~2 min, 100 programsCodegen, obvious regressionsPR, localYes (quick gate)High: catches gross breakage
--profile completeBroad semantic confidence~5 min per 500 programsFeature interactions, type bugsNightlyNo (informational)Medium: breadth over depth
--opt-checkOptimizer regression hunting2x cost (4 paths)Optimizer bugs, pass corruptionNightly, RCYes (RC gate)High: catches real optimizer bugs
--stressEdge-case value testing+30% expression costBoundary bugs, overflow, precisionNightlyNoMedium: targets specific edges
--statefulMutable state accumulation bugs~3 min per 500 programsLoop/state bugs, off-by-oneNightlyNoMedium: finds subtle state bugs
--aliasingReference propagation bugs~2 min per 500 programsAliasing, sharing, field chain bugsNightly, RCNoMedium: targets specific subsystem
--exception-propagationStack unwinding mismatches~2 min per 500 programsException codegen, catch routingNightlyNoMedium
--concurrencyDeterministic concurrency~3 min per 500 programsFork-join, pipeline, schedulingNightlyNoMedium
--symbolicConstraint-driven edge values~2 min per 500 programsBranch condition, range partitionNightlyNoMedium: complements stress
--ir-checkIR translation equivalence~3 min per 1000 programsLowering bugs, IR/JVM divergenceNightly, RCYes (RC gate)High
--verify-all-passesPer-pass optimizer isolation~5 min per 1000 programsPass-specific semantic corruptionNightly, RCYes (RC gate)High
--large-programsCompiler scalability~5 min per 20 programsCrashes, timeouts, memory exhaustionNightly, RCNo (informational)Low volume, high impact
--mode continuousAdaptive coverage explorationTime-bounded (configurable)Coverage gaps, undertested areasPost-RC, soakNoCumulative: improves over time
--metamorphicSemantics-preserving equivalence+50% costOptimizer commutativity, folding bugsNightlyNoMedium
--mutateVAST self-validation+5x per programVAST blind spotsNightlyNoMeta: tests the tester
Corpus replayKnown regression detection~30sPreviously found bugsEvery RC (gate 3)Yes (hard gate)Very high: prevents regressions

Four testing purposes

Every VAST run should have one of these purposes.

1. Smoke validation

QuestionDid we obviously break the compiler?
UseFast mode, small number of profiles, baseline differential comparison, parser round-trip, corpus replay
Run onEvery PR, local development
Success meansCheap and reliable signal that nothing is obviously broken
vary vast --mode fast

2. Semantic regression hunting

QuestionDid a recent compiler change break semantics?
UseDeep mode, --opt-check, semantic coverage, interaction coverage, stress mode, specialized profiles (stateful, aliasing, exceptions, concurrency, symbolic)
Run onNightly, before RC
Success meansHigh bug-finding probability on recent changes
./bin/vary vast --mode deep --verbose --opt-check --stress

3. Trust and infrastructure validation

QuestionIs VAST itself still believable?
UseSabotage modes, trusted path calibration, path health matrix, corpus replay, regression generation validation
Run onNightly or post-RC, whenever VAST internals change
Success meansThe tester is still testing
make vast-negative
vary vast --profile core --count 100 --calibrate --path-health

4. Scalability and soak

QuestionDoes the compiler survive hard or long-running conditions?
UseContinuous exploration, large programs, compile-time/memory/performance metrics, failure artifact collection
Run onNightly, scheduled soak, pre-release
Success meansStability under volume and size
vary vast --mode continuous --duration 300
./bin/vary vast --mode deep --large-programs --verbose

CI run profiles

PR (fast smoke)

Run only fast high-signal checks:

CheckCommand
Corpus replaymake rc (gate 3)
Fast differentialvary vast --mode fast
Parser round-tripIncluded in fast mode

Goal: quick breakage detection, under 2 minutes.

Nightly (semantic exploration)

Run broader semantic exploration with specialized generators:

SectionWhat it runs
RC validation11-gate pipeline
Deep differential--mode deep --opt-check --stress
Metamorphic + round-trip + coverage--metamorphic --round-trip --show-coverage
Mutation expansion--mutate
Specialized generators--stateful, --concurrency, --exception-propagation, --symbolic
Large programs--large-programs
IR translation checks--ir-check
Confidence report--confidence
Negative validationSabotage probes
Corpus growthExploration + auto-reduction

Goal: bug discovery and confidence growth.

RC / pre-release (trust validation)

Run the most expensive trusted suite:

CheckPurpose
Deep corpus replayAll known-good programs pass
Deep + all specialized profilesMaximum semantic coverage
Pass verification--verify-all-passes
IR translation checks--ir-check
Large programsScale/crash/perf stress
Continuous exploration (bounded)Coverage-gap hunting
Sabotage validationTrust infrastructure check
Artifact retention14-day failure artifact storage

Goal: shipping confidence.

Bug taxonomy

VAST is effective at finding specific classes of bugs. Knowing the taxonomy helps interpret failures and choose the right testing mode.

Bug classDescriptionBest VAST modeExample
Codegen bugsWrong bytecode emitted for an expression or statementDeep differential, --opt-checka - b compiled to a + b in nested context
Optimizer semantic corruptionOptimization pass changes program behaviour--opt-check, --verify-all-passesConstant folder evaluates edge-case arithmetic wrong
IR lowering bugsAST to IR translation loses or changes semantics--ir-check, deep differentialVariable binding lost during lowering
AST interpreter bugsReference oracle itself produces wrong resultDeep differential (AST disagrees with IR + JVM)Interpreter mishandles enum dispatch
Aliasing/reference bugsMutation through one alias not visible through another--aliasingShared data object not updated through alias
Exception unwinding mismatchesStack unwinding differs between paths--exception-propagationCatch block executed on wrong path
State accumulation bugsBug only surfaces after many mutation steps--statefulOff-by-one in loop iteration interacting with mutable state
Concurrency determinism bugsParallel execution produces different result across runs--concurrencyFork-join result depends on scheduling
Parser round-trip instabilityFormat-then-reparse produces different AST--round-tripFormatter drops parentheses, changing precedence
Pass-specific breakageOne optimizer pass breaks semantics while others pass--verify-all-passesDCE removes a needed side effect
Performance/scalability blowupsCompiler crashes, times out, or exhausts memory on large input--large-programsQuadratic codegen on deep call graphs
Float precision divergenceFloating-point results differ between interpreter and JVM--profile float, --stressPrecision accumulation in long arithmetic chains

Three classes of generated programs

VAST generates programs in three categories, each serving a different purpose:

ClassPurposeSizeExample
Minimal semantic probesTest one feature or interaction precisely5-20 AST nodesreturn (x + 0): tests identity folding
Feature interaction probesExercise combinations of language features20-200 AST nodesEnum match inside try/except with nullable field access
Realistic structured programsStress scalability and complex control flow200-10,000 AST nodesMulti-function program with loops, state, exceptions, and cross-function calls

Minimal probes are cheap and targeted. Interaction probes find combination bugs. Realistic programs stress the compiler at scale. A good testing session uses all three.

Mode-to-bug-class mapping

Use the right VAST mode for the compiler subsystem you changed.

Compiler changeVAST modes to run
Optimizer (constant folder, DCE)--opt-check, --verify-all-passes, --stress
IR lowering--ir-check, --metamorphic
Codegen (bytecode generation)Deep differential, --opt-check
Data types / aliasing / runtime--stateful, --concurrency
Exception handling--exception-propagation
Concurrency runtime--concurrency
Parser / formatter--round-trip, round-trip RC gate
Performance / scalability--large-programs, --mode continuous
New language featureDeep mode with complete profile, --interaction-coverage

Coverage as a steering tool

Coverage should drive testing choices, not just report status.

SignalWhat it meansAction
Low feature coverageGenerator or profile weaknessAdd constructs to generator, run richer profiles
Low semantic coverageBranch/value/behaviour gapsEnable stress mode, run symbolic generation
Low interaction coverageMissing combined constructsRun complete profile with more programs
Low confidence but no mismatchesInsufficient explorationRun continuous mode for longer, increase program count
Coverage plateau over multiple nightsGenerator saturationSwitch profile weighting, add new generation families
Good coverage but repeated failures in one classSubsystem weaknessInvest in targeted testing for that subsystem

Failure classification and routing

Every VAST failure should be classified into a bucket with a standard response.

Failure typeIdentified byResponse
Optimizer bugAST/IR/JVM-unopt agree, JVM-opt differsMinimize, check optimizer passes, store corpus entry
Codegen bugAST/IR agree, JVM differsMinimize, check bytecode generation
AST/IR disagreementAST differs from IR and JVMCheck AST interpreter correctness
Parser round-trip failureFormat-reparse produces different ASTCheck formatter and parser
Pass verification failureSingle pass changes semanticsIsolate which pass, minimize
Large-program compile failureCompilation timeout or crashCheck scalability, reduce program
Performance regressionCompile time exceeds thresholdProfile compiler, check for algorithmic regression
Infrastructure/tester failurePath crash, VAST internal errorCheck VAST code, not compiler

For every failure: minimize it, tag it, store it in the corpus, and identify the likely subsystem.

Coverage adequacy policy

Coverage is useful only if it drives decisions. These thresholds define what counts as "enough".

RC-blocking thresholds

MetricThresholdAction if below
Feature coverage (complete profile)100%Generator bug: all enabled constructs must appear
Semantic coverage (deep mode)80%Increase program count or enable --stress / --symbolic
Corpus replay pass rate100%Regression: block release, investigate
Confidence scoreNo drop > 5 points from previous RCInvestigate cause of confidence loss
Pass verificationZero mismatchesOptimizer bug: block release
Sabotage detectionAll 4 modes detectedVAST infrastructure gap: fix before release

Warning thresholds (non-blocking)

MetricThresholdAction
Interaction coverage (complete)Below 50%Run more programs with complete profile
Semantic coverage stagnationSame value for 3+ nightsGenerator may be saturated: review profile weights
Bug yield declineZero new mismatches for 7+ nightsNormal if compiler is stable; flag if new features were added
Large-program timeout rateAbove 10%Check for performance regression in codegen

How to interpret coverage gaps

Coverage numbers are not goals in themselves. They are signals about generator quality and testing thoroughness.

PatternInterpretationAction
High feature coverage, low semantic coverageGenerator uses all constructs but does not exercise interesting behaviours (edge values, error paths, deep nesting)Enable --stress and --symbolic
High semantic coverage, low interaction coverageIndividual features are well-tested but combinations are notRun the complete profile with larger program counts
High confidence, no mismatchesThe system is working; normal state for a stable compilerNone required
Low confidence despite clean runsInsufficient explorationIncrease program counts or run continuous mode for longer
Rising mismatch countActive compiler bugsMinimize, store, investigate. Do not release

Release decision policy

A release decision should consume these signals:

SignalRelease gate
Mismatch countZero unreduced new mismatches
Corpus replay100% pass rate (no regressions)
Confidence scoreNo drop beyond 5 points from previous release
Semantic coverageAbove 80% in deep mode
Interaction coverageAbove 50% in complete profile
Path healthAll paths agree on calibration set
Performance trendNo compile-time regression above 25%
Large-program stabilityNo new crash or timeout regressions
Pass verificationAll optimizer passes preserve semantics
Sabotage validationAll 4 sabotage modes detected
← Coverage and confidence
Future phases →