VAST measures what it has tested, not just whether anything failed. Coverage tracking and confidence scoring tell you how thoroughly the compiler has been exercised.
VAST tracks coverage at three levels.
Tracks which of 22 language constructs appear in generated programs: variable declarations, assignments, binary/unary operators, if statements, while loops, function definitions, literals (int, bool, string, float, none), enum/data definitions, list literals, match statements, try/except, raise, and generic functions.
vary vast --profile complete --count 50 --seed 42 --show-coverage
Feature coverage: 22/22 (100%)
+ variable_decl: 342
+ assignment: 87
+ if_stmt: 156
...
This tells you whether the generator actually used all the constructs the profile enables.
Tracks 27 semantic properties across six categories. These measure behaviors that actually trigger compiler bugs, not just whether a construct appeared.
| Category | Properties |
|---|---|
| Value ranges | zero, negative, large, boundary values |
| Control flow | branch taken/not taken, loop zero/multi iteration, nested control, early return |
| Type interactions | mixed type expressions, nullable access, collection index, enum dispatch, data field access |
| Error paths | division risk, exception thrown/caught, overflow risk |
| Expression complexity | nested binary ops, chained calls, conditional expressions, multi-arg calls |
| Stress testing | identity ops, complementary ops, deep arithmetic chains, float precision |
Each profile defines which properties are enabled. The semantic tracker walks each generated AST and records which properties occur.
vary vast --profile types --count 100 --seed 42 --semantic-coverage
Semantic coverage: 18/20 (90%)
Value ranges:
+ zero_value: 42
+ negative_value: 31
+ large_value: 8
+ boundary_value: 3
Control flow:
+ branch_taken: 89
+ branch_not_taken: 67
- loop_zero_iter: 0
+ loop_multi_iter: 23
...
The + and - marks show hit/miss.
Tracks pairwise co-occurrence of language features. Most compiler bugs appear in feature combinations: generics + collections, nullable + pattern matching, exceptions + loops.
vary vast --profile complete --count 100 --seed 42 --interaction-coverage
This catches gaps where individual features work but their combination does not.
The confidence report rolls the three coverage dimensions, program volume, and mismatch rate into one score.
vary vast --profile complete --count 200 --seed 42 --confidence
Confidence: HIGH (78%)
Feature coverage: 100%
Semantic coverage: 85%
Interaction coverage: 62%
Mismatch rate: 0.00%
Programs tested: 200
Gaps:
- Low interaction coverage (62%) — run with richer profiles
| Component | Weight | What it measures |
|---|---|---|
| Feature coverage | 20% | Construct diversity |
| Semantic coverage | 25% | Behavioral diversity |
| Interaction coverage | 15% | Combinatorial depth |
| Volume | 20% | How many programs were tested |
| Cleanliness | 20% | Absence of mismatches and path failures |
| Level | Score range | Meaning |
|---|---|---|
| VERY_HIGH | 90-100 | Comprehensive testing, no known gaps |
| HIGH | 70-89 | Solid coverage, minor gaps identified |
| MODERATE | 50-69 | Decent coverage, significant gaps remain |
| LOW | 0-49 | Insufficient testing, major gaps |
The confidence report flags four kinds of gap:
| Gap type | When it triggers |
|---|---|
| Uncovered semantic properties | One or more enabled behaviors have never been tested |
| Low program count | Volume is too small for the profile's complexity |
| High mismatch rate | Active bugs are inflating the failure rate |
| Low coverage area | A coverage dimension (feature, semantic, or interaction) falls below threshold |
In CI modes (fast and deep), confidence is computed per-profile and overall. The CI dashboard shows the confidence level for each profile.
Stress testing targets compiler edge cases that normal random generation rarely hits.
vary vast --profile core --count 100 --seed 42 --stress
The stress generator produces difficult inputs in several categories.
Integer boundary values: Long.MAX_VALUE, Long.MIN_VALUE, Int.MAX_VALUE, Int.MIN_VALUE, byte/short boundaries, powers of 10.
Float boundary values: positive/negative zero, Double.MAX_VALUE, Double.MIN_VALUE, near-zero values, 1e15, 1e-15, 0.1 + 0.2.
Stress patterns:
| Pattern | Example | What it catches |
|---|---|---|
| Identity ops | x + 0, x * 1, x - 0 | Optimizer identity folding |
| Complementary ops | (x + 5) - 5, (x * 3) / 3 | Optimizer inverse folding |
| Overflow chains | MAX_VALUE * 2 | Overflow handling |
| Deep nesting | ((a + b) * c) - d (3+ levels) | Stack depth, register allocation |
| Float precision | 0.1 + 0.1 + 0.1 + ... | Precision accumulation |
| Boundary negation | -Long.MAX_VALUE | Negation overflow |
| Division edges | x / 1, x / -1 | Division special cases |
When stress mode is active, approximately 30% of generated expressions use stress patterns. The semantic coverage tracker records which stress properties were exercised.
The --opt-check flag adds a fourth execution path to catch optimizer bugs.
vary vast --profile types --count 100 --seed 42 --opt-check
| Path | Pipeline |
|---|---|
| AST interpreter | Direct AST evaluation (reference oracle) |
| IR interpreter | AST lowered to register-based IR, then interpreted |
| JVM optimized | ConstantFolder + DeadCodeEliminator, then bytecode |
| JVM unoptimized | Direct bytecode generation, no optimization passes |
When paths disagree, the comparator identifies the suspect:
| AST | IR | JVM-unopt | JVM-opt | Blame |
|---|---|---|---|---|
| A | A | A | B | Optimizer bug |
| A | B | B | B | AST interpreter bug |
| A | A | B | B | Codegen bug |
| A | B | A | B | Mixed (multiple issues) |
In deep CI mode, --opt-check is enabled automatically.
Path calibration verifies that execution paths are healthy before testing begins.
vary vast --profile core --count 100 --seed 42 --calibrate --path-health
The --calibrate flag runs a set of known-answer programs and checks that all paths produce expected results. If calibration fails, VAST exits with an error rather than producing unreliable results.
The --path-health flag tracks path agreement across all programs in the run and prints a reliability matrix showing how often each pair of paths agrees.
Continuous mode runs time-bounded exploration with adaptive profile selection.
vary vast --mode continuous --duration 300
Continuous mode picks profiles and batch sizes based on what has been tested so far.
| Behavior | How it works |
|---|---|
| Profile selection | Profiles with lower semantic coverage get higher selection probability. Untried profiles get an exploration bonus. |
| Batch sizing | Early iterations use small batches (20 programs) for breadth. Later iterations use larger batches (100 programs) for depth. |
| Cumulative state | Coverage maps merge across iterations, so the system tracks overall progress across profiles. |
| Convergence | When cumulative semantic coverage exceeds 95%, exploration stops early. |
The report includes a per-profile summary, confidence score, and coverage gap analysis identifying which profiles and properties need more work.
VAST Continuous Exploration Report
============================================================
Duration: 300.2s | Iterations: 47
Programs: 2340 total, 2340 passed, 0 mismatches
Status: TIME_BUDGET_EXHAUSTED
Per-Profile Summary:
Profile Programs Passed Mismatches Iters
------------------------------------------------------
collections 90 90 0 3
complete 450 450 0 10
control 180 180 0 5
core 160 160 0 4
...
Confidence: HIGH (82%)
...
Coverage Gap Analysis:
Overall semantic coverage: 88%
...
See CI integration for how continuous mode fits into the CI pipeline.