VAST

Coverage and confidence

VAST measures what it has tested, not just whether anything failed. Coverage tracking and confidence scoring tell you how thoroughly the compiler has been exercised.

Three coverage dimensions

VAST tracks coverage at three levels.

Feature coverage

Tracks which of 22 language constructs appear in generated programs: variable declarations, assignments, binary/unary operators, if statements, while loops, function definitions, literals (int, bool, string, float, none), enum/data definitions, list literals, match statements, try/except, raise, and generic functions.

vary vast --profile complete --count 50 --seed 42 --show-coverage
Feature coverage: 22/22 (100%)
  + variable_decl: 342
  + assignment: 87
  + if_stmt: 156
  ...

This tells you whether the generator actually used all the constructs the profile enables.

Semantic coverage

Tracks 27 semantic properties across six categories. These measure behaviors that actually trigger compiler bugs, not just whether a construct appeared.

CategoryProperties
Value rangeszero, negative, large, boundary values
Control flowbranch taken/not taken, loop zero/multi iteration, nested control, early return
Type interactionsmixed type expressions, nullable access, collection index, enum dispatch, data field access
Error pathsdivision risk, exception thrown/caught, overflow risk
Expression complexitynested binary ops, chained calls, conditional expressions, multi-arg calls
Stress testingidentity ops, complementary ops, deep arithmetic chains, float precision

Each profile defines which properties are enabled. The semantic tracker walks each generated AST and records which properties occur.

vary vast --profile types --count 100 --seed 42 --semantic-coverage
Semantic coverage: 18/20 (90%)
  Value ranges:
    + zero_value: 42
    + negative_value: 31
    + large_value: 8
    + boundary_value: 3
  Control flow:
    + branch_taken: 89
    + branch_not_taken: 67
    - loop_zero_iter: 0
    + loop_multi_iter: 23
  ...

The + and - marks show hit/miss.

Interaction coverage

Tracks pairwise co-occurrence of language features. Most compiler bugs appear in feature combinations: generics + collections, nullable + pattern matching, exceptions + loops.

vary vast --profile complete --count 100 --seed 42 --interaction-coverage

This catches gaps where individual features work but their combination does not.

Confidence scoring

The confidence report rolls the three coverage dimensions, program volume, and mismatch rate into one score.

vary vast --profile complete --count 200 --seed 42 --confidence
Confidence: HIGH (78%)
  Feature coverage:     100%
  Semantic coverage:    85%
  Interaction coverage: 62%
  Mismatch rate:        0.00%
  Programs tested:      200
  Gaps:
    - Low interaction coverage (62%) — run with richer profiles

Scoring formula

ComponentWeightWhat it measures
Feature coverage20%Construct diversity
Semantic coverage25%Behavioral diversity
Interaction coverage15%Combinatorial depth
Volume20%How many programs were tested
Cleanliness20%Absence of mismatches and path failures

Confidence levels

LevelScore rangeMeaning
VERY_HIGH90-100Comprehensive testing, no known gaps
HIGH70-89Solid coverage, minor gaps identified
MODERATE50-69Decent coverage, significant gaps remain
LOW0-49Insufficient testing, major gaps

Gap identification

The confidence report flags four kinds of gap:

Gap typeWhen it triggers
Uncovered semantic propertiesOne or more enabled behaviors have never been tested
Low program countVolume is too small for the profile's complexity
High mismatch rateActive bugs are inflating the failure rate
Low coverage areaA coverage dimension (feature, semantic, or interaction) falls below threshold

In CI modes (fast and deep), confidence is computed per-profile and overall. The CI dashboard shows the confidence level for each profile.

Stress testing

Stress testing targets compiler edge cases that normal random generation rarely hits.

vary vast --profile core --count 100 --seed 42 --stress

What stress mode generates

The stress generator produces difficult inputs in several categories.

Integer boundary values: Long.MAX_VALUE, Long.MIN_VALUE, Int.MAX_VALUE, Int.MIN_VALUE, byte/short boundaries, powers of 10.

Float boundary values: positive/negative zero, Double.MAX_VALUE, Double.MIN_VALUE, near-zero values, 1e15, 1e-15, 0.1 + 0.2.

Stress patterns:

PatternExampleWhat it catches
Identity opsx + 0, x * 1, x - 0Optimizer identity folding
Complementary ops(x + 5) - 5, (x * 3) / 3Optimizer inverse folding
Overflow chainsMAX_VALUE * 2Overflow handling
Deep nesting((a + b) * c) - d (3+ levels)Stack depth, register allocation
Float precision0.1 + 0.1 + 0.1 + ...Precision accumulation
Boundary negation-Long.MAX_VALUENegation overflow
Division edgesx / 1, x / -1Division special cases

When stress mode is active, approximately 30% of generated expressions use stress patterns. The semantic coverage tracker records which stress properties were exercised.

Optimizer validation

The --opt-check flag adds a fourth execution path to catch optimizer bugs.

vary vast --profile types --count 100 --seed 42 --opt-check

Four execution paths

PathPipeline
AST interpreterDirect AST evaluation (reference oracle)
IR interpreterAST lowered to register-based IR, then interpreted
JVM optimizedConstantFolder + DeadCodeEliminator, then bytecode
JVM unoptimizedDirect bytecode generation, no optimization passes

Blame localization

When paths disagree, the comparator identifies the suspect:

ASTIRJVM-unoptJVM-optBlame
AAABOptimizer bug
ABBBAST interpreter bug
AABBCodegen bug
ABABMixed (multiple issues)

In deep CI mode, --opt-check is enabled automatically.

Path health monitoring

Path calibration verifies that execution paths are healthy before testing begins.

vary vast --profile core --count 100 --seed 42 --calibrate --path-health

The --calibrate flag runs a set of known-answer programs and checks that all paths produce expected results. If calibration fails, VAST exits with an error rather than producing unreliable results.

The --path-health flag tracks path agreement across all programs in the run and prints a reliability matrix showing how often each pair of paths agrees.

Continuous exploration

Continuous mode runs time-bounded exploration with adaptive profile selection.

vary vast --mode continuous --duration 300

Continuous mode picks profiles and batch sizes based on what has been tested so far.

BehaviorHow it works
Profile selectionProfiles with lower semantic coverage get higher selection probability. Untried profiles get an exploration bonus.
Batch sizingEarly iterations use small batches (20 programs) for breadth. Later iterations use larger batches (100 programs) for depth.
Cumulative stateCoverage maps merge across iterations, so the system tracks overall progress across profiles.
ConvergenceWhen cumulative semantic coverage exceeds 95%, exploration stops early.

The report includes a per-profile summary, confidence score, and coverage gap analysis identifying which profiles and properties need more work.

VAST Continuous Exploration Report
============================================================
Duration: 300.2s | Iterations: 47
Programs: 2340 total, 2340 passed, 0 mismatches
Status: TIME_BUDGET_EXHAUSTED

Per-Profile Summary:
  Profile        Programs   Passed Mismatches    Iters
  ------------------------------------------------------
  collections          90       90          0        3
  complete            450      450          0       10
  control             180      180          0        5
  core                160      160          0        4
  ...

Confidence: HIGH (82%)
  ...

Coverage Gap Analysis:
  Overall semantic coverage: 88%
  ...

See CI integration for how continuous mode fits into the CI pipeline.

← Comparison with other systems
Testing playbook →