Coverage and confidence — Markdown View

Three coverage dimensions

VAST tracks coverage at three levels.

Feature coverage

Tracks which of 22 language constructs appear in generated programs: variable declarations, assignments, binary/unary operators, if statements, while loops, function definitions, literals (int, bool, string, float, none), enum/data definitions, list literals, match statements, try/except, raise, and generic functions.

vary vast --profile complete --count 50 --seed 42 --show-coverage

Feature coverage: 22/22 (100%)
  + variable_decl: 342
  + assignment: 87
  + if_stmt: 156
  ...

This tells you whether the generator actually used all the constructs the profile enables.

Semantic coverage

Tracks 27 semantic properties across six categories. These measure behaviors that actually trigger compiler bugs, not just whether a construct appeared.

Category	Properties
Value ranges	zero, negative, large, boundary values
Control flow	branch taken/not taken, loop zero/multi iteration, nested control, early return
Type interactions	mixed type expressions, nullable access, collection index, enum dispatch, data field access
Error paths	division risk, exception thrown/caught, overflow risk
Expression complexity	nested binary ops, chained calls, conditional expressions, multi-arg calls
Stress testing	identity ops, complementary ops, deep arithmetic chains, float precision

Each profile defines which properties are enabled. The semantic tracker walks each generated AST and records which properties occur.

vary vast --profile types --count 100 --seed 42 --semantic-coverage

Semantic coverage: 18/20 (90%)
  Value ranges:
    + zero_value: 42
    + negative_value: 31
    + large_value: 8
    + boundary_value: 3
  Control flow:
    + branch_taken: 89
    + branch_not_taken: 67
    - loop_zero_iter: 0
    + loop_multi_iter: 23
  ...

The + and - marks show hit/miss.

Interaction coverage

Tracks pairwise co-occurrence of language features. Most compiler bugs appear in feature combinations: generics + collections, nullable + pattern matching, exceptions + loops.

vary vast --profile complete --count 100 --seed 42 --interaction-coverage

This catches gaps where individual features work but their combination does not.

Confidence scoring

The confidence report rolls the three coverage dimensions, program volume, and mismatch rate into one score.

vary vast --profile complete --count 200 --seed 42 --confidence

Confidence: HIGH (78%)
  Feature coverage:     100%
  Semantic coverage:    85%
  Interaction coverage: 62%
  Mismatch rate:        0.00%
  Programs tested:      200
  Gaps:
    - Low interaction coverage (62%): run with richer profiles

Scoring formula

Component	Weight	What it measures
Feature coverage	20%	Construct diversity
Semantic coverage	25%	Behavioral diversity
Interaction coverage	15%	Combinatorial depth
Volume	20%	How many programs were tested
Cleanliness	20%	Absence of mismatches and path failures

Confidence levels

Level	Score range	Meaning
VERY_HIGH	90-100	Comprehensive testing, no known gaps
HIGH	70-89	Solid coverage, minor gaps identified
MODERATE	50-69	Decent coverage, significant gaps remain
LOW	0-49	Insufficient testing, major gaps

Gap identification

The confidence report flags four kinds of gap:

Gap type	When it triggers
Uncovered semantic properties	One or more enabled behaviors have never been tested
Low program count	Volume is too small for the profile's complexity
High mismatch rate	Active bugs are inflating the failure rate
Low coverage area	A coverage dimension (feature, semantic, or interaction) falls below threshold

In CI modes (fast and deep), confidence is computed per-profile and overall. The CI dashboard shows the confidence level for each profile.

Stress testing

Stress testing targets compiler edge cases that normal random generation rarely hits.

vary vast --profile core --count 100 --seed 42 --stress

What stress mode generates

The stress generator produces difficult inputs in several categories.

Integer boundary values: Long.MAX_VALUE, Long.MIN_VALUE, Int.MAX_VALUE, Int.MIN_VALUE, byte/short boundaries, powers of 10.

Float boundary values: positive/negative zero, Double.MAX_VALUE, Double.MIN_VALUE, near-zero values, 1e15, 1e-15, 0.1 + 0.2.

Stress patterns:

Pattern	Example	What it catches
Identity ops	`x + 0`, `x * 1`, `x - 0`	Optimizer identity folding
Complementary ops	`(x + 5) - 5`, `(x * 3) / 3`	Optimizer inverse folding
Overflow chains	`MAX_VALUE * 2`	Overflow handling
Deep nesting	`((a + b) * c) - d` (3+ levels)	Stack depth, register allocation
Float precision	`0.1 + 0.1 + 0.1 + ...`	Precision accumulation
Boundary negation	`-Long.MAX_VALUE`	Negation overflow
Division edges	`x / 1`, `x / -1`	Division special cases

When stress mode is active, approximately 30% of generated expressions use stress patterns. The semantic coverage tracker records which stress properties were exercised.

Optimizer validation

The --opt-check flag adds a fourth execution path to catch optimizer bugs.

vary vast --profile types --count 100 --seed 42 --opt-check

Four execution paths

Path	Pipeline
AST interpreter	Direct AST evaluation (reference oracle)
IR interpreter	AST lowered to register-based IR, then interpreted
JVM optimized	ConstantFolder + DeadCodeEliminator, then bytecode
JVM unoptimized	Direct bytecode generation, no optimization passes

Blame localization

When paths disagree, the comparator identifies the suspect:

AST	IR	JVM-unopt	JVM-opt	Blame
A	A	A	B	Optimizer bug
A	B	B	B	AST interpreter bug
A	A	B	B	Codegen bug
A	B	A	B	Mixed (multiple issues)

In deep CI mode, --opt-check is enabled automatically.

Path health monitoring

Path calibration verifies that execution paths are healthy before testing begins.

vary vast --profile core --count 100 --seed 42 --calibrate --path-health

The --calibrate flag runs a set of known-answer programs and checks that all paths produce expected results. If calibration fails, VAST exits with an error rather than producing unreliable results.

The --path-health flag tracks path agreement across all programs in the run and prints a reliability matrix showing how often each pair of paths agrees.

Continuous exploration

Continuous mode runs time-bounded exploration with adaptive profile selection.

vary vast --mode continuous --duration 300

Continuous mode picks profiles and batch sizes based on what has been tested so far.

Behavior	How it works
Profile selection	Profiles with lower semantic coverage get higher selection probability. Untried profiles get an exploration bonus.
Batch sizing	Early iterations use small batches (20 programs) for breadth. Later iterations use larger batches (100 programs) for depth.
Cumulative state	Coverage maps merge across iterations, so the system tracks overall progress across profiles.
Convergence	When cumulative semantic coverage exceeds 95%, exploration stops early.

The report includes a per-profile summary, confidence score, and coverage gap analysis identifying which profiles and properties need more work.

VAST Continuous Exploration Report
============================================================
Duration: 300.2s | Iterations: 47
Programs: 2340 total, 2340 passed, 0 mismatches
Status: TIME_BUDGET_EXHAUSTED

Per-Profile Summary:
  Profile        Programs   Passed Mismatches    Iters
  ------------------------------------------------------
  collections          90       90          0        3
  complete            450      450          0       10
  control             180      180          0        5
  core                160      160          0        4
  ...

Confidence: HIGH (82%)
  ...

Coverage Gap Analysis:
  Overall semantic coverage: 88%
  ...

See CI integration for how continuous mode fits into the CI pipeline.