VAST

Differential testing

Run the same program through multiple independent engines. If they produce different results, one of them is wrong. With three engines, you can usually tell which one.

The core idea

Differential testing is simple: run the same program through two or more independent implementations, and compare the results. If they disagree, at least one of them has a bug.

This technique has found thousands of bugs in production compilers. Csmith found hundreds of bugs in GCC and Clang by generating random C programs and comparing output across compilers. VAST applies the same idea inside Vary, but instead of comparing different compilers, it compares three independent execution paths within the same project.

Multiple paths, one program

Every VAST-generated program runs through three paths by default, or four with --opt-check:

Generated program
       |
       +--- AST interpreter          (reference oracle)
       |
       +--- IR interpreter           (middle layer)
       |
       +--- JVM compiler (optimized) (real pipeline)
       |
       +--- JVM compiler (unoptimized, with --opt-check)
              |
        Compare all results

The AST interpreter walks the syntax tree directly. It uses sealed value types (VInt, VBool, VStr, etc.) with no type casting. Simple on purpose: its job is to be obviously correct, not fast. This is the reference implementation.

The IR interpreter lowers the AST to a flat intermediate representation (registers, blocks, jumps) and interprets it. This provides a middle layer between the high-level AST interpreter and the low-level JVM path.

The JVM compiler takes the same AST through the real compiler pipeline: constant folding, dead code elimination, type checking, bytecode generation, and classloader execution. This is the path that matters for users.

Why multiple paths instead of two

Two paths can detect a disagreement, but they cannot tell you which one is wrong. With three or more paths, blame localization becomes possible:

Three-path blame (default):

ASTIRJVMLikely fault
agreeagreediffersCodegen or bytecode emission bug
agreediffersdiffersIR lowering bug
differsagreeagreeAST interpreter bug (reference is wrong)
all differ Multiple independent bugs (rare)

Four-path blame (with --opt-check):

ASTIRJVM-unoptJVM-optLikely fault
AAABOptimizer bug
ABBBAST interpreter bug
AABBCodegen bug

When a majority of paths agree and one differs, the odd one out is the suspect. This narrows debugging from "something is wrong somewhere" to "the fault is in this specific compiler stage." The four-path mode is particularly valuable because most real compiler bugs occur in optimizers, and the three-path default always applies optimizations, making those bugs invisible.

What counts as agreement

VAST normalizes results before comparison. Both successful return values and runtime errors are compared:

Path A resultPath B resultVerdict
Success(42)Success(42)Agreement
Success(7)Success(9)Value mismatch
RuntimeError(DIV_ZERO)RuntimeError(DIV_ZERO)Agreement
Success(42)RuntimeError(DIV_ZERO)Outcome kind mismatch
RuntimeError(DIV_ZERO)RuntimeError(STACK_OVERFLOW)Error category mismatch
RuntimeError(INFINITE_LOOP)TimeoutAgreement (same root cause)

The last row is important. The AST interpreter detects infinite loops by counting iterations. The JVM executor detects them by timeout. Both map to the same error category, so VAST treats them as agreement.

Error categories

Both interpreters and the JVM executor map their exceptions to a shared set of categories:

CategoryAST interpreterJVM executor
DIVISION_BY_ZEROCaught at divide/modulo operationsArithmeticException
STACK_OVERFLOWCall depth limit exceededStackOverflowError
INFINITE_LOOPIteration cap exceededExecution timeout
INDEX_OUT_OF_BOUNDSList index checkIndexOutOfBoundsException
NULL_REFERENCENone access checkNullPointerException

This normalization keeps comparison fair across paths that detect the same problem in different ways.

What differential testing finds

Differential testing is effective at finding bugs that live in unusual feature interactions:

Bug classExampleHow differential testing finds it
Optimizer bugConstant folder changes semantics for edge-case arithmeticInterpreter computes correctly, JVM returns wrong result
Codegen bugWrong arithmetic opcode emitted for specific nestingInterpreter agrees on correct value, JVM disagrees
Control flow bugOff-by-one in bytecode jump target for while loopsLoop runs one too many times on JVM path
Type system bugIncorrect casting between numeric typesInterpreter preserves value, JVM truncates
Scoping bugVariable shadowing handled differently in compilerInterpreter uses correct binding, JVM uses wrong one

These bugs are hard to catch with hand-written tests because developers test expected behaviour. Nobody sits down and writes a test for (x - y) + y inside a nested conditional with mutable loop variables. The generator produces those combinations naturally.

Deterministic replay

Every VAST run uses a seed. The generator is deterministic: the same seed always produces the same program. When a mismatch is found, the seed is printed alongside the failing program:

VAST mismatch [seed=41822917, profile=core]
  Verdict: MISMATCH_VALUE
  AST interpreter: success(Int(7))
  IR interpreter:  success(Int(7))
  JVM bytecode:    success(Int(9))

  Replay: vary vast --seed 41822917 --count 1 --profile core

Replay is exact. The same seed produces the same program, the same execution, and the same mismatch. Without this, random testing would be impossible to debug.

Where it fits

Differential testing is the foundation of the VAST program. Every other technique builds on it:

TechniqueRole
Metamorphic testingTransforms programs and checks that multi-path comparison still agrees
Mutation testingInjects faults and verifies that the comparison detects them
ReductionShrinks failing programs found by differential testing to minimal reproducers
CI integrationRuns differential testing continuously across all language profiles
← CLI reference
Metamorphic testing →