Differential testing — Markdown View

Run the same program through multiple independent engines. If they produce different results, one of them is wrong. With three engines, you can usually tell which one.

The core idea

Differential testing is simple: run the same program through two or more independent implementations, and compare the results. If they disagree, at least one of them has a bug.

This technique has found thousands of bugs in production compilers. Csmith found hundreds of bugs in GCC and Clang by generating random C programs and comparing output across compilers. VAST applies the same idea inside Vary, but instead of comparing different compilers, it compares three independent execution paths within the same project.

Multiple paths, one program

Every VAST-generated program runs through three paths by default, or four with --opt-check:

Generated program
       |
       +--- AST interpreter          (reference oracle)
       |
       +--- IR interpreter           (middle layer)
       |
       +--- JVM compiler (optimized) (real pipeline)
       |
       +--- JVM compiler (unoptimized, with --opt-check)
              |
        Compare all results

The AST interpreter walks the syntax tree directly. It uses sealed value types (VInt, VBool, VStr, etc.) with no type casting. Simple on purpose: its job is to be obviously correct, not fast. This is the reference implementation.

The IR interpreter lowers the AST to a flat intermediate representation (registers, blocks, jumps) and interprets it. This provides a middle layer between the high-level AST interpreter and the low-level JVM path.

The JVM compiler takes the same AST through the real compiler pipeline: constant folding, dead code elimination, type checking, bytecode generation, and classloader execution. This is the path that matters for users.

Why multiple paths instead of two

Two paths can detect a disagreement, but they cannot tell you which one is wrong. With three or more paths, blame localization becomes possible:

Three-path blame (default):

AST	IR	JVM	Likely fault
agree	agree	differs	Codegen or bytecode emission bug
agree	differs	differs	IR lowering bug
differs	agree	agree	AST interpreter bug (reference is wrong)
all differ			Multiple independent bugs (rare)

Four-path blame (with --opt-check):

AST	IR	JVM-unopt	JVM-opt	Likely fault
A	A	A	B	Optimizer bug
A	B	B	B	AST interpreter bug
A	A	B	B	Codegen bug

When a majority of paths agree and one differs, the odd one out is the suspect. This narrows debugging from "something is wrong somewhere" to "the fault is in this specific compiler stage." The four-path mode is particularly valuable because most real compiler bugs occur in optimizers, and the three-path default always applies optimizations, making those bugs invisible.

What counts as agreement

VAST normalizes results before comparison. Both successful return values and runtime errors are compared:

Path A result	Path B result	Verdict
`Success(42)`	`Success(42)`	Agreement
`Success(7)`	`Success(9)`	Value mismatch
`RuntimeError(DIV_ZERO)`	`RuntimeError(DIV_ZERO)`	Agreement
`Success(42)`	`RuntimeError(DIV_ZERO)`	Outcome kind mismatch
`RuntimeError(DIV_ZERO)`	`RuntimeError(STACK_OVERFLOW)`	Error category mismatch
`RuntimeError(INFINITE_LOOP)`	`Timeout`	Agreement (same root cause)

The last row is important. The AST interpreter detects infinite loops by counting iterations. The JVM executor detects them by timeout. Both map to the same error category, so VAST treats them as agreement.

Error categories

Both interpreters and the JVM executor map their exceptions to a shared set of categories:

Category	AST interpreter	JVM executor
`DIVISION_BY_ZERO`	Caught at divide/modulo operations	`ArithmeticException`
`STACK_OVERFLOW`	Call depth limit exceeded	`StackOverflowError`
`INFINITE_LOOP`	Iteration cap exceeded	Execution timeout
`INDEX_OUT_OF_BOUNDS`	List index check	`IndexOutOfBoundsException`
`NULL_REFERENCE`	None access check	`NullPointerException`

This normalization keeps comparison fair across paths that detect the same problem in different ways.

What differential testing finds

Differential testing is effective at finding bugs that live in unusual feature interactions:

Bug class	Example	How differential testing finds it
Optimizer bug	Constant folder changes semantics for edge-case arithmetic	Interpreter computes correctly, JVM returns wrong result
Codegen bug	Wrong arithmetic opcode emitted for specific nesting	Interpreter agrees on correct value, JVM disagrees
Control flow bug	Off-by-one in bytecode jump target for while loops	Loop runs one too many times on JVM path
Type system bug	Incorrect casting between numeric types	Interpreter preserves value, JVM truncates
Scoping bug	Variable shadowing handled differently in compiler	Interpreter uses correct binding, JVM uses wrong one

These bugs are hard to catch with hand-written tests because developers test expected behaviour. Nobody sits down and writes a test for (x - y) + y inside a nested conditional with mutable loop variables. The generator produces those combinations naturally.

Deterministic replay

Every VAST run uses a seed. The generator is deterministic: the same seed always produces the same program. When a mismatch is found, the seed is printed alongside the failing program:

VAST mismatch [seed=41822917, profile=core]
  Verdict: MISMATCH_VALUE
  AST interpreter: success(Int(7))
  IR interpreter:  success(Int(7))
  JVM bytecode:    success(Int(9))

  Replay: vary vast --seed 41822917 --count 1 --profile core

Replay is exact. The same seed produces the same program, the same execution, and the same mismatch. Without this, random testing would be impossible to debug.

Where it fits

Differential testing is the foundation of the VAST program. Every other technique builds on it:

Technique	Role
Metamorphic testing	Transforms programs and checks that multi-path comparison still agrees
Mutation testing	Injects faults and verifies that the comparison detects them
Reduction	Shrinks failing programs found by differential testing to minimal reproducers
CI integration	Runs differential testing continuously across all language profiles