Run the same program through multiple independent engines. If they produce different results, one of them is wrong. With three engines, you can usually tell which one.
Differential testing is simple: run the same program through two or more independent implementations, and compare the results. If they disagree, at least one of them has a bug.
This technique has found thousands of bugs in production compilers. Csmith found hundreds of bugs in GCC and Clang by generating random C programs and comparing output across compilers. VAST applies the same idea inside Vary, but instead of comparing different compilers, it compares three independent execution paths within the same project.
Every VAST-generated program runs through three paths by default, or four with --opt-check:
Generated program
|
+--- AST interpreter (reference oracle)
|
+--- IR interpreter (middle layer)
|
+--- JVM compiler (optimized) (real pipeline)
|
+--- JVM compiler (unoptimized, with --opt-check)
|
Compare all results
The AST interpreter walks the syntax tree directly. It uses sealed value types (VInt, VBool, VStr, etc.) with no type casting. Simple on purpose: its job is to be obviously correct, not fast. This is the reference implementation.
The IR interpreter lowers the AST to a flat intermediate representation (registers, blocks, jumps) and interprets it. This provides a middle layer between the high-level AST interpreter and the low-level JVM path.
The JVM compiler takes the same AST through the real compiler pipeline: constant folding, dead code elimination, type checking, bytecode generation, and classloader execution. This is the path that matters for users.
Two paths can detect a disagreement, but they cannot tell you which one is wrong. With three or more paths, blame localization becomes possible:
Three-path blame (default):
| AST | IR | JVM | Likely fault |
|---|---|---|---|
| agree | agree | differs | Codegen or bytecode emission bug |
| agree | differs | differs | IR lowering bug |
| differs | agree | agree | AST interpreter bug (reference is wrong) |
| all differ | Multiple independent bugs (rare) |
Four-path blame (with --opt-check):
| AST | IR | JVM-unopt | JVM-opt | Likely fault |
|---|---|---|---|---|
| A | A | A | B | Optimizer bug |
| A | B | B | B | AST interpreter bug |
| A | A | B | B | Codegen bug |
When a majority of paths agree and one differs, the odd one out is the suspect. This narrows debugging from "something is wrong somewhere" to "the fault is in this specific compiler stage." The four-path mode is particularly valuable because most real compiler bugs occur in optimizers, and the three-path default always applies optimizations, making those bugs invisible.
VAST normalizes results before comparison. Both successful return values and runtime errors are compared:
| Path A result | Path B result | Verdict |
|---|---|---|
Success(42) | Success(42) | Agreement |
Success(7) | Success(9) | Value mismatch |
RuntimeError(DIV_ZERO) | RuntimeError(DIV_ZERO) | Agreement |
Success(42) | RuntimeError(DIV_ZERO) | Outcome kind mismatch |
RuntimeError(DIV_ZERO) | RuntimeError(STACK_OVERFLOW) | Error category mismatch |
RuntimeError(INFINITE_LOOP) | Timeout | Agreement (same root cause) |
The last row is important. The AST interpreter detects infinite loops by counting iterations. The JVM executor detects them by timeout. Both map to the same error category, so VAST treats them as agreement.
Both interpreters and the JVM executor map their exceptions to a shared set of categories:
| Category | AST interpreter | JVM executor |
|---|---|---|
DIVISION_BY_ZERO | Caught at divide/modulo operations | ArithmeticException |
STACK_OVERFLOW | Call depth limit exceeded | StackOverflowError |
INFINITE_LOOP | Iteration cap exceeded | Execution timeout |
INDEX_OUT_OF_BOUNDS | List index check | IndexOutOfBoundsException |
NULL_REFERENCE | None access check | NullPointerException |
This normalization keeps comparison fair across paths that detect the same problem in different ways.
Differential testing is effective at finding bugs that live in unusual feature interactions:
| Bug class | Example | How differential testing finds it |
|---|---|---|
| Optimizer bug | Constant folder changes semantics for edge-case arithmetic | Interpreter computes correctly, JVM returns wrong result |
| Codegen bug | Wrong arithmetic opcode emitted for specific nesting | Interpreter agrees on correct value, JVM disagrees |
| Control flow bug | Off-by-one in bytecode jump target for while loops | Loop runs one too many times on JVM path |
| Type system bug | Incorrect casting between numeric types | Interpreter preserves value, JVM truncates |
| Scoping bug | Variable shadowing handled differently in compiler | Interpreter uses correct binding, JVM uses wrong one |
These bugs are hard to catch with hand-written tests because developers test expected behaviour. Nobody sits down and writes a test for (x - y) + y inside a nested conditional with mutable loop variables. The generator produces those combinations naturally.
Every VAST run uses a seed. The generator is deterministic: the same seed always produces the same program. When a mismatch is found, the seed is printed alongside the failing program:
VAST mismatch [seed=41822917, profile=core]
Verdict: MISMATCH_VALUE
AST interpreter: success(Int(7))
IR interpreter: success(Int(7))
JVM bytecode: success(Int(9))
Replay: vary vast --seed 41822917 --count 1 --profile core
Replay is exact. The same seed produces the same program, the same execution, and the same mismatch. Without this, random testing would be impossible to debug.
Differential testing is the foundation of the VAST program. Every other technique builds on it:
| Technique | Role |
|---|---|
| Metamorphic testing | Transforms programs and checks that multi-path comparison still agrees |
| Mutation testing | Injects faults and verifies that the comparison detects them |
| Reduction | Shrinks failing programs found by differential testing to minimal reproducers |
| CI integration | Runs differential testing continuously across all language profiles |