Vary compiles to JVM bytecode, so at runtime it gets the same JIT compiler that optimizes Java. We expect Vary to be considerably faster than Python and hopefully within about 85% of Java on most workloads.
We wrote a small benchmark suite to see where we stand. The results are not scientific. Nine workloads on one machine with a fixed heap is not a performance study. But it tells us whether the bytecode we generate is in the right ballpark, and so far it is.
Eclipse Temurin 25 on Linux, 2 GB heap. Each benchmark runs 750 ms of warmup, then 9 trials of 2500 ms. Numbers are medians.
| Benchmark | Java (ms) | Vary (ms) | vs Java |
|---|---|---|---|
| Fib Iterative | 49.3 | 49.3 | ≈ |
| Int Arith | 26.6 | 25.8 | ≈ |
| Mandelbrot | 111.8 | 110.6 | ≈ |
| List Ops | 6.2 | 4.3 | -31% |
| Map Ops | 124.7 | 133.5 | +7% |
| Map Ops (defaults) | 174.3 | 177.3 | ≈ |
| Alloc | 29.9 | 30.2 | ≈ |
| String Concat | 56.5 | 42.7 | -24% |
| Primes Sieve | 33.3 | 28.1 | -16% |
≈ means within 5%. Negative percentages mean Vary was faster.
Most of the table is a wash, which is what we hoped for. The JIT does not care that the bytecode came from Vary instead of javac.
String concat is faster because Vary uses invokedynamic-based concatenation, and the pattern in this particular benchmark JITs well. List ops and primes sieve benefit from Vary's IntArray and BoolArray types, which skip boxing. Map ops is 7% slower, likely due to how Vary's codegen handles map default values compared to hand-written Java.
Where Vary loses, it is usually boxing. List[Int] in Vary is an ArrayList<Long> under the hood, while Java can use long[] directly. IntArray and BoolArray exist to close that gap for hot paths, but they only help when you use them.
Note: Boxing means wrapping a primitive value (like a 64-bit integer) in a heap-allocated object so it can be stored in a generic collection. Every box costs an allocation and an extra pointer dereference. In tight loops over large lists, that adds up.
Beyond runtime throughput, we now measure the compiler toolchain itself. Five benchmark suites run in CI to catch regressions:
| Suite | What it measures |
|---|---|
| Startup | Cold-start time for vary run, vary check, vary test |
| Workflow | End-to-end latency of check and test workflows |
| Memory | Peak heap, GC pause time, allocation rate |
| Verification | Throughput of mutation testing and VAST differential testing |
| Trend | Historical score tracking across runs |
Each suite has a stored baseline. Regressions above 25% fail the nightly build; regressions above 10% produce warnings. The trend suite generates historical reports for tracking performance over time.
Concurrent throughput with spawn/join under realistic allocation patterns.