tl;dr: We do not use one mutation tool for everything. `vary mutate` tests Vary programs, PIT tests Kotlin compiler code, and VAST tests whether the compiler preserves the meaning of generated Vary programs.


If you want to know whether a smoke alarm matters, you do not stare at the battery light. You press the test button. That is what mutation testing is for software. A normal test suite tells you the system looks healthy. Mutation testing asks whether anything would actually notice if something small but plausible were broken.

That question matters even more for a compiler, because a compiler is not just another application. It is the machine other programs trust. If it quietly mistranslates code, the bug does not stay inside the compiler repository. It leaks into everything built with it.

What vary mutate is for

Vary has built-in mutation testing for Vary code. When you run vary mutate, the compiler can use its knowledge of the Vary syntax tree, type system, contracts, purity rules, and test structure to make mutation testing more precise and more useful. That is one of the advantages of building mutation testing into the language from the start.

Why that does not cover the compiler itself

The complication is that the Vary compiler itself is still written mostly in Kotlin. vary mutate is designed to mutate Vary programs and run Vary tests. It is not a general-purpose Kotlin mutation framework, and trying to force it into that role would miss the point of why it works so well in the first place. Its strength comes from deep Vary-specific knowledge.

The three layers

So we test different layers in different ways.

ToolWhat it testsMain question
vary mutateVary user code and its testsDo your tests catch bugs in your Vary program?
PITKotlin compiler implementation codeDo our Kotlin tests catch mistakes in the compiler code?
vary vastCompiler semantics through generated Vary programsDoes the compiler preserve the meaning of Vary programs?

PIT for the Kotlin implementation

For the Kotlin implementation, we use PIT. Today that targets the compiler's check-rule logic and reports where the Kotlin-side tests are too weak to notice a real behavioral change. That is useful because the compiler implementation is JVM code, and PIT is the right mutation tool for JVM code.

VAST for compiler semantics

But compiler quality is not only about implementation tests. A compiler can have excellent unit tests and still be wrong at the system level. That is why PIT is only one layer. We also use Vary integration tests, validation scripts, sabotage checks, release gates, and VAST, which generates many valid Vary programs and checks that the AST interpreter, IR path, and JVM path all agree about what those programs mean.

Why this split is fine

PIT and Vary-native mutation answer different questions at different layers. PIT helps us strengthen the Kotlin tests for the implementation we have today. vary mutate helps users strengthen the tests for their Vary programs. VAST then pressures the compiler from the outside by asking whether it preserves semantics across generated programs, not just whether its internal helper functions passed unit tests.

That is the testing shape that makes sense right now. We use Kotlin-native mutation tooling for Kotlin compiler code, Vary-native mutation tooling for Vary programs, and VAST plus release validation to test the compiler as a compiler. If more of the implementation moves into Vary over time, that boundary may move too. The goal is trustworthy behaviour, and the practical way to get that is to test each layer with the tool that fits it.

More articles

What's new in Vary v122-alpha.1 v122-alpha.1 is out. The headline is vary var, a new top-level command that runs check, test, mutation, and review under a cost budget. The mutation engine was rewritten around reachability tracing, kill-first scheduling, and a hot-swap backend. Frugal, a native PEG parser library ported from Parsimonious, also lands.
Vary mutation testing speed: comparing to AST and PIT Vary now measures mutation-testing performance directly on real benchmark programs, including a project-scale parser workload and a PIT-style comparison fixture, and the current results are strong enough to talk about in concrete terms.