How Bytecode Mutation Testing Works

The companion article explains why Vary does bytecode mutation. This one explains how. A follow-up article covers the implementation internals for readers who want to go deeper.

What mutation testing actually is

Code coverage tells you which lines your tests execute. It does not tell you whether the tests would notice if those lines were wrong.

Mutation testing closes that gap. Make a small, deliberate change to your code (a "mutant"), then run your tests. If at least one test fails, the mutant is "killed," your tests caught the change. If every test still passes, the mutant "survived" and there is a hole in your test suite.

Think of it like a smoke detector check. You press the test button (introduce a known fault) and see whether the alarm goes off (a test fails). If the alarm stays silent, the detector is not doing its job.

A mutation testing tool does this hundreds or thousands of times, with different small changes: swapping + for -, changing < to <=, replacing a return value with zero. The percentage of mutants your tests kill is the mutation score. A high score means your tests are actually verifying behaviour rather than touching code.

Why bytecode

Most mutation tools work at the source level. They edit your source code, recompile, run the tests, then undo the edit and repeat. The problem is that recompiling for every single mutant is slow. If you have 500 mutants and each one needs a full compile, the wait adds up fast.

Vary takes a different approach. It compiles your source code once, all the way down to JVM bytecode (the low-level instructions the Java Virtual Machine executes). Then it makes mutations directly in the compiled bytecode, without going back through the compiler. Patching one instruction in an already-compiled program is cheaper than re-parsing, re-type-checking, and re-compiling from scratch.

The analogy: imagine you have a printed book and you want to test whether a proofreader notices a typo. Source-level mutation reprints the entire book for each typo. Bytecode mutation uses white-out on one word and photocopies that page.

The pipeline

When you run vary mutate add.vary --tests test_add.vary, six things happen:

Step	What happens
Compile source	Vary source becomes JVM bytecode
Compile tests	Same process for the test file
Run baseline	Load both into memory, run every test, record which ones pass
Generate mutations	Walk every instruction in the compiled code and identify places where a small change would be meaningful
Test each mutation	Patch one instruction, load the patched version, run the same tests, check whether any test that passed before now fails. If so, the mutant is killed
Report	Mutation score = killed / total

Steps 1 through 3 happen once. Step 5 repeats for every mutant, but it never reparses source, never type-checks, never regenerates bytecode from scratch. It patches one instruction and asks one question: did the tests notice?

One concrete example, all the way down

Start with this Vary source:

def add(a: Int, b: Int) -> Int {
    return a + b
}

test "add sums two ints" {
    observe add(2, 3) == 5
}

What the compiler emits

When Vary compiles this function, it produces JVM bytecode: a sequence of low-level instructions that the Java Virtual Machine knows how to execute. You do not normally see these instructions, but they are what actually runs when your program executes.

Vary's Int type maps to JVM long (a 64-bit integer). This means integer arithmetic uses the L-prefix instructions. Here is the bytecode for add:

public static long add(long, long)
  0: LLOAD 0       // load parameter a onto the stack
  1: LLOAD 2       // load parameter b onto the stack
  2: LADD          // pop both values, add them, push the result
  3: LRETURN       // return the value on top of the stack

Four instructions. LLOAD loads a long value. LADD adds two longs. LRETURN returns a long. The JVM is a stack machine, so values get pushed onto a stack, operations consume values from the top, and results get pushed back.

The mutation engine walks this instruction list and asks, at each position, "could a small change here be meaningful?" Only some opcodes are mutator candidates:

Each dotted arrow is one potential mutant. The engine produces one patched copy of the method per candidate, runs the tests against it, then discards the copy. The original bytecode is never modified in place.

What a mutation looks like

The arithmetic mutation operator scans the instruction list. When it reaches LADD at index 2, it knows it can swap addition for subtraction. The mutated bytecode becomes:

public static long add(long, long)
  0: LLOAD 0       // load a
  1: LLOAD 2       // load b
  2: LSUB          // subtract instead of add    <-- the mutation
  3: LRETURN       // return

One instruction changed. Everything else is identical. The mutation engine did not re-read the source file, did not re-run the type checker, did not regenerate the other three instructions. It swapped one byte.

What happens when the tests run

The mutated bytecode gets loaded into a fresh, isolated environment (a new JVM classloader, which is the JVM's way of loading compiled code into memory). The test calls add(2, 3). The original would return 5. The mutant computes 2 - 3 = -1. The observe statement checks -1 == 5, which is false, so the test fails. The mutant is killed.

If the test had been weaker, like observe add(2, 3) > 0, the mutant would survive: -1 > 0 is false, so that test would still catch it. But observe add(2, 3) > -10 would not, since -1 > -10 is still true. Mutation testing finds exactly these kinds of gaps.

The six mutation operators

Vary has six types of bytecode mutations, each targeting a different kind of instruction.

Arithmetic: swap math operations

Replaces one arithmetic operation with another. + becomes -, * becomes /, and so on.

What you wrote	What the mutant does
`a + b`	`a - b`
`a - b`	`a + b`
`a * b`	`a / b`
`a / b`	`a * b`
`a % b`	`a * b` or `a / b`

At the bytecode level, this is a single opcode swap: LADD becomes LSUB. The stack shape stays the same (two values in, one value out), so nothing else in the method needs to change.

Conditional: change boundary conditions

Alters comparison operators. < becomes <= or >=. == becomes !=. null checks flip.

What you wrote	What the mutant does
`a < b`	`a <= b` or `a >= b`
`a <= b`	`a < b` or `a > b`
`a == b`	`a != b`
`x != None`	`x == None`

These mutations catch off-by-one errors and missing boundary tests. If your test only checks max(3, 5) and never checks max(5, 5), a boundary mutant that changes < to <= might survive.

Negation removal: drop the minus sign

If your code negates a value (-x), this mutation removes the negation, turning -x into just x. If no test checks that the sign is correct, the mutant survives.

Return value replacement: return a default instead

Ignores whatever the function computed and returns a default value instead: 0 for integers, 0.0 for floats, null for objects. This tests whether your code actually uses the return value.

Return poison: return a value designed to cause trouble

Similar to return value replacement, but instead of benign defaults, it returns values chosen to trigger subtle bugs: -1 for integers, the largest possible float, empty string "" for objects.

This catches tests that only check "not null" or "not zero" without verifying the actual value. observe result != None catches a null return, but misses an empty string.

Call skip: pretend a method call never happened

Removes an entire method call and replaces it with a default return value. If your code calls validate(input) and no test notices when that call disappears, then nothing is actually checking that validation happens.

Putting it together

Take the add function from earlier. The compiler produces four bytecode instructions. The mutation engine scans all four and finds three possible mutations: one arithmetic swap on the LADD, one return-value replacement on the LRETURN, and one return-poison on the same LRETURN.

For each mutation, the engine copies the original bytecode, patches one instruction, loads the patched version into a fresh classloader, and runs the test suite. The arithmetic mutant (changing + to -) is killed because add(2, 3) returns -1 instead of 5 and the test notices. The return-value mutant (returning 0 instead of the computed sum) is killed too, as is the return-poison mutant (returning -1 regardless of input).

Three mutations, three kills, 100% mutation score. The tests are checking the behaviour of this function rather than only executing it.

That is the point of mutation testing. Bytecode mutation is what makes it fast enough to run during normal development.

Page	Focus
Bytecode mutation under the hood	The implementation internals: ASM library, classloader isolation, kill detection, and stable mutation IDs
Bytecode Mutation Is Why Vary Uses the JVM	The architectural motivation for targeting JVM bytecode
Why Mutation Testing Belongs in the Compiler	Why Vary builds mutation into the language instead of treating it as a plugin
How We Mutation Test the Compiler	How Vary uses different mutation strategies for Vary code and Kotlin compiler code