The Bytecode Mutation Thesis

tl;dr: Source-level mutation pays recompilation costs per mutant. Bytecode mutation compiles once and patches instructions. We ported a Python parser library to Vary and ran controlled benchmarks on both. The measured bytecode-vs-AST speedup on this codebase is 1.01x: test execution dominates, not compilation. The architectural argument holds for compilation-heavy workloads, but this data shows it is not a universal win.

This article lays out the case for bytecode mutation testing, explains the methodology we used to evaluate it, and reports what the controlled benchmarks actually showed.

The companion articles, How Bytecode Mutation Testing Works and Bytecode Mutation Under the Hood, cover the mechanics. This article covers the argument and the data.

The thesis

Mutation testing is slow because it recompiles for every mutant. If you can skip recompilation by patching compiled bytecode directly, mutation testing becomes fast enough to run during normal development.

That is the claim. The rest of this article examines the evidence for it and the gaps in that evidence.

Why source-level mutation is expensive

A source-level mutation tool (whether it operates on raw text or an AST) follows a loop like this for each mutant:

for each mutation site:
    1. Modify the source (text patch or AST transform)
    2. Parse the modified source
    3. Type-check the modified program
    4. Compile to executable form
    5. Run the test suite
    6. Record whether any test failed

Steps 2 through 4 are the recompilation tax. They repeat for every mutant even though the change is tiny: a single operator swap, a constant replacement, a negated condition. The compiler does not know that 99.9% of the program is identical to the last run. It starts from scratch each time.

For a codebase that produces 1,000 mutants, this means 1,000 full compilations. The test suite runs 1,000 times regardless of the approach, but the recompilation overhead is pure waste. The program was already compiled correctly. Only one instruction changed.

The bytecode alternative

Vary compiles source to JVM bytecode once. Then the mutation engine works directly on the compiled .class bytes:

1. Compile source to bytecode (once)
2. Compile tests to bytecode (once)
3. Run the baseline test suite (once)

for each mutation site:
    4. Patch one instruction in the compiled bytecode
    5. Load the patched class via a fresh classloader
    6. Run the test suite
    7. Record whether any test failed

Step 4 replaces steps 1-4 of the source-level loop. Patching a bytecode instruction means reading the class file's method bytes, swapping an opcode (for example, LADD to LSUB), and recomputing the stack frame metadata. No parsing. No type checking. No code generation. The ASM library does this in microseconds.

A concrete example from Parsimonious

Consider a function from the Parsimonious PEG parser, which we ported from Python to Vary as part of this evaluation. The Python version:

def _length_of(self):
    child_lengths = [c._length_of() for c in self]
    if None in child_lengths:
        return None
    return sum(child_lengths)

The Vary port:

def length_of(self) -> Int {
    mut total = 0
    for child in self.members {
        let child_len = child.length_of()
        if child_len < 0 {
            return -1
        }
        total = total + child_len
    }
    return total
}

When Vary compiles this method, the total = total + child_len line becomes a sequence of LLOAD, LLOAD, LADD, LSTORE bytecode instructions. The mutation engine can swap LADD for LSUB without touching any other part of the program. The compiler already verified that total and child_len are both Int, that the method returns Int, and that the class structure is sound. None of that work needs to repeat.

With source-level mutation, the tool would need to modify the source text, re-parse the entire file, re-type-check the entire module (including all its imports), and re-generate bytecode for the entire class. For one changed operator.

The Frugal port: our evaluation methodology

To evaluate this thesis against real code, we chose an existing, well-tested Python library, Parsimonious, a PEG parsing library, and ported it to Vary. We call this the Frugal project.

Why a parser library

Parser code is a demanding mutation target. It combines properties that mutation testing tends to struggle with.

Property	Why it matters for mutation
Heavy recursion and loops	Mutations can create infinite loops
Complex conditional logic	Operator swaps produce subtle behavioral changes
String manipulation	Boundary conditions matter
Well-defined I/O contract	Parse trees from grammar plus input text give a clean kill signal

If bytecode mutation can handle parser code efficiently, it should handle typical application code as well.

The scope

The Python codebase consists of 5 core modules totaling 1,490 lines of code, with 84 passing tests across 4 test files. The Vary port covers the same functionality across 10 source modules (approximately 2,200 lines including test helpers), restructured to fit Vary's module system and type constraints.

The port was not a line-for-line transliteration. Vary's static type system, its explicit optional types instead of None-as-sentinel, and different collection semantics required genuine adaptation. For example, Python's use of None as a sentinel return value became explicit -1 returns in Vary, and Python's list comprehensions became explicit loops. These differences make the port more realistic as a comparison target: the code is idiomatic to each language, not a mechanical translation.

The Python baseline: measured data

We ran mutmut 3.5.0, an AST-based Python mutation testing tool, against the Parsimonious test suite three times under controlled conditions. These are the averaged results.

Metric	Value
Source lines of code	1,500
Total mutants generated	1,189
Mutants killed	769 (mean)
Mutants survived	394 (mean)
Timeouts	3
Kill rate	64.7%
Wall-clock time	67.7 seconds (mean of 3 runs)
Throughput	18.9 mutants/second (mean)
Environment	Python 3.12.3, pytest 9.0.3, Linux x86_64, AMD Ryzen 5 3600

Several things stand out:

Throughput is consistent across runs. Individual runs ranged from 17.5 to 21.5 mutants per second, with the variance driven by OS-level caching effects. mutmut v3 does not cache mutation results between runs. Every invocation regenerates all mutants and re-tests them from scratch.

At 18.9 mutants per second, each mutant takes about 53 milliseconds end to end. For 1,189 mutants, that is just over a minute. Parsimonious is a small library. A codebase ten times larger could easily produce ten times more mutants.

Three mutants timed out consistently. This is characteristic of parser code. Certain mutations (like changing < to <= in a loop bound) can create infinite loops.

The architectural argument for bytecode mutation

The cost model predicts that bytecode mutation should eliminate per-mutant recompilation overhead. To test this, we ran controlled benchmarks on both the Python original and the Vary port.

Per-mutant cost breakdown

In the source-level model (mutmut + Python):

Phase	Happens per mutant	Approximate cost
Source modification	Yes	AST transform via libcst
Re-import / recompile	Yes	Python re-imports the module
Test discovery	Yes	pytest collects tests
Test execution	Yes	Run all relevant tests

In the bytecode model (Vary):

Phase	Happens per mutant	Approximate cost
Bytecode patch	Yes	Microseconds (ASM library)
Classloader creation	Yes	Milliseconds (JVM)
Test execution	Yes	Run all relevant tests

The test execution phase is comparable in both cases: the same tests run against the same logic. The difference is everything that happens before the tests run. Source-level mutation pays a recompilation cost that scales with program size. Bytecode mutation pays a patching cost that scales with the size of one method.

What "compile once" actually means

When Vary's mutation engine runs in bytecode mode, the compilation pipeline executes exactly once:

Lexer → Parser → ConstantFolder → DeadCodeEliminator → TypeChecker → BytecodeGenerator

This produces a set of .class bytes in memory. The mutation engine then walks the bytecode of each method, identifies mutation sites (operator swaps, constant replacements, conditional boundary changes, return value mutations), and for each site, creates a patched copy of that single method's bytecode.

The patched copy is loaded via a fresh ClassLoader, a standard JVM mechanism for loading class definitions at runtime. The JVM verifies the patched bytecode (checking stack consistency and type safety), JIT-compiles it if it runs hot, and discards it when the classloader is garbage collected. No file I/O, no process spawning, no serialization.

Compare this to mutmut's approach: for each mutant, libcst parses the source file into a concrete syntax tree, applies one AST transformation, unparses the tree back to source text, writes it to a working directory, and invokes pytest in a subprocess. The subprocess re-imports the module (which means Python's import machinery re-parses and re-compiles the source to .pyc bytecode), discovers tests, and runs them.

Why recompilation avoidance matters more for larger codebases

The recompilation tax grows with program complexity. A small file compiles quickly regardless. But as module graphs deepen, type checking touches more code, and compilation involves more optimization passes, the per-mutant overhead of source-level mutation grows while bytecode mutation's per-mutant cost stays roughly constant.

For Parsimonious at 1,490 lines, the recompilation overhead in Python is small. Python's compilation is lightweight compared to a statically-typed language. But consider the same argument applied to a 50,000-line codebase with deep type inference and cross-module dependencies. The source-level approach recompiles the dependency graph for each mutant. The bytecode approach patches one method and moves on.

The Vary measurement: controlled results

We ran Vary's mutation engine in both AST mode (--level ast, which recompiles per mutant) and bytecode mode (--level bytecode, which patches compiled bytecode) against the Frugal port, three runs each, on the same hardware as the Python baseline.

Metric	Python (mutmut)	Vary AST	Vary Bytecode
Source LOC	1,500	3,402	3,402
Mutants generated	1,189	1,844	1,844
Killed	769	1,512	1,513
Survived	394	332	331
Timeouts	3	0	0
Mutation score	64.7%	82%	82%
Wall-clock time (mean)	67.7 s	247.6 s	251.0 s
Throughput (mean)	18.9 mut/s	7.5 mut/s	7.5 mut/s

Bytecode vs. AST speedup: 1.01x.

This is the headline result, and it is not what the thesis predicted. Bytecode mode and AST mode have nearly identical throughput on this codebase. The recompilation tax that bytecode mutation was supposed to eliminate is not the bottleneck here. Test execution is.

Why the speedup is so small

For each mutant, regardless of mode, the engine must: create a classloader, load the mutated class, run the test suite, and record the result. The test execution phase dominates wall time. The per-mutant compilation cost that AST mode pays (parse, type-check, codegen) is small relative to running 22 test files against a PEG parser library.

Coverage-guided test selection (which skips irrelevant tests per mutant) improved AST mode by 22% in wall time, further closing the gap. Bytecode mode, which already skipped recompilation, saw only 8.5% improvement from the same optimization.

The cross-language comparison

The full per-run data, methodology, and caveats are kept internally and summarized here.

Python's mutmut achieves 18.9 mutants per second against 1,189 mutants. Vary achieves 7.5 mutants per second against 1,844 mutants. The raw throughput comparison (0.40x) is misleading without context:

Confounding factor	Detail
Different mutant counts	Vary generates 55% more mutants from roughly equivalent code, because bytecode-level operators find mutation sites that AST-level operators miss. More mutants means more test runs.
Different test execution models	mutmut runs pytest in subprocesses. Vary runs tests via in-process JVM classloader reloading. These have different fixed costs per mutant.
Different languages and runtimes	Python's import-based recompilation is lightweight. Vary's Kotlin-compiled mutation engine has JVM startup overhead but benefits from JIT compilation over longer runs.

A direct throughput comparison across languages is not meaningful without normalizing for these factors.

What we are not claiming

The data is in. Here is what it does and does not support.

Claim we do not make	What the data shows
Bytecode mutation is a universal speedup	On this codebase, the measured bytecode-vs-AST speedup is 1.01x. The recompilation tax is real but small relative to test execution time.
Vary is faster than Python for mutation testing	The 0.40x throughput ratio reflects different tools, different runtimes, different mutant counts, and different test execution models. It is not an apples-to-apples comparison.
These results generalize to all codebases	The Frugal port is a parser library with heavy test execution costs. Codebases where compilation is the bottleneck (deep type inference, large module graphs, expensive optimization passes) may see a larger bytecode-vs-AST gap.

What the data does support: bytecode mutation eliminates recompilation by construction, and that mechanism works as designed. The ASM-based patching is measured in microseconds. The JVM classloader reload is measured in milliseconds. The architectural argument is sound. But on a test-execution-dominated workload, removing compilation overhead does not materially change throughput.

Methodological caveats

This comparison has structural limitations that readers should weigh.

Caveat	Detail
Different languages	Python and Vary have different compilation costs, different runtime characteristics, and different standard libraries. The port is idiomatic to each language, not a line-for-line transliteration.
Different mutation tools	mutmut uses AST-based mutation via libCST. Vary's engine operates at both AST and bytecode levels. The operator sets overlap but are not identical.
Different test frameworks	mutmut invokes pytest in subprocesses. Vary runs tests via in-process classloader reloading. Per-mutant fixed costs differ.
Single hardware environment	All benchmarks ran on AMD Ryzen 5 3600, Linux x86_64, OpenJDK 25. Results may differ on other hardware or JVM versions.
Mutant count difference	Vary generated 1,844 mutants from 3,402 LOC; Python generated 1,189 from 1,500 LOC. Normalizing by LOC, Vary produces 0.54 mutants/LOC vs. Python's 0.79 mutants/LOC, but the absolute count difference affects wall time.

The raw data for all benchmarks is available in programs/frugal/artifacts/benchmark/, including per-run CSVs, JSON summaries, environment metadata, and the benchmark script itself. The canonical post-PRD results report is at vary/postPRD-report.md, and the post-hardening mutation run output is at programs/frugal/artifacts/mutation-pass-post/.

The path forward

The data changed the question. The original thesis asked: "Can bytecode mutation eliminate recompilation overhead?" The answer is yes, but on this codebase, recompilation is not the bottleneck. The new question is: "Where does bytecode mutation's advantage actually show up?"

Three directions are worth pursuing.

Direction	What it explores
Parallel mutant execution	The current engine runs mutants sequentially. Bytecode mode's lightweight classloader-per-mutant architecture is well suited to parallel execution across CPU cores. On a 6-core machine, 4-way parallelism could approach the throughput targets that sequential execution missed.
Compilation-heavy workloads	The Frugal port's per-mutant compilation cost is small because individual Vary files compile quickly. Testing the thesis against a codebase with deep cross-module type inference or expensive optimization passes would better isolate the recompilation tax.
JIT warmup reuse	The JVM JIT compiler currently treats each classloader-loaded class as cold code. Sharing JIT profiles across mutants that differ by a single instruction could reduce per-mutant test execution time, which is the actual bottleneck this data identified.

The bytecode mutation thesis was a reasonable architectural bet. The mechanism works. The speedup on this workload is negligible. Whether it matters depends on where your bottleneck is.

Page	Focus
How Bytecode Mutation Testing Works	The mechanics of bytecode mutation with worked examples
Bytecode Mutation Under the Hood	Implementation details: ASM, classloaders, and kill detection
Why Mutation Testing Is Slow	The infinite loop problem and how Vary handles timeouts
Bytecode Mutation Is Why Vary Uses the JVM	The language design decision behind JVM targeting