Mutation

Overview

Mutation testing checks whether your safety nets actually work. It plants small, deliberate mistakes in a program and sees if anything catches them. If a mistake slips through, you have found a blind spot before it matters.

What problem does this solve?

When someone builds software, they also write checks that verify the software works correctly. These checks are called tests. A test might verify that a calculator given 2 and 3 returns 5, or that a login page rejects a wrong password. When all the tests pass, the software appears to be working.

But passing tests can be misleading. A test might run part of the program without actually checking the result. Think of a fire alarm wired to every room that never sounds when there is smoke. The wiring looks complete. The system looks good. It just does not work when it matters.

There is a common metric called code coverage that measures how much of the program the tests touch. High coverage feels reassuring, but it only measures whether the tests ran the code, not whether they would notice if the code were wrong.

What mutation testing does

Mutation testing takes a direct approach: plant small, realistic mistakes in the program and see if the tests catch them.

Each planted mistake is called a mutant. One mutant might change an addition to a subtraction. Another might make a calculation return zero instead of computing a real answer. These are the kinds of mistakes that actually happen when people write or change code.

After planting a mistake, the system runs all the tests. If a test fails, the mistake was caught. That mutant is "killed." If every test still passes, the mistake was missed. That mutant "survived," and a real bug in the same spot would go unnoticed too.

The mutation score is the percentage of planted mistakes that were caught. If something went wrong in this code, would anyone know? The score answers that question with a number.

If you want to see that flow with one function and one test, start with Smallest example.

Why Vary includes this

Most programming languages leave mutation testing to outside tools that have to be installed, configured, and run separately. These tools work, but they are slow and disconnected from the language.

Vary ships mutation testing as a built-in command. Running vary mutate works the same way as running any other Vary command. Because the mutation engine understands Vary's language features, it knows which mistakes are realistic and which are noise. It also runs fast, because it works on already-compiled output rather than rewriting and recompiling the original code for every single mutant. A typical file finishes in seconds. That matters, because if the tool is slow, nobody uses it.

How it works

Vary translates programs into a low-level instruction format that computers execute directly. Each instruction does one small thing: add two numbers, or compare two values. Mutation testing swaps one instruction for a different one, runs the tests in an isolated sandbox, and throws the sandbox away when done. Each cycle takes a fraction of a second, and many run at the same time.

Because the changes happen to already-compiled instructions, the system skips the expensive step of re-translating the whole program for every mutant. A file with 30 possible mutations finishes in a few seconds.

Vary also supports a second mode that works at a higher level, making structural changes like removing entire sections of code or rearranging the order of inputs to a function. This mode is slower but catches patterns that instruction-level changes cannot.

For larger test suites, Vary has a strict mode that records which tests touch which methods during a baseline pass, then runs only the tests that reach the mutated method for each mutant. A parity benchmark verifies that every optimization produces the same kill/survive verdicts as a fresh-compile reference run, so speed never comes at the cost of correctness. See Strict mode for details.

What you learn

After running mutation testing, you get a score and a breakdown of what survived. Surviving mutants are not abstract statistics. For each one, the tool explains what was changed, where in the program, why the tests missed it, and what kind of check would catch it.

Say a calculation produces the wrong answer and no test notices. The tool tells you: this calculation's result is never verified. Write a test that checks the result. That is more useful than a coverage report saying "this line was executed."

What counts as a good score

There is no single right number. Some parts of a program are genuinely harder to test than others. Some mutations produce changes that have no visible effect on the program's behaviour, making them impossible to catch by design.

The score matters less than the list of survivors. A score of 80% where you understand the survivors is healthier than 100% achieved by writing shallow checks. The survivors tell you where the weak spots are and what to do about them.

How it fits with the rest of Vary

Mutation testing and code coverage answer different questions. Coverage asks: did the tests run this code? Mutation testing asks: would the tests catch a mistake in this code? High coverage with a low mutation score means the tests touch the code but never verify the results. You want both.

Vary also has a separate system called VAST that tests the Vary compiler itself by generating thousands of random programs and checking that the compiler handles them all correctly. VAST makes sure the compiler translates your program faithfully. Mutation testing makes sure your tests verify that the program does what you intended. One checks the translator, the other checks the safety net.

Introduction →