This page walks through a complete mutation testing session, from a weak test suite to a strong one. Mutation testing does not measure whether your code runs. It measures whether your tests would notice if the code changed.
This workflow is especially useful when tests are generated by AI, where surface coverage may be high but behavioural guarantees are weak.
You can follow along using examples/mutation-workflow/ from the repository. For an introduction, see Introduction. For the full reference, see Advanced overview.
scoring.vary has five small functions:
def add(a: Int, b: Int) -> Int {
return a + b
}
def subtract(a: Int, b: Int) -> Int {
return a - b
}
def clamp(value: Int, low: Int, high: Int) -> Int {
if value < low {
return low
}
if value > high {
return high
}
return value
}
def abs_val(n: Int) -> Int {
if n < 0 {
return -n
}
return n
}
def is_passing(score: Int) -> Bool {
return score >= 60 and score <= 100
}
test_scoring.vary has deliberately weak tests. They call every function but use loose assertions:
from scoring import add, subtract, clamp, abs_val, is_passing
test "add positive" {
observe add(2, 3) > 0
}
test "subtract positive" {
observe subtract(10, 3) > 0
}
test "clamp middle" {
observe clamp(5, 0, 10) == 5
}
test "abs positive" {
observe abs_val(5) == 5
}
test "is passing" {
observe is_passing(75)
}
Verify the code works before mutating it:
vary run examples/mutation-workflow/scoring.vary
vary test examples/mutation-workflow/test_scoring.vary
Both commands should exit cleanly. The test command runs all five tests.
vary mutate scoring.vary --tests test_scoring.vary
Output:
Score: 27% killed (13/48)
Test Strength: Weak (27%)
Early-stage test depth. The leverage fixes below are the fastest path to stronger test signal.
Biggest Leverage Fixes
1) Add scoring edge-case tests
Impact: ~14 mutants Projected after (approx): 56% (+29pp)
Location: scoring
2) Assert scoring outputs and constants
Impact: ~9 mutants Projected after (approx): 75% (+48pp)
Location: scoring
3) Assert scoring events and state changes
Impact: ~9 mutants Projected after (approx): 94% (+67pp)
Location: scoring
4) Pin scoring return values
Impact: ~3 mutants Projected after (approx): 100% (+73pp)
Location: scoring
Start here: address #1 in scoring. (~14 mutants, projected to 56%)
vary mutate . --expand "scoring"
Top survivor groups (4 of 4)
GROUP FILE:LINE SURV CAUSE
scoring#clamp scoring.vary:10-11 14 ASSERT_EFFECT
scoring#is_passing scoring.vary:27 11 ASSERT_VALUE
scoring#abs_val scoring.vary:20-21 9 ASSERT_EFFECT
scoring#subtract scoring.vary:6 1 ASSERT_MATH
Why Survivors Exist
40% (14) Branch conditions not covered
26% (9) Values changed but never asserted
26% (9) Weak assertions only
9% (3) Return values not pinned
27% means the tests catch about a quarter of possible changes.
The leverage fixes are cumulative. Fix #1 projects the score to 56%. Fix #2 adds to that, projecting 75%. The +Npp column shows how many percentage points above the current score each fix brings you. The "Start here" line tells you which fix to tackle first and gives the command to inspect it.
The CAUSE column in the survivor table tells you why each group survived. ASSERT_VALUE means a value was changed but never asserted. ASSERT_EFFECT means side effects were not observed.
The "Why Survivors Exist" breakdown gives the broader categories. "Values changed but never asserted" and "Weak assertions only" both point to the same root cause: the tests use observe x > 0 where they should use observe x == 5.
The "Biggest Leverage Fixes" section ranks which changes would kill the most mutants. The projections are cumulative: if you address fix #1, the score reaches ~56%; if you also address fix #2, it reaches ~75%.
Start with fix #1. In this case, the biggest group is scoring#clamp (14 survivors) because the test only checks the middle of the range, not the boundaries.
Pick scoring#clamp and see what survived:
vary mutate scoring.vary --tests test_scoring.vary --expand "scoring#clamp"
Expanded: scoring#clamp (14 mutants)
scoring.vary:10 Replace < with <=
vary mutate scoring.vary --replay clamp:REL_LT_TO_LE:a8c21b3f
scoring.vary:10 Replace low with 0
vary mutate scoring.vary --replay clamp:LIT_CHANGE:b3e48d12:1
scoring.vary:13 Replace > with >=
vary mutate scoring.vary --replay clamp:REL_GT_TO_GE:c4d51e2a
...
Each line shows what changed and where. The --replay command re-runs a single mutant if you want to reproduce it.
Pick a mutant and ask why it survived:
vary mutate scoring.vary --tests test_scoring.vary --why "clamp:REL_LT_TO_LE:a8c21b3f"
Mutant: The comparison operator was changed but no test covers the boundary
Location: scoring.vary line 10, in clamp
Change:
Replace < with <=
Why it survived:
The comparison operator was changed but no test covers the boundary.
Fix:
Add a test at the boundary value where < and <= differ.
Example: observe clamp(0, 0, 10) == 0
The test calls observe clamp(5, 0, 10) == 5, which passes whether the boundary check is < or <=. No test checks the boundary itself.
Add boundary tests to test_scoring.vary:
test "clamp at boundaries" {
observe clamp(0, 0, 10) == 0
observe clamp(10, 0, 10) == 10
observe clamp(-5, 0, 10) == 0
observe clamp(15, 0, 10) == 10
}
And pin the is_passing boundaries:
test "is passing at boundary" {
observe is_passing(60)
observe not is_passing(59)
observe is_passing(100)
observe not is_passing(101)
}
Re-run:
vary mutate scoring.vary --tests test_scoring.vary
The scoring#clamp and scoring#is_passing groups shrink and the score goes up.
The add and subtract tests use observe add(2, 3) > 0. Replace them with exact checks:
test "add returns sum" {
observe add(2, 3) == 5
observe add(0, 5) == 5
observe add(-1, 1) == 0
}
test "subtract returns difference" {
observe subtract(10, 3) == 7
observe subtract(5, 5) == 0
}
Re-run. The "Weak assertions only" category disappears from the breakdown.
For abs_val, test a negative input:
test "abs negative" {
observe abs_val(-3) == 3
observe abs_val(0) == 0
}
vary mutate scoring.vary --tests test_scoring.vary
The score should be above 90%. Any remaining survivors are either equivalent mutants (changes that don't affect observable behaviour) or edge cases worth investigating with --why.
You can strengthen the mutation score without writing more tests by adding contracts:
def abs_val(n: Int) -> Int {
out (r) {
r >= 0
}
if n < 0 {
return -n
}
return n
}
Now a mutant that changes return -n to return n breaks the postcondition when called with a negative input. The mutation runner counts contract violations as kills.
| Step | What you do | What it tells you |
|---|---|---|
| 1 | vary mutate file.vary | Overall score and survivor breakdown |
| 2 | Read leverage fixes | Cumulative projected scores for each fix |
| 3 | --expand a group | Individual surviving mutants |
| 4 | --why on a mutant | Root cause and suggested fix |
| 5 | Write the test, re-run | Confirm the score improved |
| 6 | Repeat | Until the score is where you want it |
| Flag | What it does |
|---|---|
--expand GROUP | Show individual mutants in a group |
--why ID | Explain why a specific mutant survived |
--replay ID | Re-run a single mutant |
--top N | Change how many groups the table shows |
--group MODE | Group survivors by function, file, or cause |
--quick | Fast mode: relational + literal operators, max 20 mutants/file |
--all | Exhaustive mode: override the default 200 mutants/file cap |
--output MODE | Output mode: text (default, live spinner), log, json, or html |
-v | Verbose output with OP and HINT columns |