Mutation

Golden path

This page walks through a complete mutation testing session, from a weak test suite to a strong one. Mutation testing does not measure whether your code runs. It measures whether your tests would notice if the code changed.

This workflow is especially useful when tests are generated by AI, where surface coverage may be high but behavioural guarantees are weak.

You can follow along using examples/mutation-workflow/ from the repository. For an introduction, see Introduction. For the full reference, see Advanced overview.

The code

scoring.vary has five small functions:

def add(a: Int, b: Int) -> Int {
    return a + b
}

def subtract(a: Int, b: Int) -> Int {
    return a - b
}

def clamp(value: Int, low: Int, high: Int) -> Int {
    if value < low {
        return low
    }
    if value > high {
        return high
    }
    return value
}

def abs_val(n: Int) -> Int {
    if n < 0 {
        return -n
    }
    return n
}

def is_passing(score: Int) -> Bool {
    return score >= 60 and score <= 100
}

test_scoring.vary has deliberately weak tests. They call every function but use loose assertions:

from scoring import add, subtract, clamp, abs_val, is_passing

test "add positive" {
    observe add(2, 3) > 0
}

test "subtract positive" {
    observe subtract(10, 3) > 0
}

test "clamp middle" {
    observe clamp(5, 0, 10) == 5
}

test "abs positive" {
    observe abs_val(5) == 5
}

test "is passing" {
    observe is_passing(75)
}

Running the examples

Verify the code works before mutating it:

vary run examples/mutation-workflow/scoring.vary
vary test examples/mutation-workflow/test_scoring.vary

Both commands should exit cleanly. The test command runs all five tests.

Step 1: Get the score

vary mutate scoring.vary --tests test_scoring.vary

Output:

Score: 27% killed (13/48)
Test Strength: Weak (27%)

Early-stage test depth. The leverage fixes below are the fastest path to stronger test signal.

Biggest Leverage Fixes

1) Add scoring edge-case tests
   Impact: ~14 mutants  Projected after (approx): 56% (+29pp)
   Location: scoring

2) Assert scoring outputs and constants
   Impact: ~9 mutants  Projected after (approx): 75% (+48pp)
   Location: scoring

3) Assert scoring events and state changes
   Impact: ~9 mutants  Projected after (approx): 94% (+67pp)
   Location: scoring

4) Pin scoring return values
   Impact: ~3 mutants  Projected after (approx): 100% (+73pp)
   Location: scoring

Start here: address #1 in scoring. (~14 mutants, projected to 56%)
  vary mutate . --expand "scoring"

Top survivor groups (4 of 4)

GROUP               FILE:LINE           SURV  CAUSE
scoring#clamp       scoring.vary:10-11    14  ASSERT_EFFECT
scoring#is_passing  scoring.vary:27       11  ASSERT_VALUE
scoring#abs_val     scoring.vary:20-21     9  ASSERT_EFFECT
scoring#subtract    scoring.vary:6         1  ASSERT_MATH

Why Survivors Exist

   40% (14) Branch conditions not covered
   26% (9) Values changed but never asserted
   26% (9) Weak assertions only
    9% (3) Return values not pinned

27% means the tests catch about a quarter of possible changes.

The leverage fixes are cumulative. Fix #1 projects the score to 56%. Fix #2 adds to that, projecting 75%. The +Npp column shows how many percentage points above the current score each fix brings you. The "Start here" line tells you which fix to tackle first and gives the command to inspect it.

The CAUSE column in the survivor table tells you why each group survived. ASSERT_VALUE means a value was changed but never asserted. ASSERT_EFFECT means side effects were not observed.

The "Why Survivors Exist" breakdown gives the broader categories. "Values changed but never asserted" and "Weak assertions only" both point to the same root cause: the tests use observe x > 0 where they should use observe x == 5.

Step 2: Read the leverage fixes

The "Biggest Leverage Fixes" section ranks which changes would kill the most mutants. The projections are cumulative: if you address fix #1, the score reaches ~56%; if you also address fix #2, it reaches ~75%.

Start with fix #1. In this case, the biggest group is scoring#clamp (14 survivors) because the test only checks the middle of the range, not the boundaries.

Step 3: Expand a group

Pick scoring#clamp and see what survived:

vary mutate scoring.vary --tests test_scoring.vary --expand "scoring#clamp"
Expanded: scoring#clamp (14 mutants)

  scoring.vary:10  Replace < with <=
    vary mutate scoring.vary --replay clamp:REL_LT_TO_LE:a8c21b3f
  scoring.vary:10  Replace low with 0
    vary mutate scoring.vary --replay clamp:LIT_CHANGE:b3e48d12:1
  scoring.vary:13  Replace > with >=
    vary mutate scoring.vary --replay clamp:REL_GT_TO_GE:c4d51e2a
  ...

Each line shows what changed and where. The --replay command re-runs a single mutant if you want to reproduce it.

Step 4: Explain a survivor

Pick a mutant and ask why it survived:

vary mutate scoring.vary --tests test_scoring.vary --why "clamp:REL_LT_TO_LE:a8c21b3f"
Mutant: The comparison operator was changed but no test covers the boundary
Location: scoring.vary line 10, in clamp

Change:
  Replace < with <=

Why it survived:
  The comparison operator was changed but no test covers the boundary.

Fix:
  Add a test at the boundary value where < and <= differ.
  Example: observe clamp(0, 0, 10) == 0

The test calls observe clamp(5, 0, 10) == 5, which passes whether the boundary check is < or <=. No test checks the boundary itself.

Step 5: Write a better test

Add boundary tests to test_scoring.vary:

test "clamp at boundaries" {
    observe clamp(0, 0, 10) == 0
    observe clamp(10, 0, 10) == 10
    observe clamp(-5, 0, 10) == 0
    observe clamp(15, 0, 10) == 10
}

And pin the is_passing boundaries:

test "is passing at boundary" {
    observe is_passing(60)
    observe not is_passing(59)
    observe is_passing(100)
    observe not is_passing(101)
}

Re-run:

vary mutate scoring.vary --tests test_scoring.vary

The scoring#clamp and scoring#is_passing groups shrink and the score goes up.

Step 6: Fix the weak assertions

The add and subtract tests use observe add(2, 3) > 0. Replace them with exact checks:

test "add returns sum" {
    observe add(2, 3) == 5
    observe add(0, 5) == 5
    observe add(-1, 1) == 0
}

test "subtract returns difference" {
    observe subtract(10, 3) == 7
    observe subtract(5, 5) == 0
}

Re-run. The "Weak assertions only" category disappears from the breakdown.

Step 7: Cover the remaining branches

For abs_val, test a negative input:

test "abs negative" {
    observe abs_val(-3) == 3
    observe abs_val(0) == 0
}

Step 8: Confirm the score

vary mutate scoring.vary --tests test_scoring.vary

The score should be above 90%. Any remaining survivors are either equivalent mutants (changes that don't affect observable behaviour) or edge cases worth investigating with --why.

Adding contracts for free kills

You can strengthen the mutation score without writing more tests by adding contracts:

def abs_val(n: Int) -> Int {
    out (r) {
        r >= 0
    }
    if n < 0 {
        return -n
    }
    return n
}

Now a mutant that changes return -n to return n breaks the postcondition when called with a negative input. The mutation runner counts contract violations as kills.

Summary

StepWhat you doWhat it tells you
1vary mutate file.varyOverall score and survivor breakdown
2Read leverage fixesCumulative projected scores for each fix
3--expand a groupIndividual surviving mutants
4--why on a mutantRoot cause and suggested fix
5Write the test, re-runConfirm the score improved
6RepeatUntil the score is where you want it

CLI flags used in this walkthrough

FlagWhat it does
--expand GROUPShow individual mutants in a group
--why IDExplain why a specific mutant survived
--replay IDRe-run a single mutant
--top NChange how many groups the table shows
--group MODEGroup survivors by function, file, or cause
--quickFast mode: relational + literal operators, max 20 mutants/file
--allExhaustive mode: override the default 200 mutants/file cap
--output MODEOutput mode: text (default, live spinner), log, json, or html
-vVerbose output with OP and HINT columns
← Smallest Example
Lie detection →