The verification ladder tells an AI tool what to run next: check, then test, then mutate, then validate. That ordering still matters.
But it assumes the code and tests are already in reasonable shape when verification starts. In practice, they usually are not. Generated code compiles and might pass a green suite, but the structure is sloppy and the tests are shallow. Running mutate on a weak test suite just produces a long list of survivors nobody wants to triage.
The confidence workflow adds a front half that tries to improve the code and tests before the heavier verification stages run.
The workflow
| Stage | Command | What it does |
|---|---|---|
| 1 | vary new | Start from a canonical project layout with predictable boundaries |
| 2 | vary check --great-code | Push code toward stronger structure and more canonical Vary patterns |
| 3 | vary check --great-tests | Catch success-only tests, duplicated logic, vacuous assertions, missing negative paths |
| 4 | vary check --review-packet | Group and count checker findings into a scannable summary |
| 5 | vary test | Confirm the improved code still behaves correctly |
| 6 | deeper verification | Escalate only when the change justifies more cost |
What each step actually does
vary new gives you scaffolding where src/, tests/, effects, and pure logic have designated places. If the layout is already familiar, both humans and tools start from a better position.
--great-code enables 5 opinionated static rules that are off by default. They flag effect-logic coupling, stringly-typed parameters (3+ Str args), broad return types on long functions, functions that raise but do not return Result[T, E], and raise "string" instead of typed errors. These are structural weaknesses that make generated code hard to review, not style nits.
--great-tests enables 12 rules that look at the structure of test blocks. They catch success-path-only suites, tests that call a function but never check the output (observe True), shallow observations that ignore most fields of a data class, duplicate test logic (3+ identical structures with different literals), missing boundary values, and empty or tautological assertions. These patterns look like coverage but would not catch a real bug.
--review-packet is a rendering flag, not an analysis pass. It takes the same diagnostics vary check already produced and groups them into a compact summary: one verdict line, counts by category, the top finding per area, and suggested next steps. It does not know about test results or mutation scores.
Here is a real example from a weak test suite:
$ vary check tests/ --great-tests --review-packet
== Review Packet ==
Verdict: REVIEW: 1 warning(s) to evaluate
Total: 8 finding(s)
By category:
testing: 7 (7 suggestion)
→ [VCT002] module has 2 test(s) but none exercise negative paths
dead_code: 1 (1 warning)
→ [VCD002] unused import 'Job' from 'model'
Next steps:
- Review 1 warning(s)
- 7 test quality finding(s), run `vary test tests/`
After fixing the tests, the packet comes back PASS: no findings. The companion article has the full rule tables for both profiles.
Before and after
| Old habit | With the confidence workflow |
|---|---|
| Ad hoc project structure | vary new provides a canonical starting shape |
| Passing tests hide weak code | --great-code flags structural problems before tests run |
| A green suite can still be shallow | --great-tests exposes weak assertions and missing paths |
| Scrolling through individual diagnostics | --review-packet groups them into a scannable summary |
Where this fits
The verification ladder tells the tool what to run next. The confidence workflow tries to clean up the code and tests before you get there. Fix the shape first, strengthen the tests second, scan the summary third, then let the ladder take over.
Related reading
| Article | Focus |
|---|---|
| The verification ladder for AI coding tools | The escalation order: check, test, mutate, validate |
| Human-readable, AI-written, confidence at scale | The product direction behind these workflow changes |