From generated code to confidence at scale

The verification ladder tells an AI tool what to run next: check, then test, then mutate, then validate. That ordering still matters.

But it assumes the code and tests are already in reasonable shape when verification starts. In practice, they usually are not. Generated code compiles and might pass a green suite, but the structure is sloppy and the tests are shallow. Running mutate on a weak test suite just produces a long list of survivors nobody wants to triage.

The confidence workflow adds a front half that tries to improve the code and tests before the heavier verification stages run.

The workflow

StageCommandWhat it does
1vary newStart from a canonical project layout with predictable boundaries
2vary check --great-codePush code toward stronger structure and more canonical Vary patterns
3vary check --great-testsCatch success-only tests, duplicated logic, vacuous assertions, missing negative paths
4vary check --review-packetGroup and count checker findings into a scannable summary
5vary testConfirm the improved code still behaves correctly
6deeper verificationEscalate only when the change justifies more cost

What each step actually does

vary new gives you scaffolding where src/, tests/, effects, and pure logic have designated places. If the layout is already familiar, both humans and tools start from a better position.

--great-code enables 5 opinionated static rules that are off by default. They flag effect-logic coupling, stringly-typed parameters (3+ Str args), broad return types on long functions, functions that raise but do not return Result[T, E], and raise "string" instead of typed errors. These are structural weaknesses that make generated code hard to review, not style nits.

--great-tests enables 12 rules that look at the structure of test blocks. They catch success-path-only suites, tests that call a function but never check the output (observe True), shallow observations that ignore most fields of a data class, duplicate test logic (3+ identical structures with different literals), missing boundary values, and empty or tautological assertions. These patterns look like coverage but would not catch a real bug.

--review-packet is a rendering flag, not an analysis pass. It takes the same diagnostics vary check already produced and groups them into a compact summary: one verdict line, counts by category, the top finding per area, and suggested next steps. It does not know about test results or mutation scores.

Here is a real example from a weak test suite:

$ vary check tests/ --great-tests --review-packet

== Review Packet ==

Verdict: REVIEW: 1 warning(s) to evaluate
Total:   8 finding(s)

By category:
  testing: 7 (7 suggestion)
    → [VCT002] module has 2 test(s) but none exercise negative paths
  dead_code: 1 (1 warning)
    → [VCD002] unused import 'Job' from 'model'

Next steps:
  - Review 1 warning(s)
  - 7 test quality finding(s), run `vary test tests/`

After fixing the tests, the packet comes back PASS: no findings. The companion article has the full rule tables for both profiles.

Before and after

Old habitWith the confidence workflow
Ad hoc project structurevary new provides a canonical starting shape
Passing tests hide weak code--great-code flags structural problems before tests run
A green suite can still be shallow--great-tests exposes weak assertions and missing paths
Scrolling through individual diagnostics--review-packet groups them into a scannable summary

Where this fits

The verification ladder tells the tool what to run next. The confidence workflow tries to clean up the code and tests before you get there. Fix the shape first, strengthen the tests second, scan the summary third, then let the ladder take over.

Related reading

ArticleFocus
The verification ladder for AI coding toolsThe escalation order: check, test, mutate, validate
Human-readable, AI-written, confidence at scaleThe product direction behind these workflow changes