From generated code to confidence at scale

tl;dr: The confidence workflow starts with `vary new`, then uses `vary check --great-code`, `vary check --great-tests`, and `vary check --review-packet` to turn plausible generated code into something easier to trust.

The verification ladder tells an AI tool what to run next: check, then test, then mutate, then validate. That ordering still matters.

But it assumes the code and tests are already in reasonable shape when verification starts. In practice, they usually are not. Generated code compiles and might pass a green suite, but the structure is sloppy and the tests are shallow. Running mutate on a weak test suite just produces a long list of survivors nobody wants to triage.

The confidence workflow adds a front half that tries to improve the code and tests before the heavier verification stages run.

The workflow

Stage	Command	What it does
1	`vary new`	Start from a canonical project layout with predictable boundaries
2	`vary check --great-code`	Push code toward stronger structure and more canonical Vary patterns
3	`vary check --great-tests`	Catch success-only tests, duplicated logic, vacuous assertions, missing negative paths
4	`vary check --review-packet`	Group and count checker findings into a scannable summary
5	`vary test`	Confirm the improved code still behaves correctly
6	deeper verification	Escalate only when the change justifies more cost

What each step actually does

vary new gives you scaffolding where src/, tests/, effects, and pure logic have designated places. If the layout is already familiar, both humans and tools start from a better position.

--great-code enables 5 opinionated static rules that are off by default. They flag effect-logic coupling, stringly-typed parameters (3+ Str args), broad return types on long functions, functions that raise but do not return Result[T, E], and raise "string" instead of typed errors. These are structural weaknesses that make generated code hard to review, not style nits.

--great-tests enables 12 rules that look at the structure of test blocks. They catch success-path-only suites, tests that call a function but never check the output (observe True), shallow observations that ignore most fields of a data class, duplicate test logic (3+ identical structures with different literals), missing boundary values, and empty or tautological assertions. These patterns look like coverage but would not catch a real bug.

--review-packet is a rendering flag, not an analysis pass. It takes the same diagnostics vary check already produced and groups them into a compact summary: one verdict line, counts by category, the top finding per area, and suggested next steps. It does not know about test results or mutation scores.

Here is a real example from a weak test suite:

$ vary check tests/ --great-tests --review-packet

== Review Packet ==

Verdict: REVIEW: 1 warning(s) to evaluate
Total:   8 finding(s)

By category:
  testing: 7 (7 suggestion)
    → [VCT002] module has 2 test(s) but none exercise negative paths
  dead_code: 1 (1 warning)
    → [VCD002] unused import 'Job' from 'model'

Next steps:
  - Review 1 warning(s)
  - 7 test quality finding(s), run `vary test tests/`

After fixing the tests, the packet comes back PASS: no findings. The companion article has the full rule tables for both profiles.

Before and after

Old habit	With the confidence workflow
Ad hoc project structure	`vary new` provides a canonical starting shape
Passing tests hide weak code	`--great-code` flags structural problems before tests run
A green suite can still be shallow	`--great-tests` exposes weak assertions and missing paths
Scrolling through individual diagnostics	`--review-packet` groups them into a scannable summary

Where this fits

The verification ladder tells the tool what to run next. The confidence workflow tries to clean up the code and tests before you get there. Fix the shape first, strengthen the tests second, scan the summary third, then let the ladder take over.

Article	Focus
The verification ladder for AI coding tools	The escalation order: `check`, `test`, `mutate`, `validate`
Human-readable, AI-written, confidence at scale	The product direction behind these workflow changes

The workflow

What each step actually does

Before and after

Where this fits

Related reading