Human-readable, AI-written, confidence at scale

The real problem with AI-generated code is not syntax. It is variance.

Five different prompts produce five different project layouts, five different testing habits, five different naming conventions. Each version compiles. Each might even pass its own tests. But a human reviewer looking at the result has no common frame to evaluate any of it, and there is far too much of it to read line by line.

Vary is trying to fix that by reducing the variance on purpose. The approach has a few parts: canonical project shapes via vary new, opinionated checker rules that are off by default until you opt in, a summary format for the findings, a corpus of weak-vs-strong examples to measure against, and language-surface rules that shrink the number of ways to express the same idea.

What the flags actually do

These are three optional flags on vary check. Without them, vary check runs its standard type errors, warnings, and lint rules. Each flag enables an additional set of rules that are off by default because they are opinionated.

The --great-code rules (5)

Enables the great-code profile. These rules only fire on this profile, so they never appear in a normal vary check run. They are static checks on the AST, not style nits. The full set:

Rule	ID	What it flags
effect-logic-coupling	VCG001	Functions that mix effectful calls (print, I/O) with decision logic
stringly-typed-data	VCG002	3+ `Str` parameters on one function, or string-matching where an enum fits
broad-return-type	VCG003	Functions with 10+ lines returning `Str` or `Bool` where a data class would document intent
missing-result-type	VCG004	Functions that can raise but do not return `Result[T, E]`
bare-raise-string	VCG005	`raise "some message"` instead of a typed error class

These catch structural weaknesses, not formatting. A function with three string parameters and a string return type compiles fine, but it is hard to review and easy to call wrong.

The --great-tests rules (12)

Enables the great-tests profile. Same idea: these rules are off by default and only fire when you ask for them. They look at the structure of test blocks in the AST.

Rule	ID	What it flags
single-observe	VCT001	Test has only one `observe` statement
no-negative-test	VCT002	Every test in the file only checks the happy path
duplicate-test-name	VCT003	Two tests with the same name
empty-test	VCT004	Test block with no assertions
test-too-long	VCT005	Test block exceeding the line limit
assert-literal	VCT006	`observe True` or `observe 1 == 1` (always passes)
weak-property	VCT007	Weak properties in `across` blocks
success-path-only	VCT008	All tests only check the happy path, no failure or edge cases
shallow-observation	VCT009	Test checks truthiness or one field of a multi-field data class
vacuous-property-test	VCT010	Test calls a function but never checks the output
duplicate-test-logic	VCT011	3+ tests with identical structure differing only in literals
no-boundary-test	VCT012	No tests with boundary values for numeric or collection parameters

These catch what Vary treats as confidence theatre: a green suite that looks like coverage but would not catch a real bug.

The --review-packet flag

This is a rendering flag, not an analysis pass. It takes the same diagnostics that vary check already produced and groups them into a compact summary: one verdict line, counts by category, the top finding in each category, and suggested next steps.

It does not run additional analysis. It does not know about test results, mutation scores, or validation status. It only sees checker diagnostics. What it gives you is a scannable 10-line summary instead of scrolling through individual findings.

Here is what it looks like on a weak codebase (the ai-confidence-demo/weak example):

$ vary check src/ --great-code --review-packet

== Review Packet ==

Verdict: CLEAN (minor): 7 suggestion(s)
Total:   7 finding(s)

By category:
  safety: 3 (3 suggestion)
    → [VCS001] print() in 'rate_for_severity': behaviour hidden in side effect
  contracts: 3 (3 suggestion)
    → [VCC001] function 'format_plan' has no contracts
  mutation: 1 (1 suggestion)
    → [VCM001] function 'join_metadata' has no side effects: consider 'pure def'

And the tests:

$ vary check tests/ --great-tests --review-packet

== Review Packet ==

Verdict: REVIEW: 1 warning(s) to evaluate
Total:   8 finding(s)

By category:
  testing: 7 (7 suggestion)
    → [VCT002] module has 2 test(s) but none exercise negative paths
  dead_code: 1 (1 warning)
    → [VCD002] unused import 'Job' from 'model'

Next steps:
  - Review 1 warning(s)
  - 7 test quality finding(s), run `vary test tests/`

After fixing the code and tests (the strong variant of the same project), both come back clean:

== Review Packet ==

Verdict: PASS: no findings
Total:   0 finding(s)

The verdict has four levels: FAIL (errors that block compilation), REVIEW (warnings to evaluate), CLEAN (only suggestions), PASS (nothing found). The JSON output (--json --review-packet) is useful for CI integration.

What these flags do not do

They do not run tests, measure mutation scores, or verify behaviour. They are static checks on the AST. --review-packet does not synthesize trust across the whole toolchain. It groups and counts checker findings.

The verification ladder covers the runtime side: vary test for behaviour, vary mutate for test strength, vary validate for the final gate.

Beyond the checker

The flags above are the checker side. Two other pieces support them.

Corpus baseline

We are building a set of "weak" and "strong" variants of the same program (see examples/ai-confidence-demo/) so we can measure whether the toolchain actually moves generated code in the right direction. Without concrete before/after examples, improvement claims are anecdotal.

Language-surface convergence

Fewer overlapping ways to express the same idea means code from different sources reads like the same language. The first consolidation rules prefer Result[T, E] over bare raises, and flag patterns where an enum would be cleaner than string matching. These overlap with the --great-code rules above, which is the point: the checker and the language surface push in the same direction.

Why bother

If AI-assisted coding keeps scaling, there will be more code than anyone can inspect manually. These flags do not solve that problem. They give the checker a way to flag the structural weaknesses and test-quality gaps that make generated code harder to trust, and they give you a summary you can scan without reading every diagnostic. The runtime side (testing, mutation, validation) is covered by the verification ladder.

Article	Focus
The verification ladder for AI coding tools	The escalation policy for what an AI tool should run next
From generated code to confidence at scale	The confidence workflow built on top of that ladder