
The real problem with AI-generated code is not syntax. It is variance.

Five different prompts produce five different project layouts, five different testing habits, five different naming conventions. Each version compiles. Each might even pass its own tests. But a human reviewer looking at the result has no common frame to evaluate any of it, and there is far too much of it to read line by line.

Vary is trying to fix that by reducing the variance on purpose. The approach has a few parts: canonical project shapes via `vary new`, opinionated checker rules that are off by default until you opt in, a summary format for the findings, a corpus of weak-vs-strong examples to measure against, and language-surface rules that shrink the number of ways to express the same idea.

## What the flags actually do

These are three optional flags on `vary check`. Without them, `vary check` runs its standard type errors, warnings, and lint rules. Each flag enables an additional set of rules that are off by default because they are opinionated.

### The --great-code rules (5)

Enables the `great-code` profile. These rules only fire on this profile, so they never appear in a normal `vary check` run. They are static checks on the AST, not style nits. The full set:

| Rule | ID | What it flags |
|---|---|---|
| effect-logic-coupling | VCG001 | Functions that mix effectful calls (print, I/O) with decision logic |
| stringly-typed-data | VCG002 | 3+ `Str` parameters on one function, or string-matching where an enum fits |
| broad-return-type | VCG003 | Functions with 10+ lines returning `Str` or `Bool` where a data class would document intent |
| missing-result-type | VCG004 | Functions that can raise but do not return `Result[T, E]` |
| bare-raise-string | VCG005 | `raise "some message"` instead of a typed error class |

These catch structural weaknesses, not formatting. A function with three string parameters and a string return type compiles fine, but it is hard to review and easy to call wrong.

### The --great-tests rules (12)

Enables the `great-tests` profile. Same idea: these rules are off by default and only fire when you ask for them. They look at the structure of `test` blocks in the AST.

| Rule | ID | What it flags |
|---|---|---|
| single-observe | VCT001 | Test has only one `observe` statement |
| no-negative-test | VCT002 | Every test in the file only checks the happy path |
| duplicate-test-name | VCT003 | Two tests with the same name |
| empty-test | VCT004 | Test block with no assertions |
| test-too-long | VCT005 | Test block exceeding the line limit |
| assert-literal | VCT006 | `observe True` or `observe 1 == 1` (always passes) |
| weak-property | VCT007 | Weak properties in `across` blocks |
| success-path-only | VCT008 | All tests only check the happy path, no failure or edge cases |
| shallow-observation | VCT009 | Test checks truthiness or one field of a multi-field data class |
| vacuous-property-test | VCT010 | Test calls a function but never checks the output |
| duplicate-test-logic | VCT011 | 3+ tests with identical structure differing only in literals |
| no-boundary-test | VCT012 | No tests with boundary values for numeric or collection parameters |

These catch what Vary treats as confidence theatre: a green suite that looks like coverage but would not catch a real bug.

### The --review-packet flag

This is a rendering flag, not an analysis pass. It takes the same diagnostics that `vary check` already produced and groups them into a compact summary: one verdict line, counts by category, the top finding in each category, and suggested next steps.

It does not run additional analysis. It does not know about test results, mutation scores, or validation status. It only sees checker diagnostics. What it gives you is a scannable 10-line summary instead of scrolling through individual findings.

Here is what it looks like on a weak codebase (the `ai-confidence-demo/weak` example):

```
$ vary check src/ --great-code --review-packet

== Review Packet ==

Verdict: CLEAN (minor): 7 suggestion(s)
Total:   7 finding(s)

By category:
  safety: 3 (3 suggestion)
    → [VCS001] print() in 'rate_for_severity': behaviour hidden in side effect
  contracts: 3 (3 suggestion)
    → [VCC001] function 'format_plan' has no contracts
  mutation: 1 (1 suggestion)
    → [VCM001] function 'join_metadata' has no side effects: consider 'pure def'
```

And the tests:

```
$ vary check tests/ --great-tests --review-packet

== Review Packet ==

Verdict: REVIEW: 1 warning(s) to evaluate
Total:   8 finding(s)

By category:
  testing: 7 (7 suggestion)
    → [VCT002] module has 2 test(s) but none exercise negative paths
  dead_code: 1 (1 warning)
    → [VCD002] unused import 'Job' from 'model'

Next steps:
  - Review 1 warning(s)
  - 7 test quality finding(s), run `vary test tests/`
```

After fixing the code and tests (the `strong` variant of the same project), both come back clean:

```
== Review Packet ==

Verdict: PASS: no findings
Total:   0 finding(s)
```

The verdict has four levels: FAIL (errors that block compilation), REVIEW (warnings to evaluate), CLEAN (only suggestions), PASS (nothing found). The JSON output (`--json --review-packet`) is useful for CI integration.

### What these flags do not do

They do not run tests, measure mutation scores, or verify behaviour. They are static checks on the AST. `--review-packet` does not synthesize trust across the whole toolchain. It groups and counts checker findings.

The [verification ladder](/articles/the-verification-ladder-for-ai-coding-tools/) covers the runtime side: `vary test` for behaviour, `vary mutate` for test strength, `vary validate` for the final gate.

## Beyond the checker

The flags above are the checker side. Two other pieces support them.

### Corpus baseline

We are building a set of "weak" and "strong" variants of the same program (see `examples/ai-confidence-demo/`) so we can measure whether the toolchain actually moves generated code in the right direction. Without concrete before/after examples, improvement claims are anecdotal.

### Language-surface convergence

Fewer overlapping ways to express the same idea means code from different sources reads like the same language. The first consolidation rules prefer `Result[T, E]` over bare raises, and flag patterns where an enum would be cleaner than string matching. These overlap with the `--great-code` rules above, which is the point: the checker and the language surface push in the same direction.

## Why bother

If AI-assisted coding keeps scaling, there will be more code than anyone can inspect manually. These flags do not solve that problem. They give the checker a way to flag the structural weaknesses and test-quality gaps that make generated code harder to trust, and they give you a summary you can scan without reading every diagnostic. The runtime side (testing, mutation, validation) is covered by the [verification ladder](/articles/the-verification-ladder-for-ai-coding-tools/).

## Related reading

| Article | Focus |
|---|---|
| [The verification ladder for AI coding tools](/articles/the-verification-ladder-for-ai-coding-tools/) | The escalation policy for what an AI tool should run next |
| [From generated code to confidence at scale](/articles/from-generated-code-to-confidence-at-scale/) | The confidence workflow built on top of that ladder |
