Same-session self-review is structurally ineffective. The model retains generation reasoning in conversation context and is reluctant to question its own decisions. No prompt instruction can overcome this.
The Data
| Approach | Findings per file |
|---|---|
| Same session + “review strictly” | 0.2-0.5 |
| Independent instance | 3.7 |
| Human baseline | 4.1 |
The gap is 7-18x. An independent instance reaches 90% of human performance. Same-session review reaches 5-12%.
Why Prompt Instructions Fail
“Review critically,” extended thinking, persona prompts, and minimum finding requirements were all tested. Maximum improvement: 0.2 → 0.5 findings. The generation reasoning is in the conversation context. Every assessment passes through “I wrote this because…” — a filter that suppresses genuine criticism.
Forcing minimum finding counts produces fabricated issues. The model invents problems to meet the quota rather than finding real ones.
The Architectural Fix
Two separate API calls with isolated contexts:
Call 1 (generate): claude -p "Write unit tests for auth module" > tests.ts
Call 2 (review): claude -p "Review tests.ts for correctness and coverage gaps"
The reviewer receives only the generated output and review criteria. No generation prompt, no conversation history, no design goals. It evaluates the code on its own merits.
The ROI
Independent review costs additional API time (~45 seconds extra in a 3-minute CI budget). In production: independent review prevented $1,000/month in bug costs for $180 additional API cost — net saving $820/month.
What the Reviewer Should Receive
- Generated code/output (the artifact being reviewed)
- Project review criteria (from CLAUDE.md)
- Test execution results (objective facts, not subjective reasoning)
- Original requirements as a separate document
What it must NOT receive: the generation prompt, conversation history, or any reasoning about why the code was written the way it was.
One-liner: Self-review bias is structural, not behavioral — the only fix is a separate API call where the reviewer has zero generation context, reaching 3.7 findings versus 0.3 in the same session.