A common CI shortcut: have Claude generate code and review it in the same session. One call, done. The problem is that same-session review catches about 6% of issues. An independent instance catches 94%.
Same model. Same review criteria. The only difference is whether the reviewer carries the generation context.
Why Same-Session Review Fails
When Claude generates code in a session, the conversation history retains the full reasoning — why this approach, why not that one, which edge cases were deliberately deferred. When the review phase begins in the same session, all of that reasoning is still in context.
The model is not evaluating a stranger’s code. It is revisiting decisions it already justified. The conclusion is predictable: looks good.
Anyone who writes knows this feeling. Your own text reads perfectly right after you write it. Come back the next day and the problems are obvious. The difference is whether the “I meant it this way” filter has faded. Claude’s context does not fade — the generation reasoning persists for the entire session.
Prompt Instructions Cannot Override This
The first instinct is to add “review critically” or “be extremely thorough” to the review prompt. It does not work.
Prompt instructions are suggestions. The retained generation context exerts structural influence on every evaluation the model makes. When assessing each piece of code, the model simultaneously holds “I wrote this because…” — that background makes genuine criticism structurally difficult.
Other things that do not work:
- “Ignore your prior reasoning” — The conversation history is not erased by a prompt instruction. The context remains regardless of what the prompt says.
- Extended thinking — Deeper reasoning with the same bias produces more elaborate justifications for the same conclusions.
- Time delays between stages — Claude’s context does not decay over time. A 30-second pause changes nothing.
- Running the review multiple times — Three biased reviews produce three sets of near-zero findings. Repetition does not fix a structural problem.
The issue is not how the review is conducted. It is who conducts it.
The Fix: Two Independent claude -p Calls
# Generate
claude -p "Write unit tests for the auth module" > tests.ts
# Review (fresh instance, no generation context)
claude -p "Review tests.ts for correctness and coverage gaps"
The second invocation starts with a clean context. It does not know why the tests were written this way, which edge cases were considered and deferred, or what alternatives were rejected. It evaluates the code on its own merits — exactly what a reviewer should do.
What the Reviewer Should and Should Not Know
Two kinds of context exist, and they must be separated:
- Generation reasoning (“I wrote it this way because…”) — This is the bias source. The reviewer must not have it.
- Review standards (“Good code in this project looks like…”) — This is the judgment basis. The reviewer needs it.
Project review criteria belong in CLAUDE.md, which loads automatically for every claude -p call. The review instance gets the project’s standards without any generation-phase reasoning. It knows what good looks like without knowing why the code was written the way it was.
For multi-stage pipelines (generate → run tests → review), test execution results should be fed to the review instance as well. Test results are objective facts, not subjective reasoning — providing them does not introduce bias.
The Data
50-report comparison:
| Config | Avg findings/report | Human baseline |
|---|---|---|
| Same session | 0.3 | 4.1 |
| Independent instance | 3.7 | 4.1 |
Same criteria, same prompt. Session isolation alone closed the gap from 7% to 90% of human performance.
200-template A/B test:
| Group | Substantive errors caught | Out of 62 total |
|---|---|---|
| Same session | 4 (6%) | 62 |
| Independent instance | 58 (94%) | 62 |
Group A’s 12 total findings might look more “precise” — but finding only 4 of 62 real errors is not precision. It is suppression.
The Cost: 45 Seconds
| Approach | Time | Issue detection |
|---|---|---|
| Same session | 2 min | 6% |
| Independent instance | 2 min 45 sec | 94% |
Both fit inside a 3-minute CI budget. Forty-five seconds for a 15x improvement in review effectiveness. A review that catches 6% of issues provides false confidence — it is worse than no review at all, because the team believes the code was checked.
One-liner: Same-session review is grading your own exam — use an independent instance so the reviewer has no memory of writing the code.