S4.2.1 Task 4.2

Paired REPORT + SKIP Examples: 67% → 94% Boundary Accuracy

The practical application of few-shot principles: paired examples showing the same pattern with different judgments. A text rule alone achieves 67% boundary accuracy. A single REPORT example pushes it to 72%. Adding a contrasting SKIP example for the same pattern type reaches 94%. The contrast is the decisive element.

The REPORT/SKIP Technique

For a code review finding like “empty catch block”:

REPORT example:

Pattern: empty catch in authentication handler
Judgment: REPORT — security-critical path, swallowed exceptions hide auth failures
Severity: HIGH

SKIP example:

Pattern: empty catch in graceful shutdown cleanup
Judgment: SKIP — intentional silence during shutdown, no user impact
Reasoning: cleanup errors during shutdown are expected and harmless

The pair teaches Claude that the same code pattern (empty catch) requires different judgments based on context. Without the SKIP example, Claude flags every empty catch regardless of context — producing the 35-45% false positive rate that all-positive example sets create.

Impact on False Positives

Adding 2 SKIP examples to an existing REPORT-only set:

MetricBeforeAfter
False positives38%9%
Missed real bugsbaseline+1%

A 76% reduction in false positives with only 1% increase in missed bugs. The SKIP examples teach restraint without teaching blindness.

When to Use Paired Examples

Target paired REPORT/SKIP examples at the categories with the highest false positive rates first. A category at 40% false positives benefits enormously from a SKIP example. A category at 5% false positives may not need one.

The priority ordering:

  1. Identify the highest-FP category
  2. Create a SKIP example for the most common false positive pattern in that category
  3. Pair it with the existing REPORT example (or create both)
  4. Measure the impact before moving to the next category

Reasoning Is Essential

The reasoning in each example is not optional decoration. It is the teaching mechanism.

“SKIP — intentional silence during shutdown” tells Claude WHY this pattern is acceptable. Without the reasoning, Claude sees two contradictory examples (same pattern, different judgments) and cannot generalize to new contexts. With the reasoning, Claude learns the decision logic: “empty catch in security context → flag; empty catch in cleanup context → skip.”

Do not abbreviate reasoning to fit more examples. Two examples with full reasoning outperform four examples with truncated reasoning.

Combined with CLAUDE.md and CI

In CI pipelines, paired examples belong in CLAUDE.md alongside review criteria. They load automatically for every claude -p review run. The examples calibrate judgment; the text criteria provide the framework; --json-schema guarantees the output structure. Three mechanisms, each handling a different dimension of review quality.

For escalation design in agentic systems, the same principle applies: include both “should escalate” AND “should handle autonomously” examples. Few-shot examples should reinforce escalation rules (like “customer requests human → immediate escalation”), never teach overriding them.


One-liner: Pair every REPORT example with a contrasting SKIP example for the same pattern — the contrast teaches boundary judgment that single-type examples cannot, cutting false positives from 38% to 9%.