Three Diverse Examples Beat Eight Homogeneous Ones | Prompt Engineering & Optimization

Few-shot design is about maximizing judgment diversity within a 2-4 example budget, not maximizing example count. A set of 3 examples covering different judgment types (flag, ambiguous, acceptable) outperforms 3 examples of the same type at the same token cost.

The Data: Diversity vs Quantity

Example set	Accuracy	False positives
3 positive-only (all bugs)	78%	35%
1 positive + 1 ambiguous + 1 “no flag”	91%	8%

Same token cost. The diverse set teaches three skills; the homogeneous set teaches one skill three times.

A balanced comparison across 300 reviews confirmed:

Config	Accuracy	False positives	Missed bugs
3 straightforward bugs	82%	31%	—
1 bug + 1 ambiguous + 1 acceptable	93%	6%	—
1 bug + 2 acceptable	85%	4%	18%

Config B is optimal — balanced coverage produces the best overall accuracy. Config C over-indexes on restraint, cutting false positives but missing 18% of real bugs.

The Three Judgment Dimensions

Every few-shot set should cover:

Clear positive case — An unambiguous finding. Establishes the output format baseline. Always the first example.
Boundary/ambiguous case — A pattern that could go either way. Demonstrates the reasoning for the judgment. This is where production inconsistency actually occurs.
Negative case — A pattern that looks suspicious but should NOT be flagged. Teaches restraint. Without this, Claude flags everything that matches even loosely.

The first example sets the format. The second and third teach judgment.

Build Order

Start with the clear case. Add boundary or negative depending on which failure mode matters more:

High false positive rate? Add the negative case second (“this looks like a bug but is acceptable because…”)
Missing real bugs? Add the boundary case second (“this subtle pattern IS a bug because…”)
Both? Three examples: clear → boundary → negative

Token-Constrained Design

With a 2-example budget, the optimal allocation is:

One bug-finding example (establishes format + what to flag)
One “acceptable/no finding” example (teaches restraint)

This pair covers more judgment surface than two bug examples. Two bug examples reinforce “flag things” — the model already does that by default. The negative example teaches the harder skill: when NOT to flag.

All-Positive Examples: The 35% False Positive Problem

Examples that only show bugs teach Claude to find bugs. Claude learns: “patterns like these → flag.” Without a counterexample showing “patterns like these → acceptable,” Claude flags everything that resembles a bug even slightly.

Adding 2 SKIP examples to an all-REPORT set reduced false positives from 38% to 9% with only a 1% increase in missed bugs. The restraint dimension is the highest-leverage addition.

Beyond 4: Diminishing Returns

Returns diminish sharply beyond 4 examples. The relationship is not monotonically positive — after the core judgment dimensions are covered, additional examples add cost without proportional quality improvement.

For multi-language support (Python, TypeScript, Go) with a 4-example budget: one example per language showing the same output structure + one ambiguous cross-language pattern. This teaches both format consistency and judgment in a language-agnostic way.

Each Example Should Teach a Different Skill

Identical-purpose examples waste the budget. A second format example teaches the same lesson; that slot should teach boundary judgment or null handling instead.

Do not abbreviate reasoning to fit more examples. The reasoning in ambiguous cases IS the lesson — without it, Claude learns to produce incomplete output, not better judgment.

One-liner: Maximize judgment diversity in 2-4 examples — one clear finding, one boundary case, one “don’t flag” — because three diverse examples at 91% accuracy beat eight homogeneous ones at 78%.