Text Instructions Failed at 64%. Two Examples Reached 91%. | Prompt Engineering & Optimization

When detailed text instructions produce inconsistent results — varying formats, hallucinated fields, wrong boundary judgments — the escalation is not more text, not stronger emphasis, not a model change. It is few-shot examples.

Showing is fundamentally more precise than telling.

The Data

Approach	Format compliance	Hallucination rate
Text instructions only	64%	28%
Text + 2 examples	91%	8%
Text + 4 examples	93%	6%

The jump from 0 to 2 examples is dramatic. From 2 to 4 is modest. Beyond 4, diminishing returns. The optimal range is 2-4 well-chosen examples.

A single null-handling example — showing “field not found → null” instead of fabrication — reduced hallucination from 31% to 4%. One targeted example, 87% reduction.

Three Values of Few-Shot

Format consistency. Text saying “output as JSON with these fields” produces 64% compliance. An example showing the exact JSON structure produces 91%. Claude pattern-matches from examples more reliably than it translates text specifications into structure.

Ambiguous judgment calibration. Is an empty catch block a bug or intentional recovery? Text cannot define every boundary case. An example showing “empty catch in auth context → flag as HIGH” and “empty catch in recovery context → skip” teaches the judgment pattern. Text-only escalation accuracy: 35%. After 4 examples (2 escalate + 2 handle): 89%.

Hallucination prevention. When a field is missing from source data, Claude may fabricate a plausible value. An example showing the correct behavior — “source lacks payment terms → output: null” — demonstrates that absence is a valid answer. No text instruction achieves the same effect as seeing the null output in context.

When to Escalate to Few-Shot

The trigger is: text instructions have been tried and produce inconsistent results. The escalation path:

Write clear text instructions
Test on representative inputs
If results are inconsistent → add 2-4 few-shot examples
Target examples at the specific failure mode (format, judgment, hallucination)

Do not start with few-shot if text instructions work. Do not add more text if text has already failed. The escalation is directional: text → examples, not text → more text.

Claude Generalizes, Not Memorizes

Few-shot examples teach judgment patterns, not rigid templates. Two well-chosen examples enable correct generalization to unseen inputs. Claude does not need an example for every possible case — it learns the decision logic from a small set and applies it broadly.

This is why 2-4 comprehensive examples outperform 8+ narrow ones. Each example should teach a different skill (format, boundary judgment, null handling) rather than repeating the same lesson with different data.

Combined with Structured Output

For CI pipelines, few-shot examples in CLAUDE.md handle content and judgment calibration, while --json-schema handles structural guarantees. They are complementary: the schema ensures field names and types; the examples ensure field values are correct and non-hallucinated.

What Does Not Work Instead

More text emphasis. “Format is critical, follow it exactly” does not improve compliance. Claude needs to see the format, not be told it matters.

Confidence thresholds. LLM self-reported confidence is poorly calibrated — overconfident on hard cases, uncertain on easy ones. No threshold setting fixes this.

Model changes. If text instructions produce inconsistent results, switching models moves the problem. Examples fix the calibration at the prompt level.

Repeated execution. Each API call is independent. Claude does not learn from prior calls or accumulate experience across runs.

One-liner: When text instructions produce inconsistent output, add 2-4 few-shot examples — they fix format compliance (64% → 91%), judgment accuracy (35% → 89%), and hallucination (31% → 4%) in ways that more text cannot.