K3.5.1 Task 3.5

Text Descriptions Hit an Ambiguity Ceiling at 63%. Examples Reach 91%.

A fact extraction agent was tested across three prompt iterations on the same 100-document dataset:

Prompt versionApproachAccuracy
v1”Extract key facts accurately”60%
v2”Extract key facts accurately, be precise and thorough”63%
v33 concrete input/output examples91%

Text refinement from v1 to v2 gained 3 percentage points. Adding examples jumped 28 points. Text descriptions have an ambiguity ceiling that more adjectives cannot break through. Concrete examples bypass it entirely.

Why Text Descriptions Fail

“Summarize findings in a structured format” produces bullet points, numbered lists, and markdown tables. All three are valid interpretations of “structured.” The instruction is ambiguous, and Claude picks among valid options each time.

More detail does not fix this. “Use a concise bullet point format with key findings first, supporting data second” is still subject to interpretation — what counts as “concise”? Which findings are “key”? A concrete example showing the exact expected output format eliminates every ambiguity in one shot.

Text Rules Can Create New Ambiguity

A customer support return eligibility agent was tested:

VersionWarranty accuracy
v1 (basic text)70%
v2 (detailed policy rules)65%
v3 (3 input/output examples)92%

Adding detailed text rules for warranty returns actually decreased accuracy by 5 points. The rules created conflicting interpretations for edge cases. Concrete examples showing how a specific warranty case should be handled jumped accuracy from 65% to 92%.

Text rules are not always additive. When they interact with existing instructions, they can introduce new confusion instead of resolving old confusion.

What Makes a Good Example

An input/output example shows the exact transformation Claude should perform:

Input: "March 5, 2024"
Output: "2024-03-05"

Input: "12/31/23"
Output: "2023-12-31"

The examples are:

  • Concrete — not “convert to ISO 8601” but showing the actual conversion
  • Multiple — covering different input variations (human-readable, slashed, abbreviated year)
  • Paired — input and output shown together as a transformation

For multi-format tasks, show multiple formats from the same input. A CI pipeline generating release notes AND changelog entries from commits should show side-by-side examples from the same commit list.

For multi-language tasks, provide 1-2 examples per language. Stack trace → user message transformations for Python, JS, Java, Go, and Rust each need their own example pairs because each format is structurally different.

The Anti-Pattern: Text-Only Feedback

“Make it more detailed” is not actionable feedback. Without showing what “detailed” looks like, Claude interprets differently each time. Show a concrete example of the detail level you want.

“Format the response professionally with a greeting, resolution steps, and closing” produces inconsistent results. Show 2-3 example responses in the exact format expected.

“Use conventional commit format” is better than nothing but still ambiguous. Show the exact diff → commit message transformation:

Input: [diff adding null check to parse_record()]
Output: "fix(parser): handle missing JSON field in parse_record()"

This addresses verbosity, scope prefix, and tense in one example.

When to Invest in Examples

Examples are most valuable when:

  • Output format varies — the same instruction produces different structures across runs
  • Transformation rules are complex — multiple input formats mapping to one output format
  • Text refinement has stalled — adding more adjectives does not improve consistency
  • Multiple outputs from one input — release notes + changelog from same commits

Examples are less critical when:

  • The task is inherently well-defined — “calculate the sum of these numbers” has no format ambiguity
  • A schema enforces structuretool_use with input_schema provides structural guarantees that examples cannot improve

Iterative Refinement: Examples First, Then Expand

The refinement sequence should be:

  1. Start with 2-3 input/output examples covering the most common cases
  2. Test on new inputs
  3. Add examples for any format variations or edge cases that emerge
  4. Update examples when new patterns are discovered

This is more effective than starting with detailed text rules and adding examples later. Examples establish the pattern; text can supplement with edge-case guidance after the pattern is clear.


One-liner: When text descriptions produce inconsistent output, stop adding adjectives and show 2-3 concrete input/output examples instead — the format ambiguity that text cannot resolve, examples eliminate in one step.