Retrying Absent Data Causes Hallucination — Format Errors Converge, Missing Data Diverges | Prompt Engineering & Optimization

Not all extraction errors are the same. Format errors (data exists, wrong format) resolve in 1-2 retries at 94%. Absent-data errors (information does not exist in source) fail after 3 retries at 96% — and the 4% “resolved” cases are likely hallucination.

Treating all errors identically wastes compute and fabricates data.

The Error Taxonomy

Error type	Retryable?	Resolution rate	Example
Format	Yes	94% in 1-2 retries	”March 15” instead of ISO 8601
Structural	Yes	87% in 1-2 retries	Missing required field that exists in source
Absent data	No	4% (likely hallucinated)	warranty_expiry for a general inquiry
Ambiguous	Once	Retry with disambiguation, then accept with qualifier	Unclear date reference

The Diagnostic Signature: Divergent Outputs

If retries for the same field produce different values each time (MIT, then Apache-2.0, then BSD-3-Clause for a license field), the data does not exist. The model is inventing a new plausible answer each time because there is no ground truth to converge on.

Format errors converge — each retry gets closer to the correct format. Absent data diverges — each retry fabricates a different value.

Production Data

In a 10,000 ticket/day system:

Format errors: 42% of retries, 94% resolved
Structural errors: 18% of retries, 87% resolved
Missing-data errors: 40% of retries, 6% “resolved” (likely hallucinated)

Missing-data retries consumed 40% of the retry compute budget for near-zero genuine value. Pre-classification eliminates this waste.

The Correct Response to Absent Data

Do not retry. Change the field from required to nullable (K4.3.3). Accept null with a "not_found_in_source" status. Escalating urgency (“The PO number MUST exist”) drives the model to fabricate with each retry.

The Two-Tier Strategy

Classify — Is the error retryable (format/structural) or non-retryable (absent data)?
Route — Retryable → retry loop with specific error feedback. Non-retryable → null acceptance with structured context.

Error classification is the essential first step. Without it, the system either wastes retries on absent data (causing hallucination) or skips retries for fixable format errors (losing data).

What Does Not Fix Absent Data

Model upgrade (Haiku → Sonnet) conflates capability with data availability. A more powerful model cannot extract information that does not exist.

Adaptive retry count based on failure rate is counterproductive. Fields with >50% failure rate are predominantly absent from source (80% of cases). More retries = more hallucination.

One-liner: Classify errors before retrying — format errors converge toward correct values with feedback, but absent-data errors diverge into fabrication, and retrying them wastes 40% of compute while producing hallucinated output.