Not all extraction errors are the same. Format errors (data exists, wrong format) resolve in 1-2 retries at 94%. Absent-data errors (information does not exist in source) fail after 3 retries at 96% — and the 4% “resolved” cases are likely hallucination.
Treating all errors identically wastes compute and fabricates data.
The Error Taxonomy
| Error type | Retryable? | Resolution rate | Example |
|---|---|---|---|
| Format | Yes | 94% in 1-2 retries | ”March 15” instead of ISO 8601 |
| Structural | Yes | 87% in 1-2 retries | Missing required field that exists in source |
| Absent data | No | 4% (likely hallucinated) | warranty_expiry for a general inquiry |
| Ambiguous | Once | Retry with disambiguation, then accept with qualifier | Unclear date reference |
The Diagnostic Signature: Divergent Outputs
If retries for the same field produce different values each time (MIT, then Apache-2.0, then BSD-3-Clause for a license field), the data does not exist. The model is inventing a new plausible answer each time because there is no ground truth to converge on.
Format errors converge — each retry gets closer to the correct format. Absent data diverges — each retry fabricates a different value.
Production Data
In a 10,000 ticket/day system:
- Format errors: 42% of retries, 94% resolved
- Structural errors: 18% of retries, 87% resolved
- Missing-data errors: 40% of retries, 6% “resolved” (likely hallucinated)
Missing-data retries consumed 40% of the retry compute budget for near-zero genuine value. Pre-classification eliminates this waste.
The Correct Response to Absent Data
Do not retry. Change the field from required to nullable (K4.3.3). Accept null with a "not_found_in_source" status. Escalating urgency (“The PO number MUST exist”) drives the model to fabricate with each retry.
The Two-Tier Strategy
- Classify — Is the error retryable (format/structural) or non-retryable (absent data)?
- Route — Retryable → retry loop with specific error feedback. Non-retryable → null acceptance with structured context.
Error classification is the essential first step. Without it, the system either wastes retries on absent data (causing hallucination) or skips retries for fixable format errors (losing data).
What Does Not Fix Absent Data
Model upgrade (Haiku → Sonnet) conflates capability with data availability. A more powerful model cannot extract information that does not exist.
Adaptive retry count based on failure rate is counterproductive. Fields with >50% failure rate are predominantly absent from source (80% of cases). More retries = more hallucination.
One-liner: Classify errors before retrying — format errors converge toward correct values with feedback, but absent-data errors diverge into fabrication, and retrying them wastes 40% of compute while producing hallucinated output.