An invoice has five extracted fields. Invoice number: 0.99 confidence. Date: 0.97. Total amount: 0.95. Warranty period: 0.62. Special terms: 0.45. Document-level routing sends the entire invoice to a human reviewer because two fields are uncertain. The reviewer re-verifies all five fields, spending time confirming the invoice number and date that were already correct.
Field-level confidence routing sends only the warranty period and special terms to review. The high-confidence fields are auto-accepted. The reviewer focuses on the two fields that actually need judgment.
The impact is dramatic: one team cut reviewer workload by 69% (8 hours/day to 2.5 hours/day) while maintaining nearly identical error catch rates (93% vs 95% with document-level routing). The 2% catch rate difference is a trivial cost for freeing 5.5 hours of daily reviewer capacity.
Output Structure: {value, confidence} Per Field
The extraction schema should pair every field with its confidence score:
{
"invoice_number": { "value": "INV-2026-4412", "confidence": 0.99 },
"total_amount": { "value": "$1,247.50", "confidence": 0.95 },
"warranty_period": { "value": "24 months", "confidence": 0.62 },
"special_terms": { "value": "Net 60 with early payment discount", "confidence": 0.45 }
}
This structure lets downstream systems apply per-field thresholds without additional processing. Each consuming system can decide its own routing rules based on the numeric scores.
Not a binary flag. A proposal to simplify to confident / needs_review per field loses critical flexibility. The routing decision is ultimately binary (review or auto-accept), but the threshold should be configurable downstream, not baked into the extraction output. With numeric scores, a system can set 0.9 for financial amounts but 0.7 for internal notes. Binary flags lock in one threshold at extraction time — changing it requires re-extracting.
Not a document-level score. A single overall confidence (average or minimum across fields) masks field-level variation. A document with three fields at 0.97 and one at 0.45 might average 0.84 and pass a 0.8 threshold — hiding the genuinely uncertain field. Or it might fail the threshold and send all four fields to review, wasting time on the three reliable ones.
Calibration: The Model’s Confidence Is Not Automatically Accurate
A confidence score of 0.88 should mean the model is correct about 88% of the time at that confidence level. This is not guaranteed.
One team set a global threshold at 0.8: auto-accept anything at or above. An audit revealed warranty_period fields averaged 0.88 confidence but had a 25% error rate. Other fields at similar confidence: 3% error rate. The model was overconfident specifically for warranty extractions.
The root cause: confidence calibration varies by field type. The same numeric score means different accuracy levels for different fields. A global threshold treats them equally, which means over-trusting some fields and under-trusting others.
Calibrate Before You Deploy
Before using confidence scores for routing:
- Create a labeled validation set with human-verified correct values
- Measure actual accuracy at each confidence level per field type
- Compare reported confidence to observed accuracy
- Set thresholds per field type based on the calibration data
There is no universal “0.8 is good enough” standard. Each model, field type, and document type needs its own validation. A model might be well-calibrated for invoice numbers (0.9 confidence → 91% accuracy) but poorly calibrated for warranty terms (0.9 confidence → 72% accuracy).
Calibrate Per Document Type Too
The same field name can have different reliability profiles across document types. total_amount extracted from a printed invoice (structured table) may be at 99% accuracy, while the same field from a handwritten receipt (variable format) may be at 82%. A single threshold for total_amount across all document types would be too loose for receipts or too strict for invoices.
The comprehensive approach: per-field-per-document-type thresholds, calibrated from labeled data, re-calibrated quarterly as document patterns evolve.
Routing With Business Impact
Not all uncertain fields are equally important. A billing extraction might report low confidence on both the refund amount (0.62) and a formatting note (0.58). The refund amount directly affects a financial transaction. The formatting note is informational.
Field-level routing can incorporate business impact alongside confidence:
- High impact + low confidence → priority human review (refund amounts, contract terms)
- Low impact + low confidence → standard review queue (internal codes, formatting notes)
- High impact + high confidence → auto-accept with audit sampling
- Low impact + high confidence → auto-accept
This allocates scarce reviewer time to the fields where errors have the largest consequences, not just the fields where the model is most uncertain.
What Field-Level Routing Replaces
| Approach | Review trigger | Reviewer effort | Error catch |
|---|---|---|---|
| Review everything | Always | Very high | ~100% |
| Document-level routing | Any field uncertain | High (40% of docs) | ~95% |
| Field-level routing | Specific field uncertain | Low (12% of fields) | ~93% |
| No review | Never | Zero | Variable |
Field-level routing sits at the efficient frontier: near-document-level error detection at a fraction of the workload. The 2% catch rate difference between document-level and field-level routing represents fields that were uncertain but happened to be correct — the model was unsure but got it right. The 69% workload reduction more than compensates.
One-liner: Output a confidence score with every extracted field, calibrate thresholds per field type, and route only the uncertain fields to human review — the certain ones do not need human eyes.