Field-Level Confidence: Review the Uncertain Fields, Not the Entire Document | Context Management & Reliability

An invoice has five extracted fields. Invoice number: 0.99 confidence. Date: 0.97. Total amount: 0.95. Warranty period: 0.62. Special terms: 0.45. Document-level routing sends the entire invoice to a human reviewer because two fields are uncertain. The reviewer re-verifies all five fields, spending time confirming the invoice number and date that were already correct.

Field-level confidence routing sends only the warranty period and special terms to review. The high-confidence fields are auto-accepted. The reviewer focuses on the two fields that actually need judgment.

The impact is dramatic: one team cut reviewer workload by 69% (8 hours/day to 2.5 hours/day) while maintaining nearly identical error catch rates (93% vs 95% with document-level routing). The 2% catch rate difference is a trivial cost for freeing 5.5 hours of daily reviewer capacity.

Output Structure: {value, confidence} Per Field

The extraction schema should pair every field with its confidence score:

{
  "invoice_number": { "value": "INV-2026-4412", "confidence": 0.99 },
  "total_amount": { "value": "$1,247.50", "confidence": 0.95 },
  "warranty_period": { "value": "24 months", "confidence": 0.62 },
  "special_terms": { "value": "Net 60 with early payment discount", "confidence": 0.45 }
}

This structure lets downstream systems apply per-field thresholds without additional processing. Each consuming system can decide its own routing rules based on the numeric scores.

Not a binary flag. A proposal to simplify to confident / needs_review per field loses critical flexibility. The routing decision is ultimately binary (review or auto-accept), but the threshold should be configurable downstream, not baked into the extraction output. With numeric scores, a system can set 0.9 for financial amounts but 0.7 for internal notes. Binary flags lock in one threshold at extraction time — changing it requires re-extracting.

Not a document-level score. A single overall confidence (average or minimum across fields) masks field-level variation. A document with three fields at 0.97 and one at 0.45 might average 0.84 and pass a 0.8 threshold — hiding the genuinely uncertain field. Or it might fail the threshold and send all four fields to review, wasting time on the three reliable ones.

Calibration: The Model’s Confidence Is Not Automatically Accurate

A confidence score of 0.88 should mean the model is correct about 88% of the time at that confidence level. This is not guaranteed.

One team set a global threshold at 0.8: auto-accept anything at or above. An audit revealed warranty_period fields averaged 0.88 confidence but had a 25% error rate. Other fields at similar confidence: 3% error rate. The model was overconfident specifically for warranty extractions.

The root cause: confidence calibration varies by field type. The same numeric score means different accuracy levels for different fields. A global threshold treats them equally, which means over-trusting some fields and under-trusting others.

Calibrate Before You Deploy

Before using confidence scores for routing:

Create a labeled validation set with human-verified correct values
Measure actual accuracy at each confidence level per field type
Compare reported confidence to observed accuracy
Set thresholds per field type based on the calibration data

There is no universal “0.8 is good enough” standard. Each model, field type, and document type needs its own validation. A model might be well-calibrated for invoice numbers (0.9 confidence → 91% accuracy) but poorly calibrated for warranty terms (0.9 confidence → 72% accuracy).

Calibrate Per Document Type Too

The same field name can have different reliability profiles across document types. total_amount extracted from a printed invoice (structured table) may be at 99% accuracy, while the same field from a handwritten receipt (variable format) may be at 82%. A single threshold for total_amount across all document types would be too loose for receipts or too strict for invoices.

The comprehensive approach: per-field-per-document-type thresholds, calibrated from labeled data, re-calibrated quarterly as document patterns evolve.

Routing With Business Impact

Not all uncertain fields are equally important. A billing extraction might report low confidence on both the refund amount (0.62) and a formatting note (0.58). The refund amount directly affects a financial transaction. The formatting note is informational.

Field-level routing can incorporate business impact alongside confidence:

High impact + low confidence → priority human review (refund amounts, contract terms)
Low impact + low confidence → standard review queue (internal codes, formatting notes)
High impact + high confidence → auto-accept with audit sampling
Low impact + high confidence → auto-accept

This allocates scarce reviewer time to the fields where errors have the largest consequences, not just the fields where the model is most uncertain.

What Field-Level Routing Replaces

Approach	Review trigger	Reviewer effort	Error catch
Review everything	Always	Very high	~100%
Document-level routing	Any field uncertain	High (40% of docs)	~95%
Field-level routing	Specific field uncertain	Low (12% of fields)	~93%
No review	Never	Zero	Variable

Field-level routing sits at the efficient frontier: near-document-level error detection at a fraction of the workload. The 2% catch rate difference between document-level and field-level routing represents fields that were uncertain but happened to be correct — the model was unsure but got it right. The 69% workload reduction more than compensates.

One-liner: Output a confidence score with every extracted field, calibrate thresholds per field type, and route only the uncertain fields to human review — the certain ones do not need human eyes.