High Confidence (0.9+): 12% Errors. Low Confidence (<0.5): 68% Correct. The Signal Is Broken. | Context Management & Reliability

LLM self-reported confidence scores are poorly calibrated for escalation decisions. The model can report 0.9 confidence while giving incorrect policy advice, and 0.3 confidence on a routine return it could resolve easily. No threshold setting fixes this — the signal itself does not reliably correlate with whether escalation is needed.

The Calibration Data

5,000 cases with logged confidence alongside actual outcomes:

Confidence band	Correctly handled	Error rate
0.9-1.0	88%	12%
0.7-0.9	79%	21%
0.5-0.7	74%	26%
0.0-0.5	68%	32%

The range is only 20 percentage points (88% to 68%). Even at the highest confidence, 12% of cases are errors. At the lowest confidence, 68% are still correct. This weak discrimination means no threshold produces sharp separation between cases needing escalation and those that do not.

Why Thresholds Cannot Fix This

At a 0.7 threshold:

False escalations: 74% of cases below 0.7 were correct (wasted human time)
Missed errors: 12% of cases above 0.9 were wrong (undetected)

Lowering the threshold catches more errors but escalates more correct cases. Raising it reduces unnecessary escalations but misses more errors. The tradeoff is inescapable because the signal itself is weak.

Sentiment Analysis: Same Problem, Different Signal

Customer emotion does not correlate with case complexity. A frustrated customer may have a simple late-delivery issue resolvable in seconds. A calm customer may have a complex policy exception requiring human authority.

Data: 70% of sentiment-triggered escalations were simple issues resolved by humans in under 2 minutes. Meanwhile, complex policy questions from calm customers were handled (incorrectly) by the agent because they did not trigger the sentiment threshold.

Emotion and complexity are independent variables. Sentiment routes by the wrong criterion.

Combining Two Weak Signals Does Not Help

Trigger	Volume	Genuinely needed human
Sentiment only	1,200/mo	35%
Confidence only	800/mo	28%
Both triggered	400/mo	42%
Explicit criteria	300/mo	91%

Even combining sentiment and confidence produces only 42% precision. Explicit criteria achieve 91%. The right signals dramatically outperform the wrong ones, regardless of combination.

The Replacement: Observable Condition Triggers

Replace sentiment and confidence with explicit, verifiable conditions:

Customer explicitly requests human — direct statement, no interpretation needed
Policy gap detected — no policy covers the scenario
No meaningful progress after 2 substantive attempts
Policy exception needed — request requires authority the agent lacks
Safety/compliance issue — deterministic, rules-engine checked

Each trigger maps to a specific, verifiable reason for human involvement — not a proxy signal that might or might not correlate.

For Code Review: Same Principle

A CI system suppressed a critical SQL injection finding (confidence: 0.45) while passing false positive style issues (confidence: 0.85). Replace confidence-based filtering with explicit severity criteria: define specific conditions for HIGH findings (data loss, security boundary, crash path) so filtering is based on verifiable code characteristics, not self-assessed confidence.

For Data Extraction: Programmatic Validation

Replace confidence-based quality gates with deterministic checks:

Schema validation (field types, required fields)
Cross-field consistency (line items sum to total)
Source verification (extracted value appears in source document)

These checks catch errors that confidence scores miss, without the 40% wasted review effort of routing “low-confidence” extractions that are actually correct.

One-liner: LLM confidence scores have only 20-point discrimination between highest and lowest bands — replace them with explicit observable-condition triggers (91% precision) instead of tuning thresholds on a fundamentally weak signal (28-42% precision).