K5.2.3 Task 5.2

High Confidence (0.9+): 12% Errors. Low Confidence (<0.5): 68% Correct. The Signal Is Broken.

LLM self-reported confidence scores are poorly calibrated for escalation decisions. The model can report 0.9 confidence while giving incorrect policy advice, and 0.3 confidence on a routine return it could resolve easily. No threshold setting fixes this — the signal itself does not reliably correlate with whether escalation is needed.

The Calibration Data

5,000 cases with logged confidence alongside actual outcomes:

Confidence bandCorrectly handledError rate
0.9-1.088%12%
0.7-0.979%21%
0.5-0.774%26%
0.0-0.568%32%

The range is only 20 percentage points (88% to 68%). Even at the highest confidence, 12% of cases are errors. At the lowest confidence, 68% are still correct. This weak discrimination means no threshold produces sharp separation between cases needing escalation and those that do not.

Why Thresholds Cannot Fix This

At a 0.7 threshold:

  • False escalations: 74% of cases below 0.7 were correct (wasted human time)
  • Missed errors: 12% of cases above 0.9 were wrong (undetected)

Lowering the threshold catches more errors but escalates more correct cases. Raising it reduces unnecessary escalations but misses more errors. The tradeoff is inescapable because the signal itself is weak.

Sentiment Analysis: Same Problem, Different Signal

Customer emotion does not correlate with case complexity. A frustrated customer may have a simple late-delivery issue resolvable in seconds. A calm customer may have a complex policy exception requiring human authority.

Data: 70% of sentiment-triggered escalations were simple issues resolved by humans in under 2 minutes. Meanwhile, complex policy questions from calm customers were handled (incorrectly) by the agent because they did not trigger the sentiment threshold.

Emotion and complexity are independent variables. Sentiment routes by the wrong criterion.

Combining Two Weak Signals Does Not Help

TriggerVolumeGenuinely needed human
Sentiment only1,200/mo35%
Confidence only800/mo28%
Both triggered400/mo42%
Explicit criteria300/mo91%

Even combining sentiment and confidence produces only 42% precision. Explicit criteria achieve 91%. The right signals dramatically outperform the wrong ones, regardless of combination.

The Replacement: Observable Condition Triggers

Replace sentiment and confidence with explicit, verifiable conditions:

  1. Customer explicitly requests human — direct statement, no interpretation needed
  2. Policy gap detected — no policy covers the scenario
  3. No meaningful progress after 2 substantive attempts
  4. Policy exception needed — request requires authority the agent lacks
  5. Safety/compliance issue — deterministic, rules-engine checked

Each trigger maps to a specific, verifiable reason for human involvement — not a proxy signal that might or might not correlate.

For Code Review: Same Principle

A CI system suppressed a critical SQL injection finding (confidence: 0.45) while passing false positive style issues (confidence: 0.85). Replace confidence-based filtering with explicit severity criteria: define specific conditions for HIGH findings (data loss, security boundary, crash path) so filtering is based on verifiable code characteristics, not self-assessed confidence.

For Data Extraction: Programmatic Validation

Replace confidence-based quality gates with deterministic checks:

  • Schema validation (field types, required fields)
  • Cross-field consistency (line items sum to total)
  • Source verification (extracted value appears in source document)

These checks catch errors that confidence scores miss, without the 40% wasted review effort of routing “low-confidence” extractions that are actually correct.


One-liner: LLM confidence scores have only 20-point discrimination between highest and lowest bands — replace them with explicit observable-condition triggers (91% precision) instead of tuning thresholds on a fundamentally weak signal (28-42% precision).