Model-reported confidence labels (HIGH/MEDIUM/LOW) must be calibrated against a labeled validation set before being used for production routing. Uncalibrated confidence is misleading — “HIGH confidence” in the security category had a 30% false positive rate despite 5% in other categories.
Calibration Data
After validating against 100+ labeled findings:
| Reported confidence | Actual accuracy |
|---|---|
| HIGH | 92% (8% FP) |
| MEDIUM | 71% (29% FP) |
| LOW | 45% (55% FP) |
These numbers vary by category. A global threshold that auto-approves “HIGH confidence” findings works for style checks (92% accurate) but fails for security checks (70% accurate at “HIGH”).
Per-Category Calibration
Different review categories need different confidence thresholds:
| Category | ”HIGH” accuracy | Safe to auto-approve? |
|---|---|---|
| Style | 95% | Yes |
| Correctness | 88% | Maybe |
| Security | 70% | No |
A single global threshold cannot serve all categories. Calibrate each category independently using labeled data.
Routing Strategy
Combine confidence and severity for intelligent routing:
- High confidence + low severity → auto-approve
- Low confidence + high severity → mandatory human review
- High confidence + high severity → human review (severity overrides confidence)
- Low confidence + low severity → batch for periodic review
Neither confidence alone nor severity alone enables this routing. Both dimensions together allocate limited human review bandwidth effectively.
Recalibration
Confidence accuracy drifts over time as code patterns and prompt criteria evolve. Recalibrate quarterly against fresh labeled data. A threshold that worked 6 months ago may no longer be accurate.
One-liner: Calibrate confidence labels per category against labeled data before trusting them for routing — uncalibrated “HIGH” means 92% for style but only 70% for security, and a global threshold fails both.