Calibrate Confidence Per Category — "HIGH" Means 92% for Style but 70% for Security | Prompt Engineering & Optimization

Model-reported confidence labels (HIGH/MEDIUM/LOW) must be calibrated against a labeled validation set before being used for production routing. Uncalibrated confidence is misleading — “HIGH confidence” in the security category had a 30% false positive rate despite 5% in other categories.

Calibration Data

After validating against 100+ labeled findings:

Reported confidence	Actual accuracy
HIGH	92% (8% FP)
MEDIUM	71% (29% FP)
LOW	45% (55% FP)

These numbers vary by category. A global threshold that auto-approves “HIGH confidence” findings works for style checks (92% accurate) but fails for security checks (70% accurate at “HIGH”).

Per-Category Calibration

Different review categories need different confidence thresholds:

Category	”HIGH” accuracy	Safe to auto-approve?
Style	95%	Yes
Correctness	88%	Maybe
Security	70%	No

A single global threshold cannot serve all categories. Calibrate each category independently using labeled data.

Routing Strategy

Combine confidence and severity for intelligent routing:

High confidence + low severity → auto-approve
Low confidence + high severity → mandatory human review
High confidence + high severity → human review (severity overrides confidence)
Low confidence + low severity → batch for periodic review

Neither confidence alone nor severity alone enables this routing. Both dimensions together allocate limited human review bandwidth effectively.

Recalibration

Confidence accuracy drifts over time as code patterns and prompt criteria evolve. Recalibrate quarterly against fresh labeled data. A threshold that worked 6 months ago may no longer be accurate.

One-liner: Calibrate confidence labels per category against labeled data before trusting them for routing — uncalibrated “HIGH” means 92% for style but only 70% for security, and a global threshold fails both.