96% Accuracy, 40% More Escalations: Why Aggregate Metrics Lie | Context Management & Reliability

A customer support system extracts order details with 96% overall accuracy. Management signs off on production deployment. Escalations spike 40% over the next quarter. The data extraction team investigates and finds: refund amount accuracy is 82%. Customer name accuracy is 99%. The average looks great because easy, high-volume fields inflate it while the hard, high-impact field drags behind.

This is aggregate accuracy masking failures. It happens in every system that reports a single number across heterogeneous categories.

The Masking Mechanism

Aggregate accuracy is a weighted average. Categories with more volume contribute more to the number. When easy, high-volume categories perform well and hard, low-volume categories perform poorly, the average looks fine while real users experience real failures.

A concrete example from invoice processing:

Invoice type	Volume	Accuracy
Printed	70%	99%
Scanned PDF	20%	95%
Handwritten	10%	72%
Aggregate	100%	97%

The 97% aggregate passes any reasonable quality gate. But handwritten invoices have a 28% error rate — nearly 1 in 3 contain errors. The finance team manually corrects ~30 invoices per week, almost all handwritten. Their complaints are dismissed because “the system is 97% accurate.”

The pattern repeats across domains:

Ticket classification: 96% overall, but the 5% escalation category could be at 50% while billing at 99% props up the average.
Code review: 94% overall, but 99% on style violations (high volume) masks 60% on security vulnerabilities (low volume, high impact).
Scientific extraction: 95% overall, but abstracts (40% of data, structured format) at 99% conceals methodology sections (15%, complex language) at far lower accuracy.

Disaggregate Before You Deploy

The fix is not complex: break the number down before making deployment decisions.

A data extraction system reports 97% field accuracy. Before expanding to refund processing, the team should validate per field:

Field	Accuracy	Volume
Customer name	99%	40%
Order number	98%	30%
Promo code	94%	15%
Refund amount	82%	15%
Aggregate	96%	100%

Refund processing depends on correct order numbers and exact refund amounts. Order numbers at 98% are solid. Refund amounts at 82% are not — 18% of refunds would process the wrong amount. The aggregate of 96% would have approved this for production.

Running a larger test set does not help. Confirming 96% across 50,000 messages instead of 10,000 narrows the confidence interval on the wrong number. The aggregate is not imprecise — it is misleading. More data makes the misleading number more statistically robust.

Per-Dimension Validation

For systems with multiple dimensions (language, issue type, severity), full cross-product validation is impractical. A code review agent has 4 languages × 4 issue types × 3 severities = 48 combinations. Many rare combinations (Rust × performance × minor) have too few samples for reliable measurement.

The practical approach: validate each dimension independently.

4 language checks: Python, JavaScript, Go, Rust
4 issue type checks: bugs, security, performance, style
3 severity checks: critical, major, minor
11 checks total, not 48

This catches failures like “poor Go accuracy” or “weak security detection” without requiring impractical sample sizes per cell. Critical dimensions (security, critical severity) should have higher thresholds than low-risk ones, reflecting the actual cost of failures.

Validating only the top 5 most common combinations is another trap. Rare but critical combinations (Go × security × critical) go unvalidated despite being the highest-risk category.

Per-Category Monitoring in Production

Disaggregation is not a one-time exercise. Category-specific accuracy can degrade independently over time — a framework update might break security test recommendations while everything else stays stable.

An aggregate monitoring threshold of 90% misses this:

Scenario	Security	Other categories	Aggregate
Baseline	90%	94%	93%
After framework update	70%	94%	91%

The aggregate drops from 93% to 91% — still above 90%, no alert fires. Meanwhile, security test accuracy collapsed by 20 points.

Per-category monitoring with independent thresholds catches this immediately. When security drops from 90% to 70%, its dedicated alert triggers regardless of what the aggregate does.

Per-Category Quality Gates

A single aggregate threshold for the entire system has no enforcement power over individual categories. A pipeline processing contracts (91%), invoices (98%), and receipts (96%) reports 95.3% overall — passing a 95% gate while contracts sit at 91%.

The correct design: quality gates per document type, with thresholds calibrated to the cost of errors in each category. Contracts need higher accuracy than receipts because contract errors have larger legal and financial consequences.

Reporting per-type accuracy alongside an aggregate threshold is a half-measure. The gate itself must be per-type to have enforcement power. Otherwise, a category can officially “pass” by riding the average.

The Escalation Disconnect

When users complain despite good aggregate numbers, the first instinct is to dismiss the complaints. “The system is 96% accurate” feels like a definitive rebuttal. It is not.

A support system at 96% overall with refund amounts at 82% saw escalations rise 40% in a quarter. Investigation revealed 85% of escalations involved incorrect refund amounts — a direct causal link from one underperforming field to a measurable customer impact. The aggregate of 96% was real, and the customer pain was also real. Both were true simultaneously because the aggregate hid where the errors clustered.

When complaints cluster on a specific category or field, the right response is to disaggregate, not to point at the aggregate as evidence that the system works.

Raising the Aggregate Does Not Fix the Category

A CI/CD code review agent at 92% overall missed a SQL injection vulnerability. The instinct: “We need to improve accuracy to 98%.” But if security issues are 8% of the review volume, even perfect security detection only moves the aggregate from 92% to about 92.6%. The security improvement is invisible in the aggregate, and efforts to raise the aggregate by 6 points will focus on high-volume categories where small improvements move the number most.

The fix must target the specific category. Disaggregate first, identify the gap, then invest in the specific dimension that is underperforming. The aggregate follows — but it is the wrong metric to optimize against.

One-liner: Always break aggregate accuracy down by category before making decisions — a 96% average can hide an 82% failure in the field that matters most.