Stratified Sampling: Why Random Monitoring Misses the Problems That Matter | Context Management & Reliability

You sample 100 extractions per month from 10,000 processed messages. Random selection. The distribution: billing 60%, shipping 25%, returns 10%, warranty claims 5%. Your sample contains ~60 billing, ~25 shipping, ~10 returns, and ~5 warranty. Five warranty samples per month. If warranty extraction accuracy is 68%, roughly 2 of those 5 would show errors — statistically indistinguishable from noise. The monitoring system never flags it.

This is the fundamental problem with random sampling for quality monitoring: rare categories get too few samples to produce reliable accuracy measurements.

Stratified Sampling: Same Cost, Better Coverage

Stratified sampling divides the population into strata (categories) and samples from each independently, guaranteeing minimum representation for every category.

A direct comparison at identical cost:

Approach	Total samples	Rare-category coverage	Issues detected
Random (100/month)	100	~3-8 per rare category	0 of 3 known issues over 4 months
Stratified (100/month)	100	15-20 per rare category	All 3 issues in month 1

Same total sample budget. Same human review cost. The only difference is allocation strategy. Stratified sampling guarantees each category gets enough samples (15-20) to detect quality problems reliably.

The three issues detected by stratification were in categories under 10% of volume: handwritten invoices (8%), foreign-language contracts (4%), multi-page receipts (3%). Random sampling would need hundreds or thousands of total samples before these categories accumulated enough individual samples to detect their specific error patterns.

Why “Just Increase the Sample Size” Is Wasteful

The instinct when random sampling misses something: “Sample more.” A team sampling 60 tests per month (TypeScript 65%, Java 20%, Python 15%) gets ~12 Java samples. They missed a Java quality issue. The proposal: increase to 200.

At 200 random samples, Java gets ~40. Better, but inefficient. Stratified sampling with 60 total — setting a minimum of 20 per language — costs the same as the original 60 while guaranteeing 20 Java samples. The stratified approach at 60 gives better per-category coverage than random at 200.

Tripling the budget to fix a methodology problem is the wrong lever.

Multi-Level Stratification

Single-level stratification (by document type, by language) catches category-level problems. But errors can cluster at finer granularities.

Level 1: Category A billing category drops from 98% to 87%. Stratified sampling detects this reliably with 40 billing samples per month.

Level 2: Sub-category Investigation reveals: a billing system migration changed 30% of message formats. Old format: still at 98%. New format: overall at 82%, with the amount field specifically at 71%. Type-level stratification caught the overall drop, but it could not identify the sub-category pattern without deeper breakdown.

Level 3: Framework-specific A code review agent stratified by language shows JavaScript at 91%. Acceptable. But within JavaScript, React Server Components (10% of JS reviews) are at 62% accuracy. The other JS frameworks average 94%. Language-level stratification missed this because RSC is a small fraction of a large category.

The solution: multi-level stratification. Language → framework, or document type → format → field. Each level adds strata with minimum sample allocations. A drop in any stratum triggers investigation.

Accuracy Cannot Be Extrapolated Across Types

A manager proposes: “Review only invoices (70% of volume) and extrapolate quality to contracts and receipts.” This assumes extraction accuracy transfers across document types. It does not.

Invoices are tabular with predictable field positions. Contracts contain prose with legal clauses and amendments. Receipts have variable formats across vendors. Each type produces distinct error patterns that the others do not predict. High invoice accuracy does not imply high contract accuracy — the extraction challenges are fundamentally different.

Stratified sampling exists precisely because categories behave differently. If they did not, a single aggregate metric would suffice.

Confidence Scores Are Not a Substitute

An alternative to stratified sampling: “Only review extractions where the model’s confidence is below 80%.” This catches errors the model knows about. It misses the more dangerous case: confident errors.

On unfamiliar patterns — a new document format, an unusual message structure, a framework the model was not trained on — the model may produce high-confidence but incorrect results. RSC code reviews at 62% accuracy may come with 95% confidence scores because the model treats RSC like standard React and does not recognize the paradigm differences.

Confidence-based review complements stratified sampling but cannot replace it. The categories where the model is confidently wrong are the ones most likely to go undetected without stratified monitoring.

Adaptive Stratification for Growing Systems

A codebase grows: 2 new modules per quarter, occasional new framework adoption. Fixed stratification categories become stale. New modules start as a small fraction of volume and their low accuracy is masked by established modules’ high scores.

Adaptive stratified sampling solves this:

Detect new categories automatically (new module paths, new framework indicators)
Add them as strata with minimum sample allocations from their first month
Flag initial accuracy for review — new categories are at highest risk of quality issues
Adjust allocations as categories mature and their accuracy stabilizes

This scales without manual configuration updates. Annual reviews of stratification categories leave 3+ quarters of blind spots for modules added in between. The monitoring methodology should detect new patterns systematically, not depend on organizational communication.

Practical Allocation

With a fixed review budget (say, 150 samples per month) and multiple strata:

Minimum per stratum: enough to detect a meaningful accuracy drop (15-20 samples is a reasonable floor for detecting a 10+ point shift)
Above-minimum allocation: distribute proportionally by volume or by risk — higher risk categories (financial fields, security reviews) can get proportionally more
Review all fields: within each sampled item, review all extracted fields, not just the most important one. A billing extraction might get the amount right but consistently misparse dates — single-field review would miss this.

The goal is not equal samples per category (which oversamples high-volume categories and undersamples nothing), but minimum representation for every category with risk-weighted allocation above the minimum.

One-liner: Stratify your monitoring samples by category with per-category minimums — random sampling underrepresents rare categories where the worst problems hide.