One 60% False Positive Category Made Developers Ignore the 95% Accurate One | Prompt Engineering & Optimization

A code review pipeline has 5 categories: bugs (92%), security (90%), performance (85%), style (70%), comment accuracy (40%). Developers are ignoring ALL findings — including the 92% accurate bug detection. The 40% accuracy category produces so much noise that developers have stopped reading the entire review output.

Users judge a system by its worst visible output, not its mathematical average.

The Trust Contamination Data

A 6-month pipeline added documentation quality checking at 55% false positive rate. Before the addition, developers addressed 80% of findings. After: 25% — across all categories, including bugs at 95% accuracy and security at 93%.

The poorly calibrated category did not just harm its own adoption. It destroyed trust in every other category. A system’s credibility is set by its least reliable visible component.

Early-stage erosion is detectable. After adding an API conventions check at 55% dismiss rate, bug finding dismissals crept from 5% to 8% and security from 8% to 12%. This is the leading indicator — by the time developers mute the bot entirely, the damage is done.

The ROI Case

A 6-category system where categories vary in accuracy (45%-94%) and value ($100-$15,000/month). The two lowest-accuracy categories contribute only $300/month in value. But their false positives caused the overall address rate to drop from 75% to 40%, costing approximately $6,650/month in missed high-value findings.

Disabling $300/month of unreliable categories to preserve $6,650/month of reliable ones is not a tradeoff. It is arithmetic.

The Correct Response: Isolate, Fix, Reintroduce

Step 1: Disable the problematic category immediately. Remove it from visible output. This restores trust in the remaining accurate categories. Do not keep it active because “some findings are valid” — a 2:1 noise-to-signal ratio destroys trust faster than having no findings at all.

Step 2: Fix in isolation. Improve the prompts with specific criteria (K4.1.1), add examples, test against a labeled dataset. A category at 60% false positive rate improved to 88% accuracy through better prompt engineering.

Step 3: Reintroduce gradually. Do not flip back to 100% immediately — 88% in internal testing may not hold on real-world data.

Shadow mode first — generate findings but do not show them. Compare against manual review.
Enable on 10-20% of PRs. Measure real-world accuracy.
Expand to 50%, then 100% once metrics hold.
Communicate the before/after metrics to rebuild developer confidence.

Shadow Mode for New Categories

New categories should never launch directly into production. The quality gate framework:

New category starts in shadow mode — findings generated but not displayed
2-week measurement against manual review as baseline
Must exceed 80% accuracy to promote to visible
Continuous monitoring with auto-demotion below 75%
Developer dismiss rate tracking as a leading indicator of trust erosion

This prevents the scenario where a promising category with untested real-world accuracy damages an established system’s credibility.

What Does Not Fix Trust Contamination

Adding confidence scores. LLM self-reported confidence is poorly calibrated — the model assigns high confidence to false positives. Shifting the filtering burden to developers (“ignore findings below 0.7 confidence”) does not address the trust problem. Developers should not have to configure thresholds to make a review tool usable.

Aggregate accuracy reporting. “97% overall accuracy” hides that two fields are at 55%. Users experience individual findings, not averages. A developer seeing three wrong findings in a row does not calculate the system’s aggregate — they stop reading.

“Be more conservative” prompts. This vague instruction may reduce all findings — including true positives — without targeting the specific high-FP category. It is the K4.1.1 anti-pattern applied to a K4.1.2 problem.

“Experimental” labels or disclaimers. Creating a two-tier system where some findings are “experimental” still shows developers false positives. They may dismiss the entire labeled category or, worse, develop the habit of skimming all findings.

Limiting findings per PR. Capping at 3 findings per review does not improve accuracy. Three false positives are still three false positives.

Developer voting on categories. Slow, subjective, and does not prevent damage during the voting period. By the time enough votes accumulate to disable a category, trust is already eroded.

Per-Team Path Scoping

In a monorepo with 5 teams using different naming conventions, a universal naming check guarantees false positives for most teams. The fix: path-specific rules that activate naming checks only for each team’s directories, with conventions calibrated to each team’s actual standards.

This is the cross-domain connection to K3.3.1 — path-scoped rules prevent false positives by loading the right criteria for the right code.

One-liner: One low-accuracy category poisons trust in all categories — disable it immediately, fix in shadow mode, and reintroduce gradually with measured accuracy, because users judge by the worst output they see.