Established vs Contested: Structure Reports by Evidence Strength | Context Management & Reliability

A research report states: “AI will transform the job market, creating more jobs than it eliminates.” Two sub-agents provided findings. One cited a source projecting net job creation. The other cited a source projecting 15% net job reduction. The coordinator chose the first perspective and dropped the second. A contested topic now reads as established fact.

Reports that present all findings with identical formatting — regardless of whether they are confirmed by four sources or sourced from a single blog post — mislead readers about evidence quality. The structure of the report should communicate evidence strength, not hide it.

Why Flat Lists Mislead

A product has two claims: “weighs 2.5 lbs” (confirmed by 4 independent sources) and “battery lasts 8-12 hours” (Source 1 says 8 hrs, Source 2 says 12 hrs, Source 3 says 10 hrs). A flat list presents both with equal formatting. The reader treats the contested battery claim with the same confidence as the verified weight.

User trust data reveals the cost:

Report format	Overall trust	Surprise rate (relied on contested claim as fact)
Flat (no evidence labels)	72%	40%
Structured (established vs contested)	65%	8%

Format B’s “lower” trust is actually better-calibrated trust. Users appropriately question contested findings. Format A’s “higher” trust is built on false confidence — 40% of users are surprised when facts they relied on turn out to be debated.

The Evidence-Strength Structure

Separate findings into sections that signal reliability:

Established Findings

Multi-source consensus. Readers can act on these with confidence.

“Product weighs 2.5 lbs” — confirmed by manufacturer, two test labs, and retail listing

Contested Findings

Sources disagree. Present all perspectives.

“Battery life: 8 hours (test lab A), 10 hours (test lab B), 12 hours (manufacturer claim)” — different testing conditions may explain the range

Preliminary Findings

Single source, not corroborated.

“New firmware improves charging speed by 15%” — reported by one tech blog, awaiting independent verification

This structure maps directly to reader action: established findings inform decisions, contested findings require discussion, preliminary findings need validation.

Four-Level Classification for Technical Reports

For developer tools and code quality reports, a four-level system works well:

Level	Criteria	Reader action
Established	Multi-source consensus (2+ independent tools/sources agree)	Act with confidence
Supported	Majority agreement (most sources agree, minor dissent)	Likely sound, investigate outlier
Contested	Sources disagree (no clear consensus)	Team discussion needed
Preliminary	Single source only (one tool/blog/observation)	Verify before acting

This is more useful than binary (recommended / not recommended), raw source counts (which ignore source quality — 3 blog posts ≠ 3 peer-reviewed benchmarks), or confidence percentages (which obscure the nature of the evidence).

Classification Must Require Consensus

A common error: classifying a claim as “established” if at least one authoritative source supports it. One system applied this rule and classified 85% of claims as established. An audit found 30% of those “established” claims had conflicting sources — nearly 1 in 3 was actually contested.

The fix: “established” requires consensus across sources, not support from any single one. A claim supported by one source but contradicted by another is, by definition, contested.

Security and Code Quality: Evidence Strength Matters

A CI/CD agent generates findings from static analysis, dynamic testing, security scanning, and peer review history. Currently, all findings appear in a flat list. Multi-database confirmed CVEs sit alongside single-tool flags.

Structuring by evidence strength:

Section	Content	% of findings
Confirmed	2+ independent tools agree	60%
Flagged	Single tool detection	30%
Disputed	One tool says vulnerable, another says safe	10%

Development teams can now allocate response proportionally: confirmed CVEs get immediate attention, single-tool flags warrant investigation, and disputed findings need team assessment. Severity still matters within each section, but evidence strength determines the first-level triage.

Filtering out single-tool findings to “reduce noise” risks missing real vulnerabilities that only one specialized tool can detect. The correct approach: present them with appropriate evidence-strength labeling, not suppress them.

Do Not Remove the Contested Section

A product manager proposes: “Remove the contested findings section. Stakeholders want definitive numbers, not debates.”

This converts uncertain projections into apparent facts. A financial projection where Source A says $5M growth and Source B says $2M decline is genuinely uncertain. Presenting either projection as established exposes stakeholders to unacknowledged risk. The uncertainty itself is actionable information — it tells decision-makers to build contingency plans rather than commit to a single forecast.

Moving contested findings to an appendix is a half-measure. Critical uncertainties in financial projections should be prominently displayed, not buried where most stakeholders will not see them.

Sub-Agents Must Provide Structured Output

The coordinator cannot construct evidence-strength sections without structured metadata from sub-agents. A sub-agent returning plain text like “AI market will reach $50B” does not tell the coordinator which of 10 accessed sources produced this claim, whether other sources agree, or when the data was collected.

Structured output — {claim, source, date, confidence} — gives the coordinator the raw material to classify evidence strength. Without it, the coordinator can only guess at consensus, and its guesses will be wrong often enough to undermine the entire classification system.

Attribution must originate from the agent that accessed the source. Post-synthesis attribution (searching for likely sources after the report is written) is unreliable because multiple sources may contain similar claims, and only the sub-agent knows which specific document it read.

One-liner: Structure reports into established, contested, and preliminary sections so readers know which findings to trust, which to debate, and which to verify — flat lists treat all evidence as equal when it is not.