S1.3.1 Task 1.3

Structured Data Preserves Attribution; Plain Text Destroys It

Context passing between agents is where information dies. The search agent finds a precise figure with its source, date, and methodology. By the time it reaches the synthesis agent, it’s “significant growth was reported.” Every structured fact that becomes plain text prose is an attribution lost, a verification made impossible.

The core principle: structured claim-source mappings

Each finding passed between agents should be a structured object:

{
  "claim": "AI adoption in healthcare grew 40% year-over-year",
  "source_url": "https://...",
  "document_name": "Healthcare AI Trends 2025",
  "relevant_excerpt": "Our analysis shows a 40% increase...",
  "publication_date": "2025-08-15",
  "content_type": "financial_data",
  "certainty": "well_established"
}

Downstream agents must preserve and augment these mappings, not summarize them away.

Four formats that fail

Plain text summary: “Several studies show AI adoption is growing rapidly” — loses which studies, which figures, when published. One system: 92% of synthesis report claims lacked verifiable attribution because context transfer flattened structured data into prose.

Raw HTML: navigation elements, ads, scripts consume context on irrelevant content. The relevant data might be 5% of the HTML.

Re-fetch instructions: “Here are the URLs, go read them yourself.” Wastes API calls, adds latency, sources may have changed.

Markdown-formatted reports: ## auth.py\n- Line 42: SQL injection looks human-readable but causes 15% misparse rate when downstream agents try to extract file names and line numbers programmatically. Use structured JSON for inter-agent communication, separate logs for human debugging.

Narrative output → downstream errors

An exploration agent returned findings as prose paragraphs. The documentation agent produced docs with 35% wrong file paths and 25% non-existent function names — because it had to interpret narrative descriptions instead of receiving explicit structured fields.

Fix: change output format from narrative to structured data ({"file": "auth.py", "function": "validate", "line": 42, "issue": "sql_injection"}). Downstream agents receive precise, parseable facts instead of prose they must interpret.

Location metadata: the silent loss

An extraction agent outputs {"value": "$1.2M", "field": "revenue", "page": 12, "section": "Financial Summary"}. The validation agent receives only {"revenue": "$1.2M"}. The page and section metadata is stripped during transfer. Result: 68% validation failure rate — the validator can’t find the source value because it doesn’t know where to look.

Fix: pass the complete structured output including location fields. Don’t strip metadata during inter-agent transfer.

Conflicting data: preserve both, don’t average

Two search agents find different EV market share figures: Agent A returns 14% (IEA 2023), Agent B returns 22% (BloombergNEF 2024). The synthesis agent receives both but averages to 18% — a fabricated number from neither source.

Fix: structured context with source metadata + synthesis instructions to preserve both values with attribution when data conflicts. “IEA reported 14% in 2023; BloombergNEF projected 22% in 2024” — both correct for their time/methodology.

Temporal metadata prevents false contradictions

Unemployment: Source A says 8%, Source B says 4%. The synthesis agent flags a contradiction. Investigation: Source A is from 2020 (pandemic), Source B is from 2024 (recovery). Both are accurate for their periods. 25% of flagged contradictions are temporal differences, not real conflicts.

Fix: require all sub-agents to include publication/data collection dates in structured output. The synthesis agent can distinguish trend changes from contradictions.

Linked records from parallel agents

Three parallel agents return: order history (8 fields), payment records (12 fields), communication logs (full threads). The resolution agent needs to correlate: match a payment to an order to a complaint.

Fix: tag outputs with shared keys (customer_id, order_id). Structure aggregated context as linked records — orders linked to their payments linked to their communications, with only resolution-relevant fields from each.

Upstream reasoning ≠ downstream needs

Upstream agents produce 8,000-token outputs with full reasoning chains. Three agents = 24,000 tokens in a 30,000-token synthesis budget, leaving only 6,000 for synthesis reasoning (needs 12,000).

The insight: upstream reasoning chains are valuable FOR upstream agents but not FOR downstream ones. The synthesis agent needs conclusions (key facts), provenance (citations), and importance (relevance scores) — not the 5,000 tokens of reasoning that produced each conclusion.

Fix: restructure upstream output from verbose to structured key findings. 8,000 → 3,000 tokens per agent. Total: 9,000, leaving 21,000 for synthesis.

Content type and certainty metadata

Tag outputs with content_type (financial_data, news, technical_assessment) and certainty_level (well_established, contested, preliminary). The synthesis agent uses these to:

  • Render financial data as tables (easy comparison)
  • Render news as prose (narrative flow)
  • Render technical findings as structured lists (preserve detail)
  • Separate well-established from contested findings in distinct report sections

One system’s reviewer complaints: 60% about financial data buried in paragraphs, 25% about unclear certainty levels, 15% about lost technical structure. Content type metadata addresses all three.


One-liner: Pass structured claim-source mappings with provenance, temporal, content type, and certainty metadata between agents — plain text kills attribution (92% loss), narrative causes errors (35% wrong paths), and conflicting data must be preserved with both sources, not averaged.