Sub-agents returning 3,000-token narrative responses with full reasoning chains fill the orchestrator’s context after 4 responses. The orchestrator needs conclusions, not the thinking process. Switching to structured output (key facts, citations, confidence scores) reduces per-agent tokens from 2,800 to 280 — a 90% reduction — and raises synthesis accuracy from 68% to 91%.
The Data: Three Output Formats
| Format | Tokens/agent | Synthesis accuracy | Context overflow |
|---|---|---|---|
| Free-form text | 2,500 | 68% | 39% |
| Summarized text | 600 | 79% | 8% |
| Structured facts+citations+scores | 250 | 91% | 1% |
Each structuring level improves all metrics simultaneously. Compact, parsed data is easier for the orchestrator to integrate than verbose text — accuracy improves because relevant information is more accessible, not because less information is provided.
The Structured Output Schema
{
"finding": "SQL injection vulnerability in user input handling",
"category": "security",
"severity": "HIGH",
"citation": "auth/handler.py:42",
"confidence": 0.95
}
A unified schema across all sub-agents enables the orchestrator to merge and prioritize findings from 8 specialized agents using the same field structure. Findings can be sorted by severity and confidence across agents without format-specific parsing.
What Sub-Agents Should NOT Return
Full reasoning chains. The orchestrator needs results, not the decision process. Reasoning about rejected hypotheses wastes context on information the orchestrator did not request.
Raw data. Forcing the orchestrator to re-derive conclusions duplicates work and consumes more context. The sub-agent should provide processed results.
Full document text. The orchestrator needs to know which policy applies and where to find it — not the full document. A citation reference (“policy-guide.md:section-3.2”) provides traceability in minimal tokens.
Confidence Scores Replace Reasoning Verification
A colleague proposes returning full reasoning chains so the orchestrator can verify correctness. But verification by reasoning review is expensive and unreliable — the orchestrator’s job is synthesis, not auditing logic.
Confidence scores serve the verification purpose efficiently: low confidence triggers re-delegation to a different sub-agent or human review. High confidence allows the orchestrator to proceed. No reasoning chain needed.
Traceability Through Citations
Every finding must link to source evidence for audit trails. Citations (e.g., auth.py:42, policy-guide.md:section-3.2) provide this traceability in minimal tokens. Reasoning chains are not required for traceability — they explain the thinking process, while citations point to the evidence.
Context efficiency and traceability are not mutually exclusive. Structured facts with citation references satisfy both constraints simultaneously.
The First Step
If sub-agents currently return free-form text: define a structured output schema. This addresses both problems — context efficiency (structured < free-form) and reliable extraction (parsed fields vs text mining). Post-processing summarization is a workaround; fixing the output format at the source is the root cause solution.
One-liner: Require sub-agents to return structured facts with citations and confidence scores instead of narrative reasoning — this reduces per-agent tokens by 90%, raises synthesis accuracy from 68% to 91%, and provides traceability through citations rather than verbose chains.