Multi-agent systems have two error handling anti-patterns that together produce devastating results. Silent swallowing hides failures from the coordinator. Workflow termination kills everything when any failure surfaces. Together, they ensure most queries produce either incomplete reports that look complete or no reports at all.
Anti-Pattern #1: Silent Swallowing
A shipping sub-agent times out trying to reach the carrier API. It catches the timeout internally and returns {"tracking_info": [], "status": "success"}. The coordinator sees a successful query with no results. The customer is told: “Your order has no delivery tracking information.”
The package was shipped yesterday with tracking. The customer’s tracking exists — the sub-agent just couldn’t access it. But the coordinator cannot distinguish “no tracking exists” from “tracking lookup failed” because the sub-agent lied about the outcome.
The pattern: Sub-agent catches error → returns empty result marked as success → coordinator proceeds on false premise → customer receives incorrect information.
This is not a minor data gap. A billing sub-agent silently swallowing a timeout means 25 customers per day are told “no outstanding billing concerns” when they actually have $500 overcharges. The false assurance is worse than admitting the lookup failed.
Anti-Pattern #2: Workflow Termination
A 4-agent code analysis system runs security, performance, style, and documentation checks. The documentation agent fails with an API timeout. The coordinator terminates the entire analysis.
Developers lose 10+ minutes of security, performance, and style results that completed successfully — because one documentation check failed. The terminate-on-first-failure pattern treats any sub-agent failure as total system failure, discarding all valid work.
How They Compound
A 1,000-query/month system with 5 sub-agents:
- 8% individual agent failure rate (400 failures/month)
- 60% silently swallowed → reports look complete but have hidden gaps (35% of all reports)
- 40% propagate → trigger workflow termination (25% of all queries killed)
- Result: ~40% of queries produce complete, accurate reports
Silent swallowing and termination partition failures into two harmful outcomes. Neither alone explains the full damage.
The Correct Pattern: Graceful Degradation with Coverage Annotation
Three layers, each fixing a specific problem:
Layer 1: Structured error context from sub-agents.
Replace silent swallowing. Return {"status": "error", "failure_type": "timeout", "partial_results": [...], "alternatives": [...]}. The coordinator receives honest, actionable information.
Layer 2: Coordinator collects all results — successful and failed. Replace workflow termination. When 1 of 5 agents fails, collect the 4 successful results, note the gap, and continue. Do not discard valid work.
Layer 3: Coverage annotation in final output. The user sees: “Security: ✅ No issues. Performance: ✅ 2 warnings. Style: ✅ Clean. Documentation: ⚠️ Unavailable (timeout — will retry).” Transparent about what succeeded and what didn’t.
Access Failure vs Valid Empty Result
The most dangerous consequence of silent swallowing: the coordinator cannot distinguish “the data doesn’t exist” from “we couldn’t access the data.” Both look like empty results.
The fix: distinct status codes. status=error with failure details for access failures. status=success with empty data for genuine absence. The coordinator can then say “no dependencies found” (confident, actionable) versus “dependency check unavailable” (honest, with follow-up).
Fix Silent Swallowing First
When both anti-patterns are present: fix silent swallowing first. Without accurate error reporting from sub-agents, the coordinator cannot make correct decisions regardless of its termination logic. Structured error context is the foundation — coordinator-level graceful degradation depends on receiving honest information.
Replacing Generic Errors Is Not Enough
A proposal to return {"status": "error", "message": "analysis unavailable"} instead of empty success is an improvement — at least the coordinator knows something failed. But generic errors still prevent intelligent recovery. The coordinator cannot distinguish timeout (retry) from auth failure (refresh credentials) from permanent failure (switch source). Structured error context with failure_type, partial_results, and alternatives enables the appropriate recovery for each failure type.
Triple Redundancy Is Over-Engineered
Running each agent 3 times with majority voting triples compute cost. Graceful degradation handles the same failures at zero extra cost by presenting partial results with coverage annotation. In CI analysis, developers prefer seeing 3 of 4 analyses immediately over waiting for 3 redundant runs of the 4th.
One-liner: Replace silent swallowing with structured error propagation and workflow termination with graceful degradation — the three-layer fix (honest errors + continue on failure + coverage annotation) transforms a system where only 40% of queries succeed into one where partial results are always preserved and gaps are always transparent.