A sub-agent returns “search failed.” The orchestrator retries 3 times. All fail. The cause was an expired API key — a permanent error that will never succeed on retry. Three wasted attempts because the orchestrator could not distinguish timeout from permanent failure.
The Structured Error Schema
{
"failure_type": "auth_error",
"partial_results": [...],
"alternatives_suggested": ["try backup API", "use cached data"],
"retry_recommended": false
}
Each field enables a specific recovery decision:
| Field | What it enables |
|---|---|
failure_type | Route to correct handler (retry vs refresh vs switch) |
partial_results | Preserve work done before failure |
alternatives_suggested | Recovery paths from sub-agent’s domain knowledge |
retry_recommended | Prevent wasted retries on permanent errors |
The Recovery Data
500 sub-agent failures compared:
| Error format | Successful recovery |
|---|---|
| Generic “failed” | 18% |
| Structured (type + partial + alternatives) | 71% |
Each field adds value incrementally: failure_type alone → 42%. Add partial_results → 58%. Add alternatives → 71%. Add retry_recommended → 74%.
failure_type Enum → Recovery Router
| Type | Recovery action |
|---|---|
timeout | Retry same query |
auth_error | Refresh credentials, then retry |
not_found | Try alternative source |
rate_limited | Back off, then retry |
permanent | Do not retry, use alternatives |
A failure_type enum maps directly to recovery handlers. No message parsing, no guessing, no wasted retries on permanent errors.
Partial Results: Don’t Waste Completed Work
A sub-agent searched 5 sources, found results from 3, then timed out. Returning only “timeout error” loses the 3 successful results. The orchestrator retries from scratch — re-searching sources already completed.
Structured error with partial_results preserves the 3 results and identifies the 2 gaps. The orchestrator uses what exists and retries only the remaining sources.
Silent Completion Is Worse Than Reported Failure
Returning success with only the 3 results (hiding the 2 unsearched sources) is deceptive. The orchestrator believes it has comprehensive results when 40% of sources were never checked. Transparent error reporting with gaps identified is always better than silent partial completion.
The First Step
If sub-agents currently return unstructured error messages: define a standard error schema for all sub-agents. Consistent structure lets the orchestrator reliably extract recovery information without per-agent parsing logic.
One-liner: Return structured errors with failure_type, partial_results, and alternatives — this enables the orchestrator to recover 71% of failures instead of 18%, by routing each error type to the correct recovery strategy.