Generic "Failed" → 18% Recovery. Structured Error → 71%. | Context Management & Reliability

A sub-agent returns “search failed.” The orchestrator retries 3 times. All fail. The cause was an expired API key — a permanent error that will never succeed on retry. Three wasted attempts because the orchestrator could not distinguish timeout from permanent failure.

The Structured Error Schema

{
  "failure_type": "auth_error",
  "partial_results": [...],
  "alternatives_suggested": ["try backup API", "use cached data"],
  "retry_recommended": false
}

Each field enables a specific recovery decision:

Field	What it enables
`failure_type`	Route to correct handler (retry vs refresh vs switch)
`partial_results`	Preserve work done before failure
`alternatives_suggested`	Recovery paths from sub-agent’s domain knowledge
`retry_recommended`	Prevent wasted retries on permanent errors

The Recovery Data

500 sub-agent failures compared:

Error format	Successful recovery
Generic “failed”	18%
Structured (type + partial + alternatives)	71%

Each field adds value incrementally: failure_type alone → 42%. Add partial_results → 58%. Add alternatives → 71%. Add retry_recommended → 74%.

failure_type Enum → Recovery Router

Type	Recovery action
`timeout`	Retry same query
`auth_error`	Refresh credentials, then retry
`not_found`	Try alternative source
`rate_limited`	Back off, then retry
`permanent`	Do not retry, use alternatives

A failure_type enum maps directly to recovery handlers. No message parsing, no guessing, no wasted retries on permanent errors.

Partial Results: Don’t Waste Completed Work

A sub-agent searched 5 sources, found results from 3, then timed out. Returning only “timeout error” loses the 3 successful results. The orchestrator retries from scratch — re-searching sources already completed.

Structured error with partial_results preserves the 3 results and identifies the 2 gaps. The orchestrator uses what exists and retries only the remaining sources.

Silent Completion Is Worse Than Reported Failure

Returning success with only the 3 results (hiding the 2 unsearched sources) is deceptive. The orchestrator believes it has comprehensive results when 40% of sources were never checked. Transparent error reporting with gaps identified is always better than silent partial completion.

The First Step

If sub-agents currently return unstructured error messages: define a standard error schema for all sub-agents. Consistent structure lets the orchestrator reliably extract recovery information without per-agent parsing logic.

One-liner: Return structured errors with failure_type, partial_results, and alternatives — this enables the orchestrator to recover 71% of failures instead of 18%, by routing each error type to the correct recovery strategy.