K5.3.1 Task 5.3

Generic "Failed" → 18% Recovery. Structured Error → 71%.

A sub-agent returns “search failed.” The orchestrator retries 3 times. All fail. The cause was an expired API key — a permanent error that will never succeed on retry. Three wasted attempts because the orchestrator could not distinguish timeout from permanent failure.

The Structured Error Schema

{
  "failure_type": "auth_error",
  "partial_results": [...],
  "alternatives_suggested": ["try backup API", "use cached data"],
  "retry_recommended": false
}

Each field enables a specific recovery decision:

FieldWhat it enables
failure_typeRoute to correct handler (retry vs refresh vs switch)
partial_resultsPreserve work done before failure
alternatives_suggestedRecovery paths from sub-agent’s domain knowledge
retry_recommendedPrevent wasted retries on permanent errors

The Recovery Data

500 sub-agent failures compared:

Error formatSuccessful recovery
Generic “failed”18%
Structured (type + partial + alternatives)71%

Each field adds value incrementally: failure_type alone → 42%. Add partial_results → 58%. Add alternatives → 71%. Add retry_recommended → 74%.

failure_type Enum → Recovery Router

TypeRecovery action
timeoutRetry same query
auth_errorRefresh credentials, then retry
not_foundTry alternative source
rate_limitedBack off, then retry
permanentDo not retry, use alternatives

A failure_type enum maps directly to recovery handlers. No message parsing, no guessing, no wasted retries on permanent errors.

Partial Results: Don’t Waste Completed Work

A sub-agent searched 5 sources, found results from 3, then timed out. Returning only “timeout error” loses the 3 successful results. The orchestrator retries from scratch — re-searching sources already completed.

Structured error with partial_results preserves the 3 results and identifies the 2 gaps. The orchestrator uses what exists and retries only the remaining sources.

Silent Completion Is Worse Than Reported Failure

Returning success with only the 3 results (hiding the 2 unsearched sources) is deceptive. The orchestrator believes it has comprehensive results when 40% of sources were never checked. Transparent error reporting with gaps identified is always better than silent partial completion.

The First Step

If sub-agents currently return unstructured error messages: define a standard error schema for all sub-agents. Consistent structure lets the orchestrator reliably extract recovery information without per-agent parsing logic.


One-liner: Return structured errors with failure_type, partial_results, and alternatives — this enables the orchestrator to recover 71% of failures instead of 18%, by routing each error type to the correct recovery strategy.