S2.2.1 Task 2.2

Retry Locally First, Propagate with Context If Recovery Fails

Sub-agents should handle transient errors locally. A timeout that resolves on retry doesn’t need coordinator involvement. But when local recovery fails, the propagated error must include structured context — without it, the coordinator defaults to the simplest (often wrong) recovery strategy.

The local recovery pattern

  1. Sub-agent encounters transient error (timeout, rate limit)
  2. Retry locally with brief backoff (the source was responsive recently)
  3. If retry succeeds → return results normally. Coordinator never knows.
  4. If retry fails → propagate to coordinator with full structured context

Most transient errors resolve in seconds. A single local retry handles the majority of timeouts without coordinator involvement.

What to include when propagating

Wrong: {"status": "failed", "message": "Could not complete search"}

The coordinator has three strategies: (1) retry same query, (2) retry modified query, (3) proceed without results. With “could not complete search,” it defaults to strategy 1 every time — even for permanent errors where retry is futile. 30% of failures are permanent but the coordinator can’t tell.

Right: structured context with:

  • Failure type: transient / permanent / validation
  • isRetryable: boolean
  • Attempted query: what was tried
  • Partial results: anything obtained before failure
  • Suggested alternatives: other sources or approaches

This lets the coordinator match failure type to strategy: permanent → proceed without, transient → retry same, validation → retry modified.

The silent swallow anti-pattern

A sub-agent catches all errors and returns empty results as success. The coordinator makes decisions assuming the task completed normally. Missing data gets incorporated into final outputs without indication of gaps.

This is worse than propagating errors — at least error propagation gives the coordinator a chance to adapt. Silent swallowing hides the failure entirely.

When to handle locally vs propagate

Error typeLocal handlingPropagate
Transient (timeout)Retry with backoff (1-2 attempts)If retry fails
Rate limitWait and retryIf still limited after delay
ValidationFix input if possibleIf agent can’t determine fix
PermissionCan’t fix locallyAlways propagate immediately
Business ruleCan’t fix locallyAlways propagate immediately

Transient errors: try locally first. Permanent errors: propagate immediately (no point retrying).


One-liner: Retry transient errors locally before propagating — but when propagation is needed, include failure type, retryability, attempted query, partial results, and alternatives so the coordinator can choose the right recovery strategy, not just default to retry.