K5.3.2 Task 5.3

"Database Error" — Orchestrator Retried 5 Times. The Database Was Permanently Decommissioned.

Generic error messages — “failed,” “unavailable,” “error occurred” — strip all recovery information. The orchestrator cannot distinguish timeout (retry) from permanent failure (switch source) from auth expiry (refresh credentials). It blindly retries everything, including errors that will never succeed.

The Cost of Generic Errors

1,000 sub-agent failures over 3 months with generic error messages:

  • 350 permanent errors retried 3x each = 1,050 wasted API calls
  • 480 failures had partial results that were discarded and re-fetched
  • Total waste: $2,100 in API costs + 15 hours of delayed customer resolutions

After switching to structured errors: wasted calls dropped 94%, partial data preserved in 89% of failures.

What Generic Errors Cause

ProblemImpact
Blind retry of permanent failuresWasted computation, never succeeds
Discarded partial resultsRe-does work already completed
No alternative source switchingPermanent failures never recover
Cannot distinguish 3 failure types returning “failed”Same generic retry for all

With structured errors: orchestrator retried only transient failures (65%), switched source for permanent (35%), and preserved 48% of partial results.

The Fix: Replace Generic Strings with Structured Fields

Replace "search failed" with:

{
  "failure_type": "auth_error",
  "partial_results": [],
  "alternatives_suggested": ["refresh credentials"],
  "retry_recommended": false
}

The orchestrator routes by failure_type: timeout → retry, auth → refresh, permanent → switch source, unknown → human review.

Handle Unknown Failure Types

Include an "unknown" category in the failure_type enum. Even unknown failures may have partial results and the sub-agent can indicate whether retry might help. The raw error message accompanies the unknown type for diagnostic context — without reverting to the generic error anti-pattern.

Monitoring + Recovery: Not a Tradeoff

Infrastructure wants simple status for dashboards. Orchestration wants detailed context for recovery. Both needs are met by a single response: a top-level status field for monitoring plus detailed fields for recovery. No compromise required.


One-liner: Generic errors cause $2,100 in wasted retries and 15 hours of delays per quarter — replace them with structured failure_type, partial_results, and alternatives to cut waste by 94%.