K2.2.1 Task 2.2

Structured Errors: 78-95% Recovery vs 15% for Generic "Operation Failed"

Setting isError: true is just the starting point. Generic “Operation failed” produces only 15% agent recovery. Adding structured metadata — error category, retryability, customer message, suggested action — enables 78-95% recovery across different error types.

The recovery rate data

Error typeMetadata providedRecovery rate
Transient (isRetryable: true)Category + retryability92% auto-recovered
Validation (field-level details)Category + specific errors78% auto-corrected
Business (isRetryable: false)Category + customer message95% correctly escalated
Generic (“Operation failed”)None15% recovery

The difference between 15% and 78-95% is structured metadata, not model capability.

The basic pattern

{
  "content": [{"type": "text", "text": "Database query timed out after 30s"}],
  "isError": true
}

The agent knows the tool failed and sees why. But it doesn’t know: should it retry? Is this temporary? What should it tell the user?

The full structured pattern

{
  "content": [{"type": "text", "text": "Refund of $750 exceeds $500 policy limit"}],
  "isError": true,
  "structuredContent": {
    "errorCategory": "business",
    "isRetryable": false,
    "customerMessage": "Refunds over $500 require manager approval",
    "suggestedAction": "escalate_to_human"
  }
}

Now the agent knows: it’s a business rule (not transient), retrying won’t help, it can tell the customer why, and it should escalate.

Four anti-patterns

Silent swallow: isError: false with empty content when the database is unreachable. The agent tells the researcher “no papers found” — when thousands exist. The database was down, not empty.

Success with error text: isError: false with content “error occurred.” The agent treats this as the data returned.

Unhandled exception: tool errors propagate as exceptions → JSON-RPC protocol errors that crash the connection. Converts recoverable tool errors into non-recoverable protocol errors.

Generic message: “Operation failed” for every error type. Agent retries everything identically — wasting attempts on non-retryable errors.

Protocol errors vs tool execution errors

TypeExampleLLM recoverability
Protocol errorMisspelled tool nameLow (code fix needed)
Tool execution errorAPI timeoutHigh (LLM can retry/adapt)

The MCP spec says tool execution errors SHOULD be provided to the LLM for self-correction. Letting exceptions propagate converts recoverable errors into non-recoverable ones.

Security: sanitized error content

Wrong: “Connection to db-prod-3.internal:5432 refused.” Right: “Database temporarily unavailable.” Include error CATEGORY and RECOVERABILITY. Log full technical details server-side. The agent gets recovery metadata without sensitive internals.


One-liner: isError: true with structured metadata (errorCategory, isRetryable, customerMessage, suggestedAction) enables 78-95% agent recovery vs 15% for generic messages — catch all tool errors as CallToolResult, never let them become protocol exceptions, and sanitize for security.