Structured Errors: 78-95% Recovery vs 15% for Generic "Operation Failed" | Tool Design & MCP Integration

Setting isError: true is just the starting point. Generic “Operation failed” produces only 15% agent recovery. Adding structured metadata — error category, retryability, customer message, suggested action — enables 78-95% recovery across different error types.

The recovery rate data

Error type	Metadata provided	Recovery rate
Transient (isRetryable: true)	Category + retryability	92% auto-recovered
Validation (field-level details)	Category + specific errors	78% auto-corrected
Business (isRetryable: false)	Category + customer message	95% correctly escalated
Generic (“Operation failed”)	None	15% recovery

The difference between 15% and 78-95% is structured metadata, not model capability.

The basic pattern

{
  "content": [{"type": "text", "text": "Database query timed out after 30s"}],
  "isError": true
}

The agent knows the tool failed and sees why. But it doesn’t know: should it retry? Is this temporary? What should it tell the user?

The full structured pattern

{
  "content": [{"type": "text", "text": "Refund of $750 exceeds $500 policy limit"}],
  "isError": true,
  "structuredContent": {
    "errorCategory": "business",
    "isRetryable": false,
    "customerMessage": "Refunds over $500 require manager approval",
    "suggestedAction": "escalate_to_human"
  }
}

Now the agent knows: it’s a business rule (not transient), retrying won’t help, it can tell the customer why, and it should escalate.

Four anti-patterns

Silent swallow: isError: false with empty content when the database is unreachable. The agent tells the researcher “no papers found” — when thousands exist. The database was down, not empty.

Success with error text: isError: false with content “error occurred.” The agent treats this as the data returned.

Unhandled exception: tool errors propagate as exceptions → JSON-RPC protocol errors that crash the connection. Converts recoverable tool errors into non-recoverable protocol errors.

Generic message: “Operation failed” for every error type. Agent retries everything identically — wasting attempts on non-retryable errors.

Protocol errors vs tool execution errors

Type	Example	LLM recoverability
Protocol error	Misspelled tool name	Low (code fix needed)
Tool execution error	API timeout	High (LLM can retry/adapt)

The MCP spec says tool execution errors SHOULD be provided to the LLM for self-correction. Letting exceptions propagate converts recoverable errors into non-recoverable ones.

Security: sanitized error content

Wrong: “Connection to db-prod-3.internal:5432 refused.” Right: “Database temporarily unavailable.” Include error CATEGORY and RECOVERABILITY. Log full technical details server-side. The agent gets recovery metadata without sensitive internals.

One-liner: isError: true with structured metadata (errorCategory, isRetryable, customerMessage, suggestedAction) enables 78-95% agent recovery vs 15% for generic messages — catch all tool errors as CallToolResult, never let them become protocol exceptions, and sanitize for security.