K5.4.4 Task 5.4

Crash Recovery: A Manifest File So You Don't Start Over

A multi-agent codebase analysis takes 45 minutes. The coordinator has completed 3 of 5 phases. The process crashes. Without a recovery mechanism, you restart from scratch — losing 30 minutes of completed work.

This happens more often than teams expect. In one system, 23% of runs crashed at least once. The solution is not to eliminate crashes (though that helps too) but to make them cheap: persist enough state to resume from where you stopped.

Why In-Memory State Is Not Enough

The Claude API is stateless. There is no server-side session. The conversation history lives in your client process. When that process dies — OOM, network timeout, unhandled exception — everything in memory goes with it. The coordinator’s progress, the sub-agents’ findings, the entire analysis context: gone.

This is the fundamental anti-pattern for long-running multi-agent systems: relying entirely on in-memory conversation state without persisting progress to disk. A 5-minute task can afford to restart. A 45-minute task cannot.

The Manifest Approach

A manifest file is a lightweight state checkpoint that the coordinator writes after each major task or phase. It records:

  • Current phase — where in the overall workflow the system is
  • Completed tasks — which tasks finished successfully
  • Pending tasks — which tasks remain
  • Scratchpad references — paths to files containing detailed findings

On restart, the coordinator loads the manifest, injects the recovered state into agent prompts, and resumes from the last checkpoint. Completed work is not re-done. The system jumps directly to the crash point.

{
  "phase": 3,
  "completed": ["dependency_scan", "vulnerability_check", "license_audit"],
  "pending": ["performance_profile", "final_report"],
  "scratchpad_files": [
    "findings/dependencies.md",
    "findings/vulnerabilities.md",
    "findings/licenses.md"
  ],
  "last_updated": "2026-03-27T14:32:00Z"
}

This is not complex infrastructure. It is a JSON file, written after each phase completes, read on startup. A few lines of code provide the recovery mechanism.

What the Manifest Must Include

A common mistake is saving only task status — which tasks are done and which are pending — without referencing the findings from completed tasks.

One team implemented exactly this. After a crash and restart, the coordinator knew that tasks 1-3 were complete and it should start at task 4. But it had no access to what tasks 1-3 actually discovered. The result: the recovered agent used vague references like “typical patterns” instead of the specific class names and line numbers it had found before the crash.

The fix is including scratchpad file paths in the manifest. Since sub-agents typically write findings to scratchpad files already (a best practice from K5.4.2), the manifest just needs to point to those files. On recovery, the coordinator loads the manifest, reads the referenced scratchpad files, and has both the progress status and the detailed findings.

Without findings references: the agent knows where it stopped but not what it learned. With findings references: the agent resumes with full context.

Keep the Manifest Current

A stale manifest is almost as bad as no manifest. One team wrote the manifest once at the start of the run and never updated it. When crashes occurred on later tasks, the manifest pointed to an empty state — recovery started from scratch because the manifest said nothing was complete.

Metrics from a team that deployed manifest recovery but had one stale-manifest failure:

Without recoveryWith manifest recovery
Crashes per month77
Avg time lost per crash35 min (full restart)8 min (resume from checkpoint)
Monthly time lost245 min56 min
Recovery success rate86% (6/7)

The one failed recovery was caused by the stale manifest. The fix: update the manifest after every completed task, not on a schedule or only at startup. The overhead of writing a small JSON file is negligible.

What Not to Save

Not the full conversation history. One engineer proposed saving the complete conversation after every API call. The math kills this: a mid-run conversation reaches 100K+ tokens, and with 40+ API calls per analysis, that is 4 million tokens of I/O per run. Worse, on recovery the full conversation may exceed the context window, producing degraded quality. The manifest approach is orders of magnitude lighter.

Not a replay log. Recording every tool call and replaying them from the beginning to reconstruct state re-does all the completed work. If the goal is to avoid restarting, replaying the entire session from scratch accomplishes nothing.

Not a distributed transaction system. Two-phase commit, write-ahead logging, distributed consensus — these are appropriate for databases handling concurrent writes with consistency guarantees. A sequential multi-agent pipeline needs a JSON file updated after each phase, not enterprise infrastructure.

The Right Level of Engineering

Crash recovery for multi-agent systems sits at a specific point on the complexity spectrum:

ApproachComplexityEffectiveness
No recoveryNoneFull restart on every crash
Manifest + scratchpad refsLowResume from last completed phase
Full conversation saveHighLarge I/O, possible context overflow
Database + transactionsVery highOverkill for sequential pipelines
VM snapshotsVery highInfrastructure-dependent, slow

The manifest approach provides the vast majority of recovery value at minimal implementation cost. A team can go from zero recovery to effective crash resilience with a single JSON file and a few read/write calls.

Integration with the Broader Context Strategy

The manifest does not stand alone. It builds on the same file-based persistence that powers scratchpad files and sub-agent output:

  • Scratchpad files preserve findings as they are discovered (K5.4.2)
  • Sub-agent delegation keeps raw data isolated and returns summaries (K5.4.3)
  • The manifest ties it all together by recording which phases are done and where the findings live

On recovery, the coordinator loads the manifest, reads the referenced scratchpad files, and rebuilds enough context to continue. No work is repeated. The conversation is fresh, but the knowledge is preserved.


One-liner: Write a manifest file after each completed phase — recording task status and scratchpad paths — so a crash costs you one phase of work, not the entire run.