Error Distribution Pattern
Typical batch failure breakdown:
- 60% context exceeded → chunk the documents
- 25% malformed input → fix the request data
- 15% transient/expired → retry as-is
Treating all errors identically (blind retry) reproduces the 60% context errors and 25% malformed errors. Only the 15% transient failures benefit from unmodified retry.
Cancellation Saves Money
If early results from a batch reveal a systematic prompt error (e.g., wrong extraction schema), cancel the running batch before it completes. Processing the remaining 90% of requests with a known-broken prompt wastes the entire cost.
Time Budget for Recovery
With a 30-hour SLA:
- Batch processing: up to 24 hours (worst case)
- Recovery window: 6 hours
- Strategy: submit batches every 4-6 hours, leaving time for one full recovery round within the SLA
Sample Testing Before Full Submission
Test prompts on a diverse 20-50 document sample before submitting thousands. One team’s comparison:
- Without sample testing: 18% failure rate, $740 total cost
- With sample testing: 3% failure rate, $519 total cost (30% savings)
The $8/month sample investment saved $300/month in reprocessing — 37x ROI.
One-liner: Test on a 20-50 document sample first (37x ROI), cancel batches with systematic errors early, fix different error types differently, and budget time for one recovery round within your SLA.