Prompt Engineering & Optimization
"Be Conservative" Means Nothing — 47% Agreement. "Lacks Sample Size" Means Something — 94%.
Specific criteria > vague instructions
One 60% False Positive Category Made Developers Ignore the 95% Accurate One
False positives poison trust
Labels Alone: 41%. Text Conditions: 72%. Text + Code Examples: 94%.
Severity definitions with examples
Text Instructions Failed at 64%. Two Examples Reached 91%.
Few-shot as ultimate calibration
Three Diverse Examples Beat Eight Homogeneous Ones
Few-shot design principles
Every Example Shows All Fields Populated — So the Model Fabricates Missing Ones
Few-shot reducing hallucination
Paired REPORT + SKIP Examples: 67% → 94% Boundary Accuracy
Few-shot for format consistency
tool_use Eliminates Structural Errors. Semantic Errors Remain.
tool_use for guaranteed structure
Forced, Any, Auto — Three Modes, Three Guarantees, Three Failure Patterns
input_schema enforcement
Required Fields on Optional Data: The #1 Structural Cause of Hallucination
Nullable fields for optional data
Without Format Rules, the Same Date Comes Out Three Different Ways
Closed enum with "other" escape / Format standardization
Blind Retry: 12% Fixed After 3 Attempts. Error Feedback: 87% Fixed After 1.
Semantic validation beyond schema (retry with corrective feedback)
Retrying Absent Data Causes Hallucination — Format Errors Converge, Missing Data Diverges
Retry with corrective feedback (error classification)
Track Which Patterns Developers Dismiss — Then Add SKIP Examples
Confidence scoring for routing (feedback loop)
Schema Says Valid. Line Items Don't Sum to Total. Both Are True.
Cross-field consistency validation
50% Cost Savings — But Up to 24 Hours Wait
Batch API for scale
Results Return in Arbitrary Order — custom_id Is Your Only Correlation
custom_id for result correlation
940 Succeeded, 45 Errored, 15 Expired — Resubmit Only the 60
Batch error handling
Poll, Download, Store Locally — Results Expire in 29 Days
Batch size and timeout
Different Errors Need Different Fixes — Don't Resubmit the Whole Batch
Cost optimization with batch
Always Test on a Sample Before Full Batch — 18% Failure vs 3%
Batch workflow design
Same-Session Review: 0.3 Findings. Independent Instance: 3.7. Human Baseline: 4.1.
Multi-pass review architecture
Even Design Goals from the Generation Prompt Suppress Review Findings
Per-file + cross-file passes
Single-Pass Review at 13+ Files: 43% Detection. Multi-Pass: 86%.
Independent instances eliminate bias
Calibrate Confidence Per Category — "HIGH" Means 92% for Style but 70% for Security
Multi-instance review pipeline