K4.1.1 Task 4.1

"Be Conservative" Means Nothing — 47% Agreement. "Lacks Sample Size" Means Something — 94%.

A document quality classifier uses the prompt: “Only report high-confidence findings and be conservative with quality ratings.” The same paper gets rated “high quality” in one run and “low quality” in another. Inter-run agreement: 47%.

Replace with: “Flag when methodology lacks sample size or control group.” Agreement jumps to 94%.

Same model, same temperature, same documents. The only change is criteria specificity.

Why Vague Terms Fail

LLMs have no consistent internal calibration for subjective qualifiers. “High-confidence,” “conservative,” “appropriate,” “professional” — these words have different meanings in different contexts, and the model picks among valid interpretations each run.

This is not randomness. It is ambiguity. The model is correctly processing a vague instruction — there are multiple valid ways to be “conservative,” and the model samples from among them. Specific criteria eliminate the sample space: either the methodology has a sample size or it does not.

The Data

A customer support team replaced vague criteria with specific standards across 500 reviewed responses:

CriteriaInter-run agreementFalse positive rate
”Check responses are appropriate”52%38%
“Flag when: refund exceeds policy limit, or response promises unavailable feature”91%7%

Specific criteria improved agreement by 39 points and cut false positives by 31 points. The criteria, not the model, drive consistency.

Common Vague Terms and Their Replacements

VagueSpecific
”Appropriate tone""Flag sarcasm, blame language, or dismissive phrasing"
"High-confidence findings""Flag when methodology lacks sample size or control group"
"Rate severity based on impact""HIGH: crash in production path. MEDIUM: incorrect output. LOW: cosmetic"
"Be accurate and careful""Extract dates in ISO 8601, return null if ambiguous"
"Ensure professional quality""Flag when refund amount exceeds $500 limit or response promises discontinued features”

The pattern: replace subjective adjectives with concrete, verifiable conditions. “Professional” is an opinion. “Exceeds $500 limit” is a fact.

Things That Do Not Fix Vague Criteria

Lower temperature. Temperature affects randomness in output generation, not interpretation of subjective terms. “Appropriate” has no consistent meaning at temperature 0 or temperature 1.

More vague text. “Be more accurate and careful” adds zero specificity. “Double-check your work” is motivational, not instructional. Stacking vague modifiers creates layers of ambiguity, not layers of precision.

More runs. Running a vague criterion 10 times and averaging does not converge on a consistent standard. Each run independently interprets the ambiguity — averaging subjective interpretations produces an average of inconsistency.

A second review pass. If the first pass uses vague criteria and produces false positives, a second pass with the same vague criteria produces the same false positives. The criteria must change, not the number of passes.

Per-Category Precision

A pipeline with 5 review categories (security, correctness, performance, style, documentation) does not need uniform specificity across all categories:

  • Security: Exact triggers with code examples — SQL injection patterns, unchecked user input in query construction
  • Correctness: Concrete conditions — claimed behavior in comments contradicts actual code logic
  • Performance: Measurable thresholds — O(n²) in loop processing arrays > 10K elements
  • Style: Named patterns — camelCase for variables, PascalCase for components
  • Documentation: Presence checks — public functions without JSDoc, missing parameter descriptions

Each category gets criteria matching its precision needs. Security needs exact triggers (zero false positives preferred). Style needs named conventions (minor inconsistency acceptable). Both are specific — just at different granularity.

This combination achieves both coverage (5 categories) and precision (concrete triggers). Dropping categories to reduce false positives is unnecessary when specific criteria prevent false positives within each category.

The Bottom Line

If the criterion contains words like “appropriate,” “professional,” “conservative,” “high-quality,” “careful,” or “thorough” without defining what those mean in concrete terms, it will produce inconsistent results. Replace every subjective qualifier with a verifiable condition.


One-liner: Replace every “be appropriate” with “flag when X contradicts Y” — subjective qualifiers produce 47% consistency, concrete conditions produce 94%.