K4.1.3 Task 4.1

Labels Alone: 41%. Text Conditions: 72%. Text + Code Examples: 94%.

HIGH, MEDIUM, LOW — without definitions, these are meaningless labels. Claude has no stable internal calibration for severity. Each run interprets “high severity” differently depending on the surrounding code context. Agreement across runs: 41%. Functionally random.

Adding text conditions (“HIGH: data loss possible, security boundary crossed”) raises agreement to 72%. Adding code examples alongside the conditions reaches 94%. Each layer contributes meaningfully.

The Three-Layer Structure

Layer 1: Label — HIGH, MEDIUM, LOW (useless alone)

Layer 2: Concrete conditions

  • HIGH: data loss possible, security boundary crossed, crash in production path
  • MEDIUM: incorrect but non-fatal functionality, degraded performance
  • LOW: style issues, naming inconsistencies, missing documentation

Layer 3: Code examples

HIGH example:
cursor.execute(f"SELECT * FROM users WHERE id = {user_input}")
→ SQL injection: unsanitized user input in query

MEDIUM example:
for user in users:
    orders = db.query(f"SELECT * FROM orders WHERE user_id = {user.id}")
→ N+1 query: functional but degrades at scale

LOW example:
def calc(x, y): return x + y
→ Missing docstring and non-descriptive parameter names

The code examples anchor abstract conditions to recognizable patterns. “Data loss possible” is still open to interpretation. A SQL injection snippet is not.

Why Labels Without Definitions Fail

A team tested severity classification on 200 code review findings:

ApproachInter-run agreement
Labels only (HIGH/MEDIUM/LOW)41%
Text conditions per level72%
Text conditions + code examples94%

Labels alone are near-random because Claude must invent its own definitions each run. Text conditions provide a framework. Code examples close the remaining gap by eliminating interpretation of abstract terms like “data loss.”

All Levels Must Be Defined

A common mistake: defining only HIGH (“data loss or security breach”) and leaving MEDIUM and LOW to Claude’s judgment. Result: everything gets classified as HIGH because it is the only level with concrete criteria.

The boundary between MEDIUM and LOW matters as much as the boundary between HIGH and MEDIUM. An undefined level becomes a catch-all that absorbs findings from both sides.

Path-Dependent Severity

The same issue type can have different severity depending on code path:

  • Null dereference on production request path → HIGH (crash affects users)
  • Null dereference on debug-only logging path → MEDIUM (cosmetic, no user impact)

Blanket rules like “all null dereferences are HIGH” ignore context. The criteria should include path information: “NULL deref on production path → HIGH, on debug/test path → MEDIUM.”

This is handled in a single review pass with conditional criteria, not by running separate passes per severity level.

Token-Budget Prioritization

When prompt space is limited, prioritize code examples for categories where misclassification is most costly. Security findings misclassified as LOW may go unaddressed — they need examples. Style findings misclassified as MEDIUM are a minor inconvenience — text-only definitions may suffice.

Do not skip examples entirely to save tokens. The 22-percentage-point improvement from adding examples (72% → 94%) is the most impactful investment in the entire criteria design.

Things That Do Not Fix Undefined Severity

Numeric scales. Switching from HIGH/MEDIUM/LOW to 1-10 shifts the problem. Without definitions for what distinguishes a 7 from a 4, you get the same inconsistency on a wider scale.

“Be consistent” instructions. This describes the desired outcome without enabling it. Claude already attempts consistency — it lacks the criteria to achieve it.

Multiple passes with majority voting. Running the same undefined criteria three times and taking the majority vote picks the most frequent wrong answer. Ambiguity is not overcome by repetition.

Continuous scores. Replacing discrete levels with continuous scores (4.2 vs 4.5) introduces more ambiguity. The distinction between 4.2 and 4.5 is even less defined than between MEDIUM and LOW.

Percentage-based downgrading. Automatically converting 30% of HIGH findings to MEDIUM ignores actual severity. Real critical findings get randomly downgraded.

CI/CD Integration

In CI pipelines, severity definitions with code examples belong in CLAUDE.md — version-controlled and automatically loaded for every claude -p invocation. Embedding definitions in the -p prompt argument is fragile and unmaintainable. CLAUDE.md ensures every CI run applies the same calibrated severity criteria.


One-liner: Define every severity level with concrete conditions AND code examples — labels alone produce 41% agreement, conditions add 31 points, and examples add 22 more to reach 94%.