Skip to content

Redesign LLM-as-judge to pointwise evaluation with separate cause/risk scoring #22

@harpomaxx

Description

@harpomaxx

Current approach and its problems

The current evaluate_risk.py uses listwise evaluation: the judge sees all 4 model outputs simultaneously and ranks them A/B/C/D. This causes several problems:

  • Context size: DAG + 4× cause_analysis + 4× risk_assessment in one prompt regularly exceeds the model's context window, causing empty/truncated responses
  • Position bias: models influence each other's scores depending on order (mitigated by randomization but not eliminated)
  • Relative scores: a model ranked 1st may still be poor in absolute terms — you just know it beat the others in that batch
  • Coupled evaluation: cause and risk are evaluated together, hiding per-task strengths

Proposed approach: pointwise evaluation

Score each model output independently against the DAG, one call at a time.

Given this DAG, score this cause analysis on 3 criteria: 1-10 each.
Given this DAG, score this risk assessment on 3 criteria: 1-10 each.

Rankings are derived by aggregating scores across incidents — not by asking the judge to compare models directly.

Criteria split

Cause analysis (3 dimensions):

  • Evidence Grounding — cites specific DAG data (IPs, ports, counts)
  • Cause Specificity — names the TTP, not just "malicious activity"
  • Alternative Hypotheses — considers legitimate/misconfiguration causes

Risk assessment (3 dimensions):

  • Risk Calibration — risk level proportionate to evidence weight
  • Actionability — concrete, incident-specific recommendations
  • Business Impact Relevance — realistic impact, not boilerplate

Benefits

  • No context size problem — one DAG + one model output per call, always small
  • No position bias — models scored independently
  • Absolute scores — comparable across incidents and datasets
  • Parallelizable — all calls are independent
  • Separate leaderboards — cause ranking and risk ranking are independent, revealing per-task model strengths

Cost

4 models × 826 incidents × 2 tasks = 6,608 calls vs 826 currently. Each call is much smaller so wall-clock time should be comparable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions