Current approach and its problems
The current evaluate_risk.py uses listwise evaluation: the judge sees all 4 model outputs simultaneously and ranks them A/B/C/D. This causes several problems:
- Context size: DAG + 4× cause_analysis + 4× risk_assessment in one prompt regularly exceeds the model's context window, causing empty/truncated responses
- Position bias: models influence each other's scores depending on order (mitigated by randomization but not eliminated)
- Relative scores: a model ranked 1st may still be poor in absolute terms — you just know it beat the others in that batch
- Coupled evaluation: cause and risk are evaluated together, hiding per-task strengths
Proposed approach: pointwise evaluation
Score each model output independently against the DAG, one call at a time.
Given this DAG, score this cause analysis on 3 criteria: 1-10 each.
Given this DAG, score this risk assessment on 3 criteria: 1-10 each.
Rankings are derived by aggregating scores across incidents — not by asking the judge to compare models directly.
Criteria split
Cause analysis (3 dimensions):
- Evidence Grounding — cites specific DAG data (IPs, ports, counts)
- Cause Specificity — names the TTP, not just "malicious activity"
- Alternative Hypotheses — considers legitimate/misconfiguration causes
Risk assessment (3 dimensions):
- Risk Calibration — risk level proportionate to evidence weight
- Actionability — concrete, incident-specific recommendations
- Business Impact Relevance — realistic impact, not boilerplate
Benefits
- No context size problem — one DAG + one model output per call, always small
- No position bias — models scored independently
- Absolute scores — comparable across incidents and datasets
- Parallelizable — all calls are independent
- Separate leaderboards — cause ranking and risk ranking are independent, revealing per-task model strengths
Cost
4 models × 826 incidents × 2 tasks = 6,608 calls vs 826 currently. Each call is much smaller so wall-clock time should be comparable.
Current approach and its problems
The current
evaluate_risk.pyuses listwise evaluation: the judge sees all 4 model outputs simultaneously and ranks them A/B/C/D. This causes several problems:Proposed approach: pointwise evaluation
Score each model output independently against the DAG, one call at a time.
Rankings are derived by aggregating scores across incidents — not by asking the judge to compare models directly.
Criteria split
Cause analysis (3 dimensions):
Risk assessment (3 dimensions):
Benefits
Cost
4 models × 826 incidents × 2 tasks = 6,608 calls vs 826 currently. Each call is much smaller so wall-clock time should be comparable.