Redesign LLM-as-judge to pointwise evaluation with separate cause/risk scoring

## Current approach and its problems

The current `evaluate_risk.py` uses **listwise evaluation**: the judge sees all 4 model outputs simultaneously and ranks them A/B/C/D. This causes several problems:

- **Context size**: DAG + 4× cause_analysis + 4× risk_assessment in one prompt regularly exceeds the model's context window, causing empty/truncated responses
- **Position bias**: models influence each other's scores depending on order (mitigated by randomization but not eliminated)
- **Relative scores**: a model ranked 1st may still be poor in absolute terms — you just know it beat the others in that batch
- **Coupled evaluation**: cause and risk are evaluated together, hiding per-task strengths

## Proposed approach: pointwise evaluation

Score each model output **independently** against the DAG, one call at a time.

```
Given this DAG, score this cause analysis on 3 criteria: 1-10 each.
Given this DAG, score this risk assessment on 3 criteria: 1-10 each.
```

Rankings are derived by aggregating scores across incidents — not by asking the judge to compare models directly.

## Criteria split

**Cause analysis** (3 dimensions):
- Evidence Grounding — cites specific DAG data (IPs, ports, counts)
- Cause Specificity — names the TTP, not just "malicious activity"
- Alternative Hypotheses — considers legitimate/misconfiguration causes

**Risk assessment** (3 dimensions):
- Risk Calibration — risk level proportionate to evidence weight
- Actionability — concrete, incident-specific recommendations
- Business Impact Relevance — realistic impact, not boilerplate

## Benefits

- **No context size problem** — one DAG + one model output per call, always small
- **No position bias** — models scored independently
- **Absolute scores** — comparable across incidents and datasets
- **Parallelizable** — all calls are independent
- **Separate leaderboards** — cause ranking and risk ranking are independent, revealing per-task model strengths

## Cost

4 models × 826 incidents × 2 tasks = **6,608 calls** vs 826 currently. Each call is much smaller so wall-clock time should be comparable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign LLM-as-judge to pointwise evaluation with separate cause/risk scoring #22

Current approach and its problems

Proposed approach: pointwise evaluation

Criteria split

Benefits

Cost

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Redesign LLM-as-judge to pointwise evaluation with separate cause/risk scoring #22

Description

Current approach and its problems

Proposed approach: pointwise evaluation

Criteria split

Benefits

Cost

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions