Port improved LLM-as-judge criteria and scoring to evaluate_summaries.py

## Background

`evaluate_risk.py` recently received improvements to the LLM-as-judge evaluation (see commit `fcc1f2f`). These improvements have not been ported to `evaluate_summaries.py`.

## What was improved in evaluate_risk.py

### 1. Evaluation criteria (was vague, now specific)
The old 6 criteria were loose and overlapping:
- Cause Identification, Evidence-Based Reasoning, Risk Level Accuracy, Business Impact, Investigation Priority, Professional Quality

Replaced with 4 independent dimensions, each with explicit 1-3/4-6/7-9/10 anchor descriptions:
- **Evidence Grounding** — does it cite specific DAG data (IPs, ports, counts)?
- **Cause Specificity** — does it name the specific attack behavior/TTP or stay vague?
- **Risk Calibration** — is the risk level proportionate to the actual evidence weight?
- **Actionability** — are recommended actions concrete and scoped to this incident?

### 2. Score anchoring (was unbounded, now calibrated)
The old prompt just said "Rate 1-10 (10=excellent, 1=poor)" with no reference points, causing the judge to compress all scores into the 7-9 range. Each dimension now has explicit anchor descriptions at 1-3, 4-6, 7-9, and 10 so the judge has concrete reference points to spread scores.

### 3. Per-dimension output (was single score, now structured)
Old output: `"scores": {"A": 8, "B": 7, ...}`
New output: `"scores": {"A": {"evidence_grounding": 8, "cause_specificity": 6, "risk_calibration": 7, "actionability": 4, "total": 25}, ...}`

This gives richer diagnostic data to understand *why* one model beats another, not just *that* it does.

## What to adapt for evaluate_summaries.py

The 4 dimensions should be adapted to the summarization task. Suggested dimensions:
- **Evidence Coverage** — are critical HIGH/MEDIUM events from the DAG mentioned?
- **Threat Identification** — does it correctly name the attack type/behavior?
- **Conciseness** — does it compress the raw data or just copy-paste it? (already partially tracked via word count)
- **Actionability** — does it help the analyst decide on next steps?

The score anchoring and per-dimension output format can be ported directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port improved LLM-as-judge criteria and scoring to evaluate_summaries.py #20

Background

What was improved in evaluate_risk.py

1. Evaluation criteria (was vague, now specific)

2. Score anchoring (was unbounded, now calibrated)

3. Per-dimension output (was single score, now structured)

What to adapt for evaluate_summaries.py

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Port improved LLM-as-judge criteria and scoring to evaluate_summaries.py #20

Description

Background

What was improved in evaluate_risk.py

1. Evaluation criteria (was vague, now specific)

2. Score anchoring (was unbounded, now calibrated)

3. Per-dimension output (was single score, now structured)

What to adapt for evaluate_summaries.py

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions