Background
evaluate_risk.py recently received improvements to the LLM-as-judge evaluation (see commit fcc1f2f). These improvements have not been ported to evaluate_summaries.py.
What was improved in evaluate_risk.py
1. Evaluation criteria (was vague, now specific)
The old 6 criteria were loose and overlapping:
- Cause Identification, Evidence-Based Reasoning, Risk Level Accuracy, Business Impact, Investigation Priority, Professional Quality
Replaced with 4 independent dimensions, each with explicit 1-3/4-6/7-9/10 anchor descriptions:
- Evidence Grounding — does it cite specific DAG data (IPs, ports, counts)?
- Cause Specificity — does it name the specific attack behavior/TTP or stay vague?
- Risk Calibration — is the risk level proportionate to the actual evidence weight?
- Actionability — are recommended actions concrete and scoped to this incident?
2. Score anchoring (was unbounded, now calibrated)
The old prompt just said "Rate 1-10 (10=excellent, 1=poor)" with no reference points, causing the judge to compress all scores into the 7-9 range. Each dimension now has explicit anchor descriptions at 1-3, 4-6, 7-9, and 10 so the judge has concrete reference points to spread scores.
3. Per-dimension output (was single score, now structured)
Old output: "scores": {"A": 8, "B": 7, ...}
New output: "scores": {"A": {"evidence_grounding": 8, "cause_specificity": 6, "risk_calibration": 7, "actionability": 4, "total": 25}, ...}
This gives richer diagnostic data to understand why one model beats another, not just that it does.
What to adapt for evaluate_summaries.py
The 4 dimensions should be adapted to the summarization task. Suggested dimensions:
- Evidence Coverage — are critical HIGH/MEDIUM events from the DAG mentioned?
- Threat Identification — does it correctly name the attack type/behavior?
- Conciseness — does it compress the raw data or just copy-paste it? (already partially tracked via word count)
- Actionability — does it help the analyst decide on next steps?
The score anchoring and per-dimension output format can be ported directly.
Background
evaluate_risk.pyrecently received improvements to the LLM-as-judge evaluation (see commitfcc1f2f). These improvements have not been ported toevaluate_summaries.py.What was improved in evaluate_risk.py
1. Evaluation criteria (was vague, now specific)
The old 6 criteria were loose and overlapping:
Replaced with 4 independent dimensions, each with explicit 1-3/4-6/7-9/10 anchor descriptions:
2. Score anchoring (was unbounded, now calibrated)
The old prompt just said "Rate 1-10 (10=excellent, 1=poor)" with no reference points, causing the judge to compress all scores into the 7-9 range. Each dimension now has explicit anchor descriptions at 1-3, 4-6, 7-9, and 10 so the judge has concrete reference points to spread scores.
3. Per-dimension output (was single score, now structured)
Old output:
"scores": {"A": 8, "B": 7, ...}New output:
"scores": {"A": {"evidence_grounding": 8, "cause_specificity": 6, "risk_calibration": 7, "actionability": 4, "total": 25}, ...}This gives richer diagnostic data to understand why one model beats another, not just that it does.
What to adapt for evaluate_summaries.py
The 4 dimensions should be adapted to the summarization task. Suggested dimensions:
The score anchoring and per-dimension output format can be ported directly.