Skip to content

Port improved LLM-as-judge criteria and scoring to evaluate_summaries.py #20

@harpomaxx

Description

@harpomaxx

Background

evaluate_risk.py recently received improvements to the LLM-as-judge evaluation (see commit fcc1f2f). These improvements have not been ported to evaluate_summaries.py.

What was improved in evaluate_risk.py

1. Evaluation criteria (was vague, now specific)

The old 6 criteria were loose and overlapping:

  • Cause Identification, Evidence-Based Reasoning, Risk Level Accuracy, Business Impact, Investigation Priority, Professional Quality

Replaced with 4 independent dimensions, each with explicit 1-3/4-6/7-9/10 anchor descriptions:

  • Evidence Grounding — does it cite specific DAG data (IPs, ports, counts)?
  • Cause Specificity — does it name the specific attack behavior/TTP or stay vague?
  • Risk Calibration — is the risk level proportionate to the actual evidence weight?
  • Actionability — are recommended actions concrete and scoped to this incident?

2. Score anchoring (was unbounded, now calibrated)

The old prompt just said "Rate 1-10 (10=excellent, 1=poor)" with no reference points, causing the judge to compress all scores into the 7-9 range. Each dimension now has explicit anchor descriptions at 1-3, 4-6, 7-9, and 10 so the judge has concrete reference points to spread scores.

3. Per-dimension output (was single score, now structured)

Old output: "scores": {"A": 8, "B": 7, ...}
New output: "scores": {"A": {"evidence_grounding": 8, "cause_specificity": 6, "risk_calibration": 7, "actionability": 4, "total": 25}, ...}

This gives richer diagnostic data to understand why one model beats another, not just that it does.

What to adapt for evaluate_summaries.py

The 4 dimensions should be adapted to the summarization task. Suggested dimensions:

  • Evidence Coverage — are critical HIGH/MEDIUM events from the DAG mentioned?
  • Threat Identification — does it correctly name the attack type/behavior?
  • Conciseness — does it compress the raw data or just copy-paste it? (already partially tracked via word count)
  • Actionability — does it help the analyst decide on next steps?

The score anchoring and per-dimension output format can be ported directly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions