Problem
Both evaluate_risk.py and evaluate_summaries.py use temperature=0.3 when calling the judge LLM. This introduces randomness into the evaluation, making results non-deterministic and harder to reproduce.
Fix
Set temperature=0.0 in call_judge_llm() in both scripts to ensure deterministic, reproducible evaluations.
Problem
Both
evaluate_risk.pyandevaluate_summaries.pyusetemperature=0.3when calling the judge LLM. This introduces randomness into the evaluation, making results non-deterministic and harder to reproduce.Fix
Set
temperature=0.0incall_judge_llm()in both scripts to ensure deterministic, reproducible evaluations.