Feat/medexqa judges by mnishant2 · Pull Request #67 · MedARC-AI/med-lm-envs

mnishant2 · 2025-11-03T16:56:24Z

This new draft PR contains the new llm judges to eval explanation. I have updated the README and added comments wherever needed. TL;DR, Run vf-eval medexqa with use_judges=False -s to save/cache outputs for different specialties. Then run tools/judge_rescore with factscore or g-eval to get LLM judge scores in both settings

… specialty codes; separate MCQ/expl metrics; metric scaling 0-100; deps update

…LLM-as-a-judge

… judge() for token tracking compatibility

mnishant2 added 9 commits October 31, 2025 23:50

MedEXQA: env for qa and explanation eval

f82f3d4

MedExQA: switch to lexical metrics + optional judge; authors’ prompt;…

86921ab

… specialty codes; separate MCQ/expl metrics; metric scaling 0-100; deps update

added specialty wise eval, lexical substring match metrics, optional …

2106e6e

…LLM-as-a-judge

cleaned up code

f537556

Add FactScore and G-Eval judges for explanation evaluation

1bf188e

Add MCQ shuffle and standardized answer extraction to MedExQA

a415f6c

author details

6a17754

Update medexqa judges: fix template to use standard placeholders, use…

9e27e98

… judge() for token tracking compatibility

Standardize parser logic and implement namespace package for medexqa

a70feb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/medexqa judges#67

Feat/medexqa judges#67
mnishant2 wants to merge 9 commits intoMedARC-AI:mainfrom
mnishant2:feat/medexqa-judges

mnishant2 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mnishant2 commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant