This repository provides a pilot implementation of a multi-agent evaluation framework for assessing the trustworthiness of Large Language Models (LLMs) in medical domains.
If you find this work useful in your research, please cite the following paper:
Shin, E., Ko, S., Yang, U., Won, W., Na, K., Woo, H., & Lee, Y. (2025). A RV Framework for Evaluating the Trustworthiness of Medical Large Language Models. Smart Media Journal, 14(12), 74-84.
@article{shin2025rv,
title={A RV Framework for Evaluating the Trustworthiness of Medical Large Language Models},
author={Shin, Eunji and Ko, Siyeon and Yang, Uijun and Won, Woohyung and Na, Kyungmin and Woo, Hyekyung and Lee, Youngho},
journal={Smart Media Journal},
volume={14},
number={12},
pages={74--84},
year={2025}
}This repository provides a pilot implementation of a multi-agent evaluation framework for assessing the trustworthiness of Large Language Models (LLMs) in medical domains.
- Demonstrate how diagnostic responses from LLMs can be evaluated using a multi-dimensional rubric (Accuracy, Explainability, Consistency, Safety).
- Compare external evaluation and self-evaluation (LLM self-critique), showing their differences and complementarity.
- Provide pilot results as a proof-of-concept for the proposed framework in our research paper.
- The pilot uses open clinical QA datasets (e.g., MedQA) as substitutes for real-world diagnostic cases.
- Each case is structured into JSON format including:
summaryevidence_listcriteriafinal_judgment
- Install dependencies
pip install -r requirements.txt
이 저장소는 의료 분야에서 대규모 언어모델(LLM)의 신뢰성 평가를 위한 멀티에이전트 프레임워크를 파일럿으로 구현한 코드입니다.
- 임상 응답(진단 텍스트)을 다차원 루브릭(정확성, 설명가능성, 일관성, 안전성)으로 평가하는 방법을 시연합니다.
- 외부 평가자 평가와 LLM 자기 평가(Self-Critique) 결과를 비교하여 차이점을 확인합니다.
- 연구 논문에서 제안하는 평가 체계의 개념 증명(Proof-of-Concept)을 제공합니다.
- 실제 임상 데이터를 대체하기 위해 공개 임상 QA 데이터셋(MedQA)을 사용했습니다.
- 각 케이스는 JSON 형식으로 구성됩니다:
summaryevidence_listcriteriafinal_judgment
- 필수 라이브러리 설치
pip install -r requirements.txt