Large language models are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands the integration of heterogeneous knowledge basesβtrials, primary studies, regulatory documents, and cost dataβ under strict accuracy constraints. Existing evaluations typically rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended text generation, leaving their real-world utility unclear. To close this gap, we present MedBrowseComp, the first benchmark that systematically tests an agentβs ability to reliably retrieve and synthesize multi-hop medical facts from live, domain-specific knowledge bases. MedBrowseComp holds 1,000+ human-curated questions that mirror clinical scenarios in which practitioners must reconcile fragmented or conflicting information to reach an up-to-date conclusion. Applying MedBrowseComp to frontier agentic systems reveals marked performance shortfalls as low as 10. These findings expose a critical gap between current LLM capabilities and the rigor demanded in clinical settings. MedBrowseComp exposes the strengths and weaknesses of current agentic systems, offering a testbed for reliable medical information seeking and clear goals for future model and toolchain upgrades.
- Clone the repository:
git clone https://github.com/yourusername/hemonc_bench.git cd hemonc_bench
- Install dependencies:
pip install -r requirements.txt
This repository provides two main benchmark datasets (they are encode to avoid training on test set, read the next section on how to decode them!)):
data/final50.csv
(MedBrowseComp50):- 50 curated clinical trial samples for rapid evaluation and demonstration.
data/final121.csv
(MedBrowseComp605):- 605 clinical trial samples for comprehensive benchmarking.
Use these files as your primary input for all evaluation and processing tasks.
To help prevent accidental data contamination during model training, you can encode the data files before sharing or uploading. Multiple encoding methods are supported:
Each cell is first shifted (by a configurable value), then base64 encoded. The output is a valid CSV file, but the contents are not human-readable.
Encode:
python data/encode_decode.py cell-encode-combo data/final50.csv --shift 3
# Produces: data/final50_cell_combo_shift3_b64.csv
Decode:
python data/encode_decode.py cell-decode-combo data/final50_cell_combo_shift3_b64.csv --shift 3
# Produces: data/final50.csv
- All encoded/decoded files always end with
.csv
and use underscores for suffixes. - This is the recommended method for sharing, archiving, and HuggingFace upload.
You can also encode/decode each cell using only base64 or only a shift:
Base64:
python data/encode_decode.py cell-encode data/final50.csv --method base64
# Produces: data/final50_cell_base64.csv
python data/encode_decode.py cell-decode data/final50_cell_base64.csv --method base64
# Produces: data/final50.csv
Shift:
python data/encode_decode.py cell-encode data/final50.csv --method shift --shift 3
# Produces: data/final50_cell_shift3.csv
python data/encode_decode.py cell-decode data/final50_cell_shift3.csv --method shift --shift 3
# Produces: data/final50.csv
Some features require API keys for Gemini, OpenAI, or Sonar (Perplexity). Copy .env.example
to .env
and add your keys:
cp .env.example .env
Edit .env
and fill in your API keys:
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
SONAR_API_KEY=your_sonar_api_key_here
The main script for processing predictions is:
python process_NCT_predictions.py --csv_path data/final50.csv --output_path results50.csv --model gemini-2.0-flash
For the larger benchmark:
python process_NCT_predictions.py --csv_path data/final121.csv --output_path results605.csv --model gemini-2.0-flash
You may also specify other models (e.g., OpenAI or Sonar), and use additional flags:
--use_tools
: Enable Google Search tool (optional)--max_workers N
: Number of parallel threads (default: 4)
Example with all options:
python process_NCT_predictions.py --csv_path data/final121.csv --output_path results605.csv --model openai-gpt-4 --use_tools --max_workers 8
--csv_path
(required): Path to your input CSV file (default:data/Hemonc_new_with_all.csv
)--output_path
: Path to save the output CSV--model
: Model to use (default:gemini-2.0-flash
)--use_tools
: Enable Google Search tool (optional)--max_workers
: Number of parallel threads (default: 4)
Example:
python process_NCT_predictions.py --csv_path data/Hemonc_new_with_all.csv --output_path results.csv
If you use this dataset in your research, please cite: https://arxiv.org/abs/2505.14963
@misc{chen2025medbrowsecompbenchmarkingmedicaldeep,
title={MedBrowseComp: Benchmarking Medical Deep Research and Computer Use},
author={Shan Chen and Pedro Moreira and Yuxin Xiao and Sam Schmidgall and Jeremy Warner and Hugo Aerts and Thomas Hartvigsen and Jack Gallifant and Danielle S. Bitterman},
year={2025},
eprint={2505.14963},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14963},
}
You prob do not need to read the details below....
The random_helper
directory contains reusable utilities and scripts for:
- Downloading and preprocessing NCT data (
download_nct_data.py
) - Extracting NCT numbers from evidence columns (
process_NCT_dataset.py
) - Processing model predictions and results (
process_predictions.py
) - PubMed utilities (
pubmed_utils.py
) - Data exploration and cleaning notebooks (various
.ipynb
files)
These scripts help with data preparation, cleaning, and intermediate analysis.
This project uses a robust, multi-step approach to extract detailed information for each clinical trial:
- Retrieves author lists for each PubMed ID using the Bio.Entrez API.
- Handles missing names and collective authors.
- Output: JSON mapping PubMed ID to a pipe-separated author list.
- Scans PubMed records for NCT IDs using regex (
NCT\d{8}
). - Searches all relevant fields (identifiers, titles, abstracts, MeSH).
- Attempts to fetch the start date from multiple locations in ClinicalTrials.gov APIs.
- Falls back to pattern matching and special-case handling for older trials.
- Checks for the presence of primary and secondary outcome arrays in the trial record.
- Returns simple Yes/No indicators.
- Extracts drug administration routes from APIs, HTML pages, or PubMed abstracts.
- Uses pattern matching for terms like "oral", "IV", "subcutaneous", etc.
- Maps drug names and keywords to a comprehensive set of classes (e.g., PD-1 inhibitor, kinase inhibitor).
- Output: Pipe-separated list of identified drug classes.
If you have questions or need more detail on any step, see the code comments or helper scripts in random_helper
.
The following summarizes how HemOnc Bench extracts and enriches clinical trial data:
- For each PubMed ID, fetches author information using the NCBI Entrez API.
- Builds a JSON mapping of PubMed IDs to a pipe-separated list of authors (e.g.,
"John Smith|Jane Doe"
). - Handles missing names and collective author groups robustly.
- Scans PubMed records for NCT IDs using the regex pattern
NCT\d{8}
. - Searches all likely fields, including identifiers, titles, abstracts, and MeSH terms.
- Returns the first valid match found.
- Attempts to extract the start date from ClinicalTrials.gov using the V2 API, which returns detailed trial records as nested JSON.
- Checks several possible JSON paths (e.g.,
protocolSection/statusModule/startDateStruct/date
) because ClinicalTrials.gov has changed its schema over time and start dates may be stored in different locations for different trials. - If none of the standard paths are found, the code searches the entire JSON for any field named "startDate" as a fallback.
- If still not found, falls back to the classic API or uses hardcoded/special-case logic for known trials.
- These JSON paths are determined by inspecting real ClinicalTrials.gov API responses. You can see them by downloading a trial's JSON from ClinicalTrials.gov and looking at its structure.
- This approach ensures compatibility across different data vintages and registry formats.
- Checks if primary and secondary outcome arrays exist in the trial data.
- Returns simple "Yes" or "No" values for each outcome type.
- Falls back to alternate APIs or special handling if needed.
- Extracts drug administration routes from ClinicalTrials.gov APIs or, if necessary, scrapes the ClinicalTrials.gov HTML page or searches PubMed abstracts.
- Uses pattern matching to identify common routes (e.g., oral, intravenous, subcutaneous, intramuscular).
- Returns a pipe-separated list of routes.
- Maps drug names and keywords to a curated set of drug classes (e.g., PD-1 inhibitor, kinase inhibitor, immunotherapy, etc.).
- Uses both direct name matching and keyword detection for comprehensive coverage.
- Returns a pipe-separated list of all identified classes for each trial.
For more details or technical specifics, see the code comments and helper scripts in the random_helper
directory.
How author data is extracted (click to expand)
graph TD
A[Input: PubMed ID] --> B[Use Bio.Entrez to fetch PubMed record]
B --> C{Contains AuthorList?}
C -->|Yes| D[Process each author]
C -->|No| E[Return empty string]
D --> F[Extract ForeName and LastName]
D --> G[Or use CollectiveName]
F --> H[Join names with pipe separator]
G --> H
H --> I[Return as JSON mapping PubMed ID to author list]
- Primary Source: PubMed API via Bio.Entrez
- Format: Pipe-separated list (e.g.,
"John Smith|Jane Doe"
) - Resilience: Falls back to
LastName
only ifForeName
is missing; handles collective author names
How NCT IDs are extracted from PubMed (click to expand)
-
Process:
- Fetch full XML record for a PubMed ID
- Apply regex pattern matching:
NCT\d{8}
- Return first match if found
-
Why This Approach? NCT IDs can appear in various locations in PubMed records:
- Secondary identifier fields
- Article titles
- Abstracts
- MeSH terms
Regex pattern matching provides the most comprehensive way to find these IDs regardless of their location.
Multi-layered extraction of start dates (click to expand)
flowchart TD
A[Input: NCT ID] --> B[Try ClinicalTrials.gov V2 API]
B --> C{Found start date?}
C -->|Yes| D[Return start date]
C -->|No| E[Check multiple JSON locations]
E --> F{Found start date?}
F -->|Yes| D
F -->|No| G[Try classic ClinicalTrials.gov API]
G --> H{Found start date?}
H -->|Yes| D
H -->|No| I[Special case handling]
I -->|Known trial| J[Use hardcoded value]
I -->|Unknown trial| K[Return empty string]
J --> D
-
Primary Method: ClinicalTrials.gov V2 API
-
JSON Locations Checked:
protocolSection/statusModule/startDateStruct/date
protocolSection/statusModule/startDate
protocolSection/designModule/studyStartDateStruct/date
protocolSection/designModule/studyStartDate
derivedSection/startDateStruct/date
derivedSection/startDate
- Pattern matching for any
"startDate"
field
-
Why So Complex? The ClinicalTrials.gov data structure has evolved over time, with older trials having different formats. This comprehensive approach ensures maximum data recovery across all trial vintages.
Outcome measure detection strategy (click to expand)
-
Primary Method: Check for array existence
- Primary outcomes:
protocolSection/outcomesModule/primaryOutcomes
- Secondary outcomes:
protocolSection/outcomesModule/secondaryOutcomes
- Primary outcomes:
-
Output: Simple "Yes" or "No"
-
Fallback Chain:
- V2 API β 2. Classic API β 3. Special case handling
-
Practical Impact: Having this information helps researchers quickly identify trials with well-defined outcome measurements, an important factor in assessing study quality.
How drug routes are determined (click to expand)
flowchart TD
A["Input: NCT ID"] --> B["Try ClinicalTrials.gov V2 API"]
B --> C{"Found route info?"}
C -->|Yes| D["Extract routes"]
C -->|No| E["Scrape HTML page"]
E --> F{"Found route info?"}
F -->|Yes| D
F -->|No| G["Try classic API"]
G --> H{"Found route info?"}
H -->|Yes| D
H -->|No| I["Search PubMed abstracts"]
I --> J{"Found route info?"}
J -->|Yes| D
J -->|No| K["Check for special case"]
K -->|Known trial| L["Use hardcoded value"]
K -->|Unknown trial| M["Return 'Unknown'"]
L --> N["Format as pipe-separated list"]
D --> N
N --> O["Return formatted routes"]
-
Pattern Matching: Text analysis for terms like:
- "oral"
- "intravenous"/"i.v."
- "subcutaneous"/"s.c."
- "intramuscular"/"i.m."
- And many more
-
Why This Matters: Administration route affects drug bioavailability, patient convenience, and compliance. This information helps researchers understand critical aspects of treatment protocols.
Intelligent drug classification system (click to expand)
-
Two-Pronged Approach:
-
Drug Name Matching: Maps specific drug names to classes
pembrolizumab β PD-1 Inhibitor atezolizumab β PD-L1 Inhibitor ipilimumab β CTLA-4 Inhibitor
-
Keyword Detection: Identifies class by terminology
"pd-1" β PD-1 Inhibitor "kinase" β Kinase Inhibitor "immunotherapy" β Immunotherapy
-
-
Comprehensive Classification Dictionary:
- PD-1 Inhibitors
- PD-L1 Inhibitors
- CTLA-4 Inhibitors
- Anti-VEGF agents
- Tyrosine Kinase Inhibitors
- EGFR Inhibitors
- Platinum-based Chemotherapy
- Taxanes
- Antimetabolites
- Immunomodulatory Drugs
- Corticosteroids
- BTK Inhibitors
- CDK4/6 Inhibitors
- PARP Inhibitors
- mTOR Inhibitors
- Proteasome Inhibitors
-
Output Format: Pipe-separated list of identified classes