chakravarthik27/fix medcalc bench dataset path #3921

chakravarthik27 · 2025-11-12T09:09:18Z

This pull request updates the MedCalc-Bench scenario to use the latest dataset version and adds a basic test for the scenario. The main changes focus on keeping the dataset reference current and improving test coverage.

MedCalc-Bench scenario updates:

Updated the dataset reference in both the class docstring and the code to use MedCalc-Bench-v2.0 instead of v1.0 in medcalc_bench_scenario.py.
MedCalc-Bench-v1.0 changed to MedCalc-Bench-v2.0

Testing improvements:

Added a new test file test_medcalc_bench_scenario.py with a pytest-based test that verifies the scenario loads instances and that the first instance is from the "test" split.

Documentation formatting:

Changed the docstring in the MATHScenario class to use a raw string for improved formatting.

@yifanmai @MiguelAFH
Could you please review this PR?

…l_engine

chakravarthik27 · 2025-11-13T06:47:36Z

Hi @yifanmai, @MiguelAFH

Medcalc_bench v1.0 is returning a 404 error here: https://huggingface.co/datasets/ncbi/MedCalc-Bench-v1.0.
I've switched to https://huggingface.co/datasets/ncbi/MedCalc-Bench-v2.0.

Could you please review this PR?

Thanks.

yifanmai · 2025-12-03T21:29:17Z

The link you sent https://huggingface.co/datasets/ncbi/MedCalc-Bench-v1.0 does not return a 404. Could you clarify what you meant by this?

As for upgrading to MedCalc-Bench-v2.0, I am OK with this, but it should be a new separate run spec function in order to maintain reverse compatibility. Users running evals using the existing MedCalc-Bench should not see any changes.

chakravarthik27 · 2025-12-04T07:37:30Z

The link you sent https://huggingface.co/datasets/ncbi/MedCalc-Bench-v1.0 does not return a 404. Could you clarify what you meant by this?

As for upgrading to MedCalc-Bench-v2.0, I am OK with this, but it should be a new separate run spec function in order to maintain reverse compatibility. Users running evals using the existing MedCalc-Bench should not see any changes.

Hi @yifanmai

A few weeks ago, I encountered a 404 error and saw the dataset for version 2.0. Now, there were also additional updates to versions 1.1 and 1.2. As you suggested, I will work on creating the new run specifications for medcalc_bench_v1.1 and medcalc_bench_v1.2.

Thanks

Regards
@chakravarthik27

yifanmai · 2025-12-04T17:49:50Z

Great, thanks for the update.

nikhilk7153 · 2025-12-15T04:13:24Z

Hi, I was going to come to raise an issue about suggesting to use the new medcalc dataset, but it seems that someone else has gotten here before me. I have made a few more changes to the MedCalc-Bench dataset from v1.2 and you can find the newest dataset here: https://github.com/nikhilk7153/MedCalc-Bench-Verified.

All updates will be made on this new repo. MedCalc-Bench Verified is an updated version from v1.2. You can find the changes from the verified version here in the released version: https://github.com/nikhilk7153/MedCalc-Bench-Verified/releases/tag/MedCalc-Bench-Verified

…aset references

chakravarthik27 · 2025-12-21T05:38:09Z

Hi @yifanmai

can you please review this PR?

Thanks

yifanmai · 2026-01-05T18:40:27Z

src/helm/benchmark/scenarios/math_scenario.py


 class MATHScenario(Scenario):
-    """
+    r"""


Revert unrelated change.

yifanmai · 2026-01-05T18:47:09Z

src/helm/benchmark/run_specs/medhelm_run_specs.py

    )


+@run_spec_function("medcalc_bench_v1_0")


There seems to be a lot of duplicated code here. I would suggest having the version number as a parameter, and then doing something like the following. This reduces the duplicated code, and also preserves backwards compatibility.

@run_spec_function("medcalc_bench") def get_medcalc_bench_spec(version: Optional[str] = None) -> RunSpec: scenario_args = {} if version is None else {"version": version} scenario_spec = ScenarioSpec(class_name="helm.benchmark.scenarios.medcalc_bench_scenario.MedCalcBenchScenario") # ... run_spec_name = "medcalc_bench" if version is None else f"medcalc_bench:version={version}" return RunSpec( name=run_spec_name, scenario_spec=scenario_spec, adapter_spec=adapter_spec, metric_specs=metric_specs, groups=["medcalc_bench"], )

yifanmai · 2026-01-05T18:48:56Z

src/helm/benchmark/scenarios/medcalc_bench_scenario.py

+    MedCalc-Bench v1.0 is an updated the version of the MedCalc-Bench dataset designed to
+    evaluate LLMs' capabilities in medical calculations. This version serves as a baseline
+    for assessing the performance of language models in computing clinically relevant values
+    from patient notes.


If you like, you can merge the version-level changelog into the class docstring for MedCalcBenchScenario.

nikhilk7153 · 2026-01-05T18:56:04Z

Will there be any future plans to update MedHELM @yifanmai ? Would it be possible to use MedCalc-Bench Verified instead of the v1.0 that is currently being used? We have fixed a number of annotation and ground truth label issues (approx. 1/3) of the dataset and so re-running would be beneficial to provide a more accurate version of the landscape.

yifanmai · 2026-01-05T18:59:06Z

I think this is more of a question for @MiguelAFH - the official evals and results are maintained by them, so it depend on whether there is funding and bandwidth available for this.

chakravarthik27 added 10 commits August 18, 2025 10:29

Add OpenRouterClient implementation and tests

3e15251

updated: add types and HELM convention as per suggested.

ac04b19

updated: removed unnecessary code

0cfba1d

updated: to get model name from request.model instead of request.mode…

bc99bbd

…l_engine

Merge branch 'stanford-crfm:main' into main

cbd676e

Merge branch 'stanford-crfm:main' into main

02120c8

fix: module error for "shc_privacy_med" and "shc_proxy_med" run specs

311bc4e

Merge branch 'stanford-crfm:main' into main

2925562

Merge branch 'stanford-crfm:main' into main

d7d8ceb

fix: medcalc_bench dataset path from huggingface

5f30c51

chakravarthik27 added 2 commits December 15, 2025 10:53

feat: add MedCalc-Bench v1.0, v1.1, and v1.2 scenarios and update dat…

2283cc2

…aset references

fix: update MedCalc-Bench dataset link and improve v1.0 description

8848304

yifanmai requested changes Jan 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chakravarthik27/fix medcalc bench dataset path #3921

chakravarthik27/fix medcalc bench dataset path #3921

Uh oh!

chakravarthik27 commented Nov 12, 2025

Uh oh!

chakravarthik27 commented Nov 13, 2025

Uh oh!

yifanmai commented Dec 3, 2025

Uh oh!

chakravarthik27 commented Dec 4, 2025 •

edited

Loading

Uh oh!

yifanmai commented Dec 4, 2025

Uh oh!

nikhilk7153 commented Dec 15, 2025 •

edited

Loading

Uh oh!

chakravarthik27 commented Dec 21, 2025

Uh oh!

yifanmai Jan 5, 2026

Uh oh!

yifanmai Jan 5, 2026

Uh oh!

yifanmai Jan 5, 2026

Uh oh!

nikhilk7153 commented Jan 5, 2026 •

edited

Loading

Uh oh!

yifanmai commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		)


		@run_spec_function("medcalc_bench_v1_0")

chakravarthik27/fix medcalc bench dataset path #3921

Are you sure you want to change the base?

chakravarthik27/fix medcalc bench dataset path #3921

Uh oh!

Conversation

chakravarthik27 commented Nov 12, 2025

Uh oh!

chakravarthik27 commented Nov 13, 2025

Uh oh!

yifanmai commented Dec 3, 2025

Uh oh!

chakravarthik27 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yifanmai commented Dec 4, 2025

Uh oh!

nikhilk7153 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chakravarthik27 commented Dec 21, 2025

Uh oh!

yifanmai Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

yifanmai Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

yifanmai Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

nikhilk7153 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yifanmai commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chakravarthik27 commented Dec 4, 2025 •

edited

Loading

nikhilk7153 commented Dec 15, 2025 •

edited

Loading

nikhilk7153 commented Jan 5, 2026 •

edited

Loading