Skip to content

Conversation

@chakravarthik27
Copy link
Contributor

This pull request updates the MedCalc-Bench scenario to use the latest dataset version and adds a basic test for the scenario. The main changes focus on keeping the dataset reference current and improving test coverage.

MedCalc-Bench scenario updates:

  • Updated the dataset reference in both the class docstring and the code to use MedCalc-Bench-v2.0 instead of v1.0 in medcalc_bench_scenario.py.
    MedCalc-Bench-v1.0 changed to MedCalc-Bench-v2.0

Testing improvements:

  • Added a new test file test_medcalc_bench_scenario.py with a pytest-based test that verifies the scenario loads instances and that the first instance is from the "test" split.

Documentation formatting:

  • Changed the docstring in the MATHScenario class to use a raw string for improved formatting.

@yifanmai @MiguelAFH
Could you please review this PR?

@chakravarthik27
Copy link
Contributor Author

Hi @yifanmai, @MiguelAFH

Medcalc_bench v1.0 is returning a 404 error here: https://huggingface.co/datasets/ncbi/MedCalc-Bench-v1.0.
I've switched to https://huggingface.co/datasets/ncbi/MedCalc-Bench-v2.0.

Could you please review this PR?

Thanks.

@yifanmai
Copy link
Collaborator

yifanmai commented Dec 3, 2025

The link you sent https://huggingface.co/datasets/ncbi/MedCalc-Bench-v1.0 does not return a 404. Could you clarify what you meant by this?

As for upgrading to MedCalc-Bench-v2.0, I am OK with this, but it should be a new separate run spec function in order to maintain reverse compatibility. Users running evals using the existing MedCalc-Bench should not see any changes.

@chakravarthik27
Copy link
Contributor Author

chakravarthik27 commented Dec 4, 2025

The link you sent https://huggingface.co/datasets/ncbi/MedCalc-Bench-v1.0 does not return a 404. Could you clarify what you meant by this?

As for upgrading to MedCalc-Bench-v2.0, I am OK with this, but it should be a new separate run spec function in order to maintain reverse compatibility. Users running evals using the existing MedCalc-Bench should not see any changes.

Hi @yifanmai

A few weeks ago, I encountered a 404 error and saw the dataset for version 2.0. Now, there were also additional updates to versions 1.1 and 1.2. As you suggested, I will work on creating the new run specifications for medcalc_bench_v1.1 and medcalc_bench_v1.2.

Thanks

Regards
@chakravarthik27

@yifanmai
Copy link
Collaborator

yifanmai commented Dec 4, 2025

Great, thanks for the update.

@nikhilk7153
Copy link

nikhilk7153 commented Dec 15, 2025

Hi, I was going to come to raise an issue about suggesting to use the new medcalc dataset, but it seems that someone else has gotten here before me. I have made a few more changes to the MedCalc-Bench dataset from v1.2 and you can find the newest dataset here: https://github.com/nikhilk7153/MedCalc-Bench-Verified.

All updates will be made on this new repo. MedCalc-Bench Verified is an updated version from v1.2. You can find the changes from the verified version here in the released version: https://github.com/nikhilk7153/MedCalc-Bench-Verified/releases/tag/MedCalc-Bench-Verified

@chakravarthik27
Copy link
Contributor Author

Hi @yifanmai

can you please review this PR?

Thanks


class MATHScenario(Scenario):
"""
r"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert unrelated change.

)


@run_spec_function("medcalc_bench_v1_0")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of duplicated code here. I would suggest having the version number as a parameter, and then doing something like the following. This reduces the duplicated code, and also preserves backwards compatibility.

@run_spec_function("medcalc_bench")
def get_medcalc_bench_spec(version: Optional[str] = None) -> RunSpec:
    scenario_args = {} if version is None else {"version": version}
    scenario_spec = ScenarioSpec(class_name="helm.benchmark.scenarios.medcalc_bench_scenario.MedCalcBenchScenario")

    # ...

    run_spec_name = "medcalc_bench" if version is None else f"medcalc_bench:version={version}"

    return RunSpec(
        name=run_spec_name,
        scenario_spec=scenario_spec,
        adapter_spec=adapter_spec,
        metric_specs=metric_specs,
        groups=["medcalc_bench"],
    )

Comment on lines +154 to +157
MedCalc-Bench v1.0 is an updated the version of the MedCalc-Bench dataset designed to
evaluate LLMs' capabilities in medical calculations. This version serves as a baseline
for assessing the performance of language models in computing clinically relevant values
from patient notes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you like, you can merge the version-level changelog into the class docstring for MedCalcBenchScenario.

@nikhilk7153
Copy link

nikhilk7153 commented Jan 5, 2026

Will there be any future plans to update MedHELM @yifanmai ? Would it be possible to use MedCalc-Bench Verified instead of the v1.0 that is currently being used? We have fixed a number of annotation and ground truth label issues (approx. 1/3) of the dataset and so re-running would be beneficial to provide a more accurate version of the landscape.

@yifanmai
Copy link
Collaborator

yifanmai commented Jan 5, 2026

I think this is more of a question for @MiguelAFH - the official evals and results are maintained by them, so it depend on whether there is funding and bandwidth available for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants