New benchmark: CaselawQA #2739

RicardoDominguez · 2025-02-26T10:58:28Z

Hi,

New task contribution for the CaselawQA benchmark for legal text annotation, introduced in the Lawma: The Power of Specialization for Legal Tasks paper appearing at ICLR2025.

CLAassistant · 2025-02-26T10:58:34Z

All committers have signed the CLA.

StellaAthena · 2025-02-26T22:40:02Z

Thank you for the PR! Your paper doesn't seem to specify the evaluation implementation used in the paper. Is this the official implementation? If not, have you validated that models behave the same on this as they do in the implementation used in the paper?

RicardoDominguez · 2025-02-27T00:43:29Z

Yes, this is the official implementation. We will specify this in the paper for the camera-ready version, coming soon.

StellaAthena · 2025-02-28T14:30:33Z

Great. Can you run the precommit hook to address the failing tests?

lm_eval/tasks/caselawqa/caselawqa_sc.yaml

RicardoDominguez · 2025-03-07T14:19:32Z

Changes:

Removed .yaml extension from the default templates
Renamed _tiny subtask to _1k which is more clear
Made CoT evaluation the dafault, rather than MMLU-style directQA, since newer model >= 3B params perform better with CoT in this benchmark
Run the precommit hook

RicardoDominguez · 2025-03-07T14:32:33Z

One question: our benchmark aggregates over 260 different annotation problems. While it provides a measure of overall model performance, researchers might want to evaluate models' accuracy on specific annotation problems, say caselawqa_sc_adminaction. I've implemented these tasks on a different branch, see here. In practice, these tasks only differ in their dataset_name. However, it might be unreasonable to "spam" the lm_eval library by implementing each of these 260 sets of annotation problems as a distinct task. Thoughts?

StellaAthena · 2025-03-10T16:58:53Z

@baberabb Can you advise on subtask implementation? I'm not sure if we currently support it, but one thing that comes to mind is that it could be helpful to be able to algorithmically iterate over subtasks and report subtask score when requested by a user.

baberabb · 2025-03-11T19:49:37Z

Right now we do expect each individual sub-task to have its own config, but it's worth thinking about having the option to create them programmatically at runtime, esp. if they all share the same base config with a minor differences. The number of configs are getting a bit unwieldy and we need to parse them all at startup.

@RicardoDominguez Can you add a link to your branch in the readme? I think that would be helpful for those users who want to make use of it.

Also if you could add a entry to lm_eval/tasks/README.md describing your benchmark in a sentence like all the other tasks!

RicardoDominguez and others added 9 commits September 11, 2024 16:01

init

1239768

caselaw running

e129ada

readme

b7ac3fa

fix bug multi digit responses split into lists

17b442a

remove tiny and hard

5d0edca

update caselawqa readme

d838ce8

Merge branch 'main' into caselawqa

b2b5c03

caselaw direct and cot

adca6db

caselawqa_cot readme

05e8f42

RicardoDominguez requested review from baberabb and lintangsutawika as code owners February 26, 2025 10:58

baberabb requested changes Mar 1, 2025

View reviewed changes

lm_eval/tasks/caselawqa/caselawqa_sc.yaml Outdated Show resolved Hide resolved

caselawqa_1k, fix link

f3aa629

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New benchmark: CaselawQA #2739

New benchmark: CaselawQA #2739

RicardoDominguez commented Feb 26, 2025 •

edited

Loading

CLAassistant commented Feb 26, 2025 •

edited

Loading

StellaAthena commented Feb 26, 2025

RicardoDominguez commented Feb 27, 2025 •

edited

Loading

StellaAthena commented Feb 28, 2025

RicardoDominguez commented Mar 7, 2025

RicardoDominguez commented Mar 7, 2025 •

edited

Loading

StellaAthena commented Mar 10, 2025

baberabb commented Mar 11, 2025

New benchmark: CaselawQA #2739

Are you sure you want to change the base?

New benchmark: CaselawQA #2739

Conversation

RicardoDominguez commented Feb 26, 2025 • edited Loading

CLAassistant commented Feb 26, 2025 • edited Loading

StellaAthena commented Feb 26, 2025

RicardoDominguez commented Feb 27, 2025 • edited Loading

StellaAthena commented Feb 28, 2025

RicardoDominguez commented Mar 7, 2025

RicardoDominguez commented Mar 7, 2025 • edited Loading

StellaAthena commented Mar 10, 2025

baberabb commented Mar 11, 2025

RicardoDominguez commented Feb 26, 2025 •

edited

Loading

CLAassistant commented Feb 26, 2025 •

edited

Loading

RicardoDominguez commented Feb 27, 2025 •

edited

Loading

RicardoDominguez commented Mar 7, 2025 •

edited

Loading