Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New benchmark: CaselawQA #2739

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

RicardoDominguez
Copy link

@RicardoDominguez RicardoDominguez commented Feb 26, 2025

Hi,

New task contribution for the CaselawQA benchmark for legal text annotation, introduced in the Lawma: The Power of Specialization for Legal Tasks paper appearing at ICLR2025.

@CLAassistant
Copy link

CLAassistant commented Feb 26, 2025

CLA assistant check
All committers have signed the CLA.

@StellaAthena
Copy link
Member

Thank you for the PR! Your paper doesn't seem to specify the evaluation implementation used in the paper. Is this the official implementation? If not, have you validated that models behave the same on this as they do in the implementation used in the paper?

@RicardoDominguez
Copy link
Author

RicardoDominguez commented Feb 27, 2025

Yes, this is the official implementation. We will specify this in the paper for the camera-ready version, coming soon.

@StellaAthena
Copy link
Member

Great. Can you run the precommit hook to address the failing tests?

@RicardoDominguez
Copy link
Author

Changes:

  • Removed .yaml extension from the default templates
  • Renamed _tiny subtask to _1k which is more clear
  • Made CoT evaluation the dafault, rather than MMLU-style directQA, since newer model >= 3B params perform better with CoT in this benchmark
  • Run the precommit hook

@RicardoDominguez
Copy link
Author

RicardoDominguez commented Mar 7, 2025

One question: our benchmark aggregates over 260 different annotation problems. While it provides a measure of overall model performance, researchers might want to evaluate models' accuracy on specific annotation problems, say caselawqa_sc_adminaction. I've implemented these tasks on a different branch, see here. In practice, these tasks only differ in their dataset_name. However, it might be unreasonable to "spam" the lm_eval library by implementing each of these 260 sets of annotation problems as a distinct task. Thoughts?

@StellaAthena
Copy link
Member

@baberabb Can you advise on subtask implementation? I'm not sure if we currently support it, but one thing that comes to mind is that it could be helpful to be able to algorithmically iterate over subtasks and report subtask score when requested by a user.

@baberabb
Copy link
Contributor

Right now we do expect each individual sub-task to have its own config, but it's worth thinking about having the option to create them programmatically at runtime, esp. if they all share the same base config with a minor differences. The number of configs are getting a bit unwieldy and we need to parse them all at startup.

@RicardoDominguez Can you add a link to your branch in the readme? I think that would be helpful for those users who want to make use of it.

Also if you could add a entry to lm_eval/tasks/README.md describing your benchmark in a sentence like all the other tasks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants