13 Feb 20:21

baberabb

27988a2

v0.4.11 Latest

Latest

v0.4.11 Release Notes

Minor release. Stay tuned for bigger changes next release.

New Platform Support

Windows ML Backend — Native Windows ML inference support by @chapsiru and @chemwolf6922 in #3470, #3564, #3565

New Benchmarks & Tasks

BEAR knowledge probe by @plonerma in #3496

Task Version Changes

The following tasks have updated versions. Results from a previous task versions may not be directly comparable. See the linked PRs or individual task READMEs for changelogs.

afrobench_belebele (all variants): 2 → 3 in #3551
evalita_llm: 0.0 → 0.1 in #3551
include (all 90 language variants): 0.0 → 0.1 in #3551
mgsm_direct (all 11 language variants): 3.0 → 4.0 by @LakshyaChaudhry in #3574

Fixes & Improvements

Fixed SQuAD v2 evaluation by @HydrogenSulfate in #3535
Fixed MasakhaNEWS tasks — replaced non-existent headline_text field with headline by @Mr-Neutr0n in #3567
Fixed incorrect task configs by @baberabb in #3552
Replaced eval() with ast.literal_eval in task configs for safer parsing by @baberabb in #3577
Fixed SGLang duplicate registration error by @enpimashin in #3543
Restored hf_transfer import check by @baberabb in #3563
Fixed modify_gen_kwargs call in vLLM VLMs by @hmellor in #3573
Refactored vLLM gen_kwargs normalization inline to modify_gen_kwargs; fixed cached gen_kwargs mutation by @baberabb in #3582
Fixed README for task-listing CLI command by @UltimateJupiter in #3545
Updated dependencies by @baberabb in #3546

New Contributors

@HydrogenSulfate made their first contribution in #3535
@UltimateJupiter made their first contribution in #3545
@enpimashin made their first contribution in #3543
@chapsiru made their first contribution in #3470
@chemwolf6922 made their first contribution in #3565
@plonerma made their first contribution in #3496
@hmellor made their first contribution in #3573
@Mr-Neutr0n made their first contribution in #3567
@LakshyaChaudhry made their first contribution in #3574

Full Changelog: v0.4.10...v0.4.11

Contributors

chemwolf6922, hmellor, and 8 other contributors

Assets 2

27 Jan 19:56

baberabb

v0.4.10

f7d0b11

v0.4.10

Highlights

The big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support.

Breaking Change: Lightweight Core with Optional Backends

pip install lm_eval no longer installs the HuggingFace/torch stack by default. (#3428)

The core package no longer includes backends. Install them explicitly:

pip install lm_eval          # core only, no model backends
pip install lm_eval[hf]      # HuggingFace backend (transformers, torch, accelerate)
pip install lm_eval[vllm]    # vLLM backend
pip install lm_eval[api]     # API backends (OpenAI, Anthropic, etc.)

Additional breaking change: Accessing model classes via attribute no longer works:

# This still works:
from lm_eval.models.huggingface import HFLM

# This now raises AttributeError:
import lm_eval.models
lm_eval.models.huggingface.HFLM

CLI Refactor

The CLI now uses explicit subcommands and supports YAML config files (#3440):

lm-eval run --model hf --tasks hellaswag      # run evaluations
lm-eval run --config my_config.yaml           # load args from YAML config
lm-eval ls tasks                               # list available tasks
lm-eval validate --tasks hellaswag,arc_easy   # validate task configs

Backward compatible when omitting run still works: lm-eval --model hf --tasks hellaswag

See lm-eval --help or the CLI documentation for details.

Other Improvements

Decoupled ContextSampler with new build_qa_turn helper (#3429)
Normalized gen_kwargs with truncation_side support for vLLM (#3509)

New Benchmarks & Tasks

PISA task by @HallerPatrick in #3412
SLR-Bench (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305
OpenAI Multilingual MMLU by @Helw150 in #3473
ULQA benchmark by @keramjan in #3340
IFEval in Spanish and Catalan by @juliafalcao in #3467
TruthfulQA-VA for Catalan by @sgs97ua in #3469
Multiple Bangla benchmarks by @Ismail-Hossain-1 in #3454
NeurIPS E2LM Competition submissions: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444

Model Support

Ministral-3 adapter (hf-mistral3) by @medhakimbedhief in #3487

Fixes & Improvements

Task Fixes

Fixed leading whitespace leakage in MMLU-Pro by @baberabb in #3500
Fixed gen_prefix delimiter handling in multiple-choice tasks by @baberabb in #3508
Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465
Fixed a=0 as valid answer index in build_qa_turn by @ezylopx5 in #3488
Fixed fewshot_config not being applied to fewshot docs by @baberabb in #3461
Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527
Fixed gsm8k_cot_llama target_delimiter issue by @baberabb in #3526
Updated LIBRA task utils by @bond005 in #3520

Backend Fixes

Fixed vLLM off-by-one max_length error by @baberabb in #3503
Resolved deprecated vllm.transformers_utils.get_tokenizer import by @DarkLight1337 in #3482
Fixed SGLang import and removed duplicate tasks by @baberabb in #3492
Removed deprecated AutoModelForVision2Seq by @baberabb in #3522
Fixed Anthropic chat model mapping by @lucafossen in #3453
Fixed bug preventing = sign in checkpoint names by @mrinaldi97 in #3517
Fixed pretty_print_task for external custom configs by @safikhanSoofiyani in #3436
Fixed CLI regressions by @fxmarty-amd in #3449

New Contributors

@safikhanSoofiyani made their first contribution in #3436
@lucafossen made their first contribution in #3453
@Ahmad21Omar made their first contribution in #3305
@ezylopx5 made their first contribution in #3488
@juliafalcao made their first contribution in #3467
@medhakimbedhief made their first contribution in #3487
@ntenenz made their first contribution in #3489
@keramjan made their first contribution in #3340
@bond005 made their first contribution in #3520
@mrinaldi97 made their first contribution in #3517
@wogns3623 made their first contribution in #3523

Full Changelog: v0.4.9.2...v0.4.10

Contributors

bond005, ntenenz, and 17 other contributors

Assets 2

26 Nov 23:27

baberabb

v0.4.9.2

ad3f4d0

lm-eval v0.4.9.2 Release Notes

This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum required version.

New Benchmarks & Tasks

A big wave of new evaluation tasks this release:

AIME and MATH500 math reasoning benchmarks by @jannalulu in #3248, #3311
BabiLong and Longbench v2 for long-context evaluation by @jannalulu in #3287, #3338
GraphWalks by @jannalulu in #3377
ZhoBLiMP, BLiMP-NL, TurBLiMP, LM-SynEval, and BHS linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265
Icelandic WinoGrande by @jmichaelov in #3277
CLIcK Korean benchmark by @shing100 in #3173
MMLU-Redux (generative) and Spanish translation by @luiscosio in #2705
EsBBQ and CaBBQ bias benchmarks by @valleruizf in #3167
EQBench in Spanish and Catalan by @priverabsc in #3168
Anthropic discrim-eval by @Helw150 in #3091
XNLI-VA by @FranValero97 in #3194
Bangla MMLU (Titulm) by @Ismail-Hossain-1 in #3317
HumanEval infilling by @its-alpesh in #3299
CNN-DailyMail 3.0.0 by @preordinary in #3426
Global PIQA and new acc_norm_bytes metric by @baberabb in #3368

Fixes & Improvements

Core Changes:

Python 3.10 minimum by @jannalulu in #3337
Unpinned datasets library by @baberabb in #3316
BOS token handling: Delegate to tokenizer; add_bos_token now defaults to None by @baberabb in #3347
Renamed LOGLEVEL env var to LMEVAL_LOG_LEVEL to avoid conflicts by @fxmarty-amd in #3418
Resolve duplicate task names with safeguards by @giuliolovisotto in #3394

Task Fixes:

Fixed MMLU-Redux to exclude samples without error_type="ok" and display summary table by @fxmarty-amd in #3410, #3406
Fixed AIME answer extraction by @jannalulu in #3353
Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361
Fixed crows_pairs dataset by @jannalulu in #3378
Fixed Gemma tokenizer add_bos_token not updating by @DarkLight1337 in #3206
Fixed lambada_multilingual_stablelm by @jmichaelov, @HallerPatrick in #3294, #3222
Fixed CodeXGLUE by @gsaltintas in #3238
Pinned correct MMLUSR version by @christinaexyou in #3350
Updated minerva_math by @baberabb in #3259

Backend Fixes:

Fixed vLLM import errors when not installed by @fxmarty-amd in #3292
Fixed vLLM data_parallel_size>1 issue by @Dornavineeth in #3303
Resolved deprecated vllm.utils.get_open_port by @DarkLight1337 in #3398
Fixed GPT series model bugs by @zinccat in #3348
Fixed PIL image hashing to use actual bytes by @tboerstad in #3331
Fixed additional_config parsing by @brian-dellabetta in #3393
Fixed batch chunking seed handling with groupby by @slimfrkha in #3047
Fixed no-output error handling by @Oseltamivir in #3395
Replaced deprecated torch_dtype with dtype by @AbdulmalikDS in #3415
Fixed custom task config reading by @SkyR0ver in #3425

Model & Backend Support

OpenAI GPT-5 support by @babyplutokurt in #3247
Azure OpenAI support by @zinccat in #3349
Fine-tuned Gemma3 evaluation support by @LearnerSXH in #3234
OpenVINO text2text models by @nikita-savelyevv in #3101
Intel XPU support for HFLM by @kaixuanliu in #3211
Attention head steering support by @luciaquirke in #3279
Leverage vLLM's tokenizer_info endpoint to avoid manual duplication by @m-misiura in #3185

What's Changed

Remove trust_remote_code: True from updated datasets by @Avelina9X in #3213
Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in #3234
Fix add_bos_token not updated for Gemma tokenizer by @DarkLight1337 in #3206
remove incomplete compilation instructions, solves #3233 by @ceferisbarov in #3242
Update utils.py by @Anri-Lombard in #3246
Adding support for OpenAI GPT-5 model by @babyplutokurt in #3247
Add xnli_va dataset by @FranValero97 in #3194
Add ZhoBLiMP benchmark by @jmichaelov in #3218
Add BLiMP-NL by @jmichaelov in #3221
Add TurBLiMP by @jmichaelov in #3219
Add LM-SynEval Benchmark by @jmichaelov in #3184
Fix unknown group key to tag in yaml config for lambada_multilingual_stablelm by @HallerPatrick in #3222
update minerva_math by @baberabb in #3259
feat: Add CLIcK task by @shing100 in #3173
Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in #3091
Add support for OpenVINO text2text generation models by @nikita-savelyevv in #3101
Update MMLU-ProX task by @weihao1115 in #3174
Support for AIME dataset by @jannalulu in #3248
feat(scrolls): delete chat_template from kwargs by @slimfrkha in #3267
pacify pre-commit by @baberabb in #3268
Fix codexglue by @gsaltintas in #3238
Add BHS benchmark by @jmichaelov in #3265
Add acc_norm metric to BLiMP-NL by @jmichaelov in #3272
Add acc_norm metric to ZhoBLiMP by @jmichaelov in #3271
Add EsBBQ and CaBBQ tasks by @valleruizf in #3167
Add support for steering individual attention heads by @luciaquirke in #3279
Add the Icelandic WinoGrande benchmark by @jmichaelov in #3277
Ignore seed when splitting batch in chunks with groupby by @slimfrkha in #3047
[fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in #3292
Fix LongBench Evaluation by @TimurAysin in #3273
add intel xpu support for HFLM by @kaixuanliu in #3211
feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in #2705
Add BabiLong by @jannalulu in #3287
Add AIME to task description by @jannalulu in #3296
Add humaneval_infilling task by @its-alpesh in #3299
Add eqbench tasks in Spanish and Catalan by @priverabsc in #3168
[fix] add math and longbench to test dependencies by @jannalulu in #3321
Fix: VLLM model when data_parallel_size>1 by @Dornavineeth in #3303
unpin datasets; update pre-commit by @baberabb in #3316
bump to python 3.10 by @jannalulu in #3337
Longbench v2 by @jannalulu in #3338
Leverage vllm's tokenizer_info endpoint to avoid manual duplication by @m-misiura in #3185
Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in #3317
remove duplicate tags/groups by @baberabb in #3343
Align humaneval_64_instruct task label in README to name in yaml file by @jmichaelov in #3344
Fixes bugs when using gpt series model by @zinccat in #3348
[fix] aime doesn't extract answers by @jannalulu in #3353
add global_piqa; add acc_norm_bytes metric by @baberabb in #3368
[fix] crows_pairs dataset by @jannalulu in #3378
Fix issue 3355 assertion error by @marksverdhei in #3356
fix(gsm8k): align README to yaml file by @neoheartbeats in #3388
added azure openai support by @zinccat in #3349
Delegate BOS to the tokenizer; add_bos_token defaults to None by @baberabb in #3347
fix trust...

Contributors

luiscosio, giuliolovisotto, and 40 other contributors

Assets 2

04 Aug 11:36

baberabb

v0.4.9.1

d021bf8

v0.4.9.1

lm-eval v0.4.9.1 Release Notes

This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon!

Enhanced Reasoning Model Handling

Better support for reasoning models with a think_end_token argument to strip intermediate reasoning from outputs for the hf, vllm, and sglang model backends. A related enable_thinking argument was also added for specific models that support it (e.g., Qwen).

New Benchmarks & Tasks

EgyMMLU and EgyHellaSwag by @houdaipha in #3063
MultiBLiMP benchmark by @jmichaelov in #3155
LIBRA benchmark for long-context evaluation by @karimovaSvetlana in #2943
Multilingual Truthfulqa in Spanish, Basque and Galician by @BlancaCalvo in #3062

Fixes & Improvements

Tasks & Benchmarks:

Aligned Humaneval results for Llama-3.1-70B-Instruct with official scores by @userljz, @baberabb, @idantene in (#3201. #3092, #3102)
Fixed incorrect dataset paths for GLUE and medical benchmarks by @Avelina9X and @idantene. (#3159, #3151)
Removed redundant "Let's think step by step" text from bbh_cot_fewshot prompts by @philipdoldo. (#3140)
Increased max_gen_toks to 2048 for HRM8K math benchmarks by @shing100. (#3124)

Backend & Stability:

Reduce CLI loading time from 2.2s to 0.05s by @stakodiak. (#3099)
Fixed a process hang caused by mp.Pool in bootstrap_stderr and introduced DISABLE_MULTIPROC envar by @ankitgola005 and @neel04. (#3135, #3106)
add image hashing and LMEVAL_HASHMM envar by @artemorloff in #2973
TaskManager: include-path precedence handling to prioritize custom dir over default by @parkhs21 in #3068

Housekeeping:

Pinned datasets < 4.0.0 temporarily to maintain compatibility with trust_remote_code by @baberabb. (#3172)
Removed models from Neural Magic and other unneeded files by @baberabb. (#3112, #3113, #3108)

What's Changed

llama3 task: update README.md by @annafontanaa in #3074
Fix Anthropic API compatibility issues in chat completions by @NourFahmy in #3054
Ensure backwards compatibility in fewshot_context by using kwargs by @kiersten-stokes in #3079
[vllm] remove system message if TemplateError for chat_template by @baberabb in #3076
feat / fix: Properly make use of subfolder from HF models by @younesbelkada in #3072
[HF] fix quantization config by @baberabb in #3039
FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct by @userljz in #3092
Truthfulqa multi harness by @BlancaCalvo in #3062
Fix: Reduce CLI loading time from 2.2s to 0.05s by @stakodiak in #3099
Humaneval - fix regression by @baberabb in #3102
Bugfix/hf tokenizer gguf override by @ankush13r in #3098
[FIX] Initial code to disable multi-proc for stderr by @neel04 in #3106
fix deps; update hooks by @baberabb in #3107
delete unneeded files by @baberabb in #3108
Fixed #3005: Processes both formats of model_args: string and dictionay by @DebjyotiRay in #3097
add image hashing and LMEVAL_HASHMM envar by @artemorloff in #2973
removal of Neural Magic models by @baberabb in #3112
Neuralmagic by @baberabb in #3113
check pil dep when hashing images by @baberabb in #3114
warning for "chat" pretrained; disable buggy evalita configs by @baberabb in #3127
fix: remove warning by @baberabb in #3128
Adding EgyMMLU and EgyHellaSwag by @houdaipha in #3063
Added mixed_precision_dtype argument to HFLM to enable autocasting by @Avelina9X in #3138
Fix for hang due to mp.Pool in bootstrap_stderr by @ankitgola005 in #3135
when using vllm with lora, it will have some mistakes, now i fix it. by @Jacky-MYQ in #3132
truncate thinking tags in generations by @baberabb in #3145
bbh_cot_fewshot: Removed repeated "Let''s think step by step." text from bbh cot prompts by @philipdoldo in #3140
Fix medical benchmarks import by @idantene in #3151
fix request hanging when request api by @mmmans in #3090
Custom request headers | trust_remote_code param fix by @RawthiL in #3069
Bugfix: update path for GLUE by @Avelina9X in #3159
Add the MultiBLiMP benchmark by @jmichaelov in #3155
multiblimp - readme by @baberabb in #3162
[tests] Added missing fixture in test_unitxt_tasks.py by @Avelina9X in #3163
Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks by @shing100 in #3124
feat: Add LIBRA benchmark for long-context evaluation by @karimovaSvetlana in #2943
Added chat_template_args to vllm by @Avelina9X in #3164
Pin datasets < 4.0.0 by @baberabb in #3172
Remove "device" from vllm_causallms.py by @mgoin in #3176
remove trust-remote-code in configs; fix escape sequences by @baberabb in #3180
Fix vllm test issue that call pop() from None by @weireweire in #3182
[hotfix] vllm: pop device from kwargs by @baberabb in #3181
Update vLLM compatibility by @DarkLight1337 in #3024
Fix mmlu_continuation subgroup names to fit Readme and other variants by @lamalunderscore in #3137
Fix humaneval_instruct by @idantene in #3201
Update README.md for mlqa by @newme616 in #3117
improve include-path precedence handling by @parkhs21 in #3068
Bump version to 0.4.9.1 by @baberabb in #3208

New Contributors

@NourFahmy made their first contribution in #3054
@userljz made their first contribution in #3092
@BlancaCalvo made their first contribution in #3062
@stakodiak made their first contribution in #3099
@ankush13r made their first contribution in #3098
@neel04 made their first contribution in https://...

Contributors

RawthiL, stakodiak, and 27 other contributors

Assets 2

19 Jun 14:18

baberabb

v0.4.9

4527495

v0.4.9

lm-eval v0.4.9 Release Notes

Key Improvements

Enhanced Backend Support:
- SGLang Generate API by @baberabb in #2997
- vLLM enhancements: Added support for enable_thinking argument (#2947) and data parallel for V1 (#3011) by @anmarques and @baberabb
- Chat template improvements: Extended vLLM chat template support (#2902) and fixed HF chat template resolution (#2992) by @anmarques and @fxmarty-amd
Multimodal Capabilities:
- Audio modality support for Qwen2 Audio models by @artemorloff in #2689
- Image processing improvements: Added resize images support (#2958) and enabled multimodal API usage (#2981) by @artemorloff and @baberabb
- ChartQA multimodal task implementation by @baberabb in #2544
Performance & Reliability:
- Quantization support added via quantization_config by @jerryzh168 in #2842
- Memory optimization: Use yaml.CLoader for faster YAML loading by @giuliolovisotto in #2777
- Bug fixes: Resolved MMLU generative metric aggregation (#2761) and context length handling issues (#2972)

New Benchmarks & Tasks

Code Evaluation

HumanEval Instruct - Instruction-following code generation benchmark by @baberabb in #2650
MBPP Instruct - Instruction-based Python programming evaluation by @baberabb in #2995

Language Modeling

C4 Dataset Support - Added perplexity evaluation on C4 web crawl dataset by @Zephyr271828 in #2889

Long Context Benchmarks

RULER and Longbench - Long-context evaluation suites added by @baberabb in #2629

Mathematical & Reasoning

GSM8K Platinum - Enhanced mathematical reasoning benchmark by @Qubitium in #2771
MastermindEval - Logic reasoning evaluation by @whoisjones in #2788
JSONSchemaBench - Structured output evaluation by @Saibo-creator in #2865

Llama Reference Implementations

Llama Reference Implementations - Added task variants for Multilingual MMLU, MMLU CoT, GSM8K, and ARC Challenge based on Llama evaluation standards by @anmarques in #2797, #2826, #2829

Multilingual Expansion

Asian Languages:

Korean MMLU (KMMLU) multiple-choice task by @Aprilistic in #2849
MMLU-ProX extended evaluation by @heli-qi in #2811
KBL 2025 Dataset - Updated Korean benchmark evaluation by @abzb1 in #3000

European Languages:

NorEval - Comprehensive Norwegian benchmark by @vmkhlv in #2919

African Languages:

AfroBench - Multi-African language evaluation by @JessicaOjo in #2825
Darija tasks - Moroccan dialect benchmarks (DarijaMMLU, DarijaHellaSwag, Darija_Bench) by @hadi-abdine in #2521

Arabic Languages:

Arab Culture task for cultural understanding by @bodasadallah in #3006

Domain-Specific Benchmarks

CareQA - Healthcare evaluation benchmark by @PabloAgustin in #2714
ACPBench & ACPBench Hard - Automated code generation evaluation by @harshakokel in #2807, #2980
INCLUDE tasks - Inclusivity evaluation suite by @agromanou in #2769
Cocoteros VA dataset by @sgs97ua in #2787

Social & Bias Evaluation

Various social bias tasks for fairness assessment by @oskarvanderwal in #1185

Technical Enhancements

Fine-grained evaluation: Added --examples argument for efficient multi-prompt evaluation by @felipemaiapolo and @mirianfsilva in #2520
Improved tokenization: Better handling of add_bos_token initialization by @baberabb in #2781
Memory management: Enhanced softmax computations with softmax_dtype argument for HFLM by @Avelina9X in #2921

Critical Bug Fixes

Collating Queries Fix - Resolved error with different continuation lengths that was causing evaluation failures by @ameyagodbole in #2987
Mutual Information Metric - Fixed acc_mutual_info calculation bug that affected metric accuracy by @baberabb in #3035

Breaking Changes & Important Updates

MMLU dataset migration: Switched to cais/mmlu dataset source by @baberabb in #2918
Default parameter updates: Increased max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests by @dazipe in #2824
Temperature defaults: Set default temperature to 0.0 for vLLM and SGLang backends by @baberabb in #2819

We extend our heartfelt thanks to all contributors who made this release possible, including 43 first-time contributors who brought fresh perspectives and valuable improvements to the evaluation harness.

What's Changed

fix mmlu (generative) metric aggregation by @wangcho2k in #2761
Bugfix by @baberabb in #2762
fix verbosity typo by @baberabb in #2765
docs: Fix typos in README.md by @ruivieira in #2778
initialize tokenizer with add_bos_token by @baberabb in #2781
improvement: Use yaml.CLoader to load yaml files when available. by @giuliolovisotto in #2777
Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only by @perlitz in #2773
Fix for mc2 calculation by @kdymkiewicz in #2768
New healthcare benchmark: careqa by @PabloAgustin in #2714
Capture gen_kwargs from CLI in squad_completion by @ksurya in #2727
humaneval instruct by @baberabb in #2650
Update evaluator.py by @zhuzeyuan in #2786
change piqa dataset path (uses parquet rather than dataset script) by @baberabb in #2790
use verify_certificate flag in batch requests by @daniel-salib in #2785
add audio modality (qwen2 audio only) by @artemorloff in #2689
Add various social bias tasks by @oskarvanderwal in #1185
update pre-commit by @baberabb in #2799
Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot by @Avelina9X in #2802
Add INCLUDE tasks by @agromanou in #2769
Add support for token-based auth for watsonx models by @kiersten-stokes in #2796
add version by @baberabb in #2808
Add cocoteros_va dataset by @sgs97ua in #2787
Add MastermindEval by @whoisjones in #2788
Add loncxt tasks by @baberabb in #2629
[hf-multimodal] pass kwargs to self.processor by @baberabb in #2667
[MM] Chartqa by @baberabb in #2544
Allow writing config to wandb by @ksurya in #2736
[change] group -> tag on afrimgsm, afrimmlu, afrixnli dataset by @jd730 in #2813
Clean up README and pyproject.toml by @kiersten-stokes in #2814
Llama3 mmlu correction by @anmarques in #2797
Add Markdown linter by @kiersten-stokes in #2818
Configure the pad tokens for Qwen when using vLLM by @zhangruoxu in #2810
fix typo in humaneval by @baberabb in #2820
default temp=0.0 for vllm and slang by @baberabb in #2819
Fixes to mmlu_pro_llama by @anmarques in #2816
Add MMLU-ProX task by @heli-qi in #2811
Quick fix for mmlu_pro_llama by @anmarques in #2827
Fix: tj-actions/changed-files is compromised by @Tautorn in #2828
Multilingual MMLU for Llama instruct models by @anmarques in #2826
bbh - changed dataset to parquet version by @baberabb in #2845
Fix typo in longbench metrics by @djwackey in #2854
Add kmmlu multiple-choice(accuracy) task #2848 by @Aprilistic in #2849
Adding ACPBench task by @harshakokel in #2807
add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench by @hadi-abdine in #2521
Increase default max_gen_toks to 2048 and max_...

Contributors

booxter, ruivieira, and 58 other contributors

Assets 2

05 Mar 07:49

baberabb

v0.4.8

6d2abda

v0.4.8

lm-eval v0.4.8 Release Notes

Key Improvements

New Backend Support:
- Added SGLang as new evaluation backend! by @Monstertail
- Enabled model steering with vector support via sparsify or sae_lens by @luciaquirke and @AMindToThink
Breaking Change: Python 3.8 support has been dropped as it reached end of life. Please upgrade to Python 3.9 or newer.
Added Support for gen_prefix in config, allowing you to append text after the <|assistant|> token (or at the end of non-chat prompts) - particularly effective for evaluating instruct models

New Benchmarks & Tasks

Code Evaluation

HumanEval by @hjlee1371 in #1992
MBPP by @hjlee1371 in #2247
HumanEval+ and MBPP+ by @bzantium in #2734

Multilingual Expansion

Global Coverage:
- Global MMLU (Lite version by @shivalika-singh in #2567, Full version by @bzantium in #2636)
- MLQA multilingual question answering by @KahnSvaer in #2622
Asian Languages:
- HRM8K benchmark for Korean and English by @bzantium in #2627
- Updated KorMedMCQA to version 2.0 by @GyoukChu in #2540
- Fixed TMLU Taiwan-specific tasks tag by @nike00811 in #2420
European Languages:
- Added Evalita-LLM benchmark by @m-resta in #2681
- BasqueBench with Basque translations of ARC and PAWS by @naiarapm in #2732
- Updated Turkish MMLU configuration by @ArdaYueksel in #2678
Middle Eastern Languages:
- Arabic MMLU by @bodasadallah in #2541
- AraDICE task by @firojalam in #2507

Ethics & Reasoning

Moral Stories by @upunaprosk in #2653
Histoires Morales by @upunaprosk in #2662

Others

MMLU Pro Plus by @asgsaeid in #2366
GroundCocoa by @HarshKohli in #2724

We extend our thanks to all contributors who made this release possible and to our users for your continued support and feedback.

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

drop python 3.8 support by @baberabb in #2575
Add Global MMLU Lite by @shivalika-singh in #2567
add warning for truncation by @baberabb in #2585
Wandb step handling bugfix and feature by @sjmielke in #2580
AraDICE task config file by @firojalam in #2507
fix extra_match low if batch_size > 1 by @sywangyi in #2595
fix model tests by @baberabb in #2604
update scrolls by @baberabb in #2602
some minor logging nits by @baberabb in #2609
Fix gguf loading via Transformers by @CL-ModelCloud in #2596
Fix Zeno visualizer on tasks like GSM8k by @pasky in #2599
Fix the format of mgsm zh and ja. by @timturing in #2587
Add HumanEval by @hjlee1371 in #1992
Add MBPP by @hjlee1371 in #2247
Add MLQA by @KahnSvaer in #2622
assistant prefill by @baberabb in #2615
fix gen_prefix by @baberabb in #2630
update pre-commit by @baberabb in #2632
add hrm8k benchmark for both Korean and English by @bzantium in #2627
New arabicmmlu by @bodasadallah in #2541
Add global_mmlu full version by @bzantium in #2636
Update KorMedMCQA: ver 2.0 by @GyoukChu in #2540
fix tmlu tmlu_taiwan_specific_tasks tag by @nike00811 in #2420
fixed mmlu generative response extraction by @RawthiL in #2503
revise mbpp prompt by @bzantium in #2645
aggregate by group (total and categories) by @bzantium in #2643
Fix max_tokens handling in vllm_vlms.py by @jkaniecki in #2637
separate category for global_mmlu by @bzantium in #2652
Add Moral Stories by @upunaprosk in #2653
add TransformerLens example by @nickypro in #2651
fix multiple input chat tempalte by @baberabb in #2576
Add Aggregation for Kobest Benchmark by @tryumanshow in #2446
update pre-commit by @baberabb in #2660
remove group from bigbench task configs by @baberabb in #2663
Add Histoires Morales task by @upunaprosk in #2662
MMLU Pro Plus by @asgsaeid in #2366
fix early return for multiple dict in task process_results by @baberabb in #2673
Turkish mmlu Config Update by @ArdaYueksel in #2678
Fix typos by @omahs in #2679
remove cuda device assertion by @baberabb in #2680
Adding the Evalita-LLM benchmark by @m-resta in #2681
Delete lm_eval/tasks/evalita_llm/single_prompt.zip by @baberabb in #2687
Update unitxt task.py to bring in line with recent repo changes by @kiersten-stokes in #2684
change ensure_ascii to False for JsonChatStr by @artemorloff in #2691
Set defaults for BLiMP scores by @jmichaelov in #2692
Update remaining references to assistant_prefill in docs to gen_prefix by @kiersten-stokes in #2683
Update README.md by @upunaprosk in #2694
fix construct_requests kwargs in python tasks by @baberabb in #2700
arithmetic: set target delimiter to empty string by @baberabb in #2701
fix vllm by @baberabb in #2708
add math_verify to some tasks by @baberabb in #2686
Logging by @lintangsutawika in #2203
Replace missing lighteval/MATH-Hard dataset with DigitalLearningGmbH/MATH-lighteval by @f4str in #2719
remove unused import by @baberabb in #2728
README updates: Added IberoBench citation info in correpsonding READMEs by @naiarapm in #2729
add o3-mini support by @HelloJocelynLu in #2697
add Basque translation of ARC and PAWS to BasqueBench by @naiarapm in #2732
Add cocoteros_es task in spanish_bench by @sgs97ua in #2721
Fix the import source for eval_logger by @kailashbuki in #2735
add humaneval+ and mbpp+ by @bzantium in #2734
Support SGLang as Potential Backend for Evaluation by @Monstertail in #2703
fix log condition on main by @baberabb in #2737
fix vllm data parallel by @baberabb in #2746
[Readme change for SGLang] fix error in readme and add OOM solutions for sglang by @Monstertail in #2738
Groundcocoa by @HarshKohli in #2724
fix doc: generate_until only outputs the generated text! by @baberabb in #2755
Enable steering HF models by @luciaquirke in #2749
Add test for a simple Unitxt task by @kiersten-stokes in #2742
add debug log by @baberabb in #2757
increment version to 0.4.8 by @baberabb in #2760

New Contributors...

Contributors

pasky, kailashbuki, and 34 other contributors

Assets 2

17 Dec 10:37

baberabb

v0.4.7

4c26a9c

v0.4.7

lm-eval v0.4.7 Release Notes

This release includes several bug fixes, minor improvements to model handling, and task additions.

⚠️ Python 3.8 End of Support Notice

Python 3.8 support will be dropped in future releases as it has reached its end of life. Users are encouraged to upgrade to Python 3.9 or newer.

Backwards Incompatibilities

Chat Template Delimiter Handling (in v0.4.6)

An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.

📝 For detailed documentation, please refer to docs/chat-template-readme.md

New Benchmarks & Tasks

Basque Integration: Added Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
SCORE Tasks: Added new subtask for non-greedy robustness evaluation by @rimashahbazyan in #2558

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

Score tasks by @rimashahbazyan in #2452
Filters bugfix; add metrics and filter to logged sample by @baberabb in #2517
skip casting if predict_only by @baberabb in #2524
make utility function to handle until by @baberabb in #2518
Update Unitxt task to use locally installed unitxt and not download Unitxt code from Huggingface by @yoavkatz in #2514
add Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
avoid timeout errors with high concurrency in api_model by @dtrawins in #2307
Update README.md by @baberabb in #2534
better doc_to_test testing by @baberabb in #2535
Support pipeline parallel with OpenVINO models by @sstrehlk in #2349
Super little tiny fix doc by @fzyzcjy in #2546
[API] left truncate for generate_until by @baberabb in #2554
Update Lightning import by @maanug-nv in #2549
add optimum-intel ipex model by @yao-matrix in #2566
add warning to readme by @baberabb in #2568
Adding new subtask to SCORE tasks: non greedy robustness by @rimashahbazyan in #2558
batch loglikelihood_rolling across requests by @baberabb in #2559
fix DeprecationWarning: invalid escape sequence '\s' for whitespace filter by @baberabb in #2560
increment version to 4.6.7 by @baberabb in #2574

New Contributors

@rimashahbazyan made their first contribution in #2452
@naiarapm made their first contribution in #2531
@dtrawins made their first contribution in #2307
@sstrehlk made their first contribution in #2349
@fzyzcjy made their first contribution in #2546
@maanug-nv made their first contribution in #2549
@yao-matrix made their first contribution in #2566

Full Changelog: v0.4.6...v0.4.7

Contributors

fzyzcjy, lintangsutawika, and 8 other contributors

Assets 2

25 Nov 13:38

baberabb

v0.4.6

9d36354

v0.4.6

lm-eval v0.4.6 Release Notes

This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.

Backwards Incompatibilities

Chat Template Delimiter Handling

📝 For detailed documentation, please refer to docs/chat-template-readme.md

New Benchmarks & Tasks

Multilingual Expansion

Spanish Bench: Enhanced benchmark with additional tasks by @zxcvuser in #2390
Japanese Leaderboard: New comprehensive Japanese language benchmark by @sitfoxfly in #2439

New Task Collections

Multimodal Unitext: Added support for multimodal tasks available in unitext by @elronbandel in #2364
Metabench: New benchmark contributed by @kozzy97 in #2357

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

Add Unitxt Multimodality Support by @elronbandel in #2364
Add new tasks to spanish_bench and fix duplicates by @zxcvuser in #2390
fix typo bug for minerva_math by @renjie-ranger in #2404
Fix: Turkish MMLU Regex Pattern by @ArdaYueksel in #2393
fix storycloze datanames by @t1101675 in #2409
Update NoticIA prompt by @ikergarcia1996 in #2421
[Fix] Replace generic exception classes with a more specific ones by @LSinev in #1989
Support for IBM watsonx_llm by @Medokins in #2397
Fix package extras for watsonx support by @kiersten-stokes in #2426
Fix lora requests when dp with vllm by @ckgresla in #2433
Add xquad task by @zxcvuser in #2435
Add verify_certificate argument to local-completion by @sjmonson in #2440
Add GPTQModel support for evaluating GPTQ models by @Qubitium in #2217
Add missing task links by @Sypherd in #2449
Update CODEOWNERS by @haileyschoelkopf in #2453
Add real process_docs example by @Sypherd in #2456
Modify label errors in catcola and paws-x by @zxcvuser in #2434
Add Japanese Leaderboard by @sitfoxfly in #2439
Typos: Fix 'loglikelihood' misspellings in api_models.py by @RobGeada in #2459
use global multi_choice_filter for mmlu_flan by @baberabb in #2461
typo by @baberabb in #2465
pass device_map other than auto for parallelize by @baberabb in #2457
OpenAI ChatCompletions: switch max_tokens by @baberabb in #2443
Ifeval: Dowload punkt_tab on rank 0 by @baberabb in #2267
Fix chat template; fix leaderboard math by @baberabb in #2475
change warning to debug by @baberabb in #2481
Updated wandb logger to use new_printer() instead of get_printer(...) by @alex-titterton in #2484
IBM watsonx_llm fixes & refactor by @Medokins in #2464
Fix revision parameter to vllm get_tokenizer by @OyvindTafjord in #2492
update pre-commit hooks and git actions by @baberabb in #2497
kbl-v0.1.1 by @whwang299 in #2493
Add mamba hf to mamba_ssm by @baberabb in #2496
remove duplicate arc_ca tag by @baberabb in #2499
Add metabench task to LM Evaluation Harness by @kozzy97 in #2357
Nits by @baberabb in #2500
[API models] parse tokenizer_backend=None properly by @baberabb in #2509

New Contributors

@renjie-ranger made their first contribution in #2404
@t1101675 made their first contribution in #2409
@Medokins made their first contribution in #2397
@kiersten-stokes made their first contribution in #2426
@ckgresla made their first contribution in #2433
@sjmonson made their first contribution in #2440
@Qubitium made their first contribution in #2217
@Sypherd made their first contribution in #2449
@sitfoxfly made their first contribution in #2439
@RobGeada made their first contribution in #2459
@alex-titterton made their first contribution in #2484
@OyvindTafjord made their first contribution in #2492
@whwang299 made their first contribution in #2493
@kozzy97 made their first contribution in #2357

Full Changelog: v0.4.5...v0.4.6

Contributors

Qubitium, sitfoxfly, and 20 other contributors

Assets 2

08 Oct 21:06

baberabb

v0.4.5

0845b58

v0.4.5

lm-eval v0.4.5 Release Notes

New Additions

Prototype Support for Vision Language Models (VLMs)

We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal and vllm-vlm. This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (mmmu_val) task and we welcome contributions and feedback from the community!

New VLM-Specific Arguments

VLM models can be configured with several new arguments within --model_args to support their specific requirements:

max_images (int): Set the maximum number of images for each prompt.
interleave (bool): Determines the positioning of image inputs. When True (default) images are interleaved with the text. When False all images are placed at the front of the text. This is model dependent.

hf-multimodal specific args:

image_token_id (int) or image_string (str): Specifies a custom token or string for image placeholders. For example, Llava models expect an "<image>" string to indicate the location of images in the input, while Qwen2-VL models expect an "<|image_pad|>" sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model family
convert_img_format (bool): Whether to convert the images to RGB format.

Example usage:

lm_eval --model hf-multimodal --model_args pretrained=llava-hf/llava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=<image> --tasks mmmu_val --apply_chat_template
lm_eval --model vllm-vlm --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template

Important considerations

Chat Template: Most VLMs require the --apply_chat_template flag to ensure proper input formatting according to the model's expected chat template.
Some VLM models are limited to processing a single image per prompt. For these models, always set max_images=1. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiring interleave=False.
Performance and Compatibility: When working with VLMs, be mindful of potential memory constraints and processing times, especially when handling multiple images or complex tasks.

Tested VLM Models

We have currently most notably tested the implementation with the following models:

llava-hf/llava-1.5-7b-hf
llava-hf/llava-v1.6-mistral-7b-hf
Qwen/Qwen2-VL-2B-Instruct
HuggingFaceM4/idefics2 (requires the latest transformers from source)

New Tasks

Several new tasks have been contributed to the library for this version!

New tasks as of v0.4.5 include:

Open Arabic LLM Leaderboard tasks, contributed by @shahrzads @Malikeh97 in #2232
MMMU (validation set), by @haileyschoelkopf @baberabb @lintangsutawika in #2243
TurkishMMLU by @ArdaYueksel in #2283
PortugueseBench, SpanishBench, GalicianBench, BasqueBench, and CatalanBench aggregate multilingual tasks in #2153 #2154 #2155 #2156 #2157 by @zxcvuser and others

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Backwards Incompatibilities

Finalizing `group` versus `tag` split

We've now fully deprecated the use of group keys directly within a task's configuration file. The appropriate key to use is now solely tag for many cases. See the v0.4.4 patchnotes for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.

Handling of Causal vs. Seq2seq backend in HFLM

In HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM. Some users may want to use causal model behavior, but set self.AUTO_MODEL_CLASS to a different factory class, such as transformers.AutoModelForVision2Seq.

As a result, those users who subclass HFLM but do not call HFLM.__init__() may now also need to set the self.backend attribute to either "causal" or "seq2seq" during initialization themselves.

While this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see #2353 for the full set of changes.

Future Plans

We intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!

Thanks, the LM Eval Harness team (@baberabb @haileyschoelkopf @lintangsutawika)

What's Changed

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) by @Malikeh97 in #2232
Multimodal prototyping by @lintangsutawika in #2243
Update README.md by @SYusupov in #2297
remove comma by @baberabb in #2315
Update neuron backend by @dacorvo in #2314
Fixed dummy model by @Am1n3e in #2339
Add a note for missing dependencies by @eldarkurtic in #2336
squad v2: load metric with evaluate by @baberabb in #2351
fix writeout script by @baberabb in #2350
Treat tags in python tasks the same as yaml tasks by @giuliolovisotto in #2288
change group to tags in task eus_exams task configs by @baberabb in #2320
change glianorex to test split by @baberabb in #2332
mmlu-pro: add newlines to task descriptions (not leaderboard) by @baberabb in #2334
Added TurkishMMLU to LM Evaluation Harness by @ArdaYueksel in #2283
add mmlu readme by @baberabb in #2282
openai: better error messages; fix greedy matching by @baberabb in #2327
fix some bugs of mmlu by @eyuansu62 in #2299
Add new benchmark: Portuguese bench by @zxcvuser in #2156
Fix missing key in custom task loading. by @giuliolovisotto in #2304
Add new benchmark: Spanish bench by @zxcvuser in #2157
Add new benchmark: Galician bench by @zxcvuser in #2155
Add new benchmark: Basque bench by @zxcvuser in #2153
Add new benchmark: Catalan bench by @zxcvuser in #2154
fix tests by @baberabb in #2380
Hotfix! by @baberabb in #2383
Solution for CSAT-QA tasks evaluation by @KyujinHan in #2385
LingOly - Fixing scoring bugs for smaller models by @am-bean in #2376
Fix float limit override by @cjluo-omniml in #2325
[API] tokenizer: add trust-remote-code by @baberabb in #2372
HF: switch conditional checks to self.backend from AUTO_MODEL_CLASS by @baberabb in #2353
max_images are passed on to vllms limit_mm_per_prompt by @baberabb in #2387
Fix Llava-1.5-hf ; Update to version 0.4.5 by @haileyschoelkopf in #2388
Bump version to v0.4.5 by @haileyschoelkopf in #2389

New Contributors

@Malikeh97 made their first contribution in #2232
@SYusupov made their first contribution in #2297
@dacorvo made their first contribution in #2314
@eldarkurtic made their first contribution in #2336
@giuliolovisotto made their first contribution in #2288
@ArdaYueksel made their first contribution in #2283
@zxcvuser made their first contribution in #2156
@KyujinHan made their first contribution in #2385
@cjluo-omniml made their first contribution in #2325

Full Changelog: https://github.com/Eleu...

Contributors

dacorvo, giuliolovisotto, and 14 other contributors

Assets 2

05 Sep 15:13

haileyschoelkopf

v0.4.4

543617f

v0.4.4

lm-eval v0.4.4 Release Notes

New Additions

This release includes the Open LLM Leaderboard 2 official task implementations! These can be run by using --tasks leaderboard. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here.
API support is overhauled! Now: support for concurrent requests, chat templates, tokenization, batching and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.
- The url can be specified by passing the base_url to --model_args, for example, base_url=http://localhost:8000/v1/completions; concurrent requests are controlled with the num_concurrent argument; tokenization is controlled with tokenized_requests.
- Other arguments (such as top_p, top_k, etc.) can be passed to the API using --gen_kwargs as usual.
- Note: Instruct-tuned models, not just base models, can be used with local-completions using --apply_chat_template (either with or without tokenized_requests).
  - They can also be used with local-chat-completions (for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). This is because ChatCompletion style APIs generally do not provide access to logits on prompt/input tokens, preventing easy measurement of multi-token continuations' log probabilities.
- example with OpenAI completions API (using vllm serve):
  - lm_eval --model local-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu
- example with chat API:
  - lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k
- We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under local-completions!
We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks. See #Backwards Incompatibilities below for more information on changes and migration instructions.
A combination of data-parallel and model-parallel (using HF's device_map functionality for "naive" pipeline parallel) inference using --model hf is now supported, thank you to @NathanHB and team!

Other new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!

New Tasks

A number of new tasks have been contributed to the library.

As a further discoverability improvement, lm_eval --tasks list now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to @anthony-dipofi for working on this.

New tasks as of v0.4.4 include:

Open LLM Leaderboard 2 tasks--see above!
Inverse Scaling tasks, contributed by @h-albert-lee in #1589
Unitxt tasks reworked by @elronbandel in #1933
MMLU-SR, contributed by @SkySuperCat in #2032
IrokoBench, contributed by @JessicaOjo @IsraelAbebe in #2042
MedConceptQA, contributed by @Ofir408 in #2010
MMLU Pro, contributed by @ysjprojects in #1961
GSM-Plus, contributed by @ysjprojects in #2103
Lingoly, contributed by @am-bean in #2198
GSM8k and Asdiv settings matching the Llama 3.1 evaluation settings, contributed by @Cameron7195 in #2215 #2236
TMLU, contributed by @adamlin120 in #2093
Mela, contributed by @Geralt-Targaryen in #1970

Backwards Incompatibilities

`tag`s versus `group`s, and how to migrate

Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for "parent" tasks like mmlu to aggregate and report a unified score across a set of component "subtasks".

There were two ways to add a task to a given group name: 1) to provide (a list of) values to the group field in a given subtask's config file:

# this is a *task* yaml file.
group: group_name1
task: my_task1
# rest of task config goes here...

or 2) to define a "group config file" and specify a group along with its constituent subtasks:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...

These would both have the same effect of reporting an averaged metric for group_name1 when calling lm_eval --tasks group_name1. However, in use-case 1) (simply registering a shorthand for a list of tasks one is interested in), reporting an aggregate score can be undesirable or ill-defined.

We've now separated out these two use-cases ("shorthand" groupings and hierarchical subtask collections) into a tag and group property separately!

To register a shorthand (now called a tag), simply change the group field name within your task's config to be tag (group_alias keys will no longer be supported in task configs.):

# this is a *task* yaml file.
tag: tag_name1
task: my_task1
# rest of task config goes here...

Group config files may remain as is if aggregation is not desired. To opt-in to reporting aggregated scores across a group's subtasks, add the following to your group config file:

# this is a group's yaml file
group: group_name1
task:
  - subtask_name1
  - subtask_name2
  # ...
 ### New! Needed to turn on aggregation ###
 aggregate_metric_list:
  - metric: acc # placeholder. Note that all subtasks in this group must report an `acc` metric key
  - weight_by_size: True # whether one wishes to report *micro* or *macro*-averaged scores across subtasks. Defaults to `True`.

Please see our documentation here for more information. We apologize for any headaches this migration may create--however, we believe separating out these two functionalities will make it less likely for users to encounter confusion or errors related to mistaken undesired aggregation.

Future Plans

We're planning to make more planning documents public and standardize on (likely) 1 new PyPI release per month! Stay tuned.

Thanks, the LM Eval Harness team (@haileyschoelkopf @lintangsutawika @baberabb)

What's Changed

fix wandb logger module import in example by @ToluClassics in #2041
Fix strip whitespace filter by @NathanHB in #2048
Gemma-2 also needs default add_bos_token=True by @haileyschoelkopf in #2049
Update trust_remote_code for Hellaswag by @haileyschoelkopf in #2029
Adds Open LLM Leaderboard Taks by @NathanHB in #2047
#1442 inverse scaling tasks implementation by @h-albert-lee in #1589
Fix TypeError in samplers.py by converting int to str by @uni2237 in #2074
Group agg rework by @lintangsutawika in #1741
Fix printout tests (N/A expected for stderrs) by @haileyschoelkopf in #2080
Easier unitxt tasks loading and removal of unitxt library dependancy by @elronbandel in #1933
Allow gating EvaluationTracker HF Hub results; customizability by @NathanHB in #2051
Minor doc fix: leaderboard README.md missing mmlu-pro group and task by @pankajarm in #2075
Revert missing utf-8 encoding for logged sample files (#2027) by @haileyschoelkopf in #2082
Update utils.py by @lintangsutawika in #2085
batch_size may be str if 'auto' is specified by @meg-huggingface in #2084
Prettify lm_eval --tasks list by @anthony-dipofi in #1929
Suppress noisy RougeScorer logs in truthfulqa_gen by @haileyschoelkopf in #2090
Update default.yaml by @waneon in #2092
Add new dataset MMLU-SR tasks by @SkySuperCat in #2032
Irokobench: Benchmark Dataset for African languages by @JessicaOjo in #2042
docs: remove trailing sentence from contribution doc by @nathan-weinberg in #2098
Added MedConceptsQA Benchmark by @Ofir408 in #2010
Also force BOS for "recurrent_gemma" and other Gemma model types by @haileyschoelkopf in #2105
formatting by @lintangsutawika in #2104
docs: align local test command to match CI by @nathan-weinberg in https://gith...