Releases: EleutherAI/lm-evaluation-harness
v0.4.11
v0.4.11 Release Notes
Minor release. Stay tuned for bigger changes next release.
New Platform Support
- Windows ML Backend — Native Windows ML inference support by @chapsiru and @chemwolf6922 in #3470, #3564, #3565
New Benchmarks & Tasks
Task Version Changes
The following tasks have updated versions. Results from a previous task versions may not be directly comparable. See the linked PRs or individual task READMEs for changelogs.
afrobench_belebele (all variants): 2 → 3 in #3551
evalita_llm: 0.0 → 0.1 in #3551
include (all 90 language variants): 0.0 → 0.1 in #3551
mgsm_direct (all 11 language variants): 3.0 → 4.0 by @LakshyaChaudhry in #3574
Fixes & Improvements
- Fixed SQuAD v2 evaluation by @HydrogenSulfate in #3535
- Fixed MasakhaNEWS tasks — replaced non-existent
headline_textfield withheadlineby @Mr-Neutr0n in #3567 - Fixed incorrect task configs by @baberabb in #3552
- Replaced
eval()withast.literal_evalin task configs for safer parsing by @baberabb in #3577 - Fixed SGLang duplicate registration error by @enpimashin in #3543
- Restored
hf_transferimport check by @baberabb in #3563 - Fixed
modify_gen_kwargscall in vLLM VLMs by @hmellor in #3573 - Refactored vLLM
gen_kwargsnormalization inline tomodify_gen_kwargs; fixed cachedgen_kwargsmutation by @baberabb in #3582 - Fixed README for task-listing CLI command by @UltimateJupiter in #3545
- Updated dependencies by @baberabb in #3546
New Contributors
- @HydrogenSulfate made their first contribution in #3535
- @UltimateJupiter made their first contribution in #3545
- @enpimashin made their first contribution in #3543
- @chapsiru made their first contribution in #3470
- @chemwolf6922 made their first contribution in #3565
- @plonerma made their first contribution in #3496
- @hmellor made their first contribution in #3573
- @Mr-Neutr0n made their first contribution in #3567
- @LakshyaChaudhry made their first contribution in #3574
Full Changelog: v0.4.10...v0.4.11
v0.4.10
Highlights
The big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support.
Breaking Change: Lightweight Core with Optional Backends
pip install lm_eval no longer installs the HuggingFace/torch stack by default. (#3428)
The core package no longer includes backends. Install them explicitly:
pip install lm_eval # core only, no model backends
pip install lm_eval[hf] # HuggingFace backend (transformers, torch, accelerate)
pip install lm_eval[vllm] # vLLM backend
pip install lm_eval[api] # API backends (OpenAI, Anthropic, etc.)Additional breaking change: Accessing model classes via attribute no longer works:
# This still works:
from lm_eval.models.huggingface import HFLM
# This now raises AttributeError:
import lm_eval.models
lm_eval.models.huggingface.HFLMCLI Refactor
The CLI now uses explicit subcommands and supports YAML config files (#3440):
lm-eval run --model hf --tasks hellaswag # run evaluations
lm-eval run --config my_config.yaml # load args from YAML config
lm-eval ls tasks # list available tasks
lm-eval validate --tasks hellaswag,arc_easy # validate task configsBackward compatible when omitting run still works: lm-eval --model hf --tasks hellaswag
See lm-eval --help or the CLI documentation for details.
Other Improvements
- Decoupled
ContextSamplerwith newbuild_qa_turnhelper (#3429) - Normalized
gen_kwargswithtruncation_sidesupport for vLLM (#3509)
New Benchmarks & Tasks
- PISA task by @HallerPatrick in #3412
- SLR-Bench (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305
- OpenAI Multilingual MMLU by @Helw150 in #3473
- ULQA benchmark by @keramjan in #3340
- IFEval in Spanish and Catalan by @juliafalcao in #3467
- TruthfulQA-VA for Catalan by @sgs97ua in #3469
- Multiple Bangla benchmarks by @Ismail-Hossain-1 in #3454
- NeurIPS E2LM Competition submissions: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444
Model Support
- Ministral-3 adapter (
hf-mistral3) by @medhakimbedhief in #3487
Fixes & Improvements
Task Fixes
- Fixed leading whitespace leakage in MMLU-Pro by @baberabb in #3500
- Fixed
gen_prefixdelimiter handling in multiple-choice tasks by @baberabb in #3508 - Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465
- Fixed
a=0as valid answer index inbuild_qa_turnby @ezylopx5 in #3488 - Fixed
fewshot_confignot being applied to fewshot docs by @baberabb in #3461 - Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527
- Fixed
gsm8k_cot_llamatarget_delimiterissue by @baberabb in #3526 - Updated LIBRA task utils by @bond005 in #3520
Backend Fixes
- Fixed vLLM off-by-one
max_lengtherror by @baberabb in #3503 - Resolved deprecated
vllm.transformers_utils.get_tokenizerimport by @DarkLight1337 in #3482 - Fixed SGLang import and removed duplicate tasks by @baberabb in #3492
- Removed deprecated
AutoModelForVision2Seqby @baberabb in #3522 - Fixed Anthropic chat model mapping by @lucafossen in #3453
- Fixed bug preventing
=sign in checkpoint names by @mrinaldi97 in #3517 - Fixed
pretty_print_taskfor external custom configs by @safikhanSoofiyani in #3436 - Fixed CLI regressions by @fxmarty-amd in #3449
New Contributors
- @safikhanSoofiyani made their first contribution in #3436
- @lucafossen made their first contribution in #3453
- @Ahmad21Omar made their first contribution in #3305
- @ezylopx5 made their first contribution in #3488
- @juliafalcao made their first contribution in #3467
- @medhakimbedhief made their first contribution in #3487
- @ntenenz made their first contribution in #3489
- @keramjan made their first contribution in #3340
- @bond005 made their first contribution in #3520
- @mrinaldi97 made their first contribution in #3517
- @wogns3623 made their first contribution in #3523
Full Changelog: v0.4.9.2...v0.4.10
lm-eval v0.4.9.2 Release Notes
This release continues our steady stream of community contributions with a batch of new benchmarks, expanded model support, and important fixes. A notable change: Python 3.10 is now the minimum required version.
New Benchmarks & Tasks
A big wave of new evaluation tasks this release:
- AIME and MATH500 math reasoning benchmarks by @jannalulu in #3248, #3311
- BabiLong and Longbench v2 for long-context evaluation by @jannalulu in #3287, #3338
- GraphWalks by @jannalulu in #3377
- ZhoBLiMP, BLiMP-NL, TurBLiMP, LM-SynEval, and BHS linguistic benchmarks by @jmichaelov in #3218, #3221, #3219, #3184, #3265
- Icelandic WinoGrande by @jmichaelov in #3277
- CLIcK Korean benchmark by @shing100 in #3173
- MMLU-Redux (generative) and Spanish translation by @luiscosio in #2705
- EsBBQ and CaBBQ bias benchmarks by @valleruizf in #3167
- EQBench in Spanish and Catalan by @priverabsc in #3168
- Anthropic discrim-eval by @Helw150 in #3091
- XNLI-VA by @FranValero97 in #3194
- Bangla MMLU (Titulm) by @Ismail-Hossain-1 in #3317
- HumanEval infilling by @its-alpesh in #3299
- CNN-DailyMail 3.0.0 by @preordinary in #3426
- Global PIQA and new
acc_norm_bytesmetric by @baberabb in #3368
Fixes & Improvements
Core Changes:
- Python 3.10 minimum by @jannalulu in #3337
- Unpinned
datasetslibrary by @baberabb in #3316 - BOS token handling: Delegate to tokenizer;
add_bos_tokennow defaults toNoneby @baberabb in #3347 - Renamed
LOGLEVELenv var toLMEVAL_LOG_LEVELto avoid conflicts by @fxmarty-amd in #3418 - Resolve duplicate task names with safeguards by @giuliolovisotto in #3394
Task Fixes:
- Fixed MMLU-Redux to exclude samples without
error_type="ok"and display summary table by @fxmarty-amd in #3410, #3406 - Fixed AIME answer extraction by @jannalulu in #3353
- Fixed LongBench evaluation and group handling by @TimurAysin, @jannalulu in #3273, #3359, #3361
- Fixed
crows_pairsdataset by @jannalulu in #3378 - Fixed Gemma tokenizer
add_bos_tokennot updating by @DarkLight1337 in #3206 - Fixed
lambada_multilingual_stablelmby @jmichaelov, @HallerPatrick in #3294, #3222 - Fixed CodeXGLUE by @gsaltintas in #3238
- Pinned correct MMLUSR version by @christinaexyou in #3350
- Updated
minerva_mathby @baberabb in #3259
Backend Fixes:
- Fixed vLLM import errors when not installed by @fxmarty-amd in #3292
- Fixed vLLM
data_parallel_size>1issue by @Dornavineeth in #3303 - Resolved deprecated
vllm.utils.get_open_portby @DarkLight1337 in #3398 - Fixed GPT series model bugs by @zinccat in #3348
- Fixed PIL image hashing to use actual bytes by @tboerstad in #3331
- Fixed
additional_configparsing by @brian-dellabetta in #3393 - Fixed batch chunking seed handling with groupby by @slimfrkha in #3047
- Fixed no-output error handling by @Oseltamivir in #3395
- Replaced deprecated
torch_dtypewithdtypeby @AbdulmalikDS in #3415 - Fixed custom task config reading by @SkyR0ver in #3425
Model & Backend Support
- OpenAI GPT-5 support by @babyplutokurt in #3247
- Azure OpenAI support by @zinccat in #3349
- Fine-tuned Gemma3 evaluation support by @LearnerSXH in #3234
- OpenVINO text2text models by @nikita-savelyevv in #3101
- Intel XPU support for HFLM by @kaixuanliu in #3211
- Attention head steering support by @luciaquirke in #3279
- Leverage vLLM's
tokenizer_infoendpoint to avoid manual duplication by @m-misiura in #3185
What's Changed
- Remove
trust_remote_code: Truefrom updated datasets by @Avelina9X in #3213 - Add support for evaluating with fine-tuned Gemma3 by @LearnerSXH in #3234
- Fix
add_bos_tokennot updated for Gemma tokenizer by @DarkLight1337 in #3206 - remove incomplete compilation instructions, solves #3233 by @ceferisbarov in #3242
- Update utils.py by @Anri-Lombard in #3246
- Adding support for OpenAI GPT-5 model by @babyplutokurt in #3247
- Add xnli_va dataset by @FranValero97 in #3194
- Add ZhoBLiMP benchmark by @jmichaelov in #3218
- Add BLiMP-NL by @jmichaelov in #3221
- Add TurBLiMP by @jmichaelov in #3219
- Add LM-SynEval Benchmark by @jmichaelov in #3184
- Fix unknown group key to tag in yaml config for
lambada_multilingual_stablelmby @HallerPatrick in #3222 - update
minerva_mathby @baberabb in #3259 - feat: Add CLIcK task by @shing100 in #3173
- Adds Anthropic/discrim-eval to lm-evaluation-harness by @Helw150 in #3091
- Add support for OpenVINO text2text generation models by @nikita-savelyevv in #3101
- Update MMLU-ProX task by @weihao1115 in #3174
- Support for AIME dataset by @jannalulu in #3248
- feat(scrolls): delete chat_template from kwargs by @slimfrkha in #3267
- pacify pre-commit by @baberabb in #3268
- Fix codexglue by @gsaltintas in #3238
- Add BHS benchmark by @jmichaelov in #3265
- Add
acc_normmetric to BLiMP-NL by @jmichaelov in #3272 - Add
acc_normmetric to ZhoBLiMP by @jmichaelov in #3271 - Add EsBBQ and CaBBQ tasks by @valleruizf in #3167
- Add support for steering individual attention heads by @luciaquirke in #3279
- Add the Icelandic WinoGrande benchmark by @jmichaelov in #3277
- Ignore seed when splitting batch in chunks with groupby by @slimfrkha in #3047
- [fix][vllm] Avoid import errors in case vllm is not installed by @fxmarty-amd in #3292
- Fix LongBench Evaluation by @TimurAysin in #3273
- add intel xpu support for HFLM by @kaixuanliu in #3211
- feat: Add mmlu-redux and it's spanish transaltion as generative task definitions by @luiscosio in #2705
- Add BabiLong by @jannalulu in #3287
- Add AIME to task description by @jannalulu in #3296
- Add humaneval_infilling task by @its-alpesh in #3299
- Add eqbench tasks in Spanish and Catalan by @priverabsc in #3168
- [fix] add math and longbench to test dependencies by @jannalulu in #3321
- Fix: VLLM model when data_parallel_size>1 by @Dornavineeth in #3303
- unpin datasets; update pre-commit by @baberabb in #3316
- bump to python 3.10 by @jannalulu in #3337
- Longbench v2 by @jannalulu in #3338
- Leverage vllm's
tokenizer_infoendpoint to avoid manual duplication by @m-misiura in #3185 - Add support for Titulm Bangla MMLU dataset by @Ismail-Hossain-1 in #3317
- remove duplicate tags/groups by @baberabb in #3343
- Align
humaneval_64_instructtask label in README to name in yaml file by @jmichaelov in #3344 - Fixes bugs when using gpt series model by @zinccat in #3348
- [fix] aime doesn't extract answers by @jannalulu in #3353
- add global_piqa; add acc_norm_bytes metric by @baberabb in #3368
- [fix] crows_pairs dataset by @jannalulu in #3378
- Fix issue 3355 assertion error by @marksverdhei in #3356
- fix(gsm8k): align README to yaml file by @neoheartbeats in #3388
- added azure openai support by @zinccat in #3349
- Delegate BOS to the tokenizer;
add_bos_tokendefaults toNoneby @baberabb in #3347 - fix trust...
v0.4.9.1
lm-eval v0.4.9.1 Release Notes
This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon!
Enhanced Reasoning Model Handling
- Better support for reasoning models with a
think_end_tokenargument to strip intermediate reasoning from outputs for thehf,vllm, andsglangmodel backends. A relatedenable_thinkingargument was also added for specific models that support it (e.g., Qwen).
New Benchmarks & Tasks
- EgyMMLU and EgyHellaSwag by @houdaipha in #3063
- MultiBLiMP benchmark by @jmichaelov in #3155
- LIBRA benchmark for long-context evaluation by @karimovaSvetlana in #2943
- Multilingual Truthfulqa in Spanish, Basque and Galician by @BlancaCalvo in #3062
Fixes & Improvements
Tasks & Benchmarks:
- Aligned Humaneval results for Llama-3.1-70B-Instruct with official scores by @userljz, @baberabb, @idantene in (#3201. #3092, #3102)
- Fixed incorrect dataset paths for GLUE and medical benchmarks by @Avelina9X and @idantene. (#3159, #3151)
- Removed redundant "Let's think step by step" text from
bbh_cot_fewshotprompts by @philipdoldo. (#3140) - Increased
max_gen_toksto 2048 for HRM8K math benchmarks by @shing100. (#3124)
Backend & Stability:
- Reduce CLI loading time from 2.2s to 0.05s by @stakodiak. (#3099)
- Fixed a process hang caused by mp.Pool in bootstrap_stderr and introduced
DISABLE_MULTIPROCenvar by @ankitgola005 and @neel04. (#3135, #3106) - add image hashing and
LMEVAL_HASHMMenvar by @artemorloff in #2973 - TaskManager:
include-pathprecedence handling to prioritize custom dir over default by @parkhs21 in #3068
Housekeeping:
- Pinned
datasets < 4.0.0temporarily to maintain compatibility withtrust_remote_codeby @baberabb. (#3172) - Removed models from Neural Magic and other unneeded files by @baberabb. (#3112, #3113, #3108)
What's Changed
- llama3 task: update README.md by @annafontanaa in #3074
- Fix Anthropic API compatibility issues in chat completions by @NourFahmy in #3054
- Ensure backwards compatibility in
fewshot_contextby using kwargs by @kiersten-stokes in #3079 - [vllm] remove system message if
TemplateErrorfor chat_template by @baberabb in #3076 - feat / fix: Properly make use of
subfolderfrom HF models by @younesbelkada in #3072 - [HF] fix quantization config by @baberabb in #3039
- FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct by @userljz in #3092
- Truthfulqa multi harness by @BlancaCalvo in #3062
- Fix: Reduce CLI loading time from 2.2s to 0.05s by @stakodiak in #3099
- Humaneval - fix regression by @baberabb in #3102
- Bugfix/hf tokenizer gguf override by @ankush13r in #3098
- [FIX] Initial code to disable multi-proc for stderr by @neel04 in #3106
- fix deps; update hooks by @baberabb in #3107
- delete unneeded files by @baberabb in #3108
- Fixed #3005: Processes both formats of model_args: string and dictionay by @DebjyotiRay in #3097
- add image hashing and
LMEVAL_HASHMMenvar by @artemorloff in #2973 - removal of Neural Magic models by @baberabb in #3112
- Neuralmagic by @baberabb in #3113
- check pil dep when hashing images by @baberabb in #3114
- warning for "chat" pretrained; disable buggy evalita configs by @baberabb in #3127
- fix: remove warning by @baberabb in #3128
- Adding EgyMMLU and EgyHellaSwag by @houdaipha in #3063
- Added mixed_precision_dtype argument to HFLM to enable autocasting by @Avelina9X in #3138
- Fix for hang due to mp.Pool in bootstrap_stderr by @ankitgola005 in #3135
- when using vllm with lora, it will have some mistakes, now i fix it. by @Jacky-MYQ in #3132
- truncate thinking tags in generations by @baberabb in #3145
bbh_cot_fewshot: Removed repeated "Let''s think step by step." text from bbh cot prompts by @philipdoldo in #3140- Fix medical benchmarks import by @idantene in #3151
- fix request hanging when request api by @mmmans in #3090
- Custom request headers | trust_remote_code param fix by @RawthiL in #3069
- Bugfix: update path for GLUE by @Avelina9X in #3159
- Add the MultiBLiMP benchmark by @jmichaelov in #3155
- multiblimp - readme by @baberabb in #3162
- [tests] Added missing fixture in test_unitxt_tasks.py by @Avelina9X in #3163
- Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks by @shing100 in #3124
- feat: Add LIBRA benchmark for long-context evaluation by @karimovaSvetlana in #2943
- Added
chat_template_argsto vllm by @Avelina9X in #3164 - Pin datasets < 4.0.0 by @baberabb in #3172
- Remove "device" from vllm_causallms.py by @mgoin in #3176
- remove trust-remote-code in configs; fix escape sequences by @baberabb in #3180
- Fix vllm test issue that call pop() from None by @weireweire in #3182
- [hotfix] vllm: pop
devicefrom kwargs by @baberabb in #3181 - Update vLLM compatibility by @DarkLight1337 in #3024
- Fix
mmlu_continuationsubgroup names to fit Readme and other variants by @lamalunderscore in #3137 - Fix humaneval_instruct by @idantene in #3201
- Update README.md for mlqa by @newme616 in #3117
- improve include-path precedence handling by @parkhs21 in #3068
- Bump version to 0.4.9.1 by @baberabb in #3208
New Contributors
- @NourFahmy made their first contribution in #3054
- @userljz made their first contribution in #3092
- @BlancaCalvo made their first contribution in #3062
- @stakodiak made their first contribution in #3099
- @ankush13r made their first contribution in #3098
- @neel04 made their first contribution in https://...
v0.4.9
lm-eval v0.4.9 Release Notes
Key Improvements
-
Enhanced Backend Support:
- SGLang Generate API by @baberabb in #2997
- vLLM enhancements: Added support for
enable_thinkingargument (#2947) and data parallel for V1 (#3011) by @anmarques and @baberabb - Chat template improvements: Extended vLLM chat template support (#2902) and fixed HF chat template resolution (#2992) by @anmarques and @fxmarty-amd
-
Multimodal Capabilities:
- Audio modality support for Qwen2 Audio models by @artemorloff in #2689
- Image processing improvements: Added resize images support (#2958) and enabled multimodal API usage (#2981) by @artemorloff and @baberabb
- ChartQA multimodal task implementation by @baberabb in #2544
-
Performance & Reliability:
- Quantization support added via
quantization_configby @jerryzh168 in #2842 - Memory optimization: Use
yaml.CLoaderfor faster YAML loading by @giuliolovisotto in #2777 - Bug fixes: Resolved MMLU generative metric aggregation (#2761) and context length handling issues (#2972)
- Quantization support added via
New Benchmarks & Tasks
Code Evaluation
- HumanEval Instruct - Instruction-following code generation benchmark by @baberabb in #2650
- MBPP Instruct - Instruction-based Python programming evaluation by @baberabb in #2995
Language Modeling
- C4 Dataset Support - Added perplexity evaluation on C4 web crawl dataset by @Zephyr271828 in #2889
Long Context Benchmarks
Mathematical & Reasoning
- GSM8K Platinum - Enhanced mathematical reasoning benchmark by @Qubitium in #2771
- MastermindEval - Logic reasoning evaluation by @whoisjones in #2788
- JSONSchemaBench - Structured output evaluation by @Saibo-creator in #2865
Llama Reference Implementations
- Llama Reference Implementations - Added task variants for Multilingual MMLU, MMLU CoT, GSM8K, and ARC Challenge based on Llama evaluation standards by @anmarques in #2797, #2826, #2829
Multilingual Expansion
Asian Languages:
- Korean MMLU (KMMLU) multiple-choice task by @Aprilistic in #2849
- MMLU-ProX extended evaluation by @heli-qi in #2811
- KBL 2025 Dataset - Updated Korean benchmark evaluation by @abzb1 in #3000
European Languages:
African Languages:
- AfroBench - Multi-African language evaluation by @JessicaOjo in #2825
- Darija tasks - Moroccan dialect benchmarks (DarijaMMLU, DarijaHellaSwag, Darija_Bench) by @hadi-abdine in #2521
Arabic Languages:
- Arab Culture task for cultural understanding by @bodasadallah in #3006
Domain-Specific Benchmarks
- CareQA - Healthcare evaluation benchmark by @PabloAgustin in #2714
- ACPBench & ACPBench Hard - Automated code generation evaluation by @harshakokel in #2807, #2980
- INCLUDE tasks - Inclusivity evaluation suite by @agromanou in #2769
- Cocoteros VA dataset by @sgs97ua in #2787
Social & Bias Evaluation
- Various social bias tasks for fairness assessment by @oskarvanderwal in #1185
Technical Enhancements
- Fine-grained evaluation: Added
--examplesargument for efficient multi-prompt evaluation by @felipemaiapolo and @mirianfsilva in #2520 - Improved tokenization: Better handling of
add_bos_tokeninitialization by @baberabb in #2781 - Memory management: Enhanced softmax computations with
softmax_dtypeargument forHFLMby @Avelina9X in #2921
Critical Bug Fixes
- Collating Queries Fix - Resolved error with different continuation lengths that was causing evaluation failures by @ameyagodbole in #2987
- Mutual Information Metric - Fixed acc_mutual_info calculation bug that affected metric accuracy by @baberabb in #3035
Breaking Changes & Important Updates
- MMLU dataset migration: Switched to
cais/mmludataset source by @baberabb in #2918 - Default parameter updates: Increased
max_gen_toksto 2048 andmax_lengthto 8192 for MMLU Pro tests by @dazipe in #2824 - Temperature defaults: Set default temperature to 0.0 for vLLM and SGLang backends by @baberabb in #2819
We extend our heartfelt thanks to all contributors who made this release possible, including 43 first-time contributors who brought fresh perspectives and valuable improvements to the evaluation harness.
What's Changed
- fix mmlu (generative) metric aggregation by @wangcho2k in #2761
- Bugfix by @baberabb in #2762
- fix verbosity typo by @baberabb in #2765
- docs: Fix typos in README.md by @ruivieira in #2778
- initialize tokenizer with
add_bos_tokenby @baberabb in #2781 - improvement: Use yaml.CLoader to load yaml files when available. by @giuliolovisotto in #2777
- Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only by @perlitz in #2773
- Fix for mc2 calculation by @kdymkiewicz in #2768
- New healthcare benchmark: careqa by @PabloAgustin in #2714
- Capture gen_kwargs from CLI in squad_completion by @ksurya in #2727
- humaneval instruct by @baberabb in #2650
- Update evaluator.py by @zhuzeyuan in #2786
- change piqa dataset path (uses parquet rather than dataset script) by @baberabb in #2790
- use verify_certificate flag in batch requests by @daniel-salib in #2785
- add audio modality (qwen2 audio only) by @artemorloff in #2689
- Add various social bias tasks by @oskarvanderwal in #1185
- update pre-commit by @baberabb in #2799
- Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot by @Avelina9X in #2802
- Add INCLUDE tasks by @agromanou in #2769
- Add support for token-based auth for watsonx models by @kiersten-stokes in #2796
- add version by @baberabb in #2808
- Add cocoteros_va dataset by @sgs97ua in #2787
- Add MastermindEval by @whoisjones in #2788
- Add loncxt tasks by @baberabb in #2629
- [hf-multimodal] pass kwargs to self.processor by @baberabb in #2667
- [MM] Chartqa by @baberabb in #2544
- Allow writing config to wandb by @ksurya in #2736
- [change] group -> tag on afrimgsm, afrimmlu, afrixnli dataset by @jd730 in #2813
- Clean up README and pyproject.toml by @kiersten-stokes in #2814
- Llama3 mmlu correction by @anmarques in #2797
- Add Markdown linter by @kiersten-stokes in #2818
- Configure the pad tokens for Qwen when using vLLM by @zhangruoxu in #2810
- fix typo in humaneval by @baberabb in #2820
- default temp=0.0 for vllm and slang by @baberabb in #2819
- Fixes to mmlu_pro_llama by @anmarques in #2816
- Add MMLU-ProX task by @heli-qi in #2811
- Quick fix for mmlu_pro_llama by @anmarques in #2827
- Fix: tj-actions/changed-files is compromised by @Tautorn in #2828
- Multilingual MMLU for Llama instruct models by @anmarques in #2826
- bbh - changed dataset to parquet version by @baberabb in #2845
- Fix typo in longbench metrics by @djwackey in #2854
- Add kmmlu multiple-choice(accuracy) task #2848 by @Aprilistic in #2849
- Adding ACPBench task by @harshakokel in #2807
- add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench by @hadi-abdine in #2521
- Increase default max_gen_toks to 2048 and max_...
v0.4.8
lm-eval v0.4.8 Release Notes
Key Improvements
-
New Backend Support:
- Added SGLang as new evaluation backend! by @Monstertail
- Enabled model steering with vector support via
sparsifyorsae_lensby @luciaquirke and @AMindToThink
-
Breaking Change: Python 3.8 support has been dropped as it reached end of life. Please upgrade to Python 3.9 or newer.
-
Added Support for
gen_prefixin config, allowing you to append text after the <|assistant|> token (or at the end of non-chat prompts) - particularly effective for evaluating instruct models
New Benchmarks & Tasks
Code Evaluation
- HumanEval by @hjlee1371 in #1992
- MBPP by @hjlee1371 in #2247
- HumanEval+ and MBPP+ by @bzantium in #2734
Multilingual Expansion
-
Global Coverage:
- Global MMLU (Lite version by @shivalika-singh in #2567, Full version by @bzantium in #2636)
- MLQA multilingual question answering by @KahnSvaer in #2622
-
Asian Languages:
-
European Languages:
-
Middle Eastern Languages:
- Arabic MMLU by @bodasadallah in #2541
- AraDICE task by @firojalam in #2507
Ethics & Reasoning
- Moral Stories by @upunaprosk in #2653
- Histoires Morales by @upunaprosk in #2662
Others
- MMLU Pro Plus by @asgsaeid in #2366
- GroundCocoa by @HarshKohli in #2724
We extend our thanks to all contributors who made this release possible and to our users for your continued support and feedback.
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
- drop python 3.8 support by @baberabb in #2575
- Add Global MMLU Lite by @shivalika-singh in #2567
- add warning for truncation by @baberabb in #2585
- Wandb step handling bugfix and feature by @sjmielke in #2580
- AraDICE task config file by @firojalam in #2507
- fix extra_match low if batch_size > 1 by @sywangyi in #2595
- fix model tests by @baberabb in #2604
- update scrolls by @baberabb in #2602
- some minor logging nits by @baberabb in #2609
- Fix gguf loading via Transformers by @CL-ModelCloud in #2596
- Fix Zeno visualizer on tasks like GSM8k by @pasky in #2599
- Fix the format of mgsm zh and ja. by @timturing in #2587
- Add HumanEval by @hjlee1371 in #1992
- Add MBPP by @hjlee1371 in #2247
- Add MLQA by @KahnSvaer in #2622
- assistant prefill by @baberabb in #2615
- fix gen_prefix by @baberabb in #2630
- update pre-commit by @baberabb in #2632
- add hrm8k benchmark for both Korean and English by @bzantium in #2627
- New arabicmmlu by @bodasadallah in #2541
- Add
global_mmlufull version by @bzantium in #2636 - Update KorMedMCQA: ver 2.0 by @GyoukChu in #2540
- fix tmlu tmlu_taiwan_specific_tasks tag by @nike00811 in #2420
- fixed mmlu generative response extraction by @RawthiL in #2503
- revise mbpp prompt by @bzantium in #2645
- aggregate by group (total and categories) by @bzantium in #2643
- Fix max_tokens handling in vllm_vlms.py by @jkaniecki in #2637
- separate category for
global_mmluby @bzantium in #2652 - Add Moral Stories by @upunaprosk in #2653
- add TransformerLens example by @nickypro in #2651
- fix multiple input chat tempalte by @baberabb in #2576
- Add Aggregation for Kobest Benchmark by @tryumanshow in #2446
- update pre-commit by @baberabb in #2660
- remove
groupfrom bigbench task configs by @baberabb in #2663 - Add Histoires Morales task by @upunaprosk in #2662
- MMLU Pro Plus by @asgsaeid in #2366
- fix early return for multiple dict in task process_results by @baberabb in #2673
- Turkish mmlu Config Update by @ArdaYueksel in #2678
- Fix typos by @omahs in #2679
- remove cuda device assertion by @baberabb in #2680
- Adding the Evalita-LLM benchmark by @m-resta in #2681
- Delete lm_eval/tasks/evalita_llm/single_prompt.zip by @baberabb in #2687
- Update unitxt task.py to bring in line with recent repo changes by @kiersten-stokes in #2684
- change ensure_ascii to False for JsonChatStr by @artemorloff in #2691
- Set defaults for BLiMP scores by @jmichaelov in #2692
- Update remaining references to
assistant_prefillin docs togen_prefixby @kiersten-stokes in #2683 - Update README.md by @upunaprosk in #2694
- fix
construct_requestskwargs in python tasks by @baberabb in #2700 arithmetic: set target delimiter to empty string by @baberabb in #2701- fix vllm by @baberabb in #2708
- add math_verify to some tasks by @baberabb in #2686
- Logging by @lintangsutawika in #2203
- Replace missing
lighteval/MATH-Harddataset withDigitalLearningGmbH/MATH-lightevalby @f4str in #2719 - remove unused import by @baberabb in #2728
- README updates: Added IberoBench citation info in correpsonding READMEs by @naiarapm in #2729
- add o3-mini support by @HelloJocelynLu in #2697
- add Basque translation of ARC and PAWS to BasqueBench by @naiarapm in #2732
- Add cocoteros_es task in spanish_bench by @sgs97ua in #2721
- Fix the import source for eval_logger by @kailashbuki in #2735
- add humaneval+ and mbpp+ by @bzantium in #2734
- Support SGLang as Potential Backend for Evaluation by @Monstertail in #2703
- fix log condition on main by @baberabb in #2737
- fix vllm data parallel by @baberabb in #2746
- [Readme change for SGLang] fix error in readme and add OOM solutions for sglang by @Monstertail in #2738
- Groundcocoa by @HarshKohli in #2724
- fix doc: generate_until only outputs the generated text! by @baberabb in #2755
- Enable steering HF models by @luciaquirke in #2749
- Add test for a simple Unitxt task by @kiersten-stokes in #2742
- add debug log by @baberabb in #2757
- increment version to 0.4.8 by @baberabb in #2760
New Contributors...
v0.4.7
lm-eval v0.4.7 Release Notes
This release includes several bug fixes, minor improvements to model handling, and task additions.
⚠️ Python 3.8 End of Support Notice
Python 3.8 support will be dropped in future releases as it has reached its end of life. Users are encouraged to upgrade to Python 3.9 or newer.
Backwards Incompatibilities
Chat Template Delimiter Handling (in v0.4.6)
An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.
📝 For detailed documentation, please refer to docs/chat-template-readme.md
New Benchmarks & Tasks
- Basque Integration: Added Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
- SCORE Tasks: Added new subtask for non-greedy robustness evaluation by @rimashahbazyan in #2558
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
- Score tasks by @rimashahbazyan in #2452
- Filters bugfix; add
metricsandfilterto logged sample by @baberabb in #2517 - skip casting if predict_only by @baberabb in #2524
- make utility function to handle
untilby @baberabb in #2518 - Update Unitxt task to use locally installed unitxt and not download Unitxt code from Huggingface by @yoavkatz in #2514
- add Basque translation of PIQA (piqa_eu) to BasqueBench by @naiarapm in #2531
- avoid timeout errors with high concurrency in api_model by @dtrawins in #2307
- Update README.md by @baberabb in #2534
- better doc_to_test testing by @baberabb in #2535
- Support pipeline parallel with OpenVINO models by @sstrehlk in #2349
- Super little tiny fix doc by @fzyzcjy in #2546
- [API] left truncate for generate_until by @baberabb in #2554
- Update Lightning import by @maanug-nv in #2549
- add optimum-intel ipex model by @yao-matrix in #2566
- add warning to readme by @baberabb in #2568
- Adding new subtask to SCORE tasks: non greedy robustness by @rimashahbazyan in #2558
- batch
loglikelihood_rollingacross requests by @baberabb in #2559 - fix
DeprecationWarning: invalid escape sequence '\s'for whitespace filter by @baberabb in #2560 - increment version to 4.6.7 by @baberabb in #2574
New Contributors
- @rimashahbazyan made their first contribution in #2452
- @naiarapm made their first contribution in #2531
- @dtrawins made their first contribution in #2307
- @sstrehlk made their first contribution in #2349
- @fzyzcjy made their first contribution in #2546
- @maanug-nv made their first contribution in #2549
- @yao-matrix made their first contribution in #2566
Full Changelog: v0.4.6...v0.4.7
v0.4.6
lm-eval v0.4.6 Release Notes
This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.
Backwards Incompatibilities
Chat Template Delimiter Handling
An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.
📝 For detailed documentation, please refer to docs/chat-template-readme.md
New Benchmarks & Tasks
Multilingual Expansion
- Spanish Bench: Enhanced benchmark with additional tasks by @zxcvuser in #2390
- Japanese Leaderboard: New comprehensive Japanese language benchmark by @sitfoxfly in #2439
New Task Collections
- Multimodal Unitext: Added support for multimodal tasks available in unitext by @elronbandel in #2364
- Metabench: New benchmark contributed by @kozzy97 in #2357
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
- Add Unitxt Multimodality Support by @elronbandel in #2364
- Add new tasks to spanish_bench and fix duplicates by @zxcvuser in #2390
- fix typo bug for minerva_math by @renjie-ranger in #2404
- Fix: Turkish MMLU Regex Pattern by @ArdaYueksel in #2393
- fix storycloze datanames by @t1101675 in #2409
- Update NoticIA prompt by @ikergarcia1996 in #2421
- [Fix] Replace generic exception classes with a more specific ones by @LSinev in #1989
- Support for IBM watsonx_llm by @Medokins in #2397
- Fix package extras for watsonx support by @kiersten-stokes in #2426
- Fix lora requests when dp with vllm by @ckgresla in #2433
- Add xquad task by @zxcvuser in #2435
- Add verify_certificate argument to local-completion by @sjmonson in #2440
- Add GPTQModel support for evaluating GPTQ models by @Qubitium in #2217
- Add missing task links by @Sypherd in #2449
- Update CODEOWNERS by @haileyschoelkopf in #2453
- Add real process_docs example by @Sypherd in #2456
- Modify label errors in catcola and paws-x by @zxcvuser in #2434
- Add Japanese Leaderboard by @sitfoxfly in #2439
- Typos: Fix 'loglikelihood' misspellings in api_models.py by @RobGeada in #2459
- use global
multi_choice_filterfor mmlu_flan by @baberabb in #2461 - typo by @baberabb in #2465
- pass device_map other than auto for parallelize by @baberabb in #2457
- OpenAI ChatCompletions: switch
max_tokensby @baberabb in #2443 - Ifeval: Dowload
punkt_tabon rank 0 by @baberabb in #2267 - Fix chat template; fix leaderboard math by @baberabb in #2475
- change warning to debug by @baberabb in #2481
- Updated wandb logger to use
new_printer()instead ofget_printer(...)by @alex-titterton in #2484 - IBM watsonx_llm fixes & refactor by @Medokins in #2464
- Fix revision parameter to vllm get_tokenizer by @OyvindTafjord in #2492
- update pre-commit hooks and git actions by @baberabb in #2497
- kbl-v0.1.1 by @whwang299 in #2493
- Add mamba hf to
mamba_ssmby @baberabb in #2496 - remove duplicate
arc_catag by @baberabb in #2499 - Add metabench task to LM Evaluation Harness by @kozzy97 in #2357
- Nits by @baberabb in #2500
- [API models] parse tokenizer_backend=None properly by @baberabb in #2509
New Contributors
- @renjie-ranger made their first contribution in #2404
- @t1101675 made their first contribution in #2409
- @Medokins made their first contribution in #2397
- @kiersten-stokes made their first contribution in #2426
- @ckgresla made their first contribution in #2433
- @sjmonson made their first contribution in #2440
- @Qubitium made their first contribution in #2217
- @Sypherd made their first contribution in #2449
- @sitfoxfly made their first contribution in #2439
- @RobGeada made their first contribution in #2459
- @alex-titterton made their first contribution in #2484
- @OyvindTafjord made their first contribution in #2492
- @whwang299 made their first contribution in #2493
- @kozzy97 made their first contribution in #2357
Full Changelog: v0.4.5...v0.4.6
v0.4.5
lm-eval v0.4.5 Release Notes
New Additions
Prototype Support for Vision Language Models (VLMs)
We're excited to introduce prototype support for Vision Language Models (VLMs) in this release, using model types hf-multimodal and vllm-vlm. This allows for evaluation of models that can process text and image inputs and produce text outputs. Currently we have added support for the MMMU (mmmu_val) task and we welcome contributions and feedback from the community!
New VLM-Specific Arguments
VLM models can be configured with several new arguments within --model_args to support their specific requirements:
max_images(int): Set the maximum number of images for each prompt.interleave(bool): Determines the positioning of image inputs. WhenTrue(default) images are interleaved with the text. WhenFalseall images are placed at the front of the text. This is model dependent.
hf-multimodal specific args:
image_token_id(int) orimage_string(str): Specifies a custom token or string for image placeholders. For example, Llava models expect an"<image>"string to indicate the location of images in the input, while Qwen2-VL models expect an"<|image_pad|>"sentinel string instead. This will be inferred based on model configuration files whenever possible, but we recommend confirming that an override is needed when testing a new model familyconvert_img_format(bool): Whether to convert the images to RGB format.
Example usage:
-
lm_eval --model hf-multimodal --model_args pretrained=llava-hf/llava-1.5-7b-hf,attn_implementation=flash_attention_2,max_images=1,interleave=True,image_string=<image> --tasks mmmu_val --apply_chat_template -
lm_eval --model vllm-vlm --model_args pretrained=llava-hf/llava-1.5-7b-hf,max_images=1,interleave=True --tasks mmmu_val --apply_chat_template
Important considerations
- Chat Template: Most VLMs require the
--apply_chat_templateflag to ensure proper input formatting according to the model's expected chat template. - Some VLM models are limited to processing a single image per prompt. For these models, always set
max_images=1. Additionally, certain models expect image placeholders to be non-interleaved with the text, requiringinterleave=False. - Performance and Compatibility: When working with VLMs, be mindful of potential memory constraints and processing times, especially when handling multiple images or complex tasks.
Tested VLM Models
We have currently most notably tested the implementation with the following models:
- llava-hf/llava-1.5-7b-hf
- llava-hf/llava-v1.6-mistral-7b-hf
- Qwen/Qwen2-VL-2B-Instruct
- HuggingFaceM4/idefics2 (requires the latest
transformersfrom source)
New Tasks
Several new tasks have been contributed to the library for this version!
New tasks as of v0.4.5 include:
- Open Arabic LLM Leaderboard tasks, contributed by @shahrzads @Malikeh97 in #2232
- MMMU (validation set), by @haileyschoelkopf @baberabb @lintangsutawika in #2243
- TurkishMMLU by @ArdaYueksel in #2283
- PortugueseBench, SpanishBench, GalicianBench, BasqueBench, and CatalanBench aggregate multilingual tasks in #2153 #2154 #2155 #2156 #2157 by @zxcvuser and others
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Backwards Incompatibilities
Finalizing group versus tag split
We've now fully deprecated the use of group keys directly within a task's configuration file. The appropriate key to use is now solely tag for many cases. See the v0.4.4 patchnotes for more info on migration, if you have a set of task YAMLs maintained outside the Eval Harness repository.
Handling of Causal vs. Seq2seq backend in HFLM
In HFLM, logic specific to handling inputs for Seq2seq (encoder-decoder models like T5) versus Causal (decoder-only autoregressive models, and the vast majority of current LMs) models previously hinged on a check for self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM. Some users may want to use causal model behavior, but set self.AUTO_MODEL_CLASS to a different factory class, such as transformers.AutoModelForVision2Seq.
As a result, those users who subclass HFLM but do not call HFLM.__init__() may now also need to set the self.backend attribute to either "causal" or "seq2seq" during initialization themselves.
While this should not affect a large majority of users, for those who subclass HFLM in potentially advanced ways, see #2353 for the full set of changes.
Future Plans
We intend to further expand our multimodal support to a wider set of vision-language tasks, as well as a broader set of model types, and are actively seeking user feedback!
Thanks, the LM Eval Harness team (@baberabb @haileyschoelkopf @lintangsutawika)
What's Changed
- Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) by @Malikeh97 in #2232
- Multimodal prototyping by @lintangsutawika in #2243
- Update README.md by @SYusupov in #2297
- remove comma by @baberabb in #2315
- Update neuron backend by @dacorvo in #2314
- Fixed dummy model by @Am1n3e in #2339
- Add a note for missing dependencies by @eldarkurtic in #2336
- squad v2: load metric with
evaluateby @baberabb in #2351 - fix writeout script by @baberabb in #2350
- Treat tags in python tasks the same as yaml tasks by @giuliolovisotto in #2288
- change group to tags in task
eus_examstask configs by @baberabb in #2320 - change glianorex to test split by @baberabb in #2332
- mmlu-pro: add newlines to task descriptions (not leaderboard) by @baberabb in #2334
- Added TurkishMMLU to LM Evaluation Harness by @ArdaYueksel in #2283
- add mmlu readme by @baberabb in #2282
- openai: better error messages; fix greedy matching by @baberabb in #2327
- fix some bugs of mmlu by @eyuansu62 in #2299
- Add new benchmark: Portuguese bench by @zxcvuser in #2156
- Fix missing key in custom task loading. by @giuliolovisotto in #2304
- Add new benchmark: Spanish bench by @zxcvuser in #2157
- Add new benchmark: Galician bench by @zxcvuser in #2155
- Add new benchmark: Basque bench by @zxcvuser in #2153
- Add new benchmark: Catalan bench by @zxcvuser in #2154
- fix tests by @baberabb in #2380
- Hotfix! by @baberabb in #2383
- Solution for CSAT-QA tasks evaluation by @KyujinHan in #2385
- LingOly - Fixing scoring bugs for smaller models by @am-bean in #2376
- Fix float limit override by @cjluo-omniml in #2325
- [API] tokenizer: add trust-remote-code by @baberabb in #2372
- HF: switch conditional checks to
self.backendfromAUTO_MODEL_CLASSby @baberabb in #2353 - max_images are passed on to vllms
limit_mm_per_promptby @baberabb in #2387 - Fix Llava-1.5-hf ; Update to version 0.4.5 by @haileyschoelkopf in #2388
- Bump version to v0.4.5 by @haileyschoelkopf in #2389
New Contributors
- @Malikeh97 made their first contribution in #2232
- @SYusupov made their first contribution in #2297
- @dacorvo made their first contribution in #2314
- @eldarkurtic made their first contribution in #2336
- @giuliolovisotto made their first contribution in #2288
- @ArdaYueksel made their first contribution in #2283
- @zxcvuser made their first contribution in #2156
- @KyujinHan made their first contribution in #2385
- @cjluo-omniml made their first contribution in #2325
Full Changelog: https://github.com/Eleu...
v0.4.4
lm-eval v0.4.4 Release Notes
New Additions
-
This release includes the Open LLM Leaderboard 2 official task implementations! These can be run by using
--tasks leaderboard. Thank you to the HF team (@clefourrier, @NathanHB , @KonradSzafer, @lozovskaya) for contributing these -- you can read more about their Open LLM Leaderboard 2 release here. -
API support is overhauled! Now: support for concurrent requests, chat templates, tokenization, batching and improved customization. This makes API support both more generalizable to new providers and should dramatically speed up API model inference.
- The url can be specified by passing the
base_urlto--model_args, for example,base_url=http://localhost:8000/v1/completions; concurrent requests are controlled with thenum_concurrentargument; tokenization is controlled withtokenized_requests. - Other arguments (such as top_p, top_k, etc.) can be passed to the API using
--gen_kwargsas usual. - Note: Instruct-tuned models, not just base models, can be used with
local-completionsusing--apply_chat_template(either with or withouttokenized_requests).- They can also be used with
local-chat-completions(for e.g. with a OpenAI Chat API endpoint), but only the former supports loglikelihood tasks (e.g. multiple-choice). This is because ChatCompletion style APIs generally do not provide access to logits on prompt/input tokens, preventing easy measurement of multi-token continuations' log probabilities.
- They can also be used with
- example with OpenAI completions API (using vllm serve):
lm_eval --model local-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10,tokenized_requests=True,tokenizer_backend=huggingface,max_length=4096 --apply_chat_template --batch_size 1 --tasks mmlu
- example with chat API:
lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,num_concurrent=10 --apply_chat_template --tasks gsm8k
- We recommend evaluating Llama-3.1-405B models via serving them with vllm then running under
local-completions!
- The url can be specified by passing the
-
We've reworked the Task Grouping system to make it clearer when and when not to report an aggregated average score across multiple subtasks. See #Backwards Incompatibilities below for more information on changes and migration instructions.
-
A combination of data-parallel and model-parallel (using HF's
device_mapfunctionality for "naive" pipeline parallel) inference using--model hfis now supported, thank you to @NathanHB and team!
Other new additions include a number of miscellaneous bugfixes and much more. Thank you to all contributors who helped out on this release!
New Tasks
A number of new tasks have been contributed to the library.
As a further discoverability improvement, lm_eval --tasks list now shows all tasks, tags, and groups in a prettier format, along with (if applicable) where to find the associated config file for a task or group! Thank you to @anthony-dipofi for working on this.
New tasks as of v0.4.4 include:
- Open LLM Leaderboard 2 tasks--see above!
- Inverse Scaling tasks, contributed by @h-albert-lee in #1589
- Unitxt tasks reworked by @elronbandel in #1933
- MMLU-SR, contributed by @SkySuperCat in #2032
- IrokoBench, contributed by @JessicaOjo @IsraelAbebe in #2042
- MedConceptQA, contributed by @Ofir408 in #2010
- MMLU Pro, contributed by @ysjprojects in #1961
- GSM-Plus, contributed by @ysjprojects in #2103
- Lingoly, contributed by @am-bean in #2198
- GSM8k and Asdiv settings matching the Llama 3.1 evaluation settings, contributed by @Cameron7195 in #2215 #2236
- TMLU, contributed by @adamlin120 in #2093
- Mela, contributed by @Geralt-Targaryen in #1970
Backwards Incompatibilities
tags versus groups, and how to migrate
Previously, we supported the ability to group a set of tasks together, generally for two purposes: 1) to have an easy-to-call shortcut for a set of tasks one might want to frequently run simultaneously, and 2) to allow for "parent" tasks like mmlu to aggregate and report a unified score across a set of component "subtasks".
There were two ways to add a task to a given group name: 1) to provide (a list of) values to the group field in a given subtask's config file:
# this is a *task* yaml file.
group: group_name1
task: my_task1
# rest of task config goes here...or 2) to define a "group config file" and specify a group along with its constituent subtasks:
# this is a group's yaml file
group: group_name1
task:
- subtask_name1
- subtask_name2
# ...These would both have the same effect of reporting an averaged metric for group_name1 when calling lm_eval --tasks group_name1. However, in use-case 1) (simply registering a shorthand for a list of tasks one is interested in), reporting an aggregate score can be undesirable or ill-defined.
We've now separated out these two use-cases ("shorthand" groupings and hierarchical subtask collections) into a tag and group property separately!
To register a shorthand (now called a tag), simply change the group field name within your task's config to be tag (group_alias keys will no longer be supported in task configs.):
# this is a *task* yaml file.
tag: tag_name1
task: my_task1
# rest of task config goes here...Group config files may remain as is if aggregation is not desired. To opt-in to reporting aggregated scores across a group's subtasks, add the following to your group config file:
# this is a group's yaml file
group: group_name1
task:
- subtask_name1
- subtask_name2
# ...
### New! Needed to turn on aggregation ###
aggregate_metric_list:
- metric: acc # placeholder. Note that all subtasks in this group must report an `acc` metric key
- weight_by_size: True # whether one wishes to report *micro* or *macro*-averaged scores across subtasks. Defaults to `True`.
Please see our documentation here for more information. We apologize for any headaches this migration may create--however, we believe separating out these two functionalities will make it less likely for users to encounter confusion or errors related to mistaken undesired aggregation.
Future Plans
We're planning to make more planning documents public and standardize on (likely) 1 new PyPI release per month! Stay tuned.
Thanks, the LM Eval Harness team (@haileyschoelkopf @lintangsutawika @baberabb)
What's Changed
- fix wandb logger module import in example by @ToluClassics in #2041
- Fix strip whitespace filter by @NathanHB in #2048
- Gemma-2 also needs default
add_bos_token=Trueby @haileyschoelkopf in #2049 - Update
trust_remote_codefor Hellaswag by @haileyschoelkopf in #2029 - Adds Open LLM Leaderboard Taks by @NathanHB in #2047
- #1442 inverse scaling tasks implementation by @h-albert-lee in #1589
- Fix TypeError in samplers.py by converting int to str by @uni2237 in #2074
- Group agg rework by @lintangsutawika in #1741
- Fix printout tests (N/A expected for stderrs) by @haileyschoelkopf in #2080
- Easier unitxt tasks loading and removal of unitxt library dependancy by @elronbandel in #1933
- Allow gating EvaluationTracker HF Hub results; customizability by @NathanHB in #2051
- Minor doc fix: leaderboard README.md missing mmlu-pro group and task by @pankajarm in #2075
- Revert missing utf-8 encoding for logged sample files (#2027) by @haileyschoelkopf in #2082
- Update utils.py by @lintangsutawika in #2085
- batch_size may be str if 'auto' is specified by @meg-huggingface in #2084
- Prettify lm_eval --tasks list by @anthony-dipofi in #1929
- Suppress noisy RougeScorer logs in
truthfulqa_genby @haileyschoelkopf in #2090 - Update default.yaml by @waneon in #2092
- Add new dataset MMLU-SR tasks by @SkySuperCat in #2032
- Irokobench: Benchmark Dataset for African languages by @JessicaOjo in #2042
- docs: remove trailing sentence from contribution doc by @nathan-weinberg in #2098
- Added MedConceptsQA Benchmark by @Ofir408 in #2010
- Also force BOS for
"recurrent_gemma"and other Gemma model types by @haileyschoelkopf in #2105 - formatting by @lintangsutawika in #2104
- docs: align local test command to match CI by @nathan-weinberg in https://gith...