lm-eval v0.4.6 Release Notes

This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.

Backwards Incompatibilities

Chat Template Delimiter Handling

An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.

📝 For detailed documentation, please refer to docs/chat-template-readme.md

New Benchmarks & Tasks

Multilingual Expansion

Spanish Bench: Enhanced benchmark with additional tasks by @zxcvuser in #2390
Japanese Leaderboard: New comprehensive Japanese language benchmark by @sitfoxfly in #2439

New Task Collections

Multimodal Unitext: Added support for multimodal tasks available in unitext by @elronbandel in #2364
Metabench: New benchmark contributed by @kozzy97 in #2357

As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).

Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)

What's Changed

Add Unitxt Multimodality Support by @elronbandel in #2364
Add new tasks to spanish_bench and fix duplicates by @zxcvuser in #2390
fix typo bug for minerva_math by @renjie-ranger in #2404
Fix: Turkish MMLU Regex Pattern by @ArdaYueksel in #2393
fix storycloze datanames by @t1101675 in #2409
Update NoticIA prompt by @ikergarcia1996 in #2421
[Fix] Replace generic exception classes with a more specific ones by @LSinev in #1989
Support for IBM watsonx_llm by @Medokins in #2397
Fix package extras for watsonx support by @kiersten-stokes in #2426
Fix lora requests when dp with vllm by @ckgresla in #2433
Add xquad task by @zxcvuser in #2435
Add verify_certificate argument to local-completion by @sjmonson in #2440
Add GPTQModel support for evaluating GPTQ models by @Qubitium in #2217
Add missing task links by @Sypherd in #2449
Update CODEOWNERS by @haileyschoelkopf in #2453
Add real process_docs example by @Sypherd in #2456
Modify label errors in catcola and paws-x by @zxcvuser in #2434
Add Japanese Leaderboard by @sitfoxfly in #2439
Typos: Fix 'loglikelihood' misspellings in api_models.py by @RobGeada in #2459
use global multi_choice_filter for mmlu_flan by @baberabb in #2461
typo by @baberabb in #2465
pass device_map other than auto for parallelize by @baberabb in #2457
OpenAI ChatCompletions: switch max_tokens by @baberabb in #2443
Ifeval: Dowload punkt_tab on rank 0 by @baberabb in #2267
Fix chat template; fix leaderboard math by @baberabb in #2475
change warning to debug by @baberabb in #2481
Updated wandb logger to use new_printer() instead of get_printer(...) by @alex-titterton in #2484
IBM watsonx_llm fixes & refactor by @Medokins in #2464
Fix revision parameter to vllm get_tokenizer by @OyvindTafjord in #2492
update pre-commit hooks and git actions by @baberabb in #2497
kbl-v0.1.1 by @whwang299 in #2493
Add mamba hf to mamba_ssm by @baberabb in #2496
remove duplicate arc_ca tag by @baberabb in #2499
Add metabench task to LM Evaluation Harness by @kozzy97 in #2357
Nits by @baberabb in #2500
[API models] parse tokenizer_backend=None properly by @baberabb in #2509

New Contributors

@renjie-ranger made their first contribution in #2404
@t1101675 made their first contribution in #2409
@Medokins made their first contribution in #2397
@kiersten-stokes made their first contribution in #2426
@ckgresla made their first contribution in #2433
@sjmonson made their first contribution in #2440
@Qubitium made their first contribution in #2217
@Sypherd made their first contribution in #2449
@sitfoxfly made their first contribution in #2439
@RobGeada made their first contribution in #2459
@alex-titterton made their first contribution in #2484
@OyvindTafjord made their first contribution in #2492
@whwang299 made their first contribution in #2493
@kozzy97 made their first contribution in #2357

Full Changelog: v0.4.5...v0.4.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.6