lm-eval v0.4.6 Release Notes
This release brings important changes to chat template handling, expands our task library with new multilingual and multimodal benchmarks, and includes various bug fixes.
Backwards Incompatibilities
Chat Template Delimiter Handling
An important modification has been made to how delimiters are handled when applying chat templates in request construction, particularly affecting multiple-choice tasks. This change ensures better compatibility with chat models by respecting their native formatting conventions.
📝 For detailed documentation, please refer to docs/chat-template-readme.md
New Benchmarks & Tasks
Multilingual Expansion
- Spanish Bench: Enhanced benchmark with additional tasks by @zxcvuser in #2390
- Japanese Leaderboard: New comprehensive Japanese language benchmark by @sitfoxfly in #2439
New Task Collections
- Multimodal Unitext: Added support for multimodal tasks available in unitext by @elronbandel in #2364
- Metabench: New benchmark contributed by @kozzy97 in #2357
As well as several slight fixes or changes to existing tasks (as noted via the incrementing of versions).
Thanks, the LM Eval Harness team (@baberabb and @lintangsutawika)
What's Changed
- Add Unitxt Multimodality Support by @elronbandel in #2364
- Add new tasks to spanish_bench and fix duplicates by @zxcvuser in #2390
- fix typo bug for minerva_math by @renjie-ranger in #2404
- Fix: Turkish MMLU Regex Pattern by @ArdaYueksel in #2393
- fix storycloze datanames by @t1101675 in #2409
- Update NoticIA prompt by @ikergarcia1996 in #2421
- [Fix] Replace generic exception classes with a more specific ones by @LSinev in #1989
- Support for IBM watsonx_llm by @Medokins in #2397
- Fix package extras for watsonx support by @kiersten-stokes in #2426
- Fix lora requests when dp with vllm by @ckgresla in #2433
- Add xquad task by @zxcvuser in #2435
- Add verify_certificate argument to local-completion by @sjmonson in #2440
- Add GPTQModel support for evaluating GPTQ models by @Qubitium in #2217
- Add missing task links by @Sypherd in #2449
- Update CODEOWNERS by @haileyschoelkopf in #2453
- Add real process_docs example by @Sypherd in #2456
- Modify label errors in catcola and paws-x by @zxcvuser in #2434
- Add Japanese Leaderboard by @sitfoxfly in #2439
- Typos: Fix 'loglikelihood' misspellings in api_models.py by @RobGeada in #2459
- use global
multi_choice_filter
for mmlu_flan by @baberabb in #2461 - typo by @baberabb in #2465
- pass device_map other than auto for parallelize by @baberabb in #2457
- OpenAI ChatCompletions: switch
max_tokens
by @baberabb in #2443 - Ifeval: Dowload
punkt_tab
on rank 0 by @baberabb in #2267 - Fix chat template; fix leaderboard math by @baberabb in #2475
- change warning to debug by @baberabb in #2481
- Updated wandb logger to use
new_printer()
instead ofget_printer(...)
by @alex-titterton in #2484 - IBM watsonx_llm fixes & refactor by @Medokins in #2464
- Fix revision parameter to vllm get_tokenizer by @OyvindTafjord in #2492
- update pre-commit hooks and git actions by @baberabb in #2497
- kbl-v0.1.1 by @whwang299 in #2493
- Add mamba hf to
mamba_ssm
by @baberabb in #2496 - remove duplicate
arc_ca
tag by @baberabb in #2499 - Add metabench task to LM Evaluation Harness by @kozzy97 in #2357
- Nits by @baberabb in #2500
- [API models] parse tokenizer_backend=None properly by @baberabb in #2509
New Contributors
- @renjie-ranger made their first contribution in #2404
- @t1101675 made their first contribution in #2409
- @Medokins made their first contribution in #2397
- @kiersten-stokes made their first contribution in #2426
- @ckgresla made their first contribution in #2433
- @sjmonson made their first contribution in #2440
- @Qubitium made their first contribution in #2217
- @Sypherd made their first contribution in #2449
- @sitfoxfly made their first contribution in #2439
- @RobGeada made their first contribution in #2459
- @alex-titterton made their first contribution in #2484
- @OyvindTafjord made their first contribution in #2492
- @whwang299 made their first contribution in #2493
- @kozzy97 made their first contribution in #2357
Full Changelog: v0.4.5...v0.4.6