Releases: huggingface/lighteval
V0.12.2
What's Changed
Bug Fixes 🐛
- Remove suites in task configs example and fix task with hf_filters by @NathanHB in #1051
- put lower bound on typer to use literal type by @NathanHB in #1042
- remove suites from serbian_eval.py by @S-Y-A-N in #1044
- neater bundle and logdir for inspect-ai by @NathanHB in #1043
- not forcing use_logits at True by @f14-bertolotti in #1050
- wrong attribute self.k -> self.n by @f14-bertolotti in #1049
New Contributors
- @S-Y-A-N made their first contribution in #1044
- @f14-bertolotti made their first contribution in #1050
Full Changelog: v0.12.0...v0.12.2
v0.12.1
v0.12.1
Patch for quality of life updates !
Run all inference providers at the same time:
in #1039
- lighteval eval hf-inference-providers/openai/gpt-oss-20b:novita hf-inference-providers/openai/gpt-oss-20b:nebius hf-inference-providers/openai/gpt-oss-20b:fireworks-ai ... aime25
+ lighteval eval hf-inference-providers/openai/gpt-oss-20b:all aime25Remove suites and make fewshots optional
in #1038
+ lighteval eval hf-inference-providers/openai/gpt-oss-20b aime25
- lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0"Bug fix:
Full Changelog: v0.12.0...v0.12.1
v0.12.0
v0.12
Exciting release in which we pivot into using inspect-ai as backend and make tasks much easier to find and add thanks to a finder space: here
New Features 🎉
- Registry refactorisation by @clefourrier in #937
- Multilingual extractiveness by @rolshoven in #956
- Added
backend_optionsparameter to llm judges. by @rolshoven in #963 - Add automatic tests for metrics by @NathanHB in #939
- Support local GGUF in VLLM and use HF tokenizer #943 by @JIElite in #972
- [RFC] Rework the dependencies to be more versatile by @LysandreJik in #951
- Sample to sample compare for integration tests by @NathanHB in #977
- Move tasks to individual files by @NathanHB in #1016
- Adds inspectai by @NathanHB in #1022
New Tasks
- GSM-PLUS by @NathanHB in #780
- TUMLU-mini by @ceferisbarov in #811
- Filipino Benchmark by @ljvmiranda921 in #852
- MMLU Redux by @clefourrier in #883
- IFBench by @clefourrier in #944
- SLR-Bench by @Ahmad21Omar in #983
- MMLU pro by @NathanHB in #1031
Enhancement ⚙️
- adds
enable_prefix_cachingoption to VLLMModelConfig by @GAD-cell in #945 - Added litellm model config options and improved
_prepare_max_new_tokensby @rolshoven in #967 - always provide parameters in the metric name to allow using several combinations by @clefourrier in #1017
Documentation 📚
- Add org_to_bill parameter to documentation by @tfrere in #781
- Update docs and enforces google's docstring style by @NathanHB in #941
- Fix broken link by @JoelNiklaus in #1014
- Update huggingface-cli login to use newer hf auth login by @Xceron in #1034
Task and Metrics changes 🛠️
- Add Bulgarian and Macedonian literals by @dianaonutu in #769
- Add TranslationLiterals for Language.DANISH by @spyysalo in #770
- Update translation_literals.py with icelandic by @joenaess in #775
- Complete TranslationLiterals for Language.ESTONIAN by @spyysalo in #779
- Update translation_literals.py by @dianaonutu in #923
- Fixing naming for sample evals + adding reqs in aime24 by @clefourrier in #989
- add translation literals for various Indic languages (Bengali, Gujarati, Punjabi, Tamil) by @rpm000 in #1015
Bug Fixes 🐛
- [#794] Fix: Assign SummaCZS instance to
self.summacin Faithfulness metric by @sahilds1 in #795 - Catch ROCM/HIP/AMD oom in
should_reduce_batch_sizeby @mcleish7 in #812 - Fix GPQA and index extractive metric by @clefourrier in #829
- Update extractive_match_utils.py for words where
:is preceded by a space by @clefourrier in #831 - fixes from_model function and adds tests by @NathanHB in #921
- fix tasks list by @alielfilali01 in #906
- set upper bound on vllm version by @NathanHB in #964
- Fixed bug that prevented the metrics from being mixed (batched/not batched) by @rolshoven in #958
- Fix inference providers calls by @clefourrier in #1012
- Fixing mixeval by @clefourrier in #1006
- Fix typo in attribute name: CONCURENT_CALLS -> CONCURRENT_CALLS by @muupan in #884
- Added ability to configure concurrent_requests in litellm_model.py by @dameikle in #911
- Added fallback for incomplete configs for vlm models launched as llms by @clefourrier in #828
New Contributors
- @pratyushmaini made their first contribution in #697
- @DeVikingMark made their first contribution in #782
- @sahilds1 made their first contribution in #795
- @dianaonutu made their first contribution in #769
- @tfrere made their first contribution in #781
- @mcleish7 made their first contribution in #812
- @leopardracer made their first contribution in #810
- @spyysalo made their first contribution in #770
- @ceferisbarov made their first contribution in #811
- @joenaess made their first contribution in #775
- @ryantzr1 made their first contribution in #784
- @dtung8068 made their first contribution in #862
- @muupan made their first contribution in #884
- @NouamaneTazi made their first contribution in #841
- @uralik made their first contribution in #887
- @dameikle made their first contribution in #911
- @ljvmiranda921 made their first contribution in #852
- @cpcdoy made their first contribution in #502
- @rolshoven made their first contribution in #958
- @JIElite made their first contribution in #972
- @LysandreJik made their first contribution in #951
- @GAD-cell made their first contribution in #945
- @amstu2 made their first contribution in #986
- @Ahmad21Omar made their first contribution in #983
- @cmpatino made their first contribution in #998
- @rpm000 made their first contribution in #1015
- @Xceron made their first contribution in #1034
Full Changelog: v0.10.0...v0.12.0
V0.11.0
Lighteval v0.11.0
This release introduces major improvements and changes, across usability, stability, performance and documentation.
Highlights include a large refactor to simplify the architecture, automated metric tests, a dependency rework, improved documentation, and new tasks/benchmarks.
Highlights
- Automated tests for metrics and stronger dependency checks
- Continuous batching, caching, and faster CLI with reduced redundancy
- Upgrade to datasets 4.0 and Trackio integration
- Automatic chat template inference and reasoning trace support
- New tasks: GSM-PLUS, TUMLU-mini, IFBench, Filipino benchmarks, MMLU Redux
- Added Bulgarian, Macedonian, Danish, Icelandic, and Estonian literals
- Documentation improvements (Google docstring style, README updates)
What's Changed
New Features
- Automatic inference of chat template usage (no kwargs needed) by @clefourrier (#885)
- More versatile dependency rework by @LysandreJik (#951)
- Automatic tests for metrics by @NathanHB (#939)
- Sample-to-sample comparisons for integration tests by @NathanHB (#977)
- Continuous batching support by @NathanHB (#850) (arthur)
- Refactored code and removed unused parts by @NathanHB (#709)
- Post-processing for reasoning tokens in pipeline by @clefourrier (#882)
- logging of system prompt by @clefourrier (#907)
- Adds Caching of samples by @clefourrier (#909)
- Upgrade to
datasets4.0 by @NathanHB (#924) - Trackio integration when available by @NathanHB (#930)
- Parameterization of sampling evals from CLI by @clefourrier (#926)
- Local GGUF support in VLLM with HF tokenizer by @JIElite (#972)
Enhancement
bootstrap_itersas an argument by @pratyushmaini (#697)- Load tasks before models by @clefourrier (#931)
- Save
reasoning_contentfrom litellm as details by @muupan (#929) - Fix for
TGIendpoint inference and JSON grammar generation by @cpcdoy (#502) - Reduced redundancy in CLI arguments by @NathanHB (#932)
- Registry refactor by @clefourrier (#937)
- Multilingual extractiveness support by @rolshoven (#956)
- Added
backend_optionsparameter to LLM judges by @rolshoven (#963)
Documentation
- Added
org_to_billparameter by @tfrere (#781) - Updated docs with Google docstring style by @NathanHB (#941)
- Updated README by @NathanHB (#961)
New Tasks
- Added GSM-PLUS by @NathanHB (#780)
- Added TUMLU-mini benchmark, fixed #577 by @ceferisbarov (#811)
- Added Filipino benchmark community tasks by @ljvmiranda921 (#852)
- MMLU Redux and caching fix by @clefourrier (#883)
- Added IFBench by @clefourrier (#944)
Task and Metrics Changes
- Added Bulgarian and Macedonian literals by @dianaonutu (#769)
- Added Danish translation literals by @spyysalo (#770)
- Added Icelandic translation literals by @joenaess (#775)
- Completed Estonian translation literals by @spyysalo (#779)
- Updated
translation_literals.pyby @dianaonutu (#923)
Bug Fixes
- Fixed [#794]: assigned
SummaCZSinstance in Faithfulness metric by @sahilds1 (#795) - Caught ROCM/HIP/AMD OOM in
should_reduce_batch_sizeby @mcleish7 (#812) - Fixed GPQA and index extractive metric by @clefourrier (#829)
- Updated
extractive_match_utils.pyfor cases with:by @clefourrier (#831) - Fixed
from_modelfunction and added tests by @NathanHB (#921) - Fixed tasks list by @alielfilali01 (#906)
- Set upper bound on VLLM version by @NathanHB (#964)
- Fixed batching bug in metrics by @rolshoven (#958)
Other Changes
- Fixed typo in attribute name (
CONCURENT_CALLS→CONCURRENT_CALLS) by @muupan (#884) - Added ability to configure
concurrent_requestsinlitellm_model.pyby @dameikle (#911)
New Contributors
We’re excited to welcome new contributors in this release:
@pratyushmaini, @DeVikingMark, @sahilds1, @dianaonutu, @tfrere, @mcleish7, @leopardracer, @spyysalo, @ceferisbarov, @joenaess, @ryantzr1, @dtung8068, @muupan, @NouamaneTazi, @uralik, @dameikle, @ljvmiranda921, @cpcdoy, @rolshoven, @JIElite, @LysandreJik
Full Changelog: v0.10.0...v0.11.0
v0.10.0
We now support VLM when using transformers backend 🥳
What's Changed
New Features 🎉
- Added support for quantization in vLLM backend by @SulRash in #690
- Adds multimodal support and MMMU pro by @NathanHB in #675
- Allow for model kwargs when loading transformers from pretrained by @NathanHB in #754
- Adds template for custom path saving results by @NathanHB in #755
- Nanotron, Multilingual tasks update + misc by @hynky1999 in #756
- Async vllm by @clefourrier in #693
New Tasks
- Adds More Generative tasks by @hynky1999 in #694
- Added Flores by @clefourrier in #717
Task and Metrics changes 🛠️
- Nanotron, Multilingual tasks update + misc by @hynky1999 in #756
- add livecodebench v6 by @Cppowboy in #712
- Add MCQ support to Yourbench evaluation by @alozowski in #734
Other Changes
- Bump ruff version by @NathanHB in #774
- Fix revision arg for vLLM tokenizer by @lewtun in #721
- Update README.md by @clefourrier in #733
- Fix litellm by @NathanHB in #736
New Contributors
- @Cppowboy made their first contribution in #712
- @SulRash made their first contribution in #690
- @Abelgurung made their first contribution in #743
Full Changelog: v0.9.2...v0.10.0
v0.9.2
What's Changed
New Features 🎉
- enable together models and reasoning models as judges. by @JoelNiklaus in #537
- Propagate vLLM batch size controls by @alvin319 in #588
- Integrate huggingface_hub inference support for LLM as Judge by @alozowski in #651
- add cot_prompt in vllm by @HERIUN in #654
- Unify modelargs and use Pydantic for model configs by @NathanHB in #609
- Improve test by @qubvel in #674
- adds wandb loging of metrics by @NathanHB in #676
- Adds wanddb logging by @NathanHB in #685
- Added custom model inference. by @JoelNiklaus in #437
- Update split iteration for DynamicBatchingDataset by @qubvel in #684
Documentation 📚
- Add --use-chat-template to the broken litellm example by @eldarkurtic in #614
- Lighteval math by @HERIUN in #630
- Update quicktour command by @qubvel in #679
- fix wrong 'custom_task_directory' in python api doc by @xgwang in #671
- docs: improve consistency in punctuation of metric list by @mariagrandury in #605
New Tasks 📈
- add arc agi 2 by @NathanHB in #642
- Add G-Pass@k Metric by @jnanliu in #589
- adds simpleqa by @NathanHB in #680
Task and Metrics changes 🛠️
- Pass At K Math by @clefourrier in #647
- Use
n=16samples to estimatepass@1for AIME benchmarks by @lewtun in #661 - adding uzbek literals by @shopulatov in #664
- Align AIME pass@1 with literature by @lewtun in #666
- Update LCB prompt & fix newlines by @rawsh in #645
- fix gsm8k metric by @NathanHB in #688
- Add pass@1 for GPQA-D and MATH-500 by @lewtun in #698
Bug Fixes 🐛
- Use
blfoat16as default for vllm models. by @NathanHB in #638 - Fix passing of generation config to main_accelerate by @LoserCheems in #659
- Parse seed for vLLM by @eldarkurtic in #602
- Parse string values for add_special_tokens in vLLM by @eldarkurtic in #598
- hardcode configs to not make lighteval crash if lcb repo unavailable by @NathanHB in #677
- tokenizer 'padding' param is not correct. by @xgwang in #669
- Fix TransformersModel.from_model() method by @Vectorrent in #691
- Inference providers by @clefourrier in #701
New Contributors
- @DerekLiu35 made their first contribution in #620
- @AnikiFan made their first contribution in #610
- @alvin319 made their first contribution in #588
- @alozowski made their first contribution in #643
- @Laz4rz made their first contribution in #613
- @shopulatov made their first contribution in #664
- @HERIUN made their first contribution in #654
- @rawsh made their first contribution in #645
- @qubvel made their first contribution in #674
- @xgwang made their first contribution in #669
- @jnanliu made their first contribution in #589
- @Vectorrent made their first contribution in #683
- @omahs made their first contribution in #702
Full Changelog: v0.8.0...v0.9.0
v0.8.0
What's new
Tasks
- LiveCodeBench by @plaguss in #548, #587, #518
- GPQA diamond by @lewtun in #534
- Humanity's last exam by @clefourrier in #520
- Olympiad Bench by @NathanHB in #521
- aime24, 25 and math500 by @NathanHB in #586
- french models Evals by @mdiazmel in #505
Metrics
- Pass@k by @clefourrier in #519
- Extractive Match metric by @hynky1999 in #495, #503, #522, #535
Features
Better logging
- log model config by @NathanHB in #627
- Support custom results/details push to hub by @albertvillanova in #457
- Push details without converting fields to str by @NathanHB in #572
Inference providers
Load details to be evaluated
- Implemented the possibility to load predictions from details files and continue evaluating from there by @JoelNiklaus in #488
sglang support
Bug Fixes and refacto
- Tiny improvements to
endpoint_model.py,base_model.py,... by @sadra-barikbin in #219 - Update README.md by @NathanHB in #486
- Fix issue with encodings for together models. by @JoelNiklaus in #483
- Made litellm judge backend more robust. by @JoelNiklaus in #485
- Fix
T_coimport bug by @gucci-j in #484 - fix README link by @vxw3t8fhjsdkghvbdifuk in #500
- Fixed issue with o1 in litellm. by @JoelNiklaus in #493
- Hotfix for litellm judge by @JoelNiklaus in #490
- Made judge response processing more robust. by @JoelNiklaus in #491
- VLLM: Allows for max tokens to be set in model config file by @NathanHB in #547
- Bump up the latex2sympy2_extended version + more tests by @hynky1999 in #510
- Fixed bug of import url_to_fs from fsspec by @LoserCheems in #507)
- Fix Ukrainian indices and confirmation word by @ayukh in #516
- Fix VLLM data-parallel by @hynky1999 in #541
- relax spacy import to relax dep by @clefourrier in #622
- vllm fix sampling params by @NathanHB in #625
- relax deps for tgi by @NathanHB in #626
- Bug fix extractive match by @hynky1999 in #540
- Fix loading of vllm model from files by @NathanHB in #533
- fix: broken URLs by @deep-diver in #550
- typo(vllm):
gpu_memory_utilisationtypo by @tpoisonooo in #553 - allows better flexibility for litellm endpoints by @NathanHB in #549
- Translate task template to Catalan and Galician and fix typos by @mariagrandury in #506
- Relax upper bound on torch by @lewtun in #508
- Fix vLLM generation with sampling params by @lewtun in #578
- Make BLEURT lazy by @hynky1999 in #536
- Fixing backend error in main_sglang. by @TankNee in #597
- VLLM + Math-Verify fixes by @hynky1999 in #603
- raise exception when generation size is more than model length by @NathanHB in #571
Thanks
Huge thanks to Hyneck, Lewis, Ben, Agustín, Elie and everyone helping and and giving feedback 💙
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @hynky1999
- Extractive Match metric (#495)
- Fix math extraction (#503)
- Bump up the latex2sympy2_extended version + more tests (#510)
- Math extraction - allow only trying the first match, more customizable latex extraction + bump deps (#522)
- add missing inits (#524)
- Sync Math-verify (#535)
- Make BLEURT lazy (#536)
- Bug fix extractive match (#540)
- Fix VLLM data-parallel (#541)
- VLLM + Math-Verify fixes (#603)
- @plaguss
- @Jayon02
- Let lighteval support sglang (#552)
- @NathanHB
- adds olympiad bench (#521)
- Fix loading of vllm model from files (#533)
- [VLLM] Allows for max tokens to be set in model config file (#547)
- allows better flexibility for litellm endpoints (#549)
- raise exception when generation size is more than model length (#571)
- Push details without converting fields to str (#572)
- adds aime24, 25 and math500 (#586)
- adds inference providers support (#616)
- vllm fix sampling params (#625)
- relax deps for tgi (#626)
- log model config (#627)
v0.7.0
What's New
New Tasks
- added musr by @clefourrier in #375
- Adds Global MLMU by @hynky1999 in #426
- Add new Arabic benchmarks (5) and enhance existing tasks by @alielfilali01 in #372
New Features
- Evaluate a model already loaded in memory for training / evaluation loop by @clefourrier in #390
- Allowing a single prompt to use several formats for one eval by @clefourrier in #398
- Autoscaling inference endpoints hardware by @clefourrier in #412
- CLI new look and features (using typer) by @NathanHB in #407
- Better Looking and more functional logging by @NathanHB in #415
- Add litellm backend by @JoelNiklaus in #385
More Translation Literals by the Community
- add bashkir variants by @AigizK in #374
- add Shan (shn) translation literals by @NoerNova in #376
- Add Udmurt (udm) translation literals by @codemurt in #381
- This PR adds translation literals for Belarusian language. by @Kryuski in #382
- added tatar literals by @gaydmi in #383
New Doc
- Add doc-builder doc-pr-upload GH Action by @albertvillanova in #411
- Set up docs by @albertvillanova in #403
- Add docstring docs by @albertvillanova in #413
- Add missing models to docs by @albertvillanova in #419
- Update docs about inference endpoints by @albertvillanova in #432
- Upgrade deprecated GH Action cache@v2 by @albertvillanova in #456
- Add EvaluationTracker to docs and fix its docstring by @albertvillanova in #464
- Checkout PR merge commit for CI tests by @albertvillanova in #468
Bug Fixes and Refacto
- Allow AdapterModels to have custom tokens by @mapmeld in #306
- Homogeneize generation params by @clefourrier in #428
- fix: cache directory variable by @NazimHAli in #378
- Add trufflehog secrets detection by @albertvillanova in #429
- greedy_until() fix by @vsabolcec in #344
- Fixes a TypeError for generative metrics. by @JoelNiklaus in #386
- Speed up Bootstrapping Computation by @JoelNiklaus in #409
- Fix imports from model_config by @albertvillanova in #443
- Fix wrong instructions and code for custom tasks by @albertvillanova in #450
- Fix minor typos by @albertvillanova in #449
- fix model parallel by @NathanHB in #481
- add configs with their models by @clefourrier in #421
- Fixes a TypeError in Sacrebleu. by @JoelNiklaus in #387
- fix ukr/rus by @hynky1999 in #394
- fix repeated cleanup by @anton-l in #399
- Update instance type/size in endpoint model_config example by @albertvillanova in #401
- Considering the case empty request list is given to base model by @sadra-barikbin in #250
- Fix a tiny bug in
PromptManager::FewShotSampler::_init_fewshot_sampling_randomby @sadra-barikbin in #423 - Fix splitting for generative tasks by @NathanHB in #400
- Fixes an error with getting the golds from the formatted_docs. by @JoelNiklaus in #388
- Fix ignored reuse_existing in config file by @albertvillanova in #431
- Deprecate Obsolete Config Properties by @ParagEkbote in #433
- fix: LightevalTaskConfig.stop_sequence attribute by @ryan-minato in #463
- fix: scorer attribute initialization in ROUGE by @ryan-minato in #471
- Delete endpoint on InferenceEndpointTimeoutError by @albertvillanova in #475
- Remove unnecessary deepcopy in evaluation_tracker by @albertvillanova in #459
- fix: CACHE_DIR Default Value in Accelerate Pipeline by @ryan-minato in #461
- Fix warning about precedence of custom tasks over default ones in registry by @albertvillanova in #466
- Implement TGI model config from path by @albertvillanova in #448
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @clefourrier
- added musr (#375)
- Update README.md
- Use the programmatic interface using an already in memory loaded model (#390)
- Pr sadra (#393)
- Allowing a single prompt to use several formats for one eval (#398)
- Autoscaling inference endpoints (#412)
- add configs with their models (#421)
- Fix custom arabic tasks (#440)
- Adds serverless endpoints back (#445)
- Homogeneize generation params (#428)
- @JoelNiklaus
- @albertvillanova
- Update instance type/size in endpoint model_config example (#401)
- Typo in feature-request.md (#406)
- Add doc-builder doc-pr-upload GH Action (#411)
- Set up docs (#403)
- Add docstring docs (#413)
- Add missing models to docs (#419)
- Add trufflehog secrets detection (#429)
- Update docs about inference endpoints (#432)
- Fix ignored reuse_existing in config file (#431)
- Test inference endpoint model config parsing from path (#434)
- Fix imports from model_config (#443)
- Fix wrong instructions and code for custom tasks (#450)
- Fix minor typos (#449)
- Implement TGI model config from path (#448)
- Upgrade deprecated GH Action cache@v2 (#456)
- Add EvaluationTracker to docs and fix its docstring (#464)
- Remove unnecessary deepcopy in evaluation_tracker (#459)
- Fix warning about precedence of custom tasks over default ones in registry (#466)
- Checkout PR merge commit for CI tests (#468)
- Delete endpoint on InferenceEndpointTimeoutError (#475)
- @NathanHB
- @ParagEkbote
- Deprecate Obsolete Config Properties (#433)
- @alielfilali01
v0.6.0
What's New
Lighteval becomes massively multilingual!
We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.
-
Add 3 NLI tasks supporting 26 unique languages. #329 by @hynky1999
-
Add 3 COPA tasks supporting about 20 unique languages. #330 by @hynky1999
-
Add Hellaswag tasks supporting about 36 unique languages. #332 by @hynky1999
- mlmm_hellaswag
- hellaswag_{tha/tur}
-
Add RC tasks supporting about 130 unique languages/scripts. #333 by @hynky1999
-
Add GK tasks supporting about 35 unique languages/scripts. #338 by @hynky1999
- meta_mmlu
- mlmm_mmlu
- rummlu
- mmlu_ara_mcf
- tur_leaderboard_mmlu
- cmmlu
- mmlu
- ceval
- mlmm_arc_challenge
- alghafa_arc_easy
- community_arc
- community_truthfulqa
- exams
- m3exams
- thai_exams
- xcsqa
- alghafa_piqa
- mera_openbookqa
- alghafa_openbookqa
- alghafa_sciqa
- mathlogic_qa
- agieval
- mera_worldtree
-
Misc Tasks #339 by @hynky1999
- openai_mmlu_tasks
- turkish_mmlu_tasks
- lumi arc
- hindi/swahili/arabic (from alghafa) arc
- cmath
- mgsm
- xcodah
- xstory
- xwinograd + tr winograd
- mlqa
- mkqa
- mintaka
- mlqa_tasks
- french triviaqa
- chegeka
- acva
- french_boolq
- hindi_boolq
-
Serbian LLM Benchmark Task by @DeanChugall in #340
-
iroko bench by @hynky1999 in #357
Other Tasks
Features
- Now Evaluate OpenAI models by @NathanHB in #359
- New Doc and README by @NathanHB in #327
- Refacto LLM as A Judge by @NathanHB in #337
- Selecting tasks using their superset by @hynky1999 in #308
- Nicer output on task search failure by @hynky1999 in #357
- Adds tasks templating by @hynky1999 in #335
- Support for multilingual generative metrics by @hynky1999 in #293
- Class implementations of faithfulness and extractiveness metrics by @chuandudx in #323
- Translation literals by @hynky1999 in #356
Bug Fixes
- Math normalization: do not crash on invalid format by @guipenedo in #331
- Skipping push to hub test by @clefourrier in #334
- Fix Metrics import path in community task template file. by @chuandudx in #309
- Allow kwargs for BERTScore compute function and remove unused var by @chuandudx in #311
- Fixes sampling for vllm when num_samples==1 by @edbeeching in #343
- Fix the dataset loading for custom tasks by @clefourrier in #364
- Fix: missing property tag in inference endpoints by @clefourrier in #368
- Fix Tokenization + misc fixes by @hynky1999 in #354
- Fix BLEURT evaluation errors by @chuandudx in #316
- Adds Baseline workflow + fixes by @hynky1999 in #363
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @hynky1999
- Support for multilingual generative metrics (#293)
- Adds tasks templating (#335)
- Multilingual NLI Tasks (#329)
- Multilingual COPA tasks (#330)
- Multilingual Hellaswag tasks (#332)
- Multilingual Reading Comprehension tasks (#333)
- Multilingual General Knowledge tasks (#338)
- Selecting tasks using their superset (#308)
- Fix Tokenization + misc fixes (#354)
- Misc-multilingual tasks (#339)
- add iroko bench + nicer output on task search failure (#357)
- Translation literals (#356)
- selected tasks for multilingual evaluation (#371)
- Adds Baseline workflow + fixes (#363)
- @DeanChugall
- Serbian LLM Benchmark Task (#340)
- @NathanHB
New Contributors
- @chuandudx made their first contribution in #323
- @edbeeching made their first contribution in #343
- @DeanChugall made their first contribution in #340
- @Stopwolf made their first contribution in #225
- @martinscooper made their first contribution in #366
Full Changelog: v0.5.0...v0.6.0
v0.5.0
What's new
Features
- Tokenization-wise encoding by @hynky1999 in #287
- Task config by @hynky1999 in #289