Skip to content

Add support for quantization_config#2842

Merged
baberabb merged 2 commits intoEleutherAI:mainfrom
jerryzh168:add-quant
Apr 15, 2025
Merged

Add support for quantization_config#2842
baberabb merged 2 commits intoEleutherAI:mainfrom
jerryzh168:add-quant

Conversation

@jerryzh168
Copy link
Copy Markdown
Contributor

Summary:
Previously quantization_config is ignored, so torchao quantized models are not supported, this PR adds that.

Test Plan:
lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:
Previously quantization_config is ignored, so torchao quantized models are not supported,
this PR adds that.

Test Plan:
lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8

Reviewers:

Subscribers:

Tasks:

Tags:
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 25, 2025

CLA assistant check
All committers have signed the CLA.

@StellaAthena
Copy link
Copy Markdown
Member

Why is this necessary? Shouldn't the kwargs argument handle it? Are we misusing that argument? If this isn't weirdly bespoke to quantization I would rather address the issue of allowing users to pass through arbitrary keyword args than monkey patch each one that people are interested in.

@jerryzh168
Copy link
Copy Markdown
Contributor Author

jerryzh168 commented Mar 27, 2025

@StellaAthena you mean pass quantization_config from the command line? the config is stored in the huggingface model checkpoint, something like this: https://huggingface.co/jerryzh168/llama3-int4wo/blob/main/config.json#L22 and I think it would be better if we can just get config from there directly, seems like right now it's ignored

Comment thread lm_eval/models/huggingface.py Outdated
@baberabb
Copy link
Copy Markdown
Contributor

Hi! Thanks for the PR. Left a comment.

Although it does look like AutoModel uses the quantization_config if its included in the model config?
https://github.com/huggingface/transformers/blob/348f3285c5114159d2ff4933b4b8ae36866d01a7/src/transformers/models/auto/auto_factory.py#L544-L545

@StellaAthena
Copy link
Copy Markdown
Member

@StellaAthena you mean pass quantization_config from the command line? the config is stored in the huggingface model checkpoint, something like this: https://huggingface.co/jerryzh168/llama3-int4wo/blob/main/config.json#L22 and I think it would be better if we can just get config from there directly, seems like right now it's ignored

Sorry, I misunderstood. I thought you were saying that if the user provides quantization configs when calling lm-eval they're ignored.

@jerryzh168
Copy link
Copy Markdown
Contributor Author

Hi! Thanks for the PR. Left a comment.

Although it does look like AutoModel uses the quantization_config if its included in the model config? huggingface/transformers@348f328/src/transformers/models/auto/auto_factory.py#L544-L545

looks like it's not picked up from there, maybe because the quantization_config we want is from config, not from model kwargs

@jerryzh168 jerryzh168 requested a review from baberabb March 31, 2025 18:01
@jerryzh168
Copy link
Copy Markdown
Contributor Author

@baberabb @StellaAthena I updated the PR, can you take a look again

@jerryzh168
Copy link
Copy Markdown
Contributor Author

Hi @baberabb @StellaAthena can you take a look again

@baberabb
Copy link
Copy Markdown
Contributor

baberabb commented Apr 8, 2025

Hi @baberabb @StellaAthena can you take a look again

LGTM! Thank you!

@jerryzh168
Copy link
Copy Markdown
Contributor Author

@baberabb can you also help merge? the CI errors does not look relevant

@baberabb baberabb merged commit 758c5ed into EleutherAI:main Apr 15, 2025
1 of 20 checks passed
JessicaOjo pushed a commit to JessicaOjo/lm-evaluation-harness that referenced this pull request Dec 10, 2025
* Add support for quantization_config

Summary:
Previously quantization_config is ignored, so torchao quantized models are not supported,
this PR adds that.

Test Plan:
lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8

Reviewers:

Subscribers:

Tasks:

Tags:

* quantization_config is optional
JessicaOjo pushed a commit to JessicaOjo/lm-evaluation-harness that referenced this pull request Dec 10, 2025
* Add support for quantization_config

Summary:
Previously quantization_config is ignored, so torchao quantized models are not supported,
this PR adds that.

Test Plan:
lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8

Reviewers:

Subscribers:

Tasks:

Tags:

* quantization_config is optional
Helw150 added a commit to Helw150/lm-evaluation-harness that referenced this pull request Jan 26, 2026
* skip casting if predict_only (#2524)

* make utility function to handle `until` (#2518)

* make utility function to handle `until`

* fix text

* Update Unitxt task to  use locally installed unitxt and not download Unitxt code from Huggingface (#2514)

* Moved to require unitxt installation and not download unitxt from HF hub.

This has performance benefits and simplifies the code.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated watsonx documentation

* Updated installation instructions

* Removed redundant comman

* Allowed unitxt tasks to generate chat APIs

Modified WatsonXI model to support chat apis

* Removed print

* Run precommit formatting

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* add Basque translation of PIQA (piqa_eu) to BasqueBench (#2531)

* avoid timeout errors with high concurrency in api_model (#2307)

* avoid timeout errors with high concurrency in api_model

* style

* add timeout

* add docs

---------

Co-authored-by: Baber <baber@hey.com>

* Update README.md (#2534)

* Update README.md

add caching tip to readme

* Update README.md

add api link

* add better testing when both doc_to_text ends in and target_delimiter are whitespaces (#2535)

* Support pipeline parallel with OpenVINO models (#2349)

* Handle pipeline_parallel parameter

* Add description of pipeline parallelism with OV models

* Update README.md (#2546)

* [API] left truncate for generate_until (#2554)

* left truncate for generate_until

* pre-commit

* Update Lightning import (#2549)

* update import

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* run formatting

---------

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* add optimum-intel ipex model (#2566)

* initial support for optimum-intel ipex model. LM model as first step

* format

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

* pass dtype

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

* update README
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

---------

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

* add warning to readme (#2568)

* make warning prominent

* make warning prominent

* Adding new subtask to SCORE tasks: non greedy robustness (#2558)

* score readme added

* generate until task's "until" parameter's default value fixed.

* score mmlu-pro and agieval added

* changed macro accuracy to micro for agieval

* Always E removed from agi eval

* redundancies removed

* MATH added

* minor cosmetic changes for math

* Licenses added Readme updated

* changes for flake8 + license header on math

* Score added to readme and precommit was run.

* Score added to readme and precommit was run.

* Import error fixed

* math task bugfix
postprocess minor fix

* CR for math added

* math CR

* math task bugfix
postprocess minor fix

CR for math added

* Math cr fixed

* mmlu_pro non_greedy task added

* non greedy summarizer added

* Non greedy for all score tasks

* Bugfixes for non-greedy

* fixing the until argument

* undoing the change to "until" arguments default behaviour

* minor fix in summarizer

* log naming changes for better readability

* math subtasks naming fix

* agieval subtask naming fix

* logging added for debugging

* path issue fixed

* minor fix

* path fix

* path fix

* non_greedy_math minor fix

* final changes

* changed readme for non-greedy
added Nvidia header
added wxample script for non_greedy
changed prompts to match that fo trt runs

* non greedy summarizer bugfix

* non_greedy summarizer fixed

* batch `loglikelihood_rolling` across requests (#2559)

* batch all rolling token windows

* nit

* copy to vllm

* fix max_length for `get_rolling_token_windows`

* bugfix

* bugfix

* add type hints

* fix `DeprecationWarning: invalid escape sequence '\s'` for whitespace filter (#2560)

* fix `DeprecationWarning: invalid escape sequence '\s'`

* add type hints

* Revert "add type hints"

This reverts commit 15d8abc626a84e97f8c238ddfbf9e243d6f6eb5c.

* increment version (#2574)

forgot to increment 0.4.6!

* drop python 3.8 support (#2575)

* feat: drop Python 3.8 support

* feat: drop Python 3.8 tests

* pre-commit

* Add Global MMLU Lite (#2567)

* add global mmlu lite

* add global mmlu lite

* fix bugs

* add task README.md

* Update README.md

* Update tasks README.md

* Update README.md

* update readme

---------

Co-authored-by: shivi <shivalikasingh95@gmail.com>

* add warning for truncation (#2585)

* add warning for truncation

* Wandb step handling bugfix and feature (#2580)

* AraDICE task config file (#2507)

* added aradice

* Added ArabicMMLU Lev Configs

* added ArabicMMLU egy configs

* Added boolq configs

* Added cultural bench configs

* added openbookqa configs

* Added PiQA configs

* added winogrande configs

* Added truthfulQA configs

* Added aradice group config

* Remove deleted files from repository

* modified arabimmlu configs

* modified metadata versions

* fixed formatting using ruff

* added aradice tasks information

* pre-commit

* Uptaded openbookqa utils

* fixed formatting on obqa

---------

Co-authored-by: Basel Mousi <bmousi@hbku.edu.qa>
Co-authored-by: Baber <baber@hey.com>

* fix extra_match low if batch_size > 1 (#2595)

* fix extra_match low if batch_size > 1

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add sorting to logprobs

* nit

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Baber <baber@hey.com>

* fix model tests (#2604)

upgrade transformers and peft in CI

* update scrolls (#2602)

* update evaluate; update construct requests

* update construct requests to handle `apply_chat_template` kwarg

* some minor logging nits (#2609)

* remove yaml extension from phraes_va_common

* remove yaml extension from winogenerated

* remove yaml extension from phrases_es

* no cache debug logging when not used

* Fix gguf loading via Transformers (#2596)

* hf support load gguf file

* code review

* code review

* code clean up

* note about use_fast compat with gguf

---------

Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

* Fix Zeno visualizer on tasks like GSM8k (#2599)

* fix(zeno): Generate unique ids in case of multiple filters

* fix(zeno): Report even non-aggregable metrics, just not as metrics

* pre-commit

---------

Co-authored-by: Baber <baber@hey.com>

* Fix the format of mgsm zh and ja. (#2587)

* Fix the format of mgsm zh and ja.

* Add change log to mgsm.

* Add newline after changelog.

* Add HumanEval (#1992)

* add custom filter

* fix type casting of references

* add humaneval

* fix a bug in humaneval

* add greedy version of humaneval

* update tasks README

* test humaneval

* return multiple metrics

* nit

* add confirmation to run code tasks

* nit

* nit

---------

Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

* Add MBPP (#2247)

* add mbpp

* fix some bugs

* add README for mbpp

* update README

* nits

---------

Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

* Add MLQA (#2622)

* Add MLQA
* add mlqa_common_yaml

* add 49 tests of mlqa family

* update tasks/README.md

---------

* fix: mlqa ast error

* nit: removed .yaml ext from template_yaml

* nit changes: minor modifications generate_tasks.py

* deleted    lm_eval/tasks/mlqa/mlqa_common_yaml.yaml

* tests updated

* nit

* assistant prefill  (#2615)

* add assistant prefix

* add arc_challenge from llama

* nit

* nit

* nit

* add assistant prefix

* add mmlu_llama

* nit

* nit

* Revert "nit"

This reverts commit 6a97f8356237305e375212b966b30e8de59dd4bc.

* fix regex bug

* add assistant_prefix to vllm

* add `Question:`

* add mmlu_pro

* add fewshot assistant_prefix

* use `assistant_prefill`

* typehints

* nits

* nits

* add to docs

* add readme

* fix gen_prefix (#2630)

* switch arg

* update pre-commit (#2632)

* update pre-commit

* add hrm8k benchmark for both Korean and English (#2627)

* add hrm8k benchmark for both Korean and English

* apply precommit

* revise tasks to make models not to directly answer; use zeroshot_cot if possible

* add README

* Add hrm8k on the task-list

---------

Co-authored-by: Baber <baber@hey.com>

* New arabicmmlu (#2541)

* point to the original ArabicMMLU dataset

* create the new subtasks files

* fix bug when the context filed is empty

* apply precommit (#2636)

* Update KorMedMCQA: ver 2.0 (#2540)

* Update KorMedMCQA: ver 2.0

* Fix pre-commit formatting issues

* Update KorMedMCQA v2.0

* pre-commit

* fix tmlu tmlu_taiwan_specific_tasks tag (#2420)

* fixed mmlu generative response extraction (#2503)

* fixed mmlu generative response extraction

* updated file version | added args to exact_match

* fix

* fix

* pre-commit

* fix groups

---------

Co-authored-by: Baber <baber@hey.com>

* revise mbpp prompt (#2645)

* aggregate by group (total and categories) (#2643)

* Fix max_tokens handling in vllm_vlms.py (#2637)

* Update vllm_vlms.py

* pre-commit

---------

Co-authored-by: Baber <baber@hey.com>

* separate category for `global_mmlu` (#2652)

* separate category

* set version 0.0

* apply precommit

* Add Moral Stories (#2653)

* Add moral stories task

* Add moral stories task

* Create README.md

* Update README.md

* Update line endings in moral_stories files

* add TransformerLens example (#2651)

* add TransformerLens example

Many people use TransformerLens to do interpretability and interventions on models, and then need to test the model.

Here is a simple script that allows one to pass in the TransformerLens model and run evaluations on it.

* Ran pre-commit checks

* fix multiple input chat tempalte (#2576)

* feat: drop Python 3.8 support

* feat: drop Python 3.8 tests

* pre-commit

* handle chat_template for multiple iput

* Add Aggregation for Kobest Benchmark (#2446)

Co-authored-by: Baber <baber@hey.com>

* update pre-commit (#2660)

* nit

* update pre-commit

* remove `group` from bigbench task configs (#2663)

* remove group from task configs

* add tags

* update readme

* Add Histoires Morales task (#2662)

* Add Histoires Morales task

* Histoires Morales task: fix mixed line endings

* Histoires Morales task: fix mixed line endings

* Remove tag for a single task

* Add some MT for Histoires Morales

* MMLU Pro Plus (#2366)

* mmlu-pro-plus is implemented

* README file is updated

* Update README.md with new task: MMLU Pro Plus

* Update README.md with new task: MMLU Pro Plus

* pre-commit

* nit

---------

Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local>
Co-authored-by: Baber <baber@hey.com>

* fix early return for multuple dict (#2673)

* Turkish mmlu Config Update (#2678)

* Added TurkishMMLU to LM Evaluation Harness

* Fixed COT name

* Fixed COT name

* Updated Readme

* Fixed Test issues

* Completed  Scan for changed tasks

* Updated Readme

* Update README.md

* fixup task naming casing + ensure yaml template stubs aren't registered

* Fix Regex Pattern for CoT experiments

* Fixed multiple choice accuracy

---------

Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Fix typos (#2679)

* fix typo

* fix typos

* fix typos

* remove cuda device assertion (#2680)

* Adding the Evalita-LLM benchmark (#2681)

* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* feat: modified fewshot split for textual entailment task

* fix: new doc_to_target function for NER tasks

* Update prompt

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Add partition for few-shot evaluatio

* Update prompt

* Add partition for few-shot evaluation

* Rename file

Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml

* Add partition for few-shot evaluation

* Add partition for few-shot evaluation

* Enhance lexical substitution management

- Improve scorer calculation for better accuracy
- Update model output postprocessing for clearer results
- Add support for few-shot relation extraction task

* Add F1 macro measure for the document dating task

* Add F1-macro measure to evaluate document dating

* Use the whole dataset

* Small changes

* Add the two prompts for the task of lexical substitution

* Add few-shot split configuration

* Add few-shot split configuration

* Add function for handling few-shot learning setup

* Fix prompt

* Remove configuration file

* Update dataset from test_same to test_cross for evaluations

* Remove whitespace at end of prompt

* Fix configuration error: corrected parameter name for the dataset used in few-shot

* Fix: Check if results is not empty before processing in lexical substitution task

* added the prompts and functions for correct NER and RE execution

* Add accuracy measure

* Add tasks for the EVALITA-LLM benchmark evaluation

* Small changes

Add the alias of the task name that will be printed in the final table results.

* Updated the prompts to reflect changes made to the extended dataset for the Admission Test task

* chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.

* fix: add information on Evalita-LLM for PR

* fix: rename folders and files

* fix: remove unused imports

* chore: run pre-commit

* chore: add task description

---------

Co-authored-by: rzanoli <zanoli@fbk.eu>
Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>

* Delete lm_eval/tasks/evalita_llm/single_prompt.zip (#2687)

* Update unitxt task.py to bring in line with recent repo changes (#2684)

* change ensure_ascii to False for JsonChatStr (#2691)

* set aggregation and higher_is_better (instead of falling back on defaults) (#2692)

* Update remaining references to assistant_prefill to gen_prefix (#2683)

* Update README.md (#2694)

* fix `construct_requests` kwargs (#2700)

* `arithmetic`: set target delimiter to empty string (#2701)

* set target delimiter to empty string

* nit

* add warning

* fix vllm (#2708)

* fix vllm

* fix data_parallel

* copy to multimodal

* add math_verify to some tasks (#2686)

* add math_verify to minerva math

* add math_verify to benchmark

* fix error

* increment version

* Logging (#2203)

* changed source of eval_logger

* allow eval_logger to be set from args

* removed verbosity arg from non-main methods

* fix logging

* pre-commit

* set verbosity in eval logger

* replace utils.eval_logger

* fix logging in main

* add logging to docs

* add logging message

* nit

* add logging to docs

* refactor setup_logging to utils

---------

Co-authored-by: Baber <baber@hey.com>

* fix missing dataset repo (#2719)

* remove unused import (#2728)

* Added IberoBench citation info (https://aclanthology.org/2025.coling-main.699/) in correpsonding READMEs (#2729)

* add o3-mini support (#2697)

* add o3-mini support

* fix linter tests

* add Basque translation of ARC and PAWS to BasqueBench (#2732)

* add Basque translation of ARC and PAWS to BasqueBench

* pre-commit

---------

Co-authored-by: Baber <baber@hey.com>

* add cocoteros_es dataset (#2721)

Co-authored-by: Robiert Sepulveda Torres <rsepulveda911112@gmail.com>

* Fix the import source for eval_logger (#2735)

* Fix the import source for eval_logger

* fix logging

---------

Co-authored-by: Baber <baber@hey.com>

* add humaneval+ and mbpp+ (#2734)

* add humaneval+ and mbpp+

* add newline at end of file

* Support SGLang as Potential Backend for Evaluation (#2703)

* initial components to support sglang

* init of class SGLangLM

* draft for generate_until of SGLang model

* mock loglikelihood

* initial loglikelihood_tokens

* todo: fix bug of sglang engine init

* implement generation tasks and test

* support output type loglikelihood and loglikelihood_rolling (#1)

* .

* loglikelihood_rolling

* /

* support dp_size>1

* typo

* add tests and clean code

* skip tests of sglang for now

* fix OOM error of sglang pytest

* finish test for sglang

* add sglang to readme

* fix OOM of tests and clean SGLang model

* update readme

* clean pyproject and add tests for evaluator

* add accuracy tests and it passed locally

* add notes for test

* Update README.md

update readme

* pre-commit

---------

Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

* fix log condition (#2737)

* fix vllm data parallel (#2746)

* remove ray.remote resources

* remove kobtest tag (registered as group)

* [Readme change for SGLang] fix error in readme and add OOM solutions for sglang (#2738)

* initial components to support sglang

* init of class SGLangLM

* draft for generate_until of SGLang model

* mock loglikelihood

* initial loglikelihood_tokens

* todo: fix bug of sglang engine init

* implement generation tasks and test

* support output type loglikelihood and loglikelihood_rolling (#1)

* .

* loglikelihood_rolling

* /

* support dp_size>1

* typo

* add tests and clean code

* skip tests of sglang for now

* fix OOM error of sglang pytest

* finish test for sglang

* add sglang to readme

* fix OOM of tests and clean SGLang model

* update readme

* clean pyproject and add tests for evaluator

* add accuracy tests and it passed locally

* add notes for test

* Update README.md

update readme

* pre-commit

* add OOM guideline for sglang and fix readme error

* fix typo

* fix typo

* add readme

---------

Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

* Groundcocoa (#2724)

* Fix failing tests

* Resolved merge conflicts

* pre-commit

---------

Co-authored-by: Baber <baber@hey.com>

* fix doc: generate_until only outputs the generated text! (#2755)

* Enable steering HF models (#2749)

* Enable steering HF models

Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

* increase HF download timeout

* Update readme; improve steering vector device handling

* Update latest news

* remove HF timeout increase

* fix tests

* ignore sae lens test

* fix accidental force push

---------

Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

* Add test for a simple Unitxt task (#2742)

* Add a test for a custom unitxt task

* Update task.py to bring in line with breaking change in v1.17.2

* Fix lint

* add debug log (#2757)

* increment version to 0.4.8 (#2760)

* fix: mmlu (generative) metric aggregation (#2761)

* Bugfix (#2762)

* bug fix

* add warning for instruct models

* nit

* fix verbosity typo (#2765)

* docs: Fix typos in README.md (#2778)

* initialize tokenizer with bos_token (#2781)

* Use yaml.CLoader to load yaml files when available. (#2777)

* Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only  (#2773)

* Filter new leaderboard_math_hard dataset to "Level 5" only

* align to linters

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

---------

Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>

* Fix for mc2 calculation (#2768)

* fix for mc2 calculation

* increment versions and changelog

---------

Co-authored-by: Baber <baber@hey.com>

* New healthcare benchmark: careqa (#2714)

* New healthcare benchmark: careqa

* LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>

* Add fixes, READMES, and remove task_list.txt

* pre-commit passed, add formatting updates; add nanmean agg_metric

* Fix import error.

* Wrapped imports in try excepts

* Wrapped imports in try excepts; also metrics to catch bert_score import error

* Try except to catch ImportErrors as well

* use np.nan

* pre-commit

---------

Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Baber <baber@hey.com>

* Capture gen_kwargs from CLI in squad_completion (#2727)

* Capture gen_kwargs from CLI in squad_completion

* Update lm_eval/tasks/squad_completion/task.py

Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* Update lm_eval/tasks/squad_completion/task.py

Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* pre-commit

---------

Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>

* humaneval instruct (#2650)

* add instruct humaneval

* nit

* add to readme

* nit

* Update evaluator.py (#2786)

minor bug fix, lm_eval.setup_logging -> setup_logging

* change piqa dataset path (uses parquet rather than dataset script) (#2790)

* use verify_certificate flag in batch requests (#2785)

* add audio modality (qwen2 audio only) (#2689)

* Added audio-modality pipeline for qwen2-audio model

* Beauty imports

* fix apply_chat_template args

* update default audio placeholders list

* add demo task - common_voice subset

* add audiolm_qwen libs to pyproject.toml

* pre-commit beautify

---------

Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>

* Add various social bias tasks (#1185)

* Implementation of Winogender

* Minor fixes README.md

* Add winogender

* Clean winogender utils.py

* Change dataset to one containing All subsets

* Flesh out README for BBQ task

* Add missing tasks for BBQ

* Add simple cooccurrence bias task

* Fix wrong mask for ambiguated context+rename metrics

* Made generate_until evaluation (following PALM paper) default

Also moved separate config files per category to separate metrics using custom function.
Created config file for multiple_choice way of evaluating BBQ.

* Add missing version metadata

* Add missing versionmetadata for bbq multiple choice

* Fix metrics and address edge cases

* Made BBQ multiple choice the default version

* Added settings following winogrande

* Add num_fewshot to simple_cooccurrence_bias

* Fixes for bbq (multiple choice)

* Fix wrong dataset

* CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets.

* Use simplest prompt possible without description

* Merge

* BBQ: Fix np.NaN related bug

* BBQ: Fix wrong aggregation method for disamb accuracy

* BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval)

* BBQ: fix showing one target in case of few-shot evals

* BBQ: Fix few-shot example for bbq_generate

* BBQ: simplify subtasks

* BBQ: Minimize number of UNK variations to reduce inference time

* BBQ: Add extra UNK keywords for the generate task

* Add a generate_until version of simple_cooccurrence_bias

* Change system/description prompt to include few-shot examples

* Group agg rework

* Run pre-commit

* add tasks to readme table

* remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text`

* fix

* fix

* fix version

---------

Co-authored-by: Baber <baber@hey.com>

* update pre-commit (#2799)

* Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot (#2802)

* Update openllm.yaml to use train fewshot split for arc

* Add INCLUDE tasks (#2769)

* Add INCLUDE tasks

* pacify pre-commit

---------

Co-authored-by: Baber <baber@hey.com>

* Add support for token-based auth for watsonx models (#2796)

* Add support for token-based auth for watsonx models

* Fix lint

* Move dotenv import to inner scope

* Improve readability of _verify_credentials

* add __version__ (#2808)

* add __version__

* add version consistency check to publish action

* Add cocoteros_va dataset (#2787)

* Add cocoteros_va dataset

* Fix format in cocoteros_va.yml

* Undo newline added

* Execute pre-commit to fix format errors

* Update catalan_bench.yaml version and add Changelog section into Readme.md

* Add MastermindEval (#2788)

* add MastermindEval benchmark

* fill out checklist

* Add loncxt tasks (#2629)

suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig

* [hf-multimodal] pass kwargs to self.processor (#2667)

* add min_pixels, max_pixels

* fix

* [MM] Chartqa (#2544)

* add changelog to readme template

* add readme

* add to task list

* Allow writing config to wandb (#2736)

* Allow writing confing to wandb

* set defaults

* Update help

* Update help

* [change] group -> tag (#2813)

* Clean up README and pyproject.toml (#2814)

* Update CODEOWNERS

* Llama3 mmlu correction (#2797)

* Update continuation template YAML for MMLU task with new generation and filtering options

* Refactor filter_list structure in continuation template YAML for improved readability

* Add 'take_first' function to filter_list in continuation template YAML

* Update filter_list in continuation template YAML to use 'strict_match' and modify filtering functions

* Add 'do_sample' option to generation_kwargs in MMLU template YAML

* Add Markdown linter (#2818)

* Add markdown linter to pre-commit hooks

* Reformat existing markdown (excluding lm_eval/tasks/*.md)

* Configure the pad tokens for Qwen when using vLLM (#2810)

* fix typo (#2820)

* [VLLM, SLANG] default temp=0.0 (#2819)

* Fixes to mmlu_pro_llama (#2816)

* Update generation_kwargs in default template to include additional end tokens

* Update filter_list in MMLU Pro configuration to use strict_match

* Update _default_template_yaml

* Add MMLU-ProX task (#2811)

* update mmlu_prox configs

* update tasks/README

* correct hyphon to underline in task/README

* update pre-commit codes

* Remove unnecessary nested list in MMLU-Pro default template YAML (#2827)

* feat: replace library (#2828)

I haven't had time to review the library that's replacing tj-actions or whether this change breaks anything, but the vulnerability is quite severe and I would rather the functionality be broken than risk compromise.

**to do:** review this later

* Multilingual MMLU for Llama instruct models (#2826)

* Multilingual MMLU

* Refactor process_docs function calls for clarity and consistency

* changed dataset to parquet version (#2845)

* Fix typo in longbench metrics (#2854)

* Add kmmlu multiple-choice(accuracy) task (#2849)

* Adding ACPBench task (#2807)

* Adding acpbench task

* adding ACPBench in Tasks readme.

* running precommit

* add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench (#2521)

* add Darija tasks

* fix multiple groups issue in darijammlu

* add MT to the description of the Darija tasks

* Update README.md

nit

* fix the recursion error caused by the darija_summarization task

* use a custom filter instead of the decorator for the strip function

---------

Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* Increase default max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests (#2824)

* Changed default max_length from 2048 to 8192 and max_gen_toks from 256 to 2048 fro MMLU Pro tasks.

* Update lm_eval/tasks/mmlu_pro/_default_template_yaml

* pre-commit

* nit

---------

* move warning (#2857)

* Fix: ACPBench Link (#2860)

* Adds MMLU CoT, gsm8k and arc_challenge for llama instruct (#2829)

* llama-style MMLU CoT

* Refactor MMLU CoT template YAML to simplify 'until' structure

* Add GSM8K task configuration for LLaMA3 with few-shot examples

* Fix missing newline at end of MMLU CoT YAML file

* Add ARC-Challenge task configuration and processing utility

* Add additional MMLU and ARC-Challenge task variants to README

* Update README with notes on arc_challenge_llama dataset preprocessing

* [leaderboard] math - sync with repo (#2817)

* sync with leaderboard

* also output old metric

* wrap old extraction in try except

* better log

* Update supported models (#2866)

* Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs (#2865)

* Add JSON schema benchmark

* Update lm_eval/tasks/jsonschema_bench/metrics.py

Thanks for catching this

Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* run pre-commit

* add description to task catalogue readme

---------

Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

* leaderboard - add subtask scores (#2867)

* add subtask scores

* pacify pre-commit

* Fix the deps of longbench from jeiba to jieba (#2873)

Signed-off-by: Lu Fang <lufang@fb.com>

* Optimization for evalita-llm rouge computation (#2878)

* feat: initial commit with templates for evalita evaluation

* fix: change rule for generate_until

* feat: modified yaml to use reduced version of NER test datasets

* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)

* Add Six Prompts for Each Multiple-Choice Task

* fix: fastest eval for summarization

* chore: linted with ruff

* chore: linted with ruff

---------

Co-authored-by: rzanoli <zanoli@fbk.eu>

* Update authentications methods, add support for deployment_id for IBM watsonx_ai (#2877)

* update authnentications methods, add support for deployment_id

* run pre-commit on changed file

* Add GSM8K Platinum (#2771)

* add gsm8k platinum

* only test splits

* wrong dataset

* link to blog

* format

* Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520)

* added option --examples

* specifying examples in dictionary

* run pre-commit - fix arg type

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* fixing bug when examples==None

* fixing bug when examples==None

* limit or examples must be None in simple_evaluate.py and in evaluator.py

* run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* merge main and run pre-commit (fix formatting)

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com

* Update __main__.py

undefined "limit" and "examples"

* update branch, fix conflicts, run pre-commit

* nits

* nits

* change 'examples' to 'samples'

---------

Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Baber <baber@hey.com>

* Extend support for chat template in vLLM (#2902)

* Add support for chat templates defined outside of tokenizer_config.json, as supported by vLLM

* Update template name to avoid conflict with other variable

* tasks README: fix dead link (#2899)

* Add support for quantization_config (#2842)

* Add support for quantization_config

Summary:
Previously quantization_config is ignored, so torchao quantized models are not supported,
this PR adds that.

Test Plan:
lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8

Reviewers:

Subscribers:

Tasks:

Tags:

* quantization_config is optional

* Fix a typo in README for tasks (#2910)

* fix resolve_hf_chat_template version (#2917)

* fix resolve_hf_chat_template version

* pre-commit

* mmlu - switch dataset to cais/mmlu; fix tests (#2918)


* switch MMLU to cais/mmlu

* switch back to tj-actions/changed-files

* cache HF folder

* init pixels before tokenizer creation (#2911)

* Longbench bugfix (#2895)

* add warning in for default until

* fix stop tokens; add vcsum

* bugfix:fix doc_to_target to string

* fix lsht, trec

* add task to readme

* add debugging logs for multiple input/output

* Added softmax_dtype argument to HFLM to coerce log_softmax computations (#2921)

* Added softmax_dtype argument to coerce log_softmax computations

* move softmax_dtype

---------

Co-authored-by: Baber <baber@hey.com>

* use np.NaN (#2937)

* Add support for enable_thinking argument in vllm model, set default to False (#2947)

* Added NorEval, a novel Norwegian benchmark (#2919)

* added noreval

* added a checklist for noreval

* run pre-commit

* changed imports and added short noreval description

* fixed norsumm path

* refactored multi-folder tasks

* refactored multi-folder tasks

* Fix import error for eval_logger in score utils (#2940)

* Fix import error for eval_logger in score utils

* pacify pre-commit

---------

Co-authored-by: Baber <baber@hey.com>

* Include all test files in sdist (#2634)

This is useful to run unit tests during distro builds.

* Change citation name (#2956)

This hasn't been a library for few shot language model evaluation in quite a while. Let's update the citation to use "the Language Model Evaluation Harness" as the title.

* add warning on truncation (#2962)

* fix: type error while checking context length (#2972)

* Fix import error for deepcopy (#2969)

Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

* Pin unitxt to most recent major version to avoid test failures (#2970)

Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

* mmlu pro generation_kwargs until Q: -> Question: (#2945)

* mmlu pro generation_kwargs until Q: -> Question:

* pacify pre-commit

* change stop token

---------

Co-authored-by: Baber <baber@hey.com>

* AfroBench: How Good are Large Language Models on African Languages? (#2825)

* add afrixnli to task

* add chat completion

* remove chat completion -untested

* afrimmlu added

* afrimmlu folder update

* afrimmlu folder update

* updated prompt

* remove print

* add afrimgsm -direct

* add squad metric

* fix bash script

* remove direct util, update common yaml

* remove print

* add few show. metric fixes

* fix direct path, add bash script for gpt models

* added transate test

* update afrixnli tasks

* update afrixnli tasks

* update metrics for afrixnli

* prompt translations fix

* prompt translations fix

* filter and metric fix -mgsm

* remove squad metric

* remove squad metric

* add f1 score to mgsm

* add f1 score to mgsm

* update native-direct with lin

* change f1 function

* add lin to utils

* add utils

* remove test limit

* remove test configs

* add swahili to mmlu

* change eng to ewe in ewe yaml mmlu

* add squad metric to mgsm, remove whitespace filter

* added translate test

* added afrixnli_translate

* fix exact match valueError

* fix exact match valueError

* restructure mmlu folder

* spacing

* remove afrimmlu_translate folder

* add utility

* format task name, clean ups

* modefied mgsm

* update on afrimgsm

* update on afrimgsm

* removed utils

* other mgsm varieties

* other mgsm varieties

* adding trasnslate direct

* Update translate_direct_yaml

* add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model

* edit for open models

* Update translate_direct_yaml

* add verbalizer for xnli

* change xnli from multiple choice to generate

* add manual accuracy scores

* revert xnli to multiple choice

* change afrimgsm utils

* revert xnli to multiple_choice

* cleanups and readmes

* remove openai fixes and unused regex

* pr review changes

* revert metrics.py, task.py and extraction.py to main version

* add afrisenti

* utilities

* pulled from main

* add afrixnli

* add afrimmlu

* update afrixnli prompts

* mising senti language

* fix afrisenti prompt 2

* fix afrisenti prompts

* fix afrisenti prompts

* configure task grouping

* add multiple prompts to afrixnli for irokobench

* add multiple prompts to afrimmlu for irokobench

* Update afrixnli_yaml

* fixes and moves

* fixes and moves

* afrimmlu multiple prompts configs

* remove validation set from afrimmlu

* remove eng from afrimmlu translate test

* correct dataset path

* multiple prompts for mgsm

* file restructure

* afribench grouping

* repo restructuring

* repo restructuring

* update exact match to hugging face exact match and add new mgsm language

* remove decontamination

* update generation kwargs

* update generation kwargs for all mgsm prompts

* remove lang

* update generation kwargs for afrimgsm translatetest

* add afrimgsm cot for direct and translate

* remove eng from translate-cot

* add masakhaPOS tasks

* remove changes from task script

* add masakhanews tasks

* add uhura arc easy

* add afriqa and belebele files

* add tags for easier run. add naija rc

* add new metrics and transformation scripts

* fix afriqa swa fewshot split

* add naijarc

* add afrobench lite tasks

* update afrobench

* update afrobench

* remove unverified files to avoid bugs

* remove files not needed

* add afrobench tasks

* add afrobench tasks

* change to version 1

* change to version 1

* update afrobench

* update afrobench

* restore metric to original script

* update readme instructions

* add individual dataset readmes

* add link to collections

* correct run script

* align with main

* align with main

* align with main

* align with main

* align with main

* align with main

* align with main

* align with main

* failed run fixes

* failed run fixes

* add afrimgsm cot

* Apply precommit fixes

* update mafand dataset name

* pull request fixes

* remove afrihate due to availability

---------

Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>
Co-authored-by: David Adelani <davlanade@gmail.com>
Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>

* Added C4 Support (#2889)

* added c4 dataset (working)

* fixed bugs in c4

* fixed loading bugs in c4 dataset; using partial loading

* cleaned the code

* added version number for c4

* removed irrelevant files

* Update utils.py (#2870)

* feat: add question suffix (#2876)

* Add device arg to model_args passed to LLM object in VLLM model class (#2879)

* fix: pass device arg in model_ar in vllm_causallms

* casting device arg to str in vLLM model args

* fix formatting (#2759)

* Delete scripts/cost_estimate.py (#2985)

This function was written years ago when the cost of running an OpenAI model was easy to compute. It is no longer viable to support this.

* Adding ACPBench Hard tasks (#2980)

* adding ACPBench_hard

* adding Clingo

* changing tarski to tarski[clingo]

* denoting the main variants in each paper

* [SGLANG] Add the SGLANG generate API (#2997)

* add `sglang-generate`

* nit

* nit

* nit

* pacify pre-commit

* fix github parse error (#2998)

* Log tokenized request warning only once (#3002)

* Log tokenized request warning only once

* Fix logging for concurrent usecase as well

* add kbl 2025 (#3000)

* Output path fix (#2993)

* fix(output_path): support direct JSON file paths

* fix linting

* turn off external Lm tests for now

* Update help text for `output_path`

---------

Co-authored-by: Baber <baber@hey.com>

* use images with api models (#2981)

* use images with apis

* pacify pre-commit

* Adding resize images support (#2958)

* first version of image resizing

* fixed bug

* clean up `resize_image`

---------

Co-authored-by: Artem Safin <artemsafin67@gmail.com>
Co-authored-by: Baber <baber@hey.com>

* Revert "feat: add question suffix (#2876)" (#3007)

This reverts commit 4dbd5ec9

* change multimodal check in evaluate (#3013)

changed multimodal check from strict equality

* [Fix] Update `resolve_hf_chat_template` arguments (#2992)

* fix arguments

* pacify pre-commit

---------

Co-authored-by: Baber <baber@hey.com>

* Fix error due in Collating queries with different continuation lengths (fixes #2984) (#2987)

* FIX error due to grouping queries with different continuation length

Make Collator choose query with the longest continuation as the
candidate for generation

* use max for key selection

* added comments explaining variable cont length (identical ctx+cont[:-1])

---------

Co-authored-by: Baber <baber@hey.com>

* [vllm] data parallel for V1 (#3011)

* add data_parallel for V1

* use Process instead of Queue

* ray used if V0 DP

* better error handling

* fix truncation warning comparison

* add arab_culture task (#3006)

* add arab_culture tasks

* add target_delimeter and remove debugging code

* chore: clean up and extend .gitignore rules (#3030)

* chore: clean up and extend .gitignore rules

* pacify pre-commit

---------

Co-authored-by: Baber <baber@hey.com>

* Enable text-only evals for VLM models (#2999)

* [Fix] acc_mutual_info metric calculation bug (#3035)

* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices

* add tests

* fix: fix vllm issue with DP>1 (#3025)

* add Mbpp instruct (#2995)

* feat: add mbpp_instruct

* fix: update generation_kwargs to use an empty until list

* fix: correct predictions formatting in pass_at_1 function

* fix: improve code block extraction by checking first without opening backticks

* fix mbpp `pass_at_1`

* remove prints (#3041)

* [longbench] fix metric calculation (#2983)

* use all answers

* use middle truncation

* maybe fix classification score

* strip classification preds

* [vllm] remove stop tokens post-hoc

* strip all preds

* pacify pre-commit

* start on truncation utility

* add to readme

* add a footgun doc

* fix newline in yaml templates

* do not strip code_sim preds!

* fix pre-commit config

* fix instruction warning

* add not to longbench readme

* Fallback to super impl in fewshot_context for Unitxt tasks (#3023)

Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

* Fix Typo in README and Comment in utils_mcq.py (#3057)

* Update README.md

* Update utils_mcq.py

* fix longbech citation (#3061)

* fix longbech citation

* Update README.md (#3070)

Wrong task name: mmlu_generation doesn't non exist -> mmlu_generative is the correct one

* Update instructions.py (#3060)

* bump version to `0.4.9` (#3073)

* llama3 task: update README.md (#3074)

"arc_chalenge_chat" doesn't exist: I think it should be "arc_challenge_chat", but this task is not implemented here (see arc task folder).

* Fix Anthropic API compatibility issues in chat completions (#3054)

* Fix Anthropic API compatibility issues in chat completions

solves two important compatibility issues between the LM Eval Harness and Anthropic's API:

1) The type field issue - Anthropic's Messages API doesn't accept the type field that other APIs might expect, that was previously included
2) The stop sequences issue - Anthropic requires stop sequences to contain non-whitespace characters

tested with most recent models from anthopic; claude-sonnet-4-0, claude-opus-4-0, resolved my local api errors

* pacufy pre-commit

* add type

---------

Co-authored-by: Baber <baber@hey.com>

* Ensure backwards compatibility in fewshot_context by using kwargs (#3079)

Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

* remove system message if `TemplateError` (#3076)

* feat / fix: Properly make use of `subfolder` from HF models (#3072)

* add subfolder

* lint

* change it to empty string

* fix typehints

---------

Co-authored-by: Baber <baber@hey.com>

* [HF] fix quantization config (#3039)

* Try fixing issue 3026 which is caused by the quantization_config argument introduced in Commit 758c5ed.
The argument is in Dict type, but for a GPTQ quantized model, it has a conflict with the huggingface interface which expects QuantizationConfigMixin type.
Current solution is removing quantization_config argument in HFLM._create_model() of lm_eval/models/huggingface.py.
Require further modification to restore the functionality provided by the previous commit.

* wrap quantization_config in AutoQuantizationConfig

* handle quantization config not dict

* wrap quantization_config in AutoQuantizationConfig if dict

---------

Co-authored-by: shanhx2000 <hs359@duke.edu>

* FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092)

* Fix: Align the Humaneval dataset with official results

Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".

(2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.

Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).

Ref: PR#2650

* add changelog and version

* add changelog

* Truthfulqa multi harness (#3062)

* truthfulqa-multi task

* truthfulqa-multi with chat few-shot

* few shot chat implementation

* changed until so it outputs lists

* changed dataset location

* added MT task

* Create README.md

* do not include MT

* changes for PR

* tag change

* removed yaml extension

* adding task to the table

* fix task configs

* add import exception

---------

Co-authored-by: Baber <baber@hey.com>

* Fix: Reduce CLI loading time from 2.2s to 0.05s (#3099)

* Lazy-load submodules to reduce import time

* pacify pre-commit

---------

Co-authored-by: Baber <baber@hey.com>

* Humaneval - fix regression (#3102)

* use double quotes

* Bugfix/hf tokenizer gguf override (#3098)

* fix(hf-gguf): skip gguf_file if external tokenizer is provided

* docs(readme): add instructions for evaluating GGUF models with Hugging Face backend

* [FIX] Initial code to disable multi-proc for stderr (#3106)

* [FIX] Initial code to disable multi-proc for stderr

* add docs; align no-mp bootstrap with mp

---------

Co-authored-by: Baber <baber@hey.com>

* remove all; reformat table (#3107)

* delete unneeded files (#3108)

* delete unneeded files

* Fixed #3005: Processes both formats of model_args: string and dictionay (#3097)

* git push --force
correctly processes both formats of model_args: string and dictionary both

* exctract to function for better test

* nit

---------

Co-authored-by: Baber <baber@hey.com>

* add image hashing and `LMEVAL_HASHMM` envar (#2973)

* add image hashing

* remove unused params decription

* use `LMEVAL_HASHMM` (defualt '1') to save raw images

---------

Co-authored-by: Baber <baber@hey.com>

* delete neuralmagic models (#3112)

* Neuralmagic (#3113)

* remove sparse-ml

* check pil dep (#3114)

* warning for "chat" pretrained; disable buggy evalita configs (#3127)

* check for chat for warning

* add test

* remove yaml extension from some evalita configs

* move unitxt to own test script

* fix CI test

* fix: remove warning (#3128)

* Adding EgyMMLU and EgyHellaSwag (#3063)

* add egy mmlu hellaswag

* add egymmlu egyhellaswag to tasks readme

* fix egymmlu config generation

* fix _generate_configs formating

* Added mixed_precision_dtype arg (#3138)

* Fix for hang due to mp.Pool in bootstrap_stderr (#3135)

* make pytorch an optional dependency

* remove FakeLM

* missed a torch

* print statements for the recursion

* sigh

* changes

* change task configs

* update mmlu config

* include soft metrics for MMLU

* logging metrics for mathqa

* lint

* Discrim Eval From Upstream

* Add Forced Versions of this Metric

* Scaling Law Metrics

* Torches

* Fix Pull Main, Add Metrics for Bias Tasks

* Lint

* Unused

* CohereForAI -> CohereLabs

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Lu Fang <lufang@fb.com>
Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Naiara Perez <naiara.pme@gmail.com>
Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>
Co-authored-by: Baber <baber@hey.com>
Co-authored-by: Slawomir Strehlke <slawomir.strehlke@intel.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com>
Co-authored-by: Yao Matrix <yaoweifeng0301@126.com>
Co-authored-by: Rima Shahbazyan <74137119+rimashahbazyan@users.noreply.github.com>
Co-authored-by: shivalika-singh <shivalikasingh@cohere.com>
Co-authored-by: shivi <shivalikasingh95@gmail.com>
Co-authored-by: Sabrina J. Mielke <sjm@sjmielke.com>
Co-authored-by: Firoj Alam, Scientist, QCRI <firojalam@users.noreply.github.com>
Co-authored-by: Basel Mousi <bmousi@hbku.edu.qa>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: CL-ModelCloud <cl@modelcloud.ai>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
Co-authored-by: Petr Baudis <pasky@ucw.cz>
Co-authored-by: Wenyang LUO <86722018+timturing@users.noreply.github.com>
Co-authored-by: Hojin Lee <nyx1371@snu.ac.kr>
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Shivansh Pachnanda <114482037+KahnSvaer@users.noreply.github.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Boda Sadallah <abdelrahman.sadallah@mbzuai.ac.ae>
Co-authored-by: Gyouk Chu <94156717+GyoukChu@users.noreply.github.com>
Co-authored-by: nike00811 <nike00811@gmail.com>
Co-authored-by: Ramiro R. C. <rawthil@gmail.com>
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai>
Co-authored-by: Irina Proskurina <72871167+upunaprosk@users.noreply.github.com>
Co-authored-by: Nicky Pochinkov <52249105+nickypro@users.noreply.github.com>
Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com>
Co-authored-by: asgsaeid <43481290+asgsaeid@users.noreply.github.com>
Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local>
Co-authored-by: Arda <ge32max@mytum.de>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: omahs <73983677+omahs@users.noreply.github.com>
Co-authored-by: Michele Resta <79645321+m-resta@users.noreply.github.com>
Co-authored-by: rzanoli <zanoli@fbk.eu>
Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>
Co-authored-by: Kiersten Stokes <kierstenstokes@gmail.com>
Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com>
Co-authored-by: James A. Michaelov <32554945+jmichaelov@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: Farhan Ahmed <Farhan.Ahmed@ibm.com>
Co-authored-by: Naiara Perez <naiara.perez@ehu.eus>
Co-authored-by: Jocelyn <34988596+HelloJocelynLu@users.noreply.github.com>
Co-authored-by: Santiago Galiano Segura <71637365+sgs97ua@users.noreply.github.com>
Co-authored-by: Robiert Sepulveda Torres <rsepulveda911112@gmail.com>
Co-authored-by: Kailashbuki <111277+kailashbuki@users.noreply.github.com>
Co-authored-by: Jinwei <55192557+Monstertail@users.noreply.github.com>
Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Harsh Kohli <harsh14791@gmail.com>
Co-authored-by: Lucia Quirke <luciarosequirke@gmail.com>
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>
Co-authored-by: Yongkeun Hwang <ykstyle@ykstyle.info>
Co-authored-by: Rui Vieira <rcardoso@redhat.com>
Co-authored-by: Giulio Lovisotto <giuliolovisotto@gmail.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Co-authored-by: Kajetan Dymkiewicz <kajetan.dymkiewicz@gmail.com>
Co-authored-by: PabloAgustin <pabloagustinquemas@gmail.com>
Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Surya Kasturi <kasturisurya@gmail.com>
Co-authored-by: Zeyuan Allen-Zhu <zhuzeyuan@hotmail.com>
Co-authored-by: daniel-salib <danielsalib@meta.com>
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>
Co-authored-by: Oskar van der Wal <56364990+oskarvanderwal@users.noreply.github.com>
Co-authored-by: Avelina9X <37878580+Avelina9X@users.noreply.github.com>
Co-authored-by: Angelika Romanou <angelika.romanou@epfl.ch>
Co-authored-by: Jonas Golde <jonas.golde@gmail.com>
Co-authored-by: Jaedong Hwang <jdhwang730@gmail.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Alexandre Marques <almarque@redhat.com>
Co-authored-by: Yifei Zhang <yifei.zhang1992@outlook.com>
Co-authored-by: heli-qi <93250319+heli-qi@users.noreply.github.com>
Co-authored-by: Alexandre Marques <alexandre@neuralmagic.com>
Co-authored-by: Bruno Carneiro <brunocarneirofs@gmail.com>
Co-authored-by: wackey <386622495@qq.com>
Co-authored-by: Jinho Heo <70141850+Aprilistic@users.noreply.github.com>
Co-authored-by: Harsha <858059+harshakokel@users.noreply.github.com>
Co-authored-by: Hadi Abdine <59775564+hadi-abdine@users.noreply.github.com>
Co-authored-by: dazipe <126095259+dazipe@users.noreply.github.com>
Co-authored-by: Daniel Holanda <holand.daniel@gmail.com>
Co-authored-by: Saibo-creator <53392976+Saibo-creator@users.noreply.github.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
Co-authored-by: Nikodem Szwast <97400923+Medokins@users.noreply.github.com>
Co-authored-by: Felipe Maia Polo <felipemaiapolo@gmail.com>
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Daniele <36171005+dtrifiro@users.noreply.github.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com>
Co-authored-by: Vladislav Mikhailov <43072268+vmkhlv@users.noreply.github.com>
Co-authored-by: Anna Fontana <101867173+annafontanaa@users.noreply.github.com>
Co-authored-by: Ihar Hrachyshka <ihrachys@redhat.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: Yoonsoo Kim <34365327+yoonniverse@users.noreply.github.com>
Co-authored-by: Jess <jessicaojo19@gmail.com>
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>
Co-authored-by: David Adelani <davlanade@gmail.com>
Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>
Co-authored-by: Yufeng Xu <yx3038@nyu.edu>
Co-authored-by: tawsif <sleeping4cat@outlook.com>
Co-authored-by: Tingchen Fu <48080217+TingchenFu@users.noreply.github.com>
Co-authored-by: Filippo Momentè <68816087+momentino@users.noreply.github.com>
Co-authored-by: Rob Geada <rob@geada.net>
Co-authored-by: Hongseok Oh <97136787+abzb1@users.noreply.github.com>
Co-authored-by: Niccolò Ajroldi <61059403+Niccolo-Ajroldi@users.noreply.github.com>
Co-authored-by: Artem Safin <artemsafin67@gmail.com>
Co-authored-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Ameya Godbole <ameyag416@gmail.com>
Co-authored-by: Boda Sadallah <bodasadallah@gmail.com>
Co-authored-by: Ivan Stankevich <105574942+e1washere@users.noreply.github.com>
Co-authored-by: Yury Sulsky <yury.sulsky@gmail.com>
Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: fuder.eth <139509124+vtjl10@users.noreply.github.com>
Co-authored-by: Maxim Evtush <154841002+maximevtush@users.noreply.github.com>
Co-authored-by: NourFahmy <35409519+NourFahmy@users.noreply.github.com>
Co-authored-by: shanhx2000 <hs359@duke.edu>
Co-authored-by: jinze <46251666+userljz@users.noreply.github.com>
Co-authored-by: Blanca Calvo <33485967+BlancaCalvo@users.noreply.github.com>
Co-authored-by: Alex Stachowiak <alexander@computer.org>
Co-authored-by: Ankush <51945739+ankush13r@users.noreply.github.com>
Co-authored-by: Neel Gupta <neelgupta04@outlook.com>
Co-authored-by: Debjyoti Ray <33850567+DebjyotiRay@users.noreply.github.com>
Co-authored-by: Atou Houdaifa <atou.hdf@gmail.com>
Co-authored-by: Ankit Gola <ankitgola005@gmail.com>
Co-authored-by: David Hall <dlwh@stanford.edu>
Co-authored-by: Nikil Ravi <nravi@stanford.edu>
Co-authored-by: Suhas Kotha <38450656+kothasuhas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants