Add support for quantization_config#2842
Conversation
Summary: Previously quantization_config is ignored, so torchao quantized models are not supported, this PR adds that. Test Plan: lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8 Reviewers: Subscribers: Tasks: Tags:
|
Why is this necessary? Shouldn't the |
|
@StellaAthena you mean pass |
|
Hi! Thanks for the PR. Left a comment. Although it does look like |
Sorry, I misunderstood. I thought you were saying that if the user provides quantization configs when calling lm-eval they're ignored. |
looks like it's not picked up from there, maybe because the quantization_config we want is from |
|
@baberabb @StellaAthena I updated the PR, can you take a look again |
|
Hi @baberabb @StellaAthena can you take a look again |
LGTM! Thank you! |
|
@baberabb can you also help merge? the CI errors does not look relevant |
* Add support for quantization_config Summary: Previously quantization_config is ignored, so torchao quantized models are not supported, this PR adds that. Test Plan: lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8 Reviewers: Subscribers: Tasks: Tags: * quantization_config is optional
* Add support for quantization_config Summary: Previously quantization_config is ignored, so torchao quantized models are not supported, this PR adds that. Test Plan: lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8 Reviewers: Subscribers: Tasks: Tags: * quantization_config is optional
* skip casting if predict_only (#2524)
* make utility function to handle `until` (#2518)
* make utility function to handle `until`
* fix text
* Update Unitxt task to use locally installed unitxt and not download Unitxt code from Huggingface (#2514)
* Moved to require unitxt installation and not download unitxt from HF hub.
This has performance benefits and simplifies the code.
Signed-off-by: Yoav Katz <katz@il.ibm.com>
* Updated watsonx documentation
* Updated installation instructions
* Removed redundant comman
* Allowed unitxt tasks to generate chat APIs
Modified WatsonXI model to support chat apis
* Removed print
* Run precommit formatting
---------
Signed-off-by: Yoav Katz <katz@il.ibm.com>
* add Basque translation of PIQA (piqa_eu) to BasqueBench (#2531)
* avoid timeout errors with high concurrency in api_model (#2307)
* avoid timeout errors with high concurrency in api_model
* style
* add timeout
* add docs
---------
Co-authored-by: Baber <baber@hey.com>
* Update README.md (#2534)
* Update README.md
add caching tip to readme
* Update README.md
add api link
* add better testing when both doc_to_text ends in and target_delimiter are whitespaces (#2535)
* Support pipeline parallel with OpenVINO models (#2349)
* Handle pipeline_parallel parameter
* Add description of pipeline parallelism with OV models
* Update README.md (#2546)
* [API] left truncate for generate_until (#2554)
* left truncate for generate_until
* pre-commit
* Update Lightning import (#2549)
* update import
Signed-off-by: Maanu Grover <maanug@nvidia.com>
* run formatting
---------
Signed-off-by: Maanu Grover <maanug@nvidia.com>
* add optimum-intel ipex model (#2566)
* initial support for optimum-intel ipex model. LM model as first step
* format
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
* pass dtype
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
* update README
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
---------
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
* add warning to readme (#2568)
* make warning prominent
* make warning prominent
* Adding new subtask to SCORE tasks: non greedy robustness (#2558)
* score readme added
* generate until task's "until" parameter's default value fixed.
* score mmlu-pro and agieval added
* changed macro accuracy to micro for agieval
* Always E removed from agi eval
* redundancies removed
* MATH added
* minor cosmetic changes for math
* Licenses added Readme updated
* changes for flake8 + license header on math
* Score added to readme and precommit was run.
* Score added to readme and precommit was run.
* Import error fixed
* math task bugfix
postprocess minor fix
* CR for math added
* math CR
* math task bugfix
postprocess minor fix
CR for math added
* Math cr fixed
* mmlu_pro non_greedy task added
* non greedy summarizer added
* Non greedy for all score tasks
* Bugfixes for non-greedy
* fixing the until argument
* undoing the change to "until" arguments default behaviour
* minor fix in summarizer
* log naming changes for better readability
* math subtasks naming fix
* agieval subtask naming fix
* logging added for debugging
* path issue fixed
* minor fix
* path fix
* path fix
* non_greedy_math minor fix
* final changes
* changed readme for non-greedy
added Nvidia header
added wxample script for non_greedy
changed prompts to match that fo trt runs
* non greedy summarizer bugfix
* non_greedy summarizer fixed
* batch `loglikelihood_rolling` across requests (#2559)
* batch all rolling token windows
* nit
* copy to vllm
* fix max_length for `get_rolling_token_windows`
* bugfix
* bugfix
* add type hints
* fix `DeprecationWarning: invalid escape sequence '\s'` for whitespace filter (#2560)
* fix `DeprecationWarning: invalid escape sequence '\s'`
* add type hints
* Revert "add type hints"
This reverts commit 15d8abc626a84e97f8c238ddfbf9e243d6f6eb5c.
* increment version (#2574)
forgot to increment 0.4.6!
* drop python 3.8 support (#2575)
* feat: drop Python 3.8 support
* feat: drop Python 3.8 tests
* pre-commit
* Add Global MMLU Lite (#2567)
* add global mmlu lite
* add global mmlu lite
* fix bugs
* add task README.md
* Update README.md
* Update tasks README.md
* Update README.md
* update readme
---------
Co-authored-by: shivi <shivalikasingh95@gmail.com>
* add warning for truncation (#2585)
* add warning for truncation
* Wandb step handling bugfix and feature (#2580)
* AraDICE task config file (#2507)
* added aradice
* Added ArabicMMLU Lev Configs
* added ArabicMMLU egy configs
* Added boolq configs
* Added cultural bench configs
* added openbookqa configs
* Added PiQA configs
* added winogrande configs
* Added truthfulQA configs
* Added aradice group config
* Remove deleted files from repository
* modified arabimmlu configs
* modified metadata versions
* fixed formatting using ruff
* added aradice tasks information
* pre-commit
* Uptaded openbookqa utils
* fixed formatting on obqa
---------
Co-authored-by: Basel Mousi <bmousi@hbku.edu.qa>
Co-authored-by: Baber <baber@hey.com>
* fix extra_match low if batch_size > 1 (#2595)
* fix extra_match low if batch_size > 1
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* add sorting to logprobs
* nit
---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Baber <baber@hey.com>
* fix model tests (#2604)
upgrade transformers and peft in CI
* update scrolls (#2602)
* update evaluate; update construct requests
* update construct requests to handle `apply_chat_template` kwarg
* some minor logging nits (#2609)
* remove yaml extension from phraes_va_common
* remove yaml extension from winogenerated
* remove yaml extension from phrases_es
* no cache debug logging when not used
* Fix gguf loading via Transformers (#2596)
* hf support load gguf file
* code review
* code review
* code clean up
* note about use_fast compat with gguf
---------
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
* Fix Zeno visualizer on tasks like GSM8k (#2599)
* fix(zeno): Generate unique ids in case of multiple filters
* fix(zeno): Report even non-aggregable metrics, just not as metrics
* pre-commit
---------
Co-authored-by: Baber <baber@hey.com>
* Fix the format of mgsm zh and ja. (#2587)
* Fix the format of mgsm zh and ja.
* Add change log to mgsm.
* Add newline after changelog.
* Add HumanEval (#1992)
* add custom filter
* fix type casting of references
* add humaneval
* fix a bug in humaneval
* add greedy version of humaneval
* update tasks README
* test humaneval
* return multiple metrics
* nit
* add confirmation to run code tasks
* nit
* nit
---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>
* Add MBPP (#2247)
* add mbpp
* fix some bugs
* add README for mbpp
* update README
* nits
---------
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>
* Add MLQA (#2622)
* Add MLQA
* add mlqa_common_yaml
* add 49 tests of mlqa family
* update tasks/README.md
---------
* fix: mlqa ast error
* nit: removed .yaml ext from template_yaml
* nit changes: minor modifications generate_tasks.py
* deleted lm_eval/tasks/mlqa/mlqa_common_yaml.yaml
* tests updated
* nit
* assistant prefill (#2615)
* add assistant prefix
* add arc_challenge from llama
* nit
* nit
* nit
* add assistant prefix
* add mmlu_llama
* nit
* nit
* Revert "nit"
This reverts commit 6a97f8356237305e375212b966b30e8de59dd4bc.
* fix regex bug
* add assistant_prefix to vllm
* add `Question:`
* add mmlu_pro
* add fewshot assistant_prefix
* use `assistant_prefill`
* typehints
* nits
* nits
* add to docs
* add readme
* fix gen_prefix (#2630)
* switch arg
* update pre-commit (#2632)
* update pre-commit
* add hrm8k benchmark for both Korean and English (#2627)
* add hrm8k benchmark for both Korean and English
* apply precommit
* revise tasks to make models not to directly answer; use zeroshot_cot if possible
* add README
* Add hrm8k on the task-list
---------
Co-authored-by: Baber <baber@hey.com>
* New arabicmmlu (#2541)
* point to the original ArabicMMLU dataset
* create the new subtasks files
* fix bug when the context filed is empty
* apply precommit (#2636)
* Update KorMedMCQA: ver 2.0 (#2540)
* Update KorMedMCQA: ver 2.0
* Fix pre-commit formatting issues
* Update KorMedMCQA v2.0
* pre-commit
* fix tmlu tmlu_taiwan_specific_tasks tag (#2420)
* fixed mmlu generative response extraction (#2503)
* fixed mmlu generative response extraction
* updated file version | added args to exact_match
* fix
* fix
* pre-commit
* fix groups
---------
Co-authored-by: Baber <baber@hey.com>
* revise mbpp prompt (#2645)
* aggregate by group (total and categories) (#2643)
* Fix max_tokens handling in vllm_vlms.py (#2637)
* Update vllm_vlms.py
* pre-commit
---------
Co-authored-by: Baber <baber@hey.com>
* separate category for `global_mmlu` (#2652)
* separate category
* set version 0.0
* apply precommit
* Add Moral Stories (#2653)
* Add moral stories task
* Add moral stories task
* Create README.md
* Update README.md
* Update line endings in moral_stories files
* add TransformerLens example (#2651)
* add TransformerLens example
Many people use TransformerLens to do interpretability and interventions on models, and then need to test the model.
Here is a simple script that allows one to pass in the TransformerLens model and run evaluations on it.
* Ran pre-commit checks
* fix multiple input chat tempalte (#2576)
* feat: drop Python 3.8 support
* feat: drop Python 3.8 tests
* pre-commit
* handle chat_template for multiple iput
* Add Aggregation for Kobest Benchmark (#2446)
Co-authored-by: Baber <baber@hey.com>
* update pre-commit (#2660)
* nit
* update pre-commit
* remove `group` from bigbench task configs (#2663)
* remove group from task configs
* add tags
* update readme
* Add Histoires Morales task (#2662)
* Add Histoires Morales task
* Histoires Morales task: fix mixed line endings
* Histoires Morales task: fix mixed line endings
* Remove tag for a single task
* Add some MT for Histoires Morales
* MMLU Pro Plus (#2366)
* mmlu-pro-plus is implemented
* README file is updated
* Update README.md with new task: MMLU Pro Plus
* Update README.md with new task: MMLU Pro Plus
* pre-commit
* nit
---------
Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local>
Co-authored-by: Baber <baber@hey.com>
* fix early return for multuple dict (#2673)
* Turkish mmlu Config Update (#2678)
* Added TurkishMMLU to LM Evaluation Harness
* Fixed COT name
* Fixed COT name
* Updated Readme
* Fixed Test issues
* Completed Scan for changed tasks
* Updated Readme
* Update README.md
* fixup task naming casing + ensure yaml template stubs aren't registered
* Fix Regex Pattern for CoT experiments
* Fixed multiple choice accuracy
---------
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
* Fix typos (#2679)
* fix typo
* fix typos
* fix typos
* remove cuda device assertion (#2680)
* Adding the Evalita-LLM benchmark (#2681)
* feat: initial commit with templates for evalita evaluation
* fix: change rule for generate_until
* feat: modified yaml to use reduced version of NER test datasets
* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)
* Add Six Prompts for Each Multiple-Choice Task
* feat: modified fewshot split for textual entailment task
* fix: new doc_to_target function for NER tasks
* Update prompt
* Add partition for few-shot evaluation
* Add partition for few-shot evaluation
* Add partition for few-shot evaluation
* Add partition for few-shot evaluatio
* Update prompt
* Add partition for few-shot evaluation
* Rename file
Rename file from _evalita-mp_ner_adg_p1 .yaml to _evalita-mp_ner_adg_p1.yaml
* Add partition for few-shot evaluation
* Add partition for few-shot evaluation
* Enhance lexical substitution management
- Improve scorer calculation for better accuracy
- Update model output postprocessing for clearer results
- Add support for few-shot relation extraction task
* Add F1 macro measure for the document dating task
* Add F1-macro measure to evaluate document dating
* Use the whole dataset
* Small changes
* Add the two prompts for the task of lexical substitution
* Add few-shot split configuration
* Add few-shot split configuration
* Add function for handling few-shot learning setup
* Fix prompt
* Remove configuration file
* Update dataset from test_same to test_cross for evaluations
* Remove whitespace at end of prompt
* Fix configuration error: corrected parameter name for the dataset used in few-shot
* Fix: Check if results is not empty before processing in lexical substitution task
* added the prompts and functions for correct NER and RE execution
* Add accuracy measure
* Add tasks for the EVALITA-LLM benchmark evaluation
* Small changes
Add the alias of the task name that will be printed in the final table results.
* Updated the prompts to reflect changes made to the extended dataset for the Admission Test task
* chore: cleaned templates before PR; feat: add configuration to run generation/ppl tasks.
* fix: add information on Evalita-LLM for PR
* fix: rename folders and files
* fix: remove unused imports
* chore: run pre-commit
* chore: add task description
---------
Co-authored-by: rzanoli <zanoli@fbk.eu>
Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>
* Delete lm_eval/tasks/evalita_llm/single_prompt.zip (#2687)
* Update unitxt task.py to bring in line with recent repo changes (#2684)
* change ensure_ascii to False for JsonChatStr (#2691)
* set aggregation and higher_is_better (instead of falling back on defaults) (#2692)
* Update remaining references to assistant_prefill to gen_prefix (#2683)
* Update README.md (#2694)
* fix `construct_requests` kwargs (#2700)
* `arithmetic`: set target delimiter to empty string (#2701)
* set target delimiter to empty string
* nit
* add warning
* fix vllm (#2708)
* fix vllm
* fix data_parallel
* copy to multimodal
* add math_verify to some tasks (#2686)
* add math_verify to minerva math
* add math_verify to benchmark
* fix error
* increment version
* Logging (#2203)
* changed source of eval_logger
* allow eval_logger to be set from args
* removed verbosity arg from non-main methods
* fix logging
* pre-commit
* set verbosity in eval logger
* replace utils.eval_logger
* fix logging in main
* add logging to docs
* add logging message
* nit
* add logging to docs
* refactor setup_logging to utils
---------
Co-authored-by: Baber <baber@hey.com>
* fix missing dataset repo (#2719)
* remove unused import (#2728)
* Added IberoBench citation info (https://aclanthology.org/2025.coling-main.699/) in correpsonding READMEs (#2729)
* add o3-mini support (#2697)
* add o3-mini support
* fix linter tests
* add Basque translation of ARC and PAWS to BasqueBench (#2732)
* add Basque translation of ARC and PAWS to BasqueBench
* pre-commit
---------
Co-authored-by: Baber <baber@hey.com>
* add cocoteros_es dataset (#2721)
Co-authored-by: Robiert Sepulveda Torres <rsepulveda911112@gmail.com>
* Fix the import source for eval_logger (#2735)
* Fix the import source for eval_logger
* fix logging
---------
Co-authored-by: Baber <baber@hey.com>
* add humaneval+ and mbpp+ (#2734)
* add humaneval+ and mbpp+
* add newline at end of file
* Support SGLang as Potential Backend for Evaluation (#2703)
* initial components to support sglang
* init of class SGLangLM
* draft for generate_until of SGLang model
* mock loglikelihood
* initial loglikelihood_tokens
* todo: fix bug of sglang engine init
* implement generation tasks and test
* support output type loglikelihood and loglikelihood_rolling (#1)
* .
* loglikelihood_rolling
* /
* support dp_size>1
* typo
* add tests and clean code
* skip tests of sglang for now
* fix OOM error of sglang pytest
* finish test for sglang
* add sglang to readme
* fix OOM of tests and clean SGLang model
* update readme
* clean pyproject and add tests for evaluator
* add accuracy tests and it passed locally
* add notes for test
* Update README.md
update readme
* pre-commit
---------
Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>
* fix log condition (#2737)
* fix vllm data parallel (#2746)
* remove ray.remote resources
* remove kobtest tag (registered as group)
* [Readme change for SGLang] fix error in readme and add OOM solutions for sglang (#2738)
* initial components to support sglang
* init of class SGLangLM
* draft for generate_until of SGLang model
* mock loglikelihood
* initial loglikelihood_tokens
* todo: fix bug of sglang engine init
* implement generation tasks and test
* support output type loglikelihood and loglikelihood_rolling (#1)
* .
* loglikelihood_rolling
* /
* support dp_size>1
* typo
* add tests and clean code
* skip tests of sglang for now
* fix OOM error of sglang pytest
* finish test for sglang
* add sglang to readme
* fix OOM of tests and clean SGLang model
* update readme
* clean pyproject and add tests for evaluator
* add accuracy tests and it passed locally
* add notes for test
* Update README.md
update readme
* pre-commit
* add OOM guideline for sglang and fix readme error
* fix typo
* fix typo
* add readme
---------
Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>
* Groundcocoa (#2724)
* Fix failing tests
* Resolved merge conflicts
* pre-commit
---------
Co-authored-by: Baber <baber@hey.com>
* fix doc: generate_until only outputs the generated text! (#2755)
* Enable steering HF models (#2749)
* Enable steering HF models
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>
* increase HF download timeout
* Update readme; improve steering vector device handling
* Update latest news
* remove HF timeout increase
* fix tests
* ignore sae lens test
* fix accidental force push
---------
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>
* Add test for a simple Unitxt task (#2742)
* Add a test for a custom unitxt task
* Update task.py to bring in line with breaking change in v1.17.2
* Fix lint
* add debug log (#2757)
* increment version to 0.4.8 (#2760)
* fix: mmlu (generative) metric aggregation (#2761)
* Bugfix (#2762)
* bug fix
* add warning for instruct models
* nit
* fix verbosity typo (#2765)
* docs: Fix typos in README.md (#2778)
* initialize tokenizer with bos_token (#2781)
* Use yaml.CLoader to load yaml files when available. (#2777)
* Consistency Fix: Filter new leaderboard_math_hard dataset to "Level 5" only (#2773)
* Filter new leaderboard_math_hard dataset to "Level 5" only
* align to linters
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
---------
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
* Fix for mc2 calculation (#2768)
* fix for mc2 calculation
* increment versions and changelog
---------
Co-authored-by: Baber <baber@hey.com>
* New healthcare benchmark: careqa (#2714)
* New healthcare benchmark: careqa
* LAUNCH_MN5_ACC <python main.py --config config/mn5.yml --models Llama-3.2-1B-Instruct --tasks careqa_open --num_fewshot 0>
* Add fixes, READMES, and remove task_list.txt
* pre-commit passed, add formatting updates; add nanmean agg_metric
* Fix import error.
* Wrapped imports in try excepts
* Wrapped imports in try excepts; also metrics to catch bert_score import error
* Try except to catch ImportErrors as well
* use np.nan
* pre-commit
---------
Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Baber <baber@hey.com>
* Capture gen_kwargs from CLI in squad_completion (#2727)
* Capture gen_kwargs from CLI in squad_completion
* Update lm_eval/tasks/squad_completion/task.py
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
* Update lm_eval/tasks/squad_completion/task.py
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
* pre-commit
---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Baber <baber@hey.com>
* humaneval instruct (#2650)
* add instruct humaneval
* nit
* add to readme
* nit
* Update evaluator.py (#2786)
minor bug fix, lm_eval.setup_logging -> setup_logging
* change piqa dataset path (uses parquet rather than dataset script) (#2790)
* use verify_certificate flag in batch requests (#2785)
* add audio modality (qwen2 audio only) (#2689)
* Added audio-modality pipeline for qwen2-audio model
* Beauty imports
* fix apply_chat_template args
* update default audio placeholders list
* add demo task - common_voice subset
* add audiolm_qwen libs to pyproject.toml
* pre-commit beautify
---------
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>
* Add various social bias tasks (#1185)
* Implementation of Winogender
* Minor fixes README.md
* Add winogender
* Clean winogender utils.py
* Change dataset to one containing All subsets
* Flesh out README for BBQ task
* Add missing tasks for BBQ
* Add simple cooccurrence bias task
* Fix wrong mask for ambiguated context+rename metrics
* Made generate_until evaluation (following PALM paper) default
Also moved separate config files per category to separate metrics using custom function.
Created config file for multiple_choice way of evaluating BBQ.
* Add missing version metadata
* Add missing versionmetadata for bbq multiple choice
* Fix metrics and address edge cases
* Made BBQ multiple choice the default version
* Added settings following winogrande
* Add num_fewshot to simple_cooccurrence_bias
* Fixes for bbq (multiple choice)
* Fix wrong dataset
* CrowS-Pairs: make it easier to use another dataset by removing dataset_name from the subsets.
* Use simplest prompt possible without description
* Merge
* BBQ: Fix np.NaN related bug
* BBQ: Fix wrong aggregation method for disamb accuracy
* BBQ: Make it possible to only evaluate on (dis)ambiguous subset (needed for few shot eval)
* BBQ: fix showing one target in case of few-shot evals
* BBQ: Fix few-shot example for bbq_generate
* BBQ: simplify subtasks
* BBQ: Minimize number of UNK variations to reduce inference time
* BBQ: Add extra UNK keywords for the generate task
* Add a generate_until version of simple_cooccurrence_bias
* Change system/description prompt to include few-shot examples
* Group agg rework
* Run pre-commit
* add tasks to readme table
* remove trailing space from simple_cooccurrence_bias_gen.yaml `doc_to_text`
* fix
* fix
* fix version
---------
Co-authored-by: Baber <baber@hey.com>
* update pre-commit (#2799)
* Update Legacy OpenLLM leaderboard to use "train" split for ARC fewshot (#2802)
* Update openllm.yaml to use train fewshot split for arc
* Add INCLUDE tasks (#2769)
* Add INCLUDE tasks
* pacify pre-commit
---------
Co-authored-by: Baber <baber@hey.com>
* Add support for token-based auth for watsonx models (#2796)
* Add support for token-based auth for watsonx models
* Fix lint
* Move dotenv import to inner scope
* Improve readability of _verify_credentials
* add __version__ (#2808)
* add __version__
* add version consistency check to publish action
* Add cocoteros_va dataset (#2787)
* Add cocoteros_va dataset
* Fix format in cocoteros_va.yml
* Undo newline added
* Execute pre-commit to fix format errors
* Update catalan_bench.yaml version and add Changelog section into Readme.md
* Add MastermindEval (#2788)
* add MastermindEval benchmark
* fill out checklist
* Add loncxt tasks (#2629)
suport for longcontext (and other synthetic tasks)
* add ruler
* add longbench
* pass `metadata` to TaskConfig
* [hf-multimodal] pass kwargs to self.processor (#2667)
* add min_pixels, max_pixels
* fix
* [MM] Chartqa (#2544)
* add changelog to readme template
* add readme
* add to task list
* Allow writing config to wandb (#2736)
* Allow writing confing to wandb
* set defaults
* Update help
* Update help
* [change] group -> tag (#2813)
* Clean up README and pyproject.toml (#2814)
* Update CODEOWNERS
* Llama3 mmlu correction (#2797)
* Update continuation template YAML for MMLU task with new generation and filtering options
* Refactor filter_list structure in continuation template YAML for improved readability
* Add 'take_first' function to filter_list in continuation template YAML
* Update filter_list in continuation template YAML to use 'strict_match' and modify filtering functions
* Add 'do_sample' option to generation_kwargs in MMLU template YAML
* Add Markdown linter (#2818)
* Add markdown linter to pre-commit hooks
* Reformat existing markdown (excluding lm_eval/tasks/*.md)
* Configure the pad tokens for Qwen when using vLLM (#2810)
* fix typo (#2820)
* [VLLM, SLANG] default temp=0.0 (#2819)
* Fixes to mmlu_pro_llama (#2816)
* Update generation_kwargs in default template to include additional end tokens
* Update filter_list in MMLU Pro configuration to use strict_match
* Update _default_template_yaml
* Add MMLU-ProX task (#2811)
* update mmlu_prox configs
* update tasks/README
* correct hyphon to underline in task/README
* update pre-commit codes
* Remove unnecessary nested list in MMLU-Pro default template YAML (#2827)
* feat: replace library (#2828)
I haven't had time to review the library that's replacing tj-actions or whether this change breaks anything, but the vulnerability is quite severe and I would rather the functionality be broken than risk compromise.
**to do:** review this later
* Multilingual MMLU for Llama instruct models (#2826)
* Multilingual MMLU
* Refactor process_docs function calls for clarity and consistency
* changed dataset to parquet version (#2845)
* Fix typo in longbench metrics (#2854)
* Add kmmlu multiple-choice(accuracy) task (#2849)
* Adding ACPBench task (#2807)
* Adding acpbench task
* adding ACPBench in Tasks readme.
* running precommit
* add Darija (Moroccan dialects) tasks including darijammlu. darijahellaswag and darija_bench (#2521)
* add Darija tasks
* fix multiple groups issue in darijammlu
* add MT to the description of the Darija tasks
* Update README.md
nit
* fix the recursion error caused by the darija_summarization task
* use a custom filter instead of the decorator for the strip function
---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
* Increase default max_gen_toks to 2048 and max_length to 8192 for MMLU Pro tests (#2824)
* Changed default max_length from 2048 to 8192 and max_gen_toks from 256 to 2048 fro MMLU Pro tasks.
* Update lm_eval/tasks/mmlu_pro/_default_template_yaml
* pre-commit
* nit
---------
* move warning (#2857)
* Fix: ACPBench Link (#2860)
* Adds MMLU CoT, gsm8k and arc_challenge for llama instruct (#2829)
* llama-style MMLU CoT
* Refactor MMLU CoT template YAML to simplify 'until' structure
* Add GSM8K task configuration for LLaMA3 with few-shot examples
* Fix missing newline at end of MMLU CoT YAML file
* Add ARC-Challenge task configuration and processing utility
* Add additional MMLU and ARC-Challenge task variants to README
* Update README with notes on arc_challenge_llama dataset preprocessing
* [leaderboard] math - sync with repo (#2817)
* sync with leaderboard
* also output old metric
* wrap old extraction in try except
* better log
* Update supported models (#2866)
* Add JSONSchemaBench: A Benchmark for Evaluating Structured Output from LLMs (#2865)
* Add JSON schema benchmark
* Update lm_eval/tasks/jsonschema_bench/metrics.py
Thanks for catching this
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
* run pre-commit
* add description to task catalogue readme
---------
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
* leaderboard - add subtask scores (#2867)
* add subtask scores
* pacify pre-commit
* Fix the deps of longbench from jeiba to jieba (#2873)
Signed-off-by: Lu Fang <lufang@fb.com>
* Optimization for evalita-llm rouge computation (#2878)
* feat: initial commit with templates for evalita evaluation
* fix: change rule for generate_until
* feat: modified yaml to use reduced version of NER test datasets
* feat: added templates to use reduced dataset for summarization (fanpage and ilpost)
* Add Six Prompts for Each Multiple-Choice Task
* fix: fastest eval for summarization
* chore: linted with ruff
* chore: linted with ruff
---------
Co-authored-by: rzanoli <zanoli@fbk.eu>
* Update authentications methods, add support for deployment_id for IBM watsonx_ai (#2877)
* update authnentications methods, add support for deployment_id
* run pre-commit on changed file
* Add GSM8K Platinum (#2771)
* add gsm8k platinum
* only test splits
* wrong dataset
* link to blog
* format
* Add `--samples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] (#2520)
* added option --examples
* specifying examples in dictionary
* run pre-commit - fix arg type
Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
* fixing bug when examples==None
* fixing bug when examples==None
* limit or examples must be None in simple_evaluate.py and in evaluator.py
* run pre-commit (fix formatting)
Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
* merge main and run pre-commit (fix formatting)
Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
* Update __main__.py
undefined "limit" and "examples"
* update branch, fix conflicts, run pre-commit
* nits
* nits
* change 'examples' to 'samples'
---------
Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Baber <baber@hey.com>
* Extend support for chat template in vLLM (#2902)
* Add support for chat templates defined outside of tokenizer_config.json, as supported by vLLM
* Update template name to avoid conflict with other variable
* tasks README: fix dead link (#2899)
* Add support for quantization_config (#2842)
* Add support for quantization_config
Summary:
Previously quantization_config is ignored, so torchao quantized models are not supported,
this PR adds that.
Test Plan:
lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8
Reviewers:
Subscribers:
Tasks:
Tags:
* quantization_config is optional
* Fix a typo in README for tasks (#2910)
* fix resolve_hf_chat_template version (#2917)
* fix resolve_hf_chat_template version
* pre-commit
* mmlu - switch dataset to cais/mmlu; fix tests (#2918)
* switch MMLU to cais/mmlu
* switch back to tj-actions/changed-files
* cache HF folder
* init pixels before tokenizer creation (#2911)
* Longbench bugfix (#2895)
* add warning in for default until
* fix stop tokens; add vcsum
* bugfix:fix doc_to_target to string
* fix lsht, trec
* add task to readme
* add debugging logs for multiple input/output
* Added softmax_dtype argument to HFLM to coerce log_softmax computations (#2921)
* Added softmax_dtype argument to coerce log_softmax computations
* move softmax_dtype
---------
Co-authored-by: Baber <baber@hey.com>
* use np.NaN (#2937)
* Add support for enable_thinking argument in vllm model, set default to False (#2947)
* Added NorEval, a novel Norwegian benchmark (#2919)
* added noreval
* added a checklist for noreval
* run pre-commit
* changed imports and added short noreval description
* fixed norsumm path
* refactored multi-folder tasks
* refactored multi-folder tasks
* Fix import error for eval_logger in score utils (#2940)
* Fix import error for eval_logger in score utils
* pacify pre-commit
---------
Co-authored-by: Baber <baber@hey.com>
* Include all test files in sdist (#2634)
This is useful to run unit tests during distro builds.
* Change citation name (#2956)
This hasn't been a library for few shot language model evaluation in quite a while. Let's update the citation to use "the Language Model Evaluation Harness" as the title.
* add warning on truncation (#2962)
* fix: type error while checking context length (#2972)
* Fix import error for deepcopy (#2969)
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
* Pin unitxt to most recent major version to avoid test failures (#2970)
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
* mmlu pro generation_kwargs until Q: -> Question: (#2945)
* mmlu pro generation_kwargs until Q: -> Question:
* pacify pre-commit
* change stop token
---------
Co-authored-by: Baber <baber@hey.com>
* AfroBench: How Good are Large Language Models on African Languages? (#2825)
* add afrixnli to task
* add chat completion
* remove chat completion -untested
* afrimmlu added
* afrimmlu folder update
* afrimmlu folder update
* updated prompt
* remove print
* add afrimgsm -direct
* add squad metric
* fix bash script
* remove direct util, update common yaml
* remove print
* add few show. metric fixes
* fix direct path, add bash script for gpt models
* added transate test
* update afrixnli tasks
* update afrixnli tasks
* update metrics for afrixnli
* prompt translations fix
* prompt translations fix
* filter and metric fix -mgsm
* remove squad metric
* remove squad metric
* add f1 score to mgsm
* add f1 score to mgsm
* update native-direct with lin
* change f1 function
* add lin to utils
* add utils
* remove test limit
* remove test configs
* add swahili to mmlu
* change eng to ewe in ewe yaml mmlu
* add squad metric to mgsm, remove whitespace filter
* added translate test
* added afrixnli_translate
* fix exact match valueError
* fix exact match valueError
* restructure mmlu folder
* spacing
* remove afrimmlu_translate folder
* add utility
* format task name, clean ups
* modefied mgsm
* update on afrimgsm
* update on afrimgsm
* removed utils
* other mgsm varieties
* other mgsm varieties
* adding trasnslate direct
* Update translate_direct_yaml
* add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model
* edit for open models
* Update translate_direct_yaml
* add verbalizer for xnli
* change xnli from multiple choice to generate
* add manual accuracy scores
* revert xnli to multiple choice
* change afrimgsm utils
* revert xnli to multiple_choice
* cleanups and readmes
* remove openai fixes and unused regex
* pr review changes
* revert metrics.py, task.py and extraction.py to main version
* add afrisenti
* utilities
* pulled from main
* add afrixnli
* add afrimmlu
* update afrixnli prompts
* mising senti language
* fix afrisenti prompt 2
* fix afrisenti prompts
* fix afrisenti prompts
* configure task grouping
* add multiple prompts to afrixnli for irokobench
* add multiple prompts to afrimmlu for irokobench
* Update afrixnli_yaml
* fixes and moves
* fixes and moves
* afrimmlu multiple prompts configs
* remove validation set from afrimmlu
* remove eng from afrimmlu translate test
* correct dataset path
* multiple prompts for mgsm
* file restructure
* afribench grouping
* repo restructuring
* repo restructuring
* update exact match to hugging face exact match and add new mgsm language
* remove decontamination
* update generation kwargs
* update generation kwargs for all mgsm prompts
* remove lang
* update generation kwargs for afrimgsm translatetest
* add afrimgsm cot for direct and translate
* remove eng from translate-cot
* add masakhaPOS tasks
* remove changes from task script
* add masakhanews tasks
* add uhura arc easy
* add afriqa and belebele files
* add tags for easier run. add naija rc
* add new metrics and transformation scripts
* fix afriqa swa fewshot split
* add naijarc
* add afrobench lite tasks
* update afrobench
* update afrobench
* remove unverified files to avoid bugs
* remove files not needed
* add afrobench tasks
* add afrobench tasks
* change to version 1
* change to version 1
* update afrobench
* update afrobench
* restore metric to original script
* update readme instructions
* add individual dataset readmes
* add link to collections
* correct run script
* align with main
* align with main
* align with main
* align with main
* align with main
* align with main
* align with main
* align with main
* failed run fixes
* failed run fixes
* add afrimgsm cot
* Apply precommit fixes
* update mafand dataset name
* pull request fixes
* remove afrihate due to availability
---------
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>
Co-authored-by: David Adelani <davlanade@gmail.com>
Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>
* Added C4 Support (#2889)
* added c4 dataset (working)
* fixed bugs in c4
* fixed loading bugs in c4 dataset; using partial loading
* cleaned the code
* added version number for c4
* removed irrelevant files
* Update utils.py (#2870)
* feat: add question suffix (#2876)
* Add device arg to model_args passed to LLM object in VLLM model class (#2879)
* fix: pass device arg in model_ar in vllm_causallms
* casting device arg to str in vLLM model args
* fix formatting (#2759)
* Delete scripts/cost_estimate.py (#2985)
This function was written years ago when the cost of running an OpenAI model was easy to compute. It is no longer viable to support this.
* Adding ACPBench Hard tasks (#2980)
* adding ACPBench_hard
* adding Clingo
* changing tarski to tarski[clingo]
* denoting the main variants in each paper
* [SGLANG] Add the SGLANG generate API (#2997)
* add `sglang-generate`
* nit
* nit
* nit
* pacify pre-commit
* fix github parse error (#2998)
* Log tokenized request warning only once (#3002)
* Log tokenized request warning only once
* Fix logging for concurrent usecase as well
* add kbl 2025 (#3000)
* Output path fix (#2993)
* fix(output_path): support direct JSON file paths
* fix linting
* turn off external Lm tests for now
* Update help text for `output_path`
---------
Co-authored-by: Baber <baber@hey.com>
* use images with api models (#2981)
* use images with apis
* pacify pre-commit
* Adding resize images support (#2958)
* first version of image resizing
* fixed bug
* clean up `resize_image`
---------
Co-authored-by: Artem Safin <artemsafin67@gmail.com>
Co-authored-by: Baber <baber@hey.com>
* Revert "feat: add question suffix (#2876)" (#3007)
This reverts commit 4dbd5ec9
* change multimodal check in evaluate (#3013)
changed multimodal check from strict equality
* [Fix] Update `resolve_hf_chat_template` arguments (#2992)
* fix arguments
* pacify pre-commit
---------
Co-authored-by: Baber <baber@hey.com>
* Fix error due in Collating queries with different continuation lengths (fixes #2984) (#2987)
* FIX error due to grouping queries with different continuation length
Make Collator choose query with the longest continuation as the
candidate for generation
* use max for key selection
* added comments explaining variable cont length (identical ctx+cont[:-1])
---------
Co-authored-by: Baber <baber@hey.com>
* [vllm] data parallel for V1 (#3011)
* add data_parallel for V1
* use Process instead of Queue
* ray used if V0 DP
* better error handling
* fix truncation warning comparison
* add arab_culture task (#3006)
* add arab_culture tasks
* add target_delimeter and remove debugging code
* chore: clean up and extend .gitignore rules (#3030)
* chore: clean up and extend .gitignore rules
* pacify pre-commit
---------
Co-authored-by: Baber <baber@hey.com>
* Enable text-only evals for VLM models (#2999)
* [Fix] acc_mutual_info metric calculation bug (#3035)
* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices
* add tests
* fix: fix vllm issue with DP>1 (#3025)
* add Mbpp instruct (#2995)
* feat: add mbpp_instruct
* fix: update generation_kwargs to use an empty until list
* fix: correct predictions formatting in pass_at_1 function
* fix: improve code block extraction by checking first without opening backticks
* fix mbpp `pass_at_1`
* remove prints (#3041)
* [longbench] fix metric calculation (#2983)
* use all answers
* use middle truncation
* maybe fix classification score
* strip classification preds
* [vllm] remove stop tokens post-hoc
* strip all preds
* pacify pre-commit
* start on truncation utility
* add to readme
* add a footgun doc
* fix newline in yaml templates
* do not strip code_sim preds!
* fix pre-commit config
* fix instruction warning
* add not to longbench readme
* Fallback to super impl in fewshot_context for Unitxt tasks (#3023)
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
* Fix Typo in README and Comment in utils_mcq.py (#3057)
* Update README.md
* Update utils_mcq.py
* fix longbech citation (#3061)
* fix longbech citation
* Update README.md (#3070)
Wrong task name: mmlu_generation doesn't non exist -> mmlu_generative is the correct one
* Update instructions.py (#3060)
* bump version to `0.4.9` (#3073)
* llama3 task: update README.md (#3074)
"arc_chalenge_chat" doesn't exist: I think it should be "arc_challenge_chat", but this task is not implemented here (see arc task folder).
* Fix Anthropic API compatibility issues in chat completions (#3054)
* Fix Anthropic API compatibility issues in chat completions
solves two important compatibility issues between the LM Eval Harness and Anthropic's API:
1) The type field issue - Anthropic's Messages API doesn't accept the type field that other APIs might expect, that was previously included
2) The stop sequences issue - Anthropic requires stop sequences to contain non-whitespace characters
tested with most recent models from anthopic; claude-sonnet-4-0, claude-opus-4-0, resolved my local api errors
* pacufy pre-commit
* add type
---------
Co-authored-by: Baber <baber@hey.com>
* Ensure backwards compatibility in fewshot_context by using kwargs (#3079)
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
* remove system message if `TemplateError` (#3076)
* feat / fix: Properly make use of `subfolder` from HF models (#3072)
* add subfolder
* lint
* change it to empty string
* fix typehints
---------
Co-authored-by: Baber <baber@hey.com>
* [HF] fix quantization config (#3039)
* Try fixing issue 3026 which is caused by the quantization_config argument introduced in Commit 758c5ed.
The argument is in Dict type, but for a GPTQ quantized model, it has a conflict with the huggingface interface which expects QuantizationConfigMixin type.
Current solution is removing quantization_config argument in HFLM._create_model() of lm_eval/models/huggingface.py.
Require further modification to restore the functionality provided by the previous commit.
* wrap quantization_config in AutoQuantizationConfig
* handle quantization config not dict
* wrap quantization_config in AutoQuantizationConfig if dict
---------
Co-authored-by: shanhx2000 <hs359@duke.edu>
* FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092)
* Fix: Align the Humaneval dataset with official results
Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals".
(2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one.
Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5).
Ref: PR#2650
* add changelog and version
* add changelog
* Truthfulqa multi harness (#3062)
* truthfulqa-multi task
* truthfulqa-multi with chat few-shot
* few shot chat implementation
* changed until so it outputs lists
* changed dataset location
* added MT task
* Create README.md
* do not include MT
* changes for PR
* tag change
* removed yaml extension
* adding task to the table
* fix task configs
* add import exception
---------
Co-authored-by: Baber <baber@hey.com>
* Fix: Reduce CLI loading time from 2.2s to 0.05s (#3099)
* Lazy-load submodules to reduce import time
* pacify pre-commit
---------
Co-authored-by: Baber <baber@hey.com>
* Humaneval - fix regression (#3102)
* use double quotes
* Bugfix/hf tokenizer gguf override (#3098)
* fix(hf-gguf): skip gguf_file if external tokenizer is provided
* docs(readme): add instructions for evaluating GGUF models with Hugging Face backend
* [FIX] Initial code to disable multi-proc for stderr (#3106)
* [FIX] Initial code to disable multi-proc for stderr
* add docs; align no-mp bootstrap with mp
---------
Co-authored-by: Baber <baber@hey.com>
* remove all; reformat table (#3107)
* delete unneeded files (#3108)
* delete unneeded files
* Fixed #3005: Processes both formats of model_args: string and dictionay (#3097)
* git push --force
correctly processes both formats of model_args: string and dictionary both
* exctract to function for better test
* nit
---------
Co-authored-by: Baber <baber@hey.com>
* add image hashing and `LMEVAL_HASHMM` envar (#2973)
* add image hashing
* remove unused params decription
* use `LMEVAL_HASHMM` (defualt '1') to save raw images
---------
Co-authored-by: Baber <baber@hey.com>
* delete neuralmagic models (#3112)
* Neuralmagic (#3113)
* remove sparse-ml
* check pil dep (#3114)
* warning for "chat" pretrained; disable buggy evalita configs (#3127)
* check for chat for warning
* add test
* remove yaml extension from some evalita configs
* move unitxt to own test script
* fix CI test
* fix: remove warning (#3128)
* Adding EgyMMLU and EgyHellaSwag (#3063)
* add egy mmlu hellaswag
* add egymmlu egyhellaswag to tasks readme
* fix egymmlu config generation
* fix _generate_configs formating
* Added mixed_precision_dtype arg (#3138)
* Fix for hang due to mp.Pool in bootstrap_stderr (#3135)
* make pytorch an optional dependency
* remove FakeLM
* missed a torch
* print statements for the recursion
* sigh
* changes
* change task configs
* update mmlu config
* include soft metrics for MMLU
* logging metrics for mathqa
* lint
* Discrim Eval From Upstream
* Add Forced Versions of this Metric
* Scaling Law Metrics
* Torches
* Fix Pull Main, Add Metrics for Bias Tasks
* Lint
* Unused
* CohereForAI -> CohereLabs
---------
Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Maanu Grover <maanug@nvidia.com>
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Signed-off-by: Yotam Perlitz <y.perlitz@ibm.com>
Signed-off-by: Lu Fang <lufang@fb.com>
Signed-off-by: Mírian Silva <mirianfrsilva@ibm.com
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Naiara Perez <naiara.pme@gmail.com>
Co-authored-by: Trawinski, Dariusz <dariusz.trawinski@intel.com>
Co-authored-by: Baber <baber@hey.com>
Co-authored-by: Slawomir Strehlke <slawomir.strehlke@intel.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Maanu Grover <109391026+maanug-nv@users.noreply.github.com>
Co-authored-by: Yao Matrix <yaoweifeng0301@126.com>
Co-authored-by: Rima Shahbazyan <74137119+rimashahbazyan@users.noreply.github.com>
Co-authored-by: shivalika-singh <shivalikasingh@cohere.com>
Co-authored-by: shivi <shivalikasingh95@gmail.com>
Co-authored-by: Sabrina J. Mielke <sjm@sjmielke.com>
Co-authored-by: Firoj Alam, Scientist, QCRI <firojalam@users.noreply.github.com>
Co-authored-by: Basel Mousi <bmousi@hbku.edu.qa>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: CL-ModelCloud <cl@modelcloud.ai>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
Co-authored-by: Petr Baudis <pasky@ucw.cz>
Co-authored-by: Wenyang LUO <86722018+timturing@users.noreply.github.com>
Co-authored-by: Hojin Lee <nyx1371@snu.ac.kr>
Co-authored-by: Hojin Lee <19949034+hjlee1371@users.noreply.github.com>
Co-authored-by: Shivansh Pachnanda <114482037+KahnSvaer@users.noreply.github.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Boda Sadallah <abdelrahman.sadallah@mbzuai.ac.ae>
Co-authored-by: Gyouk Chu <94156717+GyoukChu@users.noreply.github.com>
Co-authored-by: nike00811 <nike00811@gmail.com>
Co-authored-by: Ramiro R. C. <rawthil@gmail.com>
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai>
Co-authored-by: Irina Proskurina <72871167+upunaprosk@users.noreply.github.com>
Co-authored-by: Nicky Pochinkov <52249105+nickypro@users.noreply.github.com>
Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com>
Co-authored-by: asgsaeid <43481290+asgsaeid@users.noreply.github.com>
Co-authored-by: asgsaeid <asgaris@Saeids-MacBook-Pro.local>
Co-authored-by: Arda <ge32max@mytum.de>
Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>
Co-authored-by: omahs <73983677+omahs@users.noreply.github.com>
Co-authored-by: Michele Resta <79645321+m-resta@users.noreply.github.com>
Co-authored-by: rzanoli <zanoli@fbk.eu>
Co-authored-by: Marco Madeddu <marco.madeddu.bra@gmail.com>
Co-authored-by: Kiersten Stokes <kierstenstokes@gmail.com>
Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com>
Co-authored-by: James A. Michaelov <32554945+jmichaelov@users.noreply.github.com>
Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>
Co-authored-by: Farhan Ahmed <Farhan.Ahmed@ibm.com>
Co-authored-by: Naiara Perez <naiara.perez@ehu.eus>
Co-authored-by: Jocelyn <34988596+HelloJocelynLu@users.noreply.github.com>
Co-authored-by: Santiago Galiano Segura <71637365+sgs97ua@users.noreply.github.com>
Co-authored-by: Robiert Sepulveda Torres <rsepulveda911112@gmail.com>
Co-authored-by: Kailashbuki <111277+kailashbuki@users.noreply.github.com>
Co-authored-by: Jinwei <55192557+Monstertail@users.noreply.github.com>
Co-authored-by: Xiaotong Jiang <xiaotong.jiang@databricks.com>
Co-authored-by: Harsh Kohli <harsh14791@gmail.com>
Co-authored-by: Lucia Quirke <luciarosequirke@gmail.com>
Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>
Co-authored-by: Yongkeun Hwang <ykstyle@ykstyle.info>
Co-authored-by: Rui Vieira <rcardoso@redhat.com>
Co-authored-by: Giulio Lovisotto <giuliolovisotto@gmail.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Co-authored-by: Kajetan Dymkiewicz <kajetan.dymkiewicz@gmail.com>
Co-authored-by: PabloAgustin <pabloagustinquemas@gmail.com>
Co-authored-by: PabloAgustin <pablo.martin@bsc.es>
Co-authored-by: Surya Kasturi <kasturisurya@gmail.com>
Co-authored-by: Zeyuan Allen-Zhu <zhuzeyuan@hotmail.com>
Co-authored-by: daniel-salib <danielsalib@meta.com>
Co-authored-by: Alexandra Rak <rakalexandra@mail.ru>
Co-authored-by: Oskar van der Wal <56364990+oskarvanderwal@users.noreply.github.com>
Co-authored-by: Avelina9X <37878580+Avelina9X@users.noreply.github.com>
Co-authored-by: Angelika Romanou <angelika.romanou@epfl.ch>
Co-authored-by: Jonas Golde <jonas.golde@gmail.com>
Co-authored-by: Jaedong Hwang <jdhwang730@gmail.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Alexandre Marques <almarque@redhat.com>
Co-authored-by: Yifei Zhang <yifei.zhang1992@outlook.com>
Co-authored-by: heli-qi <93250319+heli-qi@users.noreply.github.com>
Co-authored-by: Alexandre Marques <alexandre@neuralmagic.com>
Co-authored-by: Bruno Carneiro <brunocarneirofs@gmail.com>
Co-authored-by: wackey <386622495@qq.com>
Co-authored-by: Jinho Heo <70141850+Aprilistic@users.noreply.github.com>
Co-authored-by: Harsha <858059+harshakokel@users.noreply.github.com>
Co-authored-by: Hadi Abdine <59775564+hadi-abdine@users.noreply.github.com>
Co-authored-by: dazipe <126095259+dazipe@users.noreply.github.com>
Co-authored-by: Daniel Holanda <holand.daniel@gmail.com>
Co-authored-by: Saibo-creator <53392976+Saibo-creator@users.noreply.github.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
Co-authored-by: Nikodem Szwast <97400923+Medokins@users.noreply.github.com>
Co-authored-by: Felipe Maia Polo <felipemaiapolo@gmail.com>
Co-authored-by: mirianfrsilva <mirianfrsilva@ibm.com>
Co-authored-by: Daniele <36171005+dtrifiro@users.noreply.github.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com>
Co-authored-by: Vladislav Mikhailov <43072268+vmkhlv@users.noreply.github.com>
Co-authored-by: Anna Fontana <101867173+annafontanaa@users.noreply.github.com>
Co-authored-by: Ihar Hrachyshka <ihrachys@redhat.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: Yoonsoo Kim <34365327+yoonniverse@users.noreply.github.com>
Co-authored-by: Jess <jessicaojo19@gmail.com>
Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de>
Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com>
Co-authored-by: David Adelani <davlanade@gmail.com>
Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>
Co-authored-by: Yufeng Xu <yx3038@nyu.edu>
Co-authored-by: tawsif <sleeping4cat@outlook.com>
Co-authored-by: Tingchen Fu <48080217+TingchenFu@users.noreply.github.com>
Co-authored-by: Filippo Momentè <68816087+momentino@users.noreply.github.com>
Co-authored-by: Rob Geada <rob@geada.net>
Co-authored-by: Hongseok Oh <97136787+abzb1@users.noreply.github.com>
Co-authored-by: Niccolò Ajroldi <61059403+Niccolo-Ajroldi@users.noreply.github.com>
Co-authored-by: Artem Safin <artemsafin67@gmail.com>
Co-authored-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Ameya Godbole <ameyag416@gmail.com>
Co-authored-by: Boda Sadallah <bodasadallah@gmail.com>
Co-authored-by: Ivan Stankevich <105574942+e1washere@users.noreply.github.com>
Co-authored-by: Yury Sulsky <yury.sulsky@gmail.com>
Co-authored-by: Younes B <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: fuder.eth <139509124+vtjl10@users.noreply.github.com>
Co-authored-by: Maxim Evtush <154841002+maximevtush@users.noreply.github.com>
Co-authored-by: NourFahmy <35409519+NourFahmy@users.noreply.github.com>
Co-authored-by: shanhx2000 <hs359@duke.edu>
Co-authored-by: jinze <46251666+userljz@users.noreply.github.com>
Co-authored-by: Blanca Calvo <33485967+BlancaCalvo@users.noreply.github.com>
Co-authored-by: Alex Stachowiak <alexander@computer.org>
Co-authored-by: Ankush <51945739+ankush13r@users.noreply.github.com>
Co-authored-by: Neel Gupta <neelgupta04@outlook.com>
Co-authored-by: Debjyoti Ray <33850567+DebjyotiRay@users.noreply.github.com>
Co-authored-by: Atou Houdaifa <atou.hdf@gmail.com>
Co-authored-by: Ankit Gola <ankitgola005@gmail.com>
Co-authored-by: David Hall <dlwh@stanford.edu>
Co-authored-by: Nikil Ravi <nravi@stanford.edu>
Co-authored-by: Suhas Kotha <38450656+kothasuhas@users.noreply.github.com>
Summary:
Previously quantization_config is ignored, so torchao quantized models are not supported, this PR adds that.
Test Plan:
lm_eval --model hf --model_args pretrained=jerryzh168/gemma3-int4wo --tasks hellaswag --device cuda:0 --batch_size 8
Reviewers:
Subscribers:
Tasks:
Tags: