Add the BlueBench benchmark #2369

shachardon · 2024-10-01T13:18:28Z

BlueBench is an open-source benchmark developed by domain experts to represent required needs of Enterprise users.

It is constructed using state-of-the-art benchmarking methodologies to ensure validity, robustness, and efficiency by utilizing unitxt's abilities for dynamic and flexible text processing.

As a dynamic and evolving benchmark, BlueBench currently encompasses diverse domains such as legal, finance, customer support, and news. It also evaluates a range of capabilities, including RAG, pro-social behavior, summarization, and chatbot performance, with additional tasks and domains to be integrated over time.

CLAassistant · 2024-10-01T13:18:36Z

All committers have signed the CLA.

* Update README.md I encounter some Git buffer size limits when trying to download all commits history of the repository, such as: ```error: RPC failed; curl 18 transfer closed with outstanding read data remaining error: 5815 bytes of body are still expected fetch-pack: unexpected disconnect while reading sideband packet fatal: early EOF``` therefore the installation is faster and there are not errors when I download only the last version of the repository * Fix linting issue

* feat(neuron): align with latest optimum-neuron * feat(neuron): support pre-exported neuron models * fix(neuron): correctly use max_length * fix(neuron): adapt loglikelihood The evaluation of log likelihood was not working for neuron models using continuous batching, such as all cached neuron LLama models. * refactor(neuron): remove dead code

* Treat python tasks same as yaml tasks. * Add tests. * Re-add fixture decorators. * Fix typing specification error for Python 3.9.

* change glianorex to test set * nit * fix test; doc_to_target can be str for multiple_choice * nit

…erAI#2334) * add newlines to task descriptions; increment versions * fix task tests (with groups) * Apply suggestions from code review --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

* Added TurkishMMLU to LM Evaluation Harness * Fixed COT name * Fixed COT name * Updated Readme * Fixed Test issues * Completed Scan for changed tasks * Updated Readme * Update README.md * fixup task naming casing + ensure yaml template stubs aren't registered --------- Co-authored-by: Hailey Schoelkopf <[email protected]> Co-authored-by: haileyschoelkopf <[email protected]>

* better error message; fix greedy matching * Update lm_eval/models/openai_completions.py Co-authored-by: Hailey Schoelkopf <[email protected]> * Update lm_eval/models/openai_completions.py Co-authored-by: Hailey Schoelkopf <[email protected]> * pre-commit --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

* fix some bugs of mmlu * Fix end of file newline issue --------- Co-authored-by: eyuansu62 <[email protected]>

* Add portuguese_bench * Add flores_pt group * Update _flores_common_yaml * Run linters and update flores and readme

shachardon · 2024-10-07T09:43:56Z

Hi @haileyschoelkopf @baberabb @lintangsutawika , can you take a look please? thanks!

yoavkatz · 2024-10-14T12:46:17Z

lm_eval/tasks/README.md

@@ -126,3 +126,4 @@
 | [xnli_eu](xnli_eu/README.md) | Cross-lingual Natural Language Inference tasks in Basque. | Basque |
 | [xstorycloze](xstorycloze/README.md) | Cross-lingual narrative understanding tasks to predict story endings in multiple languages. | Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese |
 | [xwinograd](xwinograd/README.md) | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. | English, French, Japanese, Portuguese, Russian, Chinese |
+| [portuguese_bench](portuguese_bench/README.md) | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese |


Should this be in alphabetical order?
Should "bluebench" be added to the table?

Signed-off-by: Yotam-Perlitz <[email protected]>

shachardon requested review from haileyschoelkopf, lintangsutawika and baberabb as code owners October 1, 2024 13:18

shachardon and others added 26 commits October 1, 2024 09:42

add bluebench files

f5282e6

tmp - run with local unitxt installation

8537fe5

repr bug (EleutherAI#2315)

390ef6f

add version metadata to configs

c06d2b2

update template list names

66c058b

change naming convention to lowercase

6bf4f3b

use recipes instead of cards + add descriptions to the README

8aec6e1

add benchmarks results

7844627

fix model names

a5a70ef

update readme

98066cb

transpose results table

9bbe5ef

another change to the results table

33f8a6e

Add unitxt to the README

6523683

fill cehcklist

e28919c

Update README.md

9d8b795

finalize checklist

d52e5dc

fix typo

980ec2d

scenarios -> groups

c2ca175

fix type

76f2c68

fix typos

0385c13

clarify checklist

ffcd078

change cards links to recipes links

3cfacd9

Fixed dummy model (EleutherAI#2339)

46dc531

add a note for missing dependencies (EleutherAI#2336)

4b8e739

baberabb and others added 14 commits October 1, 2024 09:47

load metric with evaluate (EleutherAI#2351)

4227dbc

fix writeout script (EleutherAI#2350)

f1a9ac8

Treat tags in python tasks the same as yaml tasks (EleutherAI#2288)

b4eb477

* Treat python tasks same as yaml tasks. * Add tests. * Re-add fixture decorators. * Fix typing specification error for Python 3.9.

change group to tags in task eus_exams task configs (EleutherAI#2320)

2939e2f

change glianorex to test split (EleutherAI#2332)

ce1e8e9

* change glianorex to test set * nit * fix test; doc_to_target can be str for multiple_choice * nit

mmlu-pro: add newlines to task descriptions (not leaderboard) (Eleuth…

1036191

…erAI#2334) * add newlines to task descriptions; increment versions * fix task tests (with groups) * Apply suggestions from code review --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

add mmlu readme (EleutherAI#2282)

5e3fc76

fix some bugs of mmlu (EleutherAI#2299)

94e8b79

* fix some bugs of mmlu * Fix end of file newline issue --------- Co-authored-by: eyuansu62 <[email protected]>

Add new benchmark: Portuguese bench (EleutherAI#2156)

b7b6169

* Add portuguese_bench * Add flores_pt group * Update _flores_common_yaml * Run linters and update flores and readme

Fix missing key in custom task loading. (EleutherAI#2304)

dca2cb1

use the unitxt HF implementation

2b6a6e4

remove file used to format the yaml files

da0a3ae

shachardon force-pushed the bluebench_pr branch from 83d2d1b to da0a3ae Compare October 1, 2024 13:49

shachardon and others added 2 commits October 2, 2024 04:11

uncomment bluebench_translation_mt_flores_101_eng_kor

4461ff1

Merge branch 'main' into bluebench_pr

0d1eeb2

yoavkatz reviewed Oct 14, 2024

View reviewed changes

perlitz and others added 7 commits October 23, 2024 20:46

Merge branch 'EleutherAI:main' into bluebench_pr

da73a22

accomodate for args_dict

b2c489e

Signed-off-by: Yotam-Perlitz <[email protected]>

Align to format

8447856

Signed-off-by: Yotam-Perlitz <[email protected]>

Align to formatting

b6ad7b2

Signed-off-by: Yotam-Perlitz <[email protected]>

add blank libe

57235ff

Signed-off-by: Yotam-Perlitz <[email protected]>

Add trailing blank line

55fb5c0

Signed-off-by: Yotam-Perlitz <[email protected]>

Merge branch 'EleutherAI:main' into bluebench_pr

81bba69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the BlueBench benchmark #2369

Add the BlueBench benchmark #2369

shachardon commented Oct 1, 2024

CLAassistant commented Oct 1, 2024 •

edited

Loading

shachardon commented Oct 7, 2024

yoavkatz Oct 14, 2024

Add the BlueBench benchmark #2369

Are you sure you want to change the base?

Add the BlueBench benchmark #2369

Conversation

shachardon commented Oct 1, 2024

CLAassistant commented Oct 1, 2024 • edited Loading

shachardon commented Oct 7, 2024

yoavkatz Oct 14, 2024

Choose a reason for hiding this comment

CLAassistant commented Oct 1, 2024 •

edited

Loading