Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Russian mmlu #2378

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Tasks

A list of supported tasks and task groupings can be viewed with `lm-eval --tasks list`.
Expand Down Expand Up @@ -77,6 +76,7 @@
| [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
| [minerva_math](minerva_math/README.md) | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. | English |
| [mmlu](mmlu/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English |
| [mmlu_ru](mmlu_ru/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | Russian |
| [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English |
| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English |
| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |
Expand Down
70 changes: 70 additions & 0 deletions lm_eval/tasks/mmlu_ru/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Task-name

### Paper

Title: `MMLU in Russian (Measuring Massive Multitask Language Understanding)`

Abstract: `https://arxiv.org/abs/2009.03300`

`The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.`

Homepage: `https://github.com/NLP-Core-Team/mmlu_ru`

Note: The `Flan` variants are derived from [here](https://github.com/jasonwei20/flan-2), and as described in Appendix D.1 of [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416).

Dataset Creation:

`The translation was made via Yandex.Translate API. There are some translation mistakes, especially observed with terms and formulas, no fixes were applied. Initial dataset was taken from: https://people.eecs.berkeley.edu/~hendrycks/data.tar.`

### Citation

```
@article{hendryckstest2021,
title={Measuring Massive Multitask Language Understanding},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}

@article{hendrycks2021ethics,
title={Aligning AI With Shared Human Values},
author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
journal={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2021}
}
```

### Groups, Tags, and Tasks

#### Groups

* `mmlu`: `Original multiple-choice MMLU benchmark`
* `mmlu_continuation`: `MMLU but with continuation prompts`
* `mmlu_generation`: `MMLU generation`

MMLU is the original benchmark as implemented by Hendrycks et al. with the choices in context and the answer letters (e.g `A`, `B`, `C`, `D`) in the continuation.
`mmlu_continuation` is a cloze-style variant without the choices in context and the full answer choice in the continuation.
`mmlu_generation` is a generation variant, similar to the original but the LLM is asked to generate the correct answer letter.


#### Subgroups

* `mmlu_stem'
* `mmlu_humanities'
* `mmlu_social_sciences'
* `mmlu_other'

Subgroup variants are prefixed with the subgroup name, e.g. `mmlu_stem_continuation`.

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
158 changes: 158 additions & 0 deletions lm_eval/tasks/mmlu_ru/_generate_configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
"""
Take in a YAML, and output all "other" splits with this YAML
"""

import argparse
import logging
import os

import yaml
from tqdm import tqdm


eval_logger = logging.getLogger("lm-eval")


SUBJECTS = {
"abstract_algebra": "stem",
"anatomy": "stem",
"astronomy": "stem",
"business_ethics": "other",
"clinical_knowledge": "other",
"college_biology": "stem",
"college_chemistry": "stem",
"college_computer_science": "stem",
"college_mathematics": "stem",
"college_medicine": "other",
"college_physics": "stem",
"computer_security": "stem",
"conceptual_physics": "stem",
"econometrics": "social_sciences",
"electrical_engineering": "stem",
"elementary_mathematics": "stem",
"formal_logic": "humanities",
"global_facts": "other",
"high_school_biology": "stem",
"high_school_chemistry": "stem",
"high_school_computer_science": "stem",
"high_school_european_history": "humanities",
"high_school_geography": "social_sciences",
"high_school_government_and_politics": "social_sciences",
"high_school_macroeconomics": "social_sciences",
"high_school_mathematics": "stem",
"high_school_microeconomics": "social_sciences",
"high_school_physics": "stem",
"high_school_psychology": "social_sciences",
"high_school_statistics": "stem",
"high_school_us_history": "humanities",
"high_school_world_history": "humanities",
"human_aging": "other",
"human_sexuality": "social_sciences",
"international_law": "humanities",
"jurisprudence": "humanities",
"logical_fallacies": "humanities",
"machine_learning": "stem",
"management": "other",
"marketing": "other",
"medical_genetics": "other",
"miscellaneous": "other",
"moral_disputes": "humanities",
"moral_scenarios": "humanities",
"nutrition": "other",
"philosophy": "humanities",
"prehistory": "humanities",
"professional_accounting": "other",
"professional_law": "humanities",
"professional_medicine": "other",
"professional_psychology": "social_sciences",
"public_relations": "social_sciences",
"security_studies": "social_sciences",
"sociology": "social_sciences",
"us_foreign_policy": "social_sciences",
"virology": "other",
"world_religions": "humanities",
}


def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--base_yaml_path", required=True)
parser.add_argument("--save_prefix_path", default="mmlu")
parser.add_argument("--cot_prompt_path", default=None)
parser.add_argument("--task_prefix", default="")
parser.add_argument("--group_prefix", default="")
return parser.parse_args()


if __name__ == "__main__":
args = parse_args()

# get filename of base_yaml so we can `"include": ` it in our "other" YAMLs.
base_yaml_name = os.path.split(args.base_yaml_path)[-1]
with open(args.base_yaml_path, encoding="utf-8") as f:
base_yaml = yaml.full_load(f)

if args.cot_prompt_path is not None:
import json

with open(args.cot_prompt_path, encoding="utf-8") as f:
cot_file = json.load(f)

ALL_CATEGORIES = []
for subject, category in tqdm(SUBJECTS.items()):
if category not in ALL_CATEGORIES:
ALL_CATEGORIES.append(category)

if args.cot_prompt_path is not None:
description = cot_file[subject]
else:
description = f"The following are multiple choice questions (with answers) about {' '.join(subject.split('_'))}.\n\n"

yaml_dict = {
"include": base_yaml_name,
"tag": f"mmlu_{args.task_prefix}_{category}"
if args.task_prefix != ""
else f"mmlu_{category}",
"task": f"mmlu_{args.task_prefix}_{subject}"
if args.task_prefix != ""
else f"mmlu_{subject}",
"task_alias": subject.replace("_", " "),
"dataset_name": subject,
"description": description,
}

file_save_path = args.save_prefix_path + f"_{subject}.yaml"
eval_logger.info(f"Saving yaml for subset {subject} to {file_save_path}")
with open(file_save_path, "w", encoding="utf-8") as yaml_file:
yaml.dump(
yaml_dict,
yaml_file,
allow_unicode=True,
default_style='"',
)

if args.task_prefix != "":
mmlu_subcategories = [
f"mmlu_{args.task_prefix}_{category}" for category in ALL_CATEGORIES
]
else:
mmlu_subcategories = [f"mmlu_{category}" for category in ALL_CATEGORIES]

if args.group_prefix != "":
file_save_path = args.group_prefix + ".yaml"
else:
file_save_path = args.save_prefix_path + ".yaml"

eval_logger.info(f"Saving benchmark config to {file_save_path}")
with open(file_save_path, "w", encoding="utf-8") as yaml_file:
yaml.dump(
{
"group": f"mmlu_{args.task_prefix}"
if args.task_prefix != ""
else "mmlu",
"task": mmlu_subcategories,
},
yaml_file,
indent=4,
default_flow_style=False,
)
13 changes: 13 additions & 0 deletions lm_eval/tasks/mmlu_ru/continuation/_continuation_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
dataset_path: NLPCoreTeam/mmlu_ru # a copy of `cais/mmlu` with no auxiliary_train split
output_type: multiple_choice
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
doc_to_text: "Question: {{question_ru.strip()}}\nAnswer:"
doc_to_choice: "{{choices_ru}}"
doc_to_target: "{{answer}}"
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
32 changes: 32 additions & 0 deletions lm_eval/tasks/mmlu_ru/continuation/_mmlu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
group: mmlu_ru_continuation
group_alias: mmlu (continuation)
task:
- group: stem
task:
- mmlu_ru_continuation_stem
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: other
task:
- mmlu_ru_continuation_other
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: social sciences
task:
- mmlu_ru_continuation_social_sciences
aggregate_metric_list:
- metric: acc
weight_by_size: True
- group: humanities
task:
- mmlu_ru_continuation_humanities
aggregate_metric_list:
- metric: acc
weight_by_size: True
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "abstract_algebra"
"description": "The following are questions (with answers) about abstract\
\ algebra.\n\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_abstract_algebra"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu_ru/continuation/mmlu_anatomy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "anatomy"
"description": "The following are questions (with answers) about anatomy.\n\
\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_anatomy"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu_ru/continuation/mmlu_astronomy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "astronomy"
"description": "The following are questions (with answers) about astronomy.\n\
\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_astronomy"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "business_ethics"
"description": "The following are questions (with answers) about business\
\ ethics.\n\n"
"tag": "mmlu_ru_continuation_other"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_business_ethics"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "clinical_knowledge"
"description": "The following are questions (with answers) about clinical\
\ knowledge.\n\n"
"tag": "mmlu_ru_continuation_other"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_clinical_knowledge"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_biology"
"description": "The following are questions (with answers) about college\
\ biology.\n\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_college_biology"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_chemistry"
"description": "The following are questions (with answers) about college\
\ chemistry.\n\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_college_chemistry"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_computer_science"
"description": "The following are questions (with answers) about college\
\ computer science.\n\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_college_computer_science"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_mathematics"
"description": "The following are questions (with answers) about college\
\ mathematics.\n\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_college_mathematics"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_medicine"
"description": "The following are questions (with answers) about college\
\ medicine.\n\n"
"tag": "mmlu_ru_continuation_other"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_college_medicine"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_physics"
"description": "The following are questions (with answers) about college\
\ physics.\n\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_college_physics"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "computer_security"
"description": "The following are questions (with answers) about computer\
\ security.\n\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_computer_security"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "conceptual_physics"
"description": "The following are questions (with answers) about conceptual\
\ physics.\n\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_conceptual_physics"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu_ru/continuation/mmlu_econometrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "econometrics"
"description": "The following are questions (with answers) about econometrics.\n\
\n"
"tag": "mmlu_ru_continuation_social_sciences"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_econometrics"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "electrical_engineering"
"description": "The following are questions (with answers) about electrical\
\ engineering.\n\n"
"tag": "mmlu_ru_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_ru_continuation_electrical_engineering"
Loading