EleutherAI · tatiana-iazykova · Oct 3, 2024 · Nov 1, 2024 · Dec 4, 2024 · Dec 4, 2024
@@ -1,4 +1,3 @@
-
 # Tasks
 
  A list of supported tasks and task groupings can be viewed with `lm-eval --tasks list`.
@@ -77,6 +76,7 @@
 | [mgsm](mgsm/README.md) | Benchmark of multilingual grade-school math problems. | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
 | [minerva_math](minerva_math/README.md) | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. | English |
 | [mmlu](mmlu/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | English |
+| [mmlu_ru](mmlu_ru/README.md) | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. | Russian |
 | [mmlu_pro](mmlu_pro/README.md) | A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. | English |
 | [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English |
 | model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |

@@ -0,0 +1,70 @@
+# Task-name
+
+### Paper
+
+Title: `MMLU in Russian (Measuring Massive Multitask Language Understanding)`
+
+Abstract: `https://arxiv.org/abs/2009.03300`
+
+`The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.`
+
+Homepage: `https://github.com/NLP-Core-Team/mmlu_ru`
+
+Note: The `Flan` variants are derived from [here](https://github.com/jasonwei20/flan-2), and as described in Appendix D.1 of [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416).
+
+Dataset Creation:
+
+`The translation was made via Yandex.Translate API. There are some translation mistakes, especially observed with terms and formulas, no fixes were applied. Initial dataset was taken from: https://people.eecs.berkeley.edu/~hendrycks/data.tar.`
+
+### Citation
+
+```
+@article{hendryckstest2021,
+  title={Measuring Massive Multitask Language Understanding},
+  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
+  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
+  year={2021}
+}
+
+@article{hendrycks2021ethics,
+  title={Aligning AI With Shared Human Values},
+  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
+  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
+  year={2021}
+}
+```
+
+### Groups, Tags, and Tasks
+
+#### Groups
+
+* `mmlu`: `Original multiple-choice MMLU benchmark`
+* `mmlu_continuation`: `MMLU but with continuation prompts`
+* `mmlu_generation`: `MMLU generation`
+
+MMLU is the original benchmark as implemented by Hendrycks et al. with the choices in context and the answer letters (e.g `A`, `B`, `C`, `D`) in the continuation.
+`mmlu_continuation` is a cloze-style variant without the choices in context and the full answer choice in the continuation.
+`mmlu_generation` is a generation variant, similar to the original but the LLM is asked to generate the correct answer letter.
+
+
+#### Subgroups
+
+* `mmlu_stem'
+* `mmlu_humanities'
+* `mmlu_social_sciences'
+* `mmlu_other'
+
+Subgroup variants are prefixed with the subgroup name, e.g. `mmlu_stem_continuation`.
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
@@ -0,0 +1,158 @@
+"""
+Take in a YAML, and output all "other" splits with this YAML
+"""
+
+import argparse
+import logging
+import os
+
+import yaml
+from tqdm import tqdm
+
+
+eval_logger = logging.getLogger("lm-eval")
+
+
+SUBJECTS = {
+    "abstract_algebra": "stem",
+    "anatomy": "stem",
+    "astronomy": "stem",
+    "business_ethics": "other",
+    "clinical_knowledge": "other",
+    "college_biology": "stem",
+    "college_chemistry": "stem",
+    "college_computer_science": "stem",
+    "college_mathematics": "stem",
+    "college_medicine": "other",
+    "college_physics": "stem",
+    "computer_security": "stem",
+    "conceptual_physics": "stem",
+    "econometrics": "social_sciences",
+    "electrical_engineering": "stem",
+    "elementary_mathematics": "stem",
+    "formal_logic": "humanities",
+    "global_facts": "other",
+    "high_school_biology": "stem",
+    "high_school_chemistry": "stem",
+    "high_school_computer_science": "stem",
+    "high_school_european_history": "humanities",
+    "high_school_geography": "social_sciences",
+    "high_school_government_and_politics": "social_sciences",
+    "high_school_macroeconomics": "social_sciences",
+    "high_school_mathematics": "stem",
+    "high_school_microeconomics": "social_sciences",
+    "high_school_physics": "stem",
+    "high_school_psychology": "social_sciences",
+    "high_school_statistics": "stem",
+    "high_school_us_history": "humanities",
+    "high_school_world_history": "humanities",
+    "human_aging": "other",
+    "human_sexuality": "social_sciences",
+    "international_law": "humanities",
+    "jurisprudence": "humanities",
+    "logical_fallacies": "humanities",
+    "machine_learning": "stem",
+    "management": "other",
+    "marketing": "other",
+    "medical_genetics": "other",
+    "miscellaneous": "other",
+    "moral_disputes": "humanities",
+    "moral_scenarios": "humanities",
+    "nutrition": "other",
+    "philosophy": "humanities",
+    "prehistory": "humanities",
+    "professional_accounting": "other",
+    "professional_law": "humanities",
+    "professional_medicine": "other",
+    "professional_psychology": "social_sciences",
+    "public_relations": "social_sciences",
+    "security_studies": "social_sciences",
+    "sociology": "social_sciences",
+    "us_foreign_policy": "social_sciences",
+    "virology": "other",
+    "world_religions": "humanities",
+}
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_yaml_path", required=True)
+    parser.add_argument("--save_prefix_path", default="mmlu")
+    parser.add_argument("--cot_prompt_path", default=None)
+    parser.add_argument("--task_prefix", default="")
+    parser.add_argument("--group_prefix", default="")
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    # get filename of base_yaml so we can `"include": ` it in our "other" YAMLs.
+    base_yaml_name = os.path.split(args.base_yaml_path)[-1]
+    with open(args.base_yaml_path, encoding="utf-8") as f:
+        base_yaml = yaml.full_load(f)
+
+    if args.cot_prompt_path is not None:
+        import json
+
+        with open(args.cot_prompt_path, encoding="utf-8") as f:
+            cot_file = json.load(f)
+
+    ALL_CATEGORIES = []
+    for subject, category in tqdm(SUBJECTS.items()):
+        if category not in ALL_CATEGORIES:
+            ALL_CATEGORIES.append(category)
+
+        if args.cot_prompt_path is not None:
+            description = cot_file[subject]
+        else:
+            description = f"The following are multiple choice questions (with answers) about {' '.join(subject.split('_'))}.\n\n"
+
+        yaml_dict = {
+            "include": base_yaml_name,
+            "tag": f"mmlu_{args.task_prefix}_{category}"
+            if args.task_prefix != ""
+            else f"mmlu_{category}",
+            "task": f"mmlu_{args.task_prefix}_{subject}"
+            if args.task_prefix != ""
+            else f"mmlu_{subject}",
+            "task_alias": subject.replace("_", " "),
+            "dataset_name": subject,
+            "description": description,
+        }
+
+        file_save_path = args.save_prefix_path + f"_{subject}.yaml"
+        eval_logger.info(f"Saving yaml for subset {subject} to {file_save_path}")
+        with open(file_save_path, "w", encoding="utf-8") as yaml_file:
+            yaml.dump(
+                yaml_dict,
+                yaml_file,
+                allow_unicode=True,
+                default_style='"',
+            )
+
+    if args.task_prefix != "":
+        mmlu_subcategories = [
+            f"mmlu_{args.task_prefix}_{category}" for category in ALL_CATEGORIES
+        ]
+    else:
+        mmlu_subcategories = [f"mmlu_{category}" for category in ALL_CATEGORIES]
+
+    if args.group_prefix != "":
+        file_save_path = args.group_prefix + ".yaml"
+    else:
+        file_save_path = args.save_prefix_path + ".yaml"
+
+    eval_logger.info(f"Saving benchmark config to {file_save_path}")
+    with open(file_save_path, "w", encoding="utf-8") as yaml_file:
+        yaml.dump(
+            {
+                "group": f"mmlu_{args.task_prefix}"
+                if args.task_prefix != ""
+                else "mmlu",
+                "task": mmlu_subcategories,
+            },
+            yaml_file,
+            indent=4,
+            default_flow_style=False,
+        )
@@ -0,0 +1,13 @@
+dataset_path: NLPCoreTeam/mmlu_ru # a copy of `cais/mmlu` with no auxiliary_train split
+output_type: multiple_choice
+test_split: test
+fewshot_split: dev
+fewshot_config:
+  sampler: first_n
+doc_to_text: "Question: {{question_ru.strip()}}\nAnswer:"
+doc_to_choice: "{{choices_ru}}"
+doc_to_target: "{{answer}}"
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
@@ -0,0 +1,32 @@
+group: mmlu_ru_continuation
+group_alias: mmlu (continuation)
+task:
+  - group: stem
+    task:
+      - mmlu_ru_continuation_stem
+    aggregate_metric_list:
+      - metric: acc
+        weight_by_size: True
+  - group: other
+    task:
+      - mmlu_ru_continuation_other
+    aggregate_metric_list:
+      - metric: acc
+        weight_by_size: True
+  - group: social sciences
+    task:
+      - mmlu_ru_continuation_social_sciences
+    aggregate_metric_list:
+      - metric: acc
+        weight_by_size: True
+  - group: humanities
+    task:
+      - mmlu_ru_continuation_humanities
+    aggregate_metric_list:
+      - metric: acc
+        weight_by_size: True
+aggregate_metric_list:
+  - metric: acc
+    weight_by_size: True
+metadata:
+  version: 2
@@ -0,0 +1,6 @@
+"dataset_name": "abstract_algebra"
+"description": "The following are questions (with answers) about abstract\
+  \ algebra.\n\n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_abstract_algebra"
@@ -0,0 +1,6 @@
+"dataset_name": "anatomy"
+"description": "The following are questions (with answers) about anatomy.\n\
+  \n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_anatomy"
@@ -0,0 +1,6 @@
+"dataset_name": "astronomy"
+"description": "The following are questions (with answers) about astronomy.\n\
+  \n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_astronomy"
@@ -0,0 +1,6 @@
+"dataset_name": "business_ethics"
+"description": "The following are questions (with answers) about business\
+  \ ethics.\n\n"
+"tag": "mmlu_ru_continuation_other"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_business_ethics"
@@ -0,0 +1,6 @@
+"dataset_name": "clinical_knowledge"
+"description": "The following are questions (with answers) about clinical\
+  \ knowledge.\n\n"
+"tag": "mmlu_ru_continuation_other"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_clinical_knowledge"
@@ -0,0 +1,6 @@
+"dataset_name": "college_biology"
+"description": "The following are questions (with answers) about college\
+  \ biology.\n\n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_college_biology"
@@ -0,0 +1,6 @@
+"dataset_name": "college_chemistry"
+"description": "The following are questions (with answers) about college\
+  \ chemistry.\n\n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_college_chemistry"
@@ -0,0 +1,6 @@
+"dataset_name": "college_computer_science"
+"description": "The following are questions (with answers) about college\
+  \ computer science.\n\n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_college_computer_science"
@@ -0,0 +1,6 @@
+"dataset_name": "college_mathematics"
+"description": "The following are questions (with answers) about college\
+  \ mathematics.\n\n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_college_mathematics"
@@ -0,0 +1,6 @@
+"dataset_name": "college_medicine"
+"description": "The following are questions (with answers) about college\
+  \ medicine.\n\n"
+"tag": "mmlu_ru_continuation_other"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_college_medicine"
@@ -0,0 +1,6 @@
+"dataset_name": "college_physics"
+"description": "The following are questions (with answers) about college\
+  \ physics.\n\n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_college_physics"
@@ -0,0 +1,6 @@
+"dataset_name": "computer_security"
+"description": "The following are questions (with answers) about computer\
+  \ security.\n\n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_computer_security"
@@ -0,0 +1,6 @@
+"dataset_name": "conceptual_physics"
+"description": "The following are questions (with answers) about conceptual\
+  \ physics.\n\n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_conceptual_physics"
@@ -0,0 +1,6 @@
+"dataset_name": "econometrics"
+"description": "The following are questions (with answers) about econometrics.\n\
+  \n"
+"tag": "mmlu_ru_continuation_social_sciences"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_econometrics"
@@ -0,0 +1,6 @@
+"dataset_name": "electrical_engineering"
+"description": "The following are questions (with answers) about electrical\
+  \ engineering.\n\n"
+"tag": "mmlu_ru_continuation_stem"
+"include": "_continuation_template_yaml"
+"task": "mmlu_ru_continuation_electrical_engineering"