EleutherAI · grzegorz-aniol · Dec 24, 2025 · Dec 25, 2025 · Dec 25, 2025 · Dec 25, 2025
@@ -190,6 +190,7 @@ provided to the individual README.md files for each subfolder.
 | [winogrande](winogrande/README.md)                                       | A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge.                                                                                                                                                                                                                                           | English                                                                                                                                                                                                                                                       |
 | [wmdp](wmdp/README.md)                                                   | A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions.                                                                                                                                                                                                          | English                                                                                                                                                                                                                                                       |
 | [wmt2016](wmt2016/README.md)                                             | Tasks from the WMT 2016 shared task, focusing on translation between multiple languages.                                                                                                                                                                                                                                               | English, Czech, German, Finnish, Russian, Romanian, Turkish                                                                                                                                                                                                   |
+| [wmt24pp](wmt24pp/README.md)                                             | English→55 language/dialect translation benchmark built from the Google WMT24++ dataset, evaluated with BLEU/TER/ChrF per language pair.                                                                                                                                                                                               | English→Arabic, European, Indic, East Asian, African, and other WMT24++ target languages (55 total)                                                                                                                                                            |
 | [wsc273](wsc273/README.md)                                               | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution.                                                                                                                                                                                                                                             | English                                                                                                                                                                                                                                                       |
 | [xcopa](xcopa/README.md)                                                 | Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages.                                                                                                                                                                                                                                               | Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese                                                                                                                                                           |
 | [xnli](xnli/README.md)                                                   | Cross-Lingual Natural Language Inference to test understanding across different languages.                                                                                                                                                                                                                                             | Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese                                                                                                                                 |

@@ -0,0 +1,196 @@
+# WMT24++ Translation Tasks
+
+This directory provides YAML-based tasks for evaluating English→X machine
+translation on the **WMT24++** benchmark hosted on the Hugging Face Hub as
+[`google/wmt24pp`](https://huggingface.co/datasets/google/wmt24pp).
+
+Each language pair is exposed as a separate task, using consistent
+WMT-style generation and metrics.
+
+## Dataset
+
+- **HF ID**: `google/wmt24pp`
+- **Configs**: one per language pair (e.g. `en-de_DE`, `en-pl_PL`, `en-pt_BR`, ...)
+- **Split**: single split (`train`), used here as the evaluation split
+- **Fields (per example)**:
+  - `lp`: language pair, e.g. `"en-de_DE"`
+  - `domain`: text domain (canary, news, social, speech, literary)
+  - `document_id`: document identifier
+  - `segment_id`: global segment identifier
+  - `is_bad_source`: boolean flag for low-quality sources
+  - `source`: English source sentence
+  - `target`: post-edit of `original_target` (recommended reference)
+  - `original_target`: original reference translation
+
+In this task family, we:
+- **always evaluate English→X** using `source` as input and `target` as reference
+- **drop all examples with `is_bad_source == True`**
+- **use all domains** (no filtering on `domain`).
+
+## Tasks
+
+Common configuration is defined in `wmt24pp_common.yaml` (note the missing file
+extension; this is the file referenced by `include: wmt24pp_common.yaml` in every
+per-language YAML):
+
+- `dataset_path: google/wmt24pp`
+- `test_split: train`
+- `output_type: generate_until`
+- `doc_to_text: !function utils.doc_to_text`
+- `doc_to_target: "{{target}}"`
+- `custom_dataset: !function utils.load_wmt24pp_dataset`
+- Metrics: **BLEU**, **TER**, **ChrF** (same triple as classic WMT tasks)
+
+The `lang_pair` in `metadata` is passed to `utils.load_wmt24pp_dataset`, which
+loads the corresponding HF config and filters out bad sources.
+
+Each language pair has its own YAML including the common config, e.g.:
+
+```yaml
+include: wmt24pp_common.yaml
+
+task: wmt24pp-en-de_DE
+
+tag:
+  - translation
+  - wmt24pp
+
+metadata:
+  version: 1.0
+  lang_pair: "en-de_DE"
+```
+
+The `lang_pair` in `metadata` is passed to `utils.load_wmt24pp_dataset`, which
+loads the corresponding HF config and filters out bad sources.
+
+All available language pairs are listed in the dataset card; in this repo they
+are instantiated as tasks named `wmt24pp-<lp>`, where `<lp>` matches the HF
+config (e.g. `wmt24pp-en-pt_BR`).
+
+### Group
+
+`wmt24pp_group.yaml` defines a group:
+
+- `group: wmt24pp`
+- `group_alias: WMT24++`
+- `task: [wmt24pp-en-de_DE, wmt24pp-en-pl_PL, ...]`
+- `aggregate_metric_list` aggregating **ChrF** across all subtasks using
+  `mean` (weighted by dataset size).
+
+You can run all WMT24++ tasks via:
+
+```bash
+python -m lm_eval run \
+  --model hf --model_args pretrained=... \
+  --tasks wmt24pp
+```
+
+or select any subset of language pairs explicitly:
+
+```bash
+python -m lm_eval run \
+  --model hf --model_args pretrained=... \
+  --tasks wmt24pp-en-de_DE wmt24pp-en-pl_PL
+```
+
+You can also provide a chat template:
+
+```bash
+python -m lm_eval run \
+  --model hf --model_args pretrained=... \
+  --tasks wmt24pp-en-de_DE wmt24pp-en-pl_PL \
+  --apply_chat_template ...
+```
+
+## Example evaluation config
+
+You can run a subset of language pairs using a YAML config.
+
+```yaml
+model: hf
+model_args:
+  pretrained: Qwen/Qwen2.5-7B-Instruct
+  dtype: float16
+
+tasks:
+  - wmt24pp-en-pl_PL
+
+num_fewshot: 0
+batch_size: 1
+max_batch_size: 1
+# device: cuda
+limit: 10
+
+gen_kwargs:
+  temperature: 0.0
+  max_gen_toks: 1400
+
+output_path: ./results/
+log_samples: true
+
+wandb_args: {}
+hf_hub_log_args: {}
+```
+
+With the configuration in the YAML file, you can run an experiment with the following command:
+
+```bash
+lm_eval run \
+  --config my-tasks-config.yaml \
+  --apply_chat_template ... \
+```
+
+## Metrics
+
+We follow the same metric setup as the other WMT translation tasks in this
+repository, exposing three standard MT metrics:
+
+- **BLEU** (`bleu`) – via SacreBLEU
+- **TER** (`ter`) – Translation Error Rate
+- **ChrF++** (`chrf`) – primary metric of interest for WMT24++ (character n‑gram
+  F-score), matching common reporting practices (e.g. Nemotron-3 Nano 30B).
+
+All metrics are implemented via `lm_eval.api.metrics` and use SacreBLEU under
+the hood.
+
+## Task Validity Checklist
+
+For adding novel benchmarks/datasets to the library:
+
+- [x] **Is the task an existing benchmark in the literature?**  
+  Yes. WMT24++ extends the official WMT24 benchmark to 55 languages/dialects as
+described by Deutsch et al. (2025).
+- [x] **Have you referenced the original paper that introduced the task?**  
+  The citation for the WMT24++ paper is provided in the section below.
+- [ ] **If yes, does the original paper provide a reference implementation?**  
+  Prompt template and dataset filtering match the reference release. But we didn't replicate full original implementation. 
+
+If other tasks on this dataset are already supported:
+
+- [x] **Is the "Main" variant of this task clearly denoted?**  
+  Yes. Every YAML task is `wmt24pp-en-<target>` to emphasize the English→X
+setup, and the group config exposes the complete benchmark as `wmt24pp`.
+- [x] **Have you provided a short sentence on what each new variant adds / evaluates?**  
+  The README explains that each YAML corresponds to a single HF config / language
+pair; they all evaluate the same translation direction with identical metrics.
+- [x] **Have you noted which published evaluation setups are matched by this variant?**  
+  Yes. See the section above for the specific alignment with the WMT24++ dataset
+card: same split (`train`), same bad-source filtering, same post-edited reference,
+and the BLEU/TER/ChrF++ metric trio used in the paper/MTME release.
+
+## Citation
+
+Please cite the original WMT24++ paper and the lm-evaluation-harness project
+as appropriate when using these tasks in publications.
+
+```
+@misc{deutsch2025wmt24expandinglanguagecoverage,
+      title={{WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects}},
+      author={Daniel Deutsch and Eleftheria Briakou and Isaac Caswell and Mara Finkelstein and Rebecca Galor and Juraj Juraska and Geza Kovacs and Alison Lui and Ricardo Rei and Jason Riesa and Shruti Rijhwani and Parker Riley and Elizabeth Salesky and Firas Trabelsi and Stephanie Winkler and Biao Zhang and Markus Freitag},
+      year={2025},
+      eprint={2502.12404},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2502.12404},
+}
+```
@@ -0,0 +1,143 @@
+"""Utilities for the WMT24++ translation tasks.
+
+This module provides helpers used by YAML-configured ConfigurableTasks. It
+exposes the `custom_dataset` loader, along with logic to render the official
+WMT24++ prompt template so that all language-pair YAMLs can share a single
+`doc_to_text` implementation.
+"""
+
+from __future__ import annotations
+
+from typing import Any, Dict
+
+from datasets import Dataset, load_dataset
+
+SRC_LANG = "English"
+
+TARGET_METADATA = {
+    "ar_EG": {"tgt_lang": "Arabic", "tgt_region": "Egypt"},
+    "ar_SA": {"tgt_lang": "Arabic", "tgt_region": "Saudi Arabia"},
+    "bg_BG": {"tgt_lang": "Bulgarian", "tgt_region": "Bulgaria"},
+    "bn_IN": {"tgt_lang": "Bengali", "tgt_region": "India"},
+    "ca_ES": {"tgt_lang": "Catalan", "tgt_region": "Spain"},
+    "cs_CZ": {"tgt_lang": "Czech", "tgt_region": "Czechia"},
+    "da_DK": {"tgt_lang": "Danish", "tgt_region": "Denmark"},
+    "de_DE": {"tgt_lang": "German", "tgt_region": "Germany"},
+    "el_GR": {"tgt_lang": "Greek", "tgt_region": "Greece"},
+    "es_MX": {"tgt_lang": "Spanish", "tgt_region": "Mexico"},
+    "et_EE": {"tgt_lang": "Estonian", "tgt_region": "Estonia"},
+    "fa_IR": {"tgt_lang": "Persian", "tgt_region": "Iran"},
+    "fi_FI": {"tgt_lang": "Finnish", "tgt_region": "Finland"},
+    "fil_PH": {"tgt_lang": "Filipino", "tgt_region": "Philippines"},
+    "fr_CA": {"tgt_lang": "French", "tgt_region": "Canada"},
+    "fr_FR": {"tgt_lang": "French", "tgt_region": "France"},
+    "gu_IN": {"tgt_lang": "Gujarati", "tgt_region": "India"},
+    "he_IL": {"tgt_lang": "Hebrew", "tgt_region": "Israel"},
+    "hi_IN": {"tgt_lang": "Hindi", "tgt_region": "India"},
+    "hr_HR": {"tgt_lang": "Croatian", "tgt_region": "Croatia"},
+    "hu_HU": {"tgt_lang": "Hungarian", "tgt_region": "Hungary"},
+    "id_ID": {"tgt_lang": "Indonesian", "tgt_region": "Indonesia"},
+    "is_IS": {"tgt_lang": "Icelandic", "tgt_region": "Iceland"},
+    "it_IT": {"tgt_lang": "Italian", "tgt_region": "Italy"},
+    "ja_JP": {"tgt_lang": "Japanese", "tgt_region": "Japan"},
+    "kn_IN": {"tgt_lang": "Kannada", "tgt_region": "India"},
+    "ko_KR": {"tgt_lang": "Korean", "tgt_region": "South Korea"},
+    "lt_LT": {"tgt_lang": "Lithuanian", "tgt_region": "Lithuania"},
+    "lv_LV": {"tgt_lang": "Latvian", "tgt_region": "Latvia"},
+    "ml_IN": {"tgt_lang": "Malayalam", "tgt_region": "India"},
+    "mr_IN": {"tgt_lang": "Marathi", "tgt_region": "India"},
+    "nl_NL": {"tgt_lang": "Dutch", "tgt_region": "Netherlands"},
+    "no_NO": {"tgt_lang": "Norwegian", "tgt_region": "Norway"},
+    "pa_IN": {"tgt_lang": "Punjabi", "tgt_region": "India"},
+    "pl_PL": {"tgt_lang": "Polish", "tgt_region": "Poland"},
+    "pt_BR": {"tgt_lang": "Portuguese", "tgt_region": "Brazil"},
+    "pt_PT": {"tgt_lang": "Portuguese", "tgt_region": "Portugal"},
+    "ro_RO": {"tgt_lang": "Romanian", "tgt_region": "Romania"},
+    "ru_RU": {"tgt_lang": "Russian", "tgt_region": "Russia"},
+    "sk_SK": {"tgt_lang": "Slovak", "tgt_region": "Slovakia"},
+    "sl_SI": {"tgt_lang": "Slovenian", "tgt_region": "Slovenia"},
+    "sr_RS": {"tgt_lang": "Serbian", "tgt_region": "Serbia"},
+    "sv_SE": {"tgt_lang": "Swedish", "tgt_region": "Sweden"},
+    "sw_KE": {"tgt_lang": "Swahili", "tgt_region": "Kenya"},
+    "sw_TZ": {"tgt_lang": "Swahili", "tgt_region": "Tanzania"},
+    "ta_IN": {"tgt_lang": "Tamil", "tgt_region": "India"},
+    "te_IN": {"tgt_lang": "Telugu", "tgt_region": "India"},
+    "th_TH": {"tgt_lang": "Thai", "tgt_region": "Thailand"},
+    "tr_TR": {"tgt_lang": "Turkish", "tgt_region": "Turkey"},
+    "uk_UA": {"tgt_lang": "Ukrainian", "tgt_region": "Ukraine"},
+    "ur_PK": {"tgt_lang": "Urdu", "tgt_region": "Pakistan"},
+    "vi_VN": {"tgt_lang": "Vietnamese", "tgt_region": "Vietnam"},
+    "zh_CN": {"tgt_lang": "Chinese", "tgt_region": "China"},
+    "zh_TW": {"tgt_lang": "Chinese", "tgt_region": "Taiwan"},
+    "zu_ZA": {"tgt_lang": "Zulu", "tgt_region": "South Africa"},
+}
+
+PROMPT_TEMPLATE = (
+    "You are a professional {src_lang} to {tgt_lang} translator, tasked with providing "
+    "translations suitable for use in {tgt_region} ({tgt_code}). Your goal is to accurately "
+    "convey the meaning and nuances of the original {src_lang} text while adhering to {tgt_lang} "
+    "grammar, vocabulary, and cultural sensitivities.\n"
+    "Please translate the following {src_lang} text into {tgt_lang} ({tgt_code}):\n\n"
+    "{input_text}\n\n"
+    "Produce only the {tgt_lang} translation, without any additional explanations or commentary:\n\n"
+)
+
+
+def render_prompt(*, lang_pair: str, source_text: str) -> str:
+    """Render the official WMT24++ translation prompt for a given language pair."""
+    if "-" not in lang_pair:
+        msg = f"lang_pair must be of the form 'en-XX_YY', got {lang_pair}"
+        raise ValueError(msg)
+
+    _, tgt_code = lang_pair.split("-", maxsplit=1)
+    info = TARGET_METADATA.get(tgt_code)
+    if info is None:
+        msg = (
+            f"Unknown WMT24++ target code '{tgt_code}'. Please add metadata to"
+            " TARGET_METADATA to render the prompt."
+        )
+        raise KeyError(msg)
+
+    return PROMPT_TEMPLATE.format(
+        src_lang=SRC_LANG,
+        tgt_lang=info["tgt_lang"],
+        tgt_region=info["tgt_region"],
+        tgt_code=tgt_code,
+        input_text=source_text,
+    )
+
+
+def doc_to_text(doc: Dict[str, Any]) -> str:
+    """Shared doc_to_text function that renders the WMT24++ prompt."""
+    lang_pair = doc.get("lp")
+    if not lang_pair:
+        raise KeyError("Expected 'lp' field in WMT24++ example.")
+
+    source = doc.get("source", "")
+    return render_prompt(lang_pair=lang_pair, source_text=source)
+
+
+def load_wmt24pp_dataset(*, lang_pair: str, split: str = "train", **kwargs: Any) -> Dict[str, Dataset]:
+    """Load and filter the WMT24++ dataset for a specific language pair.
+
+    Parameters
+    ----------
+    lang_pair:
+        Exact value of the `lp` field / HF config name, e.g. "en-de_DE".
+    split:
+        Dataset split name to load. WMT24++ exposes a single split ("train"),
+        which we treat as the evaluation split.
+    **kwargs:
+        Extra keyword arguments forwarded to `load_dataset`. Currently unused
+        but accepted for compatibility with ConfigurableTask metadata plumbing.
+
+    Returns
+    -------
+    dict[str, Dataset]
+        Mapping from the requested split name to the filtered dataset.
+    """
+    _ = kwargs  # ignore extraneous metadata
+
+    ds = load_dataset("google/wmt24pp", lang_pair, split=split)
+    ds = ds.filter(lambda ex: not ex["is_bad_source"])
+    return {split: ds}
@@ -0,0 +1,33 @@
+# Common configuration for WMT24++ English→X translation tasks
+
+# HF dataset information
+# Note: we actually load via `custom_dataset`, but this documents the source.
+dataset_path: google/wmt24pp
+# Each language pair is a separate HF config; see per-language YAMLs.
+
+# We treat the single available split ("train") as the evaluation split.
+training_split: null
+validation_split: null
+test_split: train
+
+output_type: generate_until
+
+# Shared prompt renderer: official WMT24++ instructions
+doc_to_text: !function utils.doc_to_text
+
+doc_to_target: "{{target}}"
+
+# Load and filter data via Python helper
+custom_dataset: !function utils.load_wmt24pp_dataset
+
+# WMT-style metrics: BLEU, TER, ChrF
+metric_list:
+  - metric: bleu
+    aggregation: bleu
+    higher_is_better: true
+  - metric: ter
+    aggregation: ter
+    higher_is_better: false
+  - metric: chrf
+    aggregation: chrf
+    higher_is_better: true
@@ -0,0 +1,11 @@
+include: wmt24pp_common.yaml
+
+task: wmt24pp-en-ar_EG
+
+tag:
+  - translation
+  - wmt24pp
+
+metadata:
+  version: 1.0
+  lang_pair: "en-ar_EG"