Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,7 @@ provided to the individual README.md files for each subfolder.
| [winogrande](winogrande/README.md) | A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge. | English |
| [wmdp](wmdp/README.md) | A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions. | English |
| [wmt2016](wmt2016/README.md) | Tasks from the WMT 2016 shared task, focusing on translation between multiple languages. | English, Czech, German, Finnish, Russian, Romanian, Turkish |
| [wmt24pp](wmt24pp/README.md) | English→55 language/dialect translation benchmark built from the Google WMT24++ dataset, evaluated with BLEU/TER/ChrF per language pair. | English→Arabic, European, Indic, East Asian, African, and other WMT24++ target languages (55 total) |
| [wsc273](wsc273/README.md) | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. | English |
| [xcopa](xcopa/README.md) | Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages. | Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese |
| [xnli](xnli/README.md) | Cross-Lingual Natural Language Inference to test understanding across different languages. | Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese |
Expand Down
196 changes: 196 additions & 0 deletions lm_eval/tasks/wmt24pp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# WMT24++ Translation Tasks

This directory provides YAML-based tasks for evaluating English→X machine
translation on the **WMT24++** benchmark hosted on the Hugging Face Hub as
[`google/wmt24pp`](https://huggingface.co/datasets/google/wmt24pp).

Each language pair is exposed as a separate task, using consistent
WMT-style generation and metrics.

## Dataset

- **HF ID**: `google/wmt24pp`
- **Configs**: one per language pair (e.g. `en-de_DE`, `en-pl_PL`, `en-pt_BR`, ...)
- **Split**: single split (`train`), used here as the evaluation split
- **Fields (per example)**:
- `lp`: language pair, e.g. `"en-de_DE"`
- `domain`: text domain (canary, news, social, speech, literary)
- `document_id`: document identifier
- `segment_id`: global segment identifier
- `is_bad_source`: boolean flag for low-quality sources
- `source`: English source sentence
- `target`: post-edit of `original_target` (recommended reference)
- `original_target`: original reference translation

In this task family, we:
- **always evaluate English→X** using `source` as input and `target` as reference
- **drop all examples with `is_bad_source == True`**
- **use all domains** (no filtering on `domain`).

## Tasks

Common configuration is defined in `wmt24pp_common.yaml` (note the missing file
extension; this is the file referenced by `include: wmt24pp_common.yaml` in every
per-language YAML):

- `dataset_path: google/wmt24pp`
- `test_split: train`
- `output_type: generate_until`
- `doc_to_text: !function utils.doc_to_text`
- `doc_to_target: "{{target}}"`
- `custom_dataset: !function utils.load_wmt24pp_dataset`
- Metrics: **BLEU**, **TER**, **ChrF** (same triple as classic WMT tasks)

The `lang_pair` in `metadata` is passed to `utils.load_wmt24pp_dataset`, which
loads the corresponding HF config and filters out bad sources.

Each language pair has its own YAML including the common config, e.g.:

```yaml
include: wmt24pp_common.yaml

task: wmt24pp-en-de_DE

tag:
- translation
- wmt24pp

metadata:
version: 1.0
lang_pair: "en-de_DE"
```

The `lang_pair` in `metadata` is passed to `utils.load_wmt24pp_dataset`, which
loads the corresponding HF config and filters out bad sources.

All available language pairs are listed in the dataset card; in this repo they
are instantiated as tasks named `wmt24pp-<lp>`, where `<lp>` matches the HF
config (e.g. `wmt24pp-en-pt_BR`).

### Group

`wmt24pp_group.yaml` defines a group:

- `group: wmt24pp`
- `group_alias: WMT24++`
- `task: [wmt24pp-en-de_DE, wmt24pp-en-pl_PL, ...]`
- `aggregate_metric_list` aggregating **ChrF** across all subtasks using
`mean` (weighted by dataset size).

You can run all WMT24++ tasks via:

```bash
python -m lm_eval run \
--model hf --model_args pretrained=... \
--tasks wmt24pp
```

or select any subset of language pairs explicitly:

```bash
python -m lm_eval run \
--model hf --model_args pretrained=... \
--tasks wmt24pp-en-de_DE wmt24pp-en-pl_PL
```

You can also provide a chat template:

```bash
python -m lm_eval run \
--model hf --model_args pretrained=... \
--tasks wmt24pp-en-de_DE wmt24pp-en-pl_PL \
--apply_chat_template ...
```

## Example evaluation config

You can run a subset of language pairs using a YAML config.

```yaml
model: hf
model_args:
pretrained: Qwen/Qwen2.5-7B-Instruct
dtype: float16

tasks:
- wmt24pp-en-pl_PL

num_fewshot: 0
batch_size: 1
max_batch_size: 1
# device: cuda
limit: 10

gen_kwargs:
temperature: 0.0
max_gen_toks: 1400

output_path: ./results/
log_samples: true

wandb_args: {}
hf_hub_log_args: {}
```

With the configuration in the YAML file, you can run an experiment with the following command:

```bash
lm_eval run \
--config my-tasks-config.yaml \
--apply_chat_template ... \
```

## Metrics

We follow the same metric setup as the other WMT translation tasks in this
repository, exposing three standard MT metrics:

- **BLEU** (`bleu`) – via SacreBLEU
- **TER** (`ter`) – Translation Error Rate
- **ChrF++** (`chrf`) – primary metric of interest for WMT24++ (character n‑gram
F-score), matching common reporting practices (e.g. Nemotron-3 Nano 30B).

All metrics are implemented via `lm_eval.api.metrics` and use SacreBLEU under
the hood.

## Task Validity Checklist

For adding novel benchmarks/datasets to the library:

- [x] **Is the task an existing benchmark in the literature?**
Yes. WMT24++ extends the official WMT24 benchmark to 55 languages/dialects as
described by Deutsch et al. (2025).
- [x] **Have you referenced the original paper that introduced the task?**
The citation for the WMT24++ paper is provided in the section below.
- [ ] **If yes, does the original paper provide a reference implementation?**
Prompt template and dataset filtering match the reference release. But we didn't replicate full original implementation.

If other tasks on this dataset are already supported:

- [x] **Is the "Main" variant of this task clearly denoted?**
Yes. Every YAML task is `wmt24pp-en-<target>` to emphasize the English→X
setup, and the group config exposes the complete benchmark as `wmt24pp`.
- [x] **Have you provided a short sentence on what each new variant adds / evaluates?**
The README explains that each YAML corresponds to a single HF config / language
pair; they all evaluate the same translation direction with identical metrics.
- [x] **Have you noted which published evaluation setups are matched by this variant?**
Yes. See the section above for the specific alignment with the WMT24++ dataset
card: same split (`train`), same bad-source filtering, same post-edited reference,
and the BLEU/TER/ChrF++ metric trio used in the paper/MTME release.

## Citation

Please cite the original WMT24++ paper and the lm-evaluation-harness project
as appropriate when using these tasks in publications.

```
@misc{deutsch2025wmt24expandinglanguagecoverage,
title={{WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects}},
author={Daniel Deutsch and Eleftheria Briakou and Isaac Caswell and Mara Finkelstein and Rebecca Galor and Juraj Juraska and Geza Kovacs and Alison Lui and Ricardo Rei and Jason Riesa and Shruti Rijhwani and Parker Riley and Elizabeth Salesky and Firas Trabelsi and Stephanie Winkler and Biao Zhang and Markus Freitag},
year={2025},
eprint={2502.12404},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12404},
}
```
143 changes: 143 additions & 0 deletions lm_eval/tasks/wmt24pp/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
"""Utilities for the WMT24++ translation tasks.

This module provides helpers used by YAML-configured ConfigurableTasks. It
exposes the `custom_dataset` loader, along with logic to render the official
WMT24++ prompt template so that all language-pair YAMLs can share a single
`doc_to_text` implementation.
"""

from __future__ import annotations

from typing import Any, Dict

from datasets import Dataset, load_dataset

SRC_LANG = "English"

TARGET_METADATA = {
"ar_EG": {"tgt_lang": "Arabic", "tgt_region": "Egypt"},
"ar_SA": {"tgt_lang": "Arabic", "tgt_region": "Saudi Arabia"},
"bg_BG": {"tgt_lang": "Bulgarian", "tgt_region": "Bulgaria"},
"bn_IN": {"tgt_lang": "Bengali", "tgt_region": "India"},
"ca_ES": {"tgt_lang": "Catalan", "tgt_region": "Spain"},
"cs_CZ": {"tgt_lang": "Czech", "tgt_region": "Czechia"},
"da_DK": {"tgt_lang": "Danish", "tgt_region": "Denmark"},
"de_DE": {"tgt_lang": "German", "tgt_region": "Germany"},
"el_GR": {"tgt_lang": "Greek", "tgt_region": "Greece"},
"es_MX": {"tgt_lang": "Spanish", "tgt_region": "Mexico"},
"et_EE": {"tgt_lang": "Estonian", "tgt_region": "Estonia"},
"fa_IR": {"tgt_lang": "Persian", "tgt_region": "Iran"},
"fi_FI": {"tgt_lang": "Finnish", "tgt_region": "Finland"},
"fil_PH": {"tgt_lang": "Filipino", "tgt_region": "Philippines"},
"fr_CA": {"tgt_lang": "French", "tgt_region": "Canada"},
"fr_FR": {"tgt_lang": "French", "tgt_region": "France"},
"gu_IN": {"tgt_lang": "Gujarati", "tgt_region": "India"},
"he_IL": {"tgt_lang": "Hebrew", "tgt_region": "Israel"},
"hi_IN": {"tgt_lang": "Hindi", "tgt_region": "India"},
"hr_HR": {"tgt_lang": "Croatian", "tgt_region": "Croatia"},
"hu_HU": {"tgt_lang": "Hungarian", "tgt_region": "Hungary"},
"id_ID": {"tgt_lang": "Indonesian", "tgt_region": "Indonesia"},
"is_IS": {"tgt_lang": "Icelandic", "tgt_region": "Iceland"},
"it_IT": {"tgt_lang": "Italian", "tgt_region": "Italy"},
"ja_JP": {"tgt_lang": "Japanese", "tgt_region": "Japan"},
"kn_IN": {"tgt_lang": "Kannada", "tgt_region": "India"},
"ko_KR": {"tgt_lang": "Korean", "tgt_region": "South Korea"},
"lt_LT": {"tgt_lang": "Lithuanian", "tgt_region": "Lithuania"},
"lv_LV": {"tgt_lang": "Latvian", "tgt_region": "Latvia"},
"ml_IN": {"tgt_lang": "Malayalam", "tgt_region": "India"},
"mr_IN": {"tgt_lang": "Marathi", "tgt_region": "India"},
"nl_NL": {"tgt_lang": "Dutch", "tgt_region": "Netherlands"},
"no_NO": {"tgt_lang": "Norwegian", "tgt_region": "Norway"},
"pa_IN": {"tgt_lang": "Punjabi", "tgt_region": "India"},
"pl_PL": {"tgt_lang": "Polish", "tgt_region": "Poland"},
"pt_BR": {"tgt_lang": "Portuguese", "tgt_region": "Brazil"},
"pt_PT": {"tgt_lang": "Portuguese", "tgt_region": "Portugal"},
"ro_RO": {"tgt_lang": "Romanian", "tgt_region": "Romania"},
"ru_RU": {"tgt_lang": "Russian", "tgt_region": "Russia"},
"sk_SK": {"tgt_lang": "Slovak", "tgt_region": "Slovakia"},
"sl_SI": {"tgt_lang": "Slovenian", "tgt_region": "Slovenia"},
"sr_RS": {"tgt_lang": "Serbian", "tgt_region": "Serbia"},
"sv_SE": {"tgt_lang": "Swedish", "tgt_region": "Sweden"},
"sw_KE": {"tgt_lang": "Swahili", "tgt_region": "Kenya"},
"sw_TZ": {"tgt_lang": "Swahili", "tgt_region": "Tanzania"},
"ta_IN": {"tgt_lang": "Tamil", "tgt_region": "India"},
"te_IN": {"tgt_lang": "Telugu", "tgt_region": "India"},
"th_TH": {"tgt_lang": "Thai", "tgt_region": "Thailand"},
"tr_TR": {"tgt_lang": "Turkish", "tgt_region": "Turkey"},
"uk_UA": {"tgt_lang": "Ukrainian", "tgt_region": "Ukraine"},
"ur_PK": {"tgt_lang": "Urdu", "tgt_region": "Pakistan"},
"vi_VN": {"tgt_lang": "Vietnamese", "tgt_region": "Vietnam"},
"zh_CN": {"tgt_lang": "Chinese", "tgt_region": "China"},
"zh_TW": {"tgt_lang": "Chinese", "tgt_region": "Taiwan"},
"zu_ZA": {"tgt_lang": "Zulu", "tgt_region": "South Africa"},
}

PROMPT_TEMPLATE = (
"You are a professional {src_lang} to {tgt_lang} translator, tasked with providing "
"translations suitable for use in {tgt_region} ({tgt_code}). Your goal is to accurately "
"convey the meaning and nuances of the original {src_lang} text while adhering to {tgt_lang} "
"grammar, vocabulary, and cultural sensitivities.\n"
"Please translate the following {src_lang} text into {tgt_lang} ({tgt_code}):\n\n"
"{input_text}\n\n"
"Produce only the {tgt_lang} translation, without any additional explanations or commentary:\n\n"
)


def render_prompt(*, lang_pair: str, source_text: str) -> str:
"""Render the official WMT24++ translation prompt for a given language pair."""
if "-" not in lang_pair:
msg = f"lang_pair must be of the form 'en-XX_YY', got {lang_pair}"
raise ValueError(msg)

_, tgt_code = lang_pair.split("-", maxsplit=1)
info = TARGET_METADATA.get(tgt_code)
if info is None:
msg = (
f"Unknown WMT24++ target code '{tgt_code}'. Please add metadata to"
" TARGET_METADATA to render the prompt."
)
raise KeyError(msg)

return PROMPT_TEMPLATE.format(
src_lang=SRC_LANG,
tgt_lang=info["tgt_lang"],
tgt_region=info["tgt_region"],
tgt_code=tgt_code,
input_text=source_text,
)


def doc_to_text(doc: Dict[str, Any]) -> str:
"""Shared doc_to_text function that renders the WMT24++ prompt."""
lang_pair = doc.get("lp")
if not lang_pair:
raise KeyError("Expected 'lp' field in WMT24++ example.")

source = doc.get("source", "")
return render_prompt(lang_pair=lang_pair, source_text=source)


def load_wmt24pp_dataset(*, lang_pair: str, split: str = "train", **kwargs: Any) -> Dict[str, Dataset]:
"""Load and filter the WMT24++ dataset for a specific language pair.

Parameters
----------
lang_pair:
Exact value of the `lp` field / HF config name, e.g. "en-de_DE".
split:
Dataset split name to load. WMT24++ exposes a single split ("train"),
which we treat as the evaluation split.
**kwargs:
Extra keyword arguments forwarded to `load_dataset`. Currently unused
but accepted for compatibility with ConfigurableTask metadata plumbing.

Returns
-------
dict[str, Dataset]
Mapping from the requested split name to the filtered dataset.
"""
_ = kwargs # ignore extraneous metadata

ds = load_dataset("google/wmt24pp", lang_pair, split=split)
ds = ds.filter(lambda ex: not ex["is_bad_source"])
return {split: ds}
33 changes: 33 additions & 0 deletions lm_eval/tasks/wmt24pp/wmt24pp_common.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Common configuration for WMT24++ English→X translation tasks

# HF dataset information
# Note: we actually load via `custom_dataset`, but this documents the source.
dataset_path: google/wmt24pp
# Each language pair is a separate HF config; see per-language YAMLs.

# We treat the single available split ("train") as the evaluation split.
training_split: null
validation_split: null
test_split: train

output_type: generate_until

# Shared prompt renderer: official WMT24++ instructions
doc_to_text: !function utils.doc_to_text

doc_to_target: "{{target}}"

# Load and filter data via Python helper
custom_dataset: !function utils.load_wmt24pp_dataset

# WMT-style metrics: BLEU, TER, ChrF
metric_list:
- metric: bleu
aggregation: bleu
higher_is_better: true
- metric: ter
aggregation: ter
higher_is_better: false
- metric: chrf
aggregation: chrf
higher_is_better: true
11 changes: 11 additions & 0 deletions lm_eval/tasks/wmt24pp/wmt24pp_en-ar_EG.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
include: wmt24pp_common.yaml

task: wmt24pp-en-ar_EG

tag:
- translation
- wmt24pp

metadata:
version: 1.0
lang_pair: "en-ar_EG"
Loading
Loading