Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
7da2a56
task yaml configuration added
keramjan Aug 24, 2025
990ca11
New task: ULQA
keramjan Sep 6, 2025
a35ce40
celep1/2, uleval, ulqa
keramjan Sep 6, 2025
fff23be
Update ulqa.yaml
keramjan Sep 9, 2025
9628d58
Update ulqa.yaml
keramjan Sep 9, 2025
4109bb2
Update ulqa.yaml
keramjan Sep 9, 2025
29b65cd
Update ulqa.yaml
keramjan Sep 9, 2025
9b2b815
Update ulqa.yaml
keramjan Sep 9, 2025
1e69c4b
Update ulqa.yaml
keramjan Sep 9, 2025
90e0c9c
Update ulqa.yaml
keramjan Sep 9, 2025
fa6954a
Update ulqa.yaml
keramjan Sep 9, 2025
2bae48c
Update ulqa.yaml
keramjan Sep 9, 2025
63b455f
Update ulqa.yaml
keramjan Sep 10, 2025
9c9b401
Update ulqa.yaml
keramjan Sep 10, 2025
6043d84
Update ulqa.yaml
keramjan Sep 10, 2025
dc55fab
Update ulqa.yaml
keramjan Sep 10, 2025
997ea41
Update ulqa.yaml
keramjan Sep 10, 2025
603ff43
Update ulqa.yaml
keramjan Sep 11, 2025
d74e07f
Update celep2.yaml
keramjan Sep 11, 2025
c02194d
Huggingface Dataset Path Updated.
keramjan Sep 16, 2025
d8f04be
lambada_uyghur task added
keramjan Sep 21, 2025
31f65ee
lambada_uyghur task config updated
keramjan Sep 21, 2025
56fe47d
Update lambada_uyghur.yaml
keramjan Sep 21, 2025
137c5b1
Update lambada_uyghur.yaml
keramjan Sep 21, 2025
9d325d9
Merge branch 'EleutherAI:main' into ulqa
keramjan Sep 21, 2025
cd0cd26
Update lambada_uyghur.yaml
keramjan Sep 22, 2025
006a9b9
lambada Uyghur test
keramjan Sep 22, 2025
b375ea3
lambada Uyghur test
keramjan Sep 22, 2025
15a2945
lambada Uyghur test
keramjan Sep 22, 2025
56dd845
lambada Uyghur test
keramjan Sep 22, 2025
a9bb02f
lambada Uyghur working version
keramjan Sep 22, 2025
b081634
ulut task added
keramjan Oct 3, 2025
1c46a93
ulut task debugged
keramjan Oct 3, 2025
edc2a2f
ulut task debugged
keramjan Oct 3, 2025
77ab8c7
ulut task debugged
keramjan Oct 3, 2025
3806fda
ulut sub-task names updated
keramjan Oct 3, 2025
1d071be
ulut sub-task nug config updated
keramjan Oct 3, 2025
0d44c73
ulut sub-task nug config updated
keramjan Oct 3, 2025
ce5caf4
ulut task config files debugged
keramjan Oct 3, 2025
69889b5
ulut task group added
keramjan Oct 4, 2025
0bfd87d
ulut task group debugged
keramjan Oct 4, 2025
56ef5a5
ulut task group debugged
keramjan Oct 4, 2025
3671217
ulut task group debugged
keramjan Oct 4, 2025
811b326
ulut task group debugged
keramjan Oct 4, 2025
3828a25
ulut task group debugged
keramjan Oct 4, 2025
8a79ddc
All sub-tasks are converted to multiple choise questions.
keramjan Oct 4, 2025
46a0540
All sub-tasks are converted to multiple choise questions.
keramjan Oct 4, 2025
42072e4
tag added to ulut.yaml
keramjan Oct 12, 2025
ccae01e
Update README.md
keramjan Oct 13, 2025
661920a
added README file
keramjan Oct 13, 2025
d4392a1
Update README.md
keramjan Oct 13, 2025
318bb48
Update README.md
keramjan Oct 13, 2025
5c70c24
Update README.md
keramjan Oct 13, 2025
d6f2ed2
all Uyghur tasks are merged
keramjan Oct 13, 2025
9d9cf83
Merge branch 'ulqa' of github.com:keramjan/lm-evaluation-harness into…
keramjan Oct 13, 2025
2e98394
ulqa task updated
keramjan Oct 13, 2025
683a6bf
ulqa task updated
keramjan Oct 13, 2025
40a1a3c
Merge branch 'main' into ulqa
keramjan Oct 14, 2025
c836e26
ulut doc_to_decontamination_query value added
keramjan Oct 14, 2025
89ac4c3
doc_to_decontamination_query added to more tasks in ulqa
keramjan Oct 14, 2025
a51d14f
metrics bleu and chrf removed from generative tasks in ulqa
keramjan Oct 14, 2025
00b4c52
Resolve merge conflict: keep ULQA benchmark entry
keramjan Jan 3, 2026
8f93f22
Merge branch 'main' into ulqa
keramjan Jan 3, 2026
b3cd096
Merge branch 'main' into ulqa
baberabb Jan 13, 2026
e313e9f
fix parsing bug
baberabb Jan 13, 2026
7657f78
rm `ulut` tag as its already a group name
baberabb Jan 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions lm_eval/tasks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ provided to the individual README.md files for each subfolder.
| [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish |
| [turblimp_core](turblimp/README.md) | A benchmark evaluating language models' grammatical capabilities in Turkish based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences. | Turkish |
| [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English |
| [ulqa](ulqa/README.md) | A number of tasks implemented to evaluate LLM's ability to understand Uyghur language and Uyghur literature. | Uyghur |
| [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English |
| [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English |
| [wikitext](wikitext/README.md) | Tasks based on text from Wikipedia articles to assess language modeling and generation. | English |
Expand Down
23 changes: 16 additions & 7 deletions lm_eval/tasks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -704,15 +704,24 @@ def pretty_print_task(task_name, task_manager, indent: int):

if isinstance(value, dict):
first_key = next(iter(value.keys()))

# TODO: simplify
if isinstance(first_key, ConfigurableGroup):
for subgroup, task_dict in value.items():
eval_logger.info(f" Subgroup: {subgroup.group}")
for task_name, configurable_task in task_dict.items():
if isinstance(configurable_task, ConfigurableTask):
pretty_print_task(task_name, task_manager, indent=2)
else:
eval_logger.info(f"{task_name}: {configurable_task}")
if isinstance(subgroup, ConfigurableGroup):
eval_logger.info(f" Subgroup: {subgroup.group}")
for task_name, configurable_task in task_dict.items():
if isinstance(configurable_task, ConfigurableTask):
pretty_print_task(task_name, task_manager, indent=2)
else:
eval_logger.info(
f"{task_name}: {configurable_task}"
)
elif isinstance(subgroup, str) and isinstance(
task_dict, ConfigurableTask
):
pretty_print_task(subgroup, task_manager, indent=1)
else:
eval_logger.info(f" {subgroup}: {task_dict}")
else:
eval_logger.info(f"{key}: {value}")
else:
Expand Down
39 changes: 39 additions & 0 deletions lm_eval/tasks/ulqa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# ULQA

### Descriptions

The ulqa datasets contains crowdsourced Uyghur language and Uyghur literature exam and exercise questions. The questions are in multiple-choice, boolean and generative formats. The tasks covers different skill levels: basic (ULUT or lambada_uyghur), intermidiate (ULQA or uleval), high (CELEP1 or CELEP2).


### Tags, Groups and Tasks

#### Tags

* uyghur_llm
* uyghur_literature

#### Groups

* ulut

#### Tasks

* `lambada_uyghur`
* `ulut`
* `ulqa`
* `uleval`
* `CELEP1`
* `CELEP2`

### Checklist

For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
24 changes: 24 additions & 0 deletions lm_eval/tasks/ulqa/celep1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
tag:
- uyghur_literature
- uyghur_llm
task: celep1
dataset_path: keramjan/CELEP1
dataset_name: main
output_type: multiple_choice
training_split: train
fewshot_split: train
test_split: train
doc_to_text: "{{instruction}}\n{{passage}}\nسۇئال: {{question}}\nجاۋاپ:"
doc_to_target: "{{ ['A','B','C','D'].index(answer) }}"
doc_to_choice: "{{[A, B, C, D]}}"
should_decontaminate: true
doc_to_decontamination_query: "{{passage}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0.0
33 changes: 33 additions & 0 deletions lm_eval/tasks/ulqa/celep2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
tag:
- uyghur_literature
- uyghur_llm
task: celep2
dataset_path: keramjan/CELEP2
dataset_name: main
output_type: generate_until
training_split: train
fewshot_split: train
test_split: train
doc_to_text: "{{instruction}}\n{{passage}}\nسۇئال: {{question}}\nجاۋاپ:"
doc_to_target: "{{answer}}"
should_decontaminate: true
doc_to_decontamination_query: "{{passage}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
- "(?s).*#### "
- "\\.$"
generation_kwargs:
until: ["سۇئال:"]
do_sample: false
temperature: 0.0
repeats: 1
num_fewshot: 0
metadata:
version: 1.0.0
22 changes: 22 additions & 0 deletions lm_eval/tasks/ulqa/lambada_uyghur.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
tag:
- uyghur_llm
task: lambada_uyghur
dataset_path: keramjan/lambada_uyghur
dataset_name: default
output_type: loglikelihood
training_split: train
test_split: train
validation_split: train
doc_to_text: "{{text.split(' ')[:-1]|join(' ')}}"
doc_to_target: "{{' '+text.split(' ')[-1]}}"
should_decontaminate: true
doc_to_decontamination_query: "{{text}}"
metric_list:
- metric: perplexity
aggregation: perplexity
higher_is_better: false
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
22 changes: 22 additions & 0 deletions lm_eval/tasks/ulqa/uleval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
tag:
- uyghur_language
- uyghur_llm
task: uleval
dataset_path: keramjan/uleval
dataset_name: main
output_type: multiple_choice
training_split: train
fewshot_split: train
test_split: train
doc_to_text: "سۇئال: {{question}}\nجاۋاپ:"
doc_to_target: "{{ ['A','B','C','D'].index(answer) }}"
doc_to_choice: "{{[A, B, C, D]}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0.0
13 changes: 13 additions & 0 deletions lm_eval/tasks/ulqa/ulqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
group: ulqa
tag:
- uyghur_language
- uyghur_llm
task:
- ulut
- lambada_uyghur
- ulqa_
- uleval
- celep1
- celep2
metadata:
version: 1.0
33 changes: 33 additions & 0 deletions lm_eval/tasks/ulqa/ulqa_.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
tag:
- uyghur_language
- uyghur_llm
task: ulqa_
dataset_path: keramjan/ulqa
dataset_name: main
output_type: generate_until
training_split: train
fewshot_split: train
test_split: train
doc_to_text: "سۇئال: {{question}}\nجاۋاپ:"
doc_to_target: "{{answer}}"
should_decontaminate: true
doc_to_decontamination_query: "{{question}}"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
regexes_to_ignore:
- ","
- "\\$"
- "(?s).*#### "
- "\\.$"
generation_kwargs:
until: ["سۇئال:"]
do_sample: false
temperature: 0.0
repeats: 1
num_fewshot: 2
metadata:
version: 1.0.0
22 changes: 22 additions & 0 deletions lm_eval/tasks/ulqa/ulut/nug.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
tag:
- uyghur_language
- uyghur_llm
task: nug
dataset_path: keramjan/ulut
dataset_name: nug
output_type: multiple_choice
training_split: train
fewshot_split: train
test_split: train
doc_to_text: "سۇئال: {{question}}\nجاۋاپ:"
doc_to_target: 2
doc_to_choice: "{{[distractor1, distractor2, answer]}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0.0
16 changes: 16 additions & 0 deletions lm_eval/tasks/ulqa/ulut/ulut.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
group: ulut
tag:
- uyghur_language
- uyghur_llm
task:
- nug
- wag
- wsm
- wub
- wum
aggregate_metric_list:
- metric: acc
aggregation: mean
weight_by_size: true
metadata:
version: 1.0
22 changes: 22 additions & 0 deletions lm_eval/tasks/ulqa/ulut/wag.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
tag:
- uyghur_language
- uyghur_llm
task: wag
dataset_path: keramjan/ulut
dataset_name: wag
output_type: multiple_choice
training_split: train
fewshot_split: train
test_split: train
doc_to_text: "تۆۋەندىكى سۆزنىڭ قارىمۇ-قارشى مەنىلىك سۆزىنى يېزىڭ: {{word}}\nجاۋاپ:"
doc_to_target: 3
doc_to_choice: "{{[distractor1, distractor2, distractor3, antonym]}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0.0
22 changes: 22 additions & 0 deletions lm_eval/tasks/ulqa/ulut/wsm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
tag:
- uyghur_language
- uyghur_llm
task: wsm
dataset_path: keramjan/ulut
dataset_name: wsm
output_type: multiple_choice
training_split: train
fewshot_split: train
test_split: train
doc_to_text: "تۆۋەندىكى سۆزنىڭ مەنىداش سۆزىنى تاللاڭ: {{word}}\nجاۋاپ:"
doc_to_target: 3
doc_to_choice: "{{[distractor1, distractor2, distractor3, synonym]}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0.0
24 changes: 24 additions & 0 deletions lm_eval/tasks/ulqa/ulut/wub.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
tag:
- uyghur_language
- uyghur_llm
task: wub
dataset_path: keramjan/ulut
dataset_name: wub
output_type: multiple_choice
training_split: train
fewshot_split: train
test_split: train
doc_to_text: "تۆۋەندىكى جۈملىدە خاتا ئىشلىتىلگەن سۆزنىڭ بار-يوقلىقىغا ھۆكۈم قىلىڭ: {{statement}}\nجاۋاپ:"
doc_to_target: check
doc_to_choice: ["true", "false"]
should_decontaminate: true
doc_to_decontamination_query: "{{statement}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0.0
24 changes: 24 additions & 0 deletions lm_eval/tasks/ulqa/ulut/wum.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
tag:
- uyghur_language
- uyghur_llm
task: wum
dataset_path: keramjan/ulut
dataset_name: wum
output_type: multiple_choice
training_split: train
fewshot_split: train
test_split: train
doc_to_text: "بوش ئورۇنغا مۇۋاپىق كىلىدىغان سۆزنى تاللاڭ: {{question}}\nجاۋاپ:"
doc_to_target: "{{ ['A','B','C'].index(answer) }}"
doc_to_choice: "{{[A, B, C]}}"
should_decontaminate: true
doc_to_decontamination_query: "{{question}}"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0.0
Loading