EleutherAI · baberabb · Jan 13, 2026 · Aug 24, 2025 · Sep 6, 2025 · Sep 6, 2025
@@ -187,6 +187,7 @@ provided to the individual README.md files for each subfolder.
 | [turkishmmlu](turkishmmlu/README.md)                                     | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams.                                                                                                                                                                                                                             | Turkish                                                                                                                                                                                                                                                       |
 | [turblimp_core](turblimp/README.md)                                      | A benchmark evaluating language models' grammatical capabilities in Turkish based on comparing the probabilities of minimal pairs of grammatical and ungrammatical sentences.                                                                                                                                                          | Turkish                                                                                                                                                                                                                                                       |
 | [unitxt](unitxt/README.md)                                               | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI.                                                                                                                                                                                        | English                                                                                                                                                                                                                                                       |
+| [ulqa](ulqa/README.md)                                                   | A number of tasks implemented to evaluate LLM's ability to understand Uyghur language and Uyghur literature.                                                                                                                                                                                                                           | Uyghur                                                                                                                        |
 | [unscramble](unscramble/README.md)                                       | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding.                                                                                                                                                                                                                                              | English                                                                                                                                                                                                                                                       |
 | [webqs](webqs/README.md)                                                 | Web-based question answering tasks designed to evaluate internet search and retrieval.                                                                                                                                                                                                                                                 | English                                                                                                                                                                                                                                                       |
 | [wikitext](wikitext/README.md)                                           | Tasks based on text from Wikipedia articles to assess language modeling and generation.                                                                                                                                                                                                                                                | English                                                                                                                                                                                                                                                       |

@@ -704,15 +704,24 @@ def pretty_print_task(task_name, task_manager, indent: int):
 
             if isinstance(value, dict):
                 first_key = next(iter(value.keys()))
-
+                # TODO: simplify
                 if isinstance(first_key, ConfigurableGroup):
                     for subgroup, task_dict in value.items():
-                        eval_logger.info(f"  Subgroup: {subgroup.group}")
-                        for task_name, configurable_task in task_dict.items():
-                            if isinstance(configurable_task, ConfigurableTask):
-                                pretty_print_task(task_name, task_manager, indent=2)
-                            else:
-                                eval_logger.info(f"{task_name}: {configurable_task}")
+                        if isinstance(subgroup, ConfigurableGroup):
+                            eval_logger.info(f"  Subgroup: {subgroup.group}")
+                            for task_name, configurable_task in task_dict.items():
+                                if isinstance(configurable_task, ConfigurableTask):
+                                    pretty_print_task(task_name, task_manager, indent=2)
+                                else:
+                                    eval_logger.info(
+                                        f"{task_name}: {configurable_task}"
+                                    )
+                        elif isinstance(subgroup, str) and isinstance(
+                            task_dict, ConfigurableTask
+                        ):
+                            pretty_print_task(subgroup, task_manager, indent=1)
+                        else:
+                            eval_logger.info(f"  {subgroup}: {task_dict}")
                 else:
                     eval_logger.info(f"{key}: {value}")
             else:

@@ -0,0 +1,39 @@
+# ULQA
+
+### Descriptions
+
+The ulqa datasets contains crowdsourced Uyghur language and Uyghur literature exam and exercise questions. The questions are in multiple-choice, boolean and generative formats. The tasks covers different skill levels: basic (ULUT or lambada_uyghur), intermidiate (ULQA or uleval), high (CELEP1 or CELEP2).  
+
+
+### Tags, Groups and Tasks
+
+#### Tags
+
+* uyghur_llm
+* uyghur_literature
+
+#### Groups
+
+* ulut
+
+#### Tasks
+
+* `lambada_uyghur`
+* `ulut`
+* `ulqa`
+* `uleval`
+* `CELEP1`
+* `CELEP2`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
@@ -0,0 +1,24 @@
+tag:
+  - uyghur_literature
+  - uyghur_llm
+task: celep1
+dataset_path: keramjan/CELEP1
+dataset_name: main
+output_type: multiple_choice
+training_split: train
+fewshot_split: train
+test_split: train
+doc_to_text: "{{instruction}}\n{{passage}}\nسۇئال: {{question}}\nجاۋاپ:"
+doc_to_target: "{{ ['A','B','C','D'].index(answer) }}"
+doc_to_choice: "{{[A, B, C, D]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{passage}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true  
+metadata:
+  version: 1.0.0
@@ -0,0 +1,33 @@
+tag:
+  - uyghur_literature
+  - uyghur_llm
+task: celep2
+dataset_path: keramjan/CELEP2
+dataset_name: main
+output_type: generate_until
+training_split: train
+fewshot_split: train
+test_split: train
+doc_to_text: "{{instruction}}\n{{passage}}\nسۇئال: {{question}}\nجاۋاپ:"
+doc_to_target: "{{answer}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{passage}}"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: false
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+      - "(?s).*#### "
+      - "\\.$"
+generation_kwargs:
+  until: ["سۇئال:"]
+  do_sample: false
+  temperature: 0.0
+repeats: 1
+num_fewshot: 0      
+metadata:
+  version: 1.0.0
@@ -0,0 +1,22 @@
+tag:
+  - uyghur_llm
+task: lambada_uyghur
+dataset_path: keramjan/lambada_uyghur
+dataset_name: default
+output_type: loglikelihood
+training_split: train
+test_split: train
+validation_split: train
+doc_to_text: "{{text.split(' ')[:-1]|join(' ')}}"
+doc_to_target: "{{' '+text.split(' ')[-1]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{text}}"
+metric_list:
+  - metric: perplexity
+    aggregation: perplexity
+    higher_is_better: false
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
@@ -0,0 +1,22 @@
+tag:
+  - uyghur_language
+  - uyghur_llm
+task: uleval
+dataset_path: keramjan/uleval
+dataset_name: main
+output_type: multiple_choice
+training_split: train
+fewshot_split: train
+test_split: train
+doc_to_text: "سۇئال: {{question}}\nجاۋاپ:"
+doc_to_target: "{{ ['A','B','C','D'].index(answer) }}"
+doc_to_choice: "{{[A, B, C, D]}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true  
+metadata:
+  version: 1.0.0
@@ -0,0 +1,13 @@
+group: ulqa
+tag:
+  - uyghur_language
+  - uyghur_llm
+task:
+  - ulut
+  - lambada_uyghur
+  - ulqa_
+  - uleval
+  - celep1
+  - celep2
+metadata:
+  version: 1.0
@@ -0,0 +1,33 @@
+tag:
+  - uyghur_language
+  - uyghur_llm
+task: ulqa_
+dataset_path: keramjan/ulqa
+dataset_name: main
+output_type: generate_until
+training_split: train
+fewshot_split: train
+test_split: train
+doc_to_text: "سۇئال: {{question}}\nجاۋاپ:"
+doc_to_target: "{{answer}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{question}}"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: false
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+      - "(?s).*#### "
+      - "\\.$"
+generation_kwargs:
+  until: ["سۇئال:"]
+  do_sample: false
+  temperature: 0.0
+repeats: 1
+num_fewshot: 2     
+metadata:
+  version: 1.0.0
@@ -0,0 +1,22 @@
+tag:
+  - uyghur_language
+  - uyghur_llm
+task: nug
+dataset_path: keramjan/ulut
+dataset_name: nug
+output_type: multiple_choice
+training_split: train
+fewshot_split: train
+test_split: train
+doc_to_text: "سۇئال: {{question}}\nجاۋاپ:"
+doc_to_target: 2
+doc_to_choice: "{{[distractor1, distractor2, answer]}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0.0
@@ -0,0 +1,16 @@
+group: ulut
+tag:
+  - uyghur_language
+  - uyghur_llm
+task:
+  - nug
+  - wag
+  - wsm
+  - wub
+  - wum
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1.0
@@ -0,0 +1,22 @@
+tag:
+  - uyghur_language
+  - uyghur_llm
+task: wag
+dataset_path: keramjan/ulut
+dataset_name: wag
+output_type: multiple_choice
+training_split: train
+fewshot_split: train
+test_split: train
+doc_to_text: "تۆۋەندىكى سۆزنىڭ قارىمۇ-قارشى مەنىلىك سۆزىنى يېزىڭ: {{word}}\nجاۋاپ:"
+doc_to_target: 3
+doc_to_choice: "{{[distractor1, distractor2, distractor3, antonym]}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0.0
@@ -0,0 +1,22 @@
+tag:
+  - uyghur_language
+  - uyghur_llm
+task: wsm
+dataset_path: keramjan/ulut
+dataset_name: wsm
+output_type: multiple_choice
+training_split: train
+fewshot_split: train
+test_split: train
+doc_to_text: "تۆۋەندىكى سۆزنىڭ مەنىداش سۆزىنى تاللاڭ: {{word}}\nجاۋاپ:"
+doc_to_target: 3
+doc_to_choice: "{{[distractor1, distractor2, distractor3, synonym]}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0.0
@@ -0,0 +1,24 @@
+tag:
+  - uyghur_language
+  - uyghur_llm
+task: wub
+dataset_path: keramjan/ulut
+dataset_name: wub
+output_type: multiple_choice
+training_split: train
+fewshot_split: train
+test_split: train
+doc_to_text: "تۆۋەندىكى جۈملىدە خاتا ئىشلىتىلگەن سۆزنىڭ بار-يوقلىقىغا ھۆكۈم قىلىڭ: {{statement}}\nجاۋاپ:"
+doc_to_target: check
+doc_to_choice: ["true", "false"]
+should_decontaminate: true
+doc_to_decontamination_query: "{{statement}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0.0
@@ -0,0 +1,24 @@
+tag:
+  - uyghur_language
+  - uyghur_llm
+task: wum
+dataset_path: keramjan/ulut
+dataset_name: wum
+output_type: multiple_choice
+training_split: train
+fewshot_split: train
+test_split: train
+doc_to_text: "بوش ئورۇنغا مۇۋاپىق كىلىدىغان سۆزنى تاللاڭ: {{question}}\nجاۋاپ:"
+doc_to_target: "{{ ['A','B','C'].index(answer) }}"
+doc_to_choice: "{{[A, B, C]}}"
+should_decontaminate: true
+doc_to_decontamination_query: "{{question}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0.0