This is the repository accompanying our ACL 2024 paper Cheetah: Natural Language Generation for 517 African Languages. In this paper, we develop Cheetah, a massively multilingual NLG language model for African languages. Cheetah supports 517 African languages and language varieties, allowing us to address the scarcity of NLG resources and provide a solution to foster linguistic diversity. We demonstrate the effectiveness of Cheetah through comprehensive evaluations across seven generation downstream tasks. In five of the seven tasks, Cheetah significantly outperforms other models, showcasing its remarkable performance for generating coherent and contextually appropriate text in a wide range of African languages. We additionally conduct a detailed human evaluation to delve deeper into the linguistic capabilities of Cheetah. The introduction of Cheetah has far-reaching benefits for linguistic diversity. By leveraging pretrained models and adapting them to specific languages, our approach facilitates the development of practical NLG applications for African communities. The findings of this study contribute to advancing NLP research in low-resource settings, enabling greater accessibility and inclusion for African languages in a rapidly expanding digital landscape.
- 1 Our Language Models
- 2. AfroNLG Benchmark and Evaluation
- 3. How to use Cheetah model
- 4. Ethics
- 5. Support Languages
- 6. Citation
- 7. Acknowledgments
Cheetah Training Data: We are guided by three main principles in developing this data: quality, linguistic diversity, and coverage.
Quality. Developing NLP technologies for low resource languages poses a significant challenge due to the limited availability of high-quality training data. To address this issue, we undertook the task of manually curating a diverse corpus spanning multiple domains, including news articles, health documents, religious texts, legal documents, and social media feeds. This manual curation approach was necessary because there were no existing datasets available for the majority of the languages we aimed to support, and we wanted to ensure the utilization of reliable and high-quality data.
Coverage. In all, we train Cheetah using a 42G multi-domain corpus across 517 African languages and language varieties. The languages are spoken in 50 of 54 African countries and they are written with five scripts. This provides support to at least 500M Africans.
Linguistic Diversity. The inclusion of languages from various domains, geographical regions, and linguistic typologies, along with the utilization of reliable data sources, contributes to enhancing the robustness and quality of Cheetah. Our data consists of languages from 14 language families in Africa written in five different orthographies. Furthermore, our data spans languages with a vast array of exotic linguistic features including tone, vowel and consonant harmony, reduplication, word orders, and word classes.
- Religious Domain. Our religious data is taken from online Bibles, Qurans, and data crawled from the Jehovah’s witness website. We also include religious texts from the book of Mormon.
- News Domain. We collect data from online newspapers (Adebara and Abdul-Mageed, 2022) and news sites such as (Voice of America), (Voice of Nigeria), (BBC), (Global voices), and (DW) news sites. We collect local newspapers from 27 languages from across Africa.
- Government Documents. We collect government documents South African Centre for Digital Language Resources (SADiLaR), and the Universal Declaration of human rights (UDHR) in multiple languages.
- Health Documents. We collect multiple health documents from the Department of Health, State Government of Victoria, Australia. We collect documents in Amharic, Dinka, Harari, Oromo, Somali, Swahili, and Tigrinya.
- Existing Corpora. We collect corpora available on the web for different African languages, including from Project Gutenberg for Afrikaans, South African News data. for Sepedi and Setswana, OSCAR (Abadji et al., 2021) for Afrikaans, Amharic, Somali, Swahili, Oromo, Malagasy, and Yoruba. We also used Tatoeba for Afrikaans, Amharic, Bemba, Igbo, Kanuri, Kongo, Luganda, Malagasy, Sepedi, Ndebele, Kinyarwanda, Somali, Swahili, Tsonga, Xhosa, Yoruba, and Zulu; Swahili Language Modelling Data for Swahili; Ijdutse corpus for Hausa; Data4Good corpora for Luganda, CC-100 for Amharic, Fulah, Igbo, Yoruba, Hausa, Tswana, Lingala, Luganada, Afrikaans, Somali, Swahili, Swati, North Sotho, Oromo, Wolof, Xhosa, and Zulu; Afriberta-Corpus for Afaan / Oromo, Amharic, Gahuza, Hausa, Igbo, Pidgin, Somali, Swahili, Tigrinya and Yoruba; mC4 for Afrikaans, Amharic, Hausa, Igbo, Malagasy, Chichewa, Shona, Somali, Sepedi, Swahili, Xhosa, Yoruba and Zulu. Further details about the model is available in the (paper).
We pretrain Cheetah using the encoder-decoder architecture (xue-etal-2021-mt5). Each of the encoder and decoder components is similar in size and configuration to T5, with 12 layers each with 12 attention heads, and 768 hidden units for the base model. In total, this results in a model with ~580 million parameters.
For pretraining Cheetah, we use a learning rate of 0.01, a batch size of 1,024 sequences, and a maximum sequence length of 1,024. We pretrain each model for 1M steps. We train our models on Google Cloud TPU with 128 cores (v3-128) from TensorFlow Research Cloud (TFRC).
Cheetah Pytorch and Tenserflow checkpoints are available on Huggingface website for direct download and use exclusively for research
. For commercial use, please contact the authors via email @ (*muhammad.mageed[at]ubc[dot]ca*).
Model | Link |
---|---|
🔥Cheetah-base🔥 | https://huggingface.co/UBC-NLP/cheetah-base |
We create AfroNLG, a multi-lingual, multi-task benchmark comprising machine translation
, paraphrase
, question answering
, summarization
, title generation
, cloze
.
Lang-Pairs | Metric | mT0 | mT5 | Afri-MT5 | AfriTeVa | Cheetah |
---|---|---|---|---|---|---|
English |
Bleu | 20.38±0.3 | 12.35±1.1 | 7.12±2.67 | 7.75±1.67 | 19.72±0.75 |
English |
Bleu | 19.19±0.3 | 12.28±0.48 | 11.73±12.3 | 20.5±0.87 | 18.9±1.22 |
English |
Bleu | 15.98±1.16 | 14.12±0.56 | 14.32±12.74 | 13.88±1.04 | 9.64±1.11 |
English |
Bleu | 12.26±0.47 | 8.82±0.43 | 9.57±0.42 | 7.83±1.04 | 10.54±0.54 |
English |
Bleu | 11.04±1.2 | 12.74±0.75 | 10.0±1.79 | 10.76±1.4 | 13.3±1.38 |
English |
Bleu | 10.59±1.84 | 9.33±0.58 | 3.08±0.57 | 7.24±0.46 | 11.08±0.61 |
English |
Bleu | 10.04±0.98 | 8.25±0.7 | 3.86±1.35 | 7.5±0.32 | 12.34±0.51 |
English |
Bleu | 17.65±1.86 | 17.97±1.69 | 1.9±1.11 } | 13.45±1.81 | 19.49±1.16 |
English |
Bleu | 5.06±0.21 | 4.96±0.16 | 0.85±0.04 | 7.32±0.00 | 9.22±0.08 |
English |
Bleu | 13.05±0.17 | 11.57±0.23 | 1.12±0.09 | 12.34±0.23 | 16.75±0.26 |
English |
Bleu | 2.17±2.77 | 3.33±0.35 | 0.09±0.01 | 4.21±0.77 | 9.75±0.01 |
English |
Bleu | 33.17±0.28 | 32.65±0.19 | 2.39±0.23 | 9.39±0.18 | 32.64±0.14 |
English |
Bleu | 22.04±2.89 | 23.2±0.23 | 2.79±0.08 | 22.39±0.28 | 28.11±0.14 |
English |
Bleu | 6.83±0.29 | 0.58±1.37 | 0.4±0.03 | 4.45±0.37 | 11.75±0.38 |
English |
Bleu | 3.4±0.12 | 1.23±0.03 | 0.03±0.0 | 1.68±0.94 | 4.64±0.13 |
English |
Bleu | 5.42±0.85 | 2.58±3.1 | 0.04±0.0 | 3.63±4.01 | 7.83±0.14 |
English |
Bleu | 10.28±0.49 | 1.31±2.26 | 0.14±0.03 | 3.8±4.2 | 12.13±0.1 |
French |
Bleu | 2.0±2.6 | 0.37±0.19 | 0.15±0.01 | 3.18±0.18 | 3.06±0.27 |
French |
Bleu | 0.4±0.09 | 0.33±0.01 | 0.07±0.0 | 0.96±0.01 | 0.28±0.25 |
French |
Bleu | 0.7±0.35 | 0.31±0.36 | 0.09±0.07 | 0.84±0.16 | 3.47±0.03 |
French |
Bleu | 0.69±0.31 | 0.8±0.13 | 1.52±0.06 | 1.73±0.53 | 1.29±0.16 |
French |
Bleu | 0.27±0.06 | 0.12±0.05 | 0.19±0.02 | 0.47±0.04 | 1.66±0.86 |
French |
Bleu | 4.02±0.12 | 0.3±0.05 | 0.11±0.01 | 3.08±0.25 | 3.01±0.07 |
English |
Bleu | 27.44±0.26 | 23.42±1.61 | 7.05±1.37 | 22.54±0.84 | 26.56±0.04 |
Acholi |
Bleu | 16.41±0.08 | 11.16±4.77 | 4.9±0.11 | 8.37±8.12 | 19.33±0.1 |
Acholi |
Bleu | 2.57±0.21 | 1.48±1.31 | 2.44±0.37 | 8.29±0.14 | 7.21±0.69 |
Acholi |
Bleu | 3.64±0.07 | 1.74±0.12 | 0.92±0.01 | 5.53±0.34 | 8.03±0.38 |
Acholi |
Bleu | 2.17±0.14 | 0.79±0.51 | 0.46±0.03 | 4.26±0.54 | 5.1±0.14 |
Acholi |
Bleu | 1.64±2.34 | 1.94±0.25 | 4.9±0.11 | 7.74±0.33 | 6.33±0.6 |
English |
Bleu | 6.19±6.33 | 8.38±0.49 | 5.93±0.22 | 10.95±0.32 | 11.61±0.28 |
English |
Bleu | 12.08±0.03 | 10.58±0.25 | 2.59±0.73 | 12.41±0.35 | 17.12±0.16 |
English |
Bleu | 6.46±0.08 | 5.69±0.02 | 1.4±0.39 | 7.88±0.18 | 9.04±0.24 |
English |
Bleu | 10.24±0.06 | 8.28±0.19 | 4.91±0.59 | 11.64±0.49 | 11.12±0.38 |
Lugbara |
Bleu | 2.21±0.35 | 1.5±0.2 | 2.22±0.15 | 6.67±0.32 | 3.68±0.31 |
Luganda |
Bleu | 3.96±0.57 | 2.61±0.12 | 3.44±0.32 | 8.05±0.23 | 7.99±0.47 |
Luganda |
Bleu | 4.47±0.08 | 3.01±0.16 | 2.5±0.22 | 8.17±0.18 | 8.13±0.33 |
Nyankore |
Bleu | 3.45±0.29 | 2.1±0.32 | 2.6±0.29 | 7.5±0.09 | 7.29±0.09 |
Nyankore |
Bleu | 8.54±0.17 | 6.91±0.23 | 2.01±0.25 | 6.77±6.73 | 6.25±10.26 |
Nyankore |
Bleu | 3.33±0.11 | 2.25±0.23 | 2.12±0.4 | 6.27±0.12 | 6.36±0.4 |
Langs | Metric | mT0 | mT5 | Afri-MT5 | AfriTeVa | Cheetah |
---|---|---|---|---|---|---|
Multilingual | Bleu | 41.79±0.28 | 41.75±0.21 | 34.72±0.51 | 43.02±1.25 | 43.23±0.09 |
Berber | Bleu | 44.84±0.31 | 44.03±0.24 | 36.08±0.83 | 46.41±0.71 | 46.0±0.27 |
Kabyle | Bleu | 25.91±0.13 | 25.32±0.46 | 11.56±0.73 | 16.06±14.79 | 26.27±0.56 |
Langs | Metric | mT0 | mT5 | Afri-MT5 | AfriTeVa | Cheetah |
---|---|---|---|---|---|---|
QA Swahili | F1 | 79.84±0.19 | 72.04±0.54 | 0 | 62.64±0.78 | 71.98±1.18 |
Langs | Metric | mT0 | mT5 | Afri-MT5 | AfriTeVa | Cheetah |
---|---|---|---|---|---|---|
Multilingual | RougeL | 22.31±0.12 | 22.23±0.04 | 5.34±0.48 | 18.97±0.06 | 24.86±0.02 |
Igbo | RougeL | 18.9±0.73 | 13.22±0.46 | 14.24±0.39 | 16.05±0.49 | 17.36±0.43 |
Oromo | RougeL | 11.28±0.03 | 10.51±0.07 | 3.52±0.49 | 7±1.73 | 14.53±0.1 |
Rundi | RougeL | 19.63±0.01 | 18.02±0.13 | 11.82±0.39 | 16.13±0.03 | 22.57±0.04 |
Swahili | RougeL | 26.38±0.02 | 24.81±0.11 | 15.07±0.17 | 21.59±0.13 | 29.05±0.13 |
Yoruba | RougeL | 21.57±0.05 | 20.06±0.12 | 13.52±0.18 | 17.3±0.11 | 22.49±0.0 |
Hausa | RougeL | 26.46±0.06 | 25.76±0.02 | 19.96±0.26 | 25.19±0.11 | 30.07±0.31 |
Nigerian Pidgin | RougeL | 26.54±0.05 | 25.79±0.1 | 14.28±1.23 | 20.29±0.12 | 27.08±0.02 |
Somali | RougeL | 20.69±0.08 | 19.21±0.06 | 13.62±0.81 | 19.27±0.18 | 23.92±0.04 |
Tigrinya | RougeL | 15.84±0.13 | 13.93±0.11 | 6.53±0.42 | 10.07±0.09 | 16.88±0.12 |
Langs | Metric | mT0 | mT5 | Afri-MT5 | AfriTeVa | Cheetah |
---|---|---|---|---|---|---|
Multilingual | Bleu | 6.53±0.02 | 6.65±0.08 | 0.1±0.02 | 5.2±0.02 | 7.52±0.07 |
Amharic | Bleu | 3.13±0.23 | 2.65±0.68 | 0.34±0.14 | 2.31±0.14 | 4.34±0.34 |
Igbo | Bleu | 6.95±0.13 | 6.9±0.22 | 0.77±0.12 | 4.61±0.14 | 8.47±0.07 |
Oromo | Bleu | 1.1±1.84 | 2.66±0.19 | 0.21±0.06 | 1.54±0.17 | 3.26±0.21 |
Rundi | Bleu | 4.4±0.28 | 4.13±0.22 | 0.84±0.07 | 3.33±0.23 | 6.05±0.5 |
Swahili | Bleu | 9.1±0.23 | 9.31±0.11 | 1.22±0.09 | 7.01±0.09 | 10.59±0.6 |
Yoruba | Bleu | 6.8±0.16 | 7.23±0.59 | 0.34±0.05 | 5.04±2.0 | 7.97±0.32 |
Hausa | Bleu | 8.11±0.24 | 7.3±0.34 | 2.59±0.01 | 6.69±0.18 | 8.48±0.23 |
Nigerian Pidgin | Bleu | 6.75±0.6 | 3.96±4.3 | 0.89±0.02 | 4.72±0.84 | 6.22±0.28 |
Somali | Bleu | 3.37±0.21 | 3.31±0.16 | 0.38±0.11 | 2.82±0.47 | 5.25±0.14 |
Tigrinya | Bleu | 2.99±0.1 | 2.94±1.09 | 0.7±0.18 | 1.92±0.26 | 5.1±0.05 |
Task | Metric | mT0 | mT5 | Afri-MT5 | AfriTeVa | Cheetah |
---|---|---|---|---|---|---|
Mask-one - 517 Languages | Bleu | 13.61±0.91 | 8.18±3.94 | 0.00±0.00 | 8.36±3.42 | 13.98±0.32 |
Mask-at-least-one - 517 Languages | Bleu | 2.36±0.11 | 2.66±0.09 | 0.93±0.12 | 0.68±0.09 | 7.07±0.09 |
Below is an example for using Cheetah predict masked tokens.
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
tokenizer = T5Tokenizer.from_pretrained("UBC-NLP/cheetah-base")
model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/cheetah-base")
yor_prompt="ìròyìn kan nípa owó ìjọba <extra_id_0> kan"
input_ids = tokenizer(yor_prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print("Tokenized input:", tokenizer.tokenize(yor_prompt))
print("Decoded output:", tokenizer.decode(outputs[0], skip_special_tokens=True))
Output:
Tokenized input: ['▁ìròyìn', '▁kan', '▁nípa', '▁owó', '▁ìjọba', '<extra_id_0>', '▁kan']
Decoded output: ìpínlẹ̀
Cheetah aligns with Afrocentric NLP where the needs of African people is put into consideration when developing technology. We believe Cheetah will not only be useful to speakers of the languages supported, but also researchers of African languages such as anthropologists and linguists. We discuss below some use cases for Cheetah and offer a number of broad impacts.
- Cheetah aims to address the lack of access to technology in about 90% of the world's languages, which automatically discriminates against native speakers of those languages. More precisely, it does so by focusing on Africa. To the best of our knowledge, Cheetah is the first massively multilingual PLM developed for African languages and language varieties. A model with knowledge of 517 African languages, is by far the largest to date for African NLP.
- Cheetah enables improved access of important information to the African community in Indigenous African languages. This is especially beneficial for people who may not be fluent in other languages. This will potentially connect more people globally.
- Cheetah affords opportunities for language preservation for many African languages. To the best of our knowledge, Cheetah consists of languages that have not been used for any NLP task until now. We believe that it can help encourage continued use of these languages in several domains, as well as trigger future development of language technologies for many of these languages.
- Cheetah Although LMs are useful for a wide range of applications, they can also be misused. Cheetah is developed using publicly available datasets that may carry biases. Although we strive to perform analyses and diagnostic case studies to probe performance of our models, our investigations are by no means comprehensive nor guarantee absence of bias in the data. In particular, we do not have access to native speakers of most of the languages covered. This hinders our ability to investigate samples from each (or at least the majority) of the languages.
Please refer to supported-languages
If you use the pre-trained model (Cheetah) for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):
@inproceedings{adebara-etal-2024-cheetah,
title = "Cheetah: Natural Language Generation for 517 {A}frican Languages",
author = "Adebara, Ife and
Elmadany, AbdelRahim and
Abdul-Mageed, Muhammad",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.691",
pages = "12798--12823",
}
We gratefully acknowledges support from Canada Research Chairs (CRC), the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN-2018-04267), the Social Sciences and Humanities Research Council of Canada (SSHRC; 435-2018-0576; 895-2020-1004; 895-2021-1008), Canadian Foundation for Innovation (CFI; 37771), Digital Research Alliance of Canada, UBC ARC-Sockeye, Advanced Micro Devices, Inc. (AMD), and Google. Any opinions, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of CRC, NSERC, SSHRC, CFI, the Alliance, AMD, Google, or UBC ARC-Sockeye.