Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Swahili Classification Task #998

Merged
merged 7 commits into from
Jul 2, 2024
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions mteb/tasks/Classification/swa/SwahiliNewsClassification.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from __future__ import annotations

# from ....abstasks import AbsTaskClassification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# from ....abstasks import AbsTaskClassification

from mteb.abstasks.AbsTaskClassification import AbsTaskClassification
from mteb.abstasks.TaskMetadata import TaskMetadata


class SwahiliNewsClassification(AbsTaskClassification):
metadata = TaskMetadata(
name="SwahiliNewsClassification",
description="Dataset for Swahili News Classification, categorized with 5 domains. Building and Optimizing Swahili Language Models: Techniques, Embeddings, and Datasets",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which 5 domains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for Adding the Dataset and domain of the classification:

Swahili is spoken by 100-150 million people across East Africa. In Tanzania, for example, it is one of the primary national languages and the official language of instruction in all schools. News in Swahili is an integral part of Tanzania's media sphere.
News contributes to education, technology, and a country's economic growth, and news in local languages plays an important cultural role in many African countries. In the modern age, however, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.
The Swahili news dataset was created to bridge the gap in utilizing the Swahili language to create NLP technologies. It aims to assist AI practitioners in Tanzania and across Africa in honing their NLP skills to address various challenges within organizations or societies related to the Swahili language. The Swahili news dataset, sourced from multiple websites providing news in Swahili, is a valuable resource for NLP research and development.
The dataset was curated explicitly for text classification tasks, categorizing news content into six distinct topics. This categorization facilitates the development of robust NLP models that can more effectively understand and process Swahili text.
Six Domains:
Local News (Kitaifa):
News concerning local events, politics, and developments within Tanzania.
International News (Kimataifa):
Coverage of global events and news stories affecting the international community.
Finance News (Uchumi):
Financial and economic news, including market trends, monetary policies, and business updates.
Health News (Afya):
Information and updates on health-related topics, medical research, public health issues, and wellness.
Sports News (Michezo):
News related to sports events, athletes, competitions, and sports culture locally and internationally.
Entertainment News (Burudani):
Coverage of entertainment industry news, including celebrity updates, music, movies, and cultural events.
By integrating this dataset into MTEB, we aim to support the development of NLP models capable of understanding and processing Swahili text across these vital domains, thus promoting linguistic diversity and technological advancement in East Africa. The high accuracy of this dataset is attributed to the human annotation process involved in its creation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the clarification @msamwelmollel much appreciated

Can you specify these domains in the description of the dataset as well

reference="https://huggingface.co/datasets/Mollel/SwahiliNewsClassification",
KennethEnevoldsen marked this conversation as resolved.
Show resolved Hide resolved
dataset={
"path": "Mollel/SwahiliNewsClassification",
"revision": "5bc5ef41a6232c5e3c84e1e9615099b70922d7be",
},
type="Classification",
category="s2s",
eval_splits=["train"],
eval_langs=["swa-Latn"],
main_score="accuracy",
date=("2019-01-01", "2023-05-01"),
form=["written"],
dialect=[],
domains=["News"],
task_subtypes=[],
license="CC BY-NC-SA 4.0",
socioeconomic_status="mixed",
annotations_creators="derived",
text_creation="found",
bibtex_citation="""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No citation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

   @inproceedings{davis2020swahili,
    title = "Swahili: News Classification Dataset (0.2)",
    author = "Davis, David",
    year = "2020",
    publisher = "Zenodo",
    doi = "10.5281/zenodo.5514203",
    url = "https://doi.org/10.5281/zenodo.5514203"
    }
    """,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful. Feel free to add that in as well.

""",
n_samples={"train": 2048},
avg_character_length={"train": 2438.2308135942326},
)

def dataset_transform(self) -> None:
self.dataset = self.dataset.rename_columns(
{"content": "text", "category": "label"}
)
self.dataset = self.stratified_subsampling(
self.dataset, seed=self.seed, splits=["train"]
)