-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Swahili Classification Task #998
Changes from 5 commits
d2932ce
99bd4e8
b774828
dfaf35a
affb828
ca59425
80b6a0c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
from __future__ import annotations | ||
|
||
# from ....abstasks import AbsTaskClassification | ||
from mteb.abstasks.AbsTaskClassification import AbsTaskClassification | ||
from mteb.abstasks.TaskMetadata import TaskMetadata | ||
|
||
|
||
class SwahiliNewsClassification(AbsTaskClassification): | ||
metadata = TaskMetadata( | ||
name="SwahiliNewsClassification", | ||
description="Dataset for Swahili News Classification, categorized with 5 domains. Building and Optimizing Swahili Language Models: Techniques, Embeddings, and Datasets", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. which 5 domains? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reason for Adding the Dataset and domain of the classification: Swahili is spoken by 100-150 million people across East Africa. In Tanzania, for example, it is one of the primary national languages and the official language of instruction in all schools. News in Swahili is an integral part of Tanzania's media sphere. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for the clarification @msamwelmollel much appreciated Can you specify these domains in the description of the dataset as well |
||
reference="https://huggingface.co/datasets/Mollel/SwahiliNewsClassification", | ||
KennethEnevoldsen marked this conversation as resolved.
Show resolved
Hide resolved
|
||
dataset={ | ||
"path": "Mollel/SwahiliNewsClassification", | ||
"revision": "5bc5ef41a6232c5e3c84e1e9615099b70922d7be", | ||
}, | ||
type="Classification", | ||
category="s2s", | ||
eval_splits=["train"], | ||
eval_langs=["swa-Latn"], | ||
main_score="accuracy", | ||
date=("2019-01-01", "2023-05-01"), | ||
form=["written"], | ||
dialect=[], | ||
domains=["News"], | ||
task_subtypes=[], | ||
license="CC BY-NC-SA 4.0", | ||
socioeconomic_status="mixed", | ||
annotations_creators="derived", | ||
text_creation="found", | ||
bibtex_citation=""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No citation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wonderful. Feel free to add that in as well. |
||
""", | ||
n_samples={"train": 2048}, | ||
avg_character_length={"train": 2438.2308135942326}, | ||
) | ||
|
||
def dataset_transform(self) -> None: | ||
self.dataset = self.dataset.rename_columns( | ||
{"content": "text", "category": "label"} | ||
) | ||
self.dataset = self.stratified_subsampling( | ||
self.dataset, seed=self.seed, splits=["train"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.