Merge pull request #147 from ARBML/add-new-dataset

Adding Gazelle to the catalogue
ARBML · Dec 21, 2024 · d44b188 · d44b188
2 parents c861093 + f7c39ae
commit d44b188
Showing 1 changed file with 38 additions and 0 deletions.
diff --git a/datasets/gazelle.json b/datasets/gazelle.json
@@ -0,0 +1,38 @@
+{
+    "Name": "Gazelle",
+    "Subsets": [],
+    "HF Link": "https://huggingface.co/datasets/UBC-NLP/gazelle_benchmark",
+    "Link": "https://huggingface.co/datasets/UBC-NLP/gazelle_benchmark",
+    "License": "CC BY-NC-ND 4.0",
+    "Year": 2024,
+    "Language": "multilingual",
+    "Dialect": "ar-MSA: (Arabic (Modern Standard Arabic))",
+    "Domain": "other",
+    "Form": "text",
+    "Collection Style": "manual curation",
+    "Description": "Gazelle comprises two main themes, text rewriting and writing advice,\nacross five distinct types of writing assistance instructions that we manually curate. Gazelle encompasses parallel data with instructions crafted in\nboth Arabic and English, covering both input and\noutput, allowing users to fine-tune models for these\nvarious tasks bilingually. Except for the I\u2019rab and\nArabic grammatical rules and definitions datasets,\nwe manually translate the Arabic explanations and\ninstructions into English.",
+    "Volume": "9,423",
+    "Unit": "sentences",
+    "Ethical Risks": "Low",
+    "Provider": " UBC-NLP",
+    "Derived From": "",
+    "Paper Title": "Gazelle: An Instruction Dataset for Arabic Writing Assistance",
+    "Paper Link": "https://arxiv.org/pdf/2410.18163",
+    "Script": "Arab",
+    "Tokenized": "Yes",
+    "Host": "other",
+    "Access": "Free",
+    "Cost": "",
+    "Test Split": "Yes",
+    "Tasks": [
+        "instruction tuning"
+    ],
+    "Venue Title": "arXiv",
+    "Citations": "",
+    "Venue Type": "preprint",
+    "Venue Name": "",
+    "Authors": "Samar M. Magdy, Fakhraddin Alwajih, Sang Yun Kwon\u03bb, Reem Abdel-Salam, Muhammad Abdul-Mageed\u03bb,\u03be,\u03b3",
+    "Affiliations": "",
+    "Abstract": "Writing has long been considered a hallmark of human intelligence and remains a pinnacle task for artificial intelligence (AI) due to the intricate cognitive processes involved. Recently, rapid advancements in generative AI, particularly through the development of Large Language Models (LLMs), have significantly transformed the landscape of writing assistance. However, underrepresented languages like Arabic encounter significant challenges in the development of advanced AI writing tools, largely due to the limited availability of data. This scarcity constrains the training of effective models, impeding the creation of sophisticated writing assistance technologies. To address these issues, we present Gazelle, a comprehensive dataset for Arabic writing assistance. In addition, we offer an evaluation framework designed to enhance Arabic writing assistance tools. Our human evaluation of leading LLMs, including GPT-4, GPT-4o, Cohere Command R+, and Gemini 1.5 Pro, highlights their respective strengths and limitations in addressing the challenges of Arabic writing. Our findings underscore the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing, paving the way for more effective AI-powered Arabic writing tools.",
+    "Added By": "zaidalyafeai"
+}