Skip to content

Latest commit

 

History

History
55 lines (45 loc) · 5 KB

README.md

File metadata and controls

55 lines (45 loc) · 5 KB

BnPC: A Gold Standard Corpus for Paraphrase Detection in Bangla, and its Evaluation

This is the official repository of the paper titled "BnPC: A Gold Standard Corpus for Paraphrase Detection in Bangla, and its Evaluation", accepted in The 17th Workshop on Building and Using Comparable Corpora (BUCC 2024) co-located with LREC-COLING 2024. It contains the codes and the dataset.

Repository Structure

The repository is divided into three directories.

  • BnPC Code: It contains the all supplementary codes that are used in this research.
  • BnPC Dataset: This directory contains the BnPC Corpus, divided into test, train, validation sets.
  • Cross Validation Datasets: This directory contains the merged datasets used in cross dataset experiments.

Abstract

In this paper, we present a benchmark dataset for paraphrase detection in Bangla. Despite being the sixth most spoken language in the world, paraphrase identification in Bangla is barely explored. Our dataset contains 8,787 human-annotated sentence pairs collected from 23 newspaper outlets’ headlines in four categories. We explored several supervised modeling approaches to benchmark the dataset, including similarity metrics, linguistic features, and fine-tuned BERT models. We also conducted a zero-shot analysis to assess the performance of pre-trained BERT models, and we carried out both zero-shot and few-shot evaluations of the publicly accessible generative language model GPT 3.5 turbo. In the benchmark evaluations, when examining GPT-3.5 using a few-shot modeling approach, it becomes evident that the model can grasp paraphrases in a manner akin to fine-tuned mBERT language models with just a handful of example data points. Within the set of benchmarking trials, the fine-tuned BanglaBERT delivered the most remarkable performance, achieving a weighted-F1 score of 87.91. Noteworthy is that GPT-3.5 excelled in both zero-shot and few-shot experiments, attaining weighted-F1 scores of 51.51 and 80.53, in that order. We also performed a cross-dataset analysis and the outcomes suggest that the model trained in our dataset resembles both diversity and generalization when tested on the other dataset. Finally, we report a human evaluation experiment to obtain a better understanding of the paraphrasing task’s limitations.

Dataset Overview

The dataset consists of the following columns:

  • sentence1: First sentence.
  • sentence2: Second sentence.
  • label: 1 denotes paraphrase and 0 denotes non-paraphrase.

Sample Data

sentence1 sentence2 label
করোনাভাইরাস থেকে মুক্তি ‘সহসাই নয়’ দীর্ঘকাল ধরে করোনা মোকাবিলার প্রস্তুতি রাখতে হবে' 1
করোনাভাইরাস থেকে মুক্তি ‘সহসাই নয়’ করোনা ভাইরাস থেকে সহসাই মুক্তি পাওয়া যাচ্ছে না 1
দীর্ঘকাল ধরে করোনা মোকাবিলার প্রস্তুতি রাখতে হবে' করোনা ভাইরাস থেকে সহসাই মুক্তি পাওয়া যাচ্ছে না 1
সেনাবাহিনীকে উপেক্ষা করে মিয়ানমারের রাস্তায় বিক্ষোভ অব্যাহত মিয়ানমার সেনাবাহিনীর বিরুদ্ধে অবস্থান নিল জাতিসংঘের মানবাধিকার কাউন্সিল 0

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Copyright of the dataset contents belongs to the original copyright holders.

Citation

@inproceedings{saha-etal-2024-bnpc,
    title = "{B}n{PC}: A Gold Standard Corpus for Paraphrase Detection in {B}angla, and its Evaluation",
    author = "Saha, Sourav  and
      Nobin, Zeshan Ahmed  and
      Chowdhury, Mufassir Ahmad  and
      Mobin, Md. Shakirul Hasan Khan  and
      Amin, Mohammad Ruhul  and
      Kar, Sudipta",
    editor = "Zweigenbaum, Pierre  and
      Rapp, Reinhard  and
      Sharoff, Serge",
    booktitle = "Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.bucc-1.8",
    pages = "69--84",
}