Skip to content

[Computer Speech & Language] A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages

License

Notifications You must be signed in to change notification settings

mehedihasanbijoy/DPCSpell

Repository files navigation

DPCSpell

A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages
Link — Computer Speech & Language

How DPCSpell works?

dpcspell

Running Test

Operating System Requirement Remark
Ubuntu 16.04.7 LTS requirements_u.yml ✔️ Successful
Ubuntu 18.04.6 LTS (Google Colab) requirements_c.txt ✔️ Successful*
Windows 10 requirements_w.yml ✔️ Successful

Get Started

git clone https://github.com/mehedihasanbijoy/DPCSpell.git

or manually download and extract the github repository of DPCSpell.


Environment Setup

Create A Virtual Environment

conda env create -f requirements_u.yml (for Ubuntu 16.04.7 LTS)
or
conda env create -f requirements_w.yml (for Windows 10)

Activate the Environment

conda activate DPCSpell

Prepare SEC Corpora

gdown https://drive.google.com/drive/folders/1_sWSi-LFsvuYh9c5GBMDd4V6_uM8yYjH?usp=share_link -O ./Dataset --folder

or manually download the folder from here and keep the extracted files into ./Dataset/


Training and Evaluation of DPCSpell

Detector Network

python detector.py --CORPUS "./Dataset/corpus.csv" --HID_DIM 128 --ENC_LAYERS 5 --DEC_LAYERS 5 --ENC_HEADS 8 --DEC_HEADS 8 --ENC_PF_DIM 256 --DEC_PF_DIM 256 --ENC_DROPOUT 0.1 --DEC_DROPOUT 0.1 --CLIP 1 --LEARNING_RATE 0.0005 --N_EPOCHS 100

Purificator Network

python purificator.py --HID_DIM 128 --ENC_LAYERS 5 --DEC_LAYERS 5 --ENC_HEADS 8 --DEC_HEADS 8 --ENC_PF_DIM 256 --DEC_PF_DIM 256 --ENC_DROPOUT 0.1 --DEC_DROPOUT 0.1 --CLIP 1 --LEARNING_RATE 0.0005 --N_EPOCHS 100 

Corrector Network

python corrector.py --HID_DIM 128 --ENC_LAYERS 5 --DEC_LAYERS 5 --ENC_HEADS 8 --DEC_HEADS 8 --ENC_PF_DIM 256 --DEC_PF_DIM 256 --ENC_DROPOUT 0.1 --DEC_DROPOUT 0.1 --CLIP 1 --LEARNING_RATE 0.0005 --N_EPOCHS 100 

Benchmarking Bangla SEC Task

benchmark

BibTeX Entry and Citation Info

@article{bijoy2024transformer,
  title={A transformer based spelling error correction framework for Bangla and resource scarce Indic languages},
  author={Bijoy, Mehedi Hasan and Hossain, Nahid and Islam, Salekul and Shatabda, Swakkhar},
  journal={Computer Speech \& Language},
  volume = {89},
  pages = {101703},
  year = {2025},
  issn = {0885-2308},
  doi = {https://doi.org/10.1016/j.csl.2024.101703},
  url = {https://www.sciencedirect.com/science/article/pii/S088523082400086X},
  publisher={Elsevier}
}

About

[Computer Speech & Language] A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages

Topics

Resources

License

Stars

Watchers

Forks

Languages