README.md

Grammatical Error Correction in Low-Resource Scenarios

This folder stores scripts and data to generate synthetic data.

If you are interested in authentic parallel data, you can find them on following links:

Czech AKCES-GEC
German FALKO-MERLIN GEC corpus
Russian RULEC-GEC -- upon request
English - there are several datasets, most of which can be found on https://www.cl.cam.ac.uk/research/nl/bea2019st/#data

apt-get install aspell-cs
apt-get install aspell-ru
apt-get install aspell-en
apt-get install aspell-de

generate_data.sh script activates environment with Python3 and supposes it to contain all packages from requirements.txt. So either modify the script for your needs or run

python3 -m venv ~/virtualenvs/aspell
pip install -r requirements.txt

Data are generated using generate_data_wrapper.sh script. This script stores several variables:

It outputs both original and corrupted sentence separated with tabulator.

Vocabularies directory contains vocabulary files used in our experiments. Sample_monolingual_data then contains a file with 10 000 clean Czech sentences. In our experiments, we used WMT News Crawl Data (http://data.statmt.org/news-crawl/) for each language.