Implimentation of DiffuDetox, based on DiffuSeq.
DiffuDetox datasets are small enough to be directly included in the datasets
folder.
The code is based on PyTorch and HuggingFace transformers
. After creating a virtual environment, run:
pip install -r requirements.txt
The training script is launched in the scripts
folder.
cd scripts
bash train.sh
Arguments explanation:
--dataset_unsup
: the name of unsupervised dataset to suppliment Paradetox--folder_name
: the name of the results directory.--data_dir
: the path to the saved datasets folder, containing.jsonl
files--resume_checkpoint
: if not none, restore this checkpoint and continue training--vocab
: the tokenizer is initialized using bert or load your own preprocessed vocab dictionary (e.g. using BPE)
You need to modify the path to model_dir
, which is obtained in the training stage.
cd scripts
bash run_decode.sh
You need to specify the folder of decoded texts. This folder should contain the decoded files from the same model but sampling with different random seeds.
cd scripts
python eval_seq2seq.py --folder ../{your-path-to-outputs} --mbr
After running eval_seq2seq.py, you will get some .json files. You can evaluate json files using the command below:
python eval_json.py --json_path path/to/json_file.json --save_path path/to/save_results