Official implementation of the [ACL 2024 paper] Semi-Supervisied Spoken Language Glossification.
- Install the environment
pip install -r requirement.txt
The data used in our experiments is placed in ./dataset. The structure is as following:
- ./dataset/phoenix/phoenix2014T/text: the plain-text version of the corpora for PHOENIX2014T dataset.
- ./dataset/phoenix/external_corpus.txt: the monolingual data.
- .dataset/phoenix/rule_pseudo.txt: the rule-based synthetic data. Run the following command to generate rule-based pseudo glosses for unlabeled data.
cd ./dataset/phoenix python rule_based_preprocess.py
To train the proposed model, please run the following command.
cd ./project/translation
python G2T_train.py -t ./../../dataset/phoenix/phoenix2014T/ -ec ./../../dataset/phoenix/external_corpus.txt -ge 70 -g 0
If you find this repo useful in your research works, please consider citing:
@inproceedings{yao2024s3lg,
title={Semi-Supervised Spoken Language Glossification},
author={Yao, Huijie and Zhou, Wengang and Zhou, Hao and Li, Houqiang},
booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year={2024}
}
If you find this monolingual corpus useful in your research works, please consider citing:
@inproceedings{zhou2021improving,
title={Improving sign language translation with monolingual data by sign back-translation},
author={Zhou, Hao and Zhou, Wengang and Qi, Weizhen and Pu, Junfu and Li, Houqiang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={1316--1325},
year={2021}
}
Please note that the corpora (PHOENIX2014T) have their own licenses and any use of them should be conforming with them and include the appropriate citations.