We propose a multi-stream keypoint attention network to depict a sequence of keypoints produced by a readily available keypoint estimator. In order to facilitate interaction across multiple streams, we investigate diverse methodologies such as keypoint fusion strategies, head fusion, and self-distillation. The resulting framework is denoted as MSKA-SLR, which is expanded into a sign language translation (SLT) model through the straightforward addition of an extra translation network.We carry out comprehensive experiments on well-known benchmarks like Phoenix-2014, Phoenix-2014T, and CSL-Daily to showcase the efficacy of our methodology. Notably, we have attained a novel state-of-the-art performance in the sign language translation task of Phoenix-2014T.
MSKA-SLR
Dataset | WER | Model | Training |
---|---|---|---|
Phoenix-2014 | 22.1 | ckpt | config |
Phoenix-2014T | 20.5 | ckpt | config |
CSL-Daily | 27.8 | ckpt | config |
MSKA-SLT
Dataset | R | B1 | B2 | B3 | B4 | Model | Training |
---|---|---|---|---|---|---|---|
Phoenix-2014T | 53.54 | 54.79 | 42.42 | 34.49 | 29.03 | ckpt | config |
CSL-Daily | 54.04 | 56.37 | 42.80 | 32.78 | 25.52 | ckpt | config |
conda create -n mska python==3.10.13
conda activate mska
# Please install PyTorch according to your CUDA version.
pip install -r requirements.txt
Datasets
Download datasets from their websites and place them under the corresponding directories in data/
Pretrained Models
mbart_de / mbart_zh : pretrained language models used to initialize the translation network for German and Chinese, with weights from mbart-cc-25.
We provide pretrained models Phoenix-2014T and CSL-Daily. Download this directory and place them under pretrained_models.
Keypoints We provide human keypoints for three datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, pre-extracted by HRNet. Please download them and place them under data/Phoenix-2014t(Phoenix-2014 or CSL-Daily).
python train.py --config configs/${dataset}_s2g.yaml --epoch 100
python train.py --config configs/${dataset}_s2g.yaml --resume pretrained_models/${dataset}_SLR/best.pth --eval
python train.py --config configs/${dataset}_s2t.yaml --epoch 40
python train.py --config configs/${dataset}_s2t.yaml --resume pretrained_models/${dataset}_SLT/best.pth --eval
@misc{guan2024multistream,
title={Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation},
author={Mo Guan and Yan Wang and Guangkun Ma and Jiarui Liu and Mingzu Sun},
year={2024},
eprint={2405.05672},
archivePrefix={arXiv},
primaryClass={cs.CV}
}