Skip to content

Latest commit

 

History

History
237 lines (180 loc) · 27.9 KB

README.md

File metadata and controls

237 lines (180 loc) · 27.9 KB

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

Samuel Albanie*, Gül Varol*, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox and Andrew Zisserman, BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues, ECCV 2020.

[Project page] [arXiv]

Contents

Setup

Requires: python 3.6. (some non-essential pre-processing scripts require python 3.7)

# Clone this repository
git clone https://github.com/gulvarol/bsl1k.git
cd bsl1k/
# Setup symbolic links (point these to folders where you would like data and checkpoints to be stored)
ln -s <replace_with_data_path> data
ln -s <replace_with_log_path> checkpoint
# Create bsl1k_env environment with dependencies
conda env create -f environment.yml
conda activate bsl1k_env
pip install -r requirements.txt

Demo

The demo folder contains a sample script to apply sign language recognition on an input video. By default, the demo will download: (1) a model that has been pretrained on BSL-1K and then fine-tuned on WLASL, (2) a video from handspeak.com (this particular video is part of the the WLASL test set). The demo should produce the output below (you can change to other inputs):

Usage: run python demo.py.

ASL example for book

The original video source can be found here. Copyright Jolanta Lapiak.

Train and Test

Supported Datasets

  • This code supports I3D classification training for the following sign language video datasets:
Dataset --datasetname Path --num-classes --ram_data info/
BSL-1K (coming soon) bsl1k data/bsl1k/ 1064 0 [COMING SOON]
WLASL wlasl data/wlasl/ 2000 1 (3.7GB)
MSASL msasl data/msasl/ 1000 1 (6.6GB)
Phoenix2014T phoenix2014 data/PHOENIX-2014-T-release-v3/PHOENIX-2014-T/ 1233 0 (3MB)
BSL-Corpus bslcp data/BSLCP/ 966 0 (1MB)
  • Please cite the original papers for WLASL, MSASL, Phoenix2014T and BSL-Corpus datasets. Here, we only provide pre-processed metadata, but not the videos, which can instead be obtained via the metadata provided by the dataset authors, as described next:

Preparing the data

WLASL: First head to the WLASL authors' github page here and download the .json file of links. This file evolves over time, the current version is v3 and is called WLASL_v0.3.json . Place the downloaded file at the location data/wlasl/info/WLASL_v0.3.json. After this step, video files can be downloaded by running the following command:

python misc/wlasl/download_wlasl.py

Notes: some videos may no longer be accessible - you can contact the WLASL authors to address this issue (they provide an email address on the github page linked above). Also note that the v3 json may produce slightly different results from the WLASL_v0.1.json we used for our experiments.

MSASL: As for the dataset above, first download the json files of video links from the authors here and place them into data/msasl/info/. This will create a file directory structure as follows:

data/
   msasl/
       info/
          MSASL_train.json
          MSASL_val.json
          MSASL_test.json

The videos may then be downloaded via:

python misc/msasl/download_msasl.py

Phoenix2014T: video files can be downloaded from here (this file should be unpacked to the location data/PHOENIX-2014-T-release-v3). You can then run the following command script to create .mp4 video files from the provided .png frames:

python misc/phoenix2014/gather_frames.py

BSL-Corpus: can be downloaded from here upon request from the owners.

  • In our folder organization, each dataset has a subfolder info/ in which most pre-extracted annotations are kept:
    • info/info.pkl
    • info/pose.pkl OpenPose is extracted for:
      • bsl1k for a subset of the videos
      • msasl and wlasl for all videos (we provide these within the .tar files)
  • We have pre-processed all of the video datasets to be at 256x256 spatial resolution. The pre-processing scripts can be found under the misc folder for each dataset. Using the original videos is possible, but is slower.
  • We have pre-processed WLASL and MSASL such that the video frames are stored in a pkl file, we then loaded the entire dataset in RAM. Setting --ram_data to 0 will not require this preprocessing step, and use the video files instead. The results are similar with and without this step.

Pretrained models

You can download some of the pretrained models used in the experiments by running bash misc/pretrained_models/download.sh in the project root directory. All the other pretrained models from the experiments are provided in the Experiments section. The best BSL-1K model reported for the final experiments is the first model.

Note 2021.09.14: You might want to check an improved model here from our follow-up CVPR'21 work.

Train

The training launch for each experiment can be found in the Experiments section by clicking "run" links. The training can be ran by directly typing python main.py <args> on terminal with the arguments. We also provide the exp/create_exp.py script that we used when launching experiments. You can use that via:

cd exp/
# Change config.json contents
python train.py

Test

cd exp/
# Change config.json contents
python test.py

Experiments

Experiments on BSL-1K

  • Best model BSL-1K(m.5), last 20 frames, video pose pretrained
Model ins. top-1 ins. top-5 cls. top-1 cls. top-5 Links
BSL-1K 75.51 88.83 52.76 72.14 run, args, model, logs
  • Trade-off between training noise vs. size
Model ins. top-1 ins. top-5 cls. top-1 cls. top-5 Links
BSL-1K(m.5) 70.61 85.26 47.47 68.13 run, args, model, logs
BSL-1K(m.6) 71.33 85.92 48.83 68.82 run, args, model, logs
BSL-1K(m.7) 70.95 85.73 48.13 67.81 run, args, model, logs
BSL-1K(m.8) 69.00 83.79 45.86 64.42 run, args, model, logs
BSL-1K(m.9) 60.53 77.51 35.09 54.26 run, args, model, logs
  • Contribution of individual cues (pose subset of the data)
Model ins. top-1 ins. top-5 cls. top-1 cls. top-5 Links
Pose2Sign (70p face) 24.41 47.59 9.74 25.99 run, args, model, logs
Pose2Sign (60p body,hands) 40.47 59.45 20.24 39.27 run, args, model, logs
Pose2Sign (130p all) 49.66 68.02 29.91 49.21 run, args, model, logs
I3D (face-crop) 42.23 69.70 21.66 50.51 run, args, model, logs
I3D (mouth-masked) 46.75 66.34 25.85 48.02 run, args, model, logs
I3D (full-frame) 65.57 81.33 44.90 64.91 run, args, model, logs
  • Effect of pretraining
Model ins. top-1 ins. top-5 cls. top-1 cls. top-5 Links
Random init. 39.80 61.01 15.76 29.87 run, args, model, logs
Gesture recognition 46.93 65.95 19.59 36.44 run, args, model, logs
Sign recognition 69.90 83.45 44.97 62.73 run, args, model, logs
Action recognition 69.00 83.79 45.86 64.42 run, args, model, logs
Video pose distillation 70.38 84.50 46.24 65.31 run, args, model, logs
  • The effect of the temporal window for KWS
Model ins. top-1 ins. top-5 cls. top-1 cls. top-5 Links
1 sec 60.10 75.42 36.62 53.83 run, args, model, logs
2 sec 64.91 80.98 40.29 59.63 run, args, model, logs
4 sec 68.09 82.79 45.35 63.64 run, args, model, logs
8 sec 69.00 83.79 45.86 64.42 run, args, model, logs
16 sec 65.91 81.84 39.51 59.03 run, args, model, logs
  • The effect of the number of frames before the mouthing peak
Model ins. top-1 ins. top-5 cls. top-1 cls. top-5 Links
16 frames 59.53 77.08 36.16 58.43 run, args, model, logs
20 frames 71.71 85.73 49.64 69.23 run, args, model, logs
24 frames 69.00 83.79 45.86 64.42 run, args, model, logs

Experiments on Transfer

  • WLASL dataset (isolated) - 64 frames input
Model ins. top-1 ins. top-5 cls. top-1 cls. top-5 Links
Kinetics pretraining 40.85 74.10 39.06 73.33 run, args, model, logs
BSL-1K pretraining 46.82 79.36 44.72 78.47 run, args, model, logs
  • MSASL dataset (isolated) - 64 frames input
Model ins. top-1 ins. top-5 cls. top-1 cls. top-5 Links
Kinetics pretraining 60.45 82.05 57.17 80.02 run, args, model, logs
BSL-1K pretraining 64.71 85.59 61.55 84.43 run, args, model, logs
  • Phoenix2014T dataset (co-articulated) - 16 frames input
Model wer del_rate ins_rate sub_rate Links
Kinetics pretraining 45.07 22.05 6.52 16.50 run, args, model, logs
BSL-1K pretraining 39.49 22.54 5.03 11.92 run, args, model, logs
  • BSL-Corpus dataset subset (co-articulated) - 16 frames input
Model ins. top-1 ins. top-5 cls. top-1 cls. top-5 Links
Kinetics pretraining 12.79 23.11 7.76 15.76 run, args, model, logs
BSL-1K pretraining 24.35 39.14 16.00 28.54 run, args, model, logs

Note on BSL-1K data release

We are in the process of finalising legal confirmation from our broadcasting partners before we release data.

Limitations

We would like to emphasise that this research represents a working progress towards achieving automatic sign language recognition, and as such, has a number of limitations that we are aware of (and likely many that we are not aware of). Key limitations include:

  • The data collected with our technique is long-tailed (this can be seen in Fig. 2 of our paper, referenced below). This reflects the nature of how signs are used in reality, but it also makes it challenging to train existing vision models (which prefer balanced data).
  • All data collected here is interpreted. Interpreted data differs from conversations between native signers (see e.g. this paper for a discussion on this point).
  • Our approach naturally biases the annotated data towards mouthings (signs that are not frequently mouthed, or signers who do not mouth, are less represented).

Citation

If you use this code, please cite the following:

@INPROCEEDINGS{albanie20_bsl1k,
  title     = {{BSL-1K}: {S}caling up co-articulated sign language recognition using mouthing cues},
  author    = {Albanie, Samuel and Varol, G{\"u}l and Momeni, Liliane and Afouras, Triantafyllos and Chung, Joon Son and Fox, Neil and Zisserman, Andrew},
  booktitle = {ECCV},
  year      = {2020}
}