Note before using this dataset, get the license from the Public DGS Corpus first!
The dataset construction is maintained in another repository. Here we offer the instructions to reconstruct the one used in our paper as follows:
# requirement: ffmpeg
# if not installed, for example download from here:
# download the dataset git
pip install git+
# generate DGS3-T
python --tfds-data-dir tfds_datasets_custom --preprocess-glosses --output examples.json > generate.out 2> generate.err
# split the whole document-level video into sentence-level segments
python --input examples.json --output-folder output --ffmpeg-custom-path "ffmpeg/ffmpeg-git-20220722-amd64-static/ffmpeg" --num-workers 8 > slice
.out 2> slice.err
The DGS3-T dataset is in output