The Fisher and CALLHOME Spanish--English Speech Translation Corpus contains English reference translations and speech recognizer output (in various forms) that complement the LDC Fisher and CALLHOME Spanish audio and transcripts. Together, they make a four-way parallel dataset whose goal is to further research in Spanish--English speech translation.
For a complete description of this corpus, and for citation in your own published research, please cite the following paper. A copy can be found in the doc/ directory.
@inproceedings{post2013improved,
Title = {Improved Speech-to-Text Translation with the {F}isher and {C}allhome {S}panish--{E}nglish Speech Translation Corpus},
Author = {Post, Matt and Kumar, Gaurav and Lopez, Adam and Karakos, Damianos and Callison-Burch, Chris and Khudanpur, Sanjeev},
Booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
Year = {2013},
Address = {Heidelberg, Germany},
Month = {December}
}
The mapping/
direocty contains files corresponding to our data splits. Each line in these files
contains a reference to the LDC transcript file and line numbers.
The corpus/
directory houses the various pieces of the corpus. Each subdirectory contains (a) a
single Spanish side and (b) either one (for Fisher training and all CALLHOME data) or four (for
Fisher test sets) English references. The Spanish side always has the extension ".es", and varies
among (a) LDC transcript (b) Kaldi ASR output (c) Kaldi lattice output and (d) lattice oracle paths.
Due to licensing restrictions, we cannot include the LDC Spanish transcripts with this dataset. We
have, however, provided scripts that will construct our data splits. To build these, first define
the environment variables $LDC2010T04
and $LDC96T17
to point to your
LDC2010T04 and
LDC96T17 installations, respectively. Then run:
make
You can also run the two scripts directly by manually listing the directories, e.g.,
./bin/build_fisher.sh $LDC2010T04
./bin/build_callhome.sh $LDC96T17
Either way, you should end up with the following generated files, in addition to the files included with the data release:
corpus/ldc/fisher_train.es
corpus/ldc/fisher_dev.es
corpus/ldc/fisher_dev2.es
corpus/ldc/fisher_test.es
corpus/ldc/callhome_train.es
corpus/ldc/callhome_devtest.es
corpus/ldc/callhome_evltest.es