Skip to content

Files

Latest commit

5d3e1b9 · Jan 29, 2018

History

History

swbd

Switchboard Corpus

Training data (Switchboard corpus)

About the Switchboard corpus:

The Switchboard corpus is conversational telephone speech collected as 2-channel, 8kHz-sampled data. We are using just the Switchboard-1 Phase 1 training data. The catalog number LDC97S62 (Switchboard-1 Release 2) corresponds, we believe, to what we have. We also use the Mississippi State transcriptions, which we download separately from here.

Additional training data (Fisher-English corpus)

About the Fisher-English corpus

The Fisher-English corpus is conversational telephone speech collected as 2-channel, 8kHz-sampled data. The data is similar to Switchboard but the transcription was mostly done in a "faster", lower-quality way.

Fisher comes in two parts, and the text and speech have separate LDC numbers. This recipe uses both parts. The LDC numbers are

The speech: **LDC2004S13**, **LDC2005S13**
The text: **LDC2004T19**, **LDC2005T19**

Evaluation data1 (eval2000 - Switchboard)

We are using the eval2000 a.k.a. hub5'00 evaluation data. The LDC numbers are

The speech: **LDC2002S09**
The text: **LDC2002T43**

Evaluation data2 (eval2000 - CallHome)

comming soon

Usage

Feature extraction

comming soon

Making dataset

comming soon