Creation pipeline

This is the documentation of creating Libriheavy asr corpus (i.e. splitting the training, dev and test sets). If you want to know how we align Librilight to be Libriheavy, please see text search libriheavy recipe for details.

Note: We can not guarantee that the pipeline will produce the same manifests as ours, because we select test and dev sets randomly from a "pool", read our paper for more details.

Download raw manifests aligned by text search

from huggingface:

bash run_pipeline.sh --stage 1 --stop-stage 1

or from modelscope

bash run_pipeline.sh --stage 0 --stop-stage 0

Filter out the segments with higher CER (Optional)

We allow some errors when aligning the audios to avoid dropping out too much data, you can filter out those segments with higher CER if you like.

bash run_pipeline.sh --stage 2 --stop-stage 2

You can specify the threshold, see scripts/filter_by_cer.py for details.

Get speakers and books for dev and test sets and excluding them from training sets

bash run_pipeline.sh --stage 3 --stop-stage 3

Split dev and test sets

bash run_pipeline.sh --stage 4 --stop-stage 4

Congratulations, you have gotten the Libriheavy corpus, see README for how to use it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline.md

pipeline.md

Creation pipeline

Download raw manifests aligned by text search

Filter out the segments with higher CER (Optional)

Get speakers and books for dev and test sets and excluding them from training sets

Split dev and test sets

Files

pipeline.md

Latest commit

History

pipeline.md

File metadata and controls

Creation pipeline

Download raw manifests aligned by text search

Filter out the segments with higher CER (Optional)

Get speakers and books for dev and test sets and excluding them from training sets

Split dev and test sets