DeepSpeech2 on Pytorch

This repo implement a pytorch-based DeepSpeech2 that can load the DeepSpeech2 model which is pretrained on PaddlePaddle-v0.12.

Installation

Prerequisites

tested on Python 3.7 with pytorch 1.4:

Ubuntu19.10 (GPU and CPU)
MacOS10.15 (CPU only)

Setup

Make sure these libraries or tools installed: pkg-config, flac, ogg, vorbis, boost and swig, e.g. installing them via apt-get:

sudo apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python-dev

or, installing them via homebrew or linuxbrew:

brew install flac libogg libvorbis boost pkg-config eigen

Run setup.sh to install all python packages and compile a C-based decoder.

Getting Started

Structure

For experiment purpose, the folder structure is data-oriented. conf stores all configurations for the training. And, exps and tensorboard store all experiment result and intermediat data, including corresponding configurations. Experiments are named by the starting time.

├── conf
│   ├── augmentation.config
│   └── experiment.yaml
├── data
│   ├── ...
│   └── manifest in csv format
├── decoded
├── exps
│   ├── 200324-15:55:40
│   ├── ...
│   └── 200411-22:28:15
└── tensorboard
    ├── 200324-15:46:24
    ├── ...
    └── 200411-22:28:15

Manifest for data

Our pytorch version takes a .csv file as its data set interface. This csv file should contain following columns:

Calumn name	Description
uttid	utterance ID
st	start time of the utterance (Required, if set `segmented=False` for dataloader).
et	end time of the utterance (Required, if set `segmented=False` for dataloader.)
text	Manual transcription.
audio_path	absolute path of audio recordings
duration	duration of the audio.

Configuration

Currently include two configurations. One for data augmentation and the other is for training process.

Data Augmentation

We kept the original code for data augmentation and feature extraction. Six optional augmentation components are provided to be selected, configured and inserted into the processing pipeline.

Volume Perturbation
Speed Perturbation
Shifting Perturbation
Online Bayesian normalization
Noise Perturbation (need background noise audio files)
Impulse Response (need impulse audio files)

In order to inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an augmentation configuration file in JSON format. For example:

[{
    "type": "speed",
    "params": {"min_speed_rate": 0.95,
               "max_speed_rate": 1.05},
    "prob": 0.6
},
{
    "type": "shift",
    "params": {"min_shift_ms": -5,
               "max_shift_ms": 5},
    "prob": 0.8
}]

If the dataloader is set by this example configuration, every audio clip which is loader by the dataloader will be processed by two steps: First, with 60% of chance, it will be speed perturbed with a uniformly random sampled speed-rate between 0.95 and 1.05. And second, with 80% of chance it will be shifted in time with a random sampled offset between -5 ms and 5 ms. The newly synthesized audio clip will be feed into the feature extractor for further training.

For other configuration examples, please refer to notebooks/example/conf/augmentation.conf.

Be careful when utilizing the data augmentation technique, as improper augmentation will do harm to the training, due to the enlarged train-test gap.

Experiment setting

experiment.yaml stores all configurations for training a model including the data augmentation which is introduced in the next section. Please check notebooks/example/conf/experiment.yaml for more details.

Example

Creating Example

Several audio recordings from Voxforge are stored in notebooks/example/audio. Since the manifest requirements absolution path for each recording. The user has to run generate_manifest.py to generate example manifest. In this example, train set, val set and test set are pointed to the example.csv.

cd notebooks/example
python generate_manifest.py

Download DS2 model and Language Model

Download DeepSpeech model and Language Model from the following link. This should take a while as the language model is huge. In the meantime, change ds2_model_path, language_model_path, vocab_filepath and mean_std_filepath accordingly.

Training a model

Training on CPU is runnable, but it is super slow. Run following code for training:

cd notebooks
python deepspeech_train.py example/conf/experiment.yaml

Inference and Evaluation

Run following code for testing:

cd notebooks
python deepspeech_test.py example/conf/experiment.yaml example/decoded

Code Structure

Important codes are distributed to five folders. Data augmentation and decoder comes from PaddlePaddle version. Rewrite the network part.

├── data_utils: code for data augmentation and dataloader
├── decoders: code for decoding
├── model_utils: code for networds 
├── notebooks: example and other test code
└── utils: yaml reader

Language Model

English LM

The English corpus is from the Common Crawl Repository and you can download it from statmt. We use part en.00 to train our English language model. There are some preprocessing steps before training:

Characters not in [A-Za-z0-9\s'] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.
Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'.

Released Models

Currently, only we only transform the BaiduEN8k. The model trained on Librispeech will be our future work.

Speech Model Adatped

Currently, we only adapted the BaiduEN8k Model.

Language	Model Name	Training Data	Hours of Speech
English	BaiduEN8k Model	Baidu Internal English Dataset	8628 h

Language Model Released

Language Model	Training Data	Token-based	Size	Descriptions
English LM	CommonCrawl(en.00)	Word-based	8.3 GB	Pruned with 0 1 1 1 1; About 1.85 billion n-grams; 'trie' binary with '-a 22 -q 8 -b 8'

Future Work

Compute Mean & Stddev for Normalizer

Have not modified for the new manifest structure yet.

To perform z-score normalization (zero-mean, unit stddev) upon audio features, we have to estimate in advance the mean and standard deviation of the features, with some training samples:

python tools/compute_mean_std.py \
--num_samples 2000 \
--specgram_type linear \
--manifest_paths data/librispeech/manifest.train \
--output_path data/librispeech/mean_std.npz

It will compute the mean and standard deviation of power spectrum feature with 2000 random sampled audio clips listed in data/librispeech/manifest.train and save the results to data/librispeech/mean_std.npz for further usage.

Build Vocabulary

Have not modified for the new manifest structure yet.

A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to text again. Such a character-based vocabulary can be built with tools/build_vocab.py.

python tools/build_vocab.py \
--count_threshold 0 \
--vocab_path data/librispeech/eng_vocab.txt \
--manifest_paths data/librispeech/manifest.train

It will write a vocabuary file data/librispeeech/eng_vocab.txt with all transcription text in data/librispeech/manifest.train, without vocabulary truncation (--count_threshold 0).

Acceleration with Multi-GPUs

Not supported yet.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
conf		conf
data		data
data_utils		data_utils
decoders		decoders
model_utils		model_utils
notebooks		notebooks
test		test
tools		tools
utils		utils
.clang-format		.clang-format
.clang_format.hook		.clang_format.hook
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.sh		setup.sh
tags		tags

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepSpeech2 on Pytorch

Table of Contents

Installation

Prerequisites

Setup

Getting Started

Structure

Manifest for data

Configuration

Data Augmentation

Experiment setting

Example

Creating Example

Download DS2 model and Language Model

Training a model

Inference and Evaluation

Code Structure

Language Model

English LM

Released Models

Speech Model Adatped

Language Model Released

Future Work

Compute Mean & Stddev for Normalizer

Build Vocabulary

Acceleration with Multi-GPUs

About

Releases 1

Packages

Languages

License

SamChen/DeepSpeech2PytorchTransferLearning

Folders and files

Latest commit

History

Repository files navigation

DeepSpeech2 on Pytorch

Table of Contents

Installation

Prerequisites

Setup

Getting Started

Structure

Manifest for data

Configuration

Data Augmentation

Experiment setting

Example

Creating Example

Download DS2 model and Language Model

Training a model

Inference and Evaluation

Code Structure

Language Model

English LM

Released Models

Speech Model Adatped

Language Model Released

Future Work

Compute Mean & Stddev for Normalizer

Build Vocabulary

Acceleration with Multi-GPUs

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages