-
Notifications
You must be signed in to change notification settings - Fork 305
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add documentation for RNNLM training (#1267)
* add documentation for training an RNNLM
- Loading branch information
1 parent
ef5da48
commit 97f9b9c
Showing
4 changed files
with
115 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
RNN-LM | ||
====== | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
librispeech/lm-training |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
.. _train_nnlm: | ||
|
||
Train an RNN langugage model | ||
====================================== | ||
|
||
If you have enough text data, you can train a neural network language model (NNLM) to improve | ||
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from | ||
scratch. | ||
|
||
.. HINT:: | ||
|
||
For how to use an NNLM during decoding, please refer to the following tutorials: | ||
:ref:`shallow_fusion`, :ref:`LODR`, :ref:`rescoring` | ||
|
||
.. note:: | ||
|
||
This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary | ||
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set | ||
for illustration purpose. You can also collect your own data. The data format is quite simple: | ||
each line should contain a complete sentence, and words should be separated by space. | ||
|
||
First, let's download the training data for the RNNLM. This can be done via the | ||
following command: | ||
|
||
.. code-block:: bash | ||
$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz | ||
$ gzip -d librispeech-lm-norm.txt.gz | ||
As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a | ||
BPE tokenizer. This can be achieved by executing the following command: | ||
|
||
.. code-block:: bash | ||
$ # if you don't have the BPE | ||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15 | ||
$ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500 | ||
$ git lfs pull --include bpe.model | ||
$ cd ../../.. | ||
$ ./local/prepare_lm_training_data.py \ | ||
--bpe-model icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500/bpe.model \ | ||
--lm-data librispeech-lm-norm.txt \ | ||
--lm-archive data/lang_bpe_500/lm_data.pt | ||
Now, you should have a file name ``lm_data.pt`` file store under the directory ``data/lang_bpe_500``. | ||
This is the packed training data for the RNNLM. We then sort the training data according to its | ||
sentence length. | ||
|
||
.. code-block:: bash | ||
$ # This could take a while (~ 20 minutes), feel free to grab a cup of coffee :) | ||
$ ./local/sort_lm_training_data.py \ | ||
--in-lm-data data/lang_bpe_500/lm_data.pt \ | ||
--out-lm-data data/lang_bpe_500/sorted_lm_data.pt \ | ||
--out-statistics data/lang_bpe_500/lm_data_stats.txt | ||
The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say | ||
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt`` | ||
and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``. | ||
|
||
After completing the previous steps, the training and testing sets for training RNNLM are ready. | ||
The next step is to train the RNNLM model. The training command is as follows: | ||
|
||
.. code-block:: bash | ||
$ # assume you are in the icefall root directory | ||
$ cd rnn_lm | ||
$ ln -s ../../egs/librispeech/ASR/data . | ||
$ cd .. | ||
$ ./rnn_lm/train.py \ | ||
--world-size 4 \ | ||
--exp-dir ./rnn_lm/exp \ | ||
--start-epoch 0 \ | ||
--num-epochs 10 \ | ||
--use-fp16 0 \ | ||
--tie-weights 1 \ | ||
--embedding-dim 2048 \ | ||
--hidden_dim 2048 \ | ||
--num-layers 3 \ | ||
--batch-size 300 \ | ||
--lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \ | ||
--lm-data-valid rnn_lm/data/lang_bpe_500/sorted_lm_data.pt | ||
.. note:: | ||
|
||
You can adjust the RNNLM hyper parameters to control the size of the RNNLM, | ||
such as embedding dimension and hidden state dimension. For more details, please | ||
run ``./rnn_lm/train.py --help``. | ||
|
||
.. note:: | ||
|
||
The training of RNNLM can take a long time (usually a couple of days). | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters