[Ready to merge] Pruned Transducer Stateless2 for WenetSpeech (char-b…

…ased) (#349) * add char-based pruned-rnnt2 for wenetspeech * style check * style check * change for export.py * do some changes * do some changes * a small change for .flake8 * solve the conflicts
k2-fsa · May 23, 2022 · 0e57b30 · 0e57b30
1 parent 2f1e23c
commit 0e57b30
Show file tree

Hide file tree

Showing 29 changed files with 4,134 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -20,6 +20,8 @@ We provide 6 recipes at present:
   - [TIMIT][timit]
   - [TED-LIUM3][tedlium3]
   - [GigaSpeech][gigaspeech]
+  - [Aidatatang_200zh][aidatatang_200zh]
+  - [WenetSpeech][wenetspeech]
 
 ### yesno
 
@@ -217,6 +219,33 @@ and [Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned R
 |   fast beam search   | 10.50 | 10.69 |
 | modified beam search | 10.40 | 10.51 |
 
+### Aidatatang_200zh
+
+We provide one model for this recipe: [Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss][Aidatatang_200zh_pruned_transducer_stateless2].
+
+#### Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss
+
+|                      |  Dev  | Test  |
+|----------------------|-------|-------|
+|    greedy search     | 5.53  | 6.59  |
+|   fast beam search   | 5.30  | 6.34  |
+| modified beam search | 5.27  | 6.33  |
+
+We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wNSnSj3T5oOctbh5IGCa393gKOoQw2GH?usp=sharing)
+
+### WenetSpeech
+
+We provide one model for this recipe: [Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss][WenetSpeech_pruned_transducer_stateless2].
+
+#### Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss (trained with L subset)
+
+|                      |  Dev  | Test-Net | Test-Meeting |
+|----------------------|-------|----------|--------------|
+|    greedy search     | 7.80  |  8.75    |  13.49       |
+|   fast beam search   | 7.94  |  8.74    |  13.80       |
+| modified beam search | 7.76  |  8.71    |  13.41       |
+
+We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1EV4e1CHa1GZgEF-bZgizqI9RyFFehIiN?usp=sharing)
 
 ## Deployment with C++
 
@@ -243,10 +272,14 @@ Please see: [![Open In Colab](https://colab.research.google.com/assets/colab-bad
 [TED-LIUM3_pruned_transducer_stateless]: egs/tedlium3/ASR/pruned_transducer_stateless
 [GigaSpeech_conformer_ctc]: egs/gigaspeech/ASR/conformer_ctc
 [GigaSpeech_pruned_transducer_stateless2]: egs/gigaspeech/ASR/pruned_transducer_stateless2
+[Aidatatang_200zh_pruned_transducer_stateless2]: egs/aidatatang_200zh/ASR/pruned_transducer_stateless2
+[WenetSpeech_pruned_transducer_stateless2]: egs/wenetspeech/ASR/pruned_transducer_stateless2
 [yesno]: egs/yesno/ASR
 [librispeech]: egs/librispeech/ASR
 [aishell]: egs/aishell/ASR
 [timit]: egs/timit/ASR
 [tedlium3]: egs/tedlium3/ASR
 [gigaspeech]: egs/gigaspeech/ASR
+[aidatatang_200zh]: egs/aidatatang_200zh/ASR
+[wenetspeech]: egs/wenetspeech/ASR
 [k2]: https://github.com/k2-fsa/k2
diff --git a/egs/wenetspeech/ASR/README.md b/egs/wenetspeech/ASR/README.md
@@ -0,0 +1,19 @@
+
+# Introduction
+
+This recipe includes some different ASR models trained with WenetSpeech.
+
+[./RESULTS.md](./RESULTS.md) contains the latest results.
+
+# Transducers
+
+There are various folders containing the name `transducer` in this folder.
+The following table lists the differences among them.
+
+|                                       | Encoder             | Decoder            | Comment                     |
+|---------------------------------------|---------------------|--------------------|-----------------------------|
+| `pruned_transducer_stateless2`        | Conformer(modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss  |                      |
+
+The decoder in `transducer_stateless` is modified from the paper
+[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
+We place an additional Conv1d layer right after the input embedding layer.
diff --git a/egs/wenetspeech/ASR/RESULTS.md b/egs/wenetspeech/ASR/RESULTS.md
@@ -0,0 +1,93 @@
+## Results
+
+### WenetSpeech char-based training results (Pruned Transducer 2)
+
+#### 2022-05-19
+
+Using the codes from this PR https://github.com/k2-fsa/icefall/pull/349.
+
+When training with the L subset, the WERs are
+
+|                                    |  dev  | test-net | test-meeting | comment                                  |
+|------------------------------------|-------|----------|--------------|------------------------------------------|
+|          greedy search             | 7.80  | 8.75     | 13.49        | --epoch 10, --avg 2, --max-duration 100  |
+| modified beam search (beam size 4) | 7.76  | 8.71     | 13.41        | --epoch 10, --avg 2, --max-duration 100  |
+| fast beam search (set as default)  | 7.94  | 8.74     | 13.80        | --epoch 10, --avg 2, --max-duration 1500 |
+
+The training command for reproducing is given below:
+
+```
+export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
+
+./pruned_transducer_stateless2/train.py \
+  --lang-dir data/lang_char \
+  --exp-dir pruned_transducer_stateless2/exp \
+  --world-size 8 \
+  --num-epochs 15 \
+  --start-epoch 0 \
+  --max-duration 180 \
+  --valid-interval 3000 \
+  --model-warm-step 3000 \
+  --save-every-n 8000 \
+  --training-subset L
+```
+
+The tensorboard training log can be found at
+https://tensorboard.dev/experiment/wM4ZUNtASRavJx79EOYYcg/#scalars
+
+The decoding command is:
+```
+epoch=10
+avg=2
+
+## greedy search
+./pruned_transducer_stateless2/decode.py \
+        --epoch $epoch \
+        --avg $avg \
+        --exp-dir ./pruned_transducer_stateless2/exp \
+        --lang-dir data/lang_char \
+        --max-duration 100 \
+        --decoding-method greedy_search
+
+## modified beam search
+./pruned_transducer_stateless2/decode.py \
+        --epoch $epoch \
+        --avg $avg \
+        --exp-dir ./pruned_transducer_stateless2/exp \
+        --lang-dir data/lang_char \
+        --max-duration 100 \
+        --decoding-method modified_beam_search \
+        --beam-size 4
+
+## fast beam search
+./pruned_transducer_stateless2/decode.py \
+        --epoch $epoch \
+        --avg $avg \
+        --exp-dir ./pruned_transducer_stateless2/exp \
+        --lang-dir data/lang_char \
+        --max-duration 1500 \
+        --decoding-method fast_beam_search \
+        --beam 4 \
+        --max-contexts 4 \
+        --max-states 8
+```
+
+When training with the M subset, the WERs are
+
+|                                    |   dev  | test-net  | test-meeting  | comment                                   |
+|------------------------------------|--------|-----------|---------------|-------------------------------------------|
+|          greedy search             | 10.40  | 11.31     | 19.64         | --epoch 29, --avg 11, --max-duration 100  |
+| modified beam search (beam size 4) |  9.85  | 11.04     | 18.20         | --epoch 29, --avg 11, --max-duration 100  |
+| fast beam search (set as default)  | 10.18  | 11.10     | 19.32         | --epoch 29, --avg 11, --max-duration 1500 |
+
+
+When training with the S subset, the WERs are
+
+|                                    |  dev   | test-net  | test-meeting  | comment                                   |
+|------------------------------------|--------|-----------|---------------|-------------------------------------------|
+|          greedy search             | 19.92  | 25.20     | 35.35         | --epoch 29, --avg 24, --max-duration 100  |
+| modified beam search (beam size 4) | 18.62  | 23.88     | 33.80         | --epoch 29, --avg 24, --max-duration 100  |
+| fast beam search (set as default)  | 19.31  | 24.41     | 34.87         | --epoch 29, --avg 24, --max-duration 1500 |
+
+
+A pre-trained model and decoding logs can be found at <https://huggingface.co/luomingshuang/icefall_asr_wenetspeech_pruned_transducer_stateless2>
diff --git a/egs/wenetspeech/ASR/local/compute_fbank_musan.py b/egs/wenetspeech/ASR/local/compute_fbank_musan.py
@@ -0,0 +1 @@
+../../../librispeech/ASR/local/compute_fbank_musan.py
diff --git a/egs/wenetspeech/ASR/local/compute_fbank_wenetspeech_dev_test.py b/egs/wenetspeech/ASR/local/compute_fbank_wenetspeech_dev_test.py
@@ -0,0 +1,93 @@
+#!/usr/bin/env python3
+# Copyright    2021  Johns Hopkins University (Piotr Żelasko)
+# Copyright    2021  Xiaomi Corp.             (Fangjun Kuang)
+#
+# See ../../../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from pathlib import Path
+
+import torch
+from lhotse import (
+    CutSet,
+    KaldifeatFbank,
+    KaldifeatFbankConfig,
+    LilcomHdf5Writer,
+)
+
+# Torch's multithreaded behavior needs to be disabled or
+# it wastes a lot of CPU and slow things down.
+# Do this outside of main() in case it needs to take effect
+# even when we are not invoking the main (e.g. when spawning subprocesses).
+torch.set_num_threads(1)
+torch.set_num_interop_threads(1)
+
+
+def compute_fbank_wenetspeech_dev_test():
+    in_out_dir = Path("data/fbank")
+    # number of workers in dataloader
+    num_workers = 42
+
+    # number of seconds in a batch
+    batch_duration = 600
+
+    subsets = ("S", "M", "DEV", "TEST_NET", "TEST_MEETING")
+
+    device = torch.device("cpu")
+    if torch.cuda.is_available():
+        device = torch.device("cuda", 0)
+    extractor = KaldifeatFbank(KaldifeatFbankConfig(device=device))
+
+    logging.info(f"device: {device}")
+
+    for partition in subsets:
+        cuts_path = in_out_dir / f"cuts_{partition}.jsonl.gz"
+        if cuts_path.is_file():
+            logging.info(f"{cuts_path} exists - skipping")
+            continue
+
+        raw_cuts_path = in_out_dir / f"cuts_{partition}_raw.jsonl.gz"
+
+        logging.info(f"Loading {raw_cuts_path}")
+        cut_set = CutSet.from_file(raw_cuts_path)
+
+        logging.info("Computing features")
+
+        cut_set = cut_set.compute_and_store_features_batch(
+            extractor=extractor,
+            storage_path=f"{in_out_dir}/feats_{partition}",
+            num_workers=num_workers,
+            batch_duration=batch_duration,
+            storage_type=LilcomHdf5Writer,
+        )
+        cut_set = cut_set.trim_to_supervisions(
+            keep_overlapping=False, min_duration=None
+        )
+
+        logging.info(f"Saving to {cuts_path}")
+        cut_set.to_file(cuts_path)
+
+
+def main():
+    formatter = (
+        "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
+    )
+    logging.basicConfig(format=formatter, level=logging.INFO)
+
+    compute_fbank_wenetspeech_dev_test()
+
+
+if __name__ == "__main__":
+    main()
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../../librispeech/ASR/local/compute_fbank_musan.py