Initial commit

Signed-off-by: Philip Abbet <[email protected]>
idiap · Feb 17, 2021 · 34bcc30 · 34bcc30
commit 34bcc30
Show file tree

Hide file tree

Showing 33 changed files with 1,724,550 additions and 0 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -0,0 +1,30 @@
+# Copyright (c) 2021 Idiap Research Institute, http://www.idiap.ch/
+# Written by Rudolf A. Braun <[email protected]>
+# 
+# This file is part of icassp-oov-recognition
+# 
+# icassp-oov-recognition is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License version 3 as
+# published by the Free Software Foundation.
+# 
+# icassp-oov-recognition is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+# 
+# You should have received a copy of the GNU General Public License
+# along with icassp-oov-recognition. If not, see <http://www.gnu.org/licenses/>.
+
+cmake_minimum_required(VERSION 3.9.5)
+SET(CMAKE_CXX_STANDARD 11)
+SET(CMAKE_CXX_FLAGS "-Ofast ")
+
+include_directories("/path/to/openfst-1.6.7/include")
+include_directories("libs/")
+
+add_subdirectory(libs/pybind11)
+pybind11_add_module(fast libs/fast.cc libs/fst-wrapper.cc)
+
+set_target_properties(fast PROPERTIES LIBRARY_OUTPUT_NAME "fast")
+
+target_link_libraries(fast PRIVATE "-L/path/to/openfst-1.6.7/lib" -lfstscript -lfst)
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -0,0 +1,54 @@
+# icassp-oov-recognition
+
+This has data and code related to the ICASSP submission "A comparison of methods for OOV-word recognition"
+
+# data
+
+This contains for English and German:  
+    - The train and test set in kaldi format (audio files not included)  
+    - The lexicon  
+    - For convenience the list of OOVs in the test set relative to the lexicon  
+    - The lexicon for the OOV-words
+
+English LM data: [link](http://www.mediafire.com/file/fy8841cfkwft5tu/en_lm_text.txt.gz/file)
+German LM data: [link](http://www.mediafire.com/file/7egjt3mygxk6whw/de_lm_text.txt.gz/file)
+
+# scripts
+
+Currently contains scripts to  
+    - create the train/test partition from a (kaldi formatted) data folder containing CommonVoice data, `build_cv_test_train.py`  
+    - create the HCL graph which can be inserted into an existing HCLG, `compose_hcl.sh`  
+    - recover words from a decoded lattice that phones arcs attached to the `<unk>` token, `recover_unk_words.sh`  
+
+# libs
+
+More up-to-date version of wrapper is [here](https://github.com/RuABraun/fst-util). You should use that.
+
+---
+
+This has code which wraps OpenFST, and functions for modifying graphs (`insert`, `replace_single`, `add_boost`).
+
+To compile you will need to include add a symlink inside the libs/ directory to a copy of the pybind11 repository, and to use `LD_LIBRARY_PATH` needs have the OpenFST libs in its path and copy the compiled .so to the site-packages/ directory (run `python -m site` to find).
+
+# How to add words to HCLG
+
+As mentioned in the paper, this method requires you to use a monophone model. Additionally, your language model needs to have been trained with pocolm and the `--limit-unk-history` option.
+
+For simplicity, the modification is done on a graph without self-loops. So you need to modify `utils/mkgraph.sh` and comment L167: `rm $dir/HCLGa.fst $dir/Ha.fst 2>/dev/null || true` because we will use `HCLGa.fst`.
+
+Inside the graph dir where the HCLG is there is a `words.txt`. You need to assign IDs to the new words you're adding and append these to `words.txt` file (these should be larger than the existing ones obviously).
+
+Assuming all this is ready you can use `script/compose_hcl.sh` to create the HCL from a lexicon of the OOV words you want to add. Check the script for the input arguments, `model` is the `final.mdl`, isym is phones osym words. Notice it uses `create_lfst.py` so you need to fst wrapper installed. There is one hardcoded parameter on L25, `303`, see [here](https://groups.google.com/g/kaldi-help/c/jL8VnwKGRWs/m/-Pe29-G9AgAJ) for what's about. You can set it to any number larger than the existing phone IDs.
+
+After calling the script and creating the `HCL.fst` you use the fst wrapper to modify the `HCLGa.fst`.
+
+```
+from wrappedfst import WrappedFst
+fst = WrappedFst('HCLGa.fst')
+ifst = WrappedFst('HCL.fst')
+unk_id =  # unk symbol
+fst.replace_single(unk_id, ifst)
+fst.write('HCLGa_new.fst')
+```
+
+Then add the self-loops (check `mkgraph.sh` for how to do that) and you are done. Replace an existing `HCLG.fst` with the new version and you can run decoding as you would normally.
diff --git a/data/de/data_test/spk2utt b/data/de/data_test/spk2utt
diff --git a/data/de/data_test/text b/data/de/data_test/text
diff --git a/data/de/data_test/utt2spk b/data/de/data_test/utt2spk
diff --git a/data/de/data_test/wav.scp b/data/de/data_test/wav.scp
diff --git a/data/de/data_train/spk2utt b/data/de/data_train/spk2utt
diff --git a/data/de/data_train/text b/data/de/data_train/text
diff --git a/data/de/data_train/utt2spk b/data/de/data_train/utt2spk
diff --git a/data/de/data_train/wav.scp b/data/de/data_train/wav.scp