As Chinese students studying in the states, we found our speaking habits morphed -- English words and phrases easily get slipped into Chinese sentences. We greatly feel the need to have messaging apps that can handle multilingual speech-to-text translation. So in this task, we are going to develop this function -- build a model using deep learning architecture(DNN, CNN, LSTM) to corretly translate multilingual audio (having Chinese and English in the same sentence) into text.
Contains scripts to build our system
LDC2015S04, our dataset description
Our study notes on Kaldi related recipie, including timit
and librispeech
- A First Speech Recognition System For Mandarin-English Code-switch Conversational Speech
- Speech Recognition on English-Mandarin Code-Switching Data using Factored Language Models
- Daniel Povey Lectures
- An Introduction to Kaldi Toolkit
- Building Speech Recognition Systems with the Kaldi Toolkit
- Kaldi Document in CN
- University of Edinburgh-Automatic Speech Reconigtion Course Lab
- Kaldi Data Prep (Eleanor Chodroff)
- Kaldi Data Prep (kaldi-asr.org)
- Kaldi excamples
- Decoding
filename: | pattern: | format: | path: | source: | |
---|---|---|---|---|---|
acoustic data: | spk2gender | <speakerID><gender> | /data/train /data/test | handmade | |
utt2spk | <utteranceID><speakerID> | /data/train /data/test | handmade | ||
wav.scp | <utteranceID><full_path_to_audio_file> | .scp: kaldi script file | /data/train /data/test | handmade | |
text | <utteranceID><full_path_to_audio_file> | .ark: kaldi archive file | /data/train /data/test | exists | |
language data: | lexicon.txt | <word> <phone 1><phone 2> ... | .ark: kaldi archive file | data/local/dict | egs/voxforge |
nonsilence_phones.txt | <phone> | data/local/dict | unkown | ||
silence_phones.txt | <phone> | data/local/dict | unkown | ||
optional_silence.txt | <phone> | data/local/dict | unkown | ||
Tools: | utils | / | kaldi/egs/wsj/s5 | ||
steps | / | kaldi/egs/wsj/s5 | |||
score.sh | / | kaldi/egs/voxforge/s5/local |
What are our language model:
3-grams trained from the transcripts of THCHS30 + LDC2015S04
directory structure taken from /egs/TIMIT/s5:
/data
/local
/nist_lm
/lm_phone_bg.arpa.gz
How to build a language model:
- SRILM
- Kaldi lm_build
- egs/babel/s5/local/train_lms_srilm.sh built using SRILM toolkit
- Language Preparation
Kaldi script utils/prepare_lang.sh
usage: utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>
e.g.: utils/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang
options:
--num-sil-states <number of states> # default: 5, #states in silence models.
--num-nonsil-states <number of states> # default: 3, #states in non-silence models.
--position-dependent-phones (true|false) # default: true; if true, use _B, _E, _S & _I
# markers on phones to indicate word-internal positions.
--share-silence-phones (true|false) # default: false; if true, share pdfs of
# all non-silence phones.
--sil-prob <probability of silence> # default: 0.5 [must have 0 < silprob < 1]
Turning the –share-silence-phones option to TRUE was extremely helpful for the Cantonese data of IARPA's BABEL project, where the data is very messy and has long untranscribed portions that the Kaldi developers try to align to a special phone that is designated for that purpose. The --sil-prob might be another potentially important option.
- lexicon.txt
- The pronunciation dictionary where every line is a word with its phonemic pronunciation. It Only contains words and their pronunciations that are present in the corpus.
- ENG: CMU dictionary
- nonsilence_phones.txt
- optional_silence.txt
- silence_phones.txt
echo
echo "===== FEATURES EXTRACTION ====="
echo
# Making feats.scp files
mfccdir=mfcc
# Uncomment and modify arguments in scripts below if you have any problems with data sorting
# utils/validate_data_dir.sh data/train # script for checking prepared data - here: for data/train directory
# utils/fix_data_dir.sh data/train # tool for data proper sorting if needed - here: for data/train directory
steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" data/test exp/make_mfcc/test $mfccdir
# Making cmvn.scp files
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test $mfccdir
MFCC-related documents
as the transition probability from state i to state j
as the emission probability from state j to sequence X
Forward-backward algorithm fine tunes
HMM solves the following three problems:
- overall likelihood (Forward algorithm): determine the likelihood of an observation sequence X=(x1, x2, ... xT) being generated by an HMM
- training (Forward-backward algorithm EM): given an observation sequence, learn the best
- decoding (Viterbi algorithm): given an on observation sequence, determine the most probable hidden state sequence
In order to train CNN, we need to extract MFSC features from the acoustic data instead of MFCC features, as Discrete Cosine Transformation (DCT) in MFCC destroys locality. MFSC features also called filter banks. In Kaldi, the scripts are something like the following:
steps/make_fbank.sh --nj 3 \ $trainDir/train_clean_fbank exp/make_fbank/train_clean_fbank feat/fbank/ || exit 1;
steps/compute_cmvn_stats.sh $trainDir/train_clean_fbank exp/make_fbank/train_clean_fbank feat/fbank/ || exit 1;
notice that fbanks don't work well with GMM as fbanks features are highly correlated, and GMM modelled with diagonal covariance matrices assumed independence of feature streams. fbanks/MFSC is okay with DNN, best for CNN.
why MFSC+GMM produced high WER-see Kaldi discussion
why DCT destroys locality-see post
tensorflow == 1.1.0
theano == 0.9.0.dev-c697eeab84e5b8a74908da654b66ec9eca4f1291
keras == 1.2
This doesn't require Sun GridEngine. Simply download [CUDA toolkit] (https://developer.nvidia.com/cuda-downloads), install it with
sudo sh cuda_8.0.61_375.26_linux.run
and then go under kaldi/src execute
./configure
to check if it detects CUDA, you will also find CUDA = true
in kaldi/src/kaldi.mk
then recompile Kaldi with
make -j 8 # 8 for 8-core cpu
make depend -j 8 # 8 for 8-core cpu
Noted that GMM-based training and decode is not supported by GPU, only nnet
does. source
**
if you are using AWS g2.2xlarge, and launched the instance before 2017-04-18 (when this note is written), its NVIDIA may need a legacy 367.x driver, the default (latest) driver that comes with CUDA-8 cuda_8.0.61_375.26_linux.run
will fail.
To check the current version of the driver installed on the instance, type
apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'
to install a version of your choice from the list, type
sudo apt-get install nvidia-367
You can also download a specifc version from the web, for example NVIDIA-Linux-x86_64-367.18.run
. Install it with
sudo sh NVIDIA-Linux-x86_64-367.18.run
and then when installing cuda_8.0.61_375.26_linux.run
, it will ask you whether to install NVIDIA driver 375, make sure you choose no
.
Required:
- install CUDA toolkit 8.0 as of 04-18-2017
- install cuDNN download v5, as of 04-18-2017, Tensorflow performs the best with cuDNN 5.x
Follow commands carefully from the Tensorflow website. After intallation, you can test if tensorflow can detect your gpu by typing the following:
# makes sure you are out of the tensorflow git repo
python
>>> import tensorflow as tf
>>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
A working tensorflow will output:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:04.0
Total memory: 11.17GiB
Free memory: 11.11GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0
I tensorflow/core/common_runtime/direct_session.cc:257] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0
- During testing, if you run into error like:
I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcudnn.so.5. LD_LIBRARY_PATH: /usr/local/cuda/lib64
I tensorflow/stream_executor/cuda/cuda_dnn.cc:3517] Unable to load cuDNN DSO
from the writer's experience, you didn't set the right LD_LIBRARY_PATH
in the ~/.profile
file. You need to examine where is libcudnn.so.5
located and move it to the desired location, most likely it will be /usr/local/cuda
. Also make sure you type source ~/.profile
to activate the change, after you modify the file.
- If you are testing it in a python shell, and you met the following error:
ImportError: libcudart.so.8.0: cannot open shared object file: No such file or directory
very likely you are in the actual tensorflow
git repo. source, make sure you jump out of it before testing.
Keras-kaldi's LSTM training script breaks under the current tensorflow (as tensorflow went through series of API changes during the previous months), we need to install Theano GPU and switch to the theano backend for running run_kt_LSTM.sh
.
After installing Theano-gpu using miniconda,
in order to modify the theano.config
file, you can create .theanorc
by the following command:
echo -e "\n[global]\nfloatX=float32\n" >> ~/.theanorc
and add device=gpu
to the this file.
If theano can't detect NVCC, by giving you the following error:
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.
(but you sure that you installed CUDA), you can solve it by adding the following lines to ~/.profile
:
export PATH=/usr/local/cuda-8.0/bin/:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH
don't forget to source ~/.profile
to enable the change.
to change the keras backend from tensorflow to theano, modify:
vim $HOME/.keras/keras.json
to test if theano is indeed using gpu, execute the following file:
from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
- 3-4 hours to train, 3 hours to decode on GPU:
local/online/run_nnet2_baseline.sh
dspavankumar/keras-kaldi github repo
Up to the time that we ran his code, the enviornment is still Keras 1.2.0
Make sure that the Keras version is the same across the machines.
to reinstall Keras from 2.0.3 to older version, type
$ sudo pip3 install keras==1.2
or
$ conda install keras==1.2.2 # if you are using conda
If there is version inconsistency (train model using 1.2.0 but decode it with 2.0.3, you will run into problem when loading an existing model:
File "steps_kt/nnet-forward.py", line 33, in <module>
m = keras.models.load_model (model)
File "/usr/local/lib/python3.5/dist-packages/keras/models.py", line 281, in load_model
Error: “Optimizer weight shape (1024, ) not compatible with provided weight shape (429,1024)”