Hi, @r9y9. First of all, thank you for such a brilliant implementation of Wavenet. Now I study about how to detect a phoneme duration (start time and end time of phoneme) extracted from audio and align this thing with linguistic feature but i don't know how to do this. Can you show me the idea that you use to solve the problem and where is the code in this repo to do this job? Thanks!

P/s: Btw is this possible to train this repo with another language? Currently, I'm doing with Vietnamese with my own dataset (7 hours of audio and ARPABET linguistic feature extracted from text)