-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Training Process #7
Comments
@SuperJonotron Unfortunately, the hard drive with all the original model checkpoints, logs, and datasets got wiped when I sent my laptop in for warranty repair :(. I used a model called Forward Tacotron V2. You actually start by training a Tacotron 2 model, and then use data from that to train the Forward model. While Tacotron 2 is quite nice, it is not very robust. Longer inputs result in word salad (you can hear this on the Uberduck TTS voices). Additionally, it is orders of magnitude slower. Listen to their Forward Tacotron V2 samples. It is extremely fluent! The process involves doing transfer learning on the Tacotron model (train on LJSpeech until fluent, switch to desired character's dataset), and then training the Forward model with outputs from Tacotron. Now, the one thing which might improve the verbal fluency a bit would be to transfer learn on the Forward model using LJSpeech as well, but it might sound a bit less like GLaDOS. |
A much more pressing issue is the vocoder. As you've seen through testing, the vocoder causes the majority of the slowdown, especially with longer inputs. The "vocoder-cpu-lq.pt" model is fast enough, but sounds terrible. All of the vocoders use hifigan, with the hq one using the v1 variant and the lq one using the v3. The creators of hifigan only released a checkpoint with discriminator weights for v1, so I had to train v3 from scratch. You're supposed to train these to millions of training steps, which I didn't have time for considering I was training on my laptop 😅. A proper hifigan-v3 model would definitely help a lot. I did find someone on the hifigan repo who found that when they swapped out the (much faster) training profile from a different vocoder, they still got good results but converged much faster. |
For somebody (me) that doesn't have any experience with training models, that's a lot to process. Looking at that project a bit I feel like I could train the initial LJSpeech dataset and then be pretty lost after that. For example, train forward has some python script but not much else to go on if you're not already familiar with what is needed to actually train a model this way. From what I understand, you created such a model from some other data, the 600 samples, and then fed it into this process. That created some intermediary model that then you did some more work on to get the final result but that whole process and everything about training further with all these various inputs...no idea. I'd be up for using some more robust hardware to do further training to make it better, or just recreating the entire model from scratch so this project can live on indefintely. Just not really sure how to code the process to make that happen. |
Reading all this info and docs in this repos and here's what I've got so far:
Does this seem like the correct order of operations? Any chance you could fill in the gaps? As far as the comments about how to make it better, I'd be happy to attempt it, just would need a bit more details on what would actually need to happen to get it done. |
The main bummer is that I lost the SSD that had all my source files due to my laptop frying. I no longer have the dataset I used, or all of the checkpoints I backed up during training. I started off with the Ellen McLain dataset but removed all non-Portal 2 files and added punctuation (VERY IMPORTANT). The files you want are in the LJSpeech folder of that repository. The other one has preprocessed Mel spectrograms. All you do to swap datasets is manually switch them when the voice sounds relatively fluent. There is no "method" of doing this; it is very manual. The ForwardTacotron model only takes the GLaDOS clips. You want to use the pretrained HiFiGAN model weights and first transfer it onto the GLaDOS source files and Mels and then use teacher forcing to generate Mels from the ForwardTacotron model to finetune it. This happens relatively quickly for the larger HiFiGAN variants but if using the smaller ones you have to train from scratch (most likely with full LJSpeech instead of GLaDOS first) due to the lack of checkpoints. I did make one such model but it sounds terrible. Needs more training time. Due to the sinusoidal nature of GLaDOS's voice part of me wonders if a fluent hardcoded vocoder might be possible... |
I started a fork of the Ellen McLain dataset that I'm slowly going through and correcting the punctuation, after which I will split off a branch to remove all the Portal 1 lines. Then I'm going to try training using various methods. ForwardTacotron just merged the multispeaker branch, maybe I'll give that a whirl. |
Okay, so I got smart and found a webpage with a transcription, ran it through a regex to format it, then ran it through a super basic powershell script to separate it into two datasets. DieKatzchen/Ellen-McLain-Dataset |
Co-authored-by: Ben Kristinsson <[email protected]> Reviewed-on: https://git.sudo.is/b/glados-tts/pulls/7
I have created a new version of the model with a new dataset and training methodology. This time was much simpler. I just trained a multispeaker ForwardTacotron model with the Portal 1 lines, Portal 2 lines, and the entire LJSpeech dataset as the three speakers. The dataset itself is quite large but I can share the text file of my manual transcription of the lines if you'd like. |
Could you go into some detail on how exactly you ran the training? |
I'm not at my desktop at the moment but yes, the new version uses the multispeaker version of ForwardTacotron. This is a much more stable solution than transfer learning because it mitigates catastrophic forgetting and gives the model a lot more to reference from when training. I'm not at my desktop at the moment but I'll gladly share the text file for the dataset. The dataset itself is a bit big though and I don't really want to host it on Google Drive or something. But generally if you're trying to train a model to speak like some other character I would definitely recommend this process, although for a male character I would suggest you use some dataset with a male speaker as reference. Maybe some subset of VCTK or something. |
I would also be interested in more details on how this training is accomplished. I have struggled to follow along with the current documentation available. |
First off, thanks for taking the time to reply back. What you are saying is an interesting idea I had not tried yet. I think the VCTK dataset does something similar which I did not put 2 and 2 together. I would still be interested in the Text file of the dataset. I'll try the default configurations with ForwardTacotron multi speaker and see how that goes. Also great job on the improvements with the new GlaDos speech model with Portal 2 output. So far it seems to work better than the previous one (and honestly the previous one was the gold standard for GlaDos TTS). Great work! Thanks again. |
Hey there, I know this issue/discussion is a tad old but I am really interested in doing this myself so I can apply it to other characters and learn how the whole process works. I am quite lost on how the "pipeline" is structured. Is it possible to get some direction on this? I dont want to be spoonfed the files haha just want some direction so I am on the right lines 😅 Thanks! |
Could publish, perhaps without the actual data set if it's too large, the code used to train the model? Since it looks like there's a tacotron 2 available, which may create a better model if retrained on that, I'd be interested in seeing what that process looks like and even try to retrain against the newer models.
The text was updated successfully, but these errors were encountered: