Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Training Process #7

Open
SuperJonotron opened this issue Nov 16, 2022 · 13 comments
Open

Model Training Process #7

SuperJonotron opened this issue Nov 16, 2022 · 13 comments

Comments

@SuperJonotron
Copy link

Could publish, perhaps without the actual data set if it's too large, the code used to train the model? Since it looks like there's a tacotron 2 available, which may create a better model if retrained on that, I'd be interested in seeing what that process looks like and even try to retrain against the newer models.

@R2D2FISH
Copy link
Owner

R2D2FISH commented Nov 16, 2022

@SuperJonotron Unfortunately, the hard drive with all the original model checkpoints, logs, and datasets got wiped when I sent my laptop in for warranty repair :(. I used a model called Forward Tacotron V2. You actually start by training a Tacotron 2 model, and then use data from that to train the Forward model. While Tacotron 2 is quite nice, it is not very robust. Longer inputs result in word salad (you can hear this on the Uberduck TTS voices). Additionally, it is orders of magnitude slower. Listen to their Forward Tacotron V2 samples. It is extremely fluent! The process involves doing transfer learning on the Tacotron model (train on LJSpeech until fluent, switch to desired character's dataset), and then training the Forward model with outputs from Tacotron. Now, the one thing which might improve the verbal fluency a bit would be to transfer learn on the Forward model using LJSpeech as well, but it might sound a bit less like GLaDOS.

@R2D2FISH
Copy link
Owner

R2D2FISH commented Nov 16, 2022

A much more pressing issue is the vocoder. As you've seen through testing, the vocoder causes the majority of the slowdown, especially with longer inputs. The "vocoder-cpu-lq.pt" model is fast enough, but sounds terrible. All of the vocoders use hifigan, with the hq one using the v1 variant and the lq one using the v3. The creators of hifigan only released a checkpoint with discriminator weights for v1, so I had to train v3 from scratch. You're supposed to train these to millions of training steps, which I didn't have time for considering I was training on my laptop 😅. A proper hifigan-v3 model would definitely help a lot. I did find someone on the hifigan repo who found that when they swapped out the (much faster) training profile from a different vocoder, they still got good results but converged much faster.
One issue of the smaller model is it works much better when finetuned (use actual model inputs and outputs as training sources). Ordinarily this would require teacher forcing, which Forward Tacotron cannot do. However, one could probably generate a ton of outputs (~10k) from the TTS and the better vocoder, save the mel spectrograms and wavs, and then train the smaller model exclusively on that. It would probably sound quite good.

@SuperJonotron
Copy link
Author

For somebody (me) that doesn't have any experience with training models, that's a lot to process. Looking at that project a bit I feel like I could train the initial LJSpeech dataset and then be pretty lost after that. For example, train forward has some python script but not much else to go on if you're not already familiar with what is needed to actually train a model this way. From what I understand, you created such a model from some other data, the 600 samples, and then fed it into this process. That created some intermediary model that then you did some more work on to get the final result but that whole process and everything about training further with all these various inputs...no idea.

I'd be up for using some more robust hardware to do further training to make it better, or just recreating the entire model from scratch so this project can live on indefintely. Just not really sure how to code the process to make that happen.

@SuperJonotron
Copy link
Author

Reading all this info and docs in this repos and here's what I've got so far:

  1. Retrain the LJ Speech Data with ForwardTacotron. Pretty straightforward via the docs on that although it's 10k steps and while I have it currently being trained, it's going to take a very long time to finish.
  2. Access the Ellen McClain dataset. Only thing I could find online is this:https://github.com/robit-man/Ellen-McLain-Dataset/tree/master/Preprocessed/training_data If this is it, it seems to be an entirely different structure as the LJSpeech data so not really sure how to work with it.
  3. Feed the Ellen-McLain dataset (modified) into the the trained LJSpeech model to forward train it. What commands are expected to make that happen?
    switch to desired character's dataset), and then training the Forward model with outputs from Tacotron
    What does this mean? I can't find any mention of this process in the documentations of these repositories.
  4. Use this forward trained model to generate a HiFiGAN model. Was this the step where the vocoder was generated using this repo https://github.com/jik876/hifi-gan? Somehow feeding in the mel files generated from step 3 into this to generate the final vocoder model?

Does this seem like the correct order of operations? Any chance you could fill in the gaps? As far as the comments about how to make it better, I'd be happy to attempt it, just would need a bit more details on what would actually need to happen to get it done.

@R2D2FISH
Copy link
Owner

The main bummer is that I lost the SSD that had all my source files due to my laptop frying. I no longer have the dataset I used, or all of the checkpoints I backed up during training. I started off with the Ellen McLain dataset but removed all non-Portal 2 files and added punctuation (VERY IMPORTANT). The files you want are in the LJSpeech folder of that repository. The other one has preprocessed Mel spectrograms. All you do to swap datasets is manually switch them when the voice sounds relatively fluent. There is no "method" of doing this; it is very manual. The ForwardTacotron model only takes the GLaDOS clips. You want to use the pretrained HiFiGAN model weights and first transfer it onto the GLaDOS source files and Mels and then use teacher forcing to generate Mels from the ForwardTacotron model to finetune it. This happens relatively quickly for the larger HiFiGAN variants but if using the smaller ones you have to train from scratch (most likely with full LJSpeech instead of GLaDOS first) due to the lack of checkpoints. I did make one such model but it sounds terrible. Needs more training time. Due to the sinusoidal nature of GLaDOS's voice part of me wonders if a fluent hardcoded vocoder might be possible...

@DieKatzchen
Copy link

I started a fork of the Ellen McLain dataset that I'm slowly going through and correcting the punctuation, after which I will split off a branch to remove all the Portal 1 lines. Then I'm going to try training using various methods. ForwardTacotron just merged the multispeaker branch, maybe I'll give that a whirl.

@DieKatzchen
Copy link

Okay, so I got smart and found a webpage with a transcription, ran it through a regex to format it, then ran it through a super basic powershell script to separate it into two datasets. DieKatzchen/Ellen-McLain-Dataset

benediktkr added a commit to benediktkr/glados-tts that referenced this issue May 21, 2023
@R2D2FISH
Copy link
Owner

R2D2FISH commented Oct 3, 2023

I have created a new version of the model with a new dataset and training methodology. This time was much simpler. I just trained a multispeaker ForwardTacotron model with the Portal 1 lines, Portal 2 lines, and the entire LJSpeech dataset as the three speakers. The dataset itself is quite large but I can share the text file of my manual transcription of the lines if you'd like.

@setokaibakg
Copy link

setokaibakg commented Oct 17, 2023

I have created a new version of the model with a new dataset and training methodology. This time was much simpler. I just trained a multispeaker ForwardTacotron model with the Portal 1 lines, Portal 2 lines, and the entire LJSpeech dataset as the three speakers. The dataset itself is quite large but I can share the text file of my manual transcription of the lines if you'd like.

Could you go into some detail on how exactly you ran the training?
Just following the ForwardTacotron instructions isn't enough and it sounds like you did 3 datasets? I think from some of your previous notes you mentioned running the LJSpeech dataset first then glados. Did you have to change configurations in between steps and what did you do to optimize the model at the end?
It would be amazing if you provided some info in your process and I suppose some people would be interested in the text file of the data set (transcript). Thanks again for any help on this. (This is one of the few things I haven't quite figured out)

@R2D2FISH
Copy link
Owner

I'm not at my desktop at the moment but yes, the new version uses the multispeaker version of ForwardTacotron. This is a much more stable solution than transfer learning because it mitigates catastrophic forgetting and gives the model a lot more to reference from when training. I'm not at my desktop at the moment but I'll gladly share the text file for the dataset. The dataset itself is a bit big though and I don't really want to host it on Google Drive or something. But generally if you're trying to train a model to speak like some other character I would definitely recommend this process, although for a male character I would suggest you use some dataset with a male speaker as reference. Maybe some subset of VCTK or something.

@SuperJonotron
Copy link
Author

Could you go into some detail on how exactly you ran the training?

I would also be interested in more details on how this training is accomplished. I have struggled to follow along with the current documentation available.

@setokaibakg
Copy link

I'm not at my desktop at the moment but yes, the new version uses the multispeaker version of ForwardTacotron. This is a much more stable solution than transfer learning because it mitigates catastrophic forgetting and gives the model a lot more to reference from when training. I'm not at my desktop at the moment but I'll gladly share the text file for the dataset. The dataset itself is a bit big though and I don't really want to host it on Google Drive or something. But generally if you're trying to train a model to speak like some other character I would definitely recommend this process, although for a male character I would suggest you use some dataset with a male speaker as reference. Maybe some subset of VCTK or something.

First off, thanks for taking the time to reply back. What you are saying is an interesting idea I had not tried yet. I think the VCTK dataset does something similar which I did not put 2 and 2 together. I would still be interested in the Text file of the dataset. I'll try the default configurations with ForwardTacotron multi speaker and see how that goes. Also great job on the improvements with the new GlaDos speech model with Portal 2 output. So far it seems to work better than the previous one (and honestly the previous one was the gold standard for GlaDos TTS). Great work! Thanks again.

@RealCalumPlays
Copy link

RealCalumPlays commented Nov 27, 2024

Hey there,

I know this issue/discussion is a tad old but I am really interested in doing this myself so I can apply it to other characters and learn how the whole process works. I am quite lost on how the "pipeline" is structured. Is it possible to get some direction on this? I dont want to be spoonfed the files haha just want some direction so I am on the right lines 😅

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants