Model Training Process #7

SuperJonotron · 2022-11-16T15:43:14Z

Could publish, perhaps without the actual data set if it's too large, the code used to train the model? Since it looks like there's a tacotron 2 available, which may create a better model if retrained on that, I'd be interested in seeing what that process looks like and even try to retrain against the newer models.

R2D2FISH · 2022-11-16T19:23:16Z

@SuperJonotron Unfortunately, the hard drive with all the original model checkpoints, logs, and datasets got wiped when I sent my laptop in for warranty repair :(. I used a model called Forward Tacotron V2. You actually start by training a Tacotron 2 model, and then use data from that to train the Forward model. While Tacotron 2 is quite nice, it is not very robust. Longer inputs result in word salad (you can hear this on the Uberduck TTS voices). Additionally, it is orders of magnitude slower. Listen to their Forward Tacotron V2 samples. It is extremely fluent! The process involves doing transfer learning on the Tacotron model (train on LJSpeech until fluent, switch to desired character's dataset), and then training the Forward model with outputs from Tacotron. Now, the one thing which might improve the verbal fluency a bit would be to transfer learn on the Forward model using LJSpeech as well, but it might sound a bit less like GLaDOS.

R2D2FISH · 2022-11-16T19:31:46Z

A much more pressing issue is the vocoder. As you've seen through testing, the vocoder causes the majority of the slowdown, especially with longer inputs. The "vocoder-cpu-lq.pt" model is fast enough, but sounds terrible. All of the vocoders use hifigan, with the hq one using the v1 variant and the lq one using the v3. The creators of hifigan only released a checkpoint with discriminator weights for v1, so I had to train v3 from scratch. You're supposed to train these to millions of training steps, which I didn't have time for considering I was training on my laptop 😅. A proper hifigan-v3 model would definitely help a lot. I did find someone on the hifigan repo who found that when they swapped out the (much faster) training profile from a different vocoder, they still got good results but converged much faster.
One issue of the smaller model is it works much better when finetuned (use actual model inputs and outputs as training sources). Ordinarily this would require teacher forcing, which Forward Tacotron cannot do. However, one could probably generate a ton of outputs (~10k) from the TTS and the better vocoder, save the mel spectrograms and wavs, and then train the smaller model exclusively on that. It would probably sound quite good.

SuperJonotron · 2022-11-16T22:32:53Z

For somebody (me) that doesn't have any experience with training models, that's a lot to process. Looking at that project a bit I feel like I could train the initial LJSpeech dataset and then be pretty lost after that. For example, train forward has some python script but not much else to go on if you're not already familiar with what is needed to actually train a model this way. From what I understand, you created such a model from some other data, the 600 samples, and then fed it into this process. That created some intermediary model that then you did some more work on to get the final result but that whole process and everything about training further with all these various inputs...no idea.

I'd be up for using some more robust hardware to do further training to make it better, or just recreating the entire model from scratch so this project can live on indefintely. Just not really sure how to code the process to make that happen.

SuperJonotron · 2022-11-18T21:55:19Z

Reading all this info and docs in this repos and here's what I've got so far:

Retrain the LJ Speech Data with ForwardTacotron. Pretty straightforward via the docs on that although it's 10k steps and while I have it currently being trained, it's going to take a very long time to finish.
Access the Ellen McClain dataset. Only thing I could find online is this:https://github.com/robit-man/Ellen-McLain-Dataset/tree/master/Preprocessed/training_data If this is it, it seems to be an entirely different structure as the LJSpeech data so not really sure how to work with it.
Feed the Ellen-McLain dataset (modified) into the the trained LJSpeech model to forward train it. What commands are expected to make that happen?
switch to desired character's dataset), and then training the Forward model with outputs from Tacotron
What does this mean? I can't find any mention of this process in the documentations of these repositories.
Use this forward trained model to generate a HiFiGAN model. Was this the step where the vocoder was generated using this repo https://github.com/jik876/hifi-gan? Somehow feeding in the mel files generated from step 3 into this to generate the final vocoder model?

Does this seem like the correct order of operations? Any chance you could fill in the gaps? As far as the comments about how to make it better, I'd be happy to attempt it, just would need a bit more details on what would actually need to happen to get it done.

R2D2FISH · 2023-01-19T00:36:06Z

The main bummer is that I lost the SSD that had all my source files due to my laptop frying. I no longer have the dataset I used, or all of the checkpoints I backed up during training. I started off with the Ellen McLain dataset but removed all non-Portal 2 files and added punctuation (VERY IMPORTANT). The files you want are in the LJSpeech folder of that repository. The other one has preprocessed Mel spectrograms. All you do to swap datasets is manually switch them when the voice sounds relatively fluent. There is no "method" of doing this; it is very manual. The ForwardTacotron model only takes the GLaDOS clips. You want to use the pretrained HiFiGAN model weights and first transfer it onto the GLaDOS source files and Mels and then use teacher forcing to generate Mels from the ForwardTacotron model to finetune it. This happens relatively quickly for the larger HiFiGAN variants but if using the smaller ones you have to train from scratch (most likely with full LJSpeech instead of GLaDOS first) due to the lack of checkpoints. I did make one such model but it sounds terrible. Needs more training time. Due to the sinusoidal nature of GLaDOS's voice part of me wonders if a fluent hardcoded vocoder might be possible...

DieKatzchen · 2023-02-05T23:29:25Z

I started a fork of the Ellen McLain dataset that I'm slowly going through and correcting the punctuation, after which I will split off a branch to remove all the Portal 1 lines. Then I'm going to try training using various methods. ForwardTacotron just merged the multispeaker branch, maybe I'll give that a whirl.

DieKatzchen · 2023-03-24T15:51:31Z

Okay, so I got smart and found a webpage with a transcription, ran it through a regex to format it, then ran it through a super basic powershell script to separate it into two datasets. DieKatzchen/Ellen-McLain-Dataset

Co-authored-by: Ben Kristinsson <[email protected]> Reviewed-on: https://git.sudo.is/b/glados-tts/pulls/7

R2D2FISH · 2023-10-03T19:45:39Z

I have created a new version of the model with a new dataset and training methodology. This time was much simpler. I just trained a multispeaker ForwardTacotron model with the Portal 1 lines, Portal 2 lines, and the entire LJSpeech dataset as the three speakers. The dataset itself is quite large but I can share the text file of my manual transcription of the lines if you'd like.

setokaibakg · 2023-10-17T19:20:47Z

I have created a new version of the model with a new dataset and training methodology. This time was much simpler. I just trained a multispeaker ForwardTacotron model with the Portal 1 lines, Portal 2 lines, and the entire LJSpeech dataset as the three speakers. The dataset itself is quite large but I can share the text file of my manual transcription of the lines if you'd like.

Could you go into some detail on how exactly you ran the training?
Just following the ForwardTacotron instructions isn't enough and it sounds like you did 3 datasets? I think from some of your previous notes you mentioned running the LJSpeech dataset first then glados. Did you have to change configurations in between steps and what did you do to optimize the model at the end?
It would be amazing if you provided some info in your process and I suppose some people would be interested in the text file of the data set (transcript). Thanks again for any help on this. (This is one of the few things I haven't quite figured out)

R2D2FISH · 2023-10-17T19:53:14Z

I'm not at my desktop at the moment but yes, the new version uses the multispeaker version of ForwardTacotron. This is a much more stable solution than transfer learning because it mitigates catastrophic forgetting and gives the model a lot more to reference from when training. I'm not at my desktop at the moment but I'll gladly share the text file for the dataset. The dataset itself is a bit big though and I don't really want to host it on Google Drive or something. But generally if you're trying to train a model to speak like some other character I would definitely recommend this process, although for a male character I would suggest you use some dataset with a male speaker as reference. Maybe some subset of VCTK or something.

SuperJonotron · 2023-10-17T20:43:12Z

Could you go into some detail on how exactly you ran the training?

I would also be interested in more details on how this training is accomplished. I have struggled to follow along with the current documentation available.

setokaibakg · 2023-10-18T13:37:52Z

I'm not at my desktop at the moment but yes, the new version uses the multispeaker version of ForwardTacotron. This is a much more stable solution than transfer learning because it mitigates catastrophic forgetting and gives the model a lot more to reference from when training. I'm not at my desktop at the moment but I'll gladly share the text file for the dataset. The dataset itself is a bit big though and I don't really want to host it on Google Drive or something. But generally if you're trying to train a model to speak like some other character I would definitely recommend this process, although for a male character I would suggest you use some dataset with a male speaker as reference. Maybe some subset of VCTK or something.

First off, thanks for taking the time to reply back. What you are saying is an interesting idea I had not tried yet. I think the VCTK dataset does something similar which I did not put 2 and 2 together. I would still be interested in the Text file of the dataset. I'll try the default configurations with ForwardTacotron multi speaker and see how that goes. Also great job on the improvements with the new GlaDos speech model with Portal 2 output. So far it seems to work better than the previous one (and honestly the previous one was the gold standard for GlaDos TTS). Great work! Thanks again.

RealCalumPlays · 2024-11-27T11:44:39Z

Hey there,

I know this issue/discussion is a tad old but I am really interested in doing this myself so I can apply it to other characters and learn how the whole process works. I am quite lost on how the "pipeline" is structured. Is it possible to get some direction on this? I dont want to be spoonfed the files haha just want some direction so I am on the right lines 😅

Thanks!

benediktkr added a commit to benediktkr/glados-tts that referenced this issue May 21, 2023

bump version to 0.2.0 (R2D2FISH#7)

a0ae88b

Co-authored-by: Ben Kristinsson <[email protected]> Reviewed-on: https://git.sudo.is/b/glados-tts/pulls/7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Training Process #7

Model Training Process #7

SuperJonotron commented Nov 16, 2022

R2D2FISH commented Nov 16, 2022 •

edited

Loading

R2D2FISH commented Nov 16, 2022 •

edited

Loading

SuperJonotron commented Nov 16, 2022

SuperJonotron commented Nov 18, 2022

R2D2FISH commented Jan 19, 2023

DieKatzchen commented Feb 5, 2023

DieKatzchen commented Mar 24, 2023

R2D2FISH commented Oct 3, 2023

setokaibakg commented Oct 17, 2023 •

edited

Loading

R2D2FISH commented Oct 17, 2023

SuperJonotron commented Oct 17, 2023

setokaibakg commented Oct 18, 2023

RealCalumPlays commented Nov 27, 2024 •

edited

Loading

Model Training Process #7

Model Training Process #7

Comments

SuperJonotron commented Nov 16, 2022

R2D2FISH commented Nov 16, 2022 • edited Loading

R2D2FISH commented Nov 16, 2022 • edited Loading

SuperJonotron commented Nov 16, 2022

SuperJonotron commented Nov 18, 2022

R2D2FISH commented Jan 19, 2023

DieKatzchen commented Feb 5, 2023

DieKatzchen commented Mar 24, 2023

R2D2FISH commented Oct 3, 2023

setokaibakg commented Oct 17, 2023 • edited Loading

R2D2FISH commented Oct 17, 2023

SuperJonotron commented Oct 17, 2023

setokaibakg commented Oct 18, 2023

RealCalumPlays commented Nov 27, 2024 • edited Loading

R2D2FISH commented Nov 16, 2022 •

edited

Loading

R2D2FISH commented Nov 16, 2022 •

edited

Loading

setokaibakg commented Oct 17, 2023 •

edited

Loading

RealCalumPlays commented Nov 27, 2024 •

edited

Loading