-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add voices from Super Dialogue Audio Pack #425
base: main
Are you sure you want to change the base?
Conversation
Attached is a sample of the output of some of the voices. Newest samples:Here is a demonstration of all the voices, as well as a failure of the "angry" voice to achieve consistency. I am puzzled, because earlier the same "angry" voice was quite consistent like the rest. Old samples:Example of a voice:Ideal for normal speech — sdap1.zip Example of a voice including non-speech vocalizations:Not ideal, includes clips with weird vocalizations — sdap2.zip Example of a curated "angry" voice:Good demonstration of the ability to mix the clips in different ways to evoke emotions — sdap3.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - thanks for putting this together. Normally I wouldn't recommend more than 3 conditioning clips per voice. Do you find the model performs better with all these clips? Would you mind cutting it down a bit to keep the repo size in check?
I did not do extensive testing with reduced numbers of these particular clips. They immediately worked so well, particularly the speech-only ones, that I left it as is with all the clips. The performance does not seem to be harmed by so many clips. When I compare the output from these voices to other voices I have put together myself with fewer clips, and even some of the training voices, these voices perform much more consistently. The quality is high, for sure, but the consistency is what is nice. For the PR, I could remove the Each variation of the voice ( Maybe what I will do is put the "extra" audio clips in a separate folder alongside/within the voices folder, so users can use them if they wish. Does the script scan subfolders for audio clips, too? I could have the extra clips in a subfolder with a text file with instructions. The reason for the redundant license files was in case someone shared just a single voice, it would retain the license info. Let me know if you have any thoughts I'll see about the subfolder thing. |
https://dillonbecker.itch.io/sdap Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/
I removed all binary file redundancy. Now there are only redundant markdown files with license and instructions for constructing subset voices. |
I just did a test with an "angry" subset voice, and the results are way less consistent. The voice completely changes. So it does appear that the sheer volume of clips contributes to consistency. |
By the way, I have no problem if you decide not to include this in your repo. It requires no maintenance so I can easily patch this on top of any changes you make on my end. I just wanted to share it with anyone who might want it, because it's open source and I already did all the work I figured I might as well give you the choice to include it or not. After doing a side-by-side test of these new voices, with the tortoise default training voices, I must concede that the training voices are overall much better. So, the purpose of these massive voices is not very clear. |
It's interesting, when I use the "angry" subset of clips for a voice, the consistency is much lower overall — random female voices and other voices pop up in clips. However, when the prompt is actually words that seem like something an angry person would say, the consistency is much greater and all candidate results are consistent. |
I noticed something on your samples that I got too, a strange yelp-groan at the end of the passage. I dont know if its misinterpreted emotional emphasis or what |
https://dillonbecker.itch.io/sdap
Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/
These voices perform very well. They can be arranged in different ways to evoke certain emotions. Example of angry emotion with one of the voices.