Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] pronounciation, cadence and nuances in XTTS v2... #3764

Open
0wwafa opened this issue May 29, 2024 · 3 comments
Open

[Feature request] pronounciation, cadence and nuances in XTTS v2... #3764

0wwafa opened this issue May 29, 2024 · 3 comments
Labels
feature request feature requests for making TTS better.

Comments

@0wwafa
Copy link

0wwafa commented May 29, 2024

Hello!
I have used xTTS v2 for a while and made great voices.
I sih to know one thing:
every voice made, when it "speaks" has the same cadence and pronounciation (clearly from a trained model).
How could I get from the speaker also that?
I mean, to really clone a voice, you don't need only the frequencies but also their nuances.
Can you please post an example or even better, add the feture directly in xTTSv2?
So that one can decide if getting a standard voice, a "speaker" voice, or a speaker voice and "nuance".
That would be great!
Thanks.

@0wwafa 0wwafa added the feature request feature requests for making TTS better. label May 29, 2024
@0wwafa
Copy link
Author

0wwafa commented Jun 8, 2024

how can I do this manually? can anybody help?

@Aphexus
Copy link

Aphexus commented Jun 18, 2024

I don't think this is entirely true. I put in some text and I had something like "in the butt, yeah in the butt!" and it spoke the last part where it raised the pitch of yeah and said it more excited and made it feel like an exclamation(as if it took into account !).

So there are some nuances. Maybe there should be some way to modify the speech a bit with "special tokens" that can raise or lower the pitch or increase the speed or whatever. I think this would require, for it to work, someone to categorize a training set that way else it likely won't feel natural.

@0wwafa
Copy link
Author

0wwafa commented Jun 18, 2024

@Aphexus lol. yes.. there are.. but they are not the same of the speakers..

like:

from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to("cpu")
t=open('text.txt', 'r').read().replace('\n','')
tts.tts_to_file(text=t, speaker_wav=["./speaker1.wav","./speaker2.wav"], language="en", file_path="test.wav")

no matter how long are the samples or how many, the foning intonation is not as the original even if the voice is similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request feature requests for making TTS better.
Projects
None yet
Development

No branches or pull requests

2 participants