Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Works in Spanish? #789

Closed
johnfelipe opened this issue Jul 6, 2021 · 31 comments
Closed

Works in Spanish? #789

johnfelipe opened this issue Jul 6, 2021 · 31 comments

Comments

@johnfelipe
Copy link

No description provided.

@Wrongtown
Copy link

There's information on other people's attempts at this on existing issue Support for other languages #30.

@pilnyjakub
Copy link

There is no Spanish model yet.

@johnfelipe
Copy link
Author

johnfelipe commented Jul 10, 2021 via email

@babysor
Copy link

babysor commented Aug 9, 2021

You can check the file diff in my repo for reference. Mine works for Chinese and I think you can do the similar modification.
https://github.com/babysor/Realtime-Voice-Clone-Chinese

@ghost ghost closed this as completed Aug 25, 2021
@ghost ghost mentioned this issue Oct 8, 2021
@AlexSteveChungAlvarez
Copy link

Trying to train the synthesizer with tux-100h (valid) dataset and cv-corpus-7.0-2021 (validated) dataset is giving me the following:
image
In the image, it is the message from tux-100h, but the same appears for cv-corpus.
Both datasets are already structured as mentioned in previous issues:
image
I don't know why the audios are not being recognised, I think that is what is happening there...

@ghost
Copy link

ghost commented Nov 10, 2021

For audios to be detected, the directory structure must match this exactly including the "speaker" and "book_dir" levels. #437 (comment)

@AlexSteveChungAlvarez
Copy link

With the same names, too?

@ghost
Copy link

ghost commented Nov 10, 2021

It is not necessary to use the same names, except datasets_root, LibriTTS and train-clean-100 if you are using the preprocessing command that I give.

However, please try matching the names before reporting a problem, or when asking for help to troubleshoot an issue like this.

@AlexSteveChungAlvarez
Copy link

AlexSteveChungAlvarez commented Nov 12, 2021

I did it both with their own names and with the names you give as example.
image

However, I got some errors:

  1. With tux-100h (https://discourse.mozilla.org/t/sharing-my-100h-of-single-speaker-spanish/45288):
python synthesizer_preprocess_audio.py datasets_root -n 6 --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments --no_trim

D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
Arguments:
    datasets_root:   datasets_root
    out_dir:         datasets_root\SV2TTS\synthesizer2
    n_processes:     6
    skip_existing:   False
    hparams:
    no_alignments:   True
    datasets_name:   LibriTTS
    subfolders:      train-clean-100

Using data from:
    datasets_root\LibriTTS\train-clean-100
LibriTTS:   0%|                                                                                                             | 0/1 [00:00<?, ?speakers/s]D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.  
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. 
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
LibriTTS:   0%|                                                                                                             | 0/1 [00:02<?, ?speakers/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "D:\Programas\Python3.7\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\preprocess.py", line 76, in preprocess_speaker
    assert text_fpath.exists()
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_audio.py", line 59, in <module>
    preprocess_dataset(**vars(args))
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\preprocess.py", line 35, in preprocess_dataset
    for speaker_metadata in tqdm(job, datasets_name, len(speaker_dirs), unit="speakers"):
  File "D:\tesis2\voiceclonenv\lib\site-packages\tqdm\std.py", line 1180, in __iter__
    for obj in iterable:
  File "D:\Programas\Python3.7\lib\multiprocessing\pool.py", line 748, in next
    raise value
AssertionError
  1. With cv-corpus 7.0 (https://commonvoice.mozilla.org/es/datasets):
python synthesizer_preprocess_audio.py datasets_root -n 6 --datasets_name tux100h-cvcorpus --subfolders valid --no_trim --no_alignments
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
Arguments:
    datasets_root:   datasets_root
    out_dir:         datasets_root\SV2TTS\synthesizer
    n_processes:     6
    skip_existing:   False
    hparams:
    no_alignments:   True
    datasets_name:   tux100h-cvcorpus
    subfolders:      valid

Using data from:
    datasets_root\tux100h-cvcorpus\valid
tux100h-cvcorpus:   0%|                                                                                                     | 0/1 [00:00<?, ?speakers/s]D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.  
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. 
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. 
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. 
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. 
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. 
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\voiceclonenv\lib\site-packages\librosa\core\audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
D:\tesis2\voiceclonenv\lib\site-packages\librosa\core\audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
tux100h-cvcorpus:   0%|                                                                                                     | 0/1 [50:24<?, ?speakers/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "D:\Programas\Python3.7\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\preprocess.py", line 78, in preprocess_speaker
    text = "".join([line for line in text_file])
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\preprocess.py", line 78, in <listcomp>
    text = "".join([line for line in text_file])
  File "D:\Programas\Python3.7\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 99: character maps to <undefined>
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_audio.py", line 59, in <module>
    preprocess_dataset(**vars(args))
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\preprocess.py", line 35, in preprocess_dataset
    for speaker_metadata in tqdm(job, datasets_name, len(speaker_dirs), unit="speakers"):
  File "D:\tesis2\voiceclonenv\lib\site-packages\tqdm\std.py", line 1180, in __iter__
    for obj in iterable:
  File "D:\Programas\Python3.7\lib\multiprocessing\pool.py", line 748, in next
    raise value
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 99: character maps to <undefined>

With cv-corpus it processed some of the files (1 625/271 010), but then stopped (after at least 30 minutes) and displayed that UnicodeDecodeError.

@pilnyjakub
Copy link

pilnyjakub commented Nov 12, 2021

For the 1st issue: #841 (comment)

For the 2nd issue, in synthesizer/preprocess.py and synthesizer/train.py, wherever opening files use the same encoding as your dataset. e.g. encoding="utf-8"

@AlexSteveChungAlvarez
Copy link

AlexSteveChungAlvarez commented Nov 12, 2021

Thank you @pilnyjakub for the fast response, I turned on again my laptop as soon as I saw it. I figured out that the files' names for the tux-100h dataset are just numbers, I just changed the first txt and wav files from "0" to "audio-0" and now it is running. Hope it goes well until the end.
For the 2nd one I was thinking on that, too, I will have to search in the code wherever it opens files. I will come up with updates later. Good night.

@AlexSteveChungAlvarez
Copy link

AlexSteveChungAlvarez commented Nov 18, 2021

I have been training since Monday on tux100h dataset for approximately 32-50h the model for the synthetizer. It has been saved with 50k steps, but I have stopped the training in 57k steps, there's a loss of 0.21-0.24 approximately. However, I have tried to clone my voice using that model in the demo_cli.py and the output does sound like a human, but it sounds like one from the dataset, not like my voice, which is the input. Any recommendations? I am using encoder and vocoder pre-trained models given in the repo.

@AlexSteveChungAlvarez
Copy link

I found out tux100h has the same voice or a very similar voice in all audios. That may be the problem. Then, I started the preprocessing on cvcorpus dataset which has multiple speakers, but I got the following error:

(voiceclonenv) D:\tesis2\Real-Time-Voice-Cloning>python synthesizer_preprocess_audio.py datasets_root -n 6 -s --no_trim --no_alignments --datasets_name 
tux100h-cvcorpus --subfolders valid
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
Arguments:
    datasets_root:   datasets_root
    out_dir:         datasets_root\SV2TTS\synthesizer
    n_processes:     6
    skip_existing:   True
    hparams:
    no_alignments:   True
    datasets_name:   tux100h-cvcorpus
    subfolders:      valid

Using data from:
    datasets_root\tux100h-cvcorpus\valid
tux100h-cvcorpus:   0%|                                                                                                    | 0/1 [00:00<?, ?speakers/s]D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended. 
  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\Real-Time-Voice-Cloning\encoder\audio.py:13: UserWarning: Unable to import 'webrtcvad'. This package enables noise removal and is recommended.  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
D:\tesis2\voiceclonenv\lib\site-packages\librosa\core\audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
D:\tesis2\voiceclonenv\lib\site-packages\librosa\core\audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
D:\tesis2\Real-Time-Voice-Cloning\synthesizer\preprocess.py:72: RuntimeWarning: invalid value encountered in true_divide
  wav = wav / np.abs(wav).max() * hparams.rescaling_max
tux100h-cvcorpus:   0%|                                                                                                 | 0/1 [44:54:59<?, ?speakers/s]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "D:\Programas\Python3.7\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\preprocess.py", line 88, in preprocess_speaker
    skip_existing, hparams))
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\preprocess.py", line 219, in process_utterance
    mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\audio.py", line 60, in melspectrogram
    D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\audio.py", line 121, in _stft
    return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
  File "D:\tesis2\voiceclonenv\lib\site-packages\librosa\core\spectrum.py", line 217, in stft
    util.valid_audio(y)
  File "D:\tesis2\voiceclonenv\lib\site-packages\librosa\util\utils.py", line 310, in valid_audio
    raise ParameterError("Audio buffer is not finite everywhere")
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_audio.py", line 59, in <module>
    preprocess_dataset(**vars(args))
  File "D:\tesis2\Real-Time-Voice-Cloning\synthesizer\preprocess.py", line 35, in preprocess_dataset
    for speaker_metadata in tqdm(job, datasets_name, len(speaker_dirs), unit="speakers"):
  File "D:\tesis2\voiceclonenv\lib\site-packages\tqdm\std.py", line 1180, in __iter__
    for obj in iterable:
  File "D:\Programas\Python3.7\lib\multiprocessing\pool.py", line 748, in next
    raise value
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere

@AlexSteveChungAlvarez
Copy link

Please I need some help here, with cv-corpus dataset I tried using less data and I could made it preprocess and train, but the results were very bad.
By the way, all the process from preprocessing until training took me about a week each time I tried on a different dataset or dataset subsample. I am using a laptop with a NVidia GeForce RTX-2060 gpu, so I think it shouldn't be that slow. Any help? I am considering using another different dataset since both cv-corpus and tux100h didn't give me results compared to the original release of the project, but I don't know which one to choose now and just have 1 month to finish this.

@johnfelipe
Copy link
Author

johnfelipe commented Dec 2, 2021 via email

@AlexSteveChungAlvarez
Copy link

Thank you very much @johnfelipe, I sent you an email.

@AlexSteveChungAlvarez
Copy link

I trained the synthesizer with this dataset: http://openslr.org/73/ .
The models obtained until 50k steps are here: https://drive.google.com/drive/folders/1pYc0YK6YfdikMONkR-29054_uMxTgy_g?usp=sharing . Though, the results are not even near to the target voice to clone. Any suggestions?
It does sound like a human, but not like the target.

@raul-parada
Copy link

Hi. Any update in here? thanks

@AlexSteveChungAlvarez
Copy link

Hi. Any update in here? thanks

https://github.com/AlexSteveChungAlvarez/Real-Time-Voice-Cloning-Spanish

@raul-parada
Copy link

Thank @AlexSteveChungAlvarez s. I have some questions: 1) where should I locate my datasets? 2) the synthesis phase is included when running demo_toolbox.py? 3) is there a maximum length of text to synthesize? If you have a tutorial/documentation beside the paper is welcome. Thank you again

@AlexSteveChungAlvarez
Copy link

If you want to train your own model, you should follow the instructions given in this repo, the toolbox does the synthesis when you click on the button it has to generate voice (there's a video in this repo explaining that), there isn't a maximum length of text to synthesize, but the recommended length is to synthesize an audio of a similar length to the original (if not, it happens sometimes that it has some kind of silences or noise). If you train your own model, please share it to me, since my college team is working now on a web interface to calculate the MOS of the Spanish models shared by the community!

@raul-parada
Copy link

Sure! I'll share my model with you. As recommendation, I would like to generate synthesize audios of approximately 10 min, hence, do you recommend me to train with audio files of length 10 min? Or which is, in your experience, the most efficient length for an excellent quality?

@AlexSteveChungAlvarez
Copy link

No need to train with audio files of the wanted length to produce, just try to use a reference (target) audio to clone of that length and it will work.

@johnfelipe
Copy link
Author

johnfelipe commented Jul 30, 2022 via email

@raul-parada
Copy link

Another question, I don't have an NVIDIA GPU, can I use the CPU instead?

@AlexSteveChungAlvarez
Copy link

Yes, you can use your CPU.

@raul-parada
Copy link

I've followed this video and could run the program. However:
1- It only selects the default, Encoder, Synthesizer and vocoder. I've copy&paste the latest version of these folders into the Git content (it had save_models folder).
2- I've run the program already 4 times and the voice still cannot talk properly. How many times do you think should I run the program? The wav files has a length of 23 minutes.

I add a screenshort of the program:
image

I just added the wav file and click on Synthesize and vocode. Is this the procedure so far?

@raul-parada
Copy link

Now, I've used the code from the spanish version and, I can see the pretrained options. However, I get this error:
image

Any clue?

@AlexSteveChungAlvarez
Copy link

AlexSteveChungAlvarez commented Aug 2, 2022

I also get the same error when I use first one model and then change it to the other, I just downloaded everything again and made sure to run the correct one from the beginning to overcome it, so, whenever I want to use any model, I do it from its respective folder (I have like 2-3 different folders to run different models). I don't know what should be the correct way to fix that error.

@raul-parada
Copy link

Thanks, @AlexSteveChungAlvarez, I've solved by changing the type of Synthesizer. I'm confused right now. I've used the pretrained models and used an audio file of 23 minutes, however, I cannot synthesize correctly my voice using simple sentences in Spanish. What I'm doing wrong? Should I train a specific model with my own voices?

@AlexSteveChungAlvarez
Copy link

I haven't tried with an audio of that length since my objective of using this code was to clone voices from few length speech audios and few target audios from the people I wanted to clone. When you have access to more audios from the person you want to clone, you will get better results by fine-tuning the model (there is a guide somewhere in the repo on how to do this). I haven't done this, but yeah, you need to train the model with your own voices, by applying transfer learning in the pre-trained model. I think it would be a better solution in your case (if you have many audios of the target voice). If not, try sharing the process you are doing in order to clone with your 23 min audio, maybe you are writing with periods (".") at the end of each sentence and it does make the synthesis go wrong, try using commas instead of periods!

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants