Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] tts.tts_with_vc_to_file cannot use cpu #3797

Open
pieris98 opened this issue Jun 21, 2024 · 2 comments
Open

[Bug] tts.tts_with_vc_to_file cannot use cpu #3797

pieris98 opened this issue Jun 21, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@pieris98
Copy link

Describe the bug

Similar to #3787, but also when running xtts_v2 model with voice cloning (vocoder model), using device='cpu' results to the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` 

To Reproduce

import torch
from TTS.api import TTS

device = "cpu"
print(device)

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts.tts_with_vc_to_file(text="Hello world!", speaker='Andrew Chipper',speaker_wav="/path/to/voice_sample.wav", language="en",file_path="/path/to/outputs/xttsv2_en_output.wav")

Expected behavior

The inference should run without using CUDA or reporting any CUDA/CUDNN/GPU-related errors.

Logs

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[9], line 12
      7 # !tts --text "hello world" \
      8 # --model_name "tts_models/en/ljspeech/glow-tts" \
      9 # --out_path output.wav
     11 tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
---> 12 tts.tts_with_vc_to_file(text="Hello world!", speaker='Andrew Chipper',speaker_wav="/home/cherry/dev/coqui/steve_taylor.wav", language="en",file_path="/home/cherry/dev/coqui/outputs/xttsv2_en_output.wav")

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/api.py:455, in TTS.tts_with_vc_to_file(self, text, language, speaker_wav, file_path, speaker, split_sentences)
    423 def tts_with_vc_to_file(
    424     self,
    425     text: str,
   (...)
    430     split_sentences: bool = True,
    431 ):
    432     """Convert text to speech with voice conversion and save to file.
    433 
    434     Check `tts_with_vc` for more details.
   (...)
    453             applicable to the 🐸TTS models. Defaults to True.
    454     """
--> 455     wav = self.tts_with_vc(
    456         text=text, language=language, speaker_wav=speaker_wav, speaker=speaker, split_sentences=split_sentences
    457     )
    458     save_wav(wav=wav, path=file_path, sample_rate=self.voice_converter.vc_config.audio.output_sample_rate)

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/api.py:419, in TTS.tts_with_vc(self, text, language, speaker_wav, speaker, split_sentences)
    415     self.tts_to_file(
    416         text=text, speaker=speaker, language=language, file_path=fp.name, split_sentences=split_sentences
    417     )
    418 if self.voice_converter is None:
--> 419     self.load_vc_model_by_name("voice_conversion_models/multilingual/vctk/freevc24")
    420 wav = self.voice_converter.voice_conversion(source_wav=fp.name, target_wav=speaker_wav)
    421 return wav

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/api.py:157, in TTS.load_vc_model_by_name(self, model_name, gpu)
    155 self.model_name = model_name
    156 model_path, config_path, _, _, _ = self.download_model_by_name(model_name)
--> 157 self.voice_converter = Synthesizer(vc_checkpoint=model_path, vc_config=config_path, use_cuda=gpu)

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/utils/synthesizer.py:101, in Synthesizer.__init__(self, tts_checkpoint, tts_config_path, tts_speakers_file, tts_languages_file, vocoder_checkpoint, vocoder_config, encoder_checkpoint, encoder_config, vc_checkpoint, vc_config, model_dir, voice_dir, use_cuda)
     98     self.output_sample_rate = self.vocoder_config.audio["sample_rate"]
    100 if vc_checkpoint:
--> 101     self._load_vc(vc_checkpoint, vc_config, use_cuda)
    102     self.output_sample_rate = self.vc_config.audio["output_sample_rate"]
    104 if model_dir:

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/utils/synthesizer.py:139, in Synthesizer._load_vc(self, vc_checkpoint, vc_config_path, use_cuda)
    137 # pylint: disable=global-statement
    138 self.vc_config = load_config(vc_config_path)
--> 139 self.vc_model = setup_vc_model(config=self.vc_config)
    140 self.vc_model.load_checkpoint(self.vc_config, vc_checkpoint)
    141 if use_cuda:

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/__init__.py:16, in setup_model(config, samples)
     14 if "model" in config and config["model"].lower() == "freevc":
     15     MyModel = importlib.import_module("TTS.vc.models.freevc").FreeVC
---> 16     model = MyModel.init_from_config(config, samples)
     17 return model

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/freevc.py:552, in FreeVC.init_from_config(config, samples, verbose)
    550 @staticmethod
    551 def init_from_config(config: FreeVCConfig, samples: Union[List[List], List[Dict]] = None, verbose=True):
--> 552     model = FreeVC(config)
    553     return model

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/freevc.py:370, in FreeVC.__init__(self, config, speaker_manager)
    368     self.enc_spk = SpeakerEncoder(model_hidden_size=self.gin_channels, model_embedding_size=self.gin_channels)
    369 else:
--> 370     self.load_pretrained_speaker_encoder()
    372 self.wavlm = get_wavlm()

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/freevc.py:381, in FreeVC.load_pretrained_speaker_encoder(self)
    379 """Load pretrained speaker encoder model as mentioned in the paper."""
    380 print(" > Loading pretrained speaker encoder model ...")
--> 381 self.enc_spk_ex = SpeakerEncoderEx(
    382     "https://github.com/coqui-ai/TTS/releases/download/v0.13.0_models/speaker_encoder.pt"
    383 )

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/modules/freevc/speaker_encoder/speaker_encoder.py:45, in SpeakerEncoder.__init__(self, weights_fpath, device, verbose)
     42 checkpoint = load_fsspec(weights_fpath, map_location="cpu")
     44 self.load_state_dict(checkpoint["model_state"], strict=False)
---> 45 self.to(device)
     47 if verbose:
     48     print("Loaded the voice encoder model on %s in %.2f seconds." % (device.type, timer() - start))

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/module.py:1173, in Module.to(self, *args, **kwargs)
   1170         else:
   1171             raise
-> 1173 return self._apply(convert)

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/module.py:779, in Module._apply(self, fn, recurse)
    777 if recurse:
    778     for module in self.children():
--> 779         module._apply(fn)
    781 def compute_should_use_set_data(tensor, tensor_applied):
    782     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    783         # If the new tensor has compatible tensor type as the existing tensor,
    784         # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    789         # global flag to let the user control whether they want the future
    790         # behavior of overwriting the existing tensor or not.

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/rnn.py:222, in RNNBase._apply(self, fn, recurse)
    217 ret = super()._apply(fn, recurse)
    219 # Resets _flat_weights
    220 # Note: be v. careful before removing this, as 3rd party device types
    221 # likely rely on this behavior to properly .to() modules like LSTM.
--> 222 self._init_flat_weights()
    224 return ret

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/rnn.py:158, in RNNBase._init_flat_weights(self)
    154 self._flat_weights = [getattr(self, wn) if hasattr(self, wn) else None
    155                       for wn in self._flat_weights_names]
    156 self._flat_weight_refs = [weakref.ref(w) if w is not None else None
    157                           for w in self._flat_weights]
--> 158 self.flatten_parameters()

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/rnn.py:209, in RNNBase.flatten_parameters(self)
    207 if self.proj_size > 0:
    208     num_weights += 1
--> 209 torch._cudnn_rnn_flatten_weight(
    210     self._flat_weights, num_weights,
    211     self.input_size, rnn.get_cudnn_mode(self.mode),
    212     self.hidden_size, self.proj_size, self.num_layers,
    213     self.batch_first, bool(self.bidirectional))

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3060 Laptop GPU"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.3.1+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.9.0",
        "version": "#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30)"
    }
}

Additional context

Note: Even though I do have CUDA and an NVIDIA GPU on my laptop, I want to use CPU because the VRAM of my GPU is not enough for the model I wanted to use.

@pieris98 pieris98 added the bug Something isn't working label Jun 21, 2024
@eginhard
Copy link
Contributor

The XTTS model natively supports voice cloning, so just use the following (and pick just one of speaker and speaker_wav, depending on which of them you need):

from TTS.api import TTS

device = "cpu"
print(device)

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts.tts_to_file(text="Hello world!", speaker='Andrew Chipper',speaker_wav="/path/to/voice_sample.wav", language="en",file_path="/path/to/outputs/xttsv2_en_output.wav")

This should run correctly on the CPU. The with_vc would pass the already cloned output through an additional voice conversion model (FreeVC), but that's not necessary here and probably leads to worse results.

@pieris98
Copy link
Author

Hey Enno, thanks a lot for the pointer, I didn't realise that some models have voice cloning built in rather than with tts.tts_with_vc_to_file().

I was then trying to run the model in tts-server and noticed this issue #3369 so I just wanted to point it out as it seems more important to solve in the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants