layout | background-class | body-class | title | summary | category | image | author | tags | github-link | github-id | featured_image_1 | featured_image_2 | accelerator | order | demo-model-link | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
hub_detail |
hub-background |
hub |
WaveGlow |
WaveGlow model for generating speech from mel spectrograms (generated by Tacotron2) |
researchers |
nvidia_logo.png |
NVIDIA |
|
NVIDIA/DeepLearningExamples |
waveglow_diagram.png |
no-image |
cuda |
10 |
The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts without any additional prosody information. The Tacotron 2 model (also available via torch.hub) produces mel spectrograms from input text using encoder-decoder architecture. WaveGlow is a flow-based model that consumes the mel spectrograms to generate speech.
In the example below:
- pretrained Tacotron2 and Waveglow models are loaded from torch.hub
- Given a tensor representation of the input text ("Hello world, I missed you so much"), Tacotron2 generates a Mel spectrogram as shown on the illustration
- Waveglow generates sound given the mel spectrogram
- the output sound is saved in an 'audio.wav' file
To run the example you need some extra python packages installed. These are needed for preprocessing the text and audio, as well as for display and input / output.
pip install numpy scipy librosa unidecode inflect librosa
apt-get update
apt-get install -y libsndfile1
Load the WaveGlow model pre-trained on LJ Speech dataset
import torch
waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp32')
Prepare the WaveGlow model for inference
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()
Load a pretrained Tacotron2 model
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp32')
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()
Now, let's make the model say:
text = "hello world, I missed you so much"
Format the input using utility methods
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])
Run the chained models
with torch.no_grad():
mel, _, _ = tacotron2.infer(sequences, lengths)
audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050
You can write it to a file and listen to it
from scipy.io.wavfile import write
write("audio.wav", rate, audio_numpy)
Alternatively, play it right away in a notebook with IPython widgets
from IPython.display import Audio
Audio(audio_numpy, rate=rate)
For detailed information on model input and output, training recipies, inference and performance visit: github and/or NGC