You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Jul 2024] We release BigVGAN-v2 along with pretrained checkpoints. Below are the highlights:
20
+
[Jul 2024 (v2.1)] BigVGAN is now integrated with 🤗 Hugging Face Hub with easy access to inference using pretrained checkpoints. We also provide an interactive demo on Hugging Face Spaces.
21
+
22
+
[Jul 2024 (v2)] We release BigVGAN-v2 along with pretrained checkpoints. Below are the highlights:
11
23
* Custom CUDA kernel for inference: we provide a fused upsampling + activation kernel written in CUDA for accelerated inference speed. Our test shows 1.5 - 3x faster speed on a single A100 GPU.
12
24
* Improved discriminator and loss: BigVGAN-v2 is trained using a [multi-scale sub-band CQT discriminator](https://arxiv.org/abs/2311.14957) and a [multi-scale mel spectrogram loss](https://arxiv.org/abs/2306.06546).
13
25
* Larger training data: BigVGAN-v2 is trained using datasets containing diverse audio types, including speech in multiple languages, environmental sounds, and instruments.
@@ -27,8 +39,42 @@ cd BigVGAN
27
39
pip install -r requirements.txt
28
40
```
29
41
42
+
## Inference Quickstart using 🤗 Hugging Face Hub
43
+
44
+
Below example describes how you can use BigVGAN: load the pretrained BigVGAN generator from Hugging Face Hub, compute mel spectrogram from input waveform, and generate synthesized waveform using the mel spectrogram as the model's input.
45
+
46
+
```python
47
+
device ='cuda'
48
+
49
+
import torch
50
+
import bigvgan
51
+
import librosa
52
+
from meldataset import get_mel_spectrogram
53
+
54
+
# instantiate the model. You can optionally set use_cuda_kernel=True for faster inference.
55
+
model = bigvgan.BigVGAN.from_pretrained('nvidia/bigvgan_v2_24khz_100band_256x', use_cuda_kernel=False)
56
+
57
+
# remove weight norm in the model and set to eval mode
58
+
model.remove_weight_norm()
59
+
model = model.eval().to(device)
60
+
61
+
# load wav file and compute mel spectrogram
62
+
wav, sr = librosa.load('/path/to/your/audio.wav', sr=model.h.sampling_rate, mono=True) # wav is np.ndarray with shape [T_time] and values in [-1, 1]
63
+
wav = torch.FloatTensor(wav).unsqueeze(0) # wav is FloatTensor with shape [B(1), T_time]
64
+
65
+
# compute mel spectrogram from the ground truth audio
66
+
mel = get_mel_spectrogram(wav, model.h).to(device) # mel is FloatTensor with shape [B(1), C_mel, T_frame]
30
67
68
+
# generate waveform from mel
69
+
with torch.inference_mode():
70
+
wav_gen = model(mel) # wav_gen is FloatTensor with shape [B(1), 1, T_time] and values in [-1, 1]
71
+
wav_gen_float = wav_gen.squeeze(0).cpu() # wav_gen is FloatTensor with shape [1, T_time]
31
72
73
+
# you can convert the generated waveform to 16 bit linear PCM
74
+
wav_gen_int16 = (wav_gen_float *32767.0).numpy().astype('int16') # wav_gen is now np.ndarray with shape [1, T_time] and int16 dtype
75
+
```
76
+
77
+
## Training
32
78
Create symbolic link to the root of the dataset. The codebase uses filelist with the relative path from the dataset. Below are the example commands for LibriTTS dataset:
Train BigVGAN model. Below is an example command for training BigVGAN-v2 using LibriTTS dataset at 24kHz with a full 100-band mel spectrogram as input:
47
92
```shell
48
93
python train.py \
@@ -61,7 +106,7 @@ Synthesize from BigVGAN model. Below is an example command for generating audio
61
106
It computes mel spectrograms using wav files from `--input_wavs_dir` and saves the generated audio to `--output_dir`.
[Success] test CUDA fused vs. plain torch BigVGAN inference
113
158
> mean_difference=0.0007238413265440613
@@ -118,26 +163,28 @@ If you see `[Fail] test CUDA fused vs. plain torch BigVGAN inference`, it means
118
163
119
164
120
165
## Pretrained Models
121
-
We provide the [pretrained models](https://drive.google.com/drive/folders/1L2RDeJMBE7QAI8qV51n0QAf4mkSgUUeE?usp=sharing).
122
-
One can download the checkpoints of the generator weight (e.g., `g_(training_steps)`) and its discriminator/optimizer states (e.g., `do_(training_steps)`) within the listed folders.
We provide the [pretrained models on Hugging Face Collections](https://huggingface.co/collections/nvidia/bigvgan-66959df3d97fd7d98d97dc9a).
167
+
One can download the checkpoints of the generator weight (named `bigvgan_generator.pt`) and its discriminator/optimizer states (named `bigvgan_discriminator_optimizer.pt`) within the listed model repositories.
The paper results are based on the original 24kHz BigVGAN models (`bigvgan_24khz_100band` and `bigvgan_base_24khz_100band`) trained on LibriTTS dataset.
137
182
We also provide 22kHz BigVGAN models with band-limited setup (i.e., fmax=8000) for TTS applications.
138
183
Note that the checkpoints use ``snakebeta`` activation with log scale parameterization, which have the best overall quality.
139
184
140
-
You can fine-tune the models by downloading the checkpoints (both the generator weight and its discrimiantor/optimizer states) and resuming training using your audio dataset.
185
+
You can fine-tune the models by:
186
+
1. downloading the checkpoints (both the generator weight and its discriminator/optimizer states)
187
+
2. resuming training using your audio dataset by specifying `--checkpoint_path` that includes the checkpoints when launching `train.py`
141
188
142
189
## Training Details of BigVGAN-v2
143
190
Comapred to the original BigVGAN, the pretrained checkpoints of BigVGAN-v2 used `batch_size=32` with a longer `segment_size=65536` and are trained using 8 A100 GPUs.
0 commit comments