Train v2 model for 44100hz (bigvgan_v2_44khz_128band_512x) output

Hello!

I'm trying to train a v2 model to output wav files in 44100hz (bigvgan_v2_44khz_128band_512x) but the output audio is ininteligible... Could @Plachtaa or anyone point me to what am I doing wrong please?

Here's the config I'm using:

```yaml
_target_: modules.v2.vc_wrapper.VoiceConversionWrapper
sr: 44100
hop_size: 512
mel_fn:
  _target_: modules.audio.mel_spectrogram
  _partial_: true
  n_fft: 2048
  win_size: 2048
  hop_size: 512
  num_mels: 128
  sampling_rate: 44100
  fmin: 0
  fmax: null
  center: False
cfm:
  _target_: modules.v2.cfm.CFM
  estimator:
    _target_: modules.v2.dit_wrapper.DiT
    time_as_token: true
    style_as_token: true
    uvit_skip_connection: false
    block_size: 8192
    depth: 13
    num_heads: 8
    hidden_dim: 512
    in_channels: 128
    content_dim: 512
    style_encoder_dim: 192
    class_dropout_prob: 0.1
    dropout_rate: 0.0
    attn_dropout_rate: 0.0
cfm_length_regulator:
  _target_: modules.v2.length_regulator.InterpolateRegulator
  channels: 512
  is_discrete: true
  codebook_size: 2048
  sampling_ratios: [ 1, 1, 1, 1 ]
  f0_condition: false
ar:
  _target_: modules.v2.ar.NaiveWrapper
  model:
    _target_: modules.v2.ar.NaiveTransformer
    config:
      _target_: modules.v2.ar.NaiveModelArgs
      dropout: 0.0
      rope_base: 10000.0
      dim: 768
      head_dim: 64
      n_local_heads: 2
      intermediate_size: 2304
      n_head: 12
      n_layer: 12
      vocab_size: 2049  # 1 + 1 for eos
ar_length_regulator:
  _target_: modules.v2.length_regulator.InterpolateRegulator
  channels: 768
  is_discrete: true
  codebook_size: 32
  sampling_ratios: [ ]
  f0_condition: false
style_encoder:
  _target_: modules.campplus.DTDNN.CAMPPlus
  feat_dim: 80
  embedding_size: 192
content_extractor_narrow:
  _target_: modules.astral_quantization.default_model.AstralQuantizer
  tokenizer_name: "openai/whisper-small"
  ssl_model_name: "facebook/hubert-large-ll60k"
  ssl_output_layer: 18
  skip_ssl: true
  encoder: &bottleneck_encoder
    _target_: modules.astral_quantization.convnext.ConvNeXtV2Stage
    dim: 512
    num_blocks: 12
    intermediate_dim: 1536
    dilation: 1
    input_dim: 1024
  quantizer:
    _target_: modules.astral_quantization.bsq.BinarySphericalQuantize
    codebook_size: 32  # codebook size, must be a power of 2
    dim: 512
    entropy_loss_weight: 0.1
    diversity_gamma: 1.0
    spherical: True
    enable_entropy_loss: True
    soft_entropy_loss: True
content_extractor_wide:
  _target_: modules.astral_quantization.default_model.AstralQuantizer
  tokenizer_name: "openai/whisper-small"
  ssl_model_name: "facebook/hubert-large-ll60k"
  ssl_output_layer: 18
  encoder: *bottleneck_encoder
  quantizer:
    _target_: modules.astral_quantization.bsq.BinarySphericalQuantize
    codebook_size: 2048  # codebook size, must be a power of 2
    dim: 512
    entropy_loss_weight: 0.1
    diversity_gamma: 1.0
    spherical: True
    enable_entropy_loss: True
    soft_entropy_loss: True
vocoder:
  _target_: modules.bigvgan.bigvgan.BigVGAN.from_pretrained
  pretrained_model_name_or_path: "nvidia/bigvgan_v2_44khz_128band_512x"
  use_cuda_kernel: false
```


I also commented out the line that loads the checkpoint as this makes it incompatible, so I'm training it from scratch:
`self.model.load_checkpoints(cfm_checkpoint_path=cfm_checkpoint_path, ar_checkpoint_path=ar_checkpoint_path)`
 from https://github.com/Plachtaa/seed-vc/blob/main/train_v2.py#L174

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Train v2 model for 44100hz (bigvgan_v2_44khz_128band_512x) output #213

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Train v2 model for 44100hz (bigvgan_v2_44khz_128band_512x) output #213

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions