How to convert PEFT-LoRA trained model into original whisper architecture? #2582

bansal-sid · 2025-04-25T07:26:54Z

bansal-sid
Apr 25, 2025

Hello, I have trained whisper large-v2 using PEFT-LoRA. I referred https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

One can also refer #988 for PEFT-LoRA training.

I am trying to convert the trained model to OpenAI-Whisper's architecture from HuggingFace architecture.

I also trained small model of whisper and large-v2 using deepseed for which I was able to convert the model into OpenAI's structure.
For conversion, I followed #830
The conversion can also be seen at https://github.com/huggingface/transformers/blob/68e85fc822097b3df8d685a4705804348245284d/src/transformers/models/whisper/convert_openai_to_hf.py#L86

Now, the issue is I'm unable to repeat this for the model trained using PEFT-LoRA.
My code is below:

from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer
from transformers import BitsAndBytesConfig


peft_model_id = "jainsakshi/openai-whisper-large-peft-en-latest-LORA-colab"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, quantization_config=BitsAndBytesConfig(load_in_8bit=True), device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)
merged_model = model.merge_and_unload()
state_dict = merged_model.state_dict()

import re
def hf_to_whisper_states(text):
    text = re.sub(r'^model\.', '', text)
    text = re.sub('.layers.', '.blocks.', text)
    text = re.sub('.base_encoder.', '.encoder.', text)
    text = re.sub('.self_attn.', '.attn.', text)
    text = re.sub('.q_proj.', '.query.', text)
    text = re.sub('.k_proj.', '.key.', text)
    text = re.sub('.v_proj.', '.value.', text)
    text = re.sub('.out_proj.', '.out.', text)
    text = re.sub('.fc1.', '.mlp.0.', text)
    text = re.sub('.fc2.', '.mlp.2.', text)
    text = re.sub('.fc3.', '.mlp.3.', text)
    text = re.sub('.fc3.', '.mlp.3.', text)
    text = re.sub('.encoder_attn.', '.cross_attn.', text)
    text = re.sub('.cross_attn.ln.', '.cross_attn_ln.', text)
    text = re.sub('.embed_positions.weight', '.positional_embedding', text)
    text = re.sub('.embed_tokens.', '.token_embedding.', text)
    # text = re.sub('model.', '', text)
    text = re.sub('attn.layer_norm.', 'attn_ln.', text)
    text = re.sub('.final_layer_norm.', '.mlp_ln.', text)
    text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text)
    text = re.sub('decoder.layer_norm.', 'decoder.ln.', text)
    text = re.sub('proj_out.weight', 'decoder.token_embedding.weight', text)
    return text

filtered_state_dict = {}
for key, value in state_dict.items():
    new_key = hf_to_whisper_states(key)
    # # Only include keys that match OpenAI's expected structure
    if not any(x in new_key for x in ['lora_', 'SCB', 'weight_format']):
        filtered_state_dict[new_key] = value

import whisper
whisper_model = whisper.load_model('large-v2', download_root= "/export/home/users/media/shared/sid/whisper_fine_tuning/models")

result = whisper_model.load_state_dict(filtered_state_dict, strict=False)

audio = whisper.load_audio("test.mp3")
result = whisper_model.transcribe(audio)
print(result["text"])

After conversion, there were no missing or unexpected keys as I had removed some extra layers that had SCB, lora_, and weight_norm at the end. I did this after the discussion with GPT.

Now, since the layer swapping happened without any error, I got the below error on transcription that I'm unable to get:

result = whisper_model.transcribe(audio)

File ~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:279, in transcribe(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, word_timestamps, prepend_punctuations, append_punctuations, clip_timestamps, hallucination_silence_threshold, **decode_options)
    [276](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:276) mel_segment = pad_or_trim(mel_segment, N_FRAMES).to(model.device).to(dtype)
    [278](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:278) decode_options["prompt"] = all_tokens[prompt_reset_since:]
--> [279](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:279) result: DecodingResult = decode_with_fallback(mel_segment)
    [280](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:280) tokens = torch.tensor(result.tokens)
    [282](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:282) if no_speech_threshold is not None:
    [283](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:283)     # no voice activity check

File ~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:195, in transcribe.<locals>.decode_with_fallback(segment)
    [192](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:192)     kwargs.pop("best_of", None)
    [194](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:194) options = DecodingOptions(**kwargs, temperature=t)
--> [195](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:195) decode_result = model.decode(segment, options)
    [197](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:197) needs_fallback = False
    [198](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:198) if (
    [199](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:199)     compression_ratio_threshold is not None
    [200](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:200)     and decode_result.compression_ratio > compression_ratio_threshold
    [201](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/whisper/transcribe.py:201) ):

File ~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    [113](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py:113) @functools.wraps(func)
...
     [77](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/torch/distributions/distribution.py:77)             )
     [78](https://vscode-remote+ssh-002dremote-002b192-002e168-002e5-002e28.vscode-resource.vscode-cdn.net/export/home/users/media/shared/sid/whisper_fine_tuning/~/shared/sid/whisper_fine_tuning/.venv/lib/python3.10/site-packages/torch/distributions/distribution.py:78) super().__init__()

ValueError: Expected parameter logits (Tensor of shape (1, 51865)) of distribution Categorical(logits: torch.Size([1, 51865])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0')

Advait251206 · 2026-06-24T18:26:31Z

Advait251206
Jun 24, 2026

The fact that:

result = whisper_model.load_state_dict(filtered_state_dict, strict=False)

completes without missing or unexpected keys is encouraging, but it does not guarantee that the weights were mapped correctly.

The real clue is this error:

ValueError: Expected parameter logits ...
but found invalid values:

tensor([[nan, nan, nan, ..., nan]], device='cuda:0')

This means the model successfully runs through most of the forward pass, but somewhere the activations have already exploded to NaN.

First thing I'd verify: merge quality

You're loading the base model as:

WhisperForConditionalGeneration.from_pretrained(
    ...,
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
)

and then doing:

merged_model = model.merge_and_unload()

This immediately raises a concern.

LoRA merging is generally safest when performed on:

fp16
bf16
fp32

weights.

Merging into an 8-bit quantized model can produce unexpected results depending on the PEFT and bitsandbytes versions.

I'd try:

base_model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path,
    torch_dtype=torch.float16
)

model = PeftModel.from_pretrained(base_model, peft_model_id)

merged_model = model.merge_and_unload()

and then save/check the merged weights before any conversion.

Check for NaNs before conversion

Immediately after merging:

for name, param in merged_model.named_parameters():
    if torch.isnan(param).any():
        print("NaN found:", name)

If NaNs already exist here, the issue is the merge itself.

Verify logits in HF before conversion

Before converting to OpenAI format:

inputs = processor(...)
outputs = merged_model(**inputs)

or simply:

merged_model.generate(...)

If the HF model works correctly, then the merge is fine and the conversion is the problem.

This is the most important diagnostic step.

Case 1

HF merged model fails:

HF merged model → NaNs

Problem:

PEFT merge
Quantization
Checkpoint corruption

Case 2

HF merged model works:

HF merged model → OK
OpenAI model → NaNs

Problem:

Key mapping
Architecture mismatch

I suspect a mapping issue

Your conversion logic is based on older Whisper HF ↔ OpenAI mappings.

Large-v2 is fairly sensitive to incorrect mappings because a single wrong normalization or attention projection can cause:

finite weights
↓
invalid activations
↓
NaN logits

The highest-risk mappings are:

.final_layer_norm.
.encoder.layer_norm.
.decoder.layer_norm.

and

proj_out.weight

because OpenAI and HF organize these slightly differently.

Just because all tensor shapes match doesn't mean the tensors belong in the correct location.

Compare against a known-good conversion

A useful sanity test:

Load untouched HF large-v2
Convert it using your script
Load into OpenAI Whisper
Run inference

If this produces NaNs:

conversion script is incorrect

If it works:

LoRA merge or PEFT-specific weights are the issue

Check tied embeddings

One thing I notice:

text = re.sub(
    'proj_out.weight',
    'decoder.token_embedding.weight',
    text
)

HF Whisper ties:

proj_out.weight

to token embeddings.

OpenAI Whisper expects:

decoder.token_embedding.weight

However, after PEFT merge there may be subtle differences in how tied weights are represented.

Verify:

merged_model.proj_out.weight.shape

matches:

whisper_model.decoder.token_embedding.weight.shape

and contains finite values.

Check for non-finite weights after conversion

After creating filtered_state_dict:

for k, v in filtered_state_dict.items():
    if not torch.isfinite(v).all():
        print("Bad tensor:", k)

Even one corrupted tensor can lead to the decoder producing NaN logits immediately.

My likely diagnosis

The most probable causes, in order, are:

Merging LoRA into an 8-bit model
A key-mapping mismatch in the HF → OpenAI conversion
Incorrect handling of tied output embeddings (proj_out.weight)
A PEFT-specific parameter not being merged as expected

I'd start by verifying that the merged Hugging Face model can successfully transcribe audio before any conversion. If the merged HF model works but the converted OpenAI model produces NaNs, then the issue is almost certainly in the conversion mapping rather than the LoRA training itself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to convert PEFT-LoRA trained model into original whisper architecture? #2582

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to convert PEFT-LoRA trained model into original whisper architecture? #2582

Uh oh!

Uh oh!

bansal-sid Apr 25, 2025

Replies: 1 comment

Uh oh!

Advait251206 Jun 24, 2026

First thing I'd verify: merge quality

Check for NaNs before conversion

Verify logits in HF before conversion

Case 1

Case 2

I suspect a mapping issue

Compare against a known-good conversion

Check tied embeddings

Check for non-finite weights after conversion

My likely diagnosis

bansal-sid
Apr 25, 2025

Advait251206
Jun 24, 2026