Looking for complete conversion from pretrained huggingface model #611

lionsheep24 · 2024-06-18T07:56:45Z

Hello,
I have pretrained a model with huggingface and attempted to deploy it using the TRTLLM-Triton Server method as documented here. However, I've noticed that the transcription results differ significantly from the original model's performance when using the Transformer pipeline.

Upon further investigation, I compared the mel spectrograms and the decoding results between the TRT-LLM implementation and the original pipeline. Both comparisons showed noticeable differences, leading to degraded transcription accuracy in the TRT-LLM implementation. In some cases, it even returned a blank string.

Let me share my pipeline implementation

model_ckpt="./models/whisper-large-v2/2"
torch_dtype = torch.float16
feature_extractor: WhisperFeatureExtractor = WhisperFeatureExtractor.from_pretrained(pretrained_model_name_or_path=model_ckpt)
tokenizer:WhisperTokenizer = WhisperTokenizer.from_pretrained(pretrained_model_name_or_path=model_ckpt)
batch_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_ckpt,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    device_map="cuda:0",
)
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=batch_model,
    tokenizer=tokenizer,
    feature_extractor=feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=False,
    torch_dtype=torch_dtype,
    generate_kwargs={"language": "ko", "num_beams": 1, "do_sample": False},
)

TRT-LLM implementation is same with the link , which I mentioned earlier, and the engine has built by below script. (trtllm version is 0.11.0.dev2024060400)

1. save hf-model 
model = AutoModel.from_pretrained(model_name, use_safetensors=True).half() # save to /workspace/models/whisper-large-v2

2. convert to openai
python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2 --output_dir /workspace/models/whisper-openai --output_name large-v2

3. convert to tensorrt-llm
python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2

4. build tensorrt-llm
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/whisper-tensorrt-llm/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 4 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/whisper-tensorrt-llm/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 4 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable

Client code for tensorrt-llm + tritonserver

from transformers import WhisperFeatureExtractor, WhisperTokenizer, AutoModelForSpeechSeq2Seq, pipeline
import torch
import numpy as np

from tritonclient.grpc import InferenceServerClient
import tritonclient.grpc as grpcclient
from tritonclient.utils import np_to_triton_dtype


def send_whisper(
    samples: np.ndarray,
    triton_client: InferenceServerClient,
    protocol_client,
    model_name: str = "whisper-large-v2-tensorrt-llm",
    whisper_prompt: str = "<|startoftranscript|><|ko|><|transcribe|><|notimestamps|>"
):

    inputs = [
        protocol_client.InferInput("WAV", samples.shape, np_to_triton_dtype(samples.dtype)),
        protocol_client.InferInput("TEXT_PREFIX", [1, 1], "BYTES"),
    ]
    inputs[0].set_data_from_numpy(samples)
    
    input_data_numpy = np.array([whisper_prompt], dtype=object).reshape((1, 1))
    inputs[1].set_data_from_numpy(input_data_numpy)

    outputs = [protocol_client.InferRequestedOutput("TRANSCRIPTS")]
    sequence_id = 100000000  # Example sequence_id, this can be changed as needed
    
    response = triton_client.infer(model_name, inputs, request_id=str(sequence_id), outputs=outputs)
    
    decoding_results = response.as_numpy("TRANSCRIPTS")[0]
    if isinstance(decoding_results, np.ndarray):
        decoding_results = b" ".join(decoding_results).decode("utf-8")
    else:
        decoding_results = decoding_results.decode("utf-8")

    print(f"TensorRT LLM STT Result: {decoding_results}")

Could anyone help me understand why these discrepancies are occurring and how to resolve them?

Thank you in advance for your assistance.

The text was updated successfully, but these errors were encountered:

csukuangfj · 2024-06-18T08:04:31Z

@yuekaizhang Could you have a look at this issue?

lionsheep24 · 2024-06-18T08:26:56Z

Let me share my build script for trt-llm.

1. save hf-model 
model = AutoModel.from_pretrained(model_name, use_safetensors=True).half() # save to /workspace/models/whisper-large-v2

2. convert to openai
python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2 --output_dir /workspace/models/whisper-openai --output_name large-v2

3. convert to tensorrt-llm
python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2

4. build tensorrt-llm
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/whisper-tensorrt-llm/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 4 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/whisper-tensorrt-llm/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 4 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable

yuekaizhang · 2024-06-18T08:55:16Z

@lionsheep24 #597 (comment), check this. You may need to align the prompt, beam_size, and other hyper-parameters to get the same outputs.

There are several succuss integration of whisper trt-llm you may refer e.g. https://github.com/Wordcab/wordcab-transcribe/tree/main/src/wordcab_transcribe/engines/tensorrt_llm. Your export steps also look good to me.

lionsheep24 · 2024-06-18T10:06:52Z

@yuekaizhang
I'm using <|startoftranscript|><|ko|><|transcribe|><|notimestamps|> prompt, beam_size of 1 and found differences in extracting mel spectrograms from same audio array between hf-way and openai-way.

Same decoding results from different audio features, you mean? There were some values of -0.74171734 in hf-way but corresponding value of openai-way were 0.

I switched compute_feature function to hf WhisperFeatureExtractor but tokenizer throws OverflowError: out of range integral type conversion because decoding result has -1 token.

I reviewed link you shared but It seems to be similar with current repo.

I'm not sure how transcription results can be same even though extracted features are different.

lionsheep24 · 2024-06-20T06:59:53Z

Hi all! any updates here?

I am curious about why the audio features extracted from the same audio array differ when using the Huggingface library compared to the method provided in this repository.

Additionally, I want to confirm if it is correct for the values to be different. In my opinion, even if the model is converted, the input audio features should be same.

When I input the features extracted using the Huggingface library into the TensorRT-LLM engine, I received a -1 token(which is different from Huggingface pipeline result), which seems to have caused an error during decoding.

Feel free to let me know if you need any further adjustments or additional information included!

yuekaizhang · 2024-06-26T05:54:31Z

Huggingface library compared to the method provided in this repository.

Theoretically, the minor difference of feature values would not have a effect on the transcript results. We actually support huggingface distill whisper in tensorrt-llm, which uses the huggingface feature extractor to train. However, it could work with our feature extractor in inference.

You may try replace the feature extractor if you think that is the root cause.
@lionsheep24

lionsheep24 · 2024-06-27T00:32:58Z

Yeah I calculated differences of features from huggingface and tensorrt-llm example and the absolute difference was up to 0.74. I think it's not a minor difference.

I tried to replace feature extractor to huggingface and feed feature to tensorrt-llm but I got -1 token from engine, as I mentioned earlier.
@yuekaizhang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for complete conversion from pretrained huggingface model #611

Looking for complete conversion from pretrained huggingface model #611

lionsheep24 commented Jun 18, 2024 •

edited

Loading

csukuangfj commented Jun 18, 2024

lionsheep24 commented Jun 18, 2024 •

edited

Loading

yuekaizhang commented Jun 18, 2024

lionsheep24 commented Jun 18, 2024 •

edited

Loading

lionsheep24 commented Jun 20, 2024 •

edited

Loading

yuekaizhang commented Jun 26, 2024

lionsheep24 commented Jun 27, 2024

Looking for complete conversion from pretrained huggingface model #611

Looking for complete conversion from pretrained huggingface model #611

Comments

lionsheep24 commented Jun 18, 2024 • edited Loading

csukuangfj commented Jun 18, 2024

lionsheep24 commented Jun 18, 2024 • edited Loading

yuekaizhang commented Jun 18, 2024

lionsheep24 commented Jun 18, 2024 • edited Loading

lionsheep24 commented Jun 20, 2024 • edited Loading

yuekaizhang commented Jun 26, 2024

lionsheep24 commented Jun 27, 2024

lionsheep24 commented Jun 18, 2024 •

edited

Loading

lionsheep24 commented Jun 18, 2024 •

edited

Loading

lionsheep24 commented Jun 18, 2024 •

edited

Loading

lionsheep24 commented Jun 20, 2024 •

edited

Loading