Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OnnxT5 slower than Pytorch #23

Open
GRIGORR opened this issue Oct 13, 2021 · 16 comments
Open

OnnxT5 slower than Pytorch #23

GRIGORR opened this issue Oct 13, 2021 · 16 comments

Comments

@GRIGORR
Copy link

GRIGORR commented Oct 13, 2021

Hi. I have created an OnnxT5 model (non quantized) as shown in Readme. But OnnxT5 is slower than original Huggingface T5 10-20%. Could you share how the latency difference shown in repo was obtained? Thanks

@piEsposito
Copy link

Same here.

@sworddish
Copy link

same here too

@Ki6an
Copy link
Owner

Ki6an commented Jan 20, 2022

can you provide the device specifications and code you are using to test the speed?

@pramodith
Copy link

Hi, I'm seeing the same problem, it seems like the Quantized Onnx version is faster than the pytorch model when I run it using a batch size of 1. However, with a batch size of 32 I'm seeing that its much slower. I'm using fastt==0.1.2, transformers=4.10.0 and pytorch==1.7.1. Is there something wrong with my setup?

@ghost
Copy link

ghost commented Mar 9, 2022

In my case, even the quantized version is slower than pytorch when the input sequence length >100 tokens. Not sure if this is as expected.

@JoeREISys
Copy link

JoeREISys commented May 16, 2022

I experienced this as well. I'm quite disappointed.

from fastT5 import get_onnx_model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from timeit import default_timer as timer

py_model_name = 'Salesforce/mixqg-large'
model_name = 'mixqg-large'
custom_output_path = './onnx_t5'
model = get_onnx_model(model_name, custom_output_path)

py_model = AutoModelForSeq2SeqLM.from_pretrained(py_model_name)
tokenizer = AutoTokenizer.from_pretrained(py_model_name)

# this is also the batch size
num_texts = 4                             # Number of input texts to decode
num_beams = 1                             # Number of beams per input text
max_encoder_length = 768                   # Maximum input token length
max_decoder_length = 768                   # Maximum output token length

def infer(model, tokenizer, text):

    # Truncate and pad the max length to ensure that the token size is compatible with fixed-sized encoder (Not necessary for pure CPU execution)
    batch = tokenizer(text, max_length=max_decoder_length, truncation=True, padding='max_length', return_tensors="pt")
    output = model.generate(**batch, max_length=max_decoder_length, num_beams=num_beams, num_return_sequences=num_beams)
    results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]

    print('Texts:')
    for i, summary in enumerate(results):
        print(i + 1, summary)

seq_0 = "Speed bumps are designed to make drivers to slow down. \n Speed bumps are designed to make drivers to slow down. Going over a typical speed bump at 5 miles per hour results in a gentle bounce, while hitting one at 20 delivers a sizable jolt. It's natural to assume that hitting a speed bump at 60mph would deliver a proportionally larger jolt, but it probably wouldn't."
seq_1 = "Going over a typical speed bump at 5 miles per hour results in a gentle bounce, while hitting one at 20 delivers a sizable jolt. \n Speed bumps are designed to make drivers to slow down. Going over a typical speed bump at 5 miles per hour results in a gentle bounce, while hitting one at 20 delivers a sizable jolt. It's natural to assume that hitting a speed bump at 60mph would deliver a proportionally larger jolt, but it probably wouldn't."
seq_2 = "Toyota was by far the most in-demand manufacturer of 2020, totalling over 8.5 million car sales last year. \n Toyota was by far the most in-demand manufacturer of 2020, totalling over 8.5 million car sales last year. They also out-sold rivals Volkswagen by 3.4 million, which equates to just under 10,000 more sales every day and almost 400 more per hour."
seq_3 = "They also out-sold rivals Volkswagen by 3.4 million, which equates to just under 10,000 more sales every day and almost 400 more per hour. \n Toyota was by far the most in-demand manufacturer of 2020, totalling over 8.5 million car sales last year. They also out-sold rivals Volkswagen by 3.4 million, which equates to just under 10,000 more sales every day and almost 400 more per hour."

start = timer()
infer(model, tokenizer, [seq_0, seq_1, seq_2, seq_3])
end = timer()
print("Onnx time:", end - start)

start = timer()
infer(py_model, tokenizer, [seq_0, seq_1, seq_2, seq_3])
end = timer()

print("PyTorch time:", end - start)

Output:

Texts:
1 What do speed bumps do?
2 What does a speed bump do to a driver?
3 What car manufacturer sold the most cars in 2020?
4 How much more did Toyota sell than Volkswagen in 2020?
Pytorch time: 7.330018510000627

Texts:
1 What do speed bumps cause?
2 What does a 20 mph speed bump do?
3 What car manufacturer sold the most cars in 2020?
4 How much more did Toyota sell than Volkswagen in 2020?
Onnx time: 14.700341603999732

@Ki6an
Copy link
Owner

Ki6an commented May 17, 2022

@JoeREISys I ran the same script in colab, I'm getting the following results. maybe it's the device issue.

Downloading: 100%
1.43k/1.43k [00:00<00:00, 37.2kB/s]
Downloading: 100%
2.75G/2.75G [01:02<00:00, 52.8MB/s]
Exporting to onnx... |################################| 3/3
Quantizing... |################################| 3/3
Setting up onnx model...
Done!
Downloading: 100%
1.92k/1.92k [00:00<00:00, 39.7kB/s]
Downloading: 100%
773k/773k [00:00<00:00, 3.16kB/s]
Downloading: 100%
1.32M/1.32M [00:00<00:00, 3.07MB/s]
Downloading: 100%
1.74k/1.74k [00:00<00:00, 50.9kB/s]
Texts:
1 What do speed bumps cause?
2 What does a speed bump do to a driver?
3 What car manufacturer sold the most cars in 2020?
4 How much more did Toyota sell than their rival in 2020?
Onnx time: 37.83371752200014
Texts:
1 What do speed bumps do?
2 What does a speed bump do to a driver?
3 What car manufacturer sold the most cars in 2020?
4 How much more did Toyota sell than Volkswagen in 2020?
PyTorch time: 54.744560202

@ierezell
Copy link

ierezell commented May 19, 2022

Hi there, first thanks a lot for this repo!
I experimented with huggingface/optimum which is really nice but they do not support text2text for now (beam search is a beast).

So I tried this repo (on GPU) and got 0.282s for OnnxT5 and 0.160 for a HuggingFace Pipeline... so twice as slow for Onnx...
I got the approximately the same magnitude on CPU.

For the OnnxT5 I followed the readme :
Note that for GPU I changed the code (in ort_settings.py) to make it work for CUDAExecutionProvider (with onnxruntime-gpu and model is loaded on GPU, confirmed with nvidia-smi).

DEFAULT_GENERATOR_OPTIONS = {
'max_length': 128, 'min_length': 2, 'early_stopping': True,
'num_beams': 3, 'temperature': 1.0, 'num_return_sequences': 3,
'top_k': 50, 'top_p': 1.0, 'repetition_penalty': 2.0,  'length_penalty': 1.0
}

OnnxT5

from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 'mrm8488/t5-base-finetuned-question-generation-ap'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
t_input = "answer: bananas context: I like to eat bananas for breakfast"
token = tokenizer(t_input, return_tensors='pt')

start = perf_counter()
tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               **DEFAULT_GENERATOR_OPTIONS)
print(perf_counter()-start)
output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

Huggingface pipeline :

input_texts = ["answer: bananas context: I like to eat bananas for breakfast"]
inputs = tokenizer.batch_encode_plus(
              input_texts, return_tensors='pt', add_special_tokens=True,
              padding=True, truncation=True
          )
inputs = inputs.to("cuda")
start = perf_counter()
output = model.generate(
                input_ids=inputs['input_ids'],
                attention_mask=inputs["attention_mask"],
                 **DEFAULT_GENERATOR_OPTIONS
             )
print(perf_counter()-start)
all_sentences= tokenizer.batch_decode(output, skip_special_tokens=True)
print(all_sentences)

Conclusion, even with optimization of ONNX and on GPU the model is twice as slow.
Note that with optimum with a TokenClassifier model I got a 10x improvement.

Thanks in advance for any help,
Have a great day.

@JoeREISys
Copy link

JoeREISys commented May 19, 2022

@JoeREISys I ran the same script in colab, I'm getting the following results. maybe it's the device issue.

@Ki6an I was using R6i instances - I will retry with C6g and C6i instances in AWS. As a side note, could there be performance gains for exporting the generate implementation as a Torch script module with the encoder and decoder submodules that are traced or can ONNX not handle traced sub modules?

@JoeREISys
Copy link

JoeREISys commented May 19, 2022

So I tried this repo (on GPU) and got 0.282s for OnnxT5 and 0.160 for a HuggingFace Pipeline... so twice as slow for Onnx...

@ierezell It's mentioned in documentation that this repo is not optimized for CUDA. GPU performance is expected to be same or worse than HF Pipeline against GPU. GPU implementation is in progress but ONNX T5 optimizations don't exist yet for GPU.

@ierezell
Copy link

@JoeREISys, I updated my comment, I had the same issue on CPU (I tried GPU by any chance that it would improve...)

@xingenju
Copy link

get_onnx_model

Hi Ki6an,
I meet the same issue here, seems fastT5 on my mac PC is faster. But on AWS P2 Instance is slower. Could you please help make sure on which machine configure can fastT5 works?

@Oxi84
Copy link

Oxi84 commented Jan 14, 2023

For me the same thing, it is slower around 10 percent, i run batch size around 10-15 beam size is 4 and sequence lenght is on average 15-20.

Probably the best optimization you can do is to run multiple batches.

@Oxi84
Copy link

Oxi84 commented Jan 18, 2023

I tried on another CPU and now it is 2x slower (without quantisation) than Pytorch with the same settings as above:
i run batch size around 10-15 beam size is 4 and sequence lenght is on average 15-20.

@Oxi84
Copy link

Oxi84 commented Jan 20, 2023

It does wok faster when using smaller batches and when using less cores. It is probably optimal to divide all cpu cores using pytorch thread number set and then use a few different flask server for interface as t5fast work amazing on one core but second core does not give any speedup.

@jayiitp
Copy link

jayiitp commented Jun 10, 2023

@ierezell l It will be a helpful if you could share the script of ort_settings.py for gpu.I tried the method you have desribed but i am getting error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants