-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OnnxT5 slower than Pytorch #23
Comments
Same here. |
same here too |
can you provide the device specifications and code you are using to test the speed? |
Hi, I'm seeing the same problem, it seems like the Quantized Onnx version is faster than the pytorch model when I run it using a batch size of 1. However, with a batch size of 32 I'm seeing that its much slower. I'm using fastt==0.1.2, transformers=4.10.0 and pytorch==1.7.1. Is there something wrong with my setup? |
In my case, even the quantized version is slower than pytorch when the input sequence length >100 tokens. Not sure if this is as expected. |
I experienced this as well. I'm quite disappointed.
Output:
|
@JoeREISys I ran the same script in colab, I'm getting the following results. maybe it's the device issue.
|
Hi there, first thanks a lot for this repo! So I tried this repo (on GPU) and got 0.282s for OnnxT5 and 0.160 for a HuggingFace Pipeline... so twice as slow for Onnx... For the OnnxT5 I followed the readme : DEFAULT_GENERATOR_OPTIONS = {
'max_length': 128, 'min_length': 2, 'early_stopping': True,
'num_beams': 3, 'temperature': 1.0, 'num_return_sequences': 3,
'top_k': 50, 'top_p': 1.0, 'repetition_penalty': 2.0, 'length_penalty': 1.0
} OnnxT5 from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer
model_name = 'mrm8488/t5-base-finetuned-question-generation-ap'
model = export_and_get_onnx_model(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
t_input = "answer: bananas context: I like to eat bananas for breakfast"
token = tokenizer(t_input, return_tensors='pt')
start = perf_counter()
tokens = model.generate(input_ids=token['input_ids'],
attention_mask=token['attention_mask'],
**DEFAULT_GENERATOR_OPTIONS)
print(perf_counter()-start)
output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output) Huggingface pipeline : input_texts = ["answer: bananas context: I like to eat bananas for breakfast"]
inputs = tokenizer.batch_encode_plus(
input_texts, return_tensors='pt', add_special_tokens=True,
padding=True, truncation=True
)
inputs = inputs.to("cuda")
start = perf_counter()
output = model.generate(
input_ids=inputs['input_ids'],
attention_mask=inputs["attention_mask"],
**DEFAULT_GENERATOR_OPTIONS
)
print(perf_counter()-start)
all_sentences= tokenizer.batch_decode(output, skip_special_tokens=True)
print(all_sentences) Conclusion, even with optimization of ONNX and on GPU the model is twice as slow. Thanks in advance for any help, |
@Ki6an I was using R6i instances - I will retry with C6g and C6i instances in AWS. As a side note, could there be performance gains for exporting the generate implementation as a Torch script module with the encoder and decoder submodules that are traced or can ONNX not handle traced sub modules? |
@ierezell It's mentioned in documentation that this repo is not optimized for CUDA. GPU performance is expected to be same or worse than HF Pipeline against GPU. GPU implementation is in progress but ONNX T5 optimizations don't exist yet for GPU. |
@JoeREISys, I updated my comment, I had the same issue on CPU (I tried GPU by any chance that it would improve...) |
Hi Ki6an, |
For me the same thing, it is slower around 10 percent, i run batch size around 10-15 beam size is 4 and sequence lenght is on average 15-20. Probably the best optimization you can do is to run multiple batches. |
I tried on another CPU and now it is 2x slower (without quantisation) than Pytorch with the same settings as above: |
It does wok faster when using smaller batches and when using less cores. It is probably optimal to divide all cpu cores using pytorch thread number set and then use a few different flask server for interface as t5fast work amazing on one core but second core does not give any speedup. |
@ierezell l It will be a helpful if you could share the script of ort_settings.py for gpu.I tried the method you have desribed but i am getting error. |
Hi. I have created an OnnxT5 model (non quantized) as shown in Readme. But OnnxT5 is slower than original Huggingface T5 10-20%. Could you share how the latency difference shown in repo was obtained? Thanks
The text was updated successfully, but these errors were encountered: