-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
❓ [Question] Why BERT Base is slower w/ Torch-TensorRT than native PyTorch? #830
Comments
End to end compilation:
to
Once you make this change, please compile and re-install transformers. Regenerate a fresh copy of BERT model using exporter https://huggingface.co/docs/transformers/serialization#using-torchscript-in-python With this change, you should be able to convert the entire BERT model into TensorRT engine (you can verify this by setting |
Just to note, you don't necessarily need to reinstall transformers, you can just patch this in your installed library since its just a change in the python code |
Hi @peri044 @narendasan , thank you so much for the suggestions and sorry for my late reply. I edited Could you please share your running code / configuration so that I could try to figure out what's wrong with my code? Here's my code:
|
Also, I got a lot of warnings during the inference, but these warnings seem to be ignorable:
|
I'm using the latest commit:
|
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days |
Hi @void-main, I tried your script ( #830 (comment)) on master branch and observed following logs:
I used a Titan V card on my host machine. Are you saying that after compile is resulting in higher latency than before compile at your end? |
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days |
@andi4191 I also meet "slower" problem. import os transformer = nn.TransformerEncoderLayer(512, 8).eval().cuda() for i in range(10): start_time = time.time() compilation_kwargs = { optimized_model = torch.compile( for i in range(10): start_time = time.time() input_ = torch.rand((10, 10, 512), dtype=torch.float).to("cuda") |
❓ Question
I'm trying to optimize hugging face's BERT Base uncased model using Torch-TensorRT, the code works after disabling full compilation (
require_full_compilation=False
), and the avg latency is ~10ms on T4. However, it it slower than native PyTorch implementation (~6ms on T4). In contrast, running the same model withtrtexec
only takes ~4ms. So, for BERT Base, it's 2.5x slower than TensorRT. I wonder if this is expected?Here's the full code:
So, my question is why it is slower than native PyTorch, and how do I fine-tune it?
What you have already tried
I've checked out the log from Torch-TensorRT, looks like the model is partitioned into 3 parts, separated by
at::Int
op, and looks like Int op is hard to implement.Next, I profiled the inference process with Nsight System, here's the screenshot:
It is expected to see 3 divided segments, however, there are 2 things that caught my attention:
cudaMemcpyAsync
took so long? Shouldn't it only return thelast_hidden_state
tensor?Environment
conda
,pip
,libtorch
, source): pipAdditional context
The text was updated successfully, but these errors were encountered: