Performance of transformers pipeline #5338
Replies: 2 comments
-
What summarization model do you use? We have optimization for BERT or similar transformer models. You can try the optimizer, which could help generate an optimized model for inference. For parallel invocation, it is preferred to use one inference session per GPU, and pin a session to CPU cores within one CPU socket. You will need to use larger batch size to reach the best throughput within some latency budget. For the model, you might consider using lower precision (like float16 or int8 quantization), and train smaller model (like 3 or 6 layers instead of 12 or 24 layers) for lower latency. |
Beta Was this translation helpful? Give feedback.
-
We use the Transformers summarization pipeline (default BERT model), ref :
https://huggingface.co/transformers/main_classes/pipelines.html#transformers.SummarizationPipeline
Sample code:
# use bart in pytorchsummarizer = pipeline("summarization")summary =
summarizer("Sam Shleifer writes the best docstring examples in the
whole world.")
We can have 6 - 8 such summarizations happening in one invocation of
Web App. That is where we are facing issue. We do not have GPU enabled
Azure VM. We are trying to parallelize using concurrent.futures lib of
python. I thought the ONNX runtime can help.
Thanks
…On Thu, Oct 1, 2020 at 11:37 AM Tianlei Wu ***@***.***> wrote:
What summarization model do you use?
We have optimization for BERT or similar transformer models. You can try
the optimizer
<https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>,
which could help generate an optimized model for inference.
For parallel invocation, it is preferred to use one inference session per
GPU, and pin a session to CPU cores within one CPU socket.
You will need to use larger batch size to reach the best throughput within
some latency budget.
For the model, you might consider using lower precision (like float16 or
int8 quantization), and train smaller model (like 3 or 6 layers instead of
12 or 24 layers) for lower latency.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5338 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADIW7NYRB3YYMGOSUSENH4DSIQMA7ANCNFSM4R7GNPOQ>
.
|
Beta Was this translation helpful? Give feedback.
-
All,
Is it possible to.improve performance of pipeline('summarization') using this technology. Currently we have fastapi based docker container where we are doing such summarization for 4 to 6 sections in input doc. Its taking 10 sec each and so 1 min+ in total, which is quite a lot for a web app. We tried parallel invocation, but that didnt cut down the time much. Just wondering if this tech can be used. We deploy the docker container in Azure VM. Thanks.
Beta Was this translation helpful? Give feedback.
All reactions