Performance of transformers pipeline #5338

anirbankonar123 · 2020-09-30T14:56:40Z

anirbankonar123
Sep 30, 2020

All,
Is it possible to.improve performance of pipeline('summarization') using this technology. Currently we have fastapi based docker container where we are doing such summarization for 4 to 6 sections in input doc. Its taking 10 sec each and so 1 min+ in total, which is quite a lot for a web app. We tried parallel invocation, but that didnt cut down the time much. Just wondering if this tech can be used. We deploy the docker container in Azure VM. Thanks.

tianleiwu · 2020-10-01T06:06:58Z

tianleiwu
Oct 1, 2020
Collaborator

What summarization model do you use?

We have optimization for BERT or similar transformer models. You can try the optimizer, which could help generate an optimized model for inference.

For parallel invocation, it is preferred to use one inference session per GPU, and pin a session to CPU cores within one CPU socket.

You will need to use larger batch size to reach the best throughput within some latency budget.

For the model, you might consider using lower precision (like float16 or int8 quantization), and train smaller model (like 3 or 6 layers instead of 12 or 24 layers) for lower latency.

0 replies

anirbankonar123 · 2020-10-01T12:34:16Z

anirbankonar123
Oct 1, 2020
Author

We use the Transformers summarization pipeline (default BERT model), ref : https://huggingface.co/transformers/main_classes/pipelines.html#transformers.SummarizationPipeline Sample code: # use bart in pytorchsummarizer = pipeline("summarization")summary = summarizer("Sam Shleifer writes the best docstring examples in the whole world.") We can have 6 - 8 such summarizations happening in one invocation of Web App. That is where we are facing issue. We do not have GPU enabled Azure VM. We are trying to parallelize using concurrent.futures lib of python. I thought the ONNX runtime can help. Thanks

…

On Thu, Oct 1, 2020 at 11:37 AM Tianlei Wu ***@***.***> wrote: What summarization model do you use? We have optimization for BERT or similar transformer models. You can try the optimizer <https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>, which could help generate an optimized model for inference. For parallel invocation, it is preferred to use one inference session per GPU, and pin a session to CPU cores within one CPU socket. You will need to use larger batch size to reach the best throughput within some latency budget. For the model, you might consider using lower precision (like float16 or int8 quantization), and train smaller model (like 3 or 6 layers instead of 12 or 24 layers) for lower latency. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#5338 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADIW7NYRB3YYMGOSUSENH4DSIQMA7ANCNFSM4R7GNPOQ> .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of transformers pipeline #5338

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Performance of transformers pipeline #5338

anirbankonar123 Sep 30, 2020

Replies: 2 comments

tianleiwu Oct 1, 2020 Collaborator

anirbankonar123 Oct 1, 2020 Author

anirbankonar123
Sep 30, 2020

tianleiwu
Oct 1, 2020
Collaborator

anirbankonar123
Oct 1, 2020
Author