-
Notifications
You must be signed in to change notification settings - Fork 189
Vision Recognition Service with Flask and service streamer
In this post, we will show how to empower a deep learning service with batching capabilities to support a high number of concurrent requests.
Following the PyTorch tutorial Deploying PyTorch and Building a REST API using Flask
,
we can build a simple vision recognition service with Flask example_vision/app.py.
Start the web service with Flask development server
cd example_vision
python app.py
Post the cat.jpg
image to your server, then you'll see it's a tabby
curl -F "[email protected]" http://localhost:5005/predict
{"class_id":"n02123045","class_name":"tabby"}
Before putting it into production, we need to solve two issues:
- One request is served at a time, it is much slower compared to local batch prediction
- It will cause CUDA out-of-memory error on GPU when there are large concurrent requests
Let's try triggering these issues with a professional benchmark tool wrk.
# first build wrk yourself following the official guide
...
# post requests to your API with 128 concurrent connections
./wrk -c 128 -d 20s --timeout=20s -s file.lua http://127.0.0.1:5005/predict
Boom! CUDA out of memory.
File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torchvision/models/densenet.py", line 34, in forward
new_features = super(_DenseLayer, self).forward(x)
File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
exponential_average_factor, self.eps)
File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1697, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.91 GiB total capacity; 10.76 GiB already allocated; 3.94 MiB free; 197.50 MiB cached)
Let's lower our concurrency expectation.
./wrk -c 64 -d 20s --timeout=20s -s file.lua http://127.0.0.1:5005/predict
Running 20s test @ http://127.0.0.1:5005/predict
2 threads and 64 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.45s 1.61s 7.14s 81.40%
Req/Sec 18.51 14.80 101.00 76.97%
344 requests in 20.05s, 64.50KB read
Requests/sec: 17.16
Transfer/sec: 3.22KB
You'll see the QPS is 17 and average latency is 3.45s with 64 concurrency. A.k.a. only 17 requests are served per second when there are 64 concurrent users, and they need to wait for 3.45 seconds on average.
To solve the issues mentioned above, we need to cache the requests in batches and schedule the prediction process. service_streamer
middleware is here to help. It will solve these issues with a couple of lines of code.
ServiceStreamer is a middleware for web service of machine learning applications. Queued requests from users are sampled into mini-batches. ServiceStreamer can significantly enhance the overall performance of the webserver by improving GPU utilization.
Install ServiceStreamer with pip
pip install service_streamer
In this part, we will add ServiceStreamer in our Flask API server to boost the overall performance of the system. All source code and benchmark scripts are here example_vision.
First, define batch_prediction
function to handle batched images (example_vision/model.py).
def batch_prediction(image_bytes_batch):
image_tensors = [transform_image(image_bytes=image_bytes) for image_bytes in image_bytes_batch]
tensor = torch.cat(image_tensors).to(device)
outputs = model.forward(tensor)
_, y_hat = outputs.max(1)
predicted_ids = y_hat.tolist()
return [imagenet_class_index[str(i)] for i in predicted_ids]
Then upgrade predict
api to stream_predict
. Encapsulate batch_prediction
function through service_streamer
,
and invoke streamer.predict
for user requests.
from flask import jsonify, request
from model import batch_prediction
from service_streamer import ThreadedStreamer
streamer = ThreadedStreamer(batch_prediction, batch_size=64)
@app.route('/stream_predict', methods=['POST'])
def stream_predict():
if request.method == 'POST':
file = request.files['file']
img_bytes = file.read()
class_id, class_name = streamer.predict([img_bytes])[0]
return jsonify({'class_id': class_id, 'class_name': class_name})
Start your server as before and test it.
python app.py
curl -F "[email protected]" http://localhost:5005/stream_predict
{"class_id":"n02123045","class_name":"tabby"}
Finally, let's do API benchmark again with wrk
./wrk -c 128 -d 20s --timeout=20s -s file.lua http://127.0.0.1:5005/stream_predict
Running 20s test @ http://127.0.0.1:5005/stream_predict
2 threads and 128 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.92s 236.22ms 3.00s 94.22%
Req/Sec 97.09 99.71 340.00 75.79%
1245 requests in 20.06s, 233.58KB read
Requests/sec: 62.07
Transfer/sec: 11.65KB
You get it!
With 128 concurrency, the QPS is up to 62(3.6x throughput) and the average latency is reduced to 1.92s(1.8x faster). Most importantly, CUDA is safe and sound.