Skip to content

Vision Recognition Service with Flask and service streamer

wuwei edited this page Aug 14, 2019 · 3 revisions

In this post, we will show how to empower a deep learning service with batching capabilities to support a high number of concurrent requests.

Start from PyTorch tutorial

Following the PyTorch tutorial Deploying PyTorch and Building a REST API using Flask, we can build a simple vision recognition service with Flask example_vision/app.py.

Start the web service with Flask development server

cd example_vision
python app.py

Post the cat.jpg image to your server, then you'll see it's a tabby

cat

curl -F "[email protected]" http://localhost:5005/predict
{"class_id":"n02123045","class_name":"tabby"}

Wait a minute!

Before putting it into production, we need to solve two issues:

  • One request is served at a time, it is much slower compared to local batch prediction
  • It will cause CUDA out-of-memory error on GPU when there are large concurrent requests

Let's try triggering these issues with a professional benchmark tool wrk.

# first build wrk yourself following the official guide
...

# post requests to your API with 128 concurrent connections
./wrk -c 128 -d 20s --timeout=20s -s file.lua http://127.0.0.1:5005/predict

Boom! CUDA out of memory.

  File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torchvision/models/densenet.py", line 34, in forward
    new_features = super(_DenseLayer, self).forward(x)
  File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward
    exponential_average_factor, self.eps)
  File "/home/liuxin/nlp/venv/lib/python3.6/site-packages/torch/nn/functional.py", line 1697, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.91 GiB total capacity; 10.76 GiB already allocated; 3.94 MiB free; 197.50 MiB cached)

Let's lower our concurrency expectation.

./wrk -c 64 -d 20s --timeout=20s -s file.lua http://127.0.0.1:5005/predict 
Running 20s test @ http://127.0.0.1:5005/predict
  2 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.45s     1.61s    7.14s    81.40%
    Req/Sec    18.51     14.80   101.00     76.97%
  344 requests in 20.05s, 64.50KB read
Requests/sec:     17.16
Transfer/sec:      3.22KB

You'll see the QPS is 17 and average latency is 3.45s with 64 concurrency. A.k.a. only 17 requests are served per second when there are 64 concurrent users, and they need to wait for 3.45 seconds on average.

ServiceStreamer to rescue

To solve the issues mentioned above, we need to cache the requests in batches and schedule the prediction process. service_streamer middleware is here to help. It will solve these issues with a couple of lines of code.

ServiceStreamer is a middleware for web service of machine learning applications. Queued requests from users are sampled into mini-batches. ServiceStreamer can significantly enhance the overall performance of the webserver by improving GPU utilization.

Install ServiceStreamer with pip

pip install service_streamer 

Boost your service in 3 minutes

In this part, we will add ServiceStreamer in our Flask API server to boost the overall performance of the system. All source code and benchmark scripts are here example_vision.

First, define batch_prediction function to handle batched images (example_vision/model.py).

def batch_prediction(image_bytes_batch):
    image_tensors = [transform_image(image_bytes=image_bytes) for image_bytes in image_bytes_batch]
    tensor = torch.cat(image_tensors).to(device)
    outputs = model.forward(tensor)
    _, y_hat = outputs.max(1)
    predicted_ids = y_hat.tolist()
    return [imagenet_class_index[str(i)] for i in predicted_ids]

Then upgrade predict api to stream_predict. Encapsulate batch_prediction function through service_streamer, and invoke streamer.predict for user requests.

from flask import jsonify, request
from model import batch_prediction
from service_streamer import ThreadedStreamer

streamer = ThreadedStreamer(batch_prediction, batch_size=64)
 
@app.route('/stream_predict', methods=['POST'])
def stream_predict():
    if request.method == 'POST':
        file = request.files['file']
        img_bytes = file.read()
        class_id, class_name = streamer.predict([img_bytes])[0]
        return jsonify({'class_id': class_id, 'class_name': class_name})

Start your server as before and test it.

python app.py

curl -F "[email protected]" http://localhost:5005/stream_predict
{"class_id":"n02123045","class_name":"tabby"}

Finally, let's do API benchmark again with wrk

./wrk -c 128 -d 20s --timeout=20s -s file.lua http://127.0.0.1:5005/stream_predict
Running 20s test @ http://127.0.0.1:5005/stream_predict
  2 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.92s   236.22ms   3.00s    94.22%
    Req/Sec    97.09     99.71   340.00     75.79%
  1245 requests in 20.06s, 233.58KB read
Requests/sec:     62.07
Transfer/sec:     11.65KB

You get it!

With 128 concurrency, the QPS is up to 62(3.6x throughput) and the average latency is reduced to 1.92s(1.8x faster). Most importantly, CUDA is safe and sound.

What's More?