Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch with Gunicorn + Flask API performance issue on Docker #3350

Open
yothinsaengs opened this issue Feb 18, 2025 · 3 comments
Open

Torch with Gunicorn + Flask API performance issue on Docker #3350

yothinsaengs opened this issue Feb 18, 2025 · 3 comments

Comments

@yothinsaengs
Copy link

yothinsaengs commented Feb 18, 2025

I use Gunicorn as web server with flask api and I have performance issue compare with using Waitress as web server with flask
when I try to calculate matrix multiplication wth numpy there's no huge different in response time between Gunicorn and Waitress

Numpy API

@app.route('/numpy')
def _numpy():
    matrix_a = np.random.rand(640, 640, 3)
    count = 0
    while count < 240:
        matrix_a = (matrix_a**2) % 7
        count += 1
    return jsonify({"message": "Hello, World!"})

But when I calculate the same operation with torch (both enable and disable torch_no_grad)

Torch API

@app.route('/torch')
def _torch():
    matrix_a = torch.rand(640, 640, 3)  # Create a random tensor
    count = 0
    while count < 240:
        matrix_a = (matrix_a ** 2) % 7  # Element-wise squaring and modulo
        count += 1
    return jsonify({"message": "Hello, World!"})

Torch_no_grad API

@app.route('/torch_no_grad')
def _torch_ng():
    with torch.no_grad():
        matrix_a = torch.rand(640, 640, 3)  # Create a random tensor
        count = 0
        while count < 240:
            matrix_a = (matrix_a ** 2) % 7  # Element-wise squaring and modulo
            count += 1
    return jsonify({"message": "Hello, World!"})

there is a huge difference in response time

limits:
  memory: 1g
  cpus: '8.0'

numpy
----------
waitress: Mean=1.1698s, Std=0.0300s
gunicorn: Mean=1.1715s, Std=0.0311s

torch
----------
waitress: Mean=0.9230s, Std=0.1078s
gunicorn: Mean=0.8869s, Std=0.1190s

torch_no_grad
----------
waitress: Mean=0.9172s, Std=0.1058s
gunicorn: Mean=0.8886s, Std=0.1126s

limits:
  memory: 1g
  cpus: '4.0'

numpy
----------
waitress: Mean=1.1876s, Std=0.0407s
gunicorn: Mean=1.1897s, Std=0.0390s

torch
----------
waitress: Mean=0.9502s, Std=0.1281s
gunicorn: Mean=0.9180s, Std=0.1288s

torch_no_grad
----------
waitress: Mean=0.9119s, Std=0.1063s
gunicorn: Mean=0.8678s, Std=0.1105s

limits:
  memory: 1g
  cpus: '2.0'

numpy
----------
waitress: Mean=1.1881s, Std=0.0494s
gunicorn: Mean=1.1835s, Std=0.0424s

torch
----------
waitress: Mean=0.7837s, Std=0.1328s
gunicorn: Mean=1.3097s, Std=0.0544s

torch_no_grad
----------
waitress: Mean=0.7932s, Std=0.0988s
gunicorn: Mean=1.3300s, Std=0.1083s

I evaluate this on
machine spec: Macbook Air m2 ram16

this is api that send request to Gunicorn and Waitress

import asyncio
import httpx
import time  
from collections import defaultdict
import numpy as np 
N = 1
url_paths = ["numpy", "torch", "torch_no_grad"]
API_URLS = [
    "http://localhost:8001/",
    "http://localhost:8002/",
]
API_URLS_DICT = {
    "http://localhost:8001/": "waitress",
    "http://localhost:8002/": "gunicorn",
}


async def fetch(client, url):
    start_time = time.perf_counter()  # Start timing
    response = await client.get(url+url_path, timeout=20.0)

    end_time = time.perf_counter()  # End timing

    response_time = end_time - start_time  # Calculate response time
    return {
        "url": url,
        "status": response.status_code,
        "response_time": response_time,
        "data": response.json()
    }


async def main():
    async with httpx.AsyncClient() as client:
        tasks = [fetch(client, url) for url in API_URLS for _ in range(N)]
        results = await asyncio.gather(*tasks)

    return results

if __name__ == "__main__":
    repeat_time = 5
    for url_path in url_paths:
        count = defaultdict(list)
        print(url_path)
        print('----------')
        for _ in range(repeat_time):
            y = asyncio.run(main())
            for x in y:
                count[API_URLS_DICT[x['url']]].append(x['response_time'])

        for k, v in count.items():
            v = np.array(v)
            print(f"{k}: Mean={v.mean():.4f}s, Std={v.std():.4f}s")

        print()
@yothinsaengs yothinsaengs changed the title Torch with Gunicorn+Flask api performance issue Torch with Gunicorn + Flask API performance issue on Docker Feb 18, 2025
@pajod
Copy link
Contributor

pajod commented Feb 19, 2025

Thanks for the detailed report.

How did you launch the test targets? Specifically, I am inquiring about the command lines containing the localhost:8001 (resp localhost:8002) listen address. I am assuming you are testing against Gunicorn 23.0 on Python 3.11, correct?

@yothinsaengs
Copy link
Author

Thanks for the detailed report.

How did you launch the test targets? Specifically, I am inquiring about the command lines containing the localhost:8001 (resp localhost:8002) listen address. I am assuming you are testing against Gunicorn 23.0 on Python 3.11, correct?

python version is 3.10, here is Dockerfile

# Use official Python image
FROM python:3.10

# Set the working directory
WORKDIR /app

# Copy the application files
COPY app.py requirements.txt ./

# Install dependencies
RUN pip install -r requirements.txt
# Install curl for health check
RUN apt-get update && apt-get install -y curl  

# Expose port 8002
EXPOSE 8002

# Run the app with Gunicorn (use default worker count)
CMD ["gunicorn", "-b", "0.0.0.0:8002", "app:app"]

note: there is no different with or without health check in performance

@benoitc
Copy link
Owner

benoitc commented Feb 21, 2025

did you try the thread worker?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants