Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Invalid picture to process" exception when running on a GPU + Docker without SGBLUR_GPUS #41

Open
zorun opened this issue Apr 2, 2024 · 2 comments

Comments

@zorun
Copy link

zorun commented Apr 2, 2024

When running through Docker on CPU, the API works and can blur images.

However, when running with a GPU:

docker build -t sgblur:latest .
docker run --gpus all --rm -ti -p 8001:8001 sgblur:latest

and sending a picture to blur, it fails:

original 3649922
INFO:     172.20.192.9:50476 - "POST /blur/ HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
...
  File "/opt/blur/src/api.py", line 28, in blur_picture
    raise HTTPException(status_code=400, detail="Invalid picture to process")
NameError: name 'HTTPException' is not defined. Did you mean: 'BaseException'?

There are two bugs:

  • the exception is not correctly imported (this is a minor issue)
  • the real issue is that sgblur fails to run on this image using a GPU, while it works on CPU

This is on a host with 2 x Nvidia Tesla P100-PCIE-16GB running Debian 11. nvidia-smi output on the host:

Tue Apr  2 16:28:22 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           On  | 00000000:3B:00.0 Off |                    0 |
| N/A   35C    P0              26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           On  | 00000000:D8:00.0 Off |                    0 |
| N/A   32C    P0              26W / 250W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
@zorun
Copy link
Author

zorun commented Apr 2, 2024

Removing the catch-all exception handling in blurPicture, I get this traceback:

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init
    queued_call()
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 183, in _check_capability
    capability = get_device_capability(d)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability
    prop = get_device_properties(device)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=

@zorun
Copy link
Author

zorun commented Apr 2, 2024

If I define SGBLUR_GPUS, then everything works fine:

docker run --gpus all -e SGBLUR_GPUS=2 --rm -ti -p 8001:8001 sgblur:latest

@zorun zorun changed the title "Invalid picture to process" exception when running on a GPU + Docker "Invalid picture to process" exception when running on a GPU + Docker without SGBLUR_GPUS Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant