"Invalid picture to process" exception when running on a GPU + Docker without SGBLUR_GPUS #41

zorun · 2024-04-02T14:30:53Z

When running through Docker on CPU, the API works and can blur images.

However, when running with a GPU:

docker build -t sgblur:latest .
docker run --gpus all --rm -ti -p 8001:8001 sgblur:latest

and sending a picture to blur, it fails:

original 3649922
INFO:     172.20.192.9:50476 - "POST /blur/ HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
...
  File "/opt/blur/src/api.py", line 28, in blur_picture
    raise HTTPException(status_code=400, detail="Invalid picture to process")
NameError: name 'HTTPException' is not defined. Did you mean: 'BaseException'?

There are two bugs:

the exception is not correctly imported (this is a minor issue)
the real issue is that sgblur fails to run on this image using a GPU, while it works on CPU

This is on a host with 2 x Nvidia Tesla P100-PCIE-16GB running Debian 11. nvidia-smi output on the host:

Tue Apr  2 16:28:22 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           On  | 00000000:3B:00.0 Off |                    0 |
| N/A   35C    P0              26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           On  | 00000000:D8:00.0 Off |                    0 |
| N/A   32C    P0              26W / 250W |      2MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

zorun · 2024-04-02T14:53:05Z

Removing the catch-all exception handling in blurPicture, I get this traceback:

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init
    queued_call()
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 183, in _check_capability
    capability = get_device_capability(d)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability
    prop = get_device_properties(device)
  File "/usr/local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=

zorun · 2024-04-02T14:53:31Z

If I define SGBLUR_GPUS, then everything works fine:

docker run --gpus all -e SGBLUR_GPUS=2 --rm -ti -p 8001:8001 sgblur:latest

zorun changed the title ~~"Invalid picture to process" exception when running on a GPU + Docker~~ "Invalid picture to process" exception when running on a GPU + Docker without SGBLUR_GPUS Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Invalid picture to process" exception when running on a GPU + Docker without SGBLUR_GPUS #41

"Invalid picture to process" exception when running on a GPU + Docker without SGBLUR_GPUS #41

zorun commented Apr 2, 2024

zorun commented Apr 2, 2024

zorun commented Apr 2, 2024

"Invalid picture to process" exception when running on a GPU + Docker without SGBLUR_GPUS #41

"Invalid picture to process" exception when running on a GPU + Docker without SGBLUR_GPUS #41

Comments

zorun commented Apr 2, 2024

zorun commented Apr 2, 2024

zorun commented Apr 2, 2024