`torch_cluster.radius` is not compatible with the CUDA graph #123

raimis · 2022-02-22T16:41:54Z

torch_cluster.radius is not compatible with the CUDA graph (https://pytorch.org/docs/stable/notes/cuda.html#cuda-graphs)

from torch_cluster import radius

device = pt.device('cuda')
x = pt.tensor([0.0], device=device)
y = pt.tensor([1.0], device=device)

graph = pt.cuda.CUDAGraph()
with pt.cuda.graph(graph):
    radius(x, y, r=2)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [26], in <module>
      8 with pt.cuda.graph(graph):
----> 9     radius(x, y, r=2)

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/shared2/raimis/opt/miniconda/envs/torchmd-net/lib/python3.9/site-packages/torch_cluster/radius.py", line 72, in radius
        torch.cumsum(deg, 0, out=ptr_y[1:])

    return torch.ops.torch_cluster.radius(x, y, ptr_x, ptr_y, r,
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                                          max_num_neighbors, num_workers)
RuntimeError: CUDA error: operation not permitted when stream is capturing
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

The text was updated successfully, but these errors were encountered:

rusty1s · 2022-02-23T08:26:39Z

Thanks for reporting. I need to look into this. Any help or reference is highly appreciated as I'm pretty unfamiliar with CUDA graphs.

RaulPPelaez · 2023-02-28T13:07:41Z

In order to "compile" a piece of code into a CUDA graph one must run it first under what is called "capture mode". Basically a dry run in which CUDA just registers the kernel launches and their arguments.
Once the graph is captured, it can be replayed in a more efficient manner. Always working on the same addresses, but the idea is that one rewrites the inputs each time.

This piece of code must comply with a series of restrictions in order to be CUDA-graph compatible, broadly speaking:

Static shapes (and more restrictively, kernels must work on the same addresses every time)
Static control flow
No operations that result in CPU-GPU sync
More information is available in the link provided by @raimis.

In the particular case of the radius kernel there are several things preventing the function to capture. For instance, this suffers from CPU-GPU sync:

pytorch_cluster/torch_cluster/radius.py

Lines 55 to 61 in 82e9df9

    
           batch_size = 1 
        
           if batch_x is not None: 
        
               assert x.size(0) == batch_x.numel() 
        
               batch_size = int(batch_x.max()) + 1 
        
           if batch_y is not None: 
        
               assert y.size(0) == batch_y.numel() 
        
               batch_size = max(batch_size, int(batch_y.max()) + 1)

You could rewrite it as:

batch_size = torch.tensor(1)
if batch_x is not None:
   assert x.size(0) == batch_x.numel()
   batch_size = batch_x.max().to(dtype=torch.int) + 1
if batch_y is not None:
   assert y.size(0) == batch_y.numel()
   batch_size = torch.max(batch_size, batch_y.max().to(dtype=torch.int) + 1)

Additionally, this line is not a static control flow:

pytorch_cluster/torch_cluster/radius.py

Line 65 in 82e9df9

if batch_size > 1:

since batch_size is dependent on the contents of the input tensors.

@raimis example does not utilize this code path, since batch_x/y are not being passed. For his example, I believe this line is the culprit:

pytorch_cluster/csrc/cuda/radius_cuda.cu

Line 93 in 82e9df9

return torch::stack({row.masked_select(mask), col.masked_select(mask)}, 0);

masked_select requires synchronization

rusty1s · 2023-03-01T08:06:03Z

Thanks for this insightful issue. Do you have interest to fix the aforementioned problems? Would that be straightforward to integrate?

RaulPPelaez · 2023-03-01T15:35:30Z

We use this functionality in several places, so I am eager to help. I am really new to torch though, so my torch-fu is not very good -.-
I am trying to make it CUDA-graph compatible without modifying the current interface, so if I get somewhere I will do a PR. I wanted to share some insight in case you saw this as an easy modification.

rusty1s added enhancement New feature or request help wanted Extra attention is needed labels Feb 23, 2022

raimis mentioned this issue Feb 23, 2022

Optimization of the graph network torchmd/torchmd-net#48

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`torch_cluster.radius` is not compatible with the CUDA graph #123

`torch_cluster.radius` is not compatible with the CUDA graph #123

raimis commented Feb 22, 2022

rusty1s commented Feb 23, 2022

RaulPPelaez commented Feb 28, 2023

rusty1s commented Mar 1, 2023

RaulPPelaez commented Mar 1, 2023

torch_cluster.radius is not compatible with the CUDA graph #123

torch_cluster.radius is not compatible with the CUDA graph #123

Comments

raimis commented Feb 22, 2022

rusty1s commented Feb 23, 2022

RaulPPelaez commented Feb 28, 2023

rusty1s commented Mar 1, 2023

RaulPPelaez commented Mar 1, 2023

`torch_cluster.radius` is not compatible with the CUDA graph #123

`torch_cluster.radius` is not compatible with the CUDA graph #123