`pyg::subgraph` CUDA implementation #42

rusty1s · 2022-05-03T14:02:17Z

No description provided.

codecov-commenter · 2022-05-03T14:22:07Z

Codecov Report

Merging #42 (9b4e6e6) into master (84874f1) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #42   +/-   ##
=======================================
  Coverage   99.48%   99.48%           
=======================================
  Files           9        9           
  Lines         195      195           
=======================================
  Hits          194      194           
  Misses          1        1

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84874f1...9b4e6e6. Read the comment docs.

pyg_lib/csrc/sampler/cuda/subgraph_kernel.cu

pyg_lib/csrc/utils/cuda/helpers.h

yaoyaowd · 2022-05-03T21:03:06Z

pyg_lib/csrc/sampler/cuda/subgraph_kernel.cu

+    const auto to_local_node_data = to_local_node.data_ptr<scalar_t>();
+    auto deg_data = deg.data_ptr<scalar_t>();
+
+    // Compute induced subgraph degree, parallelize with 32 threads per node:


I'm actually not sure if it is necessary to parallelize with 32 threads per nodes. Most of the time we are dealing with sparse data and a lot of threads will not go into for loop.

If you are looking for extreme performance, you can bundle to_local_node_data and col_data into one iterator structure and use this function. I haven't seen any better performance than it in the past.
https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html#a4854a13561cb66d46aa617aab16b8825

Do you have an example of bundling to_local_node_data and col_data into one iterator structure? This looks really interesting.

I am okay with dropping the warp-level parallelism for now, but we will lose the contiguous access to col_data, and probably under-utilize the number of threads available on modern GPUs.

On a second look, this doesn't seem possible since col_data refers to edges, while to_local_node_data refers to nodes, while we actually want do the compute across the number of nodes in the induced subgraph.

ZenoTan · 2022-05-03T23:20:35Z

pyg_lib/csrc/sampler/cuda/subgraph_kernel.cu

+  // We maintain a O(N) vector to map global node indices to local ones.
+  // TODO Can we do this without O(N) storage requirement?
+  const auto to_local_node = nodes.new_full({rowptr.size(0) - 1}, -1);


Does N means the number of nodes in the graph?
What if we could filter each node in nodes_data since it should be much smaller than rowptr_data.
Otherwise we may consider caching this tensor to reduce memory allocation for each time.

Good points! We use this vector as the mapping from global node indices to new local ones. In C++, we use a map for this but can't do the same in CUDA. I don't know of a more elegant solution for this.

Caching is an option as well, but requires a (non-intuitive and backend-specific) change in input arguments. I added it as a TODO for now.

There's GPU hash table/set which may require some atomic operations when you build it, but lookup is fast.
I found that caching is not a good option since you have to reset the array every time.

Since you can sample on GPU, then the graph is not that big, a node array is not that bad and can make the code less complicated

yaoyaowd

Let's see if we can improve it later.

rusty1s added 4 commits May 3, 2022 15:19

Update

07f6a85

update

058c0bc

update

f85ff1e

initial commit

4aa86b9

rusty1s added 0 - Priority P0 feature sampler labels May 3, 2022

rusty1s self-assigned this May 3, 2022

update

4bb8087

update

ef68b2b

ZenoTan reviewed May 3, 2022

View reviewed changes

pyg_lib/csrc/sampler/cuda/subgraph_kernel.cu Outdated Show resolved Hide resolved

yaoyaowd reviewed May 3, 2022

View reviewed changes

ZenoTan reviewed May 3, 2022

View reviewed changes

rusty1s added 4 commits May 4, 2022 07:43

update

377e261

typo

9b5d9f7

changelog

9b4e6e6

update

bf23274

yaoyaowd approved these changes May 4, 2022

View reviewed changes

Merge branch 'master' into subgraph_cuda

9113818

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pyg::subgraph` CUDA implementation #42

`pyg::subgraph` CUDA implementation #42

rusty1s commented May 3, 2022

codecov-commenter commented May 3, 2022 •

edited

Loading

yaoyaowd May 3, 2022

rusty1s May 4, 2022

rusty1s May 4, 2022

ZenoTan May 3, 2022 •

edited

Loading

rusty1s May 4, 2022 •

edited

Loading

ZenoTan May 4, 2022

yaoyaowd left a comment

pyg::subgraph CUDA implementation #42

Are you sure you want to change the base?

pyg::subgraph CUDA implementation #42

Conversation

rusty1s commented May 3, 2022

codecov-commenter commented May 3, 2022 • edited Loading

Codecov Report

yaoyaowd May 3, 2022

Choose a reason for hiding this comment

rusty1s May 4, 2022

Choose a reason for hiding this comment

rusty1s May 4, 2022

Choose a reason for hiding this comment

ZenoTan May 3, 2022 • edited Loading

Choose a reason for hiding this comment

rusty1s May 4, 2022 • edited Loading

Choose a reason for hiding this comment

ZenoTan May 4, 2022

Choose a reason for hiding this comment

yaoyaowd left a comment

Choose a reason for hiding this comment

`pyg::subgraph` CUDA implementation #42

`pyg::subgraph` CUDA implementation #42

codecov-commenter commented May 3, 2022 •

edited

Loading

ZenoTan May 3, 2022 •

edited

Loading

rusty1s May 4, 2022 •

edited

Loading