Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyg::subgraph CUDA implementation #42

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

pyg::subgraph CUDA implementation #42

wants to merge 11 commits into from

Conversation

rusty1s
Copy link
Member

@rusty1s rusty1s commented May 3, 2022

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented May 3, 2022

Codecov Report

Merging #42 (9b4e6e6) into master (84874f1) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #42   +/-   ##
=======================================
  Coverage   99.48%   99.48%           
=======================================
  Files           9        9           
  Lines         195      195           
=======================================
  Hits          194      194           
  Misses          1        1           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84874f1...9b4e6e6. Read the comment docs.

pyg_lib/csrc/utils/cuda/helpers.h Show resolved Hide resolved
const auto to_local_node_data = to_local_node.data_ptr<scalar_t>();
auto deg_data = deg.data_ptr<scalar_t>();

// Compute induced subgraph degree, parallelize with 32 threads per node:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure if it is necessary to parallelize with 32 threads per nodes. Most of the time we are dealing with sparse data and a lot of threads will not go into for loop.

If you are looking for extreme performance, you can bundle to_local_node_data and col_data into one iterator structure and use this function. I haven't seen any better performance than it in the past.
https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html#a4854a13561cb66d46aa617aab16b8825

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of bundling to_local_node_data and col_data into one iterator structure? This looks really interesting.

I am okay with dropping the warp-level parallelism for now, but we will lose the contiguous access to col_data, and probably under-utilize the number of threads available on modern GPUs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second look, this doesn't seem possible since col_data refers to edges, while to_local_node_data refers to nodes, while we actually want do the compute across the number of nodes in the induced subgraph.

Comment on lines 52 to 54
// We maintain a O(N) vector to map global node indices to local ones.
// TODO Can we do this without O(N) storage requirement?
const auto to_local_node = nodes.new_full({rowptr.size(0) - 1}, -1);
Copy link
Member

@ZenoTan ZenoTan May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does N means the number of nodes in the graph?
What if we could filter each node in nodes_data since it should be much smaller than rowptr_data.
Otherwise we may consider caching this tensor to reduce memory allocation for each time.

Copy link
Member Author

@rusty1s rusty1s May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! We use this vector as the mapping from global node indices to new local ones. In C++, we use a map for this but can't do the same in CUDA. I don't know of a more elegant solution for this.

Caching is an option as well, but requires a (non-intuitive and backend-specific) change in input arguments. I added it as a TODO for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's GPU hash table/set which may require some atomic operations when you build it, but lookup is fast.
I found that caching is not a good option since you have to reset the array every time.

Since you can sample on GPU, then the graph is not that big, a node array is not that bad and can make the code less complicated

Copy link
Contributor

@yaoyaowd yaoyaowd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can improve it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants