-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyg::subgraph
CUDA implementation
#42
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #42 +/- ##
=======================================
Coverage 99.48% 99.48%
=======================================
Files 9 9
Lines 195 195
=======================================
Hits 194 194
Misses 1 1 Continue to review full report at Codecov.
|
const auto to_local_node_data = to_local_node.data_ptr<scalar_t>(); | ||
auto deg_data = deg.data_ptr<scalar_t>(); | ||
|
||
// Compute induced subgraph degree, parallelize with 32 threads per node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure if it is necessary to parallelize with 32 threads per nodes. Most of the time we are dealing with sparse data and a lot of threads will not go into for loop.
If you are looking for extreme performance, you can bundle to_local_node_data
and col_data
into one iterator structure and use this function. I haven't seen any better performance than it in the past.
https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html#a4854a13561cb66d46aa617aab16b8825
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of bundling to_local_node_data
and col_data
into one iterator structure? This looks really interesting.
I am okay with dropping the warp-level parallelism for now, but we will lose the contiguous access to col_data
, and probably under-utilize the number of threads available on modern GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second look, this doesn't seem possible since col_data
refers to edges, while to_local_node_data
refers to nodes, while we actually want do the compute across the number of nodes in the induced subgraph.
// We maintain a O(N) vector to map global node indices to local ones. | ||
// TODO Can we do this without O(N) storage requirement? | ||
const auto to_local_node = nodes.new_full({rowptr.size(0) - 1}, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does N
means the number of nodes in the graph?
What if we could filter each node in nodes_data
since it should be much smaller than rowptr_data
.
Otherwise we may consider caching this tensor to reduce memory allocation for each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points! We use this vector as the mapping from global node indices to new local ones. In C++, we use a map for this but can't do the same in CUDA. I don't know of a more elegant solution for this.
Caching is an option as well, but requires a (non-intuitive and backend-specific) change in input arguments. I added it as a TODO for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's GPU hash table/set which may require some atomic operations when you build it, but lookup is fast.
I found that caching is not a good option since you have to reset the array every time.
Since you can sample on GPU, then the graph is not that big, a node array is not that bad and can make the code less complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if we can improve it later.
No description provided.