Concurrent multisource backwardpass #5206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

rapids-bot merged 98 commits into rapidsai:branch-25.10 from HowardHuang1:HH-multisource-backwardpass

Sep 8, 2025

Contributor

HowardHuang1 commented Aug 4, 2025 •

edited by ChuckHastings

Loading

Update the Betweenness Centrality implementation to perform multi-source backwards pass concurrently.

Closes #5221

HowardHuang1 added 30 commits

June 30, 2025 14:14


          cpp implementation of brandes algo

018b9d5


          loop through frontier levels instead of all vertices --> small ~1.7x …

e59aa23

…speedup


          new approach: distances extracted properly but still results in wrong…

d8a95fb

… centrality values


          add debug statements for mismatched index issue

20da533


          Merge branch 'branch-25.08' of https://github.com/rapidsai/cugraph in…

bc0f2ef

…to HH-betweenness-centrality


          Delete cpp/src/centrality/BC_pseudo.cuh

b8ae34b


          remove debug statements

566534b


          remove num_edges_to_process variable which was only used for the debu…

a3469e1

…g statements that were removed


          per_v_transform_reduce_outgoing_e() API has restriction that it takes…

d374d2e

… input array of size vertex_list and outputs array of same size vertex_list + use scatter to map array of size vertex_list back into global indices --> this fixed indexing issue


          clean up debug statements

6f3e0dc


          add test files for cugraph, nx-cugraph, and pylibcugraph

36681fe


          add road network test cases for libcugraph cpp tests

42f485b


          remove datasets

986292b


          move redundant computation of distance levels at every step out of th…

e68edd4

…e loop


          add new york and california road network tests

ddfea92


          clean up debug statements

9ee19a0


          avoid scanning entire vertex range in every loop instead perform edge…

3c2bd46

… and delta updates on frontier vertices only


          condense code with reusable buffers for vertices, deltas, and central…

de7e522

…ities


          sort vertices by distance + use binary search to find lower bound and…

c7d7fdc

… upper bound for each vertex frontier


          combine 3 thrust calls into 1

b1c457a


          time each chunk of code

d8b14bf


          move binary search for vertex frontier outside main diameter loop to …

1978c18

…reduce kernel launches


          merge thrust calls to reduce kernel launches from 7 to 4 within each …

52e23ce

…iteration of the delta update loop


          clean up timers and debug statements

00077e8


          simplify bound computation for binary search

664806a


          separate out combined thrust calls since it doesn't work for multi-gp…

e39f925

…u case


          Merge remote-tracking branch 'upstream/branch-25.08' into HH-betweenn…

b707c48

…ess-centrality


          add multi-gpu support

32cac28


          separate out distances, sigmas, and deltas in edge src/dst property

ccbd1a3


          Remove old files

ed697ca

HowardHuang1 added 2 commits

August 27, 2025 17:50


          clean up debug statements

2bae66d


          add chunking to CUB segmented sort to fix OOM error

8ec923e

seunghwak reviewed

View reviewed changes

Contributor

seunghwak left a comment

Please ping me in slack if you have any questions.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

@@ @@ -48,6 +49,9 @@ @@
               #include <thrust/sort.h>
               #include <thrust/transform.h>
+              // Add CUB include for better sorting performance

Contributor

seunghwak Aug 29, 2025

I think this comment is an overkill.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

Comment on lines 891 to 893

+                  printf("DEBUG: GPU has %d SMs, target chunk size: %zu vertices\n",
+                         handle.get_device_properties().multiProcessorCount,
+                         approx_vertices_to_sort_per_iteration);

Contributor

seunghwak Aug 29, 2025

Delete this before merging the code

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

+                // Sort vertices by vertex ID within each distance level using chunked CUB segmented sort
+                if (total_vertices > 0) {
+                  // Calculate target chunk size based on GPU hardware (like manager's code)

Contributor

seunghwak Aug 29, 2025

manager's code? Here, you aren't writing comments for you, for others reading this code later.

Just saying

// Calculate target chunk size based on GPU hardware

will be sufficient

Contributor

seunghwak Aug 29, 2025

We are basically setting the chunk size large enough to saturate a GPU but small enough to avoid memory allocation failure.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

+                if (total_vertices > 0) {
+                  // Calculate target chunk size based on GPU hardware (like manager's code)
+                  auto approx_vertices_to_sort_per_iteration =
+                    static_cast<size_t>(handle.get_device_properties().multiProcessorCount) * (1 << 20);

Contributor

seunghwak Aug 29, 2025

Add /* tuning parameter */ after (1 << 20) to inform that this value can be modified without impacting code correctness.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

Comment on lines 812 to 818

+                  if (host_distance_counts[d] > 0) {
+                    distance_buckets_vertices.emplace_back(host_distance_counts[d], handle.get_stream());
+                    distance_buckets_sources.emplace_back(host_distance_counts[d], handle.get_stream());
+                  } else {
+                    distance_buckets_vertices.emplace_back(0, handle.get_stream());
+                    distance_buckets_sources.emplace_back(0, handle.get_stream());
+                  }

Contributor

seunghwak Aug 29, 2025

Do you need this if - else statement? The if and the else paths execute the same code.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

+                      printf("DEBUG: Processing chunk %zu: %zu vertices, %zu distance levels\n",
+                             chunk_i,
+                             chunk_size,
+                             num_segments_in_chunk);

Contributor

seunghwak Aug 29, 2025

Delete.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

Comment on lines 956 to 967

+                      // Gather data for this chunk from original bucket arrays
+                      size_t write_offset = 0;
+                      std::vector<size_t> chunk_segment_offsets;
+                      chunk_segment_offsets.push_back(0);
+                      // Map distance level indices to actual distance values
+                      std::vector<vertex_t> distance_levels_in_chunk;
+                      for (vertex_t d = 1; d <= global_max_distance; ++d) {
+                        if (d < distance_buckets_vertices.size() && distance_buckets_vertices[d].size() > 0) {
+                          distance_levels_in_chunk.push_back(d);
+                        }
+                      }

Contributor

seunghwak Aug 29, 2025

This code is not necessary.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

Comment on lines 969 to 989

+                      // Copy data for distance levels in this chunk
+                      for (size_t level_idx = chunk_distance_start; level_idx < chunk_distance_end; ++level_idx) {
+                        vertex_t d        = distance_levels_in_chunk[level_idx];
+                        size_t level_size = distance_buckets_vertices[d].size();
+                        // Copy vertices with int32_t conversion
+                        thrust::transform(handle.get_thrust_policy(),
+                                          distance_buckets_vertices[d].begin(),
+                                          distance_buckets_vertices[d].end(),
+                                          chunk_vertices_int32.begin() + write_offset,
+                                          [] __device__(vertex_t v) { return static_cast<int32_t>(v); });
+                        // Copy sources
+                        thrust::copy(handle.get_thrust_policy(),
+                                     distance_buckets_sources[d].begin(),
+                                     distance_buckets_sources[d].end(),
+                                     chunk_sources.begin() + write_offset);
+                        write_offset += level_size;
+                        chunk_segment_offsets.push_back(write_offset);
+                      }

Contributor

seunghwak Aug 29, 2025

This as well.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

Comment on lines 992 to 998

+                        // Copy segment offsets to device
+                        rmm::device_uvector<size_t> d_chunk_segment_offsets(chunk_segment_offsets.size(),
+                                                                            handle.get_stream());
+                        raft::update_device(d_chunk_segment_offsets.data(),
+                                            chunk_segment_offsets.data(),
+                                            chunk_segment_offsets.size(),
+                                            handle.get_stream());

Contributor

seunghwak Aug 29, 2025

This too.

See this.
https://github.com/rapidsai/cugraph/blob/branch-25.10/cpp/src/structure/detail/structure_utils.cuh#L508

You can just use distance_level_offsets (just need to shift the offset values properly).

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

Comment on lines 1029 to 1050

+                        // Scatter results back to original bucket arrays
+                        size_t read_offset = 0;
+                        for (size_t level_idx = chunk_distance_start; level_idx < chunk_distance_end;
+                             ++level_idx) {
+                          vertex_t d        = distance_levels_in_chunk[level_idx];
+                          size_t level_size = distance_buckets_vertices[d].size();
+                          // Convert back to vertex_t and copy to original bucket
+                          thrust::transform(handle.get_thrust_policy(),
+                                            chunk_vertices_int32.begin() + read_offset,
+                                            chunk_vertices_int32.begin() + read_offset + level_size,
+                                            distance_buckets_vertices[d].begin(),
+                                            [] __device__(int32_t v32) { return static_cast<vertex_t>(v32); });
+                          thrust::copy(handle.get_thrust_policy(),
+                                       chunk_sources.begin() + read_offset,
+                                       chunk_sources.begin() + read_offset + level_size,
+                                       distance_buckets_sources[d].begin());
+                          read_offset += level_size;
+                        }

Contributor

seunghwak Aug 29, 2025

You don't need to copy in loops. Just copy for the entire chunk.

HowardHuang1 added 3 commits

August 29, 2025 14:24


          clean up code

fb4e423


          get rid of map

afcf049


          directly use distance_level_offsets with correct shift offset values

d83d226

BradReesWork requested a review from ChuckHastings

September 2, 2025 14:35

BradReesWork assigned HowardHuang1

HowardHuang1 added 4 commits

September 2, 2025 15:44


          Fixed bounds issue and invalid memory access on CUB segmented sort

1a8b60d


          clean up debug statements

424f7a8


          clean up comments

c522e57


          Merge remote-tracking branch 'upstream/branch-25.10' into HH-multisou…

37b65de

…rce-backwardpass

seunghwak reviewed

View reviewed changes

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

+                // Check that number of sources doesn't overflow origin_t
+                CUGRAPH_EXPECTS(
+                  cuda::std::distance(vertex_first, vertex_last) <= std::numeric_limits<origin_t>::max(),
+                  "Number of sources exceeds maximum value for origin_t (uint32_t), would cause overflow");

Contributor

seunghwak Sep 3, 2025

You may set origin_t to uint16_t to cut memory capacity & bandwidth requirements.

I don't think speedup will be dramatic but it can be non-negligible...

uint16_t can cover up to 65535 sources, and I assume for most practical use cases (except for tiny graphs), limiting the maximum number of concurrent sources to this number will be sufficient.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

+                while (vertex_frontier.bucket(bucket_idx_cur).aggregate_size() > 0) {
+                  // Step 1: Extract ALL edges from frontier (filtered by unvisited vertices)
+                  using bfs_edge_tuple_t = thrust::tuple<vertex_t, origin_t, edge_t>;

Contributor

seunghwak Sep 3, 2025

We are migrating from thrust::tuple to cuda::std::tuple. Don't use thrust::tuple anymore.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

+                      [d_sigma_2d = sigmas_2d.begin(), num_vertices, vertex_partition] __device__(
+                        auto tagged_src, auto dst, auto, auto, auto) {
+                        auto src        = thrust::get<0>(tagged_src);
+                        auto origin     = thrust::get<1>(tagged_src);

Contributor

seunghwak Sep 3, 2025

thrust::get => cuda::std::get

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

+                        auto src_idx    = origin * num_vertices + src_offset;
+                        auto src_sigma  = static_cast<edge_t>(d_sigma_2d[src_idx]);
+                        return thrust::make_tuple(dst, origin, src_sigma);

Contributor

seunghwak Sep 3, 2025

thrust::make_tuple=>cuda::std::make_tuple

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

Comment on lines 662 to 666

+                  auto num_unique = thrust::unique_count(
+                    handle.get_thrust_policy(),
+                    thrust::make_zip_iterator(frontier_vertices.begin(), frontier_origins.begin()),
+                    thrust::make_zip_iterator(frontier_vertices.end(), frontier_origins.end()),
+                    [] __device__(auto const& a, auto const& b) { return a == b; });

Contributor

seunghwak Sep 3, 2025

I assume you don't need this. You are not using this anywhere.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

+                  size_t cumulative_vertices = 0;
+                  for (vertex_t d = 0; d <= global_max_distance; ++d) {
+                    if (distance_buckets_vertices[d].size() > 0) {

Contributor

seunghwak Sep 3, 2025

distance_buckets_vertices[d] can't be 0, this if statement is not necessary.

As long as total_vertices > 0, distance_buckets_vertices[global_max_distance] >= 1 and there should be at least one vertex in each d.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

+                    }
+                  }
+                  if (distance_level_offsets.size() > 1) {

Contributor

seunghwak Sep 3, 2025

Is this necessary? I assume distance_level_offsets.size() is always >= 2.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

Comment on lines 938 to 956

+                      for (size_t level_idx = chunk_distance_start; level_idx < chunk_distance_end; ++level_idx) {
+                        vertex_t d        = level_idx;
+                        size_t level_size = distance_buckets_vertices[d].size();
+                        // Copy vertices
+                        thrust::copy(handle.get_thrust_policy(),
+                                     distance_buckets_vertices[d].begin(),
+                                     distance_buckets_vertices[d].end(),
+                                     chunk_vertices.begin() + write_offset);
+                        // Copy sources
+                        thrust::copy(handle.get_thrust_policy(),
+                                     distance_buckets_sources[d].begin(),
+                                     distance_buckets_sources[d].end(),
+                                     chunk_sources.begin() + write_offset);
+                        write_offset += level_size;
+                        chunk_segment_offsets.push_back(write_offset);
+                      }

Contributor

seunghwak Sep 3, 2025

You can skip this copy if you maintain a (vertex, origin) pairs in a consecutive array (instead of creating separate vector pairs for every d).

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

-                  vertex_frontier_t<vertex_t, void, multi_gpu, true> vertex_frontier(handle, num_buckets);
+                if constexpr (multi_gpu) {
+                  // Multi-GPU: Use sequential brandes_bfs (more reliable for cross-GPU)

Contributor

seunghwak Sep 3, 2025

"more reliable" doesn't sound right here. Our multi-source code is not ready for multi-GPU yet. Once it is updated, it should be as reliable as the sequential version.

cpp/src/centrality/betweenness_centrality_impl.cuh Outdated

-                                            include_endpoints,
-                                            do_expensive_check);
+                  auto [distances_2d, sigmas_2d] = detail::multisource_bfs(
+                    handle, graph_view, edge_weight_view, vertices_begin, vertices_end, do_expensive_check);

Contributor

seunghwak Sep 3, 2025

Where are you limiting the number of concurrent origins? You need to limit the number of origins within the size of origin_t. If the total number exceeds the limit, you need to call multisource_bfs (and backward_pass) in multiple rounds.

HowardHuang1 added 6 commits

September 3, 2025 10:21


          Small code optimizations

fbe8e0b


          Use single consecutive array for strong vertices instead of separate …

7695eb9

…arrays per distance level


          Get rid of 4 copy operations by sorting consecutive arrays in place

d9253f5


          Add batching in case number of origins exceeds size of origin_t

75b69e2


          Fix variable name inconsistency

49f609c


          Reword comments

95764a4

seunghwak approved these changes

View reviewed changes

Contributor

seunghwak left a comment

LGTM, and great job!!!

Contributor

seunghwak commented Sep 4, 2025

/ok to test 95764a4


          Merge remote-tracking branch 'upstream/branch-25.10' into HH-multisou…

642c4d0

…rce-backwardpass

ChuckHastings approved these changes

View reviewed changes

Collaborator

ChuckHastings commented Sep 5, 2025

/ok to test 642c4d0

HowardHuang1 added 2 commits

September 5, 2025 15:14


          remove unused variable

2188b56


          Merge remote-tracking branch 'upstream/branch-25.10' into HH-multisou…

d398316

…rce-backwardpass

Collaborator

ChuckHastings commented Sep 8, 2025

/merge

Collaborator

ChuckHastings commented Sep 8, 2025

/ok to test d398316

Collaborator

ChuckHastings commented Sep 8, 2025

/merge

rapids-bot bot merged commit be28bd1 into rapidsai:branch-25.10

101 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement non-breaking