UCP/PERF: UCP tests with configurable level/api/batch #10893

iyastreb · 2025-09-17T07:43:48Z

What?

Extended config option for threads, to specify device blocks -T 32:2
Added compact config for message size: -s 1024:32
Options for cooperation level: -L warp
Separate tests for single, multi, partial
Common infra for all UCP tests
Several bugfixes in perftest

ofirfarjun7 · 2025-09-18T16:22:30Z

src/tools/perf/api/libperf.h

    double                 percentile_rank; /* The percentile rank of the percentile reported
                                               in latency tests */
    unsigned               device_thread_count; /* Number of device threads */
+    unsigned               device_block_count; /* Number of device blocks */


"Number of device threads in block"
Need to make sure device_thread_count is not larger than max num of threads in block.

src/tools/perf/cuda/cuda_kernel.cuh

ofirfarjun7 · 2025-09-18T17:36:46Z

src/tools/perf/cuda/cuda_kernel.cuh

+        case UCS_DEVICE_LEVEL_WARP: \
+            UCX_KERNEL_CMD(UCS_DEVICE_LEVEL_WARP, _cmd, _blocks, _threads, func, __VA_ARGS__); \
+            break; \
+        case UCS_DEVICE_LEVEL_BLOCK: \


Block and Grid are still not supported

I think we can still keep them here?

ofirfarjun7 · 2025-09-18T17:42:22Z

src/tools/perf/cuda/cuda_kernel.cuh

    return bits;
 }

+#define UCX_KERNEL_CMD(level, cmd, blocks, threads, func, ...) \


Why we need this macro? Can we just call
func<_level, _cmd><<<blocks, threads>>>(__VA_ARGS__);
From UCX_KERNEL_DISPATCH
?

We can't, because _cmd is not a compile time constant
If we could instantiate a template with runtime values, we wouldn't need both templates at all

ofirfarjun7 · 2025-09-18T17:47:13Z

src/tools/perf/perftest_params.c

    printf("  UCP only:\n");
-    printf("     -T <threads>   number of threads in the test (%d)\n",
+    printf("     -T <threads>[:<blocks>]\n");
+    printf("                    number of threads in the test (%d)\n",


AFAIU it's number of threads on each block

This documentation refers to the main <threads> param that remains the same as before. Documentation for optional blocks param is below, but I added an explicit statement about threads on each block

ofirfarjun7 · 2025-09-18T18:04:19Z

src/tools/perf/cuda/ucp_cuda_kernel.cu

+            indices[i]          = i;
+            addresses[i]        = (char *)perf.send_buffer + offset;
+            remote_addresses[i] = perf.ucp.remote_addr + offset;
+            lengths[i]          = (i == count - 1) ? ONESIDED_SIGNAL_SIZE :


Better if we use device shared memory when possible because accessing global gpu memory can be expensive and can affect the measurements.

Maybe we can add this optimization in the next PR?
This one is already quite large

ofirfarjun7 · 2025-09-18T18:05:21Z

src/tools/perf/cuda/ucp_cuda_kernel.cu


-    ctx.status = status;
-}
+    void init_counters(const ucx_perf_context_t &perf)


Better use counter API
Can do in future PR

ofirfarjun7 · 2025-09-18T18:06:45Z

src/tools/perf/cuda/ucp_cuda_kernel.cu

+
+template<ucs_device_level_t level, ucx_perf_cmd_t cmd>
+UCS_F_DEVICE ucs_status_t
+ucp_perf_cuda_send_nbx(ucp_perf_cuda_params &params, ucx_perf_counter_t idx,


I will remove the _nbx... ucp device API is actually blocking.

Ok, disregard my prev comment.
But it's still nonblocking in the sense that we need to progress until request completion

IMO it's confusing to have _nbx in the name because it's not UCP APi function

yosefe · 2025-09-24T06:44:58Z

contrib/ucx_perftest_config/test_types_ucp_device_cuda

+# Increase number of threads after following fixes:
+# - Use thread-local memory instead of shared for requests (limit 48K)
+# - Fix WQE size limit of 1024
+ucp_device_cuda_single_bw_1k_32threads       -t ucp_put_single_bw -m cuda -s 1024 -n 10000 -T 32
+ucp_device_cuda_single_lat_1k_32threads      -t ucp_put_single_lat -m cuda -s 1024 -n 10000 -T 32
+ucp_device_cuda_multi_bw_1k_32threads        -t ucp_put_multi_bw -m cuda -s 256:8 -n 10000 -T 32 -O 2
+ucp_device_cuda_multi_lat_1k_32threads       -t ucp_put_multi_lat -m cuda -s 256:8 -n 10000 -T 32 -O 2
+ucp_device_cuda_partial_bw_1k_32threads      -t ucp_put_partial_bw -m cuda -s 256:8 -n 10000 -T 32 -O 2
+ucp_device_cuda_partial_lat_1k_32threads     -t ucp_put_partial_lat -m cuda -s 256:8 -n 10000 -T 32 -O 2


shall we test warp level?

will be tested in the next PR

yosefe · 2025-09-24T06:46:08Z

src/tools/perf/cuda/cuda_kernel.cuh

    return bits;
 }

+#define UCX_KERNEL_CMD(level, cmd, blocks, threads, shared_size, func, ...) \


use _ prefix for macro args

yosefe · 2025-09-24T06:46:17Z

src/tools/perf/cuda/cuda_kernel.cuh

+        } \
+    } while (0)
+
+#define UCX_KERNEL_DISPATCH(perf, func, ...) \


use _ prefix for macro args

IMO add PERF to the name: UCX_PERF_KERNEL_DISPATCH

done
I also refactored these macros to be more generic

yosefe · 2025-09-24T06:49:20Z

src/tools/perf/cuda/ucp_cuda_kernel.cu

+    }
+
+    template<typename T>
+    void device_clone(T **dst, const T *src, size_t count)


the name sounds weird, also maybe return void* as return value instead of return T* by value?

Changed to:

m_params.indices = device_vector(indices); m_params.addresses = device_vector(addresses); m_params.remote_addresses = device_vector(remote_addresses); m_params.lengths = device_vector(lengths); template<typename T> T* device_vector(const std::vector<T> &src) { size_t size = src.size() * sizeof(T); T *dst; CUDA_CALL(, UCS_LOG_LEVEL_FATAL, cudaMalloc, &dst, size); CUDA_CALL_ERR(cudaMemcpy, dst, src.data(), size, cudaMemcpyHostToDevice); return dst; }

yosefe · 2025-09-24T07:05:58Z

src/tools/perf/cuda/ucp_cuda_kernel.cu

+
+template<ucs_device_level_t level, ucx_perf_cmd_t cmd>
+UCS_F_DEVICE ucs_status_t
+ucp_perf_cuda_send_nbx(ucp_perf_cuda_params &params, ucx_perf_counter_t idx,


IMO it's confusing to have _nbx in the name because it's not UCP APi function

brminich · 2025-09-24T07:12:04Z

contrib/test_jenkins.sh

 	ucx_perftest="$ucx_inst/bin/ucx_perftest"
 	ucp_test_args="-b $ucx_inst_ptest/test_types_ucp_device_cuda"

+	# TODO: Run on all GPUs & NICs combinations


need to remove (can do in next pr)

brminich · 2025-09-24T07:13:16Z

src/tools/perf/api/libperf.h

    ucs_memory_type_t      recv_mem_type;   /* Recv memory type */
    ucx_perf_accel_dev_t   send_device;     /* Send memory device for gdaki */
    ucx_perf_accel_dev_t   recv_device;     /* Recv memory device for gdaki */
+    ucs_device_level_t     device_level;    /* Device level for gdaki */


minor - i'd remove gdaki

done for all three

iyastreb · 2025-09-24T10:09:25Z

Last PR comments are fixed in #10906

iyastreb added 10 commits September 17, 2025 07:39

UCP/PERF: Added config for block count

a7a6f00

UCP/PERF: Added tests CMDs for single/partial

c27a9a8

UCP/PERF: Compact form for message sizes

0927c2f

UCP/PERF: Device level option

5a4be0f

UCP/PERF: Support for IOV

4c4a3a6

UCP/PERF: Kernel dispatch macro

5a59ee2

UCP/PERF: Fixed coverity warning

0fcc18c

UCP/PERF: Fixed build error

50fbf79

UCP/PERF: Common send function for all APIs

d99de4b

UCP/PERF: Common element list for all APIs

7e7fd34

iyastreb mentioned this pull request Sep 18, 2025

Pass nodelay flag in ucp perftest #10894

Closed

iyastreb added 8 commits September 18, 2025 07:47

UCP/PERF: Common params for all APIs

0733535

UCP/PERF: Report kernel status

74f4b69

UCP/PERF: Reduce number of kernel args

4015d74

UCP/PERF: Progress fix from Thomas

39188ba

UCP/PERF: TODO comment

433c48d

UCP/PERF: Counters in params

dc25dd4

UCP/PERF: put multi

881d9ce

UCP/PERF: put partial

6276578

iyastreb marked this pull request as ready for review September 18, 2025 12:28

iyastreb requested review from yosefe, brminich and ofirfarjun7 September 18, 2025 12:28

iyastreb added 7 commits September 18, 2025 12:34

UCP/PERF: Merge branch 'master' into ucp-perf-device-level

c4115ed

UCP/PERF: Minor changes

e235a43

UCP/PERF: Minor changes

b8fe11f

UCP/PERF: Separate element for counter

28454a4

UCP/PERF: Temporary fis for single, until counter write is merged

09a7201

UCP/PERF: Fixed memory corruption

9924504

UCP/PERF: Send all elements in single

24e20f8

ofirfarjun7 reviewed Sep 18, 2025

View reviewed changes

iyastreb added 10 commits September 19, 2025 06:52

UCP/PERF: Precise doc on blocks param

8053fdd

UCP/PERF: Exclude multi-thread tests from CI, add API tests

cd48e9e

UCP/PERF: Fixed ucp_device_progress_req

a018823

UCP/PERF: Merge branch 'master' into ucp-perf-device-level

3c4e837

UCP/PERF: Allocate requests in shared memory

82b253d

UCP/PERF: Aggregate MT result

9cd91ac

UCP/PERF: Temporary cuda_ipc fix

31793c2

UCP/PERF: Added MT CI tests

8265571

UCP/PERF: Merged with master

a08733f

UCP/PERF: Reduced dimensions of CI tests due to WQE size 1024 limitation

f9957eb

yosefe reviewed Sep 24, 2025

View reviewed changes

brminich approved these changes Sep 24, 2025

View reviewed changes

yosefe merged commit aaea0a8 into openucx:master Sep 24, 2025
141 checks passed

iyastreb deleted the ucp-perf-device-level branch September 24, 2025 07:20

UCP/PERF: UCP tests with configurable level/api/batch #10893

UCP/PERF: UCP tests with configurable level/api/batch #10893

Uh oh!

Conversation

iyastreb commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ofirfarjun7 Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iyastreb Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iyastreb commented Sep 24, 2025

Uh oh!

Uh oh!

iyastreb commented Sep 17, 2025 •

edited

Loading

ofirfarjun7 Sep 18, 2025 •

edited

Loading

iyastreb Sep 19, 2025 •

edited

Loading