Skip to content

Conversation

iyastreb
Copy link
Contributor

@iyastreb iyastreb commented Sep 17, 2025

What?

  • Extended config option for threads, to specify device blocks -T 32:2
  • Added compact config for message size: -s 1024:32
  • Options for cooperation level: -L warp
  • Separate tests for single, multi, partial
  • Common infra for all UCP tests
  • Several bugfixes in perftest

@iyastreb iyastreb marked this pull request as ready for review September 18, 2025 12:28
double percentile_rank; /* The percentile rank of the percentile reported
in latency tests */
unsigned device_thread_count; /* Number of device threads */
unsigned device_block_count; /* Number of device blocks */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Number of device threads in block"
Need to make sure device_thread_count is not larger than max num of threads in block.

case UCS_DEVICE_LEVEL_WARP: \
UCX_KERNEL_CMD(UCS_DEVICE_LEVEL_WARP, _cmd, _blocks, _threads, func, __VA_ARGS__); \
break; \
case UCS_DEVICE_LEVEL_BLOCK: \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block and Grid are still not supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still keep them here?

return bits;
}

#define UCX_KERNEL_CMD(level, cmd, blocks, threads, func, ...) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this macro? Can we just call
func<_level, _cmd><<<blocks, threads>>>(__VA_ARGS__);
From UCX_KERNEL_DISPATCH
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't, because _cmd is not a compile time constant
If we could instantiate a template with runtime values, we wouldn't need both templates at all

printf(" UCP only:\n");
printf(" -T <threads> number of threads in the test (%d)\n",
printf(" -T <threads>[:<blocks>]\n");
printf(" number of threads in the test (%d)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU it's number of threads on each block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation refers to the main <threads> param that remains the same as before. Documentation for optional blocks param is below, but I added an explicit statement about threads on each block

indices[i] = i;
addresses[i] = (char *)perf.send_buffer + offset;
remote_addresses[i] = perf.ucp.remote_addr + offset;
lengths[i] = (i == count - 1) ? ONESIDED_SIGNAL_SIZE :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better if we use device shared memory when possible because accessing global gpu memory can be expensive and can affect the measurements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add this optimization in the next PR?
This one is already quite large


ctx.status = status;
}
void init_counters(const ucx_perf_context_t &perf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use counter API
Can do in future PR


template<ucs_device_level_t level, ucx_perf_cmd_t cmd>
UCS_F_DEVICE ucs_status_t
ucp_perf_cuda_send_nbx(ucp_perf_cuda_params &params, ucx_perf_counter_t idx,
Copy link
Contributor

@ofirfarjun7 ofirfarjun7 Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the _nbx... ucp device API is actually blocking.

Copy link
Contributor Author

@iyastreb iyastreb Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, disregard my prev comment.
But it's still nonblocking in the sense that we need to progress until request completion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's confusing to have _nbx in the name because it's not UCP APi function

Comment on lines +11 to +19
# Increase number of threads after following fixes:
# - Use thread-local memory instead of shared for requests (limit 48K)
# - Fix WQE size limit of 1024
ucp_device_cuda_single_bw_1k_32threads -t ucp_put_single_bw -m cuda -s 1024 -n 10000 -T 32
ucp_device_cuda_single_lat_1k_32threads -t ucp_put_single_lat -m cuda -s 1024 -n 10000 -T 32
ucp_device_cuda_multi_bw_1k_32threads -t ucp_put_multi_bw -m cuda -s 256:8 -n 10000 -T 32 -O 2
ucp_device_cuda_multi_lat_1k_32threads -t ucp_put_multi_lat -m cuda -s 256:8 -n 10000 -T 32 -O 2
ucp_device_cuda_partial_bw_1k_32threads -t ucp_put_partial_bw -m cuda -s 256:8 -n 10000 -T 32 -O 2
ucp_device_cuda_partial_lat_1k_32threads -t ucp_put_partial_lat -m cuda -s 256:8 -n 10000 -T 32 -O 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we test warp level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be tested in the next PR

return bits;
}

#define UCX_KERNEL_CMD(level, cmd, blocks, threads, shared_size, func, ...) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _ prefix for macro args

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} \
} while (0)

#define UCX_KERNEL_DISPATCH(perf, func, ...) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. use _ prefix for macro args
  2. IMO add PERF to the name: UCX_PERF_KERNEL_DISPATCH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
I also refactored these macros to be more generic

}

template<typename T>
void device_clone(T **dst, const T *src, size_t count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name sounds weird, also maybe return void* as return value instead of return T* by value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to:

        m_params.indices          = device_vector(indices);
        m_params.addresses        = device_vector(addresses);
        m_params.remote_addresses = device_vector(remote_addresses);
        m_params.lengths          = device_vector(lengths);
 
    template<typename T>
    T* device_vector(const std::vector<T> &src)
    {
        size_t size = src.size() * sizeof(T);
        T *dst;
        CUDA_CALL(, UCS_LOG_LEVEL_FATAL, cudaMalloc, &dst, size);
        CUDA_CALL_ERR(cudaMemcpy, dst, src.data(), size, cudaMemcpyHostToDevice);
        return dst;
    }


template<ucs_device_level_t level, ucx_perf_cmd_t cmd>
UCS_F_DEVICE ucs_status_t
ucp_perf_cuda_send_nbx(ucp_perf_cuda_params &params, ucx_perf_counter_t idx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's confusing to have _nbx in the name because it's not UCP APi function

ucx_perftest="$ucx_inst/bin/ucx_perftest"
ucp_test_args="-b $ucx_inst_ptest/test_types_ucp_device_cuda"

# TODO: Run on all GPUs & NICs combinations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove (can do in next pr)

ucs_memory_type_t recv_mem_type; /* Recv memory type */
ucx_perf_accel_dev_t send_device; /* Send memory device for gdaki */
ucx_perf_accel_dev_t recv_device; /* Recv memory device for gdaki */
ucs_device_level_t device_level; /* Device level for gdaki */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor - i'd remove gdaki

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done for all three

@yosefe yosefe merged commit aaea0a8 into openucx:master Sep 24, 2025
141 checks passed
@iyastreb iyastreb deleted the ucp-perf-device-level branch September 24, 2025 07:20
@iyastreb
Copy link
Contributor Author

Last PR comments are fixed in #10906

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants