-
Notifications
You must be signed in to change notification settings - Fork 489
UCP/PERF: UCP tests with configurable level/api/batch #10893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
double percentile_rank; /* The percentile rank of the percentile reported | ||
in latency tests */ | ||
unsigned device_thread_count; /* Number of device threads */ | ||
unsigned device_block_count; /* Number of device blocks */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Number of device threads in block"
Need to make sure device_thread_count
is not larger than max num of threads in block.
case UCS_DEVICE_LEVEL_WARP: \ | ||
UCX_KERNEL_CMD(UCS_DEVICE_LEVEL_WARP, _cmd, _blocks, _threads, func, __VA_ARGS__); \ | ||
break; \ | ||
case UCS_DEVICE_LEVEL_BLOCK: \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block and Grid are still not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still keep them here?
src/tools/perf/cuda/cuda_kernel.cuh
Outdated
return bits; | ||
} | ||
|
||
#define UCX_KERNEL_CMD(level, cmd, blocks, threads, func, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this macro? Can we just call
func<_level, _cmd><<<blocks, threads>>>(__VA_ARGS__);
From UCX_KERNEL_DISPATCH
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't, because _cmd is not a compile time constant
If we could instantiate a template with runtime values, we wouldn't need both templates at all
src/tools/perf/perftest_params.c
Outdated
printf(" UCP only:\n"); | ||
printf(" -T <threads> number of threads in the test (%d)\n", | ||
printf(" -T <threads>[:<blocks>]\n"); | ||
printf(" number of threads in the test (%d)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU it's number of threads on each block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation refers to the main <threads>
param that remains the same as before. Documentation for optional blocks
param is below, but I added an explicit statement about threads on each block
indices[i] = i; | ||
addresses[i] = (char *)perf.send_buffer + offset; | ||
remote_addresses[i] = perf.ucp.remote_addr + offset; | ||
lengths[i] = (i == count - 1) ? ONESIDED_SIGNAL_SIZE : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better if we use device shared memory when possible because accessing global gpu memory can be expensive and can affect the measurements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add this optimization in the next PR?
This one is already quite large
|
||
ctx.status = status; | ||
} | ||
void init_counters(const ucx_perf_context_t &perf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use counter API
Can do in future PR
|
||
template<ucs_device_level_t level, ucx_perf_cmd_t cmd> | ||
UCS_F_DEVICE ucs_status_t | ||
ucp_perf_cuda_send_nbx(ucp_perf_cuda_params ¶ms, ucx_perf_counter_t idx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the _nbx
... ucp device API is actually blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, disregard my prev comment.
But it's still nonblocking in the sense that we need to progress until request completion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's confusing to have _nbx in the name because it's not UCP APi function
# Increase number of threads after following fixes: | ||
# - Use thread-local memory instead of shared for requests (limit 48K) | ||
# - Fix WQE size limit of 1024 | ||
ucp_device_cuda_single_bw_1k_32threads -t ucp_put_single_bw -m cuda -s 1024 -n 10000 -T 32 | ||
ucp_device_cuda_single_lat_1k_32threads -t ucp_put_single_lat -m cuda -s 1024 -n 10000 -T 32 | ||
ucp_device_cuda_multi_bw_1k_32threads -t ucp_put_multi_bw -m cuda -s 256:8 -n 10000 -T 32 -O 2 | ||
ucp_device_cuda_multi_lat_1k_32threads -t ucp_put_multi_lat -m cuda -s 256:8 -n 10000 -T 32 -O 2 | ||
ucp_device_cuda_partial_bw_1k_32threads -t ucp_put_partial_bw -m cuda -s 256:8 -n 10000 -T 32 -O 2 | ||
ucp_device_cuda_partial_lat_1k_32threads -t ucp_put_partial_lat -m cuda -s 256:8 -n 10000 -T 32 -O 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we test warp level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be tested in the next PR
return bits; | ||
} | ||
|
||
#define UCX_KERNEL_CMD(level, cmd, blocks, threads, shared_size, func, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use _ prefix for macro args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} \ | ||
} while (0) | ||
|
||
#define UCX_KERNEL_DISPATCH(perf, func, ...) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use _ prefix for macro args
- IMO add PERF to the name: UCX_PERF_KERNEL_DISPATCH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I also refactored these macros to be more generic
} | ||
|
||
template<typename T> | ||
void device_clone(T **dst, const T *src, size_t count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name sounds weird, also maybe return void* as return value instead of return T* by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to:
m_params.indices = device_vector(indices);
m_params.addresses = device_vector(addresses);
m_params.remote_addresses = device_vector(remote_addresses);
m_params.lengths = device_vector(lengths);
template<typename T>
T* device_vector(const std::vector<T> &src)
{
size_t size = src.size() * sizeof(T);
T *dst;
CUDA_CALL(, UCS_LOG_LEVEL_FATAL, cudaMalloc, &dst, size);
CUDA_CALL_ERR(cudaMemcpy, dst, src.data(), size, cudaMemcpyHostToDevice);
return dst;
}
|
||
template<ucs_device_level_t level, ucx_perf_cmd_t cmd> | ||
UCS_F_DEVICE ucs_status_t | ||
ucp_perf_cuda_send_nbx(ucp_perf_cuda_params ¶ms, ucx_perf_counter_t idx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's confusing to have _nbx in the name because it's not UCP APi function
ucx_perftest="$ucx_inst/bin/ucx_perftest" | ||
ucp_test_args="-b $ucx_inst_ptest/test_types_ucp_device_cuda" | ||
|
||
# TODO: Run on all GPUs & NICs combinations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to remove (can do in next pr)
ucs_memory_type_t recv_mem_type; /* Recv memory type */ | ||
ucx_perf_accel_dev_t send_device; /* Send memory device for gdaki */ | ||
ucx_perf_accel_dev_t recv_device; /* Recv memory device for gdaki */ | ||
ucs_device_level_t device_level; /* Device level for gdaki */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor - i'd remove gdaki
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done for all three
Last PR comments are fixed in #10906 |
What?
-T 32:2
-s 1024:32
-L warp