Hi Everyone,
@yosefe @tvegas1
I am currently examining the execution of MPI_Send (Blocking send) with UCX in an intra_node scenario. At present, the memory transfer (ucs_memcpy_relaxed()) is executed in the receiver process (rank or processor), as depicted below.

By executing the same in the sender process, as shown below, we could significantly reduce cache-to-cache data transfers and conserve memory bandwidth.

However, I am struggling to find a runtime configuration that would allow me to execute this transfer in the sender process with the hint UCS_ARCH_MEMCPY_NT_DEST and benchmark it. Could anyone provide some guidance or suggestions on this matter?
Thank you in advance for your assistance.
--Arun
Hi Everyone,
@yosefe @tvegas1
I am currently examining the execution of MPI_Send (Blocking send) with UCX in an intra_node scenario. At present, the memory transfer (ucs_memcpy_relaxed()) is executed in the receiver process (rank or processor), as depicted below.
By executing the same in the sender process, as shown below, we could significantly reduce cache-to-cache data transfers and conserve memory bandwidth.
However, I am struggling to find a runtime configuration that would allow me to execute this transfer in the sender process with the hint UCS_ARCH_MEMCPY_NT_DEST and benchmark it. Could anyone provide some guidance or suggestions on this matter?
Thank you in advance for your assistance.
--Arun