-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Use Shared Memory for Intra-node Inter-process Data Transfer? #10016
Comments
@Clownier Shared memory transports should be enabled by default. Can you pls provide the following information:
|
|
@yosefe Or we can transform the problem: How can we make the connection to other servers use UCX's RDMA transmission and the connection to this server use UCX's TCP transmission in one process? |
|
@yosefe
But we found that after configuring ucp_config_modify(ucp_config, "TLS", "tcp,shm");`, the loopback traffic went through TCP, but the delay increased, causing our overall business performance (IOPS and bandwidth) to drop a lot. So I would like to ask how do we enable the local loopback traffic to go to shared memory transmission, and what intuitive way to judge whether the shared memory is gone? In addition, we also want to update the version, but after testing, we found that the 1.12 version could not build chain access normally with the subsequent version (tested the 1.14 version), and the error message The client reported an error Operation rejected by remote peer, and the server reported an error [ucp_ep_create () failed: Invalid parameter]. This made it impossible for us to upgrade one module by one machine. Since our business has gone live, it is unacceptable to stop all services and then pull them all up again. Do you have any suggestions in this regard? |
@yosefe
to test on the same machine, expecting to communicate through shared memory, but in fact it seems to use lo for TCP transmission.nload is shown in the figure: We want to know how to make it transmit through shared memory? |
@Clownier can you try setting UCX_SYSV_ERROR_HANDLING=y in addition to UCX_TLS=tcp,sm ? |
@yosefe
After adding the configuration UCX_MM_ERROR_HANDLING=y, it seems that most of the bandwidth is transmitted through shared memory, but there seems to be no relevant documentation describing the specific reason? Why does this configuration bring about these changes? |
We have identified that deploying multiple modules for data transmission on each server within the cluster leads to second-level latency tails in RDMA cluster data transfers, as detailed in issue 9976. After reviewing some research papers, we suspect that the root cause might be loopback traffic among multiple modules running on the same server.
Consequently, we are exploring the possibility of leveraging Inter-Process Communication (IPC) mechanisms to bypass loopback traffic within a single machine, while minimizing major modifications to our existing project.Our current project follows the tag matching approach similar to the hello world example for multi-server to multi-server (including self) data transfers.
We are seeking guidance on how to adapt our existing logic to use shared memory for transmission between connections on the same machine, as well as inquiring if there are any relevant demo examples or tutorials that could serve as a reference.
Your assistance in this matter would be greatly appreciated.
The text was updated successfully, but these errors were encountered: