Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Address not registered by any device(s)" Error when block_size is large #38

Open
c-guo16 opened this issue Dec 16, 2024 · 2 comments
Open

Comments

@c-guo16
Copy link

c-guo16 commented Dec 16, 2024

We use commit 0d9e226 of Mooncake repo, and run "transfer_engine_bench" on 2 node, h1 is the server side and h2 is client side. Our scripts are here:

  • server side:
MC_MTU=1024 ./mooncake-transfer-engine/example/transfer_engine_bench --mode=target --metadata_server=h1:2379 --local_server_name=h1:12345 --device_name=mlx5_bond_0 --threads=16 --block_size=65537
  • client side:
./mooncake-transfer-engine/example/transfer_engine_bench --metadata_server=h1:2379 --segment_id=h1:12345 --local_server_name=h2:12346 --device_name=mlx5_bond_0 --duration=10 --threads=16 --batch_size=1024 --use_vram=1 --gpu_id=7 --block_size=65537

Other configuration:

NIC: CX7
GPU: L40S

When we set --block_size=65536, everything is ok. But setting it like --block_size=65537 or larger causes error:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I1216 04:05:37.228515 18434 rdma_context.cpp:131] RDMA device: mlx5_bond_0, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:1e:9d:7b:9a
I1216 04:05:37.229410 18434 transfer_engine_bench.cpp:256] VRAM is used
E1216 04:05:37.310652 18460 rdma_transport.cpp:244] Address not registered by any device(s) 0x7f37d1ff3fff
F1216 04:05:37.310685 18460 transfer_engine_bench.cpp:163] Assert failed: !ret
*** Check failure stack trace: ***
Aborted (core dumped)

I just have no idea.

@kvcache-ai kvcache-ai deleted a comment Dec 16, 2024
@alogfans
Copy link
Collaborator

Can you re-run the test with --verbose option to see more logs? We did not reproduce it in our environment.

@fengquyoumo
Copy link
Contributor

I believe the error is likely due to a memory access out-of-bounds issue.
In transfer_engine_bench, a 1GB buffer is allocated by default.
According to the offset calculation, the total memory accessed is batch_size * block_size * threads.
In your case, when block_size = 65536, max_offset = 1024 * 65536 * 16 = 1GB. However, when block_size is set to 65537, the total memory access exceeds 1GB, leading to this error.

And may #72 will be a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants