-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vllm-integration with multi rdma devices error #35
Comments
@alogfans Can you check this? |
Can you provide the full log? |
INFO 12-13 06:04:21 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod INFO 12-13 06:04:26 worker.py:241] Memory profiling results: duration=3.28 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=5.89GiB, peak_torch_memory=6.58GiB, memory_usage_post_profile=5.91GiB, non_torch_memory=0.72GiB, kv_cache_size=67.89GiB, gpu_memory_utilization=0.95.
|
I compile mooncake without any compilation options,such as -DUSE_CUDA, et.al, only with |
I found that the info indicates your vllm version is
If you build from source of our experimental vllm branch, I think the version might be like Also, since the posted logs are mixed together, I want to confirm whether the first few requests are normal, and these errors occur in a succeeding request? Or does the first request report the error? |
I work on the vllm main branch with pr10502 merged commit id: 0590ec3fd9857063c43c80df281e24c16c51b2ec This is the first request reports the error. |
It seems abnormal as both device has the same LID and GID. When you use two devices together, local QP cannot determine the destination device for data transfer, leading transfer timeout (i.e., transport retry counter exceeded).
|
I try : export MC_GID_INDEX=6, it will change two nics GID together. ibv_devices shows: Is it a method to distinguish rdma-mlx5 device with GUID? |
mlx5_0 and mlx5_1 belongs to different RDMA NICs (PCIe devices). |
Does this error cause by rdma error configuration? Is it normal to obtain two NICs with same GID? How can I set two NICs with different GID? |
The GID is used to identify the device that is the target of a data transfer, similar to an IP address in a TCP/IP network. Each RDMA device has multiple legal GIDs, try setting the GID to another value using |
I test with another machine, vllm code is: https://github.com/kvcache-ai/vllm/tree/upstream-mooncake-integration select different mlx5 devices with different GID, but also counter an error, log is as below:
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.71it/s] INFO 12-16 08:39:04 model_runner.py:1097] Loading model weights took 5.1810 GB
|
Currently vLLM integration of Transfer Engine assumes that all devices in prefill nodes are connectable with devices in decode nodes. That is, the code may performs |
This means mlx5_0 and mlx5_4 are unconnectable for each other, so occurs this error? |
This caused by in my test machine with a dual port rdma network card, althought ibv_devices shows with two rdma devices, in fact this machine has only one rdma device. |
|
Use latest mooncake code, when I test with tp=1, num_rdma_nic=2, qps=2, input_len=200, output_len=100 in a single machine, which prefill instance num is 1 and decode instance num also is 1.
My mooncake_config.json is shown as below:
{
"prefill_url": "127.0.0.1:8144",
"decode_url": "127.0.0.1:8149",
"metadata_server": "127.0.0.1:2333",
"metadata_backend": "etcd",
"protocol": "rdma",
"device_name": "mlx5_0,mlx5_1"
}
There will occur an error in transfer_engine:
E1213 02:57:10.528410 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:14.286239 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
E1213 02:57:18.044381 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:21.802461 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
And with one rdma device(mlx5_0 or mlx5_1) is ok
The text was updated successfully, but these errors were encountered: