vllm-integration with multi rdma devices error #35

junna2016 · 2024-12-13T03:33:56Z

Use latest mooncake code, when I test with tp=1, num_rdma_nic=2, qps=2, input_len=200, output_len=100 in a single machine, which prefill instance num is 1 and decode instance num also is 1.
My mooncake_config.json is shown as below:

{
"prefill_url": "127.0.0.1:8144",
"decode_url": "127.0.0.1:8149",
"metadata_server": "127.0.0.1:2333",
"metadata_backend": "etcd",
"protocol": "rdma",
"device_name": "mlx5_0,mlx5_1"
}

There will occur an error in transfer_engine:

E1213 02:57:10.528410 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:14.286239 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
E1213 02:57:18.044381 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:21.802461 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded

And with one rdma device(mlx5_0 or mlx5_1) is ok

ShangmingCai · 2024-12-13T04:11:20Z

@alogfans Can you check this?

alogfans · 2024-12-13T05:41:01Z

Can you provide the full log?

junna2016 · 2024-12-13T06:09:02Z

Can you provide the full log?

launch_disagg_prefill
model=Qwen2.5-7B-Instruct-GPTQ-Int4
etcd --listen-client-urls http://0.0.0.0:2333 --advertise-client-urls http://localhost:2333
MOONCAKE_CONFIG_PATH=./mooncake_pipe_config.json
CUDA_VISIBLE_DEVICES=1
wait_for_server 8166
python3 -m vllm.entrypoints.openai.api_server --model Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8166 --max-model-len 1000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}'
local port=8166
timeout 1200 bash -c '
until curl -s localhost:8166/v1/completions > /dev/null; do
sleep 1
done'
MOONCAKE_CONFIG_PATH=./mooncake_pipe_config.json
CUDA_VISIBLE_DEVICES=2
python3 -m vllm.entrypoints.openai.api_server --model Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8277 --max-model-len 1000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9}'
2024-12-13 06:03:54.090538 I | etcdmain: etcd Version: 3.3.25
2024-12-13 06:03:54.090555 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2024-12-13 06:03:54.090558 I | etcdmain: Go Version: go1.18.1
2024-12-13 06:03:54.090564 I | etcdmain: Go OS/Arch: linux/amd64
2024-12-13 06:03:54.090568 I | etcdmain: setting maximum number of CPUs to 128, total number of available CPUs is 128
2024-12-13 06:03:54.090580 W | etcdmain: no data-dir provided, using default data-dir ./default.etcd
2024-12-13 06:03:54.090623 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2024-12-13 06:03:54.090903 I | embed: listening for peers on http://localhost:2380
2024-12-13 06:03:54.090935 I | embed: listening for client requests on 0.0.0.0:2333
2024-12-13 06:03:54.091241 I | etcdserver: name = default
2024-12-13 06:03:54.091247 I | etcdserver: data dir = default.etcd
2024-12-13 06:03:54.091249 I | etcdserver: member dir = default.etcd/member
2024-12-13 06:03:54.091252 I | etcdserver: heartbeat = 100ms
2024-12-13 06:03:54.091254 I | etcdserver: election = 1000ms
2024-12-13 06:03:54.091262 I | etcdserver: snapshot count = 100000
2024-12-13 06:03:54.091353 I | etcdserver: advertise client URLs = http://localhost:2333
2024-12-13 06:03:54.093171 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 26
2024-12-13 06:03:54.093202 I | raft: 8e9e05c52164694d became follower at term 4
2024-12-13 06:03:54.093225 I | raft: newRaft 8e9e05c52164694d [peers: [], term: 4, commit: 26, applied: 0, lastindex: 26, lastterm: 4]
2024-12-13 06:03:54.093955 W | auth: simple token is not cryptographically signed
2024-12-13 06:03:54.094296 I | etcdserver: starting server... [version: 3.3.25, cluster version: to_be_decided]
2024-12-13 06:03:54.095387 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
2024-12-13 06:03:54.095519 N | etcdserver/membership: set the initial cluster version to 3.3
2024-12-13 06:03:54.095549 I | etcdserver/api: enabled capabilities for version 3.3
2024-12-13 06:03:55.694791 I | raft: 8e9e05c52164694d is starting a new election at term 4
2024-12-13 06:03:55.694851 I | raft: 8e9e05c52164694d became candidate at term 5
2024-12-13 06:03:55.694864 I | raft: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 5
2024-12-13 06:03:55.694873 I | raft: 8e9e05c52164694d became leader at term 5
2024-12-13 06:03:55.694880 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 5
2024-12-13 06:03:55.695162 I | embed: ready to serve client requests
2024-12-13 06:03:55.695238 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2333]} to cluster cdf818194e3a8c32
2024-12-13 06:03:55.695661 N | embed: serving insecure client requests on [::]:2333, this is strongly discouraged!
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
WARNING 12-13 06:03:57 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 12-13 06:03:57 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 12-13 06:03:59 api_server.py:643] vLLM API server version 0.4.3.dev2370+g385d690b
INFO 12-13 06:03:59 api_server.py:644] args: Namespace(host=None, port=8277, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen2.5-7B-Instruct-GPTQ-Int4', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_allocator='CpuGpuBlockAllocator', block_size=16, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, cns_offload_gb=0, cns_offload_dir='', gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=5000000000.0, kv_role='kv_consumer', kv_rank=1, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 12-13 06:03:59 api_server.py:643] vLLM API server version 0.4.3.dev2370+g385d690b
INFO 12-13 06:03:59 api_server.py:644] args: Namespace(host=None, port=8166, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen2.5-7B-Instruct-GPTQ-Int4', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_allocator='CpuGpuBlockAllocator', block_size=16, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, cns_offload_gb=0, cns_offload_dir='', gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=5000000000.0, kv_role='kv_producer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 12-13 06:03:59 init.py:42] No plugins found.
INFO 12-13 06:03:59 api_server.py:178] Multiprocessing frontend to use ipc:///tmp/402d28f4-f834-4178-b0eb-adf69603cc89 for IPC Path.
INFO 12-13 06:03:59 api_server.py:197] Started engine process with PID 5829
INFO 12-13 06:03:59 init.py:42] No plugins found.
INFO 12-13 06:03:59 api_server.py:178] Multiprocessing frontend to use ipc:///tmp/e8b95c58-0bf9-4a0b-8451-5e05bef974a2 for IPC Path.
INFO 12-13 06:03:59 api_server.py:197] Started engine process with PID 5832
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
WARNING 12-13 06:04:01 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
WARNING 12-13 06:04:02 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 12-13 06:04:03 init.py:42] No plugins found.
INFO 12-13 06:04:04 config.py:399] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
INFO 12-13 06:04:05 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 12-13 06:04:05 arg_utils.py:1171] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 12-13 06:04:06 init.py:42] No plugins found.
INFO 12-13 06:04:07 config.py:399] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
INFO 12-13 06:04:08 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 12-13 06:04:08 arg_utils.py:1171] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 12-13 06:04:11 config.py:399] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 12-13 06:04:12 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 12-13 06:04:12 arg_utils.py:1171] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 12-13 06:04:12 llm_engine.py:248] Initializing an LLM engine (v0.4.3.dev2370+g385d690b) with config: model='Qwen2.5-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/Qwen2.5-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, block_allocator=CpuGpuBlockAllocator, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-7B-Instruct-GPTQ-Int4, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None,compilation_config=CompilationConfig(level=0, backend='', custom_ops=[], splitting_ops=['vllm.unified_attention', 'vllm.unified_v1_flash_attention'], use_inductor=True, inductor_specialize_for_cudagraph_no_more_than=None, inductor_compile_sizes=None, inductor_compile_config={}, inductor_passes={}, use_cudagraph=False, cudagraph_num_of_warmups=0, cudagraph_capture_sizes=None, cudagraph_copy_inputs=False, pass_config=PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), compile_sizes=<function PrivateAttr at 0x7ff4555f4b80>, capture_sizes=<function PrivateAttr at 0x7ff4555f4b80>, enabled_custom_ops=Counter(), disabled_custom_ops=Counter(), static_forward_context={})
INFO 12-13 06:04:13 selector.py:120] Using Flash Attention backend.
INFO 12-13 06:04:15 config.py:399] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
INFO 12-13 06:04:16 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
WARNING 12-13 06:04:16 arg_utils.py:1171] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 12-13 06:04:16 llm_engine.py:248] Initializing an LLM engine (v0.4.3.dev2370+g385d690b) with config: model='Qwen2.5-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='Qwen2.5-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, block_allocator=CpuGpuBlockAllocator, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-7B-Instruct-GPTQ-Int4, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None,compilation_config=CompilationConfig(level=0, backend='', custom_ops=[], splitting_ops=['vllm.unified_attention', 'vllm.unified_v1_flash_attention'], use_inductor=True, inductor_specialize_for_cudagraph_no_more_than=None, inductor_compile_sizes=None, inductor_compile_config={}, inductor_passes={}, use_cudagraph=False, cudagraph_num_of_warmups=0, cudagraph_capture_sizes=None, cudagraph_copy_inputs=False, pass_config=PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), compile_sizes=<function PrivateAttr at 0x7f8222f08b80>, capture_sizes=<function PrivateAttr at 0x7f8222f08b80>, enabled_custom_ops=Counter(), disabled_custom_ops=Counter(), static_forward_context={})
INFO 12-13 06:04:16 selector.py:120] Using Flash Attention backend.
INFO 12-13 06:04:16 mooncake_connector.py:38] Initializing MooncakeConnector under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=5000000000.0 kv_role='kv_producer' kv_rank=0 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
INFO 12-13 06:04:16 mooncake_pipe.py:227] Selecting device: cuda
INFO 12-13 06:04:16 mooncake_pipe.py:69] Mooncake Configuration loaded successfully.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1213 06:04:16.739224 5832 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
I1213 06:04:16.742707 5832 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
INFO 12-13 06:04:17 model_runner.py:1100] Starting to load model Qwen2.5-7B-Instruct-GPTQ-Int4...
INFO 12-13 06:04:18 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
INFO 12-13 06:04:19 mooncake_connector.py:38] Initializing MooncakeConnector under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=5000000000.0 kv_role='kv_consumer' kv_rank=1 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
INFO 12-13 06:04:19 mooncake_pipe.py:227] Selecting device: cuda
INFO 12-13 06:04:19 mooncake_pipe.py:69] Mooncake Configuration loaded successfully.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1213 06:04:19.949631 5829 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
I1213 06:04:19.956117 5829 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.30it/s]
INFO 12-13 06:04:20 model_runner.py:1100] Starting to load model Qwen2.5-7B-Instruct-GPTQ-Int4...
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.79it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.70it/s]

INFO 12-13 06:04:21 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 12-13 06:04:23 model_runner.py:1105] Loading model weights took 5.1810 GB
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.73it/s]

INFO 12-13 06:04:26 worker.py:241] Memory profiling results: duration=3.28 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=5.89GiB, peak_torch_memory=6.58GiB, memory_usage_post_profile=5.91GiB, non_torch_memory=0.72GiB, kv_cache_size=67.89GiB, gpu_memory_utilization=0.95.
INFO 12-13 06:04:26 model_runner.py:1105] Loading model weights took 5.1810 GB
INFO 12-13 06:04:26 gpu_executor.py:79] # GPU blocks: 79450, # CPU blocks: 4681, # CNS blocks: 0
INFO 12-13 06:04:26 gpu_executor.py:83] Maximum concurrency for 1000 tokens per request: 1271.20x
INFO 12-13 06:04:29 worker.py:241] Memory profiling results: duration=3.31 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=5.89GiB, peak_torch_memory=6.58GiB, memory_usage_post_profile=5.91GiB, non_torch_memory=0.72GiB, kv_cache_size=67.89GiB, gpu_memory_utilization=0.95.
INFO 12-13 06:04:30 gpu_executor.py:79] # GPU blocks: 79450, # CPU blocks: 4681, # CNS blocks: 0
INFO 12-13 06:04:30 gpu_executor.py:83] Maximum concurrency for 1000 tokens per request: 1271.20x
INFO 12-13 06:04:37 model_runner.py:1427] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-13 06:04:37 model_runner.py:1431] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 12-13 06:04:40 model_runner.py:1427] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-13 06:04:40 model_runner.py:1431] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 12-13 06:05:37 model_runner.py:1545] Graph capturing finished in 60 secs, took 0.95 GiB
INFO 12-13 06:05:39 api_server.py:252] vLLM to use /tmp/tmpgrn9380c as PROMETHEUS_MULTIPROC_DIR
INFO 12-13 06:05:39 api_server.py:578] Using supplied chat template:
INFO 12-13 06:05:39 api_server.py:578] None
INFO 12-13 06:05:39 launcher.py:19] Available routes are:
INFO 12-13 06:05:39 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-13 06:05:39 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-13 06:05:39 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-13 06:05:39 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-13 06:05:39 launcher.py:27] Route: /health, Methods: GET
INFO 12-13 06:05:39 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-13 06:05:39 launcher.py:27] Route: /version, Methods: GET
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/score, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /get_prefix_cache_match_len, Methods: POST
INFO: Started server process [5511]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8166 (Press CTRL+C to quit)
INFO: 127.0.0.1:28580 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed

return 0
wait_for_server 8277
local port=8277
timeout 1200 bash -c '
until curl -s localhost:8277/v1/completions > /dev/null; do
sleep 1
done'
INFO 12-13 06:05:42 model_runner.py:1545] Graph capturing finished in 62 secs, took 0.95 GiB
INFO 12-13 06:05:44 api_server.py:252] vLLM to use /tmp/tmprzx92pf7 as PROMETHEUS_MULTIPROC_DIR
INFO 12-13 06:05:44 api_server.py:578] Using supplied chat template:
INFO 12-13 06:05:44 api_server.py:578] None
INFO 12-13 06:05:44 launcher.py:19] Available routes are:
INFO 12-13 06:05:44 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-13 06:05:44 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-13 06:05:44 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-13 06:05:44 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-13 06:05:44 launcher.py:27] Route: /health, Methods: GET
INFO 12-13 06:05:44 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-13 06:05:44 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-13 06:05:44 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-13 06:05:44 launcher.py:27] Route: /version, Methods: GET
INFO 12-13 06:05:44 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-13 06:05:44 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-13 06:05:44 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 12-13 06:05:44 launcher.py:27] Route: /v1/score, Methods: POST
INFO 12-13 06:05:44 launcher.py:27] Route: /get_prefix_cache_match_len, Methods: POST
INFO: Started server process [5512]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8277 (Press CTRL+C to quit)
INFO: 127.0.0.1:29122 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed
return 0
sleep 1
python3 disagg_prefill_proxy_server.py

Serving Quart app 'disagg_prefill_proxy_server'
Debug mode: False
Please use an ASGI server (e.g. Hypercorn) directly in production
Running on http://127.0.0.1:8009 (CTRL + C to quit)
[2024-12-13 06:05:45 +0000] [7263] [INFO] Running on http://127.0.0.1:8009 (CTRL + C to quit)

for qps in 2
benchmark 2 100 disagg_prefill
results_folder=./results
model=Qwen2.5-7B-Instruct-GPTQ-Int4
dataset_name=sonnet
dataset_path=../sonnet_4x.txt
num_prompts=20
qps=2
prefix_len=50
input_len=200
output_len=100
tag=disagg_prefill
python3 ../benchmark_serving.py --backend vllm --model Qwen2.5-7B-Instruct-GPTQ-Int4 --dataset-name sonnet --dataset-path ../sonnet_4x.txt --sonnet-input-len 200 --sonnet-output-len 100 --sonnet-prefix-len 50 --num-prompts 20 --port 8009 --save-result --result-dir ./results --result-filename disagg_prefill-qps-2.json --request-rate 2
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
WARNING 12-13 06:05:51 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Namespace(backend='vllm', base_url=None, host='localhost', port=8009, endpoint='/v1/completions', dataset=None, dataset_name='sonnet', dataset_path='../sonnet_4x.txt', max_concurrency=None, model='Qwen2.5-7B-Instruct-GPTQ-Int4', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=20, logprobs=None, request_rate=2.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='./results', result_filename='disagg_prefill-qps-2.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, mooncake_mode='qps', sonnet_input_len=200, sonnet_output_len=100, sonnet_prefix_len=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
INFO 12-13 06:05:52 logger.py:37] Received request cmpl-c1227a5084944665a4b59ad58bfc1aa8-0: prompt: "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nPick as many lines as you can from these poem lines:\n\nThy end is truth's and beauty's doom and date.\nThy youth's proud livery, so gazed on now,\nTo thee I send this written embassage,\nWill be a tatter'd weed, of small worth held:\nIf thou couldst answer 'This fair child of mine\nO, learn to read what silent love hath writ:\nThen let not winter's ragged hand deface\nWho all in one, one pleasing note do sing:\nFor no man well of such a salve can speak\nSo should that beauty which you hold in lease\nTo find where your true image pictured lies;\nAnd only herald to the gaudy spring,\nA liquid prisoner pent in walls of glass,\nPity the world, or else this glutton be,\n<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 36953, 438, 1657, 5128, 438, 498, 646, 504, 1493, 32794, 5128, 1447, 1001, 88, 835, 374, 8046, 594, 323, 13143, 594, 58614, 323, 2400, 624, 1001, 88, 12537, 594, 12409, 326, 6497, 11, 773, 342, 27011, 389, 1431, 345, 1249, 39244, 358, 3624, 419, 5326, 7967, 38236, 345, 9945, 387, 264, 259, 1650, 4172, 39375, 11, 315, 2613, 5802, 5644, 510, 2679, 33123, 1410, 267, 4226, 364, 1986, 6624, 1682, 315, 10485, 198, 46, 11, 3960, 311, 1349, 1128, 21059, 2948, 51577, 2107, 510, 12209, 1077, 537, 12406, 594, 20475, 3556, 1424, 707, 578, 198, 15191, 678, 304, 825, 11, 825, 53699, 5185, 653, 7780, 510, 2461, 902, 883, 1632, 315, 1741, 264, 4274, 586, 646, 6468, 198, 4416, 1265, 429, 13143, 892, 498, 3331, 304, 25064, 198, 1249, 1477, 1380, 697, 830, 2168, 41566, 15448, 280, 3036, 1172, 64106, 311, 279, 342, 7880, 88, 10464, 345, 32, 14473, 41850, 20189, 304, 14285, 315, 8991, 345, 47, 487, 279, 1879, 11, 476, 770, 419, 2770, 959, 387, 345, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO: ::1:41950 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 12-13 06:05:52 engine.py:272] Added request cmpl-c1227a5084944665a4b59ad58bfc1aa8-0.
INFO 12-13 06:05:54 logger.py:37] Received request cmpl-65144a565425441587c3d6b20fc0154e-0: prompt: "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nPick as many lines as you can from these poem lines:\n\nThy end is truth's and beauty's doom and date.\nThy youth's proud livery, so gazed on now,\nTo thee I send this written embassage,\nWill be a tatter'd weed, of small worth held:\nIf thou couldst answer 'This fair child of mine\nO, learn to read what silent love hath writ:\nThen let not winter's ragged hand deface\nWho all in one, one pleasing note do sing:\nFor no man well of such a salve can speak\nSo should that beauty which you hold in lease\nTo find where your true image pictured lies;\nAnd only herald to the gaudy spring,\nA liquid prisoner pent in walls of glass,\nPity the world, or else this glutton be,\n<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 36953, 438, 1657, 5128, 438, 498, 646, 504, 1493, 32794, 5128, 1447, 1001, 88, 835, 374, 8046, 594, 323, 13143, 594, 58614, 323, 2400, 624, 1001, 88, 12537, 594, 12409, 326, 6497, 11, 773, 342, 27011, 389, 1431, 345, 1249, 39244, 358, 3624, 419, 5326, 7967, 38236, 345, 9945, 387, 264, 259, 1650, 4172, 39375, 11, 315, 2613, 5802, 5644, 510, 2679, 33123, 1410, 267, 4226, 364, 1986, 6624, 1682, 315, 10485, 198, 46, 11, 3960, 311, 1349, 1128, 21059, 2948, 51577, 2107, 510, 12209, 1077, 537, 12406, 594, 20475, 3556, 1424, 707, 578, 198, 15191, 678, 304, 825, 11, 825, 53699, 5185, 653, 7780, 510, 2461, 902, 883, 1632, 315, 1741, 264, 4274, 586, 646, 6468, 198, 4416, 1265, 429, 13143, 892, 498, 3331, 304, 25064, 198, 1249, 1477, 1380, 697, 830, 2168, 41566, 15448, 280, 3036, 1172, 64106, 311, 279, 342, 7880, 88, 10464, 345, 32, 14473, 41850, 20189, 304, 14285, 315, 8991, 345, 47, 487, 279, 1879, 11, 476, 770, 419, 2770, 959, 387, 345, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO: ::1:56412 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 12-13 06:05:54 engine.py:272] Added request cmpl-65144a565425441587c3d6b20fc0154e-0.
E1213 06:05:57.718971 6744 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7ff2ebffd010, length: 404, dest_addr: 140190878253072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105344, retry_cnt: 0): transport retry counter exceeded
E1213 06:06:01.476758 6744 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7ff2ebffd010, length: 404, dest_addr: 140190878253072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105344, retry_cnt: 1): transport retry counter exceeded
E1213 06:06:05.234935 6744 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7ff2ebffd010, length: 1975, dest_addr: 140190878244880, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105344, retry_cnt: 0): transport retry counter exceeded

junna2016 · 2024-12-13T06:29:35Z

I compile mooncake without any compilation options，such as -DUSE_CUDA, et.al, only with mkdir build && cd build && cmake .. && make -j && make install processes.

ShangmingCai · 2024-12-13T06:45:26Z

I compile mooncake without any compilation options，such as -DUSE_CUDA, et.al, only with mkdir build && cd build && cmake .. && make -j && make install processes.

I found that the info indicates your vllm version is

INFO 12-13 06:04:12 llm_engine.py:248] Initializing an LLM engine (v0.4.3.dev2370+g385d690b)

If you build from source of our experimental vllm branch, I think the version might be like 0.6.4.post2.devxxxx+xxxxxxx. Do you rebase from an earlier version of vllm?

Also, since the posted logs are mixed together, I want to confirm whether the first few requests are normal, and these errors occur in a succeeding request? Or does the first request report the error?

junna2016 · 2024-12-13T06:51:50Z

I compile mooncake without any compilation options，such as -DUSE_CUDA, et.al, only with mkdir build && cd build && cmake .. && make -j && make install processes.

I found that the info indicates your vllm version is
INFO 12-13 06:04:12 llm_engine.py:248] Initializing an LLM engine (v0.4.3.dev2370+g385d690b)
If you build from source of our experimental vllm branch, I think the version might be like 0.6.4.post2.devxxxx+xxxxxxx. Do you rebase from an earlier version of vllm?

Also, since the posted logs are mixed together, I want to confirm whether the first few requests are normal, and these errors occur in a succeeding request? Or does the first request report the error?

I work on the vllm main branch with pr10502 merged commit id: 0590ec3fd9857063c43c80df281e24c16c51b2ec
and I fetch your code with mooncake pipe and connector. And I install vllm with python setup.py develop mode.

This is the first request reports the error.

alogfans · 2024-12-13T08:06:22Z

I1213 06:04:16.739224 5832 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
I1213 06:04:16.742707 5832 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00

It seems abnormal as both device has the same LID and GID. When you use two devices together, local QP cannot determine the destination device for data transfer, leading transfer timeout (i.e., transport retry counter exceeded).

You can try to use other GID index by assigning MC_GID_INDEX=n env var, where n is the GID index. Mooncake Transfer Engine can detect one of valid GIDs but it potentially not worked for every case.
Does mlx5_0 and mlx5_1 be two ports in the same RDMA device? Provide the output of ibv_devinfo command.

junna2016 · 2024-12-13T08:19:04Z

ibv_devinfo

junna2016 · 2024-12-13T08:52:31Z

I1213 06:04:16.739224 5832 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
I1213 06:04:16.742707 5832 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
It seems abnormal as both device has the same LID and GID. When you use two devices together, local QP cannot determine the destination device for data transfer, leading transfer timeout (i.e., transport retry counter exceeded).

You can try to use other GID index by assigning MC_GID_INDEX=n env var, where n is the GID index. Mooncake Transfer Engine can detect one of valid GIDs but it potentially not worked for every case.

Does mlx5_0 and mlx5_1 be two ports in the same RDMA device? Provide the output of ibv_devinfo command.

I try : export MC_GID_INDEX=6, it will change two nics GID together.

ibv_devices shows:

Is it a method to distinguish rdma-mlx5 device with GUID?

alogfans · 2024-12-13T08:54:47Z

mlx5_0 and mlx5_1 belongs to different RDMA NICs (PCIe devices).
It is normal for GID change if you change the GID index. If they have different GID value, usually it is ok.

junna2016 · 2024-12-13T09:32:39Z

mlx5_0 and mlx5_1 belongs to different RDMA NICs (PCIe devices). It is normal for GID change if you change the GID index. If they have different GID value, usually it is ok.

Does this error cause by rdma error configuration? Is it normal to obtain two NICs with same GID? How can I set two NICs with different GID?

alogfans · 2024-12-14T01:08:42Z

The GID is used to identify the device that is the target of a data transfer, similar to an IP address in a TCP/IP network. Each RDMA device has multiple legal GIDs, try setting the GID to another value using MC_GID_INDEX=n. You can use ibv_devinfo -v to find all valid GIDs.

junna2016 · 2024-12-16T08:44:40Z

I test with another machine, vllm code is: https://github.com/kvcache-ai/vllm/tree/upstream-mooncake-integration

select different mlx5 devices with different GID， but also counter an error, log is as below:

model=/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4
etcd --listen-client-urls http://0.0.0.0:2333 --advertise-client-urls http://localhost:2333
MOONCAKE_CONFIG_PATH=./mooncake_pipe_config.json
CUDA_VISIBLE_DEVICES=3
wait_for_server 8166
python3 -m vllm.entrypoints.openai.api_server --model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8166 --max-model-len 1000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}'
local port=8166
timeout 1200 bash -c '
until curl -s localhost:8166/v1/completions > /dev/null; do
sleep 1
done'
MOONCAKE_CONFIG_PATH=./mooncake_pipe_config.json
CUDA_VISIBLE_DEVICES=4
python3 -m vllm.entrypoints.openai.api_server --model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8277 --max-model-len 1000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9}'
2024-12-16 08:38:39.167258 I | etcdmain: etcd Version: 3.3.25
2024-12-16 08:38:39.167275 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2024-12-16 08:38:39.167278 I | etcdmain: Go Version: go1.18.1
2024-12-16 08:38:39.167284 I | etcdmain: Go OS/Arch: linux/amd64
2024-12-16 08:38:39.167289 I | etcdmain: setting maximum number of CPUs to 128, total number of available CPUs is 128
2024-12-16 08:38:39.167292 W | etcdmain: no data-dir provided, using default data-dir ./default.etcd
2024-12-16 08:38:39.167338 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2024-12-16 08:38:39.167505 I | embed: listening for peers on http://localhost:2380
2024-12-16 08:38:39.167545 I | embed: listening for client requests on 0.0.0.0:2333
2024-12-16 08:38:39.167824 I | etcdserver: name = default
2024-12-16 08:38:39.167829 I | etcdserver: data dir = default.etcd
2024-12-16 08:38:39.167832 I | etcdserver: member dir = default.etcd/member
2024-12-16 08:38:39.167838 I | etcdserver: heartbeat = 100ms
2024-12-16 08:38:39.167842 I | etcdserver: election = 1000ms
2024-12-16 08:38:39.167844 I | etcdserver: snapshot count = 100000
2024-12-16 08:38:39.167949 I | etcdserver: advertise client URLs = http://localhost:2333
2024-12-16 08:38:39.169644 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 28
2024-12-16 08:38:39.169677 I | raft: 8e9e05c52164694d became follower at term 5
2024-12-16 08:38:39.169691 I | raft: newRaft 8e9e05c52164694d [peers: [], term: 5, commit: 28, applied: 0, lastindex: 28, lastterm: 5]
2024-12-16 08:38:39.170384 W | auth: simple token is not cryptographically signed
2024-12-16 08:38:39.170692 I | etcdserver: starting server... [version: 3.3.25, cluster version: to_be_decided]
2024-12-16 08:38:39.171679 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
2024-12-16 08:38:39.171812 N | etcdserver/membership: set the initial cluster version to 3.3
2024-12-16 08:38:39.171842 I | etcdserver/api: enabled capabilities for version 3.3
2024-12-16 08:38:40.870663 I | raft: 8e9e05c52164694d is starting a new election at term 5
2024-12-16 08:38:40.870728 I | raft: 8e9e05c52164694d became candidate at term 6
2024-12-16 08:38:40.870770 I | raft: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 6
2024-12-16 08:38:40.870780 I | raft: 8e9e05c52164694d became leader at term 6
2024-12-16 08:38:40.870785 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 6
2024-12-16 08:38:40.870966 I | embed: ready to serve client requests
2024-12-16 08:38:40.871058 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2333]} to cluster cdf818194e3a8c32
2024-12-16 08:38:40.871389 N | embed: serving insecure client requests on [::]:2333, this is strongly discouraged!
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
WARNING 12-16 08:38:42 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 12-16 08:38:42 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 12-16 08:38:44 api_server.py:643] vLLM API server version 0.1.dev3835+g875ca4c
INFO 12-16 08:38:44 api_server.py:644] args: Namespace(host=None, port=8166, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=5000000000.0, kv_role='kv_producer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 12-16 08:38:44 api_server.py:198] Started engine process with PID 15042
INFO 12-16 08:38:44 api_server.py:643] vLLM API server version 0.1.dev3835+g875ca4c
INFO 12-16 08:38:44 api_server.py:644] args: Namespace(host=None, port=8277, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=5000000000.0, kv_role='kv_consumer', kv_rank=1, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 12-16 08:38:44 api_server.py:198] Started engine process with PID 15047
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
WARNING 12-16 08:38:46 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
WARNING 12-16 08:38:47 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 12-16 08:38:49 config.py:451] This model supports multiple tasks: {'score', 'generate', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 12-16 08:38:50 config.py:451] This model supports multiple tasks: {'embed', 'reward', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
INFO 12-16 08:38:53 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 12-16 08:38:53 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 12-16 08:38:56 config.py:451] This model supports multiple tasks: {'classify', 'embed', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
INFO 12-16 08:38:56 config.py:451] This model supports multiple tasks: {'reward', 'score', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 12-16 08:38:57 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 12-16 08:38:57 llm_engine.py:249] Initializing an LLM engine (v0.1.dev3835+g875ca4c) with config: model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 12-16 08:38:57 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 12-16 08:38:57 llm_engine.py:249] Initializing an LLM engine (v0.1.dev3835+g875ca4c) with config: model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 12-16 08:38:57 selector.py:120] Using Flash Attention backend.
INFO 12-16 08:38:57 selector.py:120] Using Flash Attention backend.
INFO 12-16 08:39:01 simple_connector.py:58] Initializing MooncakeConfig under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=5000000000.0 kv_role='kv_consumer' kv_rank=1 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
INFO 12-16 08:39:01 mooncake_pipe.py:227] Selecting device: cuda
INFO 12-16 08:39:01 mooncake_pipe.py:69] Mooncake Configuration loaded successfully.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1216 08:39:02.007782 15047 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:04:00
I1216 08:39:02.013756 15047 rdma_context.cpp:131] RDMA device: mlx5_4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:21:ff:ab:23
INFO 12-16 08:39:02 simple_connector.py:58] Initializing MooncakeConfig under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=5000000000.0 kv_role='kv_producer' kv_rank=0 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
INFO 12-16 08:39:02 mooncake_pipe.py:227] Selecting device: cuda
INFO 12-16 08:39:02 mooncake_pipe.py:69] Mooncake Configuration loaded successfully.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1216 08:39:02.230799 15042 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:04:00
I1216 08:39:02.234283 15042 rdma_context.cpp:131] RDMA device: mlx5_4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:21:ff:ab:23
INFO 12-16 08:39:02 model_runner.py:1092] Starting to load model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4...
INFO 12-16 08:39:02 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 12-16 08:39:03 model_runner.py:1092] Starting to load model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4...
INFO 12-16 08:39:03 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 3.33it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 3.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.70it/s]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.86it/s]

INFO 12-16 08:39:04 model_runner.py:1097] Loading model weights took 5.1810 GB
INFO 12-16 08:39:04 model_runner.py:1097] Loading model weights took 5.1810 GB
INFO 12-16 08:39:05 worker.py:237] Memory profiling results: duration=0.82 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=5.89GiB, peak_torch_memory=6.58GiB, memory_usage_post_profile=5.91GiB, non_torch_memory=0.72GiB, kv_cache_size=67.89GiB, gpu_memory_utilization=0.95.
INFO 12-16 08:39:05 worker.py:237] Memory profiling results: duration=0.87 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=5.89GiB, peak_torch_memory=6.58GiB, memory_usage_post_profile=5.91GiB, non_torch_memory=0.72GiB, kv_cache_size=67.89GiB, gpu_memory_utilization=0.95.
INFO 12-16 08:39:05 gpu_executor.py:76] # GPU blocks: 79450, # CPU blocks: 4681
INFO 12-16 08:39:05 gpu_executor.py:80] Maximum concurrency for 1000 tokens per request: 1271.20x
INFO 12-16 08:39:06 gpu_executor.py:76] # GPU blocks: 79450, # CPU blocks: 4681
INFO 12-16 08:39:06 gpu_executor.py:80] Maximum concurrency for 1000 tokens per request: 1271.20x
INFO 12-16 08:39:09 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-16 08:39:09 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 12-16 08:39:09 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-16 08:39:09 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 12-16 08:39:27 model_runner.py:1527] Graph capturing finished in 19 secs, took 1.43 GiB
INFO 12-16 08:39:27 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 23.05 seconds
INFO 12-16 08:39:28 model_runner.py:1527] Graph capturing finished in 19 secs, took 1.43 GiB
INFO 12-16 08:39:28 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 23.55 seconds
INFO 12-16 08:39:29 api_server.py:578] Using supplied chat template:
INFO 12-16 08:39:29 api_server.py:578] None
INFO 12-16 08:39:29 launcher.py:19] Available routes are:
INFO 12-16 08:39:29 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /health, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /version, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /score, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [14716]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8277 (Press CTRL+C to quit)
INFO 12-16 08:39:29 api_server.py:578] Using supplied chat template:
INFO 12-16 08:39:29 api_server.py:578] None
INFO 12-16 08:39:29 launcher.py:19] Available routes are:
INFO 12-16 08:39:29 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /health, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /version, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /score, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [14715]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8166 (Press CTRL+C to quit)
INFO: 127.0.0.1:27236 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed
INFO: 127.0.0.1:27252 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed

return 0
wait_for_server 8277
local port=8277
timeout 1200 bash -c '
until curl -s localhost:8277/v1/completions > /dev/null; do
sleep 1
done'
INFO: 127.0.0.1:22874 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed
return 0
sleep 1
python3 disagg_prefill_proxy_server.py

Serving Quart app 'disagg_prefill_proxy_server'
Debug mode: False
Please use an ASGI server (e.g. Hypercorn) directly in production
Running on http://127.0.0.1:8009 (CTRL + C to quit)
[2024-12-16 08:39:30 +0000] [16446] [INFO] Running on http://127.0.0.1:8009 (CTRL + C to quit)

for qps in 1
benchmark 1 100 disagg_prefill
results_folder=./results
model=/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4
dataset_name=sonnet
dataset_path=../sonnet_4x.txt
num_prompts=1
qps=1
prefix_len=50
input_len=200
output_len=100
tag=disagg_prefill
python3 ../benchmark_serving.py --backend vllm --model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4 --dataset-name sonnet --dataset-path ../sonnet_4x.txt --sonnet-input-len 200 --sonnet-output-len 100 --sonnet-prefix-len 50 --num-prompts 1 --port 8009 --save-result --result-dir ./results --result-filename disagg_prefill-qps-1.json --request-rate 1
Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
WARNING 12-16 08:39:34 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Namespace(backend='vllm', base_url=None, host='localhost', port=8009, endpoint='/v1/completions', dataset=None, dataset_name='sonnet', dataset_path='../sonnet_4x.txt', max_concurrency=None, model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='./results', result_filename='disagg_prefill-qps-1.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, mooncake_mode='qps', sonnet_input_len=200, sonnet_output_len=100, sonnet_prefix_len=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
Starting initial single prompt test run...
INFO 12-16 08:39:36 logger.py:37] Received request cmpl-b869f75643534b3ab3d1105fec1995f3-0: prompt: "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nPick as many lines as you can from these poem lines:\n\nThy end is truth's and beauty's doom and date.\nThy youth's proud livery, so gazed on now,\nTo thee I send this written embassage,\nWill be a tatter'd weed, of small worth held:\nIf thou couldst answer 'This fair child of mine\nO, learn to read what silent love hath writ:\nThen let not winter's ragged hand deface\nWho all in one, one pleasing note do sing:\nFor no man well of such a salve can speak\nSo should that beauty which you hold in lease\nTo find where your true image pictured lies;\nAnd only herald to the gaudy spring,\nA liquid prisoner pent in walls of glass,\nPity the world, or else this glutton be,\n<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 36953, 438, 1657, 5128, 438, 498, 646, 504, 1493, 32794, 5128, 1447, 1001, 88, 835, 374, 8046, 594, 323, 13143, 594, 58614, 323, 2400, 624, 1001, 88, 12537, 594, 12409, 326, 6497, 11, 773, 342, 27011, 389, 1431, 345, 1249, 39244, 358, 3624, 419, 5326, 7967, 38236, 345, 9945, 387, 264, 259, 1650, 4172, 39375, 11, 315, 2613, 5802, 5644, 510, 2679, 33123, 1410, 267, 4226, 364, 1986, 6624, 1682, 315, 10485, 198, 46, 11, 3960, 311, 1349, 1128, 21059, 2948, 51577, 2107, 510, 12209, 1077, 537, 12406, 594, 20475, 3556, 1424, 707, 578, 198, 15191, 678, 304, 825, 11, 825, 53699, 5185, 653, 7780, 510, 2461, 902, 883, 1632, 315, 1741, 264, 4274, 586, 646, 6468, 198, 4416, 1265, 429, 13143, 892, 498, 3331, 304, 25064, 198, 1249, 1477, 1380, 697, 830, 2168, 41566, 15448, 280, 3036, 1172, 64106, 311, 279, 342, 7880, 88, 10464, 345, 32, 14473, 41850, 20189, 304, 14285, 315, 8991, 345, 47, 487, 279, 1879, 11, 476, 770, 419, 2770, 959, 387, 345, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO: 127.0.0.1:27264 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 12-16 08:39:36 engine.py:267] Added request cmpl-b869f75643534b3ab3d1105fec1995f3-0.
INFO 12-16 08:39:36 metrics.py:467] Avg prompt throughput: 30.8 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 12-16 08:39:36 logger.py:37] Received request cmpl-24236fc0a6974d90b2cb4f2ab7e28e73-0: prompt: "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nPick as many lines as you can from these poem lines:\n\nThy end is truth's and beauty's doom and date.\nThy youth's proud livery, so gazed on now,\nTo thee I send this written embassage,\nWill be a tatter'd weed, of small worth held:\nIf thou couldst answer 'This fair child of mine\nO, learn to read what silent love hath writ:\nThen let not winter's ragged hand deface\nWho all in one, one pleasing note do sing:\nFor no man well of such a salve can speak\nSo should that beauty which you hold in lease\nTo find where your true image pictured lies;\nAnd only herald to the gaudy spring,\nA liquid prisoner pent in walls of glass,\nPity the world, or else this glutton be,\n<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 36953, 438, 1657, 5128, 438, 498, 646, 504, 1493, 32794, 5128, 1447, 1001, 88, 835, 374, 8046, 594, 323, 13143, 594, 58614, 323, 2400, 624, 1001, 88, 12537, 594, 12409, 326, 6497, 11, 773, 342, 27011, 389, 1431, 345, 1249, 39244, 358, 3624, 419, 5326, 7967, 38236, 345, 9945, 387, 264, 259, 1650, 4172, 39375, 11, 315, 2613, 5802, 5644, 510, 2679, 33123, 1410, 267, 4226, 364, 1986, 6624, 1682, 315, 10485, 198, 46, 11, 3960, 311, 1349, 1128, 21059, 2948, 51577, 2107, 510, 12209, 1077, 537, 12406, 594, 20475, 3556, 1424, 707, 578, 198, 15191, 678, 304, 825, 11, 825, 53699, 5185, 653, 7780, 510, 2461, 902, 883, 1632, 315, 1741, 264, 4274, 586, 646, 6468, 198, 4416, 1265, 429, 13143, 892, 498, 3331, 304, 25064, 198, 1249, 1477, 1380, 697, 830, 2168, 41566, 15448, 280, 3036, 1172, 64106, 311, 279, 342, 7880, 88, 10464, 345, 32, 14473, 41850, 20189, 304, 14285, 315, 8991, 345, 47, 487, 279, 1879, 11, 476, 770, 419, 2770, 959, 387, 345, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO: 127.0.0.1:22876 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 12-16 08:39:36 engine.py:267] Added request cmpl-24236fc0a6974d90b2cb4f2ab7e28e73-0.
E1216 08:39:37.446064 16023 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:37.446094 16023 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8149@mlx5_4
E1216 08:39:38.470036 16023 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:38.470063 16023 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8149@mlx5_4
E1216 08:39:39.494045 16023 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:39.494073 16023 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8149@mlx5_4
E1216 08:39:43.255225 15996 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7f6c7bff9010, length: 1975, dest_addr: 140335372029968, local_nic: mlx5_4, peer_nic: 127.0.0.1:8144@mlx5_0, dest_rkey: 2675712, retry_cnt: 0): transport retry counter exceeded
E1216 08:39:44.294055 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:44.294175 15996 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
E1216 08:39:44.294210 15996 rdma_endpoint.cpp:122] Failed to exchange handshake description
E1216 08:39:44.294214 15996 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
E1216 08:39:45.318032 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:45.318158 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
E1216 08:39:45.318205 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
E1216 08:39:45.318209 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
E1216 08:39:46.342036 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:46.342141 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
E1216 08:39:46.342185 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
E1216 08:39:46.342190 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
E1216 08:39:47.366036 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:47.366148 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
E1216 08:39:47.366191 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
E1216 08:39:47.366196 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
E1216 08:39:48.374042 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:48.374089 15993 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:48.374109 15993 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_4
E1216 08:39:48.374147 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
E1216 08:39:48.374181 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
E1216 08:39:48.374186 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
E1216 08:39:49.414047 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:49.414168 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
E1216 08:39:49.414212 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
E1216 08:39:49.414218 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
INFO 12-16 08:39:49 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
E1216 08:39:50.422031 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:50.422031 15993 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:50.422063 15993 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_4
E1216 08:39:50.422137 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
E1216 08:39:50.422179 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
E1216 08:39:50.422184 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
E1216 08:39:51.430033 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
E1216 08:39:51.430141 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
E1216 08:39:51.430187 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
E1216 08:39:51.430191 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0

alogfans · 2024-12-16T09:14:17Z

Currently vLLM integration of Transfer Engine assumes that all devices in prefill nodes are connectable with devices in decode nodes. That is, the code may performs mlx5_0@local -> mlx5_1@remote transfer request, but both card are physically unconnectable (e.g., seperate network). We are planning to support auto-detect.

junna2016 · 2024-12-16T09:22:09Z

Currently vLLM integration of Transfer Engine assumes that all devices in prefill nodes are connectable with devices in decode nodes. That is, the code may performs mlx5_0@local -> mlx5_1@remote transfer request, but both card are physically unconnectable (e.g., seperate network). We are planning to support auto-detect.

This means mlx5_0 and mlx5_4 are unconnectable for each other, so occurs this error?

junna2016 · 2024-12-16T09:39:19Z

I1213 06:04:16.739224 5832 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
I1213 06:04:16.742707 5832 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
It seems abnormal as both device has the same LID and GID. When you use two devices together, local QP cannot determine the destination device for data transfer, leading transfer timeout (i.e., transport retry counter exceeded).

You can try to use other GID index by assigning MC_GID_INDEX=n env var, where n is the GID index. Mooncake Transfer Engine can detect one of valid GIDs but it potentially not worked for every case.

Does mlx5_0 and mlx5_1 be two ports in the same RDMA device? Provide the output of ibv_devinfo command.

This caused by in my test machine with a dual port rdma network card, althought ibv_devices shows with two rdma devices, in fact this machine has only one rdma device.

alogfans · 2024-12-16T11:08:36Z

Currently vLLM integration of Transfer Engine assumes that all devices in prefill nodes are connectable with devices in decode nodes. That is, the code may performs mlx5_0@local -> mlx5_1@remote transfer request, but both card are physically unconnectable (e.g., seperate network). We are planning to support auto-detect.

This means mlx5_0 and mlx5_4 are unconnectable for each other, so occurs this error?

mlx5_0 devices in both machines are in one network, while mlx5_0 devices in both machines are in another network.

junna2016 mentioned this issue Dec 13, 2024

[Core] Support disaggregated prefill with Mooncake Transfer Engine vllm-project/vllm#10884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm-integration with multi rdma devices error #35

vllm-integration with multi rdma devices error #35

junna2016 commented Dec 13, 2024

ShangmingCai commented Dec 13, 2024

alogfans commented Dec 13, 2024

junna2016 commented Dec 13, 2024 •

edited

Loading

junna2016 commented Dec 13, 2024

ShangmingCai commented Dec 13, 2024

junna2016 commented Dec 13, 2024

alogfans commented Dec 13, 2024

junna2016 commented Dec 13, 2024

junna2016 commented Dec 13, 2024

alogfans commented Dec 13, 2024

junna2016 commented Dec 13, 2024

alogfans commented Dec 14, 2024

junna2016 commented Dec 16, 2024

alogfans commented Dec 16, 2024

junna2016 commented Dec 16, 2024 •

edited

Loading

junna2016 commented Dec 16, 2024

alogfans commented Dec 16, 2024

vllm-integration with multi rdma devices error #35

vllm-integration with multi rdma devices error #35

Comments

junna2016 commented Dec 13, 2024

ShangmingCai commented Dec 13, 2024

alogfans commented Dec 13, 2024

junna2016 commented Dec 13, 2024 • edited Loading

junna2016 commented Dec 13, 2024

ShangmingCai commented Dec 13, 2024

junna2016 commented Dec 13, 2024

alogfans commented Dec 13, 2024

junna2016 commented Dec 13, 2024

junna2016 commented Dec 13, 2024

alogfans commented Dec 13, 2024

junna2016 commented Dec 13, 2024

alogfans commented Dec 14, 2024

junna2016 commented Dec 16, 2024

alogfans commented Dec 16, 2024

junna2016 commented Dec 16, 2024 • edited Loading

junna2016 commented Dec 16, 2024

alogfans commented Dec 16, 2024

junna2016 commented Dec 13, 2024 •

edited

Loading

junna2016 commented Dec 16, 2024 •

edited

Loading