Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm-integration with multi rdma devices error #35

Open
junna2016 opened this issue Dec 13, 2024 · 17 comments
Open

vllm-integration with multi rdma devices error #35

junna2016 opened this issue Dec 13, 2024 · 17 comments

Comments

@junna2016
Copy link

Use latest mooncake code, when I test with tp=1, num_rdma_nic=2, qps=2, input_len=200, output_len=100 in a single machine, which prefill instance num is 1 and decode instance num also is 1.
My mooncake_config.json is shown as below:

{
"prefill_url": "127.0.0.1:8144",
"decode_url": "127.0.0.1:8149",
"metadata_server": "127.0.0.1:2333",
"metadata_backend": "etcd",
"protocol": "rdma",
"device_name": "mlx5_0,mlx5_1"
}

There will occur an error in transfer_engine:

E1213 02:57:10.528410 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:14.286239 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
E1213 02:57:18.044381 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:21.802461 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded

And with one rdma device(mlx5_0 or mlx5_1) is ok

@ShangmingCai
Copy link
Collaborator

@alogfans Can you check this?

@alogfans
Copy link
Collaborator

Can you provide the full log?

@junna2016
Copy link
Author

junna2016 commented Dec 13, 2024

Can you provide the full log?

  • launch_disagg_prefill
  • model=Qwen2.5-7B-Instruct-GPTQ-Int4
  • etcd --listen-client-urls http://0.0.0.0:2333 --advertise-client-urls http://localhost:2333
  • MOONCAKE_CONFIG_PATH=./mooncake_pipe_config.json
  • CUDA_VISIBLE_DEVICES=1
  • wait_for_server 8166
  • python3 -m vllm.entrypoints.openai.api_server --model Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8166 --max-model-len 1000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}'
  • local port=8166
  • timeout 1200 bash -c '
    until curl -s localhost:8166/v1/completions > /dev/null; do
    sleep 1
    done'
  • MOONCAKE_CONFIG_PATH=./mooncake_pipe_config.json
  • CUDA_VISIBLE_DEVICES=2
    python3 -m vllm.entrypoints.openai.api_server --model Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8277 --max-model-len 1000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9}'
    2024-12-13 06:03:54.090538 I | etcdmain: etcd Version: 3.3.25
    2024-12-13 06:03:54.090555 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
    2024-12-13 06:03:54.090558 I | etcdmain: Go Version: go1.18.1
    2024-12-13 06:03:54.090564 I | etcdmain: Go OS/Arch: linux/amd64
    2024-12-13 06:03:54.090568 I | etcdmain: setting maximum number of CPUs to 128, total number of available CPUs is 128
    2024-12-13 06:03:54.090580 W | etcdmain: no data-dir provided, using default data-dir ./default.etcd
    2024-12-13 06:03:54.090623 N | etcdmain: the server is already initialized as member before, starting as etcd member...
    2024-12-13 06:03:54.090903 I | embed: listening for peers on http://localhost:2380
    2024-12-13 06:03:54.090935 I | embed: listening for client requests on 0.0.0.0:2333
    2024-12-13 06:03:54.091241 I | etcdserver: name = default
    2024-12-13 06:03:54.091247 I | etcdserver: data dir = default.etcd
    2024-12-13 06:03:54.091249 I | etcdserver: member dir = default.etcd/member
    2024-12-13 06:03:54.091252 I | etcdserver: heartbeat = 100ms
    2024-12-13 06:03:54.091254 I | etcdserver: election = 1000ms
    2024-12-13 06:03:54.091262 I | etcdserver: snapshot count = 100000
    2024-12-13 06:03:54.091353 I | etcdserver: advertise client URLs = http://localhost:2333
    2024-12-13 06:03:54.093171 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 26
    2024-12-13 06:03:54.093202 I | raft: 8e9e05c52164694d became follower at term 4
    2024-12-13 06:03:54.093225 I | raft: newRaft 8e9e05c52164694d [peers: [], term: 4, commit: 26, applied: 0, lastindex: 26, lastterm: 4]
    2024-12-13 06:03:54.093955 W | auth: simple token is not cryptographically signed
    2024-12-13 06:03:54.094296 I | etcdserver: starting server... [version: 3.3.25, cluster version: to_be_decided]
    2024-12-13 06:03:54.095387 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
    2024-12-13 06:03:54.095519 N | etcdserver/membership: set the initial cluster version to 3.3
    2024-12-13 06:03:54.095549 I | etcdserver/api: enabled capabilities for version 3.3
    2024-12-13 06:03:55.694791 I | raft: 8e9e05c52164694d is starting a new election at term 4
    2024-12-13 06:03:55.694851 I | raft: 8e9e05c52164694d became candidate at term 5
    2024-12-13 06:03:55.694864 I | raft: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 5
    2024-12-13 06:03:55.694873 I | raft: 8e9e05c52164694d became leader at term 5
    2024-12-13 06:03:55.694880 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 5
    2024-12-13 06:03:55.695162 I | embed: ready to serve client requests
    2024-12-13 06:03:55.695238 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2333]} to cluster cdf818194e3a8c32
    2024-12-13 06:03:55.695661 N | embed: serving insecure client requests on [::]:2333, this is strongly discouraged!
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    WARNING 12-13 06:03:57 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    WARNING 12-13 06:03:57 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    INFO 12-13 06:03:59 api_server.py:643] vLLM API server version 0.4.3.dev2370+g385d690b
    INFO 12-13 06:03:59 api_server.py:644] args: Namespace(host=None, port=8277, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen2.5-7B-Instruct-GPTQ-Int4', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_allocator='CpuGpuBlockAllocator', block_size=16, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, cns_offload_gb=0, cns_offload_dir='', gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=5000000000.0, kv_role='kv_consumer', kv_rank=1, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
    INFO 12-13 06:03:59 api_server.py:643] vLLM API server version 0.4.3.dev2370+g385d690b
    INFO 12-13 06:03:59 api_server.py:644] args: Namespace(host=None, port=8166, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['
    '], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen2.5-7B-Instruct-GPTQ-Int4', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_allocator='CpuGpuBlockAllocator', block_size=16, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, cns_offload_gb=0, cns_offload_dir='', gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=5000000000.0, kv_role='kv_producer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
    INFO 12-13 06:03:59 init.py:42] No plugins found.
    INFO 12-13 06:03:59 api_server.py:178] Multiprocessing frontend to use ipc:///tmp/402d28f4-f834-4178-b0eb-adf69603cc89 for IPC Path.
    INFO 12-13 06:03:59 api_server.py:197] Started engine process with PID 5829
    INFO 12-13 06:03:59 init.py:42] No plugins found.
    INFO 12-13 06:03:59 api_server.py:178] Multiprocessing frontend to use ipc:///tmp/e8b95c58-0bf9-4a0b-8451-5e05bef974a2 for IPC Path.
    INFO 12-13 06:03:59 api_server.py:197] Started engine process with PID 5832
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    WARNING 12-13 06:04:01 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    WARNING 12-13 06:04:02 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    INFO 12-13 06:04:03 init.py:42] No plugins found.
    INFO 12-13 06:04:04 config.py:399] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
    INFO 12-13 06:04:05 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
    WARNING 12-13 06:04:05 arg_utils.py:1171] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
    INFO 12-13 06:04:06 init.py:42] No plugins found.
    INFO 12-13 06:04:07 config.py:399] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
    INFO 12-13 06:04:08 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
    WARNING 12-13 06:04:08 arg_utils.py:1171] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
    INFO 12-13 06:04:11 config.py:399] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
    INFO 12-13 06:04:12 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
    WARNING 12-13 06:04:12 arg_utils.py:1171] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
    INFO 12-13 06:04:12 llm_engine.py:248] Initializing an LLM engine (v0.4.3.dev2370+g385d690b) with config: model='Qwen2.5-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/Qwen2.5-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, block_allocator=CpuGpuBlockAllocator, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-7B-Instruct-GPTQ-Int4, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None,compilation_config=CompilationConfig(level=0, backend='', custom_ops=[], splitting_ops=['vllm.unified_attention', 'vllm.unified_v1_flash_attention'], use_inductor=True, inductor_specialize_for_cudagraph_no_more_than=None, inductor_compile_sizes=None, inductor_compile_config={}, inductor_passes={}, use_cudagraph=False, cudagraph_num_of_warmups=0, cudagraph_capture_sizes=None, cudagraph_copy_inputs=False, pass_config=PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), compile_sizes=<function PrivateAttr at 0x7ff4555f4b80>, capture_sizes=<function PrivateAttr at 0x7ff4555f4b80>, enabled_custom_ops=Counter(), disabled_custom_ops=Counter(), static_forward_context={})
    INFO 12-13 06:04:13 selector.py:120] Using Flash Attention backend.
    INFO 12-13 06:04:15 config.py:399] This model supports multiple tasks: {'embedding', 'generate'}. Defaulting to 'generate'.
    INFO 12-13 06:04:16 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
    WARNING 12-13 06:04:16 arg_utils.py:1171] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
    INFO 12-13 06:04:16 llm_engine.py:248] Initializing an LLM engine (v0.4.3.dev2370+g385d690b) with config: model='Qwen2.5-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='Qwen2.5-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, block_allocator=CpuGpuBlockAllocator, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-7B-Instruct-GPTQ-Int4, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None, pooler_config=None,compilation_config=CompilationConfig(level=0, backend='', custom_ops=[], splitting_ops=['vllm.unified_attention', 'vllm.unified_v1_flash_attention'], use_inductor=True, inductor_specialize_for_cudagraph_no_more_than=None, inductor_compile_sizes=None, inductor_compile_config={}, inductor_passes={}, use_cudagraph=False, cudagraph_num_of_warmups=0, cudagraph_capture_sizes=None, cudagraph_copy_inputs=False, pass_config=PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), compile_sizes=<function PrivateAttr at 0x7f8222f08b80>, capture_sizes=<function PrivateAttr at 0x7f8222f08b80>, enabled_custom_ops=Counter(), disabled_custom_ops=Counter(), static_forward_context={})
    INFO 12-13 06:04:16 selector.py:120] Using Flash Attention backend.
    INFO 12-13 06:04:16 mooncake_connector.py:38] Initializing MooncakeConnector under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=5000000000.0 kv_role='kv_producer' kv_rank=0 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
    INFO 12-13 06:04:16 mooncake_pipe.py:227] Selecting device: cuda
    INFO 12-13 06:04:16 mooncake_pipe.py:69] Mooncake Configuration loaded successfully.
    WARNING: Logging before InitGoogleLogging() is written to STDERR
    I1213 06:04:16.739224 5832 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
    I1213 06:04:16.742707 5832 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
    INFO 12-13 06:04:17 model_runner.py:1100] Starting to load model Qwen2.5-7B-Instruct-GPTQ-Int4...
    INFO 12-13 06:04:18 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
    Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
    INFO 12-13 06:04:19 mooncake_connector.py:38] Initializing MooncakeConnector under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=5000000000.0 kv_role='kv_consumer' kv_rank=1 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
    INFO 12-13 06:04:19 mooncake_pipe.py:227] Selecting device: cuda
    INFO 12-13 06:04:19 mooncake_pipe.py:69] Mooncake Configuration loaded successfully.
    WARNING: Logging before InitGoogleLogging() is written to STDERR
    I1213 06:04:19.949631 5829 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
    I1213 06:04:19.956117 5829 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
    Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.30it/s]
    INFO 12-13 06:04:20 model_runner.py:1100] Starting to load model Qwen2.5-7B-Instruct-GPTQ-Int4...
    Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.79it/s]
    Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.70it/s]

INFO 12-13 06:04:21 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 12-13 06:04:23 model_runner.py:1105] Loading model weights took 5.1810 GB
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.33it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.83it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.73it/s]

INFO 12-13 06:04:26 worker.py:241] Memory profiling results: duration=3.28 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=5.89GiB, peak_torch_memory=6.58GiB, memory_usage_post_profile=5.91GiB, non_torch_memory=0.72GiB, kv_cache_size=67.89GiB, gpu_memory_utilization=0.95.
INFO 12-13 06:04:26 model_runner.py:1105] Loading model weights took 5.1810 GB
INFO 12-13 06:04:26 gpu_executor.py:79] # GPU blocks: 79450, # CPU blocks: 4681, # CNS blocks: 0
INFO 12-13 06:04:26 gpu_executor.py:83] Maximum concurrency for 1000 tokens per request: 1271.20x
INFO 12-13 06:04:29 worker.py:241] Memory profiling results: duration=3.31 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=5.89GiB, peak_torch_memory=6.58GiB, memory_usage_post_profile=5.91GiB, non_torch_memory=0.72GiB, kv_cache_size=67.89GiB, gpu_memory_utilization=0.95.
INFO 12-13 06:04:30 gpu_executor.py:79] # GPU blocks: 79450, # CPU blocks: 4681, # CNS blocks: 0
INFO 12-13 06:04:30 gpu_executor.py:83] Maximum concurrency for 1000 tokens per request: 1271.20x
INFO 12-13 06:04:37 model_runner.py:1427] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-13 06:04:37 model_runner.py:1431] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 12-13 06:04:40 model_runner.py:1427] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-13 06:04:40 model_runner.py:1431] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 12-13 06:05:37 model_runner.py:1545] Graph capturing finished in 60 secs, took 0.95 GiB
INFO 12-13 06:05:39 api_server.py:252] vLLM to use /tmp/tmpgrn9380c as PROMETHEUS_MULTIPROC_DIR
INFO 12-13 06:05:39 api_server.py:578] Using supplied chat template:
INFO 12-13 06:05:39 api_server.py:578] None
INFO 12-13 06:05:39 launcher.py:19] Available routes are:
INFO 12-13 06:05:39 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO 12-13 06:05:39 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO 12-13 06:05:39 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 12-13 06:05:39 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO 12-13 06:05:39 launcher.py:27] Route: /health, Methods: GET
INFO 12-13 06:05:39 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-13 06:05:39 launcher.py:27] Route: /version, Methods: GET
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /v1/score, Methods: POST
INFO 12-13 06:05:39 launcher.py:27] Route: /get_prefix_cache_match_len, Methods: POST
INFO: Started server process [5511]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8166 (Press CTRL+C to quit)
INFO: 127.0.0.1:28580 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed

  • return 0
  • wait_for_server 8277
  • local port=8277
  • timeout 1200 bash -c '
    until curl -s localhost:8277/v1/completions > /dev/null; do
    sleep 1
    done'
    INFO 12-13 06:05:42 model_runner.py:1545] Graph capturing finished in 62 secs, took 0.95 GiB
    INFO 12-13 06:05:44 api_server.py:252] vLLM to use /tmp/tmprzx92pf7 as PROMETHEUS_MULTIPROC_DIR
    INFO 12-13 06:05:44 api_server.py:578] Using supplied chat template:
    INFO 12-13 06:05:44 api_server.py:578] None
    INFO 12-13 06:05:44 launcher.py:19] Available routes are:
    INFO 12-13 06:05:44 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
    INFO 12-13 06:05:44 launcher.py:27] Route: /docs, Methods: GET, HEAD
    INFO 12-13 06:05:44 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
    INFO 12-13 06:05:44 launcher.py:27] Route: /redoc, Methods: GET, HEAD
    INFO 12-13 06:05:44 launcher.py:27] Route: /health, Methods: GET
    INFO 12-13 06:05:44 launcher.py:27] Route: /tokenize, Methods: POST
    INFO 12-13 06:05:44 launcher.py:27] Route: /detokenize, Methods: POST
    INFO 12-13 06:05:44 launcher.py:27] Route: /v1/models, Methods: GET
    INFO 12-13 06:05:44 launcher.py:27] Route: /version, Methods: GET
    INFO 12-13 06:05:44 launcher.py:27] Route: /v1/chat/completions, Methods: POST
    INFO 12-13 06:05:44 launcher.py:27] Route: /v1/completions, Methods: POST
    INFO 12-13 06:05:44 launcher.py:27] Route: /v1/embeddings, Methods: POST
    INFO 12-13 06:05:44 launcher.py:27] Route: /v1/score, Methods: POST
    INFO 12-13 06:05:44 launcher.py:27] Route: /get_prefix_cache_match_len, Methods: POST
    INFO: Started server process [5512]
    INFO: Waiting for application startup.
    INFO: Application startup complete.
    INFO: Uvicorn running on http://0.0.0.0:8277 (Press CTRL+C to quit)
    INFO: 127.0.0.1:29122 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed
  • return 0
  • sleep 1
  • python3 disagg_prefill_proxy_server.py
  • Serving Quart app 'disagg_prefill_proxy_server'
  • Debug mode: False
  • Please use an ASGI server (e.g. Hypercorn) directly in production
  • Running on http://127.0.0.1:8009 (CTRL + C to quit)
    [2024-12-13 06:05:45 +0000] [7263] [INFO] Running on http://127.0.0.1:8009 (CTRL + C to quit)
  • for qps in 2
  • benchmark 2 100 disagg_prefill
  • results_folder=./results
  • model=Qwen2.5-7B-Instruct-GPTQ-Int4
  • dataset_name=sonnet
  • dataset_path=../sonnet_4x.txt
  • num_prompts=20
  • qps=2
  • prefix_len=50
  • input_len=200
  • output_len=100
  • tag=disagg_prefill
  • python3 ../benchmark_serving.py --backend vllm --model Qwen2.5-7B-Instruct-GPTQ-Int4 --dataset-name sonnet --dataset-path ../sonnet_4x.txt --sonnet-input-len 200 --sonnet-output-len 100 --sonnet-prefix-len 50 --num-prompts 20 --port 8009 --save-result --result-dir ./results --result-filename disagg_prefill-qps-2.json --request-rate 2
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    WARNING 12-13 06:05:51 cuda.py:30] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    Namespace(backend='vllm', base_url=None, host='localhost', port=8009, endpoint='/v1/completions', dataset=None, dataset_name='sonnet', dataset_path='../sonnet_4x.txt', max_concurrency=None, model='Qwen2.5-7B-Instruct-GPTQ-Int4', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=20, logprobs=None, request_rate=2.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='./results', result_filename='disagg_prefill-qps-2.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, mooncake_mode='qps', sonnet_input_len=200, sonnet_output_len=100, sonnet_prefix_len=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
    Starting initial single prompt test run...
    INFO 12-13 06:05:52 logger.py:37] Received request cmpl-c1227a5084944665a4b59ad58bfc1aa8-0: prompt: "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nPick as many lines as you can from these poem lines:\n\nThy end is truth's and beauty's doom and date.\nThy youth's proud livery, so gazed on now,\nTo thee I send this written embassage,\nWill be a tatter'd weed, of small worth held:\nIf thou couldst answer 'This fair child of mine\nO, learn to read what silent love hath writ:\nThen let not winter's ragged hand deface\nWho all in one, one pleasing note do sing:\nFor no man well of such a salve can speak\nSo should that beauty which you hold in lease\nTo find where your true image pictured lies;\nAnd only herald to the gaudy spring,\nA liquid prisoner pent in walls of glass,\nPity the world, or else this glutton be,\n<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 36953, 438, 1657, 5128, 438, 498, 646, 504, 1493, 32794, 5128, 1447, 1001, 88, 835, 374, 8046, 594, 323, 13143, 594, 58614, 323, 2400, 624, 1001, 88, 12537, 594, 12409, 326, 6497, 11, 773, 342, 27011, 389, 1431, 345, 1249, 39244, 358, 3624, 419, 5326, 7967, 38236, 345, 9945, 387, 264, 259, 1650, 4172, 39375, 11, 315, 2613, 5802, 5644, 510, 2679, 33123, 1410, 267, 4226, 364, 1986, 6624, 1682, 315, 10485, 198, 46, 11, 3960, 311, 1349, 1128, 21059, 2948, 51577, 2107, 510, 12209, 1077, 537, 12406, 594, 20475, 3556, 1424, 707, 578, 198, 15191, 678, 304, 825, 11, 825, 53699, 5185, 653, 7780, 510, 2461, 902, 883, 1632, 315, 1741, 264, 4274, 586, 646, 6468, 198, 4416, 1265, 429, 13143, 892, 498, 3331, 304, 25064, 198, 1249, 1477, 1380, 697, 830, 2168, 41566, 15448, 280, 3036, 1172, 64106, 311, 279, 342, 7880, 88, 10464, 345, 32, 14473, 41850, 20189, 304, 14285, 315, 8991, 345, 47, 487, 279, 1879, 11, 476, 770, 419, 2770, 959, 387, 345, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
    INFO: ::1:41950 - "POST /v1/completions HTTP/1.1" 200 OK
    INFO 12-13 06:05:52 engine.py:272] Added request cmpl-c1227a5084944665a4b59ad58bfc1aa8-0.
    INFO 12-13 06:05:54 logger.py:37] Received request cmpl-65144a565425441587c3d6b20fc0154e-0: prompt: "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nPick as many lines as you can from these poem lines:\n\nThy end is truth's and beauty's doom and date.\nThy youth's proud livery, so gazed on now,\nTo thee I send this written embassage,\nWill be a tatter'd weed, of small worth held:\nIf thou couldst answer 'This fair child of mine\nO, learn to read what silent love hath writ:\nThen let not winter's ragged hand deface\nWho all in one, one pleasing note do sing:\nFor no man well of such a salve can speak\nSo should that beauty which you hold in lease\nTo find where your true image pictured lies;\nAnd only herald to the gaudy spring,\nA liquid prisoner pent in walls of glass,\nPity the world, or else this glutton be,\n<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 36953, 438, 1657, 5128, 438, 498, 646, 504, 1493, 32794, 5128, 1447, 1001, 88, 835, 374, 8046, 594, 323, 13143, 594, 58614, 323, 2400, 624, 1001, 88, 12537, 594, 12409, 326, 6497, 11, 773, 342, 27011, 389, 1431, 345, 1249, 39244, 358, 3624, 419, 5326, 7967, 38236, 345, 9945, 387, 264, 259, 1650, 4172, 39375, 11, 315, 2613, 5802, 5644, 510, 2679, 33123, 1410, 267, 4226, 364, 1986, 6624, 1682, 315, 10485, 198, 46, 11, 3960, 311, 1349, 1128, 21059, 2948, 51577, 2107, 510, 12209, 1077, 537, 12406, 594, 20475, 3556, 1424, 707, 578, 198, 15191, 678, 304, 825, 11, 825, 53699, 5185, 653, 7780, 510, 2461, 902, 883, 1632, 315, 1741, 264, 4274, 586, 646, 6468, 198, 4416, 1265, 429, 13143, 892, 498, 3331, 304, 25064, 198, 1249, 1477, 1380, 697, 830, 2168, 41566, 15448, 280, 3036, 1172, 64106, 311, 279, 342, 7880, 88, 10464, 345, 32, 14473, 41850, 20189, 304, 14285, 315, 8991, 345, 47, 487, 279, 1879, 11, 476, 770, 419, 2770, 959, 387, 345, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
    INFO: ::1:56412 - "POST /v1/completions HTTP/1.1" 200 OK
    INFO 12-13 06:05:54 engine.py:272] Added request cmpl-65144a565425441587c3d6b20fc0154e-0.
    E1213 06:05:57.718971 6744 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7ff2ebffd010, length: 404, dest_addr: 140190878253072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105344, retry_cnt: 0): transport retry counter exceeded
    E1213 06:06:01.476758 6744 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7ff2ebffd010, length: 404, dest_addr: 140190878253072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105344, retry_cnt: 1): transport retry counter exceeded
    E1213 06:06:05.234935 6744 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7ff2ebffd010, length: 1975, dest_addr: 140190878244880, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105344, retry_cnt: 0): transport retry counter exceeded

@junna2016
Copy link
Author

I compile mooncake without any compilation options,such as -DUSE_CUDA, et.al, only with mkdir build && cd build && cmake .. && make -j && make install processes.

@ShangmingCai
Copy link
Collaborator

I compile mooncake without any compilation options,such as -DUSE_CUDA, et.al, only with mkdir build && cd build && cmake .. && make -j && make install processes.

I found that the info indicates your vllm version is

INFO 12-13 06:04:12 llm_engine.py:248] Initializing an LLM engine (v0.4.3.dev2370+g385d690b)

If you build from source of our experimental vllm branch, I think the version might be like 0.6.4.post2.devxxxx+xxxxxxx. Do you rebase from an earlier version of vllm?

Also, since the posted logs are mixed together, I want to confirm whether the first few requests are normal, and these errors occur in a succeeding request? Or does the first request report the error?

@junna2016
Copy link
Author

I compile mooncake without any compilation options,such as -DUSE_CUDA, et.al, only with mkdir build && cd build && cmake .. && make -j && make install processes.

I found that the info indicates your vllm version is

INFO 12-13 06:04:12 llm_engine.py:248] Initializing an LLM engine (v0.4.3.dev2370+g385d690b)

If you build from source of our experimental vllm branch, I think the version might be like 0.6.4.post2.devxxxx+xxxxxxx. Do you rebase from an earlier version of vllm?

Also, since the posted logs are mixed together, I want to confirm whether the first few requests are normal, and these errors occur in a succeeding request? Or does the first request report the error?

I work on the vllm main branch with pr10502 merged commit id: 0590ec3fd9857063c43c80df281e24c16c51b2ec
and I fetch your code with mooncake pipe and connector. And I install vllm with python setup.py develop mode.

This is the first request reports the error.

@alogfans
Copy link
Collaborator

I1213 06:04:16.739224 5832 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
I1213 06:04:16.742707 5832 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00

It seems abnormal as both device has the same LID and GID. When you use two devices together, local QP cannot determine the destination device for data transfer, leading transfer timeout (i.e., transport retry counter exceeded).

  • You can try to use other GID index by assigning MC_GID_INDEX=n env var, where n is the GID index. Mooncake Transfer Engine can detect one of valid GIDs but it potentially not worked for every case.
  • Does mlx5_0 and mlx5_1 be two ports in the same RDMA device? Provide the output of ibv_devinfo command.

@junna2016
Copy link
Author

  • ibv_devinfo

image

@junna2016
Copy link
Author

I1213 06:04:16.739224 5832 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
I1213 06:04:16.742707 5832 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00

It seems abnormal as both device has the same LID and GID. When you use two devices together, local QP cannot determine the destination device for data transfer, leading transfer timeout (i.e., transport retry counter exceeded).

  • You can try to use other GID index by assigning MC_GID_INDEX=n env var, where n is the GID index. Mooncake Transfer Engine can detect one of valid GIDs but it potentially not worked for every case.
  • Does mlx5_0 and mlx5_1 be two ports in the same RDMA device? Provide the output of ibv_devinfo command.

I try : export MC_GID_INDEX=6, it will change two nics GID together.

ibv_devices shows:

image

Is it a method to distinguish rdma-mlx5 device with GUID?

@alogfans
Copy link
Collaborator

mlx5_0 and mlx5_1 belongs to different RDMA NICs (PCIe devices).
It is normal for GID change if you change the GID index. If they have different GID value, usually it is ok.

@junna2016
Copy link
Author

mlx5_0 and mlx5_1 belongs to different RDMA NICs (PCIe devices). It is normal for GID change if you change the GID index. If they have different GID value, usually it is ok.

Does this error cause by rdma error configuration? Is it normal to obtain two NICs with same GID? How can I set two NICs with different GID?

@alogfans
Copy link
Collaborator

The GID is used to identify the device that is the target of a data transfer, similar to an IP address in a TCP/IP network. Each RDMA device has multiple legal GIDs, try setting the GID to another value using MC_GID_INDEX=n. You can use ibv_devinfo -v to find all valid GIDs.

@junna2016
Copy link
Author

I test with another machine, vllm code is: https://github.com/kvcache-ai/vllm/tree/upstream-mooncake-integration

select different mlx5 devices with different GID, but also counter an error, log is as below:

  • model=/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4
  • etcd --listen-client-urls http://0.0.0.0:2333 --advertise-client-urls http://localhost:2333
  • MOONCAKE_CONFIG_PATH=./mooncake_pipe_config.json
  • CUDA_VISIBLE_DEVICES=3
  • wait_for_server 8166
  • python3 -m vllm.entrypoints.openai.api_server --model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8166 --max-model-len 1000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}'
  • local port=8166
  • timeout 1200 bash -c '
    until curl -s localhost:8166/v1/completions > /dev/null; do
    sleep 1
    done'
  • MOONCAKE_CONFIG_PATH=./mooncake_pipe_config.json
  • CUDA_VISIBLE_DEVICES=4
  • python3 -m vllm.entrypoints.openai.api_server --model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8277 --max-model-len 1000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9}'
    2024-12-16 08:38:39.167258 I | etcdmain: etcd Version: 3.3.25
    2024-12-16 08:38:39.167275 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
    2024-12-16 08:38:39.167278 I | etcdmain: Go Version: go1.18.1
    2024-12-16 08:38:39.167284 I | etcdmain: Go OS/Arch: linux/amd64
    2024-12-16 08:38:39.167289 I | etcdmain: setting maximum number of CPUs to 128, total number of available CPUs is 128
    2024-12-16 08:38:39.167292 W | etcdmain: no data-dir provided, using default data-dir ./default.etcd
    2024-12-16 08:38:39.167338 N | etcdmain: the server is already initialized as member before, starting as etcd member...
    2024-12-16 08:38:39.167505 I | embed: listening for peers on http://localhost:2380
    2024-12-16 08:38:39.167545 I | embed: listening for client requests on 0.0.0.0:2333
    2024-12-16 08:38:39.167824 I | etcdserver: name = default
    2024-12-16 08:38:39.167829 I | etcdserver: data dir = default.etcd
    2024-12-16 08:38:39.167832 I | etcdserver: member dir = default.etcd/member
    2024-12-16 08:38:39.167838 I | etcdserver: heartbeat = 100ms
    2024-12-16 08:38:39.167842 I | etcdserver: election = 1000ms
    2024-12-16 08:38:39.167844 I | etcdserver: snapshot count = 100000
    2024-12-16 08:38:39.167949 I | etcdserver: advertise client URLs = http://localhost:2333
    2024-12-16 08:38:39.169644 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 28
    2024-12-16 08:38:39.169677 I | raft: 8e9e05c52164694d became follower at term 5
    2024-12-16 08:38:39.169691 I | raft: newRaft 8e9e05c52164694d [peers: [], term: 5, commit: 28, applied: 0, lastindex: 28, lastterm: 5]
    2024-12-16 08:38:39.170384 W | auth: simple token is not cryptographically signed
    2024-12-16 08:38:39.170692 I | etcdserver: starting server... [version: 3.3.25, cluster version: to_be_decided]
    2024-12-16 08:38:39.171679 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
    2024-12-16 08:38:39.171812 N | etcdserver/membership: set the initial cluster version to 3.3
    2024-12-16 08:38:39.171842 I | etcdserver/api: enabled capabilities for version 3.3
    2024-12-16 08:38:40.870663 I | raft: 8e9e05c52164694d is starting a new election at term 5
    2024-12-16 08:38:40.870728 I | raft: 8e9e05c52164694d became candidate at term 6
    2024-12-16 08:38:40.870770 I | raft: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 6
    2024-12-16 08:38:40.870780 I | raft: 8e9e05c52164694d became leader at term 6
    2024-12-16 08:38:40.870785 I | raft: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 6
    2024-12-16 08:38:40.870966 I | embed: ready to serve client requests
    2024-12-16 08:38:40.871058 I | etcdserver: published {Name:default ClientURLs:[http://localhost:2333]} to cluster cdf818194e3a8c32
    2024-12-16 08:38:40.871389 N | embed: serving insecure client requests on [::]:2333, this is strongly discouraged!
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    WARNING 12-16 08:38:42 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    WARNING 12-16 08:38:42 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    INFO 12-16 08:38:44 api_server.py:643] vLLM API server version 0.1.dev3835+g875ca4c
    INFO 12-16 08:38:44 api_server.py:644] args: Namespace(host=None, port=8166, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=5000000000.0, kv_role='kv_producer', kv_rank=0, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
    INFO 12-16 08:38:44 api_server.py:198] Started engine process with PID 15042
    INFO 12-16 08:38:44 api_server.py:643] vLLM API server version 0.1.dev3835+g875ca4c
    INFO 12-16 08:38:44 api_server.py:644] args: Namespace(host=None, port=8277, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['
    '], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=KVTransferConfig(kv_connector='MooncakeConnector', kv_buffer_device='cuda', kv_buffer_size=5000000000.0, kv_role='kv_consumer', kv_rank=1, kv_parallel_size=2, kv_ip='127.0.0.1', kv_port=14579), worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
    INFO 12-16 08:38:44 api_server.py:198] Started engine process with PID 15047
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    WARNING 12-16 08:38:46 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    WARNING 12-16 08:38:47 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    INFO 12-16 08:38:49 config.py:451] This model supports multiple tasks: {'score', 'generate', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
    INFO 12-16 08:38:50 config.py:451] This model supports multiple tasks: {'embed', 'reward', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
    INFO 12-16 08:38:53 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
    INFO 12-16 08:38:53 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
    INFO 12-16 08:38:56 config.py:451] This model supports multiple tasks: {'classify', 'embed', 'score', 'generate', 'reward'}. Defaulting to 'generate'.
    INFO 12-16 08:38:56 config.py:451] This model supports multiple tasks: {'reward', 'score', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
    INFO 12-16 08:38:57 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
    INFO 12-16 08:38:57 llm_engine.py:249] Initializing an LLM engine (v0.1.dev3835+g875ca4c) with config: model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
    INFO 12-16 08:38:57 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
    INFO 12-16 08:38:57 llm_engine.py:249] Initializing an LLM engine (v0.1.dev3835+g875ca4c) with config: model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
    INFO 12-16 08:38:57 selector.py:120] Using Flash Attention backend.
    INFO 12-16 08:38:57 selector.py:120] Using Flash Attention backend.
    INFO 12-16 08:39:01 simple_connector.py:58] Initializing MooncakeConfig under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=5000000000.0 kv_role='kv_consumer' kv_rank=1 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
    INFO 12-16 08:39:01 mooncake_pipe.py:227] Selecting device: cuda
    INFO 12-16 08:39:01 mooncake_pipe.py:69] Mooncake Configuration loaded successfully.
    WARNING: Logging before InitGoogleLogging() is written to STDERR
    I1216 08:39:02.007782 15047 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:04:00
    I1216 08:39:02.013756 15047 rdma_context.cpp:131] RDMA device: mlx5_4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:21:ff:ab:23
    INFO 12-16 08:39:02 simple_connector.py:58] Initializing MooncakeConfig under kv_transfer_config kv_connector='MooncakeConnector' kv_buffer_device='cuda' kv_buffer_size=5000000000.0 kv_role='kv_producer' kv_rank=0 kv_parallel_size=2 kv_ip='127.0.0.1' kv_port=14579
    INFO 12-16 08:39:02 mooncake_pipe.py:227] Selecting device: cuda
    INFO 12-16 08:39:02 mooncake_pipe.py:69] Mooncake Configuration loaded successfully.
    WARNING: Logging before InitGoogleLogging() is written to STDERR
    I1216 08:39:02.230799 15042 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:04:00
    I1216 08:39:02.234283 15042 rdma_context.cpp:131] RDMA device: mlx5_4, LID: 0, GID: (GID_Index 3) 00:00:00:00:00:00:00:00:00:00:ff:ff:21:ff:ab:23
    INFO 12-16 08:39:02 model_runner.py:1092] Starting to load model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4...
    INFO 12-16 08:39:02 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
    INFO 12-16 08:39:03 model_runner.py:1092] Starting to load model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4...
    INFO 12-16 08:39:03 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
    Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
    Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
    Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 3.33it/s]
    Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 3.63it/s]
    Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.57it/s]
    Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.70it/s]

Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.71it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.86it/s]

INFO 12-16 08:39:04 model_runner.py:1097] Loading model weights took 5.1810 GB
INFO 12-16 08:39:04 model_runner.py:1097] Loading model weights took 5.1810 GB
INFO 12-16 08:39:05 worker.py:237] Memory profiling results: duration=0.82 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=5.89GiB, peak_torch_memory=6.58GiB, memory_usage_post_profile=5.91GiB, non_torch_memory=0.72GiB, kv_cache_size=67.89GiB, gpu_memory_utilization=0.95.
INFO 12-16 08:39:05 worker.py:237] Memory profiling results: duration=0.87 seconds, total_gpu_memory=79.15GiB, initial_memory_usage=5.89GiB, peak_torch_memory=6.58GiB, memory_usage_post_profile=5.91GiB, non_torch_memory=0.72GiB, kv_cache_size=67.89GiB, gpu_memory_utilization=0.95.
INFO 12-16 08:39:05 gpu_executor.py:76] # GPU blocks: 79450, # CPU blocks: 4681
INFO 12-16 08:39:05 gpu_executor.py:80] Maximum concurrency for 1000 tokens per request: 1271.20x
INFO 12-16 08:39:06 gpu_executor.py:76] # GPU blocks: 79450, # CPU blocks: 4681
INFO 12-16 08:39:06 gpu_executor.py:80] Maximum concurrency for 1000 tokens per request: 1271.20x
INFO 12-16 08:39:09 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-16 08:39:09 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 12-16 08:39:09 model_runner.py:1413] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-16 08:39:09 model_runner.py:1417] If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 12-16 08:39:27 model_runner.py:1527] Graph capturing finished in 19 secs, took 1.43 GiB
INFO 12-16 08:39:27 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 23.05 seconds
INFO 12-16 08:39:28 model_runner.py:1527] Graph capturing finished in 19 secs, took 1.43 GiB
INFO 12-16 08:39:28 llm_engine.py:446] init engine (profile, create kv cache, warmup model) took 23.55 seconds
INFO 12-16 08:39:29 api_server.py:578] Using supplied chat template:
INFO 12-16 08:39:29 api_server.py:578] None
INFO 12-16 08:39:29 launcher.py:19] Available routes are:
INFO 12-16 08:39:29 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /health, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /version, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /score, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [14716]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8277 (Press CTRL+C to quit)
INFO 12-16 08:39:29 api_server.py:578] Using supplied chat template:
INFO 12-16 08:39:29 api_server.py:578] None
INFO 12-16 08:39:29 launcher.py:19] Available routes are:
INFO 12-16 08:39:29 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 12-16 08:39:29 launcher.py:27] Route: /health, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /tokenize, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /detokenize, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/models, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /version, Methods: GET
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /score, Methods: POST
INFO 12-16 08:39:29 launcher.py:27] Route: /v1/score, Methods: POST
INFO: Started server process [14715]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8166 (Press CTRL+C to quit)
INFO: 127.0.0.1:27236 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed
INFO: 127.0.0.1:27252 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed

  • return 0
  • wait_for_server 8277
  • local port=8277
  • timeout 1200 bash -c '
    until curl -s localhost:8277/v1/completions > /dev/null; do
    sleep 1
    done'
    INFO: 127.0.0.1:22874 - "GET /v1/completions HTTP/1.1" 405 Method Not Allowed
  • return 0
  • sleep 1
  • python3 disagg_prefill_proxy_server.py
  • Serving Quart app 'disagg_prefill_proxy_server'
  • Debug mode: False
  • Please use an ASGI server (e.g. Hypercorn) directly in production
  • Running on http://127.0.0.1:8009 (CTRL + C to quit)
    [2024-12-16 08:39:30 +0000] [16446] [INFO] Running on http://127.0.0.1:8009 (CTRL + C to quit)
  • for qps in 1
  • benchmark 1 100 disagg_prefill
  • results_folder=./results
  • model=/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4
  • dataset_name=sonnet
  • dataset_path=../sonnet_4x.txt
  • num_prompts=1
  • qps=1
  • prefix_len=50
  • input_len=200
  • output_len=100
  • tag=disagg_prefill
  • python3 ../benchmark_serving.py --backend vllm --model /nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4 --dataset-name sonnet --dataset-path ../sonnet_4x.txt --sonnet-input-len 200 --sonnet-output-len 100 --sonnet-prefix-len 50 --num-prompts 1 --port 8009 --save-result --result-dir ./results --result-filename disagg_prefill-qps-1.json --request-rate 1
    Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at module 'cv2.dnn' has no attribute 'DictValue' opencv/opencv-python#884 to correct your environment. The import of cv2 has been skipped.
    WARNING 12-16 08:39:34 cuda.py:32] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
    Namespace(backend='vllm', base_url=None, host='localhost', port=8009, endpoint='/v1/completions', dataset=None, dataset_name='sonnet', dataset_path='../sonnet_4x.txt', max_concurrency=None, model='/nfs/xingjunna.xjn/workspace/Qwen2.5-7B-Instruct-GPTQ-Int4', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=1.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, metadata=None, result_dir='./results', result_filename='disagg_prefill-qps-1.json', ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, mooncake_mode='qps', sonnet_input_len=200, sonnet_output_len=100, sonnet_prefix_len=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None)
    Starting initial single prompt test run...
    INFO 12-16 08:39:36 logger.py:37] Received request cmpl-b869f75643534b3ab3d1105fec1995f3-0: prompt: "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nPick as many lines as you can from these poem lines:\n\nThy end is truth's and beauty's doom and date.\nThy youth's proud livery, so gazed on now,\nTo thee I send this written embassage,\nWill be a tatter'd weed, of small worth held:\nIf thou couldst answer 'This fair child of mine\nO, learn to read what silent love hath writ:\nThen let not winter's ragged hand deface\nWho all in one, one pleasing note do sing:\nFor no man well of such a salve can speak\nSo should that beauty which you hold in lease\nTo find where your true image pictured lies;\nAnd only herald to the gaudy spring,\nA liquid prisoner pent in walls of glass,\nPity the world, or else this glutton be,\n<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 36953, 438, 1657, 5128, 438, 498, 646, 504, 1493, 32794, 5128, 1447, 1001, 88, 835, 374, 8046, 594, 323, 13143, 594, 58614, 323, 2400, 624, 1001, 88, 12537, 594, 12409, 326, 6497, 11, 773, 342, 27011, 389, 1431, 345, 1249, 39244, 358, 3624, 419, 5326, 7967, 38236, 345, 9945, 387, 264, 259, 1650, 4172, 39375, 11, 315, 2613, 5802, 5644, 510, 2679, 33123, 1410, 267, 4226, 364, 1986, 6624, 1682, 315, 10485, 198, 46, 11, 3960, 311, 1349, 1128, 21059, 2948, 51577, 2107, 510, 12209, 1077, 537, 12406, 594, 20475, 3556, 1424, 707, 578, 198, 15191, 678, 304, 825, 11, 825, 53699, 5185, 653, 7780, 510, 2461, 902, 883, 1632, 315, 1741, 264, 4274, 586, 646, 6468, 198, 4416, 1265, 429, 13143, 892, 498, 3331, 304, 25064, 198, 1249, 1477, 1380, 697, 830, 2168, 41566, 15448, 280, 3036, 1172, 64106, 311, 279, 342, 7880, 88, 10464, 345, 32, 14473, 41850, 20189, 304, 14285, 315, 8991, 345, 47, 487, 279, 1879, 11, 476, 770, 419, 2770, 959, 387, 345, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
    INFO: 127.0.0.1:27264 - "POST /v1/completions HTTP/1.1" 200 OK
    INFO 12-16 08:39:36 engine.py:267] Added request cmpl-b869f75643534b3ab3d1105fec1995f3-0.
    INFO 12-16 08:39:36 metrics.py:467] Avg prompt throughput: 30.8 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
    INFO 12-16 08:39:36 logger.py:37] Received request cmpl-24236fc0a6974d90b2cb4f2ab7e28e73-0: prompt: "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nPick as many lines as you can from these poem lines:\n\nThy end is truth's and beauty's doom and date.\nThy youth's proud livery, so gazed on now,\nTo thee I send this written embassage,\nWill be a tatter'd weed, of small worth held:\nIf thou couldst answer 'This fair child of mine\nO, learn to read what silent love hath writ:\nThen let not winter's ragged hand deface\nWho all in one, one pleasing note do sing:\nFor no man well of such a salve can speak\nSo should that beauty which you hold in lease\nTo find where your true image pictured lies;\nAnd only herald to the gaudy spring,\nA liquid prisoner pent in walls of glass,\nPity the world, or else this glutton be,\n<|im_end|>\n<|im_start|>assistant\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 36953, 438, 1657, 5128, 438, 498, 646, 504, 1493, 32794, 5128, 1447, 1001, 88, 835, 374, 8046, 594, 323, 13143, 594, 58614, 323, 2400, 624, 1001, 88, 12537, 594, 12409, 326, 6497, 11, 773, 342, 27011, 389, 1431, 345, 1249, 39244, 358, 3624, 419, 5326, 7967, 38236, 345, 9945, 387, 264, 259, 1650, 4172, 39375, 11, 315, 2613, 5802, 5644, 510, 2679, 33123, 1410, 267, 4226, 364, 1986, 6624, 1682, 315, 10485, 198, 46, 11, 3960, 311, 1349, 1128, 21059, 2948, 51577, 2107, 510, 12209, 1077, 537, 12406, 594, 20475, 3556, 1424, 707, 578, 198, 15191, 678, 304, 825, 11, 825, 53699, 5185, 653, 7780, 510, 2461, 902, 883, 1632, 315, 1741, 264, 4274, 586, 646, 6468, 198, 4416, 1265, 429, 13143, 892, 498, 3331, 304, 25064, 198, 1249, 1477, 1380, 697, 830, 2168, 41566, 15448, 280, 3036, 1172, 64106, 311, 279, 342, 7880, 88, 10464, 345, 32, 14473, 41850, 20189, 304, 14285, 315, 8991, 345, 47, 487, 279, 1879, 11, 476, 770, 419, 2770, 959, 387, 345, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
    INFO: 127.0.0.1:22876 - "POST /v1/completions HTTP/1.1" 200 OK
    INFO 12-16 08:39:36 engine.py:267] Added request cmpl-24236fc0a6974d90b2cb4f2ab7e28e73-0.
    E1216 08:39:37.446064 16023 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:37.446094 16023 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8149@mlx5_4
    E1216 08:39:38.470036 16023 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:38.470063 16023 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8149@mlx5_4
    E1216 08:39:39.494045 16023 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:39.494073 16023 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8149@mlx5_4
    E1216 08:39:43.255225 15996 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7f6c7bff9010, length: 1975, dest_addr: 140335372029968, local_nic: mlx5_4, peer_nic: 127.0.0.1:8144@mlx5_0, dest_rkey: 2675712, retry_cnt: 0): transport retry counter exceeded
    E1216 08:39:44.294055 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:44.294175 15996 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
    E1216 08:39:44.294210 15996 rdma_endpoint.cpp:122] Failed to exchange handshake description
    E1216 08:39:44.294214 15996 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
    E1216 08:39:45.318032 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:45.318158 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
    E1216 08:39:45.318205 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
    E1216 08:39:45.318209 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
    E1216 08:39:46.342036 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:46.342141 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
    E1216 08:39:46.342185 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
    E1216 08:39:46.342190 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
    E1216 08:39:47.366036 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:47.366148 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
    E1216 08:39:47.366191 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
    E1216 08:39:47.366196 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
    E1216 08:39:48.374042 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:48.374089 15993 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:48.374109 15993 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_4
    E1216 08:39:48.374147 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
    E1216 08:39:48.374181 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
    E1216 08:39:48.374186 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
    E1216 08:39:49.414047 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:49.414168 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
    E1216 08:39:49.414212 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
    E1216 08:39:49.414218 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
    INFO 12-16 08:39:49 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
    E1216 08:39:50.422031 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:50.422031 15993 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:50.422063 15993 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_4
    E1216 08:39:50.422137 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
    E1216 08:39:50.422179 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
    E1216 08:39:50.422184 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0
    E1216 08:39:51.430033 16029 rdma_endpoint.cpp:378] Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num: Connection timed out [110]
    E1216 08:39:51.430141 15997 transfer_metadata.cpp:793] Handshake request is rejected by peer endpoint 127.0.0.1:8144@mlx5_0, message: Failed to modity QP to RTR, check mtu, gid, peer lid, peer qp num. Please check peer's configuration.
    E1216 08:39:51.430187 15997 rdma_endpoint.cpp:122] Failed to exchange handshake description
    E1216 08:39:51.430191 15997 worker_pool.cpp:236] Worker: Cannot make connection for endpoint: 127.0.0.1:8144@mlx5_0

@alogfans
Copy link
Collaborator

Currently vLLM integration of Transfer Engine assumes that all devices in prefill nodes are connectable with devices in decode nodes. That is, the code may performs mlx5_0@local -> mlx5_1@remote transfer request, but both card are physically unconnectable (e.g., seperate network). We are planning to support auto-detect.

@junna2016
Copy link
Author

junna2016 commented Dec 16, 2024

Currently vLLM integration of Transfer Engine assumes that all devices in prefill nodes are connectable with devices in decode nodes. That is, the code may performs mlx5_0@local -> mlx5_1@remote transfer request, but both card are physically unconnectable (e.g., seperate network). We are planning to support auto-detect.

This means mlx5_0 and mlx5_4 are unconnectable for each other, so occurs this error?

@junna2016
Copy link
Author

I1213 06:04:16.739224 5832 rdma_context.cpp:131] RDMA device: mlx5_0, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00
I1213 06:04:16.742707 5832 rdma_context.cpp:131] RDMA device: mlx5_1, LID: 0, GID: (GID_Index 9) 00:00:00:00:00:00:00:00:00:00:ff:ff:0a:f4:00:00

It seems abnormal as both device has the same LID and GID. When you use two devices together, local QP cannot determine the destination device for data transfer, leading transfer timeout (i.e., transport retry counter exceeded).

  • You can try to use other GID index by assigning MC_GID_INDEX=n env var, where n is the GID index. Mooncake Transfer Engine can detect one of valid GIDs but it potentially not worked for every case.
  • Does mlx5_0 and mlx5_1 be two ports in the same RDMA device? Provide the output of ibv_devinfo command.

This caused by in my test machine with a dual port rdma network card, althought ibv_devices shows with two rdma devices, in fact this machine has only one rdma device.

@alogfans
Copy link
Collaborator

Currently vLLM integration of Transfer Engine assumes that all devices in prefill nodes are connectable with devices in decode nodes. That is, the code may performs mlx5_0@local -> mlx5_1@remote transfer request, but both card are physically unconnectable (e.g., seperate network). We are planning to support auto-detect.

This means mlx5_0 and mlx5_4 are unconnectable for each other, so occurs this error?

mlx5_0 devices in both machines are in one network, while mlx5_0 devices in both machines are in another network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants