-
Notifications
You must be signed in to change notification settings - Fork 16
Description
I think it's due to the lack of support for fp8. Is it easy to add?
Logs:
kubectl logs -f pod/vllm-0
Collecting fastsafetensors
Downloading fastsafetensors-0.1.12-cp312-cp312-manylinux_2_34_x86_64.whl.metadata (11 kB)
Requirement already satisfied: typer>=0.9.0 in /usr/local/lib/python3.12/dist-packages (from fastsafetensors) (0.15.3)
Requirement already satisfied: torch>=2.1 in /usr/local/lib/python3.12/dist-packages (from fastsafetensors) (2.6.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (4.13.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (3.1.6)
Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (2025.3.2)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.3.1.170)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: triton==3.2.0 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (3.2.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (80.0.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy==1.13.1->torch>=2.1->fastsafetensors) (1.3.0)
Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.12/dist-packages (from typer>=0.9.0->fastsafetensors) (8.1.8)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer>=0.9.0->fastsafetensors) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.12/dist-packages (from typer>=0.9.0->fastsafetensors) (14.0.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer>=0.9.0->fastsafetensors) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer>=0.9.0->fastsafetensors) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer>=0.9.0->fastsafetensors) (0.1.2)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch>=2.1->fastsafetensors) (3.0.2)
Downloading fastsafetensors-0.1.12-cp312-cp312-manylinux_2_34_x86_64.whl (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 48.2 MB/s eta 0:00:00
Installing collected packages: fastsafetensors
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
Successfully installed fastsafetensors-0.1.12
2025-04-29 13:37:10,202 INFO usage_lib.py:467 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-04-29 13:37:10,202 INFO scripts.py:865 -- Local node IP: 10.0.2.32
2025-04-29 13:37:11,525 SUCC scripts.py:902 -- --------------------
2025-04-29 13:37:11,525 SUCC scripts.py:903 -- Ray runtime started.
2025-04-29 13:37:11,525 SUCC scripts.py:904 -- --------------------
2025-04-29 13:37:11,525 INFO scripts.py:906 -- Next steps
2025-04-29 13:37:11,525 INFO scripts.py:909 -- To add another node to this Ray cluster, run
2025-04-29 13:37:11,525 INFO scripts.py:912 -- ray start --address='10.0.2.32:6379'
2025-04-29 13:37:11,525 INFO scripts.py:921 -- To connect to this Ray cluster:
2025-04-29 13:37:11,525 INFO scripts.py:923 -- import ray
2025-04-29 13:37:11,525 INFO scripts.py:924 -- ray.init()
2025-04-29 13:37:11,525 INFO scripts.py:955 -- To terminate the Ray runtime, run
2025-04-29 13:37:11,525 INFO scripts.py:956 -- ray stop
2025-04-29 13:37:11,525 INFO scripts.py:959 -- To view the status of the cluster, use
2025-04-29 13:37:11,525 INFO scripts.py:960 -- ray status
2025-04-29 13:37:11,949 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:11,961 INFO worker.py:1841 -- Connected to Ray cluster.
Wait for all ray workers to be active. 1/2 is active
2025-04-29 13:37:20,256 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:20,267 INFO worker.py:1841 -- Connected to Ray cluster.
Wait for all ray workers to be active. 1/2 is active
2025-04-29 13:37:26,260 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:26,270 INFO worker.py:1841 -- Connected to Ray cluster.
Wait for all ray workers to be active. 1/2 is active
2025-04-29 13:37:32,251 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:32,261 INFO worker.py:1841 -- Connected to Ray cluster.
All ray workers are active and the ray cluster is initialized successfully.
INFO 04-29 13:37:40 [init.py:239] Automatically detected platform cuda.
INFO 04-29 13:37:43 [api_server.py:1043] vLLM API server version 0.8.5
INFO 04-29 13:37:43 [api_server.py:1044] args: Namespace(host=None, port=8080, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/disks/pd/DeepSeek-R1', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='fastsafetensors', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=2, tensor_parallel_size=8, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-29 13:37:43 [config.py:209] Replacing legacy 'type' key with 'rope_type'
INFO 04-29 13:37:50 [config.py:717] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 04-29 13:37:52 [arg_utils.py:1658] Pipeline Parallelism without Ray distributed executor is not supported by the V1 Engine. Falling back to V0.
INFO 04-29 13:37:52 [config.py:1770] Defaulting to use ray for distributed inference
WARNING 04-29 13:37:52 [fp8.py:63] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 04-29 13:37:52 [cuda.py:157] Forcing kv cache block size to 64 for FlashMLA backend.
INFO 04-29 13:37:52 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5) with config: model='/mnt/disks/pd/DeepSeek-R1', speculative_config=None, tokenizer='/mnt/disks/pd/DeepSeek-R1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.FASTSAFETENSORS, tensor_parallel_size=8, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/disks/pd/DeepSeek-R1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
2025-04-29 13:37:52,592 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:52,603 INFO worker.py:1841 -- Connected to Ray cluster.
INFO 04-29 13:37:52 [ray_utils.py:335] No current placement group found. Creating a new placement group.
INFO 04-29 13:37:52 [ray_distributed_executor.py:176] use_ray_spmd_worker: False
(pid=665) INFO 04-29 13:37:56 [init.py:239] Automatically detected platform cuda.
INFO 04-29 13:38:05 [ray_distributed_executor.py:352] non_carry_over_env_vars from config: set()
INFO 04-29 13:38:05 [ray_distributed_executor.py:354] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USAGE_SOURCE', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_V1']
INFO 04-29 13:38:05 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
INFO 04-29 13:38:05 [cuda.py:209] Using FlashMLA backend.
(RayWorkerWrapper pid=665) INFO 04-29 13:38:05 [cuda.py:209] Using FlashMLA backend.
(pid=325, ip=10.0.3.32) INFO 04-29 13:38:00 [init.py:239] Automatically detected platform cuda. [repeated 15x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
INFO 04-29 13:38:12 [utils.py:1055] Found nccl from library libnccl.so.2
INFO 04-29 13:38:12 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=322, ip=10.0.3.32) INFO 04-29 13:38:12 [utils.py:1055] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=322, ip=10.0.3.32) INFO 04-29 13:38:12 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:06 [cuda.py:209] Using FlashMLA backend. [repeated 14x across cluster]
(RayWorkerWrapper pid=321, ip=10.0.3.32) INFO 04-29 13:38:14 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 04-29 13:38:14 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 04-29 13:38:52 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(RayWorkerWrapper pid=322, ip=10.0.3.32) INFO 04-29 13:38:52 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(RayWorkerWrapper pid=677) INFO 04-29 13:38:12 [utils.py:1055] Found nccl from library libnccl.so.2 [repeated 14x across cluster]
(RayWorkerWrapper pid=677) INFO 04-29 13:38:12 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 14x across cluster]
INFO 04-29 13:38:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_b0daf338'), local_subscribe_addr='ipc:///tmp/4801a3f6-21da-45d1-b52f-6c5bd25ea11d', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-29 13:38:52 [utils.py:1055] Found nccl from library libnccl.so.2
INFO 04-29 13:38:52 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=321, ip=10.0.3.32) INFO 04-29 13:38:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_00c776e8'), local_subscribe_addr='ipc:///tmp/e6c79602-aa63-404e-984d-b93e19836ada', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-29 13:38:52 [parallel_state.py:1004] rank 0 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-29 13:38:52 [model_runner.py:1108] Starting to load model /mnt/disks/pd/DeepSeek-R1...
WARNING 04-29 13:38:52 [utils.py:168] The model class DeepseekV3ForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules
INFO 04-29 13:38:52 [utils.py:106] Hidden layers were unevenly partitioned: [31,30]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(RayWorkerWrapper pid=665) INFO 04-29 13:38:52 [parallel_state.py:1004] rank 1 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 1
(RayWorkerWrapper pid=665) INFO 04-29 13:38:52 [model_runner.py:1108] Starting to load model /mnt/disks/pd/DeepSeek-R1...
(RayWorkerWrapper pid=665) WARNING 04-29 13:38:52 [utils.py:168] The model class DeepseekV3ForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules
(RayWorkerWrapper pid=665) INFO 04-29 13:38:52 [utils.py:106] Hidden layers were unevenly partitioned: [31,30]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/11 [00:00<?, ?it/s]
(RayWorkerWrapper pid=323, ip=10.0.3.32) get_device_pci_bus: cudaDeviceGetPCIBusId failed, deviceId=10, err=101
(RayWorkerWrapper pid=664) libnuma: Warning: Cannot read node cpumask from sysfs
(RayWorkerWrapper pid=326, ip=10.0.3.32) get_device_pci_bus: cudaDeviceGetPCIBusId failed, deviceId=13, err=101 [repeated 7x across cluster]
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/11 [00:17<?, ?it/s]
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] Traceback (most recent call last):
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] return run_method(self, method, args, kwargs)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] return func(*args, **kwargs)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] self.model_runner.load_model()
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] loaded_weights = model.load_weights(
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] for name, loaded_weight in weights:
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] yield from self._get_weights_iterator(primary_weights)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] for (name, tensor) in weights_iterator)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] fb = loader.copy_files_to_device()
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 109, in copy_files_to_device
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] torch.cuda.set_device(self.device)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py", line 476, in set_device
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] torch._C._cuda_setDevice(device)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] RuntimeError: CUDA error: invalid device ordinal
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620]
(RayWorkerWrapper pid=677) INFO 04-29 13:38:52 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 14x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [utils.py:1055] Found nccl from library libnccl.so.2 [repeated 15x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 15x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [parallel_state.py:1004] rank 12 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 4 [repeated 14x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [model_runner.py:1108] Starting to load model /mnt/disks/pd/DeepSeek-R1... [repeated 14x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) WARNING 04-29 13:38:52 [utils.py:168] The model class DeepseekV3ForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules [repeated 14x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [utils.py:106] Hidden layers were unevenly partitioned: [31,30]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable [repeated 14x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] factory.wait_io(dtype=dtype, noalign=self.nogds)
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/tensor_factory.py", line 39, in wait_io
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] self.tensors = self.copier.wait_io(self.gbuf, dtype=dtype, noalign=noalign)
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py", line 95, in wait_io
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] return self.metadata.get_tensors(gbuf, self.device, self.aligned_offset, dtype=dtype)
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/common.py", line 147, in get_tensors
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] t2 = torch.from_dlpack(from_cuda_buffer(dst_dev_ptr, t.shape, t.strides, t.dtype, device))
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/dlpack.py", line 172, in from_cuda_buffer
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] dl_managed_tensor.dl_tensor.dtype = DLDataType.TYPE_MAP[dtype]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ~~~~~~~~~~~~~~~~~~~^^^^^^^
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] KeyError: torch.int8
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] Traceback (most recent call last): [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] return run_method(self, method, args, kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] return func(*args, **kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model [repeated 24x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] self.model_runner.load_model() [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] loaded_weights = model.load_weights( [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] for name, loaded_weight in weights: [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] yield from self._get_weights_iterator(primary_weights) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] for (name, tensor) in weights_iterator) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] fb = loader.copy_files_to_device() [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 128, in copy_files_to_device [repeated 8x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] torch.cuda.set_device(self.device) [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py", line 476, in set_device [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] torch._C._cuda_setDevice(device) [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] RuntimeError: CUDA error: invalid device ordinal [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] [repeated 7x across cluster]
ERROR 04-29 13:39:11 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
ERROR 04-29 13:39:11 [worker_base.py:620] Traceback (most recent call last):
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
ERROR 04-29 13:39:11 [worker_base.py:620] return run_method(self, method, args, kwargs)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method
ERROR 04-29 13:39:11 [worker_base.py:620] return func(*args, **kwargs)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model
ERROR 04-29 13:39:11 [worker_base.py:620] self.model_runner.load_model()
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model
ERROR 04-29 13:39:11 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
ERROR 04-29 13:39:11 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
ERROR 04-29 13:39:11 [worker_base.py:620] loaded_weights = model.load_weights(
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights
ERROR 04-29 13:39:11 [worker_base.py:620] for name, loaded_weight in weights:
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights
ERROR 04-29 13:39:11 [worker_base.py:620] yield from self._get_weights_iterator(primary_weights)
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in
ERROR 04-29 13:39:11 [worker_base.py:620] for (name, tensor) in weights_iterator)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator
ERROR 04-29 13:39:11 [worker_base.py:620] fb = loader.copy_files_to_device()
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 128, in copy_files_to_device
ERROR 04-29 13:39:11 [worker_base.py:620] factory.wait_io(dtype=dtype, noalign=self.nogds)
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/tensor_factory.py", line 39, in wait_io
ERROR 04-29 13:39:11 [worker_base.py:620] self.tensors = self.copier.wait_io(self.gbuf, dtype=dtype, noalign=noalign)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py", line 95, in wait_io
ERROR 04-29 13:39:11 [worker_base.py:620] return self.metadata.get_tensors(gbuf, self.device, self.aligned_offset, dtype=dtype)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/common.py", line 147, in get_tensors
ERROR 04-29 13:39:11 [worker_base.py:620] t2 = torch.from_dlpack(from_cuda_buffer(dst_dev_ptr, t.shape, t.strides, t.dtype, device))
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/dlpack.py", line 172, in from_cuda_buffer
ERROR 04-29 13:39:11 [worker_base.py:620] dl_managed_tensor.dl_tensor.dtype = DLDataType.TYPE_MAP[dtype]
ERROR 04-29 13:39:11 [worker_base.py:620] ~~~~~~~~~~~~~~~~~~~^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] KeyError: torch.int8
[rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1130, in
[rank0]: uvloop.run(run_server(args))
[rank0]: File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
[rank0]: return __asyncio.run(
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]: return runner.run(main)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]: return self._loop.run_until_complete(task)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]: File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
[rank0]: return await main
[rank0]: ^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
[rank0]: async with build_async_engine_client(args) as engine_client:
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
[rank0]: return await anext(self.gen)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
[rank0]: async with build_async_engine_client_from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
[rank0]: return await anext(self.gen)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
[rank0]: engine_client = AsyncLLMEngine.from_vllm_config(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 657, in from_vllm_config
[rank0]: return cls(
[rank0]: ^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 612, in init
[rank0]: self.engine = self._engine_class(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 267, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 275, in init
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 286, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in init
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 114, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 396, in _init_workers_ray
[rank0]: self._run_workers("load_model",
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 516, in _run_workers
[rank0]: self.driver_worker.execute_method(sent_method, *args, **kwargs)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
[rank0]: raise e
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
[rank0]: return run_method(self, method, args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
[rank0]: loaded_weights = model.load_weights(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights
[rank0]: for name, loaded_weight in weights:
[rank0]: ^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights
[rank0]: yield from self._get_weights_iterator(primary_weights)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in
[rank0]: for (name, tensor) in weights_iterator)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator
[rank0]: fb = loader.copy_files_to_device()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 128, in copy_files_to_device
[rank0]: factory.wait_io(dtype=dtype, noalign=self.nogds)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/tensor_factory.py", line 39, in wait_io
[rank0]: self.tensors = self.copier.wait_io(self.gbuf, dtype=dtype, noalign=noalign)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py", line 95, in wait_io
[rank0]: return self.metadata.get_tensors(gbuf, self.device, self.aligned_offset, dtype=dtype)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/common.py", line 147, in get_tensors
[rank0]: t2 = torch.from_dlpack(from_cuda_buffer(dst_dev_ptr, t.shape, t.strides, t.dtype, device))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/dlpack.py", line 172, in from_cuda_buffer
[rank0]: dl_managed_tensor.dl_tensor.dtype = DLDataType.TYPE_MAP[dtype]
[rank0]: ~~~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]: KeyError: torch.int8
(RayWorkerWrapper pid=677) libnuma: Warning: Cannot read node cpumask from sysfs [repeated 5x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] factory.wait_io(dtype=dtype, noalign=self.nogds) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py", line 95, in wait_io [repeated 12x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] self.tensors = self.copier.wait_io(self.gbuf, dtype=dtype, noalign=noalign) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] return self.metadata.get_tensors(gbuf, self.device, self.aligned_offset, dtype=dtype) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/common.py", line 147, in get_tensors [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] t2 = torch.from_dlpack(from_cuda_buffer(dst_dev_ptr, t.shape, t.strides, t.dtype, device)) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/dlpack.py", line 172, in from_cuda_buffer [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] dl_managed_tensor.dl_tensor.dtype = DLDataType.TYPE_MAP[dtype] [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ~~~~~~~~~~~~~~~~~~~^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] KeyError: torch.int8 [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] Traceback (most recent call last): [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] return run_method(self, method, args, kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] return func(*args, **kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model [repeated 18x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] self.model_runner.load_model() [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] loaded_weights = model.load_weights( [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] for name, loaded_weight in weights: [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] yield from self._get_weights_iterator(primary_weights) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] for (name, tensor) in weights_iterator) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] fb = loader.copy_files_to_device() [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 128, in copy_files_to_device [repeated 6x across cluster]
INFO 04-29 13:39:12 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
[rank0]:[W429 13:39:12.125768909 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())