Skip to content

Fastsafetensors doesn't work with DeepSeek-R1 #14

@shengshiqi-google

Description

@shengshiqi-google

I think it's due to the lack of support for fp8. Is it easy to add?

Logs:
kubectl logs -f pod/vllm-0
Collecting fastsafetensors
Downloading fastsafetensors-0.1.12-cp312-cp312-manylinux_2_34_x86_64.whl.metadata (11 kB)
Requirement already satisfied: typer>=0.9.0 in /usr/local/lib/python3.12/dist-packages (from fastsafetensors) (0.15.3)
Requirement already satisfied: torch>=2.1 in /usr/local/lib/python3.12/dist-packages (from fastsafetensors) (2.6.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (3.18.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (4.13.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (3.1.6)
Requirement already satisfied: fsspec in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (2025.3.2)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.3.1.170)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (12.4.127)
Requirement already satisfied: triton==3.2.0 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (3.2.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (80.0.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.12/dist-packages (from torch>=2.1->fastsafetensors) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy==1.13.1->torch>=2.1->fastsafetensors) (1.3.0)
Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.12/dist-packages (from typer>=0.9.0->fastsafetensors) (8.1.8)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer>=0.9.0->fastsafetensors) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.12/dist-packages (from typer>=0.9.0->fastsafetensors) (14.0.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer>=0.9.0->fastsafetensors) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=10.11.0->typer>=0.9.0->fastsafetensors) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer>=0.9.0->fastsafetensors) (0.1.2)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch>=2.1->fastsafetensors) (3.0.2)
Downloading fastsafetensors-0.1.12-cp312-cp312-manylinux_2_34_x86_64.whl (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 48.2 MB/s eta 0:00:00
Installing collected packages: fastsafetensors
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
Successfully installed fastsafetensors-0.1.12
2025-04-29 13:37:10,202 INFO usage_lib.py:467 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add --disable-usage-stats to the command that starts the cluster, or run the following command: ray disable-usage-stats before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-04-29 13:37:10,202 INFO scripts.py:865 -- Local node IP: 10.0.2.32
2025-04-29 13:37:11,525 SUCC scripts.py:902 -- --------------------
2025-04-29 13:37:11,525 SUCC scripts.py:903 -- Ray runtime started.
2025-04-29 13:37:11,525 SUCC scripts.py:904 -- --------------------
2025-04-29 13:37:11,525 INFO scripts.py:906 -- Next steps
2025-04-29 13:37:11,525 INFO scripts.py:909 -- To add another node to this Ray cluster, run
2025-04-29 13:37:11,525 INFO scripts.py:912 -- ray start --address='10.0.2.32:6379'
2025-04-29 13:37:11,525 INFO scripts.py:921 -- To connect to this Ray cluster:
2025-04-29 13:37:11,525 INFO scripts.py:923 -- import ray
2025-04-29 13:37:11,525 INFO scripts.py:924 -- ray.init()
2025-04-29 13:37:11,525 INFO scripts.py:955 -- To terminate the Ray runtime, run
2025-04-29 13:37:11,525 INFO scripts.py:956 -- ray stop
2025-04-29 13:37:11,525 INFO scripts.py:959 -- To view the status of the cluster, use
2025-04-29 13:37:11,525 INFO scripts.py:960 -- ray status
2025-04-29 13:37:11,949 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:11,961 INFO worker.py:1841 -- Connected to Ray cluster.
Wait for all ray workers to be active. 1/2 is active
2025-04-29 13:37:20,256 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:20,267 INFO worker.py:1841 -- Connected to Ray cluster.
Wait for all ray workers to be active. 1/2 is active
2025-04-29 13:37:26,260 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:26,270 INFO worker.py:1841 -- Connected to Ray cluster.
Wait for all ray workers to be active. 1/2 is active
2025-04-29 13:37:32,251 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:32,261 INFO worker.py:1841 -- Connected to Ray cluster.
All ray workers are active and the ray cluster is initialized successfully.
INFO 04-29 13:37:40 [init.py:239] Automatically detected platform cuda.
INFO 04-29 13:37:43 [api_server.py:1043] vLLM API server version 0.8.5
INFO 04-29 13:37:43 [api_server.py:1044] args: Namespace(host=None, port=8080, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/disks/pd/DeepSeek-R1', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='fastsafetensors', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=2, tensor_parallel_size=8, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-29 13:37:43 [config.py:209] Replacing legacy 'type' key with 'rope_type'
INFO 04-29 13:37:50 [config.py:717] This model supports multiple tasks: {'classify', 'generate', 'embed', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 04-29 13:37:52 [arg_utils.py:1658] Pipeline Parallelism without Ray distributed executor is not supported by the V1 Engine. Falling back to V0.
INFO 04-29 13:37:52 [config.py:1770] Defaulting to use ray for distributed inference
WARNING 04-29 13:37:52 [fp8.py:63] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
INFO 04-29 13:37:52 [cuda.py:157] Forcing kv cache block size to 64 for FlashMLA backend.
INFO 04-29 13:37:52 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5) with config: model='/mnt/disks/pd/DeepSeek-R1', speculative_config=None, tokenizer='/mnt/disks/pd/DeepSeek-R1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.FASTSAFETENSORS, tensor_parallel_size=8, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/mnt/disks/pd/DeepSeek-R1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
2025-04-29 13:37:52,592 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.2.32:6379...
2025-04-29 13:37:52,603 INFO worker.py:1841 -- Connected to Ray cluster.
INFO 04-29 13:37:52 [ray_utils.py:335] No current placement group found. Creating a new placement group.
INFO 04-29 13:37:52 [ray_distributed_executor.py:176] use_ray_spmd_worker: False
(pid=665) INFO 04-29 13:37:56 [init.py:239] Automatically detected platform cuda.
INFO 04-29 13:38:05 [ray_distributed_executor.py:352] non_carry_over_env_vars from config: set()
INFO 04-29 13:38:05 [ray_distributed_executor.py:354] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USAGE_SOURCE', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_V1']
INFO 04-29 13:38:05 [ray_distributed_executor.py:357] If certain env vars should NOT be copied to workers, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
INFO 04-29 13:38:05 [cuda.py:209] Using FlashMLA backend.
(RayWorkerWrapper pid=665) INFO 04-29 13:38:05 [cuda.py:209] Using FlashMLA backend.
(pid=325, ip=10.0.3.32) INFO 04-29 13:38:00 [init.py:239] Automatically detected platform cuda. [repeated 15x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
INFO 04-29 13:38:12 [utils.py:1055] Found nccl from library libnccl.so.2
INFO 04-29 13:38:12 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=322, ip=10.0.3.32) INFO 04-29 13:38:12 [utils.py:1055] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=322, ip=10.0.3.32) INFO 04-29 13:38:12 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:06 [cuda.py:209] Using FlashMLA backend. [repeated 14x across cluster]
(RayWorkerWrapper pid=321, ip=10.0.3.32) INFO 04-29 13:38:14 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 04-29 13:38:14 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
INFO 04-29 13:38:52 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(RayWorkerWrapper pid=322, ip=10.0.3.32) INFO 04-29 13:38:52 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(RayWorkerWrapper pid=677) INFO 04-29 13:38:12 [utils.py:1055] Found nccl from library libnccl.so.2 [repeated 14x across cluster]
(RayWorkerWrapper pid=677) INFO 04-29 13:38:12 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 14x across cluster]
INFO 04-29 13:38:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_b0daf338'), local_subscribe_addr='ipc:///tmp/4801a3f6-21da-45d1-b52f-6c5bd25ea11d', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-29 13:38:52 [utils.py:1055] Found nccl from library libnccl.so.2
INFO 04-29 13:38:52 [pynccl.py:69] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=321, ip=10.0.3.32) INFO 04-29 13:38:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_00c776e8'), local_subscribe_addr='ipc:///tmp/e6c79602-aa63-404e-984d-b93e19836ada', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-29 13:38:52 [parallel_state.py:1004] rank 0 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-29 13:38:52 [model_runner.py:1108] Starting to load model /mnt/disks/pd/DeepSeek-R1...
WARNING 04-29 13:38:52 [utils.py:168] The model class DeepseekV3ForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules
INFO 04-29 13:38:52 [utils.py:106] Hidden layers were unevenly partitioned: [31,30]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(RayWorkerWrapper pid=665) INFO 04-29 13:38:52 [parallel_state.py:1004] rank 1 in world size 16 is assigned as DP rank 0, PP rank 0, TP rank 1
(RayWorkerWrapper pid=665) INFO 04-29 13:38:52 [model_runner.py:1108] Starting to load model /mnt/disks/pd/DeepSeek-R1...
(RayWorkerWrapper pid=665) WARNING 04-29 13:38:52 [utils.py:168] The model class DeepseekV3ForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules
(RayWorkerWrapper pid=665) INFO 04-29 13:38:52 [utils.py:106] Hidden layers were unevenly partitioned: [31,30]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/11 [00:00<?, ?it/s]
(RayWorkerWrapper pid=323, ip=10.0.3.32) get_device_pci_bus: cudaDeviceGetPCIBusId failed, deviceId=10, err=101
(RayWorkerWrapper pid=664) libnuma: Warning: Cannot read node cpumask from sysfs
(RayWorkerWrapper pid=326, ip=10.0.3.32) get_device_pci_bus: cudaDeviceGetPCIBusId failed, deviceId=13, err=101 [repeated 7x across cluster]
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/11 [00:17<?, ?it/s]

(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] Traceback (most recent call last):
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] return run_method(self, method, args, kwargs)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] return func(*args, **kwargs)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] self.model_runner.load_model()
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] loaded_weights = model.load_weights(
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] for name, loaded_weight in weights:
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] yield from self._get_weights_iterator(primary_weights)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] for (name, tensor) in weights_iterator)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] fb = loader.copy_files_to_device()
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 109, in copy_files_to_device
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] torch.cuda.set_device(self.device)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py", line 476, in set_device
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] torch._C._cuda_setDevice(device)
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] RuntimeError: CUDA error: invalid device ordinal
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(RayWorkerWrapper pid=322, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620]
(RayWorkerWrapper pid=677) INFO 04-29 13:38:52 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 14x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [utils.py:1055] Found nccl from library libnccl.so.2 [repeated 15x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [pynccl.py:69] vLLM is using nccl==2.21.5 [repeated 15x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [parallel_state.py:1004] rank 12 in world size 16 is assigned as DP rank 0, PP rank 1, TP rank 4 [repeated 14x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [model_runner.py:1108] Starting to load model /mnt/disks/pd/DeepSeek-R1... [repeated 14x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) WARNING 04-29 13:38:52 [utils.py:168] The model class DeepseekV3ForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules [repeated 14x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) INFO 04-29 13:38:52 [utils.py:106] Hidden layers were unevenly partitioned: [31,30]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable [repeated 14x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] factory.wait_io(dtype=dtype, noalign=self.nogds)
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/tensor_factory.py", line 39, in wait_io
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] self.tensors = self.copier.wait_io(self.gbuf, dtype=dtype, noalign=noalign)
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py", line 95, in wait_io
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] return self.metadata.get_tensors(gbuf, self.device, self.aligned_offset, dtype=dtype)
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/common.py", line 147, in get_tensors
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] t2 = torch.from_dlpack(from_cuda_buffer(dst_dev_ptr, t.shape, t.strides, t.dtype, device))
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/dlpack.py", line 172, in from_cuda_buffer
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] dl_managed_tensor.dl_tensor.dtype = DLDataType.TYPE_MAP[dtype]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ~~~~~~~~~~~~~~~~~~~^^^^^^^
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] KeyError: torch.int8
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] Traceback (most recent call last): [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] return run_method(self, method, args, kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] return func(*args, **kwargs) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model [repeated 24x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] self.model_runner.load_model() [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] loaded_weights = model.load_weights( [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] for name, loaded_weight in weights: [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] yield from self._get_weights_iterator(primary_weights) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] for (name, tensor) in weights_iterator) [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] fb = loader.copy_files_to_device() [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 8x across cluster]
(RayWorkerWrapper pid=677) ERROR 04-29 13:39:08 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 128, in copy_files_to_device [repeated 8x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] torch.cuda.set_device(self.device) [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py", line 476, in set_device [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] torch._C._cuda_setDevice(device) [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] RuntimeError: CUDA error: invalid device ordinal [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. [repeated 7x across cluster]
(RayWorkerWrapper pid=325, ip=10.0.3.32) ERROR 04-29 13:38:53 [worker_base.py:620] [repeated 7x across cluster]
ERROR 04-29 13:39:11 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution.
ERROR 04-29 13:39:11 [worker_base.py:620] Traceback (most recent call last):
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
ERROR 04-29 13:39:11 [worker_base.py:620] return run_method(self, method, args, kwargs)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method
ERROR 04-29 13:39:11 [worker_base.py:620] return func(*args, **kwargs)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model
ERROR 04-29 13:39:11 [worker_base.py:620] self.model_runner.load_model()
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model
ERROR 04-29 13:39:11 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
ERROR 04-29 13:39:11 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
ERROR 04-29 13:39:11 [worker_base.py:620] loaded_weights = model.load_weights(
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights
ERROR 04-29 13:39:11 [worker_base.py:620] for name, loaded_weight in weights:
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights
ERROR 04-29 13:39:11 [worker_base.py:620] yield from self._get_weights_iterator(primary_weights)
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in
ERROR 04-29 13:39:11 [worker_base.py:620] for (name, tensor) in weights_iterator)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator
ERROR 04-29 13:39:11 [worker_base.py:620] fb = loader.copy_files_to_device()
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 128, in copy_files_to_device
ERROR 04-29 13:39:11 [worker_base.py:620] factory.wait_io(dtype=dtype, noalign=self.nogds)
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/tensor_factory.py", line 39, in wait_io
ERROR 04-29 13:39:11 [worker_base.py:620] self.tensors = self.copier.wait_io(self.gbuf, dtype=dtype, noalign=noalign)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py", line 95, in wait_io
ERROR 04-29 13:39:11 [worker_base.py:620] return self.metadata.get_tensors(gbuf, self.device, self.aligned_offset, dtype=dtype)
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/common.py", line 147, in get_tensors
ERROR 04-29 13:39:11 [worker_base.py:620] t2 = torch.from_dlpack(from_cuda_buffer(dst_dev_ptr, t.shape, t.strides, t.dtype, device))
ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/dlpack.py", line 172, in from_cuda_buffer
ERROR 04-29 13:39:11 [worker_base.py:620] dl_managed_tensor.dl_tensor.dtype = DLDataType.TYPE_MAP[dtype]
ERROR 04-29 13:39:11 [worker_base.py:620] ~~~~~~~~~~~~~~~~~~~^^^^^^^
ERROR 04-29 13:39:11 [worker_base.py:620] KeyError: torch.int8
[rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1130, in
[rank0]: uvloop.run(run_server(args))
[rank0]: File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
[rank0]: return __asyncio.run(
[rank0]: ^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]: return runner.run(main)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]: return self._loop.run_until_complete(task)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]: File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
[rank0]: return await main
[rank0]: ^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
[rank0]: async with build_async_engine_client(args) as engine_client:
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
[rank0]: return await anext(self.gen)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
[rank0]: async with build_async_engine_client_from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
[rank0]: return await anext(self.gen)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
[rank0]: engine_client = AsyncLLMEngine.from_vllm_config(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 657, in from_vllm_config
[rank0]: return cls(
[rank0]: ^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 612, in init
[rank0]: self.engine = self._engine_class(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 267, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 275, in init
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 286, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in init
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 114, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 396, in _init_workers_ray
[rank0]: self._run_workers("load_model",
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 516, in _run_workers
[rank0]: self.driver_worker.execute_method(sent_method, *args, **kwargs)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
[rank0]: raise e
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
[rank0]: return run_method(self, method, args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
[rank0]: loaded_weights = model.load_weights(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights
[rank0]: for name, loaded_weight in weights:
[rank0]: ^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights
[rank0]: yield from self._get_weights_iterator(primary_weights)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in
[rank0]: for (name, tensor) in weights_iterator)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator
[rank0]: fb = loader.copy_files_to_device()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 128, in copy_files_to_device
[rank0]: factory.wait_io(dtype=dtype, noalign=self.nogds)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/tensor_factory.py", line 39, in wait_io
[rank0]: self.tensors = self.copier.wait_io(self.gbuf, dtype=dtype, noalign=noalign)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py", line 95, in wait_io
[rank0]: return self.metadata.get_tensors(gbuf, self.device, self.aligned_offset, dtype=dtype)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/common.py", line 147, in get_tensors
[rank0]: t2 = torch.from_dlpack(from_cuda_buffer(dst_dev_ptr, t.shape, t.strides, t.dtype, device))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/dlpack.py", line 172, in from_cuda_buffer
[rank0]: dl_managed_tensor.dl_tensor.dtype = DLDataType.TYPE_MAP[dtype]
[rank0]: ~~~~~~~~~~~~~~~~~~~^^^^^^^
[rank0]: KeyError: torch.int8
(RayWorkerWrapper pid=677) libnuma: Warning: Cannot read node cpumask from sysfs [repeated 5x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] factory.wait_io(dtype=dtype, noalign=self.nogds) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py", line 95, in wait_io [repeated 12x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] self.tensors = self.copier.wait_io(self.gbuf, dtype=dtype, noalign=noalign) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] return self.metadata.get_tensors(gbuf, self.device, self.aligned_offset, dtype=dtype) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/common.py", line 147, in get_tensors [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] t2 = torch.from_dlpack(from_cuda_buffer(dst_dev_ptr, t.shape, t.strides, t.dtype, device)) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/dlpack.py", line 172, in from_cuda_buffer [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] dl_managed_tensor.dl_tensor.dtype = DLDataType.TYPE_MAP[dtype] [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ~~~~~~~~~~~~~~~~~~~^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] KeyError: torch.int8 [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] Error executing method 'load_model'. This might cause deadlock in distributed execution. [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] Traceback (most recent call last): [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] return run_method(self, method, args, kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] return func(*args, **kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model [repeated 18x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] self.model_runner.load_model() [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] self.model = get_model(vllm_config=self.vllm_config) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] return loader.load_model(vllm_config=vllm_config) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] loaded_weights = model.load_weights( [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 746, in load_weights [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] for name, loaded_weight in weights: [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 431, in get_all_weights [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] yield from self._get_weights_iterator(primary_weights) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 415, in [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] for (name, tensor) in weights_iterator) [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 490, in fastsafetensors_weights_iterator [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] fb = loader.copy_files_to_device() [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(RayWorkerWrapper pid=669) ERROR 04-29 13:39:11 [worker_base.py:620] File "/usr/local/lib/python3.12/dist-packages/fastsafetensors/loader.py", line 128, in copy_files_to_device [repeated 6x across cluster]
INFO 04-29 13:39:12 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
[rank0]:[W429 13:39:12.125768909 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions