[Bug] amdgpu，tp-size=2，Detected errors during sampling! NaN in the logits. #1953

linqingxu · 2024-11-08T04:44:58Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

python3 -m sglang.launch_server --model-path /root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype int8 --attention-backend triton --sampling-backend pytorch --tp-size 2
WARNING 11-08 04:42:43 rocm.py:13] fork method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to spawn instead.
[2024-11-08 04:42:51] server_args=ServerArgs(model_path='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', tokenizer_path='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='int8', kvint4_groupsize=32, quantization=None, context_length=None, device='cuda', served_model_name='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=661408819, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
[2024-11-08 04:43:04 TP0] Init torch distributed begin.
[2024-11-08 04:43:04 TP1] Init torch distributed begin.
[2024-11-08 04:43:07 TP0] Load weight begin. avail mem=23.03 GB
[2024-11-08 04:43:07 TP1] Load weight begin. avail mem=23.47 GB
[2024-11-08 04:43:07 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-11-08 04:43:07 TP1] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-11-08 04:43:07 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
[2024-11-08 04:43:07 TP0] FlashInfer is not available on Non-NV platforms. Fallback to other kernel libraries.
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:01<00:03, 1.74s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:03<00:02, 2.02s/it]
[2024-11-08 04:43:12 TP1] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=19.09 GB
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.50s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:04<00:00, 1.61s/it]

[2024-11-08 04:43:13 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=18.65 GB
[2024-11-08 04:43:13 TP1] Memory pool end. avail mem=4.58 GB
[2024-11-08 04:43:13 TP0] Memory pool end. avail mem=4.14 GB
[2024-11-08 04:43:13 TP0] Capture cuda graph begin. This can take up to several minutes.
[2024-11-08 04:43:13 TP1] Capture cuda graph begin. This can take up to several minutes.
[2024-11-08 04:43:58 TP0] max_total_num_tokens=1033884, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2024-11-08 04:43:58 TP1] max_total_num_tokens=1033884, max_prefill_tokens=16384, max_running_requests=4097, context_len=32768
[2024-11-08 04:43:58] INFO: Started server process [95220]
[2024-11-08 04:43:58] INFO: Waiting for application startup.
[2024-11-08 04:43:58] INFO: Application startup complete.
[2024-11-08 04:43:58] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2024-11-08 04:43:59] INFO: 127.0.0.1:41416 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-11-08 04:43:59 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:04 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:06 TP1] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:06 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:06 TP0] Detected errors during sampling! NaN in the logits.
[2024-11-08 04:44:06 TP1] Detected errors during sampling! NaN in the logits.

Reproduction

python3 -m sglang.launch_server --model-path /root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype int8 --attention-backend triton --sampling-backend pytorch --tp-size 2

Environment

Name: gfx1100
Uuid: GPU-b1d1b7e55cd7ec87
Marketing Name: Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 3
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2070
BDFID: 49920
Internal Node ID: 3
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 202
SDMA engine uCode:: 20
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

The text was updated successfully, but these errors were encountered:

linqingxu · 2024-11-08T04:46:03Z

Only use “tp-size > 1” raises the eror

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] amdgpu，tp-size=2，Detected errors during sampling! NaN in the logits. #1953

[Bug] amdgpu，tp-size=2，Detected errors during sampling! NaN in the logits. #1953

linqingxu commented Nov 8, 2024

linqingxu commented Nov 8, 2024

[Bug] amdgpu，tp-size=2，Detected errors during sampling! NaN in the logits. #1953

[Bug] amdgpu，tp-size=2，Detected errors during sampling! NaN in the logits. #1953

Comments

linqingxu commented Nov 8, 2024

Checklist

Describe the bug

Reproduction

Environment

linqingxu commented Nov 8, 2024