Mistral Nemo -> CUDA out of memory on 2 x 80GB H100 #1437

draqos · 2024-09-16T13:52:16Z

draqos
Sep 16, 2024

Hello everybody, i am unable to start SGLang with Mistral Nemo.
python3 -m sglang.launch_server --model-path mistralai/Mistral-Nemo-Instruct-2407 --tp-size 2 --enable-p2p-check

Details bellow.

python3 -m sglang.check_env

GPU 0,1: NVIDIA H100 PCIe
GPU 0,1 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.68
CUDA Driver Version: 560.35.03
PyTorch: 2.4.0+cu121
sglang: 0.3.1
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.2
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.114.2
hf_transfer: 0.1.8
huggingface_hub: 0.24.7
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.9.1
uvicorn: 0.30.6
uvloop: 0.20.0
zmq: 26.2.0
vllm: 0.5.5
multipart: 0.0.9
openai: 1.45.0
anthropic: 0.34.2
NVIDIA Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS 0-15,32-47 0 N/A
GPU1 SYS X 16-31,48-63 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024

gpustat

AI_SRV_01 Mon Sep 16 13:46:49 2024 560.35.03
[0] NVIDIA H100 PCIe | 42'C, 0 % | 449 / 81559 MB |
[1] NVIDIA H100 PCIe | 41'C, 0 % | 449 / 81559 MB |

python3 -m sglang.launch_server --model-path mistralai/Mistral-Nemo-Instruct-2407 --tp-size 2 --enable-p2p-check

[13:39:59] server_args=ServerArgs(model_path='mistralai/Mistral-Nemo-Instruct-2407', tokenizer_path='mistralai/Mistral-Nemo-Instruct-2407', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', dtype='auto', kv_cache_dtype='auto', trust_remote_code=False, context_length=None, quantization=None, served_model_name='mistralai/Mistral-Nemo-Instruct-2407', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.87, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=427377170, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', dp_size=1, load_balance_method='round_robin', nccl_init_addr=None, nnodes=1, node_rank=None, json_model_override_args='{}', attention_backend='flashinfer', sampling_backend='flashinfer', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, enable_mixed_chunk=False, enable_torch_compile=False, torchao_config='', enable_p2p_check=True, enable_mla=False, triton_attention_reduce_in_fp32=False, lora_paths=None, max_loras_per_batch=8)
[13:40:01 TP0] Init nccl begin.
[13:40:01 TP1] Init nccl begin.
INFO 09-16 13:40:01 custom_all_reduce_utils.py:234] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 09-16 13:40:01 custom_all_reduce_utils.py:234] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
[13:40:01 TP1] Load weight begin. avail mem=78.41 GB
[13:40:01 TP0] Load weight begin. avail mem=78.41 GB
INFO 09-16 13:40:02 weight_utils.py:236] Using model weights format ['.safetensors']
INFO 09-16 13:40:02 weight_utils.py:236] Using model weights format ['.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:02, 1.68it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:01, 1.81it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:01, 1.81it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.73it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.69it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:02<00:00, 1.72it/s]

[13:40:06 TP0] Load weight end. type=MistralForCausalLM, dtype=torch.bfloat16, avail mem=66.75 GB
[13:40:06 TP1] Load weight end. type=MistralForCausalLM, dtype=torch.bfloat16, avail mem=66.75 GB
[13:40:06 TP1] Memory pool end. avail mem=2.37 GB
[13:40:06 TP0] Memory pool end. avail mem=2.37 GB
[13:40:06 TP0] Capture cuda graph begin. This can take up to several minutes.
[13:40:06 TP1] Capture cuda graph begin. This can take up to several minutes.
INFO 09-16 13:40:11 custom_all_reduce.py:223] Registering 1458 cuda graph addresses
INFO 09-16 13:40:11 custom_all_reduce.py:223] Registering 1458 cuda graph addresses
[13:40:11 TP1] Exception in run_tp_server:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 135, in init
self.capture()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 164, in capture
) = self.capture_one_batch_size(bs, forward)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 213, in capture_one_batch_size
with torch.cuda.graph(graph, pool=self.graph_memory_pool, stream=stream):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 185, in exit
self.cuda_graph.capture_end()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/graphs.py", line 83, in capture_end
super().capture_end()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 953, in run_tp_server
model_server = ModelTpServer(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 100, in init
self.model_runner = ModelRunner(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 122, in init
self.init_cuda_graphs()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 465, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 137, in init
raise Exception(
Exception: Capture cuda graph failed: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Possible solutions:

disable cuda graph by --disable-cuda-graph

set --mem-fraction-static to a smaller value

disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

Answered by merrymercy

Sep 22, 2024

It turns out that there is something wrong with the model config of mistralai/Mistral-Nemo-Instruct-2407.

SGlang reads the model's context length based on max_position_embeddings in the model config. However, in this model, they set it to 1024k while this model is only trained with 128k.
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json#L13
This makes sglang incorrectly estimate the memory usage of memory pool, so we need to correct it with either --mem-fraction-static or --context-length.
For example, if your use case only needs 32k context length, you can do

python3 -m sglang.launch_server --model-path mistralai/Mistral-Nemo-Instruct-2407 --tp-size 2 --en…

View full answer

merrymercy · 2024-09-18T11:57:39Z

merrymercy
Sep 18, 2024
Maintainer

You can set a smaller value for --mem-fraction-static

python3 -m sglang.launch_server --model-path mistralai/Mistral-Nemo-Instruct-2407 --tp-size 2 --enable-p2p-check --mem-fraction-static 0.8

0 replies

draqos · 2024-09-20T07:25:41Z

draqos
Sep 20, 2024
Author

Thank you, any Ideea on how to get SGLang to use the 2 available GPUs instead of just one? This way, i wouldn't need to use --mem-fraction-static

0 replies

merrymercy · 2024-09-22T09:47:48Z

merrymercy
Sep 22, 2024
Maintainer

It turns out that there is something wrong with the model config of mistralai/Mistral-Nemo-Instruct-2407.

SGlang reads the model's context length based on max_position_embeddings in the model config. However, in this model, they set it to 1024k while this model is only trained with 128k.
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json#L13
This makes sglang incorrectly estimate the memory usage of memory pool, so we need to correct it with either --mem-fraction-static or --context-length.
For example, if your use case only needs 32k context length, you can do

python3 -m sglang.launch_server --model-path mistralai/Mistral-Nemo-Instruct-2407 --tp-size 2 --enable-p2p-check --context-length 32768

You can also try the model's full 128k context length. This also works.

python3 -m sglang.launch_server --model-path mistralai/Mistral-Nemo-Instruct-2407 --tp-size 2 --enable-p2p-check --context-len 131072

In all cases (32k, 128, or --mem-fraction-static 0.8), sglang will use 2 gpus because you set --tp-size 2.

0 replies

draqos · 2024-09-23T08:25:13Z

draqos
Sep 23, 2024
Author

Wow, thanks a lot!

Just for "history":
--context-len is the parameter needed to be set

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral Nemo -> CUDA out of memory on 2 x 80GB H100 #1437

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Mistral Nemo -> CUDA out of memory on 2 x 80GB H100 #1437

draqos Sep 16, 2024

Replies: 4 comments

merrymercy Sep 18, 2024 Maintainer

draqos Sep 20, 2024 Author

merrymercy Sep 22, 2024 Maintainer

draqos Sep 23, 2024 Author

draqos
Sep 16, 2024

merrymercy
Sep 18, 2024
Maintainer

draqos
Sep 20, 2024
Author

merrymercy
Sep 22, 2024
Maintainer

draqos
Sep 23, 2024
Author