Benchmark torchao and torch.compile (need torch 2.5) #1874

jerryzh168 · 2024-11-02T00:19:36Z

vllm updated to use pytorch 2.5 recently, so we can benchmark torchao with torch.compile now (previously blocked by 2.5 update)

install most recent vllm: pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
make some small modifications to sglang:
https://gist.github.com/jerryzh168/bd65f122f24d5c92525f2504a1ff5870
install sglang from source
Benchmark

int8wo no compile:
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --torchao-config int8wo

max_total_num_tokens=482059
Warmup ...
Prefill. latency: 0.06436 s, throughput:   1988.85 token/s
Decode.  latency: 0.05185 s, throughput:     19.29 token/s
Decode.  latency: 0.04700 s, throughput:     21.27 token/s
Decode.  latency: 0.04502 s, throughput:     22.21 token/s
Decode.  latency: 0.04504 s, throughput:     22.20 token/s
Decode.  latency: 0.04502 s, throughput:     22.21 token/s
Decode.  median latency: 0.04505 s, median throughput:     22.20 token/s
Total. latency:  0.388 s, throughput:    350.14 token/s
Benchmark ...
Prefill. latency: 0.05367 s, throughput:   2385.12 token/s
Decode.  latency: 0.04498 s, throughput:     22.23 token/s
Decode.  latency: 0.04530 s, throughput:     22.08 token/s
Decode.  latency: 0.04516 s, throughput:     22.15 token/s
Decode.  latency: 0.04511 s, throughput:     22.17 token/s
Decode.  latency: 0.04504 s, throughput:     22.20 token/s
Decode.  median latency: 0.04507 s, median throughput:     22.19 token/s
Total. latency:  0.369 s, throughput:    368.20 token/

int8wo with compile
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --torchao-config int8wo --enable-torch-compile

Warmup ...
Prefill. latency: 0.05634 s, throughput:   2271.98 token/s
Decode.  latency: 0.00890 s, throughput:    112.38 token/s
Decode.  latency: 0.00857 s, throughput:    116.66 token/s
Decode.  latency: 0.00843 s, throughput:    118.68 token/s
Decode.  latency: 0.00844 s, throughput:    118.51 token/s
Decode.  latency: 0.00836 s, throughput:    119.59 token/s
Decode.  median latency: 0.00843 s, median throughput:    118.68 token/s
Total. latency:  0.116 s, throughput:   1174.76 token/s
Benchmark ...
Prefill. latency: 0.05326 s, throughput:   2403.52 token/s
Decode.  latency: 0.00880 s, throughput:    113.60 token/s
Decode.  latency: 0.00844 s, throughput:    118.55 token/s
Decode.  latency: 0.00839 s, throughput:    119.24 token/s
Decode.  latency: 0.00835 s, throughput:    119.76 token/s
Decode.  latency: 0.00829 s, throughput:    120.66 token/s
Decode.  median latency: 0.00839 s, median throughput:    119.24 token/s
Total. latency:  0.112 s, throughput:   1211.90 token/s

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-11-02T01:54:13Z

@jerryzh168 When this configuration is used for bench_latency, we previously observed significant fluctuations. You may also try using bench_serving to do a benchmark as well, thanks!

CortexEdgeUser · 2024-11-03T00:42:04Z

[2024-11-03 00:41:08 TP0] Init torch distributed begin.
[2024-11-03 00:41:08 TP0] Load weight begin. avail mem=78.58 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/root/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 551, in
[rank0]: raise e
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 549, in
[rank0]: main(server_args, bench_args)
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 513, in main
[rank0]: work_func(server_args, port_args, bench_args, 0)
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 384, in latency_test
[rank0]: model_runner, tokenizer = load_model(server_args, port_args, tp_rank)
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 136, in load_model
[rank0]: model_runner = ModelRunner(
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/srt/model_executor/model_runner.py", line 155, in init
[rank0]: self.load_model()
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/srt/model_executor/model_runner.py", line 240, in load_model
[rank0]: monkey_patch_vllm_dummy_weight_loader()
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/srt/utils.py", line 452, in monkey_patch_vllm_dummy_weight_loader
[rank0]: from vllm.model_executor.model_loader.loader import (
[rank0]: ImportError: cannot import name 'DeviceConfig' from 'vllm.model_executor.model_loader.loader' (/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py)
[rank0]:[W1103 00:41:09.547081948 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

When I tried to follow your instructions, I got this error.

jerryzh168 · 2024-11-06T00:29:22Z

@CortexEdgeUser that seems reasonable since https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader/loader.py does not have DeviceConfig, I'm not sure when this is removed, I think you probably need to import from https://github.com/vllm-project/vllm/blob/ca9844b340f45f23f8d30fdce23777d215ad987c/vllm/config.py#L1158, I didn't encounter this specific error though

CortexEdgeUser · 2024-11-06T10:27:20Z

@CortexEdgeUser that seems reasonable since https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader/loader.py does not have DeviceConfig, I'm not sure when this is removed, I think you probably need to import from https://github.com/vllm-project/vllm/blob/ca9844b340f45f23f8d30fdce23777d215ad987c/vllm/config.py#L1158, I didn't encounter this specific error though

I used a commit from before the changes, and it worked well; however, I’m encountering issues with tp > 1, but this is also the case with vLLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark torchao and torch.compile (need torch 2.5) #1874

Benchmark torchao and torch.compile (need torch 2.5) #1874

jerryzh168 commented Nov 2, 2024 •

edited

Loading

zhyncs commented Nov 2, 2024

CortexEdgeUser commented Nov 3, 2024

jerryzh168 commented Nov 6, 2024 •

edited

Loading

CortexEdgeUser commented Nov 6, 2024

Benchmark torchao and torch.compile (need torch 2.5) #1874

Benchmark torchao and torch.compile (need torch 2.5) #1874

Comments

jerryzh168 commented Nov 2, 2024 • edited Loading

zhyncs commented Nov 2, 2024

CortexEdgeUser commented Nov 3, 2024

jerryzh168 commented Nov 6, 2024 • edited Loading

CortexEdgeUser commented Nov 6, 2024

jerryzh168 commented Nov 2, 2024 •

edited

Loading

jerryzh168 commented Nov 6, 2024 •

edited

Loading