Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark torchao and torch.compile (need torch 2.5) #1874

Open
jerryzh168 opened this issue Nov 2, 2024 · 4 comments
Open

Benchmark torchao and torch.compile (need torch 2.5) #1874

jerryzh168 opened this issue Nov 2, 2024 · 4 comments

Comments

@jerryzh168
Copy link
Contributor

jerryzh168 commented Nov 2, 2024

vllm updated to use pytorch 2.5 recently, so we can benchmark torchao with torch.compile now (previously blocked by 2.5 update)

  1. install most recent vllm: pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

  2. make some small modifications to sglang:
    https://gist.github.com/jerryzh168/bd65f122f24d5c92525f2504a1ff5870

  3. install sglang from source

  4. Benchmark

int8wo no compile:
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --torchao-config int8wo

max_total_num_tokens=482059
Warmup ...
Prefill. latency: 0.06436 s, throughput:   1988.85 token/s
Decode.  latency: 0.05185 s, throughput:     19.29 token/s
Decode.  latency: 0.04700 s, throughput:     21.27 token/s
Decode.  latency: 0.04502 s, throughput:     22.21 token/s
Decode.  latency: 0.04504 s, throughput:     22.20 token/s
Decode.  latency: 0.04502 s, throughput:     22.21 token/s
Decode.  median latency: 0.04505 s, median throughput:     22.20 token/s
Total. latency:  0.388 s, throughput:    350.14 token/s
Benchmark ...
Prefill. latency: 0.05367 s, throughput:   2385.12 token/s
Decode.  latency: 0.04498 s, throughput:     22.23 token/s
Decode.  latency: 0.04530 s, throughput:     22.08 token/s
Decode.  latency: 0.04516 s, throughput:     22.15 token/s
Decode.  latency: 0.04511 s, throughput:     22.17 token/s
Decode.  latency: 0.04504 s, throughput:     22.20 token/s
Decode.  median latency: 0.04507 s, median throughput:     22.19 token/s
Total. latency:  0.369 s, throughput:    368.20 token/

int8wo with compile
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --torchao-config int8wo --enable-torch-compile

Warmup ...
Prefill. latency: 0.05634 s, throughput:   2271.98 token/s
Decode.  latency: 0.00890 s, throughput:    112.38 token/s
Decode.  latency: 0.00857 s, throughput:    116.66 token/s
Decode.  latency: 0.00843 s, throughput:    118.68 token/s
Decode.  latency: 0.00844 s, throughput:    118.51 token/s
Decode.  latency: 0.00836 s, throughput:    119.59 token/s
Decode.  median latency: 0.00843 s, median throughput:    118.68 token/s
Total. latency:  0.116 s, throughput:   1174.76 token/s
Benchmark ...
Prefill. latency: 0.05326 s, throughput:   2403.52 token/s
Decode.  latency: 0.00880 s, throughput:    113.60 token/s
Decode.  latency: 0.00844 s, throughput:    118.55 token/s
Decode.  latency: 0.00839 s, throughput:    119.24 token/s
Decode.  latency: 0.00835 s, throughput:    119.76 token/s
Decode.  latency: 0.00829 s, throughput:    120.66 token/s
Decode.  median latency: 0.00839 s, median throughput:    119.24 token/s
Total. latency:  0.112 s, throughput:   1211.90 token/s
@zhyncs
Copy link
Member

zhyncs commented Nov 2, 2024

@jerryzh168 When this configuration is used for bench_latency, we previously observed significant fluctuations. You may also try using bench_serving to do a benchmark as well, thanks!

@CortexEdgeUser
Copy link

[2024-11-03 00:41:08 TP0] Init torch distributed begin.
[2024-11-03 00:41:08 TP0] Load weight begin. avail mem=78.58 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/root/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 551, in
[rank0]: raise e
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 549, in
[rank0]: main(server_args, bench_args)
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 513, in main
[rank0]: work_func(server_args, port_args, bench_args, 0)
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 384, in latency_test
[rank0]: model_runner, tokenizer = load_model(server_args, port_args, tp_rank)
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/bench_latency.py", line 136, in load_model
[rank0]: model_runner = ModelRunner(
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/srt/model_executor/model_runner.py", line 155, in init
[rank0]: self.load_model()
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/srt/model_executor/model_runner.py", line 240, in load_model
[rank0]: monkey_patch_vllm_dummy_weight_loader()
[rank0]: File "/home/ubuntu/sglang-4/python/sglang/srt/utils.py", line 452, in monkey_patch_vllm_dummy_weight_loader
[rank0]: from vllm.model_executor.model_loader.loader import (
[rank0]: ImportError: cannot import name 'DeviceConfig' from 'vllm.model_executor.model_loader.loader' (/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py)
[rank0]:[W1103 00:41:09.547081948 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

When I tried to follow your instructions, I got this error.

@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Nov 6, 2024

@CortexEdgeUser that seems reasonable since https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader/loader.py does not have DeviceConfig, I'm not sure when this is removed, I think you probably need to import from https://github.com/vllm-project/vllm/blob/ca9844b340f45f23f8d30fdce23777d215ad987c/vllm/config.py#L1158, I didn't encounter this specific error though

@CortexEdgeUser
Copy link

@CortexEdgeUser that seems reasonable since https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/model_loader/loader.py does not have DeviceConfig, I'm not sure when this is removed, I think you probably need to import from https://github.com/vllm-project/vllm/blob/ca9844b340f45f23f8d30fdce23777d215ad987c/vllm/config.py#L1158, I didn't encounter this specific error though

I used a commit from before the changes, and it worked well; however, I’m encountering issues with tp > 1, but this is also the case with vLLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants