Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label #5073

Merged
merged 115 commits into from
Jun 14, 2024
Merged
Show file tree
Hide file tree
Changes from 101 commits
Commits
Show all changes
115 commits
Select commit Hold shift + click to select a range
470b1c0
Kuntai: add a script to run tgi inside vllm docker
KuntaiDu May 27, 2024
0f522ae
Kuntai: format the bash script using shfmt
KuntaiDu May 27, 2024
fb4040b
Kuntai: add benchmarking script for trt-llm
KuntaiDu Jun 3, 2024
8abe615
remove huggingface token
KuntaiDu Jun 3, 2024
fca4a47
Kuntai: update vLLM benchmarking script
KuntaiDu Jun 3, 2024
d6e7faf
Kuntai: change the TGI script so that it can be ran inside TGI contai…
KuntaiDu Jun 4, 2024
fed1136
Kuntai: let vllm benchmark read test cases from
KuntaiDu Jun 4, 2024
c879a89
Kuntai: update benchmark-parameters.json
KuntaiDu Jun 4, 2024
c7c1ce6
Kuntai: update run-vllm-benchmarks.sh so that it parses test cases fr…
KuntaiDu Jun 4, 2024
33cf9cc
Kuntai: fix throughput parsing error while using tensor parallel size…
KuntaiDu Jun 5, 2024
0f4aab3
Kuntai: attach GPU type (e.g. H100) to test name
KuntaiDu Jun 5, 2024
8e2bb78
Merge branch 'main' of github.com:vllm-project/vllm into kuntai-tgibe…
simon-mo Jun 5, 2024
9d8d904
move files
simon-mo Jun 5, 2024
d02b35b
fix script to run on latest code
simon-mo Jun 5, 2024
2ba3559
fix
simon-mo Jun 5, 2024
c321200
add hf token
simon-mo Jun 5, 2024
e3bb365
fix path
simon-mo Jun 5, 2024
22c6dcc
fix path
simon-mo Jun 5, 2024
249aa02
Kuntai: add parameter, so that we can specify the filename of benchm…
KuntaiDu Jun 5, 2024
b41b579
Kuntai: rename the test parameter files to so that it is clear that …
KuntaiDu Jun 5, 2024
ab7a744
Kuntai: reformat the test cases file.
KuntaiDu Jun 5, 2024
0707a7f
Kuntai: add dummy weight to benchmarking
KuntaiDu Jun 6, 2024
33d31bf
Kuntai: use 7B model for testing.
KuntaiDu Jun 6, 2024
15dd5a9
Kuntai: postprocess the benchmarking results to markdown using python…
KuntaiDu Jun 6, 2024
9ffe79b
Kuntai: bugfix.
KuntaiDu Jun 6, 2024
f9805f0
Kuntai: bugfix.
KuntaiDu Jun 6, 2024
6febc17
Kuntai: reformat the markdown output, bug fix
KuntaiDu Jun 6, 2024
d9a40c1
Kuntai: add load_format to benchmark_latency.py, to allow using dummy…
KuntaiDu Jun 6, 2024
2e448df
Kuntai: see if benchmark_latency.py works in the CI dcoker
KuntaiDu Jun 6, 2024
f14d3bb
Kuntai: reduce the # of prompts to 100, for debugging.
KuntaiDu Jun 7, 2024
d6db0af
Kuntai: start developing on latency tests
KuntaiDu Jun 7, 2024
43deac5
Kuntai: update markdown generation script for
KuntaiDu Jun 7, 2024
a318650
Kuntai: temporary change for debugging
KuntaiDu Jun 7, 2024
b9e11de
Kuntai: bug fix
KuntaiDu Jun 7, 2024
b12556f
Kuntai: bug fix: percentile key is str not int
KuntaiDu Jun 7, 2024
223a69a
Kuntai: handle the case where the dataframe is empty
KuntaiDu Jun 7, 2024
6aadb3d
Kuntai: empty is a bool not a function
KuntaiDu Jun 7, 2024
63e4bf4
Kuntai: add double quote for artifact upload
KuntaiDu Jun 7, 2024
a8876d3
Kuntai: add various models to latency-tests.json
KuntaiDu Jun 7, 2024
1a19d71
Kuntai: finish debugging, run the full test now
KuntaiDu Jun 7, 2024
3cad48f
Kuntai: fix f-string issue
KuntaiDu Jun 7, 2024
e36e606
Kuntai: add more test to serving test
KuntaiDu Jun 7, 2024
ee9d701
Kuntai: fix python file syntax.
KuntaiDu Jun 7, 2024
0abebfc
Kuntai: remove -x debugging flag from the benchmarking script
KuntaiDu Jun 7, 2024
095b517
Kuntai: add , to the end of string to make yapf happy
KuntaiDu Jun 7, 2024
d5d55b4
Kuntai: reduce tp from 8 to 4 for mixtral 7B model, to avoid memory a…
KuntaiDu Jun 7, 2024
6eaef5a
Kuntai: reduce the tp for Mixtral 8x7B to 2
KuntaiDu Jun 7, 2024
e2428a9
Kuntai: remove 8x22B test, as it triggers illegal memory access
KuntaiDu Jun 7, 2024
3bf2bae
Kuntai: fall back to tp=4 for Mixtral 8x7B to avoid cuda OOM error
KuntaiDu Jun 7, 2024
3dc0bed
Kuntai: add GPU used memory to debug memory leaking
KuntaiDu Jun 7, 2024
70e5778
Kuntai: skip latency tests, for debugging
KuntaiDu Jun 7, 2024
48b8914
Kuntai: fix GPU memory leaking, and update full suite of tests
KuntaiDu Jun 7, 2024
1a5a2c3
Merge branch 'main' into kuntai-tgibench-dev
KuntaiDu Jun 9, 2024
5bd23e9
Kuntai: add GPU memory usage check after killing vllm server
KuntaiDu Jun 9, 2024
4c8dd6a
Kuntai: remove redundant gpu memory check
KuntaiDu Jun 9, 2024
973c018
Kuntai: reduce tp for 8x22B mixtral model, for more stable benchmarking
KuntaiDu Jun 9, 2024
74ecb6f
Kuntai: add debug symbol to see why 8x22B crashes under tp=8
KuntaiDu Jun 10, 2024
152f3f9
Kuntai: adjust latency-test.json to reproduce bugs
KuntaiDu Jun 10, 2024
ca7d6c5
Kuntai: adjust latency-test.json to reproduce bugs
KuntaiDu Jun 10, 2024
1dc23de
Kuntai: bug found (running 8x22B after Llama 70B triggers the bug). U…
KuntaiDu Jun 10, 2024
ef43f7d
Kuntai: bug found (running 8x22B after Llama 70B triggers the bug). U…
KuntaiDu Jun 10, 2024
e721c07
Kuntai: improve the readability of the benchmarking script
KuntaiDu Jun 10, 2024
0de27ff
Kuntai: remove vllm configuration file after execution, hopefully it …
KuntaiDu Jun 10, 2024
3dd81fa
Add H100 node
simon-mo Jun 10, 2024
f511a71
remove comment
simon-mo Jun 10, 2024
21306f2
use aws image
simon-mo Jun 10, 2024
0654dc5
mount code
simon-mo Jun 10, 2024
417e4d3
reset entrypoints
simon-mo Jun 10, 2024
5bd8d93
do not use init
simon-mo Jun 10, 2024
9bcdc87
set command
simon-mo Jun 10, 2024
54754b5
inject env
simon-mo Jun 10, 2024
d5190a6
report if buildkite agent is missing, and add longer timeout for wait…
KuntaiDu Jun 11, 2024
7b57d96
fix git clean bug in buidkite pipeline
KuntaiDu Jun 11, 2024
caea9c2
fix git clean bug in buidkite pipeline
KuntaiDu Jun 11, 2024
92dcff1
add debugging flag for more detailed error trace
KuntaiDu Jun 11, 2024
8351dfc
add debugging flag
KuntaiDu Jun 11, 2024
bdc0201
add debugging flag
KuntaiDu Jun 11, 2024
b7ce36f
log trace dumped. revert to review-ready version of the code
KuntaiDu Jun 11, 2024
f20d0b4
move the code to quick-benchmark folder, so that people do not get co…
KuntaiDu Jun 12, 2024
09baa4f
remove mixtral 8x22B with tp=8 for now, as GPU4 is not stable and thu…
KuntaiDu Jun 12, 2024
61276a0
comment out H100
KuntaiDu Jun 12, 2024
0097e9b
add median and p99, and a new column reflecting GPU type
KuntaiDu Jun 12, 2024
504b862
support dummy loading for throughput test
KuntaiDu Jun 12, 2024
75c517c
add json file for debugging --- contains much less test cases so that…
KuntaiDu Jun 12, 2024
71d21b3
update benchmarking script to handle multiple qps in serving test
KuntaiDu Jun 12, 2024
1bcc201
update postprocessing script accordingly
KuntaiDu Jun 12, 2024
691e8ac
change benchmark root to quick-benchmarks
KuntaiDu Jun 12, 2024
06cc219
fix bug when globbing qps_list
KuntaiDu Jun 12, 2024
3ff7399
fix for loop
KuntaiDu Jun 12, 2024
3c8a000
evaluate client-side benchmarking command
KuntaiDu Jun 12, 2024
da17560
bug fix: fix bug when qps=inf
KuntaiDu Jun 12, 2024
855073d
bug fix: fix bug when qps=inf
KuntaiDu Jun 12, 2024
a58bb94
add missing fi
KuntaiDu Jun 12, 2024
5d83c76
add missing backslash
KuntaiDu Jun 12, 2024
94e2367
bring back the full test cases
KuntaiDu Jun 12, 2024
553266c
update the doc
KuntaiDu Jun 12, 2024
f760deb
fix unnecessary eval command
KuntaiDu Jun 12, 2024
938e86c
make yapf happy
KuntaiDu Jun 12, 2024
7eab728
make yapf happy
KuntaiDu Jun 12, 2024
810c9ff
make yapf happy
KuntaiDu Jun 12, 2024
b3b5d5e
update the documents
KuntaiDu Jun 12, 2024
423ba21
use BUILDKITE_COMMIT
simon-mo Jun 13, 2024
08be0e2
quotation
simon-mo Jun 13, 2024
4f511e2
add jq
simon-mo Jun 13, 2024
99656d3
try >-
simon-mo Jun 13, 2024
e420563
fix quote
simon-mo Jun 13, 2024
42222dd
fix quote
simon-mo Jun 13, 2024
d0ad3ae
use a script
simon-mo Jun 13, 2024
0824f3f
Merge branch 'main' of github.com:vllm-project/vllm into kuntai-tgibe…
simon-mo Jun 13, 2024
73dd63e
don't verbose
simon-mo Jun 13, 2024
d32723a
rename
simon-mo Jun 13, 2024
f0a28f9
clean up
simon-mo Jun 14, 2024
dd90323
fix path
simon-mo Jun 14, 2024
a4af5ff
Merge branch 'main' of github.com:vllm-project/vllm into kuntai-tgibe…
simon-mo Jun 14, 2024
64bfa57
fix path error for convert-results-json-to-markdown.py
KuntaiDu Jun 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 18 additions & 5 deletions .buildkite/nightly-benchmarks/sample.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,31 +9,44 @@ steps:
containers:
# - image: us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:$BUILDKITE_COMMIT
# TODO(simon): check latest main branch or use the PR image.
- image: us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:45c35f0d58f4508bf43bd6af1d3d0d0ec0c915e6
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f7f9c5f97b4dd206a3cd9c65729a1c807ac82f50
command:
- bash -c 'nvidia-smi && nvidia-smi topo -m && pwd && ls'
- bash .buildkite/quick-benchmarks/quick-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory
# TODO(simon): bring H100 online
# - label: "H100: NVIDIA SMI"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:45c35f0d58f4508bf43bd6af1d3d0d0ec0c915e6
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f7f9c5f97b4dd206a3cd9c65729a1c807ac82f50
# command:
# - bash -c 'nvidia-smi && nvidia-smi topo -m'
# - bash
# - .buildkite/quick-benchmarks/quick-benchmarks.sh
# mount-buildkite-agent: true
# propagate-environment: true
# propagate-uid-gid: false
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

111 changes: 111 additions & 0 deletions .buildkite/quick-benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@

# Quick benchmark

## Introduction

This directory contains a quick performance benchmarking CI for vllm. The goal is to help developers know the impact of their PRs on the performance of vllm.

This benchmark will be *triggered* upon:
- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label.

**Benchmarking coverage**: latency, throughput and fix-qps serving on A100 (the support for more GPUs is comming later), with different models.

**Benchmarking ETA**: 40 minutes

## Configuring the workload for the quick benchmark

The workload of the quick benchmark contains two parts: latency tests in `latency-tests.json`, throughput tests in `throughput-tests.json` and serving tests in `serving-tests.json`.

### Latency test

Here is an example of one test inside `latency-tests.json`:

```json
[
...
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
},
...
]
```

In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `quick-benchmark.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

The number of this test is stable -- a slight change on the value of this number says something.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.


### Throughput test

The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.

The number of this test is also stable -- a slight change on the value of this number says something.

### Serving test


We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:

```
[
...
{
"test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"server_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
...
]
```

Inside this example:
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vllm server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`

The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still says something.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.


## JSON files for debugging

We provide a debugging version for json file, which covers much less test cases but is sufficient to test if the benchmark is written in the right way. Feel free to use it when you contribute to this benchmark -- it will make your iteration cycles go much faster.


## Visualizing the results

The `results2md.py` helps you put the benchmarking results inside a markdown table. To see this table, scroll to very bottom of your PR:
![PR position](./imgs/position.jpg)

Then find the `performance-benchmark` column, click details, and you will see the benchmarking tables
![Benchmarking results](./imgs/results.jpg)

If you do not see the table, please wait till the benchmark finish running.
Binary file added .buildkite/quick-benchmarks/imgs/position.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .buildkite/quick-benchmarks/imgs/results.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions .buildkite/quick-benchmarks/latency-tests-for-debugging.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
}
]
32 changes: 32 additions & 0 deletions .buildkite/quick-benchmarks/latency-tests.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[
{
"test_name": "latency_llama8B_tp1",
"parameters": {
"model": "meta-llama/Meta-Llama-3-8B",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
},
{
"test_name": "latency_llama70B_tp4",
"parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"tensor_parallel_size": 4,
"load_format": "dummy",
"num-iters-warmup": 5,
"num-iters": 15
}
},
{
"test_name": "latency_mixtral8x7B_tp2",
"parameters": {
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"tensor_parallel_size": 2,
"load_format": "dummy",
"num-iters-warmup": 5,
"num-iters": 15
}
}
]
Loading
Loading