-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offline LLM Engine Benchmark Throughput #1968
base: main
Are you sure you want to change the base?
Conversation
@ByronHsu do you want me to add this to CI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR tries to reuse the script bench_serving.py
. However, I think a better approach is to use a standalone script.
Reason:
bench_serving.py
is for online serving, but the most common use case of the Engine API is for offline use cases. We should benchmark the non-async version of the Engine to see what the maximum throughput we can get without all of the streaming/asyncio overhead- We want to pass in many other arguments to the server. The new script should be similar to
bench_latency.py
, which takes in the fullServerArgs
as arguments.sglang/python/sglang/bench_latency.py
Lines 536 to 541 in 760552e
parser = argparse.ArgumentParser() ServerArgs.add_cli_args(parser) BenchArgs.add_cli_args(parser) args = parser.parse_args() server_args = ServerArgs.from_cli_args(args) bench_args = BenchArgs.from_cli_args(args)
Can you try to write a standalone script bench_offline_throughput.py
that takes the same arguments as bench_latency.py
?
will do |
@merrymercy updated script, what do you think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better! Sorry for the back-and-forth, but I think we should support loading real datasets and support better random dataset generation, similar to the one in bench_serinvg.py
.
The bench_latency.py
uses a simple way to generate synthetic data because it does not support continuous batching or inputs with variable lengths. The engine supports everything so we can use more realistic data.
def prepare_synthetic_inputs_for_throughput_test( | ||
batch_size: int, input_len: int, output_len: int | ||
): | ||
input_ids = [[1] * input_len for _ in range(batch_size)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of generating synthetic data, can we reuse the data generation from bench_serving.py
?
Specifically, support the following arguments. You can import the code from bench_serving.py
sglang/python/sglang/bench_serving.py
Lines 1189 to 1198 in 59a5ba9
parser.add_argument( | |
"--dataset-name", | |
type=str, | |
default="sharegpt", | |
choices=["sharegpt", "random", "generated-shared-prefix"], | |
help="Name of the dataset to benchmark on.", | |
) | |
parser.add_argument( | |
"--dataset-path", type=str, default="", help="Path to the dataset." | |
) |
sglang/python/sglang/bench_serving.py
Lines 1209 to 1245 in 59a5ba9
parser.add_argument( | |
"--num-prompts", | |
type=int, | |
default=1000, | |
help="Number of prompts to process. Default is 1000.", | |
) | |
parser.add_argument( | |
"--sharegpt-output-len", | |
type=int, | |
default=None, | |
help="Output length for each request. Overrides the output length from the ShareGPT dataset.", | |
) | |
parser.add_argument( | |
"--random-input-len", | |
type=int, | |
help="Number of input tokens per request, used only for random dataset.", | |
) | |
parser.add_argument( | |
"--random-output-len", | |
type=int, | |
help="Number of output tokens per request, used only for random dataset.", | |
) | |
parser.add_argument( | |
"--random-range-ratio", | |
type=float, | |
default=0.0, | |
help="Range of sampled ratio of input/output length, " | |
"used only for random dataset.", | |
) | |
parser.add_argument( | |
"--request-rate", | |
type=float, | |
default=float("inf"), | |
help="Number of requests per second. If this is inf, then all the requests are sent at time 0. " | |
"Otherwise, we use Poisson process to synthesize the request arrival times. Default is inf.", | |
) | |
parser.add_argument("--seed", type=int, default=1, help="The random seed.") |
Then, we can deprecate these arguments
batch_size: Tuple[int] = (1,)
input_len: Tuple[int] = (1024,)
output_len: Tuple[int] = (16,)
batch_size: Tuple[int] = (1,) | ||
input_len: Tuple[int] = (1024,) | ||
output_len: Tuple[int] = (16,) | ||
result_filename: str = "" | ||
# Plotting args | ||
graph_sql: str = ( | ||
"select run_name, batch_size, prefill_throughput from results where run_name='before'" | ||
) | ||
graph_filename: str = "out.png" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my comment above, we can probably remove all of these arguments.
Remove graph_sql
because it is not used.
result_list.append(ret) | ||
|
||
if bench_args.result_filename: | ||
with jsonlines.open(bench_args.result_filename, "a") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use fout.write(json.dumps(value) + "\n")
to remove the additional dependency of jsonlines
Also, please add a unit test here https://github.com/sgl-project/sglang/blob/main/test/srt/test_srt_engine.py to run this benchmark for 10 random prompts. |
@merrymercy made the changes |
Can we have an option for |
Motivation
#1865
Add throughput benchmark for
engine.generate
Modifications
Added ability to specify an engine instead of an API url in the benchmarks
Checklist