Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline LLM Engine Benchmark Throughput #1968

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

zolinthecow
Copy link
Contributor

Motivation

#1865
Add throughput benchmark for engine.generate

Modifications

Added ability to specify an engine instead of an API url in the benchmarks

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zolinthecow
Copy link
Contributor Author

@ByronHsu do you want me to add this to CI?

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR tries to reuse the script bench_serving.py. However, I think a better approach is to use a standalone script.

Reason:

  • bench_serving.py is for online serving, but the most common use case of the Engine API is for offline use cases. We should benchmark the non-async version of the Engine to see what the maximum throughput we can get without all of the streaming/asyncio overhead
  • We want to pass in many other arguments to the server. The new script should be similar to bench_latency.py, which takes in the full ServerArgs as arguments.
    parser = argparse.ArgumentParser()
    ServerArgs.add_cli_args(parser)
    BenchArgs.add_cli_args(parser)
    args = parser.parse_args()
    server_args = ServerArgs.from_cli_args(args)
    bench_args = BenchArgs.from_cli_args(args)

Can you try to write a standalone script bench_offline_throughput.py that takes the same arguments as bench_latency.py?

@zolinthecow
Copy link
Contributor Author

will do

@zolinthecow
Copy link
Contributor Author

zolinthecow commented Nov 11, 2024

@merrymercy updated script, what do you think

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better! Sorry for the back-and-forth, but I think we should support loading real datasets and support better random dataset generation, similar to the one in bench_serinvg.py.

The bench_latency.py uses a simple way to generate synthetic data because it does not support continuous batching or inputs with variable lengths. The engine supports everything so we can use more realistic data.

def prepare_synthetic_inputs_for_throughput_test(
batch_size: int, input_len: int, output_len: int
):
input_ids = [[1] * input_len for _ in range(batch_size)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of generating synthetic data, can we reuse the data generation from bench_serving.py?

Specifically, support the following arguments. You can import the code from bench_serving.py

parser.add_argument(
"--dataset-name",
type=str,
default="sharegpt",
choices=["sharegpt", "random", "generated-shared-prefix"],
help="Name of the dataset to benchmark on.",
)
parser.add_argument(
"--dataset-path", type=str, default="", help="Path to the dataset."
)

parser.add_argument(
"--num-prompts",
type=int,
default=1000,
help="Number of prompts to process. Default is 1000.",
)
parser.add_argument(
"--sharegpt-output-len",
type=int,
default=None,
help="Output length for each request. Overrides the output length from the ShareGPT dataset.",
)
parser.add_argument(
"--random-input-len",
type=int,
help="Number of input tokens per request, used only for random dataset.",
)
parser.add_argument(
"--random-output-len",
type=int,
help="Number of output tokens per request, used only for random dataset.",
)
parser.add_argument(
"--random-range-ratio",
type=float,
default=0.0,
help="Range of sampled ratio of input/output length, "
"used only for random dataset.",
)
parser.add_argument(
"--request-rate",
type=float,
default=float("inf"),
help="Number of requests per second. If this is inf, then all the requests are sent at time 0. "
"Otherwise, we use Poisson process to synthesize the request arrival times. Default is inf.",
)
parser.add_argument("--seed", type=int, default=1, help="The random seed.")

Then, we can deprecate these arguments

    batch_size: Tuple[int] = (1,)
    input_len: Tuple[int] = (1024,)
    output_len: Tuple[int] = (16,)

Comment on lines 28 to 36
batch_size: Tuple[int] = (1,)
input_len: Tuple[int] = (1024,)
output_len: Tuple[int] = (16,)
result_filename: str = ""
# Plotting args
graph_sql: str = (
"select run_name, batch_size, prefill_throughput from results where run_name='before'"
)
graph_filename: str = "out.png"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my comment above, we can probably remove all of these arguments.
Remove graph_sql because it is not used.

result_list.append(ret)

if bench_args.result_filename:
with jsonlines.open(bench_args.result_filename, "a") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fout.write(json.dumps(value) + "\n") to remove the additional dependency of jsonlines

@merrymercy
Copy link
Contributor

Also, please add a unit test here https://github.com/sgl-project/sglang/blob/main/test/srt/test_srt_engine.py to run this benchmark for 10 random prompts.

@zolinthecow
Copy link
Contributor Author

@merrymercy made the changes

@ByronHsu
Copy link
Collaborator

Can we have an option for runtime backend? So we can easily do benchmark between runtime.generate and engine.generate. Related to #1872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants