Offline LLM Engine Benchmark Throughput #1968

zolinthecow · 2024-11-09T02:15:41Z

Motivation

#1865
Add throughput benchmark for engine.generate

Modifications

Added ability to specify an engine instead of an API url in the benchmarks

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zolinthecow · 2024-11-09T02:16:18Z

@ByronHsu do you want me to add this to CI?

merrymercy

This PR tries to reuse the script bench_serving.py. However, I think a better approach is to use a standalone script.

Reason:

bench_serving.py is for online serving, but the most common use case of the Engine API is for offline use cases. We should benchmark the non-async version of the Engine to see what the maximum throughput we can get without all of the streaming/asyncio overhead

We want to pass in many other arguments to the server. The new script should be similar to bench_latency.py, which takes in the full ServerArgs as arguments.

sglang/python/sglang/bench_latency.py

Lines 536 to 541 in 760552e

    
           parser = argparse.ArgumentParser() 
        
           ServerArgs.add_cli_args(parser) 
        
           BenchArgs.add_cli_args(parser) 
        
           args = parser.parse_args() 
        
           server_args = ServerArgs.from_cli_args(args) 
        
           bench_args = BenchArgs.from_cli_args(args)

Can you try to write a standalone script bench_offline_throughput.py that takes the same arguments as bench_latency.py?

zolinthecow · 2024-11-09T22:03:58Z

will do

zolinthecow · 2024-11-11T23:12:21Z

@merrymercy updated script, what do you think

merrymercy

This looks better! Sorry for the back-and-forth, but I think we should support loading real datasets and support better random dataset generation, similar to the one in bench_serinvg.py.

The bench_latency.py uses a simple way to generate synthetic data because it does not support continuous batching or inputs with variable lengths. The engine supports everything so we can use more realistic data.

merrymercy · 2024-11-11T23:51:44Z

python/sglang/bench_offline_throughput.py

+def prepare_synthetic_inputs_for_throughput_test(
+    batch_size: int, input_len: int, output_len: int
+):
+    input_ids = [[1] * input_len for _ in range(batch_size)]


Instead of generating synthetic data, can we reuse the data generation from bench_serving.py?

Specifically, support the following arguments. You can import the code from bench_serving.py

sglang/python/sglang/bench_serving.py

Lines 1189 to 1198 in 59a5ba9

parser.add_argument(

"--dataset-name",

type=str,

default="sharegpt",

choices=["sharegpt", "random", "generated-shared-prefix"],

help="Name of the dataset to benchmark on.",

)

parser.add_argument(

"--dataset-path", type=str, default="", help="Path to the dataset."

)

sglang/python/sglang/bench_serving.py

Lines 1209 to 1245 in 59a5ba9

parser.add_argument(

"--num-prompts",

type=int,

default=1000,

help="Number of prompts to process. Default is 1000.",

)

parser.add_argument(

"--sharegpt-output-len",

type=int,

default=None,

help="Output length for each request. Overrides the output length from the ShareGPT dataset.",

)

parser.add_argument(

"--random-input-len",

type=int,

help="Number of input tokens per request, used only for random dataset.",

)

parser.add_argument(

"--random-output-len",

type=int,

help="Number of output tokens per request, used only for random dataset.",

)

parser.add_argument(

"--random-range-ratio",

type=float,

default=0.0,

help="Range of sampled ratio of input/output length, "

"used only for random dataset.",

)

parser.add_argument(

"--request-rate",

type=float,

default=float("inf"),

help="Number of requests per second. If this is inf, then all the requests are sent at time 0. "

"Otherwise, we use Poisson process to synthesize the request arrival times. Default is inf.",

)

parser.add_argument("--seed", type=int, default=1, help="The random seed.")

Then, we can deprecate these arguments

batch_size: Tuple[int] = (1,) input_len: Tuple[int] = (1024,) output_len: Tuple[int] = (16,)

merrymercy · 2024-11-11T23:53:04Z

python/sglang/bench_offline_throughput.py

+    batch_size: Tuple[int] = (1,)
+    input_len: Tuple[int] = (1024,)
+    output_len: Tuple[int] = (16,)
+    result_filename: str = ""
+    # Plotting args
+    graph_sql: str = (
+        "select run_name, batch_size, prefill_throughput from results where run_name='before'"
+    )
+    graph_filename: str = "out.png"


With my comment above, we can probably remove all of these arguments.
Remove graph_sql because it is not used.

merrymercy · 2024-11-11T23:54:51Z

python/sglang/bench_offline_throughput.py

+            result_list.append(ret)
+
+    if bench_args.result_filename:
+        with jsonlines.open(bench_args.result_filename, "a") as f:


use fout.write(json.dumps(value) + "\n") to remove the additional dependency of jsonlines

merrymercy · 2024-11-11T23:59:37Z

Also, please add a unit test here https://github.com/sgl-project/sglang/blob/main/test/srt/test_srt_engine.py to run this benchmark for 10 random prompts.

zolinthecow · 2024-11-12T02:36:44Z

@merrymercy made the changes

ByronHsu · 2024-11-12T06:24:20Z

Can we have an option for runtime backend? So we can easily do benchmark between runtime.generate and engine.generate. Related to #1872

zolinthecow added 3 commits November 9, 2024 01:32

add offline engine bench

807a3f0

llm_engine -> engine

e3ec623

add to unit test bench

8b1232b

zolinthecow requested review from merrymercy, Ying1123 and zhyncs as code owners November 9, 2024 02:15

merrymercy requested changes Nov 9, 2024

View reviewed changes

merrymercy added the await-response label Nov 9, 2024

merrymercy assigned merrymercy and ByronHsu Nov 9, 2024

zolinthecow and others added 7 commits November 11, 2024 22:34

first draft bench offline throughput

e6293a8

script works

5564a96

reset bench serving stuff

0078bc3

merge

9f6c31a

most recent commit?

3158414

restore test utils

550ec14

Merge branch 'main' into benchmark-script

a6b183e

lint

c1c6226

merrymercy requested changes Nov 11, 2024

View reviewed changes

zolinthecow added 3 commits November 12, 2024 02:27

use sharegpt from bench_serving

1895c79

add unit test

3c8faf9

lint

170c83f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline LLM Engine Benchmark Throughput #1968

Offline LLM Engine Benchmark Throughput #1968

zolinthecow commented Nov 9, 2024

zolinthecow commented Nov 9, 2024

merrymercy left a comment •

edited

Loading

zolinthecow commented Nov 9, 2024

zolinthecow commented Nov 11, 2024 •

edited

Loading

merrymercy left a comment

merrymercy Nov 11, 2024

merrymercy Nov 11, 2024

merrymercy Nov 11, 2024

merrymercy commented Nov 11, 2024

zolinthecow commented Nov 12, 2024

ByronHsu commented Nov 12, 2024

	parser = argparse.ArgumentParser()
	ServerArgs.add_cli_args(parser)
	BenchArgs.add_cli_args(parser)
	args = parser.parse_args()
	server_args = ServerArgs.from_cli_args(args)
	bench_args = BenchArgs.from_cli_args(args)

	parser.add_argument(
	"--dataset-name",
	type=str,
	default="sharegpt",
	choices=["sharegpt", "random", "generated-shared-prefix"],
	help="Name of the dataset to benchmark on.",
	)
	parser.add_argument(
	"--dataset-path", type=str, default="", help="Path to the dataset."
	)

	parser.add_argument(
	"--num-prompts",
	type=int,
	default=1000,
	help="Number of prompts to process. Default is 1000.",
	)
	parser.add_argument(
	"--sharegpt-output-len",
	type=int,
	default=None,
	help="Output length for each request. Overrides the output length from the ShareGPT dataset.",
	)
	parser.add_argument(
	"--random-input-len",
	type=int,
	help="Number of input tokens per request, used only for random dataset.",
	)
	parser.add_argument(
	"--random-output-len",
	type=int,
	help="Number of output tokens per request, used only for random dataset.",
	)
	parser.add_argument(
	"--random-range-ratio",
	type=float,
	default=0.0,
	help="Range of sampled ratio of input/output length, "
	"used only for random dataset.",
	)
	parser.add_argument(
	"--request-rate",
	type=float,
	default=float("inf"),
	help="Number of requests per second. If this is inf, then all the requests are sent at time 0. "
	"Otherwise, we use Poisson process to synthesize the request arrival times. Default is inf.",
	)
	parser.add_argument("--seed", type=int, default=1, help="The random seed.")

Offline LLM Engine Benchmark Throughput #1968

Are you sure you want to change the base?

Offline LLM Engine Benchmark Throughput #1968

Conversation

zolinthecow commented Nov 9, 2024

Motivation

Modifications

Checklist

zolinthecow commented Nov 9, 2024

merrymercy left a comment • edited Loading

Choose a reason for hiding this comment

zolinthecow commented Nov 9, 2024

zolinthecow commented Nov 11, 2024 • edited Loading

merrymercy left a comment

Choose a reason for hiding this comment

merrymercy Nov 11, 2024

Choose a reason for hiding this comment

merrymercy Nov 11, 2024

Choose a reason for hiding this comment

merrymercy Nov 11, 2024

Choose a reason for hiding this comment

merrymercy commented Nov 11, 2024

zolinthecow commented Nov 12, 2024

ByronHsu commented Nov 12, 2024

merrymercy left a comment •

edited

Loading

zolinthecow commented Nov 11, 2024 •

edited

Loading