Concurrent load generation option is implemented #71

parfeniukink · 2025-02-12T09:22:22Z

Summary

--rate-type concurrent is implemented as an additional Profile Generation mode.
- if --rate-type is concurrent you must set the --rate which belongs to the number of concurrent workers that will be started to simulate multiple users working with the system.

Detailed

You can find the consistent type of Load Generation, which differs from the rest of the generation modes since the consistent does not use times to simulate delays. All the requests go one by one within a single asynchronous task.

Example of usage

$ guidellm --target http://localhost:8080/v1 --model Phi-3-mini-4k-instruct-q4.gguf --data 'prompt_tokens=128,generated_tokens=128' --data-type emulated --tokenizer "hf-internal-testing/llama-tokenizer" --max-requests 2 --rate-type concurrent --rate 2

╭─ Benchmarks ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [10:43:43]   100% consistent   (0.13 req/sec avg)                                                                                                                                                             │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  Generating report... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (1/1) [ 0:00:29 < 0:00:00 ]
╭─ GuideLLM Benchmarks Report (stdout) ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ╭─ Benchmark Report 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
│ │ Backend(type=openai_server, target=http://localhost:8080/v1, model=Phi-3-mini-4k-instruct-q4.gguf)                                                                                                        │ │
│ │ Data(type=emulated, source=prompt_tokens=128,generated_tokens=128, tokenizer=hf-internal-testing/llama-tokenizer)                                                                                         │ │
│ │ Rate(type=concurrent, rate=(2.0,))                                                                                                                                                                        │ │
│ │ Limits(max_number=2 requests, max_duration=120 sec)                                                                                                                                                       │ │
│ │                                                                                                                                                                                                           │ │
│ │                                                                                                                                                                                                           │ │
│ │ Requests Data by Benchmark                                                                                                                                                                                │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓                                                                                                   │ │
│ │ ┃ Benchmark                 ┃ Requests Completed ┃ Request Failed ┃ Duration  ┃ Start Time ┃ End Time ┃                                                                                                   │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩                                                                                                   │ │
│ │ │ [email protected] req/sec │ 4/4                │ 0/4            │ 29.74 sec │ 10:43:43   │ 10:44:12 │                                                                                                   │ │
│ │ └───────────────────────────┴────────────────────┴────────────────┴───────────┴────────────┴──────────┘                                                                                                   │ │
│ │                                                                                                                                                                                                           │ │
│ │ Tokens Data by Benchmark                                                                                                                                                                                  │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                                                                   │ │
│ │ ┃ Benchmark                 ┃ Prompt ┃ Prompt (1%, 5%, 50%, 95%, 99%)    ┃ Output ┃ Output (1%, 5%, 50%, 95%, 99%)    ┃                                                                                   │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩                                                                                   │ │
│ │ │ [email protected] req/sec │ 128.00 │ 128.0, 128.0, 128.0, 128.0, 128.0 │ 128.00 │ 128.0, 128.0, 128.0, 128.0, 128.0 │                                                                                   │ │
│ │ └───────────────────────────┴────────┴───────────────────────────────────┴────────┴───────────────────────────────────┘                                                                                   │ │
│ │                                                                                                                                                                                                           │ │
│ │ Performance Stats by Benchmark                                                                                                                                                                            │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │
│ │ ┃                           ┃ Request Latency [1%, 5%, 10%, 50%, 90%, 95%, 99%]      ┃ Time to First Token [1%, 5%, 10%, 50%, 90%, 95%, 99%]   ┃ Inter Token Latency [1%, 5%, 10%, 50%, 90% 95%, 99%]   ┃ │ │
│ │ ┃ Benchmark                 ┃ (sec)                                                  ┃ (ms)                                                    ┃ (ms)                                                   ┃ │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │
│ │ │ [email protected] req/sec │ 7.91, 8.79, 9.89, 18.70, 27.52, 28.62, 29.51           │ 1146.2, 2053.7, 3188.0, 12159.8, 20962.7, 22060.9,      │ 48.2, 49.1, 49.7, 51.3, 54.8, 57.3, 66.9               │ │ │
│ │ │                           │                                                        │ 22939.4                                                 │                                                        │ │ │
│ │ └───────────────────────────┴────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘ │ │
│ │                                                                                                                                                                                                           │ │
│ │ Performance Summary by Benchmark                                                                                                                                                                          │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓                                                               │ │
│ │ ┃ Benchmark                 ┃ Requests per Second ┃ Request Latency ┃ Time to First Token ┃ Inter Token Latency ┃ Output Token Throughput ┃                                                               │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩                                                               │ │
│ │ │ [email protected] req/sec │ 0.13 req/sec        │ 18.71 sec       │ 12099.47 ms         │ 52.00 ms            │ 17.22 tokens/sec        │                                                               │ │
│ │ └───────────────────────────┴─────────────────────┴─────────────────┴─────────────────────┴─────────────────────┴─────────────────────────┘                                                               │ │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

sjmonson · 2025-02-13T15:43:49Z

I'm having trouble with this PR, it seems to hang after the first request. Tried with:

guidellm --target http://localhost:8000/v1 \
         --model ibm-granite/granite-3.1-2b-instruct \
         --data-type emulated --data "prompt_tokens=128,generated_tokens=128" \
         --rate-type concurrent --rate 256 --max-seconds 20

markurtz

@parfeniukink I'm not following how this will satisfy the goal of given a request or time limit, keep a consistent number of concurrent requests open at the targeted number. Currently, it looks like it will max out at max_concurrency from settings plus the rate passed in due to the inner and outer loop.

I think a simple implementation in here would be a new function for _run_concurrent which keeps the number of requests open equal to the rate passed in within the scheduler and should raise an exception if that rate is greater than the settings max concurrency.

A more complicated implementation would be to figure out how to pass either max concurrency from the load generator or have load generator aware of how many requests have been completed and then enable a new iteration once ones have completed.

markurtz · 2025-02-25T16:03:57Z

src/guidellm/scheduler/base.py

-                if _res:
-                    benchmark.request_completed(_res)
-                    logger.debug("Request completed: {}", _res)
+        if self.mode == "consistent":


Given the longer if else here and specific logic for handling, I think it would make sense to break it out into its own function to handle the fixed concurrent request values rather than merging here if we go this route

markurtz · 2025-02-25T16:06:46Z

src/guidellm/scheduler/base.py

+                    "the concurrent execution"
+                )
+            for index, request in enumerate(self.generator):
+                while (index + 1 - completed) >= settings.max_concurrency:


Shouldn't we ignore the max_concurrency setting in this case / raise an error if the concurrent requests number is smaller than the rate type we need to run at?

Additionally, our check should be on the rate here and we would only allow a new request if we're below our rate or have passed our time / request count restrictions, right?

markurtz · 2025-02-25T16:08:39Z

src/guidellm/scheduler/base.py

+
+                # Create multiple concurrent tasks
+                tasks: list[asyncio.Task] = []
+                for _ in range(int(self.rate)):


I'm not following how this is going to keep the concurrent request count fixed. Due to the outer loop, this would always just max out around max concurrency. I say around, because this inner for loop enables it to go above the max concurrency by appending N=rate new requests before checking max concurrency again

markurtz · 2025-02-25T16:09:38Z

src/guidellm/executor/profile_generator.py

        modes_map: Dict[str, LoadGenerationMode] = {
            "constant": "constant",
            "poisson": "poisson",
+            "concurrent": "consistent",


any rationale for remapping this to consistent rather than keeping it as concurrent through the entire codebase?

Dmytro Parfeniuk added 2 commits February 12, 2025 11:13

Concurrent load generation option is implemented

0917f66

Concurrent rate type (load generation mode) is explained

2aa7ffa

parfeniukink requested a review from markurtz February 12, 2025 09:22

parfeniukink self-assigned this Feb 12, 2025

Code quality and tests

2f033f0

markurtz requested changes Feb 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent load generation option is implemented #71

Concurrent load generation option is implemented #71

parfeniukink commented Feb 12, 2025

sjmonson commented Feb 13, 2025

markurtz left a comment

markurtz Feb 25, 2025

markurtz Feb 25, 2025

markurtz Feb 25, 2025

markurtz Feb 25, 2025

Concurrent load generation option is implemented #71

Are you sure you want to change the base?

Concurrent load generation option is implemented #71

Conversation

parfeniukink commented Feb 12, 2025

Summary

Detailed

Example of usage

sjmonson commented Feb 13, 2025

markurtz left a comment

Choose a reason for hiding this comment

markurtz Feb 25, 2025

Choose a reason for hiding this comment

markurtz Feb 25, 2025

Choose a reason for hiding this comment

markurtz Feb 25, 2025

Choose a reason for hiding this comment

markurtz Feb 25, 2025

Choose a reason for hiding this comment