Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent load generation option is implemented #71

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

parfeniukink
Copy link
Contributor

Summary

  • --rate-type concurrent is implemented as an additional Profile Generation mode.
    • if --rate-type is concurrent you must set the --rate which belongs to the number of concurrent workers that will be started to simulate multiple users working with the system.

Detailed

  • You can find the consistent type of Load Generation, which differs from the rest of the generation modes since the consistent does not use times to simulate delays. All the requests go one by one within a single asynchronous task.

Example of usage

$ guidellm --target http://localhost:8080/v1 --model Phi-3-mini-4k-instruct-q4.gguf --data 'prompt_tokens=128,generated_tokens=128' --data-type emulated --tokenizer "hf-internal-testing/llama-tokenizer" --max-requests 2 --rate-type concurrent --rate 2

╭─ Benchmarks ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [10:43:43]   100% consistent   (0.13 req/sec avg)                                                                                                                                                             │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
  Generating report... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ (1/1) [ 0:00:29 < 0:00:00 ]
╭─ GuideLLM Benchmarks Report (stdout) ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ╭─ Benchmark Report 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
│ │ Backend(type=openai_server, target=http://localhost:8080/v1, model=Phi-3-mini-4k-instruct-q4.gguf)                                                                                                        │ │
│ │ Data(type=emulated, source=prompt_tokens=128,generated_tokens=128, tokenizer=hf-internal-testing/llama-tokenizer)                                                                                         │ │
│ │ Rate(type=concurrent, rate=(2.0,))                                                                                                                                                                        │ │
│ │ Limits(max_number=2 requests, max_duration=120 sec)                                                                                                                                                       │ │
│ │                                                                                                                                                                                                           │ │
│ │                                                                                                                                                                                                           │ │
│ │ Requests Data by Benchmark                                                                                                                                                                                │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓                                                                                                   │ │
│ │ ┃ Benchmark                 ┃ Requests Completed ┃ Request Failed ┃ Duration  ┃ Start Time ┃ End Time ┃                                                                                                   │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩                                                                                                   │ │
│ │ │ [email protected] req/sec │ 4/4                │ 0/4            │ 29.74 sec │ 10:43:43   │ 10:44:12 │                                                                                                   │ │
│ │ └───────────────────────────┴────────────────────┴────────────────┴───────────┴────────────┴──────────┘                                                                                                   │ │
│ │                                                                                                                                                                                                           │ │
│ │ Tokens Data by Benchmark                                                                                                                                                                                  │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓                                                                                   │ │
│ │ ┃ Benchmark                 ┃ Prompt ┃ Prompt (1%, 5%, 50%, 95%, 99%)    ┃ Output ┃ Output (1%, 5%, 50%, 95%, 99%)    ┃                                                                                   │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩                                                                                   │ │
│ │ │ [email protected] req/sec │ 128.00 │ 128.0, 128.0, 128.0, 128.0, 128.0 │ 128.00 │ 128.0, 128.0, 128.0, 128.0, 128.0 │                                                                                   │ │
│ │ └───────────────────────────┴────────┴───────────────────────────────────┴────────┴───────────────────────────────────┘                                                                                   │ │
│ │                                                                                                                                                                                                           │ │
│ │ Performance Stats by Benchmark                                                                                                                                                                            │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │
│ │ ┃                           ┃ Request Latency [1%, 5%, 10%, 50%, 90%, 95%, 99%]      ┃ Time to First Token [1%, 5%, 10%, 50%, 90%, 95%, 99%]   ┃ Inter Token Latency [1%, 5%, 10%, 50%, 90% 95%, 99%]   ┃ │ │
│ │ ┃ Benchmark                 ┃ (sec)                                                  ┃ (ms)                                                    ┃ (ms)                                                   ┃ │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │
│ │ │ [email protected] req/sec │ 7.91, 8.79, 9.89, 18.70, 27.52, 28.62, 29.51           │ 1146.2, 2053.7, 3188.0, 12159.8, 20962.7, 22060.9,      │ 48.2, 49.1, 49.7, 51.3, 54.8, 57.3, 66.9               │ │ │
│ │ │                           │                                                        │ 22939.4                                                 │                                                        │ │ │
│ │ └───────────────────────────┴────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────┘ │ │
│ │                                                                                                                                                                                                           │ │
│ │ Performance Summary by Benchmark                                                                                                                                                                          │ │
│ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓                                                               │ │
│ │ ┃ Benchmark                 ┃ Requests per Second ┃ Request Latency ┃ Time to First Token ┃ Inter Token Latency ┃ Output Token Throughput ┃                                                               │ │
│ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩                                                               │ │
│ │ │ [email protected] req/sec │ 0.13 req/sec        │ 18.71 sec       │ 12099.47 ms         │ 52.00 ms            │ 17.22 tokens/sec        │                                                               │ │
│ │ └───────────────────────────┴─────────────────────┴─────────────────┴─────────────────────┴─────────────────────┴─────────────────────────┘                                                               │ │
│ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

@parfeniukink parfeniukink self-assigned this Feb 12, 2025
@sjmonson
Copy link
Collaborator

I'm having trouble with this PR, it seems to hang after the first request. Tried with:

guidellm --target http://localhost:8000/v1 \
         --model ibm-granite/granite-3.1-2b-instruct \
         --data-type emulated --data "prompt_tokens=128,generated_tokens=128" \
         --rate-type concurrent --rate 256 --max-seconds 20

Copy link
Member

@markurtz markurtz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parfeniukink I'm not following how this will satisfy the goal of given a request or time limit, keep a consistent number of concurrent requests open at the targeted number. Currently, it looks like it will max out at max_concurrency from settings plus the rate passed in due to the inner and outer loop.

I think a simple implementation in here would be a new function for _run_concurrent which keeps the number of requests open equal to the rate passed in within the scheduler and should raise an exception if that rate is greater than the settings max concurrency.

A more complicated implementation would be to figure out how to pass either max concurrency from the load generator or have load generator aware of how many requests have been completed and then enable a new iteration once ones have completed.

if _res:
benchmark.request_completed(_res)
logger.debug("Request completed: {}", _res)
if self.mode == "consistent":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the longer if else here and specific logic for handling, I think it would make sense to break it out into its own function to handle the fixed concurrent request values rather than merging here if we go this route

"the concurrent execution"
)
for index, request in enumerate(self.generator):
while (index + 1 - completed) >= settings.max_concurrency:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we ignore the max_concurrency setting in this case / raise an error if the concurrent requests number is smaller than the rate type we need to run at?

Additionally, our check should be on the rate here and we would only allow a new request if we're below our rate or have passed our time / request count restrictions, right?


# Create multiple concurrent tasks
tasks: list[asyncio.Task] = []
for _ in range(int(self.rate)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following how this is going to keep the concurrent request count fixed. Due to the outer loop, this would always just max out around max concurrency. I say around, because this inner for loop enables it to go above the max concurrency by appending N=rate new requests before checking max concurrency again

modes_map: Dict[str, LoadGenerationMode] = {
"constant": "constant",
"poisson": "poisson",
"concurrent": "consistent",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any rationale for remapping this to consistent rather than keeping it as concurrent through the entire codebase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants