-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent load generation option is implemented #71
base: main
Are you sure you want to change the base?
Conversation
I'm having trouble with this PR, it seems to hang after the first request. Tried with: guidellm --target http://localhost:8000/v1 \
--model ibm-granite/granite-3.1-2b-instruct \
--data-type emulated --data "prompt_tokens=128,generated_tokens=128" \
--rate-type concurrent --rate 256 --max-seconds 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parfeniukink I'm not following how this will satisfy the goal of given a request or time limit, keep a consistent number of concurrent requests open at the targeted number. Currently, it looks like it will max out at max_concurrency from settings plus the rate passed in due to the inner and outer loop.
I think a simple implementation in here would be a new function for _run_concurrent which keeps the number of requests open equal to the rate passed in within the scheduler and should raise an exception if that rate is greater than the settings max concurrency.
A more complicated implementation would be to figure out how to pass either max concurrency from the load generator or have load generator aware of how many requests have been completed and then enable a new iteration once ones have completed.
if _res: | ||
benchmark.request_completed(_res) | ||
logger.debug("Request completed: {}", _res) | ||
if self.mode == "consistent": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the longer if else here and specific logic for handling, I think it would make sense to break it out into its own function to handle the fixed concurrent request values rather than merging here if we go this route
"the concurrent execution" | ||
) | ||
for index, request in enumerate(self.generator): | ||
while (index + 1 - completed) >= settings.max_concurrency: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we ignore the max_concurrency setting in this case / raise an error if the concurrent requests number is smaller than the rate type we need to run at?
Additionally, our check should be on the rate here and we would only allow a new request if we're below our rate or have passed our time / request count restrictions, right?
|
||
# Create multiple concurrent tasks | ||
tasks: list[asyncio.Task] = [] | ||
for _ in range(int(self.rate)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following how this is going to keep the concurrent request count fixed. Due to the outer loop, this would always just max out around max concurrency. I say around, because this inner for loop enables it to go above the max concurrency by appending N=rate new requests before checking max concurrency again
modes_map: Dict[str, LoadGenerationMode] = { | ||
"constant": "constant", | ||
"poisson": "poisson", | ||
"concurrent": "consistent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any rationale for remapping this to consistent rather than keeping it as concurrent through the entire codebase?
Summary
--rate-type concurrent
is implemented as an additional Profile Generation mode.--rate-type
is concurrent you must set the--rate
which belongs to the number of concurrent workers that will be started to simulate multiple users working with the system.Detailed
consistent
type of Load Generation, which differs from the rest of the generation modes since theconsistent
does not use times to simulate delays. All the requests go one by one within a single asynchronous task.Example of usage