Skip to content

feat(facade): bound IoLoopV2 dispatch_q_ quota to prevent starvation#7234

Open
glevkovich wants to merge 1 commit intomainfrom
glevkovich/dispatch_q_starvation_prevention
Open

feat(facade): bound IoLoopV2 dispatch_q_ quota to prevent starvation#7234
glevkovich wants to merge 1 commit intomainfrom
glevkovich/dispatch_q_starvation_prevention

Conversation

@glevkovich
Copy link
Copy Markdown
Contributor

Previously, IoLoopV2 drained dispatch_q_ with an unbounded while loop. Under a PubSub flood, this trapped the fiber in the control path, starving pipelined commands (e.g GET/SET) and causing client timeouts.

Key changes:

  • Bounded dispatch: process at most FLAGS_async_dispatch_quota messages per iteration; if the quota is hit, fall through to the data path so pipeline commands get a turn. Mirrors V1's async_dispatch_quota / prefer_pipeline_execution mechanism in AsyncFiber.
  • Deferred flush: the quota-hit path falls through to ParseLoop, which reaches the idle-await flush, coalescing PubSub and command replies into a single sendmsg syscall.
  • Batched backpressure: pubsub_ec.notifyAll() is now called once per quota chunk instead of once per message.
  • Testing: parameterized test_pubsub_pipeline_starvation for both V1 and V2 to prevent regressions.

Copilot AI review requested due to automatic review settings April 28, 2026 10:47
@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented Apr 28, 2026

🤖 Augment PR Summary

Summary: This PR prevents PubSub/control-path floods from starving pipelined command execution in IoLoopV2.

Changes:

  • Introduce a per-iteration dispatch quota (FLAGS_async_dispatch_quota) when draining dispatch_q_
  • When the quota is reached, fall through to the data path to parse/execute pipelined commands
  • Batch PubSub backpressure notifications (pubsub_ec.notifyAll()) once per processed chunk
  • Adjust existing tests to run against both V1 and V2 I/O loops where applicable
  • Add a V2 regression test ensuring conditional flush does not stall replies on fragmented pipelines

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/facade/dragonfly_connection.cc Outdated
Comment thread tests/dragonfly/connection_test.py Outdated
Comment thread tests/dragonfly/connection_test.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds fairness to the V2 connection I/O loop by bounding how many control-path (dispatch queue) messages are processed per iteration, preventing PubSub floods from starving pipelined command execution; expands Python integration tests to run key cases against both IoLoop V1 and V2.

Changes:

  • Bound IoLoopV2 dispatch-queue draining using FLAGS_async_dispatch_quota, falling through to the data path when the quota is hit.
  • Update/parameterize existing connection tests to run with experimental_io_loop_v2 enabled/disabled.
  • Add a V2-focused regression test to ensure conditional flushing does not stall replies.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/dragonfly/connection_test.py Parameterizes tests across V1/V2, adjusts reply-count expectations, and adds a V2 conditional-flush regression test.
src/facade/dragonfly_connection.cc Implements quota-bounded dispatch queue draining in IoLoopV2 to prevent pipeline starvation under heavy control-path load.

Comment thread tests/dragonfly/connection_test.py Outdated
Comment thread src/facade/dragonfly_connection.cc
Previously, IoLoopV2 drained dispatch_q_ with an unbounded while loop.
Under a PubSub flood, this trapped the fiber in the control path,
starving pipelined commands (GET/SET) and causing client timeouts.

Key changes:
- Bounded dispatch: process at most FLAGS_async_dispatch_quota messages
  per iteration; if the quota is hit, fall through to the data path so
  pipeline commands get a turn. Mirrors V1's async_dispatch_quota /
  prefer_pipeline_execution mechanism in AsyncFiber.
- Deferred flush: the quota-hit path falls through to ParseLoop, which
  reaches the idle-await flush, coalescing PubSub and command replies
  into a single sendmsg syscall.
- Batched backpressure: pubsub_ec.notifyAll() is now called once per
  quota chunk instead of once per message.
- Testing: parameterized test_pubsub_pipeline_starvation for both V1
  and V2 to prevent regressions.

Signed-off-by: Gil Levkovich <69595609+glevkovich@users.noreply.github.com>
@glevkovich glevkovich force-pushed the glevkovich/dispatch_q_starvation_prevention branch from 229e939 to 5a48ddb Compare April 28, 2026 11:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Comment thread src/facade/dragonfly_connection.cc
// at the top of the loop, allowing PubSub and command replies to be coalesced into
// one sendmsg syscall.
if (!quota_reached) {
continue;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the flow here.
if quota was not reached why go back and not fall through? I see continue existed before but I still do not understand what it does.

Copy link
Copy Markdown
Contributor Author

@glevkovich glevkovich Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The continue is not just optimization, it is needed for correctness and to keep low latency. I'll explain:

Reason 1:
When we are done processing here the dispatch_q, it took time, in some cases even long time, and during that time we might have got new client messages waiting in the socket (e.g GET/SET command). Sicne we havn't called ReadPendingInput (worse-case scenario), the io_buf_ is totally empty. If we just fall through to the data path, it checks io_buf_.InputLen() and thinks that there is no data from the client and skips parsing entirely (It makes a decision based on wrong info).

When we jump to the top, these are the only areas in the hot-path where we flush and read. (ignore the special case flashes further down). So we read and pull more data:

if (pending_input_) {
  ReadPendingInput();
}

Now, when the loop eventually reaches the data path, it is working with fresh, up-to-date network data.

Reason 2:
When we process PubSub messages, their replies accumulate in reply_builder_. Flush() only happens at the top of the loop (in the idle-await block). If we fall through, we traverse the entire data path section doing nothing useful, then loop back to the top to flush. The continue skips that dead code and reaches the flush immediately - one fewer loop iteration, lower latency.

In short: We use continue to guarantee low latency and fresh socket reads when the queue is naturally empty. We only use the fall-through as an emergency case (when quota_reached is true) to force the data path to run so it doesn't get starved by a never-ending flood of PubSub messages.

Copy link
Copy Markdown
Collaborator

@romange romange Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then the comment above focuses only on "quota_reached" path, but it's not clear why continue is needed in the first place. I would add additional comment around "continue" to explain why it's needed. Maybe the code structured differently would provide a more natural flow but I might be wrong too and comment at least will close the gap.

// - This mirrors V1's async_dispatch_quota / prefer_pipeline_execution mechanism in AsyncFiber.
if (!dispatch_q_.empty()) {
uint32_t dispatched{};
bool quota_reached = false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another suggestion - consider extracting this inner while loop into a helper function, aka:
bool quota_reached = ProcessControlCommands(async_dispatch_quota);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants