[1/3] queue-based autoscaling - add queue monitor #59430

harshit-anyscale · 2025-12-15T05:33:26Z

Summary

This PR is part 1 of 3 for adding queue-based autoscaling support for Ray Serve TaskConsumer deployments.

Background

TaskConsumers are workloads that consume tasks from message queues (Redis, RabbitMQ), and their scaling needs are fundamentally different from HTTP-based deployments. Instead of scaling based on HTTP request load, TaskConsumers should scale based on the number of pending tasks in the message queue.

Overall Architecture (Full Feature)

  ┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
  │  Message Queue  │◄─────│  QueueMonitor    │      │ ServeController │
  │  (Redis/RMQ)    │      │  Actor           │◄─────│ Autoscaler      │
  └─────────────────┘      └──────────────────┘      └─────────────────┘
                                   │                         │
                                   │ get_queue_length()      │
                                   └─────────────────────────┘
                                             │
                                             ▼
                                ┌───────────────────────────┐
                                │ queue_based_autoscaling   │
                                │ _policy()                 │
                                │ desired = ceil(len/target)│
                                └───────────────────────────┘

The full implementation consists of three PRs:

PR	Description	Status
PR 1 (This PR)	QueueMonitor actor for querying broker queue length	🔄 Current
PR 2	Introduce default Queue-based autoscaling policy	Upcoming
PR 3	Integration with TaskConsumer deployments	Upcoming

This PR: QueueMonitor Actor

This PR introduces the QueueMonitor Ray actor that queries message brokers to get queue length for autoscaling decisions.

Key Features

Multi-broker support: Redis and RabbitMQ
Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in runtime env
Fault tolerance: Caches last known queue length on query failures
Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup

Queue Length Calculation

For accurate autoscaling, QueueMonitor returns total workload (pending tasks):

Broker	Pending Tasks
Redis	LLEN
RabbitMQ	messages_ready

Components

QueueMonitorConfig - Configuration dataclass with broker URL and queue name
QueueMonitor - Core class that initializes broker connections and queries queue length
QueueMonitorActor - Ray actor wrapper for remote access
Helper functions:
- create_queue_monitor_actor() - Create named actor
- get_queue_monitor_actor() - Lookup existing actor
- delete_queue_monitor_actor() - Cleanup on deployment deletion

Test Plan

Unit tests for QueueMonitorConfig (7 tests)
- Broker type detection (Redis, RabbitMQ, SQS, unknown)
- Config value storage
Unit tests for QueueMonitor (4 tests)
- Redis queue length retrieval (pending)
- RabbitMQ queue length retrieval
- Error handling with cached value fallback

gemini-code-assist

Code Review

This pull request introduces a QueueMonitor actor for monitoring queue lengths in Redis and RabbitMQ, which is a valuable addition for asynchronous task processing. The implementation is generally well-structured and includes unit tests. However, I've identified a significant performance concern with the RabbitMQ connection handling that should be addressed. Additionally, there's a minor inconsistency in broker type detection and an opportunity to improve test coverage for the new actor helper functions. My detailed feedback is in the comments below.

python/ray/serve/_private/queue_monitor.py

python/ray/serve/tests/unit/test_queue_monitor.py

python/ray/serve/_private/queue_monitor.py

python/setup.py

Signed-off-by: harshit <[email protected]>

aslonnie

hmm.. does not need my review any more?

seems that import pika is still in there though?

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/queue_monitor.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/queue_monitor.py

Signed-off-by: harshit <[email protected]>

harshit-anyscale · 2025-12-19T07:48:03Z

hmm.. does not need my review any more?

seems that import pika is still in there though?

nope, import pika is now resolved as well.

python/ray/serve/_private/queue_monitor.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/broker.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/broker.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/broker.py

python/ray/serve/_private/queue_monitor.py

python/ray/serve/tests/unit/test_queue_monitor_config.py

python/ray/serve/tests/test_queue_monitor.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/queue_monitor.py

cursor · 2026-01-07T14:55:14Z

python/ray/serve/_private/queue_monitor.py

+        if queues is not None:
+            for q in queues:
+                if q.get("name") == self._queue_name:
+                    queue_length = q.get("messages")


RabbitMQ uses wrong field causing inconsistent queue metrics

Medium Severity

The code retrieves q.get("messages") from RabbitMQ, but the PR description explicitly specifies that messages_ready should be used. In RabbitMQ, messages equals messages_ready + messages_unacknowledged (in-flight), while messages_ready counts only pending tasks. This creates inconsistent behavior: Redis llen returns only pending messages (equivalent to messages_ready), but RabbitMQ returns pending plus in-progress messages. For autoscaling, this could cause RabbitMQ deployments to scale differently than Redis deployments with the same actual workload.

abrarsheikh

lg2m. Let some nits

python/ray/serve/_private/queue_monitor.py

python/ray/serve/tests/test_queue_monitor.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/queue_monitor.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/queue_monitor.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/queue_monitor.py

python/ray/serve/_private/broker.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/broker.py

Signed-off-by: harshit <[email protected]>

python/ray/serve/_private/broker.py

### Summary This PR is part 1 of 3 for adding queue-based autoscaling support for Ray Serve TaskConsumer deployments. ### Background TaskConsumers are workloads that consume tasks from message queues (Redis, RabbitMQ), and their scaling needs are fundamentally different from HTTP-based deployments. Instead of scaling based on HTTP request load, TaskConsumers should scale based on the number of pending tasks in the message queue. ### Overall Architecture (Full Feature) ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Message Queue │◄─────│ QueueMonitor │ │ ServeController │ │ (Redis/RMQ) │ │ Actor │◄─────│ Autoscaler │ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ │ get_queue_length() │ └─────────────────────────┘ │ ▼ ┌───────────────────────────┐ │ queue_based_autoscaling │ │ _policy() │ │ desired = ceil(len/target)│ └───────────────────────────┘ ``` The full implementation consists of three PRs: | PR | Description | Status | |----------------|-----------------------------------------------------|------------| | PR 1 (This PR) | QueueMonitor actor for querying broker queue length | 🔄 Current | | PR 2 | Introduce default Queue-based autoscaling policy | Upcoming | | PR 3 | Integration with TaskConsumer deployments | Upcoming | ### This PR: QueueMonitor Actor This PR introduces the QueueMonitor Ray actor that queries message brokers to get queue length for autoscaling decisions. ### Key Features - Multi-broker support: Redis and RabbitMQ - Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in runtime env - Fault tolerance: Caches last known queue length on query failures - Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup ### Queue Length Calculation For accurate autoscaling, QueueMonitor returns total workload (pending tasks): | Broker | Pending Tasks | |----------|----------------| | Redis | LLEN <queue> | | RabbitMQ | messages_ready | ### Components 1. QueueMonitorConfig - Configuration dataclass with broker URL and queue name 2. QueueMonitor - Core class that initializes broker connections and queries queue length 3. QueueMonitorActor - Ray actor wrapper for remote access 4. Helper functions: - create_queue_monitor_actor() - Create named actor - get_queue_monitor_actor() - Lookup existing actor - delete_queue_monitor_actor() - Cleanup on deployment deletion ### Test Plan - Unit tests for QueueMonitorConfig (7 tests) - Broker type detection (Redis, RabbitMQ, SQS, unknown) - Config value storage - Unit tests for QueueMonitor (4 tests) - Redis queue length retrieval (pending) - RabbitMQ queue length retrieval - Error handling with cached value fallback --------- Signed-off-by: harshit <[email protected]> Signed-off-by: elliot-barn <[email protected]>

### Summary This PR is part 1 of 3 for adding queue-based autoscaling support for Ray Serve TaskConsumer deployments. ### Background TaskConsumers are workloads that consume tasks from message queues (Redis, RabbitMQ), and their scaling needs are fundamentally different from HTTP-based deployments. Instead of scaling based on HTTP request load, TaskConsumers should scale based on the number of pending tasks in the message queue. ### Overall Architecture (Full Feature) ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Message Queue │◄─────│ QueueMonitor │ │ ServeController │ │ (Redis/RMQ) │ │ Actor │◄─────│ Autoscaler │ └─────────────────┘ └──────────────────┘ └─────────────────┘ │ │ │ get_queue_length() │ └─────────────────────────┘ │ ▼ ┌───────────────────────────┐ │ queue_based_autoscaling │ │ _policy() │ │ desired = ceil(len/target)│ └───────────────────────────┘ ``` The full implementation consists of three PRs: | PR | Description | Status | |----------------|-----------------------------------------------------|------------| | PR 1 (This PR) | QueueMonitor actor for querying broker queue length | 🔄 Current | | PR 2 | Introduce default Queue-based autoscaling policy | Upcoming | | PR 3 | Integration with TaskConsumer deployments | Upcoming | ### This PR: QueueMonitor Actor This PR introduces the QueueMonitor Ray actor that queries message brokers to get queue length for autoscaling decisions. ### Key Features - Multi-broker support: Redis and RabbitMQ - Lightweight Ray actor: Runs with num_cpus=0, and pika and redis in runtime env - Fault tolerance: Caches last known queue length on query failures - Named actor pattern: QUEUE_MONITOR::<deployment_name> for easy lookup ### Queue Length Calculation For accurate autoscaling, QueueMonitor returns total workload (pending tasks): | Broker | Pending Tasks | |----------|----------------| | Redis | LLEN <queue> | | RabbitMQ | messages_ready | ### Components 1. QueueMonitorConfig - Configuration dataclass with broker URL and queue name 2. QueueMonitor - Core class that initializes broker connections and queries queue length 3. QueueMonitorActor - Ray actor wrapper for remote access 4. Helper functions: - create_queue_monitor_actor() - Create named actor - get_queue_monitor_actor() - Lookup existing actor - delete_queue_monitor_actor() - Cleanup on deployment deletion ### Test Plan - Unit tests for QueueMonitorConfig (7 tests) - Broker type detection (Redis, RabbitMQ, SQS, unknown) - Config value storage - Unit tests for QueueMonitor (4 tests) - Redis queue length retrieval (pending) - RabbitMQ queue length retrieval - Error handling with cached value fallback --------- Signed-off-by: harshit <[email protected]> Signed-off-by: jasonwrwang <[email protected]>

harshit-anyscale self-assigned this Dec 15, 2025

harshit-anyscale changed the title ~~add queue monitor~~ [1/n] queue-based autoscaling - add queue monitor Dec 15, 2025

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

python/ray/serve/tests/unit/test_queue_monitor.py Outdated Show resolved Hide resolved

harshit-anyscale changed the title ~~[1/n] queue-based autoscaling - add queue monitor~~ [1/3] queue-based autoscaling - add queue monitor Dec 15, 2025

harshit-anyscale mentioned this pull request Dec 15, 2025

add queue length based autoscaling #59351

Closed

harshit-anyscale added the go add ONLY when ready to merge, run all tests label Dec 15, 2025

harshit-anyscale force-pushed the queue-based-autoscaling-part-1 branch from 14a17b6 to a982d4b Compare December 15, 2025 11:47

harshit-anyscale marked this pull request as ready for review December 15, 2025 11:47

harshit-anyscale requested review from a team, aslonnie, edoakes and richardliaw as code owners December 15, 2025 11:47

cursor bot reviewed Dec 15, 2025

View reviewed changes

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

harshit-anyscale requested a review from a team as a code owner December 15, 2025 12:19

cursor bot reviewed Dec 15, 2025

View reviewed changes

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

ray-gardener bot added the serve Ray Serve Related Issue label Dec 15, 2025

aslonnie reviewed Dec 15, 2025

View reviewed changes

python/setup.py Outdated Show resolved Hide resolved

add queue monitor

ad81408

Signed-off-by: harshit <[email protected]>

harshit-anyscale force-pushed the queue-based-autoscaling-part-1 branch from ce6a041 to ad81408 Compare December 17, 2025 18:04

harshit-anyscale requested a review from aslonnie December 17, 2025 18:04

aslonnie reviewed Dec 17, 2025

View reviewed changes

harshit-anyscale added 2 commits December 18, 2025 04:33

move pika and redis import inside actor

1acaaf0

Signed-off-by: harshit <[email protected]>

fix tests

90afee6

Signed-off-by: harshit <[email protected]>

cursor bot reviewed Dec 18, 2025

View reviewed changes

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

fix tests

ceb80ec

Signed-off-by: harshit <[email protected]>

cursor bot reviewed Dec 18, 2025

View reviewed changes

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

harshit-anyscale mentioned this pull request Dec 18, 2025

[2/3] queue-based autoscaling - add default queue-based autoscaling policy #59548

Open

review comments

d65d94d

Signed-off-by: harshit <[email protected]>

cursor bot reviewed Dec 19, 2025

View reviewed changes

python/ray/serve/_private/queue_monitor.py Show resolved Hide resolved

review changes

8140bcc

Signed-off-by: harshit <[email protected]>

cursor bot reviewed Jan 6, 2026

View reviewed changes

python/ray/serve/_private/broker.py Outdated Show resolved Hide resolved

python/ray/serve/_private/broker.py Show resolved Hide resolved

harshit-anyscale added 2 commits January 6, 2026 15:47

review changes

ab285ce

Signed-off-by: harshit <[email protected]>

review changes

c550a52

Signed-off-by: harshit <[email protected]>

cursor bot reviewed Jan 6, 2026

View reviewed changes

add more tests

2abc81d

Signed-off-by: harshit <[email protected]>

cursor bot reviewed Jan 6, 2026

View reviewed changes

python/ray/serve/_private/broker.py Show resolved Hide resolved

python/ray/serve/_private/broker.py Show resolved Hide resolved

abrarsheikh reviewed Jan 7, 2026

View reviewed changes

reivew changes

1f89f99

Signed-off-by: harshit <[email protected]>

cursor bot reviewed Jan 7, 2026

View reviewed changes

abrarsheikh reviewed Jan 7, 2026

View reviewed changes

review changes

a5a0828

Signed-off-by: harshit <[email protected]>

harshit-anyscale requested a review from abrarsheikh January 8, 2026 04:46

cursor bot reviewed Jan 8, 2026

View reviewed changes

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

review changes

dd5175b

Signed-off-by: harshit <[email protected]>

cursor bot reviewed Jan 8, 2026

View reviewed changes

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

abrarsheikh reviewed Jan 8, 2026

View reviewed changes

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

review changes

e026933

Signed-off-by: harshit <[email protected]>

abrarsheikh reviewed Jan 8, 2026

View reviewed changes

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

python/ray/serve/_private/queue_monitor.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 8, 2026

View reviewed changes

python/ray/serve/_private/queue_monitor.py Show resolved Hide resolved

python/ray/serve/_private/broker.py Show resolved Hide resolved

python/ray/serve/_private/broker.py Show resolved Hide resolved

review changes

844c966

Signed-off-by: harshit <[email protected]>

cursor bot reviewed Jan 8, 2026

View reviewed changes

python/ray/serve/_private/broker.py Show resolved Hide resolved

python/ray/serve/_private/broker.py Show resolved Hide resolved

review changes

ea3cb68

Signed-off-by: harshit <[email protected]>

harshit-anyscale requested a review from abrarsheikh January 8, 2026 16:30

cursor bot reviewed Jan 8, 2026

View reviewed changes

python/ray/serve/_private/broker.py Show resolved Hide resolved

python/ray/serve/_private/broker.py Show resolved Hide resolved

abrarsheikh approved these changes Jan 8, 2026

View reviewed changes

abrarsheikh merged commit f580a27 into master Jan 8, 2026
6 checks passed

abrarsheikh deleted the queue-based-autoscaling-part-1 branch January 8, 2026 20:07

[1/3] queue-based autoscaling - add queue monitor #59430

[1/3] queue-based autoscaling - add queue monitor #59430

Uh oh!

Conversation

harshit-anyscale commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Overall Architecture (Full Feature)

This PR: QueueMonitor Actor

Key Features

Queue Length Calculation

Components

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aslonnie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

harshit-anyscale commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Jan 7, 2026

Choose a reason for hiding this comment

RabbitMQ uses wrong field causing inconsistent queue metrics

Uh oh!

abrarsheikh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harshit-anyscale commented Dec 15, 2025 •

edited

Loading

harshit-anyscale commented Dec 19, 2025 •

edited

Loading