Skip to content

Conversation

@harshit-anyscale
Copy link
Contributor

@harshit-anyscale harshit-anyscale commented Dec 18, 2025

Summary

This PR adds a new combined_workload_autoscaling_policy function that enables Ray Serve deployments to scale based on combined workload from both message queue depth and HTTP requests. This is the second part of the queue-based autoscaling feature for TaskConsumer deployments.

Related PRs:

  • PR 1 (Prerequisite): #59430 - QueueMonitor Actor
  • PR 3 (Follow-up): Integration with TaskConsumer

Changes

New Components

Component Location Description
combined_workload_autoscaling_policy() autoscaling_policy.py Main policy function that scales based on combined workload (queue + HTTP requests)
_apply_scaling_decision_smoothing() autoscaling_policy.py Reusable helper for delay-based smoothing
get_queue_monitor_actor_name() queue_monitor.py Helper to generate consistent actor names
DEFAULT_COMBINED_WORKLOAD_AUTOSCALING_POLICY constants.py Import path constant for the policy

Files Modified

  • python/ray/serve/autoscaling_policy.py - Added new policy and helper functions
  • python/ray/serve/_private/constants.py - Added policy constant
  • python/ray/serve/_private/queue_monitor.py - Added get_queue_monitor_actor_name() helper

Files Added

  • python/ray/serve/tests/unit/test_queue_autoscaling_policy.py - 22 unit tests

How It Works

  ┌─────────────────────────────────────────────────────────────────┐
  │              combined_workload_autoscaling_policy               │
  ├─────────────────────────────────────────────────────────────────┤
  │                                                                 │
  │  1. Get QueueMonitor Actor                                      │
  │     ├── ray.get_actor("QUEUE_MONITOR::")                        │
  │     └── If unavailable, maintain current replicas               │
  │                                                                 │
  │  2. Query Queue Length                                          │
  │     └── ray.get(actor.get_queue_length.remote())                │
  │                                                                 │
  │  3. Calculate Total Workload                                    │
  │     └── total_workload = queue_length + total_num_requests      │
  │                                                                 │
  │  4. Calculate Desired Replicas                                  │
  │     ├── Formula: ceil(total_workload / target_ongoing_requests) │
  │     ├── Scale from zero: immediate scale up if workload > 0     │
  │     └── Clamp to [min_replicas, max_replicas]                   │
  │                                                                 │
  │  5. Apply Smoothing                                             │
  │     ├── Scale up: wait upscale_delay_s                          │
  │     ├── Scale down: wait downscale_delay_s                      │
  │     └── Scale to zero: wait downscale_to_zero_delay_s           │
  │                                                                 │
  └─────────────────────────────────────────────────────────────────┘

Scaling Formula

total_workload = queue_length + total_num_requests
desired_replicas = ceil(total_workload / target_ongoing_requests)

Example:

  • Queue has 100 pending tasks
  • HTTP requests: 50 ongoing requests
  • Total workload = 100 + 50 = 150
  • target_ongoing_requests = 10 (each replica handles 10 concurrent tasks)
  • Desired replicas = ceil(150 / 10) = 15 replicas

🤖 Generated with https://claude.com/claude-code

@harshit-anyscale harshit-anyscale self-assigned this Dec 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new queue-based autoscaling policy, which is a great addition for TaskConsumer deployments. The implementation is well-structured, with a dedicated QueueMonitor actor and comprehensive unit tests. I've identified a critical bug in the Redis connection handling and a high-severity logic issue in the scaling-to-zero implementation. Addressing these will ensure the new feature is robust and behaves as expected.

@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Dec 19, 2025
@github-actions
Copy link

github-actions bot commented Jan 2, 2026

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 2, 2026
@harshit-anyscale harshit-anyscale force-pushed the queue-based-autoscaling-part-2 branch from 86223e0 to 26d4bd7 Compare January 9, 2026 07:05
Signed-off-by: harshit <[email protected]>
@harshit-anyscale harshit-anyscale removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 9, 2026
Signed-off-by: harshit <[email protected]>
@harshit-anyscale harshit-anyscale force-pushed the queue-based-autoscaling-part-2 branch from 29f8b0b to 0c7cf30 Compare January 12, 2026 18:08
@harshit-anyscale harshit-anyscale marked this pull request as ready for review January 12, 2026 19:07
@harshit-anyscale harshit-anyscale requested a review from a team as a code owner January 12, 2026 19:07
@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Jan 13, 2026
Copy link
Contributor

@abrarsheikh abrarsheikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would base this PR on top of #58857

Comment on lines +428 to +430
DEFAULT_COMBINED_WORKLOAD_AUTOSCALING_POLICY = (
"ray.serve.autoscaling_policy:default_combined_workload_autoscaling_policy"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be named better, combined_workload is overloaded term

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abrarsheikh how about DEFAULT_QUEUE_AWARE_AUTOSCALING_POLICY? it seems okaay to me and not that overloaded, lmk your view on it, will then amend the code accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would choose default_async_inferance_autoscaling_policy .

default_combined_workload_autoscaling_policy: to the reader its not clear what workloads we are combining
DEFAULT_QUEUE_AWARE_AUTOSCALING_POLICY: which queue, the other autoscaling policy is also queue aware, but the queue there refers to ongoing requests.

@@ -1,12 +1,15 @@
import logging
import math
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many of the changes in this file are overlapping with changes in this PR #58857. I suggest reviewing the other PR, makes sure its compatible with what we need to do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will review the other PR

Comment on lines +185 to +187
queue_length = ray.get(
queue_monitor_actor.get_queue_length.remote(), timeout=5.0
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make 1 RFC call per every controller iteration. What is the impact of this on the scalability of the controller? Do we need to query queue length in every iteration?



@pytest.fixture
def queue_monitor_mock():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i woudl avoid such heavy mocking. instead figure out a way to write the tests using dependency injection. study how the existing tests are written in test_deployment_state

@harshit-anyscale
Copy link
Contributor Author

i would base this PR on top of #58857

i am not sure of the timeline we are targeting for #58857 PR, but since we want to get queue-based autoscaling feature out asap, hence, thought of merging this PR as it is.

and then once the #58857 PR is merged, i will create a new one, using the changes of #58857 to refactor the queue-aware autoscaling policy.

@abrarsheikh lmk your thoughts on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants