-
Notifications
You must be signed in to change notification settings - Fork 7.1k
add queue length based autoscaling #59351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: harshit <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces queue-based autoscaling for TaskConsumer deployments in Ray Serve. A new QueueMonitor component, implemented as a Ray actor, is added to directly query message broker queue lengths (supporting Redis and RabbitMQ). The task_consumer decorator is updated to mark deployments as TaskConsumer and provide a QueueMonitorConfig for their associated queue. During application deployment, ApplicationState now identifies TaskConsumer deployments with default autoscaling policies, creates a QueueMonitor actor for them, and switches their policy to the new queue_based_autoscaling_policy. This new policy calculates desired replicas based on queue depth (ceil(queue_length / target_ongoing_requests)), respects min_replicas and max_replicas, handles scaling from zero, and incorporates existing smoothing logic for scaling decisions. The AutoscalingState is updated to clean up QueueMonitor actors when deployments are deregistered. Unit and integration tests were added to validate the QueueMonitor functionality, the new autoscaling policy, and its interaction with TaskConsumer deployments, including scenarios for scaling up, respecting replica bounds, and actor recovery. Review comments suggest improving exception handling by catching more specific exceptions or logging full tracebacks, and updating the docstring for _apply_scaling_decision_smoothing to accurately reflect its parameters. Additionally, it was noted that relying on __del__ for resource cleanup in QueueMonitor is discouraged, recommending explicit shutdown() methods or context managers instead.
Signed-off-by: harshit <[email protected]>
Signed-off-by: harshit <[email protected]>
…ray into add-queue-based-autoscaling
abrarsheikh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suggest breaking down this PR into smaller chunks, ok to keep it like this if you want to iterate on the implementation
| def _is_task_consumer_deployment(deployment_info: DeploymentInfo) -> bool: | ||
| """Check if a deployment is a TaskConsumer.""" | ||
| try: | ||
| deployment_def = deployment_info.replica_config.deployment_def |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this access is unsafe because it deserializes user code in controller, controller does not have the user's runtime env to execute this code.
For example if user code depends on pytorch and torch is mentioned as a pip dependency in deployment's runtime_env, then this call here will fail in cluster mode.
| # Configure queue-based autoscaling for TaskConsumers | ||
| _configure_queue_based_autoscaling_for_task_consumers(deployment_infos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only takes effect in imperative mode. If user try to deploy through config then this code path is not executed
| except Exception as e: | ||
| logger.error( | ||
| f"Failed to create QueueMonitor actor for '{deployment_name}': {e}" | ||
| ) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why fail silently, should we fail deployment here
| actor = QueueMonitorActor.options( | ||
| name=full_actor_name, | ||
| namespace=namespace, | ||
| lifetime="detached", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a comment here on why to choose a detached actor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we are creating the actor again in case the controller goes down, lifetime as detached won't be required anymore, removed it in #59430
| return "unknown" | ||
|
|
||
|
|
||
| class QueueMonitor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we not go with the flower here? Doesn't that abstract away all this logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it does abstract all this logic but in the case of flower, we will have to make a remote HTTP call, then the flower will intercept it, read it, and then make the call to broker, all this combined will be more time-consuming than directly calling the broker. Apart from that, we will have to port-management for running N instances of flower inside serve controller. It is doable, but given these details, writing plain vanilla code to get queue length seems more efficient and simple.
|
creating new small PRs for this bigger change. |
This PR adds queue-based autoscaling for Ray Serve TaskConsumer deployments. TaskConsumers are workloads that consume tasks from message queues (Redis, RabbitMQ), and their scaling needs are fundamentally different from HTTP-based deployments.
Instead of scaling based on HTTP request load, TaskConsumers should scale based on the number of pending tasks in the message queue.
Key Features
queue_based_autoscaling_policythat scales replicas based on ceil(queue_length / target_ongoing_requests)Architecture
Components
queue_monitor.py)- Lightweight Ray actor that queries queue length from the message broker
- Supports Redis (using LLEN command) and RabbitMQ (using passive queue declaration)
- Named actor format:
QUEUE_MONITOR::<deployment_name>- Detached lifecycle with automatic cleanup on deployment deletion
autoscaling_policy.py)- New policy function: queue_based_autoscaling_policy
- Formula: desired_replicas = ceil(queue_length / target_ongoing_requests)
- Reuses existing smoothing/delay logic to prevent oscillation
- Stores QueueMonitor config in policy_state for actor recovery
application_state.py)- Detects TaskConsumer deployments via _is_task_consumer marker
- Automatically creates QueueMonitor actor and switches to queue-based policy
- Only applies when user hasn't specified a custom autoscaling policy
autoscaling_state.py)- QueueMonitor actors are cleaned up when deployments are deleted
- Prevents actor leaks and ensures test isolation
Files Changed
Test Plan
Follow-up PRs