-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[Serve][3/N] Add application-level autoscaling snapshot #59995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request extends autoscaling observability to the application level by introducing ApplicationSnapshot logs. The changes are well-structured, reusing existing patterns from deployment-level snapshots, and include comprehensive tests. I've identified a couple of areas for improvement to enhance code clarity and correctness. Overall, this is a solid addition to Ray Serve's observability features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
| autoscaling_contexts=autoscaling_contexts, | ||
| decisions=results, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation for incomplete app-level policy decisions
High Severity
The validation for app-level autoscaling policy decisions (lines 991-1003) checks that returned deployment IDs are valid but fails to verify that decisions include all deployments. The method docstring states it should "Decide scaling for all deployments" and the deployment-level path (lines 1031-1039) returns decisions for all deployments by iterating over _deployment_autoscaling_states, but an app-level policy can return partial decisions and pass validation. This causes inconsistent behavior where only some deployments get scaled, potentially leaving others stuck at their current replica counts indefinitely.
| # user defined policy returns a dictionary of state that is persisted between autoscaling decisions | ||
| # content of the dictionary is determined by the user defined policy but is keyed by deployment id | ||
| self._policy_state: Optional[Dict[DeploymentID, Dict]] = None | ||
| self._cached_application_snapshot: Optional[ApplicationSnapshot] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale cached application snapshots not invalidated
Medium Severity
The cached application snapshot at self._cached_application_snapshot is not cleared when the application policy is updated via register() or when deployments are added/removed via register_deployment() or deregister_deployment(). Since get_application_snapshot() returns this cached value directly and the controller's _emit_autoscaling_snapshots() calls it every control loop, stale snapshots with incorrect policy_name, num_deployments, or replica totals can be logged to autoscaling snapshot logs, misleading observability and debugging efforts.
Description
Add application-level autoscaling snapshot support for observability.
This PR extends the existing deployment-level autoscaling snapshot feature (PR #56225) to support application-level autoscaling. When an app-level autoscaling policy is configured, the controller now emits
ApplicationSnapshotlogs containing aggregated metrics across all deployments in the application.Related issues
Related to #55833
Additional information
bash % cat /tmp/ray/session_latest/logs/serve/autoscaling_snapshot_6668.log {"asctime": "2026-01-09 13:56:19,481", "levelname": "INFO", "message": "{'snapshots': [{'snapshot_type': 'application', 'timestamp_str': '2026-01-09T04:56:19Z', 'app': 'app_snap_1767934578', 'num_deployments': 2, 'total_current_replicas': 0, 'total_target_replicas': 2, 'scaling_status': 'scaling up', 'policy_name': 'ray.serve.tests.test_controller.simple_app_policy_for_test', 'errors': []}]}", "filename": "controller.py", "lineno": 511, "process": 6668, "timestamp_ns": 1767934579481838000} {"asctime": "2026-01-09 13:56:19,999", "levelname": "INFO", "message": "{'snapshots': [{'snapshot_type': 'application', 'timestamp_str': '2026-01-09T04:56:19Z', 'app': 'app_snap_1767934578', 'num_deployments': 2, 'total_current_replicas': 2, 'total_target_replicas': 2, 'scaling_status': 'stable', 'policy_name': 'ray.serve.tests.test_controller.simple_app_policy_for_test', 'errors': []}]}", "filename": "controller.py", "lineno": 511, "process": 6668, "timestamp_ns": 1767934579999085000}