Skip to content

Conversation

@nadongjun
Copy link
Contributor

Description

Add application-level autoscaling snapshot support for observability.

This PR extends the existing deployment-level autoscaling snapshot feature (PR #56225) to support application-level autoscaling. When an app-level autoscaling policy is configured, the controller now emits ApplicationSnapshot logs containing aggregated metrics across all deployments in the application.

Related issues

Related to #55833

Additional information

bash % cat /tmp/ray/session_latest/logs/serve/autoscaling_snapshot_6668.log 
{"asctime": "2026-01-09 13:56:19,481", "levelname": "INFO", "message": "{'snapshots': [{'snapshot_type': 'application', 'timestamp_str': '2026-01-09T04:56:19Z', 'app': 'app_snap_1767934578', 'num_deployments': 2, 'total_current_replicas': 0, 'total_target_replicas': 2, 'scaling_status': 'scaling up', 'policy_name': 'ray.serve.tests.test_controller.simple_app_policy_for_test', 'errors': []}]}", "filename": "controller.py", "lineno": 511, "process": 6668, "timestamp_ns": 1767934579481838000}
{"asctime": "2026-01-09 13:56:19,999", "levelname": "INFO", "message": "{'snapshots': [{'snapshot_type': 'application', 'timestamp_str': '2026-01-09T04:56:19Z', 'app': 'app_snap_1767934578', 'num_deployments': 2, 'total_current_replicas': 2, 'total_target_replicas': 2, 'scaling_status': 'stable', 'policy_name': 'ray.serve.tests.test_controller.simple_app_policy_for_test', 'errors': []}]}", "filename": "controller.py", "lineno": 511, "process": 6668, "timestamp_ns": 1767934579999085000}

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun nadongjun requested a review from a team as a code owner January 9, 2026 05:39
@nadongjun nadongjun changed the title Serve obsv application [Serve][3/N] Add application-level autoscaling snapshot Jan 9, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends autoscaling observability to the application level by introducing ApplicationSnapshot logs. The changes are well-structured, reusing existing patterns from deployment-level snapshots, and include comprehensive tests. I've identified a couple of areas for improvement to enhance code clarity and correctness. Overall, this is a solid addition to Ray Serve's observability features.

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@ray-gardener ray-gardener bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jan 9, 2026
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Jan 20, 2026
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

autoscaling_contexts=autoscaling_contexts,
decisions=results,
)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for incomplete app-level policy decisions

High Severity

The validation for app-level autoscaling policy decisions (lines 991-1003) checks that returned deployment IDs are valid but fails to verify that decisions include all deployments. The method docstring states it should "Decide scaling for all deployments" and the deployment-level path (lines 1031-1039) returns decisions for all deployments by iterating over _deployment_autoscaling_states, but an app-level policy can return partial decisions and pass validation. This causes inconsistent behavior where only some deployments get scaled, potentially leaving others stuck at their current replica counts indefinitely.

Fix in Cursor Fix in Web

# user defined policy returns a dictionary of state that is persisted between autoscaling decisions
# content of the dictionary is determined by the user defined policy but is keyed by deployment id
self._policy_state: Optional[Dict[DeploymentID, Dict]] = None
self._cached_application_snapshot: Optional[ApplicationSnapshot] = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale cached application snapshots not invalidated

Medium Severity

The cached application snapshot at self._cached_application_snapshot is not cleared when the application policy is updated via register() or when deployments are added/removed via register_deployment() or deregister_deployment(). Since get_application_snapshot() returns this cached value directly and the controller's _emit_autoscaling_snapshots() calls it every control loop, stale snapshots with incorrect policy_name, num_deployments, or replica totals can be logged to autoscaling snapshot logs, misleading observability and debugging efforts.

Additional Locations (2)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants