-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core] Fix crash when killing actor handle from previous session #59425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a crash that occurs when ray.kill() is used on an actor handle from a previous session. The fix is well-implemented across both the C++ and Python layers, preventing the crash and providing a clear, user-friendly error message. The addition of a dedicated test case ensures this scenario is covered going forward. I have one suggestion to improve the robustness of the error handling in the Python code.
Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: kriyanshii <[email protected]>
Previously, calling ray.kill() on an ActorHandle from a previous Ray session (after ray.shutdown() and ray.init()) would crash the Python interpreter with a C++ assertion failure. This fix: 1. Prevents the crash by only calling OnActorKilled() in C++ when the kill operation succeeds 2. Catches the error in Python and converts it to a helpful ValueError explaining that ActorHandle objects are not valid across sessions 3. Adds a test to verify the fix The error message now clearly explains: - ActorHandle objects are not valid across Ray sessions - When this typically happens (after shutdown/restart) - What the user should do (create a new actor handle) Fixes the issue where the interpreter would crash instead of providing a useful error message. Signed-off-by: kriyanshii <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Kriyanshi <[email protected]> Signed-off-by: kriyanshii <[email protected]>
…c threads (ray-project#55904) Signed-off-by: dayshah <[email protected]> Signed-off-by: kriyanshii <[email protected]>
…oject#59386) ## Description RLlib has nightly tests running currently however an exception was silently being ignored (`Object of type int64 is not JSON serializable`). This PR adds a numpy to json compatible function conversion for the results being serialized and remove the try / except to prevent an error occurring the future and not being alerted --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Signed-off-by: kriyanshii <[email protected]>
…#59435) ## Description Some utility functions we moved do a recursive import of themselves. Instead, we need to import from the new location. Signed-off-by: kriyanshii <[email protected]>
…tResponse, and DeploymentResponseGenerator (ray-project#59363) Fixes ray-project#52654 ### Summary This PR improves the generic type annotations on Ray Serve's handle classes to enable better IDE support and type inference. It also adds a type checking test file to verify the annotations work correctly. ### Changes #### Type Annotation Improvements (`handle.py`) - **Generic return types**: Updated methods to return the generic type parameter `R` instead of `Any`: - `DeploymentResponse.result()` → `R` - `DeploymentResponse.__await__()` → `Generator[Any, None, R]` - `DeploymentResponseGenerator.__iter__()` → `Iterator[R]` - `DeploymentResponseGenerator.__next__()` → `R` - `DeploymentResponseGenerator.__aiter__()` → `AsyncIterator[R]` - `DeploymentResponseGenerator.__anext__()` → `R` - **mypy compatibility fixes**: - Added `cast()` for `Future` types in sync/async fetch methods to help mypy understand the conditional types - Aligned `_DeploymentHandleBase.options()` signature with `DeploymentHandle.options()` to fix override error - Refactored `remote()` to return directly from each branch instead of assigning different class types to a variable #### Type Checking Tests (`check_handle_typing.py`) Added a mypy test file that verifies: - Generic type `R` is preserved through `result()`, `__await__`, iteration methods - Generic type `T` is preserved through `options()` and method access - Placeholder tests for future mypy plugin (commented out) that will verify method return type inference ### How to Test ```bash mypy python/ray/serve/tests/typing_files/check_handle_typing.py python/ray/serve/handle.py \ --follow-imports=skip \ --ignore-missing-imports ``` ### Future Work A mypy plugin can be implemented to infer the return type of `.remote()` based on which deployment method is being called: ```python handle: DeploymentHandle[MyDeployment] response = handle.get_user.remote(123) # Plugin would infer DeploymentResponse[str] user = response.result() # Would be typed as str ``` The test file includes commented-out tests that should pass once the plugin is implemented. ## Tests From master ``` ❯ mypy python/ray/serve/tests/typing_files/check_handle_typing.py python/ray/serve/handle.py --follow-imports=skip --ignore-missing-imports python/ray/serve/handle.py:10: error: Module "ray._raylet" has no attribute "ObjectRefGenerator" [attr-defined] python/ray/serve/handle.py:162: error: Item "None" of "Optional[Any]" has no attribute "_run_router_in_separate_loop" [union-attr] python/ray/serve/handle.py:210: error: Item "None" of "Optional[Any]" has no attribute "assign_request" [union-attr] python/ray/serve/handle.py:229: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked] python/ray/serve/handle.py:292: error: Unexpected keyword argument "timeout" for "result" of "Future" [call-arg] /home/ubuntu/.local/lib/python3.9/site-packages/mypy/typeshed/stdlib/_asyncio.pyi: note: "result" of "Future" defined here python/ray/serve/handle.py:319: error: Incompatible types in "await" (actual type "Union[concurrent.futures._base.Future[Any], _asyncio.Future[Any]]", expected type "Awaitable[Any]") [misc] python/ray/serve/handle.py:803: error: Incompatible types in assignment (expression has type "type[DeploymentResponse]", variable has type "type[DeploymentResponseGenerator]") [assignment] python/ray/serve/tests/typing_files/check_handle_typing.py:24: error: "DeploymentResponse" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:24: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:28: error: Expression is of type "Any", not "str" [assert-type] python/ray/serve/tests/typing_files/check_handle_typing.py:37: error: "DeploymentResponse" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:37: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:41: error: Expression is of type "Any", not "str" [assert-type] python/ray/serve/tests/typing_files/check_handle_typing.py:46: error: "DeploymentResponseGenerator" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:46: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:49: error: Expression is of type "Iterator[Any]", not "Iterator[int]" [assert-type] python/ray/serve/tests/typing_files/check_handle_typing.py:53: error: Expression is of type "Any", not "int" [assert-type] python/ray/serve/tests/typing_files/check_handle_typing.py:57: error: Expression is of type "Any", not "int" [assert-type] python/ray/serve/tests/typing_files/check_handle_typing.py:63: error: "DeploymentResponseGenerator" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:63: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:66: error: Expression is of type "AsyncIterator[Any]", not "AsyncIterator[int]" [assert-type] python/ray/serve/tests/typing_files/check_handle_typing.py:70: error: Expression is of type "Any", not "int" [assert-type] python/ray/serve/tests/typing_files/check_handle_typing.py:74: error: Expression is of type "Any", not "int" [assert-type] python/ray/serve/tests/typing_files/check_handle_typing.py:88: error: "DeploymentHandle" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:88: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:97: error: "DeploymentHandle" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:104: error: "DeploymentResponse" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:104: error: "DeploymentResponseGenerator" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:115: error: "DeploymentHandle" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:115: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:119: error: Expression is of type "Any", not "DeploymentHandle" [assert-type] python/ray/serve/tests/typing_files/check_handle_typing.py:119: error: "DeploymentHandle" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:151: error: "DeploymentHandle" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:151: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:177: error: "DeploymentHandle" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:177: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:200: error: "DeploymentHandle" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:200: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:230: error: "DeploymentHandle" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:230: note: Error code "type-arg" not covered by "type: ignore" comment python/ray/serve/tests/typing_files/check_handle_typing.py:262: error: "DeploymentHandle" expects no type arguments, but 1 given [type-arg] python/ray/serve/tests/typing_files/check_handle_typing.py:262: note: Error code "type-arg" not covered by "type: ignore" comment Found 30 errors in 2 files (checked 2 source files) ``` Because of changes in this PR ``` ❯ mypy python/ray/serve/tests/typing_files/check_handle_typing.py python/ray/serve/handle.py --follow-imports=skip --ignore-missing-imports python/ray/serve/handle.py:261: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked] Success: no issues found in 2 source files ``` Signed-off-by: abrar <[email protected]> Signed-off-by: kriyanshii <[email protected]>
…y-project#59410) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: kriyanshii <[email protected]>
fixes ray-project#59218 ### Performance Delta ```python from ray import serve from typing import List @serve.deployment(max_ongoing_requests=1000) class MyDeployment: @serve.batch(max_batch_size=10, batch_wait_timeout_s=1) async def handle_batch(self, requests: List[int]) -> List[int]: return [request + 1 for request in requests] async def __call__(self) -> List[int]: return await self.handle_batch(1) app = MyDeployment.bind() ``` `ray start --head --metrics-export-port=8080` -> `serve run batch_test:app` locust 100 users Metric | With Change | Master | Δ (Master – With Change) -- | -- | -- | -- Requests | 32,033 | 33,541 | +1,508 Fails | 0 | 0 | 0 Median (ms) | 170 | 170 | 0 95%ile (ms) | 240 | 240 | 0 99%ile (ms) | 280 | 270 | –10 ms Average (ms) | 172.98 | 171.87 | –1.11 ms Min (ms) | 70 | 84 | +14 ms Max (ms) | 352 | 365 | +13 ms Average size (bytes) | 1 | 1 | 0 Current RPS | 581.9 | 604.1 | +22.2 Current Failures/s | 0 | 0 | 0 --------- Signed-off-by: abrar <[email protected]> Signed-off-by: kriyanshii <[email protected]>
…#59396) updating default python version to py3.10 --------- Signed-off-by: elliot-barn <[email protected]> Signed-off-by: kriyanshii <[email protected]>
…ter_batches (ray-project#58467) ## Description This PR will: - Batch block resolution in `resolve_block_refs()` so `iter_batches()` issues one `ray.get()` per chunk of block refs instead of per ref. The chunk size is configurable using new `DataContext.iter_get_block_batch_size` knob. - Added a test that proves that `resolve_block_refs()` actually batches the `ray.get()` calls. ## Related issues Raised by @amogkam in `python/ray/data/_internal/block_batching/util.py` ## Additional information Simple benchmark available: https://gist.github.com/YoussefEssDS/40de959a42a19334b8dac8bd217c319b --------- Signed-off-by: YoussefEssDS <[email protected]> Signed-off-by: kriyanshii <[email protected]>
…ct#59402) ## Description - Streamlining `Unique` aggregation to avoid back and forth conversion, instead rebasing it onto `BlockColumnAccessor.unique` - Cleaned up corresponding `Preprocessor`s (like `OrdinalEncoder`) ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]> Signed-off-by: kriyanshii <[email protected]>
## Description Abstracting the delta calculation in autoscaling so we can implement custom logic on it. ## Related issues ## Additional information --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: kriyanshii <[email protected]>
… replicas (ray-project#59373) Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: kriyanshii <[email protected]>
no update in the past 2.5 years, all wheels link in yaml file already stopped working. also `pip install -U` is frowned upon. Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: kriyanshii <[email protected]>
… grpc calls (ray-project#59431) ## Description Adds token authentication to OpenTelemetry metric exports. The open telemetry grpc client is created internally within the Opentelemetry sdk and hence does not add the common interceptor automatically hence requiring us to manually include and pass the token header. fixes ray-project#59361 --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]> Signed-off-by: kriyanshii <[email protected]>
so that it does not include other file changes that are committed to master branch. the script will be deprecated with the next `rayci` upgrade, but just correcting this for the record and save a little bit of costs before the upgrade. Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: kriyanshii <[email protected]>
Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: kriyanshii <[email protected]>
those default are not really being used any more, but put some sane value for them anyways Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: kriyanshii <[email protected]>
not used anywhere any more. Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: kriyanshii <[email protected]>
unifies docs dependency set to python 3.9 all docs builds should get unified to run on python 3.10 Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: kriyanshii <[email protected]>
removing 3.9 image builds --------- Signed-off-by: elliot-barn <[email protected]> Signed-off-by: kriyanshii <[email protected]>
… using stale snapshot (ray-project#59454) ## Description 1. Start a new Ray cluster with two nodes with FT 2. Kill GCS and then restart 3. Fetch the cluster status from the autoscaler and check it still knows the two nodes The flaky issue is because the `status` gotten from `get_cluster_status(cluster.address)` is static. The autoscaler might still be working and getting correct info from the restarted GCS, but `wait_for_condition` was previously checking a static value. This PR fixes this by making `wait_for_condition` properly check by getting a fresh `cluster_status` on every check. I noticed this because it reproduces consistently in my local dev environment. ## Test ```shell (venv) (base) ubuntu@devbox:~/ray$ for i in {1..4}; do echo "=== Run $i ===" && python -m pytest python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init -v --tb=short 2>&1 | tail -5; done === Run 1 === collecting ... collected 1 item python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init[ray_start_cluster_head_with_external_redis0] PASSED [100%] ============================== 1 passed in 15.01s ============================== === Run 2 === collecting ... collected 1 item python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init[ray_start_cluster_head_with_external_redis0] PASSED [100%] ============================== 1 passed in 16.55s ============================== === Run 3 === collecting ... collected 1 item python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init[ray_start_cluster_head_with_external_redis0] PASSED [100%] ============================== 1 passed in 17.49s ============================== === Run 4 === collecting ... collected 1 item python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init[ray_start_cluster_head_with_external_redis0] PASSED [100%] ============================== 1 passed in 15.57s ============================== ``` --------- Signed-off-by: yicheng <[email protected]> Co-authored-by: yicheng <[email protected]> Signed-off-by: kriyanshii <[email protected]>
…-project#59417) Shows users how to use `download` to download from URI tables. --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: kriyanshii <[email protected]>
…ray-project#59458) Reverts ray-project#59387 Signed-off-by: kriyanshii <[email protected]>
Add API stability annotation for ActorHandleNotFoundError Annotate ray.exceptions.ActorHandleNotFoundError with @DeveloperAPI Fixes ci/lint/check_api_annotations.py failure reporting missing API stability annotation
updated tests in test_ray_shutdown.py removed unnecessary comments from ray/_private/worker.py
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kriyanshii Overall looks good. Can you please fix the premerge ci failures?
|
working on it. |
|
@codope done! ✅ |
edoakes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for the contribution @kriyanshii
There was a test failure on premerge CI, but it looks unrelated. I will re-run the CI.
|
all checks have passed with recent merge. |
…-project#59425) Fixes the issue where the interpreter would crash instead of providing a useful error message. Previously, calling ray.kill() on an ActorHandle from a previous Ray session (after ray.shutdown() and ray.init()) would crash the Python interpreter with a C++ assertion failure. This fix: 1. Prevents the crash by only calling OnActorKilled() in C++ when the kill operation succeeds 2. Catches the error in Python and converts it to a helpful ValueError explaining that ActorHandle objects are not valid across sessions 3. Adds a test to verify the fix The error message now clearly explains: - ActorHandle objects are not valid across Ray sessions - When this typically happens (after shutdown/restart) - What the user should do (create a new actor handle) Fixes ray-project#59340 --------- Signed-off-by: kriyanshii <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: jasonwrwang <[email protected]>
Fixes the issue where the interpreter would crash instead of providing a useful error message.
Description
Previously, calling ray.kill() on an ActorHandle from a previous Ray session (after ray.shutdown() and ray.init()) would crash the Python interpreter with a C++ assertion failure.
This fix:
The error message now clearly explains:
Related issues