Skip to content

Conversation

@kriyanshii
Copy link
Contributor

Fixes the issue where the interpreter would crash instead of providing a useful error message.

Description

Previously, calling ray.kill() on an ActorHandle from a previous Ray session (after ray.shutdown() and ray.init()) would crash the Python interpreter with a C++ assertion failure.

This fix:

  1. Prevents the crash by only calling OnActorKilled() in C++ when the kill operation succeeds
  2. Catches the error in Python and converts it to a helpful ValueError explaining that ActorHandle objects are not valid across sessions
  3. Adds a test to verify the fix

The error message now clearly explains:

  • ActorHandle objects are not valid across Ray sessions
  • When this typically happens (after shutdown/restart)
  • What the user should do (create a new actor handle)

Related issues

Fixes #59340

@kriyanshii kriyanshii requested a review from a team as a code owner December 14, 2025 10:00
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a crash that occurs when ray.kill() is used on an actor handle from a previous session. The fix is well-implemented across both the C++ and Python layers, preventing the crash and providing a clear, user-friendly error message. The addition of a dedicated test case ensures this scenario is covered going forward. I have one suggestion to improve the robustness of the error handling in the Python code.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Dec 14, 2025
eicherseiji and others added 25 commits December 16, 2025 16:36
Previously, calling ray.kill() on an ActorHandle from a previous Ray
session (after ray.shutdown() and ray.init()) would crash the Python
interpreter with a C++ assertion failure.

This fix:
1. Prevents the crash by only calling OnActorKilled() in C++ when the
   kill operation succeeds
2. Catches the error in Python and converts it to a helpful ValueError
   explaining that ActorHandle objects are not valid across sessions
3. Adds a test to verify the fix

The error message now clearly explains:
- ActorHandle objects are not valid across Ray sessions
- When this typically happens (after shutdown/restart)
- What the user should do (create a new actor handle)

Fixes the issue where the interpreter would crash instead of providing
a useful error message.

Signed-off-by: kriyanshii <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Kriyanshi <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
…oject#59386)

## Description
RLlib has nightly tests running currently however an exception was
silently being ignored (`Object of type int64 is not JSON
serializable`).
This PR adds a numpy to json compatible function conversion for the
results being serialized and remove the try / except to prevent an error
occurring the future and not being alerted

---------

Signed-off-by: Mark Towers <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
…#59435)

## Description

Some utility functions we moved do a recursive import of themselves.
Instead, we need to import from the new location.

Signed-off-by: kriyanshii <[email protected]>
…tResponse, and DeploymentResponseGenerator (ray-project#59363)

Fixes ray-project#52654

### Summary

This PR improves the generic type annotations on Ray Serve's handle
classes to enable better IDE support and type inference. It also adds a
type checking test file to verify the annotations work correctly.

### Changes

#### Type Annotation Improvements (`handle.py`)

- **Generic return types**: Updated methods to return the generic type
parameter `R` instead of `Any`:
  - `DeploymentResponse.result()` → `R`
  - `DeploymentResponse.__await__()` → `Generator[Any, None, R]`
  - `DeploymentResponseGenerator.__iter__()` → `Iterator[R]`
  - `DeploymentResponseGenerator.__next__()` → `R`
  - `DeploymentResponseGenerator.__aiter__()` → `AsyncIterator[R]`
  - `DeploymentResponseGenerator.__anext__()` → `R`

- **mypy compatibility fixes**:
- Added `cast()` for `Future` types in sync/async fetch methods to help
mypy understand the conditional types
- Aligned `_DeploymentHandleBase.options()` signature with
`DeploymentHandle.options()` to fix override error
- Refactored `remote()` to return directly from each branch instead of
assigning different class types to a variable

#### Type Checking Tests (`check_handle_typing.py`)

Added a mypy test file that verifies:
- Generic type `R` is preserved through `result()`, `__await__`,
iteration methods
- Generic type `T` is preserved through `options()` and method access
- Placeholder tests for future mypy plugin (commented out) that will
verify method return type inference

### How to Test

```bash
    mypy python/ray/serve/tests/typing_files/check_handle_typing.py python/ray/serve/handle.py  \
        --follow-imports=skip \
        --ignore-missing-imports
```

### Future Work

A mypy plugin can be implemented to infer the return type of `.remote()`
based on which deployment method is being called:

```python
handle: DeploymentHandle[MyDeployment]
response = handle.get_user.remote(123)  # Plugin would infer DeploymentResponse[str]
user = response.result()                 # Would be typed as str
```

The test file includes commented-out tests that should pass once the
plugin is implemented.

## Tests

From master
```
❯ mypy python/ray/serve/tests/typing_files/check_handle_typing.py python/ray/serve/handle.py  --follow-imports=skip --ignore-missing-imports
python/ray/serve/handle.py:10: error: Module "ray._raylet" has no attribute "ObjectRefGenerator"  [attr-defined]
python/ray/serve/handle.py:162: error: Item "None" of "Optional[Any]" has no attribute "_run_router_in_separate_loop"  [union-attr]
python/ray/serve/handle.py:210: error: Item "None" of "Optional[Any]" has no attribute "assign_request"  [union-attr]
python/ray/serve/handle.py:229: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
python/ray/serve/handle.py:292: error: Unexpected keyword argument "timeout" for "result" of "Future"  [call-arg]
/home/ubuntu/.local/lib/python3.9/site-packages/mypy/typeshed/stdlib/_asyncio.pyi: note: "result" of "Future" defined here
python/ray/serve/handle.py:319: error: Incompatible types in "await" (actual type "Union[concurrent.futures._base.Future[Any], _asyncio.Future[Any]]", expected type "Awaitable[Any]")  [misc]
python/ray/serve/handle.py:803: error: Incompatible types in assignment (expression has type "type[DeploymentResponse]", variable has type "type[DeploymentResponseGenerator]")  [assignment]
python/ray/serve/tests/typing_files/check_handle_typing.py:24: error: "DeploymentResponse" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:24: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:28: error: Expression is of type "Any", not "str"  [assert-type]
python/ray/serve/tests/typing_files/check_handle_typing.py:37: error: "DeploymentResponse" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:37: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:41: error: Expression is of type "Any", not "str"  [assert-type]
python/ray/serve/tests/typing_files/check_handle_typing.py:46: error: "DeploymentResponseGenerator" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:46: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:49: error: Expression is of type "Iterator[Any]", not "Iterator[int]"  [assert-type]
python/ray/serve/tests/typing_files/check_handle_typing.py:53: error: Expression is of type "Any", not "int"  [assert-type]
python/ray/serve/tests/typing_files/check_handle_typing.py:57: error: Expression is of type "Any", not "int"  [assert-type]
python/ray/serve/tests/typing_files/check_handle_typing.py:63: error: "DeploymentResponseGenerator" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:63: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:66: error: Expression is of type "AsyncIterator[Any]", not "AsyncIterator[int]"  [assert-type]
python/ray/serve/tests/typing_files/check_handle_typing.py:70: error: Expression is of type "Any", not "int"  [assert-type]
python/ray/serve/tests/typing_files/check_handle_typing.py:74: error: Expression is of type "Any", not "int"  [assert-type]
python/ray/serve/tests/typing_files/check_handle_typing.py:88: error: "DeploymentHandle" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:88: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:97: error: "DeploymentHandle" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:104: error: "DeploymentResponse" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:104: error: "DeploymentResponseGenerator" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:115: error: "DeploymentHandle" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:115: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:119: error: Expression is of type "Any", not "DeploymentHandle"  [assert-type]
python/ray/serve/tests/typing_files/check_handle_typing.py:119: error: "DeploymentHandle" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:151: error: "DeploymentHandle" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:151: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:177: error: "DeploymentHandle" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:177: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:200: error: "DeploymentHandle" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:200: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:230: error: "DeploymentHandle" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:230: note: Error code "type-arg" not covered by "type: ignore" comment
python/ray/serve/tests/typing_files/check_handle_typing.py:262: error: "DeploymentHandle" expects no type arguments, but 1 given  [type-arg]
python/ray/serve/tests/typing_files/check_handle_typing.py:262: note: Error code "type-arg" not covered by "type: ignore" comment
Found 30 errors in 2 files (checked 2 source files)
```

Because of changes in this PR
```
❯ mypy python/ray/serve/tests/typing_files/check_handle_typing.py python/ray/serve/handle.py  --follow-imports=skip --ignore-missing-imports
python/ray/serve/handle.py:261: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
Success: no issues found in 2 source files
```

Signed-off-by: abrar <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
fixes ray-project#59218

### Performance Delta

```python
from ray import serve
from typing import List

@serve.deployment(max_ongoing_requests=1000)
class MyDeployment:
    @serve.batch(max_batch_size=10, batch_wait_timeout_s=1)
    async def handle_batch(self, requests: List[int]) -> List[int]:
        return [request + 1 for request in requests]

    async def __call__(self) -> List[int]:
        return await self.handle_batch(1)

app = MyDeployment.bind()

```

`ray start --head --metrics-export-port=8080` -> `serve run
batch_test:app`

locust 100 users

Metric | With Change | Master | Δ (Master – With Change)
-- | -- | -- | --
Requests | 32,033 | 33,541 | +1,508
Fails | 0 | 0 | 0
Median (ms) | 170 | 170 | 0
95%ile (ms) | 240 | 240 | 0
99%ile (ms) | 280 | 270 | –10 ms
Average (ms) | 172.98 | 171.87 | –1.11 ms
Min (ms) | 70 | 84 | +14 ms
Max (ms) | 352 | 365 | +13 ms
Average size (bytes) | 1 | 1 | 0
Current RPS | 581.9 | 604.1 | +22.2
Current Failures/s | 0 | 0 | 0

---------

Signed-off-by: abrar <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
…#59396)

updating default python version to py3.10

---------

Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
…ter_batches (ray-project#58467)

## Description
This PR will:

- Batch block resolution in `resolve_block_refs()` so `iter_batches()`
issues one `ray.get()` per chunk of block refs instead of per ref. The
chunk size is configurable using new
`DataContext.iter_get_block_batch_size` knob.
- Added a test that proves that `resolve_block_refs()` actually batches
the `ray.get()` calls.

## Related issues
Raised by @amogkam in `python/ray/data/_internal/block_batching/util.py`

## Additional information
Simple benchmark available:
https://gist.github.com/YoussefEssDS/40de959a42a19334b8dac8bd217c319b

---------

Signed-off-by: YoussefEssDS <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
…ct#59402)

## Description

- Streamlining `Unique` aggregation to avoid back and forth conversion,
instead rebasing it onto `BlockColumnAccessor.unique`
 - Cleaned up corresponding `Preprocessor`s (like `OrdinalEncoder`)

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
## Description
Abstracting the delta calculation in autoscaling so we can implement
custom logic on it.

## Related issues

## Additional information

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
no update in the past 2.5 years, all wheels link in yaml file already
stopped working. also `pip install -U` is frowned upon.

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
… grpc calls (ray-project#59431)

## Description
Adds token authentication to OpenTelemetry metric exports. The open
telemetry grpc client is created internally within the Opentelemetry sdk
and hence does not add the common interceptor automatically hence
requiring us to manually include and pass the token header.

fixes ray-project#59361

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
so that it does not include other file changes that are committed to
master branch.

the script will be deprecated with the next `rayci` upgrade, but just
correcting this for the record and save a little bit of costs before the
upgrade.

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
those default are not really being used any more, but put some sane
value for them anyways

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
not used anywhere any more.

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
unifies docs dependency set to python 3.9

all docs builds should get unified to run on python 3.10

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
removing 3.9 image builds

---------

Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
… using stale snapshot (ray-project#59454)

## Description
1. Start a new Ray cluster with two nodes with FT
2. Kill GCS and then restart
3. Fetch the cluster status from the autoscaler and check it still knows
the two nodes

The flaky issue is because the `status` gotten from
`get_cluster_status(cluster.address)` is static. The autoscaler might
still be working and getting correct info from the restarted GCS, but
`wait_for_condition` was previously checking a static value. This PR
fixes this by making `wait_for_condition` properly check by getting a
fresh `cluster_status` on every check.

I noticed this because it reproduces consistently in my local dev
environment.

## Test
```shell
(venv) (base) ubuntu@devbox:~/ray$ for i in {1..4}; do echo "=== Run $i ===" && python -m pytest python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init -v --tb=short 2>&1 | tail -5; done
=== Run 1 ===
collecting ... collected 1 item

python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init[ray_start_cluster_head_with_external_redis0] PASSED [100%]

============================== 1 passed in 15.01s ==============================
=== Run 2 ===
collecting ... collected 1 item

python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init[ray_start_cluster_head_with_external_redis0] PASSED [100%]

============================== 1 passed in 16.55s ==============================
=== Run 3 ===
collecting ... collected 1 item

python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init[ray_start_cluster_head_with_external_redis0] PASSED [100%]

============================== 1 passed in 17.49s ==============================
=== Run 4 ===
collecting ... collected 1 item

python/ray/tests/test_gcs_fault_tolerance.py::test_autoscaler_init[ray_start_cluster_head_with_external_redis0] PASSED [100%]

============================== 1 passed in 15.57s ==============================
```

---------

Signed-off-by: yicheng <[email protected]>
Co-authored-by: yicheng <[email protected]>
Signed-off-by: kriyanshii <[email protected]>
…-project#59417)

Shows users how to use `download` to download from URI tables.

---------

Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: kriyanshii <[email protected]>
kriyanshii and others added 2 commits December 18, 2025 22:08
Add API stability annotation for ActorHandleNotFoundError

Annotate ray.exceptions.ActorHandleNotFoundError with @DeveloperAPI
Fixes ci/lint/check_api_annotations.py failure reporting missing API stability annotation
@kriyanshii kriyanshii requested a review from codope January 3, 2026 17:19
Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kriyanshii Overall looks good. Can you please fix the premerge ci failures?

@kriyanshii
Copy link
Contributor Author

working on it.

@kriyanshii
Copy link
Contributor Author

@codope done! ✅

@kriyanshii kriyanshii requested a review from codope January 6, 2026 16:09
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the contribution @kriyanshii

There was a test failure on premerge CI, but it looks unrelated. I will re-run the CI.

@edoakes edoakes enabled auto-merge (squash) January 7, 2026 15:32
@github-actions github-actions bot disabled auto-merge January 8, 2026 05:18
@kriyanshii
Copy link
Contributor Author

all checks have passed with recent merge.

@edoakes edoakes merged commit 39aa361 into ray-project:master Jan 8, 2026
6 checks passed
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…-project#59425)

Fixes the issue where the interpreter would crash instead of providing a
useful error message.

Previously, calling ray.kill() on an ActorHandle from a previous Ray
session (after ray.shutdown() and ray.init()) would crash the Python
interpreter with a C++ assertion failure.

This fix:
1. Prevents the crash by only calling OnActorKilled() in C++ when the
kill operation succeeds
2. Catches the error in Python and converts it to a helpful ValueError
explaining that ActorHandle objects are not valid across sessions
3. Adds a test to verify the fix

The error message now clearly explains:
- ActorHandle objects are not valid across Ray sessions
- When this typically happens (after shutdown/restart)
- What the user should do (create a new actor handle)

Fixes ray-project#59340

---------
Signed-off-by: kriyanshii <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Signed-off-by: jasonwrwang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] ray.kill an actor from another session will crash python interpreter