Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-12-02
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

aslonnie and others added 30 commits November 12, 2025 20:36
nothing is using it anymore

Signed-off-by: Lonnie Liu <[email protected]>
…58580)

Adding optional `include_setuptools` flag for depset configuration

If the flag is set on a depset config --unsafe-package setuptools will
not be included for depset compilation

If the flag does not exist (default false) on a depset config
--unsafe-package setuptools will be appended to the default arguments

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
otherwise, the newer docker client will refuse to communicate with the
docker daemon that is on an older version.

Signed-off-by: Lonnie Liu <[email protected]>
…ay-project#58542)

## What does this PR do?
   
Fixes HTTP streaming file downloads in Ray Data's download operation.
Some URIs (especially HTTP streams) require `open_input_stream` instead
of `open_input_file`.
   
   ## Changes
   
- Modified `download_bytes_threaded` in `plan_download_op.py` to try
both `open_input_file` and `open_input_stream` for each URI
- Improved error handling to distinguish between different error types
   - Failed downloads now return `None` gracefully instead of crashing
   
   ## Testing
```
import pyarrow as pa
from ray.data.context import DataContext
from ray.data._internal.planner.plan_download_op import download_bytes_threaded

# Test URLs: one valid, one 404
urls = [    
    "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&",
]

# Create PyArrow table and call download function
table = pa.table({"url": urls})
ctx = DataContext.get_current()
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))

# Check results
result_table = results[0]
for i in range(result_table.num_rows):
    url = result_table['url'][i].as_py()
    bytes_data = result_table['bytes'][i].as_py()
    
    if bytes_data is None:
        print(f"Row {i}: FAILED (None) - try-catch worked βœ“")
    else:
        print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)")
    print(f"  URL: {url[:60]}...")

print("\nβœ… Test passed: Failed downloads return None instead of crashing.")
```

Before the fix:
```
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/default/test_streaming_fallback.py", line 110, in <module>
    test_download_expression_with_streaming_fallback()
  File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback
    with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__
    if not self.__exit__(*sys.exc_info()):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__
    setattr(self.target, self.attribute, self.temp_original)
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
(base) ray@ip-10-0-39-21:~/default$ python test.py
2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker!
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
    for result in fn(input_queue_iter):
                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
    yield f.read()
          ^^^^^^^^
  File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
  File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
  File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
    raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
Traceback (most recent call last):
  File "/home/ray/default/test.py", line 16, in <module>
    results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded
    uri_bytes = list(
                ^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
    for result in fn(input_queue_iter):
                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
    yield f.read()
          ^^^^^^^^
  File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
  File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
  File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
    raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
```
After the fix:
```
Row 0: SUCCESS (189370 bytes)
  URL: https://static-assets.tesla.com/configurator/compositor?cont...
```
   
Tested with HTTP streaming URLs (e.g., Tesla configurator images) that
previously failed:
   - βœ… Successfully downloads HTTP stream files
   - βœ… Gracefully handles failed downloads (returns None)
   - βœ… Maintains backward compatibility with existing file downloads

---------

Signed-off-by: xyuzh <[email protected]>
Signed-off-by: Robert Nishihara <[email protected]>
Co-authored-by: Robert Nishihara <[email protected]>
## Description

We today have very little observability into pubsub. On a raylet one of
the most important states that need to be propagated through the cluster
via pubsub is cluster membership. All raylets should in an eventual BUT
timely fashion agree on the list of available nodes. This metric just
emits a simple counter to keep track of the node count.

More pubsub observability to come.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: zac <[email protected]>
Signed-off-by: Zac Policzer <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
all tests are passing

Signed-off-by: Lonnie Liu <[email protected]>
…#58587)

also stops building python 3.9 aarch64 images

Signed-off-by: Lonnie Liu <[email protected]>
so that importing test.py does not always import github

github repo imports jwt, which then imports cryptography and can lead to
issues on windows.

Signed-off-by: Lonnie Liu <[email protected]>
this makes it possible to run on a different python version than the CI
wrapper code.

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: Lonnie Liu <[email protected]>
…ecurity (ray-project#58591)

Migrates Ray dashboard authentication from JavaScript-managed cookies to
server-side HttpOnly cookies to enhance security against XSS attacks.
This addresses code review feedback to improve the authentication
implementation (ray-project#58368)

main changes:
- authentication middleware first looks for `Authorization` header, if
not found it then looks at cookies to look for the auth token
- new `api/authenticate` endpoint for verifying token and setting the
auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and
`secure=true` (when using https))
- removed javascript based cookie manipulation utils and axios
interceptors (were previously responsible for setting cookies)
- cookies are deleted when connecting to a cluster with
`AUTH_MODE=disabled`. connecting to a different ray cluster (with
different auth token) using the same endpoint (eg due to port-forwarding
or local testing) will reshow the popup and ask users to input the right
token.

---------

Signed-off-by: sampan <[email protected]>
Co-authored-by: sampan <[email protected]>
add support for `ray get-auth-token` cli command + test

---------

Signed-off-by: sampan <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Sampan S Nayak <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
ray-project#57590)

As discovered in the [PR to better define the interface for reference
counter](ray-project#57177 (review)),
plasma store provider and memory store both share thin dependencies on
reference counter that can be refactored out. This will reduce
entanglement in our code base and improve maintainability.

The main logic changes are located in 
* src/ray/core_worker/store_provider/plasma_store_provider.cc, where
reference counter related logic is refactor into core worker
* src/ray/core_worker/core_worker.cc, where factored out reference
counter logic is resolved
* src/ray/core_worker/store_provider/memory_store/memory_store.cc, where
logic related to reference counter has either been removed due to the
fact that it is tech debt or refactored into caller functions.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks
Microbenchmark:
```
single client get calls (Plasma Store) per second 10592.56 +- 535.86
single client put calls (Plasma Store) per second 4908.72 +- 41.55
multi client put calls (Plasma Store) per second 14260.79 +- 265.48
single client put gigabytes per second 11.92 +- 10.21
single client tasks and get batch per second 8.33 +- 0.19
multi client put gigabytes per second 32.09 +- 1.63
single client get object containing 10k refs per second 13.38 +- 0.13
single client wait 1k refs per second 5.04 +- 0.05
single client tasks sync per second 960.45 +- 15.76
single client tasks async per second 7955.16 +- 195.97
multi client tasks async per second 17724.1 +- 856.8
1:1 actor calls sync per second 2251.22 +- 63.93
1:1 actor calls async per second 9342.91 +- 614.74
1:1 actor calls concurrent per second 6427.29 +- 50.3
1:n actor calls async per second 8221.63 +- 167.83
n:n actor calls async per second 22876.04 +- 436.98
n:n actor calls with arg async per second 3531.21 +- 39.38
1:1 async-actor calls sync per second 1581.31 +- 34.01
1:1 async-actor calls async per second 5651.2 +- 222.21
1:1 async-actor calls with args async per second 3618.34 +- 76.02
1:n async-actor calls async per second 7379.2 +- 144.83
n:n async-actor calls async per second 19768.79 +- 211.95
```
This PR mainly makes logic changes to the `ray.get` call chain. As we
can see from the benchmark above, the single clientget calls performance
matches pre-regression levels.

---------

Signed-off-by: davik <[email protected]>
Co-authored-by: davik <[email protected]>
Co-authored-by: Ibrahim Rabbani <[email protected]>
…ay-project#58471)

2. **Extracted generic `RankManager` class** - Created reusable rank
management logic separated from deployment-specific concerns

3. **Introduced `ReplicaRank` schema** - Type-safe rank representation
replacing raw integers

4. **Simplified error handling** - not supporting self healing

5. **Updated tests** - Refactored unit tests to use new API and removed
flag-dependent test cases

**Impact:**
- Cleaner separation of concerns in rank management
- Foundation for future multi-level rank support


Next PR ray-project#58473

---------

Signed-off-by: abrar <[email protected]>
Currently, Ray metrics and events are exported through a centralized
process called the Dashboard Agent. This process functions as a gRPC
server, receiving data from all other components (GCS, Raylet, workers,
etc.). However, during a node shutdown, the Dashboard Agent may
terminate before the other components, resulting in gRPC errors and
potential loss of metrics and events.

As this issue occurs, the otel sdk logs become very noisy. Add a default
options to disable otel sdk logs to avoid confusion.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`,
which is a non-flaky version of `fetch_prometheus`. Update all of test
usage accordingly.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
… RD Datatype (ray-project#58225)

## Description
As title suggests

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Goutam <[email protected]>
…ay-project#58581)

allowing for py3.13 images (cpu & cu123) in release tests

Signed-off-by: elliot-barn <[email protected]>
## Description
Add avg prompt length metric

When using uniform prompt length (especially in testing), the P50 and
P90 computations are skewed due to the 1_2_5 buckets used in vLLM.
Average prompt length provides another useful dimension to look at and
validate.

For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows
9400, and avg accurately shows 5000.

<img width="1186" height="466" alt="image"
src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a"
/>
 

## Related issues

## Additional information

---------

Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Prometheus auto-append the `_total` suffix to all Counter metrics. Ray
historically has been supported counter metric with and without `_total`
suffix for backward compatibility, but it is now time to drop the
support (2 years since the warning was added).

There is one place in ray serve dashboard that still doesn't use the
`_total` suffix so fix it in this PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray
will delegate authentication and authorization of Ray access to
Kubernetes TokenReview and SubjectAccessReview APIs.

---------

Signed-off-by: Andrew Sy Kim <[email protected]>
unifying to python 3.10

Signed-off-by: Lonnie Liu <[email protected]>
ray-project#56520 (ray-project#56575)

As mentioned in ray-project#51080,
separate ObjectRefGenerator class from the large _raylet.pyx file.

Closes ray-project#56520

---------

Signed-off-by: l00951262 <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: l00951262 <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
## Description
Currently, streaming repartition applies a map transform to each block
independently and does not merge leftover rows across blocks, so it
cannot guarantee exact row counts per output block. This PR introduces a
new design that computes, on the driver, the input block ranges for
every output block. It avoids driver-side block fetching while ensuring
correctness and leveraging the efficiency of parallel map tasks.
## Related issues
Closes ray-project#57165

## Additional information

---------

Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Signed-off-by: You-Cheng Lin <[email protected]>
## Description
Reintroduce the old task_completion_time metric as
`task_completion_time_total`.

Refactors the ray data histogram metrics to accomplish a few things:
- Abstract away histogram details into a class RuntimeMetricsHistogram
- Removes the need for a lock in the OpRuntimeMetrics class.

It does so primarily by moving the delta tracking logic from the
OpRuntimeMetrics to the StatsActor. The delta tracking logic is
necessary because the prometheus Histogram api only accepts new
observations as input and does not allow directly setting histogram
bucket values.

## Related issues


## Additional information
Verified metrics worked:
```
# HELP ray_data_task_completion_time Time spent per task running those tasks to completion.
# TYPE ray_data_task_completion_time histogram
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="0.1",operator="ReadRange->Map(identity_with_sleep)_1"} 0.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="0.25",operator="ReadRange->Map(identity_with_sleep)_1"} 0.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="0.5",operator="ReadRange->Map(identity_with_sleep)_1"} 0.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="1.0",operator="ReadRange->Map(identity_with_sleep)_1"} 0.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="2.5",operator="ReadRange->Map(identity_with_sleep)_1"} 0.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="5.0",operator="ReadRange->Map(identity_with_sleep)_1"} 3.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="7.5",operator="ReadRange->Map(identity_with_sleep)_1"} 6.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="10.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="15.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="20.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="25.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="50.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="75.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="100.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="150.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="500.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="1000.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="2500.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="5000.0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_bucket{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",le="+Inf",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_count{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",operator="ReadRange->Map(identity_with_sleep)_1"} 10.0
ray_data_task_completion_time_sum{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",operator="ReadRange->Map(identity_with_sleep)_1"} 65.0
# HELP ray_data_task_completion_time_total Time spent running tasks to completion. This is a sum of all tasks' completion times.
# TYPE ray_data_task_completion_time_total gauge
ray_data_task_completion_time_total{Component="core_worker",NodeAddress="127.0.0.1",SessionName="session_2025-10-17_12-04-00_414091_75603",Version="3.0.0.dev0",WorkerId="9fa17dcb3156c7bee37b4077bd4361f9ce7e96c06b5267ee9e67a308",dataset="dataset_2_0",operator="ReadRange->Map(identity_with_sleep)_1"} 62.872898580506444
```

---------

Signed-off-by: Alan Guo <[email protected]>
`stabilityai/stable-diffusion-2` was deprecated from huggingface and our
example does not work anymore. Updating the model being used in the
example doc for fix.

Manual test output:
```
2025-11-13 16:37:09,568 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.0.94.19:6379...
2025-11-13 16:37:09,577 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at https://session-zek2tfbffypugm65x9sbfn5e12.i.anyscaleuserdata.com 
2025-11-13 16:37:09,579 INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_b45cce8156d06a44ed67dca704b05de5c732d215.zip' (0.13MiB) to Ray cluster...
2025-11-13 16:37:09,580 INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_b45cce8156d06a44ed67dca704b05de5c732d215.zip'.
INFO 2025-11-13 16:37:09,608 serve 8174 -- Connecting to existing Serve app in namespace "serve". New http options will not be applied.
(ServeController pid=6080) INFO 2025-11-13 16:37:09,662 controller 6080 -- Deploying new version of Deployment(name='StableDiffusionXL', app='default') (initial target replicas: 1).
(ServeController pid=6080) INFO 2025-11-13 16:37:09,663 controller 6080 -- Deploying new version of Deployment(name='APIIngress', app='default') (initial target replicas: 1).
(ServeController pid=6080) INFO 2025-11-13 16:37:09,766 controller 6080 -- Adding 1 replica to Deployment(name='StableDiffusionXL', app='default').
(ServeController pid=6080) INFO 2025-11-13 16:37:09,769 controller 6080 -- Stopping 1 replicas of Deployment(name='APIIngress', app='default') with outdated versions.
(ServeController pid=6080) INFO 2025-11-13 16:37:09,769 controller 6080 -- Adding 1 replica to Deployment(name='APIIngress', app='default').
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:11,771 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:13,771 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(autoscaler +6s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:15,772 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:17,772 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:19,772 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:21,772 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:23,772 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:25,772 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:27,772 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(ServeReplica:default:APIIngress pid=6201) INFO 2025-11-13 16:37:29,772 default_APIIngress 4enmm3dc -- Waiting for an additional 2.0s to shut down because there are 1 ongoing requests.
(ServeController pid=6080) INFO 2025-11-13 16:37:29,787 controller 6080 -- Replica(id='4enmm3dc', deployment='APIIngress', app='default') did not shut down after grace period, force-killing it. 
(ServeController pid=6080) INFO 2025-11-13 16:37:29,893 controller 6080 -- Replica(id='4enmm3dc', deployment='APIIngress', app='default') is stopped.
(ServeController pid=6080) WARNING 2025-11-13 16:37:39,800 controller 6080 -- Deployment 'StableDiffusionXL' in application 'default' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1, "GPU": 1}, total resources available: {"CPU": 14.0}. Use `ray status` for more details.
(ServeController pid=6080) WARNING 2025-11-13 16:38:09,903 controller 6080 -- Deployment 'StableDiffusionXL' in application 'default' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1, "GPU": 1}, total resources available: {"CPU": 14.0}. Use `ray status` for more details.
(ServeController pid=6080) WARNING 2025-11-13 16:38:39,985 controller 6080 -- Deployment 'StableDiffusionXL' in application 'default' has 1 replicas that have taken more than 30s to be scheduled. This may be due to waiting for the cluster to auto-scale or for a runtime environment to be installed. Resources required for each replica: {"CPU": 1, "GPU": 1}, total resources available: {"CPU": 14.0}. Use `ray status` for more details.
(ProxyActor pid=3285, ip=10.0.81.200) INFO 2025-11-13 16:38:53,625 proxy 10.0.81.200 -- Proxy starting on node 325c0b192bd38bda7128277e368d5bd4a1308e201572ae36078016c4 (HTTP port: 8000).
(ProxyActor pid=3285, ip=10.0.81.200) INFO 2025-11-13 16:38:53,678 proxy 10.0.81.200 -- Got updated endpoints: {Deployment(name='APIIngress', app='default'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ProxyActor pid=3285, ip=10.0.81.200) INFO 2025-11-13 16:38:53,718 proxy 10.0.81.200 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7ca07aa12d80>.
Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s].81.200) 
Fetching 19 files:  11%|β–ˆ         | 2/19 [00:00<00:01, 13.87it/s] 
Fetching 19 files:  21%|β–ˆβ–ˆ        | 4/19 [00:04<00:20,  1.39s/it] 
Fetching 19 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 19/19 [00:08<00:00,  2.28it/s]
Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]
Loading pipeline components...:  29%|β–ˆβ–ˆβ–Š       | 2/7 [00:00<00:00, 18.54it/s]
Loading pipeline components...:  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 4/7 [00:00<00:00,  5.35it/s]
Loading pipeline components...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:01<00:00,  6.99it/s]
INFO 2025-11-13 16:39:06,834 serve 8174 -- Application 'default' is ready at http://127.0.0.1:8000/.
INFO 2025-11-13 16:39:06,840 serve 8174 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x73852a494740>.
(ServeReplica:default:APIIngress pid=8257) INFO 2025-11-13 16:39:06,864 default_APIIngress uiu0g97s 6dc8f981-a70c-4af6-b719-93c16958cf1f -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x7ec9a87d5760>.
(ServeReplica:default:StableDiffusionXL pid=3219, ip=10.0.81.200) /home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/_private/replica.py:1320: UserWarning: Calling sync method 'generate' directly on the asyncio loop. In a future version, sync methods will be run in a threadpool by default. Ensure your sync methods are thread safe or keep the existing behavior by making them `async def`. Opt into the new behavior by setting RAY_SERVE_RUN_SYNC_IN_THREADPOOL=1.
(ServeReplica:default:StableDiffusionXL pid=3219, ip=10.0.81.200)   warnings.warn(
  0%|          | 0/50 [00:00<?, ?it/s]2 pid=3219, ip=10.0.81.200) 
  2%|▏         | 1/50 [00:00<00:23,  2.07it/s]19, ip=10.0.81.200) 
  4%|▍         | 2/50 [00:00<00:13,  3.69it/s]19, ip=10.0.81.200) 
  6%|β–Œ         | 3/50 [00:00<00:11,  4.24it/s]19, ip=10.0.81.200) 
  8%|β–Š         | 4/50 [00:00<00:10,  4.58it/s]19, ip=10.0.81.200) 
 10%|β–ˆ         | 5/50 [00:01<00:10,  4.13it/s]19, ip=10.0.81.200) 
 12%|β–ˆβ–        | 6/50 [00:01<00:09,  4.43it/s]19, ip=10.0.81.200) 
 14%|β–ˆβ–        | 7/50 [00:01<00:09,  4.66it/s]19, ip=10.0.81.200) 
 16%|β–ˆβ–Œ        | 8/50 [00:01<00:08,  4.82it/s]19, ip=10.0.81.200) 
 18%|β–ˆβ–Š        | 9/50 [00:02<00:08,  4.93it/s]19, ip=10.0.81.200) 
 20%|β–ˆβ–ˆ        | 10/50 [00:02<00:08,  4.92it/s]9, ip=10.0.81.200) 
 22%|β–ˆβ–ˆβ–       | 11/50 [00:02<00:07,  4.95it/s]9, ip=10.0.81.200) 
 24%|β–ˆβ–ˆβ–       | 12/50 [00:02<00:07,  5.01it/s]9, ip=10.0.81.200) 
 26%|β–ˆβ–ˆβ–Œ       | 13/50 [00:02<00:07,  5.06it/s]9, ip=10.0.81.200) 
 28%|β–ˆβ–ˆβ–Š       | 14/50 [00:03<00:07,  5.09it/s]9, ip=10.0.81.200) 
 30%|β–ˆβ–ˆβ–ˆ       | 15/50 [00:03<00:06,  5.07it/s]9, ip=10.0.81.200) 
 32%|β–ˆβ–ˆβ–ˆβ–      | 16/50 [00:03<00:06,  5.02it/s]9, ip=10.0.81.200) 
 34%|β–ˆβ–ˆβ–ˆβ–      | 17/50 [00:03<00:06,  5.04it/s]9, ip=10.0.81.200) 
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 18/50 [00:03<00:06,  5.06it/s]9, ip=10.0.81.200) 
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 19/50 [00:04<00:06,  5.09it/s]9, ip=10.0.81.200) 
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 20/50 [00:04<00:05,  5.07it/s]9, ip=10.0.81.200) 
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 21/50 [00:04<00:05,  5.05it/s]9, ip=10.0.81.200) 
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 22/50 [00:04<00:05,  5.06it/s]9, ip=10.0.81.200) 
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 23/50 [00:04<00:05,  5.06it/s]9, ip=10.0.81.200) 
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 24/50 [00:05<00:05,  5.08it/s]9, ip=10.0.81.200) 
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 25/50 [00:05<00:04,  5.06it/s]9, ip=10.0.81.200) 
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 26/50 [00:05<00:04,  5.04it/s]9, ip=10.0.81.200) 
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 27/50 [00:05<00:04,  5.06it/s]9, ip=10.0.81.200) 
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 28/50 [00:05<00:04,  5.06it/s]9, ip=10.0.81.200) 
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 29/50 [00:05<00:04,  5.07it/s]9, ip=10.0.81.200) 
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 30/50 [00:06<00:03,  5.07it/s]9, ip=10.0.81.200) 
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 31/50 [00:06<00:03,  5.06it/s]9, ip=10.0.81.200) 
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 32/50 [00:06<00:03,  5.06it/s]9, ip=10.0.81.200) 
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 33/50 [00:06<00:03,  5.06it/s]9, ip=10.0.81.200) 
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 34/50 [00:06<00:03,  5.08it/s]9, ip=10.0.81.200) 
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 35/50 [00:07<00:02,  5.09it/s]9, ip=10.0.81.200) 
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 36/50 [00:07<00:02,  5.06it/s]9, ip=10.0.81.200) 
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 37/50 [00:07<00:02,  5.05it/s]9, ip=10.0.81.200) 
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 38/50 [00:07<00:02,  5.04it/s]9, ip=10.0.81.200) 
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 39/50 [00:07<00:02,  5.04it/s]9, ip=10.0.81.200) 
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 40/50 [00:08<00:01,  5.04it/s]9, ip=10.0.81.200) 
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 41/50 [00:08<00:01,  5.04it/s]9, ip=10.0.81.200) 
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 42/50 [00:08<00:01,  5.04it/s]9, ip=10.0.81.200) 
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 43/50 [00:08<00:01,  5.04it/s]9, ip=10.0.81.200) 
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 44/50 [00:08<00:01,  5.04it/s]9, ip=10.0.81.200) 
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 45/50 [00:09<00:00,  5.04it/s]9, ip=10.0.81.200) 
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 46/50 [00:09<00:00,  5.04it/s]9, ip=10.0.81.200) 
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 47/50 [00:09<00:00,  5.05it/s]9, ip=10.0.81.200) 
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 48/50 [00:09<00:00,  5.04it/s]9, ip=10.0.81.200) 
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 49/50 [00:09<00:00,  5.03it/s]9, ip=10.0.81.200) 
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [00:10<00:00,  4.92it/s]9, ip=10.0.81.200) 
(ServeReplica:default:StableDiffusionXL pid=3219, ip=10.0.81.200) /tmp/ray/session_2025-11-13_16-27-23_019355_2409/runtime_resources/pip/4f400740bf0dc373d00105af0f56d30a55db3450/virtualenv/lib/python3.12/site-packages/diffusers/image_processor.py:147: RuntimeWarning: invalid value encountered in cast
(ServeReplica:default:StableDiffusionXL pid=3219, ip=10.0.81.200)   images = (images * 255).round().astype("uint8")
(ServeReplica:default:StableDiffusionXL pid=3219, ip=10.0.81.200) INFO 2025-11-13 16:39:17,741 default_StableDiffusionXL 1rrd4bwh 6dc8f981-a70c-4af6-b719-93c16958cf1f -- CALL generate OK 10852.6ms
(ServeReplica:default:APIIngress pid=8257) INFO 2025-11-13 16:39:17,786 default_APIIngress uiu0g97s 6dc8f981-a70c-4af6-b719-93c16958cf1f -- CALL generate OK 10932.3ms
  0%|          | 0/50 [00:00<?, ?it/s]2 pid=3219, ip=10.0.81.200) 
  2%|▏         | 1/50 [00:00<00:05,  8.78it/s]19, ip=10.0.81.200) 
  4%|▍         | 2/50 [00:00<00:07,  6.22it/s]19, ip=10.0.81.200) 
  6%|β–Œ         | 3/50 [00:00<00:08,  5.68it/s]19, ip=10.0.81.200) 
  8%|β–Š         | 4/50 [00:00<00:08,  5.47it/s]19, ip=10.0.81.200) 
 10%|β–ˆ         | 5/50 [00:00<00:08,  5.13it/s]19, ip=10.0.81.200) 
 12%|β–ˆβ–        | 6/50 [00:01<00:08,  5.05it/s]19, ip=10.0.81.200) 
 14%|β–ˆβ–        | 7/50 [00:01<00:08,  5.08it/s]19, ip=10.0.81.200) 
 16%|β–ˆβ–Œ        | 8/50 [00:01<00:08,  5.10it/s]19, ip=10.0.81.200) 
 18%|β–ˆβ–Š        | 9/50 [00:01<00:08,  5.12it/s]19, ip=10.0.81.200) 
 20%|β–ˆβ–ˆ        | 10/50 [00:01<00:07,  5.03it/s]9, ip=10.0.81.200) 
 22%|β–ˆβ–ˆβ–       | 11/50 [00:02<00:07,  4.99it/s]9, ip=10.0.81.200) 
 24%|β–ˆβ–ˆβ–       | 12/50 [00:02<00:07,  5.02it/s]9, ip=10.0.81.200) 
 26%|β–ˆβ–ˆβ–Œ       | 13/50 [00:02<00:07,  5.05it/s]9, ip=10.0.81.200) 
 28%|β–ˆβ–ˆβ–Š       | 14/50 [00:02<00:07,  5.07it/s]9, ip=10.0.81.200) 
 30%|β–ˆβ–ˆβ–ˆ       | 15/50 [00:02<00:06,  5.01it/s]9, ip=10.0.81.200) 
 32%|β–ˆβ–ˆβ–ˆβ–      | 16/50 [00:03<00:06,  4.99it/s]9, ip=10.0.81.200) 
 34%|β–ˆβ–ˆβ–ˆβ–      | 17/50 [00:03<00:06,  5.00it/s]9, ip=10.0.81.200) 
 36%|β–ˆβ–ˆβ–ˆβ–Œ      | 18/50 [00:03<00:06,  5.03it/s]9, ip=10.0.81.200) 
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 19/50 [00:03<00:06,  5.05it/s]9, ip=10.0.81.200) 
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 20/50 [00:03<00:05,  5.02it/s]9, ip=10.0.81.200) 
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 21/50 [00:04<00:05,  4.99it/s]9, ip=10.0.81.200) 
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 22/50 [00:04<00:05,  5.01it/s]9, ip=10.0.81.200) 
 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 23/50 [00:04<00:05,  5.02it/s]9, ip=10.0.81.200) 
 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 24/50 [00:04<00:05,  5.03it/s]9, ip=10.0.81.200) 
 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 25/50 [00:04<00:04,  5.00it/s]9, ip=10.0.81.200) 
 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 26/50 [00:05<00:04,  4.97it/s]9, ip=10.0.81.200) 
 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 27/50 [00:05<00:04,  4.99it/s]9, ip=10.0.81.200) 
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    | 28/50 [00:05<00:04,  5.00it/s]9, ip=10.0.81.200) 
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 29/50 [00:05<00:04,  5.01it/s]9, ip=10.0.81.200) 
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 30/50 [00:05<00:03,  5.01it/s]9, ip=10.0.81.200) 
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 31/50 [00:06<00:03,  4.99it/s]9, ip=10.0.81.200) 
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 32/50 [00:06<00:03,  5.01it/s]9, ip=10.0.81.200) 
 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 33/50 [00:06<00:03,  5.01it/s]9, ip=10.0.81.200) 
 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š   | 34/50 [00:06<00:03,  5.02it/s]9, ip=10.0.81.200) 
 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 35/50 [00:06<00:02,  5.01it/s]9, ip=10.0.81.200) 
 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 36/50 [00:07<00:02,  4.98it/s]9, ip=10.0.81.200) 
 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 37/50 [00:07<00:02,  4.97it/s]9, ip=10.0.81.200) 
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  | 38/50 [00:07<00:02,  4.97it/s]9, ip=10.0.81.200) 
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 39/50 [00:07<00:02,  4.97it/s]9, ip=10.0.81.200) 
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 40/50 [00:07<00:02,  4.96it/s]9, ip=10.0.81.200) 
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 41/50 [00:08<00:01,  4.94it/s]9, ip=10.0.81.200) 
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 42/50 [00:08<00:01,  4.94it/s]9, ip=10.0.81.200) 
 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 43/50 [00:08<00:01,  4.96it/s]9, ip=10.0.81.200) 
 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 44/50 [00:08<00:01,  4.96it/s]9, ip=10.0.81.200) 
 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 45/50 [00:08<00:01,  4.95it/s]9, ip=10.0.81.200) 
 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 46/50 [00:09<00:00,  4.95it/s]9, ip=10.0.81.200) 
 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 47/50 [00:09<00:00,  4.95it/s]9, ip=10.0.81.200) 
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 48/50 [00:09<00:00,  4.96it/s]9, ip=10.0.81.200) 
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 49/50 [00:09<00:00,  4.96it/s]9, ip=10.0.81.200) 
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 50/50 [00:09<00:00,  5.04it/s]9, ip=10.0.81.200) 
(ServeReplica:default:APIIngress pid=8257) INFO 2025-11-13 16:39:28,093 default_APIIngress uiu0g97s 44cd3b7b-609e-4e9f-8978-69cdd64707f7 -- GET /imagine 200 10295.2ms
(ServeReplica:default:StableDiffusionXL pid=3219, ip=10.0.81.200) INFO 2025-11-13 16:39:28,081 default_StableDiffusionXL 1rrd4bwh 44cd3b7b-609e-4e9f-8978-69cdd64707f7 -- CALL /imagine OK 10276.0ms
```

---------

Signed-off-by: doyoung <[email protected]>
…ay-project#58473)

### Summary
This PR refactors the replica rank system to support multi-dimensional
ranking (global, node-level, and local ranks) in preparation for
node-local rank tracking. The `ReplicaRank` object now contains three
fields instead of being a simple integer, enabling better coordination
of replicas across nodes.

### Motivation
Currently, Ray Serve only tracks a single global rank per replica. For
advanced use cases like tensor parallelism, model sharding across nodes,
and node-aware coordination, we need to track:
- **Global rank**: Replica's rank across all nodes (0 to N-1)
- **Node rank**: Which node the replica is on (0 to M-1) 
- **Local rank**: Replica's rank on its specific node (0 to K-1)

This PR lays the groundwork by introducing the expanded `ReplicaRank`
schema while maintaining backward compatibility in feature.

### Changes

#### Core Implementation
- **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and
`local_rank` fields (currently set to -1 as placeholders)
- **`replica.py`**: Updated replica actors to handle `ReplicaRank`
objects
- **`context.py`**: Changed `ReplicaContext.rank` type from
`Optional[int]` to `ReplicaRank`

### Current Behavior
- `node_rank` and `local_rank` are set to `-1` (placeholder values).
Will change in future
- Global rank assignment and management works as before
- All existing functionality is preserved

### Breaking Changes
Rank is changing from `int` to `ReplicaRank`

Next PR ray-project#58477

---------

Signed-off-by: abrar <[email protected]>
elliot-barn and others added 22 commits November 27, 2025 00:01
Including openlineage-python dependency
upgrading requests from 2.32.3 -> 2.32.5

LLM serve and batch release tests:
https://buildkite.com/ray-project/release/builds/69428
Only failing test has already been disabled:
llm_serve_llama_3dot1_8B_quantized_tp1_2p6d_lmcache

Core multi test: https://buildkite.com/ray-project/release/builds/69489#

---------

Signed-off-by: elliot-barn <[email protected]>
## Description
Checking RLlib, there are a couple cases where we don't fully handle
exception cases.
I've checked every try except statement in RLlib and this PR updates all
of them that didn't log or print the error if it wasn't handled

## Related issues
Fixes ray-project#58854

---------

Signed-off-by: Mark Towers <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
…jection logic (ray-project#59042)

## Description
Adds more headers to the denylist for recognising browser requests and
denying them

## Related issues
Supercedes ray-project#59040

Signed-off-by: Richo Healey <[email protected]>
## Description
ray-project#58853 didn't define a byod
script however the [nightly
test](https://buildkite.com/ray-project/release/builds/69721#019ac407-48f6-4523-b3c0-e8c2f6b6ba82)
failed as `torch` wasn't installed.
This PR adds a byod script and includes `torch` (and other possible
modules) to install

## Related issues
ray-project#58853

---------

Signed-off-by: Mark Towers <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
… (1/9) (ray-project#58983)

upgrade memray, cupy-cuda12x & rich packages in preparation for python
3.13 image building
rich==13.3.2 -> rich==13.3.3
cupy-cuda12x -> cupy-cuda12x>=13.4.0
Updated memray constraint to 1.14.0

Memray 1.14.0 has python 3.13 wheels
rich upgraded is required for memray upgrade
cupy-cuda12x>=13.4.0 has python 3.13 wheels

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…oject#58272)

## Description

This PR adds an API overview and example usage for the TPU utility
library added in this PR: ray-project#56723.
I added this section to the existing "Using TPUs with KubeRay guide",
because the utility library would be primarily used with KubeRay on GKE
(the values used for default labels are set on GKE with a mutating
webhook).

## Related issues
ray-project#55162

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This PR adds a `bundle_label_selector` argument to the `ScalingConfig`
that allows Ray Train workers to be placed on nodes with particular
labels. The previous workaround, namely using `resources_per_worker`, is
less flexible.

`bundle_label_selector` can either be a single dict, in which case it
will apply to all the workers, or a list of length `num_workers`, in
which case each item in the list will correspond to one of the workers.

I added verification to the controller instead of validating that none
of the callbacks have `on_controller_start_worker_group` when
`bundle_label_selector` is set because we might change
`on_controller_start_worker_group` in the future. We can revisit this
issue then.

---------

Signed-off-by: Timothy Seah <[email protected]>
Document the new `ray.train.get_all_reported_checkpoints` method. 

---------

Signed-off-by: Timothy Seah <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Justin Yu <[email protected]>
and also closes the file after reading..

Signed-off-by: Lonnie Liu <[email protected]>
temp files generated by python.

Signed-off-by: Lonnie Liu <[email protected]>
python 3.9 reached end of life

---------

Signed-off-by: Lonnie Liu <[email protected]>
# Description
We are cleaning up rllib's testing which includes the benchmark folder
which this PR removes in it's entirety

---------

Co-authored-by: Hassam Sheikh <[email protected]>
Co-authored-by: Kamil Kaczmarek <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
## Description

Main scope of this PR:
If there is no data at `data_path`, we currently don't error out but
just log a warning and continue.
So the error that user gets is some thing further down the line.
This PR makes it so that we error out if the data does not exist - there
is no reason for us to mask that and try to continue still.

Secondary scope:
We introduced some formatting for log messages that looks to me like it
adhere to some standard but I can not find that format anywhere else in
Ray. This PR removes this formatting to help us to not creep into a
veriety of such formats across our codebase.
…n't use them (ray-project#59052)

This PR removes the `cluster_full_of_actors_detected` and
`cluster_full_of_actors_detected_by_gcs` fields from the protobuf, as
they are not used in autoscaler v2, and autoscaler v1 is scheduled for
deletion soon.

The 2 fields are considered private, so they are deleted without
maintaining backward compatibility.

Signed-off-by: Rueian <[email protected]>
## Description
The RLlib team is working on improving our testing position. 
Currently several files are excluded in our doctest. This PR moves to
add testing for the whole project

---------

Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: Mark Towers <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
…roject#58628)

## Description

```python
import ray
from ray.util.placement_group import placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy

# Create a placement group.
pg = placement_group([{"CPU": 1}])
ray.get(pg.ready())


@ray.remote(num_cpus=1, num_gpus=8)
class Actor:
    def __init__(self):
        pass

actor = Actor.options(
    scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=pg),
    name="actor",
    get_if_exists=True,
).remote()
```

* Without this PR, you will get the following error message:
  ```
ValueError: Failed to look up actor with name 'actor'. This could
because 1. You are trying to look up a named actor you didn't create. 2.
The named actor died. 3. You did not use a namespace matching the
namespace of the actor.
  ```

* With this PR, you will get the actual root casue:
  ```
ValueError: Cannot schedule Actor with the placement group because the
resource request {'CPU': 1, 'GPU': 8} cannot fit into any bundles for
the placement group, [{'CPU': 1.0}].
  ```

## Related issues

## Additional information

---------

Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
so that failed twine upload subprocess call will not print the token out
in logs as part of the exception.

Signed-off-by: Lonnie Liu <[email protected]>
…t#58859)

Updates the Daft integration section in Ray's libraries documentation:
- Fixed GitHub stars badge URL (was pointing to non-existent `daft/daft`
instead of `Eventual-Inc/Daft`)
- Updated integration link to point to Ray-specific Daft documentation
- Updated Daft logo with new design (replaced webp with png)

---------

Signed-off-by: YK <[email protected]>
…ray-project#59031)

## Description
catch and throw token loading exceptions from the python frontend
instead of crashing from c++

eg:
```bash
(ray-dev) ubuntu@devbox:~/clone/ray$ export RAY_AUTH_MODE=token
(ray-dev) ubuntu@devbox:~/clone/ray$ export RAY_AUTH_TOKEN_PATH=missing_file.txt
(ray-dev) ubuntu@devbox:~/clone/ray$ ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.31.5.49
Traceback (most recent call last):
  File "/home/ubuntu/.conda/envs/ray-dev/bin/ray", line 7, in <module>
    sys.exit(main())
  File "/home/ubuntu/clone/ray/python/ray/scripts/scripts.py", line 2817, in main
    return cli()
  File "/home/ubuntu/.conda/envs/ray-dev/lib/python3.10/site-packages/click/core.py", line 1442, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/.conda/envs/ray-dev/lib/python3.10/site-packages/click/core.py", line 1363, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/.conda/envs/ray-dev/lib/python3.10/site-packages/click/core.py", line 1830, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/.conda/envs/ray-dev/lib/python3.10/site-packages/click/core.py", line 1226, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/.conda/envs/ray-dev/lib/python3.10/site-packages/click/core.py", line 794, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/clone/ray/python/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/clone/ray/python/ray/scripts/scripts.py", line 945, in start
    ensure_token_if_auth_enabled(system_config, create_token_if_missing=False)
  File "/home/ubuntu/clone/ray/python/ray/_private/authentication/authentication_token_setup.py", line 93, in ensure_token_if_auth_enabled
    if not token_loader.has_token(ignore_auth_mode=True):
  File "python/ray/includes/rpc_token_authentication.pxi", line 90, in ray._raylet.AuthenticationTokenLoader.has_token
    raise AuthenticationError(result.error_message.decode('utf-8'))
ray.exceptions.AuthenticationError: RAY_AUTH_TOKEN_PATH is set but file cannot be opened or is empty: missing_file.txt. Ensure that the token for the cluster is available in a local file (e.g., ~/.ray/auth_token or via RAY_AUTH_TOKEN_PATH) or as the `RAY_AUTH_TOKEN` environment variable. To generate a token for local development, use `ray get-auth-token --generate` For remote clusters, ensure that the token is propagated to all nodes of the cluster when token authentication is enabled. For more information, see: https://docs.ray.io/en/latest/ray-security/auth.html
```

---------

Signed-off-by: sampan <[email protected]>
Signed-off-by: Sampan S Nayak <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Failing test:
```
test_llm_serve_multi_node_integration.py::test_llm_serve_data_parallelism
```

Issue:
Test configuration didn't fill 2x worker nodes, leading to flakiness if
DP replicas scheduled across nodes.

Fix:
Change test configuration 2 -> 4 replicas to fill 2x worker nodes

Signed-off-by: Seiji Eicher <[email protected]>
…-project#58902)

stop running on python 3.9 any more

---------

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: matthewdeng <[email protected]>
1. Previously, the placement group lifetime is tied to the Ray job
driver, which means if we use Tune + Train V2 or Train V2 with Async
validation where validation task creates its own placement group, those
placement group owned by non-main job driver will sticks around for the
rest of the main job driver.
2. Why did Train v1 + Tune not run into this issue?
Tune’s driver process kept track of the placement groups spawned for
children, including Train. So the Tune driver process was able to remove
the placement group after stopping the trial.
If the Tune driver was launched in a remote task and was killed, you’d
run into the same issue as long as the job driver was still alive.

3. To resolve this, we proposed to add a placement group cleaner runs as
a detached actor together with Ray Train controller through
ControllerCallback and WorkerGroupCallback. This cleaner will monitor
the liveness of the controller, and if controller dies without exit
gracefully, cleans up the PG this controller spawns.
4. Now the flow will look like below:
a. after controller start, pg cleaner registered with controller id
b. after worker group start and pg created, pg cleaner registered with
pg
c. pg cleaner runs the monitor loop, if controller is not alive, try to
clean up the pg

---------

Signed-off-by: Lehui Liu <[email protected]>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #694 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 6204.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a comprehensive update and refactoring of the project's build, test, and dependency management infrastructure. It transitions to a more modular CI/CD pipeline using new Buildkite configurations, standardizes Python dependency compilation with uv and raydepsets, and refines the Bazel build system. These changes aim to improve the reliability, maintainability, and efficiency of the development workflow, particularly for continuous integration and release processes.

Highlights

  • CI/CD Infrastructure Overhaul: Significant updates to the CI/CD pipelines, including the introduction of new Buildkite YAML files (_images.rayci.yml, dependencies.rayci.yml, doc.rayci.yml, release/_images.rayci.yml, release/custom-image-build-and-test-init.sh, release/test-init.sh) and the removal of older ones (releasebuild.rayci.yml, ci/ray_ci/core.tests.yml, ci/k8s/run-kuberay-doc-tests.sh, ci/ray_ci/rllib_contrib/rllib_contrib_ci.sh, ci/lint/format.sh, ci/keep_alive, ci/env/install-miniconda.sh, doc/BUILD). This refactoring aims to modularize and streamline the build and test processes.
  • Bazel Build System Enhancements: Extensive modifications to the Bazel build system, including updating the workspace name from com_github_ray_project_ray to io_ray, introducing platform-specific Redis binaries, and replacing copy_to_workspace rules with pkg_files and pkg_zip for packaging Ray components. Several RPC-related and Raylet-related C++ build definitions have been moved or refactored for better organization.
  • Python Dependency Management with uv and raydepsets: The build system now heavily leverages uv for Python dependency management, replacing miniconda with miniforge3 in Dockerfiles and shell scripts. A new raydepsets tool has been introduced for compiling and managing Python dependency sets, with dedicated configuration files (ci/raydepsets/configs/*.depsets.yaml) and pre-hooks for build processes.
  • Python Version Updates and Standardization: Python 3.10 has been integrated and standardized across many CI/CD configurations, including Dockerfiles and Buildkite matrix builds. Python 3.9 has been removed from several build matrices, indicating a shift towards newer Python versions.
  • Pre-commit Hooks and Linting Improvements: The .pre-commit-config.yaml has been significantly updated to include new hooks for semgrep, vale, cython-lint, and eslint, enhancing code quality and style enforcement. The format.sh script has been removed, indicating a move towards pre-commit for formatting.
  • C++ API and Runtime Changes: Modifications to the C++ API and runtime, including updates to ActorCreator, ActorTaskCaller, TaskCaller to use lang_type_ member, changes in RemoteFunctionHolder initialization, and adjustments to metric recording. The ConfigInternal now uses UniqueID::FromRandom().Hex() for namespace generation and ParseAddress/BuildAddress for bootstrap address handling.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR is a large-scale refactoring of the CI/CD pipeline and build system. Key changes include:

  • Modularizing Buildkite pipelines by splitting them into smaller, more focused files (e.g., for images, docs, dependencies).
  • Refactoring the root BUILD.bazel file, moving target definitions into subdirectories for better organization.
  • Updating dependency management, including switching from miniconda to miniforge, introducing uv for python package management, and centralizing dependency set definitions with a new raydepsets tool.
  • Modernizing build scripts, for example, by using pip wheel instead of setup.py bdist_wheel.
  • Dropping or reducing support for older configurations like Python 3.9 and x86_64 macOS.
  • Numerous updates to C++ code to adapt to API changes and for modernization (e.g., using std::invoke_result_t).

The changes are extensive and generally improve maintainability, reproducibility, and performance of the CI system. My main concern, detailed in a specific comment, is the simplification of except-tags in some test steps, which could have unintended side effects on test execution. Overall, this is a massive and impressive refactoring effort.

--workers "$${BUILDKITE_PARALLEL_JOB_COUNT}" --worker-id "$${BUILDKITE_PARALLEL_JOB}" --parallelism-per-worker 3
--only-tags compiled_graphs
--except-tags multi_gpu
--except-tags custom_setup,cgroup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change significantly simplifies the list of excluded tags. The previous list was very explicit (debug_tests, asan_tests, multi_gpu, spark_on_ray, etc.), while the new list is much more generic (custom_setup, cgroup). This could lead to tests being run in the wrong environment or being unintentionally skipped. For example, tests tagged with multi_gpu might now run in a non-GPU environment. Please verify that the new tagging strategy correctly excludes all tests that require special setups and that those tests are covered in other appropriate CI steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.