🔄 daily merge: master → main 2025-11-27 #690

antfin-oss · 2025-11-27T02:55:57Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-27
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

be consistent with the default build environment Signed-off-by: Lonnie Liu <[email protected]>

…ject#58543) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

## Description - rename RAY_auth_mode → RAY_AUTH_MODE environment variable across codebase - Excluded healthcheck endpoints from authentication for Kubernetes compatibility - Fixed dashboard cookie handling to respect auth mode and clear stale tokens when switching clusters --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…ls (ray-project#58424) ## Description - Use client interceptor for adding auth tokens in grpc calls when `AUTH_MODE=token` - BuildChannel() will automatically include the interceptor - Removed `auth_token` parameter from `ClientCallImpl` - removed manual auth from `python_gcs_subscriber`.cc - tests to verify auth works for autoscaller apis --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…`) (ray-project#57090) When actors terminate gracefully, Ray calls the actor's `__ray_shutdown__()` method if defined, allowing for cleanup of resources. But, this is not invoked in case actor goes out of scope due to `del actor`. ### Why `del actor` doesn't invoke `__ray_shutdown__` Traced through the entire code path, and here's what happens: Flow when `del actor` is called: 1. **Python side**: `ActorHandle.__del__()` -> `worker.core_worker.remove_actor_handle_reference(actor_id)` https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040 2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` -> `reference_counter_->RemoveLocalReference()` - When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed` callback https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506 3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` -> `AsyncReportActorOutOfScope()` to GCS https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183 https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51 4. **GCS receives notification**: `HandleReportActorOutOfScope()` - **THE PROBLEM IS HERE** ([line 279 in `src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)): ```cpp DestroyActor(actor_id, GenActorOutOfScopeCause(actor), /*force_kill=*/true, // <-- HARDCODED TO TRUE! [reply, send_reply_callback]() { ``` 5. **Actor worker receives kill signal**: `HandleKillActor()` in [`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970) ```cpp if (request.force_kill()) { // This is TRUE for OUT_OF_SCOPE ForceExit(...) // Skips __ray_shutdown__ } else { Exit(...) // Would call __ray_shutdown__ } ``` 6. **ForceExit path**: Bypasses graceful shutdown -> No `__ray_shutdown__` callback invoked. This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE actors. Also, updated the docs. --------- Signed-off-by: Sagar Sumit <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>

Currently, a node is considered idle while pulling objects from the remote object store. This can lead to situations where a node is terminated as idle, causing the cluster to enter an infinite loop when pulling large objects that exceed the node idle termination timeout. This PR fixes the issue by treating object pulling as a busy activity. Note that nodes can still accept additional tasks while pulling objects (since pulling consumes no resources), but the auto-scaler will no longer terminate the node prematurely. Closes ray-project#54372 Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

…_FACTOR` to 2 (ray-project#58262) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description This was setting the value to be aligned with the previous default of 4. However, after some consideration i've realized that 4 is too high of a number so actually lowering this to 2 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <[email protected]>

…ray-project#58504) Signed-off-by: Nikhil Ghosh <[email protected]>

…y-project#58523) ## Description This PR improves documentation consistency in the `python/ray/data` module by converting all remaining rST-style docstrings (`:param:`, `:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.). ## Additional information **Files modified:** - `python/ray/data/preprocessors/utils.py` - Converted `StatComputationPlan.add_callable_stat()` - `python/ray/data/preprocessors/encoder.py` - Converted `unique_post_fn()` - `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()` and `BlockColumnAccessor.is_composed_of_lists()` - `python/ray/data/_internal/datasource/delta_sharing_datasource.py` - Converted `DeltaSharingDatasource.setup_delta_sharing_connections()` Signed-off-by: Balaji Veeramani <[email protected]>

…oject#58549) ## Description The original `test_concurrency` function combined multiple test scenarios into a single test with complex control flow and expensive Ray cluster initialization. This refactoring extracts the parameter validation tests into focused, independent tests that are faster, clearer, and easier to maintain. Additionally, the original test included "validation" cases that tested valid concurrency parameters but didn't actually verify that concurrency was being limited correctly—they only checked that the output was correct, which isn't useful for validating the concurrency feature itself. **Key improvements:** - Split validation tests into `test_invalid_func_concurrency_raises` and `test_invalid_class_concurrency_raises` - Use parametrized tests for different invalid concurrency values - Switch from `shutdown_only` with explicit `ray.init()` to `ray_start_regular_shared` to eliminate cluster initialization overhead - Minimize test data from 10 blocks to 1 element since we're only validating parameter errors - Remove non-validation tests that didn't verify concurrency behavior ## Related issues N/A ## Additional information The validation tests now execute significantly faster and provide clearer failure messages. Each test has a single, well-defined purpose making maintenance and debugging easier. --------- Signed-off-by: Balaji Veeramani <[email protected]>

previously it was actually using 0.4.0, which is set up by the grpc repo. the declaration in the workspace file was being shadowed.. Signed-off-by: Lonnie Liu <[email protected]>

## Description Creates a ranker interface that will rank the best operator to run next in `select_operator_to_run`. This code only refractors the existing code. The ranking value must be something that is comparable. ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <[email protected]>

…#57783) 1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set to initialize jax.distributed: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38 2. Before this change, user will have to configure both `use_tpu=True` in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able to start jax.distributed. `JAX_PLATFORMS` can be comma separated string. 3. If user uses other jax.distributed libraries like Orbax, sometimes, it will leads to misleading error about distributed initialization. 4. After this change, if user sets `use_tpu=True`, we automatically add this to env var. 5. tpu unit test is not available this time, will explore for how to cover it later. --------- Signed-off-by: Lehui Liu <[email protected]>

and ask people to use that lock file for building docs. Signed-off-by: Lonnie Liu <[email protected]>

…regression (ray-project#58390) ## Description This PR address the performance regression introduced in the [PR to make ray.get thread safe](ray-project#57911). Specifically, the previous PR requires the worker to block and wait for AsyncGet to return with a reply of the request id needed for correctly cleaning up get requests. This additional synchronous step causes the plasma store Get to regress in performance. This PR moves the request id generation step to the plasma store, removing the blocking step to fix the perf regression. ## Related issues - [PR which introduced perf regression](ray-project#57911) - [PR which observed the regression](ray-project#58175) ## Additional information New performance of the change measured by `ray microbenchmark`. <img width="485" height="17" alt="image" src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0" /> Original performance prior to the change. Here we focus on the regressing `single client get calls (Plasma Store)` metric, where our new performance returns us back to the original 10k per second range compared to the existing sub 5k per second. <img width="811" height="355" alt="image" src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c" /> --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]>

## Description support token auth in ray client server by using the existing grpc interceptors. This pr refactors the code to: - add/rename sync and async client and server interceptors - create grpc utils to house grpc channel and server creation logic, python codebase is updated to use these methods - separate tests for sync and async interceptors - make existing authentication integration tests to run with RAY_CLIENT mode --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…oject#58371) ## Description Currently Ray Data has a preprocessor called `RobustScaler`. This scales the data based on given quantiles. Calculating the quantiles involves sorting the entire dataset by column for each column (C sorts for C number of columns), which, for a large dataset, will require a lot of calculations. ** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch` as I couldn't actually find well-maintained tdigest libraries for python. ddsketch is better maintained. ** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile` aggregator ## Related issues N/A ## Additional information N/A --------- Signed-off-by: kyuds <[email protected]> Signed-off-by: Daniel Shin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]>

generating depsets for base extra python requirements Installing requirements in base extra image --------- Signed-off-by: elliot-barn <[email protected]>

…#58499) which is more accurate also moves python constraint definitions into `bazel/` directory and registering python 3.10 platform with hermetic toolchain this allows performing migration from python 3.19 to python 3.10 incrementally Signed-off-by: Lonnie Liu <[email protected]>

we stop supporting building with python 3.9 now Signed-off-by: Lonnie Liu <[email protected]>

…ct#58576) Using GetNodeAddressAndLiveness in raylet client pool instead of the bulkier Get, same for AsyncGetAll. Seems like it was already done in core worker client pool, so just making the same change for raylet client pool. Signed-off-by: joshlee <[email protected]>

…bles (ray-project#58270) ## Description - Support upserting iceberg tables for IcebergDatasink - Update schema on APPEND and UPSERT - Enable overwriting the entire table Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite functionality. Also for append, the library now handles the transaction logic implicitly so that burden can be lifted from Ray Data. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]>

) Reverts ray-project#58535 failing on windows.. :(

as the pydantic version is pinned in `requirements-doc.txt` now. Signed-off-by: Lonnie Liu <[email protected]>

…ect#58577) Signed-off-by: irabbani <[email protected]>

nothing is using it anymore Signed-off-by: Lonnie Liu <[email protected]>

…58580) Adding optional `include_setuptools` flag for depset configuration If the flag is set on a depset config --unsafe-package setuptools will not be included for depset compilation If the flag does not exist (default false) on a depset config --unsafe-package setuptools will be appended to the default arguments --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

otherwise, the newer docker client will refuse to communicate with the docker daemon that is on an older version. Signed-off-by: Lonnie Liu <[email protected]>

…ay-project#58542) ## What does this PR do? Fixes HTTP streaming file downloads in Ray Data's download operation. Some URIs (especially HTTP streams) require `open_input_stream` instead of `open_input_file`. ## Changes - Modified `download_bytes_threaded` in `plan_download_op.py` to try both `open_input_file` and `open_input_stream` for each URI - Improved error handling to distinguish between different error types - Failed downloads now return `None` gracefully instead of crashing ## Testing ``` import pyarrow as pa from ray.data.context import DataContext from ray.data._internal.planner.plan_download_op import download_bytes_threaded # Test URLs: one valid, one 404 urls = [ "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&", ] # Create PyArrow table and call download function table = pa.table({"url": urls}) ctx = DataContext.get_current() results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) # Check results result_table = results[0] for i in range(result_table.num_rows): url = result_table['url'][i].as_py() bytes_data = result_table['bytes'][i].as_py() if bytes_data is None: print(f"Row {i}: FAILED (None) - try-catch worked ✓") else: print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)") print(f" URL: {url[:60]}...") print("\n✅ Test passed: Failed downloads return None instead of crashing.") ``` Before the fix: ``` TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/ray/default/test_streaming_fallback.py", line 110, in <module> test_download_expression_with_streaming_fallback() File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__ if not self.__exit__(*sys.exc_info()): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__ setattr(self.target, self.attribute, self.temp_original) TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' (base) ray@ip-10-0-39-21:~/default$ python test.py 2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker! Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file Traceback (most recent call last): File "/home/ray/default/test.py", line 16, in <module> results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded uri_bytes = list( ^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen raise item File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file ``` After the fix: ``` Row 0: SUCCESS (189370 bytes) URL: https://static-assets.tesla.com/configurator/compositor?cont... ``` Tested with HTTP streaming URLs (e.g., Tesla configurator images) that previously failed: - ✅ Successfully downloads HTTP stream files - ✅ Gracefully handles failed downloads (returns None) - ✅ Maintains backward compatibility with existing file downloads --------- Signed-off-by: xyuzh <[email protected]> Signed-off-by: Robert Nishihara <[email protected]> Co-authored-by: Robert Nishihara <[email protected]>

## Description We today have very little observability into pubsub. On a raylet one of the most important states that need to be propagated through the cluster via pubsub is cluster membership. All raylets should in an eventual BUT timely fashion agree on the list of available nodes. This metric just emits a simple counter to keep track of the node count. More pubsub observability to come. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: zac <[email protected]> Signed-off-by: Zac Policzer <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]>

…ashboard's reporter_head. (ray-project#58978) There's no user of this endpoint in the codebase. This has the added benefit of reducing the surface area for our cython-bindings for GcsClient by removing ActorInfoAccessor::AsyncKillActor. Signed-off-by: irabbani <[email protected]>

…ject#58325) Following a worker failure or a user abort during a Train job, the execution of sharded datasets (provided through get_dataset_shard) is ungracefully shutdown. Consequently, any ongoing resource request made by a sharded dataset's SplitCoordinator to the AutoscalingRequester is not cancelled. This can result in resources being held for a preset timeout, leading to inefficient cluster utilization and slower train job turnarounds. - Implements an eager shutdown path to cleanup resource requests made to the AutoscalingRequester (depicted below) - Adds new WorkerGroupCallback hooks(`after_worker_group_abort` and `after_worker_group_shutdown`) to DatasetsSetupCallback for the new shutdown path - Implements tests for the new cleanup path --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <[email protected]>

…ay-project#58947) > Briefly describe what this PR accomplishes and why it's needed. This PR was motivated by ray-project#58018 where we call methods of the gcs node info accessor potentially from the user's python cancel thread, potentially causing thread safety issues. I did the trivial solution of adding a mutex onto node_cache_address_and_liveness_ cache. The one downside of this is instead of returning ptrs to the GcsNodeAddressAndLiveness objects in the cache, I return them by value instead. I didn't want to allow access to the mutex that guards the cache outside of the accessor since I think it's a bad precedent/will create a mess. --------- Signed-off-by: joshlee <[email protected]>

1. Jax dependency is introduced in ray-project#58322 2. The current test environment is for CUDA 12.1, which limit jax version below 0.4.14. 3. jax <= 0.4.14 does not support py 3.12. 4. skip jax test if it runs against py3.12+. Signed-off-by: Lehui Liu <[email protected]>

…ics (ray-project#58870) When reporting a checkpoint to Ray Train, every worker needs to form a barrier with a `ray.train.report` call. If every worker reports an empty checkpoint, we should notify the condition to unblock `ray.train.get_all_reported_checkpoint` calls. Before this fix, reporting an empty checkpoint and calling `get_all_reported_checkpoints` would result in a hang. --------- Signed-off-by: Timothy Seah <[email protected]>

optimization for the case where list has only one timeseries. O(nlogn) -> O(1) Signed-off-by: abrar <[email protected]>

…GNAL_HANDLER option (ray-project#58984) Ray's Abseil failure signal handler conflicts with JVM's internal signal handling. When PyArrow's HadoopFileSystem starts a JVM, and then Ray installs its signal handler, JVM's normal internal operations (for safepoint synchronization) trigger Ray's handler, causing a crash. It seems we already have some code which disables this for java bindings (there's a todo about in io_ray_runtime_RayNativeRuntime.cc). This is all well and good when we realize we're running in a java runtime, but for the bug report I'm not sure how obvious it is (maybe we should consider doing something like this in data itself automatically?) Unsure. For our part if we know the datasource is form hdfs, that seems just as likely as any that this might be happening. But it's only relevant if you're running the jvm in the same process (I was only able to reproduce this by running everything in the same process). One of the comments in the bug ticket points out they're able to skirt the error by doing process isolaiton (which ended up being the hint I needed). **What's the catch?** When this config is set we essentially won't call WriteFailureMessage() on a crash, which means we won't get a full trace on termination or flush full logs. So just setting this all the time isn't a great solution (and essentially why I've thrown it behind a config). We might want to revisit this for HDFS users. That said. The JVM core dump info is somewhat decent. So we're not totally in the dark. ## Related issues Fixes ray-project#36415 --------- Signed-off-by: zac <[email protected]>

…ject#57639) Enable zero-copy serialization for all PyTorch tensors by setting `RAY_ENABLE_ZERO_COPY_TORCH_TENSORS=1` to accelerate serialization. Example test script: ```python import os # Must be set before `import ray` to ensure that the zero-copy tensor pickle reducer # is properly registered in driver. os.environ["RAY_ENABLE_ZERO_COPY_TORCH_TENSORS"] = "1" import ray import torch from datetime import datetime ray.init(runtime_env={"env_vars": {"RAY_ENABLE_ZERO_COPY_TORCH_TENSORS": "1"}}) @ray.remote def process(tensor): return tensor.sum() x = torch.ones(1024, 1024, 256) start_time = datetime.now() x_ref = process.remote(x) result = ray.get(x_ref) time_diff = datetime.now() - start_time print(f"result : {result}") print(f"between time: {time_diff.total_seconds()}s") print(f"result type : {type(result)}") ``` Below are the performance gains and validation results: <img width="1977" height="965" alt="zuizhongxiaoguo" src="https://github.com/user-attachments/assets/e3d5210c-142d-4ec3-908c-fe590514cfc8" /> Closes ray-project#56740 ray-project#26229 --------- Signed-off-by: Haichuan Hu <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>

and performing some styling cleanup Signed-off-by: Lonnie Liu <[email protected]>

…ay-project#58692) Signed-off-by: He Su <[email protected]>

…oughput stats (ray-project#58693) This PR makes three improvements to Ray Data's throughput statistics: 1. **Makes `test_dataset_throughput` deterministic**: The original test was flaky because it relied on actual task execution timing. This PR rewrites it as unit tests (`test_dataset_throughput_calculation` and `test_operator_throughput_calculation`) using mocked `BlockStats` objects, making the tests fast and reliable. 2. **Removes "Estimated single node throughput" from Dataset-level stats**: This metric was misleading at the dataset level since it summed wall times across all operators, which doesn't accurately represent single-node performance. The "Ray Data throughput" metric (total rows / total wall time) remains and provides the meaningful dataset-level throughput. 3. **Renames "Estimated single node throughput" to "Estimated single task throughput"**: At the operator level, this metric divides total rows by the sum of task wall times. The new name more accurately reflects what it measures—the throughput if all work were done by a single task serially. --------- Signed-off-by: dancingactor <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

## Description >This PR removes the deprecated read_parquet_bulk API from Ray Data, along with its implementation and documentation. This function was deprecated in favor of read_parquet, which now covers all equivalent use cases. The deprecation warning stated removal after May 2025, and that deadline has passed — so this cleanup reduces maintenance burden and prevents user confusion. Summary of changes - Removed read_parquet_bulk from read_api.py and __init__.py - Deleted ParquetBulkDatasource + its file - Removed related tests and documentation - Updated references and docstrings mentioning the deprecated API ## Related issues > Fixes ray-project#58969 --------- Signed-off-by: rushikesh.adhav <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

…8694) ## Description The test fails intermittently with an assertion error indicating that the internal input queue for a MapBatches operator is not empty when it's expected to be. This suggests a race condition or timing issue in the streaming executor's queue management. ## Related issues Closes ray-project#58546 ## Additional information --------- Signed-off-by: 400Ping <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>

…ect#58890) ## Description RLlib is divided on its testing structure, we have the `/tests` folder and for components `<component>/tests` folder mixing the two testing project structures. This PR commits to the component style of project structure moving `tests/` files to their component folder. ## Related issues General project structure improvements --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]>

## Description - convert debug logs in authentication_token_loader to info so that users are aware of where the token being used is being loaded from - When we raise the `AuthenticationError`, if RAY_AUTH_MODE is not set to token we should explicitly print that in the error message - in error messages suggest storing tokens in filesystem instead of env - add state api tests in test_token_auth_integration.py --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

…round (ray-project#59004) ## Description Remove the dual local/CI hook configuration for pydoclint and simplify back to a single hook. The workaround was needed due to a bug, but this was fixed in pydoclint `0.8.3` 🎉 - Upgrade pydoclint from `0.8.1` to `0.8.3` - Remove separate `pydoclint-local` and `pydoclint-ci` hooks - Simplify CI lint script to use standard pre-commit run ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Thomas Desrosiers <[email protected]>

Jax dependency is introduced in ray-project#58322 The current test environment is for CUDA 12.1, which limit jax version below 0.4.14. jax <= 0.4.14 does not support py 3.12. skip jax test if it runs against py3.12+. Signed-off-by: elliot-barn <[email protected]>

…ray-project#57063)

Patch release fixes a post-merge bug that was causing GPU batch tests to fail. - Add explicit transformers>=4.57.3 dependency to ray[llm] - Update hashes ## Related issues anyscale#547 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Nikhil Ghosh <[email protected]>

sourcery-ai

The pull request #690 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 6149.

gemini-code-assist · 2025-11-27T03:09:21Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive overhaul of the Ray project's build and CI systems. It migrates the build system to Bazel, implements modern dependency management with uv and raydepsets, restructures the CI pipeline for better maintainability, enhances Docker image builds, and improves code formatting and linting. These changes collectively aim to improve the development experience, ensure code quality, and enhance the reliability and efficiency of the CI process.

Highlights

Bazel Build System: Migrated the project to use Bazel for building and testing, enabling more efficient and reproducible builds.
Dependency Management: Introduced uv and raydepsets for managing Python dependencies, ensuring consistent dependency versions across environments.
CI Pipeline Restructuring: Refactored the CI pipeline into more modular and maintainable components, improving the overall efficiency and reliability of the testing process.
Docker Image Builds: Enhanced Docker image builds by adding support for multiple architectures (x86_64, aarch64) and Python versions, and by optimizing image layering for faster build times.
Code Formatting and Linting: Adopted pre-commit hooks and semgrep for code formatting and linting, ensuring code quality and consistency.
Testing Improvements: Improved testing by adding new tests, refactoring existing tests, and enabling better test isolation and parallelization.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is an automated daily merge from master to main, incorporating a large number of changes across the repository. The changes primarily focus on refactoring and improving the CI/CD pipeline, build system, and dependency management. Key improvements include modularizing Buildkite pipelines, migrating to uv and miniforge for Python dependency management, and refactoring Bazel build rules to use rules_pkg. There are also various code style improvements and modernizations in the C++ codebase. The changes are extensive and appear to be well-coordinated improvements to the project's infrastructure. I have one suggestion to improve encapsulation in a test utility class.

gemini-code-assist · 2025-11-27T03:16:02Z

cpp/src/ray/test/cluster/cluster_mode_test.cc


 class Pip {
 public:
-  std::vector<std::string> packages;
-  bool pip_check = false;
+  std::vector<std::string> packages_;
+  bool pip_check_ = false;
  Pip() = default;
  Pip(const std::vector<std::string> &packages, bool pip_check)
-      : packages(packages), pip_check(pip_check) {}
+      : packages_(packages), pip_check_(pip_check) {}


For better encapsulation, the member variables of the Pip class should be private. The to_json and from_json functions can be declared as friends to allow them access to the private members. This prevents direct modification of the members from outside the class, making the code more robust and maintainable.

Suggested change

class Pip {

public:

std::vector<std::string> packages;

bool pip_check = false;

std::vector<std::string> packages_;

bool pip_check_ = false;

Pip() = default;

Pip(const std::vector<std::string> &packages, bool pip_check)

: packages(packages), pip_check(pip_check) {}

: packages_(packages), pip_check_(pip_check) {}

class Pip {

public:

Pip() = default;

Pip(const std::vector<std::string> &packages, bool pip_check)

: packages_(packages), pip_check_(pip_check) {}

friend void to_json(nlohmann::json &j, const Pip &pip);

friend void from_json(const nlohmann::json &j, Pip &pip);

private:

std::vector<std::string> packages_;

bool pip_check_ = false;

};

aslonnie and others added 30 commits November 11, 2025 18:00

[doc] downgrade readthedocs to use python 3.10 (ray-project#58536)

a15f5be

be consistent with the default build environment Signed-off-by: Lonnie Liu <[email protected]>

[doc][serve][llm] Attached the correct figure to the pd docs (ray-pro…

584f5ac

…ject#58543) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (…

7e87283

…ray-project#58504) Signed-off-by: Nikhil Ghosh <[email protected]>

[bazel] upgrade bazel python rules to 0.25.0 (ray-project#58535)

9d5a241

previously it was actually using 0.4.0, which is set up by the grpc repo. the declaration in the workspace file was being shadowed.. Signed-off-by: Lonnie Liu <[email protected]>

[doc] symlink the doc dependency lock file (ray-project#58520)

9e450e6

and ask people to use that lock file for building docs. Signed-off-by: Lonnie Liu <[email protected]>

[images][deps] raydepsets base extra depset (ray-project#58461)

0d56f3e

generating depsets for base extra python requirements Installing requirements in base extra image --------- Signed-off-by: elliot-barn <[email protected]>

[doc] set default python env to 3.10 (ray-project#58570)

e713b3d

we stop supporting building with python 3.9 now Signed-off-by: Lonnie Liu <[email protected]>

Revert "[bazel] upgrade bazel python rules to 0.25.0" (ray-project#58578

7a6e29e

) Reverts ray-project#58535 failing on windows.. :(

[deps] update requirements_buildkite.txt (ray-project#58574)

22fbee3

as the pydantic version is pinned in `requirements-doc.txt` now. Signed-off-by: Lonnie Liu <[email protected]>

[core] (cgroups) Use /proc/mounts if mount file is missing. (ray-proj…

0cdbe3f

…ect#58577) Signed-off-by: irabbani <[email protected]>

[serve] remove minbuild-serve-py3.9 (ray-project#58585)

292b977

nothing is using it anymore Signed-off-by: Lonnie Liu <[email protected]>

[ci] pin docker client version (ray-project#58579)

438d6dc

otherwise, the newer docker client will refuse to communicate with the docker daemon that is on an older version. Signed-off-by: Lonnie Liu <[email protected]>

jjyao and others added 21 commits November 25, 2025 13:54

[Core] Deflake test_put_out_of_disk (ray-project#56442)

3d0f004

Signed-off-by: Jiajun Yao <[email protected]>

[In Review] Update template llm finetuning (ray-project#58511)

b71ca67

[Serve] Short circuit autoscaling metrics (ray-project#58962)

12ceb2f

optimization for the case where list has only one timeseries. O(nlogn) -> O(1) Signed-off-by: abrar <[email protected]>

[ci] remove old github actions related code (ray-project#58922)

3c649a2

and performing some styling cleanup Signed-off-by: Lonnie Liu <[email protected]>

[Doc] Fix outdated Bazel install link for RedHat in development.rst (r…

88bc8ac

…ay-project#58692) Signed-off-by: He Su <[email protected]>

[data] [docs] Adding unstructured data templates from ray summit 2025 (…

76be448

…ray-project#57063)

antfin-oss requested review from SongGuyang and kfstorm as code owners November 27, 2025 02:55

antfin-oss added auto-generated daily-merge labels Nov 27, 2025

antfin-oss assigned ffbin Nov 27, 2025

sourcery-ai bot reviewed Nov 27, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

weiquanlee force-pushed the main branch from 0bac36e to fd34b15 Compare December 1, 2025 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-27 #690

🔄 daily merge: master → main 2025-11-27 #690

Uh oh!

antfin-oss commented Nov 27, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

🔄 daily merge: master → main 2025-11-27 #690

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-27 #690

Uh oh!

Conversation

antfin-oss commented Nov 27, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants