-
Notifications
You must be signed in to change notification settings - Fork 25
π daily merge: master β main 2025-12-01 #692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
β¦#57783) 1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set to initialize jax.distributed: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38 2. Before this change, user will have to configure both `use_tpu=True` in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able to start jax.distributed. `JAX_PLATFORMS` can be comma separated string. 3. If user uses other jax.distributed libraries like Orbax, sometimes, it will leads to misleading error about distributed initialization. 4. After this change, if user sets `use_tpu=True`, we automatically add this to env var. 5. tpu unit test is not available this time, will explore for how to cover it later. --------- Signed-off-by: Lehui Liu <[email protected]>
and ask people to use that lock file for building docs. Signed-off-by: Lonnie Liu <[email protected]>
β¦regression (ray-project#58390) ## Description This PR address the performance regression introduced in the [PR to make ray.get thread safe](ray-project#57911). Specifically, the previous PR requires the worker to block and wait for AsyncGet to return with a reply of the request id needed for correctly cleaning up get requests. This additional synchronous step causes the plasma store Get to regress in performance. This PR moves the request id generation step to the plasma store, removing the blocking step to fix the perf regression. ## Related issues - [PR which introduced perf regression](ray-project#57911) - [PR which observed the regression](ray-project#58175) ## Additional information New performance of the change measured by `ray microbenchmark`. <img width="485" height="17" alt="image" src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0" /> Original performance prior to the change. Here we focus on the regressing `single client get calls (Plasma Store)` metric, where our new performance returns us back to the original 10k per second range compared to the existing sub 5k per second. <img width="811" height="355" alt="image" src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c" /> --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]>
## Description support token auth in ray client server by using the existing grpc interceptors. This pr refactors the code to: - add/rename sync and async client and server interceptors - create grpc utils to house grpc channel and server creation logic, python codebase is updated to use these methods - separate tests for sync and async interceptors - make existing authentication integration tests to run with RAY_CLIENT mode --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
β¦oject#58371) ## Description Currently Ray Data has a preprocessor called `RobustScaler`. This scales the data based on given quantiles. Calculating the quantiles involves sorting the entire dataset by column for each column (C sorts for C number of columns), which, for a large dataset, will require a lot of calculations. ** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch` as I couldn't actually find well-maintained tdigest libraries for python. ddsketch is better maintained. ** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile` aggregator ## Related issues N/A ## Additional information N/A --------- Signed-off-by: kyuds <[email protected]> Signed-off-by: Daniel Shin <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]>
generating depsets for base extra python requirements Installing requirements in base extra image --------- Signed-off-by: elliot-barn <[email protected]>
β¦#58499) which is more accurate also moves python constraint definitions into `bazel/` directory and registering python 3.10 platform with hermetic toolchain this allows performing migration from python 3.19 to python 3.10 incrementally Signed-off-by: Lonnie Liu <[email protected]>
we stop supporting building with python 3.9 now Signed-off-by: Lonnie Liu <[email protected]>
β¦ct#58576) Using GetNodeAddressAndLiveness in raylet client pool instead of the bulkier Get, same for AsyncGetAll. Seems like it was already done in core worker client pool, so just making the same change for raylet client pool. Signed-off-by: joshlee <[email protected]>
β¦bles (ray-project#58270) ## Description - Support upserting iceberg tables for IcebergDatasink - Update schema on APPEND and UPSERT - Enable overwriting the entire table Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite functionality. Also for append, the library now handles the transaction logic implicitly so that burden can be lifted from Ray Data. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]>
) Reverts ray-project#58535 failing on windows.. :(
as the pydantic version is pinned in `requirements-doc.txt` now. Signed-off-by: Lonnie Liu <[email protected]>
β¦ect#58577) Signed-off-by: irabbani <[email protected]>
nothing is using it anymore Signed-off-by: Lonnie Liu <[email protected]>
β¦58580) Adding optional `include_setuptools` flag for depset configuration If the flag is set on a depset config --unsafe-package setuptools will not be included for depset compilation If the flag does not exist (default false) on a depset config --unsafe-package setuptools will be appended to the default arguments --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
otherwise, the newer docker client will refuse to communicate with the docker daemon that is on an older version. Signed-off-by: Lonnie Liu <[email protected]>
β¦ay-project#58542) ## What does this PR do? Fixes HTTP streaming file downloads in Ray Data's download operation. Some URIs (especially HTTP streams) require `open_input_stream` instead of `open_input_file`. ## Changes - Modified `download_bytes_threaded` in `plan_download_op.py` to try both `open_input_file` and `open_input_stream` for each URI - Improved error handling to distinguish between different error types - Failed downloads now return `None` gracefully instead of crashing ## Testing ``` import pyarrow as pa from ray.data.context import DataContext from ray.data._internal.planner.plan_download_op import download_bytes_threaded # Test URLs: one valid, one 404 urls = [ "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&", ] # Create PyArrow table and call download function table = pa.table({"url": urls}) ctx = DataContext.get_current() results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) # Check results result_table = results[0] for i in range(result_table.num_rows): url = result_table['url'][i].as_py() bytes_data = result_table['bytes'][i].as_py() if bytes_data is None: print(f"Row {i}: FAILED (None) - try-catch worked β") else: print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)") print(f" URL: {url[:60]}...") print("\nβ Test passed: Failed downloads return None instead of crashing.") ``` Before the fix: ``` TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/ray/default/test_streaming_fallback.py", line 110, in <module> test_download_expression_with_streaming_fallback() File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__ if not self.__exit__(*sys.exc_info()): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__ setattr(self.target, self.attribute, self.temp_original) TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' (base) ray@ip-10-0-39-21:~/default$ python test.py 2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker! Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file Traceback (most recent call last): File "/home/ray/default/test.py", line 16, in <module> results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded uri_bytes = list( ^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen raise item File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file ``` After the fix: ``` Row 0: SUCCESS (189370 bytes) URL: https://static-assets.tesla.com/configurator/compositor?cont... ``` Tested with HTTP streaming URLs (e.g., Tesla configurator images) that previously failed: - β Successfully downloads HTTP stream files - β Gracefully handles failed downloads (returns None) - β Maintains backward compatibility with existing file downloads --------- Signed-off-by: xyuzh <[email protected]> Signed-off-by: Robert Nishihara <[email protected]> Co-authored-by: Robert Nishihara <[email protected]>
## Description We today have very little observability into pubsub. On a raylet one of the most important states that need to be propagated through the cluster via pubsub is cluster membership. All raylets should in an eventual BUT timely fashion agree on the list of available nodes. This metric just emits a simple counter to keep track of the node count. More pubsub observability to come. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: zac <[email protected]> Signed-off-by: Zac Policzer <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
all tests are passing Signed-off-by: Lonnie Liu <[email protected]>
β¦#58587) also stops building python 3.9 aarch64 images Signed-off-by: Lonnie Liu <[email protected]>
so that importing test.py does not always import github github repo imports jwt, which then imports cryptography and can lead to issues on windows. Signed-off-by: Lonnie Liu <[email protected]>
this makes it possible to run on a different python version than the CI wrapper code. Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Lonnie Liu <[email protected]>
β¦ecurity (ray-project#58591) Migrates Ray dashboard authentication from JavaScript-managed cookies to server-side HttpOnly cookies to enhance security against XSS attacks. This addresses code review feedback to improve the authentication implementation (ray-project#58368) main changes: - authentication middleware first looks for `Authorization` header, if not found it then looks at cookies to look for the auth token - new `api/authenticate` endpoint for verifying token and setting the auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and `secure=true` (when using https)) - removed javascript based cookie manipulation utils and axios interceptors (were previously responsible for setting cookies) - cookies are deleted when connecting to a cluster with `AUTH_MODE=disabled`. connecting to a different ray cluster (with different auth token) using the same endpoint (eg due to port-forwarding or local testing) will reshow the popup and ask users to input the right token. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
add support for `ray get-auth-token` cli command + test --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>
ray-project#57590) As discovered in the [PR to better define the interface for reference counter](ray-project#57177 (review)), plasma store provider and memory store both share thin dependencies on reference counter that can be refactored out. This will reduce entanglement in our code base and improve maintainability. The main logic changes are located in * src/ray/core_worker/store_provider/plasma_store_provider.cc, where reference counter related logic is refactor into core worker * src/ray/core_worker/core_worker.cc, where factored out reference counter logic is resolved * src/ray/core_worker/store_provider/memory_store/memory_store.cc, where logic related to reference counter has either been removed due to the fact that it is tech debt or refactored into caller functions. <!-- Please give a short summary of the change and the problem this solves. --> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks Microbenchmark: ``` single client get calls (Plasma Store) per second 10592.56 +- 535.86 single client put calls (Plasma Store) per second 4908.72 +- 41.55 multi client put calls (Plasma Store) per second 14260.79 +- 265.48 single client put gigabytes per second 11.92 +- 10.21 single client tasks and get batch per second 8.33 +- 0.19 multi client put gigabytes per second 32.09 +- 1.63 single client get object containing 10k refs per second 13.38 +- 0.13 single client wait 1k refs per second 5.04 +- 0.05 single client tasks sync per second 960.45 +- 15.76 single client tasks async per second 7955.16 +- 195.97 multi client tasks async per second 17724.1 +- 856.8 1:1 actor calls sync per second 2251.22 +- 63.93 1:1 actor calls async per second 9342.91 +- 614.74 1:1 actor calls concurrent per second 6427.29 +- 50.3 1:n actor calls async per second 8221.63 +- 167.83 n:n actor calls async per second 22876.04 +- 436.98 n:n actor calls with arg async per second 3531.21 +- 39.38 1:1 async-actor calls sync per second 1581.31 +- 34.01 1:1 async-actor calls async per second 5651.2 +- 222.21 1:1 async-actor calls with args async per second 3618.34 +- 76.02 1:n async-actor calls async per second 7379.2 +- 144.83 n:n async-actor calls async per second 19768.79 +- 211.95 ``` This PR mainly makes logic changes to the `ray.get` call chain. As we can see from the benchmark above, the single clientget calls performance matches pre-regression levels. --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]> Co-authored-by: Ibrahim Rabbani <[email protected]>
β¦ay-project#58471) 2. **Extracted generic `RankManager` class** - Created reusable rank management logic separated from deployment-specific concerns 3. **Introduced `ReplicaRank` schema** - Type-safe rank representation replacing raw integers 4. **Simplified error handling** - not supporting self healing 5. **Updated tests** - Refactored unit tests to use new API and removed flag-dependent test cases **Impact:** - Cleaner separation of concerns in rank management - Foundation for future multi-level rank support Next PR ray-project#58473 --------- Signed-off-by: abrar <[email protected]>
Currently, Ray metrics and events are exported through a centralized process called the Dashboard Agent. This process functions as a gRPC server, receiving data from all other components (GCS, Raylet, workers, etc.). However, during a node shutdown, the Dashboard Agent may terminate before the other components, resulting in gRPC errors and potential loss of metrics and events. As this issue occurs, the otel sdk logs become very noisy. Add a default options to disable otel sdk logs to avoid confusion. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`, which is a non-flaky version of `fetch_prometheus`. Update all of test usage accordingly. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Cuong Nguyen <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
β¦ RD Datatype (ray-project#58225) ## Description As title suggests ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <[email protected]>
β¦ay-project#58581) allowing for py3.13 images (cpu & cu123) in release tests Signed-off-by: elliot-barn <[email protected]>
β¦oughput stats (ray-project#58693) This PR makes three improvements to Ray Data's throughput statistics: 1. **Makes `test_dataset_throughput` deterministic**: The original test was flaky because it relied on actual task execution timing. This PR rewrites it as unit tests (`test_dataset_throughput_calculation` and `test_operator_throughput_calculation`) using mocked `BlockStats` objects, making the tests fast and reliable. 2. **Removes "Estimated single node throughput" from Dataset-level stats**: This metric was misleading at the dataset level since it summed wall times across all operators, which doesn't accurately represent single-node performance. The "Ray Data throughput" metric (total rows / total wall time) remains and provides the meaningful dataset-level throughput. 3. **Renames "Estimated single node throughput" to "Estimated single task throughput"**: At the operator level, this metric divides total rows by the sum of task wall times. The new name more accurately reflects what it measuresβthe throughput if all work were done by a single task serially. --------- Signed-off-by: dancingactor <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>
## Description >This PR removes the deprecated read_parquet_bulk API from Ray Data, along with its implementation and documentation. This function was deprecated in favor of read_parquet, which now covers all equivalent use cases. The deprecation warning stated removal after May 2025, and that deadline has passed β so this cleanup reduces maintenance burden and prevents user confusion. Summary of changes - Removed read_parquet_bulk from read_api.py and __init__.py - Deleted ParquetBulkDatasource + its file - Removed related tests and documentation - Updated references and docstrings mentioning the deprecated API ## Related issues > Fixes ray-project#58969 --------- Signed-off-by: rushikesh.adhav <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>
β¦8694) ## Description The test fails intermittently with an assertion error indicating that the internal input queue for a MapBatches operator is not empty when it's expected to be. This suggests a race condition or timing issue in the streaming executor's queue management. ## Related issues Closes ray-project#58546 ## Additional information --------- Signed-off-by: 400Ping <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]>
β¦ect#58890) ## Description RLlib is divided on its testing structure, we have the `/tests` folder and for components `<component>/tests` folder mixing the two testing project structures. This PR commits to the component style of project structure moving `tests/` files to their component folder. ## Related issues General project structure improvements --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]>
## Description - convert debug logs in authentication_token_loader to info so that users are aware of where the token being used is being loaded from - When we raise the `AuthenticationError`, if RAY_AUTH_MODE is not set to token we should explicitly print that in the error message - in error messages suggest storing tokens in filesystem instead of env - add state api tests in test_token_auth_integration.py --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>
β¦round (ray-project#59004) ## Description Remove the dual local/CI hook configuration for pydoclint and simplify back to a single hook. The workaround was needed due to a bug, but this was fixed in pydoclint `0.8.3` π - Upgrade pydoclint from `0.8.1` to `0.8.3` - Remove separate `pydoclint-local` and `pydoclint-ci` hooks - Simplify CI lint script to use standard pre-commit run ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Thomas Desrosiers <[email protected]>
Jax dependency is introduced in ray-project#58322 The current test environment is for CUDA 12.1, which limit jax version below 0.4.14. jax <= 0.4.14 does not support py 3.12. skip jax test if it runs against py3.12+. Signed-off-by: elliot-barn <[email protected]>
Patch release fixes a post-merge bug that was causing GPU batch tests to fail. - Add explicit transformers>=4.57.3 dependency to ray[llm] - Update hashes ## Related issues anyscale#547 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Nikhil Ghosh <[email protected]>
autoscaling context need expensive function evaluation, not all autoscaling policies need the data. Lazily evaluate them to save controller CPU --------- Signed-off-by: abrar <[email protected]>
## Description `test_backpressure_e2e` occasionally fails with a bug like this: ``` [2025-11-26T17:33:36Z] PASSED[2025-11-26 17:27:35,172 E 550 12058] core_worker_process.cc:986: The core worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit ``` This PR attempt to deflake it by removing an unnecessary `del` (Long-term, we should rewrite or remove this test. This PR is a mitigation) Signed-off-by: Balaji Veeramani <[email protected]>
## Description RLlib has a tuned-examples and examples folder however we believe this is a confusing structure which this PR merges Therefore, this PR moves all tuned-example scripts into `examples/algorithms` for the new stack api and `examples/_old_api_stack/algorithms` for the old stack api. --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]>
Including openlineage-python dependency upgrading requests from 2.32.3 -> 2.32.5 LLM serve and batch release tests: https://buildkite.com/ray-project/release/builds/69428 Only failing test has already been disabled: llm_serve_llama_3dot1_8B_quantized_tp1_2p6d_lmcache Core multi test: https://buildkite.com/ray-project/release/builds/69489# --------- Signed-off-by: elliot-barn <[email protected]>
## Description Checking RLlib, there are a couple cases where we don't fully handle exception cases. I've checked every try except statement in RLlib and this PR updates all of them that didn't log or print the error if it wasn't handled ## Related issues Fixes ray-project#58854 --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]>
β¦jection logic (ray-project#59042) ## Description Adds more headers to the denylist for recognising browser requests and denying them ## Related issues Supercedes ray-project#59040 Signed-off-by: Richo Healey <[email protected]>
## Description ray-project#58853 didn't define a byod script however the [nightly test](https://buildkite.com/ray-project/release/builds/69721#019ac407-48f6-4523-b3c0-e8c2f6b6ba82) failed as `torch` wasn't installed. This PR adds a byod script and includes `torch` (and other possible modules) to install ## Related issues ray-project#58853 --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]>
β¦ (1/9) (ray-project#58983) upgrade memray, cupy-cuda12x & rich packages in preparation for python 3.13 image building rich==13.3.2 -> rich==13.3.3 cupy-cuda12x -> cupy-cuda12x>=13.4.0 Updated memray constraint to 1.14.0 Memray 1.14.0 has python 3.13 wheels rich upgraded is required for memray upgrade cupy-cuda12x>=13.4.0 has python 3.13 wheels --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>
β¦oject#58272) ## Description This PR adds an API overview and example usage for the TPU utility library added in this PR: ray-project#56723. I added this section to the existing "Using TPUs with KubeRay guide", because the utility library would be primarily used with KubeRay on GKE (the values used for default labels are set on GKE with a mutating webhook). ## Related issues ray-project#55162 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This PR adds a `bundle_label_selector` argument to the `ScalingConfig` that allows Ray Train workers to be placed on nodes with particular labels. The previous workaround, namely using `resources_per_worker`, is less flexible. `bundle_label_selector` can either be a single dict, in which case it will apply to all the workers, or a list of length `num_workers`, in which case each item in the list will correspond to one of the workers. I added verification to the controller instead of validating that none of the callbacks have `on_controller_start_worker_group` when `bundle_label_selector` is set because we might change `on_controller_start_worker_group` in the future. We can revisit this issue then. --------- Signed-off-by: Timothy Seah <[email protected]>
Document the new `ray.train.get_all_reported_checkpoints` method. --------- Signed-off-by: Timothy Seah <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Justin Yu <[email protected]>
and also closes the file after reading.. Signed-off-by: Lonnie Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request #692 has too many files changed.
The GitHub API will only let us fetch up to 300 changed files, and this pull request has 6194.
Summary of ChangesHello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This automated daily merge pull request synchronizes the Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with π and π on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is an automated daily merge from master to main. It includes a very large number of changes, primarily focused on refactoring and improving the CI/CD pipeline and the Bazel build system. Key changes include modularizing build steps, updating dependencies and toolchains (e.g., moving to miniforge and uv), a shift to arm64 for macOS builds, and significant cleanups in various configuration files and scripts. I've found one issue in the buildkite annotation logic. Otherwise, the changes appear to be consistent and part of a large-scale improvement effort.
| for alias in self._get_image_tags(): | ||
| aws_alias_image = f"{aws_registry}/anyscale/{self.image_type}:{alias}" | ||
| gcp_alias_image = f"{gcp_registry}/anyscale/{self.image_type}:{alias}" | ||
| azure_alias_image = ( | ||
| f"{azure_registry}/anyscale/{self.image_type}:{alias}" | ||
| ) | ||
| cmds += [ | ||
| f"docker tag {anyscale_image} {aws_alias_image}", | ||
| f"docker push {aws_alias_image}", | ||
| f"docker tag {anyscale_image} {gcp_alias_image}", | ||
| f"docker push {gcp_alias_image}", | ||
| f"docker tag {anyscale_image} {azure_alias_image}", | ||
| f"docker push {azure_alias_image}", | ||
| ] | ||
|
|
||
| if os.environ.get("BUILDKITE"): | ||
| subprocess.run( | ||
| [ | ||
| "buildkite-agent", | ||
| "annotate", | ||
| "--style=info", | ||
| f"--context={self.image_type}-images", | ||
| "--append", | ||
| f"{aws_alias_image}<br/>", | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bug in the buildkite annotation logic. The aws_alias_image variable is used outside the for loop, so it will only contain the value from the last iteration. This means only the last created alias will be annotated. To fix this, you should collect all aliases and annotate them together after the loop.
images_to_annotate = []
for alias in self._get_image_tags():
aws_alias_image = f"{aws_registry}/anyscale/{self.image_type}:{alias}"
gcp_alias_image = f"{gcp_registry}/anyscale/{self.image_type}:{alias}"
azure_alias_image = (
f"{azure_registry}/anyscale/{self.image_type}:{alias}"
)
cmds += [
f"docker tag {anyscale_image} {aws_alias_image}",
f"docker push {aws_alias_image}",
f"docker tag {anyscale_image} {gcp_alias_image}",
f"docker push {gcp_alias_image}",
f"docker tag {anyscale_image} {azure_alias_image}",
f"docker push {azure_alias_image}",
]
images_to_annotate.append(aws_alias_image)
if os.environ.get("BUILDKITE") and images_to_annotate:
annotation_body = "<br/>".join(images_to_annotate)
subprocess.run(
[
"buildkite-agent",
"annotate",
"--style=info",
f"--context={self.image_type}-images",
"--append",
annotation_body,
]
)
This Pull Request was created automatically to merge the latest changes from
masterintomainbranch.π Created: 2025-12-01
π Merge direction:
masterβmainπ€ Triggered by: Scheduled
Please review and merge if everything looks good.