Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1570 commits
Select commit Hold shift + click to select a range
d584094
[core] Separate out core worker client interface from core worker cli…
Sparks0219 Sep 22, 2025
91f1db0
[Core] [Azure] Fix issue where cached config file gets corrupted if c…
marosset Sep 22, 2025
2a6bd96
[core] Expose serialized_runtime_info consistently across all public …
sampan-s-nayak Sep 22, 2025
f742228
[release] Make run_release_test binary hermetic (#56717)
khluu Sep 22, 2025
0cdb49b
[core][rdt] Add nixl installation script and examples (#56430)
Qiaolin-Yu Sep 22, 2025
5d1c58a
[Core][Autoscaler] Fixed an issue where the command executed when use…
Levi080513 Sep 22, 2025
2b9f5b6
[core][actor-event-02] Add RayEventInterface implementations for Acto…
sampan-s-nayak Sep 22, 2025
92de85f
[release] Add RAY_WANT_COMMIT_IN_IMAGE for custom byod build (#56802)
khluu Sep 22, 2025
98295aa
[core] fix //src/ray/gcs/tests:gcs_node_manager_test tsan/asan (#56631)
can-anyscale Sep 22, 2025
567beab
Ray Train test_jax_trainer::test_minimal_multihost Flaky Test Fix (#5…
JasonLi1909 Sep 22, 2025
ca1f37f
[deps] updating ray_deps (no py or cpp files) wheel version (#56628)
elliot-barn Sep 22, 2025
c598fa5
[core] fix test_task_metrics flakiness (#56768)
can-anyscale Sep 22, 2025
d3f98e1
[deps] ban click 8.3.0 (#56789)
aslonnie Sep 22, 2025
9d35425
[ci] upgrade rayci (#56791)
aslonnie Sep 23, 2025
b0010db
[train][release-test] Migrate xgboost/lgbm benchmarks to train V2 (#5…
justinvyu Sep 23, 2025
72dc0d2
[image] add ray-llm and ray-ml extra images (#56800)
aslonnie Sep 23, 2025
7566fc3
[ci] raydepsets: build-placeholder-wheel script as prehook (#56547)
elliot-barn Sep 23, 2025
479e4cf
[Data] Fixing flaky tests (ordering) (#56819)
alexeykudinkin Sep 23, 2025
a0568eb
[core] Remove unnecessary grpc dependencies in bazel targets (#56762)
Sparks0219 Sep 23, 2025
4254dad
[core] Use `absl::mutex` instead of `std::mutex` in `shutdown_coordin…
codope Sep 23, 2025
d3674d4
[core] Refactoring directory structure for rpc_clients (#56814)
Sparks0219 Sep 23, 2025
c3658d9
[Fix] Fix typos: correct "stablity" to "stability". (#56730)
daiping8 Sep 23, 2025
14ceb8a
[core] Fixing ray_actors metric for concurrent actors. (#56598)
israbbani Sep 23, 2025
f405dd6
[core] Add node_id validation in NodeAffinitySchedulingStrategy (#56708)
sampan-s-nayak Sep 23, 2025
ed27235
add gen_redis_pkg bazel target (#56527)
mattip Sep 23, 2025
2831e18
[core] [noop] Adding deprecation notice to old DrainNode API (#56629)
israbbani Sep 23, 2025
7ed4358
[doc] rename BUILD to BUILD.bazel (#56841)
aslonnie Sep 23, 2025
9c1efe8
[core] rename BUILD to BUILD.bazel (#56842)
aslonnie Sep 23, 2025
e833c7f
[core] fix windows copts on `task_event_buffer` (#56840)
aslonnie Sep 23, 2025
493dfd7
[core][actor-event-03] Emit actor events to Event aggregator (#56617)
sampan-s-nayak Sep 23, 2025
9759f52
[ci] add `BUILDKITE_CACHE_READONLY` into `_DOCKER_ENV` (#56839)
aslonnie Sep 23, 2025
9bf134b
[image] build ray-extra images for aarch64 (#56818)
aslonnie Sep 23, 2025
7ef8f1c
[data] Tensor Type __repr__ should be custom tensor types (#56457)
iamjustinhsu Sep 23, 2025
946710b
[core][rdt] Handle system errors with a background monitor thread (#5…
dayshah Sep 23, 2025
9f185b4
[core] Add concurrency guards to gcs node manager (#56351)
ZacAttack Sep 23, 2025
95f86ab
[data][train] Refactor call_with_retry into shared library and use it…
TimothySeah Sep 23, 2025
55c806b
[tune][release] Move tune_with_frequent_pausing to Ray Train v2 and t…
TimothySeah Sep 23, 2025
2dfef26
[core][event] export node event by default (#56810)
can-anyscale Sep 23, 2025
324d823
add /utf-8 for spdlog's fmt library on windows (#56777)
mattip Sep 23, 2025
c855106
Aggregate autoscaling metrics on controller (#56306)
abrarsheikh Sep 23, 2025
b455703
[core] (cgroups 11/n) Raylet will move system processes into cgroup o…
israbbani Sep 23, 2025
de64cca
[ci] use ray user in manylinux ci image (#56828)
aslonnie Sep 23, 2025
eaff8f2
[ci] do not disable cache uploading in base images (#56822)
aslonnie Sep 23, 2025
65129cf
[docs] Add serve llm example to index page + other minor fix (#56788)
Aydin-ab Sep 23, 2025
e61a044
[serve] use deployment method in access logs for replicas (#56829)
akyang-anyscale Sep 23, 2025
38e7465
[core] Output the error log on the driver side if the failed task wil…
Qiaolin-Yu Sep 24, 2025
894c6a2
[Data] Add descriptive error if user tries to download from invalid c…
bveeramani Sep 24, 2025
dac5d51
[core] Reorder asyncio actor shutdown to terminate asyncio thread fir…
codope Sep 24, 2025
3c2ceb4
[tune] Enable Train v2 in doc examples (#56820)
justinvyu Sep 24, 2025
3d4d54a
[tune] Enable Train V2 in Tune unit tests and examples (#56816)
justinvyu Sep 24, 2025
78ce38e
[core] building ray core binary parts with wanda on manylinux (#56825)
aslonnie Sep 24, 2025
fcef957
[core][rdt] Retry + Make FreeActorObject idempotent (#56447)
dayshah Sep 24, 2025
4e86ecd
[core] Deflake test_reference_counting.py::test_recursive_serialized_…
Sparks0219 Sep 24, 2025
968643f
Add node ip in runtime env error message to improve debug observabili…
yancanmao Sep 24, 2025
c18ad8b
[core][event] rename DriverJobExecutionEvent to DriverJobLifecycleEve…
can-anyscale Sep 24, 2025
a5dc30a
[data.LLM] fix doc test for Working with LLMs guide (#55917)
nrghosh Sep 24, 2025
a4bdb92
[release] Rename custom byod build job (#56861)
khluu Sep 24, 2025
3d16c65
[Data] - make test_unify_schemas_nested_struct_tensors deterministic …
goutamvenkat-anyscale Sep 24, 2025
81504cc
[core] build dashboard with wanda (#56886)
aslonnie Sep 24, 2025
8f73a91
[Data] - Add alias expression (#56550)
goutamvenkat-anyscale Sep 24, 2025
2b4d6d9
[core] Deprecate LIFO/FIFO worker killing policies (#56314)
Sparks0219 Sep 24, 2025
5f87384
[core] (cgroups 12/n) Raylet will start worker processes in the appli…
israbbani Sep 24, 2025
795d1c6
[image] add haproxy binary, for ray serve use (#56845)
ok-scale Sep 24, 2025
6b8b3c1
[core][rdt][cgraphs] Only allocate tensor without initializing/writin…
dayshah Sep 24, 2025
46ee375
[Data] Cleaning up `ParquetDatasource` sampling sequence (#56892)
alexeykudinkin Sep 24, 2025
cd7d75b
[core][train] Ray Train disables blocking get inside async warning (#…
TimothySeah Sep 24, 2025
d7a2372
[tune][release] Upgrade tune_torch_benchmark to v2 (#56804)
liulehui Sep 24, 2025
2bade54
[train][release] Attach a quick checkpoint when reporting metrics (#5…
liulehui Sep 24, 2025
a328773
[release] Merge test init and custom build image init (#56650)
khluu Sep 25, 2025
6497a48
[Minor] Fix typo for GPU direct transfer timeout error (#56914)
crypdick Sep 25, 2025
1dcefd2
[release] Fix release launching script (#56916)
khluu Sep 25, 2025
97d1941
[ci] normalize wanda build / key name (#56908)
aslonnie Sep 25, 2025
9ba96b9
[release] Fix file name for custom image build yaml (#56922)
khluu Sep 25, 2025
8a29f31
Revert "remove anyscale navbar on docs.ray.io" (#56823)
saihaj Sep 25, 2025
ed21018
[core] Refactor aggregator agent to support multiple publish destinat…
sampan-s-nayak Sep 25, 2025
d9db9c0
[Data] Make zip operator accept multiple input (#56524)
owenowenisme Sep 25, 2025
945fb1d
[wheel] add option to skip ray core building (#56904)
aslonnie Sep 25, 2025
08b83cc
add post scaling api (#56135)
harshit-anyscale Sep 25, 2025
74fbb50
[ci] release test: including python_depset in custom byod config for …
elliot-barn Sep 25, 2025
09918da
[release][deps] custom byod release test using python_depset (2/2) (#…
elliot-barn Sep 25, 2025
642a1bd
Application Gateway for Containers as ingress to access Ray Cluster (…
snehachhabria Sep 25, 2025
cb2a489
shorten env variants name (#56926)
harshit-anyscale Sep 25, 2025
c479f9a
[core] Use `start_ray_shared` in `test_runtime_env.py::test_runtime_e…
edoakes Sep 25, 2025
4d14af3
[Core] [Azure] Adding ability to specify availability zones for ray c…
marosset Sep 25, 2025
e899722
[core][docs][RDT] Add docs for asyncio and object mutability (#56790)
stephanie-wang Sep 25, 2025
3918fd8
[core] Clean up / improve GCS EventRecorder + GCS TaskManager (#56912)
dayshah Sep 25, 2025
6421d4b
round timestamp to 10ms during time series merge (#56876)
abrarsheikh Sep 25, 2025
64321d8
Change References to `state.actors()` to `ray.util.state.list_actors`…
jcarlson212 Sep 25, 2025
6d3290b
[core] (cgroups 13/n) Deleting old cgroups implementation (#56909)
israbbani Sep 25, 2025
70d0abb
[deps] removing default flag & updating ray img lock file (#56859)
elliot-barn Sep 26, 2025
bcef9b6
[Data] Fix streaming executor to drain upstream output queue(s) (#56941)
srinathk10 Sep 26, 2025
76a7831
[Data] Make test ordering deterministic (#56898)
alexeykudinkin Sep 26, 2025
90d60d2
[core] Remove TagKey cython wrapper (#56939)
pcmoritz Sep 26, 2025
ff32fa0
[Data] Prefixed all Data related tests w/ `data` prefix (#56917)
alexeykudinkin Sep 26, 2025
8d9b79b
[image] add ~/.local/bin to PATH in slim image (#56920)
aslonnie Sep 26, 2025
86844c4
[ci] add release_wheels tag for ray core and dashboard bits (#56919)
aslonnie Sep 26, 2025
be4fac8
[core][doc][autoscaler] Add threading requirement to NodeProvider int…
rueian Sep 26, 2025
12688a9
[Core] Fix the bug in memray regarding the default configuration of -…
daiping8 Sep 26, 2025
6677e49
[Data] - Add shuffle aggregation type to JoinOperator (#56945)
goutamvenkat-anyscale Sep 26, 2025
a33dcbc
[Data] - Enable per block limiting for Limit Operator (#55239)
goutamvenkat-anyscale Sep 26, 2025
db38137
[Train][release test]Migrate tune_rllib_connect_test & tune_cloud_lon…
xinyuangui2 Sep 26, 2025
ab65315
[tune] Trigger Checkpointing via Trial / Tuner Callback (#55527)
Daraan Sep 26, 2025
aead31b
[core] Mark release tests as stable + kill the unstable ones (#56887)
dayshah Sep 26, 2025
6b03cda
[tune] improve _PBTTrialState for dev/debugging usage (#56890)
Daraan Sep 26, 2025
109bed2
[data] prevent double execution of to_arrow_refs (#56793)
iamjustinhsu Sep 26, 2025
4265754
[release] Use tests that associate with custom images in its build st…
khluu Sep 26, 2025
bb7255d
[train][release] Add v2 multinode persistence release test (#56856)
TimothySeah Sep 26, 2025
c7c053b
Revert "[core] Deprecate LIFO/FIFO worker killing policies" (#56960)
Sparks0219 Sep 26, 2025
1820d06
[core] remove unused redis start fixture (#56971)
aslonnie Sep 26, 2025
f1197ac
[Data] (De)serialization of PyArrow Extension Arrays (#51972)
pimdh Sep 27, 2025
2197f99
[deps] generating requirement files for docker images (#56634)
elliot-barn Sep 27, 2025
58bf339
[wheel] return early on build if no bazel build is required (#56975)
aslonnie Sep 27, 2025
94d32bb
[release] Warning step if manually trigger more than 5 tests (#56658)
khluu Sep 27, 2025
6f89324
[ci] build generated protobuf in ray core binaries (#56969)
aslonnie Sep 28, 2025
97b9dfb
[core] Split raylet cython file into multiple files (GcsSubscriber) (…
Evelynn-V Sep 29, 2025
e0b920f
[core] Fix cpp api mac breakage from #56514 (#56915)
dayshah Sep 29, 2025
d27c6b8
[java] Bazel fixing the all_modules build (#56999)
sb-hakunamatata Sep 29, 2025
96353b3
[core] Always use `ms` as a consistent unit in event stats (#57001)
edoakes Sep 29, 2025
d078d13
[core][1ev-debt/02] implement even merge logic at export time (#56558)
can-anyscale Sep 29, 2025
7fc30fd
Support azure and abfss in LLM config (#56441)
gangsf Sep 29, 2025
e789778
[core] Clean up some `gcs_actor_scheduler.cc` logs (#57003)
edoakes Sep 29, 2025
b9eeb38
Add CODEOWNER for dashboard serve and data modules (#57006)
jjyao Sep 29, 2025
9834962
[ci] use --no-deps to avoid installing dependencies (#56979)
aslonnie Sep 29, 2025
371bfc1
[train][checkpoint] Add validate_function and validate_config to ray.…
TimothySeah Sep 29, 2025
5d9e1e7
[rllib] change rllib flaky tests tag to `rllib_flaky` (#56991)
aslonnie Sep 29, 2025
ca0db46
[ci] updating regex match for release test filtering (#56437)
elliot-barn Sep 29, 2025
ffb86e7
Revert "[ci] use --no-deps to avoid installing dependencies" (#57026)
aslonnie Sep 30, 2025
69f9b6f
[core] split out test that needs java (#57007)
aslonnie Sep 30, 2025
a55e882
[core] fix src/ray/observability/tests/ray_event_recorder_test (#57029)
can-anyscale Sep 30, 2025
7927a48
[core] fix test state api and dashboard flakiness (#56966)
can-anyscale Sep 30, 2025
afd45c9
[Data] - Groupby benchmark - sort shuffle pull based (#57014)
goutamvenkat-anyscale Sep 30, 2025
d499847
[Core] Delete gcs based actor scheduling tests in test_advanced_5 (#…
sampan-s-nayak Sep 30, 2025
60cb3b6
[core] New StatusSet and StatusSetOr with variant + tag classes (#55193)
dayshah Sep 30, 2025
0c3e8c2
[RLlib] Add tags to envrunner calls, count in flight requests in Acto…
ArturNiederfahrenhorst Sep 30, 2025
f59ee5e
[ci] updating pr template description (#57016)
elliot-barn Sep 30, 2025
7c1dcd8
fix windows tests (#57008)
harshit-anyscale Sep 30, 2025
5467011
[Data] - Fix mongo datasource collStats invocation (#57027)
goutamvenkat-anyscale Sep 30, 2025
6d99759
[wheel] build java bits with wanda (#57021)
aslonnie Sep 30, 2025
1bc0114
Running 4 core release tests tests on python 3.10 (#56967)
elliot-barn Sep 30, 2025
d1e1477
[release] Running more core nightly tests on python 3.10 (#57043)
elliot-barn Sep 30, 2025
8f2b639
Loosen Ray self-dependency check to allow matching versions. (#57019)
rjpower Sep 30, 2025
63bbad3
[core] add "custom_setup" tag for tests (#56986)
aslonnie Sep 30, 2025
a6db1f1
[core] Improve actor and normal task shutdown sequence (#56159)
codope Sep 30, 2025
0632dd1
[core] Revert "Revert "[core] Deprecate LIFO/FIFO worker killing poli…
Sparks0219 Sep 30, 2025
a2610d4
[ci] upgrade rayci version to 0.20.0 (#57050)
aslonnie Sep 30, 2025
f2591da
Fix rdt micro benchmark by ensuring the actor has received the tensor…
Qiaolin-Yu Sep 30, 2025
f977b62
[core] split java worker tests out (#57054)
aslonnie Sep 30, 2025
785eefb
[core] pre-install dependencies for min install tests (#57045)
aslonnie Sep 30, 2025
2d384d4
[Data] - schema() handle pd.ArrowDtype -> pyarrow type conversion (#5…
goutamvenkat-anyscale Oct 1, 2025
0ed226a
[core] remove parallism for spark_on_ray tests (#57055)
aslonnie Oct 1, 2025
9d689ca
Revert "[core] Refactor aggregator agent to support multiple publish …
edoakes Oct 1, 2025
78aa5ea
[Ray Data | Docs] Error in ray.data.groupby example in docs (#57036)
jpatra72 Oct 1, 2025
b00f094
[core] Add per worker process group and deprecate process subreaper i…
codope Oct 1, 2025
21d4c99
[core] Fix autoscaler RAY_CHECK when GcsAutoscalerStateManager is out…
ZacAttack Oct 1, 2025
1b1074c
Revert "[core][1ev-debt/02] implement even merge logic at export time…
can-anyscale Oct 1, 2025
b30074c
[core] fix typo on java test tag (#57074)
aslonnie Oct 1, 2025
2956f83
[ci] do not install java by default any more (#57072)
aslonnie Oct 1, 2025
bd1fb2c
[Data] [2/n] - Add predicate expression support for dataset.filter (#…
goutamvenkat-anyscale Oct 1, 2025
39df8a0
[core] Add config to not inject rpc failures if rpc's are local (#57034)
dayshah Oct 1, 2025
030818d
[release test] fill in 3.10 ml legacy dependencies (#57081)
aslonnie Oct 1, 2025
e1c1039
[depset] fix compile results on images (#57084)
aslonnie Oct 1, 2025
1ec8b59
[images] stop publishing ray-ml images (#57070)
aslonnie Oct 1, 2025
7ede107
[release] updating passing core long running tests to run on py310 (#…
elliot-barn Oct 1, 2025
bf5cfc6
[release] allowing py310 for gpu byod release tests (#57079)
elliot-barn Oct 1, 2025
e5e30ad
[RLlib] LINT: Enable ruff imports for ``benchmarks/``, ``connectors/`…
czgdp1807 Oct 1, 2025
f84f239
[core] Modify test expectations for test_actor_failures.py::test_exit…
Sparks0219 Oct 1, 2025
6236996
[wheel] stop building and releasing x86 osx wheels (#57077)
aslonnie Oct 1, 2025
14582ab
[core] Renamed legacy NotifyDirectCallTask* methods. (#57051)
israbbani Oct 1, 2025
03b5f48
[core] Remove unstable `tune_air_oom` test (#57089)
edoakes Oct 1, 2025
a2a3913
[data] Abstractions for joins (#57022)
iamjustinhsu Oct 1, 2025
3a30409
[AKS] Fix S3 access issue in AKS (#56358)
gangsf Oct 1, 2025
5a485a1
[core] Un-disable memory monitor on autoscaling test (#57088)
edoakes Oct 1, 2025
79d8c4e
[core] Stabilize & clean up `single_node_oom` release test (#57087)
edoakes Oct 1, 2025
15d2bba
[docs][core] docfix - rst annotation showing up in render (#57104)
nrghosh Oct 1, 2025
039f514
[rllib] remove `long_running_apex` test (#57097)
aslonnie Oct 1, 2025
639bff7
[rllib] remove checkpoint release tests (#57105)
aslonnie Oct 1, 2025
a210519
[train][checkpoint] Add checkpoint_upload_function to ray.train.repor…
TimothySeah Oct 1, 2025
c3063a9
[core] improve metric agent connection log (#57056)
can-anyscale Oct 1, 2025
051e699
[core] Reapply aggregator refactoring changes + improvements to match…
sampan-s-nayak Oct 1, 2025
d63d56e
[Core] [Azure] Use subscription id from azure profile if not provided…
marosset Oct 2, 2025
def1603
[Data] Use iterator in write ops instead of accumulating all of the b…
alexeykudinkin Oct 2, 2025
9fe5ae8
[ci] install mpi4py from conda-forge channel (#57119)
aslonnie Oct 2, 2025
c1d68dc
[serve] Fix wait_for_condition in test_custom_autoscaling_metrics.py …
arcyleung Oct 2, 2025
420cc84
[Data] Fixing more flaky tests (#57113)
alexeykudinkin Oct 2, 2025
ee73f40
[Data][LLM] Support multi-node TP/PP for ray.data.llm (#56779)
jeffreyjeffreywang Oct 2, 2025
1bf846e
[Core][Azure] fix: bug with config key pairs when launching worker no…
alimaazamat Oct 2, 2025
39178de
[Core] Rename reference_count files to reference_counter files (#57102)
Kunchd Oct 2, 2025
7cf325d
[Data] - Initialize datacontext after setting `src_fn_name` in actor …
goutamvenkat-anyscale Oct 2, 2025
c5718a3
add document for using fastapi factory pattern in serve (#56607)
abrarsheikh Oct 2, 2025
194f66c
[Core] [Azure] Update tests according to Azure SSH public key behavio…
alimaazamat Oct 2, 2025
fd387b4
[core] Adding a batch API to get owners from the ReferenceCounter (#5…
israbbani Oct 2, 2025
368714f
[Data] Rebased `get_eligible_operators` onto `has_pending_bundles` (#…
alexeykudinkin Oct 2, 2025
fdd97ac
[docs] Update Kueue integration documentation to include RayService &…
seanlaii Oct 2, 2025
b1fbac4
[core] Kill debug_state_gcs.txt (#56869)
dayshah Oct 2, 2025
8ed72d8
[Core] [Azure] Always create standard public IP addresses (basic sku …
marosset Oct 2, 2025
904d6ca
[train][doc] Add checkpoint_upload_mode to checkpoint docs (#56860)
TimothySeah Oct 2, 2025
56789aa
[Data][LLM] Support OpenAI's nested image_url schema in PrepareImageS…
GuyStone Oct 2, 2025
b0bc362
[data] Add new comparison release tests (#57111)
omatthew98 Oct 2, 2025
dfb6771
reclassify autoscaling unit tests as small (#57141)
abrarsheikh Oct 3, 2025
2497993
[Data] Fixing remaining issues with custom tensor extensions (#56918)
alexeykudinkin Oct 3, 2025
c46d2d3
[Data] - Download expr - cast to arrow before checking column (#57146)
goutamvenkat-anyscale Oct 3, 2025
08573a2
[core] install mpirun (#57148)
aslonnie Oct 3, 2025
0f1688d
[core][state] Raise proper exception for failed response parsing (#55…
22quinn Oct 3, 2025
92aa23d
[data] Update docs (#57038)
richardliaw Oct 3, 2025
3cd708f
[ci] add option to disable installing java toolchain (#56978)
aslonnie Oct 3, 2025
a45d267
[release] Don't block release test run if AUTOMATIC is specified (#57…
khluu Oct 3, 2025
c1fd11d
[Core] Remove mpi runtime env plugin (#57143)
jjyao Oct 3, 2025
95ca224
Small fixes to Metrics Tab for kube-ray clusters (#57149)
alanwguo Oct 3, 2025
33797ec
Apply suggestions from code review
dstrodtman Oct 3, 2025
b52df85
sphinx build errors addressed
dstrodtman Oct 3, 2025
24ebbc3
Update doc/source/ray-core/scheduling/labels.md
dstrodtman Oct 3, 2025
8434fa0
Apply suggestions from code review
dstrodtman Oct 3, 2025
3f5f227
Apply suggestions from code review
dstrodtman Oct 3, 2025
213cd4b
[Data] Fix split_blocks produce empty blocks (#57085)
owenowenisme Oct 3, 2025
036afbf
[Train] [release test] Release tests for ray train local mode (#56862)
xinyuangui2 Oct 3, 2025
fc5643c
[release auto] stop checking ray-ml image commits (#57158)
aslonnie Oct 3, 2025
2a0747d
[serve] additional deps to start with prometheus (#57155)
akyang-anyscale Oct 3, 2025
a049d25
[core][RDT] Fix data race when using async gpu to gpu transfer (#57112)
Qiaolin-Yu Oct 3, 2025
f21012c
Revising test_jax_trainer flaky test (#56854)
JasonLi1909 Oct 3, 2025
055a3da
[ci] better logging groups on buildkite (#57162)
aslonnie Oct 3, 2025
623a09c
[ci] mark java related tests with java tag (#57161)
aslonnie Oct 3, 2025
153a84c
[core] Fix bug where inflight requests are not taken into account by …
Sparks0219 Oct 3, 2025
ed52bc8
[release] running core-daily-tests on py310 (#57064)
elliot-barn Oct 3, 2025
529e040
[Data] streaming train test split implementation (#56803)
martinbomio Oct 3, 2025
139a74e
[serve] Configure proxy logger in generic class (#57172)
akyang-anyscale Oct 4, 2025
81c62c1
[serve][llm] Add main pytest code snippet to those tests that were mi…
kouroshHakha Oct 4, 2025
56568c9
[train] implement `BaseWorkerGroup` for V1/V2 compatibility (#57151)
matthewdeng Oct 4, 2025
f0ded73
[doc] add doc build env var signal (#57170)
aslonnie Oct 4, 2025
562e3d3
[docs] Fix broken Towards Data Science links (#57191)
thc1006 Oct 4, 2025
cf737c2
[data] deflake custom agg + datasink (#57192)
iamjustinhsu Oct 4, 2025
35a4f23
[ci] move global configs to new configs.py (#57190)
aslonnie Oct 4, 2025
cfa6627
[ci] avoid install dependencies in tests.env.Dockerfile (#57188)
aslonnie Oct 4, 2025
4fce24a
[core] Don't disconnect worker client on OBOD unless the worker is de…
dayshah Oct 4, 2025
f82d7bd
[ci] extract `get_docker_image` (#57195)
aslonnie Oct 4, 2025
4f85194
[core] Log actor name when warning about excess queueing (#57124)
dayshah Oct 5, 2025
8595fc9
[serve][llm][docs] Add example of serving a VLLM model on fractional …
kouroshHakha Oct 5, 2025
6e370ca
[Core] Update getting started and set up document for ray on vsphere …
roshankathawate Oct 6, 2025
122c021
[serve] fix windows tests (#57183)
ok-scale Oct 6, 2025
8cdadba
[train][tune] Fix LightGBM v2 callbacks for Tune only usage (#57042)
liulehui Oct 6, 2025
2674d91
[core][metric] Redefine gcs STATS using metric interface (#56201)
can-anyscale Oct 6, 2025
8f6e660
[ci] add explicit python version (#57200)
aslonnie Oct 6, 2025
1cdf05d
[Data] - Handle BaseMaskedDtype -> pa type in schema (#57176)
goutamvenkat-anyscale Oct 6, 2025
241e619
[serve] fix proxy lua dependency in dockerfile (#57221)
akyang-anyscale Oct 6, 2025
82f88fa
[RLlib] Add tags to envrunner calls, count in flight requests in Acto…
ArturNiederfahrenhorst Oct 6, 2025
64c1379
Merge branch 'master' into doc-127-ray-labels
dstrodtman Oct 6, 2025
aa2d496
bad x-ref
dstrodtman Oct 7, 2025
29535ea
Merge branch 'master' into doc-127-ray-labels
dstrodtman Oct 7, 2025
a1cef9c
Merge branch 'master' into doc-127-ray-labels
MengjinYan Oct 7, 2025
a5a4e46
Update doc/source/ray-core/scheduling/labels.md
dstrodtman Oct 7, 2025
dfb1fa9
Merge branch 'master' into doc-127-ray-labels
MengjinYan Oct 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion doc/source/ray-core/scheduling/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,24 @@
Scheduling
==========

For each task or actor, Ray will choose a node to run it and the scheduling decision is based on the following factors.
This page provides an overview of how Ray decides to schedule tasks and actors to nodes.

.. DJS 19 Sept 2025: There should be an overview of all features and configs that impact scheduling here.
This should include descriptions for default values and behaviors, and links to things like default labels or resource definitions that can be used for scheduling without customization.

Labels
------

Labels provide a simplified solution for controlling scheduling for tasks, actors, and placement group bundles using default and custom labels. See :doc:`./labels`.

Labels are a beta feature. As this feature becomes stable, the Ray team recommends using labels to replace the following patterns:

- NodeAffinitySchedulingStrategy when `soft=false`. Use the default `ray.io/node-id` label instead.
- The `accelerator_type` option for tasks and actors. Use the default `ray.io/accelerator-type` label instead.

.. note::

A legacy pattern recommended using custom resources for label-based scheduling. We now recommend only using custom resources when you need to manage scheduling using numeric values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A legacy pattern recommended using custom resources for label-based scheduling. We now recommend only using custom resources when you need to manage scheduling using numeric values.
A legacy pattern recommended using custom resources for label-based scheduling. Anyscale now recommends only using custom resources when you need to manage scheduling using numeric values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Ray docs, so the recommender is "we" IIUC.


.. _ray-scheduling-resources:

Expand Down Expand Up @@ -127,6 +144,7 @@ More about Ray Scheduling
.. toctree::
:maxdepth: 1

labels
resources
accelerators
placement-group
Expand Down
175 changes: 175 additions & 0 deletions doc/source/ray-core/scheduling/labels.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
description: "Learn about using labels to control how Ray schedules tasks, actors, and placement groups to nodes in your Kubernetes cluster."
---

(labels)=
# Use labels to control scheduling

In Ray version 2.49.0 and above, you can use labels to control scheduling for KubeRay. Labels are a beta feature.

This page provides a conceptual overview and usage instructions for labels. Labels are key-value pairs that provide a human-readable configuration for users to control how Ray schedules tasks, actors, and placement group bundles to specific nodes.


```{note}
Ray labels share the same syntax and formatting restrictions as Kubernetes labels, but are conceptually distinct. See the [Kubernetes docs on labels and selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set).
```


## How do labels work?

The following is a high-level overview of how you use labels to control scheduling:

- Ray sets default labels that describe the underlying compute. See [](defaults).
- You define custom labels as key-value pairs. See [](custom).
- You specify *label selectors* in your Ray code to define label requirements. You can specify these requirements at the task, actor, or placement group bundle level. See [](label-selectors).
- Ray schedules tasks, actors, or placement group bundles based on the specified label selectors.
- In Ray 2.50.0 and above, if you're using a dynamic cluster with autoscaler V2 enabled, the cluster scales up to add new nodes from a designated worker group to fulfill label requirements.

(defaults)=
## Default node labels
```{note}
Ray reserves all labels under ray.io namespace.
```
During cluster initialization or as autoscaling events add nodes to your cluster, Ray assigns the following default labels to each node:

| Label | Description |
| --- | --- |
| `ray.io/node-id` | A unique ID generated for the node. |
| `ray.io/accelerator-type` | The accelerator type of the node, for example `L4`. CPU-only machines have an empty string. See {ref}`accelerator types <accelerator-types>` for a mapping of values. |

```{note}
You can override default values using `ray start` parameters.
```

The following are examples of default labels:

```python
"ray.io/accelerator-type": "" # Default label indicating the machine is CPU-only.
```

(custom)=
## Define custom labels

You can add custom labels to your nodes using the `--labels` or `--labels-file` parameter when running `ray start`.

```bash
# Examples 1: Start a head node with cpu-family and test-label labels
ray start --head --labels="cpu-family=amd,test-label=test-value"

# Example 2: Start a head node with labels from a label file
ray start --head --labels-files='./test-labels-file'

# The file content can be the following (should be a valid YAML file):
# "test-label": "test-value"
# "test-label-2": "test-value-2"
```

```{note}
You can't set labels using `ray.init()`. Local Ray clusters don't support labels.
```

(label-selectors)=
## Specify label selectors

You add label selector logic to your Ray code when defining Ray tasks, actors, or placement group bundles. Label selectors define the label requirements for matching your Ray code to a node in your Ray cluster.

Label selectors specify the following:

- The key of the label.
- Operator logic for matching.
- The value or values to match on.

The following table shows the basic syntax for label selector operator logic:

| Operator | Description | Example syntax |
| --- | --- | --- |
| Equals | Label matches exactly one value. | `{“key”: “value”}`
| Not equal | Label matches anything by one value. | `{“key”: “!value”}`
| In | Label matches one of the provided values. | `{“key”: “in(val1,val2)”}`
| Not in | Label matches none of the provided values. | `{“key”: “!in(val1,val2)”}`

You can specify one or more label selectors as a dict. When specifying multiple label selectors, the candidate node must meet all requirements. The following example configuration uses a custom label to require an `m5.16xlarge` EC2 instance and a default label to require node ID to be 123:

```python
label_selector={"instance_type": "m5.16xlarge", "ray.io/node-id": "123"}
```

## Specify label requirements for tasks and actors

Use the following syntax to add label selectors to tasks and actors:

```python
# An example for specifing label_selector in task's @ray.remote annotation
@ray.remote(label_selector={"label_name":"label_value"})
def f():
pass

# An example of specifying label_selector in actor's @ray.remote annotation
@ray.remote(label_selector={"ray.io/accelerator-type": "nvidia-h100"})
class Actor:
pass

# An example of specifying label_selector in task's options
@ray.remote
def test_task_label_in_options():
pass

test_task_label_in_options.options(label_selector={"test-lable-key": "test-label-value"}).remote()

# An example of specifying label_selector in actor's options
@ray.remote
class Actor:
pass

actor_1 = Actor.options(
label_selector={"ray.io/accelerator-type": "nvidia-h100"},
).remote()
```

## Specify label requirements for placement group bundles

Use the `bundle_label_selector` option to add label selector to placement group bundles. See the following examples:

```python
# All bundles require the same labels:
ray.util.placement_group(
bundles=[{"GPU": 1}, {"GPU": 1}],
bundle_label_selector=[{"ray.io/accelerator-type": "H100"} * 2],
)

# Bundles require different labels:
ray.util.placement_group(
bundles=[{"CPU": 1}] + [{"GPU": 1} * 2],
bundle_label_selector=[{"ray.io/market-type": "spot"}] + [{"ray.io/accelerator-type": "H100"} * 2]
)
```
## Using labels with autoscaler

Autoscaler V2 supports label-based scheduling. To enable autoscaler to scale up nodes to fulfill label requirements, you need to create multiple worker groups for different label requirement combinations and specify all the corresponding labels in the `rayStartParams` field in the Ray cluster configuration. For example:

```python
rayStartParams: {
Copy link
Contributor

@ryanaoleary ryanaoleary Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ray-project/kuberay#4106 is merged we can direct users to specify the top-level Labels field under the worker or head group with their desired labels with KubeRay v1.5+, but for now rayStartParams is the only option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MengjinYan LMK if this needs to happen for this release or you want to add later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dstrodtman I think we can add it later.

labels: "region=me-central1,ray.io/accelerator-type=nvidia-h100"
}
```

## Monitor nodes using labels

The Ray dashboard automatically shows the following information:
- Labels for each node. See {py:attr}`ray.util.state.common.NodeState.labels`.
- Label selectors set for each task, actor, or placement group bundle. See {py:attr}`ray.util.state.common.TaskState.label_selector` and {py:attr}`ray.util.state.common.ActorState.label_selector`.

Within a task, you can programmatically obtain the node label from the RuntimeContextAPI using `ray.get_runtime_context().get_node_labels()`. This returns a Python dict. See the following example:

```python
@ray.remote
def test_task_label():
node_labels = ray.get_runtime_context().get_node_labels()
print(f"[test_task_label] node labels: {node_labels}")

"""
Example output:
(test_task_label pid=68487) [test_task_label] node labels: {'test-label-1': 'test-value-1', 'test-label-key': 'test-label-value', 'test-label-2': 'test-value-2'}
"""
```
You can also access information about node label and label selector information using the state API and state CLI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link to the state reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the references here are TBH.

13 changes: 3 additions & 10 deletions doc/source/ray-core/scheduling/resources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,16 +62,9 @@ The fact that resources are logical has several implications:
Custom Resources
----------------

Besides pre-defined resources, you can also specify a Ray node's custom resources and request them in your tasks or actors.
Some use cases for custom resources:

- Your node has special hardware and you can represent it as a custom resource.
Then your tasks or actors can request the custom resource via ``@ray.remote(resources={"special_hardware": 1})``
and Ray will schedule the tasks or actors to the node that has the custom resource.
- You can use custom resources as labels to tag nodes and you can achieve label based affinity scheduling.
For example, you can do ``ray.remote(resources={"custom_label": 0.001})`` to schedule tasks or actors to nodes with ``custom_label`` custom resource.
For this use case, the actual quantity doesn't matter, and the convention is to specify a tiny number so that the label resource is
not the limiting factor for parallelism.
You can specify custom resources for a Ray node and reference them to control scheduling for your tasks or actors.

Use custom resources when you need to manage scheduling using numeric values. If you need simple label-based scheduling, use labels instead. See :doc:`labels`.

.. _specify-node-resources:

Expand Down