Releases: ray-project/ray
Ray-1.12.1
Patch release with the following fixes:
- Ray now works on Google Colab again! The bug with memory limit fetching when running Ray in a container is now fixed (#23922).
ray-ml
Docker images for CPU will start being built again after they were stopped in Ray 1.9 (#24266).- [Train/Tune] Start MLflow run under the correct experiment for Ray Train and Ray Tune integrations (#23662).
- [RLlib] Fix for APPO in eager mode (#24268).
- [RLlib] Fix Alphastar for TF2 and tracing enabled (c5502b2).
- [Serve] Fix replica leak in anonymous namespaces (#24311).
Ray-1.11.1
Patch release including fixes for the following issues:
Ray-1.12.0
Highlights
- Ray AI Runtime (AIR), an open-source toolkit for building end-to-end ML applications on Ray, is now in Alpha. AIR is an effort to unify the experience of using different Ray libraries (Ray Data, Train, Tune, Serve, RLlib). You can find more information on the docs or on the public RFC.
- Getting involved with Ray AIR. We’ll be holding office hours, development sprints, and other activities as we get closer to the Ray AIR Beta/GA release. Want to join us? Fill out this short form!
- Ray usage data collection is now off by default. If you have any questions or concerns, please comment on the RFC.
- New algorithms are added to RLlib: SlateQ & Bandits (for recommender systems use cases) and AlphaStar (multi-agent, multi-GPU w/ league-based self-play)
- Ray Datasets: new lazy execution model with automatic task fusion and memory-optimizing move semantics; first-class support for Pandas DataFrame blocks; efficient random access datasets.
Ray Autoscaler
🎉 New Features
💫 Enhancements
- Improved documentation and standards around built in autoscaler node providers. (#22236, 22237)
- Improved KubeRay support (#22987, #22847, #22348, #22188)
- Remove redis requirement (#22083)
🔨 Fixes
- No longer print infeasible warnings for internal placement group resources. Placement groups which cannot be satisfied by the autoscaler still trigger warnings. (#22235)
- Default ami’s per AWS region are updated/fixed. (#22506)
- GCP node termination updated (#23101)
- Retry legacy k8s operator on monitor failure (#22792)
- Cap min and max workers for manually managed on-prem clusters (#21710)
- Fix initialization artifacts (#22570)
- Ensure initial scaleup with high upscaling_speed isn't limited. (#21953)
Ray Client
🎉 New Features:
- ray.init has consistent return value in client mode and driver mode #21355
💫Enhancements:
🔨 Fixes:
- Fix ray client object ref releasing in wrong context #22025
Ray Core
🎉 New Features
- RuntimeEnv:
- Support setting timeout for runtime_env setup. (#23082)
- Support setting pip_check and pip_version for runtime_env. (#22826, #23306)
- env_vars will take effect when the pip install command is executed. (temporarily ineffective in conda) (#22730)
- Support strongly-typed API ray.runtime.RuntimeEnv to define runtime env. (#22522)
- Introduce virtualenv to isolate the pip type runtime env. (#21801,#22309)
- Raylet shares fate with the dashboard agent. And the dashboard agent will stay alive when it catches the port conflicts. (#22382,#23024)
- Enable dashboard in the minimal ray installation (#21896)
- Add task and object reconstruction status to ray memory cli tools(#22317)
🔨 Fixes
- Report only memory usage of pinned object copies to improve scaledown. (#22020)
- Scheduler:
- Object store:
- Improve ray stop behavior (#22159)
- Avoid warning when receiving too much logs from a different job (#22102)
- Gcs resource manager bug fix and clean up. (#22462, #22459)
- Release GIL when running
parallel_memcopy()
/memcpy()
during serializations. (#22492) - Fix registering serializer before initializing Ray. (#23031)
🏗 Architecture refactoring
- Ray distributed scheduler refactoring: (#21927, #21992, #22160, #22359, #22722, #22817, #22880, #22893, #22885, #22597, #22857, #23124)
- Removed support for bootstrapping with Redis.
Ray Data Processing
🎉 New Features
- Big Performance and Stability Improvements:
- Add lazy execution mode with automatic stage fusion and optimized memory reclamation via block move semantics (#22233, #22374, #22373, #22476)
- Support for random access datasets, providing efficient random access to rows via binary search (#22749)
- Add automatic round-robin load balancing for reading and shuffle reduce tasks, obviating the need for the
_spread_resource_prefix
hack (#21303)
- More Efficient Tabular Data Wrangling:
- Groupby + Aggregations Improvements:
- Improved Dataset Windowing:
- Better Text I/O:
- New Operations:
- Add
add_column()
utility for adding derived columns (#21967)
- Add
- Support for metadata provider callback for read APIs (#22896)
- Support configuring autoscaling actor pool size (#22574)
🔨 Fixes
- Force lazy datasource materialization in order to respect
DatasetPipeline
stage boundaries (#21970) - Simplify lifetime of designated block owner actor, and don’t create it if dynamic block splitting is disabled (#22007)
- Respect 0 CPU resource request when using manual resource-based load balancing (#22017)
- Remove batch format ambiguity by always converting Arrow batches to Pandas when
batch_format=”native”
is given (#21566) - Fix leaked stats actor handle due to closure capture reference counting bug (#22156)
- Fix boolean tensor column representation and slicing (#22323)
- Fix unhandled empty block edge case in shuffle (#22367)
- Fix unserializable Arrow Partitioning spec (#22477)
- Fix incorrect
iter_epochs()
batch format (#22550) - Fix infinite
iter_epochs()
loop on unconsumed epochs (#22572) - Fix infinite hang on
split()
whennum_shards < num_rows
(#22559) - Patch Parquet file fragment serialization to prevent metadata fetching (#22665)
- Don’t reuse task workers for actors or GPU tasks (#22482)
- Pin pipeline executor actors to driver node to allow for lineage-based fault tolerance for pipelines (#22715)
- Always use non-empty blocks to determine schema (#22834)
- API fix bash (#22886)
- Make label_column optional for
to_tf()
so it can be used for inference (#22916) - Fix
schema()
forDatasetPipeline
s (#23032) - Fix equalized split when
num_splits == num_blocks
(#23191)
💫 Enhancements
- Optimize Parquet metadata serialization via batching (#21963)
- Optimize metadata read/write for Ray Client (#21939)
- Add sanity checks for memory utilization (#22642)
🏗 Architecture refactoring
- Use threadpool to submit
DatasetPipeline
stages (#22912)
RLlib
🎉 New Features
- New “AlphaStar” algorithm: A parallelized, multi-agent/multi-GPU learning algorithm, implementing league-based self-play. (#21356, #21649)
- SlateQ algorithm has been re-tested, upgraded (multi-GPU capable, TensorFlow version), and bug-fixed (added to weekly learning tests). (#22389, #23276, #22544, #22543, #23168, #21827, #22738)
- Bandit algorithms: Moved into
agents
folder as first-class citizens, TensorFlow-Version, unified w/ other agents’ APIs. (#22821, #22028, #22427, #22465, #21949, #21773, #21932, #22421) - ReplayBuffer API (in progress): Allow users to customize and configure their own replay buffers and use these inside custom or built-in algorithms. (#22114, #22390, #21808)
- Datasets support for RLlib: Dataset Reader/Writer and documentation. (#21808, #22239, #21948)
🔨 Fixes
- Fixed memory leak in SimpleReplayBuffer. (#22678)
- Fixed Unity3D built-in examples: Action bounds from -inf/inf to -1.0/1.0. (#22247)
- Various bug fixes. (#22350, #22245, #22171, #21697, #21855, #22076, #22590, #22587, #22657, #22428, #23063, #22619, #22731, #22534, #22074, #22078, #22641, #22684, #22398, #21685)
🏗 Architecture refactoring
- A3C: Moved into new
training_iteration
API (fromexeution_plan
API). Lead to a ~2.7x performance increase on a Atari + CNN + LSTM benchmark. (#22126, #22316) - Make
multiagent->policies_to_train
more flexible via callable option (alternative to providing a list of policy IDs). (#20735)
💫Enhancements:
- Env pre-checking module now active by default. (#22191)
- Callbacks: Added
on_sub_environment_created
andon_trainer_init
callback options. (#21893, #22493) - RecSim environment wrappers: Ability to use google’s RecSim for recommender systems more easily w/ RLlib algorithms (3 RLlib-ready example environments). (#22028, #21773, #22211)
- MARWIL loss function enhancement (exploratory term for stddev). (#21493)
📖Documentation:
- Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239)
- Other doc enhancements and fixes. (#23160, #23226, #22496, #22489, #22380)
Ray Workflow
🎉 New Features:
- Support skip checkpointing.
🔨 Fixes:
- Fix an issue where the event loop is not set.
Tune
🎉 New Features:
- Expose new checkpoint interface to users (#22741)
💫Enhancemen...
Ray-1.11.0
Highlights
🎉 Ray no longer starts Redis by default. Cluster metadata previously stored in Redis is stored in the GCS now.
Ray Autoscaler
🎉 New Features
- AWS Cloudwatch dashboard support #20266
💫 Enhancements
- Kuberay autoscaler prototype #21086
🔨 Fixes
- Ray.autoscaler.sdk import issue #21795
Ray Core
🎉 New Features
🔨 Fixes
- Better support for nested tasks
- Fixed 16GB mac perf issue by limit the plasma store size to 2GB #21224
- Fix
SchedulingClassInfo.running_tasks
memory leak #21535 - Round robin during spread scheduling #19968
🏗 Architecture refactoring
- Refactor scheduler resource reporting public APIs #21732
- Refactor ObjectManager wait logic to WaitManager #21369
Ray Data Processing
🎉 New Features
- More powerful to_torch() API, providing more control over the GPU batch format. (#21117)
🔨 Fixes
- Fix simple Dataset sort generating only 1 non-empty block. (#21588)
- Improve error handling across sorting, groupbys, and aggregations. (#21610, #21627)
- Fix boolean tensor column representation and slicing. (#22358)
RLlib
🎉 New Features
- Better utils for flattening complex inputs and enable prev-actions for LSTM/attention for complex action spaces. (#21330)
MultiAgentEnv
pre-checker (#21476)- Base env pre-checker. (#21569)
🔨 Fixes
- Better defaults for QMix (#21332)
- Fix contrib/MADDPG + pettingzoo coop-pong-v4. (#21452)
- Fix action unsquashing causes inf/NaN actions for unbounded action spaces. (#21110)
- Ignore PPO KL-loss term completely if kl-coeff == 0.0 to avoid NaN values (#21456)
unsquash_action
andclip_action
(when None) cause wrong actions computed byTrainer.compute_single_action
. (#21553)- Conv2d default filter tests and add default setting for 96x96 image obs space. (#21560)
- Bing back and fix offline RL(BC & MARWIL) learning tests. (#21574, #21643)
- SimpleQ should not use a prio. replay buffer. (#21665)
- Fix video recorder env wrapper. Added test case. (#21670)
🏗 Architecture refactoring
- Decentralized multi-agent learning (#21421)
- Preparatory PR for multi-agent multi-GPU learner (alpha-star style) (#21652)
Ray Workflow
🔨 Fixes
- Fixed workflow recovery issue due to a bug of dynamic output #21571
Tune
🎉 New Features
- It is now possible to load all evaluated points from an experiment into a Searcher (#21506)
- Add CometLoggerCallback (#20766)
💫 Enhancements
- Only sync the checkpoint folder instead of the entire trial folder for cloud checkpoint. (#21658)
- Add test for heterogeneous resource request deadlocks (#21397)
- Remove unused
return_or_clean_cached_pg
(#21403) - Remove
TrialExecutor.resume_trial
(#21225) - Leave only one canonical way of stopping a trial (#21021)
🔨 Fixes
- Replace deprecated
running_sanity_check
withsanity_checking
in PTL integration (#21831) - Fix loading an
ExperimentAnalysis
object without a registeredTrainable
(#21475) - Fix stale node detection bug (#21516)
- Fixes to allow
tune/tests/test_commands.py
to run on Windows (#21342) - Deflake PBT tests (#21366)
- Fix dtype coercion in
tune.choice
(#21270)
📖 Documentation
- Fix typo in
schedulers.rst
(#21777)
Train
🎉 New Features
💫 Enhancements
🔨 Fixes
- Fix Dataloader (#21467)
📖 Documentation
Serve
🎉 New Features
🔨 Fixes
- Warn when serve.start() with different options (#21562)
- Detect http.disconnect and cancel requests properly (#21438)
Thanks
Many thanks to all those who contributed to this release!
@isaac-vidas, @wuisawesome, @stephanie-wang, @jon-chuang, @xwjiang2010, @jjyao, @MissiontoMars, @qbphilip, @yaoyuan97, @gjoliver, @Yard1, @rkooo567, @talesa, @czgdp1807, @DN6, @sven1977, @kfstorm, @krfricke, @simon-mo, @hauntsaninja, @pcmoritz, @JamieSlome, @chaokunyang, @jovany-wang, @sidward14, @DmitriGekhtman, @ericl, @mwtian, @jwyyy, @clarkzinzow, @hckuo, @vakker, @HuangLED, @iycheng, @edoakes, @shrekris-anyscale, @robertnishihara, @avnishn, @mickelliu, @ndrwnaguib, @ijrsvt, @Zyiqin-Miranda, @bveeramani, @SongGuyang, @n30111, @WangTaoTheTonic, @suquark, @richardliaw, @qicosmos, @scv119, @architkulkarni, @lixin-wei, @Catch-Bull, @acxz, @benblack769, @clay4444, @amogkam, @marin-ma, @maxpumperla, @jiaodong, @mattip, @isra17, @raulchen, @wilsonwang371, @carlogrisetti, @ashione, @matthewdeng
Ray-1.10.0
Highlights
- 🎉 Ray Windows support is now in beta – a significant fraction of the Ray test suite is now passing on Windows. We are eager to learn about your experience with Ray 1.10 on Windows, please file issues you encounter at https://github.com/ray-project/ray/issues. In the upcoming releases we will spend more time on making Ray Serve and Runtime Environment tests pass on Windows and on polishing things.
Ray Autoscaler
💫Enhancements:
- Add autoscaler update time to prometheus metrics (#20831)
- Fewer non terminated nodes calls in autoscaler update (#20359, #20623)
🔨 Fixes:
- GCP TPU autoscaling fix (#20311)
- Scale-down stability fix (#21204)
- Report node launch failure in driver logs (#20814)
Ray Client
💫Enhancements
- Client task options are encoded with pickle instead of json (#20930)
Ray Core
🎉 New Features:
runtime_env
’spip
field now installs pip packages in your existing environment instead of installing them in a new isolated environment. (#20341)
🔨 Fixes:
- Fix bug where specifying runtime_env conda/pip per-job using local requirements file using Ray Client on a remote cluster didn’t work (#20855)
- Security fixes for
log4j2
– thelog4j2
version has been bumped to 2.17.1 (#21373)
💫Enhancements:
- Allow runtime_env working_dir and py_modules to be pathlib.Path type (#20853, #20810)
- Add environment variable to skip local runtime_env garbage collection (#21163)
- Change runtime_env error log to debug log (#20875)
- Improved reference counting for runtime_env resources (#20789)
🏗 Architecture refactoring:
- Refactor runtime_env to use protobuf for multi-language support (#19511)
📖Documentation:
Ray Data Processing
🎉 New Features:
- Added stats framework for debugging Datasets performance (#20867, #21070)
- [Dask-on-Ray] New config helper for enabling the Dask-on-Ray scheduler (#21114)
💫Enhancements:
- Reduce memory usage during when converting to a Pandas DataFrame (#20921)
🔨 Fixes:
- Fix slow block evaluation when splitting (#20693)
- Fix boundary sampling concatenation on non-uniform blocks (#20784)
- Fix boolean tensor column slicing (#20905)
🏗 Architecture refactoring:
- Refactor table block structure to support more tabular block formats (#20721)
RLlib
🎉 New Features:
- Support for RE3 exploration algorithm (for tf only). (#19551)
- Environment pre-checks, better failure behavior and enhanced environment API. (#20481, #20832, #20868, #20785, #21027, #20811)
🏗 Architecture refactoring:
- Evaluation: Support evaluation setting that makes sure
train
doesn't ever have to wait foreval
to finish (b/c of long episodes). (#20757); Always attach latest eval metrics. (#21011) - Soft-deprecate
build_trainer()
utility function in favor of sub-classingTrainer
directly (and overriding some of its methods). (#20635, #20636, #20633, #20424, #20570, #20571, #20639, #20725) - Experimental no-flatten option for actions/prev-actions. (#20918)
- Use
SampleBatch
instead of an input dict whenever possible. (#20746) - Switch off
Preprocessors
by default forPGTrainer
(experimental). (#21008) - Toward a Replay Buffer API (cleanups; docstrings; renames; move into
rllib/execution/buffers
dir) (#20552)
📖Documentation:
- Overhaul of auto-API reference pages. (#19786, #20537, #20538, #20486, #20250)
- README and RLlib landing page overhaul (#20249).
- Added example containing code to compute an adapted (time-dependent) GAE used by the PPO algorithm (#20850).
🔨 Fixes:
Tune
🎉 New Features:
- Introduce TrialCheckpoint class, making checkpoint down/upload easie (#20585)
- Add random state to
BasicVariantGenerator
(#20926) - Multi-objective support for Optuna (#20489)
💫Enhancements:
- Add
set_max_concurrency
to Searcher API (#20576) - Allow for tuples in _split_resolved_unresolved_values. (#20794)
- Show the name of training func, instead of just ImplicitFunction. (#21029)
- Enforce one future at a time for any given trial at any given time. (#20783)
moveon_no_available_trials
to a subclass underrunner
(#20809) - Clean up code (#20555, #20464, #20403, #20653, #20796, #20916, #21067)
- Start restricting TrialRunner/Executor interface exposures. (#20656)
- TrialExecutor should not take in Runner interface. (#20655)
🔨Fixes:
- Deflake test_tune_restore.py (#20776)
- Fix best_trial_str for nested custom parameter columns (#21078)
- Fix checkpointing error message on K8s (#20559)
- Fix testResourceScheduler and testMultiStepRun. (#20872)
- Fix tune cloud tests for function and rllib trainables (#20536)
- Move _head_bundle_is_empty after conversion (#21039)
- Elongate test_trial_scheduler_pbt timeout. (#21120)
Train
🔨Fixes:
- Ray Train environment variables are automatically propagated and do not need to be manually set on every node (#20523)
- Various minor fixes and improvements (#20952, #20893, #20603, #20487)
📖Documentation: - Update saving/loading checkpoint docs (#20973). Thanks @jwyyy!
- Various minor doc updates (#20877, #20683)
Serve
💫Enhancements:
- Add validation to Serve AutoscalingConfig class (#20779)
- Add Serve metric for HTTP error codes (#21009)
🔨Fixes:
- No longer create placement group for deployment with no resources (#20471)
- Log errors in deployment initialization/configuration user code (#20620)
Jobs
🎉 New Features:
- Logs can be streamed from job submission server with
ray job logs
command (#20976) - Add documentation for ray job submission (#20530)
- Propagate custom headers field to JobSubmissionClient and apply to all requests (#20663)
🔨Fixes:
- Fix job serve accidentally creates local ray processes instead of connecting (#20705)
💫Enhancements:
- [Jobs] Update CLI examples to use the same setup (#20844)
Thanks
Many thanks to all those who contributed to this release!
@dmatrix, @suquark, @tekumara, @jiaodong, @jovany-wang, @avnishn, @simon-mo, @iycheng, @SongGuyang, @ArturNiederfahrenhorst, @wuisawesome, @kfstorm, @matthewdeng, @jjyao, @chenk008, @Sertingolix, @larrylian, @czgdp1807, @scv119, @duburcqa, @runedog48, @Yard1, @robertnishihara, @geraint0923, @amogkam, @DmitriGekhtman, @ijrsvt, @kk-55, @lixin-wei, @mvindiola1, @hauntsaninja, @sven1977, @Hankpipi, @qbphilip, @hckuo, @newmanwang, @clay4444, @edoakes, @liuyang-my, @iasoon, @WangTaoTheTonic, @fgogolli, @dproctor, @gramhagen, @krfricke, @richardliaw, @bveeramani, @pcmoritz, @ericl, @simonsays1980, @carlogrisetti, @stephanie-wang, @AmeerHajAli, @mwtian, @xwjiang2010, @shrekris-anyscale, @n30111, @lchu-ibm, @Scalsol, @seonggwonyoon, @gjoliver, @qicosmos, @xychu, @iamhatesz, @architkulkarni, @jwyyy, @rkooo567, @mattip, @ckw017, @MissiontoMars, @clarkzinzow
Ray-1.9.2
Patch release to bump the log4j
version from 2.16.0
to 2.17.0
. This resolves the security issue CVE-2021-45105.
Ray-1.9.1
Patch release to bump the log4j2
version from 2.14
to 2.16
. This resolves the security vulnerabilities https://nvd.nist.gov/vuln/detail/CVE-2021-44228 and https://nvd.nist.gov/vuln/detail/CVE-2021-45046.
No library or core changes included.
Thanks @seonggwonyoon and @ijrsvt for contributing the fixes!
Ray-1.9.0
Highlights
- Ray Train is now in beta! If you are using Ray Train, we’d love to hear your feedback here!
- Ray Docker images for multiple CUDA versions are now provided (#19505)! You can specify a
-cuXXX
suffix to pick a specific version.ray-ml:cpu
images are now deprecated. Theray-ml
images are only built for GPU.
- Ray Datasets now supports groupby and aggregations! See the groupby API and GroupedDataset docs for usage.
- We are making continuing progress in improving Ray stability and usability on Windows. We encourage you to try it out and report feedback or issues at https://github.com/ray-project/ray/issues.
- We are launching a Ray Job Submission server + CLI & SDK clients to make it easier to submit and monitor Ray applications when you don’t want an active connection using Ray Client. This is currently in alpha, so the APIs are subject to change, but please test it out and file issues / leave feedback on GitHub & discuss.ray.io!
Ray Autoscaler
💫Enhancements:
- Graceful termination of Ray nodes prior to autoscaler scale down (#20013)
- Ray Clusters on AWS are colocated in one Availability Zone to reduce costs & latency (#19051)
Ray Client
🔨 Fixes:
- ray.put on a list of of objects now returns a single object ref (#19737)
Ray Core
🎉 New Features:
- Support remote file storage for runtime_env (#20280, #19315)
- Added ray job submission client, cli and rest api (#19567, #19657, #19765, #19845, #19851, #19843, #19860, #19995, #20094, #20164, #20170, #20192, #20204)
💫Enhancements:
- Garbage collection for runtime_env (#20009, #20072)
- Improved logging and error messages for runtime_env (#19897, #19888, #18893)
🔨 Fixes:
- Fix runtime_env hanging issues (#19823)
- Fix specifying runtime env in @ray.remote decorator with Ray Client (#19626)
- Threaded actor / core worker / named actor race condition fixes (#19751, #19598, #20178, #20126)
📖Documentation:
- New page “Handling Dependencies”
- New page “Ray Job Submission: Going from your laptop to production”
Ray Java
API Changes:
- Fully supported namespace APIs. (Check out the namespace for more information.) #19468 #19986 #20057
- Removed global named actor APIs and global placement group APIs. #20219 #20135
- Added timeout parameter for
Ray.Get()
API. #20282
Note:
- Use
Ray.getActor(name, namespace)
API to get a named actor between jobs instead ofRay.getGlobalActor(name)
. - Use
PlacementGroup.getPlacementGroup(name, namespace)
API to get a placement group between jobs instead ofPlacementGroup.getGlobalPlacementGroup(name)
.
Ray Datasets
🎉 New Features:
- Added groupby and aggregations (#19435, #19673, #20010, #20035, #20044, #20074)
- Support custom write paths (#19347)
🔨 Fixes:
- Support custom CSV write options (#19378)
🏗 Architecture refactoring:
- Optimized block compaction (#19681)
Ray Workflow
🎉 New Features:
- Workflow right now support events (#19239)
- Allow user to specify metadata for workflow and steps (#19372)
- Allow in-place run a step if the resources match (#19928)
🔨 Fixes:
- Fix the s3 path issue (#20115)
RLlib
🏗 Architecture refactoring:
- “framework=tf2” + “eager_tracing=True” is now (almost) as fast as “framework=tf”. A check for tf2.x eager re-traces has been added making sure re-tracing does not happen outside the initial function calls. All CI learning tests (CartPole, Pendulum, FrozenLake) are now also run as framework=tf2. (#19273, #19981, #20109)
- Prepare deprecation of
build_trainer
/build_(tf_)?policy
utility functions. Instead, use sub-classing ofTrainer
orTorch|TFPolicy
. POCs done forPGTrainer
,PPO[TF|Torch]Policy
. (#20055, #20061) - V-trace (APPO & IMPALA): Don’t drop last ts can be optionally switch on. The default is still to drop it, but this may be changed in a future release. (#19601)
- Upgrade to gym 0.21. (#19535)
🔨 Fixes:
- Minor bugs/issues fixes and enhancements: #19069, #19276, #19306, #19408, #19544, #19623, #19627, #19652, #19693, #19805, #19807, #19809, #19881, #19934, #19945, #20095, #20128, #20134, #20144, #20217, #20283, #20366, #20387
📖Documentation:
- RLlib main page (“RLlib in 60sec”) overhaul. (#20215, #20248, #20225, #19932, #19982)
- Major docstring cleanups in preparation for complete overhaul of API reference pages. (#19784, #19783, #19808, #19759, #19829, #19758, #19830)
- Other documentation enhancements. (#19908, #19672, #20390)
Tune
💫Enhancements:
- Refactored and improved experiment analysis (#20197, #20181)
- Refactored cloud checkpointing API/SyncConfig (#20155, #20418, #19632, #19641, #19638, #19880, #19589, #19553, #20045, #20283)
- Remove magic results (e.g. config) before calculating trial result metrics (#19583)
- Removal of tech debt (#19773, #19960, #19472, #17654)
- Improve testing (#20016, #20031, #20263, #20210, #19730
- Various enhancements (#19496, #20211)
🔨Fixes:
- Documentation fixes (#20130, #19791)
- Tutorial fixes (#20065, #19999)
- Drop 0 value keys from PGF (#20279)
- Fix shim error message for scheduler (#19642)
- Avoid looping through _live_trials twice in _get_next_trial. (#19596)
- clean up legacy branch in update_avail_resources. (#20071)
- fix Train/Tune integration on Client (#20351)
Train
Ray Train is now in Beta! The beta version includes various usability improvements for distributed PyTorch training and checkpoint management, support for Ray Client, and an integration with Ray Datasets for distributed data ingest.
Check out the docs here, and the migration guide from Ray SGD to Ray Train here. If you are using Ray Train, we’d love to hear your feedback here!
🎉 New Features:
- New
train.torch.prepare_model(...)
andtrain.torch.prepare_data_loader(...)
API to automatically handle preparing your PyTorch model and DataLoader for distributed training (#20254). - Checkpoint management and support for custom checkpoint strategies (#19111).
- Easily configure what and how many checkpoints to save to disk.
- Support for Ray Client (#20123, #20351).
💫Enhancements:
- Simplify workflow for training with a single worker (#19814).
- Ray Placement Groups are used for scheduling the training workers (#20091).
PACK
strategy is used by default but can be changed by setting theTRAIN_ENABLE_WORKER_SPREAD
environment variable.- Automatically unwrap Torch DDP model and convert to CPU when saving a model as checkpoint (#20333).
🔨Fixes:
📖Documentation:
Serve
We would love to hear from you! Fill out the Ray Serve survey here.
🎉 New Features:
- New
checkpoint_path
configuration allows Serve to save its internal state to external storage (disk, S3, and GCS) and recover upon failure. (#19166, #19998, #20104) - Replica autoscaling is ready for testing out! (#19559, #19520)
- Native Pipeline API for model composition is ready for testing as well!
🔨Fixes:
- Serve deployment functions or classes can take no parameters (#19708)
- Replica slow start message is improved. You can now see whether it is slow to allocate resources or slow to run constructor. (#19431)
pip install ray[serve]
will now installray[default]
as well. (#19570)
🏗 Architecture refactoring:
- The terminology of “backend” and “endpoint” are officially deprecated in favor of “deployment”. (#20229, #20085, #20040, #20020, #19997, #19947, #19923, #19798).
- Progress towards Java API compatibility (#19463).
Dashboard
- Ray Dashboard is now enabled on Windows! (#19575)
Thanks
Many thanks to all those who contributed to this release!
@krfricke, @stefanbschneider, @ericl, @nikitavemuri, @qicosmos, @worldveil, @triciasfu, @AmeerHajAli, @javi-redondo, @architkulkarni, @pdames, @clay4444, @mGalarnyk, @liuyang-my, @matthewdeng, @suquark, @rkooo567, @mwtian, @chenk008, @dependabot[bot], @iycheng, @jiaodong, @scv119, @oscarknagg, @Rohan138, @stephanie-wang, @Zyiqin-Miranda, @ijrsvt, @roireshef, @tkaymak, @simon-mo, @ashione, @jovany-wang, @zenoengine, @tgaddair, @11rohans, @amogkam, @zhisbug, @lchu-ibm, @shrekris-anyscale, @pcmoritz, @yiranwang52, @mattip, @sven1977, @Yard1, @DmitriGekhtman, @ckw017, @WangTaoTheTonic, @wuisawesome, @kcpevey, @kfstorm, @rhamnett, @renos, @TeoZosa, @SongGuyang, @clarkzinzow, @avnishn, @iasoon, @gjoliver, @jjyao, @xwjiang2010, @dmatrix, @edoakes, @czgdp1807, @heng2j, @sungho-joo, @lixin-wei
Ray-1.8.0
Highlights
- Ray SGD has been rebranded to Ray Train! The new documentation landing page can be found here.
- Ray Datasets is now in beta! The beta release includes a new integration with Ray Train yielding scalable ML ingest for distributed training. Check out the docs here, try it out for your ML ingest and batch inference workloads, and let us know how it goes!
- This Ray release supports Apple Silicon (M1 Macs). Check out the installation instructions for more information!
Ray Autoscaler
🎉 New Features:
- Fake multi-node mode for autoscaler testing (#18987)
💫Enhancements:
- Improve unschedulable task warning messages by integrating with the autoscaler (#18724)
Ray Client
💫Enhancements
- Use async rpc for remote call and actor creation (#18298)
Ray Core
💫Enhancements
🔨 Fixes:
- Fixed resource demand reporting for infeasible 1-CPU tasks (#19000)
- Fixed printing Python stack trace in Python worker (#19423)
- Fixed macOS security popups (#18904)
- Fixed thread safety issues for coreworker (#18902, #18910, #18913 #19343)
- Fixed placement group performance and resource leaking issues (#19277, #19141, #19138, #19129, #18842, #18652)
- Improve unschedulable task warning messages by integrating with the autoscaler (#18724)
- Improved Windows support (#19014, #19062, #19171, #19362)
- Fix runtime_env issues (#19491, #19377, #18988)
Ray Data
Ray Datasets is now in beta! The beta release includes a new integration with Ray Train yielding scalable ML ingest for distributed training. It supports repeating and rewindowing pipelines, zipping two pipelines together, better cancellation of Datasets workloads, and many performance improvements. Check out the docs here, try it out for your ML ingest and batch inference workloads, and let us know how it goes!
🎉 New Features:
- Ray Train integration (#17626)
- Add support for repeating and rewindowing a DatasetPipeline (#19091)
- .iter_epochs() API for iterating over epochs in a DatasetPipeline (#19217)
- Add support for zipping two datasets together (#18833)
- Transformation operations are now cancelled when one fails or the entire workload is killed (#18991)
- Expose from_pandas()/to_pandas() APIs that accept/return plain Pandas DataFrames (#18992)
- Customize compression, read/write buffer size, metadata, etc. in the IO layer (#19197)
- Add spread resource prefix for manual round-robin resource-based task load balancing
💫Enhancements:
- Minimal rows are now dropped when doing an equalized split (#18953)
- Parallelized metadata fetches when reading Parquet datasets (#19211)
🔨 Fixes:
- Tensor columns now properly support table slicing (#19534)
- Prevent Datasets tasks from being captured by Ray Tune placement groups (#19208)
- Empty datasets are properly handled in most transformations (#18983)
🏗 Architecture refactoring:
- Tensor dataset representation changed to a table with a single tensor column (#18867)
RLlib
🎉 New Features:
- Allow n-step > 1 and prioritized replay for R2D2 and RNNSAC agents. (18939)
🔨 Fixes:
- Fix memory leaks in TF2 eager mode. (#19198)
- Faster worker spaces inference if specified through configuration. (#18805)
- Fix bug for complex obs spaces containing Box([2D shape]) and discrete components. (#18917)
- Torch multi-GPU stats not protected against race conditions. (#18937)
- Fix SAC agent with dict space. (#19101)
- Fix A3C/IMPALA in multi-agent setting. (#19100)
🏗 Architecture refactoring:
- Unify results dictionary returned from Trainer.train() across agents regardless of (tf or pytorch, multi-agent, multi-gpu, or algos that use >1 SGD iterations, e.g. ppo) (#18879)
Ray Workflow
🎉 New Features:
- Introduce workflow.delete (#19178)
🔨Fixes:
- Fix the bug which allow workflow step to be executed multiple times (#19090)
🏗 Architecture refactoring:
- Object reference serialization is decoupled from workflow storage (#18328)
Tune
🎉 New Features:
- PBT: Add burn-in period (#19321)
💫Enhancements:
- Optional forcible trial cleanup, return default autofilled metrics even if Trainable doesn't report at least once (#19144)
- Use queue to display JupyterNotebookReporter updates in Ray client (#19137)
- Add resume="AUTO" and enhance resume error messages (#19181)
- Provide information about resource deadlocks, early stopping in Tune docs (#18947)
- Fix HEBOSearch installation docs (#18861)
- OptunaSearch: check compatibility of search space with evaluated_rewards (#18625)
- Add
save
andrestore
methods for searchers that were missing it & test (#18760) - Add documentation for reproducible runs (setting seeds) (#18849)
- Depreciate
max_concurrent
inTuneBOHB
(#18770) - Add
on_trial_result
to ConcurrencyLimiter (#18766) - Ensure arguments passed to tune
remote_run
match (#18733) - Only disable ipython in remote actors (#18789)
🔨Fixes:
- Only try to sync driver if sync_to_driver is actually enabled (#19589)
- sync_client: Fix delete template formatting (#19553)
- Force no result buffering for hyperband schedulers (#19140)
- Exclude trial checkpoints in experiment sync (#19185)
- Fix how durable trainable is retained in global registry (#19223, #19184)
- Ensure
loc
column in progress reporter is filled (#19182) - Deflake PBT Async test (#19135)
- Fix
Analysis.dataframe()
documentation and enable passing ofmode=None
(#18850)
Ray Train (SGD)
Ray SGD has been rebranded to Ray Train! The new documentation landing page can be found here. Ray Train is integrated with Ray Datasets for distributed data loading while training, documentation available here.
🎉 New Features:
- Ray Datasets Integration (#17626)
🔨Fixes:
📖Documentation:
Serve
🎉 New Features:
- Add ability to recover from a checkpoint on cluster failure (#19125)
- Support kwargs to deployment constructors (#19023)
🔨Fixes:
- Fix asyncio compatibility issue (#19298)
- Catch spurious ConnectionErrors during shutdown (#19224)
- Fix error with uris=None in runtime_env (#18874)
- Fix shutdown logic with exit_forever (#18820)
🏗 Architecture refactoring:
- Progress towards Serve autoscaling (#18793, #19038, #19145)
- Progress towards Java support (#18630)
- Simplifications for long polling (#19154, #19205)
Dashboard
🎉 New Features:
- Basic support for the dashboard on Windows (#19319)
🔨Fixes:
- Fix healthcheck issue causing the dashboard to crash under load (#19360)
- Work around aiohttp 4.0.0+ issues (#19120)
🏗 Architecture refactoring:
- Improve dashboard agent retry logic (#18973)
Thanks
Many thanks to all those who contributed to this release!
@rkooo567, @lchu-ibm, @scv119, @pdames, @suquark, @antoine-galataud, @sven1977, @mvindiola1, @krfricke, @ijrsvt, @sighingnow, @marload, @jmakov, @clay4444, @mwtian, @pcmoritz, @iycheng, @ckw017, @chenk008, @jovany-wang, @jjyao, @hauntsaninja, @franklsf95, @jiaodong, @wuisawesome, @odp, @matthewdeng, @duarteocarmo, @czgdp1807, @gjoliver, @mattip, @richardliaw, @max0x7ba, @Jasha10, @acxz, @xwjiang2010, @SongGuyang, @simon-mo, @zhisbug, @ccssmnn, @Yard1, @hazeone, @o0olele, @froody, @robertnishihara, @amogkam, @sasha-s, @xychu, @lixin-wei, @architkulkarni, @edoakes, @clarkzinzow, @DmitriGekhtman, @avnishn, @liuyang-my, @stephanie-wang, @Chong-Li, @ericl, @juliusfrost, @carlogrisetti
Ray-1.6.0
Highlights
- Runtime Environments are ready for general use! This feature enables you to dynamically specify per-task, per-actor and per-job dependencies, including a working directory, environment variables, pip packages and conda environments. Install it with
pip install -U 'ray[default]'
. - Ray Dataset is now in alpha! Dataset is an interchange format for distributed datasets, powered by Arrow. You can also use it for a basic Ray native data processing experience. Check it out here.
- Ray Lightning v0.1 has been released! You can install it via
pip install ray-lightning
. Ray Lightning is a library of PyTorch Lightning plugins for distributed training using Ray. Features:- Enables quick and easy parallel training
- Supports PyTorch DDP, Horovod, and Sharded DDP with Fairscale
- Integrates with Ray Tune for hyperparameter optimization and is compatible with Ray Client
pip install ray
now has a significantly reduced set of dependencies. Features such as the dashboard, the cluster launcher, runtime environments, and observability metrics may requirepip install -U 'ray[default]'
to be enabled. Please report any issues on Github if this is an issue!
Ray Autoscaler
🎉 New Features:
- The Ray autoscaler now supports TPUs on GCP. Please refer to this example for spinning up a simple TPU cluster. (#17278)
💫Enhancements:
- Better AWS networking configurability (#17236 #17207 #14080)
- Support for running autoscaler without NodeUpdaters (#17194, #17328)
🔨 Fixes:
Ray Client
💫Enhancements:
- Updated docs for client server ports and ray.init(ray://) (#17003, #17333)
- Better error handling for deserialization failures (#17035)
🔨 Fixes:
- Fix for server proxy not working with non-default redis passwords (#16885)
Ray Core
🎉 New Features:
- Runtime Environments are ready for general use!
- Specify a working directory to upload your local files to all nodes in your cluster.
- Specify different conda and pip dependencies for your tasks and actors and have them installed on the fly.
🔨 Fixes:
- Fix plasma store bugs for better data processing stability (#16976, #17135, #17140, #17187, #17204, #17234, #17396, #17550)
- Fix a placement group bug where CUDA_VISIBLE_DEVICES were not properly detected (#17318)
- Improved Ray stacktrace messages. (#17389)
- Improved GCS stability and scalability (#17456, #17373, #17334, #17238, #17072)
🏗 Architecture refactoring:
Ray Data Processing
Ray Dataset is now in alpha! Dataset is an interchange format for distributed datasets, powered by Arrow. You can also use it for a basic Ray native data processing experience. Check it out here.
RLLib
🎉 New Features:
- Support for RNN/LSTM models with SAC (new agent: "RNNSAC"). Shoutout to ddworak94! (#16577)
- Support for ONNX model export (tf and torch). (#16805)
- Allow Policies to be added to/removed from a Trainer on-the-fly. (#17566)
🔨 Fixes:
-
Fix for view requirements captured during compute actions test pass. Shoutout to Chris Bamford (#15856)
-
Issues: 17397, 17425, 16715, 17174. When on driver, Torch|TFPolicy should not use
ray.get_gpu_ids()
(b/c no GPUs assigned by ray). (#17444) -
Other bug fixes: #15709, #15911, #16083, #16716, #16744, #16896, #16999, #17010, #17014, #17118, #17160, #17315, #17321, #17335, #17341, #17356, #17460, #17543, #17567, #17587
🏗 Architecture refactoring:
- CV2 to Skimage dependency change (CV2 still supported). Shoutout to Vince Jankovics. (#16841)
- Unify tf and torch policies wrt. multi-GPU handling: PPO-torch is now 33% faster on Atari and 1 GPU. (#17371)
- Implement all policy maps inside RolloutWorkers to be LRU-caches so that a large number of policies can be added on-the-fly w/o running out of memory. (#17031)
- Move all tf static-graph code into DynamicTFPolicy, such that policies can be deleted and their tf-graph is GC'd. (#17169)
- Simplify multi-agent configs: In most cases, creating dummy envs (only to retrieve spaces) are no longer necessary. (#16565, #17046)
📖Documentation:
- Examples scripts do-over (shoutout to Stefan Schneider for this initiative).
- Example script: League-based self-play with "open spiel" env. (#17077)
- Other doc improvements: #15664 (shoutout to kk-55), #17030, #17530
Tune
🎉 New Features:
- Dynamic trial resource allocation with ResourceChangingScheduler (#16787)
- It is now possible to use a define-by-run function to generate a search space with OptunaSearcher (#17464)
💫Enhancements:
- String names of searchers/schedulers can now be used directly in tune.run (#17517)
- Filter placement group resources if not in use (progress reporting) (#16996)
- Add unit tests for flatten_dict (#17241)
🔨Fixes:
📖Documentation:
- LightGBM integration (#17304)
- Other documentation improvements: #17407 (shoutout to amavilla), #17441, #17539, #17503
SGD
🎉 New Features:
- We have started initial development on a new RaySGD v2! We will be rolling it out in a future version of Ray. See the documentation here. (#17536, #17623, #17357, #17330, #17532, #17440, #17447, #17300, #17253)
💫Enhancements:
- Placement Group support for TorchTrainer (#17037)
Serve
🎉 New Features:
- Add Ray API stability annotations to Serve, marking many
serve.\*
APIs asStable
(#17295) - Support
runtime_env
'sworking_dir
for Ray Serve (#16480)
🔨Fixes:
- Fix FastAPI's response_model not added to class based view routes (#17376)
- Replace
backend
withdeployment
in metrics & logging (#17434)
🏗Stability Enhancements:
- Run Ray Serve with multi & single deployment large scale (1K+ cores) test running nightly (#17310, #17411, #17368, #17026, #17277)
Thanks
Many thanks to all who contributed to this release:
@suquark, @xwjiang2010, @clarkzinzow, @kk-55, @mGalarnyk, @pdames, @Souphis, @edoakes, @sasha-s, @iycheng, @stephanie-wang, @antoine-galataud, @scv119, @ericl, @amogkam, @ckw017, @wuisawesome, @krfricke, @vakker, @qingyun-wu, @Yard1, @juliusfrost, @DmitriGekhtman, @clay4444, @mwtian, @corentinmarek, @matthewdeng, @simon-mo, @pcmoritz, @qicosmos, @architkulkarni, @rkooo567, @navneet066, @dependabot[bot], @jovany-wang, @kombuchafox, @thomasjpfan, @kimikuri, @Ivorforce, @franklsf95, @MissiontoMars, @lantian-xu, @duburcqa, @ddworak94, @ijrsvt, @sven1977, @kira-lin, @SongGuyang, @kfstorm, @Rohan138, @jamesmishra, @amavilla, @fyrestone, @lixin-wei, @stefanbschneider, @jiaodong, @richardliaw, @WangTaoTheTonic, @chenk008, @Catch-Bull, @Bam4d