Releases: deepspeedai/DeepSpeed
Releases · deepspeedai/DeepSpeed
v0.17.6 Patch Release
What's Changed
- Update version.txt after 0.17.5 release by @loadams in #7502
- Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 by @LYMDLUT in #7421
- CI funding shout out to modal.com by @stas00 in #7503
- Fix assert when 'pp_int' object has no attribute 'custom_print_str' by @aeeeeeep in #7507
- Update TSC Committers by @PKUWZP in #7517
- Enabling Muon Optimizer in DeepSpeed by @PKUWZP in #7509
- Enable non-ZeRO mode by @sfc-gh-truwase in #7515
- Update README with ZenFlow release blog featured by PyTorch. by @Antlera in #7520
- Add riscv64 cpu support in deepspeed_shm_comm op by @heyujiao99 in #7519
- ZeRO3: Improve mismatch detection by @sfc-gh-truwase in #7525
- fix typo s/1014 /1024 by @digger-yu in #7528
- undo the revert by @stas00 in #7536
- [logging] less startup noise by @stas00 in #7526
- [doc] fixing moe tutorial by @stas00 in #7538
- docs typo:
lrrt.md
, reference tocycle_min_lr
should becycle_max_lr
by @jakehemmerle in #7530 - fixed DeepSpeedCPULion with ZeRO-Offload bug by @qibin0506 in #7531
- Fix scaling and allgather with
torch.autocast
by @tohtana in #7534 - Fix zenflow_torch_adam.py by @stas00 in #7544
- Relax restrictions of torch.autocast integration by @tohtana in #7543
- Autotune ZenFlow affinity by @delock in #7506
- fix get_cuda_compile_flag by @mingjielu in #7521
- avoid setting device_id to
init_process_group
by @kaixuanliu in #7542 - Improve error message and reduce validation in autocast test by @tohtana in #7547
- Revert "Add index to HPU devices (#7497)" by @deepcharm in #7545
- [ALST tutorial] support bs>1 by @sfc-gh-sbekman in #7550
- [MoE] Fix misuse of num_experts as expert parallel group size (ep_size) by @Flakes342 in #7551
- Limit random seed range in tests by @tohtana in #7553
- Fix gradient buffer access for DeepCompile Z1/2 by @tohtana in #7548
- Move modal tests to tests/v1 by @tohtana in #7557
- Add dependency for deepcompile test by @tohtana in #7558
- deepcompile: Create dummy inputs using empty_strided by @eternalNight in #7564
- deepcompile: Record graph order using OrderedDict by @eternalNight in #7563
- deepcompile: Create a full list of no-copy ops by @eternalNight in #7562
- fix npu device_id AttributeError issue by @we1sper in #7560
- Make Muon optimizer easier to enable by @delock in #7555
- scripts: Check .is_cuda only in non-C++ files by @eternalNight in #7561
- [bugfix] fix partition context unpatch by @hjh0119 in #7566
New Contributors
- @LYMDLUT made their first contribution in #7421
- @aeeeeeep made their first contribution in #7507
- @heyujiao99 made their first contribution in #7519
- @jakehemmerle made their first contribution in #7530
- @qibin0506 made their first contribution in #7531
- @mingjielu made their first contribution in #7521
- @kaixuanliu made their first contribution in #7542
- @sfc-gh-sbekman made their first contribution in #7550
- @Flakes342 made their first contribution in #7551
- @we1sper made their first contribution in #7560
- @hjh0119 made their first contribution in #7566
Full Changelog: v0.17.5...v0.17.6
v0.17.5 Patch Release
What's Changed
- Update version.txt after v0.17.4 release by @loadams in #7460
- Update README.md by @PKUWZP in #7465
- Add getter APIs for TP/PP/DP ranks in DeepSpeedEngine by @WoosungMyung in #7427
- fix issues raised by Coverity scans by @NirSonnenschein in #7431
- Fix all-gather duplicate params and wrong dtype by @eternalNight in #7462
- fix #7188 by @lpnpcs in #7371
- add --bind_cores_to_rank to zero offload tutorial by @delock in #7474
- Add blog for ZenFlow by @Antlera in #7463
- Fix cpu CI by @sfc-gh-truwase in #7481
- fix
deepspeed --venv_script
by @stas00 in #7469 - Modal CI by @sfc-gh-truwase in #7289
- [UlyssesSPDataLoaderAdapter] fix iterator reset by @stas00 in #7472
- [TiledFusedLogitsLoss] support inference by @stas00 in #7477
- Fix pre-compile on cpu-only machines by @AlongWY in #7168
- Enable forked PRs by @sfc-gh-truwase in #7486
- fix xpu device_id AttributeError issue by @yao-matrix in #7488
- Add Zenflow code for Stage 1 & 2 by @Antlera in #7391
- Fix invalid f-strings by @cyyever in #7457
- Fix DeepCompile for PyTorch v2.8 by @tohtana in #7496
- Reduce performance impact of compiler.enable decorator by @deepcharm in #7498
- Add index to HPU devices by @deepcharm in #7497
New Contributors
- @WoosungMyung made their first contribution in #7427
- @eternalNight made their first contribution in #7462
- @lpnpcs made their first contribution in #7371
- @Antlera made their first contribution in #7463
- @AlongWY made their first contribution in #7168
- @yao-matrix made their first contribution in #7488
- @cyyever made their first contribution in #7457
Full Changelog: v0.17.4...v0.17.5
v0.17.4 Patch Release
v0.17.3 Patch Release
What's Changed
- [TiledMLP]: fix for bs>1 by @stas00 in #7412
- Update version.txt after v0.17.2 release. by @loadams in #7417
- Enable torch version dependent compilation of record_module and iter_params by @deepcharm in #7362
- [BUGFIX] Reset
bucket.elements
after reduction in ZeRO Stage 3 by @rahul713rk in #7418 - Align missing argument in AllReduceCoalescedHandle by @deepcharm in #7414
- Improvements to Communication Logger by @alexk101 in #7404
- trying to fix nv-accelerate-v100.yml CI job by @stas00 in #7424
- fix: Propagate
strip_tensor_paddings
by @saforem2 in #7426 - Use past_key_value when provided by @deepcharm in #7428
- set
device_id
in torch'sinit_process_group
by @stas00 in #7266 - [Ulysses-ALST] add FA3 support by @stas00 in #7430
- TiledMLP + SequenceTiledCompute: improve the bs>1 use-case by @stas00 in #7422
- Remove unused yaml test configurations and update README by @loadams in #7441
- [ALST] fix typo in the url by @stas00 in #7444
- [ALST] fix typo in the url part2 by @stas00 in #7446
- Remove additional unused tests (human-eval) by @loadams in #7445
- Fix: Adapt Llama injection policy for newer transformers versions by @huanyuqu in #7443
New Contributors
- @rahul713rk made their first contribution in #7418
- @huanyuqu made their first contribution in #7443
Full Changelog: v0.17.2...v0.17.3
v0.17.2 Patch Release
What's Changed
- Update version after 0.17.1 release by @loadams in #7345
- s/UlyssesPlus/Arctic Long Sequence Training (ALST)/ by @stas00 in #7348
- Don't break set_start_method by @tjruwase in #7349
- Fix error of <glog/logging.h> by @Freed-Wu in #7351
- Improve padding util for compile by @tohtana in #7355
- Fix 404s by @tjruwase in #7363
- Fix tutorial title by @stas00 in #7365
- Restore real inputs for recompilation by @tohtana in #7356
- Fix(scheduler): WarmupLR inherits optimizer lr when not specified by @Flink-ddd in #7360
- sequence parallel default dtype by @stas00 in #7364
- Enable torch.autocast with ZeRO by @tohtana in #6993
- add Arctic Long Sequence Training paper reference by @stas00 in #7372
- Flops profiler support for F.interpolate by @sfc-gh-truwase in #7353
- Relax tolerances for FP8 unit test only for ROCm + FP16 by @rraminen in #7373
- Update latest news with DeepNVMe by @loadams in #7375
- Fix release of IPG buffer by @tohtana in #7376
- fix wandb.log() call by removing
sync
kwarg by @ned2 in #7383 - Fix dtype mismatch in
TestParamPartitioningSkipInit
by @tohtana in #7377 - Add support for ws=1 scenario by @NirSonnenschein in #7379
- fix(inference): Add missing dtype attribute to ParameterBase setter by @Flink-ddd in #7378
- add blog link by @stas00 in #7385
- fix broken url by @stas00 in #7390
- add support for CUDAtk12.9 by @loscrossos in #7394
- Fix unbound local error for
return_val
by @HollowMan6 in #7395 - Fix ZeRO stage 1 and add stage 2 support with DeepCompile by @tohtana in #7366
- Improve coverage of DeepCompile by @tohtana in #7386
- Added device detection to communication logging by @alexk101 in #7398
- fix: Add
csrc/compile
to include paths for DeepCompile builder by @HollowMan6 in #7401 - fix: DeepCompile for torch 2.8 by @HollowMan6 in #7402
- fix(comm): Expose GradBucket in deepspeed.comm API by @Flink-ddd in #7400
- fix: fix FileNotFoundError for build_win.bat by @gjj2828 in #7399
- fix: engine initializes optimizer attributes at the beginning by @HollowMan6 in #7410
New Contributors
- @Freed-Wu made their first contribution in #7351
- @Flink-ddd made their first contribution in #7360
- @ned2 made their first contribution in #7383
- @alexk101 made their first contribution in #7398
- @gjj2828 made their first contribution in #7399
Full Changelog: v0.17.1...v0.17.2
v0.17.1 Patch Release
What's Changed
- Update version.txt after v0.17.0 release by @loadams in #7326
- Ulysses Plus Docs by @stas00 in #7331
- UlyssesPlus Docs take 2 by @stas00 in #7332
- Improve Ulysses Plus Docs by @cynricfu in #7335
- Update config_utils.py by @qgallouedec in #7333
- Fix pytest version to 8.3.5 in hpu-gaudi actions by @raza-sikander in #7337
- Fix issue with symint input by @tohtana in #7243
- fp16 optimizer timers fix - TypeError: 'NoneType' object is not callable by @rraminen in #7330
- DeepNVMe update by @tjruwase in #7215
- fixed: Modified the topkgating function and modified the test_moe file for testing by @xiongjyu in #7163
- Fix LoRA arxiv reference by @emmanuel-ferdman in #7340
- Update folder name by @sfc-gh-truwase in #7343
- Improve overflow handling in ZeRO by @tjruwase in #6976
- Fix docs that are rendering Incorrectly by @felixgondwe in #7344
- Move pytest pinning from individual tests to requirements-dev.txt until fixed. by @loadams in #7327
New Contributors
- @cynricfu made their first contribution in #7335
- @xiongjyu made their first contribution in #7163
- @sfc-gh-truwase made their first contribution in #7343
- @felixgondwe made their first contribution in #7344
Full Changelog: v0.17.0...v0.17.1
DeepSpeed v0.17.0
What's Changed
- Update next version in version.txt after 0.16.9 release. by @loadams in #7306
- Update COMMITTERS.md by @PKUWZP in #7305
- Fix AutoTP gathering replaced layer params when bias is not None by @HollowMan6 in #7257
- Fix the GPU memory usage of ZeRO-Offload (only update stage_1_and_2.py) by @arminzhu in #7309
- Fix: Update grad norm calculation for CPU offload by @therealnaveenkamal in #7302
- CI: prefer bf16 over fp16 by @stas00 in #7304
tests/conftest.py
: automatically add local deepspeed repo when running tests by @stas00 in #7317- Update gaudi2 nightly,ci to latest 1.21.0 build by @raza-sikander in #7313
- anchor transformers version by @stas00 in #7316
- fix asymmetric in dequantize by @pencil-hub in #7283
- Ulysses SP for HF Integration by @stas00 in #7268
- Fix ci hang in torch2.7& improve ut by @inkcherry in #7321
- Bump to v0.17.0 by @sfc-gh-mwyatt in #7324
New Contributors
- @PKUWZP made their first contribution in #7305
- @arminzhu made their first contribution in #7309
- @therealnaveenkamal made their first contribution in #7302
- @pencil-hub made their first contribution in #7283
- @sfc-gh-mwyatt made their first contribution in #7324
Full Changelog: v0.16.9...v0.17.0
v0.16.9 Patch Release
What's Changed
- Update patch version after 0.16.8 release by @loadams in #7296
- Avoid graph break by removing another redundant requires grad false by @deepcharm in #7263
- Add qwen3 meta loading for AutoTP by @delock in #7293
- Modernize system executable detection across components by @emmanuel-ferdman in #7290
- Enable ZeRO set/get APIs for NVMe offload by @tjruwase in #7046
- Add qwen3moe meta loading for AutoTP by @ranzhejiang in #7297
- disable license check until the new license situation has been sorted… by @stas00 in #7301
- Fix extra_repr_str when weight is None / in zero-3 by @HollowMan6 in #7254
- [XPU] Support XCCL on deepspeed side by @ys950902 in #7299
New Contributors
- @emmanuel-ferdman made their first contribution in #7290
Full Changelog: v0.16.8...v0.16.9
v0.16.8 Patch Release
What's Changed
- Update version.txt after 0.16.7 release by @loadams in #7232
- Recommend using latest by @tohtana in #7233
- [NFC] Fix comment related to SP group by @c8ef in #7234
- Add cpu accelerator fp16 dtype support by @Yejing-Lai in #7207
- Update CPU torch version to 2.7 by @loadams in #7241
- Update README.md by @jizhang02 in #7246
- Fix compile error for nv_bloat162 by @loscrossos in #7248
- add
Makefile
to ease maintenance by @stas00 in #7267 - Fix fp8 gemm by @RezaYazdaniAminabadi in #7265
- [XPU] update xpu-max1100 CI workflow to torch 2.7 by @Liangliang-Ma in #7284
- Fix issues XPU tests hit with extra-index-url by @loadams in #7291
- Temporarily skip AIO tests due to an issue with runners by @loadams in #7288
- rollback #6726 by @delock in #7258
New Contributors
- @jizhang02 made their first contribution in #7246
- @loscrossos made their first contribution in #7248
Full Changelog: v0.16.7...v0.16.8
v0.16.7 Patch Release
What's Changed
- Update version.txt after 0.16.6 release by @loadams in #7218
- Fix release links by @tjruwase in #7219
- Fix pass for z3 and profiler by @tohtana in #7222
- Fix build on AMD GPUs (related to DeepCompile) by @HollowMan6 in #7224
- Add defence for DeepCompile w/o optimizer by @HollowMan6 in #7225
- Pass
with_cuda
arg for jit_load in OpBuilder by @HollowMan6 in #7226 - Make sure it's not None before offloading contiguous_grad_buffer by @HollowMan6 in #7227
Full Changelog: v0.16.6...v0.16.7