Releases: deepspeedai/DeepSpeed
Releases · deepspeedai/DeepSpeed
v0.16.6 Patch Release
What's Changed
- Update version.txt after 0.16.5 release by @loadams in #7180
- Cross layer overlapping for domino by @hwchen2017 in #7178
- async tp allreduce by @inkcherry in #7115
- Fix issue #5242 grad_norm and loss is nan by @Glaceon-Hyy in #7171
- Add qwen3 autotp support by @Yejing-Lai in #7187
- Update to new torch grad hook API: BF16Optimizer and Stage2 by @deepcharm in #7189
- Reland perf fix for nan inf check by @nelyahu in #7184
- Update to fix pydantic warning by @loadams in #7193
- update dependencies version info by @inkcherry in #7206
- HPU accelerator memory mapping is broken because of torch fill uninit memory by @oelayan7 in #7209
- Support complicated use cases with TiedLayerSpec by @limjcst in #7208
- Add defence for offload_states and reload_states w/o optimizer by @HollowMan6 in #7211
- DeepCompile for enhanced compiler integration by @tohtana in #7154
New Contributors
- @Glaceon-Hyy made their first contribution in #7171
- @limjcst made their first contribution in #7208
Full Changelog: v0.16.5...v0.16.6
v0.16.5 Patch Release
What's Changed
- Update version.txt after 0.16.4 release by @loadams in #7063
- fix an outdated doc wrt CUDA_VISIBLE_DEVICES by @stas00 in #7058
- Tecorigin sdaa accelerator by @siqi654321 in #6903
- Handle special case of libuv for Windows by @loadams in #7064
- Bug Fix for offload_states API by @U-rara in #7050
- Update README with info on newest accelerator by @loadams in #7065
- Fix TOCTOU issues, switch to fstat by @loadams in #7067
- config torch to avoid graph breaks caused by logger by @ShellyNR in #6999
- Fix meta load tensor imcompatible issue by @Yejing-Lai in #7073
- Replace calls to
python setup.py sdist
withpython -m build --sdist
by @loadams in #7069 - Revert "Handle special case of libuv for Windows (#7064)" by @loadams in #7076
- Add DeepseekV3 AutoTP. by @Yejing-Lai in #7045
- Improve inference tutorial docs by @loadams in #7083
- Pin transformers version on tests that use latest. by @loadams in #7085
- Update README.md with ICS '23 MoE paper link by @siddharth9820 in #7087
- Update parallelism for nv-torch-latest/nightly tests due to more GPUs/runner by @loadams in #7086
- Remove workflows for very old torch versions by @loadams in #7090
- Use new dlpack api; Formatting fixes by @tjruwase in #7101
- Avoid graph breaks by disabling sourceless calls in instrument_w_nvtx by @deepcharm in #7081
- Avoid graph breaks in torch.compile caused by inner classes in the backward hooks by @deepcharm in #7062
- Only run pre-commit on the changes by @hwchen2017 in #7106
- Avoid graph break due to unsupported frozenset by @deepcharm in #7105
- Fix fused_qkv print model ValueError by @Yejing-Lai in #7109
- Update references to new X/Twitter handle by @loadams in #7110
- Update gaudi2 nightly,ci to latest 1.20.0 build by @raza-sikander in #7093
- fix keep_module_on_host by @inkcherry in #7112
- Add sequential pytest mark to TestNVMeCheckpointing to resolve pytest forked hangs by @loadams in #7131
- Training multiple models by @tjruwase in #7018
- Update CONTRIBUTING.md to reflect changes from CLA to DCO by @loadams in #7135
- Avoid missing attr error by @tjruwase in #7133
- Add conditional expression by @A-transformer in #7119
- Unpin transformers version for most workflows by @loadams in #7139
- Conditionally quote env vars by @saurabhkoshatwar in #7071
- Correct the BACKWARD_PREFETCH_SUBMIT mismatch by @A-transformer in #7120
- Enhance Gaudi2 CI/Nightly Coverage with Model Parallelism and Linear Tests by @raza-sikander in #7146
- Update container version that runs on A6000 tests. by @loadams in #7153
- hf tp+zero training doc. by @inkcherry in #7151
- Avoid graph break by removing redundant requires_grad attr change by @deepcharm in #7158
- Add destroy to tests to free memory by @tohtana in #7160
- [NFC] Typo fix in SP layer. by @c8ef in #7152
- Link AutoTP blog in the front page by @hwchen2017 in #7167
- fix
seq_parallel_communication_data_type
constant. by @stas00 in #7175 - Fix typos in GDS blog by @loadams in #7177
- Variable batch size and LR scheduler by @bm-synth in #7104
New Contributors
- @siqi654321 made their first contribution in #6903
- @A-transformer made their first contribution in #7119
- @saurabhkoshatwar made their first contribution in #7071
- @c8ef made their first contribution in #7152
Full Changelog: v0.16.4...v0.16.5
v0.16.4 Patch Release
What's Changed
- Update version.txt after 0.16.3 release by @loadams in #6965
- Precisely track nvme optimizer offload by @tjruwase in #6963
- Update build_win.bat script to exclue GDS op as it lacks Windows support. by @loadams in #6971
- Add CUDA 12.8 support and comment on CUDA 12.7 by @loadams in #6975
- Update cpu torch latest to use torch 2.6 by @loadams in #6977
- generalize deepspeed linear and implement it for non cuda systems by @oelayan7 in #6932
- Update recommended Windows whl building versions by @loadams in #6983
- Title: Fix setup_env_ranks to Properly Set Environment Variables Instead of Raising Error by @fabiosanger in #6979
- Specify torchvision in nv-ds-chat workflow (prevents errors with torch 2.6) by @loadams in #6982
- Remove assumption that padding only occurs on last rank by @xylian86 in #6974
- Use ds-specific module id to avoid conflicts by @tjruwase in #6847
- Update A6000 workflows to use newer docker container - 24.09 vs 24.03 by @loadams in #6967
- Allow NVIDIA Blackwell by @fabiendupont in #6991
- Update GH org references by @tjruwase in #6998
- [XPU] max1100 workflow update for docker and softwares by @Liangliang-Ma in #7003
- autotp training(fix dco) by @inkcherry in #7004
- import triton files when triton is supported and installed by @oelayan7 in #6989
- Update A6000 tests transformers version by @loadams in #7016
- Fix ds-chat CI regression by @tjruwase in #7015
- [Ulysses tutorial] typos by @stas00 in #7024
- fix hostname -I for macOS #6497 by @fitzjalen in #6990
- Update workflows to cuda 12.4 by @loadams in #7000
- [ROCm] Enable fp_quantizer on ROCm by @rraminen in #7027
- add gds chinese blog by @GuanhuaWang in #7034
- Add chinese blog for deepspeed windows, and fix format by @hwchen2017 in #7035
- AIO on ROCM by @jomayeri in #7023
- Control trace cache warnings by @tjruwase in #7039
- Update CUDA compute capability to support Blackwell by @hwchen2017 in #7047
- Update setup.py handling of ROCm cupy by @loadams in #7051
- nv-ds-chat breaks with latest transformers by @loadams in #7052
- Rename aio_thread_count to intra_op_parallelism by @tjruwase in #7056
- add autoTP training zero2 tests by @inkcherry in #7049
- Fix, bf16 optimizer remove dup loop by @wukong1992 in #7054
New Contributors
- @fabiosanger made their first contribution in #6979
- @fabiendupont made their first contribution in #6991
- @fitzjalen made their first contribution in #6990
- @wukong1992 made their first contribution in #7054
Full Changelog: v0.16.3...v0.16.4
v0.16.3 Patch Release
What's Changed
- Update version.txt after 0.16.2 release by @loadams in #6893
- Allow to compile collective for PT>2.3 by @NirSonnenschein in #6899
- Zero2: avoid graph breaks in torch.compile by using param_idx by @nelyahu in #6803
- hpu_accelerator: use torch.use_deterministic_algorithms by @nelyahu in #6897
- Fix error caused by all_reduce call in domino by @hwchen2017 in #6880
- Update Gaudi2 jobs to latest 1.19 build by @raza-sikander in #6905
- Change compile for pipeline module torch.compile by @NirSonnenschein in #6478
- Stage3: Use new torch grad accumulation hooks API by @deepcharm in #6773
- Cleanup ops/transformer/inference tests by @loadams in #6830
- Fix
checkpointable_layers
Logic by @Quentin-Anthony in #6881 - [BUG FIX]:fix get torch.version.cuda error when cuda is None in rocm by @hj-wei in #6909
- Add fp8_gemm fallback for non-triton systems by @oelayan7 in #6916
- Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) by @inkcherry in #6694
- Cleanup ops/transformer/inference tests by @loadams in #6925
- Check transformers version in BLOOM for inference v1 by @lekurile in #6766
- inference: remove unused _validate_args function by @nelyahu in #5505
- Use
torch.log1p
by @kit1980 in #6930 - Update python version classifiers by @loadams in #6933
- Fix building on Windows with presence of Triton by @woct0rdho in #6749
- Fix windows blog examples by @loadams in #6934
- Add deepseek autotp by @Yejing-Lai in #6937
- Add position_ids arg to OPTEmbedding forward function by @lekurile in #6939
- Add information on security expectations with this software by @loadams in #6941
- Support pure meta model lm_head tp by @Yejing-Lai in #6812
- Remove op compilation flags due to perf issue by @NirSonnenschein in #6944
- Pin nv-a6000 workflow by @loadams in #6938
- [inf] Add config var to enable keeping module on host by @oelayan7 in #6846
warn
towarning
by @qgallouedec in #6952- Add extra_repr to Linear classes for debugging purpose by @Xia-Weiwen in #6954
- Update import for torchvision.transformers by @loadams in #6958
- Remove Duplicate Declaration of pandas in
Dockerfile
by @Zerohertz in #6959 - Add the missing view operations from sequence parallel(async). by @inkcherry in #6750
- Update
torch.norm
totorch.linalg.norm
andtorch.linalg.vector_norm
by @loadams in #6931 - Using explicit GPU upcast for ZeRO-Offload by @xylian86 in #6962
New Contributors
- @hj-wei made their first contribution in #6909
- @kit1980 made their first contribution in #6930
- @woct0rdho made their first contribution in #6749
- @Xia-Weiwen made their first contribution in #6954
- @Zerohertz made their first contribution in #6959
Full Changelog: v0.16.2...v0.16.3
v0.16.2 Patch Release
What's Changed
- Update pre-commit version by @loadams in #6821
- Update version.txt after 0.16.1 release by @loadams in #6826
- Pin HPU tests by @loadams in #6831
- Flops profiler support einops.einsum by @lvhoaa in #6755
- Pin pytest-subtests version for accelerate tests by @loadams in #6842
- Inference UTs check for trition support from accelerator by @raza-sikander in #6782
- Unpin pytest-subtests now that 0.14.1 is released by @loadams in #6844
- Merge LoCo with Zero++ by @XingyuXie in #6730
- Fix type error in
ZeROOrderedDict
by @oraluben in #6794 - Fix uneven head sequence parallelism bug (#6774) by @Eugene29 in #6797
- Fix nv-torch-nightly test by pinning transformers by @loadams in #6849
- Remove broken links to non-active site by @kaiksi-bb in #6854
- Avoid poisoning process with CUDA calls as soon as importing by @HollowMan6 in #6810
- Fix xpu tests workflow failure by changing pip index url by @Liangliang-Ma in #6864
- Domino updates by @GuanhuaWang in #6861
- add domino navigation by @GuanhuaWang in #6866
- Update TSC by @tjruwase in #6867
- Remove warnings from autodoc and sphinx by @loadams in #6788
- Update real_accelerator.py by @keiwoo in #6845
- Fix assertion for offloading states by @tohtana in #6855
- Remove pin from transformers version and fix Processing/Threading issues in tests by @loadams in #6822
- Add MLP/lm_head tp grain size setting. by @Yejing-Lai in #6828
- Fix --enable_each_rank_log when used with PDSH multi-node runner by @akeshet in #6863
- Update transformers ops unit tests to use
requried_torch_version
by @loadams in #6884 - Don't error out when cpu accelerator doesn't have torch (as default for whl building) by @loadams in #6886
- Add arctic model support by adding w2 to all_reduce by @pi314ever in #6856
- Update code owners by @tjruwase in #6890
New Contributors
- @lvhoaa made their first contribution in #6755
- @XingyuXie made their first contribution in #6730
- @Eugene29 made their first contribution in #6797
- @kaiksi-bb made their first contribution in #6854
- @HollowMan6 made their first contribution in #6810
- @keiwoo made their first contribution in #6845
- @akeshet made their first contribution in #6863
- @pi314ever made their first contribution in #6856
Full Changelog: v0.16.1...v0.16.2
v0.16.1 Patch Release
What's Changed
- Update version.txt after 0.16.0 release by @loadams in #6786
- Domino news update on readme.md by @GuanhuaWang in #6815
- Fix zero checkpoint by @xu-song in #6792
- Update python version but now we need to include setuptools on our own by @loadams in #6787
- Adding the new feature of FPDT by @YJHMITWEB in #6462
- Pin transformers to avoid errors with latest version by @loadams in #6820
- Ulyssess offload blog by @samadejacobs in #6814
- add FPDT tutorial by @samadejacobs in #6813
- Update README.md by @samadejacobs in #6824
- Update README.md by @samadejacobs in #6825
- Pin transformers version in cpu-torch-latest due to multiprocessing error. by @loadams in #6823
Full Changelog: v0.16.0...v0.16.1
DeepSpeed v0.16.0
What's Changed
- Update version.txt after 0.15.4 release by @loadams in #6731
- Update GH hosted workflows to 24.04 by @loadams in #6717
- Add COMMITTER file by @tjruwase in #6741
- Update AMD apex version by @loadams in #6739
- Fix Type Name Inconsistency & Typo in cpu_adam by @xylian86 in #6732
- Add Domino code by @zhangsmallshark in #6733
- Add data type check for bf16 by @hwchen2017 in #6742
- add zero3
module_granularity_threshold
to zero optimization. by @inkcherry in #6649 - AIO File Offsets by @jomayeri in #6641
- Update path for BingBertSquad from DeepSpeedExamples by @loadams in #6746
- Sanitize inputs to eval() by @loadams in #6745
- Adding the governance doc by @minjiazhang in #6748
- Add no_sync context manager by @tjruwase in #6675
- Gaudi2 Nightly job for daily check by @raza-sikander in #6753
- Disable failing python tests by @loadams in #6758
- A faster and more memory-efficient implementation of
zero_to_fp32
by @xu-song in #6658 - Pin transformers version to work around latest torch requirements by @loadams in #6759
- make xpu ops compatible with oneapi 2025.0 by @baodii in #6760
- Add explicit parameters for torch.load by @loadams in #6751
- Fix setup.py bash cmd generation to correctly extract git info by @nelyahu in #6762
- Use
json_schema_extra
instead of extra keyword inField
by @qgallouedec in #6764 - Fix potential memory issues when use deepspeed Z3 by @wenbinc-Bin in #6726
- Removes unnecessary cloning by @swigls in #6761
- Enable torch compile on _allgather_params by @deepcharm in #6769
- Unpin with latest transformers fixes by @loadams in #6763
- docs: fix HF links by @imba-tjd in #6780
- Fix Doc Error: ZeRO Stage 2 gradient partitioning by @yewentao256 in #6775
- Cleanup code docs warnings by @loadams in #6783
- Domino Blog by @GuanhuaWang in #6776
- Update version.txt before release by @loadams in #6784
- Revert release workflow by @loadams in #6785
New Contributors
- @zhangsmallshark made their first contribution in #6733
- @hwchen2017 made their first contribution in #6742
- @minjiazhang made their first contribution in #6748
- @qgallouedec made their first contribution in #6764
- @wenbinc-Bin made their first contribution in #6726
- @swigls made their first contribution in #6761
- @imba-tjd made their first contribution in #6780
- @yewentao256 made their first contribution in #6775
Full Changelog: v0.15.4...v0.16.0
v0.15.4 Patch Release
What's Changed
- Update version.txt after 0.15.3 release by @loadams in #6652
- Fix expert grad scaling problem with ZeRO optimizer by @wyooyw in #6546
- Add attribute check for language_model when replace last linear module by @Yejing-Lai in #6650
- fix init_device_mesh for torch 2.4 by @Lzhang-hub in #6614
- Fix dynamo issue by @oraluben in #6527
- sequence parallel for uneven heads by @inkcherry in #6392
- Add fallback for is_compiling by @tohtana in #6663
- Update profiler registration check by @loadams in #6668
- Add support for H100/sm_90 arch compilation by @loadams in #6669
- Update Gaudi2 docker image by @loadams in #6677
- Update gaudi2 docker version to latest release (1.18) by @raza-sikander in #6648
- Update base docker image for A6000 GPU tests by @loadams in #6681
- Remove packages that no longer need to be updated in the latest container by @loadams in #6682
- Fix training of pipeline based peft's lora model by @xuanhua in #5477
- Update checkout action to latest version by @loadams in #5021
- Add attribute check to support git-base autotp by @Yejing-Lai in #6688
- fix memcpy issue on backward for zero-infinity by @xylian86 in #6670
- Free memory in universal checkpointing tests by @tohtana in #6693
- Explictly set device when reusing dist env by @tohtana in #6696
- Update URL in README Pipeline Status for Huawei Ascend NPU by @xuedinge233 in #6706
- Pin transformers to 4.45.2 in nv-ds-chat workflow by @loadams in #6710
- [Bug Fix] Support threads_per_head < 64 for wavefront size of 64 by @jagadish-amd in #6622
- Use one param coordinator for both train/inference scenarios by @tohtana in #6662
- Update yapf version by @loadams in #6721
- Update flake8 version by @loadams in #6722
- Switch what versions of python are supported by @loadams in #5676
New Contributors
Full Changelog: v0.15.3...v0.15.4
v0.15.3 Patch Release
What's Changed
- Update version.txt after 0.15.2 release by @loadams in #6615
- Clean up prefetched parameters by @tohtana in #6557
- AIO CPU Locked Tensor by @jomayeri in #6592
- reduce setting global variables to reduce torch compile graph breaks by @NirSonnenschein in #6541
- Add API to get devices of offload states by @tohtana in #6586
- Ignore reuse_dist_env by @tohtana in #6623
- Add API for updating ZeRO gradients by @tjruwase in #6590
- [compile] Show breakdown of graph break by @delock in #6601
- Accept btl_tcp_if_include option through launcher_args by @diskkid in #6613
- Add first Step in LR Schedulers by @jomayeri in #6597
- Support safetensors export by @xu-song in #6579
- add option to disable logger while compiling to avoid graph breaks by @ShellyNR in #6496
- Lock cache file of HF model list by @tohtana in #6628
- Add README Pipeline Status for Huawei Ascend NPU by @xuedinge233 in #6588
- Update torch version in workflows by @tohtana in #6631
- Use file store for tests by @tohtana in #6632
- Fix Memory Leak In AIO by @jomayeri in #6630
- [XPU] upgrade xpu max1100 CI workflow to pytorch2.3 by @Liangliang-Ma in #6646
- [XPU] host timer check version from Torch 2.5 to Torch 2.6 by @YizhouZ in #6633
- [XPU] [DeepNVMe] use same cpu_op_desc_t with cuda by @Liangliang-Ma in #6645
New Contributors
Full Changelog: v0.15.2...v0.15.3
v0.15.2 Patch Release
What's Changed
- Update version.txt after 0.15.1 release by @loadams in #6493
- HPU: add required ENV vars to acccelerator init by @nelyahu in #6495
- Op_builder->is_compatible quite warning by @terry-for-github in #6093
- fix pipeline eval_batch micro_batches argument for schedule by @nelyahu in #6484
- Fix the broken url link by @rogerxfeng8 in #6500
- fix environment variable export bug for MultiNodeRunner by @TideDra in #5878
- Revert "BF16 optimizer: Clear lp grads after updating hp grads in hook" by @nelyahu in #6508
- wrap include cuda_bf16.h with ifdef BF16_AVAILABLE by @oelayan7 in #6520
- Avoid security issues of subprocess shell by @tjruwase in #6498
- Add conditional on torch version for scaled_dot_product_attention by @loadams in #6517
- Added Intel Gaudi to Accelerator Setup Guide by @ShifaAbu in #6543
- Skip failing newly added tests in accelerate by @loadams in #6574
- Use msgpack for p2p comm by @tohtana in #6547
- DeepNVMe perf tuning by @tjruwase in #6560
- [Accelerator] Cambricon MLU support by @Andy666G in #6472
- Fix gradient accumulation for Z2+offload by @tohtana in #6550
- fix errors when setting zero3 leaf modules with torch.compile by @NirSonnenschein in #6564
- [XPU] Support DeepNVMe new code structure by @Liangliang-Ma in #6532
- Add APIs to offload states of model, optimizer, and engine by @tohtana in #6011
- add bfloat16 to inference support dtypes by @nelyahu in #6528
- [COMPILE] workflow for deepspeed + torch.compile by @YizhouZ in #6570
- Fixes on the accelerate side mean we do not need to skip this test by @loadams in #6583
- Fix torch include in
op_builder/mlu/fused_adam.py
and update no-torch workflow triggers by @loadams in #6584 - [ROCm] Fix subprocess error by @jagadish-amd in #6587
- Cleanup CODEOWNERS file to be valid by @loadams in #6603
- Add SSF Best practices badge by @loadams in #6604
- Move V100 workflows from cuda 11.1/11.7 to 12.1 by @loadams in #6607
- Fix SD workflow by @loadams in #6609
- Pin accelerate to fix CI failures/issues by @loadams in #6610
- Add llama3.2 vision autotp by @Yejing-Lai in #6577
- Improve DS logging control by @tjruwase in #6602
- Fix device selection using CUDA_VISIBLE_DEVICES by @tohtana in #6530
- Handle when
backend
is also in compile_kwargs by @oraluben in #6502 - Rearrange inference OPS and stop using builder.load by @oelayan7 in #5490
- Unpin accelerate tests, update lightning with node16 removal. by @loadams in #6611
- Enabled Qwen2-MoE Tensor Parallelism (TP) inference by @gyou2021 in #6551
New Contributors
- @TideDra made their first contribution in #5878
- @ShifaAbu made their first contribution in #6543
- @jagadish-amd made their first contribution in #6587
- @gyou2021 made their first contribution in #6551
Full Changelog: v0.15.1...v0.15.2