Releases: NVIDIA/recsys-examples
v25.11
What's Changed
Features & Enhancements
- Counter table interface and ScoredHashTable Implementation by @jiashuy in #229
- Embedding admission strategy by @z52527 in #236
- Optimize memory waste in segmented_unique by @z52527 in #244
Bug Fixes
- Fix dtype mismatch of offset and table_range. by @jiashuy in #227
- fix preprocessor local() error in python 3.10 by @shijieliu in #228
- Add new handler despite of the exsiting ones by @JacoCheung in #233
- Fix LFU test failed in incremental_dump by @jiashuy in #242
- Fix default parameter initialization of KVCounter by @jiashuy in #253
Misc
- Format dynamicemb's source codes by @jiashuy in #230
- train docker update to cuda 12.9 by @shijieliu in #205
- Quick fix for commands in Inference README by @geoffreyQiu in #247
Full Changelog: v25.10...v25.11
v25.10
What's Changed
Features & Enhancements
- Add sequence parallelism by @JacoCheung in #216
- Decouple scaling seqlen from
max_seqlenin hstu attn by @geoffreyQiu in #208 - Fea support lru score dump load by @shijieliu in #186
- Gradient clipping by reusing TorchRec&FBGEMM's parameters by @jiashuy in #223
- [HSTU]Add SM 89 support by @JacoCheung in #217
- allow allow_overwrite in DynamicEmbDump by @fshhr46 in #206
But Fixs
- Fix LFU mode frequency count bug by @z52527 in #176
- Fix config bug when using torchrec's STBE in benchmark by @jiashuy in #193
- Fix IMA in incremental dump and test the dumped embeddings by @jiashuy in #211
- Fix rab num heads by @JacoCheung in #222
- Fix IMA caused by wrong worker id for device of which max threads is … by @jiashuy in #220
Misc
- Code reorganization for hstu training and inference by @geoffreyQiu in #202
- Add embedding pooling kernel by @z52527 in #215
Full Changelog: v25.09...v25.10
v25.09
What's Changed
Features & Enhancements
- Dynamicemb prefetch integration by @JacoCheung in #181
- Support distributed embedding dumping for dynamicemb by @z52527 @shijieliu in #120 #185
- Add kernel fusion in HSTU block for inference, with KVCache fixes by @geoffreyQiu in #184
- export hstu fp8 quant by @shijieliu in #168
- Replace BatchedDynamicEmbeddingTables with BatchedDynamicEmbeddingTablesV2 by @jiashuy in #155
Bug Fixs
- fix DynamicEmbDump - handle long strings in broadcast_string by @fshhr46 in #164
- fix: consider mask when calc hstu attn flops by @shijieliu in #177
- export fix hstu ima when num_candidates = seqlen by @shijieliu in #183
Misc
- Make local hbm budget grow when num_embeddings grows. by @jiashuy in #156
- Fix several errors for inference. by @geoffreyQiu in #167
- Fix setup.py by @yiwenchen2025 in #169
- Suppress mcore deps install by @JacoCheung in #170
- dynamicemb clean BatchedDynamicEmbeddingTables by @jiashuy in #179
- Update hstu layer benchmark doc by @JacoCheung in #171
- Update dynamicemb's benchmark and example with README.md by @jiashuy in #188
New Contributors
Full Changelog: v25.08...v25.09
v25.08
What's Changed
Features & Enhancements
- Refactor dyanmicemb with Cache&Storage. by @jiashuy in #128
- Support Kuairand dataset inference with alignment to training by @geoffreyQiu in #122
- Support eval mode for dynamicemb and move insert in backward to forward for use_index_dedup=True by @shijieliu in #136
- export hstu arbitrary mask by @shijieliu in #148
- Optimize TP HSTU layer by @JacoCheung in #132
Bug fixs
- Fix invalid pip option: replace --no-cache with --no-cache-dir by @mia1460 in #126
- Remove HostAlloc in dataloader by @JacoCheung in #129
- Fix filtering of samples with insufficient history by @mia1460 in #134
- fix pipeline test by @shijieliu in #135
- Hkv timeline clean by @jiashuy in #137
- Fix calc flops by @shijieliu in #139
- fix(dataset): add per-user reorder by time and pre-sort to guarantee … by @mia1460 in #141
- fix preprocessor not working on absolute data path by @shijieliu in #146
- fix codespell cheking by @shijieliu in #149
- fix collective utset by @shijieliu in #151
- Fix the shape hint for offsets by @yiwenchen2025 in #153
Misc
- Update dynamicemb benchmark by @jiashuy in #138
- Update the benchmarks and results. by @geoffreyQiu in #144
- update benchmark doc by @shijieliu in #150
- update benchmark result of dynamicemb to figure by @jiashuy in #154
New Contributors
- @mia1460 made their first contribution in #126
- @yiwenchen2025 made their first contribution in #153
Full Changelog: v25.07...v25.08
v25.07
What's Changed
Features & Enhancements
- HSTU inference benchmark and example release @geoffreyQiu #92 #85 #93
- Tensor parallelism support for HSTU layer @JacoCheung #101
- Print detailed memory consumption of embedding and optimizer states by @jiashuy in #113
- calc flops in ranking by @shijieliu in #96
- add preprocessing mlp for hstu by @shijieliu in #98
Bug fixs
- fix noncontiguous input for dynamicemb by @shijieliu in #99
- Fix dynamicemb example's local rank bug on multi-node by @z52527 in #95
- [Fix] retrieval shifting prediction embedding bug by @shijieliu in #114
Full Changelog: v25.06...v25.07
v25.06
What's Changed
Features & Enhancements
LFU Eviction Strategy for Dynamic Embeddings
Added a new Least Frequently Used (LFU) eviction strategy to the dynamicemb module, improving memory management and embedding efficiency.
(Contributed by @z52527 — (#52))
LayerNorm Recomputation for Fused HSTU Layer
Support for recomputing LayerNorm in the fused HSTU layer to optimize memory usage during training.
(Contributed by @JacoCheung — (#59))
Embedding and Optimizer State Insertion to HKV During Backward Pass
When use_index_dedup is enabled, embeddings and optimizer states are now inserted into the HKV during the backward pass, improving training efficiency.
(Contributed by @jiashuy — (#62))
Support for Non-Contiguous Input/Output in HSTU MHA and SiLU Recomputation
Enabled handling of non-contiguous tensors for multi-head attention and SiLU recomputation within HSTU layers.
(Contributed by @JacoCheung — (#64))
Customized CUDA Operation for Concatenating 2D Jagged Tensors
Introduced a new CUDA operator concat_2d_jagged_tensors to efficiently concatenate jagged tensors in 2D.
(Contributed by @z52527 — (#42))
Support for Training Pipeline
Added support for a streamlined training pipeline to facilitate easier model training and experimentation.
(Contributed by @JacoCheung — (#68))
Bug Fixes
Fixed HSTU Preprocess and Postprocess CI Issues
Resolved continuous integration issues related to HSTU preprocessing and postprocessing steps.
(Contributed by @shijieliu — (#76))
Documentation
Updated HSTU Installation Instructions
Clarified and expanded the README installation guide for the HSTU module to improve user onboarding.
(Contributed by @z52527 — (#84))
Dependency Updates
Stable Dependency Upgrades
Updated key dependencies to stable versions:
torchrec updated to 1.2.0
fbgemm_gpu updated to 1.2.0
mcore updated to 0.12.1
(Contributed by @shijieliu and @JacoCheung — (#74), (#75))
v25.05
Changelog
Dynamicemb example #16 #31 #58
EmbeddingBagCollection support in Dynamicemb #20
Dynamicemb functionality enhancement #45 #46 #53
HSTU cutlass kernel support contextual features in hopper backward #51
Decouple sharding and model defination in hstu example #37
Fused hstu layer #43
Fix kuairand dataset convergency issue #34
Doc enhancement #39
Full Changelog: https://github.com/NVIDIA/recsys-examples/commits/v25.05