FBGEMM_GPU v0.5.0
Release Notes
Highlights
- TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement)
- Many TBE extensions including defused TBE backward-optimizer, variable batch size support, pipeline prefetching support for UVM caching
- Many improvements and new sparse ops added
- ARM support
- SM 9.0 support for CUDA 12.1 for H100 GPUs
- PyTorch 2 support for various operators, i.e., jagged tensor, pooled embedding ops
Software Requirements
FBGEMM_GPU v0.5.0 has been tested and known to work on the following setups:
- PyTorch: v2.1
- CUDA: v11.8, 12.1
- Python: v3.8, 3.9, 3.10, 3.11
It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.
Availability
FBGEMM_GPU can be fetched directly from PyPI:
# FBGEMM_GPU CUDA variant (only CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.5.0
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.5.0
Alternatively, it can be fetched from PyTorch PIP:
# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cu121/
# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.5.0 --index-url https://download.pytorch.org/whl/cpu
Changes
Table batched embedding (TBE) operators
- [Improvement] TBE training v2 (optimized TBE forward: up to 4x kernel performance improvement) (#1641, #1804, #1787, #1904)
- [New] Variable batch size support to TBE training (#1653, #1752, #1633, #1634, #1713, #1717, #1943)
- [New] BFloat16 support for TBE CPU (#1839, #1851)
- [New] Defused TBE backward-optimizer and SplitTBE optimizer (#1819, #1820, #1821)
- [New] Max norm support for rowwise_adagrad (#1781)
- [New] Support for 1024-2048 embedding dimension in TBE inference (#1656)
- [Improvement] Backends via PyTorch dispatcher (#1948, #1976)
- [Improvement] Deprecate many TBE optimizers (#1766, #1767, #1771, #1796, #1774, #1773, #1775, #1791, #1793)
- [New] TBE UVM cache pipeline prefetching (#1883, #1893)
Jagged Tensor Operators
- [New] New jagged tensor operators (#1690)
- [New] Backends (Meta) (#1880, #1960)
- [Improvement] Jagged operator optimizations (#1643, #1646, #1644, #1661, #1662, #1691, #1692, #1777)
- [Improvement] Symbolic shape tracing on jagged operators for PyTorch 2 (#1758)
Index Select Operators
- [New] batch_index_select_dim0 with TBE backend (#1897)
- [New] Variable input sizes support for
group_index_select_dim0
(#1968) - [Improvement] Improve
group_index_select
(#1764, #1884)
Low-precision operators
- [New] Meta Backend FP8RowwiseQuantizedToFloat (#1890)
- [New] Column-wise parallel quantization/dequantization (#1743)
- [New] BF16 Support in FP8 quantize ops (#1961)
- [Improvement] FP8 row-wise quantization optimization/improvement (#1729, #1858, #1981, #1909)
Pooled Embedding
- [New] reduce_to_one (#1571)
- [New] permute_duplicate_pooled_embeddings op (#1912)
- [New] BF16 support for permute_pooled_embeddings op 1937
- [New] Variable size input-output support for
permute_pooled_embs_kernel
(#1913) - [New] Backends (Meta) (#1853)
- [Improvement] multi-gpu
all_to_one
enhancements (#1674, #1962)
Misc
- [New] CUB kernel for 2D
asynchronous_complete_cumsum
(#1707) - [New] Backends (Meta) (#1709, #1905, #1970, #1971)
- [New] BF16 support in
permute_indices_weights_kernel_2
(#1852) - [New] FP16 and BF16 support in
pack_segments
(#1708) - [New] BF16 support for HBC ops. (#1744)
- [New] BFloat16 support (#1832, #1865)
- [Improvement] Speedup
reorder_batched_ad_indices
(#1901, #1902, #1932, #1933, 1711)
Benchmarks / Tests
- [New] CLI support to GEMMsBenchmark (#1721, #1725)
- [New] Benchmark for variable batch on TBE (#1559)
- [New] BF16 output test coverage (#1835, #1838)
- [New] Benchmark for reorder_batched_ad_indices (#1895)
- [New] CPU support (#1874, #1926)
- [Improvement] GroupIndexSelect Benchmark with zero_grad (#1559)
- [Improvement] Add
nbit-cpu-with-spec
benchmark in FBGEMM-GPU's TBE benchmark suite (#1892)
Build / CI improvements and Fixes
- [New] C++17 Support to FBGEMM and FBGEMM_GPU OSS builds (#1652)
- [New] ARM Support in OSS CI (#1813)
- [New] SM 9.0 Support for CUDA 12.1 (#1825, #2002)
- [Improvement] General CI and build system enhancement (#1658, #1695, #1697, #1702, #1719, #1751, #1784, #1795, #1836, #1958, #2020, #2024)
- [Improvement] Reorganized code to enable faster builds (#1843, #1849, #1856, #1860, #1863, #1864, #1866, #1886, #1694, #1705, #1710, #1723, #1757, #1783, #1871, #1873, #1879, #1944, #1816, #1753)