List view
- No due date•23/23 issues closed
Tracks all the items needed for releasing torchtrain to OSS
Due by April 12, 2024•12/12 issues closed1. AC/SAC 2. meta init 3. torch.compile 4. grad norm clipping 5. grad scaler 6. lr scheduler
Due by March 29, 2024•7/7 issues closedEnable FP8 training in torchtrain: 1. Enable FSDP FP8 2. Enable SP FP8
Due by April 26, 2024•2/2 issues closedStep 1. Enable FSDP + Sequence Parallelism * add FSDP * add Sequence Parallelism Step 2. Enable Pipeline Parallelism in torchtrain, incremental steps: 1. manual pipeline splitting 2. best pipeline config to work with FSDP 3. explore auto splitting
Due by April 19, 2024•3/3 issues closedCouple of core infra we need to build: * enable single toml config for different parts of training (i.e. checkpoint/parallelisms/profiling, etc) * enable checkpoint save/load * metrics collecting (i.e. wps, memory usage, loss value) * Add TensorBoard for visualization like losses * testing, add more tests
Due by March 1, 2024•3/3 issues closedmake torchtrain can be trained on: 1. internal cluster 2. cloud, i.e. AWS cluster
Due by February 29, 2024•2/2 issues closedEnable efficient data loading solution for LLM training. Short term: * enable Alpaca dataset to produce the correct data on each rank (data parallel working sharded) * make sure it's performant when training on cluster Long term: * For large enough datasets that can't all be loaded to CPU, use iterable/streaming feature * For iterable/streaming, build indices and sampler to load data correctly
No due date•3/3 issues closedIntegrating the new PT native FSDP2 to torchtrain, to replace FlatParam FSDP, and demonstrate: 1. on-par performance with existing FlatParam FSDP 2. support 2D parallelism (FSDP2 + SP) 3. efficient (no-communication) checkpoint save/load 4. FP8 enablement with FSDP2
Due by June 28, 2024