GitHub · Where software is built

Milestones

torchtitan v0.1.0 release
No due date
•23/23 issues closed
100% complete0 open 23 closed
First OSS release of Torchtrain
Tracks all the items needed for releasing torchtrain to OSS
Due by April 12, 2024
•12/12 issues closed
100% complete0 open 12 closed
training techniques enablement
1. AC/SAC 2. meta init 3. torch.compile 4. grad norm clipping 5. grad scaler 6. lr scheduler
Due by March 29, 2024
•7/7 issues closed
100% complete0 open 7 closed
FP8 Enablement
Enable FP8 training in torchtrain: 1. Enable FSDP FP8 2. Enable SP FP8
Due by April 26, 2024
•2/2 issues closed
100% complete0 open 2 closed
3D Parallel enablement
Step 1. Enable FSDP + Sequence Parallelism * add FSDP * add Sequence Parallelism Step 2. Enable Pipeline Parallelism in torchtrain, incremental steps: 1. manual pipeline splitting 2. best pipeline config to work with FSDP 3. explore auto splitting
Due by April 19, 2024
•3/3 issues closed
100% complete0 open 3 closed
torchtrain infrastructure building
Couple of core infra we need to build: * enable single toml config for different parts of training (i.e. checkpoint/parallelisms/profiling, etc) * enable checkpoint save/load * metrics collecting (i.e. wps, memory usage, loss value) * Add TensorBoard for visualization like losses * testing, add more tests
Due by March 1, 2024
•3/3 issues closed
100% complete0 open 3 closed
torchtrain training on clusters
make torchtrain can be trained on: 1. internal cluster 2. cloud, i.e. AWS cluster
Due by February 29, 2024
•2/2 issues closed
100% complete0 open 2 closed
Efficient data loading solution for large datasets
Enable efficient data loading solution for LLM training. Short term: * enable Alpaca dataset to produce the correct data on each rank (data parallel working sharded) * make sure it's performant when training on cluster Long term: * For large enough datasets that can't all be loaded to CPU, use iterable/streaming feature * For iterable/streaming, build indices and sampler to load data correctly
No due date
•3/3 issues closed
100% complete0 open 3 closed
Integrating FSDP2 in torchtrain
Integrating the new PT native FSDP2 to torchtrain, to replace FlatParam FSDP, and demonstrate: 1. on-par performance with existing FlatParam FSDP 2. support 2D parallelism (FSDP2 + SP) 3. efficient (no-communication) checkpoint save/load 4. FP8 enablement with FSDP2
Due by June 28, 2024
0% complete0 open 0 closed