Skip to content

Milestones

List view

  • No due date
    23/23 issues closed
  • Tracks all the items needed for releasing torchtrain to OSS

    Due by April 12, 2024
    12/12 issues closed
  • 1. AC/SAC 2. meta init 3. torch.compile 4. grad norm clipping 5. grad scaler 6. lr scheduler

    Due by March 29, 2024
    7/7 issues closed
  • Enable FP8 training in torchtrain: 1. Enable FSDP FP8 2. Enable SP FP8

    Due by April 26, 2024
    2/2 issues closed
  • Step 1. Enable FSDP + Sequence Parallelism * add FSDP * add Sequence Parallelism Step 2. Enable Pipeline Parallelism in torchtrain, incremental steps: 1. manual pipeline splitting 2. best pipeline config to work with FSDP 3. explore auto splitting

    Due by April 19, 2024
    3/3 issues closed
  • Couple of core infra we need to build: * enable single toml config for different parts of training (i.e. checkpoint/parallelisms/profiling, etc) * enable checkpoint save/load * metrics collecting (i.e. wps, memory usage, loss value) * Add TensorBoard for visualization like losses * testing, add more tests

    Due by March 1, 2024
    3/3 issues closed
  • make torchtrain can be trained on: 1. internal cluster 2. cloud, i.e. AWS cluster

    Due by February 29, 2024
    2/2 issues closed
  • Enable efficient data loading solution for LLM training. Short term: * enable Alpaca dataset to produce the correct data on each rank (data parallel working sharded) * make sure it's performant when training on cluster Long term: * For large enough datasets that can't all be loaded to CPU, use iterable/streaming feature * For iterable/streaming, build indices and sampler to load data correctly

    No due date
    3/3 issues closed
  • Integrating the new PT native FSDP2 to torchtrain, to replace FlatParam FSDP, and demonstrate: 1. on-par performance with existing FlatParam FSDP 2. support 2D parallelism (FSDP2 + SP) 3. efficient (no-communication) checkpoint save/load 4. FP8 enablement with FSDP2

    Due by June 28, 2024