You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need a systematic approach to verify whether a scheduled model with tensor parallelism still produces the same results (i.e., outputs and gradients). It should check the following:
Whether .shard and .sync are correctly specified to maintain the shape correctness. For example, shard the weight of a linear layer by its output feature dimension results in partitioned outputs. In this case, we either need an all-gather right after the linear, or the next linear must shard its weight by input feature dimension.
Whether .shard and .sync are correctly specified to maintain the functional correctness. For example, shard the weight of a linear layer by its input feature dimension results in the same shape but partial sum outputs. In this case, all-reduce is required.
Whether the random seeds of each dropout is property configured. In the case that the input tensor of dropout is not partitioned (i.e., replica and partial sum), the random seed on each device should be the same, because all devices are supposed to do redundant computations. On the other hand, if the input tensor is partitioned (i.e., the shape on each device is divided by TP group size), the random seed on each device should be different to avoid repeated dropout patterns that hurt the convergence.
We use this issue to discuss possible solutions and track the progress. At the first glance, we may consider a compiler approach that perform type inference on static graphs. We could use TorchDynamo or LazyTensor to capture static graph and apply this analysis. However, this approach won't work if the module cannot be captured as a static graph due to coding style limitations.
The text was updated successfully, but these errors were encountered:
We need a systematic approach to verify whether a scheduled model with tensor parallelism still produces the same results (i.e., outputs and gradients). It should check the following:
.shard
and.sync
are correctly specified to maintain the shape correctness. For example, shard the weight of a linear layer by its output feature dimension results in partitioned outputs. In this case, we either need anall-gather
right after the linear, or the next linear must shard its weight by input feature dimension..shard
and.sync
are correctly specified to maintain the functional correctness. For example, shard the weight of a linear layer by its input feature dimension results in the same shape but partial sum outputs. In this case,all-reduce
is required.We use this issue to discuss possible solutions and track the progress. At the first glance, we may consider a compiler approach that perform type inference on static graphs. We could use TorchDynamo or LazyTensor to capture static graph and apply this analysis. However, this approach won't work if the module cannot be captured as a static graph due to coding style limitations.
The text was updated successfully, but these errors were encountered: