[Feature] A systematic tensor parallelism verifier #37

comaniac · 2023-02-01T19:05:05Z

We need a systematic approach to verify whether a scheduled model with tensor parallelism still produces the same results (i.e., outputs and gradients). It should check the following:

Whether .shard and .sync are correctly specified to maintain the shape correctness. For example, shard the weight of a linear layer by its output feature dimension results in partitioned outputs. In this case, we either need an all-gather right after the linear, or the next linear must shard its weight by input feature dimension.
Whether .shard and .sync are correctly specified to maintain the functional correctness. For example, shard the weight of a linear layer by its input feature dimension results in the same shape but partial sum outputs. In this case, all-reduce is required.
Whether the random seeds of each dropout is property configured. In the case that the input tensor of dropout is not partitioned (i.e., replica and partial sum), the random seed on each device should be the same, because all devices are supposed to do redundant computations. On the other hand, if the input tensor is partitioned (i.e., the shape on each device is divided by TP group size), the random seed on each device should be different to avoid repeated dropout patterns that hurt the convergence.

We use this issue to discuss possible solutions and track the progress. At the first glance, we may consider a compiler approach that perform type inference on static graphs. We could use TorchDynamo or LazyTensor to capture static graph and apply this analysis. However, this approach won't work if the module cannot be captured as a static graph due to coding style limitations.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] A systematic tensor parallelism verifier #37

[Feature] A systematic tensor parallelism verifier #37

comaniac commented Feb 1, 2023

[Feature] A systematic tensor parallelism verifier #37

[Feature] A systematic tensor parallelism verifier #37

Comments

comaniac commented Feb 1, 2023