- 
                Notifications
    You must be signed in to change notification settings 
- Fork 88
[RFC][TorchElastic] topology info in training apps/ranks in #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            kurman
  wants to merge
  3
  commits into
  pytorch:master
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
kurman:rcf-torchelastic-toplogy-info
  
      
      
   
  
    
  
  
  
 
  
      
    base: master
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          
  
     Open
                    Changes from all commits
      Commits
    
    
            Show all changes
          
          
            3 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      
    File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,191 @@ | ||
| # Providing Topology Information to Training Apps/Ranks | ||
|  | ||
| **Authors:** | ||
| * @kurman | ||
|  | ||
|  | ||
| ## Summary | ||
| Provide topology information to each rank/trainer process that can be used by trainers to optimize and potentially automate pytorch distributed models. | ||
|  | ||
|  | ||
| ## Motivation | ||
| As the size of jobs are increasing, ML jobs utilize parallelism to: | ||
| 1. increase throughput and | ||
| 2. to allow model sizes that don’t fit into a single GPU or even node. | ||
|  | ||
| Those techniques are known as Data Parallelism (DP), Model Parallelism (MP) and Pipeline Parallelism (PP). Each of them have specific trade-offs and can be combined. | ||
|  | ||
| Using those techniques requires some knowledge of underlying topology to make right tradeoffs based on communication overhead (speed, throughput), available GPU memory and use of non-accelerator resources (CPU/RAM). PyTorch has come up with various generic distributed modeling solutions like DTensor and DeviceMesh. However pytorch distributed primitives provide no automation today for taking advantage of topology infor,atopm to perform model partitioning, it is the job of the user to configure the 2-D or 3-D (MP+DP+PP) parallelism solutions in a topology aware manner for optimal performance. | ||
|  | ||
| Underlying topology data can inform application on communication constraints that can be used to optimize distributed computation graph that will take communication pattern (however still relies underlying comms to actually use them). Regardles of what are the available communication channel sematics, most value will be is grouping nodes by locality. See more details under [usecases](#Usecases) section. | ||
|  | ||
| ## Scope | ||
| The proposal focuses on exposing information about infrastructure to application layer for optimization purposes: | ||
| - Contract that applications can use to discover toplogy information when available | ||
| - Initial reference implementation that can be substituted by other implementations | ||
|  | ||
| Primary focus on defining contract that can be expanded and other potential approaches/algorithms for defining toplogy information is out of scope. | ||
|  | ||
| ## Proposed Implementation | ||
| Provide topology information to each rank/trainer process that can be used to create optimized computational graphs for Torch modules. | ||
|  | ||
| Current HPC/Gang scheduled jobs rely on number of variables that is passed to each trainer process, eg `WORLD_SIZE`, `RANK` and `LOCAL_RANK`. | ||
|  | ||
| New proposed optional `WORLD_TOPOLOGY_FILE` environment variable will reference a local filesystem file that will provide information about the underlying topology. | ||
|  | ||
| It will be generated on a best efforts basis due to underlying complexity. | ||
|  | ||
|  | ||
| ### Content of the topology information file | ||
| Topology information is a representation of graph where nodes are devices assigned to a RANK and edges are connectivity type with optional: | ||
| - Latency | ||
| - Bandwidth, and | ||
| - Available number of channels | ||
|  | ||
| The connectivity type assumes HPC type of workloads and can be scoped (but not limited) to following types: | ||
| - NVLink | ||
| - NVSwitch | ||
| - PCIe | ||
| - IB | ||
| - NIC/Ethernet | ||
|  | ||
| Eg: | ||
|  | ||
| | | RANK0 | RANK1 | RANK2 | | ||
| |-----|---------------------------|-------------------------|----------------------| | ||
| |RANK0| X | NVLink [22us,64GB/s, 4] | IB[600us, 0.4GB/s,4] | | ||
| |RANK1|NVLink [22us,64GB/s, 4] | X | IB[600us, 0.4GB/s,4] | | ||
| |RANK2|IB[600us, 0.4GB/s,4] | IB[600us, 0.4GB/s,4] | X | | ||
|  | ||
|  | ||
|  | ||
| Out-of-scope for now: | ||
| - Non-topology system resources allocated for each rank (eg MEM/CPU/GPU) | ||
| - Those tend to be homogeneous | ||
| - Most of the can be easily detected at runtime by the trainer code | ||
| - Those can be added as an extension if there are specific usecases. | ||
| - More fine-grained details based on communication pattern (p2p vs collectives). We are exposing phyiscal toplogy information between peers, however things such as bandwidth will differ based on communication pattern. | ||
|  | ||
| ### Format of the topology information file | ||
|  | ||
| - File format: json | ||
| - Define version = 0.1 | ||
| - Extensible: | ||
| - Avoid using lists and use named properties, this allows adding new properties later. | ||
|  | ||
| Eg: | ||
| ```json | ||
| { | ||
| "version": "0.1", | ||
| "ranks": { | ||
| "0": { | ||
| "peers": { | ||
| "1": { | ||
| "connection": { | ||
|         
                  kurman marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
| "type" "IB", | ||
| "latency": { | ||
| "value": "22", | ||
| "measurement": "us" | ||
| }, | ||
| "bandwidth": { | ||
| "value": "64", | ||
| "measurement": "GB/s" | ||
| }, | ||
| "channels": { | ||
| "count": "4" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| "1": {"peers": {...}} | ||
| } | ||
| } | ||
| ``` | ||
|  | ||
| ##### Python API | ||
|  | ||
| Python API will be exposed to build respresentation of the toplogy as Python datastructure. `TopologyInfo#build` factory method will use `WORLD_TOPOLOGY_FILE` env variable to discover information about the toplogy. | ||
|  | ||
| ```python | ||
| from enum import Enum | ||
| class ConnectionType(Enum): | ||
| ETH = 1 | ||
| IB = 2, | ||
| ... | ||
|  | ||
| @dataclass | ||
| class Measurement: | ||
| val: str | ||
| measurement: str | ||
|  | ||
| @dataclass | ||
| class LatencyInfo(Measurement): | ||
| ... | ||
|  | ||
| @dataclass | ||
| class BandwidthInfo(Measurement): | ||
| ... | ||
|  | ||
| @dataclass | ||
| class ChannelInfo: | ||
| count: int | ||
| ... | ||
|  | ||
| @dataclass | ||
| class ConnectionInfo: | ||
| conn_type: ConnectionType | ||
| latency: LatencyInfo | ||
| bandwidth: BandwidthInfo | ||
| channel: ChannelInfo | ||
|  | ||
| @dataclass | ||
| class PeersInfo: | ||
| connections: Dict[int, ConnectionInfo] | ||
|  | ||
| @dataclass | ||
| class TopologyInfo: | ||
| version: str | ||
| ranks: Dict[int, PeersInfo] | ||
|  | ||
| @staticmethod | ||
| def build() -> Union[None, "TopologyInfo"] | ||
| ... | ||
| ``` | ||
|  | ||
| ### Implementation details | ||
| [torchrun/torchelastic](https://pytorch.org/docs/stable/elastic/run.html) will | ||
| - Add new flag to inject information toplogy `--with-accelarator-topology` | ||
| - Build the topology representation during the rendezvous process | ||
|         
                  kurman marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
| - Save the file in local filesystem | ||
| - Launch the job with `WORLD_TOPOLOGY_FILE` env variable with value of the location of the file. | ||
|  | ||
| #### Defining/discoverying topology | ||
| Initial implementation can be very simple by targeting nvidia devices and built using at a high level following steps: | ||
| 1. Run `nivida-smi` on the host to list devices | ||
| 2. Traverse "/sys" on each host to detect connectivity type between devices on the host | ||
| 3. Traverse "/sys" on each host to detect NET connectivity with guid's | ||
| 4. Publish as part of node redezvous data | ||
| 5. On each node build representation: | ||
| - For local ranks, use the data from step 2 | ||
| - For other hosts, use NET connectivity information. | ||
|  | ||
| Further improvement can be done is to first NCCL_TOPO_FILE representation used by [NCCL](https://github.com/NVIDIA/nccl/blob/master/src/graph/topo.cc) to build topology that will allow using VM vendor specific topology definition. | ||
|  | ||
|  | ||
| ## Usecases | ||
| We expect most of the usecases will be implemented by at a framework level: | ||
|  | ||
| - Locality aware DeviceMesh definition to be used by DTensor using topology information, for example use in [controller](https://github.com/zdevito/single_controller) based execution | ||
| - DistCompiler can be used to construct optimal execution graph | ||
| - Use of this information to place MP on NVLinked accelerators, and DDP/PP on IB connected hosts. | ||
| - Replication of checkpoints/snapshots on non-adjacent nodes to provide more reliable fault tolerance, e.g. [proposal](https://docs.google.com/document/d/1xX8kTBqmm9Ve03KX8cBhnUdBx4xIf4YtbcUMky46Nd8/edit#heading=h.6xffhuojpv2y) | ||
|  | ||
|  | ||
|  | ||
| #### Additional Context | ||
| - https://arxiv.org/abs/1903.04611 | ||
| - https://github.com/pytorch/pytorch/issues/88838 | ||
| - https://pytorch.org/docs/stable/distributed.elastic.html | ||
|  | ||
|  | ||
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Uh oh!
There was an error while loading. Please reload this page.