Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: improve the topology metadata #78

Open
doujiang24 opened this issue Jan 13, 2025 · 2 comments
Open

proposal: improve the topology metadata #78

doujiang24 opened this issue Jan 13, 2025 · 2 comments

Comments

@doujiang24
Copy link
Contributor

Now, we store the simple priority_matrix in metadata, that including the prefer/available nics for cpu numas/gpus.

I'm here proposal a richer/original topolopy matrix metadata, similar to NCCL_TOPO_FILE, which contains more useful info, including if gdr supported, nvlink, nvswitch, and etc.

After that, we do not need to store the protocol in metadata, instead, we could choose a proper protocol base on the topology.

Here is a simple sample topo from nccl:

<system version="1">
  <cpu host_hash="0xb52dbdea56f9ca26" numaid="0" affinity="00000000,ffffffff,00000000,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="106">
    <pci busid="0000:4d:00.0" class="0x060400" vendor="0x1000" device="0xc010" subsystem_vendor="0x1000" subsystem_device="0xa096" link_speed="" link_width="0">
      <pci busid="0000:56:00.0" class="0x060400" vendor="0x1000" device="0xc010" subsystem_vendor="0x1000" subsystem_device="0xa096" link_speed="" link_width="0">
        <pci busid="0000:58:00.0" class="0x060400" vendor="0x1000" device="0xc010" subsystem_vendor="0x10de" subsystem_device="0x13b8" link_speed="" link_width="0">
          <pci busid="0000:5a:00.0" class="0x030200" vendor="0x10de" device="0x20b2" subsystem_vendor="0x10de" subsystem_device="0x1463" link_speed="" link_width="0">
            <gpu dev="1" sm="80" rank="1" gdr="1">
              <nvlink target="0000:f0:00.0" count="2" tclass="0x068000"/>
              <nvlink target="0000:ed:00.0" count="2" tclass="0x068000"/>
            </gpu>
          </pci>
        </pci>
        <pci busid="0000:5b:00.0" class="0x020000" vendor="0x15b3" device="0x101d" subsystem_vendor="0x15b3" subsystem_device="0x0064" link_speed="" link_width="0">
          <nic>
            <net name="mlx5_bond_0" dev="0" speed="100000" port="1" latency="0.000000" guid="0xce0fda0003ebc008" maxconn="262144" gdr="1"/>
          </nic>
        </pci>
      </pci>
    </pci>
  </cpu>
</system>
@alogfans
Copy link
Collaborator

  • The proposed topology provides more information than the current version, i.e., supported mode / max bandwidth per RNIC device.
  • It is more rely on auto-detection because its hard to manually input.
  • The topology file should be shared among the cluster, i.e., it is also stored in global metadata storage.
  • The global metadata storage also includes parameters of RDMA connection (e.g., GID, QPNum, ...), which the proposed topology file does not include.

@doujiang24
Copy link
Contributor Author

Yep, totally agreed, I'll figure out a json schema before creating a PR, which including GID/LID, ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants