moe_comms

MoE All-to-All communications modeling with simulation and performance modeling

Follow steps to use:

Set desired parameters and mode in params.py
Run simulation.py to generate routing data
Run perf_model.py to run communications simulation
See traces in comm_log.txt (diagrams detailing the high-level design of simulation.py and perf_model.py can be found in the PDF)

Currently Supports:

Parameters for different MoE configurations and hardware restrictions
Routing data generation (with load imbalance parameters)
Routing data conversion to bytes for each node (dst, src)
Full mesh communication of a cluster (with host) with certain assumptions + Communication time
Multiple links per connection between nodes (can send fragment of a load that is > intra_bw or send in a different direction)

Not supported (yet) / Assumptions:

Hierarchical communication (source to cluster, cluster to node)
Create visualization
Throughput calculation (prefill / decode)
Add delay/transfer parallelization for each round (ie each packet needs to prepare but can be prepared while last packet is sending) (list all packets for a round, choose link with most packets, do parallelization for those to find critical path)
Add comms for allocation ** ask where
Add flag size for packets
Currently round robin only applied inside a node load, ie starting at node 0 it will try to finish all node 0 sends before moving to node 1, inefficiency as it might have no receives during first rounds
PCIe FIFO Buffer Size (smaller packets would have inefficiencies)
Add support for multiple GPUs per node (easy)

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.gitignore		.gitignore
MoE Routing Simulation and Performance Model.pdf		MoE Routing Simulation and Performance Model.pdf
README.md		README.md
comm_log.txt		comm_log.txt
observations.md		observations.md
params.py		params.py
perf_model.py		perf_model.py
requirements.txt		requirements.txt
routing.csv		routing.csv
simulation.py		simulation.py
unused.py		unused.py
weights.csv		weights.csv

Provide feedback