I noticed that a ring-reduce algorithm is chosen for a fully connected topology.
|
# stage 1: ring reduce |
|
latency = ( |
|
edge_latency |
|
+ effective_data_size_per_device / edge_bandwidth_both_direction |
|
) * (device_count - 1) |
|
# stage 2: broadcast |
|
latency += effective_data_size_per_device / edge_bandwidth_per_direction |
|
latency += ( |
|
data_size / interconnect_module.internal_link_bandwidth_per_direction |
|
) |
|
self.latency = latency |
Why is this the case? I believe this would lead to significant wastage of the available bandwidth. Wouldn't a reduce-scatter followed by an allgather be a better implementation?
I noticed that a ring-reduce algorithm is chosen for a fully connected topology.
LLMCompass/software_model/communication_primitives.py
Lines 62 to 72 in bcc54eb
Why is this the case? I believe this would lead to significant wastage of the available bandwidth. Wouldn't a reduce-scatter followed by an allgather be a better implementation?