All-reduce implementation for Fully Connected topology

I noticed that a ring-reduce algorithm is chosen for a fully connected topology.

https://github.com/PrincetonUniversity/LLMCompass/blob/bcc54eb5755e50ca950ccd1f9b482f7460afae8a/software_model/communication_primitives.py#L62-L72

Why is this the case? I believe this would lead to significant wastage of the available bandwidth. Wouldn't a reduce-scatter followed by an allgather be a better implementation?

	# stage 1: ring reduce
	latency = (
	edge_latency
	+ effective_data_size_per_device / edge_bandwidth_both_direction
	) * (device_count - 1)
	# stage 2: broadcast
	latency += effective_data_size_per_device / edge_bandwidth_per_direction
	latency += (
	data_size / interconnect_module.internal_link_bandwidth_per_direction
	)
	self.latency = latency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All-reduce implementation for Fully Connected topology #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

All-reduce implementation for Fully Connected topology #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions