Replies: 2 comments
-
Is it the case that whenever we wait, we don't wait on the op, but the semaphore within it. So, in the second example above, it wouldn't have mattered whether we wait on recv of When a device waits on a recv, it waits on its own copy of recv_sem, which will be set by some other device, when a transfer from that other device completes? So a recv_sem passed into a async (remote) copy_op indicates 2 purposes:
Is this understanding correct? If so: I would have expected semaphores for 1 and 2 to be disentangled. That is, a different name to refer to semaphore of another device and a semaphore on our own device. How can I better understand why this design is there? |
Beta Was this translation helpful? Give feedback.
-
[After reading the docs further it becomes clear. As the alternate waiting on semaphores version is also there.] |
Beta Was this translation helpful? Give feedback.
-
Reproducing the
all_gather
example from the Pallas docs below:And the toy example in the docs below:
I understand that each device will get its own semaphores.
In the second example, if say, we look at device with
device_id = 2
, we see that it starts the DMA forcopy_2_to_3
and then waits on the receive of thecopy_3_to_2
DMA (not on the receive ofcopy_2_to_3
). This makes sense as device 2 needs to wait for data to reach itself from 3.In the first example (all gather) though, if I expand the
remote_copy_op.wait()
intoremote_copy_op.wait_send(); remote_copy_op.wait_recv()
. I see that the devicei
waits on the receive of the op that is transferring data to the next device, not the receive of the op that is transfering data to itself. I can't understand why this is.Beta Was this translation helpful? Give feedback.
All reactions