Understanding semaphore usage examples in the Pallas Documentation #24235

ji8er · 2024-10-10T16:31:18Z

ji8er
Oct 10, 2024

Reproducing the all_gather example from the Pallas docs below:

def all_gather_kernel(input_ref,
                      output_ref,
                      local_copy_sem,
                      send_sem,
                      recv_sems):
  outer_step = pl.program_id(0)
  my_id = lax.axis_index('x')
  right_neighbor = lax.rem(my_id + 1, num_devices)
  copy_slot = my_id - outer_step
  copy_slot = lax.rem(copy_slot + num_devices, num_devices)

  @pl.when(outer_step == 0)
  def _():
    local_copy_op = pltpu.make_async_copy(
      src_ref=input_ref,
      dst_ref=output_ref.at[my_id],
      sem=local_copy_sem,
    )
    local_copy_op.start()
    local_copy_op.wait()

  # Copy to our right neighbor.
  # Note that we will also be receiving data from our left neighbor,
  # but at `copy_slot-1` rather than `copy_slot`! This makes use of the fact
  # that the indices do not need to be symmetric between remote DMAs.
  remote_copy_op = pltpu.make_async_remote_copy(
      src_ref=output_ref.at[copy_slot],
      dst_ref=output_ref.at[copy_slot],
      send_sem=send_sem,
      recv_sem=recv_sems.at[outer_step],
      device_id=(right_neighbor, 0),
      device_id_type=pltpu.DeviceIdType.MESH,
  )
  remote_copy_op.start()
  remote_copy_op.wait()

And the toy example in the docs below:

def example_kernel(input_ref, output_ref, send_sem, recv_sem):
    device_id = lax.axis_index('x')
    copy_0_to_1 = pltpu.make_async_remote_copy(
        src_ref=input_ref,
        dst_ref=output_ref,
        send_sem=send_sem,
        recv_sem=recv_sem,
        device_id=1,
    )
    copy_2_to_3 = pltpu.make_async_remote_copy(
        src_ref=input_ref,
        dst_ref=output_ref,
        send_sem=send_sem,
        recv_sem=recv_sem,
        device_id=3,
    )
    copy_3_to_2 = pltpu.make_async_remote_copy(
        src_ref=input_ref,
        dst_ref=output_ref,
        send_sem=send_sem,
        recv_sem=recv_sem,
        device_id=2,
    )
    @pl.when(device_id == 0)
    def _():
      copy_0_to_1.start()
      copy_0_to_1.wait_send()
    @pl.when(device_id == 1)
    def _():
      copy_0_to_1.wait_recv()
    @pl.when(device_id == 2)
    def _():
      copy_2_to_3.start()
      copy_2_to_3.wait_send()
      copy_3_to_2.wait_recv()
    @pl.when(device_id == 3)
    def _():
      copy_3_to_2.start()
      copy_3_to_2.wait_send()
      copy_2_to_3.wait_recv()

I understand that each device will get its own semaphores.

In the second example, if say, we look at device with device_id = 2, we see that it starts the DMA for copy_2_to_3 and then waits on the receive of the copy_3_to_2 DMA (not on the receive of copy_2_to_3). This makes sense as device 2 needs to wait for data to reach itself from 3.

In the first example (all gather) though, if I expand the remote_copy_op.wait() into remote_copy_op.wait_send(); remote_copy_op.wait_recv(). I see that the device i waits on the receive of the op that is transferring data to the next device, not the receive of the op that is transfering data to itself. I can't understand why this is.

ji8er · 2024-10-10T16:42:58Z

ji8er
Oct 10, 2024
Author

Is it the case that whenever we wait, we don't wait on the op, but the semaphore within it.

So, in the second example above, it wouldn't have mattered whether we wait on recv of copy_2_to_3 or copy_3_to_2 as all ops use the same semaphores?

When a device waits on a recv, it waits on its own copy of recv_sem, which will be set by some other device, when a transfer from that other device completes?

So a recv_sem passed into a async (remote) copy_op indicates 2 purposes:

The copy of recv_sem on the receiving device will be set once the transfer from the current device completes.
The current device should listen for its own copy of recv_sem to be set by some other device.

Is this understanding correct?

If so:

I would have expected semaphores for 1 and 2 to be disentangled. That is, a different name to refer to semaphore of another device and a semaphore on our own device.

How can I better understand why this design is there?

0 replies

ji8er · 2024-10-15T07:53:01Z

ji8er
Oct 15, 2024
Author

[After reading the docs further it becomes clear. As the alternate waiting on semaphores version is also there.]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding semaphore usage examples in the Pallas Documentation #24235

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Understanding semaphore usage examples in the Pallas Documentation #24235

ji8er Oct 10, 2024

Replies: 2 comments

ji8er Oct 10, 2024 Author

ji8er Oct 15, 2024 Author

ji8er
Oct 10, 2024

ji8er
Oct 10, 2024
Author

ji8er
Oct 15, 2024
Author