New op: `iree_gpu.coalesced_gather_dma`

### Request description

This op is meant to go inside the `in_parallel` region of an `scf.forall` or the like. It takes as input a tensor to gather from, a thread-level tile of indices to gather at, and a subgroup-level output tensor to gather in to (and returns no results - that subgroup-level tile is a shared out)

An example of the use of the op is as follows:
```
  %0 = scf.forall (%flat_thread_id) shared_outs(%shared_dest_slice = %dest_slice) {
    %m_id, %k_id = affine.delinearize_index %flat_subgroup_id into (mTile, kTile)
    %indices_thread_slice = tensor.extract_slice %indices_slice [%m_id, %k_id] [m, k]
    scf.forall.in_parallel {
      iree_gpu.coasceled_gather_dma (%indices_thread_slice, %Idx) -> %shared_dest_slice
    }
  }
```


### What component(s) does this issue relate to?

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New op: `iree_gpu.coalesced_gather_dma` #21784

Request description

What component(s) does this issue relate to?

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New op: iree_gpu.coalesced_gather_dma #21784

Description

Request description

What component(s) does this issue relate to?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

New op: `iree_gpu.coalesced_gather_dma` #21784