Skip to content

Conversation

ofirfarjun7
Copy link
Contributor

@ofirfarjun7 ofirfarjun7 commented Sep 29, 2025

What?

Change UCP device API to work with offsets instead of address.
Add channel_id to API to support use of multiple QPs per destination.

Why?

User requirements.

@ofirfarjun7 ofirfarjun7 changed the title UCP/API: Work with offsets. UCP/API: Work with offsets and channel id. Sep 29, 2025
{
return ucs_device_atomic64_write(
reinterpret_cast<uint64_t*>(counter_ptr), value);
return ucs_device_atomic64_write(reinterpret_cast<uint64_t*>(counter_ptr),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is void function, return is not needed

UCP_DEVICE_MEM_LIST_ELEM_FIELD_LENGTH;
elems[i].memh = perf.ucp.send_memh;
elems[i].rkey = perf.ucp.rkey;
elems[i].local_addr = (char*)perf.send_buffer + offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: UCS_PTR_BYTE_OFFSET

uct_elem = (uct_device_mem_element_t*)UCS_PTR_BYTE_OFFSET(mem_list_h + 1,
elem_offset);
uct_elem = (uct_device_mem_element_t*)
UCS_PTR_BYTE_OFFSET(mem_list_h->uct_mem_elements, elem_offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe save uct_mem_elements per lane, and remove the field uct_mem_element_size ?

Copy link
Contributor Author

@ofirfarjun7 ofirfarjun7 Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, how we can calculate offset to specific element without saving the size? type of element is unknown to UCP.

BTW: to support multi lanes we do need to add uct_mem_elements per lane. Not sure I will do it in this PR, it's already large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm right

ucp_device_mem_list_handle_h mem_list_h, unsigned mem_list_index,
const void *address, uint64_t remote_address, size_t length,
uint64_t flags, ucp_device_request_t *req)
ucp_device_mem_list_handle_h mem_list_h, unsigned channel_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO channel_id param should be between "length" and "flags"

/**
* Array of local addresses for the device transfer operations.
*/
void **local_addr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_addrs

/**
* Array of remote addresses for the device transfer operations.
*/
uint64_t *remote_addr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote_addrs

* Array of lengths of the local buffers in bytes.
*/
size_t *length;
} ucp_mem_elements;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can remove this wrapping struct


handle_size += sizeof(*handle.ucp_mem_elements.local_addr);
handle_size += sizeof(*handle.ucp_mem_elements.remote_addr);
handle_size += sizeof(*handle.ucp_mem_elements.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. space line after
  2. maybe combine to single calculation

Comment on lines +343 to +348
ucp_mem_type_unpack(ep->worker, &local_addresses[i], &local_addr,
sizeof(local_addresses[i]), mem_type);
ucp_mem_type_unpack(ep->worker, &remote_addresses[i], &remote_addr,
sizeof(remote_addresses[i]), mem_type);
ucp_mem_type_unpack(ep->worker, &lengths[i], &length,
sizeof(lengths[i]), mem_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we "unpack" all these in single/several calls, and not for each element?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote_addrs array follow local_addrs array, so only way I see to do it not for each element is to allocate some tmp array, copy to it and then use unpack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, can be next pr

elems[i].local_addr = (char*)perf.send_buffer + offset;
elems[i].remote_addr = perf.ucp.remote_addr + offset;
elems[i].length = (i == count - 1) ? ONESIDED_SIGNAL_SIZE :
perf.params.msg_size_list[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next pr: align

local_offsets[i] = 0;
remote_offsets[i] = 0;
lengths[i] = (i == count - 1) ? ONESIDED_SIGNAL_SIZE :
perf.params.msg_size_list[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next pr: align


void init_counters(const ucx_perf_context_t &perf)
{
m_params.length = ucx_perf_get_message_size(&perf.params);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next pr: align

* for the memory transfer. The addresses and length must be valid for the used
* @a mem_list entry.
* for the memory transfer. The @a local_offset and @a remote_offset parameters
* specify byte offsets within the selected memory list entry.The @a length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* specify byte offsets within the selected memory list entry.The @a length,
* specify byte offsets within the selected memory list entry. The @a length,

* @tparam level Level of cooperation of the transfer.
* @param [in] req Request containing operations in progress.
*
* @param [in] channel_id Channel ID to progress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next pr: align

/**
* Array of local addresses for the device transfer operations.
*/
void **local_addr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be more logical and cache friendly to have an array of struct?
So that the fields of certain element are stored together

Copy link
Contributor Author

@ofirfarjun7 ofirfarjun7 Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, it will also help to unpack the fields when creating the handle.
But it will require changing UCT API as well...(it expects array of local & remote addresses). I didn't want to do it now because it's urgent.

iyastreb
iyastreb previously approved these changes Oct 6, 2025
@michal-shalev michal-shalev merged commit 1537aa9 into openucx:master Oct 6, 2025
141 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants