Skip to content

Conversation

Artemy-Mellanox
Copy link
Contributor

No description provided.

if (req != nullptr) {
comp = &req->comp;
req->device_ep = device_ep;
uct_device_completion_init(comp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also remove the TODO comment

#include <uct/ib/mlx5/gdaki/gdaki.cuh>

union uct_device_completion {
uct_rc_gda_completion_t rc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rc -> rc_gda

0xffffff);
uct_rc_mlx5_gda_qedump("WQE", wqe_ptr, 64);
uct_rc_mlx5_gda_qedump("CQE", cqe64, 64);
ep->wqe_error = ep->sq_wqe_pi;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some "atomic write" so it will be visible to all threads?

Comment on lines 600 to 604
if (ep->wqe_error && comp->wqe_idx < ep->wqe_error) {
return UCS_ERR_IO_ERROR;
}

status = (ucs_status_t)__shfl_sync(0xffffffff, status, 0);
__syncwarp();
return status;
} else if (level == UCS_DEVICE_LEVEL_THREAD) {
return uct_rc_mlx5_gda_progress_thread(ep);
} else {
return UCS_ERR_UNSUPPORTED;
if (ep->sq_wqe_pi <= comp->wqe_idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. why need to check ep->wqe_error separately?
    IMO comp->wqe_idx < ep->wqe_error is enough
  2. do we need some kind of atomic read/write for ep->wqe_error?
  3. maybe rename ep->wqe_error to ep->wqe_error_pi?
  4. if another thread updated ep->sq_wqe_pi but didn't yet update ep->wqe_error, we might think some tl_comp is success, while in fact it is failure.

uint64_t cqe_ci;
int sq_lock;

uint64_t wqe_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move it after sq_wqe_pi?

iyastreb
iyastreb previously approved these changes Oct 2, 2025
Copy link
Contributor

@iyastreb iyastreb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with many threads, no hangs and perf is better than before:

-O 32 -T 1   7.7us
-O 32 -T 32  159us
-O 32 -T 64  153us
-O 16 -T 128 372us
-O 8 -T 256  1169us
-O 4 -T 512  1768us

@yosefe yosefe enabled auto-merge (squash) October 3, 2025 08:42
@yosefe
Copy link
Contributor

yosefe commented Oct 3, 2025

This PR was hanging in cuda kernel perftest. So may be there is a real issue.
Update: seems it is hanging in cuda perftest even in several attempts, it does not seem random.
https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=108913&view=logs&j=e84301ea-5118-57e7-336f-e36bdd9bfe4e&t=088f285f-9a23-576c-fb3a-515759c9e3bf&s=ffab728f-40b5-5c1b-8699-ef0c0cb2b08c

@yosefe yosefe merged commit cd9f7f8 into openucx:master Oct 6, 2025
141 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants