-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/IB/MD: retry memory registration with reduced access flags on failure #10341
base: master
Are you sure you want to change the base?
UCT/IB/MD: retry memory registration with reduced access flags on failure #10341
Conversation
src/ucp/rndv/proto_rndv.inl
Outdated
@@ -49,7 +49,8 @@ ucp_proto_rndv_rts_request_init(ucp_request_t *req) | |||
status = ucp_datatype_iter_mem_reg(ep->worker->context, | |||
&req->send.state.dt_iter, | |||
rpriv->md_map, | |||
UCT_MD_MEM_ACCESS_RMA | | |||
UCT_MD_MEM_ACCESS_REMOTE_GET | | |||
UCT_MD_MEM_ACCESS_LOCAL_READ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just the most basic option so I thought I can keep mention it as well, but on the other hand it doesn't matter as local read is enabled by default in ibv_reg_mr (there is no such flag in verbs.h because it's always enabled)
src/uct/ib/mlx5/dv/ib_mlx5dv_md.c
Outdated
md->super.relaxed_order, | ||
1, flags) & | ||
access_mask; | ||
int attempt_cnt = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid abbreviations (cnt
)
src/uct/ib/base/ib_md.c
Outdated
@@ -626,9 +627,39 @@ ucs_status_t uct_ib_memh_alloc(uct_ib_md_t *md, size_t length, | |||
return UCS_OK; | |||
} | |||
|
|||
uint64_t uct_ib_memh_access_flags(uct_ib_mem_t *memh, int relaxed_order) | |||
uint64_t uct_flags_to_ibv_mem_access_flags(uint64_t uct_flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uct_ib_..
src/uct/ib/base/ib_md.c
Outdated
@@ -460,7 +460,7 @@ uct_ib_md_handle_mr_list_mt(uct_ib_md_t *md, void *address, size_t length, | |||
ucs_status_t uct_ib_reg_mr(uct_ib_md_t *md, void *address, size_t length, | |||
const uct_md_mem_reg_params_t *params, | |||
uint64_t access_flags, struct ibv_dm *dm, | |||
struct ibv_mr **mr_p) | |||
struct ibv_mr **mr_p, int first_attempt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first_attempt
-> ignore_errors
or smth similar.
This func doesn't care about "attempts" and their number, the meaning of this flag here is only to "force" ignore errors.
src/uct/ib/base/ib_md.c
Outdated
} | ||
|
||
uint64_t uct_ib_memh_access_flags(uct_ib_mem_t *memh, int relaxed_order, | ||
int first_attempt, uint64_t uct_flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, I'd rename first_attempt
to its real meaning in this context.
Maybe smth like use_uct_flags
?
src/uct/ib/base/ib_md.c
Outdated
access_flags = uct_ib_memh_access_flags(&memh->super, md->relaxed_order, | ||
attempt_cnt == 0, uct_flags); | ||
|
||
status = uct_ib_reg_mr(md, address, length, params, access_flags, NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align =
src/uct/ib/mlx5/dv/ib_mlx5dv_md.c
Outdated
NULL, &memh->mrs[mr_type].super.ib, | ||
attempt_cnt == 0); | ||
attempt_cnt++; | ||
access_flags = uct_ib_memh_access_flags(&memh->super, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First reg_mr
, then access_flags
- is that intentional?
This line always has attempt_cnt
>= 1...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's intentional, though it might be written better..
access_flags
is first calculated at the top of the function because it's used for another function, and only then it is used in the do while
loop
src/uct/ib/base/ib_md.c
Outdated
@@ -626,9 +627,35 @@ ucs_status_t uct_ib_memh_alloc(uct_ib_md_t *md, size_t length, | |||
return UCS_OK; | |||
} | |||
|
|||
uint64_t uct_ib_memh_access_flags(uct_ib_mem_t *memh, int relaxed_order) | |||
uint64_t uct_ib_flags_to_ibv_mem_access_flags(uint64_t uct_flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static
src/uct/ib/base/ib_md.c
Outdated
@@ -309,7 +309,7 @@ void *uct_ib_md_mem_handle_thread_func(void *arg) | |||
while (ctx->length > 0) { | |||
if (ctx->params != NULL) { | |||
status = uct_ib_reg_mr(ctx->md, ctx->address, length, ctx->params, | |||
ctx->access_flags, NULL, &ctx->mrs[mr_idx]); | |||
ctx->access_flags, NULL, &ctx->mrs[mr_idx], 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why hide_errors is true here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe Oh, I actually wanted to put here a value that doesn't change the previous behavior and put 1
by mistake. I'll also change it in the other places.
src/uct/ib/base/ib_md.c
Outdated
memh = ucs_derived_of(ib_memh, uct_ib_verbs_mem_t); | ||
|
||
do { | ||
access_flags = uct_ib_memh_access_flags(&memh->super, md->relaxed_order, | ||
attempt_count > 0, uct_flags); | ||
|
||
status = uct_ib_reg_mr(md, address, length, params, access_flags, | ||
NULL, &mr_default, attempt_count == 0); | ||
attempt_count++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This retry should happen in UCP layer, otherwise IMO it can cause issues with UCP rcache region merges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe Where are the rcache region merges triggered from?
Do you think we should still try to register with full permissions and only then with specific ones?
364f3cf
to
251cd9a
Compare
@@ -462,7 +462,7 @@ static void ucp_memh_cleanup(ucp_context_h context, ucp_mem_h memh) | |||
} | |||
|
|||
static ucs_status_t ucp_memh_register_gva(ucp_context_h context, ucp_mem_h memh, | |||
ucp_md_map_t md_map) | |||
ucp_md_map_t md_map, unsigned uct_flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand it's temp, but, just in case, it would be good to replace gva_mr with some structure and keep registrations for different access flags there.
… test_ucp_memheap
… capability is not set
… if have no capability
What?
Enable memory registration for read-only addresses.
Why?
A customer reported an issue when attempting to send read-only memory using RDMA. Details can be found in the related discussion: https://github.com/openucx/ucx/issues/10085#issuecomment-2377368540.