Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] fix the gid choice of IB device: only choose IBV_GID_TYPE_ROCE_V2 now #113

Merged
merged 3 commits into from
Mar 6, 2025

Conversation

fengquyoumo
Copy link
Contributor

In the function RdmaContext::getBestGidIndex, we select the first GID with an IPv4-mapped address without considering the GID type.

Since the connection is now TCP-based, it requires the type of GID to be RoCEv2.

Selecting a RoCEv1 GID will result in an error due to the poor connection:

E0219 16:52:31.208603 273373 worker_pool.cpp:281] Worker: Process failed for slice (opcode: 0, source_addr: 0x7f593802c000, length: 4096, dest_addr: 140622732378112, local_nic: mlx5_bond_0, peer_nic: x.x.x.x:12345@mlx5_bond_2, dest_rkey: 1568428, retry_cnt: 0): transport retry counter exceeded
E0219 16:52:31.208797 273373 worker_pool.cpp:281] Worker: Process failed for slice (opcode: 0, source_addr: 0x7f593800c000, length: 4096, dest_addr: 140622732247040, local_nic: mlx5_bond_0, peer_nic: x.x.x.x:12345@mlx5_bond_3, dest_rkey: 1587190, retry_cnt: 0): transport retry counter exceeded

Meanwhile, I also removed some code that appeared to be unnecessary.

@fengquyoumo fengquyoumo requested a review from alogfans February 22, 2025 17:17
@fengquyoumo
Copy link
Contributor Author

@alogfans Hi ! Could you review this PR again? I’ve made the changes based on your last feedback.
Thanks!

if (is_ipv4_rival && !is_ipv4) {
if (ipv6_addr_v4mapped((struct in6_addr *)gid_entry.gid.raw) &&
(gid_entry.gid_type == IBV_GID_TYPE_IB ||
gid_entry.gid_type == IBV_GID_TYPE_ROCE_V2)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look fine to me.
Some Infiniband devices do not enable IPoIB so I recommend to change these line as (notice brackets)

if ((ipv6_addr_v4mapped((struct in6_addr *)gid_entry.gid.raw) &&
       gid_entry.gid_type == IBV_GID_TYPE_ROCE_V2)
    || gid_entry.gid_type == IBV_GID_TYPE_IB) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted. Please take a look again!

Signed-off-by: fengquyoumo <[email protected]>
@alogfans alogfans merged commit 6cf9381 into kvcache-ai:main Mar 6, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants