Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

对于p2p-store-example,用阿里云eRDMA,设置MC_GID_INDEX=1时无法启动,只有0才能启动,不过ibv_modify_qp仍然报错 #53

Open
power-more opened this issue Dec 26, 2024 · 1 comment

Comments

@power-more
Copy link

问题:关于MC_GID_INDEX=0 or 1的配置和启动后的报错,是否与自动检测相关

描述:

我使用阿里云 eRDMA 设备运行p2p-store-example时遇到了一些问题:

  1. MC_GID_INDEX=1 不可用,只有 MC_GID_INDEX=0 才能正常工作,导致报错:Device erdma_0 GID 1 not available: Invalid argument [22],但文档要求阿里云erdma就要设置为1。

  2. 在trainer端尝试将 Queue Pair (QP) 的状态修改为 RTS 时,遇到了以下错误(刷屏):Failed to modify QP to RTS: Invalid argument [22] ,我看这是 ibv_modify_qp 函数的报错中,于是我在modify之前添加了 ibv_query_qp,改动如下:

int RdmaEndPoint::modify_qp_to_rts(struct ibv_qp *qp, struct ibv_qp_attr *attr) {
    int current_state = 0;
    struct ibv_qp_init_attr init_attr;
    int ret = ibv_query_qp(qp, attr, sizeof(*attr), &init_attr);
    if (ret) {
        std::string message = "Failed to query QP state: " + std::string(strerror(errno));
        PLOG(ERROR) << message;
        return -1;
    }

    PLOG(INFO) << "Current QP state: " << current_state;

    PLOG(INFO) << "Setting QP to RTS with the following attributes:" << std::endl
               << "Timeout: " << attr->timeout << std::endl
               << "Retry Count: " << attr->retry_cnt << std::endl
               << "RNR Retry: " << attr->rnr_retry << std::endl
               << "SQ PSN: " << attr->sq_psn << std::endl
               << "Max QP RD Atomic: " << attr->max_rd_atomic;

    if (ibv_modify_qp(qp, attr, IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT |
                      IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC)) {
        std::string message = "Failed to modify QP to RTS: " + std::string(strerror(errno));
        PLOG(ERROR) << message;
        return -1;
    }

    PLOG(INFO) << "QP successfully modified to RTS.";
    return 0;
}

但是问题仍然存在。错误信息是 Resource temporarily unavailable [11](即 EAGAIN 错误)
I1226 17:03:17.293787 1807026 rdma_endpoint.cpp:411] Current QP state: 0: Resource temporarily unavailable [11]
I1226 17:03:17.293788 1807026 rdma_endpoint.cpp:414] Setting QP to RTS with the following attributes:
Timeout:
Retry Count:
RNR Retry:
SQ PSN: 0
Max QP RD Atomic: : Resource temporarily unavailable [11]
I1226 17:03:17.293792 1807026 rdma_endpoint.cpp:428] QP successfully modified to RTS.: Resource temporarily unavailable [11]

  1. 我是手动编译的 libtransfer_engine.so,用g++并链接了要求的各种依赖库,没有用 cmake

以上,我的问题有两条:

  1. 只要libtransfer_engine.so能编译出来没报错,就一定正确吗,我自己拼的编译命令有没有可能丢失某些配置项?当然,CONFIG_ERDMA我看到了(虽然文档里没写),我已经加上了
  2. 为什么MC_GID_INDEX=1失效,需要0才可以。0代表自动检测,我的rdma设备不可用是否说明自动检测有问题,该怎么进一步debug呢?
@alogfans
Copy link
Collaborator

  1. GID Table 与网卡的型号及具体的部署配置方式有关,每个集群都不一样,GID Index 的合法值为 0-255(即 0 也是合法,且最可能被用到的 GID Index)。目前的实现中 MC_GID_INDEX=0 表示自动搜索合法的 GID Index,但在某些复杂情况下可能会失败,这时才需要手动调节 MC_GID_INDEX 环境变量。在我们的 eRDMA 测试平台中 GID Index 需为 1,但你所用的平台可能需为 0,具体可使用 ibv_devinfo -v 验证。
  2. 在转换到 RTS 期间产生错误极为少见,可以尝试打印 ibv_modify_qp 的返回值(errno 可能在分析中起误导作用),并使用 ulimit 提升可打开文件句柄数。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants