We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我使用阿里云 eRDMA 设备运行p2p-store-example时遇到了一些问题:
MC_GID_INDEX=1 不可用,只有 MC_GID_INDEX=0 才能正常工作,导致报错:Device erdma_0 GID 1 not available: Invalid argument [22],但文档要求阿里云erdma就要设置为1。
在trainer端尝试将 Queue Pair (QP) 的状态修改为 RTS 时,遇到了以下错误(刷屏):Failed to modify QP to RTS: Invalid argument [22] ,我看这是 ibv_modify_qp 函数的报错中,于是我在modify之前添加了 ibv_query_qp,改动如下:
ibv_modify_qp
ibv_query_qp
int RdmaEndPoint::modify_qp_to_rts(struct ibv_qp *qp, struct ibv_qp_attr *attr) { int current_state = 0; struct ibv_qp_init_attr init_attr; int ret = ibv_query_qp(qp, attr, sizeof(*attr), &init_attr); if (ret) { std::string message = "Failed to query QP state: " + std::string(strerror(errno)); PLOG(ERROR) << message; return -1; } PLOG(INFO) << "Current QP state: " << current_state; PLOG(INFO) << "Setting QP to RTS with the following attributes:" << std::endl << "Timeout: " << attr->timeout << std::endl << "Retry Count: " << attr->retry_cnt << std::endl << "RNR Retry: " << attr->rnr_retry << std::endl << "SQ PSN: " << attr->sq_psn << std::endl << "Max QP RD Atomic: " << attr->max_rd_atomic; if (ibv_modify_qp(qp, attr, IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC)) { std::string message = "Failed to modify QP to RTS: " + std::string(strerror(errno)); PLOG(ERROR) << message; return -1; } PLOG(INFO) << "QP successfully modified to RTS."; return 0; }
但是问题仍然存在。错误信息是 Resource temporarily unavailable [11](即 EAGAIN 错误) I1226 17:03:17.293787 1807026 rdma_endpoint.cpp:411] Current QP state: 0: Resource temporarily unavailable [11] I1226 17:03:17.293788 1807026 rdma_endpoint.cpp:414] Setting QP to RTS with the following attributes: Timeout: Retry Count: RNR Retry: SQ PSN: 0 Max QP RD Atomic: : Resource temporarily unavailable [11] I1226 17:03:17.293792 1807026 rdma_endpoint.cpp:428] QP successfully modified to RTS.: Resource temporarily unavailable [11]
Resource temporarily unavailable [11]
libtransfer_engine.so
cmake
以上,我的问题有两条:
The text was updated successfully, but these errors were encountered:
MC_GID_INDEX=0
MC_GID_INDEX
ibv_devinfo -v
Sorry, something went wrong.
No branches or pull requests
问题:关于MC_GID_INDEX=0 or 1的配置和启动后的报错,是否与自动检测相关
描述:
我使用阿里云 eRDMA 设备运行p2p-store-example时遇到了一些问题:
MC_GID_INDEX=1 不可用,只有 MC_GID_INDEX=0 才能正常工作,导致报错:Device erdma_0 GID 1 not available: Invalid argument [22],但文档要求阿里云erdma就要设置为1。
在trainer端尝试将 Queue Pair (QP) 的状态修改为 RTS 时,遇到了以下错误(刷屏):Failed to modify QP to RTS: Invalid argument [22] ,我看这是
ibv_modify_qp
函数的报错中,于是我在modify之前添加了ibv_query_qp
,改动如下:但是问题仍然存在。错误信息是
Resource temporarily unavailable [11]
(即 EAGAIN 错误)I1226 17:03:17.293787 1807026 rdma_endpoint.cpp:411] Current QP state: 0: Resource temporarily unavailable [11]
I1226 17:03:17.293788 1807026 rdma_endpoint.cpp:414] Setting QP to RTS with the following attributes:
Timeout:
Retry Count:
RNR Retry:
SQ PSN: 0
Max QP RD Atomic: : Resource temporarily unavailable [11]
I1226 17:03:17.293792 1807026 rdma_endpoint.cpp:428] QP successfully modified to RTS.: Resource temporarily unavailable [11]
libtransfer_engine.so
,用g++并链接了要求的各种依赖库,没有用cmake
。以上,我的问题有两条:
The text was updated successfully, but these errors were encountered: