We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
While running a particular margo test
https://github.com/Parallel-NetCDF/PnetCDF/blob/master/test/largefile/high_dim_var.c
with 4 ranks on 2 nodes, a read from rank 2 invokes a failure on the server, which generates the following logs:
023-07-06T16:00:56 tid=872735 @ signal_new_requests() [unifyfs_request_manager.c:269] signaling new requests 2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1802] RM[1511587981:1] got work 2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1631] processing 1 client requests 2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1324] processing mread[0] with 1 requests 2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:252] handling read request (1 extents) 2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:179] margo_bulk_transfer(buf_offset=0, len=1572864) failed 2023-07-06T16:00:56 tid=873012 @ pull_margo_bulk_buffer() [../../common/src/unifyfs_rpc_util.c:197] failed bulk transfer - transferred 0 of 1572864 bytes 2023-07-06T16:00:56 tid=873012 @ unifyfs_invoke_find_extents_rpc() [unifyfs_p2p_rpc.c:665] failed to get bulk chunk locations 2023-07-06T16:00:56 tid=873012 @ submit_read_request() [unifyfs_fops_rpc.c:279] failed to find extent locations 2023-07-06T16:00:56 tid=873012 @ process_read_rpc() [unifyfs_request_manager.c:1333] unifyfs_fops_read() failed 2023-07-06T16:00:56 tid=873012 @ rm_process_client_requests() [unifyfs_request_manager.c:1690] client rpc request 0 failed ("Mercury/Argobots operation error") 2023-07-06T16:00:56 tid=873012 @ request_manager_thread() [unifyfs_request_manager.c:1768] failed to process client rpc requests
The error code returned to the client for the read is 1004. That probably corresponds to one of these:
https://github.com/mercury-hpc/mercury/blob/55b95f72714bb0e4e0deeedf4fd78d116ea9476a/src/mercury_core_types.h#L102-L108
The read error happens during PMI_File_read_at_all which then leads to a deadlock in ROMIO: pmodels/mpich#6585
PMI_File_read_at_all
The text was updated successfully, but these errors were encountered:
No branches or pull requests
While running a particular margo test
https://github.com/Parallel-NetCDF/PnetCDF/blob/master/test/largefile/high_dim_var.c
with 4 ranks on 2 nodes, a read from rank 2 invokes a failure on the server, which generates the following logs:
The error code returned to the client for the read is 1004. That probably corresponds to one of these:
https://github.com/mercury-hpc/mercury/blob/55b95f72714bb0e4e0deeedf4fd78d116ea9476a/src/mercury_core_types.h#L102-L108
The read error happens during
PMI_File_read_at_all
which then leads to a deadlock in ROMIO:pmodels/mpich#6585
The text was updated successfully, but these errors were encountered: