Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i/o error when a single replica is missing #42

Open
anderbubble opened this issue Sep 20, 2019 · 4 comments
Open

i/o error when a single replica is missing #42

anderbubble opened this issue Sep 20, 2019 · 4 comments
Labels
bug Something isn't working
Milestone

Comments

@anderbubble
Copy link
Contributor

Recently I got an i/o error from NFSRODS when trying to access a file.

[janderson@fox1 ~]$ sha1sum /mnt/nfsrods/home/janderson/spore.bb
sha1sum: /mnt/nfsrods/home/janderson/spore.bb: Input/output error

This file has three replicas under a replResc.

[janderson@fox1 ~]$ ilsresc
rootResc:passthru
└── replResc:replication
    ├── fox1Resc:unixfilesystem
    ├── mybook:unixfilesystem
    └── rsync_net:unixfilesystem
www:passthru
└── ln1:unixfilesystem

[janderson@fox1 ~]$ ils -AL spore.bb
  janderson         0 rootResc;replResc;fox1Resc        11101 2018-12-03.12:32 & spore.bb
    sha2:ZVhrwYvtAvQDdhspTCxz1z8XO9u6YI90bxrkZOWYLHI=    generic    /srv/civilfritz/irods/Vault/home/janderson/spore.bb
        ACL - janderson#civilfritz.net:own   
  janderson         1 rootResc;replResc;mybook        11101 2018-12-12.15:08 & spore.bb
    sha2:ZVhrwYvtAvQDdhspTCxz1z8XO9u6YI90bxrkZOWYLHI=    generic    /media/mybook/Vault/home/janderson/spore.bb
        ACL - janderson#civilfritz.net:own   
  janderson         2 rootResc;replResc;rsync_net        11101 2019-03-25.21:13 & spore.bb
    sha2:ZVhrwYvtAvQDdhspTCxz1z8XO9u6YI90bxrkZOWYLHI=    generic    /media/rsync.net/Vault/home/janderson/spore.bb
        ACL - janderson#civilfritz.net:own 

and one of these resources was unmounted. After mounting, it works.

[janderson@fox1 ~]$ sudo -u irods sshfs -o idmap=user [email protected]: /media/rsync.net
[janderson@fox1 ~]$ sha1sum /mnt/nfsrods/home/janderson/spore.bb
93b18b58ba1aa3cccdd0be0dfde67ecd73290e58  /mnt/nfsrods/home/janderson/spore.bb

I understand that this is ultimately a limitation in irods itself; but if there are replicas available, they should all be consulted before returning an i/o error.

@anderbubble
Copy link
Contributor Author

rodsLog.txt

@anderbubble
Copy link
Contributor Author

nfsrods-log.txt

@anderbubble
Copy link
Contributor Author

Looking at nfsrods-log.txt, it also looks like this is leading to unhandled exceptions; so those should probably be caught and handled in any case. Maybe the result is still an i/o error; but it should be more intentional.

@trel
Copy link
Member

trel commented Sep 21, 2019

Yes, both logs show a -510002 UNIX_FILE_OPEN_ERR because the rsync.net replica (which happened to be unmounted) apparently won the voting and was offered up as the replica to be retrieved and sent to the client.

We will handle that exception more clearly in NFSRODS.

Separately, we are working on adding a retry mechanism to the API calls themselves, so that if this occurs, the whole system can try again... (not clear it would actually help here, as this isn't an 'intermittent' failure... the disk is just not there). irods/irods#3480

A monitoring system that marked the rsync_net resource as 'down' would also make it vote 0. This can be done manually with iadmin modresc rsync_net status down.

In the meantime, you can add another passthru under your replication and lower the 'read' weight for the rsync.net replica so it doesn't win the round of voting if the other replicas are available.

https://docs.irods.org/4.2.6/plugins/composable_resources/#passthru

@trel trel added the bug Something isn't working label Jan 9, 2020
@korydraughn korydraughn added this to the 2.3.0 milestone Feb 1, 2024
@korydraughn korydraughn modified the milestones: 2.3.0, 2.4.0 May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

3 participants