Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session corruption self-healing mechanism needed #207

Closed
tlaurion opened this issue Jun 4, 2024 · 7 comments
Closed

Session corruption self-healing mechanism needed #207

tlaurion opened this issue Jun 4, 2024 · 7 comments

Comments

@tlaurion
Copy link
Contributor

tlaurion commented Jun 4, 2024

It's been two times I'm having issues that I guess should not happen under normal circumstances but was able to deal with somehow when archives were unencrypted and can't anymore.

When sending new session, unfortunately in my PoC with lack of resource on the archive server backend, sometimes, mdadm backend and associated md0 (raid5) goes offline alongside of the mounted partition where the archives are supposed to be sent. When that happens, weirdly, its info files that are smaller than supposed, actually being empty files.

Of course, playing with the cotent on the encrypted archive side is kinda impossible. In this circumstances, attempts of removing dom0 /var/lib/wyng/* is not helping, and deleting the new session, reported with debug (mapping of volume and encrypted vol dir is outputted as well as where the new session dir is given) is not helping either.

I wish it was possible that arch-check could work and help under these circumstances, reporting and permitting interaction with end user to delete corrupted session, or that delete would work for the whole volume but wyng fails too early.

Up to now, I had to redo a whole arch-init and wipe old archives. I'm not sure this behavior is desired, since it is well possible that a backup session gets interrupted in the middle of it let it be for network failure or power outage reasons. In my case, it seems to be a corner case not yet isolated where something somehow uses more ram then available and OOM kills randomly, which has nothing to do with wyng here. But the fact that wyng cannot recover from this situation and requires to redo whole archive being sent seems to be a problem that should have some kind of recovering capabilities or self healing somehow.

Ideally, wyng would not fail here, mark the last session as bad, and delete it maybe? So that the end user can just send another new backup session to archive as if nothing happened.

Sorry I do not have traces here this time. I got impatient of moving my system from lvm to btrfs twice now, but if it happens again, I will report if more details are needed in this issue

@tasket
Copy link
Owner

tasket commented Jun 4, 2024

@tlaurion Issue #201 was posted to address the way Wyng loads metadata so it can be more fault-tolerant (although I will say... that remote fs is certainly not helping with whatever it considers to be fault tolerance). Wyng is careful to record data first and metadata last (as .tmp files); this is important because it doesn't even try to finalize a change (by renaming .tmp files) until everything has transferred; so the archive would be blind to any changes until the last instant when 4 mv ops are executed. BTW, 0-byte files sounds like the problems I used to have 20ya running my systems on XFS (which back then had default tuning for data center use, which I believe used write-caching).

If recovering that specific archive interests you, I could add some features to the debug branch in the next day or two that should remove the offending session for you. It would help a lot though to know the error you're getting after you clear /var/lib/wyng and try to access the archive again.

@tasket
Copy link
Owner

tasket commented Jun 5, 2024

Incidentally, Wyng has a --maxsync option which calls sync more often including on the remote fs. This might help avoid the problem (but so could turning off any write caching / delayed allocation features that are intended to run on expensive hardware).

@tlaurion
Copy link
Contributor Author

tlaurion commented Jun 5, 2024

Will remove fs optimizations.

@tasket
Copy link
Owner

tasket commented Jun 5, 2024

@tlaurion I did just write a mod that works in debug mode. It should fix an archive if (and only if) the last session 'info' or manifest file is corrupt (won't work if the 'volinfo' or 'archive.ini' are affected). Let me know if you want to try it and I'll push it to debug.

@tlaurion
Copy link
Contributor Author

tlaurion commented Jun 5, 2024

@tlaurion I did just write a mod that works in debug mode. It should fix an archive if (and only if) the last session 'info' or manifest file is corrupt (won't work if the 'volinfo' or 'archive.ini' are affected). Let me know if you want to try it and I'll push it to debug.

I do not have the broken session I'm rebackuping everything which is still long since 1.7tb.

I think it was linked to misleaded /etc/fstab write cache, not your bug!

tasket added a commit that referenced this issue Jun 7, 2024
@tasket
Copy link
Owner

tasket commented Jun 7, 2024

@tlaurion The 08wip branch now has the ability to remove corrupt sessions, simply using the arch-check command in attended mode. The integrity testing part of it has been tested, but I haven't yet tested removal of a corrupt session.

If a corrupt session is found, it invalidates any other session that comes after it. So the repair process is quite conventional in that it 'rewinds' a corrupted volume to the last known good state.

tasket added a commit that referenced this issue Jun 8, 2024
tasket added a commit that referenced this issue Jun 8, 2024
@tasket
Copy link
Owner

tasket commented Jun 8, 2024

Implemented and tested.

@tasket tasket closed this as completed Jun 8, 2024
tasket added a commit that referenced this issue Jun 8, 2024
tasket added a commit that referenced this issue Jun 9, 2024
Handle corner case if no sessions / no good sessions

Update storage vol list if empty

Retain sync proc reference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants