Add decent error recovery and Ctrl-C handling #14

jwodder · 2024-11-18T20:50:44Z

Prior discussion from dandi/dandi-utils#3:

jwodder:

Do you have any ideas or preferences for how the backup program should handle download failures? What about crashes or a Ctrl-C in the middle of a backup?

yarikoptic:

Ideally it should be robust to download failures, and be able to "track back", e.g. if it was under git like with backups2datalad -- git reset --hard; git clean -dfx . But here it is not feasible, hence we might want to explicitly "model" the "rollback" regime to get back to prior state -- remove fresh (complete or incomplete) downloads, undo mv's already done -- might be worth keeping a journal of operations or just being able to take prior state and "match" it.

Upon Ctrl-C, if using journaling mechanism, we might want to ensure having it stored. Otherwise -- not sure if worth implementing something as fancy as planned for Add "staggered"/gradual interruption for upload (and possibly download) dandi-cli#1523

jwodder:

Please elaborate on exactly what behavior you want.

yarikoptic:

rollback or match the prior state: add a function which would ensure that current tree is matching specific inventory one:

for a folder, go through the union of paths found in inventory, .dandi-s3-backup-versions.json, and on the drive (excluding .old. ones)

if file is on drive and/or .dandi-s3-backup-versions.json but not in inventory - remove file at {path} and from .dandi-s3-backup-versions.json

if file record in inventory not matching the one in .dandi-s3-backup-versions.json - remove from drive and the .dandi-s3-backup-versions.json

if file is not present on drive but in inventory - if there is a corresponding {path}.old.{versionid}.{etag} -- rename it to {path}, adjust .dandi-s3-backup-versions.json accordingly

if there is no .old. file -- fail, shouldn't happen

But while thinking about it, I realized that overall approach does not cover the case of key switching between being a file and directory.

When file becomes a directory -- all is easy, prior version gets renamed into {path}.old.{versionid}.{etag}, and for new {path}/ folder created.

When directory becomes a file -- just rename directory to {path}.old.dandi-s3-backup if such does not exist yet. If exists already -- nothing to be done.

need to add check for path to not end with .old.dandi-s3-backup into conflict detection above

to reconstruct some prior key for versionId we would need to inspect all parents to potentially carrying the .old.dandi-s3-backup suffix now

jwodder:

for a folder, go through the union of paths found in inventory,

Because each set of inventories lists every single item in the bucket, this won't scale well. Just a single CSV file from the manifest you showed in the original comment contains three million entries.

yarikoptic:

yes, there is a scalability concern as we are expecting hundreds of millions entries (e.g https://github.com/dandisets/000108 alone accounts for 300 million files across its zarrs). If those lists are sorted though -- might be quite easy since then all files in a folder would be a sequential batch and we would process that sequential batch from inventory + files on drive and in .dandi-s3-backup-versions.json only for that folder, which would be either tiny or some thousands -- not more at once.

jwodder:

But wouldn't the rollback have to be run against the entire backup tree, which would be huge? Doing that in response to an error or Ctrl-C seems absurdly time-consuming.

yarikoptic:

Indeed. FWIW, aiming for incremental changes, I think we can minimize the amount of time when interruption would lead to requiring such a full blown roll back. E.g. if we

do full analysis of what keys need to be downloaded, renamed, and removed without any changes to the data on drive. Interruption would result in non need for roll back or any other actions to cleanup

do all necessary downloads into temporary space, e.g. .dandi-s3-backup-downloads/{versionid}.{etag} at the top folder. If interrupted in this stage -- would just need to rm -rf .dandi-s3-backup-downloads/.

final stage: "expensive" to recover from if interrupted, hence interruption should be guarded (e.g. at least 3 CTrl-C's within 5 seconds, otherwise do not react): perform all planned rm, and mvs, and remove empty .dandi-s3-backup-downloads/ at the end (all files should be gone)

actually if we establish journal of those rm and mvs we could probably play them back as well leading to "lighter" way to recover, but I am not sure I would 100% trust it, thus running full fsck would still be desired for paranoid me.

WDYT? may be some better ways?

jwodder:

An alternative: Don't rollback on errors/Ctrl-C, just exit (after cleaning up any partially-downloaded files). Instead, make the command capable of both (a) running against a backup directory in which a previous backup run was interrupted (so that a failed command can be immediately rerun, hopefully this time to completion) and (b) reverting a backup directory to an earlier backup by specifying an earlier date (so that a failed command can be forcefully rolled back by running the command with the date that the backup was previously at).

yarikoptic:

Re Alternative: it reads like what i have suggested as "rollback or match the prior state" feature above and you had a legitimate concern

But wouldn't the rollback have to be run against the entire backup tree, which would be huge?

or did you think back then I was suggesting it as something to do right upon error/Ctrl-C?

Also -- how would you know if current state of the backup is "legit" and not some partial one? I think we better have an explicit annotation for that.

In either case, I think that separating out analysis, from "fetching" and then actual "doing" as I suggested above might benefit greatly in minimizing time for when we could leave the backup in some incomplete/broken state. Don't you think so?

The text was updated successfully, but these errors were encountered:

jwodder · 2024-11-25T16:08:01Z

@yarikoptic Is your suggested "analysis" step intended to be run over the entire file tree before doing anything? That would involve storing every single key in memory at once, which would likely be problematic.

yarikoptic · 2024-11-25T18:14:59Z

Do you think it would be too tricky to allow for two modes -- "staged" (the "analysis" based above) and "interleaved"? I guess if was in Python could be at the level of what_todo "generator" vs "list" decision making.

My thinking: we would need a full heavy list only in the initial backup. That is where we can proceed in "interleaved" mode, just getting a new key from the inventory, checking if present locally and if not -- yield the "action" to do to the next step. In "staged" -- it would establish the full list of actions to tackle first, and since changes should not be multitude if ran frequently -- should fit into memory easily IMHO.

jwodder · 2024-11-25T19:22:57Z

@yarikoptic That seems doable, though I really don't like the names "interleaved" and "staged". What about "immediate" and "pre-planned"?

I also have to point out that, based on my observations of previous runs of the program, the "staged"/"pre-planned" mode, when run on the dandiarchive inventories, will have to spend about 10 and a half hours just going over all the items before it gets around to actually doing anything.

In "staged"/"pre-planned" mode, when cleaning up after an error or Ctrl-C, do you want the actions performed so far to be rolled back or not? (Just what is the fsck you're referring to? I don't see how the fsck(8) command is relevant here.)

How do you expect removal of files from the local backup that don't correspond to any items in the inventories (#33) to work? Determining whether a file should be deleted basically requires storing all the CSV entries in memory at once.

yarikoptic · 2024-11-27T21:46:28Z

I have no strong opinion on names. "immediate" and "pre-planned" sound good.

will have to spend about 10 and a half hours just going over all the items before it gets around to actually doing anything.

I wonder why and either could be avoided e.g. via parallelization .

In "staged"/"pre-planned" mode, when cleaning up after an error or Ctrl-C, do you want the actions performed so far to be rolled back or not? (Just what is the fsck you're referring to?

sorry for likely using inappropriate terminology -- I often use fsck word for consistency-check. In the scope of this it would entail seeing local backup consistent with what is observed in inventory and the .s3invsync.versions.json (formerly .dandi-s3-backup-versions.json) files.

How do you expect removal of files from the local backup that don't correspond to any items in the inventories (#33) to work? Determining whether a file should be deleted basically requires storing all the CSV entries in memory at once.

Nothing 100% solid idea yet, but followed up on #33 (comment)

jwodder · 2024-12-09T13:30:52Z

@yarikoptic

I wonder why and either could be avoided e.g. via parallelization .

Because there are over 400 CSVs to process, each with 300,000 lines, and they're already being processed concurrently (see the --inventory-jobs option).

You didn't answer this question:

In "staged"/"pre-planned" mode, when cleaning up after an error or Ctrl-C, do you want the actions performed so far to be rolled back or not?

yarikoptic · 2024-12-09T16:44:45Z

I wonder why and either could be avoided e.g. via parallelization .

Because there are over 400 CSVs to process, each with 300,000 lines, and they're already being processed concurrently (see the --inventory-jobs option).

sure -- lots of work... do we know what it spends most of the time on? I am thinking about the path-filter and "printing"/logging . As for path-filter -- if regex, likely is the cpu intensive part and if was replaced with --path-prefix-filter to use simpler analog of python's ".startwith" but might shorten time considerably?

In "staged"/"pre-planned" mode, when cleaning up after an error or Ctrl-C, do you want the actions performed so far to be rolled back or not?

I guess so since we do need to be able to get into a "consistent state".

jwodder · 2025-01-14T14:36:58Z

@yarikoptic I'm going to repeat my suggestion from an earlier comment:

An alternative: Don't rollback on errors/Ctrl-C, just exit (after cleaning up any partially-downloaded files). Instead, make the command capable of both (a) running against a backup directory in which a previous backup run was interrupted (so that a failed command can be immediately rerun, hopefully this time to completion) and (b) reverting a backup directory to an earlier backup by specifying an earlier date (so that a failed command can be forcefully rolled back by running the command with the date that the backup was previously at).

We know the program can do (b), and (a) is possible as long as the program was allowed to clean up and wasn't killed by SIGTERM or SIGKILL.¹ Thus, I suggest that we don't implement any clever rollbacks on errors and instead just leave the error handling as-is.

If the program was killed by a signal, I believe the only consequence that could lead to errors on subsequent runs would be if the keys in an .s3invsync.versions.json file weren't updated to match the non-old filenames in the containing directory. It'd be possible to handle this scenario as well, likely by just redownloading whatever files weren't added to the JSON database. ↩

yarikoptic · 2025-01-14T16:26:59Z

I wonder if we could/should at least get some .s3invsync.state.json in the top directory which would reflect the state of the backup: latest invocation options, and whether was completed... or may be even a list of N last invocations? Then subsequent run on semi broken one could error out unless invoked for the same date state in attempt to mitigate/use (a)?

jwodder · 2025-01-14T17:23:22Z

@yarikoptic What exactly would the goal of that be? Why must an incomplete run only be followed by a run with the same options?

yarikoptic · 2025-01-14T18:28:51Z

I felt like for what you formulated as

so that a failed command can be immediately rerun, hopefully this time to completion

but I guess indeed no need to be the same -- it might then just "update" it to a new state. But overall, I would just like to have an opportunity to know that prior run was not completed ok.

jwodder · 2025-01-14T19:21:46Z

@yarikoptic How about something like this: We add an .s3invsync.state.json file to the root of the backup containing the following fields:

last_backup_started — start time of the last backup
last_successful_backup_finished — end time of the last successful backup

Then, if last_successful_backup_finished is less than last_backup_started, it means the last backup was unsuccessful (or is currently in progress).

jwodder · 2025-01-28T12:50:49Z

@yarikoptic Ping.

jwodder · 2025-02-03T12:09:00Z

@yarikoptic Ping.

yarikoptic · 2025-02-03T14:42:28Z

sounds good to me! It would also provide means for a solution along the lines I proposed for

Add safeguard for deletion of files #145 (comment)

since then the tool (by default) must not operate when target path is not empty and lacks .s3invsync.state.json

Store start & successful end times of program runs in `.s3invsync.state.json`

jwodder added needs design decisions UX Relating to the user experience labels Nov 18, 2024

jwodder added this to the v0.1.0 milestone Nov 18, 2024

This was referenced Nov 18, 2024

Inventory-based backup tool dandi/dandi-utils#3

Open

Ensure proper cleanup of temporary directory #25

Closed

jwodder self-assigned this Nov 26, 2024

jwodder mentioned this issue Feb 3, 2025

Store start & successful end times of program runs in .s3invsync.state.json #148

Merged

jwodder added a commit that referenced this issue Feb 3, 2025

Merge pull request #148 from dandi/gh-14

cb81e9a

Store start & successful end times of program runs in `.s3invsync.state.json`

jwodder closed this as completed in #148 Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add decent error recovery and Ctrl-C handling #14

Add decent error recovery and Ctrl-C handling #14

jwodder commented Nov 18, 2024

jwodder commented Nov 25, 2024

yarikoptic commented Nov 25, 2024

jwodder commented Nov 25, 2024

yarikoptic commented Nov 27, 2024

jwodder commented Dec 9, 2024

yarikoptic commented Dec 9, 2024

jwodder commented Jan 14, 2025

yarikoptic commented Jan 14, 2025

jwodder commented Jan 14, 2025

yarikoptic commented Jan 14, 2025

jwodder commented Jan 14, 2025

jwodder commented Jan 28, 2025

jwodder commented Feb 3, 2025

yarikoptic commented Feb 3, 2025

Add decent error recovery and Ctrl-C handling #14

Add decent error recovery and Ctrl-C handling #14

Comments

jwodder commented Nov 18, 2024

jwodder commented Nov 25, 2024

yarikoptic commented Nov 25, 2024

jwodder commented Nov 25, 2024

yarikoptic commented Nov 27, 2024

jwodder commented Dec 9, 2024

yarikoptic commented Dec 9, 2024

jwodder commented Jan 14, 2025

Footnotes

yarikoptic commented Jan 14, 2025

jwodder commented Jan 14, 2025

yarikoptic commented Jan 14, 2025

jwodder commented Jan 14, 2025

jwodder commented Jan 28, 2025

jwodder commented Feb 3, 2025

yarikoptic commented Feb 3, 2025