You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do you have any ideas or preferences for how the backup program should handle download failures? What about crashes or a Ctrl-C in the middle of a backup?
yarikoptic:
Ideally it should be robust to download failures, and be able to "track back", e.g. if it was under git like with backups2datalad -- git reset --hard; git clean -dfx . But here it is not feasible, hence we might want to explicitly "model" the "rollback" regime to get back to prior state -- remove fresh (complete or incomplete) downloads, undo mv's already done -- might be worth keeping a journal of operations or just being able to take prior state and "match" it.
Please elaborate on exactly what behavior you want.
yarikoptic:
rollback or match the prior state: add a function which would ensure that current tree is matching specific inventory one:
for a folder, go through the union of paths found in inventory, .dandi-s3-backup-versions.json, and on the drive (excluding .old. ones)
if file is on drive and/or .dandi-s3-backup-versions.json but not in inventory - remove file at {path} and from .dandi-s3-backup-versions.json
if file record in inventory not matching the one in .dandi-s3-backup-versions.json - remove from drive and the .dandi-s3-backup-versions.json
if file is not present on drive but in inventory - if there is a corresponding {path}.old.{versionid}.{etag} -- rename it to {path}, adjust .dandi-s3-backup-versions.json accordingly
if there is no .old. file -- fail, shouldn't happen
But while thinking about it, I realized that overall approach does not cover the case of key switching between being a file and directory.
When file becomes a directory -- all is easy, prior version gets renamed into {path}.old.{versionid}.{etag}, and for new {path}/ folder created.
When directory becomes a file -- just rename directory to {path}.old.dandi-s3-backup if such does not exist yet. If exists already -- nothing to be done.
need to add check for path to not end with .old.dandi-s3-backup into conflict detection above
to reconstruct some prior key for versionId we would need to inspect all parents to potentially carrying the .old.dandi-s3-backup suffix now
jwodder:
for a folder, go through the union of paths found in inventory,
Because each set of inventories lists every single item in the bucket, this won't scale well. Just a single CSV file from the manifest you showed in the original comment contains three million entries.
yarikoptic:
yes, there is a scalability concern as we are expecting hundreds of millions entries (e.g https://github.com/dandisets/000108 alone accounts for 300 million files across its zarrs). If those lists are sorted though -- might be quite easy since then all files in a folder would be a sequential batch and we would process that sequential batch from inventory + files on drive and in .dandi-s3-backup-versions.json only for that folder, which would be either tiny or some thousands -- not more at once.
jwodder:
But wouldn't the rollback have to be run against the entire backup tree, which would be huge? Doing that in response to an error or Ctrl-C seems absurdly time-consuming.
yarikoptic:
Indeed. FWIW, aiming for incremental changes, I think we can minimize the amount of time when interruption would lead to requiring such a full blown roll back. E.g. if we
do full analysis of what keys need to be downloaded, renamed, and removed without any changes to the data on drive. Interruption would result in non need for roll back or any other actions to cleanup
do all necessary downloads into temporary space, e.g. .dandi-s3-backup-downloads/{versionid}.{etag} at the top folder. If interrupted in this stage -- would just need to rm -rf .dandi-s3-backup-downloads/.
final stage: "expensive" to recover from if interrupted, hence interruption should be guarded (e.g. at least 3 CTrl-C's within 5 seconds, otherwise do not react): perform all planned rm, and mvs, and remove empty .dandi-s3-backup-downloads/ at the end (all files should be gone)
actually if we establish journal of those rm and mvs we could probably play them back as well leading to "lighter" way to recover, but I am not sure I would 100% trust it, thus running full fsck would still be desired for paranoid me.
WDYT? may be some better ways?
jwodder:
An alternative: Don't rollback on errors/Ctrl-C, just exit (after cleaning up any partially-downloaded files). Instead, make the command capable of both (a) running against a backup directory in which a previous backup run was interrupted (so that a failed command can be immediately rerun, hopefully this time to completion) and (b) reverting a backup directory to an earlier backup by specifying an earlier date (so that a failed command can be forcefully rolled back by running the command with the date that the backup was previously at).
yarikoptic:
Re Alternative: it reads like what i have suggested as "rollback or match the prior state" feature above and you had a legitimate concern
But wouldn't the rollback have to be run against the entire backup tree, which would be huge?
or did you think back then I was suggesting it as something to do right upon error/Ctrl-C?
Also -- how would you know if current state of the backup is "legit" and not some partial one? I think we better have an explicit annotation for that.
In either case, I think that separating out analysis, from "fetching" and then actual "doing" as I suggested above might benefit greatly in minimizing time for when we could leave the backup in some incomplete/broken state. Don't you think so?
The text was updated successfully, but these errors were encountered:
@yarikoptic Is your suggested "analysis" step intended to be run over the entire file tree before doing anything? That would involve storing every single key in memory at once, which would likely be problematic.
Do you think it would be too tricky to allow for two modes -- "staged" (the "analysis" based above) and "interleaved"? I guess if was in Python could be at the level of what_todo "generator" vs "list" decision making.
My thinking: we would need a full heavy list only in the initial backup. That is where we can proceed in "interleaved" mode, just getting a new key from the inventory, checking if present locally and if not -- yield the "action" to do to the next step. In "staged" -- it would establish the full list of actions to tackle first, and since changes should not be multitude if ran frequently -- should fit into memory easily IMHO.
@yarikoptic That seems doable, though I really don't like the names "interleaved" and "staged". What about "immediate" and "pre-planned"?
I also have to point out that, based on my observations of previous runs of the program, the "staged"/"pre-planned" mode, when run on the dandiarchive inventories, will have to spend about 10 and a half hours just going over all the items before it gets around to actually doing anything.
In "staged"/"pre-planned" mode, when cleaning up after an error or Ctrl-C, do you want the actions performed so far to be rolled back or not? (Just what is the fsck you're referring to? I don't see how the fsck(8) command is relevant here.)
How do you expect removal of files from the local backup that don't correspond to any items in the inventories (#33) to work? Determining whether a file should be deleted basically requires storing all the CSV entries in memory at once.
Prior discussion from dandi/dandi-infrastructure#197:
jwodder:
yarikoptic:
jwodder:
yarikoptic:
jwodder:
yarikoptic:
jwodder:
yarikoptic:
jwodder:
yarikoptic:
The text was updated successfully, but these errors were encountered: