Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving archives between disks #140

Closed
tlaurion opened this issue Feb 7, 2023 · 12 comments
Closed

Moving archives between disks #140

tlaurion opened this issue Feb 7, 2023 · 12 comments

Comments

@tlaurion
Copy link
Contributor

tlaurion commented Feb 7, 2023

@tasket : i'm looking forward into moving fix03 (latest) pruned archive between disks.
I'm not sure following read documentation on how to do this properly keeping hardlinks on origin to destination drive.

Obviously, mv won't do it.
rsync seems to do it properly on the same local filesystem, but it seems unclear if dowing so will duplicate content, which I'm tight on.

What would you recommend doing as rsync command to rsync between origin and destination drives?

@tlaurion
Copy link
Contributor Author

tlaurion commented Feb 8, 2023

Would that simply be -H ?

@tlaurion
Copy link
Contributor Author

tlaurion commented Feb 8, 2023

Trying

rsync -az -H --progress --numeric-ids /path/to/source 192.168.1.5:/path/to/dest/ --info=progress2

As instructed by https://www.cyberciti.biz/faq/linux-unix-apple-osx-bsd-rsync-copy-hard-links/ without deleting source

@tasket
Copy link
Owner

tasket commented Feb 8, 2023

@tlaurion Good question. I'm fairly certain that rsync -H behaves as expected and there should be no surprises. However, you could monitor hardlink accumulation during the process to make sure its working. Use a command on the destination like find dirname -printf '%n %p\n' | sort -rn -k1,1 | less will show you the most highly-linked files at the top.

As for Wyng itself, you're probably aware there are two ways to do this: A normal import like wyng --from=location-url arch-init, or editing the destsys, destmountpoint, and destdir fields in 'archive.ini' directly then copy that updated ini to the dest archive folder. (You cannot do this trick with v0.4 encrypted archives.)

@tasket
Copy link
Owner

tasket commented Feb 8, 2023

Btw, if you think rsync -H didn't do as good a job as you'd like, you can use wyng arch-deduplicate after the fact and it will try to deduplicate everything.

@tlaurion
Copy link
Contributor Author

tlaurion commented Feb 8, 2023

It worked as expected!

@tlaurion tlaurion closed this as completed Feb 8, 2023
@tlaurion
Copy link
Contributor Author

rsync -az -H --progress --numeric-ids /path/to/source 192.168.1.5:/path/to/dest/ --info=progress2

rsync --archive --compress --numeric-ids --times --hard-links --progress --info=progress2 /path/to/source user@server:/path/to/dest/

Is what I use now, where compression impacts the server and client without significant gains (archives are compressed afterall)

@tasket
Copy link
Owner

tasket commented Sep 22, 2023

Using --times is a good idea; I will add that to the Readme.

FWIW, if this is used beyond a fresh copy/move of the archive (i.e. an existing copy is being updated) then --delete should also be used.

@t-gh-ctrl
Copy link

t-gh-ctrl commented Feb 8, 2025

[Hopefully it's OK to comment on closed issues]

[edited to add rsync's --no-inc-recursive option after losing a couple hours trying to figure why the approach below worked with small volumes/small number of files but didn't for large volumes]

I do daily wyng backups to a local ssh server and sync that local server's wyng archive weekly to a remote server at night when my qubes os pc is powered off. So this isn't something that #199 could solve (except if wyng could run on the local server without any passphrase/key, but if I'm not mistaken that's not the case).

For that use case, the readme's mv wyng wyng-sincing && rsync wyng-syncing ... approach seems to be the only solution. However this is very bandwidth inefficient when using wyng's prune and synchronizing volumes that hold a large amount of data that doesn't change much: with those, rsync ends up transferring almost the whole volume to the remote host again and again despite zero changes in the volume because it doesn't have knowledge that wyng's prune simply moved the files to a newer session folder - so the data is deleted and the same data is sent again to the updated path instead of being moved. Those large bandwidth transfers are an issue in my case because the remote host has a metered, slow-ish internet plan.

I thought how this could be solved within wyng but it seemed too convoluted/fragile in the end. A simple workaround is to keep a hard linked copy of the wyng archive that is synchronized with the current archive after a successful rsync to the remote host.

Eg.:

  • initial step: create a hard linked copy of the wyng archive named eg. wyng.ln on the server (either with cp -al or rsync --link-dest=)
  • wyng will update the archive as usual with vms changes and potentially prune sessions; that means that wyng.ln will be lagging behind over time.
  • when it's time to backup the local archive to the remote host, rsync both the local archive and wyng.ln at the same time. That means that any file that was moved to a later session directory in the "current" wyng archive will be picked up by rsync on the remote host as a hard link from the "stale" wyng.ln copy, rather than being deleted and uploaded again. EDIT: this requires the --no-inc-recursive option to be bandwidth efficient (see caveat below)
  • then, if the rsync above was successful, synchronize wyng.ln with the current archive (with rsync --delete --link-dest=, or recreate with rm -fr ... ; cp -al ...).

I tested this with a few volumes, this seems to be working pretty well. I'll report back if any issue pops up.

Caveats:

  • disk space that would have been freed by wyng's prune will be reclaimed only during the last step above.
  • the nb. of inodes used on the hosts is doubled - but this is negligible (at least in my setup - the max nb. of inodes is jokingly high on a default install).
  • --no-inc-recursive takes a significant amount of time (and memory) to build the file list. Without this option, past a given number of files, rsync would transfer the files anyway and link them later. (see rsync's -H option in the man page: "rsync may transfer a missing hard-linked file before it finds that another link for that contents exists elsewhere")

Hope this helps! (I thought about sending a PR to update the readme but the content might be a bit too specific / unrelated to wyng).

@tasket
Copy link
Owner

tasket commented Feb 8, 2025

@t-gh-ctrl That's an interesting solution, thanks for posting it.

the nb. of inodes used on the hosts is doubled

I think if you check inode usage you'll see this isn't true. Hardlinks are dir references to inodes; they are not inodes themselves. If at the end of the procedure the inode use is really doubled, then rsync has done something to de-link wyng.ln from the original tree (or, the archive had a very high degree of change between sessions and there are few commonalities between old and new). FWIW, rsync options --compare-dest and --fuzzy seem almost applicable to this use case, but I think not actually.

Using an alternative to rsync, one that remembers file hashes globally (instead of comparing only files that have matching paths, as rsync does) might be an option. The rclone tool with --track-renames and hashing enabled looks like a possibility.

(except if wyng could run on the local server without any passphrase/key, but if I'm not mistaken that's not the case)

A simple log of Wyng's pruning actions might get us 99% there, even without authenticating. On the data chunk level, pruning is extremely simple: there is a range of session dirs to be merged together and the oldest one becomes the target, then the files from the newer (to be deleted) sessions are moved into the target. I think that gives rsync all the help it needs to avoid wasting bandwidth. (Can this be done without a pruning log? Probably, if you compare the S_* entries between the origin and duplicate you could automatically find the 'missing' sessions on the origin and it would work as long as there is a prior session that is present in both locations.) You don't need to worry about the metadata files since they are small and rsync can duplicate them.

With the following difference between Src archive and Dest...

Src Vol_a12345/             Dest Vol_a12345/
   S_20250104-000001/           S_20250104-000001/
                                S_20250112-000001/
                                S_20250115-000001/
   S_20250123-000001/           S_20250123-000001/

...on Dest do something like:

cd Vol_a12345
cp -al S_20250112-000001/* S_20250104-000001
rm -r S_20250112-000001
cp -al S_20250115-000001/* S_20250104-000001
rm -r S_20250115-000001

I'm not sure cp -al would be appropriate but I think you get the idea. Python might be a better way to script it, since you could then use mv/rename commands without creating links.

@t-gh-ctrl
Copy link

t-gh-ctrl commented Feb 9, 2025

@tasket, thank you for your detailed reply!

the nb. of inodes used on the hosts is doubled

I think if you check inode usage you'll see this isn't true

No idea why, for some reason I had always assumed that hard linking files used inodes but I now realize that was wrong. I stand corrected!

FWIW, rsync options --compare-dest and --fuzzy seem almost applicable to this use case,

It's funny, despite using rsync since a couple years after it was created (I'm old...) I never paid attention to those options. I'll give those a try out of curiosity to see how they perform but I expect that the sheer number of files in wyng archives may not play well with fuzziness and/or performance...

Using an alternative to rsync, [...]

Indeed, there might be other tools that work better. In hindsight I realize I got bent on using rsync because I'm used to it - all my other "data synchronization" tasks use rsync extensively and there isn't something that rsync hasn't managed to do well until now.

A simple log of Wyng's pruning actions might get us 99% there

Yes - "fixing" the destination is something I initially planned to do (and which I actually did manually while doing tests yesterday) but even if the task of moving files to the oldest session is pretty trivial, the bulk of the code would be error handling for everything that can go wrong (which is somewhat mitigated by having rsync running afterwards to sync everything properly). So it seemed that "bending" rsync to work with my specific use case was easier; the approach I outlined does work but is definitely sub-optimal compared to a tool that would have knowledge of what wyng is doing: for instance rsync takes 5+ minutes for rsync to build the file list of a 160GB wyng archive on old-ish but still decent hardware. Which is fine for a cron job at night, but isn't when you want to run it interactively.

@tasket
Copy link
Owner

tasket commented Feb 9, 2025

I wouldn't consider the error handling too critical. In bash you can set -e and make rsync the last command before renaming the archive dir back to its regular name.

rsync takes 5+ minutes for rsync to build the file list of a 160GB wyng archive

Just wondering if rsync has the benefit of its remote daemon in your case? That is supposed to accelerate some operations.

@t-gh-ctrl
Copy link

t-gh-ctrl commented Feb 13, 2025

In bash you can set -e

Indeed. One of these days I might give it a try. Rsync works for now- it's just not optimal...

Just wondering if rsync has the benefit of its remote daemon in your case

It's already running in daemon mode to avoid an additional encaspulation layer (rsync over wireguard, instead of rsync over ssh over wireguard [edit: I can't ssh into the host directly, I have to use wireguard]); that was actually one of the reasons for sticking to rsync...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants