Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative cluster pinning option #48

Closed
teknomunk opened this issue Jan 22, 2021 · 5 comments
Closed

Alternative cluster pinning option #48

teknomunk opened this issue Jan 22, 2021 · 5 comments

Comments

@teknomunk
Copy link

This idea was based on the discussion at #42 about High I/O usage and the two ways that the cluster has distributed pin instructions to cluster follower nodes up to now:

  • Recursively pin the root folder hash -> causes high disk I/O utilization by ipfs during pinning
  • Recursively pin each file and non-recursively pin the folders -> causes high disk space utilization by ipfs-cluster-follow

As I run a cluster node, this affects the utilization of my hardware. I never noticed the high disk space utilization as a problem because of the amount of disk space I have (>10TB), but I have noticed the high disk utilization and have taken steps to mitigate the slowdown due to high disk I/O as it affected other processes I am running (SSD cache of the logical volume the data resides on).

This is an attempt to describe an idea that should have neither the high disk I/O utilization of pinning the root folder hash nor the high disk space utilization of pinning each updated file.

Under this option, the folder structure under /ipns/x86-64.archlinux.pkg.pacman.store/ is not changed at all from its current state at all. Instead we create a completely separate directory structure that contains the same package files with a different structure optimized for making the cluster members pin just the new packages without having to check all the other packages and directories in the repo.

As an example, consider that update with only the packages abiword and go-ipfs. You would create a directory like this:

/2021-01-22-001/
/2021-01-22-001/abiword-3.0.4-4-x86_64.pkg.tar.zst
/2021-01-22-001/go-ipfs-0.7.0-1-x86_64.pkg.tar.zst

in addition to updating /extra/ and /community/, then add the hash of the folder /2021-01-22-001/ to the cluster. This folder would exist only in the cluster, and only for the purpose of having the cluster members pin those two new packages. People not part of the cluster should never see these directories.

If you then got another set of package updates, you would create another folder for only those additional packages:

/2021-01-22-002/
/2021-01-22-002/dbus-broker-26-1-x86_64.pkg.tar.zst
/2021-01-22-002/fftw-3.3.9-1-x86_64.pkg.tar.zst
/2021-01-22-002/xorg-docs-1.7.1-3-any.pkg.tar.zst
/2021-01-22-002/yasm-1.3.0-4-x86_64.pkg.tar.zst

There are a number of ways do decide when to remove these update directories from the cluster:

  • after all the packages in a directory have been removed from upstream
  • after some fixed expiration time
  • manually (not recommended)

Looking at rsync2ipfs-cluster/bin/rsync2cluster.sh, to implement this idea, I think you will only need to modify ipfs_mfs_add_file() to take a third parameter (the update folder path in MFS) along with adding the file's CID to the update folder, and add the update folder to the cluster pin set.

@RubenKelevra
Copy link
Owner

Hey @teknomunk,

0.14 was released and include batched-pinning, which is basically what you wanted to do manually: merge multiple changes together to one update to the cluster.

It's sadly still not a manual transaction, as I asked maybe could be implemented (the ticket tracking this request: ipfs-cluster/ipfs-cluster#1018 (comment)).

But nevertheless the batching should in theory make the first mode of operation more viable again, as we produce a lot less commits.

The space cleanup issue was also tackled with 0.14 and the space used by the db dropped significantly (ipfs-cluster/ipfs-cluster#1320 (comment)).

This means the boiler plating previously necessary should be no longer needed. :)

@RubenKelevra
Copy link
Owner

There are a number of ways do decide when to remove these update directories from the cluster:

  • after all the packages in a directory have been removed from upstream
  • after some fixed expiration time
  • manually (not recommended)

I don't see any way to do this cleanly without a lot of additional effort. Additionally single packages which won't update for a year or two would block all other packages unnecessarily from cleaning up.

So we would have to traverse all folders and clean up already deleted packages every once in a while.

@teknomunk
Copy link
Author

I recently updated ipfs-cluster-follow to 0.14.0. If it automatically does batching to only add the new files, that would be appreciated. If I remember, I'll look at this later today.

@RubenKelevra
Copy link
Owner

@teknomunk well, it would do batching and combine many new pins into one operation. But since we currently use a recursive pin of a folder this isn't changing anything.

We would need to switch back to pin individual files to the cluster to take advantage of this. While the source code is still in the repo, I really don't like switching back. It feels pretty hacky tbh.

I rather like the IPFS team to investigate why traversing between two folder versions and fetching the changes is so hard on IO.

@RubenKelevra
Copy link
Owner

Maybe you could highlight our use-case in a bug report on https://github.com/ipfs/go-ipfs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants