Tracking hardlink siblings #16763

lundman · 2024-11-15T02:49:17Z

lundman
Nov 15, 2024
Collaborator

So here I am, on my second platform that has an API-call to retrieve all sibling names pointing to the same hardlink.

In Windows, it is the IRP_MJ_QUERY_INFORMATION / FileHardlinkInformation call, returning ParentIDs, FileIDs, names.
and under macOS APFSIOC_NEXT_LINK (HFSIOC_NEXT_LINK) returning inode and linkid.
(inode IDs are the same with a hardlink, but linkIDs are unique so you can distinguish between them, you can stat() with either. Current implementation raises linkID up in the same range upstream does for .zfs entries - macOS let's you stat(Filename))

So pondering solutions for what to do about Hardlinks and the ability to look up siblings. Doing a big recursive search is not really going to be an option.

For macOS, we currently build an in-memory map whenever we come across a file with n_links > 1, and dynamically assign a linkID to the full path name. As it happens, linkIDs do not need to survive reboot/remounts, just be unique during the mount. It does mean it works when you import a pool from Linux/FreeBSD which contains hardlinks (I mean, eventually, see below)

But it does have a pretty big drawback, if it has not yet "come across" a hardlink (by an earlier directory listing request, or similar) then it does not know about siblings (or has a subset of siblings).

Another option would be to build a xattr and store it with the file. Fairly easy to maintain, add ParentID/FileID/Name when increasing n_links and remove when decreasing. But alas, this does nothing when pool is created by Linux/FreeBSD.
Although, with upstream support, a dataset feature "hardlink_tracking" could be PRed, should Linux/FreeBSD one day want to add API calls for sibling support. However unlikely that is.

Something in between? Leverage scrub to silently/softly track hardlinks, then update xattr once completed? What about resumed scrubs? Too hacky?

Ultimately though, I have not found anything that breaks when returning empty sibling lists for Hardlinks, so it is more of a pursuit of perfection/compliance, than any actual need.

Thoughts?

robn · 2024-11-15T07:14:12Z

robn
Nov 15, 2024
Collaborator

Funnily enough, I was actually wishing I had something like this a couple of weeks ago during a recovery job, where I had a heap of damaged directory ZAPs and was trying to rebuild the directory structure from references between objects. So it might be useful even if there's no standard filesystem API to get at it, though we could easily make one if we wanted.

I think the way to do it reasonably efficiently is something like this, which I thought about a little bit but am also making up as I go.

Make new variable-length system attribute to store an array of uint64_t. This will hold the list of ids of directory ZAP object that this object is in, but only if ZPL_LINKS > 1, so we don't pay for it if we don't use it:

    {"ZPL_LINKDIRS", 0, SA_UINT64_ARRAY, 0}

When we make a new link to an existing object and increment ZPL_LINKS, we add the dir's object id to the list. When we remove a link, we remove the dir's object id. The full list of dirs the object is in is [ZPL_PARENT, ZPL_LINKDIRS[]].

This lets us fix a low-key nit as well: right now, ZPL_PARENT is set to the object id of the dir that last added a link to it. If that link is removed, it is not changed (reasonable; how would we know what to change it to?). Now, if the object is unlinked from ZPL_PARENT, and we have other links, we remove one from ZPL_LINKDIRS and set ZPL_PARENT.

To handle multiple hard links in the same dir, we either:

don't add duplicates and scan the dir when removing to see if it's the last one; or
add duplicates and let it grow as it may; or
embed some kind of refcount with the id (top 8 bits, perhaps, since they're already unused in links objects)
I'd probably choose the third way first, and fallback to the second. So if you have 255 links in dir 123, and you add a 256th, you get a second instance: 255:123 0:123. It's still unbounded, I suppose, but fairly controlled, and doesn't require a dir scan, which can be hefty.

Notice I haven't stored the name here. That's mostly because that's a lot more effort to manage, as it gets tied into the rename path, and also just a lot more stuff to store, for a function that probably isn't used all that often (I guess). The actual call that gets all the names can go and scan the listed dirs to find them, which isn't any different to finding a file by name.

That feels about right as the MVP, I think. It's a little like block cloning in shape really. I think the overhead is low enough for a relatively rare event that we wouldn't need to complicate it futher.

I have two questions, that might change how I did stuff here.

What kind of programs use this feature, and how do they use it? If it's used infrequently, then the name lookup overhead is probably not a huge deal. But, if it's all the time, maybe we'd need to something more there (I have vague ideas, but I'm hoping it's not needed).

The other is, what is the right behaviour if the list isn't complete? I don't see any reason why it shouldn't be in normal operation, but I think it informs a migration path. If the system calls involved can handle a partial or incorrect answer, and the overheads are negligible, then we can just start adding link information the first time a hard link is created, and on remove, if it's not there, just ignore. If they need complet information, then it's slightly different; we just ignore a missing ZPL_LINKDIRS until ZPL_LINKS goes back to 1.

(or a dataset-specific feature flag, I guess, but that feels heavyhanded; system attributes are designed to be extended).

Yeah, this might even work 😆

0 replies

lundman · 2024-11-16T00:29:27Z

lundman
Nov 16, 2024
Collaborator Author

Make new variable-length system attribute to store an array of uint64_t.

That is an interesting idea, so a file/inode would have an array in ZPL_LINKDIRS of dirID. Somewhat similar to NTFS storing the filenames of hardlinks in the MFT (will spill over to attribute list). I think NTFS has a limit of 1024 hardlinks per file.

Although it does sound like you would have to list each directory to look for matching inode to get names. It would be nice if there was a way to have uint64_t (or similar) we could store to match name, is there such a thing? Especially since macOS wants linkID - but I can handle that separately if needed.

What kind of programs use this feature

Only thing I have heard of is backup style programs. But yeah I have no seen anything actually break from returning empty hardlink list. It'd just backup the same file data multiple times. Honestly, hardlinks do not appeared to be used all that much from what I see, maybe mostly by git.

So how much effort would one want to put into something like this, but it is nice to be "technically correct" :)

0 replies

robn · 2024-11-16T02:03:06Z

robn
Nov 16, 2024
Collaborator

Make new variable-length system attribute to store an array of uint64_t.

That is an interesting idea, so a file/inode would have an array in ZPL_LINKDIRS of dirID. Somewhat similar to NTFS storing the filenames of hardlinks in the MFT (will spill over to attribute list). I think NTFS has a limit of 1024 hardlinks per file.

I'm not sure what the theoretical maximum size of a system attribute is. Once I saw that ZPL_SYMLINK was a whole path name, sorta went "eh, big enough".

Although it does sound like you would have to list each directory to look for matching inode to get names. It would be nice if there was a way to have uint64_t (or similar) we could store to match name, is there such a thing? Especially since macOS wants linkID - but I can handle that separately if needed.

Yeah, this model means a directory search to find the names. This is what already happens for zfs_obj_to_path(). It's a lot of directory loading for the whole set, but I'm not sure how it can be otherwise unless we store the names as part of this list, and I'd be very wary of that because of the complexity of renames (with their POSIX atomicity guarantees). Unless we shrugged and just say there's a brief moment where it might be stale, and we do a remove then add. But still, holding the name is extra overhead that probably isn't going to be used a lot, so I don't think I want to do it all the time.

A possible in-between option is storing the ZAP hash/cursor (a uint64_t) with each link, which should allow a faster lookup into a large ZAP to get the name. I'd still have a fallback to a full search just in case that hash value has become stale. Probably we could also update that post-rename, and it stays pretty compact. Downside, maybe, is that we can't do the refcount-in-same-dir trick, but idk, maybe that was never a big deal. Or maybe we only store the hash values up to some amount for a single dir, and then we switch to a refcount.

(I'm not really optimising ahead of time here, just thinking about the shapes we might want).

For the link id, I guess the question is how "immutable" it has to be. It's a little tricky to find information about it, but the source for getattrlist() appears to say they uniquely identify a single instance of the object. That would make the ZAP hash value unsuitable, as they may collide. Maybe they could hash value + collision distance (istr ZAP hash values have some free bits at the top), but that would mean they can change if something on that hash chain changes, even unrelated to that file. Though, it's not clear to me either what it means if you rename a link - does its linkid change? Or maybe we could store the chunk id within the ZAP that holds the name, but again, those can move, and would have to be updated any time the dir changes.

So yeah, I think really the only way to do that part is to keep some 1:1 association of link id to name. If was building from scratch, I'd be inclined to put it in the dir ZAP itself, alongside the name, but if I was building from scratch maybe I wouldn't do any of this this way heh.

So yeah, maybe that's a macOS-specific extension to all this, unless it can be made extremely cheap.

Only thing I have heard of is backup style programs. But yeah I have no seen anything actually break from returning empty hardlink list. It'd just backup the same file data multiple times. Honestly, hardlinks do not appeared to be used all that much from what I see, maybe mostly by git.

Yeah, backup software is all I can think of; many things use hardlinks, but if they care about where they all are they do their own tracking. That's kinda good, we can do this how we like heh. Honestly, for me this is kind of a fun academic exercise, though like I said before, I would have liked to have it available under the hood.

So how much effort would one want to put into something like this, but it is nice to be "technically correct" :)

Well maybe I'll prototype it this afternoon for fun heh :)

0 replies

robn · 2024-11-16T11:22:46Z

robn
Nov 16, 2024
Collaborator

Here we go! https://github.com/robn/zfs/commits/zpl-linkdirs/

So, we make a file as normal, and get its parent and link count:

# touch /tank/1
# zpool sync
# zdb -v -O tank 1 | grep -E 'link|parent'
	parent	34
	links	1

Adding a link sets up the linkdirs attribute. Each element is refcnt:dirobj. Refcounts start as 0 (just existing is the first refcount).

# ln /tank/1 /tank/2
# zpool sync
# zdb -v -O tank 2 | grep -E 'link|parent'
	parent	34
	links	2
	linkdirs	0:34

As we add more, the refcount goes up:

# ln /tank/2 /tank/3
# zpool sync
# zdb -v -O tank 3 | grep -E 'link|parent'
	parent	34
	links	3
	linkdirs	1:34

# ln /tank/3 /tank/4
# zpool sync
# zdb -v -O tank 4 | grep -E 'link|parent'
	parent	34
	links	4
	linkdirs	2:34

Addng a new reference in a different dir sets up a new entry:

# mkdir /tank/x
# ln /tank/1 /tank/x/1
# zpool sync
# zdb -v -O tank 4 | grep -E 'link|parent'
	parent	34
	links	5
	linkdirs	2:34 0:128

As we remove them, the refcount goes down:

# rm /tank/1
# zpool sync
# zdb -v -O tank x/1 | grep -E 'link|parent'
	parent	34
	links	4
	linkdirs	1:34 0:128

# rm /tank/2
# zpool sync
# zdb -v -O tank x/1 | grep -E 'link|parent'
	parent	34
	links	3
	linkdirs	0:34 0:128

# rm /tank/3
# zpool sync
# zdb -v -O tank x/1 | grep -E 'link|parent'
	parent	34
	links	2
	linkdirs	0:128

Removing the last one from the listed parent dir will take a reference from the array and move it to the parent, a nice improvement on current releases.

# rm /tank/4
# zpool sync
# zdb -v -O tank x/1 | grep -E 'link|parent'
	parent	128
	links	1

So if that's workable, the next thing would be to write a function to take that list and resolve names. And then it gets wired up to the platform APIs.

That proves it out at least; pretty sure the overhead is low enough that it could be always-on, but I'd have to think a bit more about the finer details. If you think it's useful I can push on upstreaming it. And if not, it was a still a pleasant diversion for a lazy day :)

5 replies

lundman Nov 16, 2024
Collaborator Author

And what do you plan to do with ref count? Just as an early out while listing a directory? You know you are looking for 5 in a dir, so can stop once you see the 5th? Will that matter?

robn Nov 16, 2024
Collaborator

It's not optional, we need a refcount to know when we're removing the last entry. But since it's there, yes, we can use it for an early out.

mschilli87 Nov 16, 2024

I guess strictly speaking you could scan the directory for remaining references upon removal instead of keeping track of the refcount, but I do prefer your approach.

robn Nov 16, 2024
Collaborator

Yeah, needing a directory scan on every hardlink removal is a bridge too far. I'm trying to keep the overheads minimal so this could be always-on.

lundman Nov 17, 2024
Collaborator Author

Ah yep, that makes sense. Nobody is going to be happy about a directory search just because you deleted a hardlink entry. I just had it in mind that the "now free" dirID can linger, and be corrected upon the next call to fetch siblings. (which has to list directories either way, and could clean up entries with non-existent inodes).

But, if you also want to fix the PARENT, then reference counter is better, no stale entries to be concerned about.

allanjude · 2024-11-16T15:47:42Z

allanjude
Nov 16, 2024
Collaborator

It might be worth revisiting this talk from the OpenZFS 2020 Dev Summit:

Improved “zfs diff” performance with reverse-name lookup (Sanjeev Bagewadi & David Chen)

Today ‘zfs diff’ lists the names of the objects that have changed between two snapshots. However, the name lookup for the changed dnodes is quite expensive (O(n)) as the parent-directory is searched sequentially for the given inode. This is because, the directory-entries in the parent directory are indexed for name-to-dnode lookup. Whereas for the ‘zfs diff’ workflow, we require, dnode-to-name.

Also, today the dnode has a place for just one parent-id. If there are hard links, a single dnode could have multiple directory-entries pointing to it. If a hard link is removed, we don’t update the parent-id, so ‘zfs diff’ has no way to report it.

As a solution, we implemented the following:
Added a new System-Attribute (SA) to dnode where the ZAP hash value of the file-name (aka.linkname-hash) is stored in the dnode along with the parent-dnode-id. Thus, during ‘zfs diff’ the lookup for the name of a given dnode, the name can be easily fetched by doing a variant of zap_lookup() using the (parent-dnode, linkname-hash). Thus the lookup now changes to a constant-time operation (from the earlier O(n)).
For hardlinks, we can point to a ZAP object which holds multiple pairs of (parent-dnode, linkname-hash).
This talk will walk through the above optimisations done to speedup ‘zfs diff’ along with other optimisations.

Slides
Video

0 replies

lundman · 2024-11-19T08:34:46Z

lundman
Nov 19, 2024
Collaborator Author

Testing the code, I added a simple iterator, I do not need to build fullpath, as I already have to do that anyway, but has a test:

$ cd /d
$ mkdir dir1 dir2
$ touch dir1/file1.txt
$ for i in dir1/file2.txt dir2/file3.txt dir2/file4.txt
> do
> ln dir1/file1.txt $i
> done

$ ls -lR
.:
total 0
drwxr-xr-x 1 WDKRemoteUser 197121 0 Nov 19 17:22 dir1/
drwxr-xr-x 1 WDKRemoteUser 197121 0 Nov 19 17:22 dir2/

./dir1:
total 0
-rw-r--r-- 4 WDKRemoteUser 197121 0 Nov 19 17:21 file1.txt
-rw-r--r-- 4 WDKRemoteUser 197121 0 Nov 19 17:21 file2.txt

./dir2:
total 0
-rw-r--r-- 4 WDKRemoteUser 197121 0 Nov 19 17:21 file3.txt
-rw-r--r-- 4 WDKRemoteUser 197121 0 Nov 19 17:21 file4.txt



$ fsutil hardlink list /d/dir2/file3.txt

FFFF950564FC8080: dprintf: zfs_vnops_windows_lib.c:5429:file_hard_link_information(): file_hard_link_information:
FFFF950564FC8080: dprintf: zfs_vnops_windows_lib.c:5514:file_hard_link_information(): Linkdir 0: 4
FFFF950564FC8080: dprintf: zfs_znode.c:1077:zfs_zget_ext(): +zget 4
FFFF950564FC8080: dprintf: zfs_vnops_windows_lib.c:5532:file_hard_link_information(): Linkdir 4 has entry file1.txt
FFFF950564FC8080: dprintf: zfs_vnops_windows_lib.c:5532:file_hard_link_information(): Linkdir 4 has entry file2.txt
FFFF950564FC8080: dprintf: zfs_vnops_windows_lib.c:5514:file_hard_link_information(): Linkdir 1: 7
FFFF950564FC8080: dprintf: zfs_znode.c:1077:zfs_zget_ext(): +zget 7
FFFF950564FC8080: dprintf: zfs_vnops_windows_lib.c:5532:file_hard_link_information(): Linkdir 7 has entry file3.txt
FFFF950564FC8080: dprintf: zfs_vnops_windows_lib.c:5532:file_hard_link_information(): Linkdir 7 has entry file4.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking hardlink siblings #16763

{{title}}

Replies: 6 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Tracking hardlink siblings #16763

lundman Nov 15, 2024 Collaborator

Replies: 6 comments · 5 replies

robn Nov 15, 2024 Collaborator

lundman Nov 16, 2024 Collaborator Author

robn Nov 16, 2024 Collaborator

robn Nov 16, 2024 Collaborator

lundman Nov 16, 2024 Collaborator Author

robn Nov 16, 2024 Collaborator

mschilli87 Nov 16, 2024

robn Nov 16, 2024 Collaborator

lundman Nov 17, 2024 Collaborator Author

allanjude Nov 16, 2024 Collaborator

lundman Nov 19, 2024 Collaborator Author

lundman
Nov 15, 2024
Collaborator

Replies: 6 comments 5 replies

robn
Nov 15, 2024
Collaborator

lundman
Nov 16, 2024
Collaborator Author

robn
Nov 16, 2024
Collaborator

robn
Nov 16, 2024
Collaborator

lundman Nov 16, 2024
Collaborator Author

robn Nov 16, 2024
Collaborator

robn Nov 16, 2024
Collaborator

lundman Nov 17, 2024
Collaborator Author

allanjude
Nov 16, 2024
Collaborator

lundman
Nov 19, 2024
Collaborator Author