Uberallfs is a peer to peer distributed filesystem. Objects are cached on the accessing node. Mutation is implemented by passing Tokens around. Only the owner of the Token is eligible to mutate the data. This token can be requested from the current token holder by any entity which has permission to mutate the object. The old Token holder will then keep a reference to the entity which received the token. These references can then be walked to find the authoritative node at the end of the list. The actual implementation of this Idea becomes a bit more complex and offers certain optimization opportunities.
Entities who don’t have ‘write’ access do an object will never get the token. When read access is requested they synchronize the object with the nodes who hold the original data. This synchronization is managed by the current token owner.
Look here for examples how to use it
Objects become cached upon access. There will be tools to enforce this caching and pin objects to a node.
The filesystem caches objects and can be used when a node is offline. For writes this needs either the token to be locally obtained or the object can become ‘detached’ and changes need to be merged when connectivity is restored (which can be automatic if there are no conflicts).
Objects can be either instantiated by a ‘creator’ who defines security policies about who has access to them OR published as anonymous/immutable.
Objects may be private only and never be shared.
Different levels of redundancy are planned, raid1 like redundancy where sync/close calls only complete after the data is replicated or lazy backup schemes where data becomes synchronized at lower priority without blocking current access.
When space becomes scarce unused objects can be evicted from the cache. Either if this is just a copy (that is not used for redundancy) by deleting the object or by offering objects to other nodes within a configured realm of hosts.
If possible (later) Object transfer and syncronization can be spread over multiple peers to utilize better bandwidth sharing (Bittorrent alike).
The way objects are stored in the objectstore allows easy revival even without uberallfs available.
Uberallfs uses ‘opinionated’ Design. Protocols include a single version number which fully defines the properties, sizes and algorithms used. Future versions will be backward compatible to few older versions but eventually old versions will become unsupported (which may happen earlier when there are security related problems).
Version O is always defined to be ‘experimental’ it will be used in closed environments for testing and development, never in production. Any Version 0 Protocol outside of this environment is considered incompatible with itself.
Following a coarse overview of the components making uberallfs. Details are described in later Chapters.
At the core is a object store where all filesystem objects are cached. Later support for volatile objects is planned to allow once used streaming data. For Details see below.
The Objectstore expose a Virtual-File-System API to give a filesystem alike access layer.
User-access to the underlying filesystem hierarchy. The primary goal is a Linux fuse filesystem which maps the underlying uberallfs to an ordinary POSIX conforming filesystem.
Later other front ends are planned. Android storage framework for example.
As described in the introduction, the ‘trail’ pointer used to locate the node which is authoritative for a filesystem object is the main concept of uberallfs. Still there needs to be more to make this functional. For example Objects need to be recovered when the trail got broken (lost node). Only nodes which have full access to an object are allowed to become authoritative.
When a node becomes authoritative this does not mean that the data is available there, it only manages the ‘ownership’. The object metadata contains references to nodes who actually hold the data. For reading the data will be synchronized. While writing only invalidates the old references and instantiates new data locally.
Nodes without full access to objects can synchronize data as far they have permissions to do so and negotiate promises and leases with the authoritative node for race free data access.
Once access/authority to an object is granted the data may be synchronized (for reads). For this maps of byte-ranges and version/generation counts are used. There is no need for rsync like checksumming since the authoritative always knows which data is changed/recent.
Objects may become scattered across the nodes when frequent random writes at different locations of an object happen. This is mitigated by a low priority object coalescing which gather fragments and merges them on single nodes.
Access control is implemented over public keys and signatures. The node which is authoritative over an object is responsible for enforcing the permissions. Access control metadata is sufficient enough to be freestanding without any additional information. Still due to the distributed nature there are some loopholes that can not be closed (discussed below). Basically any access ever granted can not be reliably revoked at a later time.
A node establishes a session with another node on behalf of a user/key. Each session is then authenticated for this keys which is used for access control. Sessions are keep state for some operations. As long a session is alive these states are valid. When a session dies unexpectedly then these states and all associated data gets cleaned up/rolled back.
Nodes are addressed by their public keys. The last seen addresses and names of other nodes are cached for fast lookup. If that fails then a discovery is initiated (Details to be worked out).
creates user and node keys, manages signatures/pki, key-agent process.
Future versions will include a distributed public key infrastructure. This augments the exiting Access control with more advanced features like:
- web of trust for confirming identity and credibility of other keys
- revoking signatures
- key aliasing/delegation
- key renewal.
While uberallfs looks like a hierarchical filesystem, the backend store is a flat key/value object store. The keys are derived from universally unique and secure identifiers. Secure in this context means that not entity can create a collision that goes unnoticed. These identifiers resemble global unique inode numbers.
There are different object types of objects stored under a key, explained later in this document. The main parts are the ‘tree’ and ‘blob’ types. A ‘tree’ is an object that holds named references to sub-object keys much like a directory in a filesystem. Blob objects contain the file data. Other types contain metadata for security and distribution.
A mounted uberallfs uses a ‘tree’ object as the root of the mountpoint. From there on a hierarchy like with any other filesystem is created.
The difference here is that all objects can be distributed over the network and anyone (with permission to access the object) can references them within his own hierarchy. This for example allows a complete home directory to be shared as well as mounting the same object (directory) under different names at different positions in the hierarchy. For example one instance may name a directory ‘./Work/’ and another one refers to the same tree object as ‘./Arbeit/’.
Eventually (if one is careless) this could lead to directory cycles, which is the major difference to traditional filesystems where directory cycles are highly disregarded.
The most important difference to traditional filesystems is that Directories in uberallfs do not have parents. Frontends keep track of the directory traversing to for providing the parent directories.
A Object is defined by different parts:
- The Object Type
- Defines if it is a plain file, a directory and so on (in future a few more types will be supported).
- The Identifier
- Is a global unique 264 bit number (44 flipbase64 encoded characters). There are different types of identifiers which describe how the object is handled.
- Data
- The data of the object itself, could be a directory or file contents etc.
- Metadata
- Depending on the object type and identifier some extra metadata will be present, some is required (like ACL’s for Shared objects). Maps which show which nodes hold what version of the object data. Block hashes for torrent like distribution and some more.
Stores references to other objects (trees, blobs, symlinks) May store Unix special files (fifo, sockets, device nodes) initially private, eventually network transparent nodes may be implemented.
The actual File data. can be sparse/incomplete with not yet synchronized data.
PLANNED: parts of blobs with own identifiers.
A mutable objects are identified by a unique (random or hash) number while an immutable object is identified by a hash over its content. Objects which are constrained by permissions a digital signature is required to guarantee integrity (see below).
We can further deduce the necessity of 3 scopes where these keys are valid:
- private objects that must never be shared but is accessible to the local instance
- public objects that have ownership and access permissions
- anonymous objects without any ownership and public access
This leads to following 4 types of identifiers:
private | public | anonymous | |
---|---|---|---|
mutable | random | random signature | ¹ |
immutable | ² | hash signature | hash |
Note that there are 2 not supported combinations:
- Anonymous mutable data would lead security problems like denial of service attacks
- Having immutable private objects won’t have any security implications and may be supported at some point when need arises (eg. deduplication)
Eventually some more Types might be supported, for example hashing could be indirect being the hash over a bittorrent like list of hashes. This may even become the default for immutable objects at some point.
Later file encryption might be added. This is not directly on topic for uberallfs as objects are only distributed to nodes that are allowed to (at least) read them. File encryption would remove this requirement and allow proxying/caching on nodes that which don’t have access to the object.
Security manifest, access control and security related metadata.
Extra metadata about authority/trail/generation/distribution.
Maps to the nodes holding the data for mutable files. Initially only complete objects, later byte ranges/multi node.
Torrent like hash list for immutable files.
When an object type changes, its identifier changes. This .link type is then a pointer to the new identifier.
- Size restrictions for files.
- Accepted filename patterns.
- dirs/files only.
- Change the properties/identifier of a file, eg. a when a ‘.mkv.part’ file becomes renamed to ‘.mkv’ its type is changed to ‘public immutable’.
It is planned to make a simple rule engine that automates policies on objects (mostly directories). For example:
Keep lazy stats (coarse granularity, infrequently written to disk, with risk of loosing data in a crash)
- atime
- know when the object was last used
- afreq
- average frequency of use (rolling average?)
There are (so far) three main components which need to be visible on the host filesystem. These are designed to be in the same place (shared directory) as well as in different places with the components shared over several uberallfs instances.
The basic use case is that all data resides in a single directory which also serves as mountpoint for the fuse filesystem, thus shadowing they underlying data.
The objectstore can be freestanding/self contained no external configuration is needed.
- objects/
- used for the objectstore
- objects/??/
- any 2 character dir is used for the first level (4096 dirs, base64)
- objects/root/
- symlink to the root dir object
- objects/tmp/
- for safe tempfile handling
- objects/delete/
- deleted objects with some grace period
- objects/volatile
- can be a tmpfs for temporary objects
- objects/volatile/??/
- any 2 character dir is used for the first level (4096 dirs)
- config/
- configuration files
- objectstore.version
- version identifier
Planned: links to other objectstores on local computer, possibly on slower media for archives.
Objects are stored within the first level (2 character) directory under their flipbase64 identifier. Any associated metadata will have the same name but a filename extension per kind of metadata.
Directories in the objectstore refer to the contained objects. This is implemented with
some special marked symlink which is the flipbase64 identifier prefixed with
.uberallfs.
. This leverages the underlying filesystem semantics for lookup and
other operations.
The ‘node’ manages the data distribution between other nodes, forming a peer to peer network.
For that it keeps the networks addresses of other nodes and manages network related keys.
- config/
- configuration files
- nodes/??/
- information about other nodes
- keystore/
- some of the keys used to operate the node. Others may be in ~/.config/uberallfs and are loaded on startup. Private keys will be isolated, TBD.
- uberallfs.sock
- socket for local node control
- node.version
- version identifier
When fuse gets mounted it may shadow all of the above and present POSIX compatible file system. Only files starting with ‘.uberallfs.’ at the root are reserved (control socket etc).
Local permissions are treated as ‘voluntary’ in sense that a Node which gathers access to Data must not compromise the global security of the filesystem. The Objectstore itself runs as single user and uses permissions only to enforce the basic requirements (immutable objects become readonly and so on). Actual permission/access checks are managed by the outward facing VFS Api. This ensures security across the global network.
The ‘public’ API of the Objectstore is a virtual filesystem layer. Frontends like fuse use this to access objects. For this a Client has to authenticate against public Keys and used for permission checks.
The ‘perm’ object type contains all metadata necessary for access control for the associated object. Any node is obliged to validate access rights on queries.
- Identification
-
We must ensure that an Object Key and Identifier belongs to the Object in question and all following security metadata needs to be derived from this in a provable way. All public keys can be constrained by an expire date.
- Identifier
- A random number.
- Creator
- Public key of the Creator/expiration of this object. Can be only once used key which is deleted after initialization of the metadata. The expiration date here becomes part of the identifier. Once passed the object becomes invalid and can be purged.
- Key Expire
- Creation and expire parameters
- Identifier Signature
- The Identifier is signed with the Creators key.
- Object Key
- The Identifier and its Signature are hashed together to give the key used in the object store. This is not stored in the ‘perm’ object as it is the ‘name’ thereof itself.
- Administrative Lists
-
- Super Admins
-
A (optional) list of public key/expire tupes that are allowed to modify the
per-permission admins below.
- Super Admins Signature
- The list of Super-Admins together with a nonce and the Identifier becomes signed by the Creator. This indirection allows to dispose the Creator key now and to delegate administrative task to multiple entities. Caveat: after the Creator key is disposed the Super-Admin list can not be changed anymore.
- Per Permission Admins
-
Optional list for each possible permission (read, write, delete, append, …). Keys
listed in these lists are allowed to modify the respective ACL’s below. (idea:
permission tags on the lists itself: an admin may add/delete…)
- Per Permission Admins Signature
- Each of the lists above needs to be signed by the Creator or a Super-Admin. This signature contains a nonce and the Identifier as well
- Access Control Lists
-
Optional list for each possible permission (read, write, delete, append, …). Keys
listed in these lists are allowed to access the object in requested way.
- ACL Signature
- Each of the lists above needs to be signed by the Creator or a Super-Admin or a matching per-permission-Admin. This signature contains a nonce and the Identifier as well.
- Generation Count and Signature
- Whenever any data on the above got changed a generation counter is incremented and the all list blocks plus this generation counter must be signed by one of the above administrative Keys (usually the one who did the change).
TODO: creation date and expire parameters are required, shall these be signed here?
- Quorum
- M of N Admins must grant permission to be effective
- Key revocation
- special tree object which holds revoked signatures, must be safe against DoS, needs some thinking.
- Serial Nonces
- Rand(u128) number initially smaller than (MAX_U128-MAX_U64) they are incremented by adding a rand(u32)+1. Thus the magnitude is growing and one can compare that any ‘new’ value must be larger than the last known. This gives a (weak) protection against replay attacks without leaking any info about how frequently metadata got updated.
TBD: in short one who once had (administrative) access to the object can replay that old version of the metadata under some conditions since the ‘trail’ and generation count can be incomplete. (write example how this can happen, any solution for this?)
- A creates a file with B and C as Admin
- B takes the token from A A->B
- C takes the token from B A->B->C
- C removes B from an Administrative list
- B takes the token from C back A->B<-C
- B replays the ‘perm’ metadata from 2. (gains Admin back)
- A takes the file from B but can not discover the tampering
The only ‘weak’ protection against this are the expiration dates. When these are short enough they limit the time window in which such an attack can be done and constrain the necessary lifetime for signature revocations.
Can not happen because the token will never be given to a node that won’t have write access.
Uberallfs implements a set of concise permissions unlike traditional ‘rwx’ Unix permissions with their overloaded meaning for directories.
These permissions are mapped onto the available permissions of the target operating system. Permissions are tied to (lists of) public keys. There are no users and groups otherwise. There is one special (all zero?) Key which means ‘anyone’.
The local system/VFS layer maps Keys to local users to allow a straightforward view of the filesystem contents.
A permission which would allow full access (including deleting/overwriting) all data also allows a node to take authority over an object. Nodes which can’t gain authority over an object must pass their mutations to the authoritative node where they will be validated.
Access control is inclusive, when one could gain access because the key is listed in the respective Admin list, then one gets that permission implicitly.
Someone who gains the knowledge of an Identifier has also further access to inspect its metadata. Thus there are no permission checks on identifers themself. Only their lookup is validated.
File permission are initially relatively simple, only ‘append’ added over unix permissions. Should be self explanatory.
- read
- read object
- write
- This is the authoritative permission.
- append
WIP!
With directories things become more complicated.
- list
- Allow listing of the directory filenames only (excluding their identifiers).
- list-accessible
- Listing is filtered to content where one has (any) access to.
- list-authoritative
- Listing is filtered to content where one has authority for.
- read
- Allow listing of the directory content including object identifiers
- read-accessible
- Listing is filtered to content where one has (any) access to.
- read-authoritative
- Listing is filtered to content where one has authority for.
- add
- Add new objects. Implies ‘list’.
- add-authoritative
- Only add objects where one is authoritative for. Implies ‘list-authorative’
- add-anonymous
- Add anonymous objects. Implies ‘list-accessible’.
- rename
- Rename an object within the same directory. Moving objects across directories are handled like add/delete on each directory. Implies ‘list’.
- rename-authoritative
- Rename an object within the same directory where one is authoritative for. Implies ‘list-authorative’.
- rename-anonymous
- Rename an anonymous object within the same directory. Implies ‘list-accessible’.
- delete
- Delete any object. This is the authoritative permission.
- delete-authoritative
- Delete objects where one is authoritative for.
- delete-anonymous
- Delete anonymous objects.
Further rules can be defined how objects are created, what extra permissions and keys apply (inherit from directory,..)
To prevent collisions, the ‘add’ and ‘rename’ permissions imply the necessary ‘list’ permissions that would make the destination visible. To successfully add or rename a file into an existing name one would need the permission to delete the old content as well.
TBD: what permissions do objects inherit from the parent (dir) additionally to the ones the creator set up.
- leases
- expire time for leases, default and per node pubkey. leases are persistent (stored in the token trail)
- promises
- expire time for promises, default and per node pubkey. promises are volatile and expire with the session.
Any data send around is encrypted starting from the first bit (w/ the targets pubkey). Without knowledge of the keys not even protocol information is leaked. Incoming packets/connection are just dropped when they can’t be decrypted.
Envisioned usage, work in progress.
Examples here using defaults for most options. Defaults should always be the be safe option.
This examples starting with ‘plumbing’ commands to show the steps involved to set something up. When applicable ‘porcelain’ is added next to it, in general porcelain commands simplify usage, but depend on some preconditions, like that the filesystem is already set up and mounted (unless for the setup commands), contrary plumbing commands need access to the objectstore or node data and may not work when these directories are hidden behind the mounted filesystem.
$ uberallfs objectstore ./DIR_A init $ uberallfs node ./DIR_A init $ uberallfs node ./DIR_A start $ uberallfs fuse ./DIR_A mount
$ uberallfs init ./DIR_A $ uberallfs start ./DIR_A or $ uberallfs insta ./DIR_A
Will result in a uberallfs mounted on ‘./DIR_A’ with a private (by default) root directory.
We created a ‘private’ root directory in the previous step. For being used as distributed directory its type must be changed.
$ uberallfs objectstore ./DIR_A chtype public_mutable /
This changes the type and sets up a minimal ACL to make the executing user Creator of the object.
Porcelain will only work on a running (mounted) filesystem.
$ uberallfs chtype public_mutable /path/to/root
The root directory is nothing special an can be shared as any other object, the only difference is that the root directory must be present in the objectstore for almost all other operations (like mounting the file system). Thus objectstore initialization can already takes care for setting up the root directory.
On the new filesystem the node must be initialized first for exporting the (default generated) users public key.
$ uberallfs node ./DIR_B init $ uberallfs node ./DIR_B export-key base64encodedpubkey
$ uberallfs node ./DIR_B init $ uberallfs export_key ./DIR_B base64encodedpubkey
- By exported Directory
Give the new user/key access to the root directory in ‘./DIR_A’ and export it into an archive. This thin export only contains the minimum necessary metadata to reconstruct the content by querying the original node.
$ uberallfs objectstore ./DIR_A chacl +super_admin base64encodedpubkey / $ uberallfs objectstore ./DIR_A send --thin / >ARCHIVE
$ uberallfs chacl +super_admin base64encodedpubkey ./DIR_A $ uberallfs export ./DIR_A ARCHIVE
Now we can import that archive as new root directory and go on.
$ uberallfs objectstore ./DIR_B init --import ARCHIVE $ uberallfs node ./DIR_B start $ uberallfs fuse ./DIR_B mount
$ uberallfs import --root ARCHIVE ./DIR_B $ uberallfs start ./DIR_B
- By URL
Instead importing an ARCHIVE one can also supply a URL the root dir will then be fetched over the network.
The an URL has the form ‘uberallfs://host:port/identifier’ and can be shown by:
$ uberallfs node ./DIR_A show --url / uberallfs://localhost:port/base64encodedidentifier
$ uberallfs show-url ./DIR_A uberallfs://localhost:port/base64encodedidentifier
This URL can then be used to bootstrap the new objectstore
$ uberallfs objectstore ./DIR_B init --no-root $ uberallfs node ./DIR_B start $ uberallfs node ./DIR_B fetch uberallfs://localhost:port/base64encodedidentifier $ uberallfs objectstore ./DIR_B root --set base64encodedidentifier $ uberallfs fuse ./DIR_B mount
‘insta’ does all DWIM magic to get a uberallfs running. initialization, starting the node and mounting the filesystem. It possibly asks some interactive questions (for deploying keys). An existing dir will be reused if no data gets overwritten (same root again). By default an ‘insta’ created uberallfs is private but this can be overridden by the ‘–from’ and ‘–shared’ flags.
$ uberallfs insta ./DIR_B --from uberallfs://localhost:port/base64encodedidentifier
- authorative Pins an object to be locally available, possible with short lease times to allow others to mutate it without proxying.
- non authorative register at the current token holder that one wants to get a notification when the object changed (or is moved). This has only session persistence.
Objects can hold a small list of peers where the data must be replicated. There are different modes of operation: N of M operations must succeed before returning, remaining are synced lazy Operations are write, fsync, close. The N of M can be required to be N different realms.
frees memory by dropping non used (lazy atime) non owned objects. may move owned objects away (asking some other node about taking over).
Fetches and syncronizes all date (before going offline)
turns the node into offline mode (with –timeout?) it wont try to access other nodes even when internet is up. normally unnecessary because reachability is determined automatically on a peer by peer base with some backoff mechanism.
explicitly detach objects, so that they can be locally changed even when offline but may later be merged
merge detached objects back. may need manual conflict resolution in case changes happened on both sides.
what happens when offline and not owning an object
- On Read:
- old version available, just cant sync
- return stale data
- block (with timeout, then one of the next)
- EIO
- EACCESS
- data isn’t locally available (or incomplete)
- block (with timeout, then one of the next)
- EIO
- EACCES
- old version available, just cant sync
- On Write:
- block (with timeout, then one of the next)
- auto detach
- EIO
- EACCESS
Since normal directory objects can be linked at any position in a filesystem tree and have no implicit parent, symlinking into parents with ”/..’ becomes unreliable and even dangerous. For normal directory objects this becomes forbidden. The same is true for absolute symlinks.
This restriction is be removed for Private entries. Important note is that such directories can not be changed into PublicAcl shared directories in presence of such symlinks.
Later an alternative Directory type “DirectoryWithParent” may be introduced. Such Directories have some restrictions. They can only be linked to the parent defined there and thus can not be root nodes where the filesystem is mounted. Symlinks with parent refs ‘/..’ are allowed to cross into these Directories.
Objects may be referenced from different locations all over the network. Deleting a object from a directory is as simple as just remove it from there when one has authority over the directory. But this does not mean the Object itself can be removed from the object store since other nodes may still refer to it.
- Solutions
-
- When no parts of the object are locally authoritative (no data!) then it can be removed.
- Every Object has a ‘grace’ time for which it will be kept with a ‘deleted’ flag. Once
this grace time is expired it can be deleted.
- Any other node which references this object should poll the object within this grace
time. When the authoritative node responds that the object ought to be deleted then
- Node without full access may synchronize the object
- Nodes with full access are advised to adopt the object.
- Once adopted and all data is transferred the data can deleted. Metadata (trail) needs to stay alive until the grace time is expired.
- Any other node which references this object should poll the object within this grace
time. When the authoritative node responds that the object ought to be deleted then
- May also provide an discard command that really deletes an object without grace time. Other nodes querying it then will get a ‘EEXIST’ and may decide how to go on (revive or discard)
Eventually objects may get lost when an node takes ownership but is not reachable anymore.
Such an object can then be revived by quering the trail if it is possible to reconstruct the last know state of the object. This may then be revived as ‘detached’ object or put alife again under a new Identifier. This is then per-parent directory as the new identifier is inserted there.
In the event that the initially unreachable node commes alive again, data must be merged from there. The lost node is responsible for merging this. Possibly reestablishing the old Identifier with the new content again.
Maybe mark new object metadata with a ‘revived $oldidentifier’, is this necessary?
Directories may have a flag that they are protected from ‘careless’ reviving because they are intended as mountpoint -> list of nodes/expire that (may) mount them (authoratively only)
When mounting (authoratively) one needs to check that the dir didnt got revived by querying possible buddies:
- walk trail/redunancy copies/authorative mount list
limited in size and age
TBD
TBD, which logging lib?
what to log?
log