From b236f2d6c00078bb3fece6f9b7642150718e8340 Mon Sep 17 00:00:00 2001 From: Torsten Grote Date: Thu, 15 Aug 2024 15:19:35 -0300 Subject: [PATCH] Improve parts of app backup docs after feedback --- doc/README.md | 115 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 69 insertions(+), 46 deletions(-) diff --git a/doc/README.md b/doc/README.md index 0a3c07aba..6309cbb9e 100644 --- a/doc/README.md +++ b/doc/README.md @@ -20,52 +20,39 @@ in a structured form. *Chunk*: Larger files are cut into re-usable chunks that are the unit of de-duplication. -*Blob*: A blob is a chunk stored encrypted and compressed in the repository. +*Blob*: A blob is a chunk stored compressed and encrypted as an individual file in the repository. *Snapshot*: A snapshot stands for the state of a collection of apps that have been backed up at some point in time. The state here means the app data as delivered by the system (may be incomplete or absent) and the apps themselves as device specific APK files as installed at time of backup. +It is compressed and encrypted as an individual file in the repository. *Storage ID*: A storage ID is the SHA-256 hash of the content stored in the repository. -This ID is required in order to load the file from the repository. +This ID is required in order to load the file from the repository, +because it is represented in the stored file name. -## Repository Layout +## Repository -The name of all files in the repository starts with the lower case hexadecimal representation -of the storage ID, which is the SHA-256 hash of the file's contents. -This allows for easy verification of files for accidental modifications, -like disk read errors, by simply running the program `sha256sum` on the file -and comparing its output to the file name. +All data is stored in a repository. +Repositories consist of several directories and files to store blobs and snapshots. -All files in a repository are only written once and never modified afterwards. +All files in a repository are encrypted, only written once and never modified afterwards. -```console -.SeedVaultAndroidBackup -└── f35860ee961789fb5f92f467455acf165120a319e9dc27044282982111546f26 - ├── 00 - │ └── 001b527ebb5eb57f4934bafeb998cb08595ed7ced603d9d25bd3c50b338f939d - ├── 01 - │ └── 01e61554a023c9c1053e026c8a70498fb4732c3ecaaad1bd44003185b493529b - ├── ... - ├── fe - │ └── fe94fd20382e76d0215a743f3f27879d9555947250504de5b0d45321f1f66c7a - ├── ff - │ └── ff2e9f9c75d211602c5a68f0471aa549e2499a3fa9496255b26678f4aad75a98 - ├── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec.snapshot - ├── 3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce.snapshot -``` +### Repository Context -Historically, all data that Seedvault saves to backup storage +Historically, all data that Seedvault saves to external storage is below a `.SeedVaultAndroidBackup` directory. -It can contain one or more repositories as users may use the same backup storage +It can contain one or more repositories as users may use the same storage location for several devices or user accounts. As having to choose and remember a specific folder is considered bad UX for the regular Android user, Seedvault creates a repository for the user. +### Repository ID + The folder name is the ID of the repository. -It is the result of applying HMAC-SHA256 with the "app backup repoId key" +It is the result of applying HMAC-SHA256 with the string "app backup repoId key" (see [cryptography](#cryptography)) to the [`ANDROID_ID`](https://developer.android.com/reference/android/provider/Settings.Secure#ANDROID_ID) which is a 64-bit number (expressed as a hexadecimal string) provided by the operating system. @@ -81,6 +68,35 @@ Hence, a restore may run on a second device while the first device is doing a ba Note: A repository used for file backup is stored in a folder of the format `[ANDROID_ID].sv` and thus completely separate. +### Repository Layout + +The name of all files in the repository starts with the lower case hexadecimal representation +of the storage ID, which is the SHA-256 hash of the file's contents. +This allows for easy verification of files for accidental modifications, +like disk read errors, by simply running the program `sha256sum` on the file +and comparing its output to the file name. + +Blobs are stored in a directory named after the first two characters of their name. +Snapshots are stored in the repository root and have a `.snapshot` extension. + +Example of a repository with two snapshots and several blobs: + +```console +.SeedVaultAndroidBackup +└── f35860ee961789fb5f92f467455acf165120a319e9dc27044282982111546f26 + ├── 00 + │ └── 001b527ebb5eb57f4934bafeb998cb08595ed7ced603d9d25bd3c50b338f939d + ├── 01 + │ └── 01e61554a023c9c1053e026c8a70498fb4732c3ecaaad1bd44003185b493529b + ├── ... + ├── fe + │ └── fe94fd20382e76d0215a743f3f27879d9555947250504de5b0d45321f1f66c7a + ├── ff + │ └── ff2e9f9c75d211602c5a68f0471aa549e2499a3fa9496255b26678f4aad75a98 + ├── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec.snapshot + ├── 3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce.snapshot +``` + ## Data Format All files stored in the repository start with a version byte @@ -89,8 +105,9 @@ followed by an encrypted and authenticated payload (see also [Cryptography](#cry The version (currently `0x02`) is used to be able to modify aspects of the design in the future and to provide backwards compatibility. -Blobs include the raw bytes of the compressed chunks -and snapshots their compressed protobuf encoding. +Blob payloads include the raw bytes of the compressed chunks +and snapshot payloads their compressed protobuf encoding. +Compression is using the [zstd](http://www.zstd.net/) algorithm in its default configuration. ```console ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ @@ -101,7 +118,12 @@ and snapshots their compressed protobuf encoding. ## Snapshots -Encoded [in protobuf format](../app/src/main/proto/snapshot.proto). +Snapshots include information about the state of a collection of apps +that have been backed up at some point in time. + +It is encoded [in protobuf format](../app/src/main/proto/snapshot.proto), compressed with zstd +and encrypted. + Example printed as JSON: ```json @@ -185,8 +207,7 @@ to get information about all blobs that should be available in the repository. We may additionally retrieve a list of all blobs directly from the repository to ensure they are actually (still) present. -Snapshot files start with the SHA-256 hash of the content of the file -and use a `.snapshot` extension. +Snapshot file names start with the SHA-256 hash of their content and use a `.snapshot` extension. ## Cryptography @@ -198,12 +219,12 @@ This section is based on and thus very similar to encryption of Seedvault already uses [BIP39](https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki) to give users a mnemonic recovery code and for generating deterministic keys. The derived key has 512 bits -and Seedvault used to use the first 256 bits as an AES key to encrypt app data. +and Seedvault previously used the first 256 bits as an AES key to encrypt app data. Unfortunately, this key's usage is limited by Android's keystore to encryption and decryption. -Therefore, the second 256 bits are be imported into Android's keystore for use with `HMAC-SHA256`, -so that this key can act as a main key we can deterministically derive additional keys from +Therefore, the second 256 bits get imported into Android's keystore for use with `HMAC-SHA256`, +so that this key can act as a main key that we can deterministically derive additional keys from by using HKDF ([RFC5869](https://tools.ietf.org/html/rfc5869)). -These second 256 bits must not be used for any other purpose in the future. +These second 256 bits *must not* be used for any other purpose in the future. We use them for a main key to avoid users having to handle yet another secret. For deriving keys, we are only using the HKDF's second 'expand' step, @@ -229,15 +250,15 @@ The derived seed key (512 bit size) gets split into two parts: ### Stream Encryption -When a stream is written to backup storage, +When a stream is written to the repository, it starts with a header consisting of a single byte indicating the backup format version (currently `0x02`) followed by the encrypted payload. -All data written to backup storage will be encrypted with a fresh key +All data written to the repository will be encrypted with a fresh key to prevent issues with nonce/IV re-use of a single key. We derive a stream key from the main key -by using HKDF's expand step with the UTF-8 byte representation of "app backup stream key" +by using HKDF's expand step with the UTF-8 byte representation of the string "app backup stream key" as info input. This stream key is then used to derive a new key for each stream. @@ -250,13 +271,14 @@ This follows the OAE2 definition as proposed in the paper "Online Authenticated-Encryption and its Nonce-Reuse Misuse-Resistance" ([PDF](https://eprint.iacr.org/2015/189.pdf)). -It adds its own 40 byte header consisting of header length (1 byte), salt and nonce prefix. +It adds its own 40 byte header consisting of header length (1 byte), salt (32 bytes) +and nonce prefix. Then it adds one or more segments, each up to 1 MB in size. All segments are encrypted with a fresh key that is derived by using HKDF -on our stream key with another internal random salt (32 bytes) and associated data as info +on our stream key, the salt and associated data as info ([documentation](https://github.com/google/tink/blob/v1.5.0/docs/WIRE-FORMAT.md#streaming-encryption)). -All types of files written to backup storage have the following format: +All types of files written to the repository have the following format: ```console ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ @@ -267,11 +289,11 @@ All types of files written to backup storage have the following format: ┗━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ``` -When writing app data or apps to backup storage, +When writing app data or apps to the repository, the authenticated associated data (AAD) contains the backup version as the only byte (to prevent downgrade attacks) as it would otherwise not be authenticated. Other data is not included as renaming and swapping out files is made impossible -by their file name being the content hash. +by their file name starting with the content hash (which must be checked when reading). ## Content-defined chunking @@ -314,8 +336,6 @@ TODO: Is it safe to compute the gear table like this? We do derive a dedicated k * chunks already in the repository are not uploaded again, only their hash recorded * new chunks get compressed, encrypted and hashed to determine their storage ID, then uploaded * remember ordered list of chunk IDs for the app (and its APKs) -* add newly packed chunks to local index -* upload consolidated index, remove old index * add all apps, their chunk IDs (and related metadata) to new snapshot and upload that * at the end prune old snapshots based on retention rules @@ -413,7 +433,10 @@ that result in the following differences: The following individuals have reviewed this document and provided helpful feedback. +* Aayush Gupta +* Michael Rogers * Thomas Waldmann +* Tommy Webb * Alexander Weiss As they have reviewed different parts and different versions at different times,