Improve parts of app backup docs after feedback

seedvault-app · Aug 19, 2024 · 8114025 · 8114025
1 parent 68cf15a
commit 8114025
Showing 1 changed file with 107 additions and 61 deletions.
diff --git a/doc/README.md b/doc/README.md
@@ -20,52 +20,39 @@ in a structured form.
 
 *Chunk*: Larger files are cut into re-usable chunks that are the unit of de-duplication.
 
-*Blob*: A blob is a chunk stored encrypted and compressed in the repository.
+*Blob*: A blob is a chunk stored compressed and encrypted as an individual file in the repository.
 
 *Snapshot*: A snapshot stands for the state of a collection of apps
 that have been backed up at some point in time.
 The state here means the app data as delivered by the system (may be incomplete or absent)
 and the apps themselves as device specific APK files as installed at time of backup.
+It is compressed and encrypted as an individual file in the repository.
 
 *Storage ID*: A storage ID is the SHA-256 hash of the content stored in the repository.
-This ID is required in order to load the file from the repository.
+This ID is required in order to load the file from the repository,
+because it is represented in the stored file name.
 
-## Repository Layout
+## Repository
 
-The name of all files in the repository starts with the lower case hexadecimal representation
-of the storage ID, which is the SHA-256 hash of the file's contents.
-This allows for easy verification of files for accidental modifications,
-like disk read errors, by simply running the program `sha256sum` on the file
-and comparing its output to the file name.
+All data is stored in a repository.
+Repositories consist of several directories and files to store blobs and snapshots.
 
-All files in a repository are only written once and never modified afterwards.
+All files in a repository are encrypted, only written once and never modified afterwards.
 
-```console
-.SeedVaultAndroidBackup
-└── f35860ee961789fb5f92f467455acf165120a319e9dc27044282982111546f26
-    ├── 00
-    │   └── 001b527ebb5eb57f4934bafeb998cb08595ed7ced603d9d25bd3c50b338f939d
-    ├── 01
-    │   └── 01e61554a023c9c1053e026c8a70498fb4732c3ecaaad1bd44003185b493529b
-    ├── ...
-    ├── fe
-    │   └── fe94fd20382e76d0215a743f3f27879d9555947250504de5b0d45321f1f66c7a
-    ├── ff
-    │   └── ff2e9f9c75d211602c5a68f0471aa549e2499a3fa9496255b26678f4aad75a98
-    ├── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec.snapshot
-    ├── 3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce.snapshot
-```
+### Repository Context
 
-Historically, all data that Seedvault saves to backup storage
+Historically, all data that Seedvault saves to external storage
 is below a `.SeedVaultAndroidBackup` directory.
-It can contain one or more repositories as users may use the same backup storage
+It can contain one or more repositories as users may use the same storage location
 for several devices or user accounts.
 As having to choose and remember a specific folder is considered bad UX
 for the regular Android user,
 Seedvault creates a repository for the user.
 
+### Repository ID
+
 The folder name is the ID of the repository.
-It is the result of applying HMAC-SHA256 with the "app backup repoId key"
+It is the result of applying HMAC-SHA256 with the string "app backup repoId key"
 (see [cryptography](#cryptography)) to the
 [`ANDROID_ID`](https://developer.android.com/reference/android/provider/Settings.Secure#ANDROID_ID)
 which is a 64-bit number (expressed as a hexadecimal string) provided by the operating system.
@@ -81,6 +68,35 @@ Hence, a restore may run on a second device while the first device is doing a ba
 Note: A repository used for file backup is stored in a folder of the format `[ANDROID_ID].sv`
 and thus completely separate.
 
+### Repository Layout
+
+The name of all files in the repository starts with the lower case hexadecimal representation
+of the storage ID, which is the SHA-256 hash of the file's contents.
+This allows for easy verification of files for accidental modifications,
+like disk read errors, by simply running the program `sha256sum` on the file
+and comparing its output to the file name.
+
+Blobs are stored in a directory named after the first two characters of their name.
+Snapshots are stored in the repository root and have a `.snapshot` extension.
+
+Example of a repository with two snapshots and several blobs:
+
+```console
+.SeedVaultAndroidBackup
+└── f35860ee961789fb5f92f467455acf165120a319e9dc27044282982111546f26
+    ├── 00
+    │   └── 001b527ebb5eb57f4934bafeb998cb08595ed7ced603d9d25bd3c50b338f939d
+    ├── 01
+    │   └── 01e61554a023c9c1053e026c8a70498fb4732c3ecaaad1bd44003185b493529b
+    ├── ...
+    ├── fe
+    │   └── fe94fd20382e76d0215a743f3f27879d9555947250504de5b0d45321f1f66c7a
+    ├── ff
+    │   └── ff2e9f9c75d211602c5a68f0471aa549e2499a3fa9496255b26678f4aad75a98
+    ├── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec.snapshot
+    ├── 3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce.snapshot
+```
+
 ## Data Format
 
 All files stored in the repository start with a version byte
@@ -89,8 +105,9 @@ followed by an encrypted and authenticated payload (see also [Cryptography](#cry
 The version (currently `0x02`) is used to be able to modify aspects of the design in the future
 and to provide backwards compatibility.
 
-Blobs include the raw bytes of the compressed chunks
-and snapshots their compressed protobuf encoding.
+Blob payloads include the raw bytes of the compressed chunks
+and snapshot payloads their compressed protobuf encoding.
+Compression is using the [zstd](http://www.zstd.net/) algorithm in its default configuration.
 
 ```console
 ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
@@ -99,9 +116,17 @@ and snapshots their compressed protobuf encoding.
 ┗━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━┛
 ```
 
+The structure of the encrypted tink payload is explored further
+in [Stream Encryption](#stream-encryption).
+
 ## Snapshots
 
-Encoded [in protobuf format](../app/src/main/proto/snapshot.proto).
+Snapshots include information about the state of a collection of apps
+that have been backed up at some point in time.
+
+It is encoded [in protobuf format](../app/src/main/proto/snapshot.proto), compressed with zstd 
+and encrypted.
+
 Example printed as JSON:
 
 ```json
@@ -176,17 +201,20 @@ Example printed as JSON:
 }
 ```
 
-The `chunkIds` and `iconChunkIds` fields contain a list with plain text SHA-256 hashes
+The `chunkIds` and `iconChunkIds` fields contain an ordered list with plain text SHA-256 hashes
 which can be found in the main `blobs` dictionary.
 This contains a mapping from plain text SHA-256 hashes to storage IDs and size information.
+The decrypted and uncompressed chunks concatenated in order result in the original plaintext data.
+The `iconChunkIds` field assemble to a ZIP file where each entry is a
+WebP encoded image, one icon for each app in the backup.
+The entry name is the package name of the app.
 
 At the beginning of most operations, we download all available snapshots
 to get information about all blobs that should be available in the repository.
 We may additionally retrieve a list of all blobs directly from the repository
 to ensure they are actually (still) present.
 
-Snapshot files start with the SHA-256 hash of the content of the file
-and use a `.snapshot` extension.
+Snapshot file names start with the SHA-256 hash of their content and use a `.snapshot` extension.
 
 ## Cryptography
 
@@ -198,12 +226,12 @@ This section is based on and thus very similar to encryption of
 Seedvault already uses [BIP39](https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki)
 to give users a mnemonic recovery code and for generating deterministic keys.
 The derived key has 512 bits
-and Seedvault used to use the first 256 bits as an AES key to encrypt app data.
+and Seedvault previously used the first 256 bits as an AES key to encrypt app data.
 Unfortunately, this key's usage is limited by Android's keystore to encryption and decryption.
-Therefore, the second 256 bits are be imported into Android's keystore for use with `HMAC-SHA256`,
-so that this key can act as a main key we can deterministically derive additional keys from
+Therefore, the second 256 bits get imported into Android's keystore for use with `HMAC-SHA256`,
+so that this key can act as a main key that we can deterministically derive additional keys from
 by using HKDF ([RFC5869](https://tools.ietf.org/html/rfc5869)).
-These second 256 bits must not be used for any other purpose in the future.
+These second 256 bits *must not* be used for any other purpose in the future.
 We use them for a main key to avoid users having to handle yet another secret.
 
 For deriving keys, we are only using the HKDF's second 'expand' step,
@@ -216,7 +244,8 @@ This should be fine as the input key material is already a cryptographically str
 
 The original entropy comes from a BIP39 seed (12 words = 128 bit size)
 obtained from Java's `SecureRandom`.
-A PBKDF SHA512 based derivation defined in BIP39 turns this into a 512 bit seed key.
+A PBKDF SHA512 based derivation defined in BIP39 turns this into a 512 bit seed key
+as described above.
 
 The derived seed key (512 bit size) gets split into two parts:
 1. legacy app data encryption key (unused) - 256 bit - first half of seed key
@@ -229,34 +258,43 @@ The derived seed key (512 bit size) gets split into two parts:
 
 ### Stream Encryption
 
-When a stream is written to backup storage,
+When a stream is written to the repository,
 it starts with a header consisting of a single byte indicating the backup format version
 (currently `0x02`) followed by the encrypted payload.
 
-All data written to backup storage will be encrypted with a fresh key
+All data written to the repository will be encrypted with a fresh key
 to prevent issues with nonce/IV re-use of a single key.
 
 We derive a stream key from the main key
-by using HKDF's expand step with the UTF-8 byte representation of "app backup stream key"
+by using HKDF's expand step with the UTF-8 byte representation of the string "app backup stream key"
 as info input.
 This stream key is then used to derive a new key for each stream.
 
 Instead of encrypting, authenticating and segmenting a cleartext stream ourselves,
 we have chosen to employ the [tink library](https://github.com/tink-crypto/tink-java) for that task.
-Since it does not allow us to work with imported or derived keys,
-we are only using its [AesGcmHkdfStreaming](https://developers.google.com/tink/streaming-aead/aes_gcm_hkdf_streaming)
-to delegate encryption and decryption of byte streams.
+Since it does not allow us to work with imported or derived keys
+and its recommended
+[high-level API](https://developers.google.com/tink/encrypt-large-files-or-data-streams)
+requires this,
+we are directly using its
+[AesGcmHkdfStreaming](https://developers.google.com/tink/streaming-aead/aes_gcm_hkdf_streaming)
+primitive to delegate encryption and decryption of byte streams.
 This follows the OAE2 definition as proposed in the paper
 "Online Authenticated-Encryption and its Nonce-Reuse Misuse-Resistance"
 ([PDF](https://eprint.iacr.org/2015/189.pdf)).
 
-It adds its own 40 byte header consisting of header length (1 byte), salt and nonce prefix.
+It adds its own 40 byte header consisting of header length (1 byte), salt (32 bytes)
+and nonce prefix.
 Then it adds one or more segments, each up to 1 MB in size.
 All segments are encrypted with a fresh key that is derived by using HKDF
-on our stream key with another internal random salt (32 bytes) and associated data as info
+on our stream key, the salt and associated data as info
 ([documentation](https://github.com/google/tink/blob/v1.5.0/docs/WIRE-FORMAT.md#streaming-encryption)).
 
-All types of files written to backup storage have the following format:
+Note that the tink documentation (currently) recommends 128 bit keys,
+while we use 256 bit keys.
+Otherwise, we stick to the recommended defaults.
+
+All types of files written to the repository have the following format:
 
 ```console
     ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
@@ -267,11 +305,11 @@ All types of files written to backup storage have the following format:
     ┗━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
 ```
 
-When writing app data or apps to backup storage,
-the authenticated associated data (AAD) contains the backup version as the only byte
-(to prevent downgrade attacks) as it would otherwise not be authenticated.
+When writing to the repository,
+the authenticated associated data (AAD) of each file contains the backup version as the only byte
+(to prevent downgrade attacks) to ensure it is also authenticated.
 Other data is not included as renaming and swapping out files is made impossible
-by their file name being the content hash.
+by their file name starting with the content hash (which must be checked when reading).
 
 ## Content-defined chunking
 
@@ -282,20 +320,27 @@ because in our tests it presented a good trade-off
 between deduplication ratio and number of small chunks.
 Data smaller than 1.5 MiB will not be chunked further and be left as a single chunk.
 
-TODO: Can we do something to prevent watermarking attacks against single chunk files,
-      like padding somewhat them after compression, but before encryption?
-
 The FastCDC algorithm uses a gear table containing 256 random integers with 31 bits.
 When this table changes, the resulting chunks will be different.
 Hence, every repository always uses the same gear table.
 However, to make watermarking attacks harder,
 we use the "app backup gear table key" that gets derived from our main key
 to deterministically compute a gear table using AES CTR to cipher 32 null bytes with a null IV.
 
-TODO: Is it safe to compute the gear table like this? We do derive a dedicated key for this
-      that we don't use for anything else.
-      Still, the IV as well as the encrypted bytes are known to the attacker.
-      Also, it may be be possible to reverse the gear table with chosen plaintext attacks.
+**TODO**: Is it safe to compute the gear table like this? We do derive a dedicated key for this
+          that we don't use for anything else.
+          Still, the IV as well as the encrypted bytes are known to the attacker.
+          Also, it may be be possible to reverse the gear table with chosen plaintext attacks.
+
+Since a random gear table computed like this may not be sufficient for attackers
+able to control plaintext, e.g. sending a file in a messaging app,
+and due to the presence of lots of data consisting of only a single chunk,
+we apply additional random padding to all chunks.
+The plaintext gets padded with `0` to `1024` null bytes.
+This then gets compressed and encrypted.
+The appended null bytes should compress well.
+When decrypting, we use the `uncompressedLength` field in the blobs map from the snapshots
+to discard the padded bytes.
 
 ## Operations
 
@@ -314,10 +359,8 @@ TODO: Is it safe to compute the gear table like this? We do derive a dedicated k
     * chunks already in the repository are not uploaded again, only their hash recorded
     * new chunks get compressed, encrypted and hashed to determine their storage ID, then uploaded
   * remember ordered list of chunk IDs for the app (and its APKs)
-* add newly packed chunks to local index
-* upload consolidated index, remove old index
 * add all apps, their chunk IDs (and related metadata) to new snapshot and upload that
-* at the end prune old snapshots based on retention rules
+* at the end, delete old snapshots based on retention rules, then do [pruning](#pruning).
 
 ### Resume interrupted backup
 
@@ -413,7 +456,10 @@ that result in the following differences:
 
 The following individuals have reviewed this document and provided helpful feedback.
 
+* Aayush Gupta
+* Michael Rogers
 * Thomas Waldmann
+* Tommy Webb
 * Alexander Weiss
 
 As they have reviewed different parts and different versions at different times,