From 87489ab7ddc8b75f890cd0441cc96433d4ed6295 Mon Sep 17 00:00:00 2001 From: Torsten Grote Date: Thu, 27 Jun 2024 16:33:22 -0300 Subject: [PATCH 1/2] add readme for app dedup research --- doc/README.md | 421 ++++++++++++++++++++++++++++++++++++++++++ storage/doc/design.md | 2 +- 2 files changed, 422 insertions(+), 1 deletion(-) create mode 100644 doc/README.md diff --git a/doc/README.md b/doc/README.md new file mode 100644 index 000000000..0a3c07aba --- /dev/null +++ b/doc/README.md @@ -0,0 +1,421 @@ +# Seedvault App Backup Repository Documentation + +This document is a technical description of how Seedvault backs up apps +installed on an Android device. + +It is inspired by the design of the [Restic backup program](https://restic.net/). +Portions of this document are blatant copies of +[its documentation](https://restic.readthedocs.io/en/latest/100_references.html). +However, due to the different nature of Android backups, +the design [was heavily simplified](#differences-to-restic). + +Note that the backup of files is described [in a different document](../storage/doc/design.md). + +## Terminology + +This section introduces terminology used in this document. + +*Repository*: All data produced during a backup is sent to and stored in a repository +in a structured form. + +*Chunk*: Larger files are cut into re-usable chunks that are the unit of de-duplication. + +*Blob*: A blob is a chunk stored encrypted and compressed in the repository. + +*Snapshot*: A snapshot stands for the state of a collection of apps +that have been backed up at some point in time. +The state here means the app data as delivered by the system (may be incomplete or absent) +and the apps themselves as device specific APK files as installed at time of backup. + +*Storage ID*: A storage ID is the SHA-256 hash of the content stored in the repository. +This ID is required in order to load the file from the repository. + +## Repository Layout + +The name of all files in the repository starts with the lower case hexadecimal representation +of the storage ID, which is the SHA-256 hash of the file's contents. +This allows for easy verification of files for accidental modifications, +like disk read errors, by simply running the program `sha256sum` on the file +and comparing its output to the file name. + +All files in a repository are only written once and never modified afterwards. + +```console +.SeedVaultAndroidBackup +└── f35860ee961789fb5f92f467455acf165120a319e9dc27044282982111546f26 + ├── 00 + │ └── 001b527ebb5eb57f4934bafeb998cb08595ed7ced603d9d25bd3c50b338f939d + ├── 01 + │ └── 01e61554a023c9c1053e026c8a70498fb4732c3ecaaad1bd44003185b493529b + ├── ... + ├── fe + │ └── fe94fd20382e76d0215a743f3f27879d9555947250504de5b0d45321f1f66c7a + ├── ff + │ └── ff2e9f9c75d211602c5a68f0471aa549e2499a3fa9496255b26678f4aad75a98 + ├── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec.snapshot + ├── 3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce.snapshot +``` + +Historically, all data that Seedvault saves to backup storage +is below a `.SeedVaultAndroidBackup` directory. +It can contain one or more repositories as users may use the same backup storage +for several devices or user accounts. +As having to choose and remember a specific folder is considered bad UX +for the regular Android user, +Seedvault creates a repository for the user. + +The folder name is the ID of the repository. +It is the result of applying HMAC-SHA256 with the "app backup repoId key" +(see [cryptography](#cryptography)) to the +[`ANDROID_ID`](https://developer.android.com/reference/android/provider/Settings.Secure#ANDROID_ID) +which is a 64-bit number (expressed as a hexadecimal string) provided by the operating system. +It is unique to each combination of app-signing key, user, and device. +This design should guarantee that only one single app ever backs up to or prunes the repository. +Also, if the user generates a new recovery key and thus all keys change, +Seedvault will not attempt to back up into the repository tied to the old key. + +For restore, we iterate through all repositories and only offer snapshots for restore +that can be decrypted with the key provided by the user. +Hence, a restore may run on a second device while the first device is doing a backup. + +Note: A repository used for file backup is stored in a folder of the format `[ANDROID_ID].sv` +and thus completely separate. + +## Data Format + +All files stored in the repository start with a version byte +followed by an encrypted and authenticated payload (see also [Cryptography](#cryptography)). + +The version (currently `0x02`) is used to be able to modify aspects of the design in the future +and to provide backwards compatibility. + +Blobs include the raw bytes of the compressed chunks +and snapshots their compressed protobuf encoding. + +```console +┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ +┃ Version ┃ encrypted tink payload ┃ +┃ 0x02 ┃ (with 40 bytes header) ┃ +┗━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━┛ +``` + +## Snapshots + +Encoded [in protobuf format](../app/src/main/proto/snapshot.proto). +Example printed as JSON: + +```json +{ + "token": 171993168400, + "name": "Google Pixel 7a", + "androidId": "bbcc5909347d0b83", + "sdkInt": 34, + "androidIncremental": "eng.user.20240618.130805", + "d2d": true, + "apps": { + "org.example.app": { + "time": 171993268400, + "state": "apkAndData", + "type": "full", + "name": "Example app", + "system": false, + "launchableSystemApp": false, + "chunkIds": [ + "001b527ebb5eb57f4934bafeb998cb08595ed7ced603d9d25bd3c50b338f939d", + "ff2e9f9c75d211602c5a68f0471aa549e2499a3fa9496255b26678f4aad75a98" + ], + "apk": { + "versionCode": 2342, + "installer": "org.fdroid.basic", + "signatures": [ + "abd81091c552574506bbb143e3bd856504382666dd495811a539188957522ab2" + ], + "splits": [ + { + "name": "base", + "size": 1337, + "chunkIds": [ + "4f34352be7c1f4d1597d27f5824b42bad2222cf029cc888e9d7318d5e8bc2ad5", + "16759b52e8ef8c3b080f0a8cdb8ac21474cb41070f272a02e28fd1163762336b" + ] + } + ] + } + } + }, + "iconChunkIds": [ + "c11c3cf1c326375b177e768de908e6e423c3d8866715c8ec7159156b6ea82497" + ], + "blobs": { + "001b527ebb5eb57f4934bafeb998cb08595ed7ced603d9d25bd3c50b338f939d": { + "id": "259bb78b72b7b7f7ac1a9613c3c4936b06342471b39b1c41a328acaa2d3ccca5", + "length": 645312, + "uncompressedLength": 695312 + }, + "ff2e9f9c75d211602c5a68f0471aa549e2499a3fa9496255b26678f4aad75a98": { + "id": "5e6a8789571183a97e84f74fe13c4e95b6c5645e2d53fe5cc7aa122897c246b4", + "length": 9455314, + "uncompressedLength": 9855314 + }, + "4f34352be7c1f4d1597d27f5824b42bad2222cf029cc888e9d7318d5e8bc2ad5": { + "id": "a79d2b9e7afa051782cff6424ede1337c8ce05c34d9c2e565e2ed47e7ae80ea4", + "length": 8430346, + "uncompressedLength": 9430346 + }, + "16759b52e8ef8c3b080f0a8cdb8ac21474cb41070f272a02e28fd1163762336b": { + "id": "31aa8e88c8b73df565fd8bbcadfe7e2e79303eb5c0223253353d3003273e3140", + "length": 26549052, + "uncompressedLength": 28549052 + }, + "c11c3cf1c326375b177e768de908e6e423c3d8866715c8ec7159156b6ea82497": { + "id": "5421895dbead0ba83f0993f17b5f72f6c6618c0e9ea849c6c53ac240d1802d0b", + "length": 523496, + "uncompressedLength": 645314 + } + } +} +``` + +The `chunkIds` and `iconChunkIds` fields contain a list with plain text SHA-256 hashes +which can be found in the main `blobs` dictionary. +This contains a mapping from plain text SHA-256 hashes to storage IDs and size information. + +At the beginning of most operations, we download all available snapshots +to get information about all blobs that should be available in the repository. +We may additionally retrieve a list of all blobs directly from the repository +to ensure they are actually (still) present. + +Snapshot files start with the SHA-256 hash of the content of the file +and use a `.snapshot` extension. + +## Cryptography + +This section is based on and thus very similar to encryption of +[files backup](../storage/doc/design.md). + +### Main Key + +Seedvault already uses [BIP39](https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki) +to give users a mnemonic recovery code and for generating deterministic keys. +The derived key has 512 bits +and Seedvault used to use the first 256 bits as an AES key to encrypt app data. +Unfortunately, this key's usage is limited by Android's keystore to encryption and decryption. +Therefore, the second 256 bits are be imported into Android's keystore for use with `HMAC-SHA256`, +so that this key can act as a main key we can deterministically derive additional keys from +by using HKDF ([RFC5869](https://tools.ietf.org/html/rfc5869)). +These second 256 bits must not be used for any other purpose in the future. +We use them for a main key to avoid users having to handle yet another secret. + +For deriving keys, we are only using the HKDF's second 'expand' step, +because the Android Keystore does not give us access +to the key's byte representation (required for first 'extract' step) after importing it. +This should be fine as the input key material is already a cryptographically strong key +(see section 3.3 of RFC 5869 above). + +### Key derivation overview + +The original entropy comes from a BIP39 seed (12 words = 128 bit size) +obtained from Java's `SecureRandom`. +A PBKDF SHA512 based derivation defined in BIP39 turns this into a 512 bit seed key. + +The derived seed key (512 bit size) gets split into two parts: +1. legacy app data encryption key (unused) - 256 bit - first half of seed key +2. main key - 256 bit - second half of seed key used to derive application specific keys: + 1. App Backup: HKDF with info "app backup repoId key" + 2. App Backup: HKDF with info "app backup gear table key" + 3. App Backup: HKDF with info "app backup stream key" + 4. File Backup: HKDF with info "stream key" + 5. File Backup: HKDF with info "Chunk ID calculation" + +### Stream Encryption + +When a stream is written to backup storage, +it starts with a header consisting of a single byte indicating the backup format version +(currently `0x02`) followed by the encrypted payload. + +All data written to backup storage will be encrypted with a fresh key +to prevent issues with nonce/IV re-use of a single key. + +We derive a stream key from the main key +by using HKDF's expand step with the UTF-8 byte representation of "app backup stream key" +as info input. +This stream key is then used to derive a new key for each stream. + +Instead of encrypting, authenticating and segmenting a cleartext stream ourselves, +we have chosen to employ the [tink library](https://github.com/tink-crypto/tink-java) for that task. +Since it does not allow us to work with imported or derived keys, +we are only using its [AesGcmHkdfStreaming](https://developers.google.com/tink/streaming-aead/aes_gcm_hkdf_streaming) +to delegate encryption and decryption of byte streams. +This follows the OAE2 definition as proposed in the paper +"Online Authenticated-Encryption and its Nonce-Reuse Misuse-Resistance" +([PDF](https://eprint.iacr.org/2015/189.pdf)). + +It adds its own 40 byte header consisting of header length (1 byte), salt and nonce prefix. +Then it adds one or more segments, each up to 1 MB in size. +All segments are encrypted with a fresh key that is derived by using HKDF +on our stream key with another internal random salt (32 bytes) and associated data as info +([documentation](https://github.com/google/tink/blob/v1.5.0/docs/WIRE-FORMAT.md#streaming-encryption)). + +All types of files written to backup storage have the following format: + +```console + ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ + ┃ version ┃ tink payload (with 40 bytes header) ┃ + ┃ byte ┃ ┏━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ ┃ + ┃ ┃ ┃ header length ┃ salt ┃ nonce prefix ┃ encrypted segments ┃ ┃ + ┃ (0x02) ┃ ┗━━━━━━━━━━━━━━━┻━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━┛ ┃ + ┗━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ +``` + +When writing app data or apps to backup storage, +the authenticated associated data (AAD) contains the backup version as the only byte +(to prevent downgrade attacks) as it would otherwise not be authenticated. +Other data is not included as renaming and swapping out files is made impossible +by their file name being the content hash. + +## Content-defined chunking + +We use FastCDC ([PDF](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf)) +to split ZIP files and TAR streams we receive from the system into re-usable chunks. +We aim for an average chunk size of 3 MiB, +because in our tests it presented a good trade-off +between deduplication ratio and number of small chunks. +Data smaller than 1.5 MiB will not be chunked further and be left as a single chunk. + +TODO: Can we do something to prevent watermarking attacks against single chunk files, + like padding somewhat them after compression, but before encryption? + +The FastCDC algorithm uses a gear table containing 256 random integers with 31 bits. +When this table changes, the resulting chunks will be different. +Hence, every repository always uses the same gear table. +However, to make watermarking attacks harder, +we use the "app backup gear table key" that gets derived from our main key +to deterministically compute a gear table using AES CTR to cipher 32 null bytes with a null IV. + +TODO: Is it safe to compute the gear table like this? We do derive a dedicated key for this + that we don't use for anything else. + Still, the IV as well as the encrypted bytes are known to the attacker. + Also, it may be be possible to reverse the gear table with chosen plaintext attacks. + +## Operations + +### Backup + +* download all snapshots to get a mapping of all plaintext hashes (chunk IDs) to storage IDs +* optionally: get a list of all blobs and their storage IDs to confirm they (still) exist +* if APK backup wasn't disabled, for each app: + * get version code and check if we already have an APK with that version code in a snapshot + * if APK was backed up already, re-use list of plaintext hashes for current snapshot + * if APK was not yet backed up, put it (and its splits) through the chunker + * chunks already in the repository are not uploaded again, only their hash recorded + * new chunks get compressed, encrypted and hashed to determine their storage ID, then uploaded +* for each app (the system allows us to back up): + * app data we receive as a tar stream from the system gets chunked + * chunks already in the repository are not uploaded again, only their hash recorded + * new chunks get compressed, encrypted and hashed to determine their storage ID, then uploaded + * remember ordered list of chunk IDs for the app (and its APKs) +* add newly packed chunks to local index +* upload consolidated index, remove old index +* add all apps, their chunk IDs (and related metadata) to new snapshot and upload that +* at the end prune old snapshots based on retention rules + +### Resume interrupted backup + +* the mapping from chunk ID to storage ID is only stored in snapshots +* if backup gets interrupted (e.g. power or network outage), no snapshot gets written +* due to encryption, storage IDs are not deterministic, + i.e. the same chunk stored two times will have two different storage IDs +* therefore, we keep a local cache for the mapping of chunk ID to storage ID +* before starting a backup, we retrieve all uploaded storage IDs + and remove all IDs from our local cache that don't exist +* when processing new chunks, we check if the chunk ID exists in our local cache + and re-use the existing blob from the mapped storage ID +* all mappings will be added to snapshot that gets uploaded at the end of backup + +### Restore + +* download all snapshots and let the user choose one for restore +* for all apps that were selected for restore: + * reinstall app from APKs assembled from fetched chunks (see below for details) + * look up chunk IDs for app from snapshot + * download blobs for the chunk IDs as specified in snapshot + * give decrypted and decompressed chunk data back to system + +Repository re-use: + +* if restore happened on a new device, offer user to tie repository to new device +* the same main key is being used on new device due to entry of BIP39 recovery code +* the repository ID depends on the main key and the `ANDROID_ID` + which is different on the new device +* to tie repository to new device, the repository folder will be renamed to the repository ID + specific to the new device +* this will prevent the old device from doing further writes into the repository + and ensure the device will be exclusively writing to the repository + +### Pruning + +* download all snapshots +* assemble list of blobs still referenced by existing snapshots +* delete all blobs that are not referenced anymore + +### Repository data verification + +There are two types of checks that can be performed: +* Structural consistency and integrity, e.g. snapshots and blobs + * download all snapshots and a list of all blobs and their size + * check that all blobs referenced in snapshots exist on storage and have expected size +* Integrity of the actual data that you backed up + * do the structural checking as above + * offer full checking and random sampling as options as full check will be slow + * also download all/some blobs check their hash, decrypt and check chunk ID + +## Read and Write Ordering + +New snapshots only get added after all blobs they reference have been uploaded. + +Blobs only get deleted after all snapshots that reference them have been deleted first. + +## Differences to restic + +Even though the design was initially inspired by restic, +the specific nature of Android app backups allowed us to make many simplifications +that result in the following differences: + +* no config file needed: + * repository ID is in the folder name + * version is in first byte of all files + * chunker differentiator is based on key +* no keys files + * we re-use the existing BIP39 recovery code whose derived key is stored in the device key store +* blobs are not being combined into pack files, but saved directly + * tests with real-world Android backups have shown 11% of files smaller than 10 KiB + and 39% of files smaller than 500 KiB + * a typical backup has around 500 files, so the number of small files seemed manageable enough + to not justify the increased complexity of pack files and indexes +* no indexes + * since we have no pack files, we don't need an index for blobs in pack files + * a mapping of chunk ID to storage ID of blobs is stored in the snapshot +* no tree blobs + * Android app backups are flat and not in a tree structure, just a list of apps and a list of APKs + (one APK can have several APK splits) + * metadata needed for each app is directly stored in the snapshot which still has manageable size +* no locks + * we only allow a single app to use a repository by binding it to the app/user/device + * restore while another app is backing up is possible, but this doesn't require locks, + if we follow the proper read/write ordering +* different encryption and MACs + * we were already employing AES GCM via Google's tink library + for legacy app backup and files backup, so we stuck with it +* snapshot metadata is Android specific and snapshots are not stored in a dedicated folder, + but have a `.snapshot` extension. So their name isn't the content hash, but starts with it. + +## Acknowledgements + +The following individuals have reviewed this document and provided helpful feedback. + +* Thomas Waldmann +* Alexander Weiss + +As they have reviewed different parts and different versions at different times, +this acknowledgement should not be mistaken for their endorsement of the current design +nor the final implementation. diff --git a/storage/doc/design.md b/storage/doc/design.md index 88cec2e39..0600df856 100644 --- a/storage/doc/design.md +++ b/storage/doc/design.md @@ -195,7 +195,7 @@ by using HKDF's expand step with the UTF-8 byte representation of "stream key" a This stream key is then used to derive a new key for each stream. Instead of encrypting, authenticating and segmenting a cleartext stream ourselves, -we have chosen to employ the [tink library](https://github.com/google/tink) for that task. +we have chosen to employ the [tink library](https://github.com/tink-crypto/tink-java) for that task. Since it does not allow us to work with imported or derived keys, we are only using its [AesGcmHkdfStreaming](https://google.github.io/tink/javadoc/tink-android/1.5.0/index.html?com/google/crypto/tink/subtle/AesGcmHkdfStreaming.html) to delegate encryption and decryption of byte streams. From 18403bbfe499a3734e4192de323a2e2d0eb179bb Mon Sep 17 00:00:00 2001 From: Torsten Grote Date: Thu, 15 Aug 2024 15:19:35 -0300 Subject: [PATCH 2/2] Improve parts of app backup docs after feedback --- doc/README.md | 235 +++++++++++++++++++++++++++++++++++++------------- 1 file changed, 174 insertions(+), 61 deletions(-) diff --git a/doc/README.md b/doc/README.md index 0a3c07aba..36abec0d6 100644 --- a/doc/README.md +++ b/doc/README.md @@ -20,52 +20,39 @@ in a structured form. *Chunk*: Larger files are cut into re-usable chunks that are the unit of de-duplication. -*Blob*: A blob is a chunk stored encrypted and compressed in the repository. +*Blob*: A blob is a chunk stored compressed and encrypted as an individual file in the repository. *Snapshot*: A snapshot stands for the state of a collection of apps that have been backed up at some point in time. The state here means the app data as delivered by the system (may be incomplete or absent) and the apps themselves as device specific APK files as installed at time of backup. +It is compressed and encrypted as an individual file in the repository. *Storage ID*: A storage ID is the SHA-256 hash of the content stored in the repository. -This ID is required in order to load the file from the repository. +This ID is required in order to load the file from the repository, +because it is represented in the stored file name. -## Repository Layout +## Repository -The name of all files in the repository starts with the lower case hexadecimal representation -of the storage ID, which is the SHA-256 hash of the file's contents. -This allows for easy verification of files for accidental modifications, -like disk read errors, by simply running the program `sha256sum` on the file -and comparing its output to the file name. +All data is stored in a repository. +Repositories consist of several directories and files to store blobs and snapshots. -All files in a repository are only written once and never modified afterwards. +All files in a repository are encrypted, only written once and never modified afterwards. -```console -.SeedVaultAndroidBackup -└── f35860ee961789fb5f92f467455acf165120a319e9dc27044282982111546f26 - ├── 00 - │ └── 001b527ebb5eb57f4934bafeb998cb08595ed7ced603d9d25bd3c50b338f939d - ├── 01 - │ └── 01e61554a023c9c1053e026c8a70498fb4732c3ecaaad1bd44003185b493529b - ├── ... - ├── fe - │ └── fe94fd20382e76d0215a743f3f27879d9555947250504de5b0d45321f1f66c7a - ├── ff - │ └── ff2e9f9c75d211602c5a68f0471aa549e2499a3fa9496255b26678f4aad75a98 - ├── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec.snapshot - ├── 3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce.snapshot -``` +### Repository Context -Historically, all data that Seedvault saves to backup storage +Historically, all data that Seedvault saves to external storage is below a `.SeedVaultAndroidBackup` directory. -It can contain one or more repositories as users may use the same backup storage +It can contain one or more repositories as users may use the same storage location for several devices or user accounts. As having to choose and remember a specific folder is considered bad UX for the regular Android user, Seedvault creates a repository for the user. +### Repository ID + The folder name is the ID of the repository. -It is the result of applying HMAC-SHA256 with the "app backup repoId key" +It is the result of applying HMAC-SHA256 with the string "app backup repoId key" (see [cryptography](#cryptography)) to the [`ANDROID_ID`](https://developer.android.com/reference/android/provider/Settings.Secure#ANDROID_ID) which is a 64-bit number (expressed as a hexadecimal string) provided by the operating system. @@ -81,6 +68,35 @@ Hence, a restore may run on a second device while the first device is doing a ba Note: A repository used for file backup is stored in a folder of the format `[ANDROID_ID].sv` and thus completely separate. +### Repository Layout + +The name of all files in the repository starts with the lower case hexadecimal representation +of the storage ID, which is the SHA-256 hash of the file's contents. +This allows for easy verification of files for accidental modifications, +like disk read errors, by simply running the program `sha256sum` on the file +and comparing its output to the file name. + +Blobs are stored in a directory named after the first two characters of their name. +Snapshots are stored in the repository root and have a `.snapshot` extension. + +Example of a repository with two snapshots and several blobs: + +```console +.SeedVaultAndroidBackup +└── f35860ee961789fb5f92f467455acf165120a319e9dc27044282982111546f26 + ├── 00 + │ └── 001b527ebb5eb57f4934bafeb998cb08595ed7ced603d9d25bd3c50b338f939d + ├── 01 + │ └── 01e61554a023c9c1053e026c8a70498fb4732c3ecaaad1bd44003185b493529b + ├── ... + ├── fe + │ └── fe94fd20382e76d0215a743f3f27879d9555947250504de5b0d45321f1f66c7a + ├── ff + │ └── ff2e9f9c75d211602c5a68f0471aa549e2499a3fa9496255b26678f4aad75a98 + ├── 22a5af1bdc6e616f8a29579458c49627e01b32210d09adb288d1ecda7c5711ec.snapshot + ├── 3ec79977ef0cf5de7b08cd12b874cd0f62bbaf7f07f3497a5b1bbcc8cb39b1ce.snapshot +``` + ## Data Format All files stored in the repository start with a version byte @@ -89,8 +105,9 @@ followed by an encrypted and authenticated payload (see also [Cryptography](#cry The version (currently `0x02`) is used to be able to modify aspects of the design in the future and to provide backwards compatibility. -Blobs include the raw bytes of the compressed chunks -and snapshots their compressed protobuf encoding. +Blob payloads include the raw bytes of the compressed chunks +and snapshot payloads their compressed protobuf encoding. +Compression is using the [zstd](http://www.zstd.net/) algorithm in its default configuration. ```console ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ @@ -99,9 +116,17 @@ and snapshots their compressed protobuf encoding. ┗━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━┛ ``` +The structure of the encrypted tink payload is explored further +in [Stream Encryption](#stream-encryption). + ## Snapshots -Encoded [in protobuf format](../app/src/main/proto/snapshot.proto). +Snapshots include information about the state of a collection of apps +that have been backed up at some point in time. + +It is encoded [in protobuf format](../app/src/main/proto/snapshot.proto), compressed with zstd +and encrypted. + Example printed as JSON: ```json @@ -176,17 +201,20 @@ Example printed as JSON: } ``` -The `chunkIds` and `iconChunkIds` fields contain a list with plain text SHA-256 hashes +The `chunkIds` and `iconChunkIds` fields contain an ordered list with plain text SHA-256 hashes which can be found in the main `blobs` dictionary. This contains a mapping from plain text SHA-256 hashes to storage IDs and size information. +The decrypted and uncompressed chunks concatenated in order result in the original plaintext data. +The `iconChunkIds` field assemble to a ZIP file where each entry is a +WebP encoded image, one icon for each app in the backup. +The entry name is the package name of the app. At the beginning of most operations, we download all available snapshots to get information about all blobs that should be available in the repository. We may additionally retrieve a list of all blobs directly from the repository to ensure they are actually (still) present. -Snapshot files start with the SHA-256 hash of the content of the file -and use a `.snapshot` extension. +Snapshot file names start with the SHA-256 hash of their content and use a `.snapshot` extension. ## Cryptography @@ -198,12 +226,12 @@ This section is based on and thus very similar to encryption of Seedvault already uses [BIP39](https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki) to give users a mnemonic recovery code and for generating deterministic keys. The derived key has 512 bits -and Seedvault used to use the first 256 bits as an AES key to encrypt app data. +and Seedvault previously used the first 256 bits as an AES key to encrypt app data. Unfortunately, this key's usage is limited by Android's keystore to encryption and decryption. -Therefore, the second 256 bits are be imported into Android's keystore for use with `HMAC-SHA256`, -so that this key can act as a main key we can deterministically derive additional keys from +Therefore, the second 256 bits get imported into Android's keystore for use with `HMAC-SHA256`, +so that this key can act as a main key that we can deterministically derive additional keys from by using HKDF ([RFC5869](https://tools.ietf.org/html/rfc5869)). -These second 256 bits must not be used for any other purpose in the future. +These second 256 bits *must not* be used for any other purpose in the future. We use them for a main key to avoid users having to handle yet another secret. For deriving keys, we are only using the HKDF's second 'expand' step, @@ -216,7 +244,8 @@ This should be fine as the input key material is already a cryptographically str The original entropy comes from a BIP39 seed (12 words = 128 bit size) obtained from Java's `SecureRandom`. -A PBKDF SHA512 based derivation defined in BIP39 turns this into a 512 bit seed key. +A PBKDF SHA512 based derivation defined in BIP39 turns this into a 512 bit seed key +as described above. The derived seed key (512 bit size) gets split into two parts: 1. legacy app data encryption key (unused) - 256 bit - first half of seed key @@ -229,34 +258,43 @@ The derived seed key (512 bit size) gets split into two parts: ### Stream Encryption -When a stream is written to backup storage, +When a stream is written to the repository, it starts with a header consisting of a single byte indicating the backup format version (currently `0x02`) followed by the encrypted payload. -All data written to backup storage will be encrypted with a fresh key +All data written to the repository will be encrypted with a fresh key to prevent issues with nonce/IV re-use of a single key. We derive a stream key from the main key -by using HKDF's expand step with the UTF-8 byte representation of "app backup stream key" +by using HKDF's expand step with the UTF-8 byte representation of the string "app backup stream key" as info input. This stream key is then used to derive a new key for each stream. Instead of encrypting, authenticating and segmenting a cleartext stream ourselves, we have chosen to employ the [tink library](https://github.com/tink-crypto/tink-java) for that task. -Since it does not allow us to work with imported or derived keys, -we are only using its [AesGcmHkdfStreaming](https://developers.google.com/tink/streaming-aead/aes_gcm_hkdf_streaming) -to delegate encryption and decryption of byte streams. +Since it does not allow us to work with imported or derived keys +and its recommended +[high-level API](https://developers.google.com/tink/encrypt-large-files-or-data-streams) +requires this, +we are directly using its +[AesGcmHkdfStreaming](https://developers.google.com/tink/streaming-aead/aes_gcm_hkdf_streaming) +primitive to delegate encryption and decryption of byte streams. This follows the OAE2 definition as proposed in the paper "Online Authenticated-Encryption and its Nonce-Reuse Misuse-Resistance" ([PDF](https://eprint.iacr.org/2015/189.pdf)). -It adds its own 40 byte header consisting of header length (1 byte), salt and nonce prefix. +It adds its own 40 byte header consisting of header length (1 byte), salt (32 bytes) +and nonce prefix. Then it adds one or more segments, each up to 1 MB in size. All segments are encrypted with a fresh key that is derived by using HKDF -on our stream key with another internal random salt (32 bytes) and associated data as info +on our stream key, the salt and associated data as info ([documentation](https://github.com/google/tink/blob/v1.5.0/docs/WIRE-FORMAT.md#streaming-encryption)). -All types of files written to backup storage have the following format: +Note that the tink documentation (currently) recommends 128 bit keys, +while we use 256 bit keys. +Otherwise, we stick to the recommended defaults. + +All types of files written to the repository have the following format: ```console ┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ @@ -267,11 +305,11 @@ All types of files written to backup storage have the following format: ┗━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ``` -When writing app data or apps to backup storage, -the authenticated associated data (AAD) contains the backup version as the only byte -(to prevent downgrade attacks) as it would otherwise not be authenticated. +When writing to the repository, +the authenticated associated data (AAD) of each file contains the backup version as the only byte +(to prevent downgrade attacks) to ensure it is also authenticated. Other data is not included as renaming and swapping out files is made impossible -by their file name being the content hash. +by their file name starting with the content hash (which must be checked when reading). ## Content-defined chunking @@ -282,20 +320,22 @@ because in our tests it presented a good trade-off between deduplication ratio and number of small chunks. Data smaller than 1.5 MiB will not be chunked further and be left as a single chunk. -TODO: Can we do something to prevent watermarking attacks against single chunk files, - like padding somewhat them after compression, but before encryption? - The FastCDC algorithm uses a gear table containing 256 random integers with 31 bits. When this table changes, the resulting chunks will be different. Hence, every repository always uses the same gear table. However, to make watermarking attacks harder, we use the "app backup gear table key" that gets derived from our main key to deterministically compute a gear table using AES CTR to cipher 32 null bytes with a null IV. +If an attacker is somehow able to reverse engineer the gear table *and* the derived key, +they know how we chunk larger files, but they should be unable to retrieve our main key. -TODO: Is it safe to compute the gear table like this? We do derive a dedicated key for this - that we don't use for anything else. - Still, the IV as well as the encrypted bytes are known to the attacker. - Also, it may be be possible to reverse the gear table with chosen plaintext attacks. +Since a random gear table computed like this may not be sufficient for attackers +able to control (part of the) plaintext, e.g. sending a file in a messaging app, +and due to the presence of lots of data consisting of only a single chunk, +we apply additional random padding to all chunks. +The plaintext gets padded with random bytes after compression and before encryption. + +**TODO** determine the details of the padding. ## Operations @@ -314,10 +354,8 @@ TODO: Is it safe to compute the gear table like this? We do derive a dedicated k * chunks already in the repository are not uploaded again, only their hash recorded * new chunks get compressed, encrypted and hashed to determine their storage ID, then uploaded * remember ordered list of chunk IDs for the app (and its APKs) -* add newly packed chunks to local index -* upload consolidated index, remove old index * add all apps, their chunk IDs (and related metadata) to new snapshot and upload that -* at the end prune old snapshots based on retention rules +* at the end, delete old snapshots based on retention rules, then do [pruning](#pruning). ### Resume interrupted backup @@ -375,6 +413,77 @@ New snapshots only get added after all blobs they reference have been uploaded. Blobs only get deleted after all snapshots that reference them have been deleted first. +## Threat Model + +It is a design goal to be able to securely store backups +in a location that is not completely trusted +(e.g. a shared system where others can potentially access the files). + +### General assumptions + + * The device a backup is created on is trusted. + This is the most basic requirement, and it is essential for creating trustworthy backups. + * The user does not share the recovery code with an attacker. + * There is no protection against attackers deleting files at the storage location. + Nothing can be done about this. + If this needs to be guaranteed, get a secure location without any access from third parties, + e.g. a flash drive. + * Advances in cryptography attacks against the cryptographic primitives used + (i.e. AES-GCM-256 and SHA-256) have not occurred. + Such advances could render the confidentiality or integrity protections useless. + * Sufficient advances in computing have not occurred to make brute-force attacks + against cryptographic protections feasible. + +### Guarantees + + * Unencrypted content of stored files and metadata (other than size and creation time of files) + cannot be accessed without the recovery code for the repository. + Everything is encrypted and authenticated. + * Modifications to data stored in the repository (due to bad RAM, broken hard-disk, etc.) + can be detected. + * Data that has been tampered with will not be decrypted. + +### Possible attacks + +With the aforementioned assumptions and guarantees in mind, +the following are examples of things an attacker could achieve in various circumstances. + +An adversary with read access to your backup storage location could: + + * Attempt a brute force recovery code guessing attack against a copy of the repository. + * Infer the size of backups by using creation timestamps of repository files. + * Make guesses if a user is using a specific app + by comparing the size of newly appearing files in the repository + with the (split) APKs of that app and its update cycle. + Applied padding and key-based chunking may reduce attacker's confidence in guesses. + +An adversary with network access could: + + * Attempt to DoS the server storing the backup repository + or the network connection between client and server. + * Determine from where you create your backups (i.e. the location where the requests originate). + * Determine where you store your backups (i.e. which provider/target system). + * If the backend uses an encrypted connection, + infer the size of backups by observing network traffic. + If no encrypted connection is used, everything an attacker with read access (above) can do. + +### Attacks if assumptions are violated + +The following are examples of the implications associated +with violating some of the aforementioned assumptions. + +An adversary who compromises (via malware, physical access, etc.) the device making backups could: + + * Render the entire backup process untrustworthy + (e.g., intercept recovery code, copy files, manipulate data). + * Recover the encryption key from memory, thus decrypt backups from past and future. + * Create snapshots (containing garbage data) and eventually remove all correct snapshots. + +An adversary with write access to your files at the storage location could: + + * Delete or manipulate your backups, + thereby impairing your ability to restore from the compromised storage location. + ## Differences to restic Even though the design was initially inspired by restic, @@ -413,8 +522,12 @@ that result in the following differences: The following individuals have reviewed this document and provided helpful feedback. +* Aayush Gupta +* Michael Rogers * Thomas Waldmann +* Tommy Webb * Alexander Weiss +* Opal Wright As they have reviewed different parts and different versions at different times, this acknowledgement should not be mistaken for their endorsement of the current design