From 9ec6a4228eb500380212004206800a9746171d0e Mon Sep 17 00:00:00 2001 From: eskimor Date: Sat, 13 Jul 2024 19:42:37 +0200 Subject: [PATCH 1/6] Offchain runtime upgrades --- ...102-offchain-parachain-runtime-upgrades.md | 297 ++++++++++++++++++ 1 file changed, 297 insertions(+) create mode 100644 text/0102-offchain-parachain-runtime-upgrades.md diff --git a/text/0102-offchain-parachain-runtime-upgrades.md b/text/0102-offchain-parachain-runtime-upgrades.md new file mode 100644 index 000000000..dd5977aa8 --- /dev/null +++ b/text/0102-offchain-parachain-runtime-upgrades.md @@ -0,0 +1,297 @@ +# RFC-0000: Feature Name Here + +| | | +| --------------- | ------------------------------------------------------------------------------------------- | +| **Start Date** | 13 July 2024 | +| **Description** | Implement off-chain parachain runtime upgrades | +| **Authors** | eskimor | + +## Summary + +Change the upgrade process of a parachain runtime upgrade to become an off-chain +process with regards to the relay chain. Upgrades are still contained in +parachain blocks, but will no longer need to end up in relay chain blocks nor in +relay chain state. + +## Motivation + +Having parachain runtime upgrades go through the relay chain has always been +seen as a scalability concern. Due to optimizations in statement +distribution and asynchronous backing it became less crucial and got +de-prioritized, the original issue can be found +[here](https://github.com/paritytech/polkadot-sdk/issues/971). + +With the introduction of Agile Coretime and in general our efforts to reduce +barrier to entry more for Polkadot more, the issue becomes more relevant again: +We would like to reduce the required storage deposit for PVF registration, with +the aim to not only make it cheaper to run a parachain (bulk + on-demand +coretime), but also reduce the amount of capital required for the deposit. With +this we would hope for far more parachains to get registered, thousands +potentially even ten thousands. With so many PVFs registered, updates are +expected to become more frequent and even attacks on service quality for other +parachains would become a higher risk. + +## Stakeholders + +- Parachain Teams +- Relay Chain Node implementation teams +- Relay Chain runtime developers + +## Explanation + +The issues with on-chain runtime upgrades are: + +1. Needlessly costly. +2. A single runtime upgrade more or less occupies an entire relay chain block, thus it + might affect also other parachains, especially if their candidates are also + not negligible due to messages for example or they want to uprade their + runtime at the same time. +3. The signalling of the parachain to notify the relay chain of an upcoming + runtime upgrade already contains the upgrade. Therefore the only way to rate + limit upgrades is to drop an already distributed update in the size of + megabytes: With the result that the parachain missed a block and more + importantly it will try again with the very next block, until it finally + succeeds. If we imagine to reduce capacity of runtime upgrades to let's say 1 + every 100 relay chain blocks, this results in lot's of wasted effort and lost + blocks. + +We discussed introducing a separate signalling before submitting the actual +runtime, but I think we should just go one step further and make upgrades fully +off-chain. + +### Introduce a new UMP message type `RequestCodeUpgrade` + +As part of elastic scaling we are already planning to increase flexibility of [UMP +messages](https://github.com/polkadot-fellows/RFCs/issues/92#issuecomment-2144538974), we can now use this to our advantage and introduce another UMP message: + +```rust +enum UMPSignal { + // For elastic scaling + OnCore(CoreIndex), + // For off-chain upgrades + RequestCodeUpgrade(Hash), +} +``` + +We could also make that new message a regular XCM, calling an extrinsic on the +relay chain, but we will want to look into that message right after validation +on the backers on the node side, making a straight forward semantic message more +apt for the purpose. + + +### Handle `RequestCodeUpgrade` on backers + +We will introduce a new request/response protocol for both collators and +validators, with the following request/response: + +```rust +struct RequestCode { + para_id: ParaId, + code_hash: Hash, +} + +struct CodeResponse(Vec) +``` + +This protocol will be used by backers to request the PVF from collators in the +following conditions: + +1. They received a collation sending `RequestCodeUpgrade`. +2. They received a collation, but they don't yet have the code that was + previously registered on the relaychain. (E.g. disk pruned, new validator) + +In case they received the collation via PoV distribution instead of from the +collator itself, they will use the exact same message to fetch from the valiator +they got the PoV from. + +### Get the new code to all validators + +Once the candidate issuing `RequestCodeUpgrade` got backed on chain, validators +will start fetching the code from the backers as part of availability +distribution. + +To mitigate attack vectors we should make sure that serving requests for code +can be treated as low priority requests. Thus I am suggesting the following +scheme: + +Validators will notice via a runtime API (TODO: Define) that a new code has been requested, the +API will return the `Hash` and a counter, which starts at some configurable +value e.g. 10. The validators are now aware of the new hash and start fetching, +but they don't have to wait for the fetch to succeed to sign their bitfield. + +Then on each further candidate from that chain that counter gets decremented. +Validators which have not yet succeeded fetching will now try again. This game +continues until the counter reached `0`. Now it is mandatory to have to code in +order to sign a `1` in the bitfield. + +PVF pre-checking will happen after the candidate which brought the counter to +`0` has been successfully included and thus is also able to assume that 2/3 of +the validators have the code. + +This scheme serves two purposes: + +1. Fetching can happen over a longer period of time with low priority. E.g. if + we waited for the PVF at the very first avaialbility distribution, this might + actually affect liveness of other chains on the same core. Distributing + megabytes of data to a thousand validators, might take a bit. Thus this helps + isolating parachains from each other. +2. By configuring the initial counter value we can affect how much an upgrade + costs. E.g. forcing the parachain to produce 10 blocks, means 10x the cost + for issuing an update. If too frequent upgrades ever become a problem for the + system, we have a knob to make them more costly. + +### On-chain code upgrade process + +First when a candidate is backed we need to make the new hash available +(together with a counter) via a +runtime API so validators in availability distribution can check for it and +fetch it if changed (see previous section). For performance reasons, I think we +should not do an additional call, but replace the [existing one](https://github.com/paritytech/polkadot-sdk/blob/d2fd53645654d3b8e12cbf735b67b93078d70113/polkadot/node/subsystem-util/src/runtime/mod.rs#L355) with one containing the new additional information (Option<(Hash, Counter)>). + +Once the candidate gets included (counter 0), the hash is given to pre-checking +and only after pre-checking succeeded (and a full session passed) it is finally +enacted and the parachain can switch to the new code. (Same process as it used +to be.) + +### Handling new validators +#### Backers + +If a backer receives a collation for a parachain it does not yet have the code +as enacted on chain (see "On-chain code upgrade process"), it will use above +request/response protocol to fetch it from whom it received the collation. + +#### Availablity Distribution + +Validators in availability distribution will be changed to only sign a `1` in +the bitfield of a candidate if they not only have the chunk, but also the +currently active PVF. They will fetch it from backers in case they don't have it +yet. + +### How do other parties get hold of the PVF? + +Two ways: + +1. Discover collators via [relay chain DHT](https://github.com/polkadot-fellows/RFCs/pull/8) and request from them: Preferred way, + as it is less load on validators. +2. Request from validators, which will serve on a best effort basis. + +### Pruning + +We covered how validators get hold of new code, but when can they prune old ones? +In principle it is not an issue, if some validors prune code, because: + +1. We changed it so that a candidate is not deemed available if validators were + not able to fetch the PVF. +2. Backers can always fetch the PVF from collators as part of the collation + fetching. + +But the majority of validators should always keep the latest code of any +parachain and only prune the previous one, once the first candidate using the +new code got finalized. This ensures that disputes will always be able to +resolve. + +## Drawbacks + +The major drawback of this solution is the same as any solution the moves work +off-chain, it adds complexity to the node. E.g. nodes needing the PVF, need to +store them separately, together with their own pruning strategy as well. + +## Testing, Security, and Privacy + +Implementations adhering to this RFC, will respond to PVF requests with the +actual PVF, if they have it. Requesters will persist received PVFs on disk for +as long as they are replaced by a new one. Implementations must not be lazy +here, if validators only fetched the PVF when needed, they can be prevented from +participating in disputes. + +Validators should treat incoming requests for PVFs in general with rather low +priority, but should prefer fetches from other validators over requests from +random peers. + +Given that we are altering what set bits in the availability bitfields mean (not +only chunk, but also PVF available), it is important to have enough validators +upgraded, before we allow collators to make use of the new runtime upgrade +mechanism. Otherwise we would risk disputes to not being able to succeed. + +This RFC has no impact on privacy. + +## Performance, Ergonomics, and Compatibility + +### Performance + +This proposal lightens the load on the relay chain and is thus in general +beneficial for the performance of the network, this is achieved by the +following: + +1. Code upgrades are still propagated to all validators, but only once, not + twice (First statements, then via the containing relay chain block). +2. Code upgrades are only communicated to validators and other nodes which are + interested, not any full node as it has been before. +3. Relay chain block space is preserved. Previously we could only do one runtime + upgrade per relay chain block, occupying almost all of the blockspace. +4. Signalling an upgrade no longer contains the upgrade, hence if we need to + push back on an upgrade for whatever reason, no network bandwidth and core + time gets wasted because of this. + +### Ergonomics + +End users are only affected by better performance and more stable block times. +Parachains will need to implement the introduced request/response protocol and +adapt to the new signalling mechanism via an `UMP` message, instead of sending +the code upgrade directly. + +### Compatibility + +We will continue to support the old mechanism for code upgrades for a while, but +will start to impose stricter limits over time, with the number of registered +parachains going up. With those limits in place parachains not migrating to the +new scheme might be having a harder time upgrading and will miss more blocks. I +guess we can be lenient for a while still, so the upgrade path for +parachains should be rather smooth. + +In total the protocol changes we need are: + +For validators and collators: +1. New request/response protocol for fetching PVF data from collators and + validators. +2. New UMP message type for signalling a runtime upgrade. + +Only for validators: + +1. New runtime API for determining to be enacted code upgrades. +2. Different behaviour of bitfields (only sign a 1 bit, if validator has chunk + + "hot" PVF). +3. Altered behaviour in availability-distribution: Fetch missing PVFS. + +## Prior Art and References + +Off-chain runtime upgrades have been discussed before, the architecture +described here is simpler though as it piggybacks on already existing features, +namely: + +1. availability-distribution: No separate `I have code` messages anymore. +2. Existing pre-checking. + +https://github.com/paritytech/polkadot-sdk/issues/971 + +## Unresolved Questions + +None at this time. + +## Future Directions and Related Material + +By no longer having code upgrade go through the relay chain, occupying a full relay +chain block, the impact on other parachains is already greatly reduced, if we +make distribution and PVF pre-checking low-priority processes on validators. The +only thing attackers might be able to do is delay upgrades of other parachains. + +Which seems like a problem to be solved once we actually see it as a problem in +the wild (and can already be mitigated by adjusting the counter). The good thing +is that we have all the ingredients to go further if need be. Signalling no +longer actually includes the code, hence there is no need to reject the +candidate: The parachain can make progress even if we choose not to immediately +act on the request and no relay chain resources are wasted either. + +We could for example introduce another UMP Signalling message +`RequestCodeUpgradeWithPriority` which not just requests a code upgrade, but +also offers some DOT to get ranked up in a queue. From b5098e14ebbe3e5404b32808ed3c6b4b7e1de991 Mon Sep 17 00:00:00 2001 From: eskimor Date: Sat, 13 Jul 2024 23:47:30 +0200 Subject: [PATCH 2/6] Add note about storage deposits and future extensibility. --- .../0102-offchain-parachain-runtime-upgrades.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/text/0102-offchain-parachain-runtime-upgrades.md b/text/0102-offchain-parachain-runtime-upgrades.md index dd5977aa8..f9744b5f7 100644 --- a/text/0102-offchain-parachain-runtime-upgrades.md +++ b/text/0102-offchain-parachain-runtime-upgrades.md @@ -276,10 +276,16 @@ https://github.com/paritytech/polkadot-sdk/issues/971 ## Unresolved Questions -None at this time. +1. What about the initial runtime, shall we make that off-chain as well? +2. Good news, at least after the first upgrade, no code will be stored on chain + any more, this means that we also have to redefine the storage deposit now. + We no longer charge for chain storage, but validator disk storage -> Should + be cheaper. ## Future Directions and Related Material +### Further Hardening + By no longer having code upgrade go through the relay chain, occupying a full relay chain block, the impact on other parachains is already greatly reduced, if we make distribution and PVF pre-checking low-priority processes on validators. The @@ -295,3 +301,12 @@ act on the request and no relay chain resources are wasted either. We could for example introduce another UMP Signalling message `RequestCodeUpgradeWithPriority` which not just requests a code upgrade, but also offers some DOT to get ranked up in a queue. + +### Generalize this off-chain storage mechanism? + +Making this storage mechanism more general purpose is worth thinking about. E.g. +by resolving above "fee" question, we might also be able to resolve the pruning +question in a more generic way and thus could indeed open this storage facility +for other purposes as well. E.g. smart contracts, so the PoV would only need to +reference contracts by hash and the actual PoV is stored on validators and +collators and thus no longer needs to be part of the PoV. From ce9d5f392674832d8cab5bebe10369931e02fd26 Mon Sep 17 00:00:00 2001 From: eskimor Date: Sat, 13 Jul 2024 23:54:56 +0200 Subject: [PATCH 3/6] Generalize req/res protocol --- text/0102-offchain-parachain-runtime-upgrades.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/text/0102-offchain-parachain-runtime-upgrades.md b/text/0102-offchain-parachain-runtime-upgrades.md index f9744b5f7..e85aab310 100644 --- a/text/0102-offchain-parachain-runtime-upgrades.md +++ b/text/0102-offchain-parachain-runtime-upgrades.md @@ -85,12 +85,11 @@ We will introduce a new request/response protocol for both collators and validators, with the following request/response: ```rust -struct RequestCode { - para_id: ParaId, - code_hash: Hash, +struct RequestBlob { + blob_hash: Hash, } -struct CodeResponse(Vec) +struct BlobResponse(Vec) ``` This protocol will be used by backers to request the PVF from collators in the From c8cd560735c7620ca9e9ba1bcec11ca9885c7230 Mon Sep 17 00:00:00 2001 From: eskimor Date: Sun, 14 Jul 2024 12:44:00 +0200 Subject: [PATCH 4/6] Clarifications. --- text/0102-offchain-parachain-runtime-upgrades.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/text/0102-offchain-parachain-runtime-upgrades.md b/text/0102-offchain-parachain-runtime-upgrades.md index e85aab310..d32e2a369 100644 --- a/text/0102-offchain-parachain-runtime-upgrades.md +++ b/text/0102-offchain-parachain-runtime-upgrades.md @@ -57,7 +57,8 @@ The issues with on-chain runtime upgrades are: We discussed introducing a separate signalling before submitting the actual runtime, but I think we should just go one step further and make upgrades fully -off-chain. +off-chain. Which also helps bringing down deposit costs in a secure way, as we +are also actually reducing costs for the network. ### Introduce a new UMP message type `RequestCodeUpgrade` From 127d3ec3933643a8a47468c7657c51d9572d8615 Mon Sep 17 00:00:00 2001 From: eskimor Date: Sun, 14 Jul 2024 21:12:05 +0200 Subject: [PATCH 5/6] Some refinements on future directions. --- ...102-offchain-parachain-runtime-upgrades.md | 39 ++++++++++++++++++- 1 file changed, 37 insertions(+), 2 deletions(-) diff --git a/text/0102-offchain-parachain-runtime-upgrades.md b/text/0102-offchain-parachain-runtime-upgrades.md index d32e2a369..bcf2fd64a 100644 --- a/text/0102-offchain-parachain-runtime-upgrades.md +++ b/text/0102-offchain-parachain-runtime-upgrades.md @@ -90,7 +90,9 @@ struct RequestBlob { blob_hash: Hash, } -struct BlobResponse(Vec) +struct BlobResponse { + blob: Vec +} ``` This protocol will be used by backers to request the PVF from collators in the @@ -280,7 +282,14 @@ https://github.com/paritytech/polkadot-sdk/issues/971 2. Good news, at least after the first upgrade, no code will be stored on chain any more, this means that we also have to redefine the storage deposit now. We no longer charge for chain storage, but validator disk storage -> Should - be cheaper. + be cheaper. Solution to this: Not only store the hash on chain, but also the + size of the data. Then define a price per byte and charge that, but: + - how do we charge - I guess deposit has to be provided via other means, + runtime upgrade fails if not provided. + - how do we signal to the chain that the code is too large for it to reject + the upgrade? Easy: Make available and vote nay in pre-checking. + +TODO: Fully resolve these questions and incorporate in RFC text. ## Future Directions and Related Material @@ -310,3 +319,29 @@ question in a more generic way and thus could indeed open this storage facility for other purposes as well. E.g. smart contracts, so the PoV would only need to reference contracts by hash and the actual PoV is stored on validators and collators and thus no longer needs to be part of the PoV. + +A possible avenue would be to change the response to: + +```rust +enum BlobResponse { + Blob(Vec), + Blobs(MerkleTree), +} +``` + +With this the hash specified in the request can also be a merkle root and the +responder will respond with the entire merkle tree (only hashes, no payload). +Then the requester can traverse the leaf hashes and use the same request +response protocol to request any locally missing blobs in that tree. + +One leaf would for example be the PVF others could be smart contracts. With a +properly specified format (e.g. which leaf is the PVF?), what we got here is +that a parachain can not only update its PVF, but additional data, +incrementally. E.g. adding another smart contract, does not require resubmitting +the entire PVF to validators, only the root hash on the relay chain gets +updated, then validators fetch the merkle tree and only fetch any missing +leaves. That additional data could be made available to the PVF via a to be +added host function. The nice thing about this approach is, that while we can +upgrade incrementally, lifetime is still tied to the PVF and we get all the same +guarantees. Assuming the validators store blobs by hash, we even get disk +sharing if multiple parachains use the same data (e.g. same smart contracts). From 95f976311f92ba4138a2f31db2a2aebe86662105 Mon Sep 17 00:00:00 2001 From: eskimor Date: Mon, 15 Jul 2024 15:44:09 +0200 Subject: [PATCH 6/6] More clarifications --- text/0102-offchain-parachain-runtime-upgrades.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/text/0102-offchain-parachain-runtime-upgrades.md b/text/0102-offchain-parachain-runtime-upgrades.md index bcf2fd64a..20b45c34c 100644 --- a/text/0102-offchain-parachain-runtime-upgrades.md +++ b/text/0102-offchain-parachain-runtime-upgrades.md @@ -242,6 +242,13 @@ Parachains will need to implement the introduced request/response protocol and adapt to the new signalling mechanism via an `UMP` message, instead of sending the code upgrade directly. +For parachain operators we should emit events on initiated runtime upgrade and +each block reporting the current counter and how many blocks to go until the +upgrade gets passed to pre-checking. This is especially important for on-demand +chains or bulk users not occupying a full core. Further more that behaviour of +requiring multiple blocks to fully initiate a runtime upgrade needs to be well +documented. + ### Compatibility We will continue to support the old mechanism for code upgrades for a while, but