-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offchain runtime upgrades #102
base: main
Are you sure you want to change the base?
Changes from all commits
9ec6a42
b5098e1
ce9d5f3
c8cd560
127d3ec
95f9763
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,354 @@ | ||
# RFC-0000: Feature Name Here | ||
|
||
| | | | ||
| --------------- | ------------------------------------------------------------------------------------------- | | ||
| **Start Date** | 13 July 2024 | | ||
| **Description** | Implement off-chain parachain runtime upgrades | | ||
| **Authors** | eskimor | | ||
|
||
## Summary | ||
|
||
Change the upgrade process of a parachain runtime upgrade to become an off-chain | ||
process with regards to the relay chain. Upgrades are still contained in | ||
parachain blocks, but will no longer need to end up in relay chain blocks nor in | ||
relay chain state. | ||
|
||
## Motivation | ||
|
||
Having parachain runtime upgrades go through the relay chain has always been | ||
seen as a scalability concern. Due to optimizations in statement | ||
distribution and asynchronous backing it became less crucial and got | ||
de-prioritized, the original issue can be found | ||
[here](https://github.com/paritytech/polkadot-sdk/issues/971). | ||
|
||
With the introduction of Agile Coretime and in general our efforts to reduce | ||
barrier to entry more for Polkadot more, the issue becomes more relevant again: | ||
We would like to reduce the required storage deposit for PVF registration, with | ||
the aim to not only make it cheaper to run a parachain (bulk + on-demand | ||
coretime), but also reduce the amount of capital required for the deposit. With | ||
this we would hope for far more parachains to get registered, thousands | ||
potentially even ten thousands. With so many PVFs registered, updates are | ||
expected to become more frequent and even attacks on service quality for other | ||
parachains would become a higher risk. | ||
|
||
## Stakeholders | ||
|
||
- Parachain Teams | ||
- Relay Chain Node implementation teams | ||
- Relay Chain runtime developers | ||
|
||
## Explanation | ||
|
||
The issues with on-chain runtime upgrades are: | ||
|
||
1. Needlessly costly. | ||
2. A single runtime upgrade more or less occupies an entire relay chain block, thus it | ||
might affect also other parachains, especially if their candidates are also | ||
not negligible due to messages for example or they want to uprade their | ||
runtime at the same time. | ||
3. The signalling of the parachain to notify the relay chain of an upcoming | ||
runtime upgrade already contains the upgrade. Therefore the only way to rate | ||
limit upgrades is to drop an already distributed update in the size of | ||
megabytes: With the result that the parachain missed a block and more | ||
importantly it will try again with the very next block, until it finally | ||
succeeds. If we imagine to reduce capacity of runtime upgrades to let's say 1 | ||
every 100 relay chain blocks, this results in lot's of wasted effort and lost | ||
blocks. | ||
|
||
We discussed introducing a separate signalling before submitting the actual | ||
runtime, but I think we should just go one step further and make upgrades fully | ||
off-chain. Which also helps bringing down deposit costs in a secure way, as we | ||
are also actually reducing costs for the network. | ||
|
||
### Introduce a new UMP message type `RequestCodeUpgrade` | ||
|
||
As part of elastic scaling we are already planning to increase flexibility of [UMP | ||
messages](https://github.com/polkadot-fellows/RFCs/issues/92#issuecomment-2144538974), we can now use this to our advantage and introduce another UMP message: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "We just need this hack for one thing and will not use it for anything else" ;) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It does feels indeed it is creeping out into other features/change like this one but it offers a lot of advantages in the short term. I would not call it a hack, but more of a generalisation of the UMP queue. The alternative is PVF versioning which I believe is the long term solution that we'll likely to develop in 2025. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't want to add it to XCM, instead we will have a UMP queue separator between regular XCM messages and the possible additional ones for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I know what the plan was/is. However, this doesn't really invalidate what I said above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see. I would prefer to make the UMP messages more generic in this case, having two variants, one wrapping XCM and the other UMPSignal as defined here. Sounds much better than using a separator. If we agree to this I will update it also in #103 |
||
|
||
```rust | ||
enum UMPSignal { | ||
// For elastic scaling | ||
OnCore(CoreIndex), | ||
// For off-chain upgrades | ||
RequestCodeUpgrade(Hash), | ||
} | ||
``` | ||
|
||
We could also make that new message a regular XCM, calling an extrinsic on the | ||
relay chain, but we will want to look into that message right after validation | ||
on the backers on the node side, making a straight forward semantic message more | ||
apt for the purpose. | ||
|
||
|
||
### Handle `RequestCodeUpgrade` on backers | ||
|
||
We will introduce a new request/response protocol for both collators and | ||
validators, with the following request/response: | ||
|
||
```rust | ||
struct RequestBlob { | ||
blob_hash: Hash, | ||
} | ||
|
||
struct BlobResponse { | ||
blob: Vec<u8> | ||
} | ||
``` | ||
|
||
This protocol will be used by backers to request the PVF from collators in the | ||
following conditions: | ||
|
||
1. They received a collation sending `RequestCodeUpgrade`. | ||
2. They received a collation, but they don't yet have the code that was | ||
previously registered on the relaychain. (E.g. disk pruned, new validator) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it still feasible to prepare PVFs in advance (when node becomes a validator in next session)? |
||
|
||
In case they received the collation via PoV distribution instead of from the | ||
collator itself, they will use the exact same message to fetch from the valiator | ||
they got the PoV from. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not make the code upgade simply be the parachain block? Isn't that how substrate worked from the beginning? If the code were bigger than a block, then you could incrementally build the PVF in parachain state, and incrementally hash it. Or do some special larger code block type. |
||
|
||
### Get the new code to all validators | ||
|
||
Once the candidate issuing `RequestCodeUpgrade` got backed on chain, validators | ||
will start fetching the code from the backers as part of availability | ||
distribution. | ||
|
||
To mitigate attack vectors we should make sure that serving requests for code | ||
can be treated as low priority requests. Thus I am suggesting the following | ||
scheme: | ||
|
||
Validators will notice via a runtime API (TODO: Define) that a new code has been requested, the | ||
API will return the `Hash` and a counter, which starts at some configurable | ||
value e.g. 10. The validators are now aware of the new hash and start fetching, | ||
but they don't have to wait for the fetch to succeed to sign their bitfield. | ||
|
||
Then on each further candidate from that chain that counter gets decremented. | ||
Validators which have not yet succeeded fetching will now try again. This game | ||
continues until the counter reached `0`. Now it is mandatory to have to code in | ||
order to sign a `1` in the bitfield. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You've just pushed the availability into the last of these fake blocks here. I guess this works, but I'm not convinced this is better than doing some big block availability variant: We'd process the code availability in a single big parachain block, which only provides data but nerver gets executed. This takes as long as it takes, maybe runnoing at some lower priority. It occupies the availability code for that whole time, exactly like this scheme does. After that runs, we have code available on chain so everyone must fetch it and build the artifact. We must delay the PVF upgrade being usable until those builds succeed, which could be done either by a second fake parablock type, or else by some message of the sort discussed here. |
||
|
||
PVF pre-checking will happen after the candidate which brought the counter to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Do we need to use availability bitfields here or can we rely on pre-checking only? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bitfields offer the advantage that we have an incentive for backers (at least for the last one) and it avoids having impose the work of pre-checking without the "attacker" having paid their bill (produced enough blocks). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Other things to consider:
|
||
`0` has been successfully included and thus is also able to assume that 2/3 of | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there an expiry date for when the parachain needs to reach 0, otherwise the code upgrade is dropped ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. Will add a a section. |
||
the validators have the code. | ||
|
||
This scheme serves two purposes: | ||
|
||
1. Fetching can happen over a longer period of time with low priority. E.g. if | ||
we waited for the PVF at the very first avaialbility distribution, this might | ||
actually affect liveness of other chains on the same core. Distributing | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't we still starve the next parachain if the inclusion is delayed until the code was fetched by 2/3 validators ? I mean, if we treat these as low priority this can be an issue. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's why we have a configurable amount of parachain blocks to do the fetching. If we ever run into availability problems we can:
Note however that right now we do distribute those upgrades within a single relaychain slot twice, once via statement distribution then via the relay chain block. In the new scheme if we set the number of required parachain blocks to 10, we reduced pressure 20 times. Thus I doubt it will be a problem in practice and if it ever were, we have means to fix it. |
||
megabytes of data to a thousand validators, might take a bit. Thus this helps | ||
isolating parachains from each other. | ||
2. By configuring the initial counter value we can affect how much an upgrade | ||
costs. E.g. forcing the parachain to produce 10 blocks, means 10x the cost | ||
for issuing an update. If too frequent upgrades ever become a problem for the | ||
system, we have a knob to make them more costly. | ||
|
||
### On-chain code upgrade process | ||
|
||
First when a candidate is backed we need to make the new hash available | ||
(together with a counter) via a | ||
runtime API so validators in availability distribution can check for it and | ||
fetch it if changed (see previous section). For performance reasons, I think we | ||
should not do an additional call, but replace the [existing one](https://github.com/paritytech/polkadot-sdk/blob/d2fd53645654d3b8e12cbf735b67b93078d70113/polkadot/node/subsystem-util/src/runtime/mod.rs#L355) with one containing the new additional information (Option<(Hash, Counter)>). | ||
|
||
Once the candidate gets included (counter 0), the hash is given to pre-checking | ||
and only after pre-checking succeeded (and a full session passed) it is finally | ||
enacted and the parachain can switch to the new code. (Same process as it used | ||
to be.) | ||
|
||
### Handling new validators | ||
#### Backers | ||
|
||
If a backer receives a collation for a parachain it does not yet have the code | ||
as enacted on chain (see "On-chain code upgrade process"), it will use above | ||
request/response protocol to fetch it from whom it received the collation. | ||
|
||
#### Availablity Distribution | ||
|
||
Validators in availability distribution will be changed to only sign a `1` in | ||
the bitfield of a candidate if they not only have the chunk, but also the | ||
currently active PVF. They will fetch it from backers in case they don't have it | ||
yet. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah this makes sense regardless. |
||
|
||
### How do other parties get hold of the PVF? | ||
|
||
Two ways: | ||
|
||
1. Discover collators via [relay chain DHT](https://github.com/polkadot-fellows/RFCs/pull/8) and request from them: Preferred way, | ||
as it is less load on validators. | ||
2. Request from validators, which will serve on a best effort basis. | ||
|
||
### Pruning | ||
|
||
We covered how validators get hold of new code, but when can they prune old ones? | ||
In principle it is not an issue, if some validors prune code, because: | ||
|
||
1. We changed it so that a candidate is not deemed available if validators were | ||
not able to fetch the PVF. | ||
2. Backers can always fetch the PVF from collators as part of the collation | ||
fetching. | ||
|
||
But the majority of validators should always keep the latest code of any | ||
parachain and only prune the previous one, once the first candidate using the | ||
new code got finalized. This ensures that disputes will always be able to | ||
resolve. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah 1 is an improvement here, previously I'd envisions parachains doing code reuploads once per day, just so the code stays in availability |
||
|
||
## Drawbacks | ||
|
||
The major drawback of this solution is the same as any solution the moves work | ||
off-chain, it adds complexity to the node. E.g. nodes needing the PVF, need to | ||
store them separately, together with their own pruning strategy as well. | ||
|
||
## Testing, Security, and Privacy | ||
|
||
Implementations adhering to this RFC, will respond to PVF requests with the | ||
actual PVF, if they have it. Requesters will persist received PVFs on disk for | ||
as long as they are replaced by a new one. Implementations must not be lazy | ||
here, if validators only fetched the PVF when needed, they can be prevented from | ||
participating in disputes. | ||
|
||
Validators should treat incoming requests for PVFs in general with rather low | ||
priority, but should prefer fetches from other validators over requests from | ||
random peers. | ||
|
||
Given that we are altering what set bits in the availability bitfields mean (not | ||
only chunk, but also PVF available), it is important to have enough validators | ||
upgraded, before we allow collators to make use of the new runtime upgrade | ||
mechanism. Otherwise we would risk disputes to not being able to succeed. | ||
|
||
This RFC has no impact on privacy. | ||
|
||
## Performance, Ergonomics, and Compatibility | ||
|
||
### Performance | ||
|
||
This proposal lightens the load on the relay chain and is thus in general | ||
beneficial for the performance of the network, this is achieved by the | ||
following: | ||
|
||
1. Code upgrades are still propagated to all validators, but only once, not | ||
twice (First statements, then via the containing relay chain block). | ||
2. Code upgrades are only communicated to validators and other nodes which are | ||
interested, not any full node as it has been before. | ||
3. Relay chain block space is preserved. Previously we could only do one runtime | ||
upgrade per relay chain block, occupying almost all of the blockspace. | ||
4. Signalling an upgrade no longer contains the upgrade, hence if we need to | ||
push back on an upgrade for whatever reason, no network bandwidth and core | ||
time gets wasted because of this. | ||
|
||
### Ergonomics | ||
|
||
End users are only affected by better performance and more stable block times. | ||
Parachains will need to implement the introduced request/response protocol and | ||
adapt to the new signalling mechanism via an `UMP` message, instead of sending | ||
the code upgrade directly. | ||
|
||
For parachain operators we should emit events on initiated runtime upgrade and | ||
each block reporting the current counter and how many blocks to go until the | ||
upgrade gets passed to pre-checking. This is especially important for on-demand | ||
chains or bulk users not occupying a full core. Further more that behaviour of | ||
requiring multiple blocks to fully initiate a runtime upgrade needs to be well | ||
documented. | ||
|
||
### Compatibility | ||
|
||
We will continue to support the old mechanism for code upgrades for a while, but | ||
will start to impose stricter limits over time, with the number of registered | ||
parachains going up. With those limits in place parachains not migrating to the | ||
new scheme might be having a harder time upgrading and will miss more blocks. I | ||
guess we can be lenient for a while still, so the upgrade path for | ||
parachains should be rather smooth. | ||
|
||
In total the protocol changes we need are: | ||
|
||
For validators and collators: | ||
1. New request/response protocol for fetching PVF data from collators and | ||
validators. | ||
2. New UMP message type for signalling a runtime upgrade. | ||
|
||
Only for validators: | ||
|
||
1. New runtime API for determining to be enacted code upgrades. | ||
2. Different behaviour of bitfields (only sign a 1 bit, if validator has chunk + | ||
"hot" PVF). | ||
3. Altered behaviour in availability-distribution: Fetch missing PVFS. | ||
|
||
## Prior Art and References | ||
|
||
Off-chain runtime upgrades have been discussed before, the architecture | ||
described here is simpler though as it piggybacks on already existing features, | ||
namely: | ||
|
||
1. availability-distribution: No separate `I have code` messages anymore. | ||
2. Existing pre-checking. | ||
|
||
https://github.com/paritytech/polkadot-sdk/issues/971 | ||
|
||
## Unresolved Questions | ||
|
||
1. What about the initial runtime, shall we make that off-chain as well? | ||
2. Good news, at least after the first upgrade, no code will be stored on chain | ||
any more, this means that we also have to redefine the storage deposit now. | ||
We no longer charge for chain storage, but validator disk storage -> Should | ||
be cheaper. Solution to this: Not only store the hash on chain, but also the | ||
size of the data. Then define a price per byte and charge that, but: | ||
- how do we charge - I guess deposit has to be provided via other means, | ||
runtime upgrade fails if not provided. | ||
- how do we signal to the chain that the code is too large for it to reject | ||
the upgrade? Easy: Make available and vote nay in pre-checking. | ||
|
||
TODO: Fully resolve these questions and incorporate in RFC text. | ||
|
||
## Future Directions and Related Material | ||
|
||
### Further Hardening | ||
|
||
By no longer having code upgrade go through the relay chain, occupying a full relay | ||
chain block, the impact on other parachains is already greatly reduced, if we | ||
make distribution and PVF pre-checking low-priority processes on validators. The | ||
only thing attackers might be able to do is delay upgrades of other parachains. | ||
|
||
Which seems like a problem to be solved once we actually see it as a problem in | ||
the wild (and can already be mitigated by adjusting the counter). The good thing | ||
is that we have all the ingredients to go further if need be. Signalling no | ||
longer actually includes the code, hence there is no need to reject the | ||
candidate: The parachain can make progress even if we choose not to immediately | ||
act on the request and no relay chain resources are wasted either. | ||
|
||
We could for example introduce another UMP Signalling message | ||
`RequestCodeUpgradeWithPriority` which not just requests a code upgrade, but | ||
also offers some DOT to get ranked up in a queue. | ||
|
||
### Generalize this off-chain storage mechanism? | ||
|
||
Making this storage mechanism more general purpose is worth thinking about. E.g. | ||
by resolving above "fee" question, we might also be able to resolve the pruning | ||
question in a more generic way and thus could indeed open this storage facility | ||
for other purposes as well. E.g. smart contracts, so the PoV would only need to | ||
reference contracts by hash and the actual PoV is stored on validators and | ||
collators and thus no longer needs to be part of the PoV. | ||
|
||
A possible avenue would be to change the response to: | ||
|
||
```rust | ||
enum BlobResponse { | ||
Blob(Vec<u8>), | ||
Blobs(MerkleTree), | ||
} | ||
``` | ||
|
||
With this the hash specified in the request can also be a merkle root and the | ||
responder will respond with the entire merkle tree (only hashes, no payload). | ||
Then the requester can traverse the leaf hashes and use the same request | ||
response protocol to request any locally missing blobs in that tree. | ||
|
||
One leaf would for example be the PVF others could be smart contracts. With a | ||
properly specified format (e.g. which leaf is the PVF?), what we got here is | ||
that a parachain can not only update its PVF, but additional data, | ||
incrementally. E.g. adding another smart contract, does not require resubmitting | ||
the entire PVF to validators, only the root hash on the relay chain gets | ||
updated, then validators fetch the merkle tree and only fetch any missing | ||
leaves. That additional data could be made available to the PVF via a to be | ||
added host function. The nice thing about this approach is, that while we can | ||
upgrade incrementally, lifetime is still tied to the PVF and we get all the same | ||
guarantees. Assuming the validators store blobs by hash, we even get disk | ||
sharing if multiple parachains use the same data (e.g. same smart contracts). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, off-chain upgrades make sense: I mildly pushed for PVF upgrade to live in parablocks early on, but we descided for upgradfes on the relay chain since all validators need the data eventually anyways. It's true however that (a) validator set churn makes off-chain an optimization, and being on-chain incurs extra costs, like repeated downloads.