From 26dd08522297f31aa9fd6c744bdbca2c69a729e0 Mon Sep 17 00:00:00 2001 From: protolambda Date: Thu, 29 Aug 2024 11:35:41 -0600 Subject: [PATCH] protocol: block safety index --- protocol/block-safety-index.md | 224 +++++++++++++++++++++++++++++++++ 1 file changed, 224 insertions(+) create mode 100644 protocol/block-safety-index.md diff --git a/protocol/block-safety-index.md b/protocol/block-safety-index.md new file mode 100644 index 0000000..38b79a4 --- /dev/null +++ b/protocol/block-safety-index.md @@ -0,0 +1,224 @@ +# Block safety index + +# Summary + +Unify the mapping between L2 blocks and the L1 blocks they were derived from, +to address cross-protocol safety-validation challenges. + +This mapping conceptually exists in different forms in the OP-Stack, +but should be described as a feature of the rollup-node, for other components to be built against. + +This does not directly affect the op-program (or Kona), but does affect how we source the data to enter a dispute game. + +# Problem + +The following things are all *the same*: +- The op-node warmup UX problem +- The span-batch recovery problem +- The fault-proof pre-state problem +- The interop reorg problem +- The L2 finalization problem +- The op-proposer consistency problem +- The safety indexing problem + +At the core, all these issues stem from a single problem: +determining which L2 block corresponds to which L1 block. + +Yet we currently have *different solutions* for each case. + +The reason why this is hard: the L2 chain, by design, does not pre-commit to how/where data gets confirmed on L1. +This reduces reorg risk and DA cost by deferring optionality to the batcher, but introduces "offchain" state. + +Can we fix this, and simplify the OP-Stack? + +## Sub-problems + +### The op-node warmup UX problem + +The majority of op-node warmup is derivation-pipeline reset time. +It traverses to an old L2 block guaranteed to still be canonical, +then traverses to an old L1 block that is guaranteed to produce +the inputs needed to continue derivation of next L2 block inputs. + +This traversal can be lengthy: +- The L2 chain is traversed back, because we cannot with certainty say whether it was derived from a recent L1 block. + Often this means it "bottoms out" on the finalized L2 block, at least ~12 minutes old. +- The L1 chain is traversed back a full channel-timeout; 300 blocks, i.e. an hour of L1 data! + (reduced in Granite to 50 blocks = 10 minutes). + +This delay slows down the process of making nodes fully operational after a restart, +particularly with regard to derivation. + +But it can be so much faster, if we introduce state: +- L2 traversal can go back to the last L2 block derived from a still-canonical L1 block. + Seconds, not minutes, of traversed time-span. +- L1 traversal can go quickly find the relevant L1 data with a much shorter search space, + if we had stricter batch-ordering guarantees (being introduced with Holocene!) and the derived-from mapping. + - The last derived-from L1 block is an upper bound. + - The previous-to-last derived-from L1 block is a lower bound. + +### The span-batch recovery problem + +Currently, span-batch recovery (pre-Holocene) relies on 'pending-safe', +where a span-batch may generate pending blocks that can be reorged out +if later content of the span-batch is found invalid. + +This is changing with Holocene (steady batch derivation, aka strict ordering): we do not want the complexity of having to revert data that was tentatively accepted. +For interop fault-proving, the L2 assumptions should not be larger than 1 block, +since we want to consolidate cross-L2 after confirming each L2 optimistically. + +If we find a pre-existing L2 chain (simply on node restart, or on reorg), +we currently cannot be sure it was derived from a certain L1; we have to recreate the derivation state to verify. +And if the tip was part of a span-batch, we need to find the start of said span-batch. +So while we can do away with the multi-block pending-safe reorg, we still have to "find" the start of a span-batch. +If we had an index of L2-block to L1-derived-from, with span-batch bounds info, +then finding that start would be much faster. + +### The fault-proof pre-state problem + +The current fault-proof prestate-agreement depends on having an optional "safedb" feature enabled in the op-node. +This essentially serves as an index, mapping which L2 block has been derived from which L1 block. + +That index is good, close to what this doc proposes, but the implementation is fragile: +- It's optional, no clear way of backfilling the data, when enabled on an existing node. +- It wraps a key-value DB with arbitrary key / update scheme: more easy to corrupt, noisy (level/table changes) to update. +- It depends on the temporary synchronous derivation state changes for consistency, + it is not yet attached to clear derived-from events. + +Deprecating/removing the "safedb" and its updating, in favor of a new unified stable derived-from indexing solution, +would thus be a good improvement. + +### The interop reorg problem + +Currently, interop cross-L2-reorgs are not supported yet, but we will need to for following devnets / production. +To do so, as verifier node we need to answer: "Is my L2 A view invalidating your L2 B view, or vice-versa?" + +The key to answer that is ordering of L2 safety increments based on L1: +we can then say if A or B was first, and which chain has to reorg to fix its dependencies in favor of the other chain. + +Unless there is some other novel way around the above, the op-supervisor needs to tell apart that order. +To support that, we again need some form of a DB index of when a L2 block became safely derived from a L1 block. + +If the op-node maintains this index already, then it can upstream its single-chain view of this to the op-supervisor, +which can maintain the multi-L2 view of L1 safety. + +The index should contain two things to support interop: +- per L1 block, tell up to which L2 block was considered "local-safe" +- per L1 block, tell up to which L2 block was considered "cross-safe" + +The "local-safe" entries are not monotonic; if external dependencies change, we can see the label rewind. +But the key is that the label rewinding per L2 has a global ordering, +helping us resolve what do about a L2 chain that may or may not be local-safe. + +The "cross-safe" entries are monotonic; these can only change on L1 reorg, +at which point the whole derived-from index would reorg. + +### The L2 finalization problem + +Current L2 finalization caches recent L2-to-L1 derivation relations. +The L1 beacon chain updates, if it finalizes at all, with a delay of ~2 beacon epochs from the tip of the L1 chain. + +By keeping the last 2 epochs of derived-from mappings in memory, +the op-node can tell up to which L2 block was derived from a L1 block. And as soon as the L1 finalizes, +trigger the finality of the L2 blocks that are now irreversibly derived from finalized L1 inputs. + +All of this caching logic would become unnecessary, +if we had a reliable lookup method of this L2-derived-from-L1 mapping data, to act on for any L1 finalization signal. + +On top of that: the finality deriver sub-system of the op-node can potentially be completely removed, +if finality is checked through the op-supervisor API. +The op-supervisor already knows which L2 block is safe, +and what L1 block it was derived from, and can thus answer when a L2 block is finalized, given a finalized L1 view. + +### The op-proposer consistency problem + +The op-proposer has two modes: +- Wait for finalized L2 block, which we know we can propose, regardless of L1, as it's already final. +- Wait for a safe L2 block, and make op-node stop-the-world to get consistent derived-from and safe-L2 data. + +For safety/stability the finalized-L2 mode is generally used in production. +But the latter would be faster, and is preferred in testing. +It can work better if we wouldn't rely on the stop-the-world state, but instead had an index to query, +to determine which block proposal has to be bound to which L1 blockhash, for reorg safety of L2 proposals. + +### The safety indexing problem + +Safety indexing is a feature of: +- explorers: show that a L2 tx became safe at a specific point in L1, possibly even linking the batch data. +- bridges: determine what L2 blocks to accept for bridging, given some coupling to L1 batch confirmation. + +An index, of which L2 block was derived from which L1 block, would help standardize this data, +and enable every bridge/explorer to interact safely with cross-L2 block data. + +# Proposed Solution + +Interop, or technically "Isthmus" really, is an upgrade meant to apply to all OP-Stack chains. +Chains with a dependency-set of 1 (themselves) should not have to run the op-supervisor, +but should maintain a reliable view of the events and safety of their local chain. + +Like the op-supervisor events DB, the history is append-only, except for reorgs. +This means we could optimize, and store things in an append-only log style system. + +The "safedb" in op-node can be replaced with a DB like the above, shared code with the op-supervisor. +Once the above is done, we can remove the finality mechanism: the op-supervisor signals are used for finalization. + +The supervisor is Go, and modular enough that we can embed the single-DB solution into the op-node, +to serve those less-standard chains with an interop-set of 1. +This prevents these chains from entering a divergent hardfork path: +very important to keep the code-base unified and tech-debt low. + +The main remaining question is how we bootstrap the data: +- Chain-sync is on-chain information only, since it needs to verify retrieved data against authenticated commitments. + We cannot start from snap-sync if we still have to backfill the history data. +- Nodes don't currently have this state (if they have not been running the safe-db), + we need to roll it out gradually. + +This feature could either be a "soft-rollout" type of thing, +or a Holocene feature (if it turns out to be tightly integrated into the steady batch dervation work). + +## Resource Usage + +The op-node "safeDB" is closest to what we have today, +to an index that maps between L2 blocks and L1 blocks that were derived from. + +This index is missing space for interop-compatibility (storing both local-safe and cross-safe increments), +but gives an idea of the resource usage. + +In terms of compute it's very minimal to maintain: +the op-node verifies L1 anyway, and writing the incremental updates is not expensive. + +In terms of disk-space it is a little more significant: +at 32 bytes per L1 and L2 block hash, some L1 metadata, some L2 metadata, and both local-safe cross-safe, +each entry may about 200 bytes. It does not compress well, as it largely contains blockhash data. + +Storing a week of this data, at 2 second blocks, would thus be `7 * 24 * 60 * 60 / 2 * 200 = 60,480,000`, or about 60 MB. +Storing a year of this data would be around ~3 GB. + +The main resource-risk might be the reconstruction work: this requires resyncing the L2 chain. +Easy snapshots, robust handling to avoid data-corruption, and simple distribution would avoid the need to re-sync. +Potentially this can be merged with the work around op-supervisor events DB snapshot handling. + +# Alternatives Considered + +## Reusing the safedb + +The safeDB in the op-node could be re-used, but there are several things to fix, with breaking changes: +- the way it is updated needs to be less fragile. +- the way it is searched for a L2/L1 mapping, likely needs to improve. +- the format needs to change to support interop cross-safe data. + +Since the scope of the index itself is relatively small, +I believe redoing it, while designing for all the known problems, not just fault-proofs, produces a better product. + +# Risks & Uncertainties + +## Scope + +While the number of problems is large, we do also already have (suboptimal) solutions in place for each of them. +We could make a draft implementation, +and merge one problem at a time, in order of priority, to remove the legacy incrementally. + +In particular, this work might be a key part of Interop devnet 2, to handle reorgs. This should be prioritized. +The potential benefit to Holocene steady-batch-derivation should also be reviewed early, +so we can reduce Holocene work, by de-duplicating it with Interop. +