diff --git a/flips/component-interface.md b/flips/component-interface.md deleted file mode 100644 index fed4129eb82..00000000000 --- a/flips/component-interface.md +++ /dev/null @@ -1,446 +0,0 @@ -# Component Interface (Core Protocol) - -| Status | Proposed | -:-------------- |:--------------------------------------------------------- | -| **FLIP #** | [1167](https://github.com/onflow/flow-go/pull/1167) | -| **Author(s)** | Simon Zhu (simon.zhu@dapperlabs.com) | -| **Sponsor** | Simon Zhu (simon.zhu@dapperlabs.com) | -| **Updated** | 9/16/2021 | - -## Objective - -FLIP to separate the API through which components are started from the API through which they expose their status. - -## Current Implementation - -The [`ReadyDoneAware`](https://github.com/onflow/flow-go/blob/7763000ba5724bb03f522380e513b784b4597d46/module/common.go#L6) interface provides an interface through which components / modules can be started and stopped. Calling the `Ready` method should start the component and return a channel that will close when startup has completed, and `Done` should be the corresponding method to shut down the component. - -### Potential problems - -The current `ReadyDoneAware` interface is misleading, as by the name one might expect that it is only used to check the state of a component. However, in almost all current implementations the `Ready` method is used to both start the component *and* check when it has started up, and similarly for the `Done` method. - -This introduces issues of concurrency safety / idempotency, as most implementations do not properly handle the case where the `Ready` or `Done` methods are called more than once. See [this example](https://github.com/onflow/flow-go/pull/1026). - -[Clearer documentation](https://github.com/onflow/flow-go/pull/1032) and a new [`LifecycleManager`](https://github.com/onflow/flow-go/pull/1031) component were introduced as a step towards fixing this by providing concurrency-safety for components implementing `ReadyDoneAware`, but this still does not provide a clear separation between the ability to start / stop a component and the ability to check its state. A component usually only needs to be started once, whereas multiple other components may wish to check its state. - -## Proposal - -Moving forward, we will add a new `Startable` interface in addition to the existing `ReadyDoneAware`: -```golang -// Startable provides an interface to start a component. Once started, the component -// can be stopped by cancelling the given context. -type Startable interface { - // Start starts the component. Any errors encountered during startup should be returned - // directly, whereas irrecoverable errors encountered while the component is running - // should be thrown with the given SignalerContext. - // This method should only be called once, and subsequent calls should return ErrMultipleStartup. - Start(irrecoverable.SignalerContext) error -} -``` -Components which implement this interface are passed in a `SignalerContext` upon startup, which they can use to propagate any irrecoverable errors they encounter up to their parent via `SignalerContext.Throw`. The parent can then choose to handle these errors however they like, including restarting the component, logging the error, propagating the error to their own parent, etc. - -```golang -// We define a constrained interface to provide a drop-in replacement for context.Context -// including in interfaces that compose it. -type SignalerContext interface { - context.Context - Throw(err error) // delegates to the signaler - sealed() // private, to constrain builder to using WithSignaler -} - -// private, to force context derivation / WithSignaler -type signalerCtx struct { - context.Context - *Signaler -} - -func (sc signalerCtx) sealed() {} - -// the One True Way of getting a SignalerContext -func WithSignaler(parent context.Context) (SignalerContext, <-chan error) { - sig, errChan := NewSignaler() - return &signalerCtx{parent, sig}, errChan -} - -// Signaler sends the error out. -type Signaler struct { - errChan chan error - errThrown *atomic.Bool -} - -func NewSignaler() (*Signaler, <-chan error) { - errChan := make(chan error, 1) - return &Signaler{ - errChan: errChan, - errThrown: atomic.NewBool(false), - }, errChan -} - -// Throw is a narrow drop-in replacement for panic, log.Fatal, log.Panic, etc -// anywhere there's something connected to the error channel. It only sends -// the first error it is called with to the error channel, there are various -// options as to how subsequent errors can be handled. -func (s *Signaler) Throw(err error) { - defer runtime.Goexit() - if s.errThrown.CAS(false, true) { - s.errChan <- err - close(s.errChan) - } else { - // Another thread, possibly from the same component, has already thrown - // an irrecoverable error to this Signaler. Any subsequent irrecoverable - // errors can either be logged or ignored, as the parent will already - // be taking steps to remediate the first error. - } -} -``` - -> For more details about `SignalerContext` and `ErrMultipleStartup`, see [#1275](https://github.com/onflow/flow-go/pull/1275) and [#1355](https://github.com/onflow/flow-go/pull/1355/). - -To start a component, a `SignalerContext` must be created to start it with: - -```golang -var parentCtx context.Context // this is the context for the routine which manages the component -var childComponent component.Component - -ctx, cancel := context.WithCancel(parentCtx) - -// create a SignalerContext and return an error channel which can be used to receive -// any irrecoverable errors thrown with the Signaler -signalerCtx, errChan := irrecoverable.WithSignaler(ctx) - -// start the child component -childComponent.Start(signalerCtx) - -// launch goroutine to handle errors thrown from the child component -go func() { - select { - case err := <-errChan: // error thrown by child component - cancel() - // handle the error... - case <-parentCtx.Done(): // canceled by parent - // perform any necessary cleanup... - } -} -``` - -With all of this in place, the semantics of `ReadyDoneAware` can be redefined to only be used to check a component's state (i.e wait for startup / shutdown to complete) -```golang -type ReadyDoneAware interface { - // Ready returns a channel that will close when component startup has completed. - Ready() <-chan struct{} - // Done returns a channel that will close when component shutdown has completed. - Done() <-chan struct{} -} -``` - -Finally, we can define a `Component` interface which combines both of these interfaces: -```golang -type Component interface { - Startable - ReadyDoneAware -} -``` - -A component will now be started by passing a `SignalerContext` to its `Start` method, and can be stopped by cancelling the `Context`. If a component needs to startup subcomponents, it can create child `Context`s from this `Context` and pass those to the subcomponents. -### Motivations -- `Context`s are the standard way of doing go-routine lifecycle management in Go, and adhering to standards helps eliminate confusion and ambiguity for anyone interacting with the `flow-go` codebase. This is especially true now that we are beginning to provide API's and interfaces for third parties to interact with the codebase (e.g DPS). - - Even to someone unfamiliar with our codebase (but familiar with Go idioms), it is clear how a method signature like `Start(context.Context) error` will behave. A method signature like `Ready()` is not so clear. -- This promotes a hierarchical supervision paradigm, where each `Component` is equipped with a fresh signaler to its parent at launch, and is thus supervised by his parent for any irrecoverable errors it may encounter (the call to `WithSignaler` replaces the signaler in a parent context). As a consequence, sub-components themselves started by a component have it as a supervisor, which handles their irrecoverable failures, and so on. - - If context propagation is done properly, there is no need to worry about any cleanup code in the `Done` method. Cancelling the context for a component will automatically cancel all subcomponents / child routines in the component tree, and we do not have to explicitly call `Done` on each and every subcomponent to trigger their shutdown. - - This allows us to separate the capability to check a component's state from the capability to start / stop it. We may want to give multiple other components the capability to check its state, without giving them the capability to start or stop it. Here is an [example](https://github.com/onflow/flow-go/blob/b50f0ffe054103a82e4aa9e0c9e4610c2cbf2cc9/engine/common/splitter/network/network.go#L112) of where this would be useful. - - This provides a clearer way of defining ownership of components, and hence may potentially eliminate the need to deal with concurrency-safety altogether. Whoever creates a component should be responsible for starting it, and therefore they should be the only one with access to its `Startable` interface. If each component only has a single parent that is capable of starting it, then we should never run into concurrency issues. - -## Implementation (WIP) -* Lifecycle management logic for components can be further abstracted into a `RunComponent` helper function: - - ```golang - type ComponentFactory func() (Component, error) - - // OnError reacts to an irrecoverable error - // It is meant to inspect the error, determining its type and seeing if e.g. a restart or some other measure is suitable, - // and then return an ErrorHandlingResult indicating how RunComponent should proceed. - // Before returning, it could also: - // - panic (in sandboxnet / benchmark) - // - log in various Error channels and / or send telemetry ... - type OnError = func(err error) ErrorHandlingResult - - type ErrorHandlingResult int - - const ( - ErrorHandlingRestart ErrorHandlingResult = iota - ErrorHandlingStop - ) - - // RunComponent repeatedly starts components returned from the given ComponentFactory, shutting them - // down when they encounter irrecoverable errors and passing those errors to the given error handler. - // If the given context is cancelled, it will wait for the current component instance to shutdown - // before returning. - // The returned error is either: - // - The context error if the context was canceled - // - The last error handled if the error handler returns ErrorHandlingStop - // - An error returned from componentFactory while generating an instance of component - func RunComponent(ctx context.Context, componentFactory ComponentFactory, handler OnError) error { - // reference to per-run signals for the component - var component Component - var cancel context.CancelFunc - var done <-chan struct{} - var irrecoverableErr <-chan error - - start := func() error { - var err error - - component, err = componentFactory() - if err != nil { - return err // failure to generate the component, should be handled out-of-band because a restart won't help - } - - // context used to run the component - var runCtx context.Context - runCtx, cancel = context.WithCancel(ctx) - - // signaler context used for irrecoverables - var signalCtx irrecoverable.SignalerContext - signalCtx, irrecoverableErr = irrecoverable.WithSignaler(runCtx) - - component.Start(signalCtx) - - done = component.Done() - - return nil - } - - stop := func() { - // shutdown the component and wait until it's done - cancel() - <-done - } - - for { - select { - case <-ctx.Done(): - return ctx.Err() - default: - } - - if err := start(); err != nil { - return err // failure to start - } - - select { - case <-ctx.Done(): - stop() - return ctx.Err() - case err := <-irrecoverableErr: - stop() - - // send error to the handler - switch result := handler(err); result { - case ErrorHandlingRestart: - continue - case ErrorHandlingStop: - return err - default: - panic(fmt.Sprintf("invalid error handling result: %v", result)) - } - case <-done: - // Without this additional select, there is a race condition here where the done channel - // could have been closed as a result of an irrecoverable error being thrown, so that when - // the scheduler yields control back to this goroutine, both channels are available to read - // from. If this last case happens to be chosen at random to proceed instead of the one - // above, then we would return as if the component shutdown gracefully, when in fact it - // encountered an irrecoverable error. - select { - case err := <-irrecoverableErr: - switch result := handler(err); result { - case ErrorHandlingRestart: - continue - case ErrorHandlingStop: - return err - default: - panic(fmt.Sprintf("invalid error handling result: %v", result)) - } - default: - } - - // Similarly, the done channel could have closed as a result of the context being canceled. - select { - case <-ctx.Done(): - return ctx.Err() - default: - } - - // clean completion - return nil - } - } - } - ``` - - > Note: this is now implemented in [#1275](https://github.com/onflow/flow-go/pull/1275) and [#1355](https://github.com/onflow/flow-go/pull/1355), and an example can be found [here](https://github.com/onflow/flow-go/blob/24406ed3fde7661cb1df84a25755cedf041a1c50/module/irrecoverable/irrecoverable_example_test.go). -* We may be able to encapsulate a lot of the boilerplate code involved in handling startup / shutdown of worker routines into a single `ComponentManager` struct: - - ```golang - type ReadyFunc func() - - // ComponentWorker represents a worker routine of a component - type ComponentWorker func(ctx irrecoverable.SignalerContext, ready ReadyFunc) - - // ComponentManagerBuilder provides a mechanism for building a ComponentManager - type ComponentManagerBuilder interface { - // AddWorker adds a worker routine for the ComponentManager - AddWorker(ComponentWorker) ComponentManagerBuilder - - // Build builds and returns a new ComponentManager instance - Build() *ComponentManager - } - - // ComponentManager is used to manage the worker routines of a Component - type ComponentManager struct { - started *atomic.Bool - ready chan struct{} - done chan struct{} - shutdownSignal <-chan struct{} - - workers []ComponentWorker - } - - // Start initiates the ComponentManager by launching all worker routines. - func (c *ComponentManager) Start(parent irrecoverable.SignalerContext) { - // only start once - if c.started.CAS(false, true) { - ctx, cancel := context.WithCancel(parent) - signalerCtx, errChan := irrecoverable.WithSignaler(ctx) - c.shutdownSignal = ctx.Done() - - // launch goroutine to propagate irrecoverable error - go func() { - select { - case err := <-errChan: - cancel() // shutdown all workers - - // we propagate the error directly to the parent because a failure in a - // worker routine is considered irrecoverable - parent.Throw(err) - case <-c.done: - // Without this additional select, there is a race condition here where the done channel - // could be closed right after an irrecoverable error is thrown, so that when the scheduler - // yields control back to this goroutine, both channels are available to read from. If this - // second case happens to be chosen at random to proceed, then we would return and silently - // ignore the error. - select { - case err := <-errChan: - cancel() - parent.Throw(err) - default: - } - } - }() - - var workersReady sync.WaitGroup - var workersDone sync.WaitGroup - workersReady.Add(len(c.workers)) - workersDone.Add(len(c.workers)) - - // launch workers - for _, worker := range c.workers { - worker := worker - go func() { - defer workersDone.Done() - var readyOnce sync.Once - worker(signalerCtx, func() { - readyOnce.Do(func() { - workersReady.Done() - }) - }) - }() - } - - // launch goroutine to close ready channel - go c.waitForReady(&workersReady) - - // launch goroutine to close done channel - go c.waitForDone(&workersDone) - } else { - panic(module.ErrMultipleStartup) - } - } - - func (c *ComponentManager) waitForReady(workersReady *sync.WaitGroup) { - workersReady.Wait() - close(c.ready) - } - - func (c *ComponentManager) waitForDone(workersDone *sync.WaitGroup) { - workersDone.Wait() - close(c.done) - } - - // Ready returns a channel which is closed once all the worker routines have been launched and are ready. - // If any worker routines exit before they indicate that they are ready, the channel returned from Ready will never close. - func (c *ComponentManager) Ready() <-chan struct{} { - return c.ready - } - - // Done returns a channel which is closed once the ComponentManager has shut down. - // This happens when all worker routines have shut down (either gracefully or by throwing an error). - func (c *ComponentManager) Done() <-chan struct{} { - return c.done - } - - // ShutdownSignal returns a channel that is closed when shutdown has commenced. - // This can happen either if the ComponentManager's context is canceled, or a worker routine encounters - // an irrecoverable error. - // If this is called before Start, a nil channel will be returned. - func (c *ComponentManager) ShutdownSignal() <-chan struct{} { - return c.shutdownSignal - } - ``` - - Components that want to implement `Component` can use this `ComponentManager` to simplify implementation: - - ```golang - type FooComponent struct { - *component.ComponentManager - } - - func NewFooComponent(foo fooType) *FooComponent { - f := &FooComponent{} - - cmb := component.NewComponentManagerBuilder(). - AddWorker(f.childRoutine). - AddWorker(f.childRoutineWithFooParameter(foo)) - - f.ComponentManager = cmb.Build() - - return f - } - - func (f *FooComponent) childRoutine(ctx irrecoverable.SignalerContext) { - for { - select { - case <-ctx.Done(): - return - default: - // do work... - } - } - } - - func (f *FooComponent) childRoutineWithFooParameter(foo fooType) component.ComponentWorker { - return func(ctx irrecoverable.SignalerContext) { - for { - select { - case <-ctx.Done(): - return - default: - // do work with foo... - - // encounter irrecoverable error - ctx.Throw(errors.New("fatal error!")) - } - } - } - } - ``` - - > Note: this is now implemented in [#1355](https://github.com/onflow/flow-go/pull/1355) diff --git a/flips/network-api.md b/flips/network-api.md deleted file mode 100644 index a2caa57f54d..00000000000 --- a/flips/network-api.md +++ /dev/null @@ -1,93 +0,0 @@ -# Network Layer API (Core Protocol) - -| Status | Proposed | -:-------------- |:--------------------------------------------------------- | -| **FLIP #** | [1306](https://github.com/onflow/flow-go/pull/1306) | -| **Author(s)** | Simon Zhu (simon.zhu@dapperlabs.com) | -| **Sponsor** | Simon Zhu (simon.zhu@dapperlabs.com) | -| **Updated** | 9/16/2021 | - -## Objective - -Refactor the networking layer to split it into separate APIs for the public and private network, allow us to implement a strict separation in the code between these two networks. - -Enable registering a custom message ID function for the gossip layer. - -## Current Implementation - -When the network layer receives a message, it will pass the message to the [`Engine`](https://github.com/onflow/flow-go/blob/7763000ba5724bb03f522380e513b784b4597d46/network/engine.go) registered on -the corresponding channel by [calling the engine's `Process` method](https://github.com/onflow/flow-go/blob/d31fd63eb651ed9faf0f677e9934baef6c4d9792/network/p2p/network.go#L406), passing it the Flow ID of the message sender. - -[`Multicast`](https://github.com/onflow/flow-go/blob/4ddc17d1bee25c2ab12ceabcf814b702980fdebe/network/conduit.go#L82) is implemented by including a [`TargetIDs`](https://github.com/onflow/flow-go/blob/4ddc17d1bee25c2ab12ceabcf814b702980fdebe/network/message/message.proto#L12) field inside the message, which is published to a specific topic on the underlying gossip network. Upon receiving a new message on the gossip network, nodes must first [validate](https://github.com/onflow/flow-go/blob/4ddc17d1bee25c2ab12ceabcf814b702980fdebe/network/validator/targetValiator.go) that they are one of the intended recipients of the message before processing it. - -### Potential problems - -The current network layer API was designed with the assumption that all messages sent and received either target or originate from staked Flow nodes. This is why an engine's [`Process`](https://github.com/onflow/flow-go/blob/master/network/engine.go#L28) method accepts a Flow ID identifying the message sender, and outgoing messages [must specify Flow ID(s)](https://github.com/onflow/flow-go/blob/master/network/conduit.go#L62) as targets. - -This assumption is no longer true today. The access node, for example, may communicate with multiple (unstaked) consensus followers. It's perceivable that in the future there will be even more cases where communication with unstaked parties may happen (for example, execution nodes talking to DPS). - -Currently, a [`Message`](https://github.com/onflow/flow-go/blob/698c77460bc33d1a8ee8a154f7fe4877bc518a02/network/message/message.proto) which is sent over the network contains many unnecessary fields which can be deduced by the receiver of the message. The only exceptions to this are the `Payload` field (which contains the actual message data) and the `TargetIDs` field (which is used by `Multicast`). - -However, all of the existing calls to `Multicast` only target a very small number of recipients (3 to be exact), which means that there is a lot of noise on the network causing nodes to waste CPU cycles processing messages only to ignore them once they realize they are not one of the intended recipients. - -## Proposal - -We should split the existing network layer API into two distinct APIs / packages for the public and private network, and the `Engine` API should be modified so that the [`Process`](https://github.com/onflow/flow-go/blob/master/network/engine.go#L28) and [`Submit`](https://github.com/onflow/flow-go/blob/master/network/engine.go#L20) methods receive a `Context` as the first argument: - -* Private network - ```golang - type Engine interface { - Submit(ctx context.Context, channel Channel, originID flow.Identifier, event interface{}) - Process(ctx context.Context, channel Channel, originID flow.Identifier, event interface{}) error - } - - type Conduit interface { - Publish(event interface{}, targetIDs ...flow.Identifier) error - Unicast(event interface{}, targetID flow.Identifier) error - Multicast(event interface{}, num uint, targetIDs ...flow.Identifier) error - } - ``` -* Public network - ```golang - type Engine interface { - Submit(ctx context.Context, channel Channel, senderPeerID peer.ID, event interface{}) - Process(ctx context.Context, channel Channel, senderPeerID peer.ID, event interface{}) error - } - - type Conduit interface { - Publish(event interface{}, targetIDs ...peer.ID) error - Unicast(event interface{}, targetID peer.ID) error - Multicast(event interface{}, num uint, targetIDs ...peer.ID) error - } - ``` - -Various types of request-scoped data may be included in the `Context` as [values](https://pkg.go.dev/context#WithValue). For example, if a message sent on the public network originates from a staked node, that node's Flow ID may be included as a value. Once engine-side message queues are standardized as described in [FLIP 343](https://github.com/onflow/flow/pull/343), the given `Context` can be placed in the message queue along with the message itself in a wrapper struct: - -```golang -type Message struct { - ctx context.Context - event interface{} -} -``` - -> While this may seem to break the general rule of not storing `Context`s in structs, storing `Context`s in structs which are being passed like parameters is one of the exceptions to this rule. See [this](https://github.com/golang/go/issues/22602#:~:text=While%20we%27ve%20told,documentation%20and%20examples.) and [this](https://medium.com/@cep21/how-to-correctly-use-context-context-in-go-1-7-8f2c0fafdf39#:~:text=The%20one%20exception%20to%20not%20storing%20a%20context%20is%20when%20you%20need%20to%20put%20it%20in%20a%20struct%20that%20is%20used%20purely%20as%20a%20message%20that%20is%20passed%20across%20a%20channel.%20This%20is%20shown%20in%20the%20example%20below.). The idea is that `Context`s should not be **stored** but should **flow** through the program, which is what they do in this usecase. - -When the message is dequeued, the engine should check the `Context` to see whether the message might already be obsolete before processing it. At this point, we will have two distinct `Context`s in scope: -* The message `Context` -* The `Context` of the goroutine which is dequeing / processing the message - -These can be combined into a [single context](https://github.com/teivah/onecontext) which can be used by the message processing business logic, so that the processing can be cancelled either by the network or by the engine. This will allow us to deprecate [`engine.Unit`](https://github.com/onflow/flow-go/blob/master/engine/unit.go), which uses a single `Context` for the entire engine. - -There are certain types of messages (e.g block proposals) which may transit between the private and public networks via relay nodes (e.g Access Nodes). Libp2p's [default message ID function](https://github.com/libp2p/go-libp2p-pubsub/blob/0c7092d1f50091ae88407ba93103ac5868da3d0a/pubsub.go#L1040-L1043) will treat a message originating from one network, relayed to the other network by `n` distinct relay nodes, as `n` distinct messages, causing unacceptable message duplification / traffic amplification. In order to prevent this, we will need to define a [custom message ID function](https://pkg.go.dev/github.com/libp2p/go-libp2p-pubsub#WithMessageIdFn) which returns the hash of the message [`Payload`](https://github.com/onflow/flow-go/blob/698c77460bc33d1a8ee8a154f7fe4877bc518a02/network/message/message.proto#L13). - -In order to avoid making the message ID function deserialize the `Message` to access the `Payload`, we need to remove all other fields from the `Message` protobuf so that the message ID function can simply take the hash of the pubsub [`Data`](https://github.com/libp2p/go-libp2p-pubsub/blob/0c7092d1f50091ae88407ba93103ac5868da3d0a/pb/rpc.pb.go#L145) field without needing to do any deserialization. - -The `Multicast` implementation will need to be changed to make direct connections to the target peers instead of sending messages with a `TargetIDs` field via gossip. - -### Motivations -- Having a strict separation between the public and private networks provides better safety by preventing unintended passage of messages between the two networks, and makes it easier to implement mechanisms for message prioritization / rate-limiting on staked nodes which participate in both. -- Passing `Context`s gives the network layer the ability to cancel the processing of a network message. This can be leveraged to implement [timeouts](https://pkg.go.dev/context#WithTimeout), but may also be useful for other situations. For example, if the network layer becomes aware that a certain peer has become unreachable, it can cancel the processing of any sync requests from that peer. -- Since existing calls to `Multicast` only target 3 peers, changing the implementation to use direct connections instead of gossip will reduce traffic on the network and make it more efficient. -- While `engine.Unit` provides some useful functionalities, it also uses the anti-pattern of [storing a `Context` inside a struct](https://github.com/onflow/flow-go/blob/b50f0ffe054103a82e4aa9e0c9e4610c2cbf2cc9/engine/unit.go#L117), something which is [specifically advised against](https://pkg.go.dev/context#:~:text=Do%20not%20store%20Contexts%20inside%20a%20struct%20type%3B%20instead%2C%20pass%20a%20Context%20explicitly%20to%20each%20function%20that%20needs%20it.%20The%20Context%20should%20be%20the%20first%20parameter%2C%20typically%20named%20ctx%3A) by [the developers of Go](https://go.dev/blog/context-and-structs#TOC_2.). Here is an [example](https://go.dev/blog/context-and-structs#:~:text=Storing%20context%20in%20structs%20leads%20to%20confusion) illustrating some of the problems with this approach. - -## Implementation (TODO) diff --git a/flips/sync-protocol.md b/flips/sync-protocol.md deleted file mode 100644 index c4ca019eb21..00000000000 --- a/flips/sync-protocol.md +++ /dev/null @@ -1,109 +0,0 @@ -# Sync Engine (Core Protocol) - -| Status | Proposed | -:-------------- |:--------------------------------------------------------- | -| **FLIP #** | 1697 | -| **Author(s)** | Simon Zhu (simon.zhu@dapperlabs.com) | -| **Sponsor** | Simon Zhu (simon.zhu@dapperlabs.com) | -| **Updated** | 11/29/2021 | - -## Objective - -Redesign the synchronization protocol to improve efficiency, robustness, and Byzantine fault tolerance. - -## Current Implementation - -The current synchronization protocol implementation consists of two main pieces: -* The [Sync Engine](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/engine/common/synchronization/engine.go) interfaces with the network layer and handles sending synchronization requests to other nodes and processing responses. -* The [Sync Core](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go) implements the core logic, configuration, and state management of the synchronization protocol. - -There are three types of synchronization requests: -* A [Sync Height Request](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/model/messages/synchronization.go#L8-L14) is sent to share the local finalized height while requesting the same information from the recipient. It is replied to with a [Sync Height Response](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/model/messages/synchronization.go#L16-L22). -* A [Batch Request](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/model/messages/synchronization.go#L34-L40) requests a list of blocks by ID. It is replied to with a [Block Response](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/model/messages/synchronization.go#L42-L48). -* A [Range Request](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/model/messages/synchronization.go#L24-L32) requests a range of finalized blocks by height. It is replied to with a [Block Response](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/model/messages/synchronization.go#L42-L48). - -The Sync Core uses two data structures to track the statuses of requestable items: -* [`Heights`](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L53) tracks the set of requestable finalized block heights. It is used to generate Ranges for the Sync Engine to request. -* [`BlockIDs`](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L54) tracks the set of requestable block IDs. It is used to generate Batches for the Sync Engine to request. - -The Sync Engine periodically picks a small number of random nodes to send Sync Height Requests to. It also periodically calls [`ScanPending`](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L148-L166) to get a list of requestable Ranges and Batches from the Sync Core, and picks some random nodes to send those requests to. - -Each time the Compliance Engine processes a new block proposal, it finds the first ancestor which has not yet been received (if one exists) and calls [`RequestBlock`](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L114-L126) to request the missing block ID. `RequestBlock` updates `BlockIDs` by queueing the block ID. - -Each time the Sync Engine receives a Sync Height Response, it calls [`HandleHeight`](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L95-L112) to pass the received height to the Sync Core, which updates `Heights` by queueing all heights between the local finalized height and the received height. - -Each time the Sync Engine receives a Block Response, it calls [`HandleBlock`](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L67-L93) to pass each of the received blocks to the Sync Core, which updates the tracked statuses in `Heights` and `BlockIDs`. - -### Potential Problems - -* Items in `BlockIDs` do not contain the block height, which means that they cannot be pruned until the corresponding block has actually been received. If a malicious block proposal causes a non-existent parent ID to be queued by the Compliance Engine, the item will not be pruned until the maximum number of attempts is reached. -* After a block corresponding to an item in `BlockIDs` is received, the item is not pruned until the local finalized height surpasses the height of the block. If a node is very far behind, `BlockIDs` could grow very large before the local finalization catches up. -* When the Sync Engine calls `ScanPending`, it passes in the local finalized height, which the Sync Core uses to prune requestable items. Since `Heights` and `BlockIDs` are both implemented using Go maps, pruning them involves iterating through all items to find the ones for which the associated block height is lower than the local finalized height, which is inefficient. Furthermore, pruning is triggered on every call to `ScanPending`, even if the local finalized height has not changed. -* The implementation of `ScanPending` is split into three steps: - * Iterate through `Heights` and `BlockIDs` and [find all requestable heights and block IDs](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L264-L326). - * Group these requestable items into [Ranges](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L360-L415) and [Batches](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L417-L439). - * [Select a subset of these Ranges and Batches to return](https://github.com/onflow/flow-go/blob/39c455da40c8f0aa6f9962c48f4cd34a5cbacfc0/module/synchronization/core.go#L441-L454) based on a configurable limit on the maximum number of in-flight requests. - - While conceptually easy to understand, this implementation is inefficient and performs many more loop iterations than necessary. -* `HandleHeight` iterates over the entire range from the local finalized height to the received height, queueing all new heights and requeueing heights which have already been received. This can be expensive if the node is very far behind. -* The Sync Core optimistically sets the status of an item in `Heights` as Received as soon as *any* block with the corresponding height is received, even though it has no way of knowing whether the received block has actually been finalized by consensus. This could cause the height to stop being requested before the finalized block has actually been received. It's also possible that this could cause `Heights` to become fragmented (smaller requestable ranges). -* Processing a Range Request response may or may not advance the local finalized height. There are two reasons why it may not progress: - * The response contains blocks that are not actually finalized (e.g. received from a malicious node). - * More blocks are needed to form a Three-Chain and advance the local finalized height. Although this case becomes increasingly unlikely with larger ranges, it is theoretically still possible under the event-driven version of the [HotStuff](https://arxiv.org/abs/1803.05069) algorithm. - - The Sync Core does not account for the second case, and so it is possible that the Sync Engine gets stuck requesting the same range over and over. -* There is no way to determine whether a Sync Height Response is honest or not. If `HandleHeight` is called for every received Sync Height Response, an attacker could cause `Heights` to grow unboundedly large by sending a Sync Height Response with an absurdly high height. - -## Proposal - -The `RequestBlock` API should be updated to accept a block height, which should be stored with the queued item in `BlockIDs`. This will allow items in `BlockIDs` to be pruned as soon as the local finalized height surpasses their associated block heights. This also allows the Sync Core to ignore calls to `RequestBlock` for block heights which exceed the local finalized height by more than a configurable threshold. This helps to reduce the amount of resources spent tracking and requesting blocks which cannot immediately be finalized anyways. - -Instead of pruning on every call to `ScanPending`, the Sync Core should keep track of the local finalized height from the latest call to `ScanPending`, and only trigger pruning if the height has actually changed. If necessary, it's possible to optimize the performance of pruning and avoid iterating through every item in `BlockIDs` by maintaining an additional mapping from block heights to the set of requestable block IDs at each height. - -Instead of sending synchronization requests via gossip, we should directly create a new stream to another node for each request and validate the response we receive: -* A Range Request response should contain a single chain of blocks which begins at the start height of the requested range and is no longer than the size of the requested range. -* A Batch Request response should contain a subset of the requested block IDs. - -This eliminates any ambiguity about whether a response corresponds to a Batch or Range Request, so we can avoid optimistically setting the statuses of heights as Received for responses to Batch Requests. - -At any time, there is a single range of heights that the Sync Engine actively requests, which is tracked by the Sync Core. We call this the Active Range. The Active Range is parameterized by two variables `RangeStart` and `RangeEnd`, which effectively replace the `Heights` map from the existing implementation, but it can be broken up and requested by the Sync Engine in multiple segments. `RangeStart` should be greater than the local finalized block height, and `RangeEnd` should be less than or equal to the target finalized block height (more details below). The logic for updating the Active Range can be abstracted with an interface: - -```golang -type ActiveRange interface { - // Update processes a range of blocks received from a Range Request - // response and updates the requestable height range. - Update(headers []flow.Header, originID flow.Identifier) - - // LocalFinalizedHeight is called to notify a change in the local finalized height. - LocalFinalizedHeight(height uint64) - - // TargetFinalizedHeight is called to notify a change in the target finalized height. - TargetFinalizedHeight(height uint64) - - // Get returns the range of requestable block heights. - Get() chainsync.Range -} -``` - -There are many ways to implement this interface, but one possible approach is as follows: -* Select values for parameters `DefaultRangeSize` and `MinResponses` -* Let `PendingStart` be the first height greater than `LocalFinalizedHeight` that has been received less than `MinResponses` times -* Let `RangeStart` be equal to `LocalFinalizedHeight + 1` -* Let `RangeEnd` be the smaller of `TargetFinalizedHeight` and `PendingStart + DefaultRangeSize` - -The reason we keep track of `PendingStart` is to ensure that `RangeEnd` eventually increases even if the local finalized height doesn't. This is needed to address the second last item in [Potential Problems](#potential-problems). - -The target finalized height represents the speculated finalized block height of the overall chain, and should reflect the Sync Height Responses that have been received while accounting for the possibility that some of these responses are malicious. Therefore, the Sync Height Response processing logic should incorporate some sort of expiration / filtering mechanism. The details of this logic can be abstracted with an interface: - -```golang -type TargetFinalizedHeight interface { - // Update processes a height received from a Sync Height Response - // and updates the finalized height estimate. - Update(height uint64, originID flow.Identifier) - - // Get returns the estimated finalized height of the overall chain. - Get() uint64 -} -``` - -One possible approach is to maintain a sliding window of the most recent Sync Height Responses, and take the median of these values. This implies that the target finalized height will always lag slightly behind the true finalized height, which may or may not be a problem depending on the block finalization rate.