Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Gossip refactoring #1811

Merged
merged 13 commits into from
Feb 20, 2019
Merged

Gossip refactoring #1811

merged 13 commits into from
Feb 20, 2019

Conversation

arkpar
Copy link
Member

@arkpar arkpar commented Feb 15, 2019

  • Gossip messages are now validated by the consensus layer before propagation

is_expired function should be reviewed carefully.

@arkpar arkpar added the A3-in_progress Pull request is in progress. No review needed at this stage. label Feb 15, 2019
Copy link
Contributor

@andresilva andresilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brief review, overall the design lgtm

/// Message topic.
topic: H,
/// Flag that indicates if the message should be broadcasted.
broadcast: bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're breaking the protocol I think we should remove this. It's a feature I added but it's currently unused because it's an horrible idea and a vector for DoS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not part of the message anymore, but rather the grandpa validator decides if it should be broadcasted. Anyway I've noticed that currently in master this is always set to false. So if it is not needed for any future messages I'll remove it.

@@ -239,7 +241,7 @@ pub enum ProtocolMsg<B: BlockT, S: NetworkSpecialization<B>,> {
/// Execute a closure with the consensus gossip.
ExecuteWithGossip(Box<GossipTask<B> + Send + 'static>),
/// Incoming gossip consensus message.
GossipConsensusMessage(B::Hash, Vec<u8>, bool),
GossipConsensusMessage(B::Hash, u32, Vec<u8>, bool),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the broadcast flag. I think we can also remove topic now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used for sending out own messages. topic is passed because we don't want to validate them.

pub struct FullRoundMessage<Block: BlockT> {
pub round: u64,
pub set_id: u64,
pub message: SignedMessage<Block>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature currently doesn't include the round or set_id, we should probably extend it in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does I believe. The signing scheme was not changed and still includes round and set_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the signed payload includes set_id and round.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra bandwidth we will use for putting round and set_id in every message is a lot. Why not embed this data in the topic somehow?

Copy link
Member Author

@arkpar arkpar Feb 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topic, which is 32 bytes is not included in the message anymore. So the bandwidth is actually saved.

return true;
} else if set_id == rounds.set_id + 1 {
// allow a few first rounds of future set.
if round > MESSAGE_ROUND_TOLERANCE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the strategy for when we catch up? if messages are dropped consensus will halt.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing it with @rphmeier and @andresilva here's what the issue here is:
Grandpa starts round N+1 when he observes round commit for round N and requires all messages for round N+1 to complete it. Currently there's no guarantee that commit message will be the first in the set of messages for N+1, so some messages might be skipped if current round is far behind N.

There are a few solutions:

  1. Request missing messages from peers when we observe commit N and it is more than a few rounds in the future. Would require substantial network protocol changes.

  2. GRANDPA begins round N+2 when he observes commit for N. This would guarantee that all messages for N+2 are accepted as they arrive only after commit for N. Appears to be involved change in the grandpa crate.

  3. Future messages are kept around with time based expiration. Will be implemented in this PR.

Additional DOS protection scheme will be implemented in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 seems like a sensible thing to implement longer-term.

@@ -136,6 +137,11 @@ const LAST_COMPLETED_KEY: &[u8] = b"grandpa_completed_round";
const AUTHORITY_SET_KEY: &[u8] = b"grandpa_voters";
const CONSENSUS_CHANGES_KEY: &[u8] = b"grandpa_consensus_changes";

const GRANDPA_ROUND_MESSAGE: u32 = 0x1001;
const GRANDPA_COMMIT_MESSAGE: u32 = 0x1002;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these constants seems like they could be easily conflicting with other modules

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are part of the network message effectively. Any suggestions here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure these consts don't conflict (especially when people bring in their own consensus algorithms) sounds like a nightmare :( ...

Copy link
Contributor

@rphmeier rphmeier Feb 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arkpar as part of #1304 I've proposed that we make digest items and other substrate primitives that are relevant to consensus start with a 4-byte "magic number" which are assigned to consensus engines sort of like port numbers. We'll reserve some numbers and an early range for some consensus engines and custom implementors should do the same.

In light of that we could change the gossip packet structure to

{ consensus_engine: [u8; 4], message_kind: u8, data: Vec<u8> } instead, and this would keep things consistent with planned future architecture changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could group gossip validators by consensus_engine as opposed to message kind, and have the message kind be part of an encoded enum in grandpa::communication

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the second option

@arkpar arkpar added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Feb 15, 2019
Copy link
Contributor

@gnunicorn gnunicorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not quite understand the design decision to put a new kind-attribute in that the consensus-gossip knows about. I feel like there is easily a more generic and cleaner approach possible without forcing all consensus messages to have this (GRANDPA-specific) attribute.

Having both kind and topic makes the entire approach also a lot more confusing and mixed, as it becomes unclear what they are about and I am fearful this will be used totally wrong in the future...

message_times.get(&(*topic, *message_hash))
.map(|instant| *instant + (5 * MESSAGE_LIFETIME) >= now)
.unwrap_or(false)
entry.timestamp + MESSAGE_LIFETIME >= now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that having both the validator and MESSAGE_LIFETIME is odd. What if my gossip wants/needs a higher message lifetime? I'd prefer to have the MESSAGE_LIFETIME enforced through the same Validator construct as well so runtime-developers can tweak that to their needs. (see above)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MESSAGE_LIFETIME is a temporary measure. The first version of this PR did not have and I'd refer to get rid of it as well. See this comment on why it is needed: #1811 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to reiterate, removing it isn't as simple as it sounds and we will want a general utility for DOS protection before doing so -- and it's a bit beyond the scope of this PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want a hard lifetime, there is also a lru_time_cache crate we could use. Would keep our code base cleaner...

@arkpar
Copy link
Member Author

arkpar commented Feb 18, 2019

I do not quite understand the design decision to put a new kind-attribute in that the consensus-gossip knows about. I feel like there is easily a more generic and cleaner approach possible without forcing all consensus messages to have this (GRANDPA-specific) attribute.

How is it GRANDPA specific? It is generic enough to be transparent to the network layer.

Having both kind and topic makes the entire approach also a lot more confusing and mixed, as it becomes unclear what they are about and I am fearful this will be used totally wrong in the future...

The topic is not part of the message anymore. If it was, it would have to be plain i.e. include round and set id, and that would be less generic for sure. Topic is something derived from the message as a result of validation. kind Simply identifies message structure.

@gnunicorn
Copy link
Contributor

@arkpar so from a network-view topic is entirely replaced with the kind? And all other topical information is read by validate to create the topic out of the (internally) decoded message? So that we store that raw message under that topic and once the sink asks for that topic, it decodes the message again?

Don't get me wrong, I think it is a good idea to have the consensus engine figure out whether a message should be dropped - and it potentially need more information than we can put into the topic. However, replacing the rather flexible topic with a (somehow globally defined?) u32 doesn't really feel sound to me. I mean, GRANDPA could also just use the topic the exact same way, if the Validators were executed per topic (or on all messages regardless), could it not?

I need to think about this a bit more...

@rphmeier
Copy link
Contributor

rphmeier commented Feb 18, 2019

I put a comment or two about this in one of the threads, but to reiterate here:
I think consensus-gossip messages should be grouped at top-level by consensus service (GRANDPA, parachain-attestation, etc.), with globally respected 4-byte identifiers, and message kinds should be determined by that consensus engine's handler.

This is because there are other planned consensus changes (to make hot-swapping possible) that will take a similar approach and having a unified architecture seems reasonable.

@arkpar
Copy link
Member Author

arkpar commented Feb 18, 2019

@arkpar so from a network-view topic is entirely replaced with the kind? And all other topical information is read by validate to create the topic out of the (internally) decoded message? So that we store that raw message under that topic and once the sink asks for that topic, it decodes the message again?

Yes. Decoding is not that expensive though. It does not check signatures the second time at least.

Don't get me wrong, I think it is a good idea to have the consensus engine figure out whether a message should be dropped - and it potentially need more information than we can put into the topic. However, replacing the rather flexible topic with a (somehow globally defined?) u32 doesn't really feel sound to me. I mean, GRANDPA could also just use the topic the exact same way, if the Validators were executed per topic (or on all messages regardless), could it not?

I need to think about this a bit more...

In the current version It is not possible to tell if the message is valid or not just from the topic. Also it is not possible to tell which validator can handle the message just from the topic. And we need to be able to assert message validity before propagating it further. If consensus engine could list all possible valid topics at any moment in time, that would solve the problem. Unfortunately that's not always possible.

@arkpar arkpar added A3-in_progress Pull request is in progress. No review needed at this stage. and removed A0-please_review Pull request needs code review. labels Feb 18, 2019
@gnunicorn
Copy link
Contributor

gnunicorn commented Feb 19, 2019

And we need to be able to assert message validity before propagating it further. If consensus engine could list all possible valid topics at any moment in time, that would solve the problem. Unfortunately that's not always possible.

I didn't mean that. I meant the validators could be decided based on topics - you'd still have validators. But whatever.


I think consensus-gossip messages should be grouped at top-level by consensus service (GRANDPA, parachain-attestation, etc.), with globally respected 4-byte identifiers, and message kinds should be determined by that consensus engine's handler.

The main problem I have with that is the "globally respected" aspect of things. Just like the ports comparison you made, I am afraid this will become a (unnecessary) source of conflict and that, unless governed properly, will cause incompatibilities between engines that are hard to spot and debug (and fix) once deployed. I'd prefer to not have to govern this, but have a more flexible system in place (e.g. [u8; 8] like we do for keyvalue-store-key-constants?). But I won't insist.


Secondly I still prefer a more generic approach of just having a list of validators, each given the entire message, including kind, who and alike and just have the specific engine's only react on the kinds they are interested in. Allowing us to put in generic behavior guards (against bursting for e.g.) more easily. But I understand if you consider that as out of scope for this PR.

@arkpar arkpar added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Feb 19, 2019
@gnunicorn gnunicorn self-assigned this Feb 19, 2019
Copy link
Contributor

@gnunicorn gnunicorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually got around to give this a full review. Added a bunch of suggestions and have minor remarks and one or bigger suggestions (but they might simply be out of scope). Overall this looks good to me :) .

@@ -106,6 +108,10 @@ const LAST_COMPLETED_KEY: &[u8] = b"grandpa_completed_round";
const AUTHORITY_SET_KEY: &[u8] = b"grandpa_voters";
const CONSENSUS_CHANGES_KEY: &[u8] = b"grandpa_consensus_changes";

const GRANDPA_ENGINE_ID: network::ConsensusEngineId = [b'a', b'f', b'g', b'1'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks 😊 !

false
}

fn validate_round_message(&self, full: VoteOrPrecommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap.

Suggested change
fn validate_round_message(&self, full: VoteOrPrecommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> {
fn validate_round_message(&self, full: VoteOrPrecommitMessage<Block>)
-> network_gossip::ValidationResult<Block::Hash>
{

network_gossip::ValidationResult::Valid(topic)
}

fn validate_commit_message(&self, full: FullCommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, too

Suggested change
fn validate_commit_message(&self, full: FullCommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> {
fn validate_commit_message(&self, full: FullCommitMessage<Block>)
-> network_gossip::ValidationResult<Block::Hash>
{

let validator = Arc::new(GossipValidator::new());
let v = validator.clone();
service.with_gossip(move |gossip, _| {
gossip.register_validator(GRANDPA_ENGINE_ID, validator.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something or why are you not using v here and then use validator in NetworkBridge { service, validator } directly? Saves you one clone...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A leftover from when there were 2 validators here.

pub fn register_validator(&mut self, engine_id: ConsensusEngineId, validator: Arc<Validator<B::Hash>>) {
self.validators.insert(engine_id, validator);
}

/// Handle new connected peer.
pub fn new_peer(&mut self, protocol: &mut Context<B>, who: NodeIndex, roles: Roles) {
if roles.intersects(Roles::AUTHORITY) {
trace!(target:"gossip", "Registering {:?} {}", roles, who);
// Send out all known messages to authorities.
let mut known_messages = HashSet::new();
for entry in self.messages.iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still doing a burst on connect here - which we have observed to cause a cycle of reconnect on lower bandwidth systems: being dropped because they react to slowly, reconnect, being burst and dropped again ... rinse, repeat. The patch in my next comment would reduce the amount of messages to all non-stale ones and the entire PR will reduce the amount of messages that are valid any point in time, however, I am not sure this safes us entirely and I am wondering if we want to reduce the amount of messages we are sending regardless (like, just the last n entries)?

@@ -216,6 +225,8 @@ pub mod generic {
pub struct Status<Hash, Number> {
/// Protocol version.
pub version: u32,
/// Minimum supported version.
pub min_supported_version: u32,
Copy link
Contributor

@gnunicorn gnunicorn Feb 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the chat: this change alone breaks compatibility between this code base and (non-authority) full-nodes in the existing network. Maybe we can isolate this change to make it easier to migrate the other changes onto live networks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to the gossip message are breaking as well. Maintaining backwards compatibility is a pain currently because peers would just disconnect if other party has higher protocol version. This exact change is there to address that.

Copy link
Contributor

@gnunicorn gnunicorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gavofyork gavofyork merged commit 4127a93 into master Feb 20, 2019
@gavofyork gavofyork deleted the a-gossip-refactor branch February 20, 2019 11:40
rphmeier pushed a commit that referenced this pull request Feb 26, 2019
* First part of gossip protocol refactoring

* Message validation in GRANDPA

* Reverted to time-based expiration for future round messages

* Removed collect_garbage_for_topic

* Use consensus engine id instead of kind

* Use try_init

Co-Authored-By: arkpar <[email protected]>

* Comment

Co-Authored-By: arkpar <[email protected]>

* Added expiration check on broadcast

Co-Authored-By: arkpar <[email protected]>

* Apply suggestions from code review

Co-Authored-By: arkpar <[email protected]>

* Style

* Style
gavofyork pushed a commit that referenced this pull request Feb 27, 2019
* Gossip refactoring (#1811)

* First part of gossip protocol refactoring

* Message validation in GRANDPA

* Reverted to time-based expiration for future round messages

* Removed collect_garbage_for_topic

* Use consensus engine id instead of kind

* Use try_init

Co-Authored-By: arkpar <[email protected]>

* Comment

Co-Authored-By: arkpar <[email protected]>

* Added expiration check on broadcast

Co-Authored-By: arkpar <[email protected]>

* Apply suggestions from code review

Co-Authored-By: arkpar <[email protected]>

* Style

* Style

* Fixed sending commit message (#1834)

* Fix #1825 (#1828)

* Fix #1825

* Add comment

* Fixed finalizing multiple blocks at once (#1837)

* Fixed finalizing multiple blocks at once

* Added a test

* Added a comment
MTDK1 pushed a commit to bdevux/substrate that referenced this pull request Apr 12, 2019
* First part of gossip protocol refactoring

* Message validation in GRANDPA

* Reverted to time-based expiration for future round messages

* Removed collect_garbage_for_topic

* Use consensus engine id instead of kind

* Use try_init

Co-Authored-By: arkpar <[email protected]>

* Comment

Co-Authored-By: arkpar <[email protected]>

* Added expiration check on broadcast

Co-Authored-By: arkpar <[email protected]>

* Apply suggestions from code review

Co-Authored-By: arkpar <[email protected]>

* Style

* Style
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants