Gossip refactoring #1811

arkpar · 2019-02-15T16:19:45Z

Gossip messages are now validated by the consensus layer before propagation

is_expired function should be reviewed carefully.

andresilva

brief review, overall the design lgtm

andresilva · 2019-02-15T16:25:16Z

core/network/src/consensus_gossip.rs

+		/// Message topic.
+		topic: H,
+		/// Flag that indicates if the message should be broadcasted.
+		broadcast: bool


Since we're breaking the protocol I think we should remove this. It's a feature I added but it's currently unused because it's an horrible idea and a vector for DoS.

It is not part of the message anymore, but rather the grandpa validator decides if it should be broadcasted. Anyway I've noticed that currently in master this is always set to false. So if it is not needed for any future messages I'll remove it.

andresilva · 2019-02-15T16:27:13Z

core/network/src/protocol.rs

@@ -239,7 +241,7 @@ pub enum ProtocolMsg<B: BlockT, S: NetworkSpecialization<B>,> {
 	/// Execute a closure with the consensus gossip.
 	ExecuteWithGossip(Box<GossipTask<B> + Send + 'static>),
 	/// Incoming gossip consensus message.
-	GossipConsensusMessage(B::Hash, Vec<u8>, bool),
+	GossipConsensusMessage(B::Hash, u32, Vec<u8>, bool),


Let's remove the broadcast flag. I think we can also remove topic now.

This is used for sending out own messages. topic is passed because we don't want to validate them.

andresilva · 2019-02-15T16:56:53Z

core/finality-grandpa/src/lib.rs

+pub struct FullRoundMessage<Block: BlockT> {
+	pub round: u64,
+	pub set_id: u64,
+	pub message: SignedMessage<Block>,


The signature currently doesn't include the round or set_id, we should probably extend it in the future.

It does I believe. The signing scheme was not changed and still includes round and set_id.

yes, the signed payload includes set_id and round.

The extra bandwidth we will use for putting round and set_id in every message is a lot. Why not embed this data in the topic somehow?

The topic, which is 32 bytes is not included in the message anymore. So the bandwidth is actually saved.

rphmeier · 2019-02-15T18:19:28Z

core/finality-grandpa/src/lib.rs

+			return true;
+		} else if set_id == rounds.set_id + 1 {
+			// allow a few first rounds of future set.
+			if round > MESSAGE_ROUND_TOLERANCE {


what's the strategy for when we catch up? if messages are dropped consensus will halt.

After discussing it with @rphmeier and @andresilva here's what the issue here is:
Grandpa starts round N+1 when he observes round commit for round N and requires all messages for round N+1 to complete it. Currently there's no guarantee that commit message will be the first in the set of messages for N+1, so some messages might be skipped if current round is far behind N.

There are a few solutions:

Request missing messages from peers when we observe commit N and it is more than a few rounds in the future. Would require substantial network protocol changes.

GRANDPA begins round N+2 when he observes commit for N. This would guarantee that all messages for N+2 are accepted as they arrive only after commit for N. Appears to be involved change in the grandpa crate.

Future messages are kept around with time based expiration. Will be implemented in this PR.

Additional DOS protection scheme will be implemented in the future.

1 seems like a sensible thing to implement longer-term.

rphmeier · 2019-02-15T18:21:29Z

core/finality-grandpa/src/lib.rs

@@ -136,6 +137,11 @@ const LAST_COMPLETED_KEY: &[u8] = b"grandpa_completed_round";
 const AUTHORITY_SET_KEY: &[u8] = b"grandpa_voters";
 const CONSENSUS_CHANGES_KEY: &[u8] = b"grandpa_consensus_changes";

+const GRANDPA_ROUND_MESSAGE: u32 = 0x1001;
+const GRANDPA_COMMIT_MESSAGE: u32 = 0x1002;


these constants seems like they could be easily conflicting with other modules

They are part of the network message effectively. Any suggestions here?

Making sure these consts don't conflict (especially when people bring in their own consensus algorithms) sounds like a nightmare :( ...

@arkpar as part of #1304 I've proposed that we make digest items and other substrate primitives that are relevant to consensus start with a 4-byte "magic number" which are assigned to consensus engines sort of like port numbers. We'll reserve some numbers and an early range for some consensus engines and custom implementors should do the same.

In light of that we could change the gossip packet structure to

{ consensus_engine: [u8; 4], message_kind: u8, data: Vec<u8> } instead, and this would keep things consistent with planned future architecture changes.

Alternatively we could group gossip validators by consensus_engine as opposed to message kind, and have the message kind be part of an encoded enum in grandpa::communication

Implemented the second option

…p-refactor

gnunicorn

I do not quite understand the design decision to put a new kind-attribute in that the consensus-gossip knows about. I feel like there is easily a more generic and cleaner approach possible without forcing all consensus messages to have this (GRANDPA-specific) attribute.

Having both kind and topic makes the entire approach also a lot more confusing and mixed, as it becomes unclear what they are about and I am fearful this will be used totally wrong in the future...

core/network/src/consensus_gossip.rs

gnunicorn · 2019-02-18T11:18:48Z

core/network/src/consensus_gossip.rs

-			message_times.get(&(*topic, *message_hash))
-				.map(|instant| *instant + (5 * MESSAGE_LIFETIME) >= now)
-				.unwrap_or(false)
+			entry.timestamp + MESSAGE_LIFETIME >= now


I feel that having both the validator and MESSAGE_LIFETIME is odd. What if my gossip wants/needs a higher message lifetime? I'd prefer to have the MESSAGE_LIFETIME enforced through the same Validator construct as well so runtime-developers can tweak that to their needs. (see above)

MESSAGE_LIFETIME is a temporary measure. The first version of this PR did not have and I'd refer to get rid of it as well. See this comment on why it is needed: #1811 (comment)

(to reiterate, removing it isn't as simple as it sounds and we will want a general utility for DOS protection before doing so -- and it's a bit beyond the scope of this PR)

If we want a hard lifetime, there is also a lru_time_cache crate we could use. Would keep our code base cleaner...

core/network/src/consensus_gossip.rs

arkpar · 2019-02-18T15:30:47Z

I do not quite understand the design decision to put a new kind-attribute in that the consensus-gossip knows about. I feel like there is easily a more generic and cleaner approach possible without forcing all consensus messages to have this (GRANDPA-specific) attribute.

How is it GRANDPA specific? It is generic enough to be transparent to the network layer.

Having both kind and topic makes the entire approach also a lot more confusing and mixed, as it becomes unclear what they are about and I am fearful this will be used totally wrong in the future...

The topic is not part of the message anymore. If it was, it would have to be plain i.e. include round and set id, and that would be less generic for sure. Topic is something derived from the message as a result of validation. kind Simply identifies message structure.

gnunicorn · 2019-02-18T16:55:06Z

@arkpar so from a network-view topic is entirely replaced with the kind? And all other topical information is read by validate to create the topic out of the (internally) decoded message? So that we store that raw message under that topic and once the sink asks for that topic, it decodes the message again?

Don't get me wrong, I think it is a good idea to have the consensus engine figure out whether a message should be dropped - and it potentially need more information than we can put into the topic. However, replacing the rather flexible topic with a (somehow globally defined?) u32 doesn't really feel sound to me. I mean, GRANDPA could also just use the topic the exact same way, if the Validators were executed per topic (or on all messages regardless), could it not?

I need to think about this a bit more...

rphmeier · 2019-02-18T18:33:23Z

I put a comment or two about this in one of the threads, but to reiterate here:
I think consensus-gossip messages should be grouped at top-level by consensus service (GRANDPA, parachain-attestation, etc.), with globally respected 4-byte identifiers, and message kinds should be determined by that consensus engine's handler.

This is because there are other planned consensus changes (to make hot-swapping possible) that will take a similar approach and having a unified architecture seems reasonable.

arkpar · 2019-02-18T22:16:10Z

@arkpar so from a network-view topic is entirely replaced with the kind? And all other topical information is read by validate to create the topic out of the (internally) decoded message? So that we store that raw message under that topic and once the sink asks for that topic, it decodes the message again?

Yes. Decoding is not that expensive though. It does not check signatures the second time at least.

Don't get me wrong, I think it is a good idea to have the consensus engine figure out whether a message should be dropped - and it potentially need more information than we can put into the topic. However, replacing the rather flexible topic with a (somehow globally defined?) u32 doesn't really feel sound to me. I mean, GRANDPA could also just use the topic the exact same way, if the Validators were executed per topic (or on all messages regardless), could it not?

I need to think about this a bit more...

In the current version It is not possible to tell if the message is valid or not just from the topic. Also it is not possible to tell which validator can handle the message just from the topic. And we need to be able to assert message validity before propagating it further. If consensus engine could list all possible valid topics at any moment in time, that would solve the problem. Unfortunately that's not always possible.

gnunicorn · 2019-02-19T09:17:29Z

And we need to be able to assert message validity before propagating it further. If consensus engine could list all possible valid topics at any moment in time, that would solve the problem. Unfortunately that's not always possible.

I didn't mean that. I meant the validators could be decided based on topics - you'd still have validators. But whatever.

I think consensus-gossip messages should be grouped at top-level by consensus service (GRANDPA, parachain-attestation, etc.), with globally respected 4-byte identifiers, and message kinds should be determined by that consensus engine's handler.

The main problem I have with that is the "globally respected" aspect of things. Just like the ports comparison you made, I am afraid this will become a (unnecessary) source of conflict and that, unless governed properly, will cause incompatibilities between engines that are hard to spot and debug (and fix) once deployed. I'd prefer to not have to govern this, but have a more flexible system in place (e.g. [u8; 8] like we do for keyvalue-store-key-constants?). But I won't insist.

Secondly I still prefer a more generic approach of just having a list of validators, each given the entire message, including kind, who and alike and just have the specific engine's only react on the kinds they are interested in. Allowing us to put in generic behavior guards (against bursting for e.g.) more easily. But I understand if you consider that as out of scope for this PR.

gnunicorn

Eventually got around to give this a full review. Added a bunch of suggestions and have minor remarks and one or bigger suggestions (but they might simply be out of scope). Overall this looks good to me :) .

core/finality-grandpa/src/communication.rs

gnunicorn · 2019-02-19T14:33:43Z

core/finality-grandpa/src/lib.rs

@@ -106,6 +108,10 @@ const LAST_COMPLETED_KEY: &[u8] = b"grandpa_completed_round";
 const AUTHORITY_SET_KEY: &[u8] = b"grandpa_voters";
 const CONSENSUS_CHANGES_KEY: &[u8] = b"grandpa_consensus_changes";

+const GRANDPA_ENGINE_ID: network::ConsensusEngineId = [b'a', b'f', b'g', b'1'];


thanks 😊 !

gnunicorn · 2019-02-19T14:45:43Z

core/finality-grandpa/src/lib.rs

+		false
+	}
+
+	fn validate_round_message(&self, full: VoteOrPrecommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> {


please wrap.

Suggested change

fn validate_round_message(&self, full: VoteOrPrecommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> {

fn validate_round_message(&self, full: VoteOrPrecommitMessage<Block>)

-> network_gossip::ValidationResult<Block::Hash>

{

gnunicorn · 2019-02-19T14:48:14Z

core/finality-grandpa/src/lib.rs

+		network_gossip::ValidationResult::Valid(topic)
+	}
+
+	fn validate_commit_message(&self, full: FullCommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> {


here, too

Suggested change

fn validate_commit_message(&self, full: FullCommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> {

fn validate_commit_message(&self, full: FullCommitMessage<Block>)

-> network_gossip::ValidationResult<Block::Hash>

{

gnunicorn · 2019-02-19T14:59:56Z

core/finality-grandpa/src/lib.rs

+		let validator = Arc::new(GossipValidator::new());
+		let v = validator.clone();
+		service.with_gossip(move |gossip, _| {
+			gossip.register_validator(GRANDPA_ENGINE_ID, validator.clone());


Am I missing something or why are you not using v here and then use validator in NetworkBridge { service, validator } directly? Saves you one clone...

A leftover from when there were 2 validators here.

core/network/src/consensus_gossip.rs

gnunicorn · 2019-02-19T15:24:02Z

core/network/src/consensus_gossip.rs

+	pub fn register_validator(&mut self, engine_id: ConsensusEngineId, validator: Arc<Validator<B::Hash>>) {
+		self.validators.insert(engine_id, validator);
+	}
+
 	/// Handle new connected peer.
 	pub fn new_peer(&mut self, protocol: &mut Context<B>, who: NodeIndex, roles: Roles) {
 		if roles.intersects(Roles::AUTHORITY) {
 			trace!(target:"gossip", "Registering {:?} {}", roles, who);
 			// Send out all known messages to authorities.
 			let mut known_messages = HashSet::new();
 			for entry in self.messages.iter() {


We are still doing a burst on connect here - which we have observed to cause a cycle of reconnect on lower bandwidth systems: being dropped because they react to slowly, reconnect, being burst and dropped again ... rinse, repeat. The patch in my next comment would reduce the amount of messages to all non-stale ones and the entire PR will reduce the amount of messages that are valid any point in time, however, I am not sure this safes us entirely and I am wondering if we want to reduce the amount of messages we are sending regardless (like, just the last n entries)?

core/network/src/consensus_gossip.rs

gnunicorn · 2019-02-19T16:02:28Z

core/network/src/message.rs

@@ -216,6 +225,8 @@ pub mod generic {
 	pub struct Status<Hash, Number> {
 		/// Protocol version.
 		pub version: u32,
+		/// Minimum supported version.
+		pub min_supported_version: u32,


As discussed in the chat: this change alone breaks compatibility between this code base and (non-authority) full-nodes in the existing network. Maybe we can isolate this change to make it easier to migrate the other changes onto live networks?

Changes to the gossip message are breaking as well. Maintaining backwards compatibility is a pain currently because peers would just disconnect if other party has higher protocol version. This exact change is there to address that.

Co-Authored-By: arkpar <[email protected]>

…nto a-gossip-refactor

gnunicorn

👍

* First part of gossip protocol refactoring * Message validation in GRANDPA * Reverted to time-based expiration for future round messages * Removed collect_garbage_for_topic * Use consensus engine id instead of kind * Use try_init Co-Authored-By: arkpar <[email protected]> * Comment Co-Authored-By: arkpar <[email protected]> * Added expiration check on broadcast Co-Authored-By: arkpar <[email protected]> * Apply suggestions from code review Co-Authored-By: arkpar <[email protected]> * Style * Style

* Gossip refactoring (#1811) * First part of gossip protocol refactoring * Message validation in GRANDPA * Reverted to time-based expiration for future round messages * Removed collect_garbage_for_topic * Use consensus engine id instead of kind * Use try_init Co-Authored-By: arkpar <[email protected]> * Comment Co-Authored-By: arkpar <[email protected]> * Added expiration check on broadcast Co-Authored-By: arkpar <[email protected]> * Apply suggestions from code review Co-Authored-By: arkpar <[email protected]> * Style * Style * Fixed sending commit message (#1834) * Fix #1825 (#1828) * Fix #1825 * Add comment * Fixed finalizing multiple blocks at once (#1837) * Fixed finalizing multiple blocks at once * Added a test * Added a comment

* First part of gossip protocol refactoring * Message validation in GRANDPA * Reverted to time-based expiration for future round messages * Removed collect_garbage_for_topic * Use consensus engine id instead of kind * Use try_init Co-Authored-By: arkpar <[email protected]> * Comment Co-Authored-By: arkpar <[email protected]> * Added expiration check on broadcast Co-Authored-By: arkpar <[email protected]> * Apply suggestions from code review Co-Authored-By: arkpar <[email protected]> * Style * Style

arkpar added 2 commits February 14, 2019 19:54

First part of gossip protocol refactoring

6ae9ef9

Message validation in GRANDPA

4bdecbf

arkpar added the A3-in_progress Pull request is in progress. No review needed at this stage. label Feb 15, 2019

andresilva reviewed Feb 15, 2019

View reviewed changes

rphmeier reviewed Feb 15, 2019

View reviewed changes

arkpar added 2 commits February 15, 2019 23:31

Reverted to time-based expiration for future round messages

f175cec

Merge branch 'master' of github.com:paritytech/substrate into a-gossi…

4fc2081

…p-refactor

arkpar added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Feb 15, 2019

gnunicorn reviewed Feb 18, 2019

View reviewed changes

Removed collect_garbage_for_topic

ab4b240

arkpar added A3-in_progress Pull request is in progress. No review needed at this stage. and removed A0-please_review Pull request needs code review. labels Feb 18, 2019

Use consensus engine id instead of kind

43714e9

arkpar added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Feb 19, 2019

gnunicorn self-assigned this Feb 19, 2019

gnunicorn approved these changes Feb 19, 2019

View reviewed changes

gnunicorn added B8-breakseverything labels Feb 19, 2019

gnunicorn and others added 4 commits February 19, 2019 17:51

Use try_init

6e75c74

Co-Authored-By: arkpar <[email protected]>

Comment

cb68a8b

Co-Authored-By: arkpar <[email protected]>

Added expiration check on broadcast

cc4efa8

Co-Authored-By: arkpar <[email protected]>

Apply suggestions from code review

ffa9862

Co-Authored-By: arkpar <[email protected]>

arkpar added 3 commits February 19, 2019 18:11

Style

0218c89

Merge branch 'a-gossip-refactor' of github.com:paritytech/substrate i…

d0d690e

…nto a-gossip-refactor

Style

d7157d3

gnunicorn approved these changes Feb 20, 2019

View reviewed changes

arkpar mentioned this pull request Feb 20, 2019

Have an upgrade strategy for the network protocol #1518

Open

gavofyork merged commit 4127a93 into master Feb 20, 2019

gavofyork deleted the a-gossip-refactor branch February 20, 2019 11:40

arkpar mentioned this pull request Feb 20, 2019

Fixed sending commit message #1834

Merged

rphmeier added the B0-patchthis label Feb 26, 2019

rphmeier mentioned this pull request Feb 26, 2019

backport some PRs to 0.10 #1879

Merged

gui1117 removed the B0-patchthis label May 31, 2019

Gossip refactoring #1811

Gossip refactoring #1811

Conversation

arkpar commented Feb 15, 2019 • edited Loading

andresilva left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arkpar Feb 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphmeier Feb 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnunicorn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arkpar commented Feb 18, 2019

gnunicorn commented Feb 18, 2019

rphmeier commented Feb 18, 2019 • edited Loading

arkpar commented Feb 18, 2019

gnunicorn commented Feb 19, 2019 • edited Loading

gnunicorn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnunicorn Feb 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnunicorn left a comment

Choose a reason for hiding this comment

arkpar commented Feb 15, 2019 •

edited

Loading

arkpar Feb 15, 2019 •

edited

Loading

rphmeier Feb 18, 2019 •

edited

Loading

rphmeier commented Feb 18, 2019 •

edited

Loading

gnunicorn commented Feb 19, 2019 •

edited

Loading

gnunicorn Feb 19, 2019 •

edited

Loading