-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brief review, overall the design lgtm
core/network/src/consensus_gossip.rs
Outdated
/// Message topic. | ||
topic: H, | ||
/// Flag that indicates if the message should be broadcasted. | ||
broadcast: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're breaking the protocol I think we should remove this. It's a feature I added but it's currently unused because it's an horrible idea and a vector for DoS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not part of the message anymore, but rather the grandpa validator decides if it should be broadcasted. Anyway I've noticed that currently in master this is always set to false
. So if it is not needed for any future messages I'll remove it.
core/network/src/protocol.rs
Outdated
@@ -239,7 +241,7 @@ pub enum ProtocolMsg<B: BlockT, S: NetworkSpecialization<B>,> { | |||
/// Execute a closure with the consensus gossip. | |||
ExecuteWithGossip(Box<GossipTask<B> + Send + 'static>), | |||
/// Incoming gossip consensus message. | |||
GossipConsensusMessage(B::Hash, Vec<u8>, bool), | |||
GossipConsensusMessage(B::Hash, u32, Vec<u8>, bool), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the broadcast flag. I think we can also remove topic now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used for sending out own messages. topic
is passed because we don't want to validate them.
pub struct FullRoundMessage<Block: BlockT> { | ||
pub round: u64, | ||
pub set_id: u64, | ||
pub message: SignedMessage<Block>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature currently doesn't include the round or set_id, we should probably extend it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does I believe. The signing scheme was not changed and still includes round
and set_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the signed payload includes set_id and round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra bandwidth we will use for putting round
and set_id
in every message is a lot. Why not embed this data in the topic somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The topic, which is 32 bytes is not included in the message anymore. So the bandwidth is actually saved.
return true; | ||
} else if set_id == rounds.set_id + 1 { | ||
// allow a few first rounds of future set. | ||
if round > MESSAGE_ROUND_TOLERANCE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the strategy for when we catch up? if messages are dropped consensus will halt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing it with @rphmeier and @andresilva here's what the issue here is:
Grandpa starts round N+1 when he observes round commit for round N and requires all messages for round N+1 to complete it. Currently there's no guarantee that commit message will be the first in the set of messages for N+1, so some messages might be skipped if current round is far behind N.
There are a few solutions:
-
Request missing messages from peers when we observe commit N and it is more than a few rounds in the future. Would require substantial network protocol changes.
-
GRANDPA begins round N+2 when he observes commit for N. This would guarantee that all messages for N+2 are accepted as they arrive only after commit for N. Appears to be involved change in the grandpa crate.
-
Future messages are kept around with time based expiration. Will be implemented in this PR.
Additional DOS protection scheme will be implemented in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 seems like a sensible thing to implement longer-term.
core/finality-grandpa/src/lib.rs
Outdated
@@ -136,6 +137,11 @@ const LAST_COMPLETED_KEY: &[u8] = b"grandpa_completed_round"; | |||
const AUTHORITY_SET_KEY: &[u8] = b"grandpa_voters"; | |||
const CONSENSUS_CHANGES_KEY: &[u8] = b"grandpa_consensus_changes"; | |||
|
|||
const GRANDPA_ROUND_MESSAGE: u32 = 0x1001; | |||
const GRANDPA_COMMIT_MESSAGE: u32 = 0x1002; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these constants seems like they could be easily conflicting with other modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are part of the network message effectively. Any suggestions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making sure these const
s don't conflict (especially when people bring in their own consensus algorithms) sounds like a nightmare :( ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arkpar as part of #1304 I've proposed that we make digest items and other substrate primitives that are relevant to consensus start with a 4-byte "magic number" which are assigned to consensus engines sort of like port numbers. We'll reserve some numbers and an early range for some consensus engines and custom implementors should do the same.
In light of that we could change the gossip packet structure to
{ consensus_engine: [u8; 4], message_kind: u8, data: Vec<u8> }
instead, and this would keep things consistent with planned future architecture changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could group gossip validators by consensus_engine
as opposed to message kind, and have the message kind be part of an encoded enum
in grandpa::communication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the second option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not quite understand the design decision to put a new kind
-attribute in that the consensus-gossip knows about. I feel like there is easily a more generic and cleaner approach possible without forcing all consensus messages to have this (GRANDPA-specific) attribute.
Having both kind
and topic
makes the entire approach also a lot more confusing and mixed, as it becomes unclear what they are about and I am fearful this will be used totally wrong in the future...
message_times.get(&(*topic, *message_hash)) | ||
.map(|instant| *instant + (5 * MESSAGE_LIFETIME) >= now) | ||
.unwrap_or(false) | ||
entry.timestamp + MESSAGE_LIFETIME >= now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that having both the validator
and MESSAGE_LIFETIME
is odd. What if my gossip wants/needs a higher message lifetime? I'd prefer to have the MESSAGE_LIFETIME
enforced through the same Validator
construct as well so runtime-developers can tweak that to their needs. (see above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MESSAGE_LIFETIME
is a temporary measure. The first version of this PR did not have and I'd refer to get rid of it as well. See this comment on why it is needed: #1811 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(to reiterate, removing it isn't as simple as it sounds and we will want a general utility for DOS protection before doing so -- and it's a bit beyond the scope of this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want a hard lifetime, there is also a lru_time_cache
crate we could use. Would keep our code base cleaner...
How is it GRANDPA specific? It is generic enough to be transparent to the network layer.
The topic is not part of the message anymore. If it was, it would have to be plain i.e. include round and set id, and that would be less generic for sure. Topic is something derived from the message as a result of validation. |
@arkpar so from a network-view Don't get me wrong, I think it is a good idea to have the consensus engine figure out whether a message should be dropped - and it potentially need more information than we can put into the I need to think about this a bit more... |
I put a comment or two about this in one of the threads, but to reiterate here: This is because there are other planned consensus changes (to make hot-swapping possible) that will take a similar approach and having a unified architecture seems reasonable. |
Yes. Decoding is not that expensive though. It does not check signatures the second time at least.
In the current version It is not possible to tell if the message is valid or not just from the topic. Also it is not possible to tell which validator can handle the message just from the topic. And we need to be able to assert message validity before propagating it further. If consensus engine could list all possible valid topics at any moment in time, that would solve the problem. Unfortunately that's not always possible. |
I didn't mean that. I meant the validators could be decided based on topics - you'd still have validators. But whatever.
The main problem I have with that is the "globally respected" aspect of things. Just like the ports comparison you made, I am afraid this will become a (unnecessary) source of conflict and that, unless governed properly, will cause incompatibilities between engines that are hard to spot and debug (and fix) once deployed. I'd prefer to not have to govern this, but have a more flexible system in place (e.g. Secondly I still prefer a more generic approach of just having a list of validators, each given the entire message, including |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually got around to give this a full review. Added a bunch of suggestions and have minor remarks and one or bigger suggestions (but they might simply be out of scope). Overall this looks good to me :) .
@@ -106,6 +108,10 @@ const LAST_COMPLETED_KEY: &[u8] = b"grandpa_completed_round"; | |||
const AUTHORITY_SET_KEY: &[u8] = b"grandpa_voters"; | |||
const CONSENSUS_CHANGES_KEY: &[u8] = b"grandpa_consensus_changes"; | |||
|
|||
const GRANDPA_ENGINE_ID: network::ConsensusEngineId = [b'a', b'f', b'g', b'1']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks 😊 !
core/finality-grandpa/src/lib.rs
Outdated
false | ||
} | ||
|
||
fn validate_round_message(&self, full: VoteOrPrecommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please wrap.
fn validate_round_message(&self, full: VoteOrPrecommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> { | |
fn validate_round_message(&self, full: VoteOrPrecommitMessage<Block>) | |
-> network_gossip::ValidationResult<Block::Hash> | |
{ |
core/finality-grandpa/src/lib.rs
Outdated
network_gossip::ValidationResult::Valid(topic) | ||
} | ||
|
||
fn validate_commit_message(&self, full: FullCommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, too
fn validate_commit_message(&self, full: FullCommitMessage<Block>) -> network_gossip::ValidationResult<Block::Hash> { | |
fn validate_commit_message(&self, full: FullCommitMessage<Block>) | |
-> network_gossip::ValidationResult<Block::Hash> | |
{ |
core/finality-grandpa/src/lib.rs
Outdated
let validator = Arc::new(GossipValidator::new()); | ||
let v = validator.clone(); | ||
service.with_gossip(move |gossip, _| { | ||
gossip.register_validator(GRANDPA_ENGINE_ID, validator.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something or why are you not using v
here and then use validator
in NetworkBridge { service, validator }
directly? Saves you one clone
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A leftover from when there were 2 validators here.
pub fn register_validator(&mut self, engine_id: ConsensusEngineId, validator: Arc<Validator<B::Hash>>) { | ||
self.validators.insert(engine_id, validator); | ||
} | ||
|
||
/// Handle new connected peer. | ||
pub fn new_peer(&mut self, protocol: &mut Context<B>, who: NodeIndex, roles: Roles) { | ||
if roles.intersects(Roles::AUTHORITY) { | ||
trace!(target:"gossip", "Registering {:?} {}", roles, who); | ||
// Send out all known messages to authorities. | ||
let mut known_messages = HashSet::new(); | ||
for entry in self.messages.iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still doing a burst on connect
here - which we have observed to cause a cycle of reconnect on lower bandwidth systems: being dropped because they react to slowly, reconnect, being burst and dropped again ... rinse, repeat. The patch in my next comment would reduce the amount of messages to all non-stale ones and the entire PR will reduce the amount of messages that are valid any point in time, however, I am not sure this safes us entirely and I am wondering if we want to reduce the amount of messages we are sending regardless (like, just the last n
entries)?
@@ -216,6 +225,8 @@ pub mod generic { | |||
pub struct Status<Hash, Number> { | |||
/// Protocol version. | |||
pub version: u32, | |||
/// Minimum supported version. | |||
pub min_supported_version: u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the chat: this change alone breaks compatibility between this code base and (non-authority) full-nodes in the existing network. Maybe we can isolate this change to make it easier to migrate the other changes onto live networks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes to the gossip message are breaking as well. Maintaining backwards compatibility is a pain currently because peers would just disconnect if other party has higher protocol version. This exact change is there to address that.
Co-Authored-By: arkpar <[email protected]>
Co-Authored-By: arkpar <[email protected]>
Co-Authored-By: arkpar <[email protected]>
Co-Authored-By: arkpar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* First part of gossip protocol refactoring * Message validation in GRANDPA * Reverted to time-based expiration for future round messages * Removed collect_garbage_for_topic * Use consensus engine id instead of kind * Use try_init Co-Authored-By: arkpar <[email protected]> * Comment Co-Authored-By: arkpar <[email protected]> * Added expiration check on broadcast Co-Authored-By: arkpar <[email protected]> * Apply suggestions from code review Co-Authored-By: arkpar <[email protected]> * Style * Style
* Gossip refactoring (#1811) * First part of gossip protocol refactoring * Message validation in GRANDPA * Reverted to time-based expiration for future round messages * Removed collect_garbage_for_topic * Use consensus engine id instead of kind * Use try_init Co-Authored-By: arkpar <[email protected]> * Comment Co-Authored-By: arkpar <[email protected]> * Added expiration check on broadcast Co-Authored-By: arkpar <[email protected]> * Apply suggestions from code review Co-Authored-By: arkpar <[email protected]> * Style * Style * Fixed sending commit message (#1834) * Fix #1825 (#1828) * Fix #1825 * Add comment * Fixed finalizing multiple blocks at once (#1837) * Fixed finalizing multiple blocks at once * Added a test * Added a comment
* First part of gossip protocol refactoring * Message validation in GRANDPA * Reverted to time-based expiration for future round messages * Removed collect_garbage_for_topic * Use consensus engine id instead of kind * Use try_init Co-Authored-By: arkpar <[email protected]> * Comment Co-Authored-By: arkpar <[email protected]> * Added expiration check on broadcast Co-Authored-By: arkpar <[email protected]> * Apply suggestions from code review Co-Authored-By: arkpar <[email protected]> * Style * Style
is_expired
function should be reviewed carefully.