Add new message type MsgGroupBroadcast and corresponding handler#485
Add new message type MsgGroupBroadcast and corresponding handler#485LintianShi wants to merge 21 commits intotikv:masterfrom
Conversation
e8c7473 to
75cb96f
Compare
|
@Connor1996 PTAL |
2084791 to
ef1ec68
Compare
d4a8512 to
806b77a
Compare
BusyJay
left a comment
There was a problem hiding this comment.
When will MsgGroupBroadcast be set?
806b77a to
1fcdb47
Compare
| m_append.commit_term = m.get_commit_term(); | ||
| self.r.send(m_append, &mut self.msgs); | ||
| } | ||
| Err(_) => { |
There was a problem hiding this comment.
consider temporary unavailable
There was a problem hiding this comment.
Is the processing like the way in fn send_append_aggressively, looping for fetching until it success?
There was a problem hiding this comment.
nope, call it again in on_entries_fetched
There was a problem hiding this comment.
Async fetch for forwarding is committed. PTAL.
| /// This enables data replication from a follower to other servers in the same available zone. | ||
| /// | ||
| /// Enable this for reducing across-AZ traffic of cloud deployment. | ||
| pub follower_repl: bool, |
| } | ||
|
|
||
| // For a broadcast, append entries to onw log and forward MsgAppend to other dest. | ||
| fn handle_group_broadcast(&mut self, m: &Message) { |
There was a problem hiding this comment.
when the broadcast message is sent
There was a problem hiding this comment.
In raftstore, several MsgAppend will be merged into a MsgGroupBroadcast in fn build_raft_messages
4957ce5 to
eed70bc
Compare
eed70bc to
a7567c3
Compare
|
Test case is missing. |
a7567c3 to
2fa5d20
Compare
| pub fn can_async(&self) -> bool { | ||
| match self.0 { | ||
| GetEntriesFor::SendAppend { .. } => true, | ||
| GetEntriesFor::SendForward { .. } => true, |
There was a problem hiding this comment.
I think it can be false to avoid the async fetch's potential latency in follower replication.
…s agent selection Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
… is enabled Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
9406810 to
0cc4fa9
Compare
| } | ||
|
|
||
| // Group messages | ||
| if let Some(group) = msg_group.get_mut(&group_id) { |
There was a problem hiding this comment.
msg_group.entry().or_default().and_modify()
| self.prs | ||
| .iter_mut() | ||
| .filter(|&(id, _)| *id != self_id) | ||
| .for_each(|(id, pr)| core.send_append(*id, pr, &mut msgs)); |
There was a problem hiding this comment.
we can form msg group from here directly
|
|
||
| let mut idx: usize = 0; | ||
| for msg in msgs { | ||
| if !skip[idx] { |
There was a problem hiding this comment.
How about grouping the messages instead of the pos. And merge_append_group modifies or filters the messages directly instead of by passing skip
There was a problem hiding this comment.
In new commit, I have modified it. PTAL.
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
| self.prs().group_commit() | ||
| } | ||
|
|
||
| /// Checks whether the raft group is using group commit and consistent |
There was a problem hiding this comment.
It is a mis-operation. I will fix it.
| /// according to the progress recorded in r.prs(). | ||
| pub fn bcast_append(&mut self) { | ||
| let self_id = self.id; | ||
| let leader_group_id = self |
There was a problem hiding this comment.
Can this be queried only when follower replication is enabled?
There was a problem hiding this comment.
There are two conditions that we should use leader replication. One is that follower replication is disabled. The other is that follower replication is enabled but leader's group id is unknown. So I want to reuse the code of iterating progress and sending append.
There was a problem hiding this comment.
let mut use_leader_replication = self.is_leader();
let mut leader_group_id = 0;
if !use_leader_replication {
leader_group_id = query;
use_leader_replciation = leader_group_id == 0;
}
if use_leader_replication {
return;
}
| // Messages that needs to be forwarded are stored in hashmap temporarily, | ||
| // and they are grouped by broadcast_group_id of progress. | ||
| // Messages in msg_group will be pushed to message queue later. | ||
| let mut msg_group: HashMap<u64, Vec<Message>> = HashMap::default(); |
There was a problem hiding this comment.
Better cache msg_group to avoid allocation.
|
|
||
| // Merge messages in the same broadcast group and send them. | ||
| for (group_id, mut group) in msg_group.drain() { | ||
| // Double check: do not need to forward messages in leader's broadcast group. |
| } | ||
| // Attach forward information to MsgGroupbroadcast and send it. | ||
| group[mark].set_forwards(forwards.into()); | ||
| for msg in group { |
There was a problem hiding this comment.
You mean self.msgs.push(group[mark])?
| // Find an appropriate agent. | ||
| // If found an appropriate one, mark the corresponding message's type as | ||
| // MsgGroupBroadcast and return true. If not found, return false. | ||
| fn select_agent_for_bcast_group(&self, msgs: &mut [Message]) -> bool { |
There was a problem hiding this comment.
Why not return the index of the message? Changing the type in this function is quite confusing.
There was a problem hiding this comment.
I think it is a better choice.
| let peer_id = msg.to; | ||
| let is_voter = self.prs().conf().voters().contains(peer_id); | ||
| // Agent must be voter and recently active. | ||
| if !is_voter || !self.is_recent_active(peer_id) { |
There was a problem hiding this comment.
These information can be obtained when the message is generated.
There was a problem hiding this comment.
The information of voter and recent_active is only used for agent selection. When the message is generated, we only determine whether it should be replicated by follower.
If we want to obtain these information when the message is generated, another cache is needed to record them.
There was a problem hiding this comment.
No, what you need to do is merging bcast_append, merge_msg_group and select_agent_for_bcast_group.
There was a problem hiding this comment.
merge_msg_group and select_agent_for_bcast_group have been merged into bcast_append. PTAL
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
| // Do not need to merge if group size is less than two. Or there is no appropriate agent. | ||
| if !need_merge { | ||
| msgs.append(&mut group.into_iter().map(|(msg, _)| msg).collect()); | ||
| return; |
There was a problem hiding this comment.
Thank very much. I think this cause the jitters of the number of sent messages.
There was a problem hiding this comment.
you should improve the test to cover this
| let mut agent_msg = group.swap_remove(agent_msg_idx.unwrap()).0; | ||
| agent_msg.set_msg_type(MessageType::MsgGroupBroadcast); | ||
| agent_msg.set_forwards(forwards.into()); | ||
| msgs.push(agent_msg); |
There was a problem hiding this comment.
no need to send other msgs for agent?
There was a problem hiding this comment.
Other messages have been pushed to self.msgs when generated.
There was a problem hiding this comment.
Oh, I was thought there would be multiple MsgAppend for one peer
| index: msg.index, | ||
| ..Default::default() | ||
| }; | ||
| forwards.push(forward); |
There was a problem hiding this comment.
what if too many forward info makes one message too large
There was a problem hiding this comment.
Considering that there are often 3 or 5 replicas deployed on 2 or 3 az, I think one message won't take too many forward information.
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
| @@ -888,12 +1002,106 @@ impl<T: Storage> Raft<T> { | |||
| /// according to the progress recorded in r.prs(). | |||
| pub fn bcast_append(&mut self) { | |||
There was a problem hiding this comment.
follower replication only covers bcast_append, but misses somewhere calling send_append and send_append_aggressively directly
There was a problem hiding this comment.
When a cluster replicates data via bcast_append, the cluster is in a stable state with high probability.
If send_append is called, it is usually due to reject of MsgAppendResp, which means that follower's raft log is conflict with leader's. In this situation, we'd better use leader replication to fix follower's raft log as quickly as possible.
There was a problem hiding this comment.
In previous test, we find that follower replication can reduce a lot of cross-az traffic. So I think we should sacrifice some coverage of follower replication to archive less performance regression. Though we just use follower replication in the main path of log replication, it can still reduce considerable cross-az traffic.
Maybe we can expand follower replication in the future. For example, we can get snapshot from followers.
| pub(crate) max_committed_size_per_ready: u64, | ||
|
|
||
| // Message group cache for follower replication. | ||
| msg_group: HashMap<u64, Vec<(Message, bool)>>, |
There was a problem hiding this comment.
Vec is more efficient than HashMap.
There was a problem hiding this comment.
Since the number of groups is usually no more than 3, finding element in Vec is more efficient than HashMap. Am I right?
There was a problem hiding this comment.
Not just query, but also update and memory footprint.
There was a problem hiding this comment.
Done. Use Vec<(u64, Vec<(Message, bool))> instead of HashMap<u64, Vec<(Message, bool)>>.
| /// The current configuration state of the cluster. | ||
| #[get = "pub"] | ||
| conf: Configuration, | ||
| pub(crate) conf: Configuration, |
There was a problem hiding this comment.
Doesn't the above line states it's public accessible already?
There was a problem hiding this comment.
Using conf(), there will be a conflict between mutable borrow of self.prs.progress and immutable borrow of self.prs.conf.
| }); | ||
|
|
||
| // Merge messages in the same broadcast group and send them. | ||
| for (_, mut group) in core.msg_group.drain() { |
There was a problem hiding this comment.
I think it might be impossible because the broadcast group is fixed after all progresses are iterated. The merge should start after the broadcast group is fixed.
If we group progress by broadcast_group_id first and then iterate them, it is feasible.
| let mut tmp_msgs = Vec::default(); | ||
| // Let messages be pushed into tmp_vec firstly. | ||
| core.send_append(*id, pr, &mut tmp_msgs); | ||
| for msg in tmp_msgs { |
There was a problem hiding this comment.
How about
if pr.broadcast_group_id == leader_group_id || !pr.is_replicating() {
msgs.extend(tmp_msgs);
return;
}
let is_voter = is_voter(pr);
for msg in tmp_msgs {
if msg.get_msg_type() != MsgAppend {
} else {
}
}
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi <lintian.shi@pingcap.com>
Signed-off-by: LintianShi lintian.shi@pingcap.com
Modify raft-rs to support follower replication in TiKV.
Main change:
RawNodeandConfigMsgGroupBroadcastand an extra fieldforwardsinMessage, which contains information that needs to be forwarded.MsgGroupBroadcast, which appends entries to local log and forwardMsgAppendto other peers.