Skip to content

Commit

Permalink
update the issue of safe ts lag in followers
Browse files Browse the repository at this point in the history
Signed-off-by: you06 <[email protected]>
  • Loading branch information
you06 committed Jun 5, 2023
1 parent 48f1f78 commit 03c21d8
Showing 1 changed file with 59 additions and 42 deletions.
101 changes: 59 additions & 42 deletions text/0000-reduce-traffic-resolved-ts.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,91 +8,108 @@ The motivation behind this approach is that TiKV currently uses the `CheckLeader

TiKV pushes forward resolved ts by `CheckLeader` request, the traffic it costs grows linearly with the number of regions. When dealing with a large cluster that requires frequent safe TS pushes, it may result in a significant amount of traffic. Moreover, this is probably cross-AZ traffic, which is not free.

By optimizing hibernated regions, it is possible to reduce traffic significantly. In a large cluster, many regions are not accessed and remain in a hibernated state.
By optimizing inactive regions, it is possible to reduce traffic significantly. In a large cluster, many regions are not accessed and remain in a hibernated state.

Let’s review the safe ts push mechanism.

1. The leader sends a `check_leader` request to the followers with a timestamp from PD, which carries the resolved timestamp pushed in previous step 3.
2. Follower response whether the leader matched.
3. If quorum of the voters are still in the current leader’s term, update leader’s `safe_ts`.

In step 1, the leader generates safe ts, and in the next round, the followers apply those timestamps. However, there is an "advance-ts-interval" gap between two step 1s, which results in a safe timestamp lag for the followers.

Regarding our test on the hit rate for 5-second staleness, we finally achieved a nearly 100% hit rate by setting the `resolved-ts.advance-ts-interval` to 2.5 seconds. This means that if we want to achieve 100-millisecond staleness in the current mechanism, we should set `resolved-ts.advance-ts-interval` to 50 milliseconds, which doubles the traffic of pushing safe ts.

**We need a solution that applies the safe TS more efficiently.**

## Detailed design

The design will be described in two sections: the improvement of active regions and a new mechanism for hibernated regions.

### Active Regions
### Protobuf

We still need `CheckLeader` request to confirm the leadership. But with some additional fields.

```diff
message RegionEpoch {
uint64 conf_ver = 1;
uint64 version = 2;
message CheckLeaderRequest {
repeated LeaderInfo regions = 1;
uint64 ts = 2;
+ uint64 store_id = 3;
+ repeated uint64 hibernated_regions = 4;
}

message LeaderInfo {
uint64 region_id = 1;
uint64 peer_id = 2;
uint64 term = 3;
metapb.RegionEpoch region_epoch = 4;
ReadState read_state = 5;
message CheckLeaderResponse {
repeated uint64 regions = 1;
uint64 ts = 2;
+ repeated uint64 failed_regions = 3;
}
```

message ReadState {
uint64 applied_index = 1;
uint64 safe_ts = 2;
To apply safe ts for valid leaders as soon as possible, instead of waiting for the next round of advancing resolved timestamps, we need to send another `ApplySafeTS` request. The `ApplySafeTS` request is usually small, the traffic caused by it can be ignored.

```protobuf
message CheckedLeader {
uint64 region_id = 1;
ReadState read_state = 2;
}
message CheckLeaderRequest {
repeated LeaderInfo regions = 1;
uint64 ts = 2;
message ApplySafeTsRequest {
uint64 ts = 1;
uint64 store_id = 2;
repeated uint64 unsafe_regions = 3;
repeated CheckedLeader checked_leaders = 4;
}
message CheckLeaderResponse {
- repeated uint64 regions = 1;
+ repeated uint64 failed_regions = 1;
uint64 ts = 2;
message ApplySafeTsResponse {
uint64 ts = 1;
}
```

The above code is the suggested changes for `CheckLeader` protobuf.
### Active Regions

The `CheckLeaderResponse` respond with the regions that pass the Raft safety check. The leader can then push its `safe_ts` for those regions. Since most regions will pass the safety check, it is not necessary to respond with the IDs of all passing regions. Instead, we can respond with only the IDs of regions that fail the safety check.

The `CheckLeaderResponse` will respond with the regions that pass the Raft safety check. The leader can then push its `safe_ts` for those regions. Since most regions will pass the safety check, it is not necessary to respond with the IDs of all passing regions. Instead, we can respond with only the IDs of regions that fail the safety check.
Another optimization is that we can confirm the leadership if the leader lease is hold, by calling Raft’s read-index command. But this will involve in the Raftstore thread pool, more CPU will be used by this.

### Inactive Regions

Here we name the regions without writing to inactive regions. In the future TiKV will deprecate hibernated regions and merge the small regions into dymanic regions, if so the inactive regions won't be a problem, but for users that disable dynamic regions, this optimization is still required.
Here we list the active regions without writing to inactive regions. In the future, TiKV will deprecate hibernated regions and merge small regions into dynamic ones. If this happens, inactive regions will not be a problem. However, for users who do not use dynamic regions, this optimization is still required.

To save traffic, we can push the safe timestamps of inactive regions together without sending the region information list. The `ts` field in `CheckLeaderRequest` is only used to build the relationship between the request and response, although it's fetched from PD. Ideally, we can push the safe timestamps of inactive regions using this `ts` value. Additionally, we can remove inactive regions from `CheckLeaderRequest.regions`. Modify `CheckLeaderRequest` as follows.

We only send the IDs for inactive regions. In the most ideal situation, both `LeaderInfo` and the region ID in the response are skipped, reducing the traffic from 64 bytes to 8 bytes per region.

```diff
message CheckLeaderRequest {
repeated LeaderInfo regions = 1;
uint64 ts = 2;
uint64 peer_id = 3;
+ repeated uint64 inactive_regions = 4;
}
```

One more phase is required to apply the safe ts, because in check leader process, the follower cannot decide whether the request is from a valid leader, so it keep the safe ts in it's memory and wait for apply safe ts request.

```protobuf
message ApplySafeTsRequest {
uint64 ts = 1;
repeated uint64 unsafe_regions = 2;
```diff
#[derive(Clone)]
pub struct RegionReadProgressRegistry {
registry: Arc<Mutex<HashMap<u64, Arc<RegionReadProgress>>>>,
+ checked_states: Arc<Mutex<HashMap<u64, CheckedState>>>,
}

message ApplySafeTsResponse {
uint64 ts = 1;
}
+ struct CheckedState {
+ safe_ts: u64,
+ valid_regions: Vec<u64>,
+ }
```

To save traffic, `ApplySafeTsRequest.unsafe_regions` only contains the regions whose leader may be changed. In the ideal case, this request is small because there is almost no unsafe regions.

#### Safety

Safety is guaranteed as long as the leader remains unchanged. By decoupling the pushing of safe ts into two phases, the leader peer can ensure that the leadership is still unchanged before safe ts is generated. Only the region IDs of inactive regions are sent, but we can still confirm that the leader has not changed for a follower if both the leader's term and the follower's term remain unchanged. Because the safe ts is applied only after the leadership is confirmed, correctness will not be compromised.
Safety is guaranteed as long as the safe ts is generated from a **valid leader**. By decoupling the pushing of safe ts into two phases, the leader peer can ensure that the leadership is still unchanged before safe ts is generated.

For inactive regions, only the region IDs are sent. If the term has changed since the last active region check, the follower will respond with a check failure. When the leader receives the `CheckLeaderResponse`, and the inactive region ID is not in `CheckLeaderResponse.failed_regions`, it means that the terms of the leader and follower are both unchanged. After checking leadership for inactive regions, it's safe to push the safe timestamps from the leader. Additionally, since there are no following writes in inactive regions, the `applied_index` check is unnecessary.

#### Implementation

To implement safety checks for hibernated regions, we record the term when we receive leader information. We then compare the recorded term with the latest term when we receive a region ID without leader information, which indicates that it is hibernating in the leader. If the terms do not match, we must send the `LeaderInfo` for this region the next time.
To implement safety checks for inactive regions, we record the term when we receive active region checks. We then compare the recorded term with the latest term when we receive a region ID without leader information, which indicates that the region is inactive. If the terms do not match, the leader must treat it as an active region next time.

## Drawbacks

This RFC make the resolved ts management more complex.

If there are too many active regions, more CPU will be consumed.

## Unresolved questions

0 comments on commit 03c21d8

Please sign in to comment.