-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce client circuit breaker #115
Conversation
5fabecb
to
15631bf
Compare
text/0115-circuit-breaker.md
Outdated
# Title | ||
|
||
- RFC PR: https://github.com/tikv/rfcs/pull/115 | ||
- Tracking Issue: https://github.com/tikv/repo/issues/0000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use this issue: tikv/pd#8678
Signed-off-by: artem_danilov <[email protected]>
300ef37
to
eef6b58
Compare
text/0115-circuit-breaker.md
Outdated
|
||
```go | ||
type Settings struct { | ||
Type string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding more details for Type
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a type of circuit breaker e.g. "pd_get_regions", "pd_tso", "tikv_copr". Maybe we can call it a Name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Type
and Name
are ok to me and we can add some explanations in RFC.
* Default: 10 | ||
* Unit: seconds | ||
--- | ||
* `min_qps_to_open` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a config for it or is it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "had only 1 request within error_rate_window", you do not have to enable the cb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually circuit breaker tracks the number of failures within error window. I found it problematic to set absolute value given that it could vary a lot depending on qps hence I chose to go with error rate instead. But in this case we do need a protection from low qps cases or time of the day hence I propose minQPS. If we want to limit number of easy configurations to one session variables, then maybe we should consider to set it in absolute number of errors within window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
text/0115-circuit-breaker.md
Outdated
* Unit: integer | ||
--- | ||
* `cooldown_interval` | ||
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. This interval always equally jittered in the `[value/2, value]` interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I miss something? Why is it a range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a range. I just meant that the cooldown time will be jittered and with the actual time falling in the interval. Same as we jitter backoff time. Let me capture it more clearly here.
|
||
#### System variables | ||
|
||
`tidb_cb_pd_metadata_error_rate_threshold_pct`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems
pd_metadata
is difficult to understanderror_rate
is redundant with pct
How about renaming it to pd_client_cb_error_rate_threshold
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls. pd_client_get_regions_cb_error_rate_threshold
?
pct meant to inidicate that the value is integer representing percent and not float representing the ratio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, the conclusion of the last meeting was that a single switch should control all interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a dedicated circuit breaker for get regions calls, do we need some variables to control other calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@okJiang I believe on last meeting we discussed which variables needs to be configured in real time by customer and which one could be hardcoded or configured at startup with toml file. My perception was that we want to allow configure error rate dynamically and everything else hardcoded or configured statically.
@rleungx I believe we need a separate circuit breaker instance for different calls: e.g. circuit breaker for region metdata should be separate to tikv coprocessor. They also needs to be enabled/disabled individually hence we need a separate variables for each call.
How many different circuit breakers we need could be decided later and we can follow same naming and granularity for configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we are only developing the region metadata part this time?🤔 @Tema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to add more. Let's brainstorm which one we want and see if they would need any different default settings. I think I mentioned before that some of other APIs requiring circuit breakers could be:
- tikv to PD heartbeat (
heatrbeat_cb_error_rate_threshold
this prolly will be tikv.toml and not s sysvar) - tidb to each tikv requests ((
tikv_client_cb_error_rate_threshold
))
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it is better to manage an interface with a variable. For the circuit breaker itself, different interfaces can share the same implementation and only use a different setting. Right now we can focus on the GetRegion call. /cc @okJiang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rleungx Thanks for the review. I've asked some clarifying questions.
text/0115-circuit-breaker.md
Outdated
* Unit: integer | ||
--- | ||
* `cooldown_interval` | ||
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. This interval always equally jittered in the `[value/2, value]` interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a range. I just meant that the cooldown time will be jittered and with the actual time falling in the interval. Same as we jitter backoff time. Let me capture it more clearly here.
* Default: 10 | ||
* Unit: seconds | ||
--- | ||
* `min_qps_to_open` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit.
|
||
#### System variables | ||
|
||
`tidb_cb_pd_metadata_error_rate_threshold_pct`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls. pd_client_get_regions_cb_error_rate_threshold
?
pct meant to inidicate that the value is integer representing percent and not float representing the ratio.
text/0115-circuit-breaker.md
Outdated
|
||
```go | ||
type Settings struct { | ||
Type string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a type of circuit breaker e.g. "pd_get_regions", "pd_tso", "tikv_copr". Maybe we can call it a Name
?
All configs described above will be encapsulated in a `Settings` struct with ability to change error rate threshold dynamically: | ||
|
||
```go | ||
type Settings struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this configuration be exported for the user? And what is the relationship between this and 'system variables'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error rate threshold will be a system variable, and everything else is hardcoded or exposed in not documented toml client section.
func (cb *CircuitBreaker[T]) Execute(run func() (T, error, boolean)) (T, error) | ||
``` | ||
|
||
There is a third boolean parameter in the function argument above, in case the provided function doesn't return an error, but the caller still wants to treat it as a failure for the circuit breaker counting purposes (e.g. empty or corrupted result). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can run a test where only the gRPC and timeout errors are treated as actual errors, which might be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is up to the place where we wire the circuit breaker in. Boolean param allows to configure it. It make sense to start with timeouts, but having ability to tweak it in the API let us tune it further.
Signed-off-by: artem_danilov <[email protected]>
I think I've addressed all opened comments in last commit
|
Signed-off-by: rishabh_mittal <[email protected]>
Signed-off-by: rishabh_mittal <[email protected]>
Signed-off-by: rishabh_mittal <[email protected]>
Signed-off-by: rishabh_mittal <[email protected]>
* RFC:follower read cache Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * update review comments Signed-off-by: rishabh_mittal <[email protected]> * update review comments Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * add changes Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * incorporate review comments Signed-off-by: rishabh_mittal <[email protected]> * incorporate review comments Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * introduce client circuit breaker (#115) Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * Update text/0113-follower-read-cache.md Co-authored-by: cfzjywxk <[email protected]> Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * Update 0113-follower-read-cache.md Signed-off-by: rishabh_mittal <[email protected]> * review comments Signed-off-by: rishabh_mittal <[email protected]> --------- Signed-off-by: rishabh_mittal <[email protected]> Co-authored-by: Artem Danilov <[email protected]> Co-authored-by: cfzjywxk <[email protected]>
RFC for tikv/pd#8678
Rendered view: https://github.com/Tema/rfcs/blob/0115-circuit-breaker/text/0115-circuit-breaker.md