Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce client circuit breaker #115

Merged
merged 2 commits into from
Dec 12, 2024
Merged

Conversation

Tema
Copy link
Contributor

@Tema Tema commented Oct 31, 2024

@Tema
Copy link
Contributor Author

Tema commented Oct 31, 2024

@rleungx @niubell PTAL

@Tema
Copy link
Contributor Author

Tema commented Oct 31, 2024

cc: @mittalrishabh @HaoW30

# Title

- RFC PR: https://github.com/tikv/rfcs/pull/115
- Tracking Issue: https://github.com/tikv/repo/issues/0000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use this issue: tikv/pd#8678

Signed-off-by: artem_danilov <[email protected]>
@rleungx
Copy link
Member

rleungx commented Nov 5, 2024

@nolouch @okJiang Please also take a look.


```go
type Settings struct {
Type string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding more details for Type here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a type of circuit breaker e.g. "pd_get_regions", "pd_tso", "tikv_copr". Maybe we can call it a Name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Type and Name are ok to me and we can add some explanations in RFC.

* Default: 10
* Unit: seconds
---
* `min_qps_to_open`
Copy link
Member

@rleungx rleungx Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a config for it or is it necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "had only 1 request within error_rate_window", you do not have to enable the cb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually circuit breaker tracks the number of failures within error window. I found it problematic to set absolute value given that it could vary a lot depending on qps hence I chose to go with error rate instead. But in this case we do need a protection from low qps cases or time of the day hence I propose minQPS. If we want to limit number of easy configurations to one session variables, then maybe we should consider to set it in absolute number of errors within window.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

* Unit: integer
---
* `cooldown_interval`
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. This interval always equally jittered in the `[value/2, value]` interval.
Copy link
Member

@okJiang okJiang Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I miss something? Why is it a range?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a range. I just meant that the cooldown time will be jittered and with the actual time falling in the interval. Same as we jitter backoff time. Let me capture it more clearly here.


#### System variables

`tidb_cb_pd_metadata_error_rate_threshold_pct`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems

  1. pd_metadata is difficult to understand
  2. error_rate is redundant with pct

How about renaming it to pd_client_cb_error_rate_threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls. pd_client_get_regions_cb_error_rate_threshold?

pct meant to inidicate that the value is integer representing percent and not float representing the ratio.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, the conclusion of the last meeting was that a single switch should control all interfaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions about this? @rleungx @nolouch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a dedicated circuit breaker for get regions calls, do we need some variables to control other calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@okJiang I believe on last meeting we discussed which variables needs to be configured in real time by customer and which one could be hardcoded or configured at startup with toml file. My perception was that we want to allow configure error rate dynamically and everything else hardcoded or configured statically.

@rleungx I believe we need a separate circuit breaker instance for different calls: e.g. circuit breaker for region metdata should be separate to tikv coprocessor. They also needs to be enabled/disabled individually hence we need a separate variables for each call.

How many different circuit breakers we need could be decided later and we can follow same naming and granularity for configurations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we are only developing the region metadata part this time?🤔 @Tema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to add more. Let's brainstorm which one we want and see if they would need any different default settings. I think I mentioned before that some of other APIs requiring circuit breakers could be:

  • tikv to PD heartbeat (heatrbeat_cb_error_rate_threshold this prolly will be tikv.toml and not s sysvar)
  • tidb to each tikv requests ((tikv_client_cb_error_rate_threshold ))
    Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it is better to manage an interface with a variable. For the circuit breaker itself, different interfaces can share the same implementation and only use a different setting. Right now we can focus on the GetRegion call. /cc @okJiang

Copy link
Contributor Author

@Tema Tema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rleungx Thanks for the review. I've asked some clarifying questions.

* Unit: integer
---
* `cooldown_interval`
* Defines how long to wait after circuit breaker is open before go to half-open state to send a probe request. This interval always equally jittered in the `[value/2, value]` interval.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a range. I just meant that the cooldown time will be jittered and with the actual time falling in the interval. Same as we jitter backoff time. Let me capture it more clearly here.

* Default: 10
* Unit: seconds
---
* `min_qps_to_open`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Imagine we had only 1 request within error_rate_window which fails. At the end of window it will be evaluated at 100% error rate and open the circuit.


#### System variables

`tidb_cb_pd_metadata_error_rate_threshold_pct`:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to have a single circuit breaker covering all PD calls. I envisioned to have a dedicated circuit breaker for get regions calls. pd_client_get_regions_cb_error_rate_threshold?

pct meant to inidicate that the value is integer representing percent and not float representing the ratio.


```go
type Settings struct {
Type string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a type of circuit breaker e.g. "pd_get_regions", "pd_tso", "tikv_copr". Maybe we can call it a Name?

All configs described above will be encapsulated in a `Settings` struct with ability to change error rate threshold dynamically:

```go
type Settings struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this configuration be exported for the user? And what is the relationship between this and 'system variables'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error rate threshold will be a system variable, and everything else is hardcoded or exposed in not documented toml client section.

func (cb *CircuitBreaker[T]) Execute(run func() (T, error, boolean)) (T, error)
```

There is a third boolean parameter in the function argument above, in case the provided function doesn't return an error, but the caller still wants to treat it as a failure for the circuit breaker counting purposes (e.g. empty or corrupted result).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can run a test where only the gRPC and timeout errors are treated as actual errors, which might be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is up to the place where we wire the circuit breaker in. Boolean param allows to configure it. It make sense to start with timeouts, but having ability to tweak it in the API let us tune it further.

@rleungx rleungx requested review from okJiang and nolouch November 18, 2024 03:12
@rleungx
Copy link
Member

rleungx commented Nov 18, 2024

If there is no more comment, we can move on. /cc @niubell @nolouch @okJiang

Signed-off-by: artem_danilov <[email protected]>
@Tema
Copy link
Contributor Author

Tema commented Nov 19, 2024

I think I've addressed all opened comments in last commit

  • Renamed Settings.Type to Settings.Name
  • Clarified doc for cooldown_interval
  • Added missing Settings.MinQPS
  • Added docs for each Settings field

@rleungx
Copy link
Member

rleungx commented Dec 12, 2024

@nolouch @okJiang @niubell Do you have more comments? If not, I think we can merge this PR.

@rleungx rleungx merged commit 659ed1f into tikv:master Dec 12, 2024
2 checks passed
mittalrishabh pushed a commit to mittalrishabh/rfcs that referenced this pull request Dec 18, 2024
mittalrishabh pushed a commit to mittalrishabh/rfcs that referenced this pull request Dec 18, 2024
mittalrishabh pushed a commit to mittalrishabh/rfcs that referenced this pull request Dec 20, 2024
mittalrishabh pushed a commit to mittalrishabh/rfcs that referenced this pull request Dec 23, 2024
cfzjywxk added a commit that referenced this pull request Dec 27, 2024
* RFC:follower read cache

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* update review comments

Signed-off-by: rishabh_mittal <[email protected]>

* update review comments

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* add changes

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* incorporate review comments

Signed-off-by: rishabh_mittal <[email protected]>

* incorporate review comments

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* introduce client circuit breaker (#115)

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* Update text/0113-follower-read-cache.md

Co-authored-by: cfzjywxk <[email protected]>
Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* Update 0113-follower-read-cache.md

Signed-off-by: rishabh_mittal <[email protected]>

* review comments

Signed-off-by: rishabh_mittal <[email protected]>

---------

Signed-off-by: rishabh_mittal <[email protected]>
Co-authored-by: Artem Danilov <[email protected]>
Co-authored-by: cfzjywxk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants