-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: introduce circuit breaker for region calls #8856
Conversation
Signed-off-by: artem_danilov <[email protected]>
Hi @Tema. Thanks for your PR. I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: artem_danilov <[email protected]>
/ok-to-test |
Signed-off-by: artem_danilov <[email protected]>
var ErrOpenState = errors.New("circuit breaker is open") | ||
|
||
// Overloading is a type describing service return value | ||
type Overloading int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just using bool?
client/http/interface.go
Outdated
@@ -541,6 +543,11 @@ func (c *client) GetReplicateConfig(ctx context.Context) (map[string]any, error) | |||
return config, nil | |||
} | |||
|
|||
// UpdateCircuitBreakerSettings updates config for circuit breaker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's focus on gRPC calls in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I can stash http client changes for now
) | ||
|
||
// ErrOpenState is returned when the CircuitBreaker is open or half-open with pending requests. | ||
var ErrOpenState = errors.New("circuit breaker is open") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving it to errs/errno.go?
state *State[T] | ||
|
||
successCounter prometheus.Counter | ||
failureCounter prometheus.Counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will it be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failureCounter
should be used next to successCounter
in onResult
instead of fastFailCounter
.
The idea of having these metrics is to see overall error rate over time to decide on ErrorRateThresholdPct
value to set/overrie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find where it is used in the current codebase.
@rleungx can you help to take a look at https://github.com/tikv/pd/actions/runs/12043529055/job/33580779593?pr=8856 as I don't see why it marks it as FAILED while as there is no FAIL in the tests. |
|
Signed-off-by: artem_danilov <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8856 +/- ##
==========================================
- Coverage 76.21% 76.19% -0.03%
==========================================
Files 460 461 +1
Lines 70239 70431 +192
==========================================
+ Hits 53532 53662 +130
- Misses 13362 13409 +47
- Partials 3345 3360 +15
Flags with carried forward coverage won't be shown. Click here to find out more. |
@rleungx thanks for the comments. I think I've addressed all of those and also fixed all test and lint failures. |
cb := NewCircuitBreaker[int]("test_cb", settings) | ||
re.Equal(StateClosed, cb.state.stateType) | ||
driveQPS(cb, minCountToOpen, Yes, re) | ||
cb.advance(settings.ErrorRateWindow) | ||
assertFastFail(cb, re) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about wraping a function to reduce redundant code?
<-ended[i] | ||
} | ||
assertSucceeds(cb, re) | ||
re.Equal(StateClosed, cb.state.stateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check will be failed. Is it expected?
re.Equal(StateClosed, cb.state.stateType) | |
re.Equal(StateClosed, cb.state.stateType) | |
re.Equal(uint32(2), cb.state.successCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test multi-thread scene more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check will be failed. Is it expected?
No, first two requests are counted inside HalfOpen state, then we sent only once request after closing circuit breaker, so successCount
is expected to be 1. I'm adding this assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test multi-thread scene more?
Multi-threaded testing are hard and verbose, so I tend to add them only to critical functionality like this one. I can add a few more tests to validate that we don't double or undercount, but I feel they are much less critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a case to test two threads call Execute
during one state(StateOpen)? Just like this case. I misunderstand them. I think we need them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. What invariant I would be testing/asserting on after holding two threads in the Execute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just test count
. Do not need to assert State
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a case to test two threads call Execute during one state(StateOpen)?
Open state does not execute calls, but short circuits them, so can't write a pending requests test.
However, I've added TestCircuitBreaker_Count_Only_Requests_In_Same_Window
. PTAL.
Signed-off-by: artem_danilov <[email protected]>
HalfOpenSuccessCount uint32 | ||
} | ||
|
||
// AlwaysOpenSettings is a configuration that never trips the circuit breaker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never trips = AlwaysCloseSettings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, circuit breaker trips == opens (the current can't flow anymore), meaning never trips == always closed. Or I misunderstood your question? Feel free to suggest a better comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should it be named AlwaysCloseSettings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you! Fixed.
Signed-off-by: artem_danilov <[email protected]>
Signed-off-by: artem_danilov <[email protected]>
/retest |
TestConnected failed. Does not seem related to this PR. Pass locally. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except #8856 (comment). Rest lgtm
Signed-off-by: artem_danilov <[email protected]>
@nolouch @okJiang I've addressed your comments. PTAL. Also could you please stamp tikv/rfcs#115 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but wait for @nolouch last review
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nolouch, rleungx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What problem does this PR solve?
Issue Number: Ref #8678, tikv/rfcs#115
What is changed and how does it work?
This PR implements circuit breaker from tikv/rfcs#115 and integrates it into http and grpc PD clients for get region metadata calls, but disabled by default.
Check List
Tests
Code changes
Side effects
Related changes
pingcap/docs
/pingcap/docs-cn
:pingcap/tiup
:Release note