-
Notifications
You must be signed in to change notification settings - Fork 4.6k
stats/opentelemetry: Track retry counts per-call instead of per-attempt #8715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
stats/opentelemetry: Track retry counts per-call instead of per-attempt #8715
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8715 +/- ##
==========================================
- Coverage 83.21% 83.17% -0.05%
==========================================
Files 419 419
Lines 32427 32430 +3
==========================================
- Hits 26985 26973 -12
- Misses 4054 4067 +13
- Partials 1388 1390 +2
🚀 New features to boost your workflow:
|
|
Hi @ulascansenturk , Thank you for submitting this PR; we appreciate your contribution. We already had an existing PR (#8342) that addressed this issue and was previously merged. However, we had to revert it because it introduced a flaky test: Test/TraceSpan_WithRetriesAndNameResolutionDelay. The failed run is linked here for context: [failed run link]. Our ideal path forward is to cherry-pick the original PR (#8342) and include a fix for the flaky test in a follow-up commit. This ensures the original author receives proper credit. The main requirement now is to identify and fix the cause of the test flakiness. Could you please investigate this and submit the necessary fix? Your help in resolving this would be greatly appreciated. |
Fixes: grpc#8299 RELEASE NOTES: - stats/opentelemetry: Retry attempts (`grpc.previous-rpc-attempts`) are now recorded as span attributes for non-transparent client retries.
…tionDelay This commit fixes the flaky test TestTraceSpan_WithRetriesAndNameResolutionDelay which was introduced in the previous commit and caused PR grpc#8342 to be reverted. Root Cause: The test had race conditions related to timing: 1. The goroutine that updates resolver state could complete before or after the delayed resolution event was fully processed and recorded in spans 2. Span export timing was not synchronized with test validation, causing the test to sometimes check spans before they were fully exported Fix: 1. Added 'stateUpdated' event to synchronize between the resolver state update completing and span validation beginning 2. Added explicit wait for the stateUpdated event before validating spans 3. Added a 50ms sleep after RPC completion to give the span exporter time to process and export all spans before validation Testing: - Test now passes consistently (10+ consecutive runs) - Passes with race detector enabled (-race flag) - No data races detected Fixes grpc#8700
9b6371e to
b1732c6
Compare
Hi @eshitachandwani, Thank you for the guidance! I've restructured the PR as requested:
Root cause: Timing race conditions between resolver state updates and span validation. |
RELEASE NOTES: N/A
fixes #8700