Skip to content

Conversation

@ulascansenturk
Copy link
Contributor

@ulascansenturk ulascansenturk commented Nov 19, 2025

RELEASE NOTES: N/A
fixes #8700

@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

❌ Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.17%. Comparing base (50c6321) to head (b1732c6).

Files with missing lines Patch % Lines
stats/opentelemetry/client_tracing.go 78.57% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8715      +/-   ##
==========================================
- Coverage   83.21%   83.17%   -0.05%     
==========================================
  Files         419      419              
  Lines       32427    32430       +3     
==========================================
- Hits        26985    26973      -12     
- Misses       4054     4067      +13     
- Partials     1388     1390       +2     
Files with missing lines Coverage Δ
stats/opentelemetry/opentelemetry.go 73.07% <ø> (ø)
stats/opentelemetry/trace.go 90.32% <ø> (-1.99%) ⬇️
stats/opentelemetry/client_tracing.go 86.36% <78.57%> (-2.73%) ⬇️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@eshitachandwani
Copy link
Member

Hi @ulascansenturk ,

Thank you for submitting this PR; we appreciate your contribution.

We already had an existing PR (#8342) that addressed this issue and was previously merged. However, we had to revert it because it introduced a flaky test: Test/TraceSpan_WithRetriesAndNameResolutionDelay. The failed run is linked here for context: [failed run link].

Our ideal path forward is to cherry-pick the original PR (#8342) and include a fix for the flaky test in a follow-up commit. This ensures the original author receives proper credit.

The main requirement now is to identify and fix the cause of the test flakiness. Could you please investigate this and submit the necessary fix? Your help in resolving this would be greatly appreciated.

vinothkumarr227 and others added 2 commits November 20, 2025 13:51
Fixes: grpc#8299

RELEASE NOTES:

- stats/opentelemetry: Retry attempts (`grpc.previous-rpc-attempts`) are
now recorded as span attributes for non-transparent client retries.
…tionDelay

This commit fixes the flaky test TestTraceSpan_WithRetriesAndNameResolutionDelay
which was introduced in the previous commit and caused PR grpc#8342 to be reverted.

Root Cause:
The test had race conditions related to timing:
1. The goroutine that updates resolver state could complete before or after
   the delayed resolution event was fully processed and recorded in spans
2. Span export timing was not synchronized with test validation, causing
   the test to sometimes check spans before they were fully exported

Fix:
1. Added 'stateUpdated' event to synchronize between the resolver state
   update completing and span validation beginning
2. Added explicit wait for the stateUpdated event before validating spans
3. Added a 50ms sleep after RPC completion to give the span exporter time
   to process and export all spans before validation

Testing:
- Test now passes consistently (10+ consecutive runs)
- Passes with race detector enabled (-race flag)
- No data races detected

Fixes grpc#8700
@ulascansenturk ulascansenturk force-pushed the fix/8700-otel-retry-per-call-tracking branch from 9b6371e to b1732c6 Compare November 20, 2025 10:54
@ulascansenturk
Copy link
Contributor Author

Hi @ulascansenturk ,

Thank you for submitting this PR; we appreciate your contribution.

We already had an existing PR (#8342) that addressed this issue and was previously merged. However, we had to revert it because it introduced a flaky test: Test/TraceSpan_WithRetriesAndNameResolutionDelay. The failed run is linked here for context: [failed run link].

Our ideal path forward is to cherry-pick the original PR (#8342) and include a fix for the flaky test in a follow-up commit. This ensures the original author receives proper credit.

The main requirement now is to identify and fix the cause of the test flakiness. Could you please investigate this and submit the necessary fix? Your help in resolving this would be greatly appreciated.

Hi @eshitachandwani,

Thank you for the guidance! I've restructured the PR as requested:

  1. Cherry-picked the original PR stats/opentelemetry: record retry attempts from clientStream #8342 (preserves @vinothkumarr227's authorship)
  2. Fixed the flaky test in a follow-up commit

Root cause: Timing race conditions between resolver state updates and span validation.
Verified with 10+ test runs and race detector - no flakiness detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

otel: retries must be tracked per-call and not per-attempt

3 participants