-
Notifications
You must be signed in to change notification settings - Fork 166
Flaky race test check profiler #2187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Flaky race test check profiler #2187
Conversation
@@ -159,7 +159,7 @@ public void checkProfilerWorking() { | |||
|
|||
private static void doSleep() { | |||
try { | |||
Thread.sleep(100); | |||
Thread.sleep(300); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this has to be long enough for the inferred span code to observe it? WDYT if instead of sleeping you did something like
Span tx = tracer.spanBuilder("my-root").startSpan();
try (Scope scope = tx.makeCurrent()) {
await().until(() -> !AutoConfiguredDataCapture.getSpans().isEmpty());
} finally {
tx.end();
}
could that result in a more reliable test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inferred span would not be created before the span ends because that's how it infers the time. So that suggestion should timeout.
The current sequence should be
- profiler is started (we waited for that or assertion would have failed)
- we start a 100ms span
- the profiler is sampling at 5ms intervals so more than enough traces to infer a span (we only need 2, and we'd expect around 20 here)
- the span finishes and we generate first the span, then after a short delay the inferred span
- the await is waiting for both spans
Hmm, actually as I write this, it seems to me that the failure I am suggesting fixing would be a timeout failure. In fact shouldn't this should only fail with a timeout and the "size of" as a sub-reason why it timed out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inferred span would not be created before the span ends because that's how it infers the time. So that suggestion should timeout.
That is unfortunate. My experience is that short timeouts in tests can cause instability. In CI there may be things running in parallel which could also have a negative effect. I try with a 1s sleep.
the profiler is sampling at 5ms intervals so more than enough traces to infer a span (we only need 2, and we'd expect around 20 here)
according to output in https://scans.gradle.com/s/k77oi6bbrx2ew/tests/task/:inferred-spans:test/details/io.opentelemetry.contrib.inferredspans.InferredSpansAutoConfigTest/checkProfilerWorking()?page=eyJvdXRwdXQiOnsiMCI6MX19&top-execution=1 you got 0 stack traces
FINE: Parsing JFR chunk at offset
Aug 20, 2025 8:51:13 PM io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler processTraces
FINE: Processing 0 stack traces
Aug 20, 2025 8:51:13 PM io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent handle
FINE: Handling event timestamp=3,215,835,057,958 root=true threadId=1 activation=true
Aug 20, 2025 8:51:14 PM io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent startProfiling
FINE: Create call tree (ImmutableSpanContext{traceId=7dcd14b33b36f375ace29d344a0a3118, spanId=824f6b362b41c0af, traceFlags=01, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}(clock-anchor: 1755719857533603501)) for thread 1
Aug 20, 2025 8:51:14 PM io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent handle
FINE: Handling event timestamp=3,216,133,279,166 root=true threadId=1 activation=false
Aug 20, 2025 8:51:14 PM io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent stopProfiling
FINE: End call tree (ImmutableSpanContext{traceId=7dcd14b33b36f375ace29d344a0a3118, spanId=824f6b362b41c0af, traceFlags=01, traceState=ArrayBasedTraceState{entries=[]}, remote=false, valid=true}(clock-anchor: 1755719857533603501)) for thread 1
Aug 20, 2025 8:51:14 PM io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler$ActivationEvent stopProfiling
FINE: Created no spans for thread 1 (count=0)
Aug 20, 2025 8:51:14 PM io.opentelemetry.contrib.inferredspans.internal.SamplingProfiler processTraces
FINE: Processing traces took 41,626us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which implies that the profiler was suspended for the whole 100ms sleep. That's going to be rare but possible in the CI shared env world, and would explain why the error is rare. I'll bump the numbers even more, and apply to the other test failure as well
Note that the failure mentioned in #2149 (comment) https://gradle.com/s/uevvy2bsaau5k is in a different test |
closes #2149