-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add E2E Integration Test For Adaptive Sampling Processor #5951
base: main
Are you sure you want to change the base?
[WIP] Add E2E Integration Test For Adaptive Sampling Processor #5951
Conversation
Signed-off-by: Mahad Zaryab <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5951 +/- ##
==========================================
- Coverage 96.91% 96.89% -0.03%
==========================================
Files 349 349
Lines 16587 16598 +11
==========================================
+ Hits 16076 16082 +6
- Misses 328 333 +5
Partials 183 183
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
# Copyright (c) 2024 The Jaeger Authors. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
BINARY ?= all-in-one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all-in-one is a v1 style binary. I would prefer we test v2 version (or at least both, but v2 is higher priority and testing v1 at this point is wasted work since v1 will be EOLed in a year)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro I was trying to test jaeger
binary but I can't seem to access port 14268. Do you know why that is? I left more details on the issue: #5717 (comment)
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
@@ -152,7 +152,7 @@ func (a *aggregator) HandleRootSpan(span *span_model.Span, logger *zap.Logger) { | |||
} | |||
samplerType, samplerParam := span.GetSamplerParams(logger) | |||
if samplerType == span_model.SamplerTypeUnrecognized { | |||
return | |||
samplerType = span_model.SamplerTypeProbabilistic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro what kind of a config do we want to add to perform this override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like "do not check sampler tags"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro should this config be exposed as part of the YAML configuration? or do we just want it to be internal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be user settable
// } | ||
// } | ||
// return false | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro this causes the unit tests to fail and I believe its messing with the calculations as well. Any ideas on how we can get around this? If we don't hardcode this here however, the probability only gets calculated once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's very difficult to troubleshoot like this. I would suggest maybe altering tracegen and manually adding the sampler.type=probabilistic / sampler.param=0.5 (any value for now) attributes to the span to see how the system reacts to this. To my knowledge aside from this check the probability used by the sampler should not be affecting the calculations, but I may be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and another thing would help is to expose internal state via expvar so that we can actually monitor how that state changes.
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
- name: Setup Node.js version | ||
uses: ./.github/actions/setup-node.js |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure you need this, unless the test specifically checks that the UI is able to render the metrics
Signed-off-by: Mahad Zaryab <[email protected]>
@yurishkuro I added the expvar reporting to debug the first element in the service cache. Here is what I see after the first few intervals. Do these calculations make sense to you?
|
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
@@ -398,7 +401,7 @@ func (p *PostAggregator) isUsingAdaptiveSampling( | |||
// before. | |||
if len(p.serviceCache) > 1 { | |||
if e := p.serviceCache[1].Get(service, operation); e != nil { | |||
return e.UsingAdaptive && !FloatEquals(e.Probability, p.InitialSamplingProbability) | |||
return !FloatEquals(e.Probability, p.InitialSamplingProbability) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro with this patch, the numbers seem to make a bit more sense. here's the output i see now
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{1 true}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.20840949054166666 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.021478798306074933 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.012634530863339348 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.02105483853348014 false}]]"
….
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.08421935413392057 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.1403881259748575 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.08254900503950167 false}]]"
"post_aggregator_service_cache[0]": "map[tracegen:map[lets-go:{0.051602341629876265 false}]]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know if you have any thoughts on how to proceed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post_aggregator_service_cache
name is unclear, are these probabilities, or throughput?- the Boolean value at the end looks suspicious, what does it mean? If it's "using adaptive sampling" indicator then we need to know why it goes to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- its the sampling cache entry where the first value is the probability and the second is the using adaptive sampling indicator
- i mostly reverted the old
isAdaptiveSamplingLogic
aside from the e.UsingAdaptive check. When I had it hardcoded to true, the values didn't make sense (see [WIP] Add E2E Integration Test For Adaptive Sampling Processor #5951 (comment))
@@ -346,6 +348,7 @@ func (p *PostAggregator) calculateProbability(service, operation string, qps flo | |||
Probability: oldProbability, | |||
UsingAdaptive: usingAdaptiveSampling, | |||
}) | |||
safeexpvar.SetString("post_aggregator_service_cache[0]", fmt.Sprintf("%v", p.serviceCache[0].ToValue())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tostring loses important information, we should use hierarchical expvar.Map
Which problem is this PR solving?
Description of the changes
How was this change tested?
Checklist
jaeger
:make lint test
jaeger-ui
:yarn lint
andyarn test